2005-04-09 19:44:22

by Linus Torvalds

[permalink] [raw]
Subject: more git updates..


Sorry guys,
several of you have sent me small fixes and scripts to "git", but I've
been busy on breaking/changing the core infrastructure, so I didn't get
around to looking at the scripts yet.

The good news is, the data structures/indexes haven't changed, but many of
the tools to interface with them have new (and improved!) semantics:

In particular, I changed how "read-tree" works, so that it now mirrors
"write-tree", in that instead of actually changing the working directory,
it only updates the index file (aka "current directory cache" file from
the tree).

To actually change the working directory, you'd first get the index file
setup, and then you do a "checkout-cache -a" to update the files in your
working directory with the files from the sha1 database.

Also, I wrote the "diff-tree" thing I talked about:

torvalds@ppc970:~/git> ./diff-tree 8fd07d4b7778cd0233ea0a17acd3fe9d710af035 8c6d29d6a496d12f1c224db945c0c56fd60ce941 | tr '\0' '\n'
<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile
<100664 9e1bee21e17c134a2fb008db62679048fc819528 cache.h
>100664 56ef561e590fd99e938bd47fd1f2c7ed46126ff0 cache.h
<100664 fd690acc02ef9c06d7c4c3541f69b10ca4b4f8c9 cat-file.c
>100664 6e6d89291ced17a406e64b97fe8bb96a22eefc9d cat-file.c
+100664 fd00e5603dcc4a93acceda0b8cb914fabc8645d5 checkout-cache.c
<100664 a4a8c3d9ef0c4cc6c82b96b5d1a91ac6d3bed466 commit-tree.c
>100664 236ceb7646e3f5d110fd83f815b82e94cc5b2927 commit-tree.c
+100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
<100664 0eaa053919e0cc400ab9bc40d9272360117e6978 init-db.c
>100664 815743e92dad7e451c65bab01448ee8ae9deeb56 init-db.c
<100664 e7bfaadd5d2331123663a8f14a26604a3cdcb678 read-cache.c
>100664 71d0cb6fe9b7ff79e3b2c5a61e288ac9f62b39dc read-cache.c
<100664 ec0f167a6a505659e5af6911c97f465506534c34 read-tree.c
>100664 f5c50ba79d02f002b9675fd8f129fa388e3282c6 read-tree.c
<100664 00a29c403e751c2a2a61eb24fa2249c8956d1c80 show-diff.c
>100664 b963dd738989bc92bf02352bbedad13a74e66a7d show-diff.c
<100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
>100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c
<100664 7abeeba116b2b251c12ae32c7b38cb048199b574 write-tree.c
>100664 9525c6fc975888a394477339db86216cd5bd5d7c write-tree.c

(ie the output of "diff-tree" has the same NUL-termination, but if you
insist on getting ASCII output, you can just use "tr" to change the NUL
into a NL).

The format of the "diff-tree" output is that the first character is "-"
for "remove file", "+" for "add file" and "<"/">" for "change file" (where
the "<" shows the old state, and ">" shows the new state).

Btw, the NUL-termination makes this really easy to use even in shell
scripts, ie you can do

diff-tree <sha1> <sha1> | xargs -0 do_something

and you'll get each line as one nice argument to your "do_something"
script. So a do_diff could be based on something like

#!/bin/sh
while [ "$1" != "" ]; do
filename="$(echo $1 | cut -d' ' -f3-)"
first_sha="$(echo $1 | cut -d' ' -f2)"
second_sha="$(echo $2 | cut -d' ' -f2)"
c="$(echo $1 | cut -c1)"
case "$c" in
"+")
echo diff -u /dev/null "$filename($first_sha)";;
"-")
echo diff -u "$filename($first_sha)" /dev/null;;
"<")
echo diff -u "$filename($first_sha)" "$filename($second_sha)"
shift;;
*)
echo WHAT?
exit 1;;
esac
shift
done

which really shows what a horrid shell-person I am (I still use the old
tools I learnt to use fifteen years ago. I bet you can do it trivially in
perl or something sane, and I'm just stuck in the stone age of UNIX).

That makes it _very_ easy to parse. The example above is the diff between
the initial commit and one of the more recent trees, so it has changes to
everything, but a more normal thing would be

torvalds@ppc970:~/git> diff-tree 787763499dc4f8cc345bc6ed8ee1e0ae31adedd6 5b0c2695634b5bab2f5d63c7bb30f7e5815af470 | tr '\0' '\n'
<100664 01c92f2620a8e13e7cb7fd98ee644c6b65eeccb7 fsck-cache.c
>100664 81aa7bee003264ea302db835158e725eefa4012d fsck-cache.c

which tells you that the last commit changed just one file (it's from this
one:

torvalds@ppc970:~/git> cat-file commit `cat .dircache/HEAD`
tree 5b0c2695634b5bab2f5d63c7bb30f7e5815af470
parent 81c53a1d3551f358860731481bb2d87179d221e6
author Linus Torvalds <[email protected]> Sat Apr 9 12:02:30 2005
committer Linus Torvalds <[email protected]> Sat Apr 9 12:02:30 2005

Make "fsck-cache" print out all the root commits it finds.

Once I do the reference tracking, I'll also make it print out all the
HEAD commits it finds, which is even more interesting.

in case you care).

I've rsync'ed the new git repository to kernel.org, it should all be there
in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the
mirror scripts already picked it up on the public side too).

Can you guys re-send the scripts you wrote? They probably need some
updating for the new semantics. Sorry about that ;(

Linus


2005-04-09 19:54:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sat, 9 Apr 2005, Linus Torvalds wrote:
>
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

Btw, this will not overwrite any old files, so if you have an old version
of something, you'd need to do "checkout-cache -f -a" (and order matters:
the "-f" must come first). This time I actually have a big comment at the
top of the checkout-cache.c file trying to explain the logic.

Linus

2005-04-09 20:07:21

by Petr Baudis

[permalink] [raw]
Subject: Re: more git updates..

Hello,

Dear diary, on Sat, Apr 09, 2005 at 09:45:52PM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
> The good news is, the data structures/indexes haven't changed, but many of
> the tools to interface with them have new (and improved!) semantics:
>
> In particular, I changed how "read-tree" works, so that it now mirrors
> "write-tree", in that instead of actually changing the working directory,
> it only updates the index file (aka "current directory cache" file from
> the tree).
>
> To actually change the working directory, you'd first get the index file
> setup, and then you do a "checkout-cache -a" to update the files in your
> working directory with the files from the sha1 database.

that's great. I was planning to do something with this since currently
it really annoyed me. I think I will like this, even though I didn't
look at the code itself yet (just on my way).

> Also, I wrote the "diff-tree" thing I talked about:
..snip..

Hmm, I wonder, is this better done in C instead of a simple shell
script, like my gitdiff.sh? I'd say it is more flexible and probably
hardly performance-critical to have this scripted, and not difficult at
all provided you have ls-tree. But maybe I'm just too fond of my
script... ;-) (Ok, there's some trouble when you want to have newlines
and spaces in file names, and join appears to be awfully ignorant about
this... :[ )

BTW, do we care about changed modes? If so, they should probably have
their place in the diff-tree output.

BTW#2, I hope you will merge my ls-tree anyway, even though there is no
user for it currently... I should quickly figure out some. :-)

> Can you guys re-send the scripts you wrote? They probably need some
> updating for the new semantics. Sorry about that ;(

I'll try to merge ASAP.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-09 20:58:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sat, 9 Apr 2005, Petr Baudis wrote:
>
> > Also, I wrote the "diff-tree" thing I talked about:
> ..snip..
>
> Hmm, I wonder, is this better done in C instead of a simple shell
> script, like my gitdiff.sh?

With 17,000 files in the kernel, and most commits just changing a small
number of them, I actually think "diff-tree" matters. You use "join"
(which is quite reasonable), but let's put it this way: just the list of
files in the current kernel is about half a megabyte of data. Ie your
temporary files that you use in the "ls-tree + ls-tree + join" is actually
going to be quite sizeable.

My goal here is that the speed of "git" really should be almost totally
independent of the size of the project. You clearly cannot avoid _some_
size-dependency: my "diff-tree" clearly also has to work through the same
1MB of data, but I think it's worth making the constant factor be as small
as humanly possible.

I just tried checking in a kernel tree tar-file, and the initial checkin
(which is allt he compression and the sha1 calculations for every single
file) took about 1:35 (minutes, not hours ;).

Doing a commit (trivial change to the top-level Makefile) and then doing a
"treediff" between those two things took 0.05 seconds using my C thing. Ie
we're talking so fast that we really don't care.

Doing a "show-diff" takes 0.15 secs or so (that's all the "stat" calls),
and now that I test it out I realize that the most expensive operation is
actually _writing_ the "index" file out. These are the two most expensive
steps:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time update-cache Makefile

real 0m0.283s
user 0m0.171s
sys 0m0.113s


torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> time write-tree
5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a

real 0m0.441s
user 0m0.354s
sys 0m0.087s

ie with the current infrastructure it looks like I can do a "patch +
commit" in less than one second on the kernel, and 0.75 secs of that is
because the "tree" file actually grows pretty large:

cat-file tree 5ca21c9d808fa4bee1eb6948a59dfb9c7d73f36a | wc -c

says that the uncompressed tree-file is 950,874 bytes. Compressing it
means that the archival version of it is "just" 462,546 bytes, but this is
really the part that is going to eat _tons_ of disk-space.

In other words, each "commit" file is very small and cheap, but since
almost every commit will also imply a totally new tree-file, "git" is
going to have an overhead of half a megabyte per commit. Oops.

Damn, that's painful. I suspect I will have to change the format somehow.

One option (which I haven't tested yet) is that since the tree-file is
already sorted, I could always write it out with the common subdirectory
part "collapsed", ie instead of writing

...
include/asm-i386/mach-default/bios_ebda.h
include/asm-i386/mach-default/do_timer.h
...

I'd write just

...
///bios_ebda.h
///do_timer.h
...

since the directory names are implied by the predecessor.

However, that doesn't help with the 20-byte sha1 associated with each
file, which is also obviously uncompressible, so with 17,000+ files, we
have a minimum overhead of abotu 350kB per tree-file.

So even if I did the pathname compression, it wouldn't help all that much.
I'd only be removing the only part of the file that _is_ very
compressible, and I'd probably end up with something that isn't all that
far away from the 450kB+ it is now.

I suspect that I have to change the file format. Maybe make the "tree"
object a two-level thing, and have a "directory" object.

Then a "tree" object would point to a "directory" object, which would in
turn point to the individual files (and other "directory" objects, of
course). That way a commit that only changes a few files will only need to
create a few new "directory" objects, instead of creating one huge "tree"
object.

Sadly, that will make "tree-diff" potentially more expensive. On the other
hand, maybe not: it will also speed it _up_, since directories that are
totally shared will be trivially seen as such and need no further
operation.

Thougths? That would break the current repository formats, and I'd have to
create a converter thing (which shouldn't be that bad, of course).

I don't have to do it right now. In fact, I'd almost prefer for the
current thing to become good enough that it's not painful to work with,
since right now I'm using it to develop itself. Then I can convert the
format with an automated script later, before I actually start working on
the kernel...

> BTW, do we care about changed modes? If so, they should probably have
> their place in the diff-tree output.

They're there. If you want to ignore them, you can just notice that the
sha1 matches between two lines, and then you don't even have to diff them.

Linus

2005-04-09 21:06:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sat, 9 Apr 2005, Linus Torvalds wrote:
>
> I suspect that I have to change the file format. Maybe make the "tree"
> object a two-level thing, and have a "directory" object.
>
> Then a "tree" object would point to a "directory" object, which would in
> turn point to the individual files (and other "directory" objects, of
> course). That way a commit that only changes a few files will only need to
> create a few new "directory" objects, instead of creating one huge "tree"
> object.

Actually, I guess I wouldn't have to change the format. I could just
extend the existing "tree" object to be able to point to other trees, and
that's it.

The downside of that is that then a tree wouldn't have a canonical format
any more: you could have two trees that have the exact same content, but
they'd haev different names. They should obviously merge very easily (and
thus you could create a new merge that _does_ have a common name), but
it's ugly.

I'll have to think about it. It's good to notice these issues early, this
was the first time I had actually tried to check in a kernel-sized tree
for real.

Linus

2005-04-09 22:02:03

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Linus wrote:
> the NUL-termination makes this really easy to use even in shell

grumble ...

> I still use the old tools I learnt to use fifteen years ago

new comer ;)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-09 23:25:28

by Ralph Corderoy

[permalink] [raw]
Subject: Re: more git updates..


Hi Linus,

> Btw, the NUL-termination makes this really easy to use even in shell
> scripts, ie you can do
>
> diff-tree <sha1> <sha1> | xargs -0 do_something
>
> and you'll get each line as one nice argument to your "do_something"
> script. So a do_diff could be based on something like
>
> #!/bin/sh

Watch out for when xargs invokes do_something more than once and the `<'
is parsed by a different one than the `>'. A `while read ...; do ...
done' would avoid that, but wouldn't like the NULs instead of LFs.

Cheers,


Ralph.

2005-04-09 23:30:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sat, 9 Apr 2005, Linus Torvalds wrote:
>
> Actually, I guess I wouldn't have to change the format. I could just
> extend the existing "tree" object to be able to point to other trees, and
> that's it.

Done, and pushed out. The current git.git repository seems to do all of
this correctly.

NOTE! This means that each "tree" file basically tracks just a single
directory. The old style of "every file in one tree file" still works, but
fsck-cache will warn about it. Happily, the git archive itself doesn't
have any subdirectories, so git itself is not impacted by it.

Now, this means that I should add a "recusive" option to "tree-diff", but
I haven't done so yet. So right now if I change the top-level Makefile,
_and_ change kernel/exit.c, then the "tree diff" between the two commit
trees ends up looking like:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree 7bec1223736d7e02c755e9a365984b3cbfa1e6e9 d64817f809a60cd960d3078ae91b4d19cb649501 | tr '\0' '\n'
<100644 e1e7f7430c0297f22042cff58da5ca73ef121b95 Makefile
>100644 8ee21134577e98fb642dffc5b797a0121645c543 Makefile
<40000 2239383d00ae746f5e79ceccf8ac3fbca62f949d kernel
>40000 a8fad219cb78a6b6a05a10f8643d615fefc8160f kernel

ie it shows that the Makefile blob has changed, and the kernel directory
has changed. You then need to recurse into the kernel tree to see what the
changes were there:

torvalds@ppc970:~/lx-test/linux-2.6.12-rc2> diff-tree 2239383d00ae746f5e79ceccf8ac3fbca62f949d a8fad219cb78a6b6a05a10f8643d615fefc8160f | tr '\0' '\n'
<100644 1a50b58453679b6fee8de4f744f4befc39397bb1 exit.c
>100644 e8df1325bf25816827a1a64404ad533a97bfdae2 exit.c

but it clearly all seems to work. And it means that a subdirectory that
didn't change at all (the common case) will be able to re-use the old sha1
file when you create a tree (this may in fact make "diff-tree" much less
important, since now it tends to handle objects that are just a few kB in
size, rather than almost a megabyte.

So in this case, the "commit cost" of changing two files was two small
tree files (1468 and 679 bytes respectively for the kernel/ and top-level
directory) and the commit file itself (251 bytes). In addition to the
actual data files that were changed, of course.

Goodie. Big difference between that and the 460kB of the old monolithic
tree file.

Linus

2005-04-10 00:40:25

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Ralph wrote:
> Watch out for when xargs invokes do_something more than once and the `<'
> is parsed by a different one than the `>'.

It will take a pretty long list to do that. It seems that
GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.

In the old days, with 4 KByte ARG_MAX limits, this would have
bitten us pretty quickly.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 01:14:11

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: more git updates..

In article <[email protected]> you wrote:
> Ralph wrote:
>> Watch out for when xargs invokes do_something more than once and the `<'
>> is parsed by a different one than the `>'.
> It will take a pretty long list to do that. It seems that
> GNU xargs on top of a Linux kernel has a 128 KByte ARG_MAX.
> In the old days, with 4 KByte ARG_MAX limits, this would have
> bitten us pretty quickly.

Nevertheless I think it is more parser friendly to have single records for
diffs.

Greetings
Bernd

2005-04-10 01:34:34

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Bernd wrote:
> more parser friendly to have single records for diffs.

good point

[looks like you trimmed the cc list - folks around here don't like that ;)]

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 02:07:40

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Linus wrote:
> Damn, that's painful. I suspect I will have to change the format somehow.

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

The pathnames, relative to top of tree, for these 16817
files take:

503983 bytes before compression
85786 bytes after minigzip compression

I doubt any fancifying up of the pathname storage will gain much.

However going from binary to ascii sha1 digest might help (compresses
better, I suspect - I'll have to write a few lines of code to see).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 02:09:24

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

> Then a "tree" object would point to a "directory" object,

Ah - light bulb flickers - in _separate_ files.

Yes, that obviously makes a difference.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 02:21:00

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

>From before:

The sha1 (ascii) digests for 16817 files take:

689497 bytes before compression
397475 bytes after minigzip

New numbers:

The sha1 (binary) digests for 16817 files take:

336340 bytes before compression
334943 bytes after minigzip

So compressing binary digests isn't worth a darn, and compressing ascii
digests gets them down to within 18% of binary digests in size.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 02:42:05

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Sun, Apr 10, 2005 at 01:31:10AM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
> On Sat, 9 Apr 2005, Linus Torvalds wrote:
> >
> > Actually, I guess I wouldn't have to change the format. I could just
> > extend the existing "tree" object to be able to point to other trees, and
> > that's it.
>
> Done, and pushed out. The current git.git repository seems to do all of
> this correctly.
..snip..

Ok, so now I can dare announce it, I hope. I hacked my branch of git
somewhat, kept in sync with Linus, and now I have something to show.
Please see it at

http://pasky.or.cz/~pasky/dev/git/

It is basically a set of (still rather crude) shell scripts upon Linus'
git, which make it sanely usable by mere humans for actual version
tracking. Its usage _is_ going to change, so don't get too used to it
(that'd be hard anyway, I suspect), but it should be working nicely.

I have described most of the interesting parts and some basic usage in
the README at that page. It wraps commits, supports log retrieval and
comfortable diffing between any two trees. And on top of that, it can do
some basic remote repositories - it will pull (rsync) from them and it
can make the local copy track them - on pull, it will be updated
accordingly (and your local commits on the tracked branch will get
orphaned).

I didn't attach a patch against Linus since I think it's pretty much
useless now. It's available as against-linus.patch on the web, and
you can apply it to the latest git tree (NOT 0.03). But it's probably
better idea to wget my tree. You can then watch us making progress by

gitpull.sh linus
gitpull.sh pasky

and see where we differ by:

gitdiff.sh linus pasky

(This is how the against-linus.patch was generated. I'd easily generate
even 0.03 patch this way, but I forgot to merge the fsck at that time,
so it would suck.)

(Note that the tree you wget is set up to track my branch. If you want
to stop tracking it (basically necessary now if you want to do local
commits), do:

cp .dircache/HEAD .dircache/HEAD.local
gittrack.sh

The cp says that something like "I want to pick up where the tracked
branch left off". Otherwise, untracking would return you to your "local"
branch, which is just some ancient predecessor of the pasky branch here
anyway.)

Note that I didn't really test it on anything but git itself yet, so I'm
not sure how will it cope especially with directories - I tried to make
it aware of them though. I will do some more practical testing tomorrow.

Otherwise, I will probably try to consolidate the usage and
documentation now, and beautify the scripts. I might start pondering
some merging too. Oh, and gitpatch.sh. :-)

Have fun and please share your opinions,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 07:52:03

by Junio C Hamano

[permalink] [raw]
Subject: Re: more git updates..

Listing the file paths and their sigs included in a tree to make
a snapshot of a tree state sounds fine, and diffing two trees by
looking at the sigs between two such files sounds fine as well.

But I am wondering what your plans are to handle renames---or
does git already represent them?

2005-04-10 09:02:56

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
>
> But I am wondering what your plans are to handle renames---or
> does git already represent them?
>

Rename should just work. It will create a new tree object and you
will notice that in the entry that changed, the hash for the blob
object is the same.

Chris

2005-04-10 09:29:04

by Junio C Hamano

[permalink] [raw]
Subject: Re: more git updates..

>>>>> "CL" == Christopher Li <[email protected]> writes:

CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
>>
>> But I am wondering what your plans are to handle renames---or
>> does git already represent them?
>>

CL> Rename should just work. It will create a new tree object and you
CL> will notice that in the entry that changed, the hash for the blob
CL> object is the same.

Sorry, I was unclear. But doesn't that imply that a SCM built
on top of git storage needs to read all the commit and tree
records up to the common ancestor to show tree diffs between two
forked tree?

I suspect that another problem is that noticing the move of the
same SHA1 hash from one pathname to another and recognizing that
as a rename would not always work in the real world, because
sometimes people move files *and* make small changes at the same
time. If git is meant to be an intermediate format to suck
existing kernel history out of BK so that the history can be
converted for the next SCM chosen for the kernel work, I would
imagine that there needs to be a way to represent such a case.
Maybe convert a file rename as two git trees (one tree for pure
move which immediately followed by another tree for edit) if it
is not a pure move?


2005-04-10 09:40:20

by Wichert Akkerman

[permalink] [raw]
Subject: Re: more git updates..

Previously Christopher Li wrote:
> Rename should just work. It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

What if you rename and change a file within a changeset?

Wichert.

--
Wichert Akkerman <[email protected]> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.

2005-04-10 09:43:21

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
where Christopher Li <[email protected]> told me that...
> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
> >
>
> Rename should just work. It will create a new tree object and you
> will notice that in the entry that changed, the hash for the blob
> object is the same.

Which is of course wrong when you want to do proper merging, examine
per-file history, etc. One solution which springs to my mind is to have
a UUID accompany each blob and tree; that will take relatively lot of
space though, and I'm not sure it is really worth it.

How many renames were there in the 64k commits so far anyway?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 09:48:20

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Sun, Apr 10, 2005 at 11:28:54AM CEST, I got a letter
where Junio C Hamano <[email protected]> told me that...
> >>>>> "CL" == Christopher Li <[email protected]> writes:
>
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >>
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >>
>
> CL> Rename should just work. It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
>
> Sorry, I was unclear. But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?

No. See diff-tree output and
http://pasky.or.cz/~pasky/dev/git/gitdiff-do for how it's done.
Basically, you just take the two trees and compare them linearily (do a
normal diff on them, essentialy). Then the differences you spot this way
are everything what needs to appear in the patch.

> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time. If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?

Actually, this could be possible too I think. We will have to make
diff-tree two-pass, but it is already so blinding fast that I guess that
doesn't hurt too much. I might try to get my hands on that.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 10:02:25

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
>
> Done, and pushed out. The current git.git repository seems to do all of
> this correctly.
>
> NOTE! This means that each "tree" file basically tracks just a single
> directory. The old style of "every file in one tree file" still works, but
> fsck-cache will warn about it. Happily, the git archive itself doesn't
> have any subdirectories, so git itself is not impacted by it.

That is really cool stuff. My way to read it, correct me if I am wrong,
git is a user space version file system. "tree" <--> directory and
"blob" <--> file. "commit" to describe the version history.

Git always write out a full new version of blob when there is any
update to it. At first I think that waste a lot of space, especially
when there is only tiny change to it. But the more I think about it,
it make more sense. Kernel source is usually small objects and file is
compressed store any way. A very useful thing to gain form it is that,
we can truncate the older history. e.g. We can have option not to sync
the pre 2.4 change set, only grab it if we need to. Most of the time we
only interested in the recent change set.

There is one problem though. How about the SHA1 hash collision?
Even the chance is very remote, you don't want to lose some data do due
to "software" error. I think it is OK that no handle that
case right now. On the other hand, it will be nice to detect that
and give out a big error message if it really happens.

Some thing like the following patch, may be turn off able.

Chris

Index: git-0.03/read-cache.c
===================================================================
--- git-0.03.orig/read-cache.c 2005-04-09 18:42:16.000000000 -0400
+++ git-0.03/read-cache.c 2005-04-10 02:48:36.000000000 -0400
@@ -210,8 +210,22 @@
int fd;

fd = open(filename, O_WRONLY | O_CREAT | O_EXCL, 0666);
- if (fd < 0)
- return (errno == EEXIST) ? 0 : -1;
+ if (fd < 0) {
+ void *map;
+ static int error(const char * string);
+
+ if (errno != EEXIST)
+ return -1;
+ fd = open(filename, O_RDONLY);
+ if (fd < 0)
+ return -1;
+ map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+ if (map == MAP_FAILED)
+ return -1;
+ if (memcmp(buf, map, size))
+ return error("Ouch, Strike by lighting!\n");
+ return 0;
+ }
write(fd, buf, size);
close(fd);
return 0;

2005-04-10 10:16:03

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 02:28:54AM -0700, Junio C Hamano wrote:
> >>>>> "CL" == Christopher Li <[email protected]> writes:
>
> CL> On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> >>
> >> But I am wondering what your plans are to handle renames---or
> >> does git already represent them?
> >>
>
> CL> Rename should just work. It will create a new tree object and you
> CL> will notice that in the entry that changed, the hash for the blob
> CL> object is the same.
>
> Sorry, I was unclear. But doesn't that imply that a SCM built
> on top of git storage needs to read all the commit and tree
> records up to the common ancestor to show tree diffs between two
> forked tree?
>
> I suspect that another problem is that noticing the move of the
> same SHA1 hash from one pathname to another and recognizing that
> as a rename would not always work in the real world, because
> sometimes people move files *and* make small changes at the same
> time. If git is meant to be an intermediate format to suck
> existing kernel history out of BK so that the history can be
> converted for the next SCM chosen for the kernel work, I would
> imagine that there needs to be a way to represent such a case.
> Maybe convert a file rename as two git trees (one tree for pure
> move which immediately followed by another tree for edit) if it
> is not a pure move?
>

Git is not a SCM yet. For the rename + change set it should internally
handle by pure rename only plus the extra delta. The current git don't
have per file change history. From git's point of view some file deleted
and the other file appeared with same content.

It is the top level SCM to handle that correctly.
Rename a directory will be even more fun.

Chris

2005-04-10 10:18:23

by Chris Li

[permalink] [raw]
Subject: Re: Re: more git updates..

On Sun, Apr 10, 2005 at 11:41:53AM +0200, Petr Baudis wrote:
> Dear diary, on Sun, Apr 10, 2005 at 07:53:40AM CEST, I got a letter
> where Christopher Li <[email protected]> told me that...
> > On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> > >
> > > But I am wondering what your plans are to handle renames---or
> > > does git already represent them?
> > >
> >
> > Rename should just work. It will create a new tree object and you
> > will notice that in the entry that changed, the hash for the blob
> > object is the same.
>
> Which is of course wrong when you want to do proper merging, examine
> per-file history, etc. One solution which springs to my mind is to have
> a UUID accompany each blob and tree; that will take relatively lot of
> space though, and I'm not sure it is really worth it.

It should just use the rename + change two step then it is tractable
with git now.

Chris

2005-04-10 10:23:11

by Ralph Corderoy

[permalink] [raw]
Subject: Re: more git updates..


Hi Paul,

> Ralph wrote:
> > Watch out for when xargs invokes do_something more than once and the
> > `<' is parsed by a different one than the `>'.
>
> It will take a pretty long list to do that. It seems that GNU xargs
> on top of a Linux kernel has a 128 KByte ARG_MAX.

I didn't realise it was that long, but one pair of files to diff takes
128 bytes of that.

$ wc -c <<\E
> <100664 aff074c63ac827801a7d02ff92781365957f1430 update-cache.c
> >100664 3a672397164d5ff27a19a6888b578af96824ede7 update-cache.c
> E
128

So that's space for 1024 pairs. (Doesn't envp take some up too?) That
doesn't seem enough for diffs between revisions, but good enough for
most uses that people will get caught out when it fails.

$ bzip2 -dc patch-2.6.10.bz2 | grep -c '^diff '
5384

Cheers,


Ralph.

2005-04-10 11:21:55

by Rutger Nijlunsing

[permalink] [raw]
Subject: Proposal for shell-patch-format [was: Re: more git updates..]

On Sun, Apr 10, 2005 at 12:51:59AM -0700, Junio C Hamano wrote:
> Listing the file paths and their sigs included in a tree to make
> a snapshot of a tree state sounds fine, and diffing two trees by
> looking at the sigs between two such files sounds fine as well.
>
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

git doesn't represent transitions (or deltas), but only state. So it's
not (much) more then a .tar file from version-management perspective;
the only difference being that a git-tree has a comment field and a
predecessor-reference, which are currently not used in determining the
'patch' between two trees.

Deltas are derived by comparing different versions and determining
the difference by reverse-engineering the differences which got us
from version A to version B.

Deltas are currently described as patch(1)es. Patches don't have the
concept of 'renaming', so even after determining that file X has been
renamed to Y, we have no container for this fact. A patch(1) only
contains local-file-edits: substitute lines by other lines.

Deltas are not needed to follow a tree; deltas are useful for merging
branches of versions, and for reviewing purposes. This is comparable
to using tar for version-management: it is very common to weekly tar
your current version of your project as a poor-mans-version management
for one-person one-project.

So what is needed is a way to represent deltas which can contain more
than only traditional patches. I would propose a simple format:
the shell-script in a fixed-format.

Shell-patch format in EBNF:
<shellpatch> ::= ( <comment>? <command>* )*
<comment> ::= <commentline>+
The comments contains the text describing the function of the
patch following it.
<commentline> ::= "# " <text>
<command> ::=
"mv " <pathname> " " <pathname> "\n" |
"cp " <filename> " " <filename> "\n" |
"chmod " <mode> <pathname> "\n" |
"patch <<__UNIQUE_STRING__\n" <patch> "__UNIQUE_STRING__\n"
(where UNIQUE_STRING must not be contained in patch)
<filename> ::= <pathname>
(but pointing to a file)
<pathname> ::= a pathname relative to '.';
escaping special characters the shell-way;
may not contain '..'.

Example:
# Rename file b to a1, and change a line.
mv b a1
patch <<__END__
*** a1 Sun Apr 10 11:43:37 2005
--- a2 Sun Apr 10 11:43:41 2005
***************
*** 1,4 ****
1
2
! from
3
--- 1,4 ----
1
2
! to
3
__END__

Advantages:
- ASCII!
- a shell-patch is executable without extra tooling
- a shell-patch is readable and therefore reviewable
- a shell-patch is forward-compatible: a shell-patch acts
like a patch (since patch(1) ignores garbage around patch :),
but not backwards-compatible.
- extensible
- the heavy-lifting is done by 'patch'
Disadvantages:
- no deltas for binary files

Open issues:
- <comment> could be made more structured; maybe containing fields
like Sujbect:, Author:, Signed-By:, certificates, ...
(BitKeeper seems to be using "# " <field> ":" <value> "\n" lines)
- patch(1) doesn't know any directories. Should shell-patch
know directories? This implies commands working on directories to
(like directory renaming, mode changing, ...). Otherwise directories
are implicit (a file in a directories implies the existance of that
directory). Also implies mkdir and rmdir as shell-patch commands.
- extra commands might be useful to conserve more state(changes):
ln -s -- for symbolic links;
ln -- for hard links;
chown -- for permissions;
chattr -- for storing extended attributes
touch -- for setting timestamps (probably creation time only,
since mtime is something git relies on)
...and for the really adventurous:
sed 's,<fromstring>,<tostring>,' -- for substitutions
(this is something darcs supports, but which I think is too
bothersome to use since it is difficult to reverse engineere
from two random trees)
Why a fixed format at all?
- This way, the executable shell-patch can be proven to be
harmless to the machine: 'rm -rf /' is a valid shell-script,
but not a valid shell-patch (since 'rm' is not valid command,
random flags like '-rf' are not supported, and '/' is an absolute
pathname.
- A fixed format enables tooling to support such a patch format;
for example creating the reverse-patch, merging patches (yeah,
'cat' also merges patches...).

...what has this to do with git? Not much and everything, depending
on how you look onto it. 'git' is 'tar', and 'shell-patch' is 'patch';
both orthogonal concepts but very usable in combination. We'll look at
getting from two git trees to a shell-patch.

Diffing the trees would not only look at the file and per file at the
hashes, but also the other way around: which hash values are used more
than once. For files with the same hash value, compare the contents
(and rest of attributes); this is needed since the mapping from file
contents to sha1 is one-way. When the contents is the same, the
shell-patch-command to generate is obviously a 'cp'.

For example, we have got two trees in git (pathname -> hash value):
tree1/file1 -> 1234
tree1/file2 -> 4567
and
tree2/file1 -> 3456
tree2/file3 -> 4567
tree2/file4 -> 4567

..this could generate shell-patch:

# Comments-go-here
mv tree2/file2 tree2/file3
cp tree2/file3 tree2/file4
patch tree1/file1 <<__FILE_PATCH__
(patch-goes-here)
__FILE_PATCH__

...by an algorithm which starts by determining all renames, then all
copies, and finally all patches.

Comments?


--
Rutger Nijlunsing ---------------------- linux-kernel at tux.tmfweb.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

2005-04-10 11:38:49

by Luck, Tony

[permalink] [raw]
Subject: Re: more git updates..

>handle by pure rename only plus the extra delta. The current git don't
>have per file change history. From git's point of view some file deleted
>and the other file appeared with same content.
>
>It is the top level SCM to handle that correctly.
>Rename a directory will be even more fun.

But from a git perspective it will be very efficient. Imagine that
Linus decides to rename arch/i386 as arch/x86 ... at the git repository
level this just requires a changeset, a new top level tree, and a new
tree for the arch directory showing that i386 changed to x86. That's
all ... every files below that didn't change, so the blobs for the files
are all the same.

-Tony

2005-04-10 11:49:01

by Ralph Corderoy

[permalink] [raw]
Subject: Re: more git updates..


Hi,

Christopher Li wrote:
> On Sat, Apr 09, 2005 at 04:31:10PM -0700, Linus Torvalds wrote:
> > NOTE! This means that each "tree" file basically tracks just a
> > single directory. The old style of "every file in one tree file"
> > still works, but fsck-cache will warn about it. Happily, the git
> > archive itself doesn't have any subdirectories, so git itself is not
> > impacted by it.
>
> That is really cool stuff. My way to read it, correct me if I am
> wrong, git is a user space version file system. "tree" <--> directory
> and "blob" <--> file. "commit" to describe the version history.

See the Venti filesystem in Bell Labs's Plan 9 OS. It too uses SHA-1.

http://www.cs.bell-labs.com/sys/doc/venti/venti.pdf

Abstract

This paper describes a network storage system, called Venti,
intended for archival data. In this system, a unique hash of a
block's contents acts as the block identifier for read and write
operations. This approach enforces a write-once policy, preventing
accidental or malicious destruction of data. In addition, duplicate
copies of a block can be coalesced, reducing the consumption of
storage and simplifying the implementation of clients. Venti is a
building block for constructing a variety of storage applications
such as logical backup, physical backup, and snapshot file systems.

We have built a prototype of the system and present some preliminary
performance results. The system uses magnetic disks as the storage
technology, resulting in an access time for archival data that is
comparable to non-archival data. The feasibility of the write-once
model for storage is demonstrated using data from over a decade's
use of two Plan 9 file systems.

Cheers,


Ralph.

2005-04-10 12:00:47

by Luck, Tony

[permalink] [raw]
Subject: Re: more git updates..

>In other words, each "commit" file is very small and cheap, but since
>almost every commit will also imply a totally new tree-file, "git" is
>going to have an overhead of half a megabyte per commit. Oops.
>
>Damn, that's painful. I suspect I will have to change the format somehow.

Having dodged that bullet with the change to make tree files point at
other tree files ... here's another (potential) issue.

A changeset that touches just one file a few levels down from the top
of the tree (say arch/i386/kernel/setup.c) will make six new files in
the git repository (one for the changeset, four tree files, and a new
blob for the new version of the file). More complex changes make more
files ... but say the average is ten new files per changeset since most
changes touch few files. With 60,000 changesets in the current tree, we
will start out our git repository with about 600,000 files. Assuming
the first byte of the SHA1 hash is random, that means an average of 2343
files in each of the objects/xx directories. Give it a few more years at
the current pace, and we'll have over 10,000 files per directory. This
sounds like a lot to me ... but perhaps filesystems now handle large
directories enough better than they used to for this to not be a problem?

Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

-Tony

2005-04-10 15:43:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sun, 10 Apr 2005, Junio C Hamano wrote:
>
> But I am wondering what your plans are to handle renames---or
> does git already represent them?

You can represent renames on top of git - git itself really doesn't care.
In many ways you can just see git as a filesystem - it's content-
addressable, and it has a notion of versioning, but I really really
designed it coming at the problem from the viewpoint of a _filesystem_
person (hey, kernels is what I do), and I actually have absolutely _zero_
interest in creating a traditional SCM system.

So to take renaming a file as an example - why do you actually want to
track renames? In traditional SCM's, you do it for two reasons:

- space efficiency. Most SCM's are based on describing changes to a file,
and compress the data by doing revisions on the same file. In order to
continue that process past a rename, such an SCM _has_ to track
renames, or lose the delta-based approach.

The most trivial example of this is "diff", ie a rename ends up
generating a _huge_ diff unless you track the rename explicitly.

GIT doesn't care. There is _zero_ space efficiency in trying to track
renames. In fact, it would add overhead to the system, not lessen it.
That's because GIT fundamentally doesn't do the "delta-within-a-file"
model.

- annotate/blame. This is a valid concern, but the fact is, I never use
it. It may be a deficiency of mine, but I simply don't do the per-line
thing when I debug or try to find who was responsible. I do "blame" on
a much bigger-picture level, and I personally believe (pretty strongly)
that per-line annotations are not actually a good thing - they come not
because people _want_ to do things at that low level, but because
historically, you didn't _have_ the bigger-picture thing.

In other words, pretty much every SCM out there is based on SCCS
"mentally", even if not in any other model. That's why people think
per-line blame is important - you have that mental model.

So consider me deficient, or consider me radical. It boils down to the
same thing. Renames don't matter.

That said, if somebody wants to create a _real_ SCM (rather than my notion
of a pure content tracker) on top of GIT, you probably could fairly easily
do so by imposing a few limitations on a higher level. For example, most
SCM's that track renames require that the user _tell_ them about the
renames: you do a "bk mv" or a "svn rename" or something.

If you want to do the same on top of GIT, then you should think of GIT as
what it is: GIT just tracks contents. It's a filesystem - although a
fairly strange one. How would you track renames on top of that? Easy: add
your own fields to the GIT revision messages: GIT enforces the header, but
you can add anything you want to the "free-form" part that follows it.

Same goes for any other information where you care about what happens
"within" a file. GIT simply doesn't track it. You can build things on top
of GIT if you want to, though. They may not be as efficient as they would
be if they were built _into_ GIT, but on the other hand GIT does a lot of
other things a hell of a lot faster thanks to it's design.

So whether you agree with the things that _I_ consider important probably
depends on how you work. The real downside of GIT may be that _my_ way of
doing things is quite possibly very rare.

But it clearly is the only right way. The fact that everybody else does it
some other way only means that they are wrong.

Linus

2005-04-10 15:59:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sat, 9 Apr 2005 [email protected] wrote:
>
> With 60,000 changesets in the current tree, we will start out our git
> repository with about 600,000 files. Assuming the first byte of the
> SHA1 hash is random, that means an average of 2343 files in each of the
> objects/xx directories. Give it a few more years at the current pace,
> and we'll have over 10,000 files per directory. This sounds like a lot
> to me ... but perhaps filesystems now handle large directories enough
> better than they used to for this to not be a problem?

The good news is that git itself doesn't really care. I think it's
literally _one_ function ("get_sha1_filename()") that you need to change,
and then you need to write a small script that moves files around, and
you're really much done.

Also, I did actually debate that issue with myself, and decided that even
if we do have tons of files per directory, git doesn't much care. The
reason? Git never _searches_ for them. Assuming you have enough memory to
cache the tree, you just end up doing a "lookup", and inside the kernel
that's done using an efficient hash, which doesn't actually care _at_all_
about how many files there are per directory.

So I was for a while debating having a totally flat directory space, but
since there are _some_ downsides (linear lookup for cold-cache, and just
that "ls -l" ends up being O(n**2) and things), I decided that a single
fan-out is probably a good idea.

> Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

Hey, I may end up being wrong, and yes, maybe I should have done a
two-level one. The good news is that we can trivially fix it later (even
dynamically - we can make the "sha1 object tree layout" be a per-tree
config option, and there would be no real issue, so you could make small
projects use a flat version and big projects use a very deep structure
etc). You'd just have to script some renames to move the files around..

Linus

2005-04-10 16:27:32

by Petr Baudis

[permalink] [raw]
Subject: [ANNOUNCE] git-pasky-0.1

Hello,

so I "released" git-pasky-0.1, my set of patches and scripts upon
Linus' git, aimed at human usability and to an extent a SCM-like usage.

You can get it at

http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2

and after unpacking and building (make) do

git pull pasky

to get the latest changes from my branch. If you already have some git
from my branch which can do pulling, you can bring yourself up to date
by doing just

gitpull.sh pasky

(but this style of usage is deprecated now). Please see the README for
some details regarding usage etc. You can find the changes from the last
announcement in the ChangeLog (the previous announcement corresponds to
commit id 5125d089ad862f16a306b4942155092e1dce1c2d). The most important
change is probably recursive diff addition, and making git ignore the
nsec of ctime and mtime, since it is totally unreliable and likes to
taint random files as modified.

My near future plans include especially some merge support; I think it
should be rather easy, actually. I'll also add some simple tagging
mechanism. I've decided to postpone the file moving detection, since
there's no big demand for it now. ;-)

I will also need to do more testing on the linux kernel tree.
Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in

$ time gitdiff.sh `parent-id` `tree-id` >p
real 5m37.434s
user 1m27.113s
sys 2m41.036s

which is pretty horrible, it seems to me. Any benchmarking help is of
course welcomed, as well as any other feedback.

BTW, what would be the best (most complete) source for the BK tree
metadata? Should I dig it from the BKCVS gateway, or is there a better
source? Where did you get the sparse git database from, Linus? (BTW, it
would be nice to get sparse.git with the directories as separate.)

Have fun,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 16:53:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1



On Sun, 10 Apr 2005, Petr Baudis wrote:
>
> Where did you get the sparse git database from, Linus? (BTW, it
> would be nice to get sparse.git with the directories as separate.)

When we were trying to figure out how to avert the BK disaster, and one of
Tridges concerns (and, in my opinion, the only really valid one) was that
you couldn't get the BK data in some SCM-independent way.

So I wrote some very preliminary scripts (on top of BK itself) to extract
the data, to show that BK could generate a SCM-neutral file format (a very
stupid one and horribly useless for anything but interoperability, but
still...). I was hoping that that would convince Tridge that trying to
muck around with the internal BK file format was not worth it, and avert
the BK trainwreck.

Larry was ok with the idea to make my export format actually be natively
supported by BK (ie the same way you have "bk export -tpatch"), but Tridge
wanted to instead get at the native data and be difficult about it. As a
result, I can now not only use BK any more, but we also don't have a nice
export format from BK.

Yeah, I'm a bit bitter about it.

Anyway, the sparse data came out of my hack. It's very inefficient, and I
estimated that doing the same for the kernel would have taken ten solid
days of conversion, mainly because my hack was really just that: a quick
hack to show that BK could do it. Larry could have done it a lot better.

I'll re-generate the sparse git-database at some point (and I'll probably
do so from the old GIT database itself, rather than re-generating it from
my old BK data).

Linus

2005-04-10 17:00:35

by Rutger Nijlunsing

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 08:44:56AM -0700, Linus Torvalds wrote:
>
>
> On Sun, 10 Apr 2005, Junio C Hamano wrote:
> >
> > But I am wondering what your plans are to handle renames---or
> > does git already represent them?
>
> You can represent renames on top of git - git itself really doesn't care.
> In many ways you can just see git as a filesystem - it's content-
> addressable, and it has a notion of versioning, but I really really
> designed it coming at the problem from the viewpoint of a _filesystem_
> person (hey, kernels is what I do), and I actually have absolutely _zero_
> interest in creating a traditional SCM system.
>
> So to take renaming a file as an example - why do you actually want to
> track renames? In traditional SCM's, you do it for two reasons:
>
> - space efficiency. Most SCM's are based on describing changes to a file,
[snip]
> - annotate/blame. This is a valid concern, but the fact is, I never use
[snip]

- merging.
When the parent tree renames a file, it's easier for an out-of-tree
patch to get up-to-date.

- reviewing.
A huge patch with 2000 added lines and 1990 removed lines is more
difficult to review then a rename + 10 lines patch.

> So consider me deficient, or consider me radical. It boils down to the
> same thing. Renames don't matter.

When you've got no out-of-tree patches since you've got the
parent-of-all-trees, then they don't matter, that's true :)

> So whether you agree with the things that _I_ consider important probably
> depends on how you work. The real downside of GIT may be that _my_ way of
> doing things is quite possibly very rare.


--
Rutger Nijlunsing ---------------------------------- eludias ed dse.nl
never attribute to a conspiracy which can be explained by incompetence
----------------------------------------------------------------------

2005-04-10 17:31:50

by Rik van Riel

[permalink] [raw]
Subject: Re: more git updates..

On Sat, 9 Apr 2005, Linus Torvalds wrote:

> I've rsync'ed the new git repository to kernel.org, it should all be there
> in /pub/linux/kernel/people/torvalds/git.git/ (and it looks like the
> mirror scripts already picked it up on the public side too).

GCC 4 isn't very happy. Mostly sign changes, but also something
that looks like a real error:

gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c
fsck-cache.c: In function 'main':
fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined
fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined

I assume that fsck_tree and fsck_commit should complain loudly
if they ever get to that point - but since I'm not quite sure
there's no patch, sorry.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-04-10 17:34:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1


* Petr Baudis <[email protected]> wrote:

> I will also need to do more testing on the linux kernel tree.
> Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
>
> $ time gitdiff.sh `parent-id` `tree-id` >p
> real 5m37.434s
> user 1m27.113s
> sys 2m41.036s
>
> which is pretty horrible, it seems to me. Any benchmarking help is of
> course welcomed, as well as any other feedback.

it seems from the numbers that your system doesnt have enough RAM for
this and is getting IO-bound?

Ingo

2005-04-10 17:33:40

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Ralph wrote:
> but good enough for
> most uses that people will get caught out when it fails.

Exactly.

If Linus persists in this diff-tree output format, using two lines for
changed files, then I will have to add the following sed script to my
arsenal:

sed '/^</ { N; s/\n>/ / }'

It collapses pairs of lines:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile
>100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

to the single line:

<100664 4870bcf91f8666fc788b07578fb7473eda795587 Makefile 100664 5493a649bb33b9264e8ed26cc1f832989a307d3b Makefile

However, more people will get bit by this git glitch than know sed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 17:36:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: more git updates..


* Rik van Riel <[email protected]> wrote:

> GCC 4 isn't very happy. Mostly sign changes, but also something that
> looks like a real error:
>
> gcc -g -O3 -Wall -c -o fsck-cache.o fsck-cache.c
> fsck-cache.c: In function 'main':
> fsck-cache.c:59: warning: control may reach end of non-void function 'fsck_tree' being inlined
> fsck-cache.c:62: warning: control may reach end of non-void function 'fsck_commit' being inlined
>
> I assume that fsck_tree and fsck_commit should complain loudly if they
> ever get to that point - but since I'm not quite sure there's no
> patch, sorry.

i sent a patch for most of the sign errors, but the above is a case gcc
not noticing that the function can never ever exit the loop, and thus
cannot get to the 'return' point.

Ingo

2005-04-10 17:43:01

by Willy Tarreau

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1

On Sun, Apr 10, 2005 at 07:33:49PM +0200, Ingo Molnar wrote:
>
> * Petr Baudis <[email protected]> wrote:
>
> > I will also need to do more testing on the linux kernel tree.
> > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> >
> > $ time gitdiff.sh `parent-id` `tree-id` >p
> > real 5m37.434s
> > user 1m27.113s
> > sys 2m41.036s
> >
> > which is pretty horrible, it seems to me. Any benchmarking help is of
> > course welcomed, as well as any other feedback.
>
> it seems from the numbers that your system doesnt have enough RAM for
> this and is getting IO-bound?

Not the only problem, without I/O, he will go down to 4m8s (u+s) which
is still in the same order of magnitude.

willy

2005-04-10 17:45:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1


* Willy Tarreau <[email protected]> wrote:

> > > I will also need to do more testing on the linux kernel tree.
> > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > >
> > > $ time gitdiff.sh `parent-id` `tree-id` >p
> > > real 5m37.434s
> > > user 1m27.113s
> > > sys 2m41.036s
> > >
> > > which is pretty horrible, it seems to me. Any benchmarking help is of
> > > course welcomed, as well as any other feedback.
> >
> > it seems from the numbers that your system doesnt have enough RAM for
> > this and is getting IO-bound?
>
> Not the only problem, without I/O, he will go down to 4m8s (u+s) which
> is still in the same order of magnitude.

probably not the only problem - but if we are lucky then his system was
just trashing within the kernel repository and then most of the overhead
is the _unnecessary_ IO that happened due to that (which causes CPU
overhead just as much). The dominant system time suggests so, to a
certain degree. Maybe this is wishful thinking.

Ingo

2005-04-10 18:22:43

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Tony wrote:
> Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?

I tend to size these things with the square root of the number of
leaf nodes. If I have 2,560,000 leaves (your 10,000 files in each
of 16*16 directories), then I will aim for 1600 directories of
1600 leaves each.

My backup is sized for about this number of leaves, and it uses:

xxx/xxxzzzzzzzzzzzzzzzz

(I repeat the xxx in the leaf name - easier to code.)

I don't think there is any need for two levels. There are 4096
different values of three digit hex numbers. That's ok in one
directory.

The only question would be 'xx' or 'xxx' - two or three digits.

This one is on the cusp in my view - either works.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 18:49:50

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Sun, Apr 10, 2005 at 07:45:12PM CEST, I got a letter
where Ingo Molnar <[email protected]> told me that...
>
> * Willy Tarreau <[email protected]> wrote:
>
> > > > I will also need to do more testing on the linux kernel tree.
> > > > Committing patch-2.6.7 on 2.6.6 kernel and then diffing results in
> > > >
> > > > $ time gitdiff.sh `parent-id` `tree-id` >p
> > > > real 5m37.434s
> > > > user 1m27.113s
> > > > sys 2m41.036s
> > > >
> > > > which is pretty horrible, it seems to me. Any benchmarking help is of
> > > > course welcomed, as well as any other feedback.
> > >
> > > it seems from the numbers that your system doesnt have enough RAM for
> > > this and is getting IO-bound?
> >
> > Not the only problem, without I/O, he will go down to 4m8s (u+s) which
> > is still in the same order of magnitude.
>
> probably not the only problem - but if we are lucky then his system was
> just trashing within the kernel repository and then most of the overhead
> is the _unnecessary_ IO that happened due to that (which causes CPU
> overhead just as much). The dominant system time suggests so, to a
> certain degree. Maybe this is wishful thinking.

It turns out to be the forks for doing all the cuts and such what is
bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
15 forks per change, I guess, and for some reason cut takes a long of
time on its own.

I've rewritten the cuts with the use of bash arrays and other smart
stuff. I somehow don't feel comfortable using this and prefer the
old-fashioned ways, but it would be plain unusable without this.

Now I'm down to

real 1m21.440s
user 0m32.374s
sys 0m42.200s

and I kinda doubt if it is possible to cut this much down. Almost no
disk activity, I have almost everything cached by now, apparently.

Anyway, you can git pull to get the optimized version.

Thanks for the help,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 18:55:50

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Linus wrote:
> It's a filesystem - although a
> fairly strange one.

Ah ha - that explains the read-tree and write-tree names.

The read-tree pulls stuff out of this file system into
your working files, clobbering local edits. This is like
the read(2) system call, which clobbers stuff in your
read buffer.

The write-tree pushes stuff down into the file system,
just like write(2) pushes data into the kernel.

I was getting all kind of frustrated yesterday trying
to use Linus's git commands, coming at these names with my
SCM hat on.

That way of thinking really doesn't work well here.

I will have to look more closely at pasky's GIT toolkit
if I want to see an SCM style interface.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 19:13:56

by Willy Tarreau

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.1

On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote:

> It turns out to be the forks for doing all the cuts and such what is
> bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> 15 forks per change, I guess, and for some reason cut takes a long of
> time on its own.
>
> I've rewritten the cuts with the use of bash arrays and other smart
> stuff. I somehow don't feel comfortable using this and prefer the
> old-fashioned ways, but it would be plain unusable without this.

I've encountered the same problem in a config-generation script a while
ago. Fortunately, bash provides enough ways to remove most of the forks,
but the result is less portable.

I've downloaded your code, but it does not compile here because of the
tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to
get the most up to date version to take a look at the script. Basically,
all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You
can also call mkdir only if the dirs don't exist. I really think you
should end up with only one fork in the loop to call 'diff'.

> Now I'm down to
>
> real 1m21.440s
> user 0m32.374s
> sys 0m42.200s
>
> and I kinda doubt if it is possible to cut this much down. Almost no
> disk activity, I have almost everything cached by now, apparently.

It is very common to cut times by a factor of 10 or more when replacing
common unix tools by pure shell. Dynamic library initialization also
takes a lot of time nowadays, and probably you have localisation which
is big too. Sometimes, just wiping a few variables at the top of the
shell might remove some useless overhead.

> Anyway, you can git pull to get the optimized version.
>
> Thanks for the help,

Willy

2005-04-10 19:25:50

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

> Some thing like the following patch, may be turn off able.

Take out an old envelope and compute on it the odds of this
happening.

Say we have 10,000 kernel hackers, each producing one
new file every minute, for 100 hours a week. And we've
cloned a small army of Andrew Morton's to integrate
the resulting tsunamai of patches. And Linus is well
cared for in the state funny farm.

What is the probability that this check will fire even
once, between now and 10 billion years from now, when
the Sun has become a red giant destroying all life on
planet Earth?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 19:53:15

by Sean

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1

On Sun, April 10, 2005 12:55 pm, Linus Torvalds said:

> Larry was ok with the idea to make my export format actually be natively
> supported by BK (ie the same way you have "bk export -tpatch"), but
> Tridge wanted to instead get at the native data and be difficult about
> it. As a result, I can now not only use BK any more, but we also don't
> have a nice export format from BK.
>
> Yeah, I'm a bit bitter about it.
>

Linus,

With all due respect, Larry could have dealt with this years ago and
removed the motivation for Tridge and others to pursue reverse
engineering. Instead he chose to insult and question the motives of
everyone that wanted open-source access to the Linux history data. The
blame for the current situation falls firmly on the choice to use a
closed-source SCM for Linux and the actions of the company that owned it.

Sean


2005-04-10 20:37:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.1



On Sun, 10 Apr 2005, Petr Baudis wrote:
>
> It turns out to be the forks for doing all the cuts and such what is
> bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> 15 forks per change, I guess, and for some reason cut takes a long of
> time on its own.

Heh.

Can you pull my current repo, which has "diff-tree -R" that does what the
name suggests, and which should be faster than the 0.48 sec you see..

It may not matter a lot, since actually generating the diff from the file
contents is what is expensive, but remember my goal: I want the expense of
a diff-tree to be relative to the size of the diff, so that implies that
small diffs haev to be basically instantaenous. So I care.

So I just tried the 2.6.7->2.6.8 diff, and for me the new recursive
"diff-tree" can generate the _list_ of files changed in zero time:

real 0m0.079s
user 0m0.067s
sys 0m0.024s

but then _doing_ the diff is pretty expensive (in this case 3800+ files
changed, so you have to unpack 7600+ objects - and even unpacking isn't
the expensive part, the expense is literally in the diff operation
itself).

Me, the stuff I automate is the small steps. Doing a single checkin. So
that's the case I care about going fast, when a "diff-tree" will likely
have maybe five files or something. That's why I want the small
incremental cases to go fast - it it takes me a minute to generate a diff
for a _release_, that's not a big deal. I make one release every other
month, but I work with lots of small patches all the time.

Anyway, with a fast diff-tree, you should be able to generate the list of
objects for a fast "merge". That's next.

(And by "merge", I of course mean "suck". I'm talking about the old CVS
three-way merge, and you have to specify the common parent explicitly and
it won't handle any renames or any other crud. But it would get us to
something that might actually be useful for simple things. Which is why
"diff-tree" is important - it gives the information about what to tell
merge).

Linus

2005-04-10 20:46:57

by Paul Jackson

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1

Good lord - you don't need to use arrays for this.

The old-fashioned ways have their ways. Both the 'set'
command and the 'read' command can split args and assign
to distinct variable names.

Try something like the following:

diff-tree -r $id1 $id2 |
sed -e '/^</ { N; s/\n>/ / }' -e 's/./& /' |
while read op mode1 sha1 name1 mode2 sha2 name2
do
... various common stuff ...
case "$op" in
"+")
...
;;
"-")
...
;;
"<")
test $name1 = $name2 || die mismatched names
label1=$(mkbanner "$loc1" $id1 "$name1" $mode1 $sha1)
label2=$(mkbanner "$loc2" $id2 "$name1" $mode2 $sha2)
diff -L "$label1" -L "$label2" -u "$loc1" "$loc2"
;;
esac
done

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 20:55:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sun, 10 Apr 2005, Paul Jackson wrote:
>
> Ah ha - that explains the read-tree and write-tree names.
>
> The read-tree pulls stuff out of this file system into
> your working files, clobbering local edits. This is like
> the read(2) system call, which clobbers stuff in your
> read buffer.

Yes. Except it's a two-stage thing, where the staging area is always the
"current directory cache".

So a "read-tree" always reads the tree information into the directory
cache, but does not actually _update_ any of the files it "caches". To do
that, you need to do a "checkout-cache" phase.

Similarly, "write-tree" writes the current directory cache contents into a
set of tree files. But in order to have that match what is actually in
your directory right now, you need to have done a "update-cache" phase
before you did the "write-tree".

So there is always a staging area between the "real contents" and the
"written tree".

> That way of thinking really doesn't work well here.
>
> I will have to look more closely at pasky's GIT toolkit
> if I want to see an SCM style interface.

Yes. You really should think of GIT as a filesystem, and of me as a
_systems_ person, not an SCM person. In fact, I tend to detest SCM's. I
think the reason I worked so well with BitKeeper is that Larry used to do
operating systems. He's also a systems person, not really an SCM person.
Or at least he's in between the two.

My operations are like the "system calls". Useless on their own: they're
not real applications, they're just how you read and write files in this
really strange filesystem. You need to wrap them up to make them do
anything sane.

For example, take "commit-tree" - it really just says that "this is the
new tree, and these other trees were its parents". It doesn't do any of
the actual work to _get_ those trees written.

So to actually do the high-level operation of a real commit, you need to
first update the current directory cache to match what you want to commit
(the "update-cache" phase).

Then, when your directory cache matches what you want to commit (which is
NOT necessarily the same thing as your actual current working area - if
you don't want to commit some of the changes you have in your tree, you
should avoid updating the cache with those changes), you do stage 2, ie
"write-tree". That writes a tree node that describes what you want to
commit.

Only THEN, as phase three, do you do the "commit-tree". Now you give it
the tree you want to commit (remember - that may not even match your
current directory contents), and the history of how you got here (ie you
tell commit what the previous commit(s) were), and the changelog.

So a "commit" in SCM-speak is actually three totally separate phases in my
filesystem thing, and each of the phases (except for the last
"commit-tree" which is the thing that brings it all together) is actually
in turn many smaller parts (ie "update-cache" may have been called
hundreds of times, and "write-tree" will write several tree objects that
point to each other).

Similarly, a "checkout" really is about first finding the tree ID you want
to check out, and then bringing it into the "directory cache" by doing a
"read-tree" on it. You can then actually update the directory cache
further: you might "read-tree" _another_ project, or you could decide that
you want to keep one of the files you already had.

So in that scneario, after doing the read-tree you'd do an "update-cache"
on the file you want to keep in your current directory structure, which
updates your directory cache to be a _mix_ of the original tree you now
want to check out _and_ of the file you want to use from your current
directory. Then doing a "checkout-cache -a" will actually do the actual
checkout, and only at that point does your working directory really get
changed.

Btw, you don't even have to have any working directory files at all. Let's
say that you have two independent trees, and you want to create a new
commit that is the join of those two trees (where one of the trees take
precedence). You'd do a "read-tree <a> <b>", which will create a directory
cache (but not check out) that is the union of the <a> and <b> trees (<b>
will overrride). And then you can do a "write-tree" and commit the
resulting tree - without ever having _any_ of those files checked out.

Linus

2005-04-10 21:27:17

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Sun, Apr 10, 2005 at 09:13:19PM CEST, I got a letter
where Willy Tarreau <[email protected]> told me that...
> On Sun, Apr 10, 2005 at 08:45:22PM +0200, Petr Baudis wrote:
>
> > It turns out to be the forks for doing all the cuts and such what is
> > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> > 15 forks per change, I guess, and for some reason cut takes a long of
> > time on its own.
> >
> > I've rewritten the cuts with the use of bash arrays and other smart
> > stuff. I somehow don't feel comfortable using this and prefer the
> > old-fashioned ways, but it would be plain unusable without this.
>
> I've encountered the same problem in a config-generation script a while
> ago. Fortunately, bash provides enough ways to remove most of the forks,
> but the result is less portable.
>
> I've downloaded your code, but it does not compile here because of the
> tv_nsec fields in struct stat (2.4, glibc 2.2), so I cannot use it to
> get the most up to date version to take a look at the script. Basically,

Ok, I decided to stop this nsec madness (since it broke show-diff
anyway at least on my ext3), and you get it only if you pass -DNSEC
to CFLAGS now. Hope this fixes things for you. :-)

BTW, I regularly update the public copy as accessible on the web.

> all the 'cut' and 'sed' can be removed, as well as the 'dirname'. You
> can also call mkdir only if the dirs don't exist. I really think you
> should end up with only one fork in the loop to call 'diff'.

You still need to extract the file by cat-file too. ;-) And rm the files
after it compares them (so that we don't fill /tmp with crap like
certain awful programs like to do). But I will conditionalize the mkdir
calls, thanks for the suggestion - I think that's the last bit to be
squeezed from this loop (I'll yet check on the read proposal - I
considered it before and turned down for some reason, can't remember why
anymore, though).

Thanks,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 21:37:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.1



On Sun, 10 Apr 2005, Linus Torvalds wrote:
>
> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Actually, I changed things around. Everybody hated the "<" ">" lines, so I
put a changed thing on a line of its own with a "*" instead.

So you'd now see lines like

*100644->100644 1874e031abf6631ea51cf6177b82a1e662f6183e->e8181df8499f165cacc6a0d8783be7143013d410 CREDITS

which means that the CREDITS file has changed, and it shows you the mode
-> mode transition (that didn't change in this case) and the sha1 -> sha1
transition.

So now it's always just one line per change. Firthermore, the filename is
always field 3, if you use spaces as delimeters, regardless of whether
it's a +/-/* field.

So let's say you want to merge two trees (dst1 and dst2) from a common
parent (src), what you would do is:

- get the list of files to merge:

diff-tree -R <dst1> <dst2> | tr '\0' '\n' > merge-files

- Which of those were changed by <src> -> <dstX>?

diff-tree -R <src> <dst1> | tr '\0' '\n' | join -j 3 - merge-files > dst1-change
diff-tree -R <src> <dst2> | tr '\0' '\n' | join -j 3 - merge-files > dst2-change

- Which of those are common to both? Let's see what the merge list is:

join dst1-change dst2-change > merge-list

and hopefully you'd usually be working on a very small list of files by
then (everything else you'd just pick from one of the destination trees
directly - you've got the name, the sha-file, everything: no need to even
look at the data).

Does this sound sane? Pasky? Wanna try a "git merge" thing? Starting off
with the user having to tell what the common parent tree is - we can try
to do the "automatically find best common parent" crud later. THAT may be
expensive.

(Btw, this is why I think "diff-tree" is more important than actually
generating the real diff itself - the above uses diff-tree three times
just to cut down to the point where _hopefully_ you don't actually need to
generate very much diffs at all. So I want "diff-tree" to be really fast,
even if it then can take a minute to actually generate a big diff between
releases etc).

Linus

2005-04-10 21:52:14

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

I totally agree that odds is really really small.
That is why it is not worthy to handle the case. People hit that
can just add a new line or some thing to avoid it, if
it happen after all.

It is the little peace of mind to know for sure that did
not happen. I am just paranoid.

Chris

On Sun, Apr 10, 2005 at 12:23:52PM -0700, Paul Jackson wrote:
> > Some thing like the following patch, may be turn off able.
>
> Take out an old envelope and compute on it the odds of this
> happening.
>
> Say we have 10,000 kernel hackers, each producing one
> new file every minute, for 100 hours a week. And we've
> cloned a small army of Andrew Morton's to integrate
> the resulting tsunamai of patches. And Linus is well
> cared for in the state funny farm.
>
> What is the probability that this check will fire even
> once, between now and 10 billion years from now, when
> the Sun has become a red giant destroying all life on
> planet Earth?
>
> --
> I won't rest till it's the best ...
> Programmer, Linux Scalability
> Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 22:07:51

by Luck, Tony

[permalink] [raw]
Subject: RE: more git updates..

>Also, I did actually debate that issue with myself, and decided that even
>if we do have tons of files per directory, git doesn't much care. The
>reason? Git never _searches_ for them. Assuming you have enough memory to
>cache the tree, you just end up doing a "lookup", and inside the kernel
>that's done using an efficient hash, which doesn't actually care _at_all_
>about how many files there are per directory.

So long as the hash *is* efficient when the directory is packed full of
38 character filenames made only of [0-9a-f] ... which might not match
the test cases under which the hash was picked :-) When there are some
full-sized kernel git images, someone should do a sanity check.

>Hey, I may end up being wrong, and yes, maybe I should have done a
>two-level one. The good news is that we can trivially fix it later (even
>dynamically - we can make the "sha1 object tree layout" be a per-tree
>config option, and there would be no real issue, so you could make small
>projects use a flat version and big projects use a very deep structure
>etc). You'd just have to script some renames to move the files around.

It depends on how many eco-system shell scripts get built that need to
know about the layout ... if some shell/perl "libraries" encode this
filename layout (and people use them) ... then switching later would
indeed be painless.

-Tony

2005-04-10 22:11:23

by Petr Baudis

[permalink] [raw]
Subject: Re: RE: more git updates..

Dear diary, on Mon, Apr 11, 2005 at 12:07:37AM CEST, I got a letter
where "Luck, Tony" <[email protected]> told me that...
..snip..
> >Hey, I may end up being wrong, and yes, maybe I should have done a
> >two-level one. The good news is that we can trivially fix it later (even
> >dynamically - we can make the "sha1 object tree layout" be a per-tree
> >config option, and there would be no real issue, so you could make small
> >projects use a flat version and big projects use a very deep structure
> >etc). You'd just have to script some renames to move the files around.
>
> It depends on how many eco-system shell scripts get built that need to
> know about the layout ... if some shell/perl "libraries" encode this
> filename layout (and people use them) ... then switching later would
> indeed be painless.

FWIW, my short-term plans include support for monotone-like hash ID
shortening - it's enough to use the shortest leading unique part of the
ID to identify the revision. I will poke to the object repository for
that. I also already have Randy Dunlap's git lsobj, which will list all
objects of a specified type (very useful especially when looking for
orphaned commits and such rather lowlevel work).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 22:13:08

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 01:57:33PM -0700, Linus Torvalds wrote:
>
> > That way of thinking really doesn't work well here.
> >
> > I will have to look more closely at pasky's GIT toolkit
> > if I want to see an SCM style interface.
>
> Yes. You really should think of GIT as a filesystem, and of me as a
> _systems_ person, not an SCM person. In fact, I tend to detest SCM's. I
> think the reason I worked so well with BitKeeper is that Larry used to do
> operating systems. He's also a systems person, not really an SCM person.
> Or at least he's in between the two.
>

Yes, I am puzzled for a while how to use git until I realize that it is
a version file system.

BTW, one thing I learn from ext3 is that it is very useful to have some
compatible flag for future development. I think if we want to reserve some
room in the file format for further development of git, it is the right time
to do it before it get bigs. e.g. an optional variable size header in "tree"
including format version and capability etc. I can see the counter argument
that it is not as important as a real file system because it is a lot easier
bring it off line to upgrade the whole tree.

One the other hand, it is almost did not cost any thing in terms of space and
CPU time, most directory did not get to file system block boundary so extra few bytes
is almost free. If carefully planed, it will make the future up grade of git
a lot smoother.

What do you think?

Chris

2005-04-10 22:27:43

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
> On Sun, 10 Apr 2005, Petr Baudis wrote:
> >
> > It turns out to be the forks for doing all the cuts and such what is
> > bogging it down so awfully (doing diff-tree takes 0.48s ;-). I do about
> > 15 forks per change, I guess, and for some reason cut takes a long of
> > time on its own.
>
> Heh.
>
> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Funnily enough, now after some more cache teasing it's ~0.185. Your one
still ~0.17, though. :/ (That might be because of the format changes,
though, since you do less printing now.) (BTW, all those measurements
are done on my AMD K6 walking on 1600MHz, 512M RAM, about 200M available
for caches.)

Just out of interest, did you have a look at my diff-tree -r
implementation and decided that you don't like it, or you weren't aware
of it?

I will probably take most of your diff-tree change, but I'd prefer to do
the sha1->tree mapping directly in diff_tree().

> It may not matter a lot, since actually generating the diff from the file
> contents is what is expensive, but remember my goal: I want the expense of
> a diff-tree to be relative to the size of the diff, so that implies that
> small diffs haev to be basically instantaenous. So I care.

Me too, of course.

> So I just tried the 2.6.7->2.6.8 diff, and for me the new recursive
> "diff-tree" can generate the _list_ of files changed in zero time:
>
> real 0m0.079s
> user 0m0.067s
> sys 0m0.024s
>
> but then _doing_ the diff is pretty expensive (in this case 3800+ files
> changed, so you have to unpack 7600+ objects - and even unpacking isn't
> the expensive part, the expense is literally in the diff operation
> itself).
>
> Me, the stuff I automate is the small steps. Doing a single checkin. So
> that's the case I care about going fast, when a "diff-tree" will likely
> have maybe five files or something. That's why I want the small
> incremental cases to go fast - it it takes me a minute to generate a diff
> for a _release_, that's not a big deal. I make one release every other
> month, but I work with lots of small patches all the time.

I see.

> Anyway, with a fast diff-tree, you should be able to generate the list of
> objects for a fast "merge". That's next.
>
> (And by "merge", I of course mean "suck". I'm talking about the old CVS
> three-way merge, and you have to specify the common parent explicitly and
> it won't handle any renames or any other crud. But it would get us to
> something that might actually be useful for simple things. Which is why
> "diff-tree" is important - it gives the information about what to tell
> merge).

I currently already do a merge when you track someone's source - it will
throw away your previous HEAD record though, so if you committed some
local changes after the previous pull, you will get orphaned commits and
the changes will turn to uncommitted ones. I have some ideas regarding
how to do it properly (and do any arbitrary merging, for that matter), I
hope to get to it as soon as I catch up with you. :-)

BTW, the three-way merge comes from RCS. That reminds me, is there any
tool which will take .rej files and throw them into the file to create
rcsmerge-like conflicts? Perhaps it's fault of my bad tools, but I
prefer to work with the inline rejects much more to .rej files (except
to actually notice the rejects).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 22:30:15

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Sun, Apr 10, 2005 at 08:42:53PM CEST, I got a letter
where Christopher Li <[email protected]> told me that...
> I totally agree that odds is really really small.
> That is why it is not worthy to handle the case. People hit that
> can just add a new line or some thing to avoid it, if
> it happen after all.
>
> It is the little peace of mind to know for sure that did
> not happen. I am just paranoid.

BTW, I've merged the check to git-pasky some time ago, you can disable
it in the Makefile. It is by default on now, until someone convinces me
it actually affects performance measurably.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 22:36:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sun, 10 Apr 2005, Christopher Li wrote:
>
> BTW, one thing I learn from ext3 is that it is very useful to have some
> compatible flag for future development. I think if we want to reserve some
> room in the file format for further development of git

Way ahead of you.

This is (one reason) why all git objects have the type embedded inside of
them. The format of all objects is totally regular: they are all
compressed with zlib, they are all named by the sha1 file, and they all
start out with a magic header of "<typename> <typesize><nul byte>".

So if I want to create a new kind of tree object that does the same thing
as the old one but has some other layout, I'd just call it something else.
Like "dir". That was what I initially planned to do about the change to
recursive tree objects, but it turned out to actually be a lot easier to
just encode it in the old type (that way the routines that read it don't
even have to care about old/new types - it's all the same to them).

Linus

2005-04-10 23:03:18

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
>
>
> On Sun, 10 Apr 2005, Christopher Li wrote:
> >
> > BTW, one thing I learn from ext3 is that it is very useful to have some
> > compatible flag for future development. I think if we want to reserve some
> > room in the file format for further development of git
>
> Way ahead of you.
>
> This is (one reason) why all git objects have the type embedded inside of
> them. The format of all objects is totally regular: they are all
> compressed with zlib, they are all named by the sha1 file, and they all
> start out with a magic header of "<typename> <typesize><nul byte>".
>
> So if I want to create a new kind of tree object that does the same thing
> as the old one but has some other layout, I'd just call it something else.
> Like "dir". That was what I initially planned to do about the change to
> recursive tree objects, but it turned out to actually be a lot easier to
> just encode it in the old type (that way the routines that read it don't
> even have to care about old/new types - it's all the same to them).

Ha, that is right. You put the new type into same object trick me into
thinking I have to do the same way. Totally forget I can introduce new type
of objects. It is even cleaner. Cool.

How about deleting trees from the caches? I don't need to delete stuff from
the official tree. It is more for my local version control.
Here is the usage case,
- I check out the git.git.
- using quilt to build my series of patches, git-hack1, git-hack2.. git-hack6.
let's say those are store in git cache as well
- I pick some of them come up with a clean one "submit.patch"
- submit.patch get merged into official git tree.
- Now I want to get rid of the hack1 to hack6, but how?

One way to do it is never commit hack1 to hack6 into git or cache. They stay as quilt
patches only. But it is very tempting to let quilt using git instead of the
.pc/ directory, quilt can simplify as some usage case of patch and git.

Chris

2005-04-10 23:05:09

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: more git updates..

In article <[email protected]> you wrote:
> (I repeat the xxx in the leaf name - easier to code.)

It is a bit OT, but just a note: there are file systems (hash functions) out
there who dont like a lot of files named the same way. For example NTFS with
the 8.3 short names.

Greetings
Bernd

2005-04-10 23:09:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1



On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> I currently already do a merge when you track someone's source - it will
> throw away your previous HEAD record though

Not only that, it doesn't do what I consider a "merge".

A real merge should have two or more parents. The "commit-tree" command
already allows that: just add any arbitrary number of "-p xxxxxxxxx"
switches (well, I think I limited it to 16 parents, but that's just a
totally random number, there's nothing in the file format or anything
else that limits it).

So while you've merged my "data", but you've not actually merged my
revision history in your tree.

And the reason a real merge _has_ to show both parents properly is that
unless you do that, you can never merge sanely another time without
getting lots of clashes from the previous merge. So it's important that a
merge really shows both trees it got data from.

This is, btw, also the reason I haven't merged with your tree - I want to
get to the point where I really _can_ merge without throwing away the
information. In fact, at this point I'd rather not merge with your tree at
all, because I consider your tree to be "corrupt" thanks to lacking the
merge history.

So you've done the data merge, but not the history merge.

And because you didn't do the history merge, there's no way to
automatically find out what point of my tree you merged _with_. See?

And since I have no way to see what point in time you merged with me, now
I can't generate a nice 3-way diff against the last common ancestor of
both of our trees.

So now I can't do a three-way merge with you based on any sane ancestor,
unless I start guessing which ancestor of mine you merged with. Now, that
"guess" is easy enough to do with a project like "git" which currently has
just a few tens of commits and effectively only two parallell development
trees, but the whole point is to get to a system where that isn't true..

Linus

2005-04-10 23:16:18

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Useful explanation - thanks, Linus.

Is this picture and description accurate:

==============================================================


< working directory files (foo.c) >
^
^ |
| upward ops | downward ops |
| ---------- | ------------ |
| checkout-cache | update-cache |
| show-diff | v
v
< current directory cache (".dircache/index") >
^
^ |
| upward ops | downward ops |
| ---------- | ------------ |
| read-tree | write-tree |
| | commit-tree |
| v
v
< git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >


==============================================================


The checkout-cache and show-diff ops read their meta-data from
the cache, and the actual file contents from the git filesystem.
Similary, the update-cache op writes meta-data into the cache,
and may create new files in the git filesystem.

The cache (but not the git filesystem) stores transient
information (ctime, mtime, dev, ino, uid, gid, and size)
about each working file update-cache has copied into the git
filesystem so that checkout-cache and show-diff can detect
changes in the contents of working files just from a stat,
without actually rereading the file.

In some sense, the cache holds the git filesystem inodes,
and the git filesystem holds the data blocks. Except that:
(1) the cache just holds the current "view" into the git
filesystem,
(2) objects in the filesystem have an "inode" number (their
<sha1> value) that is persistent whether in view or not,
(3) objects in the filesystem are not removed just because
nothing in the cache references them,
(4) objects in the filesystem can reference other objects,
that are typically also in the filesystem, but that can
still be reliably self-identified even if found in the
wild of say one's email inbox, and
(5) the view in the directory cache can itself be made into
a filesystem object - using commit-tree.


==============================================================

Minor question:

I must have an old version - I got 'git-0.03', but
it doesn't have 'checkout-cache', and its 'read-tree'
directly writes my working files.

How do I get a current version? Well, one way I see,
and that's to pick up Pasky's:

http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2

Perhaps that's the best way?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 23:22:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sun, 10 Apr 2005, Christopher Li wrote:
>
> How about deleting trees from the caches? I don't need to delete stuff from
> the official tree. It is more for my local version control.

I have a plan. Namely to have a "list-needed" command, which you give one
commit, and a flag implying how much "history" you want (*), and then it
spits out all the sha1 files it needs for that history.

Then you delete all the other ones from your SHA1 archive (easy enough to
do efficiently by just sorting the two lists: the list of "needed" files
and the list of "available" files).

Script that, and call the command "prune-tree" or something like that, and
you're all done.

(*) The amount of history you want might be "none", which is to say that
you don't want to go back in time, so you want _just_ the list of tree and
blob objects associated with that commit.

Or you might want a "linear" history, which would be the longest path
through the parent changesets to the root.

Or you might want "all", which would follow all parents and all trees.

Or you might want to prune the history tree by date - "give me all
history, but cut it off when you hit a parent that was done more than 6
months ago".

This "list-needed" thing is not just for pruning history either. If you
have a local tree "x", and you want to figure out how much of it you need
to send to somebody else who has an older tree "y", then what you'd do is
basically "list-needed x" and remove the set of "list-needed y". That
gives you the answer to the question "what's the minimum set of sha1 files
I need to send to the other guy so that he can re-create my top-of-tree".

My second plan is to make somebody else so fired up about the problem that
I can just sit back and take patches. That's really what I'm best at.
Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
and pressing the "apply" button. Then I take all the credit for my
incredible work.

Hint, hint.

Linus

2005-04-10 23:30:27

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Mon, Apr 11, 2005 at 01:10:58AM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
>
>
> On Mon, 11 Apr 2005, Petr Baudis wrote:
> >
> > I currently already do a merge when you track someone's source - it will
> > throw away your previous HEAD record though
>
> Not only that, it doesn't do what I consider a "merge".
>
> A real merge should have two or more parents. The "commit-tree" command
> already allows that: just add any arbitrary number of "-p xxxxxxxxx"
> switches (well, I think I limited it to 16 parents, but that's just a
> totally random number, there's nothing in the file format or anything
> else that limits it).
>
> So while you've merged my "data", but you've not actually merged my
> revision history in your tree.

Well, that's exactly what I was (am) going to do. :-) That's also why I
said that I (virtually) throw the local commits away now. Instead, if
there were any local commits, I will do git merge:

commit-tree $(write-tree) -p $local_head -p $tracked_tree

Note that I will need to make this two-phase - first applying the
changes, then doing the commit; between those two phases, the user
should resolve potential conflicts and check if the merge went right.

I think I will name the first phase git merge and the second phase will
be just git commit, and I will store the merge information in
.dircache/. (BTW, I think the directory name is pretty awful; what about
.git/ ?)

> And the reason a real merge _has_ to show both parents properly is that
> unless you do that, you can never merge sanely another time without
> getting lots of clashes from the previous merge. So it's important that a
> merge really shows both trees it got data from.
>
> This is, btw, also the reason I haven't merged with your tree - I want to
> get to the point where I really _can_ merge without throwing away the
> information. In fact, at this point I'd rather not merge with your tree at
> all, because I consider your tree to be "corrupt" thanks to lacking the
> merge history.
>
> So you've done the data merge, but not the history merge.
>
> And because you didn't do the history merge, there's no way to
> automatically find out what point of my tree you merged _with_. See?
>
> And since I have no way to see what point in time you merged with me, now
> I can't generate a nice 3-way diff against the last common ancestor of
> both of our trees.
>
> So now I can't do a three-way merge with you based on any sane ancestor,
> unless I start guessing which ancestor of mine you merged with. Now, that
> "guess" is easy enough to do with a project like "git" which currently has
> just a few tens of commits and effectively only two parallell development
> trees, but the whole point is to get to a system where that isn't true..

Well, I've wanted to get the basic things working first before doing git
merge. (Especially since until recently, diff-tree was PITA to work
with, and before that it didn't even exist.) If you want, I can rebuild
my tree with doing the merging properly, after I have git merge working.

(BTW, it would be useful to have a tool which just blindly takes what
you give it on input and throws it to an object of given type; I will
need to construct arbitrary commits during the rebuild if I'm to keep
the correct dates.)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 23:27:41

by Paul Jackson

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1

Petr wrote:
> That reminds me, is there any
> tool which will take .rej files and throw them into the file to create
> rcsmerge-like conflicts?

Check out 'wiggle'
http://www.cse.unsw.edu.au/~neilb/source/wiggle/

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-10 23:38:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Sun, 10 Apr 2005, Paul Jackson wrote:
>
> Useful explanation - thanks, Linus.

Hey. You're welcome. Especially when you create good documentation for
this thing.

Because:

> Is this picture and description accurate:

[ deleted, but I'll probably try to put it in an explanation file
somewhere ]

Yes. Excellent.

> Minor question:
>
> I must have an old version - I got 'git-0.03', but
> it doesn't have 'checkout-cache', and its 'read-tree'
> directly writes my working files.

Yes. Crappy old tree, but it can still read my git.git directory, so you
can use it to update to my current source base.

However, from a usability angle, my source-base really has been
concentrating _entirely_ on just the plumbing, and if you actually want a
faucet or a toilet _conntected_ to the plumbing, you're better off with
Pasky's tree, methinks:

> How do I get a current version? Well, one way I see,
> and that's to pick up Pasky's:
>
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>
> Perhaps that's the best way?

Indeed. He's got a number of shell scripts etc to automate the boring
parts.

Linus

2005-04-10 23:45:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1



On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> (BTW, it would be useful to have a tool which just blindly takes what
> you give it on input and throws it to an object of given type; I will
> need to construct arbitrary commits during the rebuild if I'm to keep
> the correct dates.)

Hah. That's what "COMMITTER_NAME" "COMMITTER_EMAIL" and "COMMITTER_DATE"
are there for.

There's two things to commits: when (and by whom) it was committed to a
tree, and when the changes were really done.

So set the COMMITTER_xxx things to the person/time you want to consider
the _original_ one, and let "commit-tree" author you as the creator of the
commit itself. The regular "ChangeLog" thing should only show the author
and original time, but it's nice to see who created the commit itself.

I did this very much on purpose: see how I always try to attribute
authorship in BK to the person who actually wrote the code. At the same
time, I think it's interesting from a tracking standpoint to also see
when/where that change got introduced into a tree.

I _tried_ to get this right in the sparse tree conversion. I won't
guarantee that it's all correct, but the top commit in the sparse tree
looks like this:

tree 67607f05a66e36b2f038c77cfb61350d2110f7e8
parent 9c59995fef9b52386e5f7242f44720a7aca287d7
author Christopher Li <[email protected]> Sat Apr 2 09:30:09 PST 2005
committer Linus Torvalds <[email protected]> Thu Apr 7 20:06:31 2005

...

exactly because I tracked when I committed it to the sparse tree
_separately_ from tracking when it was created.

So when I re-create the sparse-tree, I'll also end up re-writing the
"committer" information. And that's proper. That's really saying "this
sha1 object was created by Xxxx at time Xxxx".

Btw, the "COMMITTER_xxxx" environment variables are very confusingly
named. They actually go into the _author_ line in the commit object. I'm a
total retard, and I really don't know why I called it "COMMITTER_xxx"
instead of "AUTHOR_xxx".

Linus "retard" Torvalds

2005-04-10 23:49:52

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Sun, Apr 10, 2005 at 11:39:02PM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
> On Sun, 10 Apr 2005, Linus Torvalds wrote:
> >
> > Can you pull my current repo, which has "diff-tree -R" that does what the
> > name suggests, and which should be faster than the 0.48 sec you see..
>
> Actually, I changed things around. Everybody hated the "<" ">" lines, so I
> put a changed thing on a line of its own with a "*" instead.
>
> So you'd now see lines like
>
> *100644->100644 1874e031abf6631ea51cf6177b82a1e662f6183e->e8181df8499f165cacc6a0d8783be7143013d410 CREDITS
>
> which means that the CREDITS file has changed, and it shows you the mode
> -> mode transition (that didn't change in this case) and the sha1 -> sha1
> transition.
>
> So now it's always just one line per change. Firthermore, the filename is
> always field 3, if you use spaces as delimeters, regardless of whether
> it's a +/-/* field.

That's great, just when I finally managed to properly fix the xargs
boundary case in gitdiff-do (without throwing away the NUL-termination).
You know how to please people! ;-)

(Not that I'd have *anything* against the change. The logic is simpler
and you'll be actually able to work with diff-tree a little sanely.)

BTW, it is quite handy to have the entry type in the listing (guessing
that from mode in the script just doesn't feel right and doing explicit
cat-file kills the performance). I would also really prefer the fields
separated by tabs. It looks nicer on the screen (aligned, e.g. modes and
type are varsized), and is also easier to parse (cut defaults to tabs as
delimiters, for example).

> So let's say you want to merge two trees (dst1 and dst2) from a common
> parent (src), what you would do is:
>
> - get the list of files to merge:
>
> diff-tree -R <dst1> <dst2> | tr '\0' '\n' > merge-files

...oh, I probably forgot to ask - why did you choose -R instead of -r?
It looks rather alien to me; if it starts by 'diff', my hand writes -r
without thinking.

> - Which of those were changed by <src> -> <dstX>?
>
> diff-tree -R <src> <dst1> | tr '\0' '\n' | join -j 3 - merge-files > dst1-change
> diff-tree -R <src> <dst2> | tr '\0' '\n' | join -j 3 - merge-files > dst2-change
>
> - Which of those are common to both? Let's see what the merge list is:
>
> join dst1-change dst2-change > merge-list
>
> and hopefully you'd usually be working on a very small list of files by
> then (everything else you'd just pick from one of the destination trees
> directly - you've got the name, the sha-file, everything: no need to even
> look at the data).

Ok, this looks reasonable. (Provided that I DWYM regarding the joins.)

> Does this sound sane? Pasky? Wanna try a "git merge" thing? Starting off
> with the user having to tell what the common parent tree is - we can try
> to do the "automatically find best common parent" crud later. THAT may be
> expensive.

I will definitively try "git merge", but maybe not this night anymore
(it's already 1:32 here now).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-10 23:56:35

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Mon, Apr 11, 2005 at 01:46:50AM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
>
>
> On Mon, 11 Apr 2005, Petr Baudis wrote:
> >
> > (BTW, it would be useful to have a tool which just blindly takes what
> > you give it on input and throws it to an object of given type; I will
> > need to construct arbitrary commits during the rebuild if I'm to keep
> > the correct dates.)
>
> Hah. That's what "COMMITTER_NAME" "COMMITTER_EMAIL" and "COMMITTER_DATE"
> are there for.
>
> There's two things to commits: when (and by whom) it was committed to a
> tree, and when the changes were really done.
>
> So set the COMMITTER_xxx things to the person/time you want to consider
> the _original_ one, and let "commit-tree" author you as the creator of the
> commit itself. The regular "ChangeLog" thing should only show the author
> and original time, but it's nice to see who created the commit itself.

I already use those - look at my ChangeLog. (That's because for certain
reasons I'm working on git in a half-broken chrooted environment.)

When rebuilding the tree from scratch, I wanted like to do it
transparently - that is, so that noone could notice that I rebuilt it,
since it effectively still _is_ the original tree from the data
standpoint, just the history flow is actually correct this time.

> Btw, the "COMMITTER_xxxx" environment variables are very confusingly
> named. They actually go into the _author_ line in the commit object. I'm a
> total retard, and I really don't know why I called it "COMMITTER_xxx"
> instead of "AUTHOR_xxx".

So, who will fix it in his tree first! ;-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 00:10:59

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Mon, Apr 11, 2005 at 01:14:57AM CEST, I got a letter
where Paul Jackson <[email protected]> told me that...
> Useful explanation - thanks, Linus.
>
> Is this picture and description accurate:
>
> ==============================================================
>
>
> < working directory files (foo.c) >
> ^
> ^ |
> | upward ops | downward ops |
> | ---------- | ------------ |
> | checkout-cache | update-cache |
> | show-diff | v
> v
> < current directory cache (".dircache/index") >
> ^
> ^ |
> | upward ops | downward ops |
> | ---------- | ------------ |
> | read-tree | write-tree |
> | | commit-tree |
> | v
> v
> < git filesystem (blobs, trees, commits: .dircache/{HEAD,objects}) >

Well, except that from purely technical standpoint commit-tree has
nothing to do in this picture - it creates new object in the git
filesystem based on its input data, but regardless to the directory
cache or current tree. It probably still belongs where it is from the
workflow standpoint, though.

..snip..
> Minor question:
>
> I must have an old version - I got 'git-0.03', but
> it doesn't have 'checkout-cache', and its 'read-tree'
> directly writes my working files.
>
> How do I get a current version? Well, one way I see,
> and that's to pick up Pasky's:
>
> http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
>
> Perhaps that's the best way?

You can take mine, and do:

git pull pasky
git pull linus
cp .dircache/HEAD .dircache/HEAD.local

Now, your tree and git filesystem is up to date.

git track local

Now, when you do git pull pasky, your working tree will not be updated
automatically anymore.

git track linus

Now, you start tracking Linus' tree instead. Note that the initial
update will blow away the scripts in your current tree, so before you do
the last two steps you will probably want to clone the tree and set PATH
to the one still tracking me, so you get all the comfort. ;-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 00:16:11

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.1

On Sun, 10 Apr 2005 16:23:11 -0700 Paul Jackson wrote:

| Petr wrote:
| > That reminds me, is there any
| > tool which will take .rej files and throw them into the file to create
| > rcsmerge-like conflicts?
|
| Check out 'wiggle'
| http://www.cse.unsw.edu.au/~neilb/source/wiggle/

or Chris Mason's 'rej' program:
ftp://ftp.suse.com/pub/people/mason/rej/


---
~Randy

2005-04-11 00:19:00

by Linus Torvalds

[permalink] [raw]
Subject: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)



Btw, does anybody have strong opinions on the license? I didn't put in a
COPYING file exactly because I was torn between GPLv2 and OSL2.1.

I'm inclined to go with GPLv2 just because it's the most common one, but I
was wondering if anybody really had strong opinions. For example, I'd
really make it "v2 by default" like the kernel, since I'm sure v3 will be
fine, but regardless of how sure I am, I'm _not_ a gambling man.

Linus

2005-04-11 00:19:43

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Linus writes:
> Hey. You're welcome. Especially when you create good documentation for
> this thing.

Glad to be of service. Sounds like the umbrella in your foofy
drink drink will come in handy - keeping off the rain.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-11 00:27:13

by Petr Baudis

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Dear diary, on Mon, Apr 11, 2005 at 02:20:52AM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
> Btw, does anybody have strong opinions on the license? I didn't put in a
> COPYING file exactly because I was torn between GPLv2 and OSL2.1.
>
> I'm inclined to go with GPLv2 just because it's the most common one, but I
> was wondering if anybody really had strong opinions. For example, I'd
> really make it "v2 by default" like the kernel, since I'm sure v3 will be
> fine, but regardless of how sure I am, I'm _not_ a gambling man.

Oh, I wanted to ask about this too. I'd mostly prefer GPLv2 (I have no
problem with the version restriction, I usually do it too), it's the one
I'm mostly familiar with and OSL appears to be incompatible with GPL (at
least FSF says so about OSL1.0), which might create various annoying
issues. I hate when licenses get in my way and prevent me to possibly
include some useful code.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 00:30:13

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1

Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter
where Linus Torvalds <[email protected]> told me that...
..snip..
> Can you pull my current repo, which has "diff-tree -R" that does what the
> name suggests, and which should be faster than the 0.48 sec you see..

Am I just missing something, or your diff-tree doesn't handle
added/removed directories?

(Mine does! *hint* *hint* It also doesn't bother with dynamic
allocation, but someone might consider the static path buffer ugly.
Anyway, I hacked it with a plan to do a massive cleanup of the file
later.)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 00:38:23

by Chris Li

[permalink] [raw]
Subject: Re: more git updates..

I see. It just need some basic set operation (+, -, and)
and some way to select a set:


sha5--->
/
/
sha1-->sha2-->sha3--
\ /
\ /
>sha4


list sha1 # all the file list in changeset sha1
# {sha1}
list sha1,sha1 # same as above
list sha1,sha2 # all the file list in between changeset sha1
# and changeset sha2
# {sha1, sha2} in example
list sha1,sha3 # {sha1, sha2, sha3, sha4}

list sha1,any # all the change set reachable from sha1.
{sha1, ... sha5, ...}

new sha1,sha2 # all the new file add between in sha1, sha2 (+)
changed sha1,sha2 # add the changed file between sha1, sha2 (>) (<)
deleted sha1,sha2 # add the deleted file between sha1, sha2 (-)

before time # all the file before time
after time # all the file after time


So in my example, the file I want to delete is :

{list hack1, base}+ {list hack2, base} ... {list hack6, base} \
- [list official_merge, base ]



On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote:
>
>
> > the official tree. It is more for my local version control.
>
> I have a plan. Namely to have a "list-needed" command, which you give one
> commit, and a flag implying how much "history" you want (*), and then it
> spits out all the sha1 files it needs for that history.
>
> Then you delete all the other ones from your SHA1 archive (easy enough to
> do efficiently by just sorting the two lists: the list of "needed" files
> and the list of "available" files).
>
> Script that, and call the command "prune-tree" or something like that, and
> you're all done.
>
> (*) The amount of history you want might be "none", which is to say that
> you don't want to go back in time, so you want _just_ the list of tree and
> blob objects associated with that commit.

That will be {list head}

>
> Or you might want a "linear" history, which would be the longest path
> through the parent changesets to the root.

That will be {list head,root}

>
> Or you might want "all", which would follow all parents and all trees.

That will be {list any, root}

>
> Or you might want to prune the history tree by date - "give me all
> history, but cut it off when you hit a parent that was done more than 6
> months ago".

That is {after -6month }

>
> This "list-needed" thing is not just for pruning history either. If you
> have a local tree "x", and you want to figure out how much of it you need
> to send to somebody else who has an older tree "y", then what you'd do is
> basically "list-needed x" and remove the set of "list-needed y". That
> gives you the answer to the question "what's the minimum set of sha1 files
> I need to send to the other guy so that he can re-create my top-of-tree".
>

That is {list x, any} - {list y, any}


> My second plan is to make somebody else so fired up about the problem that
> I can just sit back and take patches. That's really what I'm best at.
> Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
> and pressing the "apply" button. Then I take all the credit for my
> incredible work.

Sounds like a good plan.

Chris

2005-04-11 01:09:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.1



On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> Dear diary, on Sun, Apr 10, 2005 at 10:38:11PM CEST, I got a letter
> where Linus Torvalds <[email protected]> told me that...
> ..snip..
> > Can you pull my current repo, which has "diff-tree -R" that does what the
> > name suggests, and which should be faster than the 0.48 sec you see..
>
> Am I just missing something, or your diff-tree doesn't handle
> added/removed directories?

You're not missing anything, I did it that way on purpose. I thought it
would be easier to do the expansion in the caller (who knows what it is
they want to do with the end result).

But now that I look at merging, I realize that was actually the wrong
thing to do. A merge algorithm definitely wants to see the expanded tree,
since it will compare/join several of the diff-tree output things.

So I'll either fix it or decide to just go with your version instead. I'm
not overly proud.

Linus

2005-04-11 01:58:56

by Petr Baudis

[permalink] [raw]
Subject: [ANNOUNCE] git-pasky-0.2

Hello,

here goes git-pasky-0.2, my set of patches and scripts upon
Linus' git, aimed at human usability and to an extent a SCM-like usage.

If you already have a previous git-pasky version, just git pull pasky
to get it. Otherwise, you can get it from:

http://pasky.or.cz/~pasky/dev/git/

Please see the README there and/or the parent post for detailed
instructions. You can find the changes from the last announcement
in the ChangeLog (releases have separate commits so you can find them
easily; they are also tagged for purpose of diffing etc).

This is release contains mostly bugfixes, performance enhancements
(especially w.r.t. git diff), and some merges with Linus (except for
diff-tree, where I merged only the new output format). New features
are trivial - support for tagging and short SHA1 ids; you can use
only the start of the SHA1 hash long enough to be unambiguous.

My immediate plan is implementing git merge, which I will do tommorow,
if noone will do it before that is. ;-)

Any feedback/opinions/suggestions/patches (especially patches) are
welcome.

Have fun,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 02:46:22

by Daniel Barkalow

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.2

On Mon, 11 Apr 2005, Petr Baudis wrote:

> Hello,
>
> here goes git-pasky-0.2, my set of patches and scripts upon
> Linus' git, aimed at human usability and to an extent a SCM-like usage.

Incidentally, the git-pasky-base tarball you have up has its checked-out
tree partway between 0.1 and 0.2, and doesn't compile. (The included HEAD
version in .dircache is fine, if the user has some way to bootstrap)

-Daniel
*This .sig left intentionally blank*

2005-04-11 05:18:49

by Nur Hussein

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

> Btw, does anybody have strong opinions on the license? I didn't put in a
> COPYING file exactly because I was torn between GPLv2 and OSL2.1.

I think GPLv2 would create the least amount of objection in the
community, so I'd probably want to go with that.

Nur Hussein

2005-04-11 06:57:49

by bert hubert

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:

> compressed with zlib, they are all named by the sha1 file, and they all

Now I know this is a concious decision, but recent zlib allows you to write
out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
the window size. This in turn would allow users to zcat your objects at
ease.

You get confirmation of completeness of the file for free, as gzip encodes
the length of the file at the end.

Perhaps something to consider.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2005-04-11 07:20:12

by Christer Weinigel

[permalink] [raw]
Subject: Re: more git updates..

bert hubert <[email protected]> writes:

> On Sun, Apr 10, 2005 at 03:38:39PM -0700, Linus Torvalds wrote:
>
> > compressed with zlib, they are all named by the sha1 file, and they all
>
> Now I know this is a concious decision, but recent zlib allows you to write
> out gzip content, at a cost of 14 bytes I think per file, by adding 32 to
> the window size. This in turn would allow users to zcat your objects at
> ease.
>
> You get confirmation of completeness of the file for free, as gzip encodes
> the length of the file at the end.

I would very much like it if git used normal gzip files with a .gz
extension. Doing it this way means that the compression methods can
be extended in the future. I.e:

ab/1234567890.gz gzip compressed
ab/1234567890.xd xdelta compressed

I find the xdelta encoding very attractive since it can probably
reduce the size of the repository drastically. A compression script
could for run nightly and xdelta compress everything that's older than
a few months (to figure out what files to create the delta from, just
look at the commit files and compare the parent tree to the current
tree).

Of course, this means that a dumb wget won't work all that well to
synchronize two trees, but it might be worthwile anyways.

/Christer

--
"Just how much can I get away with and still go to heaven?"

Freelance consultant specializing in device driver programming for Linux
Christer Weinigel <[email protected]> http://www.weinigel.se

2005-04-11 07:45:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)


* Linus Torvalds <[email protected]> wrote:

> Btw, does anybody have strong opinions on the license? I didn't put in
> a COPYING file exactly because I was torn between GPLv2 and OSL2.1.
>
> I'm inclined to go with GPLv2 just because it's the most common one,
> but I was wondering if anybody really had strong opinions. For
> example, I'd really make it "v2 by default" like the kernel, since I'm
> sure v3 will be fine, but regardless of how sure I am, I'm _not_ a
> gambling man.

is there any fundamental problem with going with v2 right now, and then
once v3 is out and assuming it looks ok, all newly copyrightable bits
(new files, rewrites, substantial contributions, etc.) get a v3
copyright? (and the collection itself would be v3 too) That method
wouldnt make it fully v3 automatically once v3 is out, but with time
there would be enough v3 bits in it to make it essentially v3. This way
we wouldnt have to blanket trust v3 before having seen it, and wouldnt
be stuck at v2 either.

Ingo

2005-04-11 08:43:32

by Florian Weimer

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

* Ingo Molnar:

> is there any fundamental problem with going with v2 right now, and then
> once v3 is out and assuming it looks ok, all newly copyrightable bits
> (new files, rewrites, substantial contributions, etc.) get a v3
> copyright? (and the collection itself would be v3 too) That method
> wouldnt make it fully v3 automatically once v3 is out, but with time
> there would be enough v3 bits in it to make it essentially v3.

Almost certainly, v3 will be incompatible with v2 because it adds
further restrictions. This means that your proposal would result in
software which is not redistributable by third parties.

2005-04-11 08:51:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.2


* Petr Baudis <[email protected]> wrote:

> Hello,
>
> here goes git-pasky-0.2, my set of patches and scripts upon Linus'
> git, aimed at human usability and to an extent a SCM-like usage.

works fine on FC4, i only minor issues: 'git' in the tarball didnt have
the x permission. Also, your scripts assume they are in $PATH. When
trying out a tarball one doesnt usually do a 'make install' but tries
stuff locally. Also, 'make install' doesnt seem to install the git
script itself, is that intentional?

Ingo

2005-04-11 09:27:35

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: more git updates..

On Mon, 2005-04-11 at 01:04 +0200, Bernd Eckenfels wrote:
> In article <[email protected]> you wrote:
> > (I repeat the xxx in the leaf name - easier to code.)
>
> It is a bit OT, but just a note: there are file systems (hash functions) out
> there who dont like a lot of files named the same way. For example NTFS with
> the 8.3 short names.

Since you mention NTFS, there is no need to worry about that for Linux.
Certainly the Linux kernel NTFS driver is never going to create 8.3
short names. (It doesn't create names at all at the moment but my grand
plan is that it will only ever create file names in the Win32 and/or
POSIX name spaces. The DOS name space is a thing of the past IMO.)

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2005-04-11 10:20:09

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.2

Dear diary, on Mon, Apr 11, 2005 at 04:46:42AM CEST, I got a letter
where Daniel Barkalow <[email protected]> told me that...
> On Mon, 11 Apr 2005, Petr Baudis wrote:
>
> > Hello,
> >
> > here goes git-pasky-0.2, my set of patches and scripts upon
> > Linus' git, aimed at human usability and to an extent a SCM-like usage.
>
> Incidentally, the git-pasky-base tarball you have up has its checked-out
> tree partway between 0.1 and 0.2, and doesn't compile. (The included HEAD
> version in .dircache is fine, if the user has some way to bootstrap)

Oops, I'm sorry. It appears some diffs just slipped out from the tracked
tree, perhaps I was pulling once when git diff was broken and I didn't
notice it. Now there is a newer tarball there, it is not a pure 0.2
anymore though - if you use the COMMITTER_* env variables, they are now
AUTHOR_*.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 10:22:45

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.2

Dear diary, on Mon, Apr 11, 2005 at 10:50:51AM CEST, I got a letter
where Ingo Molnar <[email protected]> told me that...
>
> * Petr Baudis <[email protected]> wrote:
>
> > Hello,
> >
> > here goes git-pasky-0.2, my set of patches and scripts upon Linus'
> > git, aimed at human usability and to an extent a SCM-like usage.
>
> works fine on FC4, i only minor issues: 'git' in the tarball didnt have
> the x permission.

Sorry, fixed in the tarball. It is in the diffs but I have no git patch
yet to apply the mode changes.

> Also, your scripts assume they are in $PATH. When
> trying out a tarball one doesnt usually do a 'make install' but tries
> stuff locally.

Hmm, I think I will need to make something like

exedir=$(dirname $0)

on the top of each script and then do all the git calls with ${exedit}
prepended. That should fix the issue, right?

> Also, 'make install' doesnt seem to install the git script itself, is
> that intentional?

Oops, I actually didn't even notice that there _is_ any install target
in the Makefile already. ;-) I will add the relevant stuff to it.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 10:52:17

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Dear diary, on Mon, Apr 11, 2005 at 10:40:00AM CEST, I got a letter
where Florian Weimer <[email protected]> told me that...
> * Ingo Molnar:
>
> > is there any fundamental problem with going with v2 right now, and then
> > once v3 is out and assuming it looks ok, all newly copyrightable bits
> > (new files, rewrites, substantial contributions, etc.) get a v3
> > copyright? (and the collection itself would be v3 too) That method
> > wouldnt make it fully v3 automatically once v3 is out, but with time
> > there would be enough v3 bits in it to make it essentially v3.
>
> Almost certainly, v3 will be incompatible with v2 because it adds
> further restrictions. This means that your proposal would result in
> software which is not redistributable by third parties.

Hmm, what would be actually the point in introducing further
restrictions? Anyone who then wants to get around them will just
distribute the software with the "any later version" provision under
GPLv2, and GPLv3 will have no impact expect for new software with "v3 or
any later version" provision. What am I missing?

I've been doing a lot of LKML catching up, and I remember someone
suggesting using GPLv2 (for kernel, but should apply to git too), with a
provision to let someone trusted (Linus) decide when GPLv3 is out
whether you can use GPLv3 for the kernel too. Does it make sense? And is
it even legally doable without sending signed written documents to
Linus' tropical hacienda?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 11:36:16

by Ingo Molnar

[permalink] [raw]
Subject: [rfc] git: combo-blobs


i think all of the 'repository size' and 'bandwidth' concerns could be
solved via a new (and pretty much simple and transparent) object type:
the 'combo-blob'.

Summary:
--------

This is a space/bandwidth-efficient blob that 'includes' arbitrary
portions of (one, two, or more) simple blobs by reference [1], with byte
granularity, plus an optional followup portion that includes the full
constructed state, uncompressed. [2] It can also conserve more RAM
compared to the current repository format.

Representation:
---------------

A combo-blob would have the 'simplest possible' and thus most obvious
representation: a list (the 'include-table') of "include X bytes at
offset Y from parent Z" operations:

<parent-blob-ID> <offset> <size>
[optional full constructed state]

e.g.:

6d11b2dd7f169c29664ac0553090865b7b020973 0 64444
6d374c972c04a0b1894cc6898dffa8ab0b273fcb 0 100
6d11b2dd7f169c29664ac0553090865b7b020973 64545 163656

'punches' 100 bytes out of blob 6d1* at offset 64444, and replaces it
with blob 6d3*'s 100 bytes. [offset/size would be stored in a binary
form to have constant record sizes.]

in OS terms it's similar to an iovec representation. [3]

The hash of a combo-blob is calculated off the include-table alone: i.e.
it's _not_ equivalent to the hash of the included contents. I.e. you
cannot 'collapse' a combo-blob after the fact, it's an immutable part of
the history of the repository, similar to other stored objects. You can
freely cache/uncache (blow-up/collapse) it on the other hand.

[ NOTE: further below you can find a 'Notes' section as well, which
might address some of the issues/ideas you might have at this point. ]

Cons:
-----

there are a number of disadvantages:

- performance hit. Linus is perfectly right, in terms of performance,
nothing beats having full objects.

Hence i kept the option to include the full constructed blob [4]
(uncompressed) as well in the combo-blob. When all combo-blobs are
'blown up' then they can be better in terms of performance than the
current repository format. [they still carry the small slice & dice
information as well]

the performance hit can be reduced in a finegrained way by introducing
occasional full objects in the history. E.g. after every 8 steps one
would include a full blob, to limit the number of blobs necessary to
construct a previously unconstructed combo-blob. This would still cut
the overhead of the current format substantially.

clearly, the most important cache is the current directory cache,
which this abstraction does not hurt.

- complexity. It's all pretty straightforward, but checking the
consistency of a combo object is not as simple as checking the
consistency of a simple object, as it would have to recursively check
all parent IDs as well. I think it's worth the price though.

- repository has optional components: the 'blown up' (cached) portion of
a combo-blob can be freely destructed. This means that two
repositories can now not only differ in their directory-cache, but
also in their objects/ hierarchy. I dont think this is a big issue,
BYMMV.

Pros:
-----

- the main advantage is space/bandwith: it's pretty much as efficient as
it gets: it can be used to represent compressed binary deltas. A fully
trimmed (uncached) repository is very efficient.

- the optional 'fully constructed' portion is not compressed, so once a
repository is 'cached', it is faster to process (in areas outside the
current directory cache) than the current repository format. (In fact,
when a previously unused portion of a repository is accessed _first_,
it is IO-bound by nature - so we can very well spend the extra CPU
cycles on uncompressing things.)

- a 'combo' blob will be more memory-efficient as well. So with given
amount of RAM one could access more history, with a small CPU cost -
as long as the level of 'history recursion' is kept in check (e.g. via
the previously mentioned 'at most 8-deep combinations').
Straightforward iovecs could be passed to Linux system-calls, when
constructing a 'view' of a file, without having to cache every step of
the file's history.

- a combo-blob directly represents the way humans code: combining
pre-existing pieces of information and adding relatively low amount of
new stuff. Having a natural representation for the type of activity
that a tool supports cannot hurt.

( - combo-blobs enable a per-chunk (or per-line) edit history. It's not
an important feature though. )

Notes:
------

[1] the combo-blob is not a 'delta' thing. It combines pre-existing
parents. One of the parents may of course be a 'delta' that acts upon
the other parent - but the combo-blob does not know and does not care.
(A combo-blob might as well represent an act of someone consolidating
multiple small files into a big file, or splitting up a big file into
smaller files. Or a combo-blob might represent the trimming of a
preexisting file.)

[2]: a combo-blob is conceptually still a simple object with blob data
in it, nothing more. It can be referenced in other object types
equivalently to other blobs. It just happens to be a combination of
existing blobs, and hence the 'git filesystem' has to work harder (but
still quite efficiently) to get to the contents.

[3]: a combo-blob might reference any parent blob, including combo
blobs. This means that e.g. multiple small deltas can be represented
via:

<blob-#1>
|
|-----<blob-#2>
|
<combo-blob-#1>
|
|-----<blob-#3>
|
<combo-blob-#2>

where combo-blob-#2 is thus a combination of blob-#1,blob-#2,blob-#3.

[4] alternatively, it might also make sense to extend the simple
combo-blob concept with the concept of a 'cache-blob': a cache-blob
'blows up' combo blobs in that it fully constructs the blob contents,
but it is otherwise identical to the blob it caches. Simple (non-combo)
blob types are a cache of themselves.

Ingo

2005-04-11 13:58:07

by Petr Baudis

[permalink] [raw]
Subject: [ANNOUNCE] git-pasky-0.3

Hello,

here goes git-pasky-0.3, my set of patches and scripts upon
Linus' git, aimed at human usability and to an extent a SCM-like usage.

If you already have a previous git-pasky version, just git pull pasky
to get it (but see below!!!). Otherwise, you can get it from:

http://pasky.or.cz/~pasky/dev/git/

Please see the README there and/or the parent post for detailed
instructions. You can find the changes from the last announcement
in the ChangeLog (releases have separate commits so you can find them
easily; they are also tagged for purpose of diffing etc).

This release is mainly focused on bugfixes. Especially, it fixes git
diff, which was totally broken in the previous release and would only
diff every other file (forgot to remove one shift from the times when
changes were reported two-line from diff-tree). Very sorry about that.

This implies that git pull was broken too, though - if you pulled
tracked branch, git diff wouldn't produce the complete diff for patch to
apply. If you didn't do any local changes, it is fortunately easy to
repair:

git diff | patch -p0 -R

(The unapplied changes appear as reverted in your local tree when
compared with the cache.) You will need to edit the diff if you did
some local changes.

Other change breaking some compatibility is regarding commit
environment variables - s/COMMITTER_*/AUTHOR_*/. Otherwise it is usual
bunch of merges with Linus and some really minor stuff. Oh, and make
install works.

One annoying thing is rsync error when pulling from Linus - it tries
to sync the tags/ directory and I don't know how to safely silence it
except throwing away all stderr. I will probably make it fetch the list
of .dircache and rsync only things which are really there.

Any feedback/opinions/suggestions/patches (especially patches) are
welcome. You can also stop by at #git either on FreeNode or on OTFC (I
will be around only from 20:00 CET on, though).

Have fun,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 13:58:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: more git updates..

Followup to: <[email protected]>
By author: Christopher Li <[email protected]>
In newsgroup: linux.dev.kernel
>
> There is one problem though. How about the SHA1 hash collision?
> Even the chance is very remote, you don't want to lose some data do due
> to "software" error. I think it is OK that no handle that
> case right now. On the other hand, it will be nice to detect that
> and give out a big error message if it really happens.
>

If you're actually worried about it, it'd be better to just use a
different hash, like one of the SHA-2's (probably a better choice
anyway), instead of SHA-1.

-hpa

2005-04-11 14:46:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs

Hmmm ... I have this strong sense that I am about 2 hours away from
smacking my forehead and groaning "Duh - so that's what Ingo meant!"

However, one must play out one's destiny.

Could you provide an example scenario, which results in the creation of
a combo-blob?

The best I can come up with is the following.

Let's say Nick changes one line in the middle of kernel/sched.c
(yeah - I know - unlikely scenario - he usually changes more
than that - nevermind that detail.)

In the days Before Combo Blobs (BCB), git would have been told that
kernel/sched.c was to be picked up, and would have wrapped it up in a
zlib'd blob, sha1summed it, seen it was a new sum, and added that blob
to its objects (or something like this -- I'm still a little fuzzy on
these git details.)

But Nick just downloaded the latest git 1.5.11.1 which has added support
for combo blobs, so now, guessing here, instead of wrapping up the new
sched.c, git instead unwraps the old one, diff's with the new, notices a
couple of long sequences that are unchanged, wraps up both of those
sequences as a couple of relatively large blobs, and wraps up the new
lines that Nick just coded in the middle as a small blob, and puts all
three in the object store, along with another small combo-blob, tying
them all together.

So far, not too bad. Haven't gained anything, and required the
unpacking of a zlib blog we didn't require before, and the running and
analyzing of a diff we didn't require before, but the end result is only
moderately worse - four object blobs instead of one, but of total size
not much larger (well, total size typically 3 disk blocks worse, due to
a slight increase in fragmentation from using 4 blocks to store what
used to be in one.)

But now I get stuck. Unless I throw in something like the interleaved
delta compression that's at the heart of Marc Rochind's old SCCS code
(and Larry's rewrite thereof), I don't see how we ever come to the
practical realization that any of these four new blobs can ever be
reused.

So explain to me again how we ever gain anything with these combo blobs,
while I take a prophylactic aspirin, so the forehead whack won't hurt as
much.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-11 15:12:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


* Paul Jackson <[email protected]> wrote:

> Hmmm ... I have this strong sense that I am about 2 hours away from
> smacking my forehead and groaning "Duh - so that's what Ingo meant!"
>
> However, one must play out one's destiny.
>
> Could you provide an example scenario, which results in the creation
> of a combo-blob?
>
> The best I can come up with is the following.
>
> Let's say Nick changes one line in the middle of kernel/sched.c (yeah
> - I know - unlikely scenario - he usually changes more than that -
> nevermind that detail.)
>
> In the days Before Combo Blobs (BCB), git would have been told that
> kernel/sched.c was to be picked up, and would have wrapped it up in a
> zlib'd blob, sha1summed it, seen it was a new sum, and added that blob
> to its objects (or something like this -- I'm still a little fuzzy on
> these git details.)
>
> But Nick just downloaded the latest git 1.5.11.1 which has added
> support for combo blobs, so now, guessing here, instead of wrapping up
> the new sched.c, git instead unwraps the old one, diff's with the new,
> notices a couple of long sequences that are unchanged, wraps up both
> of those sequences as a couple of relatively large blobs, and wraps up
> the new lines that Nick just coded in the middle as a small blob, and
> puts all three in the object store, along with another small
> combo-blob, tying them all together.

actually, git would just include by reference the previous blob.

lets say we had the previous version of sched.c in a blob, ID
cc4ee6107d19f89898a8c89d45810f01710f2ff4. We have the new edit (which is
small, lets say 20 bytes) in blob e010fab710092b19be6e26de1721e249dff2d141.
We'd create the combo-blob representing the new version of sched.c, the
following way:

include cc4ee6107d19f89898a8c89d45810f01710f2ff4 0 54010
include e010fab710092b19be6e26de1721e249dff2d141 0 20
include cc4ee6107d19f89898a8c89d45810f01710f2ff4 54030 73061

so we'd include (by reference) most of the previous version, with a
small blob for the extras. Since sched.c compresses down to 36K, we
saved ~32K of bandwidth, and somewhere on the order of 20K of storage.

to construct the combo blob later on, we do have to unpack sched.c (and
if it's already a combo-blob that is not cached then we'd have to unpack
all parents until we arrive at some full blob).

> So far, not too bad. Haven't gained anything, and required the
> unpacking of a zlib blog we didn't require before, and the running and
> analyzing of a diff we didn't require before, but the end result is
> only moderately worse - four object blobs instead of one, but of total
> size not much larger (well, total size typically 3 disk blocks worse,
> due to a slight increase in fragmentation from using 4 blocks to store
> what used to be in one.)

we'd have 2 new objects (the 'delta' and the 'combo' blob).

(if # of objects is an issue then we could include new data in the combo
blob itself too, but that's getting too complex i think.)

Ingo

2005-04-11 15:28:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


here are some stats: of the last 34160 files modified in the Linux
kernel tree in the past 1 year, the file sizes total to 1 GB, and the
average file-size per file committed is 31220 bytes. The changes
themselves amount to:

22404 files changed, 1996494 insertions(+), 1396644 deletions(-)

(the # of files changed is lower because one file can be modified
multiple times)

the Linux kernel has an average line-length of 36 bytes, so even without
analyzing the commits themselves, the actual size of changes is around
70 MB content added, 50 MB content removed. The patches (plus commit
comments, and email headers) add up to 250 MB.

So the combo-blob representation would have an uncompressed content
somewhere between 130MB and 250MB: 200 MB would be a good guess i think.
That's 20% of the 1+ GB the full-blob representation would give, and it
would be nearly as compressible.

Ingo

2005-04-11 15:30:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs



On Mon, 11 Apr 2005, Ingo Molnar wrote:
>
> to construct the combo blob later on, we do have to unpack sched.c (and
> if it's already a combo-blob that is not cached then we'd have to unpack
> all parents until we arrive at some full blob).

I really don't want to have this. Having chains of dependencies is really
painful, and now if _any_ of them gets corrupted, you're screwed.

Yes, GIT already has chains, but they are the minimal possible (ie we have
the path-name-dependent tree chain, which I tried to avoid but really
couldn't). The "commit" chain can grow to arbitrary sizes, but losing any
entry but the top one really doesn't lose any data - you lost your place
in history, but at least you're not totally screwed. You still have your
data, you just can't find your way to the root (but you can, for example,
effectively re-create the whole commit chain if you want to without having
to touch any of the data blobs).

So I would very strongly suggest that we do not have dependent combo
blobs, but that if you want to, a better "network protocol" might be quite
possible. Ie send diffs over the network, and re-create the blobs on the
other side. You can trivially check that you got it right, because if you
didn't, the name of the result won't match ;)

Please?

Linus

2005-04-11 15:32:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


* Ingo Molnar <[email protected]> wrote:

> here are some stats: of the last 34160 files modified in the Linux
> kernel tree in the past 1 year, the file sizes total to 1 GB, and the
> average file-size per file committed is 31220 bytes. The changes
> themselves amount to:
>
> 22404 files changed, 1996494 insertions(+), 1396644 deletions(-)
>
> (the # of files changed is lower because one file can be modified
> multiple times)

one more number: thus the average commit size is 3575 bytes, i.e. less
than a block.

Ingo

2005-04-11 15:39:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


* Linus Torvalds <[email protected]> wrote:

> > to construct the combo blob later on, we do have to unpack sched.c (and
> > if it's already a combo-blob that is not cached then we'd have to unpack
> > all parents until we arrive at some full blob).
>
> I really don't want to have this. Having chains of dependencies is
> really painful, and now if _any_ of them gets corrupted, you're
> screwed.

if a repository is corrupted then it pretty much needs to be dropped
anyway. Also, with a 'replicate the full object on every 8th commit'
rule the risk would be somewhat mitigated as well.

but yeah, i can very much see the point of trying to avoid that
complexity. (Also, it's not like delta blobs couldnt be introduced later
on, if there's enough (if any) pressure to reduce storage overhead.)

Ingo

2005-04-11 15:49:42

by Randy.Dunlap

[permalink] [raw]
Subject: Re: more git updates..

On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:

|
|
| On Sun, 10 Apr 2005, Paul Jackson wrote:
| >
| > Useful explanation - thanks, Linus.
|
| Hey. You're welcome. Especially when you create good documentation for
| this thing.
|
| Because:
|
| > Is this picture and description accurate:
|
| [ deleted, but I'll probably try to put it in an explanation file
| somewhere ]
|
| Yes. Excellent.
|
| > Minor question:
| >
| > I must have an old version - I got 'git-0.03', but
| > it doesn't have 'checkout-cache', and its 'read-tree'
| > directly writes my working files.
|
| Yes. Crappy old tree, but it can still read my git.git directory, so you
| can use it to update to my current source base.

Please go into a little more detail about how to do this step...
that seems to be the most basic concept that I am missing.
i.e., how to find the "latest/current" tree (version/commit)
and check it out (read-tree, checkout-cache, etc.).

Even if I use Pasky's tools, I'd like to understand this step.

| However, from a usability angle, my source-base really has been
| concentrating _entirely_ on just the plumbing, and if you actually want a
| faucet or a toilet _conntected_ to the plumbing, you're better off with
| Pasky's tree, methinks:
|
| > How do I get a current version? Well, one way I see,
| > and that's to pick up Pasky's:
| >
| > http://pasky.or.cz/~pasky/dev/git/git-pasky-base.tar.bz2
| >
| > Perhaps that's the best way?
|
| Indeed. He's got a number of shell scripts etc to automate the boring
| parts.


---
~Randy

2005-04-11 15:57:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


* Ingo Molnar <[email protected]> wrote:

>
> * Linus Torvalds <[email protected]> wrote:
>
> > > to construct the combo blob later on, we do have to unpack sched.c (and
> > > if it's already a combo-blob that is not cached then we'd have to unpack
> > > all parents until we arrive at some full blob).
> >
> > I really don't want to have this. Having chains of dependencies is
> > really painful, and now if _any_ of them gets corrupted, you're
> > screwed.
>
> if a repository is corrupted then it pretty much needs to be dropped
> anyway. Also, with a 'replicate the full object on every 8th commit'
> rule the risk would be somewhat mitigated as well.

another thing is that if the repository is 'cached' (which would
normally be the case for work files), then it would be more resilient
against corruption as the full uncompressed file would be included at
the end of the combo-blob.

Ingo

2005-04-11 16:00:30

by Adam J. Richter

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

On 2005-04-11, Linus Torvalds wrote:
>I'm inclined to go with GPLv2 just because it's the most common one [...]

You may want to use a file from GPL'ed monotone that
implements a substantial diff optimization described in the August
1989 paper by Sun Wu, Udi Manber and Gene Myers ("An O(NP) Sequence
Comparison Algorithm"). According to th file, that implementation
was a port of some Scheme code written by Aubrey Jaffer to C++ by
Graydon Hoare. (By the way, I would prefer that git just punt to
user level programs for diff and merge when all of the versions
involved are different or at least have a very thin interface
for extending the facility, because I would like to do some character
based merge stuff.)

It looks to me like the anti-patent provisions of OSLv2.1
could be circumvented by an offender creating a separate company
to do patent litigation. So, I think you'll find that the software
reuse benefits (both to GIT and to other software projects) of the
more widely used GPL ougtweigh the anti-patent benefits of OSLv2.1.

Although I like the idea of anti-patent provisions, such
as those in OSLv2.1, I think mutual compatability of free
software is probably more consequential, even from a purely
political perspective.

Perhaps you might want to consider offering the code
under the distributor's choice of either license if you want
to offer the very minor benefits of slightly easier compliance
to those who do not litigate software patents, or, perhaps more
importantly, the ability of the software to be copied into
OSLv2.1 projects (if there are any).

__ ______________
Adam J. Richter \ /
[email protected] | g g d r a s i l

2005-04-11 16:01:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs



On Mon, 11 Apr 2005, Ingo Molnar wrote:
>
> if a repository is corrupted then it pretty much needs to be dropped
> anyway.

I disagree. Yes, the thing is designed to be replicated, so most of the
time the easiest thing to do is to just rsync with another copy.

But dammit, I don't want to just depend on that. I wrote "fsck" for a
reason. Right now it only finds errors, which is sufficient if you do the
rsync thing, but I think it's _wrong_ to

- be slower
- be more complex
- be less safe

to save some diskspace.

If you want to save disk-space, the current setup has a great way of doing
that: just drop old history. Exactly because a GIT repo doesn't do the
dependency chain thing, you can do that, and have a minimal GIT
repostiroty that is still perfectly valid (and is basically the size of a
single checked-out tree tree, except it's also compressed).

I don't think many people will do that, considering how cheap disk is, but
the fact is, GIT allows it just fine. "fsck" will complain right now, but
I'm actually going to make the "commit->commit" link be a "weaker" thing,
and have fsck not complain about missing history unless you do the "-v"
thing.

(Right now, for development, I _do_ want fsck to complain about missing
history, but that's a different thing. Right now it's there to make sure I
don't do stupid things, not for "users").

> Also, with a 'replicate the full object on every 8th commit'
> rule the risk would be somewhat mitigated as well.

..but not the complexity.

The fact is, I want to trust this thing. Dammit, one reason I like GIT is
that I can mentally visualize the whole damn tree, and each step is so
_simple_. That's extra important when the object database itself is so
inscrutable - unlike CVS or SCCS or formats like that, it's damn hard to
visualize from looking at a directory listing.

So this really is a very important point for me: I want a demented
chimpanzee to be able to understand the GIT linkages, and I do not want
_any_ partial results anywhere. The recursive tree is already more
complexity than I wanted, but at least that seemed inescapable.

Linus

2005-04-11 16:10:11

by Florian Weimer

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

* Petr Baudis:

>> Almost certainly, v3 will be incompatible with v2 because it adds
>> further restrictions. This means that your proposal would result in
>> software which is not redistributable by third parties.
>
> Hmm, what would be actually the point in introducing further
> restrictions? Anyone who then wants to get around them will just
> distribute the software with the "any later version" provision under
> GPLv2, and GPLv3 will have no impact expect for new software with "v3 or
> any later version" provision. What am I missing?

Software continues to evolve. The copyright owners can relicense the
code base under v3, and use v3 for all subsequent changes to the
software. The trouble with relicensing is that you have to contact
all copyright holders (or remove their code). This tends to be
impossible in long-running projects without contractual agreements
between the developers.

2005-04-11 16:50:57

by ross

[permalink] [raw]
Subject: Re: more git updates..

On Sat, Apr 09, 2005 at 12:45:52PM -0700, Linus Torvalds wrote:
> Can you guys re-send the scripts you wrote? They probably need some
> updating for the new semantics. Sorry about that ;(

I've been off email this weekend, so have fallen a bit behind here.
I'll forgo updating my stuff, since it looks like there's superior
work. Looks cool!

I must say, the git as a filesystem thing is really neat. This has
been one of the more fun projects I've toyed around with.

--
Ross Vandegrift
[email protected]

"The good Christian should beware of mathematicians, and all those who
make empty prophecies. The danger already exists that the mathematicians
have made a covenant with the devil to darken the spirit and to confine
man in the bonds of Hell."
--St. Augustine, De Genesi ad Litteram, Book II, xviii, 37

2005-04-11 16:39:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs


* Linus Torvalds <[email protected]> wrote:

> > Also, with a 'replicate the full object on every 8th commit'
> > rule the risk would be somewhat mitigated as well.
>
> ..but not the complexity.
>
> The fact is, I want to trust this thing. Dammit, one reason I like GIT
> is that I can mentally visualize the whole damn tree, and each step is
> so _simple_. That's extra important when the object database itself is
> so inscrutable - unlike CVS or SCCS or formats like that, it's damn
> hard to visualize from looking at a directory listing.

ok. Meanwhile i found another counter-argument: the average committed
file size is 36K, which with gzip -9 would compress down to roughly 8K,
with the commit message being another block. That's 2+1 blocks used per
commit, while with deltas one could at most cut this down to 1+1+1
blocks - just as much space! So we would be almost even with the more
complex delta approach, just by increasing the default compression ratio
from 6 to 9. (but even with the default we are not that bad.)

case closed i guess. (The network bandwith issue can/could indeed be
solved independently, without any impact to the fundamentals, as you
suggested.)

Ingo

2005-04-11 17:52:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs

Ingo wrote:
> actually, git would just include by reference the previous blob.

Ok - kind of like a patch blob. I can see now where under some
conditions this saves space.

I agree with conclusion this thread has already reached. Keep it
simple.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-11 18:13:42

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs

On Mon, Apr 11, 2005 at 09:01:51AM -0700, Linus Torvalds wrote:

> I disagree. Yes, the thing is designed to be replicated, so most of
> the time the easiest thing to do is to just rsync with another copy.

It's not clear how any of this is going to give me something like

bk changes -R

or
bk changes -L

functionality. I'm guessing I will have to sync locally and check
between two trees in those cases? Or at least sync enough metadata as
to make this possible... but not the entire tree right?

2005-04-11 18:29:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs



On Mon, 11 Apr 2005, Chris Wedgwood wrote:
>
> On Mon, Apr 11, 2005 at 09:01:51AM -0700, Linus Torvalds wrote:
>
> > I disagree. Yes, the thing is designed to be replicated, so most of
> > the time the easiest thing to do is to just rsync with another copy.
>
> It's not clear how any of this is going to give me something like
>
> bk changes -R
>
> or
> bk changes -L
>
> functionality.

You'd dowload all the sha1 objects (they don't actually do anything to
_your_ state - they only show the possible other states), and then it's a
"simple thing" to generate a full tree of your local HEAD commit and
compare it to a full tree of the remove HEAD commit.

If you then want to merge, you already have all the data. If you don't,
you can then prune your object tree from the stuff you don't use (fsck
already effectively does all the connectivity work, it just never removes
unreferenced files).

Linus

2005-04-11 18:31:45

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Mon, Apr 11, 2005 at 05:49:31PM CEST, I got a letter
where "Randy.Dunlap" <[email protected]> told me that...
> On Sun, 10 Apr 2005 16:38:00 -0700 (PDT) Linus Torvalds wrote:
..snip..
> | Yes. Crappy old tree, but it can still read my git.git directory, so you
> | can use it to update to my current source base.
>
> Please go into a little more detail about how to do this step...
> that seems to be the most basic concept that I am missing.
> i.e., how to find the "latest/current" tree (version/commit)
> and check it out (read-tree, checkout-cache, etc.).

Well, its ID is by convention kept in .dircache/HEAD. But that is really
only a convention, no "core git" tool reads it directly, and you need to
update it manually after you do commit-tree.

First, you need to get the accompanying tree's id. git-pasky's shortcut
is $(tree-id), but manually you can do it by

$(cat-file commit $(cat .dircache/HEAD)) | egrep '^tree'

Note that if you ever forgot to update HEAD or if you have multiple
branches in your repository, you can list all "head commits" (that is,
commits which have no other commits referencing them as parents) by
doing fsck-cache.

Now, you need to populate the directory cache by the tree (see Paul
Jackson's diagram):

read-tree $tree_id

And now you want to update your working tree from the cache:

checkout-cache -a -f

This will bring your tree in sync with the cache (it won't remove any
stale files, though). That means it will overwrite your local changes
too - turn that off by omitting the "-f". If you want to update only
some files, omit the "-a" and list them.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 18:40:59

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [rfc] git: combo-blobs

Dear diary, on Mon, Apr 11, 2005 at 08:13:19PM CEST, I got a letter
where Chris Wedgwood <[email protected]> told me that...
> On Mon, Apr 11, 2005 at 09:01:51AM -0700, Linus Torvalds wrote:
>
> > I disagree. Yes, the thing is designed to be replicated, so most of
> > the time the easiest thing to do is to just rsync with another copy.
>
> It's not clear how any of this is going to give me something like
>
> bk changes -R
>
> or
> bk changes -L
>
> functionality. I'm guessing I will have to sync locally and check
> between two trees in those cases? Or at least sync enough metadata as
> to make this possible... but not the entire tree right?

Checking "what will be transferred when I push" doesn't sound hard - the
push itself is not too trivial, but solvable. Perhaps even by pure
rsync, if you won't support updating tracked trees (does not sound
overwhelmingly useful anyway).

Checking "what will be transferred if I pull" is much worse. Perhaps you
could make a parallel objects repository, fetch all the newer commit and
tree metadata there, and then do diff-tree. I think you need something
smarter than rsync for that, though.

[git-pasky] As long as you are not pulling from a tracked branch, the
worst what can happen is that the enemy will trick you to pulling some
terabytes of data. Or overwrite existing objects with garbage, but
--ignore-existing would solve that trivially.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 18:45:45

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Hello,

please do not trim the cc list so agressively.

Dear diary, on Mon, Apr 11, 2005 at 05:46:38PM CEST, I got a letter
where "Adam J. Richter" <[email protected]> told me that...
..snip..
> Graydon Hoare. (By the way, I would prefer that git just punt to
> user level programs for diff and merge when all of the versions
> involved are different or at least have a very thin interface
> for extending the facility, because I would like to do some character
> based merge stuff.)
..snip..

But this is what git already does. I agree it could do it even better,
by checking environment variables for the appropriate tools (then you
could use that to pass diff e.g. -p etc.).

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 20:27:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs



On Mon, 11 Apr 2005, Linus Torvalds wrote:
> > bk changes -R
> >
> > bk changes -L
>
> You'd dowload all the sha1 objects (they don't actually do anything to
> _your_ state - they only show the possible other states), and then it's a
> "simple thing" to generate a full tree of your local HEAD commit and
> compare it to a full tree of the remove HEAD commit.

Ok, there's a "rev-tree" program there now to generate these things.

If you control both ends, or have some other means of a "smart"
communications protocol, you don't actually have to download the blobs
themselves. Just download the "rev-tree" from the other side, and you can
generate the differences by comparing your rev-tree against theirs.

(And since they are sorted, the compare is very cheap).

The downside? A revtree can get quite large. My "rev-tree" program allows
you to cache previous state so that you don't have to follow the whole
thing down, though, so it's possible to just send incrementals (since a
"commit" _uniquely_ generates the whole rev-tree, you really can do
reasonably smart things and create "superset revtrees" etc).

So the change difference between two commits is literally

rev-tree [commit-id1] > commit1-revtree
rev-tree [commit-id2] > commit2-revtree
join -t : commit1-revtree commit2-revtree > common-revisions

(this is also how to find the most common parent - you'd look at just the
head revisions - the ones that aren't referred to by other revisions - in
"common-revision", and figure out the best one. I think.)

Linus

2005-04-12 01:30:13

by Adam J. Richter

[permalink] [raw]
Subject: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

On Mon, 11 Apr 2005 20:45:38 +0200, Peter Baudis wrote:
> Hello,

> please do not trim the cc list so agressively.

Sorry. I read the list from a web site that does not show the
cc lists. I'll try to cc more people from the relevant discussions
though. On the other hand, I've dropped Linus from this message,
as it just points to something he previously said.

>Dear diary, on Mon, Apr 11, 2005 at 05:46:38PM CEST, I got a letter
>where "Adam J. Richter" <[email protected]> told me that...
>..snip..
>> Graydon Hoare. (By the way, I would prefer that git just punt to
>> user level programs for diff and merge when all of the versions
>> involved are different or at least have a very thin interface
>> for extending the facility, because I would like to do some character
>> based merge stuff.)
>..snip..

>But this is what git already does. I agree it could do it even better,
>by checking environment variables for the appropriate tools (then you
>could use that to pass diff e.g. -p etc.).

This message from Linus seemed to imply that git was going to get
its own 3-way merge code:

| Then the bad news: the merge algorithm is going to suck. It's going to be
| just plain 3-way merge, the same RCS/CVS thing you've seen before. With no
| understanding of renames etc. I'll try to find the best parent to base the
| merge off of, although early testers may have to tell the piece of crud
| what the most recent common parent was.

( from http://marc.theaimsgroup.com/?l=linux-kernel&m=111320013100822&w=2 )


__ ______________
Adam J. Richter \ /
[email protected] | g g d r a s i l

2005-04-12 01:42:10

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Dear diary, on Tue, Apr 12, 2005 at 03:20:18AM CEST, I got a letter
where "Adam J. Richter" <[email protected]> told me that...
> >Dear diary, on Mon, Apr 11, 2005 at 05:46:38PM CEST, I got a letter
> >where "Adam J. Richter" <[email protected]> told me that...
> >..snip..
> >> Graydon Hoare. (By the way, I would prefer that git just punt to
> >> user level programs for diff and merge when all of the versions
> >> involved are different or at least have a very thin interface
> >> for extending the facility, because I would like to do some character
> >> based merge stuff.)
> >..snip..
>
> >But this is what git already does. I agree it could do it even better,
> >by checking environment variables for the appropriate tools (then you
> >could use that to pass diff e.g. -p etc.).
>
> This message from Linus seemed to imply that git was going to get
> its own 3-way merge code:
>
> | Then the bad news: the merge algorithm is going to suck. It's going to be
> | just plain 3-way merge, the same RCS/CVS thing you've seen before. With no
> | understanding of renames etc. I'll try to find the best parent to base the
> | merge off of, although early testers may have to tell the piece of crud
> | what the most recent common parent was.

Well, from what I can read it says "just plain 3-way merge, the same
RCS/CVS thing you've seen before". :-)

Basically, when you look at merge(1) :

SYNOPSIS
merge [ options ] file1 file2 file3
DESCRIPTION
merge incorporates all changes that lead from file2 to file3
into file1.

The only big problem is how to guess the best file2 when you give it
file3 and file1.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 05:34:12

by David Lang

[permalink] [raw]
Subject: Re: more git updates..

I've been reading this and have another thought for you guys to keep in
mind for this tool.

version control of system config files on linux systems.

it sounds like you could put the / fileystem under the control of git
(after teaching it to not cross fileystem boundries so you can have
another filesystem to work with) and version control your entire system.
if this was done it would be nice to add a item type that would referance
a file in a distro package to save space. it sounds like you could run a
git checkin daily (as part of the updatedb run for example) at very little
cost.

for that matter by comparing the git data between servers (or between a
server and an archive) you could easily use it to detect tampering.

sounds very interesting, but I'm going to let things settle down a bit
before I try to tackle this (but you guys who ar working on it shoudl feel
free to add the couple options nessasary to implement this if you want ;-)

David Lang

On Sun, 10 Apr 2005, Christopher Li wrote:

> Date: Sun, 10 Apr 2005 17:28:50 -0400
> From: Christopher Li <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Paul Jackson <[email protected]>, [email protected], [email protected],
> [email protected], [email protected]
> Subject: Re: more git updates..
>
> I see. It just need some basic set operation (+, -, and)
> and some way to select a set:
>
>
> sha5--->
> /
> /
> sha1-->sha2-->sha3--
> \ /
> \ /
> >sha4
>
>
> list sha1 # all the file list in changeset sha1
> # {sha1}
> list sha1,sha1 # same as above
> list sha1,sha2 # all the file list in between changeset sha1
> # and changeset sha2
> # {sha1, sha2} in example
> list sha1,sha3 # {sha1, sha2, sha3, sha4}
>
> list sha1,any # all the change set reachable from sha1.
> {sha1, ... sha5, ...}
>
> new sha1,sha2 # all the new file add between in sha1, sha2 (+)
> changed sha1,sha2 # add the changed file between sha1, sha2 (>) (<)
> deleted sha1,sha2 # add the deleted file between sha1, sha2 (-)
>
> before time # all the file before time
> after time # all the file after time
>
>
> So in my example, the file I want to delete is :
>
> {list hack1, base}+ {list hack2, base} ... {list hack6, base} \
> - [list official_merge, base ]
>
>
>
> On Sun, Apr 10, 2005 at 04:21:08PM -0700, Linus Torvalds wrote:
>>
>>
>>> the official tree. It is more for my local version control.
>>
>> I have a plan. Namely to have a "list-needed" command, which you give one
>> commit, and a flag implying how much "history" you want (*), and then it
>> spits out all the sha1 files it needs for that history.
>>
>> Then you delete all the other ones from your SHA1 archive (easy enough to
>> do efficiently by just sorting the two lists: the list of "needed" files
>> and the list of "available" files).
>>
>> Script that, and call the command "prune-tree" or something like that, and
>> you're all done.
>>
>> (*) The amount of history you want might be "none", which is to say that
>> you don't want to go back in time, so you want _just_ the list of tree and
>> blob objects associated with that commit.
>
> That will be {list head}
>
>>
>> Or you might want a "linear" history, which would be the longest path
>> through the parent changesets to the root.
>
> That will be {list head,root}
>
>>
>> Or you might want "all", which would follow all parents and all trees.
>
> That will be {list any, root}
>
>>
>> Or you might want to prune the history tree by date - "give me all
>> history, but cut it off when you hit a parent that was done more than 6
>> months ago".
>
> That is {after -6month }
>
>>
>> This "list-needed" thing is not just for pruning history either. If you
>> have a local tree "x", and you want to figure out how much of it you need
>> to send to somebody else who has an older tree "y", then what you'd do is
>> basically "list-needed x" and remove the set of "list-needed y". That
>> gives you the answer to the question "what's the minimum set of sha1 files
>> I need to send to the other guy so that he can re-create my top-of-tree".
>>
>
> That is {list x, any} - {list y, any}
>
>
>> My second plan is to make somebody else so fired up about the problem that
>> I can just sit back and take patches. That's really what I'm best at.
>> Sitting here, in the (rain) on the patio, drinking a foofy tropical drink,
>> and pressing the "apply" button. Then I take all the credit for my
>> incredible work.
>
> Sounds like a good plan.
>
> Chris
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
-- C.A.R. Hoare

2005-04-12 05:47:45

by Barry K. Nathan

[permalink] [raw]
Subject: Re: [rfc] git: combo-blobs

On Mon, Apr 11, 2005 at 06:33:58PM +0200, Ingo Molnar wrote:
> ok. Meanwhile i found another counter-argument: the average committed
> file size is 36K, which with gzip -9 would compress down to roughly 8K,
> with the commit message being another block. That's 2+1 blocks used per
> commit, while with deltas one could at most cut this down to 1+1+1
> blocks - just as much space! So we would be almost even with the more
> complex delta approach, just by increasing the default compression ratio
> from 6 to 9. (but even with the default we are not that bad.)

I think you forgot about reiserfs/reiser4 tails. (At least, I *think*
reiser4 has tails. I know reiserfs 3.x does.)

BTW, I happen to agree completely with Linus on this issue, but I still
figured I'd mention this for the sake of completeness.

-Barry K. Nathan <[email protected]>

2005-04-12 06:04:13

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

David wrote:
> and version control your entire system

Yeah - that works. That's how I back up my system. Not
git actually, but a similar sort of store (no compression,
a line oriented ascii 'index' file).

See my post on "Kernel SCM saga..", Sat, 9 Apr 2005 08:15:53 -0700,
Message-Id: <[email protected]>

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-12 05:30:27

by David Eger

[permalink] [raw]
Subject: Re: more git updates..

So with git, *every* changeset is an entire (compressed) copy of the
kernel. Really? Every patch you accept adds 37 MB to your hard disk?

Am I missing something here?

-dte

2005-04-12 07:10:58

by Barry K. Nathan

[permalink] [raw]
Subject: Re: more git updates..

On Mon, Apr 11, 2005 at 10:14:13PM -0700, David Lang wrote:
> I've been reading this and have another thought for you guys to keep in
> mind for this tool.
>
> version control of system config files on linux systems.

I've been thinking about this too. (I won't have time to implement this
however. If I do have time in the near future to do anything involving
git, it probably won't have anything to do with version control of
config files.)

> it sounds like you could put the / fileystem under the control of git
> (after teaching it to not cross fileystem boundries so you can have
> another filesystem to work with) and version control your entire system.
> if this was done it would be nice to add a item type that would referance
> a file in a distro package to save space. it sounds like you could run a
> git checkin daily (as part of the updatedb run for example) at very little
> cost.

I was thinking that the GIT checkin should actually be done by the
distro configuration tools, and not as a cronjob. And maybe the config
tools could do two checkins if there were any manual changes since the
last checkin, or something. (That is, one checkin to check in the manual
changes since the last checkin, and another to check in whatever the
config tool just did.)

Now that I think about it, it would be really good to have a simple tool
for doing a manual checkin after manual editing of config files, but I
think something like the dual-checkin scheme would be needed as a safety
net in case root forgets to do the checkin.

-Barry K. Nathan <[email protected]>

2005-04-12 08:16:21

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: more git updates..

Dear diary, on Tue, Apr 12, 2005 at 06:05:19AM CEST, I got a letter
where David Eger <[email protected]> told me that...
> So with git, *every* changeset is an entire (compressed) copy of the
> kernel. Really? Every patch you accept adds 37 MB to your hard disk?
>
> Am I missing something here?

Yes. Only changes files re-appear. The unchanged files keep the same
SHA1 hash, therefore they don't re-appear in the repository.

So, if Linus gets a patch which sanitizes drivers/char/selection.c,
only these new objects appear in the repository:

drivers/char/selection.c
drivers/char
drivers
. (project root)
commit message

Kind regards,

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 08:41:48

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

On Tue, 12 Apr 2005, Petr Baudis wrote:
> Dear diary, on Tue, Apr 12, 2005 at 03:20:18AM CEST, I got a letter
> where "Adam J. Richter" <[email protected]> told me that...
> > >Dear diary, on Mon, Apr 11, 2005 at 05:46:38PM CEST, I got a letter
> > >where "Adam J. Richter" <[email protected]> told me that...
> > >..snip..
> > >> Graydon Hoare. (By the way, I would prefer that git just punt to
> > >> user level programs for diff and merge when all of the versions
> > >> involved are different or at least have a very thin interface
> > >> for extending the facility, because I would like to do some character
> > >> based merge stuff.)
> > >..snip..
> >
> > >But this is what git already does. I agree it could do it even better,
> > >by checking environment variables for the appropriate tools (then you
> > >could use that to pass diff e.g. -p etc.).
> >
> > This message from Linus seemed to imply that git was going to get
> > its own 3-way merge code:
> >
> > | Then the bad news: the merge algorithm is going to suck. It's going to be
> > | just plain 3-way merge, the same RCS/CVS thing you've seen before. With no
> > | understanding of renames etc. I'll try to find the best parent to base the
> > | merge off of, although early testers may have to tell the piece of crud
> > | what the most recent common parent was.
>
> Well, from what I can read it says "just plain 3-way merge, the same
> RCS/CVS thing you've seen before". :-)
>
> Basically, when you look at merge(1) :
>
> SYNOPSIS
> merge [ options ] file1 file2 file3
> DESCRIPTION
> merge incorporates all changes that lead from file2 to file3
> into file1.
>
> The only big problem is how to guess the best file2 when you give it
> file3 and file1.

That's either the point just before you started modifying the file, or your
last merge point. Sounds simple, but if your SCM system doesn't track merges,
your SOL...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2005-04-12 09:50:57

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Dear diary, on Tue, Apr 12, 2005 at 10:39:40AM CEST, I got a letter
where Geert Uytterhoeven <[email protected]> told me that...
> On Tue, 12 Apr 2005, Petr Baudis wrote:
..snip..
> > Basically, when you look at merge(1) :
> >
> > SYNOPSIS
> > merge [ options ] file1 file2 file3
> > DESCRIPTION
> > merge incorporates all changes that lead from file2 to file3
> > into file1.
> >
> > The only big problem is how to guess the best file2 when you give it
> > file3 and file1.
>
> That's either the point just before you started modifying the file, or your
> last merge point. Sounds simple, but if your SCM system doesn't track merges,
> your SOL...

Well, yes, but the last merge point search may not be so simple:

A --1---2----6---7
B \ `-4-. /
C `-3-----5'

Now, when at 7, your last merge point is not 1, but 2.

What I have proposed at the git mailing list was to have simple merging
tracking - merges/branch1/branch2 directory structure which would store
merges from branch2 to branch1. Then, when merging say to branch3, you
traverse all of them and if any of the branch1/* commits is newer than
branch3/*, you update it.

The disadvantage is that you now need to strictly use gitmerge.sh to do
the merges - Linus' revtree solution is nicer in the regard that it
works without any explicit bookkeeping, and tracks any merges properly
recorded with commit-file; it is more complex and more expensive,
though.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 13:06:57

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Tue, Apr 12, 2005 at 02:47:25PM CEST, I got a letter
where Martin Schlemmer <[email protected]> told me that...
> On Mon, 2005-04-11 at 15:57 +0200, Petr Baudis wrote:
> > Hello,
> >
> > here goes git-pasky-0.3, my set of patches and scripts upon
> > Linus' git, aimed at human usability and to an extent a SCM-like usage.
> >
>
> Its pretty dependant on where VERSION is located. This patch fixes
> that. (PS, I left the output of 'git diff' as is to ask about the
> following stuff after the proper diff ...)

Thanks, applied. I don't understand your PS, though. :-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 13:12:22

by Martin Schlemmer

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

On Tue, 2005-04-12 at 15:02 +0200, Petr Baudis wrote:
> Dear diary, on Tue, Apr 12, 2005 at 02:47:25PM CEST, I got a letter
> where Martin Schlemmer <[email protected]> told me that...
> > On Mon, 2005-04-11 at 15:57 +0200, Petr Baudis wrote:
> > > Hello,
> > >
> > > here goes git-pasky-0.3, my set of patches and scripts upon
> > > Linus' git, aimed at human usability and to an extent a SCM-like usage.
> > >
> >
> > Its pretty dependant on where VERSION is located. This patch fixes
> > that. (PS, I left the output of 'git diff' as is to ask about the
> > following stuff after the proper diff ...)
>
> Thanks, applied. I don't understand your PS, though. :-)
>

Heh, yeah I do that sometimes. Basically should 'git diff' output
anything (besides maybe not added files like cvs ... sorry, do not know
after what you fashion it) like it does now?

--
Martin Schlemmer


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2005-04-12 13:14:54

by David Woodhouse

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Mon, 2005-04-11 at 15:57 +0200, Petr Baudis wrote:
> Hello,
>
> here goes git-pasky-0.3, my set of patches and scripts upon
> Linus' git, aimed at human usability and to an extent a SCM-like
> usage.

Untar, make, add to path, pull, 'git diff' fails on PPC:

peach /home/dwmw2/git-pasky-base $ git diff
error: bad signature
error: verify header failed
read_cache: Invalid argument

A little extra debugging shows the problem:

error: bad signature 0x43524944 should be 0x44495243

The cache.h header file suggests that the cache is host-endian on
purpose, because it's local-only. So why am I seeing a cache from
another host? Is that comment no longer true?

Either way, the original decision is probably bogus -- with trees as
large as the kernel tree it makes a lot of sense to keep them somewhere
NFS-accessible and use them from different hosts, and byteswapping
really isn't going to slow it down that much. We should just pick an
endianness and stick to it.

I'd suggest making it big-endian to make sure the LE weenies don't
forget to byteswap properly.

--
dwmw2

2005-04-12 13:27:39

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Tue, Apr 12, 2005 at 03:13:15PM CEST, I got a letter
where Martin Schlemmer <[email protected]> told me that...
> On Tue, 2005-04-12 at 15:02 +0200, Petr Baudis wrote:
> > Dear diary, on Tue, Apr 12, 2005 at 02:47:25PM CEST, I got a letter
> > where Martin Schlemmer <[email protected]> told me that...
> > > On Mon, 2005-04-11 at 15:57 +0200, Petr Baudis wrote:
> > > > Hello,
> > > >
> > > > here goes git-pasky-0.3, my set of patches and scripts upon
> > > > Linus' git, aimed at human usability and to an extent a SCM-like usage.
> > > >
> > >
> > > Its pretty dependant on where VERSION is located. This patch fixes
> > > that. (PS, I left the output of 'git diff' as is to ask about the
> > > following stuff after the proper diff ...)
> >
> > Thanks, applied. I don't understand your PS, though. :-)
> >
>
> Heh, yeah I do that sometimes. Basically should 'git diff' output
> anything (besides maybe not added files like cvs ... sorry, do not know
> after what you fashion it) like it does now?

Huh. Well, git diff without any arguments should just call show-diff.
That is show your local uncommitted changes. It doesn't show the locally
added/removed files yet for several reasons, but it's being worked on.
:-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 12:48:16

by Martin Schlemmer

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Mon, 2005-04-11 at 15:57 +0200, Petr Baudis wrote:
> Hello,
>
> here goes git-pasky-0.3, my set of patches and scripts upon
> Linus' git, aimed at human usability and to an extent a SCM-like usage.
>

Its pretty dependant on where VERSION is located. This patch fixes
that. (PS, I left the output of 'git diff' as is to ask about the
following stuff after the proper diff ...)


Regards,

--
Martin Schlemmer


Attachments:
add_version.patch (3.22 kB)
signature.asc (189.00 B)
This is a digitally signed message part
Download all attachments

2005-04-12 17:42:49

by Helge Hafting

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 10, 2005 at 09:01:22AM -0700, Linus Torvalds wrote:
>
> So I was for a while debating having a totally flat directory space, but
> since there are _some_ downsides (linear lookup for cold-cache, and just
> that "ls -l" ends up being O(n**2) and things), I decided that a single
> fan-out is probably a good idea.
>
Isn't that fixed even in ext2/ext3 these days?

man mke2fs:
dir_index
Use hashed b-trees to speed up lookups in large
directories.

Also, the popular reiserfs was designed with this in mind from the start.


> > Or maybe the files should be named objects/xx/yy/zzzzzzzzzzzzzzzz?
>
> Hey, I may end up being wrong, and yes, maybe I should have done a
> two-level one.

Unless there still is performance issues, please don't. A directory
structure with extra levels is necessarily harder to use if one
ever have to use it manually somehow.

Helge Hafting

2005-04-12 20:41:05

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Re: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Dear diary, on Tue, Apr 12, 2005 at 11:50:48AM CEST, I got a letter
where Petr Baudis <[email protected]> told me that...
> Dear diary, on Tue, Apr 12, 2005 at 10:39:40AM CEST, I got a letter
> where Geert Uytterhoeven <[email protected]> told me that...
> > On Tue, 12 Apr 2005, Petr Baudis wrote:
> ..snip..
> > > Basically, when you look at merge(1) :
> > >
> > > SYNOPSIS
> > > merge [ options ] file1 file2 file3
> > > DESCRIPTION
> > > merge incorporates all changes that lead from file2 to file3
> > > into file1.
> > >
> > > The only big problem is how to guess the best file2 when you give it
> > > file3 and file1.
> >
> > That's either the point just before you started modifying the file, or your
> > last merge point. Sounds simple, but if your SCM system doesn't track merges,
> > your SOL...
>
> Well, yes, but the last merge point search may not be so simple:
>
> A --1---2----6---7
> B \ `-4-. /
> C `-3-----5'
>
> Now, when at 7, your last merge point is not 1, but 2.

...and this is obviously wrong, sorry. You would lose 3 this way.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-12 20:57:45

by David Eger

[permalink] [raw]
Subject: Re: Re: more git updates..


The reason I am questioning this point is the GIT README file.

Linus makes explicit that a "blob" is just the "file contents," and that
really, a "blob" is not just the SHA1 of the "blob":

> In particular, the "current directory cache" certainly does not need to
> be consistent with the current directory contents, but it has two very
> important attributes:
>
> (a) it can re-generate the full state it caches (not just the directory
> structure: through the "blob" object it can regenerate the data too)

And he defines "TREE" with the same name: blob

> TREE: The next hierarchical object type is the "tree" object. A tree
> object is a list of permission/name/blob data, sorted by name.

Therefore, "TREE" must be the *full* data, and since we have the following
definition for CHANGESET:

> A "changeset" is defined by the tree-object that it results in, the
> parent changesets (zero, one or more) that led up to that point, and a
> comment on what happened.

That each changeset remembers *everything* for *each point in the tree*.

Linus, if you actually mean to differentiate between the full data
and a SHA1 of the data, *please please please* say "blob" in one place
and "SHA1 of the blob" elsewhere. It's quite confusing, to me at least.

Also, the details of just what data constitutes a 'changeset' would be
lovely... i.e. a precise spec of what Pat is describing below...

-dte

> where David Eger <[email protected]> told me that...
> > So with git, *every* changeset is an entire (compressed) copy of the
> > kernel. Really? Every patch you accept adds 37 MB to your hard disk?
> >
> > Am I missing something here?
>
> Yes. Only changes files re-appear. The unchanged files keep the same
> SHA1 hash, therefore they don't re-appear in the repository.
>
> So, if Linus gets a patch which sanitizes drivers/char/selection.c,
> only these new objects appear in the repository:
>
> drivers/char/selection.c
> drivers/char
> drivers
> . (project root)
> commit message
>

2005-04-12 21:14:14

by Chris Friesen

[permalink] [raw]
Subject: Re: GIT license (Re: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.1)

Petr Baudis wrote:
> Dear diary, on Tue, Apr 12, 2005 at 11:50:48AM CEST, I got a letter

>>Well, yes, but the last merge point search may not be so simple:
>>
>>A --1---2----6---7
>>B \ `-4-. /
>>C `-3-----5'
>>
>>Now, when at 7, your last merge point is not 1, but 2.
>
>
> ...and this is obviously wrong, sorry. You would lose 3 this way.

Wouldn't the delta betweeen 2 and 5 include any contribution from 3?

Chris

2005-04-12 21:26:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: more git updates..



On Tue, 12 Apr 2005, David Eger wrote:
>
> The reason I am questioning this point is the GIT README file.
>
> Linus makes explicit that a "blob" is just the "file contents," and that
> really, a "blob" is not just the SHA1 of the "blob":
>
> > In particular, the "current directory cache" certainly does not need to
> > be consistent with the current directory contents, but it has two very
> > important attributes:
> >
> > (a) it can re-generate the full state it caches (not just the directory
> > structure: through the "blob" object it can regenerate the data too)
>
> And he defines "TREE" with the same name: blob

Yes. A tree is defined by the blobs it references (and the subtrees) but
it doesn't _contain_ them. It just contains a pointer to them.

> Therefore, "TREE" must be the *full* data, and since we have the following
> definition for CHANGESET:

No. A tree is not the full data. A tree contains enough information to
_recreate_ the full data, but the tree itself just tells you _how_ to do
that. It doesn't contain very much of the data itself at all.

> That each changeset remembers *everything* for *each point in the tree*.

But only BY REFERENCE. A "commit" is usually very small. For example, the
top-of-tree commit-file for my currest kernel test is literally 401
_bytes_ in size. Because it just references a tree (20 bytes of
_reference_).

> Linus, if you actually mean to differentiate between the full data
> and a SHA1 of the data

There is no differentiation. The sha1 _is_ the data as far as git is
concerned.

It's only confusing if you think they are different.

> Also, the details of just what data constitutes a 'changeset' would be
> lovely... i.e. a precise spec of what Pat is describing below...

torvalds@ppc970:~/test-tools/linux-2.6.12-rc2> cat-file commit `cat .git/HEAD `
tree cf9fd295d3048cd84c65d5e1a5a6b606bf4fddc6
parent c7a1a189dd0fe2c6ecd0aa33f2bd2f414c7892a0
author NeilBrown <[email protected]> Tue Apr 12 08:27:08 2005
committer Linus Torvalds <[email protected]> Tue Apr 12 08:27:08 2005

[PATCH] md: remove a number of misleading calls to MD_BUG

The conditions that cause these calls to MD_BUG are not kernel bugs, just
oddities in what userspace is asking for.

Also convert analyze_sbs to return void, and the value it returned was
always 0.

Signed-off-by: Neil Brown <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

That's it. In all it's glory. Compressed and tagged it's 401 bytes.

The tree it references is 677 bytes in size. That in turn references a
number of subtrees, but almost all of the sub-trees are shared with
_other_ tree commits, so their size is spread out over all the commits.

The full archive of the 2.6.12-rc2 kernel that I used for testing (only
_one_ version) is 102MB in size. That's about half of what the kernel is
uncompressed.

The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
and a test-run of 198 patches from Andrew) is 111MB. In other words,
adding 198 "full" new kernels only grew the archive by 9MB (that's all
"actual disk usage" btw - the files themselves are smaller, but since they
all end up taking up a full disk block..)

Basically, the whole point of git is that objects are equated with their
sha1 name, and that you can thus "include" an object by just referring to
its name. The two are equivalent.

Linus

2005-04-12 22:38:42

by David Eger

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
>
> Yes. A tree is defined by the blobs it references (and the subtrees) but
> it doesn't _contain_ them. It just contains a pointer to them.

A pointer to them? You mean a SHA1 hash of them? or what?
Where is the *real* data stored? The real files, the real patches?
Are these somewhere completely outside of git?

> > Therefore, "TREE" must be the *full* data, and since we have the following
> > definition for CHANGESET:
>
> No. A tree is not the full data. A tree contains enough information to
> _recreate_ the full data, but the tree itself just tells you _how_ to do
> that. It doesn't contain very much of the data itself at all.

Perhaps I'd understand this if you tell me what "recreate" means.
If a have a SHA1 hash of a file, and I have the file, I can verify that said
file has the SHA1 hash it's supposed to have, but I can't generate the file
from it's hash...

Sorry for being stubbornly dumb, but you'll have a couple of us puzzling
at the README ;-)

-dte

2005-04-12 22:34:04

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: more git updates..

Linus Torvalds <[email protected]> writes:

> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

Does that mean that the 64 K changes imported from bk would take ~ 3 GB?
Is that real?

Have to tried to import it?
I'm going to import the CVS data (with cvsps) - as the CVS "misses" half
the changes, the resulting archive should be half in size too?

I don't know how much space did bk use, but 3 GB for the full history
is reasonable for most people, isn't it? Especially that one can purge
older data.
--
Krzysztof Halasa

2005-04-12 22:53:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: more git updates..



On Wed, 13 Apr 2005, Krzysztof Halasa wrote:
>
> Does that mean that the 64 K changes imported from bk would take ~ 3 GB?
> Is that real?

That's a _guess_.

> Have to tried to import it?

It would take days.

> I'm going to import the CVS data (with cvsps) - as the CVS "misses" half
> the changes, the resulting archive should be half in size too?

No. The CVS archive is going to be almost the same size. BKCVS gets about
98% of all the data. It just doesn't show the complex merge graphs, but
those are "small" in comparison.

> I don't know how much space did bk use, but 3 GB for the full history
> is reasonable for most people, isn't it? Especially that one can purge
> older data.

I think it's entirely reasonable, yes. But I may be off by an order of
magnitude. I based the 3GB on estimating form the sparse tree, but I
wasn't being too careful. Andrew estimated 2GB per year (at our current
historical rate of changes) based on my merge with him. So it's in that
general range of 3-6GB, I htink.

Linus

2005-04-12 23:49:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: more git updates..



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
>
> At the rate of 9M for every 198 changeset checkins, that means I'll have
> to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> per-file ratio due the too-small files) for a whole pack including all
> changesets without accounting the original 111MB of the original tree,
> with rsync -z of git. That compares with 514M _compressible_ with CVS
> format on-disk, and with ~79M of the CVS-network download with rsync -z of
> the CVS repository (assuming default gzip compression level).

Yes. CVS is much denser.

CVS is also total crap. So your point is?

Linus

2005-04-12 23:53:13

by Panagiotis Issaris

[permalink] [raw]
Subject: Re: Re: more git updates..

Hi David,

On Tue, Apr 12, 2005 at 06:36:23PM -0400, David Eger wrote:
> > No. A tree is not the full data. A tree contains enough information
> > to
> > _recreate_ the full data, but the tree itself just tells you _how_
> > to do
> > that. It doesn't contain very much of the data itself at all.
>
> Perhaps I'd understand this if you tell me what "recreate" means.
> If a have a SHA1 hash of a file, and I have the file, I can verify
> that said
> file has the SHA1 hash it's supposed to have, but I can't generate the
> file
> from it's hash...

But, but if you have that hexified SHA1 hash of a particular file you
want to access, there would be a file with a filename equal to that
hexified SHA1 hash which contained the compressed contents of the file
you're looking for.

At least, that's how I understood it...

With friendly regards,
Takis

--
OpenPGP key: http://lumumba.luc.ac.be/takis/takis_public_key.txt
fingerprint: 6571 13A3 33D9 3726 F728 AA98 F643 B12E ECF3 E029

2005-04-12 23:44:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache. This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git. That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 60000 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).

2005-04-13 00:13:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> Yes. CVS is much denser.
>
> CVS is also total crap. So your point is?

I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
the CVS/SCCS format as storage may be more appealing than the current
git format. I guess I should have said RCS instead of CVS, sorry if that
created any confusion. The arch/darcs approach of pratically storing
patches would also be much denser but it has no efficient way of doing
"rcs up -p 1.x" on a file, that doesn't involve potentially unpacking
tons of unrelated changesets.

2005-04-13 01:14:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: more git updates..



On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
>
> I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> the CVS/SCCS format as storage may be more appealing than the current
> git format.

Go wild. I did mine in six days, and you've been whining about other
peoples SCM's for three years.

In other words - go and _do_ something instead of whining. I'm not
interested.

Linus

2005-04-13 04:45:20

by Matthias Urlichs

[permalink] [raw]
Subject: Re: more git updates..

Hi, Linus Torvalds schrub am Tue, 12 Apr 2005 15:49:07 -0700:

>> Have to tried to import it?
>
> It would take days.

You can always import it later and then graft it into the commit tree.

That would of course change *every* commit node, but so what? They're
small, and you can delete the old ones when you're done.

--
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | [email protected]


2005-04-13 08:47:17

by Russell King

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Tue, Apr 12, 2005 at 02:07:36PM +0100, David Woodhouse wrote:
> I'd suggest making it big-endian to make sure the LE weenies don't
> forget to byteswap properly.

That's not a bad argument actually - especially as networking uses BE.
(and git is about networking, right?) 8)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 08:59:59

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 10:47:05AM CEST, I got a letter
where Russell King <[email protected]> told me that...
> On Tue, Apr 12, 2005 at 02:07:36PM +0100, David Woodhouse wrote:
> > I'd suggest making it [index] big-endian to make sure the LE weenies don't
> > forget to byteswap properly.
>
> That's not a bad argument actually - especially as networking uses BE.
> (and git is about networking, right?) 8)

Theoretically, you are never supposed to share your index if you work in
fully git environment. However, I offer some "base tarballs" which have
the unpacked source as well as the .git directory, and I think you want
the index there. Of course you can always regenerate it by

read-tree $(tree-id)

but I really don't want to (hey, dwmw got away with that too! ;-). It
forces an additional out-of-order step you need to do before making use
of your git for the first time.

The NFS argument obviously seems perfectly valid to me too. So, FWIW,
I'm personally all for it, if someone gives me a patch.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 09:06:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

Petr Baudis wrote:
> Dear diary, on Wed, Apr 13, 2005 at 10:47:05AM CEST, I got a letter
> where Russell King <[email protected]> told me that...
>
>>On Tue, Apr 12, 2005 at 02:07:36PM +0100, David Woodhouse wrote:
>>
>>>I'd suggest making it [index] big-endian to make sure the LE weenies don't
>>>forget to byteswap properly.
>>
>>That's not a bad argument actually - especially as networking uses BE.
>>(and git is about networking, right?) 8)
>
> Theoretically, you are never supposed to share your index if you work in
> fully git environment. However, I offer some "base tarballs" which have
> the unpacked source as well as the .git directory, and I think you want
> the index there. Of course you can always regenerate it by
>
> read-tree $(tree-id)
>
> but I really don't want to (hey, dwmw got away with that too! ;-). It
> forces an additional out-of-order step you need to do before making use
> of your git for the first time.
>
> The NFS argument obviously seems perfectly valid to me too. So, FWIW,
> I'm personally all for it, if someone gives me a patch.
>

In userspace, it's definitely easier to stick with BE for a standard
byte order, simply because it's the one byteorder one can rely on there
being macros available to deal with on all platforms.

However, then I would also like to suggest replacing "unsigned int" and
"unsigned short" with uint32_t and uint16_t, even though they're
consistent on all *current* Linux platforms.

-hpa

2005-04-13 09:09:39

by David Woodhouse

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 2005-04-13 at 02:06 -0700, H. Peter Anvin wrote:
> However, then I would also like to suggest replacing "unsigned int"
> and "unsigned short" with uint32_t and uint16_t, even though they're
> consistent on all *current* Linux platforms.

Agreed.

--
dwmw2


2005-04-13 09:25:12

by David Woodhouse

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
> Theoretically, you are never supposed to share your index if you work
> in fully git environment.

Maybe -- if we are prepared to propagate the BK myth that network
bandwidth and disk space are free.

Meanwhile, in the real world, it'd be really useful to support sharing.

I'd even like to see support for using multiple branches checked out of
the same .git/ repository. We already cope with having multiple branches
_in_ the repository -- all we need to do is cope with multiple indices
too, so we can have different versions checked out.

--
dwmw2


2005-04-13 09:31:09

by Russell King

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 04:45:07PM -0700, Linus Torvalds wrote:
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> > At the rate of 9M for every 198 changeset checkins, that means I'll have
> > to download 2.7G _uncompressible_ (i.e. already compressed with a bad
> > per-file ratio due the too-small files) for a whole pack including all
> > changesets without accounting the original 111MB of the original tree,
> > with rsync -z of git. That compares with 514M _compressible_ with CVS
> > format on-disk, and with ~79M of the CVS-network download with rsync -z of
> > the CVS repository (assuming default gzip compression level).
>
> Yes. CVS is much denser.
>
> CVS is also total crap. So your point is?

And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
is more dense than CVS.

BK is also a lot better than CVS. So _your_ point is?

8)

Note: I'm _not_ arguing with your sentiments towards CVS. However, I
think the space usage point still stands.

What is the space usage behaviour when you have multiple git trees?
Do we need a git relink command in git-pasky? 8)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 09:35:53

by Russell King

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Mon, Apr 11, 2005 at 03:57:58PM +0200, Petr Baudis wrote:
> here goes git-pasky-0.3, my set of patches and scripts upon
> Linus' git, aimed at human usability and to an extent a SCM-like usage.

I tried this today, applied my patch for BE<->LE conversions and
glibc-2.2 compatibility (attached, still requires cleaning though),
and then tried git pull. Umm, whoops.

Here's just a small sample of what happened:

diff: /9a30ec42a6c4860d3f11ad90c1052823a020de32/show-files.c: No such file or directory
diff: /85bf824bd24f034896f5820a2628148a246f8fd1/show-files.c: No such file or directory
mkdir: cannot create directory `/9a30ec42a6c4860d3f11ad90c1052823a020de32': Permission denied
mkdir: cannot create directory `/85bf824bd24f034896f5820a2628148a246f8fd1': Permission denied
./gitdiff-do: /9a30ec42a6c4860d3f11ad90c1052823a020de32/update-cache.c: No such
file or directory
./gitdiff-do: /85bf824bd24f034896f5820a2628148a246f8fd1/update-cache.c: No such
file or directory
diff: /9a30ec42a6c4860d3f11ad90c1052823a020de32/update-cache.c: No such file or
directory
diff: /85bf824bd24f034896f5820a2628148a246f8fd1/update-cache.c: No such file or
directory
patch: **** Only garbage was found in the patch input.

--- - Wed Apr 13 09:49:43 2005
+++ cache.h Fri Apr 8 11:15:08 2005
@@ -14,6 +14,12 @@
#include <openssl/sha.h>
#include <zlib.h>

+#include <netinet/in.h>
+#define cpu_to_beuint(x) (htonl(x))
+#define beuint_to_cpu(x) (ntohl(x))
+#define cpu_to_beushort(x) (htons(x))
+#define beushort_to_cpu(x) (ntohs(x))
+
/*
* Basic data structures for the directory cache
*
@@ -67,7 +73,7 @@
#define DEFAULT_DB_ENVIRONMENT ".git/objects"

#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)
-#define ce_size(ce) cache_entry_size((ce)->namelen)
+#define ce_size(ce) cache_entry_size(beushort_to_cpu((ce)->namelen))

#define alloc_nr(x) (((x)+16)*3/2)

--- - Wed Apr 13 09:49:43 2005
+++ read-cache.c Fri Apr 8 11:05:34 2005
@@ -271,6 +271,7 @@
/* nsec seems unreliable - not all filesystems support it, so
* as long as it is in the inode cache you get right nsec
* but after it gets flushed, you get zero nsec. */
+#if 0
if (ce->mtime.sec != (unsigned int)st->st_mtim.tv_sec
#ifdef NSEC
|| ce->mtime.nsec != (unsigned int)st->st_mtim.tv_nsec
@@ -283,15 +284,21 @@
#endif
)
changed |= CTIME_CHANGED;
- if (ce->st_uid != (unsigned int)st->st_uid ||
- ce->st_gid != (unsigned int)st->st_gid)
+#else
+ if (beuint_to_cpu(ce->mtime.sec) != (unsigned int)st->st_mtime)
+ changed |= MTIME_CHANGED;
+ if (beuint_to_cpu(ce->ctime.sec) != (unsigned int)st->st_ctime)
+ changed |= CTIME_CHANGED;
+#endif
+ if (beuint_to_cpu(ce->st_uid) != (unsigned int)st->st_uid ||
+ beuint_to_cpu(ce->st_gid) != (unsigned int)st->st_gid)
changed |= OWNER_CHANGED;
- if (ce->st_mode != (unsigned int)st->st_mode)
+ if (beuint_to_cpu(ce->st_mode) != (unsigned int)st->st_mode)
changed |= MODE_CHANGED;
- if (ce->st_dev != (unsigned int)st->st_dev ||
- ce->st_ino != (unsigned int)st->st_ino)
+ if (beuint_to_cpu(ce->st_dev) != (unsigned int)st->st_dev ||
+ beuint_to_cpu(ce->st_ino) != (unsigned int)st->st_ino)
changed |= INODE_CHANGED;
- if (ce->st_size != (unsigned int)st->st_size)
+ if (beuint_to_cpu(ce->st_size) != (unsigned int)st->st_size)
changed |= DATA_CHANGED;
return changed;
}
@@ -378,9 +378,9 @@
SHA_CTX c;
unsigned char sha1[20];

- if (hdr->signature != CACHE_SIGNATURE)
+ if (hdr->signature != cpu_to_beuint(CACHE_SIGNATURE))
return error("bad signature");
- if (hdr->version != 1)
+ if (hdr->version != cpu_to_beuint(1))
return error("bad version");
SHA1_Init(&c);
SHA1_Update(&c, hdr, offsetof(struct cache_header, sha1));
@@ -428,12 +428,12 @@
if (verify_hdr(hdr, size) < 0)
goto unmap;

- active_nr = hdr->entries;
+ active_nr = beuint_to_cpu(hdr->entries);
active_alloc = alloc_nr(active_nr);
active_cache = calloc(active_alloc, sizeof(struct cache_entry *));

offset = sizeof(*hdr);
- for (i = 0; i < hdr->entries; i++) {
+ for (i = 0; i < beuint_to_cpu(hdr->entries); i++) {
struct cache_entry *ce = map + offset;
offset = offset + ce_size(ce);
active_cache[i] = ce;
@@ -452,9 +452,9 @@
struct cache_header hdr;
int i;

- hdr.signature = CACHE_SIGNATURE;
- hdr.version = 1;
- hdr.entries = entries;
+ hdr.signature = cpu_to_beuint(CACHE_SIGNATURE);
+ hdr.version = cpu_to_beuint(1);
+ hdr.entries = cpu_to_beuint(entries);

SHA1_Init(&c);
SHA1_Update(&c, &hdr, offsetof(struct cache_header, sha1));
--- - Wed Apr 13 09:49:43 2005
+++ show-diff.c Wed Apr 13 09:49:43 2005
@@ -51,7 +51,7 @@
printf("%s: ok\n", ce->name);
continue;
}
- printf("%.*s: ", ce->namelen, ce->name);
+ printf("%.*s: ", beushort_to_cpu(ce->namelen), ce->name);
for (n = 0; n < 20; n++)
printf("%02x", ce->sha1[n]);
printf(" %02x\n", changed);
--- - Wed Apr 13 09:49:43 2005
+++ update-cache.c Fri Apr 8 11:06:17 2005
@@ -108,11 +108,11 @@
memcpy(ce->name, path, namelen);
ce->ctime.sec = st.st_ctime;
#ifdef NSEC
- ce->ctime.nsec = st.st_ctim.tv_nsec;
+ ce->ctime.nsec = 0; //st.st_ctim.tv_nsec;
#endif
ce->mtime.sec = st.st_mtime;
#ifdef NSEC
- ce->mtime.nsec = st.st_mtim.tv_nsec;
+ ce->mtime.nsec = 0; //st.st_mtim.tv_nsec;
#endif
ce->st_dev = st.st_dev;
ce->st_ino = st.st_ino;

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 09:39:05

by Russell King

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Wed, Apr 13, 2005 at 10:35:21AM +0100, Russell King wrote:
> On Mon, Apr 11, 2005 at 03:57:58PM +0200, Petr Baudis wrote:
> > here goes git-pasky-0.3, my set of patches and scripts upon
> > Linus' git, aimed at human usability and to an extent a SCM-like usage.
>
> I tried this today, applied my patch for BE<->LE conversions and
> glibc-2.2 compatibility (attached, still requires cleaning though),
> and then tried git pull. Umm, whoops.

Oh, and the other thing is:

$ git pull

GNU Interactive Tools 4.3.20 (armv4l-rmk-linux-gnu), 20:02:38 Mar 7 2001
GIT is free software; you can redistribute it and/or modify it under the
terms of the GNU General Public License as published by the Free Software
Foundation; either version 2, or (at your option) any later version.
Copyright (C) 1993-1999 Free Software Foundation, Inc.
Written by Tudor Hulubei and Andrei Pitis, Bucharest, Romania

git: fatal error: `chdir' failed: permission denied.

"git" already exists as a command from about 4 years ago. Can we have
less TLAs for commands please? That namespace is rather over-used and
collision-prone.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 09:42:43

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 11:25:04AM CEST, I got a letter
where David Woodhouse <[email protected]> told me that...
> On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
> > Theoretically, you are never supposed to share your index if you work
> > in fully git environment.
>
> Maybe -- if we are prepared to propagate the BK myth that network
> bandwidth and disk space are free.
>
> Meanwhile, in the real world, it'd be really useful to support sharing.

It's fine to share the objects database. If you want to share the
directory cache, you are doing something wrong, though. What do you need
it for?

> I'd even like to see support for using multiple branches checked out of
> the same .git/ repository. We already cope with having multiple branches
> _in_ the repository -- all we need to do is cope with multiple indices
> too, so we can have different versions checked out.

I'm working on that right now. (Well, I wish I would, if other things
didn't keep distracting me.)

The idea is to have a command which will do something like:

mkdir .git
ln -s $origtree/heads $origtree/objects $origtree/tags .git
cp $origtree/HEAD .git
cd ..
read-tree $(tree-id)

Voila. Now you have a new tree with almost no current neither future
overhead.

This will be used to do the out-tree merges. Command for user to do this
will likely also make it a regular branch, doing

ln -s $(realpath git/HEAD) .git/heads/branchname

so that you can reference to it easily from your other branches.

Would this do what you want?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 09:46:42

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 11:35:21AM CEST, I got a letter
where Russell King <[email protected]> told me that...
> On Mon, Apr 11, 2005 at 03:57:58PM +0200, Petr Baudis wrote:
> > here goes git-pasky-0.3, my set of patches and scripts upon
> > Linus' git, aimed at human usability and to an extent a SCM-like usage.
>
> I tried this today, applied my patch for BE<->LE conversions and
> glibc-2.2 compatibility (attached, still requires cleaning though),
> and then tried git pull. Umm, whoops.
>
> Here's just a small sample of what happened:
>
> diff: /9a30ec42a6c4860d3f11ad90c1052823a020de32/show-files.c: No such file or directory
> diff: /85bf824bd24f034896f5820a2628148a246f8fd1/show-files.c: No such file or directory
> mkdir: cannot create directory `/9a30ec42a6c4860d3f11ad90c1052823a020de32': Permission denied
> mkdir: cannot create directory `/85bf824bd24f034896f5820a2628148a246f8fd1': Permission denied
> ./gitdiff-do: /9a30ec42a6c4860d3f11ad90c1052823a020de32/update-cache.c: No such
> file or directory
> ./gitdiff-do: /85bf824bd24f034896f5820a2628148a246f8fd1/update-cache.c: No such
> file or directory
> diff: /9a30ec42a6c4860d3f11ad90c1052823a020de32/update-cache.c: No such file or
> directory
> diff: /85bf824bd24f034896f5820a2628148a246f8fd1/update-cache.c: No such file or
> directory
> patch: **** Only garbage was found in the patch input.

I'll bet at the top of this you have a mktemp error.

mktemp turns out to be a PITA to use - on some older systems (e.g.
Mandrake 10 stock install) it has incompatible usage to the rest of the
world. When I will get a convenient infrastructure for making a shell
library, I will probably add a test for this to it.

Try to upgrade your mktemp. That Mandrake 10 user said that urpmi should
have a newer (correct) version.

I will make a patch which will refer to ?time instead instead of
?tim.sec for seconds. That should fix your problem.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 09:49:26

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 11:38:52AM CEST, I got a letter
where Russell King <[email protected]> told me that...
> On Wed, Apr 13, 2005 at 10:35:21AM +0100, Russell King wrote:
> > On Mon, Apr 11, 2005 at 03:57:58PM +0200, Petr Baudis wrote:
> > > here goes git-pasky-0.3, my set of patches and scripts upon
> > > Linus' git, aimed at human usability and to an extent a SCM-like usage.
> >
> > I tried this today, applied my patch for BE<->LE conversions and
> > glibc-2.2 compatibility (attached, still requires cleaning though),
> > and then tried git pull. Umm, whoops.
>
> Oh, and the other thing is:
>
> $ git pull
>
> GNU Interactive Tools 4.3.20 (armv4l-rmk-linux-gnu), 20:02:38 Mar 7 2001
> GIT is free software; you can redistribute it and/or modify it under the
> terms of the GNU General Public License as published by the Free Software
> Foundation; either version 2, or (at your option) any later version.
> Copyright (C) 1993-1999 Free Software Foundation, Inc.
> Written by Tudor Hulubei and Andrei Pitis, Bucharest, Romania
>
> git: fatal error: `chdir' failed: permission denied.
>
> "git" already exists as a command from about 4 years ago. Can we have
> less TLAs for commands please? That namespace is rather over-used and
> collision-prone.

I've already noticed GNU interactive tools (googling for git), but it's
Linus' choice of name. Alternative suggestions welcomed. What about
'gt'? ;-)

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 10:19:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Re: more git updates..

On Wed, Apr 13, 2005 at 10:30:52AM +0100, Russell King wrote:
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.

Yep, this is why I mentioned SCCS format too, I didn't know it was even
smaller, but I expected a similar density from SCCS.

> Note: I'm _not_ arguing with your sentiments towards CVS. However, I
> think the space usage point still stands.

If it wasn't for network synchronization it almost wouldn't matter, but
fetching 2.8G uncompressible when I could simply fetch 220MB
compressible (that will compress with zlib at little cost during rsync
to less than 78M), sounds a bit overkill.

> What is the space usage behaviour when you have multiple git trees?

Multiple trees in the sense of pulls from multiple developers aren't
more costly than a normal checkin, due the "soft hardlink" property of
the hashes. It's just every checkin taking lots of space, and generating
a new uncompressible blobs every time a changeset touches one file.

2005-04-13 10:24:18

by David Woodhouse

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 2005-04-13 at 11:42 +0200, Petr Baudis wrote:
> It's fine to share the objects database. If you want to share the
> directory cache, you are doing something wrong, though. What do you
> need it for?

I want to _not_ care which machine I happen to be on when I use git
repositories which live in my home directory. I want all operations to
just work, regardless of whether the shell I'm looking at happens to be
on a BE or a LE box.

> <...> Would this do what you want?

Sounds ideal.

--
dwmw2

2005-04-13 10:29:06

by Russell King

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, Apr 13, 2005 at 11:46:19AM +0200, Petr Baudis wrote:
> I'll bet at the top of this you have a mktemp error.

Indeed, thanks.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 10:58:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
> Go wild. I did mine in six days, and you've been whining about other
> peoples SCM's for three years.

Even if I spend 6 days doing git, you'd never have thrown away BK in
exchange for git.

> In other words - go and _do_ something instead of whining. I'm not
> interested.

CVS and SVN are already an order of magnitude more efficient than git at
storing and exporting the data and they shouldn't annoy you during the
checkins either, they have a backend much more efficient than git too,
and yet you seem not to care about them.

My suggestion was simply to at least change git to coalesce the diffs
like CVS/SCCS, I'm only making a suggestion to give git a chance to have
a backend at least as efficient as the one that CVS uses and to avoid
running rsync on a 2.8G uncompressible blob. I don't have enough spare
time to do something myself, my spare time would be too short anyway to
make a difference in SCM space, so I'd rather spend it all in more
innovative space where it might have a slight change to make a
difference.

2005-04-13 11:03:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3


* Petr Baudis <[email protected]> wrote:

> > Oh, and the other thing is:
> >
> > $ git pull
> >
> > GNU Interactive Tools 4.3.20 (armv4l-rmk-linux-gnu), 20:02:38 Mar 7 2001
> > GIT is free software; you can redistribute it and/or modify it under the
> > terms of the GNU General Public License as published by the Free Software
> > Foundation; either version 2, or (at your option) any later version.
> > Copyright (C) 1993-1999 Free Software Foundation, Inc.
> > Written by Tudor Hulubei and Andrei Pitis, Bucharest, Romania
> >
> > git: fatal error: `chdir' failed: permission denied.
> >
> > "git" already exists as a command from about 4 years ago. Can we have
> > less TLAs for commands please? That namespace is rather over-used and
> > collision-prone.
>
> I've already noticed GNU interactive tools (googling for git), but
> it's Linus' choice of name. Alternative suggestions welcomed. What
> about 'gt'? ;-)

'gt' or 'gi' both sound fine - 'gi' being a bit faster to type ;-).
(Even 'get' seems to be unused in the command namespace.)

Ingo

2005-04-13 12:44:26

by Xavier Bestel

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Le mercredi 13 avril 2005 à 10:25 +0100, David Woodhouse a écrit :
> On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
> > Theoretically, you are never supposed to share your index if you work
> > in fully git environment.
>
> Maybe -- if we are prepared to propagate the BK myth that network
> bandwidth and disk space are free.

On a related note, maybe kernel.org should host .torrent files (and
serve them) for the kernel git repository. That would ease the pain.

Xav


2005-04-13 14:37:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3



On Wed, 13 Apr 2005, David Woodhouse wrote:
>
> I'd even like to see support for using multiple branches checked out of
> the same .git/ repository.

David, we already can. The objects are _designed_ to be shared.

However, that is the ".git/objects" subdirectory. Not the per-view stuff.
For each _view_ you do need to have view-specific data, and the view index
very much is that. That's ".git/index".

The index file isn't small - it's about 1.6MB for a kernel tree, because
it needs to list every single file we know about, its "stat" information,
and it's sha1 backing store. So multiply 17,000 by ~40, and add in the
size of the name of each file, and avoid compression because this is read
and written _all_ the time, and you end up with 1.6MB.

But you _need_ one per checked-out tree. And it really _is_ private. It's
not supposed to be shared. In fact, it _cannot_ be shared, because it
doesn't have sufficient locking (it has some, but that's just to catch
_errors_ when somebody tries to do two operations that update the index
file at the same time in the same view). But even ignoring the locking
issues, it just isn't appropriate, it's not how that file works.

In other words, that index file simply _cannot_ be shared. Don't even
think about it. Only madness will ensue.

Linus

2005-04-13 14:41:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: more git updates..



On Wed, 13 Apr 2005, Russell King wrote:
>
> And my entire 2.6.12-rc2 BK tree, unchecked out, is about 220MB, which
> is more dense than CVS.
>
> BK is also a lot better than CVS. So _your_ point is?

Hey, anybody who wants to argue that BK is getter than GIT won't be
getting any counter-arguments from me.

The fact is, I have constraints. Like needing something to work within a
few days. If somebody comes up with a ultra-fast, replicatable, space
efficient SCM in three days, I'm all over it.

In the meantime, I'd suggest people who worry about network bandwidth try
to work out a synchronization protocol that allows you to send "diff
updates" between git repositories. The git model doesn't preclude looking
at the objects and sending diffs instead (and re-creating the objects on
the other side). But my time-constraints _do_.

Linus

2005-04-13 14:47:57

by David Woodhouse

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 2005-04-13 at 07:38 -0700, Linus Torvalds wrote:
> David, we already can. The objects are _designed_ to be shared.
>
> However, that is the ".git/objects" subdirectory. Not the per-view stuff.
> For each _view_ you do need to have view-specific data, and the view index
> very much is that. That's ".git/index".

Yep, it takes very little to achieve that -- to allow multiple checked-
out trees from a single object database. Petr's already outlined what it
takes.

> In other words, that index file simply _cannot_ be shared. Don't even
> think about it. Only madness will ensue.

If I use git in my home directory I cannot _help_ but share it.
Sometimes I'm using it from a BE box, sometimes from a LE box. Should I
really be forced to use separate checkouts for each type of machine?
It's bad enough having to do that for ~/bin :)

Seriously, it shouldn't have a significantly detrimental effect on the
performance if we just use explicitly sized types and fixed byte-order.
It's just not worth the pain of being gratuitously non-portable.

--
dwmw2

2005-04-13 14:48:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3



On Wed, 13 Apr 2005, Ingo Molnar wrote:
> >
> > I've already noticed GNU interactive tools (googling for git), but
> > it's Linus' choice of name. Alternative suggestions welcomed. What
> > about 'gt'? ;-)
>
> 'gt' or 'gi' both sound fine - 'gi' being a bit faster to type ;-).
> (Even 'get' seems to be unused in the command namespace.)

Let's be realistic here. "git" as in "gnu interactive tools" was last
actively developed in 1996, and had even its last maintenanace release
over five years ago.

Let it go, people.

Linus

2005-04-13 14:58:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3



On Wed, 13 Apr 2005, David Woodhouse wrote:
>
> > In other words, that index file simply _cannot_ be shared. Don't even
> > think about it. Only madness will ensue.
>
> If I use git in my home directory I cannot _help_ but share it.
> Sometimes I'm using it from a BE box, sometimes from a LE box. Should I
> really be forced to use separate checkouts for each type of machine?

Now, that kind of "private sharing" should certainly be ok. In fact, the
only locking there is (doing the ".git/index.lock" thing around any
updates and erroring out if it already exists) was somewhat designed for
that. So making it use BE data (preferred just because then you can use
the existing htonl() etc helpers in user space) would work.

As long as people don't think this means anything else... It really is a
private file.

Linus

2005-04-13 16:47:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

Xavier Bestel wrote:
> Le mercredi 13 avril 2005 à 10:25 +0100, David Woodhouse a écrit :
>
>>On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
>>
>>>Theoretically, you are never supposed to share your index if you work
>>>in fully git environment.
>>
>>Maybe -- if we are prepared to propagate the BK myth that network
>>bandwidth and disk space are free.
>
>
> On a related note, maybe kernel.org should host .torrent files (and
> serve them) for the kernel git repository. That would ease the pain.
>

/me inflicts major bodily harm on Xav.

There is a reason we (kernel.org) doesn't touch Bittorrent: for a
variety of reasons, Bittorrent doesn't lend itself very well to
automation. Jeff Garzik and I have been sketching on a sane replacement
for Bittorrent with the working name "Software Distribution Protocol",
but it's not even vaporware so far.

-hpa

2005-04-13 17:01:37

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 13 Apr 2005, Petr Baudis wrote:

> Dear diary, on Wed, Apr 13, 2005 at 11:25:04AM CEST, I got a letter
> where David Woodhouse <[email protected]> told me that...
> > On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
> > > Theoretically, you are never supposed to share your index if you work
> > > in fully git environment.
> >
> > Maybe -- if we are prepared to propagate the BK myth that network
> > bandwidth and disk space are free.
> >
> > Meanwhile, in the real world, it'd be really useful to support sharing.
>
> It's fine to share the objects database. If you want to share the
> directory cache, you are doing something wrong, though. What do you need
> it for?
>
> > I'd even like to see support for using multiple branches checked out of
> > the same .git/ repository. We already cope with having multiple branches
> > _in_ the repository -- all we need to do is cope with multiple indices
> > too, so we can have different versions checked out.
>
> I'm working on that right now. (Well, I wish I would, if other things
> didn't keep distracting me.)
>
> The idea is to have a command which will do something like:
>
> mkdir .git
> ln -s $origtree/heads $origtree/objects $origtree/tags .git
> cp $origtree/HEAD .git
> cd ..
> read-tree $(tree-id)
>
> Voila. Now you have a new tree with almost no current neither future
> overhead.

For future reference, git is unhappy if you actually do this, because your
HEAD won't match the (empty) contents of the new directory. The easiest
thing is to cp -r your original, replace the shared stuff with links, and
go from there.

-Daniel
*This .sig left intentionally blank*

2005-04-13 18:07:28

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 07:01:34PM CEST, I got a letter
where Daniel Barkalow <[email protected]> told me that...
> For future reference, git is unhappy if you actually do this, because your
> HEAD won't match the (empty) contents of the new directory. The easiest
> thing is to cp -r your original, replace the shared stuff with links, and
> go from there.

How is it unhappy? That would likely be a bug, unless you do something
which really *needs* the tree populated and doesn't make sense otherwise
(show-diff aka git diff w/o arguments, for example).

Given that what would you copy with cp -r and wipe shortly after
(objects db) is likely to be significantly larger than the working tree
itself, checkout-cache would be wiser anyway.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 18:16:04

by Xavier Bestel

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

Le mercredi 13 avril 2005 ? 09:48 -0700, H. Peter Anvin a ?crit :
> Xavier Bestel wrote:
> > On a related note, maybe kernel.org should host .torrent files (and
> > serve them) for the kernel git repository. That would ease the pain.
> >
>
> /me inflicts major bodily harm on Xav.
>
> There is a reason we (kernel.org) doesn't touch Bittorrent: for a
> variety of reasons, Bittorrent doesn't lend itself very well to
> automation. Jeff Garzik and I have been sketching on a sane replacement
> for Bittorrent with the working name "Software Distribution Protocol",
> but it's not even vaporware so far.

Aah, technical details ... glad to hear that, though.

Xav


2005-04-13 18:20:24

by Linus Torvalds

[permalink] [raw]
Subject: git mailing list (Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3)



On Wed, 13 Apr 2005, Petr Baudis wrote:
>
> Dear diary, on Wed, Apr 13, 2005 at 07:01:34PM CEST, I got a letter
> where Daniel Barkalow <[email protected]> told me that...
> > For future reference, git is unhappy if you actually do this, because your
> > HEAD won't match the (empty) contents of the new directory. The easiest
> > thing is to cp -r your original, replace the shared stuff with links, and
> > go from there.
>
> How is it unhappy?

I think it's just Daniel being unhappy because he didn't do the read-tree
+ checkout-cache + update-cache steps ;)

Btw, I'm going to stop cc'ing linux-kernel on git issues (after this
email, which also acts as an announcement for people who haven't noticed
already), since anybody who is interested in git can just use the
"[email protected]" mailing list:

echo 'subscribe git' | mail [email protected]

to get you subscribed (and you'll get a message back asking you to
authorize it to avoid spam - if you don't get anything back, it failed).

Linus

2005-04-13 18:38:36

by Daniel Barkalow

[permalink] [raw]
Subject: Re: Re: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, 13 Apr 2005, Petr Baudis wrote:

> Dear diary, on Wed, Apr 13, 2005 at 07:01:34PM CEST, I got a letter
> where Daniel Barkalow <[email protected]> told me that...
> > For future reference, git is unhappy if you actually do this, because your
> > HEAD won't match the (empty) contents of the new directory. The easiest
> > thing is to cp -r your original, replace the shared stuff with links, and
> > go from there.
>
> How is it unhappy? That would likely be a bug, unless you do something
> which really *needs* the tree populated and doesn't make sense otherwise
> (show-diff aka git diff w/o arguments, for example).

If you copy HEAD without copying the files, it will then try to apply the
patches which would apply to your previous directory to the empty
directory, which will just give a lot of errors about missing files. If
you don't copy HEAD, it tries to compare against nothing.

Upon further consideration, a "checkout-cache -a" at the end of your list
makes things generally happy.

The next problem is that rsync is replacing the .git/objects symlink with
the remote system's directory, which makes this not actually helpful.

-Daniel
*This .sig left intentionally blank*

2005-04-13 19:04:07

by Russell King

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

On Wed, Apr 13, 2005 at 10:35:21AM +0100, Russell King wrote:
> I tried this today, applied my patch for BE<->LE conversions and
> glibc-2.2 compatibility (attached, still requires cleaning though),
> and then tried git pull. Umm, whoops.

Here's an updated patch which allows me to work with a BE-based
cache. I've just used this to grab and checkout sparse.git.

Note: it also fixes my glibc-2.2 build problem with the nsec
stat64 structures (see read-cache.c).

--- cache.h
+++ cache.h Wed Apr 13 11:23:39 2005
@@ -14,6 +14,12 @@
#include <openssl/sha.h>
#include <zlib.h>

+#include <netinet/in.h>
+#define cpu_to_beuint(x) (htonl(x))
+#define beuint_to_cpu(x) (ntohl(x))
+#define cpu_to_beushort(x) (htons(x))
+#define beushort_to_cpu(x) (ntohs(x))
+
/*
* Basic data structures for the directory cache
*
@@ -67,7 +73,7 @@
#define DEFAULT_DB_ENVIRONMENT ".git/objects"

#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)
-#define ce_size(ce) cache_entry_size((ce)->namelen)
+#define ce_size(ce) cache_entry_size(beushort_to_cpu((ce)->namelen))

#define alloc_nr(x) (((x)+16)*3/2)

--- checkout-cache.c
+++ checkout-cache.c Wed Apr 13 19:52:08 2005
@@ -77,7 +77,7 @@
return error("checkout-cache: unable to read sha1 file of %s (%s)",
ce->name, sha1_to_hex(ce->sha1));
}
- fd = create_file(ce->name, ce->st_mode);
+ fd = create_file(ce->name, beuint_to_cpu(ce->st_mode));
if (fd < 0) {
free(new);
return error("checkout-cache: unable to create %s (%s)",
--- read-cache.c
+++ read-cache.c Wed Apr 13 19:37:00 2005
@@ -271,27 +271,34 @@
/* nsec seems unreliable - not all filesystems support it, so
* as long as it is in the inode cache you get right nsec
* but after it gets flushed, you get zero nsec. */
- if (ce->mtime.sec != (unsigned int)st->st_mtim.tv_sec
+#if 0
+ if (beuint_to_cpu(ce->mtime.sec) != (unsigned int)st->st_mtim.tv_sec
#ifdef NSEC
- || ce->mtime.nsec != (unsigned int)st->st_mtim.tv_nsec
+ || beuint_to_cpu(ce->mtime.nsec) != (unsigned int)st->st_mtim.tv_nsec
#endif
)
changed |= MTIME_CHANGED;
- if (ce->ctime.sec != (unsigned int)st->st_ctim.tv_sec
+ if (beuint_to_cpu(ce->ctime.sec) != (unsigned int)st->st_ctim.tv_sec
#ifdef NSEC
- || ce->ctime.nsec != (unsigned int)st->st_ctim.tv_nsec
+ || beuint_to_cpu(ce->ctime.nsec) != (unsigned int)st->st_ctim.tv_nsec
#endif
)
changed |= CTIME_CHANGED;
- if (ce->st_uid != (unsigned int)st->st_uid ||
- ce->st_gid != (unsigned int)st->st_gid)
+#else
+ if (beuint_to_cpu(ce->mtime.sec) != (unsigned int)st->st_mtime)
+ changed |= MTIME_CHANGED;
+ if (beuint_to_cpu(ce->ctime.sec) != (unsigned int)st->st_ctime)
+ changed |= CTIME_CHANGED;
+#endif
+ if (beuint_to_cpu(ce->st_uid) != (unsigned int)st->st_uid ||
+ beuint_to_cpu(ce->st_gid) != (unsigned int)st->st_gid)
changed |= OWNER_CHANGED;
- if (ce->st_mode != (unsigned int)st->st_mode)
+ if (beuint_to_cpu(ce->st_mode) != (unsigned int)st->st_mode)
changed |= MODE_CHANGED;
- if (ce->st_dev != (unsigned int)st->st_dev ||
- ce->st_ino != (unsigned int)st->st_ino)
+ if (beuint_to_cpu(ce->st_dev) != (unsigned int)st->st_dev ||
+ beuint_to_cpu(ce->st_ino) != (unsigned int)st->st_ino)
changed |= INODE_CHANGED;
- if (ce->st_size != (unsigned int)st->st_size)
+ if (beuint_to_cpu(ce->st_size) != (unsigned int)st->st_size)
changed |= DATA_CHANGED;
return changed;
}
@@ -320,7 +327,7 @@
while (last > first) {
int next = (last + first) >> 1;
struct cache_entry *ce = active_cache[next];
- int cmp = cache_name_compare(name, namelen, ce->name, ce->namelen);
+ int cmp = cache_name_compare(name, namelen, ce->name, beushort_to_cpu(ce->namelen));
if (!cmp)
return next;
if (cmp < 0) {
@@ -347,7 +354,7 @@
{
int pos;

- pos = cache_name_pos(ce->name, ce->namelen);
+ pos = cache_name_pos(ce->name, beushort_to_cpu(ce->namelen));

/* existing match? Just replace it */
if (pos >= 0) {
@@ -378,9 +385,9 @@
SHA_CTX c;
unsigned char sha1[20];

- if (hdr->signature != CACHE_SIGNATURE)
+ if (hdr->signature != cpu_to_beuint(CACHE_SIGNATURE))
return error("bad signature");
- if (hdr->version != 1)
+ if (hdr->version != cpu_to_beuint(1))
return error("bad version");
SHA1_Init(&c);
SHA1_Update(&c, hdr, offsetof(struct cache_header, sha1));
@@ -428,12 +435,12 @@
if (verify_hdr(hdr, size) < 0)
goto unmap;

- active_nr = hdr->entries;
+ active_nr = beuint_to_cpu(hdr->entries);
active_alloc = alloc_nr(active_nr);
active_cache = calloc(active_alloc, sizeof(struct cache_entry *));

offset = sizeof(*hdr);
- for (i = 0; i < hdr->entries; i++) {
+ for (i = 0; i < beuint_to_cpu(hdr->entries); i++) {
struct cache_entry *ce = map + offset;
offset = offset + ce_size(ce);
active_cache[i] = ce;
@@ -452,9 +459,9 @@
struct cache_header hdr;
int i;

- hdr.signature = CACHE_SIGNATURE;
- hdr.version = 1;
- hdr.entries = entries;
+ hdr.signature = cpu_to_beuint(CACHE_SIGNATURE);
+ hdr.version = cpu_to_beuint(1);
+ hdr.entries = cpu_to_beuint(entries);

SHA1_Init(&c);
SHA1_Update(&c, &hdr, offsetof(struct cache_header, sha1));
--- read-tree.c
+++ read-tree.c Wed Apr 13 19:56:52 2005
@@ -13,8 +13,8 @@

memset(ce, 0, size);

- ce->st_mode = mode;
- ce->namelen = baselen + len;
+ ce->st_mode = cpu_to_beuint(mode);
+ ce->namelen = cpu_to_beushort(baselen + len);
memcpy(ce->name, base, baselen);
memcpy(ce->name + baselen, pathname, len+1);
memcpy(ce->sha1, sha1, 20);
--- show-diff.c
+++ show-diff.c Wed Apr 13 11:27:34 2005
@@ -89,7 +89,7 @@
changed = cache_match_stat(ce, &st);
if (!changed)
continue;
- printf("%.*s: ", ce->namelen, ce->name);
+ printf("%.*s: ", beushort_to_cpu(ce->namelen), ce->name);
for (n = 0; n < 20; n++)
printf("%02x", ce->sha1[n]);
printf(" %02x\n", changed);
--- update-cache.c
+++ update-cache.c Wed Apr 13 19:55:16 2005
@@ -68,18 +68,18 @@
*/
static void fill_stat_cache_info(struct cache_entry *ce, struct stat *st)
{
- ce->ctime.sec = st->st_ctime;
+ ce->ctime.sec = cpu_to_beuint(st->st_ctime);
#ifdef NSEC
- ce->ctime.nsec = st->st_ctim.tv_nsec;
+ ce->ctime.nsec = cpu_to_beuint(st->st_ctim.tv_nsec);
#endif
- ce->mtime.sec = st->st_mtime;
+ ce->mtime.sec = cpu_to_beuint(st->st_mtime);
#ifdef NSEC
- ce->mtime.nsec = st->st_mtim.tv_nsec;
+ ce->mtime.nsec = cpu_to_beuint(st->st_mtim.tv_nsec);
#endif
- ce->st_dev = st->st_dev;
- ce->st_ino = st->st_ino;
- ce->st_uid = st->st_uid;
- ce->st_gid = st->st_gid;
+ ce->st_dev = cpu_to_beuint(st->st_dev);
+ ce->st_ino = cpu_to_beuint(st->st_ino);
+ ce->st_uid = cpu_to_beuint(st->st_uid);
+ ce->st_gid = cpu_to_beuint(st->st_gid);
}

static int add_file_to_cache(char *path)
@@ -106,21 +106,21 @@
ce = malloc(size);
memset(ce, 0, size);
memcpy(ce->name, path, namelen);
- ce->ctime.sec = st.st_ctime;
+ ce->ctime.sec = cpu_to_beuint(st.st_ctime);
#ifdef NSEC
- ce->ctime.nsec = st.st_ctim.tv_nsec;
+ ce->ctime.nsec = cpu_to_beuint(st.st_ctim.tv_nsec);
#endif
- ce->mtime.sec = st.st_mtime;
+ ce->mtime.sec = cpu_to_beuint(st.st_mtime);
#ifdef NSEC
- ce->mtime.nsec = st.st_mtim.tv_nsec;
+ ce->mtime.nsec = cpu_to_beuint(st.st_mtim.tv_nsec);
#endif
- ce->st_dev = st.st_dev;
- ce->st_ino = st.st_ino;
- ce->st_mode = st.st_mode;
- ce->st_uid = st.st_uid;
- ce->st_gid = st.st_gid;
- ce->st_size = st.st_size;
- ce->namelen = namelen;
+ ce->st_dev = cpu_to_beuint(st.st_dev);
+ ce->st_ino = cpu_to_beuint(st.st_ino);
+ ce->st_mode = cpu_to_beuint(st.st_mode);
+ ce->st_uid = cpu_to_beuint(st.st_uid);
+ ce->st_gid = cpu_to_beuint(st.st_gid);
+ ce->st_size = cpu_to_beuint(st.st_size);
+ ce->namelen = cpu_to_beushort(namelen);

if (index_fd(path, namelen, ce, fd, &st) < 0)
return -1;
@@ -201,7 +201,7 @@
updated = malloc(size);
memcpy(updated, ce, size);
fill_stat_cache_info(updated, &st);
- updated->st_size = st.st_size;
+ updated->st_size = cpu_to_beuint(st.st_size);
return updated;
}

--- write-tree.c
+++ write-tree.c Wed Apr 13 19:30:45 2005
@@ -45,7 +45,7 @@
do {
struct cache_entry *ce = cachep[nr];
const char *pathname = ce->name, *filename, *dirname;
- int pathlen = ce->namelen, entrylen;
+ int pathlen = beushort_to_cpu(ce->namelen), entrylen;
unsigned char *sha1;
unsigned int mode;

@@ -54,7 +54,7 @@
break;

sha1 = ce->sha1;
- mode = ce->st_mode;
+ mode = beuint_to_cpu(ce->st_mode);

/* Do we have _further_ subdirectories? */
filename = pathname + baselen;


--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 19:15:15

by Petr Baudis

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

Dear diary, on Wed, Apr 13, 2005 at 09:03:07PM CEST, I got a letter
where Russell King <[email protected]> told me that...
> On Wed, Apr 13, 2005 at 10:35:21AM +0100, Russell King wrote:
> > I tried this today, applied my patch for BE<->LE conversions and
> > glibc-2.2 compatibility (attached, still requires cleaning though),
> > and then tried git pull. Umm, whoops.
>
> Here's an updated patch which allows me to work with a BE-based
> cache. I've just used this to grab and checkout sparse.git.
>
> Note: it also fixes my glibc-2.2 build problem with the nsec
> stat64 structures (see read-cache.c).
>
> --- cache.h
> +++ cache.h Wed Apr 13 11:23:39 2005
> @@ -14,6 +14,12 @@
> #include <openssl/sha.h>
> #include <zlib.h>
>
> +#include <netinet/in.h>
> +#define cpu_to_beuint(x) (htonl(x))
> +#define beuint_to_cpu(x) (ntohl(x))
> +#define cpu_to_beushort(x) (htons(x))
> +#define beushort_to_cpu(x) (ntohs(x))
> +
> /*
> * Basic data structures for the directory cache
> *

What do the wrapper macros gain us?

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-13 19:22:10

by Russell King

[permalink] [raw]
Subject: Re: Re: [ANNOUNCE] git-pasky-0.3

On Wed, Apr 13, 2005 at 09:13:39PM +0200, Petr Baudis wrote:
> Dear diary, on Wed, Apr 13, 2005 at 09:03:07PM CEST, I got a letter
> where Russell King <[email protected]> told me that...
> > On Wed, Apr 13, 2005 at 10:35:21AM +0100, Russell King wrote:
> > > I tried this today, applied my patch for BE<->LE conversions and
> > > glibc-2.2 compatibility (attached, still requires cleaning though),
> > > and then tried git pull. Umm, whoops.
> >
> > Here's an updated patch which allows me to work with a BE-based
> > cache. I've just used this to grab and checkout sparse.git.
> >
> > Note: it also fixes my glibc-2.2 build problem with the nsec
> > stat64 structures (see read-cache.c).
> >
> > --- cache.h
> > +++ cache.h Wed Apr 13 11:23:39 2005
> > @@ -14,6 +14,12 @@
> > #include <openssl/sha.h>
> > #include <zlib.h>
> >
> > +#include <netinet/in.h>
> > +#define cpu_to_beuint(x) (htonl(x))
> > +#define beuint_to_cpu(x) (ntohl(x))
> > +#define cpu_to_beushort(x) (htons(x))
> > +#define beushort_to_cpu(x) (ntohs(x))
> > +
> > /*
> > * Basic data structures for the directory cache
> > *
>
> What do the wrapper macros gain us?

Nothing much - I don't particularly care about them. I thought someone
might object to using htonl/ntohl directly.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-13 19:24:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

Russell King wrote:
>
> Nothing much - I don't particularly care about them. I thought someone
> might object to using htonl/ntohl directly.
>

Why would they?

-hpa

2005-04-13 20:45:18

by Matt Mackall

[permalink] [raw]
Subject: Re: Re: more git updates..

On Tue, Apr 12, 2005 at 06:10:27PM -0700, Linus Torvalds wrote:
>
>
> On Wed, 13 Apr 2005, Andrea Arcangeli wrote:
> >
> > I wasn't suggesting to use CVS. I meant that for a newly developed SCM,
> > the CVS/SCCS format as storage may be more appealing than the current
> > git format.
>
> Go wild. I did mine in six days, and you've been whining about other
> peoples SCM's for three years.

I wrote a hack to do efficient delta storage with O(1) seeks for
lookup and append last week, I believe it's been integrated into the
latest Bazaar-NG. I expect it'll give better compression and
performance than BK. Of course it ends up being O(revisions) for
modifications or insertions (but that is probably a non-issue for the
SCM models we're looking at).

The git model is obviously very different, but I worry about the slop
space implied. With 200k file revision and an average of 2k slop per
file, that's 400MB of slop, or almost the size of an equivalent delta
compressed kernel repo.

Now if you can assume that blobs never change and are never deleted,
you can simply append them all onto a log, and then index them with a
separate file containing an htree of (sha1, offset, length) or the
like. Since the key is already a strong hash, this is an excellent
match and avoids rehashing in the kernel's directory lookup. And it'll
save an inode, a directory entry, and about half a data block per
entry. "Open" will also be cheaper as there's no per-revision inode to
grab.

I could hack on this if you think it fits with the git model,
otherwise I'll go back to my other experiments..

--
Mathematics is the supreme nostalgia of our time.

2005-04-13 23:42:19

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: more git updates..

Matt Mackall <[email protected]> writes:

> Now if you can assume that blobs never change and are never deleted,
> you can simply append them all onto a log, and then index them with a
> separate file containing an htree of (sha1, offset, length) or the
> like.

That mean a problem with rsync, though.

BTW: I think the bandwidth increase compared to bkcvs isn't that obvious.
After a file is modified with git, it has to be transmitted (plus
small additional things.
If a file is modified with bkcvs, it has to be transmitted (the whole
RCS file) as well.

Only the initial rsync would be much smaller with bkcvs.
--
Krzysztof Halasa

2005-04-14 00:16:29

by Matt Mackall

[permalink] [raw]
Subject: Re: more git updates..

On Thu, Apr 14, 2005 at 01:42:11AM +0200, Krzysztof Halasa wrote:
> Matt Mackall <[email protected]> writes:
>
> > Now if you can assume that blobs never change and are never deleted,
> > you can simply append them all onto a log, and then index them with a
> > separate file containing an htree of (sha1, offset, length) or the
> > like.
>
> That mean a problem with rsync, though.

I believe 200k inodes is a problem for rsync too. But we can simply
grab the remote htree, do a tree compare, find the ranges of the
remote file we need, sort and merge the ranges, and then pull them.
That will surely trounce rsync.

--
Mathematics is the supreme nostalgia of our time.

2005-04-14 02:53:54

by bd

[permalink] [raw]
Subject: Re: [ANNOUNCE] git-pasky-0.3

-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Xavier Bestel wrote:
> Le mercredi 13 avril 2005 à 10:25 +0100, David Woodhouse a écrit :
>
>>On Wed, 2005-04-13 at 10:59 +0200, Petr Baudis wrote:
>>
>>>Theoretically, you are never supposed to share your index if you work
>>>in fully git environment.
>>
>>Maybe -- if we are prepared to propagate the BK myth that network
>>bandwidth and disk space are free.
>
>
> On a related note, maybe kernel.org should host .torrent files (and
> serve them) for the kernel git repository. That would ease the pain.

Bittorrent does not lend itself well to frequently-changing files or
collections thereof - each time the git repository ip updated, a new
metadata file would need to be created, and distributed, and you'd lose
all the seeds who don't bother to get the new one every time it changes.
Moreover, I imagine some clients would have problems with more than 900
or so files due to open file limits.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1-ecc0.1.6 (GNU/Linux)

iQIVAwUBQl2lsXhF4rlE0/81AQMEZA/+MtAwhLVBGbjIGMG4911/Q4tL+RZCni2Z
9wCM2/1Acca7CUeYJOX3bFqx/HMlVyzTN/DFyz7oQbngNrcOFaO4xqHwDT9iVpUB
x1fE2Ct1BXOOnAQEzjEoogKrjWuYiy2tkcsFNMFoef0qV9U8olwwtUgXG8+dOQSZ
gEPocjFmEJLMxhNxdnigW2R1KWgJ0IoFmpIWxDUnpQGBg/dfVxtI4EQhR7FgZwch
O9faPyMdHEct7WW4S8ysMcwGUyRg8J/nlgt413P66PSp9IJ5u8t/gUc0vVcDR0Bl
QNO5Hf2kGe/tILYNMJOtQX8sGcKHC4mZJMsNlhs5Y0+GsD9/9JGj3lv69SM+kv92
5S3ePfArzNvnuoCCxS1iC+s1HZ8fyYXAPx6pVA3cs0/+QGv0LjeSZOCBWmh8vrl1
SD4MF6TPy4mdF1corQE1o8bCc/VP0cTnwBvyF6BpZeP9nipgrzLxM1PPTtDjyUDG
B3VocEsieTyyzfl3hXJxGqFL3Txt6EbRU4AwYitONbTU5zMaQuEY4xBD+UWQJgAO
K8rMXqONSoORWrZVuRyrTmFr/z6zq00BpwQy7BbHuwEXHSPvc/e4UHtk8wNYyY13
LAi2jgMGmGckwucauqZY5Y3mDaOh2m9+0x8hIvvnmLPQC91cVsuerKiKYzjYJ4/4
qsnhjobIq1s=
=ZiJ1
-----END PGP SIGNATURE-----

2005-04-20 21:08:30

by kaih

[permalink] [raw]
Subject: Re: more git updates..

[email protected] (H. Peter Anvin) wrote on 11.04.05 in <[email protected]>:

> Followup to: <[email protected]>
> By author: Christopher Li <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > There is one problem though. How about the SHA1 hash collision?
> > Even the chance is very remote, you don't want to lose some data do due
> > to "software" error. I think it is OK that no handle that
> > case right now. On the other hand, it will be nice to detect that
> > and give out a big error message if it really happens.
> >
>
> If you're actually worried about it, it'd be better to just use a
> different hash, like one of the SHA-2's (probably a better choice
> anyway), instead of SHA-1.

How could that help? *Every* hash has hash collisions. It's an unavoidable
result of using less bits than the original data has.

MfG Kai

2005-04-24 00:42:47

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

> It's an unavoidable
> result of using less bits than the original data has.

Even _not_ using a hash will have collisions - copy different globs of
data around enough, and sooner or later, two globs that started out
different will end up the same, due to errors in our computers. Even
ECC on all the buses, channels, and memory will just reduce this chance.

There is no mathematical perfection obtainable here. Deal with it.

Computers are about engineering, not philosophical perfection.

If something is likely to happen less than once in a billion years,
then for all practical purposes, it won't happen.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-24 01:29:50

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: more git updates..

In article <[email protected]> you wrote:
> If something is likely to happen less than once in a billion years,
> then for all practical purposes, it won't happen.

Of course there are colliding files already available and easyly
generate-able. So a malicous attack is already possible.

Which is especially nasty because one can proof GIT obeject file system is
broken. However I dont think it is a problem for Linux Source Control
purpose, ever.

However using a combined hash might be a good idea, here. So you silence the
critics since they have no eploit samples handy. :) Or at least go with FIPS
180-2.

Greetings
Bernd

2005-04-24 04:13:40

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

Bernd wrote:
> Of course there are colliding files already available and easyly
> generate-able. So a malicous attack is already possible.

I don't believe you. Reference?

> Or at least go with FIPS 180-2.

FIPS 180-2 specifies four secure hash algorithms - SHA-1, SHA-256,
SHA-384, and SHA-512. We're using SHA-1.

I think you meant go with SHA-256, which is new in FIPS 180-2. FIPS
180-1 only had SHA-1. FIPS 180-2 superseded FIPS 180-1, adding three
the algorithms SHA-256, SHA-384, and SHA-512.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-24 04:38:20

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: more git updates..

On Sat, Apr 23, 2005 at 09:13:26PM -0700, Paul Jackson wrote:
> I don't believe you. Reference?

I had MD5 in mind, sorry. I havent seen the SHA-1 colision samples, yet.
However it is likely to be available soon. (a simple pair with two files
will be enugh to cause "theoretical" problems. However I think it would be
possible to detect collisions on add and append sequence numbers... ugly.

> > Or at least go with FIPS 180-2.
>
> FIPS 180-2 specifies four secure hash algorithms - SHA-1, SHA-256,
> SHA-384, and SHA-512. We're using SHA-1.

Yes, I was referring to the longer versions (aka SHA-2), since FIPS tries to
phase out the 160bit version (till 2010).

Anyway I know we dont need to discuss this, I just wanted to point out that
in practical usage as source repository we might not see problems, but it
does not mean that there arent some already provokeable. PErsonally I see
the advantage of the "stateless" hash approach about more correct statefull
approaches like BK.

Greetings
Bernd

2005-04-24 04:54:36

by Paul Jackson

[permalink] [raw]
Subject: Re: more git updates..

> I had MD5 in mind, sorry.

That's what I suspected.

> Anyway I know we dont need to discuss this,

Agreed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-04-24 08:38:32

by kaih

[permalink] [raw]
Subject: Re: more git updates..

[email protected] (Paul Jackson) wrote on 23.04.05 in <[email protected]>:

> > It's an unavoidable
> > result of using less bits than the original data has.
>
> Even _not_ using a hash will have collisions - copy different globs of
> data around enough, and sooner or later, two globs that started out
> different will end up the same, due to errors in our computers. Even
> ECC on all the buses, channels, and memory will just reduce this chance.

Umm, the whole point of using a digest for the name is to catch these
things as they happen. So if you'd use the whole original bit sequence as
a name, you'd need to have exactly the same bit errors in the data, in the
name, and in the reference to the object, to miss nopticing the problem
early. And it *still* isn't a collision - the data behind name X is
exactly X, always, or it's easily recognizable as broken.

Whereas a hash collision means that both X and Y should be behind name Z.
Both are *correct* behind name Z.

Entirely different situations.

> There is no mathematical perfection obtainable here. Deal with it.

Actually, there is, and your non-hashed name system achieves it.

> If something is likely to happen less than once in a billion years,
> then for all practical purposes, it won't happen.

If that was a truely random thing, then you might have been right. But it
isn't. All possible blobs to a given digest are NOT equally probably (or
of a probability only depending on their size).

We really, really don't know how likely a collision is for the data we
want to store there - just for truely random data.

MfG Kai

2005-04-24 08:38:23

by kaih

[permalink] [raw]
Subject: Re: more git updates..

[email protected] (Felipe Alfaro Solana) wrote on 21.04.05 in <[email protected]>:

> On 20 Apr 2005 22:29:00 +0200, Kai Henningsen <[email protected]>
> > > wrote: If you're actually worried about it, it'd be better to just use a
> > > different hash, like one of the SHA-2's (probably a better choice
> > > anyway), instead of SHA-1.
> >
> > How could that help? *Every* hash has hash collisions. It's an unavoidable
> > result of using less bits than the original data has.
>
> SHA-2 allows for 256, 384 and 512-bit hashes, which provides greater
> resistance to collisions.

So? It's still finite.

MfG Kai

2005-04-25 02:44:49

by Horst H. von Brand

[permalink] [raw]
Subject: Re: more git updates..

Bernd Eckenfels <[email protected]> said:
> In article <[email protected]> you wrote:
> > If something is likely to happen less than once in a billion years,
> > then for all practical purposes, it won't happen.

> Of course there are colliding files already available and easyly
> generate-able. So a malicous attack is already possible.

Care to share some? Of what you are smoking, that is... pretty potent stuff.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2005-04-25 12:26:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: more git updates..

On Sun, Apr 24, 2005 at 06:38:13AM +0200, Bernd Eckenfels wrote:
> On Sat, Apr 23, 2005 at 09:13:26PM -0700, Paul Jackson wrote:
> > I don't believe you. Reference?
>
> I had MD5 in mind, sorry. I havent seen the SHA-1 colision samples, yet.
> However it is likely to be available soon. (a simple pair with two files
> will be enugh to cause "theoretical" problems. However I think it would be
> possible to detect collisions on add and append sequence numbers... ugly.

The MD5 collision smaples are for two 16 byte inputs which when run
through the MD5 algorithm, result in the same 128-bit hash. The SHA-1
collision samples are for two 20 byte inputs which when run through
the SHA algorithm create the same 160-bit hash. In neither case will
the inputs be valid git objects, nor anything approaching ASCII text,
let alone valid C files.

So what theoretical problems will be caused by this? Sure, an
attacker can check a garbage file containing (apparently) random bytes
into git, and then produce another garbage file containing some
completely other (apparently) random bytes which will collide with the
first garbage file.

You want to explain how this is going to cause problems in the git
systems? And even if you can describe any problems, you want to
explain why any such theoretical problems couldn't be trivially
detected and fixed?

- Ted

2005-04-25 16:47:09

by daw

[permalink] [raw]
Subject: Re: more git updates..

Theodore Ts'o wrote:
>The MD5 collision smaples are for two 16 byte inputs which when run
>through the MD5 algorithm, result in the same 128-bit hash. The SHA-1
>collision samples are for two 20 byte inputs which when run through
>the SHA algorithm create the same 160-bit hash.

There are no known SHA-1 collision samples.
(There are collision samples for MD5, and for SHA-0, but not for SHA-1.)

2005-04-25 20:35:56

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: more git updates..

On Mon, Apr 25, 2005 at 07:57:50AM -0400, Theodore Ts'o wrote:
> You want to explain how this is going to cause problems in the git
> systems?

No because I explained it does not cause Problems.

Greetings
Bernd

BTW: do you have an link to the SHA-1 collisions?
--
(OO) -- Bernd_Eckenfels@M?rscher_Strasse_8.76185Karlsruhe.de --
( .. ) ecki@{inka.de,linux.de,debian.org} http://www.eckes.org/
o--o 1024D/E383CD7E eckes@IRCNet v:+497211603874 f:+497211606754
(O____O) When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl!