2005-04-11 22:41:09

by piotr

[permalink] [raw]
Subject: Call to atention about using hash functions as content indexers (SCM saga)

Hi

I had a quick look at the source of GIT tonight, I'd like to warn you
about the use of hash functions as content indexers.

As probably you are aware, hash functions such as SHA-1 are surjective not
bijective (1-to-1 map), so they have collisions. Here one can argue
about the low probability of such a collision, I won't get into
subjetive valorations of what probability of collision is tolerable for
me and what is not.

I my humble opinion, choosing deliberately, as a design decision, a
method such as this one, that in some cases could corrupt data in a
silent and very hard to detect way, is not very good. One can also argue
that the probability of data getting corrupted in the disk, or whatever
could be higher than that of the collision, again this is not valid
comparison, since the fact that indexing by hash functions without
additional checking could make data corruption legal between the
reasonable working parameters of the program is very dangerous.

And by the way, I've read
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node15.html

and I don't agree with it. Asides from the "avalanche effect" the
properties of a good hash function is that distributes well the entropy
between the output bits, so probably, although the inputs are not
random, the pdf change in the outputs would be small, otherwise it would
be easier to find collision by differential or statistical methods.

Last, hash functions should be only used to check data integrity, and
that's the reason of the "avalanche effect", so even single bit errors
are propagated and it's effect is very noticeable at the output.

Regards.

--
Pedro Larroy Tovar | Linux & Network consultant | pedro%larroy.com
Make debian mirrors with debian-multimirror: http://pedro.larroy.com/deb_mm/
* Las patentes de programaci?n son nocivas para la innovaci?n *
http://proinnova.hispalinux.es/


2005-04-11 22:51:54

by Petr Baudis

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
where Pedro Larroy <[email protected]> told me that...
> Hi

Hello,

> I had a quick look at the source of GIT tonight, I'd like to warn you
> about the use of hash functions as content indexers.
>
> As probably you are aware, hash functions such as SHA-1 are surjective not
> bijective (1-to-1 map), so they have collisions. Here one can argue
> about the low probability of such a collision, I won't get into
> subjetive valorations of what probability of collision is tolerable for
> me and what is not.
>
> I my humble opinion, choosing deliberately, as a design decision, a
> method such as this one, that in some cases could corrupt data in a
> silent and very hard to detect way, is not very good. One can also argue
> that the probability of data getting corrupted in the disk, or whatever
> could be higher than that of the collision, again this is not valid
> comparison, since the fact that indexing by hash functions without
> additional checking could make data corruption legal between the
> reasonable working parameters of the program is very dangerous.

(i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.

(ii) In git-pasky, there's (turnable off) detection of collisions.

(iii) Your argument against comparing with the probability of a hardware
error does not make sense to me.

(iv) You fail to propose a better solution.

--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
98% of the time I am right. Why worry about the other 3%.

2005-04-11 23:23:05

by Magnus Damm

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On 4/12/05, Petr Baudis <[email protected]> wrote:

> (iv) You fail to propose a better solution.

I would feel safer with back end storage filenames based on email and
mtime together with an optional hash lookup that turns collisions into
worse performance. But that's just me.

/ magnus

2005-04-12 10:01:51

by Barry K. Nathan

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On Tue, Apr 12, 2005 at 12:51:39AM +0200, Petr Baudis wrote:
> Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
> where Pedro Larroy <[email protected]> told me that...
[snip...]
> (iii) Your argument against comparing with the probability of a hardware
> error does not make sense to me.

I didn't understand it either, at first, but then I read this:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/node9.html

Whether this is serious enough to actually worry about is another
question...

-Barry K. Nathan <[email protected]>

2005-04-12 10:03:53

by Catalin Marinas

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

Magnus Damm <[email protected]> wrote:
> On 4/12/05, Petr Baudis <[email protected]> wrote:
>
>> (iv) You fail to propose a better solution.
>
> I would feel safer with back end storage filenames based on email and
> mtime together with an optional hash lookup that turns collisions into
> worse performance. But that's just me.

Have a look at bazaar-ng (http://www.bazaar-ng.org/), they seem to do
this. I had a quick look at it and they also seem to store the full
files when they change (similar to git). It is also a bit ahead of git
(started earlier) and looks quite promising.

--
Catalin

2005-04-12 12:12:25

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On Mon, 11 Apr 2005, Petr Baudis wrote:

> Dear diary, on Tue, Apr 12, 2005 at 12:40:21AM CEST, I got a letter
> where Pedro Larroy <[email protected]> told me that...
>> Hi
>
> Hello,
>
>> I had a quick look at the source of GIT tonight, I'd like to warn you
>> about the use of hash functions as content indexers.
>>
>> As probably you are aware, hash functions such as SHA-1 are surjective not
>> bijective (1-to-1 map), so they have collisions. Here one can argue
>> about the low probability of such a collision, I won't get into
>> subjetive valorations of what probability of collision is tolerable for
>> me and what is not.
>>
>> I my humble opinion, choosing deliberately, as a design decision, a
>> method such as this one, that in some cases could corrupt data in a
>> silent and very hard to detect way, is not very good. One can also argue
>> that the probability of data getting corrupted in the disk, or whatever
>> could be higher than that of the collision, again this is not valid
>> comparison, since the fact that indexing by hash functions without
>> additional checking could make data corruption legal between the
>> reasonable working parameters of the program is very dangerous.
>
> (i) 1461501637330902918203684832716283019655932542976 possible SHA1 hashes.
>
> (ii) In git-pasky, there's (turnable off) detection of collisions.
>
> (iii) Your argument against comparing with the probability of a hardware
> error does not make sense to me.
>
> (iv) You fail to propose a better solution.
>
> --
> Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> 98% of the time I am right. Why worry about the other 3%.

This is a standard, free (no patents) hash-function that works.
The fact that somebody can write a program, designed to create
collisions, and demonstrate that after many weeks of processing,
iteratively building upon the previous result, and finally
cause a collision, really isn't relevant for this application.

There is a good possibility that there are no hash-functions
that can't be broken.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.11 on an i686 machine (5537.79 BogoMips).
Notice : All mail here is now cached for review by Dictator Bush.
98.36% of all statistics are fiction.

2005-04-12 15:32:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On Tue, Apr 12, 2005 at 12:40:21AM +0200, Pedro Larroy wrote:
>
> I had a quick look at the source of GIT tonight, I'd like to warn you
> about the use of hash functions as content indexers.
>
> As probably you are aware, hash functions such as SHA-1 are surjective not
> bijective (1-to-1 map), so they have collisions. Here one can argue
> about the low probability of such a collision, I won't get into
> subjetive valorations of what probability of collision is tolerable for
> me and what is not.
>
> I my humble opinion, choosing deliberately, as a design decision, a
> method such as this one, that in some cases could corrupt data in a
> silent and very hard to detect way, is not very good.

Actually, it will very likely be very, very easy to detect. What
happens if there is a hash collision? Well, in the case of a
non-malicious collision, instead of retrieving the correct version of
a file, we will get some random version of another file. The moment
you try to compile with that incorrect file, you will with an
extremely high probability, get a failed compile, which will be
blantently obvious.

In the case of a malicous attacker trying to introduce a collision, it
should be noted first of all that the recent SHA-1 breakage was a
collision attack, not a pre-image attack. So it's not useful for
trying to find another message that hashes to the same value as a one
already in use by git. So the work factor is still 2**80. Secondly,
even if an attacker could generate another file which has the same
hash as a pre-existing file, it still has to look like a valid git
object, and it still has to be a valid C file or it will again be
blatently obvious when you try to compile the sucker.

> One can also argue
> that the probability of data getting corrupted in the disk, or whatever
> could be higher than that of the collision, again this is not valid
> comparison

That's a matter of some religious dispute. You can always reduce the
probability of a collsion down to an arbitrarily small value simply by
using a larger hash --- and switch hashes in git is quite simple since
it would just be a matter of running a program to calculate the hash
using a different algorithm, and renaming the files. You can even use
hardlinks if you want to support two different hash algorithms
simultaneously during some transition period.

So past a certain point, there is a probability that all of molecules
of oxygen in the room will suddenly migrate outdoors, and you could
suffocate. Is it rational to spend time worrying about that
possibility? I'll leave that for you to decide.

- Ted

2005-04-12 16:42:39

by Eric Rannaud

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)


Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
The attack is still only an unpublished paper and has not yet been
implemented. An attack is: you try as hard as you can to find a collision
between two arbitrary messages (i.e. two arbitrary --and nonsensical--
source files).
In the context of git, a better estimation would be the number of hash
operations needed to find a message that has the same hash than a given
fixed message (e.g. mm/memory.c). This is more like 2^100 hash
operations. And if a collision is found, this is very likely using a
message that *doesn't* look like a C source file...

Moreover, no example of collision is known, AFAIK.

In other words: this won't happen.

Best,
/er.

2005-04-14 08:30:45

by Andy Isaacson

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On Tue, Apr 12, 2005 at 06:35:49PM +0200, Eric Rannaud wrote:
> Simply put, the best known attack of SHA-1 takes 2^69 hash operations.
> ( http://www.schneier.com/blog/archives/2005/02/sha1_broken.html )
> The attack is still only an unpublished paper and has not yet been
> implemented. An attack is: you try as hard as you can to find a collision
> between two arbitrary messages (i.e. two arbitrary --and nonsensical--
> source files).

Well, don't be quite so confident. Yes, the attacks generally require a
significant uncontrollable chunk of data, but it's easy enough to encode
that in (for example) a comment.

And remember, attacks always get better, they never get worse.

Remaining cautious about the strength of your hash functions is a good
idea.

Especially because *nobody* has a real theory of operation behind online
hash functions. The stories I've heard imply that even NSA doesn't
really "do" hash functions; they prefer HMAC-style keyed verifiers
(though of course, we on the outside can never be sure what's happening
in the puzzle palace.)

Every (practical) hash function in existence today is of the form "Well,
I mixed stuff up real good, and my friends and I tried and couldn't
break it". And 160 bits still looks *fairly* secure, but so did 64-bit
symmetric key back in '93.

> In the context of git, a better estimation would be the number of hash
> operations needed to find a message that has the same hash than a given
> fixed message (e.g. mm/memory.c). This is more like 2^100 hash
> operations. And if a collision is found, this is very likely using a
> message that *doesn't* look like a C source file...

In particular, your defense here is specious. I agree that second
preimage is an unmanagably large problem for SHA1 for the forseeable
future (say, 8 years out), but collision results almost always result in
partially-controlled attack text. So I can choose my nefarious content,
and encode the collision entropy in non-operative text (a C comment, for
example).

When your tool breaks, replace it with a new one. Don't say "well, the
jagged bits haven't hurt me *yet*."

-andy

2005-04-14 13:35:30

by Eric Rannaud

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

On Thu, 2005-04-14 at 01:30 -0700, Andy Isaacson wrote:
> In particular, your defense here is specious. I agree that second
> preimage is an unmanagably large problem for SHA1 for the forseeable
> future (say, 8 years out), but collision results almost always result in
> partially-controlled attack text. So I can choose my nefarious content,
> and encode the collision entropy in non-operative text (a C comment, for
> example).

Everything you said is fair enough, and I do agree with this argument in
the context of authentication: that is if you expect an attack (even if
a chunk of non-operative bytes won't go unnoticed for long in a the
kernel tree, but that's not the point).

But the original post of Pedro was, I think, about collisions occurring
'randomly' between two genuine modifications of a source file. In
particular, the paper that was linked to by Pedro is concerned with this
kind of unexpected collisions, i.e. about integrity in normal operation
and not about security (btw, this paper seems kind of bogus to me
because it never specifies the hash function in use).

I took the example of this attack against SHA-1 only to illustrate that
*even* if you try to find a collision, you just can't find one (at least
these days).
>From which I think it is fair to say that such a collision won't happen
between two different versions of the same file, during the normal
process of revision.

I mean, if in the normal revision process of the kernel, a SHA-1
collision was to be found, by chance, who gets to publish the paper at
the next Crypto conference?

However, if we consider the case of an attack, well, that's different.
And you're right.

/er.
--
http://www.eleves.ens.fr/home/rannaud/

2005-04-14 15:55:06

by Eric D. Mudama

[permalink] [raw]
Subject: Re: Call to atention about using hash functions as content indexers (SCM saga)

I hold my breath for weeks at a time, just incase something like this
happens! I thought I was the only one!

On 4/12/05, Theodore Ts'o <[email protected]> wrote:
> So past a certain point, there is a probability that all of molecules
> of oxygen in the room will suddenly migrate outdoors, and you could
> suffocate. Is it rational to spend time worrying about that
> possibility? I'll leave that for you to decide.
>
> - Ted