Recently, I saw a slashdot article that pointed to a site dedicated to
breaking MD5. That is, so far, no one has found any two differing
string which have the same MD5 cksum. Logically, however, there WILL be
collisions for any strings longer than the MD5 cksum itself -- we just
haven't found any. Well, there's some sort of contest where you can win
money for breaking MD5 (I think).
Even further back, there was an LKML discussion about various sorts of
compressing file systems. One of the subthreads discussed identifying
identical blocks (using MD5) and pointing them at the same physical
block on disk. Naturally, if there WERE two blocks with the same MD5,
we'd want to check the raw data, just to be sure that there wasn't a
false positive.
Think about it! If we had a filesystem that actually DID this, and it
was in the Linux kernel, it would spread far and wide. It's bound to
happen that someone will identify a collision. We then report that to
the committee offering the reward and then donate it to OSDL to help
Linux development.
Yeah, I know... I'm a dork.
On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <[email protected]> said:
> Think about it! If we had a filesystem that actually DID this, and it
> was in the Linux kernel, it would spread far and wide. It's bound to
> happen that someone will identify a collision. We then report that to
> the committee offering the reward and then donate it to OSDL to help
> Linux development.
Actually, it's *not* "bound to happen". Figure out the number of blocks you'd
need to have even a 1% chance of a birthday collision in a 2**128 space.
And you'd need that many disk blocks on *a single system*.
Then figure out the chances of a collision on a small machine that only has 20
or 30 terabytes (yes, in this case terabytes is small).
The whole reason "use MD5 as a check for identical blocks" is useful is because
the chances of *that* going wrong are vanishingly small compared to the chances
that a memory stick will throw an undetected multiple-bit error, or that a RAID
controller will write blocks to the wrong disks, or any number of other things
that *do* happen in real life, but rarely enough that we don't bother writing
code to defend against them.
[email protected] wrote:
> On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <[email protected]> said:
>
>
>>Think about it! If we had a filesystem that actually DID this, and it
>>was in the Linux kernel, it would spread far and wide. It's bound to
>>happen that someone will identify a collision. We then report that to
>>the committee offering the reward and then donate it to OSDL to help
>>Linux development.
>
>
> Actually, it's *not* "bound to happen". Figure out the number of blocks you'd
> need to have even a 1% chance of a birthday collision in a 2**128 space.
>
> And you'd need that many disk blocks on *a single system*.
>
> Then figure out the chances of a collision on a small machine that only has 20
> or 30 terabytes (yes, in this case terabytes is small).
Certainly. No one machine is going to find it in a reasonable period.
OTOH, if a million machines were doing it, it increases the chances by
just that much.
On Friday 16 January 2004 21:59, Timothy Miller wrote:
> [email protected] wrote:
> > On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <[email protected]>
said:
> >>Think about it! If we had a filesystem that actually DID this, and it
> >>was in the Linux kernel, it would spread far and wide. It's bound to
> >>happen that someone will identify a collision. We then report that to
> >>the committee offering the reward and then donate it to OSDL to help
> >>Linux development.
> >
> > Actually, it's *not* "bound to happen". Figure out the number of blocks
> > you'd need to have even a 1% chance of a birthday collision in a 2**128
> > space.
> >
> > And you'd need that many disk blocks on *a single system*.
> >
> > Then figure out the chances of a collision on a small machine that only
> > has 20 or 30 terabytes (yes, in this case terabytes is small).
>
> Certainly. No one machine is going to find it in a reasonable period.
> OTOH, if a million machines were doing it, it increases the chances by
> just that much.
Let's take a look at the chances. 30 terabytes is, in a best-case scenario
(with 512-byte blocks) about 6e10 blocks. That would be roughly
6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the
chances of a collision would be about 1e-9, disregarding the fact that all
these machines have a large chance of containing similar blocks -- their data
isn't truly random, so some blocks have a larger chance of occurring than
others. The data sets on the machines are probably reasonably static, so if
the collision isn't found *at once* the chances of it occurring later are
much smaller. So, even under the most positive assumptions, with a hundred
million machines with 30 terabytes of storage each, it's extremely probable
that you won't find a collision. (A 96-bit hash could have been broken with
this setup however. :) )
-- Bart
(re: md5 weakness)
The only document I've seen with a
rigorous demonstration of the
possibility of an md5 collision
created it by adding 0 (zero) bytes
to an input (so the colliding inputs
were not the same size in bytes).
Good luck finding a collision with
blocks that are all the same size.
Anyway, hash matching algorithms for
variable sized inputs (hashed extents,
etc) can probably get an additional several
orders of magnitude of safety by using
two hashes (md5 and sha1, for example).
What are the chances that the same two
different inputs that hash to the same
value using one of them collides in the
other, too? ("Left as an exercise for the ...")
Regards,
Clayton Weaver
<mailto: [email protected]>
--
___________________________________________________________
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm
On Sat, Jan 17, 2004 at 02:15:31PM +0100, Bart Samwel wrote:
> On Friday 16 January 2004 21:59, Timothy Miller wrote:
> > [email protected] wrote:
> > > On Fri, 16 Jan 2004 15:22:39 EST, Timothy Miller <[email protected]>
> said:
> > >>Think about it! If we had a filesystem that actually DID this, and it
> > >>was in the Linux kernel, it would spread far and wide. It's bound to
> > >>happen that someone will identify a collision. We then report that to
> > >>the committee offering the reward and then donate it to OSDL to help
> > >>Linux development.
> > >
> > > Actually, it's *not* "bound to happen". Figure out the number of blocks
> > > you'd need to have even a 1% chance of a birthday collision in a 2**128
> > > space.
> > >
> > > And you'd need that many disk blocks on *a single system*.
> > >
> > > Then figure out the chances of a collision on a small machine that only
> > > has 20 or 30 terabytes (yes, in this case terabytes is small).
> >
> > Certainly. No one machine is going to find it in a reasonable period.
> > OTOH, if a million machines were doing it, it increases the chances by
> > just that much.
>
> Let's take a look at the chances. 30 terabytes is, in a best-case scenario
> (with 512-byte blocks) about 6e10 blocks. That would be roughly
> 6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the
> chances of a collision would be about 1e-9, disregarding the fact that all
> these machines have a large chance of containing similar blocks -- their data
> isn't truly random, so some blocks have a larger chance of occurring than
> others. The data sets on the machines are probably reasonably static, so if
> the collision isn't found *at once* the chances of it occurring later are
> much smaller. So, even under the most positive assumptions, with a hundred
> million machines with 30 terabytes of storage each, it's extremely probable
> that you won't find a collision. (A 96-bit hash could have been broken with
> this setup however. :) )
There is one fundemental braino in the discussion.
Only HALF the bits are for preventing "accidental" collisions. (The
"birthday" thing). The rest is for preventing to "brute force" an input
that produces the same MD5.(*)
So MD5 has only 2**64 Bits against accidental collsions
Btw. I already had (a/the) MD5 collision(*2) in my life.
So you'd need SHA256 or SHA512 to be "really sure(tm)".
*: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
also true for MD5.
*2: I had a direcory of about 1,5 Million images and "md5sum"med them to
eliminate doubles. The Log-file, at one point, had the same md5sum as
one of the pictures.
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
Matthias Schniedermeyer wrote:
> On Sat, Jan 17, 2004 at 02:15:31PM +0100, Bart Samwel wrote:
[...]
>>Let's take a look at the chances. 30 terabytes is, in a best-case scenario
>>(with 512-byte blocks) about 6e10 blocks. That would be roughly
>>6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the
>>chances of a collision would be about 1e-9, disregarding the fact that all
>>these machines have a large chance of containing similar blocks -- their data
>>isn't truly random, so some blocks have a larger chance of occurring than
>>others. The data sets on the machines are probably reasonably static, so if
>>the collision isn't found *at once* the chances of it occurring later are
>>much smaller. So, even under the most positive assumptions, with a hundred
>>million machines with 30 terabytes of storage each, it's extremely probable
>>that you won't find a collision. (A 96-bit hash could have been broken with
>>this setup however. :) )
>
> There is one fundemental braino in the discussion.
>
> Only HALF the bits are for preventing "accidental" collisions. (The
> "birthday" thing). The rest is for preventing to "brute force" an input
> that produces the same MD5.(*)
From RFC-1321:
"It is conjectured that the difficulty of coming up with two messages
having the same message digest is on the order of 2^64 operations,
and that the difficulty of coming up with any message having a given
message digest is on the order of 2^128 operations."
So, they say it's on the order of 2^64 operations, whatever their
definition of "operation" is. It probably means that they think you need
to take the MD5 of something in the order of 2^64 random messages in
order to have a reasonable chance of finding a duplicate. The RFC says
nothing about half of the bits being for one purpose and the other for
another; in the algorithm all bits seem to be processed in a similar way.
Let's see where they got their 2^64. If you've got 2^64 messages, you've
got (with the same logic as above) approximately 2^64 * 2^64 * (1 /
2^128) = 100% chance of a birthday collision. This seems to support the
idea that the kind of analysis that I just did corresponds to the way
they look at it.
> *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this
> is also true for MD5.
It might be that you took the "2^64 operations" as meaning "64 bits" (or
2^80 as 80 bits, in the SHA1 case), while it now seems to be the result
of a birthday-collision calculation. Not a surprising misinterpretation
BTW, because the RFC doesn't specify at all where they got this number.
> Btw. I already had (a/the) MD5 collision(*2) in my life.
> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.
Wow! It might be that MD5 is not such a good hash function after all
then. The security is of course purely based on the hashes of messages
being very randomly distributed. If they really were, the chances of
this happening to you would have been extremely slim (less than 1e-20).
I think the more probable explanation is that MD5 hashes aren't truly
randomly distributed after all. :)
-- Bart
Hi!
> There is one fundemental braino in the discussion.
>
> Only HALF the bits are for preventing "accidental" collisions. (The
> "birthday" thing). The rest is for preventing to "brute force" an input
> that produces the same MD5.(*)
>
> So MD5 has only 2**64 Bits against accidental collsions
> Btw. I already had (a/the) MD5 collision(*2) in my life.
>
> So you'd need SHA256 or SHA512 to be "really sure(tm)".
>
>
>
> *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
> also true for MD5.
>
> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.
Do you have a copy? I believe *many* people would like to see that
one.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
Matthias Schniedermeyer wrote:
> *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> eliminate doubles. The Log-file, at one point, had the same md5sum as
> one of the pictures.
Something similar happened to me once. Two different files with the
same result from md5sum.
When I ran md5sum again, it still reported the same results.
Then when I flushed the page cache and ran it again, it reported
different results.
I concluded it was a rare page cache corruption heisenbug. Scary.
-- Jamie
On Thu, Jan 22, 2004 at 01:12:59AM +0100, Pavel Machek wrote:
> Hi!
>
> > There is one fundemental braino in the discussion.
> >
> > Only HALF the bits are for preventing "accidental" collisions. (The
> > "birthday" thing). The rest is for preventing to "brute force" an input
> > that produces the same MD5.(*)
> >
> > So MD5 has only 2**64 Bits against accidental collsions
> > Btw. I already had (a/the) MD5 collision(*2) in my life.
> >
> > So you'd need SHA256 or SHA512 to be "really sure(tm)".
> >
> >
> >
> > *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this is
> > also true for MD5.
> >
> > *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> > eliminate doubles. The Log-file, at one point, had the same md5sum as
> > one of the pictures.
>
> Do you have a copy? I believe *many* people would like to see that
> one.
Unfortunatly not, and reconstruction is impossibel(tm). "Back then(more
than half a year ago)" i didn't see that as important.
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.
On Thu, Jan 22, 2004 at 02:36:09AM +0000, Jamie Lokier wrote:
> Matthias Schniedermeyer wrote:
> > *2: I had a direcory of about 1,5 Million images and "md5sum"med them to
> > eliminate doubles. The Log-file, at one point, had the same md5sum as
> > one of the pictures.
>
> Something similar happened to me once. Two different files with the
> same result from md5sum.
>
> When I ran md5sum again, it still reported the same results.
>
> Then when I flushed the page cache and ran it again, it reported
> different results.
>
> I concluded it was a rare page cache corruption heisenbug. Scary.
I can 100% exclude Linux-Errors. The machine (still) runs with Solaris 8.
Bis denn
--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.