LinuxLists.cc - Where exactly will arch_fast

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

Hi Herbert,

On Do, 2014-12-04 at 16:11 +0800, Herbert Xu wrote:
> While working on rhashtable it came to me that this whole concept
> of arch_fast_hash is flawed. CRCs are linear functions so it's
> fairly easy for an attacker to identify collisions or at least
> eliminate a large amount of search space (e.g., controlling the
> last bit of the hash result is almost trivial, even when you add
> a random seed).
>
> So what exactly are we going to use arch_fast_hash for? Presumably
> it's places where security is never goint to be an issue, right?
>
> Even if security wasn't an issue, straight CRC32 has really poor
> lower-order bit distribution, which makes it a terrible choice for
> a hash table that simply uses the lower-order bits.

I wondered the same while trying to use arch_fast_hash in a lot more
places (I did a new implementation in assembler I'll send later on, it
is mostly optimized to deal with ovs flow keys).

While the uniformity of crc32 does actually look good and IMHO this even
holds for the lower bits of the hash, I totally agree on the linearity
matters.

The easiest way to make arch_fast_hash non-linear would be to build up
on the crc32 instruction like e.g. the cityhash function family does and
it seems not too hard to do that by combining two crc32c outputs of the
original and cyclic shifted input data. I have doubts if this is faster
than jhash in the end. There are proposals from Intel to do so, but they
are patent encumbered. :/

For most consumers in the networking stack, security and DoS resistence
is an issue. OVS, for which this was designed at first does do rehashing
from time to time, but still there is a possible DoS attack vector with
this hashing algorithm.

Bye,
Hannes

2014-12-04 13:15:13

by Daniel Borkmann

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On 12/04/2014 01:34 PM, Hannes Frederic Sowa wrote:
> On Do, 2014-12-04 at 16:11 +0800, Herbert Xu wrote:
>> While working on rhashtable it came to me that this whole concept
>> of arch_fast_hash is flawed. CRCs are linear functions so it's
>> fairly easy for an attacker to identify collisions or at least
>> eliminate a large amount of search space (e.g., controlling the
>> last bit of the hash result is almost trivial, even when you add
>> a random seed).
>>
>> So what exactly are we going to use arch_fast_hash for? Presumably
>> it's places where security is never goint to be an issue, right?

The original proposal [1] targeted ovs-only as a closed-door user in
order to speed up the worst case of calculating a hash over the extracted
flow key, that is, struct sw_flow_key (which nowadays consumes up to
7 cachelines on x86_64 ...).

[1] http://thread.gmane.org/gmane.linux.network/293981/

>> Even if security wasn't an issue, straight CRC32 has really poor
>> lower-order bit distribution, which makes it a terrible choice for
>> a hash table that simply uses the lower-order bits.
>
> I wondered the same while trying to use arch_fast_hash in a lot more
> places (I did a new implementation in assembler I'll send later on, it
> is mostly optimized to deal with ovs flow keys).
>
> While the uniformity of crc32 does actually look good and IMHO this even
> holds for the lower bits of the hash, I totally agree on the linearity
> matters.
>
> The easiest way to make arch_fast_hash non-linear would be to build up
> on the crc32 instruction like e.g. the cityhash function family does and
> it seems not too hard to do that by combining two crc32c outputs of the
> original and cyclic shifted input data. I have doubts if this is faster
> than jhash in the end. There are proposals from Intel to do so, but they
> are patent encumbered. :/
>
> For most consumers in the networking stack, security and DoS resistence
> is an issue. OVS, for which this was designed at first does do rehashing
> from time to time, but still there is a possible DoS attack vector with
> this hashing algorithm.

2014-12-04 15:26:41

by Thomas Graf

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On 12/04/14 at 04:11pm, Herbert Xu wrote:
> Hi:
>
> While working on rhashtable it came to me that this whole concept
> of arch_fast_hash is flawed. CRCs are linear functions so it's
> fairly easy for an attacker to identify collisions or at least
> eliminate a large amount of search space (e.g., controlling the
> last bit of the hash result is almost trivial, even when you add
> a random seed).
>
> So what exactly are we going to use arch_fast_hash for? Presumably
> it's places where security is never goint to be an issue, right?
>
> Even if security wasn't an issue, straight CRC32 has really poor
> lower-order bit distribution, which makes it a terrible choice for
> a hash table that simply uses the lower-order bits.

As Daniel pointed out, this work originated for the OVS edge use
case where security is of less concern and the rehashing is
sufficient. Identifying collisions is less of interest as the user
space fall back provides a greater surface for an attack.

2014-12-04 15:29:47

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Thu, Dec 04, 2014 at 03:26:37PM +0000, Thomas Graf wrote:
>
> As Daniel pointed out, this work originated for the OVS edge use
> case where security is of less concern and the rehashing is
> sufficient. Identifying collisions is less of interest as the user
> space fall back provides a greater surface for an attack.

Well in that case the current setup I think is very misleading.
It's inviting unsuspecting kernel developers to use it as a hash
function for general hash tables, which AFAICS is something that
it fails at miserably.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2014-12-04 15:39:42

by Thomas Graf

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On 12/04/14 at 11:29pm, Herbert Xu wrote:
> On Thu, Dec 04, 2014 at 03:26:37PM +0000, Thomas Graf wrote:
> >
> > As Daniel pointed out, this work originated for the OVS edge use
> > case where security is of less concern and the rehashing is
> > sufficient. Identifying collisions is less of interest as the user
> > space fall back provides a greater surface for an attack.
>
> Well in that case the current setup I think is very misleading.
> It's inviting unsuspecting kernel developers to use it as a hash
> function for general hash tables, which AFAICS is something that
> it fails at miserably.

Well, it's called fast hash and not secure hash ;-) but a clear hint
definitely wouldn't hurt.

I wasn't aware of the distribution weakness of the lower bits. Do
you have a reference to more information?

2014-12-04 15:44:09

by Daniel Borkmann

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On 12/04/2014 04:39 PM, Thomas Graf wrote:
> On 12/04/14 at 11:29pm, Herbert Xu wrote:
>> On Thu, Dec 04, 2014 at 03:26:37PM +0000, Thomas Graf wrote:
>>>
>>> As Daniel pointed out, this work originated for the OVS edge use
>>> case where security is of less concern and the rehashing is
>>> sufficient. Identifying collisions is less of interest as the user
>>> space fall back provides a greater surface for an attack.
>>
>> Well in that case the current setup I think is very misleading.
>> It's inviting unsuspecting kernel developers to use it as a hash
>> function for general hash tables, which AFAICS is something that
>> it fails at miserably.
>
> Well, it's called fast hash and not secure hash ;-) but a clear hint
> definitely wouldn't hurt.

Hm, I thought the kernel doc on arch_fast_hash() in include/linux/hash.h
would give enough of a hint ...

2014-12-04 15:47:23

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Thu, Dec 04, 2014 at 04:43:31PM +0100, Daniel Borkmann wrote:
>
> Hm, I thought the kernel doc on arch_fast_hash() in include/linux/hash.h
> would give enough of a hint ...

I think something more explicit like "do not use this in a hash
table unless you know what you're doing" might be needed.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2014-12-04 15:52:11

by Daniel Borkmann

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On 12/04/2014 04:47 PM, Herbert Xu wrote:
> On Thu, Dec 04, 2014 at 04:43:31PM +0100, Daniel Borkmann wrote:
>>
>> Hm, I thought the kernel doc on arch_fast_hash() in include/linux/hash.h
>> would give enough of a hint ...
>
> I think something more explicit like "do not use this in a hash
> table unless you know what you're doing" might be needed.

Ok, I'm fine with that, do you want to submit a patch on top of, or
after Hannes' improvements [1] got in? Thanks a lot, Herbert.

[1] http://patchwork.ozlabs.org/patch/417761/

2014-12-04 15:57:14

by David Laight

[permalink] [raw]

Subject: RE: Where exactly will arch_fast_hash be used

From: Herbert Xu
> On Thu, Dec 04, 2014 at 04:43:31PM +0100, Daniel Borkmann wrote:
> >
> > Hm, I thought the kernel doc on arch_fast_hash() in include/linux/hash.h
> > would give enough of a hint ...
>
> I think something more explicit like "do not use this in a hash
> table unless you know what you're doing" might be needed.

That isn't really helpful. Maybe something more like:

"This CRC algorithm used by this hash is 'linear', ie hash(a xor b) ==
hash(a) xor hash(b). This means that it is relatively easy for a remote
attacker to generate multiple items with the same hash."

David

2014-12-04 16:11:19

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Thu, Dec 04, 2014 at 03:56:30PM +0000, David Laight wrote:
>
> "This CRC algorithm used by this hash is 'linear', ie hash(a xor b) ==
> hash(a) xor hash(b). This means that it is relatively easy for a remote
> attacker to generate multiple items with the same hash."

The attacker could be local too, e.g., opening a netlink socket
can be done by any user and gets hashed in net/netlink/af_netlink.c.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2014-12-07 05:20:43

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

If you want DoS-resistant hash tables, I'm working on adding SipHash
to the kernel.

This is a keyed pseudo-random function designed specifically for that
application. I am starting with ext4 directory hashes, and then intended
to expand to secure sequence numbers (since it's far faster than MD5).

(I'm trying to figure out a good interface, since the crypto API
is a bit heavy for something to heavily optimized.)

But one comment caught my eye:
> Even if security wasn't an issue, straight CRC32 has really poor
> lower-order bit distribution, which makes it a terrible choice for
> a hash table that simply uses the lower-order bits.

Er... huh? That's the first time I've heard that claim, and while I'm not
Philip Koopman or Guy Castagnoli, I thought I understood CRCs pretty well.

CRCs generally mix bits pretty well. The sparse 16-bit CRCs chosen
for implementation simplicity had some limitations, but the Castagnoli
polynomial is quite dense.

And their mathematical symmetry means that the low bits really shouldn't
be any different from any other bits. But if it is an issue, it's just
as easy work to shift down the correct number of high bits rather than
using the low.

Can you point me to a source for that statement?

2014-12-07 09:28:42

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Sun, Dec 07, 2014 at 12:20:41AM -0500, George Spelvin wrote:
>
> Can you point me to a source for that statement?

For a start why don't you print out the hashes of 1-255 and then
find out how easy it is to deduce the last bit of the hash result.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2014-12-07 10:02:58

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

> For a start why don't you print out the hashes of 1-255 and then
> find out how easy it is to deduce the last bit of the hash result.

They're available in lib/crc32table.h, as crc32ctable_le[0].

As a CRC is a linear function, every bit is the XOR of some
selected bits of the input, i.e. the parity of the input and
some bit-specific mask sequence.

Furthermore, CRCs are cyclic, so the mask sequences for adjacent bits are
shifts of each other.

The lsbit of the CRC32c of x is the parity of x & 0x1f.

This is because the LFSR sequence generated by the polynomial
starts 0001111110010001110010101111011000111000011011110010110000100101...

The first bit corresponds to the msbit of the last byte.

How does this implicate the low bits specifically?

2014-12-07 12:52:11

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Sun, Dec 07, 2014 at 05:02:52AM -0500, George Spelvin wrote:
>
> How does this implicate the low bits specifically?

If you can easily deduce the pre-images that make the last bit
of the hash even or odd, then you've just cut your search space
for collisions by half. The real killer is that you can do this
without knowing what the secret is.

Our entire scheme is dependent on using the secret to defeat
would-be attackers. If CRC does not make effective use of the
secret, then we're essentially naked against attackers.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2014-12-07 13:14:46

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

Hi,

On So, 2014-12-07 at 00:20 -0500, George Spelvin wrote:
> If you want DoS-resistant hash tables, I'm working on adding SipHash
> to the kernel.
>
> This is a keyed pseudo-random function designed specifically for that
> application. I am starting with ext4 directory hashes, and then intended
> to expand to secure sequence numbers (since it's far faster than MD5).

Please consider xfs, too.
AFAIK xfs doesn't seed their hashing so far and the hashing function is
pretty weak. One example:
http://marc.info/?l=linux-xfs&m=139590613002926&w=2

> (I'm trying to figure out a good interface, since the crypto API
> is a bit heavy for something to heavily optimized.)

Ack. If we want to use it in the networking stack we should be able to
use it without a dependency to the crypto framework.

Bye,
Hannes

2014-12-07 13:23:12

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

>> How does this implicate the low bits specifically?

> If you can easily deduce the pre-images that make the last bit
> of the hash even or odd, then you've just cut your search space
> for collisions by half. The real killer is that you can do this
> without knowing what the secret is.

Um, yes, if you're in a situation where a hash collsion DoS is possible,
a CRC is disastrous choice. You can trivially find collisions for *all*
bits of a CRC. Low or high, they're all equally easy.

When you said
>>>>> Even if security wasn't an issue, straight CRC32 has really poor
>>>>> lower-order bit distribution, which makes it a terrible choice for
>>>>> a hash table that simply uses the lower-order bits.

This is talking about:
- Non-malicious inputs,where security isn't an issue, and
- Low-order bits specifically, implying that the high-order bits are different.

*That's* the claim I'm curious about. I know perfectly well that if
security *is* an issue, a fixed-polynomial CRC is a disaser.

But for non-malicious inputs, like normal software identifiers, a CRC
actually works very well.

If you want to do secure hashing with a CRC, you need to have a secret
*polynomial*. That *is* provably secure (it's a universal family of hash
functions), but isn't provided by x86 unless you use SSE and PCLMUL.

That's why it's a non-cryptographic hash, suitable for non-malicious
inputs only. That's the same security claim as many other common hash
functions.

> Our entire scheme is dependent on using the secret to defeat
> would-be attackers. If CRC does not make effective use of the
> secret, then we're essentially naked against attackers.

Okay, I'm confused. *What* scheme? The arch_fast_hash interface doesn't
have any provision for a secret. Because there's no point to having one;
you can't change the polynomial, and anything additive has just moves
collisions around without reducing them.

So there are plenty of hash tables in Linux that you don't dare use this
with. In fact, so many that, as you rightly point out, it's not clear
if it's worth providing this special optimization for the few remaining.

2014-12-07 13:30:59

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

Thanks for the encouragement!

> Please consider xfs, too.
> AFAIK xfs doesn't seed their hashing so far and the hashing function is
> pretty weak. One example:
> http://marc.info/?l=linux-xfs&m=139590613002926&w=2

Is that something that *can* be changed without breaking the
disk format? SipHash is explicitly *not* designed to be secure as
an unkeyed hash in the way that SHA-type algorithms are.

What it's designed to do is provide second preimage resistance
of its output, or a function (like modular reduction) of its output,
against an attacker who doesn't know the secret seed.

> Ack. If we want to use it in the networking stack we should be able to
> use it without a dependency to the crypto framework.

Already understood. My big question is whether a single function call
is okay or we need something inlinable.

2014-12-07 13:41:41

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On So, 2014-12-07 at 08:30 -0500, George Spelvin wrote:
> Thanks for the encouragement!
>
> > Please consider xfs, too.
> > AFAIK xfs doesn't seed their hashing so far and the hashing function is
> > pretty weak. One example:
> > http://marc.info/?l=linux-xfs&m=139590613002926&w=2
>
> Is that something that *can* be changed without breaking the
> disk format? SipHash is explicitly *not* designed to be secure as
> an unkeyed hash in the way that SHA-type algorithms are.

I did some research and it looked like it would need a change to the
disk format but it should be doable by incrementing the super block
version, so at least newly created filesystem would benefit from it.

> What it's designed to do is provide second preimage resistance
> of its output, or a function (like modular reduction) of its output,
> against an attacker who doesn't know the secret seed.
>
> > Ack. If we want to use it in the networking stack we should be able to
> > use it without a dependency to the crypto framework.
>
> Already understood. My big question is whether a single function call
> is okay or we need something inlinable.

Like md5_transfrom, I think a non-inline function would be just fine.
Otherwise kernel code size would increase. Most hash users in the
network stack mostly deal with less bytes of input than one round needs.

Bye,
Hannes

2014-12-07 13:52:37

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On So, 2014-12-07 at 14:41 +0100, Hannes Frederic Sowa wrote:
> On So, 2014-12-07 at 08:30 -0500, George Spelvin wrote:
> > Thanks for the encouragement!
> >
> > > Please consider xfs, too.
> > > AFAIK xfs doesn't seed their hashing so far and the hashing function is
> > > pretty weak. One example:
> > > http://marc.info/?l=linux-xfs&m=139590613002926&w=2
> >
> > Is that something that *can* be changed without breaking the
> > disk format? SipHash is explicitly *not* designed to be secure as
> > an unkeyed hash in the way that SHA-type algorithms are.
>
> I did some research and it looked like it would need a change to the
> disk format but it should be doable by incrementing the super block
> version, so at least newly created filesystem would benefit from it.
>
> > What it's designed to do is provide second preimage resistance
> > of its output, or a function (like modular reduction) of its output,
> > against an attacker who doesn't know the secret seed.
> >
> > > Ack. If we want to use it in the networking stack we should be able to
> > > use it without a dependency to the crypto framework.
> >
> > Already understood. My big question is whether a single function call
> > is okay or we need something inlinable.
>
> Like md5_transfrom, I think a non-inline function would be just fine.
> Otherwise kernel code size would increase. Most hash users in the
> network stack mostly deal with less bytes of input than one round needs.

Of course, if it looks feasable (from a performance PoV, but I doubt
that) to migrate the current jhash users to siphash, it might be worth
dealing with larger input sizes and maybe also doing it inline. But that
very much depends on the code size it would add. Currently we use jhash
as the non-linear "secure" hashing functions at most places.

Also rhashtable takes a pointer to the hasing function, thus causing gcc
to generate a function in each compilation unit if it would be static
inline.

2014-12-07 14:06:50

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On So, 2014-12-07 at 08:23 -0500, George Spelvin wrote:
> So there are plenty of hash tables in Linux that you don't dare use this
> with. In fact, so many that, as you rightly point out, it's not clear
> if it's worth providing this special optimization for the few remaining.

In case of openvswitch it shows a performance improvment. The seed
parameter could be used as an initial biasing of the crc32 function, but
in case of openvswitch it is only set to 0.

Bye,
Hannes

2014-12-07 21:33:58

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Sun, 2014-12-07 at 15:06 +0100, Hannes Frederic Sowa wrote:
> In case of openvswitch it shows a performance improvment. The seed
> parameter could be used as an initial biasing of the crc32 function, but
> in case of openvswitch it is only set to 0.

NACK.

This is the Fatal Error in thinking that Herbert was warning about.
The seed parameter doesn't affect CRC32 collisions *at all* if the inputs
are the same size.

For fixed-size inputs, a non-zero seed is equivalent to XORing a
constant into the output of the CRC computation.

for *different* sized inputs, a non-zero seed detects zero-padding
better than a zero one, but *which* non-zero value is also irrelevant;
all-ones is the traditional choice because it's simplest in hardware.

A CRC is inherently linear. CRC(a^b) = CRC(a) ^ CRC(b). This makes
them easy to analyze mathematically and gives them a number of nice
properties for detecting hardware corruption.

But that same simplicity makes it *ridiculously* easy to generate
collisions if you try.

One way of looking at a CRC is to say that each bit in the input
has a CRC. The CRC of a message string is just the XOR of the CRCs
of the individual bits that are set in the message.

Now, a CRC polynomial is chosen so that all of the bits of a
message have different CRCs. Obviously, there's a limit: when the
message is 2^n bits long, it's not possible for all the bits to
have different, non-zero n-bit CRCs.

But a CRC is a really efficient way of assigning different bit patterns
to different input bits up to that limit.

(Something like CRC32c is also chosen so that, for messages up to a
reasonable length, no 3-bit, 4-bit, etc. combinations have CRCs that
XOR to zero.)

But, and this might be what Herbert was trying to say and I was
misunderstanding, if you then *truncate* that CRC, the CRCs of the
message bits lose that uniqueness guarantee. They're just pseudorandom
numbers, and a CRC loses its special collision-resistance properties.

It's just an ordinary random hash, and thanks to the birthday paradox,
you're likely to find two bits whose CRCs agree in any particular 8 bits
within roughly sqrt(2*256) or 22 bits.

Here are a few such collisions for the least significant 8 bits of CRC32c:

Msg1 CRC32c Msg2 CRC32c Match
1<<11 3fc5f181 1<<30 bf672381 81
1<<12 9d14c3b8 1<<31 dd45aab8 b8
1<<5 c79a971f 1<<44 6006181f 1f
1<<15 13a29877 1<<45 b2f53777 77

There's nothing special about the lsbits of the CRC.
Within 64 bits, the most significant 8 bits have it worse:

1<<5 c79a971f 1<<17 c76580d9 c7
1<<6 e13b70f7 1<<18 e144fb14 e1
1<<19 70a27d8a 1<<38 7022df58 70
1<<20 38513ec5 1<<39 38116fac 38
1<<13 4e8a61dc 1<<52 4e2dfd53 4e
1<<23 a541927e 1<<53 a5e0c5d1 a5

Now, I'd like to stress that this collision rate is no worse than any
*other* hash function. A truncated CRC loses its special resistance to
the birthday paradox (you'd have been much smarter to use 8-bit CRC),
but it doesn't become especially bad. A truncated SHA-1 will have
coillisions just as often.

The concern with a CRC is that, once you've found one collision, you've
found a huge number of them. Just XOR the bit pattern of your choice
into both of the colliding messages, and you have a new collision.

For another example, if you consider the CRC32c of all possible 1-byte
messages *and then take only the low byte*, there are only 128 possible
values.

It turns out that the byte 0x5d has a CRC32c of 0xee0d9600. This ends
in 00, so if I XOR 0x5d into anything, the low 8 bits of the CRC
don't change.

Likewise, the message "23 00" has a CRC32c of 0x00ee0d96. So you can
XOR 0x23 into the second-last byte of anything, and the high 8 bits of
the CRC don't change.

2014-12-08 11:25:14

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

Hi,

On Sun, Dec 7, 2014, at 22:33, George Spelvin wrote:
> On Sun, 2014-12-07 at 15:06 +0100, Hannes Frederic Sowa wrote:
> > In case of openvswitch it shows a performance improvment. The seed
> > parameter could be used as an initial biasing of the crc32 function, but
> > in case of openvswitch it is only set to 0.
>
> NACK.
>
> This is the Fatal Error in thinking that Herbert was warning about.
> The seed parameter doesn't affect CRC32 collisions *at all* if the inputs
> are the same size.
>
> For fixed-size inputs, a non-zero seed is equivalent to XORing a
> constant into the output of the CRC computation.

Sorry for being unclear, I understood that and didn't bother patching
that '0' with a random seed exactly because of this.

> for *different* sized inputs, a non-zero seed detects zero-padding
> better than a zero one, but *which* non-zero value is also irrelevant;
> all-ones is the traditional choice because it's simplest in hardware.
>
>
> A CRC is inherently linear. CRC(a^b) = CRC(a) ^ CRC(b). This makes
> them easy to analyze mathematically and gives them a number of nice
> properties for detecting hardware corruption.
>
> But that same simplicity makes it *ridiculously* easy to generate
> collisions if you try.

Yes, understood and I totally agree we shouldn't use crc32 hashing in a
lot of places where unsafe data is going to be hashed and inserted into
hash tables.

> One way of looking at a CRC is to say that each bit in the input
> has a CRC. The CRC of a message string is just the XOR of the CRCs
> of the individual bits that are set in the message.
>
> Now, a CRC polynomial is chosen so that all of the bits of a
> message have different CRCs. Obviously, there's a limit: when the
> message is 2^n bits long, it's not possible for all the bits to
> have different, non-zero n-bit CRCs.
>
> But a CRC is a really efficient way of assigning different bit patterns
> to different input bits up to that limit.
>
> (Something like CRC32c is also chosen so that, for messages up to a
> reasonable length, no 3-bit, 4-bit, etc. combinations have CRCs that
> XOR to zero.)
>
>
> But, and this might be what Herbert was trying to say and I was
> misunderstanding, if you then *truncate* that CRC, the CRCs of the
> message bits lose that uniqueness guarantee. They're just pseudorandom
> numbers, and a CRC loses its special collision-resistance properties.
>
> It's just an ordinary random hash, and thanks to the birthday paradox,
> you're likely to find two bits whose CRCs agree in any particular 8 bits
> within roughly sqrt(2*256) or 22 bits.
>
> Here are a few such collisions for the least significant 8 bits of
> CRC32c:
>
> Msg1 CRC32c Msg2 CRC32c Match
> 1<<11 3fc5f181 1<<30 bf672381 81
> 1<<12 9d14c3b8 1<<31 dd45aab8 b8
> 1<<5 c79a971f 1<<44 6006181f 1f
> 1<<15 13a29877 1<<45 b2f53777 77
>
> There's nothing special about the lsbits of the CRC.
> Within 64 bits, the most significant 8 bits have it worse:
>
> 1<<5 c79a971f 1<<17 c76580d9 c7
> 1<<6 e13b70f7 1<<18 e144fb14 e1
> 1<<19 70a27d8a 1<<38 7022df58 70
> 1<<20 38513ec5 1<<39 38116fac 38
> 1<<13 4e8a61dc 1<<52 4e2dfd53 4e
> 1<<23 a541927e 1<<53 a5e0c5d1 a5
>
>
> Now, I'd like to stress that this collision rate is no worse than any
> *other* hash function. A truncated CRC loses its special resistance to
> the birthday paradox (you'd have been much smarter to use 8-bit CRC),
> but it doesn't become especially bad. A truncated SHA-1 will have
> coillisions just as often.
>
> The concern with a CRC is that, once you've found one collision, you've
> found a huge number of them. Just XOR the bit pattern of your choice
> into both of the colliding messages, and you have a new collision.

Ack.

> For another example, if you consider the CRC32c of all possible 1-byte
> messages *and then take only the low byte*, there are only 128 possible
> values.
>
> It turns out that the byte 0x5d has a CRC32c of 0xee0d9600. This ends
> in 00, so if I XOR 0x5d into anything, the low 8 bits of the CRC
> don't change.
>
> Likewise, the message "23 00" has a CRC32c of 0x00ee0d96. So you can
> XOR 0x23 into the second-last byte of anything, and the high 8 bits of
> the CRC don't change.

A very interesting read, thanks for your mail!

Bye,
Hannes

2014-12-08 16:20:03

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

>>> In case of openvswitch it shows a performance improvment. The seed
>>> parameter could be used as an initial biasing of the crc32 function, but
>>> in case of openvswitch it is only set to 0.

>> NACK. [...]

> Sorry for being unclear, I understood that and didn't bother patching
> that '0' with a random seed exactly because of this.

And I'm sorry for delivering a long lecture on a subject you already
understood perfectly well.

I'd just been thinking about it because of Herbert's comments, so it was
conveniently at hand. :-)

Out of curiousity, what *were* you referring to when you talked
about biasing the crc32 function? "Biasing" is a good term becuase
it just applies an offset, but what do you gain from doing that?

There are nifty things one can do with the CRC32 instruction, however.
A lot of ciphers these days use an ARX (add, rotate, XOR) kernel.
A crc32 instruction, although linear, does some very powerful rotate &
xor operations, and could replace the XOR and rotate.

2014-12-08 16:32:12

[permalink] [raw]

Subject: Re: Where exactly will arch_fast_hash be used

On Mon, Dec 8, 2014, at 17:19, George Spelvin wrote:
> >>> In case of openvswitch it shows a performance improvment. The seed
> >>> parameter could be used as an initial biasing of the crc32 function, but
> >>> in case of openvswitch it is only set to 0.
>
> >> NACK. [...]
>
> > Sorry for being unclear, I understood that and didn't bother patching
> > that '0' with a random seed exactly because of this.
>
> And I'm sorry for delivering a long lecture on a subject you already
> understood perfectly well.

I learned something, so your time wasn't completely wasted. ;)

> I'd just been thinking about it because of Herbert's comments, so it was
> conveniently at hand. :-)
>
> Out of curiousity, what *were* you referring to when you talked
> about biasing the crc32 function? "Biasing" is a good term becuase
> it just applies an offset, but what do you gain from doing that?

Actually, I don't know why the seed parameter was added. I just wanted
to mention that there is a way to bias the crc32 function which fits
into the style of the other hashing functions, like jhash with its
initval parameter.

I just kept it around during the rewrite.

The only use case I can imagine would be if one would like to calculate
a crc32c over a non-contiguous array, thus feeding the result of one crc
operation into the next one.

> There are nifty things one can do with the CRC32 instruction, however.
> A lot of ciphers these days use an ARX (add, rotate, XOR) kernel.
> A crc32 instruction, although linear, does some very powerful rotate &
> xor operations, and could replace the XOR and rotate.

Yes, I have seen it being used in cityhash.

There is also a proposal by Intel, but the hash seems too weak, too:
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/hash-method-performance-paper.pdf

Bye,
Hannes

2014-12-09 14:25:50