2003-05-29 20:30:02

by Scott A Crosby

[permalink] [raw]
Subject: Algoritmic Complexity Attacks and 2.4.20 the dcache code

Hello. We have analyzed this software to determine its vulnerability
to a new class of DoS attacks that related to a recent paper. ''Denial
of Service via Algorithmic Complexity Attacks.''

This paper discusses a new class of denial of service attacks that
work by exploiting the difference between average case performance and
worst-case performance. In an adversarial environment, the data
structures used by an application may be forced to experience their
worst case performance. For instance, hash tables are usually thought
of as being constant time operations, but with large numbers of
collisions will degrade to a linked list and may lead to a 100-10,000
times performance degradation. Because of the widespread use of hash
tables, the potential for attack is extremely widespread. Fortunately,
in many cases, other limits on the system limit the impact of these
attacks.

To be attackable, an application must have a deterministic or
predictable hash function and accept untrusted input. In general, for
the attack to be signifigant, the applications must be willing and
able to accept hundreds to tens of thousands of 'attack
inputs'. Because of that requirement, it is difficult to judge the
impact of these attack without knowing the source code extremely well,
and knowing all ways in which a program is used.

As part of this project, one of the applications examined was the
linux kernel 2.4.20, both the networking stack (subject of another
email) and in this email, the dcache.

I have confirmed via an actual attack that it is possible to force the
dcache to experience a 200x performance degradation if the attacker
can control filenames. On a P4-1.8ghz, the time to list a directory of
10,000 files is 18 seconds instead of .1 seconds.


The solution for these attacks on hash tables is to make the hash
function unpredictable via a technique known as universal
hashing. Universal hashing is a keyed hash function where, based on
the key, one of a large set hash functions is chosen. When
benchmarking, we observe that for short or medium length inputs, it is
comparable in performance to simple predictable hash functions such as
the ones in Python or Perl. Our paper has graphs and charts of our
benchmarked performance.

I highly advise using a universal hashing library, either our own or
someone elses. As is historically seen, it is very easy to make silly
mistakes when attempting to implement your own 'secure' algorithm.

The abstract, paper, and a library implementing universal hashing is
available at http://www.cs.rice.edu/~scrosby/hash/.

Scott


2003-05-30 03:44:31

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Thu, 2003-05-29 at 13:42, Scott A Crosby wrote:
> I highly advise using a universal hashing library, either our own or
> someone elses. As is historically seen, it is very easy to make silly
> mistakes when attempting to implement your own 'secure' algorithm.

Why are you recommending this when after 2 days of going back
and forth in emails with me you came to the conclusion that for
performance critical paths such as the hashes in the kernel the Jenkins
hash was an acceptable choice?

It is unacceptably costly to use a universal hash, it makes a multiply
operation for every byte of key input plus a modulo operation at the
end of the hash computation. All of which can be extremely expensive
on some architectures.

I showed and backed this up for you with benchmarks comparing your
universal hashing code and Jenkins.

Some embedded folks will have your head on a platter if we end up using
a universal hash function for the DCACHE solely based upon your advice.
:-)

--
David S. Miller <[email protected]>

2003-05-30 03:50:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code


On 29 May 2003, Scott A Crosby wrote:

> I have confirmed via an actual attack that it is possible to force the
> dcache to experience a 200x performance degradation if the attacker can
> control filenames. On a P4-1.8ghz, the time to list a directory of
> 10,000 files is 18 seconds instead of .1 seconds.

are you sure this is a big issue? Kernel 2.0 (maybe even 2.2) lists 10,000
files at roughly the same speed (18 seconds) without any attack pattern
used for filenames - still it's a kernel being used.

the network hash collision was a much more serious issue, because it could
be triggered externally, could be maintained with a relatively low input
packet flow, and affected all users of the network stack.

also, directories with 10,000 files are not quite common on systems where
there is a trust problem between users. So a typical directory with say
100 files will be listed in 0.18 seconds - it's slower, but does not make
the system unusable.

also, dcache flushes happen quite frequently under VM pressure - and
especially when using many files you get VM pressure. So it would take a
really specialized attack to keep the dcache size at the critical level
and trigger the slowdown.

also, any local user who can create thousands of files can cause much more
havoc by simply overloading the system. You might as well use that CPU
time to really bog down the system by making it swap heavily - this will
cause a _much_ heavier slowdown.

Ingo

2003-05-30 04:16:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code


On 29 May 2003, David S. Miller wrote:

> > I highly advise using a universal hashing library, either our own or
> > someone elses. As is historically seen, it is very easy to make silly
> > mistakes when attempting to implement your own 'secure' algorithm.
>
> Why are you recommending this when after 2 days of going back
> and forth in emails with me you came to the conclusion that for
> performance critical paths such as the hashes in the kernel the Jenkins
> hash was an acceptable choice?
>
> It is unacceptably costly to use a universal hash, it makes a multiply
> operation for every byte of key input plus a modulo operation at the end
> of the hash computation. All of which can be extremely expensive on
> some architectures.
>
> I showed and backed this up for you with benchmarks comparing your
> universal hashing code and Jenkins.

i'd suggest to use the jhash for the following additional kernel entities:
pagecache hash, inode hash, vcache hash.

the buffer-cache hash and the pidhash should be hard (impossible?) to
attack locally.

Ingo

2003-05-30 04:29:28

by Scott A Crosby

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Fri, 30 May 2003 06:02:18 +0200 (CEST), Ingo Molnar <[email protected]> writes:

> On 29 May 2003, Scott A Crosby wrote:
>
> > I have confirmed via an actual attack that it is possible to force the
> > dcache to experience a 200x performance degradation if the attacker can
> > control filenames. On a P4-1.8ghz, the time to list a directory of
> > 10,000 files is 18 seconds instead of .1 seconds.
>
> are you sure this is a big issue? Kernel 2.0 (maybe even 2.2) lists 10,000
> files at roughly the same speed (18 seconds) without any attack pattern
> used for filenames - still it's a kernel being used.

No. Its not that severe, but it does exist, and it is noticable even
with a quarter that number of files. I did it because it was an
interesting illustrative example, and it only took 30 seconds or so of
coding to put the hash function into generator generating program.

> So it would take a really specialized attack to keep the dcache size
> at the critical level and trigger the slowdown.

Yup. It is probably a very unusual configuration, but that doesn't
mean that somebody won't experience it. :)

Scott

2003-05-30 04:48:18

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Thu, 2003-05-29 at 21:42, Scott A Crosby wrote:
> It is probably a very unusual configuration,

It is worth noting that it might even be possible to go after
this remotely. Consider a mailserver that you can someone influence
the queued mail file names for.

--
David S. Miller <[email protected]>

2003-05-30 04:51:55

by Scott A Crosby

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On 29 May 2003 20:57:47 -0700, "David S. Miller" <[email protected]> writes:

> On Thu, 2003-05-29 at 13:42, Scott A Crosby wrote:
> > I highly advise using a universal hashing library, either our own or
> > someone elses. As is historically seen, it is very easy to make silly
> > mistakes when attempting to implement your own 'secure' algorithm.
>
> Why are you recommending this when after 2 days of going back
> and forth in emails with me you came to the conclusion that for
> performance critical paths such as the hashes in the kernel the Jenkins
> hash was an acceptable choice?

That was a boilerplate paragraph in all the other vulnerability
reports I shipped out today. Perhaps I should have stripped out out or
replaced it.

> It is unacceptably costly to use a universal hash,

Yup the Jenkin's is about 5 times faster than our CW construction on
SPARC, and thus likely at least as much faster on a wide variety of
other CPU's. I do not know if the dcache hash is performance critical,
nor do I know if there exists other more efficient universal hash
functions.

In any case, the attacks I describe are strong in relative terms, but
rather weak in terems of actual impact, so fixing it may not matter
much.

> Some embedded folks will have your head on a platter if we end up using
> a universal hash function for the DCACHE solely based upon your advice.
> :-)

Have you seen the current dcache function?

/* Linux dcache */
#define HASH_3(hi,ho,c) ho=(hi + (c << 4) + (c >> 4)) * 11


Scott

2003-05-30 06:12:35

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Scott A Crosby <[email protected]>
Date: 30 May 2003 00:04:24 -0500

Have you seen the current dcache function?

/* Linux dcache */
#define HASH_3(hi,ho,c) ho=(hi + (c << 4) + (c >> 4)) * 11

Awesome, moving the Jenkins will actually save us some
cycles :-)

2003-05-30 06:33:36

by Scott A Crosby

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Thu, 29 May 2003 23:24:40 -0700 (PDT), "David S. Miller" <[email protected]> writes:

> From: Scott A Crosby <[email protected]>
> Date: 30 May 2003 00:04:24 -0500
>
> Have you seen the current dcache function?
>
> /* Linux dcache */
> #define HASH_3(hi,ho,c) ho=(hi + (c << 4) + (c >> 4)) * 11
>
> Awesome, moving the Jenkins will actually save us some
> cycles :-)

Tricky though, because what if you want to hash more than 64 bits? You
have to somehow chain Jenkin's together.

Let H(a,b,c) be a jenkin's hash that does '' mix(a,b,c) ; return a ''

Let a,b,c,d,e be inputs to be hashed, and R,S,T,U be random keys.


Its not safe to do anything like

H(H(a,b,c),H(d,e,f),R)

Because an attacker can brute-force to find tuples (a,b,c),
(a',b',c'), ... so that H(a,b,c) == H(a',b',c') == ....

A better approach (which I make with no formal analysis of its
quality) might be to construct this:

H(a,b,R) ^ H(c,d,S) ^ H(e,f,T)

Perhaps the best approach is to visit Usenix Security 2003 this August
and ask the experts there.

Scott

2003-05-30 06:43:55

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Scott A Crosby <[email protected]>
Date: 30 May 2003 01:46:12 -0500

Its not safe to do anything like

One thing that helps here is that we don't need to provide
protection outside the realm of a single name.

This is because the hash function takes the pointer of the dentry of
the directory it is in (the parent), and contributes this into
the hash.

Back to the basic problem, using jenkins for hashing names. You could
simply shuffle the bytes into a set of 3 32-bit words, every time
you've contributed 12 bytes (the 3 words are full) or you've finished
the string, you run a __jhash_mix(a,b,c) pass. And you can make
init_name_hash() insert the initial random value (choosen using
get_random_bytes() right before we mount root).

After all, a string is just a variable number of u32 words.

Actually, since we can say something about the alignment of the string
pointer in the dentry case, we can simply feed this as a u32 pointer
straight into the jenkins hash. We know the length of the string so
this is pretty easy. Actually, the most generic version "jhash()"
handles arbitrary byte lengths and alignments.

2003-05-30 08:45:51

by Alex Riesen

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

David S. Miller, Fri, May 30, 2003 08:24:40 +0200:
> From: Scott A Crosby <[email protected]>
> Date: 30 May 2003 00:04:24 -0500
>
> Have you seen the current dcache function?
>
> /* Linux dcache */
> #define HASH_3(hi,ho,c) ho=(hi + (c << 4) + (c >> 4)) * 11
>
> Awesome, moving the Jenkins will actually save us some
> cycles :-)

static
int hash_3(int hi, int c)
{
return (hi + (c << 4) + (c >> 4)) * 11;
}

gcc-3.2.1 -O2 -march=pentium

hash_3:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %eax
movl 8(%ebp), %ecx
movl %eax, %edx
popl %ebp
sall $4, %edx
sarl $4, %eax
addl %ecx, %edx
addl %eax, %edx
leal (%edx,%edx,4), %eax
leal (%edx,%eax,2), %eax
ret

It is not guaranteed to be this way on all architectures, of course.
But still - no multiplications.

2003-05-30 08:48:39

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Alex Riesen <[email protected]>
Date: Fri, 30 May 2003 10:59:01 +0200

static
int hash_3(int hi, int c)
{
return (hi + (c << 4) + (c >> 4)) * 11;
}

gcc-3.2.1 -O2 -march=pentium
...
It is not guaranteed to be this way on all architectures, of course.
But still - no multiplications.

Indeed, I'd missed this. GCC will emit the constant multiply
expansion unless the multiply cost is set VERY low.

2003-05-30 13:34:52

by Nikita Danilov

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

Scott A Crosby writes:
> Hello. We have analyzed this software to determine its vulnerability
> to a new class of DoS attacks that related to a recent paper. ''Denial
> of Service via Algorithmic Complexity Attacks.''
>
> This paper discusses a new class of denial of service attacks that
> work by exploiting the difference between average case performance and
> worst-case performance. In an adversarial environment, the data
> structures used by an application may be forced to experience their
> worst case performance. For instance, hash tables are usually thought
> of as being constant time operations, but with large numbers of
> collisions will degrade to a linked list and may lead to a 100-10,000
> times performance degradation.

Another nice way to experience "worst case performance", is to create
deeply nested directory structure, like

0/1/2/3/4/.../99999/100000

try to unmount and see how shrink_dcache_parent/prune_dcache consume
100% of CPU without allowing preemption. Not recommended on a single
processor machine.

> Because of the widespread use of hash
> tables, the potential for attack is extremely widespread. Fortunately,
> in many cases, other limits on the system limit the impact of these
> attacks.
>

[...]

>
> Scott

Nikita.

2003-05-30 14:53:12

by Scott A Crosby

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Fri, 30 May 2003 02:00:40 -0700 (PDT), "David S. Miller" <[email protected]> writes:

> From: Alex Riesen <[email protected]>
> Date: Fri, 30 May 2003 10:59:01 +0200
>
> static
> int hash_3(int hi, int c)
> {
> return (hi + (c << 4) + (c >> 4)) * 11;
> }
>
> gcc-3.2.1 -O2 -march=pentium
> ...
> It is not guaranteed to be this way on all architectures, of course.
> But still - no multiplications.
>
> Indeed, I'd missed this. GCC will emit the constant multiply
> expansion unless the multiply cost is set VERY low.

It may still be a win. This does a bit under a dozen instructions per
byte. However, jenkin's does many bytes at a time.

Scott

2003-05-30 17:52:57

by Timothy Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code



Ingo Molnar wrote:

> i'd suggest to use the jhash for the following additional kernel entities:
> pagecache hash, inode hash, vcache hash.
>
> the buffer-cache hash and the pidhash should be hard (impossible?) to
> attack locally.
>


Maybe this is what someone is saying, and I just missed it, but...

Coming late into this discussion (once again), I am making some
assumptions here. A while back, someone came up with a method for
breaking encryption (narrowing the brute force computations) by
determining how long it took a given host to compute encryption keys or
hashes or something.

If you have a situation where a lot of hashes are to be computed, and
you want to decouple the computation time from the response time, then
do this:

1) A kernel thread requests a hash to be computed. That hash is
computed and put into a queue. The kernel thread is put into the task
queue.
2) Other kernel threads do the same.
3) Threads get woken up only when their hash is pulled off the queue.


This way, the only added overhead is kernel context switch from one
thread to another. The algorithm doesn't waste CPU time trying to hide
the complexity of the computation, because the time between request and
receipt of a hash is dependent on whatever other random hashed are also
being computed. That is, receipt is much later than request, but you
haven't wasted time.

The only case where this doesn't deal with the problem in a low-cost way
is when the queue starts out empty when you make a request, and then
it's the only one on the queue when it's pulled off. In that case, you
do have to waste time. However, given that the kernel thread will go to
sleep anyhow, the time between sleeping and waking up is somewhat random
due to whatever other kernel and user threads might get executed in the
mean time.

In fact, one solution to this problem would be for the hash computing
function to just yield the CPU before or after the computation of the
hash. Even in a system with absolutely no other load, the fact that we
have to go into and out of the scheduler will add some randomness to the
computation time, perhaps.

Have I just completely left the topic here? :)

2003-05-30 18:41:23

by Scott A Crosby

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Fri, 30 May 2003 14:16:09 -0400, Timothy Miller <[email protected]> writes:

> Ingo Molnar wrote:
>
> > i'd suggest to use the jhash for the following additional kernel
> > entities: pagecache hash, inode hash, vcache hash.
>
> > the buffer-cache hash and the pidhash should be hard (impossible?) to
>
> > attack locally.
> >
>
>
>
> Maybe this is what someone is saying, and I just missed it, but...
>
> Coming late into this discussion (once again), I am making some
> assumptions here. A while back, someone came up with a method for
> breaking encryption (narrowing the brute force computations) by
> determining how long it took a given host to compute encryption keys
> or hashes or something.

Yes. Thats also being presented at Usenix Security 2003:

Remote Timing Attacks Are Practical
David Brumley and Dan Boneh, Stanford University

Available at:
http://crypto.stanford.edu/~dabo/papers/ssl-timing.pdf

However, that paper was dealing with public key operations, not
hash-collisions. Thus, the temporal dependence on the key, the 'hidden
channel' will probably be orders of magnitude smaller. At that, the
paper only works well on a local network, and the result of the attack
is far more interesting, the server's private key.


In *theory* such an attack is possible and could be used to determine
the secret key used in the Jenkin's hash in the networking stack,
dcache, or other places in the kernel. Assuming that collision versus
non-collision could be uniquely distinguished with $k$ trials, and the
hash table has $n$ buckets, the attack would require about 5*k*n or so
queries. The idea is to determine if inputs $i$ and $j$ collide. This
has a chance of $1/n$ of occuring. If I can uniquely distinguish this,
which requires $k$ trials. then on average after $k*n$ trials I'll
have one collision. I repeat this $32/log_2 (n)$, say, 5 times, and I
have 5 such pairs of known collisions $i,j$. I then enumerate all 2^32
keys, and see which one of them is consistent with the collision
evidence observed. This will usually result in only a few possible
keys.

All of this is in theory of course. In practice, for the bits of the
kernel we're discussing, this sort of attack is completely irrelevant.
I also note that universal hashing is also not immune to these
attacks. To defend, we'd have to go to cryptographic techniques.

Scott

2003-05-31 06:06:17

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Scott A Crosby <[email protected]>
Date: 30 May 2003 10:05:51 -0500

On Fri, 30 May 2003 02:00:40 -0700 (PDT), "David S. Miller" <[email protected]> writes:

> Indeed, I'd missed this. GCC will emit the constant multiply
> expansion unless the multiply cost is set VERY low.

It may still be a win. This does a bit under a dozen instructions per
byte. However, jenkin's does many bytes at a time.

It turns out to not be the case at all. There is too much work
involved in the main loop just maintaining the 3-word + curbyte
state.

It needs to be optimized a bit.

2003-05-31 06:17:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Alex Riesen <[email protected]>
Date: Fri, 30 May 2003 10:59:01 +0200
> static
> int hash_3(int hi, int c)
> {
> return (hi + (c << 4) + (c >> 4)) * 11;
> }
> gcc-3.2.1 -O2 -march=pentium
> ...
> It is not guaranteed to be this way on all architectures, of course.
> But still - no multiplications.

On Fri, May 30, 2003 at 02:00:40AM -0700, David S. Miller wrote:
> Indeed, I'd missed this. GCC will emit the constant multiply
> expansion unless the multiply cost is set VERY low.

If the strength reduction situation changes to being properly handled
by gcc for most/all 64-bit arches, include/linux/hash.h can lose a #ifdef.


-- wli

2003-05-31 06:22:06

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: William Lee Irwin III <[email protected]>
Date: Fri, 30 May 2003 23:30:40 -0700

If the strength reduction situation changes to being properly handled
by gcc for most/all 64-bit arches, include/linux/hash.h can lose a #ifdef.

It's not a strength reduction issue, it's about not setting the
multiply cost properly in the machine description.

2003-05-31 06:28:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: William Lee Irwin III <[email protected]>
Date: Fri, 30 May 2003 23:30:40 -0700
> If the strength reduction situation changes to being properly handled
> by gcc for most/all 64-bit arches, include/linux/hash.h can lose a #ifdef.

On Fri, May 30, 2003 at 11:33:53PM -0700, David S. Miller wrote:
> It's not a strength reduction issue, it's about not setting the
> multiply cost properly in the machine description.

If it's literally that trivial I'll put digging around the machine
descriptions on my TODO list.


-- wli

2003-05-31 06:33:34

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: William Lee Irwin III <[email protected]>
Date: Fri, 30 May 2003 23:41:38 -0700

If it's literally that trivial I'll put digging around the machine
descriptions on my TODO list.

Look at TARGET_RTX_COSTS, thats where all of this happens.

2003-05-31 07:49:17

by Willy Tarreau

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

Hi David,

On Fri, May 30, 2003 at 11:18:13PM -0700, David S. Miller wrote:
> It turns out to not be the case at all. There is too much work
> involved in the main loop just maintaining the 3-word + curbyte
> state.
>
> It needs to be optimized a bit.

With this simple change, jhash_mix is *exactly* three times faster for me
on athlon-xp, whatever gcc I use (2.95.3 or 3.2.3), on the following do_hash()
function, and about 40% faster when used on local variables. I think that
gcc is not always smart enough to avoid assignments, and consumes memory
bandwidth and perhaps prevents better optimizations.

void do_hash(unsigned *a, unsigned *b, unsigned *c) {
__jhash_mix(*a, *b, *c);
}

This function is 189 bytes long, and takes about 72 cycles to complete with the
original macro, and is now 130 bytes long for about 24 cycles, which means about
1.5 operation/cycle... not bad :-)

I've just tested on other architectures, it's even more interesting :

- On a sparc ultra2/400, 100 million hashes drop from 38.8 seconds to 8.28 s (4.68x)
- And the best win : on Alpha EV6/466, it goes from 51.5 secs to 5.75 s (8.96x) !!!

I've checked that the results are the same on every arch, before and after the
modification.

I believe it's trivial enough to go into 2.4.21, don't you think ?

Regards,
Willy

--- linux-2.4/include/linux/jhash.h Sat May 10 11:36:02 2003
+++ /tmp/jhash.h Sat May 31 09:38:17 2003
@@ -23,15 +23,15 @@
/* NOTE: Arguments are modified. */
#define __jhash_mix(a, b, c) \
{ \
- a -= b; a -= c; a ^= (c>>13); \
- b -= c; b -= a; b ^= (a<<8); \
- c -= a; c -= b; c ^= (b>>13); \
- a -= b; a -= c; a ^= (c>>12); \
- b -= c; b -= a; b ^= (a<<16); \
- c -= a; c -= b; c ^= (b>>5); \
- a -= b; a -= c; a ^= (c>>3); \
- b -= c; b -= a; b ^= (a<<10); \
- c -= a; c -= b; c ^= (b>>15); \
+ a = (a - b - c) ^ (c>>13); \
+ b = (b - c - a) ^ (a<<8); \
+ c = (c - a - b) ^ (b>>13); \
+ a = (a - b - c) ^ (c>>12); \
+ b = (b - c - a) ^ (a<<16); \
+ c = (c - a - b) ^ (b>>5); \
+ a = (a - b - c) ^ (c>>3); \
+ b = (b - c - a) ^ (a<<10); \
+ c = (c - a - b) ^ (b>>15); \
}

/* The golden ration: an arbitrary value */

2003-05-31 08:01:46

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Willy TARREAU <[email protected]>
Date: Sat, 31 May 2003 10:02:05 +0200

With this simple change, jhash_mix is *exactly* three times faster
for me on athlon-xp, whatever gcc I use (2.95.3 or 3.2.3), on the
following do_hash() function, and about 40% faster when used on
local variables.

Interesting :-)

This function is 189 bytes long, and takes about 72 cycles to
complete with the original macro, and is now 130 bytes long for
about 24 cycles, which means about 1.5 operation/cycle... not bad :-)

__jhash_mix takes ~23 cycles on sparc64 in the original version for
me. I get the same measurement for your version. Maybe your gcc
version just stinks :-(

Oh wait, yes, it's the memory operations it can't eliminate.
It can't do that because it has no idea whether certain pointers
alias or not. (ie. it doesn't know whether 'a' and 'b' point
to the same value)

Since all the networking versions work on local variables, in
2.4.x it shouldn't matter performance wise.

You'll note that my updated dcache jenkins patch for 2.5.x
brought the hash->words[] variables into locals before running
__jhash_mix() on it. So it shouldn't matter there either.

2003-05-31 08:43:14

by Willy Tarreau

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Sat, May 31, 2003 at 01:12:10AM -0700, David S. Miller wrote:
> From: Willy TARREAU <[email protected]>
> Date: Sat, 31 May 2003 10:02:05 +0200
>
> With this simple change, jhash_mix is *exactly* three times faster
> for me on athlon-xp, whatever gcc I use (2.95.3 or 3.2.3), on the
> following do_hash() function, and about 40% faster when used on
> local variables.
>
> Interesting :-)

Not that much finally, because I cannot reproduce the slowdown I initially
observed with the original function on local variables. I think I may have
had a wrong type somewhere or a particular operation which prevented gcc
from fully optimizing the macro :-/

So at the moment, both the original and my version are the same speed on
local variables on athlon-xp.

BTW, I don't understand why, but I've just noticed that the Alpha is 25%
slower on my version when used on local variables !

So my version is only interesting when working on indirect variables, which
is never the case in the kernel... Never mind, that was just a try.

For info, here's what I measure here on the original version :

- athlon-xp : 24 cycles, whatever -march/-mcpu
- alpha ev6 : -mcpu=ev5 : 22 cycles
-mcpu=ev6 : 24 cycles
- sparc64 : default : 26 cycles
-mcpu=v9 : 24 cycles
-mcpu=ultrasparc: 19 cycles (18 cycles for my version here)

Considering that the Alpha and the UltraSparc can issue up to 4 instruction
per cycle, I wonder whether it would be worse trying to implement such a hash
in assembler, optimized for each processor, with a default fall-back to the
C function for archs that have not been implemented.

Cheers,
Willy

2003-05-31 08:44:53

by David Schwartz

[permalink] [raw]
Subject: RE: Algoritmic Complexity Attacks and 2.4.20 the dcache code


> __jhash_mix takes ~23 cycles on sparc64 in the original version for
> me. I get the same measurement for your version. Maybe your gcc
> version just stinks :-(

> Oh wait, yes, it's the memory operations it can't eliminate.
> It can't do that because it has no idea whether certain pointers
> alias or not. (ie. it doesn't know whether 'a' and 'b' point
> to the same value)

It's a macro, and the way it inlines, it should be obvious in most cases
that 'a', 'b', and 'c' can't be in the same place in memory.

I've been messing with optimizing the Jenkins hash lately. I've found two
interesting optimizations.

One is to check if the input pointer happens to be aligned on a double-word
boundary, and if so optimize the inner loop to use dword fetches. Even most
strings, assuming you hash from the beginning, will be aligned on a dword
boundary. For very short strings, this has no effect. The longer the string,
the more of an affect this has. Basically, you change this:

// handle most of the key
while(len >= 12)
{
a += ((unsigned)ptr[0] + ((unsigned)ptr[1]<<8) +
((unsigned)ptr[2]<<16) + ((unsigned)ptr[3]<<24));
b += ((unsigned)ptr[4] + ((unsigned)ptr[5]<<8) +
((unsigned)ptr[6]<<16) + ((unsigned)ptr[7]<<24));
c += ((unsigned)ptr[8] + ((unsigned)ptr[9]<<8) +
((unsigned)ptr[10]<<16) + ((unsigned)ptr[11]<<24));
J_MIX(a, b, c);
ptr += 12;
len -= 12;
}

to:

// handle most of the key
if( (((unsigned)data)&3) == 0)
{
while(len >= 12)
{
a += * (unsigned *) ptr;
b += *(((unsigned *) ptr)+1);
c += *(((unsigned *) ptr)+2);
J_MIX(a, b, c);
ptr += 12;
len -= 12;
}
}
else
{
while(len >= 12)
{
a += ((unsigned)ptr[0] + ((unsigned)ptr[1]<<8) +
((unsigned)ptr[2]<<16) + ((unsigned)ptr[3]<<24));
b += ((unsigned)ptr[4] + ((unsigned)ptr[5]<<8) +
((unsigned)ptr[6]<<16) + ((unsigned)ptr[7]<<24));
c += ((unsigned)ptr[8] + ((unsigned)ptr[9]<<8) +
((unsigned)ptr[10]<<16) + ((unsigned)ptr[11]<<24));
J_MIX(a, b, c);
ptr += 12;
len -= 12;
}
}

How much this helps depends upon how often your inputs are dword aligned.
If you're hashing strings from the beginning, they almost always are.

The other is to rework the switch/case statement into a tree of 'if(len>0)
{ len--; ...'. You have to reverse the order of the additions, basically
changing:

switch(len) // all the case statements fall through
{
case 11: c+=((unsigned)ptr[10]<<24);
case 10: c+=((unsigned)ptr[9]<<16);
case 9 : c+=((unsigned)ptr[8]<<8);
/* the first byte of c is reserved for the length */
case 8 : b+=((unsigned)ptr[7]<<24);
case 7 : b+=((unsigned)ptr[6]<<16);
case 6 : b+=((unsigned)ptr[5]<<8);
case 5 : b+=ptr[4];
case 4 : a+=((unsigned)ptr[3]<<24);
case 3 : a+=((unsigned)ptr[2]<<16);
case 2 : a+=((unsigned)ptr[1]<<8);
case 1 : a+=ptr[0];
// case 0 : // nothing left to add
}

to

if(len!=0) { len--; a+=ptr[0];
if(len!=0) { len--; a+=((unsigned)ptr[1]<<8);
if(len!=0) { len--; a+=((unsigned)ptr[2]<<16);
if(len!=0) { len--; a+=((unsigned)ptr[3]<<24);
if(len!=0) { len--; b+=ptr[4];
if(len!=0) { len--; b+=((unsigned)ptr[5]<<8);
if(len!=0) { len--; b+=((unsigned)ptr[6]<<16);
if(len!=0) { len--; b+=((unsigned)ptr[7]<<24);
if(len!=0) { len--; c+=((unsigned)ptr[8]<<8);
if(len!=0) { len--; c+=((unsigned)ptr[9]<<16);
if(len!=0) { len--; c+=((unsigned)ptr[10]<<24);
} } } } } } } } } } }

Yeah, it's uglier, and you'd think the extra conditionals would slow things
down, but actually, the branches predict better and the comparisons are
cheaper. The original switch/case statement translates into a
subtract/compare for nothing (to see if len>=12) followed by a totally
unpredictable indirect jump. With my version, the most frequently executed
comparisons are the easiest to predict (assuming random distribution of
lengths).

I've confirmed by benchmark that both of these are in fact optimizations
(on my processor/compiler/options/etcetera). YMMV.

DS

2003-05-31 08:47:58

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: Willy Tarreau <[email protected]>
Date: Sat, 31 May 2003 10:56:12 +0200

Considering that the Alpha and the UltraSparc can issue up to 4
instruction per cycle,

Ultrasparc only has 2 integer units. So it really can only do 2
integer operations per cycle. GCC is giving it an optimal schedule
for -mtune=ultrasparc, I know because I wrote that instruction
scheduler :-)

You can get 4 issue if you're doing floating point stuff.

I believe the current generation Alpha has 3 integer units.
GCC should be doing a good job there too.

2003-05-31 08:50:31

by David Miller

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

From: "David Schwartz" <[email protected]>
Date: Sat, 31 May 2003 01:58:09 -0700


It's a macro, and the way it inlines, it should be obvious in
most cases that 'a', 'b', and 'c' can't be in the same place in memory.

Not true at all in Willy's test case, which was:

void test(u32 *a, u32 *b, u32 *c)
{
__jhash_mix(*a, *b, *c);
}

And here you certainly have absolutely NO idea where a, b, and
c might point to.

One is to check if the input pointer happens to be aligned on
a double-word boundary,

The most generic jhash() frankly isn't very interesting for
kernel applications, %99 of uses there can use the u32 optimized
version.

Even for dcache strings we can't use it because for each character
we have to test it against some terminating value or translate
it somehow.

I wouldn't waste any time at all on this thing.

2003-05-31 18:27:40

by Aaron Lehmann

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Fri, May 30, 2003 at 11:45:29PM -0700, David S. Miller wrote:
> From: William Lee Irwin III <[email protected]>
> Date: Fri, 30 May 2003 23:41:38 -0700
>
> If it's literally that trivial I'll put digging around the machine
> descriptions on my TODO list.
>
> Look at TARGET_RTX_COSTS, thats where all of this happens.

Reading the code that handles this stuff (expmed.c) always cracks me up.


/* We might want to refine this now that we have division-by-constant
optimization. Since expand_mult_highpart tries so many variants, it is
not straightforward to generalize this. Maybe we should make an array
of possible modes in init_expmed? Save this for GCC 2.7. */

/* We could just as easily deal with negative constants here,
but it does not seem worth the trouble for GCC 2.6. */

/* This is extremely similar to the code for the unsigned case
above. For 2.7 we should merge these variants, but for
2.6.1 I don't want to touch the code for unsigned since that
get used in C. The signed case will only be used by other
languages (Ada). */

Sometimes I wish the gcc code was tame enough for me to work on.

2003-06-01 01:01:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: Algoritmic Complexity Attacks and 2.4.20 the dcache code

On Thursday 29 May 2003 22:42, Scott A Crosby wrote:
> The solution for these attacks on hash tables is to make the hash
> function unpredictable via a technique known as universal
> hashing. Universal hashing is a keyed hash function where, based on
> the key, one of a large set hash functions is chosen. When
> benchmarking, we observe that for short or medium length inputs, it is
> comparable in performance to simple predictable hash functions such as
> the ones in Python or Perl. Our paper has graphs and charts of our
> benchmarked performance.

You should have said "a solution", not "the solution", above. For the Ext2/3
HTree index, we found a rather nice solution that varies the hash dispersion
pattern on a per-volume basis, in a way that's difficult for a DOSer to
reconstruct (please feel free to analyze this approach and find a hole, if
there is one).

This is much simpler than what you propose. As we are talking core kernel
here, adding an extra spiderweb of complexity in the form of multiple hash
algorithms really isn't appealing, if it can be avoided. Not to mention the
overhead of indexing into the correct hash algorithm on each lookup.

> I highly advise using a universal hashing library, either our own or
> someone elses. As is historically seen, it is very easy to make silly
> mistakes when attempting to implement your own 'secure' algorithm.

Translation: adding bloat is often the easy way out for the lazy. Anyway,
thanks for your analysis, but I disagree with your recommendation.

Regards,

Daniel