2012-01-11 00:00:46

by Alexey Dobriyan

[permalink] [raw]
Subject: sha512: make it work, undo percpu message schedule

commit f9e2bca6c22d75a289a349f869701214d63b5060
aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
created global message schedule area.

If sha512_update will ever be entered twice, hilarity ensures.

Probably the easiest way to notice incorrect hashes being calculated is
to run 2 ping floods over AH with hmac(sha512):

#!/usr/sbin/setkey -f
flush;
spdflush;
add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
spdadd IP1 IP2 any -P out ipsec ah/transport//require;
spdadd IP2 IP1 any -P in ipsec ah/transport//require;

XfrmInStateProtoError will start ticking with -EBADMSG being returned
from ah_input(). This never happens with, say, hmac(sha1).

I personally don't understand this changelog entry:

"The message schedule W (u64[80]) is too big for the stack."

Hash context is dynamically allocated.

W[80] stack usage can be trimmed like in SHA-1, this is for separate patch.

P. S.: If only one ping flood runs, XfrmInStateProtoError will also tick
but very slowly. I personally can't explain this.

With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
with multiple bidirectional ping flood streams like it doesn't tick
with SHA-1.

Signed-off-by: Alexey Dobriyan <[email protected]>
---

crypto/sha512_generic.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
static inline u64 Ch(u64 x, u64 y, u64 z)
{
return z ^ (x & (y ^ z));
@@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 W[80];

/* load the input */
for (i = 0; i < 16; i++)
@@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
- put_cpu_var(msg_schedule);
}

static int


2012-01-11 00:12:36

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> - memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));

And, yes, this is intentional -- modern gcc pisses on stone age data clearing.

2012-01-11 00:36:16

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.
>
> If sha512_update will ever be entered twice, hilarity ensures.

Hmm, do you know why this happens? On the face of it this shouldn't
be possible as preemption is disabled.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-11 01:18:33

by Adrian-Ken Rueegsegger

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On 01/11/2012 01:00 AM, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.

[snip]

> I personally don't understand this changelog entry:
>
> "The message schedule W (u64[80]) is too big for the stack."
>
> Hash context is dynamically allocated.

My original patch did the same thing as yours and put the message
schedule on the stack in sha512_transform. Herbert argued [1], that it
was too big and suggested to store it in a static per-cpu area.

[1] - http://www.mail-archive.com/[email protected]/msg02527.html

2012-01-12 23:55:21

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > created global message schedule area.
> >
> > If sha512_update will ever be entered twice, hilarity ensures.
>
> Hmm, do you know why this happens? On the face of it this shouldn't
> be possible as preemption is disabled.

Herbert, I couldn't come up with a single scenario. :-(
But the bug is easy to reproduce.

2012-01-13 00:19:46

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
>
> Herbert, I couldn't come up with a single scenario. :-(
> But the bug is easy to reproduce.

OK, I'll try to reproduce it here.

Thanks!
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-13 06:22:56

by Steffen Klassert

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > created global message schedule area.
> >
> > If sha512_update will ever be entered twice, hilarity ensures.
>
> Hmm, do you know why this happens? On the face of it this shouldn't
> be possible as preemption is disabled.
>

I did not try to reproduce, but this looks like a race of the 'local out'
and the receive packet path. On 'lokal out' bottom halves are enabled,
so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
hash value. My guess could be checked easily by disabling the bottom halves
before the percpu value is fetched.

2012-01-13 06:46:12

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 07:22:56AM +0100, Steffen Klassert wrote:
>
> I did not try to reproduce, but this looks like a race of the 'local out'
> and the receive packet path. On 'lokal out' bottom halves are enabled,
> so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
> The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
> hash value. My guess could be checked easily by disabling the bottom halves
> before the percpu value is fetched.

Good point. I'll send out a patch for Alexey to test.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-13 06:48:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 07:22 +0100, Steffen Klassert a écrit :
> On Wed, Jan 11, 2012 at 11:36:11AM +1100, Herbert Xu wrote:
> > On Wed, Jan 11, 2012 at 03:00:40AM +0300, Alexey Dobriyan wrote:
> > > commit f9e2bca6c22d75a289a349f869701214d63b5060
> > > aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> > > created global message schedule area.
> > >
> > > If sha512_update will ever be entered twice, hilarity ensures.
> >
> > Hmm, do you know why this happens? On the face of it this shouldn't
> > be possible as preemption is disabled.
> >
>
> I did not try to reproduce, but this looks like a race of the 'local out'
> and the receive packet path. On 'lokal out' bottom halves are enabled,
> so could be interrupted by the NET_RX_SOFTIRQ while doing a sha512_update.
> The NET_RX_SOFTIRQ could invoke sha512_update too, that would corrupt the
> hash value. My guess could be checked easily by disabling the bottom halves
> before the percpu value is fetched.

Good catch. It can be generalized to any interrupts (soft and hard)

Another solution is using two blocks, one used from interrupt context.

static DEFINE_PER_CPU(u64[80], msg_schedule);
static DEFINE_PER_CPU(u64[80], msg_schedule_irq);

(Like we do for SNMP mibs on !x86 arches)

2012-01-13 06:50:48

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 07:48:38AM +0100, Eric Dumazet wrote:
>
> Another solution is using two blocks, one used from interrupt context.
>
> static DEFINE_PER_CPU(u64[80], msg_schedule);
> static DEFINE_PER_CPU(u64[80], msg_schedule_irq);
>
> (Like we do for SNMP mibs on !x86 arches)

Yes that sounds good. Thanks Eric!
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-13 07:08:16

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
>
> Herbert, I couldn't come up with a single scenario. :-(
> But the bug is easy to reproduce.

OK, does this patch work for you?

commit 31f4e55c09c1170f8b813c14b1299b70f50db414
Author: Herbert Xu <[email protected]>
Date: Fri Jan 13 18:06:50 2012 +1100

crypto: sha512 - Fix msg_schedule race

The percpu msg_schedule setup was unsafe as a user in a process
context can be interrupted by a softirq user which would then
scribble over the exact same work area. This was discovered by
Steffen Klassert.

This patch based on ideas from Eric Dumazet fixes this by using
two independent work areas.

Reported-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Herbert Xu <[email protected]>

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..a49c457 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -11,6 +11,7 @@
*
*/
#include <crypto/internal/hash.h>
+#include <linux/interrupt.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/mm.h>
@@ -21,7 +22,9 @@
#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);
+#define SHA512_SCHEDULE_SIZE 80
+
+static DEFINE_PER_CPU(u64[SHA512_SCHEDULE_SIZE * 2], msg_schedule);

static inline u64 Ch(u64 x, u64 y, u64 z)
{
@@ -89,7 +92,8 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 *W = get_cpu_var(msg_schedule) +
+ SHA512_SCHEDULE_SIZE * !!in_interrupt();

/* load the input */
for (i = 0; i < 16; i++)
@@ -128,7 +132,7 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
+ memset(W, 0, SHA512_SCHEDULE_SIZE * sizeof(*W));
put_cpu_var(msg_schedule);
}

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-13 09:45:25

by David Laight

[permalink] [raw]
Subject: RE: sha512: make it work, undo percpu message schedule


> Good catch. It can be generalized to any interrupts (soft and hard)
>
> Another solution is using two blocks, one used from interrupt context.
>
> static DEFINE_PER_CPU(u64[80], msg_schedule);
> static DEFINE_PER_CPU(u64[80], msg_schedule_irq);
>
> (Like we do for SNMP mibs on !x86 arches)

Don't you need one block per interrupt level ?

It also means that the functions need to know where they
are being called from - which may not be true.

A thought is that if you are going to reserve this memory
for each cpu, it might as well be reserved as part of the
interrupt stack (assuming interrupts switch stacks).

Process level code would still need to use a reserved buffer.

David

2012-01-13 10:35:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 18:08 +1100, Herbert Xu a écrit :
> On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
> >
> > Herbert, I couldn't come up with a single scenario. :-(
> > But the bug is easy to reproduce.
>
> OK, does this patch work for you?
>
> commit 31f4e55c09c1170f8b813c14b1299b70f50db414
> Author: Herbert Xu <[email protected]>
> Date: Fri Jan 13 18:06:50 2012 +1100
>
> crypto: sha512 - Fix msg_schedule race
>
> The percpu msg_schedule setup was unsafe as a user in a process
> context can be interrupted by a softirq user which would then
> scribble over the exact same work area. This was discovered by
> Steffen Klassert.
>
> This patch based on ideas from Eric Dumazet fixes this by using
> two independent work areas.
>
> Reported-by: Alexey Dobriyan <[email protected]>
> Signed-off-by: Herbert Xu <[email protected]>
>

I wonder ...

With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
this function ?

What about following patch instead ?

(Trying a dynamic memory allocation, and fallback on a single
pre-allocated bloc of memory, shared by all cpus, protected by a
spinlock)

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..5c80a76 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,7 +21,6 @@
#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);

static inline u64 Ch(u64 x, u64 y, u64 z)
{
@@ -87,10 +86,16 @@ static void
sha512_transform(u64 *state, const u8 *input)
{
u64 a, b, c, d, e, f, g, h, t1, t2;
-
+ static u64 msg_schedule[80];
+ static DEFINE_SPINLOCK(msg_schedule_lock);
int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);

+
+ if (!W) {
+ spin_lock_bh(&msg_schedule_lock);
+ W = msg_schedule;
+ }
/* load the input */
for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);
@@ -128,8 +133,11 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
- put_cpu_var(msg_schedule);
+ memset(W, 0, sizeof(msg_schedule));
+ if (W == msg_schedule)
+ spin_unlock_bh(&msg_schedule_lock);
+ else
+ kfree(W);
}

static int

2012-01-13 10:41:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 11:35 +0100, Eric Dumazet a écrit :

> I wonder ...
>
> With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
> this function ?
>
> What about following patch instead ?
>
> (Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock)
>
> diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
> index 9ed9f60..5c80a76 100644
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,7 +21,6 @@
> #include <linux/percpu.h>
> #include <asm/byteorder.h>
>
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
>
> static inline u64 Ch(u64 x, u64 y, u64 z)
> {
> @@ -87,10 +86,16 @@ static void
> sha512_transform(u64 *state, const u8 *input)
> {
> u64 a, b, c, d, e, f, g, h, t1, t2;
> -
> + static u64 msg_schedule[80];
> + static DEFINE_SPINLOCK(msg_schedule_lock);
> int i;
> - u64 *W = get_cpu_var(msg_schedule);
> + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);
>

And a plain kmalloc() is enough, since we fully initialize the array a
bit later.

for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);
for (i = 16; i < 80; i++) {
BLEND_OP(i, W);
}

2012-01-13 10:57:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 11:41 +0100, Eric Dumazet a écrit :

> And a plain kmalloc() is enough, since we fully initialize the array a
> bit later.
>
> for (i = 0; i < 16; i++)
> LOAD_OP(i, W, input);
> for (i = 16; i < 80; i++) {
> BLEND_OP(i, W);
> }


Here is an official patch ;)

[PATCH v3] crypto: sha512 - Fix msg_schedule race

The percpu msg_schedule setup was unsafe as a user in a process
context can be interrupted by a softirq user which would then
scribble over the exact same work area.

Bug reported by Alexey Dobriyan, and diagnosed by Steffen Klassert.

Bug added in commit f9e2bca6c22 (crypto: sha512 - Move message schedule
W[80] to static percpu area)

Try a dynamic memory allocation, and fallback using a single static
area, guarded by a spinlock.

This removes use of percpu memory.

Reported-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
CC: Herbert Xu <[email protected]>
CC: Adrian-Ken Rueegsegger <[email protected]>
CC: Steffen Klassert <[email protected]>
---
crypto/sha512_generic.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 9ed9f60..b52ef9b 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -18,10 +18,8 @@
#include <linux/crypto.h>
#include <linux/types.h>
#include <crypto/sha.h>
-#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);

static inline u64 Ch(u64 x, u64 y, u64 z)
{
@@ -87,10 +85,15 @@ static void
sha512_transform(u64 *state, const u8 *input)
{
u64 a, b, c, d, e, f, g, h, t1, t2;
-
+ static u64 msg_schedule[80];
+ static DEFINE_SPINLOCK(msg_schedule_lock);
int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 *W = kmalloc(sizeof(msg_schedule), GFP_ATOMIC | __GFP_NOWARN);

+ if (!W) {
+ spin_lock_bh(&msg_schedule_lock);
+ W = msg_schedule;
+ }
/* load the input */
for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);
@@ -128,8 +131,11 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
- put_cpu_var(msg_schedule);
+ memset(W, 0, sizeof(msg_schedule));
+ if (W == msg_schedule)
+ spin_unlock_bh(&msg_schedule_lock);
+ else
+ kfree(W);
}

static int

2012-01-13 11:02:55

by Steffen Klassert

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 11:35:42AM +0100, Eric Dumazet wrote:
> Le vendredi 13 janvier 2012 ? 18:08 +1100, Herbert Xu a ?crit :
> > On Fri, Jan 13, 2012 at 02:55:14AM +0300, Alexey Dobriyan wrote:
> > >
> > > Herbert, I couldn't come up with a single scenario. :-(
> > > But the bug is easy to reproduce.
> >
> > OK, does this patch work for you?
> >
> > commit 31f4e55c09c1170f8b813c14b1299b70f50db414
> > Author: Herbert Xu <[email protected]>
> > Date: Fri Jan 13 18:06:50 2012 +1100
> >
> > crypto: sha512 - Fix msg_schedule race
> >
> > The percpu msg_schedule setup was unsafe as a user in a process
> > context can be interrupted by a softirq user which would then
> > scribble over the exact same work area. This was discovered by
> > Steffen Klassert.
> >
> > This patch based on ideas from Eric Dumazet fixes this by using
> > two independent work areas.
> >
> > Reported-by: Alexey Dobriyan <[email protected]>
> > Signed-off-by: Herbert Xu <[email protected]>
> >
>
> I wonder ...
>
> With 4096 cpus, do we really want to reserve 5242880 bytes of memory for
> this function ?
>
> What about following patch instead ?
>
> (Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock)

If we want to do dynamic memory allocation, we could place it to the
shash_desc context where we already store the intermediate digest value.
This is preallocated anyway, so we don't need to do another allocation.

2012-01-13 11:33:44

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On 1/13/12, Eric Dumazet <[email protected]> wrote:

> + static u64 msg_schedule[80];
> + static DEFINE_SPINLOCK(msg_schedule_lock);

No guys, no.

SHA-512 only needs u64[16] running window for message scheduling.

I'm sending whitespace mangled patch which is only tested
with selfcryptotest passed, so you won't apply something complex.

Stackspace usage drops down to like this:

-139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp
+136: 48 81 ec 18 01 00 00 sub $0x118,%rsp

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
static inline u64 Ch(u64 x, u64 y, u64 z)
{
return z ^ (x & (y ^ z));
@@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)

static inline void BLEND_OP(int I, u64 *W)
{
- W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+ W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
}

static void
@@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 W[16];

/* load the input */
for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);

- for (i = 16; i < 80; i++) {
- BLEND_OP(i, W);
- }
-
/* load the state into our registers */
a=state[0]; b=state[1]; c=state[2]; d=state[3];
e=state[4]; f=state[5]; g=state[6]; h=state[7];

- /* now iterate */
- for (i=0; i<80; i+=8) {
- t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ];
- t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
- t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
- t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
- t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
- t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
- t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
- t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
- t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
- t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
- t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
- t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
- t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
- t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
- t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
- t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
+ t2 = e0(a) + Maj(a,b,c); \
+ d += t1; \
+ h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
+ BLEND_OP(i, W); \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \
+ t2 = e0(a) + Maj(a,b,c); \
+ d += t1; \
+ h = t1 + t2
+
+ for (i = 0; i < 16; i += 8) {
+ SHA512_0_15(i, a, b, c, d, e, f, g, h);
+ SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+ }
+ for (i = 16; i < 80; i += 8) {
+ SHA512_16_79(i, a, b, c, d, e, f, g, h);
+ SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;
@@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
- put_cpu_var(msg_schedule);
}

static int

2012-01-13 11:45:53

by David Laight

[permalink] [raw]
Subject: RE: sha512: make it work, undo percpu message schedule


> Trying a dynamic memory allocation, and fallback on a single
> pre-allocated bloc of memory, shared by all cpus, protected by a
> spinlock
...
> -
> + static u64 msg_schedule[80];
> + static DEFINE_SPINLOCK(msg_schedule_lock);
> int i;
> - u64 *W = get_cpu_var(msg_schedule);
> + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC |
__GFP_NOWARN);
>
> +
> + if (!W) {
> + spin_lock_bh(&msg_schedule_lock);
> + W = msg_schedule;
> + }

If this code can be called from an ISR is the kmalloc()
call safe?

If the above is safe, wouldn't it be better to:
1) try to use the static buffer
2) try to kalloc() a buffer
3) spinwait for the static buffer to be free

David

2012-01-13 12:34:17

by Eric Dumazet

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 13:33 +0200, Alexey Dobriyan a écrit :
> On 1/13/12, Eric Dumazet <[email protected]> wrote:
>
> > + static u64 msg_schedule[80];
> > + static DEFINE_SPINLOCK(msg_schedule_lock);
>
> No guys, no.
>
> SHA-512 only needs u64[16] running window for message scheduling.
>
> I'm sending whitespace mangled patch which is only tested
> with selfcryptotest passed, so you won't apply something complex.
>
> Stackspace usage drops down to like this:
>
> -139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp
> +136: 48 81 ec 18 01 00 00 sub $0x118,%rsp
>
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,8 +21,6 @@
> #include <linux/percpu.h>
> #include <asm/byteorder.h>
>
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
> -
> static inline u64 Ch(u64 x, u64 y, u64 z)
> {
> return z ^ (x & (y ^ z));
> @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
>
> static inline void BLEND_OP(int I, u64 *W)
> {
> - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
> + W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
> }
>
> static void
> @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
> u64 a, b, c, d, e, f, g, h, t1, t2;
>
> int i;
> - u64 *W = get_cpu_var(msg_schedule);
> + u64 W[16];
>
> /* load the input */
> for (i = 0; i < 16; i++)
> LOAD_OP(i, W, input);
>
> - for (i = 16; i < 80; i++) {
> - BLEND_OP(i, W);
> - }
> -
> /* load the state into our registers */
> a=state[0]; b=state[1]; c=state[2]; d=state[3];
> e=state[4]; f=state[5]; g=state[6]; h=state[7];
>
> - /* now iterate */
> - for (i=0; i<80; i+=8) {
> - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ];
> - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
> - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
> - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
> - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
> - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
> - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
> - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
> - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
> - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
> - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
> - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
> - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
> - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
> - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
> - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
> + t2 = e0(a) + Maj(a,b,c); \
> + d += t1; \
> + h = t1 + t2
> +
> +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
> + BLEND_OP(i, W); \
> + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \
> + t2 = e0(a) + Maj(a,b,c); \
> + d += t1; \
> + h = t1 + t2
> +
> + for (i = 0; i < 16; i += 8) {
> + SHA512_0_15(i, a, b, c, d, e, f, g, h);
> + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
> + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
> + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
> + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
> + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
> + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
> + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> + }
> + for (i = 16; i < 80; i += 8) {
> + SHA512_16_79(i, a, b, c, d, e, f, g, h);
> + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
> + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
> + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
> + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
> + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
> + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
> + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
> }
>
> state[0] += a; state[1] += b; state[2] += c; state[3] += d;
> @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)
>
> /* erase our data */
> a = b = c = d = e = f = g = h = t1 = t2 = 0;
> - memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> - put_cpu_var(msg_schedule);
> }
>
> static int


Even if its true, its not stable material.

stable teams want obvious patches.

2012-01-13 12:35:28

by Eric Dumazet

[permalink] [raw]
Subject: RE: sha512: make it work, undo percpu message schedule

Le vendredi 13 janvier 2012 à 11:45 +0000, David Laight a écrit :
> > Trying a dynamic memory allocation, and fallback on a single
> > pre-allocated bloc of memory, shared by all cpus, protected by a
> > spinlock
> ...
> > -
> > + static u64 msg_schedule[80];
> > + static DEFINE_SPINLOCK(msg_schedule_lock);
> > int i;
> > - u64 *W = get_cpu_var(msg_schedule);
> > + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC |
> __GFP_NOWARN);
> >
> > +
> > + if (!W) {
> > + spin_lock_bh(&msg_schedule_lock);
> > + W = msg_schedule;
> > + }
>
> If this code can be called from an ISR is the kmalloc()
> call safe?
>

Yes, obviously, kmalloc() is IRQ safe.

> If the above is safe, wouldn't it be better to:
> 1) try to use the static buffer
> 2) try to kalloc() a buffer
> 3) spinwait for the static buffer to be free
>

No idea of what you mean, and why you think its better.

kmalloc() propably can give us a block already hot in cpu cache.

2012-01-14 18:20:31

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 01:34:13PM +0100, Eric Dumazet wrote:
> Le vendredi 13 janvier 2012 ? 13:33 +0200, Alexey Dobriyan a ?crit :
> > On 1/13/12, Eric Dumazet <[email protected]> wrote:
> >
> > > + static u64 msg_schedule[80];
> > > + static DEFINE_SPINLOCK(msg_schedule_lock);
> >
> > No guys, no.
> >
> > SHA-512 only needs u64[16] running window for message scheduling.
> >
> > I'm sending whitespace mangled patch which is only tested
> > with selfcryptotest passed, so you won't apply something complex.
> >
> > Stackspace usage drops down to like this:
> >
> > -139: 48 81 ec c8 02 00 00 sub $0x2c8,%rsp
> > +136: 48 81 ec 18 01 00 00 sub $0x118,%rsp
> >
> > --- a/crypto/sha512_generic.c
> > +++ b/crypto/sha512_generic.c
> > @@ -21,8 +21,6 @@
> > #include <linux/percpu.h>
> > #include <asm/byteorder.h>
> >
> > -static DEFINE_PER_CPU(u64[80], msg_schedule);
> > -
> > static inline u64 Ch(u64 x, u64 y, u64 z)
> > {
> > return z ^ (x & (y ^ z));
> > @@ -80,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)
> >
> > static inline void BLEND_OP(int I, u64 *W)
> > {
> > - W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
> > + W[I%16] = s1(W[(I-2)%16]) + W[(I-7)%16] + s0(W[(I-15)%16]) + W[I%16];
> > }
> >
> > static void
> > @@ -89,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
> > u64 a, b, c, d, e, f, g, h, t1, t2;
> >
> > int i;
> > - u64 *W = get_cpu_var(msg_schedule);
> > + u64 W[16];
> >
> > /* load the input */
> > for (i = 0; i < 16; i++)
> > LOAD_OP(i, W, input);
> >
> > - for (i = 16; i < 80; i++) {
> > - BLEND_OP(i, W);
> > - }
> > -
> > /* load the state into our registers */
> > a=state[0]; b=state[1]; c=state[2]; d=state[3];
> > e=state[4]; f=state[5]; g=state[6]; h=state[7];
> >
> > - /* now iterate */
> > - for (i=0; i<80; i+=8) {
> > - t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ];
> > - t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
> > - t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
> > - t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
> > - t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
> > - t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
> > - t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
> > - t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
> > - t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
> > - t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
> > - t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
> > - t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
> > - t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
> > - t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
> > - t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
> > - t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
> > + t2 = e0(a) + Maj(a,b,c); \
> > + d += t1; \
> > + h = t1 + t2
> > +
> > +#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
> > + BLEND_OP(i, W); \
> > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \
> > + t2 = e0(a) + Maj(a,b,c); \
> > + d += t1; \
> > + h = t1 + t2
> > +
> > + for (i = 0; i < 16; i += 8) {
> > + SHA512_0_15(i, a, b, c, d, e, f, g, h);
> > + SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
> > + SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
> > + SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
> > + SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
> > + SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
> > + SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
> > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> > + }
> > + for (i = 16; i < 80; i += 8) {
> > + SHA512_16_79(i, a, b, c, d, e, f, g, h);
> > + SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
> > + SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
> > + SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
> > + SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
> > + SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
> > + SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
> > + SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
> > }
> >
> > state[0] += a; state[1] += b; state[2] += c; state[3] += d;
> > @@ -128,8 +136,6 @@ sha512_transform(u64 *state, const u8 *input)
> >
> > /* erase our data */
> > a = b = c = d = e = f = g = h = t1 = t2 = 0;
> > - memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> > - put_cpu_var(msg_schedule);
> > }
> >
> > static int
>
>
> Even if its true, its not stable material.
>
> stable teams want obvious patches.

I understand that.

But it _is_ obvious if you see what macro does and original code and
read FIPS 180-2 6.3.2 "SHA-512 Hash Computation" from where
it is obvious that W[i] depends on W[i - 16] and doesn't depend on
earlier values.

This stack reduction patch completely fixes original bug and
doesn't show any AH errors. Given the nature of crypto code
where one bit mistake ruins everything, it should be enough.

2012-01-14 18:27:44

by Alexey Dobriyan

[permalink] [raw]
Subject: [PATCH 1/3] sha512: make it work, undo percpu message schedule

commit f9e2bca6c22d75a289a349f869701214d63b5060
aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
created global message schedule area.

If sha512_update will ever be entered twice, hash will be silently
calculated incorrectly.

Probably the easiest way to notice incorrect hashes being calculated is
to run 2 ping floods over AH with hmac(sha512):

#!/usr/sbin/setkey -f
flush;
spdflush;
add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
spdadd IP1 IP2 any -P out ipsec ah/transport//require;
spdadd IP2 IP1 any -P in ipsec ah/transport//require;

XfrmInStateProtoError will start ticking with -EBADMSG being returned
from ah_input(). This never happens with, say, hmac(sha1).

With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
with multiple bidirectional ping flood streams like it doesn't tick
with SHA-1.

After this patch sha512_transform() will start using ~750 bytes of stack on x86_64.
This is OK for simple loads, for something more heavy, stack reduction will be done
separatedly.

Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
---

crypto/sha512_generic.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -21,8 +21,6 @@
#include <linux/percpu.h>
#include <asm/byteorder.h>

-static DEFINE_PER_CPU(u64[80], msg_schedule);
-
static inline u64 Ch(u64 x, u64 y, u64 z)
{
return z ^ (x & (y ^ z));
@@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 *W = get_cpu_var(msg_schedule);
+ u64 W[80];

/* load the input */
for (i = 0; i < 16; i++)
@@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)

/* erase our data */
a = b = c = d = e = f = g = h = t1 = t2 = 0;
- memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
- put_cpu_var(msg_schedule);
}

static int

2012-01-14 18:44:54

by Alexey Dobriyan

[permalink] [raw]
Subject: [PATCH 3/3] sha512: use standard ror64()

Use standard ror64() instead of hand-written.
There is no standard ror64, so create it.

The difference is shift value being "unsigned int" instead of uint64_t
(for which there is no reason). gcc starts to emit native ROR instructions
which it doesn't do for some reason currently. This should make the code
faster.

Patch survives in-tree crypto test and ping flood with hmac(sha512) on.

Signed-off-by: Alexey Dobriyan <[email protected]>
---

crypto/sha512_generic.c | 13 ++++---------
include/linux/bitops.h | 20 ++++++++++++++++++++
2 files changed, 24 insertions(+), 9 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -31,11 +31,6 @@ static inline u64 Maj(u64 x, u64 y, u64 z)
return (x & y) | (z & (x | y));
}

-static inline u64 RORu64(u64 x, u64 y)
-{
- return (x >> y) | (x << (64 - y));
-}
-
static const u64 sha512_K[80] = {
0x428a2f98d728ae22ULL, 0x7137449123ef65cdULL, 0xb5c0fbcfec4d3b2fULL,
0xe9b5dba58189dbbcULL, 0x3956c25bf348b538ULL, 0x59f111f1b605d019ULL,
@@ -66,10 +61,10 @@ static const u64 sha512_K[80] = {
0x5fcb6fab3ad6faecULL, 0x6c44198c4a475817ULL,
};

-#define e0(x) (RORu64(x,28) ^ RORu64(x,34) ^ RORu64(x,39))
-#define e1(x) (RORu64(x,14) ^ RORu64(x,18) ^ RORu64(x,41))
-#define s0(x) (RORu64(x, 1) ^ RORu64(x, 8) ^ (x >> 7))
-#define s1(x) (RORu64(x,19) ^ RORu64(x,61) ^ (x >> 6))
+#define e0(x) (ror64(x,28) ^ ror64(x,34) ^ ror64(x,39))
+#define e1(x) (ror64(x,14) ^ ror64(x,18) ^ ror64(x,41))
+#define s0(x) (ror64(x, 1) ^ ror64(x, 8) ^ (x >> 7))
+#define s1(x) (ror64(x,19) ^ ror64(x,61) ^ (x >> 6))

static inline void LOAD_OP(int I, u64 *W, const u8 *input)
{
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -56,6 +56,26 @@ static inline unsigned long hweight_long(unsigned long w)
}

/**
+ * rol64 - rotate a 64-bit value left
+ * @word: value to rotate
+ * @shift: bits to roll
+ */
+static inline __u64 rol64(__u64 word, unsigned int shift)
+{
+ return (word << shift) | (word >> (64 - shift));
+}
+
+/**
+ * ror64 - rotate a 64-bit value right
+ * @word: value to rotate
+ * @shift: bits to roll
+ */
+static inline __u64 ror64(__u64 word, unsigned int shift)
+{
+ return (word >> shift) | (word << (64 - shift));
+}
+
+/**
* rol32 - rotate a 32-bit value left
* @word: value to rotate
* @shift: bits to roll

2012-01-14 18:40:57

by Alexey Dobriyan

[permalink] [raw]
Subject: [PATCH 2/3] sha512: reduce stack usage to safe number

For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15] and W[i - 16].
Consequently, keeping all W[80] array on stack is unnecessary,
only 16 values are really needed.

Using W[16] instead of W[80] greatly reduces stack usage
(~750 bytes to ~340 bytes on x86_64).

Line by line explanation:
* BLEND_OP
array is "circular" now, all indexes have to be modulo 16.
Round number is positive, so remainder operation should be
without surprises.

* initial full message scheduling is trimmed to first 16 values which
come from data block, the rest is calculated before it's needed.

* original loop body is unrolled version of new SHA512_0_15 and
SHA512_16_79 macros, unrolling was done to not do explicit variable
renaming. Otherwise it's the very same code after preprocessing.
See sha1_transform() code which does the same trick.

Patch survives in-tree crypto test and original bugreport test
(ping flood with hmac(sha512).

See FIPS 180-2 for SHA-512 definition
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
---

This is patch is for stable if 750 byte stack usage is not
considered safe.

crypto/sha512_generic.c | 58 ++++++++++++++++++++++++++++--------------------
1 file changed, 34 insertions(+), 24 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)

static inline void BLEND_OP(int I, u64 *W)
{
- W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+ W[I % 16] += s1(W[(I-2) % 16]) + W[(I-7) % 16] + s0(W[(I-15) % 16]);
}

static void
@@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 W[80];
+ u64 W[16];

/* load the input */
for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);

- for (i = 16; i < 80; i++) {
- BLEND_OP(i, W);
- }
-
/* load the state into our registers */
a=state[0]; b=state[1]; c=state[2]; d=state[3];
e=state[4]; f=state[5]; g=state[6]; h=state[7];

- /* now iterate */
- for (i=0; i<80; i+=8) {
- t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ];
- t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
- t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
- t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
- t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
- t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
- t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
- t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
- t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
- t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
- t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
- t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
- t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
- t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
- t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
- t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
+ t2 = e0(a) + Maj(a, b, c); \
+ d += t1; \
+ h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
+ BLEND_OP(i, W); \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)%16]; \
+ t2 = e0(a) + Maj(a, b, c); \
+ d += t1; \
+ h = t1 + t2
+
+ for (i = 0; i < 16; i += 8) {
+ SHA512_0_15(i, a, b, c, d, e, f, g, h);
+ SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+ }
+ for (i = 16; i < 80; i += 8) {
+ SHA512_16_79(i, a, b, c, d, e, f, g, h);
+ SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;

2012-01-14 19:09:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/3] sha512: reduce stack usage to safe number

On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <[email protected]> wrote:
>
> Line by line explanation:
> * BLEND_OP
> ?array is "circular" now, all indexes have to be modulo 16.
> ?Round number is positive, so remainder operation should be
> ?without surprises.

Don't use "%" except on unsigned values. Even if it's positive, if
it's a signed number and the compiler doesn't *see* that it is
absolutely positive, division is nontrivial. Even when you divide by a
constant.

For example, "% 16" on an 'int' on x86-64 will generate

movl %edi, %edx
sarl $31, %edx
shrl $28, %edx
leal (%rdi,%rdx), %eax
andl $15, %eax
subl %edx, %eax

in order to get the signed case right. The fact that the end result is
correct for unsigned numbers is irrelevant: it's still stupid and
slow.

With an unsigned int, '% 16' will generate the obvious

andl $15, %eax

instead.

Quite frankly, stop using division in the first place. Dividing by
powers-of-two and expecting the compiler to fix things up is just
stupid, *exactly* because of issues like these: you either have to
think about it carefully, or the compiler may end up creating crap
code.

So just use "& 15" instead. That doesn't have these kinds of issues.
It is a *good* thing when the C code is close to the end result you
want to generate. It is *not* a good thing to write code that looks
nothing like the end result and just expect the compiler to do the
right thing. Even if the compiler does do the right thing, what was
the advantage?

Linus

2012-01-14 20:41:34

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 2/3] sha512: reduce stack usage to safe number

On Sat, Jan 14, 2012 at 11:08:45AM -0800, Linus Torvalds wrote:
> On Sat, Jan 14, 2012 at 10:40 AM, Alexey Dobriyan <[email protected]> wrote:
> >
> > Line by line explanation:
> > * BLEND_OP
> > ?array is "circular" now, all indexes have to be modulo 16.
> > ?Round number is positive, so remainder operation should be
> > ?without surprises.
>
> Don't use "%" except on unsigned values. Even if it's positive, if
> it's a signed number and the compiler doesn't *see* that it is
> absolutely positive, division is nontrivial. Even when you divide by a
> constant.
>
> For example, "% 16" on an 'int' on x86-64 will generate
>
> movl %edi, %edx
> sarl $31, %edx
> shrl $28, %edx
> leal (%rdi,%rdx), %eax
> andl $15, %eax
> subl %edx, %eax
>
> in order to get the signed case right. The fact that the end result is
> correct for unsigned numbers is irrelevant: it's still stupid and
> slow.
>
> With an unsigned int, '% 16' will generate the obvious
>
> andl $15, %eax
>
> instead.
>
> Quite frankly, stop using division in the first place. Dividing by
> powers-of-two and expecting the compiler to fix things up is just
> stupid, *exactly* because of issues like these: you either have to
> think about it carefully, or the compiler may end up creating crap
> code.

For the record, it generates "andl $15" here.

> So just use "& 15" instead. That doesn't have these kinds of issues.
> It is a *good* thing when the C code is close to the end result you
> want to generate. It is *not* a good thing to write code that looks
> nothing like the end result and just expect the compiler to do the
> right thing. Even if the compiler does do the right thing, what was
> the advantage?

Here is updated patch which explicitly uses & (equally tested):
---------------------------------------------------------------

For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15]
and W[i - 16]. Consequently, keeping all W[80] array on stack is
unnecessary, only 16 values are really needed.

Using W[16] instead of W[80] greatly reduces stack usage
(~750 bytes to ~340 bytes on x86_64).

Line by line explanation:
* BLEND_OP
array is "circular" now, all indexes have to be modulo 16.

* initial full message scheduling is trimmed to first 16 values which
come from data block, the rest is calculated right before it's needed.

* original loop body is unrolled version of new SHA512_0_15 and
SHA512_16_79 macros, unrolling was done to not do explicit variable
renaming. Otherwise it's the very same code after preprocessing.
See sha1_transform() code which does the same trick.

Patch survives in-tree crypto test and original bugreport test
(ping flood with hmac(sha512).

See FIPS 180-2 for SHA-512 definition
http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
---

This is patch is for stable if 750 byte stack usage is not
considered safe.

crypto/sha512_generic.c | 58 ++++++++++++++++++++++++++++--------------------
1 file changed, 34 insertions(+), 24 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -78,7 +78,7 @@ static inline void LOAD_OP(int I, u64 *W, const u8 *input)

static inline void BLEND_OP(int I, u64 *W)
{
- W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+ W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]);
}

static void
@@ -87,38 +87,48 @@ sha512_transform(u64 *state, const u8 *input)
u64 a, b, c, d, e, f, g, h, t1, t2;

int i;
- u64 W[80];
+ u64 W[16];

/* load the input */
for (i = 0; i < 16; i++)
LOAD_OP(i, W, input);

- for (i = 16; i < 80; i++) {
- BLEND_OP(i, W);
- }
-
/* load the state into our registers */
a=state[0]; b=state[1]; c=state[2]; d=state[3];
e=state[4]; f=state[5]; g=state[6]; h=state[7];

- /* now iterate */
- for (i=0; i<80; i+=8) {
- t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[i ];
- t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
- t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[i+1];
- t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
- t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[i+2];
- t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
- t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[i+3];
- t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
- t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[i+4];
- t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
- t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[i+5];
- t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
- t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[i+6];
- t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
- t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[i+7];
- t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
+#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
+ t2 = e0(a) + Maj(a, b, c); \
+ d += t1; \
+ h = t1 + t2
+
+#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
+ BLEND_OP(i, W); \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \
+ t2 = e0(a) + Maj(a, b, c); \
+ d += t1; \
+ h = t1 + t2
+
+ for (i = 0; i < 16; i += 8) {
+ SHA512_0_15(i, a, b, c, d, e, f, g, h);
+ SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
+ }
+ for (i = 16; i < 80; i += 8) {
+ SHA512_16_79(i, a, b, c, d, e, f, g, h);
+ SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
+ SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
+ SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
+ SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
+ SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
+ SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
+ SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;

2012-01-14 21:14:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/3] sha512: reduce stack usage to safe number

On Sat, Jan 14, 2012 at 12:41 PM, Alexey Dobriyan <[email protected]> wrote:
>
> For the record, it generates "andl $15" here.

Ok. That means that gcc was able to prove that it never had any signed
values (which is certainly reasonable when you do things like "for
(i=0; i<X;i++)"). But it's better to simply not rely on gcc always
getting details like this right.

It's also better to use a model that simply doesn't even require you
as a programmer to have to even *think* about signed values.

It's easy to get "%" wrong by mistake (signed integer modulus didn't
even use to have well-defined semantics in traditional C), and there
is almost never any excuse for using it for powers-of-two.

> Here is updated patch which explicitly uses & (equally tested):

Thanks,

Linus

2012-01-14 21:46:54

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule

Le samedi 14 janvier 2012 à 21:27 +0300, Alexey Dobriyan a écrit :
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.


> Signed-off-by: Alexey Dobriyan <[email protected]>
> Cc: [email protected]
> ---
>
> crypto/sha512_generic.c | 6 +-----
> 1 file changed, 1 insertion(+), 5 deletions(-)
>
> --- a/crypto/sha512_generic.c
> +++ b/crypto/sha512_generic.c
> @@ -21,8 +21,6 @@
> #include <linux/percpu.h>
> #include <asm/byteorder.h>
>
> -static DEFINE_PER_CPU(u64[80], msg_schedule);
> -
> static inline u64 Ch(u64 x, u64 y, u64 z)
> {
> return z ^ (x & (y ^ z));
> @@ -89,7 +87,7 @@ sha512_transform(u64 *state, const u8 *input)
> u64 a, b, c, d, e, f, g, h, t1, t2;
>
> int i;
> - u64 *W = get_cpu_var(msg_schedule);
> + u64 W[80];
>
> /* load the input */
> for (i = 0; i < 16; i++)
> @@ -128,8 +126,6 @@ sha512_transform(u64 *state, const u8 *input)
>
> /* erase our data */
> a = b = c = d = e = f = g = h = t1 = t2 = 0;
> - memset(W, 0, sizeof(__get_cpu_var(msg_schedule)));
> - put_cpu_var(msg_schedule);
> }
>
> static int

Is it just me or are you ignoring what crypto maintainer and others
thought of your patch ?

You are re-introducing a 640 bytes stack array, how comes it can be
really safe ?

This is too risky, and we provided an alternate patch, not just for fun.

2012-01-14 21:52:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule

On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <[email protected]> wrote:
>
> This is too risky, and we provided an alternate patch, not just for fun.

Did you see the second patch?

The one that got rid of the *stupid* 80-entry array?

I don't know why so many sha implementations do that idiotic full
array, when the circular one is much better.

In fact, the 16-entry circular array allows machines with lots of
registers to keep all the state in registers and the C implementation
can often be as good as hand-tuned assembly. At least that's true for
sha1, I'm not sure you can do the same with sha512.

But that actually *requires* that the 16-entry array be done on the
stack as an automatic array. Anything else, and the compiler won't be
able to do it.

Linus

2012-01-14 22:00:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule

Le samedi 14 janvier 2012 à 13:52 -0800, Linus Torvalds a écrit :
> On Sat, Jan 14, 2012 at 1:46 PM, Eric Dumazet <[email protected]> wrote:
> >
> > This is too risky, and we provided an alternate patch, not just for fun.
>
> Did you see the second patch?
>

I saw it and felt it was not a stable material.

And Herbert sent an alternate patch, then I sent a V3 patch.

> The one that got rid of the *stupid* 80-entry array?
>

Maybe, I didnt wrote this code, and I am very glad this code can be
smarter and all.

2012-01-15 01:43:14

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 1/3] sha512: make it work, undo percpu message schedule

On Sat, Jan 14, 2012 at 09:27:37PM +0300, Alexey Dobriyan wrote:
> commit f9e2bca6c22d75a289a349f869701214d63b5060
> aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
> created global message schedule area.
>
> If sha512_update will ever be entered twice, hash will be silently
> calculated incorrectly.
>
> Probably the easiest way to notice incorrect hashes being calculated is
> to run 2 ping floods over AH with hmac(sha512):
>
> #!/usr/sbin/setkey -f
> flush;
> spdflush;
> add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
> add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
> spdadd IP1 IP2 any -P out ipsec ah/transport//require;
> spdadd IP2 IP1 any -P in ipsec ah/transport//require;
>
> XfrmInStateProtoError will start ticking with -EBADMSG being returned
> from ah_input(). This never happens with, say, hmac(sha1).
>
> With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
> with multiple bidirectional ping flood streams like it doesn't tick
> with SHA-1.
>
> After this patch sha512_transform() will start using ~750 bytes of stack on x86_64.
> This is OK for simple loads, for something more heavy, stack reduction will be done
> separatedly.
>
> Signed-off-by: Alexey Dobriyan <[email protected]>
> Cc: [email protected]

OK, I've applied patches 1-2 to crypto and patch 3 to cryptodev.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-15 01:43:46

by Herbert Xu

[permalink] [raw]
Subject: Re: sha512: make it work, undo percpu message schedule

On Fri, Jan 13, 2012 at 12:02:55PM +0100, Steffen Klassert wrote:
>
> If we want to do dynamic memory allocation, we could place it to the
> shash_desc context where we already store the intermediate digest value.
> This is preallocated anyway, so we don't need to do another allocation.

The problem with that is that some call paths place shash_desc on
the stack too.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-16 09:56:42

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/3] sha512: reduce stack usage to safe number

Doesn't this badly overflow W[] ..

> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
...
> + for (i = 0; i < 16; i += 8) {
...
> + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> + }

David

2012-01-16 10:20:47

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 2/3] sha512: reduce stack usage to safe number

On 1/16/12, David Laight <[email protected]> wrote:
> Doesn't this badly overflow W[] ..
>
>> +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
>> + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
> ...
>> + for (i = 0; i < 16; i += 8) {
> ...
>> + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
>> + }

No, why should it?
i can be only 0 and 8.

2012-01-16 10:23:07

by Eric Dumazet

[permalink] [raw]
Subject: RE: [PATCH 2/3] sha512: reduce stack usage to safe number

Le lundi 16 janvier 2012 à 09:56 +0000, David Laight a écrit :
> Doesn't this badly overflow W[] ..
>
> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
> > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
> ...
> > + for (i = 0; i < 16; i += 8) {
> ...
> > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
> > + }
>
> David
>
>

No overflow since loop is done for only i=0 and i=8

By the way, I suspect previous code was chosen years ago because this
version uses less stack but adds much more code bloat.


size crypto/sha512_generic.o crypto/sha512_generic_old.o
text data bss dec hex filename
17369 704 0 18073 4699 crypto/sha512_generic.o
8249 704 0 8953 22f9 crypto/sha512_generic_old.o

2012-01-16 11:37:35

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 2/3] sha512: reduce stack usage to safe number


> By the way, I suspect previous code was chosen years ago because this
> version uses less stack but adds much more code bloat.
>
> size crypto/sha512_generic.o crypto/sha512_generic_old.o
> text data bss dec hex filename
> 17369 704 0 18073 4699 crypto/sha512_generic.o
> 8249 704 0 8953 22f9
> crypto/sha512_generic_old.o

I wonder what the difference in execution times is?
I know this is very dependant in the actual cpu
and the cold/warm cache state...

But some measurements might be useful.

David

2012-01-17 12:03:10

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 2/3] sha512: reduce stack usage to safe number

On 1/16/12, Eric Dumazet <[email protected]> wrote:
> Le lundi 16 janvier 2012 ? 09:56 +0000, David Laight a ?crit :
>> Doesn't this badly overflow W[] ..
>>
>> > +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
>> > + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
>> ...
>> > + for (i = 0; i < 16; i += 8) {
>> ...
>> > + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
>> > + }
>>
>> David
>>
>>
>
> No overflow since loop is done for only i=0 and i=8
>
> By the way, I suspect previous code was chosen years ago because this
> version uses less stack but adds much more code bloat.

I think W[80] was use because it's the most straightforward way to
write this code
by following spec.

All SHA definitions have full message schedule pseudocoded
before hash computation.

> size crypto/sha512_generic.o crypto/sha512_generic_old.o
> text data bss dec hex filename
> 17369 704 0 18073 4699 crypto/sha512_generic.o
> 8249 704 0 8953 22f9 crypto/sha512_generic_old.o

This is because SHA-512 is fundamentally 64-bit algorithm multiplied by
excessive unrolling.

Surprisingly, doing variable renaming by hand like in spec:
t1 = ...
t2 = ...
h = g;
g = f;
f = e;
e = d + T1;
d = c;
c = b;
b = a;
a = t1 + t2;

bring stack space on i386 under control too.

2012-01-18 18:02:17

by Alexey Dobriyan

[permalink] [raw]
Subject: [PATCH 4/3] sha512: reduce stack usage even on i386

Fix still excessive stack usage on i386.

There is too much loop unrolling going on, despite W[16] being used,
gcc screws up this for some reason. So, don't be smart, use simple code
from SHA-512 definition, this keeps code size _and_ stack usage back
under control even on i386:

-14b: 81 ec 9c 03 00 00 sub $0x39c,%esp
+149: 81 ec 64 01 00 00 sub $0x164,%esp

$ size ../sha512_generic-i386-00*
text data bss dec hex filename
15521 712 0 16233 3f69 ../sha512_generic-i386-000.o
4225 712 0 4937 1349 ../sha512_generic-i386-001.o

Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
---

crypto/sha512_generic.c | 42 ++++++++++++++++++++----------------------
1 file changed, 20 insertions(+), 22 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -95,35 +95,33 @@ sha512_transform(u64 *state, const u8 *input)
#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
+ h = g; \
+ g = f; \
+ f = e; \
+ e = d + t1; \
+ d = c; \
+ c = b; \
+ b = a; \
+ a = t1 + t2

#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
BLEND_OP(i, W); \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15]; \
t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
-
- for (i = 0; i < 16; i += 8) {
+ h = g; \
+ g = f; \
+ f = e; \
+ e = d + t1; \
+ d = c; \
+ c = b; \
+ b = a; \
+ a = t1 + t2
+
+ for (i = 0; i < 16; i++) {
SHA512_0_15(i, a, b, c, d, e, f, g, h);
- SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
- SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
- SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
- SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
- SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
- SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
- SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
}
- for (i = 16; i < 80; i += 8) {
+ for (i = 16; i < 80; i++) {
SHA512_16_79(i, a, b, c, d, e, f, g, h);
- SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
- SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
- SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
- SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
- SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
- SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
- SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;

2012-01-26 02:35:02

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 4/3] sha512: reduce stack usage even on i386

On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote:
> Fix still excessive stack usage on i386.
>
> There is too much loop unrolling going on, despite W[16] being used,
> gcc screws up this for some reason. So, don't be smart, use simple code
> from SHA-512 definition, this keeps code size _and_ stack usage back
> under control even on i386:
>
> -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp
> +149: 81 ec 64 01 00 00 sub $0x164,%esp
>
> $ size ../sha512_generic-i386-00*
> text data bss dec hex filename
> 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o
> 4225 712 0 4937 1349 ../sha512_generic-i386-001.o
>
> Signed-off-by: Alexey Dobriyan <[email protected]>
> Cc: [email protected]

Hmm, your patch doesn't apply against my crypto tree. Please
regenerate.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-27 17:51:30

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 4/3] sha512: reduce stack usage even on i386

On Thu, Jan 26, 2012 at 01:35:02PM +1100, Herbert Xu wrote:
> On Wed, Jan 18, 2012 at 09:02:10PM +0300, Alexey Dobriyan wrote:
> > Fix still excessive stack usage on i386.
> >
> > There is too much loop unrolling going on, despite W[16] being used,
> > gcc screws up this for some reason. So, don't be smart, use simple code
> > from SHA-512 definition, this keeps code size _and_ stack usage back
> > under control even on i386:
> >
> > -14b: 81 ec 9c 03 00 00 sub $0x39c,%esp
> > +149: 81 ec 64 01 00 00 sub $0x164,%esp
> >
> > $ size ../sha512_generic-i386-00*
> > text data bss dec hex filename
> > 15521 712 0 16233 3f69 ../sha512_generic-i386-000.o
> > 4225 712 0 4937 1349 ../sha512_generic-i386-001.o
> >
> > Signed-off-by: Alexey Dobriyan <[email protected]>
> > Cc: [email protected]
>
> Hmm, your patch doesn't apply against my crypto tree. Please
> regenerate.

I think this is because your tree contained "%16" code instead if "&15".
Now that it contains "&15" it should become applicable.

Anyway.
--------------------------------------------------------------------------
[PATCH] sha512: reduce stack usage even on i386

Fix still excessive stack usage on i386.

There is too much loop unrolling going on, despite W[16] being used,
gcc screws up this for some reason. So, don't be smart, use simple code
from SHA-512 definition, this keeps code size _and_ stack usage back
under control even on i386:

-14b: 81 ec 9c 03 00 00 sub $0x39c,%esp
+149: 81 ec 64 01 00 00 sub $0x164,%esp

$ size ../sha512_generic-i386-00*
text data bss dec hex filename
15521 712 0 16233 3f69 ../sha512_generic-i386-000.o
4225 712 0 4937 1349 ../sha512_generic-i386-001.o

Signed-off-by: Alexey Dobriyan <[email protected]>
Cc: [email protected]
---

crypto/sha512_generic.c | 42 ++++++++++++++++++++----------------------
1 file changed, 20 insertions(+), 22 deletions(-)

--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -100,35 +100,33 @@ sha512_transform(u64 *state, const u8 *input)
#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
+ h = g; \
+ g = f; \
+ f = e; \
+ e = d + t1; \
+ d = c; \
+ c = b; \
+ b = a; \
+ a = t1 + t2

#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
BLEND_OP(i, W); \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i & 15]; \
t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
-
- for (i = 0; i < 16; i += 8) {
+ h = g; \
+ g = f; \
+ f = e; \
+ e = d + t1; \
+ d = c; \
+ c = b; \
+ b = a; \
+ a = t1 + t2
+
+ for (i = 0; i < 16; i++) {
SHA512_0_15(i, a, b, c, d, e, f, g, h);
- SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
- SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
- SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
- SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
- SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
- SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
- SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
}
- for (i = 16; i < 80; i += 8) {
+ for (i = 16; i < 80; i++) {
SHA512_16_79(i, a, b, c, d, e, f, g, h);
- SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
- SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
- SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
- SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
- SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
- SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
- SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;

2012-01-27 22:33:04

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 4/3] sha512: reduce stack usage even on i386

On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote:
>
> I think this is because your tree contained "%16" code instead if "&15".
> Now that it contains "&15" it should become applicable.

OK.

> --------------------------------------------------------------------------
> [PATCH] sha512: reduce stack usage even on i386

Can you try the approach that git takes with using asm to read
and write W (see previous email from Linus in respone to my push
request)? As it stands your patch is simply relying on gcc's
ability to optimise. At least with asm volatile we know that
gcc will leave it alone.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2012-01-30 11:10:49

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 4/3] sha512: reduce stack usage even on i386

On 1/28/12, Herbert Xu <[email protected]> wrote:
> On Fri, Jan 27, 2012 at 08:51:30PM +0300, Alexey Dobriyan wrote:
>>
>> I think this is because your tree contained "%16" code instead if "&15".
>> Now that it contains "&15" it should become applicable.
>
> OK.
>
>> --------------------------------------------------------------------------
>> [PATCH] sha512: reduce stack usage even on i386
>
> Can you try the approach that git takes with using asm to read
> and write W (see previous email from Linus in respone to my push
> request)? As it stands your patch is simply relying on gcc's
> ability to optimise. At least with asm volatile we know that
> gcc will leave it alone.

For some reason it doesn't. :-( I've also tried full barriers.

With this patch, stack usage is still ~900 bytes.

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index dd0439d..35e7ae7 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -66,16 +66,6 @@ static const u64 sha512_K[80] = {
#define s0(x) (ror64(x, 1) ^ ror64(x, 8) ^ (x >> 7))
#define s1(x) (ror64(x,19) ^ ror64(x,61) ^ (x >> 6))

-static inline void LOAD_OP(int I, u64 *W, const u8 *input)
-{
- W[I] = __be64_to_cpu( ((__be64*)(input))[I] );
-}
-
-static inline void BLEND_OP(int I, u64 *W)
-{
- W[I & 15] += s1(W[(I-2) & 15]) + W[(I-7) & 15] + s0(W[(I-15) & 15]);
-}
-
static void
sha512_transform(u64 *state, const u8 *input)
{
@@ -84,26 +74,29 @@ sha512_transform(u64 *state, const u8 *input)
int i;
u64 W[16];

- /* load the input */
- for (i = 0; i < 16; i++)
- LOAD_OP(i, W, input);
-
/* load the state into our registers */
a=state[0]; b=state[1]; c=state[2]; d=state[3];
e=state[4]; f=state[5]; g=state[6]; h=state[7];

#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
+ { \
+ u64 tmp = be64_to_cpu(*((__be64 *)input + (i))); \
+ *(volatile u64 *)&W[i] = tmp; \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp; \
t2 = e0(a) + Maj(a, b, c); \
d += t1; \
- h = t1 + t2
+ h = t1 + t2; \
+ }

#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
- BLEND_OP(i, W); \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15];\
+ { \
+ u64 tmp = W[(i) & 15] + s1(W[(i-2) & 15]) + W[(i-7) & 15] +
s0(W[(i-15) & 15]);\
+ *(volatile u64 *)&W[(i) & 15] = tmp; \
+ t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + tmp; \
t2 = e0(a) + Maj(a, b, c); \
d += t1; \
- h = t1 + t2
+ h = t1 + t2; \
+ }

for (i = 0; i < 16; i += 8) {
SHA512_0_15(i, a, b, c, d, e, f, g, h);

2012-02-03 03:35:05

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 4/3] sha512: reduce stack usage even on i386

On Mon, Jan 30, 2012 at 01:10:48PM +0200, Alexey Dobriyan wrote:
>
> For some reason it doesn't. :-( I've also tried full barriers.

OK, I don't think it worked in the sha1 case either, as on my
i386 here lib/sha1.c ends up using nearly 500 bytes of stack
space.

It appears that the crux of the issue here is the use of 64-bit
ops. gcc just doesn't like that at all on i386 and ends up
putting every other intermediate result on the stack in its own
unique place.

While you de-unrolling patch does reduce the stack use, it also
incurs a speed penalty of around 15%, and that's on top of the
4% we've already added by getting rid of per_cpu.

However, this patch seems to work for me, and reduces the stack
usage to a level about the same as lib/sha1.c + crypto/sha1_generic.c
without adding any new overhead.

So let me know if you have any better ideas.

diff --git a/crypto/sha512_generic.c b/crypto/sha512_generic.c
index 3edebfd..f04af93 100644
--- a/crypto/sha512_generic.c
+++ b/crypto/sha512_generic.c
@@ -89,46 +89,42 @@ sha512_transform(u64 *state, const u8 *input)
int i;
u64 W[16];

- /* load the input */
- for (i = 0; i < 16; i++)
- LOAD_OP(i, W, input);
-
/* load the state into our registers */
a=state[0]; b=state[1]; c=state[2]; d=state[3];
e=state[4]; f=state[5]; g=state[6]; h=state[7];

-#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i]; \
- t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
-
-#define SHA512_16_79(i, a, b, c, d, e, f, g, h) \
- BLEND_OP(i, W); \
- t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[(i)&15]; \
- t2 = e0(a) + Maj(a, b, c); \
- d += t1; \
- h = t1 + t2
-
- for (i = 0; i < 16; i += 8) {
- SHA512_0_15(i, a, b, c, d, e, f, g, h);
- SHA512_0_15(i + 1, h, a, b, c, d, e, f, g);
- SHA512_0_15(i + 2, g, h, a, b, c, d, e, f);
- SHA512_0_15(i + 3, f, g, h, a, b, c, d, e);
- SHA512_0_15(i + 4, e, f, g, h, a, b, c, d);
- SHA512_0_15(i + 5, d, e, f, g, h, a, b, c);
- SHA512_0_15(i + 6, c, d, e, f, g, h, a, b);
- SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
- }
- for (i = 16; i < 80; i += 8) {
- SHA512_16_79(i, a, b, c, d, e, f, g, h);
- SHA512_16_79(i + 1, h, a, b, c, d, e, f, g);
- SHA512_16_79(i + 2, g, h, a, b, c, d, e, f);
- SHA512_16_79(i + 3, f, g, h, a, b, c, d, e);
- SHA512_16_79(i + 4, e, f, g, h, a, b, c, d);
- SHA512_16_79(i + 5, d, e, f, g, h, a, b, c);
- SHA512_16_79(i + 6, c, d, e, f, g, h, a, b);
- SHA512_16_79(i + 7, b, c, d, e, f, g, h, a);
+ /* now iterate */
+ for (i=0; i<80; i+=8) {
+ if (!(i & 8)) {
+ int j;
+
+ if (i < 16) {
+ /* load the input */
+ for (j = 0; j < 16; j++)
+ LOAD_OP(i + j, W, input);
+ } else {
+ for (j = 0; j < 16; j++) {
+ BLEND_OP(i + j, W);
+ }
+ }
+ }
+
+ t1 = h + e1(e) + Ch(e,f,g) + sha512_K[i ] + W[(i & 15)];
+ t2 = e0(a) + Maj(a,b,c); d+=t1; h=t1+t2;
+ t1 = g + e1(d) + Ch(d,e,f) + sha512_K[i+1] + W[(i & 15) + 1];
+ t2 = e0(h) + Maj(h,a,b); c+=t1; g=t1+t2;
+ t1 = f + e1(c) + Ch(c,d,e) + sha512_K[i+2] + W[(i & 15) + 2];
+ t2 = e0(g) + Maj(g,h,a); b+=t1; f=t1+t2;
+ t1 = e + e1(b) + Ch(b,c,d) + sha512_K[i+3] + W[(i & 15) + 3];
+ t2 = e0(f) + Maj(f,g,h); a+=t1; e=t1+t2;
+ t1 = d + e1(a) + Ch(a,b,c) + sha512_K[i+4] + W[(i & 15) + 4];
+ t2 = e0(e) + Maj(e,f,g); h+=t1; d=t1+t2;
+ t1 = c + e1(h) + Ch(h,a,b) + sha512_K[i+5] + W[(i & 15) + 5];
+ t2 = e0(d) + Maj(d,e,f); g+=t1; c=t1+t2;
+ t1 = b + e1(g) + Ch(g,h,a) + sha512_K[i+6] + W[(i & 15) + 6];
+ t2 = e0(c) + Maj(c,d,e); f+=t1; b=t1+t2;
+ t1 = a + e1(f) + Ch(f,g,h) + sha512_K[i+7] + W[(i & 15) + 7];
+ t2 = e0(b) + Maj(b,c,d); e+=t1; a=t1+t2;
}

state[0] += a; state[1] += b; state[2] += c; state[3] += d;

Thanks!
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt