LinuxLists.cc - [PATCH] lib/sha1: use the git implementation of SHA-1

2011-08-06 01:47:39

Subject: [PATCH] lib/sha1: use the git implementation of SHA-1

For ChromiumOS, we use SHA-1 to verify the integrity of the root
filesystem. The speed of the kernel sha-1 implementation has a major
impact on our boot performance.

To improve boot performance, we investigated using the heavily
optimized sha-1 implementation used in git. With the git
sha-1 implementation, we see a 11.7% improvement in boot time.

10 reboots, remove slowest/fastest.

Before:

Mean: 6.58 seconds Stdev: 0.14

After (with git sha-1, this patch):

Mean: 5.89 seconds Stdev: 0.07

The other cool thing about the git SHA-1 implementation is that
it only needs 64 bytes of stack for the workspace while the original
kernel implementation needed 320 bytes.

Signed-off-by: Mandeep Singh Baines <[email protected]>
CC: Ramsay Jones <[email protected]>
CC: Nicolas Pitre <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Herbert Xu <[email protected]>
CC: David S. Miller <[email protected]>
CC: [email protected]
---
include/linux/cryptohash.h | 2 +-
lib/sha1.c | 212 ++++++++++++++++++++++++++++++++-----------
2 files changed, 159 insertions(+), 55 deletions(-)

diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index ec78a4b..f945218 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -3,7 +3,7 @@

#define SHA_DIGEST_WORDS 5
#define SHA_MESSAGE_BYTES (512 /*bits*/ / 8)
-#define SHA_WORKSPACE_WORDS 80
+#define SHA_WORKSPACE_WORDS 16

void sha_init(__u32 *buf);
void sha_transform(__u32 *digest, const char *data, __u32 *W);
diff --git a/lib/sha1.c b/lib/sha1.c
index 4c45fd5..f33271d 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -1,31 +1,72 @@
/*
- * SHA transform algorithm, originally taken from code written by
- * Peter Gutmann, and placed in the public domain.
+ * SHA1 routine optimized to do word accesses rather than byte accesses,
+ * and to avoid unnecessary copies into the context array.
+ *
+ * This was based on the git SHA1 implementation.
*/

#include <linux/kernel.h>
#include <linux/module.h>
-#include <linux/cryptohash.h>
+#include <linux/bitops.h>
+#include <asm/unaligned.h>

-/* The SHA f()-functions. */
+/*
+ * If you have 32 registers or more, the compiler can (and should)
+ * try to change the array[] accesses into registers. However, on
+ * machines with less than ~25 registers, that won't really work,
+ * and at least gcc will make an unholy mess of it.
+ *
+ * So to avoid that mess which just slows things down, we force
+ * the stores to memory to actually happen (we might be better off
+ * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
+ * suggested by Artur Skawina - that will also make gcc unable to
+ * try to do the silly "optimize away loads" part because it won't
+ * see what the value will be).
+ *
+ * Ben Herrenschmidt reports that on PPC, the C version comes close
+ * to the optimized asm with this (ie on PPC you don't want that
+ * 'volatile', since there are lots of registers).
+ *
+ * On ARM we get the best code generation by forcing a full memory barrier
+ * between each SHA_ROUND, otherwise gcc happily get wild with spilling and
+ * the stack frame size simply explode and performance goes down the drain.
+ */

-#define f1(x,y,z) (z ^ (x & (y ^ z))) /* x ? y : z */
-#define f2(x,y,z) (x ^ y ^ z) /* XOR */
-#define f3(x,y,z) ((x & y) + (z & (x ^ y))) /* majority */
+#ifdef CONFIG_X86
+ #define setW(x, val) (*(volatile __u32 *)&W(x) = (val))
+#elif defined(CONFIG_ARM)
+ #define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
+#else
+ #define setW(x, val) (W(x) = (val))
+#endif

-/* The SHA Mysterious Constants */
+/* This "rolls" over the 512-bit array */
+#define W(x) (array[(x)&15])

-#define K1 0x5A827999L /* Rounds 0-19: sqrt(2) * 2^30 */
-#define K2 0x6ED9EBA1L /* Rounds 20-39: sqrt(3) * 2^30 */
-#define K3 0x8F1BBCDCL /* Rounds 40-59: sqrt(5) * 2^30 */
-#define K4 0xCA62C1D6L /* Rounds 60-79: sqrt(10) * 2^30 */
+/*
+ * Where do we get the source from? The first 16 iterations get it from
+ * the input data, the next mix it from the 512-bit array.
+ */
+#define SHA_SRC(t) get_unaligned_be32((__u32 *)data + t)
+#define SHA_MIX(t) rol32(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
+
+#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
+ __u32 TEMP = input(t); setW(t, TEMP); \
+ E += TEMP + rol32(A,5) + (fn) + (constant); \
+ B = ror32(B, 2); } while (0)
+
+#define T_0_15(t, A, B, C, D, E) SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
+#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E )
+#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E )
+#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0xca62c1d6, A, B, C, D, E )

/**
* sha_transform - single block SHA1 transform
*
* @digest: 160 bit digest to update
* @data: 512 bits of data to hash
- * @W: 80 words of workspace (see note)
+ * @array: 16 words of workspace (see note)
*
* This function generates a SHA1 digest for a single 512-bit block.
* Be warned, it does not handle padding and message digest, do not
@@ -36,47 +77,111 @@
* to clear the workspace. This is left to the caller to avoid
* unnecessary clears between chained hashing operations.
*/
-void sha_transform(__u32 *digest, const char *in, __u32 *W)
+void sha_transform(__u32 *digest, const char *data, __u32 *array)
{
- __u32 a, b, c, d, e, t, i;
-
- for (i = 0; i < 16; i++)
- W[i] = be32_to_cpu(((const __be32 *)in)[i]);
-
- for (i = 0; i < 64; i++)
- W[i+16] = rol32(W[i+13] ^ W[i+8] ^ W[i+2] ^ W[i], 1);
-
- a = digest[0];
- b = digest[1];
- c = digest[2];
- d = digest[3];
- e = digest[4];
-
- for (i = 0; i < 20; i++) {
- t = f1(b, c, d) + K1 + rol32(a, 5) + e + W[i];
- e = d; d = c; c = rol32(b, 30); b = a; a = t;
- }
-
- for (; i < 40; i ++) {
- t = f2(b, c, d) + K2 + rol32(a, 5) + e + W[i];
- e = d; d = c; c = rol32(b, 30); b = a; a = t;
- }
-
- for (; i < 60; i ++) {
- t = f3(b, c, d) + K3 + rol32(a, 5) + e + W[i];
- e = d; d = c; c = rol32(b, 30); b = a; a = t;
- }
-
- for (; i < 80; i ++) {
- t = f2(b, c, d) + K4 + rol32(a, 5) + e + W[i];
- e = d; d = c; c = rol32(b, 30); b = a; a = t;
- }
-
- digest[0] += a;
- digest[1] += b;
- digest[2] += c;
- digest[3] += d;
- digest[4] += e;
+ __u32 A, B, C, D, E;
+
+ A = digest[0];
+ B = digest[1];
+ C = digest[2];
+ D = digest[3];
+ E = digest[4];
+
+ /* Round 1 - iterations 0-16 take their input from 'data' */
+ T_0_15( 0, A, B, C, D, E);
+ T_0_15( 1, E, A, B, C, D);
+ T_0_15( 2, D, E, A, B, C);
+ T_0_15( 3, C, D, E, A, B);
+ T_0_15( 4, B, C, D, E, A);
+ T_0_15( 5, A, B, C, D, E);
+ T_0_15( 6, E, A, B, C, D);
+ T_0_15( 7, D, E, A, B, C);
+ T_0_15( 8, C, D, E, A, B);
+ T_0_15( 9, B, C, D, E, A);
+ T_0_15(10, A, B, C, D, E);
+ T_0_15(11, E, A, B, C, D);
+ T_0_15(12, D, E, A, B, C);
+ T_0_15(13, C, D, E, A, B);
+ T_0_15(14, B, C, D, E, A);
+ T_0_15(15, A, B, C, D, E);
+
+ /* Round 1 - tail. Input from 512-bit mixing array */
+ T_16_19(16, E, A, B, C, D);
+ T_16_19(17, D, E, A, B, C);
+ T_16_19(18, C, D, E, A, B);
+ T_16_19(19, B, C, D, E, A);
+
+ /* Round 2 */
+ T_20_39(20, A, B, C, D, E);
+ T_20_39(21, E, A, B, C, D);
+ T_20_39(22, D, E, A, B, C);
+ T_20_39(23, C, D, E, A, B);
+ T_20_39(24, B, C, D, E, A);
+ T_20_39(25, A, B, C, D, E);
+ T_20_39(26, E, A, B, C, D);
+ T_20_39(27, D, E, A, B, C);
+ T_20_39(28, C, D, E, A, B);
+ T_20_39(29, B, C, D, E, A);
+ T_20_39(30, A, B, C, D, E);
+ T_20_39(31, E, A, B, C, D);
+ T_20_39(32, D, E, A, B, C);
+ T_20_39(33, C, D, E, A, B);
+ T_20_39(34, B, C, D, E, A);
+ T_20_39(35, A, B, C, D, E);
+ T_20_39(36, E, A, B, C, D);
+ T_20_39(37, D, E, A, B, C);
+ T_20_39(38, C, D, E, A, B);
+ T_20_39(39, B, C, D, E, A);
+
+ /* Round 3 */
+ T_40_59(40, A, B, C, D, E);
+ T_40_59(41, E, A, B, C, D);
+ T_40_59(42, D, E, A, B, C);
+ T_40_59(43, C, D, E, A, B);
+ T_40_59(44, B, C, D, E, A);
+ T_40_59(45, A, B, C, D, E);
+ T_40_59(46, E, A, B, C, D);
+ T_40_59(47, D, E, A, B, C);
+ T_40_59(48, C, D, E, A, B);
+ T_40_59(49, B, C, D, E, A);
+ T_40_59(50, A, B, C, D, E);
+ T_40_59(51, E, A, B, C, D);
+ T_40_59(52, D, E, A, B, C);
+ T_40_59(53, C, D, E, A, B);
+ T_40_59(54, B, C, D, E, A);
+ T_40_59(55, A, B, C, D, E);
+ T_40_59(56, E, A, B, C, D);
+ T_40_59(57, D, E, A, B, C);
+ T_40_59(58, C, D, E, A, B);
+ T_40_59(59, B, C, D, E, A);
+
+ /* Round 4 */
+ T_60_79(60, A, B, C, D, E);
+ T_60_79(61, E, A, B, C, D);
+ T_60_79(62, D, E, A, B, C);
+ T_60_79(63, C, D, E, A, B);
+ T_60_79(64, B, C, D, E, A);
+ T_60_79(65, A, B, C, D, E);
+ T_60_79(66, E, A, B, C, D);
+ T_60_79(67, D, E, A, B, C);
+ T_60_79(68, C, D, E, A, B);
+ T_60_79(69, B, C, D, E, A);
+ T_60_79(70, A, B, C, D, E);
+ T_60_79(71, E, A, B, C, D);
+ T_60_79(72, D, E, A, B, C);
+ T_60_79(73, C, D, E, A, B);
+ T_60_79(74, B, C, D, E, A);
+ T_60_79(75, A, B, C, D, E);
+ T_60_79(76, E, A, B, C, D);
+ T_60_79(77, D, E, A, B, C);
+ T_60_79(78, C, D, E, A, B);
+ T_60_79(79, B, C, D, E, A);
+
+ digest[0] += A;
+ digest[1] += B;
+ digest[2] += C;
+ digest[3] += D;
+ digest[4] += E;
}
EXPORT_SYMBOL(sha_transform);

@@ -92,4 +197,3 @@ void sha_init(__u32 *buf)
buf[3] = 0x10325476;
buf[4] = 0xc3d2e1f0;
}
-
--
1.7.3.1

2011-08-07 11:54:12

by Joachim Eastwood

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sat, Aug 6, 2011 at 3:46 AM, Mandeep Singh Baines <[email protected]> wrote:
> For ChromiumOS, we use SHA-1 to verify the integrity of the root
> filesystem. The speed of the kernel sha-1 implementation has a major
> impact on our boot performance.
>
> To improve boot performance, we investigated using the heavily
> optimized sha-1 implementation used in git. With the git
> sha-1 implementation, we see a 11.7% improvement in boot time.
>
> 10 reboots, remove slowest/fastest.
>
> Before:
>
> Mean: 6.58 seconds Stdev: 0.14
>
> After (with git sha-1, this patch):
>
> Mean: 5.89 seconds Stdev: 0.07
>
> The other cool thing about the git SHA-1 implementation is that
> it only needs 64 bytes of stack for the workspace while the original
> kernel implementation needed 320 bytes.
>
> Signed-off-by: Mandeep Singh Baines <[email protected]>
> CC: Ramsay Jones <[email protected]>
> CC: Nicolas Pitre <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: Herbert Xu <[email protected]>
> CC: David S. Miller <[email protected]>
> CC: [email protected]
> ---
> ?include/linux/cryptohash.h | ? ?2 +-
> ?lib/sha1.c ? ? ? ? ? ? ? ? | ?212 ++++++++++++++++++++++++++++++++-----------
> ?2 files changed, 159 insertions(+), 55 deletions(-)
>
> diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
> index ec78a4b..f945218 100644
> --- a/include/linux/cryptohash.h
> +++ b/include/linux/cryptohash.h
> @@ -3,7 +3,7 @@
>
> ?#define SHA_DIGEST_WORDS 5
> ?#define SHA_MESSAGE_BYTES (512 /*bits*/ / 8)
> -#define SHA_WORKSPACE_WORDS 80
> +#define SHA_WORKSPACE_WORDS 16
>
> ?void sha_init(__u32 *buf);
> ?void sha_transform(__u32 *digest, const char *data, __u32 *W);
> diff --git a/lib/sha1.c b/lib/sha1.c
> index 4c45fd5..f33271d 100644
> --- a/lib/sha1.c
> +++ b/lib/sha1.c
> @@ -1,31 +1,72 @@
> ?/*
> - * SHA transform algorithm, originally taken from code written by
> - * Peter Gutmann, and placed in the public domain.
> + * SHA1 routine optimized to do word accesses rather than byte accesses,
> + * and to avoid unnecessary copies into the context array.
> + *
> + * This was based on the git SHA1 implementation.
> ?*/
>
> ?#include <linux/kernel.h>
> ?#include <linux/module.h>
> -#include <linux/cryptohash.h>
> +#include <linux/bitops.h>
> +#include <asm/unaligned.h>
>
> -/* The SHA f()-functions. ?*/
> +/*
> + * If you have 32 registers or more, the compiler can (and should)
> + * try to change the array[] accesses into registers. However, on
> + * machines with less than ~25 registers, that won't really work,
> + * and at least gcc will make an unholy mess of it.
> + *
> + * So to avoid that mess which just slows things down, we force
> + * the stores to memory to actually happen (we might be better off
> + * with a 'W(t)=(val);asm("":"+m" (W(t))' there instead, as
> + * suggested by Artur Skawina - that will also make gcc unable to
> + * try to do the silly "optimize away loads" part because it won't
> + * see what the value will be).
> + *
> + * Ben Herrenschmidt reports that on PPC, the C version comes close
> + * to the optimized asm with this (ie on PPC you don't want that
> + * 'volatile', since there are lots of registers).
> + *
> + * On ARM we get the best code generation by forcing a full memory barrier
> + * between each SHA_ROUND, otherwise gcc happily get wild with spilling and
> + * the stack frame size simply explode and performance goes down the drain.
> + */
>
> -#define f1(x,y,z) ? (z ^ (x & (y ^ z))) ? ? ? ? ? ? ? ?/* x ? y : z */
> -#define f2(x,y,z) ? (x ^ y ^ z) ? ? ? ? ? ? ? ? ? ? ? ?/* XOR */
> -#define f3(x,y,z) ? ((x & y) + (z & (x ^ y))) ?/* majority */
> +#ifdef CONFIG_X86
> + ?#define setW(x, val) (*(volatile __u32 *)&W(x) = (val))
> +#elif defined(CONFIG_ARM)
> + ?#define setW(x, val) do { W(x) = (val); __asm__("":::"memory"); } while (0)
> +#else
> + ?#define setW(x, val) (W(x) = (val))
> +#endif
>
> -/* The SHA Mysterious Constants */
> +/* This "rolls" over the 512-bit array */
> +#define W(x) (array[(x)&15])
>
> -#define K1 ?0x5A827999L ? ? ? ? ? ? ? ? ? ? ? ?/* Rounds ?0-19: sqrt(2) * 2^30 */
> -#define K2 ?0x6ED9EBA1L ? ? ? ? ? ? ? ? ? ? ? ?/* Rounds 20-39: sqrt(3) * 2^30 */
> -#define K3 ?0x8F1BBCDCL ? ? ? ? ? ? ? ? ? ? ? ?/* Rounds 40-59: sqrt(5) * 2^30 */
> -#define K4 ?0xCA62C1D6L ? ? ? ? ? ? ? ? ? ? ? ?/* Rounds 60-79: sqrt(10) * 2^30 */
> +/*
> + * Where do we get the source from? The first 16 iterations get it from
> + * the input data, the next mix it from the 512-bit array.
> + */
> +#define SHA_SRC(t) get_unaligned_be32((__u32 *)data + t)
> +#define SHA_MIX(t) rol32(W(t+13) ^ W(t+8) ^ W(t+2) ^ W(t), 1)
> +
> +#define SHA_ROUND(t, input, fn, constant, A, B, C, D, E) do { \
> + ? ? ? __u32 TEMP = input(t); setW(t, TEMP); \
> + ? ? ? E += TEMP + rol32(A,5) + (fn) + (constant); \
> + ? ? ? B = ror32(B, 2); } while (0)
> +
> +#define T_0_15(t, A, B, C, D, E) ?SHA_ROUND(t, SHA_SRC, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
> +#define T_16_19(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (((C^D)&B)^D) , 0x5a827999, A, B, C, D, E )
> +#define T_20_39(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , 0x6ed9eba1, A, B, C, D, E )
> +#define T_40_59(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, ((B&C)+(D&(B^C))) , 0x8f1bbcdc, A, B, C, D, E )
> +#define T_60_79(t, A, B, C, D, E) SHA_ROUND(t, SHA_MIX, (B^C^D) , ?0xca62c1d6, A, B, C, D, E )
>
> ?/**
> ?* sha_transform - single block SHA1 transform
> ?*
> ?* @digest: 160 bit digest to update
> ?* @data: ? 512 bits of data to hash
> - * @W: ? ? ?80 words of workspace (see note)
> + * @array: ?16 words of workspace (see note)
> ?*
> ?* This function generates a SHA1 digest for a single 512-bit block.
> ?* Be warned, it does not handle padding and message digest, do not
> @@ -36,47 +77,111 @@
> ?* to clear the workspace. This is left to the caller to avoid
> ?* unnecessary clears between chained hashing operations.
> ?*/
> -void sha_transform(__u32 *digest, const char *in, __u32 *W)
> +void sha_transform(__u32 *digest, const char *data, __u32 *array)
> ?{
> - ? ? ? __u32 a, b, c, d, e, t, i;
> -
> - ? ? ? for (i = 0; i < 16; i++)
> - ? ? ? ? ? ? ? W[i] = be32_to_cpu(((const __be32 *)in)[i]);
> -
> - ? ? ? for (i = 0; i < 64; i++)
> - ? ? ? ? ? ? ? W[i+16] = rol32(W[i+13] ^ W[i+8] ^ W[i+2] ^ W[i], 1);
> -
> - ? ? ? a = digest[0];
> - ? ? ? b = digest[1];
> - ? ? ? c = digest[2];
> - ? ? ? d = digest[3];
> - ? ? ? e = digest[4];
> -
> - ? ? ? for (i = 0; i < 20; i++) {
> - ? ? ? ? ? ? ? t = f1(b, c, d) + K1 + rol32(a, 5) + e + W[i];
> - ? ? ? ? ? ? ? e = d; d = c; c = rol32(b, 30); b = a; a = t;
> - ? ? ? }
> -
> - ? ? ? for (; i < 40; i ++) {
> - ? ? ? ? ? ? ? t = f2(b, c, d) + K2 + rol32(a, 5) + e + W[i];
> - ? ? ? ? ? ? ? e = d; d = c; c = rol32(b, 30); b = a; a = t;
> - ? ? ? }
> -
> - ? ? ? for (; i < 60; i ++) {
> - ? ? ? ? ? ? ? t = f3(b, c, d) + K3 + rol32(a, 5) + e + W[i];
> - ? ? ? ? ? ? ? e = d; d = c; c = rol32(b, 30); b = a; a = t;
> - ? ? ? }
> -
> - ? ? ? for (; i < 80; i ++) {
> - ? ? ? ? ? ? ? t = f2(b, c, d) + K4 + rol32(a, 5) + e + W[i];
> - ? ? ? ? ? ? ? e = d; d = c; c = rol32(b, 30); b = a; a = t;
> - ? ? ? }
> -
> - ? ? ? digest[0] += a;
> - ? ? ? digest[1] += b;
> - ? ? ? digest[2] += c;
> - ? ? ? digest[3] += d;
> - ? ? ? digest[4] += e;
> + ? ? ? __u32 A, B, C, D, E;
> +
> + ? ? ? A = digest[0];
> + ? ? ? B = digest[1];
> + ? ? ? C = digest[2];
> + ? ? ? D = digest[3];
> + ? ? ? E = digest[4];
> +
> + ? ? ? /* Round 1 - iterations 0-16 take their input from 'data' */
> + ? ? ? T_0_15( 0, A, B, C, D, E);
> + ? ? ? T_0_15( 1, E, A, B, C, D);
> + ? ? ? T_0_15( 2, D, E, A, B, C);
> + ? ? ? T_0_15( 3, C, D, E, A, B);
> + ? ? ? T_0_15( 4, B, C, D, E, A);
> + ? ? ? T_0_15( 5, A, B, C, D, E);
> + ? ? ? T_0_15( 6, E, A, B, C, D);
> + ? ? ? T_0_15( 7, D, E, A, B, C);
> + ? ? ? T_0_15( 8, C, D, E, A, B);
> + ? ? ? T_0_15( 9, B, C, D, E, A);
> + ? ? ? T_0_15(10, A, B, C, D, E);
> + ? ? ? T_0_15(11, E, A, B, C, D);
> + ? ? ? T_0_15(12, D, E, A, B, C);
> + ? ? ? T_0_15(13, C, D, E, A, B);
> + ? ? ? T_0_15(14, B, C, D, E, A);
> + ? ? ? T_0_15(15, A, B, C, D, E);
> +
> + ? ? ? /* Round 1 - tail. Input from 512-bit mixing array */
> + ? ? ? T_16_19(16, E, A, B, C, D);
> + ? ? ? T_16_19(17, D, E, A, B, C);
> + ? ? ? T_16_19(18, C, D, E, A, B);
> + ? ? ? T_16_19(19, B, C, D, E, A);
> +
> + ? ? ? /* Round 2 */
> + ? ? ? T_20_39(20, A, B, C, D, E);
> + ? ? ? T_20_39(21, E, A, B, C, D);
> + ? ? ? T_20_39(22, D, E, A, B, C);
> + ? ? ? T_20_39(23, C, D, E, A, B);
> + ? ? ? T_20_39(24, B, C, D, E, A);
> + ? ? ? T_20_39(25, A, B, C, D, E);
> + ? ? ? T_20_39(26, E, A, B, C, D);
> + ? ? ? T_20_39(27, D, E, A, B, C);
> + ? ? ? T_20_39(28, C, D, E, A, B);
> + ? ? ? T_20_39(29, B, C, D, E, A);
> + ? ? ? T_20_39(30, A, B, C, D, E);
> + ? ? ? T_20_39(31, E, A, B, C, D);
> + ? ? ? T_20_39(32, D, E, A, B, C);
> + ? ? ? T_20_39(33, C, D, E, A, B);
> + ? ? ? T_20_39(34, B, C, D, E, A);
> + ? ? ? T_20_39(35, A, B, C, D, E);
> + ? ? ? T_20_39(36, E, A, B, C, D);
> + ? ? ? T_20_39(37, D, E, A, B, C);
> + ? ? ? T_20_39(38, C, D, E, A, B);
> + ? ? ? T_20_39(39, B, C, D, E, A);
> +
> + ? ? ? /* Round 3 */
> + ? ? ? T_40_59(40, A, B, C, D, E);
> + ? ? ? T_40_59(41, E, A, B, C, D);
> + ? ? ? T_40_59(42, D, E, A, B, C);
> + ? ? ? T_40_59(43, C, D, E, A, B);
> + ? ? ? T_40_59(44, B, C, D, E, A);
> + ? ? ? T_40_59(45, A, B, C, D, E);
> + ? ? ? T_40_59(46, E, A, B, C, D);
> + ? ? ? T_40_59(47, D, E, A, B, C);
> + ? ? ? T_40_59(48, C, D, E, A, B);
> + ? ? ? T_40_59(49, B, C, D, E, A);
> + ? ? ? T_40_59(50, A, B, C, D, E);
> + ? ? ? T_40_59(51, E, A, B, C, D);
> + ? ? ? T_40_59(52, D, E, A, B, C);
> + ? ? ? T_40_59(53, C, D, E, A, B);
> + ? ? ? T_40_59(54, B, C, D, E, A);
> + ? ? ? T_40_59(55, A, B, C, D, E);
> + ? ? ? T_40_59(56, E, A, B, C, D);
> + ? ? ? T_40_59(57, D, E, A, B, C);
> + ? ? ? T_40_59(58, C, D, E, A, B);
> + ? ? ? T_40_59(59, B, C, D, E, A);
> +
> + ? ? ? /* Round 4 */
> + ? ? ? T_60_79(60, A, B, C, D, E);
> + ? ? ? T_60_79(61, E, A, B, C, D);
> + ? ? ? T_60_79(62, D, E, A, B, C);
> + ? ? ? T_60_79(63, C, D, E, A, B);
> + ? ? ? T_60_79(64, B, C, D, E, A);
> + ? ? ? T_60_79(65, A, B, C, D, E);
> + ? ? ? T_60_79(66, E, A, B, C, D);
> + ? ? ? T_60_79(67, D, E, A, B, C);
> + ? ? ? T_60_79(68, C, D, E, A, B);
> + ? ? ? T_60_79(69, B, C, D, E, A);
> + ? ? ? T_60_79(70, A, B, C, D, E);
> + ? ? ? T_60_79(71, E, A, B, C, D);
> + ? ? ? T_60_79(72, D, E, A, B, C);
> + ? ? ? T_60_79(73, C, D, E, A, B);
> + ? ? ? T_60_79(74, B, C, D, E, A);
> + ? ? ? T_60_79(75, A, B, C, D, E);
> + ? ? ? T_60_79(76, E, A, B, C, D);
> + ? ? ? T_60_79(77, D, E, A, B, C);
> + ? ? ? T_60_79(78, C, D, E, A, B);
> + ? ? ? T_60_79(79, B, C, D, E, A);
> +
> + ? ? ? digest[0] += A;
> + ? ? ? digest[1] += B;
> + ? ? ? digest[2] += C;
> + ? ? ? digest[3] += D;
> + ? ? ? digest[4] += E;
> ?}
> ?EXPORT_SYMBOL(sha_transform);
>
> @@ -92,4 +197,3 @@ void sha_init(__u32 *buf)
> ? ? ? ?buf[3] = 0x10325476;
> ? ? ? ?buf[4] = 0xc3d2e1f0;
> ?}
> -
> --
> 1.7.3.1
>
> --

Hello Mandeep,

This patch cause a hang on my custom AT91RM9200 ARM board.
See logs below.

Linux version 3.0.0-mpa-07617-gce20efc (subcon@archspace) (gcc version
4.3.3 (GCC) ) #101 PREEMPT Sun Aug 7 15:30:13 CEST 2011
CPU: ARM920T [41129200] revision 0 (ARMv4T), cr=c0007177
CPU: VIVT data cache, VIVT instruction cache
Machine: Phontech MPA 1600
Memory policy: ECC disabled, Data cache writeback
AT91: Detected soc type: at91rm9200
AT91: Detected soc subtype: Unknown
AT91: sram at 0x200000 of 0x4000 mapped at 0xfef74000
Clocks: CPU 179 MHz, master 59 MHz, main 18.432 MHz
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 16256
Kernel command line: mem=64M console=ttyAT0,115200 root=/dev/nfs
nfsroot=172.16.10.1:/array/devel/mpa_fs,v3
ip=172.16.10.100:::255.255.255.0::eth0:off
PID hash table entries: 256 (order: -2, 1024 bytes)
Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
Memory: 64MB = 64MB total
Memory: 62096k/62096k available, 3440k reserved, 0K highmem
Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
DMA : 0xffc00000 - 0xffe00000 ( 2 MB)
vmalloc : 0xc4800000 - 0xfee00000 ( 934 MB)
lowmem : 0xc0000000 - 0xc4000000 ( 64 MB)
modules : 0xbf000000 - 0xc0000000 ( 16 MB)
.text : 0xc0008000 - 0xc027f000 (2524 kB)
.init : 0xc027f000 - 0xc029a000 ( 108 kB)
.data : 0xc029a000 - 0xc02b46a0 ( 106 kB)
.bss : 0xc02b46c4 - 0xc02c4470 ( 64 kB)
SLUB: Genslabs=13, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS:192
AT91: 128 gpio irqs in 4 banks
console [ttyAT0] enabled
Calibrating delay loop... 78.64 BogoMIPS (lpj=393216)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
NET: Registered protocol family 16
bio: create slab <bio-0> at 0
SCSI subsystem initialized
i2c-gpio i2c-gpio: using pins 57 (SDA) and 58 (SCL)
Switching to clocksource 32k_counter
NET: Registered protocol family 2

And here its stuck.

A good boot should looks like this;
Linux version 3.0.0-mpa-07617-gce20efc-dirty (subcon@archspace) (gcc
version 4.3.3 (GCC) ) #100 PREEMPT Sun Aug 7 15:16:59 CEST 2011
CPU: ARM920T [41129200] revision 0 (ARMv4T), cr=c0007177
CPU: VIVT data cache, VIVT instruction cache
Machine: Phontech MPA 1600
Memory policy: ECC disabled, Data cache writeback
AT91: Detected soc type: at91rm9200
AT91: Detected soc subtype: Unknown
AT91: sram at 0x200000 of 0x4000 mapped at 0xfef74000
On node 0 totalpages: 16384
free_area_init_node: node 0, pgdat c02b3ee8, node_mem_map c02c5000
Normal zone: 128 pages used for memmap
Normal zone: 0 pages reserved
Normal zone: 16256 pages, LIFO batch:3
Clocks: CPU 179 MHz, master 59 MHz, main 18.432 MHz
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0
Built 1 zonelists in Zone order, mobility grouping on. Total pages: 16256
Kernel command line: mem=64M console=ttyAT0,115200 root=/dev/nfs
nfsroot=172.16.10.1:/array/devel/mpa_fs,v3
ip=172.16.10.100:::255.255.255.0::eth0:off
PID hash table entries: 256 (order: -2, 1024 bytes)
Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
Memory: 64MB = 64MB total
Memory: 62096k/62096k available, 3440k reserved, 0K highmem
Virtual kernel memory layout:
vector : 0xffff0000 - 0xffff1000 ( 4 kB)
fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB)
DMA : 0xffc00000 - 0xffe00000 ( 2 MB)
vmalloc : 0xc4800000 - 0xfee00000 ( 934 MB)
lowmem : 0xc0000000 - 0xc4000000 ( 64 MB)
modules : 0xbf000000 - 0xc0000000 ( 16 MB)
.text : 0xc0008000 - 0xc027f000 (2524 kB)
.init : 0xc027f000 - 0xc029a000 ( 108 kB)
.data : 0xc029a000 - 0xc02b46a0 ( 106 kB)
.bss : 0xc02b46c4 - 0xc02c4470 ( 64 kB)
SLUB: Genslabs=13, HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
NR_IRQS:192
AT91: 128 gpio irqs in 4 banks
console [ttyAT0] enabled
Calibrating delay loop... 78.23 BogoMIPS (lpj=391168)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
NET: Registered protocol family 16
bio: create slab <bio-0> at 0
SCSI subsystem initialized
i2c-gpio i2c-gpio: using pins 57 (SDA) and 58 (SCL)
Switching to clocksource 32k_counter
NET: Registered protocol family 2
IP route cache hash table entries: 1024 (order: 0, 4096 bytes)
TCP established hash table entries: 2048 (order: 2, 16384 bytes)
TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
TCP: Hash tables configured (established 2048 bind 2048)
TCP reno registered
UDP hash table entries: 256 (order: 0, 4096 bytes)
UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
...

Reverting this patch makes my board boot normal again.

I see some ARM asm in your patch, maybe this is the cause?
My ARM core is a ARM920T.

And btw, I am NFS booting this board.

regards
Joachim Eastwood

2011-08-07 16:53:03

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 4:54 AM, Joachim Eastwood <[email protected]> wrote:
>
> I see some ARM asm in your patch, maybe this is the cause?

No, it's just a barrier to make sure the compiler doesn't do crazy
things, no actual asm instructions involved.

That code is quite well tested in git, so I'm surprised it has any
problems on arm. It also has zero loops, a hang sounds odd. Can you
get some more debug information out of it (for example, where it hangs
- maybe "initcall_debug=1" on the kernel command line?

The biggest difference with the git sources is the slightly different
calling conventions (passing the workspace array as an argument is bad
for code generation, btw - since now gcc doesn't see that the
workspace accesses are dead) and the fact that the kernel version uses
kernel macros like "get_unaligned_be32()" rather than it's own
implementation.

Linus

2011-08-07 17:36:04

by Joachim Eastwood

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 6:52 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Aug 7, 2011 at 4:54 AM, Joachim ?Eastwood <[email protected]> wrote:
>>
>> I see some ARM asm in your patch, maybe this is the cause?
>
> No, it's just a barrier to make sure the compiler doesn't do crazy
> things, no actual asm instructions involved.
>
> That code is quite well tested in git, so I'm surprised it has any
> problems on arm. It also has zero loops, a hang sounds odd. Can you
> get some more debug information out of it (for example, where it hangs
> - maybe "initcall_debug=1" on the kernel command line?

initcall_debug=1 didn't do anything to the boot log.

But I add some printk's around the calls to sha_init and sha_transform.
...
NET: Registered protocol family 2
extract_buf: call sha_init
extract_buf: call sha_init done
extract_buf: call sha_transform
extract_buf: call sha_transform done
extract_buf: call sha_transform
extract_buf: call sha_transform done
extract_buf: function exit

These printk's come from drivers/char/random.c
So it doesn't seem like it hangs in any of the sha_* funtions.

But I never see any of the printk's I added to net/ipv4/syncookies.c
or net/ipv4/tcp_output.c.

btw, my compiler is: arm-angstrom-linux-gnueabi-gcc (GCC) 4.3.3

regards
Joachim Eastwood

> The biggest difference with the git sources is the slightly different
> calling conventions (passing the workspace array as an argument is bad
> for code generation, btw - since now gcc doesn't see that the
> workspace accesses are dead) and the fact that the kernel version uses
> kernel macros like "get_unaligned_be32()" rather than it's own
> implementation.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Linus
>

2011-08-07 17:45:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 10:36 AM, Joachim Eastwood <[email protected]> wrote:
>
> These printk's come from drivers/char/random.c
> So it doesn't seem like it hangs in any of the sha_* funtions.

The only other change is to SHA_WORKSPACE_WORDS - I wonder if some
code depends on the old (much bigger) workspace for some reason?

The git SHA1 routines are way smarter than the old SHA1, and will
re-use the workspace area, so they need only a fraction of the old
area.

Try changing SHA_WORKSPACE_WORDS back to 80 (in
include/linux/cryptohash.h). The git sha1 only needs 16 words, but ..

If that fixes it for you, then it's almost certainly some buggy user
that uses the SHA1 workspace array for its own odd case, and
incorrectly "knows" that it's that old wasteful 320 bytes. There's a
few places in networking that uses SHA_WORKSPACE_WORDS.

Linus

2011-08-07 18:08:46

by Joachim Eastwood

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 7:44 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Aug 7, 2011 at 10:36 AM, Joachim ?Eastwood <[email protected]> wrote:
>>
>> These printk's come from drivers/char/random.c
>> So it doesn't seem like it hangs in any of the sha_* funtions.
>
> The only other change is to SHA_WORKSPACE_WORDS - I wonder if some
> code depends on the old (much bigger) workspace for some reason?
>
> The git SHA1 routines are way smarter than the old SHA1, and will
> re-use the workspace area, so they need only a fraction of the old
> area.
>
> Try changing SHA_WORKSPACE_WORDS back to 80 (in
> include/linux/cryptohash.h). The git sha1 only needs 16 words, but ..

yup, setting it to 80 makes my kernel boot again :-)

> If that fixes it for you, then it's almost certainly some buggy user
> that uses the SHA1 workspace array for its own odd case, and
> incorrectly "knows" that it's that old wasteful 320 bytes. There's a
> few places in networking that uses SHA_WORKSPACE_WORDS.

Guess more architectures than ARM are affected by this then.

regards
Joachim Eastwood

> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Linus
>

2011-08-07 18:38:12

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

There aren't many users of that define, could you just turn it back to the proper 16, and then try changing it to 80 in each place that uses it?

That way we'd see exactly *which* use is the buggy one..

Linus

Joachim Eastwood <[email protected]> wrote:

>On Sun, Aug 7, 2011 at 7:44 PM, Linus Torvalds
><[email protected]> wrote:
>> On Sun, Aug 7, 2011 at 10:36 AM, Joachim Eastwood
><[email protected]> wrote:
>>>
>>> These printk's come from drivers/char/random.c
>>> So it doesn't seem like it hangs in any of the sha_* funtions.
>>
>> The only other change is to SHA_WORKSPACE_WORDS - I wonder if some
>> code depends on the old (much bigger) workspace for some reason?
>>
>> The git SHA1 routines are way smarter than the old SHA1, and will
>> re-use the workspace area, so they need only a fraction of the old
>> area.
>>
>> Try changing SHA_WORKSPACE_WORDS back to 80 (in
>> include/linux/cryptohash.h). The git sha1 only needs 16 words, but ..
>
>yup, setting it to 80 makes my kernel boot again :-)
>
>> If that fixes it for you, then it's almost certainly some buggy user
>> that uses the SHA1 workspace array for its own odd case, and
>> incorrectly "knows" that it's that old wasteful 320 bytes. There's a
>> few places in networking that uses SHA_WORKSPACE_WORDS.
>
>Guess more architectures than ARM are affected by this then.
>
>regards
>Joachim Eastwood
>
>> Linus
>>

2011-08-07 18:57:36

by Joachim Eastwood

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 8:38 PM, Linus Torvalds
<[email protected]> wrote:
> There aren't many users of that define, could you just turn it back to the proper 16, and then try changing it to 80 in each place that uses it?
>
> That way we'd see exactly *which* use is the buggy one..

Its drivers/char/random.c.

Boots fine when forcing workspace array in extract_buf to 80.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index c35a785..0584bb0 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -816,7 +816,7 @@ static size_t account(struct entropy_store *r,
size_t nbytes, int min,
static void extract_buf(struct entropy_store *r, __u8 *out)
{
int i;
- __u32 hash[5], workspace[SHA_WORKSPACE_WORDS];
+ __u32 hash[5], workspace[80 /*SHA_WORKSPACE_WORDS*/];
__u8 extract[64];

/* Generate a hash across the pool, 16 words (512 bits) at a time */

regards
Joachim Eastwood

> ? ? ? ?Linus
>
> Joachim ?Eastwood <[email protected]> wrote:
>
>>On Sun, Aug 7, 2011 at 7:44 PM, Linus Torvalds
>><[email protected]> wrote:
>>> On Sun, Aug 7, 2011 at 10:36 AM, Joachim ?Eastwood
>><[email protected]> wrote:
>>>>
>>>> These printk's come from drivers/char/random.c
>>>> So it doesn't seem like it hangs in any of the sha_* funtions.
>>>
>>> The only other change is to SHA_WORKSPACE_WORDS - I wonder if some
>>> code depends on the old (much bigger) workspace for some reason?
>>>
>>> The git SHA1 routines are way smarter than the old SHA1, and will
>>> re-use the workspace area, so they need only a fraction of the old
>>> area.
>>>
>>> Try changing SHA_WORKSPACE_WORDS back to 80 (in
>>> include/linux/cryptohash.h). The git sha1 only needs 16 words, but ..
>>
>>yup, setting it to 80 makes my kernel boot again :-)
>>
>>> If that fixes it for you, then it's almost certainly some buggy user
>>> that uses the SHA1 workspace array for its own odd case, and
>>> incorrectly "knows" that it's that old wasteful 320 bytes. There's a
>>> few places in networking that uses SHA_WORKSPACE_WORDS.
>>
>>Guess more architectures than ARM are affected by this then.
>>
>>regards
>>Joachim Eastwood
>>
>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Linus
>>>
>
>

2011-08-07 19:48:04

by Andreas Schwab

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

Joachim Eastwood <[email protected]> writes:

> On Sun, Aug 7, 2011 at 8:38 PM, Linus Torvalds
> <[email protected]> wrote:
>> There aren't many users of that define, could you just turn it back to the proper 16, and then try changing it to 80 in each place that uses it?
>>
>> That way we'd see exactly *which* use is the buggy one..
>
> Its drivers/char/random.c.

ARM has its own implementation of sha_transform in arch/arm/lib/sha1.S,
which assumes SHA_WORKSPACE_WORDS is 80.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2011-08-07 20:18:36

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 12:47 PM, Andreas Schwab <[email protected]> wrote:
>
> ARM has its own implementation of sha_transform in arch/arm/lib/sha1.S,
> which assumes SHA_WORKSPACE_WORDS is 80.

Well, that certainly explains it.

I wonder if that thing is worth it. It seems to be based on the bad
slow version of sha1, so I suspect that the biggest advantage of it
may the byte-swapping being done more efficiently. The ARM version of
"get_unaligned_be32()" is potentially pretty bad.

Joachim, does it all work for you if you just remove 'sha1.o' from
lib-y in arch/arm/lib/Makefile?

Nico (now with corrected email address): is that ARM-optimized asm
really worth it? Compared to the git C implementation?

Linus

2011-08-07 20:48:04

by Joachim Eastwood

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 10:17 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Aug 7, 2011 at 12:47 PM, Andreas Schwab <[email protected]> wrote:
>>
>> ARM has its own implementation of sha_transform in arch/arm/lib/sha1.S,
>> which assumes SHA_WORKSPACE_WORDS is 80.
>
> Well, that certainly explains it.
>
> I wonder if that thing is worth it. It seems to be based on the bad
> slow version of sha1, so I suspect that the biggest advantage of it
> may the byte-swapping being done more efficiently. The ARM version of
> "get_unaligned_be32()" is potentially pretty bad.
>
> Joachim, does it all work for you if you just remove 'sha1.o' from
> lib-y in arch/arm/lib/Makefile?

yes, this works. At least my board boots as normal.

regards
Joachim Eastwood

> Nico (now with corrected email address): is that ARM-optimized asm
> really worth it? Compared to the git C implementation?
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus
>

2011-08-07 21:07:15

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, Aug 7, 2011 at 1:48 PM, Joachim Eastwood <[email protected]> wrote:
>
> yes, this works. At least my board boots as normal.

Ok, I'll remove it for -rc1, just to have a working ARM setup. Maybe
we can re-introduce it later (either together with some arm-specific
hack for SHA_WORKSPACE_WORDS or by having an arm-optimized version of
the *good* sha1 routine).

But I doubt it: there used to be an ARM-optimized thing in git too. It
was removed two years ago with the commit message:

remove ARM and Mozilla SHA1 implementations

They are both slower than the new BLK_SHA1 implementation, so it is
pointless to keep them around.

and quite frankly, that removed code seems to be the same as the
in-kernel one. So I bet the ARM "optimized" SHA1 is simply not worth
keeping around.

Linus

2011-08-07 21:29:37

by Joe Perches

[permalink] [raw]

Subject: [PATCH] treewide: Remove direct uses of SHA_WORKSPACE_WORDS

While not connected to ARM's implementation of sha_transform,
maybe this might make code a bit clearer.

Remove need to know the size and type of SHA_WORKSPACE_WORDS.
Introduce and use opaque struct sha_workspace instead.

Add #include <linux/cryptohash.h> to lib/sha1.c

Signed-off-by: Joe Perches <[email protected]>
---
crypto/sha1_generic.c | 6 +++---
drivers/char/random.c | 9 +++++----
include/linux/cryptohash.h | 7 ++++++-
lib/sha1.c | 7 +++++--
net/ipv4/syncookies.c | 7 ++++---
net/ipv4/tcp_output.c | 4 ++--
net/ipv6/syncookies.c | 7 ++++---
7 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/crypto/sha1_generic.c b/crypto/sha1_generic.c
index 00ae60e..639b507 100644
--- a/crypto/sha1_generic.c
+++ b/crypto/sha1_generic.c
@@ -49,7 +49,7 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
src = data;

if ((partial + len) >= SHA1_BLOCK_SIZE) {
- u32 temp[SHA_WORKSPACE_WORDS];
+ struct sha_workspace temp;

if (partial) {
done = -partial;
@@ -59,12 +59,12 @@ static int sha1_update(struct shash_desc *desc, const u8 *data,
}

do {
- sha_transform(sctx->state, src, temp);
+ sha_transform(sctx->state, src, &temp);
done += SHA1_BLOCK_SIZE;
src = data + done;
} while (done + SHA1_BLOCK_SIZE <= len);

- memset(temp, 0, sizeof(temp));
+ memset(&temp, 0, sizeof(temp));
partial = 0;
}
memcpy(sctx->buffer + partial, src, len - done);
diff --git a/drivers/char/random.c b/drivers/char/random.c
index c35a785..bd0fd99 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -816,13 +816,14 @@ static size_t account(struct entropy_store *r, size_t nbytes, int min,
static void extract_buf(struct entropy_store *r, __u8 *out)
{
int i;
- __u32 hash[5], workspace[SHA_WORKSPACE_WORDS];
+ __u32 hash[5];
+ struct sha_workspace workspace;
__u8 extract[64];

/* Generate a hash across the pool, 16 words (512 bits) at a time */
sha_init(hash);
for (i = 0; i < r->poolinfo->poolwords; i += 16)
- sha_transform(hash, (__u8 *)(r->pool + i), workspace);
+ sha_transform(hash, (__u8 *)(r->pool + i), &workspace);

/*
* We mix the hash back into the pool to prevent backtracking
@@ -839,9 +840,9 @@ static void extract_buf(struct entropy_store *r, __u8 *out)
* To avoid duplicates, we atomically extract a portion of the
* pool while mixing, and hash one final time.
*/
- sha_transform(hash, extract, workspace);
+ sha_transform(hash, extract, &workspace);
memset(extract, 0, sizeof(extract));
- memset(workspace, 0, sizeof(workspace));
+ memset(&workspace, 0, sizeof(workspace));

/*
* In case the hash function has some recognizable output
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index 2cd9f1c..18b3a27 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -5,8 +5,13 @@
#define SHA_MESSAGE_BYTES (512 /*bits*/ / 8)
#define SHA_WORKSPACE_WORDS 16

+struct sha_workspace {
+ __u32 words[SHA_WORKSPACE_WORDS];
+};
+
void sha_init(__u32 *buf);
-void sha_transform(__u32 *digest, const char *data, __u32 *W);
+void sha_transform(__u32 *digest, const char *data,
+ struct sha_workspace *workspace);

#define MD5_DIGEST_WORDS 4
#define MD5_MESSAGE_BYTES 64
diff --git a/lib/sha1.c b/lib/sha1.c
index f33271d..990612a 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -8,6 +8,7 @@
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/bitops.h>
+#include <linux/cryptohash.h>
#include <asm/unaligned.h>

/*
@@ -66,7 +67,7 @@
*
* @digest: 160 bit digest to update
* @data: 512 bits of data to hash
- * @array: 16 words of workspace (see note)
+ * @workspace: struct sha_workspace * (see note)
*
* This function generates a SHA1 digest for a single 512-bit block.
* Be warned, it does not handle padding and message digest, do not
@@ -77,9 +78,11 @@
* to clear the workspace. This is left to the caller to avoid
* unnecessary clears between chained hashing operations.
*/
-void sha_transform(__u32 *digest, const char *data, __u32 *array)
+void sha_transform(__u32 *digest, const char *data,
+ struct sha_workspace *workspace)
{
__u32 A, B, C, D, E;
+ __u32 *array = workspace->words;

A = digest[0];
B = digest[1];
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 92bb943..77e8069 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -37,20 +37,21 @@ __initcall(init_syncookies);
#define COOKIEBITS 24 /* Upper bits store count */
#define COOKIEMASK (((__u32)1 << COOKIEBITS) - 1)

-static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
- ipv4_cookie_scratch);
+static DEFINE_PER_CPU(__u32 [16 + 5], ipv4_cookie_scratch);
+static DEFINE_PER_CPU(struct sha_workspace, ipv4_sha_workspace);

static u32 cookie_hash(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport,
u32 count, int c)
{
__u32 *tmp = __get_cpu_var(ipv4_cookie_scratch);
+ struct sha_workspace workspace = __get_cpu_var(ipv4_sha_workspace);

memcpy(tmp + 4, syncookie_secret[c], sizeof(syncookie_secret[c]));
tmp[0] = (__force u32)saddr;
tmp[1] = (__force u32)daddr;
tmp[2] = ((__force u32)sport << 16) + (__force u32)dport;
tmp[3] = count;
- sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5);
+ sha_transform(tmp + 16, (__u8 *)tmp, &workspace);

return tmp[17];
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..d9388c8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2494,7 +2494,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
}

if (opts.hash_size > 0) {
- __u32 workspace[SHA_WORKSPACE_WORDS];
+ struct sha_workspace workspace;
u32 *mess = &xvp->cookie_bakery[COOKIE_DIGEST_WORDS];
u32 *tail = &mess[COOKIE_MESSAGE_WORDS-1];

@@ -2512,7 +2512,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,

sha_transform((__u32 *)&xvp->cookie_bakery[0],
(char *)mess,
- &workspace[0]);
+ &workspace);
opts.hash_location =
(__u8 *)&xvp->cookie_bakery[0];
}
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 89d5bf8..7781ef2 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -63,13 +63,14 @@ static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
return child;
}

-static DEFINE_PER_CPU(__u32 [16 + 5 + SHA_WORKSPACE_WORDS],
- ipv6_cookie_scratch);
+static DEFINE_PER_CPU(__u32 [16 + 5], ipv6_cookie_scratch);
+static DEFINE_PER_CPU(struct sha_workspace, ipv6_sha_workspace);

static u32 cookie_hash(const struct in6_addr *saddr, const struct in6_addr *daddr,
__be16 sport, __be16 dport, u32 count, int c)
{
__u32 *tmp = __get_cpu_var(ipv6_cookie_scratch);
+ struct sha_workspace workspace = __get_cpu_var(ipv6_sha_workspace);

/*
* we have 320 bits of information to hash, copy in the remaining
@@ -81,7 +82,7 @@ static u32 cookie_hash(const struct in6_addr *saddr, const struct in6_addr *dadd
memcpy(tmp + 4, daddr, 16);
tmp[8] = ((__force u32)sport << 16) + (__force u32)dport;
tmp[9] = count;
- sha_transform(tmp + 16, (__u8 *)tmp, tmp + 16 + 5);
+ sha_transform(tmp + 16, (__u8 *)tmp, &workspace);

return tmp[17];
}
--
1.7.6.405.gc1be0

2011-08-07 21:42:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] treewide: Remove direct uses of SHA_WORKSPACE_WORDS

On Sun, Aug 7, 2011 at 2:29 PM, Joe Perches <[email protected]> wrote:
> While not connected to ARM's implementation of sha_transform,
> maybe this might make code a bit clearer.
>
> Remove need to know the size and type of SHA_WORKSPACE_WORDS.
> Introduce and use opaque struct sha_workspace instead.

The thing is, that workspace thing is likely just broken, and should be removed.

On machines with sufficient registers, we can keep the workspace in
the register set, and setting it up on stack and passing it in as a
buffer is just bad.

On x86 and arm it doesn't matter, but on ppc it likely pessimizes the
code by resulting in pointless writebacks to that stack buffer.

So the right thing to do is probably to just remove it entirely, and
just make the sha_transform() thing allocate it on the stack as
needed. The only issue then is whether you really want the stackspace
cleared afterwards. My personal guess is that it really doesn't
matter: if you leak stack space anywhere, you are seriously screwed
from a security standpoint anyway, so..

But in the meantime, I don't want to add this kind of abstraction for
something that I suspect is just fundamentally broken.

Linus

2011-08-08 01:25:00

by Nicolas Pitre

[permalink] [raw]

Subject: Re: [PATCH] lib/sha1: use the git implementation of SHA-1

On Sun, 7 Aug 2011, Linus Torvalds wrote:

> On Sun, Aug 7, 2011 at 1:48 PM, Joachim Eastwood <[email protected]> wrote:
> >
> > yes, this works. At least my board boots as normal.
>
> Ok, I'll remove it for -rc1, just to have a working ARM setup. Maybe
> we can re-introduce it later (either together with some arm-specific
> hack for SHA_WORKSPACE_WORDS or by having an arm-optimized version of
> the *good* sha1 routine).
>
> But I doubt it: there used to be an ARM-optimized thing in git too. It
> was removed two years ago with the commit message:
>
> remove ARM and Mozilla SHA1 implementations
>
> They are both slower than the new BLK_SHA1 implementation, so it is
> pointless to keep them around.
>
> and quite frankly, that removed code seems to be the same as the
> in-kernel one. So I bet the ARM "optimized" SHA1 is simply not worth
> keeping around.

Indeed. The ARM code in the kernel is mine:

|commit c09f98271f685af349d3f0199360f1c0e85550e0
|Author: Nicolas Pitre <[email protected]>
|Date: Fri Oct 28 15:26:40 2005 +0100
|
| [ARM] 2930/1: optimized sha1 implementation for ARM
|
| Patch from Nicolas Pitre
|
| Here's an ARM assembly SHA1 implementation to replace the default C
| version. It is approximately 50% faster than the generic C version. On
| an XScale processor running at 400MHz:
| generic C version: 9.8 MB/s
| my version: 14.5 MB/s
[...]

I eventually added it to Git as well. After a while I worked on the C
version in Git:

|$ git log --reverse --oneline --author="Nicolas Pitre" git/block-sha1/
|30ba0de block-sha1: move code around
|dc52fd2 block-sha1: split the different "hacks" to be individually selected
|660231a block-sha1: support for architectures with memory alignment restrictions
|ee7dc31 block-sha1: more good unaligned memory access candidates
|d5f6a96 block-sha1: make the size member first in the context struct
|51ea551 make sure byte swapping is optimal for git
|e9c5dcd block-sha1: guard gcc extensions with __GNUC__
|30ae47b remove ARM and Mozilla SHA1 implementations

I therefore think that the ARM version in the kernel should also be
removed if the Git C version goes in.

Nicolas