2007-06-08 22:28:05

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64

The following 3-part series adds assembly implementations of the SHA-1
transform for x86 and x86_64. For x86_64 the optimized code is always
selected; on x86 it is selected if the kernel is compiled for i486 or above
(since the code needs BSWAP). These changes primarily improve the
performance of the CryptoAPI SHA-1 module and of /dev/urandom. I've
included some performance data from my test boxes below.

This version incorporates feedback from Herbert Xu. Andrew, I'm sending
this to you because of the (admittedly tiny) intersection with arm and s390
in part 1.

-

tcrypt performance tests:

=== Pentium IV in 32-bit mode, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 229 114 50%
1 64 16 142 76 46%
2 64 64 79 35 56%
3 256 16 59 34 42%
4 256 64 44 24 45%
5 256 256 43 17 60%
6 1024 16 51 36 29%
7 1024 256 30 13 57%
8 1024 1024 28 12 57%
9 2048 16 66 30 55%
10 2048 256 31 12 61%
11 2048 1024 27 13 52%
12 2048 2048 26 13 50%
13 4096 16 49 30 39%
14 4096 256 28 12 57%
15 4096 1024 28 11 61%
16 4096 4096 26 13 50%
17 8192 16 49 29 41%
18 8192 256 27 11 59%
19 8192 1024 26 11 58%
20 8192 4096 25 10 60%
21 8192 8192 25 10 60%

=== Intel Core 2 in 64-bit mode, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 112 81 28%
1 64 16 55 39 29%
2 64 64 42 27 36%
3 256 16 35 25 29%
4 256 64 24 14 42%
5 256 256 22 12 45%
6 1024 16 31 22 29%
7 1024 256 17 9 47%
8 1024 1024 16 9 44%
9 2048 16 30 22 27%
10 2048 256 16 8 50%
11 2048 1024 16 8 50%
12 2048 2048 16 8 50%
13 4096 16 29 21 28%
14 4096 256 16 8 50%
15 4096 1024 15 8 47%
16 4096 4096 15 7 53%
17 8192 16 29 22 24%
18 8192 256 16 8 50%
19 8192 1024 15 7 53%
20 8192 4096 15 7 53%
21 8192 8192 15 7 53%

I've also done informal tests on other boxes, and the performance
improvement has been in the same ballpark.

On the aforementioned Pentium IV, /dev/urandom throughput goes from 3.7 MB/s
to 5.6 MB/s with the patches; on the Core 2, it increases from 5.5 MB/s to
8.1 MB/s.

Signed-off-by: Benjamin Gilbert <[email protected]>


2007-06-08 22:09:19

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH 1/3] [CRYPTO] Move sha_init() into cryptohash.h

Make sha_init() a static inline in cryptohash.h rather than an (unexported)
function in lib/sha1.c, in preparation for making sha1.c optional. This
also allows some cleanups:

- Modular code can now use sha_init() rather than reimplementing it

- The optimized implementation of SHA-1 for ARM no longer needs to
reimplement sha_init() in assembly

Signed-off-by: Benjamin Gilbert <[email protected]>
---

arch/arm/lib/sha1.S | 16 ----------------
arch/s390/crypto/sha1_s390.c | 6 +-----
drivers/crypto/padlock-sha.c | 8 ++------
include/linux/cryptohash.h | 14 +++++++++++++-
lib/sha1.c | 14 --------------
5 files changed, 16 insertions(+), 42 deletions(-)

diff --git a/arch/arm/lib/sha1.S b/arch/arm/lib/sha1.S
index ff6ece4..5be800c 100644
--- a/arch/arm/lib/sha1.S
+++ b/arch/arm/lib/sha1.S
@@ -188,19 +188,3 @@ ENTRY(sha_transform)
.L_sha_K:
.word 0x5a827999, 0x6ed9eba1, 0x8f1bbcdc, 0xca62c1d6

-
-/*
- * void sha_init(__u32 *buf)
- */
-
-.L_sha_initial_digest:
- .word 0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0
-
-ENTRY(sha_init)
-
- str lr, [sp, #-4]!
- adr r1, .L_sha_initial_digest
- ldmia r1, {r1, r2, r3, ip, lr}
- stmia r0, {r1, r2, r3, ip, lr}
- ldr pc, [sp], #4
-
diff --git a/arch/s390/crypto/sha1_s390.c b/arch/s390/crypto/sha1_s390.c
index af4460e..fed9a2e 100644
--- a/arch/s390/crypto/sha1_s390.c
+++ b/arch/s390/crypto/sha1_s390.c
@@ -42,11 +42,7 @@ static void sha1_init(struct crypto_tfm *tfm)
{
struct s390_sha1_ctx *sctx = crypto_tfm_ctx(tfm);

- sctx->state[0] = 0x67452301;
- sctx->state[1] = 0xEFCDAB89;
- sctx->state[2] = 0x98BADCFE;
- sctx->state[3] = 0x10325476;
- sctx->state[4] = 0xC3D2E1F0;
+ sha_init(sctx->state);
sctx->count = 0;
}

diff --git a/drivers/crypto/padlock-sha.c b/drivers/crypto/padlock-sha.c
index a781fd2..b47d708 100644
--- a/drivers/crypto/padlock-sha.c
+++ b/drivers/crypto/padlock-sha.c
@@ -107,12 +107,8 @@ static void padlock_do_sha1(const char *in, char *out, int count)
char buf[128+16];
char *result = NEAREST_ALIGNED(buf);

- ((uint32_t *)result)[0] = 0x67452301;
- ((uint32_t *)result)[1] = 0xEFCDAB89;
- ((uint32_t *)result)[2] = 0x98BADCFE;
- ((uint32_t *)result)[3] = 0x10325476;
- ((uint32_t *)result)[4] = 0xC3D2E1F0;
-
+ sha_init((uint32_t *)result);
+
asm volatile (".byte 0xf3,0x0f,0xa6,0xc8" /* rep xsha1 */
: "+S"(in), "+D"(result)
: "c"(count), "a"(0));
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index c118b2a..a172401 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -4,7 +4,19 @@
#define SHA_DIGEST_WORDS 5
#define SHA_WORKSPACE_WORDS 80

-void sha_init(__u32 *buf);
+/**
+ * sha_init - initialize the vectors for a SHA1 digest
+ * @buf: vector to initialize
+ */
+static inline void sha_init(__u32 *buf)
+{
+ buf[0] = 0x67452301;
+ buf[1] = 0xefcdab89;
+ buf[2] = 0x98badcfe;
+ buf[3] = 0x10325476;
+ buf[4] = 0xc3d2e1f0;
+}
+
void sha_transform(__u32 *digest, const char *data, __u32 *W);

__u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);
diff --git a/lib/sha1.c b/lib/sha1.c
index 4c45fd5..815816f 100644
--- a/lib/sha1.c
+++ b/lib/sha1.c
@@ -79,17 +79,3 @@ void sha_transform(__u32 *digest, const char *in, __u32 *W)
digest[4] += e;
}
EXPORT_SYMBOL(sha_transform);
-
-/**
- * sha_init - initialize the vectors for a SHA1 digest
- * @buf: vector to initialize
- */
-void sha_init(__u32 *buf)
-{
- buf[0] = 0x67452301;
- buf[1] = 0xefcdab89;
- buf[2] = 0x98badcfe;
- buf[3] = 0x10325476;
- buf[4] = 0xc3d2e1f0;
-}
-

2007-06-08 22:28:26

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64

Add optimized implementation of the SHA-1 hash function for x86_64, ported
from the x86 implementation in Nettle (which is LGPLed).

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <[email protected]>
---

arch/x86_64/kernel/x8664_ksyms.c | 3
arch/x86_64/lib/Makefile | 2
arch/x86_64/lib/sha1.S | 281 ++++++++++++++++++++++++++++++++++++++
include/linux/cryptohash.h | 2
lib/Kconfig | 7 +
5 files changed, 293 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 77c25b3..bc641ab 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -3,6 +3,7 @@

#include <linux/module.h>
#include <linux/smp.h>
+#include <linux/cryptohash.h>

#include <asm/semaphore.h>
#include <asm/processor.h>
@@ -60,3 +61,5 @@ EXPORT_SYMBOL(init_level4_pgt);
EXPORT_SYMBOL(load_gs_index);

EXPORT_SYMBOL(_proxy_pda);
+
+EXPORT_SYMBOL(sha_transform);
diff --git a/arch/x86_64/lib/Makefile b/arch/x86_64/lib/Makefile
index c943271..6c8110b 100644
--- a/arch/x86_64/lib/Makefile
+++ b/arch/x86_64/lib/Makefile
@@ -9,5 +9,5 @@ obj-$(CONFIG_SMP) += msr-on-cpu.o

lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
usercopy.o getuser.o putuser.o \
- thunk.o clear_page.o copy_page.o bitstr.o bitops.o
+ thunk.o clear_page.o copy_page.o bitstr.o bitops.o sha1.o
lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o copy_user_nocache.o
diff --git a/arch/x86_64/lib/sha1.S b/arch/x86_64/lib/sha1.S
new file mode 100644
index 0000000..48f4fde
--- /dev/null
+++ b/arch/x86_64/lib/sha1.S
@@ -0,0 +1,281 @@
+/*
+ * sha1-x86_64 - x86_64-optimized SHA1 hash algorithm
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
+ * Ported from x86 to x86_64 by Benjamin Gilbert
+ *
+ * Copyright (C) 2004, Niels M?ller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation. A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage. r12-15 must be saved if they will be used. Accessing
+ r8-r15 takes an extra instruction byte. */
+#define P_STATE %rdi /* Pointer parameter */
+#define P_DATA %rsi /* Pointer parameter */
+#define DATA %rdx /* Pointer parameter */
+#define SA %edi /* Reuses P_STATE */
+#define SB %esi /* Reuses P_DATA */
+#define SC %eax
+#define SD %ebx /* Callee-saved */
+#define SE %ebp /* Callee-saved */
+#define TMP %ecx
+#define TMP2 %r8d /* Used by F3 */
+#define CONST %r9d
+#define STATE %r10
+
+/* Constants */
+#define K1VALUE $0x5A827999 /* Rounds 0-19 */
+#define K2VALUE $0x6ED9EBA1 /* Rounds 20-39 */
+#define K3VALUE $0x8F1BBCDC /* Rounds 40-59 */
+#define K4VALUE $0xCA62C1D6 /* Rounds 60-79 */
+
+/* Convert stack offsets in 32-bit words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via P_DATA into register, byteswaps it, and stores it in
+ the DATA array. */
+#define SWAP(index, register) \
+ movl OFFSET(index)(P_DATA), register; \
+ bswap register; \
+ movl register, OFFSET(index)(DATA)
+
+/* push/pop wrappers that update the DWARF unwind table */
+#define PUSH(regname) \
+ push %regname; \
+ CFI_ADJUST_CFA_OFFSET 8; \
+ CFI_REL_OFFSET regname, 0
+
+#define POP(regname) \
+ pop %regname; \
+ CFI_ADJUST_CFA_OFFSET -8; \
+ CFI_RESTORE regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ * W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i) \
+ movl OFFSET(i % 16)(DATA), TMP; \
+ xorl OFFSET((i + 2) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 8) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 13) % 16)(DATA), TMP; \
+ roll $1, TMP; \
+ movl TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ * f1(x,y,z) = z ^ (x & (y ^ z))
+ * f2(x,y,z) = x ^ y ^ z
+ * f3(x,y,z) = (x & y) | (z & (x | y))
+ * f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z) \
+ movl z, TMP; \
+ xorl y, TMP; \
+ andl x, TMP; \
+ xorl z, TMP
+
+#define F2(x,y,z) \
+ movl x, TMP; \
+ xorl y, TMP; \
+ xorl z, TMP
+
+#define F3(x,y,z) \
+ movl x, TMP2; \
+ andl y, TMP2; \
+ movl x, TMP; \
+ orl y, TMP; \
+ andl z, TMP; \
+ orl TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ * a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ * b' = a;
+ * c' = b <<< 30;
+ * d' = c;
+ * e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ * e += a <<< 5 + f( b, c, d ) + k + w;
+ * b <<<= 30
+ *
+ * k is not an explicit parameter; it's always stored in CONST.
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,w) \
+ addl CONST, e; \
+ addl w, e; \
+ f(b,c,d); \
+ addl TMP, e; \
+ movl a, TMP; \
+ roll $5, TMP; \
+ addl TMP, e; \
+ roll $30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* We expect a 64-byte workspace. */
+.text
+ENTRY(sha_transform)
+ CFI_STARTPROC
+ PUSH(rbx)
+ PUSH(rbp)
+
+ /* Load and byteswap data */
+ SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %ebp)
+ SWAP( 4, %r8d); SWAP( 5, %r9d); SWAP( 6, %r10d); SWAP( 7, %r11d)
+ SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %ebp)
+ SWAP(12, %r8d); SWAP(13, %r9d); SWAP(14, %r10d); SWAP(15, %r11d)
+
+ /* P_DATA now dead; free up P_STATE for other uses */
+ movq P_STATE, STATE
+
+ /* load the state vector */
+ movl (STATE), SA
+ movl 4(STATE), SB
+ movl 8(STATE), SC
+ movl 12(STATE), SD
+ movl 16(STATE), SE
+
+ movl K1VALUE, CONST
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 0)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 1)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 2)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 3)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 4)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 5)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 6)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 7)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 8)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 9)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET(10)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET(11)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET(12)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET(13)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET(14)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET(15)(DATA))
+ EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP)
+ EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP)
+ EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP)
+ EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP)
+
+ movl K2VALUE, CONST
+ EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ movl K3VALUE, CONST
+ EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ movl K4VALUE, CONST
+ EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ /* Update the state vector */
+ addl SA, (STATE)
+ addl SB, 4(STATE)
+ addl SC, 8(STATE)
+ addl SD, 12(STATE)
+ addl SE, 16(STATE)
+
+ POP(rbp)
+ POP(rbx)
+ ret
+ CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index 0da331c..b3b14a3 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -7,6 +7,8 @@

#if defined(CONFIG_SHA1_X86)
#define SHA_WORKSPACE_WORDS 0
+#elif defined(CONFIG_SHA1_X86_64)
+#define SHA_WORKSPACE_WORDS 16
#else
#define SHA_WORKSPACE_WORDS 80
#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 69fdb64..23a84ed 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -132,9 +132,14 @@ config SHA1_X86
depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
default y

+config SHA1_X86_64
+ bool
+ depends on (X86 || UML_X86) && 64BIT
+ default y
+
config SHA1_GENERIC
bool
- depends on !SHA1_X86
+ depends on !SHA1_X86 && !SHA1_X86_64
default y

endmenu

2007-06-08 22:28:36

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Add x86-optimized implementation of the SHA-1 hash function, taken from
Nettle under the LGPL. This code will be enabled on kernels compiled for
486es or better; kernels which support 386es will use the generic
implementation (since we need BSWAP).

We disable building lib/sha1.o when an optimized implementation is
available, as the library link order for x86 (and x86_64) would otherwise
ignore the optimized version. The existing optimized implementation for ARM
does not do this; the library link order for that architecture appears to
favor the arch/arm/ version automatically. I've left this situation alone
since I'm not familiar with the ARM code, but a !ARM condition could be
added to CONFIG_SHA1_GENERIC if it makes sense.

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <[email protected]>
---

arch/i386/kernel/i386_ksyms.c | 5 +
arch/i386/lib/Makefile | 1
arch/i386/lib/sha1.S | 299 +++++++++++++++++++++++++++++++++++++++++
include/linux/cryptohash.h | 9 +
lib/Kconfig | 13 ++
lib/Makefile | 3
6 files changed, 328 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
index e3d4b73..812bc4e 100644
--- a/arch/i386/kernel/i386_ksyms.c
+++ b/arch/i386/kernel/i386_ksyms.c
@@ -1,4 +1,5 @@
#include <linux/module.h>
+#include <linux/cryptohash.h>
#include <asm/checksum.h>
#include <asm/desc.h>

@@ -28,3 +29,7 @@ EXPORT_SYMBOL(__read_lock_failed);
#endif

EXPORT_SYMBOL(csum_partial);
+
+#ifdef CONFIG_SHA1_X86
+EXPORT_SYMBOL(sha_transform);
+#endif
diff --git a/arch/i386/lib/Makefile b/arch/i386/lib/Makefile
index 22d8ac5..69f4845 100644
--- a/arch/i386/lib/Makefile
+++ b/arch/i386/lib/Makefile
@@ -6,6 +6,7 @@
lib-y = checksum.o delay.o usercopy.o getuser.o putuser.o memcpy.o strstr.o \
bitops.o semaphore.o

+lib-$(CONFIG_SHA1_X86) += sha1.o
lib-$(CONFIG_X86_USE_3DNOW) += mmx.o

obj-$(CONFIG_SMP) += msr-on-cpu.o
diff --git a/arch/i386/lib/sha1.S b/arch/i386/lib/sha1.S
new file mode 100644
index 0000000..28aa4b7
--- /dev/null
+++ b/arch/i386/lib/sha1.S
@@ -0,0 +1,299 @@
+/*
+ * x86-optimized SHA1 hash algorithm (i486 and above)
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
+ *
+ * Copyright (C) 2004, Niels M?ller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation. A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage */
+#define SA %eax
+#define SB %ebx
+#define SC %ecx
+#define SD %edx
+#define SE %ebp
+#define DATA %esp
+#define TMP %edi
+#define TMP2 %esi /* Used by SWAP and F3 */
+#define TMP3 64(%esp)
+
+/* Constants */
+#define K1VALUE $0x5A827999 /* Rounds 0-19 */
+#define K2VALUE $0x6ED9EBA1 /* Rounds 20-39 */
+#define K3VALUE $0x8F1BBCDC /* Rounds 40-59 */
+#define K4VALUE $0xCA62C1D6 /* Rounds 60-79 */
+
+/* Convert stack offsets in words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via TMP2 into register, byteswaps it, and stores it in
+ the DATA array. */
+#define SWAP(index, register) \
+ movl OFFSET(index)(TMP2), register; \
+ bswap register; \
+ movl register, OFFSET(index)(DATA)
+
+/* Sets the workspace word at the given index to TMP. */
+#define CLEAR(index) \
+ movl TMP, OFFSET(index)(DATA)
+
+/* pushl/popl wrappers that update the DWARF unwind table */
+#define PUSH(regname) \
+ pushl %regname; \
+ CFI_ADJUST_CFA_OFFSET 4; \
+ CFI_REL_OFFSET regname, 0
+
+#define POP(regname) \
+ popl %regname; \
+ CFI_ADJUST_CFA_OFFSET -4; \
+ CFI_RESTORE regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ * W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i) \
+ movl OFFSET(i % 16)(DATA), TMP; \
+ xorl OFFSET((i + 2) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 8) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 13) % 16)(DATA), TMP; \
+ roll $1, TMP; \
+ movl TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ * f1(x,y,z) = z ^ (x & (y ^ z))
+ * f2(x,y,z) = x ^ y ^ z
+ * f3(x,y,z) = (x & y) | (z & (x | y))
+ * f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z) \
+ movl z, TMP; \
+ xorl y, TMP; \
+ andl x, TMP; \
+ xorl z, TMP
+
+#define F2(x,y,z) \
+ movl x, TMP; \
+ xorl y, TMP; \
+ xorl z, TMP
+
+#define F3(x,y,z) \
+ movl x, TMP2; \
+ andl y, TMP2; \
+ movl x, TMP; \
+ orl y, TMP; \
+ andl z, TMP; \
+ orl TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ * a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ * b' = a;
+ * c' = b <<< 30;
+ * d' = c;
+ * e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ * e += a <<< 5 + f( b, c, d ) + k + w;
+ * b <<<= 30
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,k,w) \
+ addl k, e; \
+ addl w, e; \
+ f(b,c,d); \
+ addl TMP, e; \
+ movl a, TMP; \
+ roll $5, TMP; \
+ addl TMP, e; \
+ roll $30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* The workspace argument is ignored; we don't have enough registers to handle
+ a workspace that isn't on our own stack. */
+.text
+ENTRY(sha_transform)
+ CFI_STARTPROC
+ /* save all registers that need to be saved */
+ PUSH(ebx) /* 80(%esp) */
+ PUSH(ebp) /* 76(%esp) */
+ PUSH(esi) /* 72(%esp) */
+ PUSH(edi) /* 68(%esp) */
+
+ subl $68, %esp /* %esp = W */
+ CFI_ADJUST_CFA_OFFSET 68
+
+ /* Load and byteswap data */
+ movl 92(%esp), TMP2
+
+ SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %edx)
+ SWAP( 4, %eax); SWAP( 5, %ebx); SWAP( 6, %ecx); SWAP( 7, %edx)
+ SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %edx)
+ SWAP(12, %eax); SWAP(13, %ebx); SWAP(14, %ecx); SWAP(15, %edx)
+
+ /* load the state vector */
+ movl 88(%esp),TMP
+ movl (TMP), SA
+ movl 4(TMP), SB
+ movl 8(TMP), SC
+ movl 12(TMP), SD
+ movl 16(TMP), SE
+
+ movl K1VALUE, TMP2
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 0)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 1)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 2)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 3)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 4)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 5)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 6)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 7)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 8)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 9)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(10)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET(11)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET(12)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET(13)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET(14)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(15)(DATA))
+ EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP2, TMP)
+ EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP2, TMP)
+ EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP2, TMP)
+ EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP2, TMP)
+
+ /* TMP2 is free to use in these rounds */
+ movl K2VALUE, TMP2
+ EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ /* We have to put this constant on the stack */
+ movl K3VALUE, TMP3
+ EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ movl K4VALUE, TMP2
+ EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ /* Update the state vector */
+ movl 88(%esp),TMP
+ addl SA, (TMP)
+ addl SB, 4(TMP)
+ addl SC, 8(TMP)
+ addl SD, 12(TMP)
+ addl SE, 16(TMP)
+
+ /* Clear the workspace for security */
+ xorl TMP, TMP
+ CLEAR( 0); CLEAR( 1); CLEAR( 2); CLEAR( 3);
+ CLEAR( 4); CLEAR( 5); CLEAR( 6); CLEAR( 7);
+ CLEAR( 8); CLEAR( 9); CLEAR(10); CLEAR(11);
+ CLEAR(12); CLEAR(13); CLEAR(14); CLEAR(15);
+
+ addl $68, %esp
+ CFI_ADJUST_CFA_OFFSET -68
+ POP(edi)
+ POP(esi)
+ POP(ebp)
+ POP(ebx)
+ ret
+ CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index a172401..0da331c 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -1,8 +1,15 @@
#ifndef __CRYPTOHASH_H
#define __CRYPTOHASH_H

+#include <linux/linkage.h>
+
#define SHA_DIGEST_WORDS 5
+
+#if defined(CONFIG_SHA1_X86)
+#define SHA_WORKSPACE_WORDS 0
+#else
#define SHA_WORKSPACE_WORDS 80
+#endif

/**
* sha_init - initialize the vectors for a SHA1 digest
@@ -17,7 +24,7 @@ static inline void sha_init(__u32 *buf)
buf[4] = 0xc3d2e1f0;
}

-void sha_transform(__u32 *digest, const char *data, __u32 *W);
+asmlinkage void sha_transform(__u32 *digest, const char *data, __u32 *W);

__u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);

diff --git a/lib/Kconfig b/lib/Kconfig
index 2e7ae6b..69fdb64 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -124,4 +124,17 @@ config HAS_DMA
depends on !NO_DMA
default y

+#
+# Optimized SHA-1 support is autoconfigured
+#
+config SHA1_X86
+ bool
+ depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
+ default y
+
+config SHA1_GENERIC
+ bool
+ depends on !SHA1_X86
+ default y
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index c8c8e20..f67be29 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -5,10 +5,11 @@
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o dump_stack.o \
idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
- sha1.o irq_regs.o reciprocal_div.o
+ irq_regs.o reciprocal_div.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SHA1_GENERIC) += sha1.o

lib-y += kobject.o kref.o kobject_uevent.o klist.o


2007-06-09 07:32:59

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+


On Jun 8 2007 17:42, Benjamin Gilbert wrote:

>@@ -0,0 +1,299 @@
>+/*
>+ * x86-optimized SHA1 hash algorithm (i486 and above)
>+ *
>+ * Originally from Nettle
>+ * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
>+ *
>+ * Copyright (C) 2004, Niels M?ller
>+ * Copyright (C) 2006-2007 Carnegie Mellon University

UTF-8 please. Hint: it should most likely be an ö.


Jan
--

2007-06-09 20:12:28

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
> Add x86-optimized implementation of the SHA-1 hash function, taken from
> Nettle under the LGPL. This code will be enabled on kernels compiled for
> 486es or better; kernels which support 386es will use the generic
> implementation (since we need BSWAP).
>
> We disable building lib/sha1.o when an optimized implementation is
> available, as the library link order for x86 (and x86_64) would otherwise
> ignore the optimized version. The existing optimized implementation for ARM
> does not do this; the library link order for that architecture appears to
> favor the arch/arm/ version automatically. I've left this situation alone
> since I'm not familiar with the ARM code, but a !ARM condition could be
> added to CONFIG_SHA1_GENERIC if it makes sense.
>
> The code has been tested with tcrypt and the NIST test vectors.

Have you benchmarked this against lib/sha1.c? Please post the results.
Until then, I'm frankly skeptical that your unrolled version is faster
because when I introduced lib/sha1.c the rolled version therein won by
a significant margin and had 1/10th the cache footprint.

--
Mathematics is the supreme nostalgia of our time.

2007-06-09 20:23:40

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Matt Mackall wrote:
> On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
>> Add x86-optimized implementation of the SHA-1 hash function, taken from
>> Nettle under the LGPL. This code will be enabled on kernels compiled for
>> 486es or better; kernels which support 386es will use the generic
>> implementation (since we need BSWAP).
>>
>> We disable building lib/sha1.o when an optimized implementation is
>> available, as the library link order for x86 (and x86_64) would otherwise
>> ignore the optimized version. The existing optimized implementation for ARM
>> does not do this; the library link order for that architecture appears to
>> favor the arch/arm/ version automatically. I've left this situation alone
>> since I'm not familiar with the ARM code, but a !ARM condition could be
>> added to CONFIG_SHA1_GENERIC if it makes sense.
>>
>> The code has been tested with tcrypt and the NIST test vectors.
>
> Have you benchmarked this against lib/sha1.c? Please post the results.
> Until then, I'm frankly skeptical that your unrolled version is faster
> because when I introduced lib/sha1.c the rolled version therein won by
> a significant margin and had 1/10th the cache footprint.

Yes. And it also depends on the CPU as well. Testing on a server-class
x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce
different result than from popular but less-capable "value" CPUs.

Jeff

2007-06-09 21:35:09

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

On Sat, Jun 09, 2007 at 04:23:27PM -0400, Jeff Garzik wrote:
> Matt Mackall wrote:
> >On Fri, Jun 08, 2007 at 05:42:53PM -0400, Benjamin Gilbert wrote:
> >>Add x86-optimized implementation of the SHA-1 hash function, taken from
> >>Nettle under the LGPL. This code will be enabled on kernels compiled for
> >>486es or better; kernels which support 386es will use the generic
> >>implementation (since we need BSWAP).
> >>
> >>We disable building lib/sha1.o when an optimized implementation is
> >>available, as the library link order for x86 (and x86_64) would otherwise
> >>ignore the optimized version. The existing optimized implementation for
> >>ARM
> >>does not do this; the library link order for that architecture appears to
> >>favor the arch/arm/ version automatically. I've left this situation alone
> >>since I'm not familiar with the ARM code, but a !ARM condition could be
> >>added to CONFIG_SHA1_GENERIC if it makes sense.
> >>
> >>The code has been tested with tcrypt and the NIST test vectors.
> >
> >Have you benchmarked this against lib/sha1.c? Please post the results.
> >Until then, I'm frankly skeptical that your unrolled version is faster
> >because when I introduced lib/sha1.c the rolled version therein won by
> >a significant margin and had 1/10th the cache footprint.
>
> Yes. And it also depends on the CPU as well. Testing on a server-class
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce
> different result than from popular but less-capable "value" CPUs.

In particular, any optimization made for "486+" CPUs is highly suspect
on any machine which runs the core at >1x the memory bus speeds.

--
Mathematics is the supreme nostalgia of our time.

2007-06-10 00:34:08

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Jeff Garzik wrote:
> Matt Mackall wrote:
>> Have you benchmarked this against lib/sha1.c? Please post the results.
>> Until then, I'm frankly skeptical that your unrolled version is faster
>> because when I introduced lib/sha1.c the rolled version therein won by
>> a significant margin and had 1/10th the cache footprint.

See the benchmark tables in patch 0 at the head of this thread.
Performance improved by at least 25% in every test, and 40-60% was more
common for the 32-bit version (on a Pentium IV).

It's not just the loop unrolling; it's the register allocation and
spilling. For comparison, I built SHATransform() from the
drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and
SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty
close to what you tested back then. The resulting code is 49% MOV
instructions, and 80% of *those* involve memory. gcc4 is somewhat
better, but it still spills a whole lot, both for the 2.6.11 unrolled
code and for the current lib/sha1.c.

In contrast, the assembly implementation in this patch only has to go to
memory for data and workspace (with one small exception in the F3
rounds), and the workspace has a fifth of the cache footprint of the
default implementation.

> Yes. And it also depends on the CPU as well. Testing on a server-class
> x86 CPU (often with bigger L2, and perhaps even L1, cache) will produce
> different result than from popular but less-capable "value" CPUs.

Good point. I benchmarked the 32-bit assembly code on a couple more boxes:

=== AMD Duron, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 104 72 31%
1 64 16 52 36 31%
2 64 64 45 29 36%
3 256 16 33 23 30%
4 256 64 27 17 37%
5 256 256 24 14 42%
6 1024 16 29 20 31%
7 1024 256 20 11 45%
8 1024 1024 19 11 42%
9 2048 16 28 20 29%
10 2048 256 19 11 42%
11 2048 1024 18 10 44%
12 2048 2048 18 10 44%
13 4096 16 28 19 32%
14 4096 256 18 10 44%
15 4096 1024 18 10 44%
16 4096 4096 18 10 44%
17 8192 16 27 19 30%
18 8192 256 18 10 44%
19 8192 1024 18 10 44%
20 8192 4096 17 10 41%
21 8192 8192 17 10 41%

=== Classic Pentium, average of 5 trials ===
Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
block update (C) (asm)
0 16 16 145 144 1%
1 64 16 72 61 15%
2 64 64 65 52 20%
3 256 16 46 39 15%
4 256 64 39 32 18%
5 256 256 36 29 19%
6 1024 16 40 33 18%
7 1024 256 30 23 23%
8 1024 1024 29 23 21%
9 2048 16 39 32 18%
10 2048 256 29 22 24%
11 2048 1024 28 22 21%
12 2048 2048 28 22 21%
13 4096 16 38 32 16%
14 4096 256 28 22 21%
15 4096 1024 28 21 25%
16 4096 4096 27 21 22%
17 8192 16 38 32 16%
18 8192 256 28 22 21%
19 8192 1024 28 21 25%
20 8192 4096 27 21 22%
21 8192 8192 27 21 22%

The improvement isn't as good, but it's still noticeable.

--Benjamin Gilbert

2007-06-10 01:16:10

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Jan Engelhardt wrote:
> On Jun 8 2007 17:42, Benjamin Gilbert wrote:
>> @@ -0,0 +1,299 @@
>> +/*
>> + * x86-optimized SHA1 hash algorithm (i486 and above)
>> + *
>> + * Originally from Nettle
>> + * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
>> + *
>> + * Copyright (C) 2004, Niels M?ller
>> + * Copyright (C) 2006-2007 Carnegie Mellon University
>
> UTF-8 please. Hint: it should most likely be an ö.

Whoops, I had thought I had gotten that right. I'll get updates for
parts 2 and 3 sent out on Monday.

Thanks
--Benjamin Gilbert

2007-06-10 14:00:50

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
> Jeff Garzik wrote:
> >Matt Mackall wrote:
> >>Have you benchmarked this against lib/sha1.c? Please post the results.
> >>Until then, I'm frankly skeptical that your unrolled version is faster
> >>because when I introduced lib/sha1.c the rolled version therein won by
> >>a significant margin and had 1/10th the cache footprint.
>
> See the benchmark tables in patch 0 at the head of this thread.
> Performance improved by at least 25% in every test, and 40-60% was more
> common for the 32-bit version (on a Pentium IV).
>
> It's not just the loop unrolling; it's the register allocation and
> spilling. For comparison, I built SHATransform() from the
> drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and
> SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty
> close to what you tested back then. The resulting code is 49% MOV
> instructions, and 80% of *those* involve memory. gcc4 is somewhat
> better, but it still spills a whole lot, both for the 2.6.11 unrolled
> code and for the current lib/sha1.c.

Wait, your benchmark is comparing against the unrolled code?

> In contrast, the assembly implementation in this patch only has to go to
> memory for data and workspace (with one small exception in the F3
> rounds), and the workspace has a fifth of the cache footprint of the
> default implementation.

How big is the -code- footprint?

Earlier you wrote:

> On the aforementioned Pentium IV, /dev/urandom throughput goes from
> 3.7 MB/s to 5.6 MB/s with the patches; on the Core 2, it increases
> from 5.5 MB/s to 8.1 MB/s.

Whoa. We've regressed something horrible here:

http://groups.google.com/group/linux.kernel/msg/fba056363c99d4f9?dmode=source&hl=en

In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
Were your tests with or without the latest /dev/urandom fixes? This
one in particular:

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

--
Mathematics is the supreme nostalgia of our time.

2007-06-10 16:48:40

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Matt Mackall wrote:
> On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
>> It's not just the loop unrolling; it's the register allocation and
>> spilling. For comparison, I built SHATransform() from the
>> drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and
>> SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty
>> close to what you tested back then. The resulting code is 49% MOV
>> instructions, and 80% of *those* involve memory. gcc4 is somewhat
>> better, but it still spills a whole lot, both for the 2.6.11 unrolled
>> code and for the current lib/sha1.c.
>
> Wait, your benchmark is comparing against the unrolled code?

No, it's comparing the current lib/sha1.c to the optimized code in the
patch. I was just pointing out that the unrolled code you were likely
testing against, back then, may not have been very good. (Though I
assumed that you were talking about the unrolled code in random.c, not
the code in CryptoAPI, so that might change the numbers some. It
appears from the post you linked below that the unrolled CryptoAPI code
still beat the rolled version?)

> How big is the -code- footprint?

About 3700 bytes for the 32-bit version of sha_transform().

> Whoa. We've regressed something horrible here:
>
> http://groups.google.com/group/linux.kernel/msg/fba056363c99d4f9?dmode=source&hl=en
>
> In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
> Were your tests with or without the latest /dev/urandom fixes? This
> one in particular:
>
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

I'm not in front of that machine right now; I can check tomorrow. For
what it's worth, I've seen equivalent performance (a few MB/s) on a
range of fairly-recent kernels.

--Benjamin Gilbert

2007-06-10 17:34:19

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

On Sun, Jun 10, 2007 at 12:47:19PM -0400, Benjamin Gilbert wrote:
> Matt Mackall wrote:
> >On Sat, Jun 09, 2007 at 08:33:25PM -0400, Benjamin Gilbert wrote:
> >>It's not just the loop unrolling; it's the register allocation and
> >>spilling. For comparison, I built SHATransform() from the
> >>drivers/char/random.c in 2.6.11, using gcc 3.3.5 with -O2 and
> >>SHA_CODE_SIZE == 3 (i.e., fully unrolled); I'm guessing this is pretty
> >>close to what you tested back then. The resulting code is 49% MOV
> >>instructions, and 80% of *those* involve memory. gcc4 is somewhat
> >>better, but it still spills a whole lot, both for the 2.6.11 unrolled
> >>code and for the current lib/sha1.c.
> >
> >Wait, your benchmark is comparing against the unrolled code?
>
> No, it's comparing the current lib/sha1.c to the optimized code in the
> patch. I was just pointing out that the unrolled code you were likely
> testing against, back then, may not have been very good. (Though I
> assumed that you were talking about the unrolled code in random.c, not
> the code in CryptoAPI, so that might change the numbers some. It
> appears from the post you linked below that the unrolled CryptoAPI code
> still beat the rolled version?)

That predates lib/sha1.c by a while.

> >How big is the -code- footprint?
>
> About 3700 bytes for the 32-bit version of sha_transform().

lib/sha1.c's footprint is... 621 bytes today. Huh. That's up from 466
bytes when it was introduced and no one's touched it:

http://search.luky.org/ML/linux-kernel.2005/msg06648.html

Stupid compilers.

But anyway. Cache footprint matters. The two big users of SHA1 in the
kernel are /dev/random and IPSec, both of which typically operate on
small chunks of data.

--
Mathematics is the supreme nostalgia of our time.

2007-06-11 11:06:05

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64

Benjamin Gilbert <[email protected]> writes:

> +/* push/pop wrappers that update the DWARF unwind table */
> +#define PUSH(regname) \
> + push %regname; \
> + CFI_ADJUST_CFA_OFFSET 8; \
> + CFI_REL_OFFSET regname, 0
> +
> +#define POP(regname) \
> + pop %regname; \
> + CFI_ADJUST_CFA_OFFSET -8; \
> + CFI_RESTORE regname

Please don't do these kinds of wrappers. They just obfuscate the code.

And BTW plain gas macros (.macro) are much nicer to read too
than cpp macros.


> +#define EXPAND(i) \
> + movl OFFSET(i % 16)(DATA), TMP; \
> + xorl OFFSET((i + 2) % 16)(DATA), TMP; \

Such overlapping memory accesses are somewhat dangerous as they tend
to stall some CPUs. Better probably to do a quad load and then extract.

If you care about the last cycle I would suggest you run
it at least once through the Pipeline simulator in the Linux
version of AMD CodeAnalyst or through vtune.

I haven't checked in detail if it's possible but it's suspicious you
never use quad operations for anything. You keep at least half
the CPU's bits idle all the time.

> + EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
> + EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
> + EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
> + EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
> + EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)

Gut feeling is that the unroll factor is far too large.
Have you tried a smaller one? That would save icache
which is very important in the kernel. Unlike in your micro benchmark
when kernel code runs normally caches are cold. Smaller is faster then.
And most kernel SHA applications don't process very much data anyways
so startup costs are important.

> diff --git a/lib/Kconfig b/lib/Kconfig
> index 69fdb64..23a84ed 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -132,9 +132,14 @@ config SHA1_X86
> depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
> default y
>
> +config SHA1_X86_64
> + bool
> + depends on (X86 || UML_X86) && 64BIT
> + default y
> +
> config SHA1_GENERIC
> bool
> - depends on !SHA1_X86
> + depends on !SHA1_X86 && !SHA1_X86_64

Better define a SHA_ARCH_OPTIMIZED helper symbol, otherwise
this will get messy as more architectures add optimized versions.

-Andi

2007-06-11 11:07:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Matt Mackall <[email protected]> writes:
>
> Have you benchmarked this against lib/sha1.c? Please post the results.
> Until then, I'm frankly skeptical that your unrolled version is faster
> because when I introduced lib/sha1.c the rolled version therein won by
> a significant margin and had 1/10th the cache footprint.

I would always suggest to benchmark such functions with forced cold i/d caches

(memset(x, 0, 5*1024*1024) and running some very large generated
function every few iterations of the benchmark)

-Andi

2007-06-11 17:40:06

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Matt Mackall wrote:
> In 2003, I was getting 17MB/s out of my Athlon. Now I'm getting 2.7MB/s.
> Were your tests with or without the latest /dev/urandom fixes? This
> one in particular:
>
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.21.y.git;a=commitdiff;h=374f167dfb97c1785515a0c41e32a66b414859a8

With. I just tried 2.6.11 (the oldest that will boot) on the Pentium IV
box and got 3.7 MB/s, so if it's a regression it's been around for a while.

--Benjamin Gilbert

2007-06-11 19:45:36

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 3/3] [CRYPTO] Add optimized SHA-1 implementation for x86_64

Andi Kleen wrote:
> Benjamin Gilbert <[email protected]> writes:
>> +#define EXPAND(i) \
>> + movl OFFSET(i % 16)(DATA), TMP; \
>> + xorl OFFSET((i + 2) % 16)(DATA), TMP; \
>
> Such overlapping memory accesses are somewhat dangerous as they tend
> to stall some CPUs. Better probably to do a quad load and then extract.

OFFSET(i) is defined as 4*(i), so they don't actually overlap.
(Arguably that macro should go away.)

> I haven't checked in detail if it's possible but it's suspicious you
> never use quad operations for anything. You keep at least half
> the CPU's bits idle all the time.

SHA-1 fundamentally wants to work with 32-bit quantities. It might be
possible to use quad operations for some things, with sufficient
cleverness, but I doubt it'd be worth the effort.

> Gut feeling is that the unroll factor is far too large.
> Have you tried a smaller one? That would save icache
> which is very important in the kernel.

That seems to be the consensus. I'll see if I can find some time to try
[email protected]'s suggestion and report back.

I don't think, though, that cache footprint is the *only* thing that
matters. Leaving aside /dev/urandom, there are cases where throughput
matters a lot. This patch set came out of some work on a hashing block
device driver in which SHA is, by far, the biggest CPU user. One could
imagine content-addressable filesystems, or even IPsec under the right
workloads, being in a similar situation.

Would it be more palatable to roll the patch as an optimized CryptoAPI
module rather than as a lib/sha1.c replacement? That wouldn't help
/dev/urandom, of course, but for other cases it would allow the user to
ask for the optimized version if needed, and not pay the footprint costs
otherwise.

--Benjamin Gilbert

2007-06-11 19:47:57

by Benjamin Gilbert

[permalink] [raw]
Subject: Re: [PATCH 2/3] [CRYPTO] Add optimized SHA-1 implementation for i486+

Benjamin Gilbert wrote:
> Jan Engelhardt wrote:
>> UTF-8 please. Hint: it should most likely be an ö.
>
> Whoops, I had thought I had gotten that right. I'll get updates for
> parts 2 and 3 sent out on Monday.

I'm sending the corrected parts 2 and 3 as replies to this email. The
UTF-8 fix is the *only* thing that has changed. The patches themselves
are moot in their current form, but I wanted to make sure they were
archived with the correct attribution.

--Benjamin Gilbert

2007-06-11 19:51:00

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH] [CRYPTO] Add optimized SHA-1 implementation for i486+

Add x86-optimized implementation of the SHA-1 hash function, taken from
Nettle under the LGPL. This code will be enabled on kernels compiled for
486es or better; kernels which support 386es will use the generic
implementation (since we need BSWAP).

We disable building lib/sha1.o when an optimized implementation is
available, as the library link order for x86 (and x86_64) would otherwise
ignore the optimized version. The existing optimized implementation for ARM
does not do this; the library link order for that architecture appears to
favor the arch/arm/ version automatically. I've left this situation alone
since I'm not familiar with the ARM code, but a !ARM condition could be
added to CONFIG_SHA1_GENERIC if it makes sense.

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <[email protected]>
---

arch/i386/kernel/i386_ksyms.c | 5 +
arch/i386/lib/Makefile | 1
arch/i386/lib/sha1.S | 299 +++++++++++++++++++++++++++++++++++++++++
include/linux/cryptohash.h | 9 +
lib/Kconfig | 13 ++
lib/Makefile | 3
6 files changed, 328 insertions(+), 2 deletions(-)

diff --git a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
index e3d4b73..812bc4e 100644
--- a/arch/i386/kernel/i386_ksyms.c
+++ b/arch/i386/kernel/i386_ksyms.c
@@ -1,4 +1,5 @@
#include <linux/module.h>
+#include <linux/cryptohash.h>
#include <asm/checksum.h>
#include <asm/desc.h>

@@ -28,3 +29,7 @@ EXPORT_SYMBOL(__read_lock_failed);
#endif

EXPORT_SYMBOL(csum_partial);
+
+#ifdef CONFIG_SHA1_X86
+EXPORT_SYMBOL(sha_transform);
+#endif
diff --git a/arch/i386/lib/Makefile b/arch/i386/lib/Makefile
index 22d8ac5..69f4845 100644
--- a/arch/i386/lib/Makefile
+++ b/arch/i386/lib/Makefile
@@ -6,6 +6,7 @@
lib-y = checksum.o delay.o usercopy.o getuser.o putuser.o memcpy.o strstr.o \
bitops.o semaphore.o

+lib-$(CONFIG_SHA1_X86) += sha1.o
lib-$(CONFIG_X86_USE_3DNOW) += mmx.o

obj-$(CONFIG_SMP) += msr-on-cpu.o
diff --git a/arch/i386/lib/sha1.S b/arch/i386/lib/sha1.S
new file mode 100644
index 0000000..a84d829
--- /dev/null
+++ b/arch/i386/lib/sha1.S
@@ -0,0 +1,299 @@
+/*
+ * x86-optimized SHA1 hash algorithm (i486 and above)
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation. A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage */
+#define SA %eax
+#define SB %ebx
+#define SC %ecx
+#define SD %edx
+#define SE %ebp
+#define DATA %esp
+#define TMP %edi
+#define TMP2 %esi /* Used by SWAP and F3 */
+#define TMP3 64(%esp)
+
+/* Constants */
+#define K1VALUE $0x5A827999 /* Rounds 0-19 */
+#define K2VALUE $0x6ED9EBA1 /* Rounds 20-39 */
+#define K3VALUE $0x8F1BBCDC /* Rounds 40-59 */
+#define K4VALUE $0xCA62C1D6 /* Rounds 60-79 */
+
+/* Convert stack offsets in words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via TMP2 into register, byteswaps it, and stores it in
+ the DATA array. */
+#define SWAP(index, register) \
+ movl OFFSET(index)(TMP2), register; \
+ bswap register; \
+ movl register, OFFSET(index)(DATA)
+
+/* Sets the workspace word at the given index to TMP. */
+#define CLEAR(index) \
+ movl TMP, OFFSET(index)(DATA)
+
+/* pushl/popl wrappers that update the DWARF unwind table */
+#define PUSH(regname) \
+ pushl %regname; \
+ CFI_ADJUST_CFA_OFFSET 4; \
+ CFI_REL_OFFSET regname, 0
+
+#define POP(regname) \
+ popl %regname; \
+ CFI_ADJUST_CFA_OFFSET -4; \
+ CFI_RESTORE regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ * W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i) \
+ movl OFFSET(i % 16)(DATA), TMP; \
+ xorl OFFSET((i + 2) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 8) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 13) % 16)(DATA), TMP; \
+ roll $1, TMP; \
+ movl TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ * f1(x,y,z) = z ^ (x & (y ^ z))
+ * f2(x,y,z) = x ^ y ^ z
+ * f3(x,y,z) = (x & y) | (z & (x | y))
+ * f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z) \
+ movl z, TMP; \
+ xorl y, TMP; \
+ andl x, TMP; \
+ xorl z, TMP
+
+#define F2(x,y,z) \
+ movl x, TMP; \
+ xorl y, TMP; \
+ xorl z, TMP
+
+#define F3(x,y,z) \
+ movl x, TMP2; \
+ andl y, TMP2; \
+ movl x, TMP; \
+ orl y, TMP; \
+ andl z, TMP; \
+ orl TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ * a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ * b' = a;
+ * c' = b <<< 30;
+ * d' = c;
+ * e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ * e += a <<< 5 + f( b, c, d ) + k + w;
+ * b <<<= 30
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,k,w) \
+ addl k, e; \
+ addl w, e; \
+ f(b,c,d); \
+ addl TMP, e; \
+ movl a, TMP; \
+ roll $5, TMP; \
+ addl TMP, e; \
+ roll $30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* The workspace argument is ignored; we don't have enough registers to handle
+ a workspace that isn't on our own stack. */
+.text
+ENTRY(sha_transform)
+ CFI_STARTPROC
+ /* save all registers that need to be saved */
+ PUSH(ebx) /* 80(%esp) */
+ PUSH(ebp) /* 76(%esp) */
+ PUSH(esi) /* 72(%esp) */
+ PUSH(edi) /* 68(%esp) */
+
+ subl $68, %esp /* %esp = W */
+ CFI_ADJUST_CFA_OFFSET 68
+
+ /* Load and byteswap data */
+ movl 92(%esp), TMP2
+
+ SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %edx)
+ SWAP( 4, %eax); SWAP( 5, %ebx); SWAP( 6, %ecx); SWAP( 7, %edx)
+ SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %edx)
+ SWAP(12, %eax); SWAP(13, %ebx); SWAP(14, %ecx); SWAP(15, %edx)
+
+ /* load the state vector */
+ movl 88(%esp),TMP
+ movl (TMP), SA
+ movl 4(TMP), SB
+ movl 8(TMP), SC
+ movl 12(TMP), SD
+ movl 16(TMP), SE
+
+ movl K1VALUE, TMP2
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 0)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 1)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 2)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 3)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 4)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET( 5)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET( 6)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET( 7)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET( 8)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET( 9)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(10)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, TMP2, OFFSET(11)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, TMP2, OFFSET(12)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, TMP2, OFFSET(13)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, TMP2, OFFSET(14)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, TMP2, OFFSET(15)(DATA))
+ EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP2, TMP)
+ EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP2, TMP)
+ EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP2, TMP)
+ EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP2, TMP)
+
+ /* TMP2 is free to use in these rounds */
+ movl K2VALUE, TMP2
+ EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ /* We have to put this constant on the stack */
+ movl K3VALUE, TMP3
+ EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP3, TMP)
+ EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP3, TMP)
+ EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP3, TMP)
+ EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP3, TMP)
+ EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP3, TMP)
+
+ movl K4VALUE, TMP2
+ EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP2, TMP)
+ EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP2, TMP)
+ EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP2, TMP)
+ EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP2, TMP)
+ EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP2, TMP)
+
+ /* Update the state vector */
+ movl 88(%esp),TMP
+ addl SA, (TMP)
+ addl SB, 4(TMP)
+ addl SC, 8(TMP)
+ addl SD, 12(TMP)
+ addl SE, 16(TMP)
+
+ /* Clear the workspace for security */
+ xorl TMP, TMP
+ CLEAR( 0); CLEAR( 1); CLEAR( 2); CLEAR( 3);
+ CLEAR( 4); CLEAR( 5); CLEAR( 6); CLEAR( 7);
+ CLEAR( 8); CLEAR( 9); CLEAR(10); CLEAR(11);
+ CLEAR(12); CLEAR(13); CLEAR(14); CLEAR(15);
+
+ addl $68, %esp
+ CFI_ADJUST_CFA_OFFSET -68
+ POP(edi)
+ POP(esi)
+ POP(ebp)
+ POP(ebx)
+ ret
+ CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index a172401..0da331c 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -1,8 +1,15 @@
#ifndef __CRYPTOHASH_H
#define __CRYPTOHASH_H

+#include <linux/linkage.h>
+
#define SHA_DIGEST_WORDS 5
+
+#if defined(CONFIG_SHA1_X86)
+#define SHA_WORKSPACE_WORDS 0
+#else
#define SHA_WORKSPACE_WORDS 80
+#endif

/**
* sha_init - initialize the vectors for a SHA1 digest
@@ -17,7 +24,7 @@ static inline void sha_init(__u32 *buf)
buf[4] = 0xc3d2e1f0;
}

-void sha_transform(__u32 *digest, const char *data, __u32 *W);
+asmlinkage void sha_transform(__u32 *digest, const char *data, __u32 *W);

__u32 half_md4_transform(__u32 buf[4], __u32 const in[8]);

diff --git a/lib/Kconfig b/lib/Kconfig
index 2e7ae6b..69fdb64 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -124,4 +124,17 @@ config HAS_DMA
depends on !NO_DMA
default y

+#
+# Optimized SHA-1 support is autoconfigured
+#
+config SHA1_X86
+ bool
+ depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
+ default y
+
+config SHA1_GENERIC
+ bool
+ depends on !SHA1_X86
+ default y
+
endmenu
diff --git a/lib/Makefile b/lib/Makefile
index c8c8e20..f67be29 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -5,10 +5,11 @@
lib-y := ctype.o string.o vsprintf.o cmdline.o \
rbtree.o radix-tree.o dump_stack.o \
idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
- sha1.o irq_regs.o reciprocal_div.o
+ irq_regs.o reciprocal_div.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
+lib-$(CONFIG_SHA1_GENERIC) += sha1.o

lib-y += kobject.o kref.o kobject_uevent.o klist.o


2007-06-11 19:52:43

by Benjamin Gilbert

[permalink] [raw]
Subject: [PATCH] [CRYPTO] Add optimized SHA-1 implementation for x86_64

Add optimized implementation of the SHA-1 hash function for x86_64, ported
from the x86 implementation in Nettle (which is LGPLed).

The code has been tested with tcrypt and the NIST test vectors.

Signed-off-by: Benjamin Gilbert <[email protected]>
---

arch/x86_64/kernel/x8664_ksyms.c | 3
arch/x86_64/lib/Makefile | 2
arch/x86_64/lib/sha1.S | 281 ++++++++++++++++++++++++++++++++++++++
include/linux/cryptohash.h | 2
lib/Kconfig | 7 +
5 files changed, 293 insertions(+), 2 deletions(-)

diff --git a/arch/x86_64/kernel/x8664_ksyms.c b/arch/x86_64/kernel/x8664_ksyms.c
index 77c25b3..bc641ab 100644
--- a/arch/x86_64/kernel/x8664_ksyms.c
+++ b/arch/x86_64/kernel/x8664_ksyms.c
@@ -3,6 +3,7 @@

#include <linux/module.h>
#include <linux/smp.h>
+#include <linux/cryptohash.h>

#include <asm/semaphore.h>
#include <asm/processor.h>
@@ -60,3 +61,5 @@ EXPORT_SYMBOL(init_level4_pgt);
EXPORT_SYMBOL(load_gs_index);

EXPORT_SYMBOL(_proxy_pda);
+
+EXPORT_SYMBOL(sha_transform);
diff --git a/arch/x86_64/lib/Makefile b/arch/x86_64/lib/Makefile
index c943271..6c8110b 100644
--- a/arch/x86_64/lib/Makefile
+++ b/arch/x86_64/lib/Makefile
@@ -9,5 +9,5 @@ obj-$(CONFIG_SMP) += msr-on-cpu.o

lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
usercopy.o getuser.o putuser.o \
- thunk.o clear_page.o copy_page.o bitstr.o bitops.o
+ thunk.o clear_page.o copy_page.o bitstr.o bitops.o sha1.o
lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o copy_user_nocache.o
diff --git a/arch/x86_64/lib/sha1.S b/arch/x86_64/lib/sha1.S
new file mode 100644
index 0000000..f928ac3
--- /dev/null
+++ b/arch/x86_64/lib/sha1.S
@@ -0,0 +1,281 @@
+/*
+ * sha1-x86_64 - x86_64-optimized SHA1 hash algorithm
+ *
+ * Originally from Nettle
+ * Ported from M4 to cpp by Benjamin Gilbert <[email protected]>
+ * Ported from x86 to x86_64 by Benjamin Gilbert
+ *
+ * Copyright (C) 2004, Niels Möller
+ * Copyright (C) 2006-2007 Carnegie Mellon University
+ *
+ * This library is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License as
+ * published by the Free Software Foundation. A copy of the GNU Lesser General
+ * Public License should have been distributed along with this library in the
+ * file LICENSE.LGPL.
+ *
+ * This library is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
+ * for more details.
+ */
+
+#include <linux/linkage.h>
+#include <asm/dwarf2.h>
+
+/* Register usage. r12-15 must be saved if they will be used. Accessing
+ r8-r15 takes an extra instruction byte. */
+#define P_STATE %rdi /* Pointer parameter */
+#define P_DATA %rsi /* Pointer parameter */
+#define DATA %rdx /* Pointer parameter */
+#define SA %edi /* Reuses P_STATE */
+#define SB %esi /* Reuses P_DATA */
+#define SC %eax
+#define SD %ebx /* Callee-saved */
+#define SE %ebp /* Callee-saved */
+#define TMP %ecx
+#define TMP2 %r8d /* Used by F3 */
+#define CONST %r9d
+#define STATE %r10
+
+/* Constants */
+#define K1VALUE $0x5A827999 /* Rounds 0-19 */
+#define K2VALUE $0x6ED9EBA1 /* Rounds 20-39 */
+#define K3VALUE $0x8F1BBCDC /* Rounds 40-59 */
+#define K4VALUE $0xCA62C1D6 /* Rounds 60-79 */
+
+/* Convert stack offsets in 32-bit words to offsets in bytes */
+#define OFFSET(i) 4*(i)
+
+/* Reads the input via P_DATA into register, byteswaps it, and stores it in
+ the DATA array. */
+#define SWAP(index, register) \
+ movl OFFSET(index)(P_DATA), register; \
+ bswap register; \
+ movl register, OFFSET(index)(DATA)
+
+/* push/pop wrappers that update the DWARF unwind table */
+#define PUSH(regname) \
+ push %regname; \
+ CFI_ADJUST_CFA_OFFSET 8; \
+ CFI_REL_OFFSET regname, 0
+
+#define POP(regname) \
+ pop %regname; \
+ CFI_ADJUST_CFA_OFFSET -8; \
+ CFI_RESTORE regname
+
+/*
+ * expand(i) is the expansion function
+ *
+ * W[i] = (W[i - 16] ^ W[i - 14] ^ W[i - 8] ^ W[i - 3]) <<< 1
+ *
+ * where W[i] is stored in DATA[i mod 16].
+ *
+ * Result is stored back in W[i], and also left in TMP, the only
+ * register that is used.
+ */
+#define EXPAND(i) \
+ movl OFFSET(i % 16)(DATA), TMP; \
+ xorl OFFSET((i + 2) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 8) % 16)(DATA), TMP; \
+ xorl OFFSET((i + 13) % 16)(DATA), TMP; \
+ roll $1, TMP; \
+ movl TMP, OFFSET(i % 16)(DATA)
+
+/*
+ * The f functions,
+ *
+ * f1(x,y,z) = z ^ (x & (y ^ z))
+ * f2(x,y,z) = x ^ y ^ z
+ * f3(x,y,z) = (x & y) | (z & (x | y))
+ * f4 = f2
+ *
+ * The macro Fk(x,y,z) computes = fk(x,y,z).
+ * Result is left in TMP.
+ */
+#define F1(x,y,z) \
+ movl z, TMP; \
+ xorl y, TMP; \
+ andl x, TMP; \
+ xorl z, TMP
+
+#define F2(x,y,z) \
+ movl x, TMP; \
+ xorl y, TMP; \
+ xorl z, TMP
+
+#define F3(x,y,z) \
+ movl x, TMP2; \
+ andl y, TMP2; \
+ movl x, TMP; \
+ orl y, TMP; \
+ andl z, TMP; \
+ orl TMP2, TMP
+
+/*
+ * The form of one sha1 round is
+ *
+ * a' = e + a <<< 5 + f( b, c, d ) + k + w;
+ * b' = a;
+ * c' = b <<< 30;
+ * d' = c;
+ * e' = d;
+ *
+ * where <<< denotes rotation. We permute our variables, so that we
+ * instead get
+ *
+ * e += a <<< 5 + f( b, c, d ) + k + w;
+ * b <<<= 30
+ *
+ * k is not an explicit parameter; it's always stored in CONST.
+ *
+ * Using the TMP register for the rotate could be avoided, by rotating
+ * %a in place, adding, and then rotating back.
+ */
+#define ROUND(a,b,c,d,e,f,w) \
+ addl CONST, e; \
+ addl w, e; \
+ f(b,c,d); \
+ addl TMP, e; \
+ movl a, TMP; \
+ roll $5, TMP; \
+ addl TMP, e; \
+ roll $30, b;
+
+/* sha_transform(__u32 *digest, const char *in, __u32 *W) */
+/* We expect a 64-byte workspace. */
+.text
+ENTRY(sha_transform)
+ CFI_STARTPROC
+ PUSH(rbx)
+ PUSH(rbp)
+
+ /* Load and byteswap data */
+ SWAP( 0, %eax); SWAP( 1, %ebx); SWAP( 2, %ecx); SWAP( 3, %ebp)
+ SWAP( 4, %r8d); SWAP( 5, %r9d); SWAP( 6, %r10d); SWAP( 7, %r11d)
+ SWAP( 8, %eax); SWAP( 9, %ebx); SWAP(10, %ecx); SWAP(11, %ebp)
+ SWAP(12, %r8d); SWAP(13, %r9d); SWAP(14, %r10d); SWAP(15, %r11d)
+
+ /* P_DATA now dead; free up P_STATE for other uses */
+ movq P_STATE, STATE
+
+ /* load the state vector */
+ movl (STATE), SA
+ movl 4(STATE), SB
+ movl 8(STATE), SC
+ movl 12(STATE), SD
+ movl 16(STATE), SE
+
+ movl K1VALUE, CONST
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 0)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 1)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 2)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 3)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 4)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET( 5)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET( 6)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET( 7)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET( 8)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET( 9)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET(10)(DATA))
+ ROUND(SE, SA, SB, SC, SD, F1, OFFSET(11)(DATA))
+ ROUND(SD, SE, SA, SB, SC, F1, OFFSET(12)(DATA))
+ ROUND(SC, SD, SE, SA, SB, F1, OFFSET(13)(DATA))
+ ROUND(SB, SC, SD, SE, SA, F1, OFFSET(14)(DATA))
+
+ ROUND(SA, SB, SC, SD, SE, F1, OFFSET(15)(DATA))
+ EXPAND(16); ROUND(SE, SA, SB, SC, SD, F1, TMP)
+ EXPAND(17); ROUND(SD, SE, SA, SB, SC, F1, TMP)
+ EXPAND(18); ROUND(SC, SD, SE, SA, SB, F1, TMP)
+ EXPAND(19); ROUND(SB, SC, SD, SE, SA, F1, TMP)
+
+ movl K2VALUE, CONST
+ EXPAND(20); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(21); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(22); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(23); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(24); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(25); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(26); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(27); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(28); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(29); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(30); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(31); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(32); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(33); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(34); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(35); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(36); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(37); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(38); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(39); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ movl K3VALUE, CONST
+ EXPAND(40); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(41); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(42); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(43); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(44); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(45); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(46); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(47); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(48); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(49); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(50); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(51); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(52); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(53); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(54); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ EXPAND(55); ROUND(SA, SB, SC, SD, SE, F3, TMP)
+ EXPAND(56); ROUND(SE, SA, SB, SC, SD, F3, TMP)
+ EXPAND(57); ROUND(SD, SE, SA, SB, SC, F3, TMP)
+ EXPAND(58); ROUND(SC, SD, SE, SA, SB, F3, TMP)
+ EXPAND(59); ROUND(SB, SC, SD, SE, SA, F3, TMP)
+
+ movl K4VALUE, CONST
+ EXPAND(60); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(61); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(62); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(63); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(64); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(65); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(66); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(67); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(68); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(69); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(70); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(71); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(72); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(73); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(74); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ EXPAND(75); ROUND(SA, SB, SC, SD, SE, F2, TMP)
+ EXPAND(76); ROUND(SE, SA, SB, SC, SD, F2, TMP)
+ EXPAND(77); ROUND(SD, SE, SA, SB, SC, F2, TMP)
+ EXPAND(78); ROUND(SC, SD, SE, SA, SB, F2, TMP)
+ EXPAND(79); ROUND(SB, SC, SD, SE, SA, F2, TMP)
+
+ /* Update the state vector */
+ addl SA, (STATE)
+ addl SB, 4(STATE)
+ addl SC, 8(STATE)
+ addl SD, 12(STATE)
+ addl SE, 16(STATE)
+
+ POP(rbp)
+ POP(rbx)
+ ret
+ CFI_ENDPROC
+ENDPROC(sha_transform)
diff --git a/include/linux/cryptohash.h b/include/linux/cryptohash.h
index 0da331c..b3b14a3 100644
--- a/include/linux/cryptohash.h
+++ b/include/linux/cryptohash.h
@@ -7,6 +7,8 @@

#if defined(CONFIG_SHA1_X86)
#define SHA_WORKSPACE_WORDS 0
+#elif defined(CONFIG_SHA1_X86_64)
+#define SHA_WORKSPACE_WORDS 16
#else
#define SHA_WORKSPACE_WORDS 80
#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 69fdb64..23a84ed 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -132,9 +132,14 @@ config SHA1_X86
depends on (X86 || UML_X86) && !64BIT && X86_BSWAP
default y

+config SHA1_X86_64
+ bool
+ depends on (X86 || UML_X86) && 64BIT
+ default y
+
config SHA1_GENERIC
bool
- depends on !SHA1_X86
+ depends on !SHA1_X86 && !SHA1_X86_64
default y

endmenu

2007-06-11 20:30:23

by Adrian Bunk

[permalink] [raw]
Subject: Re: [PATCH 0/3] Add optimized SHA-1 implementations for x86 and x86_64

On Fri, Jun 08, 2007 at 05:42:42PM -0400, Benjamin Gilbert wrote:
> The following 3-part series adds assembly implementations of the SHA-1
> transform for x86 and x86_64. For x86_64 the optimized code is always
> selected; on x86 it is selected if the kernel is compiled for i486 or above
> (since the code needs BSWAP). These changes primarily improve the
> performance of the CryptoAPI SHA-1 module and of /dev/urandom. I've
> included some performance data from my test boxes below.
>
> This version incorporates feedback from Herbert Xu. Andrew, I'm sending
> this to you because of the (admittedly tiny) intersection with arm and s390
> in part 1.
>
> -
>
> tcrypt performance tests:
>
> === Pentium IV in 32-bit mode, average of 5 trials ===
> Test# Bytes/ Bytes/ Cyc/B Cyc/B Change
>...
> I've also done informal tests on other boxes, and the performance
> improvement has been in the same ballpark.
>
> On the aforementioned Pentium IV, /dev/urandom throughput goes from 3.7 MB/s
> to 5.6 MB/s with the patches; on the Core 2, it increases from 5.5 MB/s to
> 8.1 MB/s.
>...

With which gcc version and compiler flags?

And why is the C code slower?
Problems in the C code?
gcc problems?

Generally, I'd really prefer one C implementation that works good on all
platforms over getting dozens of different assembler implemenations,
each potentially with different bugs.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed