2015-07-07 19:37:20

by Martin Willi

[permalink] [raw]
Subject: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

This patch series adds both ChaCha20 and Poly1305 specific ciphers for
x86_64 using SSE2/SSSE3 and AVX2 instructions. The idea is to have a drop-in
replacement for AESNI/CLMUL-accelerated AES-GCM providing at least somewhat
comparable performance, refer to RFC7539 for details. It is based
on cryptodev.

The first patch adds some speed tests to tcrypt. The second patch exports
some functionality from chacha20-generic to use it as fallback. Patch 3
adds a single block SSSE3 driver for ChaCha20, while patch 4 and 5 extend it
by an optimized four block SSSE3 and an eight block AVX2 variant. Patch 6
adds an additional test vector for ChaCha20 to actually test the AVX2 eight
block variant processing 512-bytes at once.

Patch 7 exports some poly1305-generic functionality to use it as fallback.
Patch 8 introduces a single block SSE2 driver for Poly1305, while patch 9
and 10 add an optimized two block SSE2 and a four block AVX2 variant.

Overall speedup for the ChaCha20/Poly1305 AEAD for typical IPsec payloads
is ~50-150% with SSE2/SSSE3 and ~100-200% with AVX2, or even more for larger
payloads:

poly1305-generic:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-generic,poly1305-generic)) encryption
test 0 (288 bit key, 16 byte blocks): 902007 operations in 1 seconds (14432112 bytes)
test 1 (288 bit key, 64 byte blocks): 945302 operations in 1 seconds (60499328 bytes)
test 2 (288 bit key, 256 byte blocks): 559910 operations in 1 seconds (143336960 bytes)
test 3 (288 bit key, 512 byte blocks): 365334 operations in 1 seconds (187051008 bytes)
test 4 (288 bit key, 1024 byte blocks): 213663 operations in 1 seconds (218790912 bytes)
test 5 (288 bit key, 2048 byte blocks): 117263 operations in 1 seconds (240154624 bytes)
test 6 (288 bit key, 4096 byte blocks): 61915 operations in 1 seconds (253603840 bytes)
test 7 (288 bit key, 8192 byte blocks): 31662 operations in 1 seconds (259375104 bytes)

SSE2/SSSE3:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 945909 operations in 1 seconds (15134544 bytes)
test 1 (288 bit key, 64 byte blocks): 945702 operations in 1 seconds (60524928 bytes)
test 2 (288 bit key, 256 byte blocks): 759759 operations in 1 seconds (194498304 bytes)
test 3 (288 bit key, 512 byte blocks): 609356 operations in 1 seconds (311990272 bytes)
test 4 (288 bit key, 1024 byte blocks): 445479 operations in 1 seconds (456170496 bytes)
test 5 (288 bit key, 2048 byte blocks): 289479 operations in 1 seconds (592852992 bytes)
test 6 (288 bit key, 4096 byte blocks): 170082 operations in 1 seconds (696655872 bytes)
test 7 (288 bit key, 8192 byte blocks): 91443 operations in 1 seconds (749101056 bytes)

AVX2:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 896305 operations in 1 seconds (14340880 bytes)
test 1 (288 bit key, 64 byte blocks): 929638 operations in 1 seconds (59496832 bytes)
test 2 (288 bit key, 256 byte blocks): 750673 operations in 1 seconds (192172288 bytes)
test 3 (288 bit key, 512 byte blocks): 687636 operations in 1 seconds (352069632 bytes)
test 4 (288 bit key, 1024 byte blocks): 555209 operations in 1 seconds (568534016 bytes)
test 5 (288 bit key, 2048 byte blocks): 402049 operations in 1 seconds (823396352 bytes)
test 6 (288 bit key, 4096 byte blocks): 259861 operations in 1 seconds (1064390656 bytes)
test 7 (288 bit key, 8192 byte blocks): 147283 operations in 1 seconds (1206542336 bytes)

All benchmark results from a Core i5-4670T.

The ChaCha20/Poly1305 AEAD on Haswell with AVX2 has about half the raw
AESNI/CLMUL-accelerated AES-GCM (rfc4106-gcm-aesni) performance for typical
IPsec MTUs. On Ivy Bridge using SSE2/SSSE3 the numbers compared to AES-GCM
are very similar due to the less efficient CLMUL instructions.

Martin Willi (10):
crypto: tcrypt - Add ChaCha20/Poly1305 speed tests
crypto: chacha20 - Export common ChaCha20 helpers
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
crypto: chacha20 - Add an eight block AVX2 variant for x86_64
crypto: testmgr - Add a longer ChaCha20 test vector
crypto: poly1305 - Export common Poly1305 helpers
crypto: poly1305 - Add a SSE2 SIMD variant for x86_64
crypto: poly1305 - Add a two block SSE2 variant for x86_64
crypto: poly1305 - Add a four block AVX2 variant for x86_64

arch/x86/crypto/Makefile | 6 +
arch/x86/crypto/chacha20-avx2-x86_64.S | 443 ++++++++++++++++++++++
arch/x86/crypto/chacha20-ssse3-x86_64.S | 625 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 150 ++++++++
arch/x86/crypto/poly1305-avx2-x86_64.S | 386 ++++++++++++++++++++
arch/x86/crypto/poly1305-sse2-x86_64.S | 582 +++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 207 +++++++++++
crypto/Kconfig | 27 ++
crypto/chacha20_generic.c | 28 +-
crypto/chacha20poly1305.c | 7 +-
crypto/poly1305_generic.c | 73 ++--
crypto/tcrypt.c | 15 +
crypto/tcrypt.h | 20 +
crypto/testmgr.h | 334 ++++++++++++++++-
include/crypto/chacha20.h | 25 ++
include/crypto/poly1305.h | 41 +++
16 files changed, 2909 insertions(+), 60 deletions(-)
create mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S
create mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
create mode 100644 arch/x86/crypto/chacha20_glue.c
create mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305_glue.c
create mode 100644 include/crypto/chacha20.h
create mode 100644 include/crypto/poly1305.h

--
1.9.1


2015-07-07 19:37:20

by Martin Willi

[permalink] [raw]
Subject: [PATCH 02/10] crypto: chacha20 - Export common ChaCha20 helpers

As architecture specific drivers need a software fallback, export a
ChaCha20 en-/decryption function together with some helpers in a header
file.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/chacha20_generic.c | 28 ++++++++++++----------------
crypto/chacha20poly1305.c | 3 +--
include/crypto/chacha20.h | 25 +++++++++++++++++++++++++
3 files changed, 38 insertions(+), 18 deletions(-)
create mode 100644 include/crypto/chacha20.h

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index fa42e70..da9c899 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -13,14 +13,7 @@
#include <linux/crypto.h>
#include <linux/kernel.h>
#include <linux/module.h>
-
-#define CHACHA20_NONCE_SIZE 16
-#define CHACHA20_KEY_SIZE 32
-#define CHACHA20_BLOCK_SIZE 64
-
-struct chacha20_ctx {
- u32 key[8];
-};
+#include <crypto/chacha20.h>

static inline u32 rotl32(u32 v, u8 n)
{
@@ -108,7 +101,7 @@ static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
}
}

-static void chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
+void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
{
static const char constant[16] = "expand 32-byte k";

@@ -129,8 +122,9 @@ static void chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
state[14] = le32_to_cpuvp(iv + 8);
state[15] = le32_to_cpuvp(iv + 12);
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_init);

-static int chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
+int crypto_chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
unsigned int keysize)
{
struct chacha20_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -144,8 +138,9 @@ static int chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_setkey);

-static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+int crypto_chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
struct scatterlist *src, unsigned int nbytes)
{
struct blkcipher_walk walk;
@@ -155,7 +150,7 @@ static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
blkcipher_walk_init(&walk, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, &walk, CHACHA20_BLOCK_SIZE);

- chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);
+ crypto_chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);

while (walk.nbytes >= CHACHA20_BLOCK_SIZE) {
chacha20_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr,
@@ -172,6 +167,7 @@ static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,

return err;
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_crypt);

static struct crypto_alg alg = {
.cra_name = "chacha20",
@@ -187,11 +183,11 @@ static struct crypto_alg alg = {
.blkcipher = {
.min_keysize = CHACHA20_KEY_SIZE,
.max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_NONCE_SIZE,
+ .ivsize = CHACHA20_IV_SIZE,
.geniv = "seqiv",
- .setkey = chacha20_setkey,
- .encrypt = chacha20_crypt,
- .decrypt = chacha20_crypt,
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = crypto_chacha20_crypt,
+ .decrypt = crypto_chacha20_crypt,
},
},
};
diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index 7b46ed7..c9a36a9 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -13,6 +13,7 @@
#include <crypto/internal/hash.h>
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
+#include <crypto/chacha20.h>
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -23,8 +24,6 @@
#define POLY1305_BLOCK_SIZE 16
#define POLY1305_DIGEST_SIZE 16
#define POLY1305_KEY_SIZE 32
-#define CHACHA20_KEY_SIZE 32
-#define CHACHA20_IV_SIZE 16
#define CHACHAPOLY_IV_SIZE 12

struct chachapoly_instance_ctx {
diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h
new file mode 100644
index 0000000..274bbae
--- /dev/null
+++ b/include/crypto/chacha20.h
@@ -0,0 +1,25 @@
+/*
+ * Common values for the ChaCha20 algorithm
+ */
+
+#ifndef _CRYPTO_CHACHA20_H
+#define _CRYPTO_CHACHA20_H
+
+#include <linux/types.h>
+#include <linux/crypto.h>
+
+#define CHACHA20_IV_SIZE 16
+#define CHACHA20_KEY_SIZE 32
+#define CHACHA20_BLOCK_SIZE 64
+
+struct chacha20_ctx {
+ u32 key[8];
+};
+
+void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv);
+int crypto_chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keysize);
+int crypto_chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes);
+
+#endif
--
1.9.1

2015-07-07 19:37:20

by Martin Willi

[permalink] [raw]
Subject: [PATCH 09/10] crypto: poly1305 - Add a two block SSE2 variant for x86_64

Extends the x86_64 SSE2 Poly1305 authenticator by a function processing two
consecutive Poly1305 blocks in parallel using a derived key r^2. Loop
unrolling can be more effectively mapped to SSE instructions, further
increasing throughput.

For large messages, throughput increases by ~45-65% compared to single
block SSE2:

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3654724 opers/sec, 350853504 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5939245 opers/sec, 570167520 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9480730 opers/sec, 910150080 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1387720 opers/sec, 399663360 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2031633 opers/sec, 585110304 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3697643 opers/sec, 1064921184 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 570658 opers/sec, 602614848 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1129287 opers/sec, 1192527072 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 289452 opers/sec, 602060160 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 595957 opers/sec, 1239590560 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 303426 opers/sec, 1252542528 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153136 opers/sec, 1259390464 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3623907 opers/sec, 347895072 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5915244 opers/sec, 567863424 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9489525 opers/sec, 910994400 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1369657 opers/sec, 394461216 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2016000 opers/sec, 580608000 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3716146 opers/sec, 1070250048 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 561913 opers/sec, 593380128 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1635554 opers/sec, 1727145024 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 286540 opers/sec, 596003200 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 935339 opers/sec, 1945505120 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 491810 opers/sec, 2030191680 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 252857 opers/sec, 2079495968 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/poly1305-sse2-x86_64.S | 306 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 54 +++++-
2 files changed, 355 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/poly1305-sse2-x86_64.S b/arch/x86/crypto/poly1305-sse2-x86_64.S
index a3d2b5e..338c748 100644
--- a/arch/x86/crypto/poly1305-sse2-x86_64.S
+++ b/arch/x86/crypto/poly1305-sse2-x86_64.S
@@ -15,6 +15,7 @@
.align 16

ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+ORMASK: .octa 0x00000000010000000000000001000000

.text

@@ -274,3 +275,308 @@ ENTRY(poly1305_block_sse2)
pop %rbx
ret
ENDPROC(poly1305_block_sse2)
+
+
+#define u0 0x00(%r8)
+#define u1 0x04(%r8)
+#define u2 0x08(%r8)
+#define u3 0x0c(%r8)
+#define u4 0x10(%r8)
+#define hc0 %xmm0
+#define hc1 %xmm1
+#define hc2 %xmm2
+#define hc3 %xmm5
+#define hc4 %xmm6
+#define ru0 %xmm7
+#define ru1 %xmm8
+#define ru2 %xmm9
+#define ru3 %xmm10
+#define ru4 %xmm11
+#define sv1 %xmm12
+#define sv2 %xmm13
+#define sv3 %xmm14
+#define sv4 %xmm15
+#undef d0
+#define d0 %r13
+
+ENTRY(poly1305_2block_sse2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 16 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Doubleblock count
+ # %r8: Poly1305 derived key r^2 u[5]
+
+ # This two-block variant further improves performance by using loop
+ # unrolled block processing. This is more straight forward and does
+ # less byte shuffling, but requires a second Poly1305 key r^2:
+ # h = (h + m) * r => h = (h + m1) * r^2 + m2 * r
+
+ push %rbx
+ push %r12
+ push %r13
+
+ # combine r0,u0
+ movd u0,ru0
+ movd r0,t1
+ punpcklqdq t1,ru0
+
+ # combine r1,u1 and s1=r1*5,v1=u1*5
+ movd u1,ru1
+ movd r1,t1
+ punpcklqdq t1,ru1
+ movdqa ru1,sv1
+ pslld $2,sv1
+ paddd ru1,sv1
+
+ # combine r2,u2 and s2=r2*5,v2=u2*5
+ movd u2,ru2
+ movd r2,t1
+ punpcklqdq t1,ru2
+ movdqa ru2,sv2
+ pslld $2,sv2
+ paddd ru2,sv2
+
+ # combine r3,u3 and s3=r3*5,v3=u3*5
+ movd u3,ru3
+ movd r3,t1
+ punpcklqdq t1,ru3
+ movdqa ru3,sv3
+ pslld $2,sv3
+ paddd ru3,sv3
+
+ # combine r4,u4 and s4=r4*5,v4=u4*5
+ movd u4,ru4
+ movd r4,t1
+ punpcklqdq t1,ru4
+ movdqa ru4,sv4
+ pslld $2,sv4
+ paddd ru4,sv4
+
+.Ldoblock2:
+ # hc0 = [ m[16-19] & 0x3ffffff, h0 + m[0-3] & 0x3ffffff ]
+ movd 0x00(m),hc0
+ movd 0x10(m),t1
+ punpcklqdq t1,hc0
+ pand ANMASK(%rip),hc0
+ movd h0,t1
+ paddd t1,hc0
+ # hc1 = [ (m[19-22] >> 2) & 0x3ffffff, h1 + (m[3-6] >> 2) & 0x3ffffff ]
+ movd 0x03(m),hc1
+ movd 0x13(m),t1
+ punpcklqdq t1,hc1
+ psrld $2,hc1
+ pand ANMASK(%rip),hc1
+ movd h1,t1
+ paddd t1,hc1
+ # hc2 = [ (m[22-25] >> 4) & 0x3ffffff, h2 + (m[6-9] >> 4) & 0x3ffffff ]
+ movd 0x06(m),hc2
+ movd 0x16(m),t1
+ punpcklqdq t1,hc2
+ psrld $4,hc2
+ pand ANMASK(%rip),hc2
+ movd h2,t1
+ paddd t1,hc2
+ # hc3 = [ (m[25-28] >> 6) & 0x3ffffff, h3 + (m[9-12] >> 6) & 0x3ffffff ]
+ movd 0x09(m),hc3
+ movd 0x19(m),t1
+ punpcklqdq t1,hc3
+ psrld $6,hc3
+ pand ANMASK(%rip),hc3
+ movd h3,t1
+ paddd t1,hc3
+ # hc4 = [ (m[28-31] >> 8) | (1<<24), h4 + (m[12-15] >> 8) | (1<<24) ]
+ movd 0x0c(m),hc4
+ movd 0x1c(m),t1
+ punpcklqdq t1,hc4
+ psrld $8,hc4
+ por ORMASK(%rip),hc4
+ movd h4,t1
+ paddd t1,hc4
+
+ # t1 = [ hc0[1] * r0, hc0[0] * u0 ]
+ movdqa ru0,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * s4, hc1[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * s3, hc2[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s2, hc3[0] * v2 ]
+ movdqa sv2,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s1, hc4[0] * v1 ]
+ movdqa sv1,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d0 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d0
+
+ # t1 = [ hc0[1] * r1, hc0[0] * u1 ]
+ movdqa ru1,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r0, hc1[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * s4, hc2[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s3, hc3[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s2, hc4[0] * v2 ]
+ movdqa sv2,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d1 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d1
+
+ # t1 = [ hc0[1] * r2, hc0[0] * u2 ]
+ movdqa ru2,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r1, hc1[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r0, hc2[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s4, hc3[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s3, hc4[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d2 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d2
+
+ # t1 = [ hc0[1] * r3, hc0[0] * u3 ]
+ movdqa ru3,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r2, hc1[0] * u2 ]
+ movdqa ru2,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r1, hc2[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * r0, hc3[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s4, hc4[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d3 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d3
+
+ # t1 = [ hc0[1] * r4, hc0[0] * u4 ]
+ movdqa ru4,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r3, hc1[0] * u3 ]
+ movdqa ru3,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r2, hc2[0] * u2 ]
+ movdqa ru2,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * r1, hc3[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * r0, hc4[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d4 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x20,m
+ dec %rcx
+ jnz .Ldoblock2
+
+ pop %r13
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_2block_sse2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 1e59274..b7c33d0 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -18,24 +18,68 @@
#include <asm/fpu/api.h>
#include <asm/simd.h>

+struct poly1305_simd_desc_ctx {
+ struct poly1305_desc_ctx base;
+ /* derived key u set? */
+ bool uset;
+ /* derived Poly1305 key r^2 */
+ u32 u[5];
+};
+
asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
const u32 *r, unsigned int blocks);
+asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
+ unsigned int blocks, const u32 *u);
+
+static int poly1305_simd_init(struct shash_desc *desc)
+{
+ struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc);
+
+ sctx->uset = false;
+
+ return crypto_poly1305_init(desc);
+}
+
+static void poly1305_simd_mult(u32 *a, const u32 *b)
+{
+ u8 m[POLY1305_BLOCK_SIZE];
+
+ memset(m, 0, sizeof(m));
+ /* The poly1305 block function adds a hi-bit to the accumulator which
+ * we don't need for key multiplication; compensate for it. */
+ a[4] -= 1 << 24;
+ poly1305_block_sse2(a, m, b, 1);
+}

static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
const u8 *src, unsigned int srclen)
{
+ struct poly1305_simd_desc_ctx *sctx;
unsigned int blocks, datalen;

+ BUILD_BUG_ON(offsetof(struct poly1305_simd_desc_ctx, base));
+ sctx = container_of(dctx, struct poly1305_simd_desc_ctx, base);
+
if (unlikely(!dctx->sset)) {
datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
src += srclen - datalen;
srclen = datalen;
}

+ if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) {
+ if (unlikely(!sctx->uset)) {
+ memcpy(sctx->u, dctx->r, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u, dctx->r);
+ sctx->uset = true;
+ }
+ blocks = srclen / (POLY1305_BLOCK_SIZE * 2);
+ poly1305_2block_sse2(dctx->h, src, dctx->r, blocks, sctx->u);
+ src += POLY1305_BLOCK_SIZE * 2 * blocks;
+ srclen -= POLY1305_BLOCK_SIZE * 2 * blocks;
+ }
if (srclen >= POLY1305_BLOCK_SIZE) {
- blocks = srclen / POLY1305_BLOCK_SIZE;
- poly1305_block_sse2(dctx->h, src, dctx->r, blocks);
- srclen -= POLY1305_BLOCK_SIZE * blocks;
+ poly1305_block_sse2(dctx->h, src, dctx->r, 1);
+ srclen -= POLY1305_BLOCK_SIZE;
}
return srclen;
}
@@ -84,11 +128,11 @@ static int poly1305_simd_update(struct shash_desc *desc,

static struct shash_alg alg = {
.digestsize = POLY1305_DIGEST_SIZE,
- .init = crypto_poly1305_init,
+ .init = poly1305_simd_init,
.update = poly1305_simd_update,
.final = crypto_poly1305_final,
.setkey = crypto_poly1305_setkey,
- .descsize = sizeof(struct poly1305_desc_ctx),
+ .descsize = sizeof(struct poly1305_simd_desc_ctx),
.base = {
.cra_name = "poly1305",
.cra_driver_name = "poly1305-simd",
--
1.9.1

2015-07-07 19:37:20

by Martin Willi

[permalink] [raw]
Subject: [PATCH 07/10] crypto: poly1305 - Export common Poly1305 helpers

As architecture specific drivers need a software fallback, export Poly1305
init/update/final functions together with some helpers in a header file.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/chacha20poly1305.c | 4 +--
crypto/poly1305_generic.c | 73 +++++++++++++++++++++++------------------------
include/crypto/poly1305.h | 41 ++++++++++++++++++++++++++
3 files changed, 77 insertions(+), 41 deletions(-)
create mode 100644 include/crypto/poly1305.h

diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index c9a36a9..1f2f8b4 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -14,6 +14,7 @@
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
#include <crypto/chacha20.h>
+#include <crypto/poly1305.h>
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -21,9 +22,6 @@

#include "internal.h"

-#define POLY1305_BLOCK_SIZE 16
-#define POLY1305_DIGEST_SIZE 16
-#define POLY1305_KEY_SIZE 32
#define CHACHAPOLY_IV_SIZE 12

struct chachapoly_instance_ctx {
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 387b5c8..2df9835d 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -13,31 +13,11 @@

#include <crypto/algapi.h>
#include <crypto/internal/hash.h>
+#include <crypto/poly1305.h>
#include <linux/crypto.h>
#include <linux/kernel.h>
#include <linux/module.h>

-#define POLY1305_BLOCK_SIZE 16
-#define POLY1305_KEY_SIZE 32
-#define POLY1305_DIGEST_SIZE 16
-
-struct poly1305_desc_ctx {
- /* key */
- u32 r[5];
- /* finalize key */
- u32 s[4];
- /* accumulator */
- u32 h[5];
- /* partial buffer */
- u8 buf[POLY1305_BLOCK_SIZE];
- /* bytes used in partial buffer */
- unsigned int buflen;
- /* r key has been set */
- bool rset;
- /* s key has been set */
- bool sset;
-};
-
static inline u64 mlt(u64 a, u64 b)
{
return a * b;
@@ -58,7 +38,7 @@ static inline u32 le32_to_cpuvp(const void *p)
return le32_to_cpup(p);
}

-static int poly1305_init(struct shash_desc *desc)
+int crypto_poly1305_init(struct shash_desc *desc)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);

@@ -69,8 +49,9 @@ static int poly1305_init(struct shash_desc *desc)

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_init);

-static int poly1305_setkey(struct crypto_shash *tfm,
+int crypto_poly1305_setkey(struct crypto_shash *tfm,
const u8 *key, unsigned int keylen)
{
/* Poly1305 requires a unique key for each tag, which implies that
@@ -79,6 +60,7 @@ static int poly1305_setkey(struct crypto_shash *tfm,
* the update() call. */
return -ENOTSUPP;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);

static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
{
@@ -98,16 +80,10 @@ static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
dctx->s[3] = le32_to_cpuvp(key + 12);
}

-static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
- const u8 *src, unsigned int srclen,
- u32 hibit)
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen)
{
- u32 r0, r1, r2, r3, r4;
- u32 s1, s2, s3, s4;
- u32 h0, h1, h2, h3, h4;
- u64 d0, d1, d2, d3, d4;
-
- if (unlikely(!dctx->sset)) {
+ if (!dctx->sset) {
if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
poly1305_setrkey(dctx, src);
src += POLY1305_BLOCK_SIZE;
@@ -121,6 +97,25 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
dctx->sset = true;
}
}
+ return srclen;
+}
+EXPORT_SYMBOL_GPL(crypto_poly1305_setdesckey);
+
+static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen,
+ u32 hibit)
+{
+ u32 r0, r1, r2, r3, r4;
+ u32 s1, s2, s3, s4;
+ u32 h0, h1, h2, h3, h4;
+ u64 d0, d1, d2, d3, d4;
+ unsigned int datalen;
+
+ if (unlikely(!dctx->sset)) {
+ datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
+ src += srclen - datalen;
+ srclen = datalen;
+ }

r0 = dctx->r[0];
r1 = dctx->r[1];
@@ -181,7 +176,7 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
return srclen;
}

-static int poly1305_update(struct shash_desc *desc,
+int crypto_poly1305_update(struct shash_desc *desc,
const u8 *src, unsigned int srclen)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -214,8 +209,9 @@ static int poly1305_update(struct shash_desc *desc,

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_update);

-static int poly1305_final(struct shash_desc *desc, u8 *dst)
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
__le32 *mac = (__le32 *)dst;
@@ -282,13 +278,14 @@ static int poly1305_final(struct shash_desc *desc, u8 *dst)

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_final);

static struct shash_alg poly1305_alg = {
.digestsize = POLY1305_DIGEST_SIZE,
- .init = poly1305_init,
- .update = poly1305_update,
- .final = poly1305_final,
- .setkey = poly1305_setkey,
+ .init = crypto_poly1305_init,
+ .update = crypto_poly1305_update,
+ .final = crypto_poly1305_final,
+ .setkey = crypto_poly1305_setkey,
.descsize = sizeof(struct poly1305_desc_ctx),
.base = {
.cra_name = "poly1305",
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
new file mode 100644
index 0000000..894df59
--- /dev/null
+++ b/include/crypto/poly1305.h
@@ -0,0 +1,41 @@
+/*
+ * Common values for the Poly1305 algorithm
+ */
+
+#ifndef _CRYPTO_POLY1305_H
+#define _CRYPTO_POLY1305_H
+
+#include <linux/types.h>
+#include <linux/crypto.h>
+
+#define POLY1305_BLOCK_SIZE 16
+#define POLY1305_KEY_SIZE 32
+#define POLY1305_DIGEST_SIZE 16
+
+struct poly1305_desc_ctx {
+ /* key */
+ u32 r[5];
+ /* finalize key */
+ u32 s[4];
+ /* accumulator */
+ u32 h[5];
+ /* partial buffer */
+ u8 buf[POLY1305_BLOCK_SIZE];
+ /* bytes used in partial buffer */
+ unsigned int buflen;
+ /* r key has been set */
+ bool rset;
+ /* s key has been set */
+ bool sset;
+};
+
+int crypto_poly1305_init(struct shash_desc *desc);
+int crypto_poly1305_setkey(struct crypto_shash *tfm,
+ const u8 *key, unsigned int keylen);
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen);
+int crypto_poly1305_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen);
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
+
+#endif
--
1.9.1

2015-07-07 19:37:20

by Martin Willi

[permalink] [raw]
Subject: [PATCH 08/10] crypto: poly1305 - Add a SSE2 SIMD variant for x86_64

Implements an x86_64 assembler driver for the Poly1305 authenticator. This
single block variant holds the 130-bit integer in 5 32-bit words, but uses
SSE to do two multiplications/additions in parallel.

When calling updates with small blocks, the overhead for kernel_fpu_begin/
kernel_fpu_end() negates the perfmance gain. We therefore use the
poly1305-generic fallback for small updates.

For large messages, throughput increases by ~5-10% compared to
poly1305-generic:

testing speed of poly1305 (poly1305-generic)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3599215 opers/sec, 345524640 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 6116275 opers/sec, 587162400 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9608427 opers/sec, 922408992 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1453208 opers/sec, 418523904 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2103440 opers/sec, 605790720 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3691593 opers/sec, 1063178784 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 593364 opers/sec, 626592384 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1074975 opers/sec, 1135173600 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 303882 opers/sec, 632074560 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 548559 opers/sec, 1141002720 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 276084 opers/sec, 1139674752 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 138984 opers/sec, 1143004416 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3654724 opers/sec, 350853504 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5939245 opers/sec, 570167520 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9480730 opers/sec, 910150080 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1387720 opers/sec, 399663360 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2031633 opers/sec, 585110304 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3697643 opers/sec, 1064921184 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 570658 opers/sec, 602614848 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1129287 opers/sec, 1192527072 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 289452 opers/sec, 602060160 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 595957 opers/sec, 1239590560 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 303426 opers/sec, 1252542528 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153136 opers/sec, 1259390464 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 2 +
arch/x86/crypto/poly1305-sse2-x86_64.S | 276 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 123 +++++++++++++++
crypto/Kconfig | 12 ++
4 files changed, 413 insertions(+)
create mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index ce39b3c..5cf405c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
+obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o

# These modules require assembler to support AVX.
ifeq ($(avx_supported),yes)
@@ -85,6 +86,7 @@ aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
+poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o
ifeq ($(avx2_supported),yes)
sha1-ssse3-y += sha1_avx2_x86_64_asm.o
endif
diff --git a/arch/x86/crypto/poly1305-sse2-x86_64.S b/arch/x86/crypto/poly1305-sse2-x86_64.S
new file mode 100644
index 0000000..a3d2b5e
--- /dev/null
+++ b/arch/x86/crypto/poly1305-sse2-x86_64.S
@@ -0,0 +1,276 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, x64 SSE2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 16
+
+ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+
+.text
+
+#define h0 0x00(%rdi)
+#define h1 0x04(%rdi)
+#define h2 0x08(%rdi)
+#define h3 0x0c(%rdi)
+#define h4 0x10(%rdi)
+#define r0 0x00(%rdx)
+#define r1 0x04(%rdx)
+#define r2 0x08(%rdx)
+#define r3 0x0c(%rdx)
+#define r4 0x10(%rdx)
+#define s1 0x00(%rsp)
+#define s2 0x04(%rsp)
+#define s3 0x08(%rsp)
+#define s4 0x0c(%rsp)
+#define m %rsi
+#define h01 %xmm0
+#define h23 %xmm1
+#define h44 %xmm2
+#define t1 %xmm3
+#define t2 %xmm4
+#define t3 %xmm5
+#define t4 %xmm6
+#define mask %xmm7
+#define d0 %r8
+#define d1 %r9
+#define d2 %r10
+#define d3 %r11
+#define d4 %r12
+
+ENTRY(poly1305_block_sse2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 16 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Block count
+
+ # This single block variant tries to improve performance by doing two
+ # multiplications in parallel using SSE instructions. There is quite
+ # some quardword packing involved, hence the speedup is marginal.
+
+ push %rbx
+ push %r12
+ sub $0x10,%rsp
+
+ # s1..s4 = r1..r4 * 5
+ mov r1,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s1
+ mov r2,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s2
+ mov r3,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s3
+ mov r4,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s4
+
+ movdqa ANMASK(%rip),mask
+
+.Ldoblock:
+ # h01 = [0, h1, 0, h0]
+ # h23 = [0, h3, 0, h2]
+ # h44 = [0, h4, 0, h4]
+ movd h0,h01
+ movd h1,t1
+ movd h2,h23
+ movd h3,t2
+ movd h4,h44
+ punpcklqdq t1,h01
+ punpcklqdq t2,h23
+ punpcklqdq h44,h44
+
+ # h01 += [ (m[3-6] >> 2) & 0x3ffffff, m[0-3] & 0x3ffffff ]
+ movd 0x00(m),t1
+ movd 0x03(m),t2
+ psrld $2,t2
+ punpcklqdq t2,t1
+ pand mask,t1
+ paddd t1,h01
+ # h23 += [ (m[9-12] >> 6) & 0x3ffffff, (m[6-9] >> 4) & 0x3ffffff ]
+ movd 0x06(m),t1
+ movd 0x09(m),t2
+ psrld $4,t1
+ psrld $6,t2
+ punpcklqdq t2,t1
+ pand mask,t1
+ paddd t1,h23
+ # h44 += [ (m[12-15] >> 8) | (1 << 24), (m[12-15] >> 8) | (1 << 24) ]
+ mov 0x0c(m),%eax
+ shr $8,%eax
+ or $0x01000000,%eax
+ movd %eax,t1
+ pshufd $0xc4,t1,t1
+ paddd t1,h44
+
+ # t1[0] = h0 * r0 + h2 * s3
+ # t1[1] = h1 * s4 + h3 * s2
+ movd r0,t1
+ movd s4,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd s3,t2
+ movd s2,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t2[0] = h0 * r1 + h2 * s4
+ # t2[1] = h1 * r0 + h3 * s3
+ movd r1,t2
+ movd r0,t3
+ punpcklqdq t3,t2
+ pmuludq h01,t2
+ movd s4,t3
+ movd s3,t4
+ punpcklqdq t4,t3
+ pmuludq h23,t3
+ paddq t3,t2
+ # t3[0] = h4 * s1
+ # t3[1] = h4 * s2
+ movd s1,t3
+ movd s2,t4
+ punpcklqdq t4,t3
+ pmuludq h44,t3
+ # d0 = t1[0] + t1[1] + t3[0]
+ # d1 = t2[0] + t2[1] + t3[1]
+ movdqa t1,t4
+ punpcklqdq t2,t4
+ punpckhqdq t2,t1
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d0
+ psrldq $8,t1
+ movq t1,d1
+
+ # t1[0] = h0 * r2 + h2 * r0
+ # t1[1] = h1 * r1 + h3 * s4
+ movd r2,t1
+ movd r1,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd r0,t2
+ movd s4,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t2[0] = h0 * r3 + h2 * r1
+ # t2[1] = h1 * r2 + h3 * r0
+ movd r3,t2
+ movd r2,t3
+ punpcklqdq t3,t2
+ pmuludq h01,t2
+ movd r1,t3
+ movd r0,t4
+ punpcklqdq t4,t3
+ pmuludq h23,t3
+ paddq t3,t2
+ # t3[0] = h4 * s3
+ # t3[1] = h4 * s4
+ movd s3,t3
+ movd s4,t4
+ punpcklqdq t4,t3
+ pmuludq h44,t3
+ # d2 = t1[0] + t1[1] + t3[0]
+ # d3 = t2[0] + t2[1] + t3[1]
+ movdqa t1,t4
+ punpcklqdq t2,t4
+ punpckhqdq t2,t1
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d2
+ psrldq $8,t1
+ movq t1,d3
+
+ # t1[0] = h0 * r4 + h2 * r2
+ # t1[1] = h1 * r3 + h3 * r1
+ movd r4,t1
+ movd r3,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd r2,t2
+ movd r1,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t3[0] = h4 * r0
+ movd r0,t3
+ pmuludq h44,t3
+ # d4 = t1[0] + t1[1] + t3[0]
+ movdqa t1,t4
+ psrldq $8,t4
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x10,m
+ dec %rcx
+ jnz .Ldoblock
+
+ add $0x10,%rsp
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_block_sse2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
new file mode 100644
index 0000000..1e59274
--- /dev/null
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -0,0 +1,123 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/poly1305.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/fpu/api.h>
+#include <asm/simd.h>
+
+asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
+ const u32 *r, unsigned int blocks);
+
+static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen)
+{
+ unsigned int blocks, datalen;
+
+ if (unlikely(!dctx->sset)) {
+ datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
+ src += srclen - datalen;
+ srclen = datalen;
+ }
+
+ if (srclen >= POLY1305_BLOCK_SIZE) {
+ blocks = srclen / POLY1305_BLOCK_SIZE;
+ poly1305_block_sse2(dctx->h, src, dctx->r, blocks);
+ srclen -= POLY1305_BLOCK_SIZE * blocks;
+ }
+ return srclen;
+}
+
+static int poly1305_simd_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen)
+{
+ struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+ unsigned int bytes;
+
+ /* kernel_fpu_begin/end is costly, use fallback for small updates */
+ if (srclen <= 288 || !may_use_simd())
+ return crypto_poly1305_update(desc, src, srclen);
+
+ kernel_fpu_begin();
+
+ if (unlikely(dctx->buflen)) {
+ bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen);
+ memcpy(dctx->buf + dctx->buflen, src, bytes);
+ src += bytes;
+ srclen -= bytes;
+ dctx->buflen += bytes;
+
+ if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+ poly1305_simd_blocks(dctx, dctx->buf,
+ POLY1305_BLOCK_SIZE);
+ dctx->buflen = 0;
+ }
+ }
+
+ if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
+ bytes = poly1305_simd_blocks(dctx, src, srclen);
+ src += srclen - bytes;
+ srclen = bytes;
+ }
+
+ kernel_fpu_end();
+
+ if (unlikely(srclen)) {
+ dctx->buflen = srclen;
+ memcpy(dctx->buf, src, srclen);
+ }
+
+ return 0;
+}
+
+static struct shash_alg alg = {
+ .digestsize = POLY1305_DIGEST_SIZE,
+ .init = crypto_poly1305_init,
+ .update = poly1305_simd_update,
+ .final = crypto_poly1305_final,
+ .setkey = crypto_poly1305_setkey,
+ .descsize = sizeof(struct poly1305_desc_ctx),
+ .base = {
+ .cra_name = "poly1305",
+ .cra_driver_name = "poly1305-simd",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_SHASH,
+ .cra_alignmask = sizeof(u32) - 1,
+ .cra_blocksize = POLY1305_BLOCK_SIZE,
+ .cra_module = THIS_MODULE,
+ },
+};
+
+static int __init poly1305_simd_mod_init(void)
+{
+ if (!cpu_has_xmm2)
+ return -ENODEV;
+
+ return crypto_register_shash(&alg);
+}
+
+static void __exit poly1305_simd_mod_exit(void)
+{
+ crypto_unregister_shash(&alg);
+}
+
+module_init(poly1305_simd_mod_init);
+module_exit(poly1305_simd_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <[email protected]>");
+MODULE_DESCRIPTION("Poly1305 authenticator");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-simd");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 82caab0..c57478c 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -470,6 +470,18 @@ config CRYPTO_POLY1305
It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use
in IETF protocols. This is the portable C implementation of Poly1305.

+config CRYPTO_POLY1305_X86_64
+ tristate "Poly1305 authenticator algorithm (x86_64/SSE2)"
+ depends on X86 && 64BIT
+ select CRYPTO_POLY1305
+ help
+ Poly1305 authenticator algorithm, RFC7539.
+
+ Poly1305 is an authenticator algorithm designed by Daniel J. Bernstein.
+ It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use
+ in IETF protocols. This is the x86_64 assembler implementation using SIMD
+ instructions.
+
config CRYPTO_MD4
tristate "MD4 digest algorithm"
select CRYPTO_HASH
--
1.9.1

2015-07-07 19:37:21

by Martin Willi

[permalink] [raw]
Subject: [PATCH 04/10] crypto: chacha20 - Add a four block SSSE3 variant for x86_64

Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.

For large messages, throughput increases by ~110% compared to single block
SSSE3:

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 4154698 operations in 1 seconds (66475168 bytes)
test 1 (256 bit key, 64 byte blocks): 4593368 operations in 1 seconds (293975552 bytes)
test 2 (256 bit key, 256 byte blocks): 1796194 operations in 1 seconds (459825664 bytes)
test 3 (256 bit key, 1024 byte blocks): 519725 operations in 1 seconds (532198400 bytes)
test 4 (256 bit key, 8192 byte blocks): 67132 operations in 1 seconds (549945344 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 4164293 operations in 1 seconds (66628688 bytes)
test 1 (256 bit key, 64 byte blocks): 4545912 operations in 1 seconds (290938368 bytes)
test 2 (256 bit key, 256 byte blocks): 3238241 operations in 1 seconds (828989696 bytes)
test 3 (256 bit key, 1024 byte blocks): 1120664 operations in 1 seconds (1147559936 bytes)
test 4 (256 bit key, 8192 byte blocks): 140107 operations in 1 seconds (1147756544 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/chacha20-ssse3-x86_64.S | 483 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 8 +
2 files changed, 491 insertions(+)

diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S
index 1b97ad0..712b130 100644
--- a/arch/x86/crypto/chacha20-ssse3-x86_64.S
+++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S
@@ -16,6 +16,7 @@

ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+CTRINC: .octa 0x00000003000000020000000100000000

.text

@@ -140,3 +141,485 @@ ENTRY(chacha20_block_xor_ssse3)

ret
ENDPROC(chacha20_block_xor_ssse3)
+
+ENTRY(chacha20_4block_xor_ssse3)
+ # %rdi: Input state matrix, s
+ # %rsi: 4 data blocks output, o
+ # %rdx: 4 data blocks input, i
+
+ # This function encrypts four consecutive ChaCha20 blocks by loading the
+ # the state matrix in SSE registers four times. As we need some scratch
+ # registers, we save the first four registers on the stack. The
+ # algorithm performs each operation on the corresponding word of each
+ # state matrix, hence requires no word shuffling. For final XORing step
+ # we transpose the matrix by interleaving 32- and then 64-bit words,
+ # which allows us to do XOR in SSE registers. 8/16-bit word rotation is
+ # done with the slightly better performing SSSE3 byte shuffling,
+ # 7/12-bit word rotation uses traditional shift+OR.
+
+ sub $0x40,%rsp
+
+ # x0..15[0-3] = s0..3[0..3]
+ movq 0x00(%rdi),%xmm1
+ pshufd $0x00,%xmm1,%xmm0
+ pshufd $0x55,%xmm1,%xmm1
+ movq 0x08(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ movq 0x10(%rdi),%xmm5
+ pshufd $0x00,%xmm5,%xmm4
+ pshufd $0x55,%xmm5,%xmm5
+ movq 0x18(%rdi),%xmm7
+ pshufd $0x00,%xmm7,%xmm6
+ pshufd $0x55,%xmm7,%xmm7
+ movq 0x20(%rdi),%xmm9
+ pshufd $0x00,%xmm9,%xmm8
+ pshufd $0x55,%xmm9,%xmm9
+ movq 0x28(%rdi),%xmm11
+ pshufd $0x00,%xmm11,%xmm10
+ pshufd $0x55,%xmm11,%xmm11
+ movq 0x30(%rdi),%xmm13
+ pshufd $0x00,%xmm13,%xmm12
+ pshufd $0x55,%xmm13,%xmm13
+ movq 0x38(%rdi),%xmm15
+ pshufd $0x00,%xmm15,%xmm14
+ pshufd $0x55,%xmm15,%xmm15
+ # x0..3 on stack
+ movdqa %xmm0,0x00(%rsp)
+ movdqa %xmm1,0x10(%rsp)
+ movdqa %xmm2,0x20(%rsp)
+ movdqa %xmm3,0x30(%rsp)
+
+ movdqa CTRINC(%rip),%xmm1
+ movdqa ROT8(%rip),%xmm2
+ movdqa ROT16(%rip),%xmm3
+
+ # x12 += counter values 0-3
+ paddd %xmm1,%xmm12
+
+ mov $10,%ecx
+
+.Ldoubleround4:
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm3,%xmm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm3,%xmm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm3,%xmm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm3,%xmm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+ paddd %xmm12,%xmm8
+ pxor %xmm8,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm4
+ por %xmm0,%xmm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+ paddd %xmm13,%xmm9
+ pxor %xmm9,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm5
+ por %xmm0,%xmm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+ paddd %xmm14,%xmm10
+ pxor %xmm10,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm6
+ por %xmm0,%xmm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+ paddd %xmm15,%xmm11
+ pxor %xmm11,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm7
+ por %xmm0,%xmm7
+
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm2,%xmm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm2,%xmm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm2,%xmm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm2,%xmm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+ paddd %xmm12,%xmm8
+ pxor %xmm8,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm4
+ por %xmm0,%xmm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+ paddd %xmm13,%xmm9
+ pxor %xmm9,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm5
+ por %xmm0,%xmm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+ paddd %xmm14,%xmm10
+ pxor %xmm10,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm6
+ por %xmm0,%xmm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+ paddd %xmm15,%xmm11
+ pxor %xmm11,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm7
+ por %xmm0,%xmm7
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm3,%xmm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 16)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm3,%xmm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm3,%xmm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm3,%xmm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+ paddd %xmm15,%xmm10
+ pxor %xmm10,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm5
+ por %xmm0,%xmm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+ paddd %xmm12,%xmm11
+ pxor %xmm11,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm6
+ por %xmm0,%xmm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+ paddd %xmm13,%xmm8
+ pxor %xmm8,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm7
+ por %xmm0,%xmm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+ paddd %xmm14,%xmm9
+ pxor %xmm9,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm4
+ por %xmm0,%xmm4
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm2,%xmm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm2,%xmm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm2,%xmm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm2,%xmm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+ paddd %xmm15,%xmm10
+ pxor %xmm10,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm5
+ por %xmm0,%xmm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+ paddd %xmm12,%xmm11
+ pxor %xmm11,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm6
+ por %xmm0,%xmm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+ paddd %xmm13,%xmm8
+ pxor %xmm8,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm7
+ por %xmm0,%xmm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+ paddd %xmm14,%xmm9
+ pxor %xmm9,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm4
+ por %xmm0,%xmm4
+
+ dec %ecx
+ jnz .Ldoubleround4
+
+ # x0[0-3] += s0[0]
+ # x1[0-3] += s0[1]
+ movq 0x00(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd 0x00(%rsp),%xmm2
+ movdqa %xmm2,0x00(%rsp)
+ paddd 0x10(%rsp),%xmm3
+ movdqa %xmm3,0x10(%rsp)
+ # x2[0-3] += s0[2]
+ # x3[0-3] += s0[3]
+ movq 0x08(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd 0x20(%rsp),%xmm2
+ movdqa %xmm2,0x20(%rsp)
+ paddd 0x30(%rsp),%xmm3
+ movdqa %xmm3,0x30(%rsp)
+
+ # x4[0-3] += s1[0]
+ # x5[0-3] += s1[1]
+ movq 0x10(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm4
+ paddd %xmm3,%xmm5
+ # x6[0-3] += s1[2]
+ # x7[0-3] += s1[3]
+ movq 0x18(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm6
+ paddd %xmm3,%xmm7
+
+ # x8[0-3] += s2[0]
+ # x9[0-3] += s2[1]
+ movq 0x20(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm8
+ paddd %xmm3,%xmm9
+ # x10[0-3] += s2[2]
+ # x11[0-3] += s2[3]
+ movq 0x28(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm10
+ paddd %xmm3,%xmm11
+
+ # x12[0-3] += s3[0]
+ # x13[0-3] += s3[1]
+ movq 0x30(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm12
+ paddd %xmm3,%xmm13
+ # x14[0-3] += s3[2]
+ # x15[0-3] += s3[3]
+ movq 0x38(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm14
+ paddd %xmm3,%xmm15
+
+ # x12 += counter values 0-3
+ paddd %xmm1,%xmm12
+
+ # interleave 32-bit words in state n, n+1
+ movdqa 0x00(%rsp),%xmm0
+ movdqa 0x10(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpckldq %xmm1,%xmm2
+ punpckhdq %xmm1,%xmm0
+ movdqa %xmm2,0x00(%rsp)
+ movdqa %xmm0,0x10(%rsp)
+ movdqa 0x20(%rsp),%xmm0
+ movdqa 0x30(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpckldq %xmm1,%xmm2
+ punpckhdq %xmm1,%xmm0
+ movdqa %xmm2,0x20(%rsp)
+ movdqa %xmm0,0x30(%rsp)
+ movdqa %xmm4,%xmm0
+ punpckldq %xmm5,%xmm4
+ punpckhdq %xmm5,%xmm0
+ movdqa %xmm0,%xmm5
+ movdqa %xmm6,%xmm0
+ punpckldq %xmm7,%xmm6
+ punpckhdq %xmm7,%xmm0
+ movdqa %xmm0,%xmm7
+ movdqa %xmm8,%xmm0
+ punpckldq %xmm9,%xmm8
+ punpckhdq %xmm9,%xmm0
+ movdqa %xmm0,%xmm9
+ movdqa %xmm10,%xmm0
+ punpckldq %xmm11,%xmm10
+ punpckhdq %xmm11,%xmm0
+ movdqa %xmm0,%xmm11
+ movdqa %xmm12,%xmm0
+ punpckldq %xmm13,%xmm12
+ punpckhdq %xmm13,%xmm0
+ movdqa %xmm0,%xmm13
+ movdqa %xmm14,%xmm0
+ punpckldq %xmm15,%xmm14
+ punpckhdq %xmm15,%xmm0
+ movdqa %xmm0,%xmm15
+
+ # interleave 64-bit words in state n, n+2
+ movdqa 0x00(%rsp),%xmm0
+ movdqa 0x20(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpcklqdq %xmm1,%xmm2
+ punpckhqdq %xmm1,%xmm0
+ movdqa %xmm2,0x00(%rsp)
+ movdqa %xmm0,0x20(%rsp)
+ movdqa 0x10(%rsp),%xmm0
+ movdqa 0x30(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpcklqdq %xmm1,%xmm2
+ punpckhqdq %xmm1,%xmm0
+ movdqa %xmm2,0x10(%rsp)
+ movdqa %xmm0,0x30(%rsp)
+ movdqa %xmm4,%xmm0
+ punpcklqdq %xmm6,%xmm4
+ punpckhqdq %xmm6,%xmm0
+ movdqa %xmm0,%xmm6
+ movdqa %xmm5,%xmm0
+ punpcklqdq %xmm7,%xmm5
+ punpckhqdq %xmm7,%xmm0
+ movdqa %xmm0,%xmm7
+ movdqa %xmm8,%xmm0
+ punpcklqdq %xmm10,%xmm8
+ punpckhqdq %xmm10,%xmm0
+ movdqa %xmm0,%xmm10
+ movdqa %xmm9,%xmm0
+ punpcklqdq %xmm11,%xmm9
+ punpckhqdq %xmm11,%xmm0
+ movdqa %xmm0,%xmm11
+ movdqa %xmm12,%xmm0
+ punpcklqdq %xmm14,%xmm12
+ punpckhqdq %xmm14,%xmm0
+ movdqa %xmm0,%xmm14
+ movdqa %xmm13,%xmm0
+ punpcklqdq %xmm15,%xmm13
+ punpckhqdq %xmm15,%xmm0
+ movdqa %xmm0,%xmm15
+
+ # xor with corresponding input, write to output
+ movdqa 0x00(%rsp),%xmm0
+ movdqu 0x00(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x00(%rsi)
+ movdqa 0x10(%rsp),%xmm0
+ movdqu 0x80(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x80(%rsi)
+ movdqa 0x20(%rsp),%xmm0
+ movdqu 0x40(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x40(%rsi)
+ movdqa 0x30(%rsp),%xmm0
+ movdqu 0xc0(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0xc0(%rsi)
+ movdqu 0x10(%rdx),%xmm1
+ pxor %xmm1,%xmm4
+ movdqu %xmm4,0x10(%rsi)
+ movdqu 0x90(%rdx),%xmm1
+ pxor %xmm1,%xmm5
+ movdqu %xmm5,0x90(%rsi)
+ movdqu 0x50(%rdx),%xmm1
+ pxor %xmm1,%xmm6
+ movdqu %xmm6,0x50(%rsi)
+ movdqu 0xd0(%rdx),%xmm1
+ pxor %xmm1,%xmm7
+ movdqu %xmm7,0xd0(%rsi)
+ movdqu 0x20(%rdx),%xmm1
+ pxor %xmm1,%xmm8
+ movdqu %xmm8,0x20(%rsi)
+ movdqu 0xa0(%rdx),%xmm1
+ pxor %xmm1,%xmm9
+ movdqu %xmm9,0xa0(%rsi)
+ movdqu 0x60(%rdx),%xmm1
+ pxor %xmm1,%xmm10
+ movdqu %xmm10,0x60(%rsi)
+ movdqu 0xe0(%rdx),%xmm1
+ pxor %xmm1,%xmm11
+ movdqu %xmm11,0xe0(%rsi)
+ movdqu 0x30(%rdx),%xmm1
+ pxor %xmm1,%xmm12
+ movdqu %xmm12,0x30(%rsi)
+ movdqu 0xb0(%rdx),%xmm1
+ pxor %xmm1,%xmm13
+ movdqu %xmm13,0xb0(%rsi)
+ movdqu 0x70(%rdx),%xmm1
+ pxor %xmm1,%xmm14
+ movdqu %xmm14,0x70(%rsi)
+ movdqu 0xf0(%rdx),%xmm1
+ pxor %xmm1,%xmm15
+ movdqu %xmm15,0xf0(%rsi)
+
+ add $0x40,%rsp
+ ret
+ENDPROC(chacha20_4block_xor_ssse3)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 250de40..4d677c3 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -20,12 +20,20 @@
#define CHACHA20_STATE_ALIGN 16

asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);

static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];

+ while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ chacha20_4block_xor_ssse3(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE * 4;
+ src += CHACHA20_BLOCK_SIZE * 4;
+ dst += CHACHA20_BLOCK_SIZE * 4;
+ state[12] += 4;
+ }
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_ssse3(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
--
1.9.1

2015-07-07 19:37:21

by Martin Willi

[permalink] [raw]
Subject: [PATCH 01/10] crypto: tcrypt - Add ChaCha20/Poly1305 speed tests

Adds individual ChaCha20 and Poly1305 and a combined rfc7539esp AEAD speed
test using mode numbers 214, 321 and 213. For Poly1305 we add a specific
speed template, as it expects the key prepended to the input data.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/tcrypt.c | 15 +++++++++++++++
crypto/tcrypt.h | 20 ++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 9f6f10b..0e2f651 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1767,6 +1767,17 @@ static int do_test(const char *alg, u32 type, u32 mask, int m)
NULL, 0, 16, 8, aead_speed_template_19);
break;

+ case 213:
+ test_aead_speed("rfc7539esp(chacha20,poly1305)", ENCRYPT, sec,
+ NULL, 0, 16, 8, aead_speed_template_36);
+ break;
+
+ case 214:
+ test_cipher_speed("chacha20", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ break;
+
+
case 300:
if (alg) {
test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -1855,6 +1866,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m)
test_hash_speed("crct10dif", sec, generic_hash_speed_template);
if (mode > 300 && mode < 400) break;

+ case 321:
+ test_hash_speed("poly1305", sec, poly1305_speed_template);
+ if (mode > 300 && mode < 400) break;
+
case 399:
break;

diff --git a/crypto/tcrypt.h b/crypto/tcrypt.h
index 6cc1b85..f0bfee1 100644
--- a/crypto/tcrypt.h
+++ b/crypto/tcrypt.h
@@ -61,12 +61,14 @@ static u8 speed_template_32_40_48[] = {32, 40, 48, 0};
static u8 speed_template_32_48[] = {32, 48, 0};
static u8 speed_template_32_48_64[] = {32, 48, 64, 0};
static u8 speed_template_32_64[] = {32, 64, 0};
+static u8 speed_template_32[] = {32, 0};

/*
* AEAD speed tests
*/
static u8 aead_speed_template_19[] = {19, 0};
static u8 aead_speed_template_20[] = {20, 0};
+static u8 aead_speed_template_36[] = {36, 0};

/*
* Digest speed tests
@@ -127,4 +129,22 @@ static struct hash_speed hash_speed_template_16[] = {
{ .blen = 0, .plen = 0, .klen = 0, }
};

+static struct hash_speed poly1305_speed_template[] = {
+ { .blen = 96, .plen = 16, },
+ { .blen = 96, .plen = 32, },
+ { .blen = 96, .plen = 96, },
+ { .blen = 288, .plen = 16, },
+ { .blen = 288, .plen = 32, },
+ { .blen = 288, .plen = 288, },
+ { .blen = 1056, .plen = 32, },
+ { .blen = 1056, .plen = 1056, },
+ { .blen = 2080, .plen = 32, },
+ { .blen = 2080, .plen = 2080, },
+ { .blen = 4128, .plen = 4128, },
+ { .blen = 8224, .plen = 8224, },
+
+ /* End marker */
+ { .blen = 0, .plen = 0, }
+};
+
#endif /* _CRYPTO_TCRYPT_H */
--
1.9.1

2015-07-07 19:37:21

by Martin Willi

[permalink] [raw]
Subject: [PATCH 06/10] crypto: testmgr - Add a longer ChaCha20 test vector

The AVX2 variant of ChaCha20 is used only for messages with >= 512 bytes
length. With the existing test vectors, the implementation could not be
tested. Due that lack of such a long official test vector, this one is
self-generated using chacha20-generic.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/testmgr.h | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 333 insertions(+), 1 deletion(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index b052555..b77901d 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -30180,7 +30180,7 @@ static struct cipher_testvec salsa20_stream_enc_tv_template[] = {
},
};

-#define CHACHA20_ENC_TEST_VECTORS 3
+#define CHACHA20_ENC_TEST_VECTORS 4
static struct cipher_testvec chacha20_enc_tv_template[] = {
{ /* RFC7539 A.2. Test Vector #1 */
.key = "\x00\x00\x00\x00\x00\x00\x00\x00"
@@ -30354,6 +30354,338 @@ static struct cipher_testvec chacha20_enc_tv_template[] = {
"\x87\xb5\x8d\xfd\x72\x8a\xfa\x36"
"\x75\x7a\x79\x7a\xc1\x88\xd1",
.rlen = 127,
+ }, { /* Self-made test vector for long data */
+ .key = "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a"
+ "\xf3\x33\x88\x86\x04\xf6\xb5\xf0"
+ "\x47\x39\x17\xc1\x40\x2b\x80\x09"
+ "\x9d\xca\x5c\xbc\x20\x70\x75\xc0",
+ .klen = 32,
+ .iv = "\x1c\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x01",
+ .input = "\x49\xee\xe0\xdc\x24\x90\x40\xcd"
+ "\xc5\x40\x8f\x47\x05\xbc\xdd\x81"
+ "\x47\xc6\x8d\xe6\xb1\x8f\xd7\xcb"
+ "\x09\x0e\x6e\x22\x48\x1f\xbf\xb8"
+ "\x5c\xf7\x1e\x8a\xc1\x23\xf2\xd4"
+ "\x19\x4b\x01\x0f\x4e\xa4\x43\xce"
+ "\x01\xc6\x67\xda\x03\x91\x18\x90"
+ "\xa5\xa4\x8e\x45\x03\xb3\x2d\xac"
+ "\x74\x92\xd3\x53\x47\xc8\xdd\x25"
+ "\x53\x6c\x02\x03\x87\x0d\x11\x0c"
+ "\x58\xe3\x12\x18\xfd\x2a\x5b\x40"
+ "\x0c\x30\xf0\xb8\x3f\x43\xce\xae"
+ "\x65\x3a\x7d\x7c\xf4\x54\xaa\xcc"
+ "\x33\x97\xc3\x77\xba\xc5\x70\xde"
+ "\xd7\xd5\x13\xa5\x65\xc4\x5f\x0f"
+ "\x46\x1a\x0d\x97\xb5\xf3\xbb\x3c"
+ "\x84\x0f\x2b\xc5\xaa\xea\xf2\x6c"
+ "\xc9\xb5\x0c\xee\x15\xf3\x7d\xbe"
+ "\x9f\x7b\x5a\xa6\xae\x4f\x83\xb6"
+ "\x79\x49\x41\xf4\x58\x18\xcb\x86"
+ "\x7f\x30\x0e\xf8\x7d\x44\x36\xea"
+ "\x75\xeb\x88\x84\x40\x3c\xad\x4f"
+ "\x6f\x31\x6b\xaa\x5d\xe5\xa5\xc5"
+ "\x21\x66\xe9\xa7\xe3\xb2\x15\x88"
+ "\x78\xf6\x79\xa1\x59\x47\x12\x4e"
+ "\x9f\x9f\x64\x1a\xa0\x22\x5b\x08"
+ "\xbe\x7c\x36\xc2\x2b\x66\x33\x1b"
+ "\xdd\x60\x71\xf7\x47\x8c\x61\xc3"
+ "\xda\x8a\x78\x1e\x16\xfa\x1e\x86"
+ "\x81\xa6\x17\x2a\xa7\xb5\xc2\xe7"
+ "\xa4\xc7\x42\xf1\xcf\x6a\xca\xb4"
+ "\x45\xcf\xf3\x93\xf0\xe7\xea\xf6"
+ "\xf4\xe6\x33\x43\x84\x93\xa5\x67"
+ "\x9b\x16\x58\x58\x80\x0f\x2b\x5c"
+ "\x24\x74\x75\x7f\x95\x81\xb7\x30"
+ "\x7a\x33\xa7\xf7\x94\x87\x32\x27"
+ "\x10\x5d\x14\x4c\x43\x29\xdd\x26"
+ "\xbd\x3e\x3c\x0e\xfe\x0e\xa5\x10"
+ "\xea\x6b\x64\xfd\x73\xc6\xed\xec"
+ "\xa8\xc9\xbf\xb3\xba\x0b\x4d\x07"
+ "\x70\xfc\x16\xfd\x79\x1e\xd7\xc5"
+ "\x49\x4e\x1c\x8b\x8d\x79\x1b\xb1"
+ "\xec\xca\x60\x09\x4c\x6a\xd5\x09"
+ "\x49\x46\x00\x88\x22\x8d\xce\xea"
+ "\xb1\x17\x11\xde\x42\xd2\x23\xc1"
+ "\x72\x11\xf5\x50\x73\x04\x40\x47"
+ "\xf9\x5d\xe7\xa7\x26\xb1\x7e\xb0"
+ "\x3f\x58\xc1\x52\xab\x12\x67\x9d"
+ "\x3f\x43\x4b\x68\xd4\x9c\x68\x38"
+ "\x07\x8a\x2d\x3e\xf3\xaf\x6a\x4b"
+ "\xf9\xe5\x31\x69\x22\xf9\xa6\x69"
+ "\xc6\x9c\x96\x9a\x12\x35\x95\x1d"
+ "\x95\xd5\xdd\xbe\xbf\x93\x53\x24"
+ "\xfd\xeb\xc2\x0a\x64\xb0\x77\x00"
+ "\x6f\x88\xc4\x37\x18\x69\x7c\xd7"
+ "\x41\x92\x55\x4c\x03\xa1\x9a\x4b"
+ "\x15\xe5\xdf\x7f\x37\x33\x72\xc1"
+ "\x8b\x10\x67\xa3\x01\x57\x94\x25"
+ "\x7b\x38\x71\x7e\xdd\x1e\xcc\x73"
+ "\x55\xd2\x8e\xeb\x07\xdd\xf1\xda"
+ "\x58\xb1\x47\x90\xfe\x42\x21\x72"
+ "\xa3\x54\x7a\xa0\x40\xec\x9f\xdd"
+ "\xc6\x84\x6e\xca\xae\xe3\x68\xb4"
+ "\x9d\xe4\x78\xff\x57\xf2\xf8\x1b"
+ "\x03\xa1\x31\xd9\xde\x8d\xf5\x22"
+ "\x9c\xdd\x20\xa4\x1e\x27\xb1\x76"
+ "\x4f\x44\x55\xe2\x9b\xa1\x9c\xfe"
+ "\x54\xf7\x27\x1b\xf4\xde\x02\xf5"
+ "\x1b\x55\x48\x5c\xdc\x21\x4b\x9e"
+ "\x4b\x6e\xed\x46\x23\xdc\x65\xb2"
+ "\xcf\x79\x5f\x28\xe0\x9e\x8b\xe7"
+ "\x4c\x9d\x8a\xff\xc1\xa6\x28\xb8"
+ "\x65\x69\x8a\x45\x29\xef\x74\x85"
+ "\xde\x79\xc7\x08\xae\x30\xb0\xf4"
+ "\xa3\x1d\x51\x41\xab\xce\xcb\xf6"
+ "\xb5\xd8\x6d\xe0\x85\xe1\x98\xb3"
+ "\x43\xbb\x86\x83\x0a\xa0\xf5\xb7"
+ "\x04\x0b\xfa\x71\x1f\xb0\xf6\xd9"
+ "\x13\x00\x15\xf0\xc7\xeb\x0d\x5a"
+ "\x9f\xd7\xb9\x6c\x65\x14\x22\x45"
+ "\x6e\x45\x32\x3e\x7e\x60\x1a\x12"
+ "\x97\x82\x14\xfb\xaa\x04\x22\xfa"
+ "\xa0\xe5\x7e\x8c\x78\x02\x48\x5d"
+ "\x78\x33\x5a\x7c\xad\xdb\x29\xce"
+ "\xbb\x8b\x61\xa4\xb7\x42\xe2\xac"
+ "\x8b\x1a\xd9\x2f\x0b\x8b\x62\x21"
+ "\x83\x35\x7e\xad\x73\xc2\xb5\x6c"
+ "\x10\x26\x38\x07\xe5\xc7\x36\x80"
+ "\xe2\x23\x12\x61\xf5\x48\x4b\x2b"
+ "\xc5\xdf\x15\xd9\x87\x01\xaa\xac"
+ "\x1e\x7c\xad\x73\x78\x18\x63\xe0"
+ "\x8b\x9f\x81\xd8\x12\x6a\x28\x10"
+ "\xbe\x04\x68\x8a\x09\x7c\x1b\x1c"
+ "\x83\x66\x80\x47\x80\xe8\xfd\x35"
+ "\x1c\x97\x6f\xae\x49\x10\x66\xcc"
+ "\xc6\xd8\xcc\x3a\x84\x91\x20\x77"
+ "\x72\xe4\x24\xd2\x37\x9f\xc5\xc9"
+ "\x25\x94\x10\x5f\x40\x00\x64\x99"
+ "\xdc\xae\xd7\x21\x09\x78\x50\x15"
+ "\xac\x5f\xc6\x2c\xa2\x0b\xa9\x39"
+ "\x87\x6e\x6d\xab\xde\x08\x51\x16"
+ "\xc7\x13\xe9\xea\xed\x06\x8e\x2c"
+ "\xf8\x37\x8c\xf0\xa6\x96\x8d\x43"
+ "\xb6\x98\x37\xb2\x43\xed\xde\xdf"
+ "\x89\x1a\xe7\xeb\x9d\xa1\x7b\x0b"
+ "\x77\xb0\xe2\x75\xc0\xf1\x98\xd9"
+ "\x80\x55\xc9\x34\x91\xd1\x59\xe8"
+ "\x4b\x0f\xc1\xa9\x4b\x7a\x84\x06"
+ "\x20\xa8\x5d\xfa\xd1\xde\x70\x56"
+ "\x2f\x9e\x91\x9c\x20\xb3\x24\xd8"
+ "\x84\x3d\xe1\x8c\x7e\x62\x52\xe5"
+ "\x44\x4b\x9f\xc2\x93\x03\xea\x2b"
+ "\x59\xc5\xfa\x3f\x91\x2b\xbb\x23"
+ "\xf5\xb2\x7b\xf5\x38\xaf\xb3\xee"
+ "\x63\xdc\x7b\xd1\xff\xaa\x8b\xab"
+ "\x82\x6b\x37\x04\xeb\x74\xbe\x79"
+ "\xb9\x83\x90\xef\x20\x59\x46\xff"
+ "\xe9\x97\x3e\x2f\xee\xb6\x64\x18"
+ "\x38\x4c\x7a\x4a\xf9\x61\xe8\x9a"
+ "\xa1\xb5\x01\xa6\x47\xd3\x11\xd4"
+ "\xce\xd3\x91\x49\x88\xc7\xb8\x4d"
+ "\xb1\xb9\x07\x6d\x16\x72\xae\x46"
+ "\x5e\x03\xa1\x4b\xb6\x02\x30\xa8"
+ "\x3d\xa9\x07\x2a\x7c\x19\xe7\x62"
+ "\x87\xe3\x82\x2f\x6f\xe1\x09\xd9"
+ "\x94\x97\xea\xdd\x58\x9e\xae\x76"
+ "\x7e\x35\xe5\xb4\xda\x7e\xf4\xde"
+ "\xf7\x32\x87\xcd\x93\xbf\x11\x56"
+ "\x11\xbe\x08\x74\xe1\x69\xad\xe2"
+ "\xd7\xf8\x86\x75\x8a\x3c\xa4\xbe"
+ "\x70\xa7\x1b\xfc\x0b\x44\x2a\x76"
+ "\x35\xea\x5d\x85\x81\xaf\x85\xeb"
+ "\xa0\x1c\x61\xc2\xf7\x4f\xa5\xdc"
+ "\x02\x7f\xf6\x95\x40\x6e\x8a\x9a"
+ "\xf3\x5d\x25\x6e\x14\x3a\x22\xc9"
+ "\x37\x1c\xeb\x46\x54\x3f\xa5\x91"
+ "\xc2\xb5\x8c\xfe\x53\x08\x97\x32"
+ "\x1b\xb2\x30\x27\xfe\x25\x5d\xdc"
+ "\x08\x87\xd0\xe5\x94\x1a\xd4\xf1"
+ "\xfe\xd6\xb4\xa3\xe6\x74\x81\x3c"
+ "\x1b\xb7\x31\xa7\x22\xfd\xd4\xdd"
+ "\x20\x4e\x7c\x51\xb0\x60\x73\xb8"
+ "\x9c\xac\x91\x90\x7e\x01\xb0\xe1"
+ "\x8a\x2f\x75\x1c\x53\x2a\x98\x2a"
+ "\x06\x52\x95\x52\xb2\xe9\x25\x2e"
+ "\x4c\xe2\x5a\x00\xb2\x13\x81\x03"
+ "\x77\x66\x0d\xa5\x99\xda\x4e\x8c"
+ "\xac\xf3\x13\x53\x27\x45\xaf\x64"
+ "\x46\xdc\xea\x23\xda\x97\xd1\xab"
+ "\x7d\x6c\x30\x96\x1f\xbc\x06\x34"
+ "\x18\x0b\x5e\x21\x35\x11\x8d\x4c"
+ "\xe0\x2d\xe9\x50\x16\x74\x81\xa8"
+ "\xb4\x34\xb9\x72\x42\xa6\xcc\xbc"
+ "\xca\x34\x83\x27\x10\x5b\x68\x45"
+ "\x8f\x52\x22\x0c\x55\x3d\x29\x7c"
+ "\xe3\xc0\x66\x05\x42\x91\x5f\x58"
+ "\xfe\x4a\x62\xd9\x8c\xa9\x04\x19"
+ "\x04\xa9\x08\x4b\x57\xfc\x67\x53"
+ "\x08\x7c\xbc\x66\x8a\xb0\xb6\x9f"
+ "\x92\xd6\x41\x7c\x5b\x2a\x00\x79"
+ "\x72",
+ .ilen = 1281,
+ .result = "\x45\xe8\xe0\xb6\x9c\xca\xfd\x87"
+ "\xe8\x1d\x37\x96\x8a\xe3\x40\x35"
+ "\xcf\x5e\x3a\x46\x3d\xfb\xd0\x69"
+ "\xde\xaf\x7a\xd5\x0d\xe9\x52\xec"
+ "\xc2\x82\xe5\x3e\x7d\xb2\x4a\xd9"
+ "\xbb\xc3\x9f\xc0\x5d\xac\x93\x8d"
+ "\x0e\x6f\xd3\xd7\xfb\x6a\x0d\xce"
+ "\x92\x2c\xf7\xbb\x93\x57\xcc\xee"
+ "\x42\x72\x6f\xc8\x4b\xd2\x76\xbf"
+ "\xa0\xe3\x7a\x39\xf9\x5c\x8e\xfd"
+ "\xa1\x1d\x41\xe5\x08\xc1\x1c\x11"
+ "\x92\xfd\x39\x5c\x51\xd0\x2f\x66"
+ "\x33\x4a\x71\x15\xfe\xee\x12\x54"
+ "\x8c\x8f\x34\xd8\x50\x3c\x18\xa6"
+ "\xc5\xe1\x46\x8a\xfb\x5f\x7e\x25"
+ "\x9b\xe2\xc3\x66\x41\x2b\xb3\xa5"
+ "\x57\x0e\x94\x17\x26\x39\xbb\x54"
+ "\xae\x2e\x6f\x42\xfb\x4d\x89\x6f"
+ "\x9d\xf1\x16\x2e\xe3\xe7\xfc\xe3"
+ "\xb2\x4b\x2b\xa6\x7c\x04\x69\x3a"
+ "\x70\x5a\xa7\xf1\x31\x64\x19\xca"
+ "\x45\x79\xd8\x58\x23\x61\xaf\xc2"
+ "\x52\x05\xc3\x0b\xc1\x64\x7c\x81"
+ "\xd9\x11\xcf\xff\x02\x3d\x51\x84"
+ "\x01\xac\xc6\x2e\x34\x2b\x09\x3a"
+ "\xa8\x5d\x98\x0e\x89\xd9\xef\x8f"
+ "\xd9\xd7\x7d\xdd\x63\x47\x46\x7d"
+ "\xa1\xda\x0b\x53\x7d\x79\xcd\xc9"
+ "\x86\xdd\x6b\x13\xa1\x9a\x70\xdd"
+ "\x5c\xa1\x69\x3c\xe4\x5d\xe3\x8c"
+ "\xe5\xf4\x87\x9c\x10\xcf\x0f\x0b"
+ "\xc8\x43\xdc\xf8\x1d\x62\x5e\x5b"
+ "\xe2\x03\x06\xc5\x71\xb6\x48\xa5"
+ "\xf0\x0f\x2d\xd5\xa2\x73\x55\x8f"
+ "\x01\xa7\x59\x80\x5f\x11\x6c\x40"
+ "\xff\xb1\xf2\xc6\x7e\x01\xbb\x1c"
+ "\x69\x9c\xc9\x3f\x71\x5f\x07\x7e"
+ "\xdf\x6f\x99\xca\x9c\xfd\xf9\xb9"
+ "\x49\xe7\xcc\x91\xd5\x9b\x8f\x03"
+ "\xae\xe7\x61\x32\xef\x41\x6c\x75"
+ "\x84\x9b\x8c\xce\x1d\x6b\x93\x21"
+ "\x41\xec\xc6\xad\x8e\x0c\x48\xa8"
+ "\xe2\xf5\x57\xde\xf7\x38\xfd\x4a"
+ "\x6f\xa7\x4a\xf9\xac\x7d\xb1\x85"
+ "\x7d\x6c\x95\x0a\x5a\xcf\x68\xd2"
+ "\xe0\x7a\x26\xd9\xc1\x6d\x3e\xc6"
+ "\x37\xbd\xbe\x24\x36\x77\x9f\x1b"
+ "\xc1\x22\xf3\x79\xae\x95\x78\x66"
+ "\x97\x11\xc0\x1a\xf1\xe8\x0d\x38"
+ "\x09\xc2\xee\xb7\xd3\x46\x7b\x59"
+ "\x77\x23\xe8\xb4\x92\x3d\x78\xbe"
+ "\xe2\x25\x63\xa5\x2a\x06\x70\x92"
+ "\x32\x63\xf9\x19\x21\x68\xe1\x0b"
+ "\x9a\xd0\xee\x21\xdb\x1f\xe0\xde"
+ "\x3e\x64\x02\x4d\x0e\xe0\x0a\xa9"
+ "\xed\x19\x8c\xa8\xbf\xe3\x2e\x75"
+ "\x24\x2b\xb0\xe5\x82\x6a\x1e\x6f"
+ "\x71\x2a\x3a\x60\xed\x06\x0d\x17"
+ "\xa2\xdb\x29\x1d\xae\xb2\xc4\xfb"
+ "\x94\x04\xd8\x58\xfc\xc4\x04\x4e"
+ "\xee\xc7\xc1\x0f\xe9\x9b\x63\x2d"
+ "\x02\x3e\x02\x67\xe5\xd8\xbb\x79"
+ "\xdf\xd2\xeb\x50\xe9\x0a\x02\x46"
+ "\xdf\x68\xcf\xe7\x2b\x0a\x56\xd6"
+ "\xf7\xbc\x44\xad\xb8\xb5\x5f\xeb"
+ "\xbc\x74\x6b\xe8\x7e\xb0\x60\xc6"
+ "\x0d\x96\x09\xbb\x19\xba\xe0\x3c"
+ "\xc4\x6c\xbf\x0f\x58\xc0\x55\x62"
+ "\x23\xa0\xff\xb5\x1c\xfd\x18\xe1"
+ "\xcf\x6d\xd3\x52\xb4\xce\xa6\xfa"
+ "\xaa\xfb\x1b\x0b\x42\x6d\x79\x42"
+ "\x48\x70\x5b\x0e\xdd\x3a\xc9\x69"
+ "\x8b\x73\x67\xf6\x95\xdb\x8c\xfb"
+ "\xfd\xb5\x08\x47\x42\x84\x9a\xfa"
+ "\xcc\x67\xb2\x3c\xb6\xfd\xd8\x32"
+ "\xd6\x04\xb6\x4a\xea\x53\x4b\xf5"
+ "\x94\x16\xad\xf0\x10\x2e\x2d\xb4"
+ "\x8b\xab\xe5\x89\xc7\x39\x12\xf3"
+ "\x8d\xb5\x96\x0b\x87\x5d\xa7\x7c"
+ "\xb0\xc2\xf6\x2e\x57\x97\x2c\xdc"
+ "\x54\x1c\x34\x72\xde\x0c\x68\x39"
+ "\x9d\x32\xa5\x75\x92\x13\x32\xea"
+ "\x90\x27\xbd\x5b\x1d\xb9\x21\x02"
+ "\x1c\xcc\xba\x97\x5e\x49\x58\xe8"
+ "\xac\x8b\xf3\xce\x3c\xf0\x00\xe9"
+ "\x6c\xae\xe9\x77\xdf\xf4\x02\xcd"
+ "\x55\x25\x89\x9e\x90\xf3\x6b\x8f"
+ "\xb7\xd6\x47\x98\x26\x2f\x31\x2f"
+ "\x8d\xbf\x54\xcd\x99\xeb\x80\xd7"
+ "\xac\xc3\x08\xc2\xa6\x32\xf1\x24"
+ "\x76\x7c\x4f\x78\x53\x55\xfb\x00"
+ "\x8a\xd6\x52\x53\x25\x45\xfb\x0a"
+ "\x6b\xb9\xbe\x3c\x5e\x11\xcc\x6a"
+ "\xdd\xfc\xa7\xc4\x79\x4d\xbd\xfb"
+ "\xce\x3a\xf1\x7a\xda\xeb\xfe\x64"
+ "\x28\x3d\x0f\xee\x80\xba\x0c\xf8"
+ "\xe9\x5b\x3a\xd4\xae\xc9\xf3\x0e"
+ "\xe8\x5d\xc5\x5c\x0b\x20\x20\xee"
+ "\x40\x0d\xde\x07\xa7\x14\xb4\x90"
+ "\xb6\xbd\x3b\xae\x7d\x2b\xa7\xc7"
+ "\xdc\x0b\x4c\x5d\x65\xb0\xd2\xc5"
+ "\x79\x61\x23\xe0\xa2\x99\x73\x55"
+ "\xad\xc6\xfb\xc7\x54\xb5\x98\x1f"
+ "\x8c\x86\xc2\x3f\xbe\x5e\xea\x64"
+ "\xa3\x60\x18\x9f\x80\xaf\x52\x74"
+ "\x1a\xfe\x22\xc2\x92\x67\x40\x02"
+ "\x08\xee\x67\x5b\x67\xe0\x3d\xde"
+ "\x7a\xaf\x8e\x28\xf3\x5e\x0e\xf4"
+ "\x48\x56\xaa\x85\x22\xd8\x36\xed"
+ "\x3b\x3d\x68\x69\x30\xbc\x71\x23"
+ "\xb1\x6e\x61\x03\x89\x44\x03\xf4"
+ "\x32\xaa\x4c\x40\x9f\x69\xfb\x70"
+ "\x91\xcc\x1f\x11\xbd\x76\x67\xe6"
+ "\x10\x8b\x29\x39\x68\xea\x4e\x6d"
+ "\xae\xfb\x40\xcf\xe2\xd0\x0d\x8d"
+ "\x6f\xed\x9b\x8d\x64\x7a\x94\x8e"
+ "\x32\x38\x78\xeb\x7d\x5f\xf9\x4d"
+ "\x13\xbe\x21\xea\x16\xe7\x5c\xee"
+ "\xcd\xf6\x5f\xc6\x45\xb2\x8f\x2b"
+ "\xb5\x93\x3e\x45\xdb\xfd\xa2\x6a"
+ "\xec\x83\x92\x99\x87\x47\xe0\x7c"
+ "\xa2\x7b\xc4\x2a\xcd\xc0\x81\x03"
+ "\x98\xb0\x87\xb6\x86\x13\x64\x33"
+ "\x4c\xd7\x99\xbf\xdb\x7b\x6e\xaa"
+ "\x76\xcc\xa0\x74\x1b\xa3\x6e\x83"
+ "\xd4\xba\x7a\x84\x9d\x91\x71\xcd"
+ "\x60\x2d\x56\xfd\x26\x35\xcb\xeb"
+ "\xac\xe9\xee\xa4\xfc\x18\x5b\x91"
+ "\xd5\xfe\x84\x45\xe0\xc7\xfd\x11"
+ "\xe9\x00\xb6\x54\xdf\xe1\x94\xde"
+ "\x2b\x70\x9f\x94\x7f\x15\x0e\x83"
+ "\x63\x10\xb3\xf5\xea\xd3\xe8\xd1"
+ "\xa5\xfc\x17\x19\x68\x9a\xbc\x17"
+ "\x30\x43\x0a\x1a\x33\x92\xd4\x2a"
+ "\x2e\x68\x99\xbc\x49\xf0\x68\xe3"
+ "\xf0\x1f\xcb\xcc\xfa\xbb\x05\x56"
+ "\x46\x84\x8b\x69\x83\x64\xc5\xe0"
+ "\xc5\x52\x99\x07\x3c\xa6\x5c\xaf"
+ "\xa3\xde\xd7\xdb\x43\xe6\xb7\x76"
+ "\x4e\x4d\xd6\x71\x60\x63\x4a\x0c"
+ "\x5f\xae\x25\x84\x22\x90\x5f\x26"
+ "\x61\x4d\x8f\xaf\xc9\x22\xf2\x05"
+ "\xcf\xc1\xdc\x68\xe5\x57\x8e\x24"
+ "\x1b\x30\x59\xca\xd7\x0d\xc3\xd3"
+ "\x52\x9e\x09\x3e\x0e\xaf\xdb\x5f"
+ "\xc7\x2b\xde\x3a\xfd\xad\x93\x04"
+ "\x74\x06\x89\x0e\x90\xeb\x85\xff"
+ "\xe6\x3c\x12\x42\xf4\xfa\x80\x75"
+ "\x5e\x4e\xd7\x2f\x93\x0b\x34\x41"
+ "\x02\x85\x68\xd0\x03\x12\xde\x92"
+ "\x54\x7a\x7e\xfb\x55\xe7\x88\xfb"
+ "\xa4\xa9\xf2\xd1\xc6\x70\x06\x37"
+ "\x25\xee\xa7\x6e\xd9\x89\x86\x50"
+ "\x2e\x07\xdb\xfb\x2a\x86\x45\x0e"
+ "\x91\xf4\x7c\xbb\x12\x60\xe8\x3f"
+ "\x71\xbe\x8f\x9d\x26\xef\xd9\x89"
+ "\xc4\x8f\xd8\xc5\x73\xd8\x84\xaa"
+ "\x2f\xad\x22\x1e\x7e\xcf\xa2\x08"
+ "\x23\x45\x89\x42\xa0\x30\xeb\xbf"
+ "\xa1\xed\xad\xd5\x76\xfa\x24\x8f"
+ "\x98",
+ .rlen = 1281,
},
};

--
1.9.1

2015-07-07 19:37:21

by Martin Willi

[permalink] [raw]
Subject: [PATCH 03/10] crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64

Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.

For large messages, throughput increases by ~65% compared to
chacha20-generic:

testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 4015926 operations in 1 seconds (64254816 bytes)
test 1 (256 bit key, 64 byte blocks): 4161758 operations in 1 seconds (266352512 bytes)
test 2 (256 bit key, 256 byte blocks): 1223686 operations in 1 seconds (313263616 bytes)
test 3 (256 bit key, 1024 byte blocks): 325200 operations in 1 seconds (333004800 bytes)
test 4 (256 bit key, 8192 byte blocks): 40725 operations in 1 seconds (333619200 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 4154698 operations in 1 seconds (66475168 bytes)
test 1 (256 bit key, 64 byte blocks): 4593368 operations in 1 seconds (293975552 bytes)
test 2 (256 bit key, 256 byte blocks): 1796194 operations in 1 seconds (459825664 bytes)
test 3 (256 bit key, 1024 byte blocks): 519725 operations in 1 seconds (532198400 bytes)
test 4 (256 bit key, 8192 byte blocks): 67132 operations in 1 seconds (549945344 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 2 +
arch/x86/crypto/chacha20-ssse3-x86_64.S | 142 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 123 +++++++++++++++++++++++++++
crypto/Kconfig | 15 ++++
4 files changed, 282 insertions(+)
create mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
create mode 100644 arch/x86/crypto/chacha20_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 5a4a089..b09e9a4 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
obj-$(CONFIG_CRYPTO_SALSA20_X86_64) += salsa20-x86_64.o
+obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
@@ -60,6 +61,7 @@ blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o
+chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o

ifeq ($(avx_supported),yes)
diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S
new file mode 100644
index 0000000..1b97ad0
--- /dev/null
+++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S
@@ -0,0 +1,142 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 16
+
+ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
+ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+
+.text
+
+ENTRY(chacha20_block_xor_ssse3)
+ # %rdi: Input state matrix, s
+ # %rsi: 1 data block output, o
+ # %rdx: 1 data block input, i
+
+ # This function encrypts one ChaCha20 block by loading the state matrix
+ # in four SSE registers. It performs matrix operation on four words in
+ # parallel, but requireds shuffling to rearrange the words after each
+ # round. 8/16-bit word rotation is done with the slightly better
+ # performing SSSE3 byte shuffling, 7/12-bit word rotation uses
+ # traditional shift+OR.
+
+ # x0..3 = s0..3
+ movdqa 0x00(%rdi),%xmm0
+ movdqa 0x10(%rdi),%xmm1
+ movdqa 0x20(%rdi),%xmm2
+ movdqa 0x30(%rdi),%xmm3
+ movdqa %xmm0,%xmm8
+ movdqa %xmm1,%xmm9
+ movdqa %xmm2,%xmm10
+ movdqa %xmm3,%xmm11
+
+ movdqa ROT8(%rip),%xmm4
+ movdqa ROT16(%rip),%xmm5
+
+ mov $10,%ecx
+
+.Ldoubleround:
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm5,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm6
+ pslld $12,%xmm6
+ psrld $20,%xmm1
+ por %xmm6,%xmm1
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm4,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm7
+ pslld $7,%xmm7
+ psrld $25,%xmm1
+ por %xmm7,%xmm1
+
+ # x1 = shuffle32(x1, MASK(0, 3, 2, 1))
+ pshufd $0x39,%xmm1,%xmm1
+ # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+ pshufd $0x4e,%xmm2,%xmm2
+ # x3 = shuffle32(x3, MASK(2, 1, 0, 3))
+ pshufd $0x93,%xmm3,%xmm3
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm5,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm6
+ pslld $12,%xmm6
+ psrld $20,%xmm1
+ por %xmm6,%xmm1
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm4,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm7
+ pslld $7,%xmm7
+ psrld $25,%xmm1
+ por %xmm7,%xmm1
+
+ # x1 = shuffle32(x1, MASK(2, 1, 0, 3))
+ pshufd $0x93,%xmm1,%xmm1
+ # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+ pshufd $0x4e,%xmm2,%xmm2
+ # x3 = shuffle32(x3, MASK(0, 3, 2, 1))
+ pshufd $0x39,%xmm3,%xmm3
+
+ dec %ecx
+ jnz .Ldoubleround
+
+ # o0 = i0 ^ (x0 + s0)
+ movdqu 0x00(%rdx),%xmm4
+ paddd %xmm8,%xmm0
+ pxor %xmm4,%xmm0
+ movdqu %xmm0,0x00(%rsi)
+ # o1 = i1 ^ (x1 + s1)
+ movdqu 0x10(%rdx),%xmm5
+ paddd %xmm9,%xmm1
+ pxor %xmm5,%xmm1
+ movdqu %xmm1,0x10(%rsi)
+ # o2 = i2 ^ (x2 + s2)
+ movdqu 0x20(%rdx),%xmm6
+ paddd %xmm10,%xmm2
+ pxor %xmm6,%xmm2
+ movdqu %xmm2,0x20(%rsi)
+ # o3 = i3 ^ (x3 + s3)
+ movdqu 0x30(%rdx),%xmm7
+ paddd %xmm11,%xmm3
+ pxor %xmm7,%xmm3
+ movdqu %xmm3,0x30(%rsi)
+
+ ret
+ENDPROC(chacha20_block_xor_ssse3)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
new file mode 100644
index 0000000..250de40
--- /dev/null
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -0,0 +1,123 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/chacha20.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/fpu/api.h>
+#include <asm/simd.h>
+
+#define CHACHA20_STATE_ALIGN 16
+
+asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+
+static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
+ unsigned int bytes)
+{
+ u8 buf[CHACHA20_BLOCK_SIZE];
+
+ while (bytes >= CHACHA20_BLOCK_SIZE) {
+ chacha20_block_xor_ssse3(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE;
+ src += CHACHA20_BLOCK_SIZE;
+ dst += CHACHA20_BLOCK_SIZE;
+ state[12]++;
+ }
+ if (bytes) {
+ memcpy(buf, src, bytes);
+ chacha20_block_xor_ssse3(state, buf, buf);
+ memcpy(dst, buf, bytes);
+ }
+}
+
+static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
+ struct blkcipher_walk walk;
+ int err;
+
+ if (!may_use_simd())
+ return crypto_chacha20_crypt(desc, dst, src, nbytes);
+
+ state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt_block(desc, &walk, CHACHA20_BLOCK_SIZE);
+
+ crypto_chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);
+
+ kernel_fpu_begin();
+
+ while (walk.nbytes >= CHACHA20_BLOCK_SIZE) {
+ chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
+ rounddown(walk.nbytes, CHACHA20_BLOCK_SIZE));
+ err = blkcipher_walk_done(desc, &walk,
+ walk.nbytes % CHACHA20_BLOCK_SIZE);
+ }
+
+ if (walk.nbytes) {
+ chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
+ walk.nbytes);
+ err = blkcipher_walk_done(desc, &walk, 0);
+ }
+
+ kernel_fpu_end();
+
+ return err;
+}
+
+static struct crypto_alg alg = {
+ .cra_name = "chacha20",
+ .cra_driver_name = "chacha20-simd",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = 1,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_ctxsize = sizeof(struct chacha20_ctx),
+ .cra_alignmask = sizeof(u32) - 1,
+ .cra_module = THIS_MODULE,
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = CHACHA20_KEY_SIZE,
+ .max_keysize = CHACHA20_KEY_SIZE,
+ .ivsize = CHACHA20_IV_SIZE,
+ .geniv = "seqiv",
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = chacha20_simd,
+ .decrypt = chacha20_simd,
+ },
+ },
+};
+
+static int __init chacha20_simd_mod_init(void)
+{
+ if (!cpu_has_ssse3)
+ return -ENODEV;
+
+ return crypto_register_alg(&alg);
+}
+
+static void __exit chacha20_simd_mod_fini(void)
+{
+ crypto_unregister_alg(&alg);
+}
+
+module_init(chacha20_simd_mod_init);
+module_exit(chacha20_simd_mod_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <[email protected]>");
+MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-simd");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index b4cfc57..8f24185 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1213,6 +1213,21 @@ config CRYPTO_CHACHA20
See also:
<http://cr.yp.to/chacha/chacha-20080128.pdf>

+config CRYPTO_CHACHA20_X86_64
+ tristate "ChaCha20 cipher algorithm (x86_64/SSSE3)"
+ depends on X86 && 64BIT
+ select CRYPTO_BLKCIPHER
+ select CRYPTO_CHACHA20
+ help
+ ChaCha20 cipher algorithm, RFC7539.
+
+ ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J.
+ Bernstein and further specified in RFC7539 for use in IETF protocols.
+ This is the x86_64 assembler implementation using SIMD instructions.
+
+ See also:
+ <http://cr.yp.to/chacha/chacha-20080128.pdf>
+
config CRYPTO_SEED
tristate "SEED cipher algorithm"
select CRYPTO_ALGAPI
--
1.9.1

2015-07-07 19:37:21

by Martin Willi

[permalink] [raw]
Subject: [PATCH 05/10] crypto: chacha20 - Add an eight block AVX2 variant for x86_64

Extends the x86_64 ChaCha20 implementation by a function processing eight
ChaCha20 blocks in parallel using AVX2.

For large messages, throughput increases by ~55-70% compared to four block
SSSE3:

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 4164293 operations in 1 seconds (66628688 bytes)
test 1 (256 bit key, 64 byte blocks): 4545912 operations in 1 seconds (290938368 bytes)
test 2 (256 bit key, 256 byte blocks): 3238241 operations in 1 seconds (828989696 bytes)
test 3 (256 bit key, 1024 byte blocks): 1120664 operations in 1 seconds (1147559936 bytes)
test 4 (256 bit key, 8192 byte blocks): 140107 operations in 1 seconds (1147756544 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 4166978 operations in 1 seconds (66671648 bytes)
test 1 (256 bit key, 64 byte blocks): 4557525 operations in 1 seconds (291681600 bytes)
test 2 (256 bit key, 256 byte blocks): 3231026 operations in 1 seconds (827142656 bytes)
test 3 (256 bit key, 1024 byte blocks): 1929946 operations in 1 seconds (1976264704 bytes)
test 4 (256 bit key, 8192 byte blocks): 218037 operations in 1 seconds (1786159104 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 1 +
arch/x86/crypto/chacha20-avx2-x86_64.S | 443 +++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 19 ++
crypto/Kconfig | 2 +-
4 files changed, 464 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b09e9a4..ce39b3c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -77,6 +77,7 @@ endif

ifeq ($(avx2_supported),yes)
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
+ chacha20-x86_64-y += chacha20-avx2-x86_64.o
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
endif

diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S
new file mode 100644
index 0000000..16694e6
--- /dev/null
+++ b/arch/x86/crypto/chacha20-avx2-x86_64.S
@@ -0,0 +1,443 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 32
+
+ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
+ .octa 0x0e0d0c0f0a09080b0605040702010003
+ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+ .octa 0x0d0c0f0e09080b0a0504070601000302
+CTRINC: .octa 0x00000003000000020000000100000000
+ .octa 0x00000007000000060000000500000004
+
+.text
+
+ENTRY(chacha20_8block_xor_avx2)
+ # %rdi: Input state matrix, s
+ # %rsi: 8 data blocks output, o
+ # %rdx: 8 data blocks input, i
+
+ # This function encrypts eight consecutive ChaCha20 blocks by loading
+ # the state matrix in AVX registers eight times. As we need some
+ # scratch registers, we save the first four registers on the stack. The
+ # algorithm performs each operation on the corresponding word of each
+ # state matrix, hence requires no word shuffling. For final XORing step
+ # we transpose the matrix by interleaving 32-, 64- and then 128-bit
+ # words, which allows us to do XOR in AVX registers. 8/16-bit word
+ # rotation is done with the slightly better performing byte shuffling,
+ # 7/12-bit word rotation uses traditional shift+OR.
+
+ vzeroupper
+ # 4 * 32 byte stack, 32-byte aligned
+ mov %rsp, %r8
+ and $~31, %rsp
+ sub $0x80, %rsp
+
+ # x0..15[0-7] = s[0..15]
+ vpbroadcastd 0x00(%rdi),%ymm0
+ vpbroadcastd 0x04(%rdi),%ymm1
+ vpbroadcastd 0x08(%rdi),%ymm2
+ vpbroadcastd 0x0c(%rdi),%ymm3
+ vpbroadcastd 0x10(%rdi),%ymm4
+ vpbroadcastd 0x14(%rdi),%ymm5
+ vpbroadcastd 0x18(%rdi),%ymm6
+ vpbroadcastd 0x1c(%rdi),%ymm7
+ vpbroadcastd 0x20(%rdi),%ymm8
+ vpbroadcastd 0x24(%rdi),%ymm9
+ vpbroadcastd 0x28(%rdi),%ymm10
+ vpbroadcastd 0x2c(%rdi),%ymm11
+ vpbroadcastd 0x30(%rdi),%ymm12
+ vpbroadcastd 0x34(%rdi),%ymm13
+ vpbroadcastd 0x38(%rdi),%ymm14
+ vpbroadcastd 0x3c(%rdi),%ymm15
+ # x0..3 on stack
+ vmovdqa %ymm0,0x00(%rsp)
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa %ymm3,0x60(%rsp)
+
+ vmovdqa CTRINC(%rip),%ymm1
+ vmovdqa ROT8(%rip),%ymm2
+ vmovdqa ROT16(%rip),%ymm3
+
+ # x12 += counter values 0-3
+ vpaddd %ymm1,%ymm12,%ymm12
+
+ mov $10,%ecx
+
+.Ldoubleround8:
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+ vpaddd 0x00(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm3,%ymm12,%ymm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+ vpaddd 0x20(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm3,%ymm13,%ymm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+ vpaddd 0x40(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm3,%ymm14,%ymm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+ vpaddd 0x60(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm3,%ymm15,%ymm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+ vpaddd %ymm12,%ymm8,%ymm8
+ vpxor %ymm8,%ymm4,%ymm4
+ vpslld $12,%ymm4,%ymm0
+ vpsrld $20,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+ vpaddd %ymm13,%ymm9,%ymm9
+ vpxor %ymm9,%ymm5,%ymm5
+ vpslld $12,%ymm5,%ymm0
+ vpsrld $20,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+ vpaddd %ymm14,%ymm10,%ymm10
+ vpxor %ymm10,%ymm6,%ymm6
+ vpslld $12,%ymm6,%ymm0
+ vpsrld $20,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+ vpaddd %ymm15,%ymm11,%ymm11
+ vpxor %ymm11,%ymm7,%ymm7
+ vpslld $12,%ymm7,%ymm0
+ vpsrld $20,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+ vpaddd 0x00(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm2,%ymm12,%ymm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+ vpaddd 0x20(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm2,%ymm13,%ymm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+ vpaddd 0x40(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm2,%ymm14,%ymm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+ vpaddd 0x60(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm2,%ymm15,%ymm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+ vpaddd %ymm12,%ymm8,%ymm8
+ vpxor %ymm8,%ymm4,%ymm4
+ vpslld $7,%ymm4,%ymm0
+ vpsrld $25,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+ vpaddd %ymm13,%ymm9,%ymm9
+ vpxor %ymm9,%ymm5,%ymm5
+ vpslld $7,%ymm5,%ymm0
+ vpsrld $25,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+ vpaddd %ymm14,%ymm10,%ymm10
+ vpxor %ymm10,%ymm6,%ymm6
+ vpslld $7,%ymm6,%ymm0
+ vpsrld $25,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+ vpaddd %ymm15,%ymm11,%ymm11
+ vpxor %ymm11,%ymm7,%ymm7
+ vpslld $7,%ymm7,%ymm0
+ vpsrld $25,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+ vpaddd 0x00(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm3,%ymm15,%ymm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 16)%ymm0
+ vpaddd 0x20(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm3,%ymm12,%ymm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+ vpaddd 0x40(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm3,%ymm13,%ymm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+ vpaddd 0x60(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm3,%ymm14,%ymm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+ vpaddd %ymm15,%ymm10,%ymm10
+ vpxor %ymm10,%ymm5,%ymm5
+ vpslld $12,%ymm5,%ymm0
+ vpsrld $20,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+ vpaddd %ymm12,%ymm11,%ymm11
+ vpxor %ymm11,%ymm6,%ymm6
+ vpslld $12,%ymm6,%ymm0
+ vpsrld $20,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+ vpaddd %ymm13,%ymm8,%ymm8
+ vpxor %ymm8,%ymm7,%ymm7
+ vpslld $12,%ymm7,%ymm0
+ vpsrld $20,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+ vpaddd %ymm14,%ymm9,%ymm9
+ vpxor %ymm9,%ymm4,%ymm4
+ vpslld $12,%ymm4,%ymm0
+ vpsrld $20,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+ vpaddd 0x00(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm2,%ymm15,%ymm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+ vpaddd 0x20(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm2,%ymm12,%ymm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+ vpaddd 0x40(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm2,%ymm13,%ymm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+ vpaddd 0x60(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm2,%ymm14,%ymm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+ vpaddd %ymm15,%ymm10,%ymm10
+ vpxor %ymm10,%ymm5,%ymm5
+ vpslld $7,%ymm5,%ymm0
+ vpsrld $25,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+ vpaddd %ymm12,%ymm11,%ymm11
+ vpxor %ymm11,%ymm6,%ymm6
+ vpslld $7,%ymm6,%ymm0
+ vpsrld $25,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+ vpaddd %ymm13,%ymm8,%ymm8
+ vpxor %ymm8,%ymm7,%ymm7
+ vpslld $7,%ymm7,%ymm0
+ vpsrld $25,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+ vpaddd %ymm14,%ymm9,%ymm9
+ vpxor %ymm9,%ymm4,%ymm4
+ vpslld $7,%ymm4,%ymm0
+ vpsrld $25,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+
+ dec %ecx
+ jnz .Ldoubleround8
+
+ # x0..15[0-3] += s[0..15]
+ vpbroadcastd 0x00(%rdi),%ymm0
+ vpaddd 0x00(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpbroadcastd 0x04(%rdi),%ymm0
+ vpaddd 0x20(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpbroadcastd 0x08(%rdi),%ymm0
+ vpaddd 0x40(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpbroadcastd 0x0c(%rdi),%ymm0
+ vpaddd 0x60(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpbroadcastd 0x10(%rdi),%ymm0
+ vpaddd %ymm0,%ymm4,%ymm4
+ vpbroadcastd 0x14(%rdi),%ymm0
+ vpaddd %ymm0,%ymm5,%ymm5
+ vpbroadcastd 0x18(%rdi),%ymm0
+ vpaddd %ymm0,%ymm6,%ymm6
+ vpbroadcastd 0x1c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm7,%ymm7
+ vpbroadcastd 0x20(%rdi),%ymm0
+ vpaddd %ymm0,%ymm8,%ymm8
+ vpbroadcastd 0x24(%rdi),%ymm0
+ vpaddd %ymm0,%ymm9,%ymm9
+ vpbroadcastd 0x28(%rdi),%ymm0
+ vpaddd %ymm0,%ymm10,%ymm10
+ vpbroadcastd 0x2c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm11,%ymm11
+ vpbroadcastd 0x30(%rdi),%ymm0
+ vpaddd %ymm0,%ymm12,%ymm12
+ vpbroadcastd 0x34(%rdi),%ymm0
+ vpaddd %ymm0,%ymm13,%ymm13
+ vpbroadcastd 0x38(%rdi),%ymm0
+ vpaddd %ymm0,%ymm14,%ymm14
+ vpbroadcastd 0x3c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm15,%ymm15
+
+ # x12 += counter values 0-3
+ vpaddd %ymm1,%ymm12,%ymm12
+
+ # interleave 32-bit words in state n, n+1
+ vmovdqa 0x00(%rsp),%ymm0
+ vmovdqa 0x20(%rsp),%ymm1
+ vpunpckldq %ymm1,%ymm0,%ymm2
+ vpunpckhdq %ymm1,%ymm0,%ymm1
+ vmovdqa %ymm2,0x00(%rsp)
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa 0x40(%rsp),%ymm0
+ vmovdqa 0x60(%rsp),%ymm1
+ vpunpckldq %ymm1,%ymm0,%ymm2
+ vpunpckhdq %ymm1,%ymm0,%ymm1
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa %ymm1,0x60(%rsp)
+ vmovdqa %ymm4,%ymm0
+ vpunpckldq %ymm5,%ymm0,%ymm4
+ vpunpckhdq %ymm5,%ymm0,%ymm5
+ vmovdqa %ymm6,%ymm0
+ vpunpckldq %ymm7,%ymm0,%ymm6
+ vpunpckhdq %ymm7,%ymm0,%ymm7
+ vmovdqa %ymm8,%ymm0
+ vpunpckldq %ymm9,%ymm0,%ymm8
+ vpunpckhdq %ymm9,%ymm0,%ymm9
+ vmovdqa %ymm10,%ymm0
+ vpunpckldq %ymm11,%ymm0,%ymm10
+ vpunpckhdq %ymm11,%ymm0,%ymm11
+ vmovdqa %ymm12,%ymm0
+ vpunpckldq %ymm13,%ymm0,%ymm12
+ vpunpckhdq %ymm13,%ymm0,%ymm13
+ vmovdqa %ymm14,%ymm0
+ vpunpckldq %ymm15,%ymm0,%ymm14
+ vpunpckhdq %ymm15,%ymm0,%ymm15
+
+ # interleave 64-bit words in state n, n+2
+ vmovdqa 0x00(%rsp),%ymm0
+ vmovdqa 0x40(%rsp),%ymm2
+ vpunpcklqdq %ymm2,%ymm0,%ymm1
+ vpunpckhqdq %ymm2,%ymm0,%ymm2
+ vmovdqa %ymm1,0x00(%rsp)
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa 0x20(%rsp),%ymm0
+ vmovdqa 0x60(%rsp),%ymm2
+ vpunpcklqdq %ymm2,%ymm0,%ymm1
+ vpunpckhqdq %ymm2,%ymm0,%ymm2
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa %ymm2,0x60(%rsp)
+ vmovdqa %ymm4,%ymm0
+ vpunpcklqdq %ymm6,%ymm0,%ymm4
+ vpunpckhqdq %ymm6,%ymm0,%ymm6
+ vmovdqa %ymm5,%ymm0
+ vpunpcklqdq %ymm7,%ymm0,%ymm5
+ vpunpckhqdq %ymm7,%ymm0,%ymm7
+ vmovdqa %ymm8,%ymm0
+ vpunpcklqdq %ymm10,%ymm0,%ymm8
+ vpunpckhqdq %ymm10,%ymm0,%ymm10
+ vmovdqa %ymm9,%ymm0
+ vpunpcklqdq %ymm11,%ymm0,%ymm9
+ vpunpckhqdq %ymm11,%ymm0,%ymm11
+ vmovdqa %ymm12,%ymm0
+ vpunpcklqdq %ymm14,%ymm0,%ymm12
+ vpunpckhqdq %ymm14,%ymm0,%ymm14
+ vmovdqa %ymm13,%ymm0
+ vpunpcklqdq %ymm15,%ymm0,%ymm13
+ vpunpckhqdq %ymm15,%ymm0,%ymm15
+
+ # interleave 128-bit words in state n, n+4
+ vmovdqa 0x00(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm4,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm4,%ymm0,%ymm4
+ vmovdqa %ymm1,0x00(%rsp)
+ vmovdqa 0x20(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm5,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm5,%ymm0,%ymm5
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa 0x40(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm6,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm6,%ymm0,%ymm6
+ vmovdqa %ymm1,0x40(%rsp)
+ vmovdqa 0x60(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm7,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm7,%ymm0,%ymm7
+ vmovdqa %ymm1,0x60(%rsp)
+ vperm2i128 $0x20,%ymm12,%ymm8,%ymm0
+ vperm2i128 $0x31,%ymm12,%ymm8,%ymm12
+ vmovdqa %ymm0,%ymm8
+ vperm2i128 $0x20,%ymm13,%ymm9,%ymm0
+ vperm2i128 $0x31,%ymm13,%ymm9,%ymm13
+ vmovdqa %ymm0,%ymm9
+ vperm2i128 $0x20,%ymm14,%ymm10,%ymm0
+ vperm2i128 $0x31,%ymm14,%ymm10,%ymm14
+ vmovdqa %ymm0,%ymm10
+ vperm2i128 $0x20,%ymm15,%ymm11,%ymm0
+ vperm2i128 $0x31,%ymm15,%ymm11,%ymm15
+ vmovdqa %ymm0,%ymm11
+
+ # xor with corresponding input, write to output
+ vmovdqa 0x00(%rsp),%ymm0
+ vpxor 0x0000(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0000(%rsi)
+ vmovdqa 0x20(%rsp),%ymm0
+ vpxor 0x0080(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0080(%rsi)
+ vmovdqa 0x40(%rsp),%ymm0
+ vpxor 0x0040(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0040(%rsi)
+ vmovdqa 0x60(%rsp),%ymm0
+ vpxor 0x00c0(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x00c0(%rsi)
+ vpxor 0x0100(%rdx),%ymm4,%ymm4
+ vmovdqu %ymm4,0x0100(%rsi)
+ vpxor 0x0180(%rdx),%ymm5,%ymm5
+ vmovdqu %ymm5,0x00180(%rsi)
+ vpxor 0x0140(%rdx),%ymm6,%ymm6
+ vmovdqu %ymm6,0x0140(%rsi)
+ vpxor 0x01c0(%rdx),%ymm7,%ymm7
+ vmovdqu %ymm7,0x01c0(%rsi)
+ vpxor 0x0020(%rdx),%ymm8,%ymm8
+ vmovdqu %ymm8,0x0020(%rsi)
+ vpxor 0x00a0(%rdx),%ymm9,%ymm9
+ vmovdqu %ymm9,0x00a0(%rsi)
+ vpxor 0x0060(%rdx),%ymm10,%ymm10
+ vmovdqu %ymm10,0x0060(%rsi)
+ vpxor 0x00e0(%rdx),%ymm11,%ymm11
+ vmovdqu %ymm11,0x00e0(%rsi)
+ vpxor 0x0120(%rdx),%ymm12,%ymm12
+ vmovdqu %ymm12,0x0120(%rsi)
+ vpxor 0x01a0(%rdx),%ymm13,%ymm13
+ vmovdqu %ymm13,0x01a0(%rsi)
+ vpxor 0x0160(%rdx),%ymm14,%ymm14
+ vmovdqu %ymm14,0x0160(%rsi)
+ vpxor 0x01e0(%rdx),%ymm15,%ymm15
+ vmovdqu %ymm15,0x01e0(%rsi)
+
+ vzeroupper
+ mov %r8,%rsp
+ ret
+ENDPROC(chacha20_8block_xor_avx2)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 4d677c3..effe216 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -21,12 +21,27 @@

asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+#ifdef CONFIG_AS_AVX2
+asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src);
+static bool chacha20_use_avx2;
+#endif

static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];

+#ifdef CONFIG_AS_AVX2
+ if (chacha20_use_avx2) {
+ while (bytes >= CHACHA20_BLOCK_SIZE * 8) {
+ chacha20_8block_xor_avx2(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE * 8;
+ src += CHACHA20_BLOCK_SIZE * 8;
+ dst += CHACHA20_BLOCK_SIZE * 8;
+ state[12] += 8;
+ }
+ }
+#endif
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
chacha20_4block_xor_ssse3(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE * 4;
@@ -113,6 +128,10 @@ static int __init chacha20_simd_mod_init(void)
if (!cpu_has_ssse3)
return -ENODEV;

+#ifdef CONFIG_AS_AVX2
+ chacha20_use_avx2 = cpu_has_avx && cpu_has_avx2 &&
+ cpu_has_xfeatures(XSTATE_SSE | XSTATE_YMM, NULL);
+#endif
return crypto_register_alg(&alg);
}

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 8f24185..82caab0 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1214,7 +1214,7 @@ config CRYPTO_CHACHA20
<http://cr.yp.to/chacha/chacha-20080128.pdf>

config CRYPTO_CHACHA20_X86_64
- tristate "ChaCha20 cipher algorithm (x86_64/SSSE3)"
+ tristate "ChaCha20 cipher algorithm (x86_64/SSSE3/AVX2)"
depends on X86 && 64BIT
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
--
1.9.1

2015-07-07 19:37:24

by Martin Willi

[permalink] [raw]
Subject: [PATCH 10/10] crypto: poly1305 - Add a four block AVX2 variant for x86_64

Extends the x86_64 Poly1305 authenticator by a function processing four
consecutive Poly1305 blocks in parallel using AVX2 instructions.

For large messages, throughput increases by ~15-45% compared to two
block SSE2:

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3623907 opers/sec, 347895072 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5915244 opers/sec, 567863424 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9489525 opers/sec, 910994400 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1369657 opers/sec, 394461216 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2016000 opers/sec, 580608000 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3716146 opers/sec, 1070250048 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 561913 opers/sec, 593380128 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1635554 opers/sec, 1727145024 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 286540 opers/sec, 596003200 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 935339 opers/sec, 1945505120 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 491810 opers/sec, 2030191680 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 252857 opers/sec, 2079495968 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3572538 opers/sec, 342963648 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5868291 opers/sec, 563355936 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9448599 opers/sec, 907065504 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1352271 opers/sec, 389454048 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 1993268 opers/sec, 574061184 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3693984 opers/sec, 1063867392 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 555095 opers/sec, 586180320 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1909861 opers/sec, 2016813216 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 282260 opers/sec, 587100800 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 1221476 opers/sec, 2540670080 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 677578 opers/sec, 2797041984 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 364094 opers/sec, 2994309056 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 1 +
arch/x86/crypto/poly1305-avx2-x86_64.S | 386 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 40 ++++
crypto/Kconfig | 2 +-
4 files changed, 428 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 5cf405c..9a2838c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -89,6 +89,7 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o
ifeq ($(avx2_supported),yes)
sha1-ssse3-y += sha1_avx2_x86_64_asm.o
+poly1305-x86_64-y += poly1305-avx2-x86_64.o
endif
crc32c-intel-y := crc32c-intel_glue.o
crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
diff --git a/arch/x86/crypto/poly1305-avx2-x86_64.S b/arch/x86/crypto/poly1305-avx2-x86_64.S
new file mode 100644
index 0000000..eff2f41
--- /dev/null
+++ b/arch/x86/crypto/poly1305-avx2-x86_64.S
@@ -0,0 +1,386 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, x64 AVX2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 32
+
+ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+ .octa 0x0000000003ffffff0000000003ffffff
+ORMASK: .octa 0x00000000010000000000000001000000
+ .octa 0x00000000010000000000000001000000
+
+.text
+
+#define h0 0x00(%rdi)
+#define h1 0x04(%rdi)
+#define h2 0x08(%rdi)
+#define h3 0x0c(%rdi)
+#define h4 0x10(%rdi)
+#define r0 0x00(%rdx)
+#define r1 0x04(%rdx)
+#define r2 0x08(%rdx)
+#define r3 0x0c(%rdx)
+#define r4 0x10(%rdx)
+#define u0 0x00(%r8)
+#define u1 0x04(%r8)
+#define u2 0x08(%r8)
+#define u3 0x0c(%r8)
+#define u4 0x10(%r8)
+#define w0 0x14(%r8)
+#define w1 0x18(%r8)
+#define w2 0x1c(%r8)
+#define w3 0x20(%r8)
+#define w4 0x24(%r8)
+#define y0 0x28(%r8)
+#define y1 0x2c(%r8)
+#define y2 0x30(%r8)
+#define y3 0x34(%r8)
+#define y4 0x38(%r8)
+#define m %rsi
+#define hc0 %ymm0
+#define hc1 %ymm1
+#define hc2 %ymm2
+#define hc3 %ymm3
+#define hc4 %ymm4
+#define hc0x %xmm0
+#define hc1x %xmm1
+#define hc2x %xmm2
+#define hc3x %xmm3
+#define hc4x %xmm4
+#define t1 %ymm5
+#define t2 %ymm6
+#define t1x %xmm5
+#define t2x %xmm6
+#define ruwy0 %ymm7
+#define ruwy1 %ymm8
+#define ruwy2 %ymm9
+#define ruwy3 %ymm10
+#define ruwy4 %ymm11
+#define ruwy0x %xmm7
+#define ruwy1x %xmm8
+#define ruwy2x %xmm9
+#define ruwy3x %xmm10
+#define ruwy4x %xmm11
+#define svxz1 %ymm12
+#define svxz2 %ymm13
+#define svxz3 %ymm14
+#define svxz4 %ymm15
+#define d0 %r9
+#define d1 %r10
+#define d2 %r11
+#define d3 %r12
+#define d4 %r13
+
+ENTRY(poly1305_4block_avx2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 64 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Quadblock count
+ # %r8: Poly1305 derived key r^2 u[5], r^3 w[5], r^4 y[5],
+
+ # This four-block variant uses loop unrolled block processing. It
+ # requires 4 Poly1305 keys: r, r^2, r^3 and r^4:
+ # h = (h + m) * r => h = (h + m1) * r^4 + m2 * r^3 + m3 * r^2 + m4 * r
+
+ vzeroupper
+ push %rbx
+ push %r12
+ push %r13
+
+ # combine r0,u0,w0,y0
+ vmovd y0,ruwy0x
+ vmovd w0,t1x
+ vpunpcklqdq t1,ruwy0,ruwy0
+ vmovd u0,t1x
+ vmovd r0,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy0,ruwy0
+
+ # combine r1,u1,w1,y1 and s1=r1*5,v1=u1*5,x1=w1*5,z1=y1*5
+ vmovd y1,ruwy1x
+ vmovd w1,t1x
+ vpunpcklqdq t1,ruwy1,ruwy1
+ vmovd u1,t1x
+ vmovd r1,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy1,ruwy1
+ vpslld $2,ruwy1,svxz1
+ vpaddd ruwy1,svxz1,svxz1
+
+ # combine r2,u2,w2,y2 and s2=r2*5,v2=u2*5,x2=w2*5,z2=y2*5
+ vmovd y2,ruwy2x
+ vmovd w2,t1x
+ vpunpcklqdq t1,ruwy2,ruwy2
+ vmovd u2,t1x
+ vmovd r2,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy2,ruwy2
+ vpslld $2,ruwy2,svxz2
+ vpaddd ruwy2,svxz2,svxz2
+
+ # combine r3,u3,w3,y3 and s3=r3*5,v3=u3*5,x3=w3*5,z3=y3*5
+ vmovd y3,ruwy3x
+ vmovd w3,t1x
+ vpunpcklqdq t1,ruwy3,ruwy3
+ vmovd u3,t1x
+ vmovd r3,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy3,ruwy3
+ vpslld $2,ruwy3,svxz3
+ vpaddd ruwy3,svxz3,svxz3
+
+ # combine r4,u4,w4,y4 and s4=r4*5,v4=u4*5,x4=w4*5,z4=y4*5
+ vmovd y4,ruwy4x
+ vmovd w4,t1x
+ vpunpcklqdq t1,ruwy4,ruwy4
+ vmovd u4,t1x
+ vmovd r4,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy4,ruwy4
+ vpslld $2,ruwy4,svxz4
+ vpaddd ruwy4,svxz4,svxz4
+
+.Ldoblock4:
+ # hc0 = [m[48-51] & 0x3ffffff, m[32-35] & 0x3ffffff,
+ # m[16-19] & 0x3ffffff, m[ 0- 3] & 0x3ffffff + h0]
+ vmovd 0x00(m),hc0x
+ vmovd 0x10(m),t1x
+ vpunpcklqdq t1,hc0,hc0
+ vmovd 0x20(m),t1x
+ vmovd 0x30(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc0,hc0
+ vpand ANMASK(%rip),hc0,hc0
+ vmovd h0,t1x
+ vpaddd t1,hc0,hc0
+ # hc1 = [(m[51-54] >> 2) & 0x3ffffff, (m[35-38] >> 2) & 0x3ffffff,
+ # (m[19-22] >> 2) & 0x3ffffff, (m[ 3- 6] >> 2) & 0x3ffffff + h1]
+ vmovd 0x03(m),hc1x
+ vmovd 0x13(m),t1x
+ vpunpcklqdq t1,hc1,hc1
+ vmovd 0x23(m),t1x
+ vmovd 0x33(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc1,hc1
+ vpsrld $2,hc1,hc1
+ vpand ANMASK(%rip),hc1,hc1
+ vmovd h1,t1x
+ vpaddd t1,hc1,hc1
+ # hc2 = [(m[54-57] >> 4) & 0x3ffffff, (m[38-41] >> 4) & 0x3ffffff,
+ # (m[22-25] >> 4) & 0x3ffffff, (m[ 6- 9] >> 4) & 0x3ffffff + h2]
+ vmovd 0x06(m),hc2x
+ vmovd 0x16(m),t1x
+ vpunpcklqdq t1,hc2,hc2
+ vmovd 0x26(m),t1x
+ vmovd 0x36(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc2,hc2
+ vpsrld $4,hc2,hc2
+ vpand ANMASK(%rip),hc2,hc2
+ vmovd h2,t1x
+ vpaddd t1,hc2,hc2
+ # hc3 = [(m[57-60] >> 6) & 0x3ffffff, (m[41-44] >> 6) & 0x3ffffff,
+ # (m[25-28] >> 6) & 0x3ffffff, (m[ 9-12] >> 6) & 0x3ffffff + h3]
+ vmovd 0x09(m),hc3x
+ vmovd 0x19(m),t1x
+ vpunpcklqdq t1,hc3,hc3
+ vmovd 0x29(m),t1x
+ vmovd 0x39(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc3,hc3
+ vpsrld $6,hc3,hc3
+ vpand ANMASK(%rip),hc3,hc3
+ vmovd h3,t1x
+ vpaddd t1,hc3,hc3
+ # hc4 = [(m[60-63] >> 8) | (1<<24), (m[44-47] >> 8) | (1<<24),
+ # (m[28-31] >> 8) | (1<<24), (m[12-15] >> 8) | (1<<24) + h4]
+ vmovd 0x0c(m),hc4x
+ vmovd 0x1c(m),t1x
+ vpunpcklqdq t1,hc4,hc4
+ vmovd 0x2c(m),t1x
+ vmovd 0x3c(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc4,hc4
+ vpsrld $8,hc4,hc4
+ vpor ORMASK(%rip),hc4,hc4
+ vmovd h4,t1x
+ vpaddd t1,hc4,hc4
+
+ # t1 = [ hc0[3] * r0, hc0[2] * u0, hc0[1] * w0, hc0[0] * y0 ]
+ vpmuludq hc0,ruwy0,t1
+ # t1 += [ hc1[3] * s4, hc1[2] * v4, hc1[1] * x4, hc1[0] * z4 ]
+ vpmuludq hc1,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * s3, hc2[2] * v3, hc2[1] * x3, hc2[0] * z3 ]
+ vpmuludq hc2,svxz3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s2, hc3[2] * v2, hc3[1] * x2, hc3[0] * z2 ]
+ vpmuludq hc3,svxz2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s1, hc4[2] * v1, hc4[1] * x1, hc4[0] * z1 ]
+ vpmuludq hc4,svxz1,t2
+ vpaddq t2,t1,t1
+ # d0 = t1[0] + t1[1] + t[2] + t[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d0
+
+ # t1 = [ hc0[3] * r1, hc0[2] * u1,hc0[1] * w1, hc0[0] * y1 ]
+ vpmuludq hc0,ruwy1,t1
+ # t1 += [ hc1[3] * r0, hc1[2] * u0, hc1[1] * w0, hc1[0] * y0 ]
+ vpmuludq hc1,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * s4, hc2[2] * v4, hc2[1] * x4, hc2[0] * z4 ]
+ vpmuludq hc2,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s3, hc3[2] * v3, hc3[1] * x3, hc3[0] * z3 ]
+ vpmuludq hc3,svxz3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s2, hc4[2] * v2, hc4[1] * x2, hc4[0] * z2 ]
+ vpmuludq hc4,svxz2,t2
+ vpaddq t2,t1,t1
+ # d1 = t1[0] + t1[1] + t1[3] + t1[4]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d1
+
+ # t1 = [ hc0[3] * r2, hc0[2] * u2, hc0[1] * w2, hc0[0] * y2 ]
+ vpmuludq hc0,ruwy2,t1
+ # t1 += [ hc1[3] * r1, hc1[2] * u1, hc1[1] * w1, hc1[0] * y1 ]
+ vpmuludq hc1,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r0, hc2[2] * u0, hc2[1] * w0, hc2[0] * y0 ]
+ vpmuludq hc2,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s4, hc3[2] * v4, hc3[1] * x4, hc3[0] * z4 ]
+ vpmuludq hc3,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s3, hc4[2] * v3, hc4[1] * x3, hc4[0] * z3 ]
+ vpmuludq hc4,svxz3,t2
+ vpaddq t2,t1,t1
+ # d2 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d2
+
+ # t1 = [ hc0[3] * r3, hc0[2] * u3, hc0[1] * w3, hc0[0] * y3 ]
+ vpmuludq hc0,ruwy3,t1
+ # t1 += [ hc1[3] * r2, hc1[2] * u2, hc1[1] * w2, hc1[0] * y2 ]
+ vpmuludq hc1,ruwy2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r1, hc2[2] * u1, hc2[1] * w1, hc2[0] * y1 ]
+ vpmuludq hc2,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * r0, hc3[2] * u0, hc3[1] * w0, hc3[0] * y0 ]
+ vpmuludq hc3,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s4, hc4[2] * v4, hc4[1] * x4, hc4[0] * z4 ]
+ vpmuludq hc4,svxz4,t2
+ vpaddq t2,t1,t1
+ # d3 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d3
+
+ # t1 = [ hc0[3] * r4, hc0[2] * u4, hc0[1] * w4, hc0[0] * y4 ]
+ vpmuludq hc0,ruwy4,t1
+ # t1 += [ hc1[3] * r3, hc1[2] * u3, hc1[1] * w3, hc1[0] * y3 ]
+ vpmuludq hc1,ruwy3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r2, hc2[2] * u2, hc2[1] * w2, hc2[0] * y2 ]
+ vpmuludq hc2,ruwy2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * r1, hc3[2] * u1, hc3[1] * w1, hc3[0] * y1 ]
+ vpmuludq hc3,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * r0, hc4[2] * u0, hc4[1] * w0, hc4[0] * y0 ]
+ vpmuludq hc4,ruwy0,t2
+ vpaddq t2,t1,t1
+ # d4 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x40,m
+ dec %rcx
+ jnz .Ldoblock4
+
+ vzeroupper
+ pop %r13
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_4block_avx2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index b7c33d0..f7170d7 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -22,20 +22,33 @@ struct poly1305_simd_desc_ctx {
struct poly1305_desc_ctx base;
/* derived key u set? */
bool uset;
+#ifdef CONFIG_AS_AVX2
+ /* derived keys r^3, r^4 set? */
+ bool wset;
+#endif
/* derived Poly1305 key r^2 */
u32 u[5];
+ /* ... silently appended r^3 and r^4 when using AVX2 */
};

asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
const u32 *r, unsigned int blocks);
asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
unsigned int blocks, const u32 *u);
+#ifdef CONFIG_AS_AVX2
+asmlinkage void poly1305_4block_avx2(u32 *h, const u8 *src, const u32 *r,
+ unsigned int blocks, const u32 *u);
+static bool poly1305_use_avx2;
+#endif

static int poly1305_simd_init(struct shash_desc *desc)
{
struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc);

sctx->uset = false;
+#ifdef CONFIG_AS_AVX2
+ sctx->wset = false;
+#endif

return crypto_poly1305_init(desc);
}
@@ -66,6 +79,26 @@ static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
srclen = datalen;
}

+#ifdef CONFIG_AS_AVX2
+ if (poly1305_use_avx2 && srclen >= POLY1305_BLOCK_SIZE * 4) {
+ if (unlikely(!sctx->wset)) {
+ if (!sctx->uset) {
+ memcpy(sctx->u, dctx->r, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u, dctx->r);
+ sctx->uset = true;
+ }
+ memcpy(sctx->u + 5, sctx->u, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u + 5, dctx->r);
+ memcpy(sctx->u + 10, sctx->u + 5, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u + 10, dctx->r);
+ sctx->wset = true;
+ }
+ blocks = srclen / (POLY1305_BLOCK_SIZE * 4);
+ poly1305_4block_avx2(dctx->h, src, dctx->r, blocks, sctx->u);
+ src += POLY1305_BLOCK_SIZE * 4 * blocks;
+ srclen -= POLY1305_BLOCK_SIZE * 4 * blocks;
+ }
+#endif
if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) {
if (unlikely(!sctx->uset)) {
memcpy(sctx->u, dctx->r, sizeof(sctx->u));
@@ -149,6 +182,13 @@ static int __init poly1305_simd_mod_init(void)
if (!cpu_has_xmm2)
return -ENODEV;

+#ifdef CONFIG_AS_AVX2
+ poly1305_use_avx2 = cpu_has_avx && cpu_has_avx2 &&
+ cpu_has_xfeatures(XSTATE_SSE | XSTATE_YMM, NULL);
+ alg.descsize = sizeof(struct poly1305_simd_desc_ctx);
+ if (poly1305_use_avx2)
+ alg.descsize += 10 * sizeof(u32);
+#endif
return crypto_register_shash(&alg);
}

diff --git a/crypto/Kconfig b/crypto/Kconfig
index c57478c..354bb69 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -471,7 +471,7 @@ config CRYPTO_POLY1305
in IETF protocols. This is the portable C implementation of Poly1305.

config CRYPTO_POLY1305_X86_64
- tristate "Poly1305 authenticator algorithm (x86_64/SSE2)"
+ tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)"
depends on X86 && 64BIT
select CRYPTO_POLY1305
help
--
1.9.1

2015-07-07 22:13:05

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

On Tue, Jul 07, 2015 at 09:36:46PM +0200, Martin Willi wrote:
>
> poly1305-generic:
> testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-generic,poly1305-generic)) encryption
> test 0 (288 bit key, 16 byte blocks): 902007 operations in 1 seconds (14432112 bytes)
> test 1 (288 bit key, 64 byte blocks): 945302 operations in 1 seconds (60499328 bytes)
> test 2 (288 bit key, 256 byte blocks): 559910 operations in 1 seconds (143336960 bytes)
> test 3 (288 bit key, 512 byte blocks): 365334 operations in 1 seconds (187051008 bytes)
> test 4 (288 bit key, 1024 byte blocks): 213663 operations in 1 seconds (218790912 bytes)
> test 5 (288 bit key, 2048 byte blocks): 117263 operations in 1 seconds (240154624 bytes)
> test 6 (288 bit key, 4096 byte blocks): 61915 operations in 1 seconds (253603840 bytes)
> test 7 (288 bit key, 8192 byte blocks): 31662 operations in 1 seconds (259375104 bytes)

Running the speed test with sec=1 makes no sense because it's
too short. Please use sec=0 and count cycles instead.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-07-08 20:36:26

by Martin Willi

[permalink] [raw]
Subject: Re: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

Herbert,

> Running the speed test with sec=1 makes no sense because it's
> too short. Please use sec=0 and count cycles instead.

I get less constant numbers between different runs when using sec=0,
hence I've used sec=1. Below are the numbers of "average" runs for the
AEAD measuring cycles; I'll use cycles in the individual patch notes in
a v2.

Kind regards
Martin

--

generic:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-generic,poly1305-generic)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 9444 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 10692 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 18299 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 26952 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 48493 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 83766 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 150899 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 296779 cycles (8192 bytes)

SSE2/SSSE3:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 9814 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 9998 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 12442 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 20321 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 21098 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 33423 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 55183 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 102514 cycles (8192 bytes)

AVX2:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 9883 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 10891 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 12467 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 13538 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 16783 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 23161 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 37359 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 64670 cycles (8192 bytes)

2015-07-08 20:41:51

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

On Wed, Jul 08, 2015 at 10:36:23PM +0200, Martin Willi wrote:
>
> I get less constant numbers between different runs when using sec=0,
> hence I've used sec=1. Below are the numbers of "average" runs for the
> AEAD measuring cycles; I'll use cycles in the individual patch notes in
> a v2.

If you're going to use sec you need to use at least 10 in order
for it to be meaningful as shorter values often result in bogus
numbers. Of course sec=10 takes a long time which is why cycles
is the preferred method.

What sort of variance do you see with cycles? Do you get the
same variance for other algorithms, e.g., cbc/aes?

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2015-07-11 08:15:55

by Martin Willi

[permalink] [raw]
Subject: Re: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers


> If you're going to use sec you need to use at least 10 in order
> for it to be meaningful as shorter values often result in bogus
> numbers.

Ok, I'll use sec=10 in v2. There is no fundamental difference compared
to sec=1 (except for very short blocks):

testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 9498006 operations in 10 seconds (151968096 bytes)
test 1 (288 bit key, 64 byte blocks): 9423516 operations in 10 seconds (603105024 bytes)
test 2 (288 bit key, 256 byte blocks): 7597253 operations in 10 seconds (1944896768 bytes)
test 3 (288 bit key, 512 byte blocks): 6979753 operations in 10 seconds (3573633536 bytes)
test 4 (288 bit key, 1024 byte blocks): 5629328 operations in 10 seconds (5764431872 bytes)
test 5 (288 bit key, 2048 byte blocks): 4071284 operations in 10 seconds (8337989632 bytes)
test 6 (288 bit key, 4096 byte blocks): 2627325 operations in 10 seconds (10761523200 bytes)
test 7 (288 bit key, 8192 byte blocks): 1492531 operations in 10 seconds (12226813952 bytes)

testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 896305 operations in 1 seconds (14340880 bytes)
test 1 (288 bit key, 64 byte blocks): 929638 operations in 1 seconds (59496832 bytes)
test 2 (288 bit key, 256 byte blocks): 750673 operations in 1 seconds (192172288 bytes)
test 3 (288 bit key, 512 byte blocks): 687636 operations in 1 seconds (352069632 bytes)
test 4 (288 bit key, 1024 byte blocks): 555209 operations in 1 seconds (568534016 bytes)
test 5 (288 bit key, 2048 byte blocks): 402049 operations in 1 seconds (823396352 bytes)
test 6 (288 bit key, 4096 byte blocks): 259861 operations in 1 seconds (1064390656 bytes)
test 7 (288 bit key, 8192 byte blocks): 147283 operations in 1 seconds (1206542336 bytes)

> What sort of variance do you see with cycles?

Here a very fast and a very slow run (these are extremes, though):

testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 3765 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 3823 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 4728 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 5135 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 7026 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 8804 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 14674 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 24616 cycles (8192 bytes)

testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 15031 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 15670 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 13034 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 14045 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 20944 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 26445 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 31912 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 61366 cycles (8192 bytes)

> Do you get the same variance for other algorithms, e.g., cbc/aes?

Yes, another extreme:

testing speed of cbc(aes) (cbc(aes-aesni)) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 593 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 1589 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 5311 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 20666 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 161483 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 593 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 1659 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 5609 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 21568 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 172484 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 612 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 1687 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 5836 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 22400 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 177799 cycles (8192 bytes)

testing speed of cbc(aes) (cbc(aes-aesni)) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 1130 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 3002 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 10135 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 39911 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 308130 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 1151 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 3096 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 11486 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 42155 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 328148 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 1186 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 3260 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 11909 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 44794 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 325349 cycles (8192 bytes)

I see that with all algorithms, both kvm-virtualized and on bare metal.

Regards
Martin

2015-07-13 07:01:54

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

On Sat, Jul 11, 2015 at 10:15:52AM +0200, Martin Willi wrote:
>
> > If you're going to use sec you need to use at least 10 in order
> > for it to be meaningful as shorter values often result in bogus
> > numbers.
>
> Ok, I'll use sec=10 in v2. There is no fundamental difference compared
> to sec=1 (except for very short blocks):

Thanks.

> Yes, another extreme:
>
> testing speed of cbc(aes) (cbc(aes-aesni)) decryption
> test 0 (128 bit key, 16 byte blocks): 1 operation in 593 cycles (16 bytes)
> test 1 (128 bit key, 64 byte blocks): 1 operation in 1589 cycles (64 bytes)
> test 2 (128 bit key, 256 byte blocks): 1 operation in 5311 cycles (256 bytes)
> test 3 (128 bit key, 1024 byte blocks): 1 operation in 20666 cycles (1024 bytes)
> test 4 (128 bit key, 8192 byte blocks): 1 operation in 161483 cycles (8192 bytes)
> test 5 (192 bit key, 16 byte blocks): 1 operation in 593 cycles (16 bytes)
> test 6 (192 bit key, 64 byte blocks): 1 operation in 1659 cycles (64 bytes)
> test 7 (192 bit key, 256 byte blocks): 1 operation in 5609 cycles (256 bytes)
> test 8 (192 bit key, 1024 byte blocks): 1 operation in 21568 cycles (1024 bytes)
> test 9 (192 bit key, 8192 byte blocks): 1 operation in 172484 cycles (8192 bytes)
> test 10 (256 bit key, 16 byte blocks): 1 operation in 612 cycles (16 bytes)
> test 11 (256 bit key, 64 byte blocks): 1 operation in 1687 cycles (64 bytes)
> test 12 (256 bit key, 256 byte blocks): 1 operation in 5836 cycles (256 bytes)
> test 13 (256 bit key, 1024 byte blocks): 1 operation in 22400 cycles (1024 bytes)
> test 14 (256 bit key, 8192 byte blocks): 1 operation in 177799 cycles (8192 bytes)
>
> testing speed of cbc(aes) (cbc(aes-aesni)) decryption
> test 0 (128 bit key, 16 byte blocks): 1 operation in 1130 cycles (16 bytes)
> test 1 (128 bit key, 64 byte blocks): 1 operation in 3002 cycles (64 bytes)
> test 2 (128 bit key, 256 byte blocks): 1 operation in 10135 cycles (256 bytes)
> test 3 (128 bit key, 1024 byte blocks): 1 operation in 39911 cycles (1024 bytes)
> test 4 (128 bit key, 8192 byte blocks): 1 operation in 308130 cycles (8192 bytes)
> test 5 (192 bit key, 16 byte blocks): 1 operation in 1151 cycles (16 bytes)
> test 6 (192 bit key, 64 byte blocks): 1 operation in 3096 cycles (64 bytes)
> test 7 (192 bit key, 256 byte blocks): 1 operation in 11486 cycles (256 bytes)
> test 8 (192 bit key, 1024 byte blocks): 1 operation in 42155 cycles (1024 bytes)
> test 9 (192 bit key, 8192 byte blocks): 1 operation in 328148 cycles (8192 bytes)
> test 10 (256 bit key, 16 byte blocks): 1 operation in 1186 cycles (16 bytes)
> test 11 (256 bit key, 64 byte blocks): 1 operation in 3260 cycles (64 bytes)
> test 12 (256 bit key, 256 byte blocks): 1 operation in 11909 cycles (256 bytes)
> test 13 (256 bit key, 1024 byte blocks): 1 operation in 44794 cycles (1024 bytes)
> test 14 (256 bit key, 8192 byte blocks): 1 operation in 325349 cycles (8192 bytes)

Weird. I don't see this variance with tcrypt mode=200 at all.
I do however see it with mode=500 but that's probably just telling
us that the async tcrypt code path is buggy.

It'd be good if we can figure out what's going on here.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt