2015-07-16 17:14:23

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

This patch series adds both ChaCha20 and Poly1305 specific ciphers for
x86_64 using SSE2/SSSE3 and AVX2 instructions. The idea is to have a drop-in
replacement for AESNI/CLMUL-accelerated AES-GCM providing at least somewhat
comparable performance, refer to RFC7539 for details. It is based on cryptodev,
including the ChaCha20/Poly1305 AEAD interface conversion patch.

The first patch adds some speed tests to tcrypt. The second patch exports
some functionality from chacha20-generic to use it as fallback. Patch 3
adds a single block SSSE3 driver for ChaCha20, while patch 4 and 5 extend it
by an optimized four block SSSE3 and an eight block AVX2 variant. Patch 6
adds an additional test vector for ChaCha20 to actually test the AVX2 eight
block variant processing 512-bytes at once.

Patch 7 exports some poly1305-generic functionality to use it as fallback.
Patch 8 introduces a single block SSE2 driver for Poly1305, while patch 9
and 10 add an optimized two block SSE2 and a four block AVX2 variant.

Overall speedup for the ChaCha20/Poly1305 AEAD for typical IPsec payloads
is ~50-150% with SSE2/SSSE3 and ~100-200% with AVX2, or even more for larger
payloads:

generic:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-generic,poly1305-generic)) encryption
test 0 (288 bit key, 16 byte blocks): 10456041 operations in 10 seconds (167296656 bytes)
test 1 (288 bit key, 64 byte blocks): 9999411 operations in 10 seconds (639962304 bytes)
test 2 (288 bit key, 256 byte blocks): 5793012 operations in 10 seconds (1483011072 bytes)
test 3 (288 bit key, 512 byte blocks): 3743676 operations in 10 seconds (1916762112 bytes)
test 4 (288 bit key, 1024 byte blocks): 2190023 operations in 10 seconds (2242583552 bytes)
test 5 (288 bit key, 2048 byte blocks): 1195864 operations in 10 seconds (2449129472 bytes)
test 6 (288 bit key, 4096 byte blocks): 627625 operations in 10 seconds (2570752000 bytes)
test 7 (288 bit key, 8192 byte blocks): 319844 operations in 10 seconds (2620162048 bytes)

SSE2/SSSE3:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 10077910 operations in 10 seconds (161246560 bytes)
test 1 (288 bit key, 64 byte blocks): 9990400 operations in 10 seconds (639385600 bytes)
test 2 (288 bit key, 256 byte blocks): 7953774 operations in 10 seconds (2036166144 bytes)
test 3 (288 bit key, 512 byte blocks): 6351059 operations in 10 seconds (3251742208 bytes)
test 4 (288 bit key, 1024 byte blocks): 4593059 operations in 10 seconds (4703292416 bytes)
test 5 (288 bit key, 2048 byte blocks): 2956300 operations in 10 seconds (6054502400 bytes)
test 6 (288 bit key, 4096 byte blocks): 1724958 operations in 10 seconds (7065427968 bytes)
test 7 (288 bit key, 8192 byte blocks): 925156 operations in 10 seconds (7578877952 bytes)

AVX2:
testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 10006774 operations in 10 seconds (160108384 bytes)
test 1 (288 bit key, 64 byte blocks): 9896498 operations in 10 seconds (633375872 bytes)
test 2 (288 bit key, 256 byte blocks): 7922198 operations in 10 seconds (2028082688 bytes)
test 3 (288 bit key, 512 byte blocks): 7261666 operations in 10 seconds (3717972992 bytes)
test 4 (288 bit key, 1024 byte blocks): 5835006 operations in 10 seconds (5975046144 bytes)
test 5 (288 bit key, 2048 byte blocks): 4172937 operations in 10 seconds (8546174976 bytes)
test 6 (288 bit key, 4096 byte blocks): 2670484 operations in 10 seconds (10938302464 bytes)
test 7 (288 bit key, 8192 byte blocks): 1504684 operations in 10 seconds (12326371328 bytes)

All benchmark results from a Core i5-4670T.

The ChaCha20/Poly1305 AEAD on Haswell with AVX2 has about half the raw
AESNI/CLMUL-accelerated AES-GCM (rfc4106-gcm-aesni) performance for typical
IPsec MTUs. On Ivy Bridge using SSE2/SSSE3 the numbers compared to AES-GCM
are very similar due to the less efficient CLMUL instructions.

Changes in v2:
- No code changes
- Use sec=10 for more reliable benchmark results

Martin Willi (10):
crypto: tcrypt - Add ChaCha20/Poly1305 speed tests
crypto: chacha20 - Export common ChaCha20 helpers
crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64
crypto: chacha20 - Add a four block SSSE3 variant for x86_64
crypto: chacha20 - Add an eight block AVX2 variant for x86_64
crypto: testmgr - Add a longer ChaCha20 test vector
crypto: poly1305 - Export common Poly1305 helpers
crypto: poly1305 - Add a SSE2 SIMD variant for x86_64
crypto: poly1305 - Add a two block SSE2 variant for x86_64
crypto: poly1305 - Add a four block AVX2 variant for x86_64

arch/x86/crypto/Makefile | 6 +
arch/x86/crypto/chacha20-avx2-x86_64.S | 443 ++++++++++++++++++++++
arch/x86/crypto/chacha20-ssse3-x86_64.S | 625 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 150 ++++++++
arch/x86/crypto/poly1305-avx2-x86_64.S | 386 ++++++++++++++++++++
arch/x86/crypto/poly1305-sse2-x86_64.S | 582 +++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 207 +++++++++++
crypto/Kconfig | 27 ++
crypto/chacha20_generic.c | 28 +-
crypto/chacha20poly1305.c | 7 +-
crypto/poly1305_generic.c | 73 ++--
crypto/tcrypt.c | 15 +
crypto/tcrypt.h | 20 +
crypto/testmgr.h | 334 ++++++++++++++++-
include/crypto/chacha20.h | 25 ++
include/crypto/poly1305.h | 41 +++
16 files changed, 2909 insertions(+), 60 deletions(-)
create mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S
create mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
create mode 100644 arch/x86/crypto/chacha20_glue.c
create mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305_glue.c
create mode 100644 include/crypto/chacha20.h
create mode 100644 include/crypto/poly1305.h

--
1.9.1


2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 07/10] crypto: poly1305 - Export common Poly1305 helpers

As architecture specific drivers need a software fallback, export Poly1305
init/update/final functions together with some helpers in a header file.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/chacha20poly1305.c | 4 +--
crypto/poly1305_generic.c | 73 +++++++++++++++++++++++------------------------
include/crypto/poly1305.h | 41 ++++++++++++++++++++++++++
3 files changed, 77 insertions(+), 41 deletions(-)
create mode 100644 include/crypto/poly1305.h

diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index 410554d..b71445f 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -14,6 +14,7 @@
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
#include <crypto/chacha20.h>
+#include <crypto/poly1305.h>
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -21,9 +22,6 @@

#include "internal.h"

-#define POLY1305_BLOCK_SIZE 16
-#define POLY1305_DIGEST_SIZE 16
-#define POLY1305_KEY_SIZE 32
#define CHACHAPOLY_IV_SIZE 12

struct chachapoly_instance_ctx {
diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 387b5c8..2df9835d 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -13,31 +13,11 @@

#include <crypto/algapi.h>
#include <crypto/internal/hash.h>
+#include <crypto/poly1305.h>
#include <linux/crypto.h>
#include <linux/kernel.h>
#include <linux/module.h>

-#define POLY1305_BLOCK_SIZE 16
-#define POLY1305_KEY_SIZE 32
-#define POLY1305_DIGEST_SIZE 16
-
-struct poly1305_desc_ctx {
- /* key */
- u32 r[5];
- /* finalize key */
- u32 s[4];
- /* accumulator */
- u32 h[5];
- /* partial buffer */
- u8 buf[POLY1305_BLOCK_SIZE];
- /* bytes used in partial buffer */
- unsigned int buflen;
- /* r key has been set */
- bool rset;
- /* s key has been set */
- bool sset;
-};
-
static inline u64 mlt(u64 a, u64 b)
{
return a * b;
@@ -58,7 +38,7 @@ static inline u32 le32_to_cpuvp(const void *p)
return le32_to_cpup(p);
}

-static int poly1305_init(struct shash_desc *desc)
+int crypto_poly1305_init(struct shash_desc *desc)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);

@@ -69,8 +49,9 @@ static int poly1305_init(struct shash_desc *desc)

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_init);

-static int poly1305_setkey(struct crypto_shash *tfm,
+int crypto_poly1305_setkey(struct crypto_shash *tfm,
const u8 *key, unsigned int keylen)
{
/* Poly1305 requires a unique key for each tag, which implies that
@@ -79,6 +60,7 @@ static int poly1305_setkey(struct crypto_shash *tfm,
* the update() call. */
return -ENOTSUPP;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);

static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
{
@@ -98,16 +80,10 @@ static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
dctx->s[3] = le32_to_cpuvp(key + 12);
}

-static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
- const u8 *src, unsigned int srclen,
- u32 hibit)
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen)
{
- u32 r0, r1, r2, r3, r4;
- u32 s1, s2, s3, s4;
- u32 h0, h1, h2, h3, h4;
- u64 d0, d1, d2, d3, d4;
-
- if (unlikely(!dctx->sset)) {
+ if (!dctx->sset) {
if (!dctx->rset && srclen >= POLY1305_BLOCK_SIZE) {
poly1305_setrkey(dctx, src);
src += POLY1305_BLOCK_SIZE;
@@ -121,6 +97,25 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
dctx->sset = true;
}
}
+ return srclen;
+}
+EXPORT_SYMBOL_GPL(crypto_poly1305_setdesckey);
+
+static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen,
+ u32 hibit)
+{
+ u32 r0, r1, r2, r3, r4;
+ u32 s1, s2, s3, s4;
+ u32 h0, h1, h2, h3, h4;
+ u64 d0, d1, d2, d3, d4;
+ unsigned int datalen;
+
+ if (unlikely(!dctx->sset)) {
+ datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
+ src += srclen - datalen;
+ srclen = datalen;
+ }

r0 = dctx->r[0];
r1 = dctx->r[1];
@@ -181,7 +176,7 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx,
return srclen;
}

-static int poly1305_update(struct shash_desc *desc,
+int crypto_poly1305_update(struct shash_desc *desc,
const u8 *src, unsigned int srclen)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
@@ -214,8 +209,9 @@ static int poly1305_update(struct shash_desc *desc,

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_update);

-static int poly1305_final(struct shash_desc *desc, u8 *dst)
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst)
{
struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
__le32 *mac = (__le32 *)dst;
@@ -282,13 +278,14 @@ static int poly1305_final(struct shash_desc *desc, u8 *dst)

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_poly1305_final);

static struct shash_alg poly1305_alg = {
.digestsize = POLY1305_DIGEST_SIZE,
- .init = poly1305_init,
- .update = poly1305_update,
- .final = poly1305_final,
- .setkey = poly1305_setkey,
+ .init = crypto_poly1305_init,
+ .update = crypto_poly1305_update,
+ .final = crypto_poly1305_final,
+ .setkey = crypto_poly1305_setkey,
.descsize = sizeof(struct poly1305_desc_ctx),
.base = {
.cra_name = "poly1305",
diff --git a/include/crypto/poly1305.h b/include/crypto/poly1305.h
new file mode 100644
index 0000000..894df59
--- /dev/null
+++ b/include/crypto/poly1305.h
@@ -0,0 +1,41 @@
+/*
+ * Common values for the Poly1305 algorithm
+ */
+
+#ifndef _CRYPTO_POLY1305_H
+#define _CRYPTO_POLY1305_H
+
+#include <linux/types.h>
+#include <linux/crypto.h>
+
+#define POLY1305_BLOCK_SIZE 16
+#define POLY1305_KEY_SIZE 32
+#define POLY1305_DIGEST_SIZE 16
+
+struct poly1305_desc_ctx {
+ /* key */
+ u32 r[5];
+ /* finalize key */
+ u32 s[4];
+ /* accumulator */
+ u32 h[5];
+ /* partial buffer */
+ u8 buf[POLY1305_BLOCK_SIZE];
+ /* bytes used in partial buffer */
+ unsigned int buflen;
+ /* r key has been set */
+ bool rset;
+ /* s key has been set */
+ bool sset;
+};
+
+int crypto_poly1305_init(struct shash_desc *desc);
+int crypto_poly1305_setkey(struct crypto_shash *tfm,
+ const u8 *key, unsigned int keylen);
+unsigned int crypto_poly1305_setdesckey(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen);
+int crypto_poly1305_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen);
+int crypto_poly1305_final(struct shash_desc *desc, u8 *dst);
+
+#endif
--
1.9.1

2015-07-16 17:14:23

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 04/10] crypto: chacha20 - Add a four block SSSE3 variant for x86_64

Extends the x86_64 SSSE3 ChaCha20 implementation by a function processing
four ChaCha20 blocks in parallel. This avoids the word shuffling needed
in the single block variant, further increasing throughput.

For large messages, throughput increases by ~110% compared to single block
SSSE3:

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/chacha20-ssse3-x86_64.S | 483 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 8 +
2 files changed, 491 insertions(+)

diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S
index 1b97ad0..712b130 100644
--- a/arch/x86/crypto/chacha20-ssse3-x86_64.S
+++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S
@@ -16,6 +16,7 @@

ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+CTRINC: .octa 0x00000003000000020000000100000000

.text

@@ -140,3 +141,485 @@ ENTRY(chacha20_block_xor_ssse3)

ret
ENDPROC(chacha20_block_xor_ssse3)
+
+ENTRY(chacha20_4block_xor_ssse3)
+ # %rdi: Input state matrix, s
+ # %rsi: 4 data blocks output, o
+ # %rdx: 4 data blocks input, i
+
+ # This function encrypts four consecutive ChaCha20 blocks by loading the
+ # the state matrix in SSE registers four times. As we need some scratch
+ # registers, we save the first four registers on the stack. The
+ # algorithm performs each operation on the corresponding word of each
+ # state matrix, hence requires no word shuffling. For final XORing step
+ # we transpose the matrix by interleaving 32- and then 64-bit words,
+ # which allows us to do XOR in SSE registers. 8/16-bit word rotation is
+ # done with the slightly better performing SSSE3 byte shuffling,
+ # 7/12-bit word rotation uses traditional shift+OR.
+
+ sub $0x40,%rsp
+
+ # x0..15[0-3] = s0..3[0..3]
+ movq 0x00(%rdi),%xmm1
+ pshufd $0x00,%xmm1,%xmm0
+ pshufd $0x55,%xmm1,%xmm1
+ movq 0x08(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ movq 0x10(%rdi),%xmm5
+ pshufd $0x00,%xmm5,%xmm4
+ pshufd $0x55,%xmm5,%xmm5
+ movq 0x18(%rdi),%xmm7
+ pshufd $0x00,%xmm7,%xmm6
+ pshufd $0x55,%xmm7,%xmm7
+ movq 0x20(%rdi),%xmm9
+ pshufd $0x00,%xmm9,%xmm8
+ pshufd $0x55,%xmm9,%xmm9
+ movq 0x28(%rdi),%xmm11
+ pshufd $0x00,%xmm11,%xmm10
+ pshufd $0x55,%xmm11,%xmm11
+ movq 0x30(%rdi),%xmm13
+ pshufd $0x00,%xmm13,%xmm12
+ pshufd $0x55,%xmm13,%xmm13
+ movq 0x38(%rdi),%xmm15
+ pshufd $0x00,%xmm15,%xmm14
+ pshufd $0x55,%xmm15,%xmm15
+ # x0..3 on stack
+ movdqa %xmm0,0x00(%rsp)
+ movdqa %xmm1,0x10(%rsp)
+ movdqa %xmm2,0x20(%rsp)
+ movdqa %xmm3,0x30(%rsp)
+
+ movdqa CTRINC(%rip),%xmm1
+ movdqa ROT8(%rip),%xmm2
+ movdqa ROT16(%rip),%xmm3
+
+ # x12 += counter values 0-3
+ paddd %xmm1,%xmm12
+
+ mov $10,%ecx
+
+.Ldoubleround4:
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm3,%xmm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm3,%xmm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm3,%xmm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm3,%xmm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+ paddd %xmm12,%xmm8
+ pxor %xmm8,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm4
+ por %xmm0,%xmm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+ paddd %xmm13,%xmm9
+ pxor %xmm9,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm5
+ por %xmm0,%xmm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+ paddd %xmm14,%xmm10
+ pxor %xmm10,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm6
+ por %xmm0,%xmm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+ paddd %xmm15,%xmm11
+ pxor %xmm11,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm7
+ por %xmm0,%xmm7
+
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm2,%xmm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm2,%xmm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm2,%xmm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm2,%xmm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+ paddd %xmm12,%xmm8
+ pxor %xmm8,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm4
+ por %xmm0,%xmm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+ paddd %xmm13,%xmm9
+ pxor %xmm9,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm5
+ por %xmm0,%xmm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+ paddd %xmm14,%xmm10
+ pxor %xmm10,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm6
+ por %xmm0,%xmm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+ paddd %xmm15,%xmm11
+ pxor %xmm11,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm7
+ por %xmm0,%xmm7
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm3,%xmm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 16)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm3,%xmm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm3,%xmm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm3,%xmm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+ paddd %xmm15,%xmm10
+ pxor %xmm10,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm5
+ por %xmm0,%xmm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+ paddd %xmm12,%xmm11
+ pxor %xmm11,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm6
+ por %xmm0,%xmm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+ paddd %xmm13,%xmm8
+ pxor %xmm8,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm7
+ por %xmm0,%xmm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+ paddd %xmm14,%xmm9
+ pxor %xmm9,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $12,%xmm0
+ psrld $20,%xmm4
+ por %xmm0,%xmm4
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+ movdqa 0x00(%rsp),%xmm0
+ paddd %xmm5,%xmm0
+ movdqa %xmm0,0x00(%rsp)
+ pxor %xmm0,%xmm15
+ pshufb %xmm2,%xmm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+ movdqa 0x10(%rsp),%xmm0
+ paddd %xmm6,%xmm0
+ movdqa %xmm0,0x10(%rsp)
+ pxor %xmm0,%xmm12
+ pshufb %xmm2,%xmm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+ movdqa 0x20(%rsp),%xmm0
+ paddd %xmm7,%xmm0
+ movdqa %xmm0,0x20(%rsp)
+ pxor %xmm0,%xmm13
+ pshufb %xmm2,%xmm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+ movdqa 0x30(%rsp),%xmm0
+ paddd %xmm4,%xmm0
+ movdqa %xmm0,0x30(%rsp)
+ pxor %xmm0,%xmm14
+ pshufb %xmm2,%xmm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+ paddd %xmm15,%xmm10
+ pxor %xmm10,%xmm5
+ movdqa %xmm5,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm5
+ por %xmm0,%xmm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+ paddd %xmm12,%xmm11
+ pxor %xmm11,%xmm6
+ movdqa %xmm6,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm6
+ por %xmm0,%xmm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+ paddd %xmm13,%xmm8
+ pxor %xmm8,%xmm7
+ movdqa %xmm7,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm7
+ por %xmm0,%xmm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+ paddd %xmm14,%xmm9
+ pxor %xmm9,%xmm4
+ movdqa %xmm4,%xmm0
+ pslld $7,%xmm0
+ psrld $25,%xmm4
+ por %xmm0,%xmm4
+
+ dec %ecx
+ jnz .Ldoubleround4
+
+ # x0[0-3] += s0[0]
+ # x1[0-3] += s0[1]
+ movq 0x00(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd 0x00(%rsp),%xmm2
+ movdqa %xmm2,0x00(%rsp)
+ paddd 0x10(%rsp),%xmm3
+ movdqa %xmm3,0x10(%rsp)
+ # x2[0-3] += s0[2]
+ # x3[0-3] += s0[3]
+ movq 0x08(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd 0x20(%rsp),%xmm2
+ movdqa %xmm2,0x20(%rsp)
+ paddd 0x30(%rsp),%xmm3
+ movdqa %xmm3,0x30(%rsp)
+
+ # x4[0-3] += s1[0]
+ # x5[0-3] += s1[1]
+ movq 0x10(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm4
+ paddd %xmm3,%xmm5
+ # x6[0-3] += s1[2]
+ # x7[0-3] += s1[3]
+ movq 0x18(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm6
+ paddd %xmm3,%xmm7
+
+ # x8[0-3] += s2[0]
+ # x9[0-3] += s2[1]
+ movq 0x20(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm8
+ paddd %xmm3,%xmm9
+ # x10[0-3] += s2[2]
+ # x11[0-3] += s2[3]
+ movq 0x28(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm10
+ paddd %xmm3,%xmm11
+
+ # x12[0-3] += s3[0]
+ # x13[0-3] += s3[1]
+ movq 0x30(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm12
+ paddd %xmm3,%xmm13
+ # x14[0-3] += s3[2]
+ # x15[0-3] += s3[3]
+ movq 0x38(%rdi),%xmm3
+ pshufd $0x00,%xmm3,%xmm2
+ pshufd $0x55,%xmm3,%xmm3
+ paddd %xmm2,%xmm14
+ paddd %xmm3,%xmm15
+
+ # x12 += counter values 0-3
+ paddd %xmm1,%xmm12
+
+ # interleave 32-bit words in state n, n+1
+ movdqa 0x00(%rsp),%xmm0
+ movdqa 0x10(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpckldq %xmm1,%xmm2
+ punpckhdq %xmm1,%xmm0
+ movdqa %xmm2,0x00(%rsp)
+ movdqa %xmm0,0x10(%rsp)
+ movdqa 0x20(%rsp),%xmm0
+ movdqa 0x30(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpckldq %xmm1,%xmm2
+ punpckhdq %xmm1,%xmm0
+ movdqa %xmm2,0x20(%rsp)
+ movdqa %xmm0,0x30(%rsp)
+ movdqa %xmm4,%xmm0
+ punpckldq %xmm5,%xmm4
+ punpckhdq %xmm5,%xmm0
+ movdqa %xmm0,%xmm5
+ movdqa %xmm6,%xmm0
+ punpckldq %xmm7,%xmm6
+ punpckhdq %xmm7,%xmm0
+ movdqa %xmm0,%xmm7
+ movdqa %xmm8,%xmm0
+ punpckldq %xmm9,%xmm8
+ punpckhdq %xmm9,%xmm0
+ movdqa %xmm0,%xmm9
+ movdqa %xmm10,%xmm0
+ punpckldq %xmm11,%xmm10
+ punpckhdq %xmm11,%xmm0
+ movdqa %xmm0,%xmm11
+ movdqa %xmm12,%xmm0
+ punpckldq %xmm13,%xmm12
+ punpckhdq %xmm13,%xmm0
+ movdqa %xmm0,%xmm13
+ movdqa %xmm14,%xmm0
+ punpckldq %xmm15,%xmm14
+ punpckhdq %xmm15,%xmm0
+ movdqa %xmm0,%xmm15
+
+ # interleave 64-bit words in state n, n+2
+ movdqa 0x00(%rsp),%xmm0
+ movdqa 0x20(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpcklqdq %xmm1,%xmm2
+ punpckhqdq %xmm1,%xmm0
+ movdqa %xmm2,0x00(%rsp)
+ movdqa %xmm0,0x20(%rsp)
+ movdqa 0x10(%rsp),%xmm0
+ movdqa 0x30(%rsp),%xmm1
+ movdqa %xmm0,%xmm2
+ punpcklqdq %xmm1,%xmm2
+ punpckhqdq %xmm1,%xmm0
+ movdqa %xmm2,0x10(%rsp)
+ movdqa %xmm0,0x30(%rsp)
+ movdqa %xmm4,%xmm0
+ punpcklqdq %xmm6,%xmm4
+ punpckhqdq %xmm6,%xmm0
+ movdqa %xmm0,%xmm6
+ movdqa %xmm5,%xmm0
+ punpcklqdq %xmm7,%xmm5
+ punpckhqdq %xmm7,%xmm0
+ movdqa %xmm0,%xmm7
+ movdqa %xmm8,%xmm0
+ punpcklqdq %xmm10,%xmm8
+ punpckhqdq %xmm10,%xmm0
+ movdqa %xmm0,%xmm10
+ movdqa %xmm9,%xmm0
+ punpcklqdq %xmm11,%xmm9
+ punpckhqdq %xmm11,%xmm0
+ movdqa %xmm0,%xmm11
+ movdqa %xmm12,%xmm0
+ punpcklqdq %xmm14,%xmm12
+ punpckhqdq %xmm14,%xmm0
+ movdqa %xmm0,%xmm14
+ movdqa %xmm13,%xmm0
+ punpcklqdq %xmm15,%xmm13
+ punpckhqdq %xmm15,%xmm0
+ movdqa %xmm0,%xmm15
+
+ # xor with corresponding input, write to output
+ movdqa 0x00(%rsp),%xmm0
+ movdqu 0x00(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x00(%rsi)
+ movdqa 0x10(%rsp),%xmm0
+ movdqu 0x80(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x80(%rsi)
+ movdqa 0x20(%rsp),%xmm0
+ movdqu 0x40(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0x40(%rsi)
+ movdqa 0x30(%rsp),%xmm0
+ movdqu 0xc0(%rdx),%xmm1
+ pxor %xmm1,%xmm0
+ movdqu %xmm0,0xc0(%rsi)
+ movdqu 0x10(%rdx),%xmm1
+ pxor %xmm1,%xmm4
+ movdqu %xmm4,0x10(%rsi)
+ movdqu 0x90(%rdx),%xmm1
+ pxor %xmm1,%xmm5
+ movdqu %xmm5,0x90(%rsi)
+ movdqu 0x50(%rdx),%xmm1
+ pxor %xmm1,%xmm6
+ movdqu %xmm6,0x50(%rsi)
+ movdqu 0xd0(%rdx),%xmm1
+ pxor %xmm1,%xmm7
+ movdqu %xmm7,0xd0(%rsi)
+ movdqu 0x20(%rdx),%xmm1
+ pxor %xmm1,%xmm8
+ movdqu %xmm8,0x20(%rsi)
+ movdqu 0xa0(%rdx),%xmm1
+ pxor %xmm1,%xmm9
+ movdqu %xmm9,0xa0(%rsi)
+ movdqu 0x60(%rdx),%xmm1
+ pxor %xmm1,%xmm10
+ movdqu %xmm10,0x60(%rsi)
+ movdqu 0xe0(%rdx),%xmm1
+ pxor %xmm1,%xmm11
+ movdqu %xmm11,0xe0(%rsi)
+ movdqu 0x30(%rdx),%xmm1
+ pxor %xmm1,%xmm12
+ movdqu %xmm12,0x30(%rsi)
+ movdqu 0xb0(%rdx),%xmm1
+ pxor %xmm1,%xmm13
+ movdqu %xmm13,0xb0(%rsi)
+ movdqu 0x70(%rdx),%xmm1
+ pxor %xmm1,%xmm14
+ movdqu %xmm14,0x70(%rsi)
+ movdqu 0xf0(%rdx),%xmm1
+ pxor %xmm1,%xmm15
+ movdqu %xmm15,0xf0(%rsi)
+
+ add $0x40,%rsp
+ ret
+ENDPROC(chacha20_4block_xor_ssse3)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 250de40..4d677c3 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -20,12 +20,20 @@
#define CHACHA20_STATE_ALIGN 16

asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);

static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];

+ while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
+ chacha20_4block_xor_ssse3(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE * 4;
+ src += CHACHA20_BLOCK_SIZE * 4;
+ dst += CHACHA20_BLOCK_SIZE * 4;
+ state[12] += 4;
+ }
while (bytes >= CHACHA20_BLOCK_SIZE) {
chacha20_block_xor_ssse3(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE;
--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 03/10] crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64

Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
single block variant works on a single state matrix using SSE instructions.
It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
operations.

For large messages, throughput increases by ~65% compared to
chacha20-generic:

testing speed of chacha20 (chacha20-generic) encryption
test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 2 +
arch/x86/crypto/chacha20-ssse3-x86_64.S | 142 ++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 123 +++++++++++++++++++++++++++
crypto/Kconfig | 15 ++++
4 files changed, 282 insertions(+)
create mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
create mode 100644 arch/x86/crypto/chacha20_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 5a4a089..b09e9a4 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
obj-$(CONFIG_CRYPTO_SALSA20_X86_64) += salsa20-x86_64.o
+obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha20-x86_64.o
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
@@ -60,6 +61,7 @@ blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o
+chacha20-x86_64-y := chacha20-ssse3-x86_64.o chacha20_glue.o
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o

ifeq ($(avx_supported),yes)
diff --git a/arch/x86/crypto/chacha20-ssse3-x86_64.S b/arch/x86/crypto/chacha20-ssse3-x86_64.S
new file mode 100644
index 0000000..1b97ad0
--- /dev/null
+++ b/arch/x86/crypto/chacha20-ssse3-x86_64.S
@@ -0,0 +1,142 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 SSSE3 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 16
+
+ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
+ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+
+.text
+
+ENTRY(chacha20_block_xor_ssse3)
+ # %rdi: Input state matrix, s
+ # %rsi: 1 data block output, o
+ # %rdx: 1 data block input, i
+
+ # This function encrypts one ChaCha20 block by loading the state matrix
+ # in four SSE registers. It performs matrix operation on four words in
+ # parallel, but requireds shuffling to rearrange the words after each
+ # round. 8/16-bit word rotation is done with the slightly better
+ # performing SSSE3 byte shuffling, 7/12-bit word rotation uses
+ # traditional shift+OR.
+
+ # x0..3 = s0..3
+ movdqa 0x00(%rdi),%xmm0
+ movdqa 0x10(%rdi),%xmm1
+ movdqa 0x20(%rdi),%xmm2
+ movdqa 0x30(%rdi),%xmm3
+ movdqa %xmm0,%xmm8
+ movdqa %xmm1,%xmm9
+ movdqa %xmm2,%xmm10
+ movdqa %xmm3,%xmm11
+
+ movdqa ROT8(%rip),%xmm4
+ movdqa ROT16(%rip),%xmm5
+
+ mov $10,%ecx
+
+.Ldoubleround:
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm5,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm6
+ pslld $12,%xmm6
+ psrld $20,%xmm1
+ por %xmm6,%xmm1
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm4,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm7
+ pslld $7,%xmm7
+ psrld $25,%xmm1
+ por %xmm7,%xmm1
+
+ # x1 = shuffle32(x1, MASK(0, 3, 2, 1))
+ pshufd $0x39,%xmm1,%xmm1
+ # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+ pshufd $0x4e,%xmm2,%xmm2
+ # x3 = shuffle32(x3, MASK(2, 1, 0, 3))
+ pshufd $0x93,%xmm3,%xmm3
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 16)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm5,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 12)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm6
+ pslld $12,%xmm6
+ psrld $20,%xmm1
+ por %xmm6,%xmm1
+
+ # x0 += x1, x3 = rotl32(x3 ^ x0, 8)
+ paddd %xmm1,%xmm0
+ pxor %xmm0,%xmm3
+ pshufb %xmm4,%xmm3
+
+ # x2 += x3, x1 = rotl32(x1 ^ x2, 7)
+ paddd %xmm3,%xmm2
+ pxor %xmm2,%xmm1
+ movdqa %xmm1,%xmm7
+ pslld $7,%xmm7
+ psrld $25,%xmm1
+ por %xmm7,%xmm1
+
+ # x1 = shuffle32(x1, MASK(2, 1, 0, 3))
+ pshufd $0x93,%xmm1,%xmm1
+ # x2 = shuffle32(x2, MASK(1, 0, 3, 2))
+ pshufd $0x4e,%xmm2,%xmm2
+ # x3 = shuffle32(x3, MASK(0, 3, 2, 1))
+ pshufd $0x39,%xmm3,%xmm3
+
+ dec %ecx
+ jnz .Ldoubleround
+
+ # o0 = i0 ^ (x0 + s0)
+ movdqu 0x00(%rdx),%xmm4
+ paddd %xmm8,%xmm0
+ pxor %xmm4,%xmm0
+ movdqu %xmm0,0x00(%rsi)
+ # o1 = i1 ^ (x1 + s1)
+ movdqu 0x10(%rdx),%xmm5
+ paddd %xmm9,%xmm1
+ pxor %xmm5,%xmm1
+ movdqu %xmm1,0x10(%rsi)
+ # o2 = i2 ^ (x2 + s2)
+ movdqu 0x20(%rdx),%xmm6
+ paddd %xmm10,%xmm2
+ pxor %xmm6,%xmm2
+ movdqu %xmm2,0x20(%rsi)
+ # o3 = i3 ^ (x3 + s3)
+ movdqu 0x30(%rdx),%xmm7
+ paddd %xmm11,%xmm3
+ pxor %xmm7,%xmm3
+ movdqu %xmm3,0x30(%rsi)
+
+ ret
+ENDPROC(chacha20_block_xor_ssse3)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
new file mode 100644
index 0000000..250de40
--- /dev/null
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -0,0 +1,123 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/chacha20.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/fpu/api.h>
+#include <asm/simd.h>
+
+#define CHACHA20_STATE_ALIGN 16
+
+asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+
+static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
+ unsigned int bytes)
+{
+ u8 buf[CHACHA20_BLOCK_SIZE];
+
+ while (bytes >= CHACHA20_BLOCK_SIZE) {
+ chacha20_block_xor_ssse3(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE;
+ src += CHACHA20_BLOCK_SIZE;
+ dst += CHACHA20_BLOCK_SIZE;
+ state[12]++;
+ }
+ if (bytes) {
+ memcpy(buf, src, bytes);
+ chacha20_block_xor_ssse3(state, buf, buf);
+ memcpy(dst, buf, bytes);
+ }
+}
+
+static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
+ struct blkcipher_walk walk;
+ int err;
+
+ if (!may_use_simd())
+ return crypto_chacha20_crypt(desc, dst, src, nbytes);
+
+ state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt_block(desc, &walk, CHACHA20_BLOCK_SIZE);
+
+ crypto_chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);
+
+ kernel_fpu_begin();
+
+ while (walk.nbytes >= CHACHA20_BLOCK_SIZE) {
+ chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
+ rounddown(walk.nbytes, CHACHA20_BLOCK_SIZE));
+ err = blkcipher_walk_done(desc, &walk,
+ walk.nbytes % CHACHA20_BLOCK_SIZE);
+ }
+
+ if (walk.nbytes) {
+ chacha20_dosimd(state, walk.dst.virt.addr, walk.src.virt.addr,
+ walk.nbytes);
+ err = blkcipher_walk_done(desc, &walk, 0);
+ }
+
+ kernel_fpu_end();
+
+ return err;
+}
+
+static struct crypto_alg alg = {
+ .cra_name = "chacha20",
+ .cra_driver_name = "chacha20-simd",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = 1,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_ctxsize = sizeof(struct chacha20_ctx),
+ .cra_alignmask = sizeof(u32) - 1,
+ .cra_module = THIS_MODULE,
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = CHACHA20_KEY_SIZE,
+ .max_keysize = CHACHA20_KEY_SIZE,
+ .ivsize = CHACHA20_IV_SIZE,
+ .geniv = "seqiv",
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = chacha20_simd,
+ .decrypt = chacha20_simd,
+ },
+ },
+};
+
+static int __init chacha20_simd_mod_init(void)
+{
+ if (!cpu_has_ssse3)
+ return -ENODEV;
+
+ return crypto_register_alg(&alg);
+}
+
+static void __exit chacha20_simd_mod_fini(void)
+{
+ crypto_unregister_alg(&alg);
+}
+
+module_init(chacha20_simd_mod_init);
+module_exit(chacha20_simd_mod_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <[email protected]>");
+MODULE_DESCRIPTION("chacha20 cipher algorithm, SIMD accelerated");
+MODULE_ALIAS_CRYPTO("chacha20");
+MODULE_ALIAS_CRYPTO("chacha20-simd");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index b4cfc57..8f24185 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1213,6 +1213,21 @@ config CRYPTO_CHACHA20
See also:
<http://cr.yp.to/chacha/chacha-20080128.pdf>

+config CRYPTO_CHACHA20_X86_64
+ tristate "ChaCha20 cipher algorithm (x86_64/SSSE3)"
+ depends on X86 && 64BIT
+ select CRYPTO_BLKCIPHER
+ select CRYPTO_CHACHA20
+ help
+ ChaCha20 cipher algorithm, RFC7539.
+
+ ChaCha20 is a 256-bit high-speed stream cipher designed by Daniel J.
+ Bernstein and further specified in RFC7539 for use in IETF protocols.
+ This is the x86_64 assembler implementation using SIMD instructions.
+
+ See also:
+ <http://cr.yp.to/chacha/chacha-20080128.pdf>
+
config CRYPTO_SEED
tristate "SEED cipher algorithm"
select CRYPTO_ALGAPI
--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 08/10] crypto: poly1305 - Add a SSE2 SIMD variant for x86_64

Implements an x86_64 assembler driver for the Poly1305 authenticator. This
single block variant holds the 130-bit integer in 5 32-bit words, but uses
SSE to do two multiplications/additions in parallel.

When calling updates with small blocks, the overhead for kernel_fpu_begin/
kernel_fpu_end() negates the perfmance gain. We therefore use the
poly1305-generic fallback for small updates.

For large messages, throughput increases by ~5-10% compared to
poly1305-generic:

testing speed of poly1305 (poly1305-generic)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 4080026 opers/sec, 391682496 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 6221094 opers/sec, 597225024 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9609750 opers/sec, 922536057 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1459379 opers/sec, 420301267 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2115179 opers/sec, 609171609 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3729874 opers/sec, 1074203856 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 593000 opers/sec, 626208000 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1081536 opers/sec, 1142102332 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 302077 opers/sec, 628320576 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 554384 opers/sec, 1153120176 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 278715 opers/sec, 1150536345 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 140202 opers/sec, 1153022070 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 2 +
arch/x86/crypto/poly1305-sse2-x86_64.S | 276 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 123 +++++++++++++++
crypto/Kconfig | 12 ++
4 files changed, 413 insertions(+)
create mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S
create mode 100644 arch/x86/crypto/poly1305_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index ce39b3c..5cf405c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -31,6 +31,7 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
+obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o

# These modules require assembler to support AVX.
ifeq ($(avx_supported),yes)
@@ -85,6 +86,7 @@ aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
+poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o
ifeq ($(avx2_supported),yes)
sha1-ssse3-y += sha1_avx2_x86_64_asm.o
endif
diff --git a/arch/x86/crypto/poly1305-sse2-x86_64.S b/arch/x86/crypto/poly1305-sse2-x86_64.S
new file mode 100644
index 0000000..a3d2b5e
--- /dev/null
+++ b/arch/x86/crypto/poly1305-sse2-x86_64.S
@@ -0,0 +1,276 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, x64 SSE2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 16
+
+ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+
+.text
+
+#define h0 0x00(%rdi)
+#define h1 0x04(%rdi)
+#define h2 0x08(%rdi)
+#define h3 0x0c(%rdi)
+#define h4 0x10(%rdi)
+#define r0 0x00(%rdx)
+#define r1 0x04(%rdx)
+#define r2 0x08(%rdx)
+#define r3 0x0c(%rdx)
+#define r4 0x10(%rdx)
+#define s1 0x00(%rsp)
+#define s2 0x04(%rsp)
+#define s3 0x08(%rsp)
+#define s4 0x0c(%rsp)
+#define m %rsi
+#define h01 %xmm0
+#define h23 %xmm1
+#define h44 %xmm2
+#define t1 %xmm3
+#define t2 %xmm4
+#define t3 %xmm5
+#define t4 %xmm6
+#define mask %xmm7
+#define d0 %r8
+#define d1 %r9
+#define d2 %r10
+#define d3 %r11
+#define d4 %r12
+
+ENTRY(poly1305_block_sse2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 16 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Block count
+
+ # This single block variant tries to improve performance by doing two
+ # multiplications in parallel using SSE instructions. There is quite
+ # some quardword packing involved, hence the speedup is marginal.
+
+ push %rbx
+ push %r12
+ sub $0x10,%rsp
+
+ # s1..s4 = r1..r4 * 5
+ mov r1,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s1
+ mov r2,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s2
+ mov r3,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s3
+ mov r4,%eax
+ lea (%eax,%eax,4),%eax
+ mov %eax,s4
+
+ movdqa ANMASK(%rip),mask
+
+.Ldoblock:
+ # h01 = [0, h1, 0, h0]
+ # h23 = [0, h3, 0, h2]
+ # h44 = [0, h4, 0, h4]
+ movd h0,h01
+ movd h1,t1
+ movd h2,h23
+ movd h3,t2
+ movd h4,h44
+ punpcklqdq t1,h01
+ punpcklqdq t2,h23
+ punpcklqdq h44,h44
+
+ # h01 += [ (m[3-6] >> 2) & 0x3ffffff, m[0-3] & 0x3ffffff ]
+ movd 0x00(m),t1
+ movd 0x03(m),t2
+ psrld $2,t2
+ punpcklqdq t2,t1
+ pand mask,t1
+ paddd t1,h01
+ # h23 += [ (m[9-12] >> 6) & 0x3ffffff, (m[6-9] >> 4) & 0x3ffffff ]
+ movd 0x06(m),t1
+ movd 0x09(m),t2
+ psrld $4,t1
+ psrld $6,t2
+ punpcklqdq t2,t1
+ pand mask,t1
+ paddd t1,h23
+ # h44 += [ (m[12-15] >> 8) | (1 << 24), (m[12-15] >> 8) | (1 << 24) ]
+ mov 0x0c(m),%eax
+ shr $8,%eax
+ or $0x01000000,%eax
+ movd %eax,t1
+ pshufd $0xc4,t1,t1
+ paddd t1,h44
+
+ # t1[0] = h0 * r0 + h2 * s3
+ # t1[1] = h1 * s4 + h3 * s2
+ movd r0,t1
+ movd s4,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd s3,t2
+ movd s2,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t2[0] = h0 * r1 + h2 * s4
+ # t2[1] = h1 * r0 + h3 * s3
+ movd r1,t2
+ movd r0,t3
+ punpcklqdq t3,t2
+ pmuludq h01,t2
+ movd s4,t3
+ movd s3,t4
+ punpcklqdq t4,t3
+ pmuludq h23,t3
+ paddq t3,t2
+ # t3[0] = h4 * s1
+ # t3[1] = h4 * s2
+ movd s1,t3
+ movd s2,t4
+ punpcklqdq t4,t3
+ pmuludq h44,t3
+ # d0 = t1[0] + t1[1] + t3[0]
+ # d1 = t2[0] + t2[1] + t3[1]
+ movdqa t1,t4
+ punpcklqdq t2,t4
+ punpckhqdq t2,t1
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d0
+ psrldq $8,t1
+ movq t1,d1
+
+ # t1[0] = h0 * r2 + h2 * r0
+ # t1[1] = h1 * r1 + h3 * s4
+ movd r2,t1
+ movd r1,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd r0,t2
+ movd s4,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t2[0] = h0 * r3 + h2 * r1
+ # t2[1] = h1 * r2 + h3 * r0
+ movd r3,t2
+ movd r2,t3
+ punpcklqdq t3,t2
+ pmuludq h01,t2
+ movd r1,t3
+ movd r0,t4
+ punpcklqdq t4,t3
+ pmuludq h23,t3
+ paddq t3,t2
+ # t3[0] = h4 * s3
+ # t3[1] = h4 * s4
+ movd s3,t3
+ movd s4,t4
+ punpcklqdq t4,t3
+ pmuludq h44,t3
+ # d2 = t1[0] + t1[1] + t3[0]
+ # d3 = t2[0] + t2[1] + t3[1]
+ movdqa t1,t4
+ punpcklqdq t2,t4
+ punpckhqdq t2,t1
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d2
+ psrldq $8,t1
+ movq t1,d3
+
+ # t1[0] = h0 * r4 + h2 * r2
+ # t1[1] = h1 * r3 + h3 * r1
+ movd r4,t1
+ movd r3,t2
+ punpcklqdq t2,t1
+ pmuludq h01,t1
+ movd r2,t2
+ movd r1,t3
+ punpcklqdq t3,t2
+ pmuludq h23,t2
+ paddq t2,t1
+ # t3[0] = h4 * r0
+ movd r0,t3
+ pmuludq h44,t3
+ # d4 = t1[0] + t1[1] + t3[0]
+ movdqa t1,t4
+ psrldq $8,t4
+ paddq t4,t1
+ paddq t3,t1
+ movq t1,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x10,m
+ dec %rcx
+ jnz .Ldoblock
+
+ add $0x10,%rsp
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_block_sse2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
new file mode 100644
index 0000000..1e59274
--- /dev/null
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -0,0 +1,123 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, SIMD glue code
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <crypto/algapi.h>
+#include <crypto/internal/hash.h>
+#include <crypto/poly1305.h>
+#include <linux/crypto.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <asm/fpu/api.h>
+#include <asm/simd.h>
+
+asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
+ const u32 *r, unsigned int blocks);
+
+static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
+ const u8 *src, unsigned int srclen)
+{
+ unsigned int blocks, datalen;
+
+ if (unlikely(!dctx->sset)) {
+ datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
+ src += srclen - datalen;
+ srclen = datalen;
+ }
+
+ if (srclen >= POLY1305_BLOCK_SIZE) {
+ blocks = srclen / POLY1305_BLOCK_SIZE;
+ poly1305_block_sse2(dctx->h, src, dctx->r, blocks);
+ srclen -= POLY1305_BLOCK_SIZE * blocks;
+ }
+ return srclen;
+}
+
+static int poly1305_simd_update(struct shash_desc *desc,
+ const u8 *src, unsigned int srclen)
+{
+ struct poly1305_desc_ctx *dctx = shash_desc_ctx(desc);
+ unsigned int bytes;
+
+ /* kernel_fpu_begin/end is costly, use fallback for small updates */
+ if (srclen <= 288 || !may_use_simd())
+ return crypto_poly1305_update(desc, src, srclen);
+
+ kernel_fpu_begin();
+
+ if (unlikely(dctx->buflen)) {
+ bytes = min(srclen, POLY1305_BLOCK_SIZE - dctx->buflen);
+ memcpy(dctx->buf + dctx->buflen, src, bytes);
+ src += bytes;
+ srclen -= bytes;
+ dctx->buflen += bytes;
+
+ if (dctx->buflen == POLY1305_BLOCK_SIZE) {
+ poly1305_simd_blocks(dctx, dctx->buf,
+ POLY1305_BLOCK_SIZE);
+ dctx->buflen = 0;
+ }
+ }
+
+ if (likely(srclen >= POLY1305_BLOCK_SIZE)) {
+ bytes = poly1305_simd_blocks(dctx, src, srclen);
+ src += srclen - bytes;
+ srclen = bytes;
+ }
+
+ kernel_fpu_end();
+
+ if (unlikely(srclen)) {
+ dctx->buflen = srclen;
+ memcpy(dctx->buf, src, srclen);
+ }
+
+ return 0;
+}
+
+static struct shash_alg alg = {
+ .digestsize = POLY1305_DIGEST_SIZE,
+ .init = crypto_poly1305_init,
+ .update = poly1305_simd_update,
+ .final = crypto_poly1305_final,
+ .setkey = crypto_poly1305_setkey,
+ .descsize = sizeof(struct poly1305_desc_ctx),
+ .base = {
+ .cra_name = "poly1305",
+ .cra_driver_name = "poly1305-simd",
+ .cra_priority = 300,
+ .cra_flags = CRYPTO_ALG_TYPE_SHASH,
+ .cra_alignmask = sizeof(u32) - 1,
+ .cra_blocksize = POLY1305_BLOCK_SIZE,
+ .cra_module = THIS_MODULE,
+ },
+};
+
+static int __init poly1305_simd_mod_init(void)
+{
+ if (!cpu_has_xmm2)
+ return -ENODEV;
+
+ return crypto_register_shash(&alg);
+}
+
+static void __exit poly1305_simd_mod_exit(void)
+{
+ crypto_unregister_shash(&alg);
+}
+
+module_init(poly1305_simd_mod_init);
+module_exit(poly1305_simd_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Martin Willi <[email protected]>");
+MODULE_DESCRIPTION("Poly1305 authenticator");
+MODULE_ALIAS_CRYPTO("poly1305");
+MODULE_ALIAS_CRYPTO("poly1305-simd");
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 82caab0..c57478c 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -470,6 +470,18 @@ config CRYPTO_POLY1305
It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use
in IETF protocols. This is the portable C implementation of Poly1305.

+config CRYPTO_POLY1305_X86_64
+ tristate "Poly1305 authenticator algorithm (x86_64/SSE2)"
+ depends on X86 && 64BIT
+ select CRYPTO_POLY1305
+ help
+ Poly1305 authenticator algorithm, RFC7539.
+
+ Poly1305 is an authenticator algorithm designed by Daniel J. Bernstein.
+ It is used for the ChaCha20-Poly1305 AEAD, specified in RFC7539 for use
+ in IETF protocols. This is the x86_64 assembler implementation using SIMD
+ instructions.
+
config CRYPTO_MD4
tristate "MD4 digest algorithm"
select CRYPTO_HASH
--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 02/10] crypto: chacha20 - Export common ChaCha20 helpers

As architecture specific drivers need a software fallback, export a
ChaCha20 en-/decryption function together with some helpers in a header
file.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/chacha20_generic.c | 28 ++++++++++++----------------
crypto/chacha20poly1305.c | 3 +--
include/crypto/chacha20.h | 25 +++++++++++++++++++++++++
3 files changed, 38 insertions(+), 18 deletions(-)
create mode 100644 include/crypto/chacha20.h

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index fa42e70..da9c899 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -13,14 +13,7 @@
#include <linux/crypto.h>
#include <linux/kernel.h>
#include <linux/module.h>
-
-#define CHACHA20_NONCE_SIZE 16
-#define CHACHA20_KEY_SIZE 32
-#define CHACHA20_BLOCK_SIZE 64
-
-struct chacha20_ctx {
- u32 key[8];
-};
+#include <crypto/chacha20.h>

static inline u32 rotl32(u32 v, u8 n)
{
@@ -108,7 +101,7 @@ static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
}
}

-static void chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
+void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
{
static const char constant[16] = "expand 32-byte k";

@@ -129,8 +122,9 @@ static void chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv)
state[14] = le32_to_cpuvp(iv + 8);
state[15] = le32_to_cpuvp(iv + 12);
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_init);

-static int chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
+int crypto_chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
unsigned int keysize)
{
struct chacha20_ctx *ctx = crypto_tfm_ctx(tfm);
@@ -144,8 +138,9 @@ static int chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,

return 0;
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_setkey);

-static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+int crypto_chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
struct scatterlist *src, unsigned int nbytes)
{
struct blkcipher_walk walk;
@@ -155,7 +150,7 @@ static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
blkcipher_walk_init(&walk, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, &walk, CHACHA20_BLOCK_SIZE);

- chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);
+ crypto_chacha20_init(state, crypto_blkcipher_ctx(desc->tfm), walk.iv);

while (walk.nbytes >= CHACHA20_BLOCK_SIZE) {
chacha20_docrypt(state, walk.dst.virt.addr, walk.src.virt.addr,
@@ -172,6 +167,7 @@ static int chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,

return err;
}
+EXPORT_SYMBOL_GPL(crypto_chacha20_crypt);

static struct crypto_alg alg = {
.cra_name = "chacha20",
@@ -187,11 +183,11 @@ static struct crypto_alg alg = {
.blkcipher = {
.min_keysize = CHACHA20_KEY_SIZE,
.max_keysize = CHACHA20_KEY_SIZE,
- .ivsize = CHACHA20_NONCE_SIZE,
+ .ivsize = CHACHA20_IV_SIZE,
.geniv = "seqiv",
- .setkey = chacha20_setkey,
- .encrypt = chacha20_crypt,
- .decrypt = chacha20_crypt,
+ .setkey = crypto_chacha20_setkey,
+ .encrypt = crypto_chacha20_crypt,
+ .decrypt = crypto_chacha20_crypt,
},
},
};
diff --git a/crypto/chacha20poly1305.c b/crypto/chacha20poly1305.c
index 8626093..410554d 100644
--- a/crypto/chacha20poly1305.c
+++ b/crypto/chacha20poly1305.c
@@ -13,6 +13,7 @@
#include <crypto/internal/hash.h>
#include <crypto/internal/skcipher.h>
#include <crypto/scatterwalk.h>
+#include <crypto/chacha20.h>
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
@@ -23,8 +24,6 @@
#define POLY1305_BLOCK_SIZE 16
#define POLY1305_DIGEST_SIZE 16
#define POLY1305_KEY_SIZE 32
-#define CHACHA20_KEY_SIZE 32
-#define CHACHA20_IV_SIZE 16
#define CHACHAPOLY_IV_SIZE 12

struct chachapoly_instance_ctx {
diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h
new file mode 100644
index 0000000..274bbae
--- /dev/null
+++ b/include/crypto/chacha20.h
@@ -0,0 +1,25 @@
+/*
+ * Common values for the ChaCha20 algorithm
+ */
+
+#ifndef _CRYPTO_CHACHA20_H
+#define _CRYPTO_CHACHA20_H
+
+#include <linux/types.h>
+#include <linux/crypto.h>
+
+#define CHACHA20_IV_SIZE 16
+#define CHACHA20_KEY_SIZE 32
+#define CHACHA20_BLOCK_SIZE 64
+
+struct chacha20_ctx {
+ u32 key[8];
+};
+
+void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv);
+int crypto_chacha20_setkey(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keysize);
+int crypto_chacha20_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes);
+
+#endif
--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 09/10] crypto: poly1305 - Add a two block SSE2 variant for x86_64

Extends the x86_64 SSE2 Poly1305 authenticator by a function processing two
consecutive Poly1305 blocks in parallel using a derived key r^2. Loop
unrolling can be more effectively mapped to SSE instructions, further
increasing throughput.

For large messages, throughput increases by ~45-65% compared to single
block SSE2:

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3790063 opers/sec, 363846076 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5913378 opers/sec, 567684355 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9352574 opers/sec, 897847104 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1362145 opers/sec, 392297990 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2007075 opers/sec, 578037628 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3709811 opers/sec, 1068425798 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 566272 opers/sec, 597984182 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1111657 opers/sec, 1173910108 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 288857 opers/sec, 600823808 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 590746 opers/sec, 1228751888 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 301825 opers/sec, 1245936902 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 153075 opers/sec, 1258896201 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3809514 opers/sec, 365713411 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5973423 opers/sec, 573448627 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9446779 opers/sec, 906890803 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1364814 opers/sec, 393066691 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2045780 opers/sec, 589184697 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3711946 opers/sec, 1069040592 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 573686 opers/sec, 605812732 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1647802 opers/sec, 1740079440 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 292970 opers/sec, 609378224 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 943229 opers/sec, 1961916528 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 494623 opers/sec, 2041804569 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 254045 opers/sec, 2089271014 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/poly1305-sse2-x86_64.S | 306 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 54 +++++-
2 files changed, 355 insertions(+), 5 deletions(-)

diff --git a/arch/x86/crypto/poly1305-sse2-x86_64.S b/arch/x86/crypto/poly1305-sse2-x86_64.S
index a3d2b5e..338c748 100644
--- a/arch/x86/crypto/poly1305-sse2-x86_64.S
+++ b/arch/x86/crypto/poly1305-sse2-x86_64.S
@@ -15,6 +15,7 @@
.align 16

ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+ORMASK: .octa 0x00000000010000000000000001000000

.text

@@ -274,3 +275,308 @@ ENTRY(poly1305_block_sse2)
pop %rbx
ret
ENDPROC(poly1305_block_sse2)
+
+
+#define u0 0x00(%r8)
+#define u1 0x04(%r8)
+#define u2 0x08(%r8)
+#define u3 0x0c(%r8)
+#define u4 0x10(%r8)
+#define hc0 %xmm0
+#define hc1 %xmm1
+#define hc2 %xmm2
+#define hc3 %xmm5
+#define hc4 %xmm6
+#define ru0 %xmm7
+#define ru1 %xmm8
+#define ru2 %xmm9
+#define ru3 %xmm10
+#define ru4 %xmm11
+#define sv1 %xmm12
+#define sv2 %xmm13
+#define sv3 %xmm14
+#define sv4 %xmm15
+#undef d0
+#define d0 %r13
+
+ENTRY(poly1305_2block_sse2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 16 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Doubleblock count
+ # %r8: Poly1305 derived key r^2 u[5]
+
+ # This two-block variant further improves performance by using loop
+ # unrolled block processing. This is more straight forward and does
+ # less byte shuffling, but requires a second Poly1305 key r^2:
+ # h = (h + m) * r => h = (h + m1) * r^2 + m2 * r
+
+ push %rbx
+ push %r12
+ push %r13
+
+ # combine r0,u0
+ movd u0,ru0
+ movd r0,t1
+ punpcklqdq t1,ru0
+
+ # combine r1,u1 and s1=r1*5,v1=u1*5
+ movd u1,ru1
+ movd r1,t1
+ punpcklqdq t1,ru1
+ movdqa ru1,sv1
+ pslld $2,sv1
+ paddd ru1,sv1
+
+ # combine r2,u2 and s2=r2*5,v2=u2*5
+ movd u2,ru2
+ movd r2,t1
+ punpcklqdq t1,ru2
+ movdqa ru2,sv2
+ pslld $2,sv2
+ paddd ru2,sv2
+
+ # combine r3,u3 and s3=r3*5,v3=u3*5
+ movd u3,ru3
+ movd r3,t1
+ punpcklqdq t1,ru3
+ movdqa ru3,sv3
+ pslld $2,sv3
+ paddd ru3,sv3
+
+ # combine r4,u4 and s4=r4*5,v4=u4*5
+ movd u4,ru4
+ movd r4,t1
+ punpcklqdq t1,ru4
+ movdqa ru4,sv4
+ pslld $2,sv4
+ paddd ru4,sv4
+
+.Ldoblock2:
+ # hc0 = [ m[16-19] & 0x3ffffff, h0 + m[0-3] & 0x3ffffff ]
+ movd 0x00(m),hc0
+ movd 0x10(m),t1
+ punpcklqdq t1,hc0
+ pand ANMASK(%rip),hc0
+ movd h0,t1
+ paddd t1,hc0
+ # hc1 = [ (m[19-22] >> 2) & 0x3ffffff, h1 + (m[3-6] >> 2) & 0x3ffffff ]
+ movd 0x03(m),hc1
+ movd 0x13(m),t1
+ punpcklqdq t1,hc1
+ psrld $2,hc1
+ pand ANMASK(%rip),hc1
+ movd h1,t1
+ paddd t1,hc1
+ # hc2 = [ (m[22-25] >> 4) & 0x3ffffff, h2 + (m[6-9] >> 4) & 0x3ffffff ]
+ movd 0x06(m),hc2
+ movd 0x16(m),t1
+ punpcklqdq t1,hc2
+ psrld $4,hc2
+ pand ANMASK(%rip),hc2
+ movd h2,t1
+ paddd t1,hc2
+ # hc3 = [ (m[25-28] >> 6) & 0x3ffffff, h3 + (m[9-12] >> 6) & 0x3ffffff ]
+ movd 0x09(m),hc3
+ movd 0x19(m),t1
+ punpcklqdq t1,hc3
+ psrld $6,hc3
+ pand ANMASK(%rip),hc3
+ movd h3,t1
+ paddd t1,hc3
+ # hc4 = [ (m[28-31] >> 8) | (1<<24), h4 + (m[12-15] >> 8) | (1<<24) ]
+ movd 0x0c(m),hc4
+ movd 0x1c(m),t1
+ punpcklqdq t1,hc4
+ psrld $8,hc4
+ por ORMASK(%rip),hc4
+ movd h4,t1
+ paddd t1,hc4
+
+ # t1 = [ hc0[1] * r0, hc0[0] * u0 ]
+ movdqa ru0,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * s4, hc1[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * s3, hc2[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s2, hc3[0] * v2 ]
+ movdqa sv2,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s1, hc4[0] * v1 ]
+ movdqa sv1,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d0 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d0
+
+ # t1 = [ hc0[1] * r1, hc0[0] * u1 ]
+ movdqa ru1,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r0, hc1[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * s4, hc2[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s3, hc3[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s2, hc4[0] * v2 ]
+ movdqa sv2,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d1 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d1
+
+ # t1 = [ hc0[1] * r2, hc0[0] * u2 ]
+ movdqa ru2,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r1, hc1[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r0, hc2[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * s4, hc3[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s3, hc4[0] * v3 ]
+ movdqa sv3,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d2 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d2
+
+ # t1 = [ hc0[1] * r3, hc0[0] * u3 ]
+ movdqa ru3,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r2, hc1[0] * u2 ]
+ movdqa ru2,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r1, hc2[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * r0, hc3[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * s4, hc4[0] * v4 ]
+ movdqa sv4,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d3 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d3
+
+ # t1 = [ hc0[1] * r4, hc0[0] * u4 ]
+ movdqa ru4,t1
+ pmuludq hc0,t1
+ # t1 += [ hc1[1] * r3, hc1[0] * u3 ]
+ movdqa ru3,t2
+ pmuludq hc1,t2
+ paddq t2,t1
+ # t1 += [ hc2[1] * r2, hc2[0] * u2 ]
+ movdqa ru2,t2
+ pmuludq hc2,t2
+ paddq t2,t1
+ # t1 += [ hc3[1] * r1, hc3[0] * u1 ]
+ movdqa ru1,t2
+ pmuludq hc3,t2
+ paddq t2,t1
+ # t1 += [ hc4[1] * r0, hc4[0] * u0 ]
+ movdqa ru0,t2
+ pmuludq hc4,t2
+ paddq t2,t1
+ # d4 = t1[0] + t1[1]
+ movdqa t1,t2
+ psrldq $8,t2
+ paddq t2,t1
+ movq t1,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x20,m
+ dec %rcx
+ jnz .Ldoblock2
+
+ pop %r13
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_2block_sse2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index 1e59274..b7c33d0 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -18,24 +18,68 @@
#include <asm/fpu/api.h>
#include <asm/simd.h>

+struct poly1305_simd_desc_ctx {
+ struct poly1305_desc_ctx base;
+ /* derived key u set? */
+ bool uset;
+ /* derived Poly1305 key r^2 */
+ u32 u[5];
+};
+
asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
const u32 *r, unsigned int blocks);
+asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
+ unsigned int blocks, const u32 *u);
+
+static int poly1305_simd_init(struct shash_desc *desc)
+{
+ struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc);
+
+ sctx->uset = false;
+
+ return crypto_poly1305_init(desc);
+}
+
+static void poly1305_simd_mult(u32 *a, const u32 *b)
+{
+ u8 m[POLY1305_BLOCK_SIZE];
+
+ memset(m, 0, sizeof(m));
+ /* The poly1305 block function adds a hi-bit to the accumulator which
+ * we don't need for key multiplication; compensate for it. */
+ a[4] -= 1 << 24;
+ poly1305_block_sse2(a, m, b, 1);
+}

static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
const u8 *src, unsigned int srclen)
{
+ struct poly1305_simd_desc_ctx *sctx;
unsigned int blocks, datalen;

+ BUILD_BUG_ON(offsetof(struct poly1305_simd_desc_ctx, base));
+ sctx = container_of(dctx, struct poly1305_simd_desc_ctx, base);
+
if (unlikely(!dctx->sset)) {
datalen = crypto_poly1305_setdesckey(dctx, src, srclen);
src += srclen - datalen;
srclen = datalen;
}

+ if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) {
+ if (unlikely(!sctx->uset)) {
+ memcpy(sctx->u, dctx->r, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u, dctx->r);
+ sctx->uset = true;
+ }
+ blocks = srclen / (POLY1305_BLOCK_SIZE * 2);
+ poly1305_2block_sse2(dctx->h, src, dctx->r, blocks, sctx->u);
+ src += POLY1305_BLOCK_SIZE * 2 * blocks;
+ srclen -= POLY1305_BLOCK_SIZE * 2 * blocks;
+ }
if (srclen >= POLY1305_BLOCK_SIZE) {
- blocks = srclen / POLY1305_BLOCK_SIZE;
- poly1305_block_sse2(dctx->h, src, dctx->r, blocks);
- srclen -= POLY1305_BLOCK_SIZE * blocks;
+ poly1305_block_sse2(dctx->h, src, dctx->r, 1);
+ srclen -= POLY1305_BLOCK_SIZE;
}
return srclen;
}
@@ -84,11 +128,11 @@ static int poly1305_simd_update(struct shash_desc *desc,

static struct shash_alg alg = {
.digestsize = POLY1305_DIGEST_SIZE,
- .init = crypto_poly1305_init,
+ .init = poly1305_simd_init,
.update = poly1305_simd_update,
.final = crypto_poly1305_final,
.setkey = crypto_poly1305_setkey,
- .descsize = sizeof(struct poly1305_desc_ctx),
+ .descsize = sizeof(struct poly1305_simd_desc_ctx),
.base = {
.cra_name = "poly1305",
.cra_driver_name = "poly1305-simd",
--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 01/10] crypto: tcrypt - Add ChaCha20/Poly1305 speed tests

Adds individual ChaCha20 and Poly1305 and a combined rfc7539esp AEAD speed
test using mode numbers 214, 321 and 213. For Poly1305 we add a specific
speed template, as it expects the key prepended to the input data.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/tcrypt.c | 15 +++++++++++++++
crypto/tcrypt.h | 20 ++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 73ed4f2..e9a05ba 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1793,6 +1793,17 @@ static int do_test(const char *alg, u32 type, u32 mask, int m)
NULL, 0, 16, 16, aead_speed_template_19);
break;

+ case 213:
+ test_aead_speed("rfc7539esp(chacha20,poly1305)", ENCRYPT, sec,
+ NULL, 0, 16, 8, aead_speed_template_36);
+ break;
+
+ case 214:
+ test_cipher_speed("chacha20", ENCRYPT, sec, NULL, 0,
+ speed_template_32);
+ break;
+
+
case 300:
if (alg) {
test_hash_speed(alg, sec, generic_hash_speed_template);
@@ -1881,6 +1892,10 @@ static int do_test(const char *alg, u32 type, u32 mask, int m)
test_hash_speed("crct10dif", sec, generic_hash_speed_template);
if (mode > 300 && mode < 400) break;

+ case 321:
+ test_hash_speed("poly1305", sec, poly1305_speed_template);
+ if (mode > 300 && mode < 400) break;
+
case 399:
break;

diff --git a/crypto/tcrypt.h b/crypto/tcrypt.h
index 6cc1b85..f0bfee1 100644
--- a/crypto/tcrypt.h
+++ b/crypto/tcrypt.h
@@ -61,12 +61,14 @@ static u8 speed_template_32_40_48[] = {32, 40, 48, 0};
static u8 speed_template_32_48[] = {32, 48, 0};
static u8 speed_template_32_48_64[] = {32, 48, 64, 0};
static u8 speed_template_32_64[] = {32, 64, 0};
+static u8 speed_template_32[] = {32, 0};

/*
* AEAD speed tests
*/
static u8 aead_speed_template_19[] = {19, 0};
static u8 aead_speed_template_20[] = {20, 0};
+static u8 aead_speed_template_36[] = {36, 0};

/*
* Digest speed tests
@@ -127,4 +129,22 @@ static struct hash_speed hash_speed_template_16[] = {
{ .blen = 0, .plen = 0, .klen = 0, }
};

+static struct hash_speed poly1305_speed_template[] = {
+ { .blen = 96, .plen = 16, },
+ { .blen = 96, .plen = 32, },
+ { .blen = 96, .plen = 96, },
+ { .blen = 288, .plen = 16, },
+ { .blen = 288, .plen = 32, },
+ { .blen = 288, .plen = 288, },
+ { .blen = 1056, .plen = 32, },
+ { .blen = 1056, .plen = 1056, },
+ { .blen = 2080, .plen = 32, },
+ { .blen = 2080, .plen = 2080, },
+ { .blen = 4128, .plen = 4128, },
+ { .blen = 8224, .plen = 8224, },
+
+ /* End marker */
+ { .blen = 0, .plen = 0, }
+};
+
#endif /* _CRYPTO_TCRYPT_H */
--
1.9.1

2015-07-16 17:14:23

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 06/10] crypto: testmgr - Add a longer ChaCha20 test vector

The AVX2 variant of ChaCha20 is used only for messages with >= 512 bytes
length. With the existing test vectors, the implementation could not be
tested. Due that lack of such a long official test vector, this one is
self-generated using chacha20-generic.

Signed-off-by: Martin Willi <[email protected]>
---
crypto/testmgr.h | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 333 insertions(+), 1 deletion(-)

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 754e47f..9ca350c 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -30216,7 +30216,7 @@ static struct cipher_testvec salsa20_stream_enc_tv_template[] = {
},
};

-#define CHACHA20_ENC_TEST_VECTORS 3
+#define CHACHA20_ENC_TEST_VECTORS 4
static struct cipher_testvec chacha20_enc_tv_template[] = {
{ /* RFC7539 A.2. Test Vector #1 */
.key = "\x00\x00\x00\x00\x00\x00\x00\x00"
@@ -30390,6 +30390,338 @@ static struct cipher_testvec chacha20_enc_tv_template[] = {
"\x87\xb5\x8d\xfd\x72\x8a\xfa\x36"
"\x75\x7a\x79\x7a\xc1\x88\xd1",
.rlen = 127,
+ }, { /* Self-made test vector for long data */
+ .key = "\x1c\x92\x40\xa5\xeb\x55\xd3\x8a"
+ "\xf3\x33\x88\x86\x04\xf6\xb5\xf0"
+ "\x47\x39\x17\xc1\x40\x2b\x80\x09"
+ "\x9d\xca\x5c\xbc\x20\x70\x75\xc0",
+ .klen = 32,
+ .iv = "\x1c\x00\x00\x00\x00\x00\x00\x00"
+ "\x00\x00\x00\x00\x00\x00\x00\x01",
+ .input = "\x49\xee\xe0\xdc\x24\x90\x40\xcd"
+ "\xc5\x40\x8f\x47\x05\xbc\xdd\x81"
+ "\x47\xc6\x8d\xe6\xb1\x8f\xd7\xcb"
+ "\x09\x0e\x6e\x22\x48\x1f\xbf\xb8"
+ "\x5c\xf7\x1e\x8a\xc1\x23\xf2\xd4"
+ "\x19\x4b\x01\x0f\x4e\xa4\x43\xce"
+ "\x01\xc6\x67\xda\x03\x91\x18\x90"
+ "\xa5\xa4\x8e\x45\x03\xb3\x2d\xac"
+ "\x74\x92\xd3\x53\x47\xc8\xdd\x25"
+ "\x53\x6c\x02\x03\x87\x0d\x11\x0c"
+ "\x58\xe3\x12\x18\xfd\x2a\x5b\x40"
+ "\x0c\x30\xf0\xb8\x3f\x43\xce\xae"
+ "\x65\x3a\x7d\x7c\xf4\x54\xaa\xcc"
+ "\x33\x97\xc3\x77\xba\xc5\x70\xde"
+ "\xd7\xd5\x13\xa5\x65\xc4\x5f\x0f"
+ "\x46\x1a\x0d\x97\xb5\xf3\xbb\x3c"
+ "\x84\x0f\x2b\xc5\xaa\xea\xf2\x6c"
+ "\xc9\xb5\x0c\xee\x15\xf3\x7d\xbe"
+ "\x9f\x7b\x5a\xa6\xae\x4f\x83\xb6"
+ "\x79\x49\x41\xf4\x58\x18\xcb\x86"
+ "\x7f\x30\x0e\xf8\x7d\x44\x36\xea"
+ "\x75\xeb\x88\x84\x40\x3c\xad\x4f"
+ "\x6f\x31\x6b\xaa\x5d\xe5\xa5\xc5"
+ "\x21\x66\xe9\xa7\xe3\xb2\x15\x88"
+ "\x78\xf6\x79\xa1\x59\x47\x12\x4e"
+ "\x9f\x9f\x64\x1a\xa0\x22\x5b\x08"
+ "\xbe\x7c\x36\xc2\x2b\x66\x33\x1b"
+ "\xdd\x60\x71\xf7\x47\x8c\x61\xc3"
+ "\xda\x8a\x78\x1e\x16\xfa\x1e\x86"
+ "\x81\xa6\x17\x2a\xa7\xb5\xc2\xe7"
+ "\xa4\xc7\x42\xf1\xcf\x6a\xca\xb4"
+ "\x45\xcf\xf3\x93\xf0\xe7\xea\xf6"
+ "\xf4\xe6\x33\x43\x84\x93\xa5\x67"
+ "\x9b\x16\x58\x58\x80\x0f\x2b\x5c"
+ "\x24\x74\x75\x7f\x95\x81\xb7\x30"
+ "\x7a\x33\xa7\xf7\x94\x87\x32\x27"
+ "\x10\x5d\x14\x4c\x43\x29\xdd\x26"
+ "\xbd\x3e\x3c\x0e\xfe\x0e\xa5\x10"
+ "\xea\x6b\x64\xfd\x73\xc6\xed\xec"
+ "\xa8\xc9\xbf\xb3\xba\x0b\x4d\x07"
+ "\x70\xfc\x16\xfd\x79\x1e\xd7\xc5"
+ "\x49\x4e\x1c\x8b\x8d\x79\x1b\xb1"
+ "\xec\xca\x60\x09\x4c\x6a\xd5\x09"
+ "\x49\x46\x00\x88\x22\x8d\xce\xea"
+ "\xb1\x17\x11\xde\x42\xd2\x23\xc1"
+ "\x72\x11\xf5\x50\x73\x04\x40\x47"
+ "\xf9\x5d\xe7\xa7\x26\xb1\x7e\xb0"
+ "\x3f\x58\xc1\x52\xab\x12\x67\x9d"
+ "\x3f\x43\x4b\x68\xd4\x9c\x68\x38"
+ "\x07\x8a\x2d\x3e\xf3\xaf\x6a\x4b"
+ "\xf9\xe5\x31\x69\x22\xf9\xa6\x69"
+ "\xc6\x9c\x96\x9a\x12\x35\x95\x1d"
+ "\x95\xd5\xdd\xbe\xbf\x93\x53\x24"
+ "\xfd\xeb\xc2\x0a\x64\xb0\x77\x00"
+ "\x6f\x88\xc4\x37\x18\x69\x7c\xd7"
+ "\x41\x92\x55\x4c\x03\xa1\x9a\x4b"
+ "\x15\xe5\xdf\x7f\x37\x33\x72\xc1"
+ "\x8b\x10\x67\xa3\x01\x57\x94\x25"
+ "\x7b\x38\x71\x7e\xdd\x1e\xcc\x73"
+ "\x55\xd2\x8e\xeb\x07\xdd\xf1\xda"
+ "\x58\xb1\x47\x90\xfe\x42\x21\x72"
+ "\xa3\x54\x7a\xa0\x40\xec\x9f\xdd"
+ "\xc6\x84\x6e\xca\xae\xe3\x68\xb4"
+ "\x9d\xe4\x78\xff\x57\xf2\xf8\x1b"
+ "\x03\xa1\x31\xd9\xde\x8d\xf5\x22"
+ "\x9c\xdd\x20\xa4\x1e\x27\xb1\x76"
+ "\x4f\x44\x55\xe2\x9b\xa1\x9c\xfe"
+ "\x54\xf7\x27\x1b\xf4\xde\x02\xf5"
+ "\x1b\x55\x48\x5c\xdc\x21\x4b\x9e"
+ "\x4b\x6e\xed\x46\x23\xdc\x65\xb2"
+ "\xcf\x79\x5f\x28\xe0\x9e\x8b\xe7"
+ "\x4c\x9d\x8a\xff\xc1\xa6\x28\xb8"
+ "\x65\x69\x8a\x45\x29\xef\x74\x85"
+ "\xde\x79\xc7\x08\xae\x30\xb0\xf4"
+ "\xa3\x1d\x51\x41\xab\xce\xcb\xf6"
+ "\xb5\xd8\x6d\xe0\x85\xe1\x98\xb3"
+ "\x43\xbb\x86\x83\x0a\xa0\xf5\xb7"
+ "\x04\x0b\xfa\x71\x1f\xb0\xf6\xd9"
+ "\x13\x00\x15\xf0\xc7\xeb\x0d\x5a"
+ "\x9f\xd7\xb9\x6c\x65\x14\x22\x45"
+ "\x6e\x45\x32\x3e\x7e\x60\x1a\x12"
+ "\x97\x82\x14\xfb\xaa\x04\x22\xfa"
+ "\xa0\xe5\x7e\x8c\x78\x02\x48\x5d"
+ "\x78\x33\x5a\x7c\xad\xdb\x29\xce"
+ "\xbb\x8b\x61\xa4\xb7\x42\xe2\xac"
+ "\x8b\x1a\xd9\x2f\x0b\x8b\x62\x21"
+ "\x83\x35\x7e\xad\x73\xc2\xb5\x6c"
+ "\x10\x26\x38\x07\xe5\xc7\x36\x80"
+ "\xe2\x23\x12\x61\xf5\x48\x4b\x2b"
+ "\xc5\xdf\x15\xd9\x87\x01\xaa\xac"
+ "\x1e\x7c\xad\x73\x78\x18\x63\xe0"
+ "\x8b\x9f\x81\xd8\x12\x6a\x28\x10"
+ "\xbe\x04\x68\x8a\x09\x7c\x1b\x1c"
+ "\x83\x66\x80\x47\x80\xe8\xfd\x35"
+ "\x1c\x97\x6f\xae\x49\x10\x66\xcc"
+ "\xc6\xd8\xcc\x3a\x84\x91\x20\x77"
+ "\x72\xe4\x24\xd2\x37\x9f\xc5\xc9"
+ "\x25\x94\x10\x5f\x40\x00\x64\x99"
+ "\xdc\xae\xd7\x21\x09\x78\x50\x15"
+ "\xac\x5f\xc6\x2c\xa2\x0b\xa9\x39"
+ "\x87\x6e\x6d\xab\xde\x08\x51\x16"
+ "\xc7\x13\xe9\xea\xed\x06\x8e\x2c"
+ "\xf8\x37\x8c\xf0\xa6\x96\x8d\x43"
+ "\xb6\x98\x37\xb2\x43\xed\xde\xdf"
+ "\x89\x1a\xe7\xeb\x9d\xa1\x7b\x0b"
+ "\x77\xb0\xe2\x75\xc0\xf1\x98\xd9"
+ "\x80\x55\xc9\x34\x91\xd1\x59\xe8"
+ "\x4b\x0f\xc1\xa9\x4b\x7a\x84\x06"
+ "\x20\xa8\x5d\xfa\xd1\xde\x70\x56"
+ "\x2f\x9e\x91\x9c\x20\xb3\x24\xd8"
+ "\x84\x3d\xe1\x8c\x7e\x62\x52\xe5"
+ "\x44\x4b\x9f\xc2\x93\x03\xea\x2b"
+ "\x59\xc5\xfa\x3f\x91\x2b\xbb\x23"
+ "\xf5\xb2\x7b\xf5\x38\xaf\xb3\xee"
+ "\x63\xdc\x7b\xd1\xff\xaa\x8b\xab"
+ "\x82\x6b\x37\x04\xeb\x74\xbe\x79"
+ "\xb9\x83\x90\xef\x20\x59\x46\xff"
+ "\xe9\x97\x3e\x2f\xee\xb6\x64\x18"
+ "\x38\x4c\x7a\x4a\xf9\x61\xe8\x9a"
+ "\xa1\xb5\x01\xa6\x47\xd3\x11\xd4"
+ "\xce\xd3\x91\x49\x88\xc7\xb8\x4d"
+ "\xb1\xb9\x07\x6d\x16\x72\xae\x46"
+ "\x5e\x03\xa1\x4b\xb6\x02\x30\xa8"
+ "\x3d\xa9\x07\x2a\x7c\x19\xe7\x62"
+ "\x87\xe3\x82\x2f\x6f\xe1\x09\xd9"
+ "\x94\x97\xea\xdd\x58\x9e\xae\x76"
+ "\x7e\x35\xe5\xb4\xda\x7e\xf4\xde"
+ "\xf7\x32\x87\xcd\x93\xbf\x11\x56"
+ "\x11\xbe\x08\x74\xe1\x69\xad\xe2"
+ "\xd7\xf8\x86\x75\x8a\x3c\xa4\xbe"
+ "\x70\xa7\x1b\xfc\x0b\x44\x2a\x76"
+ "\x35\xea\x5d\x85\x81\xaf\x85\xeb"
+ "\xa0\x1c\x61\xc2\xf7\x4f\xa5\xdc"
+ "\x02\x7f\xf6\x95\x40\x6e\x8a\x9a"
+ "\xf3\x5d\x25\x6e\x14\x3a\x22\xc9"
+ "\x37\x1c\xeb\x46\x54\x3f\xa5\x91"
+ "\xc2\xb5\x8c\xfe\x53\x08\x97\x32"
+ "\x1b\xb2\x30\x27\xfe\x25\x5d\xdc"
+ "\x08\x87\xd0\xe5\x94\x1a\xd4\xf1"
+ "\xfe\xd6\xb4\xa3\xe6\x74\x81\x3c"
+ "\x1b\xb7\x31\xa7\x22\xfd\xd4\xdd"
+ "\x20\x4e\x7c\x51\xb0\x60\x73\xb8"
+ "\x9c\xac\x91\x90\x7e\x01\xb0\xe1"
+ "\x8a\x2f\x75\x1c\x53\x2a\x98\x2a"
+ "\x06\x52\x95\x52\xb2\xe9\x25\x2e"
+ "\x4c\xe2\x5a\x00\xb2\x13\x81\x03"
+ "\x77\x66\x0d\xa5\x99\xda\x4e\x8c"
+ "\xac\xf3\x13\x53\x27\x45\xaf\x64"
+ "\x46\xdc\xea\x23\xda\x97\xd1\xab"
+ "\x7d\x6c\x30\x96\x1f\xbc\x06\x34"
+ "\x18\x0b\x5e\x21\x35\x11\x8d\x4c"
+ "\xe0\x2d\xe9\x50\x16\x74\x81\xa8"
+ "\xb4\x34\xb9\x72\x42\xa6\xcc\xbc"
+ "\xca\x34\x83\x27\x10\x5b\x68\x45"
+ "\x8f\x52\x22\x0c\x55\x3d\x29\x7c"
+ "\xe3\xc0\x66\x05\x42\x91\x5f\x58"
+ "\xfe\x4a\x62\xd9\x8c\xa9\x04\x19"
+ "\x04\xa9\x08\x4b\x57\xfc\x67\x53"
+ "\x08\x7c\xbc\x66\x8a\xb0\xb6\x9f"
+ "\x92\xd6\x41\x7c\x5b\x2a\x00\x79"
+ "\x72",
+ .ilen = 1281,
+ .result = "\x45\xe8\xe0\xb6\x9c\xca\xfd\x87"
+ "\xe8\x1d\x37\x96\x8a\xe3\x40\x35"
+ "\xcf\x5e\x3a\x46\x3d\xfb\xd0\x69"
+ "\xde\xaf\x7a\xd5\x0d\xe9\x52\xec"
+ "\xc2\x82\xe5\x3e\x7d\xb2\x4a\xd9"
+ "\xbb\xc3\x9f\xc0\x5d\xac\x93\x8d"
+ "\x0e\x6f\xd3\xd7\xfb\x6a\x0d\xce"
+ "\x92\x2c\xf7\xbb\x93\x57\xcc\xee"
+ "\x42\x72\x6f\xc8\x4b\xd2\x76\xbf"
+ "\xa0\xe3\x7a\x39\xf9\x5c\x8e\xfd"
+ "\xa1\x1d\x41\xe5\x08\xc1\x1c\x11"
+ "\x92\xfd\x39\x5c\x51\xd0\x2f\x66"
+ "\x33\x4a\x71\x15\xfe\xee\x12\x54"
+ "\x8c\x8f\x34\xd8\x50\x3c\x18\xa6"
+ "\xc5\xe1\x46\x8a\xfb\x5f\x7e\x25"
+ "\x9b\xe2\xc3\x66\x41\x2b\xb3\xa5"
+ "\x57\x0e\x94\x17\x26\x39\xbb\x54"
+ "\xae\x2e\x6f\x42\xfb\x4d\x89\x6f"
+ "\x9d\xf1\x16\x2e\xe3\xe7\xfc\xe3"
+ "\xb2\x4b\x2b\xa6\x7c\x04\x69\x3a"
+ "\x70\x5a\xa7\xf1\x31\x64\x19\xca"
+ "\x45\x79\xd8\x58\x23\x61\xaf\xc2"
+ "\x52\x05\xc3\x0b\xc1\x64\x7c\x81"
+ "\xd9\x11\xcf\xff\x02\x3d\x51\x84"
+ "\x01\xac\xc6\x2e\x34\x2b\x09\x3a"
+ "\xa8\x5d\x98\x0e\x89\xd9\xef\x8f"
+ "\xd9\xd7\x7d\xdd\x63\x47\x46\x7d"
+ "\xa1\xda\x0b\x53\x7d\x79\xcd\xc9"
+ "\x86\xdd\x6b\x13\xa1\x9a\x70\xdd"
+ "\x5c\xa1\x69\x3c\xe4\x5d\xe3\x8c"
+ "\xe5\xf4\x87\x9c\x10\xcf\x0f\x0b"
+ "\xc8\x43\xdc\xf8\x1d\x62\x5e\x5b"
+ "\xe2\x03\x06\xc5\x71\xb6\x48\xa5"
+ "\xf0\x0f\x2d\xd5\xa2\x73\x55\x8f"
+ "\x01\xa7\x59\x80\x5f\x11\x6c\x40"
+ "\xff\xb1\xf2\xc6\x7e\x01\xbb\x1c"
+ "\x69\x9c\xc9\x3f\x71\x5f\x07\x7e"
+ "\xdf\x6f\x99\xca\x9c\xfd\xf9\xb9"
+ "\x49\xe7\xcc\x91\xd5\x9b\x8f\x03"
+ "\xae\xe7\x61\x32\xef\x41\x6c\x75"
+ "\x84\x9b\x8c\xce\x1d\x6b\x93\x21"
+ "\x41\xec\xc6\xad\x8e\x0c\x48\xa8"
+ "\xe2\xf5\x57\xde\xf7\x38\xfd\x4a"
+ "\x6f\xa7\x4a\xf9\xac\x7d\xb1\x85"
+ "\x7d\x6c\x95\x0a\x5a\xcf\x68\xd2"
+ "\xe0\x7a\x26\xd9\xc1\x6d\x3e\xc6"
+ "\x37\xbd\xbe\x24\x36\x77\x9f\x1b"
+ "\xc1\x22\xf3\x79\xae\x95\x78\x66"
+ "\x97\x11\xc0\x1a\xf1\xe8\x0d\x38"
+ "\x09\xc2\xee\xb7\xd3\x46\x7b\x59"
+ "\x77\x23\xe8\xb4\x92\x3d\x78\xbe"
+ "\xe2\x25\x63\xa5\x2a\x06\x70\x92"
+ "\x32\x63\xf9\x19\x21\x68\xe1\x0b"
+ "\x9a\xd0\xee\x21\xdb\x1f\xe0\xde"
+ "\x3e\x64\x02\x4d\x0e\xe0\x0a\xa9"
+ "\xed\x19\x8c\xa8\xbf\xe3\x2e\x75"
+ "\x24\x2b\xb0\xe5\x82\x6a\x1e\x6f"
+ "\x71\x2a\x3a\x60\xed\x06\x0d\x17"
+ "\xa2\xdb\x29\x1d\xae\xb2\xc4\xfb"
+ "\x94\x04\xd8\x58\xfc\xc4\x04\x4e"
+ "\xee\xc7\xc1\x0f\xe9\x9b\x63\x2d"
+ "\x02\x3e\x02\x67\xe5\xd8\xbb\x79"
+ "\xdf\xd2\xeb\x50\xe9\x0a\x02\x46"
+ "\xdf\x68\xcf\xe7\x2b\x0a\x56\xd6"
+ "\xf7\xbc\x44\xad\xb8\xb5\x5f\xeb"
+ "\xbc\x74\x6b\xe8\x7e\xb0\x60\xc6"
+ "\x0d\x96\x09\xbb\x19\xba\xe0\x3c"
+ "\xc4\x6c\xbf\x0f\x58\xc0\x55\x62"
+ "\x23\xa0\xff\xb5\x1c\xfd\x18\xe1"
+ "\xcf\x6d\xd3\x52\xb4\xce\xa6\xfa"
+ "\xaa\xfb\x1b\x0b\x42\x6d\x79\x42"
+ "\x48\x70\x5b\x0e\xdd\x3a\xc9\x69"
+ "\x8b\x73\x67\xf6\x95\xdb\x8c\xfb"
+ "\xfd\xb5\x08\x47\x42\x84\x9a\xfa"
+ "\xcc\x67\xb2\x3c\xb6\xfd\xd8\x32"
+ "\xd6\x04\xb6\x4a\xea\x53\x4b\xf5"
+ "\x94\x16\xad\xf0\x10\x2e\x2d\xb4"
+ "\x8b\xab\xe5\x89\xc7\x39\x12\xf3"
+ "\x8d\xb5\x96\x0b\x87\x5d\xa7\x7c"
+ "\xb0\xc2\xf6\x2e\x57\x97\x2c\xdc"
+ "\x54\x1c\x34\x72\xde\x0c\x68\x39"
+ "\x9d\x32\xa5\x75\x92\x13\x32\xea"
+ "\x90\x27\xbd\x5b\x1d\xb9\x21\x02"
+ "\x1c\xcc\xba\x97\x5e\x49\x58\xe8"
+ "\xac\x8b\xf3\xce\x3c\xf0\x00\xe9"
+ "\x6c\xae\xe9\x77\xdf\xf4\x02\xcd"
+ "\x55\x25\x89\x9e\x90\xf3\x6b\x8f"
+ "\xb7\xd6\x47\x98\x26\x2f\x31\x2f"
+ "\x8d\xbf\x54\xcd\x99\xeb\x80\xd7"
+ "\xac\xc3\x08\xc2\xa6\x32\xf1\x24"
+ "\x76\x7c\x4f\x78\x53\x55\xfb\x00"
+ "\x8a\xd6\x52\x53\x25\x45\xfb\x0a"
+ "\x6b\xb9\xbe\x3c\x5e\x11\xcc\x6a"
+ "\xdd\xfc\xa7\xc4\x79\x4d\xbd\xfb"
+ "\xce\x3a\xf1\x7a\xda\xeb\xfe\x64"
+ "\x28\x3d\x0f\xee\x80\xba\x0c\xf8"
+ "\xe9\x5b\x3a\xd4\xae\xc9\xf3\x0e"
+ "\xe8\x5d\xc5\x5c\x0b\x20\x20\xee"
+ "\x40\x0d\xde\x07\xa7\x14\xb4\x90"
+ "\xb6\xbd\x3b\xae\x7d\x2b\xa7\xc7"
+ "\xdc\x0b\x4c\x5d\x65\xb0\xd2\xc5"
+ "\x79\x61\x23\xe0\xa2\x99\x73\x55"
+ "\xad\xc6\xfb\xc7\x54\xb5\x98\x1f"
+ "\x8c\x86\xc2\x3f\xbe\x5e\xea\x64"
+ "\xa3\x60\x18\x9f\x80\xaf\x52\x74"
+ "\x1a\xfe\x22\xc2\x92\x67\x40\x02"
+ "\x08\xee\x67\x5b\x67\xe0\x3d\xde"
+ "\x7a\xaf\x8e\x28\xf3\x5e\x0e\xf4"
+ "\x48\x56\xaa\x85\x22\xd8\x36\xed"
+ "\x3b\x3d\x68\x69\x30\xbc\x71\x23"
+ "\xb1\x6e\x61\x03\x89\x44\x03\xf4"
+ "\x32\xaa\x4c\x40\x9f\x69\xfb\x70"
+ "\x91\xcc\x1f\x11\xbd\x76\x67\xe6"
+ "\x10\x8b\x29\x39\x68\xea\x4e\x6d"
+ "\xae\xfb\x40\xcf\xe2\xd0\x0d\x8d"
+ "\x6f\xed\x9b\x8d\x64\x7a\x94\x8e"
+ "\x32\x38\x78\xeb\x7d\x5f\xf9\x4d"
+ "\x13\xbe\x21\xea\x16\xe7\x5c\xee"
+ "\xcd\xf6\x5f\xc6\x45\xb2\x8f\x2b"
+ "\xb5\x93\x3e\x45\xdb\xfd\xa2\x6a"
+ "\xec\x83\x92\x99\x87\x47\xe0\x7c"
+ "\xa2\x7b\xc4\x2a\xcd\xc0\x81\x03"
+ "\x98\xb0\x87\xb6\x86\x13\x64\x33"
+ "\x4c\xd7\x99\xbf\xdb\x7b\x6e\xaa"
+ "\x76\xcc\xa0\x74\x1b\xa3\x6e\x83"
+ "\xd4\xba\x7a\x84\x9d\x91\x71\xcd"
+ "\x60\x2d\x56\xfd\x26\x35\xcb\xeb"
+ "\xac\xe9\xee\xa4\xfc\x18\x5b\x91"
+ "\xd5\xfe\x84\x45\xe0\xc7\xfd\x11"
+ "\xe9\x00\xb6\x54\xdf\xe1\x94\xde"
+ "\x2b\x70\x9f\x94\x7f\x15\x0e\x83"
+ "\x63\x10\xb3\xf5\xea\xd3\xe8\xd1"
+ "\xa5\xfc\x17\x19\x68\x9a\xbc\x17"
+ "\x30\x43\x0a\x1a\x33\x92\xd4\x2a"
+ "\x2e\x68\x99\xbc\x49\xf0\x68\xe3"
+ "\xf0\x1f\xcb\xcc\xfa\xbb\x05\x56"
+ "\x46\x84\x8b\x69\x83\x64\xc5\xe0"
+ "\xc5\x52\x99\x07\x3c\xa6\x5c\xaf"
+ "\xa3\xde\xd7\xdb\x43\xe6\xb7\x76"
+ "\x4e\x4d\xd6\x71\x60\x63\x4a\x0c"
+ "\x5f\xae\x25\x84\x22\x90\x5f\x26"
+ "\x61\x4d\x8f\xaf\xc9\x22\xf2\x05"
+ "\xcf\xc1\xdc\x68\xe5\x57\x8e\x24"
+ "\x1b\x30\x59\xca\xd7\x0d\xc3\xd3"
+ "\x52\x9e\x09\x3e\x0e\xaf\xdb\x5f"
+ "\xc7\x2b\xde\x3a\xfd\xad\x93\x04"
+ "\x74\x06\x89\x0e\x90\xeb\x85\xff"
+ "\xe6\x3c\x12\x42\xf4\xfa\x80\x75"
+ "\x5e\x4e\xd7\x2f\x93\x0b\x34\x41"
+ "\x02\x85\x68\xd0\x03\x12\xde\x92"
+ "\x54\x7a\x7e\xfb\x55\xe7\x88\xfb"
+ "\xa4\xa9\xf2\xd1\xc6\x70\x06\x37"
+ "\x25\xee\xa7\x6e\xd9\x89\x86\x50"
+ "\x2e\x07\xdb\xfb\x2a\x86\x45\x0e"
+ "\x91\xf4\x7c\xbb\x12\x60\xe8\x3f"
+ "\x71\xbe\x8f\x9d\x26\xef\xd9\x89"
+ "\xc4\x8f\xd8\xc5\x73\xd8\x84\xaa"
+ "\x2f\xad\x22\x1e\x7e\xcf\xa2\x08"
+ "\x23\x45\x89\x42\xa0\x30\xeb\xbf"
+ "\xa1\xed\xad\xd5\x76\xfa\x24\x8f"
+ "\x98",
+ .rlen = 1281,
},
};

--
1.9.1

2015-07-16 17:14:22

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 05/10] crypto: chacha20 - Add an eight block AVX2 variant for x86_64

Extends the x86_64 ChaCha20 implementation by a function processing eight
ChaCha20 blocks in parallel using AVX2.

For large messages, throughput increases by ~55-70% compared to four block
SSSE3:

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 41999675 operations in 10 seconds (671994800 bytes)
test 1 (256 bit key, 64 byte blocks): 45805908 operations in 10 seconds (2931578112 bytes)
test 2 (256 bit key, 256 byte blocks): 32814947 operations in 10 seconds (8400626432 bytes)
test 3 (256 bit key, 1024 byte blocks): 19777167 operations in 10 seconds (20251819008 bytes)
test 4 (256 bit key, 8192 byte blocks): 2279321 operations in 10 seconds (18672197632 bytes)

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 1 +
arch/x86/crypto/chacha20-avx2-x86_64.S | 443 +++++++++++++++++++++++++++++++++
arch/x86/crypto/chacha20_glue.c | 19 ++
crypto/Kconfig | 2 +-
4 files changed, 464 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b09e9a4..ce39b3c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -77,6 +77,7 @@ endif

ifeq ($(avx2_supported),yes)
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
+ chacha20-x86_64-y += chacha20-avx2-x86_64.o
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
endif

diff --git a/arch/x86/crypto/chacha20-avx2-x86_64.S b/arch/x86/crypto/chacha20-avx2-x86_64.S
new file mode 100644
index 0000000..16694e6
--- /dev/null
+++ b/arch/x86/crypto/chacha20-avx2-x86_64.S
@@ -0,0 +1,443 @@
+/*
+ * ChaCha20 256-bit cipher algorithm, RFC7539, x64 AVX2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 32
+
+ROT8: .octa 0x0e0d0c0f0a09080b0605040702010003
+ .octa 0x0e0d0c0f0a09080b0605040702010003
+ROT16: .octa 0x0d0c0f0e09080b0a0504070601000302
+ .octa 0x0d0c0f0e09080b0a0504070601000302
+CTRINC: .octa 0x00000003000000020000000100000000
+ .octa 0x00000007000000060000000500000004
+
+.text
+
+ENTRY(chacha20_8block_xor_avx2)
+ # %rdi: Input state matrix, s
+ # %rsi: 8 data blocks output, o
+ # %rdx: 8 data blocks input, i
+
+ # This function encrypts eight consecutive ChaCha20 blocks by loading
+ # the state matrix in AVX registers eight times. As we need some
+ # scratch registers, we save the first four registers on the stack. The
+ # algorithm performs each operation on the corresponding word of each
+ # state matrix, hence requires no word shuffling. For final XORing step
+ # we transpose the matrix by interleaving 32-, 64- and then 128-bit
+ # words, which allows us to do XOR in AVX registers. 8/16-bit word
+ # rotation is done with the slightly better performing byte shuffling,
+ # 7/12-bit word rotation uses traditional shift+OR.
+
+ vzeroupper
+ # 4 * 32 byte stack, 32-byte aligned
+ mov %rsp, %r8
+ and $~31, %rsp
+ sub $0x80, %rsp
+
+ # x0..15[0-7] = s[0..15]
+ vpbroadcastd 0x00(%rdi),%ymm0
+ vpbroadcastd 0x04(%rdi),%ymm1
+ vpbroadcastd 0x08(%rdi),%ymm2
+ vpbroadcastd 0x0c(%rdi),%ymm3
+ vpbroadcastd 0x10(%rdi),%ymm4
+ vpbroadcastd 0x14(%rdi),%ymm5
+ vpbroadcastd 0x18(%rdi),%ymm6
+ vpbroadcastd 0x1c(%rdi),%ymm7
+ vpbroadcastd 0x20(%rdi),%ymm8
+ vpbroadcastd 0x24(%rdi),%ymm9
+ vpbroadcastd 0x28(%rdi),%ymm10
+ vpbroadcastd 0x2c(%rdi),%ymm11
+ vpbroadcastd 0x30(%rdi),%ymm12
+ vpbroadcastd 0x34(%rdi),%ymm13
+ vpbroadcastd 0x38(%rdi),%ymm14
+ vpbroadcastd 0x3c(%rdi),%ymm15
+ # x0..3 on stack
+ vmovdqa %ymm0,0x00(%rsp)
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa %ymm3,0x60(%rsp)
+
+ vmovdqa CTRINC(%rip),%ymm1
+ vmovdqa ROT8(%rip),%ymm2
+ vmovdqa ROT16(%rip),%ymm3
+
+ # x12 += counter values 0-3
+ vpaddd %ymm1,%ymm12,%ymm12
+
+ mov $10,%ecx
+
+.Ldoubleround8:
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 16)
+ vpaddd 0x00(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm3,%ymm12,%ymm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 16)
+ vpaddd 0x20(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm3,%ymm13,%ymm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 16)
+ vpaddd 0x40(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm3,%ymm14,%ymm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 16)
+ vpaddd 0x60(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm3,%ymm15,%ymm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 12)
+ vpaddd %ymm12,%ymm8,%ymm8
+ vpxor %ymm8,%ymm4,%ymm4
+ vpslld $12,%ymm4,%ymm0
+ vpsrld $20,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 12)
+ vpaddd %ymm13,%ymm9,%ymm9
+ vpxor %ymm9,%ymm5,%ymm5
+ vpslld $12,%ymm5,%ymm0
+ vpsrld $20,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 12)
+ vpaddd %ymm14,%ymm10,%ymm10
+ vpxor %ymm10,%ymm6,%ymm6
+ vpslld $12,%ymm6,%ymm0
+ vpsrld $20,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 12)
+ vpaddd %ymm15,%ymm11,%ymm11
+ vpxor %ymm11,%ymm7,%ymm7
+ vpslld $12,%ymm7,%ymm0
+ vpsrld $20,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+
+ # x0 += x4, x12 = rotl32(x12 ^ x0, 8)
+ vpaddd 0x00(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm2,%ymm12,%ymm12
+ # x1 += x5, x13 = rotl32(x13 ^ x1, 8)
+ vpaddd 0x20(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm2,%ymm13,%ymm13
+ # x2 += x6, x14 = rotl32(x14 ^ x2, 8)
+ vpaddd 0x40(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm2,%ymm14,%ymm14
+ # x3 += x7, x15 = rotl32(x15 ^ x3, 8)
+ vpaddd 0x60(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm2,%ymm15,%ymm15
+
+ # x8 += x12, x4 = rotl32(x4 ^ x8, 7)
+ vpaddd %ymm12,%ymm8,%ymm8
+ vpxor %ymm8,%ymm4,%ymm4
+ vpslld $7,%ymm4,%ymm0
+ vpsrld $25,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+ # x9 += x13, x5 = rotl32(x5 ^ x9, 7)
+ vpaddd %ymm13,%ymm9,%ymm9
+ vpxor %ymm9,%ymm5,%ymm5
+ vpslld $7,%ymm5,%ymm0
+ vpsrld $25,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x10 += x14, x6 = rotl32(x6 ^ x10, 7)
+ vpaddd %ymm14,%ymm10,%ymm10
+ vpxor %ymm10,%ymm6,%ymm6
+ vpslld $7,%ymm6,%ymm0
+ vpsrld $25,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x11 += x15, x7 = rotl32(x7 ^ x11, 7)
+ vpaddd %ymm15,%ymm11,%ymm11
+ vpxor %ymm11,%ymm7,%ymm7
+ vpslld $7,%ymm7,%ymm0
+ vpsrld $25,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 16)
+ vpaddd 0x00(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm3,%ymm15,%ymm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 16)%ymm0
+ vpaddd 0x20(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm3,%ymm12,%ymm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 16)
+ vpaddd 0x40(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm3,%ymm13,%ymm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 16)
+ vpaddd 0x60(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm3,%ymm14,%ymm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 12)
+ vpaddd %ymm15,%ymm10,%ymm10
+ vpxor %ymm10,%ymm5,%ymm5
+ vpslld $12,%ymm5,%ymm0
+ vpsrld $20,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 12)
+ vpaddd %ymm12,%ymm11,%ymm11
+ vpxor %ymm11,%ymm6,%ymm6
+ vpslld $12,%ymm6,%ymm0
+ vpsrld $20,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 12)
+ vpaddd %ymm13,%ymm8,%ymm8
+ vpxor %ymm8,%ymm7,%ymm7
+ vpslld $12,%ymm7,%ymm0
+ vpsrld $20,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 12)
+ vpaddd %ymm14,%ymm9,%ymm9
+ vpxor %ymm9,%ymm4,%ymm4
+ vpslld $12,%ymm4,%ymm0
+ vpsrld $20,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+
+ # x0 += x5, x15 = rotl32(x15 ^ x0, 8)
+ vpaddd 0x00(%rsp),%ymm5,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpxor %ymm0,%ymm15,%ymm15
+ vpshufb %ymm2,%ymm15,%ymm15
+ # x1 += x6, x12 = rotl32(x12 ^ x1, 8)
+ vpaddd 0x20(%rsp),%ymm6,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpxor %ymm0,%ymm12,%ymm12
+ vpshufb %ymm2,%ymm12,%ymm12
+ # x2 += x7, x13 = rotl32(x13 ^ x2, 8)
+ vpaddd 0x40(%rsp),%ymm7,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpxor %ymm0,%ymm13,%ymm13
+ vpshufb %ymm2,%ymm13,%ymm13
+ # x3 += x4, x14 = rotl32(x14 ^ x3, 8)
+ vpaddd 0x60(%rsp),%ymm4,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpxor %ymm0,%ymm14,%ymm14
+ vpshufb %ymm2,%ymm14,%ymm14
+
+ # x10 += x15, x5 = rotl32(x5 ^ x10, 7)
+ vpaddd %ymm15,%ymm10,%ymm10
+ vpxor %ymm10,%ymm5,%ymm5
+ vpslld $7,%ymm5,%ymm0
+ vpsrld $25,%ymm5,%ymm5
+ vpor %ymm0,%ymm5,%ymm5
+ # x11 += x12, x6 = rotl32(x6 ^ x11, 7)
+ vpaddd %ymm12,%ymm11,%ymm11
+ vpxor %ymm11,%ymm6,%ymm6
+ vpslld $7,%ymm6,%ymm0
+ vpsrld $25,%ymm6,%ymm6
+ vpor %ymm0,%ymm6,%ymm6
+ # x8 += x13, x7 = rotl32(x7 ^ x8, 7)
+ vpaddd %ymm13,%ymm8,%ymm8
+ vpxor %ymm8,%ymm7,%ymm7
+ vpslld $7,%ymm7,%ymm0
+ vpsrld $25,%ymm7,%ymm7
+ vpor %ymm0,%ymm7,%ymm7
+ # x9 += x14, x4 = rotl32(x4 ^ x9, 7)
+ vpaddd %ymm14,%ymm9,%ymm9
+ vpxor %ymm9,%ymm4,%ymm4
+ vpslld $7,%ymm4,%ymm0
+ vpsrld $25,%ymm4,%ymm4
+ vpor %ymm0,%ymm4,%ymm4
+
+ dec %ecx
+ jnz .Ldoubleround8
+
+ # x0..15[0-3] += s[0..15]
+ vpbroadcastd 0x00(%rdi),%ymm0
+ vpaddd 0x00(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x00(%rsp)
+ vpbroadcastd 0x04(%rdi),%ymm0
+ vpaddd 0x20(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x20(%rsp)
+ vpbroadcastd 0x08(%rdi),%ymm0
+ vpaddd 0x40(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x40(%rsp)
+ vpbroadcastd 0x0c(%rdi),%ymm0
+ vpaddd 0x60(%rsp),%ymm0,%ymm0
+ vmovdqa %ymm0,0x60(%rsp)
+ vpbroadcastd 0x10(%rdi),%ymm0
+ vpaddd %ymm0,%ymm4,%ymm4
+ vpbroadcastd 0x14(%rdi),%ymm0
+ vpaddd %ymm0,%ymm5,%ymm5
+ vpbroadcastd 0x18(%rdi),%ymm0
+ vpaddd %ymm0,%ymm6,%ymm6
+ vpbroadcastd 0x1c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm7,%ymm7
+ vpbroadcastd 0x20(%rdi),%ymm0
+ vpaddd %ymm0,%ymm8,%ymm8
+ vpbroadcastd 0x24(%rdi),%ymm0
+ vpaddd %ymm0,%ymm9,%ymm9
+ vpbroadcastd 0x28(%rdi),%ymm0
+ vpaddd %ymm0,%ymm10,%ymm10
+ vpbroadcastd 0x2c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm11,%ymm11
+ vpbroadcastd 0x30(%rdi),%ymm0
+ vpaddd %ymm0,%ymm12,%ymm12
+ vpbroadcastd 0x34(%rdi),%ymm0
+ vpaddd %ymm0,%ymm13,%ymm13
+ vpbroadcastd 0x38(%rdi),%ymm0
+ vpaddd %ymm0,%ymm14,%ymm14
+ vpbroadcastd 0x3c(%rdi),%ymm0
+ vpaddd %ymm0,%ymm15,%ymm15
+
+ # x12 += counter values 0-3
+ vpaddd %ymm1,%ymm12,%ymm12
+
+ # interleave 32-bit words in state n, n+1
+ vmovdqa 0x00(%rsp),%ymm0
+ vmovdqa 0x20(%rsp),%ymm1
+ vpunpckldq %ymm1,%ymm0,%ymm2
+ vpunpckhdq %ymm1,%ymm0,%ymm1
+ vmovdqa %ymm2,0x00(%rsp)
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa 0x40(%rsp),%ymm0
+ vmovdqa 0x60(%rsp),%ymm1
+ vpunpckldq %ymm1,%ymm0,%ymm2
+ vpunpckhdq %ymm1,%ymm0,%ymm1
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa %ymm1,0x60(%rsp)
+ vmovdqa %ymm4,%ymm0
+ vpunpckldq %ymm5,%ymm0,%ymm4
+ vpunpckhdq %ymm5,%ymm0,%ymm5
+ vmovdqa %ymm6,%ymm0
+ vpunpckldq %ymm7,%ymm0,%ymm6
+ vpunpckhdq %ymm7,%ymm0,%ymm7
+ vmovdqa %ymm8,%ymm0
+ vpunpckldq %ymm9,%ymm0,%ymm8
+ vpunpckhdq %ymm9,%ymm0,%ymm9
+ vmovdqa %ymm10,%ymm0
+ vpunpckldq %ymm11,%ymm0,%ymm10
+ vpunpckhdq %ymm11,%ymm0,%ymm11
+ vmovdqa %ymm12,%ymm0
+ vpunpckldq %ymm13,%ymm0,%ymm12
+ vpunpckhdq %ymm13,%ymm0,%ymm13
+ vmovdqa %ymm14,%ymm0
+ vpunpckldq %ymm15,%ymm0,%ymm14
+ vpunpckhdq %ymm15,%ymm0,%ymm15
+
+ # interleave 64-bit words in state n, n+2
+ vmovdqa 0x00(%rsp),%ymm0
+ vmovdqa 0x40(%rsp),%ymm2
+ vpunpcklqdq %ymm2,%ymm0,%ymm1
+ vpunpckhqdq %ymm2,%ymm0,%ymm2
+ vmovdqa %ymm1,0x00(%rsp)
+ vmovdqa %ymm2,0x40(%rsp)
+ vmovdqa 0x20(%rsp),%ymm0
+ vmovdqa 0x60(%rsp),%ymm2
+ vpunpcklqdq %ymm2,%ymm0,%ymm1
+ vpunpckhqdq %ymm2,%ymm0,%ymm2
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa %ymm2,0x60(%rsp)
+ vmovdqa %ymm4,%ymm0
+ vpunpcklqdq %ymm6,%ymm0,%ymm4
+ vpunpckhqdq %ymm6,%ymm0,%ymm6
+ vmovdqa %ymm5,%ymm0
+ vpunpcklqdq %ymm7,%ymm0,%ymm5
+ vpunpckhqdq %ymm7,%ymm0,%ymm7
+ vmovdqa %ymm8,%ymm0
+ vpunpcklqdq %ymm10,%ymm0,%ymm8
+ vpunpckhqdq %ymm10,%ymm0,%ymm10
+ vmovdqa %ymm9,%ymm0
+ vpunpcklqdq %ymm11,%ymm0,%ymm9
+ vpunpckhqdq %ymm11,%ymm0,%ymm11
+ vmovdqa %ymm12,%ymm0
+ vpunpcklqdq %ymm14,%ymm0,%ymm12
+ vpunpckhqdq %ymm14,%ymm0,%ymm14
+ vmovdqa %ymm13,%ymm0
+ vpunpcklqdq %ymm15,%ymm0,%ymm13
+ vpunpckhqdq %ymm15,%ymm0,%ymm15
+
+ # interleave 128-bit words in state n, n+4
+ vmovdqa 0x00(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm4,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm4,%ymm0,%ymm4
+ vmovdqa %ymm1,0x00(%rsp)
+ vmovdqa 0x20(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm5,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm5,%ymm0,%ymm5
+ vmovdqa %ymm1,0x20(%rsp)
+ vmovdqa 0x40(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm6,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm6,%ymm0,%ymm6
+ vmovdqa %ymm1,0x40(%rsp)
+ vmovdqa 0x60(%rsp),%ymm0
+ vperm2i128 $0x20,%ymm7,%ymm0,%ymm1
+ vperm2i128 $0x31,%ymm7,%ymm0,%ymm7
+ vmovdqa %ymm1,0x60(%rsp)
+ vperm2i128 $0x20,%ymm12,%ymm8,%ymm0
+ vperm2i128 $0x31,%ymm12,%ymm8,%ymm12
+ vmovdqa %ymm0,%ymm8
+ vperm2i128 $0x20,%ymm13,%ymm9,%ymm0
+ vperm2i128 $0x31,%ymm13,%ymm9,%ymm13
+ vmovdqa %ymm0,%ymm9
+ vperm2i128 $0x20,%ymm14,%ymm10,%ymm0
+ vperm2i128 $0x31,%ymm14,%ymm10,%ymm14
+ vmovdqa %ymm0,%ymm10
+ vperm2i128 $0x20,%ymm15,%ymm11,%ymm0
+ vperm2i128 $0x31,%ymm15,%ymm11,%ymm15
+ vmovdqa %ymm0,%ymm11
+
+ # xor with corresponding input, write to output
+ vmovdqa 0x00(%rsp),%ymm0
+ vpxor 0x0000(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0000(%rsi)
+ vmovdqa 0x20(%rsp),%ymm0
+ vpxor 0x0080(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0080(%rsi)
+ vmovdqa 0x40(%rsp),%ymm0
+ vpxor 0x0040(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x0040(%rsi)
+ vmovdqa 0x60(%rsp),%ymm0
+ vpxor 0x00c0(%rdx),%ymm0,%ymm0
+ vmovdqu %ymm0,0x00c0(%rsi)
+ vpxor 0x0100(%rdx),%ymm4,%ymm4
+ vmovdqu %ymm4,0x0100(%rsi)
+ vpxor 0x0180(%rdx),%ymm5,%ymm5
+ vmovdqu %ymm5,0x00180(%rsi)
+ vpxor 0x0140(%rdx),%ymm6,%ymm6
+ vmovdqu %ymm6,0x0140(%rsi)
+ vpxor 0x01c0(%rdx),%ymm7,%ymm7
+ vmovdqu %ymm7,0x01c0(%rsi)
+ vpxor 0x0020(%rdx),%ymm8,%ymm8
+ vmovdqu %ymm8,0x0020(%rsi)
+ vpxor 0x00a0(%rdx),%ymm9,%ymm9
+ vmovdqu %ymm9,0x00a0(%rsi)
+ vpxor 0x0060(%rdx),%ymm10,%ymm10
+ vmovdqu %ymm10,0x0060(%rsi)
+ vpxor 0x00e0(%rdx),%ymm11,%ymm11
+ vmovdqu %ymm11,0x00e0(%rsi)
+ vpxor 0x0120(%rdx),%ymm12,%ymm12
+ vmovdqu %ymm12,0x0120(%rsi)
+ vpxor 0x01a0(%rdx),%ymm13,%ymm13
+ vmovdqu %ymm13,0x01a0(%rsi)
+ vpxor 0x0160(%rdx),%ymm14,%ymm14
+ vmovdqu %ymm14,0x0160(%rsi)
+ vpxor 0x01e0(%rdx),%ymm15,%ymm15
+ vmovdqu %ymm15,0x01e0(%rsi)
+
+ vzeroupper
+ mov %r8,%rsp
+ ret
+ENDPROC(chacha20_8block_xor_avx2)
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 4d677c3..effe216 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -21,12 +21,27 @@

asmlinkage void chacha20_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src);
+#ifdef CONFIG_AS_AVX2
+asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src);
+static bool chacha20_use_avx2;
+#endif

static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
unsigned int bytes)
{
u8 buf[CHACHA20_BLOCK_SIZE];

+#ifdef CONFIG_AS_AVX2
+ if (chacha20_use_avx2) {
+ while (bytes >= CHACHA20_BLOCK_SIZE * 8) {
+ chacha20_8block_xor_avx2(state, dst, src);
+ bytes -= CHACHA20_BLOCK_SIZE * 8;
+ src += CHACHA20_BLOCK_SIZE * 8;
+ dst += CHACHA20_BLOCK_SIZE * 8;
+ state[12] += 8;
+ }
+ }
+#endif
while (bytes >= CHACHA20_BLOCK_SIZE * 4) {
chacha20_4block_xor_ssse3(state, dst, src);
bytes -= CHACHA20_BLOCK_SIZE * 4;
@@ -113,6 +128,10 @@ static int __init chacha20_simd_mod_init(void)
if (!cpu_has_ssse3)
return -ENODEV;

+#ifdef CONFIG_AS_AVX2
+ chacha20_use_avx2 = cpu_has_avx && cpu_has_avx2 &&
+ cpu_has_xfeatures(XSTATE_SSE | XSTATE_YMM, NULL);
+#endif
return crypto_register_alg(&alg);
}

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 8f24185..82caab0 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1214,7 +1214,7 @@ config CRYPTO_CHACHA20
<http://cr.yp.to/chacha/chacha-20080128.pdf>

config CRYPTO_CHACHA20_X86_64
- tristate "ChaCha20 cipher algorithm (x86_64/SSSE3)"
+ tristate "ChaCha20 cipher algorithm (x86_64/SSSE3/AVX2)"
depends on X86 && 64BIT
select CRYPTO_BLKCIPHER
select CRYPTO_CHACHA20
--
1.9.1

2015-07-16 17:14:51

by Martin Willi

[permalink] [raw]
Subject: [PATCH v2 10/10] crypto: poly1305 - Add a four block AVX2 variant for x86_64

Extends the x86_64 Poly1305 authenticator by a function processing four
consecutive Poly1305 blocks in parallel using AVX2 instructions.

For large messages, throughput increases by ~15-45% compared to two
block SSE2:

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3809514 opers/sec, 365713411 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5973423 opers/sec, 573448627 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9446779 opers/sec, 906890803 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1364814 opers/sec, 393066691 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2045780 opers/sec, 589184697 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3711946 opers/sec, 1069040592 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 573686 opers/sec, 605812732 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1647802 opers/sec, 1740079440 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 292970 opers/sec, 609378224 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 943229 opers/sec, 1961916528 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 494623 opers/sec, 2041804569 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 254045 opers/sec, 2089271014 bytes/sec

testing speed of poly1305 (poly1305-simd)
test 0 ( 96 byte blocks, 16 bytes per update, 6 updates): 3826224 opers/sec, 367317552 bytes/sec
test 1 ( 96 byte blocks, 32 bytes per update, 3 updates): 5948638 opers/sec, 571069267 bytes/sec
test 2 ( 96 byte blocks, 96 bytes per update, 1 updates): 9439110 opers/sec, 906154627 bytes/sec
test 3 ( 288 byte blocks, 16 bytes per update, 18 updates): 1367756 opers/sec, 393913872 bytes/sec
test 4 ( 288 byte blocks, 32 bytes per update, 9 updates): 2056881 opers/sec, 592381958 bytes/sec
test 5 ( 288 byte blocks, 288 bytes per update, 1 updates): 3711153 opers/sec, 1068812179 bytes/sec
test 6 ( 1056 byte blocks, 32 bytes per update, 33 updates): 574940 opers/sec, 607136745 bytes/sec
test 7 ( 1056 byte blocks, 1056 bytes per update, 1 updates): 1948830 opers/sec, 2057964585 bytes/sec
test 8 ( 2080 byte blocks, 32 bytes per update, 65 updates): 293308 opers/sec, 610082096 bytes/sec
test 9 ( 2080 byte blocks, 2080 bytes per update, 1 updates): 1235224 opers/sec, 2569267792 bytes/sec
test 10 ( 4128 byte blocks, 4128 bytes per update, 1 updates): 684405 opers/sec, 2825226316 bytes/sec
test 11 ( 8224 byte blocks, 8224 bytes per update, 1 updates): 367101 opers/sec, 3019039446 bytes/sec

Benchmark results from a Core i5-4670T.

Signed-off-by: Martin Willi <[email protected]>
---
arch/x86/crypto/Makefile | 1 +
arch/x86/crypto/poly1305-avx2-x86_64.S | 386 +++++++++++++++++++++++++++++++++
arch/x86/crypto/poly1305_glue.c | 40 ++++
crypto/Kconfig | 2 +-
4 files changed, 428 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 5cf405c..9a2838c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -89,6 +89,7 @@ sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o
ifeq ($(avx2_supported),yes)
sha1-ssse3-y += sha1_avx2_x86_64_asm.o
+poly1305-x86_64-y += poly1305-avx2-x86_64.o
endif
crc32c-intel-y := crc32c-intel_glue.o
crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
diff --git a/arch/x86/crypto/poly1305-avx2-x86_64.S b/arch/x86/crypto/poly1305-avx2-x86_64.S
new file mode 100644
index 0000000..eff2f41
--- /dev/null
+++ b/arch/x86/crypto/poly1305-avx2-x86_64.S
@@ -0,0 +1,386 @@
+/*
+ * Poly1305 authenticator algorithm, RFC7539, x64 AVX2 functions
+ *
+ * Copyright (C) 2015 Martin Willi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/linkage.h>
+
+.data
+.align 32
+
+ANMASK: .octa 0x0000000003ffffff0000000003ffffff
+ .octa 0x0000000003ffffff0000000003ffffff
+ORMASK: .octa 0x00000000010000000000000001000000
+ .octa 0x00000000010000000000000001000000
+
+.text
+
+#define h0 0x00(%rdi)
+#define h1 0x04(%rdi)
+#define h2 0x08(%rdi)
+#define h3 0x0c(%rdi)
+#define h4 0x10(%rdi)
+#define r0 0x00(%rdx)
+#define r1 0x04(%rdx)
+#define r2 0x08(%rdx)
+#define r3 0x0c(%rdx)
+#define r4 0x10(%rdx)
+#define u0 0x00(%r8)
+#define u1 0x04(%r8)
+#define u2 0x08(%r8)
+#define u3 0x0c(%r8)
+#define u4 0x10(%r8)
+#define w0 0x14(%r8)
+#define w1 0x18(%r8)
+#define w2 0x1c(%r8)
+#define w3 0x20(%r8)
+#define w4 0x24(%r8)
+#define y0 0x28(%r8)
+#define y1 0x2c(%r8)
+#define y2 0x30(%r8)
+#define y3 0x34(%r8)
+#define y4 0x38(%r8)
+#define m %rsi
+#define hc0 %ymm0
+#define hc1 %ymm1
+#define hc2 %ymm2
+#define hc3 %ymm3
+#define hc4 %ymm4
+#define hc0x %xmm0
+#define hc1x %xmm1
+#define hc2x %xmm2
+#define hc3x %xmm3
+#define hc4x %xmm4
+#define t1 %ymm5
+#define t2 %ymm6
+#define t1x %xmm5
+#define t2x %xmm6
+#define ruwy0 %ymm7
+#define ruwy1 %ymm8
+#define ruwy2 %ymm9
+#define ruwy3 %ymm10
+#define ruwy4 %ymm11
+#define ruwy0x %xmm7
+#define ruwy1x %xmm8
+#define ruwy2x %xmm9
+#define ruwy3x %xmm10
+#define ruwy4x %xmm11
+#define svxz1 %ymm12
+#define svxz2 %ymm13
+#define svxz3 %ymm14
+#define svxz4 %ymm15
+#define d0 %r9
+#define d1 %r10
+#define d2 %r11
+#define d3 %r12
+#define d4 %r13
+
+ENTRY(poly1305_4block_avx2)
+ # %rdi: Accumulator h[5]
+ # %rsi: 64 byte input block m
+ # %rdx: Poly1305 key r[5]
+ # %rcx: Quadblock count
+ # %r8: Poly1305 derived key r^2 u[5], r^3 w[5], r^4 y[5],
+
+ # This four-block variant uses loop unrolled block processing. It
+ # requires 4 Poly1305 keys: r, r^2, r^3 and r^4:
+ # h = (h + m) * r => h = (h + m1) * r^4 + m2 * r^3 + m3 * r^2 + m4 * r
+
+ vzeroupper
+ push %rbx
+ push %r12
+ push %r13
+
+ # combine r0,u0,w0,y0
+ vmovd y0,ruwy0x
+ vmovd w0,t1x
+ vpunpcklqdq t1,ruwy0,ruwy0
+ vmovd u0,t1x
+ vmovd r0,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy0,ruwy0
+
+ # combine r1,u1,w1,y1 and s1=r1*5,v1=u1*5,x1=w1*5,z1=y1*5
+ vmovd y1,ruwy1x
+ vmovd w1,t1x
+ vpunpcklqdq t1,ruwy1,ruwy1
+ vmovd u1,t1x
+ vmovd r1,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy1,ruwy1
+ vpslld $2,ruwy1,svxz1
+ vpaddd ruwy1,svxz1,svxz1
+
+ # combine r2,u2,w2,y2 and s2=r2*5,v2=u2*5,x2=w2*5,z2=y2*5
+ vmovd y2,ruwy2x
+ vmovd w2,t1x
+ vpunpcklqdq t1,ruwy2,ruwy2
+ vmovd u2,t1x
+ vmovd r2,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy2,ruwy2
+ vpslld $2,ruwy2,svxz2
+ vpaddd ruwy2,svxz2,svxz2
+
+ # combine r3,u3,w3,y3 and s3=r3*5,v3=u3*5,x3=w3*5,z3=y3*5
+ vmovd y3,ruwy3x
+ vmovd w3,t1x
+ vpunpcklqdq t1,ruwy3,ruwy3
+ vmovd u3,t1x
+ vmovd r3,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy3,ruwy3
+ vpslld $2,ruwy3,svxz3
+ vpaddd ruwy3,svxz3,svxz3
+
+ # combine r4,u4,w4,y4 and s4=r4*5,v4=u4*5,x4=w4*5,z4=y4*5
+ vmovd y4,ruwy4x
+ vmovd w4,t1x
+ vpunpcklqdq t1,ruwy4,ruwy4
+ vmovd u4,t1x
+ vmovd r4,t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,ruwy4,ruwy4
+ vpslld $2,ruwy4,svxz4
+ vpaddd ruwy4,svxz4,svxz4
+
+.Ldoblock4:
+ # hc0 = [m[48-51] & 0x3ffffff, m[32-35] & 0x3ffffff,
+ # m[16-19] & 0x3ffffff, m[ 0- 3] & 0x3ffffff + h0]
+ vmovd 0x00(m),hc0x
+ vmovd 0x10(m),t1x
+ vpunpcklqdq t1,hc0,hc0
+ vmovd 0x20(m),t1x
+ vmovd 0x30(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc0,hc0
+ vpand ANMASK(%rip),hc0,hc0
+ vmovd h0,t1x
+ vpaddd t1,hc0,hc0
+ # hc1 = [(m[51-54] >> 2) & 0x3ffffff, (m[35-38] >> 2) & 0x3ffffff,
+ # (m[19-22] >> 2) & 0x3ffffff, (m[ 3- 6] >> 2) & 0x3ffffff + h1]
+ vmovd 0x03(m),hc1x
+ vmovd 0x13(m),t1x
+ vpunpcklqdq t1,hc1,hc1
+ vmovd 0x23(m),t1x
+ vmovd 0x33(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc1,hc1
+ vpsrld $2,hc1,hc1
+ vpand ANMASK(%rip),hc1,hc1
+ vmovd h1,t1x
+ vpaddd t1,hc1,hc1
+ # hc2 = [(m[54-57] >> 4) & 0x3ffffff, (m[38-41] >> 4) & 0x3ffffff,
+ # (m[22-25] >> 4) & 0x3ffffff, (m[ 6- 9] >> 4) & 0x3ffffff + h2]
+ vmovd 0x06(m),hc2x
+ vmovd 0x16(m),t1x
+ vpunpcklqdq t1,hc2,hc2
+ vmovd 0x26(m),t1x
+ vmovd 0x36(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc2,hc2
+ vpsrld $4,hc2,hc2
+ vpand ANMASK(%rip),hc2,hc2
+ vmovd h2,t1x
+ vpaddd t1,hc2,hc2
+ # hc3 = [(m[57-60] >> 6) & 0x3ffffff, (m[41-44] >> 6) & 0x3ffffff,
+ # (m[25-28] >> 6) & 0x3ffffff, (m[ 9-12] >> 6) & 0x3ffffff + h3]
+ vmovd 0x09(m),hc3x
+ vmovd 0x19(m),t1x
+ vpunpcklqdq t1,hc3,hc3
+ vmovd 0x29(m),t1x
+ vmovd 0x39(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc3,hc3
+ vpsrld $6,hc3,hc3
+ vpand ANMASK(%rip),hc3,hc3
+ vmovd h3,t1x
+ vpaddd t1,hc3,hc3
+ # hc4 = [(m[60-63] >> 8) | (1<<24), (m[44-47] >> 8) | (1<<24),
+ # (m[28-31] >> 8) | (1<<24), (m[12-15] >> 8) | (1<<24) + h4]
+ vmovd 0x0c(m),hc4x
+ vmovd 0x1c(m),t1x
+ vpunpcklqdq t1,hc4,hc4
+ vmovd 0x2c(m),t1x
+ vmovd 0x3c(m),t2x
+ vpunpcklqdq t2,t1,t1
+ vperm2i128 $0x20,t1,hc4,hc4
+ vpsrld $8,hc4,hc4
+ vpor ORMASK(%rip),hc4,hc4
+ vmovd h4,t1x
+ vpaddd t1,hc4,hc4
+
+ # t1 = [ hc0[3] * r0, hc0[2] * u0, hc0[1] * w0, hc0[0] * y0 ]
+ vpmuludq hc0,ruwy0,t1
+ # t1 += [ hc1[3] * s4, hc1[2] * v4, hc1[1] * x4, hc1[0] * z4 ]
+ vpmuludq hc1,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * s3, hc2[2] * v3, hc2[1] * x3, hc2[0] * z3 ]
+ vpmuludq hc2,svxz3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s2, hc3[2] * v2, hc3[1] * x2, hc3[0] * z2 ]
+ vpmuludq hc3,svxz2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s1, hc4[2] * v1, hc4[1] * x1, hc4[0] * z1 ]
+ vpmuludq hc4,svxz1,t2
+ vpaddq t2,t1,t1
+ # d0 = t1[0] + t1[1] + t[2] + t[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d0
+
+ # t1 = [ hc0[3] * r1, hc0[2] * u1,hc0[1] * w1, hc0[0] * y1 ]
+ vpmuludq hc0,ruwy1,t1
+ # t1 += [ hc1[3] * r0, hc1[2] * u0, hc1[1] * w0, hc1[0] * y0 ]
+ vpmuludq hc1,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * s4, hc2[2] * v4, hc2[1] * x4, hc2[0] * z4 ]
+ vpmuludq hc2,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s3, hc3[2] * v3, hc3[1] * x3, hc3[0] * z3 ]
+ vpmuludq hc3,svxz3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s2, hc4[2] * v2, hc4[1] * x2, hc4[0] * z2 ]
+ vpmuludq hc4,svxz2,t2
+ vpaddq t2,t1,t1
+ # d1 = t1[0] + t1[1] + t1[3] + t1[4]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d1
+
+ # t1 = [ hc0[3] * r2, hc0[2] * u2, hc0[1] * w2, hc0[0] * y2 ]
+ vpmuludq hc0,ruwy2,t1
+ # t1 += [ hc1[3] * r1, hc1[2] * u1, hc1[1] * w1, hc1[0] * y1 ]
+ vpmuludq hc1,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r0, hc2[2] * u0, hc2[1] * w0, hc2[0] * y0 ]
+ vpmuludq hc2,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * s4, hc3[2] * v4, hc3[1] * x4, hc3[0] * z4 ]
+ vpmuludq hc3,svxz4,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s3, hc4[2] * v3, hc4[1] * x3, hc4[0] * z3 ]
+ vpmuludq hc4,svxz3,t2
+ vpaddq t2,t1,t1
+ # d2 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d2
+
+ # t1 = [ hc0[3] * r3, hc0[2] * u3, hc0[1] * w3, hc0[0] * y3 ]
+ vpmuludq hc0,ruwy3,t1
+ # t1 += [ hc1[3] * r2, hc1[2] * u2, hc1[1] * w2, hc1[0] * y2 ]
+ vpmuludq hc1,ruwy2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r1, hc2[2] * u1, hc2[1] * w1, hc2[0] * y1 ]
+ vpmuludq hc2,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * r0, hc3[2] * u0, hc3[1] * w0, hc3[0] * y0 ]
+ vpmuludq hc3,ruwy0,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * s4, hc4[2] * v4, hc4[1] * x4, hc4[0] * z4 ]
+ vpmuludq hc4,svxz4,t2
+ vpaddq t2,t1,t1
+ # d3 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d3
+
+ # t1 = [ hc0[3] * r4, hc0[2] * u4, hc0[1] * w4, hc0[0] * y4 ]
+ vpmuludq hc0,ruwy4,t1
+ # t1 += [ hc1[3] * r3, hc1[2] * u3, hc1[1] * w3, hc1[0] * y3 ]
+ vpmuludq hc1,ruwy3,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc2[3] * r2, hc2[2] * u2, hc2[1] * w2, hc2[0] * y2 ]
+ vpmuludq hc2,ruwy2,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc3[3] * r1, hc3[2] * u1, hc3[1] * w1, hc3[0] * y1 ]
+ vpmuludq hc3,ruwy1,t2
+ vpaddq t2,t1,t1
+ # t1 += [ hc4[3] * r0, hc4[2] * u0, hc4[1] * w0, hc4[0] * y0 ]
+ vpmuludq hc4,ruwy0,t2
+ vpaddq t2,t1,t1
+ # d4 = t1[0] + t1[1] + t1[2] + t1[3]
+ vpermq $0xee,t1,t2
+ vpaddq t2,t1,t1
+ vpsrldq $8,t1,t2
+ vpaddq t2,t1,t1
+ vmovq t1x,d4
+
+ # d1 += d0 >> 26
+ mov d0,%rax
+ shr $26,%rax
+ add %rax,d1
+ # h0 = d0 & 0x3ffffff
+ mov d0,%rbx
+ and $0x3ffffff,%ebx
+
+ # d2 += d1 >> 26
+ mov d1,%rax
+ shr $26,%rax
+ add %rax,d2
+ # h1 = d1 & 0x3ffffff
+ mov d1,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h1
+
+ # d3 += d2 >> 26
+ mov d2,%rax
+ shr $26,%rax
+ add %rax,d3
+ # h2 = d2 & 0x3ffffff
+ mov d2,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h2
+
+ # d4 += d3 >> 26
+ mov d3,%rax
+ shr $26,%rax
+ add %rax,d4
+ # h3 = d3 & 0x3ffffff
+ mov d3,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h3
+
+ # h0 += (d4 >> 26) * 5
+ mov d4,%rax
+ shr $26,%rax
+ lea (%eax,%eax,4),%eax
+ add %eax,%ebx
+ # h4 = d4 & 0x3ffffff
+ mov d4,%rax
+ and $0x3ffffff,%eax
+ mov %eax,h4
+
+ # h1 += h0 >> 26
+ mov %ebx,%eax
+ shr $26,%eax
+ add %eax,h1
+ # h0 = h0 & 0x3ffffff
+ andl $0x3ffffff,%ebx
+ mov %ebx,h0
+
+ add $0x40,m
+ dec %rcx
+ jnz .Ldoblock4
+
+ vzeroupper
+ pop %r13
+ pop %r12
+ pop %rbx
+ ret
+ENDPROC(poly1305_4block_avx2)
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index b7c33d0..f7170d7 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -22,20 +22,33 @@ struct poly1305_simd_desc_ctx {
struct poly1305_desc_ctx base;
/* derived key u set? */
bool uset;
+#ifdef CONFIG_AS_AVX2
+ /* derived keys r^3, r^4 set? */
+ bool wset;
+#endif
/* derived Poly1305 key r^2 */
u32 u[5];
+ /* ... silently appended r^3 and r^4 when using AVX2 */
};

asmlinkage void poly1305_block_sse2(u32 *h, const u8 *src,
const u32 *r, unsigned int blocks);
asmlinkage void poly1305_2block_sse2(u32 *h, const u8 *src, const u32 *r,
unsigned int blocks, const u32 *u);
+#ifdef CONFIG_AS_AVX2
+asmlinkage void poly1305_4block_avx2(u32 *h, const u8 *src, const u32 *r,
+ unsigned int blocks, const u32 *u);
+static bool poly1305_use_avx2;
+#endif

static int poly1305_simd_init(struct shash_desc *desc)
{
struct poly1305_simd_desc_ctx *sctx = shash_desc_ctx(desc);

sctx->uset = false;
+#ifdef CONFIG_AS_AVX2
+ sctx->wset = false;
+#endif

return crypto_poly1305_init(desc);
}
@@ -66,6 +79,26 @@ static unsigned int poly1305_simd_blocks(struct poly1305_desc_ctx *dctx,
srclen = datalen;
}

+#ifdef CONFIG_AS_AVX2
+ if (poly1305_use_avx2 && srclen >= POLY1305_BLOCK_SIZE * 4) {
+ if (unlikely(!sctx->wset)) {
+ if (!sctx->uset) {
+ memcpy(sctx->u, dctx->r, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u, dctx->r);
+ sctx->uset = true;
+ }
+ memcpy(sctx->u + 5, sctx->u, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u + 5, dctx->r);
+ memcpy(sctx->u + 10, sctx->u + 5, sizeof(sctx->u));
+ poly1305_simd_mult(sctx->u + 10, dctx->r);
+ sctx->wset = true;
+ }
+ blocks = srclen / (POLY1305_BLOCK_SIZE * 4);
+ poly1305_4block_avx2(dctx->h, src, dctx->r, blocks, sctx->u);
+ src += POLY1305_BLOCK_SIZE * 4 * blocks;
+ srclen -= POLY1305_BLOCK_SIZE * 4 * blocks;
+ }
+#endif
if (likely(srclen >= POLY1305_BLOCK_SIZE * 2)) {
if (unlikely(!sctx->uset)) {
memcpy(sctx->u, dctx->r, sizeof(sctx->u));
@@ -149,6 +182,13 @@ static int __init poly1305_simd_mod_init(void)
if (!cpu_has_xmm2)
return -ENODEV;

+#ifdef CONFIG_AS_AVX2
+ poly1305_use_avx2 = cpu_has_avx && cpu_has_avx2 &&
+ cpu_has_xfeatures(XSTATE_SSE | XSTATE_YMM, NULL);
+ alg.descsize = sizeof(struct poly1305_simd_desc_ctx);
+ if (poly1305_use_avx2)
+ alg.descsize += 10 * sizeof(u32);
+#endif
return crypto_register_shash(&alg);
}

diff --git a/crypto/Kconfig b/crypto/Kconfig
index c57478c..354bb69 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -471,7 +471,7 @@ config CRYPTO_POLY1305
in IETF protocols. This is the portable C implementation of Poly1305.

config CRYPTO_POLY1305_X86_64
- tristate "Poly1305 authenticator algorithm (x86_64/SSE2)"
+ tristate "Poly1305 authenticator algorithm (x86_64/SSE2/AVX2)"
depends on X86 && 64BIT
select CRYPTO_POLY1305
help
--
1.9.1

2015-07-17 13:31:57

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH v2 00/10] crypto: x86_64 - Add SSE/AVX2 ChaCha20/Poly1305 ciphers

On Thu, Jul 16, 2015 at 07:13:58PM +0200, Martin Willi wrote:
> This patch series adds both ChaCha20 and Poly1305 specific ciphers for
> x86_64 using SSE2/SSSE3 and AVX2 instructions. The idea is to have a drop-in
> replacement for AESNI/CLMUL-accelerated AES-GCM providing at least somewhat
> comparable performance, refer to RFC7539 for details. It is based on cryptodev,
> including the ChaCha20/Poly1305 AEAD interface conversion patch.
>
> The first patch adds some speed tests to tcrypt. The second patch exports
> some functionality from chacha20-generic to use it as fallback. Patch 3
> adds a single block SSSE3 driver for ChaCha20, while patch 4 and 5 extend it
> by an optimized four block SSSE3 and an eight block AVX2 variant. Patch 6
> adds an additional test vector for ChaCha20 to actually test the AVX2 eight
> block variant processing 512-bytes at once.
>
> Patch 7 exports some poly1305-generic functionality to use it as fallback.
> Patch 8 introduces a single block SSE2 driver for Poly1305, while patch 9
> and 10 add an optimized two block SSE2 and a four block AVX2 variant.
>
> Overall speedup for the ChaCha20/Poly1305 AEAD for typical IPsec payloads
> is ~50-150% with SSE2/SSSE3 and ~100-200% with AVX2, or even more for larger
> payloads:

All applied. Thanks Martin!
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt