2012-05-27 14:49:30

by Johannes Goetzfried

[permalink] [raw]
Subject: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

This patch adds a x86_64/avx assembler implementation of the Twofish block
cipher. The implementation processes eight blocks in parallel (two 4 block
chunk AVX operations). The table-lookups are done in general-purpose registers.
For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
module are called. A good performance increase is provided for blocksizes
greater or equal to 128B.

Patch has been tested with tcrypt and automated filesystem tests.

Tcrypt benchmark results:

Intel Core i5-2500 CPU (fam:6, model:42, step:7)

twofish-avx-x86_64 vs. twofish-x86_64-3way
128bit key: (lrw:256bit) (xts:256bit)
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B 0.96x 0.97x 1.00x 0.95x 0.97x 0.97x 0.96x 0.95x 0.95x 0.98x
64B 0.99x 0.99x 1.00x 0.99x 0.98x 0.98x 0.99x 0.98x 0.99x 0.98x
256B 1.20x 1.21x 1.00x 1.19x 1.15x 1.14x 1.19x 1.20x 1.18x 1.19x
1024B 1.29x 1.30x 1.00x 1.28x 1.23x 1.24x 1.26x 1.28x 1.26x 1.27x
8192B 1.31x 1.32x 1.00x 1.31x 1.25x 1.25x 1.28x 1.29x 1.28x 1.30x

256bit key: (lrw:384bit) (xts:512bit)
size ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
16B 0.96x 0.96x 1.00x 0.96x 0.97x 0.98x 0.95x 0.95x 0.95x 0.96x
64B 1.00x 0.99x 1.00x 0.98x 0.98x 1.01x 0.98x 0.98x 0.98x 0.98x
256B 1.20x 1.21x 1.00x 1.21x 1.15x 1.15x 1.19x 1.20x 1.18x 1.19x
1024B 1.29x 1.30x 1.00x 1.28x 1.23x 1.23x 1.26x 1.27x 1.26x 1.27x
8192B 1.31x 1.33x 1.00x 1.31x 1.26x 1.26x 1.29x 1.29x 1.28x 1.30x

serpent-avx-x86_64 vs aes-asm (8kB block):
128bit 256bit
ecb-enc 1.19x 1.63x
ecb-dec 1.18x 1.62x
cbc-enc 0.75x 1.03x
cbc-dec 1.23x 1.67x
ctr-enc 1.24x 1.65x
ctr-dec 1.24x 1.65x
lrw-enc 1.15x 1.53x
lrw-dec 1.14x 1.52x
xts-enc 1.16x 1.56x
xts-dec 1.16x 1.56x

Signed-off-by: Johannes Goetzfried <[email protected]>
---
arch/x86/crypto/Makefile | 2 +
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 302 ++++++++
arch/x86/crypto/twofish_avx_glue.c | 1086 +++++++++++++++++++++++++++
arch/x86/crypto/twofish_glue_3way.c | 2 +
crypto/Kconfig | 24 +
crypto/tcrypt.c | 23 +
crypto/testmgr.c | 60 ++
7 files changed, 1499 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/crypto/twofish-avx-x86_64-asm_64.S
create mode 100644 arch/x86/crypto/twofish_avx_glue.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index e191ac0..b02a73e 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
+obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o
obj-$(CONFIG_CRYPTO_SALSA20_X86_64) += salsa20-x86_64.o
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
@@ -30,6 +31,7 @@ camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
+twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o
salsa20-x86_64-y := salsa20-x86_64-asm_64.o salsa20_glue.o
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
new file mode 100644
index 0000000..daf070a
--- /dev/null
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -0,0 +1,302 @@
+/*
+ * Twofish Cipher 8-way parallel algorithm (AVX/x86_64)
+ *
+ * Copyright (C) 2012 Johannes Goetzfried
+ * <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
+ * USA
+ *
+ */
+
+.file "twofish-avx-x86_64-asm_64.S"
+.text
+
+/* structure of crypto context */
+#define s0 0
+#define s1 1024
+#define s2 2048
+#define s3 3072
+#define w 4096
+#define k 4128
+
+/**********************************************************************
+ 8-way AVX twofish
+ **********************************************************************/
+#define CTX %rdi
+
+#define RA1 %xmm0
+#define RB1 %xmm1
+#define RC1 %xmm2
+#define RD1 %xmm3
+
+#define RA2 %xmm4
+#define RB2 %xmm5
+#define RC2 %xmm6
+#define RD2 %xmm7
+
+#define RX %xmm8
+#define RY %xmm9
+
+#define RK1 %xmm10
+#define RK2 %xmm11
+
+#define RID1 %rax
+#define RID1b %al
+#define RID2 %rbx
+#define RID2b %bl
+
+#define RGI1 %rdx
+#define RGI1bl %dl
+#define RGI1bh %dh
+#define RGI2 %rcx
+#define RGI2bl %cl
+#define RGI2bh %ch
+
+#define RGS1 %r8
+#define RGS1d %r8d
+#define RGS2 %r9
+#define RGS2d %r9d
+#define RGS3 %r10
+#define RGS3d %r10d
+
+
+#define lookup_32bit(t0, t1, t2, t3, src, dst) \
+ movb src ## bl, RID1b; \
+ movb src ## bh, RID2b; \
+ movl t0(CTX, RID1, 4), dst ## d; \
+ xorl t1(CTX, RID2, 4), dst ## d; \
+ shrq $16, src; \
+ movb src ## bl, RID1b; \
+ movb src ## bh, RID2b; \
+ xorl t2(CTX, RID1, 4), dst ## d; \
+ xorl t3(CTX, RID2, 4), dst ## d;
+
+#define G(a, x, t0, t1, t2, t3) \
+ vmovq a, RGI1; \
+ vpsrldq $8, a, x; \
+ vmovq x, RGI2; \
+ \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
+ shrq $16, RGI1; \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
+ shlq $32, RGS2; \
+ orq RGS1, RGS2; \
+ \
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
+ shrq $16, RGI2; \
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
+ shlq $32, RGS3; \
+ orq RGS1, RGS3; \
+ \
+ vmovq RGS2, x; \
+ vpinsrq $1, RGS3, x, x;
+
+#define encround(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0); \
+ vpaddd x, y, x; \
+ vpaddd y, x, y; \
+ vpaddd x, RK1, x; \
+ vpaddd y, RK2, y; \
+ vpxor x, c, c; \
+ vpsrld $1, c, x; \
+ vpslld $(32 - 1), c, c; \
+ vpor c, x, c; \
+ vpslld $1, d, x; \
+ vpsrld $(32 - 1), d, d; \
+ vpor d, x, d; \
+ vpxor d, y, d;
+
+#define decround(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0); \
+ vpaddd x, y, x; \
+ vpaddd y, x, y; \
+ vpaddd y, RK2, y; \
+ vpxor d, y, d; \
+ vpsrld $1, d, y; \
+ vpslld $(32 - 1), d, d; \
+ vpor d, y, d; \
+ vpslld $1, c, y; \
+ vpsrld $(32 - 1), c, c; \
+ vpor c, y, c; \
+ vpaddd x, RK1, x; \
+ vpxor x, c, c;
+
+#define encrypt_round(n, a, b, c, d) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
+ encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+
+#define decrypt_round(n, a, b, c, d) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
+ decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+
+#define encrypt_cycle(n) \
+ encrypt_round((2*n), RA, RB, RC, RD); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB);
+
+#define decrypt_cycle(n) \
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
+ decrypt_round((2*n), RA, RB, RC, RD);
+
+
+#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
+ vpunpckldq x1, x0, t0; \
+ vpunpckhdq x1, x0, t2; \
+ vpunpckldq x3, x2, t1; \
+ vpunpckhdq x3, x2, x3; \
+ \
+ vpunpcklqdq t1, t0, x0; \
+ vpunpckhqdq t1, t0, x1; \
+ vpunpcklqdq x3, t2, x2; \
+ vpunpckhqdq x3, t2, x3;
+
+#define inpack_blocks(in, x0, x1, x2, x3, wkey, t0, t1, t2) \
+ vpxor (0*4*4)(in), wkey, x0; \
+ vpxor (1*4*4)(in), wkey, x1; \
+ vpxor (2*4*4)(in), wkey, x2; \
+ vpxor (3*4*4)(in), wkey, x3; \
+ \
+ transpose_4x4(x0, x1, x2, x3, t0, t1, t2)
+
+#define outunpack_blocks(out, x0, x1, x2, x3, wkey, t0, t1, t2) \
+ transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
+ \
+ vpxor x0, wkey, x0; \
+ vmovdqu x0, (0*4*4)(out); \
+ vpxor x1, wkey, x1; \
+ vmovdqu x1, (1*4*4)(out); \
+ vpxor x2, wkey, x2; \
+ vmovdqu x2, (2*4*4)(out); \
+ vpxor x3, wkey, x3; \
+ vmovdqu x3, (3*4*4)(out);
+
+#define outunpack_xor_blocks(out, x0, x1, x2, x3, wkey, t0, t1, t2) \
+ transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
+ \
+ vpxor x0, wkey, x0; \
+ vpxor (0*4*4)(out), x0, x0; \
+ vmovdqu x0, (0*4*4)(out); \
+ vpxor x1, wkey, x1; \
+ vpxor (1*4*4)(out), x1, x1; \
+ vmovdqu x1, (1*4*4)(out); \
+ vpxor x2, wkey, x2; \
+ vpxor (2*4*4)(out), x2, x2; \
+ vmovdqu x2, (2*4*4)(out); \
+ vpxor x3, wkey, x3; \
+ vpxor (3*4*4)(out), x3, x3; \
+ vmovdqu x3, (3*4*4)(out);
+
+.align 8
+.global __twofish_enc_blk_8way
+.type __twofish_enc_blk_8way,@function;
+
+__twofish_enc_blk_8way:
+ /* input:
+ * %rdi: ctx, CTX
+ * %rsi: dst
+ * %rdx: src
+ * %rcx: bool, if true: xor output
+ */
+
+ pushq %rbx;
+ pushq %rcx;
+
+ vmovdqu w(CTX), RK1;
+
+ leaq (4*4*4)(%rdx), %rax;
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+
+ xorq RID1, RID1;
+ xorq RID2, RID2;
+
+ encrypt_cycle(0);
+ encrypt_cycle(1);
+ encrypt_cycle(2);
+ encrypt_cycle(3);
+ encrypt_cycle(4);
+ encrypt_cycle(5);
+ encrypt_cycle(6);
+ encrypt_cycle(7);
+
+ vmovdqu (w+4*4)(CTX), RK1;
+
+ popq %rcx;
+ popq %rbx;
+
+ leaq (4*4*4)(%rsi), %rax;
+ leaq (4*4*4)(%rax), %rdx;
+
+ testb %cl, %cl;
+ jnz __enc_xor8;
+
+ outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+
+ ret;
+
+__enc_xor8:
+ outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+
+ ret;
+
+.align 8
+.global twofish_dec_blk_8way
+.type twofish_dec_blk_8way,@function;
+
+twofish_dec_blk_8way:
+ /* input:
+ * %rdi: ctx, CTX
+ * %rsi: dst
+ * %rdx: src
+ */
+
+ pushq %rbx;
+
+ vmovdqu (w+4*4)(CTX), RK1;
+
+ leaq (4*4*4)(%rdx), %rax;
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+
+ xorq RID1, RID1;
+ xorq RID2, RID2;
+
+ decrypt_cycle(7);
+ decrypt_cycle(6);
+ decrypt_cycle(5);
+ decrypt_cycle(4);
+ decrypt_cycle(3);
+ decrypt_cycle(2);
+ decrypt_cycle(1);
+ decrypt_cycle(0);
+
+ vmovdqu (w)(CTX), RK1;
+
+ popq %rbx;
+
+ leaq (4*4*4)(%rsi), %rax;
+ outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+
+ ret;
+
diff --git a/arch/x86/crypto/twofish_avx_glue.c b/arch/x86/crypto/twofish_avx_glue.c
new file mode 100644
index 0000000..68a3a825
--- /dev/null
+++ b/arch/x86/crypto/twofish_avx_glue.c
@@ -0,0 +1,1086 @@
+/*
+ * Glue Code for AVX assembler version of Twofish Cipher
+ *
+ * Copyright (C) 2012 Johannes Goetzfried
+ * <[email protected]>
+ *
+ * Glue code based on twofish_sse2_glue.c by:
+ * Copyright (C) 2011 Jussi Kivilinna <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
+ * USA
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/hardirq.h>
+#include <linux/types.h>
+#include <linux/crypto.h>
+#include <linux/err.h>
+#include <crypto/algapi.h>
+#include <crypto/twofish.h>
+#include <crypto/cryptd.h>
+#include <crypto/b128ops.h>
+#include <crypto/ctr.h>
+#include <crypto/lrw.h>
+#include <crypto/xts.h>
+#include <asm/i387.h>
+#include <asm/xcr.h>
+#include <asm/xsave.h>
+#include <crypto/scatterwalk.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+
+
+#define TWOFISH_PARALLEL_BLOCKS 8
+
+/* regular block cipher functions from twofish_x86_64 module */
+asmlinkage void twofish_enc_blk(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src);
+asmlinkage void twofish_dec_blk(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src);
+
+/* 3-way parallel cipher functions from twofish_x86_64-3way module */
+asmlinkage void __twofish_enc_blk_3way(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src, bool xor);
+asmlinkage void twofish_dec_blk_3way(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src);
+
+static inline void twofish_enc_blk_3way(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src)
+{
+ __twofish_enc_blk_3way(ctx, dst, src, false);
+}
+
+static inline void twofish_enc_blk_3way_xor(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src)
+{
+ __twofish_enc_blk_3way(ctx, dst, src, true);
+}
+
+/* 8-way parallel cipher functions */
+asmlinkage void __twofish_enc_blk_8way(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src, bool xor);
+asmlinkage void twofish_dec_blk_8way(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src);
+
+static inline void twofish_enc_blk_xway(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src)
+{
+ __twofish_enc_blk_8way(ctx, dst, src, false);
+}
+
+static inline void twofish_enc_blk_xway_xor(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src)
+{
+ __twofish_enc_blk_8way(ctx, dst, src, true);
+}
+
+static inline void twofish_dec_blk_xway(struct twofish_ctx *ctx, u8 *dst,
+ const u8 *src)
+{
+ twofish_dec_blk_8way(ctx, dst, src);
+}
+
+
+
+struct async_twofish_ctx {
+ struct cryptd_ablkcipher *cryptd_tfm;
+};
+
+static inline bool twofish_fpu_begin(bool fpu_enabled, unsigned int nbytes)
+{
+ if (fpu_enabled)
+ return true;
+
+ /* AVX is only used when chunk to be processed is large enough, so
+ * do not enable FPU until it is necessary.
+ */
+ if (nbytes < TF_BLOCK_SIZE * TWOFISH_PARALLEL_BLOCKS)
+ return false;
+
+ kernel_fpu_begin();
+ return true;
+}
+
+static inline void twofish_fpu_end(bool fpu_enabled)
+{
+ if (fpu_enabled)
+ kernel_fpu_end();
+}
+
+static int ecb_crypt(struct blkcipher_desc *desc, struct blkcipher_walk *walk,
+ bool enc)
+{
+ bool fpu_enabled = false;
+ struct twofish_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ unsigned int nbytes;
+ int err;
+
+ err = blkcipher_walk_virt(desc, walk);
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+
+ while ((nbytes = walk->nbytes)) {
+ u8 *wsrc = walk->src.virt.addr;
+ u8 *wdst = walk->dst.virt.addr;
+
+ fpu_enabled = twofish_fpu_begin(fpu_enabled, nbytes);
+
+ /* Process multi-block batch */
+ if (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS) {
+ do {
+ if (enc)
+ twofish_enc_blk_xway(ctx, wdst, wsrc);
+ else
+ twofish_dec_blk_xway(ctx, wdst, wsrc);
+
+ wsrc += bsize * TWOFISH_PARALLEL_BLOCKS;
+ wdst += bsize * TWOFISH_PARALLEL_BLOCKS;
+ nbytes -= bsize * TWOFISH_PARALLEL_BLOCKS;
+ } while (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Process three block batch */
+ if (nbytes >= bsize * 3) {
+ do {
+ if (enc)
+ twofish_enc_blk_3way(ctx, wdst, wsrc);
+ else
+ twofish_dec_blk_3way(ctx, wdst, wsrc);
+
+ wsrc += bsize * 3;
+ wdst += bsize * 3;
+ nbytes -= bsize * 3;
+ } while (nbytes >= bsize * 3);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Handle leftovers */
+ do {
+ if (enc)
+ twofish_enc_blk(ctx, wdst, wsrc);
+ else
+ twofish_dec_blk(ctx, wdst, wsrc);
+
+ wsrc += bsize;
+ wdst += bsize;
+ nbytes -= bsize;
+ } while (nbytes >= bsize);
+
+done:
+ err = blkcipher_walk_done(desc, walk, nbytes);
+ }
+
+ twofish_fpu_end(fpu_enabled);
+ return err;
+}
+
+static int ecb_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct blkcipher_walk walk;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ return ecb_crypt(desc, &walk, true);
+}
+
+static int ecb_decrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct blkcipher_walk walk;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ return ecb_crypt(desc, &walk, false);
+}
+
+static unsigned int __cbc_encrypt(struct blkcipher_desc *desc,
+ struct blkcipher_walk *walk)
+{
+ struct twofish_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ unsigned int nbytes = walk->nbytes;
+ u128 *src = (u128 *)walk->src.virt.addr;
+ u128 *dst = (u128 *)walk->dst.virt.addr;
+ u128 *iv = (u128 *)walk->iv;
+
+ do {
+ u128_xor(dst, src, iv);
+ twofish_enc_blk(ctx, (u8 *)dst, (u8 *)dst);
+ iv = dst;
+
+ src += 1;
+ dst += 1;
+ nbytes -= bsize;
+ } while (nbytes >= bsize);
+
+ u128_xor((u128 *)walk->iv, (u128 *)walk->iv, iv);
+ return nbytes;
+}
+
+static int cbc_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct blkcipher_walk walk;
+ int err;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt(desc, &walk);
+
+ while ((nbytes = walk.nbytes)) {
+ nbytes = __cbc_encrypt(desc, &walk);
+ err = blkcipher_walk_done(desc, &walk, nbytes);
+ }
+
+ return err;
+}
+
+static unsigned int __cbc_decrypt(struct blkcipher_desc *desc,
+ struct blkcipher_walk *walk)
+{
+ struct twofish_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ unsigned int nbytes = walk->nbytes;
+ u128 *src = (u128 *)walk->src.virt.addr;
+ u128 *dst = (u128 *)walk->dst.virt.addr;
+ u128 ivs[TWOFISH_PARALLEL_BLOCKS - 1];
+ u128 last_iv;
+ int i;
+
+ /* Start of the last block. */
+ src += nbytes / bsize - 1;
+ dst += nbytes / bsize - 1;
+
+ last_iv = *src;
+
+ /* Process multi-block batch */
+ if (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS) {
+ do {
+ nbytes -= bsize * (TWOFISH_PARALLEL_BLOCKS - 1);
+ src -= TWOFISH_PARALLEL_BLOCKS - 1;
+ dst -= TWOFISH_PARALLEL_BLOCKS - 1;
+
+ for (i = 0; i < TWOFISH_PARALLEL_BLOCKS - 1; i++)
+ ivs[i] = src[i];
+
+ twofish_dec_blk_xway(ctx, (u8 *)dst, (u8 *)src);
+
+ for (i = 0; i < TWOFISH_PARALLEL_BLOCKS - 1; i++)
+ u128_xor(dst + (i + 1), dst + (i + 1), ivs + i);
+
+ nbytes -= bsize;
+ if (nbytes < bsize)
+ goto done;
+
+ u128_xor(dst, dst, src - 1);
+ src -= 1;
+ dst -= 1;
+ } while (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Process three block batch */
+ if (nbytes >= bsize * 3) {
+ do {
+ nbytes -= bsize * (3 - 1);
+ src -= 3 - 1;
+ dst -= 3 - 1;
+
+ ivs[0] = src[0];
+ ivs[1] = src[1];
+
+ twofish_dec_blk_3way(ctx, (u8 *)dst, (u8 *)src);
+
+ u128_xor(dst + 1, dst + 1, ivs + 0);
+ u128_xor(dst + 2, dst + 2, ivs + 1);
+
+ nbytes -= bsize;
+ if (nbytes < bsize)
+ goto done;
+
+ u128_xor(dst, dst, src - 1);
+ src -= 1;
+ dst -= 1;
+ } while (nbytes >= bsize * 3);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Handle leftovers */
+ for (;;) {
+ twofish_dec_blk(ctx, (u8 *)dst, (u8 *)src);
+
+ nbytes -= bsize;
+ if (nbytes < bsize)
+ break;
+
+ u128_xor(dst, dst, src - 1);
+ src -= 1;
+ dst -= 1;
+ }
+
+done:
+ u128_xor(dst, dst, (u128 *)walk->iv);
+ *(u128 *)walk->iv = last_iv;
+
+ return nbytes;
+}
+
+static int cbc_decrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ bool fpu_enabled = false;
+ struct blkcipher_walk walk;
+ int err;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt(desc, &walk);
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+
+ while ((nbytes = walk.nbytes)) {
+ fpu_enabled = twofish_fpu_begin(fpu_enabled, nbytes);
+ nbytes = __cbc_decrypt(desc, &walk);
+ err = blkcipher_walk_done(desc, &walk, nbytes);
+ }
+
+ twofish_fpu_end(fpu_enabled);
+ return err;
+}
+
+static inline void u128_to_be128(be128 *dst, const u128 *src)
+{
+ dst->a = cpu_to_be64(src->a);
+ dst->b = cpu_to_be64(src->b);
+}
+
+static inline void be128_to_u128(u128 *dst, const be128 *src)
+{
+ dst->a = be64_to_cpu(src->a);
+ dst->b = be64_to_cpu(src->b);
+}
+
+static inline void u128_inc(u128 *i)
+{
+ i->b++;
+ if (!i->b)
+ i->a++;
+}
+
+static void ctr_crypt_final(struct blkcipher_desc *desc,
+ struct blkcipher_walk *walk)
+{
+ struct twofish_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ u8 *ctrblk = walk->iv;
+ u8 keystream[TF_BLOCK_SIZE];
+ u8 *src = walk->src.virt.addr;
+ u8 *dst = walk->dst.virt.addr;
+ unsigned int nbytes = walk->nbytes;
+
+ twofish_enc_blk(ctx, keystream, ctrblk);
+ crypto_xor(keystream, src, nbytes);
+ memcpy(dst, keystream, nbytes);
+
+ crypto_inc(ctrblk, TF_BLOCK_SIZE);
+}
+
+static unsigned int __ctr_crypt(struct blkcipher_desc *desc,
+ struct blkcipher_walk *walk)
+{
+ struct twofish_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ unsigned int nbytes = walk->nbytes;
+ u128 *src = (u128 *)walk->src.virt.addr;
+ u128 *dst = (u128 *)walk->dst.virt.addr;
+ u128 ctrblk;
+ be128 ctrblocks[TWOFISH_PARALLEL_BLOCKS];
+ int i;
+
+ be128_to_u128(&ctrblk, (be128 *)walk->iv);
+
+ /* Process multi-block batch */
+ if (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS) {
+ do {
+ /* create ctrblks for parallel encrypt */
+ for (i = 0; i < TWOFISH_PARALLEL_BLOCKS; i++) {
+ if (dst != src)
+ dst[i] = src[i];
+
+ u128_to_be128(&ctrblocks[i], &ctrblk);
+ u128_inc(&ctrblk);
+ }
+
+ twofish_enc_blk_xway_xor(ctx, (u8 *)dst,
+ (u8 *)ctrblocks);
+
+ src += TWOFISH_PARALLEL_BLOCKS;
+ dst += TWOFISH_PARALLEL_BLOCKS;
+ nbytes -= bsize * TWOFISH_PARALLEL_BLOCKS;
+ } while (nbytes >= bsize * TWOFISH_PARALLEL_BLOCKS);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Process three block batch */
+ if (nbytes >= bsize * 3) {
+ do {
+ if (dst != src) {
+ dst[0] = src[0];
+ dst[1] = src[1];
+ dst[2] = src[2];
+ }
+
+ /* create ctrblks for parallel encrypt */
+ u128_to_be128(&ctrblocks[0], &ctrblk);
+ u128_inc(&ctrblk);
+ u128_to_be128(&ctrblocks[1], &ctrblk);
+ u128_inc(&ctrblk);
+ u128_to_be128(&ctrblocks[2], &ctrblk);
+ u128_inc(&ctrblk);
+
+ twofish_enc_blk_3way_xor(ctx, (u8 *)dst,
+ (u8 *)ctrblocks);
+
+ src += 3;
+ dst += 3;
+ nbytes -= bsize * 3;
+ } while (nbytes >= bsize * 3);
+
+ if (nbytes < bsize)
+ goto done;
+ }
+
+ /* Handle leftovers */
+ do {
+ if (dst != src)
+ *dst = *src;
+
+ u128_to_be128(&ctrblocks[0], &ctrblk);
+ u128_inc(&ctrblk);
+
+ twofish_enc_blk(ctx, (u8 *)ctrblocks, (u8 *)ctrblocks);
+ u128_xor(dst, dst, (u128 *)ctrblocks);
+
+ src += 1;
+ dst += 1;
+ nbytes -= bsize;
+ } while (nbytes >= bsize);
+
+done:
+ u128_to_be128((be128 *)walk->iv, &ctrblk);
+ return nbytes;
+}
+
+static int ctr_crypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ bool fpu_enabled = false;
+ struct blkcipher_walk walk;
+ int err;
+
+ blkcipher_walk_init(&walk, dst, src, nbytes);
+ err = blkcipher_walk_virt_block(desc, &walk, TF_BLOCK_SIZE);
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+
+ while ((nbytes = walk.nbytes) >= TF_BLOCK_SIZE) {
+ fpu_enabled = twofish_fpu_begin(fpu_enabled, nbytes);
+ nbytes = __ctr_crypt(desc, &walk);
+ err = blkcipher_walk_done(desc, &walk, nbytes);
+ }
+
+ twofish_fpu_end(fpu_enabled);
+
+ if (walk.nbytes) {
+ ctr_crypt_final(desc, &walk);
+ err = blkcipher_walk_done(desc, &walk, 0);
+ }
+
+ return err;
+}
+
+struct crypt_priv {
+ struct twofish_ctx *ctx;
+ bool fpu_enabled;
+};
+
+static void encrypt_callback(void *priv, u8 *srcdst, unsigned int nbytes)
+{
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ struct crypt_priv *ctx = priv;
+ int i;
+
+ ctx->fpu_enabled = twofish_fpu_begin(ctx->fpu_enabled, nbytes);
+
+ if (nbytes == bsize * TWOFISH_PARALLEL_BLOCKS) {
+ twofish_enc_blk_xway(ctx->ctx, srcdst, srcdst);
+ return;
+ }
+
+ for (i = 0; i < nbytes / (bsize * 3); i++, srcdst += bsize * 3)
+ twofish_enc_blk_3way(ctx->ctx, srcdst, srcdst);
+
+ nbytes %= bsize * 3;
+
+ for (i = 0; i < nbytes / bsize; i++, srcdst += bsize)
+ twofish_enc_blk(ctx->ctx, srcdst, srcdst);
+}
+
+static void decrypt_callback(void *priv, u8 *srcdst, unsigned int nbytes)
+{
+ const unsigned int bsize = TF_BLOCK_SIZE;
+ struct crypt_priv *ctx = priv;
+ int i;
+
+ ctx->fpu_enabled = twofish_fpu_begin(ctx->fpu_enabled, nbytes);
+
+ if (nbytes == bsize * TWOFISH_PARALLEL_BLOCKS) {
+ twofish_dec_blk_xway(ctx->ctx, srcdst, srcdst);
+ return;
+ }
+
+ for (i = 0; i < nbytes / (bsize * 3); i++, srcdst += bsize * 3)
+ twofish_dec_blk_3way(ctx->ctx, srcdst, srcdst);
+
+ nbytes %= bsize * 3;
+
+ for (i = 0; i < nbytes / bsize; i++, srcdst += bsize)
+ twofish_dec_blk(ctx->ctx, srcdst, srcdst);
+}
+
+struct twofish_lrw_ctx {
+ struct lrw_table_ctx lrw_table;
+ struct twofish_ctx twofish_ctx;
+};
+
+static int lrw_twofish_setkey(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ struct twofish_lrw_ctx *ctx = crypto_tfm_ctx(tfm);
+ int err;
+
+ err = __twofish_setkey(&ctx->twofish_ctx, key,
+ keylen - TF_BLOCK_SIZE, &tfm->crt_flags);
+ if (err)
+ return err;
+
+ return lrw_init_table(&ctx->lrw_table, key + keylen -
+ TF_BLOCK_SIZE);
+}
+
+static int lrw_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct twofish_lrw_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ be128 buf[TWOFISH_PARALLEL_BLOCKS];
+ struct crypt_priv crypt_ctx = {
+ .ctx = &ctx->twofish_ctx,
+ .fpu_enabled = false,
+ };
+ struct lrw_crypt_req req = {
+ .tbuf = buf,
+ .tbuflen = sizeof(buf),
+
+ .table_ctx = &ctx->lrw_table,
+ .crypt_ctx = &crypt_ctx,
+ .crypt_fn = encrypt_callback,
+ };
+ int ret;
+
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+ ret = lrw_crypt(desc, dst, src, nbytes, &req);
+ twofish_fpu_end(crypt_ctx.fpu_enabled);
+
+ return ret;
+}
+
+static int lrw_decrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct twofish_lrw_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ be128 buf[TWOFISH_PARALLEL_BLOCKS];
+ struct crypt_priv crypt_ctx = {
+ .ctx = &ctx->twofish_ctx,
+ .fpu_enabled = false,
+ };
+ struct lrw_crypt_req req = {
+ .tbuf = buf,
+ .tbuflen = sizeof(buf),
+
+ .table_ctx = &ctx->lrw_table,
+ .crypt_ctx = &crypt_ctx,
+ .crypt_fn = decrypt_callback,
+ };
+ int ret;
+
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+ ret = lrw_crypt(desc, dst, src, nbytes, &req);
+ twofish_fpu_end(crypt_ctx.fpu_enabled);
+
+ return ret;
+}
+
+static void lrw_exit_tfm(struct crypto_tfm *tfm)
+{
+ struct twofish_lrw_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ lrw_free_table(&ctx->lrw_table);
+}
+
+struct twofish_xts_ctx {
+ struct twofish_ctx tweak_ctx;
+ struct twofish_ctx crypt_ctx;
+};
+
+static int xts_twofish_setkey(struct crypto_tfm *tfm, const u8 *key,
+ unsigned int keylen)
+{
+ struct twofish_xts_ctx *ctx = crypto_tfm_ctx(tfm);
+ u32 *flags = &tfm->crt_flags;
+ int err;
+
+ /* key consists of keys of equal size concatenated, therefore
+ * the length must be even
+ */
+ if (keylen % 2) {
+ *flags |= CRYPTO_TFM_RES_BAD_KEY_LEN;
+ return -EINVAL;
+ }
+
+ /* first half of xts-key is for crypt */
+ err = __twofish_setkey(&ctx->crypt_ctx, key, keylen / 2, flags);
+ if (err)
+ return err;
+
+ /* second half of xts-key is for tweak */
+ return __twofish_setkey(&ctx->tweak_ctx,
+ key + keylen / 2, keylen / 2, flags);
+}
+
+static int xts_encrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct twofish_xts_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ be128 buf[TWOFISH_PARALLEL_BLOCKS];
+ struct crypt_priv crypt_ctx = {
+ .ctx = &ctx->crypt_ctx,
+ .fpu_enabled = false,
+ };
+ struct xts_crypt_req req = {
+ .tbuf = buf,
+ .tbuflen = sizeof(buf),
+
+ .tweak_ctx = &ctx->tweak_ctx,
+ .tweak_fn = XTS_TWEAK_CAST(twofish_enc_blk),
+ .crypt_ctx = &crypt_ctx,
+ .crypt_fn = encrypt_callback,
+ };
+ int ret;
+
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+ ret = xts_crypt(desc, dst, src, nbytes, &req);
+ twofish_fpu_end(crypt_ctx.fpu_enabled);
+
+ return ret;
+}
+
+static int xts_decrypt(struct blkcipher_desc *desc, struct scatterlist *dst,
+ struct scatterlist *src, unsigned int nbytes)
+{
+ struct twofish_xts_ctx *ctx = crypto_blkcipher_ctx(desc->tfm);
+ be128 buf[TWOFISH_PARALLEL_BLOCKS];
+ struct crypt_priv crypt_ctx = {
+ .ctx = &ctx->crypt_ctx,
+ .fpu_enabled = false,
+ };
+ struct xts_crypt_req req = {
+ .tbuf = buf,
+ .tbuflen = sizeof(buf),
+
+ .tweak_ctx = &ctx->tweak_ctx,
+ .tweak_fn = XTS_TWEAK_CAST(twofish_enc_blk),
+ .crypt_ctx = &crypt_ctx,
+ .crypt_fn = decrypt_callback,
+ };
+ int ret;
+
+ desc->flags &= ~CRYPTO_TFM_REQ_MAY_SLEEP;
+ ret = xts_crypt(desc, dst, src, nbytes, &req);
+ twofish_fpu_end(crypt_ctx.fpu_enabled);
+
+ return ret;
+}
+
+static int ablk_set_key(struct crypto_ablkcipher *tfm, const u8 *key,
+ unsigned int key_len)
+{
+ struct async_twofish_ctx *ctx = crypto_ablkcipher_ctx(tfm);
+ struct crypto_ablkcipher *child = &ctx->cryptd_tfm->base;
+ int err;
+
+ crypto_ablkcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+ crypto_ablkcipher_set_flags(child, crypto_ablkcipher_get_flags(tfm)
+ & CRYPTO_TFM_REQ_MASK);
+ err = crypto_ablkcipher_setkey(child, key, key_len);
+ crypto_ablkcipher_set_flags(tfm, crypto_ablkcipher_get_flags(child)
+ & CRYPTO_TFM_RES_MASK);
+ return err;
+}
+
+static int __ablk_encrypt(struct ablkcipher_request *req)
+{
+ struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req);
+ struct async_twofish_ctx *ctx = crypto_ablkcipher_ctx(tfm);
+ struct blkcipher_desc desc;
+
+ desc.tfm = cryptd_ablkcipher_child(ctx->cryptd_tfm);
+ desc.info = req->info;
+ desc.flags = 0;
+
+ return crypto_blkcipher_crt(desc.tfm)->encrypt(
+ &desc, req->dst, req->src, req->nbytes);
+}
+
+static int ablk_encrypt(struct ablkcipher_request *req)
+{
+ struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req);
+ struct async_twofish_ctx *ctx = crypto_ablkcipher_ctx(tfm);
+
+ if (!irq_fpu_usable()) {
+ struct ablkcipher_request *cryptd_req =
+ ablkcipher_request_ctx(req);
+
+ memcpy(cryptd_req, req, sizeof(*req));
+ ablkcipher_request_set_tfm(cryptd_req, &ctx->cryptd_tfm->base);
+
+ return crypto_ablkcipher_encrypt(cryptd_req);
+ } else {
+ return __ablk_encrypt(req);
+ }
+}
+
+static int ablk_decrypt(struct ablkcipher_request *req)
+{
+ struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req);
+ struct async_twofish_ctx *ctx = crypto_ablkcipher_ctx(tfm);
+
+ if (!irq_fpu_usable()) {
+ struct ablkcipher_request *cryptd_req =
+ ablkcipher_request_ctx(req);
+
+ memcpy(cryptd_req, req, sizeof(*req));
+ ablkcipher_request_set_tfm(cryptd_req, &ctx->cryptd_tfm->base);
+
+ return crypto_ablkcipher_decrypt(cryptd_req);
+ } else {
+ struct blkcipher_desc desc;
+
+ desc.tfm = cryptd_ablkcipher_child(ctx->cryptd_tfm);
+ desc.info = req->info;
+ desc.flags = 0;
+
+ return crypto_blkcipher_crt(desc.tfm)->decrypt(
+ &desc, req->dst, req->src, req->nbytes);
+ }
+}
+
+static void ablk_exit(struct crypto_tfm *tfm)
+{
+ struct async_twofish_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ cryptd_free_ablkcipher(ctx->cryptd_tfm);
+}
+
+static int ablk_init(struct crypto_tfm *tfm)
+{
+ struct async_twofish_ctx *ctx = crypto_tfm_ctx(tfm);
+ struct cryptd_ablkcipher *cryptd_tfm;
+ char drv_name[CRYPTO_MAX_ALG_NAME];
+
+ snprintf(drv_name, sizeof(drv_name), "__driver-%s",
+ crypto_tfm_alg_driver_name(tfm));
+
+ cryptd_tfm = cryptd_alloc_ablkcipher(drv_name, 0, 0);
+ if (IS_ERR(cryptd_tfm))
+ return PTR_ERR(cryptd_tfm);
+
+ ctx->cryptd_tfm = cryptd_tfm;
+ tfm->crt_ablkcipher.reqsize = sizeof(struct ablkcipher_request) +
+ crypto_ablkcipher_reqsize(&cryptd_tfm->base);
+
+ return 0;
+}
+
+static struct crypto_alg twofish_algs[10] = { {
+ .cra_name = "__ecb-twofish-avx",
+ .cra_driver_name = "__driver-ecb-twofish-avx",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[0].cra_list),
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .setkey = twofish_setkey,
+ .encrypt = ecb_encrypt,
+ .decrypt = ecb_decrypt,
+ },
+ },
+}, {
+ .cra_name = "__cbc-twofish-avx",
+ .cra_driver_name = "__driver-cbc-twofish-avx",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[1].cra_list),
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .setkey = twofish_setkey,
+ .encrypt = cbc_encrypt,
+ .decrypt = cbc_decrypt,
+ },
+ },
+}, {
+ .cra_name = "__ctr-twofish-avx",
+ .cra_driver_name = "__driver-ctr-twofish-avx",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[2].cra_list),
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = twofish_setkey,
+ .encrypt = ctr_crypt,
+ .decrypt = ctr_crypt,
+ },
+ },
+}, {
+ .cra_name = "__lrw-twofish-avx",
+ .cra_driver_name = "__driver-lrw-twofish-avx",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct twofish_lrw_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[3].cra_list),
+ .cra_exit = lrw_exit_tfm,
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE +
+ TF_BLOCK_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE +
+ TF_BLOCK_SIZE,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = lrw_twofish_setkey,
+ .encrypt = lrw_encrypt,
+ .decrypt = lrw_decrypt,
+ },
+ },
+}, {
+ .cra_name = "__xts-twofish-avx",
+ .cra_driver_name = "__driver-xts-twofish-avx",
+ .cra_priority = 0,
+ .cra_flags = CRYPTO_ALG_TYPE_BLKCIPHER,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct twofish_xts_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_blkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[4].cra_list),
+ .cra_u = {
+ .blkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE * 2,
+ .max_keysize = TF_MAX_KEY_SIZE * 2,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = xts_twofish_setkey,
+ .encrypt = xts_encrypt,
+ .decrypt = xts_decrypt,
+ },
+ },
+}, {
+ .cra_name = "ecb(twofish)",
+ .cra_driver_name = "ecb-twofish-avx",
+ .cra_priority = 400,
+ .cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct async_twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_ablkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[5].cra_list),
+ .cra_init = ablk_init,
+ .cra_exit = ablk_exit,
+ .cra_u = {
+ .ablkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .setkey = ablk_set_key,
+ .encrypt = ablk_encrypt,
+ .decrypt = ablk_decrypt,
+ },
+ },
+}, {
+ .cra_name = "cbc(twofish)",
+ .cra_driver_name = "cbc-twofish-avx",
+ .cra_priority = 400,
+ .cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct async_twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_ablkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[6].cra_list),
+ .cra_init = ablk_init,
+ .cra_exit = ablk_exit,
+ .cra_u = {
+ .ablkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = ablk_set_key,
+ .encrypt = __ablk_encrypt,
+ .decrypt = ablk_decrypt,
+ },
+ },
+}, {
+ .cra_name = "ctr(twofish)",
+ .cra_driver_name = "ctr-twofish-avx",
+ .cra_priority = 400,
+ .cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
+ .cra_blocksize = 1,
+ .cra_ctxsize = sizeof(struct async_twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_ablkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[7].cra_list),
+ .cra_init = ablk_init,
+ .cra_exit = ablk_exit,
+ .cra_u = {
+ .ablkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = ablk_set_key,
+ .encrypt = ablk_encrypt,
+ .decrypt = ablk_encrypt,
+ .geniv = "chainiv",
+ },
+ },
+}, {
+ .cra_name = "lrw(twofish)",
+ .cra_driver_name = "lrw-twofish-avx",
+ .cra_priority = 400,
+ .cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct async_twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_ablkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[8].cra_list),
+ .cra_init = ablk_init,
+ .cra_exit = ablk_exit,
+ .cra_u = {
+ .ablkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE +
+ TF_BLOCK_SIZE,
+ .max_keysize = TF_MAX_KEY_SIZE +
+ TF_BLOCK_SIZE,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = ablk_set_key,
+ .encrypt = ablk_encrypt,
+ .decrypt = ablk_decrypt,
+ },
+ },
+}, {
+ .cra_name = "xts(twofish)",
+ .cra_driver_name = "xts-twofish-avx",
+ .cra_priority = 400,
+ .cra_flags = CRYPTO_ALG_TYPE_ABLKCIPHER | CRYPTO_ALG_ASYNC,
+ .cra_blocksize = TF_BLOCK_SIZE,
+ .cra_ctxsize = sizeof(struct async_twofish_ctx),
+ .cra_alignmask = 0,
+ .cra_type = &crypto_ablkcipher_type,
+ .cra_module = THIS_MODULE,
+ .cra_list = LIST_HEAD_INIT(twofish_algs[9].cra_list),
+ .cra_init = ablk_init,
+ .cra_exit = ablk_exit,
+ .cra_u = {
+ .ablkcipher = {
+ .min_keysize = TF_MIN_KEY_SIZE * 2,
+ .max_keysize = TF_MAX_KEY_SIZE * 2,
+ .ivsize = TF_BLOCK_SIZE,
+ .setkey = ablk_set_key,
+ .encrypt = ablk_encrypt,
+ .decrypt = ablk_decrypt,
+ },
+ },
+} };
+
+static int __init twofish_init(void)
+{
+ u64 xcr0;
+
+ if (!cpu_has_avx || !cpu_has_osxsave) {
+ printk(KERN_INFO "AVX instructions are not detected.\n");
+ return -ENODEV;
+ }
+
+ xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
+ if ((xcr0 & (XSTATE_SSE | XSTATE_YMM)) != (XSTATE_SSE | XSTATE_YMM)) {
+ printk(KERN_INFO "AVX detected but unusable.\n");
+ return -ENODEV;
+ }
+
+ return crypto_register_algs(twofish_algs, ARRAY_SIZE(twofish_algs));
+}
+
+static void __exit twofish_exit(void)
+{
+ crypto_unregister_algs(twofish_algs, ARRAY_SIZE(twofish_algs));
+}
+
+module_init(twofish_init);
+module_exit(twofish_exit);
+
+MODULE_DESCRIPTION("Twofish Cipher Algorithm, AVX optimized");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("twofish");
diff --git a/arch/x86/crypto/twofish_glue_3way.c b/arch/x86/crypto/twofish_glue_3way.c
index 922ab24..77e4e55 100644
--- a/arch/x86/crypto/twofish_glue_3way.c
+++ b/arch/x86/crypto/twofish_glue_3way.c
@@ -45,8 +45,10 @@ asmlinkage void twofish_dec_blk(struct twofish_ctx *ctx, u8 *dst,
/* 3-way parallel cipher functions */
asmlinkage void __twofish_enc_blk_3way(struct twofish_ctx *ctx, u8 *dst,
const u8 *src, bool xor);
+EXPORT_SYMBOL_GPL(__twofish_enc_blk_3way);
asmlinkage void twofish_dec_blk_3way(struct twofish_ctx *ctx, u8 *dst,
const u8 *src);
+EXPORT_SYMBOL_GPL(twofish_dec_blk_3way);

static inline void twofish_enc_blk_3way(struct twofish_ctx *ctx, u8 *dst,
const u8 *src)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 8e84225..e00a4e4 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -913,6 +913,30 @@ config CRYPTO_TWOFISH_X86_64_3WAY
See also:
<http://www.schneier.com/twofish.html>

+config CRYPTO_TWOFISH_AVX_X86_64
+ tristate "Twofish cipher algorithm (x86_64/AVX)"
+ depends on X86 && 64BIT
+ select CRYPTO_ALGAPI
+ select CRYPTO_CRYPTD
+ select CRYPTO_TWOFISH_COMMON
+ select CRYPTO_TWOFISH_X86_64
+ select CRYPTO_TWOFISH_X86_64_3WAY
+ select CRYPTO_LRW
+ select CRYPTO_XTS
+ help
+ Twofish cipher algorithm (x86_64/AVX).
+
+ Twofish was submitted as an AES (Advanced Encryption Standard)
+ candidate cipher by researchers at CounterPane Systems. It is a
+ 16 round block cipher supporting key sizes of 128, 192, and 256
+ bits.
+
+ This module provides the Twofish cipher algorithm that processes
+ eight blocks parallel using the AVX Instruction Set.
+
+ See also:
+ <http://www.schneier.com/twofish.html>
+
comment "Compression"

config CRYPTO_DEFLATE
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index 8f147bf..69b6fb1 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -1563,6 +1563,29 @@ static int do_test(int m)
speed_template_32_64);
break;

+ case 504:
+ test_acipher_speed("ecb(twofish)", ENCRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("ecb(twofish)", DECRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("cbc(twofish)", ENCRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("cbc(twofish)", DECRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("ctr(twofish)", ENCRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("ctr(twofish)", DECRYPT, sec, NULL, 0,
+ speed_template_16_24_32);
+ test_acipher_speed("lrw(twofish)", ENCRYPT, sec, NULL, 0,
+ speed_template_32_40_48);
+ test_acipher_speed("lrw(twofish)", DECRYPT, sec, NULL, 0,
+ speed_template_32_40_48);
+ test_acipher_speed("xts(twofish)", ENCRYPT, sec, NULL, 0,
+ speed_template_32_48_64);
+ test_acipher_speed("xts(twofish)", DECRYPT, sec, NULL, 0,
+ speed_template_32_48_64);
+ break;
+
case 1000:
test_available();
break;
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 5674878..29aadd8 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1549,6 +1549,21 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+ .alg = "__cbc-twofish-avx",
+ .test = alg_test_null,
+ .suite = {
+ .cipher = {
+ .enc = {
+ .vecs = NULL,
+ .count = 0
+ },
+ .dec = {
+ .vecs = NULL,
+ .count = 0
+ }
+ }
+ }
+ }, {
.alg = "__driver-cbc-aes-aesni",
.test = alg_test_null,
.suite = {
@@ -1579,6 +1594,21 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+ .alg = "__driver-cbc-twofish-avx",
+ .test = alg_test_null,
+ .suite = {
+ .cipher = {
+ .enc = {
+ .vecs = NULL,
+ .count = 0
+ },
+ .dec = {
+ .vecs = NULL,
+ .count = 0
+ }
+ }
+ }
+ }, {
.alg = "__driver-ecb-aes-aesni",
.test = alg_test_null,
.suite = {
@@ -1609,6 +1639,21 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+ .alg = "__driver-ecb-twofish-avx",
+ .test = alg_test_null,
+ .suite = {
+ .cipher = {
+ .enc = {
+ .vecs = NULL,
+ .count = 0
+ },
+ .dec = {
+ .vecs = NULL,
+ .count = 0
+ }
+ }
+ }
+ }, {
.alg = "__ghash-pclmulqdqni",
.test = alg_test_null,
.suite = {
@@ -1806,6 +1851,21 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+ .alg = "cryptd(__driver-ecb-twofish-avx)",
+ .test = alg_test_null,
+ .suite = {
+ .cipher = {
+ .enc = {
+ .vecs = NULL,
+ .count = 0
+ },
+ .dec = {
+ .vecs = NULL,
+ .count = 0
+ }
+ }
+ }
+ }, {
.alg = "cryptd(__ghash-pclmulqdqni)",
.test = alg_test_null,
.suite = {
--
1.7.2.5


2012-05-28 06:25:09

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Johannes Goetzfried
<[email protected]>:

> This patch adds a x86_64/avx assembler implementation of the Twofish block
> cipher. The implementation processes eight blocks in parallel (two 4 block
> chunk AVX operations). The table-lookups are done in general-purpose
> registers.
> For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
> module are called. A good performance increase is provided for blocksizes
> greater or equal to 128B.
>
> Patch has been tested with tcrypt and automated filesystem tests.
>

It would be benefical to expand the twofish vectors in
crypto/testmgr.h from 3 blocks
to 8 blocks so that 8-way algorithm(s) can be checked runtime. And
while at expanding
test-vectors, why not just expand to 16 blocks... AVX2 is just one year away:

https://github.com/jkivilin/crypto-avx2/commit/1a72d7a6a1553aee70ad4b6a1980ca372181f40d

>
> Tcrypt benchmark results:
>
> Intel Core i5-2500 CPU (fam:6, model:42, step:7)

<snip>

> +/*
> + * Glue Code for AVX assembler version of Twofish Cipher
> + *
> + * Copyright (C) 2012 Johannes Goetzfried
> + * <[email protected]>
> + *
> + * Glue code based on twofish_sse2_glue.c by:
> + * Copyright (C) 2011 Jussi Kivilinna <[email protected]>

I think you mean serpent_sse2_glue.c :)

-Jussi

2012-05-28 13:52:03

by Johannes Goetzfried

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Hello,

> It would be benefical to expand the twofish vectors in
> crypto/testmgr.h from 3 blocks
> to 8 blocks so that 8-way algorithm(s) can be checked runtime. And
> while at expanding
> test-vectors, why not just expand to 16 blocks... AVX2 is just one year away:
> https://github.com/jkivilin/crypto-avx2/commit/1a72d7a6a1553aee70ad4b6a1980ca372181f40d

that's a good idea. Thank you for the link to your commit. I will send this as
an extra patch.

> I think you mean serpent_sse2_glue.c :)

Yeah, that's right, I replaced a bit too much *g* I will resend the patch.

- Johannes

2012-08-15 08:42:19

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Johannes Goetzfried
<[email protected]>:

> This patch adds a x86_64/avx assembler implementation of the Twofish block
> cipher. The implementation processes eight blocks in parallel (two 4 block
> chunk AVX operations). The table-lookups are done in general-purpose
> registers.
> For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
> module are called. A good performance increase is provided for blocksizes
> greater or equal to 128B.
>
> Patch has been tested with tcrypt and automated filesystem tests.
>
> Tcrypt benchmark results:
>
> Intel Core i5-2500 CPU (fam:6, model:42, step:7)

I started thinking about the performance on AMD Bulldozer.
vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers
on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on
Intel sandy-bridge (where instructions have latency of 1 to 2). See:
http://www.agner.org/optimize/instruction_tables.pdf

It would be really good, if implementation could be tested on AMD CPU
to determinate, if it causes performance regression. However I don't
have access to machine with such CPU.

-Jussi

2012-08-15 09:28:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote:
> I started thinking about the performance on AMD Bulldozer.
> vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers
> on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on
> Intel sandy-bridge (where instructions have latency of 1 to 2). See:
> http://www.agner.org/optimize/instruction_tables.pdf
>
> It would be really good, if implementation could be tested on AMD CPU
> to determinate, if it causes performance regression. However I don't
> have access to machine with such CPU.

But I do. :)

And if you tell me exactly how to run the tests and on what kernel, I'll
try to do so.

HTH.

--
Regards/Gruss,
Boris.

2012-08-15 11:00:16

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> On Wed, Aug 15, 2012 at 11:42:16AM +0300, Jussi Kivilinna wrote:
>> I started thinking about the performance on AMD Bulldozer.
>> vmovq/vmovd/vpextr*/vpinsr* between FPU and general purpose registers
>> on AMD CPU is alot slower (latencies from 8 to 12 cycles) than on
>> Intel sandy-bridge (where instructions have latency of 1 to 2). See:
>> http://www.agner.org/optimize/instruction_tables.pdf
>>
>> It would be really good, if implementation could be tested on AMD CPU
>> to determinate, if it causes performance regression. However I don't
>> have access to machine with such CPU.
>
> But I do. :)
>
> And if you tell me exactly how to run the tests and on what kernel, I'll
> try to do so.
>

Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For
testing you need CRYPTO_TEST build as module. You should turn off
turbo-core, freq-scaling, etc.

Testing twofish-avx ('async twofish' speed test):
modprobe twofish-avx-x86_64
modprobe tcrypt mode=504 sec=1

Testing twofish-x86_64-3way ('sync twofish' speed test):
modprobe twofish-x86_64-3way
modprobe tcrypt mode=202 sec=1

Loading tcrypt will block until tests are complete, after which
modprobe will return with error. This is expected. Results are in
kernel log.

-Jussi

> HTH.
>
> --
> Regards/Gruss,
> Boris.
>
>

2012-08-15 12:52:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Ok, here we go. Raw data below.

On Wed, Aug 15, 2012 at 02:00:16PM +0300, Jussi Kivilinna wrote:
> >And if you tell me exactly how to run the tests and on what kernel,
> >I'll try to do so.

Ok, the box is a single-socket Bulldozer: "AMD FX(tm)-8100 Eight-Core
Processor stepping 02"; kernel is 3.6.0-rc1+ which is latest Linus +
tip/master merged ontop.

> Twofish-avx (CONFIG_TWOFISH_AVX_X86_64) is available in 3.6-rc1. For

I took CONFIG_CRYPTO_TWOFISH_AVX_X86_64 but I'm pretty sure you meant
that.

> testing you need CRYPTO_TEST build as module. You should turn off
> turbo-core, freq-scaling, etc.

$ for i in $(seq 0 7); do echo "performance" > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor ; done
$ for i in $(seq 0 7); do echo 0 > /sys/devices/system/cpu/cpu$i/cpufreq/cpb ; done

> Testing twofish-avx ('async twofish' speed test):
> modprobe twofish-avx-x86_64
> modprobe tcrypt mode=504 sec=1

$ modprobe twofish-avx-x86_64
$ modprobe tcrypt mode=504 sec=1

[ 224.672094]
[ 224.672094] testing speed of async ecb(twofish) encryption
[ 224.681444] test 0 (128 bit key, 16 byte blocks): 4862478 operations in 1 seconds (77799648 bytes)
[ 225.689190] test 1 (128 bit key, 64 byte blocks): 2040557 operations in 1 seconds (130595648 bytes)
[ 226.695864] test 2 (128 bit key, 256 byte blocks): 564098 operations in 1 seconds (144409088 bytes)
[ 227.702365] test 3 (128 bit key, 1024 byte blocks): 156553 operations in 1 seconds (160310272 bytes)
[ 228.708960] test 4 (128 bit key, 8192 byte blocks): 20128 operations in 1 seconds (164888576 bytes)
[ 229.715485] test 5 (192 bit key, 16 byte blocks): 4853879 operations in 1 seconds (77662064 bytes)
[ 230.722165] test 6 (192 bit key, 64 byte blocks): 2040187 operations in 1 seconds (130571968 bytes)
[ 231.729110] test 7 (192 bit key, 256 byte blocks): 564125 operations in 1 seconds (144416000 bytes)
[ 232.735600] test 8 (192 bit key, 1024 byte blocks): 156231 operations in 1 seconds (159980544 bytes)
[ 233.742205] test 9 (192 bit key, 8192 byte blocks): 19913 operations in 1 seconds (163127296 bytes)
[ 234.748777] test 10 (256 bit key, 16 byte blocks): 4880977 operations in 1 seconds (78095632 bytes)
[ 235.751405] test 11 (256 bit key, 64 byte blocks): 2045621 operations in 1 seconds (130919744 bytes)
[ 236.758079] test 12 (256 bit key, 256 byte blocks): 565273 operations in 1 seconds (144709888 bytes)
[ 237.764579] test 13 (256 bit key, 1024 byte blocks): 156625 operations in 1 seconds (160384000 bytes)
[ 238.771175] test 14 (256 bit key, 8192 byte blocks): 20125 operations in 1 seconds (164864000 bytes)
[ 239.777726]
[ 239.777726] testing speed of async ecb(twofish) decryption
[ 239.787020] test 0 (128 bit key, 16 byte blocks): 4962193 operations in 1 seconds (79395088 bytes)
[ 240.792405] test 1 (128 bit key, 64 byte blocks): 2056765 operations in 1 seconds (131632960 bytes)
[ 241.799070] test 2 (128 bit key, 256 byte blocks): 559384 operations in 1 seconds (143202304 bytes)
[ 242.805568] test 3 (128 bit key, 1024 byte blocks): 153881 operations in 1 seconds (157574144 bytes)
[ 243.812191] test 4 (128 bit key, 8192 byte blocks): 19636 operations in 1 seconds (160858112 bytes)
[ 244.818718] test 5 (192 bit key, 16 byte blocks): 4917689 operations in 1 seconds (78683024 bytes)
[ 245.825408] test 6 (192 bit key, 64 byte blocks): 2056235 operations in 1 seconds (131599040 bytes)
[ 246.832070] test 7 (192 bit key, 256 byte blocks): 560579 operations in 1 seconds (143508224 bytes)
[ 247.838598] test 8 (192 bit key, 1024 byte blocks): 153813 operations in 1 seconds (157504512 bytes)
[ 248.845201] test 9 (192 bit key, 8192 byte blocks): 19411 operations in 1 seconds (159014912 bytes)
[ 249.851755] test 10 (256 bit key, 16 byte blocks): 4932508 operations in 1 seconds (78920128 bytes)
[ 250.858372] test 11 (256 bit key, 64 byte blocks): 2057244 operations in 1 seconds (131663616 bytes)
[ 251.865039] test 12 (256 bit key, 256 byte blocks): 559493 operations in 1 seconds (143230208 bytes)
[ 252.871554] test 13 (256 bit key, 1024 byte blocks): 153980 operations in 1 seconds (157675520 bytes)
[ 253.878159] test 14 (256 bit key, 8192 byte blocks): 19665 operations in 1 seconds (161095680 bytes)
[ 254.884711]
[ 254.884711] testing speed of async cbc(twofish) encryption
[ 254.898925] test 0 (128 bit key, 16 byte blocks): 5194404 operations in 1 seconds (83110464 bytes)
[ 255.907087] test 1 (128 bit key, 64 byte blocks): 1916243 operations in 1 seconds (122639552 bytes)
[ 256.913758] test 2 (128 bit key, 256 byte blocks): 541282 operations in 1 seconds (138568192 bytes)
[ 257.916278] test 3 (128 bit key, 1024 byte blocks): 141389 operations in 1 seconds (144782336 bytes)
[ 258.918865] test 4 (128 bit key, 8192 byte blocks): 17811 operations in 1 seconds (145907712 bytes)
[ 259.925372] test 5 (192 bit key, 16 byte blocks): 5176387 operations in 1 seconds (82822192 bytes)
[ 260.932038] test 6 (192 bit key, 64 byte blocks): 1916300 operations in 1 seconds (122643200 bytes)
[ 261.938693] test 7 (192 bit key, 256 byte blocks): 542642 operations in 1 seconds (138916352 bytes)
[ 262.945201] test 8 (192 bit key, 1024 byte blocks): 141318 operations in 1 seconds (144709632 bytes)
[ 263.952090] test 9 (192 bit key, 8192 byte blocks): 17681 operations in 1 seconds (144842752 bytes)
[ 264.958650] test 10 (256 bit key, 16 byte blocks): 5174239 operations in 1 seconds (82787824 bytes)
[ 265.965289] test 11 (256 bit key, 64 byte blocks): 1909023 operations in 1 seconds (122177472 bytes)
[ 266.971660] test 12 (256 bit key, 256 byte blocks): 541859 operations in 1 seconds (138715904 bytes)
[ 267.978471] test 13 (256 bit key, 1024 byte blocks): 141247 operations in 1 seconds (144636928 bytes)
[ 268.985066] test 14 (256 bit key, 8192 byte blocks): 17808 operations in 1 seconds (145883136 bytes)
[ 269.991595]
[ 269.991595] testing speed of async cbc(twofish) decryption
[ 270.001048] test 0 (128 bit key, 16 byte blocks): 4914615 operations in 1 seconds (78633840 bytes)
[ 271.006285] test 1 (128 bit key, 64 byte blocks): 1986798 operations in 1 seconds (127155072 bytes)
[ 272.012949] test 2 (128 bit key, 256 byte blocks): 536765 operations in 1 seconds (137411840 bytes)
[ 273.019467] test 3 (128 bit key, 1024 byte blocks): 148321 operations in 1 seconds (151880704 bytes)
[ 274.026071] test 4 (128 bit key, 8192 byte blocks): 18928 operations in 1 seconds (155058176 bytes)
[ 275.032578] test 5 (192 bit key, 16 byte blocks): 4912929 operations in 1 seconds (78606864 bytes)
[ 276.039252] test 6 (192 bit key, 64 byte blocks): 1980857 operations in 1 seconds (126774848 bytes)
[ 277.045915] test 7 (192 bit key, 256 byte blocks): 533058 operations in 1 seconds (136462848 bytes)
[ 278.052433] test 8 (192 bit key, 1024 byte blocks): 147262 operations in 1 seconds (150796288 bytes)
[ 279.059038] test 9 (192 bit key, 8192 byte blocks): 18619 operations in 1 seconds (152526848 bytes)
[ 280.065555] test 10 (256 bit key, 16 byte blocks): 4889191 operations in 1 seconds (78227056 bytes)
[ 281.072228] test 11 (256 bit key, 64 byte blocks): 1981910 operations in 1 seconds (126842240 bytes)
[ 282.078902] test 12 (256 bit key, 256 byte blocks): 539723 operations in 1 seconds (138169088 bytes)
[ 283.081401] test 13 (256 bit key, 1024 byte blocks): 148718 operations in 1 seconds (152287232 bytes)
[ 284.083999] test 14 (256 bit key, 8192 byte blocks): 18967 operations in 1 seconds (155377664 bytes)
[ 285.090559]
[ 285.090559] testing speed of async ctr(twofish) encryption
[ 285.104630] test 0 (128 bit key, 16 byte blocks): 4582435 operations in 1 seconds (73318960 bytes)
[ 286.113221] test 1 (128 bit key, 64 byte blocks): 1948842 operations in 1 seconds (124725888 bytes)
[ 287.119875] test 2 (128 bit key, 256 byte blocks): 545866 operations in 1 seconds (139741696 bytes)
[ 288.126400] test 3 (128 bit key, 1024 byte blocks): 148249 operations in 1 seconds (151806976 bytes)
[ 289.133004] test 4 (128 bit key, 8192 byte blocks): 18970 operations in 1 seconds (155402240 bytes)
[ 290.139504] test 5 (192 bit key, 16 byte blocks): 4537518 operations in 1 seconds (72600288 bytes)
[ 291.146177] test 6 (192 bit key, 64 byte blocks): 1935999 operations in 1 seconds (123903936 bytes)
[ 292.152852] test 7 (192 bit key, 256 byte blocks): 537517 operations in 1 seconds (137604352 bytes)
[ 293.159351] test 8 (192 bit key, 1024 byte blocks): 147055 operations in 1 seconds (150584320 bytes)
[ 294.165963] test 9 (192 bit key, 8192 byte blocks): 18823 operations in 1 seconds (154198016 bytes)
[ 295.172516] test 10 (256 bit key, 16 byte blocks): 4351876 operations in 1 seconds (69630016 bytes)
[ 296.179154] test 11 (256 bit key, 64 byte blocks): 1957846 operations in 1 seconds (125302144 bytes)
[ 297.185818] test 12 (256 bit key, 256 byte blocks): 540281 operations in 1 seconds (138311936 bytes)
[ 298.192327] test 13 (256 bit key, 1024 byte blocks): 147917 operations in 1 seconds (151467008 bytes)
[ 299.198913] test 14 (256 bit key, 8192 byte blocks): 19127 operations in 1 seconds (156688384 bytes)
[ 300.205443]
[ 300.205443] testing speed of async ctr(twofish) decryption
[ 300.214834] test 0 (128 bit key, 16 byte blocks): 4527967 operations in 1 seconds (72447472 bytes)
[ 301.220136] test 1 (128 bit key, 64 byte blocks): 1949170 operations in 1 seconds (124746880 bytes)
[ 302.226792] test 2 (128 bit key, 256 byte blocks): 539500 operations in 1 seconds (138112000 bytes)
[ 303.233301] test 3 (128 bit key, 1024 byte blocks): 147991 operations in 1 seconds (151542784 bytes)
[ 304.239914] test 4 (128 bit key, 8192 byte blocks): 18995 operations in 1 seconds (155607040 bytes)
[ 305.246442] test 5 (192 bit key, 16 byte blocks): 4567525 operations in 1 seconds (73080400 bytes)
[ 306.249105] test 6 (192 bit key, 64 byte blocks): 1939242 operations in 1 seconds (124111488 bytes)
[ 307.251763] test 7 (192 bit key, 256 byte blocks): 537004 operations in 1 seconds (137473024 bytes)
[ 308.258272] test 8 (192 bit key, 1024 byte blocks): 147203 operations in 1 seconds (150735872 bytes)
[ 309.264884] test 9 (192 bit key, 8192 byte blocks): 18861 operations in 1 seconds (154509312 bytes)
[ 310.271428] test 10 (256 bit key, 16 byte blocks): 4390731 operations in 1 seconds (70251696 bytes)
[ 311.278075] test 11 (256 bit key, 64 byte blocks): 1961134 operations in 1 seconds (125512576 bytes)
[ 312.284729] test 12 (256 bit key, 256 byte blocks): 540294 operations in 1 seconds (138315264 bytes)
[ 313.291239] test 13 (256 bit key, 1024 byte blocks): 148623 operations in 1 seconds (152189952 bytes)
[ 314.297834] test 14 (256 bit key, 8192 byte blocks): 19020 operations in 1 seconds (155811840 bytes)
[ 315.304393]
[ 315.304393] testing speed of async lrw(twofish) encryption
[ 315.318957] test 0 (256 bit key, 16 byte blocks): 3469489 operations in 1 seconds (55511824 bytes)
[ 316.326743] test 1 (256 bit key, 64 byte blocks): 1608603 operations in 1 seconds (102950592 bytes)
[ 317.333414] test 2 (256 bit key, 256 byte blocks): 465927 operations in 1 seconds (119277312 bytes)
[ 318.339930] test 3 (256 bit key, 1024 byte blocks): 128940 operations in 1 seconds (132034560 bytes)
[ 319.346534] test 4 (256 bit key, 8192 byte blocks): 16585 operations in 1 seconds (135864320 bytes)
[ 320.353078] test 5 (320 bit key, 16 byte blocks): 3377257 operations in 1 seconds (54036112 bytes)
[ 321.359717] test 6 (320 bit key, 64 byte blocks): 1603153 operations in 1 seconds (102601792 bytes)
[ 322.366400] test 7 (320 bit key, 256 byte blocks): 458261 operations in 1 seconds (117314816 bytes)
[ 323.372916] test 8 (320 bit key, 1024 byte blocks): 128620 operations in 1 seconds (131706880 bytes)
[ 324.379485] test 9 (320 bit key, 8192 byte blocks): 16413 operations in 1 seconds (134455296 bytes)
[ 325.386011] test 10 (384 bit key, 16 byte blocks): 3532266 operations in 1 seconds (56516256 bytes)
[ 326.392692] test 11 (384 bit key, 64 byte blocks): 1589841 operations in 1 seconds (101749824 bytes)
[ 327.399356] test 12 (384 bit key, 256 byte blocks): 461842 operations in 1 seconds (118231552 bytes)
[ 328.405866] test 13 (384 bit key, 1024 byte blocks): 129080 operations in 1 seconds (132177920 bytes)
[ 329.412472] test 14 (384 bit key, 8192 byte blocks): 16629 operations in 1 seconds (136224768 bytes)
[ 330.415047]
[ 330.415047] testing speed of async lrw(twofish) decryption
[ 330.415051] test 0 (256 bit key, 16 byte blocks): 3407370 operations in 1 seconds (54517920 bytes)
[ 331.417671] test 1 (256 bit key, 64 byte blocks): 1616321 operations in 1 seconds (103444544 bytes)
[ 332.424354] test 2 (256 bit key, 256 byte blocks): 458505 operations in 1 seconds (117377280 bytes)
[ 333.430870] test 3 (256 bit key, 1024 byte blocks): 126675 operations in 1 seconds (129715200 bytes)
[ 334.437790] test 4 (256 bit key, 8192 byte blocks): 16239 operations in 1 seconds (133029888 bytes)
[ 335.444028] test 5 (320 bit key, 16 byte blocks): 3572964 operations in 1 seconds (57167424 bytes)
[ 336.450960] test 6 (320 bit key, 64 byte blocks): 1594182 operations in 1 seconds (102027648 bytes)
[ 337.457616] test 7 (320 bit key, 256 byte blocks): 459795 operations in 1 seconds (117707520 bytes)
[ 338.464141] test 8 (320 bit key, 1024 byte blocks): 126568 operations in 1 seconds (129605632 bytes)
[ 339.470746] test 9 (320 bit key, 8192 byte blocks): 16016 operations in 1 seconds (131203072 bytes)
[ 340.477280] test 10 (384 bit key, 16 byte blocks): 3481392 operations in 1 seconds (55702272 bytes)
[ 341.483944] test 11 (384 bit key, 64 byte blocks): 1611309 operations in 1 seconds (103123776 bytes)
[ 342.490591] test 12 (384 bit key, 256 byte blocks): 458111 operations in 1 seconds (117276416 bytes)
[ 343.497109] test 13 (384 bit key, 1024 byte blocks): 126501 operations in 1 seconds (129537024 bytes)
[ 344.503696] test 14 (384 bit key, 8192 byte blocks): 16251 operations in 1 seconds (133128192 bytes)
[ 345.510217]
[ 345.510217] testing speed of async xts(twofish) encryption
[ 345.524414] test 0 (256 bit key, 16 byte blocks): 3107202 operations in 1 seconds (49715232 bytes)
[ 346.532927] test 1 (256 bit key, 64 byte blocks): 1585412 operations in 1 seconds (101466368 bytes)
[ 347.539278] test 2 (256 bit key, 256 byte blocks): 487146 operations in 1 seconds (124709376 bytes)
[ 348.546099] test 3 (256 bit key, 1024 byte blocks): 137897 operations in 1 seconds (141206528 bytes)
[ 349.552720] test 4 (256 bit key, 8192 byte blocks): 18001 operations in 1 seconds (147464192 bytes)
[ 350.559245] test 5 (384 bit key, 16 byte blocks): 3094509 operations in 1 seconds (49512144 bytes)
[ 351.565900] test 6 (384 bit key, 64 byte blocks): 1585673 operations in 1 seconds (101483072 bytes)
[ 352.572557] test 7 (384 bit key, 256 byte blocks): 484334 operations in 1 seconds (123989504 bytes)
[ 353.579076] test 8 (384 bit key, 1024 byte blocks): 138064 operations in 1 seconds (141377536 bytes)
[ 354.581689] test 9 (384 bit key, 8192 byte blocks): 18021 operations in 1 seconds (147628032 bytes)
[ 355.584216] test 10 (512 bit key, 16 byte blocks): 3166517 operations in 1 seconds (50664272 bytes)
[ 356.590881] test 11 (512 bit key, 64 byte blocks): 1593724 operations in 1 seconds (101998336 bytes)
[ 357.597536] test 12 (512 bit key, 256 byte blocks): 487015 operations in 1 seconds (124675840 bytes)
[ 358.604045] test 13 (512 bit key, 1024 byte blocks): 138101 operations in 1 seconds (141415424 bytes)
[ 359.610641] test 14 (512 bit key, 8192 byte blocks): 17990 operations in 1 seconds (147374080 bytes)
[ 360.617193]
[ 360.617193] testing speed of async xts(twofish) decryption
[ 360.626573] test 0 (256 bit key, 16 byte blocks): 3107491 operations in 1 seconds (49719856 bytes)
[ 361.631845] test 1 (256 bit key, 64 byte blocks): 1542680 operations in 1 seconds (98731520 bytes)
[ 362.638423] test 2 (256 bit key, 256 byte blocks): 481115 operations in 1 seconds (123165440 bytes)
[ 363.645036] test 3 (256 bit key, 1024 byte blocks): 136886 operations in 1 seconds (140171264 bytes)
[ 364.651630] test 4 (256 bit key, 8192 byte blocks): 17624 operations in 1 seconds (144375808 bytes)
[ 365.658140] test 5 (384 bit key, 16 byte blocks): 3112081 operations in 1 seconds (49793296 bytes)
[ 366.664511] test 6 (384 bit key, 64 byte blocks): 1544403 operations in 1 seconds (98841792 bytes)
[ 367.671383] test 7 (384 bit key, 256 byte blocks): 481335 operations in 1 seconds (123221760 bytes)
[ 368.677986] test 8 (384 bit key, 1024 byte blocks): 136897 operations in 1 seconds (140182528 bytes)
[ 369.684600] test 9 (384 bit key, 8192 byte blocks): 17612 operations in 1 seconds (144277504 bytes)
[ 370.691109] test 10 (512 bit key, 16 byte blocks): 3199446 operations in 1 seconds (51191136 bytes)
[ 371.697798] test 11 (512 bit key, 64 byte blocks): 1569564 operations in 1 seconds (100452096 bytes)
[ 372.704454] test 12 (512 bit key, 256 byte blocks): 482158 operations in 1 seconds (123432448 bytes)
[ 373.710955] test 13 (512 bit key, 1024 byte blocks): 136846 operations in 1 seconds (140130304 bytes)
[ 374.717549] test 14 (512 bit key, 8192 byte blocks): 17522 operations in 1 seconds (143540224 bytes)

> Testing twofish-x86_64-3way ('sync twofish' speed test):
> modprobe twofish-x86_64-3way
> modprobe tcrypt mode=202 sec=1

$ modprobe twofish-x86_64-3way
$ modprobe tcrypt mode=202 sec=1

[ 841.095600]
[ 841.095600] testing speed of ecb(twofish) encryption
[ 841.103893] test 0 (128 bit key, 16 byte blocks): 5059409 operations in 1 seconds (80950544 bytes)
[ 842.105260] test 1 (128 bit key, 64 byte blocks): 2093363 operations in 1 seconds (133975232 bytes)
[ 843.111943] test 2 (128 bit key, 256 byte blocks): 610543 operations in 1 seconds (156299008 bytes)
[ 844.118754] test 3 (128 bit key, 1024 byte blocks): 161042 operations in 1 seconds (164907008 bytes)
[ 845.125367] test 4 (128 bit key, 8192 byte blocks): 20397 operations in 1 seconds (167092224 bytes)
[ 846.131876] test 5 (192 bit key, 16 byte blocks): 4967411 operations in 1 seconds (79478576 bytes)
[ 847.138548] test 6 (192 bit key, 64 byte blocks): 2081577 operations in 1 seconds (133220928 bytes)
[ 848.145213] test 7 (192 bit key, 256 byte blocks): 612129 operations in 1 seconds (156705024 bytes)
[ 849.151731] test 8 (192 bit key, 1024 byte blocks): 161409 operations in 1 seconds (165282816 bytes)
[ 850.158335] test 9 (192 bit key, 8192 byte blocks): 20228 operations in 1 seconds (165707776 bytes)
[ 851.164844] test 10 (256 bit key, 16 byte blocks): 4968195 operations in 1 seconds (79491120 bytes)
[ 852.171533] test 11 (256 bit key, 64 byte blocks): 2083566 operations in 1 seconds (133348224 bytes)
[ 853.178189] test 12 (256 bit key, 256 byte blocks): 611680 operations in 1 seconds (156590080 bytes)
[ 854.184697] test 13 (256 bit key, 1024 byte blocks): 161160 operations in 1 seconds (165027840 bytes)
[ 855.191294] test 14 (256 bit key, 8192 byte blocks): 20400 operations in 1 seconds (167116800 bytes)
[ 856.197847]
[ 856.197847] testing speed of ecb(twofish) decryption
[ 856.206729] test 0 (128 bit key, 16 byte blocks): 4975693 operations in 1 seconds (79611088 bytes)
[ 857.212507] test 1 (128 bit key, 64 byte blocks): 2072003 operations in 1 seconds (132608192 bytes)
[ 858.219170] test 2 (128 bit key, 256 byte blocks): 611965 operations in 1 seconds (156663040 bytes)
[ 859.225681] test 3 (128 bit key, 1024 byte blocks): 161027 operations in 1 seconds (164891648 bytes)
[ 860.232294] test 4 (128 bit key, 8192 byte blocks): 20348 operations in 1 seconds (166690816 bytes)
[ 861.238838] test 5 (192 bit key, 16 byte blocks): 4953128 operations in 1 seconds (79250048 bytes)
[ 862.245476] test 6 (192 bit key, 64 byte blocks): 2070776 operations in 1 seconds (132529664 bytes)
[ 863.252132] test 7 (192 bit key, 256 byte blocks): 611045 operations in 1 seconds (156427520 bytes)
[ 864.258639] test 8 (192 bit key, 1024 byte blocks): 160815 operations in 1 seconds (164674560 bytes)
[ 865.265271] test 9 (192 bit key, 8192 byte blocks): 20144 operations in 1 seconds (165019648 bytes)
[ 866.267824] test 10 (256 bit key, 16 byte blocks): 4970527 operations in 1 seconds (79528432 bytes)
[ 867.270444] test 11 (256 bit key, 64 byte blocks): 2073117 operations in 1 seconds (132679488 bytes)
[ 868.277128] test 12 (256 bit key, 256 byte blocks): 612096 operations in 1 seconds (156696576 bytes)
[ 869.283628] test 13 (256 bit key, 1024 byte blocks): 160923 operations in 1 seconds (164785152 bytes)
[ 870.290213] test 14 (256 bit key, 8192 byte blocks): 20333 operations in 1 seconds (166567936 bytes)
[ 871.296741]
[ 871.296741] testing speed of cbc(twofish) encryption
[ 871.305656] test 0 (128 bit key, 16 byte blocks): 5219296 operations in 1 seconds (83508736 bytes)
[ 872.311449] test 1 (128 bit key, 64 byte blocks): 1924062 operations in 1 seconds (123139968 bytes)
[ 873.317800] test 2 (128 bit key, 256 byte blocks): 543826 operations in 1 seconds (139219456 bytes)
[ 874.324307] test 3 (128 bit key, 1024 byte blocks): 141437 operations in 1 seconds (144831488 bytes)
[ 875.330902] test 4 (128 bit key, 8192 byte blocks): 17831 operations in 1 seconds (146071552 bytes)
[ 876.337439] test 5 (192 bit key, 16 byte blocks): 5208718 operations in 1 seconds (83339488 bytes)
[ 877.344101] test 6 (192 bit key, 64 byte blocks): 1920005 operations in 1 seconds (122880320 bytes)
[ 878.350767] test 7 (192 bit key, 256 byte blocks): 543963 operations in 1 seconds (139254528 bytes)
[ 879.357265] test 8 (192 bit key, 1024 byte blocks): 141507 operations in 1 seconds (144903168 bytes)
[ 880.363889] test 9 (192 bit key, 8192 byte blocks): 17685 operations in 1 seconds (144875520 bytes)
[ 881.370413] test 10 (256 bit key, 16 byte blocks): 5186062 operations in 1 seconds (82976992 bytes)
[ 882.377078] test 11 (256 bit key, 64 byte blocks): 1909259 operations in 1 seconds (122192576 bytes)
[ 883.383725] test 12 (256 bit key, 256 byte blocks): 543371 operations in 1 seconds (139102976 bytes)
[ 884.390250] test 13 (256 bit key, 1024 byte blocks): 141395 operations in 1 seconds (144788480 bytes)
[ 885.396838] test 14 (256 bit key, 8192 byte blocks): 17823 operations in 1 seconds (146006016 bytes)
[ 886.403391]
[ 886.403391] testing speed of cbc(twofish) decryption
[ 886.411632] test 0 (128 bit key, 16 byte blocks): 5012934 operations in 1 seconds (80206944 bytes)
[ 887.418033] test 1 (128 bit key, 64 byte blocks): 2025951 operations in 1 seconds (129660864 bytes)
[ 888.424706] test 2 (128 bit key, 256 byte blocks): 596675 operations in 1 seconds (152748800 bytes)
[ 889.431233] test 3 (128 bit key, 1024 byte blocks): 156569 operations in 1 seconds (160326656 bytes)
[ 890.433868] test 4 (128 bit key, 8192 byte blocks): 19783 operations in 1 seconds (162062336 bytes)
[ 891.436382] test 5 (192 bit key, 16 byte blocks): 4999583 operations in 1 seconds (79993328 bytes)
[ 892.443032] test 6 (192 bit key, 64 byte blocks): 2025099 operations in 1 seconds (129606336 bytes)
[ 893.449696] test 7 (192 bit key, 256 byte blocks): 593294 operations in 1 seconds (151883264 bytes)
[ 894.456204] test 8 (192 bit key, 1024 byte blocks): 156223 operations in 1 seconds (159972352 bytes)
[ 895.462798] test 9 (192 bit key, 8192 byte blocks): 19560 operations in 1 seconds (160235520 bytes)
[ 896.469351] test 10 (256 bit key, 16 byte blocks): 5002391 operations in 1 seconds (80038256 bytes)
[ 897.475997] test 11 (256 bit key, 64 byte blocks): 2021338 operations in 1 seconds (129365632 bytes)
[ 898.482681] test 12 (256 bit key, 256 byte blocks): 597158 operations in 1 seconds (152872448 bytes)
[ 899.489171] test 13 (256 bit key, 1024 byte blocks): 156466 operations in 1 seconds (160221184 bytes)
[ 900.495775] test 14 (256 bit key, 8192 byte blocks): 19748 operations in 1 seconds (161775616 bytes)
[ 901.502295]
[ 901.502295] testing speed of ctr(twofish) encryption
[ 901.510534] test 0 (128 bit key, 16 byte blocks): 4775185 operations in 1 seconds (76402960 bytes)
[ 902.516972] test 1 (128 bit key, 64 byte blocks): 1969757 operations in 1 seconds (126064448 bytes)
[ 903.523636] test 2 (128 bit key, 256 byte blocks): 596735 operations in 1 seconds (152764160 bytes)
[ 904.530162] test 3 (128 bit key, 1024 byte blocks): 157023 operations in 1 seconds (160791552 bytes)
[ 905.536756] test 4 (128 bit key, 8192 byte blocks): 19844 operations in 1 seconds (162562048 bytes)
[ 906.543299] test 5 (192 bit key, 16 byte blocks): 4802348 operations in 1 seconds (76837568 bytes)
[ 907.549938] test 6 (192 bit key, 64 byte blocks): 1977219 operations in 1 seconds (126542016 bytes)
[ 908.556613] test 7 (192 bit key, 256 byte blocks): 595537 operations in 1 seconds (152457472 bytes)
[ 909.563121] test 8 (192 bit key, 1024 byte blocks): 156491 operations in 1 seconds (160246784 bytes)
[ 910.569725] test 9 (192 bit key, 8192 byte blocks): 19541 operations in 1 seconds (160079872 bytes)
[ 911.576270] test 10 (256 bit key, 16 byte blocks): 4860804 operations in 1 seconds (77772864 bytes)
[ 912.582924] test 11 (256 bit key, 64 byte blocks): 1980010 operations in 1 seconds (126720640 bytes)
[ 913.589589] test 12 (256 bit key, 256 byte blocks): 597238 operations in 1 seconds (152892928 bytes)
[ 914.596105] test 13 (256 bit key, 1024 byte blocks): 157162 operations in 1 seconds (160933888 bytes)
[ 915.598703] test 14 (256 bit key, 8192 byte blocks): 19832 operations in 1 seconds (162463744 bytes)
[ 916.601249]
[ 916.601249] testing speed of ctr(twofish) decryption
[ 916.609490] test 0 (128 bit key, 16 byte blocks): 4601859 operations in 1 seconds (73629744 bytes)
[ 917.615919] test 1 (128 bit key, 64 byte blocks): 1970487 operations in 1 seconds (126111168 bytes)
[ 918.622573] test 2 (128 bit key, 256 byte blocks): 587668 operations in 1 seconds (150443008 bytes)
[ 919.629092] test 3 (128 bit key, 1024 byte blocks): 157030 operations in 1 seconds (160798720 bytes)
[ 920.635695] test 4 (128 bit key, 8192 byte blocks): 19868 operations in 1 seconds (162758656 bytes)
[ 921.642194] test 5 (192 bit key, 16 byte blocks): 4837646 operations in 1 seconds (77402336 bytes)
[ 922.648877] test 6 (192 bit key, 64 byte blocks): 1978413 operations in 1 seconds (126618432 bytes)
[ 923.655549] test 7 (192 bit key, 256 byte blocks): 590723 operations in 1 seconds (151225088 bytes)
[ 924.662059] test 8 (192 bit key, 1024 byte blocks): 156488 operations in 1 seconds (160243712 bytes)
[ 925.668663] test 9 (192 bit key, 8192 byte blocks): 19533 operations in 1 seconds (160014336 bytes)
[ 926.675208] test 10 (256 bit key, 16 byte blocks): 4877702 operations in 1 seconds (78043232 bytes)
[ 927.681854] test 11 (256 bit key, 64 byte blocks): 1981581 operations in 1 seconds (126821184 bytes)
[ 928.688517] test 12 (256 bit key, 256 byte blocks): 591865 operations in 1 seconds (151517440 bytes)
[ 929.695027] test 13 (256 bit key, 1024 byte blocks): 157106 operations in 1 seconds (160876544 bytes)
[ 930.701622] test 14 (256 bit key, 8192 byte blocks): 19818 operations in 1 seconds (162349056 bytes)
[ 931.708148]
[ 931.708148] testing speed of lrw(twofish) encryption
[ 931.716391] test 0 (256 bit key, 16 byte blocks): 3742901 operations in 1 seconds (59886416 bytes)
[ 932.723129] test 1 (256 bit key, 64 byte blocks): 1632818 operations in 1 seconds (104500352 bytes)
[ 933.729812] test 2 (256 bit key, 256 byte blocks): 507407 operations in 1 seconds (129896192 bytes)
[ 934.736320] test 3 (256 bit key, 1024 byte blocks): 134953 operations in 1 seconds (138191872 bytes)
[ 935.742933] test 4 (256 bit key, 8192 byte blocks): 17152 operations in 1 seconds (140509184 bytes)
[ 936.749449] test 5 (320 bit key, 16 byte blocks): 3604847 operations in 1 seconds (57677552 bytes)
[ 937.756114] test 6 (320 bit key, 64 byte blocks): 1645280 operations in 1 seconds (105297920 bytes)
[ 938.762787] test 7 (320 bit key, 256 byte blocks): 505243 operations in 1 seconds (129342208 bytes)
[ 939.765318] test 8 (320 bit key, 1024 byte blocks): 135382 operations in 1 seconds (138631168 bytes)
[ 940.767912] test 9 (320 bit key, 8192 byte blocks): 17004 operations in 1 seconds (139296768 bytes)
[ 941.774421] test 10 (384 bit key, 16 byte blocks): 3748381 operations in 1 seconds (59974096 bytes)
[ 942.781104] test 11 (384 bit key, 64 byte blocks): 1618390 operations in 1 seconds (103576960 bytes)
[ 943.787759] test 12 (384 bit key, 256 byte blocks): 508853 operations in 1 seconds (130266368 bytes)
[ 944.793973] test 13 (384 bit key, 1024 byte blocks): 135082 operations in 1 seconds (138323968 bytes)
[ 945.800560] test 14 (384 bit key, 8192 byte blocks): 17158 operations in 1 seconds (140558336 bytes)
[ 946.807124]
[ 946.807124] testing speed of lrw(twofish) decryption
[ 946.815364] test 0 (256 bit key, 16 byte blocks): 3601916 operations in 1 seconds (57630656 bytes)
[ 947.821765] test 1 (256 bit key, 64 byte blocks): 1661901 operations in 1 seconds (106361664 bytes)
[ 948.828439] test 2 (256 bit key, 256 byte blocks): 503586 operations in 1 seconds (128918016 bytes)
[ 949.834947] test 3 (256 bit key, 1024 byte blocks): 134739 operations in 1 seconds (137972736 bytes)
[ 950.841551] test 4 (256 bit key, 8192 byte blocks): 17087 operations in 1 seconds (139976704 bytes)
[ 951.848113] test 5 (320 bit key, 16 byte blocks): 3718723 operations in 1 seconds (59499568 bytes)
[ 952.854741] test 6 (320 bit key, 64 byte blocks): 1640905 operations in 1 seconds (105017920 bytes)
[ 953.861405] test 7 (320 bit key, 256 byte blocks): 505306 operations in 1 seconds (129358336 bytes)
[ 954.867924] test 8 (320 bit key, 1024 byte blocks): 134609 operations in 1 seconds (137839616 bytes)
[ 955.874527] test 9 (320 bit key, 8192 byte blocks): 16971 operations in 1 seconds (139026432 bytes)
[ 956.881088] test 10 (384 bit key, 16 byte blocks): 3591435 operations in 1 seconds (57462960 bytes)
[ 957.887717] test 11 (384 bit key, 64 byte blocks): 1649581 operations in 1 seconds (105573184 bytes)
[ 958.894382] test 12 (384 bit key, 256 byte blocks): 502560 operations in 1 seconds (128655360 bytes)
[ 959.900892] test 13 (384 bit key, 1024 byte blocks): 134723 operations in 1 seconds (137956352 bytes)
[ 960.907488] test 14 (384 bit key, 8192 byte blocks): 17095 operations in 1 seconds (140042240 bytes)
[ 961.914039]
[ 961.914039] testing speed of xts(twofish) encryption
[ 961.922282] test 0 (256 bit key, 16 byte blocks): 3145313 operations in 1 seconds (50325008 bytes)
[ 962.928692] test 1 (256 bit key, 64 byte blocks): 1583838 operations in 1 seconds (101365632 bytes)
[ 963.931688] test 2 (256 bit key, 256 byte blocks): 522571 operations in 1 seconds (133778176 bytes)
[ 964.934178] test 3 (256 bit key, 1024 byte blocks): 142343 operations in 1 seconds (145759232 bytes)
[ 965.940803] test 4 (256 bit key, 8192 byte blocks): 18213 operations in 1 seconds (149200896 bytes)
[ 966.947327] test 5 (384 bit key, 16 byte blocks): 3152410 operations in 1 seconds (50438560 bytes)
[ 967.953973] test 6 (384 bit key, 64 byte blocks): 1583572 operations in 1 seconds (101348608 bytes)
[ 968.960638] test 7 (384 bit key, 256 byte blocks): 523459 operations in 1 seconds (134005504 bytes)
[ 969.967147] test 8 (384 bit key, 1024 byte blocks): 142362 operations in 1 seconds (145778688 bytes)
[ 970.973760] test 9 (384 bit key, 8192 byte blocks): 18217 operations in 1 seconds (149233664 bytes)
[ 971.980303] test 10 (512 bit key, 16 byte blocks): 3303261 operations in 1 seconds (52852176 bytes)
[ 972.986948] test 11 (512 bit key, 64 byte blocks): 1626050 operations in 1 seconds (104067200 bytes)
[ 973.993616] test 12 (512 bit key, 256 byte blocks): 526250 operations in 1 seconds (134720000 bytes)
[ 975.000114] test 13 (512 bit key, 1024 byte blocks): 142627 operations in 1 seconds (146050048 bytes)
[ 976.006710] test 14 (512 bit key, 8192 byte blocks): 18277 operations in 1 seconds (149725184 bytes)
[ 977.013263]
[ 977.013263] testing speed of xts(twofish) decryption
[ 977.022105] test 0 (256 bit key, 16 byte blocks): 3135829 operations in 1 seconds (50173264 bytes)
[ 978.027922] test 1 (256 bit key, 64 byte blocks): 1578849 operations in 1 seconds (101046336 bytes)
[ 979.034578] test 2 (256 bit key, 256 byte blocks): 521004 operations in 1 seconds (133377024 bytes)
[ 980.041098] test 3 (256 bit key, 1024 byte blocks): 141705 operations in 1 seconds (145105920 bytes)
[ 981.047709] test 4 (256 bit key, 8192 byte blocks): 18161 operations in 1 seconds (148774912 bytes)
[ 982.054227] test 5 (384 bit key, 16 byte blocks): 3138227 operations in 1 seconds (50211632 bytes)
[ 983.060883] test 6 (384 bit key, 64 byte blocks): 1578454 operations in 1 seconds (101021056 bytes)
[ 984.067555] test 7 (384 bit key, 256 byte blocks): 520945 operations in 1 seconds (133361920 bytes)
[ 985.074064] test 8 (384 bit key, 1024 byte blocks): 141746 operations in 1 seconds (145147904 bytes)
[ 986.080676] test 9 (384 bit key, 8192 byte blocks): 18170 operations in 1 seconds (148848640 bytes)
[ 987.087194] test 10 (512 bit key, 16 byte blocks): 3303084 operations in 1 seconds (52849344 bytes)
[ 988.093869] test 11 (512 bit key, 64 byte blocks): 1623781 operations in 1 seconds (103921984 bytes)
[ 989.096562] test 12 (512 bit key, 256 byte blocks): 526076 operations in 1 seconds (134675456 bytes)
[ 990.099044] test 13 (512 bit key, 1024 byte blocks): 142068 operations in 1 seconds (145477632 bytes)
[ 991.105639] test 14 (512 bit key, 8192 byte blocks): 18138 operations in 1 seconds (148586496 bytes)

Let me know if you need more tests.

HTH.

--
Regards/Gruss,
Boris.

2012-08-15 13:48:54

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> Ok, here we go. Raw data below.

Thanks alot!

Twofish-avx appears somewhat slower than 3way, ~9% slower with 256byte
blocks to ~3% slower with 8kb blocks.

>

<snip>

>
> Let me know if you need more tests.

I posted patch that optimize twofish-avx few weeks ago:
http://marc.info/?l=linux-crypto-vger&m=134364845024825&w=2

I'd be interested to know, if this is patch helps on Bulldozer.

-Jussi

>
> HTH.
>
> --
> Regards/Gruss,
> Boris.
>
>

2012-08-15 14:03:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote:
> I posted patch that optimize twofish-avx few weeks ago:
> http://marc.info/?l=linux-crypto-vger&m=134364845024825&w=2
>
> I'd be interested to know, if this is patch helps on Bulldozer.

Sure, can you inline it here too please. The "Download message RAW" link
on marc.info gives me a diff but patch says:

patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S
patch unexpectedly ends in middle of line

Thanks.

--
Regards/Gruss,
Boris.

2012-08-15 14:22:03

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

> On Wed, Aug 15, 2012 at 04:48:54PM +0300, Jussi Kivilinna wrote:
> > I posted patch that optimize twofish-avx few weeks ago:
> > http://marc.info/?l=linux-crypto-vger&m=134364845024825&w=2
> >
> > I'd be interested to know, if this is patch helps on Bulldozer.
>
> Sure, can you inline it here too please. The "Download message RAW" link
> on marc.info gives me a diff but patch says:
>
> patching file arch/x86/crypto/twofish-avx-x86_64-asm_64.S
> patch unexpectedly ends in middle of line
>
> Thanks.

Here...


Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.

Also move common round code to separate function to reduce object size.

Tested on Core i5-2450M.

---
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 144 +++++++++++++++++----------
1 file changed, 92 insertions(+), 52 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..42b27b7 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
#define RC2 %xmm6
#define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14

#define RID1 %rax
+#define RID1d %eax
#define RID1b %al
#define RID2 %rbx
+#define RID2d %ebx
#define RID2b %bl

#define RGI1 %rdx
@@ -73,40 +80,45 @@
#define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ shrq $16, src; \
movl t0(CTX, RID1, 4), dst ## d; \
xorl t1(CTX, RID2, 4), dst ## d; \
- shrq $16, src; \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ interleave_op(il_reg); \
xorl t2(CTX, RID1, 4), dst ## d; \
xorl t3(CTX, RID2, 4), dst ## d;

+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+ shrq $16, reg;
+
#define G(a, x, t0, t1, t2, t3) \
vmovq a, RGI1; \
- vpsrldq $8, a, x; \
- vmovq x, RGI2; \
+ vpextrq $1, a, RGI2; \
\
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
- shrq $16, RGI1; \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
- shlq $32, RGS2; \
- orq RGS1, RGS2; \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+ vmovd RGS1d, x; \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
+ vpinsrd $1, RGS2d, x, x; \
\
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
- shrq $16, RGI2; \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
- shlq $32, RGS3; \
- orq RGS1, RGS3; \
- \
- vmovq RGS2, x; \
- vpinsrq $1, RGS3, x, x;
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \
+ vpinsrd $2, RGS1d, x, x; \
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \
+ vpinsrd $3, RGS3d, x, x;
+
+#define encround_g1g2(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0);

-#define encround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define encround_end(a, b, c, d, x, y) \
+ vpslld $1, d, RT; \
+ vpsrld $(32 - 1), d, d; \
+ vpor d, RT, d; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd x, RK1, x; \
@@ -115,14 +127,16 @@
vpsrld $1, c, x; \
vpslld $(32 - 1), c, c; \
vpor c, x, c; \
- vpslld $1, d, x; \
- vpsrld $(32 - 1), d, d; \
- vpor d, x, d; \
vpxor d, y, d;

-#define decround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define decround_g1g2(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0);
+
+#define decround_end(a, b, c, d, x, y) \
+ vpslld $1, c, RT; \
+ vpsrld $(32 - 1), c, c; \
+ vpor c, RT, c; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd y, RK2, y; \
@@ -130,23 +144,50 @@
vpsrld $1, d, y; \
vpslld $(32 - 1), d, d; \
vpor d, y, d; \
- vpslld $1, c, y; \
- vpsrld $(32 - 1), c, c; \
- vpor c, y, c; \
vpaddd x, RK1, x; \
vpxor x, c, c;

+.align 4
+encround_RARBRCRD:
+ encround_g1g2(RA1, RB1, RC1, RD1, RX0, RY0);
+ encround_g1g2(RA2, RB2, RC2, RD2, RX1, RY1);
+ encround_end(RA1, RB1, RC1, RD1, RX0, RY0);
+ encround_end(RA2, RB2, RC2, RD2, RX1, RY1);
+ ret;
+
+.align 4
+encround_RCRDRARB:
+ encround_g1g2(RC1, RD1, RA1, RB1, RX0, RY0);
+ encround_g1g2(RC2, RD2, RA2, RB2, RX1, RY1);
+ encround_end(RC1, RD1, RA1, RB1, RX0, RY0);
+ encround_end(RC2, RD2, RA2, RB2, RX1, RY1);
+ ret;
+
#define encrypt_round(n, a, b, c, d) \
vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+ call encround_ ## a ## b ## c ## d;
+
+.align 4
+decround_RARBRCRD:
+ decround_g1g2(RA1, RB1, RC1, RD1, RX0, RY0);
+ decround_g1g2(RA2, RB2, RC2, RD2, RX1, RY1);
+ decround_end(RA1, RB1, RC1, RD1, RX0, RY0);
+ decround_end(RA2, RB2, RC2, RD2, RX1, RY1);
+ ret;
+
+.align 4
+decround_RCRDRARB:
+ decround_g1g2(RC1, RD1, RA1, RB1, RX0, RY0);
+ decround_g1g2(RC2, RD2, RA2, RB2, RX1, RY1);
+ decround_end(RC1, RD1, RA1, RB1, RX0, RY0);
+ decround_end(RC2, RD2, RA2, RB2, RX1, RY1);
+ ret;

#define decrypt_round(n, a, b, c, d) \
vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+ call decround_ ## a ## b ## c ## d;

#define encrypt_cycle(n) \
encrypt_round((2*n), RA, RB, RC, RD); \
@@ -156,7 +197,6 @@
decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
decrypt_round((2*n), RA, RB, RC, RD);

-
#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
vpunpckldq x1, x0, t0; \
vpunpckhdq x1, x0, t2; \
@@ -222,8 +262,8 @@ __twofish_enc_blk_8way:
vmovdqu w(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

xorq RID1, RID1;
xorq RID2, RID2;
@@ -247,14 +287,14 @@ __twofish_enc_blk_8way:
testb %cl, %cl;
jnz __enc_xor8;

- outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

__enc_xor8:
- outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

@@ -274,8 +314,8 @@ twofish_dec_blk_8way:
vmovdqu (w+4*4)(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

xorq RID1, RID1;
xorq RID2, RID2;
@@ -294,7 +334,7 @@ twofish_dec_blk_8way:
popq %rbx;

leaq (4*4*4)(%rsi), %rax;
- outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

ret;

2012-08-15 15:33:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote:

> Patch replaces 'movb' instructions with 'movzbl' to break false
> register dependencies and interleaves instructions better for
> out-of-order scheduling.
>
> Also move common round code to separate function to reduce object
> size.

Ok, redid the first test

$ modprobe twofish-avx-x86_64
$ modprobe tcrypt mode=504 sec=1

and from quickly juxtaposing the two results, I'd say the patch makes
things slightly worse but you'd need to run your scripts on it to get
the accurate results:

[ 98.206067] testing speed of async ecb(twofish) encryption
[ 98.214796] test 0 (128 bit key, 16 byte blocks): 4549296 operations in 1 seconds (72788736 bytes)
[ 99.221569] test 1 (128 bit key, 64 byte blocks): 1995934 operations in 1 seconds (127739776 bytes)
[ 100.228250] test 2 (128 bit key, 256 byte blocks): 535040 operations in 1 seconds (136970240 bytes)
[ 101.234751] test 3 (128 bit key, 1024 byte blocks): 148602 operations in 1 seconds (152168448 bytes)
[ 102.241345] test 4 (128 bit key, 8192 byte blocks): 19148 operations in 1 seconds (156860416 bytes)
[ 103.247880] test 5 (192 bit key, 16 byte blocks): 4558391 operations in 1 seconds (72934256 bytes)
[ 104.254547] test 6 (192 bit key, 64 byte blocks): 1997838 operations in 1 seconds (127861632 bytes)
[ 105.261202] test 7 (192 bit key, 256 byte blocks): 534396 operations in 1 seconds (136805376 bytes)
[ 106.267694] test 8 (192 bit key, 1024 byte blocks): 148199 operations in 1 seconds (151755776 bytes)
[ 107.274296] test 9 (192 bit key, 8192 byte blocks): 18913 operations in 1 seconds (154935296 bytes)
[ 108.280824] test 10 (256 bit key, 16 byte blocks): 4595524 operations in 1 seconds (73528384 bytes)
[ 109.287496] test 11 (256 bit key, 64 byte blocks): 1997893 operations in 1 seconds (127865152 bytes)
[ 110.294168] test 12 (256 bit key, 256 byte blocks): 533790 operations in 1 seconds (136650240 bytes)
[ 111.300679] test 13 (256 bit key, 1024 byte blocks): 148787 operations in 1 seconds (152357888 bytes)
[ 112.303561] test 14 (256 bit key, 8192 byte blocks): 19146 operations in 1 seconds (156844032 bytes)
[ 113.310104]
[ 113.310104] testing speed of async ecb(twofish) decryption
[ 113.319419] test 0 (128 bit key, 16 byte blocks): 4754043 operations in 1 seconds (76064688 bytes)
[ 114.324768] test 1 (128 bit key, 64 byte blocks): 1831420 operations in 1 seconds (117210880 bytes)
[ 115.331441] test 2 (128 bit key, 256 byte blocks): 541170 operations in 1 seconds (138539520 bytes)
[ 116.337957] test 3 (128 bit key, 1024 byte blocks): 150538 operations in 1 seconds (154150912 bytes)
[ 117.344571] test 4 (128 bit key, 8192 byte blocks): 19397 operations in 1 seconds (158900224 bytes)
[ 118.351122] test 5 (192 bit key, 16 byte blocks): 4753957 operations in 1 seconds (76063312 bytes)
[ 119.357778] test 6 (192 bit key, 64 byte blocks): 1828676 operations in 1 seconds (117035264 bytes)
[ 120.364459] test 7 (192 bit key, 256 byte blocks): 540331 operations in 1 seconds (138324736 bytes)
[ 121.370969] test 8 (192 bit key, 1024 byte blocks): 150348 operations in 1 seconds (153956352 bytes)
[ 122.377573] test 9 (192 bit key, 8192 byte blocks): 19196 operations in 1 seconds (157253632 bytes)
[ 123.384080] test 10 (256 bit key, 16 byte blocks): 4664399 operations in 1 seconds (74630384 bytes)
[ 124.390782] test 11 (256 bit key, 64 byte blocks): 1839324 operations in 1 seconds (117716736 bytes)
[ 125.397463] test 12 (256 bit key, 256 byte blocks): 538735 operations in 1 seconds (137916160 bytes)
[ 126.403962] test 13 (256 bit key, 1024 byte blocks): 150489 operations in 1 seconds (154100736 bytes)
[ 127.410567] test 14 (256 bit key, 8192 byte blocks): 19397 operations in 1 seconds (158900224 bytes)
[ 128.417091]
[ 128.417091] testing speed of async cbc(twofish) encryption
[ 128.431227] test 0 (128 bit key, 16 byte blocks): 4681239 operations in 1 seconds (74899824 bytes)
[ 129.439466] test 1 (128 bit key, 64 byte blocks): 1836636 operations in 1 seconds (117544704 bytes)
[ 130.446131] test 2 (128 bit key, 256 byte blocks): 536055 operations in 1 seconds (137230080 bytes)
[ 131.452631] test 3 (128 bit key, 1024 byte blocks): 140955 operations in 1 seconds (144337920 bytes)
[ 132.459243] test 4 (128 bit key, 8192 byte blocks): 17821 operations in 1 seconds (145989632 bytes)
[ 133.466124] test 5 (192 bit key, 16 byte blocks): 4674373 operations in 1 seconds (74789968 bytes)
[ 134.472728] test 6 (192 bit key, 64 byte blocks): 1835821 operations in 1 seconds (117492544 bytes)
[ 135.479374] test 7 (192 bit key, 256 byte blocks): 535882 operations in 1 seconds (137185792 bytes)
[ 136.485876] test 8 (192 bit key, 1024 byte blocks): 140917 operations in 1 seconds (144299008 bytes)
[ 137.492470] test 9 (192 bit key, 8192 byte blocks): 17707 operations in 1 seconds (145055744 bytes)
[ 138.498979] test 10 (256 bit key, 16 byte blocks): 4674648 operations in 1 seconds (74794368 bytes)
[ 139.505660] test 11 (256 bit key, 64 byte blocks): 1828219 operations in 1 seconds (117006016 bytes)
[ 140.512343] test 12 (256 bit key, 256 byte blocks): 535835 operations in 1 seconds (137173760 bytes)
[ 141.518842] test 13 (256 bit key, 1024 byte blocks): 140884 operations in 1 seconds (144265216 bytes)
[ 142.525447] test 14 (256 bit key, 8192 byte blocks): 17815 operations in 1 seconds (145940480 bytes)
[ 143.531972]
[ 143.531972] testing speed of async cbc(twofish) decryption
[ 143.541345] test 0 (128 bit key, 16 byte blocks): 4461471 operations in 1 seconds (71383536 bytes)
[ 144.546671] test 1 (128 bit key, 64 byte blocks): 1726158 operations in 1 seconds (110474112 bytes)
[ 145.553334] test 2 (128 bit key, 256 byte blocks): 524618 operations in 1 seconds (134302208 bytes)
[ 146.559862] test 3 (128 bit key, 1024 byte blocks): 145305 operations in 1 seconds (148792320 bytes)
[ 147.566457] test 4 (128 bit key, 8192 byte blocks): 18667 operations in 1 seconds (152920064 bytes)
[ 148.572965] test 5 (192 bit key, 16 byte blocks): 4458941 operations in 1 seconds (71343056 bytes)
[ 149.579638] test 6 (192 bit key, 64 byte blocks): 1734677 operations in 1 seconds (111019328 bytes)
[ 150.586303] test 7 (192 bit key, 256 byte blocks): 521797 operations in 1 seconds (133580032 bytes)
[ 151.592811] test 8 (192 bit key, 1024 byte blocks): 144554 operations in 1 seconds (148023296 bytes)
[ 152.599423] test 9 (192 bit key, 8192 byte blocks): 18461 operations in 1 seconds (151232512 bytes)
[ 153.605932] test 10 (256 bit key, 16 byte blocks): 4454216 operations in 1 seconds (71267456 bytes)
[ 154.612614] test 11 (256 bit key, 64 byte blocks): 1749350 operations in 1 seconds (111958400 bytes)
[ 155.619270] test 12 (256 bit key, 256 byte blocks): 525143 operations in 1 seconds (134436608 bytes)
[ 156.625778] test 13 (256 bit key, 1024 byte blocks): 145597 operations in 1 seconds (149091328 bytes)
[ 157.632367] test 14 (256 bit key, 8192 byte blocks): 18667 operations in 1 seconds (152920064 bytes)
[ 158.638911]
[ 158.638911] testing speed of async ctr(twofish) encryption
[ 158.652915] test 0 (128 bit key, 16 byte blocks): 4582013 operations in 1 seconds (73312208 bytes)
[ 159.661274] test 1 (128 bit key, 64 byte blocks): 1949294 operations in 1 seconds (124754816 bytes)
[ 160.667949] test 2 (128 bit key, 256 byte blocks): 519205 operations in 1 seconds (132916480 bytes)
[ 161.674749] test 3 (128 bit key, 1024 byte blocks): 142060 operations in 1 seconds (145469440 bytes)
[ 162.681372] test 4 (128 bit key, 8192 byte blocks): 18272 operations in 1 seconds (149684224 bytes)
[ 163.687577] test 5 (192 bit key, 16 byte blocks): 4539161 operations in 1 seconds (72626576 bytes)
[ 164.694561] test 6 (192 bit key, 64 byte blocks): 1935006 operations in 1 seconds (123840384 bytes)
[ 165.701209] test 7 (192 bit key, 256 byte blocks): 517208 operations in 1 seconds (132405248 bytes)
[ 166.707725] test 8 (192 bit key, 1024 byte blocks): 141790 operations in 1 seconds (145192960 bytes)
[ 167.714338] test 9 (192 bit key, 8192 byte blocks): 18120 operations in 1 seconds (148439040 bytes)
[ 168.720856] test 10 (256 bit key, 16 byte blocks): 4379275 operations in 1 seconds (70068400 bytes)
[ 169.727530] test 11 (256 bit key, 64 byte blocks): 1957465 operations in 1 seconds (125277760 bytes)
[ 170.734185] test 12 (256 bit key, 256 byte blocks): 519760 operations in 1 seconds (133058560 bytes)
[ 171.740392] test 13 (256 bit key, 1024 byte blocks): 142374 operations in 1 seconds (145790976 bytes)
[ 172.746986] test 14 (256 bit key, 8192 byte blocks): 18292 operations in 1 seconds (149848064 bytes)
[ 173.753539]
[ 173.753539] testing speed of async ctr(twofish) decryption
[ 173.762929] test 0 (128 bit key, 16 byte blocks): 4465609 operations in 1 seconds (71449744 bytes)
[ 174.768467] test 1 (128 bit key, 64 byte blocks): 1947565 operations in 1 seconds (124644160 bytes)
[ 175.775139] test 2 (128 bit key, 256 byte blocks): 523259 operations in 1 seconds (133954304 bytes)
[ 176.781352] test 3 (128 bit key, 1024 byte blocks): 141135 operations in 1 seconds (144522240 bytes)
[ 177.787959] test 4 (128 bit key, 8192 byte blocks): 17984 operations in 1 seconds (147324928 bytes)
[ 178.794512] test 5 (192 bit key, 16 byte blocks): 4541736 operations in 1 seconds (72667776 bytes)
[ 179.801141] test 6 (192 bit key, 64 byte blocks): 1937279 operations in 1 seconds (123985856 bytes)
[ 180.807805] test 7 (192 bit key, 256 byte blocks): 513856 operations in 1 seconds (131547136 bytes)
[ 181.814331] test 8 (192 bit key, 1024 byte blocks): 141039 operations in 1 seconds (144423936 bytes)
[ 182.820918] test 9 (192 bit key, 8192 byte blocks): 17825 operations in 1 seconds (146022400 bytes)
[ 183.827461] test 10 (256 bit key, 16 byte blocks): 4380875 operations in 1 seconds (70094000 bytes)
[ 184.834419] test 11 (256 bit key, 64 byte blocks): 1959937 operations in 1 seconds (125435968 bytes)
[ 185.841075] test 12 (256 bit key, 256 byte blocks): 515782 operations in 1 seconds (132040192 bytes)
[ 186.847585] test 13 (256 bit key, 1024 byte blocks): 142571 operations in 1 seconds (145992704 bytes)
[ 187.854181] test 14 (256 bit key, 8192 byte blocks): 18105 operations in 1 seconds (148316160 bytes)
[ 188.860717]
[ 188.860717] testing speed of async lrw(twofish) encryption
[ 188.875294] test 0 (256 bit key, 16 byte blocks): 3445285 operations in 1 seconds (55124560 bytes)
[ 189.883381] test 1 (256 bit key, 64 byte blocks): 1585896 operations in 1 seconds (101497344 bytes)
[ 190.890072] test 2 (256 bit key, 256 byte blocks): 449477 operations in 1 seconds (115066112 bytes)
[ 191.896590] test 3 (256 bit key, 1024 byte blocks): 123541 operations in 1 seconds (126505984 bytes)
[ 192.903174] test 4 (256 bit key, 8192 byte blocks): 15868 operations in 1 seconds (129990656 bytes)
[ 193.909694] test 5 (320 bit key, 16 byte blocks): 3590396 operations in 1 seconds (57446336 bytes)
[ 194.916355] test 6 (320 bit key, 64 byte blocks): 1579004 operations in 1 seconds (101056256 bytes)
[ 195.923041] test 7 (320 bit key, 256 byte blocks): 449033 operations in 1 seconds (114952448 bytes)
[ 196.929529] test 8 (320 bit key, 1024 byte blocks): 123347 operations in 1 seconds (126307328 bytes)
[ 197.936142] test 9 (320 bit key, 8192 byte blocks): 15762 operations in 1 seconds (129122304 bytes)
[ 198.942702] test 10 (384 bit key, 16 byte blocks): 3496049 operations in 1 seconds (55936784 bytes)
[ 199.949333] test 11 (384 bit key, 64 byte blocks): 1589166 operations in 1 seconds (101706624 bytes)
[ 200.955996] test 12 (384 bit key, 256 byte blocks): 449480 operations in 1 seconds (115066880 bytes)
[ 201.962497] test 13 (384 bit key, 1024 byte blocks): 123767 operations in 1 seconds (126737408 bytes)
[ 202.969101] test 14 (384 bit key, 8192 byte blocks): 15921 operations in 1 seconds (130424832 bytes)
[ 203.971665]
[ 203.971665] testing speed of async lrw(twofish) decryption
[ 203.971755] test 0 (256 bit key, 16 byte blocks): 3558879 operations in 1 seconds (56942064 bytes)
[ 204.974331] test 1 (256 bit key, 64 byte blocks): 1588116 operations in 1 seconds (101639424 bytes)
[ 205.981001] test 2 (256 bit key, 256 byte blocks): 451198 operations in 1 seconds (115506688 bytes)
[ 206.987510] test 3 (256 bit key, 1024 byte blocks): 124791 operations in 1 seconds (127785984 bytes)
[ 207.994115] test 4 (256 bit key, 8192 byte blocks): 16087 operations in 1 seconds (131784704 bytes)
[ 209.000650] test 5 (320 bit key, 16 byte blocks): 3559066 operations in 1 seconds (56945056 bytes)
[ 210.007298] test 6 (320 bit key, 64 byte blocks): 1579234 operations in 1 seconds (101070976 bytes)
[ 211.013960] test 7 (320 bit key, 256 byte blocks): 454953 operations in 1 seconds (116467968 bytes)
[ 212.020469] test 8 (320 bit key, 1024 byte blocks): 124810 operations in 1 seconds (127805440 bytes)
[ 213.027082] test 9 (320 bit key, 8192 byte blocks): 15887 operations in 1 seconds (130146304 bytes)
[ 214.033610] test 10 (384 bit key, 16 byte blocks): 3554484 operations in 1 seconds (56871744 bytes)
[ 215.040272] test 11 (384 bit key, 64 byte blocks): 1583334 operations in 1 seconds (101333376 bytes)
[ 216.046937] test 12 (384 bit key, 256 byte blocks): 453554 operations in 1 seconds (116109824 bytes)
[ 217.053436] test 13 (384 bit key, 1024 byte blocks): 124894 operations in 1 seconds (127891456 bytes)
[ 218.060032] test 14 (384 bit key, 8192 byte blocks): 16080 operations in 1 seconds (131727360 bytes)
[ 219.066597]
[ 219.066597] testing speed of async xts(twofish) encryption
[ 219.080737] test 0 (256 bit key, 16 byte blocks): 3105784 operations in 1 seconds (49692544 bytes)
[ 220.089254] test 1 (256 bit key, 64 byte blocks): 1586587 operations in 1 seconds (101541568 bytes)
[ 221.095918] test 2 (256 bit key, 256 byte blocks): 475166 operations in 1 seconds (121642496 bytes)
[ 222.102427] test 3 (256 bit key, 1024 byte blocks): 133144 operations in 1 seconds (136339456 bytes)
[ 223.109038] test 4 (256 bit key, 8192 byte blocks): 17219 operations in 1 seconds (141058048 bytes)
[ 224.115549] test 5 (384 bit key, 16 byte blocks): 3097574 operations in 1 seconds (49561184 bytes)
[ 225.122213] test 6 (384 bit key, 64 byte blocks): 1585836 operations in 1 seconds (101493504 bytes)
[ 226.128885] test 7 (384 bit key, 256 byte blocks): 475173 operations in 1 seconds (121644288 bytes)
[ 227.135398] test 8 (384 bit key, 1024 byte blocks): 133173 operations in 1 seconds (136369152 bytes)
[ 228.138011] test 9 (384 bit key, 8192 byte blocks): 17254 operations in 1 seconds (141344768 bytes)
[ 229.140563] test 10 (512 bit key, 16 byte blocks): 3171090 operations in 1 seconds (50737440 bytes)
[ 230.147211] test 11 (512 bit key, 64 byte blocks): 1595445 operations in 1 seconds (102108480 bytes)
[ 231.153866] test 12 (512 bit key, 256 byte blocks): 475161 operations in 1 seconds (121641216 bytes)
[ 232.160384] test 13 (512 bit key, 1024 byte blocks): 133269 operations in 1 seconds (136467456 bytes)
[ 233.166970] test 14 (512 bit key, 8192 byte blocks): 17225 operations in 1 seconds (141107200 bytes)
[ 234.173501]
[ 234.173501] testing speed of async xts(twofish) decryption
[ 234.182898] test 0 (256 bit key, 16 byte blocks): 3095689 operations in 1 seconds (49531024 bytes)
[ 235.188173] test 1 (256 bit key, 64 byte blocks): 1433025 operations in 1 seconds (91713600 bytes)
[ 236.194753] test 2 (256 bit key, 256 byte blocks): 472038 operations in 1 seconds (120841728 bytes)
[ 237.201347] test 3 (256 bit key, 1024 byte blocks): 134015 operations in 1 seconds (137231360 bytes)
[ 238.207969] test 4 (256 bit key, 8192 byte blocks): 17446 operations in 1 seconds (142917632 bytes)
[ 239.214478] test 5 (384 bit key, 16 byte blocks): 3099755 operations in 1 seconds (49596080 bytes)
[ 240.221142] test 6 (384 bit key, 64 byte blocks): 1432335 operations in 1 seconds (91669440 bytes)
[ 241.227711] test 7 (384 bit key, 256 byte blocks): 470340 operations in 1 seconds (120407040 bytes)
[ 242.234314] test 8 (384 bit key, 1024 byte blocks): 133929 operations in 1 seconds (137143296 bytes)
[ 243.240926] test 9 (384 bit key, 8192 byte blocks): 17442 operations in 1 seconds (142884864 bytes)
[ 244.247453] test 10 (512 bit key, 16 byte blocks): 3193773 operations in 1 seconds (51100368 bytes)
[ 245.254119] test 11 (512 bit key, 64 byte blocks): 1440631 operations in 1 seconds (92200384 bytes)
[ 246.260689] test 12 (512 bit key, 256 byte blocks): 475293 operations in 1 seconds (121675008 bytes)
[ 247.267283] test 13 (512 bit key, 1024 byte blocks): 134350 operations in 1 seconds (137574400 bytes)
[ 248.273879] test 14 (512 bit key, 8192 byte blocks): 17441 operations in 1 seconds (142876672 bytes)

--
Regards/Gruss,
Boris.

2012-08-15 17:34:31

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> On Wed, Aug 15, 2012 at 05:22:03PM +0300, Jussi Kivilinna wrote:
>
>> Patch replaces 'movb' instructions with 'movzbl' to break false
>> register dependencies and interleaves instructions better for
>> out-of-order scheduling.
>>
>> Also move common round code to separate function to reduce object
>> size.
>
> Ok, redid the first test
>

Thanks.

> $ modprobe twofish-avx-x86_64
> $ modprobe tcrypt mode=504 sec=1
>
> and from quickly juxtaposing the two results, I'd say the patch makes
> things slightly worse but you'd need to run your scripts on it to get
> the accurate results:
>

About ~5% slower, probably because I was tuning for sandy-bridge and introduced
more FPU<=>CPU register moves.

Here's new version of patch, with FPU<=>CPU moves from original implementation.

(Note: also changes encryption function to inline all code in to main function,
decryption still places common code to separate function to reduce object size.
This is to measure the difference.)

-Jussi

---
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 124 +++++++++++++++++----------
1 file changed, 77 insertions(+), 47 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..d331ab8 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
#define RC2 %xmm6
#define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14

#define RID1 %rax
+#define RID1d %eax
#define RID1b %al
#define RID2 %rbx
+#define RID2d %ebx
#define RID2b %bl

#define RGI1 %rdx
@@ -73,40 +80,48 @@
#define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ shrq $16, src; \
movl t0(CTX, RID1, 4), dst ## d; \
xorl t1(CTX, RID2, 4), dst ## d; \
- shrq $16, src; \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ interleave_op(il_reg); \
xorl t2(CTX, RID1, 4), dst ## d; \
xorl t3(CTX, RID2, 4), dst ## d;

+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+ shrq $16, reg;
+
#define G(a, x, t0, t1, t2, t3) \
- vmovq a, RGI1; \
- vpsrldq $8, a, x; \
- vmovq x, RGI2; \
+ vmovq a, RGI1; \
+ vpextrq $1, a, RGI2; \
\
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
- shrq $16, RGI1; \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+ lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
shlq $32, RGS2; \
orq RGS1, RGS2; \
\
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
- shrq $16, RGI2; \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
- shlq $32, RGS3; \
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, shr_next, RGI2); \
+ lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, dummy, none); \
+ shlq $32, RGS1; \
orq RGS1, RGS3; \
\
vmovq RGS2, x; \
vpinsrq $1, RGS3, x, x;

-#define encround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define encround_g1g2(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0);
+
+#define encround_end(a, b, c, d, x, y) \
+ vpslld $1, d, RT; \
+ vpsrld $(32 - 1), d, d; \
+ vpor d, RT, d; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd x, RK1, x; \
@@ -115,14 +130,16 @@
vpsrld $1, c, x; \
vpslld $(32 - 1), c, c; \
vpor c, x, c; \
- vpslld $1, d, x; \
- vpsrld $(32 - 1), d, d; \
- vpor d, x, d; \
vpxor d, y, d;

-#define decround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define decround_g1g2(a, b, c, d, x, y) \
+ G(a, x, s0, s1, s2, s3); \
+ G(b, y, s1, s2, s3, s0);
+
+#define decround_end(a, b, c, d, x, y) \
+ vpslld $1, c, RT; \
+ vpsrld $(32 - 1), c, c; \
+ vpor c, RT, c; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd y, RK2, y; \
@@ -130,23 +147,37 @@
vpsrld $1, d, y; \
vpslld $(32 - 1), d, d; \
vpor d, y, d; \
- vpslld $1, c, y; \
- vpsrld $(32 - 1), c, c; \
- vpor c, y, c; \
vpaddd x, RK1, x; \
vpxor x, c, c;

#define encrypt_round(n, a, b, c, d) \
vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+ encround_g1g2(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ encround_g1g2(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1); \
+ encround_end(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ encround_end(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1);
+
+.align 4
+decround_RARBRCRD:
+ decround_g1g2(RA1, RB1, RC1, RD1, RX0, RY0);
+ decround_g1g2(RA2, RB2, RC2, RD2, RX1, RY1);
+ decround_end(RA1, RB1, RC1, RD1, RX0, RY0);
+ decround_end(RA2, RB2, RC2, RD2, RX1, RY1);
+ ret;
+
+.align 4
+decround_RCRDRARB:
+ decround_g1g2(RC1, RD1, RA1, RB1, RX0, RY0);
+ decround_g1g2(RC2, RD2, RA2, RB2, RX1, RY1);
+ decround_end(RC1, RD1, RA1, RB1, RX0, RY0);
+ decround_end(RC2, RD2, RA2, RB2, RX1, RY1);
+ ret;

#define decrypt_round(n, a, b, c, d) \
vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+ call decround_ ## a ## b ## c ## d;

#define encrypt_cycle(n) \
encrypt_round((2*n), RA, RB, RC, RD); \
@@ -156,7 +187,6 @@
decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
decrypt_round((2*n), RA, RB, RC, RD);

-
#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
vpunpckldq x1, x0, t0; \
vpunpckhdq x1, x0, t2; \
@@ -222,8 +252,8 @@ __twofish_enc_blk_8way:
vmovdqu w(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

xorq RID1, RID1;
xorq RID2, RID2;
@@ -247,14 +277,14 @@ __twofish_enc_blk_8way:
testb %cl, %cl;
jnz __enc_xor8;

- outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

__enc_xor8:
- outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

@@ -274,8 +304,8 @@ twofish_dec_blk_8way:
vmovdqu (w+4*4)(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

xorq RID1, RID1;
xorq RID2, RID2;
@@ -294,7 +324,7 @@ twofish_dec_blk_8way:
popq %rbx;

leaq (4*4*4)(%rsi), %rax;
- outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

ret;

2012-08-16 13:29:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:
> About ~5% slower, probably because I was tuning for sandy-bridge and
> introduced more FPU<=>CPU register moves.
>
> Here's new version of patch, with FPU<=>CPU moves from original
> implementation.
>
> (Note: also changes encryption function to inline all code in to main
> function, decryption still places common code to separate function to
> reduce object size. This is to measure the difference.)

Yep, looks better than the previous run and also a bit better or on par
with the initial run I did.

The thing is, I'm not sure whether optimizing the thing for each uarch
is a workable solution software-wise or maybe having a single version
which performs sufficiently ok on all uarches is easier/better to
maintain without causing code bloat. Hmmm...

4th:
====
ran like 1st.

[ 1014.074150]
[ 1014.074150] testing speed of async ecb(twofish) encryption
[ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 seconds (77920880 bytes)
[ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 seconds (130804992 bytes)
[ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 seconds (155238400 bytes)
[ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 seconds (172993536 bytes)
[ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 seconds (178397184 bytes)
[ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 seconds (78116064 bytes)
[ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 seconds (130766720 bytes)
[ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 seconds (155514112 bytes)
[ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 seconds (172792832 bytes)
[ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 seconds (175652864 bytes)
[ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 seconds (78269808 bytes)
[ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 seconds (131160960 bytes)
[ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 seconds (155352832 bytes)
[ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 seconds (173289472 bytes)
[ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 seconds (178364416 bytes)
[ 1029.184981]
[ 1029.184981] testing speed of async ecb(twofish) decryption
[ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 seconds (78897040 bytes)
[ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 seconds (131643584 bytes)
[ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 seconds (150888704 bytes)
[ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 seconds (167609344 bytes)
[ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 seconds (172539904 bytes)
[ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 seconds (78904592 bytes)
[ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 seconds (131455296 bytes)
[ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 seconds (150935296 bytes)
[ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 seconds (167490560 bytes)
[ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 seconds (171204608 bytes)
[ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 seconds (78997488 bytes)
[ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 seconds (131243392 bytes)
[ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 seconds (150238464 bytes)
[ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 seconds (167473152 bytes)
[ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 seconds (172466176 bytes)
[ 1044.283892]
[ 1044.283892] testing speed of async cbc(twofish) encryption
[ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 seconds (82979840 bytes)
[ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 seconds (122946176 bytes)
[ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 seconds (138953472 bytes)
[ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399 operations in 1 seconds (144792576 bytes)
[ 1048.318312] test 4 (128 bit key, 8192 byte blocks): 17755 operations in 1 seconds (145448960 bytes)
[ 1049.324829] test 5 (192 bit key, 16 byte blocks): 5196441 operations in 1 seconds (83143056 bytes)
[ 1050.331485] test 6 (192 bit key, 64 byte blocks): 1921456 operations in 1 seconds (122973184 bytes)
[ 1051.338157] test 7 (192 bit key, 256 byte blocks): 543581 operations in 1 seconds (139156736 bytes)
[ 1052.344658] test 8 (192 bit key, 1024 byte blocks): 141473 operations in 1 seconds (144868352 bytes)
[ 1053.351270] test 9 (192 bit key, 8192 byte blocks): 17601 operations in 1 seconds (144187392 bytes)
[ 1054.357823] test 10 (256 bit key, 16 byte blocks): 5190283 operations in 1 seconds (83044528 bytes)
[ 1055.364462] test 11 (256 bit key, 64 byte blocks): 1912796 operations in 1 seconds (122418944 bytes)
[ 1056.371134] test 12 (256 bit key, 256 byte blocks): 542719 operations in 1 seconds (138936064 bytes)
[ 1057.377643] test 13 (256 bit key, 1024 byte blocks): 141377 operations in 1 seconds (144770048 bytes)
[ 1058.384229] test 14 (256 bit key, 8192 byte blocks): 17752 operations in 1 seconds (145424384 bytes)
[ 1059.390799]
[ 1059.390799] testing speed of async cbc(twofish) decryption
[ 1059.400187] test 0 (128 bit key, 16 byte blocks): 4889197 operations in 1 seconds (78227152 bytes)
[ 1060.405460] test 1 (128 bit key, 64 byte blocks): 1980831 operations in 1 seconds (126773184 bytes)
[ 1061.408145] test 2 (128 bit key, 256 byte blocks): 568695 operations in 1 seconds (145585920 bytes)
[ 1062.410647] test 3 (128 bit key, 1024 byte blocks): 158294 operations in 1 seconds (162093056 bytes)
[ 1063.417258] test 4 (128 bit key, 8192 byte blocks): 20312 operations in 1 seconds (166395904 bytes)
[ 1064.423758] test 5 (192 bit key, 16 byte blocks): 4904906 operations in 1 seconds (78478496 bytes)
[ 1065.430440] test 6 (192 bit key, 64 byte blocks): 1983636 operations in 1 seconds (126952704 bytes)
[ 1066.437104] test 7 (192 bit key, 256 byte blocks): 564340 operations in 1 seconds (144471040 bytes)
[ 1067.443613] test 8 (192 bit key, 1024 byte blocks): 157404 operations in 1 seconds (161181696 bytes)
[ 1068.450216] test 9 (192 bit key, 8192 byte blocks): 20055 operations in 1 seconds (164290560 bytes)
[ 1069.456753] test 10 (256 bit key, 16 byte blocks): 4901215 operations in 1 seconds (78419440 bytes)
[ 1070.463417] test 11 (256 bit key, 64 byte blocks): 1978968 operations in 1 seconds (126653952 bytes)
[ 1071.470073] test 12 (256 bit key, 256 byte blocks): 568440 operations in 1 seconds (145520640 bytes)
[ 1072.476580] test 13 (256 bit key, 1024 byte blocks): 158329 operations in 1 seconds (162128896 bytes)
[ 1073.483177] test 14 (256 bit key, 8192 byte blocks): 20311 operations in 1 seconds (166387712 bytes)
[ 1074.489739]
[ 1074.489739] testing speed of async ctr(twofish) encryption
[ 1074.499266] test 0 (128 bit key, 16 byte blocks): 4565109 operations in 1 seconds (73041744 bytes)
[ 1075.504391] test 1 (128 bit key, 64 byte blocks): 1955085 operations in 1 seconds (125125440 bytes)
[ 1076.511055] test 2 (128 bit key, 256 byte blocks): 573971 operations in 1 seconds (146936576 bytes)
[ 1077.517563] test 3 (128 bit key, 1024 byte blocks): 158489 operations in 1 seconds (162292736 bytes)
[ 1078.524175] test 4 (128 bit key, 8192 byte blocks): 20330 operations in 1 seconds (166543360 bytes)
[ 1079.530702] test 5 (192 bit key, 16 byte blocks): 4550468 operations in 1 seconds (72807488 bytes)
[ 1080.537358] test 6 (192 bit key, 64 byte blocks): 1943897 operations in 1 seconds (124409408 bytes)
[ 1081.544030] test 7 (192 bit key, 256 byte blocks): 564033 operations in 1 seconds (144392448 bytes)
[ 1082.550531] test 8 (192 bit key, 1024 byte blocks): 157126 operations in 1 seconds (160897024 bytes)
[ 1083.557170] test 9 (192 bit key, 8192 byte blocks): 20121 operations in 1 seconds (164831232 bytes)
[ 1084.563713] test 10 (256 bit key, 16 byte blocks): 4403637 operations in 1 seconds (70458192 bytes)
[ 1085.570360] test 11 (256 bit key, 64 byte blocks): 1961264 operations in 1 seconds (125520896 bytes)
[ 1086.577008] test 12 (256 bit key, 256 byte blocks): 571514 operations in 1 seconds (146307584 bytes)
[ 1087.583517] test 13 (256 bit key, 1024 byte blocks): 158342 operations in 1 seconds (162142208 bytes)
[ 1088.590121] test 14 (256 bit key, 8192 byte blocks): 20392 operations in 1 seconds (167051264 bytes)
[ 1089.596648]
[ 1089.596648] testing speed of async ctr(twofish) decryption
[ 1089.606061] test 0 (128 bit key, 16 byte blocks): 4517104 operations in 1 seconds (72273664 bytes)
[ 1090.611326] test 1 (128 bit key, 64 byte blocks): 1953102 operations in 1 seconds (124998528 bytes)
[ 1091.617989] test 2 (128 bit key, 256 byte blocks): 574354 operations in 1 seconds (147034624 bytes)
[ 1092.624497] test 3 (128 bit key, 1024 byte blocks): 158402 operations in 1 seconds (162203648 bytes)
[ 1093.631110] test 4 (128 bit key, 8192 byte blocks): 20369 operations in 1 seconds (166862848 bytes)
[ 1094.637618] test 5 (192 bit key, 16 byte blocks): 4524710 operations in 1 seconds (72395360 bytes)
[ 1095.644293] test 6 (192 bit key, 64 byte blocks): 1940148 operations in 1 seconds (124169472 bytes)
[ 1096.650957] test 7 (192 bit key, 256 byte blocks): 567684 operations in 1 seconds (145327104 bytes)
[ 1097.657466] test 8 (192 bit key, 1024 byte blocks): 158922 operations in 1 seconds (162736128 bytes)
[ 1098.664088] test 9 (192 bit key, 8192 byte blocks): 20087 operations in 1 seconds (164552704 bytes)
[ 1099.670596] test 10 (256 bit key, 16 byte blocks): 4397085 operations in 1 seconds (70353360 bytes)
[ 1100.677278] test 11 (256 bit key, 64 byte blocks): 1961007 operations in 1 seconds (125504448 bytes)
[ 1101.683933] test 12 (256 bit key, 256 byte blocks): 577961 operations in 1 seconds (147958016 bytes)
[ 1102.690452] test 13 (256 bit key, 1024 byte blocks): 158836 operations in 1 seconds (162648064 bytes)
[ 1103.697038] test 14 (256 bit key, 8192 byte blocks): 20427 operations in 1 seconds (167337984 bytes)
[ 1104.703575]
[ 1104.703575] testing speed of async lrw(twofish) encryption
[ 1104.713108] test 0 (256 bit key, 16 byte blocks): 3555452 operations in 1 seconds (56887232 bytes)
[ 1105.718261] test 1 (256 bit key, 64 byte blocks): 1617632 operations in 1 seconds (103528448 bytes)
[ 1106.724924] test 2 (256 bit key, 256 byte blocks): 495199 operations in 1 seconds (126770944 bytes)
[ 1107.731442] test 3 (256 bit key, 1024 byte blocks): 137358 operations in 1 seconds (140654592 bytes)
[ 1108.738065] test 4 (256 bit key, 8192 byte blocks): 17637 operations in 1 seconds (144482304 bytes)
[ 1109.740593] test 5 (320 bit key, 16 byte blocks): 3478175 operations in 1 seconds (55650800 bytes)
[ 1110.743248] test 6 (320 bit key, 64 byte blocks): 1591957 operations in 1 seconds (101885248 bytes)
[ 1111.749911] test 7 (320 bit key, 256 byte blocks): 493803 operations in 1 seconds (126413568 bytes)
[ 1112.756430] test 8 (320 bit key, 1024 byte blocks): 137066 operations in 1 seconds (140355584 bytes)
[ 1113.763034] test 9 (320 bit key, 8192 byte blocks): 17288 operations in 1 seconds (141623296 bytes)
[ 1114.769587] test 10 (384 bit key, 16 byte blocks): 3576437 operations in 1 seconds (57222992 bytes)
[ 1115.776232] test 11 (384 bit key, 64 byte blocks): 1587771 operations in 1 seconds (101617344 bytes)
[ 1116.782890] test 12 (384 bit key, 256 byte blocks): 493841 operations in 1 seconds (126423296 bytes)
[ 1117.789396] test 13 (384 bit key, 1024 byte blocks): 137324 operations in 1 seconds (140619776 bytes)
[ 1118.795993] test 14 (384 bit key, 8192 byte blocks): 17625 operations in 1 seconds (144384000 bytes)
[ 1119.802548]
[ 1119.802548] testing speed of async lrw(twofish) decryption
[ 1119.811940] test 0 (256 bit key, 16 byte blocks): 3590161 operations in 1 seconds (57442576 bytes)
[ 1120.817198] test 1 (256 bit key, 64 byte blocks): 1623745 operations in 1 seconds (103919680 bytes)
[ 1121.823872] test 2 (256 bit key, 256 byte blocks): 482001 operations in 1 seconds (123392256 bytes)
[ 1122.830398] test 3 (256 bit key, 1024 byte blocks): 133842 operations in 1 seconds (137054208 bytes)
[ 1123.836992] test 4 (256 bit key, 8192 byte blocks): 17195 operations in 1 seconds (140861440 bytes)
[ 1124.843536] test 5 (320 bit key, 16 byte blocks): 3536998 operations in 1 seconds (56591968 bytes)
[ 1125.850156] test 6 (320 bit key, 64 byte blocks): 1625698 operations in 1 seconds (104044672 bytes)
[ 1126.856830] test 7 (320 bit key, 256 byte blocks): 482518 operations in 1 seconds (123524608 bytes)
[ 1127.863348] test 8 (320 bit key, 1024 byte blocks): 133672 operations in 1 seconds (136880128 bytes)
[ 1128.869959] test 9 (320 bit key, 8192 byte blocks): 16860 operations in 1 seconds (138117120 bytes)
[ 1129.876469] test 10 (384 bit key, 16 byte blocks): 3637750 operations in 1 seconds (58204000 bytes)
[ 1130.883151] test 11 (384 bit key, 64 byte blocks): 1626131 operations in 1 seconds (104072384 bytes)
[ 1131.889814] test 12 (384 bit key, 256 byte blocks): 483999 operations in 1 seconds (123903744 bytes)
[ 1132.896324] test 13 (384 bit key, 1024 byte blocks): 133598 operations in 1 seconds (136804352 bytes)
[ 1133.902920] test 14 (384 bit key, 8192 byte blocks): 17206 operations in 1 seconds (140951552 bytes)
[ 1134.905485]
[ 1134.905485] testing speed of async xts(twofish) encryption
[ 1134.905501] test 0 (256 bit key, 16 byte blocks): 2908165 operations in 1 seconds (46530640 bytes)
[ 1135.908137] test 1 (256 bit key, 64 byte blocks): 1462715 operations in 1 seconds (93613760 bytes)
[ 1136.914715] test 2 (256 bit key, 256 byte blocks): 506478 operations in 1 seconds (129658368 bytes)
[ 1137.921320] test 3 (256 bit key, 1024 byte blocks): 148018 operations in 1 seconds (151570432 bytes)
[ 1138.927924] test 4 (256 bit key, 8192 byte blocks): 19435 operations in 1 seconds (159211520 bytes)
[ 1139.934451] test 5 (384 bit key, 16 byte blocks): 2905195 operations in 1 seconds (46483120 bytes)
[ 1140.941116] test 6 (384 bit key, 64 byte blocks): 1454656 operations in 1 seconds (93097984 bytes)
[ 1141.947683] test 7 (384 bit key, 256 byte blocks): 504479 operations in 1 seconds (129146624 bytes)
[ 1142.954280] test 8 (384 bit key, 1024 byte blocks): 148172 operations in 1 seconds (151728128 bytes)
[ 1143.960892] test 9 (384 bit key, 8192 byte blocks): 19433 operations in 1 seconds (159195136 bytes)
[ 1144.967410] test 10 (512 bit key, 16 byte blocks): 2904583 operations in 1 seconds (46473328 bytes)
[ 1145.974091] test 11 (512 bit key, 64 byte blocks): 1501387 operations in 1 seconds (96088768 bytes)
[ 1146.980652] test 12 (512 bit key, 256 byte blocks): 504501 operations in 1 seconds (129152256 bytes)
[ 1147.987254] test 13 (512 bit key, 1024 byte blocks): 148180 operations in 1 seconds (151736320 bytes)
[ 1148.993842] test 14 (512 bit key, 8192 byte blocks): 19439 operations in 1 seconds (159244288 bytes)
[ 1150.000380]
[ 1150.000380] testing speed of async xts(twofish) decryption
[ 1150.009770] test 0 (256 bit key, 16 byte blocks): 3007004 operations in 1 seconds (48112064 bytes)
[ 1151.015056] test 1 (256 bit key, 64 byte blocks): 1534733 operations in 1 seconds (98222912 bytes)
[ 1152.021642] test 2 (256 bit key, 256 byte blocks): 508129 operations in 1 seconds (130081024 bytes)
[ 1153.028246] test 3 (256 bit key, 1024 byte blocks): 144920 operations in 1 seconds (148398080 bytes)
[ 1154.034859] test 4 (256 bit key, 8192 byte blocks): 18870 operations in 1 seconds (154583040 bytes)
[ 1155.041367] test 5 (384 bit key, 16 byte blocks): 3009083 operations in 1 seconds (48145328 bytes)
[ 1156.048040] test 6 (384 bit key, 64 byte blocks): 1535084 operations in 1 seconds (98245376 bytes)
[ 1157.054609] test 7 (384 bit key, 256 byte blocks): 508112 operations in 1 seconds (130076672 bytes)
[ 1158.061215] test 8 (384 bit key, 1024 byte blocks): 145035 operations in 1 seconds (148515840 bytes)
[ 1159.067830] test 9 (384 bit key, 8192 byte blocks): 18890 operations in 1 seconds (154746880 bytes)
[ 1160.070368] test 10 (512 bit key, 16 byte blocks): 3076988 operations in 1 seconds (49231808 bytes)
[ 1161.073040] test 11 (512 bit key, 64 byte blocks): 1540659 operations in 1 seconds (98602176 bytes)
[ 1162.079610] test 12 (512 bit key, 256 byte blocks): 508316 operations in 1 seconds (130128896 bytes)
[ 1163.086195] test 13 (512 bit key, 1024 byte blocks): 144951 operations in 1 seconds (148429824 bytes)
[ 1164.092792] test 14 (512 bit key, 8192 byte blocks): 18865 operations in 1 seconds (154542080 bytes)

--
Regards/Gruss,
Boris.

2012-08-16 14:26:06

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:
>> About ~5% slower, probably because I was tuning for sandy-bridge and
>> introduced more FPU<=>CPU register moves.
>>
>> Here's new version of patch, with FPU<=>CPU moves from original
>> implementation.
>>
>> (Note: also changes encryption function to inline all code in to main
>> function, decryption still places common code to separate function to
>> reduce object size. This is to measure the difference.)
>
> Yep, looks better than the previous run and also a bit better or on par
> with the initial run I did.

Thanks again. Speed gained with patch is ~8%, and is able of getting
twofish-avx pass twofish-3way.

>
> The thing is, I'm not sure whether optimizing the thing for each uarch
> is a workable solution software-wise or maybe having a single version
> which performs sufficiently ok on all uarches is easier/better to
> maintain without causing code bloat. Hmmm...

Agreed, testing on multiple CPUs to get single well working version is
what I have done in the past. But purchasing all the latest CPUs on
the market isn't option for me, and for testing AVX I'm stuck with
sandy-bridge :)

-Jussi

> 4th:
> ====
> ran like 1st.
>
> [ 1014.074150]
> [ 1014.074150] testing speed of async ecb(twofish) encryption
> [ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055
> operations in 1 seconds (77920880 bytes)
> [ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828
> operations in 1 seconds (130804992 bytes)
> [ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400
> operations in 1 seconds (155238400 bytes)
> [ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939
> operations in 1 seconds (172993536 bytes)
> [ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777
> operations in 1 seconds (178397184 bytes)
> [ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254
> operations in 1 seconds (78116064 bytes)
> [ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230
> operations in 1 seconds (130766720 bytes)
> [ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477
> operations in 1 seconds (155514112 bytes)
> [ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743
> operations in 1 seconds (172792832 bytes)
> [ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442
> operations in 1 seconds (175652864 bytes)
> [ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863
> operations in 1 seconds (78269808 bytes)
> [ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390
> operations in 1 seconds (131160960 bytes)
> [ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847
> operations in 1 seconds (155352832 bytes)
> [ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228
> operations in 1 seconds (173289472 bytes)
> [ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773
> operations in 1 seconds (178364416 bytes)
> [ 1029.184981]
> [ 1029.184981] testing speed of async ecb(twofish) decryption
> [ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065
> operations in 1 seconds (78897040 bytes)
> [ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931
> operations in 1 seconds (131643584 bytes)
> [ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409
> operations in 1 seconds (150888704 bytes)
> [ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681
> operations in 1 seconds (167609344 bytes)
> [ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062
> operations in 1 seconds (172539904 bytes)
> [ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537
> operations in 1 seconds (78904592 bytes)
> [ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989
> operations in 1 seconds (131455296 bytes)
> [ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591
> operations in 1 seconds (150935296 bytes)
> [ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565
> operations in 1 seconds (167490560 bytes)
> [ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899
> operations in 1 seconds (171204608 bytes)
> [ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343
> operations in 1 seconds (78997488 bytes)
> [ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678
> operations in 1 seconds (131243392 bytes)
> [ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869
> operations in 1 seconds (150238464 bytes)
> [ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548
> operations in 1 seconds (167473152 bytes)
> [ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053
> operations in 1 seconds (172466176 bytes)
> [ 1044.283892]
> [ 1044.283892] testing speed of async cbc(twofish) encryption
> [ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240
> operations in 1 seconds (82979840 bytes)
> [ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034
> operations in 1 seconds (122946176 bytes)
> [ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787
> operations in 1 seconds (138953472 bytes)
> [ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399
> operations in 1 seconds (144792576 bytes)
> [ 1048.318312] test 4 (128 bit key, 8192 byte blocks): 17755
> operations in 1 seconds (145448960 bytes)
> [ 1049.324829] test 5 (192 bit key, 16 byte blocks): 5196441
> operations in 1 seconds (83143056 bytes)
> [ 1050.331485] test 6 (192 bit key, 64 byte blocks): 1921456
> operations in 1 seconds (122973184 bytes)
> [ 1051.338157] test 7 (192 bit key, 256 byte blocks): 543581
> operations in 1 seconds (139156736 bytes)
> [ 1052.344658] test 8 (192 bit key, 1024 byte blocks): 141473
> operations in 1 seconds (144868352 bytes)
> [ 1053.351270] test 9 (192 bit key, 8192 byte blocks): 17601
> operations in 1 seconds (144187392 bytes)
> [ 1054.357823] test 10 (256 bit key, 16 byte blocks): 5190283
> operations in 1 seconds (83044528 bytes)
> [ 1055.364462] test 11 (256 bit key, 64 byte blocks): 1912796
> operations in 1 seconds (122418944 bytes)
> [ 1056.371134] test 12 (256 bit key, 256 byte blocks): 542719
> operations in 1 seconds (138936064 bytes)
> [ 1057.377643] test 13 (256 bit key, 1024 byte blocks): 141377
> operations in 1 seconds (144770048 bytes)
> [ 1058.384229] test 14 (256 bit key, 8192 byte blocks): 17752
> operations in 1 seconds (145424384 bytes)
> [ 1059.390799]
> [ 1059.390799] testing speed of async cbc(twofish) decryption
> [ 1059.400187] test 0 (128 bit key, 16 byte blocks): 4889197
> operations in 1 seconds (78227152 bytes)
> [ 1060.405460] test 1 (128 bit key, 64 byte blocks): 1980831
> operations in 1 seconds (126773184 bytes)
> [ 1061.408145] test 2 (128 bit key, 256 byte blocks): 568695
> operations in 1 seconds (145585920 bytes)
> [ 1062.410647] test 3 (128 bit key, 1024 byte blocks): 158294
> operations in 1 seconds (162093056 bytes)
> [ 1063.417258] test 4 (128 bit key, 8192 byte blocks): 20312
> operations in 1 seconds (166395904 bytes)
> [ 1064.423758] test 5 (192 bit key, 16 byte blocks): 4904906
> operations in 1 seconds (78478496 bytes)
> [ 1065.430440] test 6 (192 bit key, 64 byte blocks): 1983636
> operations in 1 seconds (126952704 bytes)
> [ 1066.437104] test 7 (192 bit key, 256 byte blocks): 564340
> operations in 1 seconds (144471040 bytes)
> [ 1067.443613] test 8 (192 bit key, 1024 byte blocks): 157404
> operations in 1 seconds (161181696 bytes)
> [ 1068.450216] test 9 (192 bit key, 8192 byte blocks): 20055
> operations in 1 seconds (164290560 bytes)
> [ 1069.456753] test 10 (256 bit key, 16 byte blocks): 4901215
> operations in 1 seconds (78419440 bytes)
> [ 1070.463417] test 11 (256 bit key, 64 byte blocks): 1978968
> operations in 1 seconds (126653952 bytes)
> [ 1071.470073] test 12 (256 bit key, 256 byte blocks): 568440
> operations in 1 seconds (145520640 bytes)
> [ 1072.476580] test 13 (256 bit key, 1024 byte blocks): 158329
> operations in 1 seconds (162128896 bytes)
> [ 1073.483177] test 14 (256 bit key, 8192 byte blocks): 20311
> operations in 1 seconds (166387712 bytes)
> [ 1074.489739]
> [ 1074.489739] testing speed of async ctr(twofish) encryption
> [ 1074.499266] test 0 (128 bit key, 16 byte blocks): 4565109
> operations in 1 seconds (73041744 bytes)
> [ 1075.504391] test 1 (128 bit key, 64 byte blocks): 1955085
> operations in 1 seconds (125125440 bytes)
> [ 1076.511055] test 2 (128 bit key, 256 byte blocks): 573971
> operations in 1 seconds (146936576 bytes)
> [ 1077.517563] test 3 (128 bit key, 1024 byte blocks): 158489
> operations in 1 seconds (162292736 bytes)
> [ 1078.524175] test 4 (128 bit key, 8192 byte blocks): 20330
> operations in 1 seconds (166543360 bytes)
> [ 1079.530702] test 5 (192 bit key, 16 byte blocks): 4550468
> operations in 1 seconds (72807488 bytes)
> [ 1080.537358] test 6 (192 bit key, 64 byte blocks): 1943897
> operations in 1 seconds (124409408 bytes)
> [ 1081.544030] test 7 (192 bit key, 256 byte blocks): 564033
> operations in 1 seconds (144392448 bytes)
> [ 1082.550531] test 8 (192 bit key, 1024 byte blocks): 157126
> operations in 1 seconds (160897024 bytes)
> [ 1083.557170] test 9 (192 bit key, 8192 byte blocks): 20121
> operations in 1 seconds (164831232 bytes)
> [ 1084.563713] test 10 (256 bit key, 16 byte blocks): 4403637
> operations in 1 seconds (70458192 bytes)
> [ 1085.570360] test 11 (256 bit key, 64 byte blocks): 1961264
> operations in 1 seconds (125520896 bytes)
> [ 1086.577008] test 12 (256 bit key, 256 byte blocks): 571514
> operations in 1 seconds (146307584 bytes)
> [ 1087.583517] test 13 (256 bit key, 1024 byte blocks): 158342
> operations in 1 seconds (162142208 bytes)
> [ 1088.590121] test 14 (256 bit key, 8192 byte blocks): 20392
> operations in 1 seconds (167051264 bytes)
> [ 1089.596648]
> [ 1089.596648] testing speed of async ctr(twofish) decryption
> [ 1089.606061] test 0 (128 bit key, 16 byte blocks): 4517104
> operations in 1 seconds (72273664 bytes)
> [ 1090.611326] test 1 (128 bit key, 64 byte blocks): 1953102
> operations in 1 seconds (124998528 bytes)
> [ 1091.617989] test 2 (128 bit key, 256 byte blocks): 574354
> operations in 1 seconds (147034624 bytes)
> [ 1092.624497] test 3 (128 bit key, 1024 byte blocks): 158402
> operations in 1 seconds (162203648 bytes)
> [ 1093.631110] test 4 (128 bit key, 8192 byte blocks): 20369
> operations in 1 seconds (166862848 bytes)
> [ 1094.637618] test 5 (192 bit key, 16 byte blocks): 4524710
> operations in 1 seconds (72395360 bytes)
> [ 1095.644293] test 6 (192 bit key, 64 byte blocks): 1940148
> operations in 1 seconds (124169472 bytes)
> [ 1096.650957] test 7 (192 bit key, 256 byte blocks): 567684
> operations in 1 seconds (145327104 bytes)
> [ 1097.657466] test 8 (192 bit key, 1024 byte blocks): 158922
> operations in 1 seconds (162736128 bytes)
> [ 1098.664088] test 9 (192 bit key, 8192 byte blocks): 20087
> operations in 1 seconds (164552704 bytes)
> [ 1099.670596] test 10 (256 bit key, 16 byte blocks): 4397085
> operations in 1 seconds (70353360 bytes)
> [ 1100.677278] test 11 (256 bit key, 64 byte blocks): 1961007
> operations in 1 seconds (125504448 bytes)
> [ 1101.683933] test 12 (256 bit key, 256 byte blocks): 577961
> operations in 1 seconds (147958016 bytes)
> [ 1102.690452] test 13 (256 bit key, 1024 byte blocks): 158836
> operations in 1 seconds (162648064 bytes)
> [ 1103.697038] test 14 (256 bit key, 8192 byte blocks): 20427
> operations in 1 seconds (167337984 bytes)
> [ 1104.703575]
> [ 1104.703575] testing speed of async lrw(twofish) encryption
> [ 1104.713108] test 0 (256 bit key, 16 byte blocks): 3555452
> operations in 1 seconds (56887232 bytes)
> [ 1105.718261] test 1 (256 bit key, 64 byte blocks): 1617632
> operations in 1 seconds (103528448 bytes)
> [ 1106.724924] test 2 (256 bit key, 256 byte blocks): 495199
> operations in 1 seconds (126770944 bytes)
> [ 1107.731442] test 3 (256 bit key, 1024 byte blocks): 137358
> operations in 1 seconds (140654592 bytes)
> [ 1108.738065] test 4 (256 bit key, 8192 byte blocks): 17637
> operations in 1 seconds (144482304 bytes)
> [ 1109.740593] test 5 (320 bit key, 16 byte blocks): 3478175
> operations in 1 seconds (55650800 bytes)
> [ 1110.743248] test 6 (320 bit key, 64 byte blocks): 1591957
> operations in 1 seconds (101885248 bytes)
> [ 1111.749911] test 7 (320 bit key, 256 byte blocks): 493803
> operations in 1 seconds (126413568 bytes)
> [ 1112.756430] test 8 (320 bit key, 1024 byte blocks): 137066
> operations in 1 seconds (140355584 bytes)
> [ 1113.763034] test 9 (320 bit key, 8192 byte blocks): 17288
> operations in 1 seconds (141623296 bytes)
> [ 1114.769587] test 10 (384 bit key, 16 byte blocks): 3576437
> operations in 1 seconds (57222992 bytes)
> [ 1115.776232] test 11 (384 bit key, 64 byte blocks): 1587771
> operations in 1 seconds (101617344 bytes)
> [ 1116.782890] test 12 (384 bit key, 256 byte blocks): 493841
> operations in 1 seconds (126423296 bytes)
> [ 1117.789396] test 13 (384 bit key, 1024 byte blocks): 137324
> operations in 1 seconds (140619776 bytes)
> [ 1118.795993] test 14 (384 bit key, 8192 byte blocks): 17625
> operations in 1 seconds (144384000 bytes)
> [ 1119.802548]
> [ 1119.802548] testing speed of async lrw(twofish) decryption
> [ 1119.811940] test 0 (256 bit key, 16 byte blocks): 3590161
> operations in 1 seconds (57442576 bytes)
> [ 1120.817198] test 1 (256 bit key, 64 byte blocks): 1623745
> operations in 1 seconds (103919680 bytes)
> [ 1121.823872] test 2 (256 bit key, 256 byte blocks): 482001
> operations in 1 seconds (123392256 bytes)
> [ 1122.830398] test 3 (256 bit key, 1024 byte blocks): 133842
> operations in 1 seconds (137054208 bytes)
> [ 1123.836992] test 4 (256 bit key, 8192 byte blocks): 17195
> operations in 1 seconds (140861440 bytes)
> [ 1124.843536] test 5 (320 bit key, 16 byte blocks): 3536998
> operations in 1 seconds (56591968 bytes)
> [ 1125.850156] test 6 (320 bit key, 64 byte blocks): 1625698
> operations in 1 seconds (104044672 bytes)
> [ 1126.856830] test 7 (320 bit key, 256 byte blocks): 482518
> operations in 1 seconds (123524608 bytes)
> [ 1127.863348] test 8 (320 bit key, 1024 byte blocks): 133672
> operations in 1 seconds (136880128 bytes)
> [ 1128.869959] test 9 (320 bit key, 8192 byte blocks): 16860
> operations in 1 seconds (138117120 bytes)
> [ 1129.876469] test 10 (384 bit key, 16 byte blocks): 3637750
> operations in 1 seconds (58204000 bytes)
> [ 1130.883151] test 11 (384 bit key, 64 byte blocks): 1626131
> operations in 1 seconds (104072384 bytes)
> [ 1131.889814] test 12 (384 bit key, 256 byte blocks): 483999
> operations in 1 seconds (123903744 bytes)
> [ 1132.896324] test 13 (384 bit key, 1024 byte blocks): 133598
> operations in 1 seconds (136804352 bytes)
> [ 1133.902920] test 14 (384 bit key, 8192 byte blocks): 17206
> operations in 1 seconds (140951552 bytes)
> [ 1134.905485]
> [ 1134.905485] testing speed of async xts(twofish) encryption
> [ 1134.905501] test 0 (256 bit key, 16 byte blocks): 2908165
> operations in 1 seconds (46530640 bytes)
> [ 1135.908137] test 1 (256 bit key, 64 byte blocks): 1462715
> operations in 1 seconds (93613760 bytes)
> [ 1136.914715] test 2 (256 bit key, 256 byte blocks): 506478
> operations in 1 seconds (129658368 bytes)
> [ 1137.921320] test 3 (256 bit key, 1024 byte blocks): 148018
> operations in 1 seconds (151570432 bytes)
> [ 1138.927924] test 4 (256 bit key, 8192 byte blocks): 19435
> operations in 1 seconds (159211520 bytes)
> [ 1139.934451] test 5 (384 bit key, 16 byte blocks): 2905195
> operations in 1 seconds (46483120 bytes)
> [ 1140.941116] test 6 (384 bit key, 64 byte blocks): 1454656
> operations in 1 seconds (93097984 bytes)
> [ 1141.947683] test 7 (384 bit key, 256 byte blocks): 504479
> operations in 1 seconds (129146624 bytes)
> [ 1142.954280] test 8 (384 bit key, 1024 byte blocks): 148172
> operations in 1 seconds (151728128 bytes)
> [ 1143.960892] test 9 (384 bit key, 8192 byte blocks): 19433
> operations in 1 seconds (159195136 bytes)
> [ 1144.967410] test 10 (512 bit key, 16 byte blocks): 2904583
> operations in 1 seconds (46473328 bytes)
> [ 1145.974091] test 11 (512 bit key, 64 byte blocks): 1501387
> operations in 1 seconds (96088768 bytes)
> [ 1146.980652] test 12 (512 bit key, 256 byte blocks): 504501
> operations in 1 seconds (129152256 bytes)
> [ 1147.987254] test 13 (512 bit key, 1024 byte blocks): 148180
> operations in 1 seconds (151736320 bytes)
> [ 1148.993842] test 14 (512 bit key, 8192 byte blocks): 19439
> operations in 1 seconds (159244288 bytes)
> [ 1150.000380]
> [ 1150.000380] testing speed of async xts(twofish) decryption
> [ 1150.009770] test 0 (256 bit key, 16 byte blocks): 3007004
> operations in 1 seconds (48112064 bytes)
> [ 1151.015056] test 1 (256 bit key, 64 byte blocks): 1534733
> operations in 1 seconds (98222912 bytes)
> [ 1152.021642] test 2 (256 bit key, 256 byte blocks): 508129
> operations in 1 seconds (130081024 bytes)
> [ 1153.028246] test 3 (256 bit key, 1024 byte blocks): 144920
> operations in 1 seconds (148398080 bytes)
> [ 1154.034859] test 4 (256 bit key, 8192 byte blocks): 18870
> operations in 1 seconds (154583040 bytes)
> [ 1155.041367] test 5 (384 bit key, 16 byte blocks): 3009083
> operations in 1 seconds (48145328 bytes)
> [ 1156.048040] test 6 (384 bit key, 64 byte blocks): 1535084
> operations in 1 seconds (98245376 bytes)
> [ 1157.054609] test 7 (384 bit key, 256 byte blocks): 508112
> operations in 1 seconds (130076672 bytes)
> [ 1158.061215] test 8 (384 bit key, 1024 byte blocks): 145035
> operations in 1 seconds (148515840 bytes)
> [ 1159.067830] test 9 (384 bit key, 8192 byte blocks): 18890
> operations in 1 seconds (154746880 bytes)
> [ 1160.070368] test 10 (512 bit key, 16 byte blocks): 3076988
> operations in 1 seconds (49231808 bytes)
> [ 1161.073040] test 11 (512 bit key, 64 byte blocks): 1540659
> operations in 1 seconds (98602176 bytes)
> [ 1162.079610] test 12 (512 bit key, 256 byte blocks): 508316
> operations in 1 seconds (130128896 bytes)
> [ 1163.086195] test 13 (512 bit key, 1024 byte blocks): 144951
> operations in 1 seconds (148429824 bytes)
> [ 1164.092792] test 14 (512 bit key, 8192 byte blocks): 18865
> operations in 1 seconds (154542080 bytes)
>
> --
> Regards/Gruss,
> Boris.
>
>

2012-08-17 07:37:17

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

>
> Yep, looks better than the previous run and also a bit better or on par
> with the initial run I did.
>

I made few further changes, mainly moving/interleaving 'vmovq/vpextrq' ahead
so they should be completed before those target registers are needed. This
only gave 0.5% increase on Sandy-bridge, but might help more on Bulldozer.

-Jussi

---
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 205 +++++++++++++++++----------
1 file changed, 130 insertions(+), 75 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..6638a87 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
* Copyright (C) 2012 Johannes Goetzfried
* <[email protected]>
*
+ * Copyright © 2012 Jussi Kivilinna <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
#define RC2 %xmm6
#define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13

-#define RID1 %rax
-#define RID1b %al
-#define RID2 %rbx
-#define RID2b %bl
+#define RT %xmm14
+
+#define RID1 %rbp
+#define RID1d %ebp
+#define RID2 %rsi
+#define RID2d %esi

#define RGI1 %rdx
#define RGI1bl %dl
@@ -65,6 +72,13 @@
#define RGI2bl %cl
#define RGI2bh %ch

+#define RGI3 %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4 %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
#define RGS1 %r8
#define RGS1d %r8d
#define RGS2 %r9
@@ -73,40 +87,53 @@
#define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ shrq $16, src; \
movl t0(CTX, RID1, 4), dst ## d; \
xorl t1(CTX, RID2, 4), dst ## d; \
- shrq $16, src; \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ interleave_op(il_reg); \
xorl t2(CTX, RID1, 4), dst ## d; \
xorl t3(CTX, RID2, 4), dst ## d;

-#define G(a, x, t0, t1, t2, t3) \
- vmovq a, RGI1; \
- vpsrldq $8, a, x; \
- vmovq x, RGI2; \
- \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
- shrq $16, RGI1; \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
- shlq $32, RGS2; \
- orq RGS1, RGS2; \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+ shrq $16, reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \
+ shlq $32, RGS2; \
+ orq RGS1, RGS2; \
\
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
- shrq $16, RGI2; \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
- shlq $32, RGS3; \
- orq RGS1, RGS3; \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \
+ shlq $32, RGS1; \
+ orq RGS1, RGS3; \
\
- vmovq RGS2, x; \
+ vmovq RGS2, x; \
vpinsrq $1, RGS3, x, x;

-#define encround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+ vmovq b ## 1, RGI3; \
+ vpextrq $1, b ## 1, RGI4; \
+ G(RGI1, RGI2, x1, s0, s1, s2, s3); \
+ vmovq a ## 2, RGI1; \
+ vpextrq $1, a ## 2, RGI2; \
+ G(RGI3, RGI4, y1, s1, s2, s3, s0); \
+ vmovq b ## 2, RGI3; \
+ vpextrq $1, b ## 2, RGI4; \
+ G(RGI1, RGI2, x2, s0, s1, s2, s3); \
+ G(RGI3, RGI4, y2, s1, s2, s3, s0);
+
+#define encround_tail(a, b, c, d, x, y) \
+ vpslld $1, d, RT; \
+ vpsrld $(32 - 1), d, d; \
+ vpor d, RT, d; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd x, RK1, x; \
@@ -115,14 +142,24 @@
vpsrld $1, c, x; \
vpslld $(32 - 1), c, c; \
vpor c, x, c; \
- vpslld $1, d, x; \
- vpsrld $(32 - 1), d, d; \
- vpor d, x, d; \
vpxor d, y, d;

-#define decround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define decround_head_2(a, b, c, d, x1, y1, x2, y2) \
+ vmovq b ## 1, RGI3; \
+ vpextrq $1, b ## 1, RGI4; \
+ G(RGI1, RGI2, x1, s0, s1, s2, s3); \
+ vmovq a ## 2, RGI1; \
+ vpextrq $1, a ## 2, RGI2; \
+ G(RGI3, RGI4, y1, s1, s2, s3, s0); \
+ vmovq b ## 2, RGI3; \
+ vpextrq $1, b ## 2, RGI4; \
+ G(RGI1, RGI2, x2, s0, s1, s2, s3); \
+ G(RGI3, RGI4, y2, s1, s2, s3, s0);
+
+#define decround_tail(a, b, c, d, x, y) \
+ vpslld $1, c, RT; \
+ vpsrld $(32 - 1), c, c; \
+ vpor c, RT, c; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd y, RK2, y; \
@@ -130,32 +167,44 @@
vpsrld $1, d, y; \
vpslld $(32 - 1), d, d; \
vpor d, y, d; \
- vpslld $1, c, y; \
- vpsrld $(32 - 1), c, c; \
- vpor c, y, c; \
vpaddd x, RK1, x; \
vpxor x, c, c;

-#define encrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
-
-#define decrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+#define preload_rgi(c) \
+ vmovq c, RGI1; \
+ vpextrq $1, c, RGI2;
+
+#define encrypt_round(n, a, b, c, d, preload) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ encround_head_2(a, b, c, d, RX0, RY0, RX1, RY1); \
+ encround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ preload(c ## 1); \
+ encround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1);
+
+#define decrypt_round(n, a, b, c, d, preload) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ decround_head_2(a, b, c, d, RX0, RY0, RX1, RY1); \
+ decround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ preload(c ## 1); \
+ decround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1);

#define encrypt_cycle(n) \
- encrypt_round((2*n), RA, RB, RC, RD); \
- encrypt_round(((2*n) + 1), RC, RD, RA, RB);
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi);
+
+#define encrypt_cycle_last(n) \
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, dummy);

#define decrypt_cycle(n) \
- decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
- decrypt_round((2*n), RA, RB, RC, RD);
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi); \
+ decrypt_round((2*n), RA, RB, RC, RD, preload_rgi);

+#define decrypt_cycle_last(n) \
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi); \
+ decrypt_round((2*n), RA, RB, RC, RD, dummy);

#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
vpunpckldq x1, x0, t0; \
@@ -216,17 +265,19 @@ __twofish_enc_blk_8way:
* %rcx: bool, if true: xor output
*/

+ pushq %rbp;
pushq %rbx;
pushq %rcx;

vmovdqu w(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ vmovq RA1, RGI1;
+ vpextrq $1, RA1, RGI2;
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

encrypt_cycle(0);
encrypt_cycle(1);
@@ -235,26 +286,27 @@ __twofish_enc_blk_8way:
encrypt_cycle(4);
encrypt_cycle(5);
encrypt_cycle(6);
- encrypt_cycle(7);
+ encrypt_cycle_last(7);

vmovdqu (w+4*4)(CTX), RK1;

popq %rcx;
popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
+ leaq (4*4*4)(%r11), %rax;

testb %cl, %cl;
jnz __enc_xor8;

- outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

__enc_xor8:
- outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

@@ -269,16 +321,18 @@ twofish_dec_blk_8way:
* %rdx: src
*/

+ pushq %rbp;
pushq %rbx;

vmovdqu (w+4*4)(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ vmovq RC1, RGI1;
+ vpextrq $1, RC1, RGI2;
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

decrypt_cycle(7);
decrypt_cycle(6);
@@ -287,14 +341,15 @@ twofish_dec_blk_8way:
decrypt_cycle(3);
decrypt_cycle(2);
decrypt_cycle(1);
- decrypt_cycle(0);
+ decrypt_cycle_last(0);

vmovdqu (w)(CTX), RK1;

popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
- outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ leaq (4*4*4)(%r11), %rax;
+ outunpack_blocks(%r11, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

ret;

2012-08-20 17:32:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Fri, Aug 17, 2012 at 10:37:10AM +0300, Jussi Kivilinna wrote:
> I made few further changes, mainly moving/interleaving 'vmovq/vpextrq'
> ahead so they should be completed before those target registers are
> needed. This only gave 0.5% increase on Sandy-bridge, but might help
> more on Bulldozer.

Here you go:

[ 52.282208]
[ 52.282208] testing speed of async ecb(twofish) encryption
[ 52.291580] test 0 (128 bit key, 16 byte blocks): 4890079 operations in 1 seconds (78241264 bytes)
[ 53.301588] test 1 (128 bit key, 64 byte blocks): 2045945 operations in 1 seconds (130940480 bytes)
[ 54.309656] test 2 (128 bit key, 256 byte blocks): 604184 operations in 1 seconds (154671104 bytes)
[ 55.317289] test 3 (128 bit key, 1024 byte blocks): 168541 operations in 1 seconds (172585984 bytes)
[ 56.325565] test 4 (128 bit key, 8192 byte blocks): 21673 operations in 1 seconds (177545216 bytes)
[ 57.333529] test 5 (192 bit key, 16 byte blocks): 4877931 operations in 1 seconds (78046896 bytes)
[ 58.341588] test 6 (192 bit key, 64 byte blocks): 2044495 operations in 1 seconds (130847680 bytes)
[ 59.349647] test 7 (192 bit key, 256 byte blocks): 604909 operations in 1 seconds (154856704 bytes)
[ 60.357533] test 8 (192 bit key, 1024 byte blocks): 167836 operations in 1 seconds (171864064 bytes)
[ 61.365545] test 9 (192 bit key, 8192 byte blocks): 21439 operations in 1 seconds (175628288 bytes)
[ 62.369497] test 10 (256 bit key, 16 byte blocks): 4907149 operations in 1 seconds (78514384 bytes)
[ 63.373535] test 11 (256 bit key, 64 byte blocks): 2060437 operations in 1 seconds (131867968 bytes)
[ 64.381620] test 12 (256 bit key, 256 byte blocks): 604784 operations in 1 seconds (154824704 bytes)
[ 65.389523] test 13 (256 bit key, 1024 byte blocks): 168547 operations in 1 seconds (172592128 bytes)
[ 66.397520] test 14 (256 bit key, 8192 byte blocks): 21682 operations in 1 seconds (177618944 bytes)
[ 67.405461]
[ 67.405461] testing speed of async ecb(twofish) decryption
[ 67.414776] test 0 (128 bit key, 16 byte blocks): 4903251 operations in 1 seconds (78452016 bytes)
[ 68.421569] test 1 (128 bit key, 64 byte blocks): 1979230 operations in 1 seconds (126670720 bytes)
[ 69.429644] test 2 (128 bit key, 256 byte blocks): 591549 operations in 1 seconds (151436544 bytes)
[ 70.437574] test 3 (128 bit key, 1024 byte blocks): 166478 operations in 1 seconds (170473472 bytes)
[ 71.445590] test 4 (128 bit key, 8192 byte blocks): 21441 operations in 1 seconds (175644672 bytes)
[ 72.453536] test 5 (192 bit key, 16 byte blocks): 4895430 operations in 1 seconds (78326880 bytes)
[ 73.461596] test 6 (192 bit key, 64 byte blocks): 1976120 operations in 1 seconds (126471680 bytes)
[ 74.469680] test 7 (192 bit key, 256 byte blocks): 590021 operations in 1 seconds (151045376 bytes)
[ 75.477600] test 8 (192 bit key, 1024 byte blocks): 165925 operations in 1 seconds (169907200 bytes)
[ 76.485606] test 9 (192 bit key, 8192 byte blocks): 21087 operations in 1 seconds (172744704 bytes)
[ 77.493561] test 10 (256 bit key, 16 byte blocks): 4882275 operations in 1 seconds (78116400 bytes)
[ 78.501621] test 11 (256 bit key, 64 byte blocks): 1976460 operations in 1 seconds (126493440 bytes)
[ 79.509706] test 12 (256 bit key, 256 byte blocks): 591122 operations in 1 seconds (151327232 bytes)
[ 80.517617] test 13 (256 bit key, 1024 byte blocks): 166587 operations in 1 seconds (170585088 bytes)
[ 81.525606] test 14 (256 bit key, 8192 byte blocks): 21439 operations in 1 seconds (175628288 bytes)
[ 82.533520]
[ 82.533520] testing speed of async cbc(twofish) encryption
[ 82.547843] test 0 (128 bit key, 16 byte blocks): 5182177 operations in 1 seconds (82914832 bytes)
[ 83.557344] test 1 (128 bit key, 64 byte blocks): 1913550 operations in 1 seconds (122467200 bytes)
[ 84.565418] test 2 (128 bit key, 256 byte blocks): 540406 operations in 1 seconds (138343936 bytes)
[ 85.573320] test 3 (128 bit key, 1024 byte blocks): 141160 operations in 1 seconds (144547840 bytes)
[ 86.581346] test 4 (128 bit key, 8192 byte blocks): 17791 operations in 1 seconds (145743872 bytes)
[ 87.589283] test 5 (192 bit key, 16 byte blocks): 5167742 operations in 1 seconds (82683872 bytes)
[ 88.597316] test 6 (192 bit key, 64 byte blocks): 1913755 operations in 1 seconds (122480320 bytes)
[ 89.605689] test 7 (192 bit key, 256 byte blocks): 541933 operations in 1 seconds (138734848 bytes)
[ 90.613599] test 8 (192 bit key, 1024 byte blocks): 141155 operations in 1 seconds (144542720 bytes)
[ 91.621597] test 9 (192 bit key, 8192 byte blocks): 17652 operations in 1 seconds (144605184 bytes)
[ 92.629509] test 10 (256 bit key, 16 byte blocks): 5166590 operations in 1 seconds (82665440 bytes)
[ 93.637594] test 11 (256 bit key, 64 byte blocks): 1906451 operations in 1 seconds (122012864 bytes)
[ 94.645680] test 12 (256 bit key, 256 byte blocks): 541165 operations in 1 seconds (138538240 bytes)
[ 95.653590] test 13 (256 bit key, 1024 byte blocks): 141115 operations in 1 seconds (144501760 bytes)
[ 96.661588] test 14 (256 bit key, 8192 byte blocks): 17790 operations in 1 seconds (145735680 bytes)
[ 97.669536]
[ 97.669536] testing speed of async cbc(twofish) decryption
[ 97.678949] test 0 (128 bit key, 16 byte blocks): 4869673 operations in 1 seconds (77914768 bytes)
[ 98.685593] test 1 (128 bit key, 64 byte blocks): 1903734 operations in 1 seconds (121838976 bytes)
[ 99.693669] test 2 (128 bit key, 256 byte blocks): 578537 operations in 1 seconds (148105472 bytes)
[ 100.701591] test 3 (128 bit key, 1024 byte blocks): 161224 operations in 1 seconds (165093376 bytes)
[ 101.709606] test 4 (128 bit key, 8192 byte blocks): 20570 operations in 1 seconds (168509440 bytes)
[ 102.717526] test 5 (192 bit key, 16 byte blocks): 4888753 operations in 1 seconds (78220048 bytes)
[ 103.725594] test 6 (192 bit key, 64 byte blocks): 1897049 operations in 1 seconds (121411136 bytes)
[ 104.733660] test 7 (192 bit key, 256 byte blocks): 576290 operations in 1 seconds (147530240 bytes)
[ 105.741572] test 8 (192 bit key, 1024 byte blocks): 160307 operations in 1 seconds (164154368 bytes)
[ 106.749588] test 9 (192 bit key, 8192 byte blocks): 20231 operations in 1 seconds (165732352 bytes)
[ 107.757500] test 10 (256 bit key, 16 byte blocks): 4900905 operations in 1 seconds (78414480 bytes)
[ 108.765608] test 11 (256 bit key, 64 byte blocks): 1913352 operations in 1 seconds (122454528 bytes)
[ 109.769683] test 12 (256 bit key, 256 byte blocks): 579791 operations in 1 seconds (148426496 bytes)
[ 110.773581] test 13 (256 bit key, 1024 byte blocks): 161259 operations in 1 seconds (165129216 bytes)
[ 111.781590] test 14 (256 bit key, 8192 byte blocks): 20569 operations in 1 seconds (168501248 bytes)
[ 112.789528]
[ 112.789528] testing speed of async ctr(twofish) encryption
[ 112.803833] test 0 (128 bit key, 16 byte blocks): 4524631 operations in 1 seconds (72394096 bytes)
[ 113.813345] test 1 (128 bit key, 64 byte blocks): 1929960 operations in 1 seconds (123517440 bytes)
[ 114.821706] test 2 (128 bit key, 256 byte blocks): 573250 operations in 1 seconds (146752000 bytes)
[ 115.829617] test 3 (128 bit key, 1024 byte blocks): 156671 operations in 1 seconds (160431104 bytes)
[ 116.837641] test 4 (128 bit key, 8192 byte blocks): 20175 operations in 1 seconds (165273600 bytes)
[ 117.845587] test 5 (192 bit key, 16 byte blocks): 4464459 operations in 1 seconds (71431344 bytes)
[ 118.853620] test 6 (192 bit key, 64 byte blocks): 1913816 operations in 1 seconds (122484224 bytes)
[ 119.861697] test 7 (192 bit key, 256 byte blocks): 560342 operations in 1 seconds (143447552 bytes)
[ 120.869607] test 8 (192 bit key, 1024 byte blocks): 156535 operations in 1 seconds (160291840 bytes)
[ 121.877623] test 9 (192 bit key, 8192 byte blocks): 20128 operations in 1 seconds (164888576 bytes)
[ 122.885535] test 10 (256 bit key, 16 byte blocks): 4310418 operations in 1 seconds (68966688 bytes)
[ 123.893619] test 11 (256 bit key, 64 byte blocks): 1928764 operations in 1 seconds (123440896 bytes)
[ 124.901679] test 12 (256 bit key, 256 byte blocks): 573752 operations in 1 seconds (146880512 bytes)
[ 125.909600] test 13 (256 bit key, 1024 byte blocks): 157643 operations in 1 seconds (161426432 bytes)
[ 126.917597] test 14 (256 bit key, 8192 byte blocks): 20256 operations in 1 seconds (165937152 bytes)
[ 127.925536]
[ 127.925536] testing speed of async ctr(twofish) decryption
[ 127.934939] test 0 (128 bit key, 16 byte blocks): 4539834 operations in 1 seconds (72637344 bytes)
[ 128.941593] test 1 (128 bit key, 64 byte blocks): 1948606 operations in 1 seconds (124710784 bytes)
[ 129.949670] test 2 (128 bit key, 256 byte blocks): 579095 operations in 1 seconds (148248320 bytes)
[ 130.957604] test 3 (128 bit key, 1024 byte blocks): 157576 operations in 1 seconds (161357824 bytes)
[ 131.965614] test 4 (128 bit key, 8192 byte blocks): 20272 operations in 1 seconds (166068224 bytes)
[ 132.969540] test 5 (192 bit key, 16 byte blocks): 4543224 operations in 1 seconds (72691584 bytes)
[ 133.973612] test 6 (192 bit key, 64 byte blocks): 1937373 operations in 1 seconds (123991872 bytes)
[ 134.981681] test 7 (192 bit key, 256 byte blocks): 566959 operations in 1 seconds (145141504 bytes)
[ 135.989592] test 8 (192 bit key, 1024 byte blocks): 157951 operations in 1 seconds (161741824 bytes)
[ 136.997607] test 9 (192 bit key, 8192 byte blocks): 20148 operations in 1 seconds (165052416 bytes)
[ 138.005528] test 10 (256 bit key, 16 byte blocks): 4395855 operations in 1 seconds (70333680 bytes)
[ 139.013612] test 11 (256 bit key, 64 byte blocks): 1957802 operations in 1 seconds (125299328 bytes)
[ 140.021687] test 12 (256 bit key, 256 byte blocks): 572735 operations in 1 seconds (146620160 bytes)
[ 141.029592] test 13 (256 bit key, 1024 byte blocks): 158475 operations in 1 seconds (162278400 bytes)
[ 142.037589] test 14 (256 bit key, 8192 byte blocks): 20350 operations in 1 seconds (166707200 bytes)
[ 143.045538]
[ 143.045538] testing speed of async lrw(twofish) encryption
[ 143.060417] test 0 (256 bit key, 16 byte blocks): 3264161 operations in 1 seconds (52226576 bytes)
[ 144.069309] test 1 (256 bit key, 64 byte blocks): 1554828 operations in 1 seconds (99508992 bytes)
[ 145.077289] test 2 (256 bit key, 256 byte blocks): 489501 operations in 1 seconds (125312256 bytes)
[ 146.085306] test 3 (256 bit key, 1024 byte blocks): 136369 operations in 1 seconds (139641856 bytes)
[ 147.093313] test 4 (256 bit key, 8192 byte blocks): 17659 operations in 1 seconds (144662528 bytes)
[ 148.101258] test 5 (320 bit key, 16 byte blocks): 3212599 operations in 1 seconds (51401584 bytes)
[ 149.109301] test 6 (320 bit key, 64 byte blocks): 1592816 operations in 1 seconds (101940224 bytes)
[ 150.117375] test 7 (320 bit key, 256 byte blocks): 484266 operations in 1 seconds (123972096 bytes)
[ 151.125583] test 8 (320 bit key, 1024 byte blocks): 136324 operations in 1 seconds (139595776 bytes)
[ 152.133598] test 9 (320 bit key, 8192 byte blocks): 17409 operations in 1 seconds (142614528 bytes)
[ 153.141528] test 10 (384 bit key, 16 byte blocks): 3341384 operations in 1 seconds (53462144 bytes)
[ 154.149595] test 11 (384 bit key, 64 byte blocks): 1568609 operations in 1 seconds (100390976 bytes)
[ 155.157663] test 12 (384 bit key, 256 byte blocks): 489544 operations in 1 seconds (125323264 bytes)
[ 156.165591] test 13 (384 bit key, 1024 byte blocks): 136252 operations in 1 seconds (139522048 bytes)
[ 157.169586] test 14 (384 bit key, 8192 byte blocks): 17666 operations in 1 seconds (144719872 bytes)
[ 158.173527]
[ 158.173527] testing speed of async lrw(twofish) decryption
[ 158.182931] test 0 (256 bit key, 16 byte blocks): 3299986 operations in 1 seconds (52799776 bytes)
[ 159.189595] test 1 (256 bit key, 64 byte blocks): 1483669 operations in 1 seconds (94954816 bytes)
[ 160.197584] test 2 (256 bit key, 256 byte blocks): 473621 operations in 1 seconds (121246976 bytes)
[ 161.205593] test 3 (256 bit key, 1024 byte blocks): 134830 operations in 1 seconds (138065920 bytes)
[ 162.213607] test 4 (256 bit key, 8192 byte blocks): 17453 operations in 1 seconds (142974976 bytes)
[ 163.221562] test 5 (320 bit key, 16 byte blocks): 3451006 operations in 1 seconds (55216096 bytes)
[ 164.229605] test 6 (320 bit key, 64 byte blocks): 1438524 operations in 1 seconds (92065536 bytes)
[ 165.237585] test 7 (320 bit key, 256 byte blocks): 476321 operations in 1 seconds (121938176 bytes)
[ 166.245591] test 8 (320 bit key, 1024 byte blocks): 134740 operations in 1 seconds (137973760 bytes)
[ 167.253287] test 9 (320 bit key, 8192 byte blocks): 17135 operations in 1 seconds (140369920 bytes)
[ 168.261215] test 10 (384 bit key, 16 byte blocks): 3327948 operations in 1 seconds (53247168 bytes)
[ 169.269284] test 11 (384 bit key, 64 byte blocks): 1477492 operations in 1 seconds (94559488 bytes)
[ 170.277265] test 12 (384 bit key, 256 byte blocks): 476087 operations in 1 seconds (121878272 bytes)
[ 171.285263] test 13 (384 bit key, 1024 byte blocks): 134794 operations in 1 seconds (138029056 bytes)
[ 172.293260] test 14 (384 bit key, 8192 byte blocks): 17417 operations in 1 seconds (142680064 bytes)
[ 173.301199]
[ 173.301199] testing speed of async xts(twofish) encryption
[ 173.314784] test 0 (256 bit key, 16 byte blocks): 3098318 operations in 1 seconds (49573088 bytes)
[ 174.321306] test 1 (256 bit key, 64 byte blocks): 1566215 operations in 1 seconds (100237760 bytes)
[ 175.329692] test 2 (256 bit key, 256 byte blocks): 506626 operations in 1 seconds (129696256 bytes)
[ 176.337596] test 3 (256 bit key, 1024 byte blocks): 147735 operations in 1 seconds (151280640 bytes)
[ 177.345602] test 4 (256 bit key, 8192 byte blocks): 19329 operations in 1 seconds (158343168 bytes)
[ 178.353549] test 5 (384 bit key, 16 byte blocks): 3100328 operations in 1 seconds (49605248 bytes)
[ 179.361609] test 6 (384 bit key, 64 byte blocks): 1565733 operations in 1 seconds (100206912 bytes)
[ 180.369684] test 7 (384 bit key, 256 byte blocks): 505319 operations in 1 seconds (129361664 bytes)
[ 181.373602] test 8 (384 bit key, 1024 byte blocks): 147921 operations in 1 seconds (151471104 bytes)
[ 182.377597] test 9 (384 bit key, 8192 byte blocks): 19357 operations in 1 seconds (158572544 bytes)
[ 183.385517] test 10 (512 bit key, 16 byte blocks): 3174613 operations in 1 seconds (50793808 bytes)
[ 184.393594] test 11 (512 bit key, 64 byte blocks): 1574183 operations in 1 seconds (100747712 bytes)
[ 185.401652] test 12 (512 bit key, 256 byte blocks): 508311 operations in 1 seconds (130127616 bytes)
[ 186.409563] test 13 (512 bit key, 1024 byte blocks): 148226 operations in 1 seconds (151783424 bytes)
[ 187.417570] test 14 (512 bit key, 8192 byte blocks): 19354 operations in 1 seconds (158547968 bytes)
[ 188.425520]
[ 188.425520] testing speed of async xts(twofish) decryption
[ 188.434933] test 0 (256 bit key, 16 byte blocks): 2984374 operations in 1 seconds (47749984 bytes)
[ 189.441610] test 1 (256 bit key, 64 byte blocks): 1391229 operations in 1 seconds (89038656 bytes)
[ 190.449590] test 2 (256 bit key, 256 byte blocks): 491896 operations in 1 seconds (125925376 bytes)
[ 191.457597] test 3 (256 bit key, 1024 byte blocks): 146033 operations in 1 seconds (149537792 bytes)
[ 192.465606] test 4 (256 bit key, 8192 byte blocks): 19087 operations in 1 seconds (156360704 bytes)
[ 193.473507] test 5 (384 bit key, 16 byte blocks): 2992604 operations in 1 seconds (47881664 bytes)
[ 194.481601] test 6 (384 bit key, 64 byte blocks): 1390541 operations in 1 seconds (88994624 bytes)
[ 195.489573] test 7 (384 bit key, 256 byte blocks): 492459 operations in 1 seconds (126069504 bytes)
[ 196.497591] test 8 (384 bit key, 1024 byte blocks): 146036 operations in 1 seconds (149540864 bytes)
[ 197.505598] test 9 (384 bit key, 8192 byte blocks): 19026 operations in 1 seconds (155860992 bytes)
[ 198.513517] test 10 (512 bit key, 16 byte blocks): 2961196 operations in 1 seconds (47379136 bytes)
[ 199.521593] test 11 (512 bit key, 64 byte blocks): 1398191 operations in 1 seconds (89484224 bytes)
[ 200.529575] test 12 (512 bit key, 256 byte blocks): 496017 operations in 1 seconds (126980352 bytes)
[ 201.537574] test 13 (512 bit key, 1024 byte blocks): 146297 operations in 1 seconds (149808128 bytes)
[ 202.545571] test 14 (512 bit key, 8192 byte blocks): 19039 operations in 1 seconds (155967488 bytes)

--
Regards/Gruss,
Boris.

2012-08-22 04:35:16

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

>
> Here you go:
>
> [ 52.282208]
> [ 52.282208] testing speed of async ecb(twofish) encryption

Thanks!

Looks that encryption lost ~0.4% while decryption gained ~1.8%.

For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k
and 8k tests, it's ~5% faster.

Here's very last test-patch, testing different ordering of fpu<->cpu reg
instructions at few places.

---
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 232 ++++++++++++++++++---------
1 file changed, 154 insertions(+), 78 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..693963a 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
* Copyright (C) 2012 Johannes Goetzfried
* <[email protected]>
*
+ * Copyright © 2012 Jussi Kivilinna <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,21 @@
#define RC2 %xmm6
#define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RT %xmm14

-#define RID1 %rax
-#define RID1b %al
-#define RID2 %rbx
-#define RID2b %bl
+#define RID1 %rbp
+#define RID1d %ebp
+#define RID2 %rsi
+#define RID2d %esi

#define RGI1 %rdx
#define RGI1bl %dl
@@ -65,6 +72,13 @@
#define RGI2bl %cl
#define RGI2bh %ch

+#define RGI3 %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4 %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
#define RGS1 %r8
#define RGS1d %r8d
#define RGS2 %r9
@@ -73,40 +87,58 @@
#define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ shrq $16, src; \
movl t0(CTX, RID1, 4), dst ## d; \
xorl t1(CTX, RID2, 4), dst ## d; \
- shrq $16, src; \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
+ interleave_op(il_reg); \
xorl t2(CTX, RID1, 4), dst ## d; \
xorl t3(CTX, RID2, 4), dst ## d;

-#define G(a, x, t0, t1, t2, t3) \
- vmovq a, RGI1; \
- vpsrldq $8, a, x; \
- vmovq x, RGI2; \
- \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
- shrq $16, RGI1; \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
- shlq $32, RGS2; \
- orq RGS1, RGS2; \
- \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
- shrq $16, RGI2; \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
- shlq $32, RGS3; \
- orq RGS1, RGS3; \
- \
- vmovq RGS2, x; \
- vpinsrq $1, RGS3, x, x;
+#define dummy(d) /* do nothing */

-#define encround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define shr_next(reg) \
+ shrq $16, reg;
+
+#define G_enc(gi1, gi2, x, t0, t1, t2, t3) \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \
+ shlq $32, RGS2; \
+ orq RGS1, RGS2; \
+ \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \
+ shlq $32, RGS1; \
+ orq RGS1, RGS3;
+
+#define encround_head_2(a, b, c, d, x1, y1, x2, y2) \
+ vmovq b ## 1, RGI3; \
+ vpextrq $1, b ## 1, RGI4; \
+ G_enc(RGI1, RGI2, x1, s0, s1, s2, s3); \
+ vmovq a ## 2, RGI1; \
+ vpextrq $1, a ## 2, RGI2; \
+ vmovq RGS2, x1; \
+ vpinsrq $1, RGS3, x1, x1; \
+ G_enc(RGI3, RGI4, y1, s1, s2, s3, s0); \
+ vmovq b ## 2, RGI3; \
+ vpextrq $1, b ## 2, RGI4; \
+ vmovq RGS2, y1; \
+ vpinsrq $1, RGS3, y1, y1; \
+ G_enc(RGI1, RGI2, x2, s0, s1, s2, s3); \
+ vmovq RGS2, x2; \
+ vpinsrq $1, RGS3, x2, x2; \
+ G_enc(RGI3, RGI4, y2, s1, s2, s3, s0); \
+ vmovq RGS2, y2; \
+ vpinsrq $1, RGS3, y2, y2;
+
+#define encround_tail(a, b, c, d, x, y) \
+ vpslld $1, d, RT; \
+ vpsrld $(32 - 1), d, d; \
+ vpor d, RT, d; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd x, RK1, x; \
@@ -115,14 +147,40 @@
vpsrld $1, c, x; \
vpslld $(32 - 1), c, c; \
vpor c, x, c; \
- vpslld $1, d, x; \
- vpsrld $(32 - 1), d, d; \
- vpor d, x, d; \
vpxor d, y, d;

-#define decround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define G_dec(gi1, gi2, x, t0, t1, t2, t3) \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \
+ shlq $32, RGS2; \
+ orq RGS1, RGS2; \
+ vmovq RGS2, x; \
+ \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \
+ shlq $32, RGS1; \
+ orq RGS1, RGS3;
+
+#define decround_head_2(a, b, c, d, x1, y1, x2, y2) \
+ vmovq b ## 1, RGI3; \
+ vpextrq $1, b ## 1, RGI4; \
+ G_dec(RGI1, RGI2, x1, s0, s1, s2, s3); \
+ vmovq a ## 2, RGI1; \
+ vpextrq $1, a ## 2, RGI2; \
+ vpinsrq $1, RGS3, x1, x1; \
+ G_dec(RGI3, RGI4, y1, s1, s2, s3, s0); \
+ vmovq b ## 2, RGI3; \
+ vpextrq $1, b ## 2, RGI4; \
+ vpinsrq $1, RGS3, y1, y1; \
+ G_dec(RGI1, RGI2, x2, s0, s1, s2, s3); \
+ vpinsrq $1, RGS3, x2, x2; \
+ G_dec(RGI3, RGI4, y2, s1, s2, s3, s0); \
+ vpinsrq $1, RGS3, y2, y2;
+
+#define decround_tail(a, b, c, d, x, y) \
+ vpslld $1, c, RT; \
+ vpsrld $(32 - 1), c, c; \
+ vpor c, RT, c; \
vpaddd x, y, x; \
vpaddd y, x, y; \
vpaddd y, RK2, y; \
@@ -130,32 +188,44 @@
vpsrld $1, d, y; \
vpslld $(32 - 1), d, d; \
vpor d, y, d; \
- vpslld $1, c, y; \
- vpsrld $(32 - 1), c, c; \
- vpor c, y, c; \
vpaddd x, RK1, x; \
vpxor x, c, c;

-#define encrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
-
-#define decrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+#define preload_rgi(c) \
+ vmovq c, RGI1; \
+ vpextrq $1, c, RGI2;
+
+#define encrypt_round(n, a, b, c, d, preload) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ encround_head_2(a, b, c, d, RX0, RY0, RX1, RY1); \
+ encround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ preload(c ## 1); \
+ encround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1);
+
+#define decrypt_round(n, a, b, c, d, preload) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ decround_head_2(a, b, c, d, RX0, RY0, RX1, RY1); \
+ decround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0); \
+ preload(c ## 1); \
+ decround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1);

#define encrypt_cycle(n) \
- encrypt_round((2*n), RA, RB, RC, RD); \
- encrypt_round(((2*n) + 1), RC, RD, RA, RB);
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi);
+
+#define encrypt_cycle_last(n) \
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, dummy);

#define decrypt_cycle(n) \
- decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
- decrypt_round((2*n), RA, RB, RC, RD);
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi); \
+ decrypt_round((2*n), RA, RB, RC, RD, preload_rgi);

+#define decrypt_cycle_last(n) \
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi); \
+ decrypt_round((2*n), RA, RB, RC, RD, dummy);

#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
vpunpckldq x1, x0, t0; \
@@ -216,17 +286,19 @@ __twofish_enc_blk_8way:
* %rcx: bool, if true: xor output
*/

+ pushq %rbp;
pushq %rbx;
pushq %rcx;

vmovdqu w(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ vmovq RA1, RGI1;
+ vpextrq $1, RA1, RGI2;
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

encrypt_cycle(0);
encrypt_cycle(1);
@@ -235,26 +307,27 @@ __twofish_enc_blk_8way:
encrypt_cycle(4);
encrypt_cycle(5);
encrypt_cycle(6);
- encrypt_cycle(7);
+ encrypt_cycle_last(7);

vmovdqu (w+4*4)(CTX), RK1;

popq %rcx;
popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
+ leaq (4*4*4)(%r11), %rax;

testb %cl, %cl;
jnz __enc_xor8;

- outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

__enc_xor8:
- outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

@@ -269,16 +342,18 @@ twofish_dec_blk_8way:
* %rdx: src
*/

+ pushq %rbp;
pushq %rbx;

vmovdqu (w+4*4)(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ vmovq RC1, RGI1;
+ vpextrq $1, RC1, RGI2;
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

decrypt_cycle(7);
decrypt_cycle(6);
@@ -287,14 +362,15 @@ twofish_dec_blk_8way:
decrypt_cycle(3);
decrypt_cycle(2);
decrypt_cycle(1);
- decrypt_cycle(0);
+ decrypt_cycle_last(0);

vmovdqu (w)(CTX), RK1;

popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
- outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ leaq (4*4*4)(%r11), %rax;
+ outunpack_blocks(%r11, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

ret;

2012-08-22 13:31:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
> Looks that encryption lost ~0.4% while decryption gained ~1.8%.
>
> For 256 byte test, it's still slightly slower than twofish-3way (~3%). For 1k
> and 8k tests, it's ~5% faster.
>
> Here's very last test-patch, testing different ordering of fpu<->cpu reg
> instructions at few places.

Hehe,

I don't mind testing patches, no worries there. Here are the results
this time, doesn't look better than the last run, AFAICT.

[ 133.952723]
[ 133.952723] testing speed of async ecb(twofish) encryption
[ 133.961946] test 0 (128 bit key, 16 byte blocks): 4768513 operations in 1 seconds (76296208 bytes)
[ 134.968388] test 1 (128 bit key, 64 byte blocks): 2033479 operations in 1 seconds (130142656 bytes)
[ 135.975070] test 2 (128 bit key, 256 byte blocks): 604754 operations in 1 seconds (154817024 bytes)
[ 136.981570] test 3 (128 bit key, 1024 byte blocks): 169578 operations in 1 seconds (173647872 bytes)
[ 137.988191] test 4 (128 bit key, 8192 byte blocks): 21847 operations in 1 seconds (178970624 bytes)
[ 138.994735] test 5 (192 bit key, 16 byte blocks): 4777481 operations in 1 seconds (76439696 bytes)
[ 140.001382] test 6 (192 bit key, 64 byte blocks): 2035352 operations in 1 seconds (130262528 bytes)
[ 141.008038] test 7 (192 bit key, 256 byte blocks): 603240 operations in 1 seconds (154429440 bytes)
[ 142.014591] test 8 (192 bit key, 1024 byte blocks): 169266 operations in 1 seconds (173328384 bytes)
[ 143.021169] test 9 (192 bit key, 8192 byte blocks): 21610 operations in 1 seconds (177029120 bytes)
[ 144.027703] test 10 (256 bit key, 16 byte blocks): 4798051 operations in 1 seconds (76768816 bytes)
[ 145.034341] test 11 (256 bit key, 64 byte blocks): 2036766 operations in 1 seconds (130353024 bytes)
[ 146.041015] test 12 (256 bit key, 256 byte blocks): 604216 operations in 1 seconds (154679296 bytes)
[ 147.047523] test 13 (256 bit key, 1024 byte blocks): 169594 operations in 1 seconds (173664256 bytes)
[ 148.054120] test 14 (256 bit key, 8192 byte blocks): 21889 operations in 1 seconds (179314688 bytes)
[ 149.060657]
[ 149.060657] testing speed of async ecb(twofish) decryption
[ 149.069830] test 0 (128 bit key, 16 byte blocks): 4890581 operations in 1 seconds (78249296 bytes)
[ 150.075322] test 1 (128 bit key, 64 byte blocks): 2006891 operations in 1 seconds (128441024 bytes)
[ 151.081994] test 2 (128 bit key, 256 byte blocks): 586650 operations in 1 seconds (150182400 bytes)
[ 152.088522] test 3 (128 bit key, 1024 byte blocks): 164734 operations in 1 seconds (168687616 bytes)
[ 153.091153] test 4 (128 bit key, 8192 byte blocks): 21111 operations in 1 seconds (172941312 bytes)
[ 154.097687] test 5 (192 bit key, 16 byte blocks): 4911365 operations in 1 seconds (78581840 bytes)
[ 155.104371] test 6 (192 bit key, 64 byte blocks): 2025363 operations in 1 seconds (129623232 bytes)
[ 156.111154] test 7 (192 bit key, 256 byte blocks): 591229 operations in 1 seconds (151354624 bytes)
[ 157.117723] test 8 (192 bit key, 1024 byte blocks): 164381 operations in 1 seconds (168326144 bytes)
[ 158.124336] test 9 (192 bit key, 8192 byte blocks): 20714 operations in 1 seconds (169689088 bytes)
[ 159.130724] test 10 (256 bit key, 16 byte blocks): 4931938 operations in 1 seconds (78911008 bytes)
[ 160.137379] test 11 (256 bit key, 64 byte blocks): 2029741 operations in 1 seconds (129903424 bytes)
[ 161.144078] test 12 (256 bit key, 256 byte blocks): 589340 operations in 1 seconds (150871040 bytes)
[ 162.150580] test 13 (256 bit key, 1024 byte blocks): 164484 operations in 1 seconds (168431616 bytes)
[ 163.157174] test 14 (256 bit key, 8192 byte blocks): 21116 operations in 1 seconds (172982272 bytes)
[ 164.163694]
[ 164.163694] testing speed of async cbc(twofish) encryption
[ 164.177772] test 0 (128 bit key, 16 byte blocks): 5197069 operations in 1 seconds (83153104 bytes)
[ 165.186414] test 1 (128 bit key, 64 byte blocks): 1912975 operations in 1 seconds (122430400 bytes)
[ 166.193078] test 2 (128 bit key, 256 byte blocks): 540464 operations in 1 seconds (138358784 bytes)
[ 167.199587] test 3 (128 bit key, 1024 byte blocks): 140709 operations in 1 seconds (144086016 bytes)
[ 168.206209] test 4 (128 bit key, 8192 byte blocks): 17747 operations in 1 seconds (145383424 bytes)
[ 169.212768] test 5 (192 bit key, 16 byte blocks): 5184004 operations in 1 seconds (82944064 bytes)
[ 170.219372] test 6 (192 bit key, 64 byte blocks): 1913377 operations in 1 seconds (122456128 bytes)
[ 171.226028] test 7 (192 bit key, 256 byte blocks): 541385 operations in 1 seconds (138594560 bytes)
[ 172.232538] test 8 (192 bit key, 1024 byte blocks): 140867 operations in 1 seconds (144247808 bytes)
[ 173.239280] test 9 (192 bit key, 8192 byte blocks): 17642 operations in 1 seconds (144523264 bytes)
[ 174.245667] test 10 (256 bit key, 16 byte blocks): 5193804 operations in 1 seconds (83100864 bytes)
[ 175.252331] test 11 (256 bit key, 64 byte blocks): 1907560 operations in 1 seconds (122083840 bytes)
[ 176.259013] test 12 (256 bit key, 256 byte blocks): 540773 operations in 1 seconds (138437888 bytes)
[ 177.265669] test 13 (256 bit key, 1024 byte blocks): 140699 operations in 1 seconds (144075776 bytes)
[ 178.272126] test 14 (256 bit key, 8192 byte blocks): 17744 operations in 1 seconds (145358848 bytes)
[ 179.278698]
[ 179.278698] testing speed of async cbc(twofish) decryption
[ 179.288016] test 0 (128 bit key, 16 byte blocks): 4877381 operations in 1 seconds (78038096 bytes)
[ 180.293323] test 1 (128 bit key, 64 byte blocks): 1947911 operations in 1 seconds (124666304 bytes)
[ 181.299994] test 2 (128 bit key, 256 byte blocks): 577589 operations in 1 seconds (147862784 bytes)
[ 182.306512] test 3 (128 bit key, 1024 byte blocks): 159665 operations in 1 seconds (163496960 bytes)
[ 183.313115] test 4 (128 bit key, 8192 byte blocks): 20403 operations in 1 seconds (167141376 bytes)
[ 184.319652] test 5 (192 bit key, 16 byte blocks): 4885336 operations in 1 seconds (78165376 bytes)
[ 185.326307] test 6 (192 bit key, 64 byte blocks): 1939707 operations in 1 seconds (124141248 bytes)
[ 186.332972] test 7 (192 bit key, 256 byte blocks): 574612 operations in 1 seconds (147100672 bytes)
[ 187.339496] test 8 (192 bit key, 1024 byte blocks): 158410 operations in 1 seconds (162211840 bytes)
[ 188.346102] test 9 (192 bit key, 8192 byte blocks): 19940 operations in 1 seconds (163348480 bytes)
[ 189.352646] test 10 (256 bit key, 16 byte blocks): 4897969 operations in 1 seconds (78367504 bytes)
[ 190.359301] test 11 (256 bit key, 64 byte blocks): 1945680 operations in 1 seconds (124523520 bytes)
[ 191.365965] test 12 (256 bit key, 256 byte blocks): 578743 operations in 1 seconds (148158208 bytes)
[ 192.372475] test 13 (256 bit key, 1024 byte blocks): 159732 operations in 1 seconds (163565568 bytes)
[ 193.379068] test 14 (256 bit key, 8192 byte blocks): 20421 operations in 1 seconds (167288832 bytes)
[ 194.385621]
[ 194.385621] testing speed of async ctr(twofish) encryption
[ 194.399652] test 0 (128 bit key, 16 byte blocks): 4576370 operations in 1 seconds (73221920 bytes)
[ 195.408279] test 1 (128 bit key, 64 byte blocks): 1945671 operations in 1 seconds (124522944 bytes)
[ 196.414951] test 2 (128 bit key, 256 byte blocks): 585959 operations in 1 seconds (150005504 bytes)
[ 197.421462] test 3 (128 bit key, 1024 byte blocks): 159292 operations in 1 seconds (163115008 bytes)
[ 198.428072] test 4 (128 bit key, 8192 byte blocks): 20497 operations in 1 seconds (167911424 bytes)
[ 199.434598] test 5 (192 bit key, 16 byte blocks): 4682261 operations in 1 seconds (74916176 bytes)
[ 200.441262] test 6 (192 bit key, 64 byte blocks): 1959838 operations in 1 seconds (125429632 bytes)
[ 201.447927] test 7 (192 bit key, 256 byte blocks): 571085 operations in 1 seconds (146197760 bytes)
[ 202.454445] test 8 (192 bit key, 1024 byte blocks): 158933 operations in 1 seconds (162747392 bytes)
[ 203.461056] test 9 (192 bit key, 8192 byte blocks): 20462 operations in 1 seconds (167624704 bytes)
[ 204.467565] test 10 (256 bit key, 16 byte blocks): 4373557 operations in 1 seconds (69976912 bytes)
[ 205.474257] test 11 (256 bit key, 64 byte blocks): 1949469 operations in 1 seconds (124766016 bytes)
[ 206.480921] test 12 (256 bit key, 256 byte blocks): 576799 operations in 1 seconds (147660544 bytes)
[ 207.487430] test 13 (256 bit key, 1024 byte blocks): 159786 operations in 1 seconds (163620864 bytes)
[ 208.494025] test 14 (256 bit key, 8192 byte blocks): 20514 operations in 1 seconds (168050688 bytes)
[ 209.500569]
[ 209.500569] testing speed of async ctr(twofish) decryption
[ 209.509891] test 0 (128 bit key, 16 byte blocks): 4573902 operations in 1 seconds (73182432 bytes)
[ 210.515256] test 1 (128 bit key, 64 byte blocks): 1950356 operations in 1 seconds (124822784 bytes)
[ 211.521921] test 2 (128 bit key, 256 byte blocks): 576961 operations in 1 seconds (147702016 bytes)
[ 212.528577] test 3 (128 bit key, 1024 byte blocks): 159763 operations in 1 seconds (163597312 bytes)
[ 213.535069] test 4 (128 bit key, 8192 byte blocks): 20487 operations in 1 seconds (167829504 bytes)
[ 214.541717] test 5 (192 bit key, 16 byte blocks): 4657220 operations in 1 seconds (74515520 bytes)
[ 215.548250] test 6 (192 bit key, 64 byte blocks): 1965789 operations in 1 seconds (125810496 bytes)
[ 216.554907] test 7 (192 bit key, 256 byte blocks): 573294 operations in 1 seconds (146763264 bytes)
[ 217.561432] test 8 (192 bit key, 1024 byte blocks): 159180 operations in 1 seconds (163000320 bytes)
[ 218.568037] test 9 (192 bit key, 8192 byte blocks): 20324 operations in 1 seconds (166494208 bytes)
[ 219.574719] test 10 (256 bit key, 16 byte blocks): 4453463 operations in 1 seconds (71255408 bytes)
[ 220.581245] test 11 (256 bit key, 64 byte blocks): 1965129 operations in 1 seconds (125768256 bytes)
[ 221.587910] test 12 (256 bit key, 256 byte blocks): 576236 operations in 1 seconds (147516416 bytes)
[ 222.594408] test 13 (256 bit key, 1024 byte blocks): 159425 operations in 1 seconds (163251200 bytes)
[ 223.601169] test 14 (256 bit key, 8192 byte blocks): 20489 operations in 1 seconds (167845888 bytes)
[ 224.607566]
[ 224.607566] testing speed of async lrw(twofish) encryption
[ 224.622145] test 0 (256 bit key, 16 byte blocks): 3501782 operations in 1 seconds (56028512 bytes)
[ 225.630224] test 1 (256 bit key, 64 byte blocks): 1613072 operations in 1 seconds (103236608 bytes)
[ 226.636896] test 2 (256 bit key, 256 byte blocks): 497185 operations in 1 seconds (127279360 bytes)
[ 227.643415] test 3 (256 bit key, 1024 byte blocks): 138762 operations in 1 seconds (142092288 bytes)
[ 228.650027] test 4 (256 bit key, 8192 byte blocks): 17841 operations in 1 seconds (146153472 bytes)
[ 229.656571] test 5 (320 bit key, 16 byte blocks): 3569802 operations in 1 seconds (57116832 bytes)
[ 230.663357] test 6 (320 bit key, 64 byte blocks): 1619243 operations in 1 seconds (103631552 bytes)
[ 231.669882] test 7 (320 bit key, 256 byte blocks): 497649 operations in 1 seconds (127398144 bytes)
[ 232.676382] test 8 (320 bit key, 1024 byte blocks): 138425 operations in 1 seconds (141747200 bytes)
[ 233.682986] test 9 (320 bit key, 8192 byte blocks): 17621 operations in 1 seconds (144351232 bytes)
[ 234.689512] test 10 (384 bit key, 16 byte blocks): 3572115 operations in 1 seconds (57153840 bytes)
[ 235.696175] test 11 (384 bit key, 64 byte blocks): 1632166 operations in 1 seconds (104458624 bytes)
[ 236.702850] test 12 (384 bit key, 256 byte blocks): 496593 operations in 1 seconds (127127808 bytes)
[ 237.709348] test 13 (384 bit key, 1024 byte blocks): 138736 operations in 1 seconds (142065664 bytes)
[ 238.715953] test 14 (384 bit key, 8192 byte blocks): 17864 operations in 1 seconds (146341888 bytes)
[ 239.722482]
[ 239.722482] testing speed of async lrw(twofish) decryption
[ 239.732092] test 0 (256 bit key, 16 byte blocks): 3369646 operations in 1 seconds (53914336 bytes)
[ 240.737175] test 1 (256 bit key, 64 byte blocks): 1595683 operations in 1 seconds (102123712 bytes)
[ 241.743969] test 2 (256 bit key, 256 byte blocks): 481201 operations in 1 seconds (123187456 bytes)
[ 242.750356] test 3 (256 bit key, 1024 byte blocks): 134713 operations in 1 seconds (137946112 bytes)
[ 243.756963] test 4 (256 bit key, 8192 byte blocks): 17342 operations in 1 seconds (142065664 bytes)
[ 244.763479] test 5 (320 bit key, 16 byte blocks): 3519317 operations in 1 seconds (56309072 bytes)
[ 245.770159] test 6 (320 bit key, 64 byte blocks): 1589175 operations in 1 seconds (101707200 bytes)
[ 246.776815] test 7 (320 bit key, 256 byte blocks): 480032 operations in 1 seconds (122888192 bytes)
[ 247.783341] test 8 (320 bit key, 1024 byte blocks): 134196 operations in 1 seconds (137416704 bytes)
[ 248.789955] test 9 (320 bit key, 8192 byte blocks): 16979 operations in 1 seconds (139091968 bytes)
[ 249.796480] test 10 (384 bit key, 16 byte blocks): 3569030 operations in 1 seconds (57104480 bytes)
[ 250.803154] test 11 (384 bit key, 64 byte blocks): 1598999 operations in 1 seconds (102335936 bytes)
[ 251.809809] test 12 (384 bit key, 256 byte blocks): 484369 operations in 1 seconds (123998464 bytes)
[ 252.816328] test 13 (384 bit key, 1024 byte blocks): 134804 operations in 1 seconds (138039296 bytes)
[ 253.822922] test 14 (384 bit key, 8192 byte blocks): 17314 operations in 1 seconds (141836288 bytes)
[ 254.829487]
[ 254.829487] testing speed of async xts(twofish) encryption
[ 254.843608] test 0 (256 bit key, 16 byte blocks): 3109395 operations in 1 seconds (49750320 bytes)
[ 255.852126] test 1 (256 bit key, 64 byte blocks): 1579951 operations in 1 seconds (101116864 bytes)
[ 256.858797] test 2 (256 bit key, 256 byte blocks): 504014 operations in 1 seconds (129027584 bytes)
[ 257.865306] test 3 (256 bit key, 1024 byte blocks): 147066 operations in 1 seconds (150595584 bytes)
[ 258.871918] test 4 (256 bit key, 8192 byte blocks): 19266 operations in 1 seconds (157827072 bytes)
[ 259.878445] test 5 (384 bit key, 16 byte blocks): 3099540 operations in 1 seconds (49592640 bytes)
[ 260.885109] test 6 (384 bit key, 64 byte blocks): 1579599 operations in 1 seconds (101094336 bytes)
[ 261.891774] test 7 (384 bit key, 256 byte blocks): 504289 operations in 1 seconds (129097984 bytes)
[ 262.898292] test 8 (384 bit key, 1024 byte blocks): 147102 operations in 1 seconds (150632448 bytes)
[ 263.904904] test 9 (384 bit key, 8192 byte blocks): 19264 operations in 1 seconds (157810688 bytes)
[ 264.911422] test 10 (512 bit key, 16 byte blocks): 3171752 operations in 1 seconds (50748032 bytes)
[ 265.918104] test 11 (512 bit key, 64 byte blocks): 1588640 operations in 1 seconds (101672960 bytes)
[ 266.924772] test 12 (512 bit key, 256 byte blocks): 505971 operations in 1 seconds (129528576 bytes)
[ 267.931267] test 13 (512 bit key, 1024 byte blocks): 147292 operations in 1 seconds (150827008 bytes)
[ 268.937863] test 14 (512 bit key, 8192 byte blocks): 19263 operations in 1 seconds (157802496 bytes)
[ 269.944426]
[ 269.944426] testing speed of async xts(twofish) decryption
[ 269.953737] test 0 (256 bit key, 16 byte blocks): 3097600 operations in 1 seconds (49561600 bytes)
[ 270.959104] test 1 (256 bit key, 64 byte blocks): 1552959 operations in 1 seconds (99389376 bytes)
[ 271.965690] test 2 (256 bit key, 256 byte blocks): 506885 operations in 1 seconds (129762560 bytes)
[ 272.972285] test 3 (256 bit key, 1024 byte blocks): 144134 operations in 1 seconds (147593216 bytes)
[ 273.978907] test 4 (256 bit key, 8192 byte blocks): 18638 operations in 1 seconds (152682496 bytes)
[ 274.985432] test 5 (384 bit key, 16 byte blocks): 3101878 operations in 1 seconds (49630048 bytes)
[ 275.992098] test 6 (384 bit key, 64 byte blocks): 1552884 operations in 1 seconds (99384576 bytes)
[ 276.998658] test 7 (384 bit key, 256 byte blocks): 507621 operations in 1 seconds (129950976 bytes)
[ 278.005271] test 8 (384 bit key, 1024 byte blocks): 144218 operations in 1 seconds (147679232 bytes)
[ 279.011884] test 9 (384 bit key, 8192 byte blocks): 18622 operations in 1 seconds (152551424 bytes)
[ 280.018419] test 10 (512 bit key, 16 byte blocks): 3185817 operations in 1 seconds (50973072 bytes)
[ 281.025090] test 11 (512 bit key, 64 byte blocks): 1562195 operations in 1 seconds (99980480 bytes)
[ 282.031661] test 12 (512 bit key, 256 byte blocks): 507517 operations in 1 seconds (129924352 bytes)
[ 283.038255] test 13 (512 bit key, 1024 byte blocks): 144199 operations in 1 seconds (147659776 bytes)
[ 284.044860] test 14 (512 bit key, 8192 byte blocks): 18609 operations in 1 seconds (152444928 bytes)

--
Regards/Gruss,
Boris.

2012-08-22 19:20:08

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
>> Looks that encryption lost ~0.4% while decryption gained ~1.8%.
>>
>> For 256 byte test, it's still slightly slower than twofish-3way
>> (~3%). For 1k
>> and 8k tests, it's ~5% faster.
>>
>> Here's very last test-patch, testing different ordering of fpu<->cpu reg
>> instructions at few places.
>
> Hehe,
>
> I don't mind testing patches, no worries there. Here are the results
> this time, doesn't look better than the last run, AFAICT.
>

Actually it does look better, at least for encryption. Decryption had different
ordering for test, which appears to be bad on bulldozer as it is on
sandy-bridge.

So, yet another patch then :)

Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
doing one of the round rotations one round ahead. Also introduces some
more paralellism inside lookup_32bit.

---
arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 227 +++++++++++++++++----------
1 file changed, 142 insertions(+), 85 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index 35f4557..1585abb 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -4,6 +4,8 @@
* Copyright (C) 2012 Johannes Goetzfried
* <[email protected]>
*
+ * Copyright © 2012 Jussi Kivilinna <[email protected]>
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
@@ -47,16 +49,22 @@
#define RC2 %xmm6
#define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9
+
+#define RX1 %xmm10
+#define RY1 %xmm11

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RK1 %xmm12
+#define RK2 %xmm13

-#define RID1 %rax
-#define RID1b %al
-#define RID2 %rbx
-#define RID2b %bl
+#define RT %xmm14
+#define RR %xmm15
+
+#define RID1 %rbp
+#define RID1d %ebp
+#define RID2 %rsi
+#define RID2d %esi

#define RGI1 %rdx
#define RGI1bl %dl
@@ -65,6 +73,13 @@
#define RGI2bl %cl
#define RGI2bh %ch

+#define RGI3 %rax
+#define RGI3bl %al
+#define RGI3bh %ah
+#define RGI4 %rbx
+#define RGI4bl %bl
+#define RGI4bh %bh
+
#define RGS1 %r8
#define RGS1d %r8d
#define RGS2 %r9
@@ -73,89 +88,123 @@
#define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
- movl t0(CTX, RID1, 4), dst ## d; \
- xorl t1(CTX, RID2, 4), dst ## d; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+ movzbl src ## bl, RID1d; \
+ movzbl src ## bh, RID2d; \
shrq $16, src; \
- movb src ## bl, RID1b; \
- movb src ## bh, RID2b; \
+ movl t0(CTX, RID1, 4), dst ## d; \
+ movl t1(CTX, RID2, 4), RID2d; \
+ movzbl src ## bl, RID1d; \
+ xorl RID2d, dst ## d; \
+ movzbl src ## bh, RID2d; \
+ interleave_op(il_reg); \
xorl t2(CTX, RID1, 4), dst ## d; \
xorl t3(CTX, RID2, 4), dst ## d;

-#define G(a, x, t0, t1, t2, t3) \
- vmovq a, RGI1; \
- vpsrldq $8, a, x; \
- vmovq x, RGI2; \
+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+ shrq $16, reg;
+
+#define G(gi1, gi2, x, t0, t1, t2, t3) \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS1, shr_next, ##gi1); \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS3, shr_next, ##gi2); \
+ \
+ lookup_32bit(t0, t1, t2, t3, ##gi1, RGS2, dummy, none); \
+ shlq $32, RGS2; \
+ orq RGS1, RGS2; \
+ lookup_32bit(t0, t1, t2, t3, ##gi2, RGS1, dummy, none); \
+ shlq $32, RGS1; \
+ orq RGS1, RGS3;
+
+#define round_head_2(a, b, x1, y1, x2, y2) \
+ vmovq b ## 1, RGI3; \
+ vpextrq $1, b ## 1, RGI4; \
\
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
- shrq $16, RGI1; \
- lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
- shlq $32, RGS2; \
- orq RGS1, RGS2; \
+ G(RGI1, RGI2, x1, s0, s1, s2, s3); \
+ vmovq a ## 2, RGI1; \
+ vpextrq $1, a ## 2, RGI2; \
+ vmovq RGS2, x1; \
+ vpinsrq $1, RGS3, x1, x1; \
\
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
- shrq $16, RGI2; \
- lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
- shlq $32, RGS3; \
- orq RGS1, RGS3; \
+ G(RGI3, RGI4, y1, s1, s2, s3, s0); \
+ vmovq b ## 2, RGI3; \
+ vpextrq $1, b ## 2, RGI4; \
+ vmovq RGS2, y1; \
+ vpinsrq $1, RGS3, y1, y1; \
\
- vmovq RGS2, x; \
- vpinsrq $1, RGS3, x, x;
+ G(RGI1, RGI2, x2, s0, s1, s2, s3); \
+ vmovq RGS2, x2; \
+ vpinsrq $1, RGS3, x2, x2; \
+ \
+ G(RGI3, RGI4, y2, s1, s2, s3, s0); \
+ vmovq RGS2, y2; \
+ vpinsrq $1, RGS3, y2, y2;

-#define encround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+#define encround_tail(a, b, c, d, x, y, prerotate) \
vpaddd x, y, x; \
+ vpaddd x, RK1, RT;\
+ prerotate(b); \
+ vpxor RT, c, c; \
vpaddd y, x, y; \
- vpaddd x, RK1, x; \
vpaddd y, RK2, y; \
- vpxor x, c, c; \
- vpsrld $1, c, x; \
+ vpsrld $1, c, RT; \
vpslld $(32 - 1), c, c; \
- vpor c, x, c; \
- vpslld $1, d, x; \
- vpsrld $(32 - 1), d, d; \
- vpor d, x, d; \
- vpxor d, y, d;
-
-#define decround(a, b, c, d, x, y) \
- G(a, x, s0, s1, s2, s3); \
- G(b, y, s1, s2, s3, s0); \
+ vpor c, RT, c; \
+ vpxor d, y, d; \
+
+#define decround_tail(a, b, c, d, x, y, prerotate) \
vpaddd x, y, x; \
+ vpaddd x, RK1, RT;\
+ prerotate(a); \
+ vpxor RT, c, c; \
vpaddd y, x, y; \
vpaddd y, RK2, y; \
vpxor d, y, d; \
vpsrld $1, d, y; \
vpslld $(32 - 1), d, d; \
vpor d, y, d; \
- vpslld $1, c, y; \
- vpsrld $(32 - 1), c, c; \
- vpor c, y, c; \
- vpaddd x, RK1, x; \
- vpxor x, c, c;
-
-#define encrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- encround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- encround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
-
-#define decrypt_round(n, a, b, c, d) \
- vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
- vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
- decround(a ## 1, b ## 1, c ## 1, d ## 1, RX, RY); \
- decround(a ## 2, b ## 2, c ## 2, d ## 2, RX, RY);
+
+#define rotate_1l(x) \
+ vpslld $1, x, RR; \
+ vpsrld $(32 - 1), x, x; \
+ vpor x, RR, x;
+
+#define preload_rgi(c) \
+ vmovq c, RGI1; \
+ vpextrq $1, c, RGI2;
+
+#define encrypt_round(n, a, b, c, d, preload, prerotate) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ round_head_2(a, b, RX0, RY0, RX1, RY1); \
+ encround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0, prerotate); \
+ preload(c ## 1); \
+ encround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1, prerotate);
+
+#define decrypt_round(n, a, b, c, d, preload, prerotate) \
+ vbroadcastss (k+4*(2*(n)))(CTX), RK1; \
+ vbroadcastss (k+4*(2*(n)+1))(CTX), RK2; \
+ round_head_2(a, b, RX0, RY0, RX1, RY1); \
+ decround_tail(a ## 1, b ## 1, c ## 1, d ## 1, RX0, RY0, prerotate); \
+ preload(c ## 1); \
+ decround_tail(a ## 2, b ## 2, c ## 2, d ## 2, RX1, RY1, prerotate);

#define encrypt_cycle(n) \
- encrypt_round((2*n), RA, RB, RC, RD); \
- encrypt_round(((2*n) + 1), RC, RD, RA, RB);
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi, rotate_1l); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi, rotate_1l);
+
+#define encrypt_cycle_last(n) \
+ encrypt_round((2*n), RA, RB, RC, RD, preload_rgi, rotate_1l); \
+ encrypt_round(((2*n) + 1), RC, RD, RA, RB, dummy, dummy);

#define decrypt_cycle(n) \
- decrypt_round(((2*n) + 1), RC, RD, RA, RB); \
- decrypt_round((2*n), RA, RB, RC, RD);
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi, rotate_1l); \
+ decrypt_round((2*n), RA, RB, RC, RD, preload_rgi, rotate_1l);

+#define decrypt_cycle_last(n) \
+ decrypt_round(((2*n) + 1), RC, RD, RA, RB, preload_rgi, rotate_1l); \
+ decrypt_round((2*n), RA, RB, RC, RD, dummy, dummy);

#define transpose_4x4(x0, x1, x2, x3, t0, t1, t2) \
vpunpckldq x1, x0, t0; \
@@ -216,17 +265,20 @@ __twofish_enc_blk_8way:
* %rcx: bool, if true: xor output
*/

+ pushq %rbp;
pushq %rbx;
pushq %rcx;

vmovdqu w(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ preload_rgi(RA1);
+ rotate_1l(RD1);
+ inpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);
+ rotate_1l(RD2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

encrypt_cycle(0);
encrypt_cycle(1);
@@ -235,26 +287,27 @@ __twofish_enc_blk_8way:
encrypt_cycle(4);
encrypt_cycle(5);
encrypt_cycle(6);
- encrypt_cycle(7);
+ encrypt_cycle_last(7);

vmovdqu (w+4*4)(CTX), RK1;

popq %rcx;
popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
+ leaq (4*4*4)(%r11), %rax;

testb %cl, %cl;
jnz __enc_xor8;

- outunpack_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

__enc_xor8:
- outunpack_xor_blocks(%rsi, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ outunpack_xor_blocks(%r11, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ outunpack_xor_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);

ret;

@@ -269,16 +322,19 @@ twofish_dec_blk_8way:
* %rdx: src
*/

+ pushq %rbp;
pushq %rbx;

vmovdqu (w+4*4)(CTX), RK1;

leaq (4*4*4)(%rdx), %rax;
- inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX, RY, RK2);
- inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX, RY, RK2);
+ inpack_blocks(%rdx, RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
+ preload_rgi(RC1);
+ rotate_1l(RA1);
+ inpack_blocks(%rax, RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);
+ rotate_1l(RA2);

- xorq RID1, RID1;
- xorq RID2, RID2;
+ movq %rsi, %r11;

decrypt_cycle(7);
decrypt_cycle(6);
@@ -287,14 +343,15 @@ twofish_dec_blk_8way:
decrypt_cycle(3);
decrypt_cycle(2);
decrypt_cycle(1);
- decrypt_cycle(0);
+ decrypt_cycle_last(0);

vmovdqu (w)(CTX), RK1;

popq %rbx;
+ popq %rbp;

- leaq (4*4*4)(%rsi), %rax;
- outunpack_blocks(%rsi, RA1, RB1, RC1, RD1, RK1, RX, RY, RK2);
- outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX, RY, RK2);
+ leaq (4*4*4)(%r11), %rax;
+ outunpack_blocks(%r11, RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
+ outunpack_blocks(%rax, RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);

ret;

2012-08-23 00:05:50

by Jason Garrett-Glaser

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna
<[email protected]> wrote:
> Quoting Borislav Petkov <[email protected]>:
>
>> On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
>>> Looks that encryption lost ~0.4% while decryption gained ~1.8%.
>>>
>>> For 256 byte test, it's still slightly slower than twofish-3way
>>> (~3%). For 1k
>>> and 8k tests, it's ~5% faster.
>>>
>>> Here's very last test-patch, testing different ordering of fpu<->cpu reg
>>> instructions at few places.
>>
>> Hehe,.
>>
>> I don't mind testing patches, no worries there. Here are the results
>> this time, doesn't look better than the last run, AFAICT.
>>
>
> Actually it does look better, at least for encryption. Decryption had different
> ordering for test, which appears to be bad on bulldozer as it is on
> sandy-bridge.
>
> So, yet another patch then :)
>
> Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
> doing one of the round rotations one round ahead. Also introduces some
> more paralellism inside lookup_32bit.

Outsider looking in here, but avoiding the 256-way lookup tables
entirely might be faster. Looking at the twofish code, one byte-wise
calculation looks like this:

a0 = x >> 4; b0 = x & 15;
a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
a2 = qt0[n][a1]; b2 = qt1[n][b1];
a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
a4 = qt2[n][a3]; b4 = qt3[n][b3];
return (b4 << 4) | a4;

This means that you can do something like this pseudocode (Intel
syntax). pshufb on ymm registers is AVX2, but splitting it into xmm
operations would probably be fine (as would using this for just a pure
SSE implementation!). On AVX2 you' have to double the tables for both
ways, naturally.

constants:
pb_0x0f = {0x0f,0x0f,0x0f ... }
ashx: lookup table
ror4: lookup table
qt0[n]: lookup table
qt1[n]: lookup table
qt2[n]: lookup table
qt3[n]: lookup table

vpand b0, in, pb_0x0f
vpsrlw a0, in, 4
vpand a0, a0, pb_0x0f ; effectively vpsrlb, but that doesn't exist

vpxor a1, a0, b0
vpshufb a0, ashx, a0
vpshufb b0, ror4, b0
vpxor b1, a0, b0

vpshufb a2, qt0[n], a1
vpshufb b2, qt1[n], b1

vpxor a3, a2, b2
vpshufb a3, ashx, a2
vpshufb b3, ror4, b2
vpxor b3, a2, b2

vpshufb a4, qt2[n], a3
vpshufb b4, qt3[n], b3

vpsllw b4, b4, 4 ; effectively vpsrlb, but that doesn't exist
vpor out, a4, b4

That's 15 instructions (plus maybe a move or two) to do 16 lookups for
SSE (~9 cycles by my guessing on a Nehalem). AVX would run into the
problem of lots of extra vinsert/vextract (just going 16-byte might be
better, might be not, depending on execution units). AVX2 would be
super fast (15 for 32).

If this works, this could be quite a bit faster with the table-based approach.

Jason

2012-08-23 08:33:45

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Jason Garrett-Glaser <[email protected]>:

> On Wed, Aug 22, 2012 at 12:20 PM, Jussi Kivilinna
> <[email protected]> wrote:
>> Quoting Borislav Petkov <[email protected]>:
>>
>>> On Wed, Aug 22, 2012 at 07:35:12AM +0300, Jussi Kivilinna wrote:
>>>> Looks that encryption lost ~0.4% while decryption gained ~1.8%.
>>>>
>>>> For 256 byte test, it's still slightly slower than twofish-3way
>>>> (~3%). For 1k
>>>> and 8k tests, it's ~5% faster.
>>>>
>>>> Here's very last test-patch, testing different ordering of fpu<->cpu reg
>>>> instructions at few places.
>>>
>>> Hehe,.
>>>
>>> I don't mind testing patches, no worries there. Here are the results
>>> this time, doesn't look better than the last run, AFAICT.
>>>
>>
>> Actually it does look better, at least for encryption. Decryption
>> had different
>> ordering for test, which appears to be bad on bulldozer as it is on
>> sandy-bridge.
>>
>> So, yet another patch then :)
>>
>> Interleaving at some new places (reordered lookup_32bit()s in G-macro) and
>> doing one of the round rotations one round ahead. Also introduces some
>> more paralellism inside lookup_32bit.
>
> Outsider looking in here, but avoiding the 256-way lookup tables
> entirely might be faster. Looking at the twofish code, one byte-wise
> calculation looks like this:
>
> a0 = x >> 4; b0 = x & 15;
> a1 = a0 ^ b0; b1 = ror4[b0] ^ ashx[a0];
> a2 = qt0[n][a1]; b2 = qt1[n][b1];
> a3 = a2 ^ b2; b3 = ror4[b2] ^ ashx[a2];
> a4 = qt2[n][a3]; b4 = qt3[n][b3];
> return (b4 << 4) | a4;
>
> This means that you can do something like this pseudocode (Intel
> syntax). pshufb on ymm registers is AVX2, but splitting it into xmm
> operations would probably be fine (as would using this for just a pure
> SSE implementation!). On AVX2 you' have to double the tables for both
> ways, naturally.
>
> constants:
> pb_0x0f = {0x0f,0x0f,0x0f ... }
> ashx: lookup table
> ror4: lookup table
> qt0[n]: lookup table
> qt1[n]: lookup table
> qt2[n]: lookup table
> qt3[n]: lookup table
>
> vpand b0, in, pb_0x0f
> vpsrlw a0, in, 4
> vpand a0, a0, pb_0x0f ; effectively vpsrlb, but that doesn't exist
>
> vpxor a1, a0, b0
> vpshufb a0, ashx, a0
> vpshufb b0, ror4, b0
> vpxor b1, a0, b0
>
> vpshufb a2, qt0[n], a1
> vpshufb b2, qt1[n], b1
>
> vpxor a3, a2, b2
> vpshufb a3, ashx, a2
> vpshufb b3, ror4, b2
> vpxor b3, a2, b2
>
> vpshufb a4, qt2[n], a3
> vpshufb b4, qt3[n], b3
>
> vpsllw b4, b4, 4 ; effectively vpsrlb, but that doesn't exist
> vpor out, a4, b4
>
> That's 15 instructions (plus maybe a move or two) to do 16 lookups for
> SSE (~9 cycles by my guessing on a Nehalem). AVX would run into the
> problem of lots of extra vinsert/vextract (just going 16-byte might be
> better, might be not, depending on execution units). AVX2 would be
> super fast (15 for 32).
>
> If this works, this could be quite a bit faster with the table-based
> approach.

The above would implement twofish permutations q0 and q1? For
byte-sliced implementation you would need 8 parallel blocks (16b
registers, two parallel h-functions for round, 16/2).

In this setup, for double h-function, you need 12 q0/1 operations (for
128bit key, for 192bit: 16, for 256bit: 20), plus 8 key material xors
(for 192bit 12, 256bit 16) and MDS matrix multiplication (alot more
than 15 instructions, I'd think). We do 16-rounds so that gives us,
((12*15+8+15)*16)/(8*16) > 25.3 cycles/byte. Usually I get ~2.5
instructions/cycle for pure SSE2, so that's 10 cycles/byte.

After that we have PHT phase. But now problem is that PHT base uses
32-bit additions, so either we move between byte-sliced and
dword-sliced modes here or move addition carry over bytes. After PHT
there is 32-bit addition with key material and 32-bit rotations.

I don't think this is going to work. For AVX2, vpgatherdd is going to
speed up 32-bit lookups anyway.

-Jussi

>
> Jason
>
>

2012-08-23 14:36:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote:
> Actually it does look better, at least for encryption. Decryption had different
> ordering for test, which appears to be bad on bulldozer as it is on
> sandy-bridge.
>
> So, yet another patch then :)

Here you go:

[ 153.736745]
[ 153.736745] testing speed of async ecb(twofish) encryption
[ 153.745806] test 0 (128 bit key, 16 byte blocks): 4832343 operations in 1 seconds (77317488 bytes)
[ 154.752525] test 1 (128 bit key, 64 byte blocks): 2049979 operations in 1 seconds (131198656 bytes)
[ 155.755195] test 2 (128 bit key, 256 byte blocks): 620439 operations in 1 seconds (158832384 bytes)
[ 156.761694] test 3 (128 bit key, 1024 byte blocks): 173900 operations in 1 seconds (178073600 bytes)
[ 157.768282] test 4 (128 bit key, 8192 byte blocks): 22366 operations in 1 seconds (183222272 bytes)
[ 158.774815] test 5 (192 bit key, 16 byte blocks): 4850741 operations in 1 seconds (77611856 bytes)
[ 159.781498] test 6 (192 bit key, 64 byte blocks): 2046772 operations in 1 seconds (130993408 bytes)
[ 160.788163] test 7 (192 bit key, 256 byte blocks): 619915 operations in 1 seconds (158698240 bytes)
[ 161.794636] test 8 (192 bit key, 1024 byte blocks): 173442 operations in 1 seconds (177604608 bytes)
[ 162.801242] test 9 (192 bit key, 8192 byte blocks): 22083 operations in 1 seconds (180903936 bytes)
[ 163.807793] test 10 (256 bit key, 16 byte blocks): 4862951 operations in 1 seconds (77807216 bytes)
[ 164.814449] test 11 (256 bit key, 64 byte blocks): 2050036 operations in 1 seconds (131202304 bytes)
[ 165.821121] test 12 (256 bit key, 256 byte blocks): 620349 operations in 1 seconds (158809344 bytes)
[ 166.827621] test 13 (256 bit key, 1024 byte blocks): 173917 operations in 1 seconds (178091008 bytes)
[ 167.834218] test 14 (256 bit key, 8192 byte blocks): 22362 operations in 1 seconds (183189504 bytes)
[ 168.840798]
[ 168.840798] testing speed of async ecb(twofish) decryption
[ 168.849968] test 0 (128 bit key, 16 byte blocks): 4889899 operations in 1 seconds (78238384 bytes)
[ 169.855439] test 1 (128 bit key, 64 byte blocks): 2052293 operations in 1 seconds (131346752 bytes)
[ 170.862113] test 2 (128 bit key, 256 byte blocks): 616979 operations in 1 seconds (157946624 bytes)
[ 171.868631] test 3 (128 bit key, 1024 byte blocks): 172773 operations in 1 seconds (176919552 bytes)
[ 172.875244] test 4 (128 bit key, 8192 byte blocks): 22224 operations in 1 seconds (182059008 bytes)
[ 173.881777] test 5 (192 bit key, 16 byte blocks): 4893653 operations in 1 seconds (78298448 bytes)
[ 174.888451] test 6 (192 bit key, 64 byte blocks): 2048078 operations in 1 seconds (131076992 bytes)
[ 175.895131] test 7 (192 bit key, 256 byte blocks): 619204 operations in 1 seconds (158516224 bytes)
[ 176.901651] test 8 (192 bit key, 1024 byte blocks): 172569 operations in 1 seconds (176710656 bytes)
[ 177.908253] test 9 (192 bit key, 8192 byte blocks): 21888 operations in 1 seconds (179306496 bytes)
[ 178.914781] test 10 (256 bit key, 16 byte blocks): 4921751 operations in 1 seconds (78748016 bytes)
[ 179.917481] test 11 (256 bit key, 64 byte blocks): 2051219 operations in 1 seconds (131278016 bytes)
[ 180.920147] test 12 (256 bit key, 256 byte blocks): 618536 operations in 1 seconds (158345216 bytes)
[ 181.926637] test 13 (256 bit key, 1024 byte blocks): 172886 operations in 1 seconds (177035264 bytes)
[ 182.933249] test 14 (256 bit key, 8192 byte blocks): 22222 operations in 1 seconds (182042624 bytes)
[ 183.939803]
[ 183.939803] testing speed of async cbc(twofish) encryption
[ 183.953902] test 0 (128 bit key, 16 byte blocks): 5195403 operations in 1 seconds (83126448 bytes)
[ 184.962487] test 1 (128 bit key, 64 byte blocks): 1912010 operations in 1 seconds (122368640 bytes)
[ 185.969150] test 2 (128 bit key, 256 byte blocks): 540125 operations in 1 seconds (138272000 bytes)
[ 186.975650] test 3 (128 bit key, 1024 byte blocks): 140631 operations in 1 seconds (144006144 bytes)
[ 187.982411] test 4 (128 bit key, 8192 byte blocks): 17737 operations in 1 seconds (145301504 bytes)
[ 188.988782] test 5 (192 bit key, 16 byte blocks): 5182287 operations in 1 seconds (82916592 bytes)
[ 189.995435] test 6 (192 bit key, 64 byte blocks): 1912356 operations in 1 seconds (122390784 bytes)
[ 191.002093] test 7 (192 bit key, 256 byte blocks): 540991 operations in 1 seconds (138493696 bytes)
[ 192.008600] test 8 (192 bit key, 1024 byte blocks): 140791 operations in 1 seconds (144169984 bytes)
[ 193.015197] test 9 (192 bit key, 8192 byte blocks): 17609 operations in 1 seconds (144252928 bytes)
[ 194.021740] test 10 (256 bit key, 16 byte blocks): 5191521 operations in 1 seconds (83064336 bytes)
[ 195.028534] test 11 (256 bit key, 64 byte blocks): 1906226 operations in 1 seconds (121998464 bytes)
[ 196.035069] test 12 (256 bit key, 256 byte blocks): 540479 operations in 1 seconds (138362624 bytes)
[ 197.041579] test 13 (256 bit key, 1024 byte blocks): 140654 operations in 1 seconds (144029696 bytes)
[ 198.048164] test 14 (256 bit key, 8192 byte blocks): 17741 operations in 1 seconds (145334272 bytes)
[ 199.054717]
[ 199.054717] testing speed of async cbc(twofish) decryption
[ 199.064019] test 0 (128 bit key, 16 byte blocks): 4783914 operations in 1 seconds (76542624 bytes)
[ 200.069414] test 1 (128 bit key, 64 byte blocks): 1954641 operations in 1 seconds (125097024 bytes)
[ 201.076079] test 2 (128 bit key, 256 byte blocks): 604230 operations in 1 seconds (154682880 bytes)
[ 202.082586] test 3 (128 bit key, 1024 byte blocks): 167613 operations in 1 seconds (171635712 bytes)
[ 203.089199] test 4 (128 bit key, 8192 byte blocks): 21451 operations in 1 seconds (175726592 bytes)
[ 204.095716] test 5 (192 bit key, 16 byte blocks): 4795759 operations in 1 seconds (76732144 bytes)
[ 205.102390] test 6 (192 bit key, 64 byte blocks): 1953134 operations in 1 seconds (125000576 bytes)
[ 206.109055] test 7 (192 bit key, 256 byte blocks): 599761 operations in 1 seconds (153538816 bytes)
[ 207.115564] test 8 (192 bit key, 1024 byte blocks): 166437 operations in 1 seconds (170431488 bytes)
[ 208.122184] test 9 (192 bit key, 8192 byte blocks): 20789 operations in 1 seconds (170303488 bytes)
[ 209.128728] test 10 (256 bit key, 16 byte blocks): 4794873 operations in 1 seconds (76717968 bytes)
[ 210.135375] test 11 (256 bit key, 64 byte blocks): 1953978 operations in 1 seconds (125054592 bytes)
[ 211.142039] test 12 (256 bit key, 256 byte blocks): 604269 operations in 1 seconds (154692864 bytes)
[ 212.148556] test 13 (256 bit key, 1024 byte blocks): 167571 operations in 1 seconds (171592704 bytes)
[ 213.155143] test 14 (256 bit key, 8192 byte blocks): 21453 operations in 1 seconds (175742976 bytes)
[ 214.161698]
[ 214.161698] testing speed of async ctr(twofish) encryption
[ 214.175571] test 0 (128 bit key, 16 byte blocks): 4581950 operations in 1 seconds (73311200 bytes)
[ 215.184354] test 1 (128 bit key, 64 byte blocks): 1944709 operations in 1 seconds (124461376 bytes)
[ 216.191166] test 2 (128 bit key, 256 byte blocks): 594086 operations in 1 seconds (152086016 bytes)
[ 217.197536] test 3 (128 bit key, 1024 byte blocks): 163216 operations in 1 seconds (167133184 bytes)
[ 218.204149] test 4 (128 bit key, 8192 byte blocks): 21075 operations in 1 seconds (172646400 bytes)
[ 219.210813] test 5 (192 bit key, 16 byte blocks): 4705554 operations in 1 seconds (75288864 bytes)
[ 220.217330] test 6 (192 bit key, 64 byte blocks): 1963988 operations in 1 seconds (125695232 bytes)
[ 221.224004] test 7 (192 bit key, 256 byte blocks): 581953 operations in 1 seconds (148979968 bytes)
[ 222.230513] test 8 (192 bit key, 1024 byte blocks): 162790 operations in 1 seconds (166696960 bytes)
[ 223.237126] test 9 (192 bit key, 8192 byte blocks): 20706 operations in 1 seconds (169623552 bytes)
[ 224.243642] test 10 (256 bit key, 16 byte blocks): 4437112 operations in 1 seconds (70993792 bytes)
[ 225.250324] test 11 (256 bit key, 64 byte blocks): 1963735 operations in 1 seconds (125679040 bytes)
[ 226.256990] test 12 (256 bit key, 256 byte blocks): 596765 operations in 1 seconds (152771840 bytes)
[ 227.263498] test 13 (256 bit key, 1024 byte blocks): 163385 operations in 1 seconds (167306240 bytes)
[ 228.270232] test 14 (256 bit key, 8192 byte blocks): 20950 operations in 1 seconds (171622400 bytes)
[ 229.276657]
[ 229.276657] testing speed of async ctr(twofish) decryption
[ 229.285975] test 0 (128 bit key, 16 byte blocks): 4571340 operations in 1 seconds (73141440 bytes)
[ 230.291288] test 1 (128 bit key, 64 byte blocks): 1949949 operations in 1 seconds (124796736 bytes)
[ 231.297951] test 2 (128 bit key, 256 byte blocks): 591529 operations in 1 seconds (151431424 bytes)
[ 232.304470] test 3 (128 bit key, 1024 byte blocks): 163609 operations in 1 seconds (167535616 bytes)
[ 233.311073] test 4 (128 bit key, 8192 byte blocks): 20975 operations in 1 seconds (171827200 bytes)
[ 234.317581] test 5 (192 bit key, 16 byte blocks): 4639461 operations in 1 seconds (74231376 bytes)
[ 235.324307] test 6 (192 bit key, 64 byte blocks): 1963173 operations in 1 seconds (125643072 bytes)
[ 236.330929] test 7 (192 bit key, 256 byte blocks): 585030 operations in 1 seconds (149767680 bytes)
[ 237.337445] test 8 (192 bit key, 1024 byte blocks): 162872 operations in 1 seconds (166780928 bytes)
[ 238.344050] test 9 (192 bit key, 8192 byte blocks): 20728 operations in 1 seconds (169803776 bytes)
[ 239.350603] test 10 (256 bit key, 16 byte blocks): 4443427 operations in 1 seconds (71094832 bytes)
[ 240.357259] test 11 (256 bit key, 64 byte blocks): 1965011 operations in 1 seconds (125760704 bytes)
[ 241.363914] test 12 (256 bit key, 256 byte blocks): 590193 operations in 1 seconds (151089408 bytes)
[ 242.370422] test 13 (256 bit key, 1024 byte blocks): 163370 operations in 1 seconds (167290880 bytes)
[ 243.377018] test 14 (256 bit key, 8192 byte blocks): 20969 operations in 1 seconds (171778048 bytes)
[ 244.383546]
[ 244.383546] testing speed of async lrw(twofish) encryption
[ 244.398118] test 0 (256 bit key, 16 byte blocks): 3582956 operations in 1 seconds (57327296 bytes)
[ 245.406230] test 1 (256 bit key, 64 byte blocks): 1618011 operations in 1 seconds (103552704 bytes)
[ 246.412911] test 2 (256 bit key, 256 byte blocks): 502411 operations in 1 seconds (128617216 bytes)
[ 247.419427] test 3 (256 bit key, 1024 byte blocks): 140501 operations in 1 seconds (143873024 bytes)
[ 248.422071] test 4 (256 bit key, 8192 byte blocks): 18166 operations in 1 seconds (148815872 bytes)
[ 249.424613] test 5 (320 bit key, 16 byte blocks): 3576354 operations in 1 seconds (57221664 bytes)
[ 250.431245] test 6 (320 bit key, 64 byte blocks): 1626817 operations in 1 seconds (104116288 bytes)
[ 251.437908] test 7 (320 bit key, 256 byte blocks): 504222 operations in 1 seconds (129080832 bytes)
[ 252.444407] test 8 (320 bit key, 1024 byte blocks): 140962 operations in 1 seconds (144345088 bytes)
[ 253.451020] test 9 (320 bit key, 8192 byte blocks): 17955 operations in 1 seconds (147087360 bytes)
[ 254.457555] test 10 (384 bit key, 16 byte blocks): 3558173 operations in 1 seconds (56930768 bytes)
[ 255.464210] test 11 (384 bit key, 64 byte blocks): 1630951 operations in 1 seconds (104380864 bytes)
[ 256.470866] test 12 (384 bit key, 256 byte blocks): 504089 operations in 1 seconds (129046784 bytes)
[ 257.477383] test 13 (384 bit key, 1024 byte blocks): 141065 operations in 1 seconds (144450560 bytes)
[ 258.483979] test 14 (384 bit key, 8192 byte blocks): 18168 operations in 1 seconds (148832256 bytes)
[ 259.490542]
[ 259.490542] testing speed of async lrw(twofish) decryption
[ 259.499858] test 0 (256 bit key, 16 byte blocks): 3557489 operations in 1 seconds (56919824 bytes)
[ 260.505175] test 1 (256 bit key, 64 byte blocks): 1630277 operations in 1 seconds (104337728 bytes)
[ 261.511865] test 2 (256 bit key, 256 byte blocks): 503750 operations in 1 seconds (128960000 bytes)
[ 262.518383] test 3 (256 bit key, 1024 byte blocks): 140698 operations in 1 seconds (144074752 bytes)
[ 263.524988] test 4 (256 bit key, 8192 byte blocks): 18124 operations in 1 seconds (148471808 bytes)
[ 264.531487] test 5 (320 bit key, 16 byte blocks): 3579978 operations in 1 seconds (57279648 bytes)
[ 265.538179] test 6 (320 bit key, 64 byte blocks): 1632251 operations in 1 seconds (104464064 bytes)
[ 266.544843] test 7 (320 bit key, 256 byte blocks): 502180 operations in 1 seconds (128558080 bytes)
[ 267.551350] test 8 (320 bit key, 1024 byte blocks): 139727 operations in 1 seconds (143080448 bytes)
[ 268.557964] test 9 (320 bit key, 8192 byte blocks): 17731 operations in 1 seconds (145252352 bytes)
[ 269.564481] test 10 (384 bit key, 16 byte blocks): 3570236 operations in 1 seconds (57123776 bytes)
[ 270.571162] test 11 (384 bit key, 64 byte blocks): 1623126 operations in 1 seconds (103880064 bytes)
[ 271.577828] test 12 (384 bit key, 256 byte blocks): 504857 operations in 1 seconds (129243392 bytes)
[ 272.584346] test 13 (384 bit key, 1024 byte blocks): 140801 operations in 1 seconds (144180224 bytes)
[ 273.586961] test 14 (384 bit key, 8192 byte blocks): 18139 operations in 1 seconds (148594688 bytes)
[ 274.589525]
[ 274.589525] testing speed of async xts(twofish) encryption
[ 274.603741] test 0 (256 bit key, 16 byte blocks): 3098851 operations in 1 seconds (49581616 bytes)
[ 275.612164] test 1 (256 bit key, 64 byte blocks): 1577161 operations in 1 seconds (100938304 bytes)
[ 276.618836] test 2 (256 bit key, 256 byte blocks): 525612 operations in 1 seconds (134556672 bytes)
[ 277.625459] test 3 (256 bit key, 1024 byte blocks): 150507 operations in 1 seconds (154119168 bytes)
[ 278.632105] test 4 (256 bit key, 8192 byte blocks): 19633 operations in 1 seconds (160833536 bytes)
[ 279.638587] test 5 (384 bit key, 16 byte blocks): 3092237 operations in 1 seconds (49475792 bytes)
[ 280.645261] test 6 (384 bit key, 64 byte blocks): 1576545 operations in 1 seconds (100898880 bytes)
[ 281.651795] test 7 (384 bit key, 256 byte blocks): 526516 operations in 1 seconds (134788096 bytes)
[ 282.658305] test 8 (384 bit key, 1024 byte blocks): 150782 operations in 1 seconds (154400768 bytes)
[ 283.664935] test 9 (384 bit key, 8192 byte blocks): 19632 operations in 1 seconds (160825344 bytes)
[ 284.671425] test 10 (512 bit key, 16 byte blocks): 3164770 operations in 1 seconds (50636320 bytes)
[ 285.678254] test 11 (512 bit key, 64 byte blocks): 1586822 operations in 1 seconds (101556608 bytes)
[ 286.684781] test 12 (512 bit key, 256 byte blocks): 527705 operations in 1 seconds (135092480 bytes)
[ 287.691290] test 13 (512 bit key, 1024 byte blocks): 150918 operations in 1 seconds (154540032 bytes)
[ 288.697885] test 14 (512 bit key, 8192 byte blocks): 19640 operations in 1 seconds (160890880 bytes)
[ 289.704422]
[ 289.704422] testing speed of async xts(twofish) decryption
[ 289.713733] test 0 (256 bit key, 16 byte blocks): 3082480 operations in 1 seconds (49319680 bytes)
[ 290.719098] test 1 (256 bit key, 64 byte blocks): 1571464 operations in 1 seconds (100573696 bytes)
[ 291.725752] test 2 (256 bit key, 256 byte blocks): 528360 operations in 1 seconds (135260160 bytes)
[ 292.732271] test 3 (256 bit key, 1024 byte blocks): 150115 operations in 1 seconds (153717760 bytes)
[ 293.738874] test 4 (256 bit key, 8192 byte blocks): 19513 operations in 1 seconds (159850496 bytes)
[ 294.745427] test 5 (384 bit key, 16 byte blocks): 3087055 operations in 1 seconds (49392880 bytes)
[ 295.752083] test 6 (384 bit key, 64 byte blocks): 1572391 operations in 1 seconds (100633024 bytes)
[ 296.754760] test 7 (384 bit key, 256 byte blocks): 527241 operations in 1 seconds (134973696 bytes)
[ 297.757259] test 8 (384 bit key, 1024 byte blocks): 150210 operations in 1 seconds (153815040 bytes)
[ 298.763871] test 9 (384 bit key, 8192 byte blocks): 19504 operations in 1 seconds (159776768 bytes)
[ 299.770425] test 10 (512 bit key, 16 byte blocks): 3157185 operations in 1 seconds (50514960 bytes)
[ 300.777072] test 11 (512 bit key, 64 byte blocks): 1579551 operations in 1 seconds (101091264 bytes)
[ 301.783745] test 12 (512 bit key, 256 byte blocks): 526692 operations in 1 seconds (134833152 bytes)
[ 302.790244] test 13 (512 bit key, 1024 byte blocks): 150220 operations in 1 seconds (153825280 bytes)
[ 303.796840] test 14 (512 bit key, 8192 byte blocks): 19498 operations in 1 seconds (159727616 bytes)

--
Regards/Gruss,
Boris.

2012-08-28 09:17:48

by Jussi Kivilinna

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

Quoting Borislav Petkov <[email protected]>:

> On Wed, Aug 22, 2012 at 10:20:03PM +0300, Jussi Kivilinna wrote:
>> Actually it does look better, at least for encryption. Decryption
>> had different
>> ordering for test, which appears to be bad on bulldozer as it is on
>> sandy-bridge.
>>
>> So, yet another patch then :)
>
> Here you go:

Thanks!

With this patch twofish-avx is faster than twofish-3way for 256, 1k
and 8k tests.

size old-vs-new new-vs-3way old-vs-3way
ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec
256 1.10x 1.11x 1.01x 1.01x 0.92x 0.91x
1k 1.11x 1.12x 1.08x 1.07x 0.97x 0.96x
8k 1.11x 1.13x 1.10x 1.08x 0.99x 0.97x

-Jussi

>
> [ 153.736745]
> [ 153.736745] testing speed of async ecb(twofish) encryption
> [ 153.745806] test 0 (128 bit key, 16 byte blocks): 4832343
> operations in 1 seconds (77317488 bytes)
> [ 154.752525] test 1 (128 bit key, 64 byte blocks): 2049979
> operations in 1 seconds (131198656 bytes)
> [ 155.755195] test 2 (128 bit key, 256 byte blocks): 620439
> operations in 1 seconds (158832384 bytes)
> [ 156.761694] test 3 (128 bit key, 1024 byte blocks): 173900
> operations in 1 seconds (178073600 bytes)
> [ 157.768282] test 4 (128 bit key, 8192 byte blocks): 22366
> operations in 1 seconds (183222272 bytes)
> [ 158.774815] test 5 (192 bit key, 16 byte blocks): 4850741
> operations in 1 seconds (77611856 bytes)
> [ 159.781498] test 6 (192 bit key, 64 byte blocks): 2046772
> operations in 1 seconds (130993408 bytes)
> [ 160.788163] test 7 (192 bit key, 256 byte blocks): 619915
> operations in 1 seconds (158698240 bytes)
> [ 161.794636] test 8 (192 bit key, 1024 byte blocks): 173442
> operations in 1 seconds (177604608 bytes)
> [ 162.801242] test 9 (192 bit key, 8192 byte blocks): 22083
> operations in 1 seconds (180903936 bytes)
> [ 163.807793] test 10 (256 bit key, 16 byte blocks): 4862951
> operations in 1 seconds (77807216 bytes)
> [ 164.814449] test 11 (256 bit key, 64 byte blocks): 2050036
> operations in 1 seconds (131202304 bytes)
> [ 165.821121] test 12 (256 bit key, 256 byte blocks): 620349
> operations in 1 seconds (158809344 bytes)
> [ 166.827621] test 13 (256 bit key, 1024 byte blocks): 173917
> operations in 1 seconds (178091008 bytes)
> [ 167.834218] test 14 (256 bit key, 8192 byte blocks): 22362
> operations in 1 seconds (183189504 bytes)
> [ 168.840798]
> [ 168.840798] testing speed of async ecb(twofish) decryption
> [ 168.849968] test 0 (128 bit key, 16 byte blocks): 4889899
> operations in 1 seconds (78238384 bytes)
> [ 169.855439] test 1 (128 bit key, 64 byte blocks): 2052293
> operations in 1 seconds (131346752 bytes)
> [ 170.862113] test 2 (128 bit key, 256 byte blocks): 616979
> operations in 1 seconds (157946624 bytes)
> [ 171.868631] test 3 (128 bit key, 1024 byte blocks): 172773
> operations in 1 seconds (176919552 bytes)
> [ 172.875244] test 4 (128 bit key, 8192 byte blocks): 22224
> operations in 1 seconds (182059008 bytes)
> [ 173.881777] test 5 (192 bit key, 16 byte blocks): 4893653
> operations in 1 seconds (78298448 bytes)
> [ 174.888451] test 6 (192 bit key, 64 byte blocks): 2048078
> operations in 1 seconds (131076992 bytes)
> [ 175.895131] test 7 (192 bit key, 256 byte blocks): 619204
> operations in 1 seconds (158516224 bytes)
> [ 176.901651] test 8 (192 bit key, 1024 byte blocks): 172569
> operations in 1 seconds (176710656 bytes)
> [ 177.908253] test 9 (192 bit key, 8192 byte blocks): 21888
> operations in 1 seconds (179306496 bytes)
> [ 178.914781] test 10 (256 bit key, 16 byte blocks): 4921751
> operations in 1 seconds (78748016 bytes)
> [ 179.917481] test 11 (256 bit key, 64 byte blocks): 2051219
> operations in 1 seconds (131278016 bytes)
> [ 180.920147] test 12 (256 bit key, 256 byte blocks): 618536
> operations in 1 seconds (158345216 bytes)
> [ 181.926637] test 13 (256 bit key, 1024 byte blocks): 172886
> operations in 1 seconds (177035264 bytes)
> [ 182.933249] test 14 (256 bit key, 8192 byte blocks): 22222
> operations in 1 seconds (182042624 bytes)
> [ 183.939803]
> [ 183.939803] testing speed of async cbc(twofish) encryption
> [ 183.953902] test 0 (128 bit key, 16 byte blocks): 5195403
> operations in 1 seconds (83126448 bytes)
> [ 184.962487] test 1 (128 bit key, 64 byte blocks): 1912010
> operations in 1 seconds (122368640 bytes)
> [ 185.969150] test 2 (128 bit key, 256 byte blocks): 540125
> operations in 1 seconds (138272000 bytes)
> [ 186.975650] test 3 (128 bit key, 1024 byte blocks): 140631
> operations in 1 seconds (144006144 bytes)
> [ 187.982411] test 4 (128 bit key, 8192 byte blocks): 17737
> operations in 1 seconds (145301504 bytes)
> [ 188.988782] test 5 (192 bit key, 16 byte blocks): 5182287
> operations in 1 seconds (82916592 bytes)
> [ 189.995435] test 6 (192 bit key, 64 byte blocks): 1912356
> operations in 1 seconds (122390784 bytes)
> [ 191.002093] test 7 (192 bit key, 256 byte blocks): 540991
> operations in 1 seconds (138493696 bytes)
> [ 192.008600] test 8 (192 bit key, 1024 byte blocks): 140791
> operations in 1 seconds (144169984 bytes)
> [ 193.015197] test 9 (192 bit key, 8192 byte blocks): 17609
> operations in 1 seconds (144252928 bytes)
> [ 194.021740] test 10 (256 bit key, 16 byte blocks): 5191521
> operations in 1 seconds (83064336 bytes)
> [ 195.028534] test 11 (256 bit key, 64 byte blocks): 1906226
> operations in 1 seconds (121998464 bytes)
> [ 196.035069] test 12 (256 bit key, 256 byte blocks): 540479
> operations in 1 seconds (138362624 bytes)
> [ 197.041579] test 13 (256 bit key, 1024 byte blocks): 140654
> operations in 1 seconds (144029696 bytes)
> [ 198.048164] test 14 (256 bit key, 8192 byte blocks): 17741
> operations in 1 seconds (145334272 bytes)
> [ 199.054717]
> [ 199.054717] testing speed of async cbc(twofish) decryption
> [ 199.064019] test 0 (128 bit key, 16 byte blocks): 4783914
> operations in 1 seconds (76542624 bytes)
> [ 200.069414] test 1 (128 bit key, 64 byte blocks): 1954641
> operations in 1 seconds (125097024 bytes)
> [ 201.076079] test 2 (128 bit key, 256 byte blocks): 604230
> operations in 1 seconds (154682880 bytes)
> [ 202.082586] test 3 (128 bit key, 1024 byte blocks): 167613
> operations in 1 seconds (171635712 bytes)
> [ 203.089199] test 4 (128 bit key, 8192 byte blocks): 21451
> operations in 1 seconds (175726592 bytes)
> [ 204.095716] test 5 (192 bit key, 16 byte blocks): 4795759
> operations in 1 seconds (76732144 bytes)
> [ 205.102390] test 6 (192 bit key, 64 byte blocks): 1953134
> operations in 1 seconds (125000576 bytes)
> [ 206.109055] test 7 (192 bit key, 256 byte blocks): 599761
> operations in 1 seconds (153538816 bytes)
> [ 207.115564] test 8 (192 bit key, 1024 byte blocks): 166437
> operations in 1 seconds (170431488 bytes)
> [ 208.122184] test 9 (192 bit key, 8192 byte blocks): 20789
> operations in 1 seconds (170303488 bytes)
> [ 209.128728] test 10 (256 bit key, 16 byte blocks): 4794873
> operations in 1 seconds (76717968 bytes)
> [ 210.135375] test 11 (256 bit key, 64 byte blocks): 1953978
> operations in 1 seconds (125054592 bytes)
> [ 211.142039] test 12 (256 bit key, 256 byte blocks): 604269
> operations in 1 seconds (154692864 bytes)
> [ 212.148556] test 13 (256 bit key, 1024 byte blocks): 167571
> operations in 1 seconds (171592704 bytes)
> [ 213.155143] test 14 (256 bit key, 8192 byte blocks): 21453
> operations in 1 seconds (175742976 bytes)
> [ 214.161698]
> [ 214.161698] testing speed of async ctr(twofish) encryption
> [ 214.175571] test 0 (128 bit key, 16 byte blocks): 4581950
> operations in 1 seconds (73311200 bytes)
> [ 215.184354] test 1 (128 bit key, 64 byte blocks): 1944709
> operations in 1 seconds (124461376 bytes)
> [ 216.191166] test 2 (128 bit key, 256 byte blocks): 594086
> operations in 1 seconds (152086016 bytes)
> [ 217.197536] test 3 (128 bit key, 1024 byte blocks): 163216
> operations in 1 seconds (167133184 bytes)
> [ 218.204149] test 4 (128 bit key, 8192 byte blocks): 21075
> operations in 1 seconds (172646400 bytes)
> [ 219.210813] test 5 (192 bit key, 16 byte blocks): 4705554
> operations in 1 seconds (75288864 bytes)
> [ 220.217330] test 6 (192 bit key, 64 byte blocks): 1963988
> operations in 1 seconds (125695232 bytes)
> [ 221.224004] test 7 (192 bit key, 256 byte blocks): 581953
> operations in 1 seconds (148979968 bytes)
> [ 222.230513] test 8 (192 bit key, 1024 byte blocks): 162790
> operations in 1 seconds (166696960 bytes)
> [ 223.237126] test 9 (192 bit key, 8192 byte blocks): 20706
> operations in 1 seconds (169623552 bytes)
> [ 224.243642] test 10 (256 bit key, 16 byte blocks): 4437112
> operations in 1 seconds (70993792 bytes)
> [ 225.250324] test 11 (256 bit key, 64 byte blocks): 1963735
> operations in 1 seconds (125679040 bytes)
> [ 226.256990] test 12 (256 bit key, 256 byte blocks): 596765
> operations in 1 seconds (152771840 bytes)
> [ 227.263498] test 13 (256 bit key, 1024 byte blocks): 163385
> operations in 1 seconds (167306240 bytes)
> [ 228.270232] test 14 (256 bit key, 8192 byte blocks): 20950
> operations in 1 seconds (171622400 bytes)
> [ 229.276657]
> [ 229.276657] testing speed of async ctr(twofish) decryption
> [ 229.285975] test 0 (128 bit key, 16 byte blocks): 4571340
> operations in 1 seconds (73141440 bytes)
> [ 230.291288] test 1 (128 bit key, 64 byte blocks): 1949949
> operations in 1 seconds (124796736 bytes)
> [ 231.297951] test 2 (128 bit key, 256 byte blocks): 591529
> operations in 1 seconds (151431424 bytes)
> [ 232.304470] test 3 (128 bit key, 1024 byte blocks): 163609
> operations in 1 seconds (167535616 bytes)
> [ 233.311073] test 4 (128 bit key, 8192 byte blocks): 20975
> operations in 1 seconds (171827200 bytes)
> [ 234.317581] test 5 (192 bit key, 16 byte blocks): 4639461
> operations in 1 seconds (74231376 bytes)
> [ 235.324307] test 6 (192 bit key, 64 byte blocks): 1963173
> operations in 1 seconds (125643072 bytes)
> [ 236.330929] test 7 (192 bit key, 256 byte blocks): 585030
> operations in 1 seconds (149767680 bytes)
> [ 237.337445] test 8 (192 bit key, 1024 byte blocks): 162872
> operations in 1 seconds (166780928 bytes)
> [ 238.344050] test 9 (192 bit key, 8192 byte blocks): 20728
> operations in 1 seconds (169803776 bytes)
> [ 239.350603] test 10 (256 bit key, 16 byte blocks): 4443427
> operations in 1 seconds (71094832 bytes)
> [ 240.357259] test 11 (256 bit key, 64 byte blocks): 1965011
> operations in 1 seconds (125760704 bytes)
> [ 241.363914] test 12 (256 bit key, 256 byte blocks): 590193
> operations in 1 seconds (151089408 bytes)
> [ 242.370422] test 13 (256 bit key, 1024 byte blocks): 163370
> operations in 1 seconds (167290880 bytes)
> [ 243.377018] test 14 (256 bit key, 8192 byte blocks): 20969
> operations in 1 seconds (171778048 bytes)
> [ 244.383546]
> [ 244.383546] testing speed of async lrw(twofish) encryption
> [ 244.398118] test 0 (256 bit key, 16 byte blocks): 3582956
> operations in 1 seconds (57327296 bytes)
> [ 245.406230] test 1 (256 bit key, 64 byte blocks): 1618011
> operations in 1 seconds (103552704 bytes)
> [ 246.412911] test 2 (256 bit key, 256 byte blocks): 502411
> operations in 1 seconds (128617216 bytes)
> [ 247.419427] test 3 (256 bit key, 1024 byte blocks): 140501
> operations in 1 seconds (143873024 bytes)
> [ 248.422071] test 4 (256 bit key, 8192 byte blocks): 18166
> operations in 1 seconds (148815872 bytes)
> [ 249.424613] test 5 (320 bit key, 16 byte blocks): 3576354
> operations in 1 seconds (57221664 bytes)
> [ 250.431245] test 6 (320 bit key, 64 byte blocks): 1626817
> operations in 1 seconds (104116288 bytes)
> [ 251.437908] test 7 (320 bit key, 256 byte blocks): 504222
> operations in 1 seconds (129080832 bytes)
> [ 252.444407] test 8 (320 bit key, 1024 byte blocks): 140962
> operations in 1 seconds (144345088 bytes)
> [ 253.451020] test 9 (320 bit key, 8192 byte blocks): 17955
> operations in 1 seconds (147087360 bytes)
> [ 254.457555] test 10 (384 bit key, 16 byte blocks): 3558173
> operations in 1 seconds (56930768 bytes)
> [ 255.464210] test 11 (384 bit key, 64 byte blocks): 1630951
> operations in 1 seconds (104380864 bytes)
> [ 256.470866] test 12 (384 bit key, 256 byte blocks): 504089
> operations in 1 seconds (129046784 bytes)
> [ 257.477383] test 13 (384 bit key, 1024 byte blocks): 141065
> operations in 1 seconds (144450560 bytes)
> [ 258.483979] test 14 (384 bit key, 8192 byte blocks): 18168
> operations in 1 seconds (148832256 bytes)
> [ 259.490542]
> [ 259.490542] testing speed of async lrw(twofish) decryption
> [ 259.499858] test 0 (256 bit key, 16 byte blocks): 3557489
> operations in 1 seconds (56919824 bytes)
> [ 260.505175] test 1 (256 bit key, 64 byte blocks): 1630277
> operations in 1 seconds (104337728 bytes)
> [ 261.511865] test 2 (256 bit key, 256 byte blocks): 503750
> operations in 1 seconds (128960000 bytes)
> [ 262.518383] test 3 (256 bit key, 1024 byte blocks): 140698
> operations in 1 seconds (144074752 bytes)
> [ 263.524988] test 4 (256 bit key, 8192 byte blocks): 18124
> operations in 1 seconds (148471808 bytes)
> [ 264.531487] test 5 (320 bit key, 16 byte blocks): 3579978
> operations in 1 seconds (57279648 bytes)
> [ 265.538179] test 6 (320 bit key, 64 byte blocks): 1632251
> operations in 1 seconds (104464064 bytes)
> [ 266.544843] test 7 (320 bit key, 256 byte blocks): 502180
> operations in 1 seconds (128558080 bytes)
> [ 267.551350] test 8 (320 bit key, 1024 byte blocks): 139727
> operations in 1 seconds (143080448 bytes)
> [ 268.557964] test 9 (320 bit key, 8192 byte blocks): 17731
> operations in 1 seconds (145252352 bytes)
> [ 269.564481] test 10 (384 bit key, 16 byte blocks): 3570236
> operations in 1 seconds (57123776 bytes)
> [ 270.571162] test 11 (384 bit key, 64 byte blocks): 1623126
> operations in 1 seconds (103880064 bytes)
> [ 271.577828] test 12 (384 bit key, 256 byte blocks): 504857
> operations in 1 seconds (129243392 bytes)
> [ 272.584346] test 13 (384 bit key, 1024 byte blocks): 140801
> operations in 1 seconds (144180224 bytes)
> [ 273.586961] test 14 (384 bit key, 8192 byte blocks): 18139
> operations in 1 seconds (148594688 bytes)
> [ 274.589525]
> [ 274.589525] testing speed of async xts(twofish) encryption
> [ 274.603741] test 0 (256 bit key, 16 byte blocks): 3098851
> operations in 1 seconds (49581616 bytes)
> [ 275.612164] test 1 (256 bit key, 64 byte blocks): 1577161
> operations in 1 seconds (100938304 bytes)
> [ 276.618836] test 2 (256 bit key, 256 byte blocks): 525612
> operations in 1 seconds (134556672 bytes)
> [ 277.625459] test 3 (256 bit key, 1024 byte blocks): 150507
> operations in 1 seconds (154119168 bytes)
> [ 278.632105] test 4 (256 bit key, 8192 byte blocks): 19633
> operations in 1 seconds (160833536 bytes)
> [ 279.638587] test 5 (384 bit key, 16 byte blocks): 3092237
> operations in 1 seconds (49475792 bytes)
> [ 280.645261] test 6 (384 bit key, 64 byte blocks): 1576545
> operations in 1 seconds (100898880 bytes)
> [ 281.651795] test 7 (384 bit key, 256 byte blocks): 526516
> operations in 1 seconds (134788096 bytes)
> [ 282.658305] test 8 (384 bit key, 1024 byte blocks): 150782
> operations in 1 seconds (154400768 bytes)
> [ 283.664935] test 9 (384 bit key, 8192 byte blocks): 19632
> operations in 1 seconds (160825344 bytes)
> [ 284.671425] test 10 (512 bit key, 16 byte blocks): 3164770
> operations in 1 seconds (50636320 bytes)
> [ 285.678254] test 11 (512 bit key, 64 byte blocks): 1586822
> operations in 1 seconds (101556608 bytes)
> [ 286.684781] test 12 (512 bit key, 256 byte blocks): 527705
> operations in 1 seconds (135092480 bytes)
> [ 287.691290] test 13 (512 bit key, 1024 byte blocks): 150918
> operations in 1 seconds (154540032 bytes)
> [ 288.697885] test 14 (512 bit key, 8192 byte blocks): 19640
> operations in 1 seconds (160890880 bytes)
> [ 289.704422]
> [ 289.704422] testing speed of async xts(twofish) decryption
> [ 289.713733] test 0 (256 bit key, 16 byte blocks): 3082480
> operations in 1 seconds (49319680 bytes)
> [ 290.719098] test 1 (256 bit key, 64 byte blocks): 1571464
> operations in 1 seconds (100573696 bytes)
> [ 291.725752] test 2 (256 bit key, 256 byte blocks): 528360
> operations in 1 seconds (135260160 bytes)
> [ 292.732271] test 3 (256 bit key, 1024 byte blocks): 150115
> operations in 1 seconds (153717760 bytes)
> [ 293.738874] test 4 (256 bit key, 8192 byte blocks): 19513
> operations in 1 seconds (159850496 bytes)
> [ 294.745427] test 5 (384 bit key, 16 byte blocks): 3087055
> operations in 1 seconds (49392880 bytes)
> [ 295.752083] test 6 (384 bit key, 64 byte blocks): 1572391
> operations in 1 seconds (100633024 bytes)
> [ 296.754760] test 7 (384 bit key, 256 byte blocks): 527241
> operations in 1 seconds (134973696 bytes)
> [ 297.757259] test 8 (384 bit key, 1024 byte blocks): 150210
> operations in 1 seconds (153815040 bytes)
> [ 298.763871] test 9 (384 bit key, 8192 byte blocks): 19504
> operations in 1 seconds (159776768 bytes)
> [ 299.770425] test 10 (512 bit key, 16 byte blocks): 3157185
> operations in 1 seconds (50514960 bytes)
> [ 300.777072] test 11 (512 bit key, 64 byte blocks): 1579551
> operations in 1 seconds (101091264 bytes)
> [ 301.783745] test 12 (512 bit key, 256 byte blocks): 526692
> operations in 1 seconds (134833152 bytes)
> [ 302.790244] test 13 (512 bit key, 1024 byte blocks): 150220
> operations in 1 seconds (153825280 bytes)
> [ 303.796840] test 14 (512 bit key, 8192 byte blocks): 19498
> operations in 1 seconds (159727616 bytes)
>
> --
> Regards/Gruss,
> Boris.
>
>

2012-08-28 16:25:18

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

On Tue, Aug 28, 2012 at 12:17:43PM +0300, Jussi Kivilinna wrote:
> With this patch twofish-avx is faster than twofish-3way for 256, 1k
> and 8k tests.
>
> size old-vs-new new-vs-3way old-vs-3way
> ecb-enc ecb-dec ecb-enc ecb-dec ecb-enc ecb-dec
> 256 1.10x 1.11x 1.01x 1.01x 0.92x 0.91x
> 1k 1.11x 1.12x 1.08x 1.07x 0.97x 0.96x
> 8k 1.11x 1.13x 1.10x 1.08x 0.99x 0.97x

Not bad, that's 10ish percent improvement, after all.

Thanks.

--
Regards/Gruss,
Boris.