LinuxLists.cc - [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

Hello,

This helps a bit...now download sw-crypt performance is about 150Mbps,
but still not as good as with my patch on 5.4 kernel, and fpu is still
high in perf top:

13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
6.62% [kernel] [k] kernel_fpu_begin
4.14% [kernel] [k] _aesni_enc1
2.06% [kernel] [k] __crypto_xor
1.95% [kernel] [k] copy_user_generic_string
1.93% libjvm.so [.] SpinPause
1.01% [kernel] [k] aesni_encrypt
0.98% [kernel] [k] crypto_ctr_crypt
0.93% [kernel] [k] udp_sendmsg
0.78% [kernel] [k] crypto_inc
0.74% [kernel] [k] __ip_append_data.isra.53
0.65% [kernel] [k] aesni_cbc_enc
0.64% [kernel] [k] __dev_queue_xmit
0.62% [kernel] [k] ipt_do_table
0.62% [kernel] [k] igb_xmit_frame_ring
0.59% [kernel] [k] ip_route_output_key_hash_rcu
0.57% [kernel] [k] memcpy
0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
0.56% [kernel] [k] irq_fpu_usable
0.56% [kernel] [k] mac_do_update

If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
then I can help. Possibly hwsim would also be a good test case, but I have not tried
that.

Not sure it is related or not, but system was throwing this during
the test. We'll try to verify whether or not it is related to that
patch.

...
Message from syslogd@lf0350-35c0 at Aug 3 09:55:05 ...
kernel:[Hardware Error]: CPU:0 (16:30:1) MC2_STATUS[Over|CE|-|AddrV|-|CECC|-|-]: 0xd45840900012017a

Message from syslogd@lf0350-35c0 at Aug 3 09:55:05 ...
kernel:[Hardware Error]: Error Addr: 0x00000000a656a810

Message from syslogd@lf0350-35c0 at Aug 3 09:55:05 ...
kernel:[Hardware Error]: MC2 Error: ECC error in L2 data array (Vict).

Message from syslogd@lf0350-35c0 at Aug 3 09:55:05 ...
kernel:[Hardware Error]: cache level: L2, tx: GEN, mem-tx: EV

Message from syslogd@lf0350-35c0 at Aug 3 10:00:33 ...
kernel:[Hardware Error]: Corrected error, no action required.

Message from syslogd@lf0350-35c0 at Aug 3 10:00:33 ...
kernel:[Hardware Error]: CPU:0 (16:30:1) MC2_STATUS[Over|CE|-|AddrV|-|CECC|-|-]: 0xd55840920010011a

Thanks,
Ben

On 8/2/20 2:06 AM, Ard Biesheuvel wrote:
> Ben reports that CCM using AES-NI instructions performs pathologically
> poorly, which is due to the overhead of preserving/restoring the SIMD
> state, which is repeated after every 16 bytes of input when executing
> the CBCMAC portion of the algorithm.
>
> So let's clone the arm64 implementation of cbcmac(aes), which takes
> care to only preserve/restore the SIMD state after processing the
> whole input. Since cmac(aes) and xcbc(aes) can reuse most of the code,
> let's expose those as well.
>
> Cc: Ben Greear <[email protected]>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/crypto/Makefile | 2 +-
> arch/x86/crypto/aesni-intel.h | 39 +++
> arch/x86/crypto/aesni-intel_glue.c | 42 +---
> arch/x86/crypto/aesni-intel_mac.c | 257 ++++++++++++++++++++
> 4 files changed, 306 insertions(+), 34 deletions(-)
>
> diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
> index a31de0c6ccde..f83e162f87ad 100644
> --- a/arch/x86/crypto/Makefile
> +++ b/arch/x86/crypto/Makefile
> @@ -51,7 +51,7 @@ chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o
> chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
>
> obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
> -aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
> +aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o aesni-intel_mac.o
> aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
>
> obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
> diff --git a/arch/x86/crypto/aesni-intel.h b/arch/x86/crypto/aesni-intel.h
> new file mode 100644
> index 000000000000..d9204c043184
> --- /dev/null
> +++ b/arch/x86/crypto/aesni-intel.h
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include <crypto/algapi.h>
> +#include <crypto/aes.h>
> +#include <crypto/internal/hash.h>
> +
> +#define AESNI_ALIGN 16
> +#define AESNI_ALIGN_ATTR __attribute__ ((__aligned__(AESNI_ALIGN)))
> +#define AES_BLOCK_MASK (~(AES_BLOCK_SIZE - 1))
> +#define RFC4106_HASH_SUBKEY_SIZE 16
> +#define AESNI_ALIGN_EXTRA ((AESNI_ALIGN - 1) & ~(CRYPTO_MINALIGN - 1))
> +#define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA)
> +#define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)
> +
> +extern struct shash_alg aesni_mac_algs[];
> +extern const int aesni_num_mac_algs;
> +
> +asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
> + unsigned int key_len);
> +asmlinkage void aesni_enc(const void *ctx, u8 *out, const u8 *in);
> +asmlinkage void aesni_dec(const void *ctx, u8 *out, const u8 *in);
> +asmlinkage void aesni_ecb_enc(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len);
> +asmlinkage void aesni_ecb_dec(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len);
> +asmlinkage void aesni_cbc_enc(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len, u8 *iv);
> +asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len, u8 *iv);
> +
> +static inline struct crypto_aes_ctx *aes_ctx(void *raw_ctx)
> +{
> + unsigned long addr = (unsigned long)raw_ctx;
> + unsigned long align = AESNI_ALIGN;
> +
> + if (align <= crypto_tfm_ctx_alignment())
> + align = 1;
> + return (struct crypto_aes_ctx *)ALIGN(addr, align);
> +}
> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
> index ad8a7188a2bf..bc51c5f80858 100644
> --- a/arch/x86/crypto/aesni-intel_glue.c
> +++ b/arch/x86/crypto/aesni-intel_glue.c
> @@ -19,8 +19,6 @@
> #include <linux/types.h>
> #include <linux/module.h>
> #include <linux/err.h>
> -#include <crypto/algapi.h>
> -#include <crypto/aes.h>
> #include <crypto/ctr.h>
> #include <crypto/b128ops.h>
> #include <crypto/gcm.h>
> @@ -37,14 +35,7 @@
> #include <asm/crypto/glue_helper.h>
> #endif
>
> -
> -#define AESNI_ALIGN 16
> -#define AESNI_ALIGN_ATTR __attribute__ ((__aligned__(AESNI_ALIGN)))
> -#define AES_BLOCK_MASK (~(AES_BLOCK_SIZE - 1))
> -#define RFC4106_HASH_SUBKEY_SIZE 16
> -#define AESNI_ALIGN_EXTRA ((AESNI_ALIGN - 1) & ~(CRYPTO_MINALIGN - 1))
> -#define CRYPTO_AES_CTX_SIZE (sizeof(struct crypto_aes_ctx) + AESNI_ALIGN_EXTRA)
> -#define XTS_AES_CTX_SIZE (sizeof(struct aesni_xts_ctx) + AESNI_ALIGN_EXTRA)
> +#include "aesni-intel.h"
>
> /* This data is stored at the end of the crypto_tfm struct.
> * It's a type of per "session" data storage location.
> @@ -81,19 +72,6 @@ struct gcm_context_data {
> u8 hash_keys[GCM_BLOCK_LEN * 16];
> };
>
> -asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
> - unsigned int key_len);
> -asmlinkage void aesni_enc(const void *ctx, u8 *out, const u8 *in);
> -asmlinkage void aesni_dec(const void *ctx, u8 *out, const u8 *in);
> -asmlinkage void aesni_ecb_enc(struct crypto_aes_ctx *ctx, u8 *out,
> - const u8 *in, unsigned int len);
> -asmlinkage void aesni_ecb_dec(struct crypto_aes_ctx *ctx, u8 *out,
> - const u8 *in, unsigned int len);
> -asmlinkage void aesni_cbc_enc(struct crypto_aes_ctx *ctx, u8 *out,
> - const u8 *in, unsigned int len, u8 *iv);
> -asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
> - const u8 *in, unsigned int len, u8 *iv);
> -
> #define AVX_GEN2_OPTSIZE 640
> #define AVX_GEN4_OPTSIZE 4096
>
> @@ -296,16 +274,6 @@ generic_gcmaes_ctx *generic_gcmaes_ctx_get(struct crypto_aead *tfm)
> }
> #endif
>
> -static inline struct crypto_aes_ctx *aes_ctx(void *raw_ctx)
> -{
> - unsigned long addr = (unsigned long)raw_ctx;
> - unsigned long align = AESNI_ALIGN;
> -
> - if (align <= crypto_tfm_ctx_alignment())
> - align = 1;
> - return (struct crypto_aes_ctx *)ALIGN(addr, align);
> -}
> -
> static int aes_set_key_common(struct crypto_tfm *tfm, void *raw_ctx,
> const u8 *in_key, unsigned int key_len)
> {
> @@ -1098,8 +1066,15 @@ static int __init aesni_init(void)
> if (err)
> goto unregister_skciphers;
>
> + err = crypto_register_shashes(aesni_mac_algs, aesni_num_mac_algs);
> + if (err)
> + goto unregister_aeads;
> +
> return 0;
>
> +unregister_aeads:
> + simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
> + aesni_simd_aeads);
> unregister_skciphers:
> simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
> aesni_simd_skciphers);
> @@ -1110,6 +1085,7 @@ static int __init aesni_init(void)
>
> static void __exit aesni_exit(void)
> {
> + crypto_unregister_shashes(aesni_mac_algs, aesni_num_mac_algs);
> simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
> aesni_simd_aeads);
> simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
> diff --git a/arch/x86/crypto/aesni-intel_mac.c b/arch/x86/crypto/aesni-intel_mac.c
> new file mode 100644
> index 000000000000..7d546bbf21e5
> --- /dev/null
> +++ b/arch/x86/crypto/aesni-intel_mac.c
> @@ -0,0 +1,257 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2013 - 2017 Linaro Ltd <[email protected]>
> + * Copyright (C) 2020 Arm Ltd <[email protected]>
> + */
> +
> +#include <asm/simd.h>
> +#include <crypto/b128ops.h>
> +#include <crypto/internal/hash.h>
> +#include <crypto/internal/simd.h>
> +#include <linux/module.h>
> +
> +#include "aesni-intel.h"
> +
> +MODULE_ALIAS_CRYPTO("cmac(aes)");
> +MODULE_ALIAS_CRYPTO("xcbc(aes)");
> +MODULE_ALIAS_CRYPTO("cbcmac(aes)");
> +
> +struct mac_tfm_ctx {
> + u8 key[CRYPTO_AES_CTX_SIZE];
> + u8 __aligned(8) consts[];
> +};
> +
> +struct mac_desc_ctx {
> + u8 dg[AES_BLOCK_SIZE];
> + unsigned int len;
> +};
> +
> +asmlinkage void aesni_enc(const void *ctx, u8 *out, const u8 *in);
> +asmlinkage void aesni_ecb_enc(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len);
> +asmlinkage void aesni_cbc_enc(struct crypto_aes_ctx *ctx, u8 *out,
> + const u8 *in, unsigned int len, u8 *iv);
> +
> +static int cbcmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
> + unsigned int key_len)
> +{
> + struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
> +
> + return aes_expandkey(aes_ctx(&ctx->key), in_key, key_len);
> +}
> +
> +static void cmac_gf128_mul_by_x(be128 *y, const be128 *x)
> +{
> + u64 a = be64_to_cpu(x->a);
> + u64 b = be64_to_cpu(x->b);
> +
> + y->a = cpu_to_be64((a << 1) | (b >> 63));
> + y->b = cpu_to_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0));
> +}
> +
> +static int cmac_setkey(struct crypto_shash *tfm, const u8 *in_key,
> + unsigned int key_len)
> +{
> + struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
> + be128 *consts = (be128 *)ctx->consts;
> + int err;
> +
> + err = cbcmac_setkey(tfm, in_key, key_len);
> + if (err)
> + return err;
> +
> + /* encrypt the zero vector */
> + kernel_fpu_begin();
> + aesni_ecb_enc(aes_ctx(&ctx->key), ctx->consts, (u8[AES_BLOCK_SIZE]){},
> + AES_BLOCK_SIZE);
> + kernel_fpu_end();
> +
> + cmac_gf128_mul_by_x(consts, consts);
> + cmac_gf128_mul_by_x(consts + 1, consts);
> +
> + return 0;
> +}
> +
> +static int xcbc_setkey(struct crypto_shash *tfm, const u8 *in_key,
> + unsigned int key_len)
> +{
> + static u8 const ks[3][AES_BLOCK_SIZE] = {
> + { [0 ... AES_BLOCK_SIZE - 1] = 0x1 },
> + { [0 ... AES_BLOCK_SIZE - 1] = 0x2 },
> + { [0 ... AES_BLOCK_SIZE - 1] = 0x3 },
> + };
> +
> + struct mac_tfm_ctx *ctx = crypto_shash_ctx(tfm);
> + u8 key[AES_BLOCK_SIZE];
> + int err;
> +
> + err = cbcmac_setkey(tfm, in_key, key_len);
> + if (err)
> + return err;
> +
> + kernel_fpu_begin();
> + aesni_ecb_enc(aes_ctx(&ctx->key), key, ks[0], AES_BLOCK_SIZE);
> + aesni_ecb_enc(aes_ctx(&ctx->key), ctx->consts, ks[1], 2 * AES_BLOCK_SIZE);
> + kernel_fpu_end();
> +
> + return cbcmac_setkey(tfm, key, sizeof(key));
> +}
> +
> +static int mac_init(struct shash_desc *desc)
> +{
> + struct mac_desc_ctx *ctx = shash_desc_ctx(desc);
> +
> + memset(ctx->dg, 0, AES_BLOCK_SIZE);
> + ctx->len = 0;
> +
> + return 0;
> +}
> +
> +static void mac_do_update(struct crypto_aes_ctx *ctx, u8 const in[], int blocks,
> + u8 dg[], int enc_before, int enc_after)
> +{
> + if (likely(crypto_simd_usable())) {
> + kernel_fpu_begin();
> + if (enc_before)
> + aesni_enc(ctx, dg, dg);
> +
> + while (blocks--) {
> + if (!blocks && !enc_after) {
> + crypto_xor(dg, in, AES_BLOCK_SIZE);
> + break;
> + }
> + aesni_cbc_enc(ctx, dg, in, AES_BLOCK_SIZE, dg);
> + in += AES_BLOCK_SIZE;
> + }
> + kernel_fpu_end();
> + } else {
> + if (enc_before)
> + aes_encrypt(ctx, dg, dg);
> +
> + while (blocks--) {
> + crypto_xor(dg, in, AES_BLOCK_SIZE);
> + in += AES_BLOCK_SIZE;
> +
> + if (blocks || enc_after)
> + aes_encrypt(ctx, dg, dg);
> + }
> + }
> +}
> +
> +static int mac_update(struct shash_desc *desc, const u8 *p, unsigned int len)
> +{
> + struct mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> + struct mac_desc_ctx *ctx = shash_desc_ctx(desc);
> +
> + while (len > 0) {
> + unsigned int l;
> +
> + if ((ctx->len % AES_BLOCK_SIZE) == 0 &&
> + (ctx->len + len) > AES_BLOCK_SIZE) {
> +
> + int blocks = len / AES_BLOCK_SIZE;
> +
> + len %= AES_BLOCK_SIZE;
> +
> + mac_do_update(aes_ctx(&tctx->key), p, blocks, ctx->dg,
> + (ctx->len != 0), (len != 0));
> +
> + p += blocks * AES_BLOCK_SIZE;
> +
> + if (!len) {
> + ctx->len = AES_BLOCK_SIZE;
> + break;
> + }
> + ctx->len = 0;
> + }
> +
> + l = min(len, AES_BLOCK_SIZE - ctx->len);
> +
> + if (l <= AES_BLOCK_SIZE) {
> + crypto_xor(ctx->dg + ctx->len, p, l);
> + ctx->len += l;
> + len -= l;
> + p += l;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int cbcmac_final(struct shash_desc *desc, u8 *out)
> +{
> + struct mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> + struct mac_desc_ctx *ctx = shash_desc_ctx(desc);
> +
> + mac_do_update(aes_ctx(&tctx->key), NULL, 0, ctx->dg, (ctx->len != 0), 0);
> +
> + memcpy(out, ctx->dg, AES_BLOCK_SIZE);
> +
> + return 0;
> +}
> +
> +static int cmac_final(struct shash_desc *desc, u8 *out)
> +{
> + struct mac_tfm_ctx *tctx = crypto_shash_ctx(desc->tfm);
> + struct mac_desc_ctx *ctx = shash_desc_ctx(desc);
> + u8 *consts = tctx->consts;
> +
> + if (ctx->len != AES_BLOCK_SIZE) {
> + ctx->dg[ctx->len] ^= 0x80;
> + consts += AES_BLOCK_SIZE;
> + }
> +
> + mac_do_update(aes_ctx(&tctx->key), consts, 1, ctx->dg, 0, 1);
> +
> + memcpy(out, ctx->dg, AES_BLOCK_SIZE);
> +
> + return 0;
> +}
> +
> +struct shash_alg aesni_mac_algs[] = { {
> + .base.cra_name = "cmac(aes)",
> + .base.cra_driver_name = "cmac-aes-aesni",
> + .base.cra_priority = 400,
> + .base.cra_blocksize = AES_BLOCK_SIZE,
> + .base.cra_ctxsize = sizeof(struct mac_tfm_ctx) +
> + 2 * AES_BLOCK_SIZE,
> + .base.cra_module = THIS_MODULE,
> +
> + .digestsize = AES_BLOCK_SIZE,
> + .init = mac_init,
> + .update = mac_update,
> + .final = cmac_final,
> + .setkey = cmac_setkey,
> + .descsize = sizeof(struct mac_desc_ctx),
> +}, {
> + .base.cra_name = "xcbc(aes)",
> + .base.cra_driver_name = "xcbc-aes-aesni",
> + .base.cra_priority = 400,
> + .base.cra_blocksize = AES_BLOCK_SIZE,
> + .base.cra_ctxsize = sizeof(struct mac_tfm_ctx) +
> + 2 * AES_BLOCK_SIZE,
> + .base.cra_module = THIS_MODULE,
> +
> + .digestsize = AES_BLOCK_SIZE,
> + .init = mac_init,
> + .update = mac_update,
> + .final = cmac_final,
> + .setkey = xcbc_setkey,
> + .descsize = sizeof(struct mac_desc_ctx),
> +}, {
> + .base.cra_name = "cbcmac(aes)",
> + .base.cra_driver_name = "cbcmac-aes-aesni",
> + .base.cra_priority = 400,
> + .base.cra_blocksize = 1,
> + .base.cra_ctxsize = sizeof(struct mac_tfm_ctx),
> + .base.cra_module = THIS_MODULE,
> +
> + .digestsize = AES_BLOCK_SIZE,
> + .init = mac_init,
> + .update = mac_update,
> + .final = cbcmac_final,
> + .setkey = cbcmac_setkey,
> + .descsize = sizeof(struct mac_desc_ctx),
> +} };
> +
> +const int aesni_num_mac_algs = ARRAY_SIZE(aesni_mac_algs);
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-04 12:56:46

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
>
> Hello,
>
> This helps a bit...now download sw-crypt performance is about 150Mbps,
> but still not as good as with my patch on 5.4 kernel, and fpu is still
> high in perf top:
>
> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
> 6.62% [kernel] [k] kernel_fpu_begin
> 4.14% [kernel] [k] _aesni_enc1
> 2.06% [kernel] [k] __crypto_xor
> 1.95% [kernel] [k] copy_user_generic_string
> 1.93% libjvm.so [.] SpinPause
> 1.01% [kernel] [k] aesni_encrypt
> 0.98% [kernel] [k] crypto_ctr_crypt
> 0.93% [kernel] [k] udp_sendmsg
> 0.78% [kernel] [k] crypto_inc
> 0.74% [kernel] [k] __ip_append_data.isra.53
> 0.65% [kernel] [k] aesni_cbc_enc
> 0.64% [kernel] [k] __dev_queue_xmit
> 0.62% [kernel] [k] ipt_do_table
> 0.62% [kernel] [k] igb_xmit_frame_ring
> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
> 0.57% [kernel] [k] memcpy
> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
> 0.56% [kernel] [k] irq_fpu_usable
> 0.56% [kernel] [k] mac_do_update
>
> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
> then I can help. Possibly hwsim would also be a good test case, but I have not tried
> that.
>

I don't think this is likely to be reproducible on other
micro-architectures, so setting up a test rig is unlikely to help.

I'll send out a v2 which implements a ahash instead of a shash (and
implements some other tweaks) so that kernel_fpu_begin() is only
called twice for each packet on the cbcmac path.

Do you have any numbers for the old kernel without your patch? This
pathological FPU preserve/restore behavior could be caused be the
optimizations, or by other changes that landed in the meantime, so I
would like to know if kernel_fpu_begin() is as prominent in those
traces as well.

2020-08-04 13:02:01

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
> On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
>>
>> Hello,
>>
>> This helps a bit...now download sw-crypt performance is about 150Mbps,
>> but still not as good as with my patch on 5.4 kernel, and fpu is still
>> high in perf top:
>>
>> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
>> 6.62% [kernel] [k] kernel_fpu_begin
>> 4.14% [kernel] [k] _aesni_enc1
>> 2.06% [kernel] [k] __crypto_xor
>> 1.95% [kernel] [k] copy_user_generic_string
>> 1.93% libjvm.so [.] SpinPause
>> 1.01% [kernel] [k] aesni_encrypt
>> 0.98% [kernel] [k] crypto_ctr_crypt
>> 0.93% [kernel] [k] udp_sendmsg
>> 0.78% [kernel] [k] crypto_inc
>> 0.74% [kernel] [k] __ip_append_data.isra.53
>> 0.65% [kernel] [k] aesni_cbc_enc
>> 0.64% [kernel] [k] __dev_queue_xmit
>> 0.62% [kernel] [k] ipt_do_table
>> 0.62% [kernel] [k] igb_xmit_frame_ring
>> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
>> 0.57% [kernel] [k] memcpy
>> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
>> 0.56% [kernel] [k] irq_fpu_usable
>> 0.56% [kernel] [k] mac_do_update
>>
>> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
>> then I can help. Possibly hwsim would also be a good test case, but I have not tried
>> that.
>>
>
> I don't think this is likely to be reproducible on other
> micro-architectures, so setting up a test rig is unlikely to help.
>
> I'll send out a v2 which implements a ahash instead of a shash (and
> implements some other tweaks) so that kernel_fpu_begin() is only
> called twice for each packet on the cbcmac path.
>
> Do you have any numbers for the old kernel without your patch? This
> pathological FPU preserve/restore behavior could be caused be the
> optimizations, or by other changes that landed in the meantime, so I
> would like to know if kernel_fpu_begin() is as prominent in those
> traces as well.
>

This same patch makes i7 mobile processors able to handle 1Gbps+ software
decrypt rates, where without the patch, the rate was badly constrained and CPU
load was much higher, so it is definitely noticeable on other processors too.
The weak processor on the current test rig is convenient because the problem
is so noticeable even at slower wifi speeds.

We can do some tests on 5.4 with our patch reverted.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-04 13:11:36

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On Tue, 4 Aug 2020 at 15:01, Ben Greear <[email protected]> wrote:
>
> On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
> > On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
> >>
> >> Hello,
> >>
> >> This helps a bit...now download sw-crypt performance is about 150Mbps,
> >> but still not as good as with my patch on 5.4 kernel, and fpu is still
> >> high in perf top:
> >>
> >> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
> >> 6.62% [kernel] [k] kernel_fpu_begin
> >> 4.14% [kernel] [k] _aesni_enc1
> >> 2.06% [kernel] [k] __crypto_xor
> >> 1.95% [kernel] [k] copy_user_generic_string
> >> 1.93% libjvm.so [.] SpinPause
> >> 1.01% [kernel] [k] aesni_encrypt
> >> 0.98% [kernel] [k] crypto_ctr_crypt
> >> 0.93% [kernel] [k] udp_sendmsg
> >> 0.78% [kernel] [k] crypto_inc
> >> 0.74% [kernel] [k] __ip_append_data.isra.53
> >> 0.65% [kernel] [k] aesni_cbc_enc
> >> 0.64% [kernel] [k] __dev_queue_xmit
> >> 0.62% [kernel] [k] ipt_do_table
> >> 0.62% [kernel] [k] igb_xmit_frame_ring
> >> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
> >> 0.57% [kernel] [k] memcpy
> >> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
> >> 0.56% [kernel] [k] irq_fpu_usable
> >> 0.56% [kernel] [k] mac_do_update
> >>
> >> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
> >> then I can help. Possibly hwsim would also be a good test case, but I have not tried
> >> that.
> >>
> >
> > I don't think this is likely to be reproducible on other
> > micro-architectures, so setting up a test rig is unlikely to help.
> >
> > I'll send out a v2 which implements a ahash instead of a shash (and
> > implements some other tweaks) so that kernel_fpu_begin() is only
> > called twice for each packet on the cbcmac path.
> >
> > Do you have any numbers for the old kernel without your patch? This
> > pathological FPU preserve/restore behavior could be caused be the
> > optimizations, or by other changes that landed in the meantime, so I
> > would like to know if kernel_fpu_begin() is as prominent in those
> > traces as well.
> >
>
> This same patch makes i7 mobile processors able to handle 1Gbps+ software
> decrypt rates, where without the patch, the rate was badly constrained and CPU
> load was much higher, so it is definitely noticeable on other processors too.

OK

> The weak processor on the current test rig is convenient because the problem
> is so noticeable even at slower wifi speeds.
>
> We can do some tests on 5.4 with our patch reverted.
>

The issue with your CCM patch is that it keeps the FPU enabled for the
entire input, which also means that preemption is disabled, which
makes the -rt people grumpy. (Of course, it also uses APIs that no
longer exists, but that should be easy to fix)

Do you happen to have any ballpark figures for the packet sizes and
the time spent doing encryption?

2020-08-04 13:23:45

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On 8/4/20 6:08 AM, Ard Biesheuvel wrote:
> On Tue, 4 Aug 2020 at 15:01, Ben Greear <[email protected]> wrote:
>>
>> On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
>>> On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> This helps a bit...now download sw-crypt performance is about 150Mbps,
>>>> but still not as good as with my patch on 5.4 kernel, and fpu is still
>>>> high in perf top:
>>>>
>>>> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
>>>> 6.62% [kernel] [k] kernel_fpu_begin
>>>> 4.14% [kernel] [k] _aesni_enc1
>>>> 2.06% [kernel] [k] __crypto_xor
>>>> 1.95% [kernel] [k] copy_user_generic_string
>>>> 1.93% libjvm.so [.] SpinPause
>>>> 1.01% [kernel] [k] aesni_encrypt
>>>> 0.98% [kernel] [k] crypto_ctr_crypt
>>>> 0.93% [kernel] [k] udp_sendmsg
>>>> 0.78% [kernel] [k] crypto_inc
>>>> 0.74% [kernel] [k] __ip_append_data.isra.53
>>>> 0.65% [kernel] [k] aesni_cbc_enc
>>>> 0.64% [kernel] [k] __dev_queue_xmit
>>>> 0.62% [kernel] [k] ipt_do_table
>>>> 0.62% [kernel] [k] igb_xmit_frame_ring
>>>> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
>>>> 0.57% [kernel] [k] memcpy
>>>> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
>>>> 0.56% [kernel] [k] irq_fpu_usable
>>>> 0.56% [kernel] [k] mac_do_update
>>>>
>>>> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
>>>> then I can help. Possibly hwsim would also be a good test case, but I have not tried
>>>> that.
>>>>
>>>
>>> I don't think this is likely to be reproducible on other
>>> micro-architectures, so setting up a test rig is unlikely to help.
>>>
>>> I'll send out a v2 which implements a ahash instead of a shash (and
>>> implements some other tweaks) so that kernel_fpu_begin() is only
>>> called twice for each packet on the cbcmac path.
>>>
>>> Do you have any numbers for the old kernel without your patch? This
>>> pathological FPU preserve/restore behavior could be caused be the
>>> optimizations, or by other changes that landed in the meantime, so I
>>> would like to know if kernel_fpu_begin() is as prominent in those
>>> traces as well.
>>>
>>
>> This same patch makes i7 mobile processors able to handle 1Gbps+ software
>> decrypt rates, where without the patch, the rate was badly constrained and CPU
>> load was much higher, so it is definitely noticeable on other processors too.
>
> OK
>
>> The weak processor on the current test rig is convenient because the problem
>> is so noticeable even at slower wifi speeds.
>>
>> We can do some tests on 5.4 with our patch reverted.
>>
>
> The issue with your CCM patch is that it keeps the FPU enabled for the
> entire input, which also means that preemption is disabled, which
> makes the -rt people grumpy. (Of course, it also uses APIs that no
> longer exists, but that should be easy to fix)

So, if there is no other way to get back the performance, can it be a compile
or runtime option (disabled by default for -RT type folks) to re-enable the feature
that helps our CPU usage?

Or, can you do an add-on patch to enable keeping fpu enabled so that I can test
how that affects our performance?

>
> Do you happen to have any ballpark figures for the packet sizes and
> the time spent doing encryption?

This test was using MTU UDP frames I think, and mostly it is just sending
and receiving frames. perf top output gives you as much detail as I have about
what the kernel is spending time doing.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-04 19:45:56

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On 8/4/20 6:08 AM, Ard Biesheuvel wrote:
> On Tue, 4 Aug 2020 at 15:01, Ben Greear <[email protected]> wrote:
>>
>> On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
>>> On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> This helps a bit...now download sw-crypt performance is about 150Mbps,
>>>> but still not as good as with my patch on 5.4 kernel, and fpu is still
>>>> high in perf top:
>>>>
>>>> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
>>>> 6.62% [kernel] [k] kernel_fpu_begin
>>>> 4.14% [kernel] [k] _aesni_enc1
>>>> 2.06% [kernel] [k] __crypto_xor
>>>> 1.95% [kernel] [k] copy_user_generic_string
>>>> 1.93% libjvm.so [.] SpinPause
>>>> 1.01% [kernel] [k] aesni_encrypt
>>>> 0.98% [kernel] [k] crypto_ctr_crypt
>>>> 0.93% [kernel] [k] udp_sendmsg
>>>> 0.78% [kernel] [k] crypto_inc
>>>> 0.74% [kernel] [k] __ip_append_data.isra.53
>>>> 0.65% [kernel] [k] aesni_cbc_enc
>>>> 0.64% [kernel] [k] __dev_queue_xmit
>>>> 0.62% [kernel] [k] ipt_do_table
>>>> 0.62% [kernel] [k] igb_xmit_frame_ring
>>>> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
>>>> 0.57% [kernel] [k] memcpy
>>>> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
>>>> 0.56% [kernel] [k] irq_fpu_usable
>>>> 0.56% [kernel] [k] mac_do_update
>>>>
>>>> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
>>>> then I can help. Possibly hwsim would also be a good test case, but I have not tried
>>>> that.
>>>>
>>>
>>> I don't think this is likely to be reproducible on other
>>> micro-architectures, so setting up a test rig is unlikely to help.
>>>
>>> I'll send out a v2 which implements a ahash instead of a shash (and
>>> implements some other tweaks) so that kernel_fpu_begin() is only
>>> called twice for each packet on the cbcmac path.
>>>
>>> Do you have any numbers for the old kernel without your patch? This
>>> pathological FPU preserve/restore behavior could be caused be the
>>> optimizations, or by other changes that landed in the meantime, so I
>>> would like to know if kernel_fpu_begin() is as prominent in those
>>> traces as well.
>>>
>>
>> This same patch makes i7 mobile processors able to handle 1Gbps+ software
>> decrypt rates, where without the patch, the rate was badly constrained and CPU
>> load was much higher, so it is definitely noticeable on other processors too.
>
> OK
>
>> The weak processor on the current test rig is convenient because the problem
>> is so noticeable even at slower wifi speeds.
>>
>> We can do some tests on 5.4 with our patch reverted.
>>
>
> The issue with your CCM patch is that it keeps the FPU enabled for the
> entire input, which also means that preemption is disabled, which
> makes the -rt people grumpy. (Of course, it also uses APIs that no
> longer exists, but that should be easy to fix)
>
> Do you happen to have any ballpark figures for the packet sizes and
> the time spent doing encryption?
>

My tester reports this last patch appears to break wpa-2 entirely, so we
cannot test it as is.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-04 20:13:21

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On Tue, 4 Aug 2020 at 21:45, Ben Greear <[email protected]> wrote:
>
> On 8/4/20 6:08 AM, Ard Biesheuvel wrote:
> > On Tue, 4 Aug 2020 at 15:01, Ben Greear <[email protected]> wrote:
> >>
> >> On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
> >>> On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> This helps a bit...now download sw-crypt performance is about 150Mbps,
> >>>> but still not as good as with my patch on 5.4 kernel, and fpu is still
> >>>> high in perf top:
> >>>>
> >>>> 13.89% libc-2.29.so [.] __memset_sse2_unaligned_erms
> >>>> 6.62% [kernel] [k] kernel_fpu_begin
> >>>> 4.14% [kernel] [k] _aesni_enc1
> >>>> 2.06% [kernel] [k] __crypto_xor
> >>>> 1.95% [kernel] [k] copy_user_generic_string
> >>>> 1.93% libjvm.so [.] SpinPause
> >>>> 1.01% [kernel] [k] aesni_encrypt
> >>>> 0.98% [kernel] [k] crypto_ctr_crypt
> >>>> 0.93% [kernel] [k] udp_sendmsg
> >>>> 0.78% [kernel] [k] crypto_inc
> >>>> 0.74% [kernel] [k] __ip_append_data.isra.53
> >>>> 0.65% [kernel] [k] aesni_cbc_enc
> >>>> 0.64% [kernel] [k] __dev_queue_xmit
> >>>> 0.62% [kernel] [k] ipt_do_table
> >>>> 0.62% [kernel] [k] igb_xmit_frame_ring
> >>>> 0.59% [kernel] [k] ip_route_output_key_hash_rcu
> >>>> 0.57% [kernel] [k] memcpy
> >>>> 0.57% libjvm.so [.] InstanceKlass::oop_follow_contents
> >>>> 0.56% [kernel] [k] irq_fpu_usable
> >>>> 0.56% [kernel] [k] mac_do_update
> >>>>
> >>>> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
> >>>> then I can help. Possibly hwsim would also be a good test case, but I have not tried
> >>>> that.
> >>>>
> >>>
> >>> I don't think this is likely to be reproducible on other
> >>> micro-architectures, so setting up a test rig is unlikely to help.
> >>>
> >>> I'll send out a v2 which implements a ahash instead of a shash (and
> >>> implements some other tweaks) so that kernel_fpu_begin() is only
> >>> called twice for each packet on the cbcmac path.
> >>>
> >>> Do you have any numbers for the old kernel without your patch? This
> >>> pathological FPU preserve/restore behavior could be caused be the
> >>> optimizations, or by other changes that landed in the meantime, so I
> >>> would like to know if kernel_fpu_begin() is as prominent in those
> >>> traces as well.
> >>>
> >>
> >> This same patch makes i7 mobile processors able to handle 1Gbps+ software
> >> decrypt rates, where without the patch, the rate was badly constrained and CPU
> >> load was much higher, so it is definitely noticeable on other processors too.
> >
> > OK
> >
> >> The weak processor on the current test rig is convenient because the problem
> >> is so noticeable even at slower wifi speeds.
> >>
> >> We can do some tests on 5.4 with our patch reverted.
> >>
> >
> > The issue with your CCM patch is that it keeps the FPU enabled for the
> > entire input, which also means that preemption is disabled, which
> > makes the -rt people grumpy. (Of course, it also uses APIs that no
> > longer exists, but that should be easy to fix)
> >
> > Do you happen to have any ballpark figures for the packet sizes and
> > the time spent doing encryption?
> >
>
> My tester reports this last patch appears to break wpa-2 entirely, so we
> cannot test it as is.
>

Ah, that is unfortunate. It passed the internal selftests we have in
the kernel, but apparently, the coverage is not 100%.

I will take another look into this next week.

2020-08-18 08:24:47

[permalink] [raw]

Subject: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Sun, Aug 02, 2020 at 12:06:16PM +0300, Ard Biesheuvel wrote:
> Ben reports that CCM using AES-NI instructions performs pathologically
> poorly, which is due to the overhead of preserving/restoring the SIMD
> state, which is repeated after every 16 bytes of input when executing
> the CBCMAC portion of the algorithm.
>
> So let's clone the arm64 implementation of cbcmac(aes), which takes
> care to only preserve/restore the SIMD state after processing the
> whole input. Since cmac(aes) and xcbc(aes) can reuse most of the code,
> let's expose those as well.
>
> Cc: Ben Greear <[email protected]>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/x86/crypto/Makefile | 2 +-
> arch/x86/crypto/aesni-intel.h | 39 +++
> arch/x86/crypto/aesni-intel_glue.c | 42 +---
> arch/x86/crypto/aesni-intel_mac.c | 257 ++++++++++++++++++++
> 4 files changed, 306 insertions(+), 34 deletions(-)

We should just use the accelerated cbc skcipher.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 08:33:25

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Tue, 18 Aug 2020 at 10:24, Herbert Xu <[email protected]> wrote:
>
> On Sun, Aug 02, 2020 at 12:06:16PM +0300, Ard Biesheuvel wrote:
> > Ben reports that CCM using AES-NI instructions performs pathologically
> > poorly, which is due to the overhead of preserving/restoring the SIMD
> > state, which is repeated after every 16 bytes of input when executing
> > the CBCMAC portion of the algorithm.
> >
> > So let's clone the arm64 implementation of cbcmac(aes), which takes
> > care to only preserve/restore the SIMD state after processing the
> > whole input. Since cmac(aes) and xcbc(aes) can reuse most of the code,
> > let's expose those as well.
> >
> > Cc: Ben Greear <[email protected]>
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/x86/crypto/Makefile | 2 +-
> > arch/x86/crypto/aesni-intel.h | 39 +++
> > arch/x86/crypto/aesni-intel_glue.c | 42 +---
> > arch/x86/crypto/aesni-intel_mac.c | 257 ++++++++++++++++++++
> > 4 files changed, 306 insertions(+), 34 deletions(-)
>
> We should just use the accelerated cbc skcipher.
>

What do you mean? You cannot implement cbcmac using a cbc skcipher
unless you provide a scratch buffer of arbitrary size as the
destination, in order to capture the skcipher output IV as the MAC.

2020-08-18 13:52:26

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Tue, Aug 18, 2020 at 10:31:39AM +0200, Ard Biesheuvel wrote:
>
> What do you mean? You cannot implement cbcmac using a cbc skcipher
> unless you provide a scratch buffer of arbitrary size as the
> destination, in order to capture the skcipher output IV as the MAC.

Please have a look at patch 6. The trick is to use an SG list
chained to itself.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 13:56:44

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/18/20 6:51 AM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 10:31:39AM +0200, Ard Biesheuvel wrote:
>>
>> What do you mean? You cannot implement cbcmac using a cbc skcipher
>> unless you provide a scratch buffer of arbitrary size as the
>> destination, in order to capture the skcipher output IV as the MAC.
>
> Please have a look at patch 6. The trick is to use an SG list
> chained to itself.

Herbert, thanks for working on this. If I apply the patches you posted,
that is expected to provide wifi aes decryption speedup similar to what
the original patch I sent does? Or, are additional patches needed?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-18 14:06:03

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Tue, Aug 18, 2020 at 06:56:12AM -0700, Ben Greear wrote:
>
> Herbert, thanks for working on this. If I apply the patches you posted,
> that is expected to provide wifi aes decryption speedup similar to what
> the original patch I sent does? Or, are additional patches needed?

It depends on whether the wifi code uses the async ahash interface,
if not it probably won't make much of an impact at all.

I just checked and if the code you're using is net/mac80211/aes_cmac.c
then it's using shash so it won't really benefit from this. However,
there's no reason why mac80211 can't be converted over to async as
we did for IPsec.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 14:19:12

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/18/20 7:05 AM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 06:56:12AM -0700, Ben Greear wrote:
>>
>> Herbert, thanks for working on this. If I apply the patches you posted,
>> that is expected to provide wifi aes decryption speedup similar to what
>> the original patch I sent does? Or, are additional patches needed?
>
> It depends on whether the wifi code uses the async ahash interface,
> if not it probably won't make much of an impact at all.
>
> I just checked and if the code you're using is net/mac80211/aes_cmac.c
> then it's using shash so it won't really benefit from this. However,
> there's no reason why mac80211 can't be converted over to async as
> we did for IPsec.

I think converting it to async is beyond what I have time to work on
and not sure if it would be accepted upstream anyway.

Is there any easy way to use your work to make shash fast for aesni? I
basically just want it to perform as well as it used to with my patch.

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-18 22:17:15

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>
> Is there any easy way to use your work to make shash fast for aesni? I
> basically just want it to perform as well as it used to with my patch.

Yes. We could add a sync version of aesni that simply falls back
to aes-generic when simd is unavailable.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 22:31:46

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
> >
> > Is there any easy way to use your work to make shash fast for aesni? I
> > basically just want it to perform as well as it used to with my patch.
>
> Yes. We could add a sync version of aesni that simply falls back
> to aes-generic when simd is unavailable.

But I think before anyone attempts this we should explore making
mac80211 async like IPsec. Is there any fundamental reason why
that is not possible? Have the wireless people expressed any
objections to making this async before?

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 22:34:18

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/18/20 3:27 PM, Herbert Xu wrote:
> On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
>> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>>>
>>> Is there any easy way to use your work to make shash fast for aesni? I
>>> basically just want it to perform as well as it used to with my patch.
>>
>> Yes. We could add a sync version of aesni that simply falls back
>> to aes-generic when simd is unavailable.
>
> But I think before anyone attempts this we should explore making
> mac80211 async like IPsec. Is there any fundamental reason why
> that is not possible? Have the wireless people expressed any
> objections to making this async before?

I don't think it has been discussed recently, but mac80211 is already
a complicated beast, so if this added any significant complexity
it might not be worth it.

Truth is though, I know very little about what changes would be
needed to make it do async decrypt, so maybe it would be a simple
matter?

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-18 22:35:54

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
>
> I don't think it has been discussed recently, but mac80211 is already
> a complicated beast, so if this added any significant complexity
> it might not be worth it.

Any bulk data path should be using the async interface, otherwise
performance will seriously suffer should SIMD be unavailable. I
think someone should look at converting wireless to async like IPsec.

> Truth is though, I know very little about what changes would be
> needed to make it do async decrypt, so maybe it would be a simple
> matter?

IPsec was actually quite straightforward.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-18 22:40:14

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/18/20 3:33 PM, Herbert Xu wrote:
> On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
>>
>> I don't think it has been discussed recently, but mac80211 is already
>> a complicated beast, so if this added any significant complexity
>> it might not be worth it.
>
> Any bulk data path should be using the async interface, otherwise
> performance will seriously suffer should SIMD be unavailable. I
> think someone should look at converting wireless to async like IPsec.

Most users in most cases are using hw crypt, so that is likely why
it hasn't gotten a huge amount of effort to optimize the software
crypt path.

If someone wants to give this async api a try for mac80211, I can
test, and I can sponsor the work, but I don't have time to try
to implement it myself.

Thanks,
Ben

>
>> Truth is though, I know very little about what changes would be
>> needed to make it do async decrypt, so maybe it would be a simple
>> matter?
>
> IPsec was actually quite straightforward.
>
> Cheers,
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-20 06:58:56

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Wed, 19 Aug 2020 at 00:39, Ben Greear <[email protected]> wrote:
>
> On 8/18/20 3:33 PM, Herbert Xu wrote:
> > On Tue, Aug 18, 2020 at 03:31:10PM -0700, Ben Greear wrote:
> >>
> >> I don't think it has been discussed recently, but mac80211 is already
> >> a complicated beast, so if this added any significant complexity
> >> it might not be worth it.
> >
> > Any bulk data path should be using the async interface, otherwise
> > performance will seriously suffer should SIMD be unavailable. I
> > think someone should look at converting wireless to async like IPsec.
>
> Most users in most cases are using hw crypt, so that is likely why
> it hasn't gotten a huge amount of effort to optimize the software
> crypt path.
>

As I understand it, it is highly unusual for these code paths to be
exercised in the first place. All mac80211 hardware anyone still cares
about supports CCMP offload, and so only in special cases like Ben's
do we need the software implementation. Also, in Ben's case, it is on
a hot path whereas obsolete hardware that does not implement CCMP
offload does not support anything over 11g speeds to begin with.

Then, there is the additional issue where the FPU preserve/restore
appears to be disproportionately expensive on the actual SoC in
question.

My preferred approach for cbcmac(aes-ni) would be to mirror the arm64
exactly, which means going through the data only a single time, and
interleave the CTR and CBCMAC operations at the AES round level. My
cbcmac ahash approach (v2) is plan B as far as I am concerned, but it
turns out to be flawed and I haven't had time to look into this.

But if we look at the actual issue at hand, we might also look into
amortizing the FPU preserve/restore over multiple invocations of a
cipher. I proposed a patch a while ago that makes cipher an internal
crypto API abstraction, and we could easily add pre/post hooks that
preserve/restore the FPU in this case, in which case we would not need
any changes at higher levels.

> If someone wants to give this async api a try for mac80211, I can
> test, and I can sponsor the work, but I don't have time to try
> to implement it myself.
>
> Thanks,
> Ben
>
> >
> >> Truth is though, I know very little about what changes would be
> >> needed to make it do async decrypt, so maybe it would be a simple
> >> matter?
> >
> > IPsec was actually quite straightforward.
> >
> > Cheers,
> >
>
>
> --
> Ben Greear <[email protected]>
> Candela Technologies Inc http://www.candelatech.com

2020-08-20 07:02:14

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 08:58:15AM +0200, Ard Biesheuvel wrote:
>
> But if we look at the actual issue at hand, we might also look into
> amortizing the FPU preserve/restore over multiple invocations of a
> cipher. I proposed a patch a while ago that makes cipher an internal
> crypto API abstraction, and we could easily add pre/post hooks that
> preserve/restore the FPU in this case, in which case we would not need
> any changes at higher levels.

I think any use of SIMD crypto_cipher on bulk data is just wrong.
Because the performance degradation when SIMD cannot be used is
too great for this to make sense.

So optimising the FPU overhead is attacking the wrong problem.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:05:08

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, 20 Aug 2020 at 09:01, Herbert Xu <[email protected]> wrote:
>
> On Thu, Aug 20, 2020 at 08:58:15AM +0200, Ard Biesheuvel wrote:
> >
> > But if we look at the actual issue at hand, we might also look into
> > amortizing the FPU preserve/restore over multiple invocations of a
> > cipher. I proposed a patch a while ago that makes cipher an internal
> > crypto API abstraction, and we could easily add pre/post hooks that
> > preserve/restore the FPU in this case, in which case we would not need
> > any changes at higher levels.
>
> I think any use of SIMD crypto_cipher on bulk data is just wrong.
> Because the performance degradation when SIMD cannot be used is
> too great for this to make sense.
>
> So optimising the FPU overhead is attacking the wrong problem.
>

I don't disagree with that, especially given all the effort that went
into optimizing FPU preserve/restore on both arm64 and x86. But the
bottom line is that this is what is causing the degradation in Ben's
case, so we cannot disregard it.

2020-08-20 07:07:15

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 09:04:26AM +0200, Ard Biesheuvel wrote:
>
> I don't disagree with that, especially given all the effort that went
> into optimizing FPU preserve/restore on both arm64 and x86. But the
> bottom line is that this is what is causing the degradation in Ben's
> case, so we cannot disregard it.

If he's having problems with the performance when SIMD is in use
due to preserve/restore, I'd hate to see his numbers when SIMD is
not available.

IOW if this really matters to him, then wireless code needs to switch
over to ahash.

Solving half of the problem simply makes no sense.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:20:39

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, 20 Aug 2020 at 09:06, Herbert Xu <[email protected]> wrote:
>
> On Thu, Aug 20, 2020 at 09:04:26AM +0200, Ard Biesheuvel wrote:
> >
> > I don't disagree with that, especially given all the effort that went
> > into optimizing FPU preserve/restore on both arm64 and x86. But the
> > bottom line is that this is what is causing the degradation in Ben's
> > case, so we cannot disregard it.
>
> If he's having problems with the performance when SIMD is in use
> due to preserve/restore, I'd hate to see his numbers when SIMD is
> not available.
>

Actually, I'm not so sure that they will be so much worse. The
expensive FPU preserve/restore occurs for every 16 bytes of data
processed by the AES cipher, which I'd estimate to take ~10 cycles per
byte for an unaccelerated implementation. But table based AES should
be avoided, especially for MAC algorithms where the plaintext may be
known to an attacker who is after the key.

However, the CCMP handling is invoked from softirq context or from
task context, and so SIMD is generally available unless the softirq
happens to be taken over the back of a hardirq that interrupted a task
running in the kernel that was using the SIMD already. IOW, this
happens so rarely in practice that I would not expect it to be
noticeable in the performance stats.

> IOW if this really matters to him, then wireless code needs to switch
> over to ahash.
>
> Solving half of the problem simply makes no sense.
>

My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
one. This means we can amortize the FPU preserve/restore over the
entire scatterlist, instead of relying on the ahash walk to present
the data in virtually mapped chunks.

I'd still like to explore this approach, but I simply haven't had the
spare cycles to spend on this.

2020-08-20 07:30:50

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 09:19:16AM +0200, Ard Biesheuvel wrote:
>
> Actually, I'm not so sure that they will be so much worse. The
> expensive FPU preserve/restore occurs for every 16 bytes of data
> processed by the AES cipher, which I'd estimate to take ~10 cycles per
> byte for an unaccelerated implementation. But table based AES should
> be avoided, especially for MAC algorithms where the plaintext may be
> known to an attacker who is after the key.

On my machine the performance difference on a 1472-byte request
between SIMD and generic is 2161 vs. 7558 (cycles).
>
> However, the CCMP handling is invoked from softirq context or from
> task context, and so SIMD is generally available unless the softirq
> happens to be taken over the back of a hardirq that interrupted a task
> running in the kernel that was using the SIMD already. IOW, this
> happens so rarely in practice that I would not expect it to be
> noticeable in the performance stats.

What if the same machine was doing TLS/IPsec sends at full throttle?
That would be exactly the wrong time to slow down softirqs four-fold,
no?

> My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
> one. This means we can amortize the FPU preserve/restore over the
> entire scatterlist, instead of relying on the ahash walk to present
> the data in virtually mapped chunks.
>
> I'd still like to explore this approach, but I simply haven't had the
> spare cycles to spend on this.

I don't have an issue your patch per se. But please make it so that
it has the async path like everything else. Also wireless uses shash
so it can't use an ahash anyway even if it is sync.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:34:25

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, 20 Aug 2020 at 09:29, Herbert Xu <[email protected]> wrote:
>
> On Thu, Aug 20, 2020 at 09:19:16AM +0200, Ard Biesheuvel wrote:
> >
> > Actually, I'm not so sure that they will be so much worse. The
> > expensive FPU preserve/restore occurs for every 16 bytes of data
> > processed by the AES cipher, which I'd estimate to take ~10 cycles per
> > byte for an unaccelerated implementation. But table based AES should
> > be avoided, especially for MAC algorithms where the plaintext may be
> > known to an attacker who is after the key.
>
> On my machine the performance difference on a 1472-byte request
> between SIMD and generic is 2161 vs. 7558 (cycles).

Sure. But your machine does not have the pathological FPU
preserve/restore performance.

> >
> > However, the CCMP handling is invoked from softirq context or from
> > task context, and so SIMD is generally available unless the softirq
> > happens to be taken over the back of a hardirq that interrupted a task
> > running in the kernel that was using the SIMD already. IOW, this
> > happens so rarely in practice that I would not expect it to be
> > noticeable in the performance stats.
>
> What if the same machine was doing TLS/IPsec sends at full throttle?
> That would be exactly the wrong time to slow down softirqs four-fold,
> no?
>

Fair point.

> > My v2 attempt at cbcmac(aesni) implements an ahash, but a synchronous
> > one. This means we can amortize the FPU preserve/restore over the
> > entire scatterlist, instead of relying on the ahash walk to present
> > the data in virtually mapped chunks.
> >
> > I'd still like to explore this approach, but I simply haven't had the
> > spare cycles to spend on this.
>
> I don't have an issue your patch per se. But please make it so that
> it has the async path like everything else. Also wireless uses shash
> so it can't use an ahash anyway even if it is sync.
>

The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
by a skcipher+ahash combo by the ccm template. So a synchronous ahash
is fine for this particular case.

2020-08-20 07:45:07

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 09:33:21AM +0200, Ard Biesheuvel wrote:
>
> > On my machine the performance difference on a 1472-byte request
> > between SIMD and generic is 2161 vs. 7558 (cycles).
>
> Sure. But your machine does not have the pathological FPU
> preserve/restore performance.

Why does that matter? These are numbers for cbc-aesni which means
just a single preserve/restore for the whole request.

Or are you saying on Ben's machine cbc-aesni would have worse
performance vs. aes-generic?

> The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
> by a skcipher+ahash combo by the ccm template. So a synchronous ahash
> is fine for this particular case.

OK I was just grepping for cmac so didn't see this.

For this case, I think it's even more important that it be converted
over to async because its sending path is also in user context just
like IPsec.

So simply by sending wireless packets you can hog the CPU while
doing SIMD in kernel context which would then kill the receive
path if you're using the generic fallback.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:51:18

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, 20 Aug 2020 at 09:44, Herbert Xu <[email protected]> wrote:
>
> On Thu, Aug 20, 2020 at 09:33:21AM +0200, Ard Biesheuvel wrote:
> >
> > > On my machine the performance difference on a 1472-byte request
> > > between SIMD and generic is 2161 vs. 7558 (cycles).
> >
> > Sure. But your machine does not have the pathological FPU
> > preserve/restore performance.
>
> Why does that matter? These are numbers for cbc-aesni which means
> just a single preserve/restore for the whole request.
>

No, that is the whole problem. The CCM template has a CBCMAC
implementation that wraps the bare cipher, which means it invokes
crypto_cipher_encrypt_one() for each 16 bytes of input, and each of
those calls involves a FPU preserve/restore.

> Or are you saying on Ben's machine cbc-aesni would have worse
> performance vs. aes-generic?
>

Yes, given the pathological overhead of FPU preserve/restore for every
block of 16 bytes processed by the cbcmac wrapper.

> > The mac80211 CCMP code uses a synchronous ccm aead, which gets backed
> > by a skcipher+ahash combo by the ccm template. So a synchronous ahash
> > is fine for this particular case.
>
> OK I was just grepping for cmac so didn't see this.
>
> For this case, I think it's even more important that it be converted
> over to async because its sending path is also in user context just
> like IPsec.
>

Indeed.

cmac() is not really relevant for performance, afaict. Only cbcmac()
is used for bulk data.

> So simply by sending wireless packets you can hog the CPU while
> doing SIMD in kernel context which would then kill the receive
> path if you're using the generic fallback.
>
> Cheers,
> --
> Email: Herbert Xu <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:54:52

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
>
> > Or are you saying on Ben's machine cbc-aesni would have worse
> > performance vs. aes-generic?
> >
>
> Yes, given the pathological overhead of FPU preserve/restore for every
> block of 16 bytes processed by the cbcmac wrapper.

I'm sceptical. Do we have numbers showing this? You can get them
from tcrypt with my patch:

https://patchwork.kernel.org/patch/11701343/

Just do

modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16

> cmac() is not really relevant for performance, afaict. Only cbcmac()
> is used for bulk data.

Sure but it's trivial to extend my cmac patch to support cbcmac.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 07:57:29

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, 20 Aug 2020 at 09:54, Herbert Xu <[email protected]> wrote:
>
> On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
> >
> > > Or are you saying on Ben's machine cbc-aesni would have worse
> > > performance vs. aes-generic?
> > >
> >
> > Yes, given the pathological overhead of FPU preserve/restore for every
> > block of 16 bytes processed by the cbcmac wrapper.
>
> I'm sceptical. Do we have numbers showing this? You can get them
> from tcrypt with my patch:
>
> https://patchwork.kernel.org/patch/11701343/
>
> Just do
>
> modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
> modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16
>
> > cmac() is not really relevant for performance, afaict. Only cbcmac()
> > is used for bulk data.
>
> Sure but it's trivial to extend my cmac patch to support cbcmac.
>

Sure.

Ben, care to have a go at the above on your hardware? It would help us
get to the bottom of this issue.

2020-08-20 14:12:11

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/20/20 12:56 AM, Ard Biesheuvel wrote:
> On Thu, 20 Aug 2020 at 09:54, Herbert Xu <[email protected]> wrote:
>>
>> On Thu, Aug 20, 2020 at 09:48:02AM +0200, Ard Biesheuvel wrote:
>>>
>>>> Or are you saying on Ben's machine cbc-aesni would have worse
>>>> performance vs. aes-generic?
>>>>
>>>
>>> Yes, given the pathological overhead of FPU preserve/restore for every
>>> block of 16 bytes processed by the cbcmac wrapper.
>>
>> I'm sceptical. Do we have numbers showing this? You can get them
>> from tcrypt with my patch:
>>
>> https://patchwork.kernel.org/patch/11701343/
>>
>> Just do
>>
>> modprobe tcrypt mode=400 alg='cmac(aes-aesni)' klen=16
>> modprobe tcrypt mode=400 alg='cmac(aes-generic)' klen=16
>>
>>> cmac() is not really relevant for performance, afaict. Only cbcmac()
>>> is used for bulk data.
>>
>> Sure but it's trivial to extend my cmac patch to support cbcmac.
>>
>
>
> Sure.
>
> Ben, care to have a go at the above on your hardware? It would help us
> get to the bottom of this issue.

Here's a run on an: Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz

testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
[ 259.397756] tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 244 cycles/operation, 15 cycles/byte
[ 259.397759] tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1052 cycles/operation, 16 cycles/byte
[ 259.397765] tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 641 cycles/operation, 10 cycles/byte
[ 259.397768] tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 3909 cycles/operation, 15 cycles/byte
[ 259.397786] tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 2602 cycles/operation, 10 cycles/byte
[ 259.397797] tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2211 cycles/operation, 8 cycles/byte
[ 259.397807] tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 15453 cycles/operation, 15 cycles/byte
[ 259.397872] tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 8863 cycles/operation, 8 cycles/byte
[ 259.397910] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 8442 cycles/operation, 8 cycles/byte
[ 259.397946] tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 43542 cycles/operation, 21 cycles/byte
[ 259.398110] tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 17649 cycles/operation, 8 cycles/byte
[ 259.398184] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 21255 cycles/operation, 10 cycles/byte
[ 259.398267] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 16322 cycles/operation, 7 cycles/byte
[ 259.398335] tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 60301 cycles/operation, 14 cycles/byte
[ 259.398585] tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 34413 cycles/operation, 8 cycles/byte
[ 259.398728] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 32894 cycles/operation, 8 cycles/byte
[ 259.398865] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 32521 cycles/operation, 7 cycles/byte
[ 259.399000] tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 120415 cycles/operation, 14 cycles/byte
[ 259.399550] tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 68635 cycles/operation, 8 cycles/byte
[ 259.399834] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 83770 cycles/operation, 10 cycles/byte
[ 259.400157] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 65075 cycles/operation, 7 cycles/byte
[ 259.400427] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 65085 cycles/operation, 7 cycles/byte
[ 294.171336]
testing speed of async cmac(aes-generic) (cmac(aes-generic))
[ 294.171340] tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 275 cycles/operation, 17 cycles/byte
[ 294.171343] tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 1191 cycles/operation, 18 cycles/byte
[ 294.171350] tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 738 cycles/operation, 11 cycles/byte
[ 294.171354] tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 4386 cycles/operation, 17 cycles/byte
[ 294.171374] tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 2915 cycles/operation, 11 cycles/byte
[ 294.171387] tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 2464 cycles/operation, 9 cycles/byte
[ 294.171398] tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 17558 cycles/operation, 17 cycles/byte
[ 294.171472] tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 14022 cycles/operation, 13 cycles/byte
[ 294.171530] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 9022 cycles/operation, 8 cycles/byte
[ 294.171569] tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 38107 cycles/operation, 18 cycles/byte
[ 294.171722] tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 18083 cycles/operation, 8 cycles/byte
[ 294.171798] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 17260 cycles/operation, 8 cycles/byte
[ 294.171870] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 17415 cycles/operation, 8 cycles/byte
[ 294.171943] tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 66005 cycles/operation, 16 cycles/byte
[ 294.172217] tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 36035 cycles/operation, 8 cycles/byte
[ 294.172366] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 42812 cycles/operation, 10 cycles/byte
[ 294.172533] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 53415 cycles/operation, 13 cycles/byte
[ 294.172745] tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 133326 cycles/operation, 16 cycles/byte
[ 294.173297] tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 90271 cycles/operation, 11 cycles/byte
[ 294.173646] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 68703 cycles/operation, 8 cycles/byte
[ 294.173931] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 67951 cycles/operation, 8 cycles/byte
[ 294.174213] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 68370 cycles/operation, 8 cycles/byte

On my slow apu2 board with processor: AMD GX-412TC SOC

testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
[ 51.750514] tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 600 cycles/operation, 37 cycle
[ 51.750532] tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 2063 cycles/operation, 32 cycle
[ 51.750582] tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 1326 cycles/operation, 20 cycle
[ 51.750619] tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 11190 cycles/operation, 43 cycle
[ 51.750775] tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 4935 cycles/operation, 19 cycle
[ 51.750840] tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 8652 cycles/operation, 33 cycle
[ 51.750948] tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 43430 cycles/operation, 42 cycle
[ 51.751488] tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 23589 cycles/operation, 23 cycle
[ 51.751810] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 18759 cycles/operation, 18 cycle
[ 51.752027] tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 79699 cycles/operation, 38 cycle
[ 51.753035] tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 39900 cycles/operation, 19 cycle
[ 51.753559] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 38390 cycles/operation, 18 cycle
[ 51.754057] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 40888 cycles/operation, 19 cycle
[ 51.754615] tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 143019 cycles/operation, 34 cycle
[ 51.756369] tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 89046 cycles/operation, 21 cycle
[ 51.757527] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 77992 cycles/operation, 19 cycle
[ 51.758526] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 76021 cycles/operation, 18 cycle
[ 51.759442] tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 312260 cycles/operation, 38 cycle
[ 51.763195] tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 176472 cycles/operation, 21 cycle
[ 51.765255] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 169565 cycles/operation, 20 cycle
[ 51.767321] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 164968 cycles/operation, 20 cycle
[ 51.769256] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 165096 cycles/operation, 20 cycle

testing speed of async cmac(aes-generic) (cmac(aes-generic))
[ 97.835925] tcrypt: test 0 ( 16 byte blocks, 16 bytes per update, 1 updates): 665 cycles/operation, 41 cycle
[ 97.835945] tcrypt: test 1 ( 64 byte blocks, 16 bytes per update, 4 updates): 2430 cycles/operation, 37 cycle
[ 97.836016] tcrypt: test 2 ( 64 byte blocks, 64 bytes per update, 1 updates): 1656 cycles/operation, 25 cycle
[ 97.836044] tcrypt: test 3 ( 256 byte blocks, 16 bytes per update, 16 updates): 9014 cycles/operation, 35 cycle
[ 97.836259] tcrypt: test 4 ( 256 byte blocks, 64 bytes per update, 4 updates): 13444 cycles/operation, 52 cycle
[ 97.836399] tcrypt: test 5 ( 256 byte blocks, 256 bytes per update, 1 updates): 8960 cycles/operation, 35 cycle
[ 97.836515] tcrypt: test 6 ( 1024 byte blocks, 16 bytes per update, 64 updates): 51594 cycles/operation, 50 cycle
[ 97.837151] tcrypt: test 7 ( 1024 byte blocks, 256 bytes per update, 4 updates): 28105 cycles/operation, 27 cycle
[ 97.837497] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 31365 cycles/operation, 30 cycle
[ 97.837865] tcrypt: test 9 ( 2048 byte blocks, 16 bytes per update, 128 updates): 86111 cycles/operation, 42 cycle
[ 97.838927] tcrypt: test 10 ( 2048 byte blocks, 256 bytes per update, 8 updates): 60021 cycles/operation, 29 cycle
[ 97.839628] tcrypt: test 11 ( 2048 byte blocks, 1024 bytes per update, 2 updates): 56311 cycles/operation, 27 cycle
[ 97.840308] tcrypt: test 12 ( 2048 byte blocks, 2048 bytes per update, 1 updates): 50877 cycles/operation, 24 cycle
[ 97.840943] tcrypt: test 13 ( 4096 byte blocks, 16 bytes per update, 256 updates): 174028 cycles/operation, 42 cycle
[ 97.843205] tcrypt: test 14 ( 4096 byte blocks, 256 bytes per update, 16 updates): 103243 cycles/operation, 25 cycle
[ 97.844524] tcrypt: test 15 ( 4096 byte blocks, 1024 bytes per update, 4 updates): 99960 cycles/operation, 24 cycle
[ 97.845865] tcrypt: test 16 ( 4096 byte blocks, 4096 bytes per update, 1 updates): 121735 cycles/operation, 29 cycle
[ 97.847355] tcrypt: test 17 ( 8192 byte blocks, 16 bytes per update, 512 updates): 387559 cycles/operation, 47 cycle
[ 97.851930] tcrypt: test 18 ( 8192 byte blocks, 256 bytes per update, 32 updates): 223662 cycles/operation, 27 cycle
[ 97.854617] tcrypt: test 19 ( 8192 byte blocks, 1024 bytes per update, 8 updates): 226131 cycles/operation, 27 cycle
[ 97.857385] tcrypt: test 20 ( 8192 byte blocks, 4096 bytes per update, 2 updates): 203840 cycles/operation, 24 cycle
[ 97.859888] tcrypt: test 21 ( 8192 byte blocks, 8192 bytes per update, 1 updates): 220232 cycles/operation, 26 cycle

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-20 21:51:40

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 06:54:58AM -0700, Ben Greear wrote:
>
> Here's a run on an: Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz
>
> testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>
> [ 259.397910] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 8442 cycles/operation, 8 cycles/byte
>
> testing speed of async cmac(aes-generic) (cmac(aes-generic))
>
> [ 294.171530] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 9022 cycles/operation, 8 cycles/byte
>
> On my slow apu2 board with processor: AMD GX-412TC SOC
>
> testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>
> [ 51.751810] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 18759 cycles/operation, 18 cycle
>
> testing speed of async cmac(aes-generic) (cmac(aes-generic))
>
> [ 97.837497] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 31365 cycles/operation, 30 cycle

So clearly aes-generic is slower than aes-aesni even with saving and
restoring for each block. Therefore improving the performance of
the latter per se does not make sense.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-20 22:20:04

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 8/20/20 1:10 PM, Herbert Xu wrote:
> On Thu, Aug 20, 2020 at 06:54:58AM -0700, Ben Greear wrote:
>>
>> Here's a run on an: Intel(R) Core(TM) i7-7700T CPU @ 2.90GHz
>>
>> testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>>
>> [ 259.397910] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 8442 cycles/operation, 8 cycles/byte
>>
>> testing speed of async cmac(aes-generic) (cmac(aes-generic))
>>
>> [ 294.171530] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 9022 cycles/operation, 8 cycles/byte
>>
>> On my slow apu2 board with processor: AMD GX-412TC SOC
>>
>> testing speed of async cmac(aes-aesni) (cmac(aes-aesni))
>>
>> [ 51.751810] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 18759 cycles/operation, 18 cycle
>>
>> testing speed of async cmac(aes-generic) (cmac(aes-generic))
>>
>> [ 97.837497] tcrypt: test 8 ( 1024 byte blocks, 1024 bytes per update, 1 updates): 31365 cycles/operation, 30 cycle
>
> So clearly aes-generic is slower than aes-aesni even with saving and
> restoring for each block. Therefore improving the performance of
> the latter per se does not make sense.

I have always assumed that I need aesni instructions to have any chance at this performing well,
but there are certainly chips out there that don't have aesni, so possibly it is still worth improving
if it is relatively easy to do so.

I am currently using x86-64 CPUs with aesni, and also some AP platforms running QCA ARM chips.
I am not sure if ARM is using aesni or not...it is certainly not that fast, but maybe for other
reasons.

Thanks,
Ben

>
> Cheers,
>

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2020-08-20 22:21:24

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On Thu, Aug 20, 2020 at 03:09:55PM -0700, Ben Greear wrote:
>
> I have always assumed that I need aesni instructions to have any chance at this performing well,
> but there are certainly chips out there that don't have aesni, so possibly it is still worth improving
> if it is relatively easy to do so.

What we were discussing is the merit of improving aesni only
while still being exposed to aes-generic on the softirq path.

This is clearly not acceptable.

Improving aes-generic obviously would make sense.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2020-08-22 22:48:54

by Christian Lamparter

[permalink] [raw]

Subject: Re: [PATCH 0/5] crypto: Implement cmac based on cbc skcipher

On 2020-08-19 00:27, Herbert Xu wrote:
> On Wed, Aug 19, 2020 at 08:15:50AM +1000, Herbert Xu wrote:
>> On Tue, Aug 18, 2020 at 07:17:35AM -0700, Ben Greear wrote:
>>>
>>> Is there any easy way to use your work to make shash fast for aesni? I
>>> basically just want it to perform as well as it used to with my patch.
>>
>> Yes. We could add a sync version of aesni that simply falls back
>> to aes-generic when simd is unavailable.
>
> But I think before anyone attempts this we should explore making
> mac80211 async like IPsec. Is there any fundamental reason why
> that is not possible? Have the wireless people expressed any
> objections to making this async before?

Ohh, is this still about a heavily-updated and rewritten version
of my old initial patch from 2014 for 3.16-wl?
<https://lore.kernel.org/linux-wireless/1518134.xFh23iA8q1@blech/>

Because back in 2016, I've asked this on linux-wireless:

| It would be a great if mac80211 would do to the encryption and
| decryption asynchronously. As this would work for other ciphers
| and also allows crypto offload to dedicated crypto hardware.

And the answer back then (same as now) was:
<https://lore.kernel.org/linux-wireless/[email protected]/>

>The only problem with that is that we'd essentially need a software
>queue for *all* frames, and release them from there in RX order after
>decrypt. That's surely doable, but so far nobody has thought it
>important enough since mostly HW crypto is used ...

Ben, keep up the good work!

Cheers,
Christian

2020-09-23 11:09:12

[permalink] [raw]

Subject: Re: [PATCH] crypto: x86/aesni - implement accelerated CBCMAC, CMAC and XCBC shashes

On 8/4/20 12:45 PM, Ben Greear wrote:
> On 8/4/20 6:08 AM, Ard Biesheuvel wrote:
>> On Tue, 4 Aug 2020 at 15:01, Ben Greear <[email protected]> wrote:
>>>
>>> On 8/4/20 5:55 AM, Ard Biesheuvel wrote:
>>>> On Mon, 3 Aug 2020 at 21:11, Ben Greear <[email protected]> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> This helps a bit...now download sw-crypt performance is about 150Mbps,
>>>>> but still not as good as with my patch on 5.4 kernel, and fpu is still
>>>>> high in perf top:
>>>>>
>>>>>       13.89% libc-2.29.so   [.] __memset_sse2_unaligned_erms
>>>>>         6.62% [kernel]       [k] kernel_fpu_begin
>>>>>         4.14% [kernel]       [k] _aesni_enc1
>>>>>         2.06% [kernel]       [k] __crypto_xor
>>>>>         1.95% [kernel]       [k] copy_user_generic_string
>>>>>         1.93% libjvm.so      [.] SpinPause
>>>>>         1.01% [kernel]       [k] aesni_encrypt
>>>>>         0.98% [kernel]       [k] crypto_ctr_crypt
>>>>>         0.93% [kernel]       [k] udp_sendmsg
>>>>>         0.78% [kernel]       [k] crypto_inc
>>>>>         0.74% [kernel]       [k] __ip_append_data.isra.53
>>>>>         0.65% [kernel]       [k] aesni_cbc_enc
>>>>>         0.64% [kernel]       [k] __dev_queue_xmit
>>>>>         0.62% [kernel]       [k] ipt_do_table
>>>>>         0.62% [kernel]       [k] igb_xmit_frame_ring
>>>>>         0.59% [kernel]       [k] ip_route_output_key_hash_rcu
>>>>>         0.57% [kernel]       [k] memcpy
>>>>>         0.57% libjvm.so      [.] InstanceKlass::oop_follow_contents
>>>>>         0.56% [kernel]       [k] irq_fpu_usable
>>>>>         0.56% [kernel]       [k] mac_do_update
>>>>>
>>>>> If you'd like help setting up a test rig and have an ath10k pcie NIC or ath9k pcie NIC,
>>>>> then I can help. Possibly hwsim would also be a good test case, but I have not tried
>>>>> that.
>>>>>
>>>>
>>>> I don't think this is likely to be reproducible on other
>>>> micro-architectures, so setting up a test rig is unlikely to help.
>>>>
>>>> I'll send out a v2 which implements a ahash instead of a shash (and
>>>> implements some other tweaks) so that kernel_fpu_begin() is only
>>>> called twice for each packet on the cbcmac path.
>>>>
>>>> Do you have any numbers for the old kernel without your patch? This
>>>> pathological FPU preserve/restore behavior could be caused be the
>>>> optimizations, or by other changes that landed in the meantime, so I
>>>> would like to know if kernel_fpu_begin() is as prominent in those
>>>> traces as well.
>>>>
>>>
>>> This same patch makes i7 mobile processors able to handle 1Gbps+ software
>>> decrypt rates, where without the patch, the rate was badly constrained and CPU
>>> load was much higher, so it is definitely noticeable on other processors too.
>>
>> OK
>>
>>> The weak processor on the current test rig is convenient because the problem
>>> is so noticeable even at slower wifi speeds.
>>>
>>> We can do some tests on 5.4 with our patch reverted.
>>>
>>
>> The issue with your CCM patch is that it keeps the FPU enabled for the
>> entire input, which also means that preemption is disabled, which
>> makes the -rt people grumpy. (Of course, it also uses APIs that no
>> longer exists, but that should be easy to fix)
>>
>> Do you happen to have any ballpark figures for the packet sizes and
>> the time spent doing encryption?
>>
>
> My tester reports this last patch appears to break wpa-2 entirely, so we
> cannot test it as is.

Hello,

I'm still hoping that I can find some one to help enable this feature again
on 5.9-ish kernels.

I don't care about preemption being disabled for long-ish times, that is no worse than what I had
before, so even if an official in-kernel patch is not possible, I can carry an
out-of-tree patch.

Thanks,
Ben

2020-10-29 16:59:17