LinuxLists.cc - [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

2020-12-18 17:04:29

Subject: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

[ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
SIMD in kernel mode could reduce complexity and improve performance, but we
need to decide whether we can do this, and how much softirq processing
latency we can tolerate. If we can find a satisfactory solution for this,
we might do the same for x86 and 32-bit ARM as well. ]

The crypto API provides two ways to invoke symmetric encryption algorithms:
- synchronously, where the transformation is guaranteed to be done by the
time the function returns;
- asynchronously, where the function may return with a -EINPROGRESS return code,
and a completion will be signalled when the transformation is done.

The latter is mainly intended for h/w accelerators, where the throughput would
be severely limited by the latency otherwise. However, it is also being used
for software algorithms based on SIMD instructions, which cannot be issued from
any context (the rules are not the same on each architecture, but typically,
SIMD can be used in task context, or in softirq context if it was not taken
while the SIMD was already in use in kernel mode).

Many users of the crypto API exist in the kernel today that opt out of this
asynchronous interface (802.11, macsec, kerberos, sw kTLS), or use a library
interface which is fundamentally synchronous (wireguard). This means we end
up using a degraded mode for the contended case (a scalar fallback) as well
as the uncontended case (generic GCM/CCM/CTR chaining mode templates wrapped
around the SIMD cipher as opposed to accelerated implementations of the full
chaining modes in question). Note that scalar AES runs ~20x slower than the
SIMD instruction based version.

So let's address this for arm64, by reorganizing kernel mode SIMD support so
that the SIMD unit can always be assumed to be available. This means we need
to defer softirq processing when grabbing the NEON unit in task context, so
that any use of it in softirq context is guaranteed not to interrupt any code
that was already using the NEON.

This obviously impacts softirq processing latency, which is why the existing
conditional NEON yield support is modified to take pending softirqs into
account.

As an example of how this impacts the code, the existing arm64 GCM driver is
updated to:
- Add yield support - currently, the pending softirq check is performed every
64 bytes of input, which is way too often - one of the desired outcomes of
this RFC is getting a reasonable ballpark for how long we want to run with
softirqs disabled.
- Remove the existing scalar fallbacks, which are no longer needed.

Questions:
- what did I miss or break horribly?
- does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
kthread, so I don't think it cares.
- what would be a reasonable upper bound to keep softirqs disabled? I suppose
100s of cycles or less is overkill, but I'm not sure how to derive a better
answer.
- could we do the same on x86, now that kernel_fpu_begin/end is no longer
expensive?

Cc: Dave Martin <[email protected]>
Cc: Mark Brown <[email protected]>
Cc: Herbert Xu <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Ingo Molnar <[email protected]>

Ard Biesheuvel (5):
crypto: aead - disallow en/decrypt for non-task or non-softirq context
crypto: skcipher - disallow en/decrypt for non-task or non-softirq
context
crypto: arm64/gcm-aes-ce - add NEON yield support
arm64: fpsimd: run kernel mode NEON with softirqs disabled
crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path

arch/arm64/crypto/ghash-ce-core.S | 115 ++++++-----
arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
arch/arm64/include/asm/assembler.h | 19 +-
arch/arm64/kernel/asm-offsets.c | 2 +
arch/arm64/kernel/fpsimd.c | 4 +-
crypto/aead.c | 10 +
crypto/skcipher.c | 10 +
7 files changed, 155 insertions(+), 214 deletions(-)

--
2.17.1

2020-12-18 17:07:24

by Ard Biesheuvel

[permalink] [raw]

Subject: [RFC PATCH 5/5] crypto: arm64/gcm-aes-ce - remove non-SIMD fallback path

Now that kernel mode SIMD is guaranteed to be available when executing
in task or softirq context, we no longer need scalar fallbacks to use
when the NEON is unavailable. So get rid of them.

Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/crypto/ghash-ce-glue.c | 209 +++++---------------
1 file changed, 51 insertions(+), 158 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 720cd3a58da3..15794fe21a0b 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -362,84 +362,36 @@ static int gcm_encrypt(struct aead_request *req)

err = skcipher_walk_aead_encrypt(&walk, req, false);

- if (likely(crypto_simd_usable())) {
- do {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int nbytes = walk.nbytes;
-
- tag = (u8 *)&lengths;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
- src = dst = memcpy(buf + sizeof(buf) - nbytes,
- src, nbytes);
- } else if (nbytes < walk.total) {
- nbytes &= ~(AES_BLOCK_SIZE - 1);
- tag = NULL;
- }
-
- kernel_neon_begin();
- pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
- dg, iv, ctx->aes_key.key_enc, nrounds,
- tag);
- kernel_neon_end();
-
- if (unlikely(!nbytes))
- break;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr,
- buf + sizeof(buf) - nbytes, nbytes);
-
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- } while (walk.nbytes);
- } else {
- while (walk.nbytes >= AES_BLOCK_SIZE) {
- int blocks = walk.nbytes / AES_BLOCK_SIZE;
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int remaining = blocks;
-
- do {
- aes_encrypt(&ctx->aes_key, buf, iv);
- crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
- crypto_inc(iv, AES_BLOCK_SIZE);
-
- dst += AES_BLOCK_SIZE;
- src += AES_BLOCK_SIZE;
- } while (--remaining > 0);
-
- ghash_do_update(blocks, dg, walk.dst.virt.addr,
- &ctx->ghash_key, NULL);
-
- err = skcipher_walk_done(&walk,
- walk.nbytes % AES_BLOCK_SIZE);
- }
-
- /* handle the tail */
- if (walk.nbytes) {
- aes_encrypt(&ctx->aes_key, buf, iv);
+ do {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ int nbytes = walk.nbytes;

- crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
- buf, walk.nbytes);
+ tag = (u8 *)&lengths;

- memcpy(buf, walk.dst.virt.addr, walk.nbytes);
- memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+ src = dst = memcpy(buf + sizeof(buf) - nbytes,
+ src, nbytes);
+ } else if (nbytes < walk.total) {
+ nbytes &= ~(AES_BLOCK_SIZE - 1);
+ tag = NULL;
}

- tag = (u8 *)&lengths;
- ghash_do_update(1, dg, tag, &ctx->ghash_key,
- walk.nbytes ? buf : NULL);
+ kernel_neon_begin();
+ pmull_gcm_encrypt(nbytes, dst, src, ctx->ghash_key.h,
+ dg, iv, ctx->aes_key.key_enc, nrounds,
+ tag);
+ kernel_neon_end();

- if (walk.nbytes)
- err = skcipher_walk_done(&walk, 0);
+ if (unlikely(!nbytes))
+ break;

- put_unaligned_be64(dg[1], tag);
- put_unaligned_be64(dg[0], tag + 8);
- put_unaligned_be32(1, iv + GCM_IV_SIZE);
- aes_encrypt(&ctx->aes_key, iv, iv);
- crypto_xor(tag, iv, AES_BLOCK_SIZE);
- }
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr,
+ buf + sizeof(buf) - nbytes, nbytes);
+
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ } while (walk.nbytes);

if (err)
return err;
@@ -464,6 +416,7 @@ static int gcm_decrypt(struct aead_request *req)
u64 dg[2] = {};
be128 lengths;
u8 *tag;
+ int ret;
int err;

lengths.a = cpu_to_be64(req->assoclen * 8);
@@ -481,101 +434,41 @@ static int gcm_decrypt(struct aead_request *req)

err = skcipher_walk_aead_decrypt(&walk, req, false);

- if (likely(crypto_simd_usable())) {
- int ret;
-
- do {
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
- int nbytes = walk.nbytes;
-
- tag = (u8 *)&lengths;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
- src = dst = memcpy(buf + sizeof(buf) - nbytes,
- src, nbytes);
- } else if (nbytes < walk.total) {
- nbytes &= ~(AES_BLOCK_SIZE - 1);
- tag = NULL;
- }
-
- kernel_neon_begin();
- ret = pmull_gcm_decrypt(nbytes, dst, src,
- ctx->ghash_key.h,
- dg, iv, ctx->aes_key.key_enc,
- nrounds, tag, otag, authsize);
- kernel_neon_end();
-
- if (unlikely(!nbytes))
- break;
-
- if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
- memcpy(walk.dst.virt.addr,
- buf + sizeof(buf) - nbytes, nbytes);
-
- err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
- } while (walk.nbytes);
-
- if (err)
- return err;
- if (ret)
- return -EBADMSG;
- } else {
- while (walk.nbytes >= AES_BLOCK_SIZE) {
- int blocks = walk.nbytes / AES_BLOCK_SIZE;
- const u8 *src = walk.src.virt.addr;
- u8 *dst = walk.dst.virt.addr;
-
- ghash_do_update(blocks, dg, walk.src.virt.addr,
- &ctx->ghash_key, NULL);
-
- do {
- aes_encrypt(&ctx->aes_key, buf, iv);
- crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
- crypto_inc(iv, AES_BLOCK_SIZE);
-
- dst += AES_BLOCK_SIZE;
- src += AES_BLOCK_SIZE;
- } while (--blocks > 0);
+ do {
+ const u8 *src = walk.src.virt.addr;
+ u8 *dst = walk.dst.virt.addr;
+ int nbytes = walk.nbytes;

- err = skcipher_walk_done(&walk,
- walk.nbytes % AES_BLOCK_SIZE);
- }
+ tag = (u8 *)&lengths;

- /* handle the tail */
- if (walk.nbytes) {
- memcpy(buf, walk.src.virt.addr, walk.nbytes);
- memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+ src = dst = memcpy(buf + sizeof(buf) - nbytes,
+ src, nbytes);
+ } else if (nbytes < walk.total) {
+ nbytes &= ~(AES_BLOCK_SIZE - 1);
+ tag = NULL;
}

- tag = (u8 *)&lengths;
- ghash_do_update(1, dg, tag, &ctx->ghash_key,
- walk.nbytes ? buf : NULL);
-
- if (walk.nbytes) {
- aes_encrypt(&ctx->aes_key, buf, iv);
+ kernel_neon_begin();
+ ret = pmull_gcm_decrypt(nbytes, dst, src, ctx->ghash_key.h,
+ dg, iv, ctx->aes_key.key_enc,
+ nrounds, tag, otag, authsize);
+ kernel_neon_end();

- crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
- buf, walk.nbytes);
+ if (unlikely(!nbytes))
+ break;

- err = skcipher_walk_done(&walk, 0);
- }
+ if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+ memcpy(walk.dst.virt.addr,
+ buf + sizeof(buf) - nbytes, nbytes);

- if (err)
- return err;
+ err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+ } while (walk.nbytes);

- put_unaligned_be64(dg[1], tag);
- put_unaligned_be64(dg[0], tag + 8);
- put_unaligned_be32(1, iv + GCM_IV_SIZE);
- aes_encrypt(&ctx->aes_key, iv, iv);
- crypto_xor(tag, iv, AES_BLOCK_SIZE);
+ if (err)
+ return err;

- if (crypto_memneq(tag, otag, authsize)) {
- memzero_explicit(tag, AES_BLOCK_SIZE);
- return -EBADMSG;
- }
- }
- return 0;
+ return ret ? -EBADMSG : 0;
}

static struct aead_alg gcm_aes_alg = {
--
2.17.1

2020-12-19 02:07:07

by Herbert Xu

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
>
> Questions:
> - what did I miss or break horribly?
> - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> kthread, so I don't think it cares.
> - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> 100s of cycles or less is overkill, but I'm not sure how to derive a better
> answer.
> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> expensive?

If this approach works not only would it allow us to support the
synchronous users better, it would also allow us to remove loads
of cruft in the Crypto API that exist solely to support these SIMD
code paths.

So I eagerly await the assessment of the scheduler/RT folks on this
approach.

Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2021-01-14 08:24:05

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

On Sat, 19 Dec 2020 at 03:05, Herbert Xu <[email protected]> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> >
> > Questions:
> > - what did I miss or break horribly?
> > - does any of this matter for RT? AIUI, RT runs softirqs from a dedicated
> > kthread, so I don't think it cares.
> > - what would be a reasonable upper bound to keep softirqs disabled? I suppose
> > 100s of cycles or less is overkill, but I'm not sure how to derive a better
> > answer.
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> > expensive?
>
> If this approach works not only would it allow us to support the
> synchronous users better, it would also allow us to remove loads
> of cruft in the Crypto API that exist solely to support these SIMD
> code paths.
>
> So I eagerly await the assessment of the scheduler/RT folks on this
> approach.
>

Any insights here? Is there a ballpark upper bound for the duration of
a softirq disabled section? Other reasons why dis/enabling softirq
handling is a bad idea?

2021-02-16 10:12:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> SIMD in kernel mode could reduce complexity and improve performance, but we
> need to decide whether we can do this, and how much softirq processing
> latency we can tolerate. If we can find a satisfactory solution for this,
> we might do the same for x86 and 32-bit ARM as well. ]

> - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> expensive?

Can't we simply save/restore the relevant register set?

So something like (note amluto was wanting to add a regset argument):

<task>
kernel_fpu_begin(MMX)
<SIRQ>
kernel_fpu_begin(SSE)
kernel_fpu_end();
</SIRQ>
...
kernel_fpu_end()

Would have to save the MMX regs on first SIRQ invocation of
kernel_fpu_begin(), and then have softirq context termination </SIRQ>
above, restore it.

I mean, we already do much the same for the first kernel_fpu_begin(),
that has to save the user registers, which will be restore when we go
back to userspace.

So why not do exactly the same for softirq context?

2021-02-16 10:36:35

by Ard Biesheuvel

[permalink] [raw]

Subject: Re: [RFC PATCH 0/5] running kernel mode SIMD with softirqs disabled

On Tue, 16 Feb 2021 at 11:10, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Dec 18, 2020 at 06:01:01PM +0100, Ard Biesheuvel wrote:
> > [ TL;DR for the non-ARM folks on CC: disabling softirq processing when using
> > SIMD in kernel mode could reduce complexity and improve performance, but we
> > need to decide whether we can do this, and how much softirq processing
> > latency we can tolerate. If we can find a satisfactory solution for this,
> > we might do the same for x86 and 32-bit ARM as well. ]
>
> > - could we do the same on x86, now that kernel_fpu_begin/end is no longer
> > expensive?
>
> Can't we simply save/restore the relevant register set?
>
> So something like (note amluto was wanting to add a regset argument):
>
> <task>
> kernel_fpu_begin(MMX)
> <SIRQ>
> kernel_fpu_begin(SSE)
> kernel_fpu_end();
> </SIRQ>
> ...
> kernel_fpu_end()
>
> Would have to save the MMX regs on first SIRQ invocation of
> kernel_fpu_begin(), and then have softirq context termination </SIRQ>
> above, restore it.
>
> I mean, we already do much the same for the first kernel_fpu_begin(),
> that has to save the user registers, which will be restore when we go
> back to userspace.
>
> So why not do exactly the same for softirq context?

That is what we originally had on arm64, with per-CPU buffers of the
appropriate size. This became a bit messy when SVE support was added,
because the register file is so large (32 registers of up to 2048 bits
each), and since the kernel does not use SVE itself, we want the inner
per-CPU buffer to only cover 128 bits per register. This means we
cannot allow the <sirq></sirq> region above to interrupt the outer
preserve (which is implemented entirely in software), since resuming
the outer preserve after a sirq would preserve content that was
corrupted by the inner preserve/restore. This could be addressed by
disabling interrupts across the outer preserve, but this caused a
problem somewhere else (Dave might remember the details better than I
do). So we ended up making SIMD in task context mutually exclusive
with SIMD in softirq context, also because that is what 32-bit ARM and
x86 were already doing as well.

But I understand that these concerns may not apply to x86 at all, so
perhaps the answer there is indeed to have a alternate buffer. And
actually, Andy L. suggested the same when I asked him about it on IRC.