2020-04-20 07:59:01

by Jason A. Donenfeld

[permalink] [raw]
Subject: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

The initial Zinc patchset, after some mailing list discussion, contained
code to ensure that kernel_fpu_enable would not be kept on for more than
a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
isn't totally scientific, but it's not a bad guess either, and it's
what's used in both the x86 poly1305 and blake2s library code already.
Unfortunately it appears to have been left out of the final patchset
that actually added the glue code. So, this commit adds back the
PAGE_SIZE chunking.

Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
Cc: Eric Biggers <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: [email protected]
Signed-off-by: Jason A. Donenfeld <[email protected]>
---
Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
patches, or if there was actually some later discussion in which we
concluded that the PAGE_SIZE chunking wasn't required, perhaps because
of FPU changes. If that's the case, please do let me know, in which case
I'll submit a _different_ patch that removes the chunking from x86 poly
and blake. I can't find any emails that would indicate that, but I might
be mistaken.

arch/arm/crypto/chacha-glue.c | 16 +++++++++++++---
arch/arm/crypto/poly1305-glue.c | 17 +++++++++++++----
arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
arch/arm64/crypto/poly1305-glue.c | 17 +++++++++++++----
arch/x86/crypto/chacha_glue.c | 16 +++++++++++++---
5 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index 6fdb0ac62b3d..0e29ebac95fd 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
return;
}

- kernel_neon_begin();
- chacha_doneon(state, dst, src, bytes, nrounds);
- kernel_neon_end();
+ for (;;) {
+ unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+ kernel_neon_begin();
+ chacha_doneon(state, dst, src, todo, nrounds);
+ kernel_neon_end();
+
+ bytes -= todo;
+ if (!bytes)
+ break;
+ src += todo;
+ dst += todo;
+ }
}
EXPORT_SYMBOL(chacha_crypt_arch);

diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
index ceec04ec2f40..536a4a943ebe 100644
--- a/arch/arm/crypto/poly1305-glue.c
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -160,13 +160,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);

if (static_branch_likely(&have_neon) && do_neon) {
- kernel_neon_begin();
- poly1305_blocks_neon(&dctx->h, src, len, 1);
- kernel_neon_end();
+ for (;;) {
+ unsigned int todo = min_t(unsigned int, PAGE_SIZE, len);
+
+ kernel_neon_begin();
+ poly1305_blocks_neon(&dctx->h, src, todo, 1);
+ kernel_neon_end();
+
+ len -= todo;
+ if (!len)
+ break;
+ src += todo;
+ }
} else {
poly1305_blocks_arm(&dctx->h, src, len, 1);
+ src += len;
}
- src += len;
nbytes %= POLY1305_BLOCK_SIZE;
}

diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c
index 37ca3e889848..3eff767f4f77 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -87,9 +87,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
!crypto_simd_usable())
return chacha_crypt_generic(state, dst, src, bytes, nrounds);

- kernel_neon_begin();
- chacha_doneon(state, dst, src, bytes, nrounds);
- kernel_neon_end();
+ for (;;) {
+ unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+ kernel_neon_begin();
+ chacha_doneon(state, dst, src, todo, nrounds);
+ kernel_neon_end();
+
+ bytes -= todo;
+ if (!bytes)
+ break;
+ src += todo;
+ dst += todo;
+ }
}
EXPORT_SYMBOL(chacha_crypt_arch);

diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c
index e97b092f56b8..616134bef02c 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -143,13 +143,22 @@ void poly1305_update_arch(struct poly1305_desc_ctx *dctx, const u8 *src,
unsigned int len = round_down(nbytes, POLY1305_BLOCK_SIZE);

if (static_branch_likely(&have_neon) && crypto_simd_usable()) {
- kernel_neon_begin();
- poly1305_blocks_neon(&dctx->h, src, len, 1);
- kernel_neon_end();
+ for (;;) {
+ unsigned int todo = min_t(unsigned int, PAGE_SIZE, len);
+
+ kernel_neon_begin();
+ poly1305_blocks_neon(&dctx->h, src, todo, 1);
+ kernel_neon_end();
+
+ len -= todo;
+ if (!len)
+ break;
+ src += todo;
+ }
} else {
poly1305_blocks(&dctx->h, src, len, 1);
+ src += len;
}
- src += len;
nbytes %= POLY1305_BLOCK_SIZE;
}

diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index b412c21ee06e..10733035b81c 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -153,9 +153,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
bytes <= CHACHA_BLOCK_SIZE)
return chacha_crypt_generic(state, dst, src, bytes, nrounds);

- kernel_fpu_begin();
- chacha_dosimd(state, dst, src, bytes, nrounds);
- kernel_fpu_end();
+ for (;;) {
+ unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
+
+ kernel_fpu_begin();
+ chacha_dosimd(state, dst, src, todo, nrounds);
+ kernel_fpu_end();
+
+ bytes -= todo;
+ if (!bytes)
+ break;
+ src += todo;
+ dst += todo;
+ }
}
EXPORT_SYMBOL(chacha_crypt_arch);

--
2.26.1


2020-04-22 04:06:10

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Mon, Apr 20, 2020 at 01:57:11AM -0600, Jason A. Donenfeld wrote:
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
>
> Fixes: 84e03fa39fbe ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
> Fixes: b3aad5bad26a ("crypto: arm64/chacha - expose arm64 ChaCha routine as library function")
> Fixes: a44a3430d71b ("crypto: arm/chacha - expose ARM ChaCha routine as library function")
> Fixes: f569ca164751 ("crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Fixes: a6b803b3ddc7 ("crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation")
> Cc: Eric Biggers <[email protected]>
> Cc: Ard Biesheuvel <[email protected]>
> Cc: [email protected]
> Signed-off-by: Jason A. Donenfeld <[email protected]>
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.
>
> arch/arm/crypto/chacha-glue.c | 16 +++++++++++++---
> arch/arm/crypto/poly1305-glue.c | 17 +++++++++++++----
> arch/arm64/crypto/chacha-neon-glue.c | 16 +++++++++++++---
> arch/arm64/crypto/poly1305-glue.c | 17 +++++++++++++----
> arch/x86/crypto/chacha_glue.c | 16 +++++++++++++---
> 5 files changed, 65 insertions(+), 17 deletions(-)

I don't think you're missing anything. On x86, kernel_fpu_begin() and
kernel_fpu_end() did get optimized in v5.2. But they still disable preemption,
which is the concern here.

>
> diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
> index 6fdb0ac62b3d..0e29ebac95fd 100644
> --- a/arch/arm/crypto/chacha-glue.c
> +++ b/arch/arm/crypto/chacha-glue.c
> @@ -91,9 +91,19 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
> return;
> }
>
> - kernel_neon_begin();
> - chacha_doneon(state, dst, src, bytes, nrounds);
> - kernel_neon_end();
> + for (;;) {
> + unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> +
> + kernel_neon_begin();
> + chacha_doneon(state, dst, src, todo, nrounds);
> + kernel_neon_end();
> +
> + bytes -= todo;
> + if (!bytes)
> + break;
> + src += todo;
> + dst += todo;
> + }
> }
> EXPORT_SYMBOL(chacha_crypt_arch);

Seems this should just be a 'while' loop?

while (bytes) {
unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);

kernel_neon_begin();
chacha_doneon(state, dst, src, todo, nrounds);
kernel_neon_end();

bytes -= todo;
src += todo;
dst += todo;
}

Likewise elsewhere in this patch. (For Poly1305, len >= POLY1305_BLOCK_SIZE at
the beginning, so that could use a 'do' loop.)

- Eric

2020-04-22 19:37:25

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Wed, Apr 22, 2020 at 5:28 AM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2020-04-22 09:23:34 [+0200], Ard Biesheuvel wrote:
> > My memory is a bit fuzzy here. I remember talking to the linux-rt guys
> > about what delay is actually acceptable, which was a lot higher than I
> > had thought based on their initial reports about scheduling blackouts
> > on arm64 due to preemption remaining disabled for too long. I intended
> > to revisit this with more accurate bounds but then I apparently
> > forgot.
> >
> > So SIMD chacha20 and SIMD poly1305 both run in <5 cycles per bytes,
> > both on x86 and ARM. If we take 20 microseconds as a ballpark upper
> > bound for how long preemption may be disabled, that gives us ~4000
> > bytes of ChaCha20 or Poly1305 on a hypothetical 1 GHz core.
> >
> > So I think 4 KB is indeed a reasonable quantum of work here. Only
> > PAGE_SIZE is not necessarily equal to 4 KB on arm64, so we should use
> > SZ_4K instead.
> >
> > *However*, at the time, the report was triggered by the fact that we
> > were keeping SIMD enabled across calls into the scatterwalk API, which
> > may call kmalloc()/kfree() etc. There is no need for that anymore, now
> > that the FPU begin/end routines all have been optimized to restore the
> > userland SIMD state lazily.
>
> The 20usec sound reasonable. The other concern was memory allocation
> within the preempt-disable section. If this is no longer the case,
> perfect.

Cool, thanks for the confirmation. I'll get a v2 of this patch out the door.

2020-04-22 19:53:45

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <[email protected]> wrote:
>
> On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <[email protected]> wrote:
> >
> > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <[email protected]> wrote:
> > > Seems this should just be a 'while' loop?
> > >
> > > while (bytes) {
> > > unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > >
> > > kernel_neon_begin();
> > > chacha_doneon(state, dst, src, todo, nrounds);
> > > kernel_neon_end();
> > >
> > > bytes -= todo;
> > > src += todo;
> > > dst += todo;
> > > }
> >
> > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > doesn't touch), because then we can break out of the loop before
> > having to increment src and dst unnecessarily. Likely a pointless
> > optimization as probably the compiler can figure out how to avoid
> > that. But maybe it can't. If you have a strong preference, I can
> > reactor everything to use `while (bytes)`, but if you don't care,
> > let's keep this as-is. Opinion?
> >
>
> Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> given that bytes is guaranteed to be non-zero before we enter the
> loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> we can.

Okay, will do-while it up for v2.