2020-04-20 08:32:41

by David Laight

[permalink] [raw]
Subject: RE: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

From: Jason A. Donenfeld
> Sent: 20 April 2020 08:57
>
> The initial Zinc patchset, after some mailing list discussion, contained
> code to ensure that kernel_fpu_enable would not be kept on for more than
> a PAGE_SIZE chunk, since it disables preemption. The choice of PAGE_SIZE
> isn't totally scientific, but it's not a bad guess either, and it's
> what's used in both the x86 poly1305 and blake2s library code already.
> Unfortunately it appears to have been left out of the final patchset
> that actually added the glue code. So, this commit adds back the
> PAGE_SIZE chunking.
>
...
> ---
> Eric, Ard - I'm wondering if this was in fact just an oversight in Ard's
> patches, or if there was actually some later discussion in which we
> concluded that the PAGE_SIZE chunking wasn't required, perhaps because
> of FPU changes. If that's the case, please do let me know, in which case
> I'll submit a _different_ patch that removes the chunking from x86 poly
> and blake. I can't find any emails that would indicate that, but I might
> be mistaken.

Maybe kernel_fp_begin() should be passed the address of somewhere
the address of an fpu save area buffer can be written to.
Then the pre-emption code can allocate the buffer and save the
state into it.

However that doesn't solve the problem for non-preemptive kernels.
The may need a cond_resched() in the loop if it might take 1ms (or so).

kernel_fpu_begin() ought also be passed a parameter saying which
fpu features are required, and return which are allocated.
On x86 this could be used to check for AVX512 (etc) which may be
available in an ISR unless it interrupted inside a kernel_fpu_begin()
section (etc).
It would also allow optimisations if only 1 or 2 fpu registers are
needed (eg for some of the crypto functions) rather than the whole
fpu register set.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2020-04-21 04:03:25

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> Maybe kernel_fp_begin() should be passed the address of somewhere
> the address of an fpu save area buffer can be written to.
> Then the pre-emption code can allocate the buffer and save the
> state into it.
>
> However that doesn't solve the problem for non-preemptive kernels.
> The may need a cond_resched() in the loop if it might take 1ms (or so).
>
> kernel_fpu_begin() ought also be passed a parameter saying which
> fpu features are required, and return which are allocated.
> On x86 this could be used to check for AVX512 (etc) which may be
> available in an ISR unless it interrupted inside a kernel_fpu_begin()
> section (etc).
> It would also allow optimisations if only 1 or 2 fpu registers are
> needed (eg for some of the crypto functions) rather than the whole
> fpu register set.

There might be ways to improve lots of FPU things, indeed. This patch
here is just a patch to Herbert's branch in order to make uniform
usage of our existing solution for this, fixing the existing bug. I
wouldn't mind seeing more involved and better solutions in a patchset
for crypto-next.

Will follow up with your suggestion in a different thread, so as not
to block this one.

2020-04-21 04:15:40

by Jason A. Donenfeld

[permalink] [raw]
Subject: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

Hi David,

On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> Maybe kernel_fp_begin() should be passed the address of somewhere
> the address of an fpu save area buffer can be written to.
> Then the pre-emption code can allocate the buffer and save the
> state into it.

Interesting idea. It looks like `struct xregs_state` is only 576
bytes. That's not exactly small, but it's not insanely huge either,
and maybe we could justifiably stick that on the stack, or even
reserve part of the stack allocation for that that the function would
know about, without needing to specify any address.

> kernel_fpu_begin() ought also be passed a parameter saying which
> fpu features are required, and return which are allocated.
> On x86 this could be used to check for AVX512 (etc) which may be
> available in an ISR unless it interrupted inside a kernel_fpu_begin()
> section (etc).
> It would also allow optimisations if only 1 or 2 fpu registers are
> needed (eg for some of the crypto functions) rather than the whole
> fpu register set.

For AVX512 this probably makes sense, I suppose. But I'm not sure if
there are too many bits of crypto code that only use a few registers.
There are those accelerated memcpy routines in i915 though -- ever see
drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
this way, I wonder if it'd make sense to totally overengineer it and
write a gcc/as plugin to create the register mask for us. Or, maybe
some checker inside of objtool could help here.

Actually, though, the thing I've been wondering about is actually
moving in the complete opposite direction: is there some
efficient-enough way that we could allow FPU registers in all contexts
always, without the need for kernel_fpu_begin/end? I was reversing
ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
their judicious use of vectorisation everywhere. I assume a lot of
that is being generated by their compiler, which of course gcc could
do for us if we let it. Is that an interesting avenue to consider? Or
are you pretty certain that it'd be a huge mistake, with an
irreversible speed hit?

Jason

2020-04-21 04:27:03

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

On Mon, Apr 20, 2020 at 10:14 PM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.

Hah-hah, nevermind here. extended_state_area is of course huge,
bringing the whole structure to a whopping 3k with avx512. :)

2020-04-21 07:03:12

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <[email protected]> wrote:
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.
>
> > kernel_fpu_begin() ought also be passed a parameter saying which
> > fpu features are required, and return which are allocated.
> > On x86 this could be used to check for AVX512 (etc) which may be
> > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > section (etc).
> > It would also allow optimisations if only 1 or 2 fpu registers are
> > needed (eg for some of the crypto functions) rather than the whole
> > fpu register set.
>
> For AVX512 this probably makes sense, I suppose. But I'm not sure if
> there are too many bits of crypto code that only use a few registers.
> There are those accelerated memcpy routines in i915 though -- ever see
> drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> this way, I wonder if it'd make sense to totally overengineer it and
> write a gcc/as plugin to create the register mask for us. Or, maybe
> some checker inside of objtool could help here.
>
> Actually, though, the thing I've been wondering about is actually
> moving in the complete opposite direction: is there some
> efficient-enough way that we could allow FPU registers in all contexts
> always, without the need for kernel_fpu_begin/end? I was reversing
> ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> their judicious use of vectorisation everywhere. I assume a lot of
> that is being generated by their compiler, which of course gcc could
> do for us if we let it. Is that an interesting avenue to consider? Or
> are you pretty certain that it'd be a huge mistake, with an
> irreversible speed hit?
>

When I added support for kernel mode SIMD to arm64 originally, there
was a kernel_neon_begin_partial() that took an int to specify how many
registers were being used, the reason being that NEON preserve/store
was fully eager at this point, and arm64 has 32 SIMD registers, many
of which weren't really used, e.g., in the basic implementation of AES
based on special instructions.

With the introduction of lazy restore, and SVE handling for userspace,
we decided to remove this because it didn't really help anymore, and
made the code more difficult to manage.

However, I think it would make sense to have something like this in
the general case. I.e., NEON registers 0-3 are always preserved when
an exception or interrupt (or syscall) is taken, and so they can be
used anywhere in the kernel. If you want the whole set, you will have
to use begin/end as before. This would already unlock a few
interesting case, like memcpy, xor, and sequences that can easily be
implemented with only a few registers such as instructio based AES.

Unfortunately, the compiler needs to be taught about this to be
completely useful, which means lots of prototyping and benchmarking
upfront, as the contract will be set in stone once the compilers get
on board.

2020-04-21 08:06:29

by David Laight

[permalink] [raw]
Subject: RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

From: Ard Biesheuvel
> Sent: 21 April 2020 08:02
> On Tue, 21 Apr 2020 at 06:15, Jason A. Donenfeld <[email protected]> wrote:
> >
> > Hi David,
> >
> > On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> > > Maybe kernel_fp_begin() should be passed the address of somewhere
> > > the address of an fpu save area buffer can be written to.
> > > Then the pre-emption code can allocate the buffer and save the
> > > state into it.
> >
> > Interesting idea. It looks like `struct xregs_state` is only 576
> > bytes. That's not exactly small, but it's not insanely huge either,
> > and maybe we could justifiably stick that on the stack, or even
> > reserve part of the stack allocation for that that the function would
> > know about, without needing to specify any address.
> >
> > > kernel_fpu_begin() ought also be passed a parameter saying which
> > > fpu features are required, and return which are allocated.
> > > On x86 this could be used to check for AVX512 (etc) which may be
> > > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > > section (etc).
> > > It would also allow optimisations if only 1 or 2 fpu registers are
> > > needed (eg for some of the crypto functions) rather than the whole
> > > fpu register set.
> >
> > For AVX512 this probably makes sense, I suppose. But I'm not sure if
> > there are too many bits of crypto code that only use a few registers.
> > There are those accelerated memcpy routines in i915 though -- ever see
> > drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> > this way, I wonder if it'd make sense to totally overengineer it and
> > write a gcc/as plugin to create the register mask for us. Or, maybe
> > some checker inside of objtool could help here.
> >
> > Actually, though, the thing I've been wondering about is actually
> > moving in the complete opposite direction: is there some
> > efficient-enough way that we could allow FPU registers in all contexts
> > always, without the need for kernel_fpu_begin/end? I was reversing
> > ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> > their judicious use of vectorisation everywhere. I assume a lot of
> > that is being generated by their compiler, which of course gcc could
> > do for us if we let it. Is that an interesting avenue to consider? Or
> > are you pretty certain that it'd be a huge mistake, with an
> > irreversible speed hit?
> >
>
> When I added support for kernel mode SIMD to arm64 originally, there
> was a kernel_neon_begin_partial() that took an int to specify how many
> registers were being used, the reason being that NEON preserve/store
> was fully eager at this point, and arm64 has 32 SIMD registers, many
> of which weren't really used, e.g., in the basic implementation of AES
> based on special instructions.
>
> With the introduction of lazy restore, and SVE handling for userspace,
> we decided to remove this because it didn't really help anymore, and
> made the code more difficult to manage.
>
> However, I think it would make sense to have something like this in
> the general case. I.e., NEON registers 0-3 are always preserved when
> an exception or interrupt (or syscall) is taken, and so they can be
> used anywhere in the kernel. If you want the whole set, you will have
> to use begin/end as before. This would already unlock a few
> interesting case, like memcpy, xor, and sequences that can easily be
> implemented with only a few registers such as instructio based AES.
>
> Unfortunately, the compiler needs to be taught about this to be
> completely useful, which means lots of prototyping and benchmarking
> upfront, as the contract will be set in stone once the compilers get
> on board.

You can always just use asm with explicit registers.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-04-21 08:20:16

by David Laight

[permalink] [raw]
Subject: RE: FPU register granularity [Was: Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks]

From: Jason A. Donenfeld
> Sent: 21 April 2020 05:15
>
> Hi David,
>
> On Mon, Apr 20, 2020 at 2:32 AM David Laight <[email protected]> wrote:
> > Maybe kernel_fp_begin() should be passed the address of somewhere
> > the address of an fpu save area buffer can be written to.
> > Then the pre-emption code can allocate the buffer and save the
> > state into it.
>
> Interesting idea. It looks like `struct xregs_state` is only 576
> bytes. That's not exactly small, but it's not insanely huge either,
> and maybe we could justifiably stick that on the stack, or even
> reserve part of the stack allocation for that that the function would
> know about, without needing to specify any address.

As you said yourself, with AVX512 it is much larger.
Which is why I suggested the save code could allocate the area.
Note that this would only be needed for nested use (for a full save).

> > kernel_fpu_begin() ought also be passed a parameter saying which
> > fpu features are required, and return which are allocated.
> > On x86 this could be used to check for AVX512 (etc) which may be
> > available in an ISR unless it interrupted inside a kernel_fpu_begin()
> > section (etc).
> > It would also allow optimisations if only 1 or 2 fpu registers are
> > needed (eg for some of the crypto functions) rather than the whole
> > fpu register set.
>
> For AVX512 this probably makes sense, I suppose. But I'm not sure if
> there are too many bits of crypto code that only use a few registers.
> There are those accelerated memcpy routines in i915 though -- ever see
> drivers/gpu/drm/i915/i915_memcpy.c? sort of wild. But if we did go
> this way, I wonder if it'd make sense to totally overengineer it and
> write a gcc/as plugin to create the register mask for us. Or, maybe
> some checker inside of objtool could help here.

I suspect some of that code is overly unrolled.

> Actually, though, the thing I've been wondering about is actually
> moving in the complete opposite direction: is there some
> efficient-enough way that we could allow FPU registers in all contexts
> always, without the need for kernel_fpu_begin/end? I was reversing
> ntoskrnl.exe and was kind of impressed (maybe not the right word?) by
> their judicious use of vectorisation everywhere. I assume a lot of
> that is being generated by their compiler, which of course gcc could
> do for us if we let it. Is that an interesting avenue to consider? Or
> are you pretty certain that it'd be a huge mistake, with an
> irreversible speed hit?


I think windows takes the 'hit' of saving the entire fpu state on
every kernel entry.
Note that for system calls this is actually minimal.
All the 'callee saved' registers (most of the fpu ones) can be
trashed - ie reloaded with zeros.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)