Hi Andy & folks,
Lots of crypto routines look like this:
kernel_fpu_begin();
encrypt();
kernel_fpu_end();
If you call such a routine twice, you get:
kernel_fpu_begin();
encrypt();
kernel_fpu_end();
kernel_fpu_begin();
encrypt();
kernel_fpu_end();
In a loop this looks like:
for (thing) {
kernel_fpu_begin();
encrypt(thing);
kernel_fpu_end();
}
This is obviously very bad, because begin() and end() are slow, so
WireGuard does the obvious:
kernel_fpu_begin();
for (thing)
encrypt(thing);
kernel_fpu_end();
This is fine and well, and the crypto API I'm working on will enable
this to be done in a clear way, but I do wonder if maybe this is not
something that should be happening at the level of the caller, but
rather in the fpu functions themselves. Namely, what are your thoughts
on modifying kernel_fpu_end() so that it doesn't actually restore the
state, but just reenables preemption and marks that on the next
context switch, the state should be restored? Then, essentially,
kernel_fpu_begin() and end() become free after the first usage of
kernel_fpu_begin().
Is this something feasible? I know that performance-wise, I'm really
gaining a lot from hoisting those functions out of the loops, and API
wise, it'd be slightly simpler to implement if I didn't have to all
for that hoisting.
Regards,
Jason
On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
> In a loop this looks like:
>
> for (thing) {
> kernel_fpu_begin();
> encrypt(thing);
> kernel_fpu_end();
> }
>
> This is obviously very bad, because begin() and end() are slow, so
> WireGuard does the obvious:
>
> kernel_fpu_begin();
> for (thing)
> encrypt(thing);
> kernel_fpu_end();
>
> This is fine and well, and the crypto API I'm working on will enable
It might be fine crypto performance wise, but it's a total nightmare
latency wise because kernel_fpu_begin() disables preemption. We've seen
latencies in the larger millisecond range due to processing large data sets
with kernel FPU.
If you want to go there then we really need a better approach which allows
kernel FPU usage in preemptible context and in case of preemption a way to
stash the preempted FPU context and restore it when the task gets scheduled
in again. Just using the existing FPU stuff and moving the loops inside the
begin/end section and keeping preemption disabled for arbitrary time spans
is not going to fly.
Thanks,
tglx
On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <[email protected]> wrote:
>
> Hi Andy & folks,
>
> Lots of crypto routines look like this:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> If you call such a routine twice, you get:
>
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
> kernel_fpu_begin();
> encrypt();
> kernel_fpu_end();
>
> In a loop this looks like:
>
> for (thing) {
> kernel_fpu_begin();
> encrypt(thing);
> kernel_fpu_end();
> }
>
> This is obviously very bad, because begin() and end() are slow, so
> WireGuard does the obvious:
>
> kernel_fpu_begin();
> for (thing)
> encrypt(thing);
> kernel_fpu_end();
>
> This is fine and well, and the crypto API I'm working on will enable
> this to be done in a clear way, but I do wonder if maybe this is not
> something that should be happening at the level of the caller, but
> rather in the fpu functions themselves. Namely, what are your thoughts
> on modifying kernel_fpu_end() so that it doesn't actually restore the
> state, but just reenables preemption and marks that on the next
> context switch, the state should be restored? Then, essentially,
> kernel_fpu_begin() and end() become free after the first usage of
> kernel_fpu_begin().
>
> Is this something feasible? I know that performance-wise, I'm really
> gaining a lot from hoisting those functions out of the loops, and API
> wise, it'd be slightly simpler to implement if I didn't have to all
> for that hoisting.
Hi Jason-
Funny you asked. This has been discussed a few times, although not
quite in the form you imagined. The idea that we've tossed around is
to restore FPU state on return to user mode. Roughly, we'd introduce
a new thread flag TIF_FPU_UNLOADED (name TBD).
prepare_exit_to_usermode() would notice this flag, copy the fpstate to
fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
one has quite thought it through, but I think it should be outside the
loop.) We'd update all the FPU accessors to understand the flag.
We'd probably have a helper like:
void unload_user_fpstate(void)
that would do nothing if TIF_FPU_UNLOADED is set. If TIF_FPU_UNLOADED
is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
We'd call unload_user_fpstate() during context switch in the prev task
context and in kernel_fpu_begin().
The one major complication I know of is that PKRU is different from
all other FPU state in that it needs to be loaded when *kernel* code
does any user memory access. So we'd need to make sure that context
switches load the new task's PKRU eagerly. Using WRPKRU is easy, but,
unless we do something very clever, actually finding PKRU in the
in-memory fpstate image may be slightly nontrivial.
I suppose we could eagerly load the new FPU state on context switches,
but I'm not sure I see any point. We'd still have to load it when we
switch back to user mode, so it would be slower but not necessarily
any simpler.
I'd love to review patches to make this change, but I don't have the
bandwidth to write them right now.
On Fri, Jun 15, 2018 at 12:25 PM, Thomas Gleixner <[email protected]> wrote:
> On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
>> In a loop this looks like:
>>
>> for (thing) {
>> kernel_fpu_begin();
>> encrypt(thing);
>> kernel_fpu_end();
>> }
>>
>> This is obviously very bad, because begin() and end() are slow, so
>> WireGuard does the obvious:
>>
>> kernel_fpu_begin();
>> for (thing)
>> encrypt(thing);
>> kernel_fpu_end();
>>
>> This is fine and well, and the crypto API I'm working on will enable
>
> It might be fine crypto performance wise, but it's a total nightmare
> latency wise because kernel_fpu_begin() disables preemption. We've seen
> latencies in the larger millisecond range due to processing large data sets
> with kernel FPU.
>
> If you want to go there then we really need a better approach which allows
> kernel FPU usage in preemptible context and in case of preemption a way to
> stash the preempted FPU context and restore it when the task gets scheduled
> in again. Just using the existing FPU stuff and moving the loops inside the
> begin/end section and keeping preemption disabled for arbitrary time spans
> is not going to fly.
One optimization that can be done is to delay restoring the user FPU
state until we exit to userspace. That way the FPU is saved and
restored only once no matter how many times
kernel_fpu_begin()/kernel_fpu_end() are called.
--
Brian Gerst
On Fri, 2018-06-15 at 11:31 -0700, Andy Lutomirski wrote:
> On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <[email protected]>
> wrote:
> >
> > Hi Andy & folks,
> >
> > Lots of crypto routines look like this:
> >
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> >
> > If you call such a routine twice, you get:
> >
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> >
> > In a loop this looks like:
> >
> > for (thing) {
> > kernel_fpu_begin();
> > encrypt(thing);
> > kernel_fpu_end();
> > }
> >
> > This is obviously very bad, because begin() and end() are slow, so
> > WireGuard does the obvious:
> >
> > kernel_fpu_begin();
> > for (thing)
> > encrypt(thing);
> > kernel_fpu_end();
> >
> > This is fine and well, and the crypto API I'm working on will
> > enable
> > this to be done in a clear way, but I do wonder if maybe this is
> > not
> > something that should be happening at the level of the caller, but
> > rather in the fpu functions themselves. Namely, what are your
> > thoughts
> > on modifying kernel_fpu_end() so that it doesn't actually restore
> > the
> > state, but just reenables preemption and marks that on the next
> > context switch, the state should be restored? Then, essentially,
> > kernel_fpu_begin() and end() become free after the first usage of
> > kernel_fpu_begin().
> >
> > Is this something feasible? I know that performance-wise, I'm
> > really
> > gaining a lot from hoisting those functions out of the loops, and
> > API
> > wise, it'd be slightly simpler to implement if I didn't have to all
> > for that hoisting.
>
> Hi Jason-
>
> Funny you asked. This has been discussed a few times, although not
> quite in the form you imagined. The idea that we've tossed around is
> to restore FPU state on return to user mode. Roughly, we'd introduce
> a new thread flag TIF_FPU_UNLOADED (name TBD).
> prepare_exit_to_usermode() would notice this flag, copy the fpstate
> to
> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
> one has quite thought it through, but I think it should be outside
> the
> loop.) We'd update all the FPU accessors to understand the flag.
> We'd probably have a helper like:
>
> void unload_user_fpstate(void)
>
> that would do nothing if TIF_FPU_UNLOADED is set. If
> TIF_FPU_UNLOADED
> is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
> We'd call unload_user_fpstate() during context switch in the prev
> task
> context and in kernel_fpu_begin().
>
> The one major complication I know of is that PKRU is different from
> all other FPU state in that it needs to be loaded when *kernel* code
> does any user memory access. So we'd need to make sure that context
> switches load the new task's PKRU eagerly. Using WRPKRU is easy,
> but,
> unless we do something very clever, actually finding PKRU in the
> in-memory fpstate image may be slightly nontrivial.
IIRC the in-kernel FPU state always has every FPU field
at a constant location.
> I suppose we could eagerly load the new FPU state on context
> switches,
> but I'm not sure I see any point. We'd still have to load it when we
> switch back to user mode, so it would be slower but not necessarily
> any simpler.
>
> I'd love to review patches to make this change, but I don't have the
> bandwidth to write them right now.
I can take a look at this, I am already looking at
some context switch code right now, anyway.
I also should still have the TIF_FPU_UNLOADED patches
(or whatever the flag was called) code around somewhere.
--
All Rights Reversed.
On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> Using WRPKRU is easy, but,
> unless we do something very clever, actually finding PKRU in the
> in-memory fpstate image may be slightly nontrivial.
Why?
It's at a constant offset during any one boot, for sure. XSAVEC/XSAVES
can move it around, but only if you change the Requested Feature BitMap
(RFBM) that we pass to XSAVES or change the control register that
enables XSAVE states (XCR0).
We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as
I remember.
On Fri, Jun 15, 2018 at 11:43 AM Dave Hansen
<[email protected]> wrote:
>
> On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > Using WRPKRU is easy, but,
> > unless we do something very clever, actually finding PKRU in the
> > in-memory fpstate image may be slightly nontrivial.
>
> Why?
>
> It's at a constant offset during any one boot, for sure. XSAVEC/XSAVES
> can move it around, but only if you change the Requested Feature BitMap
> (RFBM) that we pass to XSAVES or change the control register that
> enables XSAVE states (XCR0).
>
> We don't change XCR0 after boot, and RFBM is hard-coded to -1 as far as
> I remember.
I thought that XSAVES didn't allocate space for parts of the state
that are in the init state. I guess I was wrong :)
So never mind, context switch should just need to WRPKRU the field at
the predetermined offset in fpstate for the new task. And, if some
appropriate debug option is set, it should warn if TIF_FPU_UNLOADED is
clear for the new task.
I suspect that the whole patch should only be a couple hundred lines
of code. And I think there are VM workloads where it would be a
*huge* win.
--Andy
On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> for (thing) {
> kernel_fpu_begin();
> encrypt(thing);
> kernel_fpu_end();
> }
Don't forget that the processor has optimizations for this, too. The
"modified optimization" will notice that between:
kernel_fpu_end(); -> XRSTOR
and
kernel_fpu_start(); -> XSAVE(S|OPT)
the processor has not modified the states. It'll skip doing any writes
of the state. Doing what Andy is describing is still way better than
letting the processor do it, but you should just know up front that this
may not be as much of a win as you would expect.
On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
<[email protected]> wrote:
>
> On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > for (thing) {
> > kernel_fpu_begin();
> > encrypt(thing);
> > kernel_fpu_end();
> > }
>
> Don't forget that the processor has optimizations for this, too. The
> "modified optimization" will notice that between:
>
> kernel_fpu_end(); -> XRSTOR
> and
> kernel_fpu_start(); -> XSAVE(S|OPT)
>
> the processor has not modified the states. It'll skip doing any writes
> of the state. Doing what Andy is describing is still way better than
> letting the processor do it, but you should just know up front that this
> may not be as much of a win as you would expect.
Even with the modified optimization, kernel_fpu_end() still needs to
reload the state that was trashed by the kernel FPU use. If the
kernel is using something like AVX512 state, then kernel_fpu_end()
will transfer an enormous amount of data no matter how clever the CPU
is. And I think I once measured XSAVEOPT taking a hundred cycles or
so even when RFBM==0, so it's not exactly super fast.
On Fri, Jun 15, 2018 at 06:25:39PM +0200, Thomas Gleixner wrote:
> On Fri, 15 Jun 2018, Jason A. Donenfeld wrote:
> > In a loop this looks like:
> >
> > for (thing) {
> > kernel_fpu_begin();
> > encrypt(thing);
> > kernel_fpu_end();
> > }
> >
> > This is obviously very bad, because begin() and end() are slow, so
> > WireGuard does the obvious:
> >
> > kernel_fpu_begin();
> > for (thing)
> > encrypt(thing);
> > kernel_fpu_end();
> >
> > This is fine and well, and the crypto API I'm working on will enable
>
> It might be fine crypto performance wise, but it's a total nightmare
> latency wise because kernel_fpu_begin() disables preemption. We've seen
> latencies in the larger millisecond range due to processing large data sets
> with kernel FPU.
>
> If you want to go there then we really need a better approach which allows
> kernel FPU usage in preemptible context and in case of preemption a way to
> stash the preempted FPU context and restore it when the task gets scheduled
> in again. Just using the existing FPU stuff and moving the loops inside the
> begin/end section and keeping preemption disabled for arbitrary time spans
> is not going to fly.
Didn't we recently do a bunch of crypto patches to help with this?
I think they had the pattern:
kernel_fpu_begin();
for (units-of-work) {
do_unit_of_work();
if (need_resched()) {
kernel_fpu_end();
cond_resched();
kernel_fpu_begin();
}
}
kernel_fpu_end();
I'd have to go dig out the actual series, but I think they had a bunch
of helpers to deal with that.
On Fri, Jun 15, 2018 at 8:53 PM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> <[email protected]> wrote:
> Even with the modified optimization, kernel_fpu_end() still needs to
> reload the state that was trashed by the kernel FPU use. If the
> kernel is using something like AVX512 state, then kernel_fpu_end()
> will transfer an enormous amount of data no matter how clever the CPU
> is. And I think I once measured XSAVEOPT taking a hundred cycles or
> so even when RFBM==0, so it's not exactly super fast.
Indeed the speed up is really significant, especially for the AVX512
case. Here are some numbers from my laptop and a server taken a few
seconds ago:
AVX2 - Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Inside: 684617437
Outside: 547710093
Percent speedup: 24
AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
Inside: 634415672
Outside: 286698960
Percent speedup: 121
This is from this test -- https://xn--4db.cc/F7RF2fhv/c . There are
probably various issues with that test case, and it's possible there
are other effects going on (the avx512 case looks particularly insane)
to make the difference _that_ drastic, but I think there's no doubt
that the optimization here is a meaningful one.
On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <[email protected]> wrote:
> Didn't we recently do a bunch of crypto patches to help with this?
>
> I think they had the pattern:
>
> kernel_fpu_begin();
> for (units-of-work) {
> do_unit_of_work();
> if (need_resched()) {
> kernel_fpu_end();
> cond_resched();
> kernel_fpu_begin();
> }
> }
> kernel_fpu_end();
Right, so that's the thing -- this is an optimization easily available
to individual crypto primitives. But I'm interested in applying this
kind of optimization to an entire queue of, say, tiny packets, where
each packet is processed individually. Or, to a cryptographic
construction, where several different primitives are used, such that
it'd be meaningful not to have to get the performance hit of
end()begin() in between each and everyone of them.
On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]> wrote:
> quite in the form you imagined. The idea that we've tossed around is
> to restore FPU state on return to user mode. Roughly, we'd introduce
> a new thread flag TIF_FPU_UNLOADED (name TBD).
> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
> one has quite thought it through, but I think it should be outside the
> loop.) We'd update all the FPU accessors to understand the flag.
Yes! This is exactly what I was thinking. Then those calls to begin()
and end() could be placed as close to the actual FPU usage as
possible.
On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]> wrote:
>> quite in the form you imagined. The idea that we've tossed around is
>> to restore FPU state on return to user mode. Roughly, we'd introduce
>> a new thread flag TIF_FPU_UNLOADED (name TBD).
>> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
>> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
>> one has quite thought it through, but I think it should be outside the
>> loop.) We'd update all the FPU accessors to understand the flag.
> Yes! This is exactly what I was thinking. Then those calls to begin()
> and end() could be placed as close to the actual FPU usage as
> possible.
Andy, what was the specific concern about PKRU? That we might do:
kernel_fpu_begin(); <- Saves the first time
something()
kernel_fpu_end(); <- Does not XRSTOR
copy_from_user(); <- Sees old PKRU, does the wrong thing
prepare_exit_to_usermode(); <- Does the XRSTOR
// only now does PKRU have the right value
SYSRET/IRET
?
Does that *matter* unless something() modified PKRU? We could just make
the rule that nobody is supposed to mess with it and that it's not
covered by kernel_fpu_begin/end() semantics. We could even
theoretically enforce that in a debug environment if we watch its value.
On Fri, Jun 15, 2018 at 10:27 PM Jason A. Donenfeld <[email protected]> wrote:
> AVX512 - Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
> Inside: 634415672
> Outside: 286698960
> Percent speedup: 121
More realistic results, with turbo turned off:
Inside: 616851298
Outside: 339606790
Percent speedup: 81
Still a pretty outstanding speedup.
> On Jun 15, 2018, at 1:42 PM, Dave Hansen <[email protected]> wrote:
>
>> On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
>>> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]> wrote:
>>> quite in the form you imagined. The idea that we've tossed around is
>>> to restore FPU state on return to user mode. Roughly, we'd introduce
>>> a new thread flag TIF_FPU_UNLOADED (name TBD).
>>> prepare_exit_to_usermode() would notice this flag, copy the fpstate to
>>> fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
>>> one has quite thought it through, but I think it should be outside the
>>> loop.) We'd update all the FPU accessors to understand the flag.
>> Yes! This is exactly what I was thinking. Then those calls to begin()
>> and end() could be placed as close to the actual FPU usage as
>> possible.
>
> Andy, what was the specific concern about PKRU? That we might do:
>
> kernel_fpu_begin(); <- Saves the first time
> something()
> kernel_fpu_end(); <- Does not XRSTOR
>
> copy_from_user(); <- Sees old PKRU, does the wrong thing
>
> prepare_exit_to_usermode(); <- Does the XRSTOR
> // only now does PKRU have the right value
> SYSRET/IRET
>
> ?
>
> Does that *matter* unless something() modified PKRU? We could just make
> the rule that nobody is supposed to mess with it and that it's not
> covered by kernel_fpu_begin/end() semantics. We could even
> theoretically enforce that in a debug environment if we watch its value.
Indeed, this case seems highly unlikely. I was imagining we have two tasks. Task A enters the kernel and sleeps. Task B runs and gets its PKRU loaded. Then it enters the kernel and we switch back to A. Now we have A’s FPU state in memory and B’s PKRU in the register. If A does copy_to_user, we lose.
On Fri, 2018-06-15 at 13:42 -0700, Dave Hansen wrote:
> On 06/15/2018 01:33 PM, Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]>
> > wrote:
> > > quite in the form you imagined. The idea that we've tossed
> > > around is
> > > to restore FPU state on return to user mode. Roughly, we'd
> > > introduce
> > > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > > prepare_exit_to_usermode() would notice this flag, copy the
> > > fpstate to
> > > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() --
> > > No
> > > one has quite thought it through, but I think it should be
> > > outside the
> > > loop.) We'd update all the FPU accessors to understand the flag.
> >
> > Yes! This is exactly what I was thinking. Then those calls to
> > begin()
> > and end() could be placed as close to the actual FPU usage as
> > possible.
>
> Andy, what was the specific concern about PKRU? That we might do:
>
> kernel_fpu_begin(); <- Saves the first time
> something()
> kernel_fpu_end(); <- Does not XRSTOR
>
> copy_from_user(); <- Sees old PKRU, does the wrong thing
>
> prepare_exit_to_usermode(); <- Does the XRSTOR
> // only now does PKRU have the right value
> SYSRET/IRET
>
> ?
>
> Does that *matter* unless something() modified PKRU? We could just
> make
> the rule that nobody is supposed to mess with it and that it's not
> covered by kernel_fpu_begin/end() semantics. We could even
> theoretically enforce that in a debug environment if we watch its
> value.
KVM needs to change out guest and host PKRU values
when switching between guest and host mode, but
since f775b13eedee ("x86,kvm: move qemu/guest FPU
switching out to vcpu_run") that no longer happens
under kernel_fpu_begin/end so we don't need to care
about that :)
--
All Rights Reversed.
On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <[email protected]> wrote:
> > Didn't we recently do a bunch of crypto patches to help with this?
> >
> > I think they had the pattern:
> >
> > kernel_fpu_begin();
> > for (units-of-work) {
> > do_unit_of_work();
> > if (need_resched()) {
> > kernel_fpu_end();
> > cond_resched();
> > kernel_fpu_begin();
> > }
> > }
> > kernel_fpu_end();
>
> Right, so that's the thing -- this is an optimization easily available
> to individual crypto primitives. But I'm interested in applying this
> kind of optimization to an entire queue of, say, tiny packets, where
> each packet is processed individually. Or, to a cryptographic
> construction, where several different primitives are used, such that
> it'd be meaningful not to have to get the performance hit of
> end()begin() in between each and everyone of them.
I'm confused.. how does the above not apply to your situation?
On Mon, Jun 18, 2018 at 11:44 AM Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jun 15, 2018 at 10:30:46PM +0200, Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 9:34 PM Peter Zijlstra <[email protected]> wrote:
> > > Didn't we recently do a bunch of crypto patches to help with this?
> > >
> > > I think they had the pattern:
> > >
> > > kernel_fpu_begin();
> > > for (units-of-work) {
> > > do_unit_of_work();
> > > if (need_resched()) {
> > > kernel_fpu_end();
> > > cond_resched();
> > > kernel_fpu_begin();
> > > }
> > > }
> > > kernel_fpu_end();
> >
> > Right, so that's the thing -- this is an optimization easily available
> > to individual crypto primitives. But I'm interested in applying this
> > kind of optimization to an entire queue of, say, tiny packets, where
> > each packet is processed individually. Or, to a cryptographic
> > construction, where several different primitives are used, such that
> > it'd be meaningful not to have to get the performance hit of
> > end()begin() in between each and everyone of them.
>
> I'm confused.. how does the above not apply to your situation?
In the example you've given, the optimization is applied at the level
of the, say, encryption function. Suppose you send a scattergather off
to an encryption function, which then walks the sglist and encrypts
each of the parts using some particular key. For each of the parts, it
benefits from the above optimization.
But what I'm referring to is encrypting multiple different things,
with different keys. In the case I'd like to optimize, I have a worker
thread that's processing a large queue of separate sglists and
encrypting them separately under different keys. In this case, having
kernel_fpu_begin/end inside the encryption function itself is a
problem, since that means toggling the FPU in between every queue
item. The solution, for now, is to just hoist the kernel_fpu_begin/end
out of the encryption function, and put them instead at the beginning
and end of my worker thread that handles all the items of the queue.
This is fine and dandy, but far from ideal, as putting that kind of
logic inside the encryption function itself makes more sense. For
example, the encryption function can decide whether or not it even
wants to use the FPU before calling kernel_fpu_begin. Ostensibly this
logic too could be hoisted outside, but at what point do you draw the
line and decide these optimizations are leading the API in the wrong
direction?
Hence, the idea here in this thread is to make it cost-free to place
kernel_fpu_begin/end as close as possible to the actual use of the
FPU. The solution, it seems, is to have the actual kernel_fpu_end work
occur on context switch.
From: Andy Lutomirski
> Sent: 15 June 2018 19:54
> On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> <[email protected]> wrote:
> >
> > On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > > for (thing) {
> > > kernel_fpu_begin();
> > > encrypt(thing);
> > > kernel_fpu_end();
> > > }
> >
> > Don't forget that the processor has optimizations for this, too. The
> > "modified optimization" will notice that between:
> >
> > kernel_fpu_end(); -> XRSTOR
> > and
> > kernel_fpu_start(); -> XSAVE(S|OPT)
> >
> > the processor has not modified the states. It'll skip doing any writes
> > of the state. Doing what Andy is describing is still way better than
> > letting the processor do it, but you should just know up front that this
> > may not be as much of a win as you would expect.
>
> Even with the modified optimization, kernel_fpu_end() still needs to
> reload the state that was trashed by the kernel FPU use. If the
> kernel is using something like AVX512 state, then kernel_fpu_end()
> will transfer an enormous amount of data no matter how clever the CPU
> is. And I think I once measured XSAVEOPT taking a hundred cycles or
> so even when RFBM==0, so it's not exactly super fast.
If the kernel was entered by a system call do you need to save the AVX512
state at all?
IIRC the registers are all defined as 'called saved' so there is no expectation
that they will be saved across the syscall wrapper function call.
All you need to do is ensure that 'kernel' values aren't passed back to userspace.
There is a single instruction to zero all the AVX512 registers.
David
On Tue, 19 Jun 2018, David Laight wrote:
> From: Andy Lutomirski
> > Sent: 15 June 2018 19:54
> > On Fri, Jun 15, 2018 at 11:50 AM Dave Hansen
> > <[email protected]> wrote:
> > >
> > > On 06/15/2018 11:31 AM, Andy Lutomirski wrote:
> > > > for (thing) {
> > > > kernel_fpu_begin();
> > > > encrypt(thing);
> > > > kernel_fpu_end();
> > > > }
> > >
> > > Don't forget that the processor has optimizations for this, too. The
> > > "modified optimization" will notice that between:
> > >
> > > kernel_fpu_end(); -> XRSTOR
> > > and
> > > kernel_fpu_start(); -> XSAVE(S|OPT)
> > >
> > > the processor has not modified the states. It'll skip doing any writes
> > > of the state. Doing what Andy is describing is still way better than
> > > letting the processor do it, but you should just know up front that this
> > > may not be as much of a win as you would expect.
> >
> > Even with the modified optimization, kernel_fpu_end() still needs to
> > reload the state that was trashed by the kernel FPU use. If the
> > kernel is using something like AVX512 state, then kernel_fpu_end()
> > will transfer an enormous amount of data no matter how clever the CPU
> > is. And I think I once measured XSAVEOPT taking a hundred cycles or
> > so even when RFBM==0, so it's not exactly super fast.
>
> If the kernel was entered by a system call do you need to save the AVX512
> state at all?
> IIRC the registers are all defined as 'called saved' so there is no expectation
> that they will be saved across the syscall wrapper function call.
> All you need to do is ensure that 'kernel' values aren't passed back to userspace.
> There is a single instruction to zero all the AVX512 registers.
Then we need different treatment for exception entries and consecutive
preemption. Lots of corner cases to cover ...
Thanks,
tglx
On Wed, 2018-07-11 at 18:28 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote:
> > On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]>
> > wrote:
> > > quite in the form you imagined. The idea that we've tossed
> > > around is
> > > to restore FPU state on return to user mode. Roughly, we'd
> > > introduce
> > > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > > prepare_exit_to_usermode() would notice this flag, copy the
> > > fpstate to
> > > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() --
> > > No
> > > one has quite thought it through, but I think it should be
> > > outside the
> > > loop.) We'd update all the FPU accessors to understand the flag.
> >
> > Yes! This is exactly what I was thinking. Then those calls to
> > begin()
> > and end() could be placed as close to the actual FPU usage as
> > possible.
>
> I was thinking about this myself. Did anyone try to hack something in
> the meantime? I might want to look into this, too :)
I have implemented this before, back before the
big rewrite of the syscall entry and exit code.
It seemed to work fine.
--
All Rights Reversed.
On 2018-06-15 22:33:47 [+0200], Jason A. Donenfeld wrote:
> On Fri, Jun 15, 2018 at 8:32 PM Andy Lutomirski <[email protected]> wrote:
> > quite in the form you imagined. The idea that we've tossed around is
> > to restore FPU state on return to user mode. Roughly, we'd introduce
> > a new thread flag TIF_FPU_UNLOADED (name TBD).
> > prepare_exit_to_usermode() would notice this flag, copy the fpstate to
> > fpregs, and clear the flag. (Or maybe exit_to_usermode_loop() -- No
> > one has quite thought it through, but I think it should be outside the
> > loop.) We'd update all the FPU accessors to understand the flag.
>
> Yes! This is exactly what I was thinking. Then those calls to begin()
> and end() could be placed as close to the actual FPU usage as
> possible.
I was thinking about this myself. Did anyone try to hack something in
the meantime? I might want to look into this, too :)
Sebastian