2022-05-04 17:15:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust

On Tue, May 03 2022 at 11:06, Peter Zijlstra wrote:
> On Mon, May 02, 2022 at 05:58:40PM +0200, Thomas Gleixner wrote:
>> Right, though currently it's guaranteed that softirq processing context
>> can use the FPU. Quite some of the network crypto work runs in softirq
>> context, so this might cause a regression. If so, then this needs to be
>> an explicit commit on top which is easy to revert. Let me stare at it
>> some more.
>
> Right, so with the:
>
> preempt_disable();
> this_cpu_write(fpu_in_use, true);
> barrier();
>
> sequence it is safe against both softirq and hardirq fpu usage. The only
> concern is performance not correctness when dropping that
> local_bh_disable() thing.
>
> So what Thomas proposes makes sense to me.

Now I was looking at it the other way round too; i.e. to use
local_bh_disable() for both fpregs_lock() and kernel_fpu_begin().

Using local_bh_disable() for both fpregs_lock() and kernel_fpu_begin()
is not possible with the current constraints, because kernel_fpu_begin()
can be called from hard interrupt context.

But the only use case which utilizes FPU from hard interrupt context is
the random generator via add_randomness_...().

I did a benchmark of these functions, which invoke blake2s_update()
three times in a row, on a SKL-X and a ZEN3. The generic code and the
FPU accelerated code are pretty much on par vs. execution time of the
algorithm itself plus/minus noise.

But in case that the interrupt hits a userspace task the FPU needs to be
saved and if the interrupt does not result in rescheduling then the
return to user space has to restore it. That's _expensive_ and the
actual cost depends on the FPU state, but 200-300 cycles for save and
200-700 cycles for restore are due.

Even if we ignore the save/restore part and assume that it averages out
vs. schedule() having to save FPU state anyway, then there is another
aspect to this: power consumption which affects also thermal budget and
capacity.

Though that made me more curious and I did the same comparison for crc32
which is heavily used by ext4. crc32c_pcl_intel_update() already
contains a switch to software when the buffer length is less than 512
bytes. But even on larger buffers, typically ~4k, FPU is not necessarily
a win. It's consistently slower by a factor of ~1.4x. And that's not due
to xsave/rstor overhead because these computations run on a worker
thread which does not do that dance at all.

IOW, using the FPU blindly for this kind of computations is not
necessarily a good plan. I have no idea how these things are analyzed
and evaluated if at all. Maybe the crypto people can shed some light on
this.

Thanks,

tglx


2022-05-04 17:28:14

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust

Hi Thomas,

On Wed, May 04, 2022 at 05:36:38PM +0200, Thomas Gleixner wrote:
> But the only use case which utilizes FPU from hard interrupt context is
> the random generator via add_randomness_...().
>
> I did a benchmark of these functions, which invoke blake2s_update()
> three times in a row, on a SKL-X and a ZEN3. The generic code and the
> FPU accelerated code are pretty much on par vs. execution time of the
> algorithm itself plus/minus noise.
>
> IOW, using the FPU blindly for this kind of computations is not
> necessarily a good plan. I have no idea how these things are analyzed
> and evaluated if at all. Maybe the crypto people can shed some light on
> this.

drivers/net/wireguard/{noise,cookie}.c makes pretty heavy use of BLAKE2s
in hot paths where the FPU is already being used for other algorithms,
and so there the save/restore is worth it (assuming restore finally
works lazily). In benchmarks, the SIMD code made a real difference.

But this presumably regards mix_pool_bytes() in the RNG. If it turns out
that supporting the FPU in hard IRQ context is a major PITA, and the RNG
is the only thing making use of it, then sure, drop hard IRQ context
support for it. However... This may be unearthing a larger bug.
Sebastian and I put in a decent amount of work during 5.18 to remove all
calls to mix_pool_bytes() (and hence to blake2s_compress()) from
add_interrupt_randomness(). Have a look:

https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/tree/drivers/char/random.c#n1289

It now accumulates in some per-CPU buffer, and then every 64 interrupts
a worker runs that does the actual mix_pool_bytes() from kthread
context.

So the question is: what is still hitting mix_pool_bytes() from hard IRQ
context? I'll investigate a bit and see.

Jason

2022-05-04 22:51:33

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust

Jason,

On Wed, May 04 2022 at 17:55, Jason A. Donenfeld wrote:
> On Wed, May 04, 2022 at 05:36:38PM +0200, Thomas Gleixner wrote:
>> But the only use case which utilizes FPU from hard interrupt context is
>> the random generator via add_randomness_...().
>>
>> I did a benchmark of these functions, which invoke blake2s_update()
>> three times in a row, on a SKL-X and a ZEN3. The generic code and the
>> FPU accelerated code are pretty much on par vs. execution time of the
>> algorithm itself plus/minus noise.
>>
>> IOW, using the FPU blindly for this kind of computations is not
>> necessarily a good plan. I have no idea how these things are analyzed
>> and evaluated if at all. Maybe the crypto people can shed some light on
>> this.
>
> drivers/net/wireguard/{noise,cookie}.c makes pretty heavy use of BLAKE2s
> in hot paths where the FPU is already being used for other algorithms,
> and so there the save/restore is worth it (assuming restore finally
> works lazily). In benchmarks, the SIMD code made a real difference.

I'm sure there are very valid use cases, but just the two things I
looked at turned out to be questionable at least.

> But this presumably regards mix_pool_bytes() in the RNG. If it turns out
> that supporting the FPU in hard IRQ context is a major PITA, and the
> RNG

Supporting FPU in hard interrupt context is possible if required and the
preexisting bug which survived 10+ years has been fixed.
x
I just started to look into this because of that bug and due to the
inconsistency between the FPU protections we have. The inconsistency
comes from the hardirq requirement.

> is the only thing making use of it, then sure, drop hard IRQ context
> support for it. However... This may be unearthing a larger bug.
> Sebastian and I put in a decent amount of work during 5.18 to remove all
> calls to mix_pool_bytes() (and hence to blake2s_compress()) from
> add_interrupt_randomness(). Have a look:

I know.

> It now accumulates in some per-CPU buffer, and then every 64 interrupts
> a worker runs that does the actual mix_pool_bytes() from kthread
> context.

That's add_interrupt_randomness() and not affected by this.

> So the question is: what is still hitting mix_pool_bytes() from hard IRQ
> context? I'll investigate a bit and see.

add_disk_randomness() on !RT kernels. That's what made me look into this
in the first place as it unearthed the long standing FPU protection
bug. See the first patch in this thread.

Possibly add_device_randomness() too, but I haven't seen evidence so far.

Thanks,

tglx

2022-05-04 23:08:10

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust

Hey Thomas,

On Wed, May 04, 2022 at 06:45:45PM +0200, Thomas Gleixner wrote:
> add_disk_randomness() on !RT kernels. That's what made me look into this
> in the first place as it unearthed the long standing FPU protection
> bug. See the first patch in this thread.
>
> Possibly add_device_randomness() too, but I haven't seen evidence so far.

It looks like it's being hit from add_input_randomness() via
input_handle_event() too. There are two positions we could take toward
this:

One stance to take would be that if an event happens in an interrupt,
add_interrupt_randomness() should in theory already be siphashing in a
cycle counter so additional calls to add_{input,disk}_randomness() don't
contribute substantially (unless you assume the num field has some
entropic value). If that's compelling, then the next thing to do would
be adding a `if (in_interrupt()) return;` early on in some function, and
then we forever after impose a rule, "never mix into the input pool
directly from an irq".

The other stance is that these input/disk events are relatively rare --
compared to, say, a storm of interrupts from a NIC -- so mixing into the
input pool from there isn't actually a problem, and we benefit from the
quasi domain-specific accounting and the superior mixing function,
there, so keep it around. And the non-raw spinlock on the input pool
won't negatively affect RT from this context, because all its callers on
RT should be threaded.

The second stance seems easier and more conservative from a certain
perspective -- we don't need to change anything -- so I'm more inclined
toward it. And given that you've fixed the bug now, it sounds like
that's fine with you too. But if you're thinking about it differently in
fact, let me know.

Jason

2022-05-07 12:11:56

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 3/3] x86/fpu: Make FPU protection more robust

Hi Thomas,

On Wed, May 04, 2022 at 09:05:01PM +0200, Jason A. Donenfeld wrote:
> Hey Thomas,
>
> On Wed, May 04, 2022 at 06:45:45PM +0200, Thomas Gleixner wrote:
> > add_disk_randomness() on !RT kernels. That's what made me look into this
> > in the first place as it unearthed the long standing FPU protection
> > bug. See the first patch in this thread.
> >
> > Possibly add_device_randomness() too, but I haven't seen evidence so far.
>
> It looks like it's being hit from add_input_randomness() via
> input_handle_event() too. There are two positions we could take toward
> this:
>
> One stance to take would be that if an event happens in an interrupt,
> add_interrupt_randomness() should in theory already be siphashing in a
> cycle counter so additional calls to add_{input,disk}_randomness() don't
> contribute substantially (unless you assume the num field has some
> entropic value). If that's compelling, then the next thing to do would
> be adding a `if (in_interrupt()) return;` early on in some function, and
> then we forever after impose a rule, "never mix into the input pool
> directly from an irq".
>
> The other stance is that these input/disk events are relatively rare --
> compared to, say, a storm of interrupts from a NIC -- so mixing into the
> input pool from there isn't actually a problem, and we benefit from the
> quasi domain-specific accounting and the superior mixing function,
> there, so keep it around. And the non-raw spinlock on the input pool
> won't negatively affect RT from this context, because all its callers on
> RT should be threaded.

It turned out to be more complicated than this. Sometimes a given
interrupt handler would have multiple calls to add_disk_randomness(),
followed by a single call to add_interrupt_randomness(). Since those
multiple calls all represent different things being measured, the second
option of discarding them didn't seem like a good idea, but the first
option seemed pretty bad too, since the status quo is way too much
hashing from an irq handler. Further complicating things is the fact
that add_{input,disk}_randomness() is never just called from irq or from
elsewhere, as different drivers do different things. Luckily the way the
entropy estimator currently works allows for a pretty straight forward
compromise, which I posted here:

https://lore.kernel.org/lkml/[email protected]/

The upshot for this discussion is that it means we can probably get rid
of hard irq FPU support. (This is in addition to the performance
argument, in which tentatively it seemed like the difference wasn't
/that/ much to justify the complexity.)

I intend to mull this over a bit more, though, so if my patch is to make
it in, it'll be for 5.19 and certainly not 5.18. So for 5.18, I think
it's probably a good idea to apply the patches in this thread. And then
for 5.19 we can take the next step.

Jason