2022-05-03 01:13:14

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 0/3] x86/fpu: Prevent FPU state corruption

The recent changes in the random code unearthed a long standing FPU state
corruption due do a buggy condition for granting in-kernel FPU usage.

The following series addresses this issue and makes the code more robust.

Thanks,

tglx
---
arch/um/include/asm/fpu/api.h | 2
arch/x86/include/asm/fpu/api.h | 21 +-------
arch/x86/include/asm/simd.h | 2
arch/x86/kernel/fpu/core.c | 92 +++++++++++++++++++-----------------
net/netfilter/nft_set_pipapo_avx2.c | 2
5 files changed, 57 insertions(+), 62 deletions(-)




2022-05-04 18:03:25

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

Hi Thomas,

On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
> The recent changes in the random code unearthed a long standing FPU state
> corruption due do a buggy condition for granting in-kernel FPU usage.

Thanks for working that out. I've been banging my head over [1] for a
few days now trying to see if it's a mis-bisect or a real thing. I'll
ask Larry to retry with this patchset.

Jason

[1] https://lore.kernel.org/lkml/[email protected]/

2022-05-05 15:45:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

On Wed, May 04 2022 at 17:40, Jason A. Donenfeld wrote:
> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
>> The recent changes in the random code unearthed a long standing FPU state
>> corruption due do a buggy condition for granting in-kernel FPU usage.
>
> Thanks for working that out. I've been banging my head over [1] for a
> few days now trying to see if it's a mis-bisect or a real thing. I'll
> ask Larry to retry with this patchset.

You could have just Cc'ed him. :) Did so now.

Larry, can you please test:

https://lore.kernel.org/lkml/[email protected]

Thanks,

Thomas

2022-05-18 04:55:33

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

Hey Thomas,

On Wed, May 04, 2022 at 05:40:26PM +0200, Jason A. Donenfeld wrote:
> Hi Thomas,
>
> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
> > The recent changes in the random code unearthed a long standing FPU state
> > corruption due do a buggy condition for granting in-kernel FPU usage.
>
> Thanks for working that out. I've been banging my head over [1] for a
> few days now trying to see if it's a mis-bisect or a real thing. I'll
> ask Larry to retry with this patchset.

So, Larry's debugging was inconsistent and didn't result in anything I
could piece together into basic cause and effect. But luckily Vadim, who
maintains the VirtualBox drivers for Oracle, was able to reproduce the
issue and was able to conduct some real debugging. I've CC'd him here.
From talking with Vadim, here are some findings thus far:

- Certain Linux guest processes crash under high load.
- Windows kernel guest panics.

Observation: the Windows kernel uses SSSE3 in their kernel all over the
place, generated by the compiler.

- Moving the mouse around helps induce the crash.

Observation: add_input_randomness() -> .. -> kernel_fpu_begin() -> blake2s_compress().

- The problem exhibits itself in rc7, so this patchset does not fix
the issue.
- Applying https://xn--4db.cc/ttEUSvdC fixes the issue.

Observation: the problem is definitely related to using the FPU in a
hard IRQ.

I went reading KVM to get some idea of why KVM does *not* have this
problem, and it looks like there's some careful code there about doing
xsave and such around IRQs. So my current theory is that VirtualBox's
VMM just forgot to do this, and until now this bug went unnoticed.

Since VirtualBox is out of tree (and extremely messy of a codebase), and
this appears to be an out of tree module problem rather than a kernel
problem, I'm inclined to think that there's not much for us to do, at
least until we receive information to the contrary of this presumption.

But in case you do want to do something proactively, I don't have any
objections to just disabling the FPU in hard IRQ for 5.18. And in 5.19,
add_input_randomness() isn't going to hit that path anyway. But also,
doing nothing and letting the VirtualBox people figure out their bug
would be fine with me too. Either way, just wanted to give you a heads
up.

Jason

2022-05-18 11:15:08

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

Hi Vadim,

On Wed, May 18, 2022 at 03:02:05AM +0200, Jason A. Donenfeld wrote:
> Observation: the problem is definitely related to using the FPU in a
> hard IRQ.

I wrote a tiny reproducer that should be pretty reliable for testing
this, attached below. I think this proves my working theory. Run this in
a VirtualBox VM, and then move your mouse around or hit the keyboard, or
do something that triggers the add_{input,disk}_randomness() path from a
hardirq handler. On my laptop, for example, the trackpoint goes via
hardirq, but the touchpad does not. As soon as I move the trackpoint
around, the below program prints "XSAVE is borked!".

Also, note that this isn't just "corruption" of the guest VM, but also
leaking secret contents of the host VM into the guest. So you might
really want to make sure VirtualBox issues a fix for this before 5.18,
as it's arguably security sensitive.

Regards,
Jason


#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
int status = 0;

for (int i = 0, nproc = sysconf(_SC_NPROCESSORS_ONLN); i < nproc; ++i) {
if (!fork()) {
prctl(PR_SET_PDEATHSIG, SIGKILL);
asm("movq $42, %%rax\n"
"movq %%rax, %%xmm0\n"
"0:\n"
"movq %%xmm0, %%rbx\n"
"cmpq %%rax, %%rbx\n"
"je 0b\n"
: : : "rax", "rbx", "xmm0", "cc");
_exit(77);
}
}
wait(&status);
if (WEXITSTATUS(status) == 77)
printf("XSAVE is borked!\n");
return 1;
}

2022-05-18 11:21:10

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

On Wed, May 18, 2022 at 01:14:40PM +0200, Jason A. Donenfeld wrote:
> I wrote a tiny reproducer that should be pretty reliable for testing
> this, attached below. I think this proves my working theory. Run this in
> a VirtualBox VM, and then move your mouse around or hit the keyboard, or
> do something that triggers the add_{input,disk}_randomness() path from a
> hardirq handler. On my laptop, for example, the trackpoint goes via
> hardirq, but the touchpad does not. As soon as I move the trackpoint
> around, the below program prints "XSAVE is borked!".

I should also add that the VM should have as many CPUs as the host, to
better the chances that the irq arrives on the same CPU that the guest
VM is running on.

2022-05-18 13:10:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

On Wed, May 18 2022 at 03:02, Jason A. Donenfeld wrote:
> On Wed, May 04, 2022 at 05:40:26PM +0200, Jason A. Donenfeld wrote:
>> On Sun, May 01, 2022 at 09:31:42PM +0200, Thomas Gleixner wrote:
>> > The recent changes in the random code unearthed a long standing FPU state
>> > corruption due do a buggy condition for granting in-kernel FPU usage.
>>
>> Thanks for working that out. I've been banging my head over [1] for a
>> few days now trying to see if it's a mis-bisect or a real thing. I'll
>> ask Larry to retry with this patchset.
>
> So, Larry's debugging was inconsistent and didn't result in anything I
> could piece together into basic cause and effect. But luckily Vadim, who
> maintains the VirtualBox drivers for Oracle, was able to reproduce the
> issue and was able to conduct some real debugging. I've CC'd him here.
> From talking with Vadim, here are some findings thus far:
>
> - Certain Linux guest processes crash under high load.
> - Windows kernel guest panics.
>
> Observation: the Windows kernel uses SSSE3 in their kernel all over the
> place, generated by the compiler.
>
> - Moving the mouse around helps induce the crash.
>
> Observation: add_input_randomness() -> .. -> kernel_fpu_begin() -> blake2s_compress().
>
> - The problem exhibits itself in rc7, so this patchset does not fix
> the issue.
> - Applying https://xn--4db.cc/ttEUSvdC fixes the issue.
>
> Observation: the problem is definitely related to using the FPU in a
> hard IRQ.
>
> I went reading KVM to get some idea of why KVM does *not* have this
> problem, and it looks like there's some careful code there about doing
> xsave and such around IRQs. So my current theory is that VirtualBox's
> VMM just forgot to do this, and until now this bug went unnoticed.

That's a very valid assumption. I audited all places which fiddle with
FPU in Linus tree and with the fix applied they're all safe.

> Since VirtualBox is out of tree (and extremely messy of a codebase), and
> this appears to be an out of tree module problem rather than a kernel
> problem, I'm inclined to think that there's not much for us to do, at
> least until we receive information to the contrary of this presumption.

Agreed in all points.

> But in case you do want to do something proactively, I don't have any
> objections to just disabling the FPU in hard IRQ for 5.18. And in 5.19,
> add_input_randomness() isn't going to hit that path anyway. But also,
> doing nothing and letting the VirtualBox people figure out their bug
> would be fine with me too. Either way, just wanted to give you a heads
> up.

That virtualborx bug has to be fixed in any case as this problem exists
forever and there have been drivers using FPU in hard interrupt context
in the past sporadically, so it's sheer luck that this didn't explode
before. AFAICT all of this has been moved to softirq context over the
years, so the random code is probably the sole in hard interrupt user in
mainline today.

In the interest of users we should probably bite the bullet and just
disable hard interrupt FPU usage upstream and Cc stable. The stable
kernel updates probably reach users faster.

Thanks,

tglx



2022-05-18 14:14:05

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

Hi Thomas,

On Wed, May 18, 2022 at 03:09:54PM +0200, Thomas Gleixner wrote:
> In the interest of users we should probably bite the bullet and just
> disable hard interrupt FPU usage upstream and Cc stable. The stable
> kernel updates probably reach users faster.

Considering <https://git.zx2c4.com/linux-rng/commit/?id=e3e33fc2e> is
slated for 5.19, that seems fine, as this will remove what is hopefully
the last hardirq FPU user.

The bigger motivation, though, for removing the hardirq FPU code would
be for just simplifying the FPU handling. The VirtualBox people are
going to have to fix this bug anyway, since it also affects old kernels
they support.

Jason

2022-05-28 19:55:34

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [patch 0/3] x86/fpu: Prevent FPU state corruption

Hey folks,

An update on the VirtualBox thing (not that anybody here actually
cares, but at least I've been nerd sniped):

It looks like they've got a patch out now for it:
https://www.virtualbox.org/attachment/ticket/20914/vbox-linux-5.18.patch
It seems like they call kernel_fpu_begin() before loading guest fpu
regs, but then immediately re-enable preemption. Yikes. So if another
kernel task preempts that one and uses the fpu... And it makes me
wonder the extent of this breakage prior (maybe not just hardirq?
unsure.). Also, it apparently is only enabled for 5.18, so that
doesn't help with old kernels. Oh well. All that development happens
behind closed doors so probably not worth losing sleep over.

Anyway, from a kernel perspective, there's now no urgency for us to do
anything about this. VirtualBox won't compile for 5.18 without that
patch, and given that Oracle fixed it, it doesn't appear to be our
bug. Case closed. So if we go ahead with removing hardirq fpu support,
it won't be due to this but for other motivations.

Jason