2022-02-15 21:36:47

by Brian Geffon

[permalink] [raw]
Subject: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

There are two issues with PKRU handling prior to 5.13. The first is that
when eagerly switching PKRU we check that current is not a kernel
thread as kernel threads will never use PKRU. It's possible that
this_cpu_read_stable() on current_task (ie. get_current()) is returning
an old cached value. By forcing the read with this_cpu_read() the
correct task is used. Without this it's possible when switching from
a kernel thread to a userspace thread that we'll still observe the
PF_KTHREAD flag and never restore the PKRU. And as a result this
issue only occurs when switching from a kernel thread to a userspace
thread, switching from a non kernel thread works perfectly fine because
all we consider in that situation is the flags from some other non
kernel task and the next fpu is passed in to switch_fpu_finish().

Without reloading the value finish_fpu_load() after being inlined into
__switch_to() uses a stale value of current:

ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi
ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13)
bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb>
bb2: 41 f6 45 3e 20 testb $0x20,0x3e(%r13)
bb7: 75 1c jne bd5 <__switch_to+0x20e>

By using this_cpu_read() and avoiding the cached value the compiler does
insert an additional load instruction and observes the correct value now:

ba1: 8b 35 00 00 00 00 mov 0x0(%rip),%esi
ba7: f0 41 80 4d 01 40 lock orb $0x40,0x1(%r13)
bad: e9 00 00 00 00 jmp bb2 <__switch_to+0x1eb>
bb2: 65 48 8b 05 00 00 00 mov %gs:0x0(%rip),%rax
bb9: 00
bba: f6 40 3e 20 testb $0x20,0x3e(%rax)
bbe: 75 1c jne bdc <__switch_to+0x215>

The second issue is when using write_pkru() we only write to the
xstate when the feature bit is set because get_xsave_addr() returns
NULL when the feature bit is not set. This is problematic as the CPU
is free to clear the feature bit when it observes the xstate in the
init state, this behavior seems to be documented a few places throughout
the kernel. If the bit was cleared then in write_pkru() we would happily
write to PKRU without ever updating the xstate, and the FPU restore on
return to userspace would load the old value agian.

Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Brian Geffon <[email protected]>
Signed-off-by: Willis Kung <[email protected]>
Tested-by: Willis Kung <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 2 +-
arch/x86/include/asm/pgtable.h | 14 ++++++++++----
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 03b3de491b5e..540bda5bdd28 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
* PKRU state is switched eagerly because it needs to be valid before we
* return to userland e.g. for a copy_to_user() operation.
*/
- if (!(current->flags & PF_KTHREAD)) {
+ if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
/*
* If the PKRU bit in xsave.header.xfeatures is not set,
* then the PKRU component was in init state, which means
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9e71bf86d8d0..aa381b530de0 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return;

- pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
-
/*
* The PKRU value in xstate needs to be in sync with the value that is
* written to the CPU. The FPU restore on return to userland would
* otherwise load the previous value again.
*/
fpregs_lock();
- if (pk)
- pk->pkru = pkru;
+ /*
+ * The CPU is free to clear the feature bit when the xstate is in the
+ * init state. For this reason, we need to make sure the feature bit is
+ * reset when we're explicitly writing to pkru. If we did not then we
+ * would write to pkru and it would not be saved on a context switch.
+ */
+ current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
+ pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
+ BUG_ON(!pk);
+ pk->pkru = pkru;
__write_pkru(pkru);
fpregs_unlock();
}
--
2.35.1.265.g69c8d7142f-goog


2022-02-15 22:15:00

by Brian Geffon

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

On Tue, Feb 15, 2022 at 4:14 PM Guenter Roeck <[email protected]> wrote:
>
> On Tue, Feb 15, 2022 at 9:13 AM Dave Hansen <[email protected]> wrote:
> >
> > On 2/15/22 07:36, Brian Geffon wrote:
> > > There are two issues with PKRU handling prior to 5.13.
> >
> > Are you sure both of these issues were introduced by 0cecca9d03c? I'm
> > surprised that the get_xsave_addr() issue is not older.
> >
> > Should this be two patches?
> >
> > > The first is that when eagerly switching PKRU we check that current
> >
> > Don't forget to write in imperative mood. No "we's", please.
> >
> > https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
> >
> > This goes for changelogs and comments too.
> >
> > > is not a kernel thread as kernel threads will never use PKRU. It's
> > > possible that this_cpu_read_stable() on current_task (ie.
> > > get_current()) is returning an old cached value. By forcing the read
> > > with this_cpu_read() the correct task is used. Without this it's
> > > possible when switching from a kernel thread to a userspace thread
> > > that we'll still observe the PF_KTHREAD flag and never restore the
> > > PKRU. And as a result this issue only occurs when switching from a
> > > kernel thread to a userspace thread, switching from a non kernel
> > > thread works perfectly fine because all we consider in that situation
> > > is the flags from some other non kernel task and the next fpu is
> > > passed in to switch_fpu_finish().
> >
> > It makes *sense* that there would be a place in the context switch code
> > where 'current' is wonky, but I never realized this. This seems really
> > fragile, but *also* trivially detectable.
> >
> > Is the PKRU code really the only code to use 'current' in a buggy way
> > like this?
> >
> > > The second issue is when using write_pkru() we only write to the
> > > xstate when the feature bit is set because get_xsave_addr() returns
> > > NULL when the feature bit is not set. This is problematic as the CPU
> > > is free to clear the feature bit when it observes the xstate in the
> > > init state, this behavior seems to be documented a few places throughout
> > > the kernel. If the bit was cleared then in write_pkru() we would happily
> > > write to PKRU without ever updating the xstate, and the FPU restore on
> > > return to userspace would load the old value agian.
> >
> >
> > ^ again
> >
> > It's probably worth noting that the AMD init tracker is a lot more
> > aggressive than Intel's. On Intel, I think XRSTOR is the only way to
> > get back to the init state. You're obviously hitting this on AMD.
> >
>
> Brian should correct me here, but I think we have seen this with one
> specific Intel CPU.
>
> Brian, would it make sense to list the affected CPU model(s), or at
> least the ones where we have observed the problem ?

The only CPU I have access to at the moment with OSPKE is an 11th Gen
Core i5-1135G7, so that's the only one I've observed it on. I can try
to search around for other hardware.

Brian

>
> Thanks,
> Guenter
>
> > It's also *very* unlikely that PKRU gets back to a value of 0. I think
> > we added a selftest for this case in later kernels.
> >
> > That helps explain why this bug hung around for so long.
> >
> > > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> > > index 03b3de491b5e..540bda5bdd28 100644
> > > --- a/arch/x86/include/asm/fpu/internal.h
> > > +++ b/arch/x86/include/asm/fpu/internal.h
> > > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> > > * PKRU state is switched eagerly because it needs to be valid before we
> > > * return to userland e.g. for a copy_to_user() operation.
> > > */
> > > - if (!(current->flags & PF_KTHREAD)) {
> > > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
> >
> > This really deserves a specific comment.
> >
> > > /*
> > > * If the PKRU bit in xsave.header.xfeatures is not set,
> > > * then the PKRU component was in init state, which means
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > index 9e71bf86d8d0..aa381b530de0 100644
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> > > if (!boot_cpu_has(X86_FEATURE_OSPKE))
> > > return;
> > >
> > > - pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > > -
> > > /*
> > > * The PKRU value in xstate needs to be in sync with the value that is
> > > * written to the CPU. The FPU restore on return to userland would
> > > * otherwise load the previous value again.
> > > */
> > > fpregs_lock();
> > > - if (pk)
> > > - pk->pkru = pkru;
> > > + /*
> > > + * The CPU is free to clear the feature bit when the xstate is in the
> > > + * init state. For this reason, we need to make sure the feature bit is
> > > + * reset when we're explicitly writing to pkru. If we did not then we
> > > + * would write to pkru and it would not be saved on a context switch.
> > > + */
> > > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
> >
> > I don't think we need to describe how the init optimization works again.
> > I'm also not sure it's worth mentioning context switches here. It's a
> > wider problem than that. Maybe:
> >
> > /*
> > * All fpregs will be XRSTOR'd from this buffer before returning
> > * to userspace. Ensure that XRSTOR does not init PKRU and that
> > * get_xsave_addr() will work.
> > */
> >
> > > + pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > > + BUG_ON(!pk);
> >
> > A BUG_ON() a line before a NULL pointer dereference doesn't tend to do
> > much good.
> >
> > > + pk->pkru = pkru;
> > > __write_pkru(pkru);
> > > fpregs_unlock();
> > > }
> >

2022-02-16 06:11:08

by Guenter Roeck

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

On Tue, Feb 15, 2022 at 9:13 AM Dave Hansen <[email protected]> wrote:
>
> On 2/15/22 07:36, Brian Geffon wrote:
> > There are two issues with PKRU handling prior to 5.13.
>
> Are you sure both of these issues were introduced by 0cecca9d03c? I'm
> surprised that the get_xsave_addr() issue is not older.
>
> Should this be two patches?
>
> > The first is that when eagerly switching PKRU we check that current
>
> Don't forget to write in imperative mood. No "we's", please.
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> This goes for changelogs and comments too.
>
> > is not a kernel thread as kernel threads will never use PKRU. It's
> > possible that this_cpu_read_stable() on current_task (ie.
> > get_current()) is returning an old cached value. By forcing the read
> > with this_cpu_read() the correct task is used. Without this it's
> > possible when switching from a kernel thread to a userspace thread
> > that we'll still observe the PF_KTHREAD flag and never restore the
> > PKRU. And as a result this issue only occurs when switching from a
> > kernel thread to a userspace thread, switching from a non kernel
> > thread works perfectly fine because all we consider in that situation
> > is the flags from some other non kernel task and the next fpu is
> > passed in to switch_fpu_finish().
>
> It makes *sense* that there would be a place in the context switch code
> where 'current' is wonky, but I never realized this. This seems really
> fragile, but *also* trivially detectable.
>
> Is the PKRU code really the only code to use 'current' in a buggy way
> like this?
>
> > The second issue is when using write_pkru() we only write to the
> > xstate when the feature bit is set because get_xsave_addr() returns
> > NULL when the feature bit is not set. This is problematic as the CPU
> > is free to clear the feature bit when it observes the xstate in the
> > init state, this behavior seems to be documented a few places throughout
> > the kernel. If the bit was cleared then in write_pkru() we would happily
> > write to PKRU without ever updating the xstate, and the FPU restore on
> > return to userspace would load the old value agian.
>
>
> ^ again
>
> It's probably worth noting that the AMD init tracker is a lot more
> aggressive than Intel's. On Intel, I think XRSTOR is the only way to
> get back to the init state. You're obviously hitting this on AMD.
>

Brian should correct me here, but I think we have seen this with one
specific Intel CPU.

Brian, would it make sense to list the affected CPU model(s), or at
least the ones where we have observed the problem ?

Thanks,
Guenter

> It's also *very* unlikely that PKRU gets back to a value of 0. I think
> we added a selftest for this case in later kernels.
>
> That helps explain why this bug hung around for so long.
>
> > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> > index 03b3de491b5e..540bda5bdd28 100644
> > --- a/arch/x86/include/asm/fpu/internal.h
> > +++ b/arch/x86/include/asm/fpu/internal.h
> > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> > * PKRU state is switched eagerly because it needs to be valid before we
> > * return to userland e.g. for a copy_to_user() operation.
> > */
> > - if (!(current->flags & PF_KTHREAD)) {
> > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
>
> This really deserves a specific comment.
>
> > /*
> > * If the PKRU bit in xsave.header.xfeatures is not set,
> > * then the PKRU component was in init state, which means
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index 9e71bf86d8d0..aa381b530de0 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> > if (!boot_cpu_has(X86_FEATURE_OSPKE))
> > return;
> >
> > - pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > -
> > /*
> > * The PKRU value in xstate needs to be in sync with the value that is
> > * written to the CPU. The FPU restore on return to userland would
> > * otherwise load the previous value again.
> > */
> > fpregs_lock();
> > - if (pk)
> > - pk->pkru = pkru;
> > + /*
> > + * The CPU is free to clear the feature bit when the xstate is in the
> > + * init state. For this reason, we need to make sure the feature bit is
> > + * reset when we're explicitly writing to pkru. If we did not then we
> > + * would write to pkru and it would not be saved on a context switch.
> > + */
> > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
>
> I don't think we need to describe how the init optimization works again.
> I'm also not sure it's worth mentioning context switches here. It's a
> wider problem than that. Maybe:
>
> /*
> * All fpregs will be XRSTOR'd from this buffer before returning
> * to userspace. Ensure that XRSTOR does not init PKRU and that
> * get_xsave_addr() will work.
> */
>
> > + pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > + BUG_ON(!pk);
>
> A BUG_ON() a line before a NULL pointer dereference doesn't tend to do
> much good.
>
> > + pk->pkru = pkru;
> > __write_pkru(pkru);
> > fpregs_unlock();
> > }
>

2022-02-16 06:55:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

On 2/15/22 07:36, Brian Geffon wrote:
> There are two issues with PKRU handling prior to 5.13.

Are you sure both of these issues were introduced by 0cecca9d03c? I'm
surprised that the get_xsave_addr() issue is not older.

Should this be two patches?

> The first is that when eagerly switching PKRU we check that current

Don't forget to write in imperative mood. No "we's", please.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html

This goes for changelogs and comments too.

> is not a kernel thread as kernel threads will never use PKRU. It's
> possible that this_cpu_read_stable() on current_task (ie.
> get_current()) is returning an old cached value. By forcing the read
> with this_cpu_read() the correct task is used. Without this it's
> possible when switching from a kernel thread to a userspace thread
> that we'll still observe the PF_KTHREAD flag and never restore the
> PKRU. And as a result this issue only occurs when switching from a
> kernel thread to a userspace thread, switching from a non kernel
> thread works perfectly fine because all we consider in that situation
> is the flags from some other non kernel task and the next fpu is
> passed in to switch_fpu_finish().

It makes *sense* that there would be a place in the context switch code
where 'current' is wonky, but I never realized this. This seems really
fragile, but *also* trivially detectable.

Is the PKRU code really the only code to use 'current' in a buggy way
like this?

> The second issue is when using write_pkru() we only write to the
> xstate when the feature bit is set because get_xsave_addr() returns
> NULL when the feature bit is not set. This is problematic as the CPU
> is free to clear the feature bit when it observes the xstate in the
> init state, this behavior seems to be documented a few places throughout
> the kernel. If the bit was cleared then in write_pkru() we would happily
> write to PKRU without ever updating the xstate, and the FPU restore on
> return to userspace would load the old value agian.


^ again

It's probably worth noting that the AMD init tracker is a lot more
aggressive than Intel's. On Intel, I think XRSTOR is the only way to
get back to the init state. You're obviously hitting this on AMD.

It's also *very* unlikely that PKRU gets back to a value of 0. I think
we added a selftest for this case in later kernels.

That helps explain why this bug hung around for so long.

> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 03b3de491b5e..540bda5bdd28 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> * PKRU state is switched eagerly because it needs to be valid before we
> * return to userland e.g. for a copy_to_user() operation.
> */
> - if (!(current->flags & PF_KTHREAD)) {
> + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {

This really deserves a specific comment.

> /*
> * If the PKRU bit in xsave.header.xfeatures is not set,
> * then the PKRU component was in init state, which means
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 9e71bf86d8d0..aa381b530de0 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> if (!boot_cpu_has(X86_FEATURE_OSPKE))
> return;
>
> - pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> -
> /*
> * The PKRU value in xstate needs to be in sync with the value that is
> * written to the CPU. The FPU restore on return to userland would
> * otherwise load the previous value again.
> */
> fpregs_lock();
> - if (pk)
> - pk->pkru = pkru;
> + /*
> + * The CPU is free to clear the feature bit when the xstate is in the
> + * init state. For this reason, we need to make sure the feature bit is
> + * reset when we're explicitly writing to pkru. If we did not then we
> + * would write to pkru and it would not be saved on a context switch.
> + */
> + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;

I don't think we need to describe how the init optimization works again.
I'm also not sure it's worth mentioning context switches here. It's a
wider problem than that. Maybe:

/*
* All fpregs will be XRSTOR'd from this buffer before returning
* to userspace. Ensure that XRSTOR does not init PKRU and that
* get_xsave_addr() will work.
*/

> + pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> + BUG_ON(!pk);

A BUG_ON() a line before a NULL pointer dereference doesn't tend to do
much good.

> + pk->pkru = pkru;
> __write_pkru(pkru);
> fpregs_unlock();
> }