2022-02-15 20:01:11

by Brian Geffon

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

Hi Dave,
Thank you for taking a look at this.

On Tue, Feb 15, 2022 at 12:13 PM Dave Hansen <[email protected]> wrote:
>
> On 2/15/22 07:36, Brian Geffon wrote:
> > There are two issues with PKRU handling prior to 5.13.
>
> Are you sure both of these issues were introduced by 0cecca9d03c? I'm
> surprised that the get_xsave_addr() issue is not older.
>
> Should this be two patches?

You're right, the get_xsave_addr() issue is much older than the eager
reloading of PKRU. I'll split this out into two patches.

>
> > The first is that when eagerly switching PKRU we check that current
>
> Don't forget to write in imperative mood. No "we's", please.
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
>
> This goes for changelogs and comments too.

This will be corrected in future patches.

>
> > is not a kernel thread as kernel threads will never use PKRU. It's
> > possible that this_cpu_read_stable() on current_task (ie.
> > get_current()) is returning an old cached value. By forcing the read
> > with this_cpu_read() the correct task is used. Without this it's
> > possible when switching from a kernel thread to a userspace thread
> > that we'll still observe the PF_KTHREAD flag and never restore the
> > PKRU. And as a result this issue only occurs when switching from a
> > kernel thread to a userspace thread, switching from a non kernel
> > thread works perfectly fine because all we consider in that situation
> > is the flags from some other non kernel task and the next fpu is
> > passed in to switch_fpu_finish().
>
> It makes *sense* that there would be a place in the context switch code
> where 'current' is wonky, but I never realized this. This seems really
> fragile, but *also* trivially detectable.
>
> Is the PKRU code really the only code to use 'current' in a buggy way
> like this?

Yes, because the remaining code in __switch_to() references the next
task as next_p rather than current. Technically, it might be more
correct to pass next_p to switch_fpu_finish(), what do you think? This
may make sense since we're also passing the next fpu anyway.

>
> > The second issue is when using write_pkru() we only write to the
> > xstate when the feature bit is set because get_xsave_addr() returns
> > NULL when the feature bit is not set. This is problematic as the CPU
> > is free to clear the feature bit when it observes the xstate in the
> > init state, this behavior seems to be documented a few places throughout
> > the kernel. If the bit was cleared then in write_pkru() we would happily
> > write to PKRU without ever updating the xstate, and the FPU restore on
> > return to userspace would load the old value agian.
>
>
> ^ again
>
> It's probably worth noting that the AMD init tracker is a lot more
> aggressive than Intel's. On Intel, I think XRSTOR is the only way to
> get back to the init state. You're obviously hitting this on AMD.
>
> It's also *very* unlikely that PKRU gets back to a value of 0. I think
> we added a selftest for this case in later kernels.
>
> That helps explain why this bug hung around for so long.
>
> > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> > index 03b3de491b5e..540bda5bdd28 100644
> > --- a/arch/x86/include/asm/fpu/internal.h
> > +++ b/arch/x86/include/asm/fpu/internal.h
> > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
> > * PKRU state is switched eagerly because it needs to be valid before we
> > * return to userland e.g. for a copy_to_user() operation.
> > */
> > - if (!(current->flags & PF_KTHREAD)) {
> > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) {
>
> This really deserves a specific comment.
>
> > /*
> > * If the PKRU bit in xsave.header.xfeatures is not set,
> > * then the PKRU component was in init state, which means
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > index 9e71bf86d8d0..aa381b530de0 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru)
> > if (!boot_cpu_has(X86_FEATURE_OSPKE))
> > return;
> >
> > - pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > -
> > /*
> > * The PKRU value in xstate needs to be in sync with the value that is
> > * written to the CPU. The FPU restore on return to userland would
> > * otherwise load the previous value again.
> > */
> > fpregs_lock();
> > - if (pk)
> > - pk->pkru = pkru;
> > + /*
> > + * The CPU is free to clear the feature bit when the xstate is in the
> > + * init state. For this reason, we need to make sure the feature bit is
> > + * reset when we're explicitly writing to pkru. If we did not then we
> > + * would write to pkru and it would not be saved on a context switch.
> > + */
> > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU;
>
> I don't think we need to describe how the init optimization works again.
> I'm also not sure it's worth mentioning context switches here. It's a
> wider problem than that. Maybe:
>
> /*
> * All fpregs will be XRSTOR'd from this buffer before returning
> * to userspace. Ensure that XRSTOR does not init PKRU and that
> * get_xsave_addr() will work.
> */
>
> > + pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
> > + BUG_ON(!pk);
>
> A BUG_ON() a line before a NULL pointer dereference doesn't tend to do
> much good.
>
> > + pk->pkru = pkru;
> > __write_pkru(pkru);
> > fpregs_unlock();
> > }
>

Brian


2022-02-15 22:13:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency

On 2/15/22 09:50, Brian Geffon wrote:
>>> is not a kernel thread as kernel threads will never use PKRU. It's
>>> possible that this_cpu_read_stable() on current_task (ie.
>>> get_current()) is returning an old cached value. By forcing the read
>>> with this_cpu_read() the correct task is used. Without this it's
>>> possible when switching from a kernel thread to a userspace thread
>>> that we'll still observe the PF_KTHREAD flag and never restore the
>>> PKRU. And as a result this issue only occurs when switching from a
>>> kernel thread to a userspace thread, switching from a non kernel
>>> thread works perfectly fine because all we consider in that situation
>>> is the flags from some other non kernel task and the next fpu is
>>> passed in to switch_fpu_finish().
>>
>> It makes *sense* that there would be a place in the context switch code
>> where 'current' is wonky, but I never realized this. This seems really
>> fragile, but *also* trivially detectable.
>>
>> Is the PKRU code really the only code to use 'current' in a buggy way
>> like this?
>
> Yes, because the remaining code in __switch_to() references the next
> task as next_p rather than current. Technically, it might be more
> correct to pass next_p to switch_fpu_finish(), what do you think? This
> may make sense since we're also passing the next fpu anyway.

Yeah, passing next_p instead of next_fpu makes a lot of sense to me.

2022-02-16 04:39:39

by Brian Geffon

[permalink] [raw]
Subject: [PATCH stable 5.4,5.10] x86/fpu: Correct pkru/xstate inconsistency

When eagerly switching PKRU in switch_fpu_finish() it checks that
current is not a kernel thread as kernel threads will never use PKRU.
It's possible that this_cpu_read_stable() on current_task
(ie. get_current()) is returning an old cached value. To resolve this
reference next_p directly rather than relying on current.

As written it's possible when switching from a kernel thread to a
userspace thread to observe a cached PF_KTHREAD flag and never restore
the PKRU. And as a result this issue only occurs when switching
from a kernel thread to a userspace thread, switching from a non kernel
thread works perfectly fine because all that is considered in that
situation are the flags from some other non kernel task and the next fpu
is passed in to switch_fpu_finish().

This behavior only exists between 5.2 and 5.13 when it was fixed by a
rewrite decoupling PKRU from xstate, in:
commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()")

Unfortunately backporting the fix from 5.13 is probably not realistic as
it's part of a 60+ patch series which rewrites most of the PKRU handling.

Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
Signed-off-by: Brian Geffon <[email protected]>
Signed-off-by: Willis Kung <[email protected]>
Tested-by: Willis Kung <[email protected]>
Cc: <[email protected]> # v5.4.x
Cc: <[email protected]> # v5.10.x
---
arch/x86/include/asm/fpu/internal.h | 13 ++++++++-----
arch/x86/kernel/process_32.c | 6 ++----
arch/x86/kernel/process_64.c | 6 ++----
3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 03b3de491b5e..5ed702e2c55f 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -560,9 +560,11 @@ static inline void __fpregs_load_activate(void)
* The FPU context is only stored/restored for a user task and
* PF_KTHREAD is used to distinguish between kernel and user threads.
*/
-static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
+static inline void switch_fpu_prepare(struct task_struct *prev, int cpu)
{
- if (static_cpu_has(X86_FEATURE_FPU) && !(current->flags & PF_KTHREAD)) {
+ struct fpu *old_fpu = &prev->thread.fpu;
+
+ if (static_cpu_has(X86_FEATURE_FPU) && !(prev->flags & PF_KTHREAD)) {
if (!copy_fpregs_to_fpstate(old_fpu))
old_fpu->last_cpu = -1;
else
@@ -581,10 +583,11 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Load PKRU from the FPU context if available. Delay loading of the
* complete FPU state until the return to userland.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct task_struct *next)
{
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
+ struct fpu *next_fpu = &next->thread.fpu;

if (!static_cpu_has(X86_FEATURE_FPU))
return;
@@ -598,7 +601,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
* PKRU state is switched eagerly because it needs to be valid before we
* return to userland e.g. for a copy_to_user() operation.
*/
- if (!(current->flags & PF_KTHREAD)) {
+ if (!(next->flags & PF_KTHREAD)) {
/*
* If the PKRU bit in xsave.header.xfeatures is not set,
* then the PKRU component was in init state, which means
@@ -607,7 +610,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
* in memory is not valid. This means pkru_val has to be
* set to 0 and not to init_pkru_value.
*/
- pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+ pk = get_xsave_addr(&next_fpu->state.xsave, XFEATURE_PKRU);
pkru_val = pk ? pk->pkru : 0;
}
__write_pkru(pkru_val);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..352f876950ab 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -229,14 +229,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
- struct fpu *prev_fpu = &prev->fpu;
- struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

if (!test_thread_flag(TIF_NEED_FPU_LOAD))
- switch_fpu_prepare(prev_fpu, cpu);
+ switch_fpu_prepare(prev_p, cpu);

/*
* Save away %gs. No need to save %fs, as it was saved on the
@@ -292,7 +290,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

this_cpu_write(current_task, next_p);

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(next_p);

/* Load the Intel cache allocation PQR MSR. */
resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index da3cc3a10d63..633788362906 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -505,15 +505,13 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread;
struct thread_struct *next = &next_p->thread;
- struct fpu *prev_fpu = &prev->fpu;
- struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();

WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
this_cpu_read(irq_count) != -1);

if (!test_thread_flag(TIF_NEED_FPU_LOAD))
- switch_fpu_prepare(prev_fpu, cpu);
+ switch_fpu_prepare(prev_p, cpu);

/* We must save %fs and %gs before load_TLS() because
* %fs and %gs may be cleared by load_TLS().
@@ -565,7 +563,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
this_cpu_write(current_task, next_p);
this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));

- switch_fpu_finish(next_fpu);
+ switch_fpu_finish(next_p);

/* Reload sp0. */
update_task_stack(next_p);
--
2.35.1.265.g69c8d7142f-goog

2022-02-24 16:23:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH stable 5.4,5.10] x86/fpu: Correct pkru/xstate inconsistency

On 2/15/22 11:22, Brian Geffon wrote:
> When eagerly switching PKRU in switch_fpu_finish() it checks that
> current is not a kernel thread as kernel threads will never use PKRU.
> It's possible that this_cpu_read_stable() on current_task
> (ie. get_current()) is returning an old cached value. To resolve this
> reference next_p directly rather than relying on current.
>
> As written it's possible when switching from a kernel thread to a
> userspace thread to observe a cached PF_KTHREAD flag and never restore
> the PKRU. And as a result this issue only occurs when switching
> from a kernel thread to a userspace thread, switching from a non kernel
> thread works perfectly fine because all that is considered in that
> situation are the flags from some other non kernel task and the next fpu
> is passed in to switch_fpu_finish().
>
> This behavior only exists between 5.2 and 5.13 when it was fixed by a
> rewrite decoupling PKRU from xstate, in:
> commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()")
>
> Unfortunately backporting the fix from 5.13 is probably not realistic as
> it's part of a 60+ patch series which rewrites most of the PKRU handling.
>
> Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
> Signed-off-by: Brian Geffon <[email protected]>
> Signed-off-by: Willis Kung <[email protected]>
> Tested-by: Willis Kung <[email protected]>
> Cc: <[email protected]> # v5.4.x
> Cc: <[email protected]> # v5.10.x

I don't like forking the stable code from mainline. But I also think
that backporting the FPU reworking that changed the PKRU handling is
likely to cause more bugs in stable than it fixes.

This fix is at least isolated to the protection keys code.

Acked-by: Dave Hansen <[email protected]>

2022-02-25 13:17:23

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH stable 5.4,5.10] x86/fpu: Correct pkru/xstate inconsistency

On Thu, Feb 24, 2022 at 07:16:17AM -0800, Dave Hansen wrote:
> On 2/15/22 11:22, Brian Geffon wrote:
> > When eagerly switching PKRU in switch_fpu_finish() it checks that
> > current is not a kernel thread as kernel threads will never use PKRU.
> > It's possible that this_cpu_read_stable() on current_task
> > (ie. get_current()) is returning an old cached value. To resolve this
> > reference next_p directly rather than relying on current.
> >
> > As written it's possible when switching from a kernel thread to a
> > userspace thread to observe a cached PF_KTHREAD flag and never restore
> > the PKRU. And as a result this issue only occurs when switching
> > from a kernel thread to a userspace thread, switching from a non kernel
> > thread works perfectly fine because all that is considered in that
> > situation are the flags from some other non kernel task and the next fpu
> > is passed in to switch_fpu_finish().
> >
> > This behavior only exists between 5.2 and 5.13 when it was fixed by a
> > rewrite decoupling PKRU from xstate, in:
> > commit 954436989cc5 ("x86/fpu: Remove PKRU handling from switch_fpu_finish()")
> >
> > Unfortunately backporting the fix from 5.13 is probably not realistic as
> > it's part of a 60+ patch series which rewrites most of the PKRU handling.
> >
> > Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state")
> > Signed-off-by: Brian Geffon <[email protected]>
> > Signed-off-by: Willis Kung <[email protected]>
> > Tested-by: Willis Kung <[email protected]>
> > Cc: <[email protected]> # v5.4.x
> > Cc: <[email protected]> # v5.10.x
>
> I don't like forking the stable code from mainline. But I also think
> that backporting the FPU reworking that changed the PKRU handling is
> likely to cause more bugs in stable than it fixes.
>
> This fix is at least isolated to the protection keys code.
>
> Acked-by: Dave Hansen <[email protected]>

now queued up, thanks.

greg k-h