Subject: [RFC PATCH] x86: load FPU registers on return to userland

This is a refurbished series originally started by by Rik van Riel. The
goal is load the FPU registers on return to userland and not on every
context switch. By this optimisation we can:
- avoid loading the registers if the task stays in kernel and does
not return to userland
- make kernel_fpu_begin() cheaper: it only saves the registers on the
first invocation. The second invocation does not need save them again.

To access the FPU registers in kernel we need:
- disable preemption to avoid that the scheduler switches tasks. By
doing so it would set TIF_LOAD_FPU and the FPU registers would be not
valid.
- disable BH because the softirq might use kernel_fpu_begin() and then
set TIF_LOAD_FPU instead loading the FPU registers on completion.

This seems to work with userland & xmm registers. Haven't tested the
pkeys feature and KVM yet.

Sebastian




Subject: [RFC PATCH 07/10] x86/entry: add TIF_LOAD_FPU

Add TIF_LOAD_FPU. This is reserved for loading the FPU registers before
returning to userpace. It is introduced now, so we can add code handling
it now before adding the main feature.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/thread_info.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index cd6920674b905..cd910fd8c8c4c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -83,6 +83,7 @@ struct thread_info {
#define TIF_SYSCALL_EMU 6 /* syscall emulation active */
#define TIF_SYSCALL_AUDIT 7 /* syscall auditing active */
#define TIF_SECCOMP 8 /* secure computing */
+#define TIF_LOAD_FPU 10 /* load FPU on return to userspace */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
#define TIF_UPROBE 12 /* breakpointed or singlestepping */
#define TIF_PATCH_PENDING 13 /* pending live patching update */
@@ -110,6 +111,7 @@ struct thread_info {
#define _TIF_SYSCALL_EMU (1 << TIF_SYSCALL_EMU)
#define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
#define _TIF_SECCOMP (1 << TIF_SECCOMP)
+#define _TIF_LOAD_FPU (1 << TIF_LOAD_FPU)
#define _TIF_USER_RETURN_NOTIFY (1 << TIF_USER_RETURN_NOTIFY)
#define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_PATCH_PENDING (1 << TIF_PATCH_PENDING)
--
2.19.0


Subject: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace

From: Rik van Riel <[email protected]>

Defer loading of FPU state until return to userspace. This gives
the kernel the potential to skip loading FPU state for tasks that
stay in kernel mode, or for tasks that end up with repeated
invocations of kernel_fpu_begin.

It also increases the chances that a task's FPU state will remain
valid in the FPU registers until it is scheduled back in, allowing
us to skip restoring that task's FPU state altogether.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/entry/common.c | 9 +++++++++
arch/x86/include/asm/fpu/api.h | 5 +++++
arch/x86/include/asm/fpu/internal.h | 16 +++++++++------
arch/x86/kernel/fpu/core.c | 31 ++++++++++++++++++++++-------
arch/x86/kernel/process_32.c | 7 ++++---
arch/x86/kernel/process_64.c | 7 ++++---
arch/x86/kvm/x86.c | 10 +++++++---
arch/x86/mm/pkeys.c | 2 +-
8 files changed, 64 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b819181..180e16a8177d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,6 +31,7 @@
#include <asm/vdso.h>
#include <linux/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/fpu/api.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -196,6 +197,14 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);

+ /* Reload ti->flags; we may have rescheduled above. */
+ cached_flags = READ_ONCE(ti->flags);
+
+ if (unlikely(cached_flags & _TIF_LOAD_FPU)) {
+ clear_thread_flag(TIF_LOAD_FPU);
+ switch_fpu_return();
+ }
+
#ifdef CONFIG_COMPAT
/*
* Compat syscalls set TS_COMPAT. Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index a9caac9d4a729..308e66a7fcd62 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -27,6 +27,11 @@ extern void kernel_fpu_begin(void);
extern void kernel_fpu_end(void);
extern bool irq_fpu_usable(void);

+/*
+ * Load the task FPU state before returning to userspace.
+ */
+extern void switch_fpu_return(void);
+
/*
* Query the presence of one or more xfeatures. Works on any legacy CPU as well.
*
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 184d76c6470b1..370686972592e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -472,9 +472,11 @@ static inline void fpregs_activate(struct fpu *fpu)
/*
* Load the FPU state for the current task. Call with preemption disabled.
*/
-static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
+static inline void __fpregs_load_activate(void)
{
- if (!fpregs_state_valid(fpu, cpu))
+ struct fpu *fpu = &current->thread.fpu;
+
+ if (!fpregs_state_valid(fpu, smp_processor_id()))
copy_kernel_to_fpregs(&fpu->state);
fpregs_activate(fpu);
}
@@ -482,11 +484,13 @@ static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
static inline void __fpregs_changes_begin(void)
{
preempt_disable();
+ local_bh_disable();
}

static inline void __fpregs_changes_end(void)
{
preempt_enable();
+ local_bh_enable();
}

/*
@@ -497,8 +501,8 @@ static inline void __fpregs_changes_end(void)
* - switch_fpu_prepare() saves the old state.
* This is done within the context of the old process.
*
- * - switch_fpu_finish() restores the new state as
- * necessary.
+ * - switch_fpu_finish() sets TIF_LOAD_FPU; the floating point state
+ * will get loaded on return to userspace, or when the kernel needs it.
*/
static inline void
switch_fpu_prepare(struct fpu *old_fpu, int cpu)
@@ -523,7 +527,7 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
* Set up the userspace FPU context for the new task, if the task
* has used the FPU.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
+static inline void switch_fpu_finish(struct fpu *new_fpu)
{
bool preload = static_cpu_has(X86_FEATURE_FPU) &&
new_fpu->initialized;
@@ -531,7 +535,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
if (!preload)
return;

- __fpregs_load_activate(new_fpu, cpu);
+ set_thread_flag(TIF_LOAD_FPU);
/* Protection keys need to be in place right at context switch time. */
if (boot_cpu_has(X86_FEATURE_OSPKE)) {
if (new_fpu->pkru != __read_pkru())
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8564086c217fc..cbbd20515cf46 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)

kernel_fpu_disable();

- if (fpu->initialized) {
+ __cpu_invalidate_fpregs_state();
+
+ if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
/*
* Ignore return value -- we don't care if reg state
* is clobbered.
*/
copy_fpregs_to_fpstate(fpu);
- } else {
- __cpu_invalidate_fpregs_state();
}
}
EXPORT_SYMBOL(__kernel_fpu_begin);
@@ -117,8 +117,7 @@ void __kernel_fpu_end(void)
{
struct fpu *fpu = &current->thread.fpu;

- if (fpu->initialized)
- copy_kernel_to_fpregs(&fpu->state);
+ switch_fpu_finish(fpu);

kernel_fpu_enable();
}
@@ -147,7 +146,7 @@ void fpu__save(struct fpu *fpu)
{
WARN_ON_FPU(fpu != &current->thread.fpu);

- preempt_disable();
+ __fpregs_changes_begin();
trace_x86_fpu_before_save(fpu);
if (fpu->initialized) {
if (!copy_fpregs_to_fpstate(fpu)) {
@@ -155,7 +154,7 @@ void fpu__save(struct fpu *fpu)
}
}
trace_x86_fpu_after_save(fpu);
- preempt_enable();
+ __fpregs_changes_end();
}
EXPORT_SYMBOL_GPL(fpu__save);

@@ -403,6 +402,24 @@ void fpu__clear(struct fpu *fpu)
}
}

+/*
+ * Load FPU context before returning to userspace.
+ */
+void switch_fpu_return(void)
+{
+ if (!static_cpu_has(X86_FEATURE_FPU))
+ return;
+
+ /*
+ * We should never return to user space without the task's
+ * own FPU contents loaded into the registers. That makes it
+ * a bug to not have the task's FPU state set up.
+ */
+ WARN_ON_FPU(!current->thread.fpu.initialized);
+
+ __fpregs_load_activate();
+}
+
/*
* x87 math exception handling:
*/
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5046a3c9dec2f..a65f8ce36379b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -236,7 +236,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- switch_fpu_prepare(prev_fpu, cpu);
+ if (prev_fpu->initialized && !test_thread_flag(TIF_LOAD_FPU))
+ switch_fpu_prepare(prev_fpu, cpu);

/*
* Save away %gs. No need to save %fs, as it was saved on the
@@ -297,10 +298,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (prev->gs | next->gs)
lazy_load_gs(next->gs);

- switch_fpu_finish(next_fpu, cpu);
-
this_cpu_write(current_task, next_p);

+ switch_fpu_finish(next_fpu);
+
/* Load the Intel cache allocation PQR MSR. */
intel_rdt_sched_in();

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ea5ea850348da..66b763f3da6a0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -427,7 +427,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
this_cpu_read(irq_count) != -1);

- switch_fpu_prepare(prev_fpu, cpu);
+ if (prev_fpu->initialized && !test_thread_flag(TIF_LOAD_FPU))
+ switch_fpu_prepare(prev_fpu, cpu);

/* We must save %fs and %gs before load_TLS() because
* %fs and %gs may be cleared by load_TLS().
@@ -478,8 +479,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
load_seg_legacy(prev->gsindex, prev->gsbase,
next->gsindex, next->gsbase, GS);

- switch_fpu_finish(next_fpu, cpu);
-
/*
* Switch the PDA and FPU contexts.
*/
@@ -489,6 +488,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reload sp0. */
update_task_stack(next_p);

+ switch_fpu_finish(next_fpu);
+
/*
* Now maybe reload the debug registers and handle I/O bitmaps
*/
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d8b3c257e769..d9a8c5ec8d6fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7573,6 +7573,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
wait_lapic_expire(vcpu);
guest_enter_irqoff();

+ if (test_and_clear_thread_flag(TIF_LOAD_FPU))
+ switch_fpu_return();
+
if (unlikely(vcpu->arch.switch_db_regs)) {
set_debugreg(0, 7);
set_debugreg(vcpu->arch.eff_db[0], 0);
@@ -7832,22 +7835,23 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
/* Swap (qemu) user FPU context for the guest FPU context. */
static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
- preempt_disable();
+ __fpregs_changes_begin();
copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
/* PKRU is separately restored in kvm_x86_ops->run. */
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
~XFEATURE_MASK_PKRU);
- preempt_enable();
+ __fpregs_changes_end();
trace_kvm_fpu(1);
}

/* When vcpu_run ends, restore user space FPU context. */
static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
{
- preempt_disable();
+ __fpregs_changes_begin();
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
preempt_enable();
+ __fpregs_changes_end();
++vcpu->stat.fpu_reload;
trace_kvm_fpu(0);
}
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6e91529fe2a30..83835abd526d7 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -28,7 +28,7 @@ void write_pkru(u32 pkru)
current->thread.fpu.pkru = pkru;

__fpregs_changes_begin();
- __fpregs_load_activate(&current->thread.fpu, smp_processor_id());
+ __fpregs_load_activate();
__write_pkru(pkru);
__fpregs_changes_end();
}
--
2.19.0


Subject: [RFC PATCH 09/10] x86/fpu: copy non-resident FPU state at fork time

From: Rik van Riel <[email protected]>

If the FPU state is not loaded in registers at fork time, memcpy the
fpstate from the parent task to the child task.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/kernel/fpu/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2ea85b32421a0..8564086c217fc 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -210,11 +210,14 @@ int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu)
* ( The function 'fails' in the FNSAVE case, which destroys
* register contents so we have to copy them back. )
*/
- if (!copy_fpregs_to_fpstate(dst_fpu)) {
+ __fpregs_changes_begin();
+ if (test_thread_flag(TIF_LOAD_FPU)) {
+ memcpy(&src_fpu->state, &dst_fpu->state, fpu_kernel_xstate_size);
+ } else if (!copy_fpregs_to_fpstate(dst_fpu)) {
memcpy(&src_fpu->state, &dst_fpu->state, fpu_kernel_xstate_size);
copy_kernel_to_fpregs(&src_fpu->state);
}
-
+ __fpregs_changes_end();
trace_x86_fpu_copy_src(src_fpu);
trace_x86_fpu_copy_dst(dst_fpu);

--
2.19.0


Subject: [RFC PATCH 06/10] x86/fpu: Always store the registers in copy_fpstate_to_sigframe()

From: Rik van Riel <[email protected]>

copy_fpstate_to_sigframe() has two callers and both invoke the function only if
fpu->initialized is set. So the check in the function for ->initialized makes
no sense. It might be a relict from the lazy-FPU time: If the FPU registers
were "loaded" then we would could save them directly. Otherwise (the FPU
registers are not up to date) then they are saved in the fpu struct and those
could be used for memcpy().

Since the registers always loaded at this point, save them and copy
later.
This code is extracted from an earlier version of the patchset while
there still was lazy-FPU on x86. This is a preparation for loading the
FPU registers on return to userland.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 45 ---------------------------
arch/x86/kernel/fpu/signal.c | 48 ++++++++---------------------
2 files changed, 12 insertions(+), 81 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 57bd1576e033d..184d76c6470b1 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -125,22 +125,6 @@ extern void fpstate_sanitize_xstate(struct fpu *fpu);
_ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_fprestore) \
: output : input)

-static inline int copy_fregs_to_user(struct fregs_state __user *fx)
-{
- return user_insn(fnsave %[fx]; fwait, [fx] "=m" (*fx), "m" (*fx));
-}
-
-static inline int copy_fxregs_to_user(struct fxregs_state __user *fx)
-{
- if (IS_ENABLED(CONFIG_X86_32))
- return user_insn(fxsave %[fx], [fx] "=m" (*fx), "m" (*fx));
- else if (IS_ENABLED(CONFIG_AS_FXSAVEQ))
- return user_insn(fxsaveq %[fx], [fx] "=m" (*fx), "m" (*fx));
-
- /* See comment in copy_fxregs_to_kernel() below. */
- return user_insn(rex64/fxsave (%[fx]), "=m" (*fx), [fx] "R" (fx));
-}
-
static inline void copy_kernel_to_fxregs(struct fxregs_state *fx)
{
if (IS_ENABLED(CONFIG_X86_32)) {
@@ -351,35 +335,6 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
XSTATE_XRESTORE(xstate, lmask, hmask);
}

-/*
- * Save xstate to user space xsave area.
- *
- * We don't use modified optimization because xrstor/xrstors might track
- * a different application.
- *
- * We don't use compacted format xsave area for
- * backward compatibility for old applications which don't understand
- * compacted format of xsave area.
- */
-static inline int copy_xregs_to_user(struct xregs_state __user *buf)
-{
- int err;
-
- /*
- * Clear the xsave header first, so that reserved fields are
- * initialized to zero.
- */
- err = __clear_user(&buf->header, sizeof(buf->header));
- if (unlikely(err))
- return -EFAULT;
-
- stac();
- XSTATE_OP(XSAVE, buf, -1, -1, err);
- clac();
-
- return err;
-}
-
/*
* Restore xstate from user space xsave area.
*/
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 23f1691670b66..ff6e7a67522d8 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -117,23 +117,6 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)

return err;
}
-
-static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
-{
- int err;
-
- if (use_xsave())
- err = copy_xregs_to_user(buf);
- else if (use_fxsr())
- err = copy_fxregs_to_user((struct fxregs_state __user *) buf);
- else
- err = copy_fregs_to_user((struct fregs_state __user *) buf);
-
- if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
- err = -EFAULT;
- return err;
-}
-
/*
* Save the fpu, extended register state to the user signal frame.
*
@@ -172,27 +155,20 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;

- if (fpu->initialized || using_compacted_format()) {
- /* Save the live register state to the user directly. */
- if (copy_fpregs_to_sigframe(buf_fx))
- return -1;
- /* Update the thread's fxstate to save the fsave header. */
- if (ia32_fxstate)
- copy_fxregs_to_kernel(fpu);
- } else {
- /*
- * It is a *bug* if kernel uses compacted-format for xsave
- * area and we copy it out directly to a signal frame. It
- * should have been handled above by saving the registers
- * directly.
- */
- if (boot_cpu_has(X86_FEATURE_XSAVES)) {
- WARN_ONCE(1, "x86/fpu: saving compacted-format xsave area to a signal frame!\n");
- return -1;
- }
+ /* Update the thread's fxstate to save the fsave header. */
+ if (ia32_fxstate)
+ copy_fxregs_to_kernel(fpu);
+ else {
+ copy_fpregs_to_fpstate(fpu);
+ fpregs_deactivate(fpu);
+ }

+ if (using_compacted_format()) {
+ copy_xstate_to_user(buf_fx, xsave, 0, size);
+ } else {
fpstate_sanitize_xstate(fpu);
- if (__copy_to_user(buf_fx, xsave, fpu_user_xstate_size))
+ size = fpu_user_xstate_size;
+ if (__copy_to_user(buf_fx, xsave, size))
return -1;
}

--
2.19.0


Subject: [RFC PATCH 02/10] kvm: x86: make kvm_{load|put}_guest_fpu() static

The functions
kvm_load_guest_fpu()
kvm_put_guest_fpu()

are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/kvm/x86.c | 46 ++++++++++++++++++++--------------------
include/linux/kvm_host.h | 2 --
2 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 542f6315444d7..2d8b3c257e769 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7829,6 +7829,29 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
return 0;
}

+/* Swap (qemu) user FPU context for the guest FPU context. */
+static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
+ /* PKRU is separately restored in kvm_x86_ops->run. */
+ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
+ ~XFEATURE_MASK_PKRU);
+ preempt_enable();
+ trace_kvm_fpu(1);
+}
+
+/* When vcpu_run ends, restore user space FPU context. */
+static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
+{
+ preempt_disable();
+ copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
+ copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
+ preempt_enable();
+ ++vcpu->stat.fpu_reload;
+ trace_kvm_fpu(0);
+}
+
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
{
int r;
@@ -8406,29 +8429,6 @@ static void fx_init(struct kvm_vcpu *vcpu)
vcpu->arch.cr0 |= X86_CR0_ET;
}

-/* Swap (qemu) user FPU context for the guest FPU context. */
-void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
-{
- preempt_disable();
- copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
- /* PKRU is separately restored in kvm_x86_ops->run. */
- __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
- ~XFEATURE_MASK_PKRU);
- preempt_enable();
- trace_kvm_fpu(1);
-}
-
-/* When vcpu_run ends, restore user space FPU context. */
-void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
-{
- preempt_disable();
- copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
- copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
- preempt_enable();
- ++vcpu->stat.fpu_reload;
- trace_kvm_fpu(0);
-}
-
void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
{
void *wbinvd_dirty_mask = vcpu->arch.wbinvd_dirty_mask;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0205aee44dedd..c926698040e0d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -733,8 +733,6 @@ bool kvm_vcpu_wake_up(struct kvm_vcpu *vcpu);
void kvm_vcpu_kick(struct kvm_vcpu *vcpu);
int kvm_vcpu_yield_to(struct kvm_vcpu *target);
void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu, bool usermode_vcpu_not_eligible);
-void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
-void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);

void kvm_flush_remote_tlbs(struct kvm *kvm);
void kvm_reload_remote_mmus(struct kvm *kvm);
--
2.19.0


Subject: [RFC PATCH 03/10] x86/fpu: add (__)make_fpregs_active helpers

From: Rik van Riel <[email protected]>

Add helper function that ensures the floating point registers for
the current task are active. Use with preemption disabled.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 27 ++++++++++++++++++++++-----
1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index a38bf5a1e37ad..16c4077ffc945 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -514,6 +514,26 @@ static inline void fpregs_activate(struct fpu *fpu)
trace_x86_fpu_regs_activated(fpu);
}

+/*
+ * Load the FPU state for the current task. Call with preemption disabled.
+ */
+static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
+{
+ if (!fpregs_state_valid(fpu, cpu))
+ copy_kernel_to_fpregs(&fpu->state);
+ fpregs_activate(fpu);
+}
+
+static inline void __fpregs_changes_begin(void)
+{
+ preempt_disable();
+}
+
+static inline void __fpregs_changes_end(void)
+{
+ preempt_enable();
+}
+
/*
* FPU state switching for scheduling.
*
@@ -553,11 +573,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
bool preload = static_cpu_has(X86_FEATURE_FPU) &&
new_fpu->initialized;

- if (preload) {
- if (!fpregs_state_valid(new_fpu, cpu))
- copy_kernel_to_fpregs(&new_fpu->state);
- fpregs_activate(new_fpu);
- }
+ if (preload)
+ __fpregs_load_activate(new_fpu, cpu);
}

/*
--
2.19.0


Subject: [RFC PATCH 01/10] x86/entry: remove _TIF_ALLWORK_MASK

There is no user of _TIF_ALLWORK_MASK since commit 21d375b6b34ff
("x86/entry/64: Remove the SYSCALL64 fast path").
Remove unused define _TIF_ALLWORK_MASK.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/thread_info.h | 8 --------
1 file changed, 8 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 2ff2a30a264f4..cd6920674b905 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -136,14 +136,6 @@ struct thread_info {
_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
_TIF_NOHZ)

-/* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK \
- (_TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING | \
- _TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU | \
- _TIF_SYSCALL_AUDIT | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | \
- _TIF_PATCH_PENDING | _TIF_NOHZ | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_FSCHECK)
-
/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
(_TIF_IO_BITMAP|_TIF_NOCPUID|_TIF_NOTSC|_TIF_BLOCKSTEP|_TIF_SSBD)
--
2.19.0


Subject: [RFC PATCH 08/10] x86/fpu: prepare copy_fpstate_to_sigframe for TIF_LOAD_FPU

From: Rik van Riel <[email protected]>

If TIF_LOAD_FPU is set, then the registers are saved (not loaded). In that case
we skip the saving part.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/kernel/fpu/signal.c | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index ff6e7a67522d8..906efd9572ffa 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -155,13 +155,17 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
sizeof(struct user_i387_ia32_struct), NULL,
(struct _fpstate_32 __user *) buf) ? -1 : 1;

- /* Update the thread's fxstate to save the fsave header. */
- if (ia32_fxstate)
- copy_fxregs_to_kernel(fpu);
- else {
- copy_fpregs_to_fpstate(fpu);
- fpregs_deactivate(fpu);
+ __fpregs_changes_begin();
+ if (!test_thread_flag(TIF_LOAD_FPU)) {
+ /* Update the thread's fxstate to save the fsave header. */
+ if (ia32_fxstate)
+ copy_fxregs_to_kernel(fpu);
+ else {
+ copy_fpregs_to_fpstate(fpu);
+ fpregs_deactivate(fpu);
+ }
}
+ __fpregs_changes_end();

if (using_compacted_format()) {
copy_xstate_to_user(buf_fx, xsave, 0, size);
--
2.19.0


Subject: [RFC PATCH 05/10] x86/pkeys: Drop the preempt-disable section

From: Rik van Riel <[email protected]>

The fpu->initialized flag should not be changed underneath us. This might be a
fallout during the removal of the LazyFPU support. The FPU is marked
initialized as soon as the state has been set to an initial value. It does not
signal if the CPU's FPU registers are loaded.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/mm/pkeys.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index c7a7b6bd64009..6e91529fe2a30 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -53,17 +53,13 @@ int __execute_only_pkey(struct mm_struct *mm)
* dance to set PKRU if we do not need to. Check it
* first and assume that if the execute-only pkey is
* write-disabled that we do not have to set it
- * ourselves. We need preempt off so that nobody
- * can make fpregs inactive.
+ * ourselves.
*/
- preempt_disable();
if (!need_to_set_mm_pkey &&
current->thread.fpu.initialized &&
!__pkru_allows_read(read_pkru(), execute_only_pkey)) {
- preempt_enable();
return execute_only_pkey;
}
- preempt_enable();

/*
* Set up PKRU so that it denies access for everything
--
2.19.0


Subject: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state

From: Rik van Riel <[email protected]>

While most of a task's FPU state is only needed in user space,
the protection keys need to be in place immediately after a
context switch.

The reason is that any accesses to userspace memory while running
in kernel mode also need to abide by the memory permissions
specified in the protection keys.

The pkru info is put in its own cache line in the fpu struct because
that cache line is accessed anyway at context switch time, and the
location of the pkru info in the xsave buffer changes depending on
what other FPU registers are in use if the CPU uses compressed xsave
state (on by default).

The initial state of pkru is zeroed out automatically by fpstate_init.

Signed-off-by: Rik van Riel <[email protected]>
[bigeasy: load PKRU state only if we also load FPU content]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 11 +++++++++--
arch/x86/include/asm/fpu/types.h | 10 ++++++++++
arch/x86/include/asm/pgtable.h | 6 +-----
arch/x86/mm/pkeys.c | 14 ++++++++++++++
4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 16c4077ffc945..57bd1576e033d 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -573,8 +573,15 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
bool preload = static_cpu_has(X86_FEATURE_FPU) &&
new_fpu->initialized;

- if (preload)
- __fpregs_load_activate(new_fpu, cpu);
+ if (!preload)
+ return;
+
+ __fpregs_load_activate(new_fpu, cpu);
+ /* Protection keys need to be in place right at context switch time. */
+ if (boot_cpu_has(X86_FEATURE_OSPKE)) {
+ if (new_fpu->pkru != __read_pkru())
+ __write_pkru(new_fpu->pkru);
+ }
}

/*
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecfa..6fa58d37938d2 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -293,6 +293,16 @@ struct fpu {
*/
unsigned int last_cpu;

+ /*
+ * Protection key bits. These also live inside fpu.state.xsave,
+ * but the location varies if the CPU uses the compressed format
+ * for XSAVE(OPT).
+ *
+ * The protection key needs to be switched out immediately at context
+ * switch time, so it is in place for things like copy_to_user.
+ */
+ unsigned int pkru;
+
/*
* @initialized:
*
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 690c0307afed0..cc36f91011ad7 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -132,11 +132,7 @@ static inline u32 read_pkru(void)
return 0;
}

-static inline void write_pkru(u32 pkru)
-{
- if (boot_cpu_has(X86_FEATURE_OSPKE))
- __write_pkru(pkru);
-}
+extern void write_pkru(u32 pkru);

static inline int pte_young(pte_t pte)
{
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6e98e0a7c9231..c7a7b6bd64009 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -18,6 +18,20 @@

#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/mmu_context.h> /* vma_pkey() */
+#include <asm/fpu/internal.h>
+
+void write_pkru(u32 pkru)
+{
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return;
+
+ current->thread.fpu.pkru = pkru;
+
+ __fpregs_changes_begin();
+ __fpregs_load_activate(&current->thread.fpu, smp_processor_id());
+ __write_pkru(pkru);
+ __fpregs_changes_end();
+}

int __execute_only_pkey(struct mm_struct *mm)
{
--
2.19.0


2018-09-12 14:19:33

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state

On 12/09/2018 15:33, Sebastian Andrzej Siewior wrote:
> From: Rik van Riel <[email protected]>
>
> While most of a task's FPU state is only needed in user space,
> the protection keys need to be in place immediately after a
> context switch.
>
> The reason is that any accesses to userspace memory while running
> in kernel mode also need to abide by the memory permissions
> specified in the protection keys.
>
> The pkru info is put in its own cache line in the fpu struct because
> that cache line is accessed anyway at context switch time, and the
> location of the pkru info in the xsave buffer changes depending on
> what other FPU registers are in use if the CPU uses compressed xsave
> state (on by default).
>
> The initial state of pkru is zeroed out automatically by fpstate_init.
>
> Signed-off-by: Rik van Riel <[email protected]>
> [bigeasy: load PKRU state only if we also load FPU content]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> arch/x86/include/asm/fpu/internal.h | 11 +++++++++--
> arch/x86/include/asm/fpu/types.h | 10 ++++++++++
> arch/x86/include/asm/pgtable.h | 6 +-----
> arch/x86/mm/pkeys.c | 14 ++++++++++++++
> 4 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 16c4077ffc945..57bd1576e033d 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -573,8 +573,15 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
> bool preload = static_cpu_has(X86_FEATURE_FPU) &&
> new_fpu->initialized;
>
> - if (preload)
> - __fpregs_load_activate(new_fpu, cpu);
> + if (!preload)
> + return;
> +
> + __fpregs_load_activate(new_fpu, cpu);
> + /* Protection keys need to be in place right at context switch time. */
> + if (boot_cpu_has(X86_FEATURE_OSPKE)) {
> + if (new_fpu->pkru != __read_pkru())
> + __write_pkru(new_fpu->pkru);
> + }
> }
>
> /*
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index 202c53918ecfa..6fa58d37938d2 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -293,6 +293,16 @@ struct fpu {
> */
> unsigned int last_cpu;
>
> + /*
> + * Protection key bits. These also live inside fpu.state.xsave,
> + * but the location varies if the CPU uses the compressed format
> + * for XSAVE(OPT).
> + *
> + * The protection key needs to be switched out immediately at context
> + * switch time, so it is in place for things like copy_to_user.
> + */
> + unsigned int pkru;
> +
> /*
> * @initialized:
> *
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 690c0307afed0..cc36f91011ad7 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -132,11 +132,7 @@ static inline u32 read_pkru(void)
> return 0;
> }
>
> -static inline void write_pkru(u32 pkru)
> -{
> - if (boot_cpu_has(X86_FEATURE_OSPKE))
> - __write_pkru(pkru);
> -}
> +extern void write_pkru(u32 pkru);
>
> static inline int pte_young(pte_t pte)
> {
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 6e98e0a7c9231..c7a7b6bd64009 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -18,6 +18,20 @@
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/mmu_context.h> /* vma_pkey() */
> +#include <asm/fpu/internal.h>
> +
> +void write_pkru(u32 pkru)
> +{
> + if (!boot_cpu_has(X86_FEATURE_OSPKE))
> + return;
> +
> + current->thread.fpu.pkru = pkru;
> +
> + __fpregs_changes_begin();
> + __fpregs_load_activate(&current->thread.fpu, smp_processor_id());
> + __write_pkru(pkru);
> + __fpregs_changes_end();
> +}
>
> int __execute_only_pkey(struct mm_struct *mm)
> {
>

I think you can go a step further and exclude PKRU state from
copy_kernel_to_fpregs altogether; you just use RDPKRU/WRPKRU. This also
means you don't need to call __fpregs_* functions in write_pkru.

Thanks,

Paolo

2018-09-12 15:21:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state



> On Sep 12, 2018, at 6:33 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> From: Rik van Riel <[email protected]>
>
> While most of a task's FPU state is only needed in user space,
> the protection keys need to be in place immediately after a
> context switch.
>
> The reason is that any accesses to userspace memory while running
> in kernel mode also need to abide by the memory permissions
> specified in the protection keys.
>
> The pkru info is put in its own cache line in the fpu struct because
> that cache line is accessed anyway at context switch time, and the
> location of the pkru info in the xsave buffer changes depending on
> what other FPU registers are in use if the CPU uses compressed xsave
> state (on by default).
>
> The initial state of pkru is zeroed out automatically by fpstate_init.
>
> Signed-off-by: Rik van Riel <[email protected]>
> [bigeasy: load PKRU state only if we also load FPU content]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> arch/x86/include/asm/fpu/internal.h | 11 +++++++++--
> arch/x86/include/asm/fpu/types.h | 10 ++++++++++
> arch/x86/include/asm/pgtable.h | 6 +-----
> arch/x86/mm/pkeys.c | 14 ++++++++++++++
> 4 files changed, 34 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 16c4077ffc945..57bd1576e033d 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -573,8 +573,15 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
> bool preload = static_cpu_has(X86_FEATURE_FPU) &&
> new_fpu->initialized;
>
> - if (preload)
> - __fpregs_load_activate(new_fpu, cpu);
> + if (!preload)
> + return;
> +
> + __fpregs_load_activate(new_fpu, cpu);
> + /* Protection keys need to be in place right at context switch time. */
> + if (boot_cpu_has(X86_FEATURE_OSPKE)) {
> + if (new_fpu->pkru != __read_pkru())
> + __write_pkru(new_fpu->pkru);
> + }
> }
>
> /*
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index 202c53918ecfa..6fa58d37938d2 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -293,6 +293,16 @@ struct fpu {
> */
> unsigned int last_cpu;
>
> + /*
> + * Protection key bits. These also live inside fpu.state.xsave,
> + * but the location varies if the CPU uses the compressed format
> + * for XSAVE(OPT).
> + *
> + * The protection key needs to be switched out immediately at context
> + * switch time, so it is in place for things like copy_to_user.
> + */
> + unsigned int pkru;
> +
> /*
> * @initialized:
> *
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 690c0307afed0..cc36f91011ad7 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -132,11 +132,7 @@ static inline u32 read_pkru(void)
> return 0;
> }
>
> -static inline void write_pkru(u32 pkru)
> -{
> - if (boot_cpu_has(X86_FEATURE_OSPKE))
> - __write_pkru(pkru);
> -}
> +extern void write_pkru(u32 pkru);
>
> static inline int pte_young(pte_t pte)
> {
> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
> index 6e98e0a7c9231..c7a7b6bd64009 100644
> --- a/arch/x86/mm/pkeys.c
> +++ b/arch/x86/mm/pkeys.c
> @@ -18,6 +18,20 @@
>
> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
> #include <asm/mmu_context.h> /* vma_pkey() */
> +#include <asm/fpu/internal.h>
> +
> +void write_pkru(u32 pkru)
> +{
> + if (!boot_cpu_has(X86_FEATURE_OSPKE))
> + return;
> +
> + current->thread.fpu.pkru = pkru;
> +

I thought that the offset of PKRU in the xstate was fixed after boot.

Anyway, as written, this needs a lockdep assertion that we’re not preemptible, an explicit preempt_disable(), or a comment explaining why it’s okay if we get preempted in this function.

> + __fpregs_changes_begin();
> + __fpregs_load_activate(&current->thread.fpu, smp_processor_id());
> + __write_pkru(pkru);
> + __fpregs_changes_end();
> +}
>
> int __execute_only_pkey(struct mm_struct *mm)
> {
> --
> 2.19.0
>

2018-09-12 15:26:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state




> On Sep 12, 2018, at 7:18 AM, Paolo Bonzini <[email protected]> wrote:
>
>> On 12/09/2018 15:33, Sebastian Andrzej Siewior wrote:
>> From: Rik van Riel <[email protected]>
>>
>> While most of a task's FPU state is only needed in user space,
>> the protection keys need to be in place immediately after a
>> context switch.
>>
>> The reason is that any accesses to userspace memory while running
>> in kernel mode also need to abide by the memory permissions
>> specified in the protection keys.
>>
>> The pkru info is put in its own cache line in the fpu struct because
>> that cache line is accessed anyway at context switch time, and the
>> location of the pkru info in the xsave buffer changes depending on
>> what other FPU registers are in use if the CPU uses compressed xsave
>> state (on by default).
>>
>> The initial state of pkru is zeroed out automatically by fpstate_init.
>>
>> Signed-off-by: Rik van Riel <[email protected]>
>> [bigeasy: load PKRU state only if we also load FPU content]
>> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
>> ---
>> arch/x86/include/asm/fpu/internal.h | 11 +++++++++--
>> arch/x86/include/asm/fpu/types.h | 10 ++++++++++
>> arch/x86/include/asm/pgtable.h | 6 +-----
>> arch/x86/mm/pkeys.c | 14 ++++++++++++++
>> 4 files changed, 34 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
>> index 16c4077ffc945..57bd1576e033d 100644
>> --- a/arch/x86/include/asm/fpu/internal.h
>> +++ b/arch/x86/include/asm/fpu/internal.h
>> @@ -573,8 +573,15 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
>> bool preload = static_cpu_has(X86_FEATURE_FPU) &&
>> new_fpu->initialized;
>>
>> - if (preload)
>> - __fpregs_load_activate(new_fpu, cpu);
>> + if (!preload)
>> + return;
>> +
>> + __fpregs_load_activate(new_fpu, cpu);
>> + /* Protection keys need to be in place right at context switch time. */
>> + if (boot_cpu_has(X86_FEATURE_OSPKE)) {
>> + if (new_fpu->pkru != __read_pkru())
>> + __write_pkru(new_fpu->pkru);
>> + }
>> }
>>
>> /*
>> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
>> index 202c53918ecfa..6fa58d37938d2 100644
>> --- a/arch/x86/include/asm/fpu/types.h
>> +++ b/arch/x86/include/asm/fpu/types.h
>> @@ -293,6 +293,16 @@ struct fpu {
>> */
>> unsigned int last_cpu;
>>
>> + /*
>> + * Protection key bits. These also live inside fpu.state.xsave,
>> + * but the location varies if the CPU uses the compressed format
>> + * for XSAVE(OPT).
>> + *
>> + * The protection key needs to be switched out immediately at context
>> + * switch time, so it is in place for things like copy_to_user.
>> + */
>> + unsigned int pkru;
>> +
>> /*
>> * @initialized:
>> *
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 690c0307afed0..cc36f91011ad7 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -132,11 +132,7 @@ static inline u32 read_pkru(void)
>> return 0;
>> }
>>
>> -static inline void write_pkru(u32 pkru)
>> -{
>> - if (boot_cpu_has(X86_FEATURE_OSPKE))
>> - __write_pkru(pkru);
>> -}
>> +extern void write_pkru(u32 pkru);
>>
>> static inline int pte_young(pte_t pte)
>> {
>> diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
>> index 6e98e0a7c9231..c7a7b6bd64009 100644
>> --- a/arch/x86/mm/pkeys.c
>> +++ b/arch/x86/mm/pkeys.c
>> @@ -18,6 +18,20 @@
>>
>> #include <asm/cpufeature.h> /* boot_cpu_has, ... */
>> #include <asm/mmu_context.h> /* vma_pkey() */
>> +#include <asm/fpu/internal.h>
>> +
>> +void write_pkru(u32 pkru)
>> +{
>> + if (!boot_cpu_has(X86_FEATURE_OSPKE))
>> + return;
>> +
>> + current->thread.fpu.pkru = pkru;
>> +
>> + __fpregs_changes_begin();
>> + __fpregs_load_activate(&current->thread.fpu, smp_processor_id());
>> + __write_pkru(pkru);
>> + __fpregs_changes_end();
>> +}
>>
>> int __execute_only_pkey(struct mm_struct *mm)
>> {
>>
>
> I think you can go a step further and exclude PKRU state from
> copy_kernel_to_fpregs altogether; you just use RDPKRU/WRPKRU. This also
> means you don't need to call __fpregs_* functions in write_pkru.
>
>

Except that the signal ABI has PKRU in the xstate. So we’d need to fake it or do something special for signals.

2018-09-12 15:31:01

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state

On Wed, 2018-09-12 at 08:20 -0700, Andy Lutomirski wrote:
> >
> > --- a/arch/x86/mm/pkeys.c
> > +++ b/arch/x86/mm/pkeys.c
> > @@ -18,6 +18,20 @@
> >
> > #include <asm/cpufeature.h> /* boot_cpu_has,
> > ... */
> > #include <asm/mmu_context.h> /*
> > vma_pkey() */
> > +#include <asm/fpu/internal.h>
> > +
> > +void write_pkru(u32 pkru)
> > +{
> > + if (!boot_cpu_has(X86_FEATURE_OSPKE))
> > + return;
> > +
> > + current->thread.fpu.pkru = pkru;
> > +
>
> I thought that the offset of PKRU in the xstate was fixed after boot.

You are right, it is. However, that offset would need
to be stored somewhere, and the value read every time
we wanted to read or store the PKRU value from/to the
floating point state.

I suspect that would not be any faster than keeping a
copy of the PKRU value in a known location.

> Anyway, as written, this needs a lockdep assertion that we’re not
> preemptible, an explicit preempt_disable(), or a comment explaining
> why it’s okay if we get preempted in this function.
>
> > + __fpregs_changes_begin();

This handles the preemption disabling, see patch
3 of the series.

> > + __fpregs_load_activate(&current->thread.fpu,
> > smp_processor_id());
> > + __write_pkru(pkru);
> > + __fpregs_changes_end();
> > +}
> >
> > int __execute_only_pkey(struct mm_struct *mm)
> > {
> > --
> > 2.19.0
> >
>
>
--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-12 15:31:34

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state

On 12/09/2018 17:24, Andy Lutomirski wrote:
>> I think you can go a step further and exclude PKRU state from
>> copy_kernel_to_fpregs altogether; you just use RDPKRU/WRPKRU. This also
>> means you don't need to call __fpregs_* functions in write_pkru.
>>
>>
> Except that the signal ABI has PKRU in the xstate. So we’d need to fake it or do something special for signals.

The signal ABI is already special because it uses the non-compacted
format. As long as copy_fpregs_to_sigframe includes the PKRU state
(i.e. EDX:EAX=-1), and PKRU value is okay (which it is because it's
switched eagerly), everything should work...

Paolo

2018-09-12 15:47:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace


> On Sep 12, 2018, at 6:33 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> From: Rik van Riel <[email protected]>
>
> Defer loading of FPU state until return to userspace. This gives
> the kernel the potential to skip loading FPU state for tasks that
> stay in kernel mode, or for tasks that end up with repeated
> invocations of kernel_fpu_begin.
>
> It also increases the chances that a task's FPU state will remain
> valid in the FPU registers until it is scheduled back in, allowing
> us to skip restoring that task's FPU state altogether.
>
>

> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
>
> kernel_fpu_disable();
>
> - if (fpu->initialized) {
> + __cpu_invalidate_fpregs_state();
> +
> + if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {

Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.

Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.

2018-09-12 15:49:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state



> On Sep 12, 2018, at 8:30 AM, Rik van Riel <[email protected]> wrote:
>
> On Wed, 2018-09-12 at 08:20 -0700, Andy Lutomirski wrote:
>>>
>>> --- a/arch/x86/mm/pkeys.c
>>> +++ b/arch/x86/mm/pkeys.c
>>> @@ -18,6 +18,20 @@
>>>
>>> #include <asm/cpufeature.h> /* boot_cpu_has,
>>> ... */
>>> #include <asm/mmu_context.h> /*
>>> vma_pkey() */
>>> +#include <asm/fpu/internal.h>
>>> +
>>> +void write_pkru(u32 pkru)
>>> +{
>>> + if (!boot_cpu_has(X86_FEATURE_OSPKE))
>>> + return;
>>> +
>>> + current->thread.fpu.pkru = pkru;
>>> +
>>
>> I thought that the offset of PKRU in the xstate was fixed after boot.
>
> You are right, it is. However, that offset would need
> to be stored somewhere, and the value read every time
> we wanted to read or store the PKRU value from/to the
> floating point state.
>
> I suspect that would not be any faster than keeping a
> copy of the PKRU value in a known location.
>
>> Anyway, as written, this needs a lockdep assertion that we’re not
>> preemptible, an explicit preempt_disable(), or a comment explaining
>> why it’s okay if we get preempted in this function.
>>
>>> + __fpregs_changes_begin();
>
> This handles the preemption disabling, see patch
> 3 of the series.

Sure, but the first write is *before* this. So we can be preempted with the two copies of PKRU being out of sync.

>
>>> + __fpregs_load_activate(&current->thread.fpu,
>>> smp_processor_id());
>>> + __write_pkru(pkru);
>>> + __fpregs_changes_end();
>>> +}
>>>
>>> int __execute_only_pkey(struct mm_struct *mm)
>>> {
>>> --
>>> 2.19.0
>>>
>>
>>
> --
> All Rights Reversed.

Subject: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

While most of a task's FPU state is only needed in user space,
the protection keys need to be in place immediately after a
context switch.

The reason is that any accesses to userspace memory while running
in kernel mode also need to abide by the memory permissions
specified in the protection keys.

The pkru info is put in its own cache line in the fpu struct because
that cache line is accessed anyway at context switch time.
Remove XFEATURE_MASK_PKRU from supported flags. This removes the PKRU
state from XSAVE/XRESTORE and so decouples it from the FPU state.

The initial state of pkru is updated in pkru_set_init_value() via
fpu__clear() - it is no longer affected by fpstate_init().

Signed-off-by: Rik van Riel <[email protected]>
[bigeasy: load PKRU state only if we also load FPU content]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

v1…v2: remove PKRU from xsave/srestore.

On 2018-09-12 16:18:44 [+0200], Paolo Bonzini wrote:

> I think you can go a step further and exclude PKRU state from
> copy_kernel_to_fpregs altogether; you just use RDPKRU/WRPKRU. This also
> means you don't need to call __fpregs_* functions in write_pkru.

something like this then? It looks like kvm excludes PKRU from
xsave/xrestore, too. This wouldn't be required then. This is (again)
untested since I have no box with this PKRU feature. This only available
on Intel's Xeon Scalable, right?

> Thanks,
>
> Paolo

arch/x86/include/asm/fpu/internal.h | 14 ++++++++++++--
arch/x86/include/asm/fpu/types.h | 7 +++++++
arch/x86/include/asm/fpu/xstate.h | 1 -
arch/x86/include/asm/pgtable.h | 7 +++++--
arch/x86/include/asm/pkeys.h | 2 +-
arch/x86/kernel/fpu/core.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 4 ----
arch/x86/mm/pkeys.c | 21 ++++-----------------
include/linux/pkeys.h | 2 +-
9 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 16c4077ffc945..903ee77b6d5b0 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -573,8 +573,18 @@ static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
bool preload = static_cpu_has(X86_FEATURE_FPU) &&
new_fpu->initialized;

- if (preload)
- __fpregs_load_activate(new_fpu, cpu);
+ if (!preload)
+ return;
+
+ __fpregs_load_activate(new_fpu, cpu);
+
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+ /* Protection keys need to be in place right at context switch time. */
+ if (boot_cpu_has(X86_FEATURE_OSPKE)) {
+ if (new_fpu->pkru != __read_pkru())
+ __write_pkru(new_fpu->pkru);
+ }
+#endif
}

/*
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecfa..257b092bdaa4e 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -293,6 +293,13 @@ struct fpu {
*/
unsigned int last_cpu;

+ /*
+ * Protection key bits. The protection key needs to be switched out
+ * immediately at context switch time, so it is in place for things
+ * like copy_to_user.
+ */
+ unsigned int pkru;
+
/*
* @initialized:
*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 48581988d78c7..abe8793fa50f9 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -29,7 +29,6 @@
XFEATURE_MASK_OPMASK | \
XFEATURE_MASK_ZMM_Hi256 | \
XFEATURE_MASK_Hi16_ZMM | \
- XFEATURE_MASK_PKRU | \
XFEATURE_MASK_BNDREGS | \
XFEATURE_MASK_BNDCSR)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 690c0307afed0..d87bdfaf45e56 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -134,8 +134,11 @@ static inline u32 read_pkru(void)

static inline void write_pkru(u32 pkru)
{
- if (boot_cpu_has(X86_FEATURE_OSPKE))
- __write_pkru(pkru);
+ if (!boot_cpu_has(X86_FEATURE_OSPKE))
+ return;
+
+ current->thread.fpu.pkru = pkru;
+ __write_pkru(pkru);
}

static inline int pte_young(pte_t pte)
diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h
index 19b137f1b3beb..b184f916319e5 100644
--- a/arch/x86/include/asm/pkeys.h
+++ b/arch/x86/include/asm/pkeys.h
@@ -119,7 +119,7 @@ extern int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val);
extern int __arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
unsigned long init_val);
-extern void copy_init_pkru_to_fpregs(void);
+extern void pkru_set_init_value(void);

static inline int vma_pkey(struct vm_area_struct *vma)
{
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2ea85b32421a0..72cd2e2a07194 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -373,7 +373,7 @@ static inline void copy_init_fpstate_to_fpregs(void)
copy_kernel_to_fregs(&init_fpstate.fsave);

if (boot_cpu_has(X86_FEATURE_OSPKE))
- copy_init_pkru_to_fpregs();
+ pkru_set_init_value();
}

/*
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 87a57b7642d36..11014d841b9f7 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -920,10 +920,6 @@ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
int pkey_shift = (pkey * PKRU_BITS_PER_PKEY);
u32 new_pkru_bits = 0;

- /*
- * This check implies XSAVE support. OSPKE only gets
- * set if we enable XSAVE and we enable PKU in XCR0.
- */
if (!boot_cpu_has(X86_FEATURE_OSPKE))
return -EINVAL;

diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6e98e0a7c9231..2f9d95206c741 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -137,26 +137,13 @@ u32 init_pkru_value = PKRU_AD_KEY( 1) | PKRU_AD_KEY( 2) | PKRU_AD_KEY( 3) |
PKRU_AD_KEY(10) | PKRU_AD_KEY(11) | PKRU_AD_KEY(12) |
PKRU_AD_KEY(13) | PKRU_AD_KEY(14) | PKRU_AD_KEY(15);

-/*
- * Called from the FPU code when creating a fresh set of FPU
- * registers. This is called from a very specific context where
- * we know the FPU regstiers are safe for use and we can use PKRU
- * directly.
- */
-void copy_init_pkru_to_fpregs(void)
+void pkru_set_init_value(void)
{
u32 init_pkru_value_snapshot = READ_ONCE(init_pkru_value);
- /*
- * Any write to PKRU takes it out of the XSAVE 'init
- * state' which increases context switch cost. Avoid
- * writing 0 when PKRU was already 0.
- */
- if (!init_pkru_value_snapshot && !read_pkru())
+
+ if (init_pkru_value_snapshot == read_pkru())
return;
- /*
- * Override the PKRU state that came from 'init_fpstate'
- * with the baseline from the process.
- */
+
write_pkru(init_pkru_value_snapshot);
}

diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 2955ba9760489..9a9efecc1388f 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -44,7 +44,7 @@ static inline bool arch_pkeys_enabled(void)
return false;
}

-static inline void copy_init_pkru_to_fpregs(void)
+static inline void pkru_set_init_value(void)
{
}

--
2.19.0


2018-09-17 08:38:00

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

>> I think you can go a step further and exclude PKRU state from
>> copy_kernel_to_fpregs altogether; you just use RDPKRU/WRPKRU. This also
>> means you don't need to call __fpregs_* functions in write_pkru.
>
> something like this then? It looks like kvm excludes PKRU from
> xsave/xrestore, too. This wouldn't be required then

Yes, that's why the subject caught my eye! In fact, the reason for KVM
to do so is either the opposite as your patch (it wants to call WRPKRU
_after_ XRSTOR, not before) or the same (it wants to keep the userspace
PKRU loaded for as long as possible), depending on how you look at it.

> . This is (again)
> untested since I have no box with this PKRU feature. This only available
> on Intel's Xeon Scalable, right?

It is available on QEMU too (the slower JIT thing without KVM, i.e. use
/usr/bin/qemu-system-x86_64 instead of /usr/bin/qemu-kvm or /usr/bin/kvm).

> - if (preload)
> - __fpregs_load_activate(new_fpu, cpu);
> + if (!preload)
> + return;
> +
> + __fpregs_load_activate(new_fpu, cpu);
> +
> +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> + /* Protection keys need to be in place right at context switch time. */
> + if (boot_cpu_has(X86_FEATURE_OSPKE)) {
> + if (new_fpu->pkru != __read_pkru())
> + __write_pkru(new_fpu->pkru);
> + }
> +#endif

I think this would be before the "if (preload)"?
>
> if (boot_cpu_has(X86_FEATURE_OSPKE))
> - copy_init_pkru_to_fpregs();
> + pkru_set_init_value();
> }
>

Likewise, move this to fpu__clear and outside "if
(static_cpu_has(X86_FEATURE_FPU))"?

Also, a __read_pkru() seems to be missing in switch_fpu_prepare.

Thanks,

Paolo

Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 2018-09-17 10:37:20 [+0200], Paolo Bonzini wrote:
> > . This is (again)
> > untested since I have no box with this PKRU feature. This only available
> > on Intel's Xeon Scalable, right?
>
> It is available on QEMU too (the slower JIT thing without KVM, i.e. use
> /usr/bin/qemu-system-x86_64 instead of /usr/bin/qemu-kvm or /usr/bin/kvm).

okay. I managed to get the kernel to report this flag as available now.

> > - if (preload)
> > - __fpregs_load_activate(new_fpu, cpu);
> > + if (!preload)
> > + return;
> > +
> > + __fpregs_load_activate(new_fpu, cpu);
> > +
> > +#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> > + /* Protection keys need to be in place right at context switch time. */
> > + if (boot_cpu_has(X86_FEATURE_OSPKE)) {
> > + if (new_fpu->pkru != __read_pkru())
> > + __write_pkru(new_fpu->pkru);
> > + }
> > +#endif
>
> I think this would be before the "if (preload)"?

I did not find an explicit loading of pkru except this "xrestore"
which happens on "preload". From what I saw, preload was always set
except for kernel threads. Based on that, it looks to me like it can be
skipped if there is no FPU/kernel thread.

> >
> > if (boot_cpu_has(X86_FEATURE_OSPKE))
> > - copy_init_pkru_to_fpregs();
> > + pkru_set_init_value();
> > }
> >
>
> Likewise, move this to fpu__clear and outside "if
> (static_cpu_has(X86_FEATURE_FPU))"?

okay. But if there is no FPU we did not save/restore the pkru value. Is
this supposed to be an improvement?

> Also, a __read_pkru() seems to be missing in switch_fpu_prepare.

But the value is stored during write_pkru(). So the copy that is saved
should be identical to the value that would be read, correct?

>
> Thanks,
>
> Paolo

Sebastian

2018-09-18 15:08:04

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 18/09/2018 16:27, Sebastian Andrzej Siewior wrote:
>> Likewise, move this to fpu__clear and outside "if
>> (static_cpu_has(X86_FEATURE_FPU))"?
> okay. But if there is no FPU we did not save/restore the pkru value. Is
> this supposed to be an improvement?

Honestly it just seemed "more correct", but now that I think about it,
kernel threads should run with PKRU=0. maybe there's a preexisting bug
that your patch has the occasion to fix.

Paolo

>> Also, a __read_pkru() seems to be missing in switch_fpu_prepare.
> But the value is stored during write_pkru(). So the copy that is saved
> should be identical to the value that would be read, correct?
>


2018-09-18 15:14:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On Tue, 2018-09-18 at 17:07 +0200, Paolo Bonzini wrote:
> On 18/09/2018 16:27, Sebastian Andrzej Siewior wrote:
> > > Likewise, move this to fpu__clear and outside "if
> > > (static_cpu_has(X86_FEATURE_FPU))"?
> >
> > okay. But if there is no FPU we did not save/restore the pkru
> > value. Is
> > this supposed to be an improvement?
>
> Honestly it just seemed "more correct", but now that I think about
> it,
> kernel threads should run with PKRU=0. maybe there's a preexisting
> bug
> that your patch has the occasion to fix.

I don't think it matters what the PKRU state is
for kernel threads, since kernel PTEs should not
be using protection keys anyway.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-18 15:31:17

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 18/09/2018 17:11, Rik van Riel wrote:
> On Tue, 2018-09-18 at 17:07 +0200, Paolo Bonzini wrote:
>> On 18/09/2018 16:27, Sebastian Andrzej Siewior wrote:
>>>> Likewise, move this to fpu__clear and outside "if
>>>> (static_cpu_has(X86_FEATURE_FPU))"?
>>>
>>> okay. But if there is no FPU we did not save/restore the pkru
>>> value. Is
>>> this supposed to be an improvement?
>>
>> Honestly it just seemed "more correct", but now that I think about
>> it,
>> kernel threads should run with PKRU=0. maybe there's a preexisting
>> bug
>> that your patch has the occasion to fix.
>
> I don't think it matters what the PKRU state is
> for kernel threads, since kernel PTEs should not
> be using protection keys anyway.

What about copy_from/to_user?

Paolo


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 2018-09-18 17:29:52 [+0200], Paolo Bonzini wrote:
> > I don't think it matters what the PKRU state is
> > for kernel threads, since kernel PTEs should not
> > be using protection keys anyway.
>
> What about copy_from/to_user?

This doesn't work for a kernel thread, does it? I mean they share the
init's MM and never do copy_{from|to}_user.

> Paolo

Sebastian

2018-09-18 17:29:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On Tue, 2018-09-18 at 18:04 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-09-18 17:29:52 [+0200], Paolo Bonzini wrote:
> > > I don't think it matters what the PKRU state is
> > > for kernel threads, since kernel PTEs should not
> > > be using protection keys anyway.
> >
> > What about copy_from/to_user?
>
> This doesn't work for a kernel thread, does it? I mean they share the
> init's MM and never do copy_{from|to}_user.

Indeed, copy_from/to_user only works if current->mm
points at an mm_struct with userspace memory.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-09-19 05:56:19

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 18/09/2018 19:29, Rik van Riel wrote:
> On Tue, 2018-09-18 at 18:04 +0200, Sebastian Andrzej Siewior wrote:
>> On 2018-09-18 17:29:52 [+0200], Paolo Bonzini wrote:
>>>> I don't think it matters what the PKRU state is
>>>> for kernel threads, since kernel PTEs should not
>>>> be using protection keys anyway.
>>>
>>> What about copy_from/to_user?
>>
>> This doesn't work for a kernel thread, does it? I mean they share the
>> init's MM and never do copy_{from|to}_user.
>
> Indeed, copy_from/to_user only works if current->mm
> points at an mm_struct with userspace memory.

A kthread can do use_mm/unuse_mm.

Paolo


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature
Subject: Re: [RFC PATCH 04/10] x86/fpu: eager switch PKRU state

On 2018-09-12 08:49:01 [-0700], Andy Lutomirski wrote:
> Sure, but the first write is *before* this. So we can be preempted with the two copies of PKRU being out of sync.

so it took a while to understand this but now that I did, I will
consider this in the next version. Thank you.

Sebastian

2018-09-19 17:05:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 19/09/2018 18:57, Sebastian Andrzej Siewior wrote:
> On 2018-09-19 07:55:51 [+0200], Paolo Bonzini wrote:
>> A kthread can do use_mm/unuse_mm.
>
> indeed. The FPU struct for the kernel thread isn't valid / does not
> contain the expected PKRU value. So loading the pkru value from the
> struct FPU does not work as expected. We could set it to 0 for a kernel
> thread so we don't end up with a random value.
> If we want to get this usecase working then we would have to move pkru
> value from FPU to mm_struct and consider it in use_mm(). Do we want
> this?

As a start, I think keeping it in the FPU struct but loading it
unconditionally will work. kthreads will not obey PKU but it will be
better already.

I honestly don't know if PKRU should be per-mm, I don't know mm very
well despite my brilliant observation above. :)

Paolo

Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 2018-09-19 07:55:51 [+0200], Paolo Bonzini wrote:
> A kthread can do use_mm/unuse_mm.

indeed. The FPU struct for the kernel thread isn't valid / does not
contain the expected PKRU value. So loading the pkru value from the
struct FPU does not work as expected. We could set it to 0 for a kernel
thread so we don't end up with a random value.
If we want to get this usecase working then we would have to move pkru
value from FPU to mm_struct and consider it in use_mm(). Do we want
this?

> Paolo

Sebastian

Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace

On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
> >
> > kernel_fpu_disable();
> >
> > - if (fpu->initialized) {
> > + __cpu_invalidate_fpregs_state();
> > +
> > + if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
>
> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
okay.

> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.

Do you refer to the fpu.initilized check or something else?

Sebastian

Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state

On 2018-09-19 19:00:34 [+0200], Paolo Bonzini wrote:
> On 19/09/2018 18:57, Sebastian Andrzej Siewior wrote:
> > On 2018-09-19 07:55:51 [+0200], Paolo Bonzini wrote:
> >> A kthread can do use_mm/unuse_mm.
> >
> > indeed. The FPU struct for the kernel thread isn't valid / does not
> > contain the expected PKRU value. So loading the pkru value from the
> > struct FPU does not work as expected. We could set it to 0 for a kernel
> > thread so we don't end up with a random value.
> > If we want to get this usecase working then we would have to move pkru
> > value from FPU to mm_struct and consider it in use_mm(). Do we want
> > this?
>
> As a start, I think keeping it in the FPU struct but loading it
> unconditionally will work. kthreads will not obey PKU but it will be
> better already.

Are you saying I should load the uninitialized value for kthreads or
load 0 to have in a known state?

> I honestly don't know if PKRU should be per-mm, I don't know mm very
> well despite my brilliant observation above. :)

Now that I have qemu machine with PKRU I would need to figure out how
this works. So unless I am mistaken mm is per task and the FPU is per
thread. And since the FPU struct isn't initialized for kthreads, we
should end up with 0. But setting to 0 if not used sounds good.

> Paolo

Sebastian

2018-09-19 19:52:02

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state



> On Sep 19, 2018, at 1:00 PM, Paolo Bonzini <[email protected]> wrote:
>
> On 19/09/2018 18:57, Sebastian Andrzej Siewior wrote:
>> On 2018-09-19 07:55:51 [+0200], Paolo Bonzini wrote:
>>> A kthread can do use_mm/unuse_mm.
>>
>> indeed. The FPU struct for the kernel thread isn't valid / does not
>> contain the expected PKRU value. So loading the pkru value from the
>> struct FPU does not work as expected. We could set it to 0 for a kernel
>> thread so we don't end up with a random value.
>> If we want to get this usecase working then we would have to move pkru
>> value from FPU to mm_struct and consider it in use_mm(). Do we want
>> this?
>
> As a start, I think keeping it in the FPU struct but loading it
> unconditionally will work. kthreads will not obey PKU but it will be
> better already.
>
> I honestly don't know if PKRU should be per-mm, I don't know mm very
> well despite my brilliant observation above. :)
>
One of the rumored use cases for PKRU is to allow different
threads in the same process to have different memory
permissions, while still sharing the same page tables.

Making it per-mm would break that :)


2018-09-19 19:57:32

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 04/10 v2 ] x86/fpu: eager switch PKRU state



> On Sep 19, 2018, at 10:00 AM, Paolo Bonzini <[email protected]> wrote:
>
>> On 19/09/2018 18:57, Sebastian Andrzej Siewior wrote:
>>> On 2018-09-19 07:55:51 [+0200], Paolo Bonzini wrote:
>>> A kthread can do use_mm/unuse_mm.
>>
>> indeed. The FPU struct for the kernel thread isn't valid / does not
>> contain the expected PKRU value. So loading the pkru value from the
>> struct FPU does not work as expected. We could set it to 0 for a kernel
>> thread so we don't end up with a random value.
>> If we want to get this usecase working then we would have to move pkru
>> value from FPU to mm_struct and consider it in use_mm(). Do we want
>> this?
>
> As a start, I think keeping it in the FPU struct but loading it
> unconditionally will work. kthreads will not obey PKU but it will be
> better already.
>
> I honestly don't know if PKRU should be per-mm, I don't know mm very
> well despite my brilliant observation above. :)
>
>

It must be per thread. I don’t think it’s possible to have sane semantics per mm.

I also think that use_mm should set PKRU to the same value that signal handlers use. If we use 0, it’s a recipe for accidental PK bypass.

2018-09-21 03:46:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace



> On Sep 19, 2018, at 10:05 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
>>> --- a/arch/x86/kernel/fpu/core.c
>>> +++ b/arch/x86/kernel/fpu/core.c
>>> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
>>>
>>> kernel_fpu_disable();
>>>
>>> - if (fpu->initialized) {
>>> + __cpu_invalidate_fpregs_state();
>>> +
>>> + if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
>>
>> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
> okay.
>
>> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.
>
> Do you refer to the fpu.initilized check or something else?
>

I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded. So we don’t need the special case for kernel threads.

Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context? It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn. I think that at least switch_fpu_finish should call it. Arguably switch_fpu_prepare should too, at the beginning.

2018-09-21 04:16:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace



> On Sep 20, 2018, at 8:45 PM, Andy Lutomirski <[email protected]> wrote:
>
>
>
>> On Sep 19, 2018, at 10:05 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>>
>> On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
>>>> --- a/arch/x86/kernel/fpu/core.c
>>>> +++ b/arch/x86/kernel/fpu/core.c
>>>> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
>>>>
>>>> kernel_fpu_disable();
>>>>
>>>> - if (fpu->initialized) {
>>>> + __cpu_invalidate_fpregs_state();
>>>> +
>>>> + if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
>>>
>>> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
>> okay.
>>
>>> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.
>>
>> Do you refer to the fpu.initilized check or something else?
>>
>
> I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded. So we don’t need the special case for kernel threads.
>
> Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context? It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn. I think that at least switch_fpu_finish should call it. Arguably switch_fpu_prepare should too, at the beginning.

Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.

Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere. Probably FPU-less systems should never have TIF_LOAD_FPU set.

Or we could decide that no one uses FPU emulation any more.

Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace

On 2018-09-20 21:15:35 [-0700], Andy Lutomirski wrote:
> > I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded. So we don’t need the special case for kernel threads.
> >
> > Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context? It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn. I think that at least switch_fpu_finish should call it. Arguably switch_fpu_prepare should too, at the beginning.
>
> Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.
okay.

> Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere. Probably FPU-less systems should never have TIF_LOAD_FPU set.
Yes, they should not.

> Or we could decide that no one uses FPU emulation any more.

Oh. Removing unused code? I'm all yours.
There is this Intel Quark thingy which comes to mind can still be
bought. Its data sheet[0] has this:
| 13.1 Features:
| Note: The processor does not provide an x87 Floating Point Unit (FPU) and does
| not support x87 FPU instructions

so not only it does not support the lock prefix, no, it also relies on
soft-FPU.
The latest bsp release notes quotes a package named
quark_linux_v3.14+v1.2.1.1.tar.gz

so they still use v3.14 (which is not a supported kernel anymore).
And then I took a look into their Yocto-BSP and found this:
| $ fgrep -R MATH_EMULATION .
| ./meta-intel-quark/recipes-kernel/linux/files/quark_3.14.cfg:# CONFIG_MATH_EMULATION is not set

so they don't set this option. This is small SoC and does not run on any
Distro due to the missing lock prefix. So if they use yocto to recompile
everything, they can rebuild their toolchain with soft-fpu support which
is more efficient than calling into the kernel for every opcode.

So I *think* nobody relies on FPU-emulation anymore. I would suggest to
get this patch set into shape and then getting rid of
CONFIG_MATH_EMULATION?

[0] https://www.intel.de/content/dam/www/public/us/en/documents/datasheets/quark-c1000-datasheet.pdf
[1] https://downloadmirror.intel.com/23197/eng/Quark_SW_RelNotes_330232_007.pdf

Sebastian

2018-09-26 14:34:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace



> On Sep 26, 2018, at 4:12 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> On 2018-09-20 21:15:35 [-0700], Andy Lutomirski wrote:
>>> I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded. So we don’t need the special case for kernel threads.
>>>
>>> Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context? It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn. I think that at least switch_fpu_finish should call it. Arguably switch_fpu_prepare should too, at the beginning.
>>
>> Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.
> okay.
>
>> Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere. Probably FPU-less systems should never have TIF_LOAD_FPU set.
> Yes, they should not.
>
>> Or we could decide that no one uses FPU emulation any more.
>
> Oh. Removing unused code? I'm all yours.
> There is this Intel Quark thingy which comes to mind can still be
> bought. Its data sheet[0] has this:
> | 13.1 Features:
> | Note: The processor does not provide an x87 Floating Point Unit (FPU) and does
> | not support x87 FPU instructions
>
> so not only it does not support the lock prefix, no, it also relies on
> soft-FPU.
> The latest bsp release notes quotes a package named
> quark_linux_v3.14+v1.2.1.1.tar.gz
>
> so they still use v3.14 (which is not a supported kernel anymore).
> And then I took a look into their Yocto-BSP and found this:
> | $ fgrep -R MATH_EMULATION .
> | ./meta-intel-quark/recipes-kernel/linux/files/quark_3.14.cfg:# CONFIG_MATH_EMULATION is not set
>
> so they don't set this option. This is small SoC and does not run on any
> Distro due to the missing lock prefix. So if they use yocto to recompile
> everything, they can rebuild their toolchain with soft-fpu support which
> is more efficient than calling into the kernel for every opcode.
>
> So I *think* nobody relies on FPU-emulation anymore. I would suggest to
> get this patch set into shape and then getting rid of
> CONFIG_MATH_EMULATION?

I don’t think you can boot a kernel without math emulation on a no-FPU CPU. So maybe we should table this particular idea. I didn’t realize there were Quark CPUs without FPU.


Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace

On 2018-09-26 07:34:09 [-0700], Andy Lutomirski wrote:
> > So I *think* nobody relies on FPU-emulation anymore. I would suggest to
> > get this patch set into shape and then getting rid of
> > CONFIG_MATH_EMULATION?

Bryan, Denys, does anyone of you rely on CONFIG_MATH_EMULATION?
The manual for Quark claims that the CPU has no FPU and yet the current
kernel (or the yocto v3.14) does not select this option.
Denys added support for some opcodes in 2015 but I didn't figure out
*why* or which CPU in particular requires this.

> I don’t think you can boot a kernel without math emulation on a no-FPU CPU. So maybe we should table this particular idea. I didn’t realize there were Quark CPUs without FPU.

Okay.

Sebastian

2018-09-26 16:27:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 10/10] x86/fpu: defer FPU state load until return to userspace



> On Sep 26, 2018, at 8:32 AM, Sebastian Andrzej Siewior <[email protected]> wrote:
>
> On 2018-09-26 07:34:09 [-0700], Andy Lutomirski wrote:
>>> So I *think* nobody relies on FPU-emulation anymore. I would suggest to
>>> get this patch set into shape and then getting rid of
>>> CONFIG_MATH_EMULATION?
>
> Bryan, Denys, does anyone of you rely on CONFIG_MATH_EMULATION?
> The manual for Quark claims that the CPU has no FPU and yet the current
> kernel (or the yocto v3.14) does not select this option.
> Denys added support for some opcodes in 2015 but I didn't figure out
> *why* or which CPU in particular requires this.

Certainly the early Quark CPUs were, um, crappy. They had so many terrifying errata that I’m surprised anyone is willing to use them.

I would *hope* that all supported Quark CPUs support FPU instructions even if they’re microcoded and run extremely slowly.

Subject: Re: [RFC PATCH 01/10] x86/entry: remove _TIF_ALLWORK_MASK

On 2018-09-12 15:33:44 [+0200], To [email protected] wrote:
> There is no user of _TIF_ALLWORK_MASK since commit 21d375b6b34ff
> ("x86/entry/64: Remove the SYSCALL64 fast path").
> Remove unused define _TIF_ALLWORK_MASK.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Is it okay for that one to be merged as-is or should I repost it without
the RFC?

Sebastian