2016-10-01 20:50:17

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 0/5] x86,fpu: make FPU context switching much lazier

This series is an attempt at making the x86 FPU context switching
code much lazier. By only reloading the FPU context when a task
switches to user mode, we can avoid switching FPU context for tasks
that spin in kernel mode, and avoid reloading the FPU context for
tasks that get interrupted by a kernel thread or briefly go idle.

It also allows us to skip restoring the userspace FPU context when
exiting a KVM guest.

This series is still BROKEN. The first 3 patches seem to work
fine in my tests (but should not, due to missing signal path
code), while the 4th test makes it easier to trigger bugs.

I am posting this to ask about obvious issues people may see,
ideas on what direction I should take this series in, and to
avoid code conflicts with Andy's plans wrt. lazy fpu mode.


2016-10-01 20:50:05

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 3/5] x86,fpu: add kernel fpu argument to __kernel_fpu_begin

From: Rik van Riel <[email protected]>

Most kernel FPU contexts are transient, but a KVM VCPU context
persists. Add a kernel FPU argument to __kernel_fpu_begin, so
we can know whether or not the KVM VCPU context got clobbered
by another kernel FPU context.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/efi.h | 2 +-
arch/x86/include/asm/fpu/api.h | 2 +-
arch/x86/kernel/fpu/core.c | 6 +++---
arch/x86/kvm/x86.c | 8 ++++++--
4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index d0bb76d81402..603d2cdd6b82 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -71,7 +71,7 @@ struct efi_scratch {
({ \
efi_sync_low_kernel_mappings(); \
preempt_disable(); \
- __kernel_fpu_begin(); \
+ __kernel_fpu_begin(NULL); \
\
if (efi_scratch.use_pgd) { \
efi_scratch.prev_cr3 = read_cr3(); \
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index edd7dc7ae4f7..f6704edf9904 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -20,7 +20,7 @@
* All other cases use kernel_fpu_begin/end() which disable preemption
* during kernel FPU usage.
*/
-extern void __kernel_fpu_begin(void);
+extern void __kernel_fpu_begin(struct fpu *fpu);
extern void __kernel_fpu_end(void);
extern void kernel_fpu_begin(void);
extern void kernel_fpu_end(void);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c4350f188be1..537eb65b6ae6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -110,7 +110,7 @@ bool irq_fpu_usable(void)
}
EXPORT_SYMBOL(irq_fpu_usable);

-void __kernel_fpu_begin(void)
+void __kernel_fpu_begin(struct fpu *kernelfpu)
{
struct fpu *fpu = &current->thread.fpu;

@@ -118,7 +118,7 @@ void __kernel_fpu_begin(void)

kernel_fpu_disable();

- this_cpu_write(fpu_fpregs_owner_ctx, NULL);
+ this_cpu_write(fpu_fpregs_owner_ctx, kernelfpu);

if (fpu->fpregs_active) {
/*
@@ -150,7 +150,7 @@ EXPORT_SYMBOL(__kernel_fpu_end);
void kernel_fpu_begin(void)
{
preempt_disable();
- __kernel_fpu_begin();
+ __kernel_fpu_begin(NULL);
}
EXPORT_SYMBOL_GPL(kernel_fpu_begin);

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 699f8726539a..55c82d066d3a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7332,6 +7332,8 @@ static void fx_init(struct kvm_vcpu *vcpu)

void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
+ struct fpu *fpu;
+
if (vcpu->guest_fpu_loaded)
return;

@@ -7340,9 +7342,11 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
* and assume host would use all available bits.
* Guest xcr0 would be loaded later.
*/
+ fpu = &vcpu->arch.guest_fpu;
+
vcpu->guest_fpu_loaded = 1;
- __kernel_fpu_begin();
- __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state);
+ __kernel_fpu_begin(fpu);
+ __copy_kernel_to_fpregs(&fpu->state);
trace_kvm_fpu(1);
}

--
2.7.4

2016-10-01 20:50:08

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 5/5] x86,fpu: kinda sorta fix up signal path

From: Rik van Riel <[email protected]>

Need to ensure that the FPU save code and the lazy restore code
do not use invalid kernel or floating point register state and
copy it over to the other location.

I am pretty sure this is incomplete.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/kernel/fpu/core.c | 1 +
arch/x86/kernel/fpu/signal.c | 4 ++++
2 files changed, 5 insertions(+)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 537eb65b6ae6..fa59cc741fa5 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -314,6 +314,7 @@ void fpu__activate_curr(struct fpu *fpu)
trace_x86_fpu_activate_state(fpu);
/* Safe to do for the current task: */
fpu->fpstate_active = 1;
+ fpu->last_cpu = -1;
}
}
EXPORT_SYMBOL_GPL(fpu__activate_curr);
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index a184c210efba..89f882983da7 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -171,6 +171,10 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
(struct _fpstate_32 __user *) buf) ? -1 : 1;

if (fpregs_active() || using_compacted_format()) {
+ /* Compacted format, but the FP state is not loaded yet. */
+ if (unlikely(!fpu_lazy_skip_restore(&tsk->thread.fpu)))
+ copy_kernel_to_fpregs(&tsk->thread.fpu.state);
+
/* Save the live register state to the user directly. */
if (copy_fpregs_to_sigframe(buf_fx))
return -1;
--
2.7.4

2016-10-01 20:49:59

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 1/5] x86,fpu: split prev/next task fpu state handling

From: Rik van Riel <[email protected]>

Move all handling of the next state FPU state handling into
switch_fpu_finish, in preparation for more lazily switching
FPU states.

CR0.TS state is mirrored in a per-cpu variable, instead of
being passed around in a local variable, because that will
not be possible later in the series.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 78 ++++++++++++++++---------------------
arch/x86/kernel/fpu/core.c | 1 +
arch/x86/kernel/process_32.c | 5 +--
arch/x86/kernel/process_64.c | 5 +--
4 files changed, 39 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 2737366ea583..79e1cee9f3b0 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -482,6 +482,7 @@ extern int copy_fpstate_to_sigframe(void __user *buf, void __user *fp, int size)
*/

DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
+DECLARE_PER_CPU(bool, fpu_active);

/*
* Must be run with preemption disabled: this clears the fpu_fpregs_owner_ctx,
@@ -577,28 +578,15 @@ static inline void fpregs_deactivate(struct fpu *fpu)
*
* This is a two-stage process:
*
- * - switch_fpu_prepare() saves the old state and
- * sets the new state of the CR0.TS bit. This is
- * done within the context of the old process.
+ * - switch_fpu_prepare() saves the old state.
+ * This is done within the context of the old process.
*
- * - switch_fpu_finish() restores the new state as
- * necessary.
+ * - switch_fpu_finish() restores the new state
+ * and flips CR0.TS as necessary.
*/
-typedef struct { int preload; } fpu_switch_t;
-
-static inline fpu_switch_t
-switch_fpu_prepare(struct fpu *old_fpu, struct fpu *new_fpu, int cpu)
+static inline void
+switch_fpu_prepare(struct fpu *old_fpu, int cpu)
{
- fpu_switch_t fpu;
-
- /*
- * If the task has used the math, pre-load the FPU on xsave processors
- * or if the past 5 consecutive context-switches used math.
- */
- fpu.preload = static_cpu_has(X86_FEATURE_FPU) &&
- new_fpu->fpstate_active &&
- (use_eager_fpu() || new_fpu->counter > 5);
-
if (old_fpu->fpregs_active) {
if (!copy_fpregs_to_fpstate(old_fpu))
old_fpu->last_cpu = -1;
@@ -608,29 +596,10 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu *new_fpu, int cpu)
/* But leave fpu_fpregs_owner_ctx! */
old_fpu->fpregs_active = 0;
trace_x86_fpu_regs_deactivated(old_fpu);
-
- /* Don't change CR0.TS if we just switch! */
- if (fpu.preload) {
- new_fpu->counter++;
- __fpregs_activate(new_fpu);
- trace_x86_fpu_regs_activated(new_fpu);
- prefetch(&new_fpu->state);
- } else {
- __fpregs_deactivate_hw();
- }
} else {
old_fpu->counter = 0;
old_fpu->last_cpu = -1;
- if (fpu.preload) {
- new_fpu->counter++;
- if (fpu_want_lazy_restore(new_fpu, cpu))
- fpu.preload = 0;
- else
- prefetch(&new_fpu->state);
- fpregs_activate(new_fpu);
- }
}
- return fpu;
}

/*
@@ -638,15 +607,36 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu *new_fpu, int cpu)
*/

/*
- * By the time this gets called, we've already cleared CR0.TS and
- * given the process the FPU if we are going to preload the FPU
- * state - all we need to do is to conditionally restore the register
- * state itself.
+ * Set up the userspace FPU context for the new task.
*/
-static inline void switch_fpu_finish(struct fpu *new_fpu, fpu_switch_t fpu_switch)
+static inline void switch_fpu_finish(struct fpu *new_fpu)
{
- if (fpu_switch.preload)
+ bool preload;
+ /*
+ * If the task has used the math, pre-load the FPU on xsave processors
+ * or if the past 5 consecutive context-switches used math.
+ */
+ preload = static_cpu_has(X86_FEATURE_FPU) &&
+ new_fpu->fpstate_active &&
+ (use_eager_fpu() || new_fpu->counter > 5);
+
+ if (preload) {
+ prefetch(&new_fpu->state);
+ new_fpu->counter++;
+ __fpregs_activate(new_fpu);
+ trace_x86_fpu_regs_activated(new_fpu);
+
+ /* Don't change CR0.TS if we just switch! */
+ if (!__this_cpu_read(fpu_active)) {
+ __fpregs_activate_hw();
+ __this_cpu_write(fpu_active, true);
+ }
+
copy_kernel_to_fpregs(&new_fpu->state);
+ } else if (__this_cpu_read(fpu_active)) {
+ __this_cpu_write(fpu_active, false);
+ __fpregs_deactivate_hw();
+ }
}

/*
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 3fc03a09a93b..82cd46584528 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -39,6 +39,7 @@ static DEFINE_PER_CPU(bool, in_kernel_fpu);
* Track which context is using the FPU on the CPU:
*/
DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
+DEFINE_PER_CPU(bool, fpu_active) = false;

static void kernel_fpu_disable(void)
{
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index d86be29c38c7..8cd2f42190dc 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -247,11 +247,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
- fpu_switch_t fpu_switch;

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- fpu_switch = switch_fpu_prepare(prev_fpu, next_fpu, cpu);
+ switch_fpu_prepare(prev_fpu, cpu);

/*
* Save away %gs. No need to save %fs, as it was saved on the
@@ -310,7 +309,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (prev->gs | next->gs)
lazy_load_gs(next->gs);

- switch_fpu_finish(next_fpu, fpu_switch);
+ switch_fpu_finish(next_fpu);

this_cpu_write(current_task, next_p);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 63236d8f84bf..92b9485a6a18 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -264,9 +264,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
unsigned prev_fsindex, prev_gsindex;
- fpu_switch_t fpu_switch;

- fpu_switch = switch_fpu_prepare(prev_fpu, next_fpu, cpu);
+ switch_fpu_prepare(prev_fpu, cpu);

/* We must save %fs and %gs before load_TLS() because
* %fs and %gs may be cleared by load_TLS().
@@ -416,7 +415,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
prev->gsbase = 0;
prev->gsindex = prev_gsindex;

- switch_fpu_finish(next_fpu, fpu_switch);
+ switch_fpu_finish(next_fpu);

/*
* Switch the PDA and FPU contexts.
--
2.7.4

2016-10-01 20:50:55

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

From: Rik van Riel <[email protected]>

Delay the loading of FPU registers until a process switches back to
userspace. This allows us to skip FPU saving & restoring for kernel
threads, the idle task, and tasks that are spinning in kernel space.

It also allows us to not repeatedly save & restore the userspace FPU
context on repeated invocations of kernel_fpu_start & kernel_fpu_end.

Not overwriting the FPU state of a task unless we need to also allows
us to be be lazier about restoring it, in a future patch.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/entry/common.c | 4 ++++
arch/x86/include/asm/fpu/api.h | 5 +++++
arch/x86/include/asm/fpu/internal.h | 44 +++++++++----------------------------
arch/x86/include/asm/thread_info.h | 4 +++-
arch/x86/kernel/fpu/core.c | 17 ++++++++------
arch/x86/kernel/process.c | 35 +++++++++++++++++++++++++++++
arch/x86/kernel/process_32.c | 5 ++---
arch/x86/kernel/process_64.c | 5 ++---
8 files changed, 71 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 1433f6b4607d..a69bbefa3408 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -27,6 +27,7 @@
#include <asm/vdso.h>
#include <asm/uaccess.h>
#include <asm/cpufeature.h>
+#include <asm/fpu/api.h>

#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
exit_to_usermode_loop(regs, cached_flags);

+ if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
+ switch_fpu_return();
+
#ifdef CONFIG_COMPAT
/*
* Compat syscalls set TS_COMPAT. Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 1429a7c736db..edd7dc7ae4f7 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -37,6 +37,11 @@ extern int irq_ts_save(void);
extern void irq_ts_restore(int TS_state);

/*
+ * Set up the userspace FPU context before returning to userspace.
+ */
+extern void switch_fpu_return(void);
+
+/*
* Query the presence of one or more xfeatures. Works on any legacy CPU as well.
*
* If 'feature_name' is set then put a human-readable description of
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 79e1cee9f3b0..b5accb35e434 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -19,6 +19,7 @@
#include <asm/fpu/xstate.h>
#include <asm/cpufeature.h>
#include <asm/trace/fpu.h>
+#include <asm/thread_info.h>

/*
* High level FPU state handling functions:
@@ -576,13 +577,17 @@ static inline void fpregs_deactivate(struct fpu *fpu)
/*
* FPU state switching for scheduling.
*
- * This is a two-stage process:
+ * This is a three-stage process:
*
* - switch_fpu_prepare() saves the old state.
* This is done within the context of the old process.
*
- * - switch_fpu_finish() restores the new state
- * and flips CR0.TS as necessary.
+ * - switch_fpu_finish() sets TIF_LOAD_CPU, causing FPU state to
+ * be loaded when the new process returns to userspace.
+ * This is done with current_task pointing to the new process.
+ *
+ * - switch_fpu_return() restores the new state and flips CR0.TS as
+ * necessary. This only runs if the process returns to userspace.
*/
static inline void
switch_fpu_prepare(struct fpu *old_fpu, int cpu)
@@ -605,38 +610,9 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
/*
* Misc helper functions:
*/
-
-/*
- * Set up the userspace FPU context for the new task.
- */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(void)
{
- bool preload;
- /*
- * If the task has used the math, pre-load the FPU on xsave processors
- * or if the past 5 consecutive context-switches used math.
- */
- preload = static_cpu_has(X86_FEATURE_FPU) &&
- new_fpu->fpstate_active &&
- (use_eager_fpu() || new_fpu->counter > 5);
-
- if (preload) {
- prefetch(&new_fpu->state);
- new_fpu->counter++;
- __fpregs_activate(new_fpu);
- trace_x86_fpu_regs_activated(new_fpu);
-
- /* Don't change CR0.TS if we just switch! */
- if (!__this_cpu_read(fpu_active)) {
- __fpregs_activate_hw();
- __this_cpu_write(fpu_active, true);
- }
-
- copy_kernel_to_fpregs(&new_fpu->state);
- } else if (__this_cpu_read(fpu_active)) {
- __this_cpu_write(fpu_active, false);
- __fpregs_deactivate_hw();
- }
+ set_thread_flag(TIF_LOAD_FPU);
}

/*
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8b7c8d8e0852..401e9c3e6039 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -106,6 +106,7 @@ struct thread_info {
#define TIF_SYSCALL_TRACEPOINT 28 /* syscall tracepoint instrumentation */
#define TIF_ADDR32 29 /* 32-bit address space on 64 bits */
#define TIF_X32 30 /* 32-bit native x86-64 binary */
+#define TIF_LOAD_FPU 31 /* load FPU on return to userspace */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -129,6 +130,7 @@ struct thread_info {
#define _TIF_SYSCALL_TRACEPOINT (1 << TIF_SYSCALL_TRACEPOINT)
#define _TIF_ADDR32 (1 << TIF_ADDR32)
#define _TIF_X32 (1 << TIF_X32)
+#define _TIF_LOAD_FPU (1 << TIF_LOAD_FPU)

/*
* work to do in syscall_trace_enter(). Also includes TIF_NOHZ for
@@ -142,7 +144,7 @@ struct thread_info {
/* work to do on any return to user space */
#define _TIF_ALLWORK_MASK \
((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \
- _TIF_NOHZ)
+ _TIF_NOHZ | _TIF_LOAD_FPU)

/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW \
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 82cd46584528..c4350f188be1 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -118,6 +118,8 @@ void __kernel_fpu_begin(void)

kernel_fpu_disable();

+ this_cpu_write(fpu_fpregs_owner_ctx, NULL);
+
if (fpu->fpregs_active) {
/*
* Ignore return value -- we don't care if reg state
@@ -125,8 +127,10 @@ void __kernel_fpu_begin(void)
*/
copy_fpregs_to_fpstate(fpu);
} else {
- this_cpu_write(fpu_fpregs_owner_ctx, NULL);
- __fpregs_activate_hw();
+ if (!__this_cpu_read(fpu_active)) {
+ __this_cpu_write(fpu_active, true);
+ __fpregs_activate_hw();
+ }
}
}
EXPORT_SYMBOL(__kernel_fpu_begin);
@@ -135,11 +139,10 @@ void __kernel_fpu_end(void)
{
struct fpu *fpu = &current->thread.fpu;

- if (fpu->fpregs_active)
- copy_kernel_to_fpregs(&fpu->state);
- else
- __fpregs_deactivate_hw();
-
+ if (fpu->fpregs_active) {
+ switch_fpu_finish();
+ fpu->fpregs_active = 0;
+ }
kernel_fpu_enable();
}
EXPORT_SYMBOL(__kernel_fpu_end);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 62c0b0ea2ce4..087413be39cf 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -32,6 +32,7 @@
#include <asm/tlbflush.h>
#include <asm/mce.h>
#include <asm/vm86.h>
+#include <asm/fpu/types.h>

/*
* per-CPU TSS segments. Threads are completely 'soft' on Linux,
@@ -191,6 +192,40 @@ int set_tsc_mode(unsigned int val)
return 0;
}

+/*
+ * Set up the userspace FPU context before returning to userspace.
+ */
+void switch_fpu_return(void)
+{
+ struct fpu *fpu = &current->thread.fpu;
+ bool preload;
+ /*
+ * If the task has used the math, pre-load the FPU on xsave processors
+ * or if the past 5 consecutive context-switches used math.
+ */
+ preload = static_cpu_has(X86_FEATURE_FPU) &&
+ fpu->fpstate_active &&
+ (use_eager_fpu() || fpu->counter > 5);
+
+ if (preload) {
+ prefetch(&fpu->state);
+ fpu->counter++;
+ __fpregs_activate(fpu);
+ trace_x86_fpu_regs_activated(fpu);
+
+ /* Don't change CR0.TS if we just switch! */
+ if (!__this_cpu_read(fpu_active)) {
+ __fpregs_activate_hw();
+ __this_cpu_write(fpu_active, true);
+ }
+
+ copy_kernel_to_fpregs(&fpu->state);
+ } else if (__this_cpu_read(fpu_active)) {
+ __this_cpu_write(fpu_active, false);
+ __fpregs_deactivate_hw();
+ }
+}
+
void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
struct tss_struct *tss)
{
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8cd2f42190dc..45e08c14e06d 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -244,7 +244,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
struct thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
struct fpu *prev_fpu = &prev->fpu;
- struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss, cpu);

@@ -309,9 +308,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (prev->gs | next->gs)
lazy_load_gs(next->gs);

- switch_fpu_finish(next_fpu);
-
this_cpu_write(current_task, next_p);

+ switch_fpu_finish();
+
return prev_p;
}
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 92b9485a6a18..f3b83b6af6ea 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -260,7 +260,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
struct thread_struct *prev = &prev_p->thread;
struct thread_struct *next = &next_p->thread;
struct fpu *prev_fpu = &prev->fpu;
- struct fpu *next_fpu = &next->fpu;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
unsigned prev_fsindex, prev_gsindex;
@@ -415,8 +414,6 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
prev->gsbase = 0;
prev->gsindex = prev_gsindex;

- switch_fpu_finish(next_fpu);
-
/*
* Switch the PDA and FPU contexts.
*/
@@ -425,6 +422,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* Reload esp0 and ss1. This changes current_thread_info(). */
load_sp0(tss, next);

+ switch_fpu_finish();
+
/*
* Now maybe reload the debug registers and handle I/O bitmaps
*/
--
2.7.4

2016-10-01 20:51:07

by Rik van Riel

[permalink] [raw]
Subject: [PATCH RFC 4/5] x86,fpu: lazily skip FPU restore when still loaded

From: Rik van Riel <[email protected]>

When the FPU register set has not been touched by anybody else,
we can lazily skip the restore.

Intel has a number of clever optimizations to reduce the FPU
restore overhead, but those optimizations do not work across
the guest/host virtualization boundary, and even on bare metal
it should be faster to skip the restore entirely.

This code is still BROKEN. I am not yet sure why.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/fpu/internal.h | 13 +++++++++++++
arch/x86/kernel/process.c | 3 +++
arch/x86/kvm/x86.c | 8 +++++++-
3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index b5accb35e434..f69960e9aea1 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -575,6 +575,19 @@ static inline void fpregs_deactivate(struct fpu *fpu)
}

/*
+ * Check whether an FPU's register set is still loaded in the CPU.
+ */
+static inline bool fpu_lazy_skip_restore(struct fpu *fpu)
+{
+ bool still_loaded = (fpu->fpstate_active &&
+ fpu->last_cpu == raw_smp_processor_id() &&
+ __this_cpu_read(fpu_fpregs_owner_ctx) == fpu);
+
+ fpu->fpregs_active = still_loaded;
+ return still_loaded;
+}
+
+/*
* FPU state switching for scheduling.
*
* This is a three-stage process:
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 087413be39cf..6b72415e400f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -208,6 +208,9 @@ void switch_fpu_return(void)
(use_eager_fpu() || fpu->counter > 5);

if (preload) {
+ if (fpu_lazy_skip_restore(fpu))
+ return;
+
prefetch(&fpu->state);
fpu->counter++;
__fpregs_activate(fpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 55c82d066d3a..16ebcd12edf7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7346,7 +7346,12 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)

vcpu->guest_fpu_loaded = 1;
__kernel_fpu_begin(fpu);
- __copy_kernel_to_fpregs(&fpu->state);
+
+ if (!fpu_lazy_skip_restore(fpu)) {
+ fpu->last_cpu = raw_smp_processor_id();
+ __copy_kernel_to_fpregs(&fpu->state);
+ }
+
trace_kvm_fpu(1);
}

@@ -7358,6 +7363,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
}

vcpu->guest_fpu_loaded = 0;
+ vcpu->arch.guest_fpu.fpregs_active = 0;
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
__kernel_fpu_end();
++vcpu->stat.fpu_reload;
--
2.7.4

2016-10-01 23:27:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 1/5] x86,fpu: split prev/next task fpu state handling

On Oct 1, 2016 1:49 PM, <[email protected]> wrote:
>
> From: Rik van Riel <[email protected]>
>
> Move all handling of the next state FPU state handling into
> switch_fpu_finish, in preparation for more lazily switching
> FPU states.
>
> CR0.TS state is mirrored in a per-cpu variable, instead of
> being passed around in a local variable, because that will
> not be possible later in the series.

This seems reasonable in principle, but IMO it would be less scary if
you rebased onto this:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/fpu

Because the amount of testing needed and the amount of code that gets
rearranged would be reduced. Want to fold those patches into you
series? I can also just send them in directly, although this is an
awkward time to do so.

--Andy

2016-10-01 23:44:48

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Sat, Oct 1, 2016 at 1:31 PM, <[email protected]> wrote:
> From: Rik van Riel <[email protected]>
>
> Delay the loading of FPU registers until a process switches back to
> userspace. This allows us to skip FPU saving & restoring for kernel
> threads, the idle task, and tasks that are spinning in kernel space.
>
> It also allows us to not repeatedly save & restore the userspace FPU
> context on repeated invocations of kernel_fpu_start & kernel_fpu_end.
>
> Not overwriting the FPU state of a task unless we need to also allows
> us to be be lazier about restoring it, in a future patch.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> arch/x86/entry/common.c | 4 ++++
> arch/x86/include/asm/fpu/api.h | 5 +++++
> arch/x86/include/asm/fpu/internal.h | 44 +++++++++----------------------------
> arch/x86/include/asm/thread_info.h | 4 +++-
> arch/x86/kernel/fpu/core.c | 17 ++++++++------
> arch/x86/kernel/process.c | 35 +++++++++++++++++++++++++++++
> arch/x86/kernel/process_32.c | 5 ++---
> arch/x86/kernel/process_64.c | 5 ++---
> 8 files changed, 71 insertions(+), 48 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 1433f6b4607d..a69bbefa3408 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -27,6 +27,7 @@
> #include <asm/vdso.h>
> #include <asm/uaccess.h>
> #include <asm/cpufeature.h>
> +#include <asm/fpu/api.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/syscalls.h>
> @@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
> if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> exit_to_usermode_loop(regs, cached_flags);
>
> + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> + switch_fpu_return();
> +

How about:

if (unlikely(...)) {
exit_to_usermode_loop(regs, cached_flags);
cached_flags = READ_ONCE(ti->flags);
}

if (ti->flags & _TIF_LOAD_FPU) {
clear_thread_flag(TIF_LOAD_FPU);
switch_fpu_return();
}

or something along those lines. The issue is that
test_and_clear_thread_flag is unconditionally atomic, which means that
it will slow down the common fast case by quite a bit.

Alternatively, you could try to jam it into thread_struct::status,
which would let you avoid an atomic operation when you clear it, but
you'd still need an atomic op to set it, so this might not be a win.


> +static inline void switch_fpu_finish(void)
> {
> - bool preload;
> - /*
> - * If the task has used the math, pre-load the FPU on xsave processors
> - * or if the past 5 consecutive context-switches used math.
> - */
> - preload = static_cpu_has(X86_FEATURE_FPU) &&
> - new_fpu->fpstate_active &&
> - (use_eager_fpu() || new_fpu->counter > 5);
> -
> - if (preload) {
> - prefetch(&new_fpu->state);
> - new_fpu->counter++;
> - __fpregs_activate(new_fpu);
> - trace_x86_fpu_regs_activated(new_fpu);
> -
> - /* Don't change CR0.TS if we just switch! */
> - if (!__this_cpu_read(fpu_active)) {
> - __fpregs_activate_hw();
> - __this_cpu_write(fpu_active, true);
> - }
> -
> - copy_kernel_to_fpregs(&new_fpu->state);
> - } else if (__this_cpu_read(fpu_active)) {
> - __this_cpu_write(fpu_active, false);
> - __fpregs_deactivate_hw();
> - }
> + set_thread_flag(TIF_LOAD_FPU);
> }

I can imagine this causing problems with kernel code that accesses
current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it would
make sense to make your changes deeper into the FPU core innards so
that, for example, we'd have explicit functions that cause the
in-memory state for current to be up-to-date and readable, to cause
the in-memory state to be up-to-date and writable (which is the same
thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
in-CPU state to be up-to-date (possibly readable and writable).
TIF_LOAD_FPU would trigger the latter.

I've often found it confusing that fpu__save by itself has somewhat
ill-defined effects.


> +/*
> + * Set up the userspace FPU context before returning to userspace.
> + */
> +void switch_fpu_return(void)
> +{
> + struct fpu *fpu = &current->thread.fpu;
> + bool preload;
> + /*
> + * If the task has used the math, pre-load the FPU on xsave processors
> + * or if the past 5 consecutive context-switches used math.
> + */
> + preload = static_cpu_has(X86_FEATURE_FPU) &&
> + fpu->fpstate_active &&
> + (use_eager_fpu() || fpu->counter > 5);
> +
> + if (preload) {
> + prefetch(&fpu->state);
> + fpu->counter++;
> + __fpregs_activate(fpu);
> + trace_x86_fpu_regs_activated(fpu);
> +
> + /* Don't change CR0.TS if we just switch! */
> + if (!__this_cpu_read(fpu_active)) {
> + __fpregs_activate_hw();
> + __this_cpu_write(fpu_active, true);
> + }

We should just finish getting rid of all TS uses.

2016-10-02 00:08:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,  <[email protected]> wrote:
> > 
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/syscalls.h>
> > @@ -197,6 +198,9 @@ __visible inline void
> > prepare_exit_to_usermode(struct pt_regs *regs)
> >         if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
> >                 exit_to_usermode_loop(regs, cached_flags);
> >
> > +       if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > +               switch_fpu_return();
> > +
>
> How about:
>
> if (unlikely(...)) {
>   exit_to_usermode_loop(regs, cached_flags);
>   cached_flags = READ_ONCE(ti->flags);
> }
>
> if (ti->flags & _TIF_LOAD_FPU) {
>   clear_thread_flag(TIF_LOAD_FPU);
>   switch_fpu_return();
> }

It looks like test_thread_flag should be fine for avoiding
atomics, too?

  if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
      clear_thread_flag(TIF_LOAD_FPU);
      switch_fpu_return();
  }


> > +static inline void switch_fpu_finish(void)
> >  {
> > +       set_thread_flag(TIF_LOAD_FPU);
> >  }
>
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
> would
> make sense to make your changes deeper into the FPU core innards so
> that, for example, we'd have explicit functions that cause the

Whereabouts do you have in mind?

I like your general idea, but am not sure quite where
to put it.

> in-memory state for current to be up-to-date and readable, to cause
> the in-memory state to be up-to-date and writable (which is the same
> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
> in-CPU state to be up-to-date (possibly readable and writable).
> TIF_LOAD_FPU would trigger the latter.
>
> I've often found it confusing that fpu__save by itself has somewhat
> ill-defined effects.

Fully agreed on both. I guess that means we want a few
different functions:

1) initialize FPU state, both in memory and in registers

2) cause FPU state in registers to be updated from memory,
   if necessary

3) cause FPU state in memory to be updated from registers,
   if necessary

The latter functions could use fpu->fpregs_active to see
whether any action is required, and be called from various
places in the code.

The signal code in particular is doing some strange things
that should probably be hidden away in FPU specific functions.

>
> >
> > +/*
> > + * Set up the userspace FPU context before returning to userspace.
> > + */
> > +void switch_fpu_return(void)
> > +{
> > +       struct fpu *fpu = &current->thread.fpu;
> > +       bool preload;
> > +       /*
> > +        * If the task has used the math, pre-load the FPU on xsave
> > processors
> > +        * or if the past 5 consecutive context-switches used math.
> > +        */
> > +       preload = static_cpu_has(X86_FEATURE_FPU) &&
> > +                 fpu->fpstate_active &&
> > +                 (use_eager_fpu() || fpu->counter > 5);
> > +
> > +       if (preload) {
> > +               prefetch(&fpu->state);
> > +               fpu->counter++;
> > +               __fpregs_activate(fpu);
> > +               trace_x86_fpu_regs_activated(fpu);
> > +
> > +               /* Don't change CR0.TS if we just switch! */
> > +               if (!__this_cpu_read(fpu_active)) {
> > +                       __fpregs_activate_hw();
> > +                       __this_cpu_write(fpu_active, true);
> > +               }
>
> We should just finish getting rid of all TS uses.

Agreed. I will rebase on top of your FPU changes, that
will make things easier for everybody.

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-02 00:42:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 1:31 PM,  <[email protected]> wrote:
> > 
> >
> > +static inline void switch_fpu_finish(void)
> >  {
> > +       set_thread_flag(TIF_LOAD_FPU);
> >  }
>
> I can imagine this causing problems with kernel code that accesses
> current's FPU state, e.g. get_xsave_field_ptr(). 

That makes me wonder, what test programs do people have
to verify the correctness of the FPU switching code?

I have a few floating point benchmarks that check
the results for correctness, but nothing that adds in
signals, for example.

What do people use?

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-03 16:23:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On 10/01/2016 05:42 PM, Rik van Riel wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > On Sat, Oct 1, 2016 at 1:31 PM, <[email protected]> wrote:
>>> > > +static inline void switch_fpu_finish(void)
>>> > > {
>>> > > + set_thread_flag(TIF_LOAD_FPU);
>>> > > }
>> >
>> > I can imagine this causing problems with kernel code that accesses
>> > current's FPU state, e.g. get_xsave_field_ptr().
> That makes me wonder, what test programs do people have
> to verify the correctness of the FPU switching code?

The MPX testing code in selftests/ goes off the rails pretty quickly
when the FP/XSAVE state gets corrupt. It has found quite a few bugs in
the last few years. The protection keys code in there also keeps a
shadow copy of the "PKRU" register in software which also makes it
notice pretty quickly if something goes awry.

That said, I'd _love_ to see more formal FPU testing done. We've had
lots of bugs in there, and they tend to hit the newer features and newer
CPUs. Let me know if you find something. :)

2016-10-03 20:05:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 4/5] x86,fpu: lazily skip FPU restore when still loaded

On 10/01/2016 01:31 PM, [email protected] wrote:
> /*
> + * Check whether an FPU's register set is still loaded in the CPU.
> + */
> +static inline bool fpu_lazy_skip_restore(struct fpu *fpu)
> +{
> + bool still_loaded = (fpu->fpstate_active &&
> + fpu->last_cpu == raw_smp_processor_id() &&
> + __this_cpu_read(fpu_fpregs_owner_ctx) == fpu);
> +
> + fpu->fpregs_active = still_loaded;
> + return still_loaded;
> +}

I wonder if we should call this something more along the lines of
fpregs_activate_fast(), which returns if it managed to do the activation
fast or not. I _think_ that's more along the lines of what it is
actually doing. The fact that it can be lazy is really an
implementation detail.

What are the preempt rules with this thing? This needs to be called in
preempt-disabled contexts, right?

2016-10-03 20:23:02

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 4/5] x86,fpu: lazily skip FPU restore when still loaded

On Mon, 2016-10-03 at 13:04 -0700, Dave Hansen wrote:
> On 10/01/2016 01:31 PM, [email protected] wrote:
> >
> >  /*
> > + * Check whether an FPU's register set is still loaded in the CPU.
> > + */
> > +static inline bool fpu_lazy_skip_restore(struct fpu *fpu)
> > +{
> > + bool still_loaded = (fpu->fpstate_active &&
> > +      fpu->last_cpu ==
> > raw_smp_processor_id() &&
> > +      __this_cpu_read(fpu_fpregs_owner_ctx)
> > == fpu);
> > +
> > + fpu->fpregs_active = still_loaded;
> > + return still_loaded;
> > +}
> I wonder if we should call this something more along the lines of
> fpregs_activate_fast(), which returns if it managed to do the
> activation
> fast or not.  I _think_ that's more along the lines of what it is
> actually doing.  The fact that it can be lazy is really an
> implementation detail.
>
> What are the preempt rules with this thing?  This needs to be called
> in
> preempt-disabled contexts, right?

Indeed, all the FPU context switching code needs 
to be called in preempt-disabled contexts.

You do not want to get preempted halfway through
saving or restoring floating point registers.

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-03 20:50:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RFC 4/5] x86,fpu: lazily skip FPU restore when still loaded

On 10/03/2016 01:22 PM, Rik van Riel wrote:
>> > What are the preempt rules with this thing? This needs to be called
>> > in preempt-disabled contexts, right?
> Indeed, all the FPU context switching code needs
> to be called in preempt-disabled contexts.
>
> You do not want to get preempted halfway through
> saving or restoring floating point registers.

OK, cool, that's what I expected. Could you just add a comment about it
to make it clear that it's also the case for this new
fpu_lazy_skip_restore() helper?

2016-10-03 20:54:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <[email protected]> wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 1:31 PM, <[email protected]> wrote:
>> >
>> > #define CREATE_TRACE_POINTS
>> > #include <trace/events/syscalls.h>
>> > @@ -197,6 +198,9 @@ __visible inline void
>> > prepare_exit_to_usermode(struct pt_regs *regs)
>> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>> > exit_to_usermode_loop(regs, cached_flags);
>> >
>> > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > + switch_fpu_return();
>> > +
>>
>> How about:
>>
>> if (unlikely(...)) {
>> exit_to_usermode_loop(regs, cached_flags);
>> cached_flags = READ_ONCE(ti->flags);
>> }
>>
>> if (ti->flags & _TIF_LOAD_FPU) {
>> clear_thread_flag(TIF_LOAD_FPU);
>> switch_fpu_return();
>> }
>
> It looks like test_thread_flag should be fine for avoiding
> atomics, too?
>
> if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> clear_thread_flag(TIF_LOAD_FPU);
> switch_fpu_return();
> }

True, but we've got that cached_flags variable already...

>
>
>> > +static inline void switch_fpu_finish(void)
>> > {
>> > + set_thread_flag(TIF_LOAD_FPU);
>> > }
>>
>> I can imagine this causing problems with kernel code that accesses
>> current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it
>> would
>> make sense to make your changes deeper into the FPU core innards so
>> that, for example, we'd have explicit functions that cause the
>
> Whereabouts do you have in mind?
>
> I like your general idea, but am not sure quite where
> to put it.

Somewhere in arch/x86/kernel/fpu/ perhaps?

One of my objections to the current code structure is that I find it
quite hard to keep track of all of the possible states of the fpu that
we have. Off the top of my head, we have:

1. In-cpu state is valid and in-memory state is invalid. (This is the
case when we're running user code, for example.)
2. In-memory state is valid.
2a. In-cpu state is *also* valid for some particular task. In this
state, no one may write to either FPU state copy for that task lest
they get out of sync.
2b. In-cpu state is not valid.

Rik, your patches further complicate this by making it possible to
have the in-cpu state be valid for a non-current task even when
non-idle. This is why I'm thinking that we should maybe revisit the
API a bit to make this clearer and to avoid scattering around all the
state manipulation quite so much. I propose, as a straw-man:

user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current.
After calling this, we're in state 1 or 2a.

user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable
for current. Invalidates any in-memory copy. After calling this,
we're in state 1.

user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for
current. After calling this, we're in state 2a or 2b.

user_fpregs_move_to_memory() -- Makes the in-memory state be valid and
writable for current. After calling this, we're in state 2b.

The xxx_to_cpu() functions will make sure that, if the cpu state was
valid for a different task, it stops being valid for that different
task.

Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
and PKRU memory state readers. MPX and PKRU will do
user_fpregs_move_to_memory() before writing to memory.
kernel_fpu_begin() does user_fpregs_move_to_memory().

There are a couple ways we could deal with TIF_LOAD_FPU in this
scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it
unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
and make the handler do user_fpregs_move_to_cpu(). This avoids
needing to deal with complications in which we're in state 2a and we
land in actual user code and then forget that the in-memory copy is
out of date. It also means we get TIF_REFRESH_FPU for free for
newly-forked threads.

>> in-memory state for current to be up-to-date and readable, to cause
>> the in-memory state to be up-to-date and writable (which is the same
>> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
>> in-CPU state to be up-to-date (possibly readable and writable).
>> TIF_LOAD_FPU would trigger the latter.
>>
>> I've often found it confusing that fpu__save by itself has somewhat
>> ill-defined effects.
>
> Fully agreed on both. I guess that means we want a few
> different functions:
>
> 1) initialize FPU state, both in memory and in registers
>
> 2) cause FPU state in registers to be updated from memory,
> if necessary
>
> 3) cause FPU state in memory to be updated from registers,
> if necessary
>
> The latter functions could use fpu->fpregs_active to see
> whether any action is required, and be called from various
> places in the code.
>
> The signal code in particular is doing some strange things
> that should probably be hidden away in FPU specific functions.
>
>>
>> >
>> > +/*
>> > + * Set up the userspace FPU context before returning to userspace.
>> > + */
>> > +void switch_fpu_return(void)
>> > +{
>> > + struct fpu *fpu = &current->thread.fpu;
>> > + bool preload;
>> > + /*
>> > + * If the task has used the math, pre-load the FPU on xsave
>> > processors
>> > + * or if the past 5 consecutive context-switches used math.
>> > + */
>> > + preload = static_cpu_has(X86_FEATURE_FPU) &&
>> > + fpu->fpstate_active &&
>> > + (use_eager_fpu() || fpu->counter > 5);
>> > +
>> > + if (preload) {
>> > + prefetch(&fpu->state);
>> > + fpu->counter++;
>> > + __fpregs_activate(fpu);
>> > + trace_x86_fpu_regs_activated(fpu);
>> > +
>> > + /* Don't change CR0.TS if we just switch! */
>> > + if (!__this_cpu_read(fpu_active)) {
>> > + __fpregs_activate_hw();
>> > + __this_cpu_write(fpu_active, true);
>> > + }
>>
>> We should just finish getting rid of all TS uses.
>
> Agreed. I will rebase on top of your FPU changes, that
> will make things easier for everybody.
>
> --
> All Rights Reversed.



--
Andy Lutomirski
AMA Capital Management, LLC

2016-10-03 21:03:10

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 4/5] x86,fpu: lazily skip FPU restore when still loaded

On Mon, 2016-10-03 at 13:49 -0700, Dave Hansen wrote:
> On 10/03/2016 01:22 PM, Rik van Riel wrote:
> >
> > >
> > > >
> > > > What are the preempt rules with this thing?  This needs to be
> > > > called
> > > > in preempt-disabled contexts, right?
> > Indeed, all the FPU context switching code needs 
> > to be called in preempt-disabled contexts.
> >
> > You do not want to get preempted halfway through
> > saving or restoring floating point registers.
> OK, cool, that's what I expected.  Could you just add a comment about
> it
> to make it clear that it's also the case for this new
> fpu_lazy_skip_restore() helper?

Turns out the code already has an old
fpu_want_lazy_restore(), which is what
I will use instead :)

I will add documentation about preemption
in places where it is necessary.

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-03 21:22:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <[email protected]> wrote:
> >
> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> > >
> > > On Sat, Oct 1, 2016 at 1:31 PM,  <[email protected]> wrote:
> > > >
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include <trace/events/syscalls.h>
> > > > @@ -197,6 +198,9 @@ __visible inline void
> > > > prepare_exit_to_usermode(struct pt_regs *regs)
> > > >         if (unlikely(cached_flags &
> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
> > > >                 exit_to_usermode_loop(regs, cached_flags);
> > > >
> > > > +       if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
> > > > +               switch_fpu_return();
> > > > +
> > > How about:
> > >
> > > if (unlikely(...)) {
> > >   exit_to_usermode_loop(regs, cached_flags);
> > >   cached_flags = READ_ONCE(ti->flags);
> > > }
> > >
> > > if (ti->flags & _TIF_LOAD_FPU) {
> > >   clear_thread_flag(TIF_LOAD_FPU);
> > >   switch_fpu_return();
> > > }
> > It looks like test_thread_flag should be fine for avoiding
> > atomics, too?
> >
> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> >       clear_thread_flag(TIF_LOAD_FPU);
> >       switch_fpu_return();
> >   }
> True, but we've got that cached_flags variable already...

The cached flag may no longer be valid after
exit_to_usermode_loop() returns, because that
code is preemptible.

We have to reload a fresh ti->flags after 
preemption is disabled and irqs are off.

> One of my objections to the current code structure is that I find it
> quite hard to keep track of all of the possible states of the fpu
> that
> we have.  Off the top of my head, we have:
>
> 1. In-cpu state is valid and in-memory state is invalid.  (This is
> the
> case when we're running user code, for example.)
> 2. In-memory state is valid.
> 2a. In-cpu state is *also* valid for some particular task.  In this
> state, no one may write to either FPU state copy for that task lest
> they get out of sync.
> 2b. In-cpu state is not valid.

It gets more fun with ptrace. Look at xfpregs_get
and xfpregs_set, which call fpu__activate_fpstate_read
and fpu__activate_fpstate_write respectively.

It looks like those functions already deal with the
cases you outline above.

> Rik, your patches further complicate this by making it possible to
> have the in-cpu state be valid for a non-current task even when
> non-idle. 

However, we never need to examine the in-cpu state for
a task that is not currently running, or being scheduled
to in the context switch code.

We always save the FPU register state when a task is
context switched out, so for a non-current task, we
still only need to consider whether the in-memory state
is valid or not.

The in-cpu state does not matter.

fpu__activate_fpstate_write will disable lazy restore,
and ensure a full restore from memory.

>  This is why I'm thinking that we should maybe revisit the
> API a bit to make this clearer and to avoid scattering around all the
> state manipulation quite so much.  I propose, as a straw-man:
>
> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> current.
> After calling this, we're in state 1 or 2a.
>
> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> writable
> for current.  Invalidates any in-memory copy.  After calling this,
> we're in state 1.

How can we distinguish between these two?

If the task is running, it can change its FPU
registers at any time, without any obvious thing
to mark the transition from state 2a to state 1.

It sounds like the first function you name would
only be useful if we prevented a switch to user
space, which could immediately do whatever it wants
with the registers.

Is there a use case for user_fpregs_copy_to_cpu()?

> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
> for
> current.  After calling this, we're in state 2a or 2b.
>
> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
> and
> writable for current.  After calling this, we're in state 2b.

We can only ensure that the in memory copy stays
valid, by stopping the task from writing to the
in-CPU copy.

This sounds a lot like fpu__activate_fpstate_read

What we need after copying the contents to memory is
a way to invalidate the in-CPU copy, if changes were
made to the in-memory copy.

This is already done by fpu__activate_fpstate_write

> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
> and PKRU memory state readers.  MPX and PKRU will do
> user_fpregs_move_to_memory() before writing to memory.
> kernel_fpu_begin() does user_fpregs_move_to_memory().

This is done by switch_fpu_prepare.

Can you think of any other place in the kernel where we
need a 1 -> 2a transition, besides context switching and
ptrace/xfpregs_get?

> There are a couple ways we could deal with TIF_LOAD_FPU in this
> scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
> and make the handler do user_fpregs_move_to_cpu().

I can see where you are coming from, but given that
there is no way to run a task besides context switching
to it, I am not sure why we would need to do anything
besides disabling the lazy restore (indicating that the
in-CPU state is stale) anywhere else?

If user_fpregs_move_to_cpu determines that the in-register
state is still valid, it can skip the restore.

This is what switch_fpu_finish does after my first patch,
and switch_fpu_prepare later in my series.

--
All rights reversed


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-03 21:36:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <[email protected]> wrote:
>> >
>> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > >
>> > > On Sat, Oct 1, 2016 at 1:31 PM, <[email protected]> wrote:
>> > > >
>> > > >
>> > > > #define CREATE_TRACE_POINTS
>> > > > #include <trace/events/syscalls.h>
>> > > > @@ -197,6 +198,9 @@ __visible inline void
>> > > > prepare_exit_to_usermode(struct pt_regs *regs)
>> > > > if (unlikely(cached_flags &
>> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
>> > > > exit_to_usermode_loop(regs, cached_flags);
>> > > >
>> > > > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > > > + switch_fpu_return();
>> > > > +
>> > > How about:
>> > >
>> > > if (unlikely(...)) {
>> > > exit_to_usermode_loop(regs, cached_flags);
>> > > cached_flags = READ_ONCE(ti->flags);

Like this?

>> > > }
>> > >
>> > > if (ti->flags & _TIF_LOAD_FPU) {
>> > > clear_thread_flag(TIF_LOAD_FPU);
>> > > switch_fpu_return();
>> > > }
>> > It looks like test_thread_flag should be fine for avoiding
>> > atomics, too?
>> >
>> > if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>> > clear_thread_flag(TIF_LOAD_FPU);
>> > switch_fpu_return();
>> > }
>> True, but we've got that cached_flags variable already...
>
> The cached flag may no longer be valid after
> exit_to_usermode_loop() returns, because that
> code is preemptible.
>
> We have to reload a fresh ti->flags after
> preemption is disabled and irqs are off.

See above...

>
>> One of my objections to the current code structure is that I find it
>> quite hard to keep track of all of the possible states of the fpu
>> that
>> we have. Off the top of my head, we have:
>>
>> 1. In-cpu state is valid and in-memory state is invalid. (This is
>> the
>> case when we're running user code, for example.)
>> 2. In-memory state is valid.
>> 2a. In-cpu state is *also* valid for some particular task. In this
>> state, no one may write to either FPU state copy for that task lest
>> they get out of sync.
>> 2b. In-cpu state is not valid.
>
> It gets more fun with ptrace. Look at xfpregs_get
> and xfpregs_set, which call fpu__activate_fpstate_read
> and fpu__activate_fpstate_write respectively.
>
> It looks like those functions already deal with the
> cases you outline above.

Hmm, maybe they could be used. The names are IMO atrocious. I had to
read the docs and the code to figure out WTF "activating" the
"fpstate" even meant.

>
>> Rik, your patches further complicate this by making it possible to
>> have the in-cpu state be valid for a non-current task even when
>> non-idle.
>
> However, we never need to examine the in-cpu state for
> a task that is not currently running, or being scheduled
> to in the context switch code.
>
> We always save the FPU register state when a task is
> context switched out, so for a non-current task, we
> still only need to consider whether the in-memory state
> is valid or not.
>
> The in-cpu state does not matter.
>
> fpu__activate_fpstate_write will disable lazy restore,
> and ensure a full restore from memory.

Something has to fix up owner_ctx so that the previous thread doesn't
think its CPU state is still valid. I don't think
activate_fpstate_write does it, because nothing should (I think) be
calling it as a mattor of course on context switch.

>
>> This is why I'm thinking that we should maybe revisit the
>> API a bit to make this clearer and to avoid scattering around all the
>> state manipulation quite so much. I propose, as a straw-man:
>>
>> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> current.
>> After calling this, we're in state 1 or 2a.
>>
>> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> writable
>> for current. Invalidates any in-memory copy. After calling this,
>> we're in state 1.
>
> How can we distinguish between these two?
>
> If the task is running, it can change its FPU
> registers at any time, without any obvious thing
> to mark the transition from state 2a to state 1.

That's true now, but I don't think it's really true any more with your
patches. If a task is running in *kernel* mode, then I think that
pretty much all of the combinations are possible. When the task
actually enters user mode, we have to make sure we're in state 1 so
the CPU regs belong to the task and are writable.

>
> It sounds like the first function you name would
> only be useful if we prevented a switch to user
> space, which could immediately do whatever it wants
> with the registers.
>
> Is there a use case for user_fpregs_copy_to_cpu()?

With PKRU, any user memory access would need
user_fpregs_copy_to_cpu(). I think we should get rid of this and just
switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
context switch to avoid this. Other than that, I can't think of any
use cases off the top of my head unless the signal code wanted to play
games with this stuff.

>
>> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
>> for
>> current. After calling this, we're in state 2a or 2b.
>>
>> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
>> and
>> writable for current. After calling this, we're in state 2b.
>
> We can only ensure that the in memory copy stays
> valid, by stopping the task from writing to the
> in-CPU copy.
>
> This sounds a lot like fpu__activate_fpstate_read
>
> What we need after copying the contents to memory is
> a way to invalidate the in-CPU copy, if changes were
> made to the in-memory copy.

My proposal is that we invalidate the CPU state *before* writing the
in-memory state, not after.

>
> This is already done by fpu__activate_fpstate_write
>
>> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
>> and PKRU memory state readers. MPX and PKRU will do
>> user_fpregs_move_to_memory() before writing to memory.
>> kernel_fpu_begin() does user_fpregs_move_to_memory().
>
> This is done by switch_fpu_prepare.
>
> Can you think of any other place in the kernel where we
> need a 1 -> 2a transition, besides context switching and
> ptrace/xfpregs_get?

Anything else that tries to read task xstate from memory, i.e. MPX and
PKRU. (Although if we switch to eager-switched PKRU, then PKRU stops
mattering for this purpose.)

Actually, I don't see any way your patches can be compatible with PKRU
without switching to eager-switched PKRU.

>
>> There are a couple ways we could deal with TIF_LOAD_FPU in this
>> scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it
>> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
>> and make the handler do user_fpregs_move_to_cpu().
>
> I can see where you are coming from, but given that
> there is no way to run a task besides context switching
> to it, I am not sure why we would need to do anything
> besides disabling the lazy restore (indicating that the
> in-CPU state is stale) anywhere else?
>
> If user_fpregs_move_to_cpu determines that the in-register
> state is still valid, it can skip the restore.
>
> This is what switch_fpu_finish does after my first patch,
> and switch_fpu_prepare later in my series.

I think my main objection is that there are too many functions like
this outside the fpu/ directory that all interact in complicated ways.
I think that activate_fpststate_read/write are along the right lines,
but we should start using them and their friends everywhere.

2016-10-04 01:29:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel <[email protected]> wrote:
> > 
> > >
> > > One of my objections to the current code structure is that I find
> > > it
> > > quite hard to keep track of all of the possible states of the fpu
> > > that
> > > we have.  Off the top of my head, we have:
> > >
> > > 1. In-cpu state is valid and in-memory state is invalid.  (This
> > > is
> > > the
> > > case when we're running user code, for example.)
> > > 2. In-memory state is valid.
> > > 2a. In-cpu state is *also* valid for some particular task.  In
> > > this
> > > state, no one may write to either FPU state copy for that task
> > > lest
> > > they get out of sync.
> > > 2b. In-cpu state is not valid.
> >
> > It gets more fun with ptrace. Look at xfpregs_get
> > and xfpregs_set, which call fpu__activate_fpstate_read
> > and fpu__activate_fpstate_write respectively.
> >
> > It looks like those functions already deal with the
> > cases you outline above.
>
> Hmm, maybe they could be used.  The names are IMO atrocious.  I had
> to
> read the docs and the code to figure out WTF "activating" the
> "fpstate" even meant.

If you have good naming ideas, I'd be willing to submit
patches to rename them :)

> > >
> > > Rik, your patches further complicate this by making it possible
> > > to
> > > have the in-cpu state be valid for a non-current task even when
> > > non-idle.
> >
> > However, we never need to examine the in-cpu state for
> > a task that is not currently running, or being scheduled
> > to in the context switch code.
> >
> > We always save the FPU register state when a task is
> > context switched out, so for a non-current task, we
> > still only need to consider whether the in-memory state
> > is valid or not.
> >
> > The in-cpu state does not matter.
> >
> > fpu__activate_fpstate_write will disable lazy restore,
> > and ensure a full restore from memory.
>
> Something has to fix up owner_ctx so that the previous thread doesn't
> think its CPU state is still valid.  I don't think
> activate_fpstate_write does it, because nothing should (I think) be
> calling it as a mattor of course on context switch.

These functions are called on tasks that are ptraced,
and currently stopped. fpu__activate_fpstate_write
sets fpu->last_cpu to -1, which will cause the next
context switch to the task to load the fpstate into
the registers.

The status in the struct fpu keeps track of the status
of things, though maybe not as explicitly as you would
want.

Having two separate status booleans for "registers valid"
and "memory valid" may make more sense.

Running the task in user space, would have to ensure
"registers valid" is true, and make "memory valid"
false, because userspace could write to the registers.

Doing a ptrace fpstate_read would make "memory valid"
true, but does not need to invalidate "registers valid".

Doing a ptrace fpstate_write would make "memory valid"
true, but would invalidate "registers valid".

Context switching away from a task would make "memory
valid" true, by virtue of copying the fpstate to
memory.

Ingo, would you have any objection to patches tracking
the validity of both register and memory states
independently?

We can get rid of fpu.counter, since nobody uses it
any more.

> > >  This is why I'm thinking that we should maybe revisit the
> > > API a bit to make this clearer and to avoid scattering around all
> > > the
> > > state manipulation quite so much.  I propose, as a straw-man:
> > >
> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> > > current.
> > > After calling this, we're in state 1 or 2a.
> > >
> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> > > writable
> > > for current.  Invalidates any in-memory copy.  After calling
> > > this,
> > > we're in state 1.
> >
> > How can we distinguish between these two?
> >
> > If the task is running, it can change its FPU
> > registers at any time, without any obvious thing
> > to mark the transition from state 2a to state 1.
>
> That's true now, but I don't think it's really true any more with
> your
> patches.  If a task is running in *kernel* mode, then I think that
> pretty much all of the combinations are possible.  When the task
> actually enters user mode, we have to make sure we're in state 1 so
> the CPU regs belong to the task and are writable.

That is fair.

> > It sounds like the first function you name would
> > only be useful if we prevented a switch to user
> > space, which could immediately do whatever it wants
> > with the registers.
> >
> > Is there a use case for user_fpregs_copy_to_cpu()?
>
> With PKRU, any user memory access would need
> user_fpregs_copy_to_cpu().  I think we should get rid of this and
> just
> switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
> context switch to avoid this.  Other than that, I can't think of any
> use cases off the top of my head unless the signal code wanted to
> play
> games with this stuff.

You are right, read_pkru() and write_pkru() can only deal with
the pkru state being present in registers. Is this because of an
assumption in the code, or because of a hardware requirement?

> > > user_fpregs_copy_to_memory() -- Makes the in-memory state be
> > > valid
> > > for
> > > current.  After calling this, we're in state 2a or 2b.
> > >
> > > user_fpregs_move_to_memory() -- Makes the in-memory state be
> > > valid
> > > and
> > > writable for current.  After calling this, we're in state 2b.
> >
> > We can only ensure that the in memory copy stays
> > valid, by stopping the task from writing to the
> > in-CPU copy.
> >
> > This sounds a lot like fpu__activate_fpstate_read
> >
> > What we need after copying the contents to memory is
> > a way to invalidate the in-CPU copy, if changes were
> > made to the in-memory copy.
>
> My proposal is that we invalidate the CPU state *before* writing the
> in-memory state, not after.

Modifying the in-memory FPU state can only be done on a task
that is not currently running in user space. As long as the
modification is done before task runs again, that is safe.

I am probably missing some special case you are concerned about.

> > This is already done by fpu__activate_fpstate_write
> >
> > >
> > > Scheduling out calls user_fpregs_copy_to_memory(), as do all the
> > > MPX
> > > and PKRU memory state readers.  MPX and PKRU will do
> > > user_fpregs_move_to_memory() before writing to memory.
> > > kernel_fpu_begin() does user_fpregs_move_to_memory().
> >
> > This is done by switch_fpu_prepare.
> >
> > Can you think of any other place in the kernel where we
> > need a 1 -> 2a transition, besides context switching and
> > ptrace/xfpregs_get?
>
> Anything else that tries to read task xstate from memory, i.e. MPX
> and
> PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> mattering for this purpose.)
>
> Actually, I don't see any way your patches can be compatible with
> PKRU
> without switching to eager-switched PKRU.

My patches preserve eager saving of the FPU state, they only
make restore lazy. That is 1 -> 2a transitions continue to
be eager.

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-04 02:10:09

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, Oct 3, 2016 at 6:29 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
>> On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel <[email protected]> wrote:
>> >
>> > >
>> > > One of my objections to the current code structure is that I find
>> > > it
>> > > quite hard to keep track of all of the possible states of the fpu
>> > > that
>> > > we have. Off the top of my head, we have:
>> > >
>> > > 1. In-cpu state is valid and in-memory state is invalid. (This
>> > > is
>> > > the
>> > > case when we're running user code, for example.)
>> > > 2. In-memory state is valid.
>> > > 2a. In-cpu state is *also* valid for some particular task. In
>> > > this
>> > > state, no one may write to either FPU state copy for that task
>> > > lest
>> > > they get out of sync.
>> > > 2b. In-cpu state is not valid.
>> >
>> > It gets more fun with ptrace. Look at xfpregs_get
>> > and xfpregs_set, which call fpu__activate_fpstate_read
>> > and fpu__activate_fpstate_write respectively.
>> >
>> > It looks like those functions already deal with the
>> > cases you outline above.
>>
>> Hmm, maybe they could be used. The names are IMO atrocious. I had
>> to
>> read the docs and the code to figure out WTF "activating" the
>> "fpstate" even meant.
>
> If you have good naming ideas, I'd be willing to submit
> patches to rename them :)
>
>> > >
>> > > Rik, your patches further complicate this by making it possible
>> > > to
>> > > have the in-cpu state be valid for a non-current task even when
>> > > non-idle.
>> >
>> > However, we never need to examine the in-cpu state for
>> > a task that is not currently running, or being scheduled
>> > to in the context switch code.
>> >
>> > We always save the FPU register state when a task is
>> > context switched out, so for a non-current task, we
>> > still only need to consider whether the in-memory state
>> > is valid or not.
>> >
>> > The in-cpu state does not matter.
>> >
>> > fpu__activate_fpstate_write will disable lazy restore,
>> > and ensure a full restore from memory.
>>
>> Something has to fix up owner_ctx so that the previous thread doesn't
>> think its CPU state is still valid. I don't think
>> activate_fpstate_write does it, because nothing should (I think) be
>> calling it as a mattor of course on context switch.
>
> These functions are called on tasks that are ptraced,
> and currently stopped. fpu__activate_fpstate_write
> sets fpu->last_cpu to -1, which will cause the next
> context switch to the task to load the fpstate into
> the registers.
>
> The status in the struct fpu keeps track of the status
> of things, though maybe not as explicitly as you would
> want.
>
> Having two separate status booleans for "registers valid"
> and "memory valid" may make more sense.

I have no problem with the concept of "owner_ctx", and I think it's a
perfectly reasonable data structure. My problem with it is that it's
subtle and knowledge of it is spread all over the place. Just going
with "registers valid" in a variable won't work, I think, because
there's nowhere to put it. We need to be able to delete a struct fpu
while that struct fpu might have a valid copy in a different cpu's
registers.

Anyway, feel free to tell me that I'm making this too difficult :)

>
> Running the task in user space, would have to ensure
> "registers valid" is true, and make "memory valid"
> false, because userspace could write to the registers.
>
> Doing a ptrace fpstate_read would make "memory valid"
> true, but does not need to invalidate "registers valid".
>
> Doing a ptrace fpstate_write would make "memory valid"
> true, but would invalidate "registers valid".
>
> Context switching away from a task would make "memory
> valid" true, by virtue of copying the fpstate to
> memory.
>
> Ingo, would you have any objection to patches tracking
> the validity of both register and memory states
> independently?
>
> We can get rid of fpu.counter, since nobody uses it
> any more.

We should definitely do this.

Maybe getting in some cleanups first (my lazy fpu deletion,
fpu.counter removal, etc) first is the way to go.

>
>> > > This is why I'm thinking that we should maybe revisit the
>> > > API a bit to make this clearer and to avoid scattering around all
>> > > the
>> > > state manipulation quite so much. I propose, as a straw-man:
>> > >
>> > > user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> > > current.
>> > > After calling this, we're in state 1 or 2a.
>> > >
>> > > user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> > > writable
>> > > for current. Invalidates any in-memory copy. After calling
>> > > this,
>> > > we're in state 1.
>> >
>> > How can we distinguish between these two?
>> >
>> > If the task is running, it can change its FPU
>> > registers at any time, without any obvious thing
>> > to mark the transition from state 2a to state 1.
>>
>> That's true now, but I don't think it's really true any more with
>> your
>> patches. If a task is running in *kernel* mode, then I think that
>> pretty much all of the combinations are possible. When the task
>> actually enters user mode, we have to make sure we're in state 1 so
>> the CPU regs belong to the task and are writable.
>
> That is fair.
>
>> > It sounds like the first function you name would
>> > only be useful if we prevented a switch to user
>> > space, which could immediately do whatever it wants
>> > with the registers.
>> >
>> > Is there a use case for user_fpregs_copy_to_cpu()?
>>
>> With PKRU, any user memory access would need
>> user_fpregs_copy_to_cpu(). I think we should get rid of this and
>> just
>> switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
>> context switch to avoid this. Other than that, I can't think of any
>> use cases off the top of my head unless the signal code wanted to
>> play
>> games with this stuff.
>
> You are right, read_pkru() and write_pkru() can only deal with
> the pkru state being present in registers. Is this because of an
> assumption in the code, or because of a hardware requirement?
>
>> > > user_fpregs_copy_to_memory() -- Makes the in-memory state be
>> > > valid
>> > > for
>> > > current. After calling this, we're in state 2a or 2b.
>> > >
>> > > user_fpregs_move_to_memory() -- Makes the in-memory state be
>> > > valid
>> > > and
>> > > writable for current. After calling this, we're in state 2b.
>> >
>> > We can only ensure that the in memory copy stays
>> > valid, by stopping the task from writing to the
>> > in-CPU copy.
>> >
>> > This sounds a lot like fpu__activate_fpstate_read
>> >
>> > What we need after copying the contents to memory is
>> > a way to invalidate the in-CPU copy, if changes were
>> > made to the in-memory copy.
>>
>> My proposal is that we invalidate the CPU state *before* writing the
>> in-memory state, not after.
>
> Modifying the in-memory FPU state can only be done on a task
> that is not currently running in user space. As long as the
> modification is done before task runs again, that is safe.
>
> I am probably missing some special case you are concerned about.
>
>> > This is already done by fpu__activate_fpstate_write
>> >
>> > >
>> > > Scheduling out calls user_fpregs_copy_to_memory(), as do all the
>> > > MPX
>> > > and PKRU memory state readers. MPX and PKRU will do
>> > > user_fpregs_move_to_memory() before writing to memory.
>> > > kernel_fpu_begin() does user_fpregs_move_to_memory().
>> >
>> > This is done by switch_fpu_prepare.
>> >
>> > Can you think of any other place in the kernel where we
>> > need a 1 -> 2a transition, besides context switching and
>> > ptrace/xfpregs_get?
>>
>> Anything else that tries to read task xstate from memory, i.e. MPX
>> and
>> PKRU. (Although if we switch to eager-switched PKRU, then PKRU stops
>> mattering for this purpose.)
>>
>> Actually, I don't see any way your patches can be compatible with
>> PKRU
>> without switching to eager-switched PKRU.
>
> My patches preserve eager saving of the FPU state, they only
> make restore lazy. That is 1 -> 2a transitions continue to
> be eager.
>
> --
> All Rights Reversed.



--
Andy Lutomirski
AMA Capital Management, LLC

2016-10-04 02:11:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:

> Anything else that tries to read task xstate from memory, i.e. MPX
> and
> PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
> mattering for this purpose.)
>
> Actually, I don't see any way your patches can be compatible with
> PKRU
> without switching to eager-switched PKRU.

There is one case where the in-register PKRU state matters:
- user space accesses to memory

There are several cases where the in-memory PKRU state would
suffice:
- get_user_pages(_fast) to the local task (could also use registers)
- setting VMA/PTE permission bits (could also use registers)

There is one case where only in-memory PKRU state works, where
PKRU is currently simply ignored:
- get_user_pages to another task's memory

Dave, are there major obstacles to making read_pkru and write_pkru
work with in-memory state?

Would it be better for read/write_pkru to force the FPU state
to get loaded into registers, under the assumption that if things
like get_user_pages_fast happens, we will likely switch back to
userspace soon, anyway?

Would that assumption be wrong with KVM? :)

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-04 02:47:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:

> > Having two separate status booleans for "registers valid"
> > and "memory valid" may make more sense.
>
> I have no problem with the concept of "owner_ctx", and I think it's a
> perfectly reasonable data structure.  My problem with it is that it's
> subtle and knowledge of it is spread all over the place.  Just going
> with "registers valid" in a variable won't work, I think, because
> there's nowhere to put it.  We need to be able to delete a struct fpu
> while that struct fpu might have a valid copy in a different cpu's
> registers.
>
> Anyway, feel free to tell me that I'm making this too difficult :)

How about we rename fpu_want_lazy_restore to
fpu_registers_valid()?  Problem solved :)

Then we can rename __cpu_disable_lazy_restore
to fpu_invalidate_registers(), and call that
before we modify any in-memory FPU state.

> > We can get rid of fpu.counter, since nobody uses it
> > any more.
>
> We should definitely do this.
>
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Sounds good.  I will keep my patch 1/4 as part of the
cleanup series, and will not move on to the harder
stuff until after the cleanups.

Any other stuff I should clean up while we're there?

> > > > > 
> > You are right, read_pkru() and write_pkru() can only deal with
> > the pkru state being present in registers. Is this because of an
> > assumption in the code, or because of a hardware requirement?

read_pkru and write_pkru would be candidates for using
fpu_registers_valid, and potentially a fpu_make_registers_valid,
which restores the contents of the fpu registers from memory,
if fpu_registers_valid is not true.

Likewise, we can have an fpu_make_memory_valid to ensure the
in kernel memory copy of the FPU registers is valid, potentially
a _read and _write version that do exactly what the pstate code
wants today.

Would that make sense as an API?

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-04 03:02:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Oct 3, 2016 7:11 PM, "Rik van Riel" <[email protected]> wrote:
>
> On Mon, 2016-10-03 at 14:36 -0700, Andy Lutomirski wrote:
> >
> > Anything else that tries to read task xstate from memory, i.e. MPX
> > and
> > PKRU. (Although if we switch to eager-switched PKRU, then PKRU stops
> > mattering for this purpose.)
> >
> > Actually, I don't see any way your patches can be compatible with
> > PKRU
> > without switching to eager-switched PKRU.
>
> There is one case where the in-register PKRU state matters:
> - user space accesses to memory
>
> There are several cases where the in-memory PKRU state would
> suffice:
> - get_user_pages(_fast) to the local task (could also use registers)
> - setting VMA/PTE permission bits (could also use registers)
>
> There is one case where only in-memory PKRU state works, where
> PKRU is currently simply ignored:
> - get_user_pages to another task's memory

Also __get_user, etc. I don't think you want to start playing with
TIF_LOAD_FPU there. I think tracking PKRU separately and eagerly
loading it with WRPKRU may be the only decent choice here.

>
> Dave, are there major obstacles to making read_pkru and write_pkru
> work with in-memory state?
>
> Would it be better for read/write_pkru to force the FPU state
> to get loaded into registers, under the assumption that if things
> like get_user_pages_fast happens, we will likely switch back to
> userspace soon, anyway?
>
> Would that assumption be wrong with KVM? :)

>
> --
> All Rights Reversed.

2016-10-04 03:03:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Mon, Oct 3, 2016 at 7:47 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2016-10-03 at 19:09 -0700, Andy Lutomirski wrote:
>
>> > Having two separate status booleans for "registers valid"
>> > and "memory valid" may make more sense.
>>
>> I have no problem with the concept of "owner_ctx", and I think it's a
>> perfectly reasonable data structure. My problem with it is that it's
>> subtle and knowledge of it is spread all over the place. Just going
>> with "registers valid" in a variable won't work, I think, because
>> there's nowhere to put it. We need to be able to delete a struct fpu
>> while that struct fpu might have a valid copy in a different cpu's
>> registers.
>>
>> Anyway, feel free to tell me that I'm making this too difficult :)
>
> How about we rename fpu_want_lazy_restore to
> fpu_registers_valid()? Problem solved :)
>
> Then we can rename __cpu_disable_lazy_restore
> to fpu_invalidate_registers(), and call that
> before we modify any in-memory FPU state.

Sounds good to me.

>
>> > We can get rid of fpu.counter, since nobody uses it
>> > any more.
>>
>> We should definitely do this.
>>
>> Maybe getting in some cleanups first (my lazy fpu deletion,
>> fpu.counter removal, etc) first is the way to go.
>
> Sounds good. I will keep my patch 1/4 as part of the
> cleanup series, and will not move on to the harder
> stuff until after the cleanups.
>
> Any other stuff I should clean up while we're there?

Almost certainly, but nothing I'm thinking of right now :)

>
>> > > > >
>> > You are right, read_pkru() and write_pkru() can only deal with
>> > the pkru state being present in registers. Is this because of an
>> > assumption in the code, or because of a hardware requirement?
>
> read_pkru and write_pkru would be candidates for using
> fpu_registers_valid, and potentially a fpu_make_registers_valid,
> which restores the contents of the fpu registers from memory,
> if fpu_registers_valid is not true.
>
> Likewise, we can have an fpu_make_memory_valid to ensure the
> in kernel memory copy of the FPU registers is valid, potentially
> a _read and _write version that do exactly what the pstate code
> wants today.
>
> Would that make sense as an API?
>
> --
> All Rights Reversed.



--
Andy Lutomirski
AMA Capital Management, LLC

2016-10-04 06:35:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace


* Andy Lutomirski <[email protected]> wrote:

> > Running the task in user space, would have to ensure
> > "registers valid" is true, and make "memory valid"
> > false, because userspace could write to the registers.
> >
> > Doing a ptrace fpstate_read would make "memory valid"
> > true, but does not need to invalidate "registers valid".
> >
> > Doing a ptrace fpstate_write would make "memory valid"
> > true, but would invalidate "registers valid".
> >
> > Context switching away from a task would make "memory
> > valid" true, by virtue of copying the fpstate to
> > memory.
> >
> > Ingo, would you have any objection to patches tracking
> > the validity of both register and memory states
> > independently?

I'd like to reserve judgement - but tentatively I see no red flags
at the moment, but in any case I'd like to start with:

> >
> > We can get rid of fpu.counter, since nobody uses it
> > any more.
>
> We should definitely do this.
>
> Maybe getting in some cleanups first (my lazy fpu deletion,
> fpu.counter removal, etc) first is the way to go.

Could you guys please submit all pending FPU bits that aren't new
functionality to -tip? I'll remove lazy FPU handling if you don't
beat me to it.

Plus after lazy FPU removal we should take another good look and
make the FPU state machine even more obvious and self-documenting.

Upstream merge commit 597f03f9d133 would be a good base.

Thanks,

Ingo

2016-10-04 12:48:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace

On Tue, 2016-10-04 at 08:35 +0200, Ingo Molnar wrote:
> * Andy Lutomirski <[email protected]> wrote:

> > > We can get rid of fpu.counter, since nobody uses it
> > > any more.
> >
> > We should definitely do this.
> >
> > Maybe getting in some cleanups first (my lazy fpu deletion,
> > fpu.counter removal, etc) first is the way to go.
>
> Could you guys please submit all pending FPU bits that aren't new
> functionality to -tip? I'll remove lazy FPU handling if you don't
> beat me to it.
>
> Plus after lazy FPU removal we should take another good look and
> make the FPU state machine even more obvious and self-documenting.
>
> Upstream merge commit 597f03f9d133 would be a good base.

I am flying out to Europe tomorrow, but will try to get
you the cleanups some time this week.  Besides the first
four patcehs from Andy, there is a shortlist of a few
more small changes.

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-10-04 15:28:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH RFC 1/5] x86,fpu: split prev/next task fpu state handling

On Sat, 2016-10-01 at 16:26 -0700, Andy Lutomirski wrote:
> On Oct 1, 2016 1:49 PM, <[email protected]> wrote:
> >
> >
> > From: Rik van Riel <[email protected]>
> >
> > Move all handling of the next state FPU state handling into
> > switch_fpu_finish, in preparation for more lazily switching
> > FPU states.
> >
> > CR0.TS state is mirrored in a per-cpu variable, instead of
> > being passed around in a local variable, because that will
> > not be possible later in the series.
>
> This seems reasonable in principle, but IMO it would be less scary if
> you rebased onto this:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x8
> 6/fpu

I can rebase on top of that.

I am perfectly fine with your patches going in
first, and mine later on. Too many FPU changes
at once is risky, anyway.

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part