2012-08-24 21:15:31

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 0/6] x86, fpu: cleanups, introduce non-lazy FPU restore for xsave

These patches are against tip/x86/fpu. Few cleanups and more improtantly
this series introduces the non-lazy FPU restore mechanism for processors
supporting xsave feature. More details in the individual patch changelogs.

Thanks.

Suresh Siddha (6):
x86, fpu: drop_fpu() before restoring new state from sigframe
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
x86, fpu: use non-lazy fpu restore for processors supporting xsave

arch/x86/include/asm/fpu-internal.h | 118 +++++++++++++++++++---------------
arch/x86/include/asm/i387.h | 1 +
arch/x86/include/asm/xor_32.h | 56 +++--------------
arch/x86/include/asm/xor_64.h | 61 +++---------------
arch/x86/include/asm/xor_avx.h | 54 ++++------------
arch/x86/include/asm/xsave.h | 1 +
arch/x86/kernel/i387.c | 20 +++++-
arch/x86/kernel/process.c | 12 +++-
arch/x86/kernel/process_32.c | 4 -
arch/x86/kernel/process_64.c | 4 -
arch/x86/kernel/traps.c | 5 +-
arch/x86/kernel/xsave.c | 58 +++++++++++++----
arch/x86/kvm/x86.c | 3 +-
drivers/lguest/x86/core.c | 10 ++-
14 files changed, 180 insertions(+), 227 deletions(-)

--
1.7.6.5


2012-08-24 21:15:41

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 2/6] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.

Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 17 +++--------------
arch/x86/kernel/xsave.c | 1 -
2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index fe95ad0..fac39e9 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk)
}

/*
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe.
+ * Need to be preemption-safe.
*
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
+ * NOTE! user_fpu_begin() must be used only immediately before restoring
+ * it. This function does not do any save/restore on their own.
*/
-static inline void user_fpu_end(void)
-{
- preempt_disable();
- __thread_fpu_end(current);
- preempt_enable();
-}
-
static inline void user_fpu_begin(void)
{
preempt_disable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 6cfc7d9..f0bb844 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -254,7 +254,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size)
/* Update the thread's fxstate to save the fsave header. */
if (ia32_fxstate)
fpu_fxsave(&tsk->thread.fpu);
- user_fpu_end();
} else {
sanitize_i387_state(tsk);
if (__copy_to_user(buf_fx, xsave, xstate_size))
--
1.7.6.5

2012-08-24 21:15:49

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 5/6] lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models

Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: Rusty Russell <[email protected]>
---
drivers/lguest/x86/core.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c
index 39809035..4af12e1 100644
--- a/drivers/lguest/x86/core.c
+++ b/drivers/lguest/x86/core.c
@@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
* we set it now, so we can trap and pass that trap to the Guest if it
* uses the FPU.
*/
- if (cpu->ts)
- unlazy_fpu(current);
+ if (cpu->ts && user_has_fpu())
+ stts();

/*
* SYSENTER is an optimized way of doing system calls. We can't allow
@@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
if (boot_cpu_has(X86_FEATURE_SEP))
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);

+ /* Clear the host TS bit if it was set above. */
+ if (cpu->ts && user_has_fpu())
+ clts();
+
/*
* If the Guest page faulted, then the cr2 register will tell us the
* bad virtual address. We have to grab this now, because once we
@@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
* a different CPU. So all the critical stuff should be done
* before this.
*/
- else if (cpu->regs->trapnum == 7)
+ else if (cpu->regs->trapnum == 7 && !user_has_fpu())
math_state_restore();
}

--
1.7.6.5

2012-08-24 21:16:05

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 6/6] x86, fpu: use non-lazy fpu restore for processors supporting xsave

Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.

Reasons driving this model change are:

i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.

ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.

iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.

Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: Jim Kukunas <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Avi Kivity <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 96 +++++++++++++++++++++++------------
arch/x86/include/asm/i387.h | 1 +
arch/x86/include/asm/xsave.h | 1 +
arch/x86/kernel/i387.c | 20 ++++++-
arch/x86/kernel/process.c | 12 +++--
arch/x86/kernel/process_32.c | 4 --
arch/x86/kernel/process_64.c | 4 --
arch/x86/kernel/traps.c | 5 ++-
arch/x86/kernel/xsave.c | 57 +++++++++++++++++----
9 files changed, 140 insertions(+), 60 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index fac39e9..e31cc6e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk)
static inline void __thread_fpu_end(struct task_struct *tsk)
{
__thread_clear_has_fpu(tsk);
- stts();
+ if (!use_xsave())
+ stts();
}

static inline void __thread_fpu_begin(struct task_struct *tsk)
{
- clts();
+ if (!use_xsave())
+ clts();
__thread_set_has_fpu(tsk);
}

+static inline void __drop_fpu(struct task_struct *tsk)
+{
+ if (__thread_has_fpu(tsk)) {
+ /* Ignore delayed exceptions from user space */
+ asm volatile("1: fwait\n"
+ "2:\n"
+ _ASM_EXTABLE(1b, 2b));
+ __thread_fpu_end(tsk);
+ }
+}
+
+static inline void drop_fpu(struct task_struct *tsk)
+{
+ /*
+ * Forget coprocessor state..
+ */
+ preempt_disable();
+ tsk->fpu_counter = 0;
+ __drop_fpu(tsk);
+ clear_used_math();
+ preempt_enable();
+}
+
+static inline void drop_init_fpu(struct task_struct *tsk)
+{
+ if (!use_xsave())
+ drop_fpu(tsk);
+ else
+ xrstor_state(init_xstate_buf, -1);
+}
+
/*
* FPU state switching for scheduling.
*
@@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
{
fpu_switch_t fpu;

- fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
+ /*
+ * If the task has used the math, pre-load the FPU on xsave processors
+ * or if the past 5 consecutive context-switches used math.
+ */
+ fpu.preload = tsk_used_math(new) && (use_xsave() ||
+ new->fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
cpu = ~0;
@@ -345,14 +383,14 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
new->fpu_counter++;
__thread_set_has_fpu(new);
prefetch(new->thread.fpu.state);
- } else
+ } else if (!use_xsave())
stts();
} else {
old->fpu_counter = 0;
old->thread.fpu.last_cpu = ~0;
if (fpu.preload) {
new->fpu_counter++;
- if (fpu_lazy_restore(new, cpu))
+ if (!use_xsave() && fpu_lazy_restore(new, cpu))
fpu.preload = 0;
else
prefetch(new->thread.fpu.state);
@@ -372,7 +410,7 @@ static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
{
if (fpu.preload) {
if (unlikely(restore_fpu_checking(new)))
- __thread_fpu_end(new);
+ drop_init_fpu(new);
}
}

@@ -400,17 +438,6 @@ static inline int restore_xstate_sig(void __user *buf, int ia32_frame)
return __restore_xstate_sig(buf, buf_fx, size);
}

-static inline void __drop_fpu(struct task_struct *tsk)
-{
- if (__thread_has_fpu(tsk)) {
- /* Ignore delayed exceptions from user space */
- asm volatile("1: fwait\n"
- "2:\n"
- _ASM_EXTABLE(1b, 2b));
- __thread_fpu_end(tsk);
- }
-}
-
/*
* Need to be preemption-safe.
*
@@ -431,24 +458,18 @@ static inline void user_fpu_begin(void)
static inline void save_init_fpu(struct task_struct *tsk)
{
WARN_ON_ONCE(!__thread_has_fpu(tsk));
+
+ if (use_xsave()) {
+ xsave_state(&tsk->thread.fpu.state->xsave, -1);
+ return;
+ }
+
preempt_disable();
__save_init_fpu(tsk);
__thread_fpu_end(tsk);
preempt_enable();
}

-static inline void drop_fpu(struct task_struct *tsk)
-{
- /*
- * Forget coprocessor state..
- */
- tsk->fpu_counter = 0;
- preempt_disable();
- __drop_fpu(tsk);
- preempt_enable();
- clear_used_math();
-}
-
/*
* i387 state interaction
*/
@@ -503,12 +524,21 @@ static inline void fpu_free(struct fpu *fpu)
}
}

-static inline void fpu_copy(struct fpu *dst, struct fpu *src)
+static inline void fpu_copy(struct task_struct *dst, struct task_struct *src)
{
- memcpy(dst->state, src->state, xstate_size);
-}
+ if (use_xsave()) {
+ struct xsave_struct *xsave = &dst->thread.fpu.state->xsave;

-extern void fpu_finit(struct fpu *fpu);
+ memset(&xsave->xsave_hdr, 0, sizeof(struct xsave_hdr_struct));
+ xsave_state(xsave, -1);
+ } else {
+ struct fpu *dfpu = &dst->thread.fpu;
+ struct fpu *sfpu = &src->thread.fpu;
+
+ unlazy_fpu(src);
+ memcpy(dfpu->state, sfpu->state, xstate_size);
+ }
+}

static inline unsigned long
alloc_mathframe(unsigned long sp, int ia32_frame, unsigned long *buf_fx,
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 257d9cc..6c3bd37 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -19,6 +19,7 @@ struct pt_regs;
struct user_i387_struct;

extern int init_fpu(struct task_struct *child);
+extern void fpu_finit(struct fpu *fpu);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);
extern void math_state_restore(void);

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index c1d989a..2ddee1b8 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -34,6 +34,7 @@
extern unsigned int xstate_size;
extern u64 pcntxt_mask;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
+extern struct xsave_struct *init_xstate_buf;

extern void xsave_init(void);
extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index ab6a2e8..5285574 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -22,7 +22,15 @@
/*
* Were we in an interrupt that interrupted kernel mode?
*
- * We can do a kernel_fpu_begin/end() pair *ONLY* if that
+ * For now, on xsave platforms we will return interrupted
+ * kernel FPU as not-idle. TBD: As we use non-lazy FPU restore
+ * for xsave platforms, ideally we can change the return value
+ * to something like __thread_has_fpu(current). But we need to
+ * be careful of doing __thread_clear_has_fpu() before saving
+ * the FPU etc for supporting nested uses etc. For now, take
+ * the simple route!
+ *
+ * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that
* pair does nothing at all: the thread must not have fpu (so
* that we don't try to save the FPU state), and TS must
* be set (so that the clts/stts pair does nothing that is
@@ -30,6 +38,9 @@
*/
static inline bool interrupted_kernel_fpu_idle(void)
{
+ if (use_xsave())
+ return 0;
+
return !__thread_has_fpu(current) &&
(read_cr0() & X86_CR0_TS);
}
@@ -73,7 +84,7 @@ void kernel_fpu_begin(void)
__save_init_fpu(me);
__thread_clear_has_fpu(me);
/* We do 'stts()' in kernel_fpu_end() */
- } else {
+ } else if (!use_xsave()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
@@ -82,7 +93,10 @@ EXPORT_SYMBOL(kernel_fpu_begin);

void kernel_fpu_end(void)
{
- stts();
+ if (use_xsave())
+ math_state_restore();
+ else
+ stts();
preempt_enable();
}
EXPORT_SYMBOL(kernel_fpu_end);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 30069d1..c21e30f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -66,15 +66,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
int ret;

- unlazy_fpu(src);
-
*dst = *src;
if (fpu_allocated(&src->thread.fpu)) {
memset(&dst->thread.fpu, 0, sizeof(dst->thread.fpu));
ret = fpu_alloc(&dst->thread.fpu);
if (ret)
return ret;
- fpu_copy(&dst->thread.fpu, &src->thread.fpu);
+ fpu_copy(dst, src);
}
return 0;
}
@@ -153,7 +151,13 @@ void flush_thread(void)

flush_ptrace_hw_breakpoint(tsk);
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
+ /*
+ * Free the FPU state for non xsave platforms. They get reallocated
+ * lazily at the first use.
+ */
+ if (!use_xsave())
+ free_thread_xstate(tsk);
}

static void hard_disable_TSC(void)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 516fa18..b9ff83c 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -190,10 +190,6 @@ start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
regs->cs = __USER_CS;
regs->ip = new_ip;
regs->sp = new_sp;
- /*
- * Free the old FP and other extended state
- */
- free_thread_xstate(current);
}
EXPORT_SYMBOL_GPL(start_thread);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 0a980c9..8a6d20c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -232,10 +232,6 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
- /*
- * Free the old FP and other extended state
- */
- free_thread_xstate(current);
}

void
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b481341..ac7d527 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -613,11 +613,12 @@ void math_state_restore(void)
}

__thread_fpu_begin(tsk);
+
/*
* Paranoid restore. send a SIGSEGV if we fail to restore the state.
*/
if (unlikely(restore_fpu_checking(tsk))) {
- __thread_fpu_end(tsk);
+ drop_init_fpu(tsk);
force_sig(SIGSEGV, tsk);
return;
}
@@ -629,6 +630,8 @@ EXPORT_SYMBOL_GPL(math_state_restore);
dotraplinkage void __kprobes
do_device_not_available(struct pt_regs *regs, long error_code)
{
+ BUG_ON(use_xsave());
+
#ifdef CONFIG_MATH_EMULATION
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index f0bb844..55bf571 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -21,7 +21,7 @@ u64 pcntxt_mask;
/*
* Represents init state for the supported extended state.
*/
-static struct xsave_struct *init_xstate_buf;
+struct xsave_struct *init_xstate_buf;

static struct _fpx_sw_bytes fx_sw_reserved, fx_sw_reserved_ia32;
static unsigned int *xstate_offsets, *xstate_sizes, xstate_features;
@@ -267,7 +267,7 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size)
if (use_fxsr() && save_xstate_epilog(buf_fx, ia32_fxstate))
return -1;

- drop_fpu(tsk); /* trigger finit */
+ drop_init_fpu(tsk); /* trigger finit */

return 0;
}
@@ -339,7 +339,7 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
config_enabled(CONFIG_IA32_EMULATION));

if (!buf) {
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
return 0;
}

@@ -379,15 +379,30 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
*/
struct xsave_struct *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;
+ int err = 0;

+ /*
+ * Drop the current fpu which clears used_math(). This ensures
+ * that any context-switch during the copy of the new state,
+ * avoids the intermediate state from getting restored/saved.
+ * Thus avoiding the new restored state from getting corrupted.
+ * We will be ready to restore/save the state only after
+ * set_used_math() is again set.
+ */
drop_fpu(tsk);

if (__copy_from_user(xsave, buf_fx, state_size) ||
- __copy_from_user(&env, buf, sizeof(env)))
- return -1;
+ __copy_from_user(&env, buf, sizeof(env))) {
+ err = -1;
+ } else {
+ sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
+ set_used_math();
+ }

- sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
- set_used_math();
+ if (use_xsave())
+ math_state_restore();
+
+ return err;
} else {
/*
* For 64-bit frames and 32-bit fsave frames, restore the user
@@ -395,7 +410,7 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
*/
user_fpu_begin();
if (restore_user_xstate(buf_fx, xstate_bv, fx_only)) {
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
return -1;
}
}
@@ -434,11 +449,29 @@ static void prepare_fx_sw_frame(void)
*/
static inline void xstate_enable(void)
{
+ clts();
set_in_cr4(X86_CR4_OSXSAVE);
xsetbv(XCR_XFEATURE_ENABLED_MASK, pcntxt_mask);
}

/*
+ * This is same as math_state_restore(). But use_xsave() is not yet
+ * patched to use math_state_restore().
+ */
+static inline void init_restore_xstate(void)
+{
+ init_fpu(current);
+ __thread_fpu_begin(current);
+ xrstor_state(init_xstate_buf, -1);
+}
+
+static inline void xstate_enable_ap(void)
+{
+ xstate_enable();
+ init_restore_xstate();
+}
+
+/*
* Record the offsets and sizes of different state managed by the xsave
* memory layout.
*/
@@ -478,7 +511,6 @@ static void __init setup_xstate_init(void)
__alignof__(struct xsave_struct));
init_xstate_buf->i387.mxcsr = MXCSR_DEFAULT;

- clts();
/*
* Init all the features state with header_bv being 0x0
*/
@@ -488,7 +520,6 @@ static void __init setup_xstate_init(void)
* of any feature which is not represented by all zero's.
*/
xsave_state(init_xstate_buf, -1);
- stts();
}

/*
@@ -532,6 +563,10 @@ static void __init xstate_enable_boot_cpu(void)

pr_info("enabled xstate_bv 0x%llx, cntxt size 0x%x\n",
pcntxt_mask, xstate_size);
+
+ current->thread.fpu.state =
+ alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct));
+ init_restore_xstate();
}

/*
@@ -550,6 +585,6 @@ void __cpuinit xsave_init(void)
return;

this_func = next_func;
- next_func = xstate_enable;
+ next_func = xstate_enable_ap;
this_func();
}
--
1.7.6.5

2012-08-24 21:16:22

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 4/6] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.

This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
will just save the user-state and avoiding the read/write of cr0.
In general, cr0 accesses are much slower.

* Manual save/restore of xmm/ymm registers will affect the 'modified' and
the 'init' optimizations brought in the by xsaveopt/xrstor
infrastructure.

* Foward compatibility with future vector register extensions will be a
problem if the xmm/ymm registers are manually saved and restored
(corrupting the extended state of those vector registers).

With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: Jim Kukunas <[email protected]>
Cc: NeilBrown <[email protected]>
---
arch/x86/include/asm/xor_32.h | 56 +++++-------------------------------
arch/x86/include/asm/xor_64.h | 61 ++++++----------------------------------
arch/x86/include/asm/xor_avx.h | 54 ++++++++---------------------------
3 files changed, 29 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h
index 4545708..aabd585 100644
--- a/arch/x86/include/asm/xor_32.h
+++ b/arch/x86/include/asm/xor_32.h
@@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = {
* Copyright (C) 1999 Zach Brown (with obvious credit due Ingo)
*/

-#define XMMS_SAVE \
-do { \
- preempt_disable(); \
- cr0 = read_cr0(); \
- clts(); \
- asm volatile( \
- "movups %%xmm0,(%0) ;\n\t" \
- "movups %%xmm1,0x10(%0) ;\n\t" \
- "movups %%xmm2,0x20(%0) ;\n\t" \
- "movups %%xmm3,0x30(%0) ;\n\t" \
- : \
- : "r" (xmm_save) \
- : "memory"); \
-} while (0)
-
-#define XMMS_RESTORE \
-do { \
- asm volatile( \
- "sfence ;\n\t" \
- "movups (%0),%%xmm0 ;\n\t" \
- "movups 0x10(%0),%%xmm1 ;\n\t" \
- "movups 0x20(%0),%%xmm2 ;\n\t" \
- "movups 0x30(%0),%%xmm3 ;\n\t" \
- : \
- : "r" (xmm_save) \
- : "memory"); \
- write_cr0(cr0); \
- preempt_enable(); \
-} while (0)
-
-#define ALIGN16 __attribute__((aligned(16)))
-
#define OFFS(x) "16*("#x")"
#define PF_OFFS(x) "256+16*("#x")"
#define PF0(x) " prefetchnta "PF_OFFS(x)"(%1) ;\n"
@@ -587,10 +555,8 @@ static void
xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
:
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
:
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -762,7 +724,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
:
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -770,10 +732,8 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4, unsigned long *p5)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

/* Make sure GCC forgets anything it knows about p4 or p5,
such that it won't pass to the asm volatile below a
@@ -850,7 +810,7 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
like assuming they have some legal value. */
asm("" : "=r" (p4), "=r" (p5));

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_pIII_sse = {
diff --git a/arch/x86/include/asm/xor_64.h b/arch/x86/include/asm/xor_64.h
index b9b2323..5fc06d0 100644
--- a/arch/x86/include/asm/xor_64.h
+++ b/arch/x86/include/asm/xor_64.h
@@ -34,41 +34,7 @@
* no advantages to be gotten from x86-64 here anyways.
*/

-typedef struct {
- unsigned long a, b;
-} __attribute__((aligned(16))) xmm_store_t;
-
-/* Doesn't use gcc to save the XMM registers, because there is no easy way to
- tell it to do a clts before the register saving. */
-#define XMMS_SAVE \
-do { \
- preempt_disable(); \
- asm volatile( \
- "movq %%cr0,%0 ;\n\t" \
- "clts ;\n\t" \
- "movups %%xmm0,(%1) ;\n\t" \
- "movups %%xmm1,0x10(%1) ;\n\t" \
- "movups %%xmm2,0x20(%1) ;\n\t" \
- "movups %%xmm3,0x30(%1) ;\n\t" \
- : "=&r" (cr0) \
- : "r" (xmm_save) \
- : "memory"); \
-} while (0)
-
-#define XMMS_RESTORE \
-do { \
- asm volatile( \
- "sfence ;\n\t" \
- "movups (%1),%%xmm0 ;\n\t" \
- "movups 0x10(%1),%%xmm1 ;\n\t" \
- "movups 0x20(%1),%%xmm2 ;\n\t" \
- "movups 0x30(%1),%%xmm3 ;\n\t" \
- "movq %0,%%cr0 ;\n\t" \
- : \
- : "r" (cr0), "r" (xmm_save) \
- : "memory"); \
- preempt_enable(); \
-} while (0)
+#include <asm/i387.h>

#define OFFS(x) "16*("#x")"
#define PF_OFFS(x) "256+16*("#x")"
@@ -91,10 +57,8 @@ static void
xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
unsigned int lines = bytes >> 8;
- unsigned long cr0;
- xmm_store_t xmm_save[4];

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -135,7 +99,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
: [inc] "r" (256UL)
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -143,11 +107,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;
-
- XMMS_SAVE;

+ kernel_fpu_begin();
asm volatile(
#undef BLOCK
#define BLOCK(i) \
@@ -194,7 +155,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
[p1] "+r" (p1), [p2] "+r" (p2), [p3] "+r" (p3)
: [inc] "r" (256UL)
: "memory");
- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -202,10 +163,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -261,7 +220,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
: [inc] "r" (256UL)
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -269,10 +228,8 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4, unsigned long *p5)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -336,7 +293,7 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
: [inc] "r" (256UL)
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_sse = {
diff --git a/arch/x86/include/asm/xor_avx.h b/arch/x86/include/asm/xor_avx.h
index 2510d35..7ea79c5 100644
--- a/arch/x86/include/asm/xor_avx.h
+++ b/arch/x86/include/asm/xor_avx.h
@@ -20,32 +20,6 @@
#include <linux/compiler.h>
#include <asm/i387.h>

-#define ALIGN32 __aligned(32)
-
-#define YMM_SAVED_REGS 4
-
-#define YMMS_SAVE \
-do { \
- preempt_disable(); \
- cr0 = read_cr0(); \
- clts(); \
- asm volatile("vmovaps %%ymm0, %0" : "=m" (ymm_save[0]) : : "memory"); \
- asm volatile("vmovaps %%ymm1, %0" : "=m" (ymm_save[32]) : : "memory"); \
- asm volatile("vmovaps %%ymm2, %0" : "=m" (ymm_save[64]) : : "memory"); \
- asm volatile("vmovaps %%ymm3, %0" : "=m" (ymm_save[96]) : : "memory"); \
-} while (0);
-
-#define YMMS_RESTORE \
-do { \
- asm volatile("sfence" : : : "memory"); \
- asm volatile("vmovaps %0, %%ymm3" : : "m" (ymm_save[96])); \
- asm volatile("vmovaps %0, %%ymm2" : : "m" (ymm_save[64])); \
- asm volatile("vmovaps %0, %%ymm1" : : "m" (ymm_save[32])); \
- asm volatile("vmovaps %0, %%ymm0" : : "m" (ymm_save[0])); \
- write_cr0(cr0); \
- preempt_enable(); \
-} while (0);
-
#define BLOCK4(i) \
BLOCK(32 * i, 0) \
BLOCK(32 * (i + 1), 1) \
@@ -60,10 +34,9 @@ do { \

static void xor_avx_2(unsigned long bytes, unsigned long *p0, unsigned long *p1)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -82,16 +55,15 @@ do { \
p1 = (unsigned long *)((uintptr_t)p1 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_3(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -113,16 +85,15 @@ do { \
p2 = (unsigned long *)((uintptr_t)p2 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_4(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2, unsigned long *p3)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -147,16 +118,15 @@ do { \
p3 = (unsigned long *)((uintptr_t)p3 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_5(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2, unsigned long *p3, unsigned long *p4)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -184,7 +154,7 @@ do { \
p4 = (unsigned long *)((uintptr_t)p4 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_avx = {
--
1.7.6.5

2012-08-24 21:16:30

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 1/6] x86, fpu: drop_fpu() before restoring new state from sigframe

No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.

Also fold the stop_fpu_preload() into drop_fpu().

Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 7 +------
arch/x86/kernel/xsave.c | 8 +++-----
2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index ba83a08..fe95ad0 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
}

-static inline void stop_fpu_preload(struct task_struct *tsk)
-{
- tsk->fpu_counter = 0;
-}
-
static inline void drop_fpu(struct task_struct *tsk)
{
/*
* Forget coprocessor state..
*/
- stop_fpu_preload(tsk);
+ tsk->fpu_counter = 0;
preempt_disable();
__drop_fpu(tsk);
preempt_enable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index a23d100..6cfc7d9 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -381,16 +381,14 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
struct xsave_struct *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;

- stop_fpu_preload(tsk);
- unlazy_fpu(tsk);
+ drop_fpu(tsk);

if (__copy_from_user(xsave, buf_fx, state_size) ||
- __copy_from_user(&env, buf, sizeof(env))) {
- drop_fpu(tsk);
+ __copy_from_user(&env, buf, sizeof(env)))
return -1;
- }

sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
+ set_used_math();
} else {
/*
* For 64-bit frames and 32-bit fsave frames, restore the user
--
1.7.6.5

2012-08-24 21:16:25

by Suresh Siddha

[permalink] [raw]
Subject: [PATCH 3/6] x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha <[email protected]>
Cc: Avi Kivity <[email protected]>
---
arch/x86/kvm/x86.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42bce48..67e773c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5969,7 +5969,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
*/
kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
- unlazy_fpu(current);
+ kernel_fpu_begin();
fpu_restore_checking(&vcpu->arch.guest_fpu);
trace_kvm_fpu(1);
}
@@ -5983,6 +5983,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)

vcpu->guest_fpu_loaded = 0;
fpu_save_init(&vcpu->arch.guest_fpu);
+ kernel_fpu_end();
++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);
--
1.7.6.5

2012-08-24 21:33:39

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86, fpu: cleanups, introduce non-lazy FPU restore for xsave

I have applied this to tip:x86/fpu, but I have also asked Suresh to
prepare a followon patch to decouple eager save from the existence of
the XSAVE instruction. It seems pretty clear that eager save is a net
benefit in the presence of the XSAVEOPT, but it isn't as clear for only
having XSAVE, as far as I can tell. Either way it would seem to be a
policy decision that is somewhat separate from the exact instruction.

-hpa

2012-08-24 22:13:45

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] x86, fpu: drop_fpu() before restoring new state from sigframe

Commit-ID: 739390035c5fba2132fa424309786ff7bdd2cc1e
Gitweb: http://git.kernel.org/tip/739390035c5fba2132fa424309786ff7bdd2cc1e
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:12:57 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:47 -0700

x86, fpu: drop_fpu() before restoring new state from sigframe

No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.

Also fold the stop_fpu_preload() into drop_fpu().

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 7 +------
arch/x86/kernel/xsave.c | 8 +++-----
2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index ba83a08..fe95ad0 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -448,17 +448,12 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
}

-static inline void stop_fpu_preload(struct task_struct *tsk)
-{
- tsk->fpu_counter = 0;
-}
-
static inline void drop_fpu(struct task_struct *tsk)
{
/*
* Forget coprocessor state..
*/
- stop_fpu_preload(tsk);
+ tsk->fpu_counter = 0;
preempt_disable();
__drop_fpu(tsk);
preempt_enable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index a23d100..6cfc7d9 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -381,16 +381,14 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
struct xsave_struct *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;

- stop_fpu_preload(tsk);
- unlazy_fpu(tsk);
+ drop_fpu(tsk);

if (__copy_from_user(xsave, buf_fx, state_size) ||
- __copy_from_user(&env, buf, sizeof(env))) {
- drop_fpu(tsk);
+ __copy_from_user(&env, buf, sizeof(env)))
return -1;
- }

sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
+ set_used_math();
} else {
/*
* For 64-bit frames and 32-bit fsave frames, restore the user

2012-08-24 22:14:39

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

Commit-ID: cc50fae05beb2db9f4587bbb1a0d6aba2af5b407
Gitweb: http://git.kernel.org/tip/cc50fae05beb2db9f4587bbb1a0d6aba2af5b407
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:12:58 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:48 -0700

x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 17 +++--------------
arch/x86/kernel/xsave.c | 1 -
2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index fe95ad0..fac39e9 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk)
}

/*
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe.
+ * Need to be preemption-safe.
*
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
+ * NOTE! user_fpu_begin() must be used only immediately before restoring
+ * it. This function does not do any save/restore on their own.
*/
-static inline void user_fpu_end(void)
-{
- preempt_disable();
- __thread_fpu_end(current);
- preempt_enable();
-}
-
static inline void user_fpu_begin(void)
{
preempt_disable();
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 6cfc7d9..f0bb844 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -254,7 +254,6 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size)
/* Update the thread's fxstate to save the fsave header. */
if (ia32_fxstate)
fpu_fxsave(&tsk->thread.fpu);
- user_fpu_end();
} else {
sanitize_i387_state(tsk);
if (__copy_to_user(buf_fx, xsave, xstate_size))

2012-08-24 22:15:31

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] x86, kvm: use kernel_fpu_begin/end() in kvm_load/ put_guest_fpu()

Commit-ID: 98700fa647b3572f7fa55485570ab9fc53b91d23
Gitweb: http://git.kernel.org/tip/98700fa647b3572f7fa55485570ab9fc53b91d23
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:12:59 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:49 -0700

x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()

kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Avi Kivity <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kvm/x86.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be6d549..b92cc39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5954,7 +5954,7 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
*/
kvm_put_guest_xcr0(vcpu);
vcpu->guest_fpu_loaded = 1;
- unlazy_fpu(current);
+ kernel_fpu_begin();
fpu_restore_checking(&vcpu->arch.guest_fpu);
trace_kvm_fpu(1);
}
@@ -5968,6 +5968,7 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)

vcpu->guest_fpu_loaded = 0;
fpu_save_init(&vcpu->arch.guest_fpu);
+ kernel_fpu_end();
++vcpu->stat.fpu_reload;
kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
trace_kvm_fpu(0);

2012-08-24 22:16:25

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

Commit-ID: 964735018df03c94dd12665385d59e3b2c7c08b8
Gitweb: http://git.kernel.org/tip/964735018df03c94dd12665385d59e3b2c7c08b8
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:13:00 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:50 -0700

x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage

use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.

This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
will just save the user-state and avoiding the read/write of cr0.
In general, cr0 accesses are much slower.

* Manual save/restore of xmm/ymm registers will affect the 'modified' and
the 'init' optimizations brought in the by xsaveopt/xrstor
infrastructure.

* Foward compatibility with future vector register extensions will be a
problem if the xmm/ymm registers are manually saved and restored
(corrupting the extended state of those vector registers).

With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Jim Kukunas <[email protected]>
Cc: NeilBrown <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/xor_32.h | 56 +++++-------------------------------
arch/x86/include/asm/xor_64.h | 61 ++++++----------------------------------
arch/x86/include/asm/xor_avx.h | 54 ++++++++---------------------------
3 files changed, 29 insertions(+), 142 deletions(-)

diff --git a/arch/x86/include/asm/xor_32.h b/arch/x86/include/asm/xor_32.h
index 4545708..aabd585 100644
--- a/arch/x86/include/asm/xor_32.h
+++ b/arch/x86/include/asm/xor_32.h
@@ -534,38 +534,6 @@ static struct xor_block_template xor_block_p5_mmx = {
* Copyright (C) 1999 Zach Brown (with obvious credit due Ingo)
*/

-#define XMMS_SAVE \
-do { \
- preempt_disable(); \
- cr0 = read_cr0(); \
- clts(); \
- asm volatile( \
- "movups %%xmm0,(%0) ;\n\t" \
- "movups %%xmm1,0x10(%0) ;\n\t" \
- "movups %%xmm2,0x20(%0) ;\n\t" \
- "movups %%xmm3,0x30(%0) ;\n\t" \
- : \
- : "r" (xmm_save) \
- : "memory"); \
-} while (0)
-
-#define XMMS_RESTORE \
-do { \
- asm volatile( \
- "sfence ;\n\t" \
- "movups (%0),%%xmm0 ;\n\t" \
- "movups 0x10(%0),%%xmm1 ;\n\t" \
- "movups 0x20(%0),%%xmm2 ;\n\t" \
- "movups 0x30(%0),%%xmm3 ;\n\t" \
- : \
- : "r" (xmm_save) \
- : "memory"); \
- write_cr0(cr0); \
- preempt_enable(); \
-} while (0)
-
-#define ALIGN16 __attribute__((aligned(16)))
-
#define OFFS(x) "16*("#x")"
#define PF_OFFS(x) "256+16*("#x")"
#define PF0(x) " prefetchnta "PF_OFFS(x)"(%1) ;\n"
@@ -587,10 +555,8 @@ static void
xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -633,7 +599,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
:
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -641,10 +607,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -694,7 +658,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
:
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -702,10 +666,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -762,7 +724,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
:
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -770,10 +732,8 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4, unsigned long *p5)
{
unsigned long lines = bytes >> 8;
- char xmm_save[16*4] ALIGN16;
- int cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

/* Make sure GCC forgets anything it knows about p4 or p5,
such that it won't pass to the asm volatile below a
@@ -850,7 +810,7 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
like assuming they have some legal value. */
asm("" : "=r" (p4), "=r" (p5));

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_pIII_sse = {
diff --git a/arch/x86/include/asm/xor_64.h b/arch/x86/include/asm/xor_64.h
index b9b2323..5fc06d0 100644
--- a/arch/x86/include/asm/xor_64.h
+++ b/arch/x86/include/asm/xor_64.h
@@ -34,41 +34,7 @@
* no advantages to be gotten from x86-64 here anyways.
*/

-typedef struct {
- unsigned long a, b;
-} __attribute__((aligned(16))) xmm_store_t;
-
-/* Doesn't use gcc to save the XMM registers, because there is no easy way to
- tell it to do a clts before the register saving. */
-#define XMMS_SAVE \
-do { \
- preempt_disable(); \
- asm volatile( \
- "movq %%cr0,%0 ;\n\t" \
- "clts ;\n\t" \
- "movups %%xmm0,(%1) ;\n\t" \
- "movups %%xmm1,0x10(%1) ;\n\t" \
- "movups %%xmm2,0x20(%1) ;\n\t" \
- "movups %%xmm3,0x30(%1) ;\n\t" \
- : "=&r" (cr0) \
- : "r" (xmm_save) \
- : "memory"); \
-} while (0)
-
-#define XMMS_RESTORE \
-do { \
- asm volatile( \
- "sfence ;\n\t" \
- "movups (%1),%%xmm0 ;\n\t" \
- "movups 0x10(%1),%%xmm1 ;\n\t" \
- "movups 0x20(%1),%%xmm2 ;\n\t" \
- "movups 0x30(%1),%%xmm3 ;\n\t" \
- "movq %0,%%cr0 ;\n\t" \
- : \
- : "r" (cr0), "r" (xmm_save) \
- : "memory"); \
- preempt_enable(); \
-} while (0)
+#include <asm/i387.h>

#define OFFS(x) "16*("#x")"
#define PF_OFFS(x) "256+16*("#x")"
@@ -91,10 +57,8 @@ static void
xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
{
unsigned int lines = bytes >> 8;
- unsigned long cr0;
- xmm_store_t xmm_save[4];

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -135,7 +99,7 @@ xor_sse_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
: [inc] "r" (256UL)
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -143,11 +107,8 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;
-
- XMMS_SAVE;

+ kernel_fpu_begin();
asm volatile(
#undef BLOCK
#define BLOCK(i) \
@@ -194,7 +155,7 @@ xor_sse_3(unsigned long bytes, unsigned long *p1, unsigned long *p2,
[p1] "+r" (p1), [p2] "+r" (p2), [p3] "+r" (p3)
: [inc] "r" (256UL)
: "memory");
- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -202,10 +163,8 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -261,7 +220,7 @@ xor_sse_4(unsigned long bytes, unsigned long *p1, unsigned long *p2,
: [inc] "r" (256UL)
: "memory" );

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static void
@@ -269,10 +228,8 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
unsigned long *p3, unsigned long *p4, unsigned long *p5)
{
unsigned int lines = bytes >> 8;
- xmm_store_t xmm_save[4];
- unsigned long cr0;

- XMMS_SAVE;
+ kernel_fpu_begin();

asm volatile(
#undef BLOCK
@@ -336,7 +293,7 @@ xor_sse_5(unsigned long bytes, unsigned long *p1, unsigned long *p2,
: [inc] "r" (256UL)
: "memory");

- XMMS_RESTORE;
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_sse = {
diff --git a/arch/x86/include/asm/xor_avx.h b/arch/x86/include/asm/xor_avx.h
index 2510d35..7ea79c5 100644
--- a/arch/x86/include/asm/xor_avx.h
+++ b/arch/x86/include/asm/xor_avx.h
@@ -20,32 +20,6 @@
#include <linux/compiler.h>
#include <asm/i387.h>

-#define ALIGN32 __aligned(32)
-
-#define YMM_SAVED_REGS 4
-
-#define YMMS_SAVE \
-do { \
- preempt_disable(); \
- cr0 = read_cr0(); \
- clts(); \
- asm volatile("vmovaps %%ymm0, %0" : "=m" (ymm_save[0]) : : "memory"); \
- asm volatile("vmovaps %%ymm1, %0" : "=m" (ymm_save[32]) : : "memory"); \
- asm volatile("vmovaps %%ymm2, %0" : "=m" (ymm_save[64]) : : "memory"); \
- asm volatile("vmovaps %%ymm3, %0" : "=m" (ymm_save[96]) : : "memory"); \
-} while (0);
-
-#define YMMS_RESTORE \
-do { \
- asm volatile("sfence" : : : "memory"); \
- asm volatile("vmovaps %0, %%ymm3" : : "m" (ymm_save[96])); \
- asm volatile("vmovaps %0, %%ymm2" : : "m" (ymm_save[64])); \
- asm volatile("vmovaps %0, %%ymm1" : : "m" (ymm_save[32])); \
- asm volatile("vmovaps %0, %%ymm0" : : "m" (ymm_save[0])); \
- write_cr0(cr0); \
- preempt_enable(); \
-} while (0);
-
#define BLOCK4(i) \
BLOCK(32 * i, 0) \
BLOCK(32 * (i + 1), 1) \
@@ -60,10 +34,9 @@ do { \

static void xor_avx_2(unsigned long bytes, unsigned long *p0, unsigned long *p1)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -82,16 +55,15 @@ do { \
p1 = (unsigned long *)((uintptr_t)p1 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_3(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -113,16 +85,15 @@ do { \
p2 = (unsigned long *)((uintptr_t)p2 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_4(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2, unsigned long *p3)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -147,16 +118,15 @@ do { \
p3 = (unsigned long *)((uintptr_t)p3 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static void xor_avx_5(unsigned long bytes, unsigned long *p0, unsigned long *p1,
unsigned long *p2, unsigned long *p3, unsigned long *p4)
{
- unsigned long cr0, lines = bytes >> 9;
- char ymm_save[32 * YMM_SAVED_REGS] ALIGN32;
+ unsigned long lines = bytes >> 9;

- YMMS_SAVE
+ kernel_fpu_begin();

while (lines--) {
#undef BLOCK
@@ -184,7 +154,7 @@ do { \
p4 = (unsigned long *)((uintptr_t)p4 + 512);
}

- YMMS_RESTORE
+ kernel_fpu_end();
}

static struct xor_block_template xor_block_avx = {

2012-08-24 22:17:18

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] lguest, x86: handle guest TS bit for lazy/ non-lazy fpu host models

Commit-ID: 1ce83ffda9aea53e6e4b6b6a82c028a019526010
Gitweb: http://git.kernel.org/tip/1ce83ffda9aea53e6e4b6b6a82c028a019526010
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:13:01 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:52 -0700

lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models

Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Rusty Russell <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
drivers/lguest/x86/core.c | 10 +++++++---
1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/lguest/x86/core.c b/drivers/lguest/x86/core.c
index 39809035..4af12e1 100644
--- a/drivers/lguest/x86/core.c
+++ b/drivers/lguest/x86/core.c
@@ -203,8 +203,8 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
* we set it now, so we can trap and pass that trap to the Guest if it
* uses the FPU.
*/
- if (cpu->ts)
- unlazy_fpu(current);
+ if (cpu->ts && user_has_fpu())
+ stts();

/*
* SYSENTER is an optimized way of doing system calls. We can't allow
@@ -234,6 +234,10 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
if (boot_cpu_has(X86_FEATURE_SEP))
wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);

+ /* Clear the host TS bit if it was set above. */
+ if (cpu->ts && user_has_fpu())
+ clts();
+
/*
* If the Guest page faulted, then the cr2 register will tell us the
* bad virtual address. We have to grab this now, because once we
@@ -249,7 +253,7 @@ void lguest_arch_run_guest(struct lg_cpu *cpu)
* a different CPU. So all the critical stuff should be done
* before this.
*/
- else if (cpu->regs->trapnum == 7)
+ else if (cpu->regs->trapnum == 7 && !user_has_fpu())
math_state_restore();
}

2012-08-24 22:18:15

by Suresh Siddha

[permalink] [raw]
Subject: [tip:x86/fpu] x86, fpu: use non-lazy fpu restore for processors supporting xsave

Commit-ID: 127f5403bfbc5f52cf0fbbadfa5e624a32a137ff
Gitweb: http://git.kernel.org/tip/127f5403bfbc5f52cf0fbbadfa5e624a32a137ff
Author: Suresh Siddha <[email protected]>
AuthorDate: Fri, 24 Aug 2012 14:13:02 -0700
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 24 Aug 2012 14:26:54 -0700

x86, fpu: use non-lazy fpu restore for processors supporting xsave

Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.

Reasons driving this model change are:

i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.

ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.

iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.

Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.

Signed-off-by: Suresh Siddha <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Jim Kukunas <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Avi Kivity <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/fpu-internal.h | 96 +++++++++++++++++++++++------------
arch/x86/include/asm/i387.h | 1 +
arch/x86/include/asm/xsave.h | 1 +
arch/x86/kernel/i387.c | 20 ++++++-
arch/x86/kernel/process.c | 12 +++--
arch/x86/kernel/process_32.c | 4 --
arch/x86/kernel/process_64.c | 4 --
arch/x86/kernel/traps.c | 5 ++-
arch/x86/kernel/xsave.c | 57 +++++++++++++++++----
9 files changed, 140 insertions(+), 60 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
index fac39e9..e31cc6e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -291,15 +291,48 @@ static inline void __thread_set_has_fpu(struct task_struct *tsk)
static inline void __thread_fpu_end(struct task_struct *tsk)
{
__thread_clear_has_fpu(tsk);
- stts();
+ if (!use_xsave())
+ stts();
}

static inline void __thread_fpu_begin(struct task_struct *tsk)
{
- clts();
+ if (!use_xsave())
+ clts();
__thread_set_has_fpu(tsk);
}

+static inline void __drop_fpu(struct task_struct *tsk)
+{
+ if (__thread_has_fpu(tsk)) {
+ /* Ignore delayed exceptions from user space */
+ asm volatile("1: fwait\n"
+ "2:\n"
+ _ASM_EXTABLE(1b, 2b));
+ __thread_fpu_end(tsk);
+ }
+}
+
+static inline void drop_fpu(struct task_struct *tsk)
+{
+ /*
+ * Forget coprocessor state..
+ */
+ preempt_disable();
+ tsk->fpu_counter = 0;
+ __drop_fpu(tsk);
+ clear_used_math();
+ preempt_enable();
+}
+
+static inline void drop_init_fpu(struct task_struct *tsk)
+{
+ if (!use_xsave())
+ drop_fpu(tsk);
+ else
+ xrstor_state(init_xstate_buf, -1);
+}
+
/*
* FPU state switching for scheduling.
*
@@ -333,7 +366,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
{
fpu_switch_t fpu;

- fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
+ /*
+ * If the task has used the math, pre-load the FPU on xsave processors
+ * or if the past 5 consecutive context-switches used math.
+ */
+ fpu.preload = tsk_used_math(new) && (use_xsave() ||
+ new->fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
cpu = ~0;
@@ -345,14 +383,14 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
new->fpu_counter++;
__thread_set_has_fpu(new);
prefetch(new->thread.fpu.state);
- } else
+ } else if (!use_xsave())
stts();
} else {
old->fpu_counter = 0;
old->thread.fpu.last_cpu = ~0;
if (fpu.preload) {
new->fpu_counter++;
- if (fpu_lazy_restore(new, cpu))
+ if (!use_xsave() && fpu_lazy_restore(new, cpu))
fpu.preload = 0;
else
prefetch(new->thread.fpu.state);
@@ -372,7 +410,7 @@ static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
{
if (fpu.preload) {
if (unlikely(restore_fpu_checking(new)))
- __thread_fpu_end(new);
+ drop_init_fpu(new);
}
}

@@ -400,17 +438,6 @@ static inline int restore_xstate_sig(void __user *buf, int ia32_frame)
return __restore_xstate_sig(buf, buf_fx, size);
}

-static inline void __drop_fpu(struct task_struct *tsk)
-{
- if (__thread_has_fpu(tsk)) {
- /* Ignore delayed exceptions from user space */
- asm volatile("1: fwait\n"
- "2:\n"
- _ASM_EXTABLE(1b, 2b));
- __thread_fpu_end(tsk);
- }
-}
-
/*
* Need to be preemption-safe.
*
@@ -431,24 +458,18 @@ static inline void user_fpu_begin(void)
static inline void save_init_fpu(struct task_struct *tsk)
{
WARN_ON_ONCE(!__thread_has_fpu(tsk));
+
+ if (use_xsave()) {
+ xsave_state(&tsk->thread.fpu.state->xsave, -1);
+ return;
+ }
+
preempt_disable();
__save_init_fpu(tsk);
__thread_fpu_end(tsk);
preempt_enable();
}

-static inline void drop_fpu(struct task_struct *tsk)
-{
- /*
- * Forget coprocessor state..
- */
- tsk->fpu_counter = 0;
- preempt_disable();
- __drop_fpu(tsk);
- preempt_enable();
- clear_used_math();
-}
-
/*
* i387 state interaction
*/
@@ -503,12 +524,21 @@ static inline void fpu_free(struct fpu *fpu)
}
}

-static inline void fpu_copy(struct fpu *dst, struct fpu *src)
+static inline void fpu_copy(struct task_struct *dst, struct task_struct *src)
{
- memcpy(dst->state, src->state, xstate_size);
-}
+ if (use_xsave()) {
+ struct xsave_struct *xsave = &dst->thread.fpu.state->xsave;

-extern void fpu_finit(struct fpu *fpu);
+ memset(&xsave->xsave_hdr, 0, sizeof(struct xsave_hdr_struct));
+ xsave_state(xsave, -1);
+ } else {
+ struct fpu *dfpu = &dst->thread.fpu;
+ struct fpu *sfpu = &src->thread.fpu;
+
+ unlazy_fpu(src);
+ memcpy(dfpu->state, sfpu->state, xstate_size);
+ }
+}

static inline unsigned long
alloc_mathframe(unsigned long sp, int ia32_frame, unsigned long *buf_fx,
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 257d9cc..6c3bd37 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -19,6 +19,7 @@ struct pt_regs;
struct user_i387_struct;

extern int init_fpu(struct task_struct *child);
+extern void fpu_finit(struct fpu *fpu);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);
extern void math_state_restore(void);

diff --git a/arch/x86/include/asm/xsave.h b/arch/x86/include/asm/xsave.h
index c1d989a..2ddee1b8 100644
--- a/arch/x86/include/asm/xsave.h
+++ b/arch/x86/include/asm/xsave.h
@@ -34,6 +34,7 @@
extern unsigned int xstate_size;
extern u64 pcntxt_mask;
extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
+extern struct xsave_struct *init_xstate_buf;

extern void xsave_init(void);
extern void update_regset_xstate_info(unsigned int size, u64 xstate_mask);
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index ab6a2e8..5285574 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -22,7 +22,15 @@
/*
* Were we in an interrupt that interrupted kernel mode?
*
- * We can do a kernel_fpu_begin/end() pair *ONLY* if that
+ * For now, on xsave platforms we will return interrupted
+ * kernel FPU as not-idle. TBD: As we use non-lazy FPU restore
+ * for xsave platforms, ideally we can change the return value
+ * to something like __thread_has_fpu(current). But we need to
+ * be careful of doing __thread_clear_has_fpu() before saving
+ * the FPU etc for supporting nested uses etc. For now, take
+ * the simple route!
+ *
+ * On others, we can do a kernel_fpu_begin/end() pair *ONLY* if that
* pair does nothing at all: the thread must not have fpu (so
* that we don't try to save the FPU state), and TS must
* be set (so that the clts/stts pair does nothing that is
@@ -30,6 +38,9 @@
*/
static inline bool interrupted_kernel_fpu_idle(void)
{
+ if (use_xsave())
+ return 0;
+
return !__thread_has_fpu(current) &&
(read_cr0() & X86_CR0_TS);
}
@@ -73,7 +84,7 @@ void kernel_fpu_begin(void)
__save_init_fpu(me);
__thread_clear_has_fpu(me);
/* We do 'stts()' in kernel_fpu_end() */
- } else {
+ } else if (!use_xsave()) {
this_cpu_write(fpu_owner_task, NULL);
clts();
}
@@ -82,7 +93,10 @@ EXPORT_SYMBOL(kernel_fpu_begin);

void kernel_fpu_end(void)
{
- stts();
+ if (use_xsave())
+ math_state_restore();
+ else
+ stts();
preempt_enable();
}
EXPORT_SYMBOL(kernel_fpu_end);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 30069d1..c21e30f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -66,15 +66,13 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
int ret;

- unlazy_fpu(src);
-
*dst = *src;
if (fpu_allocated(&src->thread.fpu)) {
memset(&dst->thread.fpu, 0, sizeof(dst->thread.fpu));
ret = fpu_alloc(&dst->thread.fpu);
if (ret)
return ret;
- fpu_copy(&dst->thread.fpu, &src->thread.fpu);
+ fpu_copy(dst, src);
}
return 0;
}
@@ -153,7 +151,13 @@ void flush_thread(void)

flush_ptrace_hw_breakpoint(tsk);
memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array));
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
+ /*
+ * Free the FPU state for non xsave platforms. They get reallocated
+ * lazily at the first use.
+ */
+ if (!use_xsave())
+ free_thread_xstate(tsk);
}

static void hard_disable_TSC(void)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 516fa18..b9ff83c 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -190,10 +190,6 @@ start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
regs->cs = __USER_CS;
regs->ip = new_ip;
regs->sp = new_sp;
- /*
- * Free the old FP and other extended state
- */
- free_thread_xstate(current);
}
EXPORT_SYMBOL_GPL(start_thread);

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 85151f3..6ef2d84 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -232,10 +232,6 @@ start_thread_common(struct pt_regs *regs, unsigned long new_ip,
regs->cs = _cs;
regs->ss = _ss;
regs->flags = X86_EFLAGS_IF;
- /*
- * Free the old FP and other extended state
- */
- free_thread_xstate(current);
}

void
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b481341..ac7d527 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -613,11 +613,12 @@ void math_state_restore(void)
}

__thread_fpu_begin(tsk);
+
/*
* Paranoid restore. send a SIGSEGV if we fail to restore the state.
*/
if (unlikely(restore_fpu_checking(tsk))) {
- __thread_fpu_end(tsk);
+ drop_init_fpu(tsk);
force_sig(SIGSEGV, tsk);
return;
}
@@ -629,6 +630,8 @@ EXPORT_SYMBOL_GPL(math_state_restore);
dotraplinkage void __kprobes
do_device_not_available(struct pt_regs *regs, long error_code)
{
+ BUG_ON(use_xsave());
+
#ifdef CONFIG_MATH_EMULATION
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index f0bb844..55bf571 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -21,7 +21,7 @@ u64 pcntxt_mask;
/*
* Represents init state for the supported extended state.
*/
-static struct xsave_struct *init_xstate_buf;
+struct xsave_struct *init_xstate_buf;

static struct _fpx_sw_bytes fx_sw_reserved, fx_sw_reserved_ia32;
static unsigned int *xstate_offsets, *xstate_sizes, xstate_features;
@@ -267,7 +267,7 @@ int save_xstate_sig(void __user *buf, void __user *buf_fx, int size)
if (use_fxsr() && save_xstate_epilog(buf_fx, ia32_fxstate))
return -1;

- drop_fpu(tsk); /* trigger finit */
+ drop_init_fpu(tsk); /* trigger finit */

return 0;
}
@@ -339,7 +339,7 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
config_enabled(CONFIG_IA32_EMULATION));

if (!buf) {
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
return 0;
}

@@ -379,15 +379,30 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
*/
struct xsave_struct *xsave = &tsk->thread.fpu.state->xsave;
struct user_i387_ia32_struct env;
+ int err = 0;

+ /*
+ * Drop the current fpu which clears used_math(). This ensures
+ * that any context-switch during the copy of the new state,
+ * avoids the intermediate state from getting restored/saved.
+ * Thus avoiding the new restored state from getting corrupted.
+ * We will be ready to restore/save the state only after
+ * set_used_math() is again set.
+ */
drop_fpu(tsk);

if (__copy_from_user(xsave, buf_fx, state_size) ||
- __copy_from_user(&env, buf, sizeof(env)))
- return -1;
+ __copy_from_user(&env, buf, sizeof(env))) {
+ err = -1;
+ } else {
+ sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
+ set_used_math();
+ }

- sanitize_restored_xstate(tsk, &env, xstate_bv, fx_only);
- set_used_math();
+ if (use_xsave())
+ math_state_restore();
+
+ return err;
} else {
/*
* For 64-bit frames and 32-bit fsave frames, restore the user
@@ -395,7 +410,7 @@ int __restore_xstate_sig(void __user *buf, void __user *buf_fx, int size)
*/
user_fpu_begin();
if (restore_user_xstate(buf_fx, xstate_bv, fx_only)) {
- drop_fpu(tsk);
+ drop_init_fpu(tsk);
return -1;
}
}
@@ -434,11 +449,29 @@ static void prepare_fx_sw_frame(void)
*/
static inline void xstate_enable(void)
{
+ clts();
set_in_cr4(X86_CR4_OSXSAVE);
xsetbv(XCR_XFEATURE_ENABLED_MASK, pcntxt_mask);
}

/*
+ * This is same as math_state_restore(). But use_xsave() is not yet
+ * patched to use math_state_restore().
+ */
+static inline void init_restore_xstate(void)
+{
+ init_fpu(current);
+ __thread_fpu_begin(current);
+ xrstor_state(init_xstate_buf, -1);
+}
+
+static inline void xstate_enable_ap(void)
+{
+ xstate_enable();
+ init_restore_xstate();
+}
+
+/*
* Record the offsets and sizes of different state managed by the xsave
* memory layout.
*/
@@ -478,7 +511,6 @@ static void __init setup_xstate_init(void)
__alignof__(struct xsave_struct));
init_xstate_buf->i387.mxcsr = MXCSR_DEFAULT;

- clts();
/*
* Init all the features state with header_bv being 0x0
*/
@@ -488,7 +520,6 @@ static void __init setup_xstate_init(void)
* of any feature which is not represented by all zero's.
*/
xsave_state(init_xstate_buf, -1);
- stts();
}

/*
@@ -532,6 +563,10 @@ static void __init xstate_enable_boot_cpu(void)

pr_info("enabled xstate_bv 0x%llx, cntxt size 0x%x\n",
pcntxt_mask, xstate_size);
+
+ current->thread.fpu.state =
+ alloc_bootmem_align(xstate_size, __alignof__(struct xsave_struct));
+ init_restore_xstate();
}

/*
@@ -550,6 +585,6 @@ void __cpuinit xsave_init(void)
return;

this_func = next_func;
- next_func = xstate_enable;
+ next_func = xstate_enable_ap;
this_func();
}

2012-08-25 18:34:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/6] x86, fpu: cleanups, introduce non-lazy FPU restore for xsave

On Fri, Aug 24, 2012 at 2:33 PM, H. Peter Anvin <[email protected]> wrote:
> I have applied this to tip:x86/fpu, but I have also asked Suresh to
> prepare a followon patch to decouple eager save from the existence of
> the XSAVE instruction. It seems pretty clear that eager save is a net
> benefit in the presence of the XSAVEOPT, but it isn't as clear for only
> having XSAVE, as far as I can tell. Either way it would seem to be a
> policy decision that is somewhat separate from the exact instruction.

Agreed. It would also be good for testing bugs and/or performance for
both xsave and non-xsave cases.

Linus

2012-08-26 11:30:16

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 2/6] x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()

On Fri, Aug 24, 2012 at 02:12:58PM -0700, Suresh Siddha wrote:
> Few lines below we do drop_fpu() which is more safer. Remove the
> unnecessary user_fpu_end() in save_xstate_sig(), which allows
> the drop_fpu() to ignore any pending exceptions from the user-space
> and drop the current fpu.
>
> Signed-off-by: Suresh Siddha <[email protected]>
> ---
> arch/x86/include/asm/fpu-internal.h | 17 +++--------------
> arch/x86/kernel/xsave.c | 1 -
> 2 files changed, 3 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
> index fe95ad0..fac39e9 100644
> --- a/arch/x86/include/asm/fpu-internal.h
> +++ b/arch/x86/include/asm/fpu-internal.h
> @@ -412,22 +412,11 @@ static inline void __drop_fpu(struct task_struct *tsk)
> }
>
> /*
> - * The actual user_fpu_begin/end() functions
> - * need to be preemption-safe.
> + * Need to be preemption-safe.
> *
> - * NOTE! user_fpu_end() must be used only after you
> - * have saved the FP state, and user_fpu_begin() must
> - * be used only immediately before restoring it.
> - * These functions do not do any save/restore on
> - * their own.
> + * NOTE! user_fpu_begin() must be used only immediately before restoring
> + * it. This function does not do any save/restore on their own.
> */
> -static inline void user_fpu_end(void)
> -{
> - preempt_disable();
> - __thread_fpu_end(current);
> - preempt_enable();
> -}

Now that user_fpu_begin has lost its counterpart user_fpu_end, maybe
rename it to user_fpu_init() or something similar which doesn't suggest
there's a _begin and an _end thingy?

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551