2012-02-19 22:23:24

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 0/2] More i387 state save/restore work


Ok, this is a series of two patches that continue my i387 state
save/restore series, but aren't necessarily worth it for Linux-3.3.

That said, the first one is a bug-fix - but it's an old bug, and I'm not
sure it can actually be triggered. The failure path for the FP state
preload is bogus - and always was. But I'm not sure it really *can* fail.

The first one has another small bugfix in it too, and I think that one may
be new to the rewritten FP state preloading - it doesn't update the
fpu_counter, so once it starts preloading, it never stops.

I wrote a silly FPU task switch testing program, which basically starts
two processes pinned to the same CPU, and then uses sched_yield() in both
to switch back-and-forth between them. *One* of the processes uses the FPU
between every yield, the other does not. It runs for two seconds, and
counts how many loops it gets through.

With that test, I get:

- Plain 3.3-rc4:

[torvalds@i5 ~]$ uname -r
3.3.0-rc4
[torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
2216090 loops in 2 seconds
2216922 loops in 2 seconds
2217148 loops in 2 seconds
2232191 loops in 2 seconds
2186203 loops in 2 seconds
2231614 loops in 2 seconds

- With the first patch that fixes the FPU preloading to eventually stop:

[torvalds@i5 ~]$ uname -r
3.3.0-rc4-00001-g704ed737bd3c
[torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
2306667 loops in 2 seconds
2295760 loops in 2 seconds
2295494 loops in 2 seconds
2296282 loops in 2 seconds
2282229 loops in 2 seconds
2301842 loops in 2 seconds

- With the second patch that does the lazy preloading

[torvalds@i5 ~]$ uname -r
3.3.0-rc4-00002-g022899d937f9
[torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
2466973 loops in 2 seconds
2456168 loops in 2 seconds
2449863 loops in 2 seconds
2461588 loops in 2 seconds
2478256 loops in 2 seconds
2476844 loops in 2 seconds

so these things do make some difference. But it is also interesting to see
from profiles just how expensive setting CR0.TS is (the write to CR0 is
very expensive indeed), so even when you avoid the FP state restore
lazily, just setting TS in between task switches is still a big cost of
FPU save/restore.

Linus Torvalds (2):
i387: use 'restore_fpu_checking()' directly in task switching code
i387: support lazy restore of FPU state

arch/x86/include/asm/i387.h | 48 +++++++++++++++++++++++++++----------
arch/x86/include/asm/processor.h | 3 +-
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 40 ++++++-------------------------
6 files changed, 49 insertions(+), 48 deletions(-)

Comments? I feel confident enough about these that I thin kthey might even
work in 3.3, especially the first one. But I want people to look at
them.

Linus

--
1.7.9.188.g12766.dirty


2012-02-19 22:26:43

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 1/2] i387: use 'restore_fpu_checking()' directly in task switching code


From: Linus Torvalds <[email protected]>
Date: Sun, 19 Feb 2012 11:48:44 -0800

This inlines what is usually just a couple of instructions, but more
importantly it also fixes the theoretical error case (can that FPU
restore really ever fail? Maybe we should remove the checking).

We can't start sending signals from within the scheduler, we're much too
deep in the kernel and are holding the runqueue lock etc. So don't
bother even trying.

Also, make sure that we increment the fpu_counter even when we preload
the FP state, since that is also how we eventually decide to try not
preloading (when the byte counter overflows and becomes zero again).

Signed-off-by: Linus Torvalds <[email protected]>
---

The fpu_counter part in switch_fpu_finish() is technically separate, but
these are both bugs in the same preloading code, so I made it one patch.

Can that paranoid restore actually fail? We're loading from kernel state,
stuff we saved there. But maybe you can mess things up on x86-32 by
messing up the signal stack restore state, which gets copied by hand into
the kernel buffer? Regardless, *if* it can fail, that signal sending from
the scheduler is seriously wrong, afaik.

arch/x86/include/asm/i387.h | 19 ++++++++++++++++---
arch/x86/kernel/traps.c | 40 ++++++++--------------------------------
2 files changed, 24 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index a850b4d8d14d..251c366289b9 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -29,7 +29,6 @@ extern unsigned int sig_xstate_size;
extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);
extern int init_fpu(struct task_struct *child);
-extern void __math_state_restore(struct task_struct *);
extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

@@ -269,6 +268,16 @@ static inline int fpu_restore_checking(struct fpu *fpu)

static inline int restore_fpu_checking(struct task_struct *tsk)
{
+ /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
+ is pending. Clear the x87 state here by setting it to fixed
+ values. "m" is a random variable that should be in L1 */
+ alternative_input(
+ ASM_NOP8 ASM_NOP2,
+ "emms\n\t" /* clear stack tags */
+ "fildl %P[addr]", /* set F?P to defined value */
+ X86_FEATURE_FXSAVE_LEAK,
+ [addr] "m" (tsk->thread.has_fpu));
+
return fpu_restore_checking(&tsk->thread.fpu);
}

@@ -377,8 +386,12 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
*/
static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
{
- if (fpu.preload)
- __math_state_restore(new);
+ if (fpu.preload) {
+ new->fpu_counter++;
+ if (unlikely(restore_fpu_checking(new))) {
+ __thread_fpu_end(new);
+ }
+ }
}

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 77da5b475ad2..4bbe04d96744 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -571,37 +571,6 @@ asmlinkage void __attribute__((weak)) smp_threshold_interrupt(void)
}

/*
- * This gets called with the process already owning the
- * FPU state, and with CR0.TS cleared. It just needs to
- * restore the FPU register state.
- */
-void __math_state_restore(struct task_struct *tsk)
-{
- /* We need a safe address that is cheap to find and that is already
- in L1. We've just brought in "tsk->thread.has_fpu", so use that */
-#define safe_address (tsk->thread.has_fpu)
-
- /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
- is pending. Clear the x87 state here by setting it to fixed
- values. safe_address is a random variable that should be in L1 */
- alternative_input(
- ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
- "fildl %P[addr]", /* set F?P to defined value */
- X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (safe_address));
-
- /*
- * Paranoid restore. send a SIGSEGV if we fail to restore the state.
- */
- if (unlikely(restore_fpu_checking(tsk))) {
- __thread_fpu_end(tsk);
- force_sig(SIGSEGV, tsk);
- return;
- }
-}
-
-/*
* 'math_state_restore()' saves the current math information in the
* old math state array, and gets the new ones from the current task
*
@@ -631,7 +600,14 @@ void math_state_restore(void)
}

__thread_fpu_begin(tsk);
- __math_state_restore(tsk);
+ /*
+ * Paranoid restore. send a SIGSEGV if we fail to restore the state.
+ */
+ if (unlikely(restore_fpu_checking(tsk))) {
+ __thread_fpu_end(tsk);
+ force_sig(SIGSEGV, tsk);
+ return;
+ }

tsk->fpu_counter++;
}
--
1.7.9.188.g12766.dirty

2012-02-19 22:37:20

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 2/2] i387: support lazy restore of FPU state


From: Linus Torvalds <[email protected]>
Date: Sun, 19 Feb 2012 13:27:00 -0800

This makes us recognize when we try to restore FPU state that matches
what we already have in the FPU on this CPU, and avoids the restore
entirely if so.

To do this, we add two new data fields:

- a percpu 'fpu_owner_task' variable that gets written any time we
update the "has_fpu" field, and thus acts as a kind of back-pointer
to the task that owns the CPU. The exception is when we save the FPU
state as part of a context switch - if the save can keep the FPU
state around, we leave the 'fpu_owner_task' variable pointing at the
task whose FP state still remains on the CPU.

- a per-thread "last cpu" field, that indicates which CPU that thread
used its FPU on last.

These two fields together can be used when next switching back to the
task to see if the CPU still matches, and nobody else has taken over the
FPU state. In that case, we can avoid the 'f[x]rstor' entirely, and
just clear the CR0.TS bit.

Signed-off-by: Linus Torvalds <[email protected]>
---

The argument why this is "obviously correct" goes as follows (but somebody
*really* should double-check my logic):

- on *every* task switch from task A, we write A->thread.fpu.last_cpu,
whether we owned the FPU or not. And we only write a real CPU number in
the case where we owned it, and the FPU save left the state untouched
in the FPU.

- so when we switch into task A next time, comparing the current CPU
number with that 'last_cpu' field inarguably says "when I last switched
out, I really saved it on this CPU"

That, together with verifying that the per-cpu "fpu_owner_task" matches
"task A", guarantees that the state is really valid. Because we will
clear (or set to another task) fpu_owner_task if it ever gets
switched to anything else.

But somebody should really validate this. Think through all the
kernel_fpu_begin() etc cases. I think it looks pretty obvious, and it
really does seem to work and improve task switching, but...

Linus

---
arch/x86/include/asm/i387.h | 35 +++++++++++++++++++++++------------
arch/x86/include/asm/processor.h | 3 ++-
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 251c366289b9..2a361ed26901 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -32,6 +32,8 @@ extern int init_fpu(struct task_struct *child);
extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

+DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
+
extern user_regset_active_fn fpregs_active, xfpregs_active;
extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
xstateregs_get;
@@ -276,7 +278,7 @@ static inline int restore_fpu_checking(struct task_struct *tsk)
"emms\n\t" /* clear stack tags */
"fildl %P[addr]", /* set F?P to defined value */
X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (tsk->thread.has_fpu));
+ [addr] "m" (tsk->thread.fpu.has_fpu));

return fpu_restore_checking(&tsk->thread.fpu);
}
@@ -288,19 +290,21 @@ static inline int restore_fpu_checking(struct task_struct *tsk)
*/
static inline int __thread_has_fpu(struct task_struct *tsk)
{
- return tsk->thread.has_fpu;
+ return tsk->thread.fpu.has_fpu;
}

/* Must be paired with an 'stts' after! */
static inline void __thread_clear_has_fpu(struct task_struct *tsk)
{
- tsk->thread.has_fpu = 0;
+ tsk->thread.fpu.has_fpu = 0;
+ percpu_write(fpu_owner_task, NULL);
}

/* Must be paired with a 'clts' before! */
static inline void __thread_set_has_fpu(struct task_struct *tsk)
{
- tsk->thread.has_fpu = 1;
+ tsk->thread.fpu.has_fpu = 1;
+ percpu_write(fpu_owner_task, tsk);
}

/*
@@ -345,19 +349,23 @@ typedef struct { int preload; } fpu_switch_t;
* We don't do that yet, so "fpu_lazy_restore()" always returns
* false, but some day..
*/
-#define fpu_lazy_restore(tsk) (0)
-#define fpu_lazy_state_intact(tsk) do { } while (0)
+static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
+{
+ return new == percpu_read_stable(fpu_owner_task) &&
+ cpu == new->thread.fpu.last_cpu;
+}

-static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new)
+static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
{
fpu_switch_t fpu;

fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
if (__thread_has_fpu(old)) {
- if (__save_init_fpu(old))
- fpu_lazy_state_intact(old);
- __thread_clear_has_fpu(old);
+ if (!__save_init_fpu(old))
+ cpu = ~0;
old->fpu_counter++;
+ old->thread.fpu.last_cpu = cpu;
+ old->thread.fpu.has_fpu = 0; /* But leave fpu_owner_task! */

/* Don't change CR0.TS if we just switch! */
if (fpu.preload) {
@@ -367,8 +375,9 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
stts();
} else {
old->fpu_counter = 0;
+ old->thread.fpu.last_cpu = ~0;
if (fpu.preload) {
- if (fpu_lazy_restore(new))
+ if (fpu_lazy_restore(new, cpu))
fpu.preload = 0;
else
prefetch(new->thread.fpu.state);
@@ -464,8 +473,10 @@ static inline void kernel_fpu_begin(void)
__save_init_fpu(me);
__thread_clear_has_fpu(me);
/* We do 'stts()' in kernel_fpu_end() */
- } else
+ } else {
+ percpu_write(fpu_owner_task, NULL);
clts();
+ }
}

static inline void kernel_fpu_end(void)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f7c89e231c6c..58545c97d071 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -374,6 +374,8 @@ union thread_xstate {
};

struct fpu {
+ unsigned int last_cpu;
+ unsigned int has_fpu;
union thread_xstate *state;
};

@@ -454,7 +456,6 @@ struct thread_struct {
unsigned long trap_no;
unsigned long error_code;
/* floating point and extended processor state */
- unsigned long has_fpu;
struct fpu fpu;
#ifdef CONFIG_X86_32
/* Virtual 86 mode info */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d43cad74f166..b667148dfad7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1044,6 +1044,8 @@ DEFINE_PER_CPU(char *, irq_stack_ptr) =

DEFINE_PER_CPU(unsigned int, irq_count) = -1;

+DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
+
/*
* Special IST stacks which the CPU switches to when it calls
* an IST-marked descriptor entry. Up to 7 stacks (hardware
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 80bfe1ab0031..15e6c6494e82 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -303,7 +303,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- fpu = switch_fpu_prepare(prev_p, next_p);
+ fpu = switch_fpu_prepare(prev_p, next_p, cpu);

/*
* Reload esp0.
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1fd94bc4279d..b6ba67d76402 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -388,7 +388,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
unsigned fsindex, gsindex;
fpu_switch_t fpu;

- fpu = switch_fpu_prepare(prev_p, next_p);
+ fpu = switch_fpu_prepare(prev_p, next_p, cpu);

/*
* Reload esp0, LDT and the page table pointer:
--
1.7.9.188.g12766.dirty

2012-02-19 22:44:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: support lazy restore of FPU state

On 02/19/2012 02:37 PM, Linus Torvalds wrote:
>
> - on *every* task switch from task A, we write A->thread.fpu.last_cpu,
> whether we owned the FPU or not. And we only write a real CPU number in
> the case where we owned it, and the FPU save left the state untouched
> in the FPU.
>
> - so when we switch into task A next time, comparing the current CPU
> number with that 'last_cpu' field inarguably says "when I last switched
> out, I really saved it on this CPU"
>
> That, together with verifying that the per-cpu "fpu_owner_task" matches
> "task A", guarantees that the state is really valid. Because we will
> clear (or set to another task) fpu_owner_task if it ever gets
> switched to anything else.
>
> But somebody should really validate this. Think through all the
> kernel_fpu_begin() etc cases. I think it looks pretty obvious, and it
> really does seem to work and improve task switching, but...
>

I think your logic is correct but suboptimal.

What would make more sense to me is that we write last_cpu when we
*load* the state. After all, if you didn't load the state you couldn't
have modified it. In kernel_fpu_begin, *if* we end up flushing the
state, we should set last_cpu to -1 indicating that *no* CPU currently
owns the state -- after all, even on this CPU we would now have to
reload the state from memory.

Does that make sense?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-19 23:18:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: support lazy restore of FPU state

On 02/19/2012 02:44 PM, H. Peter Anvin wrote:
> On 02/19/2012 02:37 PM, Linus Torvalds wrote:
>>
>> - on *every* task switch from task A, we write A->thread.fpu.last_cpu,
>> whether we owned the FPU or not. And we only write a real CPU number in
>> the case where we owned it, and the FPU save left the state untouched
>> in the FPU.
>>
>> - so when we switch into task A next time, comparing the current CPU
>> number with that 'last_cpu' field inarguably says "when I last switched
>> out, I really saved it on this CPU"
>>
>> That, together with verifying that the per-cpu "fpu_owner_task" matches
>> "task A", guarantees that the state is really valid. Because we will
>> clear (or set to another task) fpu_owner_task if it ever gets
>> switched to anything else.
>>
>> But somebody should really validate this. Think through all the
>> kernel_fpu_begin() etc cases. I think it looks pretty obvious, and it
>> really does seem to work and improve task switching, but...
>>
>
> I think your logic is correct but suboptimal.
>
> What would make more sense to me is that we write last_cpu when we
> *load* the state. After all, if you didn't load the state you couldn't
> have modified it. In kernel_fpu_begin, *if* we end up flushing the
> state, we should set last_cpu to -1 indicating that *no* CPU currently
> owns the state -- after all, even on this CPU we would now have to
> reload the state from memory.
>

This is obviously wrong for kernel_fpu_begin... what we should do there
is to just set fpu_owner_task to NULL as we no longer have any task's
content in the fpu; no need to much with last_cpu though.

-hpa

2012-02-19 23:56:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: support lazy restore of FPU state

On Sun, Feb 19, 2012 at 2:44 PM, H. Peter Anvin <[email protected]> wrote:
>
> I think your logic is correct but suboptimal.

I do agree. But I had two or three previous versions of this that all
worked fine, but that I decided simply weren't safe.

So at some point I just decided that "optimal" was less important than
"simple to think about".

For example, one of the things I originally wanted to do was to be
able to switch to another CPU and back - and if the process didn't use
the FPU on the other CPU, and nothing else used the FPU on the
original one, we would just restore the state.

It's definitely doable (with these same fields), but I decided that
it's not something we actually care about from a performance angle,
and just thinking about it made me worry more than I wanted about the
correctness angle.

Which was why I ended up with that simpler approach.

> What would make more sense to me is that we write last_cpu when we
> *load* the state.

Yes. But writing last_cpu on every context switch, and *only* using a
valid CPU number if the save actually successfully left the state
untouched just made it easier for me to think about it. Then "last_cpu
matches a cpu" validates not just the CPU, but also "and we actually
saved it on this CPU at the *last* context switch".

Because there are multiple ways to use the FPU, and not all of them
come from restoring the math state. There's a few cases where we
initialize it from scratch without restoring it, for example. I
decided I just didn't want to worry about it.

> In kernel_fpu_begin, *if* we end up flushing the state, we should set
> last_cpu to -1 indicating that *no* CPU currently owns the state

No, you really want to use the per-cpu data there, not the thread
data. Because the process that has a 'last_cpu' pointing to this cpu
may not be running right now - that is, after all, the whole point of
the lazy restore: we cache FPU state of a process that isn't even
active.

But this is exactly the kind of thing I got wrong at least twice
before I decided to not even try to be clever about it.

And even now I still would prefer others to look over even my totally
non-subtle logic, just in case there was something else I forgot
about.

But hey, if you can convince me with a counter-patch that "obviously
works", I won't argue too much. I'm just explaining why I ended up
deciding on the stupid-but-fairly-straightforward approach.

Linus

2012-02-20 00:53:39

by Michael Neuling

[permalink] [raw]
Subject: Re: [PATCH 0/2] More i387 state save/restore work

Linus,

> Ok, this is a series of two patches that continue my i387 state
> save/restore series, but aren't necessarily worth it for Linux-3.3.

We have similar lazy save/restore code on powerpc here:

http://lists.ozlabs.org/pipermail/linuxppc-dev/2010-December/087422.html

With your test, it looks like you're getting about a 10% performance
boost. For VSX registers on powerpc we got about 8% with a similar
micro-benchmark. We were a little disappointed it took such a
tailored/synthetic micro-benchmark to get such modest performance
improvements.

> That said, the first one is a bug-fix - but it's an old bug, and I'm not
> sure it can actually be triggered. The failure path for the FP state
> preload is bogus - and always was. But I'm not sure it really *can* fail.
>
> The first one has another small bugfix in it too, and I think that one may
> be new to the rewritten FP state preloading - it doesn't update the
> fpu_counter, so once it starts preloading, it never stops.
>
> I wrote a silly FPU task switch testing program, which basically starts
> two processes pinned to the same CPU, and then uses sched_yield() in both
> to switch back-and-forth between them. *One* of the processes uses the FPU
> between every yield, the other does not. It runs for two seconds, and
> counts how many loops it gets through.

> With that test, I get:
>
> - Plain 3.3-rc4:
>
> [torvalds@i5 ~]$ uname -r
> 3.3.0-rc4
> [torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
> 2216090 loops in 2 seconds
> 2216922 loops in 2 seconds
> 2217148 loops in 2 seconds
> 2232191 loops in 2 seconds
> 2186203 loops in 2 seconds
> 2231614 loops in 2 seconds
>
> - With the first patch that fixes the FPU preloading to eventually stop:
>
> [torvalds@i5 ~]$ uname -r
> 3.3.0-rc4-00001-g704ed737bd3c
> [torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
> 2306667 loops in 2 seconds
> 2295760 loops in 2 seconds
> 2295494 loops in 2 seconds
> 2296282 loops in 2 seconds
> 2282229 loops in 2 seconds
> 2301842 loops in 2 seconds
>
> - With the second patch that does the lazy preloading
>
> [torvalds@i5 ~]$ uname -r
> 3.3.0-rc4-00002-g022899d937f9
> [torvalds@i5 ~]$ ./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;./a.out ;
> 2466973 loops in 2 seconds
> 2456168 loops in 2 seconds
> 2449863 loops in 2 seconds
> 2461588 loops in 2 seconds
> 2478256 loops in 2 seconds
> 2476844 loops in 2 seconds

Does "2476844 loops in 2 seconds" imply 2476844 context switches in 2
sec? With Anton's context_switch [1] benchmark, we don't even hit 100K
context switches per sec.

Do you have this test program anywhere?

Mikey

1. http://ozlabs.org/~anton/junkcode/context_switch.c

> so these things do make some difference. But it is also interesting to see
> from profiles just how expensive setting CR0.TS is (the write to CR0 is
> very expensive indeed), so even when you avoid the FP state restore
> lazily, just setting TS in between task switches is still a big cost of
> FPU save/restore.
>
>
> Linus Torvalds (2):
> i387: use 'restore_fpu_checking()' directly in task switching code
> i387: support lazy restore of FPU state
>
> arch/x86/include/asm/i387.h | 48 +++++++++++++++++++++++++++---------
-
> arch/x86/include/asm/processor.h | 3 +-
> arch/x86/kernel/cpu/common.c | 2 +
> arch/x86/kernel/process_32.c | 2 +-
> arch/x86/kernel/process_64.c | 2 +-
> arch/x86/kernel/traps.c | 40 ++++++-------------------------
> 6 files changed, 49 insertions(+), 48 deletions(-)
>
> Comments? I feel confident enough about these that I thin kthey might even
> work in 3.3, especially the first one. But I want people to look at
> them.
>
> Linus
>
> --
> 1.7.9.188.g12766.dirty
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-02-20 01:04:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] More i387 state save/restore work

On Sun, Feb 19, 2012 at 4:53 PM, Michael Neuling <[email protected]> wrote:
>
> Does "2476844 loops in 2 seconds" imply 2476844 context switches in 2
> sec? ?With Anton's context_switch [1] benchmark, we don't even hit 100K
> context switches per sec.
>
> Do you have this test program anywhere?

Here. No guarantees that this is at all sane, it's special-cased code
literally for testing only this one issue. The only indication I have
that this works at all is that the numbers did change roughly as
expected, and the kernel profile changes made sense.

You may have to do something else than just do some FP calculation to
force it to use VSX on PPC, obviously.

Linus
---
[torvalds@i5 ~]$ cat fp-switch.c
#define _GNU_SOURCE
#include <sched.h>
#include <stdlib.h>

#include <stdio.h>
#include <string.h>
#include <signal.h>
#include <sys/time.h>

#define SECONDS (2)
unsigned long loops = 0, child;

static void use_math(void)
{
double x = 0;
asm volatile("":"+m" (x));
x += 1;
asm volatile("":"+m" (x));
}

static void end(int signr)
{
printf("%d loops in %d seconds\n", loops, SECONDS);
kill(child, SIGKILL);
exit(0);
}

int main(int argc, char **argv)
{
cpu_set_t cpuset;

CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);

sched_setaffinity(0, sizeof(cpu_set_t), &cpuset);

child = fork();
if (!child) {
for (;;)
sched_yield();
}

signal(SIGALRM, end);
alarm(SECONDS);

for (;;) {
use_math();
sched_yield();
loops++;
}
}

2012-02-20 01:06:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] More i387 state save/restore work

On Sun, Feb 19, 2012 at 5:03 PM, Linus Torvalds
<[email protected]> wrote:
>
> You may have to do something else than just do some FP calculation to
> force it to use VSX on PPC, obviously.

And don't tell me that I should use pipes to bounce a byte back and
forth instead of just relying on sched_yield() hopefully doing what I
wanted it to do even for non-RT case. Yeah, I probably should, but
this was simpler and I'm a moron.

Linus

2012-02-20 01:11:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] More i387 state save/restore work

Oh, and final comment - looking at the thing you pointed at, it looks
much more adventurous than my x86 FP state thing.

I always save things unconditionally, so that I don't have to do the
IPI or just in general care about the "oops, now I want things in
memory, not in some random CPU FP state". So mine is just a "writeback
cache", and only optimizes the reading things back: there is never any
dirty state in the CPU except when the process is actively using it.

That obviously does mean that I only optimize away the restore side,
not the save side. But it's *way* simpler, and considering that I just
spent almost a week trying to figure out FP state save bugs, simple is
good.

Linus

2012-02-20 02:09:42

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH 0/2] More i387 state save/restore work

On Mon, February 20, 2012 02:03, Linus Torvalds wrote:
> On Sun, Feb 19, 2012 at 4:53 PM, Michael Neuling <[email protected]> wrote:
>>
>> Does "2476844 loops in 2 seconds" imply 2476844 context switches in 2
>> sec? With Anton's context_switch [1] benchmark, we don't even hit 100K
>> context switches per sec.

No, it implies 2476844 context switches per second, because it only counts
the loops in one process, and it takes two context switches to switch away
and back again.

My numbers for context_switch.c are 418K for no VDSO/FPU and 413K with FPU.
Linus' test program gets:

1050525 loops in 2 seconds with FPU
1150258 loops in 2 seconds with use_math() commented.

So it seems that the overhead of doing the pipe thing is quite high compared
to sched_yield().

These numbers are for an old Pentium M pinned at 1.4GHz, so getting only
100K seems very bad.

>>
>> Do you have this test program anywhere?
>
> Here. No guarantees that this is at all sane, it's special-cased code
> literally for testing only this one issue. The only indication I have
> that this works at all is that the numbers did change roughly as
> expected, and the kernel profile changes made sense.

I tested both programs, and your loops per second is half the context
switches according to vmstat, so it works as expected. I haven't tested
your FPU patches though.

Greetings,

Indan

2012-02-20 07:51:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: support lazy restore of FPU state


* Linus Torvalds <[email protected]> wrote:

> +DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);

> +static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
> +{
> + return new == percpu_read_stable(fpu_owner_task) &&
> + cpu == new->thread.fpu.last_cpu;
> +}

I guess the CPU hotplug case deserves a comment in the code: CPU
hotplug + replug of the same (but meanwhile reset) CPU is safe
because fpu_owner_task[cpu] gets reset to NULL.

Likewise, if we hot-unplug two CPUs and then insert the second
one, it's the same CPU index but not the same FPU state anymore.
There too the CPU hotplug code resetting fpu_owner_task to NULL
makes this optimization safe.

Thanks,

Ingo

2012-02-20 19:46:42

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH v2 0/3] More i387 state save/restore work


This is a slightly updated patch-series that does the same thing my
previous one did, just fixing a few special cases.

In particular, I noticed that the 'fpu_counter' situation was still
rather confused in the last patch-series, and instead of having patch#1
fix up a part of it as part of re-organizing the FP state restore code,
this just fixes the fpu_counter confusion separately. Especially since
I wanted to clear the fpu_counter at fork time for new processes, just
to make sure tht we never try to lazily switch to stale FPU state
(re-used task struct pointer and all that).

So patch#1 is that fpu_counter work.

The two other patches are basically the same as last time, except with
slightly updated commit logs and obviously updated for the fpu_counter
fixes (it is, for example, wrong to update the counter at FPU preload
time: if we do the lazy restore, there won't be any explicit preload, but
the fpu_counter should still update - so that affected the preloading
thing).

I've also done more testing of the series, although some of that shows
funny effects: it's actually faster on my machine (and in my specialized
tests) to switch between two processes where *one* uses the FPU with
these lazy restore patches than it is to switch between two non-FPU
users.

However, that seems to be due to some funny cache interaction: the
profiles clearly show that '__switch_to()' itself is much faster if it
doesn't have to work with any FPU state at all. But probably because of
some random cache replacement detail, I then get more switches with the
more expensive __switch_to when I only save the FP state (without ever
restoring it).

Regardless, I'm pretty happy with this series, and I think I'll commit
and push out at least the two first patches just because they fix real
(albeit not very important) bugs. I feel like I'm going to do the third
one too for 3.3, but I'm not sure yet. And comments welcome.

Linus


Linus Torvalds (3):
i387: fix up some fpu_counter confusion
i387: use 'restore_fpu_checking()' directly in task switching code
i387: support lazy restore of FPU state

arch/x86/include/asm/i387.h | 53 +++++++++++++++++++++++++++----------
arch/x86/include/asm/processor.h | 3 +-
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/process_32.c | 3 +-
arch/x86/kernel/process_64.c | 3 +-
arch/x86/kernel/traps.c | 40 +++++-----------------------
6 files changed, 54 insertions(+), 50 deletions(-)

--
1.7.9.188.g12766.dirty

2012-02-20 19:47:28

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH v2 1/3] i387: fix up some fpu_counter confusion


From: Linus Torvalds <[email protected]>
Date: Mon, 20 Feb 2012 10:24:09 -0800
Subject: [PATCH v2 1/3] i387: fix up some fpu_counter confusion

This makes sure we clear the FPU usage counter for newly created tasks,
just so that we start off in a known state (for example, don't try to
preload the FPU state on the first task switch etc).

It also fixes a thinko in when we increment the fpu_counter at task
switch time, introduced by commit 34ddc81a230b ("i387: re-introduce FPU
state preloading at context switch time"). We should increment the
*new* task fpu_counter, not the old task, and only if we decide to use
that state (whether lazily or preloaded).

Signed-off-by: Linus Torvalds <[email protected]>
---

That nonsensical "old->fpu_counter++" actually got the right result in
most common cases (because in the case of a preload and holding on to
the FPU for the whole timeslice - which is the normal case - it acted
the same way as if we had incremented the count at FPU load time), but
it really makes no sense.

And the "clear counters at fork time" don't really matter until you do
lazy restore, but it annoyed me that we started out a new process with
basically random fpu_counter values from the old one.

arch/x86/include/asm/i387.h | 3 ++-
arch/x86/kernel/process_32.c | 1 +
arch/x86/kernel/process_64.c | 1 +
3 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index a850b4d8d14d..8df95849721d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -348,10 +348,10 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
if (__save_init_fpu(old))
fpu_lazy_state_intact(old);
__thread_clear_has_fpu(old);
- old->fpu_counter++;

/* Don't change CR0.TS if we just switch! */
if (fpu.preload) {
+ new->fpu_counter++;
__thread_set_has_fpu(new);
prefetch(new->thread.fpu.state);
} else
@@ -359,6 +359,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
} else {
old->fpu_counter = 0;
if (fpu.preload) {
+ new->fpu_counter++;
if (fpu_lazy_restore(new))
fpu.preload = 0;
else
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 80bfe1ab0031..bc32761bc27a 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -214,6 +214,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,

task_user_gs(p) = get_user_gs(regs);

+ p->fpu_counter = 0;
p->thread.io_bitmap_ptr = NULL;
tsk = current;
err = -ENOMEM;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 1fd94bc4279d..8ad880b3bc1c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -286,6 +286,7 @@ int copy_thread(unsigned long clone_flags, unsigned long sp,

set_tsk_thread_flag(p, TIF_FORK);

+ p->fpu_counter = 0;
p->thread.io_bitmap_ptr = NULL;

savesegment(gs, p->thread.gsindex);
--
1.7.9.188.g12766.dirty

2012-02-20 19:48:09

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH v2 2/3] i387: use 'restore_fpu_checking()' directly in task switching code


From: Linus Torvalds <[email protected]>
Date: Sun, 19 Feb 2012 11:48:44 -0800
Subject: [PATCH v2 2/3] i387: use 'restore_fpu_checking()' directly in task switching code

This inlines what is usually just a couple of instructions, but more
importantly it also fixes the theoretical error case (can that FPU
restore really ever fail? Maybe we should remove the checking).

We can't start sending signals from within the scheduler, we're much too
deep in the kernel and are holding the runqueue lock etc. So don't
bother even trying.

Signed-off-by: Linus Torvalds <[email protected]>
---

I do think the "load bad FP state from user stack at a signal restore,
and then get an exception on task switch preload and try to send a
signal while holding the runqueue lock" is a real possibility on x86-32,
so this *may* actually be -stable material too. But somebody should
really double-check my thinking.

arch/x86/include/asm/i387.h | 17 ++++++++++++++---
arch/x86/kernel/traps.c | 40 ++++++++--------------------------------
2 files changed, 22 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 8df95849721d..74c607b37e87 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -29,7 +29,6 @@ extern unsigned int sig_xstate_size;
extern void fpu_init(void);
extern void mxcsr_feature_mask_init(void);
extern int init_fpu(struct task_struct *child);
-extern void __math_state_restore(struct task_struct *);
extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

@@ -269,6 +268,16 @@ static inline int fpu_restore_checking(struct fpu *fpu)

static inline int restore_fpu_checking(struct task_struct *tsk)
{
+ /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
+ is pending. Clear the x87 state here by setting it to fixed
+ values. "m" is a random variable that should be in L1 */
+ alternative_input(
+ ASM_NOP8 ASM_NOP2,
+ "emms\n\t" /* clear stack tags */
+ "fildl %P[addr]", /* set F?P to defined value */
+ X86_FEATURE_FXSAVE_LEAK,
+ [addr] "m" (tsk->thread.has_fpu));
+
return fpu_restore_checking(&tsk->thread.fpu);
}

@@ -378,8 +387,10 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
*/
static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
{
- if (fpu.preload)
- __math_state_restore(new);
+ if (fpu.preload) {
+ if (unlikely(restore_fpu_checking(new)))
+ __thread_fpu_end(new);
+ }
}

/*
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 77da5b475ad2..4bbe04d96744 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -571,37 +571,6 @@ asmlinkage void __attribute__((weak)) smp_threshold_interrupt(void)
}

/*
- * This gets called with the process already owning the
- * FPU state, and with CR0.TS cleared. It just needs to
- * restore the FPU register state.
- */
-void __math_state_restore(struct task_struct *tsk)
-{
- /* We need a safe address that is cheap to find and that is already
- in L1. We've just brought in "tsk->thread.has_fpu", so use that */
-#define safe_address (tsk->thread.has_fpu)
-
- /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
- is pending. Clear the x87 state here by setting it to fixed
- values. safe_address is a random variable that should be in L1 */
- alternative_input(
- ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
- "fildl %P[addr]", /* set F?P to defined value */
- X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (safe_address));
-
- /*
- * Paranoid restore. send a SIGSEGV if we fail to restore the state.
- */
- if (unlikely(restore_fpu_checking(tsk))) {
- __thread_fpu_end(tsk);
- force_sig(SIGSEGV, tsk);
- return;
- }
-}
-
-/*
* 'math_state_restore()' saves the current math information in the
* old math state array, and gets the new ones from the current task
*
@@ -631,7 +600,14 @@ void math_state_restore(void)
}

__thread_fpu_begin(tsk);
- __math_state_restore(tsk);
+ /*
+ * Paranoid restore. send a SIGSEGV if we fail to restore the state.
+ */
+ if (unlikely(restore_fpu_checking(tsk))) {
+ __thread_fpu_end(tsk);
+ force_sig(SIGSEGV, tsk);
+ return;
+ }

tsk->fpu_counter++;
}
--
1.7.9.188.g12766.dirty

2012-02-20 19:48:38

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH v2 3/3] i387: support lazy restore of FPU state


From: Linus Torvalds <[email protected]>
Date: Sun, 19 Feb 2012 13:27:00 -0800
Subject: [PATCH v2 3/3] i387: support lazy restore of FPU state

This makes us recognize when we try to restore FPU state that matches
what we already have in the FPU on this CPU, and avoids the restore
entirely if so.

To do this, we add two new data fields:

- a percpu 'fpu_owner_task' variable that gets written any time we
update the "has_fpu" field, and thus acts as a kind of back-pointer
to the task that owns the CPU. The exception is when we save the FPU
state as part of a context switch - if the save can keep the FPU
state around, we leave the 'fpu_owner_task' variable pointing at the
task whose FP state still remains on the CPU.

- a per-thread 'last_cpu' field, that indicates which CPU that thread
used its FPU on last. We update this on every context switch
(writing an invalid CPU number if the last context switch didn't
leave the FPU in a lazily usable state), so we know that *that*
thread has done nothing else with the FPU since.

These two fields together can be used when next switching back to the
task to see if the CPU still matches: if 'fpu_owner_task' matches the
task we are switching to, we know that no other task (or kernel FPU
usage) touched the FPU on this CPU in the meantime, and if the current
CPU number matches the 'last_cpu' field, we know that this thread did no
other FP work on any other CPU, so the FPU state on the CPU must match
what was saved on last context switch.

In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
CR0.TS bit.

Signed-off-by: Linus Torvalds <[email protected]>
---

No changes since last time, although the fpu_counter changes nearby
means that the patch itself isn't quite identical. And there's a bit
more commentary on why it's being so anal on task switch time in the
commit log.

Kernel profiles and just timing things both agree: this works. That
said, I don't know how much it matters in real life, but it seems to be
the RightThing(tm) to do, and it really falls out quite nicely without a
lot of complexity.

arch/x86/include/asm/i387.h | 35 +++++++++++++++++++++++------------
arch/x86/include/asm/processor.h | 3 ++-
arch/x86/kernel/cpu/common.c | 2 ++
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 74c607b37e87..247904945d3f 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -32,6 +32,8 @@ extern int init_fpu(struct task_struct *child);
extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

+DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
+
extern user_regset_active_fn fpregs_active, xfpregs_active;
extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
xstateregs_get;
@@ -276,7 +278,7 @@ static inline int restore_fpu_checking(struct task_struct *tsk)
"emms\n\t" /* clear stack tags */
"fildl %P[addr]", /* set F?P to defined value */
X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (tsk->thread.has_fpu));
+ [addr] "m" (tsk->thread.fpu.has_fpu));

return fpu_restore_checking(&tsk->thread.fpu);
}
@@ -288,19 +290,21 @@ static inline int restore_fpu_checking(struct task_struct *tsk)
*/
static inline int __thread_has_fpu(struct task_struct *tsk)
{
- return tsk->thread.has_fpu;
+ return tsk->thread.fpu.has_fpu;
}

/* Must be paired with an 'stts' after! */
static inline void __thread_clear_has_fpu(struct task_struct *tsk)
{
- tsk->thread.has_fpu = 0;
+ tsk->thread.fpu.has_fpu = 0;
+ percpu_write(fpu_owner_task, NULL);
}

/* Must be paired with a 'clts' before! */
static inline void __thread_set_has_fpu(struct task_struct *tsk)
{
- tsk->thread.has_fpu = 1;
+ tsk->thread.fpu.has_fpu = 1;
+ percpu_write(fpu_owner_task, tsk);
}

/*
@@ -345,18 +349,22 @@ typedef struct { int preload; } fpu_switch_t;
* We don't do that yet, so "fpu_lazy_restore()" always returns
* false, but some day..
*/
-#define fpu_lazy_restore(tsk) (0)
-#define fpu_lazy_state_intact(tsk) do { } while (0)
+static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
+{
+ return new == percpu_read_stable(fpu_owner_task) &&
+ cpu == new->thread.fpu.last_cpu;
+}

-static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new)
+static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
{
fpu_switch_t fpu;

fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
if (__thread_has_fpu(old)) {
- if (__save_init_fpu(old))
- fpu_lazy_state_intact(old);
- __thread_clear_has_fpu(old);
+ if (!__save_init_fpu(old))
+ cpu = ~0;
+ old->thread.fpu.last_cpu = cpu;
+ old->thread.fpu.has_fpu = 0; /* But leave fpu_owner_task! */

/* Don't change CR0.TS if we just switch! */
if (fpu.preload) {
@@ -367,9 +375,10 @@ static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct ta
stts();
} else {
old->fpu_counter = 0;
+ old->thread.fpu.last_cpu = ~0;
if (fpu.preload) {
new->fpu_counter++;
- if (fpu_lazy_restore(new))
+ if (fpu_lazy_restore(new, cpu))
fpu.preload = 0;
else
prefetch(new->thread.fpu.state);
@@ -463,8 +472,10 @@ static inline void kernel_fpu_begin(void)
__save_init_fpu(me);
__thread_clear_has_fpu(me);
/* We do 'stts()' in kernel_fpu_end() */
- } else
+ } else {
+ percpu_write(fpu_owner_task, NULL);
clts();
+ }
}

static inline void kernel_fpu_end(void)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index f7c89e231c6c..58545c97d071 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -374,6 +374,8 @@ union thread_xstate {
};

struct fpu {
+ unsigned int last_cpu;
+ unsigned int has_fpu;
union thread_xstate *state;
};

@@ -454,7 +456,6 @@ struct thread_struct {
unsigned long trap_no;
unsigned long error_code;
/* floating point and extended processor state */
- unsigned long has_fpu;
struct fpu fpu;
#ifdef CONFIG_X86_32
/* Virtual 86 mode info */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d43cad74f166..b667148dfad7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1044,6 +1044,8 @@ DEFINE_PER_CPU(char *, irq_stack_ptr) =

DEFINE_PER_CPU(unsigned int, irq_count) = -1;

+DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
+
/*
* Special IST stacks which the CPU switches to when it calls
* an IST-marked descriptor entry. Up to 7 stacks (hardware
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index bc32761bc27a..c08d1ff12b7c 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -304,7 +304,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

- fpu = switch_fpu_prepare(prev_p, next_p);
+ fpu = switch_fpu_prepare(prev_p, next_p, cpu);

/*
* Reload esp0.
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 8ad880b3bc1c..cfa5c90c01db 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -389,7 +389,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
unsigned fsindex, gsindex;
fpu_switch_t fpu;

- fpu = switch_fpu_prepare(prev_p, next_p);
+ fpu = switch_fpu_prepare(prev_p, next_p, cpu);

/*
* Reload esp0, LDT and the page table pointer:
--
1.7.9.188.g12766.dirty

2012-02-21 01:50:25

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Mon, Feb 20, 2012 at 2:48 PM, Linus Torvalds
<[email protected]> wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Sun, 19 Feb 2012 13:27:00 -0800
> Subject: [PATCH v2 3/3] i387: support lazy restore of FPU state
>
> This makes us recognize when we try to restore FPU state that matches
> what we already have in the FPU on this CPU, and avoids the restore
> entirely if so.
>
> To do this, we add two new data fields:
>
> ?- a percpu 'fpu_owner_task' variable that gets written any time we
> ? update the "has_fpu" field, and thus acts as a kind of back-pointer
> ? to the task that owns the CPU. ?The exception is when we save the FPU
> ? state as part of a context switch - if the save can keep the FPU
> ? state around, we leave the 'fpu_owner_task' variable pointing at the
> ? task whose FP state still remains on the CPU.
>
> ?- a per-thread 'last_cpu' field, that indicates which CPU that thread
> ? used its FPU on last. ?We update this on every context switch
> ? (writing an invalid CPU number if the last context switch didn't
> ? leave the FPU in a lazily usable state), so we know that *that*
> ? thread has done nothing else with the FPU since.
>
> These two fields together can be used when next switching back to the
> task to see if the CPU still matches: if 'fpu_owner_task' matches the
> task we are switching to, we know that no other task (or kernel FPU
> usage) touched the FPU on this CPU in the meantime, and if the current
> CPU number matches the 'last_cpu' field, we know that this thread did no
> other FP work on any other CPU, so the FPU state on the CPU must match
> what was saved on last context switch.
>
> In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
> CR0.TS bit.
>
> Signed-off-by: Linus Torvalds <[email protected]>

I haven't tried really figuring this out yet, but building the Fedora kernel
on x86_64 with your latest tree results in:

ERROR: "fpu_owner_task" [lib/raid6/raid6_pq.ko] undefined!
ERROR: "fpu_owner_task" [arch/x86/kvm/kvm.ko] undefined!
ERROR: "fpu_owner_task" [arch/x86/crypto/sha1-ssse3.ko] undefined!
ERROR: "fpu_owner_task" [arch/x86/crypto/serpent-sse2-x86_64.ko] undefined!
ERROR: "fpu_owner_task" [arch/x86/crypto/ghash-clmulni-intel.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
+ exit 1

Since this patch went in as 7e16838d94b566a1, I'm guessing it's at least
related.

I'm building again with more verbose output but I thought I'd send this out
quickly.

josh

2012-02-21 02:10:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Mon, Feb 20, 2012 at 5:50 PM, Josh Boyer <[email protected]> wrote:
>
> I haven't tried really figuring this out yet, but building the Fedora kernel
> on x86_64 with your latest tree results in:
>
> ERROR: "fpu_owner_task" [lib/raid6/raid6_pq.ko] undefined!

Ugh. My dislike of modules on my machines strikes again, and
apparently nobody else tested the patches I sent out.

The attached trivial patch fixes it, I bet.

Although I do wonder if we should just make kernel_fpu_begin() be a
real function instead of inlining it. I'm not sure it makes sense to
inline that thing, and it might be better to export that one instead.
Comments?

Linus


Attachments:
patch.diff (548.00 B)

2012-02-21 02:12:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On 02/20/2012 05:50 PM, Josh Boyer wrote:
>
> I haven't tried really figuring this out yet, but building the Fedora kernel
> on x86_64 with your latest tree results in:
>
> ERROR: "fpu_owner_task" [lib/raid6/raid6_pq.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/kvm/kvm.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/sha1-ssse3.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/serpent-sse2-x86_64.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/ghash-clmulni-intel.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> + exit 1
>
> Since this patch went in as 7e16838d94b566a1, I'm guessing it's at least
> related.
>
> I'm building again with more verbose output but I thought I'd send this out
> quickly.
>

fpu_owner_task needs an EXPORT_SYMBOL_GPL().

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-21 02:14:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On 02/20/2012 06:10 PM, Linus Torvalds wrote:
> On Mon, Feb 20, 2012 at 5:50 PM, Josh Boyer <[email protected]> wrote:
>>
>> I haven't tried really figuring this out yet, but building the Fedora kernel
>> on x86_64 with your latest tree results in:
>>
>> ERROR: "fpu_owner_task" [lib/raid6/raid6_pq.ko] undefined!
>
> Ugh. My dislike of modules on my machines strikes again, and
> apparently nobody else tested the patches I sent out.
>
> The attached trivial patch fixes it, I bet.
>
> Although I do wonder if we should just make kernel_fpu_begin() be a
> real function instead of inlining it. I'm not sure it makes sense to
> inline that thing, and it might be better to export that one instead.
> Comments?
>

I would agree with that.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-21 02:19:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Mon, Feb 20, 2012 at 6:10 PM, Linus Torvalds
<[email protected]> wrote:
>
> The attached trivial patch fixes it, I bet.

Actually, it doesn't fix it on x86-32, because we actually have an
#ifdef CONFIG_X86_64 around the "current_task" definition due to
pointless differences in how we do that on x86-64 and x86-32.

So much for the "common" part of "arch/x86/kernel/cpu/common.c"

> Although I do wonder if we should just make kernel_fpu_begin() be a
> real function instead of inlining it. I'm not sure it makes sense to
> inline that thing, and it might be better to export that one instead.

I do think that would be better in the long run, but for now here's an
updated "trivial" patch to fix it.

I want the fpu_owner_task to be declared next to the cache-hot
task-switching stuff, and since they are different on 32-bit and
64-bit (for no really good reason), that gets duplicated too. Sad.

Linus

Linus


Attachments:
patch.diff (897.00 B)

2012-02-21 02:32:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On 02/20/2012 06:18 PM, Linus Torvalds wrote:
>
>> Although I do wonder if we should just make kernel_fpu_begin() be a
>> real function instead of inlining it. I'm not sure it makes sense to
>> inline that thing, and it might be better to export that one instead.
>
> I do think that would be better in the long run, but for now here's an
> updated "trivial" patch to fix it.
>

There is actually another very good reason for out-of-lining
kernel_fpu_begin/_end: it will act as a compiler barrier i we ever do
compiler-generated SSE or AVX code, which we may very well want to do.

In fact, there may even be good reason for something like:

kernel_fpu_use(void (*func)(void *), void *arg);

... where the FPU-using stuff is actively embedded in the call.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-21 02:36:52

by Jongman Heo

[permalink] [raw]
Subject: Re: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state


> Sender : Josh Boyer<[email protected]>
> Date : 2012-02-21 10:50 (GMT+09:00)
> Title : Re: [PATCH v2 3/3] i387: support lazy restore of FPU state
>
> > On Mon, Feb 20, 2012 at 2:48 PM, Linus Torvalds
> > wrote:
> >
> > From: Linus Torvalds
> > Date: Sun, 19 Feb 2012 13:27:00 -0800
> > Subject: [PATCH v2 3/3] i387: support lazy restore of FPU state
> >
> > This makes us recognize when we try to restore FPU state that matches
> > what we already have in the FPU on this CPU, and avoids the restore
> > entirely if so.
> >
> > To do this, we add two new data fields:
> >
> > - a percpu 'fpu_owner_task' variable that gets written any time we
> > update the "has_fpu" field, and thus acts as a kind of back-pointer
> > to the task that owns the CPU. The exception is when we save the FPU
> > state as part of a context switch - if the save can keep the FPU
> > state around, we leave the 'fpu_owner_task' variable pointing at the
> > task whose FP state still remains on the CPU.
> >
> > - a per-thread 'last_cpu' field, that indicates which CPU that thread
> > used its FPU on last. We update this on every context switch
> > (writing an invalid CPU number if the last context switch didn't
> > leave the FPU in a lazily usable state), so we know that *that*
> > thread has done nothing else with the FPU since.
> >
> > These two fields together can be used when next switching back to the
> > task to see if the CPU still matches: if 'fpu_owner_task' matches the
> > task we are switching to, we know that no other task (or kernel FPU
> > usage) touched the FPU on this CPU in the meantime, and if the current
> > CPU number matches the 'last_cpu' field, we know that this thread did no
> > other FP work on any other CPU, so the FPU state on the CPU must match
> > what was saved on last context switch.
> >
> > In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
> > CR0.TS bit.
> >
> > Signed-off-by: Linus Torvalds
>
> I haven't tried really figuring this out yet, but building the Fedora kernel
> on x86_64 with your latest tree results in:
>
> ERROR: "fpu_owner_task" [lib/raid6/raid6_pq.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/kvm/kvm.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/sha1-ssse3.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/serpent-sse2-x86_64.ko] undefined!
> ERROR: "fpu_owner_task" [arch/x86/crypto/ghash-clmulni-intel.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> + exit 1
>
> Since this patch went in as 7e16838d94b566a1, I'm guessing it's at least
> related.
>
> I'm building again with more verbose output but I thought I'd send this out
> quickly.
>
> josh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Similar here, with 32bit x86 build.

[snip]
LD net/built-in.o
LD vmlinux.o
MODPOST vmlinux.o
GEN .version
CHK include/generated/compile.h
UPD include/generated/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
arch/x86/built-in.o: In function `__thread_clear_has_fpu':
/usr/src/linux/arch/x86/include/asm/i387.h:300: undefined reference to `fpu_owner_task'
arch/x86/built-in.o: In function `__thread_set_has_fpu':
/usr/src/linux/arch/x86/include/asm/i387.h:307: undefined reference to `fpu_owner_task'
arch/x86/built-in.o: In function `fpu_lazy_restore':
/usr/src/linux/arch/x86/include/asm/i387.h:354: undefined reference to `fpu_owner_task'
arch/x86/built-in.o: In function `__thread_set_has_fpu':
/usr/src/linux/arch/x86/include/asm/i387.h:307: undefined reference to `fpu_owner_task'
arch/x86/built-in.o: In function `__thread_clear_has_fpu':
/usr/src/linux/arch/x86/include/asm/i387.h:300: undefined reference to `fpu_owner_task'
arch/x86/built-in.o:/usr/src/linux/arch/x86/include/asm/i387.h:300: more undefined references to `fpu_owner_task' follow
make: *** [.tmp_vmlinux1] Error


In case you need my .config, please let me know~.

Jongman Heo.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-02-21 02:47:11

by Jongman Heo

[permalink] [raw]
Subject: Re: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state


> Sender : Linus Torvalds<[email protected]>
> Date : 2012-02-21 11:18 (GMT+09:00)
> Title : Re: [PATCH v2 3/3] i387: support lazy restore of FPU state
>
> On Mon, Feb 20, 2012 at 6:10 PM, Linus Torvalds
> wrote:
> >
> > The attached trivial patch fixes it, I bet.
>
> Actually, it doesn't fix it on x86-32, because we actually have an
> #ifdef CONFIG_X86_64 around the "current_task" definition due to
> pointless differences in how we do that on x86-64 and x86-32.
>
> So much for the "common" part of "arch/x86/kernel/cpu/common.c"
>
> > Although I do wonder if we should just make kernel_fpu_begin() be a
> > real function instead of inlining it. I'm not sure it makes sense to
> > inline that thing, and it might be better to export that one instead.
>
> I do think that would be better in the long run, but for now here's an
> updated "trivial" patch to fix it.
>
> I want the fpu_owner_task to be declared next to the cache-hot
> task-switching stuff, and since they are different on 32-bit and
> 64-bit (for no really good reason), that gets duplicated too. Sad.
>
> Linus
>
> Linus

Yeah, this patch fixes my x86-32 build.

Thanks,
Jongman Heo.????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-02-21 05:28:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Mon, Feb 20, 2012 at 6:14 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/20/2012 06:10 PM, Linus Torvalds wrote:
>>
>> Although I do wonder if we should just make kernel_fpu_begin() be a
>> real function instead of inlining it. I'm not sure it makes sense to
>> inline that thing, and it might be better to export that one instead.
>> Comments?
>
> I would agree with that.

So I have a patch that does that, but it's noticeably bigger.

It uninlines a fair amount of i387.h, and moves it into i387.c. I do
think it's probably the right thing to do, though.

I did a "make allmodconfig" with this on x86-64, but it's quite
possible that x86-32 does additional cases. Does this patch work for
people?

(This is *on*top*of* the quick "let's just get it to work" patch that
just exports the new percpu variable. I already committed that and
pushed it out, since I wanted a quick fix so that people wouldn't be
held up by this)

IOW, if you can try this on top of current -git, that would be lovely.

Linus


Attachments:
uninline-i387.diff (5.66 kB)

2012-02-21 05:35:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

Running build tests now; will check results in the morning.

Linus Torvalds <[email protected]> wrote:

>On Mon, Feb 20, 2012 at 6:14 PM, H. Peter Anvin <[email protected]> wrote:
>> On 02/20/2012 06:10 PM, Linus Torvalds wrote:
>>>
>>> Although I do wonder if we should just make kernel_fpu_begin() be a
>>> real function instead of inlining it. I'm not sure it makes sense to
>>> inline that thing, and it might be better to export that one
>instead.
>>> Comments?
>>
>> I would agree with that.
>
>So I have a patch that does that, but it's noticeably bigger.
>
>It uninlines a fair amount of i387.h, and moves it into i387.c. I do
>think it's probably the right thing to do, though.
>
>I did a "make allmodconfig" with this on x86-64, but it's quite
>possible that x86-32 does additional cases. Does this patch work for
>people?
>
>(This is *on*top*of* the quick "let's just get it to work" patch that
>just exports the new percpu variable. I already committed that and
>pushed it out, since I wanted a quick fix so that people wouldn't be
>held up by this)
>
>IOW, if you can try this on top of current -git, that would be lovely.
>
> Linus

--
Sent from my mobile phone. Please excuse my brevity and lack of formatting.

2012-02-21 14:19:26

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Tue, Feb 21, 2012 at 12:27 AM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Feb 20, 2012 at 6:14 PM, H. Peter Anvin <[email protected]> wrote:
>> On 02/20/2012 06:10 PM, Linus Torvalds wrote:
>>>
>>> Although I do wonder if we should just make kernel_fpu_begin() be a
>>> real function instead of inlining it. I'm not sure it makes sense to
>>> inline that thing, and it might be better to export that one instead.
>>> Comments?
>>
>> I would agree with that.
>
> So I have a patch that does that, but it's noticeably bigger.
>
> It uninlines a fair amount of i387.h, and moves it into i387.c. I do
> think it's probably the right thing to do, though.
>
> I did a "make allmodconfig" with this on x86-64, but it's quite
> possible that x86-32 does additional cases. Does this patch work for
> people?
>
> (This is *on*top*of* the quick "let's just get it to work" patch that
> just exports the new percpu variable. I already committed that and
> pushed it out, since I wanted a quick fix so that people wouldn't be
> held up by this)
>
> IOW, if you can try this on top of current -git, that would be lovely.

The quick patch cleared up the build issue I was seeing. Thanks for the quick
fix. I'll try building with this one a bit later today as well.

josh

2012-02-21 18:00:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On 02/20/2012 09:27 PM, Linus Torvalds wrote:
>
> IOW, if you can try this on top of current -git, that would be lovely.
>

This passed all my build tests overnight.

I don't know if unlazy_fpu() being out of line will cause a context
switch performance loss, though.

Otherwise:

Acked-by: H. Peter Anvin <[email protected]>

Let me know if you'd prefer me to pull this into the -tip tree.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-21 18:07:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state


* H. Peter Anvin <[email protected]> wrote:

> On 02/20/2012 09:27 PM, Linus Torvalds wrote:
> >
> > IOW, if you can try this on top of current -git, that would be lovely.
> >
>
> This passed all my build tests overnight.
>
> I don't know if unlazy_fpu() being out of line will cause a context
> switch performance loss, though.
>
> Otherwise:
>
> Acked-by: H. Peter Anvin <[email protected]>
>
> Let me know if you'd prefer me to pull this into the -tip tree.

I think we should put it into tip:x86/fpu or so, and run it
through all the regular testing channels and queue it up for
v3.4.

Thanks,

Ingo

2012-02-21 18:26:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Tue, Feb 21, 2012 at 9:59 AM, H. Peter Anvin <[email protected]> wrote:
>
> I don't know if unlazy_fpu() being out of line will cause a context
> switch performance loss, though.

unlazy_fpu() is no longer used for context switching - it's only used
for unlazying the FPU.

Part of the whole FP mess is that there were all these *insane*
dependencies where these things were used for different semantic
issues. That's largely fixed now, and things have a single semantic
use.

> Acked-by: H. Peter Anvin <[email protected]>
>
> Let me know if you'd prefer me to pull this into the -tip tree.

Yes, but I'll send you this with a proper commit log. I'll also send
you another patch that splits "<asm/i387.h>" into two.

Right now <asm/i387.h> has two different users: the first of which
"normal kernel use" kind of things (ie kernel_fpu_begin() and friends)
that really don't care about the internals very deeply. But the second
class of user is the actual i387 internal implementation thing that is
used by i387.c and xsave.c etc to actually implement the exposed
interfaces.

As a result, <asm/i387.h> is this mixture of exposed interfaces and
"deep internal knowledge". And as a result, that deep internal
knowledge kind of accidentally gets exposed to code that really
shouldn't be exposed to it.

I have a patch that fixes that. I'll send you a series of two patches
with sign-offs and commentary asap.

Linus

2012-02-21 21:15:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On 02/21/2012 10:26 AM, Linus Torvalds wrote:
>
> Yes, but I'll send you this with a proper commit log. I'll also send
> you another patch that splits "<asm/i387.h>" into two.
>
> Right now <asm/i387.h> has two different users: the first of which
> "normal kernel use" kind of things (ie kernel_fpu_begin() and friends)
> that really don't care about the internals very deeply. But the second
> class of user is the actual i387 internal implementation thing that is
> used by i387.c and xsave.c etc to actually implement the exposed
> interfaces.
>
> As a result, <asm/i387.h> is this mixture of exposed interfaces and
> "deep internal knowledge". And as a result, that deep internal
> knowledge kind of accidentally gets exposed to code that really
> shouldn't be exposed to it.
>
> I have a patch that fixes that. I'll send you a series of two patches
> with sign-offs and commentary asap.
>

Thanks!

-hpa

2012-02-21 21:39:52

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 0/2] i387: FP state interface cleanups


From: Linus Torvalds <[email protected]>
Date: Tue, 21 Feb 2012 13:32:49 -0800
Subject: [PATCH 0/2] i387: FP state interface cleanups

This cleans up some of our FP-state mess. The aim is to expose much
less of the internal interfaces and implementation to various users that
really don't care.

We shouldn't export our internal 'fpu_owner_task' tracking to modules,
and we shouldn't expose all the low-level FP state save/restore code to
code that simply must never use it anyway.

This passed allmodconfig on x86-32 and -64.

Linus

Linus Torvalds (2):
i387: uninline the generic FP helpers that we expose to kernel
modules
i387: split up <asm/i387.h> into exported and internal interfaces

arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/fpu-internal.h | 520 ++++++++++++++++++++++++++++++
arch/x86/include/asm/i387.h | 590 +----------------------------------
arch/x86/kernel/cpu/common.c | 3 +-
arch/x86/kernel/i387.c | 83 +++++-
arch/x86/kernel/process.c | 1 +
arch/x86/kernel/process_32.c | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/ptrace.c | 1 +
arch/x86/kernel/signal.c | 1 +
arch/x86/kernel/traps.c | 1 +
arch/x86/kernel/xsave.c | 1 +
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 1 +
arch/x86/power/cpu.c | 1 +
15 files changed, 624 insertions(+), 584 deletions(-)
create mode 100644 arch/x86/include/asm/fpu-internal.h

--
1.7.9.188.g12766.dirty

2012-02-21 21:40:31

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 1/2] i387: uninline the generic FP helpers that we expose to kernel modules


From: Linus Torvalds <[email protected]>
Date: Tue, 21 Feb 2012 10:25:45 -0800
Subject: [PATCH 1/2] i387: uninline the generic FP helpers that we expose to kernel modules

Instead of exporting the very low-level internals of the FPU state
save/restore code (ie things like 'fpu_owner_task'), we should export
the higher-level interfaces.

Inlining these things is pointless anyway: sure, sometimes the end
result is small, but while 'stts()' can result in just three x86
instructions, those are not cheap instructions (writing %cr0 is a
serializing instruction and a very slow one at that).

So the overhead of a function call is not noticeable, and we really
don't want random modules mucking about with our internal state save
logic anyway.

So this unexports 'fpu_owner_task', and instead uninlines and exports
the actual functions that modules can use: fpu_kernel_begin/end() and
unlazy_fpu().

Acked-by: H. Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---

This is the same patch I already sent out earlier, now with a commit
message and Peter's ack added.

arch/x86/include/asm/i387.h | 78 ++--------------------------------------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/i387.c | 80 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 84 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 247904945d3f..0c1031d354f2 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -419,70 +419,9 @@ static inline void __clear_fpu(struct task_struct *tsk)
}
}

-/*
- * Were we in an interrupt that interrupted kernel mode?
- *
- * We can do a kernel_fpu_begin/end() pair *ONLY* if that
- * pair does nothing at all: the thread must not have fpu (so
- * that we don't try to save the FPU state), and TS must
- * be set (so that the clts/stts pair does nothing that is
- * visible in the interrupted kernel thread).
- */
-static inline bool interrupted_kernel_fpu_idle(void)
-{
- return !__thread_has_fpu(current) &&
- (read_cr0() & X86_CR0_TS);
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static inline bool interrupted_user_mode(void)
-{
- struct pt_regs *regs = get_irq_regs();
- return regs && user_mode_vm(regs);
-}
-
-/*
- * Can we use the FPU in kernel mode with the
- * whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
- */
-static inline bool irq_fpu_usable(void)
-{
- return !in_interrupt() ||
- interrupted_user_mode() ||
- interrupted_kernel_fpu_idle();
-}
-
-static inline void kernel_fpu_begin(void)
-{
- struct task_struct *me = current;
-
- WARN_ON_ONCE(!irq_fpu_usable());
- preempt_disable();
- if (__thread_has_fpu(me)) {
- __save_init_fpu(me);
- __thread_clear_has_fpu(me);
- /* We do 'stts()' in kernel_fpu_end() */
- } else {
- percpu_write(fpu_owner_task, NULL);
- clts();
- }
-}
-
-static inline void kernel_fpu_end(void)
-{
- stts();
- preempt_enable();
-}
+extern bool irq_fpu_usable(void);
+extern void kernel_fpu_begin(void);
+extern void kernel_fpu_end(void);

/*
* Some instructions like VIA's padlock instructions generate a spurious
@@ -566,16 +505,7 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
}

-static inline void unlazy_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- if (__thread_has_fpu(tsk)) {
- __save_init_fpu(tsk);
- __thread_fpu_end(tsk);
- } else
- tsk->fpu_counter = 0;
- preempt_enable();
-}
+extern void unlazy_fpu(struct task_struct *tsk);

static inline void clear_fpu(struct task_struct *tsk)
{
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0f7d68d318f..cb71b01ab66e 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1045,7 +1045,6 @@ DEFINE_PER_CPU(char *, irq_stack_ptr) =
DEFINE_PER_CPU(unsigned int, irq_count) = -1;

DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
-EXPORT_PER_CPU_SYMBOL(fpu_owner_task);

/*
* Special IST stacks which the CPU switches to when it calls
@@ -1115,7 +1114,6 @@ void debug_stack_reset(void)
DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
-EXPORT_PER_CPU_SYMBOL(fpu_owner_task);

#ifdef CONFIG_CC_STACKPROTECTOR
DEFINE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 739d8598f789..17b7549c4134 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -32,6 +32,86 @@
# define user32_fxsr_struct user_fxsr_struct
#endif

+/*
+ * Were we in an interrupt that interrupted kernel mode?
+ *
+ * We can do a kernel_fpu_begin/end() pair *ONLY* if that
+ * pair does nothing at all: the thread must not have fpu (so
+ * that we don't try to save the FPU state), and TS must
+ * be set (so that the clts/stts pair does nothing that is
+ * visible in the interrupted kernel thread).
+ */
+static inline bool interrupted_kernel_fpu_idle(void)
+{
+ return !__thread_has_fpu(current) &&
+ (read_cr0() & X86_CR0_TS);
+}
+
+/*
+ * Were we in user mode (or vm86 mode) when we were
+ * interrupted?
+ *
+ * Doing kernel_fpu_begin/end() is ok if we are running
+ * in an interrupt context from user mode - we'll just
+ * save the FPU state as required.
+ */
+static inline bool interrupted_user_mode(void)
+{
+ struct pt_regs *regs = get_irq_regs();
+ return regs && user_mode_vm(regs);
+}
+
+/*
+ * Can we use the FPU in kernel mode with the
+ * whole "kernel_fpu_begin/end()" sequence?
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+bool irq_fpu_usable(void)
+{
+ return !in_interrupt() ||
+ interrupted_user_mode() ||
+ interrupted_kernel_fpu_idle();
+}
+EXPORT_SYMBOL(irq_fpu_usable);
+
+void kernel_fpu_begin(void)
+{
+ struct task_struct *me = current;
+
+ WARN_ON_ONCE(!irq_fpu_usable());
+ preempt_disable();
+ if (__thread_has_fpu(me)) {
+ __save_init_fpu(me);
+ __thread_clear_has_fpu(me);
+ /* We do 'stts()' in kernel_fpu_end() */
+ } else {
+ percpu_write(fpu_owner_task, NULL);
+ clts();
+ }
+}
+EXPORT_SYMBOL(kernel_fpu_begin);
+
+void kernel_fpu_end(void)
+{
+ stts();
+ preempt_enable();
+}
+EXPORT_SYMBOL(kernel_fpu_end);
+
+void unlazy_fpu(struct task_struct *tsk)
+{
+ preempt_disable();
+ if (__thread_has_fpu(tsk)) {
+ __save_init_fpu(tsk);
+ __thread_fpu_end(tsk);
+ } else
+ tsk->fpu_counter = 0;
+ preempt_enable();
+}
+EXPORT_SYMBOL(unlazy_fpu);
+
#ifdef CONFIG_MATH_EMULATION
# define HAVE_HWFP (boot_cpu_data.hard_math)
#else
--
1.7.9.188.g12766.dirty

2012-02-21 21:41:09

by Linus Torvalds

[permalink] [raw]
Subject: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces


From: Linus Torvalds <[email protected]>
Date: Tue, 21 Feb 2012 13:19:22 -0800
Subject: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module. But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained. Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <[email protected]>
---

This took longer to generate than I expected, because I actually did
allmodconfig builds on both x86-32 and -64.

Btw, I really don't like what arch/x86/kvm/ does with CR0.TS and the FP
state. I'm not at all sure that's all kosher. But I don't know the
code, so I just made sure that at no point did any of the semantics
change.

Linus

arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/fpu-internal.h | 520 +++++++++++++++++++++++++++++++++++
arch/x86/include/asm/i387.h | 512 +----------------------------------
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/i387.c | 3 +-
arch/x86/kernel/process.c | 1 +
arch/x86/kernel/process_32.c | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/ptrace.c | 1 +
arch/x86/kernel/signal.c | 1 +
arch/x86/kernel/traps.c | 1 +
arch/x86/kernel/xsave.c | 1 +
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 1 +
arch/x86/power/cpu.c | 1 +
15 files changed, 540 insertions(+), 508 deletions(-)
create mode 100644 arch/x86/include/asm/fpu-internal.h

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 65577698cab2..5563ba1cf513 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -24,6 +24,7 @@
#include <asm/ucontext.h>
#include <asm/uaccess.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/ptrace.h>
#include <asm/ia32_unistd.h>
#include <asm/user32.h>
diff --git a/arch/x86/include/asm/fpu-internal.h b/arch/x86/include/asm/fpu-internal.h
new file mode 100644
index 000000000000..1709fc78884b
--- /dev/null
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -0,0 +1,520 @@
+/*
+ * Copyright (C) 1994 Linus Torvalds
+ *
+ * Pentium III FXSR, SSE support
+ * General FPU state handling cleanups
+ * Gareth Hughes <[email protected]>, May 2000
+ * x86-64 work by Andi Kleen 2002
+ */
+
+#ifndef _FPU_INTERNAL_H
+#define _FPU_INTERNAL_H
+
+#include <linux/kernel_stat.h>
+#include <linux/regset.h>
+#include <linux/slab.h>
+#include <asm/asm.h>
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+#include <asm/sigcontext.h>
+#include <asm/user.h>
+#include <asm/uaccess.h>
+#include <asm/xsave.h>
+
+extern unsigned int sig_xstate_size;
+extern void fpu_init(void);
+
+DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
+
+extern user_regset_active_fn fpregs_active, xfpregs_active;
+extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
+ xstateregs_get;
+extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
+ xstateregs_set;
+
+
+/*
+ * xstateregs_active == fpregs_active. Please refer to the comment
+ * at the definition of fpregs_active.
+ */
+#define xstateregs_active fpregs_active
+
+extern struct _fpx_sw_bytes fx_sw_reserved;
+#ifdef CONFIG_IA32_EMULATION
+extern unsigned int sig_xstate_ia32_size;
+extern struct _fpx_sw_bytes fx_sw_reserved_ia32;
+struct _fpstate_ia32;
+struct _xstate_ia32;
+extern int save_i387_xstate_ia32(void __user *buf);
+extern int restore_i387_xstate_ia32(void __user *buf);
+#endif
+
+#ifdef CONFIG_MATH_EMULATION
+extern void finit_soft_fpu(struct i387_soft_struct *soft);
+#else
+static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
+#endif
+
+#define X87_FSW_ES (1 << 7) /* Exception Summary */
+
+static __always_inline __pure bool use_xsaveopt(void)
+{
+ return static_cpu_has(X86_FEATURE_XSAVEOPT);
+}
+
+static __always_inline __pure bool use_xsave(void)
+{
+ return static_cpu_has(X86_FEATURE_XSAVE);
+}
+
+static __always_inline __pure bool use_fxsr(void)
+{
+ return static_cpu_has(X86_FEATURE_FXSR);
+}
+
+extern void __sanitize_i387_state(struct task_struct *);
+
+static inline void sanitize_i387_state(struct task_struct *tsk)
+{
+ if (!use_xsaveopt())
+ return;
+ __sanitize_i387_state(tsk);
+}
+
+#ifdef CONFIG_X86_64
+static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
+{
+ int err;
+
+ /* See comment in fxsave() below. */
+#ifdef CONFIG_AS_FXSAVEQ
+ asm volatile("1: fxrstorq %[fx]\n\t"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=r" (err)
+ : [fx] "m" (*fx), "0" (0));
+#else
+ asm volatile("1: rex64/fxrstor (%[fx])\n\t"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=r" (err)
+ : [fx] "R" (fx), "m" (*fx), "0" (0));
+#endif
+ return err;
+}
+
+static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
+{
+ int err;
+
+ /*
+ * Clear the bytes not touched by the fxsave and reserved
+ * for the SW usage.
+ */
+ err = __clear_user(&fx->sw_reserved,
+ sizeof(struct _fpx_sw_bytes));
+ if (unlikely(err))
+ return -EFAULT;
+
+ /* See comment in fxsave() below. */
+#ifdef CONFIG_AS_FXSAVEQ
+ asm volatile("1: fxsaveq %[fx]\n\t"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=r" (err), [fx] "=m" (*fx)
+ : "0" (0));
+#else
+ asm volatile("1: rex64/fxsave (%[fx])\n\t"
+ "2:\n"
+ ".section .fixup,\"ax\"\n"
+ "3: movl $-1,%[err]\n"
+ " jmp 2b\n"
+ ".previous\n"
+ _ASM_EXTABLE(1b, 3b)
+ : [err] "=r" (err), "=m" (*fx)
+ : [fx] "R" (fx), "0" (0));
+#endif
+ if (unlikely(err) &&
+ __clear_user(fx, sizeof(struct i387_fxsave_struct)))
+ err = -EFAULT;
+ /* No need to clear here because the caller clears USED_MATH */
+ return err;
+}
+
+static inline void fpu_fxsave(struct fpu *fpu)
+{
+ /* Using "rex64; fxsave %0" is broken because, if the memory operand
+ uses any extended registers for addressing, a second REX prefix
+ will be generated (to the assembler, rex64 followed by semicolon
+ is a separate instruction), and hence the 64-bitness is lost. */
+
+#ifdef CONFIG_AS_FXSAVEQ
+ /* Using "fxsaveq %0" would be the ideal choice, but is only supported
+ starting with gas 2.16. */
+ __asm__ __volatile__("fxsaveq %0"
+ : "=m" (fpu->state->fxsave));
+#else
+ /* Using, as a workaround, the properly prefixed form below isn't
+ accepted by any binutils version so far released, complaining that
+ the same type of prefix is used twice if an extended register is
+ needed for addressing (fix submitted to mainline 2005-11-21).
+ asm volatile("rex64/fxsave %0"
+ : "=m" (fpu->state->fxsave));
+ This, however, we can work around by forcing the compiler to select
+ an addressing mode that doesn't require extended registers. */
+ asm volatile("rex64/fxsave (%[fx])"
+ : "=m" (fpu->state->fxsave)
+ : [fx] "R" (&fpu->state->fxsave));
+#endif
+}
+
+#else /* CONFIG_X86_32 */
+
+/* perform fxrstor iff the processor has extended states, otherwise frstor */
+static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
+{
+ /*
+ * The "nop" is needed to make the instructions the same
+ * length.
+ */
+ alternative_input(
+ "nop ; frstor %1",
+ "fxrstor %1",
+ X86_FEATURE_FXSR,
+ "m" (*fx));
+
+ return 0;
+}
+
+static inline void fpu_fxsave(struct fpu *fpu)
+{
+ asm volatile("fxsave %[fx]"
+ : [fx] "=m" (fpu->state->fxsave));
+}
+
+#endif /* CONFIG_X86_64 */
+
+/*
+ * These must be called with preempt disabled. Returns
+ * 'true' if the FPU state is still intact.
+ */
+static inline int fpu_save_init(struct fpu *fpu)
+{
+ if (use_xsave()) {
+ fpu_xsave(fpu);
+
+ /*
+ * xsave header may indicate the init state of the FP.
+ */
+ if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
+ return 1;
+ } else if (use_fxsr()) {
+ fpu_fxsave(fpu);
+ } else {
+ asm volatile("fnsave %[fx]; fwait"
+ : [fx] "=m" (fpu->state->fsave));
+ return 0;
+ }
+
+ /*
+ * If exceptions are pending, we need to clear them so
+ * that we don't randomly get exceptions later.
+ *
+ * FIXME! Is this perhaps only true for the old-style
+ * irq13 case? Maybe we could leave the x87 state
+ * intact otherwise?
+ */
+ if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) {
+ asm volatile("fnclex");
+ return 0;
+ }
+ return 1;
+}
+
+static inline int __save_init_fpu(struct task_struct *tsk)
+{
+ return fpu_save_init(&tsk->thread.fpu);
+}
+
+static inline int fpu_fxrstor_checking(struct fpu *fpu)
+{
+ return fxrstor_checking(&fpu->state->fxsave);
+}
+
+static inline int fpu_restore_checking(struct fpu *fpu)
+{
+ if (use_xsave())
+ return fpu_xrstor_checking(fpu);
+ else
+ return fpu_fxrstor_checking(fpu);
+}
+
+static inline int restore_fpu_checking(struct task_struct *tsk)
+{
+ /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
+ is pending. Clear the x87 state here by setting it to fixed
+ values. "m" is a random variable that should be in L1 */
+ alternative_input(
+ ASM_NOP8 ASM_NOP2,
+ "emms\n\t" /* clear stack tags */
+ "fildl %P[addr]", /* set F?P to defined value */
+ X86_FEATURE_FXSAVE_LEAK,
+ [addr] "m" (tsk->thread.fpu.has_fpu));
+
+ return fpu_restore_checking(&tsk->thread.fpu);
+}
+
+/*
+ * Software FPU state helpers. Careful: these need to
+ * be preemption protection *and* they need to be
+ * properly paired with the CR0.TS changes!
+ */
+static inline int __thread_has_fpu(struct task_struct *tsk)
+{
+ return tsk->thread.fpu.has_fpu;
+}
+
+/* Must be paired with an 'stts' after! */
+static inline void __thread_clear_has_fpu(struct task_struct *tsk)
+{
+ tsk->thread.fpu.has_fpu = 0;
+ percpu_write(fpu_owner_task, NULL);
+}
+
+/* Must be paired with a 'clts' before! */
+static inline void __thread_set_has_fpu(struct task_struct *tsk)
+{
+ tsk->thread.fpu.has_fpu = 1;
+ percpu_write(fpu_owner_task, tsk);
+}
+
+/*
+ * Encapsulate the CR0.TS handling together with the
+ * software flag.
+ *
+ * These generally need preemption protection to work,
+ * do try to avoid using these on their own.
+ */
+static inline void __thread_fpu_end(struct task_struct *tsk)
+{
+ __thread_clear_has_fpu(tsk);
+ stts();
+}
+
+static inline void __thread_fpu_begin(struct task_struct *tsk)
+{
+ clts();
+ __thread_set_has_fpu(tsk);
+}
+
+/*
+ * FPU state switching for scheduling.
+ *
+ * This is a two-stage process:
+ *
+ * - switch_fpu_prepare() saves the old state and
+ * sets the new state of the CR0.TS bit. This is
+ * done within the context of the old process.
+ *
+ * - switch_fpu_finish() restores the new state as
+ * necessary.
+ */
+typedef struct { int preload; } fpu_switch_t;
+
+/*
+ * FIXME! We could do a totally lazy restore, but we need to
+ * add a per-cpu "this was the task that last touched the FPU
+ * on this CPU" variable, and the task needs to have a "I last
+ * touched the FPU on this CPU" and check them.
+ *
+ * We don't do that yet, so "fpu_lazy_restore()" always returns
+ * false, but some day..
+ */
+static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
+{
+ return new == percpu_read_stable(fpu_owner_task) &&
+ cpu == new->thread.fpu.last_cpu;
+}
+
+static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
+{
+ fpu_switch_t fpu;
+
+ fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
+ if (__thread_has_fpu(old)) {
+ if (!__save_init_fpu(old))
+ cpu = ~0;
+ old->thread.fpu.last_cpu = cpu;
+ old->thread.fpu.has_fpu = 0; /* But leave fpu_owner_task! */
+
+ /* Don't change CR0.TS if we just switch! */
+ if (fpu.preload) {
+ new->fpu_counter++;
+ __thread_set_has_fpu(new);
+ prefetch(new->thread.fpu.state);
+ } else
+ stts();
+ } else {
+ old->fpu_counter = 0;
+ old->thread.fpu.last_cpu = ~0;
+ if (fpu.preload) {
+ new->fpu_counter++;
+ if (fpu_lazy_restore(new, cpu))
+ fpu.preload = 0;
+ else
+ prefetch(new->thread.fpu.state);
+ __thread_fpu_begin(new);
+ }
+ }
+ return fpu;
+}
+
+/*
+ * By the time this gets called, we've already cleared CR0.TS and
+ * given the process the FPU if we are going to preload the FPU
+ * state - all we need to do is to conditionally restore the register
+ * state itself.
+ */
+static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
+{
+ if (fpu.preload) {
+ if (unlikely(restore_fpu_checking(new)))
+ __thread_fpu_end(new);
+ }
+}
+
+/*
+ * Signal frame handlers...
+ */
+extern int save_i387_xstate(void __user *buf);
+extern int restore_i387_xstate(void __user *buf);
+
+static inline void __clear_fpu(struct task_struct *tsk)
+{
+ if (__thread_has_fpu(tsk)) {
+ /* Ignore delayed exceptions from user space */
+ asm volatile("1: fwait\n"
+ "2:\n"
+ _ASM_EXTABLE(1b, 2b));
+ __thread_fpu_end(tsk);
+ }
+}
+
+/*
+ * The actual user_fpu_begin/end() functions
+ * need to be preemption-safe.
+ *
+ * NOTE! user_fpu_end() must be used only after you
+ * have saved the FP state, and user_fpu_begin() must
+ * be used only immediately before restoring it.
+ * These functions do not do any save/restore on
+ * their own.
+ */
+static inline void user_fpu_end(void)
+{
+ preempt_disable();
+ __thread_fpu_end(current);
+ preempt_enable();
+}
+
+static inline void user_fpu_begin(void)
+{
+ preempt_disable();
+ if (!user_has_fpu())
+ __thread_fpu_begin(current);
+ preempt_enable();
+}
+
+/*
+ * These disable preemption on their own and are safe
+ */
+static inline void save_init_fpu(struct task_struct *tsk)
+{
+ WARN_ON_ONCE(!__thread_has_fpu(tsk));
+ preempt_disable();
+ __save_init_fpu(tsk);
+ __thread_fpu_end(tsk);
+ preempt_enable();
+}
+
+static inline void clear_fpu(struct task_struct *tsk)
+{
+ preempt_disable();
+ __clear_fpu(tsk);
+ preempt_enable();
+}
+
+/*
+ * i387 state interaction
+ */
+static inline unsigned short get_fpu_cwd(struct task_struct *tsk)
+{
+ if (cpu_has_fxsr) {
+ return tsk->thread.fpu.state->fxsave.cwd;
+ } else {
+ return (unsigned short)tsk->thread.fpu.state->fsave.cwd;
+ }
+}
+
+static inline unsigned short get_fpu_swd(struct task_struct *tsk)
+{
+ if (cpu_has_fxsr) {
+ return tsk->thread.fpu.state->fxsave.swd;
+ } else {
+ return (unsigned short)tsk->thread.fpu.state->fsave.swd;
+ }
+}
+
+static inline unsigned short get_fpu_mxcsr(struct task_struct *tsk)
+{
+ if (cpu_has_xmm) {
+ return tsk->thread.fpu.state->fxsave.mxcsr;
+ } else {
+ return MXCSR_DEFAULT;
+ }
+}
+
+static bool fpu_allocated(struct fpu *fpu)
+{
+ return fpu->state != NULL;
+}
+
+static inline int fpu_alloc(struct fpu *fpu)
+{
+ if (fpu_allocated(fpu))
+ return 0;
+ fpu->state = kmem_cache_alloc(task_xstate_cachep, GFP_KERNEL);
+ if (!fpu->state)
+ return -ENOMEM;
+ WARN_ON((unsigned long)fpu->state & 15);
+ return 0;
+}
+
+static inline void fpu_free(struct fpu *fpu)
+{
+ if (fpu->state) {
+ kmem_cache_free(task_xstate_cachep, fpu->state);
+ fpu->state = NULL;
+ }
+}
+
+static inline void fpu_copy(struct fpu *dst, struct fpu *src)
+{
+ memcpy(dst->state, src->state, xstate_size);
+}
+
+extern void fpu_finit(struct fpu *fpu);
+
+#endif
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 0c1031d354f2..7ce0798b1b26 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -13,411 +13,15 @@
#ifndef __ASSEMBLY__

#include <linux/sched.h>
-#include <linux/kernel_stat.h>
-#include <linux/regset.h>
#include <linux/hardirq.h>
-#include <linux/slab.h>
-#include <asm/asm.h>
-#include <asm/cpufeature.h>
-#include <asm/processor.h>
-#include <asm/sigcontext.h>
-#include <asm/user.h>
-#include <asm/uaccess.h>
-#include <asm/xsave.h>
+#include <asm/system.h>
+
+struct pt_regs;
+struct user_i387_struct;

-extern unsigned int sig_xstate_size;
-extern void fpu_init(void);
-extern void mxcsr_feature_mask_init(void);
extern int init_fpu(struct task_struct *child);
-extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);
-
-DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
-
-extern user_regset_active_fn fpregs_active, xfpregs_active;
-extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
- xstateregs_get;
-extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
- xstateregs_set;
-
-/*
- * xstateregs_active == fpregs_active. Please refer to the comment
- * at the definition of fpregs_active.
- */
-#define xstateregs_active fpregs_active
-
-extern struct _fpx_sw_bytes fx_sw_reserved;
-#ifdef CONFIG_IA32_EMULATION
-extern unsigned int sig_xstate_ia32_size;
-extern struct _fpx_sw_bytes fx_sw_reserved_ia32;
-struct _fpstate_ia32;
-struct _xstate_ia32;
-extern int save_i387_xstate_ia32(void __user *buf);
-extern int restore_i387_xstate_ia32(void __user *buf);
-#endif
-
-#ifdef CONFIG_MATH_EMULATION
-extern void finit_soft_fpu(struct i387_soft_struct *soft);
-#else
-static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
-#endif
-
-#define X87_FSW_ES (1 << 7) /* Exception Summary */
-
-static __always_inline __pure bool use_xsaveopt(void)
-{
- return static_cpu_has(X86_FEATURE_XSAVEOPT);
-}
-
-static __always_inline __pure bool use_xsave(void)
-{
- return static_cpu_has(X86_FEATURE_XSAVE);
-}
-
-static __always_inline __pure bool use_fxsr(void)
-{
- return static_cpu_has(X86_FEATURE_FXSR);
-}
-
-extern void __sanitize_i387_state(struct task_struct *);
-
-static inline void sanitize_i387_state(struct task_struct *tsk)
-{
- if (!use_xsaveopt())
- return;
- __sanitize_i387_state(tsk);
-}
-
-#ifdef CONFIG_X86_64
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- int err;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxrstorq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "m" (*fx), "0" (0));
-#else
- asm volatile("1: rex64/fxrstor (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "R" (fx), "m" (*fx), "0" (0));
-#endif
- return err;
-}
-
-static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
-{
- int err;
-
- /*
- * Clear the bytes not touched by the fxsave and reserved
- * for the SW usage.
- */
- err = __clear_user(&fx->sw_reserved,
- sizeof(struct _fpx_sw_bytes));
- if (unlikely(err))
- return -EFAULT;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxsaveq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), [fx] "=m" (*fx)
- : "0" (0));
-#else
- asm volatile("1: rex64/fxsave (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), "=m" (*fx)
- : [fx] "R" (fx), "0" (0));
-#endif
- if (unlikely(err) &&
- __clear_user(fx, sizeof(struct i387_fxsave_struct)))
- err = -EFAULT;
- /* No need to clear here because the caller clears USED_MATH */
- return err;
-}
-
-static inline void fpu_fxsave(struct fpu *fpu)
-{
- /* Using "rex64; fxsave %0" is broken because, if the memory operand
- uses any extended registers for addressing, a second REX prefix
- will be generated (to the assembler, rex64 followed by semicolon
- is a separate instruction), and hence the 64-bitness is lost. */
-
-#ifdef CONFIG_AS_FXSAVEQ
- /* Using "fxsaveq %0" would be the ideal choice, but is only supported
- starting with gas 2.16. */
- __asm__ __volatile__("fxsaveq %0"
- : "=m" (fpu->state->fxsave));
-#else
- /* Using, as a workaround, the properly prefixed form below isn't
- accepted by any binutils version so far released, complaining that
- the same type of prefix is used twice if an extended register is
- needed for addressing (fix submitted to mainline 2005-11-21).
- asm volatile("rex64/fxsave %0"
- : "=m" (fpu->state->fxsave));
- This, however, we can work around by forcing the compiler to select
- an addressing mode that doesn't require extended registers. */
- asm volatile("rex64/fxsave (%[fx])"
- : "=m" (fpu->state->fxsave)
- : [fx] "R" (&fpu->state->fxsave));
-#endif
-}
-
-#else /* CONFIG_X86_32 */
-
-/* perform fxrstor iff the processor has extended states, otherwise frstor */
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- /*
- * The "nop" is needed to make the instructions the same
- * length.
- */
- alternative_input(
- "nop ; frstor %1",
- "fxrstor %1",
- X86_FEATURE_FXSR,
- "m" (*fx));
-
- return 0;
-}
-
-static inline void fpu_fxsave(struct fpu *fpu)
-{
- asm volatile("fxsave %[fx]"
- : [fx] "=m" (fpu->state->fxsave));
-}
-
-#endif /* CONFIG_X86_64 */
-
-/*
- * These must be called with preempt disabled. Returns
- * 'true' if the FPU state is still intact.
- */
-static inline int fpu_save_init(struct fpu *fpu)
-{
- if (use_xsave()) {
- fpu_xsave(fpu);
-
- /*
- * xsave header may indicate the init state of the FP.
- */
- if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
- return 1;
- } else if (use_fxsr()) {
- fpu_fxsave(fpu);
- } else {
- asm volatile("fnsave %[fx]; fwait"
- : [fx] "=m" (fpu->state->fsave));
- return 0;
- }
-
- /*
- * If exceptions are pending, we need to clear them so
- * that we don't randomly get exceptions later.
- *
- * FIXME! Is this perhaps only true for the old-style
- * irq13 case? Maybe we could leave the x87 state
- * intact otherwise?
- */
- if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) {
- asm volatile("fnclex");
- return 0;
- }
- return 1;
-}
-
-static inline int __save_init_fpu(struct task_struct *tsk)
-{
- return fpu_save_init(&tsk->thread.fpu);
-}
-
-static inline int fpu_fxrstor_checking(struct fpu *fpu)
-{
- return fxrstor_checking(&fpu->state->fxsave);
-}
-
-static inline int fpu_restore_checking(struct fpu *fpu)
-{
- if (use_xsave())
- return fpu_xrstor_checking(fpu);
- else
- return fpu_fxrstor_checking(fpu);
-}
-
-static inline int restore_fpu_checking(struct task_struct *tsk)
-{
- /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
- is pending. Clear the x87 state here by setting it to fixed
- values. "m" is a random variable that should be in L1 */
- alternative_input(
- ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
- "fildl %P[addr]", /* set F?P to defined value */
- X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (tsk->thread.fpu.has_fpu));
-
- return fpu_restore_checking(&tsk->thread.fpu);
-}
-
-/*
- * Software FPU state helpers. Careful: these need to
- * be preemption protection *and* they need to be
- * properly paired with the CR0.TS changes!
- */
-static inline int __thread_has_fpu(struct task_struct *tsk)
-{
- return tsk->thread.fpu.has_fpu;
-}
-
-/* Must be paired with an 'stts' after! */
-static inline void __thread_clear_has_fpu(struct task_struct *tsk)
-{
- tsk->thread.fpu.has_fpu = 0;
- percpu_write(fpu_owner_task, NULL);
-}
-
-/* Must be paired with a 'clts' before! */
-static inline void __thread_set_has_fpu(struct task_struct *tsk)
-{
- tsk->thread.fpu.has_fpu = 1;
- percpu_write(fpu_owner_task, tsk);
-}
-
-/*
- * Encapsulate the CR0.TS handling together with the
- * software flag.
- *
- * These generally need preemption protection to work,
- * do try to avoid using these on their own.
- */
-static inline void __thread_fpu_end(struct task_struct *tsk)
-{
- __thread_clear_has_fpu(tsk);
- stts();
-}
-
-static inline void __thread_fpu_begin(struct task_struct *tsk)
-{
- clts();
- __thread_set_has_fpu(tsk);
-}
-
-/*
- * FPU state switching for scheduling.
- *
- * This is a two-stage process:
- *
- * - switch_fpu_prepare() saves the old state and
- * sets the new state of the CR0.TS bit. This is
- * done within the context of the old process.
- *
- * - switch_fpu_finish() restores the new state as
- * necessary.
- */
-typedef struct { int preload; } fpu_switch_t;
-
-/*
- * FIXME! We could do a totally lazy restore, but we need to
- * add a per-cpu "this was the task that last touched the FPU
- * on this CPU" variable, and the task needs to have a "I last
- * touched the FPU on this CPU" and check them.
- *
- * We don't do that yet, so "fpu_lazy_restore()" always returns
- * false, but some day..
- */
-static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
-{
- return new == percpu_read_stable(fpu_owner_task) &&
- cpu == new->thread.fpu.last_cpu;
-}
-
-static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
-{
- fpu_switch_t fpu;
-
- fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
- if (__thread_has_fpu(old)) {
- if (!__save_init_fpu(old))
- cpu = ~0;
- old->thread.fpu.last_cpu = cpu;
- old->thread.fpu.has_fpu = 0; /* But leave fpu_owner_task! */
-
- /* Don't change CR0.TS if we just switch! */
- if (fpu.preload) {
- new->fpu_counter++;
- __thread_set_has_fpu(new);
- prefetch(new->thread.fpu.state);
- } else
- stts();
- } else {
- old->fpu_counter = 0;
- old->thread.fpu.last_cpu = ~0;
- if (fpu.preload) {
- new->fpu_counter++;
- if (fpu_lazy_restore(new, cpu))
- fpu.preload = 0;
- else
- prefetch(new->thread.fpu.state);
- __thread_fpu_begin(new);
- }
- }
- return fpu;
-}
-
-/*
- * By the time this gets called, we've already cleared CR0.TS and
- * given the process the FPU if we are going to preload the FPU
- * state - all we need to do is to conditionally restore the register
- * state itself.
- */
-static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
-{
- if (fpu.preload) {
- if (unlikely(restore_fpu_checking(new)))
- __thread_fpu_end(new);
- }
-}
-
-/*
- * Signal frame handlers...
- */
-extern int save_i387_xstate(void __user *buf);
-extern int restore_i387_xstate(void __user *buf);
-
-static inline void __clear_fpu(struct task_struct *tsk)
-{
- if (__thread_has_fpu(tsk)) {
- /* Ignore delayed exceptions from user space */
- asm volatile("1: fwait\n"
- "2:\n"
- _ASM_EXTABLE(1b, 2b));
- __thread_fpu_end(tsk);
- }
-}
+extern void math_state_restore(void);

extern bool irq_fpu_usable(void);
extern void kernel_fpu_begin(void);
@@ -463,118 +67,14 @@ static inline void irq_ts_restore(int TS_state)
* we can just assume we have FPU access - typically
* to save the FP state - we'll just take a #NM
* fault and get the FPU access back.
- *
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe, though.
- *
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
*/
static inline int user_has_fpu(void)
{
- return __thread_has_fpu(current);
-}
-
-static inline void user_fpu_end(void)
-{
- preempt_disable();
- __thread_fpu_end(current);
- preempt_enable();
-}
-
-static inline void user_fpu_begin(void)
-{
- preempt_disable();
- if (!user_has_fpu())
- __thread_fpu_begin(current);
- preempt_enable();
-}
-
-/*
- * These disable preemption on their own and are safe
- */
-static inline void save_init_fpu(struct task_struct *tsk)
-{
- WARN_ON_ONCE(!__thread_has_fpu(tsk));
- preempt_disable();
- __save_init_fpu(tsk);
- __thread_fpu_end(tsk);
- preempt_enable();
+ return current->thread.fpu.has_fpu;
}

extern void unlazy_fpu(struct task_struct *tsk);

-static inline void clear_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- __clear_fpu(tsk);
- preempt_enable();
-}
-
-/*
- * i387 state interaction
- */
-static inline unsigned short get_fpu_cwd(struct task_struct *tsk)
-{
- if (cpu_has_fxsr) {
- return tsk->thread.fpu.state->fxsave.cwd;
- } else {
- return (unsigned short)tsk->thread.fpu.state->fsave.cwd;
- }
-}
-
-static inline unsigned short get_fpu_swd(struct task_struct *tsk)
-{
- if (cpu_has_fxsr) {
- return tsk->thread.fpu.state->fxsave.swd;
- } else {
- return (unsigned short)tsk->thread.fpu.state->fsave.swd;
- }
-}
-
-static inline unsigned short get_fpu_mxcsr(struct task_struct *tsk)
-{
- if (cpu_has_xmm) {
- return tsk->thread.fpu.state->fxsave.mxcsr;
- } else {
- return MXCSR_DEFAULT;
- }
-}
-
-static bool fpu_allocated(struct fpu *fpu)
-{
- return fpu->state != NULL;
-}
-
-static inline int fpu_alloc(struct fpu *fpu)
-{
- if (fpu_allocated(fpu))
- return 0;
- fpu->state = kmem_cache_alloc(task_xstate_cachep, GFP_KERNEL);
- if (!fpu->state)
- return -ENOMEM;
- WARN_ON((unsigned long)fpu->state & 15);
- return 0;
-}
-
-static inline void fpu_free(struct fpu *fpu)
-{
- if (fpu->state) {
- kmem_cache_free(task_xstate_cachep, fpu->state);
- fpu->state = NULL;
- }
-}
-
-static inline void fpu_copy(struct fpu *dst, struct fpu *src)
-{
- memcpy(dst->state, src->state, xstate_size);
-}
-
-extern void fpu_finit(struct fpu *fpu);
-
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_I387_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb71b01ab66e..89620b1725d4 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -28,6 +28,7 @@
#include <asm/apic.h>
#include <asm/desc.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mtrr.h>
#include <linux/numa.h>
#include <asm/asm.h>
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 17b7549c4134..7734bcbb5a3a 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -16,6 +16,7 @@
#include <asm/uaccess.h>
#include <asm/ptrace.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/user.h>

#ifdef CONFIG_X86_64
@@ -124,7 +125,7 @@ EXPORT_SYMBOL_GPL(xstate_size);
unsigned int sig_xstate_ia32_size = sizeof(struct _fpstate_ia32);
static struct i387_fxsave_struct fx_scratch __cpuinitdata;

-void __cpuinit mxcsr_feature_mask_init(void)
+static void __cpuinit mxcsr_feature_mask_init(void)
{
unsigned long mask = 0;

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 15763af7bfe3..c38d84e01022 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -21,6 +21,7 @@
#include <asm/idle.h>
#include <asm/uaccess.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/debugreg.h>

struct kmem_cache *task_xstate_cachep;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index c08d1ff12b7c..ee32dee7a0a3 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -45,6 +45,7 @@
#include <asm/ldt.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/desc.h>
#ifdef CONFIG_MATH_EMULATION
#include <asm/math_emu.h>
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index cfa5c90c01db..5bad3c71e48f 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -43,6 +43,7 @@
#include <asm/system.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mmu_context.h>
#include <asm/prctl.h>
#include <asm/desc.h>
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 50267386b766..78f05e438be5 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -27,6 +27,7 @@
#include <asm/system.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/debugreg.h>
#include <asm/ldt.h>
#include <asm/desc.h>
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 46a01bdc27e2..25edcfc9ba5b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -24,6 +24,7 @@
#include <asm/processor.h>
#include <asm/ucontext.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/vdso.h>
#include <asm/mce.h>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4bbe04d96744..ec61d4c1b93b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -54,6 +54,7 @@
#include <asm/traps.h>
#include <asm/desc.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mce.h>

#include <asm/mach_traps.h>
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 711091114119..e62728e30b01 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -6,6 +6,7 @@
#include <linux/bootmem.h>
#include <linux/compat.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#ifdef CONFIG_IA32_EMULATION
#include <asm/sigcontext32.h>
#endif
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3b4c8d8ad906..246490f643b6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1457,7 +1457,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
#ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
#endif
- if (__thread_has_fpu(current))
+ if (user_has_fpu())
clts();
load_gdt(&__get_cpu_var(host_gdt));
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9cbfc0698118..b937b6179d80 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -57,6 +57,7 @@
#include <asm/mtrr.h>
#include <asm/mce.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h> /* Ugh! */
#include <asm/xcr.h>
#include <asm/pvclock.h>
#include <asm/div64.h>
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index f10c0afa1cb4..4889655ba784 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -20,6 +20,7 @@
#include <asm/xcr.h>
#include <asm/suspend.h>
#include <asm/debugreg.h>
+#include <asm/fpu-internal.h> /* pcntxt_mask */

#ifdef CONFIG_X86_32
static struct saved_context saved_context;
--
1.7.9.188.g12766.dirty

2012-02-21 21:54:49

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Mon, 2012-02-20 at 11:48 -0800, Linus Torvalds wrote:
> From: Linus Torvalds <[email protected]>
> Date: Sun, 19 Feb 2012 13:27:00 -0800
> Subject: [PATCH v2 3/3] i387: support lazy restore of FPU state
>
> This makes us recognize when we try to restore FPU state that matches
> what we already have in the FPU on this CPU, and avoids the restore
> entirely if so.
>
> To do this, we add two new data fields:
>
> - a percpu 'fpu_owner_task' variable that gets written any time we
> update the "has_fpu" field, and thus acts as a kind of back-pointer
> to the task that owns the CPU. The exception is when we save the FPU
> state as part of a context switch - if the save can keep the FPU
> state around, we leave the 'fpu_owner_task' variable pointing at the
> task whose FP state still remains on the CPU.
>
> - a per-thread 'last_cpu' field, that indicates which CPU that thread
> used its FPU on last. We update this on every context switch
> (writing an invalid CPU number if the last context switch didn't
> leave the FPU in a lazily usable state), so we know that *that*
> thread has done nothing else with the FPU since.
>
> These two fields together can be used when next switching back to the
> task to see if the CPU still matches: if 'fpu_owner_task' matches the
> task we are switching to, we know that no other task (or kernel FPU
> usage) touched the FPU on this CPU in the meantime, and if the current
> CPU number matches the 'last_cpu' field, we know that this thread did no
> other FP work on any other CPU, so the FPU state on the CPU must match
> what was saved on last context switch.
>
> In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
> CR0.TS bit.
>

Reviewing this code, I think we need to set the 'last_cpu' to an invalid
number in the fpu_alloc too. Appended is the patch.
---

From: Suresh Siddha <[email protected]>
Subject: x86, fpu: set the last_cpu in fpu_alloc() to an invalid cpu

Initialize the struct fpu's last_cpu in fpu_alloc() to an invalid cpu number,
so that the check in fpu_lazy_restore() will always return false
on the first context-switch in of this new task.

Otherwise, on a fork(), last_cpu of the new task's fpu will be copied from
the parent task and fpu_lazy_restore() can potentially return success wrongly
on the first context-switch in of this new task, leading to fpu corruption.

Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/i387.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 2479049..58ba656 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -627,6 +627,7 @@ static inline int fpu_alloc(struct fpu *fpu)
if (!fpu->state)
return -ENOMEM;
WARN_ON((unsigned long)fpu->state & 15);
+ fpu->last_cpu = ~0;
return 0;
}


2012-02-21 21:57:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Tue, Feb 21, 2012 at 1:54 PM, Suresh Siddha
<[email protected]> wrote:
>
> Reviewing this code, I think we need to set the 'last_cpu' to an invalid
> number in the fpu_alloc too. Appended is the patch.

I cleared fpu_counter in copy_thread() instead.

Doing this in fpu_alloc() isn't necessarily a bad idea, though.

Linus

2012-02-21 22:19:51

by Suresh Siddha

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] i387: support lazy restore of FPU state

On Tue, 2012-02-21 at 13:57 -0800, Linus Torvalds wrote:
> On Tue, Feb 21, 2012 at 1:54 PM, Suresh Siddha
> <[email protected]> wrote:
> >
> > Reviewing this code, I think we need to set the 'last_cpu' to an invalid
> > number in the fpu_alloc too. Appended is the patch.
>
> I cleared fpu_counter in copy_thread() instead.

Hmm, didn't pay attention to that patch. Too many patches ;(

> Doing this in fpu_alloc() isn't necessarily a bad idea, though.

yes, I think we should do this in fpu_alloc() or bit more explicitly by
updating 'last_cpu' in copy_thread().

Someone experimenting with removing fpu pre-load may not realize this
subtle issue.

thanks,
suresh



2012-02-21 23:50:19

by Linus Torvalds

[permalink] [raw]
Subject: [tip:x86/fpu] i387: Uninline the generic FP helpers that we expose to kernel modules

Commit-ID: 8546c008924d5fd1724fa698eaa92b414bafd50d
Gitweb: http://git.kernel.org/tip/8546c008924d5fd1724fa698eaa92b414bafd50d
Author: Linus Torvalds <[email protected]>
AuthorDate: Tue, 21 Feb 2012 10:25:45 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Tue, 21 Feb 2012 14:12:46 -0800

i387: Uninline the generic FP helpers that we expose to kernel modules

Instead of exporting the very low-level internals of the FPU state
save/restore code (ie things like 'fpu_owner_task'), we should export
the higher-level interfaces.

Inlining these things is pointless anyway: sure, sometimes the end
result is small, but while 'stts()' can result in just three x86
instructions, those are not cheap instructions (writing %cr0 is a
serializing instruction and a very slow one at that).

So the overhead of a function call is not noticeable, and we really
don't want random modules mucking about with our internal state save
logic anyway.

So this unexports 'fpu_owner_task', and instead uninlines and exports
the actual functions that modules can use: fpu_kernel_begin/end() and
unlazy_fpu().

Signed-off-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/i387.h | 78 ++--------------------------------------
arch/x86/kernel/cpu/common.c | 2 -
arch/x86/kernel/i387.c | 80 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 84 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 2479049..0c1031d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -419,70 +419,9 @@ static inline void __clear_fpu(struct task_struct *tsk)
}
}

-/*
- * Were we in an interrupt that interrupted kernel mode?
- *
- * We can do a kernel_fpu_begin/end() pair *ONLY* if that
- * pair does nothing at all: the thread must not have fpu (so
- * that we don't try to save the FPU state), and TS must
- * be set (so that the clts/stts pair does nothing that is
- * visible in the interrupted kernel thread).
- */
-static inline bool interrupted_kernel_fpu_idle(void)
-{
- return !__thread_has_fpu(current) &&
- (read_cr0() & X86_CR0_TS);
-}
-
-/*
- * Were we in user mode (or vm86 mode) when we were
- * interrupted?
- *
- * Doing kernel_fpu_begin/end() is ok if we are running
- * in an interrupt context from user mode - we'll just
- * save the FPU state as required.
- */
-static inline bool interrupted_user_mode(void)
-{
- struct pt_regs *regs = get_irq_regs();
- return regs && user_mode_vm(regs);
-}
-
-/*
- * Can we use the FPU in kernel mode with the
- * whole "kernel_fpu_begin/end()" sequence?
- *
- * It's always ok in process context (ie "not interrupt")
- * but it is sometimes ok even from an irq.
- */
-static inline bool irq_fpu_usable(void)
-{
- return !in_interrupt() ||
- interrupted_user_mode() ||
- interrupted_kernel_fpu_idle();
-}
-
-static inline void kernel_fpu_begin(void)
-{
- struct task_struct *me = current;
-
- WARN_ON_ONCE(!irq_fpu_usable());
- preempt_disable();
- if (__thread_has_fpu(me)) {
- __save_init_fpu(me);
- __thread_clear_has_fpu(me);
- /* We do 'stts()' in kernel_fpu_end() */
- } else {
- percpu_write(fpu_owner_task, NULL);
- clts();
- }
-}
-
-static inline void kernel_fpu_end(void)
-{
- stts();
- preempt_enable();
-}
+extern bool irq_fpu_usable(void);
+extern void kernel_fpu_begin(void);
+extern void kernel_fpu_end(void);

/*
* Some instructions like VIA's padlock instructions generate a spurious
@@ -566,16 +505,7 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
}

-static inline void unlazy_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- if (__thread_has_fpu(tsk)) {
- __save_init_fpu(tsk);
- __thread_fpu_end(tsk);
- } else
- tsk->fpu_counter = 0;
- preempt_enable();
-}
+extern void unlazy_fpu(struct task_struct *tsk);

static inline void clear_fpu(struct task_struct *tsk)
{
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0f7d68..cb71b01 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1045,7 +1045,6 @@ DEFINE_PER_CPU(char *, irq_stack_ptr) =
DEFINE_PER_CPU(unsigned int, irq_count) = -1;

DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
-EXPORT_PER_CPU_SYMBOL(fpu_owner_task);

/*
* Special IST stacks which the CPU switches to when it calls
@@ -1115,7 +1114,6 @@ void debug_stack_reset(void)
DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task;
EXPORT_PER_CPU_SYMBOL(current_task);
DEFINE_PER_CPU(struct task_struct *, fpu_owner_task);
-EXPORT_PER_CPU_SYMBOL(fpu_owner_task);

#ifdef CONFIG_CC_STACKPROTECTOR
DEFINE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 739d859..17b7549 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -32,6 +32,86 @@
# define user32_fxsr_struct user_fxsr_struct
#endif

+/*
+ * Were we in an interrupt that interrupted kernel mode?
+ *
+ * We can do a kernel_fpu_begin/end() pair *ONLY* if that
+ * pair does nothing at all: the thread must not have fpu (so
+ * that we don't try to save the FPU state), and TS must
+ * be set (so that the clts/stts pair does nothing that is
+ * visible in the interrupted kernel thread).
+ */
+static inline bool interrupted_kernel_fpu_idle(void)
+{
+ return !__thread_has_fpu(current) &&
+ (read_cr0() & X86_CR0_TS);
+}
+
+/*
+ * Were we in user mode (or vm86 mode) when we were
+ * interrupted?
+ *
+ * Doing kernel_fpu_begin/end() is ok if we are running
+ * in an interrupt context from user mode - we'll just
+ * save the FPU state as required.
+ */
+static inline bool interrupted_user_mode(void)
+{
+ struct pt_regs *regs = get_irq_regs();
+ return regs && user_mode_vm(regs);
+}
+
+/*
+ * Can we use the FPU in kernel mode with the
+ * whole "kernel_fpu_begin/end()" sequence?
+ *
+ * It's always ok in process context (ie "not interrupt")
+ * but it is sometimes ok even from an irq.
+ */
+bool irq_fpu_usable(void)
+{
+ return !in_interrupt() ||
+ interrupted_user_mode() ||
+ interrupted_kernel_fpu_idle();
+}
+EXPORT_SYMBOL(irq_fpu_usable);
+
+void kernel_fpu_begin(void)
+{
+ struct task_struct *me = current;
+
+ WARN_ON_ONCE(!irq_fpu_usable());
+ preempt_disable();
+ if (__thread_has_fpu(me)) {
+ __save_init_fpu(me);
+ __thread_clear_has_fpu(me);
+ /* We do 'stts()' in kernel_fpu_end() */
+ } else {
+ percpu_write(fpu_owner_task, NULL);
+ clts();
+ }
+}
+EXPORT_SYMBOL(kernel_fpu_begin);
+
+void kernel_fpu_end(void)
+{
+ stts();
+ preempt_enable();
+}
+EXPORT_SYMBOL(kernel_fpu_end);
+
+void unlazy_fpu(struct task_struct *tsk)
+{
+ preempt_disable();
+ if (__thread_has_fpu(tsk)) {
+ __save_init_fpu(tsk);
+ __thread_fpu_end(tsk);
+ } else
+ tsk->fpu_counter = 0;
+ preempt_enable();
+}
+EXPORT_SYMBOL(unlazy_fpu);
+
#ifdef CONFIG_MATH_EMULATION
# define HAVE_HWFP (boot_cpu_data.hard_math)
#else

2012-02-21 23:51:28

by Linus Torvalds

[permalink] [raw]
Subject: [tip:x86/fpu] i387: Split up <asm/i387.h> into exported and internal interfaces

Commit-ID: 1361b83a13d4d92e53fbb6c877528713e118b821
Gitweb: http://git.kernel.org/tip/1361b83a13d4d92e53fbb6c877528713e118b821
Author: Linus Torvalds <[email protected]>
AuthorDate: Tue, 21 Feb 2012 13:19:22 -0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Tue, 21 Feb 2012 14:12:54 -0800

i387: Split up <asm/i387.h> into exported and internal interfaces

While various modules include <asm/i387.h> to get access to things we
actually *intend* for them to use, most of that header file was really
pretty low-level internal stuff that we really don't want to expose to
others.

So split the header file into two: the small exported interfaces remain
in <asm/i387.h>, while the internal definitions that are only used by
core architecture code are now in <asm/fpu-internal.h>.

The guiding principle for this was to expose functions that we export to
modules, and leave them in <asm/i387.h>, while stuff that is used by
task switching or was marked GPL-only is in <asm/fpu-internal.h>.

The fpu-internal.h file could be further split up too, especially since
arch/x86/kvm/ uses some of the remaining stuff for its module. But that
kvm usage should probably be abstracted out a bit, and at least now the
internal FPU accessor functions are much more contained. Even if it
isn't perhaps as contained as it _could_ be.

Signed-off-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/ia32/ia32_signal.c | 1 +
arch/x86/include/asm/{i387.h => fpu-internal.h} | 72 +---
arch/x86/include/asm/i387.h | 512 +----------------------
arch/x86/kernel/cpu/common.c | 1 +
arch/x86/kernel/i387.c | 3 +-
arch/x86/kernel/process.c | 1 +
arch/x86/kernel/process_32.c | 1 +
arch/x86/kernel/process_64.c | 1 +
arch/x86/kernel/ptrace.c | 1 +
arch/x86/kernel/signal.c | 1 +
arch/x86/kernel/traps.c | 1 +
arch/x86/kernel/xsave.c | 1 +
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 1 +
arch/x86/power/cpu.c | 1 +
15 files changed, 26 insertions(+), 574 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 6557769..5563ba1 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -24,6 +24,7 @@
#include <asm/ucontext.h>
#include <asm/uaccess.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/ptrace.h>
#include <asm/ia32_unistd.h>
#include <asm/user32.h>
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/fpu-internal.h
similarity index 87%
copy from arch/x86/include/asm/i387.h
copy to arch/x86/include/asm/fpu-internal.h
index 0c1031d..4fa8815 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -7,15 +7,11 @@
* x86-64 work by Andi Kleen 2002
*/

-#ifndef _ASM_X86_I387_H
-#define _ASM_X86_I387_H
+#ifndef _FPU_INTERNAL_H
+#define _FPU_INTERNAL_H

-#ifndef __ASSEMBLY__
-
-#include <linux/sched.h>
#include <linux/kernel_stat.h>
#include <linux/regset.h>
-#include <linux/hardirq.h>
#include <linux/slab.h>
#include <asm/asm.h>
#include <asm/cpufeature.h>
@@ -27,10 +23,6 @@

extern unsigned int sig_xstate_size;
extern void fpu_init(void);
-extern void mxcsr_feature_mask_init(void);
-extern int init_fpu(struct task_struct *child);
-extern void math_state_restore(void);
-extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);

DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);

@@ -40,6 +32,7 @@ extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
xstateregs_set;

+
/*
* xstateregs_active == fpregs_active. Please refer to the comment
* at the definition of fpregs_active.
@@ -275,7 +268,7 @@ static inline int restore_fpu_checking(struct task_struct *tsk)
values. "m" is a random variable that should be in L1 */
alternative_input(
ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
+ "emms\n\t" /* clear stack tags */
"fildl %P[addr]", /* set F?P to defined value */
X86_FEATURE_FXSAVE_LEAK,
[addr] "m" (tsk->thread.fpu.has_fpu));
@@ -419,53 +412,9 @@ static inline void __clear_fpu(struct task_struct *tsk)
}
}

-extern bool irq_fpu_usable(void);
-extern void kernel_fpu_begin(void);
-extern void kernel_fpu_end(void);
-
-/*
- * Some instructions like VIA's padlock instructions generate a spurious
- * DNA fault but don't modify SSE registers. And these instructions
- * get used from interrupt context as well. To prevent these kernel instructions
- * in interrupt context interacting wrongly with other user/kernel fpu usage, we
- * should use them only in the context of irq_ts_save/restore()
- */
-static inline int irq_ts_save(void)
-{
- /*
- * If in process context and not atomic, we can take a spurious DNA fault.
- * Otherwise, doing clts() in process context requires disabling preemption
- * or some heavy lifting like kernel_fpu_begin()
- */
- if (!in_atomic())
- return 0;
-
- if (read_cr0() & X86_CR0_TS) {
- clts();
- return 1;
- }
-
- return 0;
-}
-
-static inline void irq_ts_restore(int TS_state)
-{
- if (TS_state)
- stts();
-}
-
/*
- * The question "does this thread have fpu access?"
- * is slightly racy, since preemption could come in
- * and revoke it immediately after the test.
- *
- * However, even in that very unlikely scenario,
- * we can just assume we have FPU access - typically
- * to save the FP state - we'll just take a #NM
- * fault and get the FPU access back.
- *
* The actual user_fpu_begin/end() functions
- * need to be preemption-safe, though.
+ * need to be preemption-safe.
*
* NOTE! user_fpu_end() must be used only after you
* have saved the FP state, and user_fpu_begin() must
@@ -473,11 +422,6 @@ static inline void irq_ts_restore(int TS_state)
* These functions do not do any save/restore on
* their own.
*/
-static inline int user_has_fpu(void)
-{
- return __thread_has_fpu(current);
-}
-
static inline void user_fpu_end(void)
{
preempt_disable();
@@ -505,8 +449,6 @@ static inline void save_init_fpu(struct task_struct *tsk)
preempt_enable();
}

-extern void unlazy_fpu(struct task_struct *tsk);
-
static inline void clear_fpu(struct task_struct *tsk)
{
preempt_disable();
@@ -575,6 +517,4 @@ static inline void fpu_copy(struct fpu *dst, struct fpu *src)

extern void fpu_finit(struct fpu *fpu);

-#endif /* __ASSEMBLY__ */
-
-#endif /* _ASM_X86_I387_H */
+#endif
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 0c1031d..7ce0798 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -13,411 +13,15 @@
#ifndef __ASSEMBLY__

#include <linux/sched.h>
-#include <linux/kernel_stat.h>
-#include <linux/regset.h>
#include <linux/hardirq.h>
-#include <linux/slab.h>
-#include <asm/asm.h>
-#include <asm/cpufeature.h>
-#include <asm/processor.h>
-#include <asm/sigcontext.h>
-#include <asm/user.h>
-#include <asm/uaccess.h>
-#include <asm/xsave.h>
+#include <asm/system.h>
+
+struct pt_regs;
+struct user_i387_struct;

-extern unsigned int sig_xstate_size;
-extern void fpu_init(void);
-extern void mxcsr_feature_mask_init(void);
extern int init_fpu(struct task_struct *child);
-extern void math_state_restore(void);
extern int dump_fpu(struct pt_regs *, struct user_i387_struct *);
-
-DECLARE_PER_CPU(struct task_struct *, fpu_owner_task);
-
-extern user_regset_active_fn fpregs_active, xfpregs_active;
-extern user_regset_get_fn fpregs_get, xfpregs_get, fpregs_soft_get,
- xstateregs_get;
-extern user_regset_set_fn fpregs_set, xfpregs_set, fpregs_soft_set,
- xstateregs_set;
-
-/*
- * xstateregs_active == fpregs_active. Please refer to the comment
- * at the definition of fpregs_active.
- */
-#define xstateregs_active fpregs_active
-
-extern struct _fpx_sw_bytes fx_sw_reserved;
-#ifdef CONFIG_IA32_EMULATION
-extern unsigned int sig_xstate_ia32_size;
-extern struct _fpx_sw_bytes fx_sw_reserved_ia32;
-struct _fpstate_ia32;
-struct _xstate_ia32;
-extern int save_i387_xstate_ia32(void __user *buf);
-extern int restore_i387_xstate_ia32(void __user *buf);
-#endif
-
-#ifdef CONFIG_MATH_EMULATION
-extern void finit_soft_fpu(struct i387_soft_struct *soft);
-#else
-static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
-#endif
-
-#define X87_FSW_ES (1 << 7) /* Exception Summary */
-
-static __always_inline __pure bool use_xsaveopt(void)
-{
- return static_cpu_has(X86_FEATURE_XSAVEOPT);
-}
-
-static __always_inline __pure bool use_xsave(void)
-{
- return static_cpu_has(X86_FEATURE_XSAVE);
-}
-
-static __always_inline __pure bool use_fxsr(void)
-{
- return static_cpu_has(X86_FEATURE_FXSR);
-}
-
-extern void __sanitize_i387_state(struct task_struct *);
-
-static inline void sanitize_i387_state(struct task_struct *tsk)
-{
- if (!use_xsaveopt())
- return;
- __sanitize_i387_state(tsk);
-}
-
-#ifdef CONFIG_X86_64
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- int err;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxrstorq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "m" (*fx), "0" (0));
-#else
- asm volatile("1: rex64/fxrstor (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err)
- : [fx] "R" (fx), "m" (*fx), "0" (0));
-#endif
- return err;
-}
-
-static inline int fxsave_user(struct i387_fxsave_struct __user *fx)
-{
- int err;
-
- /*
- * Clear the bytes not touched by the fxsave and reserved
- * for the SW usage.
- */
- err = __clear_user(&fx->sw_reserved,
- sizeof(struct _fpx_sw_bytes));
- if (unlikely(err))
- return -EFAULT;
-
- /* See comment in fxsave() below. */
-#ifdef CONFIG_AS_FXSAVEQ
- asm volatile("1: fxsaveq %[fx]\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), [fx] "=m" (*fx)
- : "0" (0));
-#else
- asm volatile("1: rex64/fxsave (%[fx])\n\t"
- "2:\n"
- ".section .fixup,\"ax\"\n"
- "3: movl $-1,%[err]\n"
- " jmp 2b\n"
- ".previous\n"
- _ASM_EXTABLE(1b, 3b)
- : [err] "=r" (err), "=m" (*fx)
- : [fx] "R" (fx), "0" (0));
-#endif
- if (unlikely(err) &&
- __clear_user(fx, sizeof(struct i387_fxsave_struct)))
- err = -EFAULT;
- /* No need to clear here because the caller clears USED_MATH */
- return err;
-}
-
-static inline void fpu_fxsave(struct fpu *fpu)
-{
- /* Using "rex64; fxsave %0" is broken because, if the memory operand
- uses any extended registers for addressing, a second REX prefix
- will be generated (to the assembler, rex64 followed by semicolon
- is a separate instruction), and hence the 64-bitness is lost. */
-
-#ifdef CONFIG_AS_FXSAVEQ
- /* Using "fxsaveq %0" would be the ideal choice, but is only supported
- starting with gas 2.16. */
- __asm__ __volatile__("fxsaveq %0"
- : "=m" (fpu->state->fxsave));
-#else
- /* Using, as a workaround, the properly prefixed form below isn't
- accepted by any binutils version so far released, complaining that
- the same type of prefix is used twice if an extended register is
- needed for addressing (fix submitted to mainline 2005-11-21).
- asm volatile("rex64/fxsave %0"
- : "=m" (fpu->state->fxsave));
- This, however, we can work around by forcing the compiler to select
- an addressing mode that doesn't require extended registers. */
- asm volatile("rex64/fxsave (%[fx])"
- : "=m" (fpu->state->fxsave)
- : [fx] "R" (&fpu->state->fxsave));
-#endif
-}
-
-#else /* CONFIG_X86_32 */
-
-/* perform fxrstor iff the processor has extended states, otherwise frstor */
-static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
-{
- /*
- * The "nop" is needed to make the instructions the same
- * length.
- */
- alternative_input(
- "nop ; frstor %1",
- "fxrstor %1",
- X86_FEATURE_FXSR,
- "m" (*fx));
-
- return 0;
-}
-
-static inline void fpu_fxsave(struct fpu *fpu)
-{
- asm volatile("fxsave %[fx]"
- : [fx] "=m" (fpu->state->fxsave));
-}
-
-#endif /* CONFIG_X86_64 */
-
-/*
- * These must be called with preempt disabled. Returns
- * 'true' if the FPU state is still intact.
- */
-static inline int fpu_save_init(struct fpu *fpu)
-{
- if (use_xsave()) {
- fpu_xsave(fpu);
-
- /*
- * xsave header may indicate the init state of the FP.
- */
- if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP))
- return 1;
- } else if (use_fxsr()) {
- fpu_fxsave(fpu);
- } else {
- asm volatile("fnsave %[fx]; fwait"
- : [fx] "=m" (fpu->state->fsave));
- return 0;
- }
-
- /*
- * If exceptions are pending, we need to clear them so
- * that we don't randomly get exceptions later.
- *
- * FIXME! Is this perhaps only true for the old-style
- * irq13 case? Maybe we could leave the x87 state
- * intact otherwise?
- */
- if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) {
- asm volatile("fnclex");
- return 0;
- }
- return 1;
-}
-
-static inline int __save_init_fpu(struct task_struct *tsk)
-{
- return fpu_save_init(&tsk->thread.fpu);
-}
-
-static inline int fpu_fxrstor_checking(struct fpu *fpu)
-{
- return fxrstor_checking(&fpu->state->fxsave);
-}
-
-static inline int fpu_restore_checking(struct fpu *fpu)
-{
- if (use_xsave())
- return fpu_xrstor_checking(fpu);
- else
- return fpu_fxrstor_checking(fpu);
-}
-
-static inline int restore_fpu_checking(struct task_struct *tsk)
-{
- /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
- is pending. Clear the x87 state here by setting it to fixed
- values. "m" is a random variable that should be in L1 */
- alternative_input(
- ASM_NOP8 ASM_NOP2,
- "emms\n\t" /* clear stack tags */
- "fildl %P[addr]", /* set F?P to defined value */
- X86_FEATURE_FXSAVE_LEAK,
- [addr] "m" (tsk->thread.fpu.has_fpu));
-
- return fpu_restore_checking(&tsk->thread.fpu);
-}
-
-/*
- * Software FPU state helpers. Careful: these need to
- * be preemption protection *and* they need to be
- * properly paired with the CR0.TS changes!
- */
-static inline int __thread_has_fpu(struct task_struct *tsk)
-{
- return tsk->thread.fpu.has_fpu;
-}
-
-/* Must be paired with an 'stts' after! */
-static inline void __thread_clear_has_fpu(struct task_struct *tsk)
-{
- tsk->thread.fpu.has_fpu = 0;
- percpu_write(fpu_owner_task, NULL);
-}
-
-/* Must be paired with a 'clts' before! */
-static inline void __thread_set_has_fpu(struct task_struct *tsk)
-{
- tsk->thread.fpu.has_fpu = 1;
- percpu_write(fpu_owner_task, tsk);
-}
-
-/*
- * Encapsulate the CR0.TS handling together with the
- * software flag.
- *
- * These generally need preemption protection to work,
- * do try to avoid using these on their own.
- */
-static inline void __thread_fpu_end(struct task_struct *tsk)
-{
- __thread_clear_has_fpu(tsk);
- stts();
-}
-
-static inline void __thread_fpu_begin(struct task_struct *tsk)
-{
- clts();
- __thread_set_has_fpu(tsk);
-}
-
-/*
- * FPU state switching for scheduling.
- *
- * This is a two-stage process:
- *
- * - switch_fpu_prepare() saves the old state and
- * sets the new state of the CR0.TS bit. This is
- * done within the context of the old process.
- *
- * - switch_fpu_finish() restores the new state as
- * necessary.
- */
-typedef struct { int preload; } fpu_switch_t;
-
-/*
- * FIXME! We could do a totally lazy restore, but we need to
- * add a per-cpu "this was the task that last touched the FPU
- * on this CPU" variable, and the task needs to have a "I last
- * touched the FPU on this CPU" and check them.
- *
- * We don't do that yet, so "fpu_lazy_restore()" always returns
- * false, but some day..
- */
-static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
-{
- return new == percpu_read_stable(fpu_owner_task) &&
- cpu == new->thread.fpu.last_cpu;
-}
-
-static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new, int cpu)
-{
- fpu_switch_t fpu;
-
- fpu.preload = tsk_used_math(new) && new->fpu_counter > 5;
- if (__thread_has_fpu(old)) {
- if (!__save_init_fpu(old))
- cpu = ~0;
- old->thread.fpu.last_cpu = cpu;
- old->thread.fpu.has_fpu = 0; /* But leave fpu_owner_task! */
-
- /* Don't change CR0.TS if we just switch! */
- if (fpu.preload) {
- new->fpu_counter++;
- __thread_set_has_fpu(new);
- prefetch(new->thread.fpu.state);
- } else
- stts();
- } else {
- old->fpu_counter = 0;
- old->thread.fpu.last_cpu = ~0;
- if (fpu.preload) {
- new->fpu_counter++;
- if (fpu_lazy_restore(new, cpu))
- fpu.preload = 0;
- else
- prefetch(new->thread.fpu.state);
- __thread_fpu_begin(new);
- }
- }
- return fpu;
-}
-
-/*
- * By the time this gets called, we've already cleared CR0.TS and
- * given the process the FPU if we are going to preload the FPU
- * state - all we need to do is to conditionally restore the register
- * state itself.
- */
-static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu)
-{
- if (fpu.preload) {
- if (unlikely(restore_fpu_checking(new)))
- __thread_fpu_end(new);
- }
-}
-
-/*
- * Signal frame handlers...
- */
-extern int save_i387_xstate(void __user *buf);
-extern int restore_i387_xstate(void __user *buf);
-
-static inline void __clear_fpu(struct task_struct *tsk)
-{
- if (__thread_has_fpu(tsk)) {
- /* Ignore delayed exceptions from user space */
- asm volatile("1: fwait\n"
- "2:\n"
- _ASM_EXTABLE(1b, 2b));
- __thread_fpu_end(tsk);
- }
-}
+extern void math_state_restore(void);

extern bool irq_fpu_usable(void);
extern void kernel_fpu_begin(void);
@@ -463,118 +67,14 @@ static inline void irq_ts_restore(int TS_state)
* we can just assume we have FPU access - typically
* to save the FP state - we'll just take a #NM
* fault and get the FPU access back.
- *
- * The actual user_fpu_begin/end() functions
- * need to be preemption-safe, though.
- *
- * NOTE! user_fpu_end() must be used only after you
- * have saved the FP state, and user_fpu_begin() must
- * be used only immediately before restoring it.
- * These functions do not do any save/restore on
- * their own.
*/
static inline int user_has_fpu(void)
{
- return __thread_has_fpu(current);
-}
-
-static inline void user_fpu_end(void)
-{
- preempt_disable();
- __thread_fpu_end(current);
- preempt_enable();
-}
-
-static inline void user_fpu_begin(void)
-{
- preempt_disable();
- if (!user_has_fpu())
- __thread_fpu_begin(current);
- preempt_enable();
-}
-
-/*
- * These disable preemption on their own and are safe
- */
-static inline void save_init_fpu(struct task_struct *tsk)
-{
- WARN_ON_ONCE(!__thread_has_fpu(tsk));
- preempt_disable();
- __save_init_fpu(tsk);
- __thread_fpu_end(tsk);
- preempt_enable();
+ return current->thread.fpu.has_fpu;
}

extern void unlazy_fpu(struct task_struct *tsk);

-static inline void clear_fpu(struct task_struct *tsk)
-{
- preempt_disable();
- __clear_fpu(tsk);
- preempt_enable();
-}
-
-/*
- * i387 state interaction
- */
-static inline unsigned short get_fpu_cwd(struct task_struct *tsk)
-{
- if (cpu_has_fxsr) {
- return tsk->thread.fpu.state->fxsave.cwd;
- } else {
- return (unsigned short)tsk->thread.fpu.state->fsave.cwd;
- }
-}
-
-static inline unsigned short get_fpu_swd(struct task_struct *tsk)
-{
- if (cpu_has_fxsr) {
- return tsk->thread.fpu.state->fxsave.swd;
- } else {
- return (unsigned short)tsk->thread.fpu.state->fsave.swd;
- }
-}
-
-static inline unsigned short get_fpu_mxcsr(struct task_struct *tsk)
-{
- if (cpu_has_xmm) {
- return tsk->thread.fpu.state->fxsave.mxcsr;
- } else {
- return MXCSR_DEFAULT;
- }
-}
-
-static bool fpu_allocated(struct fpu *fpu)
-{
- return fpu->state != NULL;
-}
-
-static inline int fpu_alloc(struct fpu *fpu)
-{
- if (fpu_allocated(fpu))
- return 0;
- fpu->state = kmem_cache_alloc(task_xstate_cachep, GFP_KERNEL);
- if (!fpu->state)
- return -ENOMEM;
- WARN_ON((unsigned long)fpu->state & 15);
- return 0;
-}
-
-static inline void fpu_free(struct fpu *fpu)
-{
- if (fpu->state) {
- kmem_cache_free(task_xstate_cachep, fpu->state);
- fpu->state = NULL;
- }
-}
-
-static inline void fpu_copy(struct fpu *dst, struct fpu *src)
-{
- memcpy(dst->state, src->state, xstate_size);
-}
-
-extern void fpu_finit(struct fpu *fpu);
-
#endif /* __ASSEMBLY__ */

#endif /* _ASM_X86_I387_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb71b01..89620b1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -28,6 +28,7 @@
#include <asm/apic.h>
#include <asm/desc.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mtrr.h>
#include <linux/numa.h>
#include <asm/asm.h>
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 17b7549..7734bcb 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -16,6 +16,7 @@
#include <asm/uaccess.h>
#include <asm/ptrace.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/user.h>

#ifdef CONFIG_X86_64
@@ -124,7 +125,7 @@ EXPORT_SYMBOL_GPL(xstate_size);
unsigned int sig_xstate_ia32_size = sizeof(struct _fpstate_ia32);
static struct i387_fxsave_struct fx_scratch __cpuinitdata;

-void __cpuinit mxcsr_feature_mask_init(void)
+static void __cpuinit mxcsr_feature_mask_init(void)
{
unsigned long mask = 0;

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 15763af..c38d84e 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -21,6 +21,7 @@
#include <asm/idle.h>
#include <asm/uaccess.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/debugreg.h>

struct kmem_cache *task_xstate_cachep;
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index c08d1ff..ee32dee 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -45,6 +45,7 @@
#include <asm/ldt.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/desc.h>
#ifdef CONFIG_MATH_EMULATION
#include <asm/math_emu.h>
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index cfa5c90..5bad3c7 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -43,6 +43,7 @@
#include <asm/system.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mmu_context.h>
#include <asm/prctl.h>
#include <asm/desc.h>
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..78f05e4 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -27,6 +27,7 @@
#include <asm/system.h>
#include <asm/processor.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/debugreg.h>
#include <asm/ldt.h>
#include <asm/desc.h>
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 46a01bd..25edcfc 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -24,6 +24,7 @@
#include <asm/processor.h>
#include <asm/ucontext.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/vdso.h>
#include <asm/mce.h>

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 4bbe04d..ec61d4c 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -54,6 +54,7 @@
#include <asm/traps.h>
#include <asm/desc.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#include <asm/mce.h>

#include <asm/mach_traps.h>
diff --git a/arch/x86/kernel/xsave.c b/arch/x86/kernel/xsave.c
index 7110911..e62728e 100644
--- a/arch/x86/kernel/xsave.c
+++ b/arch/x86/kernel/xsave.c
@@ -6,6 +6,7 @@
#include <linux/bootmem.h>
#include <linux/compat.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h>
#ifdef CONFIG_IA32_EMULATION
#include <asm/sigcontext32.h>
#endif
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3b4c8d8..246490f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1457,7 +1457,7 @@ static void __vmx_load_host_state(struct vcpu_vmx *vmx)
#ifdef CONFIG_X86_64
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
#endif
- if (__thread_has_fpu(current))
+ if (user_has_fpu())
clts();
load_gdt(&__get_cpu_var(host_gdt));
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9cbfc06..b937b61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -57,6 +57,7 @@
#include <asm/mtrr.h>
#include <asm/mce.h>
#include <asm/i387.h>
+#include <asm/fpu-internal.h> /* Ugh! */
#include <asm/xcr.h>
#include <asm/pvclock.h>
#include <asm/div64.h>
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index f10c0af..4889655 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -20,6 +20,7 @@
#include <asm/xcr.h>
#include <asm/suspend.h>
#include <asm/debugreg.h>
+#include <asm/fpu-internal.h> /* pcntxt_mask */

#ifdef CONFIG_X86_32
static struct saved_context saved_context;

2012-02-28 11:21:23

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/21/2012 11:41 PM, Linus Torvalds wrote:
> Btw, I really don't like what arch/x86/kvm/ does with CR0.TS and the FP
> state. I'm not at all sure that's all kosher. But I don't know the
> code, so I just made sure that at no point did any of the semantics
> change.
>

Can you elaborate on what you don't like in the kvm code (apart from "it
does virtualiztion")?

btw, some time ago I did some work to lazify fpu save (as opposed to
just fpu restore) and abstract out the various users (user mode, kernel
threads, irq context, guest mode, and signal handlers). This would
allow you to run task A's user mode with task B's fpu loaded, have
preemptible kernel fpu being, avoid fpu switching while handling
signals, and run user mode with a guest fpu loaded or vice versa.
However I abandoned the effort as too complex. Perhaps a more
determined hacker can make more progress there.

--
error compiling committee.c: too many arguments to function

2012-02-28 16:05:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On Tue, Feb 28, 2012 at 3:21 AM, Avi Kivity <[email protected]> wrote:
>
> Can you elaborate on what you don't like in the kvm code (apart from "it
> does virtualiztion")?

It doesn't fit any of the patterns of the x87 save/restore code, and I
don't know what it does.

It does clts on its own, in random places without actually restoring
the FPU state. Why is that ok? I don't know. And I don't think it is,
but I didn't change any of it. Why doesn't that thing corrupt the lazy
state save of some other process, for example?

Doing a "clts()" without restoring the FPU state immediately
afterwards is fundamentally *wrong*. It's crazy. Insane. You can now
use the FPU, but with whatever random state that is in it that caused
TS to be set to begin with.

And if you don't have any FPU state to restore, because you want to
use your own kernel state, you should use the
"kernel_fpu_begin()/end()" things that we have had forever.

Here's an example of the kind of UTTER AND ABSOLUTE SHIT that kvm FPU
state restore is:

static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
{
preempt_disable();
kvm_load_guest_fpu(emul_to_vcpu(ctxt));
/*
* CR0.TS may reference the host fpu state, not the guest fpu state,
* so it may be clear at this point.
*/
clts();
}

that whole "comment" says nothing at all. And clearing CR0.TS *after*
loading the FPU state is a f*cking joke, since you need it clear to
load the FPU state to begin with. So as far as I can tell,
kvm_load_guest_fpu() will have cleared the FPU state already, but *it*
did it by:

unlazy_fpu(current);
fpu_restore_checking(&vcpu->arch.guest_fpu);

where "unlazy_fpu()" will have *set* TS if it wasn't set before, so
fpu_restore_checking() will now TAKE A FAULT, and in that fault
handler it will clear TS so that it can reload the state we just saved
(yes, really), only to then return to fpu_restore_checking() and
reload yet *another* state.

The code is crap. It's insane. It may work, but if it does, it does so
by pure chance and happenstance. The code is CLEARLY INSANE.

I wasn't going to touch it. It had been written by a
random-code-generator that had strung the various FPU accessor
functions up in random order until it compiled.

Linus

2012-02-28 17:22:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/28/2012 06:05 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 3:21 AM, Avi Kivity <[email protected]> wrote:
> >
> > Can you elaborate on what you don't like in the kvm code (apart from "it
> > does virtualiztion")?
>
> It doesn't fit any of the patterns of the x87 save/restore code, and I
> don't know what it does.

It tries to do two things: first, keep the guest fpu loaded while
running kernel code, and second, allow the instruction emulator to
access the guest fpu.

> It does clts on its own, in random places without actually restoring
> the FPU state. Why is that ok? I don't know.

The way we use vmx, it does an implicit stts() after an exit from a
guest (it's not required, but it's expensive to play with the value of
the host cr0, so we set it to a safe value and clear it when needed).
So sometimes we need these random clts()s.

> And I don't think it is,
> but I didn't change any of it. Why doesn't that thing corrupt the lazy
> state save of some other process, for example?
>
> Doing a "clts()" without restoring the FPU state immediately
> afterwards is fundamentally *wrong*. It's crazy. Insane. You can now
> use the FPU, but with whatever random state that is in it that caused
> TS to be set to begin with.

There are two cases. In one of them, we do restore the guest fpu
immediately afterwards. In the other, we're just clearing a CR0.TS that
was set spuriously.

> And if you don't have any FPU state to restore, because you want to
> use your own kernel state, you should use the
> "kernel_fpu_begin()/end()" things that we have had forever.

We do have state - the guest state.

> Here's an example of the kind of UTTER AND ABSOLUTE SHIT that kvm FPU
> state restore is:
>
> static void emulator_get_fpu(struct x86_emulate_ctxt *ctxt)
> {
> preempt_disable();
> kvm_load_guest_fpu(emul_to_vcpu(ctxt));
> /*
> * CR0.TS may reference the host fpu state, not the guest fpu state,
> * so it may be clear at this point.
> */
> clts();
> }
>
> that whole "comment" says nothing at all. And clearing CR0.TS *after*
> loading the FPU state is a f*cking joke, since you need it clear to
> load the FPU state to begin with. So as far as I can tell,
> kvm_load_guest_fpu() will have cleared the FPU state already, but *it*
> did it by:
>
> unlazy_fpu(current);
> fpu_restore_checking(&vcpu->arch.guest_fpu);
>
> where "unlazy_fpu()" will have *set* TS if it wasn't set before, so
> fpu_restore_checking() will now TAKE A FAULT, and in that fault
> handler it will clear TS so that it can reload the state we just saved
> (yes, really), only to then return to fpu_restore_checking() and
> reload yet *another* state.
>
> The code is crap. It's insane. It may work, but if it does, it does so
> by pure chance and happenstance. The code is CLEARLY INSANE.

What you described is the slow path. The fast path is

void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
{
if (vcpu->guest_fpu_loaded)
return;

If we're emulating an fpu instruction, it's very likely that we have the
guest fpu loaded into the cpu. If we do take that path, we have the
right fpu state loaded, but CR0.TS is set by the recent exit, so we need
to clear it (the comment is in fact correct, except that it misspelled
"set").

> I wasn't going to touch it. It had been written by a
> random-code-generator that had strung the various FPU accessor
> functions up in random order until it compiled.

The tried and tested way, yes.

--
error compiling committee.c: too many arguments to function

2012-02-28 17:37:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On Tue, Feb 28, 2012 at 9:21 AM, Avi Kivity <[email protected]> wrote:
>
> What you described is the slow path.

Indeed. I'd even call it the "glacial and stupid" path.

>The fast path is
>
> ?void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
> ?{
> ? ? ?if (vcpu->guest_fpu_loaded)
> ? ? ? ? ?return;
>
> If we're emulating an fpu instruction, it's very likely that we have the
> guest fpu loaded into the cpu. ?If we do take that path, we have the
> right fpu state loaded, but CR0.TS is set by the recent exit, so we need
> to clear it (the comment is in fact correct, except that it misspelled
> "set").

So why the hell do you put the clts in the wrong place, then?

Dammit, the code is crap.

The clts's are in random places, they don't make any sense, the
comments are *wrong*, and the only reason they exist in the first
place is exactly the fact that the code does what it does in the wrong
place.

There's a reason I called the code crap. It makes no sense. Your
explanation only explains what it does (badly) when what *should* have
happened is you saying "ok, that makes no sense, let's fix it up".

So let me re-iterate: it makes ZERO SENSE to clear clts *after*
restoring the state. Don't do it. Don't make excuses for doing it.
It's WRONG. Whether it even happens to work by random chance isn't
even the issue.

Where it *can* make sense to clear TS is in your code that says

> ? ? ?if (vcpu->guest_fpu_loaded)
> ? ? ? ? ?return;

where you could have done it like this:

/* If we already have the FPU loaded, just clear TS - it was set
by a recent exit */
if (vcpu->guest_fpu_loaded) {
clts();
return;
}

And then at least the *placement* of clts would make sense. HOWEVER.
Even if you did that, what guarantees that the most recent FP usage
was by *your* kvm process? Sure, the "recent exit" may have set TS,
but have you had preemption disabled for the whole time? Guaranteed?

Because TS may be set because something else rescheduled too.

So where's the comment about why you actually own and control CR0.TS,
and nobody else does?

Finally, how does this all interact with the fact that the task
switching now keeps the FPU state around in the FPU and caches what
state it is? I have no idea, because the kvm code is so inpenetratable
due to all these totally unexplained things.

Btw, don't get me wrong - the core FPU state save/restore was a mess
of random "TS_USEDFPU" and clts() crap too. We had several bugs there,
partly exactly because the core FPU restore code also had "clts()"
separated from the logic that actually set or cleared the TS_USEDFPU
bit, and it was not at all clear at a "local" setting what the F was
going on.

Most of the recent i387 changes were actually to clean up and make
sense of that thing, and making sure that the clts() was paired with
the action of actually giving the FPU to the thread etc. So at least
now the core FPU handling is reasonably sane, and the clts's and
stts's are paired with the things that take control of the FPU, and we
have a few helper functions and some abstraction in place.

The kvm code definitely needs the same kind of cleanup. Because as it
is now, it's just random code junk, and there is no obvious reason why
it wouldn't interact horribly badly with an interrupt doing
"irq_fpu_usable()" + "kernel_fpu_begin/end()" for example.

Seriously.

Linus

2012-02-28 18:09:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On Tue, Feb 28, 2012 at 9:37 AM, Linus Torvalds
<[email protected]> wrote:
>
> So where's the comment about why you actually own and control CR0.TS,
> and nobody else does?

So what I think KVM should strive for (but I really don't know the
code, so maybe there are good reasons why it is impossible) is to just
never touch TS at all, and let the core kernel code do it all for you.

When you need access to the FPU, let the core code just handle it for
you. Let it trap and restore the state. When you get scheduled away,
let the core code just set TS, because you really can't touch the FP
state again.

IOW, just do the FP operations you do within the thread you are. Never
touch TS at all, just don't worry about it. Worry about your own
internal FP state machine, but don't interact with the "global" kernel
TS state machine.

You can't do a lot better than that, I think. Especially now that we
do the lazy restore, we can schedule between two tasks and if only one
of them actually uses the FPU, we won't bother with extraneous state
restores.

The one exception I can think of is that if you are loading totally
*new* FP state, and you think that TS is likely to be set, instead of
trapping (and loading the old state in the trap handling) only to
return to load the *new* state, we could expose a helper for that
situation. It would look something like

user_fpu_begin();
fpu_restore_checking(newfpustate);

and it would avoid the trap when loading the new state.

Linus

2012-02-28 18:09:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/28/2012 07:37 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 9:21 AM, Avi Kivity <[email protected]> wrote:
> >
> > What you described is the slow path.
>
> Indeed. I'd even call it the "glacial and stupid" path.

Right. It won't be offended, since it's almost never called.

> >The fast path is
> >
> > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
> > {
> > if (vcpu->guest_fpu_loaded)
> > return;
> >
> > If we're emulating an fpu instruction, it's very likely that we have the
> > guest fpu loaded into the cpu. If we do take that path, we have the
> > right fpu state loaded, but CR0.TS is set by the recent exit, so we need
> > to clear it (the comment is in fact correct, except that it misspelled
> > "set").
>
> So why the hell do you put the clts in the wrong place, then?

You mean, why not in kvm_load_guest_fpu()? Most of the uses of
kvm_load_guest_fpu() are just before guest entry, so the clts() would be
immediately overwritten by the loading of the GUEST_CR0 register (which
may have TS set or clear). So putting it here would be wasting cycles.

>
> Dammit, the code is crap.
>
> The clts's are in random places, they don't make any sense, the
> comments are *wrong*, and the only reason they exist in the first
> place is exactly the fact that the code does what it does in the wrong
> place.
>
> There's a reason I called the code crap. It makes no sense. Your
> explanation only explains what it does (badly) when what *should* have
> happened is you saying "ok, that makes no sense, let's fix it up".
>
> So let me re-iterate: it makes ZERO SENSE to clear clts *after*
> restoring the state. Don't do it. Don't make excuses for doing it.
> It's WRONG. Whether it even happens to work by random chance isn't
> even the issue.
>
> Where it *can* make sense to clear TS is in your code that says
>
> > if (vcpu->guest_fpu_loaded)
> > return;
>
> where you could have done it like this:
>
> /* If we already have the FPU loaded, just clear TS - it was set
> by a recent exit */
> if (vcpu->guest_fpu_loaded) {
> clts();
> return;
> }
>
> And then at least the *placement* of clts would make sense.

True, it's cleaner, but as noted above, it's wasteful.

> HOWEVER.
> Even if you did that, what guarantees that the most recent FP usage
> was by *your* kvm process? Sure, the "recent exit" may have set TS,
> but have you had preemption disabled for the whole time? Guaranteed?

Both the vcpu_load() and emulation paths happen with preemption disabled.

> Because TS may be set because something else rescheduled too.
>
> So where's the comment about why you actually own and control CR0.TS,
> and nobody else does?

The code is poorly commented, yes.

> Finally, how does this all interact with the fact that the task
> switching now keeps the FPU state around in the FPU and caches what
> state it is? I have no idea, because the kvm code is so inpenetratable
> due to all these totally unexplained things.

This is done by preempt notifiers. Whenever a task switch happens we
push the guest fpu state into memory (if loaded) and let the normal
stuff happen. So the if we had a task switch during instruction
emulation, for example, then we'd get the "glacial and stupid path" to fire.

> Btw, don't get me wrong - the core FPU state save/restore was a mess
> of random "TS_USEDFPU" and clts() crap too. We had several bugs there,
> partly exactly because the core FPU restore code also had "clts()"
> separated from the logic that actually set or cleared the TS_USEDFPU
> bit, and it was not at all clear at a "local" setting what the F was
> going on.
>
> Most of the recent i387 changes were actually to clean up and make
> sense of that thing, and making sure that the clts() was paired with
> the action of actually giving the FPU to the thread etc. So at least
> now the core FPU handling is reasonably sane, and the clts's and
> stts's are paired with the things that take control of the FPU, and we
> have a few helper functions and some abstraction in place.
>
> The kvm code definitely needs the same kind of cleanup. Because as it
> is now, it's just random code junk, and there is no obvious reason why
> it wouldn't interact horribly badly with an interrupt doing
> "irq_fpu_usable()" + "kernel_fpu_begin/end()" for example.

Well, interrupted_kernel_fpu_idle() does look like it returns the wrong
result when kvm is active. Has the semantics of that changed in the
recent round? The kvm fpu code is quite old and we haven't had any
reports of bad interactions with RAID/encryption since it was stabilized.

> Seriously.

I agree a refactoring is needed. We may need to replace read_cr0() in

static inline bool interrupted_kernel_fpu_idle(void)
{
return !(current_thread_info()->status & TS_USEDFPU) &&
(read_cr0() & X86_CR0_TS);
}

with some percpu variable since CR0.TS is not reliable in interrupt
context while kvm is running.

--
error compiling committee.c: too many arguments to function

2012-02-28 18:29:22

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/28/2012 08:08 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 9:37 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So where's the comment about why you actually own and control CR0.TS,
> > and nobody else does?
>
> So what I think KVM should strive for (but I really don't know the
> code, so maybe there are good reasons why it is impossible) is to just
> never touch TS at all, and let the core kernel code do it all for you.

Which TS? With kvm (restricting ourselves to vmx at this moment) there
are three versions lying around: CR0.TS, GUEST_CR0.TS (which is loaded
by the cpu during entry into the guest) and HOST_CR0.TS (which is loaded
by the cpu during guest exit). GUEST_CR0.TS is actually a combination
of the guest's virtualized CR0.TS, and a flag that says whether the
guest fpu is loaded or not. HOST_CR0 is basically a cached CR0, but as
it's expensive to change it, we don't want to reflect CR0.TS into
HOST_CR0.TS.

> When you need access to the FPU, let the core code just handle it for
> you. Let it trap and restore the state. When you get scheduled away,
> let the core code just set TS, because you really can't touch the FP
> state again.
>
> IOW, just do the FP operations you do within the thread you are. Never
> touch TS at all, just don't worry about it. Worry about your own
> internal FP state machine, but don't interact with the "global" kernel
> TS state machine.

I can't avoid touching it. On exit vmx will set it for me. I can
atomically copy CR0.TS into HOST_CR0.TS, but that's expensive.

Maybe we should just virtualize it into a percpu variable. Should speed
up the non-kvm case as well since read_cr0() is likely not very fast.

> You can't do a lot better than that, I think. Especially now that we
> do the lazy restore, we can schedule between two tasks and if only one
> of them actually uses the FPU, we won't bother with extraneous state
> restores.

Ah, this is a new bit, I'll have to study it.

> The one exception I can think of is that if you are loading totally
> *new* FP state, and you think that TS is likely to be set, instead of
> trapping (and loading the old state in the trap handling) only to
> return to load the *new* state, we could expose a helper for that
> situation. It would look something like
>
> user_fpu_begin();
> fpu_restore_checking(newfpustate);
>
> and it would avoid the trap when loading the new state.
>

--
error compiling committee.c: too many arguments to function

2012-02-28 18:35:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On Tue, Feb 28, 2012 at 10:09 AM, Avi Kivity <[email protected]> wrote:
>
> This is done by preempt notifiers. ?Whenever a task switch happens we
> push the guest fpu state into memory (if loaded) and let the normal
> stuff happen. ?So the if we had a task switch during instruction
> emulation, for example, then we'd get the "glacial and stupid path" to fire.

Oh christ.

This is exactly what the scheduler has ALWAYS ALREADY DONE FOR YOU.

That's what the i387 save-and-restore code is all about. What's the
advantage of just re-implementing it in non-obvious ways?

Stop doing it. You get *zero* advantages from just doing what the
scheduler natively does for you, and the scheduler does it *better*.

Linus

2012-02-28 19:06:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/28/2012 08:34 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 10:09 AM, Avi Kivity <[email protected]> wrote:
> >
> > This is done by preempt notifiers. Whenever a task switch happens we
> > push the guest fpu state into memory (if loaded) and let the normal
> > stuff happen. So the if we had a task switch during instruction
> > emulation, for example, then we'd get the "glacial and stupid path" to fire.
>
> Oh christ.
>
> This is exactly what the scheduler has ALWAYS ALREADY DONE FOR YOU.

No, the scheduler saves the state into task_struct. I need it saved
into the vcpu structure. We have two fpu states, the user state, and
the guest state. APIs that take a task_struct as a parameter, or
reference current implicitly, aren't going to work.

> That's what the i387 save-and-restore code is all about. What's the
> advantage of just re-implementing it in non-obvious ways?
>
> Stop doing it. You get *zero* advantages from just doing what the
> scheduler natively does for you, and the scheduler does it *better*.

The scheduler does something different.

What I'd ideally want is

struct fpu {
int cpu; /* -1 = not loaded */
union thread_xstate *state;
};

Perhaps with a struct fpu_ops *ops if needed. We could then let various
users' fpus float around freely and only save/load them at the last moment.


--
error compiling committee.c: too many arguments to function

2012-02-28 19:26:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On Tue, Feb 28, 2012 at 11:06 AM, Avi Kivity <[email protected]> wrote:
>
> No, the scheduler saves the state into task_struct. ?I need it saved
> into the vcpu structure. ?We have two fpu states, the user state, and
> the guest state. ?APIs that take a task_struct as a parameter, or
> reference current implicitly, aren't going to work.

As far as I can tell, you should do the saving into the vcpu structure
when you actually switch the thing around.

In fact, you can do it these days by just playing around with the
"tsk->thread.fpu.state" pointer, I guess.

But it all boils down to the fact that your code is not just ugly,
it's *buggy*. If you play around with setting TS, you *will* be hit by
interrupts etc that will start to use the FP code that you "don't
use".

And there is no excuse for you touching the host TS. The kernel does
that for you, and does it better. And caches the end result in
TS_USEDFPU (old) or in some variable that you shouldn't look at but
can access with the user_has_fpu() helpers.

Linus

2012-02-28 19:45:35

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 2/2] i387: split up <asm/i387.h> into exported and internal interfaces

On 02/28/2012 09:26 PM, Linus Torvalds wrote:
> On Tue, Feb 28, 2012 at 11:06 AM, Avi Kivity <[email protected]> wrote:
> >
> > No, the scheduler saves the state into task_struct. I need it saved
> > into the vcpu structure. We have two fpu states, the user state, and
> > the guest state. APIs that take a task_struct as a parameter, or
> > reference current implicitly, aren't going to work.
>
> As far as I can tell, you should do the saving into the vcpu structure
> when you actually switch the thing around.
>
> In fact, you can do it these days by just playing around with the
> "tsk->thread.fpu.state" pointer, I guess.

Good idea. I can't say I like poking into struct fpu's internals, but
we can treat it as an opaque structure and copy it around.

We can also do this in kernel_fpu_begin(), and allow it to be preemptible.

> But it all boils down to the fact that your code is not just ugly,
> it's *buggy*. If you play around with setting TS, you *will* be hit by
> interrupts etc that will start to use the FP code that you "don't
> use".
>
> And there is no excuse for you touching the host TS. The kernel does
> that for you, and does it better. And caches the end result in
> TS_USEDFPU (old) or in some variable that you shouldn't look at but
> can access with the user_has_fpu() helpers.

Again, I can't avoid touching it. I can try to get the hardware to
always preserve its value, but that comes with a cost.

btw, I think the current code is safe wrt kvm. If the guest fpu has
been loaded, then we know that that TS_USEDFPU is set, since we will
have saved the user fpu earlier. Yes it's "accidental" and needs to be
improved, but I don't think it's a data corruptor waiting to happen.

--
error compiling committee.c: too many arguments to function