Currently fpu management is only lazy in one direction. When we switch into
a task, we may avoid loading the fpu state in the hope that the task will
never use it. If we guess right we save an fpu load/save cycle; if not,
a Device not Available exception will remind us to load the fpu.
However, in the other direction, fpu management is eager. When we switch out
of an fpu-using task, we always save its fpu state.
This is wasteful if the task(s) that run until we switch back in all don't use
the fpu, since we could have kept the task's fpu on the cpu all this time
and saved an fpu save/load cycle. This can be quite common with threaded
interrupts, but will also happen with normal kernel threads and even normal
user tasks.
This patch series converts task fpu management to be fully lazy. When
switching out of a task, we keep its fpu state on the cpu, only flushing it
if some other task needs the fpu.
Open issues/TODO:
- patch 2 enables interrupts during #NM. There's a comment that says
it shouldn't be done, presumably because of old-style #FERR handling.
Need to fix one way or the other (dropping #FERR support, eagerly saving
state when #FERR is detected, or dropping the entire optimization on i386)
- flush fpu state on cpu offlining (trivial)
- make sure the AMD FXSAVE workaround still works correctly
- reduce IPIs by flushing fpu state when we know a task is being migrated
(guidance from scheduler folk appreciated)
- preemptible kernel_fpu_begin() to improve latency on raid and crypto setups
(will post patches)
- lazy host-side kvm fpu management (will post patches)
- accelerate signal delivery by allocating signal handlers their own fpu
state, and letting them run with the normal task's fpu until they use
an fp instruction (will generously leave to interested parties)
Avi Kivity (4):
x86, fpu: merge __save_init_fpu() implementations
x86, fpu: run device not available trap with interrupts enabled
x86, fpu: Let the fpu remember which cpu it is active on
x86, fpu: don't save fpu state when switching from a task
arch/x86/include/asm/i387.h | 126 +++++++++++++++++++++++++++++++++-----
arch/x86/include/asm/processor.h | 4 +
arch/x86/kernel/i387.c | 3 +
arch/x86/kernel/process.c | 1 +
arch/x86/kernel/process_32.c | 12 +++-
arch/x86/kernel/process_64.c | 13 +++--
arch/x86/kernel/traps.c | 13 ++---
7 files changed, 139 insertions(+), 33 deletions(-)
In order to allow a task's fpu state to fully float, we may need to
bring it back from another processor. To do that, we need interrupts to
be enabled so we can fire off an IPI to that processor.
May break 80386/7 combos with FERR# wired through the interrupt controller.
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kernel/traps.c | 13 +++++--------
1 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 142d70c..c7d67cb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -740,7 +740,6 @@ asmlinkage void math_state_restore(void)
struct task_struct *tsk = thread->task;
if (!tsk_used_math(tsk)) {
- local_irq_enable();
/*
* does a slab alloc which can sleep
*/
@@ -751,7 +750,6 @@ asmlinkage void math_state_restore(void)
do_group_exit(SIGKILL);
return;
}
- local_irq_disable();
}
clts(); /* Allow maths ops (or we recurse) */
@@ -774,21 +772,20 @@ void math_emulate(struct math_emu_info *info)
dotraplinkage void __kprobes
do_device_not_available(struct pt_regs *regs, long error_code)
{
+ preempt_disable();
+ local_irq_enable();
#ifdef CONFIG_X86_32
if (read_cr0() & X86_CR0_EM) {
struct math_emu_info info = { };
- conditional_sti(regs);
-
info.regs = regs;
math_emulate(&info);
- } else {
- math_state_restore(); /* interrupts still off */
- conditional_sti(regs);
- }
+ } else
+ math_state_restore();
#else
math_state_restore();
#endif
+ preempt_enable();
}
#ifdef CONFIG_X86_32
--
1.7.1
Currently, we load the fpu state lazily when switching into a task: usually
we leave the fpu state in memory and only load it on demand.
However, when switching out of an fpu-using task, we eagerly save the fpu
state to memory. This can be detrimental if we'll switch right back to this
task without touching the fpu again - we'll have run a save/load cycle for
nothing.
This patch changes fpu saving on switch out to be lazy - we simply leave the
fpu state alone. If we're lucky, when we're back in this task the fpu state
will be loaded. If not the fpu API will save the current fpu state and load
our state back.
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kernel/process_32.c | 12 ++++++++----
arch/x86/kernel/process_64.c | 13 ++++++++-----
2 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 8d12878..4cb5bc4 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -302,10 +302,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
* If the task has used fpu the last 5 timeslices, just do a full
* restore of the math state immediately to avoid the trap; the
* chances of needing FPU soon are obviously high now
+ *
+ * If the fpu is remote, we can't preload it since that requires an
+ * IPI. Let a math execption move it locally.
*/
- preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
-
- __unlazy_fpu(prev_p);
+ preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5
+ && !fpu_remote(&next->fpu);
/* we're going to use this soon, after a few expensive things */
if (preload_fpu)
@@ -351,8 +353,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
/* If we're going to preload the fpu context, make sure clts
is run while we're batching the cpu state updates. */
- if (preload_fpu)
+ if (preload_fpu || fpu_loaded(&next->fpu))
clts();
+ else
+ stts();
/*
* Leave lazy mode, flushing any hypercalls made here.
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3c2422a..65d2130 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -383,8 +383,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
* If the task has used fpu the last 5 timeslices, just do a full
* restore of the math state immediately to avoid the trap; the
* chances of needing FPU soon are obviously high now
+ *
+ * If the fpu is remote, we can't preload it since that requires an
+ * IPI. Let a math execption move it locally.
*/
- preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;
+ preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5
+ && !fpu_remote(&next->fpu);
/* we're going to use this soon, after a few expensive things */
if (preload_fpu)
@@ -418,12 +422,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
load_TLS(next, cpu);
- /* Must be after DS reload */
- unlazy_fpu(prev_p);
-
/* Make sure cpu is ready for new context */
- if (preload_fpu)
+ if (preload_fpu || fpu_loaded(&next->fpu))
clts();
+ else
+ stts();
/*
* Leave lazy mode, flushing any hypercalls made here.
--
1.7.1
Add a member fpu->cpu to struct fpu which contains which cpu has this fpu
register set loaded (or -1 if the registers were flushed to memory in
fpu->state).
The various fpu accesors are modified to IPI the loaded cpu if it
happens to be different from the current cpu.
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/i387.h | 115 +++++++++++++++++++++++++++++++++++--
arch/x86/include/asm/processor.h | 4 +
arch/x86/kernel/i387.c | 3 +
arch/x86/kernel/process.c | 1 +
4 files changed, 116 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index df5badf..124c89d 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -174,7 +174,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
#endif
}
-static inline void fpu_save_init(struct fpu *fpu)
+static inline void __fpu_save_init(struct fpu *fpu)
{
if (use_xsave())
fpu_xsave(fpu);
@@ -222,10 +222,7 @@ static inline int fxrstor_checking(struct i387_fxsave_struct *fx)
#define safe_address (kstat_cpu(0).cpustat.user)
#endif
-/*
- * These must be called with preempt disabled
- */
-static inline void fpu_save_init(struct fpu *fpu)
+static inline void __fpu_save_init(struct fpu *fpu)
{
if (use_xsave()) {
struct xsave_struct *xstate = &fpu->state->xsave;
@@ -273,6 +270,33 @@ end:
#endif /* CONFIG_X86_64 */
+static inline bool fpu_loaded(struct fpu *fpu)
+{
+ return fpu->cpu == smp_processor_id();
+}
+
+static inline bool fpu_remote(struct fpu *fpu)
+{
+ return fpu->cpu != -1 && fpu->cpu != smp_processor_id();
+}
+
+/*
+ * These must be called with preempt disabled
+ */
+static inline void fpu_save_init(struct fpu *fpu)
+{
+ ulong flags;
+
+ if (__get_cpu_var(current_fpu) != fpu
+ || fpu->cpu != smp_processor_id())
+ return;
+ local_irq_save(flags);
+ __fpu_save_init(fpu);
+ fpu->cpu = -1;
+ __get_cpu_var(current_fpu) = NULL;
+ local_irq_restore(flags);
+}
+
static inline void __save_init_fpu(struct task_struct *tsk)
{
fpu_save_init(&tsk->thread.fpu);
@@ -284,7 +308,7 @@ static inline int fpu_fxrstor_checking(struct fpu *fpu)
return fxrstor_checking(&fpu->state->fxsave);
}
-static inline int fpu_restore_checking(struct fpu *fpu)
+static inline int __fpu_restore_checking(struct fpu *fpu)
{
if (use_xsave())
return fpu_xrstor_checking(fpu);
@@ -292,6 +316,47 @@ static inline int fpu_restore_checking(struct fpu *fpu)
return fpu_fxrstor_checking(fpu);
}
+static inline void __fpu_unload(void *_fpu)
+{
+ struct fpu *fpu = _fpu;
+ unsigned cr0 = read_cr0();
+
+ if (cr0 & X86_CR0_TS)
+ clts();
+ if (__get_cpu_var(current_fpu) == fpu)
+ fpu_save_init(fpu);
+ if (cr0 & X86_CR0_TS)
+ write_cr0(cr0);
+}
+
+static inline void fpu_unload(struct fpu *fpu)
+{
+ int cpu = ACCESS_ONCE(fpu->cpu);
+
+ if (cpu != -1)
+ smp_call_function_single(cpu, __fpu_unload, fpu, 1);
+}
+
+static inline int fpu_restore_checking(struct fpu *fpu)
+{
+ ulong flags;
+ struct fpu *oldfpu;
+ int ret;
+
+ if (fpu->cpu == smp_processor_id())
+ return 0;
+ fpu_unload(fpu);
+ local_irq_save(flags);
+ oldfpu = __get_cpu_var(current_fpu);
+ if (oldfpu)
+ fpu_save_init(oldfpu);
+ ret = __fpu_restore_checking(fpu);
+ fpu->cpu = smp_processor_id();
+ __get_cpu_var(current_fpu) = fpu;
+ local_irq_restore(flags);
+ return ret;
+}
+
static inline int restore_fpu_checking(struct task_struct *tsk)
{
return fpu_restore_checking(&tsk->thread.fpu);
@@ -451,18 +516,46 @@ static bool fpu_allocated(struct fpu *fpu)
return fpu->state != NULL;
}
+static inline void fpu_init_empty(struct fpu *fpu)
+{
+ fpu->state = NULL;
+ fpu->cpu = -1;
+}
+
static inline int fpu_alloc(struct fpu *fpu)
{
if (fpu_allocated(fpu))
return 0;
fpu->state = kmem_cache_alloc(task_xstate_cachep, GFP_KERNEL);
+ fpu->cpu = -1;
if (!fpu->state)
return -ENOMEM;
WARN_ON((unsigned long)fpu->state & 15);
return 0;
}
-static inline void fpu_free(struct fpu *fpu)
+static inline void __fpu_forget(void *_fpu)
+{
+ struct fpu *fpu = _fpu;
+
+ if (fpu->cpu == smp_processor_id()) {
+ fpu->cpu = -1;
+ __get_cpu_var(current_fpu) = NULL;
+ }
+}
+
+static inline void fpu_forget(struct fpu *fpu)
+{
+ int cpu;
+
+ preempt_disable();
+ cpu = ACCESS_ONCE(fpu->cpu);
+ if (cpu != -1)
+ smp_call_function_single(cpu, __fpu_forget, fpu, 1);
+ preempt_enable();
+}
+
+static inline void __fpu_free(struct fpu *fpu)
{
if (fpu->state) {
kmem_cache_free(task_xstate_cachep, fpu->state);
@@ -470,8 +563,16 @@ static inline void fpu_free(struct fpu *fpu)
}
}
+static inline void fpu_free(struct fpu *fpu)
+{
+ fpu_forget(fpu);
+ __fpu_free(fpu);
+}
+
static inline void fpu_copy(struct fpu *dst, struct fpu *src)
{
+ fpu_unload(src);
+ fpu_unload(dst);
memcpy(dst->state, src->state, xstate_size);
}
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7e5c6a6..98996fe 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -378,8 +378,11 @@ union thread_xstate {
struct fpu {
union thread_xstate *state;
+ int cpu; /* -1 = unloaded */
};
+DECLARE_PER_CPU(struct fpu *, current_fpu);
+
#ifdef CONFIG_X86_64
DECLARE_PER_CPU(struct orig_ist, orig_ist);
@@ -892,6 +895,7 @@ static inline void spin_lock_prefetch(const void *x)
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
+ .fpu = { .cpu = -1, }, \
}
/*
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index c4444bc..e56f486 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -38,6 +38,9 @@
# define HAVE_HWFP 1
#endif
+DEFINE_PER_CPU(struct fpu *, current_fpu);
+EXPORT_PER_CPU_SYMBOL_GPL(current_fpu);
+
static unsigned int mxcsr_feature_mask __read_mostly = 0xffffffffu;
unsigned int xstate_size;
unsigned int sig_xstate_ia32_size = sizeof(struct _fpstate_ia32);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ebcfcce..16a7a9b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -35,6 +35,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
int ret;
*dst = *src;
+ fpu_init_empty(&dst->thread.fpu);
if (fpu_allocated(&src->thread.fpu)) {
memset(&dst->thread.fpu, 0, sizeof(dst->thread.fpu));
ret = fpu_alloc(&dst->thread.fpu);
--
1.7.1
The two __save_init_fpu() implementations are identical, merge them.
Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/i387.h | 11 ++---------
1 files changed, 2 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 815c5b2..df5badf 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -184,12 +184,6 @@ static inline void fpu_save_init(struct fpu *fpu)
fpu_clear(fpu);
}
-static inline void __save_init_fpu(struct task_struct *tsk)
-{
- fpu_save_init(&tsk->thread.fpu);
- task_thread_info(tsk)->status &= ~TS_USEDFPU;
-}
-
#else /* CONFIG_X86_32 */
#ifdef CONFIG_MATH_EMULATION
@@ -277,15 +271,14 @@ end:
;
}
+#endif /* CONFIG_X86_64 */
+
static inline void __save_init_fpu(struct task_struct *tsk)
{
fpu_save_init(&tsk->thread.fpu);
task_thread_info(tsk)->status &= ~TS_USEDFPU;
}
-
-#endif /* CONFIG_X86_64 */
-
static inline int fpu_fxrstor_checking(struct fpu *fpu)
{
return fxrstor_checking(&fpu->state->fxsave);
--
1.7.1
On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said:
> Currently fpu management is only lazy in one direction. When we switch into
> a task, we may avoid loading the fpu state in the hope that the task will
> never use it. If we guess right we save an fpu load/save cycle; if not,
> a Device not Available exception will remind us to load the fpu.
>
> However, in the other direction, fpu management is eager. When we switch out
> of an fpu-using task, we always save its fpu state.
Does anybody have numbers on how many clocks it takes a modern CPU design
to do a FPU state save or restore? I know it must have been painful in the
days before cache memory, having to make added trips out to RAM for 128-bit
registers. But what's the impact today? (Yes, I see there's the potential
for a painful IPI call - anything else?)
Do we have any numbers on how many saves/restores this will save us when
running the hypothetical "standard Gnome desktop" environment? How common
is the "we went all the way around to the original single FPU-using task" case?
On 06/13/2010 11:45 PM, [email protected] wrote:
> On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said:
>
>> Currently fpu management is only lazy in one direction. When we switch into
>> a task, we may avoid loading the fpu state in the hope that the task will
>> never use it. If we guess right we save an fpu load/save cycle; if not,
>> a Device not Available exception will remind us to load the fpu.
>>
>> However, in the other direction, fpu management is eager. When we switch out
>> of an fpu-using task, we always save its fpu state.
>>
> Does anybody have numbers on how many clocks it takes a modern CPU design
> to do a FPU state save or restore?
320 cycles for a back-to-back round trip. Presumably less on more
modern hardware, more if uncached, more on even more modern hardware
that has the xsave header (8 bytes) and ymm state (256 bytes) in addition.
> I know it must have been painful in the
> days before cache memory, having to make added trips out to RAM for 128-bit
> registers. But what's the impact today?
I'd estimate between 300 and 600 cycles depending on the factors above.
> (Yes, I see there's the potential
> for a painful IPI call - anything else?)
>
The IPI is only taken after a task migration, hopefully a rare event.
The patchset also adds the overhead of irq save/restore. I think I can
remove that at the cost of some complexity, but prefer to start with a
simple approach.
> Do we have any numbers on how many saves/restores this will save us when
> running the hypothetical "standard Gnome desktop" environment?
The potential is in the number of context switches per second. On a
desktop environment, I don't see much potential for a throughput
improvement, rather latency reduction from making the crypto threads
preemptible and reducing context switch times.
Servers with high context switch rates, esp. with real-time preemptible
kernels (due to threaded interrupts), will see throughput gains. And,
of course, kvm will benefit from not needing to switch the fpu when
going from guest to host userspace or to a host kernel thread (vhost-net).
> How common
> is the "we went all the way around to the original single FPU-using task" case?
>
When your context switch is due to an oversubscribed cpu, not very
common. When it is due to the need to service an event and go back to
sleep, very common.
--
error compiling committee.c: too many arguments to function
On 06/13/2010 06:03 PM, Avi Kivity wrote:
> Currently fpu management is only lazy in one direction. When we switch into
> a task, we may avoid loading the fpu state in the hope that the task will
> never use it. If we guess right we save an fpu load/save cycle; if not,
> a Device not Available exception will remind us to load the fpu.
>
> However, in the other direction, fpu management is eager. When we switch out
> of an fpu-using task, we always save its fpu state.
>
> This is wasteful if the task(s) that run until we switch back in all don't use
> the fpu, since we could have kept the task's fpu on the cpu all this time
> and saved an fpu save/load cycle. This can be quite common with threaded
> interrupts, but will also happen with normal kernel threads and even normal
> user tasks.
>
> This patch series converts task fpu management to be fully lazy. When
> switching out of a task, we keep its fpu state on the cpu, only flushing it
> if some other task needs the fpu.
>
Ingo, Peter, any feedback on this?
--
error compiling committee.c: too many arguments to function
On 06/16/2010 12:24 AM, Avi Kivity wrote:
>
> Ingo, Peter, any feedback on this?
>
Conceptually, this makes sense to me. However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU. That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> On 06/16/2010 12:24 AM, Avi Kivity wrote:
>
>> Ingo, Peter, any feedback on this?
>>
> Conceptually, this makes sense to me. However, I have a concern what
> happens when a task is scheduled on another CPU, while its FPU state is
> still in registers in the original CPU. That would seem to require
> expensive IPIs to spill the state in order for the rescheduling to
> proceed, and this could really damage performance.
>
Right, this optimization isn't free.
I think the tradeoff is favourable since task migrations are much less
frequent than context switches within the same cpu, can the scheduler
experts comment?
We can also mitigate some of the IPIs if we know that we're migrating on
the cpu we're migrating from (i.e. we're pushing tasks to another cpu,
not pulling them from their cpu). Is that a common case, and if so,
where can I hook a call to unlazy_fpu() (or its new equivalent)?
Note that kvm on intel has exactly the same issue (the VMPTR and VMCS
are on-chip registers that are expensive to load and save, so we keep
them loaded even while not scheduled, and IPI if we notice we've
migrated; note that architecturally the cpu can cache multiple VMCSs
simultaneously (though I doubt they cache multiple VMCSs
microarchitecturally at this point)).
--
error compiling committee.c: too many arguments to function
(Cc:-ed various performance/optimization folks)
* Avi Kivity <[email protected]> wrote:
> On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> >On 06/16/2010 12:24 AM, Avi Kivity wrote:
> >>Ingo, Peter, any feedback on this?
> > Conceptually, this makes sense to me. However, I have a concern what
> > happens when a task is scheduled on another CPU, while its FPU state is
> > still in registers in the original CPU. That would seem to require
> > expensive IPIs to spill the state in order for the rescheduling to
> > proceed, and this could really damage performance.
>
> Right, this optimization isn't free.
>
> I think the tradeoff is favourable since task migrations are much
> less frequent than context switches within the same cpu, can the
> scheduler experts comment?
This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.
I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.
So this does not look like a patch-set we could apply without gathering a
_ton_ of hard data about advantages and disadvantages.
> We can also mitigate some of the IPIs if we know that we're migrating on the
> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
> pulling them from their cpu). Is that a common case, and if so, where can I
> hook a call to unlazy_fpu() (or its new equivalent)?
When the system goes from idle to less idle then most of the 'fast' migrations
happen on a 'push' model - on a busy CPU we wake up a new task and push it out
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
little cost.
But on busy servers where most wakeups are IRQ based the chance of being on
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
CPUs.
If there's some sucky corner case in theory we could approach it statistically
and measure the ratio of fast vs. slow migration vs. local context switches -
but that looks a bit complex.
Dunno.
Ingo
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote:
>
> (Cc:-ed various performance/optimization folks)
>
> * Avi Kivity <[email protected]> wrote:
>
> > On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
> > >On 06/16/2010 12:24 AM, Avi Kivity wrote:
> > >>Ingo, Peter, any feedback on this?
> > > Conceptually, this makes sense to me. However, I have a concern what
> > > happens when a task is scheduled on another CPU, while its FPU state is
> > > still in registers in the original CPU. That would seem to require
> > > expensive IPIs to spill the state in order for the rescheduling to
> > > proceed, and this could really damage performance.
> >
> > Right, this optimization isn't free.
> >
> > I think the tradeoff is favourable since task migrations are much
> > less frequent than context switches within the same cpu, can the
> > scheduler experts comment?
>
> This cannot be stated categorically without precise measurements of
> known-good, known-bad, average FPU usage and average CPU usage scenarios. All
> these workloads have different characteristics.
>
> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
> various lmbench components, X benchmarks, tiobench - you name it. Combined
> with the fact that most micro-benchmarks wont be using the FPU, while in the
> long run most processes will be using the FPU due to SIMM instructions. So
> even a positive result might be skewed in practice. Has to be measured
> carefully IMO - and i havent seen a _single_ performance measurement in the
> submission mail. This is really essential.
It can be nice to code an absolute worst-case microbenchmark too.
Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.
Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a ?crit :
> in the long run most processes will be using the FPU due to SIMM
> instructions.
I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...
Samuel
On 06/16/2010 11:39 AM, Ingo Molnar wrote:
> (Cc:-ed various performance/optimization folks)
>
> * Avi Kivity<[email protected]> wrote:
>
>
>> On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
>>
>>> On 06/16/2010 12:24 AM, Avi Kivity wrote:
>>>
>>>> Ingo, Peter, any feedback on this?
>>>>
>>> Conceptually, this makes sense to me. However, I have a concern what
>>> happens when a task is scheduled on another CPU, while its FPU state is
>>> still in registers in the original CPU. That would seem to require
>>> expensive IPIs to spill the state in order for the rescheduling to
>>> proceed, and this could really damage performance.
>>>
>> Right, this optimization isn't free.
>>
>> I think the tradeoff is favourable since task migrations are much
>> less frequent than context switches within the same cpu, can the
>> scheduler experts comment?
>>
> This cannot be stated categorically without precise measurements of
> known-good, known-bad, average FPU usage and average CPU usage scenarios. All
> these workloads have different characteristics.
>
> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
> various lmbench components, X benchmarks, tiobench - you name it. Combined
> with the fact that most micro-benchmarks wont be using the FPU, while in the
> long run most processes will be using the FPU due to SIMM instructions. So
> even a positive result might be skewed in practice. Has to be measured
> carefully IMO - and i havent seen a _single_ performance measurement in the
> submission mail. This is really essential.
>
I have really no idea what to measure. Which would you most like to see?
> So this does not look like a patch-set we could apply without gathering a
> _ton_ of hard data about advantages and disadvantages.
>
I agree (not to mention that I'm not really close to having an applyable
patchset).
Note some of the advantages will not be in throughput but in latency
(making kernel_fpu_begin() preemptible, and reducing context switch time
for event threads).
>> We can also mitigate some of the IPIs if we know that we're migrating on the
>> cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
>> pulling them from their cpu). Is that a common case, and if so, where can I
>> hook a call to unlazy_fpu() (or its new equivalent)?
>>
> When the system goes from idle to less idle then most of the 'fast' migrations
> happen on a 'push' model - on a busy CPU we wake up a new task and push it out
> to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
> little cost.
>
Can you point me to the code which does this?
> But on busy servers where most wakeups are IRQ based the chance of being on
> the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
> CPUs.
>
But don't we usually avoid pulls due to NUMA and cache considerations?
> If there's some sucky corner case in theory we could approach it statistically
> and measure the ratio of fast vs. slow migration vs. local context switches -
> but that looks a bit complex.
>
>
I certainly wouldn't want to start with it.
> Dunno.
>
--
error compiling committee.c: too many arguments to function
On 06/16/2010 12:10 PM, Nick Piggin wrote:
>
>> This cannot be stated categorically without precise measurements of
>> known-good, known-bad, average FPU usage and average CPU usage scenarios. All
>> these workloads have different characteristics.
>>
>> I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
>> various lmbench components, X benchmarks, tiobench - you name it. Combined
>> with the fact that most micro-benchmarks wont be using the FPU, while in the
>> long run most processes will be using the FPU due to SIMM instructions. So
>> even a positive result might be skewed in practice. Has to be measured
>> carefully IMO - and i havent seen a _single_ performance measurement in the
>> submission mail. This is really essential.
>>
> It can be nice to code an absolute worst-case microbenchmark too.
>
Sure.
> Task migration can actually be very important to the point of being
> almost a fastpath in some workloads where threads are oversubscribed to
> CPUs and blocking on some contented resource (IO or mutex or whatever).
> I suspect the main issues in that case is the actual context switching
> and contention, but it would be nice to see just how much slower it
> could get.
>
If it's just cpu oversubscription then the IPIs will be limited by the
rebalance rate and the time slice, so as you say it has to involve
contention and frequent wakeups as well as heavy cpu usage. That won't
be easy to code. Can you suggest an existing benchmark to run?
--
error compiling committee.c: too many arguments to function
On 06/16/2010 12:01 PM, Samuel Thibault wrote:
> Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a ?crit :
>
>> in the long run most processes will be using the FPU due to SIMM
>> instructions.
>>
> I believe glibc already uses SIMM instructions for e.g. memcpy and
> friends, i.e. basically all applications...
>
I think they ought to be using 'rep movs' on newer processors, but yes
you're right.
--
error compiling committee.c: too many arguments to function
> But on busy servers where most wakeups are IRQ based the chance of being on
> the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
> CPUs.
That doesn't seem right. If the server is busy with FPU-using tasks, then
the FPU state has already been swapped out, and no IPI is necessary.
The worst-case seems to be a lot of non-FPU CPU hogs, and a few FPU-using tasks
that get bounced around the CPUs like pinballs.
It is an explicit scheduler goal to keep tasks on the same CPU across
schedules, so they get to re-use their cache state. The IPI only happens
when that goal is not met, *and* the FPU state has not been forced out
by another FPU-using task.
Not completely trivial to arrange.
(An halfway version of this optimization whoch sould avoid the need for
an IPI would be *save* the FPU state, but mark it "clean", so the re-load
can be skipped if we're lucky. If the code supported this as well as the
IPI alternative, you could make a heuristic guess at switch-out time
whether to save immediately or hope the odds of needing the IPI are less than
the fxsave/IPI cost ratio.)
On 06/16/2010 02:32 PM, George Spelvin wrote:
>
> (An halfway version of this optimization whoch sould avoid the need for
> an IPI would be *save* the FPU state, but mark it "clean", so the re-load
> can be skipped if we're lucky. If the code supported this as well as the
> IPI alternative, you could make a heuristic guess at switch-out time
> whether to save immediately or hope the odds of needing the IPI are less than
> the fxsave/IPI cost ratio.)
>
That's an interesting optimization - and we already have something
similar in the form of fpu preload. Shouldn't be too hard to do.
--
error compiling committee.c: too many arguments to function
> That's an interesting optimization - and we already have something
> similar in the form of fpu preload. Shouldn't be too hard to do.
Unfortunately, there's no dirty flag for the preloaded FPU state.
Unless you take an interrupt, which defeats the whole purpose of
preload.
AFAIK, I should add; there's a lot of obscure stuff in the x86
system-level architecture. But a bit of searching around the source
didn't show me anything; once we've used the CPU for 5 context switches,
the kernel calls __math_state_restore when loading the new state,
which sets TS_USEDFPU.
(While you're mucking about in there, do you suppose the gas < 2.16
workaround in arch/x86/include/asm/i387.h:fxsave() can be yanked yet?)