LinuxLists.cc - [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Dave Jones wrote:
> I feel I'm missing something obvious here, but is this part the
> low-hanging fruit that it seems ?

You have eliminated one MSR write very cleanly, although there are
still a few unnecessary conditionals when compared with grabbing a
whole branch of the fruit tree, as it were.

That leaves the other MSR write, which is also unnecessary.

-- Jamie

2003-02-12 04:08:11

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Martin J. Bligh wrote:
> Since the SYSENTER/vsyscall support went in the 2.5 __switch_to/load_esp0
> function does two WRMSRs to rewrite MSR_IA32_SYSENTER_CS and
> MSR_IA32_SYSENTER_ESP. This is hidden in processor.h:load_esp0. WRMSR is
> very slow (60+ cycles) especially on a Pentium 4 and slows down the context
> switch considerably. This is a trade off between faster system calls using
> SYSENTER and slower context switches, but the context switches got unduly
> hit here.

<Boggle!> I'm amazed this slipped in.

> The reason it rewrites SYSENTER_CS is non obviously vm86 which
> doesn't guarantee the MSR stays constant (sigh).

I am confused by your sentence. Can vm86 code alter the sysenter
MSRs? That should raise a GPF, surely... Or do you mean that the
code in vm86.c alters sysenter, because it calls disable_sysenter()?

> I think this would be better handled by having a global flag or
> process flag when any process uses vm86 and not do it when this flag
> is not set (as in 99% of all normal use cases)

Is there bug?

I think there's a bug with CONFIG_PREEMPT - can someone confirm? The
kernel can be preempted after the call to disable_sysenter() in
vm86.c, and it will reschedule (see resume_kernel), and reload the
MSRs if I understand entry.S correctly.

So there needs to be a different way to set/clear the MSRs anyway.

Perhaps the debug register loads, ts_io_bitmap loads, and MSR loads
could all be efficiently conditional on a flag?

I.e., in __switch_to:

#define SLOW_SWITCH_VM86 (1 << 0)
#define SLOW_SWITCH_IO_BITMAP (1 << 1)
#define SLOW_SWITCH_DEBUG (1 << 2)

if (unlikely(prev->slow_switch | next->slow_switch)) {
if (unlikely(next->slow_switch & SLOW_SWITCH_DEBUG)) {
// ...
}
if (unlikely((prev->slow_switch ^ next->slow_switch)
& SLOW_SWITCH_IO_BITMAP)) {
// ...
}
if (unlikely((prev->slow_switch ^ next->slow_switch)
& SLOW_SWITCH_VM86)) {
if (next->slow_switch & SLOW_SWITCH_VM86)
disable_sysenter();
else
enable_sysenter();
}
}

And whenever ts_io_bitmap or debugrg[7] are written to, recalculate
the value of slow_switch (bits 1 and 2). And set bit 0 in
do_sys_vm86, clear it in save_v86_state, and recalculate that bit in
restore_sigcontext.

That captures the rare cases, and ensures that the MSRs are always
clear in vm86 mode even if it is preempted, always set otherwise, and
not changed normally.

(The above assumes we revert to a trampoline stack, so the MSRs don't
have to be rewritten during normal context switches).

> It rewrites SYSENTER_ESP to the stack page of the current process.
> Previous implements used an trampoline for that. The reason it was moved to
> the context was that an NMI could see the trampoline stack for one
> instruction and when it calls current (very unlikely) and references the
> stack pointer it doesn't get a valid task_struct. The obvious solution
> would be to somehow check this case (e.g. by looking at esp) in the NMI
> slow path.

It's very easy to fix NMIs by either looking at EIP or ESP at the
start of the NMI handler. EIP is a bit simpler, because the address
range is fixed at link time and does not vary between CPUs (each CPU
needs its own 6-word trampoline).

With a trampoline stack, it's also necessary to fixup the case where a
Debug trap occurs at the start of the sysenter handler (in the debug
path), and when an NMI interrupts that debug path before it has fixed
up the stack.

A cute and wonderful hack is to use the 6 words in the TSS prior to
&tss->es as the trampoline. Now that __switch_to is done in software,
those words are not used for anything else. The nice thing is that a
single load from a fixed offset from ESP gets you the replacement
value of %esp0, i.e. the "real" kernel stack is loaded like this:

movl -68(%esp),%esp

Other fixed offsets from &tss->esp0 are possible - especially nice
would be to share a cache line with the GDT's hot cache line. (To do
this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
and then the accesses to GDT, trampoline and tss->esp0 will all touch
the same cache line if you're lucky).

The fixup cases for NMI and debug are a bit tricky but not _that_
tricky.

Enjoy,
-- Jamie

2003-02-12 05:43:07

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

In article <[email protected]>,
Jamie Lokier <[email protected]> wrote:
>Dave Jones wrote:
>> I feel I'm missing something obvious here, but is this part the
>> low-hanging fruit that it seems ?
>
>You have eliminated one MSR write very cleanly, although there are
>still a few unnecessary conditionals when compared with grabbing a
>whole branch of the fruit tree, as it were.
>
>That leaves the other MSR write, which is also unnecessary.

No, the other one _is_ necessary. I did timings, and having it in the
context switch path made system calls clearly faster on a P4 (as
compared to my original trampoline approach).

It may be only two instructions difference ("movl xx,%esp ; jmp common")
in the system call path, but it was much more than two cycles. I don't
know why, but I assume the system call causes a total pipeline flush,
and then the immediate jmp basically means that the P4 has a hard time
getting the pipe restarted.

This might be fixable by moving more (all?) of the kernel-side fast
system call code into the per-cpu trampoline page, so that you wouldn't
have the immediate jump. Somebody needs to try it and time it, otherwise
the wrmsr stays in the context switch.

I want fast system calls. Most people don't see it yet (because you need
a glibc that takes advantage of it), but those fast system calls are
more than enough to make up for some scheduling overhead.

Linus

2003-02-12 05:48:11

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

In article <[email protected]>,
Jamie Lokier <[email protected]> wrote:
>
>A cute and wonderful hack is to use the 6 words in the TSS prior to
>&tss->es as the trampoline. Now that __switch_to is done in software,
>those words are not used for anything else.

No!!

That's not cute and wonderful, that's _horrible_.

Mixing data and code on the same page is very very slow on a P4 (well, I
think it's "same half-page", but the point is that you should not EVER
mix data and code - it ends up being slow on modern CPU's).

>Other fixed offsets from &tss->esp0 are possible - especially nice
>would be to share a cache line with the GDT's hot cache line. (To do
>this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
>and then the accesses to GDT, trampoline and tss->esp0 will all touch
>the same cache line if you're lucky).

Since almost all x86 CPU's have some kind of cacheline exclusion policy
between the I$ and the D$ (to handle the strict x86 I$ coherency
requirements), your "if you're lucky" is completely bogus. In fact,
you'd be the _pessimal_ cache behaviour for something like that, ie you
get lines that ping-pong between the L2 and the two instruction caches.

Don't do it. Keep data and code on separate pages.

Linus

2003-02-12 07:41:05

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 02:59:02AM +0000, Dave Jones wrote:
> On Tue, Feb 11, 2003 at 05:35:43PM -0800, Martin J. Bligh wrote:
>
> > The reason it rewrites SYSENTER_CS is non obviously vm86 which
> > doesn't guarantee the MSR stays constant (sigh). I think this would
> > be better handled by having a global flag or process flag when any process
> > uses vm86 and not do it when this flag is not set (as in 99% of all
> > normal use cases)
>
> I feel I'm missing something obvious here, but is this part the
> low-hanging fruit that it seems ?

Yes I implemented a similar patch now too last night. It also fixes a few other
fast path bugs in __switch_to

- Fix false sharing in the GDT and replace an imul with a shift.
Really pad the GDT to cache lines now.

- Don't use LOCK prefixes in bit operations when accessing the
thread_info flags of the switched threads. LOCK is very slow on P4
and it isn't necessary here.

Really we should have __set_bit/__test_bit without memory barrier
and atomic stuff on each arch and use that for thread_info.h,
but for now do it this way.

[this is a port from x86-64]

- Inline FPU switch - it is only a few lines.

But I must say I don't know vm86() semantics enough to know if this is
good enough, especially when the clear the TIF_VM86 flag. Could someone
more familiar with it review it?

BTW vm86.c at the first look doesn't look very preempt safe to me.

comments?

-Andi

diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/cpu/common.c linux-2.5.60-work/arch/i386/kernel/cpu/common.c
--- linux-2.5.60/arch/i386/kernel/cpu/common.c 2003-02-10 19:37:57.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/cpu/common.c 2003-02-12 01:42:01.000000000 +0100
@@ -484,7 +484,7 @@ void __init cpu_init (void)
BUG();
enter_lazy_tlb(&init_mm, current, cpu);

- load_esp0(t, thread->esp0);
+ load_esp0(current, t, thread->esp0);
set_tss_desc(cpu,t);
cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
load_TR_desc();
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/i387.c linux-2.5.60-work/arch/i386/kernel/i387.c
--- linux-2.5.60/arch/i386/kernel/i387.c 2003-02-10 19:39:17.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/i387.c 2003-02-11 23:51:58.000000000 +0100
@@ -52,24 +52,6 @@ void init_fpu(struct task_struct *tsk)
* FPU lazy state save handling.
*/

-static inline void __save_init_fpu( struct task_struct *tsk )
-{
- if ( cpu_has_fxsr ) {
- asm volatile( "fxsave %0 ; fnclex"
- : "=m" (tsk->thread.i387.fxsave) );
- } else {
- asm volatile( "fnsave %0 ; fwait"
- : "=m" (tsk->thread.i387.fsave) );
- }
- clear_tsk_thread_flag(tsk, TIF_USEDFPU);
-}
-
-void save_init_fpu( struct task_struct *tsk )
-{
- __save_init_fpu(tsk);
- stts();
-}
-
void kernel_fpu_begin(void)
{
preempt_disable();
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/process.c linux-2.5.60-work/arch/i386/kernel/process.c
--- linux-2.5.60/arch/i386/kernel/process.c 2003-02-10 19:37:54.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/process.c 2003-02-12 01:40:02.000000000 +0100
@@ -437,7 +437,7 @@ void __switch_to(struct task_struct *pre
/*
* Reload esp0, LDT and the page table pointer:
*/
- load_esp0(tss, next->esp0);
+ load_esp0(prev_p, tss, next->esp0);

/*
* Load the per-thread Thread-Local Storage descriptor.
diff -burpN -X ../KDIFX linux-2.5.60/arch/i386/kernel/vm86.c linux-2.5.60-work/arch/i386/kernel/vm86.c
--- linux-2.5.60/arch/i386/kernel/vm86.c 2003-02-10 19:37:58.000000000 +0100
+++ linux-2.5.60-work/arch/i386/kernel/vm86.c 2003-02-12 01:46:51.000000000 +0100
@@ -114,7 +117,7 @@ struct pt_regs * save_v86_state(struct k
}
tss = init_tss + smp_processor_id();
current->thread.esp0 = current->thread.saved_esp0;
- load_esp0(tss, current->thread.esp0);
+ load_esp0(current, tss, current->thread.esp0);
current->thread.saved_esp0 = 0;
loadsegment(fs, current->thread.saved_fs);
loadsegment(gs, current->thread.saved_gs);
@@ -309,6 +313,10 @@ static inline void return_to_32bit(struc
{
struct pt_regs * regs32;

+ /* FIXME should disable preemption here but how can we reenable it? */
+
+ enable_sysenter();
+
regs32 = save_v86_state(regs16);
regs32->eax = retval;
__asm__ __volatile__("movl %0,%%esp\n\t"
diff -burpN -X ../KDIFX linux-2.5.60/arch/x86_64/kernel/process.c linux-2.5.60-work/arch/x86_64/kernel/process.c
--- linux-2.5.60/arch/x86_64/kernel/process.c 2003-02-10 19:37:56.000000000 +0100
+++ linux-2.5.60-work/arch/x86_64/kernel/process.c 2003-02-12 01:51:00.000000000 +0100
@@ -41,6 +41,7 @@
#include <linux/init.h>
#include <linux/ctype.h>
#include <linux/slab.h>
+#include <linux/thread_info.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/i387.h linux-2.5.60-work/include/asm-i386/i387.h
--- linux-2.5.60/include/asm-i386/i387.h 2003-02-10 19:38:49.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/i387.h 2003-02-12 01:21:13.000000000 +0100
@@ -21,23 +21,41 @@ extern void init_fpu(struct task_struct
/*
* FPU lazy state save handling...
*/
-extern void save_init_fpu( struct task_struct *tsk );
extern void restore_fpu( struct task_struct *tsk );

extern void kernel_fpu_begin(void);
#define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)

+static inline void __save_init_fpu( struct task_struct *tsk )
+{
+ if ( cpu_has_fxsr ) {
+ asm volatile( "fxsave %0 ; fnclex"
+ : "=m" (tsk->thread.i387.fxsave) );
+ } else {
+ asm volatile( "fnsave %0 ; fwait"
+ : "=m" (tsk->thread.i387.fsave) );
+ }
+ tsk->thread_info->flags &= ~TIF_USEDFPU;
+}
+
+static inline void save_init_fpu( struct task_struct *tsk )
+{
+ __save_init_fpu(tsk);
+ stts();
+}
+
+
#define unlazy_fpu( tsk ) do { \
- if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) \
+ if ((tsk)->thread_info->flags & _TIF_USEDFPU) \
save_init_fpu( tsk ); \
} while (0)

#define clear_fpu( tsk ) \
do { \
- if (test_tsk_thread_flag(tsk, TIF_USEDFPU)) { \
+ if ((tsk)->thread_info->flags & _TIF_USEDFPU) { \
asm volatile("fwait"); \
- clear_tsk_thread_flag(tsk, TIF_USEDFPU); \
+ (tsk)->thread_info->flags &= ~_TIF_USEDFPU; \
stts(); \
} \
} while (0)
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/processor.h linux-2.5.60-work/include/asm-i386/processor.h
--- linux-2.5.60/include/asm-i386/processor.h 2003-02-10 19:37:57.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/processor.h 2003-02-12 01:52:28.000000000 +0100
@@ -408,20 +408,30 @@ struct thread_struct {
.io_bitmap = { [ 0 ... IO_BITMAP_SIZE ] = ~0 }, \
}

-static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
-{
- tss->esp0 = esp0;
- if (cpu_has_sep) {
- wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
- wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
- }
-}
-
-static inline void disable_sysenter(void)
-{
- if (cpu_has_sep)
- wrmsr(MSR_IA32_SYSENTER_CS, 0, 0);
-}
+#define load_esp0(prev, tss, _esp0) do { \
+ (tss)->esp0 = _esp0; \
+ if (cpu_has_sep) { \
+ if (unlikely((prev)->thread_info->flags & _TIF_VM86)) \
+ wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); \
+ wrmsr(MSR_IA32_SYSENTER_ESP, (_esp0), 0); \
+ } \
+} while(0)
+
+/* The caller of the next two functions should have disabled preemption. */
+
+#define disable_sysenter() do { \
+ if (cpu_has_sep) { \
+ set_thread_flag(TIF_VM86); \
+ wrmsr(MSR_IA32_SYSENTER_CS, 0, 0); \
+ } \
+} while(0)
+
+#define enable_sysenter() do { \
+ if (cpu_has_sep) { \
+ clear_thread_flag(TIF_VM86); \
+ wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0); \
+ } \
+} while(0)

#define start_thread(regs, new_eip, new_esp) do { \
__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/segment.h linux-2.5.60-work/include/asm-i386/segment.h
--- linux-2.5.60/include/asm-i386/segment.h 2003-02-10 19:38:06.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/segment.h 2003-02-11 23:56:37.000000000 +0100
@@ -67,7 +67,7 @@
/*
* The GDT has 25 entries but we pad it to cacheline boundary:
*/
-#define GDT_ENTRIES 28
+#define GDT_ENTRIES 32

#define GDT_SIZE (GDT_ENTRIES * 8)

diff -burpN -X ../KDIFX linux-2.5.60/include/asm-i386/thread_info.h linux-2.5.60-work/include/asm-i386/thread_info.h
--- linux-2.5.60/include/asm-i386/thread_info.h 2003-02-10 19:37:59.000000000 +0100
+++ linux-2.5.60-work/include/asm-i386/thread_info.h 2003-02-12 01:51:26.000000000 +0100
@@ -111,15 +111,18 @@ static inline struct thread_info *curren
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */
#define TIF_IRET 5 /* return with iret */
+#define TIF_VM86 6 /* may use vm86 */
#define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */
#define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */

+
#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
#define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP)
#define _TIF_IRET (1<<TIF_IRET)
+#define _TIF_VM86 (1<<TIF_VM86)
#define _TIF_USEDFPU (1<<TIF_USEDFPU)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)

2003-02-12 10:01:28

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Linus Torvalds wrote:
> >That leaves the other MSR write, which is also unnecessary.
>
> No, the other one _is_ necessary. I did timings, and having it in the
> context switch path made system calls clearly faster on a P4 (as
> compared to my original trampoline approach).
>
> It may be only two instructions difference ("movl xx,%esp ; jmp common")
> in the system call path, but it was much more than two cycles. I don't
> know why, but I assume the system call causes a total pipeline flush,
> and then the immediate jmp basically means that the P4 has a hard time
> getting the pipe restarted.

The jump is not necessary, and you don't need to duplicate the system
call code either. You place an instruction like this at the start of
the system call code - this is _all_ that you need.

movl -60(%esp),%esp

The current task's esp0 is always stored in the TSS. We get that for
free. And you can point SYSENTER_ESP in, before or after the TSS too.
The trampoline stack needs exactly 6 words to handle debug and NMI.

The constant may vary according to how you lay things out, and you
might put it after cld;sti[2] in the entry code, but you get the idea.

I suspect you are right that it is the jump which is expensive - a
stack load _should_ be less than a cycle. Normal functions do it all
the time. But then a jump _should_ be less than a cycle too. Ah well!

(Of course even a single load of %esp, even if it turns out to be
cheap, can cost more on average than writing the MSR per context
switch.)

> This might be fixable by moving more (all?) of the kernel-side fast
> system call code into the per-cpu trampoline page, so that you wouldn't
> have the immediate jump. Somebody needs to try it and time it, otherwise
> the wrmsr stays in the context switch.

I have timed how long it takes to do sysenter, call a function in
kernel space and sysexit to return, complete with the above method of
stack setup (and the debug + NMI fixups). This is a module for <=2.4 kernels.

It takes ages - 82 cycles.

Here are my notes from last year (sorry, I don't have a P4):

Performance and emulation methods
---------------------------------

* On everything since later Pentium Pros from Intel, and since
the K7 from AMD, `sysenter' is available as a native instruction.

On my Celeron 366, it takes 82 (84.5 on an 800MHz P3) cycles to
enter the kernel, call an empty C function and return to
userspace. Compare this to 236 (242) cycles using `int $0x81' to
do the same thing.

* On old CPUs which don't support `sysenter', it is emulated
using the "illegal opcode" trap (#UD).

This is actually quite fast: the empty C function takes only 17
(16) cycles longer than `int $0x81'. Because classic system
calls use `int $0x80', you can see that emulating `sysenter'
would be a useful fallback method for userspace system calls.

* Don't take the cycle timings too seriously. They vary by about
8% according to the exact layout of the userspace code and also
from one module loading to the next (probably due to cache or TLB
colour effects). I haven't quoted the _best_ timings (which are
about 8% better than the ones I quoted), because they only occur
occasionally and cannot be repeated to order (you have to unload
and reload the module until the best timing appears).

> I want fast system calls. Most people don't see it yet (because you need
> a glibc that takes advantage of it), but those fast system calls are
> more than enough to make up for some scheduling overhead.

By the way, your suggestion of comparing %ds and %es to __USER_DS and
avoiding loading them if they are the expected values saves 8 cycles
on the two CPUs I did it on. Not loading them on exit, which you
already do, saves a further 10 cycles.

Because you are careful to disallow sysenter from vm86 mode,
transitions from vm86 _always_ go through the
interrupt/int$0x80/exception paths, which always reload %ds and %es.

So your concern about vm86 screwing with the cpu's internal segment
descriptors doesn't apply, AFAICT, to the sysenter path. (It probably
does apply to the interrupt and exception paths).

So here are some system call optimisation hints:

[0] Comparing %ds and %es to __USER_DS and not reloading them in the
sysenter path is safe, and worth 8 cycles on my CPUs.

[1] "movl %ds,%ebx" is about 9-10 cycles faster than "pushl %ds;
popl %ebx" for what its worth. I think it's the pushl which is
slow but I haven't timed it by itself.

[2] Putting cli immediately before sti saves exactly 5 cycles on my
Celeron, and putting that just before the %esp load helps a little.
Cost of loading the flags register?

I am a wimp, or perhaps impatient, when it comes to the
compile-reboot-test cycle so I'm not likely to try the above any time
soon. But those are good hints if anyone (Ingo? :) wants to try them.

enjoy,
-- Jamie

2003-02-12 10:07:49

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Linus Torvalds wrote:
> In article <[email protected]>,
> Jamie Lokier <[email protected]> wrote:
> >
> >A cute and wonderful hack is to use the 6 words in the TSS prior to
> >&tss->es as the trampoline. Now that __switch_to is done in software,
> >those words are not used for anything else.
>
> No!!
>
> That's not cute and wonderful, that's _horrible_.

I meant: the trampoline _stack_ lives in the TSS.

There is no trampoline _code_.

My apologies for poor wording.

-- Jamie

2003-02-12 10:16:57

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Andi Kleen wrote:
> + /* FIXME should disable preemption here but how can we reenable it? */
> +
> + enable_sysenter();
> +

Try this:

1. Disable preemption in do_sys_vm86(), at the same place as
disable_sysenter() is called.

2. Enable preemption in save_v86_state(), and put the call
to enable_sysenter() there.

3. In restore_sigcontext() [signal.c], _iff_ the VM flag
is set in the restored context, call disable_sysenter()
and also disable preemption.

That should make vm86 simply disable preemption while it is activated.
It is not as nice as actually being preemptible, but safe first,
optimise later.

The return path to vm86 mode has the peculiar property of not doing
the need_resched test, unlike the return path to normal user space,
which is a boon here.

-- Jamie

2003-02-12 10:35:37

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 10:27:41AM +0000, Jamie Lokier wrote:
> Andi Kleen wrote:
> > + /* FIXME should disable preemption here but how can we reenable it? */
> > +
> > + enable_sysenter();
> > +
>
> Try this:

[...] I have no real interest in vm86 mode, perhaps one of the people
interested in dosemu etc. could take care of it. I'm very glad it doesn't
exist on my main architecture - x86-64 - given how many hacks it needs to be
supported.

I would like to have fast context switch on IA32 though so it would be nice
if someone deeply familiar with sys_vm86 could review my patch.

Avoiding the SYSCALL_CS MSR is independent from the issues Linus raised.

-Andi

2003-02-12 12:48:50

by Dave Jones

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 04:21:43AM +0000, Jamie Lokier wrote:

> > I feel I'm missing something obvious here, but is this part the
> > low-hanging fruit that it seems ?
> You have eliminated one MSR write very cleanly, although there are
> still a few unnecessary conditionals when compared with grabbing a
> whole branch of the fruit tree, as it were.
>
> That leaves the other MSR write, which is also unnecessary.

Removing that one didn't seem quite so easy, so I wussed out.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2003-02-12 14:12:18

by Mika Penttilä

[permalink] [raw]

Subject: Re: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

BTW, why is sysenter supposed to be disabled while in vm86? And if it should be disabled (as now in sys_vm86), the next context switch back to the vm86 process re-enables it, in load_esp0, right? So what's the gain?

--Mika

2003-02-12 17:17:44

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

In article <[email protected]>,
Jamie Lokier <[email protected]> wrote:
>
>I meant: the trampoline _stack_ lives in the TSS.
>
>There is no trampoline _code_.

Ahh, ok. That sounds quite doable, and all my complaints go away.

It still leaves the debug exception and NMI issue.

The debug exception case is easy to trigger: use gdb to single-step
through the user-lebel fast system call code, and you _will_ get a debug
exception on the very first kernel instruction (which is also the one
that doesn't have a valid stack).

So anybody want to actually try to implement this?

Linus

2003-02-12 18:12:31

[permalink] [raw]

Subject: Re: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

[email protected] wrote:
> BTW, why is sysenter supposed to be disabled while in vm86? And if
> it should be disabled (as now in sys_vm86), the next context switch
> back to the vm86 process re-enables it, in load_esp0, right? So what's
> the gain?

I quite misread the code in entry.S at ret_from_int. The comment
"returning to kernel or vm86-space" confused me - of course scheduling
happens in vm86 mode.

(Andi et al., forget anything I've said about CONFIG_PREEMPT problems! :).

-- Jamie

2003-02-12 18:08:58

by Dave Jones

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 06:52:00PM +0100, Ingo Oeser wrote:
> On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
> > [...] I have no real interest in vm86 mode, perhaps one of the people
> > interested in dosemu etc. could take care of it. I'm very glad it doesn't
> > exist on my main architecture - x86-64 - given how many hacks it needs to be
> > supported.
>
> So what about making it a CONFIG_XXX option? The few dosemu users
> could then configure it in.

Overkill. Andi's TF_VM86 fix looks to be the nicest way to do it.
If you don't use dosemu etc, the wrmsr should never be hit.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2003-02-12 18:05:44

by Ingo Oeser

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
> [...] I have no real interest in vm86 mode, perhaps one of the people
> interested in dosemu etc. could take care of it. I'm very glad it doesn't
> exist on my main architecture - x86-64 - given how many hacks it needs to be
> supported.

So what about making it a CONFIG_XXX option? The few dosemu users
could then configure it in.

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2003-02-12 18:09:06

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, Feb 12, 2003 at 06:52:00PM +0100, Ingo Oeser wrote:
> On Wed, Feb 12, 2003 at 11:45:08AM +0100, Andi Kleen wrote:
> > [...] I have no real interest in vm86 mode, perhaps one of the people
> > interested in dosemu etc. could take care of it. I'm very glad it doesn't
> > exist on my main architecture - x86-64 - given how many hacks it needs to be
> > supported.
>
> So what about making it a CONFIG_XXX option? The few dosemu users
> could then configure it in.

Doesn't help for precompiled distribution kernels, which is what the majority
of linux users run these days.

-Andi

2003-02-12 18:51:24

[permalink] [raw]

Subject: Re: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

[email protected] wrote:
> BTW, why is sysenter supposed to be disabled while in vm86?

The only reason is to make vm86-mode more "compatible". In other words,
trap the GP that happens if SYSENTER_CS is 0, and make sure vm86 mode
works the way it historically did.

We can choose to say "screw it", I guess, and just see if it actually
breaks anything.

Linus

2003-02-13 01:32:12

by Alan

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Wed, 2003-02-12 at 18:18, Andi Kleen wrote:
> > So what about making it a CONFIG_XXX option? The few dosemu users
> > could then configure it in.
>
> Doesn't help for precompiled distribution kernels, which is what the majority
> of linux users run these days.

XFree86 makes significant use of it, and its software x86 emulator isn't up to
replacing it on many cards (eg my C&T only works with vm86 not the emulator).
Obviously on x864-64 you have little choice, but for x86-32 its somewhat
relevant.

2003-02-13 05:08:01

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Andi Kleen <[email protected]> writes:

> On Wed, Feb 12, 2003 at 10:27:41AM +0000, Jamie Lokier wrote:
> > Andi Kleen wrote:
> > > + /* FIXME should disable preemption here but how can we reenable it? */
> > > +
> > > + enable_sysenter();
> > > +
> >
> > Try this:
>
> [...] I have no real interest in vm86 mode, perhaps one of the people
> interested in dosemu etc. could take care of it. I'm very glad it doesn't
> exist on my main architecture - x86-64 - given how many hacks it needs to be
> supported.

There is certainly some old cruft in there, but...

I have been thinking evil thoughts lately about what it would take
to implement on x86-64.

Switching in and out of long mode is evil enough that I don't think it
is worth it. And encouraging people to write good JIT compiling
emulators sounds much better, especially in the long run. But it can
be written.

Eric

2003-02-13 17:57:17

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

[Hmm, this is becomming a FAQ]

> Switching in and out of long mode is evil enough that I don't think it
> is worth it. And encouraging people to write good JIT compiling

Forget it. It is completely undefined in the architecture what happens
then. You'll lose interrupts and everything. Nothing for an operating
system intended to be stable.

I have no plans at all to even think about it for Linux/x86-64.

> emulators sounds much better, especially in the long run. But it can
> be written.

For DOS even a slow emulator should be good enough. After all most
DOS Programs are written for slow machines. Bochs running on a K8
will be hopefully fast enough. If not an JIT can be written, perhaps
you can extend valgrind for it.

Or if you really rely on a DOS program executing fast you can
always boot a 32bit kernel which of course still supports vm86
in legacy mode.

-Andi

2003-02-14 00:05:28

by Peter Tattam

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Thu, 13 Feb 2003, Andi Kleen wrote:

> [Hmm, this is becomming a FAQ]
>
> > Switching in and out of long mode is evil enough that I don't think it
> > is worth it. And encouraging people to write good JIT compiling
>
> Forget it. It is completely undefined in the architecture what happens
> then. You'll lose interrupts and everything. Nothing for an operating
> system intended to be stable.
>
> I have no plans at all to even think about it for Linux/x86-64.

I have given this some thought even though the accepted wisdom is to avoid it.

As far as I can tell, there are only two critical structures which need to be
warped. You would need to have a legacy set of page tables ready to go, and a
new IDT which is used only for legacy mode. If an IRQ or execption happens in
legacy mode, you warp the CPU back to long mode to handle it and then warp
back.

The concept of warping the CPU is being formalized in the plex86 project anyway
and is likely to become more common as time goes on. (warping is replacing
GDT/IDT/LDT/CR3 etc by stubs which then warp back to the host when anything
"interesting" happens)

So as far as I can tell, a switch to v86 mode requires reloading page tables
(this would happen on a typical task context switch anyway), and switching the
IDT. GDT,LDT and TR can stay as is since these should trap to #GP which is
handled by the IDT change. I can't see why you would want to touch the PIC or
APIC at all, and this is usually what causes the loss of interrupts when
handling this kind of thing.

The only other unknown quantity is the time it takes for the CPU to
enable/disable long mode, but with modern CPU speeds, the interrupt latency may
only be mildy affect by such a process, unless the CPU is broken in some way.
I see no discussion in the AMD manuals regarding the cost of the mode switch,
only what AMD engineers have hinted at.

>
> > emulators sounds much better, especially in the long run. But it can
> > be written.
>
> For DOS even a slow emulator should be good enough. After all most
> DOS Programs are written for slow machines. Bochs running on a K8
> will be hopefully fast enough. If not an JIT can be written, perhaps
> you can extend valgrind for it.
>
> Or if you really rely on a DOS program executing fast you can
> always boot a 32bit kernel which of course still supports vm86
> in legacy mode.

While an emulator sounds like a good idea, it is baggage that needs to be
included. JIT is probably overkill if the hardware can already do it.

If the use for running v86 code is infrequent, the cost in CPU cycles to change
modes may be neglible anyway.

If it's for regular use (e.g. an MSDOS box), I am sure the scheduler could take
into account that a v86 context switch is more expensive than a normal one and
steps could be taken to avoid it.

I contend that if the thunking code is reasonably well defined and thought out,
jumping in & out of long mode might not be as big a hassle as originally
thought.

I have a need to run v86 code from ring 0, so I'm not keen to slip other
people's code in there. This would mean I'd need to write a v86 emulator from
scratch which I think is more time than writing the warping code that I've
suggested.

I am going have a go at doing it anyway and I'll let you know my results when I
get some real hardware.

>
> -Andi
>

Peter

--
Peter R. Tattam [email protected]
Managing Director, Trumpet Software International Pty Ltd
Hobart, Australia, Ph. +61-3-6245-0220, Fax +61-3-62450210

2003-02-14 01:19:40

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

> I have a need to run v86 code from ring 0, so I'm not keen to slip other
...

[for the unsuspecting readers - Peter is talking about non Linux here]

> people's code in there. This would mean I'd need to write a v86 emulator from
> scratch which I think is more time than writing the warping code that I've
> suggested.

Have you taken a look at valgrind? (http://developer.kde.org/~sewardj/)

It is a free software x86 JIT. I don't think it supports 16bit code currently,
but it probably wouldn't be too difficult to add. It wasn't primarily designed
for speed - its main application is to instrument programs - but its slowdown
compared to running on the real CPU is moderate and its certainly fast enough
for anything designed to run on DoS.

-Andi

2003-02-14 01:41:55

by Eric Northup

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Thursday 13 February 2003 07:14 pm, Peter Tattam wrote:
> On Thu, 13 Feb 2003, Andi Kleen wrote:
> > [Hmm, this is becomming a FAQ]
> >
> > > Switching in and out of long mode is evil enough that I don't think it
> > > is worth it. And encouraging people to write good JIT compiling
> >
> > Forget it. It is completely undefined in the architecture what happens
> > then. You'll lose interrupts and everything. Nothing for an operating
> > system intended to be stable.
> >
> > I have no plans at all to even think about it for Linux/x86-64.
[snip]
>
> The only other unknown quantity is the time it takes for the CPU to
> enable/disable long mode, but with modern CPU speeds, the interrupt latency
> may only be mildy affect by such a process, unless the CPU is broken in
> some way. I see no discussion in the AMD manuals regarding the cost of the
> mode switch, only what AMD engineers have hinted at.

I think the real issue is that AMD neither recommends nor supports this
strategy. ( http://www.x86-64.org/lists/discuss/msg02964.html ... there were
better posts but I couldn't find them) People with real hardware can't talk
about it right now, but it seems to me this is just begging to get hit by
errata -- how much effore do you think team Hammer spent testing a subtle
mode transition which is marked "Don't do that!" ?

> > > emulators sounds much better, especially in the long run. But it can
> > > be written.
> >
> > For DOS even a slow emulator should be good enough. After all most
> > DOS Programs are written for slow machines. Bochs running on a K8
> > will be hopefully fast enough. If not an JIT can be written, perhaps
> > you can extend valgrind for it.
> >
> > Or if you really rely on a DOS program executing fast you can
> > always boot a 32bit kernel which of course still supports vm86
> > in legacy mode.
>
> While an emulator sounds like a good idea, it is baggage that needs to be
> included. JIT is probably overkill if the hardware can already do it.

I am actually working on a dynamic translator for x86, and am starting with
16-bit real-mode. It's a bit OT for linux-kernel, and it's not done yet so
I'll spare you the details, but the point is that the kernel doesn't need to
do anything special to help an emulator/dynamic translator, and that it
*shouldn't* let you run real-mode code on the hardware.

> I contend that if the thunking code is reasonably well defined and thought
> out, jumping in & out of long mode might not be as big a hassle as
> originally thought.

Even the best code is subject to the limitations of the hardware it is run on.

-Eric

2003-02-14 01:51:53

by Peter Tattam

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Thu, 13 Feb 2003, Eric Northup wrote:

> On Thursday 13 February 2003 07:14 pm, Peter Tattam wrote:
> > On Thu, 13 Feb 2003, Andi Kleen wrote:
> > > [Hmm, this is becomming a FAQ]
> > >
> > > > Switching in and out of long mode is evil enough that I don't think it
> > > > is worth it. And encouraging people to write good JIT compiling
> > >
> > > Forget it. It is completely undefined in the architecture what happens
> > > then. You'll lose interrupts and everything. Nothing for an operating
> > > system intended to be stable.
> > >
> > > I have no plans at all to even think about it for Linux/x86-64.
> [snip]
> >
> > The only other unknown quantity is the time it takes for the CPU to
> > enable/disable long mode, but with modern CPU speeds, the interrupt latency
> > may only be mildy affect by such a process, unless the CPU is broken in
> > some way. I see no discussion in the AMD manuals regarding the cost of the
> > mode switch, only what AMD engineers have hinted at.
>
> I think the real issue is that AMD neither recommends nor supports this
> strategy. ( http://www.x86-64.org/lists/discuss/msg02964.html ... there were
> better posts but I couldn't find them) People with real hardware can't talk
> about it right now, but it seems to me this is just begging to get hit by
> errata -- how much effore do you think team Hammer spent testing a subtle
> mode transition which is marked "Don't do that!" ?
>

well, I guess AMD need to come out & explicitly state this somewhere other than
on a mailing list. I wouldn't be only one tempted to see if it can be done,
and if it becomes "necessary" for some OSes, AMD will get locked into a
backward compatibility minefield. Anyone know what Windows 64 does about this
issue? If Microsoft considers that it is sufficient to warp the CPU for v86
emulation, it may just be a done deal.

Peter

--
Peter R. Tattam [email protected]
Managing Director, Trumpet Software International Pty Ltd
Hobart, Australia, Ph. +61-3-6245-0220, Fax +61-3-62450210

2003-02-14 03:57:22

by Thomas J. Merritt

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

|<><><><><> Original message from Peter Tattam <><><><><>
|On Thu, 13 Feb 2003, Eric Northup wrote:
|
|> On Thursday 13 February 2003 07:14 pm, Peter Tattam wrote:
|> > On Thu, 13 Feb 2003, Andi Kleen wrote:
|> > > [Hmm, this is becomming a FAQ]
|> > >
|> > > > Switching in and out of long mode is evil enough that I don't think it
|> > > > is worth it. And encouraging people to write good JIT compiling
|> > >
|> > > Forget it. It is completely undefined in the architecture what happens
|> > > then. You'll lose interrupts and everything. Nothing for an operating
|> > > system intended to be stable.
|> > >
|> > > I have no plans at all to even think about it for Linux/x86-64.
|> [snip]
|> >
|> > The only other unknown quantity is the time it takes for the CPU to
|> > enable/disable long mode, but with modern CPU speeds, the interrupt latency
|> > may only be mildy affect by such a process, unless the CPU is broken in
|> > some way. I see no discussion in the AMD manuals regarding the cost of the
|> > mode switch, only what AMD engineers have hinted at.
|>
|> I think the real issue is that AMD neither recommends nor supports this
|> strategy. ( http://www.x86-64.org/lists/discuss/msg02964.html ... there were
|
|> better posts but I couldn't find them) People with real hardware can't talk
|> about it right now, but it seems to me this is just begging to get hit by
|> errata -- how much effore do you think team Hammer spent testing a subtle
|> mode transition which is marked "Don't do that!" ?
|>
|
|well, I guess AMD need to come out & explicitly state this somewhere other than
|on a mailing list. I wouldn't be only one tempted to see if it can be done,
|and if it becomes "necessary" for some OSes, AMD will get locked into a
|backward compatibility minefield. Anyone know what Windows 64 does about this
|issue? If Microsoft considers that it is sufficient to warp the CPU for v86
|emulation, it may just be a done deal.

The only way to get from long-mode back to legacy-mode is to reset the
processor. It can be done in software but you will likely lose interrupts.
Attempting to switch out of long-mode by modifying EFER will just get you a #GP
fault. You might want to read Volume 2 section 14.6.2.

TJ Merritt
[email protected]
1-925-462-4300 x115

2003-02-14 08:18:51

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Peter Tattam <[email protected]> writes:

> On Thu, 13 Feb 2003, Andi Kleen wrote:
>
> > For DOS even a slow emulator should be good enough. After all most
> > DOS Programs are written for slow machines. Bochs running on a K8
> > will be hopefully fast enough. If not an JIT can be written, perhaps
> > you can extend valgrind for it.

valgrind certainly sounds interesting.

> > Or if you really rely on a DOS program executing fast you can
> > always boot a 32bit kernel which of course still supports vm86
> > in legacy mode.
>
> While an emulator sounds like a good idea, it is baggage that needs to be
> included. JIT is probably overkill if the hardware can already do it.

A good JIT supporting infrastructure is a requirement of a high quality
implementation otherwise someone will complain. But at the same time
only the inner loops really need to be optimized. The rest of the code
will be much less noticable.

> If the use for running v86 code is infrequent, the cost in CPU cycles to change
> modes may be neglible anyway.
>
> If it's for regular use (e.g. an MSDOS box), I am sure the scheduler could take
> into account that a v86 context switch is more expensive than a normal one and
> steps could be taken to avoid it.

Nope that doesn't help. The trap rate is ruled by the number of instructions
that must be emulated. Not by the timer. I haven't profiled this closely but
it matches with my experience with dosemu. Given the increasing cost of traps,
and the relative fixed frequency of instructions that require traps to
be emulated, I believe it can be shown that using the native cpu is an
increasing bad idea.

The worst case is using an ega 4 plane mode, and emulating it. Which
is actually faster to emulate all of the instructions in than to take
traps when there are instructions you must emulate.

> I contend that if the thunking code is reasonably well defined and thought out,
> jumping in & out of long mode might not be as big a hassle as originally
> thought.

I contend that v86 mode has stunted the growth of more powerful techniques on
Linux because v86 mode is so trivial to use. Not implementing v86 mode in
the kernel of x86-64 should encourage all of the nice techniques that need to
be built.

For most JIT recompilation all that needs to be written is a fast loop
scanning for instructions that must be emulated. If those instructions
are not present on a page the page is good to go. For questionable pages
you can do something else.

> I have a need to run v86 code from ring 0, so I'm not keen to slip other
> people's code in there. This would mean I'd need to write a v86 emulator from
> scratch which I think is more time than writing the warping code that I've
> suggested.
>
> I am going have a go at doing it anyway and I'll let you know my results when I
> get some real hardware.

You don't want to try it in the simulator? Is it to slow for you?

I have gone as far as testing switching in and out of long mode, and it is not
to difficult. But I have not setup exception handlers to reflect
things from 32bit mode to 64bit mode etc.

Eric

2003-02-14 09:30:01

by Firefly

[permalink] [raw]

Subject: Re: [discuss] Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Thu, 13 Feb 2003, Thomas J. Merritt wrote:

> The only way to get from long-mode back to legacy-mode is to reset the
> processor. It can be done in software but you will likely lose interrupts.

Smartdrv.sys and triple-faults come back, all is forgiven! ;)

-Peter

2003-03-10 02:59:03

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Ok Jamie,
since you've been interested in the past, I thought I'd ask you to test
the current context switch stuff. Andi cleaned up some FPU reload stuff
(and I fixed a bug in it, tssk tssk Andi - you'd obviously not actually
timed your cleanups), and I just committed and pushed out my "cache the
value of SYSENTER_CS in the TSS" patch.

It won't bring context switching back to where it _could_ be, but it
should be noticeably better. My pipe bandwidth is up from under 600MB/s
to about ~700MB/s according to lmbench.

Your SYSENTER_ESP hack would probably get back the rest, but I haven't
seen any patches for it, hint hint.

In the meantime, we're almost back to where we were _and_ we support
sysenter (ie my system calls are down by almost a factor of four). So
we're doing pretty well.

Linus

2003-03-10 10:56:21

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Mon, Mar 10, 2003 at 04:07:36AM +0100, Linus Torvalds wrote:
> since you've been interested in the past, I thought I'd ask you to test
> the current context switch stuff. Andi cleaned up some FPU reload stuff
> (and I fixed a bug in it, tssk tssk Andi - you'd obviously not actually
> timed your cleanups), and I just committed and pushed out my "cache the

You mean the TIF->_TIF thing? Yes that was wrong in the first patch,
but fixed in the patches later. Unfortunately the patch still
has the problem pointed out by Manfred Spraul: if you're unlucky
it could destroy the _TIF_SIGPENDING set by another CPU with the
non atomic access. Really thread_info should have two flag words:
one that is truly local and can be accessed without LOCK and
one that can be changed at will by external users too.

After some discussion with him I think the right fix for now is to
move it it back to PF_USEDFPU into task_struct->flags.

Will submit a patch for that later after I was able to test it.

-Andi

2003-03-10 18:25:53

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Mon, 10 Mar 2003, Andi Kleen wrote:
>
> Unfortunately the patch still has the problem pointed out by Manfred
> Spraul: if you're unlucky it could destroy the _TIF_SIGPENDING set by
> another CPU with the non atomic access. Really thread_info should have
> two flag words: one that is truly local and can be accessed without LOCK
> and one that can be changed at will by external users too.

Yup, you're right.

I fixed that by splitting the "flags" field into two: "flags" is the old
flags, and "status" is thread-synchronous stuff (ie things that don't need
to worry about atomicity). Right now the FP lazy bit is the only thing
that is marked as thread-synchronous.

While going through the users I also noticed that fork() did the FPU
unlazy() thing totally wrong - it the the parent unlazy() _after_ it had
already copied the process flags to the child, so even though it copied
the x87 state to the child, the process flags could still say that the
child was using lazy state, and thus the FP state in the child was
basically totally corrupt. I wrote a test program to verify.

So I fixed that part too, by having a "prepare_to_copy()" function that
properly "calms down" the parent before we copy the task and thread
states. That fixes the bug, and also avoids an extra unnecessary x87 state
copy on x86.

(Not that the extra copy is noticeable - fork() is expensive enough
anyway. It might just _barely_ be noticeable on thread creation when we
don't have to worry about copying the VM state. But the bug was real,
and the simplification is an added bonus).

Linus

2003-03-10 22:35:41

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Sun, 9 Mar 2003, Linus Torvalds wrote:
>
> Your SYSENTER_ESP hack would probably get back the rest, but I haven't
> seen any patches for it, hint hint.

Oh, well, I just did it myself. And tested with both NMI's and debug
traps, just to make sure that we do the right thing there too.

(If we get an NMI on the first three instructions in a debug trap that
happens on the first instruction of the sysenter path, we're still
screwed. I'm still trying to figure out a good way to unscrew us).

> In the meantime, we're almost back to where we were _and_ we support
> sysenter (ie my system calls are down by almost a factor of four). So
> we're doing pretty well.

We're now pretty much back to 2.4.x performance on the scheduler, as far
as I can tell. Can people confirm and close the bug?

Linus

2003-03-18 15:15:16

by Kevin Pedretti

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Linus,
I wasn't aware of what you state below but it makes sense. What I
haven't been able to figure out, and nobody seems to know, is why the
rodata section of an executable is placed in the text section and is not
page aligned. This seems to be a mixing of code and data on the same
page. Maybe it doesn't matter since it is read only?

Example:

11 .text 000000e8 08048244 08048244 00000244 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
12 .fini 0000001c 0804832c 0804832c 0000032c 2**2
CONTENTS, ALLOC, LOAD, READONLY, CODE
13 .rodata 0000000c 08048348 08048348 00000348 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
14 .data 0000000c 08049354 08049354 00000354 2**2
CONTENTS, ALLOC, LOAD, DATA

Thanks,
Kevin

[email protected] wrote:

>In article <[email protected]>,
>Jamie Lokier <[email protected]> wrote:
>
>
>>A cute and wonderful hack is to use the 6 words in the TSS prior to
>>&tss->es as the trampoline. Now that __switch_to is done in software,
>>those words are not used for anything else.
>>
>>
>
>No!!
>
>That's not cute and wonderful, that's _horrible_.
>
>Mixing data and code on the same page is very very slow on a P4 (well, I
>think it's "same half-page", but the point is that you should not EVER
>mix data and code - it ends up being slow on modern CPU's).
>
>
>
>>Other fixed offsets from &tss->esp0 are possible - especially nice
>>would be to share a cache line with the GDT's hot cache line. (To do
>>this, place GDT before TSS, make KERNEL_CS near the end of the GDT,
>>and then the accesses to GDT, trampoline and tss->esp0 will all touch
>>the same cache line if you're lucky).
>>
>>
>
>Since almost all x86 CPU's have some kind of cacheline exclusion policy
>between the I$ and the D$ (to handle the strict x86 I$ coherency
>requirements), your "if you're lucky" is completely bogus. In fact,
>you'd be the _pessimal_ cache behaviour for something like that, ie you
>get lines that ping-pong between the L2 and the two instruction caches.
>
>Don't do it. Keep data and code on separate pages.
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2003-03-18 16:32:15

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

On Tue, 18 Mar 2003, Kevin Pedretti wrote:
>
> I wasn't aware of what you state below but it makes sense. What I
> haven't been able to figure out, and nobody seems to know, is why the
> rodata section of an executable is placed in the text section and is not
> page aligned. This seems to be a mixing of code and data on the same
> page. Maybe it doesn't matter since it is read only?

It's a bad idea to share even read-only data, but the impact of read-only
data is much less that read-write. In particular, you should avoid sharing
_any_ code and data in the same physical L1 cache-line, since that will be
a big problem for any CPU with exclusion between the I$ and D$.

HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache
coherency protocol, so instead of having exclusion they allow sharing as
long as the D$ isn't actually dirty. In that case it's fine to share
read-only data and code, although the cache utilization goes down if you
do a lot of it.

Anyway, as long as they are in separate cache-lines, you should be ok even
on something with cache exclusion.

When it comes to actually _writing_ to the data, at least on the P4 you
don't want to have read-write data anywhere _near_ the I$ (somebody
reported half-page granularity). This is true on crusoe too, btw (at a
128-byte granularity).

Anyway, I think gcc should make sure that even the ro-data section is at
least cacheline-aligned so that it stays away from cachelines used for I$.
That makes sense even on CPU's that don't have exclusion, since it
actually gives slightly better L1 cache utilization.

You can run this (stupid) test-program to try. On my P4 I get

empty overhead=320 cycles
load overhead=0 cycles
I$ load overhead=0 cycles
I$ load overhead=0 cycles
I$ store overhead=264 cycles

and on my PIII I get

empty overhead=74 cycles
load overhead=8 cycles
I$ load overhead=8 cycles
I$ load overhead=8 cycles
I$ store overhead=103 cycles

and (just for fun) on an old crusoe I get

empty overhead=67 cycles
load overhead=-9 cycles
I$ load overhead=-14 cycles
I$ load overhead=-14 cycles
I$ store overhead=12 cycles

where that "negative overhead" just shows that we do some strnge things to
scheduling, and the loop actually ends up faster if it has a load in it
than without the load..

But you can see that storing to code is a really bad idea. Especially on a
P4, where the overhead for a store was 264 cycles! (You can also see the
cost of doing just the empty synchronization and rdtsc - 320 cycles for a
rdtsc and two locked memory accesses on a P4).

I don't have access to an old Pentium - I think that was the one that had
the strict exclusion between the L1 I$ and D$, and then you should see the
I$ load overhead go up.

Linus

----
#include <sys/types.h>
#include <time.h>
#include <sys/time.h>
#include <sys/fcntl.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#include <sys/mman.h>

#define PAGE_SIZE (4096UL)
#define PAGE_MASK (~(PAGE_SIZE-1))

#define serialize() asm volatile("lock ; addl $0,(%esp)")

#define rdtsc() ({ unsigned long a, d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

static int unused = 0;

#define NR (100000)

int main()
{
int i;
unsigned long overhead = ~0UL, empty = 0;
void * address = (void *)(PAGE_MASK & (unsigned long)main);

mprotect(address, PAGE_SIZE, PROT_READ | PROT_WRITE | PROT_EXEC);

overhead = ~0UL;
for (i = 0; i < NR; i++) {
unsigned long cycles = rdtsc();
serialize();
serialize();
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}
printf("empty overhead=%ld cycles\n", overhead);
empty = overhead;

overhead = ~0UL;
for (i = 0; i < NR; i++) {
unsigned long dummy;
unsigned long cycles = rdtsc();
serialize();
asm volatile("movl %1,%0":"=r" (dummy):"m" (unused));
serialize();
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}
printf("load overhead=%ld cycles\n", overhead-empty);

overhead = ~0UL;
for (i = 0; i < NR; i++) {
unsigned long dummy;
unsigned long cycles = rdtsc();
serialize();
asm volatile("1:\tmovl 1b,%0":"=r" (dummy));
serialize();
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}
printf("I$ load overhead=%ld cycles\n", overhead-empty);

asm volatile("jmp 1f\n.align 128\n99:\t.long 0\n1:");
overhead = ~0UL;
for (i = 0; i < NR; i++) {
unsigned long dummy;
unsigned long cycles;
cycles = rdtsc();
serialize();
asm volatile("movl 99b,%0":"=r" (dummy));
serialize();
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}
printf("I$ load overhead=%ld cycles\n", overhead-empty);

asm volatile("jmp 1f\n99:\t.long 0\n1:");
overhead = ~0UL;
for (i = 0; i < NR; i++) {
unsigned long dummy;
unsigned long cycles;
cycles = rdtsc();
serialize();
asm volatile("1:\tmovl %0,99b":"=r" (dummy));
serialize();
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}
printf("I$ store overhead=%ld cycles\n", overhead-empty);
return 0;
}

2003-03-18 18:20:08

by Brian Gerst

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

Linus Torvalds wrote:
> On Tue, 18 Mar 2003, Kevin Pedretti wrote:
>
>> I wasn't aware of what you state below but it makes sense. What I
>>haven't been able to figure out, and nobody seems to know, is why the
>>rodata section of an executable is placed in the text section and is not
>>page aligned. This seems to be a mixing of code and data on the same
>>page. Maybe it doesn't matter since it is read only?
>
>
> It's a bad idea to share even read-only data, but the impact of read-only
> data is much less that read-write. In particular, you should avoid sharing
> _any_ code and data in the same physical L1 cache-line, since that will be
> a big problem for any CPU with exclusion between the I$ and D$.
>
> HOWEVER, modern x86 CPU's tend to have the I$ be part of the cache
> coherency protocol, so instead of having exclusion they allow sharing as
> long as the D$ isn't actually dirty. In that case it's fine to share
> read-only data and code, although the cache utilization goes down if you
> do a lot of it.
>
> Anyway, as long as they are in separate cache-lines, you should be ok even
> on something with cache exclusion.
>
> When it comes to actually _writing_ to the data, at least on the P4 you
> don't want to have read-write data anywhere _near_ the I$ (somebody
> reported half-page granularity). This is true on crusoe too, btw (at a
> 128-byte granularity).
>
> Anyway, I think gcc should make sure that even the ro-data section is at
> least cacheline-aligned so that it stays away from cachelines used for I$.
> That makes sense even on CPU's that don't have exclusion, since it
> actually gives slightly better L1 cache utilization.
>
> You can run this (stupid) test-program to try. On my P4 I get
>
> empty overhead=320 cycles
> load overhead=0 cycles
> I$ load overhead=0 cycles
> I$ load overhead=0 cycles
> I$ store overhead=264 cycles
>
> and on my PIII I get
>
> empty overhead=74 cycles
> load overhead=8 cycles
> I$ load overhead=8 cycles
> I$ load overhead=8 cycles
> I$ store overhead=103 cycles
>
> and (just for fun) on an old crusoe I get
>
> empty overhead=67 cycles
> load overhead=-9 cycles
> I$ load overhead=-14 cycles
> I$ load overhead=-14 cycles
> I$ store overhead=12 cycles
>
> where that "negative overhead" just shows that we do some strnge things to
> scheduling, and the loop actually ends up faster if it has a load in it
> than without the load..
>
> But you can see that storing to code is a really bad idea. Especially on a
> P4, where the overhead for a store was 264 cycles! (You can also see the
> cost of doing just the empty synchronization and rdtsc - 320 cycles for a
> rdtsc and two locked memory accesses on a P4).
>
> I don't have access to an old Pentium - I think that was the one that had
> the strict exclusion between the L1 I$ and D$, and then you should see the
> I$ load overhead go up.
>
> Linus

Here's a few more data points:

vendor_id : AuthenticAMD
cpu family : 5
model : 8
model name : AMD-K6(tm) 3D processor
stepping : 12
cpu MHz : 451.037
empty overhead=105 cycles
load overhead=-2 cycles
I$ load overhead=30 cycles
I$ load overhead=90 cycles
I$ store overhead=95 cycles

vendor_id : GenuineIntel
cpu family : 6
model : 3
model name : Pentium II (Klamath)
stepping : 3
cpu MHz : 265.913
empty overhead=73 cycles
load overhead=10 cycles
I$ load overhead=10 cycles
I$ load overhead=10 cycles
I$ store overhead=2 cycles

vendor_id : AuthenticAMD
cpu family : 6
model : 6
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1409.946
empty overhead=11 cycles
load overhead=5 cycles
I$ load overhead=5 cycles
I$ load overhead=5 cycles
I$ store overhead=826 cycles

The Athlon XP shows really bad behavior when you store to the text area.

--
Brian Gerst

2003-03-18 19:03:35

by Thomas Molina

[permalink] [raw]

Subject: Re: [Bug 350] New: i386 context switch very slow compared to 2.4 due to wrmsr (performance)

> > You can run this (stupid) test-program to try. On my P4 I get
> >
> > empty overhead=320 cycles
> > load overhead=0 cycles
> > I$ load overhead=0 cycles
> > I$ load overhead=0 cycles
> > I$ store overhead=264 cycles

On my Athlon 1.3GHz system I get:
[tmolina@dad tmolina]$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) Processor
stepping : 2
cpu MHz : 1343.030
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov
pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 2637.82

empty overhead=16 cycles
load overhead=1 cycles
I$ load overhead=1 cycles
I$ load overhead=1 cycles
I$ store overhead=763 cycles

2003-03-18 19:12:00