2002-12-09 17:41:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

In article <[email protected]>,
Mike Hayward <[email protected]> wrote:
>
>I have been benchmarking Pentium 4 boxes against my Pentium III laptop
>with the exact same kernel and executables as well as custom compiled
>kernels. The Pentium III has a much lower clock rate and I have
>noticed that system call performance (and hence io performance) is up
>to an order of magnitude higher on my Pentium III laptop. 1k block IO
>reads/writes are anemic on the Pentium 4, for example, so I'm trying
>to figure out why and thought someone might have an idea.

P4's really suck at system calls. A 2.8GHz P4 does a simple system call
a lot _slower_ than a 500MHz PIII.

The P4 has problems with some other things too, but the "int + iret"
instruction combination is absolutely the worst I've seen. A 1.2GHz
Athlon will be 5-10 times faster than the fastest P4 on system call
overhead.

HOWEVER, the P4 is really good at a lot of other things. On average, a
P4 tends to perform quite well on most loads, and hyperthreading (if you
have a Xeon or one of the newer desktop CPU's) also tends to work quite
well to smooth things out in real life.

In short: the P4 architecture excels at some things, and it sucks at
others. It _mostly_ tends to excel more than suck.

Linus


2002-12-09 19:32:06

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:

> P4's really suck at system calls. A 2.8GHz P4 does a simple system call
> a lot _slower_ than a 500MHz PIII.
>
> The P4 has problems with some other things too, but the "int + iret"
> instruction combination is absolutely the worst I've seen. A 1.2GHz
> Athlon will be 5-10 times faster than the fastest P4 on system call
> overhead.

Time to look into an alternative like SYSCALL perhaps ?

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-17 00:42:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Mon, 9 Dec 2002, Dave Jones wrote:
>
> Time to look into an alternative like SYSCALL perhaps ?

Well, here's a very raw first try at using intel sysenter/sysexit.

It does actually work, I've done a "hello world" program that used
sysenter to enter the kernel, but kernel exit requires knowing where to
return to (the SYSENTER_RETURN define in entry.S), and I didn't set up a
fixmap entry for this yet, so I don't have a good value to return to yet.

But this, together with a fixmap entry that is user-readable (and thus
executable) that contains the "sysenter" instruction (and enough setup so
that %ebp points to the stack we want to return with), and together with
some debugging should get you there.

WARNING! I may be setting up the stack slightly incorrectly, since this
also hurls chunks when debugging. Dunno. Ingo, care to take a look?

Btw, that per-CPU sysenter entry-point is really clever of me, but it's
not strictly NMI-safe. There's a single-instruction window between having
started "sysenter" and having a valid kernel stack, and if an NMI comes in
at that point, the NMI will now have a bogus stack pointer.

That NMI problem is pretty fundamentally unfixable due to the stupid
sysenter semantics, but we could just make the NMI handlers be real
careful about it and fix it up if it happens.

Most of the diff here is actually moving around some of the segments,
since sysenter/sysexit wants them in one particular order. The setup code
to initialize sysenter is itself pretty trivial.

Linus

----
===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
--- 1.1/arch/i386/kernel/sysenter.c Sat Dec 14 04:38:56 2002
+++ edited/arch/i386/kernel/sysenter.c 2002-12-16 16:37:32.000000000 -0800
@@ -0,0 +1,52 @@
+/*
+ * linux/arch/i386/kernel/sysenter.c
+ *
+ * (C) Copyright 2002 Linus Torvalds
+ *
+ * This file contains the needed initializations to support sysenter.
+ */
+
+#include <linux/init.h>
+#include <linux/smp.h>
+#include <linux/thread_info.h>
+#include <linux/gfp.h>
+
+#include <asm/cpufeature.h>
+#include <asm/msr.h>
+
+extern asmlinkage void sysenter_entry(void);
+
+static void __init enable_sep_cpu(void *info)
+{
+ unsigned long page = __get_free_page(GFP_ATOMIC);
+ int cpu = get_cpu();
+ unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
+ unsigned long rel32;
+
+ rel32 = (unsigned long) sysenter_entry - (page+11);
+
+
+ *(short *) (page+0) = 0x258b; /* movl xxxxx,%esp */
+ *(long **) (page+2) = esp0_ptr;
+ *(char *) (page+6) = 0xe9; /* jmp rl32 */
+ *(long *) (page+7) = rel32;
+
+ wrmsr(0x174, __KERNEL_CS, 0); /* SYSENTER_CS_MSR */
+ wrmsr(0x175, page+PAGE_SIZE, 0); /* SYSENTER_ESP_MSR */
+ wrmsr(0x176, page, 0); /* SYSENTER_EIP_MSR */
+
+ printk("Enabling SEP on CPU %d\n", cpu);
+ put_cpu();
+}
+
+static int __init sysenter_setup(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_SEP))
+ return 0;
+
+ enable_sep_cpu(NULL);
+ smp_call_function(enable_sep_cpu, NULL, 1, 1);
+ return 0;
+}
+
+__initcall(sysenter_setup);
===== arch/i386/kernel/Makefile 1.30 vs edited =====
--- 1.30/arch/i386/kernel/Makefile Sat Dec 14 04:38:56 2002
+++ edited/arch/i386/kernel/Makefile Mon Dec 16 13:43:57 2002
@@ -29,6 +29,7 @@
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_EDD) += edd.o
obj-$(CONFIG_MODULES) += module.o
+obj-y += sysenter.o

EXTRA_AFLAGS := -traditional

===== arch/i386/kernel/entry.S 1.41 vs edited =====
--- 1.41/arch/i386/kernel/entry.S Fri Dec 6 09:43:43 2002
+++ edited/arch/i386/kernel/entry.S Mon Dec 16 16:17:47 2002
@@ -94,7 +94,7 @@
movl %edx, %ds; \
movl %edx, %es;

-#define RESTORE_ALL \
+#define RESTORE_REGS \
popl %ebx; \
popl %ecx; \
popl %edx; \
@@ -104,14 +104,25 @@
popl %eax; \
1: popl %ds; \
2: popl %es; \
- addl $4, %esp; \
-3: iret; \
.section .fixup,"ax"; \
-4: movl $0,(%esp); \
+3: movl $0,(%esp); \
jmp 1b; \
-5: movl $0,(%esp); \
+4: movl $0,(%esp); \
jmp 2b; \
-6: pushl %ss; \
+.previous; \
+.section __ex_table,"a";\
+ .align 4; \
+ .long 1b,3b; \
+ .long 2b,4b; \
+.previous
+
+
+#define RESTORE_ALL \
+ RESTORE_REGS \
+ addl $4, %esp; \
+1: iret; \
+.section .fixup,"ax"; \
+2: pushl %ss; \
popl %ds; \
pushl %ss; \
popl %es; \
@@ -120,11 +131,11 @@
.previous; \
.section __ex_table,"a";\
.align 4; \
- .long 1b,4b; \
- .long 2b,5b; \
- .long 3b,6b; \
+ .long 1b,2b; \
.previous

+
+
ENTRY(lcall7)
pushfl # We get a different stack layout with call
# gates, which has to be cleaned up later..
@@ -219,6 +230,39 @@
cli
jmp need_resched
#endif
+
+#define SYSENTER_RETURN 0
+
+ # sysenter call handler stub
+ ALIGN
+ENTRY(sysenter_entry)
+ sti
+ pushl $(__USER_DS)
+ pushl %ebp
+ pushfl
+ pushl $(__USER_CS)
+ pushl $SYSENTER_RETURN
+
+ pushl %eax
+ SAVE_ALL
+ GET_THREAD_INFO(%ebx)
+ cmpl $(NR_syscalls), %eax
+ jae syscall_badsys
+
+ testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebx)
+ jnz syscall_trace_entry
+ call *sys_call_table(,%eax,4)
+ movl %eax,EAX(%esp)
+ cli
+ movl TI_FLAGS(%ebx), %ecx
+ testw $_TIF_ALLWORK_MASK, %cx
+ jne syscall_exit_work
+ RESTORE_REGS
+ movl 4(%esp),%edx
+ movl 16(%esp),%ecx
+ sti
+ sysexit
+

# system call handler stub
ALIGN
===== arch/i386/kernel/head.S 1.18 vs edited =====
--- 1.18/arch/i386/kernel/head.S Thu Dec 5 18:56:49 2002
+++ edited/arch/i386/kernel/head.S Mon Dec 16 14:14:44 2002
@@ -414,8 +414,8 @@
.quad 0x0000000000000000 /* 0x0b reserved */
.quad 0x0000000000000000 /* 0x13 reserved */
.quad 0x0000000000000000 /* 0x1b reserved */
- .quad 0x00cffa000000ffff /* 0x23 user 4GB code at 0x00000000 */
- .quad 0x00cff2000000ffff /* 0x2b user 4GB data at 0x00000000 */
+ .quad 0x0000000000000000 /* 0x20 unused */
+ .quad 0x0000000000000000 /* 0x28 unused */
.quad 0x0000000000000000 /* 0x33 TLS entry 1 */
.quad 0x0000000000000000 /* 0x3b TLS entry 2 */
.quad 0x0000000000000000 /* 0x43 TLS entry 3 */
@@ -425,22 +425,25 @@

.quad 0x00cf9a000000ffff /* 0x60 kernel 4GB code at 0x00000000 */
.quad 0x00cf92000000ffff /* 0x68 kernel 4GB data at 0x00000000 */
- .quad 0x0000000000000000 /* 0x70 TSS descriptor */
- .quad 0x0000000000000000 /* 0x78 LDT descriptor */
+ .quad 0x00cffa000000ffff /* 0x73 user 4GB code at 0x00000000 */
+ .quad 0x00cff2000000ffff /* 0x7b user 4GB data at 0x00000000 */
+
+ .quad 0x0000000000000000 /* 0x80 TSS descriptor */
+ .quad 0x0000000000000000 /* 0x88 LDT descriptor */

/* Segments used for calling PnP BIOS */
- .quad 0x00c09a0000000000 /* 0x80 32-bit code */
- .quad 0x00809a0000000000 /* 0x88 16-bit code */
- .quad 0x0080920000000000 /* 0x90 16-bit data */
- .quad 0x0080920000000000 /* 0x98 16-bit data */
+ .quad 0x00c09a0000000000 /* 0x90 32-bit code */
+ .quad 0x00809a0000000000 /* 0x98 16-bit code */
.quad 0x0080920000000000 /* 0xa0 16-bit data */
+ .quad 0x0080920000000000 /* 0xa8 16-bit data */
+ .quad 0x0080920000000000 /* 0xb0 16-bit data */
/*
* The APM segments have byte granularity and their bases
* and limits are set at run time.
*/
- .quad 0x00409a0000000000 /* 0xa8 APM CS code */
- .quad 0x00009a0000000000 /* 0xb0 APM CS 16 code (16 bit) */
- .quad 0x0040920000000000 /* 0xb8 APM DS data */
+ .quad 0x00409a0000000000 /* 0xb8 APM CS code */
+ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */
+ .quad 0x0040920000000000 /* 0xc8 APM DS data */

#if CONFIG_SMP
.fill (NR_CPUS-1)*GDT_ENTRIES,8,0 /* other CPU's GDT */
===== include/asm-i386/segment.h 1.2 vs edited =====
--- 1.2/include/asm-i386/segment.h Mon Aug 12 10:56:27 2002
+++ edited/include/asm-i386/segment.h Mon Dec 16 14:08:09 2002
@@ -9,8 +9,8 @@
* 2 - reserved
* 3 - reserved
*
- * 4 - default user CS <==== new cacheline
- * 5 - default user DS
+ * 4 - unused <==== new cacheline
+ * 5 - unused
*
* ------- start of TLS (Thread-Local Storage) segments:
*
@@ -25,16 +25,18 @@
*
* 12 - kernel code segment <==== new cacheline
* 13 - kernel data segment
- * 14 - TSS
- * 15 - LDT
- * 16 - PNPBIOS support (16->32 gate)
- * 17 - PNPBIOS support
- * 18 - PNPBIOS support
+ * 14 - default user CS
+ * 15 - default user DS
+ * 16 - TSS
+ * 17 - LDT
+ * 18 - PNPBIOS support (16->32 gate)
* 19 - PNPBIOS support
* 20 - PNPBIOS support
- * 21 - APM BIOS support
- * 22 - APM BIOS support
- * 23 - APM BIOS support
+ * 21 - PNPBIOS support
+ * 22 - PNPBIOS support
+ * 23 - APM BIOS support
+ * 24 - APM BIOS support
+ * 25 - APM BIOS support
*/
#define GDT_ENTRY_TLS_ENTRIES 3
#define GDT_ENTRY_TLS_MIN 6
@@ -42,10 +44,10 @@

#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES * 8)

-#define GDT_ENTRY_DEFAULT_USER_CS 4
+#define GDT_ENTRY_DEFAULT_USER_CS 14
#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS * 8 + 3)

-#define GDT_ENTRY_DEFAULT_USER_DS 5
+#define GDT_ENTRY_DEFAULT_USER_DS 15
#define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS * 8 + 3)

#define GDT_ENTRY_KERNEL_BASE 12
@@ -56,14 +58,14 @@
#define GDT_ENTRY_KERNEL_DS (GDT_ENTRY_KERNEL_BASE + 1)
#define __KERNEL_DS (GDT_ENTRY_KERNEL_DS * 8)

-#define GDT_ENTRY_TSS (GDT_ENTRY_KERNEL_BASE + 2)
-#define GDT_ENTRY_LDT (GDT_ENTRY_KERNEL_BASE + 3)
+#define GDT_ENTRY_TSS (GDT_ENTRY_KERNEL_BASE + 4)
+#define GDT_ENTRY_LDT (GDT_ENTRY_KERNEL_BASE + 5)

-#define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 4)
-#define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 9)
+#define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6)
+#define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11)

/*
- * The GDT has 21 entries but we pad it to cacheline boundary:
+ * The GDT has 23 entries but we pad it to cacheline boundary:
*/
#define GDT_ENTRIES 24


2002-12-17 00:56:51

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 16, 2002 at 04:47:00PM -0800, Linus Torvalds wrote:

Cool, new toys 8-) I'll have a play with this tomorrow.
after a quick glance, one thing jumped out at me.

> +static int __init sysenter_setup(void)
> +{
> + if (!boot_cpu_has(X86_FEATURE_SEP))
> + return 0;
> +
> + enable_sep_cpu(NULL);
> + smp_call_function(enable_sep_cpu, NULL, 1, 1);
> + return 0;

I'm sure I recall seeing errata on at least 1 CPU re sysenter.
If we do decide to go this route, we'll need to blacklist ones
with any really icky problems.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-17 02:27:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Dave Jones wrote:
>
> I'm sure I recall seeing errata on at least 1 CPU re sysenter.
> If we do decide to go this route, we'll need to blacklist ones
> with any really icky problems.

The errata is something like "all P6's report SEP, but it doesn't
actually _work_ on anything before the third stepping".

However, that should _not_ be handled by magic sysenter-specific code.
That's what the per-vendor cpu feature fixups are there for, so that these
kinds of bugs get fixed in _one_ place (initialization) and not in all the
users of the feature flags.

In fact, we already have that code in the proper place, namely
arch/i386/kernel/cpu/intel.c:

/* SEP CPUID bug: Pentium Pro reports SEP but doesn't have it */
if ( c->x86 == 6 && c->x86_model < 3 && c->x86_mask < 3 )
clear_bit(X86_FEATURE_SEP, c->x86_capability);

so the stuff I sent out should work on everything.

(Modulo the missing syscall page I already mentioned and potential bugs
in the code itself, of course ;)

Linus

2002-12-17 05:46:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> (Modulo the missing syscall page I already mentioned and potential bugs
> in the code itself, of course ;)

Ok, I did the vsyscall page too, and tried to make it do the right thing
(but I didn't bother to test it on a non-SEP machine).

I'm pushing the changes out right now, but basically it boils down to the
fact that with these changes, user space can instead of doing an

int $0x80

instruction for a system call just do a

call 0xfffff000

instead. The vsyscall page will be set up to use sysenter if the CPU
supports it, and if it doesn't, it will just do the old "int $0x80"
instead (and it could use the AMD syscall instruction if it wants to).
User mode shouldn't know or care, the calling convention is the same as it
ever was.

On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
something like 1761 cycles with the old "int 0x80/iret" approach. That's a
noticeable improvement, but I have to say that I'm a bit disappointed in
the P4 still, it shouldn't be even that much.

As a comparison, an Athlon will do a full int/iret faster than a P4 does a
sysenter/sysexit. Pathetic. But it's better than it used to be.

Whatever. The code is extremely simple, and while I'm sure there are
things I've missed I'd love to hear if this works for anybody else. I'm
appending the (extremely stupid) test-program I used to test it.

The way I did this, things like system call restarting etc _should_ all
work fine even with "sysenter", simply by virtue of both sysenter and "int
0x80" being two-byte opcodes. But it might be interesting to verify that a
recompiled glibc (or even just a preload) really works with this on a
"whole system" testbed rather than just testing one system call (and not
even caring about the return value) a million times.

The good news is that the kernel part really looks pretty clean.

Linus

---
#include <time.h>
#include <sys/time.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#define rdtsc() ({ unsigned long a,d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

int main()
{
int i, ret;
unsigned long start, end;

start = rdtsc();
for (i = 0; i < 1000000; i++) {
asm volatile("call 0xfffff000"
:"=a" (ret)
:"0" (__NR_getppid));
}
end = rdtsc();
printf("%f cycles\n", (end - start) / 1000000.0);

start = rdtsc();
for (i = 0; i < 1000000; i++) {
asm volatile("int $0x80"
:"=a" (ret)
:"0" (__NR_getppid));
}
end = rdtsc();
printf("%f cycles\n", (end - start) / 1000000.0);
}


2002-12-17 06:00:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> noticeable improvement, but I have to say that I'm a bit disappointed in
> the P4 still, it shouldn't be even that much.

On a slightly more real system call (gettimeofday - which actually matters
in real life) the difference is still visible, but less so - because the
system call itself takes more of the time, and the kernel entry overhead
is thus not as clear.

For gettimeofday(), the results on my P4 are:

sysenter: 1280.425844 cycles
int/iret: 2415.698224 cycles
1135.272380 cycles diff
factor 1.886637

ie sysenter makes that system call almost twice as fast.

It's not as good as a pure user-mode solution using tsc could be, but
we've seen the kinds of complexities that has with multi-CPU systems, and
they are so painful that I suspect the sysenter approach is a lot more
palatable even if it doesn't allow for the absolute best theoretical
numbers.

Linus

2002-12-17 06:11:11

by GrandMasterLee

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 00:09, Linus Torvalds wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> > something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> > noticeable improvement, but I have to say that I'm a bit disappointed in
> > the P4 still, it shouldn't be even that much.
>
> On a slightly more real system call (gettimeofday - which actually matters
> in real life) the difference is still visible, but less so - because the
> system call itself takes more of the time, and the kernel entry overhead
> is thus not as clear.
>
> For gettimeofday(), the results on my P4 are:
>
> sysenter: 1280.425844 cycles
> int/iret: 2415.698224 cycles
> 1135.272380 cycles diff
> factor 1.886637
>
> ie sysenter makes that system call almost twice as fast.


I'm curious, if this is one of the Dual P4's non-Xeon(say, 2.4 Ghz+?) or
if this is one of the Xeons? There seems to be some perceived disparity
between which performs how. I think the biggest difference on the Xeon's
is the stepping and the cache,(pipeline too?), but not too much else.

[...]
> Linus
>

2002-12-17 06:09:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> For gettimeofday(), the results on my P4 are:
>
> sysenter: 1280.425844 cycles
> int/iret: 2415.698224 cycles
> 1135.272380 cycles diff
> factor 1.886637
>
> ie sysenter makes that system call almost twice as fast.

Final comparison for the evening: a PIII looks very different, since the
system call overhead is much smaller to begin with. On a PIII, the above
ends up looking like

gettimeofday() testing:
sysenter: 561.697236 cycles
int/iret: 686.170463 cycles
124.473227 cycles diff
factor 1.221602

ie the speedup is much less because the original int/iret numbers aren't
nearly as embarrassing as the P4 ones. It's still a win, though.

Linus

2002-12-17 06:35:34

by dean gaudet

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, 16 Dec 2002, Linus Torvalds wrote:

> It's not as good as a pure user-mode solution using tsc could be, but
> we've seen the kinds of complexities that has with multi-CPU systems, and
> they are so painful that I suspect the sysenter approach is a lot more
> palatable even if it doesn't allow for the absolute best theoretical
> numbers.

don't many of the multi-CPU problems with tsc go away because you've got a
per-cpu physical page for the vsyscall?

i.e. per-cpu tsc epoch and scaling can be set on that page.

the only trouble i know of is what happens when an interrupt occurs and
the task is rescheduled on another cpu... in theory you could test %eip
against 0xfffffxxx and "rollback" (or complete) any incomplete
gettimeofday call prior to saving a task's state. but i bet that test is
undesirable on all interrupt paths.

-dean

2002-12-17 09:40:06

by Andre Hedrick

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Linus,

Are you serious about moving of the banging we currently do on 0x80?
If so, I have a P4 development board with leds to monitor all the lower io
ports and can decode for you.

On Mon, 16 Dec 2002, Linus Torvalds wrote:

>
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > (Modulo the missing syscall page I already mentioned and potential bugs
> > in the code itself, of course ;)
>
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
>
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
>
> int $0x80
>
> instruction for a system call just do a
>
> call 0xfffff000
>
> instead. The vsyscall page will be set up to use sysenter if the CPU
> supports it, and if it doesn't, it will just do the old "int $0x80"
> instead (and it could use the AMD syscall instruction if it wants to).
> User mode shouldn't know or care, the calling convention is the same as it
> ever was.
>
> On my P4 machine, a "getppid()" is 641 cycles with sysenter/sysexit, and
> something like 1761 cycles with the old "int 0x80/iret" approach. That's a
> noticeable improvement, but I have to say that I'm a bit disappointed in
> the P4 still, it shouldn't be even that much.
>
> As a comparison, an Athlon will do a full int/iret faster than a P4 does a
> sysenter/sysexit. Pathetic. But it's better than it used to be.
>
> Whatever. The code is extremely simple, and while I'm sure there are
> things I've missed I'd love to hear if this works for anybody else. I'm
> appending the (extremely stupid) test-program I used to test it.
>
> The way I did this, things like system call restarting etc _should_ all
> work fine even with "sysenter", simply by virtue of both sysenter and "int
> 0x80" being two-byte opcodes. But it might be interesting to verify that a
> recompiled glibc (or even just a preload) really works with this on a
> "whole system" testbed rather than just testing one system call (and not
> even caring about the return value) a million times.
>
> The good news is that the kernel part really looks pretty clean.
>
> Linus
>
> ---
> #include <time.h>
> #include <sys/time.h>
> #include <asm/unistd.h>
> #include <sys/stat.h>
> #include <stdio.h>
>
> #define rdtsc() ({ unsigned long a,d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })
>
> int main()
> {
> int i, ret;
> unsigned long start, end;
>
> start = rdtsc();
> for (i = 0; i < 1000000; i++) {
> asm volatile("call 0xfffff000"
> :"=a" (ret)
> :"0" (__NR_getppid));
> }
> end = rdtsc();
> printf("%f cycles\n", (end - start) / 1000000.0);
>
> start = rdtsc();
> for (i = 0; i < 1000000; i++) {
> asm volatile("int $0x80"
> :"=a" (ret)
> :"0" (__NR_getppid));
> }
> end = rdtsc();
> printf("%f cycles\n", (end - start) / 1000000.0);
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Andre Hedrick
LAD Storage Consulting Group

2002-12-17 10:46:07

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
>
> But it might be interesting to verify that a
> recompiled glibc (or even just a preload) really works with this on a
> "whole system" testbed rather than just testing one system call (and not
> even caring about the return value) a million times.


I've created a modified glibc which uses the syscall code for almost
everything. There are a few int $0x80 left here and there but mostly it
is a centralized change.

The result: all works as expected. Nice.

On my test machine your little test program performs the syscalls on
roughly twice as fast (HT P4, pretty new). Your numbers are perhaps for
the P4 Xeons. Anyway, when measuring some more involved code (I ran my
thread benchmark) I got only about 3% performance increase. It's doing
a fair amount of system calls. But again, the good news is your code
survived even this stress test.


The problem with the current solution is the instruction set of the x86.
In your test code you simply use call 0xfffff000 and it magically work.
But this is only the case because your program is linked statically.

For the libc DSO I had to play some dirty tricks. The x86 CPU has no
absolute call. The variant with an immediate parameter is a relative
jump. Only when jumping through a register or memory location is it
possible to jump to an absolute address. To be clear, if I have

call 0xfffff000

in a DSO which is loaded at address 0x80000000 the jumps ends at
0x7fffffff. The problem is that the static linker doesn't know the load
address. We could of course have the dynamic linker fix up the
addresses but this is plain stupid. It would mean fixing up a lot of
places and making of those pages covered non-sharable.

Instead I've changed the syscall handling to effectve do

pushl %ebp
movl $0xfffff000, %ebp
call *%ebp
popl %ebp

An alternative is to store the address in a memory location. But since
%ebx is used for a syscall parameter it is necessary to address the
memory relative to the stack pointer which would mean loading the stack
address with 0xfffff000 before making the syscall. Not much better than
the code sequence above.

Anyway, it's still an improvement. But now the question comes up: how
the ld.so detect that the kernel supports these syscalls and can use an
appropriate DSO? This brings up again the idea of the read-only page(s)
mapped into all processes (you remember).


Anyway, it works nicely. If you need more testing let me know.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 12:33:28

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 01:45:52AM -0800, Andre Hedrick wrote:

> Are you serious about moving of the banging we currently do on 0x80?
> If so, I have a P4 development board with leds to monitor all the lower io
> ports and can decode for you.

INT 0x80 != IO port 0x80

8-)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-17 14:25:13

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 09:45, Andre Hedrick wrote:
>
> Linus,
>
> Are you serious about moving of the banging we currently do on 0x80?
> If so, I have a P4 development board with leds to monitor all the lower io
> ports and can decode for you.

Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
80 (its very easy), but requires we set up some udelay constants with an
initial safety value right at boot (which we should do - we udelay
before it is initialised)

2002-12-17 16:02:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
>
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
>
> int $0x80
>
> instruction for a system call just do a
>
> call 0xfffff000

I thought that last page was intentionally left invalid?

So that, for example, *(char *)MAP_FAILED will give SIGSEGV;
whereas now I can read a 0 there (and perhaps you should be
using get_zeroed_page rather than __get_free_page?).

I cannot name anything which relies on that page being invalid,
but think it would be safer to keep that it way; though I guess
more compatibility pain to use the next page down (or could
seg lim be used? I forget the granularity restrictions).

Hugh

2002-12-17 16:23:17

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 17 Dec 2002, Hugh Dickins wrote:

> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > Ok, I did the vsyscall page too, and tried to make it do the right thing
> > (but I didn't bother to test it on a non-SEP machine).
> >
> > I'm pushing the changes out right now, but basically it boils down to the
> > fact that with these changes, user space can instead of doing an
> >
> > int $0x80
> >
> > instruction for a system call just do a
> >
> > call 0xfffff000
>

So you are going to do a system-call off a trap instead of an interrupt.
The difference in performance should be practically nothing. There is
also going to be additional overhead in returning from the trap since
the IP and caller's segment was not saved by the initial trap. I don't
see how you can possibly claim any improvement in performance. Further,
it doesn't make any sense. We don't call physical addresses from a
virtual address anyway, so there will be additional translation that
must take some time. With the current page-table translation you
would need to put your system-call entry point at 0xfffff000 - 0xc0000000
= 0x3ffff000 and there might not even be any RAM there. This guarantees
that you are going to have to set up a special PTE, resulting in
additional overhead.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-17 16:41:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Mon, 16 Dec 2002, dean gaudet wrote:
>
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?

No.

The per-cpu page is _inside_ the kernel, and is only pointed at by the
SYSENTER_EIP_MSR, and not accessible from user space. It's not virtually
mapped to the same address at all.

The userspace vsyscall page is shared on the whole system, and has to be
so, because anything else is a disaster from a TLB standpoint (two threads
running on different CPU's have the same page tables, so it's basically
impossible to sanely do per-cpu TLB mappings with a hw-filled TLB like the
x86).

Linus

2002-12-17 16:45:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 17 Dec 2002, Hugh Dickins wrote:
> whereas now I can read a 0 there (and perhaps you should be
> using get_zeroed_page rather than __get_free_page?).

Sorry, yes, you are using get_zeroed_page for the one that needs it.

Hugh

2002-12-17 16:57:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> The problem with the current solution is the instruction set of the x86.
> In your test code you simply use call 0xfffff000 and it magically work.
> But this is only the case because your program is linked statically.

Yeah, it's not very convenient. I didn't find any real alternatives,
though, and you can always just put 0xfffff000 in memory or registers and
jump to that. In fact, I suspect that if you actually want to use it in
glibc, then at least in the short term that's what you need to do anyway,
sinc eyou probably don't want to have a glibc that only works with very
recent kernels.

So I was actually assuming that the glibc code would look more like
something like this:

old_fashioned:
int $0x80
ret

unsigned long system_call_ptr = old_fashioned;

/* .. startup .. */
if (kernel_version > xxx)
system_call_ptr = 0xfffff000;


/* ... usage ... */
call *system_call_ptr;

since you cannot depend on the 0xfffff000 on older kernels anyway..

> Instead I've changed the syscall handling to effectve do
>
> pushl %ebp
> movl $0xfffff000, %ebp
> call *%ebp
> popl %ebp

The above will work, but then you'd have limited yourself to a binary that
_only_ works on new kernels. So I'd suggest the memory indirection
instead.

Linus

2002-12-17 16:58:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Hugh Dickins wrote:
>
> I thought that last page was intentionally left invalid?

It was. But I thought it made sense to use, as it's the only really
"special" page.

But yes, we should decide on this quickly - it's easy to change right now,
but..

Linus

2002-12-17 17:11:14

by Matti Aarnio

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > I thought that last page was intentionally left invalid?
>
> It was. But I thought it made sense to use, as it's the only really
> "special" page.

In couple of occasions I have caught myself from pre-decrementing
a char pointer which "just happened" to be NULL.

Please keep the last page, as well as a few of the first pages as
NULL-pointer poisons.

> But yes, we should decide on this quickly - it's easy to change right now,
> but..
>
> Linus

/Matti Aarnio

2002-12-17 17:39:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance




On Tue, 17 Dec 2002, Richard B. Johnson wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > instruction for a system call just do a
> >
> > call 0xfffff000
>
> So you are going to do a system-call off a trap instead of an interrupt.

No no. The kernel maps a magic read-only page at 0xfffff000, and there's
no trap involved. The code at that address is kernel-generated for the CPU
in question, and it will do whatever is most convenient.

No traps. They're slow as hell.

Linus

2002-12-17 17:46:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Matti Aarnio wrote:
>
> On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> > On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > > I thought that last page was intentionally left invalid?
> >
> > It was. But I thought it made sense to use, as it's the only really
> > "special" page.
>
> In couple of occasions I have caught myself from pre-decrementing
> a char pointer which "just happened" to be NULL.
>
> Please keep the last page, as well as a few of the first pages as
> NULL-pointer poisons.

I think I have a good clean solution to this, that not only avoids the
need for any hard-coded address _at_all_, but also solves Uli's problem
quite cleanly.

Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
is the "base of sysinfo page". Right now that page is all zeroes except
for the system call trampoline at the beginning, but we might want to add
other system information to the page in the future (it is readable, after
all).

So we'd have an AT_SYSINFO entry, that with the current implementation
would just get the value 0xfffff000. And then the glibc startup code could
easily be backwards compatible with the suggestion I had in the previous
email. Since we basically want to do an indirect jump anyway (because of
the lack of absolute jumps in the instruction set), this looks like the
natural way to do it.

That also allows the kernel to move around the SYSINFO page at will, and
even makes it possible to avoid it altogether (ie this will solve the
inevitable problems with UML - UML just wouldn't set AT_SYSINFO, so user
level just wouldn't even _try_ to use it).

With that, there's nothing "special" about the vsyscall page, and I'd just
go back to having the very last page unmapped (and have the vsyscall page
in some other fixmap location that might even depend on kernel
configuration).

Whaddaya think?

Linus

2002-12-17 17:47:22

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Yeah, it's not very convenient. I didn't find any real alternatives,
> though, and you can always just put 0xfffff000 in memory or registers and
> jump to that.

Putting the value into memory myself is not possible. In a DSO I have
to address memory indirectly. But all registers (except %ebp, and maybe
it'll be used some day) are used at the time of the call.

But there is a way: if I'm using

#define makesyscall(name) \
movl $__NR_##name, $eax; \
call 0xfffff000-__NR_##name($eax)

and you'd put at address 0xfffff000 the address of the entry point the
wrappers wouldn't have any problems finding it.


> In fact, I suspect that if you actually want to use it in
> glibc, then at least in the short term that's what you need to do anyway,
> sinc eyou probably don't want to have a glibc that only works with very
> recent kernels.

That's a compilation option. We might want to do dynamic testing and
yes, a simple pointer indirection is adequate.

But still, the problem is detecting the capable kernels. You have said
not long ago that comparing kernel versions is wrong. And I agree. It
doesn't cover backports and nothing. But there is a lack of an alternative.

If you don't like the process-global page thingy (anymore) the
alternative would be a sysconf() system call.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 17:53:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> If you don't like the process-global page thingy (anymore) the
> alternative would be a sysconf() system call.

Well, we do _have_ the process-global thingy now - it's the vsyscall page.
It's not settable by the process, but it's useful for information.
Together with an elf AT_ entry pointing to it, it's certainly sufficient
for this usage, and it should also be sufficient for "future use" (ie we
can add future system information in the page later: bitmaps of features
at offset "start + 128" for example).

Linus

2002-12-17 18:15:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).

Here's the suggested (totally untested as of yet) patch:

- it moves the system call page to 0xffffe000 instead, leaving an
unmapped page at the very top of the address space. So trying to
dereference -1 will cause a SIGSEGV.

- it adds the AT_SYSINFO elf entry on x86 that points to the system page.

Thus glibc startup should be able to just do

ptr = default_int80_syscall;
if (AT_SYSINFO entry found)
ptr = value(AT_SYSINFO)

and then you can just do a

call *ptr

to do a system call regardless of kernel version. This also allows the
kernel to later move the page around as it sees fit.

The advantage of using an AT_SYSINFO entry is that

- no new system call needed to figure anything out
- backwards compatibility (ie old kernels automatically detected)
- I think glibc already parses the AT entries at startup anyway

so it _looks_ like a perfect way to do this.

Linus

----
===== arch/i386/kernel/entry.S 1.42 vs edited =====
--- 1.42/arch/i386/kernel/entry.S Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/entry.S Tue Dec 17 10:13:16 2002
@@ -232,7 +232,7 @@
#endif

/* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xfffff007
+#define SYSENTER_RETURN 0xffffe007

# sysenter call handler stub
ALIGN
===== include/asm-i386/elf.h 1.3 vs edited =====
--- 1.3/include/asm-i386/elf.h Thu Oct 17 00:48:55 2002
+++ edited/include/asm-i386/elf.h Tue Dec 17 10:12:58 2002
@@ -100,6 +100,12 @@

#define ELF_PLATFORM (system_utsname.machine)

+/*
+ * Architecture-neutral AT_ values in 0-17, leave some room
+ * for more of them, start the x86-specific ones at 32.
+ */
+#define AT_SYSINFO 32
+
#ifdef __KERNEL__
#define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)

@@ -115,6 +121,11 @@
extern void dump_smp_unlazy_fpu(void);
#define ELF_CORE_SYNC dump_smp_unlazy_fpu
#endif
+
+#define ARCH_DLINFO \
+do { \
+ NEW_AUX_ENT(AT_SYSINFO, 0xffffe000); \
+} while (0)

#endif

===== include/asm-i386/fixmap.h 1.9 vs edited =====
--- 1.9/include/asm-i386/fixmap.h Mon Dec 16 21:39:04 2002
+++ edited/include/asm-i386/fixmap.h Tue Dec 17 10:11:31 2002
@@ -42,8 +42,8 @@
* task switches.
*/
enum fixed_addresses {
- FIX_VSYSCALL,
FIX_HOLE,
+ FIX_VSYSCALL,
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
#endif

2002-12-17 18:25:17

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Thus glibc startup should be able to just do
>
> ptr = default_int80_syscall;
> if (AT_SYSINFO entry found)
> ptr = value(AT_SYSINFO)
>
> and then you can just do a
>
> call *ptr

This won't work as I just wrote but something similar I can make work.
I think the use of the TCB is the best thing to do. Replicating the
info in all thread new thread's TCBs doesn't cost much and with NPTL
it's even lower cost since we reuse old TCBs.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 18:22:48

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).
>
> So we'd have an AT_SYSINFO entry, that with the current implementation
> would just get the value 0xfffff000. And then the glibc startup code could
> easily be backwards compatible with the suggestion I had in the previous
> email. Since we basically want to do an indirect jump anyway (because of
> the lack of absolute jumps in the instruction set), this looks like the
> natural way to do it.

Yes, I definitely think that a new AT_* value is at order and it's a
nice way to determine the address.

But it will eliminate the problem. Remember: the x86 (unlike x86-64)
has no PC-relative data addressing mode. I.e., in a DSO to find a
memory location with an address I need a base register which isn't
available anymore at the time the call is made.

You have to assume that all the registers, including %ebp, are used at
the time of the call. This makes it impossible to find a memory
location in a DSO without text relocation (i.e., making parts of the
code writable, at least for a moment). This is time consuming and not
resource friendly.

There is one way around this and maybe it is what should be done: we
have the TLS memory available. And since this vsyscall stuff gets
introduced after the TLS is functional it is a possibility.

The address received in AT_SYSINFO is stored in a word in the TCB
(thread control block). Then the code to call through this is a variant
of what I posted earlier

movl $__NR_##name, %eax
call *%gs:-__NR_##name+TCB_OFFSET(%eax)

In case the vsyscall stuff is not available we jump to something like

int $0x80
ret

The address of this code is the default value of the TCB word.


There is another thing we might want to consider. The above code jump
to 0xfffff000 or whatever adddres is specified. I.e., the
demultiplexing happens in the kernel. Do we want to do this at
userlevel? This would allow almost no-cost determination of those
syscalls which can be handled at userlevel (getpid, getppid, ...).

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 18:27:54

by Jeff Dike

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[email protected] said:
> That also allows the kernel to move around the SYSINFO page at will,
> and even makes it possible to avoid it altogether (ie this will solve
> the inevitable problems with UML - UML just wouldn't set AT_SYSINFO,
> so user level just wouldn't even _try_ to use it).

Why shouldn't I just set it to where UML provides its own SYSINFO page?

Jeff

2002-12-17 18:36:00

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 17:55, Ulrich Drepper wrote:
> But there is a way: if I'm using
>
> #define makesyscall(name) \
> movl $__NR_##name, $eax; \
> call 0xfffff000-__NR_##name($eax)
>
> and you'd put at address 0xfffff000 the address of the entry point the
> wrappers wouldn't have any problems finding it.

Is there any reason you can't just keep the linker out of the entire
mess by generating

.byte whatever
.dword 0xFFFF0000

instead of call ?


2002-12-17 18:39:52

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 18:30, Ulrich Drepper wrote:
> demultiplexing happens in the kernel. Do we want to do this at
> userlevel? This would allow almost no-cost determination of those
> syscalls which can be handled at userlevel (getpid, getppid, ...).

getppid changes and so I think has to go to kernel (unless we go around
patching user pages on process exit [ick]). getpid can already be done
by reading it once at startup time and caching the data.


2002-12-17 18:40:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On 17 Dec 2002, Alan Cox wrote:
>
> Is there any reason you can't just keep the linker out of the entire
> mess by generating
>
> .byte whatever
> .dword 0xFFFF0000
>
> instead of call ?

Alan, the problem is that there _is_ no such instruction as a "call
absolute".

There is only a "call relative" or "call indirect-absolute". So you either
have to indirect through memory or a register, or you have to fix up the
call at link-time.

Yeah, I know it sounds strange, but it makes sense. Absolute calls are
actually very unusual, and using relative calls is _usually_ the right
thing to do. It's only in cases like this that we really want to call a
specific address.

Linus

2002-12-17 18:42:23

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Alan Cox wrote:

> Is there any reason you can't just keep the linker out of the entire
> mess by generating
>
> .byte whatever
> .dword 0xFFFF0000
>
> instead of call ?

There is no such instruction. Unless you know about some secret
undocumented opcode...

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 18:50:38

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Alan Cox wrote:

> getppid changes and so I think has to go to kernel (unless we go around
> patching user pages on process exit [ick]).

But this is exactly what I expect to happen. If you want to implement
gettimeofday() at user-level you need to modify the page. Some of the
information the kernel has to keep for the thread group can be stored in
this place and eventually be used by some uerlevel code executed by
jumping to 0xfffff000 or whatever the address is.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 18:56:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> But it will eliminate the problem. Remember: the x86 (unlike x86-64)
> has no PC-relative data addressing mode. I.e., in a DSO to find a
> memory location with an address I need a base register which isn't
> available anymore at the time the call is made.

Actually, I see a more serious problem with the current "syscall"
interface: it doesn't allow six-argument system calls AT ALL, since it
needed %ebp to keep the stack pointer.

So a six-argument system call _has_ to use "int $0x80" anyway, which to
some degree simplifies your problem: you can only use the indirect call
approach for things where %ebp will be free for use anyway.

So then you can use %ebp as the indirection, and the code will look
something like

games, since that is guaranteed not to be ever used by a system call (it
wasn't guaranteed before, but since the sysenter really needs something to
hold the stack pointer I made %ebp do that, so there's no way we can ever
use %ebp for system calls on x86).

So you _can_ do something like this:

syscall_with_5_args:
pushl %ebx
pushl %esi
pushl %edi
pushl %ebp
movl syscall_ptr + GOT,%ebp // uses DSO ptr in %ebx or whatever
movl $__NR_xxxxxx,%eax
movl 20(%esp),%ebx
movl 24(%esp),%ecx
movl 28(%esp),%edx
movl 32(%esp),%esi
movl 36(%esp),%edi
call *%ebp
.. test for errno if needed ..
popl %ebp
popl %edi
popl %esi
popl %ebx
ret

> You have to assume that all the registers, including %ebp, are used at
> the time of the call.

See why this isn't possible right now anyway.

Hmm.. Which system calls have all six parameters? I'll have to see if I
can find any way to make those use the new interface too.

In the meantime, I do agree with you that the TLS approach should work
too, and might be better. It will allow all six arguments to be used if we
just find a good calling conventions (too bad sysenter is such a pig of an
instruction, it's really not very well designed since it loses
information).

Linus

2002-12-17 18:56:40

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 18:48, Ulrich Drepper wrote:
> Alan Cox wrote:
>
> > Is there any reason you can't just keep the linker out of the entire
> > mess by generating
> >
> > .byte whatever
> > .dword 0xFFFF0000
> >
> > instead of call ?
>
> There is no such instruction. Unless you know about some secret
> undocumented opcode...

No I'd forgotten how broken x86 was

2002-12-17 19:01:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> But this is exactly what I expect to happen. If you want to implement
> gettimeofday() at user-level you need to modify the page.

Note that I really don't think we ever want to do the user-level
gettimeofday(). The complexity just argues against it, it's better to try
to make system calls be cheap enough that you really don't care.

sysenter helps a bit there.

If we'd need to modify the page, we couldn't share one page between all
processes, and we couldn't make it global in the TLB. So modifying the
info page is something we should avoid at all cost - it's not totally
unlikely that the overheads implied by per-thread pages would drown out
the wins from trying to be clever.

The advantage of the current static fixmap is that it's _extremely_
streamlined. The only overhead is literally the system entry itself, which
while a bit too high on a P4 is not that bad in general (and hopefully
Intel will fix the stupidities that cause the P4 to be slow at kernel
entry. Somebody already mentioned that apparently the newer P4 cores are
actually faster at system calls than mine is).

Linus

2002-12-17 18:57:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Jeff Dike wrote:
>
> [email protected] said:
> > That also allows the kernel to move around the SYSINFO page at will,
> > and even makes it possible to avoid it altogether (ie this will solve
> > the inevitable problems with UML - UML just wouldn't set AT_SYSINFO,
> > so user level just wouldn't even _try_ to use it).
>
> Why shouldn't I just set it to where UML provides its own SYSINFO page?

Sure, that works fine too.

Linus

2002-12-17 19:05:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>
> It's not as good as a pure user-mode solution using tsc could be, but
> we've seen the kinds of complexities that has with multi-CPU systems, and
> they are so painful that I suspect the sysenter approach is a lot more
> palatable even if it doesn't allow for the absolute best theoretical
> numbers.
>

The complexity only applies to nonsynchronized TSCs though, I would
assume. I believe x86-64 uses a vsyscall using the TSC when it can
provide synchronized TSCs, and if it can't it puts a normal system call
inside the vsyscall in question.

-hpa


2002-12-17 19:03:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

dean gaudet wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
>
>>It's not as good as a pure user-mode solution using tsc could be, but
>>we've seen the kinds of complexities that has with multi-CPU systems, and
>>they are so painful that I suspect the sysenter approach is a lot more
>>palatable even if it doesn't allow for the absolute best theoretical
>>numbers.
>
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?
>
> i.e. per-cpu tsc epoch and scaling can be set on that page.
>
> the only trouble i know of is what happens when an interrupt occurs and
> the task is rescheduled on another cpu... in theory you could test %eip
> against 0xfffffxxx and "rollback" (or complete) any incomplete
> gettimeofday call prior to saving a task's state. but i bet that test is
> undesirable on all interrupt paths.
>

Exactly. This is a real problem.

-hpa

2002-12-17 19:11:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ulrich Drepper wrote:
> Alan Cox wrote:
>
>
>>Is there any reason you can't just keep the linker out of the entire
>>mess by generating
>>
>> .byte whatever
>> .dword 0xFFFF0000
>>
>>instead of call ?
>
>
> There is no such instruction. Unless you know about some secret
> undocumented opcode...
>

Well, there is lcall $0xffff0000, $USER_CS... (no, I'm most definitely
*not* suggesting it.)

-hpa

2002-12-17 19:14:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>
> On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
>>But this is exactly what I expect to happen. If you want to implement
>>gettimeofday() at user-level you need to modify the page.
>
> Note that I really don't think we ever want to do the user-level
> gettimeofday(). The complexity just argues against it, it's better to try
> to make system calls be cheap enough that you really don't care.
>

Let's see... it works fine on UP and on *most* SMP, and on the ones
where it doesn't work you just fill in a system call into the vsyscall
slot. It just means that gettimeofday() needs a different vsyscall slot.

-hpa

2002-12-17 19:12:03

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> In the meantime, I do agree with you that the TLS approach should work
> too, and might be better. It will allow all six arguments to be used if we
> just find a good calling conventions

If you push out the AT_* patch I'll hack the glibc bits (probably the
TLS variant). Won't take too long, you'll get results this afternoon.

What about AMD's instruction? Is it as flawed as sysenter? If not and
%ebp is available I really should use the TLS method.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 19:19:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Hmm.. Which system calls have all six parameters? I'll have to see if I
> can find any way to make those use the new interface too.

The only ones I found from a quick grep are
- sys_recvfrom
- sys_sendto
- sys_mmap2()
- sys_ipc()

and none of them are of a kind where the system call entry itself is the
biggest performance issue (and sys_ipc() is deprecated anyway), so it's
probably acceptable to just use the old interface for them.

One other alternative is to change the calling convention for the
new-style system call, and not have arguments in registers at all. We
could make the interface something like

- %eax contains system call number
- %edx contains pointer to argument block
- call *syscallptr // trashes all registers

and then the old "compatibility" function would be something like

movl 0(%edx),%ebx
movl 4(%edx),%ecx
movl 12(%edx),%esi
movl 16(%edx),%edi
movl 20(%edx),%ebp
movl 8(%edx),%edx
int $0x80
ret

while the "sysenter" interface would do the loads from kernel space.

That would make some things easier, but the problem with this approach is
that if you have a single-argument system call, and you just pass in the
stack pointer offset in %edx directly, then the system call stubs will
always load 6 arguments, and if we're just at the end of the stack it
won't actually _work_. So part of the calling convention would have to be
the guarantee that there is stack-space available (should always be true
in practice, of course).

Linus

2002-12-17 19:24:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
>>Hmm.. Which system calls have all six parameters? I'll have to see if I
>>can find any way to make those use the new interface too.
>
>
> The only ones I found from a quick grep are
> - sys_recvfrom
> - sys_sendto
> - sys_mmap2()
> - sys_ipc()
>
> and none of them are of a kind where the system call entry itself is the
> biggest performance issue (and sys_ipc() is deprecated anyway), so it's
> probably acceptable to just use the old interface for them.
>

recvfrom() and sendto() can also be implemeted as sendmsg() recvmsg() if
one really wants to.

What one can also do is that a sixth argument, if one exists, is passed
on the stack (i.e. in (%ebp), since %ebp contains the stack pointer.)

-hpa

2002-12-17 19:28:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> Let's see... it works fine on UP and on *most* SMP, and on the ones
> where it doesn't work you just fill in a system call into the vsyscall
> slot. It just means that gettimeofday() needs a different vsyscall slot.

The thing is, gettimeofday() isn't _that_ special. It's just not worth a
vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
Just because we can?

This is especially true since the people who _really_ might care about
gettimeofday() are exactly the people who wouldn't be able to use the fast
user-space-only version.

How much do you think gettimeofday() really matters on a desktop? Sure, X
apps do gettimeofday() calls, but they do a whole lot more of _other_
calls, and gettimeofday() is really far far down in the noise for them.
The people who really call for gettimeofday() as a performance thing seem
to be database people who want it as a timestamp. But those are the same
people who also want NUMA machines which don't necessarily have
synchronized clocks.

Linus

2002-12-17 19:25:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

>> It's not as good as a pure user-mode solution using tsc could be, but
>> we've seen the kinds of complexities that has with multi-CPU systems, and
>> they are so painful that I suspect the sysenter approach is a lot more
>> palatable even if it doesn't allow for the absolute best theoretical
>> numbers.
>
> The complexity only applies to nonsynchronized TSCs though, I would
> assume. I believe x86-64 uses a vsyscall using the TSC when it can
> provide synchronized TSCs, and if it can't it puts a normal system call
> inside the vsyscall in question.

You can't use the TSC to do gettimeofday on boxes where they aren't
syncronised anyway though. That's nothing to do with vsyscalls, you just
need a different time source (eg the legacy stuff or HPET/cyclone).

M.

2002-12-17 19:40:44

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 11:10:20AM -0800, Linus Torvalds wrote:
> Intel will fix the stupidities that cause the P4 to be slow at kernel
> entry. Somebody already mentioned that apparently the newer P4 cores are
> actually faster at system calls than mine is).

My HT Northwood returns slightly better results than your xeon,
but the syscall stuff still completely trounces it.

(19:38:46:davej@tetrachloride:davej)$ ./a.out
440.107164 cycles
1152.596084 cycles

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-17 19:45:42

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>

> The only ones I found from a quick grep are
> - sys_recvfrom
> - sys_sendto
> - sys_mmap2()
> - sys_ipc()

All but mmap2 do not use 6 parameters. They are implemented via the
sys_ipc multiplexer which takes the stack pointer as an argument which
then helps to locate the parameters.


--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 19:36:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> What one can also do is that a sixth argument, if one exists, is passed
> on the stack (i.e. in (%ebp), since %ebp contains the stack pointer.)

I like this. I will make it so. It will allow the old calling conventions
and has none of the stack size issues that my "memory block" approach had.

Also since this will all be done inside the wrapper and is thus entirely
invisible to the caller. Good, this solves the six-arg case nicely.

Linus

2002-12-17 19:41:23

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 17 Dec 2002, Alan Cox wrote:

> On Tue, 2002-12-17 at 18:48, Ulrich Drepper wrote:
> > Alan Cox wrote:
> >
> > > Is there any reason you can't just keep the linker out of the entire
> > > mess by generating
> > >
> > > .byte whatever
> > > .dword 0xFFFF0000
> > >
> > > instead of call ?
> >
> > There is no such instruction. Unless you know about some secret
> > undocumented opcode...
>
> No I'd forgotten how broken x86 was
>

You can call intersegment with a full pointer. I don't know how
expensive that is. Since USER_CS is a fixed value in Linux, it
can be hard-coded

.byte 0x9a
.dword 0xfffff000
.word USER_CS

No. I didn't try this, I'm just looking at the manual. I don't know
what the USER_CS is (didn't look in the kernel) The book says the
pointer is 16:32 which means that it's a dword, followed by a word.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-17 19:36:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?
>

getpid() could be implemented in userspace, but not via vsyscalls
(instead it could be passed in the ELF data area at process start.)

"Because we can and it's relatively easy" is a pretty good argument in
my opinion.

> This is especially true since the people who _really_ might care about
> gettimeofday() are exactly the people who wouldn't be able to use the fast
> user-space-only version.
>
> How much do you think gettimeofday() really matters on a desktop? Sure, X
> apps do gettimeofday() calls, but they do a whole lot more of _other_
> calls, and gettimeofday() is really far far down in the noise for them.
> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same
> people who also want NUMA machines which don't necessarily have
> synchronized clocks.
>

I think this is really an overstatement. Timestamping etc. (and heck,
even databases) are actually perfectly usable even on smaller machines
these days. Sure, DB vendors like to boast of their 128-way NUMA
machines, but I suspect the bulk of them run on single- and
dual-processor machines (by count, not necessarily by data volume.)

-hpa


2002-12-17 19:49:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Richard B. Johnson wrote:
>
> You can call intersegment with a full pointer. I don't know how
> expensive that is.

It's so expensive as to not be worth it, it's cheaper to load a register
or something, i eyou can do

pushl $0xfffff000
call *(%esp)

faster than doing a far call.

Linus

2002-12-17 19:51:23

by Matti Aarnio

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 11:37:04AM -0800, Linus Torvalds wrote:
> On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> > Let's see... it works fine on UP and on *most* SMP, and on the ones
> > where it doesn't work you just fill in a system call into the vsyscall
> > slot. It just means that gettimeofday() needs a different vsyscall slot.
>
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?

clone() -- which doesn't really like anybody using stack-pointer ?

(I do use gettimeofday() a _lot_, but I have my own userspace
mapped shared segment thingamajingie doing it.. And I write
code that runs on lots of systems, not only at Linux. )

> Linus

/Matti Aarnio

2002-12-17 19:52:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


How about this diff? It does both the 6-parameter thing _and_ the
AT_SYSINFO addition. Untested, since I have to run off and watch my kids
do their winter program ;)

Linus

-----
===== arch/i386/kernel/entry.S 1.42 vs edited =====
--- 1.42/arch/i386/kernel/entry.S Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/entry.S Tue Dec 17 11:59:13 2002
@@ -232,7 +232,7 @@
#endif

/* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xfffff007
+#define SYSENTER_RETURN 0xffffe007

# sysenter call handler stub
ALIGN
@@ -243,6 +243,21 @@
pushfl
pushl $(__USER_CS)
pushl $SYSENTER_RETURN
+
+/*
+ * Load the potential sixth argument from user stack.
+ * Careful about security.
+ */
+ cmpl $0xc0000000,%ebp
+ jae syscall_badsys
+1: movl (%ebp),%ebp
+.section .fixup,"ax"
+2: xorl %ebp,%ebp
+.previous
+.section __ex_table,"a"
+ .align 4
+ .long 1b,2b
+.previous

pushl %eax
SAVE_ALL
===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
--- 1.1/arch/i386/kernel/sysenter.c Mon Dec 16 21:39:04 2002
+++ edited/arch/i386/kernel/sysenter.c Tue Dec 17 11:39:39 2002
@@ -48,14 +48,14 @@
0xc3 /* ret */
};
static const char sysent[] = {
- 0x55, /* push %ebp */
0x51, /* push %ecx */
0x52, /* push %edx */
+ 0x55, /* push %ebp */
0x89, 0xe5, /* movl %esp,%ebp */
0x0f, 0x34, /* sysenter */
+ 0x5d, /* pop %ebp */
0x5a, /* pop %edx */
0x59, /* pop %ecx */
- 0x5d, /* pop %ebp */
0xc3 /* ret */
};
unsigned long page = get_zeroed_page(GFP_ATOMIC);
===== include/asm-i386/elf.h 1.3 vs edited =====
--- 1.3/include/asm-i386/elf.h Thu Oct 17 00:48:55 2002
+++ edited/include/asm-i386/elf.h Tue Dec 17 10:12:58 2002
@@ -100,6 +100,12 @@

#define ELF_PLATFORM (system_utsname.machine)

+/*
+ * Architecture-neutral AT_ values in 0-17, leave some room
+ * for more of them, start the x86-specific ones at 32.
+ */
+#define AT_SYSINFO 32
+
#ifdef __KERNEL__
#define SET_PERSONALITY(ex, ibcs2) set_personality((ibcs2)?PER_SVR4:PER_LINUX)

@@ -115,6 +121,11 @@
extern void dump_smp_unlazy_fpu(void);
#define ELF_CORE_SYNC dump_smp_unlazy_fpu
#endif
+
+#define ARCH_DLINFO \
+do { \
+ NEW_AUX_ENT(AT_SYSINFO, 0xffffe000); \
+} while (0)

#endif

===== include/asm-i386/fixmap.h 1.9 vs edited =====
--- 1.9/include/asm-i386/fixmap.h Mon Dec 16 21:39:04 2002
+++ edited/include/asm-i386/fixmap.h Tue Dec 17 10:11:31 2002
@@ -42,8 +42,8 @@
* task switches.
*/
enum fixed_addresses {
- FIX_VSYSCALL,
FIX_HOLE,
+ FIX_VSYSCALL,
#ifdef CONFIG_X86_LOCAL_APIC
FIX_APIC_BASE, /* local (CPU) APIC) -- required for SMP or not */
#endif

2002-12-17 19:59:56

by Matti Aarnio

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

(cutting down To:/Cc:)

On Tue, Dec 17, 2002 at 11:43:57AM -0800, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> > vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> > Just because we can?
>
> getpid() could be implemented in userspace, but not via vsyscalls
> (instead it could be passed in the ELF data area at process start.)

After fork() or clone() ?
If we had only spawn(), and some separate way to start threads...

...
> -hpa

/Matti Aarnio

2002-12-17 19:58:50

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?

This is why I'd say mkae no distinction at all. Have the first
nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table. We can
transfer to a different slot for each syscall. Each slot then could be
a PC-relative jump to the common sysenter code or to some special code
sequence which is also in the global page.

If we don't do this now and it seems desirable in future we wither have
to introduce a second ABI for the vsyscall stuff (ugly!) or you'll have
to do the demultiplexing yourself in the code starting at 0xfffff000.


--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 20:02:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Matti Aarnio wrote:
> (cutting down To:/Cc:)
>
> On Tue, Dec 17, 2002 at 11:43:57AM -0800, H. Peter Anvin wrote:
>
>>Linus Torvalds wrote:
>>
>>>The thing is, gettimeofday() isn't _that_ special. It's just not worth a
>>>vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
>>>Just because we can?
>>
>>getpid() could be implemented in userspace, but not via vsyscalls
>>(instead it could be passed in the ELF data area at process start.)
>
>
> After fork() or clone() ?
> If we had only spawn(), and some separate way to start threads...
>

fork() and clone() would have to return the self-pid as an auxilliary
return value. This, of course, is getting rather fuggly.

Anything that cares caches getpid() anyway.

-hpa


2002-12-17 20:03:18

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 19:26, Martin J. Bligh wrote:
> >> It's not as good as a pure user-mode solution using tsc could be, but
> You can't use the TSC to do gettimeofday on boxes where they aren't
> syncronised anyway though. That's nothing to do with vsyscalls, you just
> need a different time source (eg the legacy stuff or HPET/cyclone).

Ditto all the laptops and the like. With code provided by the kernel we
can cheat however. If we know the fastest the CPU can go (ie full speed
on spudstop/powernow etc) we can tell the tsc value at which we have to
query the kernel to get time to any given accuracy, so allowing limited
caching

Ditto by knowing the worst case drift on summit

Alan

2002-12-17 20:01:34

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 19:12, H. Peter Anvin wrote:
> The complexity only applies to nonsynchronized TSCs though, I would
> assume. I believe x86-64 uses a vsyscall using the TSC when it can
> provide synchronized TSCs, and if it can't it puts a normal system call
> inside the vsyscall in question.

For x86-64 there is the hpet timer, which is a lot saner but I don't
think we can mmap it


2002-12-17 20:09:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Alan Cox wrote:
> On Tue, 2002-12-17 at 19:26, Martin J. Bligh wrote:
>
>>>>It's not as good as a pure user-mode solution using tsc could be, but
>>
>>You can't use the TSC to do gettimeofday on boxes where they aren't
>>syncronised anyway though. That's nothing to do with vsyscalls, you just
>>need a different time source (eg the legacy stuff or HPET/cyclone).
>
>
> Ditto all the laptops and the like. With code provided by the kernel we
> can cheat however. If we know the fastest the CPU can go (ie full speed
> on spudstop/powernow etc) we can tell the tsc value at which we have to
> query the kernel to get time to any given accuracy, so allowing limited
> caching
>
> Ditto by knowing the worst case drift on summit
>

Clever. I like it :)

-hpa


2002-12-17 20:09:51

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> ===== arch/i386/kernel/sysenter.c 1.1 vs edited =====
> --- 1.1/arch/i386/kernel/sysenter.c Mon Dec 16 21:39:04 2002
> +++ edited/arch/i386/kernel/sysenter.c Tue Dec 17 11:39:39 2002
> @@ -48,14 +48,14 @@
> 0xc3 /* ret */
> };
> static const char sysent[] = {
> - 0x55, /* push %ebp */
> 0x51, /* push %ecx */
> 0x52, /* push %edx */
> + 0x55, /* push %ebp */
> 0x89, 0xe5, /* movl %esp,%ebp */
> 0x0f, 0x34, /* sysenter */
> + 0x5d, /* pop %ebp */
> 0x5a, /* pop %edx */
> 0x59, /* pop %ecx */
> - 0x5d, /* pop %ebp */
> 0xc3 /* ret */

Instead of duplicating the push/pop %ebp just use the first one by using

movl 12(%ebo), %ebp

in the kernel code or remove the first. The later is better, smaller code.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 20:26:04

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 12:06:29PM -0800, Ulrich Drepper wrote:
> Linus Torvalds wrote:
>
> > The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> > vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
>
> This is why I'd say mkae no distinction at all. Have the first
> nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table. We can
> transfer to a different slot for each syscall. Each slot then could be
> a PC-relative jump to the common sysenter code or to some special code
> sequence which is also in the global page.
>
> If we don't do this now and it seems desirable in future we wither have
> to introduce a second ABI for the vsyscall stuff (ugly!) or you'll have
> to do the demultiplexing yourself in the code starting at 0xfffff000.

But what does this do to things like PTRACE_SYSCALL? And do we care...
I suppose not if we keep the syscall trace checks on every kernel entry
path.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-12-17 21:27:00

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 10:49:31AM -0800, Linus Torvalds wrote:
> There is only a "call relative" or "call indirect-absolute". So you either
> have to indirect through memory or a register, or you have to fix up the
> call at link-time.
>
> Yeah, I know it sounds strange, but it makes sense. Absolute calls are
> actually very unusual, and using relative calls is _usually_ the right
> thing to do. It's only in cases like this that we really want to call a
> specific address.

The stubs I used for the vsyscall bits just did an absolute jump to
the vsyscall page, which would then do a ret to the original calling
userspace code (since that provided library symbols for the user to
bind against).

-ben
--
"Do you seek knowledge in time travel?"

2002-12-17 21:30:41

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 10:57:29AM -0800, Ulrich Drepper wrote:
> But this is exactly what I expect to happen. If you want to implement
> gettimeofday() at user-level you need to modify the page. Some of the
> information the kernel has to keep for the thread group can be stored in
> this place and eventually be used by some uerlevel code executed by
> jumping to 0xfffff000 or whatever the address is.

You don't actually need to modify the page, rather the data for the user
level gettimeofday needs to be in a shared page and some register (like
%tr) must expose the current cpu number to index into the data. Either
way, it's an internal implementation detail for the kernel to take care
of, with multiple potential solutions.

-ben
--
"Do you seek knowledge in time travel?"

2002-12-17 21:31:59

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 11:11:19AM -0800, H. Peter Anvin wrote:
> > against 0xfffffxxx and "rollback" (or complete) any incomplete
> > gettimeofday call prior to saving a task's state. but i bet that test is
> > undesirable on all interrupt paths.
> >
>
> Exactly. This is a real problem.

No, just take the number of context switches before and after the attempt
to read the time of day.

-ben
--
"Do you seek knowledge in time travel?"

2002-12-17 21:29:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Benjamin LaHaise wrote:
>
> The stubs I used for the vsyscall bits just did an absolute jump to
> the vsyscall page, which would then do a ret to the original calling
> userspace code (since that provided library symbols for the user to
> bind against).
>

What kind of "absolute jumps" were this?

-hpa


2002-12-17 21:34:20

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Benjamin LaHaise wrote:
> On Tue, Dec 17, 2002 at 11:11:19AM -0800, H. Peter Anvin wrote:
>
>>>against 0xfffffxxx and "rollback" (or complete) any incomplete
>>>gettimeofday call prior to saving a task's state. but i bet that test is
>>>undesirable on all interrupt paths.
>>>
>>
>>Exactly. This is a real problem.
>
>
> No, just take the number of context switches before and after the attempt
> to read the time of day.
>

How do you do that from userspace, atomically? A counter in the shared
page?

-hpa


2002-12-17 21:33:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Benjamin LaHaise wrote:
> On Tue, Dec 17, 2002 at 10:57:29AM -0800, Ulrich Drepper wrote:
>
>>But this is exactly what I expect to happen. If you want to implement
>>gettimeofday() at user-level you need to modify the page. Some of the
>>information the kernel has to keep for the thread group can be stored in
>>this place and eventually be used by some uerlevel code executed by
>>jumping to 0xfffff000 or whatever the address is.
>
>
> You don't actually need to modify the page, rather the data for the user
> level gettimeofday needs to be in a shared page and some register (like
> %tr) must expose the current cpu number to index into the data. Either
> way, it's an internal implementation detail for the kernel to take care
> of, with multiple potential solutions.
>

That's not the problem... the problem is that the userland code can get
preempted at any time and rescheduled on another CPU.

-hpa


2002-12-17 21:45:40

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 01:41:55PM -0800, H. Peter Anvin wrote:
> > No, just take the number of context switches before and after the attempt
> > to read the time of day.

> How do you do that from userspace, atomically? A counter in the shared
> page?

Yup. You need some shared data for the TSC offset such anyways, so
moving the context switch counter onto such a page won't be much of
a problem. Using the %tr trick to get the CPU number would allow for
some of these data structures to be per-cpu without incurring any LOCK
overhead, too.

-ben
--
"Do you seek knowledge in time travel?"

2002-12-17 21:42:36

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 01:36:53PM -0800, H. Peter Anvin wrote:
> Benjamin LaHaise wrote:
> >
> > The stubs I used for the vsyscall bits just did an absolute jump to
> > the vsyscall page, which would then do a ret to the original calling
> > userspace code (since that provided library symbols for the user to
> > bind against).
> >
>
> What kind of "absolute jumps" were this?

It was a far jump (ljmp $__USER_CS,<address>).

-ben
--
"Do you seek knowledge in time travel?"

2002-12-17 23:12:59

by Andre Hedrick

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Okay I will go back to my storage cave, call when you need something.

Got some meat tenderizer for the shoe leather to make choking it down
easier?

Cheers,

On Tue, 17 Dec 2002, Dave Jones wrote:

> On Tue, Dec 17, 2002 at 01:45:52AM -0800, Andre Hedrick wrote:
>
> > Are you serious about moving of the banging we currently do on 0x80?
> > If so, I have a P4 development board with leds to monitor all the lower io
> > ports and can decode for you.
>
> INT 0x80 != IO port 0x80
>
> 8-)
>
> Dave
>
> --
> | Dave Jones. http://www.codemonkey.org.uk
> | SuSE Labs
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Andre Hedrick
LAD Storage Consulting Group

2002-12-18 00:11:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Ulrich Drepper wrote:
>
> This is why I'd say mkae no distinction at all. Have the first
> nr_syscalls * 8 bytes starting at 0xfffff000 as a jump table.

No, the way sysenter works, the table approach just sucks up dcache space
(the kernel cannot know which sysenter is the one that the user uses
anyway, so the jump table would have to just add back some index and we'd
be back exactly where we started)

I'll keep it the way it is now.

Linus

2002-12-18 00:32:13

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> No, the way sysenter works, the table approach just sucks up dcache space
> (the kernel cannot know which sysenter is the one that the user uses
> anyway, so the jump table would have to just add back some index and we'd
> be back exactly where we started)
>
> I'll keep it the way it is now.

I won't argue since honestly, not doing it is much easier for me. But I
want to be sure I'm clear.

What I suggested is to have the first part of the global page be


.p2align 3
jmp sysenter_label
.p2align 3
jmp sysenter_label
...
.p2align
jmp userlevel_gettimeofday

sysenter_label:
the usual sysenter code

userlevel_gettimeofday:
whatever necessary


All this would be in the global page. There is only one sysenter call.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-18 04:06:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Tue, 17 Dec 2002, Ulrich Drepper wrote:

> > - 0x55, /* push %ebp */
> > + 0x55, /* push %ebp */
> > + 0x5d, /* pop %ebp */
> > - 0x5d, /* pop %ebp */
>
> Instead of duplicating the push/pop %ebp just use the first one by using

No, it's not duplicating it. Look closer. It's just _moving_ it, so that
the old %ebp value will naturally be pointed to by %esp, which is what we
want.

Anyway, I reverted the %ebp games from my kernel, because they are
fundamentally not restartable and thus not really a good idea. Besides, it
might be wrong to try to optimize the fast system calls to handle six
arguments too, if that makes the (much more common case) the other system
calls slower. So the six-argument case might as well just continue to use
"int 0x80".

Linus


2002-12-18 04:07:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> How about this diff? It does both the 6-parameter thing _and_ the
> AT_SYSINFO addition.

The 6-parameter thing is broken. It's clever, but playing games with %ebp
is not going to work with restarting of the system call - we need to
restart with the proper %ebp.

I pushed out the AT_SYSINFO stuff, but we're back to the "needs to use
'int $0x80' for system calls that take 6 arguments" drawing board.

The only sane way I see to fix the %ebp problem is to actually expand the
kernel "struct ptregs" to have separate "ebp" and "arg6" fields (so that
we can re-start with the right ebp, and have arg6 as the right argument on
the stack). That would work but is not really worth it.

Linus


2002-12-18 05:26:20

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 2002-12-17 at 09:55, Linus Torvalds wrote:
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).

The P4 optimisation guide promises horrible things if you write within
2k of a cached instruction from another CPU (it dumps the whole trace
cache, it seems), so you'd need to be careful about mixing mutable data
and the syscall code in that page.

Immutable data should be fine.

J

2002-12-18 08:38:54

by kaih

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[email protected] (Linus Torvalds) wrote on 17.12.02 in <[email protected]>:

> On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> >
> > Let's see... it works fine on UP and on *most* SMP, and on the ones
> > where it doesn't work you just fill in a system call into the vsyscall
> > slot. It just means that gettimeofday() needs a different vsyscall slot.
>
> The thing is, gettimeofday() isn't _that_ special. It's just not worth a
> vsyscall of it's own, I feel. Where do you stop? Do we do getpid() too?
> Just because we can?

It's special enough that while programming under DOS, I had my own special
routine which just took the BIOS ticker from low memory for a lot of
things - even to decide if calling the actual time-of-day syscall was
useful or if I should expect to get the same value back as last time.

That was a *serious* performance improvement. (Of course, DOS syscalls are
S-L-O-W ...)

These days, the equivalent does call gettimeofday(). It's still probably
the most-used syscall by far. (Hmm - maybe I can get some numbers for
that? Must see if I get time today.) And *that* is why optimizing this one
call makes sense.

> This is especially true since the people who _really_ might care about
> gettimeofday() are exactly the people who wouldn't be able to use the fast
> user-space-only version.

Say what? Why wouldn't I be able to use it? Right now, I know of no SMP
installation that's even in the planning ...

> How much do you think gettimeofday() really matters on a desktop? Sure, X

Why desktop? We use the same kind of thing in the server, and it's much
more important there. Client performance is uninteresting - clients mostly
wait anyway.

> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same

Not database, but otherwise on the nail.

> people who also want NUMA machines which don't necessarily have
> synchronized clocks.

Nope, no interest in those. SMP *might* become interesting, but I don't
think we'd ever want to care about weird stuff like NUMA ... at least not
for the next five years or so.

We don't shovel nearly as much data around as the database guys.

MfG Kai

2002-12-18 08:38:52

by kaih

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[email protected] (Linus Torvalds) wrote on 17.12.02 in <[email protected]>:

> On Tue, 17 Dec 2002, Richard B. Johnson wrote:
> >
> > You can call intersegment with a full pointer. I don't know how
> > expensive that is.
>
> It's so expensive as to not be worth it, it's cheaper to load a register
> or something, i eyou can do
>
> pushl $0xfffff000
> call *(%esp)
>
> faster than doing a far call.

Hmm ...

How expensive would it be to have a special virtual DSO built into ld.so
which exported this (and any other future entry points), to be linked
against like any other DSO? That way, the *actual* interface would only be
between the kernel and ld.so.

MfG Kai

2002-12-18 12:50:21

by Rogier Wolff

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 11:10:20AM -0800, Linus Torvalds wrote:
>
>
> On Tue, 17 Dec 2002, Ulrich Drepper wrote:
> >
> > But this is exactly what I expect to happen. If you want to implement
> > gettimeofday() at user-level you need to modify the page.
>
> Note that I really don't think we ever want to do the user-level
> gettimeofday(). The complexity just argues against it, it's better to try
> to make system calls be cheap enough that you really don't care.

I'd say that this should not be "fixed" from userspace, but from the
kernel. Thus if the kernel knows that the "gettimeofday" can be made
faster by doing it completely in userspace, then that system call
should be "patched" by the kernel to do it faster for everybody.

Next, someone might find a faster (full-userspace) way to do some
"reads"(*). Then it might pay to check for that specific
filedescriptor in userspace, and only call into the kernel for the
other filedescriptors. The idea is that the kernel knows best when
optimizations are possible.

Thus that ONE magic address is IMHO not the right way to do it. By
demultiplexing the stuff in userspace, you can do "sane" things with
specific syscalls.

So for example, the code at 0xffff80000 would be:
mov 0x00,%eax
int $80
ret

(in the case where sysenter & friends is not available)

moving the "load syscall number into the register" into the
kernel-modifiable area does not cost a thing, but because we have
demultiplexed the code, we can easily replace the gettimeofday call by
something that (when it's easy) doesn't require the 600-cycle call
into kernel mode.

The "syscall _NR" would then become:

call 0xffff8000 + _NR * 0x80

allowing for up to 0x80 bytes of "patchup code" or "do it quickly"
code, but also for a jump to some other "magic page", that has more
extensive code.

(Oh, I'm showing a base of 0xffff8000: A bit lower than previous
suggestions: allowing for a per-syscall entrypoint, and up to 0x80
bytes of fixup or "do it really quickly" code.)

P.S. People might argue that using this large "stride" would have a
larger cache-footprint. I think that all "where it matters" programs
will have a very small working-set of system calls. It might pay to
use a stride of say 0xa0 to spread the different
system-call-code-points over different cache-lines whenever possible.

Roger.


(*) I was trying to pick a particularly unlikely case, but I can even
see a case where this is useful. For example, a kernel might be
compiled with "high performance pipes", which would move most of the
pipe reads and writes into userspace, through a shared-memory window.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2002-12-18 12:53:19

by Rogier Wolff

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 17, 2002 at 11:37:04AM -0800, Linus Torvalds wrote:
> How much do you think gettimeofday() really matters on a desktop? Sure, X
> apps do gettimeofday() calls, but they do a whole lot more of _other_
> calls, and gettimeofday() is really far far down in the noise for them.
> The people who really call for gettimeofday() as a performance thing seem
> to be database people who want it as a timestamp. But those are the same
> people who also want NUMA machines which don't necessarily have
> synchronized clocks.

Once the kernel provides the right infrastructure, doing it may become
so easy that it can be tried and implemented and benchmarked with so
little effort that it would simply stick.

Roger.


--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2002-12-18 13:07:47

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 17 Dec 2002, Linus Torvalds wrote:

>
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
> >
> > How about this diff? It does both the 6-parameter thing _and_ the
> > AT_SYSINFO addition.
>
> The 6-parameter thing is broken. It's clever, but playing games with %ebp
> is not going to work with restarting of the system call - we need to
> restart with the proper %ebp.
>
> I pushed out the AT_SYSINFO stuff, but we're back to the "needs to use
> 'int $0x80' for system calls that take 6 arguments" drawing board.
>
> The only sane way I see to fix the %ebp problem is to actually expand the
> kernel "struct ptregs" to have separate "ebp" and "arg6" fields (so that
> we can re-start with the right ebp, and have arg6 as the right argument on
> the stack). That would work but is not really worth it.
>
> Linus
>

How about for the new interface, a one-parameter arg, i.e., a pointer
to a descriptor (structure)?? For the typical one-argument call, i.e.,
getpid(), it's just one de-reference. The pointer register can be
EAX on Intel, a register normally available in a 'C' call.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-18 15:02:47

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, 2002-12-18 at 05:34, Jeremy Fitzhardinge wrote:
> The P4 optimisation guide promises horrible things if you write within
> 2k of a cached instruction from another CPU (it dumps the whole trace
> cache, it seems), so you'd need to be careful about mixing mutable data
> and the syscall code in that page.

The PIII errata promise worse things with SMP and code modified as
another cpu ruins it and seems to mark them WONTFIX, so there is another
dragon to beware of

2002-12-19 13:54:03

by Shuji YAMAMURA

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi,

We've measured gettimeofday() cost on both of Xeon and P3, too.
We also measured them on different kernels (UP and MP).

Xeon(2GHz) P3(1GHz)
=========================================
UP kernel 939[ns] 441[ns]
1878[cycles] 441[cycles]
-----------------------------------------
MP kernel 1054[ns] 485[ns]
2108[cycles] 485[cycles]
-----------------------------------------
(The kernel version is 2.4.18)

In this experiment, Xeon is two times slower than P3, despite that the
frequency of the Xeon is two times faster. More over, the performance
difference between UP and MP is very interesting in Xeon case. The
difference of Xeon (230 cycles) is five times larger than that of P3
(44 cycles).

We think that the instructions with lock prefix in the MP kernel
damage the Xeon performance which serialize operations in an execution
pipeline. On the P4/Xeon systems, these lock operations should be
avoided as possible as we can.

The following web page shows the details of this experiment.

http://www.labs.fujitsu.com/en/techinfo/linux/lse-0211/index.html

Regards

At Mon, 16 Dec 2002 22:18:27 -0800 (PST),
Linus wrote:
> On Mon, 16 Dec 2002, Linus Torvalds wrote:
> >
> > For gettimeofday(), the results on my P4 are:
> >
> > sysenter: 1280.425844 cycles
> > int/iret: 2415.698224 cycles
> > 1135.272380 cycles diff
> > factor 1.886637
> >
> > ie sysenter makes that system call almost twice as fast.
>
> Final comparison for the evening: a PIII looks very different, since the
> system call overhead is much smaller to begin with. On a PIII, the above
> ends up looking like
>
> gettimeofday() testing:
> sysenter: 561.697236 cycles
> int/iret: 686.170463 cycles
> 124.473227 cycles diff
> factor 1.221602

-----
Shuji Yamamura ([email protected])
Grid Computing & Bioinformatics Laboratory
Information Technology Core Laboratories
Fujitsu Laboratories LTD.

2002-12-19 22:06:13

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > Are you serious about moving of the banging we currently do on 0x80?
> > If so, I have a P4 development board with leds to monitor all the lower io
> > ports and can decode for you.
>
> Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
> 80 (its very easy), but requires we set up some udelay constants with an
> initial safety value right at boot (which we should do - we udelay
> before it is initialised)

Actually that would be nice -- I have leds on 0x80 too ;-).
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-19 22:09:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
>>
>>Different thing - int 0x80 syscall not i/o port 80. I've done I/O port
>>80 (its very easy), but requires we set up some udelay constants with an
>>initial safety value right at boot (which we should do - we udelay
>>before it is initialised)
>
> Actually that would be nice -- I have leds on 0x80 too ;-).
> Pavel

We have tried before, and failed. Phoenix uses something like 0xe2, but
apparently some machines with non-Phoenix BIOSes actually use that port.

-hpa

2002-12-19 22:07:05

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> I've created a modified glibc which uses the syscall code for almost
> everything. There are a few int $0x80 left here and there but mostly it
> is a centralized change.
>
> The result: all works as expected. Nice.
>
> On my test machine your little test program performs the syscalls on
> roughly twice as fast (HT P4, pretty new). Your numbers are perhaps for
> the P4 Xeons. Anyway, when measuring some more involved code (I ran my
> thread benchmark) I got only about 3% performance increase. It's doing
> a fair amount of system calls. But again, the good news is your code
> survived even this stress test.
>
>
> The problem with the current solution is the instruction set of the x86.
> In your test code you simply use call 0xfffff000 and it magically work.
> But this is only the case because your program is linked statically.
>
> For the libc DSO I had to play some dirty tricks. The x86 CPU has no
> absolute call. The variant with an immediate parameter is a relative
> jump. Only when jumping through a register or memory location is it
> possible to jump to an absolute address. To be clear, if I have
>
> call 0xfffff000
>
> in a DSO which is loaded at address 0x80000000 the jumps ends at
> 0x7fffffff. The problem is that the static linker doesn't know the load
> address. We could of course have the dynamic linker fix up the
> addresses but this is plain stupid. It would mean fixing up a lot of
> places and making of those pages covered non-sharable.

Can't you do call far __SOME_CS, 0xfffff000?

Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-19 22:38:01

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > > But this is exactly what I expect to happen. If you want to implement
> > > gettimeofday() at user-level you need to modify the page.
> >
> > Note that I really don't think we ever want to do the user-level
> > gettimeofday(). The complexity just argues against it, it's better to try
> > to make system calls be cheap enough that you really don't care.
>
> I'd say that this should not be "fixed" from userspace, but from the
> kernel. Thus if the kernel knows that the "gettimeofday" can be made
> faster by doing it completely in userspace, then that system call
> should be "patched" by the kernel to do it faster for everybody.
>
> Next, someone might find a faster (full-userspace) way to do some
> "reads"(*). Then it might pay to check for that specific
> filedescriptor in userspace, and only call into the kernel for the
> other filedescriptors. The idea is that the kernel knows best when
> optimizations are possible.
>
> Thus that ONE magic address is IMHO not the right way to do it. By
> demultiplexing the stuff in userspace, you can do "sane" things with
> specific syscalls.
>
> So for example, the code at 0xffff80000 would be:
> mov 0x00,%eax
> int $80
> ret
>
> (in the case where sysenter & friends is not available)

This could save that one register needed for 6-args syscalls. If code
at 0xffff8000 was mov %ebp, %eax; sysenter; ret for P4, you could do
6-args syscalls this way.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-19 22:08:09

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > (Modulo the missing syscall page I already mentioned and potential bugs
> > in the code itself, of course ;)
>
> Ok, I did the vsyscall page too, and tried to make it do the right thing
> (but I didn't bother to test it on a non-SEP machine).
>
> I'm pushing the changes out right now, but basically it boils down to the
> fact that with these changes, user space can instead of doing an
>
> int $0x80
>
> instruction for a system call just do a
>
> call 0xfffff000
>
> instead. The vsyscall page will be set up to use sysenter if the CPU
> supports it, and if it doesn't, it will just do the old "int $0x80"
> instead (and it could use the AMD syscall instruction if it wants to).
> User mode shouldn't know or care, the calling convention is the same as it
> ever was.

Perhaps it makes sense to define that gettimeofday is done by

call 0xfffff100,

NOW? So we can add vsyscalls later?
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-19 22:08:11

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > It's not as good as a pure user-mode solution using tsc could be, but
> > we've seen the kinds of complexities that has with multi-CPU systems, and
> > they are so painful that I suspect the sysenter approach is a lot more
> > palatable even if it doesn't allow for the absolute best theoretical
> > numbers.
>
> don't many of the multi-CPU problems with tsc go away because you've got a
> per-cpu physical page for the vsyscall?
>
> i.e. per-cpu tsc epoch and scaling can be set on that page.

Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
nice at all.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?