2002-12-09 08:24:21

by Mike Hayward

[permalink] [raw]
Subject: Intel P6 vs P7 system call performance

I have been benchmarking Pentium 4 boxes against my Pentium III laptop
with the exact same kernel and executables as well as custom compiled
kernels. The Pentium III has a much lower clock rate and I have
noticed that system call performance (and hence io performance) is up
to an order of magnitude higher on my Pentium III laptop. 1k block IO
reads/writes are anemic on the Pentium 4, for example, so I'm trying
to figure out why and thought someone might have an idea.

Notice below that the System Call overhead is much higher on the
Pentium 4 even though the cpu runs more than twice the speed and the
system has DDRAM, a 400 Mhz FSB, etc. I even get pretty remarkable
syscall/io performance on my Pentium III laptop vs. an otherwise idle
dual Xeon.

See how the performance is nearly opposite of what one would expect:

----------------------------------------------------------------------
basic sys call performance iterated for 10 secs:

while (1) {
close(dup(0));
getpid();
getuid();
umask(022);
iter++;
}

M-Pentium III 850Mhz Sys Call Rate 433741.8
Pentium 4 2Ghz Sys Call Rate 233637.8
Xeon x 2 2.4Ghz Sys Call Rate 207684.2

----------------------------------------------------------------------
1k read sys calls iterated for 10 secs (all buffered reads, no disk):

M-Pentium III 850Mhz File Read 1492961.0 (~149 io/s)
Pentium 4 2Ghz File Read 1088629.0 (~108 io/s)
Xeon x 2 2.4Ghz File Read 686892.0 (~ 69 io/s)

Any ideas? Not sure I want to upgrade to the P7 architecture if this
is right, since for me system calls are probably more important than
raw cpu computational power.

- Mike

--- Mobile Pentium III 850 Mhz ---

BYTE UNIX Benchmarks (Version 3.11)
System -- Linux flux.loup.net 2.4.7-10 #1 Thu Sep 6 17:27:27 EDT 2001 i686 unknown
Start Benchmark Run: Thu Nov 8 07:55:04 PST 2001
1 interactive users.
Dhrystone 2 without register variables 1652556.1 lps (10 secs, 6 samples)
Dhrystone 2 using register variables 1513809.2 lps (10 secs, 6 samples)
Arithmetic Test (type = arithoh) 3770106.2 lps (10 secs, 6 samples)
Arithmetic Test (type = register) 230897.5 lps (10 secs, 6 samples)
Arithmetic Test (type = short) 230586.1 lps (10 secs, 6 samples)
Arithmetic Test (type = int) 230916.2 lps (10 secs, 6 samples)
Arithmetic Test (type = long) 232229.7 lps (10 secs, 6 samples)
Arithmetic Test (type = float) 222990.2 lps (10 secs, 6 samples)
Arithmetic Test (type = double) 224339.4 lps (10 secs, 6 samples)
System Call Overhead Test 433741.8 lps (10 secs, 6 samples)
Pipe Throughput Test 499465.5 lps (10 secs, 6 samples)
Pipe-based Context Switching Test 229029.2 lps (10 secs, 6 samples)
Process Creation Test 8696.6 lps (10 secs, 6 samples)
Execl Throughput Test 1089.8 lps (9 secs, 6 samples)
File Read (10 seconds) 1492961.0 KBps (10 secs, 6 samples)
File Write (10 seconds) 157663.0 KBps (10 secs, 6 samples)
File Copy (10 seconds) 32516.0 KBps (10 secs, 6 samples)
File Read (30 seconds) 1507645.0 KBps (30 secs, 6 samples)
File Write (30 seconds) 161130.0 KBps (30 secs, 6 samples)
File Copy (30 seconds) 20155.0 KBps (30 secs, 6 samples)
C Compiler Test 491.2 lpm (60 secs, 3 samples)
Shell scripts (1 concurrent) 1315.2 lpm (60 secs, 3 samples)
Shell scripts (2 concurrent) 694.4 lpm (60 secs, 3 samples)
Shell scripts (4 concurrent) 357.1 lpm (60 secs, 3 samples)
Shell scripts (8 concurrent) 180.4 lpm (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 46831.0 lpm (60 secs, 6 samples)
Recursion Test--Tower of Hanoi 20954.1 lps (10 secs, 6 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Arithmetic Test (type = double) 2541.7 224339.4 88.3
Dhrystone 2 without register variables 22366.3 1652556.1 73.9
Execl Throughput Test 16.5 1089.8 66.0
File Copy (30 seconds) 179.0 20155.0 112.6
Pipe-based Context Switching Test 1318.5 229029.2 173.7
Shell scripts (8 concurrent) 4.0 180.4 45.1
=========
SUM of 6 items 559.6
AVERAGE 93.3

--- Desktop Pentium 4 2.0 Ghz w/ 266 Mhz DDR ---

BYTE UNIX Benchmarks (Version 3.11)
System -- Linux gw2 2.4.19 #1 Mon Dec 9 05:31:23 GMT-7 2002 i686 unknown
Start Benchmark Run: Mon Dec 9 05:45:47 GMT-7 2002
1 interactive users.
Dhrystone 2 without register variables 2910759.3 lps (10 secs, 6 samples)
Dhrystone 2 using register variables 2928495.6 lps (10 secs, 6 samples)
Arithmetic Test (type = arithoh) 9252565.4 lps (10 secs, 6 samples)
Arithmetic Test (type = register) 498894.3 lps (10 secs, 6 samples)
Arithmetic Test (type = short) 473452.0 lps (10 secs, 6 samples)
Arithmetic Test (type = int) 498956.5 lps (10 secs, 6 samples)
Arithmetic Test (type = long) 498932.0 lps (10 secs, 6 samples)
Arithmetic Test (type = float) 451138.8 lps (10 secs, 6 samples)
Arithmetic Test (type = double) 451106.8 lps (10 secs, 6 samples)
System Call Overhead Test 233637.8 lps (10 secs, 6 samples)
Pipe Throughput Test 437441.1 lps (10 secs, 6 samples)
Pipe-based Context Switching Test 167229.2 lps (10 secs, 6 samples)
Process Creation Test 9407.2 lps (10 secs, 6 samples)
Execl Throughput Test 2158.8 lps (10 secs, 6 samples)
File Read (10 seconds) 1088629.0 KBps (10 secs, 6 samples)
File Write (10 seconds) 472315.0 KBps (10 secs, 6 samples)
File Copy (10 seconds) 10569.0 KBps (10 secs, 6 samples)
File Read (120 seconds) 1089526.0 KBps (120 secs, 6 samples)
File Write (120 seconds) 467028.0 KBps (120 secs, 6 samples)
File Copy (120 seconds) 3541.0 KBps (120 secs, 6 samples)
C Compiler Test 973.9 lpm (60 secs, 3 samples)
Shell scripts (1 concurrent) 2590.8 lpm (60 secs, 3 samples)
Shell scripts (2 concurrent) 1359.6 lpm (60 secs, 3 samples)
Shell scripts (4 concurrent) 696.4 lpm (60 secs, 3 samples)
Shell scripts (8 concurrent) 352.1 lpm (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99120.4 lpm (60 secs, 6 samples)
Recursion Test--Tower of Hanoi 44857.5 lps (10 secs, 6 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Arithmetic Test (type = double) 2541.7 451106.8 177.5
Dhrystone 2 without register variables 22366.3 2910759.3 130.1
Execl Throughput Test 16.5 2158.8 130.8
File Copy (120 seconds) 179.0 3541.0 19.7
Pipe-based Context Switching Test 1318.5 167229.2 126.8
Shell scripts (8 concurrent) 4.0 352.1 88.0
=========
SUM of 6 items 673.0
AVERAGE 112.1


--- Pentium 4 Xeon 2.4 Ghz x 2 w/ 2.4.19 ---

BYTE UNIX Benchmarks (Version 3.11)
System -- Linux brent-xeon 2.4.19-kel #5 SMP Wed Sep 25 03:15:13 GMT 2002 i686 unknown
Start Benchmark Run: Thu Oct 10 03:48:07 MDT 2002
0 interactive users.
Dhrystone 2 without register variables 2200821.4 lps (10 secs, 6 samples)
Dhrystone 2 using register variables 2233296.6 lps (10 secs, 6 samples)
Arithmetic Test (type = arithoh) 7366670.5 lps (10 secs, 6 samples)
Arithmetic Test (type = register) 399261.4 lps (10 secs, 6 samples)
Arithmetic Test (type = short) 361354.7 lps (10 secs, 6 samples)
Arithmetic Test (type = int) 364200.0 lps (10 secs, 6 samples)
Arithmetic Test (type = long) 345292.9 lps (10 secs, 6 samples)
Arithmetic Test (type = float) 539907.7 lps (10 secs, 6 samples)
Arithmetic Test (type = double) 537355.5 lps (10 secs, 6 samples)
System Call Overhead Test 207684.2 lps (10 secs, 6 samples)
Pipe Throughput Test 283868.3 lps (10 secs, 6 samples)
Pipe-based Context Switching Test 98205.6 lps (10 secs, 6 samples)
Process Creation Test 5395.9 lps (10 secs, 6 samples)
Execl Throughput Test 1612.9 lps (9 secs, 6 samples)
File Read (10 seconds) 686892.0 KBps (10 secs, 6 samples)
File Write (10 seconds) 272217.0 KBps (10 secs, 6 samples)
File Copy (10 seconds) 56415.0 KBps (10 secs, 6 samples)
File Read (30 seconds) 681181.0 KBps (30 secs, 6 samples)
File Write (30 seconds) 272351.0 KBps (30 secs, 6 samples)
File Copy (30 seconds) 20611.0 KBps (30 secs, 6 samples)
C Compiler Test 873.5 lpm (60 secs, 3 samples)
Shell scripts (1 concurrent) 2970.1 lpm (60 secs, 3 samples)
Shell scripts (2 concurrent) 1294.2 lpm (60 secs, 3 samples)
Shell scripts (4 concurrent) 845.2 lpm (60 secs, 3 samples)
Shell scripts (8 concurrent) 409.2 lpm (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places no measured results
Recursion Test--Tower of Hanoi 33661.9 lps (10 secs, 6 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Arithmetic Test (type = double) 2541.7 537355.5 211.4
Dhrystone 2 without register variables 22366.3 2200821.4 98.4
Execl Throughput Test 16.5 1612.9 97.8
File Copy (30 seconds) 179.0 20611.0 115.1
Pipe-based Context Switching Test 1318.5 98205.6 74.5
Shell scripts (8 concurrent) 4.0 409.2 102.3
=========
SUM of 6 items 699.5
AVERAGE 116.6


2002-12-09 15:33:35

by Erich Boleyn

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Mike Hayward <[email protected]> wrote:

> I have been benchmarking Pentium 4 boxes against my Pentium III laptop
> with the exact same kernel and executables as well as custom compiled
> kernels. The Pentium III has a much lower clock rate and I have
> noticed that system call performance (and hence io performance) is up
> to an order of magnitude higher on my Pentium III laptop. 1k block IO
> reads/writes are anemic on the Pentium 4, for example, so I'm trying
> to figure out why and thought someone might have an idea.
>
> Notice below that the System Call overhead is much higher on the
> Pentium 4 even though the cpu runs more than twice the speed and the
> system has DDRAM, a 400 Mhz FSB, etc. I even get pretty remarkable
> syscall/io performance on my Pentium III laptop vs. an otherwise idle
> dual Xeon.
>
> See how the performance is nearly opposite of what one would expect:
...
> M-Pentium III 850Mhz Sys Call Rate 433741.8
> Pentium 4 2Ghz Sys Call Rate 233637.8
> Xeon x 2 2.4Ghz Sys Call Rate 207684.2
...[other benchmark deleted]...
> Any ideas? Not sure I want to upgrade to the P7 architecture if this
> is right, since for me system calls are probably more important than
> raw cpu computational power.

You're assuming that ALL operations in a P4 are linearly faster than
a P-III. This is definitely not the case.

A P4 has a much longer pipeline (for a many cases, considerably
longer than the diagrams imply) than the P-III, and in particular
it has a much longer latency in handling mode transitions.

The results you got don't surprise me whatsoever. In fact the raw
system call transition instructions are likely 5x slower on the
P4.

--
Erich Stefan Boleyn <[email protected]> http://www.uruk.org/
"Reality is truly stranger than fiction; Probably why fiction is so popular"

2002-12-09 19:39:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Followup to: <[email protected]>
By author: Dave Jones <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:
>
> > P4's really suck at system calls. A 2.8GHz P4 does a simple system call
> > a lot _slower_ than a 500MHz PIII.
> >
> > The P4 has problems with some other things too, but the "int + iret"
> > instruction combination is absolutely the worst I've seen. A 1.2GHz
> > Athlon will be 5-10 times faster than the fastest P4 on system call
> > overhead.
>
> Time to look into an alternative like SYSCALL perhaps ?
>

SYSCALL is AMD. SYSENTER is Intel, and is likely to be significantly
faster. Unfortunately SYSENTER is also extremely braindamaged, in
that it destroys *both* the EIP and the ESP beyond recovery, and
because it's allowed in V86 and 16-bit modes (where it will cause
permanent data loss) which means that it needs to be able to be turned
off for things like DOSEMU and WINE to work correctly.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2002-12-11 12:40:46

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

It get even worse with Hammer. When you run hammer in compatibility mode
(32 bit app on a 64 bit OS) the sysenter is an illegal instruction.
Since Intel don't implement syscall, there is no portable sys*
instruction for 32 bit apps. You could argue that libc hides it for you
and you just need libc to test the host at startup (do I get a sigill if
I try to do getpid() with sysenter? syscall? if so we uses int80 for
syscalls). But not all programs are linked dyn.

Too bad really, I tried the sysenter patch once, and the gain (on PIII
and athlon) was significant.

Fortunately the 64bit libc for hammer uses syscall.


PS: rdtsc on P4 is also painfully slow!!!

TJ

On man, 2002-12-09 at 20:46, H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Dave Jones <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > On Mon, Dec 09, 2002 at 05:48:45PM +0000, Linus Torvalds wrote:
> >
> > > P4's really suck at system calls. A 2.8GHz P4 does a simple system call
> > > a lot _slower_ than a 500MHz PIII.
> > >
> > > The P4 has problems with some other things too, but the "int + iret"
> > > instruction combination is absolutely the worst I've seen. A 1.2GHz
> > > Athlon will be 5-10 times faster than the fastest P4 on system call
> > > overhead.
> >
> > Time to look into an alternative like SYSCALL perhaps ?
> >
>
> SYSCALL is AMD. SYSENTER is Intel, and is likely to be significantly
> faster. Unfortunately SYSENTER is also extremely braindamaged, in
> that it destroys *both* the EIP and the ESP beyond recovery, and
> because it's allowed in V86 and 16-bit modes (where it will cause
> permanent data loss) which means that it needs to be able to be turned
> off for things like DOSEMU and WINE to work correctly.
>
> -hpa



--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-11 18:43:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Terje Eggestad wrote:
> It get even worse with Hammer. When you run hammer in compatibility mode
> (32 bit app on a 64 bit OS) the sysenter is an illegal instruction.
>
> Since Intel don't implement syscall, there is no portable sys*
> instruction for 32 bit apps. You could argue that libc hides it for you
> and you just need libc to test the host at startup (do I get a sigill if
> I try to do getpid() with sysenter? syscall? if so we uses int80 for
> syscalls). But not all programs are linked dyn.


Linus talked about this once, and it was agreed that the only sane way
to do this properly was via vsyscalls... have a page mapped somewhere in
high (kernel-area) memory, say at 0xfffff000, but readable by normal
processes. A system call can be invoked via call 0xfffff000, and the
*kernel* enters whatever code is appropriate to enter itself.

> Too bad really, I tried the sysenter patch once, and the gain (on PIII
> and athlon) was significant.
>
> Fortunately the 64bit libc for hammer uses syscall.
>

Yes.

>
> PS: rdtsc on P4 is also painfully slow!!!
>

Now that's just braindead...

-hpa


2002-12-12 09:35:13

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> Terje Eggestad wrote:
>
> >
> > PS: rdtsc on P4 is also painfully slow!!!
> >
>
> Now that's just braindead...
>

It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.

For a simple op like that, even 11 is a lot... Really makes you wonder.


> -hpa

TJ

--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-12 09:59:09

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:

> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
>
> For a simple op like that, even 11 is a lot... Really makes you wonder.

wasn't rdtsc also supposed to be a pipeline sync of the cpu?
(or am I confusing it with cpuid)

2002-12-12 10:24:00

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On tor, 2002-12-12 at 11:06, Arjan van de Ven wrote:
> On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:
>
> > It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> >
> > For a simple op like that, even 11 is a lot... Really makes you wonder.
>
> wasn't rdtsc also supposed to be a pipeline sync of the cpu?
> (or am I confusing it with cpuid)

THis is what the P4 manual says:

"The RDTSC instruction is not a serializing instruction. Thus, it does
not necessarily wait until all previous instructions have been executed
before reading the counter. Similarly, subsequent instructions may begin
execution before the read operation is performed."

Thus it *shouldn't* sync the pipeline. cpuid is a serializing inst, yes.

TJ

--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-12 20:20:27

by Mark Mielke

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> > Terje Eggestad wrote:
> > > PS: rdtsc on P4 is also painfully slow!!!
> > Now that's just braindead...
> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> For a simple op like that, even 11 is a lot... Really makes you wonder.

Some of this discussion is a little bit unfair. My understanding of what
Intel has done with the P4, is create an architecture that allows for
higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
PII 2.4 Ghz machines have you ever seen on the market?

Certainly, some of their decisions seem to be a little odd on the surface.

That doesn't mean the situation is black and white.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-12-12 20:49:34

by J.A. Magallon

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On 2002.12.12 Mark Mielke wrote:
>On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
>> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
>> > Terje Eggestad wrote:
>> > > PS: rdtsc on P4 is also painfully slow!!!
>> > Now that's just braindead...
>> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
>> For a simple op like that, even 11 is a lot... Really makes you wonder.
>
>Some of this discussion is a little bit unfair. My understanding of what
>Intel has done with the P4, is create an architecture that allows for
>higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
>PII 2.4 Ghz machines have you ever seen on the market?
>
>Certainly, some of their decisions seem to be a little odd on the surface.
>
>That doesn't mean the situation is black and white.
>

No. The situation is just black. Each day Intel processors are a bigger
pile of crap and less intelligent, but MHz compensate for the average
office user. Think of what could a P4 do if the same effort put on
Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
closer, 1Mb like G4s.
If syscalls take 300% time but processor is also 300% faster 'nobody
notices'.

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))

2002-12-12 20:49:46

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, Dec 12, 2002 at 03:36:46PM -0500, Mark Mielke wrote:
> On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> > On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> > > Terje Eggestad wrote:
> > > > PS: rdtsc on P4 is also painfully slow!!!
> > > Now that's just braindead...
> > It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> > For a simple op like that, even 11 is a lot... Really makes you wonder.
>
> Some of this discussion is a little bit unfair. My understanding of what
> Intel has done with the P4, is create an architecture that allows for
> higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
> PII 2.4 Ghz machines have you ever seen on the market?
>
> Certainly, some of their decisions seem to be a little odd on the surface.
>
> That doesn't mean the situation is black and white.

Assume a 1GHz P-III. 34 clocks @ 1GHz = 34 ns. 84 clocks @ 2.4 GHz = 35 ns.
That's actually slower. Fortunately the P4 isn't this bad on all
instructions.

--
Vojtech Pavlik
SuSE Labs

2002-12-12 22:17:32

by Zac Hansen

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

>
> No. The situation is just black. Each day Intel processors are a bigger
> pile of crap and less intelligent

My hyper-threaded xeons beg to argue with you -- all 4 (2) of them.

, but MHz compensate for the average
> office user. Think of what could a P4 do if the same effort put on
> Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
> closer, 1Mb like G4s.

Err, syscalls are still going to take the same amount of time no matter
how much cache the chip has on it. And, IMHO, adding more cache to make a
processor faster is just as "dumb" as bumping the MHz.

> If syscalls take 300% time but processor is also 300% faster 'nobody
> notices'.
>

The point many are forgetting is that processors do a lot more than system
calls. And P4's are quite quick at doing this.. especially those new
3+GHz ones (with hyperthreading).

By the way, did everyone see the test on Tom's Hardware Guide comparison
between the p4 3.06 with hyperthreading on and a p4 3.6 without
hyperthreading..

http://www17.tomshardware.com/cpu/20021114/index.html

For those of you who just want the info -- here's the spoiler -- when
running multiple apps, the 3.06 can torch the 3.6. Check out the second
benchmark on this page

http://www17.tomshardware.com/cpu/20021114/p4_306ht-16.html

25% faster. Most of the other benchmarks don't show off hyperthreading,
as they're running a single process, but from personal experience, it's
nice. I don't know why they give you the option to turn it off in the
bios. I have 2 xeons, and even then I leave HT on on both. I'd not even
think about considering turning it off if I only had 1 processor..

--Zac
[email protected]

2002-12-13 00:38:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Arjan van de Ven wrote:
> On Thu, 2002-12-12 at 10:42, Terje Eggestad wrote:
>
>
>>It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
>>
>>For a simple op like that, even 11 is a lot... Really makes you wonder.
>
>
> wasn't rdtsc also supposed to be a pipeline sync of the cpu?
> (or am I confusing it with cpuid)

That's CPUID.

-hpa

2002-12-13 09:13:30

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On tor, 2002-12-12 at 21:56, J.A. Magallon wrote:
> On 2002.12.12 Mark Mielke wrote:
> >On Thu, Dec 12, 2002 at 10:42:56AM +0100, Terje Eggestad wrote:
> >> On ons, 2002-12-11 at 19:50, H. Peter Anvin wrote:
> >> > Terje Eggestad wrote:
> >> > > PS: rdtsc on P4 is also painfully slow!!!
> >> > Now that's just braindead...
> >> It takes about 11 cycles on athlon, 34 on PII, and a whooping 84 on P4.
> >> For a simple op like that, even 11 is a lot... Really makes you wonder.
> >
> >Some of this discussion is a little bit unfair. My understanding of what
> >Intel has done with the P4, is create an architecture that allows for
> >higher clock rates. Sure the P4 might take 84, vs PII 34, but how many
> >PII 2.4 Ghz machines have you ever seen on the market?
> >
> >Certainly, some of their decisions seem to be a little odd on the surface.
> >
> >That doesn't mean the situation is black and white.
> >
>
> No. The situation is just black. Each day Intel processors are a bigger
> pile of crap and less intelligent, but MHz compensate for the average
> office user. Think of what could a P4 do if the same effort put on
> Hz was put on getting cheap a cache of 4Mb or 8Mb like MIPSes have. Or
> closer, 1Mb like G4s.
> If syscalls take 300% time but processor is also 300% faster 'nobody
> notices'.

Well, it does make sense if Intel optimized away rdtsc for more commonly
used things, but even that don't seem to be the case. I'm measuring the
overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
and Athlon, while it's 1600 cycles on P4.

TJ



--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-13 15:38:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
> Any ideas? Not sure I want to upgrade to the P7 architecture if this
> is right, since for me system calls are probably more important than
> raw cpu computational power.

This is the same for me. I'm extremely uninterested in the P-IV for my
own use because of this.


Bill

2002-12-13 15:51:30

by Ville Herva

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Fri, Dec 13, 2002 at 10:21:11AM +0100, you [Terje Eggestad] wrote:
>
> Well, it does make sense if Intel optimized away rdtsc for more commonly
> used things, but even that don't seem to be the case. I'm measuring the
> overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
> and Athlon, while it's 1600 cycles on P4.

Just out of interest, how much would sysenter (or syscall on amd) cost,
then? (Supposing it can be feasibly implemented.)

I think I heard WinXP (W2k too?) is using sysenter?


-- v --

[email protected]

2002-12-13 16:44:09

by Mike Hayward

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi Bill,

> On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
> > Any ideas? Not sure I want to upgrade to the P7 architecture if this
> > is right, since for me system calls are probably more important than
> > raw cpu computational power.
>
> This is the same for me. I'm extremely uninterested in the P-IV for my
> own use because of this.

I've also noticed that algorithms like the recursive one I run which
simulates solving the Tower of Hanoi problem are most likely very hard
to do branch prediction on. Both the code and data no doubt fit
entirely in the L2 cache. The AMD processor below is a much lower
cost and significantly lower clock rate (and on a machine with only a
100Mhz Memory bus) than the Xeon, yet dramatically outperforms it with
the same executable, compiled with gcc -march=i686 -O3. Maybe with a
better Pentium 4 optimizing compiler the P4 and Xeon could improve a
few percent, but I doubt it'll ever see the AMD numbers.

Recursion Test--Tower of Hanoi

Uni AMD XP 1800 2.4.18 kernel 46751.6 lps (10 secs, 6 samples)
Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)

- Mike

2002-12-13 17:43:23

by Margit Schubert-While

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Well, in the 2.4.x kernels, the P4 gets compiled as a I686 with NO special
treatment :-) (Not even prefetch, because of an ifdef bug)
The P3 at least gets one level of prefetch and the AMD's get special compile
options(arch=k6,athlon), full prefetch and SSE.

>From Mike Hayward
>Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)

Hmm, P4 2.4Ghz , also gcc -O3 -march=i686

margit:/disk03/bytebench-3.1/src # ./hanoi 10
576264 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571001 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571133 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
570517 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
571019 loops
margit:/disk03/bytebench-3.1/src # ./hanoi 10
582688 loops

Margit

2002-12-13 19:24:21

by Dieter Nützel

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

> Well, in the 2.4.x kernels, the P4 gets compiled as a I686 with NO special
> treatment :-) (Not even prefetch, because of an ifdef bug)
> The P3 at least gets one level of prefetch and the AMD's get special compile
> options(arch=k6,athlon), full prefetch and SSE.
>
> >From Mike Hayward
> >Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)
>
> Hmm, P4 2.4Ghz , also gcc -O3 -march=i686
>
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 576264 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571001 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571133 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 570517 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 571019 loops
> margit:/disk03/bytebench-3.1/src # ./hanoi 10
> 582688 loops

Apples and oranges? ;-)

dual AMD Athlon MP 1900+, 1.6 GHz
(but single threaded app)
2.4.20-aa1
gcc-2.95.3

unixbench-4.1.0/src> gcc -O -mcpu=k6 -march=i686 -fomit-frame-pointer
-mpreferred-stack-boundary=2 -malign-functions=4 -o hanoi hanoi.c
unixbench-4.1.0/src> sync
unixbench-4.1.0/src> ./hanoi 10
565338 loops
unixbench-4.1.0/src> ./hanoi 10
565379 loops
unixbench-4.1.0/src> ./hanoi 10
565448 loops
unixbench-4.1.0/src> ./hanoi 10
565218 loops
unixbench-4.1.0/src> ./hanoi 10
565148 loops
unixbench-4.1.0/src> ./hanoi 10
565136 loops

You should run "./Run hanoi"...

Recursion Test--Tower of Hanoi 58404.5 lps (19.3 secs, 3 samples)

Regards,
Dieter
--
Dieter N?tzel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel at hamburg.de (replace at with @)

2002-12-13 21:44:43

by Margit Schubert-While

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hmm Apples & Oranges

diff hanoi.c hanoi2.c
17a18
> void mov();
51c52
< mov(disk,1,3);
---
> (void)mov(disk,1,3);
58c59
< mov(n,f,t)
---
> void mov(n,f,t)
67,69c68,70
< mov(n-1,f,o);
< mov(1,f,t);
< mov(n-1,o,t);
---
> (void)mov(n-1,f,o);
> (void)mov(1,f,t);
> (void)mov(n-1,o,t);


cc -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer hanoi.c -o hanoi
cc -O3 -march=i686 -mcpu=i686 -fomit-frame-pointer hanoi2.c -o hanoi2
./hanoi 10
536837 loops
./hanoi 10
538709 loops
./hanoi2 10
850127 loops
./hanoi2 10
852651 loops

Huu ?

Margit

2002-12-13 21:50:24

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

I haven't tried the vsyscall patch, but there was a sysenter patch
floating around that I tried. It reduced the syscall overhead with 1/3
to 1/4, but I never tried it on P4.

FYI: Just note that I say overhead, which I assume to be the time it
take to do someting like getpid(), write(-1,...), select(-1, ...) (etc
that is immediatlely returned with -EINVAL by the kernel).
Since the kernel do execute a quite afew instructions beside the
int/iret sysenter/sysexit, it's an assumption that the int 80 is the
culprit.

I would be nice if someone bothered to try this on an windoze box.
(Un)fortunatly I live in a windoze free environment. :-)

TJ


On Fri, 2002-12-13 at 16:58, Ville Herva wrote:
On Fri, Dec 13, 2002 at 10:21:11AM +0100, you [Terje Eggestad] wrote:
>
> Well, it does make sense if Intel optimized away rdtsc for more commonly
> used things, but even that don't seem to be the case. I'm measuring the
> overhead of doing a syscall on Linux (int 80) to be ~280 cycles on PIII,
> and Athlon, while it's 1600 cycles on P4.

Just out of interest, how much would sysenter (or syscall on amd) cost,
then? (Supposing it can be feasibly implemented.)

I think I heard WinXP (W2k too?) is using sysenter?


-- v --

[email protected]
--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-13 22:46:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Terje Eggestad wrote:
> I haven't tried the vsyscall patch, but there was a sysenter patch
> floating around that I tried. It reduced the syscall overhead with 1/3
> to 1/4, but I never tried it on P4.
>
> FYI: Just note that I say overhead, which I assume to be the time it
> take to do someting like getpid(), write(-1,...), select(-1, ...) (etc
> that is immediatlely returned with -EINVAL by the kernel).
> Since the kernel do execute a quite afew instructions beside the
> int/iret sysenter/sysexit, it's an assumption that the int 80 is the
> culprit.
>

IRET in particular is a very slow instruction.

As far as I know, though, the SYSENTER patch didn't deal with several of
the corner cases introduced by the generally weird SYSENTER instruction
(such as the fact that V86 tasks can execute it despite the fact there
is in general no way to resume execution of the V86 task afterwards.)

In practice this means that vsyscalls is pretty much the only sensible
way to do this. Also note that INT 80h will need to be supported
indefinitely.

Personally, I wonder if it's worth the trouble, when x86-64 takes care
of the issue anyway :)

-hpa


2002-12-14 00:48:50

by GrandMasterLee

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Fri, 2002-12-13 at 10:49, Mike Hayward wrote:
> Hi Bill,
>
> > On Mon, Dec 09, 2002 at 01:30:28AM -0700, Mike Hayward wrote:
> > > Any ideas? Not sure I want to upgrade to the P7 architecture if this
> > > is right, since for me system calls are probably more important than
> > > raw cpu computational power.
> >
> > This is the same for me. I'm extremely uninterested in the P-IV for my
> > own use because of this.
>
> I've also noticed that algorithms like the recursive one I run which
> simulates solving the Tower of Hanoi problem are most likely very hard
> to do branch prediction on. Both the code and data no doubt fit
> entirely in the L2 cache. The AMD processor below is a much lower
> cost and significantly lower clock rate (and on a machine with only a
> 100Mhz Memory bus) than the Xeon, yet dramatically outperforms it with
> the same executable, compiled with gcc -march=i686 -O3. Maybe with a
> better Pentium 4 optimizing compiler the P4 and Xeon could improve a
> few percent, but I doubt it'll ever see the AMD numbers.
What GCC were you using? I'd use 3.2, or 3.2.1 myself with
-march=pentium4 and -mcpu=pentium4 to see if there *is* any difference
there. On my quad P4 Xeon 1.6Ghz with 1M L3 cache, I can compile a
kernel in about 35 seconds. Mind you that's my own config, not
*everything*. On a dual athlon MP at 1.8 Ghz, I get about 5 mins or so.
Both are running with make -jx where X is the saturation value.


> Recursion Test--Tower of Hanoi
>
> Uni AMD XP 1800 2.4.18 kernel 46751.6 lps (10 secs, 6 samples)
> Dual Pentium 4 Xeon 2.4Ghz 2.4.19 kernel 33661.9 lps (10 secs, 6 samples)
>
> - Mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
GrandMasterLee <[email protected]>

2002-12-14 04:33:24

by Mike Dresser

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 13 Dec 2002, GrandMasterLee wrote:

> there. On my quad P4 Xeon 1.6Ghz with 1M L3 cache, I can compile a
> kernel in about 35 seconds. Mind you that's my own config, not
> *everything*. On a dual athlon MP at 1.8 Ghz, I get about 5 mins or so.
> Both are running with make -jx where X is the saturation value.

Something seems odd about the athlon MP time, I've got a celeron 533
with slow disks that does a pretty standard make dep ; make of 2.4.20 in
7m05, which is not that much different considering it's a third the speed,
and one cpu instead of two.

The single P4/2.53 in another machine can haul down in 3m17s

Guess our kernel .config's or version must vary greatly.

Mike

2002-12-14 04:46:02

by Mike Dresser

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Fri, 13 Dec 2002, Mike Dresser wrote:

> The single P4/2.53 in another machine can haul down in 3m17s
>
Amend that to 2m19s, forgot to kill a background backup that was moving
files around at about 20 meg a second.

Mike

2002-12-14 09:53:55

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
> On Fri, 13 Dec 2002, Mike Dresser wrote:
>
> > The single P4/2.53 in another machine can haul down in 3m17s
> >
> Amend that to 2m19s, forgot to kill a background backup that was moving
> files around at about 20 meg a second.

Note that there are more factors at play than raw cpu speed in a
kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
example. My guess is you have faster disk(s) than I do, as most of
the time mine seems to be waiting for something to do.

*note also that this is compiling stock 2.4.20 with default configuration.
The minute you change any options, we're comparings apples to oranges.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-14 17:40:59

by Mike Dresser

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sat, 14 Dec 2002, Dave Jones wrote:

> Note that there are more factors at play than raw cpu speed in a
> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
> example. My guess is you have faster disk(s) than I do, as most of
> the time mine seems to be waiting for something to do.

Quantum Fireball AS's in that machine. My main comment was that his
Althon MP at 1.8 was half or less the speed of a single P4. Even with
compiler changes, I wouldn't think it would make THAT much of a
difference?

Mike

2002-12-14 18:28:54

by GrandMasterLee

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sat, 2002-12-14 at 04:01, Dave Jones wrote:
> On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
> > On Fri, 13 Dec 2002, Mike Dresser wrote:
> >
> > > The single P4/2.53 in another machine can haul down in 3m17s
> > >
> > Amend that to 2m19s, forgot to kill a background backup that was moving
> > files around at about 20 meg a second.



> Note that there are more factors at play than raw cpu speed in a
> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
> example. My guess is you have faster disk(s) than I do, as most of
> the time mine seems to be waiting for something to do.

An easy way to level the playing field would be to use /dev/shm to build
your kernel in. That way it's all in memory. If you've got a maching
with 512M, then it's easily accomplished.

> *note also that this is compiling stock 2.4.20 with default configuration.
> The minute you change any options, we're comparings apples to oranges.
>
> Dave

2002-12-15 01:55:56

by J.A. Magallon

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On 2002.12.14 GrandMasterLee wrote:
>On Sat, 2002-12-14 at 04:01, Dave Jones wrote:
>> On Fri, Dec 13, 2002 at 11:53:51PM -0500, Mike Dresser wrote:
>> > On Fri, 13 Dec 2002, Mike Dresser wrote:
>> >
>> > > The single P4/2.53 in another machine can haul down in 3m17s
>> > >
>> > Amend that to 2m19s, forgot to kill a background backup that was moving
>> > files around at about 20 meg a second.
>
>
>
>> Note that there are more factors at play than raw cpu speed in a
>> kernel compile. Your time here is slightly faster than my 2.8Ghz P4-HT for
>> example. My guess is you have faster disk(s) than I do, as most of
>> the time mine seems to be waiting for something to do.
>
>An easy way to level the playing field would be to use /dev/shm to build
>your kernel in. That way it's all in memory. If you've got a maching
>with 512M, then it's easily accomplished.
>

tmpfs does not guarantee you that it is always in ram. It also can be paged.
An easier way is to fill you page cache with the kernel tree like

werewolf:/usr/src/linux# grep -v -r "" *

and then build, so no disk read will be trown.

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.20-jam1 (gcc 3.2 (Mandrake Linux 9.1 3.2-4mdk))

2002-12-15 03:58:20

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


H. Peter Anvin writes:

> As far as I know, though, the SYSENTER patch didn't deal with several of
> the corner cases introduced by the generally weird SYSENTER instruction
> (such as the fact that V86 tasks can execute it despite the fact there
> is in general no way to resume execution of the V86 task afterwards.)
>
> In practice this means that vsyscalls is pretty much the only sensible
> way to do this. Also note that INT 80h will need to be supported
> indefinitely.
>
> Personally, I wonder if it's worth the trouble, when x86-64 takes care
> of the issue anyway :)

There is another way:

Have apps enter kernel mode via Intel's purposely undefined
instruction, plus a few bytes of padding and identification.
Require that this not cross a page boundry. When it faults,
write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
the page marked clean so it doesn't need to hit swap; if it
gets paged in again it gets patched again.


2002-12-15 08:35:36

by scott thomason

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Saturday 14 December 2002 11:48 am, Mike Dresser wrote:
> On Sat, 14 Dec 2002, Dave Jones wrote:
> > Note that there are more factors at play than raw cpu speed in a
> > kernel compile. Your time here is slightly faster than my 2.8Ghz
> > P4-HT for example. My guess is you have faster disk(s) than I
> > do, as most of the time mine seems to be waiting for something to
> > do.
>
> Quantum Fireball AS's in that machine. My main comment was that
> his Althon MP at 1.8 was half or less the speed of a single P4.
> Even with compiler changes, I wouldn't think it would make THAT
> much of a difference?

I've been doing a lot of benchmarking with "contest" lately, and one
thing I can state emphatically is that the kernel that you are
running while performing a compile can be a large factor, especially
if you are maxing out the machine with a large "make -jN". Some
kernel versions vary enormously in their ability to handle I/O load
(an area I've been paying close attention to). Sounds like you have
some decent SMP hardware, and probably a good chunk of memory to go
with it, so you might want to experiment with these kernels, which
have given good I/O performance in my tests:

linux-2.4.19-rmap14c
linux-2.4.19-rmap15a
linux-2.4.18-rml-O1 (slow at creating tarballs, fast everwhere else)

And if you you don't mind bleeding edge, just go with a more recent
2.5 kernel that you can make work. You simply can't get comparable
performance out of 2.4.

I've attached some contest numbers for tests I've run to-date below.
Please note that while I use contest as the benchmarking tool, I use
qmail compiles as the actual load, not kernel compiles (I don't have
the patience--qmail compiles take about 35-40% the time as a kernel
compile. Now if we can get Con to work on speeding up "Killing the
the load process..." <g>).
---scott

sorry for the html table to text pasting conversion :(

noload
process_load
ctar_load
xtar_load
read_load
list_load
mem_load

linux-2.4.18
16.73
22.61
244.52
78.84
108.52
18.58
53.12

linux-2.4.18-ac3
19.01
25.64
99.52
94.23
314.29
23.34
119.95

linux-2.4.18-rc1-akpm-low-latency
16.69
21.92
335.62
79.10
122.34
18.39
104.80

linux-2.4.18-rc4-aa1
16.43
93.85
179.12
100.29
46.64
17.15
96.91

linux-2.4.18-rmap12h
18.84
24.72
143.12
95.11
298.85
23.17
121.22

linux-2.4.18-rml-O1
16.83
31.42
266.28
77.98
77.15
18.18
63.03

linux-2.4.18-rml-preempt
16.93
21.87
334.08
84.22
116.30
18.46
60.30

linux-2.4.18-rml-preempt+lockbreak
16.85
22.42
271.52
74.37
229.96
19.57
45.21

linux-2.4.19
16.99
22.42
261.69
103.61
163.55
18.44
66.16

linux-2.4.19-ac4
19.08
30.32
176.03
89.38
288.53
22.79
102.09

linux-2.4.19-akpm-low-latency
16.90
21.87
230.92
111.37
179.63
18.36
87.47

linux-2.4.19-ck14
-
-
-
-
-
-
176.41

linux-2.4.19-rc5-aa1
18.37
27.18
931.45
154.94
372.73
22.01
125.92

linux-2.4.19-rmap14c
17.84
24.56
74.81
76.73
121.86
20.57
165.10

linux-2.4.19-rmap15
18.27
24.09
71.32
77.05
146.68
18.99
102.56

linux-2.4.19-rmap15-splitactive
17.28
23.09
69.16
79.49
140.15
20.27
129.84

linux-2.4.19-rmap15a
17.10
23.00
62.44
78.12
138.96
18.46
133.32

linux-2.4.19-rml-O1
16.61
25.45
314.24
90.43
124.27
18.32
72.90

linux-2.4.19-rml-preempt
16.88
21.80
238.80
86.46
155.89
18.45
56.74

linux-2.4.20
16.62
21.84
191.12
101.06
100.35
18.22
70.47

linux-2.4.20-aa1
18.23
29.03
331.96
137.70
96.88
22.22
143.22

linux-2.4.20-ac1
20.24
28.41
776.73
138.35
221.55
22.06
171.13

linux-2.4.20-rc2-aa1
18.44
28.39
255.79
156.30
86.78
21.98
139.04

linux-2.5.49
17.66
22.39
36.73
26.85
19.91
20.29
57.34

linux-2.5.50
17.80
24.19
32.81
25.87
21.43
21.17
45.96

2002-12-15 22:22:01

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > As far as I know, though, the SYSENTER patch didn't deal with several of
> > the corner cases introduced by the generally weird SYSENTER instruction
> > (such as the fact that V86 tasks can execute it despite the fact there
> > is in general no way to resume execution of the V86 task afterwards.)
> >
> > In practice this means that vsyscalls is pretty much the only sensible
> > way to do this. Also note that INT 80h will need to be supported
> > indefinitely.
> >
> > Personally, I wonder if it's worth the trouble, when x86-64 takes care
> > of the issue anyway :)
>
> There is another way:
>
> Have apps enter kernel mode via Intel's purposely undefined
> instruction, plus a few bytes of padding and identification.
> Require that this not cross a page boundry. When it faults,
> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
> the page marked clean so it doesn't need to hit swap; if it
> gets paged in again it gets patched again.

Thats *very* dirty hack. vsyscalls seem cleaner than that.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-15 22:22:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > Any ideas? Not sure I want to upgrade to the P7 architecture if this
> > is right, since for me system calls are probably more important than
> > raw cpu computational power.
>
> This is the same for me. I'm extremely uninterested in the P-IV for my
> own use because of this.

Well, then you should fix the kernel so that syscalls are done by
sysenter (or how is it called).
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-15 22:30:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

At some point in the past, I wrote:
>> This is the same for me. I'm extremely uninterested in the P-IV for my
>> own use because of this.

On Sun, Dec 15, 2002 at 10:59:51PM +0100, Pavel Machek wrote:
> Well, then you should fix the kernel so that syscalls are done by
> sysenter (or how is it called).
> Pavel

ABI is immutable. I actually run apps at home.

sysenter is also unusable for low-level loss-of-state reasons mentioned
elsewhere in this thread.


Nice try, though.


Bill

2002-12-15 22:35:55

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> >> This is the same for me. I'm extremely uninterested in the P-IV for my
> >> own use because of this.
>
> > Well, then you should fix the kernel so that syscalls are done by
> > sysenter (or how is it called).
> > Pavel
>
> ABI is immutable. I actually run apps at home.

Perhaps that one killer app can be recompiled?

> sysenter is also unusable for low-level loss-of-state reasons mentioned
> elsewhere in this thread.

Well, disabling v86 may be well wroth it :-).
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-12-16 07:26:00

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek writes:
> [Albert Cahalan]

>> Have apps enter kernel mode via Intel's purposely undefined
>> instruction, plus a few bytes of padding and identification.
>> Require that this not cross a page boundry. When it faults,
>> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
>> the page marked clean so it doesn't need to hit swap; if it
>> gets paged in again it gets patched again.
>
> Thats *very* dirty hack. vsyscalls seem cleaner than that.

Sure it's dirty. It's also fast, with the only overhead being
a few NOPs that could get skipped on syscall return anyway.
Patching overhead is negligible, since it only happens when a
page is brought in fresh from the disk.

The vsyscall stuff costs you on every syscall. It's nice for
when you can avoid entering kernel mode entirely, but in that
case the hack I described above can write out a call to user
code (for time-of-day I imagine) just as well as it can write
out a SYSENTER, INT 0x80, or SYSCALL instruction.

Enter with INT 0x42 if you prefer, or just pick one of the new
instructions.

An alternative would be to hack ld.so to patch the syscalls,
but then you get dirty C-O-W pages in every address space.
Permissions change, swap gets used, etc.

2002-12-16 11:10:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> >> Have apps enter kernel mode via Intel's purposely undefined
> >> instruction, plus a few bytes of padding and identification.
> >> Require that this not cross a page boundry. When it faults,
> >> write the SYSENTER, INT 0x80, or SYSCALL as needed. Leave
> >> the page marked clean so it doesn't need to hit swap; if it
> >> gets paged in again it gets patched again.
> >
> > Thats *very* dirty hack. vsyscalls seem cleaner than that.
>
> Sure it's dirty. It's also fast, with the only overhead being
> a few NOPs that could get skipped on syscall return anyway.
> Patching overhead is negligible, since it only happens when a
> page is brought in fresh from the disk.

Yes but "read only" code changing under you... Should better be
avoided.

> The vsyscall stuff costs you on every syscall. It's nice for

Well, the cost is basically one call. That's not *that* big cost.

Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-12-16 17:38:09

by Mark Mielke

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 16, 2002 at 12:17:59PM +0100, Pavel Machek wrote:
> > Sure it's dirty. It's also fast, with the only overhead being
> > a few NOPs that could get skipped on syscall return anyway.
> > Patching overhead is negligible, since it only happens when a
> > page is brought in fresh from the disk.
> Yes but "read only" code changing under you... Should better be
> avoided.

Programs that self verify their own CRC may get a little confused (are
there any of these left?), but other than that, 'goto is better avoided'
as well, but sometimes 'goto' is the best answer.

> > The vsyscall stuff costs you on every syscall. It's nice for
> Well, the cost is basically one call. That's not *that* big cost.

Time for benchmarks... :-)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-12-16 19:48:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
>
>>The vsyscall stuff costs you on every syscall. It's nice for
>
>
> Well, the cost is basically one call. That's not *that* big cost.
>

You absolutely, positively *need* the call anyway. SYSENTER trashes EIP.

-hpa


2002-12-16 20:58:20

by Jonah Sherman

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 16, 2002 at 12:54:32PM -0500, Mark Mielke wrote:
> Programs that self verify their own CRC may get a little confused (are
> there any of these left?), but other than that, 'goto is better avoided'
> as well, but sometimes 'goto' is the best answer.

This shouldn't cause any problems. The only way this would cause a problem is if the program had direct system calls in it, but as long as they are using libc(what self-crcing program doesn't use libc?), the changes would only be made to code pages inside libc, so the program's own code pages would remain untouched.

2002-12-17 04:02:41

by David Schwartz

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Mon, 16 Dec 2002 11:07:06 -0500, Jonah Sherman wrote:

>On Mon, Dec 16, 2002 at 12:54:32PM -0500, Mark Mielke wrote:

>>Programs that self verify their own CRC may get a little confused (are
>>there any of these left?), but other than that, 'goto is better avoided'
>>as well, but sometimes 'goto' is the best answer.

>This shouldn't cause any problems. The only way this would cause a problem
>is if the program had direct system calls in it, but as long as they are
>using libc(what self-crcing program doesn't use libc?), the changes would
>only be made to code pages inside libc, so the program's own code pages
>would remain untouched.

A program that checked its own CRC would probably be statically linked. This
is especially likely to be true if the CRC was for security reasons.

DS


2002-12-17 07:59:42

by Helge Hafting

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Mark Mielke wrote:
>
> On Mon, Dec 16, 2002 at 12:17:59PM +0100, Pavel Machek wrote:
> > > Sure it's dirty. It's also fast, with the only overhead being
> > > a few NOPs that could get skipped on syscall return anyway.
> > > Patching overhead is negligible, since it only happens when a
> > > page is brought in fresh from the disk.
> > Yes but "read only" code changing under you... Should better be
> > avoided.
>
> Programs that self verify their own CRC may get a little confused (are
> there any of these left?), but other than that, 'goto is better avoided'
> as well, but sometimes 'goto' is the best answer.

And then there's programs that store constants as parts of the code,
so that their constant-ness is enforced byt the mmu.

This can be taken further - the compiler can save space by looking
through the generated code and use an address in the code as the
constant if it happens to have the right value. With some
bad luck it chooses the syscall sequence that it really don't expect
to be modified.

Helge Hafting

2002-12-17 08:48:57

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds <[email protected]> writes:
>
> That NMI problem is pretty fundamentally unfixable due to the stupid
> sysenter semantics, but we could just make the NMI handlers be real
> careful about it and fix it up if it happens.

You just have to make the NMI a task gate with an own TSS, then the
microcode will set up an own stack for you.

The only issue afterwards is that "current" does not work, but that
can be worked around by being a bit careful in the handler.
It has to run with interrupts off too to avoid a race with an
timer interrupt which uses current (or alternatively the timer
interrupt could check for the "in nmi condition" - I don't think
any other interrupts access current except when they crash)

[in theory it would be also possible to align the NMI stacks to
8K and put a "pseudo" task into that stack, but it would look
a bit inelegant for me]

Using a task gate would be a good idea for kernel stack faults and
double faults too, then it would be at least possible to get an oops
for them, not the usual double fault.

[x86-64 does it similarly, except that it uses ISTs instead of task
gates and avoids the current problem by using an explicit base register]

I cannot implement SYSENTER for x86-64/32bit emulation, but I think
I can change the vsyscall code to use SYSCALL, not SYSENTER. The only
issue is that I cannot easily use a fixmap to map into 32bit processes,
because the kernel fixmap are way up into the 48bit address space
and not reachable from compatibility mode.
I suspect a similar trick as with the lazy vmallocs - map it in the
page fault handler on demand will work. I hope there won't be much
more of these special cases though, do_page_fault is getting
awfully complicated.

-Andi

2002-12-17 11:10:36

by Eric Dumazet

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

> For the libc DSO I had to play some dirty tricks. The x86 CPU has no
> absolute call. The variant with an immediate parameter is a relative
> jump. Only when jumping through a register or memory location is it
> possible to jump to an absolute address. To be clear, if I have
>
> call 0xfffff000
>
> in a DSO which is loaded at address 0x80000000 the jumps ends at
> 0x7fffffff. The problem is that the static linker doesn't know the load
> address. We could of course have the dynamic linker fix up the
> addresses but this is plain stupid. It would mean fixing up a lot of
> places and making of those pages covered non-sharable.
>

You could have only one routine that would need a relocation / patch at
dynamic linking stage :

absolute_syscall:
jmp 0xfffff000

Then all syscalls routine could use :

getpid:
...
call absolute_syscall
...
instead of "call 0xfffff000"


If the kernel doesnt support the 0xfffff000 page, you could patch
absolute_syscall (if it resides in .data section) with :
absolute_syscall:
int 0x80
ret
(3 bytes instead of 5 bytes)

See you

2002-12-17 15:56:17

by John Reiser

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, 16 Dec 2002, Linus Torvalds wrote [regarding vsyscall implementation]:
> The good news is that the kernel part really looks pretty clean.

Where is the CPU serializing instruction which must be executed before return
to user mode, so that kernel accesses to hardware devices are guaranteed to
complete before any subsequent user access begins? (Otherwise a read/write
by the user to a memory-mapped device page can appear out-of-order with respect
to the kernel accesses in a preceding syscall.) The only generally useful
serializing instructions are IRET and CPUID; only IRET is implemented univerally.

--
John Reiser, [email protected]

2002-12-17 16:09:48

by John Reiser

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ulrich Drepper wrote:
[snip]
> pushl %ebp
> movl $0xfffff000, %ebp
> call *%ebp
> popl %ebp

This does not work for mmap64 [syscall 192], which passes a parameter in %ebp.

--
John Reiser, [email protected]

2002-12-17 16:24:44

by Manfred Spraul

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

>
>
> pushl %ebp
> movl $0xfffff000, %ebp
> call *%ebp
> popl %ebp
>
>

You could avoid clobbering a register with something like

pushl $0xfffff000
call *(%esp)
addl %esp,4

--
Manfred

2002-12-17 16:48:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On 17 Dec 2002, Andi Kleen wrote:
>
> Linus Torvalds <[email protected]> writes:
> >
> > That NMI problem is pretty fundamentally unfixable due to the stupid
> > sysenter semantics, but we could just make the NMI handlers be real
> > careful about it and fix it up if it happens.
>
> You just have to make the NMI a task gate with an own TSS, then the
> microcode will set up an own stack for you.

Actually, I came up with a much simpler solution (which I didn't yet
implement, but should be just a few lines).

The simpler solution is to just make the temporary ESP stack _look_ like
it's a real process - ie make it 8kB per CPU (instead of the current 4kB)
and put a fake "thread_info" at the bottom of it with the right CPU
number etc. That way if an NMI comes in (in the _extremely_ tiny window),
it will still see a sane picture of the system. It will basically think
that we had a micro-task-switch between two instructions.

It's also entirely possible that the NMI window may not actually even
exist, since I'm not even sure that Intel checks for pending interrupt
before the first instruction of a trap handler.

> Using a task gate would be a good idea for kernel stack faults and
> double faults too, then it would be at least possible to get an oops
> for them, not the usual double fault.

We can't get stack faults without degrading performance horribly (they
require you to set up the stack segment in magic ways that gcc doesn't
even support). For double-faults, yes, but quite frankly, if you ever get
a double fault things are _so_ screwed up that it's not very funny any
more.

> I cannot implement SYSENTER for x86-64/32bit emulation, but I think
> I can change the vsyscall code to use SYSCALL, not SYSENTER.

Right. The point of my patches is that user-level really _cannot_ use
sysenter directly, because the sysenter semantics are just not useful for
user land. So as far as user land is concerned, it really _is_ just a
"call 0xfffff000", and then the kernel can do whatever is appropriate for
that CPU.

Linus

2002-12-17 17:02:56

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 17 Dec 2002, Manfred Spraul wrote:

> >
> >
> > pushl %ebp
> > movl $0xfffff000, %ebp
> > call *%ebp
> > popl %ebp
> >
> >
>
> You could avoid clobbering a register with something like
>
> pushl $0xfffff000
> call *(%esp)
> addl %esp,4
>

This is a near 'call'.

pushl $0xfffff000
ret

This is a 'far' 'call' that I think you will need to reload the segment
back to user-mode segments on the return.

pushl $KERNEL_CS
pushl $0xfffff000
lret




Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-17 17:09:28

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 17 Dec 2002, Richard B. Johnson wrote:

> On Tue, 17 Dec 2002, Manfred Spraul wrote:
>
> > >
> > >
> > > pushl %ebp
> > > movl $0xfffff000, %ebp
> > > call *%ebp
> > > popl %ebp
> > >
> > >
> >
> > You could avoid clobbering a register with something like
> >
> > pushl $0xfffff000
> > call *(%esp)
> > addl %esp,4
> >
>
> This is a near 'call'.
>
> pushl $0xfffff000
> ret
>

I hate answering my own stuff......... This gets back and modifies
no registers.

Actually I should be:

pushl $next_address # Where to go when the call returns
pushl $0xfffff000 # Put this on the stack
ret # 'Return' to it (jump)
next_address: # Were we end up after



Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-17 17:26:07

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

dada1 wrote:

> You could have only one routine that would need a relocation / patch at
> dynamic linking stage :

That's a horrible way to deal with this in DSOs. THere is no writable
and executable segment and it would have to be created which means
enormous additional setup costs and higher memory requirement. I'm not
going to use any scode modification.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-17 17:29:41

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson writes:
> Actually I should be:
>
> pushl $next_address # Where to go when the call returns
> pushl $0xfffff000 # Put this on the stack
> ret # 'Return' to it (jump)
> next_address: # Were we end up after

You just killed that process' performance by causing the
return-stack branch prediction buffer to go out of sync.

It might have worked ok on a 486, but P6+ don't like it one bit.

This is also why I'm slightly unhappy about the
s/int $0x80/call <address of sysenter>/ approach, since it leads
to yet another recursion level and risk overflowing the RSB.

2002-12-17 19:02:10

by Ross Biro

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


It doesn't make sense to me to use a specially formatted page forced
into user space to tell libraries how to do system calls. Perhaps each
executable personality in the kernel should export a special shared
library in it's own native format that contains the necessary
information. That way we don't have to worry as much about code or
values changing sizes or locations.

We would have the chicken/egg problem with how the special shared
library gets loaded in the first place. For that we either support a
legacy syscall method (i.e. int 0x80 on x86) which should only be used
by ld.so or the equivalent or magically force the library into user
space at a known address.

Ross


Linus Torvalds wrote:

>On 17 Dec 2002, Alan Cox wrote:
>
>
>>Is there any reason you can't just keep the linker out of the entire
>>mess by generating
>>
>> .byte whatever
>> .dword 0xFFFF0000
>>
>>instead of call ?
>>
>>
>
>Alan, the problem is that there _is_ no such instruction as a "call
>absolute".
>
>There is only a "call relative" or "call indirect-absolute". So you either
>have to indirect through memory or a register, or you have to fix up the
>call at link-time.
>
>Yeah, I know it sounds strange, but it makes sense. Absolute calls are
>actually very unusual, and using relative calls is _usually_ the right
>thing to do. It's only in cases like this that we really want to call a
>specific address.
>
> Linus
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2002-12-17 19:46:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson wrote:
>
> You can call intersegment with a full pointer. I don't know how
> expensive that is. Since USER_CS is a fixed value in Linux, it
> can be hard-coded
>
> .byte 0x9a
> .dword 0xfffff000
> .word USER_CS
>
> No. I didn't try this, I'm just looking at the manual. I don't know
> what the USER_CS is (didn't look in the kernel) The book says the
> pointer is 16:32 which means that it's a dword, followed by a word.
>

It's quite expensive (not as expensive as INT, but not that far from
it), and you also push CS onto the stack.

-hpa


2002-12-17 20:04:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Alan Cox wrote:
> On Tue, 2002-12-17 at 19:12, H. Peter Anvin wrote:
>
>>The complexity only applies to nonsynchronized TSCs though, I would
>>assume. I believe x86-64 uses a vsyscall using the TSC when it can
>>provide synchronized TSCs, and if it can't it puts a normal system call
>>inside the vsyscall in question.
>
>
> For x86-64 there is the hpet timer, which is a lot saner but I don't
> think we can mmap it
>

It's only necessary, though, when TSC isn't usable. TSC is psycho fast
when it's available. Just about anything is saner than the old 8042 or
whatever it is called timer chip, though...

-hpa


2002-12-18 01:23:05

by Nakajima, Jun

[permalink] [raw]
Subject: RE: Intel P6 vs P7 system call performance

AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there.

Jun

> -----Original Message-----
> From: Ulrich Drepper [mailto:[email protected]]
> Sent: Tuesday, December 17, 2002 11:19 AM
> To: Linus Torvalds
> Cc: Matti Aarnio; Hugh Dickins; Dave Jones; Ingo Molnar; linux-
> [email protected]; [email protected]
> Subject: Re: Intel P6 vs P7 system call performance
>
> Linus Torvalds wrote:
>
> > In the meantime, I do agree with you that the TLS approach should work
> > too, and might be better. It will allow all six arguments to be used if
> we
> > just find a good calling conventions
>
> If you push out the AT_* patch I'll hack the glibc bits (probably the
> TLS variant). Won't take too long, you'll get results this afternoon.
>
> What about AMD's instruction? Is it as flawed as sysenter? If not and
> %ebp is available I really should use the TLS method.
>
> --
> --------------. ,-. 444 Castro Street
> Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
> Red Hat `--' drepper at redhat.com `---------------------------
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-12-18 01:46:54

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Nakajima, Jun wrote:
> AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there.

That's good to know but not what I meant.

I referred to syscall/sysret opcodes. They are broken in their own way
(destroying ecx on kernel entry) but at least they preserve eip.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-18 03:29:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ulrich Drepper wrote:
>
> That's good to know but not what I meant.
>
> I referred to syscall/sysret opcodes. They are broken in their own way
> (destroying ecx on kernel entry) but at least they preserve eip.
>

Destroying %ecx is a lot less destructive than destroying %eip and %esp...

-hpa

2002-12-18 03:58:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Btw, on another tangent - Andrew Morton reports that APM is unhappy about
the fact that the fast system call stuff required us to move the segments
around a bit. That's probably because the APM code has the old APM segment
numbers hardcoded somewhere, but I don't see where (I certainly knew about
the segment number issue, and tried to update the cases I saw).

Debugging help would be appreciated, especially from somebody who knows
the APM code.

Linus

2002-12-18 03:56:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, H. Peter Anvin wrote:
> Ulrich Drepper wrote:
> >
> > That's good to know but not what I meant.
> >
> > I referred to syscall/sysret opcodes. They are broken in their own way
> > (destroying ecx on kernel entry) but at least they preserve eip.
> >
>
> Destroying %ecx is a lot less destructive than destroying %eip and %esp...

Actually, as far as the kernel is concerned, they are about equally bad.

Destroying %eip is the _least_ bad register to destroy, since the kernel
can control that part, and it is trivial to just have a single call site.

But destroying %esp or %ecx is pretty much totally equivalent - it
destroys one user mode register, and it doesn't really matter _which_ one.
In both cases 32 bits of user information is destroyed, and they are 100%
equivalent as far as the kernel is concerned.

On intel with sysenter, destroying %esp means that we have to save the
value in %ebp, and we thus lose argument 6.

On AMD, %ecx is destroyed on entry, which means that we lose argument 2
(which i smore important than arg6, but that only means that the AMD
trampoline will have to move the old value of %ecx into %ebp, at which
point the two approaches are again exactly the same).

In either case, one GP register is irrevocably lost, which means that
there are only 5 GP registers left for arguments. And thus both Intel and
AMD will have _exactly_ the same problem with six-argument system calls.

The _sane_ thing to do would have been to save the old user %esp/%eip on
the kernel stack. Preferably together with %eflags and %ss and %cs, just
for completeness. That stack save part is _not_ the expensive or complex
part of a "int 0x80" or long call (the _complex_ part is all the stupid
GDT/IDT lookups and all the segment switching crap).

In short, both AMD and Intel optimized away too much.

The good news is that since both of them suck, it's easier to make the
six-argument decision. Since six arguments are problematic for all major
"fast" system calls, my executive decision is to just say that
six-argument system calls will just have to continue using the old and
slower system call interface. It's kind of a crock, but it's simply due to
silly CPU designers.

Linus

2002-12-18 04:32:43

by Stephen Rothwell

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi Linus, Andrew,

On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <[email protected]> wrote:
>
> Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> the fact that the fast system call stuff required us to move the segments
> around a bit. That's probably because the APM code has the old APM segment
> numbers hardcoded somewhere, but I don't see where (I certainly knew about
> the segment number issue, and tried to update the cases I saw).

I looked at this yesterday and decided that it was OK as well.

> Debugging help would be appreciated, especially from somebody who knows
> the APM code.

It would help to know what "unhappy" means :-)

Does the following fix it for you? Untested, assumes cache lines are 32
bytes.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.5.52-200212181207/include/asm-i386/segment.h 2.5.52-200212181207-apm/include/asm-i386/segment.h
--- 2.5.52-200212181207/include/asm-i386/segment.h 2002-12-18 15:25:48.000000000 +1100
+++ 2.5.52-200212181207-apm/include/asm-i386/segment.h 2002-12-18 15:38:34.000000000 +1100
@@ -65,9 +65,9 @@
#define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11)

/*
- * The GDT has 23 entries but we pad it to cacheline boundary:
+ * The GDT has 25 entries but we pad it to cacheline boundary:
*/
-#define GDT_ENTRIES 24
+#define GDT_ENTRIES 28

#define GDT_SIZE (GDT_ENTRIES * 8)

2002-12-18 04:31:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
>>How about this diff? It does both the 6-parameter thing _and_ the
>>AT_SYSINFO addition.
>
>
> The 6-parameter thing is broken. It's clever, but playing games with %ebp
> is not going to work with restarting of the system call - we need to
> restart with the proper %ebp.
>

This confuses me -- there seems to be no reason this shouldn't work as
long as %esp == %ebp on sysexit. The SYSEXIT-trashed GPRs seem like a
bigger problem.

-hpa


2002-12-18 04:29:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>>
>>Destroying %ecx is a lot less destructive than destroying %eip and %esp...
>
> Actually, as far as the kernel is concerned, they are about equally bad.
>

Right, but from a user-mode point of view it means at least one extra
instruction.

> Destroying %eip is the _least_ bad register to destroy, since the kernel
> can control that part, and it is trivial to just have a single call site.

Trivial, perhaps, but it requires a call/ret pair in userspace, which is
a fairly expensive form of push/pop.

> The good news is that since both of them suck, it's easier to make the
> six-argument decision. Since six arguments are problematic for all major
> "fast" system calls, my executive decision is to just say that
> six-argument system calls will just have to continue using the old and
> slower system call interface. It's kind of a crock, but it's simply due to
> silly CPU designers.

Oh, so you're not going to do the "read from stack" thing? (Agreed, by
the way, on the CPU design -- both SYSENTER and SYSCALL suck. SYSCALL
was changed rather substantially in x86-64 for that reason.)

-hpa



2002-12-18 04:43:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Wed, 18 Dec 2002, Stephen Rothwell wrote:
>
> It would help to know what "unhappy" means :-)

Andrew reported an oops in the BIOS. I ahev the full oops info somewhere,
but quite frankly it isn't that readable. It shows

EIP: 00b8:[<000044d7>] Not tainted
ds: 0000 es: 0000 ss: 0068
Call Trace:
[<c0112739>] apm_bios_call+0x75/0xf4
[<c0130000>] cache_init_objs+0x34/0xd8
[<c0112b72>] apm_get_power_status+0x42/0x84
[<c012d843>] __alloc_pages+0x77/0x244
[<c0113828>] apm_get_info+0x38/0xe4
[<c016982d>] proc_file_read+0xa9/0x1ac
[<c0141b53>] vfs_read+0xb7/0x138
[<c0141dee>] sys_read+0x2a/0x40
[<c0108e67>] syscall_call+0x7/0xb

and I suspect the problem is that 0 in ds/es..

> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.

Andrew?

Linus

2002-12-18 04:40:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, H. Peter Anvin wrote:
>
> This confuses me -- there seems to be no reason this shouldn't work as
> long as %esp == %ebp on sysexit. The SYSEXIT-trashed GPRs seem like a
> bigger problem.

The thing is, the argument save area == the kernel stack frame. This is
part of the reason why Linux has very fast system calls - there is
absolutely _zero_ extraneous setup. No argument fetching and marshalling,
it's all part of just setting up the regular kernel stack.

So to get the right argument in arg6, the argument _needs_ to be saved in
the %ebp entry on the kernel stack. Which means that on return from the
system call (which may not actually be through a "sysenter" at all, if
signals happen it will go through the generic paths), %ebp will have been
updated as part of the kernel stack unwinding.

Which is ok for a regular fast system call (ebp will get restored
immediately), but it is NOT ok for the system call restart case, since in
that case we want %ebp to contain the old stack pointer, not the sixth
argument.

If we just save the stack pointer value (== the initial %ebx value), the
right thing will get restored, but then system calls will see the stack
pointer value as arg6 - because of the 1:1 relationship between arguments
and stack save.

Linus

2002-12-18 04:45:48

by Andrew Morton

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Stephen Rothwell wrote:
>
> Hi Linus, Andrew,
>
> On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <[email protected]> wrote:
> >
> > Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> > the fact that the fast system call stuff required us to move the segments
> > around a bit. That's probably because the APM code has the old APM segment
> > numbers hardcoded somewhere, but I don't see where (I certainly knew about
> > the segment number issue, and tried to update the cases I saw).
>
> I looked at this yesterday and decided that it was OK as well.
>
> > Debugging help would be appreciated, especially from somebody who knows
> > the APM code.
>
> It would help to know what "unhappy" means :-)

The lcall seems to be going awry. It oopses when apmd starts up,
and the sysenter patch is the trigger.

CPU: 0
EIP: 00b8:[<000044d7>] Not tainted
EFLAGS: 00010202
EIP is at 0x44d7
eax: 000000c8 ebx: 00000001 ecx: 00000000 edx: 00000000
esi: c02e0091 edi: 000000ff ebp: ceed1ec4 esp: ceed1e74
ds: 0000 es: 0000 ss: 0068
Process apmd (pid: 679, threadinfo=ceed0000 task=cfa058a0)
Stack: 0000530a 00b844e8 00000000 ceed1ec4 c0112739 00000060 ceed1ec4 000000ff
00000068 00000068 ceed1f32 c02e0091 000000ff 00000202 ceed0000 cf706c24
00000000 00000000 c0130000 c1740000 ceed1f04 c0112b72 0000530a 00000001
Call Trace:
[<c0112739>] apm_bios_call+0x75/0xf4
[<c0130000>] cache_init_objs+0x34/0xd8
[<c0112b72>] apm_get_power_status+0x42/0x84
[<c012d843>] __alloc_pages+0x77/0x244
[<c0113828>] apm_get_info+0x38/0xe4
[<c016982d>] proc_file_read+0xa9/0x1ac
[<c0141b53>] vfs_read+0xb7/0x138
[<c0141dee>] sys_read+0x2a/0x40
[<c0108e67>] syscall_call+0x7/0xb

> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.
>

I cleverly left the laptop at work. Shall test tomorrow.

2002-12-18 05:17:54

by Brian Gerst

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

diff -urN linux-2.5.52-bk2/arch/i386/kernel/cpu/common.c linux/arch/i386/kernel/cpu/common.c
--- linux-2.5.52-bk2/arch/i386/kernel/cpu/common.c Sat Dec 14 12:32:00 2002
+++ linux/arch/i386/kernel/cpu/common.c Tue Dec 17 23:21:55 2002
@@ -487,7 +487,7 @@
BUG();
enter_lazy_tlb(&init_mm, current, cpu);

- t->esp0 = thread->esp0;
+ load_esp0(t, thread->esp0);
set_tss_desc(cpu,t);
cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff;
load_TR_desc();
diff -urN linux-2.5.52-bk2/arch/i386/kernel/process.c linux/arch/i386/kernel/process.c
--- linux-2.5.52-bk2/arch/i386/kernel/process.c Sat Dec 14 12:32:04 2002
+++ linux/arch/i386/kernel/process.c Tue Dec 17 23:29:54 2002
@@ -440,7 +440,7 @@
/*
* Reload esp0, LDT and the page table pointer:
*/
- tss->esp0 = next->esp0;
+ load_esp0(tss, next->esp0);

/*
* Load the per-thread Thread-Local Storage descriptor.
diff -urN linux-2.5.52-bk2/arch/i386/kernel/sysenter.c linux/arch/i386/kernel/sysenter.c
--- linux-2.5.52-bk2/arch/i386/kernel/sysenter.c Tue Dec 17 23:21:45 2002
+++ linux/arch/i386/kernel/sysenter.c Tue Dec 17 23:31:01 2002
@@ -20,22 +20,12 @@

static void __init enable_sep_cpu(void *info)
{
- unsigned long page = __get_free_page(GFP_ATOMIC);
int cpu = get_cpu();
- unsigned long *esp0_ptr = &(init_tss + cpu)->esp0;
- unsigned long rel32;
+ struct tss_struct *tss = init_tss + cpu;

- rel32 = (unsigned long) sysenter_entry - (page+11);
-
-
- *(short *) (page+0) = 0x258b; /* movl xxxxx,%esp */
- *(long **) (page+2) = esp0_ptr;
- *(char *) (page+6) = 0xe9; /* jmp rl32 */
- *(long *) (page+7) = rel32;
-
- wrmsr(0x174, __KERNEL_CS, 0); /* SYSENTER_CS_MSR */
- wrmsr(0x175, page+PAGE_SIZE, 0); /* SYSENTER_ESP_MSR */
- wrmsr(0x176, page, 0); /* SYSENTER_EIP_MSR */
+ wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
+ wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp0, 0);
+ wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);

printk("Enabling SEP on CPU %d\n", cpu);
put_cpu();
@@ -60,14 +50,15 @@
};
unsigned long page = get_zeroed_page(GFP_ATOMIC);

+ if (cpu_has_sep) {
+ memcpy((void *) page, sysent, sizeof(sysent));
+ enable_sep_cpu(NULL);
+ smp_call_function(enable_sep_cpu, NULL, 1, 1);
+ } else
+ memcpy((void *) page, int80, sizeof(int80));
+
__set_fixmap(FIX_VSYSCALL, __pa(page), PAGE_READONLY);
- memcpy((void *) page, int80, sizeof(int80));
- if (!boot_cpu_has(X86_FEATURE_SEP))
- return 0;
-
- memcpy((void *) page, sysent, sizeof(sysent));
- enable_sep_cpu(NULL);
- smp_call_function(enable_sep_cpu, NULL, 1, 1);
+
return 0;
}

diff -urN linux-2.5.52-bk2/arch/i386/kernel/vm86.c linux/arch/i386/kernel/vm86.c
--- linux-2.5.52-bk2/arch/i386/kernel/vm86.c Sat Dec 14 12:32:02 2002
+++ linux/arch/i386/kernel/vm86.c Tue Dec 17 23:21:55 2002
@@ -113,7 +113,7 @@
do_exit(SIGSEGV);
}
tss = init_tss + smp_processor_id();
- tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
+ load_esp0(tss, current->thread.saved_esp0);
current->thread.saved_esp0 = 0;
ret = KVM86->regs32;
return ret;
@@ -283,7 +283,8 @@
info->regs32->eax = 0;
tsk->thread.saved_esp0 = tsk->thread.esp0;
tss = init_tss + smp_processor_id();
- tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
+ tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;
+ load_esp0(tss, tsk->thread.esp0);

tsk->thread.screen_bitmap = info->screen_bitmap;
if (info->flags & VM86_SCREEN_BITMAP)
diff -urN linux-2.5.52-bk2/include/asm-i386/cpufeature.h linux/include/asm-i386/cpufeature.h
--- linux-2.5.52-bk2/include/asm-i386/cpufeature.h Sun Sep 15 22:18:22 2002
+++ linux/include/asm-i386/cpufeature.h Tue Dec 17 23:29:27 2002
@@ -7,6 +7,8 @@
#ifndef __ASM_I386_CPUFEATURE_H
#define __ASM_I386_CPUFEATURE_H

+#include <linux/bitops.h>
+
#define NCAPINTS 4 /* Currently we have 4 32-bit words worth of info */

/* Intel-defined CPU features, CPUID level 0x00000001, word 0 */
@@ -74,6 +76,7 @@
#define cpu_has_pae boot_cpu_has(X86_FEATURE_PAE)
#define cpu_has_pge boot_cpu_has(X86_FEATURE_PGE)
#define cpu_has_apic boot_cpu_has(X86_FEATURE_APIC)
+#define cpu_has_sep boot_cpu_has(X86_FEATURE_SEP)
#define cpu_has_mtrr boot_cpu_has(X86_FEATURE_MTRR)
#define cpu_has_mmx boot_cpu_has(X86_FEATURE_MMX)
#define cpu_has_fxsr boot_cpu_has(X86_FEATURE_FXSR)
diff -urN linux-2.5.52-bk2/include/asm-i386/msr.h linux/include/asm-i386/msr.h
--- linux-2.5.52-bk2/include/asm-i386/msr.h Sat Dec 14 12:32:05 2002
+++ linux/include/asm-i386/msr.h Tue Dec 17 23:21:55 2002
@@ -53,6 +53,10 @@

#define MSR_IA32_BBL_CR_CTL 0x119

+#define MSR_IA32_SYSENTER_CS 0x174
+#define MSR_IA32_SYSENTER_ESP 0x175
+#define MSR_IA32_SYSENTER_EIP 0x176
+
#define MSR_IA32_MCG_CAP 0x179
#define MSR_IA32_MCG_STATUS 0x17a
#define MSR_IA32_MCG_CTL 0x17b
diff -urN linux-2.5.52-bk2/include/asm-i386/processor.h linux/include/asm-i386/processor.h
--- linux-2.5.52-bk2/include/asm-i386/processor.h Sat Dec 14 12:32:08 2002
+++ linux/include/asm-i386/processor.h Tue Dec 17 23:26:16 2002
@@ -14,6 +14,7 @@
#include <asm/types.h>
#include <asm/sigcontext.h>
#include <asm/cpufeature.h>
+#include <asm/msr.h>
#include <linux/cache.h>
#include <linux/config.h>
#include <linux/threads.h>
@@ -416,6 +417,13 @@
{~0, } /* ioperm */ \
}

+static inline void load_esp0(struct tss_struct *tss, unsigned long esp0)
+{
+ tss->esp0 = esp0;
+ if (cpu_has_sep)
+ wrmsr(MSR_IA32_SYSENTER_ESP, esp0, 0);
+}
+
#define start_thread(regs, new_eip, new_esp) do { \
__asm__("movl %0,%%fs ; movl %0,%%gs": :"r" (0)); \
set_fs(USER_DS); \


Attachments:
sysenter-1 (5.44 kB)

2002-12-18 05:31:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Jeremy Fitzhardinge wrote:
> On Tue, 2002-12-17 at 09:55, Linus Torvalds wrote:
>
>>Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
>>is the "base of sysinfo page". Right now that page is all zeroes except
>>for the system call trampoline at the beginning, but we might want to add
>>other system information to the page in the future (it is readable, after
>>all).
>
>
> The P4 optimisation guide promises horrible things if you write within
> 2k of a cached instruction from another CPU (it dumps the whole trace
> cache, it seems), so you'd need to be careful about mixing mutable data
> and the syscall code in that page.
>
> Immutable data should be fine.
>

Yes, you really want to use a second page.

-hpa



2002-12-18 05:52:39

by Brian Gerst

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ulrich Drepper wrote:
> Nakajima, Jun wrote:
>
>>AMD (at least Athlon, as far as I know) supports sysenter/sysexit. We tested it on an Athlon box as well, and it worked fine. And sysenter/sysexit was better than int/iret too (about 40% faster) there.
>
>
> That's good to know but not what I meant.
>
> I referred to syscall/sysret opcodes. They are broken in their own way
> (destroying ecx on kernel entry) but at least they preserve eip.
>

syscall is pretty much unusable unless the NMI is changed to a task
gate. syscall does not change %esp on entry to the kernel, so an NMI
before the manual stack switch would still use the user stack, which is
not guaranteed to be valid - oops. x86-64 gets around this by using an
interrupt stack, its replacement for task gates.

--
Brian Gerst

2002-12-18 05:57:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Wed, 18 Dec 2002, Brian Gerst wrote:
>
> How about this patch? Instead of making a per-cpu trampoline, write to
> the msr during each context switch.

I wanted to avoid slowing down the context switch, but I didn't actually
time how much the MSR write hurts you (it needs to be conditional, though,
I think).

Linus

2002-12-18 06:29:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 17 Dec 2002, Linus Torvalds wrote:
>
> Which is ok for a regular fast system call (ebp will get restored
> immediately), but it is NOT ok for the system call restart case, since in
> that case we want %ebp to contain the old stack pointer, not the sixth
> argument.

I came up with an absolutely wonderfully _disgusting_ solution for this.

The thing to realize on how to solve this is that since "sysenter" loses
track of EIP, there's really no real reason to try to return directly
after the "sysenter" instruction anyway. The return point is really
totally arbitrary, after all.

Now, couple this with the fact that system call restarting will always
just subtract two from the "return point" aka saved EIP value (that's the
size of an "int 0x80" instruction), and what you can do is to make the
kernel point the sysexit return point not at just past the "sysenter", but
instead make it point to just past a totally unrelated 2-byte jump
instruction.

With that in mind, I made the sysentry trampoline look like this:

static const char sysent[] = {
0x51, /* push %ecx */
0x52, /* push %edx */
0x55, /* push %ebp */
0x89, 0xe5, /* movl %esp,%ebp */
0x0f, 0x34, /* sysenter */
/* System call restart point is here! (SYSENTER_RETURN - 2) */
0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
0x5d, /* pop %ebp */
0x5a, /* pop %edx */
0x59, /* pop %ecx */
0xc3 /* ret */
};

which does the right thing for a "restarted" system call (ie when it
restarts, it won't re-do just the sysenter instruction, it will really
restart at the backwards jump, and thus re-start the "movl %esp,%ebp"
too).

Which means that now the kernel can happily trash %ebp as part of the
sixth argument setup, since system call restarting will re-initialize it
to point to the user-level stack that we need in %ebp because otherwise it
gets totally lost.

I'm a disgusting pig, and proud of it to boot.

Linus

2002-12-18 12:47:54

by Terje Eggestad

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


what about:

int (*_vsyscall) (int, ...);
_vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC,
MAP_VSYSCALL, , );

or if you're afraid of running out of MAP_* flags:

fd = open("/dev/vsyscall", );
_vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC, MAP_SHARED,
fd, 0);

Then you can leisurely map it in just after the programs text segment.

TJ


On tir, 2002-12-17 at 18:55, Linus Torvalds wrote:
> On Tue, 17 Dec 2002, Matti Aarnio wrote:
> >
> > On Tue, Dec 17, 2002 at 09:07:21AM -0800, Linus Torvalds wrote:
> > > On Tue, 17 Dec 2002, Hugh Dickins wrote:
> > > > I thought that last page was intentionally left invalid?
> > >
> > > It was. But I thought it made sense to use, as it's the only really
> > > "special" page.
> >
> > In couple of occasions I have caught myself from pre-decrementing
> > a char pointer which "just happened" to be NULL.
> >
> > Please keep the last page, as well as a few of the first pages as
> > NULL-pointer poisons.
>
> I think I have a good clean solution to this, that not only avoids the
> need for any hard-coded address _at_all_, but also solves Uli's problem
> quite cleanly.
>
> Uli, how about I just add one ne warchitecture-specific ELF AT flag, which
> is the "base of sysinfo page". Right now that page is all zeroes except
> for the system call trampoline at the beginning, but we might want to add
> other system information to the page in the future (it is readable, after
> all).
>
> So we'd have an AT_SYSINFO entry, that with the current implementation
> would just get the value 0xfffff000. And then the glibc startup code could
> easily be backwards compatible with the suggestion I had in the previous
> email. Since we basically want to do an indirect jump anyway (because of
> the lack of absolute jumps in the instruction set), this looks like the
> natural way to do it.
>
> That also allows the kernel to move around the SYSINFO page at will, and
> even makes it possible to avoid it altogether (ie this will solve the
> inevitable problems with UML - UML just wouldn't set AT_SYSINFO, so user
> level just wouldn't even _try_ to use it).
>
> With that, there's nothing "special" about the vsyscall page, and I'd just
> go back to having the very last page unmapped (and have the vsyscall page
> in some other fixmap location that might even depend on kernel
> configuration).
>
> Whaddaya think?
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
_________________________________________________________________________

Terje Eggestad mailto:[email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 150, Oppsal +47 975 31 574 (MOBILE)
N-0619 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2002-12-18 13:32:40

by Horst H. von Brand

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[Extremely interesting new syscall mechanism tread elided]

What happened to "feature freeze"?
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2002-12-18 13:39:24

by Sean Neakums

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

commence Horst von Brand quotation:

> [Extremely interesting new syscall mechanism tread elided]
>
> What happened to "feature freeze"?

How are system calls a new feature? Or is optimizing an existing
feature not allowed by your definition of "feature freeze"?

--
/ |
[|] Sean Neakums | Questions are a burden to others;
[|] <[email protected]> | answers a prison for oneself.
\ |

2002-12-18 14:02:57

by Horst H. von Brand

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Sean Neakums <[email protected]> said:
> commence Horst von Brand quotation:
>
> > [Extremely interesting new syscall mechanism tread elided]
> >
> > What happened to "feature freeze"?
>
> How are system calls a new feature? Or is optimizing an existing
> feature not allowed by your definition of "feature freeze"?

This "optimizing" is very much userspace-visible, and a radical change in
an interface this fundamental counts as a new feature in my book.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2002-12-18 14:43:15

by Eric Dumazet

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

From: "Horst von Brand" <[email protected]>
> > How are system calls a new feature? Or is optimizing an existing
> > feature not allowed by your definition of "feature freeze"?
>
> This "optimizing" is very much userspace-visible, and a radical change in
> an interface this fundamental counts as a new feature in my book.

Since int 0x80 is supported/ will be supported for the next 20 years, I dont
think this is a radical change.
No userspace visible at all.
You are free to use the old way of calling the kernel...

2002-12-18 15:04:15

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, 2002-12-18 at 13:40, Horst von Brand wrote:
> [Extremely interesting new syscall mechanism tread elided]
>
> What happened to "feature freeze"?

I'm wondering that. 2.5.49 was usable for devel work, no kernel since
has been. Its stopped IDE getting touched until January.

Linus. you are doing the slow slide into a second round of development
work again, just like mid 2.3, just like 1.3.60, ...

2002-12-18 16:34:15

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
> [Extremely interesting new syscall mechanism tread elided]
>
> What happened to "feature freeze"?

*bites lip* it's fairly low impact *duck*.
Given the wins seem to be fairly impressive across the board, spending
a few days on getting this right isn't going to push 2.6 back any
noticable amount of time.

This stuff is mostly of the case "it either works, or it doesn't".
And right now, corner cases like apm aside, it seems to be holding up
so far. This isn't as far reaching as it sounds. There are still
drivers being turned upside down which are changing things in a
lot bigger ways than this[1]

Dave

[1] Myself being one of the guilty parties there, wrt agp.

--
| Dave Jones. http://www.codemonkey.org.uk

2002-12-18 16:40:52

by Linus Torvalds

[permalink] [raw]
Subject: Freezing.. (was Re: Intel P6 vs P7 system call performance)



On Wed, 18 Dec 2002, Dave Jones wrote:
> On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
> > [Extremely interesting new syscall mechanism tread elided]
> >
> > What happened to "feature freeze"?
>
> *bites lip* it's fairly low impact *duck*.

However, it's a fair question.

I've been wondering how to formalize patch acceptance at code freeze, but
it might be a good idea to start talking about some way to maybe put
brakes on patches earlier, ie some kind of "required approval process".

I think the system call thing is very localized and thus not a big issue,
but in general we do need to have something in place.

I just don't know what that "something" should be. Any ideas? I thought
about the code freeze require buy-in from three of four people (me, Alan,
Dave and Andrew come to mind) for a patch to go in, but that's probably
too draconian for now. Or is it (maybe start with "needs approval by two"
and switch it to three when going into code freeze)?

Linus

2002-12-18 16:51:35

by Dave Jones

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 08:49:37AM -0800, Linus Torvalds wrote:

> > > What happened to "feature freeze"?
> > *bites lip* it's fairly low impact *duck*.
> However, it's a fair question.

Indeed. Were you merging something like preempt at this stage, I'd be wondering
if you'd broken out the eggnog a little too soon.

> I just don't know what that "something" should be. Any ideas? I thought
> about the code freeze require buy-in from three of four people (me, Alan,
> Dave and Andrew come to mind) for a patch to go in, but that's probably
> too draconian for now. Or is it (maybe start with "needs approval by two"
> and switch it to three when going into code freeze)?

You'd likely need an odd number of folks in this cabal^Winner circle
though, or would you just do it and be damned if you got an equal
number of 'aye's and 'nay's ? 8-)

Other than that, it reminds me of the way the gcc folks work, with a
number of people reviewing patches before acceptance [not that this
doesn't happen on l-k already], and at least 1 approval from someone
prepared to approve submissions.

The approval process does seem to be quite a lot of work though.
I think it was rth last year at OLS who told me that at that time
he'd been doing more approving of other peoples stuff than coding himself.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-18 16:49:08

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> I've been wondering how to formalize patch acceptance at code freeze, but
> it might be a good idea to start talking about some way to maybe put
> brakes on patches earlier, ie some kind of "required approval process".

We went through this here for the bk-3.0 release. We're a much smaller
team so this may not work at all for you, but it was very successful
for us, so much so that we are looking at formalizing it in BK. But
you can apply the same process outside of BK just fine.

We created a well known spot for pending patches; all reviewers need access
to that spot. Here's the README from that directory:

There should be the following subdirectories here

ready/ -> waiting on review
done/ -> in the tree
rejected/ -> no good


In the ready/ subdirectory, for each repository which has changes that
want to be in bk-3.0 but are not, I want:

ready/atrev -> /home/bk/wscott/bk-3.0-atrev
ready/atrev.RTI
ready/atrev.REVIEWED

The first is a symlink to the location of the repository.

The second is an RTI request which describes what is in the repo and why
it should go in.

The third contains the review comments in the form

lm (approved|not approved)
review comments
wscott (approved|not approved)
review comments
etc.

Once the REVIEWED file contains enough approvals, in the judgement
of the gatekeeper, then he pulls the repo into the bk-3.0 tree and moves
the 3 files from ready/* to done/*

The things which worked very well were:

a) extremely simple. As we added developers they understood right away
what the process was.
b) centralized location. Anyone could be bored and go do a review.
c) tight control on the tree.



We're thinking about formalizing this in the context of BK as follows:

NAME
bk queue - manage the queue of pending changes

DESCRIPTION
bk queue is used to manage a queue of changes to a repository.
It is typically used on integration repositories where tighter
controls on change are desirable.

In all commands, if no URL is specified, the implied URL is the
parent of the current repository, if any. The URL "." means this
repository.

XXX - need a large paragraph on the importance of not circulating
changesets which are in review state. They'll come back.

bk queue [-n<name>] [-R<rti>] [<URL>]
This is like a bk push but wants a "request to integrate"
(RTI) which is sent with the changes. It also wants a name
for the set of changes. All pending changesets are pushed.
If no name is given, the user is prompted for one. If no
RTI is given, the user is prompted for one.

bk queue -l [-n<name>] [<URL>]
Lists the set of pending changes in the queue like so:
<name> <date> <user> <state>

Values for the <state> field:
unreviewed - nobody has looked at it yet
reviewed by <reviewer> on <date> - obvious
accepted - it is in accepted state but not integrated
rejected - reviewed and rejected

Note that if there are multiple reviewers of a change, there
will be multiple lines in the listing for that change.

If the <name> arg is present then restrict the listing to
that name. If the <name> arg is present more than once,
restrict the listing to the set of named changes.

Could also have a -s<state> option which restricts the listing
to those changes in <state> state.

bk queue -pR [-o<file>] <name> [<URL>]
Retrieves and displays the RTI for change <name>.
If <file> is specified, put the form there.

bk queue -pr [-o<file>] [-u<user>] <name> [<URL>]
Retrieves and displays the review form[s] for change <name>.
If a user is specified, retrieve that users' review only.
If <file> is specified, put the form there.

bk queue -uR [<rti>] [<URL>]
Replaces any existing RTI with the specified RTI. If no RTI
is specified, it prompts you for one like bk setup does.

bk queue -ur [<review>] [<URL>]
Adds or replaces any existing review form with the specified
review. If no review is specified, it prompts you for one
like bk setup does. You may only replace your own reviews.

bk queue -O[<owner>] [<URL>]
Sets the owner of the repository to <owner>. Only the owner
may update the repository. Only the current owner can change
the ownership. If no owner is specified and there is an owner
and the caller is the owner, then delete the owner.
(This is nothing more than a pre-{incoming,commit}-owner trigger)

bk queue -d<name> [-f] [<URL>]
Delete the named change from the queue. This deletes EVERYTHING,
the patch, rti, reviews, everything. Only the submitter of the
change may delete the change unless the -f option is supplied.

bk queue -U<name> [-R<rti>] [<URL>]
Replace the changes in the queue <name> with the set of
changesets in the current repository. If the <rti> is
present, replace the current RTI form with the specified form.
All reviews, if any, are updated with a note that indicates
the existing review was against changes which have been replaced.

GUI
This is a command line tool; Bryan gets to do bk queuetool
using these interfaces.

TODO
- how do we merge?
- define a format for the RTI
- define a format for reviews
- should the RTI & review files be KV files?
- should the {name/RTI/REVIEWS} live as part of the repo and be
propogated? I think yes for upstream propogation, no for
downstream. Hard to say.
- need a way to add a queue item with no changes, i.e., an RFE which
needs to be in the tree but there are no changes yet.

FILES
BitKeeper/queue/<name>/CSETS - changeset keys for change <name>
BitKeeper/queue/<name>/RTI - RTI for change <name>
BitKeeper/queue/<name>/PATCH - BK patch for change <name>
BitKeeper/queue/<name>/RESYNC - exploded patch for change <name>
BitKeeper/queue/<name>/review.user - review by user for change <name>
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-18 17:00:53

by Andrew Morton

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Linus Torvalds wrote:
>
> On Wed, 18 Dec 2002, Dave Jones wrote:
> > On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
> > > [Extremely interesting new syscall mechanism tread elided]
> > >
> > > What happened to "feature freeze"?
> >
> > *bites lip* it's fairly low impact *duck*.
>
> However, it's a fair question.
>
> I've been wondering how to formalize patch acceptance at code freeze, but
> it might be a good idea to start talking about some way to maybe put
> brakes on patches earlier, ie some kind of "required approval process".
>
> I think the system call thing is very localized and thus not a big issue,
> but in general we do need to have something in place.
>
> I just don't know what that "something" should be. Any ideas? I thought
> about the code freeze require buy-in from three of four people (me, Alan,
> Dave and Andrew come to mind) for a patch to go in, but that's probably
> too draconian for now. Or is it (maybe start with "needs approval by two"
> and switch it to three when going into code freeze)?
>

It does sound a little bureacratic for this point in development.

The first thing we need is a set of widely-understood guidelines.
Such as:

Only
- bugfixes
- speedups
- previously-agreed-to or in-progress features
- totally new things (new drivers, new filesystems)

Once everyone understands this framework then it becomes easy to
decide what to drop, what not.

So right now, sysenter is "in". Later, even "speedups" falls off
the list and sysenter would at that stage be "out".

Can it be that simple?

2002-12-18 16:58:48

by Eli Carter

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Linus Torvalds wrote:
>
> On Wed, 18 Dec 2002, Dave Jones wrote:
>
>>On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
>> > [Extremely interesting new syscall mechanism tread elided]
>> >
>> > What happened to "feature freeze"?
>>
>>*bites lip* it's fairly low impact *duck*.
>
>
> However, it's a fair question.
>
> I've been wondering how to formalize patch acceptance at code freeze, but
> it might be a good idea to start talking about some way to maybe put
> brakes on patches earlier, ie some kind of "required approval process".
>
> I think the system call thing is very localized and thus not a big issue,
> but in general we do need to have something in place.
>
> I just don't know what that "something" should be. Any ideas? I thought
> about the code freeze require buy-in from three of four people (me, Alan,
> Dave and Andrew come to mind) for a patch to go in, but that's probably
> too draconian for now. Or is it (maybe start with "needs approval by two"
> and switch it to three when going into code freeze)?

Well, Linus, you're not the most conservative when it comes to freezes.
(Hey! Watch it with those thunderbolts!) Alan, on the other hand, I
would trust to be pretty conservative.
I'm afraid I haven't followed Dave & Andrew well enough in that light.

But my question is... if 2 are required, and say, Dave is as slushy on
freezes as you are, then have we gained much?

Perhaps 2 of 4 approve with no dissenting votes?

If Dave and Andrew are relatively conservative on freezes, then this
concern is sufficiently addressed already.

Food for thought from a relative nobody. ;)

Eli
--------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter(a)inet.com `-------------------------------------------------

2002-12-18 17:32:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)



On Wed, 18 Dec 2002, Dave Jones wrote:
>
> > I just don't know what that "something" should be. Any ideas? I thought
> > about the code freeze require buy-in from three of four people (me, Alan,
> > Dave and Andrew come to mind) for a patch to go in, but that's probably
> > too draconian for now. Or is it (maybe start with "needs approval by two"
> > and switch it to three when going into code freeze)?
>
> You'd likely need an odd number of folks in this cabal^Winner circle
> though, or would you just do it and be damned if you got an equal
> number of 'aye's and 'nay's ? 8-)

Quite frankly, I wouldn't expect a lot of dissent.

I suspect a group approach has very little inherent disagreement, and to
me the main result of having an "approval process" is to really just slow
things down and make people think about the submitting. The actual
approval itself is secondary (it _looks_ like a primary objective, but in
real life it's just the _existence_ of rules that make more of a
difference).

> The approval process does seem to be quite a lot of work though.
> I think it was rth last year at OLS who told me that at that time
> he'd been doing more approving of other peoples stuff than coding himself.

I heartily disagree with the approval process for development, just
because it gets so much in the way and just annoys people. But for
stabilization, that's exactly what you want. So I think gcc is using the
approval process much too much, but apparently it works for them.

And I think it could work for the kernel too, especially the stable
releases and for the process of getting there. I just don't really know
how to set it up well.

Linus

2002-12-18 17:56:03

by Jeff Garzik

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Linus Torvalds wrote:
> On Wed, 18 Dec 2002, Dave Jones wrote:
>>The approval process does seem to be quite a lot of work though.
>>I think it was rth last year at OLS who told me that at that time
>>he'd been doing more approving of other peoples stuff than coding himself.
>
>
> I heartily disagree with the approval process for development, just
> because it gets so much in the way and just annoys people. But for
> stabilization, that's exactly what you want. So I think gcc is using the
> approval process much too much, but apparently it works for them.


gcc's approval process looks a lot like the Linux approval process.
Dave's description of rth's work sounds a lot like the Linus Role in
Linux... with the exception I guess that there are multiple peer Linii
in gcc, and they read every patch <runs for cover> More seriously, gcc
appears to be "post the patch to gcc-patches, hope someone applies it"
which is a lot more like Linux than some think :)

Jeff



2002-12-18 18:01:37

by Mike Dresser

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, 18 Dec 2002, Jeff Garzik wrote:

> Linux... with the exception I guess that there are multiple peer Linii

Perhaps this is the solution. Would someone please obtain a DNA sample
from Linus?

Mike

2002-12-18 18:18:00

by John Alvord

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, 18 Dec 2002 08:49:37 -0800 (PST), Linus Torvalds
<[email protected]> wrote:

>
>
>On Wed, 18 Dec 2002, Dave Jones wrote:
>> On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
>> > [Extremely interesting new syscall mechanism tread elided]
>> >
>> > What happened to "feature freeze"?
>>
>> *bites lip* it's fairly low impact *duck*.
>
>However, it's a fair question.
>
>I've been wondering how to formalize patch acceptance at code freeze, but
>it might be a good idea to start talking about some way to maybe put
>brakes on patches earlier, ie some kind of "required approval process".
>
>I think the system call thing is very localized and thus not a big issue,
>but in general we do need to have something in place.
>
>I just don't know what that "something" should be. Any ideas? I thought
>about the code freeze require buy-in from three of four people (me, Alan,
>Dave and Andrew come to mind) for a patch to go in, but that's probably
>too draconian for now. Or is it (maybe start with "needs approval by two"
>and switch it to three when going into code freeze)?
>
> Linus

I think there should be a distinction between changes which make an
API change/new function/user interface change, versus bug fixes,
adapting to new APIs, documentation, etc.

john alvord

2002-12-18 18:34:10

by Horst H. von Brand

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Dave Jones <[email protected]> said:
> On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
> > [Extremely interesting new syscall mechanism tread elided]
> >
> > What happened to "feature freeze"?

> *bites lip* it's fairly low impact *duck*.
> Given the wins seem to be fairly impressive across the board, spending
> a few days on getting this right isn't going to push 2.6 back any
> noticable amount of time.

Ever hear Larry McVoy's [I think, please correct me if wrong] standard
rant of how $UNIX_FROM_BIG_VENDOR sucks, one "almost unnoticeable
performance impact" feature at a time?

Similarly, Fred Brooks tells in "The Mythical Man Month" how schedules
don't slip by months, they slip a day at a time...

> This stuff is mostly of the case "it either works, or it doesn't".
> And right now, corner cases like apm aside, it seems to be holding up
> so far. This isn't as far reaching as it sounds. There are still
> drivers being turned upside down which are changing things in a
> lot bigger ways than this[1]
>
> Dave
>
> [1] Myself being one of the guilty parties there, wrt agp.

Fixing a broken feature is in for me. Adding new features is supposed to be
out until 2.7 opens.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2002-12-18 18:55:31

by Mark Mielke

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, Dec 18, 2002 at 11:10:50AM -0300, Horst von Brand wrote:
> Sean Neakums <[email protected]> said:
> > How are system calls a new feature? Or is optimizing an existing
> > feature not allowed by your definition of "feature freeze"?
> This "optimizing" is very much userspace-visible, and a radical change in
> an interface this fundamental counts as a new feature in my book.

Since operating systems like WIN32 are at least published to take
advantage of SYSENTER, it may not be in Linux's interest to
purposefully use a slower interface until 2.8 (how long will that be
until people can use?). The last thing I want to read about in a
technical journal is how WIN32 has lower system call overhead than
Linux on IA-32 architectures. That might just be selfish of me for
the Linux community... :-)

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-12-18 19:00:39

by Alan Cox

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> And I think it could work for the kernel too, especially the stable
> releases and for the process of getting there. I just don't really know
> how to set it up well.

A start might be

1. Ack large patches you don't want with "Not for 2.6" instead
of ignoring them. I'm bored of seeing the 18th resend of
this and that wildly bogus patch.

Then people know the status

2. Apply patches only after they have been approved by the maintainer
of that code area.

Where it is core code run it past Andrew, Al and other people
with extremely good taste.

3. Anything which changes core stuff and needs new tools, setup
etc please just say NO to for now. Modules was a mistake (hindsight
I grant is a great thing), but its done. We don't want any more


4. Violate 1-3 when appropriate as always, but preferably not to
often and after consulting the good taste department 8)

Alan

2002-12-18 19:04:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Stephen Rothwell wrote:
>
> Hi Linus, Andrew,
>
> On Tue, 17 Dec 2002 20:07:53 -0800 (PST) Linus Torvalds <[email protected]> wrote:
> >
> > Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> > the fact that the fast system call stuff required us to move the segments
> > around a bit. That's probably because the APM code has the old APM segment
> > numbers hardcoded somewhere, but I don't see where (I certainly knew about
> > the segment number issue, and tried to update the cases I saw).
>
> I looked at this yesterday and decided that it was OK as well.
>
> > Debugging help would be appreciated, especially from somebody who knows
> > the APM code.
>
> It would help to know what "unhappy" means :-)
>
> Does the following fix it for you? Untested, assumes cache lines are 32
> bytes.

It does fix the apmd oops, and APM works fine.

Here's the patch again. (But what happens if cachelines are not 32 bytes?)


--- 25/include/asm-i386/segment.h~sfr Wed Dec 18 10:54:07 2002
+++ 25-akpm/include/asm-i386/segment.h Wed Dec 18 10:54:07 2002
@@ -65,9 +65,9 @@
#define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11)

/*
- * The GDT has 23 entries but we pad it to cacheline boundary:
+ * The GDT has 25 entries but we pad it to cacheline boundary:
*/
-#define GDT_ENTRIES 24
+#define GDT_ENTRIES 28

#define GDT_SIZE (GDT_ENTRIES * 8)


_

2002-12-18 19:15:55

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Make it async. So anyone can review stuff and record their feelings in a
centralized place. We have a spare machine set up, kernel.bkbits.net,
that could be used as a dumping grounds for patches and reviews if
master.kernel.org is too locked down.

If you force the review process into a "push" model where patches are
sent to someone, then you are stuck waiting for them to review it and
it may or may not happen. Do the reviews in a centralized place where
everyone can see them and add their own comments.

On Wed, Dec 18, 2002 at 02:08:02PM -0500, Alan Cox wrote:
> > And I think it could work for the kernel too, especially the stable
> > releases and for the process of getting there. I just don't really know
> > how to set it up well.
>
> A start might be
>
> 1. Ack large patches you don't want with "Not for 2.6" instead
> of ignoring them. I'm bored of seeing the 18th resend of
> this and that wildly bogus patch.
>
> Then people know the status
>
> 2. Apply patches only after they have been approved by the maintainer
> of that code area.
>
> Where it is core code run it past Andrew, Al and other people
> with extremely good taste.
>
> 3. Anything which changes core stuff and needs new tools, setup
> etc please just say NO to for now. Modules was a mistake (hindsight
> I grant is a great thing), but its done. We don't want any more
>
>
> 4. Violate 1-3 when appropriate as always, but preferably not to
> often and after consulting the good taste department 8)
>
> Alan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-18 19:25:41

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 02:30:48PM -0500, Alan Cox wrote:
> We've got one - its called linux-kernel.

Huh? That's like saying "we don't need a bug database, we have a mailing
list". That's patently wrong and so is your statement. If you want
reviews you need some place to store them. A mailing list isn't storage.

You'll do it however you want of course, but you are being stupid about it.
Why is that?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-18 19:23:18

by Alan Cox

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

We've got one - its called linux-kernel.

Alan

2002-12-18 19:37:21

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 02:42:51PM -0500, Alan Cox wrote:
> > On Wed, Dec 18, 2002 at 02:30:48PM -0500, Alan Cox wrote:
> > > We've got one - its called linux-kernel.
> >
> > Huh? That's like saying "we don't need a bug database, we have a mailing
> > list". That's patently wrong and so is your statement. If you want
> > reviews you need some place to store them. A mailing list isn't storage.
> >
> > You'll do it however you want of course, but you are being stupid about it.
> > Why is that?
>
> We've got a bug database (bugzilla), we've got a system for seeing what opinion
> appears to be -kernel-list

And exactly how is your statement different than

"we have a system for seeing what bugs appear to be -kernel-list"

?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-18 19:35:00

by Alan Cox

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> On Wed, Dec 18, 2002 at 02:30:48PM -0500, Alan Cox wrote:
> > We've got one - its called linux-kernel.
>
> Huh? That's like saying "we don't need a bug database, we have a mailing
> list". That's patently wrong and so is your statement. If you want
> reviews you need some place to store them. A mailing list isn't storage.
>
> You'll do it however you want of course, but you are being stupid about it.
> Why is that?

We've got a bug database (bugzilla), we've got a system for seeing what opinion
appears to be -kernel-list


2002-12-18 19:42:47

by Oliver Xymoron

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 09:41:15AM -0800, Linus Torvalds wrote:
>
> > The approval process does seem to be quite a lot of work though.
> > I think it was rth last year at OLS who told me that at that time
> > he'd been doing more approving of other peoples stuff than coding himself.
>
> I heartily disagree with the approval process for development, just
> because it gets so much in the way and just annoys people. But for
> stabilization, that's exactly what you want. So I think gcc is using the
> approval process much too much, but apparently it works for them.
>
> And I think it could work for the kernel too, especially the stable
> releases and for the process of getting there. I just don't really know
> how to set it up well.

Actually, I think Marcello's got the stable process pretty well
figured out without any of this committee business. And given that his
credibility as 2.4 maintainer depends on his holding to the mandate to
make the kernel stable, he probably doesn't have too hard a time
holding the line. As benevolent dictator, you're simply not beholden
to such expectations and I doubt the committee approach would work for
long either.

So perhaps you should throw out a date for 'code freeze' and then plan to
hand off to the 2.6 maintainer at that date.

The other piece that will help is if the timeline for 2.7 shows up
around then and is short enough so that people won't despair of ever
getting their big feature in.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-12-18 20:06:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Terje Eggestad wrote:
> what about:
>
> int (*_vsyscall) (int, ...);
> _vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC,
> MAP_VSYSCALL, , );
>
> or if you're afraid of running out of MAP_* flags:
>
> fd = open("/dev/vsyscall", );
> _vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC, MAP_SHARED,
> fd, 0);
>
> Then you can leisurely map it in just after the programs text segment.
>

Very ugly -- then the application has to do indirect calls.

-hpa


2002-12-18 20:14:28

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


The number of CPU clocks necessary to make the 'far' or
full-pointer call by pushing the segment register, the offset,
then issuing a 'lret' is 33 clocks on a Pentium II.

longcall clocks = 46
call clocks = 13
actual full-pointer call clocks = 33

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 1
cpu MHz : 399.573
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 797.90

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 1
cpu MHz : 399.573
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 797.90


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


Attachments:
sysenter.tar.gz (3.87 kB)

2002-12-18 20:20:11

by John Bradford

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> On Wed, Dec 18, 2002 at 02:42:51PM -0500, Alan Cox wrote:
> > > On Wed, Dec 18, 2002 at 02:30:48PM -0500, Alan Cox wrote:
> > > > We've got one - its called linux-kernel.
> > >
> > > Huh? That's like saying "we don't need a bug database, we have a mailing
> > > list". That's patently wrong and so is your statement. If you want
> > > reviews you need some place to store them. A mailing list isn't storage.
> > >
> > > You'll do it however you want of course, but you are being
> > > stupid about it.
> > > Why is that?
> >
> > We've got a bug database (bugzilla), we've got a system for seeing
> > what opinion appears to be -kernel-list
>
> And exactly how is your statement different than
>
> "we have a system for seeing what bugs appear to be -kernel-list"
>
> ?

This forthcoming BK-related flamewar falls in to category 1, I.E. is
not a 2.6 feature :-)

John.

2002-12-18 20:18:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson wrote:
> The number of CPU clocks necessary to make the 'far' or
> full-pointer call by pushing the segment register, the offset,
> then issuing a 'lret' is 33 clocks on a Pentium II.
>
> longcall clocks = 46
> call clocks = 13
> actual full-pointer call clocks = 33

That's not a call, that's a jump. Comparing it to a call instruction is
meaningless.

-hpa

2002-12-18 21:52:59

by Nakajima, Jun

[permalink] [raw]
Subject: RE: Freezing.. (was Re: Intel P6 vs P7 system call performance)

BTW, in terms of validation, I think we might want to compare the results from LTP (http://ltp.sourceforge.net/), for example, by having it run on the two setups (sysenter/sysexit and int/iret).

Jun

> -----Original Message-----
> From: Linus Torvalds [mailto:[email protected]]
> Sent: Wednesday, December 18, 2002 8:50 AM
> To: Dave Jones
> Cc: Horst von Brand; [email protected]; Alan Cox; Andrew Morton
> Subject: Freezing.. (was Re: Intel P6 vs P7 system call performance)
>
>
>
> On Wed, 18 Dec 2002, Dave Jones wrote:
> > On Wed, Dec 18, 2002 at 10:40:24AM -0300, Horst von Brand wrote:
> > > [Extremely interesting new syscall mechanism tread elided]
> > >
> > > What happened to "feature freeze"?
> >
> > *bites lip* it's fairly low impact *duck*.
>
> However, it's a fair question.
>
> I've been wondering how to formalize patch acceptance at code freeze, but
> it might be a good idea to start talking about some way to maybe put
> brakes on patches earlier, ie some kind of "required approval process".
>
> I think the system call thing is very localized and thus not a big issue,
> but in general we do need to have something in place.
>
> I just don't know what that "something" should be. Any ideas? I thought
> about the code freeze require buy-in from three of four people (me, Alan,
> Dave and Andrew come to mind) for a patch to go in, but that's probably
> too draconian for now. Or is it (maybe start with "needs approval by two"
> and switch it to three when going into code freeze)?
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-12-18 22:01:33

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> > And exactly how is your statement different than
> >
> > "we have a system for seeing what bugs appear to be -kernel-list"
> > ?
>
> This forthcoming BK-related flamewar falls in to category 1, I.E. is
> not a 2.6 feature :-)

I don't understand why BK is part of the conversation. It has nothing to
do with it. If every time I post to this list the assumption is that it's
"time to beat larry up about BK" then it's time for me to get off the list.

I can understand it when we're discussing BK; other than that, it's pretty
friggin lame. If that's what was behind your posts, Alan, there is an
easy procmail fix for that.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-18 22:18:59

by John Bradford

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> > This forthcoming BK-related flamewar falls in to category 1, I.E. is
> > not a 2.6 feature :-)
>
> I don't understand why BK is part of the conversation. It has nothing to
> do with it. If every time I post to this list the assumption is that it's
> "time to beat larry up about BK" then it's time for me to get off
> the list.
> I can understand it when we're discussing BK; other than that, it's pretty
> friggin lame. If that's what was behind your posts, Alan, there is an
> easy procmail fix for that.

My interpretation was that that is what he meant. If I was wrong, I
appologise.

I was trying to point out in an amusing way that a repeat of the BK
flamewar we've seen on LKML was inappropriate.

John.

2002-12-18 22:23:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

H. Peter Anvin wrote:
> Terje Eggestad wrote:
> > fd = open("/dev/vsyscall", );
> > _vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC, MAP_SHARED,
> > fd, 0);
>
> Very ugly -- then the application has to do indirect calls.

No it doesn't.

The application, or library, would map the vsyscall page to an address
in its own data section. This means that position-independent code
can do vsyscalls without any relocations, and hence without dirtying
its own caller pages.

In some ways this is better than the 0xfffe0000 address: _that_
requires position-independent code to do indirect calls to the
absolute address, or to dirty its caller pages.

That said, you always need the page at 0xfffe0000 mapped anyway, so
that sysexit can jump to a fixed address (which is fastest).

-- Jamie

2002-12-18 22:31:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Jamie Lokier wrote:
> H. Peter Anvin wrote:
>
>>Terje Eggestad wrote:
>>
>>>fd = open("/dev/vsyscall", );
>>>_vsyscall = mmap(NULL, getpagesize(), PROT_READ|PROT_EXEC, MAP_SHARED,
>>>fd, 0);
>>
>>Very ugly -- then the application has to do indirect calls.
>
>
> No it doesn't.
>
> The application, or library, would map the vsyscall page to an address
> in its own data section. This means that position-independent code
> can do vsyscalls without any relocations, and hence without dirtying
> its own caller pages.
>

Oh, I see... you don't really mean NULL in the first argument :)

This has one additional advantage: an application which wants to
override vsyscalls can simply map something instead of the kernel page,
and UML can present its own vsyscall page.

> In some ways this is better than the 0xfffe0000 address: _that_
> requires position-independent code to do indirect calls to the
> absolute address, or to dirty its caller pages.
>
> That said, you always need the page at 0xfffe0000 mapped anyway, so
> that sysexit can jump to a fixed address (which is fastest).

That's a possiblity, or if the task_struct contains the desired return
address for a particular process that might also work -- it's just a GPR
after all.

-hpa



2002-12-18 22:31:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Wed, 18 Dec 2002, Jamie Lokier wrote:
>
> That said, you always need the page at 0xfffe0000 mapped anyway, so
> that sysexit can jump to a fixed address (which is fastest).

Yes. This is important. There _needs_ to be some fixed address at least as
far as the kernel is concerned (it might move around between reboots or
something like that, but it needs to be something the kernel knows about
intimately and doesn't need lots of dynamic lookup).

However, there's another issue, namely process startup cost. I personally
want it to be as light as at all possible. I hate doing an "strace" on
user processes and seeing tons and tons of crapola showing up. Just for
fun, do a

strace /bin/sh -c "echo hello"

to see what I'm talking about. And that's actually a _lot_ better these
days than it used to be.

Anyway, I really hate to see "unnecessary crap" in the user mode startup
just because kernel interfaces are bad. That's why I like the AT_SYSINFO
ELF auxilliary table approach - it's something that is already _there_ for
the process to just take advantage of. Having to do a magic mmap for
somehting that everybody needs to do is just bad design.

Linus

2002-12-18 22:51:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Btw, I'm pushing what looks like the "final" version of sysenter/sysexit
for now. There may be bugs left, but all the known issues are resolved:

- single-stepping over the system call now works. It doesn't actually see
all of the user-mode instructions, since the fast system call interface
does not lend itself well to restoring "TF" in eflags on return, but
the trampoline code saves and restores the flags, so you will be able
to step over the important bits.

(ptrace also doesn't actually allow you to look at the instruction
contents in high memory, so gdb won't see the instructions in the
user-mode fast system call trampoline even when it can single-step
them, and I don't think I'll bother to fix it up).

- NMI at the "wrong" time (just before first instruction in kernel
space) should now be a non-issue. The per-CPU SEP stack looks like a
real (nonpreemptable) process, and follows all the conventions needed
for "current_thread_info()" and friends. This behaviour is also
triggered by the single-step debug trap, so while I've obviously not
tested NMI behaviour, I _have_ tested the very same concept at that
exact point.

- The APM problem was confirmed by Andrew to apparently be just a GDT
that was too small for the new layout.

This is in addition to the six-argument issues and the glibc address query
issues that were resolved yesterday.

Linus

2002-12-18 23:43:15

by Billy Rose

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson wrote:
> The number of CPU clocks necessary to make the 'far' or
> full-pointer call by pushing the segment register, the offset,
> then issuing a 'lret' is 33 clocks on a Pentium II.
>
> longcall clocks = 46
> call clocks = 13
> actual full-pointer call clocks = 33

this is not correct. the assumed target (of a _far_ call) would issue a far
return and only an offset would be left on the stack to return to (oops). the
code segment of the orginal caller needs pushed to create the seg:off pair and
hence a far return would land back at the original calling routine. this is a
very convoluted method of making the orginal call being far, as simply calling
far in the first pace should issue much faster. OTOH, if you are making a
workaround to an already existing piece of code, this works beautifully (with
the additional seg pushed on the stack).

b.

2002-12-19 00:01:07

by Alan Cox

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> I don't understand why BK is part of the conversation. It has nothing to
> do with it. If every time I post to this list the assumption is that it's
> "time to beat larry up about BK" then it's time for me to get off the list.
>
> I can understand it when we're discussing BK; other than that, it's pretty
> friggin lame. If that's what was behind your posts, Alan, there is an
> easy procmail fix for that.

It wasnt me who brought up bitkeeper

2002-12-19 00:10:47

by Alan Cox

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> > > I can understand it when we're discussing BK; other than that, it's pretty
> > > friggin lame. If that's what was behind your posts, Alan, there is an
> > > easy procmail fix for that.
> >
> > It wasnt me who brought up bitkeeper
>
> PLONK. Into kernel-spam you go. I've had it with ax grinders.

Oh dear me. Larry McVoy has flipped

I'm now being added to his spam list for *not* mentioning bitkeeper

Poor Larry, I hope has a nice christmas break, he clearly needs it

2002-12-19 00:21:00

by Alan

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, 2002-12-18 at 22:37, John Bradford wrote:
> I was trying to point out in an amusing way that a repeat of the BK
> flamewar we've seen on LKML was inappropriate.

I got the joke but I don't have a US postal address 8)

More seriously we have defect tracking now - > bugzilla.kernel.org
We have an advanced scalable groupware communication environment (email)

How the actual patches get applied really isnt relevant. I know Linus
hated jitterbug, Im guessing he hates bugzilla too ?

2002-12-19 00:29:44

by Russell King

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Thu, Dec 19, 2002 at 01:09:17AM +0000, Alan Cox wrote:
> How the actual patches get applied really isnt relevant. I know Linus
> hated jitterbug, Im guessing he hates bugzilla too ?

I'm waiting for the kernel bugzilla to become useful - currently the
record for me has been:

3 bugs total
3 bugs for serial code for drivers I don't maintain, reassigned to mbligh.

This means I write (choose one):

1. non-buggy code (highly unlikely)
2. code no one tests
3. code people do test but report via other means (eg, email, irc)

If it's (3), which it seems to be, it means that bugzilla is failing to
do its job properly, which is most unfortunate.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-12-19 00:30:00

by Larry McVoy

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 07:18:44PM -0500, Alan Cox wrote:
> > > > I can understand it when we're discussing BK; other than that, it's pretty
> > > > friggin lame. If that's what was behind your posts, Alan, there is an
> > > > easy procmail fix for that.
> > >
> > > It wasnt me who brought up bitkeeper
> >
> > PLONK. Into kernel-spam you go. I've had it with ax grinders.
>
> Oh dear me. Larry McVoy has flipped
>
> I'm now being added to his spam list for *not* mentioning bitkeeper
>
> Poor Larry, I hope has a nice christmas break, he clearly needs it

Look, Alan and anyone else, I'm sort of sick of the flames about BK.
It's apparent that there will always be people who are looking for
excuses to attack BK because it isn't GPLed and how dare the kernel
hackers use it. Your mail was so senseless that that was the only sane
explanation I could find and apparently I wasn't being paranoid, that's
what John thought as well.

I have a bad habit of taking things personally and too seriously and
the result is that attacks on me/BK/whatever, imagined or real, stress
me out and waste my time. Life's too short to for me to deal with that
nonsense anymore. I discovered procmail and I dump people into a spam
file if I feel they have a track record of yanking my chain. It's my
fault that I'm such a wuss that I can't handle it but this works.
It's not personal, it's about having a more pleasant life and I find
things to be more pleasant without the flames.

I'll still read your mail, I do so about every 2 weeks, but that way
whatever yankage you were (or were not) trying to do is in the past and
I'll ignore it.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-12-19 00:34:12

by John Bradford

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> > I don't understand why BK is part of the conversation. It has nothing to
> > do with it. If every time I post to this list the assumption is that it's
> > "time to beat larry up about BK" then it's time for me to get off the list.
> >
> > I can understand it when we're discussing BK; other than that, it's pretty
> > friggin lame. If that's what was behind your posts, Alan, there is an
> > easy procmail fix for that.
>
> It wasnt me who brought up bitkeeper
>

No, it's my fault - I was skimming through list traffic, and not
concentrating, (proof of this is the fact that I've had sendmail
configured incorrectly all day, and been posting from the wrong
address, and only just realised :-) ).

I saw Larry mention kernel.bkbits.net, and Alan say, "We've got one -
its called linux-kernel", (in a separate message without quoting
anything, so it's really your fault :-) :-) :-) ), and assumed that a
BK argument was imminent, and I made a joke comment that it, (an
argument), was not a 2.6 required feature.

Sorry about the wasted bandwidth, I'll stop posting as it's now past
midnight, and I obviously need sleep.

Oh, 2.4.20-pre2 compiled OK for me, I hope that proves I've done
something useful tonight.

John.

2002-12-19 00:40:41

by John Bradford

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> > I was trying to point out in an amusing way that a repeat of the BK
> > flamewar we've seen on LKML was inappropriate.
>
> I got the joke but I don't have a US postal address 8)

Eh??? US postal address? What!? Now I am really confused.

> More seriously we have defect tracking now - > bugzilla.kernel.org
> We have an advanced scalable groupware communication environment (email)
>
> How the actual patches get applied really isnt relevant. I know Linus
> hated jitterbug, Im guessing he hates bugzilla too ?

I don't like bugzilla particularly, it's too clunky, and it's
difficult to check that you are not entering a duplicate bug when the
database gets too big. Maybe that's just my opinion, though. Maybe I
should write a better bug tracking system...

John.

2002-12-19 00:50:46

by Jeff Garzik

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Thu, Dec 19, 2002 at 12:37:40AM +0000, Russell King wrote:
> This means I write (choose one):
> 3. code people do test but report via other means (eg, email, irc)

> If it's (3), which it seems to be, it means that bugzilla is failing to
> do its job properly, which is most unfortunate.

Given that it started around Halloween, I would at least give it a
chance before claiming its failure. :)

IMO Bugzilla is gonna become even more useful as the code freeze hits,
and there are bugs we want to track until we get around to fixing
them...

Jeff




2002-12-19 01:02:01

by Adam J. Richter

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Russell King wrote:
>I'm waiting for the kernel bugzilla to become useful - currently the
>record for me has been:
>
>3 bugs total
>3 bugs for serial code for drivers I don't maintain, reassigned to mbligh.
>
>This means I write (choose one):
>
>1. non-buggy code (highly unlikely)
>2. code no one tests
>3. code people do test but report via other means (eg, email, irc)
>
>If it's (3), which it seems to be, it means that bugzilla is failing to
>do its job properly, which is most unfortunate.

I don't currently use bugzilla (just due to inertia), but the
whole world doesn't have to switch to something overnight in order for
that facility to end up saving more time and resources than it has
cost. Adoption can grow gradually, and it's probably easier to work
out bugs (in bugzilla) and improvements that way anyhow.

Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."

2002-12-19 01:11:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)


On 19 Dec 2002, Alan Cox wrote:
>
> How the actual patches get applied really isnt relevant. I know Linus
> hated jitterbug, Im guessing he hates bugzilla too ?

I didn't start out hating jitterbug, I tried it for a while.

I ended up not really being able to work with anything that was so
email-hostile. You had to click on things from a browser, write passwords,
and generally just act "gooey", instead of getting things just _done_.

If I can't do my work by email from a standard keyboard interface, it's
just not worth it. Maybe bugzilla works better, but I seriously expect it
to help _others_ track bugs more than it helps me.

Which is fine. We don't all have to agree on the tools or on how to track
stuff. The worst we can do (I think) is to _force_ people to work some
way.

[ This is where the angel chorus behind me started singing "Why can't we
all live together in peace and harmony" and put up a big banner saying
"Larry [heart] Alan". At that point my fever-induced brain just said
"plop" ]

Linus

2002-12-19 01:42:09

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> This means I write (choose one):
>
> 1. non-buggy code (highly unlikely)
> 2. code no one tests
> 3. code people do test but report via other means (eg, email, irc)
>
> If it's (3), which it seems to be, it means that bugzilla is failing to
> do its job properly, which is most unfortunate.

Not everyone will end up using it ... if people want to log bugs from
lkml into bugzilla, I think that'd help gather a critical mass.

Are you getting a lot of bug-reports for serial code on lkml? I use it
heavily, and it seems to work just fine to me .... so I pick (1). Yay! ;-)

Some of the bugs in there lie fallow, but I've seen quite a few get fixed.
The fact that some people (Dave Jones springs to mind) trawl through there
being extremely helpful fixing things is very useful ;-) Lots of things got
fixed, though I can't *prove* it was solely due to it being in Bugzilla.

As the list of bugs increases, it'll become an increasingly powerful
search engine for information as well .... I'll draw up a list of things
that don't seem to being worked on, and mail it out to kernel-janitors
and/or lkml and see if people are interested in fixing some of the fallow
stuff.

M.

2002-12-19 06:29:56

by Timothy D. Witham

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Related thought:

One of the things that we are trying to do is to automate
patch testing.

The PLM (http://www.osdl.org/plm) takes every patch that it gets
and does a quick "Does it compile test". Right now there
are only 4 kernel configuration files that we try but we are
going to be adding more. We could expand this to 100's
if needed as it would just be a matter of adding additional
hardware to make the compiles go faster in parallel.

Here is the example of the output from a baseline kernel.

http://www.osdl.org/cgi-bin/plm?module=patch_info&patch_id=986

A patch would look the same. The PASS reports are really
short and the FAIL reports just give you the configuration
files and the tail of the output from the kernel make.

We've talked to a couple of system vendors about expanding
this to take the configurations that have passed and running
them on their 10's of hardware platforms of interest and we
would be very happy to expand this to a very large number of
configurations of all sorts.

Tim

On Wed, 2002-12-18 at 11:08, Alan Cox wrote:
> > And I think it could work for the kernel too, especially the stable
> > releases and for the process of getting there. I just don't really know
> > how to set it up well.
>
> A start might be
>
> 1. Ack large patches you don't want with "Not for 2.6" instead
> of ignoring them. I'm bored of seeing the 18th resend of
> this and that wildly bogus patch.
>
> Then people know the status
>
> 2. Apply patches only after they have been approved by the maintainer
> of that code area.
>
> Where it is core code run it past Andrew, Al and other people
> with extremely good taste.
>
> 3. Anything which changes core stuff and needs new tools, setup
> etc please just say NO to for now. Modules was a mistake (hindsight
> I grant is a great thing), but its done. We don't want any more
>
>
> 4. Violate 1-3 when appropriate as always, but preferably not to
> often and after consulting the good taste department 8)
>
> Alan
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham <[email protected]>
Open Sourcre Development Lab, Inc

2002-12-19 06:35:51

by Andrew Morton

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

"Timothy D. Witham" wrote:
>
> Related thought:
>
> One of the things that we are trying to do is to automate
> patch testing.
>
> The PLM (http://www.osdl.org/plm) takes every patch that it gets
> and does a quick "Does it compile test". Right now there
> are only 4 kernel configuration files that we try but we are
> going to be adding more. We could expand this to 100's
> if needed as it would just be a matter of adding additional
> hardware to make the compiles go faster in parallel.

It would be valuable to be able to test that things compile
cleanly on non-ia32 machines. And boot, too.

That's probably a lot of ongoing work though.

2002-12-19 06:40:29

by Timothy D. Witham

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, 2002-12-18 at 22:43, Andrew Morton wrote:
> "Timothy D. Witham" wrote:
> >
> > Related thought:
> >
> > One of the things that we are trying to do is to automate
> > patch testing.
> >
> > The PLM (http://www.osdl.org/plm) takes every patch that it gets
> > and does a quick "Does it compile test". Right now there
> > are only 4 kernel configuration files that we try but we are
> > going to be adding more. We could expand this to 100's
> > if needed as it would just be a matter of adding additional
> > hardware to make the compiles go faster in parallel.
>
> It would be valuable to be able to test that things compile
> cleanly on non-ia32 machines. And boot, too.
>
The way the software is configured it is fairly easy to
add multiple servers (even different instruction sets) that
have the complies farmed out to them.

> That's probably a lot of ongoing work though.

The largest portion of the work would be keeping
up with the breakages in the trees.

BTW I'm in Japan so my access times are going to be
a little strange.
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2002-12-19 06:57:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

> Related thought:
>
> One of the things that we are trying to do is to automate
> patch testing.
>
> The PLM (http://www.osdl.org/plm) takes every patch that it gets
> and does a quick "Does it compile test". Right now there
> are only 4 kernel configuration files that we try but we are
> going to be adding more. We could expand this to 100's
> if needed as it would just be a matter of adding additional
> hardware to make the compiles go faster in parallel.

URL doesn't seem to work. But would be cool if you had one SMP
config, one UP with IO/APIC, and one without IO/APIC. I seem
to break the middle one whenever I write a patch ;-(

M.

2002-12-19 07:02:53

by Timothy D. Witham

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Sorry, they changed it last week and my fingers still
have the old firmware.

http://www.osdl.org/cgi-bin/plm

TIm

On Wed, 2002-12-18 at 23:05, Martin J. Bligh wrote:
> > Related thought:
> >
> > One of the things that we are trying to do is to automate
> > patch testing.
> >
> > The PLM (http://www.osdl.org/plm) takes every patch that it gets
> > and does a quick "Does it compile test". Right now there
> > are only 4 kernel configuration files that we try but we are
> > going to be adding more. We could expand this to 100's
> > if needed as it would just be a matter of adding additional
> > hardware to make the compiles go faster in parallel.
>
> URL doesn't seem to work. But would be cool if you had one SMP
> config, one UP with IO/APIC, and one without IO/APIC. I seem
> to break the middle one whenever I write a patch ;-(
>
> M.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2002-12-19 09:05:38

by Russell King

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Wed, Dec 18, 2002 at 05:08:45PM -0800, Adam J. Richter wrote:
> I don't currently use bugzilla (just due to inertia), but the
> whole world doesn't have to switch to something overnight in order for
> that facility to end up saving more time and resources than it has
> cost. Adoption can grow gradually, and it's probably easier to work
> out bugs (in bugzilla) and improvements that way anyhow.

I'm not asking the world to switch to it overnight. Just one person
would be nice. 8)

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-12-19 10:20:54

by Dave Jones

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Thu, Dec 19, 2002 at 12:59:20AM +0000, John Bradford wrote:

> I don't like bugzilla particularly, it's too clunky, and it's
> difficult to check that you are not entering a duplicate bug when the
> database gets too big.

File bug anyway and worry about it later. The bugzilla elves regularly
go through the database cleaning up crufty bits, marking dupes,
closing invalids, world peace etc etc. It seems to be holding
up well so far. Of the 180 bugs filed, I think I've personally rejected
<10 dupes/invalids. Other folks haven't done that many too.
Here's to hoping it continues to remain high signal.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk

2002-12-19 10:43:55

by Dave Jones

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

On Thu, Dec 19, 2002 at 12:37:40AM +0000, Russell King wrote:
> On Thu, Dec 19, 2002 at 01:09:17AM +0000, Alan Cox wrote:
> > How the actual patches get applied really isnt relevant. I know Linus
> > hated jitterbug, Im guessing he hates bugzilla too ?
>
> I'm waiting for the kernel bugzilla to become useful - currently the
> record for me has been:
>
> 3 bugs total
> 3 bugs for serial code for drivers I don't maintain, reassigned to mbligh.

That was unfortunate, and you got dumped with those because some thought
"Ah, serial! RMK!". Some of the categories in bugzilla still need
broadening IMO.

> This means I write (choose one):
> 1. non-buggy code (highly unlikely)
> 2. code no one tests
> 3. code people do test but report via other means (eg, email, irc)
>
> If it's (3), which it seems to be, it means that bugzilla is failing to
> do its job properly, which is most unfortunate.

It's early days. The types of bugs being filed still fall into the
"useful" "not useful" categories though. I don't think it's really
that important that we track what doesn't compile at this stage.
Those reports are being either closed within a few hours of them
being opened with a "Fixed in BK", or are drivers which no-one currently
wants to fix/can fix (Things like the various sti/cli breakage)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-19 13:00:29

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, 18 Dec 2002 [email protected] wrote:

> Richard B. Johnson wrote:
> > The number of CPU clocks necessary to make the 'far' or
> > full-pointer call by pushing the segment register, the offset,
> > then issuing a 'lret' is 33 clocks on a Pentium II.
> >
> > longcall clocks = 46
> > call clocks = 13
> > actual full-pointer call clocks = 33
>
> this is not correct. the assumed target (of a _far_ call) would issue a far
> return and only an offset would be left on the stack to return to (oops). the
> code segment of the orginal caller needs pushed to create the seg:off pair and
> hence a far return would land back at the original calling routine. this is a
> very convoluted method of making the orginal call being far, as simply calling
> far in the first pace should issue much faster. OTOH, if you are making a
> workaround to an already existing piece of code, this works beautifully (with
> the additional seg pushed on the stack).
>

The target, i.e., the label 'goto' would be the reserved page for the
system call. The whole purpose was to minimize the number of CPU cycles
necessary to call 0xfffff000 and return. The system call does not have
issue a 'far' return, it can do anything it requires. The page at
0xfffff000 is mapped into every process and is in that process CS space
already.

I have already gotten responses from people who looked at the code
and said it was broken. It is not broken. It does what is expected.
It takes the same number of CPU cycles as:

pushl $0xfffff000
call *(%esp)
addl $4, %esp

... which is the current proposal. It has the advantage that only
the return address is on the stack when the target code is executed.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-19 13:09:32

by Stephen Satchell

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

At 02:08 PM 12/18/02 -0800, Larry McVoy wrote:
>I don't understand why BK is part of the conversation. It has nothing to
>do with it. If every time I post to this list the assumption is that it's
>"time to beat larry up about BK" then it's time for me to get off the list.
>
>I can understand it when we're discussing BK; other than that, it's pretty
>friggin lame. If that's what was behind your posts, Alan, there is an
>easy procmail fix for that.

Boy, talk about humor-impaired. When was the last time you got out and had
some fun not related to computer, Larry?

I don't read more than 95 percent of this mailing list and I got the joke.

Lighten up, and take a hint from the nearest cat: see the toy in everything.

Satch, another relative nobody.


--
The human mind treats a new idea the way the body treats a strange
protein: it rejects it. -- P. Medawar
This posting is for entertainment purposes only; it is not a legal opinion.

2002-12-19 13:14:56

by Bart Hartgers

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 18 Dec, Linus Torvalds wrote:
>
> On Wed, 18 Dec 2002, Jamie Lokier wrote:
>>
>> That said, you always need the page at 0xfffe0000 mapped anyway, so
>> that sysexit can jump to a fixed address (which is fastest).
>
> Yes. This is important. There _needs_ to be some fixed address at least as
> far as the kernel is concerned (it might move around between reboots or
> something like that, but it needs to be something the kernel knows about
> intimately and doesn't need lots of dynamic lookup).
>
> However, there's another issue, namely process startup cost. I personally
> want it to be as light as at all possible. I hate doing an "strace" on
> user processes and seeing tons and tons of crapola showing up. Just for

So why not map the magic page at 0xffffe000 at some other address as
well?

Static binaries can just directly jump/call into the magic page.

Shared binaries do somekind of mmap("/proc/self/mem") magic to put a
copy of the page at an address that is convenient for them. Shared
binaries have to do a lot of mmap-ing anyway, so the overhead should be
negligible.




--
Bart Hartgers - TUE Eindhoven
http://plasimo.phys.tue.nl/bart/contact.html

2002-12-19 13:32:00

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, Dec 19, 2002 at 02:22:36PM +0100, [email protected] wrote:
> > However, there's another issue, namely process startup cost. I personally
> > want it to be as light as at all possible. I hate doing an "strace" on
> > user processes and seeing tons and tons of crapola showing up. Just for
> So why not map the magic page at 0xffffe000 at some other address as
> well?
> Static binaries can just directly jump/call into the magic page.

.. and explode nicely when you try to run them on an older kernel
without the new syscall magick. This is what Linus' first
proof-of-concept code did.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-19 13:47:17

by Bart Hartgers

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 19 Dec, Dave Jones wrote:
> On Thu, Dec 19, 2002 at 02:22:36PM +0100, [email protected] wrote:
> > > However, there's another issue, namely process startup cost. I personally
> > > want it to be as light as at all possible. I hate doing an "strace" on
> > > user processes and seeing tons and tons of crapola showing up. Just for
> > So why not map the magic page at 0xffffe000 at some other address as
> > well?
> > Static binaries can just directly jump/call into the magic page.
>
> .. and explode nicely when you try to run them on an older kernel
> without the new syscall magick. This is what Linus' first
> proof-of-concept code did.


True, but unless I really don't get it, compatibility of a new static
binary with an old kernel is going to break anyway.
My point was that the double-mapped page trick adds no overhead in the
case of a static binary, and just one extra mmap in case of a shared
binary.

Bart

>
> Dave
>

--
Bart Hartgers - TUE Eindhoven
http://plasimo.phys.tue.nl/bart/contact.html

2002-12-19 14:14:29

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Dave Jones wrote:
> > Static binaries can just directly jump/call into the magic page.
>
> .. and explode nicely when you try to run them on an older kernel
> without the new syscall magick. This is what Linus' first
> proof-of-concept code did.

<evil-grin>

No, because the static binary installs a SIGSEGV handler to emulate
the magic page on older kernels :)

</evil-grin>

-- Jamie

2002-12-19 14:32:31

by Billy Rose

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson wrote:

> The target, i.e., the label 'goto' would be the reserved page for the
> system call. The whole purpose was to minimize the number of CPU cycles
> necessary to call 0xfffff000 and return. The system call does not have
> issue a 'far' return, it can do anything it requires. The page at
> 0xfffff000 is mapped into every process and is in that process CS space
> already.

that being the case, why push %cs and reload it without reason as the
code is mapped into every process?

therefore, would it not suffice to use:

...
long_call(); //call to $0xfffff000 via near ret
//code at $0xfffff000 returns directly here when a ret is issued
...

long_call:
pushl $0xfffff000
ret

2002-12-19 14:49:02

by Bart Hartgers

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 19 Dec, [email protected] wrote:
> long_call:
> pushl $0xfffff000
> ret
>

A ret(urn) to an address that wasn't put on the stack by a call
severly confuses the branch prediction on many processors.


--
Bart Hartgers - TUE Eindhoven
http://plasimo.phys.tue.nl/bart/contact.html

2002-12-19 15:01:13

by Richard B. Johnson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, 19 Dec 2002 [email protected] wrote:

> Richard B. Johnson wrote:
>
> > The target, i.e., the label 'goto' would be the reserved page for the
> > system call. The whole purpose was to minimize the number of CPU cycles
> > necessary to call 0xfffff000 and return. The system call does not have
> > issue a 'far' return, it can do anything it requires. The page at
> > 0xfffff000 is mapped into every process and is in that process CS space
> > already.
>
> that being the case, why push %cs and reload it without reason as the
> code is mapped into every process?
>
> therefore, would it not suffice to use:
>
> ...
> long_call(); //call to $0xfffff000 via near ret
> //code at $0xfffff000 returns directly here when a ret is issued
> ...
>
> long_call:
> pushl $0xfffff000
> ret
>

Because the number pushed onto the stack is a displacement, not
an address, i.e., -4095. To have the address act as an address,
you need to load a full-pointer, i.e. SEG:OFFSET (like the old
16-bit days). The offset is 32-bits and the segment is whatever
the kernel has set up for __USER_CS (0x23). All the 'near' calls
are calls to a signed displacement, same for jumps.

It would be nice if you could just do call $0xfffff000,
the problem is that the 'call' expects a displacement, usually
determined (fixed-up) by the linker. So, in this case, you
end up calling some code that exists 4095 bytes before the
call instruction NotGood(tm).

So the whole idea of this exercise is to do the same thing as
call far ptr 0x23:0xfffff000 (Intel syntax), without
reguiring a fixup, but minimizing the instructions and disruption
due to reloading a segment.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2002-12-19 15:12:27

by Bart Hartgers

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 19 Dec, Richard B. Johnson wrote:
> On Thu, 19 Dec 2002 [email protected] wrote:

>> long_call:
>> pushl $0xfffff000
>> ret
>>
>
> Because the number pushed onto the stack is a displacement, not
> an address, i.e., -4095. To have the address act as an address,

Not true. A ret(urn) is (sort of) equivalent to 'pop %eip'. The above
code would actually jump to address 0xfffff000, but probably be slow
since it confuses the branch prediction.

Bart

--
Bart Hartgers - TUE Eindhoven
http://plasimo.phys.tue.nl/bart/contact.html

2002-12-19 16:03:04

by Billy Rose

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Richard B. Johnson wrote:

> Because the number pushed onto the stack is a displacement, not
> an address, i.e., -4095. To have the address act as an address,
> you need to load a full-pointer, i.e. SEG:OFFSET (like the old
> 16-bit days). The offset is 32-bits and the segment is whatever
> the kernel has set up for __USER_CS (0x23). All the 'near' calls
> are calls to a signed displacement, same for jumps.

call's and jmp's use displacement, ret's are _always_ absolute.

2002-12-19 16:31:17

by Eli Carter

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

Russell King wrote:
> On Wed, Dec 18, 2002 at 05:08:45PM -0800, Adam J. Richter wrote:
>
>> I don't currently use bugzilla (just due to inertia), but the
>>whole world doesn't have to switch to something overnight in order for
>>that facility to end up saving more time and resources than it has
>>cost. Adoption can grow gradually, and it's probably easier to work
>>out bugs (in bugzilla) and improvements that way anyhow.
>
>
> I'm not asking the world to switch to it overnight. Just one person
> would be nice. 8)
>

Ok, Russell, maybe I can lend a small hand there....

You have a bug tracking mechanism of your own on http://www.arm.linux.org.uk,
along with a separate patch tracker.
Do you want ARM bug reports in bugzilla instead of your site? If so,
can you link to it from that bug tracker page? (I suppose you'd want to
direct people to bugzilla for just 2.5.* and 2.5.*-rmk*)

I submitted a 2.4 bug to your bug tracker, got an answer to the question
when I posted to the arm mailing lists (thanks!), and submitted a patch
to the mailing list. But nothing has happened on the bug status. I
asked if you wanted patches for bugs put in the patch tracker or the bug
tracker, but got no reply.
I understand that you're fighting the Acorn battle of 2.5.50 -> 2.5.52,
so I'm trying not to sound like I'm complaining. (Failing, yes, I know,
sorry. :/ ) Some assurance that you will acknowledge bugs in bugzilla
would be greatly encouraging to me. (Such as a reply to this message?)

I'll try to get 2.5 bug reports for ARM into bugzilla based on your
comments here, but a couple of suggestions:
- post an announcement to the arm lists of where you want which bugs to go,
- link to the same in a prominent place from your bug and patch trackers
- if you can, perhaps give priority in terms of replies and such to
those who use bugzilla... I value your replies, and if I can do
something to increase my chances of even getting an "Ack, I'll look at
it next week", I'll try to do that.

Comments?

Eli
--------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter(a)inet.com `-------------------------------------------------

2002-12-19 16:49:26

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, Dec 19, 2002 at 02:22:12PM +0000, Jamie Lokier wrote:

> <evil-grin>
> No, because the static binary installs a SIGSEGV handler to emulate
> the magic page on older kernels :)
> </evil-grin>

You're a sick man. Really. 8)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk

2002-12-19 18:38:36

by Billy Rose

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


> Not true. A ret(urn) is (sort of) equivalent to 'pop %eip'. The above
> code would actually jump to address 0xfffff000, but probably be slow
> since it confuses the branch prediction.
>
>
>Bart

that being the case, then the original code that Linus put forth:

pushl $0xfffff000
call *(%esp)
add $4,%esp

would be the way to go as it is highly readable. actually, the code at
0xfffff000 could issue a ret $4 and eliminate the add after the call.

2002-12-19 19:22:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[email protected] wrote:
> On 18 Dec, Linus Torvalds wrote:
>
>>On Wed, 18 Dec 2002, Jamie Lokier wrote:
>>
>>>That said, you always need the page at 0xfffe0000 mapped anyway, so
>>>that sysexit can jump to a fixed address (which is fastest).
>>
>>Yes. This is important. There _needs_ to be some fixed address at least as
>>far as the kernel is concerned (it might move around between reboots or
>>something like that, but it needs to be something the kernel knows about
>>intimately and doesn't need lots of dynamic lookup).
>>
>>However, there's another issue, namely process startup cost. I personally
>>want it to be as light as at all possible. I hate doing an "strace" on
>>user processes and seeing tons and tons of crapola showing up. Just for
>
> So why not map the magic page at 0xffffe000 at some other address as
> well?
>
> Static binaries can just directly jump/call into the magic page.
>
> Shared binaries do somekind of mmap("/proc/self/mem") magic to put a
> copy of the page at an address that is convenient for them. Shared
> binaries have to do a lot of mmap-ing anyway, so the overhead should be
> negligible.
>

That would require /proc to be mounted for all shared binaries to work.
That is tantamount to killing chroot().

Perhaps it could be done with mremap(), but I would assume that would
entail a special case in the mremap() code.

A special system call would be a bit gross, but it's better than a total
hack.

-hpa


2002-12-19 19:31:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Thu, 19 Dec 2002 [email protected] wrote:
>
> True, but unless I really don't get it, compatibility of a new static
> binary with an old kernel is going to break anyway.

NO.

The current code in 2.5.x is perfectly able to be 100% compatible with
binaries even on old kernels. This whole discussion is _totally_
pointless. I solved all the glibc problems early on, and Uli is already
happy with the interfaces, and they work fine for old kernels that don't
have a clue about the new system call interfaces.

WITHOUT any new magic system calls.

WITHOUT any stupid SIGSEGV tricks.

WITHOUT and silly mmap()'s on magic files.

> My point was that the double-mapped page trick adds no overhead in the
> case of a static binary, and just one extra mmap in case of a shared
> binary.

For _zero_ gain. The jump to the library address has to be indirect
anyway, and glibc has several places to put the information without any
mmap's or anything like that.

Linus

2002-12-19 22:03:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> For _zero_ gain. The jump to the library address has to be indirect
> anyway, and glibc has several places to put the information without any
> mmap's or anything like that.

This is not true, (but your overall point is still correct).

The jump to the magic page can be direct in statically linked code, or
in the executable itself. The assembler and linker have no problem
with this, I have just tried it.

What people (not Linus) have said about static binaries is moot,
because a static binary is linked at an absolute address itself, and
so can use the standard "call relative" instruction directly to the
fixed magic page address.

Dynamic binaries or libraries can use the indirect call or relocate
the calls at load time, or if they _really_ want a magic page at a
position relative to the library, they can just _copy_ the magic page
from 0xfffe0000. It is not all that magic.

-- Jamie

2002-12-19 22:08:10

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> Btw, on another tangent - Andrew Morton reports that APM is unhappy about
> the fact that the fast system call stuff required us to move the segments
> around a bit. That's probably because the APM code has the old APM segment
> numbers hardcoded somewhere, but I don't see where (I certainly knew about
> the segment number issue, and tried to update the cases I saw).
>
> Debugging help would be appreciated, especially from somebody who knows
> the APM code.

IIRC, segment 0x40 was special in BIOS days, and some APM bioses
blindly access 0x40 even from protected mode (windows have segment
0x40 with base 0x400....) Is that issue you are hitting?
Pavel

--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-19 22:09:26

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Jamie Lokier wrote:
>
> Dynamic binaries or libraries can use the indirect call or relocate
> the calls at load time, or if they _really_ want a magic page at a
> position relative to the library, they can just _copy_ the magic page
> from 0xfffe0000. It is not all that magic.
>

That would make it impossible for the kernel to have kernel-controlled
data on that page|other page though...

I personally would like to see some better interface than mmap()
/proc/self/mem in order to alias pages, anyway. We could use a
MAP_ALIAS flag in mmap() for this (where the fd would be ignored, but
the offset would matter.)

-hpa


2002-12-19 22:12:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
>>
>>don't many of the multi-CPU problems with tsc go away because you've got a
>>per-cpu physical page for the vsyscall?
>>
>>i.e. per-cpu tsc epoch and scaling can be set on that page.
>
> Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> nice at all.
>

???100 clocks is what... ?50 ns these days? You can't get that kind of
accuracy for anything outside the CPU core anyway...

-hpa

2002-12-19 22:17:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Thu, 19 Dec 2002, Jamie Lokier wrote:
> Linus Torvalds wrote:
> > For _zero_ gain. The jump to the library address has to be indirect
> > anyway, and glibc has several places to put the information without any
> > mmap's or anything like that.
>
> This is not true, (but your overall point is still correct).

Go back and read the postings by Uli.

Uli's suggested glibc approach is to just put the magis system call
address (which glibc gets from the AT_SYSINFO elf aux table entry) into
the per-thread TLS area, which is alway spointed to by %gs anyway.

THIS WORKS WITH ALL DSO'S WITHOUT ANY GAMES, ANY MMAP'S, ANY RELINKING, OR
ANY EXTRA WORK AT ALL!

The system call entry becomes a simple

call *%gs:constant-offset

Not mmap. No magic system calls. No relinking. Not _nothing_. One
instruction, that's it.

See for example Uli's posting in this thread from the day before
yesterday, message ID <[email protected]>. So please stop
arguing about any extra work, because none is needed.

Linus

2002-12-19 22:17:57

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
> Hi!
>
>
>>>>don't many of the multi-CPU problems with tsc go away because you've got a
>>>>per-cpu physical page for the vsyscall?
>>>>
>>>>i.e. per-cpu tsc epoch and scaling can be set on that page.
>>>
>>>Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
>>>nice at all.
>>>
>>
>>???100 clocks is what... ?50 ns these days? You can't get that kind of
>>accuracy for anything outside the CPU core anyway...
>
> 50ns is bad enough when it makes your time go backwards.
>

Backwards?? Clock spreading should make the rate change, but it should
never decrement.

-hpa


2002-12-19 22:20:23

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> >>>>don't many of the multi-CPU problems with tsc go away because you've got a
> >>>>per-cpu physical page for the vsyscall?
> >>>>
> >>>>i.e. per-cpu tsc epoch and scaling can be set on that page.
> >>>
> >>>Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> >>>nice at all.
> >>>
> >>
> >>???100 clocks is what... ?50 ns these days? You can't get that kind of
> >>accuracy for anything outside the CPU core anyway...
> >
> > 50ns is bad enough when it makes your time go backwards.
> >
>
> Backwards?? Clock spreading should make the rate change, but it should
> never decrement.

User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
-50ns, so it reads time "before" time reported cpu1. And gets confused.

Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-12-19 22:24:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
>
> User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
> -50ns, so it reads time "before" time reported cpu1. And gets confused.
>

How can you get that communication to happen in < 50 ns?

-hpa


2002-12-19 22:20:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
>
> Uli's suggested glibc approach is to just put the magis system call
> address (which glibc gets from the AT_SYSINFO elf aux table entry) into
> the per-thread TLS area, which is alway spointed to by %gs anyway.
>
> THIS WORKS WITH ALL DSO'S WITHOUT ANY GAMES, ANY MMAP'S, ANY RELINKING, OR
> ANY EXTRA WORK AT ALL!
>
> The system call entry becomes a simple
>
> call *%gs:constant-offset
>
> Not mmap. No magic system calls. No relinking. Not _nothing_. One
> instruction, that's it.
>

Unfortunately it means taking an indirect call cost for every invocation...

-hpa

2002-12-19 22:31:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Pavel Machek wrote:
> Hi!
>
>
>>>User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
>>>-50ns, so it reads time "before" time reported cpu1. And gets confused.
>>>
>>
>>How can you get that communication to happen in < 50 ns?
>
>
> I'm not sure I can do that, but I'm not sure I can't either. CPUs
> snoop each other's cache, and that's supposed to be fast...
>

Even over a 400 MHz FSB you have 2.5 ns cycles. I doubt you can
transfer a cache line in 20 FSB cycles.

-hpa


2002-12-19 22:29:34

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> > User on cpu1 reads time, communicates it to cpu2, but cpu2 is drifted
> > -50ns, so it reads time "before" time reported cpu1. And gets confused.
> >
>
> How can you get that communication to happen in < 50 ns?

I'm not sure I can do that, but I'm not sure I can't either. CPUs
snoop each other's cache, and that's supposed to be fast...

--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-12-19 22:15:09

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> >>don't many of the multi-CPU problems with tsc go away because you've got a
> >>per-cpu physical page for the vsyscall?
> >>
> >>i.e. per-cpu tsc epoch and scaling can be set on that page.
> >
> > Problem is that cpu's can randomly drift +/- 100 clocks or so... Not
> > nice at all.
> >
>
> ???100 clocks is what... ?50 ns these days? You can't get that kind of
> accuracy for anything outside the CPU core anyway...

50ns is bad enough when it makes your time go backwards.

Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2002-12-19 22:44:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Thu, 19 Dec 2002, H. Peter Anvin wrote:
>
> Unfortunately it means taking an indirect call cost for every invocation...

Ehh.. I just tested the "cost" of this on a PIII (comparing a indirect
call with a direct one), and it's exactly one extra cycle.

ONE CYCLE.

On a P4 the difference was 4 cycles. On my test P95 system I didn't see
any difference at all. And I don't have an athlon handy in my office.

That's the difference between

static void *address = &do_nothing;
asm("call *%0" :"m" (address))

and

asm("call do_nothing");

So it's between 0-4 cycles on machines that take 200 - 1000 cycles for
just the system call overhead.

And for that "overhead", you get a binary that trivially works on all
kernels, _and_ doesn't need extra mmap's etc (which are _easily_ thousands
of cycles).

Linus

2002-12-19 23:24:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Thu, 19 Dec 2002, Linus Torvalds wrote:
>
> So it's between 0-4 cycles on machines that take 200 - 1000 cycles for
> just the system call overhead.

Side note: I'd expect indirect calls - and especially the predictable
ones, like this - to maintain competitive behaviour on CPU's. Even the P4,
which usually has really bad worst-case behaviour for more complex
instructions (just look at the 2000 cycles for a regular int80/iret and
shudder) does a indirect call without huge problems.

That's because indirect calls are actually very common, and to some degree
getting _more_ so with the proliferation of OO languages (and I'm
discounting the "indirect call" that is just a return statement - they've
obviously always been common, but a return stack means that CPU
optimizations for "ret" instructions are different from "real" indirect
calls).

So I don't worry about the indirection per se. I'd worry a lot more about
some of the tricks people mentioned (ie the "pushl $0xfffff000 ; ret"
approach probably sucks quite badly on some CPU's, simply because it does
bad things to return stacks on modern CPU's - not necessarily visible in
microbenchmarks, but..).

Linus

2002-12-20 00:46:16

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, Dec 18, 2002 at 02:57:11PM -0800, Linus Torvalds wrote:
>
> Btw, I'm pushing what looks like the "final" version of sysenter/sysexit
> for now. There may be bugs left, but all the known issues are resolved:
>
> - single-stepping over the system call now works. It doesn't actually see
> all of the user-mode instructions, since the fast system call interface
> does not lend itself well to restoring "TF" in eflags on return, but
> the trampoline code saves and restores the flags, so you will be able
> to step over the important bits.
>
> (ptrace also doesn't actually allow you to look at the instruction
> contents in high memory, so gdb won't see the instructions in the
> user-mode fast system call trampoline even when it can single-step
> them, and I don't think I'll bother to fix it up).

This worries me. I'm no x86 guru, but I assume the trampoline's setting of
the TF bit will kick in right around the following 'ret'. So the
application will stop and GDB won't be able to read the instruction at
PC. I bet that makes it unhappy.

Shouldn't be that hard to fix this up in ptrace, though.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-12-20 01:39:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Thu, 19 Dec 2002, Daniel Jacobowitz wrote:
> >
> > (ptrace also doesn't actually allow you to look at the instruction
> > contents in high memory, so gdb won't see the instructions in the
> > user-mode fast system call trampoline even when it can single-step
> > them, and I don't think I'll bother to fix it up).
>
> This worries me. I'm no x86 guru, but I assume the trampoline's setting of
> the TF bit will kick in right around the following 'ret'. So the
> application will stop and GDB won't be able to read the instruction at
> PC. I bet that makes it unhappy.

It doesn't make gdb all that unhappy, everything seems to work fine
despite the fact that gdb decides it just can't display the instructions.

> Shouldn't be that hard to fix this up in ptrace, though.

Or even in user space, since the high pages are all the same in all
processes (so gdb doesn't even strictly need ptrace, it can just read it's
_own_ codespace there). But yeah, we could make ptrace aware of the magic
pages.

Linus

2002-12-20 02:16:27

by Alan

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, 2002-12-18 at 23:45, Pavel Machek wrote:
> IIRC, segment 0x40 was special in BIOS days, and some APM bioses
> blindly access 0x40 even from protected mode (windows have segment
> 0x40 with base 0x400....) Is that issue you are hitting?

Well the spec says it is not special. Windows leaves it pointing to
0x400 and if you don't do that your APM doesn't work.

Alan

2002-12-20 02:28:55

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Thu, Dec 19, 2002 at 05:47:55PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 19 Dec 2002, Daniel Jacobowitz wrote:
> > >
> > > (ptrace also doesn't actually allow you to look at the instruction
> > > contents in high memory, so gdb won't see the instructions in the
> > > user-mode fast system call trampoline even when it can single-step
> > > them, and I don't think I'll bother to fix it up).
> >
> > This worries me. I'm no x86 guru, but I assume the trampoline's setting of
> > the TF bit will kick in right around the following 'ret'. So the
> > application will stop and GDB won't be able to read the instruction at
> > PC. I bet that makes it unhappy.
>
> It doesn't make gdb all that unhappy, everything seems to work fine
> despite the fact that gdb decides it just can't display the instructions.

Yeah; sometimes it will generate that error in the middle of
single-stepping over something larger, though, and it breaks you out of
whatever you were doing.

> > Shouldn't be that hard to fix this up in ptrace, though.
>
> Or even in user space, since the high pages are all the same in all
> processes (so gdb doesn't even strictly need ptrace, it can just read it's
> _own_ codespace there). But yeah, we could make ptrace aware of the magic
> pages.

I need to get back to my scheduled ptrace cleanups. Meanwhile, here's
a patch to do this. Completely untested, like all good patches; but
it's pretty simple.

===== arch/i386/kernel/ptrace.c 1.17 vs edited =====
--- 1.17/arch/i386/kernel/ptrace.c Wed Nov 27 18:14:11 2002
+++ edited/arch/i386/kernel/ptrace.c Thu Dec 19 21:33:37 2002
@@ -21,6 +21,7 @@
#include <asm/processor.h>
#include <asm/i387.h>
#include <asm/debugreg.h>
+#include <asm/fixmap.h>

/*
* does not yet catch signals sent when the child dies.
@@ -196,6 +197,18 @@
case PTRACE_PEEKDATA: {
unsigned long tmp;
int copied;
+
+ /* Allow ptrace to read from the vsyscall page. */
+ if (addr >= FIXADDR_START && addr < FIXADDR_TOP &&
+ (addr & 3) == 0) {
+ int idx = virt_to_fix (addr);
+ if (idx == FIX_VSYSCALL) {
+ tmp = * (unsigned long *) addr;
+ ret = put_user (tmp, (unsigned long *) data);
+ break;
+ }
+ }
+

copied = access_process_vm(child, addr, &tmp, sizeof(tmp), 0);
ret = -EIO;


--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-12-20 03:56:25

by Stephen Rothwell

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On 20 Dec 2002 03:05:15 +0000 Alan Cox <[email protected]> wrote:
>
> On Wed, 2002-12-18 at 23:45, Pavel Machek wrote:
> > IIRC, segment 0x40 was special in BIOS days, and some APM bioses
> > blindly access 0x40 even from protected mode (windows have segment
> > 0x40 with base 0x400....) Is that issue you are hitting?
>
> Well the spec says it is not special. Windows leaves it pointing to
> 0x400 and if you don't do that your APM doesn't work.

The problem with the new syscall stuff is fixed in BK (the GDT was no longer
long enough ...)

The 0x40 thing is set up and torn down for each BIOS call these days.
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/

2002-12-20 10:01:47

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> For _zero_ gain. The jump to the library address has to be indirect
> anyway, and glibc has several places to put the information without any
> mmap's or anything like that.

Correct. The current implementation is optimal.

It is necessary to have indirection since the target address can change.

I'm never going to use self-modifying code.

And it's a simple, one-instruction change.

int $0x80 -> call *%gs:0x18


That's it. It's all implemented and tested. The results are in the
latest NPTL source drop. The code won't be available in LinuxThreads
since it requires a kernel with TLS support.

As far as I'm concerned the discussion is over. I'm happy with what I
have now. The additional overhead for the case when AT_SYSINFO is not
available is neglegable (and can be compiled-out completely if one
really wants), and in case AT_SYSINFO is available the code really is
the fatest possible given the constraints mentioned above.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-20 11:59:15

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

This is a suggestion on a small performance improvement.

Ulrich Drepper wrote:
> int $0x80 -> call *%gs:0x18

The calling convention has been (slightly) changed - i.e. 6 argument
calls don't work, so why not go a bit further: allow the vsyscall entry
point to clobber more GPRs?

I see 3 pushes and pops in the vsyscall page (if I've looked at the
correct patch from Linus), to preserve %ecx, %edx and %ebp:

vsyscall:
pushl %ebp
pushl %ecx
pushl %edx
0:
movl %esp,%ebp
sysenter
jmp 0b
popl %edx
popl %ecx
popl %ebp
ret

The benefit is that this allows Glibc to do a wholesale replacement of
"int $0x80" -> "single call instruction". Otherwise, those pushes are
completely unnecessary. It could be this short instead:

vsyscall:
movl %esp,%ebp
sysenter
jmp vsyscall
ret

It is nice to be able to use the _exact_ same convention in glibc, for
getting a patch out of the door quickly. But it is just as easy to do
that putting the pushes and pops into the library itself:

Instead of

int $0x80 -> call *%gs:0x18

Write

int $0x80 -> pushl %ebp
pushl %ecx
pushl %edx
call *%gs:0x18
popl %edx
popl %ecx
popl %ebp

It has exactly the same cost as the current patches, but provides
userspace with more optimisation flexibility, using an asm clobber
list instead of explicit instructions for inline syscalls, etc.

Cheers,
-- Jamie

2002-12-20 16:50:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Fri, 20 Dec 2002, Jamie Lokier wrote:
>
> Ulrich Drepper wrote:
> > int $0x80 -> call *%gs:0x18
>
> The calling convention has been (slightly) changed - i.e. 6 argument
> calls don't work, so why not go a bit further: allow the vsyscall entry
> point to clobber more GPRs?

Actually, six-argument syscalls _do_ work. I admit that the way to make
them work is "interesting", but it's also extremely simple.

> The benefit is that this allows Glibc to do a wholesale replacement of
> "int $0x80" -> "single call instruction". Otherwise, those pushes are
> completely unnecessary. It could be this short instead:
>
> vsyscall:
> movl %esp,%ebp
> sysenter
> jmp vsyscall
> ret

Yes, we could have changed the implementation to clobber more registers,
but if we want to support all system calls it would still have to save
%ebp, so the minimal approach would have been

vsyscall:
pushl %ebp
0:
movl %esp,%ebp
sysenter
jmp 0b /* only done for restarting */
popl %ebp
ret

which is all of 4 (simple) instructions cheaper than the one we have now.

And if the caller cannot depend on registers being saved, the caller may
actually end up being more complicated. For example, with the current
setup, you can have

getpid():
movl $__NR_getpid,%eax
jmp *%gs:0x18

but if system calls clobber registers, the caller needs to be

getpid():
pushl %ebx
pushl %esi
pushl %edi
pushl %ebp
movl $__NR_getpid,%eax
call *%gs:0x18
popl %ebp
popl %edi
popl %esi
popl %ebx
ret

and notice how the _real_ code sequence actually got much _worse_ from the
fact that you tried to save time by not saving registers.


> It is nice to be able to use the _exact_ same convention in glibc, for
> getting a patch out of the door quickly. But it is just as easy to do
> that putting the pushes and pops into the library itself:
>
> Instead of
>
> int $0x80 -> call *%gs:0x18
>
> Write
>
> int $0x80 -> pushl %ebp
> pushl %ecx
> pushl %edx
> call *%gs:0x18
> popl %edx
> popl %ecx
> popl %ebp

But where's the advantage then? You use the same number of instructions
dynamically, and you use _more_ icache space than if you have the pushes
and pops in just one place?

> It has exactly the same cost as the current patches, but provides
> userspace with more optimisation flexibility, using an asm clobber
> list instead of explicit instructions for inline syscalls, etc.

In practice, there is nothing around the call. And you have to realize
that the pushes and pops you added in your version are _wasted_ for other
cases. If the system call ends up being int 0x80, you just wasted time. If
the system call ended up being AMD's x86-64 version of syscall, you just
wsted time.

The advantage of putting all the register save in the trampoline is that
user mode literally doesn't have to _know_ what it is calling. It only
needs to know two simple rules:

- registers are preserved (except for %eax which is the return value, of
course)
- it should fill in arguments in %ebx, %ecx ... (but the things that
aren't arguments can just be left untouched)

And then depending on what the real low-level calling convention is, the
trampoline will save the _minimum_ number of registers (ie some calling
conventions might clobber different registers than %ecx/%edx - you have to
remember that "sysenter" is just _one_ out of at least three calling
conventions available).

Linus

2002-12-20 23:31:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> And if the caller cannot depend on registers being saved, the caller may
> actually end up being more complicated. For example, with the current
> setup, you can have
>
> getpid():
> movl $__NR_getpid,%eax
> jmp *%gs:0x18
>
> but if system calls clobber registers, the caller needs to be
>
> [long code snippet]
>
> and notice how the _real_ code sequence actually got much _worse_ from the
> fact that you tried to save time by not saving registers.

No, your "real" code sequence is wrong.

%ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
%ecx/%edx are call-clobbered registers in the i386 function call ABI.

This is not a coincidence.

So, getpid looks like this with the _smaller_ vsyscall code:

getpid():
movl $__NR_getpid,%eax
call *%gs:0x18
ret

Intel didn't choose %ecx/%edx as the sysexit registers by accident.
They were chosen for exactly this reason.

By the way, the same applies to AMD's syscall/sysret, which clobbers %ecx.

What I'm suggesting is that we should say that "call 0xffffe000"
clobbers only the registers (%eax/%ecx/%edx) that _normal_ function
calls clobber on i386, and preserves the call-saved registers.

This keeps the size of system call stubs in libc to the minimum.
Think about it.

-- Jamie

2002-12-20 23:42:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Jamie Lokier wrote:
>
> No, your "real" code sequence is wrong.
>
> %ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
> %ecx/%edx are call-clobbered registers in the i386 function call ABI.
>
> This is not a coincidence.
>
> So, getpid looks like this with the _smaller_ vsyscall code:
>
> getpid():
> movl $__NR_getpid,%eax
> call *%gs:0x18
> ret

... or just...

getpid:
movl $__NR_getpid, %eax
jmp *%gs:0x18

This doesn't mess up the call/return stack, even, because the ret in the
stub matches the call to getpid.

-hpa

2002-12-21 00:01:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Fri, 20 Dec 2002, Jamie Lokier wrote:
>
> %ebx/%edi/%esi are preserved across sysenter/sysexit, whereas
> %ecx/%edx are call-clobbered registers in the i386 function call ABI.
>
> This is not a coincidence.

Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
we quite fundamentally need to save %edx/%ecx _anyway_.

The reason is system call restarting and signal handling. You don't see it
right now, because the system call restart mechanism doesn't actually use
"sysexit" at all, but that's because the current implementation is only
the minimal possible implementation.

The way we do signal handling right now, we always punt to the "old" code,
ie the return path that will eventually return with an "iret".

And that old code will restore _all_ registers, including %ecx and %edx.
So when we return after a restart to the restart handler, %ecx and %edx
will have their original values, which is why restarting works right now.

The "iret" will trash "%ebp", simply because we fake out the whole %ebp
saving to get the six-argument case right. That's why we have to have that
extra complicated restart sequence:

0:
movl %esp,%ebp
syscall
restart:
jmp 0b

but once we start using sysexit for the signal handler return path too, we
will need to restore %edx and %ecx too, otherwise our restarted system
call will have crap in the registers. I already wrote the code, but
decided that as long as we don't do that kind of restarting, we shouldn't
have the overhead in the trampoline. But basically the trampoline then
will become

system_call_trampoline:
pushfl
pushl %ecx
pushl %edx
pushl %ebp
movl %esp,%ebp
syscall
0:
movl %esp,%ebp
movl 4(%ebp),%edx
movl 8(%ebp),%ecx
syscall

restart:
jmp 0b
sysenter_return_point:
popl %ebp
popl %edx
popl %ecx
popfl
ret

see? So you _have_ to really save the arguments anyway, because you cannot
do a sysexit-based system call restart otherwise. And once you save them,
you might as well restore them too.

And since you have to restore them for system call restart anyway, you
might as well just make it part of the calling convention.

Yes, I'm thinking ahead. Sue me.

Linus

2002-12-21 11:10:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Tue, 17 Dec 2002, Linus Torvalds wrote:

> > How about this patch? Instead of making a per-cpu trampoline, write to
> > the msr during each context switch.
>
> I wanted to avoid slowing down the context switch, but I didn't actually
> time how much the MSR write hurts you (it needs to be conditional,
> though, I think).

this is the solution i took in the original vsyscall patches. The syscall
entry cost is at least one factor more important than the context-switch
cost. The MSR write was not all that slow when i measured it (it was on
the range of 20 cycles), and it's definitely something the chip makers
should keep fast.

Ingo

2002-12-21 15:59:03

by Christian Leber

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Wed, Dec 18, 2002 at 12:25:38AM -0500, Brian Gerst wrote:

> How about this patch? Instead of making a per-cpu trampoline, write to
> the msr during each context switch. This means that the stack pointer
> is valid at all times, and also saves memory and a cache line bounce. I
> also included some misc cleanups.

Just a little bit of benchmarking:
(little testprogram by Linus out of this thread)
(on a AMD Duron 750)

2.5.52-bk2+sysenter-1 (Brian Gerst):
igor3:~# ./a.out
187.894946 cycles (call 0xfffff000)
299.155075 cycles (int $0x80)

2.5.52-bk6:
igor3:~# ./a.out
202.134535 cycles (call 0xffffe000)
299.117583 cycles (int $0x80)

Not really much, but the difference is there. (I don't about other side
effects)


Christian Leber

2002-12-21 17:10:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
> we quite fundamentally need to save %edx/%ecx _anyway_.

On the kernel side, yes. In the userspace trampoline, it's not required.

> but once we start using sysexit for the signal handler return path too, we
> will need to restore %edx and %ecx too, otherwise our restarted system
> call will have crap in the registers. I already wrote the code, but
> decided that as long as we don't do that kind of restarting, we shouldn't
> have the overhead in the trampoline. But basically the trampoline then
> will become
>
> system_call_trampoline:
> pushfl
> pushl %ecx
> pushl %edx
> pushl %ebp
> movl %esp,%ebp
> syscall
> 0:
> movl %esp,%ebp
> movl 4(%ebp),%edx
> movl 8(%ebp),%ecx
> syscall
>
> restart:
> jmp 0b
> sysenter_return_point:
> popl %ebp
> popl %edx
> popl %ecx
> popfl
> ret

> see? So you _have_ to really save the arguments anyway, because you cannot
> do a sysexit-based system call restart otherwise. And once you save them,
> you might as well restore them too.
>
> And since you have to restore them for system call restart anyway, you
> might as well just make it part of the calling convention.
>
> Yes, I'm thinking ahead. Sue me.

You're optimising the _rare_ case.

The correct [;)] trampoline looks like this:

system_call_trampoline:
pushl %ebp
movl %esp,%ebp
sysenter
sysenter_return_point:
popl %ebp
ret
sysenter_restart:
popl %edx
popl %ecx
movl %esp,%ebp
sysenter

This is accompanied by changing this line in arch/i386/kernel/signal.c:

regs->eip -= 2;

To this (best moved to an inline function):

if (likely (regs->eip == sysenter_return_point)) {
unsigned long * esp = (unsigned long *) regs->esp - 8;
if (!access_ok(VERIFY_WRITE, esp, 8)
|| __put_user(regs->edx, esp)
|| __put_user(regs->ecx, esp+4)) {
if (sig == SIGSEGV)
ka->sa.sa_handler = SIG_DFL;
force_sig(SIGSEGV, current);
}
regs->esp = (long) esp;
regs->eip = sysenter_restart;
} else {
regs->eip -= 2;
}

Thus the common case, system calls, are optimised. The uncommon case,
signal interrupts system call, is penalised, though it's not a large
penalty. (Much less than the gain from using sysexit!) Which is more
common?

By the way, this works with the AMD syscall instruction too.
Then the trampoline is:

system_call_trampoline:
pushl %ebp
movl %ecx,%ebp
syscall
syscall_return_point:
popl %ebp
ret
syscall_restart:
popl %edx
popl %ebp
syscall

Finally, there is no need to restore %ebp in the kernel sysexit/sysret
paths, because it will always be restored in the trampoline. So you
save 1 memory access there too.

(ps. I have a question: why does your trampoline save & restore
the flags?)

-- Jamie

2002-12-21 17:20:58

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ingo Molnar wrote:
> > > How about this patch? Instead of making a per-cpu trampoline, write to
> > > the msr during each context switch.
> >
> > I wanted to avoid slowing down the context switch, but I didn't actually
> > time how much the MSR write hurts you (it needs to be conditional,
> > though, I think).
>
> this is the solution i took in the original vsyscall patches. The syscall
> entry cost is at least one factor more important than the context-switch
> cost. The MSR write was not all that slow when i measured it (it was on
> the range of 20 cycles), and it's definitely something the chip makers
> should keep fast.

I think it would be better to make NMI and Debug trap use task gates,
if that does actually work, and let the NMI and debug handlers fix the
stack if they trap in the entry paths. They are much rarer after all.

My unreliable memory recalls about 40 cycles for an MSR write on my Celeron.

However, I think you still need MSR writes _sometimes_ in the context
switch to disable sysenter for vm86-mode tasks.

Could the context switch be written like this?:

if (unlikely((prev_task->flags | next_task->flags) & PF_SLOW_SWITCH)) {
// Do rare stuff including debug registers
// and sysenter/syscall MSR change for vm86.
}

// Other stuff.

-- Jamie

2002-12-21 19:32:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Sat, 21 Dec 2002, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Yes, you can make the "clobbers %eax/%edx/%ecx" argument, but the fact is,
> > we quite fundamentally need to save %edx/%ecx _anyway_.
>
> On the kernel side, yes. In the userspace trampoline, it's not required.

No, it _is_ required.

There are a few registers that _have_ to be saved on the user side,
because the kernel will trash them. Those registers are:

- eflags (kernel has no sane way to restore things like TF in it
atomically with a sysexit)
- ebp (kernel has to reload it with arg-6)
- ecx/edx (kernel _cannot_ restore them).

Your games with looking at %eip are fragile as hell.

> You're optimising the _rare_ case.

NO. I'm making it WORK.

> This is accompanied by changing this line in arch/i386/kernel/signal.c:
>
> regs->eip -= 2;

You're full of it.

You're adding fundamental complexity and special cases, because you have
a political agenda that you want to support, that is not really
supportable.

The fact is, system calls have a special calling convention anyway, and
doing them the way we're doing them now is a hell of a lot saner than
making much more complex code. Saving and restoring the two registers
means that they get easier and more efficient to use from inline asms for
example, and means that the code is simpler.

Your suggestion has _zero_ advantages. Doing two register pop's takes a
cycle, and means that the calling sequence is simple and has no special
cases.

Th eexample code you posted is fragile as hell. Looking at "eip" means
that the different system call entry points now have to be extra careful
not to have the same return points, which is just _bad_ programming.

Linus


2002-12-22 02:10:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> - eflags (kernel has no sane way to restore things like TF in it
> atomically with a sysexit)

It is better to use iret with TF. The penalty of forcing _every_
system call to pushfl and popfl in user space is quite a lot: I
measured 30 cycles for "pushfl;popfl" on my 366MHz Celeron.

("sysenter, setup segments, call a function in kernel space, restore
segments, sysexit" takes 82 cycles on the same Celeron, so 30 cycles
is quite a significant proportion to add to that. Btw, _82_, not 200
or so).

> - ebp (kernel has to reload it with arg-6)
> - ecx/edx (kernel _cannot_ restore them).

These are only needed when delivering a signal.

> Your games with looking at %eip are fragile as hell.

Like we don't play %eip games anywhere else... (the page fault fixup
table comes to mind).

> because you have a political agenda that you want to support, that
> is not really supportable.

And there was me thinking I was performance-tuning some code.
Politics, it gets everywhere, like curry gets onto anything white.

> Saving and restoring the two registers
> means that they get easier and more efficient to use from inline asms for
> example, and means that the code is simpler.

They are not more efficient from inline asms, though marginally
simpler to write (shorter clobber list). You just moved the cost from
the asm itself, where it is usually optimised away, to the trampoline
where it is always present (and cast in stone).

> Your suggestion has _zero_ advantages. Doing two register pop's takes a
> cycle, and means that the calling sequence is simple and has no special
> cases.

(Plus another cycle for the two pushes...)

> Th eexample code you posted is fragile as hell. Looking at "eip" means
> that the different system call entry points now have to be extra careful
> not to have the same return points, which is just _bad_ programming.

We are talking about a _very_ small trampoline, which is simplicity
itself compared with entry.S in general. You're right about the extra
care (so write a comment!), although it does just work for _all_ entry
points. Is this really worse than your own "wonderful hack"?

<shrug> You're the executive decision maker. I just know how to
write fast code. Thanks for listening.

-- Jamie

2002-12-22 03:14:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Sun, 22 Dec 2002, Jamie Lokier wrote:
>
> It is better to use iret with TF. The penalty of forcing _every_
> system call to pushfl and popfl in user space is quite a lot: I
> measured 30 cycles for "pushfl;popfl" on my 366MHz Celeron.

Jamie, please stop these emails.

The fact is, when a user enters the kernel with TF set using "sysenter",
the kernel doesn't even _know_ that TF is set, because it will take a
debug trap on the very first instruction, and the debug handler has no
real option except to just return with TF cleared before the kernel even
had a chance to save eflags. So at no point in the sysenter/sysexit path
does the code have a chance to even _realize_ that the user called it with
single-stepping on.

So how do you want the code to figure that out, and then (a) set TF on the
stack and (b) do the jump to the slow path? Sure, we could add magic
per-process flags in the debug handler, and then test them in the sysenter
path - but that really is pretty gross.

Saving and restoring eflags in user mode avoids all of these
complications, and means that there are no special cases. None. Zero.
Nada.

Special case code is bad. It's certainly a lot more important to me to
have a straightforward approach that doesn't have any strange cases, and
where debugging "just works", instead of having a lot of magic small
details scattered all over the place.

So if you really care, create all your special case magic tricks, and see
just how ugly it gets. Then see whether it makes any difference at all
except on the very simplest system calls ("getpid" really isn't very
important).

Linus

2002-12-22 09:59:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Sat, 21 Dec 2002, Linus Torvalds wrote:

> Saving and restoring eflags in user mode avoids all of these
> complications, and means that there are no special cases. None. Zero.
> Nada.

and i'm 100% sure the more robust eflags saving will also avoid security
holes. The amount of security-relevant complexity that comes from all the
x86 features [and their combinations] is amazing.

Ingo

2002-12-22 10:08:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


while reviewing the sysenter trampoline code i started wondering about the
HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
much breaks the concept of per-logical-CPU sysenter trampolines. It also
makes context-switch time sysenter MSR writing impossible, so i really
hope this is not the case.

Ingo




2002-12-22 11:00:36

by James Cloos

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus> The system call entry becomes a simple

Linus> call *%gs:constant-offset

Linus> Not mmap. No magic system calls. No relinking. Not
Linus> _nothing_. One instruction, that's it.

I presume *%gs:0x18 is only for shared objects?

A na?ve:

- asm volatile("call 0xffffe000"
+ asm volatile("call *%%gs:0x18"

in the trivial getppid benchmark code gives a SEGV, since
(according to gdb's info all-registers) %gs == 0 when it runs.

Is it just that my glibc is too old, or is there a shared vs static difference?

-JimC

P.S. On a (1 Gig) mobile p3 the getppid bench gives ~333 cycles for
int $0x80 and ~215 for call 0xffffe000, before yesterday's push.

2002-12-22 12:25:36

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sun, 22 Dec 2002 11:23:08 +0100 (CET), Ingo Molnar wrote:
>while reviewing the sysenter trampoline code i started wondering about the
>HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
>much breaks the concept of per-logical-CPU sysenter trampolines. It also
>makes context-switch time sysenter MSR writing impossible, so i really
>hope this is not the case.

Some MSRs are shared, some aren't. One must always check this in
the IA32 Volume 3 manual. The three SYSENTER MSRs are not shared.

However, no-one has yet proven that writing to these in the context
switch path has acceptable performance -- remember, there is _no_
a priori reason to assume _anything_ about performance on P4s,
you really do need to measure things before taking design decisions.

Manfred had a version with fixed MSR values and the varying data
in memory. Maybe that's actually faster.

2002-12-22 15:24:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ingo Molnar wrote:
> and i'm 100% sure the more robust eflags saving will also avoid security
> holes. The amount of security-relevant complexity that comes from all the
> x86 features [and their combinations] is amazing.

Userspace can skip the "popfl" with a well-timed signal. If the
"sysexit" path leaves the kernel with an unsafe eflags, that will
propagate into the signal handler.

AFAICT, one of these is required:

1. eflags must be safe before leaving kernel space, or
2. setup_sigcontext() must clean it up (it already does clear TF).

-- Jamie

2002-12-22 15:37:14

by Nakajima, Jun

[permalink] [raw]
Subject: RE: Intel P6 vs P7 system call performance

Correct. Please look at Table B-1. Most of MSRs are shared, but some MSRs are unique in each logical processor, to provide the x86 architectural state. Those SYSENTER MSRs, and Machine Check register save state (IA32_MCG_XXX), for example, are unique.

Jun

> -----Original Message-----
> From: Mikael Pettersson [mailto:[email protected]]
> Sent: Sunday, December 22, 2002 4:34 AM
> To: [email protected]; [email protected]
> Cc: [email protected]; Nakajima, Jun; [email protected]
> Subject: Re: Intel P6 vs P7 system call performance
>
> On Sun, 22 Dec 2002 11:23:08 +0100 (CET), Ingo Molnar wrote:
> >while reviewing the sysenter trampoline code i started wondering about
> the
> >HT case. Dont HT boxes share the MSRs between logical CPUs? This pretty
> >much breaks the concept of per-logical-CPU sysenter trampolines. It also
> >makes context-switch time sysenter MSR writing impossible, so i really
> >hope this is not the case.
>
> Some MSRs are shared, some aren't. One must always check this in
> the IA32 Volume 3 manual. The three SYSENTER MSRs are not shared.
>
> However, no-one has yet proven that writing to these in the context
> switch path has acceptable performance -- remember, there is _no_
> a priori reason to assume _anything_ about performance on P4s,
> you really do need to measure things before taking design decisions.
>
> Manfred had a version with fixed MSR values and the varying data
> in memory. Maybe that's actually faster.

2002-12-22 15:52:34

by Jamie Lokier

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Mikael Pettersson wrote:
> Manfred had a version with fixed MSR values and the varying data
> in memory. Maybe that's actually faster.

The stack pointer is already changed on context switches since Ingo
changed the kernel to use per-cpu TSS segments.

Manfred's code used a tiny stack (without a valid task struct,
a different method of recovering than Linus' code). You can get that
stack down to 6 words, (3 for debug trap, 3 more for nested NMI).

Combining these leads to an IMHO beautiful hack, which does work btw:
the 6 words fit into unused parts of the per-CPU TSS (just before
tss->es). The MSR has a constant value:

wrmsr(MSR_IA32_SYSENTER_ESP, (u32) &tss->es, 0);

I found the fastest sysenter entry code looks like this:

sysenter_entry_point:
cld # Faster before sti.
sti # Re-enable interrupts after next insn.
movl -68(%esp),%esp # Load per-CPU stack from tss->esp0.

with appropriate fixups at the start of the NMI and debug trap handlers.

enjoy,
-- Jamie

2002-12-22 18:40:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On 22 Dec 2002, James H. Cloos Jr. wrote:
>
> I presume *%gs:0x18 is only for shared objects?

No, it's for everything, but it requires a glibc that has set it up.

Uli, do you make public snapshots available so that people can test the
new libraries and maybe see system-wide performance issues?

(It would also be good for testing - I've tried to be _very_ careful
inside the kernel, but in the end wide testing is always a good idea)

Linus

2002-12-22 18:45:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Sun, 22 Dec 2002, Ingo Molnar wrote:
>
> On Sat, 21 Dec 2002, Linus Torvalds wrote:
>
> > Saving and restoring eflags in user mode avoids all of these
> > complications, and means that there are no special cases. None. Zero.
> > Nada.
>
> and i'm 100% sure the more robust eflags saving will also avoid security
> holes. The amount of security-relevant complexity that comes from all the
> x86 features [and their combinations] is amazing.

I looked a bit at what it would take to have the TF bit handled by the
sysenter path, and it might not be so horrible - certainly not as ugly as
the register restore bits.

Jamie, if you want to do it, it looks like you could add a new "work" bit
in the thread flags, and add it to the _TIF_ALLWORK_MASK tests. At least
that way it wouldn't touch the regular code, and I don't think that the
result would have any strange "magic EIP" tests or anything horrible like
that ;)

Linus

2002-12-22 18:59:38

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Uli, do you make public snapshots available so that people can test the
> new libraries and maybe see system-wide performance issues?

It is already available. I've announced it on the NPTL mailing list a
couple of days ago. There is no support without NPTL since the TLS
setup isn't present in sufficient form in the LinuxThreads code which
has to work on stone-old kernels. But the NPTL code is more than stable
enough to run on test systems. In fact, I've a complete system running
using it.

Announcement:

https://listman.redhat.com/pipermail/phil-list/2002-December/000387.html

It is not easy to build glibc and you can easily ruin your system. You
need very recent tools, the CVS version of glibc and the NPTL add-on.
See for instance

https://listman.redhat.com/pipermail/phil-list/2002-December/000352.html

for a recipe on how to build glibc and how to run binaries using it
*without* replacing your system's libc. There have been That's save but
still the build is demanding. I know I'll be lynched again for saying
this, but it's the only experience I have: use RHL8 and get the very
latest tools (gcc, binutils) from rawhide. Then you should be fine.

If there is interest in RPMs of the binaries I might _try_ to provide
some. But this would mean replacing the system's libc.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-22 19:09:15

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

>>I presume *%gs:0x18 is only for shared objects?
>
>
> No, it's for everything, but it requires a glibc that has set it up.

Actually, the above is used only in the DSOs. In static objects I'm
using a global variable. This saves the %gs prefix.

But of course Linus is right: using the new functionality needs quite a
bit of infrastructure which most definitely isn't present in the libc
you have on your system. See my post from a few minutes ago.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-22 19:25:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Sun, 22 Dec 2002, Ulrich Drepper wrote:
>
> It is already available. I've announced it on the NPTL mailing list a
> couple of days ago.

Ok. I was definitely thinking of something rpm-like, since I know building
it is a bitch, and doing things wrong tends to result in systems that
don't work all that well.

> If there is interest in RPMs of the binaries I might _try_ to provide
> some. But this would mean replacing the system's libc.

I suspect that many people who test out 2.5.x kernels (and especially -bk
snapshots) don't find that too scary.

Linus

2002-12-22 19:42:35

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> Ok. I was definitely thinking of something rpm-like, since I know building
> it is a bitch, and doing things wrong tends to result in systems that
> don't work all that well.

I've talked to our guy producing the glibc RPMs and he said that he'll
produce them soon. We'll let people know when it happened.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-22 20:42:33

by James Cloos

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

>>>>> "Ulrich" == Ulrich Drepper <[email protected]> writes:

Ulrich> I've talked to our guy producing the glibc RPMs and he said
Ulrich> that he'll produce them soon. We'll let people know when it
Ulrich> happened.

I'd tend to prefer an LD_PRELOAD-able dso that just set up %gs and had
entries for each of the foo(2) over a full glibc rpm. I've only got
the one box to test on right now, but would like to see how well
sysenter? works.

-JimC

? Assuming I didn't just mix up the intel and amd opcodes....

2002-12-22 20:48:20

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

James H. Cloos Jr. wrote:

> I'd tend to prefer an LD_PRELOAD-able dso that just set up %gs and had
> entries for each of the foo(2) over a full glibc rpm.

This is not possible. The infrastructure is set up in the dynamic
linker. Read the mail with the references to the NPTL mailing list.
The second referenced mail contains a recipe for building glibc and then
using it in-place. This is not possible with binary RPMs in the way we
build them.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-23 04:55:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Sun, 22 Dec 2002, Linus Torvalds wrote:
>
> I looked a bit at what it would take to have the TF bit handled by the
> sysenter path, and it might not be so horrible - certainly not as ugly as
> the register restore bits.

Hey, I tried it out, and it does indeed turn out to be fairly easy and
clean (in fact, it's mostly four pretty obvious "one-liners").

Let nobody say I won't change my mind - you were right Jamie (*). The
pushfl/popfl is unnessary, and does show up in microbenchmarks.

How does the attached patch work for people? I've verified that
single-stepping works, and I've also verified that it does improve
performance for simple system calls. Everything looks quite simple.

Linus

(*) In fact, people sometimes complain that I change my mind way too
often. Hey, sue me.

-=-=-=

===== arch/i386/kernel/signal.c 1.22 vs edited =====
--- 1.22/arch/i386/kernel/signal.c Fri Dec 6 09:43:43 2002
+++ edited/arch/i386/kernel/signal.c Sun Dec 22 20:31:38 2002
@@ -609,6 +609,11 @@
void do_notify_resume(struct pt_regs *regs, sigset_t *oldset,
__u32 thread_info_flags)
{
+ /* Pending single-step? */
+ if (thread_info_flags & _TIF_SINGLESTEP) {
+ regs->eflags |= TF_MASK;
+ clear_thread_flag(TIF_SINGLESTEP);
+ }
/* deal with pending signal delivery */
if (thread_info_flags & _TIF_SIGPENDING)
do_signal(regs,oldset);
===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
--- 1.4/arch/i386/kernel/sysenter.c Sat Dec 21 16:02:02 2002
+++ edited/arch/i386/kernel/sysenter.c Sun Dec 22 20:17:28 2002
@@ -54,19 +54,18 @@
0xc3 /* ret */
};
static const char sysent[] = {
- 0x9c, /* pushf */
0x51, /* push %ecx */
0x52, /* push %edx */
0x55, /* push %ebp */
0x89, 0xe5, /* movl %esp,%ebp */
0x0f, 0x34, /* sysenter */
+ 0x00, /* align return point */
/* System call restart point is here! (SYSENTER_RETURN - 2) */
0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
0x5d, /* pop %ebp */
0x5a, /* pop %edx */
0x59, /* pop %ecx */
- 0x9d, /* popf - restore TF */
0xc3 /* ret */
};
unsigned long page = get_zeroed_page(GFP_ATOMIC);
===== arch/i386/kernel/traps.c 1.36 vs edited =====
--- 1.36/arch/i386/kernel/traps.c Mon Nov 18 10:10:45 2002
+++ edited/arch/i386/kernel/traps.c Sun Dec 22 20:03:35 2002
@@ -605,7 +605,7 @@
* interface.
*/
if ((regs->xcs & 3) == 0)
- goto clear_TF;
+ goto clear_TF_reenable;
if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE)
goto clear_TF;
}
@@ -637,6 +637,8 @@
handle_vm86_trap((struct kernel_vm86_regs *) regs, error_code, 1);
return;

+clear_TF_reenable:
+ set_tsk_thread_flag(tsk, TIF_SINGLESTEP);
clear_TF:
regs->eflags &= ~TF_MASK;
return;
===== include/asm-i386/thread_info.h 1.8 vs edited =====
--- 1.8/include/asm-i386/thread_info.h Fri Dec 6 09:43:43 2002
+++ edited/include/asm-i386/thread_info.h Sun Dec 22 20:30:28 2002
@@ -109,6 +109,7 @@
#define TIF_NOTIFY_RESUME 1 /* resumption notification requested */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
+#define TIF_SINGLESTEP 4 /* restore singlestep on return to user mode */
#define TIF_USEDFPU 16 /* FPU was used by this task this quantum (SMP) */
#define TIF_POLLING_NRFLAG 17 /* true if poll_idle() is polling TIF_NEED_RESCHED */

@@ -116,6 +117,7 @@
#define _TIF_NOTIFY_RESUME (1<<TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
+#define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP)
#define _TIF_USEDFPU (1<<TIF_USEDFPU)
#define _TIF_POLLING_NRFLAG (1<<TIF_POLLING_NRFLAG)


2002-12-23 07:06:34

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:

> How does the attached patch work for people?

I've compiled glibc and ran the test suite without any problems.

--
--------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------

2002-12-23 14:47:49

by kaih

[permalink] [raw]
Subject: Re: Freezing.. (was Re: Intel P6 vs P7 system call performance)

[email protected] (Mike Dresser) wrote on 18.12.02 in <Pine.LNX.4.33.0212181308380.11644-100000@router.windsormachine.com>:

> On Wed, 18 Dec 2002, Jeff Garzik wrote:
>
> > Linux... with the exception I guess that there are multiple peer Linii
>
> Perhaps this is the solution. Would someone please obtain a DNA sample
> from Linus?

It's been in the works for quite some time now, I gather, but the process
is expected to take maybe two decades more before the first candidate
becomes available.

It *was* announced here, however.

MfG Kai

2002-12-23 23:28:18

by Petr Vandrovec

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sun, Dec 22, 2002 at 09:03:44PM -0800, Linus Torvalds wrote:
>
> How does the attached patch work for people? I've verified that
> single-stepping works, and I've also verified that it does improve
> performance for simple system calls. Everything looks quite simple.

> ===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
> --- 1.4/arch/i386/kernel/sysenter.c Sat Dec 21 16:02:02 2002
> +++ edited/arch/i386/kernel/sysenter.c Sun Dec 22 20:17:28 2002
> @@ -54,19 +54,18 @@
> 0xc3 /* ret */
> };
> static const char sysent[] = {
> - 0x9c, /* pushf */
> 0x51, /* push %ecx */
> 0x52, /* push %edx */
> 0x55, /* push %ebp */
> 0x89, 0xe5, /* movl %esp,%ebp */
> 0x0f, 0x34, /* sysenter */
> + 0x00, /* align return point */
> /* System call restart point is here! (SYSENTER_RETURN - 2) */
> 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */

Hi Linus,

Jump instruction should be 0xeb, 0xf9, with 0xeb, 0xfa it jumps into
the middle of movl %esp,%ebp because of added alignment.

Maybe glibc tests needs also something to check restarted syscall...
Thanks,
Petr Vandrovec
[email protected]


--- linux/arch/i386/kernel/sysenter.c.orig 2002-12-24 00:23:41.000000000 +0100
+++ linux/arch/i386/kernel/sysenter.c 2002-12-24 00:23:50.000000000 +0100
@@ -61,7 +61,7 @@
0x0f, 0x34, /* sysenter */
0x00, /* align return point */
/* System call restart point is here! (SYSENTER_RETURN - 2) */
- 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
+ 0xeb, 0xf9, /* jmp to "movl %esp,%ebp" */
/* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
0x5d, /* pop %ebp */
0x5a, /* pop %edx */

2002-12-24 00:15:55

by Stephen Rothwell

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, 24 Dec 2002 00:27:43 +0100 Petr Vandrovec <[email protected]> wrote:
>
> On Sun, Dec 22, 2002 at 09:03:44PM -0800, Linus Torvalds wrote:
> >
> > How does the attached patch work for people? I've verified that
> > single-stepping works, and I've also verified that it does improve
> > performance for simple system calls. Everything looks quite simple.
>
> > ===== arch/i386/kernel/sysenter.c 1.4 vs edited =====
> > --- 1.4/arch/i386/kernel/sysenter.c Sat Dec 21 16:02:02 2002
> > +++ edited/arch/i386/kernel/sysenter.c Sun Dec 22 20:17:28 2002
> > @@ -54,19 +54,18 @@
> > 0xc3 /* ret */
> > };
> > static const char sysent[] = {
> > - 0x9c, /* pushf */
> > 0x51, /* push %ecx */
> > 0x52, /* push %edx */
> > 0x55, /* push %ebp */
> > 0x89, 0xe5, /* movl %esp,%ebp */
> > 0x0f, 0x34, /* sysenter */
> > + 0x00, /* align return point */
> > /* System call restart point is here! (SYSENTER_RETURN - 2) */
> > 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
>
> Hi Linus,
>
> Jump instruction should be 0xeb, 0xf9, with 0xeb, 0xfa it jumps into
> the middle of movl %esp,%ebp because of added alignment.

And if you change the 0x00 use for alighment to 0x90 (nop) you can
use gdb to disassemble that array of bytes to check any changes ...

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/

2002-12-24 04:01:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 24 Dec 2002, Stephen Rothwell wrote:
>
> And if you change the 0x00 use for alighment to 0x90 (nop) you can
> use gdb to disassemble that array of bytes to check any changes ...

Yeah, and I really should align the _normal_ return address (and not the
restart address).

Something like the appended, perhaps?

Linus

===== arch/i386/kernel/entry.S 1.45 vs edited =====
--- 1.45/arch/i386/kernel/entry.S Wed Dec 18 14:42:17 2002
+++ edited/arch/i386/kernel/entry.S Mon Dec 23 20:02:10 2002
@@ -233,7 +233,7 @@
#endif

/* Points to after the "sysenter" instruction in the vsyscall page */
-#define SYSENTER_RETURN 0xffffe00a
+#define SYSENTER_RETURN 0xffffe010

# sysenter call handler stub
ALIGN
===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
--- 1.5/arch/i386/kernel/sysenter.c Sun Dec 22 21:12:23 2002
+++ edited/arch/i386/kernel/sysenter.c Mon Dec 23 20:04:33 2002
@@ -57,12 +57,17 @@
0x51, /* push %ecx */
0x52, /* push %edx */
0x55, /* push %ebp */
+ /* 3: backjump target */
0x89, 0xe5, /* movl %esp,%ebp */
0x0f, 0x34, /* sysenter */
- 0x00, /* align return point */
- /* System call restart point is here! (SYSENTER_RETURN - 2) */
- 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
- /* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
+
+ /* 7: align return point with nop's to make disassembly easier */
+ 0x90, 0x90, 0x90, 0x90,
+ 0x90, 0x90, 0x90,
+
+ /* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
+ 0xeb, 0xf3, /* jmp to "movl %esp,%ebp" */
+ /* 16: System call normal return point is here! (SYSENTER_RETURN in entry.S) */
0x5d, /* pop %ebp */
0x5a, /* pop %edx */
0x59, /* pop %ecx */

2002-12-24 07:58:32

by Rogier Wolff

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 23, 2002 at 08:10:14PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 24 Dec 2002, Stephen Rothwell wrote:
> >
> > And if you change the 0x00 use for alighment to 0x90 (nop) you can
> > use gdb to disassemble that array of bytes to check any changes ...
>
> Yeah, and I really should align the _normal_ return address (and not the
> restart address).
>
> Something like the appended, perhaps?
>
> Linus
>
> ===== arch/i386/kernel/entry.S 1.45 vs edited =====
> --- 1.45/arch/i386/kernel/entry.S Wed Dec 18 14:42:17 2002
> +++ edited/arch/i386/kernel/entry.S Mon Dec 23 20:02:10 2002
> @@ -233,7 +233,7 @@
> #endif
>
> /* Points to after the "sysenter" instruction in the vsyscall page */
> -#define SYSENTER_RETURN 0xffffe00a
> +#define SYSENTER_RETURN 0xffffe010
>
> # sysenter call handler stub
> ALIGN
> ===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
> --- 1.5/arch/i386/kernel/sysenter.c Sun Dec 22 21:12:23 2002
> +++ edited/arch/i386/kernel/sysenter.c Mon Dec 23 20:04:33 2002
> @@ -57,12 +57,17 @@
> 0x51, /* push %ecx */
> 0x52, /* push %edx */
> 0x55, /* push %ebp */
> + /* 3: backjump target */
> 0x89, 0xe5, /* movl %esp,%ebp */
> 0x0f, 0x34, /* sysenter */
> - 0x00, /* align return point */
> - /* System call restart point is here! (SYSENTER_RETURN - 2) */
> - 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
> - /* System call normal return point is here! (SYSENTER_RETURN in entry.S) */
> +
> + /* 7: align return point with nop's to make disassembly easier */
> + 0x90, 0x90, 0x90, 0x90,
> + 0x90, 0x90, 0x90,
> +
> + /* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
> + 0xeb, 0xf3, /* jmp to "movl %esp,%ebp" */
> + /* 16: System call normal return point is here! (SYSENTER_RETURN in entry.S) */
> 0x5d, /* pop %ebp */
> 0x5a, /* pop %edx */
> 0x59, /* pop %ecx */

Ehmm, Linus,

Why do you want to align the return point? Why are jump-targets aligned?
Because they are faster. But why are they faster? Because the
cache-line fill is more efficient: the CPU might execute those
instructions, while it has a smaller chance of hitting the instructions
before the target.

In this case, I'd guess we'd have more benefit from the sysenter return
prefetching the sysenter cache line, than from prefetching the bunch
of noops just behind the return from syscall.

Now this is very hard to prove using a benchmark: In the benchmark
you'll quite likely run from a hot cache, and the cache line
effects are the things we would want to measure.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2002-12-24 18:42:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 24 Dec 2002, Rogier Wolff wrote:
>
> Ehmm, Linus,
>
> Why do you want to align the return point? Why are jump-targets aligned?
> Because they are faster. But why are they faster? Because the
> cache-line fill is more efficient: the CPU might execute those
> instructions, while it has a smaller chance of hitting the instructions
> before the target.

Actually, no. Many CPU's apparently also have issues with instruction
decoding etc, where certain alignments (4 or 8-byte aligned) are better
simply because they feed the decode logic more efficiently.

Everything here fits in one cache-line, so clearly the cacheline issues
don't matter.

Linus

2002-12-24 19:28:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


Ok, one final optimization.

We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
and we've used that to avoid saving unnecessary segment registers over
context switches etc.

I realized that there is really no reason to use __KERNEL_DS for this, and
that as far as the kernel is concerned, the only thing that matters is
that it's a flat 32-bit segment. So we might as well make the kernel
always load ES/DS with __USER_DS instead, which has the advantage that we
can avoid one set of segment loads for the "sysenter/sysexit" case.

(We still need to load ES/DS at entry to the kernel, since we cannot rely
on user space not trying to do strange things. But once we load them with
__USER_DS, we at least don't need to restore them on return to user mode
any more, since "sysenter" only works in a flat 32-bit user mode anyway
(*)).

This doesn't matter much for a P4 (surprisingly, a P4 does very well
indeed on segment loads), but it does make a difference on PIII-class
CPU's.

This makes a PIII do a "getpid()" system call in something like 160
cycles (a P4 is at 430 cycles, oh well).

Ingo, would you mind taking a look at the patch, to see if you see any
paths where we don't follow the new segment register rules. It looks like
swsuspend isn't properly saving and restoring segment register contents.
so that will need double-checking (it wasn't correct before either, so
this doesn't make it any worse, at least).

Linus

(*) We could avoid even that initial load by instead _testing_ that the
values are the correct ones and jumping out if not, but I worry about vm86
mode being able to fool us with segments that have the right selectors but
the wrong segment caches. I disabled sysenter for vm86 mode, but it's so
subtle that I prefer just doing the segment loads rather than doing two
moves and comparisons.

###########################################
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/12/24 [email protected] 1.953
# Make the default values for DS/ES be the _user_ segment descriptors
# on x86 - the kernel doesn't really care (as long as it's all flat 32-bit),
# and it means that the return path for sysenter/sysexit can avoid re-loading
# the segment registers.
#
# NOTE! This means that _all_ kernel code (not just the sysenter path) must
# be appropriately changed, since the kernel knows the conventions and doesn't
# save/restore DS/ES internally on context switches etc.
# --------------------------------------------
#
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/entry.S Tue Dec 24 11:34:28 2002
@@ -91,18 +91,21 @@
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
- movl $(__KERNEL_DS), %edx; \
+ movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;

-#define RESTORE_REGS \
+#define RESTORE_INT_REGS \
popl %ebx; \
popl %ecx; \
popl %edx; \
popl %esi; \
popl %edi; \
popl %ebp; \
- popl %eax; \
+ popl %eax
+
+#define RESTORE_REGS \
+ RESTORE_INT_REGS; \
1: popl %ds; \
2: popl %es; \
.section .fixup,"ax"; \
@@ -271,9 +274,9 @@
movl TI_FLAGS(%ebx), %ecx
testw $_TIF_ALLWORK_MASK, %cx
jne syscall_exit_work
- RESTORE_REGS
- movl 4(%esp),%edx
- movl 16(%esp),%ecx
+ RESTORE_INT_REGS
+ movl 12(%esp),%edx
+ movl 24(%esp),%ecx
sti
sysexit

@@ -428,7 +431,7 @@
movl %esp, %edx
pushl %esi # push the error code
pushl %edx # push the pt_regs pointer
- movl $(__KERNEL_DS), %edx
+ movl $(__USER_DS), %edx
movl %edx, %ds
movl %edx, %es
call *%edi
diff -Nru a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S
--- a/arch/i386/kernel/head.S Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/head.S Tue Dec 24 11:34:28 2002
@@ -235,12 +235,15 @@
lidt idt_descr
ljmp $(__KERNEL_CS),$1f
1: movl $(__KERNEL_DS),%eax # reload all the segment registers
- movl %eax,%ds # after changing gdt.
+ movl %eax,%ss # after changing gdt.
+
+ movl $(__USER_DS),%eax # DS/ES contains default USER segment
+ movl %eax,%ds
movl %eax,%es
+
+ xorl %eax,%eax # Clear FS/GS and LDT
movl %eax,%fs
movl %eax,%gs
- movl %eax,%ss
- xorl %eax,%eax
lldt %ax
cld # gcc2 wants the direction flag cleared at all times
#ifdef CONFIG_SMP
diff -Nru a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c Tue Dec 24 11:34:28 2002
+++ b/arch/i386/kernel/process.c Tue Dec 24 11:34:28 2002
@@ -219,8 +219,8 @@
regs.ebx = (unsigned long) fn;
regs.edx = (unsigned long) arg;

- regs.xds = __KERNEL_DS;
- regs.xes = __KERNEL_DS;
+ regs.xds = __USER_DS;
+ regs.xes = __USER_DS;
regs.orig_eax = -1;
regs.eip = (unsigned long) kernel_thread_helper;
regs.xcs = __KERNEL_CS;

2002-12-24 20:08:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Tue, 24 Dec 2002, Linus Torvalds wrote:

> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks
> like swsuspend isn't properly saving and restoring segment register
> contents. so that will need double-checking (it wasn't correct before
> either, so this doesn't make it any worse, at least).

this reminds me of another related matter that is not fixed yet, which bug
caused XFree86 to crash if it was linked against the new libpthreads - in
vm86 mode we did not save/restore %gs [and %fs] properly, which breaks
new-style threading. The attached patch is against the 2.4 backport of the
threading stuff, i'll do a 2.5 patch after christmas eve :-)

Ingo

--- linux/include/asm-i386/processor.h.orig 2002-12-06 11:49:24.000000000 +0100
+++ linux/include/asm-i386/processor.h 2002-12-06 11:52:39.000000000 +0100
@@ -388,6 +388,7 @@
struct vm86_struct * vm86_info;
unsigned long screen_bitmap;
unsigned long v86flags, v86mask, saved_esp0;
+ unsigned int saved_fs, saved_gs;
/* IO permissions */
int ioperm;
unsigned long io_bitmap[IO_BITMAP_SIZE+1];
--- linux/arch/i386/kernel/vm86.c.orig 2002-12-06 11:50:26.000000000 +0100
+++ linux/arch/i386/kernel/vm86.c 2002-12-06 11:53:40.000000000 +0100
@@ -113,6 +113,8 @@
tss = init_tss + smp_processor_id();
tss->esp0 = current->thread.esp0 = current->thread.saved_esp0;
current->thread.saved_esp0 = 0;
+ loadsegment(fs, current->thread.saved_fs);
+ loadsegment(gs, current->thread.saved_gs);
ret = KVM86->regs32;
return ret;
}
@@ -277,6 +279,9 @@
*/
info->regs32->eax = 0;
tsk->thread.saved_esp0 = tsk->thread.esp0;
+ asm volatile("movl %%fs,%0":"=m" (tsk->thread.saved_fs));
+ asm volatile("movl %%gs,%0":"=m" (tsk->thread.saved_gs));
+
tss = init_tss + smp_processor_id();
tss->esp0 = tsk->thread.esp0 = (unsigned long) &info->VM86_TSS_ESP0;


2002-12-24 20:19:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Tue, 24 Dec 2002, Linus Torvalds wrote:

> I realized that there is really no reason to use __KERNEL_DS for this,
> and that as far as the kernel is concerned, the only thing that matters
> is that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that
> we can avoid one set of segment loads for the "sysenter/sysexit" case.

this basically hardcodes flat segment layout on x86. If anything (Wine?)
modifies the default segments, it can wrap syscalls by saving/restoring
the modified %ds and %es selectors explicitly.

Ingo

2002-12-24 20:18:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 24 Dec 2002, Ingo Molnar wrote:
>
> this reminds me of another related matter that is not fixed yet, which bug
> caused XFree86 to crash if it was linked against the new libpthreads - in
> vm86 mode we did not save/restore %gs [and %fs] properly, which breaks
> new-style threading. The attached patch is against the 2.4 backport of the
> threading stuff, i'll do a 2.5 patch after christmas eve :-)

Actually, pretty much nothing has changed in vm86 handling, so the patch
should work fine as-is on 2.5.x too.

Linus

2002-12-24 20:30:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance



On Tue, 24 Dec 2002, Ingo Molnar wrote:
>
> this basically hardcodes flat segment layout on x86. If anything (Wine?)
> modifies the default segments, it can wrap syscalls by saving/restoring
> the modified %ds and %es selectors explicitly.

Note that that was true even before this patch - you cannot use glibc
without having the default DS/ES settings anyway. I not only checked with
Uli, but gcc simply cannot generate code that has different segments for
stack and data, so if you have non-flat segments you had to either

- flatten them out before calling the standard library
- do your system calls directly by hand

And note how both of these still work fine (if you flatten things out it
trivially works, and if you do system calls by hand the old "int 0x80"
approach obviously doesn't change anything, and non-flat still works).

So the new code really only takes advantage of the fact that non-flat
wouldn't have worked with glibc in the first place, and without glibc you
don't see any difference in behaviour since it won't be using the new
calling convention.

Linus

2002-12-24 21:02:45

by Rogier Wolff

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Tue, Dec 24, 2002 at 10:51:11AM -0800, Linus Torvalds wrote:
>
> Everything here fits in one cache-line, so clearly the cacheline issues
> don't matter.

I'm getting old. Larger cache lines, you're right.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currently in such an *
* excursion: The stable situation does not include humans. ***************

2002-12-26 18:51:02

by Pavel Machek

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Hi!

> Ok, one final optimization.
>
> We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
> and we've used that to avoid saving unnecessary segment registers over
> context switches etc.
>
> I realized that there is really no reason to use __KERNEL_DS for this, and
> that as far as the kernel is concerned, the only thing that matters is
> that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that we
> can avoid one set of segment loads for the "sysenter/sysexit" case.
>
> (We still need to load ES/DS at entry to the kernel, since we cannot rely
> on user space not trying to do strange things. But once we load them with
> __USER_DS, we at least don't need to restore them on return to user mode
> any more, since "sysenter" only works in a flat 32-bit user mode anyway
> (*)).
>
> This doesn't matter much for a P4 (surprisingly, a P4 does very well
> indeed on segment loads), but it does make a difference on PIII-class
> CPU's.
>
> This makes a PIII do a "getpid()" system call in something like 160
> cycles (a P4 is at 430 cycles, oh well).
>
> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks like
> swsuspend isn't properly saving and restoring segment register contents.
> so that will need double-checking (it wasn't correct before either, so
> this doesn't make it any worse, at least).

Does this look like fixing it?
Pavel

--- clean/arch/i386/kernel/suspend_asm.S 2002-12-18 22:20:47.000000000 +0100
+++ linux-swsusp/arch/i386/kernel/suspend_asm.S 2002-12-26 08:45:34.000000000 +0100
@@ -64,9 +64,10 @@
jb .L1455
.p2align 4,,7
.L1453:
- movl $104,%eax
+ movl $__USER_DS,%eax

movw %ax, %ds
+ movw %ax, %es
movl saved_context_esp, %esp
movl saved_context_ebp, %ebp
movl saved_context_eax, %eax


--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-12-27 17:37:59

by kaih

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

[email protected] (Linus Torvalds) wrote on 23.12.02 in <[email protected]>:

> Something like the appended, perhaps?
>
> Linus
>
> ===== arch/i386/kernel/entry.S 1.45 vs edited =====
> --- 1.45/arch/i386/kernel/entry.S Wed Dec 18 14:42:17 2002
> +++ edited/arch/i386/kernel/entry.S Mon Dec 23 20:02:10 2002
> @@ -233,7 +233,7 @@
> #endif
>
> /* Points to after the "sysenter" instruction in the vsyscall page */
> -#define SYSENTER_RETURN 0xffffe00a
> +#define SYSENTER_RETURN 0xffffe010
>
> # sysenter call handler stub
> ALIGN
> ===== arch/i386/kernel/sysenter.c 1.5 vs edited =====
> --- 1.5/arch/i386/kernel/sysenter.c Sun Dec 22 21:12:23 2002
> +++ edited/arch/i386/kernel/sysenter.c Mon Dec 23 20:04:33 2002
> @@ -57,12 +57,17 @@
> 0x51, /* push %ecx */
> 0x52, /* push %edx */
> 0x55, /* push %ebp */
> + /* 3: backjump target */
> 0x89, 0xe5, /* movl %esp,%ebp */
> 0x0f, 0x34, /* sysenter */
> - 0x00, /* align return point */

Also 0x90 here?

> - /* System call restart point is here! (SYSENTER_RETURN - 2) */
> - 0xeb, 0xfa, /* jmp to "movl %esp,%ebp" */
> - /* System call normal return point is here! (SYSENTER_RETURN in entry.S)
> */ +
> + /* 7: align return point with nop's to make disassembly easier */
> + 0x90, 0x90, 0x90, 0x90,
> + 0x90, 0x90, 0x90,
> +
> + /* 14: System call restart point is here! (SYSENTER_RETURN - 2) */
> + 0xeb, 0xf3, /* jmp to "movl %esp,%ebp" */
> + /* 16: System call normal return point is here! (SYSENTER_RETURN in
> entry.S) */ 0x5d, /* pop %ebp */
> 0x5a, /* pop %edx */
> 0x59, /* pop %ecx */


MfG Kai

2002-12-28 01:58:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> Note that that was true even before this patch - you cannot use glibc
> without having the default DS/ES settings anyway. I not only checked with
> Uli, but gcc simply cannot generate code that has different segments for
> stack and data, so if you have non-flat segments you had to either

More importantly, SYSENTER hardcodes flat layout.

-hpa


2002-12-28 01:57:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Ingo Molnar wrote:
> On Tue, 24 Dec 2002, Linus Torvalds wrote:
>
>
>>I realized that there is really no reason to use __KERNEL_DS for this,
>>and that as far as the kernel is concerned, the only thing that matters
>>is that it's a flat 32-bit segment. So we might as well make the kernel
>>always load ES/DS with __USER_DS instead, which has the advantage that
>>we can avoid one set of segment loads for the "sysenter/sysexit" case.
>
>
> this basically hardcodes flat segment layout on x86. If anything (Wine?)
> modifies the default segments, it can wrap syscalls by saving/restoring
> the modified %ds and %es selectors explicitly.
>

I don't think you can modify the GDT segments.

-hpa

P.S. Please don't use my @transmeta.com address for non-Transmeta
business. I'm trying very hard to keep my mailboxes semi-organized.



2002-12-28 20:28:56

by Ville Herva

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Mon, Dec 09, 2002 at 11:46:47AM -0800, you [H. Peter Anvin] wrote:
>
> SYSCALL is AMD. SYSENTER is Intel, and is likely to be significantly

Now that Linus has killed the dragon and everybody seems happy with the
shiny new SYSENTER code, let just add one more stupid question to this
thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?


-- v --

[email protected]

2002-12-29 01:56:53

by Christian Leber

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:

> Now that Linus has killed the dragon and everybody seems happy with the
> shiny new SYSENTER code, let just add one more stupid question to this
> thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?

Yes, the output of the programm Linus posted is on a Duron 750 with
2.5.53 like this:

igor3:~# ./a.out
187.894946 cycles (call 0xffffe000)
299.155075 cycles (int 80)

(cycles per getpid() call)


Christian Leber

--
"Omnis enim res, quae dando non deficit, dum habetur et non datur,
nondum habetur, quomodo habenda est." (Aurelius Augustinus)
Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

2002-12-30 11:22:16

by Dave Jones

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:

> > SYSCALL is AMD. SYSENTER is Intel, and is likely to be significantly
> Now that Linus has killed the dragon and everybody seems happy with the
> shiny new SYSENTER code, let just add one more stupid question to this
> thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?

Its something I wondered about too. Even if it isn't a win for K7,
it's possible that the K6 family may benefit from SYSCALL support.
Maybe even the K5 if it was around that early ? (too lazy to check pdf's)

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-12-30 12:58:14

by Manfred Spraul

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

DaveJ wrote:

>On Sat, Dec 28, 2002 at 10:37:06PM +0200, Ville Herva wrote:
>
> > > SYSCALL is AMD. SYSENTER is Intel, and is likely to be significantly
> > Now that Linus has killed the dragon and everybody seems happy with the
> > shiny new SYSENTER code, let just add one more stupid question to this
> > thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> > SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?
>
>Its something I wondered about too. Even if it isn't a win for K7,
>it's possible that the K6 family may benefit from SYSCALL support.
>Maybe even the K5 if it was around that early ? (too lazy to check pdf's)
>
>

I looked at SYSCALL once, and noticed some problems:

- it doesn't even load ESP with a kernel value, a task gate for NMI is
mandatory.
- SMP support is only possible with a per-cpu entry point with
(boot-time) fixups to the address where the entry point can find the
kernel stack.
- The AMD docs contain one odd sentence:
"The CS and SS registers must not be modified by the operating system
between the execution of the SYSCALL and the corresponding SYSRET
instruction".
Is SYSCALL+iretd permitted? That's needed for execve, iopl, task
switches, signal delivery.
What about interrupts during SYSCALLs? NMI to taskgate?

Either that sentence is just wrong, or SYSCALL is unusable.

It's not supported by the K5 cpu:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf

--
Manfred

2002-12-30 14:46:55

by Andi Kleen

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Manfred Spraul <[email protected]> writes:

> - The AMD docs contain one odd sentence:
> "The CS and SS registers must not be modified by the operating system
> between the execution of the SYSCALL and the corresponding SYSRET
> instruction".

As I understand it is:

SYSCALL does not actually load the new CS/SS from the GDT,
but just set the internal base/limit/descriptor valid registers to
"flat" values. SYSEXIT undo this. When the SYSRET is executed the
selector should be still the same, otherwise the resulting state may not be
exactly the same as when SYSCALL happend.

But if you make sure that CS/SS have the same state on SYSRET then
everything should be ok. This means a context switch to a process with
possibly different CS/SS values should be ok.

The GDT is guaranteed to stay constant and SYSCALL forbids use of an
LDT for SS/CS.

I have not actually tried this on an Athlon, but on x86-64 with
entering long mode it works (including context switches to processes with
different segments etc.)

To add to confusion there are three different all slightly different
SYSCALL flavours:

K6 (unusable iirc), Athlon/Hammer with 32bit OS
(would work, but they have SYSENTER too so it makes sense to just share code
with Intel), Hammer with 64bit OS (working, has to use SYSCALL from both
32bit and 64bit processes)

They are different in what registers they clobber and how EFLAGS is handled
and some other details.

On x86-64 SYSCALL is the only and native system call entry instruction
for 64bit processes. The only reason to use SYSCALL from 32bit programs is
that on a x86-64 SYSENTER from 32bit processes to 64bit kernels is not
supported, so the 2.5.53 x86-64 kernel implements the AT_SYSINFO vsyscall
page using SYSCALL.

> Is SYSCALL+iretd permitted? That's needed for execve, iopl, task

Yes, it works.

> switches, signal delivery.
> What about interrupts during SYSCALLs?

SYSCALL in 32bit mode turns off IF on entry (in long mode it is configurable
using a MSR)

-Andi

2002-12-30 18:14:14

by Christian Leber

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

On Sun, Dec 29, 2002 at 03:05:10AM +0100, Christian Leber wrote:

> > Now that Linus has killed the dragon and everybody seems happy with the
> > shiny new SYSENTER code, let just add one more stupid question to this
> > thread: has anyone made benchmarks on SYSCALL/SYSENTER/INT80 on Athlon? Is
> > SYSCALL worth doing separately for Athlon (and perhaps Hammer/32-bit mode)?
>
> Yes, the output of the programm Linus posted is on a Duron 750 with
> 2.5.53 like this:
>
> igor3:~# ./a.out
> 187.894946 cycles (call 0xffffe000)
> 299.155075 cycles (int 80)
> (cycles per getpid() call)

Damn, false lines, this where numbers from 2.5.52-bk2+sysenter-patch.

But now the right and interesting lines:

2.5.53:
igor3:~# ./a.out
166.283549 cycles
278.461609 cycles

2.5.53-bk5:
igor3:~# ./a.out
150.895348 cycles
279.441955 cycles

The question is: are the numbers correct?
(I don't know if the TSC thing is actually right)

And why have int 80 also gotten faster?


Is this a valid testprogramm to find out how long a system call takes?
igor3:~# cat sysc.c
#define rdtscl(low) \
__asm__ __volatile__ ("rdtsc" : "=a" (low) : : "edx")

int getpiddd()
{
int i=0; return i+10;
}

int main(int argc, char **argv) {
long a,b,c,d;
int i1,i2,i3;

rdtscl(a);
i1 = getpiddd(); //just to see how long a simple function takes
rdtscl(b);
i2 = getpid();
rdtscl(c);
i3 = getpid();
rdtscl(d);
printf("function call: %lu first: %lu second: %lu cycles\n",b-a,c-b,d-c);
return 0;
}

I link it against a slightly modified (1 line of code) dietlibc:
igor3:~# dietlibc-0.22/bin-i386/diet gcc sysc.c
igor3:~# ./a.out
function call: 42 first: 1821 second: 169 cycles

I heard that there are serious problems involved with TSC, therefore I
don't know if the numbers are correct/make seens.


Christian Leber

--
"Omnis enim res, quae dando non deficit, dum habetur et non datur,
nondum habetur, quomodo habenda est." (Aurelius Augustinus)
Translation: <http://gnuhh.org/work/fsf-europe/augustinus.html>

2002-12-30 21:14:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

In article <[email protected]>,
Christian Leber <[email protected]> wrote:
>
>But now the right and interesting lines:
>
>2.5.53:
>igor3:~# ./a.out
>166.283549 cycles
>278.461609 cycles
>
>2.5.53-bk5:
>igor3:~# ./a.out
>150.895348 cycles
>279.441955 cycles
>
>The question is: are the numbers correct?

Roughly. The program I posted has some overflow errors (which you will
hit if testing expensive system calls that take >4000 cycles). They also

do an average, which is "mostly correct", but not stable if there is
some load in the machine. The right way to do timings like this is
probably to do minimums for individual calls, and then subtract out the
TSC reading overhead. See attached silly program.

>And why have int 80 also gotten faster?

Random luck. Sometimes you get cacheline alignment magic etc. Or just
because the timings aren't stable for other reasons (background
processes etc).

>Is this a valid testprogramm to find out how long a system call takes?

Not really. The results won't be stable, since you might have cache
misses, page faults, other processes, whatever.

So you'll get _somehat_ correct numbers, but they may be randomly off.

Linus

---
#include <sys/types.h>
#include <time.h>
#include <sys/time.h>
#include <sys/fcntl.h>
#include <asm/unistd.h>
#include <sys/stat.h>
#include <stdio.h>

#define rdtsc() ({ unsigned long a, d; asm volatile("rdtsc":"=a" (a), "=d" (d)); a; })

// for testing _just_ system call overhead.
//#define __NR_syscall __NR_stat64
#define __NR_syscall __NR_getpid

#define NR (100000)

int main()
{
int i, ret;
unsigned long fast = ~0UL, slow = ~0UL, overhead = ~0UL;
struct timeval x,y;
char *filename = "test";
struct stat st;
int j;

for (i = 0; i < NR; i++) {
unsigned long cycles = rdtsc();
asm volatile("");
cycles = rdtsc() - cycles;
if (cycles < overhead)
overhead = cycles;
}

printf("overhead: %6d\n", overhead);

for (j = 0; j < 10; j++)
for (i = 0; i < NR; i++) {
unsigned long cycles = rdtsc();
asm volatile("call 0xffffe000"
:"=a" (ret)
:"0" (__NR_syscall),
"b" (filename),
"c" (&st));
cycles = rdtsc() - cycles;
if (cycles < fast)
fast = cycles;
}

fast -= overhead;
printf("sysenter: %6d cycles\n", fast);

for (i = 0; i < NR; i++) {
unsigned long cycles = rdtsc();
asm volatile("int $0x80"
:"=a" (ret)
:"0" (__NR_syscall),
"b" (filename),
"c" (&st));
cycles = rdtsc() - cycles;
if (cycles < slow)
slow = cycles;
}

slow -= overhead;
printf("int0x80: %6d cycles\n", slow);
printf(" %6d cycles difference\n", slow-fast);
printf("factor %f\n", (double) slow / fast);
}


2003-01-10 11:23:19

by Gabriel Paubert

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance




On Tue, 24 Dec 2002, Linus Torvalds wrote:

[That's old, I know. I'm slowly catching up on my email backlog after
almost 3 weeks away]

>
> Ok, one final optimization.
>
> We have traditionally held ES/DS constant at __KERNEL_DS in the kernel,
> and we've used that to avoid saving unnecessary segment registers over
> context switches etc.
>
> I realized that there is really no reason to use __KERNEL_DS for this, and
> that as far as the kernel is concerned, the only thing that matters is
> that it's a flat 32-bit segment. So we might as well make the kernel
> always load ES/DS with __USER_DS instead, which has the advantage that we
> can avoid one set of segment loads for the "sysenter/sysexit" case.
>
> (We still need to load ES/DS at entry to the kernel, since we cannot rely
> on user space not trying to do strange things. But once we load them with
> __USER_DS, we at least don't need to restore them on return to user mode
> any more, since "sysenter" only works in a flat 32-bit user mode anyway
> (*)).

We cannot rely either on userspace not setting NT bit in eflags. While
it won't cause an oops since the only instruction which ever depends on
it, iret, has a handler (which needs to be patched, see below),
I'm absolutely not convinced that all code paths are "NT safe" ;-)

For example, set NT and then execute sysenter with garbage in %eax, the
kernel will try to return (-ENOSYS) with iret and kill the task. As long
as it only allows a task to kill itself, it's not a big deal. But NT is
not cleared across task switches unless I miss something, and that looks
very dangerous.

It's so complex that I'm not sure that clearing NT in __switch_to is
sufficient, but clearing it in every sysenter path will make clock cycles
accountants scream (the only way is through popfl).

>
> This doesn't matter much for a P4 (surprisingly, a P4 does very well
> indeed on segment loads), but it does make a difference on PIII-class
> CPU's.
>
> This makes a PIII do a "getpid()" system call in something like 160
> cycles (a P4 is at 430 cycles, oh well).
>
> Ingo, would you mind taking a look at the patch, to see if you see any
> paths where we don't follow the new segment register rules. It looks like
> swsuspend isn't properly saving and restoring segment register contents.
> so that will need double-checking (it wasn't correct before either, so
> this doesn't make it any worse, at least).

I'm no Ingo, unfortunately, but you'll need at least the following patch
(the second hunk is only a typo fix) to the iret exception recovery code,
which used push and pops to get the smallest possible code size.

That's a minimal patch, let me know if you prefer to have a single copy of
the exception handler for all instances of RESTORE_ALL.

===== entry.S 1.49 vs edited =====
--- 1.49/arch/i386/kernel/entry.S Sat Jan 4 19:06:07 2003
+++ edited/entry.S Fri Jan 10 02:12:00 2003
@@ -126,10 +126,9 @@
addl $4, %esp; \
1: iret; \
.section .fixup,"ax"; \
-2: pushl %ss; \
- popl %ds; \
- pushl %ss; \
- popl %es; \
+2: movl $(__USER_DS), %edx; \
+ movl %edx, %ds; \
+ movl %edx, %es; \
pushl $11; \
call do_exit; \
.previous; \
@@ -225,7 +224,7 @@
movl TI_FLAGS(%ebx), %ecx # need_resched set ?
testb $_TIF_NEED_RESCHED, %cl
jz restore_all
- testl $IF_MASK,EFLAGS(%esp) # interrupts off (execption path) ?
+ testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ?
jz restore_all
movl $PREEMPT_ACTIVE,TI_PRE_COUNT(%ebx)
sti


Regards,
Gabriel.

2003-01-10 17:09:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance


On Fri, 10 Jan 2003, Gabriel Paubert wrote:
>
> We cannot rely either on userspace not setting NT bit in eflags. While
> it won't cause an oops since the only instruction which ever depends on
> it, iret, has a handler (which needs to be patched, see below),
> I'm absolutely not convinced that all code paths are "NT safe" ;-)

It shouldn't matter.

NT is only tested by "iret", and if somebody sets NT in user space they
get exactly what they deserve.

> For example, set NT and then execute sysenter with garbage in %eax, the
> kernel will try to return (-ENOSYS) with iret and kill the task. As long
> as it only allows a task to kill itself, it's not a big deal. But NT is
> not cleared across task switches unless I miss something, and that looks
> very dangerous.

It _is_ cleared by task-switching these days. Or rather, it's saved and
restored, so the original NT setter will get it restored when resumed.

> I'm no Ingo, unfortunately, but you'll need at least the following patch
> (the second hunk is only a typo fix) to the iret exception recovery code,
> which used push and pops to get the smallest possible code size.

Good job.

Linus

2003-01-10 18:01:17

by Gabriel Paubert

[permalink] [raw]
Subject: Re: Intel P6 vs P7 system call performance

Linus Torvalds wrote:
> It shouldn't matter.
>
> NT is only tested by "iret", and if somebody sets NT in user space they
> get exactly what they deserve.

Indeed. I realized after I sent the previous mail that I had missed the
flags save/restore in switch_to :-(

Still, does this mean that there is some micro optimization opportunity in
the lcall7/lcall27 handlers to remove the popfl? After all TF is now
handled by some magic in do_debug unless I miss (again) something,
NT has become irrelevant, and cld in SAVE_ALL takes care of DF.

In short something like the following (I just love patches which only
remove code):

===== entry.S 1.51 vs edited =====
--- 1.51/arch/i386/kernel/entry.S Mon Jan 6 04:54:58 2003
+++ edited/entry.S Fri Jan 10 18:57:42 2003
@@ -156,16 +156,6 @@
movl %edx,EIP(%ebp) # Now we move them to their "normal" places
movl %ecx,CS(%ebp) #

- #
- # Call gates don't clear TF and NT in eflags like
- # traps do, so we need to do it ourselves.
- # %eax already contains eflags (but it may have
- # DF set, clear that also)
- #
- andl $~(DF_MASK | TF_MASK | NT_MASK),%eax
- pushl %eax
- popfl
-
andl $-8192, %ebp # GET_THREAD_INFO
movl TI_EXEC_DOMAIN(%ebp), %edx # Get the execution domain
call *4(%edx) # Call the lcall7 handler for the domain


>>For example, set NT and then execute sysenter with garbage in %eax, the
>>kernel will try to return (-ENOSYS) with iret and kill the task. As long
>>as it only allows a task to kill itself, it's not a big deal. But NT is
>>not cleared across task switches unless I miss something, and that looks
>>very dangerous.
>
>
> It _is_ cleared by task-switching these days. Or rather, it's saved and
> restored, so the original NT setter will get it restored when resumed.

Yeah, sorry for the noise.

>
>
>>I'm no Ingo, unfortunately, but you'll need at least the following patch
>>(the second hunk is only a typo fix) to the iret exception recovery code,
>>which used push and pops to get the smallest possible code size.
>
>
> Good job.

That was too easy since I did originally suggest the push/pop sequence :-)

Gabriel.