2001-04-28 15:54:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)


On Sat, 28 Apr 2001, Andi Kleen wrote:

> You can also just use the cycle counter directly in most modern CPUs.
> It can be read with a single instruction. In fact modern glibc will do
> it for you when you use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)

well, it's not reliable while using things like APM, so i'd not recommend
to depend on it too much.

Ingo



2001-04-28 19:53:51

by Andi Kleen

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sat, Apr 28, 2001 at 05:52:42PM +0200, Ingo Molnar wrote:
>
> On Sat, 28 Apr 2001, Andi Kleen wrote:
>
> > You can also just use the cycle counter directly in most modern CPUs.
> > It can be read with a single instruction. In fact modern glibc will do
> > it for you when you use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)
>
> well, it's not reliable while using things like APM, so i'd not recommend
> to depend on it too much.

*If* you use APM on your server boxes. Not likely even when it doesn't have more than one CPU
and it can be checked at runtime.

I guess glibc could also regularly (every 10 calls or so) call regular gettimeofday
to recheck synchronization; at least for a web server that potential inaccuracy would
be acceptable ("best effort") and the cost of the system call is 1/10.

In x86-64 there are special vsyscalls btw to solve this problem that export
a lockless kernel gettimeofday()

-Andi

2001-04-28 22:58:18

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Andi Kleen writes:
> On Sat, Apr 28, 2001 at 05:52:42PM +0200, Ingo Molnar wrote:
> >
> > On Sat, 28 Apr 2001, Andi Kleen wrote:
> >
> > > You can also just use the cycle counter directly in most modern CPUs.
> > > It can be read with a single instruction. In fact modern glibc will do
> > > it for you when you use clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)
> >
> > well, it's not reliable while using things like APM, so i'd not recommend
> > to depend on it too much.
>
> *If* you use APM on your server boxes. Not likely even when it doesn't have more than one CPU
> and it can be checked at runtime.
>
> I guess glibc could also regularly (every 10 calls or so) call
> regular gettimeofday to recheck synchronization; at least for a web
> server that potential inaccuracy would be acceptable ("best effort")
> and the cost of the system call is 1/10.
>
> In x86-64 there are special vsyscalls btw to solve this problem that export
> a lockless kernel gettimeofday()

Whatever happened to that hack that was discussed a year or two ago?
The one where (also on IA32) a magic page was set up by the kernel
containing code for fast system calls, and the kernel would write
calibation information to that magic page. The code written there
would use the TSC in conjunction with that calibration data.

There was much discussion about this idea, even Linus was keen on
it. But IIRC, nothing ever happened.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 05:13:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Followup to: <[email protected]>
By author: Richard Gooch <[email protected]>
In newsgroup: linux.dev.kernel
> >
> > In x86-64 there are special vsyscalls btw to solve this problem that export
> > a lockless kernel gettimeofday()
>
> Whatever happened to that hack that was discussed a year or two ago?
> The one where (also on IA32) a magic page was set up by the kernel
> containing code for fast system calls, and the kernel would write
> calibation information to that magic page. The code written there
> would use the TSC in conjunction with that calibration data.
>
> There was much discussion about this idea, even Linus was keen on
> it. But IIRC, nothing ever happened.
>

We discussed this at the Summit, not a year or two ago. x86-64 has
it, and it wouldn't be too bad to do in i386... just noone did.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-29 11:14:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

"H. Peter Anvin" wrote:
>
> Followup to: <[email protected]>
> By author: Richard Gooch <[email protected]>
> In newsgroup: linux.dev.kernel
> > >
> > > In x86-64 there are special vsyscalls btw to solve this problem that export
> > > a lockless kernel gettimeofday()
> >
> > Whatever happened to that hack that was discussed a year or two ago?
> > The one where (also on IA32) a magic page was set up by the kernel
> > containing code for fast system calls, and the kernel would write
> > calibation information to that magic page. The code written there
> > would use the TSC in conjunction with that calibration data.
> >
> > There was much discussion about this idea, even Linus was keen on
> > it. But IIRC, nothing ever happened.
> >
>
> We discussed this at the Summit, not a year or two ago. x86-64 has
> it, and it wouldn't be too bad to do in i386... just noone did.

It came up long before that. I refer to the technique in a post dated
Nov 17, even though I can't find the original.
http://www.mail-archive.com/[email protected]/msg13584.html

Initiated by a post from (iirc) Dean Gaudet, we found out that
gettimeofday was one particular system call in the Apache fast path that
couldn't be optimized well, or moved out of the fast path. After a
couple of suggestions for improving things, Linus chimed in with the
magic page suggestion.

--
Jeff Garzik | Game called on account of naked chick
Building 1024 |
MandrakeSoft |

2001-04-29 11:28:19

by David Miller

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)


Jeff Garzik writes:
> After a couple of suggestions for improving things, Linus chimed in
> with the magic page suggestion.

Since this is being brought up again, I want to mention something.

If we are going to map in a page like this, there are other cool
things one could do with this page. We should keep it at _1_ page
so people don't go crazy with ideas of stuff to put here btw...

The idea is that the one thing one tends to optimize for new cpus
is the memcpy/memset implementation. What better way to shield
libc from having to be updated for new cpus but to put it into
the kernel in this magic page?

There is a secondary effect to doing this on systems with physically
indexed caches (read as: most if not all x86 cpus today), the kernel's
memcpy/memset call icache usage can be shared with the user.

This also allows things like "kernel disabled cpu feature XYZ because
of a hardware bug, so instead of the usual optimized memcpy for this
processor, memcpy FOO is now faster since the feature is disabled, so
that is what we'll use" Really, libc shouldn't know things like this.

I thought about doing something along these lines on sparc64 sometime
around the next to last Linux EXPO held in North Caroline (the one
which was on the Duke university campus). In fact I believe I
remember specifically mentioning this idea to Jakub Jelinek during
that conference. It's particularly attractive on sparc64 because you
can use a "global" TLB entry which is thus shared between all address
spaces.

Later,
David S. Miller
[email protected]

2001-04-29 13:32:58

by Ingo Oeser

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 04:27:48AM -0700, David S. Miller wrote:
> The idea is that the one thing one tends to optimize for new cpus
> is the memcpy/memset implementation. What better way to shield
> libc from having to be updated for new cpus but to put it into
> the kernel in this magic page?

Hehe, you have read this MXT patch on linux-mm, too? ;-)

There we have 10x faster memmove/memcpy/bzero for 1K blocks
granularity (== alignment is 1K and size is multiple of 1K), that
is done by the memory controller.

This can only be done in the kernel, because it is critical we
access here.

Good idea.

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-29 16:22:58

by dean gaudet

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, 29 Apr 2001, Jeff Garzik wrote:

> "H. Peter Anvin" wrote:
> > We discussed this at the Summit, not a year or two ago. x86-64 has
> > it, and it wouldn't be too bad to do in i386... just noone did.
>
> It came up long before that. I refer to the technique in a post dated
> Nov 17, even though I can't find the original.
> http://www.mail-archive.com/[email protected]/msg13584.html
>
> Initiated by a post from (iirc) Dean Gaudet, we found out that
> gettimeofday was one particular system call in the Apache fast path that
> couldn't be optimized well, or moved out of the fast path. After a
> couple of suggestions for improving things, Linus chimed in with the
> magic page suggestion.

heheh. i can't claim that i was the first ever to think of this. but
here's the post i originally made on the topic. iirc a few folks said
"security horror!"... then last year ingo and linus (and probably others)
came up with a scheme everyone was happy with.

i was kind of solving a different problem with the code page though -- the
ability to use rdtsc on SMP boxes with processors of varying speeds and
synchronizations.

-dean

>From [email protected] Sun Apr 29 09:14:20 2001
Date: Mon, 11 May 1998 18:28:46 -0700 (PDT)
From: Dean Gaudet <[email protected]>
To: [email protected]
Subject: Re: do_fast_gettimeoffset oops explained
X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information
regarding copyright and disclaimer.

On 12 May 1998, Linus Torvalds wrote:

> And if you wonder why we care, then the reason is simple: there are
> real-world cases where a large fraction of our CPU time is spent getting
> timestamps. The reason gettimeofday() was optimized is that it actually
> showed up very clearly on system profiles.
>
> For example, X tends to timestamp each and every event it gets. And
> getting accurate benchmark numbers implies having an accurate clock: the
> "fast" gettimeoffset is not only 5 times faster than the slow one, it
> also gives more precision because it doesn't have to go outside the
> (fast and accurate) CPU to the (slow and less accurate) timer chip.

apache w/NSPR threading is doing gettimeofday() left and right too (it's
used after poll() to figure out how much time elapsed)... so much that
I was talking to Ingo about ways to make it faster... and came up with
a user-space method of using RDTSC which can handle changes to the
system clock. In a nutshell it requires a /dev/calibrate (or whatever
you want to call it) which is mmappable -- you need the "epoch" value (the
time that cycle 0 occured at), and the "cycles per microsecond" value.

I suppose that isn't too revolutionary... what had me stumped for a
while though, was how to do this on SMP boxes, I was assuming their
TSCs weren't synchronized (Ingo tells me they are on Intel). In case it
happens elsewhere, here's my idea. Use a separate v->p mapping for the
/dev/calibrate page on each processor. It's marked read-only of course.
In order to handle atomicity (can't take a task switch while in the
middle of using the "epoch" and "cycles per microsecond" constants), put
the code which actually calculates the time of day on the /dev/calibrate
page itself. The kernel notices EIP on this page when it's switching
away from a task, and completes the call in the kernel prior to switching.
(It only needs to futz the stack a bit -- unroll a stack frame and set
edx:eax... it can do it right in the saved registers.)

Note that this trick provides for more "user space system calls"... I
imagine a bunch of the signal routines such as sigprocmask and sigaction
could actually be done through routines on a special read-only page.
The kernel deals with atomicity only when it needs to.

Dean


2001-04-29 18:48:30

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Ingo Oeser writes:
> On Sun, Apr 29, 2001 at 04:27:48AM -0700, David S. Miller wrote:
> > The idea is that the one thing one tends to optimize for new cpus
> > is the memcpy/memset implementation. What better way to shield
> > libc from having to be updated for new cpus but to put it into
> > the kernel in this magic page?
>
> Hehe, you have read this MXT patch on linux-mm, too? ;-)
>
> There we have 10x faster memmove/memcpy/bzero for 1K blocks
> granularity (== alignment is 1K and size is multiple of 1K), that
> is done by the memory controller.

This sounds different to me. Using the memory controller is (should
be!) a privileged operation, thus it requires a system call. This is
quite different from code in a magic page, which is excuted entirely
in user-space. The point of the magic page is to avoid the syscall
overhead.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 18:56:19

by Gregory Maxwell

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 12:48:06PM -0600, Richard Gooch wrote:
> Ingo Oeser writes:
> > On Sun, Apr 29, 2001 at 04:27:48AM -0700, David S. Miller wrote:
> > > The idea is that the one thing one tends to optimize for new cpus
> > > is the memcpy/memset implementation. What better way to shield
> > > libc from having to be updated for new cpus but to put it into
> > > the kernel in this magic page?
> >
> > Hehe, you have read this MXT patch on linux-mm, too? ;-)
> >
> > There we have 10x faster memmove/memcpy/bzero for 1K blocks
> > granularity (== alignment is 1K and size is multiple of 1K), that
> > is done by the memory controller.
>
> This sounds different to me. Using the memory controller is (should
> be!) a privileged operation, thus it requires a system call. This is
> quite different from code in a magic page, which is excuted entirely
> in user-space. The point of the magic page is to avoid the syscall
> overhead.

Too bad this is a performance hack, otherwise we could place the privlaged
code in the read-only page, allow it to get execute from user space, catch
the exception, notice the EIP and let it continue on.

2001-04-29 19:02:49

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Gregory Maxwell writes:
> On Sun, Apr 29, 2001 at 12:48:06PM -0600, Richard Gooch wrote:
> > Ingo Oeser writes:
> > > On Sun, Apr 29, 2001 at 04:27:48AM -0700, David S. Miller wrote:
> > > > The idea is that the one thing one tends to optimize for new cpus
> > > > is the memcpy/memset implementation. What better way to shield
> > > > libc from having to be updated for new cpus but to put it into
> > > > the kernel in this magic page?
> > >
> > > Hehe, you have read this MXT patch on linux-mm, too? ;-)
> > >
> > > There we have 10x faster memmove/memcpy/bzero for 1K blocks
> > > granularity (== alignment is 1K and size is multiple of 1K), that
> > > is done by the memory controller.
> >
> > This sounds different to me. Using the memory controller is (should
> > be!) a privileged operation, thus it requires a system call. This is
> > quite different from code in a magic page, which is excuted entirely
> > in user-space. The point of the magic page is to avoid the syscall
> > overhead.
>
> Too bad this is a performance hack, otherwise we could place the
> privlaged code in the read-only page, allow it to get execute from
> user space, catch the exception, notice the EIP and let it continue
> on.

No need for anything that complicated. We can merge David's user-space
memcpy code with the memory controller scheme. We need a new syscall
anyway to access the memory controller, so we may as well just make it
a simple interface. Then the user-space code may, on some machines,
contain a test (for alignment) and call to the new syscall.

The two schemes are independent, and should be treated as such. Just
as the magic page code can call the new syscall, so could libc.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 19:38:36

by Jamie Lokier

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

David S. Miller wrote:
> It's particularly attractive on sparc64 because you
> can use a "global" TLB entry which is thus shared between all address
> spaces.

Fwiw, modern x86 has global TLB entries too.

-- Jamie

2001-04-29 19:47:48

by Gregory Maxwell

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 01:02:13PM -0600, Richard Gooch wrote:
> Gregory Maxwell writes:
> > On Sun, Apr 29, 2001 at 12:48:06PM -0600, Richard Gooch wrote:
> > > Ingo Oeser writes:
> > > > On Sun, Apr 29, 2001 at 04:27:48AM -0700, David S. Miller wrote:
> > > > > The idea is that the one thing one tends to optimize for new cpus
> > > > > is the memcpy/memset implementation. What better way to shield
> > > > > libc from having to be updated for new cpus but to put it into
> > > > > the kernel in this magic page?
> > > >
> > > > Hehe, you have read this MXT patch on linux-mm, too? ;-)
> > > >
> > > > There we have 10x faster memmove/memcpy/bzero for 1K blocks
> > > > granularity (== alignment is 1K and size is multiple of 1K), that
> > > > is done by the memory controller.
> > >
> > > This sounds different to me. Using the memory controller is (should
> > > be!) a privileged operation, thus it requires a system call. This is
> > > quite different from code in a magic page, which is excuted entirely
> > > in user-space. The point of the magic page is to avoid the syscall
> > > overhead.
> >
> > Too bad this is a performance hack, otherwise we could place the
> > privlaged code in the read-only page, allow it to get execute from
> > user space, catch the exception, notice the EIP and let it continue
> > on.
>
> No need for anything that complicated. We can merge David's user-space
> memcpy code with the memory controller scheme. We need a new syscall
> anyway to access the memory controller, so we may as well just make it
> a simple interface. Then the user-space code may, on some machines,
> contain a test (for alignment) and call to the new syscall.
>
> The two schemes are independent, and should be treated as such. Just
> as the magic page code can call the new syscall, so could libc.

Would it make sence to have libc use the magic page for all syscalls? Then
on cpus with a fast syscall instruction, the magic page could contain the
needed junk in userspace to use it.

(i.e. that really should be in libc, but we don't want libc to contain all
sorts of CPU specific cruft.. or is there a more general way to accomplish
this?)

2001-04-29 19:54:40

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Gregory Maxwell writes:
> Would it make sence to have libc use the magic page for all
> syscalls? Then on cpus with a fast syscall instruction, the magic
> page could contain the needed junk in userspace to use it.

That's pretty much what Linus suggested. He proposed having a new
syscall interface which was just calls into the magic page. All
syscalls would thus be available via the magic page. The kernel could
then selectively optimise individual syscalls (like gettimeofday(2))
or optimise the interface into kernel space, without libc ever having
to know about the details.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 20:12:22

by Ingo Oeser

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 12:48:06PM -0600, Richard Gooch wrote:
> Ingo Oeser writes:
> > There we have 10x faster memmove/memcpy/bzero for 1K blocks
> > granularity (== alignment is 1K and size is multiple of 1K), that
> > is done by the memory controller.
> This sounds different to me. Using the memory controller is (should
> be!) a privileged operation, thus it requires a system call. This is
> quite different from code in a magic page, which is excuted entirely
> in user-space. The point of the magic page is to avoid the syscall
> overhead.

Yes, but we currently have more than 10K cycles for doing
memset of a page. If we do an syscall, we have around 600-900
(don't know exactly), which is still less.

The point is: The code in that "magic page" that considers the
tradeoff is KERNEL code, which is designed to care about such
trade-offs for that machine. Glibc never knows this stuff and
shouldn't, because it is already bloated.

We get the full win here, for our "compile the kernel for THIS
machine to get maximum performance"-strategy.

People tend to compile the kernel, but not the glibc.

Just let the benchmarks, Linus and Ulrich decide ;-)

Regards

Ingo Oeser
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< been there and had much fun >>>>>>>>>>>>

2001-04-29 20:18:52

by Gregory Maxwell

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 10:11:59PM +0200, Ingo Oeser wrote:
[snip]
> The point is: The code in that "magic page" that considers the
> tradeoff is KERNEL code, which is designed to care about such
> trade-offs for that machine. Glibc never knows this stuff and
> shouldn't, because it is already bloated.
>
> We get the full win here, for our "compile the kernel for THIS
> machine to get maximum performance"-strategy.
>
> People tend to compile the kernel, but not the glibc.
>
> Just let the benchmarks, Linus and Ulrich decide ;-)

The kernel can even customize the page at runtime if it needs to, such as
changing algorithims to deal with lock contention.

Of course, this page will need to present a stable interface to glibc, and
having both the code and a comprehensive jump-table might become tough in a
single page...

2001-04-29 20:22:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Followup to: <[email protected]>
By author: dean gaudet <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Sun, 29 Apr 2001, Jeff Garzik wrote:
>
> > "H. Peter Anvin" wrote:
> > > We discussed this at the Summit, not a year or two ago. x86-64 has
> > > it, and it wouldn't be too bad to do in i386... just noone did.
> >
> > It came up long before that. I refer to the technique in a post dated
> > Nov 17, even though I can't find the original.
> > http://www.mail-archive.com/[email protected]/msg13584.html
> >
> > Initiated by a post from (iirc) Dean Gaudet, we found out that
> > gettimeofday was one particular system call in the Apache fast path that
> > couldn't be optimized well, or moved out of the fast path. After a
> > couple of suggestions for improving things, Linus chimed in with the
> > magic page suggestion.
>
> heheh. i can't claim that i was the first ever to think of this. but
> here's the post i originally made on the topic. iirc a few folks said
> "security horror!"... then last year ingo and linus (and probably others)
> came up with a scheme everyone was happy with.
>
> i was kind of solving a different problem with the code page though -- the
> ability to use rdtsc on SMP boxes with processors of varying speeds and
> synchronizations.
>

The thing that made me say we discussed this last month was Richard's
comment that it had already been implemented (which it has, by Andrea,
for x86-64.) The idea of doing it for i386 has been kicked around for
years, originally as a way to handle INT 0x80 vs SYSENTER vs SYSCALL,
which I think is part of why it never got implemented, since handling
multiple flavours of system calls apparently causes some pain in the
system call entry/exit path.

The handling of a few things like gettimeofday etc. was something we
observed could be added on top at that time, but was largely
considered secondary.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-29 20:45:43

by Arjan van de Ven

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

In article <[email protected]> you wrote:
> Yes, but we currently have more than 10K cycles for doing
> memset of a page.

make that 3800 or so..... (700 Mhz AMD Duron)

2001-04-29 21:17:14

by jg

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

The "put the time into a magic location in shared memory" goes back, as
far as I know, to Bob Scheifler or myself for the X Window System, sometime
around 1984 or 1985: we put it into a page of shared memory where we used
a circular buffer scheme to put input events (keyboard/mice), so that
we could avoid the read system call overhead to get these events (and
more importantly, check between each request if there was input to
process). I don't think we ever claimed it was novel, just that we did
it that way (I'd have to ask Bob if he had heard of that before we did
it). We put it into the same piece of memory we put the circular event
buffer, avoiding both the get-time-of day calls, but also the much more
expensive reads that would have been required (we put the events into a
circular buffer, with the kernel only updating one value, and user space
updating the other value defining the circular buffer).

In X, it is important for interactivity to get input events and send them
to clients ASAP: just note the effect of Keith Packard's recent implementation
of "silken mouse", where signals are used to deliver events to the X server.
This finally has made mouse tracking (done in user space on Linux; generally
done by kernel drivers on most UNIX boxes) what we were getting on 1 mip machines
under load (Keith has also done more than this with his new internal X
scheduler, which prevents clients from monopolizing the X server anywhere
like the old implementation).

This shared memory technique is very powerful to allow a client application to know if
it needs to do a system call, and is very useful for high performance servers
(like X), where a system call is way too expensive.

I've certainly mentioned this technique in the past in the Web community
(but HTTP servers are processing requests about 1/100-1/1000 the rate of
an X server, which gets into the millions of requests/second on current machines.

So if you want to get user space to really go fast, sometimes you resort
to such trickery.... I think the technique has real value: the interesting
question is should there be general kernel facilities to make this easy
(we did it via ugly hacks on VAX and MIPS boxes) for kernel facilities
to provide.

"X is an exercise in avoiding system calls". I think I said this around
1984-1985.
- Jim

--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[email protected]

2001-04-29 21:40:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Jim Gettys wrote:
>
> The "put the time into a magic location in shared memory" goes back...
>

Short summary: depending on how much you were talking general idea versus
specifics, you can go arbitrarily far back (I wouldn't be surprised if
shared memory techniques were used regularly before memory protection.)

Fair?

Not to pick on you or anyone else, but it is well-known to everyone
except the U.S. patent office that "there are no new ideas in computer
science." :)

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-29 21:47:33

by jg

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)


>
> Short summary: depending on how much you were talking general idea versus
> specifics, you can go arbitrarily far back (I wouldn't be surprised if
> shared memory techniques were used regularly before memory protection.)
>
> Fair?

Very fair.

>
> Not to pick on you or anyone else, but it is well-known to everyone
> except the U.S. patent office that "there are no new ideas in computer
> science." :)
>


Exactly why I noted in my mail that I didn't consider it novel even back then; just
a good engineering idea that we went ahead and used a long time ago...
- Jim
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[email protected]

2001-04-29 22:19:29

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Ingo Oeser writes:
> On Sun, Apr 29, 2001 at 12:48:06PM -0600, Richard Gooch wrote:
> > Ingo Oeser writes:
> > > There we have 10x faster memmove/memcpy/bzero for 1K blocks
> > > granularity (== alignment is 1K and size is multiple of 1K), that
> > > is done by the memory controller.
> > This sounds different to me. Using the memory controller is (should
> > be!) a privileged operation, thus it requires a system call. This is
> > quite different from code in a magic page, which is excuted entirely
> > in user-space. The point of the magic page is to avoid the syscall
> > overhead.
>
> Yes, but we currently have more than 10K cycles for doing
> memset of a page. If we do an syscall, we have around 600-900
> (don't know exactly), which is still less.
>
> The point is: The code in that "magic page" that considers the
> tradeoff is KERNEL code, which is designed to care about such
> trade-offs for that machine.

Um, yes. I don't disagree with that. I'm just saying the two issues
are conceptually separate, and should be considered independently.

> Glibc never knows this stuff and shouldn't, because it is already
> bloated.

True, true and true.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 22:22:03

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

Gregory Maxwell writes:
> On Sun, Apr 29, 2001 at 10:11:59PM +0200, Ingo Oeser wrote:
> [snip]
> > The point is: The code in that "magic page" that considers the
> > tradeoff is KERNEL code, which is designed to care about such
> > trade-offs for that machine. Glibc never knows this stuff and
> > shouldn't, because it is already bloated.
> >
> > We get the full win here, for our "compile the kernel for THIS
> > machine to get maximum performance"-strategy.
> >
> > People tend to compile the kernel, but not the glibc.
> >
> > Just let the benchmarks, Linus and Ulrich decide ;-)
>
> The kernel can even customize the page at runtime if it needs to, such as
> changing algorithims to deal with lock contention.
>
> Of course, this page will need to present a stable interface to
> glibc, and having both the code and a comprehensive jump-table might
> become tough in a single page...

Sure. IIRC, Linus talked about "a few pages".

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 22:30:21

by Richard Gooch

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

H. Peter Anvin writes:
> The thing that made me say we discussed this last month was
> Richard's comment that it had already been implemented (which it
> has, by Andrea, for x86-64.) The idea of doing it for i386 has been
> kicked around for

Correction: I didn't say it had been implemented. I just asked what
happened to the idea. I never saw it go into i386.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-04-29 23:53:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 09:38:04PM +0200, Jamie Lokier wrote:
> Fwiw, modern x86 has global TLB entries too.

my x86-64 implementation is marking the tlb entry global of course (so
it's not flushed during context switch):

#define __PAGE_KERNEL_VSYSCALL \
(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
#define MAKE_GLOBAL(x) __pgprot((x) | _PAGE_GLOBAL)
#define PAGE_KERNEL_VSYSCALL MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL)

static void __init map_vsyscall(void)
{
extern char __vsyscall_0;
unsigned long physaddr_page0 = (unsigned long) &__vsyscall_0 - __START_KERNEL_map;

__set_fixmap(VSYSCALL_FIRST_PAGE, physaddr_page0, PAGE_KERNEL_VSYSCALL);
}

Andrea

2001-04-30 00:14:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

On Sun, Apr 29, 2001 at 04:18:27PM -0400, Gregory Maxwell wrote:
> having both the code and a comprehensive jump-table might become tough in a

In the x86-64 implementation there's no jump table. The original design
had a jump table but Peter raised the issue that indirect jumps are very
costly and he suggested to jump to a fixed virtual address instead, I
agreed with his suggestion. So this is what I implemented for x86-64
with regard to the userspace vsyscall API (which will be used by glibc):

enum vsyscall_num {
__NR_vgettimeofday,
__NR_vtime,
};

#define VSYSCALL_ADDR(vsyscall_nr) (VSYSCALL_START+VSYSCALL_SIZE*(vsyscall_nr))

the linker can prelink the vsyscall virtual address into the binary as a
weak symbol and the dynamic linker will need to patch it only if
somebody is overriding the weak symbol with a LD_PRELOAD.

Virtual address space is relatively cheap. Currently the 64bit
vgettimeofday bytecode + data is nearly 200 bytes, and the first two
slots are large 512bytes each. So with 1024 bytes we do the whole thing,
and we still have space for further 6 vsyscalls without paying any
additional tlb entry.

(the implementation of the above #define will change shortly but the
VSYSCALL_ADDR() API for glibc will remain the same)

Andrea

2001-04-30 07:02:44

by David Miller

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)


dean gaudet writes:
> i was kind of solving a different problem with the code page though -- the
> ability to use rdtsc on SMP boxes with processors of varying speeds and
> synchronizations.

A better way to solve that problem is the way UltraSPARC-III do and
future ia64 systems will, by way of a "system tick" register which
increments at a constant rate regardless of how the cpus are clocked.

The "system tick" pulse goes into the processor, so it's still a local
cpu register being accessed. Think of it as a system bus clock cycle
counter.

Granted, you probably couldn't make changes to the hardware you were
working on at the time :-)

Later,
David S. Miller
[email protected]

2001-04-30 07:30:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

"David S. Miller" wrote:
>
> dean gaudet writes:
> > i was kind of solving a different problem with the code page though -- the
> > ability to use rdtsc on SMP boxes with processors of varying speeds and
> > synchronizations.
>
> A better way to solve that problem is the way UltraSPARC-III do and
> future ia64 systems will, by way of a "system tick" register which
> increments at a constant rate regardless of how the cpus are clocked.
>
> The "system tick" pulse goes into the processor, so it's still a local
> cpu register being accessed. Think of it as a system bus clock cycle
> counter.
>
> Granted, you probably couldn't make changes to the hardware you were
> working on at the time :-)
>

RDTSC in Crusoe processors does basically this.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-04-30 07:51:51

by David Miller

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)


H. Peter Anvin writes:
> RDTSC in Crusoe processors does basically this.

Hmmm, one of the advantages of using a seperate tick register for this
constant clock is that you can still do cycle accurate asm code
analysis even when the cpu is down clocked.

The joys of compatability I suppose :-)

Later,
David S. Miller
[email protected]

2001-04-30 14:56:42

by Jonathan Lundell

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

At 12:29 AM -0700 2001-04-30, H. Peter Anvin wrote:
>"David S. Miller" wrote:
>>
>> dean gaudet writes:
>> > i was kind of solving a different problem with the code page
>>though -- the
>> > ability to use rdtsc on SMP boxes with processors of varying speeds and
>> > synchronizations.
>>
>> A better way to solve that problem is the way UltraSPARC-III do and
>> future ia64 systems will, by way of a "system tick" register which
>> increments at a constant rate regardless of how the cpus are clocked.
>>
>> The "system tick" pulse goes into the processor, so it's still a local
>> cpu register being accessed. Think of it as a system bus clock cycle
>> counter.
>>
>> Granted, you probably couldn't make changes to the hardware you were
>> working on at the time :-)
>>
>
>RDTSC in Crusoe processors does basically this.
>
> -hpa

The Pentium III TSC has the bizarre characteristic, per Intel docs
anyway, that only the low half can be written (as I recall the high
half gets set to zero), making restoration problematical in certain
power-management regimes. Hopefully the Crusoe does better.
--
/Jonathan Lundell.

2001-04-30 16:47:31

by Alan

[permalink] [raw]
Subject: Re: X15 alpha release: as fast as TUX but in user space (fwd)

> The point is: The code in that "magic page" that considers the
> tradeoff is KERNEL code, which is designed to care about such
> trade-offs for that machine. Glibc never knows this stuff and
> shouldn't, because it is already bloated.

glibc is bloated because it cares about such stuff and complex standards.
There is no reason to make a mess of the kernel when you can handle more
stuff nicely with the libraries.

Since glibc inlines most memcpy calls you'd need to build an MXT glibc,
which is doable. Uninlining most memcpy calls is a loss on some processors
and often a loss anyway as the copies are so small