2006-10-12 15:45:00

by John Richard Moser

[permalink] [raw]
Subject: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Can context switches be made faster? This is a simple question, mainly
because I don't really understand what happens during a context switch
that the kernel has control over (besides storing registers).

Linux ported onto the L4-Iguana microkernel is reported to be faster
than the monolith[1]; it's not like microkernels are faster, but the
L4-Iguana apparently just has super awesome context switching code:

Wombat's context-switching overheads as measured by lmbench on an
XScale processor are up to thirty times less than those of native
Linux, thanks to Wombat profiting from the implementation of fast
context switches in L4-embedded.

The first question that comes into my mind is, obviously, is this some
special "fast context switch" code for only embedded systems; or is it
possible to work this on normal systems?

The second is, if it IS possible to get faster context switches in
general use, can the L4 context switch methods be used in Linux? I
believe L4 is BSD licensed-- at least the files in their CVS repo that I
looked at have "BSD" stamped on them. Maybe some of the code can be
examined, adopted, adapted, etc.

If it's not possible to speed up context switches, the classical
question would probably be.. why not? ;) No use knowing a thing and
not understanding it; you get situations like this right here... :)

[1]http://l4hq.org/
- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS5i9ws1xW0HCTEFAQKHtw/+PtjvtkfynX0uItgpa2zocbo8/hadu4kp
dGmxI9eBouUc0T5GDjX7hHYolaIFswuNWyrnELU6uE4WeQ6l5BFnobc1FCiHLVBE
7cZlr9FaFw3r7Ohb4AJBTLKXRYP3h107SnccxJLcqVqspwmzs6lZHaXCU9vrxpCW
Xaam8bSBUrqJ3tIPalM20Nl4SrVF0clMYlKRT2LdD3/TFdN2e60m9sczyGEnLVT/
Co+/JpQ5qxk7DqjXJHr0N5a0CmgjlTZQHEjtvfcPlrKa5CprLECrYx2aJHgs+nIz
CXf9L2z3oiE4yWADK5+zXlJWcF7+pvspIsI9rQDdHoO2xFUzguiVup9XJOLbytpV
yN0dVrOWaAXQMBtrYCInOtA6ynpAZ+hTv2EBSHRaOC+mnxcDTBSqJrj979RrKGlj
Mz282LaSRDL4XPq9d8LwrnuPHIoqGGfj0wUKwmxC19vDfGOk3Y1I/frvcRgnjlb1
TwG/QPgiGcXXVTgEDeogqgq+DRPmuoxxXo+OPMgA4441BzgqkCzxmjLA0uQL15dd
CrAO8NF1fOzvWCvAQO8DhSaGGOikeur4BdkwnF6/eTQYA7QGewaCVdY0u6Q2dhAF
wrGho4pGaEh/ev59/KsHvtSD88SfTUsLigTgGrwiRTufUm4XVbr5AzldTPUMUkUU
+jRWrqbwkLM=
=IDvt
-----END PGP SIGNATURE-----


2006-10-12 15:53:12

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



John Richard Moser wrote:
...
> The second is, if it IS possible to get faster context switches in
> general use, can the L4 context switch methods be used in Linux? I
> believe L4 is BSD licensed-- at least the files in their CVS repo that I
> looked at have "BSD" stamped on them. Maybe some of the code can be
> examined, adopted, adapted, etc.
>

Looking a bit deeper, Iguana is OzPLB licensed, oops. :(

The question still remains, though, as to what must happen during
context switches that takes so long and if any of it can be sped up.
Wikipedia has some light detail...


- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS5k4gs1xW0HCTEFAQIzYxAApoMBD2Oo9GwYFurgwjAAxwaikHcAIeXI
14Defpeb863KguvMSk698+O1HuwhJlMfMw6Ir1pKvLS85ooPpUXSsV1SMRG27fNY
GsHdoIFgBJiAeokrYdXfmGE8HZvnAvVBNPrPik9F7OfglptEe1kMXQ4gbqCzWUq0
qVWE6/X8UqaTw3M1fDS7+uebq1Mc4aafCIf6ANRRxzrz4TQzeIs4EWuy9x8wjuF9
laBV9xrIghXKf+vcstgLbkLdalBvfBpIWHaD5qbZk7G6nRzJrZeKCThn4zivylzZ
3W6xUu5U4fZyKiMbGiZNqnJa7ym9z15LPIHlVZ4R2uZkjT4fv70Hd1zh9frWWCqI
4Nw0RWz3DWtTi17CgTgprRM/TGN4MEnU+DFo6noPqCpxSdNn+LsKP8e+lpgZxql9
Vj3KvFl7uUI3Bp/PWdp8f6if/DyhXbrYkb9Dp/SY68tVAiffkfPrwgtCHpmAVGqZ
NtdpwRYQgsMH3VMB0zJDJOG3zxmP4l/X/ay8zPBmd+elItR3OPHu4rcnlDYwgK3x
TyIaAzyC175nTRgyA3ZkuN1JWHEnFOj3LDMlSFOlSiqMu37ncNocPbLb+L/papt+
GIxFO4eYR/D6X9N2lr1UXCxyDljMD7Y0fWc7667QN7eGH7jqaiIg8FYq3uUt7jZc
6Gdh7sZXIMg=
=vdRP
-----END PGP SIGNATURE-----

2006-10-12 17:19:36

by Russell King

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Thu, Oct 12, 2006 at 11:44:56AM -0400, John Richard Moser wrote:
> Can context switches be made faster? This is a simple question, mainly
> because I don't really understand what happens during a context switch
> that the kernel has control over (besides storing registers).

They can be, but there's a big penalty that you pay for it. You must
limit the virtual memory space to 32MB for _every_ process in the
system, and if you have many processes running (I forget how many)
you end up running into the same latency problems.

The latency problem comes from the requirement to keep the cache
coherent with the VM mappings, and to this extent on Linux we need to
flush the cache each time we change the VM mapping.

There have been projects in the past which have come and gone to
support the "Fast Context Switch" approach found on these CPUs, but
patches have _never_ been submitted, so I can only conclude that the
projects never got off the ground.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-10-12 18:20:21

by Phillip Susi

[permalink] [raw]
Subject: Re: Can context switches be faster?

John Richard Moser wrote:
> Can context switches be made faster? This is a simple question, mainly
> because I don't really understand what happens during a context switch
> that the kernel has control over (besides storing registers).
>

Besides saving the registers, the expensive operation in a context
switch involves flushing caches and switching page tables. This can be
avoided if the new and old processes both share the same address space.


2006-10-12 18:25:23

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Russell King wrote:
> On Thu, Oct 12, 2006 at 11:44:56AM -0400, John Richard Moser wrote:
>> Can context switches be made faster? This is a simple question, mainly
>> because I don't really understand what happens during a context switch
>> that the kernel has control over (besides storing registers).
>
> They can be, but there's a big penalty that you pay for it. You must
> limit the virtual memory space to 32MB for _every_ process in the
> system, and if you have many processes running (I forget how many)
> you end up running into the same latency problems.

Interesting information; for the rest of this discussion let's assume
that we want the system to remain functional. :)

>
> The latency problem comes from the requirement to keep the cache
> coherent with the VM mappings, and to this extent on Linux we need to
> flush the cache each time we change the VM mapping.

*OUCH*

Flushing cache takes time, doesn't it? Worse, you can't have happy
accidents where cache remains the same for various areas (i.e. I1's
caching of libc and gtk) between processes.

I guess on x86 and x86-64 at least (popular CPUs) the cache is not tied
to physical memory; but rather to virtual memory? Wikipedia:

Multiple virtual addresses can map to a single physical address. Most
processors guarantee that all updates to that single physical address
will happen in program order. To deliver on that guarantee, the
processor must ensure that only one copy of a physical address resides
in the cache at any given time.

...

But virtual indexing is not always the best choice. It introduces the
problem of virtual aliases -- the cache may have multiple locations
which can store the value of a single physical address. The cost of
dealing with virtual aliases grows with cache size, and as a result
most level-2 and larger caches are physically indexed.

-- http://en.wikipedia.org/wiki/CPU_cache

So apparently most CPUs virtually address L1 cache and physically
address L2; but sometimes physically addressing L1 is better.. hur.

I'd need more information on this one.

- Is L1 on <CPU of choice> physically aliased or physically tagged
such that leaving it in place between switches will cause the CPU to
recognize it's wrong?

- Is L2 on <CPU of choice> in such a manner?

- Can L1 be flushed without flushing L2?

- Does the current code act on these behaviors, or just flush all
cache regardless?

>
> There have been projects in the past which have come and gone to
> support the "Fast Context Switch" approach found on these CPUs, but
> patches have _never_ been submitted, so I can only conclude that the
> projects never got off the ground.
>

A shame.

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS6IiQs1xW0HCTEFAQJTwRAAlpf3+FRxxRkpIEiWMV3k7Prauj8uZEFK
I6eITkNAe7kYt2q+TqTSjSaDI106iFUFXa/D20fbaykQ5mRio9nqIxNfGlqZdYyq
f62CrEzt6gfHdS4KSA7uIz8v+N6D/aUscxxPXr/LTGfZms1WWsDA/hBPZFtUL3cH
8/ydOwVa/23mGcuOnf8yAP9yzvK2k0XJPbHJo4JnZh11bf9ttLF5+pxLIxWrvZgL
eAp5QuwcQ7Rg7MJ50sSUBtbCHPpsRvC0KsROdLltGWxyEXMOlwfOn0UNBd0GhWRB
O2B91x4elGBESwWpIJ+cB6gzCRIbuJcVuwgn73RDVi3hWg+suxa6IR7+lJ+Jv6tA
/iTbnr6Um4X+Dv4WISYxiB3vYuSuTKmzq75iqpRj+9KOOCHMnkxx7qUdy8qEiSoB
vuO3LuRMGwahOqe9gqUFNwdM7X4cOuoayKe1hZR36ocjC0rHXaDCH9t+UBn+iFyo
sCDnfq6RbKvxFZTiShdhv3Qb6tHNaOzwaIhO29rY9RvD5fkQYFK/DBdcQHa6Pc+Q
SHoS8lJ2/uU4BpzioHO4GqwsqOl4jUjgg7/zzeuK6JWVnihowK3/1BgPKJ50K0e5
7yQ1yC1FiWiwO584rp5qrm3lJPwVjdf9cWxzaZqtl6WT4QE3wb6BI9rvwrEEhddk
IrJbcFEjS00=
=i0Ac
-----END PGP SIGNATURE-----

2006-10-12 18:29:23

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Phillip Susi wrote:
> John Richard Moser wrote:
>> Can context switches be made faster? This is a simple question, mainly
>> because I don't really understand what happens during a context switch
>> that the kernel has control over (besides storing registers).
>>
>
> Besides saving the registers, the expensive operation in a context
> switch involves flushing caches and switching page tables. This can be
> avoided if the new and old processes both share the same address space.
>

i.e. this can be avoided when switching to threads in the same process.

How does a page table switch work? As I understand there are PTE chains
which are pretty much linked lists the MMU follows; I can't imagine this
being a harder problem than replacing the head. I'd imagine the head
PTE would be something like a no-access page that's not really mapped to
anything so exchanging it is pretty much exchanging its "next" pointer?


>
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS6JfQs1xW0HCTEFAQL9NQ//Qs7SGrlE0fGDj0HFssed/cVDrjR9GUFL
RzjJFNyB0DmDKLEcIfcYjWnvvyvFX1ggCPYflOcUXo57jz6GCsSBs0p+nXAheZiH
B301nk4ZKj6B18UOqv3NDY7r5wV0L2x54TM98Ty4CxgvKw5Jglno4fLZQs9Tefmk
2woLmyJkiDG2xlRijmfEbSXQPo/wNIfYaDrJlvRPgEJiDa1j7cCHP+o2wZEnJCr9
UMpEooaZ8n8cO5CVP7YiG/sG52GG3cbW9j5FbueQOAjxySNX2X8JVT3wLGHixvB4
z6BpvnCpgNVoe1LQFDleUjP3Lb0rO9ap3HMlLJ/lbs7LeegLgtUDW2uaa4PXcc9C
uEiwA1BNEZMIYhSsiek+acojmmGh9zOsI2nfjBmZoj6UJOjSyYz7srZTWKLILVi+
0hffeQs2FWHQ7sMSRhjM8ITMjZGvK85AnBwdsf2jEIGGOMixq2BAkYu+Ljq5FLOJ
VKdaqmLq60vzMqLeb7pdWJgDXNvxU5RLvnktkuPBSlftPLwNvGf1GVSEXEKkJ9/Y
VhTfX9gFKHC59N9qn01KpkhG9cVXfUPU5nDlfYdCrgkf7QX4aTEg6kt9ALg4fDlN
W2cFr4bfjGLBmqsI9FP/Zxq42N1F9bXIym3HcSmTn/fsoiTQdUcuxUKkdKA7cXiE
xMyKfw/+y0Q=
=NEcB
-----END PGP SIGNATURE-----

2006-10-12 18:37:15

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:

Hi,

> So apparently most CPUs virtually address L1 cache and physically
> address L2; but sometimes physically addressing L1 is better.. hur.

if you are interested in this I would strongly urge you to read Curt
Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
Multiprocessing and Caching for Kernel Programmers); it explains this
and related materials really really well.


> - Does the current code act on these behaviors, or just flush all
> cache regardless?

the cache flushing is a per architecture property. On x86, the cache
flushing isn't needed; but a TLB flush is. Depending on your hardware
that can be expensive as well.

Greetings,
Arjan van de Ven

2006-10-12 18:56:36

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Arjan van de Ven wrote:
> On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:
>
> Hi,
>
>> So apparently most CPUs virtually address L1 cache and physically
>> address L2; but sometimes physically addressing L1 is better.. hur.
>
> if you are interested in this I would strongly urge you to read Curt
> Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
> Multiprocessing and Caching for Kernel Programmers); it explains this
> and related materials really really well.
>

That will likely be more useful when I've got more background knowledge;
right now my biggest problem is I'm inexperienced and haven't yet gotten
my 2-year compsci degree (more importantly, the associated knowledge
that goes with it), so things like compiler design are like "There's a
concept of turning stuff into trees and associating actions with virtual
registers and then turning that into real register accounting and
instructions, but I really don't know WTF it's doing."

I imagine once I get into the 4-year stuff I'll be able to digest
something like that; so I'll keep that in mind. The '1st year compsci
student' crutch is annoying and I need to get rid of it so I don't feel
so damn crippled trying to do anything.

>
>> - Does the current code act on these behaviors, or just flush all
>> cache regardless?
>
> the cache flushing is a per architecture property. On x86, the cache
> flushing isn't needed; but a TLB flush is. Depending on your hardware
> that can be expensive as well.
>

Mm. TLB flush is expensive, and pretty unavoidable unless you do stupid
things like map libc into the same address everywhere. TLB flush
between threads in the same process is probably avoidable; aren't
threads basically processes with the same PID in something called a
thread group? (Hmm... creative scheduling possibilities...)

> Greetings,
> Arjan van de Ven
>
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS6P4As1xW0HCTEFAQLcjQ//aQF7RWL4INtwl8ZPI35ueTj6y7Ap5vdP
bJUjXOgF4M/yyyRB9KThTaoRxUkTapPbUaNAa2VBU5c8vBW3anyTuGjUXhW7dBUZ
Xfk0611xEVrEsCcwNjpRhpZH55npjaymzSvdDulqdVThLJCDxkYGonL+6K7T5PRF
eRqFvniRur/pNormpkxzXoqxEqhO0dpy/RWbMuhF4hGpCLBz6ZNXWjRDJrzTtoFp
dhDAy5sJuf4gGqG0oQY2FN9FzFYrCguKGReP/xuzWhqUUyxvKrgbJLFBj1Zut3Uc
Gm59FhSpfKyJifJGGvFhrMxMKLL/wVsLxfRUbvFuD8Xki6UxxQ3Kve2AVihKcRz6
RpYDQesAG0TgP9GypM8eT5uezKnzIe1FJpg9619Isem11dogvuMA6h7qgdJH7+gb
E8yWeVFoZx1GPNCpQgZhy2IijztgYS40GMf+0xoUZcanE3pO93+85DNxapcyGkNY
kYIGSiugKDlQlIVaBov8+EtLr6hILBXPT9+PjwD9EWWjjQZB9iJy+ROnKntd+twe
knNYHm6aKuXVxmndfrzW+VDDN4I0TnbD3ctoepkEUQK2RFoBAt5tEceFmI6YILSr
w/h71yBztmvMAzSe6vEVSKN95jEg7UGd09UQN/GbcEJvw2s5l+HL1LGVRIbR0YN/
4gK2kAq28eA=
=Grg2
-----END PGP SIGNATURE-----

2006-10-12 19:02:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Thu, 2006-10-12 at 14:56 -0400, John Richard Moser wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> Arjan van de Ven wrote:
> > On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:
> >
> > Hi,
> >
> >> So apparently most CPUs virtually address L1 cache and physically
> >> address L2; but sometimes physically addressing L1 is better.. hur.
> >
> > if you are interested in this I would strongly urge you to read Curt
> > Schimmel's book (UNIX(R) Systems for Modern Architectures: Symmetric
> > Multiprocessing and Caching for Kernel Programmers); it explains this
> > and related materials really really well.
> >
>
> That will likely be more useful when I've got more background knowledge;

the book is quite explenatory, so maybe your uni has it in the library
so that you can try to see if it makes sense?

> >
> > the cache flushing is a per architecture property. On x86, the cache
> > flushing isn't needed; but a TLB flush is. Depending on your hardware
> > that can be expensive as well.
> >
>
> Mm. TLB flush is expensive, and pretty unavoidable unless you do stupid
> things like map libc into the same address everywhere.

TLB flush is needed if ANY part of the address space is pointing at
different pages. Since that is always the case for different processes,
where exactly glibc is doesn't matter... as long as there is ONE page of
memory different you have to flush. (And the flush is not expensive
enough to do partial ones based on lots of smarts; the smarts will be
more expensive than just doing a full flush)

> TLB flush
> between threads in the same process is probably avoidable; aren't
> threads basically processes with the same PID in something called a
> thread group? (Hmm... creative scheduling possibilities...)

the scheduler is smart enough to avoid the flush then afaik
(threads share the "mm" in Linux, and "different mm" is used as trigger
for the flush. Linux is even smarter; kernel threads don't have any mm
at all so those count as wildcards in this respect, so that if you have
"thread A" -> "kernel thread" -> "thread B" you'll have no flush)



2006-10-12 19:58:10

by Chris Friesen

[permalink] [raw]
Subject: Re: Can context switches be faster?

John Richard Moser wrote:

> Linux ported onto the L4-Iguana microkernel is reported to be faster
> than the monolith[1]; it's not like microkernels are faster, but the
> L4-Iguana apparently just has super awesome context switching code:
>
> Wombat's context-switching overheads as measured by lmbench on an
> XScale processor are up to thirty times less than those of native
> Linux, thanks to Wombat profiting from the implementation of fast
> context switches in L4-embedded.

The Xscale is a fairly special beast, and it's context-switch times are
pretty slow by default.

Here are some context-switch times from lmbench on a modified 2.6.10
kernel. Times are in microseconds:

cpu clock speed context switch
pentium-M 1.8GHz 0.890
dual-Xeon 2GHz 7.430
Xscale 700MHz 108.2
dual 970FX 1.8GHz 5.850
ppc 7447 1GHz 1.720

Reducing the Xscale time by a factor of 30 would basically bring it into
line with the other uniprocessor machines.

Chris

2006-10-12 20:23:33

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Chris Friesen wrote:
> John Richard Moser wrote:
>
>> Linux ported onto the L4-Iguana microkernel is reported to be faster
>> than the monolith[1]; it's not like microkernels are faster, but the
>> L4-Iguana apparently just has super awesome context switching code:
>>
>> Wombat's context-switching overheads as measured by lmbench on an
>> XScale processor are up to thirty times less than those of native
>> Linux, thanks to Wombat profiting from the implementation of fast
>> context switches in L4-embedded.
>
> The Xscale is a fairly special beast, and it's context-switch times are
> pretty slow by default.
>
> Here are some context-switch times from lmbench on a modified 2.6.10
> kernel. Times are in microseconds:
>
> cpu clock speed context switch
> pentium-M 1.8GHz 0.890
> dual-Xeon 2GHz 7.430
> Xscale 700MHz 108.2
> dual 970FX 1.8GHz 5.850
> ppc 7447 1GHz 1.720
>
> Reducing the Xscale time by a factor of 30 would basically bring it into
> line with the other uniprocessor machines.
>

That's a load more descriptive :D

0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
PPC performs similarly, 1 cycle should be about 1uS.

> Chris
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS6kPws1xW0HCTEFAQJnAhAAloJ1KvztR+CiKSvBEvEHUcoUm9HwxpaE
WrTKrBxEV36VXgHVzJsqvs6YLzyLhyLWv2YFKlPS6iYyKvwO3v8P54fVPLqZAcAn
7VSAml9wSRt1woh1MxwApDlQO/snf5rAV/+1cuCGXKmpqLAHCGjrMJgn5TEbknCG
9H4Ie5Db6541lO7Zw5rouas03wLVPUU831UCTKJi8ngyx30FNDeaDds9EeYA0Ox/
5gEOVavkHRV5AJ6GhGtmgEpTdB69oB6nwv07UtuYN7QKC2tJ0E7tomuLeh4mbMM9
N5rV9Q+KzL955HINqQPe5+pF31+W4fQ4zKBzyyz0AQ6BHkyu+v9B371HuIo5RCeV
adC5p17PM5Ms819bfB0Nl6WjRhje5ybTDHyKxiNEHQL7T+LCkCjJvDvkeIr33aS2
vcAlFQvk8RYz5bZn1dsJpXfYc0GPH9M93Zf3dOr8syPmzOCFlD3MgWJvcTGlg2fT
Lxg5e8MBfjwyZ1XYBouFaY4GlytaebCXT5jXlutZDYQfIIIsHhQ2BV3xCcr2jP5K
Q1UmOld8GB+HeHmfvge3/5gXWIQxZx/vgNEm+XPUQk7j0Ei9E4yy/MVdXdx4GKDv
uNbclRT3xOlYhsbabGsxunCRok6Ph4eTQPmT3YLO+rpxLq1vyGDW3+kEca6+yCIp
WEBb7mlPRaw=
=pvD8
-----END PGP SIGNATURE-----

2006-10-12 20:29:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Can context switches be faster?

> ther uniprocessor machines.
> >
>
> That's a load more descriptive :D
>
> 0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
> PPC performs similarly, 1 cycle should be about 1uS.
>
> > Chris

you have your units off; 1 cycle is 1 nS not 1 uS (or 0.556 nS for the
pM)

2006-10-12 20:33:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Can context switches be faster?

John Richard Moser wrote:
> That's a load more descriptive :D
>
> 0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
> PPC performs similarly, 1 cycle should be about 1uS.
>

No, you're a factor of 1000 off - these numbers show the context switch
is around 1600-75000 cycles. And that doesn't really tell the whole
story: if caches/TLB get flushed on context switch, then the newly
switched-to task will bear the cost of having cold caches, which isn't
visible in the raw context switch time.

But modern x86 processors have a very quick context switch time, and I
don't think there's much room for improvement aside from
micro-optimisations (though that might change if the architecture grows
a way to avoid flushing the TLB on switch).

J

2006-10-12 20:36:58

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Arjan van de Ven wrote:
>> ther uniprocessor machines.
>> That's a load more descriptive :D
>>
>> 0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
>> PPC performs similarly, 1 cycle should be about 1uS.
>>
>>> Chris
>
> you have your units off; 1 cycle is 1 nS not 1 uS (or 0.556 nS for the
> pM)

Ah, right right. Nano is a billion, micro is a million. How did I get
that mixed up.

So that's barely two THOUSAND cycles. Which is closer to what I
expected (i.e., not instant)


>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS6nZAs1xW0HCTEFAQImYQ/+Lmcd2zzOhtJHZ6AofBajWh7CEhqLCfqd
mQ09b95YPtDXEr0f5Uud69/v4+dE5gVh1ze68z4+15dFtfLfR7EuBJpXykoSDSGB
fMVhwd6COzN8l4Bl826tbH3jvNnX8jssLcCr/qNvtYA1pDntXjjFdPF0OHdU87Kr
KYODxIM69akjTALjYO4NinYZdhcJn++DkHfKtIfjL5qD9gc0VMe8EMj1bQo2Jz8T
FBpMtfz2JPcXq8CFYDz9fMREtrJQfpYYFMyG+MJHWiNpEWUXTy7Jg40q/k5Jhsf0
ReDh+FJDWGmtep938wTquIojhSu/WYNyyBjAXlWcLo08S1cl5BSLj0hD12sKXT6c
0FmqIZO44Pp6dBfoHxRrss6tljbsoTAmAf0ac4PLMciHJfidYvoAfWH++70i+4Wm
CtsQdRn0jY7Ws0vB3pVLA6IlpZ/D5OXE+ko7ntiOUDkiKynoUHz71h6WwtpCorF2
myUurBvHbL85VTCOegomy/5/6UcozFK2LovkjRN9jXoP2ZportXPIYL0ssrRnXN0
7GtS9Ye5Mu2+n88pk64HdKWg75uflZPAobfCBg1lt933wlXfpOwi7FKKuwTHP8vD
2IabVRcJEk2VnOgJDfQ5iSooN4XDFRIG+DjI3/7lzRDpLRpMiUBy91K7adi5u7ah
tx+izQjdo3c=
=HqrS
-----END PGP SIGNATURE-----

2006-10-13 02:54:20

by Andrew James Wade

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Thursday 12 October 2006 14:29, John Richard Moser wrote:
> How does a page table switch work? As I understand there are PTE chains
> which are pretty much linked lists the MMU follows; I can't imagine this
> being a harder problem than replacing the head.

Generally, the virtual memory mappings are stored as high-fanout trees
rather than linked lists. (ia64 supports a hash table based scheme,
but I don't know if Linux uses it.) But the bulk of the mapping
lookups will actually occur in a cache of the virtual memory mappings
called the translation lookaside buffer (TLB). It is from the TLB and
not the memory mapping trees that some of the performance problems
with address space switches originate.

The kernel can tolerate some small inconsistencies between the TLB
and the mapping tree (it can fix them in the page fault handler). But
for the most part the TLB must be kept consistent with the current
address space mappings for correct operation. Unfortunately, on some
architectures the only practical way of doing this is to flush the TLB
on address space switches. I do not know if the flush itself takes any
appreciable time, but each of the subsequent TLB cache misses will
necessitate walking the current mapping tree. Whether done by the MMU
or by the kernel (implementations vary), these walks in the aggregate
can be a performance issue.

On some architectures the L1 cache can also require attention from the
kernel on address space switches for correct operation. Even when the
L1 cache doesn't need flushing a change in address space will generally
be accompanied by a change of working set, leading to a period of high
cache misses for the L1/L2 caches.

Microbenchmarks can miss the cache miss costs associated with context
switches. But I believe the costs of cache thrashing and flushing are
the reason that the time-sharing granularity is so coarse in Linux,
rather than the time it takes the kernel to actually perform a context
switch. (The default time-slice is 100 ms.) Still, the cache miss costs
are workload-dependent, and the actual time the kernel takes to context
switch can be important as well.

Andrew Wade

2006-10-13 05:29:27

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Andrew James Wade wrote:
> On Thursday 12 October 2006 14:29, John Richard Moser wrote:
>> How does a page table switch work? As I understand there are PTE chains
>> which are pretty much linked lists the MMU follows; I can't imagine this
>> being a harder problem than replacing the head.
>
> Generally, the virtual memory mappings are stored as high-fanout trees
> rather than linked lists. (ia64 supports a hash table based scheme,
> but I don't know if Linux uses it.) But the bulk of the mapping
> lookups will actually occur in a cache of the virtual memory mappings
> called the translation lookaside buffer (TLB). It is from the TLB and
> not the memory mapping trees that some of the performance problems
> with address space switches originate.
>
> The kernel can tolerate some small inconsistencies between the TLB
> and the mapping tree (it can fix them in the page fault handler). But
> for the most part the TLB must be kept consistent with the current
> address space mappings for correct operation. Unfortunately, on some
> architectures the only practical way of doing this is to flush the TLB
> on address space switches. I do not know if the flush itself takes any
> appreciable time, but each of the subsequent TLB cache misses will
> necessitate walking the current mapping tree. Whether done by the MMU
> or by the kernel (implementations vary), these walks in the aggregate
> can be a performance issue.

True. You can trick the MMU into faulting into the kernel (PaX does
this to apply non-executable pages-- pages, not halves of VM-- on x86),
but it's orders of magnitude slower as I understand and the petty gains
you can get over the hardware MMU doing it are not going to outweigh it.

>
> On some architectures the L1 cache can also require attention from the
> kernel on address space switches for correct operation. Even when the
> L1 cache doesn't need flushing a change in address space will generally
> be accompanied by a change of working set, leading to a period of high
> cache misses for the L1/L2 caches.

Yeah, only exception being if L1 and L2 are both physically addressed,
and thing like libc's .text are shared, leading to shared working sets
in I1 and L2.

>
> Microbenchmarks can miss the cache miss costs associated with context
> switches. But I believe the costs of cache thrashing and flushing are

cachegrind is probably guilty but I haven't examined it.

> the reason that the time-sharing granularity is so coarse in Linux,
> rather than the time it takes the kernel to actually perform a context
> switch. (The default time-slice is 100 ms.) Still, the cache miss costs

I thought it was minimum 5mS... I don't know what default is. Heh.

> are workload-dependent, and the actual time the kernel takes to context
> switch can be important as well.
>
> Andrew Wade
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS8kMQs1xW0HCTEFAQLS9w/+MAgrzYrZx1n59MgcfJTfp5eaD+tS5fJ7
qDd47qbycVR44sdj7kd89MNs7+oenm+511SRm1boQnT4mApqm/WXOXP3CnlFersS
f8TfS/HwMw79IBGDm9m0mIOZytaWDdDhoNyj3J993QUwfdjVbpOLhXXp+66ZqV7N
cHfVVX9MabJo5jwbzFWjvJMzoIfIQWGHHcqSqQ33sC3Kep+zbSs4yvoWGVE687eq
g9rxzT03z32a0oyUongqr6jV0X5v4w3u83sYlARgGGJcOcVFC3ulj05zz1w9GQHc
/SADT5Y3uFmHr11Rh2gJMnGEQoqu2a+dda5sLKD9R9Q0CtSKnNSV1g1HYDzwUyce
8f/xoFbP9yFjBynW2nZ8ZXNVQeiCy0M22Pq7K+VRwywE5U1ow6BEhNFLTXC0WPyQ
kl+EZZaXxhGa1m2EUvUeebchpx4uLyYPmHaOuSS6qiNxf5ct5TO2f94nwR5rwZLD
iKy2A7rkE6mM1z5WFTyO3QAlQg6vdObURHb38d/lp55iATg2z0FiUto2pzE9h61y
3Eax60EKDCtzCm69Sx2hnYaWr4Bj6NifZZbrYjZrjxb7feELFba2oZ6Y6ew0+v4d
Sp6V3dvRthpsNF+Mm0lR9KbC1QwHrnHQg3gvVC6N86XGCLiEaRkMtRNSV45pybYy
jdOycAlM9Hs=
=iUgf
-----END PGP SIGNATURE-----

2006-10-13 11:06:01

by James Courtier-Dutton

[permalink] [raw]
Subject: Re: Can context switches be faster?

Arjan van de Ven wrote:
> On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:
>
>
>> - Does the current code act on these behaviors, or just flush all
>> cache regardless?
>
> the cache flushing is a per architecture property. On x86, the cache
> flushing isn't needed; but a TLB flush is. Depending on your hardware
> that can be expensive as well.
>

So, that is needed for a full process context switch to another process.
Is the context switch between threads quicker as it should not need to
flush the TLB?

James

2006-10-13 14:51:55

by Chase Venters

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Fri, 13 Oct 2006, James Courtier-Dutton wrote:

> Arjan van de Ven wrote:
>> On Thu, 2006-10-12 at 14:25 -0400, John Richard Moser wrote:
>>
>>
>>> - Does the current code act on these behaviors, or just flush all
>>> cache regardless?
>>
>> the cache flushing is a per architecture property. On x86, the cache
>> flushing isn't needed; but a TLB flush is. Depending on your hardware
>> that can be expensive as well.
>>
>
> So, that is needed for a full process context switch to another process.
> Is the context switch between threads quicker as it should not need to
> flush the TLB?

Indeed. This is also true for switching from a process to a kernel thread
and back, because kernel threads don't have their own user-space virtual
memory; they just live inside the kernel virtual memory mapped into every
process.

> James
>

Thanks,
Chase

2006-10-13 16:58:24

by Andrew James Wade

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Friday 13 October 2006 01:29, John Richard Moser wrote:
> True. You can trick the MMU into faulting into the kernel (PaX does
> this to apply non-executable pages-- pages, not halves of VM-- on x86),

Oooh, that is a neat hack!

> but it's orders of magnitude slower as I understand and the petty gains
> you can get over the hardware MMU doing it are not going to outweigh it.

It's architecture-dependent; not all architectures are even capeable of
walking the page table trees in hardware. They compensate with
lightweight traps for TLB cache misses.

Andrew Wade

2006-10-13 17:24:15

by John Richard Moser

[permalink] [raw]
Subject: Re: Can context switches be faster?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Andrew James Wade wrote:
> On Friday 13 October 2006 01:29, John Richard Moser wrote:
>> True. You can trick the MMU into faulting into the kernel (PaX does
>> this to apply non-executable pages-- pages, not halves of VM-- on x86),
>
> Oooh, that is a neat hack!
>

Yes, very neat ;)

The way it works is pretty much that he has NX pages marked SUPERVISOR
so the userspace code can't put them in the TLB. What happens on access is:

- MMU cries, faults into kernel
- Kernel examines nature of fault
- If attempting to read/write, put the mapping in the DTLB for the
process
- If attempting to execute, KILL.
- Application continues running if it was a data access.

As I understand it, this is about 7000 times slower than having a
hardware NX bit and an MMU that handles the event; but once the page is
in the DTLB the MMU doesn't complain any more until it's pushed out, so
the hit is minimal. Wide memory access patterns that can push data out
of the TLB quickly and come back to it just as fast get a pretty
noticeable performance hit; but they're rare.

These days he's got the OpenBSD W^X/RedHat Exec Shield method in there
as well, so when possible the stack is made non-executable using
hardware and is completely exempt from these performance considerations.

- From what I hear, Linus isn't interested in either PTE hacks or segment
limit hacks, so old x86 will never have an NX bit (full like PaX or best
effort like Exec Shield) in the kernel. Too bad; I'd really like to see
some enforcement of non-executable pages on good old x86 chips in mainline.

>> but it's orders of magnitude slower as I understand and the petty gains
>> you can get over the hardware MMU doing it are not going to outweigh it.
>
> It's architecture-dependent; not all architectures are even capeable of
> walking the page table trees in hardware. They compensate with
> lightweight traps for TLB cache misses.

The ones that can do it probably have hardware tuned enough that a
software implementation isn't going to outrun them, least of all not far
enough to overtake the weight of the traps that can be set up. It's a
lovely area of theoretical hacks though, if you like coming up with
impractical "what kinds of nasty things can we do to the hardware"
ideas. :)

>
> Andrew Wade
>

- --
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQIVAwUBRS/LuQs1xW0HCTEFAQJHvw//TvBqPD8YlOqjCkgKEPSnCICuGE3LW8Yq
BKo1sSwdMnBW9b9IH3jlO3nKSG49MJvJnyQeb5OEzq0B0qHVct8z8ax80+XDbwWg
9FNo85SnAQMuAq9FxgHof9itja4eqgNOXwI1kib1OT5FAw/cINGxsWoSOTS5Ko4R
uAc74bbtoYACCvktKr9kpp/LSrctDtJ0ObaGpKO+rdi6023zGRF6IrIaI8GTAXIg
yoE0hue+eukKvDLCI7L6ZyIMH95mlKPBbsirhx1sVeXIfbDMNulyC56o9Bcs/QVi
oplLe7SEi67vCUslonbley/MG9b8cA2tT8YSAxJDTwfBgxKghY6mn5nRXxG/6UuG
6Zxbmfn4bhxPRpbVO8G+10WtfbxOlPpnHGaC1NCw7W+h/rMcUf3I/wkHJwtbkJG3
hfs6rOgSx72319kqT3eC/L0JUlBIOt0Z9hmVLkJAW85JAx2bfVSVrYCzhVzK1NFn
5v9RD1lRCuZDq72S14VaCcHcIwPuTBLIPklsQWy4XmDLtRQoOO1VVPgakB2Zs7h5
hURD7r8qXbWQj4XNCNb8ZOa3f36yAzqJNdE6fppZPAwWcT3kMMkWuUoslGMjrJfw
4nH37n0PrTnXusP1HyBjTzcGE2wSKbzreclHYWsbds3S5QIxFhqW6Qh6eFnTaawv
AAAxHihjHSg=
=uOEk
-----END PGP SIGNATURE-----

2006-10-13 23:32:40

by Andreas Mohr

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Thu, Oct 12, 2006 at 01:35:12PM -0700, Jeremy Fitzhardinge wrote:
> John Richard Moser wrote:
> >That's a load more descriptive :D
> >
> >0.890 uS, 0.556uS/cycle, that's barely 2 cycles you know. (Pentium M)
> >PPC performs similarly, 1 cycle should be about 1uS.
> >
>
> No, you're a factor of 1000 off - these numbers show the context switch
> is around 1600-75000 cycles. And that doesn't really tell the whole
> story: if caches/TLB get flushed on context switch, then the newly
> switched-to task will bear the cost of having cold caches, which isn't
> visible in the raw context switch time.
>
> But modern x86 processors have a very quick context switch time, and I
> don't think there's much room for improvement aside from
> micro-optimisations (though that might change if the architecture grows
> a way to avoid flushing the TLB on switch).

OK, so since we've now amply worked out in this thread that TLB/cache flushing
is a real problem for context switching management, would it be possible to
smartly reorder processes on the runqueue (probably works best with many active
processes with the same/similar priority on the runqueue!) to minimize
TLB flushing needs due to less mm context differences of adjacently scheduled
processes?
(i.e. don't immediately switch from user process 1 to user process 2 and
back to 1 again, but always try to sort some kernel threads in between
to avoid excessive TLB flushing)

See also my new posting about this at
http://bhhdoa.org.au/pipermail/ck/2006-October/006442.html

Andreas Mohr

2006-10-13 23:48:37

by David Lang

[permalink] [raw]
Subject: Re: Can context switches be faster?

On Sat, 14 Oct 2006, Andreas Mohr wrote:

> OK, so since we've now amply worked out in this thread that TLB/cache flushing
> is a real problem for context switching management, would it be possible to
> smartly reorder processes on the runqueue (probably works best with many active
> processes with the same/similar priority on the runqueue!) to minimize
> TLB flushing needs due to less mm context differences of adjacently scheduled
> processes?
> (i.e. don't immediately switch from user process 1 to user process 2 and
> back to 1 again, but always try to sort some kernel threads in between
> to avoid excessive TLB flushing)

since kernel threads don't cause flushing it shouldn't matter where they appear
in the scheduleing.

other then kernel threads, only threaded programs share the mm context (in
normal situations), and it would be a fair bit of work to sort the list of
potential things to be scheduled to group these togeather (not to mention the
potential fairness issues that would arise from this).

I suspect that the overhead of doing this sorting (and looking up the mm context
to do the sorting) would overwelm the relativly small number of TLB flushes that
would be saved.

I could see this being a potential advantage for servers with massive numbers of
threads for one program, but someone would have to look at how much overhead the
sorting would be (not to mention the fact that the kernel devs tend to frown on
special case optimizations that have a noticable impact on the general case)

David Lang

2006-10-14 00:03:57

by Alan

[permalink] [raw]
Subject: Re: Can context switches be faster?

Ar Sad, 2006-10-14 am 01:32 +0200, ysgrifennodd Andreas Mohr:
> OK, so since we've now amply worked out in this thread that TLB/cache flushing
> is a real problem for context switching management, would it be possible to
> smartly reorder processes on the runqueue (probably works best with many active
> processes with the same/similar priority on the runqueue!) to minimize
> TLB flushing needs due to less mm context differences of adjacently scheduled
> processes?

We already do. The newer x86 processors also have TLB tagging so they
can (if you find one without errata!) avoid the actual flush and instead
track the tag.

Alan

2006-10-14 00:11:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Can context switches be faster?

Alan Cox wrote:
> We already do. The newer x86 processors also have TLB tagging so they
> can (if you find one without errata!) avoid the actual flush and instead
> track the tag.

Are there any? I think AMD dropped it rather than spend the effort to
make it work.

J

2006-10-14 00:11:14

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: Can context switches be faster?

Andreas Mohr wrote:
> OK, so since we've now amply worked out in this thread that TLB/cache flushing
> is a real problem for context switching management, would it be possible to
> smartly reorder processes on the runqueue (probably works best with many active
> processes with the same/similar priority on the runqueue!) to minimize
> TLB flushing needs due to less mm context differences of adjacently scheduled
> processes?
> (i.e. don't immediately switch from user process 1 to user process 2 and
> back to 1 again, but always try to sort some kernel threads in between
> to avoid excessive TLB flushing)
>

It does. The kernel will (slightly) prefer to switch between two
threads sharing an address space over switching to a different address
space. (Hm, at least it used to, but I can't see where that happens now.)

J