2005-01-18 17:42:29

by Luck, Tony

[permalink] [raw]
Subject: pipe performance regression on ia64

David Mosberger pointed out to me that 2.6.11-rc1 kernel scores
very badly on ia64 in lmbench pipe throughput test (bw_pipe) compared
with earlier kernels.

Nanhai Zou looked into this, and found that the performance loss
began with Linus' patch to speed up pipe performance by allocating
a circular list of pages.

Here's his analysis:

>OK, I know the reason now.
>
>This regression we saw comes from scheduler load balancer.
>
>Pipe is a kind of workload that writer and reader will never run at the
>same time. They are synchronized by semaphore. One is always sleeping
>when the other end is working.
>
>To have cache hot, we do not wish to let writer and reader
>to be balanced to 2 cpus. That is why in fs/pipe.c, kernel use
>wake_up_interruptible_sync() instead of wake_up_interruptible to wakeup
>process.
>
>Now, load balancer is still balancing the processes if we have other
>any cpu idle. Note that on an HT enabled x86 the load balancer will
>first balance the process to a cpu in SMT domain without cache miss
>penalty.
>
>So, when we run bw_pipe on a low load SMP machine, the kernel running in
>a way load balancer always trying to spread out 2 processes while the
>wake_up_interruptible_sync() is always trying to draw them back into
>1 cpu.
>
>Linus's patch will reduce the change to call wake_up_interruptible_sync()
>a lot.
>
>For bw_pipe writer or reader, the buffer size is 64k. In a 16k page
>kernel. The old kernel will call wake_up_interruptible_sync 4 times but
>the new kernel will call wakeup only 1 time.
>
>Now the load balancer wins, processes are running on 2 cpus at most of
>the time. They got a lot of cache miss penalty.
>
>To prove this, Just run 4 instances of bw_pipe on a 4 -way Tiger to let
>load balancer not so active.
>
>Or simply add some code at the top of main() in bw_pipe.c
>
>{
> long affinity = 1;
> sched_setaffinity(getpid(), sizeof(long), &affinity);
>}
>then make and run bw_pipe again.
>
>Now I get a throughput of 5GB...

-Tony


2005-01-18 18:11:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe performance regression on ia64



On Tue, 18 Jan 2005, Luck, Tony wrote:
> David Mosberger:
> >
> >So, when we run bw_pipe on a low load SMP machine, the kernel running in
> >a way load balancer always trying to spread out 2 processes while the
> >wake_up_interruptible_sync() is always trying to draw them back into
> >1 cpu.
> >
> >Linus's patch will reduce the change to call wake_up_interruptible_sync()
> >a lot.
> >
> >For bw_pipe writer or reader, the buffer size is 64k. In a 16k page
> >kernel. The old kernel will call wake_up_interruptible_sync 4 times but
> >the new kernel will call wakeup only 1 time.

Yes, it will depend on the buffer size, and on whether the writer actually
does any _work_ to fill it, or just writes it.

The thing is, in real life, the "wake_up()" tends to be preferable,
because even though we are totally synchronized on the pipe semaphore
(which is a locking issue in itself that might be worth looking into),
most real loads will actually do something to _generate_ the write data in
the first place, and thus you actually want to spread the load out over
CPU's.

The lmbench pipe benchmark is kind of special, since the writer literally
does nothing but write and the reader does nothing but read, so there is
nothing to parallellize.

The "wake_up_sync()" hack only helps for the special case where we know
the writer is going to write more. Of course, we could make the pipe code
use that "synchronous" write unconditionally, and benchmarks would look
better, but I suspect it would hurt real life.

The _normal_ use of a pipe, after all, is having a writer that does real
work to generate the data (like 'cc1'), and a sink that actually does real
work with it (like 'as'), and having less synchronization is a _good_
thing.

I don't know how to make the benchmark look repeatable and good, though.
The CPU affinity thing may be the right thing.

For example, if somebody blocks on a semaphore, we actually do have some
code to try to wake it up on the same CPU that released the semaphore (in
"try_to_wake_up()", but again, that in this case tends to be fought by the
idle balancing there too.. And again, that does tend to be the right thing
to do if the process has _other_ data than the stuff protected by the
semaphore. It's just that pipe_bw doesn't have that..

(pipe_bw() also makes zero-copy pipes with VM tricks look really good,
because it never does a store operation to the buffer it uses to write the
data, so VM tricks never see any COW faults, and can just move pages
around without any cost. Again, that is not what real life does, so
optimizing for the benchmark does the wrong thing).

Linus

2005-01-18 18:31:17

by David Mosberger

[permalink] [raw]
Subject: Re: pipe performance regression on ia64

>>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <[email protected]> said:

Linus> I don't know how to make the benchmark look repeatable and
Linus> good, though. The CPU affinity thing may be the right thing.

Perhaps it should be split up into three cases:

- producer/consumer pinned to the same CPU
- producer/consumer pinned to different CPUs
- producer/consumer lefter under control of the scheduler

The first two would let us observe any changes in the actual pipe
code, whereas the 3rd case would tell us which case the scheduler is
leaning towards (or if it starts doing something real crazy, like
reschedule the tasks on different CPUs each time, we'd see a bandwith
lower than case 2 and that should ring alarm bells).

--david

2005-01-18 20:18:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: pipe performance regression on ia64



On Tue, 18 Jan 2005, David Mosberger wrote:
>
> >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <[email protected]> said:
>
> Linus> I don't know how to make the benchmark look repeatable and
> Linus> good, though. The CPU affinity thing may be the right thing.
>
> Perhaps it should be split up into three cases:
>
> - producer/consumer pinned to the same CPU
> - producer/consumer pinned to different CPUs
> - producer/consumer lefter under control of the scheduler
>
> The first two would let us observe any changes in the actual pipe
> code, whereas the 3rd case would tell us which case the scheduler is
> leaning towards (or if it starts doing something real crazy, like
> reschedule the tasks on different CPUs each time, we'd see a bandwith
> lower than case 2 and that should ring alarm bells).

Yes, that would be good.

However, I don't know who (if anybody) maintains lmbench any more. It
might be Carl Staelin (added to cc), and there used to be a mailing list
which may or may not be active any more..

[ Background for Carl (and/or lmbench-users):

The "pipe bandwidth" test ends up giving wildly fluctuating (and even
when stable, pretty nonsensical, since they depend very strongly on the
size of the buffer being used to do the writes vs the buffer size in the
kernel) numbers purely depending on where the reader/writer got
scheduled.

So a recent kernel buffer management change made lmbench numbers vary
radically, ranging from huge improvements to big decreases. It would be
useful to see the numbers as a function of CPU selection on SMP (the
same is probably true also for the scheduling latency benchmark, which
is also extremely unstable on SMP).

It's not just that it has big variance - you can't just average out many
runs. It has very "modal" operation, making averages meaningless.

A trivial thing that would work for most cases is just a simple (change
the "1" to whatever CPU-mask you want for some case)

long affinity = 1; /* bitmask: CPU0 only */
sched_setaffinity(0, sizeof(long), &affinity);

but I don't know what other OS's do, so it's obviously not portable ]

Hmm?

Linus

2005-01-18 23:34:47

by Nick Piggin

[permalink] [raw]
Subject: Re: pipe performance regression on ia64

Linus Torvalds wrote:
>
> On Tue, 18 Jan 2005, Luck, Tony wrote:
>
>>David Mosberger:
>>
>>>So, when we run bw_pipe on a low load SMP machine, the kernel running in
>>>a way load balancer always trying to spread out 2 processes while the
>>>wake_up_interruptible_sync() is always trying to draw them back into
>>>1 cpu.
>>>
>>>Linus's patch will reduce the change to call wake_up_interruptible_sync()
>>>a lot.
>>>
>>>For bw_pipe writer or reader, the buffer size is 64k. In a 16k page
>>>kernel. The old kernel will call wake_up_interruptible_sync 4 times but
>>>the new kernel will call wakeup only 1 time.
>
>
> Yes, it will depend on the buffer size, and on whether the writer actually
> does any _work_ to fill it, or just writes it.
>
> The thing is, in real life, the "wake_up()" tends to be preferable,
> because even though we are totally synchronized on the pipe semaphore
> (which is a locking issue in itself that might be worth looking into),
> most real loads will actually do something to _generate_ the write data in
> the first place, and thus you actually want to spread the load out over
> CPU's.
>
> The lmbench pipe benchmark is kind of special, since the writer literally
> does nothing but write and the reader does nothing but read, so there is
> nothing to parallellize.
>
> The "wake_up_sync()" hack only helps for the special case where we know
> the writer is going to write more. Of course, we could make the pipe code
> use that "synchronous" write unconditionally, and benchmarks would look
> better, but I suspect it would hurt real life.
>
> The _normal_ use of a pipe, after all, is having a writer that does real
> work to generate the data (like 'cc1'), and a sink that actually does real
> work with it (like 'as'), and having less synchronization is a _good_
> thing.
>
> I don't know how to make the benchmark look repeatable and good, though.
> The CPU affinity thing may be the right thing.
>

Regarding scheduler balancing behaviour:

The problem could also be magnified in recent -bk kernels by the
"wake up to an idle CPU" code in sched.c:try_to_wake_up(). To turn
this off, remove SD_WAKE_IDLE from include/linux/topology.h:SD_CPU_INIT
and include/asm/topology.h:SD_NODE_INIT

David I remember you reporting a pipe bandwidth regression, and I had
a patch for it, but that hurt other workloads, so I don't think we
ever really got anywhere. I've recently begun having another look at
the multiprocessor balancer, so hopefully I can get a bit further with
it this time.

2005-01-19 03:08:06

by Larry McVoy

[permalink] [raw]
Subject: Re: [Lmbench-users] Re: pipe performance regression on ia64

It would be good if you copied me directly since I don't read the kernel
list anymore (I'd love to but don't have the bandwidth) and I rarely read
the lmbench list. But only if you want to drag me into it, of course.

Carl and I both work on LMbench but not very actively. I had really
hoped that once people saw how small the benchmarks are they would
create their own:

work ~/LMbench2/src wc bw_pipe.c
120 340 2399 bw_pipe.c

I'm very unthrilled with the idea of adding stuff to the release benchmark
which is OS specific. That said, there is nothing to say that you can't
grab the benchmark and tweak your own test case in there to prove or
disprove your theory.

If you want to take LMbench and turn it into LinuxBench or something like
that so that it is clear that it is just a regression test for Linux then
hacking in a bunch of tests would make a ton of sense.

But, if you keep it generic I can give you output on a pile of different
OS's on relatively recent hardware since we just upgraded our build
cluster:

Welcome to redhat52.bitmover.com, a 2.1Ghz Athlon running Red Hat 5.2.
Welcome to redhat62.bitmover.com, a 2.16Ghz Athlon running Red Hat 6.2.
Welcome to redhat71.bitmover.com, a 2.1Ghz Athlon running Red Hat 7.1.
Welcome to redhat9.bitmover.com, a 2.1Ghz Athlon running Red Hat 9.
Welcome to amd64.bitmover.com, a 2Ghz AMD 64 running Fedora Core 1.
Welcome to parisc.bitmover.com, a 552Mhz PA8600 running Debian 3.1
Welcome to ppc.bitmover.com, a 400Mhz PowerPC running Yellow Dog 1.2.
Welcome to macos.bitmover.com, a dual 1.2Ghz G4 running MacOS 10.2.8.
Welcome to sparc.bitmover.com a 440 Mhz Sun Netra T1 running Debian 3.1.
Welcome to alpha.bitmover.com, a 500Mhz AlphaPC running Red Hat 7.2.
Welcome to ia64.bitmover.com, a dual 800Mhz Itanium running Red Hat 7.2.
Welcome to freebsd.bitmover.com, a 2.17Ghz Athlon running FreeBSD 2.2.8.
Welcome to freebsd3.bitmover.com, a 1.8Ghz Athlon running FreeBSD 3.2.
Welcome to freebsd4.bitmover.com, a 1.8Ghz Athlon running FreeBSD 4.1.
Welcome to freebsd5.bitmover.com, a 1.6Ghz Athlon running FreeBSD 5.1.
Welcome to openbsd.bitmover.com, a 2.17Ghz Athlon running OpenBSD 3.4.
Welcome to netbsd.bitmover.com, a 1Ghz Athlon running NetBSD 1.6.1.
Welcome to sco.bitmover.com, a 1.8Ghz Athlon running SCO OpenServer R5.
Welcome to sun.bitmover.com, a 440Mhz Sun Ultra 10 running Solaris 2.6
Welcome to sunx86.bitmover.com, a dual 1Ghz PIII running Solaris 2.7.
Welcome to sgi.bitmover.com, a 195Mhz MIPS IP28 running IRIX 6.5.
Welcome to sibyte.bitmover.com, a dual 800Mhz MIPS running Debian 3.0.
Welcome to hp.bitmover.com, a 552Mhz PA8600 running HP-UX 10.20.
Welcome to hp11.bitmover.com, a dual 550Mhz PA8500 running HP-UX 11.11.
Welcome to hp11-32bit.bitmover.com, a 400Mhz PA8500 running HP-UX 11.11.
Welcome to aix.bitmover.com, a 332Mhz PowerPC running AIX 4.1.5.
Welcome to qube.bitmover.com, a 250Mhz MIPS running Linux 2.0.34.
Welcome to arm.bitmover.com, a 233Mhz StrongARM running Linux 2.2.
Welcome to tru64.bitmover.com, a 600Mhz Alpha running Tru64 5.1B.
Welcome to winxp2.bitmover.com, a 2.1Ghz Athlon running Windows XP.


On Tue, Jan 18, 2005 at 12:17:11PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 18 Jan 2005, David Mosberger wrote:
> >
> > >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <[email protected]> said:
> >
> > Linus> I don't know how to make the benchmark look repeatable and
> > Linus> good, though. The CPU affinity thing may be the right thing.
> >
> > Perhaps it should be split up into three cases:
> >
> > - producer/consumer pinned to the same CPU
> > - producer/consumer pinned to different CPUs
> > - producer/consumer lefter under control of the scheduler
> >
> > The first two would let us observe any changes in the actual pipe
> > code, whereas the 3rd case would tell us which case the scheduler is
> > leaning towards (or if it starts doing something real crazy, like
> > reschedule the tasks on different CPUs each time, we'd see a bandwith
> > lower than case 2 and that should ring alarm bells).
>
> Yes, that would be good.
>
> However, I don't know who (if anybody) maintains lmbench any more. It
> might be Carl Staelin (added to cc), and there used to be a mailing list
> which may or may not be active any more..
>
> [ Background for Carl (and/or lmbench-users):
>
> The "pipe bandwidth" test ends up giving wildly fluctuating (and even
> when stable, pretty nonsensical, since they depend very strongly on the
> size of the buffer being used to do the writes vs the buffer size in the
> kernel) numbers purely depending on where the reader/writer got
> scheduled.
>
> So a recent kernel buffer management change made lmbench numbers vary
> radically, ranging from huge improvements to big decreases. It would be
> useful to see the numbers as a function of CPU selection on SMP (the
> same is probably true also for the scheduling latency benchmark, which
> is also extremely unstable on SMP).
>
> It's not just that it has big variance - you can't just average out many
> runs. It has very "modal" operation, making averages meaningless.
>
> A trivial thing that would work for most cases is just a simple (change
> the "1" to whatever CPU-mask you want for some case)
>
> long affinity = 1; /* bitmask: CPU0 only */
> sched_setaffinity(0, sizeof(long), &affinity);
>
> but I don't know what other OS's do, so it's obviously not portable ]
>
> Hmm?
>
> Linus
> _______________________________________________
> Lmbench-users mailing list
> [email protected]
> http://bitmover.com/mailman/listinfo/lmbench-users

--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com

2005-01-19 03:21:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Lmbench-users] Re: pipe performance regression on ia64



On Tue, 18 Jan 2005, Larry McVoy wrote:
>
> I'm very unthrilled with the idea of adding stuff to the release benchmark
> which is OS specific. That said, there is nothing to say that you can't
> grab the benchmark and tweak your own test case in there to prove or
> disprove your theory.

Hmm.. The notion of SMP and CPU pinning is certainly not OS-specific (and
I bet you'll see all the same issues everythwre else too), but the
interfaces do tend to be, which makes it a bit uncomfortable..

Linus

2005-01-19 05:11:54

by David Mosberger

[permalink] [raw]
Subject: Re: pipe performance regression on ia64

>>>>> On Wed, 19 Jan 2005 10:34:30 +1100, Nick Piggin <[email protected]> said:

Nick> David I remember you reporting a pipe bandwidth regression,
Nick> and I had a patch for it, but that hurt other workloads, so I
Nick> don't think we ever really got anywhere. I've recently begun
Nick> having another look at the multiprocessor balancer, so
Nick> hopefully I can get a bit further with it this time.

While it may be worthwhile to improve the scheduler, it's clear that
there isn't going to be a trivial "fix" for this issue, especially
since it's not even clear that anything is really broken. Independent
of the scheduler work, it would be very useful to have a pipe
benchmark which at least made the dependencies on the scheduler
obvious. So I think improving the scheduler and improving the LMbench
pipe benchmark are entirely complementary.

--david

2005-01-19 12:44:17

by Nick Piggin

[permalink] [raw]
Subject: Re: pipe performance regression on ia64

David Mosberger wrote:
>>>>>>On Wed, 19 Jan 2005 10:34:30 +1100, Nick Piggin <[email protected]> said:
>
>
> Nick> David I remember you reporting a pipe bandwidth regression,
> Nick> and I had a patch for it, but that hurt other workloads, so I
> Nick> don't think we ever really got anywhere. I've recently begun
> Nick> having another look at the multiprocessor balancer, so
> Nick> hopefully I can get a bit further with it this time.
>
> While it may be worthwhile to improve the scheduler, it's clear that
> there isn't going to be a trivial "fix" for this issue, especially
> since it's not even clear that anything is really broken. Independent
> of the scheduler work, it would be very useful to have a pipe
> benchmark which at least made the dependencies on the scheduler
> obvious. So I think improving the scheduler and improving the LMbench
> pipe benchmark are entirely complementary.
>

Oh that's quite true. A bad score on SMP on the pipe benchmark does
not mean anything is broken.

And IMO, probably many (most?) lmbench tests should be run with all
processes bound to the same CPU on SMP systems to get the best
repeatability and an indication of the basic serial speed of the
operation (which AFAIK is what they aim to measure).

Having the scheduler take care of process placement is interesting
too, of course. But it adds a new variable to the tests, which IMO
doesn't always suit lmbench too well.

2005-01-19 12:53:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: pipe performance regression on ia64


* Linus Torvalds <[email protected]> wrote:

> The "wake_up_sync()" hack only helps for the special case where we
> know the writer is going to write more. Of course, we could make the
> pipe code use that "synchronous" write unconditionally, and benchmarks
> would look better, but I suspect it would hurt real life.

not just that, it's incorrect scheduling, because it introduces the
potential to delay the woken up task by a long time, amounting to a
missed wakeup.

> I don't know how to make the benchmark look repeatable and good,
> though. The CPU affinity thing may be the right thing.

the fundamental bw_pipe scenario is this: the wakeup will happen earlier
than the waker suspends. (because it's userspace that decides about
suspension.) So the kernel rightfully notifies another, idle CPU to run
the freshly woken task. If the message passing across CPUs and the
target CPU is fast enough to 'grab' the task, then we'll get the "slow"
benchmark case, waker remaining on this CPU, wakee running on another
CPU. If this CPU happens to be fast enough suspending, before that other
CPU had the chance to grab the CPU (we 'steal the task back') then we'll
see the "fast" benchmark scenario.

i've seen traces where a single bw_pipe testrun showed _both_ variants
in chunks of 100s of milliseconds, probably due to cacheline placement
putting the overhead sometimes above the critical latency, sometimes
below it.

so there will always be this 'latency and tendency to reschedule on
another CPU' thing that will act as a barrier between 'really good' and
'really bad' numbers, and if a test happens to be around that boundary
it will fluctuate back and forth.

and this property also has another effect: _worse_ scheduling decisions
(not waking up an idle CPU when we could) can result in _better_ bw_pipe
numbers. Also, a _slower_ scheduler can sometimes move the bw_pipe
workload below the threshold, resulting in _better_ numbers. So as far
as SMP systems are concerned, bw_pipe numbers have to be considered very
carefully.

this is a generic thing: message passing latency scales inversely always
to the quality of distribution of SMP tasks. The better we are at
spreading out tasks, the worse message passing latency gets. (nothing
will beat passive, work-less 'message passing' between two tasks on the
same CPU.)

Ingo

2005-01-19 16:42:54

by Larry McVoy

[permalink] [raw]
Subject: Re: [Lmbench-users] Re: pipe performance regression on ia64

On Tue, Jan 18, 2005 at 12:17:11PM -0800, Linus Torvalds wrote:
>
>
> On Tue, 18 Jan 2005, David Mosberger wrote:
> >
> > >>>>> On Tue, 18 Jan 2005 10:11:26 -0800 (PST), Linus Torvalds <[email protected]> said:
> >
> > Linus> I don't know how to make the benchmark look repeatable and
> > Linus> good, though. The CPU affinity thing may be the right thing.
> >
> > Perhaps it should be split up into three cases:
> >
> > - producer/consumer pinned to the same CPU
> > - producer/consumer pinned to different CPUs
> > - producer/consumer lefter under control of the scheduler
> >
> > The first two would let us observe any changes in the actual pipe
> > code, whereas the 3rd case would tell us which case the scheduler is
> > leaning towards (or if it starts doing something real crazy, like
> > reschedule the tasks on different CPUs each time, we'd see a bandwith
> > lower than case 2 and that should ring alarm bells).
>
> Yes, that would be good.

You're revisiting a pile of work I did back at SGI, I'm pretty sure all
of this has been thought through before but it's worth going over again.
I have some pretty strong opinions about this that schedulers tend not
to like.

It's certainly true that you can increase the performance of this sort
of problem by pinning the processes to a CPU and/or different CPUs.
For specific applications that is a fine thing to do, I did that for
the bulk data server that was moving NFS traffic over a TCP socket
over HIPPI (if you look at images from space that came from the
military it is pretty likely that they passed through that code).
Pinning the processes to a particular _cache_ (not CPU, CPU has
nothing to do with it) gave me around 20% better throughput.

The problem is that schedulers tend be too smart, they try and figure
out where to put the process each time the schedule. In general that
is the wrong answer for two reasons:

a) It's more work on each context switch
b) It only works for processes that use up a substantial fraction of
their time slice (because the calculation is typically based in part
on the idea if you ran on this cache for a long time then you want
to stay here).

The problem with the "thinking scheduler" is that it doesn't work for I/O
loads. That sort of approach will believe that it is fine to move a process
which hasn't run for a long time. That's false, you are invalidating its
cache and that hurts. That's the 20% gain I got.

You are far better off, in my opinion, havig a scheduler that thinks at
process creation time and then only when the load gets unbalanced. Other
than that, it always puts the process back on the CPU where it last ran.
If the scheduler guesses wrong and puts two processes on the same CPU
and they fight one will get moved. But it shouldn't right away, leave
it there and let things settle a bit.

If someone coded up this policy and tried it I think that it would go a
long way towards making the LMbench timings more stable. I could be
wrong but it would be interesting to compare this approach with a manual
placement approach. Manual placement will always do better but it should
be in the 5% range, not in the 20% range.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com

2005-01-19 17:34:22

by David Mosberger

[permalink] [raw]
Subject: Re: pipe performance regression on ia64

>>>>> On Wed, 19 Jan 2005 23:43:45 +1100, Nick Piggin <[email protected]> said:

Nick> Oh that's quite true. A bad score on SMP on the pipe benchmark
Nick> does not mean anything is broken.

Nick> And IMO, probably many (most?) lmbench tests should be run
Nick> with all processes bound to the same CPU on SMP systems to get
Nick> the best repeatability and an indication of the basic serial
Nick> speed of the operation (which AFAIK is what they aim to
Nick> measure).

We need to keep an eye on both the intra- and the inter-cpu
pipe-bandwidth and should measure them explicitly. The problem is
that at the moment, we get one, the other, or a mixture of the two,
subject to the vagaries of the scheduler. If we could reliably
measure both intra and inter-cpu cases, we may well find new
optimization opportunities (I'm almost certain that's the case for the
cross-cpu case; which is probably the more important case, actually).

--david