Hello all,
everyone knows that there are three kinds of lies: lies, damned lies, and benchmarks.
Despite of this obvious fact, recently I've tried to compare pipe performance on
Linux and FreeBSD systems. Unfortunately, Linux results are poor - ~2x slower than
FreeBSD. The detailed description of the test case, preparation, environment and
results are located at http://213.148.29.37/PipeBench, and everyone are pleased to
look at, reproduce, criticize, etc.
Thanks,
Dmitry
From: Antipov Dmitry <[email protected]>
Date: Wed, 05 Mar 2008 10:46:57 +0300
> Despite of this obvious fact, recently I've tried to compare pipe
> performance on Linux and FreeBSD systems. Unfortunately, Linux
> results are poor - ~2x slower than FreeBSD. The detailed description
> of the test case, preparation, environment and results are located
> at http://213.148.29.37/PipeBench, and everyone are pleased to look
> at, reproduce, criticize, etc.
FreeBSD does page flipping into the pipe receiver, so rerun your test
case but have either the sender or the receiver make changes to
their memory buffer in between the read/write calls.
FreeBSD's scheme is only good for benchmarks, rather then real life.
David Miller a ?crit :
> From: Antipov Dmitry <[email protected]>
> Date: Wed, 05 Mar 2008 10:46:57 +0300
>
>
>> Despite of this obvious fact, recently I've tried to compare pipe
>> performance on Linux and FreeBSD systems. Unfortunately, Linux
>> results are poor - ~2x slower than FreeBSD. The detailed description
>> of the test case, preparation, environment and results are located
>> at http://213.148.29.37/PipeBench, and everyone are pleased to look
>> at, reproduce, criticize, etc.
>>
>
> FreeBSD does page flipping into the pipe receiver, so rerun your test
> case but have either the sender or the receiver make changes to
> their memory buffer in between the read/write calls.
>
> FreeBSD's scheme is only good for benchmarks, rather then real life.
>
>
>
>
page flipping might explain differences for big transferts, but note the
difference with small buffers (64, 128, 256, 512 bytes)
I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120
machine, and we can notice that four cpus are used (but only two threads
are running on this benchmark)
Cpu0 : 3.7% us, 38.7% sy, 0.0% ni, 57.7% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu1 : 4.0% us, 36.5% sy, 0.0% ni, 58.5% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu2 : 3.7% us, 25.9% sy, 0.0% ni, 70.4% id, 0.0% wa, 0.0% hi, 0.0% si
Cpu3 : 2.0% us, 25.3% sy, 0.0% ni, 72.7% id, 0.0% wa, 0.0% hi, 0.0% si
# vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
2 0 0 1356432 125788 416796 0 0 3 1 5 2 1
2 97 0
1 0 0 1356432 125788 416796 0 0 0 0 18 336471 5
35 61 0
1 0 0 1356680 125788 416796 0 0 0 0 16 330420 6
34 60 0
1 0 0 1356680 125788 416796 0 0 0 0 16 319826 6
34 61 0
1 0 0 1356680 125788 416796 0 0 0 0 16 311708 5
34 61 0
2 0 0 1356680 125788 416796 0 0 0 0 17 331712 4
35 61 0
1 0 0 1356680 125788 416796 0 0 0 4 17 333001 6
32 62 0
1 0 0 1356680 125788 416796 0 0 0 0 15 336755 7
31 62 0
2 0 0 1356680 125788 416796 0 0 0 0 16 323086 5
34 61 0
1 0 0 1356680 125788 416796 0 0 0 0 12 373822 4
33 63 0
# opreport -l /boot/vmlinux-2.6.24.2 |head -n 30
CPU: Core 2, speed 1866.8 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
52137 9.3521 kunmap_atomic
50983 9.1451 mwait_idle_with_hints
50448 9.0492 system_call
49727 8.9198 task_rq_lock
24531 4.4003 pipe_read
19820 3.5552 pipe_write
16176 2.9016 dnotify_parent
15455 2.7723 file_update_time
15216 2.7294 find_busiest_group
12449 2.2331 __copy_from_user_ll_nozero
12291 2.2047 set_next_entity
12023 2.1566 resched_task
11728 2.1037 __switch_to
11294 2.0259 update_curr
10749 1.9281 touch_atime
9334 1.6743 kmap_atomic_prot
9084 1.6295 __wake_up_sync
8321 1.4926 try_to_wake_up
7522 1.3493 pick_next_entity
7216 1.2944 cpu_idle
6780 1.2162 vfs_read
6727 1.2067 vfs_write
6570 1.1785 __copy_to_user_ll
6407 1.1493 syscall_exit
6283 1.1270 restore_nocheck
6064 1.0877 weighted_cpuload
6045 1.0843 rw_verify_area
This benchmarek mostly stress scheduler AFAIK, not really pipe() code...
Lot of context switches...
Eric Dumazet <[email protected]> writes:
>
> This benchmarek mostly stress scheduler AFAIK, not really pipe() code...
>
> Lot of context switches...
The recent mysql sysbench benchmark where they also won was also such
a (somewhat unrealistic) scheduler benchmark with a lot of over
scheduling. Just speculating here, but perhaps the context switch got
slower recently?
-Andi
Eric Dumazet wrote:
> I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120
> machine, and we can notice that four cpus are used (but only two
> threads are running on this benchmark)
Do the threads migrate from CPU to CPU? That would be sub-optimal,
wouldn't it?
On Wednesday 05 March 2008 20:47, Eric Dumazet wrote:
> David Miller a ?crit :
> > From: Antipov Dmitry <[email protected]>
> > Date: Wed, 05 Mar 2008 10:46:57 +0300
> >
> >> Despite of this obvious fact, recently I've tried to compare pipe
> >> performance on Linux and FreeBSD systems. Unfortunately, Linux
> >> results are poor - ~2x slower than FreeBSD. The detailed description
> >> of the test case, preparation, environment and results are located
> >> at http://213.148.29.37/PipeBench, and everyone are pleased to look
> >> at, reproduce, criticize, etc.
> >
> > FreeBSD does page flipping into the pipe receiver, so rerun your test
> > case but have either the sender or the receiver make changes to
> > their memory buffer in between the read/write calls.
> >
> > FreeBSD's scheme is only good for benchmarks, rather then real life.
>
> page flipping might explain differences for big transferts, but note the
> difference with small buffers (64, 128, 256, 512 bytes)
>
> I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120
> machine, and we can notice that four cpus are used (but only two threads
> are running on this benchmark)
One thing to try is pinning both processes on the same CPU. This
may be what the FreeBSD scheduler is preferring to do, and it ends
up being really a tradeoff that helps some workloads and hurts
others. With a very unscientific test with an old kernel, the
pipe.c test gets anywhere from about 1.5 to 3 times faster when
running it as taskset 1 ./pipe
> # opreport -l /boot/vmlinux-2.6.24.2 |head -n 30
> CPU: Core 2, speed 1866.8 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
> unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % symbol name
> 52137 9.3521 kunmap_atomic
I wonder if FreeBSD doesn't allocate their pipe buffers from kernel
addressable memory. We could do this to eliminate the cost completely
on highmem systems (whether it is a good idea I don't know, normally
you'd actually do a bit of work between reading or writing from a
pipe...)
> 50983 9.1451 mwait_idle_with_hints
> 50448 9.0492 system_call
> 49727 8.9198 task_rq_lock
> 24531 4.4003 pipe_read
> 19820 3.5552 pipe_write
> 16176 2.9016 dnotify_parent
Just say no to dnotify.
> 15455 2.7723 file_update_time
Dumb question: anyone know why pipe.c calls this?
Nick Piggin a ?crit :
> On Wednesday 05 March 2008 20:47, Eric Dumazet wrote:
>
>> David Miller a ?crit :
>>
>>> From: Antipov Dmitry <[email protected]>
>>> Date: Wed, 05 Mar 2008 10:46:57 +0300
>>>
>>>
>>>> Despite of this obvious fact, recently I've tried to compare pipe
>>>> performance on Linux and FreeBSD systems. Unfortunately, Linux
>>>> results are poor - ~2x slower than FreeBSD. The detailed description
>>>> of the test case, preparation, environment and results are located
>>>> at http://213.148.29.37/PipeBench, and everyone are pleased to look
>>>> at, reproduce, criticize, etc.
>>>>
>>> FreeBSD does page flipping into the pipe receiver, so rerun your test
>>> case but have either the sender or the receiver make changes to
>>> their memory buffer in between the read/write calls.
>>>
>>> FreeBSD's scheme is only good for benchmarks, rather then real life.
>>>
>> page flipping might explain differences for big transferts, but note the
>> difference with small buffers (64, 128, 256, 512 bytes)
>>
>> I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120
>> machine, and we can notice that four cpus are used (but only two threads
>> are running on this benchmark)
>>
>
> One thing to try is pinning both processes on the same CPU. This
> may be what the FreeBSD scheduler is preferring to do, and it ends
> up being really a tradeoff that helps some workloads and hurts
> others. With a very unscientific test with an old kernel, the
> pipe.c test gets anywhere from about 1.5 to 3 times faster when
> running it as taskset 1 ./pipe
>
>
>
>> # opreport -l /boot/vmlinux-2.6.24.2 |head -n 30
>> CPU: Core 2, speed 1866.8 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
>> unit mask of 0x00 (Unhalted core cycles) count 100000
>> samples % symbol name
>> 52137 9.3521 kunmap_atomic
>>
>
> I wonder if FreeBSD doesn't allocate their pipe buffers from kernel
> addressable memory. We could do this to eliminate the cost completely
> on highmem systems (whether it is a good idea I don't know, normally
> you'd actually do a bit of work between reading or writing from a
> pipe...)
>
>
>
>> 50983 9.1451 mwait_idle_with_hints
>> 50448 9.0492 system_call
>> 49727 8.9198 task_rq_lock
>> 24531 4.4003 pipe_read
>> 19820 3.5552 pipe_write
>> 16176 2.9016 dnotify_parent
>>
>
> Just say no to dnotify.
>
>
>
>> 15455 2.7723 file_update_time
>>
>
> Dumb question: anyone know why pipe.c calls this?
>
>
Because pipe writer calls write() syscall -> file_update_time() in kernel
while pipe reader calls read() syscall -> touch_atime() in kernel
inode i_mtime, i_ctime, i_atime and i_mutex fields share same cache
line, so nothing we can improve to avoid cache line ping pong.
On Thursday 06 March 2008 01:55, Eric Dumazet wrote:
> Nick Piggin a ?crit :
> > On Wednesday 05 March 2008 20:47, Eric Dumazet wrote:
> >> David Miller a ?crit :
> >>> From: Antipov Dmitry <[email protected]>
> >>> Date: Wed, 05 Mar 2008 10:46:57 +0300
> >>>
> >>>> Despite of this obvious fact, recently I've tried to compare pipe
> >>>> performance on Linux and FreeBSD systems. Unfortunately, Linux
> >>>> results are poor - ~2x slower than FreeBSD. The detailed description
> >>>> of the test case, preparation, environment and results are located
> >>>> at http://213.148.29.37/PipeBench, and everyone are pleased to look
> >>>> at, reproduce, criticize, etc.
> >>>
> >>> FreeBSD does page flipping into the pipe receiver, so rerun your test
> >>> case but have either the sender or the receiver make changes to
> >>> their memory buffer in between the read/write calls.
> >>>
> >>> FreeBSD's scheme is only good for benchmarks, rather then real life.
> >>
> >> page flipping might explain differences for big transferts, but note the
> >> difference with small buffers (64, 128, 256, 512 bytes)
> >>
> >> I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120
> >> machine, and we can notice that four cpus are used (but only two threads
> >> are running on this benchmark)
> >
> > One thing to try is pinning both processes on the same CPU. This
> > may be what the FreeBSD scheduler is preferring to do, and it ends
> > up being really a tradeoff that helps some workloads and hurts
> > others. With a very unscientific test with an old kernel, the
> > pipe.c test gets anywhere from about 1.5 to 3 times faster when
> > running it as taskset 1 ./pipe
> >
> >> # opreport -l /boot/vmlinux-2.6.24.2 |head -n 30
> >> CPU: Core 2, speed 1866.8 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
> >> unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples % symbol name
> >> 52137 9.3521 kunmap_atomic
> >
> > I wonder if FreeBSD doesn't allocate their pipe buffers from kernel
> > addressable memory. We could do this to eliminate the cost completely
> > on highmem systems (whether it is a good idea I don't know, normally
> > you'd actually do a bit of work between reading or writing from a
> > pipe...)
> >
> >> 50983 9.1451 mwait_idle_with_hints
> >> 50448 9.0492 system_call
> >> 49727 8.9198 task_rq_lock
> >> 24531 4.4003 pipe_read
> >> 19820 3.5552 pipe_write
> >> 16176 2.9016 dnotify_parent
> >
> > Just say no to dnotify.
> >
> >> 15455 2.7723 file_update_time
> >
> > Dumb question: anyone know why pipe.c calls this?
>
> Because pipe writer calls write() syscall -> file_update_time() in kernel
> while pipe reader calls read() syscall -> touch_atime() in kernel
Yeah, but why does the pipe inode need to have its times updated?
I guess there is some reason... hopefully not C&P related.
On Wed, Mar 5, 2008 at 7:38 AM, Nick Piggin <[email protected]> wrote:>> On Thursday 06 March 2008 01:55, Eric Dumazet wrote:> > Nick Piggin a écrit :> > > On Wednesday 05 March 2008 20:47, Eric Dumazet wrote:> > >> David Miller a écrit :> > >>> From: Antipov Dmitry <[email protected]>> > >>> Date: Wed, 05 Mar 2008 10:46:57 +0300> > >>>> > >>>> Despite of this obvious fact, recently I've tried to compare pipe> > >>>> performance on Linux and FreeBSD systems. Unfortunately, Linux> > >>>> results are poor - ~2x slower than FreeBSD. The detailed description> > >>>> of the test case, preparation, environment and results are located> > >>>> at http://213.148.29.37/PipeBench, and everyone are pleased to look> > >>>> at, reproduce, criticize, etc.> > >>>> > >>> FreeBSD does page flipping into the pipe receiver, so rerun your test> > >>> case but have either the sender or the receiver make changes to> > >>> their memory buffer in between the read/write calls.> > >>>> > >>> FreeBSD's scheme is only good for benchmarks, rather then real life.> > >>> > >> page flipping might explain differences for big transferts, but note the> > >> difference with small buffers (64, 128, 256, 512 bytes)> > >>> > >> I tried the 'pipe' prog on a fresh linux-2.6.24.2, on a dual Xeon 5120> > >> machine, and we can notice that four cpus are used (but only two threads> > >> are running on this benchmark)> > >> > > One thing to try is pinning both processes on the same CPU. This> > > may be what the FreeBSD scheduler is preferring to do, and it ends> > > up being really a tradeoff that helps some workloads and hurts> > > others. With a very unscientific test with an old kernel, the> > > pipe.c test gets anywhere from about 1.5 to 3 times faster when> > > running it as taskset 1 ./pipe> > >> > >> # opreport -l /boot/vmlinux-2.6.24.2 |head -n 30> > >> CPU: Core 2, speed 1866.8 MHz (estimated)> > >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a> > >> unit mask of 0x00 (Unhalted core cycles) count 100000> > >> samples % symbol name> > >> 52137 9.3521 kunmap_atomic> > >> > > I wonder if FreeBSD doesn't allocate their pipe buffers from kernel> > > addressable memory. We could do this to eliminate the cost completely> > > on highmem systems (whether it is a good idea I don't know, normally> > > you'd actually do a bit of work between reading or writing from a> > > pipe...)> > >> > >> 50983 9.1451 mwait_idle_with_hints> > >> 50448 9.0492 system_call> > >> 49727 8.9198 task_rq_lock> > >> 24531 4.4003 pipe_read> > >> 19820 3.5552 pipe_write> > >> 16176 2.9016 dnotify_parent> > >> > > Just say no to dnotify.> > >> > >> 15455 2.7723 file_update_time> > >> > > Dumb question: anyone know why pipe.c calls this?> >> > Because pipe writer calls write() syscall -> file_update_time() in kernel> > while pipe reader calls read() syscall -> touch_atime() in kernel>> Yeah, but why does the pipe inode need to have its times updated?> I guess there is some reason... hopefully not C&P related.
In principle so that the reader or writer can find out the last timethe other end did any processing of the pipe. And yeah, for POSIXcompliance: "Upon successful completion, pipe() will mark for updatethe st_atime, st_ctime and st_mtime fields of the pipe. " But it'd benice if there were a way to avoid touching it more than once a second(note the 'will mark for update' language). Or if the pipe is aphysical FIFO on a noatime filesystem?????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Thursday 06 March 2008 02:55, Ray Lee wrote:
> On Wed, Mar 5, 2008 at 7:38 AM, Nick Piggin <[email protected]> wrote:
> > Yeah, but why does the pipe inode need to have its times updated?
> > I guess there is some reason... hopefully not C&P related.
>
> In principle so that the reader or writer can find out the last time
> the other end did any processing of the pipe. And yeah, for POSIX
> compliance: "Upon successful completion, pipe() will mark for update
> the st_atime, st_ctime and st_mtime fields of the pipe. "
Thanks.
> But it'd be
> nice if there were a way to avoid touching it more than once a second
> (note the 'will mark for update' language). Or if the pipe is a
> physical FIFO on a noatime filesystem?
I doubt it really matters for anything except this test. I wouldn't
bother doing anything fancy really. It just caught my eye and I was
wondering why it was there at all.
Thanks,
Nick
Nick Piggin wrote:
> One thing to try is pinning both processes on the same CPU. This
> may be what the FreeBSD scheduler is preferring to do, and it ends
> up being really a tradeoff that helps some workloads and hurts
> others. With a very unscientific test with an old kernel, the
> pipe.c test gets anywhere from about 1.5 to 3 times faster when
> running it as taskset 1 ./pipe
Sounds interesting. What kernel version did you tried? Can you
send your .config to me?
I've tried this trick on 2.6.25-rc4, and got ~20% more throughput for
large (> 8K) buffers at the cost of going ~30% down for the small ones.
Dmitry
On Thursday 06 March 2008 23:11, Dmitry Antipov wrote:
> Nick Piggin wrote:
> > One thing to try is pinning both processes on the same CPU. This
> > may be what the FreeBSD scheduler is preferring to do, and it ends
> > up being really a tradeoff that helps some workloads and hurts
> > others. With a very unscientific test with an old kernel, the
> > pipe.c test gets anywhere from about 1.5 to 3 times faster when
> > running it as taskset 1 ./pipe
>
> Sounds interesting. What kernel version did you tried? Can you
> send your .config to me?
>
> I've tried this trick on 2.6.25-rc4, and got ~20% more throughput for
> large (> 8K) buffers at the cost of going ~30% down for the small ones.
Seems some people are still concerned about this benchmark. OK I
tried with Linux 2.6.25-rc6 (just because it's what I've got on
this system). Versus FreeBSD 7.0.
Unfortunately, I don't think FreeBSD supports binding a process to a
CPU, and on either system when the scheduler is allowed to choose
what happens, results are more variable than you would like.
That being said, I found that Linux often outscored FreeBSD in all 3
tests of pipe_v3. FreeBSD does appear like it can get a slightly higher
throughput at 64K in test #1, so maybe it's data copy routines are
slightly better. OTOH, I found Linux is better at 64K in test #2. For
the low sizes, I found Linux was usually faster than FreeBSD in tests
1 and 2, and around the same on test 3.
The other thing is that this test is pretty well a random context
switch benchmark that really depends on slight variations in how the
scheduler scheduler runs things. If you happen to be able to keep the
pipe from filling or emptying completely, you can run both processes
at the same time on different CPUs. If you run both processes on the
same CPU, then you want to avoid preempting the producer until it
fills the pipe, then you want to avoid preempting the consumer until
it empties the pipe in order to minimise context switches.
For example, if I call a nice(20) in the start of the reader processes
in tests #1 and #2, I get around a 5x speedup in Linux when running
reader and writer on the same CPU.
I won't bother posting actual numbers... if anybody is interested I
can mail raw results offline.
But again, it isn't such a great test because a higher number doesn't
really mean you'll do better with any real program, and optimising for
a higher number here could actually harm real programs.
pipe test v3 also is doing funny things with "simulating" real
accesses. It should generally write into the buffer before
write(2)ing it, and read from the buffer after read(2)ing it. Instead
it writes to the buffer after write(2) and after read(2). Also, it
should probably touch a significant number of the pages and
cachelines transferred in each case, rather than the 1/2 stores it
does right now. There are a lot of ways you can copy data around, so
even if you defeat page flipping (for small transfers), then you
still don't know if one method of copying data around is better than
another.
Basically I will just reiterate what I said before that it is really
difficult to draw any conclusions from a test like this, and from the
numbers I see, you certainly can't say FreeBSD is faster than Linux.
If you want to run this kind of microbenchmark, something like lmbench
at least has been around for a long time and been reviewed (whether or
not it is any more meaningful, I don't know). Or do you have a real
workload that this pipe test simulates?
Thanks,
Nick