This is the first release moving the pagetables in highmem. It only
compiles on x86 and it is still a bit experimental. I couldn't reproduce
problems yet though. the new pte-highmem patch can be downloaded from
here:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1/20_pte-highmem-6
Next relevant things to do are the non-x86 archs compilation, and I'd
like to sort out the vary-IO for rawio and the hardblocksize-O_DIRECT
patch.
URL:
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1.bz2
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.18pre4aa1/
Diff between 2.4.18pre2aa2 and 2.4.18pre4aa1 follows:
Only in 2.4.18pre2aa2: 00_3.5G-address-space-2
Only in 2.4.18pre4aa1/: 00_3.5G-address-space-3
Merge 1-2-3 GB option.
Only in 2.4.18pre4aa1/: 00_access_process_vm-1
Fix oops in access_process_vm (get_area_pages will
set the page pointer to NULL on non-ram maps).
Only in 2.4.18pre4aa1/: 00_allow_mixed_b_size-1
This is the groundwork for the O_DIRECT-hardblocksize
patch, and for the IOvary patch for rawio.
In short this prevents the merging of different b_size
in the same request at the blkdev layer. After I mentioned
this Jens immediatly sent me a patch and here it is.
So now I'd suggest to drop the varyIO thing which shouldn't
be necessary any longer, and to port the rawio-large-bsize
patch, and the O_DIRECT hardblocksize patches on top of
my current tree. I'd like to include both. Of course the
O_DIRECT-hardblocksize patch can also take advantage of
the large-b_size improvement to brw_kiovec during large
requests hardblocksize aligned. At least unless we want
to change the alignment requirements, in such a case
the varyIO info would be still valuable.
About the O_DIRECT-hardblocksize patch there's also another problem
though, if the get_block says that the buffer is new(), then
the whole "soft" block must be cleared out, if not written
to completly implicitly by the write. I just fixed similar bugs in
presence of I/O errors or ENOSPC with O_DIRECT, and I don't want to
reintroduce the very same problem while adding a new feature. The
buffer_new() path is a very slow path for the DB usage point of view,
so it's perfectly fine there to just writeout the zero page (or
something like that) on the blocks around in a synchronous manner etc..
Only in 2.4.18pre4aa1/: 00_icmp-offset-1
Remote security fix from Andi (see bugtraq).
Only in 2.4.18pre4aa1/: 00_init-blk-freelist-1
Requests cmd wasn't initialized when first queued into the blkdev,
so if dequeued and then re-enqueued without being used, they could get
unbalanced. Now always initialize it during get_request, so it certainly
works right.
Only in 2.4.18pre2aa2: 00_msync-ret-1
Only in 2.4.18pre2aa2: 00_page-cache-release-1
Only in 2.4.18pre2aa2: 00_ramdisk-buffercache-2
Only in 2.4.18pre2aa2: 00_truncate-garbage-1
Merged in mainline.
Only in 2.4.18pre2aa2: 00_vmalloc-tlb-flush-1
Merged into mainline (modulo Jeff having implemented pagetable
walking/tlb misses into uml that doesn't assume the tlb
flush [ouch, right Andrew, tlb invalidate :) ] cames first).
Only in 2.4.18pre2aa2: 00_nfs-2.4.17-cto-1
Only in 2.4.18pre4aa1/: 00_nfs-2.4.17-cto-2
Only in 2.4.18pre2aa2: 00_nfs-bkl-1
Only in 2.4.18pre4aa1/: 00_nfs-bkl-2
Only in 2.4.18pre2aa2: 00_nfs-rdplus-1
Only in 2.4.18pre4aa1/: 00_nfs-rdplus-2
Only in 2.4.18pre2aa2: 00_nfs-svc_tcp-1
Only in 2.4.18pre4aa1/: 00_nfs-svc_tcp-2
Only in 2.4.18pre2aa2: 00_nfs-tcp-tweaks-1
Only in 2.4.18pre4aa1/: 00_nfs-tcp-tweaks-2
Only in 2.4.18pre4aa1/: 10_nfs-o_direct-1
New NFS updates from Trond.
Only in 2.4.18pre2aa2: 00_rwsem-fair-25
Only in 2.4.18pre2aa2: 00_rwsem-fair-25-recursive-7
Only in 2.4.18pre4aa1/: 00_rwsem-fair-26
Only in 2.4.18pre4aa1/: 00_rwsem-fair-26-recursive-7
Rediffed.
Only in 2.4.18pre4aa1/: 00_waitfor-one-page-1
Export complaining symbol.
Only in 2.4.18pre2aa2: 10_vm-22
Only in 2.4.18pre4aa1/: 10_vm-23
Minor changes (try to always do some relevant work during the
refiling).
Only in 2.4.18pre4aa1/: 20_pte-highmem-6
First "working" version of the pte-highmem patch, this fixes (or at
least "should fix" :) lots of bugs. pte_offset_lowmem is still there
because kmap doesn't yet work by the time pte_offset_lowmem is
recalled. Lots of fixes, special thanks to Hugh, Linus and others for
the review and the feedback! All drivers should be updated. Works
for me so far.
Only in 2.4.18pre2aa2: 30_dyn-sched-2
Only in 2.4.18pre4aa1/: 30_dyn-sched-3
Minor changes, volatile would be needed only to avoid confusing
gcc, but nobody cares about variables changing under gcc anyways so
let's remove it so it will be a little faster.
Only in 2.4.18pre2aa2: 50_uml-patch-2.4.17-4.bz2
Only in 2.4.18pre4aa1/: 50_uml-patch-2.4.17-7.bz2
Latest update from Jeff (hopefully vmalloc works despite it doesn't
start with the tlb invalidate).
Only in 2.4.18pre4aa1/: 60_show-stack-1
Export symbol, so CONFIG_TUX_DEBUG has a chance to generate a loadable
kernel module.
Only in 2.4.18pre2aa2: 60_tux-vfs-4
Only in 2.4.18pre4aa1/: 60_tux-vfs-5
Rediffed.
Andrea
On Tue, 2002-01-22 at 01:48, Andrea Arcangeli wrote:
> Only in 2.4.18pre4aa1/: 00_icmp-offset-1
>
> Remote security fix from Andi (see bugtraq).
Are we sure this works? I thought I saw someone (IRC perhaps?) who had
weird anomalies with this fix (although it does certainly fix the hole).
> Only in 2.4.18pre2aa2: 10_vm-22
> Only in 2.4.18pre4aa1/: 10_vm-23
>
> Minor changes (try to always do some relevant work during the
> refiling).
When will we see this in 2.4 stock? ;-)
I know you have said you are busy, but it would great to get the bits
pushed to Marcelo in reasonable documented chunks so he can merge
them...
Also, these should be pushed to Linus, too. Same VM in 2.5, after all.
Robert Love
No weird anomalies here. I believe the ones you refer to were a result
of ipv6 bits not being updated as well. Russell posted two patches for
those.
http://marc.theaimsgroup.com/?l=linux-kernel&m=101164602428323&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=101164602428401&w=2
On Tue, Jan 22, 2002 at 01:58:58AM -0500, Robert Love wrote:
> > Only in 2.4.18pre4aa1/: 00_icmp-offset-1
> >
> > Remote security fix from Andi (see bugtraq).
>
> Are we sure this works? I thought I saw someone (IRC perhaps?) who had
> weird anomalies with this fix (although it does certainly fix the hole).
--
Dan Chen [email protected]
GPG key: http://www.unc.edu/~crimsun/pubkey.gpg.asc
On Tue, 2002-01-22 at 02:37, Dan Chen wrote:
> No weird anomalies here. I believe the ones you refer to were a result
> of ipv6 bits not being updated as well. Russell posted two patches for
> those.
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101164602428323&w=2
> http://marc.theaimsgroup.com/?l=linux-kernel&m=101164602428401&w=2
Maybe, although I seem to recall odd ICMP behavior being the problem.
Although I don't think the above is in -aa. Andrea, perhaps this too
should be merged?
Ideally this will all show up in 2.4-proper soon, anyhow.
Robert Love
On Tue, Jan 22, 2002 at 02:37:42AM -0500, Dan Chen wrote:
> No weird anomalies here. I believe the ones you refer to were a result
> of ipv6 bits not being updated as well. Russell posted two patches for
> those.
No - I do see weirdness in ipv4 as well:
bash-2.04# uptime
10:00am up 18:57, 1 user, load average: 0.02, 0.03, 0.00
bash-2.04# dmesg|grep 'broad'
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
127.0.0.1 sent an invalid ICMP error to a broadcast.
Only one of these happened on boot. The rest randomly pop up over time.
I'm going to try tcpdumping lo to see if I can work out what's causing
them.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Tue, 2002-01-22 at 05:02, Russell King wrote:
> On Tue, Jan 22, 2002 at 02:37:42AM -0500, Dan Chen wrote:
> > No weird anomalies here. I believe the ones you refer to were a result
> > of ipv6 bits not being updated as well. Russell posted two patches for
> > those.
>
> No - I do see weirdness in ipv4 as well:
OK, this is the anomaly I spoke of. Weird ICMP errors. I've seen
others with this problem.
I don't think we have a proper solution here.
> bash-2.04# uptime
> 10:00am up 18:57, 1 user, load average: 0.02, 0.03, 0.00
> bash-2.04# dmesg|grep 'broad'
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
> 127.0.0.1 sent an invalid ICMP error to a broadcast.
>
> Only one of these happened on boot. The rest randomly pop up over time.
> I'm going to try tcpdumping lo to see if I can work out what's causing
> them.
Robert Love
Changelog with history at:
http://home.earthlink.net/~rwhron/kernel/2.4.18pre4aa1.html
Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
--
Randy Hron
On January 24, 2002 06:23 am, [email protected] wrote:
> Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
> http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
"dbench 64, 128, 192 on ext2fs. dbench may not be the best I/O benchmark,
but it does create a high load, and may put some pressure on the cpu and
i/o schedulers. Each dbench process creates about 21 megabytes worth of
files, so disk usage is 1.3 GB, 2.6 GB and 4.0 GB for the dbench runs. Big
enough so the tests cannot run from the buffer/page caches on this box."
Thanks kindly for the testing, but please don't use dbench any more for
benchmarks. If you are testing stability, fine, but dbench throughput
numbers are not good for much more than wild goose chases.
Even when mostly uncached, dbench still produces flaky results.
--
Daniel
On Thu, Jan 24, 2002 at 07:27:43AM +0100, Daniel Phillips wrote:
> On January 24, 2002 06:23 am, [email protected] wrote:
> > Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
> > http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
>
> "dbench 64, 128, 192 on ext2fs. dbench may not be the best I/O benchmark,
> but it does create a high load, and may put some pressure on the cpu and
> i/o schedulers. Each dbench process creates about 21 megabytes worth of
> files, so disk usage is 1.3 GB, 2.6 GB and 4.0 GB for the dbench runs. Big
> enough so the tests cannot run from the buffer/page caches on this box."
>
> Thanks kindly for the testing, but please don't use dbench any more for
> benchmarks. If you are testing stability, fine, but dbench throughput
> numbers are not good for much more than wild goose chases.
>
> Even when mostly uncached, dbench still produces flaky results.
this is not enterely true. dbench has a value. the only problem with
dbench is that you can trivially cheat and change the kernel in a broken
way, but optimal _only_ for dbench, just to get stellar dbench numbers,
but this is definitely not the case with the -aa tree, -aa tree is
definitely not optimized for dbench, infact the recent improvement cames
most probably from dyn-sched and bdflush histeresis introduction, not
from vm changes at all (there were no recent significant vm changes in
the page replacement and aging algorithms infact). rmap instead sucks in
most of the benchmarks because of the noticeable overhead of maintaining
those reverse maps that starts to help only by the time you need to
swap/pageout (totally useless and only overhead for number crunching,
database selfcaching etc..). This is the only issue with the rmap design
and you can definitely see it in the numbers. Here I'm only speaking
about the design, I never checked the current implementation.
Andrea
On Thu, Jan 24, 2002 at 12:23:42AM -0500, [email protected] wrote:
> Changelog with history at:
> http://home.earthlink.net/~rwhron/kernel/2.4.18pre4aa1.html
>
> Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
> http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
Randy, I will reiterate the obvious, but your reliable and impartial
performance feedback is extremely helpful. Thanks,
Keep up the good work :),
Andrea
> > http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
>
> Even when mostly uncached, dbench still produces flaky results.
dbench results are not perfectly repeatable. I agree that dbench
results that vary by 20% or so may not be meaningful. I think
dbench is of some value though. In some cases the difference
between kernels is 200% or more.
Below are results from a couple of aa releases, and a few rmap
releases. Some of the tests were ran twice. You can see that
there is some variation between "identical" runs. You can see
that aa kernels do extremely well with large numbers of processes,
and as the number of processes increases from 64 -> 128 -> 192,
the throughput drops in a predictable way.
rmap, when compared with most other kernels does well with 64 processes.
At 192, rmap doesn't do as well. That may be useful information for the
people developing rmap.
dbench 64 processes
2.4.18pre4aa1 ************************************************** 25.2 MB/sec
2.4.18pre2aa2 ******************************************** 22.2 MB/sec
2.4.17rmap11a **************************** 14.2 MB/sec
2.4.17rmap11a *************************** 13.9 MB/sec
2.4.17rmap12a *************************** 13.7 MB/sec
2.4.18pre3rmap11b ********************** 11.4 MB/sec
2.4.17rmap11c ********************* 10.8 MB/sec
2.4.17rmap11c ********************* 10.6 MB/sec
2.4.17rmap11b ******************* 9.7 MB/sec
dbench 128 processes
2.4.18pre4aa1 ******************************** 16.4 MB/sec
2.4.18pre2aa2 ******************************** 16.3 MB/sec
2.4.18pre2aa2 ***************************** 14.9 MB/sec
2.4.17rmap11a ************ 6.1 MB/sec
2.4.17rmap11a ************ 6.1 MB/sec
2.4.18pre3rmap11b ********** 5.1 MB/sec
2.4.17rmap11b ********* 5.0 MB/sec
2.4.17rmap12a ********* 4.5 MB/sec
2.4.17rmap11c ******** 4.2 MB/sec
2.4.17rmap11c ******** 4.2 MB/sec
dbench 192 processes
2.4.18pre2aa2 ***************** 8.8 MB/sec
2.4.18pre4aa1 **************** 8.2 MB/sec
2.4.18pre2aa2 *************** 7.7 MB/sec
2.4.17rmap11a ******** 4.4 MB/sec
2.4.17rmap11a ******** 4.3 MB/sec
2.4.18pre3rmap11b ******* 3.8 MB/sec
2.4.17rmap11b ******* 3.8 MB/sec
2.4.17rmap12a ****** 3.1 MB/sec
2.4.17rmap11c ***** 3.0 MB/sec
2.4.17rmap11c ***** 2.9 MB/sec
On the other hand, rmap does very well with sequential reads
on tiobench, which is running a lot fewer processes than dbench.
Read, Write, and Seeks are MB/sec
Num Seq Read Rand Read Seq Write Rand Write
Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- ------------- ----------- ------------- -----------
2.4.17rmap12a 1 22.85 32.2% 1.15 2.2% 13.10 83.5% 0.71 1.6%
2.4.18pre2aa2 1 11.96 23.1% 2.24 3.2% 12.90 76.8% 0.71 1.6%
2.4.18pre4aa1 1 11.23 21.3% 3.12 4.8% 11.92 66.1% 0.66 1.3%
2.4.17rmap12a 2 22.07 32.1% 1.20 2.2% 12.84 80.4% 0.71 1.6%
2.4.18pre2aa2 2 11.09 22.0% 2.57 3.2% 13.10 76.3% 0.70 1.6%
2.4.18pre4aa1 2 10.68 20.9% 3.39 4.4% 12.14 67.9% 0.67 1.3%
2.4.17rmap12a 4 21.75 32.0% 1.20 2.2% 12.69 78.5% 0.71 1.6%
2.4.18pre2aa2 4 10.52 21.1% 2.82 3.6% 12.84 73.9% 0.69 1.5%
2.4.18pre4aa1 4 10.48 20.4% 3.56 4.2% 12.28 69.0% 0.67 1.4%
2.4.17rmap12a 8 21.34 31.8% 1.23 2.3% 12.57 77.3% 0.71 1.7%
2.4.18pre2aa2 8 10.24 19.5% 3.01 4.0% 12.94 74.1% 0.70 1.6%
2.4.18pre4aa1 8 10.08 18.9% 3.63 4.5% 12.24 68.8% 0.67 1.4%
I added bonnie++ to the list of tests a day or so ago.
I'll begin putting those results up in the near future.
--
Randy Hron
On Thu, 24 Jan 2002 [email protected] wrote:
> > > http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
> >
> > Even when mostly uncached, dbench still produces flaky results.
> Below are results from a couple of aa releases, and a few rmap
> releases.
[snip results: -aa twice as fast as -rmap for dbench,
-rmap twice as fast as -aa for tiobench]
What would be interesting here are the dbench dots, where
a '+' indicates that a program exits.
It's possible that under one of the kernels the programs
are getting throttled differently and some of the dbench
processes exit _way_ earlier than the others, leaving a
much lighter load on the rest of the system for the second
part of the test.
It would be interesting to see the dbench dots from both
-aa and -rmap ;)
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
> [snip results: -aa twice as fast as -rmap for dbench,
> -rmap twice as fast as -aa for tiobench]
Look closely at all the numbers:
dbench 64 128 192 on ext completed in 4500 seconds on 2.4.18pre4aa1
dbench 64 128 192 on ext completed in 12471 seconds on 2.4.17rmap12a
2.4.18pre4aa1 completed the three dbenches 277% faster.
For tiobench:
Tiobench is interesting because it has the CPU% column. I mentioned
sequential reads because it's a bench where 2.4.17rmap12a was faster.
Someone else might say 2.4.18pre4aa1 was 271% faster at random reads.
Let's analyze CPU efficiency where threads = 1:
Num Seq Read Rand Read Seq Write Rand Write
Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
--- ------------- ----------- ------------- -----------
2.4.17rmap12a 1 22.85 32.2% 1.15 2.2% 13.10 83.5% 0.71 1.6%
2.4.18pre4aa1 1 11.23 21.3% 3.12 4.8% 11.92 66.1% 0.66 1.3%
Sequential Read CPU Efficiency
2.4.18pre4aa1 11.23 / .213 = 52.723
2.4.17rmap12a 22.85 / .322 = 70.962
2.4.17rmap12a was 35% more CPU efficent.
Random Read CPU Efficiency
2.4.18pre4aa1 3.12 / .048 = 65.000
2.4.17rmap12a 1.15 / .022 = 52.272
2.4.18pre4aa1 was 24% more CPU efficient.
Sequential Write CPU Efficiency
2.4.18pre4aa1 11.92 / .661 = 18.033
2.4.17rmap12a 13.10 / .835 = 15.688
2.4.18pre4aa1 was 15% more CPU efficient.
Random Write CPU Efficiency
2.4.18pre4aa1 .066 / .013 = 50.767
2.4.17rmap12a .071 / .016 = 44.375
2.4.18pre4aa1 was 14% more CPU efficient.
> It would be interesting to see the dbench dots from both
> -aa and -rmap ;)
All the dots are at:
http://home.earthlink.net/~rwhron/kernel/dots/
--
Randy Hron
On Thu, 24 Jan 2002 [email protected] wrote:
> > It would be interesting to see the dbench dots from both
> > -aa and -rmap ;)
>
> All the dots are at:
> http://home.earthlink.net/~rwhron/kernel/dots/
I think we have an explanation here.
With dbench 192 on -aa the first processes exit around
halfway through the dbench test and around the end only
few processes are left.
With rmap the write trottling is a bit smoother, but
this results in all processes running to about 70% through
the test and many more processes running at the last part
of the test, exiting simultaneously.
Considering the possible bad consequences for real
workloads, I'm not sure I want to make the system more
unfair just to better accomodate dbench ;)
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
> workloads, I'm not sure I want to make the system more
> unfair just to better accomodate dbench ;)
I'm wondering if rmap is a little too aggressive on
read-ahead, and if that has a negative impact on
a complex workload.
--
Randy Hron
On Thu, 24 Jan 2002 [email protected] wrote:
> > workloads, I'm not sure I want to make the system more
> > unfair just to better accomodate dbench ;)
>
> I'm wondering if rmap is a little too aggressive on
> read-ahead, and if that has a negative impact on
> a complex workload.
I haven't changed the readahead code one bit compared
to 2.4 mainline, but I'm wondering the same.
Fixing readahead window sizing has been on my TODO list
for quite a while already.
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, Jan 25, 2002 at 02:57:02AM -0200, Rik van Riel wrote:
> On Thu, 24 Jan 2002 [email protected] wrote:
>
> > > workloads, I'm not sure I want to make the system more
> > > unfair just to better accomodate dbench ;)
> >
> > I'm wondering if rmap is a little too aggressive on
> > read-ahead, and if that has a negative impact on
> > a complex workload.
>
> I haven't changed the readahead code one bit compared
> to 2.4 mainline, but I'm wondering the same.
>
> Fixing readahead window sizing has been on my TODO list
> for quite a while already.
One thing that struck me about this; doesn't both the rmap-patches and
the aa-patches contain other changes than merely changes to the VM? If
so, couldn't these changes tip the result in an unfair direction?! After
all, what we want is a VM-to-VM shoot-out, not a VM-to-VM+whatever
shoot-out. After all, one would assume that the non VM-related changes
would be merged to the kernel no matter what VM is used, right?
Then again, maybe I just ate the blue pill and returned to a world of
illusions not knowing what's best for me.
Regards: David Weinehall
_ _
// David Weinehall <[email protected]> /> Northern lights wander \\
// Maintainer of the v2.0 kernel // Dance across the winter sky //
\> http://www.acc.umu.se/~tao/ </ Full colour fire </
On Fri, Jan 25, 2002 at 01:35:08AM -0200, Rik van Riel wrote:
> Considering the possible bad consequences for real
> workloads, I'm not sure I want to make the system more
> unfair just to better accomodate dbench ;)
it may be useful if Randy can throw a real world test
into the benchmarking, to get a better comparison of
the various systems. The obvious one that springs to mind
would be something like compilation of a large source tree
kernel/mozilla/etc.. (same version, same config options
every time). Though, as compilation is largely compute bound,
instead of IO bound, the more small files that need to be
read/generated the better.
Or maybe timing an updatedb. Its realworld enough in that its
a daily task, generates lots of IO..
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
> it may be useful if Randy can throw a real world test
> into the benchmarking, to get a better comparison of
> the various systems. The obvious one that springs to mind
> would be something like compilation of a large source tree
Thanks for the feedback.
2.5.2-dj5 wins the lucky "first-timer" award on the new tests.
Extract/configure/make/check autoconf-2.52:
Executes over 100000 processes and creates a lot of small
temporary files. Won't hit the disk much on this box.
Extract/Configure/make/test perl-5.6.1:
For perl, "make test" is executed 5 times. "make test" is about
75% system and 25% user, which may provide more variation between
kernel versions.
> Or maybe timing an updatedb. Its realworld enough in that its
> a daily task, generates lots of IO..
I'll time updatedb too. updatedb may vary over time, depending
on how many src trees are extracted. I'll make an effort to
keep that variable consistent.
--
Randy Hron
On Fri, 25 Jan 2002, David Weinehall wrote:
> One thing that struck me about this; doesn't both the rmap-patches and
> the aa-patches contain other changes than merely changes to the VM? If
> so, couldn't these changes tip the result in an unfair direction?! After
> all, what we want is a VM-to-VM shoot-out, not a VM-to-VM+whatever
> shoot-out. After all, one would assume that the non VM-related changes
> would be merged to the kernel no matter what VM is used, right?
The -aa kernel seems to contain patches to a few dozen subsystems.
The -rmap patch is pretty much only VM changes.
You're right that this is not a strict VM vs VM comparison...
kind regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
On Fri, Jan 25, 2002 at 03:03:16PM -0200, Rik van Riel wrote:
> The -aa kernel seems to contain patches to a few dozen subsystems.
> The -rmap patch is pretty much only VM changes.
> You're right that this is not a strict VM vs VM comparison...
Agreed. Andrea's tree seemed to gain quite a bit of a lead
when bits of the lowlat patches were applied for eg.
Just taking 00_vm_?? from ../people/andrea/.. would give better
comparison for a head to head vm pissing contest.
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
On Thu, Jan 24, 2002 at 10:23:57PM -0500, [email protected] wrote:
> > [snip results: -aa twice as fast as -rmap for dbench,
> > -rmap twice as fast as -aa for tiobench]
>
> Look closely at all the numbers:
>
> dbench 64 128 192 on ext completed in 4500 seconds on 2.4.18pre4aa1
> dbench 64 128 192 on ext completed in 12471 seconds on 2.4.17rmap12a
>
> 2.4.18pre4aa1 completed the three dbenches 277% faster.
>
> For tiobench:
>
> Tiobench is interesting because it has the CPU% column. I mentioned
> sequential reads because it's a bench where 2.4.17rmap12a was faster.
> Someone else might say 2.4.18pre4aa1 was 271% faster at random reads.
> Let's analyze CPU efficiency where threads = 1:
>
> Num Seq Read Rand Read Seq Write Rand Write
> Thr Rate (CPU%) Rate (CPU%) Rate (CPU%) Rate (CPU%)
> --- ------------- ----------- ------------- -----------
> 2.4.17rmap12a 1 22.85 32.2% 1.15 2.2% 13.10 83.5% 0.71 1.6%
> 2.4.18pre4aa1 1 11.23 21.3% 3.12 4.8% 11.92 66.1% 0.66 1.3%
Those weird numbers generated by rmap12a on tiobench shows that the page
replacement algorithm in rmap is not able to detect cache pollution,
that lefts pollution in cache rather than discarding the pollution, so
later that is causing reads not to be served from disk, but to be served
from cache.
Being tiobench an I/O benchmark the above is a completly fake result,
seq read I/O is not going to be faster with rmap. If you change tiobench
to remount the fs where the output files are been generated between the
"random write" and the "seq read" tests, you should get out comparable
numbers.
I don't consider goodness the fact rmap12a lefts old pollution in the
caches, that seems to proof it will do the wrong thing when the most
recently used data is part of the working set (like after you do the
first cvs checkout, you want the second checkout not to hit the disk,
this page replacement in rmap12a should hit the disk the second time
too).
In some ways tiobench has the same problems of dbench. A broken page
replacement algorithm can generate stellar numbers in both of the two
benchmarks.
Furthmore running the 'seq read' after the 'random write' (tiobench does
that), adds even more "random" to the output of the 'seq read' because
the 'random read' and 'random write' tests are not comparable in first
place too: the random seed is setup always different, and also to make a
real 'seq read' test, the 'seq read' should be run after the 'seq
write', not after the 'random write' (even assuming the random seed is
always initialized to the same value).
Andrea
On January 25, 2002 01:09 am, Andrea Arcangeli wrote:
> On Thu, Jan 24, 2002 at 07:27:43AM +0100, Daniel Phillips wrote:
> > On January 24, 2002 06:23 am, [email protected] wrote:
> > > Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
> > > http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
> >
> > "dbench 64, 128, 192 on ext2fs. dbench may not be the best I/O
> > benchmark, but it does create a high load, and may put some pressure on
> > the cpu and i/o schedulers. Each dbench process creates about 21
> > megabytes worth of files, so disk usage is 1.3 GB, 2.6 GB and 4.0 GB
> > for the dbench runs. Big enough so the tests cannot run from the
> > buffer/page caches on this box."
> >
> > Thanks kindly for the testing, but please don't use dbench any more for
> > benchmarks. If you are testing stability, fine, but dbench throughput
> > numbers are not good for much more than wild goose chases.
> >
> > Even when mostly uncached, dbench still produces flaky results.
>
> this is not enterely true. dbench has a value.
Yes, but not for benchmarks. It has value only as a stability test - while
it may in some cases provide some general indication of performance, its
variance is far too large, even under controlled conditions, for it to have
any value as a benchmark. I'm surprised you'd even suggest this.
Andrea, please, if we want good benchmarks let's at least be clear on what
tools benchmarkers should/should not be using.
> the only problem with
> dbench is that you can trivially cheat and change the kernel in a broken
> way, but optimal _only_ for dbench, just to get stellar dbench numbers,
No, this is not the only problem. DBench is just plain *flaky*. You don't
appear to be clear on why. In short, dbench has two main flaws:
- It's extremely sensitive to scheduling. If one process happens to make
progress then it gets more heavily cached and its progress becomes even
greater. The benchmark completes much more quickly in this case, whereas
if all processes progress at nearly the same rate (by chance) it runs
more slowly.
- It can happen (again by chance) that dbench files get deleted while still
in cache, and this process completes in a fraction of the time that real
disk IO would require.
I've seen successsive runs of dbench *under identical conditions* (that is,
from a clean reboot etc.) vary by as much as 30%. Others report even greater
variance. Can we please agree that dbench is useless for benchmarks?
--
Daniel
On Mon, Jan 28, 2002 at 10:53:25AM +0100, Daniel Phillips wrote:
> On January 25, 2002 01:09 am, Andrea Arcangeli wrote:
> > On Thu, Jan 24, 2002 at 07:27:43AM +0100, Daniel Phillips wrote:
> > > On January 24, 2002 06:23 am, [email protected] wrote:
> > > > Benchmarks on 2.4.18pre4aa1 and lots of other kernels at:
> > > > http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
> > >
> > > "dbench 64, 128, 192 on ext2fs. dbench may not be the best I/O
> > > benchmark, but it does create a high load, and may put some pressure on
> > > the cpu and i/o schedulers. Each dbench process creates about 21
> > > megabytes worth of files, so disk usage is 1.3 GB, 2.6 GB and 4.0 GB
> > > for the dbench runs. Big enough so the tests cannot run from the
> > > buffer/page caches on this box."
> > >
> > > Thanks kindly for the testing, but please don't use dbench any more for
> > > benchmarks. If you are testing stability, fine, but dbench throughput
> > > numbers are not good for much more than wild goose chases.
> > >
> > > Even when mostly uncached, dbench still produces flaky results.
> >
> > this is not enterely true. dbench has a value.
>
> Yes, but not for benchmarks. It has value only as a stability test - while
> it may in some cases provide some general indication of performance, its
> variance is far too large, even under controlled conditions, for it to have
> any value as a benchmark. I'm surprised you'd even suggest this.
>
> Andrea, please, if we want good benchmarks let's at least be clear on what
> tools benchmarkers should/should not be using.
>
> > the only problem with
> > dbench is that you can trivially cheat and change the kernel in a broken
> > way, but optimal _only_ for dbench, just to get stellar dbench numbers,
>
> No, this is not the only problem. DBench is just plain *flaky*. You don't
> appear to be clear on why. In short, dbench has two main flaws:
>
> - It's extremely sensitive to scheduling. If one process happens to make
> progress then it gets more heavily cached and its progress becomes even
> greater. The benchmark completes much more quickly in this case, whereas
> if all processes progress at nearly the same rate (by chance) it runs
> more slowly.
>
> - It can happen (again by chance) that dbench files get deleted while still
> in cache, and this process completes in a fraction of the time that real
> disk IO would require.
>
> I've seen successsive runs of dbench *under identical conditions* (that is,
> from a clean reboot etc.) vary by as much as 30%. Others report even greater
> variance. Can we please agree that dbench is useless for benchmarks?
I never seen it to vary 30% on the same kernel.
Anyways dbench tells you mostly about elevator etc... it's a good test
to check the elevator is working properly, the ++ must be mixed with the
dots etc... if the elevator is aggressive enough. Of course that means
the elevator is not perfectly fair but that's the whole point about
having an elevator. It is also an interesting test for page replacement,
but with page replacement it would be possible to write a broken
algorithm that produces good numbers, that's the thing I believe to be
bad about dbench (oh, like tiotest fake numbers too of course). Other
than this it just shows rmap12a has an elevator not aggressive enough
which is probably true, I doubt it has anything to do with the VM
changes in rmap (of course rmap design significant overhead is helping
to slow it down too though), more likely the bomb_segments logic from
Andrew that Rik has included, infact the broken page replacement that
lefts old stuff in cache if something might generate more unfairness
that should generate faster dbench numbers for rmap, but on this last
bit I'm not 100% sure (AFIK to get a fast dbench by cheating with the vm
you need to make sure to cache lots of the readahead as well (also the
one not used yet), but I'm not 100% sure on the effect of lefting old
pollution in cache rather than recycling it, I never attempted it).
Andrea
On January 28, 2002 04:29 pm, Andrea Arcangeli wrote:
> On Mon, Jan 28, 2002 at 10:53:25AM +0100, Daniel Phillips wrote:
> > On January 25, 2002 01:09 am, Andrea Arcangeli wrote:
> > > On Thu, Jan 24, 2002 at 07:27:43AM +0100, Daniel Phillips wrote:
> > > > On January 24, 2002 06:23 am, [email protected] wrote:
> > > > Even when mostly uncached, dbench still produces flaky results.
> > [...]
> > > the only problem with
> > > dbench is that you can trivially cheat and change the kernel in a broken
> > > way, but optimal _only_ for dbench, just to get stellar dbench numbers,
> >
> > No, this is not the only problem. DBench is just plain *flaky*. You
don't
> > appear to be clear on why. In short, dbench has two main flaws:
> >
> > - It's extremely sensitive to scheduling. If one process happens to
make
> > progress then it gets more heavily cached and its progress becomes
even
> > greater. The benchmark completes much more quickly in this case,
whereas
> > if all processes progress at nearly the same rate (by chance) it runs
> > more slowly.
> >
> > - It can happen (again by chance) that dbench files get deleted while
still
> > in cache, and this process completes in a fraction of the time that
real
> > disk IO would require.
> >
> > I've seen successsive runs of dbench *under identical conditions* (that
is,
> > from a clean reboot etc.) vary by as much as 30%. Others report even
greater
> > variance. Can we please agree that dbench is useless for benchmarks?
>
> I never seen it to vary 30% on the same kernel.
Just ask around. Marcelo or Andrew Morton would be a good place to start.
> Anyways dbench tells you mostly about elevator etc... it's a good test
> to check the elevator is working properly, the ++ must be mixed with the
> dots etc... if the elevator is aggressive enough. Of course that means
> the elevator is not perfectly fair but that's the whole point about
> having an elevator. It is also an interesting test for page replacement,
> but with page replacement it would be possible to write a broken
> algorithm that produces good numbers, that's the thing I believe to be
> bad about dbench (oh, like tiotest fake numbers too of course). Other
> than this it just shows rmap12a has an elevator not aggressive enough
> which is probably true, I doubt it has anything to do with the VM
> changes in rmap (of course rmap design significant overhead is helping
> to slow it down too though), more likely the bomb_segments logic from
> Andrew that Rik has included, infact the broken page replacement that
> lefts old stuff in cache if something might generate more unfairness
> that should generate faster dbench numbers for rmap, but on this last
> bit I'm not 100% sure (AFIK to get a fast dbench by cheating with the vm
> you need to make sure to cache lots of the readahead as well (also the
> one not used yet), but I'm not 100% sure on the effect of lefting old
> pollution in cache rather than recycling it, I never attempted it).
Interesting analysis. It's a hint at how hard the elevator problem really
is. Fairness as in 'equal load distribution' is not the best policy under
heavy load, just as it is not the best policy under heavy swapping. Exactly
what kind of unfairness is best, though, is a deep, difficult question. I'll
bet it doesn't get seriously addressed even in this kernel cycle, or at best,
very late in the cycle after the big infrastructure changes settle down.
--
Daniel
On Mon, Jan 28, 2002 at 09:28:24PM +0100, Daniel Phillips wrote:
> Just ask around. Marcelo or Andrew Morton would be a good place to start.
ah, btw, if you test with a broken page replacement (kind of random)
it's normal you get huge variations.
But with my -aa tree, you should never get a significant difference (no
matter if it's Marcelo or Andrew to run the benchmark). I've also to say
I always mke2fs first when I run my benchmarks, so I don't consider
possible filesystem layout differences into the equation but I doubt
(unless you're running with a corner case like running out of space or
stuff like that), that it will make a significant difference either.
> > Anyways dbench tells you mostly about elevator etc... it's a good test
> > to check the elevator is working properly, the ++ must be mixed with the
> > dots etc... if the elevator is aggressive enough. Of course that means
> > the elevator is not perfectly fair but that's the whole point about
> > having an elevator. It is also an interesting test for page replacement,
> > but with page replacement it would be possible to write a broken
> > algorithm that produces good numbers, that's the thing I believe to be
> > bad about dbench (oh, like tiotest fake numbers too of course). Other
> > than this it just shows rmap12a has an elevator not aggressive enough
> > which is probably true, I doubt it has anything to do with the VM
> > changes in rmap (of course rmap design significant overhead is helping
> > to slow it down too though), more likely the bomb_segments logic from
> > Andrew that Rik has included, infact the broken page replacement that
> > lefts old stuff in cache if something might generate more unfairness
> > that should generate faster dbench numbers for rmap, but on this last
> > bit I'm not 100% sure (AFIK to get a fast dbench by cheating with the vm
> > you need to make sure to cache lots of the readahead as well (also the
> > one not used yet), but I'm not 100% sure on the effect of lefting old
> > pollution in cache rather than recycling it, I never attempted it).
>
> Interesting analysis. It's a hint at how hard the elevator problem really
> is. Fairness as in 'equal load distribution' is not the best policy under
> heavy load, just as it is not the best policy under heavy swapping. Exactly
as always it depends if the object is throughput or latency, for dbench
that's the object.
Also the function between throughtput and latency is not linear and it
depends on too many factors to find an elevator algorithm that works
well on the paper.
So, in function of that, one vapourware idea I had while reading your
email is to use the feedback from the output througput generated to know
when it's worthwhile to decrease or increase the latency. If decreasing
latency doesn't decrease the final throughput generated, that means
we're ok to decrease latency even more. As soon as the throughput
decreases (despite of people waiting on the submit_bh pipeline), we know
we'd better not decrease latency further, unless we want to hurt
performance.
The current elevator (not rmap) is always very permissive, so throughput
is ok in dbench (and anything seeking as hard as dbench), but latency
often sucks (actually in -aa I decreased the read latency so it's
acceptable, not like in mainline, but still it's far from being very
reactive under a write flood). The feedback from the output channel to
control the latency parameters in a dynamic manner may help to decrease
latency when possible (not unconditionally with elvtune). One of the
thing I love about the analog electronics are the operational chips, a
feedback loop solves so much difficult problems so easily. Software can
do similar things lots of times. Anyways this is just vapourware
(probably quite complex to implement in a generic manner) but fixed
algorithms are not likely to give us a solution (we'll be either too
permissive or too slow in dbench), while this kind of feedback sounds
like something that may solve the problem dynamically, or maybe I'm
simply just dreaming :).
> what kind of unfairness is best, though, is a deep, difficult question. I'll
> bet it doesn't get seriously addressed even in this kernel cycle, or at best,
> very late in the cycle after the big infrastructure changes settle down.
>
> --
> Daniel
Andrea
On January 29, 2002 12:40 am, Andrea Arcangeli wrote:
> On Mon, Jan 28, 2002 at 09:28:24PM +0100, Daniel Phillips wrote:
> > Just ask around. Marcelo or Andrew Morton would be a good place to start.
>
> ah, btw, if you test with a broken page replacement (kind of random)
> it's normal you get huge variations.
>
> But with my -aa tree, you should never get a significant difference (no
> matter if it's Marcelo or Andrew to run the benchmark).
Oh, that's interesting, and actually I can see why that might be (feedback
in your VM is quite predictable, so it isn't prone to oscillation). It's
not just the VM that affects dbench's running pattern though, it's also
scheduling.
> I've also to say I always mke2fs first when I run my benchmarks,
Yes, and it would be nice if we had an operation to squeeze cache down to
its minimum size (whatever that means) just for running benchmarks
accurately without rebooting.
> so I don't consider
> possible filesystem layout differences into the equation but I doubt
> (unless you're running with a corner case like running out of space or
> stuff like that), that it will make a significant difference either.
> > > Anyways dbench tells you mostly about elevator etc... it's a good test
> > > to check the elevator is working properly, the ++ must be mixed with the
> > > dots etc... if the elevator is aggressive enough. Of course that means
> > > the elevator is not perfectly fair but that's the whole point about
> > > having an elevator. It is also an interesting test for page replacement,
> > > but with page replacement it would be possible to write a broken
> > > algorithm that produces good numbers, that's the thing I believe to be
> > > bad about dbench (oh, like tiotest fake numbers too of course). Other
> > > than this it just shows rmap12a has an elevator not aggressive enough
> > > which is probably true, I doubt it has anything to do with the VM
> > > changes in rmap (of course rmap design significant overhead is helping
> > > to slow it down too though), more likely the bomb_segments logic from
> > > Andrew that Rik has included, infact the broken page replacement that
> > > lefts old stuff in cache if something might generate more unfairness
> > > that should generate faster dbench numbers for rmap, but on this last
> > > bit I'm not 100% sure (AFIK to get a fast dbench by cheating with the vm
> > > you need to make sure to cache lots of the readahead as well (also the
> > > one not used yet), but I'm not 100% sure on the effect of lefting old
> > > pollution in cache rather than recycling it, I never attempted it).
> >
> > Interesting analysis. It's a hint at how hard the elevator problem really
> > is. Fairness as in 'equal load distribution' is not the best policy under
> > heavy load, just as it is not the best policy under heavy swapping. Exactly
>
> as always it depends if the object is throughput or latency, for dbench
> that's the object.
>
> Also the function between throughtput and latency is not linear and it
> depends on too many factors to find an elevator algorithm that works
> well on the paper.
>
> So, in function of that, one vapourware idea I had while reading your
> email is to use the feedback from the output througput generated to know
> when it's worthwhile to decrease or increase the latency. If decreasing
> latency doesn't decrease the final throughput generated, that means
> we're ok to decrease latency even more. As soon as the throughput
> decreases (despite of people waiting on the submit_bh pipeline), we know
> we'd better not decrease latency further, unless we want to hurt
> performance.
But what is the knob by which you control latency?
> The current elevator (not rmap) is always very permissive, so throughput
> is ok in dbench (and anything seeking as hard as dbench), but latency
> often sucks (actually in -aa I decreased the read latency so it's
> acceptable, not like in mainline, but still it's far from being very
> reactive under a write flood). The feedback from the output channel to
> control the latency parameters in a dynamic manner may help to decrease
> latency when possible (not unconditionally with elvtune). One of the
> thing I love about the analog electronics are the operational chips, a
> feedback loop solves so much difficult problems so easily. Software can
> do similar things lots of times.
Oh yes, that's exactly the way I think of these things and I did
experiment with a similar idea earlier this year with my 'early flush with
bandwidth estimation' earlier this year. What I found is, it's very hard
to get a good 'signal' by tracking kernel statistics. By the time I
averaged the disk bandwidth enough to get a smooth signal, the lag was way
too high to be useful. The statistics just aren't very coninuous, so they
tend to resist analysis by analog methods. Note: they resist analysis,
they don't defy it.
> Anyways this is just vapourware
> (probably quite complex to implement in a generic manner) but fixed
> algorithms are not likely to give us a solution (we'll be either too
> permissive or too slow in dbench), while this kind of feedback sounds
> like something that may solve the problem dynamically, or maybe I'm
> simply just dreaming :).
Well I'm dreaming the same dreams, and by coincidence it's the reason I was
complaining earlier today on lkml about the lack of good muldiv operations
with double-wide intermediate results in the kernel. Such operators are
needed to do the filtering calculations and so on with enough precision -
and by this, I mean 'enough choices of divisor' more than 'enough bits' -
so the algorithms don't choke on their own noise.
But before you can do signal processing, feedback, or whatever, you have to
have a good signal.
--
Daniel
Hi!
> > I've also to say I always mke2fs first when I run my benchmarks,
>
> Yes, and it would be nice if we had an operation to squeeze cache down to
> its minimum size (whatever that means) just for running benchmarks
> accurately without rebooting.
Take a look at swsusp -- it frees as much memory as possible before
doing anything.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.