I've been crunching on Andrea's 10_vm-32 patch for a number of days
with a view to getting it into the main tree.
At http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre3/aa/ are 24
separate patches - the sum of all these is basically identical to
10_vm-32. Note that the commentary in some of those patches is not
completely accurate.
Linus reviewed those patches over the weekend. As a result of that and
some of my own work, we're down to 16 patches. They are at
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre3/aa2/
I'll be feeding those 16 patches onto this mailing list for review.
Here's a summary:
aa-010-show_stack.patch
Sort-of provides an arch-independent show_stack() API.
aa-020-sync_buffers.patch
writeback changes.
aa-030-writeout_scheduling.patch
More writeback changes.
aa-040-touch_buffer.patch
Buffer page aging changes.
aa-050-page_virtual.patch
Linus said: "it just makes a micro-optimization to
"page_address()" for the case where there is only one mem_map.
In my book, it only makes the thing more unreadable."
Not included in aa2.
aa-060-start_aggressive_readahead.patch
This was patch leakage from the XFS tree. Not included in aa2
aa-070-exclusive_swap_page.patch
Linus says this was addressed by other means. Not included.
aa-080-async_swap_speedup.patch
Linus said "The other part (that allows async swapins)
was dangerous in my testing (it's originally from me): it
allows a process that has a big dirty footprint to keep on to
its pages without ever waiting for their dirty writeback. I
ended up reverting it because it allowed hoggers to make for
worse interactive behaviour, but it definitely improves
throughput."
Not included.
aa-090-zone_watermarks.patch
"I think the same problem got fixed with a simple
one-liner in my 2.4.17 or something: make the "pages_low"
requirement add up over all zones you go through."
Not included.
aa-093-vm_tunables.patch
Adds /proc tunables
aa-096-swap_out.patch
Changes to the swap_out logic. With this patch the VM
becomes *totally* unusable. It needs the changes in aa-110 as
well.
aa-100-local_pages.patch
This was some code which added a memclass check to the
local_pages logic. After examination I decided that there was
tons more code in there than we actually needed, so I stripped
this down to just a single process-local page. Which is
basically all that the code ever did.
aa-110-zone_accounting.patch
Adds all the instrumentation which aa-096 needs. 'fraid I got those patches
backwards.
Andrea had implemented this as lots of macros in
swap.h. Linus' said "but should be cleaned up to use real
functions instead of those macros from hell. So I did that.
aa-120-try_to_free_pages_nozone.patch
Support function needed by buffer.c
aa-140-misc_junk.patch
Random little stuff
aa-150-read_write_tweaks.patch
Little changes to pagefault and write(2) code.
aa-160-lru_release_check.patch
Hugh's famous BUG() check in free_pages.
aa-170-drain_cpu_caches.patch
microoptimisation
aa-180-activate_page_cleanup.patch
Code cleanup
aa-190-block_flushpage_check.patch
BUG check
aa-200-active_page_swapout.patch
Remove dead code (I think)
aa-210-tlb_flush_speedup.patch
microoptimisation
aa-230-free_zone_bhs.patch
Prevent ZONE_NORMAL from getting clogged with
buffer_heads. Everyone agrees that this is a pretty ugly hack.
I'm not proposing it for inclusion at this time, but the diff
is there, and it is stable.
aa-240-page_table_hash.patch
This is the patch which changes Bill Irwin's hashing
scheme for per-page waitqueues. I'm not proposing it for
merging at this time - I think that more discussion and
evaluation is needed to justify such action. But the patch is
there, and is stable.
For a merging plan I'd propose that the patches be considered in three
groups. Maybe split across three kernel releases.
writeback changes:
aa-010-show_stack.patch
aa-020-sync_buffers.patch
aa-030-writeout_scheduling.patch
aa-040-touch_buffer.patch
VM changes:
aa-093-vm_tunables.patch
aa-096-swap_out.patch
aa-100-local_pages.patch
aa-110-zone_accounting.patch
aa-120-try_to_free_pages_nozone.patch
The rest:
aa-140-misc_junk.patch
aa-150-read_write_tweaks.patch
aa-160-lru_release_check.patch
aa-170-drain_cpu_caches.patch
aa-180-activate_page_cleanup.patch
aa-190-block_flushpage_check.patch
aa-200-active_page_swapout.patch
There are still a few areas which need more work, but they're not
critical. They're highlighted in the commentary against the individual
patches.
-
On Tue, 19 Mar 2002, Andrew Morton wrote:
> I've been crunching on Andrea's 10_vm-32 patch for a number of days
> with a view to getting it into the main tree.
Hi Andrew, (just an fyi)
I was curious as to what effect some of these aa patches would have
on 2.5.7 throughput, so I wiggled..
aa-096-swap_out
aa-180-activate_page_cleanup
aa-150-read_write_tweaks
aa-110-zone_accounting
aa-093-vm_tunables
aa-040-touch_buffer
aa-030-writeout_scheduling
aa-020-sync_buffers
..into it any gave it a little spin. At swap, the aa modified
kernel won. At disk blasting though, the stock kernel won big.
freshboot; time make -j70 bzImage CC=gcc-3.0.3 && procinfo
Linux 2.5.7aa (root@mikeg) (gcc gcc-2.95.3 20010315 ) #89 [mikeg.]
user : 0:08:02.98 77.9% page in : 549622
nice : 0:00:00.00 0.0% page out: 437908
system: 0:00:50.93 8.2% swap in : 105628
idle : 0:01:25.86 13.9% swap out: 89827
real 8m59.259s aside 6m55.169s gcc-2.95.3 doing make -j90
user 7m54.930s 6m12.100s on very same 384 mb box...
sys 0m34.320s 0m25.050s
#!/bin/sh
# testo
# /tmp is tmpfs
for i in 1 2 3 4 5
do
mv /test/linux-2.5.7 /tmp/.
mv /tmp/linux-2.5.7 /test/.
done
time testo
real 3m43.775s
user 0m4.850s
sys 0m47.460s
freshboot; time make -j70 bzImage CC=gcc-3.0.3 && procinfo
Linux 2.5.7 (root@mikeg) (gcc gcc-2.95.3 20010315 ) #83 [mikeg.]
user : 0:08:04.81 73.4% page in : 606872
nice : 0:00:00.00 0.0% page out: 482386
system: 0:00:50.95 7.7% swap in : 129625
idle : 0:02:04.66 18.9% swap out: 115843
real 9m39.606s
user 7m56.300s
sys 0m34.570s
time testo
real 2m7.581s
user 0m4.490s
sys 0m41.850s
On Sun, Mar 31, 2002 at 02:26:14PM +0200, Mike Galbraith wrote:
> #!/bin/sh
> # testo
> # /tmp is tmpfs
>
> for i in 1 2 3 4 5
> do
> mv /test/linux-2.5.7 /tmp/.
> mv /tmp/linux-2.5.7 /test/.
> done
It would be important to see the /tmp and /test tests benchmarked
separately, the way tmpfs and normal filesystem writes to disk is very
different and involves different algorithms, so it's not easy to say
which one could go wrong by looking at the global result. Just in case:
it is very important that the tmpfs contents are exactly the same before
starting the two tests. If you load something into /tmp before starting
the test performance will be different due the need of additional
swapouts.
So I would suggest moving linux-2.5.7 over two normal fs and then just
moving it over two tmpfs, so we know what's running slower.
Another possibility is that the lru could be more fair (we may better at
flushing dirty pages, allowing them to be discarded in lru order), I
assume your machine cannot take in cache a kernel tree, so there should
be a total cache trashing scenario. So you may want to verify with
vmstat that both kernels are doing the very same amount of I/O, just to
be sure one of the two isn't faster because of additional fariness in
the lru information, and not because of slower I/O.
Thanks,
Andrea
Andrea Arcangeli wrote:
>
> Another possibility is that the lru could be more fair (we may better at
> flushing dirty pages, allowing them to be discarded in lru order), I
> assume your machine cannot take in cache a kernel tree, so there should
> be a total cache trashing scenario. So you may want to verify with
> vmstat that both kernels are doing the very same amount of I/O, just to
> be sure one of the two isn't faster because of additional fariness in
> the lru information, and not because of slower I/O.
2.5.x is spending significantly more CPU on I/O for smallish machines
such as Mike's. The extra bio allocation for each bh is showing up
heavily.
But more significantly, something has gone wrong at the buffer
writeback level. Try writing a 100 meg file in 4096 byte
write()s. It's nice.
Now write it in 5000 byte write()s. It's horrid. We spend moe
time in write_some_buffers() than in copy_*_user. With 4096
byte writes, write_some_buffers visits 150,000 buffers. With
5000-byte writes, it visits 8,000,000. Under lru_list_lock.
I assume what's happening is that we write to a buffer, it goes onto
BUF_DIRTY and then balance_dirty or bdflush or kupdate moves it to
BUF_LOCKED. Then write(2) redirties the buffer so it is locked, dirty,
on BUF_DIRTY, with I/O underway. BUF_DIRTY gets flooded with locked buffers
and we just do enormous amounts of scanning in write_some_buffers().
I haven't looked into this further yet. Not sure why it only
happens in 2.5. Maybe it is happening in 2.4, but it's not as
easy to trigger for some reason.
But I don't think there's anything to prevent this from happening
in 2.4 is there?
Also, I've been *trying* to get some decent I/O bandwidth on my test
box, but 2.4 badness keeps on getting in the way. bounce_end_io_read()
is being particularly irritating. It's copying tons of data
which has just come in from PCI while inside io_request_lock. Ugh.
Is there any reason why we can't drop io_request_lock around the
completion handler in end_that_request_first()?
-
On Sun, Mar 31, 2002 at 05:52:06PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > Another possibility is that the lru could be more fair (we may better at
> > flushing dirty pages, allowing them to be discarded in lru order), I
> > assume your machine cannot take in cache a kernel tree, so there should
> > be a total cache trashing scenario. So you may want to verify with
> > vmstat that both kernels are doing the very same amount of I/O, just to
> > be sure one of the two isn't faster because of additional fariness in
> > the lru information, and not because of slower I/O.
>
> 2.5.x is spending significantly more CPU on I/O for smallish machines
> such as Mike's. The extra bio allocation for each bh is showing up
> heavily.
But that should be a fixed cost for both (again assuming they're doing
the same amount of I/O, but if they aren't that's not an I/O comparison
in the first place). This is why I ignored the bio bit.
> But more significantly, something has gone wrong at the buffer
> writeback level. Try writing a 100 meg file in 4096 byte
> write()s. It's nice.
>
> Now write it in 5000 byte write()s. It's horrid. We spend moe
> time in write_some_buffers() than in copy_*_user. With 4096
> byte writes, write_some_buffers visits 150,000 buffers. With
> 5000-byte writes, it visits 8,000,000. Under lru_list_lock.
>
> I assume what's happening is that we write to a buffer, it goes onto
> BUF_DIRTY and then balance_dirty or bdflush or kupdate moves it to
> BUF_LOCKED. Then write(2) redirties the buffer so it is locked, dirty,
> on BUF_DIRTY, with I/O underway. BUF_DIRTY gets flooded with locked buffers
> and we just do enormous amounts of scanning in write_some_buffers().
>
> I haven't looked into this further yet. Not sure why it only
> happens in 2.5. Maybe it is happening in 2.4, but it's not as
> easy to trigger for some reason.
>
> But I don't think there's anything to prevent this from happening
> in 2.4 is there?
The dirty buffer should be inserted at the end, so it should be marked
dirty the second time around, before we had a chance to start the I/O on
it. We should start writing the first buffers at the top of the list,
after we just rewritten it before starting the I/O. That is what should
mitigate the badness of the O(N) algorithm avoiding a flood of
dirty+locked buffers at the top of the list, maybe 2.5 simply writout
the whole list as fast as possible without stopping after some rasonable
amount of work done (bdflush breakage) or it balances way too early.
Any of those errors could lead 2.5 into submitting the buffer for the
I/O before it had a chance to be rewritten-in-cache while it was still
only dirty. My async flushing changes or any equivalent well balanced
logic, will hopefully avoid such a bad behaviour.
> Also, I've been *trying* to get some decent I/O bandwidth on my test
> box, but 2.4 badness keeps on getting in the way. bounce_end_io_read()
> is being particularly irritating. It's copying tons of data
> which has just come in from PCI while inside io_request_lock. Ugh.
>
> Is there any reason why we can't drop io_request_lock around the
> completion handler in end_that_request_first()?
You can apply the 00_block-highmem patch from 2.4.19pre5aa1, that will
apply cleanly to vanilla 2.4.19pre5 too, it will avoid the bounces on
all common high end hardware.
Andrea
On Mon, 1 Apr 2002, Andrea Arcangeli wrote:
> On Sun, Mar 31, 2002 at 02:26:14PM +0200, Mike Galbraith wrote:
> > #!/bin/sh
> > # testo
> > # /tmp is tmpfs
> >
> > for i in 1 2 3 4 5
> > do
> > mv /test/linux-2.5.7 /tmp/.
> > mv /tmp/linux-2.5.7 /test/.
> > done
>
> It would be important to see the /tmp and /test tests benchmarked
> separately, the way tmpfs and normal filesystem writes to disk is very
> different and involves different algorithms, so it's not easy to say
> which one could go wrong by looking at the global result. Just in case:
> it is very important that the tmpfs contents are exactly the same before
> starting the two tests. If you load something into /tmp before starting
> the test performance will be different due the need of additional
> swapouts.
>
> So I would suggest moving linux-2.5.7 over two normal fs and then just
> moving it over two tmpfs, so we know what's running slower.
2.5.7.virgin
time testo (mv tree between /test and /usr/local partitions)
real 10m42.697s
user 0m5.110s
sys 1m16.240s
Bonnie -s 1000
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 7197 27.4 8557 12.0 4273 7.3 8049 40.6 9049 8.0 111.5 1.3
2.5.7.aa
time testo
real 51m17.577s
user 0m5.680s
sys 1m15.320s
Bonnie -s 1000
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 5184 19.6 5965 8.9 3081 4.4 8936 45.5 9049 8.1 98.2 1.0
Egad. Before I gallop off to look for a merge booboo, let me show you
what I was looking for with the aa writeout changes ;-)
2.4.6.virgin
time testo
real 12m12.384s
user 0m5.280s
sys 0m54.110s
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 8377 31.3 10560 12.3 3236 5.2 7289 36.1 8974 6.7 113.3 1.0
2.4.6.flushto
time testo
real 9m5.801s
user 0m5.060s
sys 0m59.310s
Bonnie -s 1000
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1000 10553 37.6 11785 13.5 4174 6.5 6785 33.7 8964 6.9 115.7 1.2
pittypatterpittypatter...
-Mike
On Mon, Apr 01, 2002 at 02:07:23PM +0200, Mike Galbraith wrote:
> On Mon, 1 Apr 2002, Andrea Arcangeli wrote:
>
> > On Sun, Mar 31, 2002 at 02:26:14PM +0200, Mike Galbraith wrote:
> > > #!/bin/sh
> > > # testo
> > > # /tmp is tmpfs
> > >
> > > for i in 1 2 3 4 5
> > > do
> > > mv /test/linux-2.5.7 /tmp/.
> > > mv /tmp/linux-2.5.7 /test/.
> > > done
> >
> > It would be important to see the /tmp and /test tests benchmarked
> > separately, the way tmpfs and normal filesystem writes to disk is very
> > different and involves different algorithms, so it's not easy to say
> > which one could go wrong by looking at the global result. Just in case:
> > it is very important that the tmpfs contents are exactly the same before
> > starting the two tests. If you load something into /tmp before starting
> > the test performance will be different due the need of additional
> > swapouts.
> >
> > So I would suggest moving linux-2.5.7 over two normal fs and then just
> > moving it over two tmpfs, so we know what's running slower.
>
> 2.5.7.virgin
>
> time testo (mv tree between /test and /usr/local partitions)
> real 10m42.697s
> user 0m5.110s
> sys 1m16.240s
>
> Bonnie -s 1000
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 1000 7197 27.4 8557 12.0 4273 7.3 8049 40.6 9049 8.0 111.5 1.3
>
> 2.5.7.aa
>
> time testo
> real 51m17.577s
> user 0m5.680s
> sys 1m15.320s
>
> Bonnie -s 1000
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 1000 5184 19.6 5965 8.9 3081 4.4 8936 45.5 9049 8.1 98.2 1.0
>
> Egad. Before I gallop off to look for a merge booboo, let me show you
> what I was looking for with the aa writeout changes ;-)
>
> 2.4.6.virgin
>
> time testo
> real 12m12.384s
> user 0m5.280s
> sys 0m54.110s
>
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 1000 8377 31.3 10560 12.3 3236 5.2 7289 36.1 8974 6.7 113.3 1.0
>
> 2.4.6.flushto
>
> time testo
> real 9m5.801s
> user 0m5.060s
> sys 0m59.310s
>
> Bonnie -s 1000
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 1000 10553 37.6 11785 13.5 4174 6.5 6785 33.7 8964 6.9 115.7 1.2
>
> pittypatterpittypatter...
comparing 2.4 with 2.5 is a bit unfair, can you try 2.4.19pre5aa1
first? Note that you didn't applied all the vm patches, infact I've no
idea how they apply to 2.5 in the first place (I assume they applied
cleanly).
Also it would be interesting to know how much memory you have in use
before starting the benchmark, it maybe you're triggering some swap
because the VM understand lots of your mappings are unused and that
so you're swapping out during the I/O benchmark because of that. the
anon pages in the lru are meant exactly for that purpose. If you want a
vm that never ever swaps during an I/O benchmark all mapped pages should
not be considered by the vm until we run out of unmapped pages, it's
quite equivalent to raising vm_mapped_ratio to 10000, you can try with
vm_mapped_ratio set to 10000 too infact.
Andrea
On Mon, 1 Apr 2002, Andrea Arcangeli wrote:
> comparing 2.4 with 2.5 is a bit unfair, can you try 2.4.19pre5aa1
> first? Note that you didn't applied all the vm patches, infact I've no
> idea how they apply to 2.5 in the first place (I assume they applied
> cleanly).
Sure, I'll give 2.4.19pre5aa1 a go. The patches didn't go in clean,
but the changes made since the split weren't too big to easily plug
them in. (I only applied the patches which looked interesting for my
little UP box)
I was comparing the old 2.4 kernel to 2.5 because of a loss in write
throughput which I've been tracking for a while, and hoped to get back
via aa after reading some bits.
> Also it would be interesting to know how much memory you have in use
> before starting the benchmark, it maybe you're triggering some swap
Mostly empty, but that doesn't matter much.. see below.
> because the VM understand lots of your mappings are unused and that
> so you're swapping out during the I/O benchmark because of that. the
> anon pages in the lru are meant exactly for that purpose. If you want a
> vm that never ever swaps during an I/O benchmark all mapped pages should
> not be considered by the vm until we run out of unmapped pages, it's
> quite equivalent to raising vm_mapped_ratio to 10000, you can try with
> vm_mapped_ratio set to 10000 too infact.
I don't mind if my box swaps a bit when stressed. In fact, I like it
to go find bored pages and get them out of the way :)
It's the buffer.c changes (the ones I'm most interested in:) that are
causing my disk woes. They look like they're in right, but are causing
bad (synchronous) IO behavior for some reason. I have tomorrow yet to
figure it out.
-Mike
On Mon, 1 Apr 2002, Mike Galbraith wrote:
<snip>
> It's the buffer.c changes (the ones I'm most interested in:) that are
> causing my disk woes. They look like they're in right, but are causing
> bad (synchronous) IO behavior for some reason. I have tomorrow yet to
> figure it out.
Just to make sure: You mean the buffer.c changes alone (pre4 -> pre5) are
causing bad synchronous IO behaviour for you ?
On Mon, 1 Apr 2002, Marcelo Tosatti wrote:
> On Mon, 1 Apr 2002, Mike Galbraith wrote:
>
> <snip>
>
> > It's the buffer.c changes (the ones I'm most interested in:) that are
> > causing my disk woes. They look like they're in right, but are causing
> > bad (synchronous) IO behavior for some reason. I have tomorrow yet to
> > figure it out.
>
> Just to make sure: You mean the buffer.c changes alone (pre4 -> pre5) are
> causing bad synchronous IO behaviour for you ?
I'm working out of 2.5, not 2.4. I'm going to test 2.4.19pre5
and aa to make sure they don't show this behavior.. seriously
doubt that they will.
-Mike
On Mon, 1 Apr 2002, Andrea Arcangeli wrote:
> ........ can you try 2.4.19pre5aa1 first?
Ok, I tested (and repeated for consistancy to prevent repeat of
unfortunate premature results) 2.4.19-pre5 and 2.4.19pre5aa1.
I didn't get my write throughput back (oh well), but I do NOT
see any bad behavior. IO looks/feels good in both kernels.
Only thing interesting during testing was that 2.4.19pre5aa1
lost by a consistant ~15% in the move a tree around test.
(if I find anything interesting on the 2.5 thingy, I'll let you
know offline)
-Mike
On Tue, Apr 02, 2002 at 09:28:51AM +0200, Mike Galbraith wrote:
> Only thing interesting during testing was that 2.4.19pre5aa1
> lost by a consistant ~15% in the move a tree around test.
I'm not sure what your problem is but I made a fast check on the
performance difference between 2.4.6 and 2.4.19pre5aa1:
2.4.19pre5aa1 mem=1200M:
time dd if=/dev/zero of=test bs=4096 count=$[1500*1024*1024/4096] ; time sync ; time dd if=/dev/zero of=test bs=4096 count=$[1500*1024*1024/4096] ; time sync
384000+0 records in
384000+0 records out
real 0m48.776s
user 0m0.380s
sys 0m11.850s
real 0m24.589s
user 0m0.000s
sys 0m0.290s
384000+0 records in
384000+0 records out
real 0m45.774s
user 0m0.330s
sys 0m12.190s
real 0m26.813s
user 0m0.010s
sys 0m0.290s
time dd if=test of=/dev/null bs=4096 count=$[1500*1024*1024/4096]; time dd if=test of=/dev/null bs=4096 count=$[1500*1024*1024/4096]
384000+0 records in
384000+0 records out
real 1m2.269s
user 0m0.250s
sys 0m12.070s
384000+0 records in
384000+0 records out
real 1m2.284s
user 0m0.240s
sys 0m12.510s
(andrew test below, just in case)
time dd if=/dev/zero of=test bs=5000 count=$[1500*1024*1024/5000] ; time sync ; time dd if=/dev/zero of=test bs=5000 count=$[1500*1024*1024/5000] ; time sync
314572+0 records in
314572+0 records out
real 0m49.273s
user 0m0.430s
sys 0m12.420s
real 0m25.064s
user 0m0.000s
sys 0m0.300s
314572+0 records in
314572+0 records out
real 0m49.567s
user 0m0.350s
sys 0m12.070s
real 0m25.618s
user 0m0.010s
sys 0m0.320s
official 2.4.6 mem=1200M:
time dd if=/dev/zero of=test bs=4096 count=$[1500*1024*1024/4096] ; time sync ; time dd if=/dev/zero of=test bs=4096 count=$[1500*1024*1024/4096] ; time sync
384000+0 records in
384000+0 records out
real 0m37.425s
user 0m0.590s
sys 0m14.900s
real 0m33.751s
user 0m0.000s
sys 0m0.100s
384000+0 records in
384000+0 records out
real 0m34.182s
user 0m0.500s
sys 0m14.780s
real 0m35.487s
user 0m0.000s
sys 0m0.160s
time dd if=test of=/dev/null bs=4096 count=$[1500*1024*1024/4096]; time dd if=test of=/dev/null bs=4096 count=$[1500*1024*1024/4096]
384000+0 records in
384000+0 records out
real 1m2.978s
user 0m0.230s
sys 0m11.330s
384000+0 records in
384000+0 records out
real 1m2.660s
user 0m0.290s
sys 0m11.050s
time dd if=/dev/zero of=test bs=5000 count=$[1500*1024*1024/5000] ; time sync ; time dd if=/dev/zero of=test bs=5000 count=$[1500*1024*1024/5000] ; time sync
314572+0 records in
314572+0 records out
real 0m37.602s
user 0m0.450s
sys 0m15.270s
real 0m36.123s
user 0m0.000s
sys 0m0.180s
314572+0 records in
314572+0 records out
real 0m36.118s
user 0m0.460s
sys 0m15.820s
real 0m33.486s
user 0m0.020s
sys 0m0.160s
As you can see the write/read raw performance is the same, dominated
only by raw disk speed. I'm not sure how can you write or read to disk
faster with 2.4.6. Also note that raw speed with total cache trashing is
quite unrelated to the VM if balance_dirty()/bdflush works sanely.
One thing that cames to mind is the fact the old 2.4.6 balance_dirty()
was passing over .ndirty buffers before breaking the loop, now we stop
much earlier, so we may take less advantage of some cpu cache doing so,
that may matter on slow cpu machines, for my box clearly doesn't matter.
Are you sure the 15% of performance you're talking about isn't your disk
that keeps writing when your workload finishes? I mean, see the sync
time, it decreases because the latest kernels have lower bdflush sync
percentages, that's normal and expected. You should take the sync time
into account too of course. If you want the level of dirty buffers in
the system to be larger you only need to tweak bdflush, the default has
to be conservative to be fair with the users that are only reading.
The difference between 2.4.6 and the latest VM code should kick in when
the VM really starts to matter.
It is possible that your read/write mixed workload gets the elevator
into the equation too. But the most important thing is that raw
read/write speed with total cache trashing is fast, and that's the case,
so whatever involvement with mixed read/write load it has to be only an
elevator thing or a bdflush tuning parameter changable via sysctl.
Andrea
Hello,
I have patched a stock 2.4.19-pre5 kernel with Andrew Morton's -aa
VM splitup [1], Ingo Molnar's O(1) scheduler [2], Andrew Morton's read
latency [3] and IDE lockup patches [4] and the mini low latency patch
[5] plus fixes for it [6] (in this order). When running ps or top under
this kernel, I get following error messages:
{vmalloc_to_page} {GPLONLY_vmalloc_to_page}
Warning: /boot/System.map does not match kernel data.
I made sure that I didn't forget to copy the new System.map. Perhaps
some symbol needs to be exported?
Andreas
--
[1] http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre5/aa1/
[2]
http://people.redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.18-pre8-K3.patch
[3]
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre5/read-latency2.patch
[4]
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre5/ide-lockup.patch
[5]
http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.19-pre5-jam2/23-lowlatency-mini.gz
[6]
http://giga.cps.unizar.es/~magallon/linux/kernel/2.4.19-pre5-jam2/24-lowlatency-fixes-5.gz
On Tue, 2 Apr 2002, Andrea Arcangeli wrote:
> As you can see the write/read raw performance is the same, dominated
> only by raw disk speed. I'm not sure how can you write or read to disk
> faster with 2.4.6. Also note that raw speed with total cache trashing is
> quite unrelated to the VM if balance_dirty()/bdflush works sanely.
>
> One thing that cames to mind is the fact the old 2.4.6 balance_dirty()
> was passing over .ndirty buffers before breaking the loop, now we stop
> much earlier, so we may take less advantage of some cpu cache doing so,
> that may matter on slow cpu machines, for my box clearly doesn't matter.
Hmm. Could it have something to do with the 1k blocksize?
> Are you sure the 15% of performance you're talking about isn't your disk
> that keeps writing when your workload finishes? I mean, see the sync
> time, it decreases because the latest kernels have lower bdflush sync
> percentages, that's normal and expected. You should take the sync time
> into account too of course. If you want the level of dirty buffers in
> the system to be larger you only need to tweak bdflush, the default has
> to be conservative to be fair with the users that are only reading.
The 15% delta was between 2.4.19-pre5 and 2.4.19-pre5aa1. The move
a tree around test may not be a particularly wonderful test though.
(Bonnie, known to be less than wonderful test, shows no difference)
> The difference between 2.4.6 and the latest VM code should kick in when
> the VM really starts to matter.
My disk throughput loss has nothing to do with the VM change I think.
I thought your flush changes would give me more, just as my flush
changes to 2.4.6 did. Alas, it didn't pan out.
> It is possible that your read/write mixed workload gets the elevator
> into the equation too. But the most important thing is that raw
> read/write speed with total cache trashing is fast, and that's the case,
> so whatever involvement with mixed read/write load it has to be only an
> elevator thing or a bdflush tuning parameter changable via sysctl.
>
> Andrea
-Mike