[ Does anyone know what "CFT" means? It means "call for testers". It
doesn't mean "woo-hoo, it'll be neat when that's merged <delete>". It means
"help, help - there's no point in just one guy testing this" (thanks Randy). ]
This is an update of the delayed-allocation and multipage pagecache I/O
patches. I'm calling this a beta, because it all works, and I have
other stuff to do for a while.
Of the thirteen patches, seven (dallocbase-* and tuning-*) are
applicable to the base 2.5.6 kernel.
You need to mount an ext2 filesystem with the `-o delalloc' mount
option to turn on most of the functionality.
The rolled up patch is at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6/everything.patch.gz
These patches do a ton of stuff. Generally, the CPU overhead for filesystem
operations is decreased by about 40%. Note "overhead": this is after factoring
out the constant copy_*_user overhead. This translates to a 15-25% reduction
in CPU use for most workloads.
All the benchmarks are increased, to varying degrees. Best case is two
instances of `dbench 64' against different disks which went from 7
megabytes/sec to 25. This is due to better write layout patterns, avoidance of
synchronous reads in the writeback path, better memory management and better
management of writeback threads.
The patch breakdown is:
dallocbase-10-readahead
Unifies the current three readahead functions (mmap reads, read(2) and
sys_readhead) into a single implementation.
More aggressive in building up the readahead windows.
More conservative in tearing them down.
Special start-of-file heuristics.
Preallocates the readahead pages, to avoid the (never demonstrated, but
potentially catastrophic) scenario where allocation of readahead pages causes
the allocator to perform VM writeout.
{hidden agenda): Gets all the readahead pages gathered together in one
spot, so they can be marshalled into big BIOs.
Reinstates the readahead tuning ioctls, so hdparm(8) and blockdev(8) are
working again. The readahead settings are now per-request-queue, and the
drivers never have to know about it.
Big code cleanup.
Identifies readahead thrashing.
Currently, it just performs a shrink on the readahead window when thrashing
occurs. This greatly reduces the amount of pointless I/O which we perform,
and will reduce the CPU load. The idea is that the readahead window
dynamically adjusts to a sustainable size. It improves things, but not
hugely, experimentally.
We really need drop-behind for read and write streams. Or O_STREAMING,
indeed.
dallocbase-15-pageprivate
page->buffers is a bit of a layering violation. Not all address_spaces
have pages which are backed by buffers.
The exclusive use of page->buffers for buffers means that a piece of prime
real estate in struct page is unavailable to other forms of address_space.
This patch turns page->buffers into `unsigned long page->private' and sets
in place all the infrastructure which is needed to allow other address_spaces
to use this storage.
With this change in place, the multipage-bio no-buffer_head code can use
page->private to cache the results of an earlier get_block(), so repeated
calls into the filesystem are not needed in the case of file overwriting.
dallocbase-20-page_accounting
This patch provides global accounting of locked and dirty pages. It does
this via lightweight per-CPU data structures. The page_cache_size accounting
has been changed to use this facility as well.
Locked and dirty page accounting is needed for making writeback and
throttling decisions in the delayed-allocation code.
dallocbase-30-pdflush
This patch creates an adaptively-sized pool of writeback threads, called
`pdflush'. A simple algorithm is used to determine when new threads are
needed, and when excess threads should be reaped.
The kupdate and bdflush kernel threads are removed - the pdflush pool is
used instead.
The (ab)use of keventd for writing back unused inodes has been removed -
the pdflush pool is now used for that operation.
dalloc-10-core
The core delayed allocation code. There's a description in the
dalloc-10-core.patch file (all the patches have descriptions).
dalloc-20-ext2
Implements delayed allocation for ext2.
dalloc-30-ratcache
The radix-tree pagecache patch.
mpage-10-biobits
Little API extensions in the BIO layer which were needed for building the
pagecache BIOs.
mpage-20-core
The core multipage I/O layer.
This now implements multipage BIO reads into the pagecache. Also caching
of get_block() results at page->private.
The get_block() result caching currently only applies if all of a page's
blocks are laid out contiguously on disk. Caching of a discontiguous list of
blocks at page->private is easy enough to do, but would require a memory
allocation, and the requirement is so rare that I didn't bother.
mpage-30-ext2
Implements multipage I/O for ext2.
tuning-10-request
get_request() fairness for 2.5.x. Avoids the situation where a thread
sleeps for ages on the request queue while other threads whizz in and steal
requests which they didn't wait for.
tuning-20-ext2-preread-inode
When we create a new inode, preread its backing block.
Without this patch, many-inode writeout gets seriously stalled by having to
read many individual inode table blocks.
tuning-30-read_latency
read-latency2, ported from 2.4. Intelligently promotes reads ahead of
writes on the request queue, to prevent reads from being stalled for very
long periods of time.
Also reinstates the BLKELVGET and BLKELVSET ioctls, so `elvtune' may be
used in 2.5.
Also increases the size of the request queues, which allows better request
merging. This is acceptable now that reads are not heavily penalised by a
large queue.
-
On March 12, 2002 07:00 am, Andrew Morton wrote:
> Identifies readahead thrashing.
>
> Currently, it just performs a shrink on the readahead window when thrashing
> occurs. This greatly reduces the amount of pointless I/O which we perform,
> and will reduce the CPU load. The idea is that the readahead window
> dynamically adjusts to a sustainable size. It improves things, but not
> hugely, experimentally.
The question is, does it wipe out a nasty corner case? If so then the improvement
for the averge case is just a nice fringe benefit. A carefully constructed test
that triggers the corner case would be most interesting.
--
Daniel
On March 12, 2002 07:00 am, Andrew Morton wrote:
> dallocbase-15-pageprivate
>
> page->buffers is a bit of a layering violation. Not all address_spaces
> have pages which are backed by buffers.
>
> The exclusive use of page->buffers for buffers means that a piece of prime
> real estate in struct page is unavailable to other forms of address_space.
>
> This patch turns page->buffers into `unsigned long page->private' and sets
> in place all the infrastructure which is needed to allow other address_spaces
> to use this storage.
>
> With this change in place, the multipage-bio no-buffer_head code can use
> page->private to cache the results of an earlier get_block(), so repeated
> calls into the filesystem are not needed in the case of file overwriting.
That's pragmatic, a good short term solution. Getting rid of page->buffers
entirely will be nicer, and in that case you want to cache the physical block
only for those pages that have one, e.g., not for swap-backed pages, which
keep that information in the page table.
I've been playing with the idea of caching the physical block in the radix
tree, which imposes the cost only on cache pages. This forces you to do a
tree probe at IO time, but that cost is probably insignificant against the
cost of the IO. This arrangement could make it quite convenient for the
filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
when ->get_block consults the metadata to fill in one physical address, why
not fill in several more, if it's convenient?
--
Daniel
Daniel Phillips wrote:
>
> On March 12, 2002 07:00 am, Andrew Morton wrote:
> > Identifies readahead thrashing.
> >
> > Currently, it just performs a shrink on the readahead window when thrashing
> > occurs. This greatly reduces the amount of pointless I/O which we perform,
> > and will reduce the CPU load. The idea is that the readahead window
> > dynamically adjusts to a sustainable size. It improves things, but not
> > hugely, experimentally.
>
> The question is, does it wipe out a nasty corner case? If so then the improvement
> for the averge case is just a nice fringe benefit. A carefully constructed test
> that triggers the corner case would be most interesting.
>
There are many test scenarios. The one I use is:
- 64 megs of memory.
- Process A loops across N 10-megabyte files, reading 4k from each one
and terminates when all N files are fully read.
- Process B loops, repeatedly reading a one gig file off another disk.
The total wallclock time for process A exhibits *massive* step jumps
as you vary N. In stock 2.5.6 the runtime jumps from 40 seconds to
ten minutes when N is increased from 40 to 60.
With my changes, the rate of increase of runtime-versus-N is lower,
and happens at later N. But it's still very sudden and very bad.
Yes, it's a known-and-nasty corner case. Worth fixing if the
fix is clean. But IMO the problem is not common enough to
justify significantly compromising the common case.
-
On March 12, 2002 09:29 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > On March 12, 2002 07:00 am, Andrew Morton wrote:
> > > Identifies readahead thrashing.
> > >
> > > Currently, it just performs a shrink on the readahead window when thrashing
> > > occurs. This greatly reduces the amount of pointless I/O which we perform,
> > > and will reduce the CPU load. The idea is that the readahead window
> > > dynamically adjusts to a sustainable size. It improves things, but not
> > > hugely, experimentally.
> >
> > The question is, does it wipe out a nasty corner case? If so then the improvement
> > for the averge case is just a nice fringe benefit. A carefully constructed test
> > that triggers the corner case would be most interesting.
>
> There are many test scenarios. The one I use is:
>
> - 64 megs of memory.
>
> - Process A loops across N 10-megabyte files, reading 4k from each one
> and terminates when all N files are fully read.
>
> - Process B loops, repeatedly reading a one gig file off another disk.
>
> The total wallclock time for process A exhibits *massive* step jumps
> as you vary N. In stock 2.5.6 the runtime jumps from 40 seconds to
> ten minutes when N is increased from 40 to 60.
>
> With my changes, the rate of increase of runtime-versus-N is lower,
> and happens at later N. But it's still very sudden and very bad.
>
> Yes, it's a known-and-nasty corner case. Worth fixing if the
> fix is clean. But IMO the problem is not common enough to
> justify significantly compromising the common case.
It's a given the common case should be optimal. I'm sure there's an algorithm that
fixes up your test case, which by the way isn't that uncommon - it's Rik's '100 ftp
processes' case. I'll buy the suggestion it isn't common enough to drop everything
right now and go fix it.
--
Daniel
Daniel Phillips wrote:
>
> On March 12, 2002 07:00 am, Andrew Morton wrote:
> > dallocbase-15-pageprivate
> >
> > page->buffers is a bit of a layering violation. Not all address_spaces
> > have pages which are backed by buffers.
> >
> > The exclusive use of page->buffers for buffers means that a piece of prime
> > real estate in struct page is unavailable to other forms of address_space.
> >
> > This patch turns page->buffers into `unsigned long page->private' and sets
> > in place all the infrastructure which is needed to allow other address_spaces
> > to use this storage.
> >
> > With this change in place, the multipage-bio no-buffer_head code can use
> > page->private to cache the results of an earlier get_block(), so repeated
> > calls into the filesystem are not needed in the case of file overwriting.
>
> That's pragmatic, a good short term solution. Getting rid of page->buffers
> entirely will be nicer, and in that case you want to cache the physical block
> only for those pages that have one, e.g., not for swap-backed pages, which
> keep that information in the page table.
Really, I don't think we can lose page->buffers for *enough* users
of address_spaces to make it worthwhile.
If it was only being used for, say, blockdev inodes then we could
perhaps take it out and hash for it, but there are a ton of
filesystems out there...
The main problem I see with this patch series is that it introduces
a new way of performing writeback while leaving the old way in place.
The new way is better, I think - it's just a_ops->write_many_pages().
But at present, there are some address_spaces which support write_many_pages(),
and others which still use ->writepage() and sync_page_buffers().
This will make VM development harder, because the VM now needs to cope
with the nice, uniform, does-clustering-for-you writeback as well as
the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)
I need to give the VM a uniform way of performing writeback for
all address_spaces. My current thinking there is that all
address_spaces (even the non-delalloc, buffer_head-backed ones)
need to be taught to perform multipage clustered writeback
based on the address_space, not the dirty buffer LRU.
This is pretty deep surgery. If it can be made to work, it'll
be nice - it will heavily deprecate the buffer_head layer and will
unify the current two-or-three different ways of performing
writeback (I've already unified all ways of performing writeback
for delalloc filesystems - my version of kupdate writeback, bdflush
writeback, vm-writeback and write(2) writeback are all unified).
> I've been playing with the idea of caching the physical block in the radix
> tree, which imposes the cost only on cache pages. This forces you to do a
> tree probe at IO time, but that cost is probably insignificant against the
> cost of the IO. This arrangement could make it quite convenient for the
> filesystem to exploit the structure by doing opportunistic map-ahead, i.e.,
> when ->get_block consults the metadata to fill in one physical address, why
> not fill in several more, if it's convenient?
That would be fairly easy to do. My current writeback interface
into the filesytem is, basically, "write back N pages from your
mapping->dirty_pages list" [1]. The address_space could quite simply
whizz through that list and map all the required pages in a batched
manner.
[1] Problem with the current implementation is that I've taken
out the guarantee that the page which the VM wanted to free
actually has I/O started against it. So if the VM wants to
free something from ZONE_NORMAL, the address_space may just
go and start writeback against 1000 ZONE_HIGHMEM pages instead.
In practice, I suspect this doesn't matter much. But it needs
fixing.
(Our current behaviour in this scenario is terrible. Suppose
a mapping has a mixture of dirty pages from two or more zones,
and the VM is trying to free up a particular zone: the VM will
*selectively* perform writepage against *some* of the dirty
pages, and will skip writeback of pages from other zones.
This means that we're submitting great chunks of discontiguous
I/O. It'll fragment the layout of sparse files and will
greatly decrease writeout bandwidth. We should be opportunistically
submitting writeback against disk-contiguous and file-offset-contiguous
pages from other zones at the same time! I'm doing that now, but
with the present VM design [2] I do need to provide a way to
ensure that writeback has commenced against the target page).
[2] The more I think about it, the less I like it. I have a feeling
that I'll end up having to, umm, redesign the VM. Damn.
-
[email protected] said:
> Really, I don't think we can lose page->buffers for *enough* users of
> address_spaces to make it worthwhile.
> If it was only being used for, say, blockdev inodes then we could
> perhaps take it out and hash for it, but there are a ton of
> filesystems out there...
I have plenty of boxes which never have any use for page->buffers. Ever.
--
dwmw2
On March 12, 2002 10:00 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> > On March 12, 2002 07:00 am, Andrew Morton wrote:
> > > With this change in place, the multipage-bio no-buffer_head code can
use
> > > page->private to cache the results of an earlier get_block(), so
repeated
> > > calls into the filesystem are not needed in the case of file
overwriting.
> >
> > That's pragmatic, a good short term solution. Getting rid of
page->buffers
> > entirely will be nicer, and in that case you want to cache the physical
block
> > only for those pages that have one, e.g., not for swap-backed pages, which
> > keep that information in the page table.
>
> Really, I don't think we can lose page->buffers for *enough* users
> of address_spaces to make it worthwhile.
>
> If it was only being used for, say, blockdev inodes then we could
> perhaps take it out and hash for it, but there are a ton of
> filesystems out there...
That's the thrust of my current work - massaging things into a form where
struct page can be substituted for buffer_head as the block data handle for
the mass of filesystems that use it.
> The main problem I see with this patch series is that it introduces
> a new way of performing writeback while leaving the old way in place.
> The new way is better, I think - it's just a_ops->write_many_pages().
> But at present, there are some address_spaces which support
write_many_pages(),
> and others which still use ->writepage() and sync_page_buffers().
>
> This will make VM development harder, because the VM now needs to cope
> with the nice, uniform, does-clustering-for-you writeback as well as
> the crufty old write-little-bits-of-crap-all-over-the-disk writeback :)
>
> I need to give the VM a uniform way of performing writeback for
> all address_spaces. My current thinking there is that all
> address_spaces (even the non-delalloc, buffer_head-backed ones)
> need to be taught to perform multipage clustered writeback
> based on the address_space, not the dirty buffer LRU.
>
> This is pretty deep surgery. If it can be made to work, it'll
> be nice - it will heavily deprecate the buffer_head layer and will
> unify the current two-or-three different ways of performing
> writeback (I've already unified all ways of performing writeback
> for delalloc filesystems - my version of kupdate writeback, bdflush
> writeback, vm-writeback and write(2) writeback are all unified).
For me, the missing piece of the puzzle is how to recover the semantics of
->b_flushtime. The crude solution is just to put that in struct page for
now. At least that's a wash in terms of size because ->buffers goes out.
It's not right for the long term though, because flushtime is only needed for
cache-under-io. It's annoying to bloat up all pages for the sake of the
block cache. Some kind of external structure is needed that can capture both
flushtime ordering and physical ordering, and be readily accessible from
struct page without costing a field in struct page. Right now that structure
is the buffer dirty list, though that fails the requirement of not bloating
the struct page. It falls short by other measures as well, such as offering
no good way to search for physical adjacency for the purpose of clustering.
I don't know what kind of thing I'm searching for here, I thought I'd dump
the problem on you and see what happens.
--
Daniel
Daniel Phillips wrote:
>
> That's the thrust of my current work - massaging things into a form where
> struct page can be substituted for buffer_head as the block data handle for
> the mass of filesystems that use it.
>
> ...
>
> For me, the missing piece of the puzzle is how to recover the semantics of
> ->b_flushtime. The crude solution is just to put that in struct page for
> now. At least that's a wash in terms of size because ->buffers goes out.
I'm currently doing that in struct address_space(!). Maybe struct inode
would make more sense...
So the mapping records the time at which it was first dirtied. So the
`kupdate' function simply writes back all files which had their
first-dirtying time between 30 and 35 seconds ago.
That works OK, but it also needs to cope with the case of a single
huge dirty file. For that case, periodic writeback also terminates
when it has written back 1/6th of all the dirty pages in the machine.
This is all fairly arbitrary, and is basically designed to map onto
the time-honoured behaviour. I haven't observed any surprises from
it, nor any reason to change it.
> It's not right for the long term though, because flushtime is only needed for
> cache-under-io. It's annoying to bloat up all pages for the sake of the
> block cache. Some kind of external structure is needed that can capture both
> flushtime ordering and physical ordering, and be readily accessible from
> struct page without costing a field in struct page. Right now that structure
> is the buffer dirty list, though that fails the requirement of not bloating
> the struct page. It falls short by other measures as well, such as offering
> no good way to search for physical adjacency for the purpose of clustering.
>
> I don't know what kind of thing I'm searching for here, I thought I'd dump
> the problem on you and see what happens.
See above...
We also need to discuss writeback of metadata. For delayed allocate
files, indirect blocks are not a problem, because I start I/O against
them *immediately*, as soon as they're dirtied. This is because we
know that the indirect's associated data blocks are also under I/O.
Which leaves bitmaps and inode blocks. These I am leaving on the
dirty buffer LRU, so nothing has changed there.
Now, I think it's fair to say that the ext2/ext3 inter-file fragmentation
issue is one of the three biggest performance problems in Linux. (The
other two being excessive latency in the page allocator due to VM writeback
and read latency in the I/O scheduler).
The fix for interfile fragmentation lies inside ext2/ext3, not inside
any generic layers of the kernel. And this really is a must-fix,
because the completion time for writeback is approximately proportional
to the size of the filesystem. So we're getting, what? Fifty percent
slower per year?
The `tar xfz linux.tar.gz ; sync' workload can be sped up 4x-5x by
using find_group_other() for directories. I spent a week or so
poking at this when it first came up. Basically, *everything*
which I did to address the rapid-growth problem ended up penalising
the slow-growth fragmentation - long-term intra-file fragmentation
suffered at the expense of short-term inter-file fragmentation.
So I ended up concluding that we need online defrag to fix the
intra-file fragmentation. Then this would *enable* the death
of find_group_dir(). I do have fully-journalled, cache-coherent,
crash-proof block relocation coded for ext3. It's just an ioctl:
int try_to_relocate_page(int fd, long blocks[]);
But I haven't sat down and thought about the userspace application
yet. Which is kinda dumb, because the payback from this will be
considerable.
So. Conclusion: periodic writeback is based on inode-dirty-time,
with a limit on the number of pages. Buffer LRU writeback for
bitmaps and inodes is OK, but we need to fix the directory
allocator.
-
On Wed, Mar 13, 2002 at 11:50:32AM -0800, Andrew Morton wrote:
> Now, I think it's fair to say that the ext2/ext3 inter-file fragmentation
> issue is one of the three biggest performance problems in Linux. (The
> other two being excessive latency in the page allocator due to VM writeback
> and read latency in the I/O scheduler).
>
> The fix for interfile fragmentation lies inside ext2/ext3, not inside
> any generic layers of the kernel. And this really is a must-fix,
> because the completion time for writeback is approximately proportional
> to the size of the filesystem. So we're getting, what? Fifty percent
> slower per year?
>
> The `tar xfz linux.tar.gz ; sync' workload can be sped up 4x-5x by
> using find_group_other() for directories. I spent a week or so
> poking at this when it first came up. Basically, *everything*
> which I did to address the rapid-growth problem ended up penalising
> the slow-growth fragmentation - long-term intra-file fragmentation
> suffered at the expense of short-term inter-file fragmentation.
I know ReiserFS has similar problems.
Can anyone say wheather JFS or XFS has this problem also?
On March 13, 2002 08:50 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > That's the thrust of my current work - massaging things into a form where
> > struct page can be substituted for buffer_head as the block data handle
> > for the mass of filesystems that use it.
> > ...
> >
> > For me, the missing piece of the puzzle is how to recover the semantics of
> > ->b_flushtime. The crude solution is just to put that in struct page for
> > now. At least that's a wash in terms of size because ->buffers goes out.
>
> I'm currently doing that in struct address_space(!). Maybe struct inode
> would make more sense...
I don't know, when do we ever use an address_space that's not attached to an
inode? Hmm, swapper_space. Though nobody does it now, it's also possible to
have more than one address_space per inode. This confuses the issue because
you don't know whether to flush them together, separately, or what. By the
time things get this murky, its time for VM to step back and let the
filesystem itself establish the flushing policy. No, we don't have any model
for how to express that, and we need one. What you're developing here is
a generic_flush, mainly for use by dumb filesystems, and by lucky accident,
also suitable for Ext3.
So, hrm, the sky won't fall either places you put it.
> So the mapping records the time at which it was first dirtied. So the
> `kupdate' function simply writes back all files which had their
> first-dirtying time between 30 and 35 seconds ago.
I guess you have to be careful to set the first-dirtying time again after
submitting all the IO, in case the inode got dirtied again while you were
busy submitting.
> That works OK, but it also needs to cope with the case of a single
> huge dirty file. For that case, periodic writeback also terminates
> when it has written back 1/6th of all the dirty pages in the machine.
You need a way of cycling through all the inodes on the system reliably,
otherwise you'll get nasty situations where repeated dirtying starves some
inodes of updates. This has to have file offset resolution, otherwise more
flush starvation corner cases will start crawling out of the woodwork.
The 1/6th rule is an oversimplification, it should at least be based on how
much flush IO is already in flight, and other more difficult measures we
haven't even started to address yet, such as how much and what kind of other
IO is competing for the same bandwidth, and how much bandwidth is available.
> This is all fairly arbitrary, and is basically designed to map onto
> the time-honoured behaviour. I haven't observed any surprises from
> it, nor any reason to change it.
It's surprisingly resistant to flaming. The starvation problem is going to
get ugly, it's just as hard as the elevator starvation question and the
crude, inode-level resolution of the flushtime makes it tricky. But I think
it can be beaten into some kind of shape where the corner cases are seen to
be bounded.
I don't know, I need to think about it more. It's both convenient and
strange not to maintain per-block flushtime.
> We also need to discuss writeback of metadata. For delayed allocate
> files, indirect blocks are not a problem, because I start I/O against
> them *immediately*, as soon as they're dirtied. This is because we
> know that the indirect's associated data blocks are also under I/O.
>
> Which leaves bitmaps and inode blocks. These I am leaving on the
> dirty buffer LRU, so nothing has changed there.
Easy, give them an address_space.
[snip fascinating/timesucking sortie into online defrag]
--
Daniel
--On Monday, March 11, 2002 22:00:57 -0800 Andrew Morton <[email protected]> wrote:
> "help, help - there's no point in just one guy testing this" (thanks Randy).
Will you accept the testing of a gal? ;)
>
> This is an update of the delayed-allocation and multipage pagecache I/O
> patches. I'm calling this a beta, because it all works, and I have
> other stuff to do for a while.
>
Here are the dbench throughput results on an 8-way SMP with 2GB memory.
These are run with 64 then 128 clients 15 times each averaged. It looked
pretty good.
Running with more than 180 clients caused the system to hang, after
a reset there was much filesystem corruption. This happened twice. Probably
related to filling up disk space. There are no ServerRaid drivers for 2.5 yet
so the biggest disks on this system are unusable. lockmeter results are forthcoming (day or two).
Running dbench on an 8-way SMP 15 times each.
2.5.6 clean
Clients Avg
64 37.9821
128 29.8258
2.5.6 with everything.patch
Clients Avg
64 41.0204
128 30.6431
Hanna
Hanna Linder wrote:
>
> --On Monday, March 11, 2002 22:00:57 -0800 Andrew Morton <[email protected]> wrote:
>
> > "help, help - there's no point in just one guy testing this" (thanks Randy).
>
> Will you accept the testing of a gal? ;)
With alacrity :) Thanks.
> >
> > This is an update of the delayed-allocation and multipage pagecache I/O
> > patches. I'm calling this a beta, because it all works, and I have
> > other stuff to do for a while.
> >
>
> Here are the dbench throughput results on an 8-way SMP with 2GB memory.
> These are run with 64 then 128 clients 15 times each averaged. It looked
> pretty good.
> Running with more than 180 clients caused the system to hang, after
> a reset there was much filesystem corruption. This happened twice. Probably
> related to filling up disk space.
It could be space-related. A couple of gigs should have been plenty..
One other possible explanation is to do with radix-tree pagecache.
It has to allocate memory to add nodes to the tree. When these
allocations start failing due to out-of-memory, the VM will keep
on calling swap_out() a trillion times without noticing that it
didn't work out. But if this happened, yo would have seen a huge
number of "0-order allocation failed" messages.
> There are no ServerRaid drivers for 2.5 yet
> so the biggest disks on this system are unusable. lockmeter results are forthcoming (day or two).
>
> Running dbench on an 8-way SMP 15 times each.
>
> 2.5.6 clean
>
> Clients Avg
>
> 64 37.9821
> 128 29.8258
>
> 2.5.6 with everything.patch
>
> Clients Avg
>
> 64 41.0204
> 128 30.6431
>
That's odd. I'm showing 50% increases in dbench throughput. Not
that there's anything particularly clever about that - these patches
allow the kernel to just throw more memory in dbench's direction, and
it likes that. But it does indicate that something funny is up.
I'll take a closer look - thanks again.
-
--On Monday, March 18, 2002 12:14:27 -0800 Andrew Morton <[email protected]> wrote:
> One other possible explanation is to do with radix-tree pagecache.
> It has to allocate memory to add nodes to the tree. When these
> allocations start failing due to out-of-memory, the VM will keep
> on calling swap_out() a trillion times without noticing that it
> didn't work out. But if this happened, yo would have seen a huge
> number of "0-order allocation failed" messages.
Yes, I did see a huge number of those messages. It also died
on 2.5.6 clean though. I chalked it up to 2.5 instability.
Will test again when things calm down. Any chance you will
backport to 2.4?
Glad to help.
Hanna
Hanna Linder wrote:
>
> --On Monday, March 18, 2002 12:14:27 -0800 Andrew Morton <[email protected]> wrote:
>
> > One other possible explanation is to do with radix-tree pagecache.
> > It has to allocate memory to add nodes to the tree. When these
> > allocations start failing due to out-of-memory, the VM will keep
> > on calling swap_out() a trillion times without noticing that it
> > didn't work out. But if this happened, yo would have seen a huge
> > number of "0-order allocation failed" messages.
>
> Yes, I did see a huge number of those messages.
OK. Probably it would have eventually recovered, because there
would have been *some* I/O queued up somewhere. Of course, it
would recover a damn sight faster if I hadn't added those
printk's in the page allocator :)
> It also died
> on 2.5.6 clean though. I chalked it up to 2.5 instability.
mm. 2.5.6 is stable in my testing. PIIX4 IDE and aic7xxx SCSI.
So maybe a driver problem, maybe a highmem problem?
> Will test again when things calm down. Any chance you will
> backport to 2.4?
I don't really plan to do that. There's still quite a lot more stuff
needs doing, and it'll end up a huuuuge patch. Plus a fair bit of
the value isn't there, because 2.4 doesn't have BIOs.
-
2.5.6 with Andrew's everything patch on ext2
filesystem mounted with delalloc came up with
these MB/second on k6-2/475 with IDE disk:
dbench 128
2.5.6 2.5.6-akpme akpm % faster
8.4 12.5 48
tiobench seq reads (8 - 128 threads avg 3 runs)
2.5.6 2.5.6-akpme %
9.36 12.97 38
tiobench seq writes (8 - 128 threads avg 3 runs)
2.5.6 2.5.6-akpme %
15.3 19.29 26
Both kernels needed reiserfs patches for 2.5.6, but
above tests are on ext2.
More on these tests and a few other akpm patches at:
http://home.earthlink.net/~rwhron/kernel/akpm.html
--
Randy Hron