2002-03-01 08:32:20

by Andrew Morton

[permalink] [raw]
Subject: [patch] delayed disk block allocation


A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are
available at

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch
- Core functions
and
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch
- delalloc implementation for ext2.

Also,
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-30-ratcache.patch
- Momchil Velikov/Christoph Hellwig radix-tree pagecache ported onto delalloc.


With "allocate on flush", (aka delayed allocation), file data is
assigned a disk mapping when the data is being written out, rather than
at write(2) time. This has the following advantages:

- Less disk fragmentation

- By deferring writeout, we can send larger amounts of data into
the filesytem, which gives the fs a better chance of laying the
blocks out contiguously.

This defeats the current (really bad) performance bug in ext2
and ext3 wherein the blocks for two streaming files are
intermingled in AAAAAAAABBBBBBBBAAAAAAAABBBBBBBB manner (ext2) and
ABABABABAB manner (ext3).

- Short-lived files *never* get a disk mapping, which prevents
them from fragmenting disk free space.

- Faster

- Enables complete bypass of the buffer_head layer (that's the
next patch series).

- Less work to do for short-lived files

- Unifies the handling of shared mappings and write(2) data.


The patches as they stand work fine, but they enable much more. I'd be
interested in hearing some test results of this plus the mpio patch on
big boxes.

Delayed allocate for ext2 is turned on with the `-o delalloc' mount
option. Without this, nothing much has changed (well, things will be a
bit faster, but not much).

The `MPIO_DEBUG' define in mm.h should be set to zero prior to running
any benchmarking.


There is a reservation API by which the kernel tells the filesystem to
reserve space for the delalloc page. It is completely impractical to
perform exact reservations for ext2, at least. So what the code does
is to take worst-case reservations based on the page's file offset, and
to force a writeback when ENOSPC is encountered. The writeback
collapses all existing reservations into real disk mappings. If we're
still out of space, we're really out of space. This happens within
delalloc_prepare_write() at present. I'll be moving it elsewhere soon.

There are new writeback mechanisms - an adaptively-sized pool of
threads called "pdflush" is available for writeback. These perform the
kupdate and bdflush function for dirty pages. They are designed to
avoid the situation where bdflush gets stuck on a particular device,
starving writeout for other devices. pdflush should provide increased
writeout bandwidth on many-disk machines.

There are a minimum of two instances of pdflush. Additional threads
are created and destroyed on-demand according to a
simple-but-seems-to-work algorithm, which is described in the code.

The pdflush threads are used to perform writeback within shrink_cache,
thus making kswapd almost non-blocking.

Global accounting of locked and dirty pages has been introduced. This
permits accurate writeback/throttle decisions in balance_dirty_pages().
Testing is showing considerable improvements in system tractability
under heavy load, while approximately doubling heavy dbench throughput.
Other benchmarks are pretty much unchanged, apart from those which are
affected by file fragmentation, which show improvement.

With this patch, writepage() is still using the buffer layer, so lock
contention will still be high.

A few performance glitches in the dirty-data writeout path have been
fixed.

The PG_locked and PG_dirty flags have been renamed to prevent people
from using them, which would bypass locked- and dirty-page accounting.

A number of functions in fs/inode.c have been renamed. We have a huge
and confusing array of sync_foo() functions in there. I've attempted
to differentiate between `writeback_foo()', which starts I/O, and
`sync_foo()', which starts I/O and waits on it. It's still a bit of a
mess, and needs revisiting.

The ratcache patch removes generic_buffer_fdatasync() from the kernel.
Nothing in the tree is using it.

Within the VM, the concept of ->writepage() has been replaced with the
concept of "write back a mapping". This means that rather than writing
back a single page, we write back *all* dirty pages against the mapping
to which the LRU page belongs.

Despite its crudeness, this actually works rather well. And it's
important, because disk blocks are allocated at ->writepage time, and
we *need* to write out good chunks of data, in ascending file offset
order so that the files are laid out well. Random page-at-a-time
sprinkliness won't cut it.

A simple future refinement is to change the API to be "write back N
pages around this one, including this one". At present, I'll have to
pull a number out of the air (128 pages?). Some additional
intelligence from the VM may help here.

Or not. It could be that writing out all the mapping's pages is
always the right thing to do - it's what bdflush is doing at present,
and it *has* to have the best bandwidth. But it may come unstuck
when applied to swapcache.


Things which must still be done include:

- Closely review the ratcache patch. I fixed several fairly fatal
bugs in it, and it works just fine now. But it seems I was working
from an old version. Still, it would benefit from a careful
walkthrough. This version might be flakey in the tmpfs/shmfs area.

- Remove bdflush and kupdate - use the pdflush pool to provide these
functions.

- Expose the three writeback tunables to userspace (/proc/sys/vm/pdflush?)

- Use pdflush for try_to_sync_unused_inodes(), to stop the keventd
abuse.

- Move the page_cache_size accounting into the per-cpu accumulators.

- Use delalloc for ext2 directories and symlinks.

- Throttle tasks which are dirtying pages via shared mappings.

- Make preallocation and quotas play together.

- Implement non-blocking try_to_free_buffers in the VM for buffers
against delalloc filesystems.

Overall system behaviour is noticeably improved by these patches,
but memory-requesters can still occasionally get blocked for a long
time in sync_page_buffers(). Which is fairly pointless, because the
amount of non-dirty-page memory which is backed by buffers is tiny.
Just hand this work off to pdflush if the backing filesystem is
delayed-allocate.

- Verify that the new APIs and implementation are suitable for XFS.

- Prove the API by implementing delalloc on other filesystems.

Looks to be fairly simple for reiserfs and ext3. But implementing
multi-page no-buffer_head pagecache writeout will be harder for these
filesystems.


Nice-to-do things include:

- Maybe balance_dirty_state() calculations for buffer_head based
filesystems can take into account locked and dirty page counts to
make better flush/throttling decisions.

- Turn swap space into a delayed allocate address_space. Allocation
of swapspace at ->writepage() time should provide improved swap
bandwidth.

- Unjumble the writeout order.

In the current code (as in the current 2.4 and 2.5 kernels), disk
blocks are laid out in the order in which the application wrote(2)
them. So files which are created by lseeky applications are all
jumbled up.

I can't really see a practical solution for this in the general
case, even with radix-tree pagecache. And it may be that we don't
*need* a solution, because it'll often be the case that the read
pattern for the file is also lseeky.

But when the "write some pages around this one" function is
implemented, it will perform this unjumbling. It'll be OK to
implement this by probing the pagecache, or by walking the radix
tree.

- Don't perform synchronous I/O (reads) inside lock_super() on ext2.
Massive numbers of threads get piled up on the superblock lock when
using silly workloads.


2002-03-04 02:13:51

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On March 1, 2002 09:26 am, Andrew Morton wrote:
> A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are
> available at
>
>
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch
> - Core functions
> and
>
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch
> - delalloc implementation for ext2.
>
> Also,
>
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-30-ratcache.patch
> - Momchil Velikov/Christoph Hellwig radix-tree pagecache ported onto
> delalloc.

Wow, this is massive. Why did you write [patch] instead of [PATCH]? ;-) I'm
surprised there aren't any comments on this patch so far, that should teach
you to post on a Friday afternoon.

> Global accounting of locked and dirty pages has been introduced.

Alan seems to be working on this as well. Besides locked and dirty we also
have 'pinned', i.e., pages that somebody has taken a use count on, beyond the
number of pte+pcache references.

I'm just going to poke at a couple of inconsequential things for now, to show
I've read the post. In general this is really important work because it
starts to move away from the vfs's current dumb filesystem orientation.

I doubt all the subproblems you've addressed are tackled in the simplest
possible way, and because of that it's a cinch Linus isn't just going to
apply this. But hopefully the benchmarking team descend upon this and find
out if it does/does't suck, and hopefully you plan to maintain it through 2.5.

> Testing is showing considerable improvements in system tractability
> under heavy load, while approximately doubling heavy dbench throughput.
> Other benchmarks are pretty much unchanged, apart from those which are
> affected by file fragmentation, which show improvement.

What is system tractability?

> With this patch, writepage() is still using the buffer layer, so lock
> contention will still be high.

Right, and buffers are going away one way or another.

> A number of functions in fs/inode.c have been renamed. We have a huge
> and confusing array of sync_foo() functions in there. I've attempted
> to differentiate between `writeback_foo()', which starts I/O, and
> `sync_foo()', which starts I/O and waits on it. It's still a bit of a
> mess, and needs revisiting.

Yup.

> Within the VM, the concept of ->writepage() has been replaced with the
> concept of "write back a mapping". This means that rather than writing
> back a single page, we write back *all* dirty pages against the mapping
> to which the LRU page belongs.

This is a good and natural step, but don't we want to go even more global
than that and look at all the dirty data on a superblock, so the decision on
what to write out is optimized across files for better locality.

> Despite its crudeness, this actually works rather well. And it's
> important, because disk blocks are allocated at ->writepage time, and
> we *need* to write out good chunks of data, in ascending file offset
> order so that the files are laid out well. Random page-at-a-time
> sprinkliness won't cut it.

Right.

> A simple future refinement is to change the API to be "write back N
> pages around this one, including this one". At present, I'll have to
> pull a number out of the air (128 pages?). Some additional
> intelligence from the VM may help here.
>
> Or not. It could be that writing out all the mapping's pages is
> always the right thing to do - it's what bdflush is doing at present,
> and it *has* to have the best bandwidth.

This has the desireable result of deferring the work on how best to "write
back N pages around this one, including this one" above. I really think that
if we sit back and let the vfs evolve on its own a little more, that that
question will get much easier to answer.

> But it may come unstuck when applied to swapcache.

You're not even trying to apply this to swap cache right now are you?

> Things which must still be done include:
>
> [...]
>
> - Remove bdflush and kupdate - use the pdflush pool to provide these
> functions.

The main disconnect there is sub-page sized writes, you will bundle together
young and old 1K buffers. Since it's getting harder to find a 1K blocksize
filesystem, we might not care. There is also my nefarious plan to make
struct pages refer to variable-binary-sized objects, including smaller than
4K PAGE_SIZE.

> - Expose the three writeback tunables to userspace (/proc/sys/vm/pdflush?)

/proc/sys/vm/pageflush <- we know the pages are dirty

> - Use pdflush for try_to_sync_unused_inodes(), to stop the keventd
> abuse.

Could you explain please?

> - Verify that the new APIs and implementation are suitable for XFS.

Hey, I was going to mention that :-) The question is, how long will it be
before vfs-level delayed allocation is in 2.5. Steve might see that as a
way to get infinitely sidetracked.

Well, I used up way more than the time budget I have for reading your patch
and post, and I barely scratched the surface. I hope the rest of the
filesystem gang gets involved at this point, or maybe not. Cooks/soup.

I guess the thing to do is start thinking about parts that can be broken out
because of obvious correctness. The dirty/locked accounting would be one
candidate, the multiple flush threads another, and I'm sure there are more
because you don't seem to have treated much as sacred ;-)

Something tells me the vfs is going to look quite different by the end
of 2.6.

--
Daniel

2002-03-04 03:10:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Daniel Phillips wrote:
> On March 1, 2002 09:26 am, Andrew Morton wrote:
> > A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are
> > available at http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch
> > - Core functions
> > and
> http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch
> > - delalloc implementation for ext2.

> Wow, this is massive. Why did you write [patch] instead of [PATCH]? ;-) I'm
> surprised there aren't any comments on this patch so far, that should teach
> you to post on a Friday afternoon.

My only comment is: how fast can we get delalloc into 2.5.x for further
testing and development?

IMNSHO there are few comments because I believe that few people actually
realize the benefits of delalloc. My ext2 filesystem with --10--
percent fragmentation could sure use code like this, though.


> > But it may come unstuck when applied to swapcache.
>
> You're not even trying to apply this to swap cache right now are you?

This is a disagreement akpm and I have, actually :)

I actually would rather that it was made a requirement that all
swapfiles are "dense", so that new block allocation NEVER must be
performed when swapping.


> There is also my nefarious plan to make
> struct pages refer to variable-binary-sized objects, including smaller than
> 4K PAGE_SIZE.

sigh...

--
Jeff Garzik |
Building 1024 |
MandrakeSoft | Choose life.

2002-03-04 05:04:22

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote:
> The main disconnect there is sub-page sized writes, you will bundle together
> young and old 1K buffers. Since it's getting harder to find a 1K blocksize
> filesystem, we might not care.

Please don't do that.

Hopefully, once this is in, 1k blocks will work much better. There are many
cases where people work with lots of small files, and using 1k blocks is bad
enough, 4k would be worse.

Also, with dhash going into ext2/3 lots of tiny files in one dir will be
feasible and comparible with reiserfs.

2002-03-04 05:32:44

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Mar 03, 2002 21:04 -0800, Mike Fedyk wrote:
> On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote:
> > The main disconnect there is sub-page sized writes, you will bundle together
> > young and old 1K buffers. Since it's getting harder to find a 1K blocksize
> > filesystem, we might not care.
>
> Please don't do that.
>
> Hopefully, once this is in, 1k blocks will work much better. There are many
> cases where people work with lots of small files, and using 1k blocks is bad
> enough, 4k would be worse.
>
> Also, with dhash going into ext2/3 lots of tiny files in one dir will be
> feasible and comparible with reiserfs.

Actually, there are a whole bunch of performance issues with 1kB block
ext2 filesystems. For very small files, you are probably better off
to have tails in EAs stored with the inode, or with other tails/EAs in
a shared block. We discussed this on ext2-devel a few months ago, and
while the current ext2 EA design is totally unsuitable for that, it
isn't impossible to fix.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-03-04 05:39:56

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Sun, Mar 03, 2002 at 10:31:03PM -0700, Andreas Dilger wrote:
> On Mar 03, 2002 21:04 -0800, Mike Fedyk wrote:
> > On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote:
> > > The main disconnect there is sub-page sized writes, you will bundle together
> > > young and old 1K buffers. Since it's getting harder to find a 1K blocksize
> > > filesystem, we might not care.
> >
> > Please don't do that.
> >
> > Hopefully, once this is in, 1k blocks will work much better. There are many
> > cases where people work with lots of small files, and using 1k blocks is bad
> > enough, 4k would be worse.
> >
> > Also, with dhash going into ext2/3 lots of tiny files in one dir will be
> > feasible and comparible with reiserfs.
>
> Actually, there are a whole bunch of performance issues with 1kB block
> ext2 filesystems. For very small files, you are probably better off
> to have tails in EAs stored with the inode, or with other tails/EAs in
> a shared block. We discussed this on ext2-devel a few months ago, and
> while the current ext2 EA design is totally unsuitable for that, it
> isn't impossible to fix.

Great, we're finally heading tward dual sized blocks (or clusters or etc).
I'll be looking forward to that. :)

Do you think it'll look like block tails (like ffs?) or will it be more like
tail packing in reiserfs?

2002-03-04 05:41:15

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Andreas Dilger wrote:
> Actually, there are a whole bunch of performance issues with 1kB block
> ext2 filesystems. For very small files, you are probably better off
> to have tails in EAs stored with the inode, or with other tails/EAs in
> a shared block. We discussed this on ext2-devel a few months ago, and
> while the current ext2 EA design is totally unsuitable for that, it
> isn't impossible to fix.

IMO the ext2 filesystem design is on it's last legs ;-) I tend to
think that a new filesystem efficiently handling these features is far
better than dragging ext2 kicking and screaming into the 2002's :)

--
Jeff Garzik |
Building 1024 |
MandrakeSoft | Choose life.

2002-03-04 06:16:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Mar 03, 2002 21:40 -0800, Mike Fedyk wrote:
> On Sun, Mar 03, 2002 at 10:31:03PM -0700, Andreas Dilger wrote:
> > Actually, there are a whole bunch of performance issues with 1kB block
> > ext2 filesystems. For very small files, you are probably better off
> > to have tails in EAs stored with the inode, or with other tails/EAs in
> > a shared block. We discussed this on ext2-devel a few months ago, and
> > while the current ext2 EA design is totally unsuitable for that, it
> > isn't impossible to fix.
>
> Do you think it'll look like block tails (like ffs?) or will it be more like
> tail packing in reiserfs?

It will be like tail packing in reiserfs, I believe. FFS has fixed-sized
'fragments' while reiserfs has arbitrary-sized 'tails'. The ext2 tail
packing will be arbitrary-sized tails, stored as an extended attribute
(along with any other EAs that this inode has. With proper EA design,
you can share multiple inode's EAs in the same block. We also discussed
for very small files that you could store the tail (or other EA data)
within the inode itself.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-03-04 06:23:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Mar 04, 2002 00:41 -0500, Jeff Garzik wrote:
> Andreas Dilger wrote:
> > Actually, there are a whole bunch of performance issues with 1kB block
> > ext2 filesystems. For very small files, you are probably better off
> > to have tails in EAs stored with the inode, or with other tails/EAs in
> > a shared block. We discussed this on ext2-devel a few months ago, and
> > while the current ext2 EA design is totally unsuitable for that, it
> > isn't impossible to fix.
>
> IMO the ext2 filesystem design is on it's last legs ;-) I tend to
> think that a new filesystem efficiently handling these features is far
> better than dragging ext2 kicking and screaming into the 2002's :)

That's why we have ext3 ;-). Given that reiserfs just barely has an
fsck that finally works most of the time, and they are about to re-do
the entire filesystem for reiser-v4 in 6 months, I'd rather stick with
glueing features onto an ext2 core than rebuilding everything from
scratch each time.

Given that ext3, and htree, and all of the other ext2 'hacks' seem to
do very well, I think it will continue to improve for some time to come.
A wise man once said "I'm not dead yet".

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2002-03-04 07:22:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Daniel Phillips wrote:
>
> ...
> Why did you write [patch] instead of [PATCH]? ;-)

It's a start ;)

>...
> > Global accounting of locked and dirty pages has been introduced.
>
> Alan seems to be working on this as well. Besides locked and dirty we also
> have 'pinned', i.e., pages that somebody has taken a use count on, beyond the
> number of pte+pcache references.

Well one little problem with tree owners writing code is that it's
rather hard for mortals to see what they've done. Because the diffs
come with so much other stuff. Which is my lame excuse for not having
reviewed Alan's work. But if he has global (as opposed to per-mm, per-vma,
etc) data then yes, it can go into the page_states[] array.

> I'm just going to poke at a couple of inconsequential things for now, to show
> I've read the post. In general this is really important work because it
> starts to move away from the vfs's current dumb filesystem orientation.
>
> I doubt all the subproblems you've addressed are tackled in the simplest
> possible way, and because of that it's a cinch Linus isn't just going to
> apply this. But hopefully the benchmarking team descend upon this and find
> out if it does/does't suck, and hopefully you plan to maintain it through 2.5.

The problem with little, incremental patches is that they require
a high degree of planning, trust and design. A belief that the
end outcome will be right. That's hard, and it generates a lot of
talk, and the end outcome may *not* be right.

So in the best of worlds, we have the end outcome in-hand, and testable.
If it works, then we go back to the incremental patches.

> > Testing is showing considerable improvements in system tractability
> > under heavy load, while approximately doubling heavy dbench throughput.
> > Other benchmarks are pretty much unchanged, apart from those which are
> > affected by file fragmentation, which show improvement.
>
> What is system tractability?

Sorry. General usability when the system is under load. With these
patches it's better, but still bad.

Look. A process calls the page allocator to, duh, allocate some pages.
Processes do *not* call the page allocator because they suddenly feel
like spending fifteen seconds asleep on the damned request queue.

We need to throttle the writers and only the writers. We need other tasks
to be able to obtain full benefit of the rate at which the disks can
clean memory.

You know where this is headed, don't you:

- writeout is performed by the writers, and by the gang-of-flush-threads.
- kswapd is 100% non-blocking. It never does I/O.
- kswapd is the only process which runs page_launder/shrink_caches.
- Memory requesters do not perform I/O. They sleep until memory
is available. kswapd gives them pages as they become available, and
wakes them up.

So that's the grand plan. It may be fatally flawed - I remember Linus
had a serious-sounding objection to it some time back, but I forget
what that was. We come badly unstuck if it's a disk-writer who
goes to sleep on the i-want-some-memory queue, but I don't think
it was that.

Still, this is just a VM rant. It's not the objective of this work.


> > With this patch, writepage() is still using the buffer layer, so lock
> > contention will still be high.
>
> Right, and buffers are going away one way or another.

This is a problem. I'm adding new stuff which does old things in
a new way, with no believable plan in place for getting rid of the
old stuff.

I don't think it's humanly possible to do away with struct buffer_head.
It is *the* way of representing a disk block. And unless we plan
to live with 4k pages and 4k blocks for ever, the problem is about
to get worse. Think 64k pages with 4k blocks.

Possibly we could handle sub-page segments of memory via a per-page up-to-date
bitmask. And then a `dirty' bitmask. And then a `locked' bitmask, etc. I
suspect eventually we'll end up with, say, a vector of structures attached to
each page which represents the state of each of the page's sub-segments. whoops.

So as a tool for representing disk blocks - for subdividing individual
pages of the block device's pagecache entries, buffer_heads make sense,
and I doubt if they're going away.

But as a tool for getting bulk file data on and off disk, buffer_heads
really must die. Note how submit_bh() now adds an extra memory allocation
into each buffer as it goes past. Look at some 2.5 kernel profiles....

> ...
> > Within the VM, the concept of ->writepage() has been replaced with the
> > concept of "write back a mapping". This means that rather than writing
> > back a single page, we write back *all* dirty pages against the mapping
> > to which the LRU page belongs.
>
> This is a good and natural step, but don't we want to go even more global
> than that and look at all the dirty data on a superblock, so the decision on
> what to write out is optimized across files for better locality.

Possibly, yes.

The way I'm performing writeback now is quite different from the
2.4 way. Instead of:

for (buffer = oldest; buffer != newest; buffer++)
write(buffer);

it's

for (superblock = first; superblock != last; superblock++)
for (dirty_inode = first; dirty_inode != last; dirty_inode++)
filemap_fdatasync(inode);

Again, by luck and by design, it turns out that this almost always
works. Care is taken to ensure that the ordering of the various
lists is preserved, and that we end up writing data in program-creation
order. Which works OK, because filesystems allocate inodes and blocks
in the way we expect (and desire).

What you're proposing is that, within the VM, we opportunistically
flush out more inodes - those which neighbour the one which owns
the page which the VM wants to free.

That would work. It doesn't particularly help us in the case where the VM
is trying to get free pages against a specific zone, but it would perhaps
provide some overall bandwidth benefits.

However, I'm kind of already doing this. Note how the VM's wakeup_bdflush()
call also wakes pdflush. pdflush will wake up, walk through all the
superblocks, find one which doesn't currently have a pdflush instance
working it and will start writing back that superblock's dirty pages.

(And the next wakeup_bdflush call will wake another pdflush thread,
which will go off and find a different superblock to sync, which is
in theory tons better than using a single bdflush thread for all dirty
data in the machine. But I haven't demonstrated practical benefit
from this yet).

> ...
>
> > But it may come unstuck when applied to swapcache.
>
> You're not even trying to apply this to swap cache right now are you?

No.

> > Things which must still be done include:
> >
> > [...]
> >
> > - Remove bdflush and kupdate - use the pdflush pool to provide these
> > functions.
>
> The main disconnect there is sub-page sized writes, you will bundle together
> young and old 1K buffers. Since it's getting harder to find a 1K blocksize
> filesystem, we might not care. There is also my nefarious plan to make
> struct pages refer to variable-binary-sized objects, including smaller than
> 4K PAGE_SIZE.

I was merely suggesting a tidy-up here. pdflush provides a dynamically-sized
pool of threads for writing data back to disk. So we can remove the
dedicated kupdate and bdflush kernel threads and replace them with:

wakeup_bdflush()
{
pdflush_operation(sync_old_buffers, NULL);
}

Additionally, we do need to provide ways of turning the kupdate,
bdflush and pdflush functions off and on. For laptops, swsusp, etc.
But these are really strange interfaces which have sort of crept
up on us over time. In this case we need to go back, work out
what we're really trying to do here and provide a proper set of
interfaces. Rather than `kill -STOP $(pidof kupdate)' or whatever
the heck people are using.

> ...
> > - Use pdflush for try_to_sync_unused_inodes(), to stop the keventd
> > abuse.
>
> Could you explain please?

keventd is a "process context bottom half handler". It should provide
the caller with reasonably-good response times. Recently, schedule_task()
has been used for writing ginormous gobs of discontiguous data out to
disk because the VM happened to get itself into a sticky corner.

So it's another little tidy-up. Use the pdflush pool for this operation,
and restore keventd's righteousness.

> ...
> I guess the thing to do is start thinking about parts that can be broken out
> because of obvious correctness. The dirty/locked accounting would be one
> candidate, the multiple flush threads another, and I'm sure there are more
> because you don't seem to have treated much as sacred ;-)

Yes, that's a reasonable ordering. pdflush is simple and powerful enough
to be useful even if the rest founders - rationalise kupdate, bdflush,
keventd non-abuse, etc. ratcache is ready, IMO. The global page-accounting
is not provably needed yet.

Here's another patch for you:

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/dallocbase-10-readahead.patch

It's against 2.5.6-pre2 base. It's a partial redesign and a big
tidy-up of the readahead code. It's largely described in the
comments (naturally).

- Unifies the current three readahead functions (mmap reads, read(2)
and sys_readhead) into a single implementation.

- More aggressive in building up the readahead windows.

- More conservative in tearing them down.

- Special start-of-file heuristics.

- Preallocates the readahead pages, to avoid the (never demonstrated,
but potentially catastrophic) scenario where allocation of readahead
pages causes the allocator to perform VM writeout.

- {hidden agenda): Gets all the readahead pages gathered together in
one spot, so they can be marshalled into big BIOs.

- reinstates the readahead tunables which Mr Dalecki cheerfully chainsawed.
So hdparm(8) and blockdev(8) are working again. The readahead settings
are now per-request-queue, and the drivers never have to know about it.

- Identifies readahead thrashing.

Note "identifies". This is 100% reliable - it detects readahead
thrashing beautifully. It just doesn't do anything useful about
it :(

Currently, I just perform a massive shrink on the readahead window
when thrashing occurs. This greatly reduces the amount of pointless
I/O which we perform, and will reduce the CPU load. But big deal. It
doesn't do anything to reduce the seek load, and it's the seek load
which is the killer here. I have a little test case which goes from
40 seconds with 40 files to eight minutes with 50 files, because the
50 file case invokes thrashing. Still thinking about this one.

- Provides almost unmeasurable throughput speedups!

-

2002-03-04 07:29:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Jeff Garzik wrote:
>
> ...
> >
> > You're not even trying to apply this to swap cache right now are you?
>
> This is a disagreement akpm and I have, actually :)
>

Misunderstanding, rather.

Swapfiles aren't interesting, IMO. And I agree that mkswap or swapon
should just barf if the file has any holes in it.

But what I refer to here is, simply, delayed allocate for swapspace.
So swap_out() sticks the target page into the swapcache, marks it
dirty and takes a space reservation for the page out of the swapcache's
address_space. But no disk space is allocated at swap_out() time.
Instead, the real disk mapping is created when the VM calls
a_ops->vm_writeback() against the swapcache page's address_space.

All of which rather implies a ripup-and-rewrite of half the swap
code. It would certainly require a new allocator. So I just mentioned
the possibility. Glad you're interested :)

-

2002-03-04 07:56:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Andreas Dilger wrote:
>
> Actually, there are a whole bunch of performance issues with 1kB block
> ext2 filesystems.

I don't see any new problems with file tails here. They're catered for
OK. What I have not catered for is file holes. With the delalloc
patches, whole pages are always written out (except for at eof). So
if your file has lots of very small non-holes in it, these will become
larger non-holes.

If we're serious about 64k PAGE_CACHE_SIZE then this becomes more of
a problem. In the worst case, a file which used to consist of

4k block | (1 meg - 4k) hole | 4k block | (1 meg - 4k) hole | ...

will become:

64k block | (1 meg - 64k) hole | 64k block | (1 meg - 64k hole) | ...

Which is a *lot* more disk space. It'll happen right now, if the
file is written via mmap. But with no-buffer-head delayed allocate,
it'll happen with write(2) as well.

Is such space wastage on sparse files a showstopper? Maybe,
but probably not if there is always at least one filesystem
which handles this scenario well, because it _is_ a specialised
scenario.

-

2002-03-04 08:11:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On March 4, 2002 08:53 am, Andrew Morton wrote:
> Andreas Dilger wrote:
> >
> > Actually, there are a whole bunch of performance issues with 1kB block
> > ext2 filesystems.
>
> I don't see any new problems with file tails here. They're catered for
> OK. What I have not catered for is file holes. With the delalloc
> patches, whole pages are always written out (except for at eof). So
> if your file has lots of very small non-holes in it, these will become
> larger non-holes.
>
> If we're serious about 64k PAGE_CACHE_SIZE then this becomes more of
> a problem. In the worst case, a file which used to consist of
>
> 4k block | (1 meg - 4k) hole | 4k block | (1 meg - 4k) hole | ...
>
> will become:
>
> 64k block | (1 meg - 64k) hole | 64k block | (1 meg - 64k hole) | ...
>
> Which is a *lot* more disk space. It'll happen right now, if the
> file is written via mmap. But with no-buffer-head delayed allocate,
> it'll happen with write(2) as well.
>
> Is such space wastage on sparse files a showstopper? Maybe,
> but probably not if there is always at least one filesystem
> which handles this scenario well, because it _is_ a specialised
> scenario.

I guess 4K PAGE_CACHE_SIZE will serve us well for another couple of years,
and in that time I hope to produce a patch that generalizes the notion of
page size so we can use the size that works best for each address_space,
i.e., the same as the filesystem blocksize.

By the way, have you ever seen a sparse 1K blocksize file? I haven't, I
wouldn't get too worked up about treating the holes a little less than
optimally.

--
Daniel

2002-03-04 08:36:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Daniel Phillips wrote:
>
> ..
> I guess 4K PAGE_CACHE_SIZE will serve us well for another couple of years,

Having reviewed the archives, it seems that the multipage PAGE_CACHE_SIZE
patches which Hugh and Ben were working on were mainly designed to increase
I/O efficiency.

If that's the only reason for large pages then yeah, I think we can stick
with 4k PAGE_CACHE_SIZE :). There really are tremendous efficiencies
available in the current code.

Another (and very significant) reason for large pages is to decrease
TLB misses. Said to be very important for large-working-set scientific
apps and such. But that doesn't seem to have a lot to do with PAGE_CACHE_SIZE?

> ...
> By the way, have you ever seen a sparse 1K blocksize file?
> ...

Sure I have. I just created one. (I'm writing test cases for my
emails now. Sheesh).

-

2002-03-04 09:28:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On March 4, 2002 08:20 am, Andrew Morton wrote:
> You know where this is headed, don't you:

Yes I do, because it's more or less a carbon copy of what I had in mind.

> - writeout is performed by the writers, and by the gang-of-flush-threads.
> - kswapd is 100% non-blocking. It never does I/O.
> - kswapd is the only process which runs page_launder/shrink_caches.
> - Memory requesters do not perform I/O. They sleep until memory
> is available. kswapd gives them pages as they become available, and
> wakes them up.
>
> So that's the grand plan. It may be fatally flawed - I remember Linus
> had a serious-sounding objection to it some time back, but I forget
> what that was.

I remember, since he gently roasted me last autum for thinking such thoughts.
The idea is that by making threads do their own vm scanning they throttle
themselves. I don't think the resulting chaotic scanning behavior is worth
it.

> > > With this patch, writepage() is still using the buffer layer, so lock
> > > contention will still be high.
> >
> > Right, and buffers are going away one way or another.
>
> This is a problem. I'm adding new stuff which does old things in
> a new way, with no believable plan in place for getting rid of the
> old stuff.
>
> I don't think it's humanly possible to do away with struct buffer_head.
> It is *the* way of representing a disk block. And unless we plan
> to live with 4k pages and 4k blocks for ever, the problem is about
> to get worse. Think 64k pages with 4k blocks.

If struct page refers to an object the same size as a disk block then struct
page can take the place of a buffer in the getblk interface. For IO we have
other ways of doing things, kvecs, bio's and so on. We don't need buffers
for that, the only thing we need them for is handles for disk blocks, and
locking thereof.

If you actually had this now your patch would be quite a lot simpler. It's
going to take a while though, because first we have to do active physical
defragmentation, and that requires rmap. So a prototype is a few months away
at least, but six months ago I would have said two years.

> Possibly we could handle sub-page segments of memory via a per-page up-to-date
> bitmask. And then a `dirty' bitmask. And then a `locked' bitmask, etc. I
> suspect eventually we'll end up with, say, a vector of structures attached to
> each page which represents the state of each of the page's sub-segments. whoops.

You could, but that would be a lot messier than what I have in mind. Your
work fits really nicely with that since it gets rid of the IO function of
buffers.

--
Daniel

2002-03-04 14:50:20

by Randy Hron

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

It's not on a big box, but I have a side by side of
2.5.6-pre2 and 2.5.6-pre2 with patches from
http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/

2.5.6-pre2-akpm compiled with MPIO_DEBUG = 0 and ext2
filesystem mounted with delalloc.

tiobench and dbench were on ext2.
bonnie++ and most other tests were on reiserfs.

2.5.6-pre2-akpm throughput on ext2 is much improved.

http://home.earthlink.net/~rwhron/kernel/akpm.html

One odd number in lmbench is page fault latency. Lmbench also
showed high page fault latency in 2.4.18-pre9 with make_request,
read-latency2, and low-latency. 2.4.19-pre1aa1 with read_latency2
(2.4.19pre1aa1rl) did not show a bump in page fault latency.

--
Randy Hron

2002-03-04 15:10:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

[email protected] wrote:
> 2.5.6-pre2-akpm compiled with MPIO_DEBUG = 0 and ext2
> filesystem mounted with delalloc.
>
> tiobench and dbench were on ext2.
> bonnie++ and most other tests were on reiserfs.
>
> 2.5.6-pre2-akpm throughput on ext2 is much improved.
>
> http://home.earthlink.net/~rwhron/kernel/akpm.html

This page of statistics is pretty gnifty -- I wish more people posted
such :)

I also like your other comparisons at the top of that link... I was
thinking, what would be even more neat and more useful as well would be
to post comparisons -between- the various kernels you have tested, i.e.
a single page that includes benchmarks for 2.4.x-official,
2.5.x-official, -aa, -rmap, etc.

Regards,

Jeff



--
Jeff Garzik |
Building 1024 |
MandrakeSoft | Choose life.

2002-03-05 00:07:21

by Hans Reiser

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Andreas Dilger wrote:

>On Mar 04, 2002 00:41 -0500, Jeff Garzik wrote:
>
>>Andreas Dilger wrote:
>>
>>>Actually, there are a whole bunch of performance issues with 1kB block
>>>ext2 filesystems. For very small files, you are probably better off
>>>to have tails in EAs stored with the inode, or with other tails/EAs in
>>>a shared block. We discussed this on ext2-devel a few months ago, and
>>>while the current ext2 EA design is totally unsuitable for that, it
>>>isn't impossible to fix.
>>>
>>IMO the ext2 filesystem design is on it's last legs ;-) I tend to
>>think that a new filesystem efficiently handling these features is far
>>better than dragging ext2 kicking and screaming into the 2002's :)
>>
>
>That's why we have ext3 ;-). Given that reiserfs just barely has an
>fsck that finally works most of the time, and they are about to re-do
>the entire filesystem for reiser-v4 in 6 months, I'd rather stick with
>glueing features onto an ext2 core than rebuilding everything from
>scratch each time.
>
Isn't this that old evolution vs. design argument? ReiserFS is design,
code, tweak for a while, and then start to design again methodology. We
take novel designs and algorithms, and stick with them until they are
stable production code, and then we go back and do more novel
algorithms. Such a methodology is well suited for doing deep rewrites
at 5 year intervals.

No pain, no gain, or so we think. Rewriting the core is hard work.
Some people get success and then coast. Some people get success and
then accelerate. We're keeping the pedal on the metal. We're throwing
every penny we make into rebuilding the foundation of our filesystem.

ReiserFS V3 is pretty stable right now. Fsck is usually the last thing
to stabilize for a filesystem, and we were no exception to that rule.

ReiserFS V3 will last for probably another 3 years (though it will
remain supported for I imagine at least a decade, maybe more), with V4
gradually edging it out of the market as V4 gets more and more stable.
During at least the next year we will keep some staff adding minor
tweaks to V3. For instance, our layout policies will improve over the
next few months, journal relocation is about to move from Linux 2.5 to
2.4, and data write ordering code is being developed by Chris. I don't
know how long fsck maintenance for V3 will continue, but it will be at
least as long as users find bugs in it.

The biggest reason we are writing V4 from scratch is that it is a thing
of beauty. If you don't understand this, words will not explain it.

>
>
>Given that ext3, and htree, and all of the other ext2 'hacks' seem to
>do very well, I think it will continue to improve for some time to come.
>A wise man once said "I'm not dead yet".
>
>Cheers, Andreas
>--
>Andreas Dilger
>http://sourceforge.net/projects/ext2resize/
>http://www-mddsp.enel.ucalgary.ca/People/adilger/
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2002-03-07 12:06:25

by Etienne Lorrain

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

> With "allocate on flush", (aka delayed allocation), file data is
> assigned a disk mapping when the data is being written out, rather than
> at write(2) time. This has the following advantages:

I do agree that this is a better solution than current one,
but (even if I did not had time to test the patch), I have
a question: How about bootloaders?

IHMO all current bootloaders need to write to disk a "chain" of sector
to load for their own initialisation, i.e. loading the remainning
part of code stored on a file in one filesystem from the 512 bytes
bootcode. This "chain" of sector can only be known once the file
has been allocated to disk - and it has to be written on the same file,
at its allocated space.

So can you upgrade LILO or GRUB with your patch installed?
It is not a so big problem (the solution being to install the
bootloader on an unmounted filesystem with tools like e2fsprogs),
but it seems incompatible with the current executables.

Comment?
Etienne.

___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en fran?ais !
Yahoo! Mail : http://fr.mail.yahoo.com

2002-03-07 14:47:57

by Steve Lord

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Thu, 2002-03-07 at 06:06, Etienne Lorrain wrote:
> > With "allocate on flush", (aka delayed allocation), file data is
> > assigned a disk mapping when the data is being written out, rather than
> > at write(2) time. This has the following advantages:
>
> I do agree that this is a better solution than current one,
> but (even if I did not had time to test the patch), I have
> a question: How about bootloaders?
>
> IHMO all current bootloaders need to write to disk a "chain" of sector
> to load for their own initialisation, i.e. loading the remainning
> part of code stored on a file in one filesystem from the 512 bytes
> bootcode. This "chain" of sector can only be known once the file
> has been allocated to disk - and it has to be written on the same file,
> at its allocated space.
>
> So can you upgrade LILO or GRUB with your patch installed?
> It is not a so big problem (the solution being to install the
> bootloader on an unmounted filesystem with tools like e2fsprogs),
> but it seems incompatible with the current executables.

The interface used by lilo to read the kernel location needs to flush
data out to disk before returning results. It's not too hard to do.

Steve


2002-03-07 17:29:54

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

On Thu, Mar 07, 2002 at 08:47:24AM -0600, Steve Lord wrote:
> On Thu, 2002-03-07 at 06:06, Etienne Lorrain wrote:
> > > With "allocate on flush", (aka delayed allocation), file data is
> > > assigned a disk mapping when the data is being written out, rather than
> > > at write(2) time. This has the following advantages:
> >
> > I do agree that this is a better solution than current one,
> > but (even if I did not had time to test the patch), I have
> > a question: How about bootloaders?
> >
> > IHMO all current bootloaders need to write to disk a "chain" of sector
> > to load for their own initialisation, i.e. loading the remainning
> > part of code stored on a file in one filesystem from the 512 bytes
> > bootcode. This "chain" of sector can only be known once the file
> > has been allocated to disk - and it has to be written on the same file,
> > at its allocated space.
> >
> > So can you upgrade LILO or GRUB with your patch installed?
> > It is not a so big problem (the solution being to install the
> > bootloader on an unmounted filesystem with tools like e2fsprogs),
> > but it seems incompatible with the current executables.
>
> The interface used by lilo to read the kernel location needs to flush
> data out to disk before returning results. It's not too hard to do.
>

Also, GRUB shouldn't have this problem since it reads the filesystems
directly on boot.

2002-03-07 20:52:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] delayed disk block allocation

Hi!

> ReiserFS V3 is pretty stable right now. Fsck is usually the last thing
> to stabilize for a filesystem, and we were no exception to that rule.
>
> ReiserFS V3 will last for probably another 3 years (though it will
> remain supported for I imagine at least a decade, maybe more), with V4
> gradually edging it out of the market as V4 gets more and more stable.
> During at least the next year we will keep some staff adding minor
> tweaks to V3. For instance, our layout policies will improve over the
> next few months, journal relocation is about to move from Linux 2.5 to
> 2.4, and data write ordering code is being developed by Chris. I don't
> know how long fsck maintenance for V3 will continue, but it will be at
> least as long as users find bugs in it.

When you think youer fsck works, let me know. I have some tests to find
bugs in fsck....
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.