This would seem to be an error on the part of ll_rw_block.
Setting b_end_io to a default handler without checking to see
a callback has already been defined defeats the purpose of having
a function op.
void ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
{
@@ -928,7 +1046,8 @@
if (test_and_set_bit(BH_Lock, &bh->b_state))
continue;
- set_bit(BH_Req, &bh->b_state);
+ /* We have the buffer lock */
+ bh->b_end_io = end_buffer_io_sync;
switch(rw) {
case WRITE:
-Russell Cattelan
In article <[email protected]>,
Russell Cattelan <[email protected]> wrote:
>This would seem to be an error on the part of ll_rw_block.
>Setting b_end_io to a default handler without checking to see
>a callback has already been defined defeats the purpose of having
>a function op.
No.
It just means that if you have your own function op, you had better not
call "ll_rw_block()".
The problem is that as it was done before, people would set the function
op without actually holding the buffer lock.
Which meant that you could have two unrelated users setting the function
op to two different things, and it would be only a matter of the purest
luck which one happened to "win".
If you want to set your own end-operation, you now need to lock the
buffer _first_, then set "b_end_io" to your operation, and then do a
"submit_bh()". You cannot use ll_rw_block().
Yes, this is different than before. Sorry about that.
But yes, this way actually happens to work reliably.
Linus
On Thu, 14 Dec 2000, Russell Cattelan wrote:
>
> So one more observation in
> filemap_sync_pte
>
> lock_page(page);
> error = filemap_write_page(page, 1);
> -> UnlockPage(page);
> This unlock page was removed? is that correct?
Yes. The "writepage" thing changed: "struct file" disappeared (as I'm sure
you also noticed), and the page writer is supposed to unlock the page
itself. Which it may do at any time, of course.
There are some reasons to do it only after the IO has actually completed:
this way the VM layer won't try to write it out _again_ before the first
IO hasn't even finished yet, and the writing logic can possibly be
simplified if you know that nobody else will be touching that page.
But that is up to you: you can do the UnlockPage before even returning
from your "->writepage()" function, if you choose to do so.
Linus
On Thu, 14 Dec 2000, Russell Cattelan wrote:
>
> Ok one more wrinkle.
> sync_buffers calls ll_rw_block, this is going to have the same problem as
> calling ll_rw_block directly.
Good point.
This actually looks fairly nasty to fix. The obvious fix would be to not
put such buffers on the dirty list at all, and instead rely on the VM
layer calling "writepage()" when it wants to push out the pages.
That would be the nice behaviour from a VM standpoint.
However, that assumes that you don't have any "anonymous" buffers, which
is probably an unrealistic assumption.
The problem is that we don't have any per-buffer "writebuffer()" function,
the way we have them per-page. It was never needed for any of the normal
filesystems, and XFS just happened to be able to take advantage of the
b_end_io behaviour.
Suggestions welcome.
Linus
On Thu, 14 Dec 2000, Linus Torvalds wrote:
> Good point.
>
> This actually looks fairly nasty to fix. The obvious fix would be to not
> put such buffers on the dirty list at all, and instead rely on the VM
> layer calling "writepage()" when it wants to push out the pages.
> That would be the nice behaviour from a VM standpoint.
>
> However, that assumes that you don't have any "anonymous" buffers, which
> is probably an unrealistic assumption.
>
> The problem is that we don't have any per-buffer "writebuffer()" function,
> the way we have them per-page. It was never needed for any of the normal
> filesystems, and XFS just happened to be able to take advantage of the
> b_end_io behaviour.
>
> Suggestions welcome.
Just one: any fs that really cares about completion callback is very likely
to be picky about the requests ordering. So sync_buffers() is very unlikely
to be useful anyway.
In that sense we really don't have anonymous buffers here. I seriously
suspect that "unrealistic" assumption is not unrealistic at all. I'm
not sufficiently familiar with XFS code to say for sure, but...
What we really need is a way for VFS/VM to pass the pressure on filesystem.
That's it. If fs wants unusual completions for requests - let it have its
own queueing mechanism and submit these requests when it finds that convenient.
Stephen, you probably already thought about that area. Could you comment on
that?
Cheers,
Al
On Fri, 15 Dec 2000, Alexander Viro wrote:
> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.
>
Somewhat. I guess there are at least two ways to do it. First flush the
buffers where ordering matters (log blocks), then send the others onto the
dirty list (general metadata). You might have your own end_io for those, and
sync_buffers would lose it.
Second way (reiserfs recently changed to this method) is to do all the
flushing yourself, and remove the need for an end_io call back.
> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...
>
> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.
>
Yes, this is exactly what we've discussed.
-chris
Hi,
On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> On Thu, 14 Dec 2000, Linus Torvalds wrote:
>
> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.
>
> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...
Right. ext3 and reiserfs just want to submit their own IOs when it
comes to the journal. (At least in ext3, already-journaled buffers
can be written back by the VM freely.) It's a matter of telling the
fs when that should start.
> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.
There is a very clean way of doing this with address spaces. It's
something I would like to see done properly for 2.5: eliminate all
knowledge of buffer_heads from the VM layer. It would be pretty
simple to remove page->buffers completely and replace it with a
page->private pointer, owned by whatever address_space controlled the
page. Instead of trying to unmap and flush buffers on the page
directly, these operations would become address_space operations.
We could still provide the standard try_to_free_buffers() and
unmap_underlying_metadata() functions to operate on the per-page
buffer_head lists, and existing filesystems would only have to point
their address_space "private metadata" operations at the generic
functions. However, a filesystem which had additional ordering
constraints could then intercept the flush or writeback calls from the
VM and decide on its own how best to honour the VM pressure.
This even works both for hashed and unhashed anonymous buffers, *if*
you allow the filesystem to attach all of its hashed buffers buffers
to an address_space of its own rather than having the buffer cache
pool all such buffers together.
--Stephen
In article <[email protected]>,
Stephen C. Tweedie <[email protected]> wrote:
>
>> What we really need is a way for VFS/VM to pass the pressure on filesystem.
>> That's it. If fs wants unusual completions for requests - let it have its
>> own queueing mechanism and submit these requests when it finds that convenient.
>
>There is a very clean way of doing this with address spaces. It's
>something I would like to see done properly for 2.5: eliminate all
>knowledge of buffer_heads from the VM layer.
Note that you should be able to already get this effect with the current
2.4.0 tree.
The way to get the VM to ignore your buffer heads is to never mark the
buffers dirty, and instead mark the page they are on dirty along with
giving it a mapping (you can have a special per-superblock
"metadata-mapping" for stuff that isn't actually associated with any
particular file.
That way, the VM will just call writepage() for you, and you can use
that to schedule your writeouts (you don't actually need to write the
page the VM _asks_ you to write - you can just mark it dirty again and
consider the writepage to be mainly a VM pressure indicator).
Now, I also agree that we should be able to clean this up properly for
2.5.x, and actually do exactly this for the anonymous buffers, so that
the VM no longer needs to worry about buffer knowledge, and fs/buffer.c
becomes just another user of the writepage functionality. That is not
all that hard to do, it mainly just requires some small changes to how
"mark_buffer_dirty()" works (ie it would also mark the page dirty, so
that the VM layer would know to call "writepage()").
I really think almost all of the VM infrastructure for this is in place,
the PageDirty code has both simplified the VM enormously and made it a
lot more powerful at the same time.
Linus
> Now, I also agree that we should be able to clean this up properly for
> 2.5.x, and actually do exactly this for the anonymous buffers, so that
> the VM no longer needs to worry about buffer knowledge, and fs/buffer.c
> becomes just another user of the writepage functionality. That is not
> all that hard to do, it mainly just requires some small changes to how
Why not incorporate this change into 2.4.x?
Jeff
On Sat, 16 Dec 2000, Jeff Chua wrote:
>
> > Now, I also agree that we should be able to clean this up properly for
> > 2.5.x, and actually do exactly this for the anonymous buffers, so that
> > the VM no longer needs to worry about buffer knowledge, and fs/buffer.c
> > becomes just another user of the writepage functionality. That is not
> > all that hard to do, it mainly just requires some small changes to how
>
> Why not incorporate this change into 2.4.x?
It might be 10 lines of change, and obviously correct.
And it might not be. If somebody wants to try out the DirtyPage approach
for buffer handling, please do so. I'll apply it if it _does_ turn out to
be as small as I suspect it might be, and if the code is straightforward
and obvious.
If not, we're better off leaving it for 2.5.x
Linus
On Fri, 15 Dec 2000, Linus Torvalds wrote:
[ writepage for anon buffers ]
>
> It might be 10 lines of change, and obviously correct.
>
I'll give this a try, it will be interesting regardless of if it is simple
enough for kernel inclusion.
On a related note, I hit a snag porting reiserfs into test12, where
block_read_full_page assumes the buffer it gets back from get_block won't
be up to date. When reiserfs reads a packed tail directly into the page,
reading the newly mapped buffer isn't required, and is actually a bad
idea, since the packed tails have a block number of 0 when copied into
the page cache.
In other words, after calling reiserfs_get_block, the buffer might be
mapped and uptodate, with no i/o required in block_read_full_page
The following patch to block_read_full_page fixes things for me, and seems
like a good idea in general. It might be better to apply something
similar to submit_bh instead...comments?
-chris
--- linux-test12/fs/buffer.c Mon Dec 18 11:37:42 2000
+++ linux/fs/buffer.c Mon Dec 18 11:38:36 2000
@@ -1706,8 +1706,10 @@
}
}
- arr[nr] = bh;
- nr++;
+ if (!buffer_uptodate(bh)) {
+ arr[nr] = bh;
+ nr++;
+ }
} while (i++, iblock++, (bh = bh->b_this_page) != head);
if (!nr) {
On Sat, 16 Dec 2000, Chris Mason wrote:
>
> In other words, after calling reiserfs_get_block, the buffer might be
> mapped and uptodate, with no i/o required in block_read_full_page
>
> The following patch to block_read_full_page fixes things for me, and seems
> like a good idea in general. It might be better to apply something
> similar to submit_bh instead...comments?
Making the change to submit_bh() would make it look more like the old
ll_rw_block() in this regard, but at the same time I really don't like the
old ll_rw_block() interface that knew about semantics..
Your patch looks fine, although I'd personally prefer this one even more:
fs/buffer.c patch cut-and-paste:
+++ fs/buffer.c Sat Dec 16 11:02:44 2000
@@ -1700,6 +1693,9 @@
set_bit(BH_Uptodate, &bh->b_state);
continue;
}
+ /* get_block() might have updated the buffer synchronously */
+ if (buffer_uptodate(bh))
+ continue;
}
arr[nr] = bh;
which makes it explicit about how we could have suddenly gotten an
up-to-date buffer. Does that work for you?
Linus
On Sat, 16 Dec 2000, Linus Torvalds wrote:
> Your patch looks fine, although I'd personally prefer this one even more:
>
Yes, that looks better and works here.
-chris
Linus Torvalds wrote:
> On Thu, 14 Dec 2000, Russell Cattelan wrote:
> >
> > Ok one more wrinkle.
> > sync_buffers calls ll_rw_block, this is going to have the same problem as
> > calling ll_rw_block directly.
>
> Good point.
>
> This actually looks fairly nasty to fix. The obvious fix would be to not
> put such buffers on the dirty list at all, and instead rely on the VM
> layer calling "writepage()" when it wants to push out the pages.
> That would be the nice behaviour from a VM standpoint.
>
> However, that assumes that you don't have any "anonymous" buffers, which
> is probably an unrealistic assumption.
>
> The problem is that we don't have any per-buffer "writebuffer()" function,
> the way we have them per-page. It was never needed for any of the normal
> filesystems, and XFS just happened to be able to take advantage of the
> b_end_io behaviour.
>
> Suggestions welcome.
>
> Linus
Ok after a bit of trial and error I do have something working.
I wouldn't call it the most elegant solution but it does work
and it isn't very intrusive.
#define BH_End_io 7 /* End io function defined don't remap it */
/* don't change the callback if somebody explicitly set it */
if(!test_bit(BH_End_io, &bh->b_state)){
bh->b_end_io = end_buffer_io_sync;
}
What I've done is in the XFS set buffer_head setup functions is
set the initial value of b_state to BH_Locked and BH_End_io
set the callback function and the rest of the relevant fields and then unlock
the
buffer.
The only other quick fix that comes to mind is to change sync_buffers to use
submit_bh rather than ll_rw_block.
--
Russell Cattelan
[email protected]
Alexander Viro wrote:
> On Thu, 14 Dec 2000, Linus Torvalds wrote:
>
> > Good point.
> >
> > This actually looks fairly nasty to fix. The obvious fix would be to not
> > put such buffers on the dirty list at all, and instead rely on the VM
> > layer calling "writepage()" when it wants to push out the pages.
> > That would be the nice behaviour from a VM standpoint.
> >
> > However, that assumes that you don't have any "anonymous" buffers, which
> > is probably an unrealistic assumption.
> >
> > The problem is that we don't have any per-buffer "writebuffer()" function,
> > the way we have them per-page. It was never needed for any of the normal
> > filesystems, and XFS just happened to be able to take advantage of the
> > b_end_io behaviour.
> >
> > Suggestions welcome.
>
> Just one: any fs that really cares about completion callback is very likely
> to be picky about the requests ordering. So sync_buffers() is very unlikely
> to be useful anyway.
Actually no, that's not the issue.
The XFS log uses a LSN (Log Sequence Number) to keep track of log write ordering.
Sync IO on each log buffer isn't realistic; the performance hit would be to great.
I wasn't around when most of XFS was developed, but from I what I understand it
was discovered early on that firing off writes in a particular order doesn't
guarantee that
they will finish in that order. Thus the implantation of a sequence number for
each log write.
One of the obstacles we ran into early on in the linux port was the fact that
linux used fixed size IO requests to any given device.
But most of XFS's meta data structures vary in size in multiples of 512 bytes.
We were also implementing a page caching / clustering layer called
page_buf which understands primarily pages and not necessary
disk blocks. If your FS block size happens to match your page size then things
are good, but it doesn't....
So we added a bit map field to the pages structure.
Each bit then represents one BASIC BLOCK eg 512 for all practical purposes
The end_io functions XFS defines updates the correct bit or the whole bit array
if the whole page is valid, thus signaling the rest of the page_buf that the io
has
completed.
Ok there is a lot more to it than what I've just described but you probably get
the idea.
>
>
> In that sense we really don't have anonymous buffers here. I seriously
> suspect that "unrealistic" assumption is not unrealistic at all. I'm
> not sufficiently familiar with XFS code to say for sure, but...
>
> What we really need is a way for VFS/VM to pass the pressure on filesystem.
> That's it. If fs wants unusual completions for requests - let it have its
> own queueing mechanism and submit these requests when it finds that convenient.
>
> Stephen, you probably already thought about that area. Could you comment on
> that?
> Cheers,
> Al
--
Russell Cattelan
[email protected]
Chris Mason wrote:
> On Fri, 15 Dec 2000, Alexander Viro wrote:
>
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> >
> Somewhat. I guess there are at least two ways to do it. First flush the
> buffers where ordering matters (log blocks), then send the others onto the
> dirty list (general metadata). You might have your own end_io for those, and
> sync_buffers would lose it.
>
> Second way (reiserfs recently changed to this method) is to do all the
> flushing yourself, and remove the need for an end_io call back.
>
I'm curious about this.
Does the mean reiserFS is doing all of it's own buffer management?
This would seem a little redundant with what is already in the kernel?
>
>
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
> >
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
> >
> Yes, this is exactly what we've discussed.
>
> -chris
--
Russell Cattelan
[email protected]
"Stephen C. Tweedie" wrote:
> Hi,
>
> On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> > On Thu, 14 Dec 2000, Linus Torvalds wrote:
> >
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> >
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
>
> Right. ext3 and reiserfs just want to submit their own IOs when it
> comes to the journal. (At least in ext3, already-journaled buffers
> can be written back by the VM freely.) It's a matter of telling the
> fs when that should start.
>
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
>
> There is a very clean way of doing this with address spaces. It's
> something I would like to see done properly for 2.5: eliminate all
> knowledge of buffer_heads from the VM layer. It would be pretty
> simple to remove page->buffers completely and replace it with a
> page->private pointer, owned by whatever address_space controlled the
> page. Instead of trying to unmap and flush buffers on the page
> directly, these operations would become address_space operations.
Yes this is a lot of what page buf would like to do eventually.
Have the VM system pressure page_buf for pages which would
then be able to intelligently call the file system to free up cached pages.
A big part of getting Delay Alloc to not completely consume all the
system pages, is being told when it's time to start really allocating disk
space and push pages out.
>
>
> We could still provide the standard try_to_free_buffers() and
> unmap_underlying_metadata() functions to operate on the per-page
> buffer_head lists, and existing filesystems would only have to point
> their address_space "private metadata" operations at the generic
> functions. However, a filesystem which had additional ordering
> constraints could then intercept the flush or writeback calls from the
> VM and decide on its own how best to honour the VM pressure.
>
> This even works both for hashed and unhashed anonymous buffers, *if*
> you allow the filesystem to attach all of its hashed buffers buffers
> to an address_space of its own rather than having the buffer cache
> pool all such buffers together.
>
> --Stephen
--
Russell Cattelan
[email protected]
On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, Dec 15, 2000 at 02:00:19AM -0500, Alexander Viro wrote:
> > On Thu, 14 Dec 2000, Linus Torvalds wrote:
> >
> > Just one: any fs that really cares about completion callback is very likely
> > to be picky about the requests ordering. So sync_buffers() is very unlikely
> > to be useful anyway.
> >
> > In that sense we really don't have anonymous buffers here. I seriously
> > suspect that "unrealistic" assumption is not unrealistic at all. I'm
> > not sufficiently familiar with XFS code to say for sure, but...
>
> Right. ext3 and reiserfs just want to submit their own IOs when it
> comes to the journal. (At least in ext3, already-journaled buffers
> can be written back by the VM freely.) It's a matter of telling the
> fs when that should start.
>
> > What we really need is a way for VFS/VM to pass the pressure on filesystem.
> > That's it. If fs wants unusual completions for requests - let it have its
> > own queueing mechanism and submit these requests when it finds that convenient.
>
> There is a very clean way of doing this with address spaces. It's
> something I would like to see done properly for 2.5: eliminate all
> knowledge of buffer_heads from the VM layer. It would be pretty
> simple to remove page->buffers completely and replace it with a
> page->private pointer, owned by whatever address_space controlled the
> page. Instead of trying to unmap and flush buffers on the page
> directly, these operations would become address_space operations.
>
> We could still provide the standard try_to_free_buffers() and
> unmap_underlying_metadata() functions to operate on the per-page
> buffer_head lists, and existing filesystems would only have to point
> their address_space "private metadata" operations at the generic
> functions. However, a filesystem which had additional ordering
> constraints could then intercept the flush or writeback calls from the
> VM and decide on its own how best to honour the VM pressure.
Stephen,
The ->flush() operation (which we've been discussing a bit) would be very
useful now (mainly for XFS).
At page_launder(), we can call ->flush() if the given page has it defined.
Otherwise use try_to_free_buffers() as we do now for filesystems which
dont care about the special flushing treatment.
On Sat, 16 Dec 2000, Russell Cattelan wrote:
> >
> I'm curious about this.
> Does the mean reiserFS is doing all of it's own buffer management?
>
> This would seem a little redundant with what is already in the kernel?
>
For metadata only reiserfs does its own write management. The buffers
come from getblk. We just don't mark the buffers dirty for flushing by
flush_dirty_buffers()
This has the advantage of avoiding races against bdflush and friends, and
makes it easier to keep track of which buffers have actually made their
way to disk. It has all of the obvious disadvantages with respect to
memory pressure.
-chris
Hi,
On Sat, Dec 16, 2000 at 07:08:02PM -0600, Russell Cattelan wrote:
> > There is a very clean way of doing this with address spaces. It's
> > something I would like to see done properly for 2.5: eliminate all
> > knowledge of buffer_heads from the VM layer. It would be pretty
> > simple to remove page->buffers completely and replace it with a
> > page->private pointer, owned by whatever address_space controlled the
> > page. Instead of trying to unmap and flush buffers on the page
> > directly, these operations would become address_space operations.
>
> Yes this is a lot of what page buf would like to do eventually.
> Have the VM system pressure page_buf for pages which would
> then be able to intelligently call the file system to free up cached pages.
> A big part of getting Delay Alloc to not completely consume all the
> system pages, is being told when it's time to start really allocating disk
> space and push pages out.
Delayed allocation is actually much easier, since it's entirely an
operation on logical page addresses, not physical ones --- by
definition you don't have any buffer_heads yet because you haven't
decided on the disk blocks. If you're just dealing with pages, not
blocks, then the address_space is the natural way of dealing with it
already.
Only the full semantics of the flush callback have been missing to
date, and with 2.4.0-test12 even that is mostly solved, since
page_launder will give you the writeback() callbacks you need to flush
things to disk when you start getting memory pressure. You can even
treat the writepage() as an advisory call.
--Stephen
Hi,
On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
>
> Stephen,
>
> The ->flush() operation (which we've been discussing a bit) would be very
> useful now (mainly for XFS).
>
> At page_launder(), we can call ->flush() if the given page has it defined.
> Otherwise use try_to_free_buffers() as we do now for filesystems which
> dont care about the special flushing treatment.
As of 2.4.0test12, page_launder() will already call the
per-address-space writepage() operation for dirty pages. Do you need
something similar for clean pages too, or does Linus's new laundry
code give you what you need now?
Cheers,
Stephen
On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:
> Hi,
>
> On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> >
> > Stephen,
> >
> > The ->flush() operation (which we've been discussing a bit) would be very
> > useful now (mainly for XFS).
> >
> > At page_launder(), we can call ->flush() if the given page has it defined.
> > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > dont care about the special flushing treatment.
>
> As of 2.4.0test12, page_launder() will already call the
> per-address-space writepage() operation for dirty pages. Do you need
> something similar for clean pages too, or does Linus's new laundry
> code give you what you need now?
I think the semantics of the filesystem specific ->flush and ->writepage
are not the same.
Is ok for filesystem specific writepage() code to sync other "physically
contiguous" dirty pages with reference to the one requested by
writepage() ?
If so, it can do the same job as the ->flush() idea we've discussing.
Marcelo Tosatti wrote:
>
> On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:
>
> > Hi,
> >
> > On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> > >
> > > Stephen,
> > >
> > > The ->flush() operation (which we've been discussing a bit) would be very
> > > useful now (mainly for XFS).
> > >
> > > At page_launder(), we can call ->flush() if the given page has it defined.
> > > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > > dont care about the special flushing treatment.
> >
> > As of 2.4.0test12, page_launder() will already call the
> > per-address-space writepage() operation for dirty pages. Do you need
> > something similar for clean pages too, or does Linus's new laundry
> > code give you what you need now?
>
> I think the semantics of the filesystem specific ->flush and ->writepage
> are not the same.
>
> Is ok for filesystem specific writepage() code to sync other "physically
> contiguous" dirty pages with reference to the one requested by
> writepage() ?
>
> If so, it can do the same job as the ->flush() idea we've discussing.
Except that for ->writepage you don't have the option of *not* writing
the specified page.
--
Daniel
In article <[email protected]> you wrote:
>> I think the semantics of the filesystem specific ->flush and ->writepage
>> are not the same.
>>
>> Is ok for filesystem specific writepage() code to sync other "physically
>> contiguous" dirty pages with reference to the one requested by
>> writepage() ?
>>
>> If so, it can do the same job as the ->flush() idea we've discussing.
>
> Except that for ->writepage you don't have the option of *not* writing
> the specified page.
In current -test13pre you have.
Christoph
--
Whip me. Beat me. Make me maintain AIX.
On Tue, 19 Dec 2000, Daniel Phillips wrote:
> Marcelo Tosatti wrote:
> >
> > On Mon, 18 Dec 2000, Stephen C. Tweedie wrote:
> >
> > > Hi,
> > >
> > > On Sun, Dec 17, 2000 at 12:38:17AM -0200, Marcelo Tosatti wrote:
> > > > On Fri, 15 Dec 2000, Stephen C. Tweedie wrote:
> > > >
> > > > Stephen,
> > > >
> > > > The ->flush() operation (which we've been discussing a bit) would be very
> > > > useful now (mainly for XFS).
> > > >
> > > > At page_launder(), we can call ->flush() if the given page has it defined.
> > > > Otherwise use try_to_free_buffers() as we do now for filesystems which
> > > > dont care about the special flushing treatment.
> > >
> > > As of 2.4.0test12, page_launder() will already call the
> > > per-address-space writepage() operation for dirty pages. Do you need
> > > something similar for clean pages too, or does Linus's new laundry
> > > code give you what you need now?
> >
> > I think the semantics of the filesystem specific ->flush and ->writepage
> > are not the same.
> >
> > Is ok for filesystem specific writepage() code to sync other "physically
> > contiguous" dirty pages with reference to the one requested by
> > writepage() ?
> >
> > If so, it can do the same job as the ->flush() idea we've discussing.
>
> Except that for ->writepage you don't have the option of *not* writing
> the specified page.
It does.
Both the swapper writepage() operation and shm_writepage() cannot be able
to write the page.
Ok guys, I think I've taken Linus' suggestion to have buffer.c use its
own writepage a bit too far. This patch marks pages dirty when the
buffer head is marked dirty, and changes flush_dirty_buffers and
sync_buffers to use writepage instead of ll_rw_block. The idea is
to allow filesystems to use the buffer lists and provide their own
i/o mechanism.
The result is a serious semantics change for writepage, which now is
expected to do partial page writes when the page isn't up to date and
there are dirty buffers inside. For all the obvious reasons, this isn't
fit for 2.4.0, and if you all feel it is a 2.5. thing I'll send along
the shorter patch Linus originally suggested. But, I think it would
be pretty useful for the new filesystems (once I also fix
fsync_inode_buffers and sync_page_buffers).
Other changes: submit_bh now cleans the buffers. I don't see how
they were getting cleaned before, it must have been try_to_free_buffers
sending the page through sync_page_buffers, meaning they were probably
getting written twice. Unless someone throws a clue my way, I'll send
this out as a separate diff.
page_launder doesn't fiddle with the buffermem_pages counter, it is done
in try_to_free_buffers instead.
Obvious bug, block_write_full_page zeros out the bits past the end of
file every time. This should not be needed for normal file writes.
Most testing was on ext2, who actually calls mark_buffer_dirty, and
supports blocksizes < intel page size. More tests are running
overnight.
-chris
diff -urN linux.test13p3p/drivers/block/ll_rw_blk.c linux-test12/drivers/block/ll_rw_blk.c
--- linux.test13p3p/drivers/block/ll_rw_blk.c Tue Dec 19 05:07:50 2000
+++ linux.test13p3/drivers/block/ll_rw_blk.c Thu Dec 21 16:56:43 2000
@@ -954,6 +954,7 @@
if (!test_bit(BH_Lock, &bh->b_state))
BUG();
+ mark_buffer_clean(bh) ;
set_bit(BH_Req, &bh->b_state);
/*
diff -urN linux.test13p3p/fs/buffer.c linux-test12/fs/buffer.c
--- linux.test13p3p/fs/buffer.c Thu Dec 21 17:24:17 2000
+++ linux.test13p3/fs/buffer.c Thu Dec 21 17:28:46 2000
@@ -97,6 +97,16 @@
static int grow_buffers(int size);
static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+
+static struct address_space_operations anon_space_ops = {
+ writepage: block_write_anon_page,
+ sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+ pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+ a_ops: &anon_space_ops,
+} ;
/* This is used by some architectures to estimate available memory. */
atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +171,37 @@
atomic_dec(&bh->b_count);
}
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page) {
+ int (*writepage)(struct page *) ;
+ int ret ;
+
+ writepage = page->mapping->a_ops->writepage ;
+
+ if (!writepage) {
+ writepage = anon_space_ops.writepage ;
+ }
+
+ page_cache_get(page) ;
+ ClearPageDirty(page) ;
+ ret = writepage(page) ;
+ /* note, when writepage returns 1, we mark the page dirty again
+ ** but the writepage func is responsible for making sure any
+ ** buffers that need to stay dirty really do stay dirty
+ ** ick.
+ */
+ if (ret == 1) {
+ SetPageDirty(page) ;
+ UnlockPage(page) ;
+ }
+ page_cache_release(page) ;
+ return ret ;
+}
+
/* Call sync_buffers with wait!=0 to ensure that the call does not
* return until all buffer writes have completed. Sync() may return
* before the writes have finished; fsync() may not.
@@ -175,6 +216,7 @@
{
int i, retry, pass = 0, err = 0;
struct buffer_head * bh, *next;
+ struct page *page ;
/* One pass for no-wait, three for wait:
* 0) write out all dirty, unlocked buffers;
@@ -230,10 +272,19 @@
if (!buffer_dirty(bh) || pass >= 2)
continue;
- atomic_inc(&bh->b_count);
+ page = bh->b_page ;
+ if (TryLockPage(page)) {
+ if (!wait || !pass) {
+ retry = 1 ;
+ continue ;
+ }
+ spin_unlock(&lru_list_lock);
+ wait_on_page(page) ;
+ goto repeat ;
+ }
spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
+
+ dirty_list_writepage(page) ;
retry = 1;
goto repeat;
}
@@ -1096,8 +1147,10 @@
int dispose = BUF_CLEAN;
if (buffer_locked(bh))
dispose = BUF_LOCKED;
- if (buffer_dirty(bh))
+ if (buffer_dirty(bh)) {
dispose = BUF_DIRTY;
+ SetPageDirty(bh->b_page) ;
+ }
if (buffer_protected(bh))
dispose = BUF_PROTECTED;
if (dispose != bh->b_list) {
@@ -1473,6 +1526,52 @@
* "Dirty" is valid only with the last case (mapped+uptodate).
*/
+static int block_write_anon_page(struct page *page)
+{
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ int i, nr = 0 ;
+ int partial = 0 ;
+ int ret = 0 ;
+
+ if (!PageLocked(page))
+ BUG();
+
+ if (!page->buffers)
+ BUG() ;
+
+ head = page->buffers;
+ bh = head;
+
+ /* Stage 1: find the dirty buffers, lock them for submit_bh */
+ do {
+ if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+ if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_io_async;
+ atomic_inc(&bh->b_count);
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
+ unlock_buffer(bh) ;
+ }
+ } else {
+ partial = 1 ;
+ }
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /* Stage 2: submit the IO */
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+ /* Done - end_buffer_io_async will unlock */
+ if (!partial)
+ SetPageUptodate(page);
+ if (nr == 0) {
+ UnlockPage(page) ;
+ }
+ return ret ;
+}
+
/*
* block_write_full_page() is SMP-safe - currently it's still
* being called with the kernel lock held, but the code is ready.
@@ -1482,6 +1581,10 @@
int err, i;
unsigned long block;
struct buffer_head *bh, *head;
+ int nr = 0 ;
+ struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+ int page_ok = Page_Uptodate(page) ;
+ int partial = 0;
if (!PageLocked(page))
BUG();
@@ -1504,36 +1607,46 @@
*
* Leave it to the low-level FS to make all those
* decisions (block #0 may actually be a valid block)
+ *
+ * only bother when the page is up to date or the buffer
+ * is dirty.
*/
- if (!buffer_mapped(bh)) {
- err = get_block(inode, block, bh, 1);
- if (err)
- goto out;
- if (buffer_new(bh))
- unmap_underlying_metadata(bh);
+ if (page_ok || buffer_dirty(bh)) {
+ if (!buffer_mapped(bh)) {
+ err = get_block(inode, block, bh, 1);
+ if (err)
+ goto out;
+ if (buffer_new(bh))
+ unmap_underlying_metadata(bh);
+ }
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
}
bh = bh->b_this_page;
block++;
} while (bh != head);
/* Stage 2: lock the buffers, mark them dirty */
- do {
+ for (i = 0 ; i < nr ; i++) {
+ bh = arr[i] ;
lock_buffer(bh);
bh->b_end_io = end_buffer_io_async;
atomic_inc(&bh->b_count);
set_bit(BH_Uptodate, &bh->b_state);
set_bit(BH_Dirty, &bh->b_state);
- bh = bh->b_this_page;
- } while (bh != head);
+ }
- /* Stage 3: submit the IO */
- do {
- submit_bh(WRITE, bh);
- bh = bh->b_this_page;
- } while (bh != head);
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+
+ if (nr == 0)
+ UnlockPage(page) ;
/* Done - end_buffer_io_async will unlock */
- SetPageUptodate(page);
+ if (!partial)
+ SetPageUptodate(page);
return 0;
out:
@@ -1653,6 +1766,45 @@
}
/*
+** just sets the dirty bits for a range of buffers in the page. Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+ unsigned from, unsigned to)
+{
+ unsigned block_start, block_end;
+ int partial = 0 ;
+ unsigned blocksize;
+ struct buffer_head *bh, *head;
+
+ blocksize = inode->i_sb->s_blocksize;
+
+ for(bh = head = page->buffers, block_start = 0;
+ bh != head || !block_start;
+ block_start=block_end, bh = bh->b_this_page) {
+ block_end = block_start + blocksize;
+ if (block_end <= from || block_start >= to) {
+ if (!buffer_uptodate(bh))
+ partial = 1;
+ } else {
+ set_bit(BH_Uptodate, &bh->b_state);
+ if (!atomic_set_buffer_dirty(bh)) {
+ buffer_insert_inode_queue(bh, inode);
+ }
+ }
+ }
+ /*
+ * is this a partial write that happened to make all buffers
+ * uptodate then we can optimize away a bogus readpage() for
+ * the next read(). Here we 'discover' wether the page went
+ * uptodate as a result of this (potentially partial) write.
+ */
+ if (!partial)
+ SetPageUptodate(page);
+ return 0;
+}
+
+/*
* Generic "read page" function for block devices that have the normal
* get_block functionality. This is most of the block device filesystems.
* Reads the page asynchronously --- the unlock_buffer() and
@@ -1942,13 +2093,23 @@
if (!err) {
memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
- __block_commit_write(inode,page,0,offset);
+
+ /* this will just set the dirty bits for block_write_full_page
+ ** it is only safe because we have the page locked and
+ ** nobody will try to write the buffers between
+ ** the block_dirty_range and the write_full_page calls
+ ** we have to clear the page up to date so the buffers
+ ** past the end of the file won't get written
+ */
+ __block_dirty_range(inode,page,0,offset);
+ ClearPageUptodate(page);
+ err = __block_write_full_page(inode, page, get_block) ;
done:
kunmap(page);
- UnlockPage(page);
return err;
}
ClearPageUptodate(page);
+ UnlockPage(page);
goto done;
}
@@ -2239,7 +2400,7 @@
struct buffer_head *bh, *tmp;
struct buffer_head * insert_point;
int isize;
-
+ unsigned long index ;
if ((size & 511) || (size > PAGE_SIZE)) {
printk("VFS: grow_buffers: size = %d\n",size);
return 0;
@@ -2255,6 +2416,16 @@
isize = BUFSIZE_INDEX(size);
+ /* don't put this buffer head on the free list until the
+ ** page is setup. Is there a better index to use? Would 0
+ ** be good enough?
+ */
+ page->flags &= ~(1 << PG_referenced);
+ index = atomic_read(&buffermem_pages) ;
+ atomic_inc(&buffermem_pages);
+ add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+ page->buffers = bh;
+
spin_lock(&free_list[isize].lock);
insert_point = free_list[isize].list;
tmp = bh;
@@ -2278,11 +2449,7 @@
free_list[isize].list = bh;
spin_unlock(&free_list[isize].lock);
- page->buffers = bh;
- page->flags &= ~(1 << PG_referenced);
- lru_cache_add(page);
UnlockPage(page);
- atomic_inc(&buffermem_pages);
return 1;
no_buffer_head:
@@ -2381,6 +2548,9 @@
/* And free the page */
page->buffers = NULL;
+ if (page->mapping == (&anon_space_mapping)) {
+ atomic_dec(&buffermem_pages) ;
+ }
page_cache_release(page);
spin_unlock(&free_list[index].lock);
write_unlock(&hash_table_lock);
@@ -2559,6 +2729,7 @@
static int flush_dirty_buffers(int check_flushtime)
{
struct buffer_head * bh, *next;
+ struct page *page ;
int flushed = 0, i;
restart:
@@ -2575,6 +2746,8 @@
}
if (buffer_locked(bh))
continue;
+ if (!buffer_uptodate(bh))
+ continue ;
if (check_flushtime) {
/* The dirty lru list is chronologically ordered so
@@ -2587,13 +2760,12 @@
if (++flushed > bdf_prm.b_un.ndirty)
goto out_unlock;
}
-
- /* OK, now we are committed to write it out. */
- atomic_inc(&bh->b_count);
- spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
-
+ page = bh->b_page ;
+ if (TryLockPage(page)) {
+ continue ;
+ }
+ spin_unlock(&lru_list_lock) ;
+ dirty_list_writepage(page) ;
if (current->need_resched)
schedule();
goto restart;
diff -urN linux.test13p3p/mm/page_alloc.c linux-test12/mm/page_alloc.c
--- linux.test13p3p/mm/page_alloc.c Mon Dec 18 06:53:45 2000
+++ linux.test13p3/mm/page_alloc.c Thu Dec 21 16:56:38 2000
@@ -317,11 +317,12 @@
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
- * wake up bdflush.
+ * wake up kswapd here as well, so page_launder can start
+ * sending things to disk.
*/
else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
- wakeup_bdflush(0);
+ wakeup_kswapd(0);
try_again:
/*
diff -urN linux.test13p3p/mm/vmscan.c linux-test12/mm/vmscan.c
--- linux.test13p3p/mm/vmscan.c Tue Dec 19 05:07:57 2000
+++ linux.test13p3/mm/vmscan.c Thu Dec 21 17:32:57 2000
@@ -678,7 +678,6 @@
/* The page was only in the buffer cache. */
} else if (!page->mapping) {
- atomic_dec(&buffermem_pages);
freed_page = 1;
cleaned_pages++;
@@ -739,9 +738,6 @@
* free anything yet, we wait synchronously on the writeout of
* MAX_SYNC_LAUNDER pages.
*
- * We also wake up bdflush, since bdflush should, under most
- * loads, flush out the dirty pages before we have to wait on
- * IO.
*/
if (can_get_io_locks && !launder_loop && free_shortage()) {
launder_loop = 1;
@@ -750,8 +746,6 @@
sync = 0;
/* We only do a few "out of order" flushes. */
maxlaunder = MAX_LAUNDER;
- /* Kflushd takes care of the rest. */
- wakeup_bdflush(0);
goto dirty_page_rescan;
}
On Thu, 21 Dec 2000, Chris Mason wrote:
> Obvious bug, block_write_full_page zeros out the bits past the end of
> file every time. This should not be needed for normal file writes.
Unfortunately, it _is_ needed for pageout path. mmap() the last page
of file. Dirty the data past the EOF (MMU will not catch that access).
Let the pageout send the page to disk. You don't want to have the data
past EOF end up on the backstore.
Al, slowly getting back to life and sanity. Fsck the flu...
On Thu, 21 Dec 2000, Chris Mason wrote:
> Ok guys, I think I've taken Linus' suggestion to have buffer.c use its
> own writepage a bit too far. This patch marks pages dirty when the
> buffer head is marked dirty, and changes flush_dirty_buffers and
> sync_buffers to use writepage instead of ll_rw_block. The idea is
> to allow filesystems to use the buffer lists and provide their own
> i/o mechanism.
>
> The result is a serious semantics change for writepage, which now is
> expected to do partial page writes when the page isn't up to date and
> there are dirty buffers inside. For all the obvious reasons, this isn't
> fit for 2.4.0, and if you all feel it is a 2.5. thing I'll send along
> the shorter patch Linus originally suggested. But, I think it would
> be pretty useful for the new filesystems (once I also fix
> fsync_inode_buffers and sync_page_buffers).
It is very powerful.
With this on place, the filesystem is able to do write clustering at its
writepage() function by checking if the on-disk physically nearby pages
are dirty.
> Other changes: submit_bh now cleans the buffers. I don't see how
> they were getting cleaned before, it must have been try_to_free_buffers
> sending the page through sync_page_buffers, meaning they were probably
> getting written twice. Unless someone throws a clue my way, I'll send
> this out as a separate diff.
Ouch.
> page_launder doesn't fiddle with the buffermem_pages counter, it is done
> in try_to_free_buffers instead.
>
> Obvious bug, block_write_full_page zeros out the bits past the end of
> file every time. This should not be needed for normal file writes.
>
> Most testing was on ext2, who actually calls mark_buffer_dirty, and
> supports blocksizes < intel page size. More tests are running
> overnight.
It seems your code has a problem with bh flush time.
In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
be written in case its old enough. (bh->b_flushtime)
If the flush happens for an anonymous buffer, you'll end up writing all
buffers which are sitting on the same page (with block_write_anon_page),
but these other buffers are not necessarily old enough to be flushed.
Marcelo Tosatti writes:
> It seems your code has a problem with bh flush time.
>
> In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> be written in case its old enough. (bh->b_flushtime)
>
> If the flush happens for an anonymous buffer, you'll end up writing all
> buffers which are sitting on the same page (with block_write_anon_page),
> but these other buffers are not necessarily old enough to be flushed.
This isn't really a "problem" however. The page is the _maximum_ age of
the buffer before it needs to be written. If we can efficiently write it
out with another buffer (essentially for free if they are on the same
spot on disk), then there is less work for us to do later.
Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert
On Thu, 21 Dec 2000, Andreas Dilger wrote:
> Marcelo Tosatti writes:
> > It seems your code has a problem with bh flush time.
> >
> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> > be written in case its old enough. (bh->b_flushtime)
> >
> > If the flush happens for an anonymous buffer, you'll end up writing all
> > buffers which are sitting on the same page (with block_write_anon_page),
> > but these other buffers are not necessarily old enough to be flushed.
>
> This isn't really a "problem" however. The page is the _maximum_ age of
> the buffer before it needs to be written. If we can efficiently write it
> out with another buffer
> (essentially for free if they are on the same spot on disk)
Are you sure this is true for buffer pages in most cases?
On Thursday, December 21, 2000 20:54:09 -0500 Alexander Viro <[email protected]> wrote:
>
>
> On Thu, 21 Dec 2000, Chris Mason wrote:
>
>> Obvious bug, block_write_full_page zeros out the bits past the end of
>> file every time. This should not be needed for normal file writes.
>
> Unfortunately, it _is_ needed for pageout path. mmap() the last page
> of file. Dirty the data past the EOF (MMU will not catch that access).
> Let the pageout send the page to disk. You don't want to have the data
> past EOF end up on the backstore.
>
Sorry, I wasn't very clear. I'd like to find a way to just do the memset in the one case it is needed (a real writepage), and not do it when we are using writepage just to flush dirty buffers. I don't see how to do it though, without making a new flush operation. That would also solve my concerns about the change in writepage semantics (even though every FS I could find in the kernel tree was using block_write_full_page).
thanks,
Chris
On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti <[email protected]> wrote:
>
> On Thu, 21 Dec 2000, Andreas Dilger wrote:
>
>> Marcelo Tosatti writes:
>> > It seems your code has a problem with bh flush time.
>> >
>> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
>> > be written in case its old enough. (bh->b_flushtime)
>> >
>> > If the flush happens for an anonymous buffer, you'll end up writing all
>> > buffers which are sitting on the same page (with block_write_anon_page),
>> > but these other buffers are not necessarily old enough to be flushed.
>>
>> This isn't really a "problem" however. The page is the _maximum_ age of
>> the buffer before it needs to be written. If we can efficiently write it
>> out with another buffer
>
>
>> (essentially for free if they are on the same spot on disk)
>
> Are you sure this is true for buffer pages in most cases?
>
>
It's a good point. block_write_anon_page could be changed to just write the oldest buffer and redirty the page (if the buffers are far apart). If memory is tight, and we *really* need the page back, it will be flushed by try_to_free_buffers.
It seems a bit nasty to me though...writepage should write the page.
-chris
On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti
<[email protected]> wrote:
>> Marcelo Tosatti writes:
>> > It seems your code has a problem with bh flush time.
>> >
>> > In flush_dirty_buffers(), a buffer may (if being called from kupdate)
>> > only be written in case its old enough. (bh->b_flushtime)
>> >
>> > If the flush happens for an anonymous buffer, you'll end up writing all
>> > buffers which are sitting on the same page (with
>> > block_write_anon_page), but these other buffers are not necessarily
>> > old enough to be flushed.
>>
A quick benchmark shows there's room for improvement here. I'll play
around with a version of block_write_anon_page that tries to be more
selective when flushing things out.
-chris
Chris Mason wrote:
>
> On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti <[email protected]> wrote:
> >
> > On Thu, 21 Dec 2000, Andreas Dilger wrote:
> >
> >> Marcelo Tosatti writes:
> >> > It seems your code has a problem with bh flush time.
> >> >
> >> > In flush_dirty_buffers(), a buffer may (if being called from kupdate) only
> >> > be written in case its old enough. (bh->b_flushtime)
> >> >
> >> > If the flush happens for an anonymous buffer, you'll end up writing all
> >> > buffers which are sitting on the same page (with block_write_anon_page),
> >> > but these other buffers are not necessarily old enough to be flushed.
> >>
> >> This isn't really a "problem" however. The page is the _maximum_ age of
> >> the buffer before it needs to be written. If we can efficiently write it
> >> out with another buffer
> >
> >> (essentially for free if they are on the same spot on disk)
> >
> > Are you sure this is true for buffer pages in most cases?
>
> It's a good point. block_write_anon_page could be changed to just
> write the oldest buffer and redirty the page (if the buffers are
> far apart). If memory is tight, and we *really* need the page back,
> it will be flushed by try_to_free_buffers.
>
> It seems a bit nasty to me though...writepage should write the page.
Um. Why cater to the uncommon case of 1K blocks? Just let
bdflush/kupdated deal with them in the normal way - it's pretty
efficient. Only try to do the clustering optimization when buffer size
matches memory page size.
--
Daniel
On Friday, December 22, 2000 17:45:57 +0100 Daniel Phillips
<[email protected]> wrote:
[ flushing a page at a time in bdflush ]
> Um. Why cater to the uncommon case of 1K blocks? Just let
> bdflush/kupdated deal with them in the normal way - it's pretty
> efficient. Only try to do the clustering optimization when buffer size
> matches memory page size.
>
This isn't really an attempt at a clustering optimization. The problem at
hand is that buffer cache buffers can be on relatively random pages. So, a
page might have buffers that are very far apart, where one needs flushing
and the other doesn't.
In the blocksize == page size case, this won't happen, and we don't lose
any speed over the existing code. In the blocksize < pagesize case, my new
code is slower, so my goal is to fix just that problem.
Real write clustering would be a different issue entirely, and is worth
doing ;-)
-chris
On Fri, 22 Dec 2000, Chris Mason wrote:
>
>
> On Thursday, December 21, 2000 22:38:04 -0200 Marcelo Tosatti
> <[email protected]> wrote:
>
> >> Marcelo Tosatti writes:
> >> > It seems your code has a problem with bh flush time.
> >> >
> >> > In flush_dirty_buffers(), a buffer may (if being called from kupdate)
> >> > only be written in case its old enough. (bh->b_flushtime)
> >> >
> >> > If the flush happens for an anonymous buffer, you'll end up writing all
> >> > buffers which are sitting on the same page (with
> >> > block_write_anon_page), but these other buffers are not necessarily
> >> > old enough to be flushed.
> >>
>
> A quick benchmark shows there's room for improvement here. I'll play
> around with a version of block_write_anon_page that tries to be more
> selective when flushing things out.
There is one more nasty issue to deal with.
You only want to take into account the buffer flushtime if
"check_flushtime" parameter is passed as true to flush_dirty_buffers
(which is done by kupdate).
Thinking a bit more about the issue, I dont see any reason why we want to
write all buffers of an anonymous page at
sync_buffers/flush_dirty_buffers.
I think we could simply write the buffer with ll_rw_block() if the page
which this buffer is sitting on does not have mapping->a_ops->writepage
operation defined.
Chris?
On Friday, December 22, 2000 17:52:28 -0200 Marcelo Tosatti
<[email protected]> wrote:
> There is one more nasty issue to deal with.
>
> You only want to take into account the buffer flushtime if
> "check_flushtime" parameter is passed as true to flush_dirty_buffers
> (which is done by kupdate).
>
> Thinking a bit more about the issue, I dont see any reason why we want to
> write all buffers of an anonymous page at
> sync_buffers/flush_dirty_buffers.
>
> I think we could simply write the buffer with ll_rw_block() if the page
> which this buffer is sitting on does not have mapping->a_ops->writepage
> operation defined.
>
It is enough to leave buffer heads we don't flush on the dirty list (and
redirty the page), they'll get written by a future loop through
flush_dirty_pages, or by page_launder. We could use ll_rw_block instead,
even though anon pages do have a writepage with this patch (just check if
page->mapping == &anon_space_mapping).
There are lots of things we could try here, including some logic inside
block_write_anon_page to check the current memory pressure/dirty levels to
see how much work should be done.
I'll play with this a bit, but more ideas would be appreciated.
BTW, recent change to my local code was to remove the ll_rw_block call from
sync_page_buffers. It seemed cleaner than making try_to_free_buffers
understand that sometimes writepage will be called, and sometimes the page
will end up unlocked because of it....comments?
-chris
On Fri, 22 Dec 2000, Chris Mason wrote:
> It is enough to leave buffer heads we don't flush on the dirty list (and
> redirty the page), they'll get written by a future loop through
> flush_dirty_pages, or by page_launder. We could use ll_rw_block instead,
> even though anon pages do have a writepage with this patch (just check if
> page->mapping == &anon_space_mapping).
If we use ll_rw_block directly on buffers of anonymous pages
(page->mapping == &anon_space_mapping) instead using
dirty_list_writepage() (which will end up calling block_write_anon_page)
we can fix the buffer flushtime issue.
> There are lots of things we could try here, including some logic inside
> block_write_anon_page to check the current memory pressure/dirty levels to
> see how much work should be done.
At writepage() context we do not know if we should care about the
flushtime.
> I'll play with this a bit, but more ideas would be appreciated.
>
> BTW, recent change to my local code was to remove the ll_rw_block call from
> sync_page_buffers. It seemed cleaner than making try_to_free_buffers
> understand that sometimes writepage will be called, and sometimes the page
> will end up unlocked because of it....comments?
Could you please send me your latest patch ?
Thanks
Chris Mason wrote:
> It is enough to leave buffer heads we don't flush on the dirty list (and
> redirty the page), they'll get written by a future loop through
> flush_dirty_pages, or by page_launder. We could use ll_rw_block instead,
> even though anon pages do have a writepage with this patch (just check if
> page->mapping == &anon_space_mapping).
anon_space_mapping... this is really useful. This would make a nice
patch on its own.
--
Daniel
On Friday, December 22, 2000 21:26:33 -0200 Marcelo Tosatti
> If we use ll_rw_block directly on buffers of anonymous pages
> (page->mapping == &anon_space_mapping) instead using
> dirty_list_writepage() (which will end up calling block_write_anon_page)
> we can fix the buffer flushtime issue.
>
Ok, I'm just being stubborn. The point of the patch was to get rid
of the ll_rw_block calls, so I'm resisting adding one back in ;-)
But, your way does seem like the least complicated method to honor
the flushtime param for anon pages, and I'll switch over to that.
I've updated to test13-pre4, and removed the hunk for submit_bh.
Looks as though pre4 changed the submit_bh callers to clear the dirty
bit, so my code does the same.
Other changes: sync_page_buffers doesn't write blocks on the page, it
only waits on them. Minor change to when the page count is
increased before writepage is called.
I still need to update fsync_inode_buffers....doesn't look like that
will happen before I head off to visit family, where for some reason
they've banned me from reading email ;-) I'll be back on Tuesday,
hope everyone has a good holiday. New patch below:
-chris
diff -urN linux-test13-pre4/fs/buffer.c linux-anon-space/fs/buffer.c
--- linux-test13-pre4/fs/buffer.c Sat Dec 23 13:14:48 2000
+++ linux-anon-space/fs/buffer.c Sat Dec 23 13:30:24 2000
@@ -97,6 +97,16 @@
static int grow_buffers(int size);
static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+
+static struct address_space_operations anon_space_ops = {
+ writepage: block_write_anon_page,
+ sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+ pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+ a_ops: &anon_space_ops,
+} ;
/* This is used by some architectures to estimate available memory. */
atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +171,30 @@
atomic_dec(&bh->b_count);
}
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page) {
+ int (*writepage)(struct page *) ;
+ int ret ;
+
+ writepage = page->mapping->a_ops->writepage ;
+
+ if (!writepage) {
+ writepage = anon_space_ops.writepage ;
+ }
+
+ ClearPageDirty(page) ;
+ ret = writepage(page) ;
+ if (ret == 1) {
+ SetPageDirty(page) ;
+ UnlockPage(page) ;
+ }
+ return ret ;
+}
+
/* Call sync_buffers with wait!=0 to ensure that the call does not
* return until all buffer writes have completed. Sync() may return
* before the writes have finished; fsync() may not.
@@ -175,6 +209,7 @@
{
int i, retry, pass = 0, err = 0;
struct buffer_head * bh, *next;
+ struct page *page ;
/* One pass for no-wait, three for wait:
* 0) write out all dirty, unlocked buffers;
@@ -230,10 +265,22 @@
if (!buffer_dirty(bh) || pass >= 2)
continue;
- atomic_inc(&bh->b_count);
+ page = bh->b_page ;
+ page_cache_get(page) ;
+ if (TryLockPage(page)) {
+ if (!wait || !pass) {
+ retry = 1 ;
+ continue ;
+ }
+ spin_unlock(&lru_list_lock);
+ wait_on_page(page) ;
+ page_cache_release(page) ;
+ goto repeat ;
+ }
spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
+
+ dirty_list_writepage(page) ;
+ page_cache_release(page) ;
retry = 1;
goto repeat;
}
@@ -1101,8 +1148,10 @@
int dispose = BUF_CLEAN;
if (buffer_locked(bh))
dispose = BUF_LOCKED;
- if (buffer_dirty(bh))
+ if (buffer_dirty(bh)) {
dispose = BUF_DIRTY;
+ SetPageDirty(bh->b_page) ;
+ }
if (buffer_protected(bh))
dispose = BUF_PROTECTED;
if (dispose != bh->b_list) {
@@ -1478,6 +1527,53 @@
* "Dirty" is valid only with the last case (mapped+uptodate).
*/
+static int block_write_anon_page(struct page *page)
+{
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ int i, nr = 0 ;
+ int partial = 0 ;
+ int ret = 0 ;
+
+ if (!PageLocked(page))
+ BUG();
+
+ if (!page->buffers)
+ BUG() ;
+
+ head = page->buffers;
+ bh = head;
+
+ /* Stage 1: find the dirty buffers, lock them for submit_bh */
+ do {
+ if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+ if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_io_async;
+ clear_bit(BH_Dirty, &bh->b_state) ;
+ atomic_inc(&bh->b_count);
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
+ unlock_buffer(bh) ;
+ }
+ } else {
+ partial = 1 ;
+ }
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /* Stage 2: submit the IO */
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+ /* Done - end_buffer_io_async will unlock */
+ if (!partial)
+ SetPageUptodate(page);
+ if (nr == 0) {
+ UnlockPage(page) ;
+ }
+ return ret ;
+}
+
/*
* block_write_full_page() is SMP-safe - currently it's still
* being called with the kernel lock held, but the code is ready.
@@ -1487,6 +1583,10 @@
int err, i;
unsigned long block;
struct buffer_head *bh, *head;
+ int nr = 0 ;
+ struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+ int page_ok = Page_Uptodate(page) ;
+ int partial = 0;
if (!PageLocked(page))
BUG();
@@ -1509,36 +1609,46 @@
*
* Leave it to the low-level FS to make all those
* decisions (block #0 may actually be a valid block)
+ *
+ * only bother when the page is up to date or the buffer
+ * is dirty.
*/
- if (!buffer_mapped(bh)) {
- err = get_block(inode, block, bh, 1);
- if (err)
- goto out;
- if (buffer_new(bh))
- unmap_underlying_metadata(bh);
+ if (page_ok || buffer_dirty(bh)) {
+ if (!buffer_mapped(bh)) {
+ err = get_block(inode, block, bh, 1);
+ if (err)
+ goto out;
+ if (buffer_new(bh))
+ unmap_underlying_metadata(bh);
+ }
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
}
bh = bh->b_this_page;
block++;
} while (bh != head);
/* Stage 2: lock the buffers, mark them clean */
- do {
+ for (i = 0 ; i < nr ; i++) {
+ bh = arr[i] ;
lock_buffer(bh);
bh->b_end_io = end_buffer_io_async;
atomic_inc(&bh->b_count);
set_bit(BH_Uptodate, &bh->b_state);
clear_bit(BH_Dirty, &bh->b_state);
- bh = bh->b_this_page;
- } while (bh != head);
+ }
- /* Stage 3: submit the IO */
- do {
- submit_bh(WRITE, bh);
- bh = bh->b_this_page;
- } while (bh != head);
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+
+ if (nr == 0)
+ UnlockPage(page) ;
/* Done - end_buffer_io_async will unlock */
- SetPageUptodate(page);
+ if (!partial)
+ SetPageUptodate(page);
return 0;
out:
@@ -1658,6 +1768,45 @@
}
/*
+** just sets the dirty bits for a range of buffers in the page. Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+ unsigned from, unsigned to)
+{
+ unsigned block_start, block_end;
+ int partial = 0 ;
+ unsigned blocksize;
+ struct buffer_head *bh, *head;
+
+ blocksize = inode->i_sb->s_blocksize;
+
+ for(bh = head = page->buffers, block_start = 0;
+ bh != head || !block_start;
+ block_start=block_end, bh = bh->b_this_page) {
+ block_end = block_start + blocksize;
+ if (block_end <= from || block_start >= to) {
+ if (!buffer_uptodate(bh))
+ partial = 1;
+ } else {
+ set_bit(BH_Uptodate, &bh->b_state);
+ if (!atomic_set_buffer_dirty(bh)) {
+ buffer_insert_inode_queue(bh, inode);
+ }
+ }
+ }
+ /*
+ * is this a partial write that happened to make all buffers
+ * uptodate then we can optimize away a bogus readpage() for
+ * the next read(). Here we 'discover' wether the page went
+ * uptodate as a result of this (potentially partial) write.
+ */
+ if (!partial)
+ SetPageUptodate(page);
+ return 0;
+}
+
+/*
* Generic "read page" function for block devices that have the normal
* get_block functionality. This is most of the block device filesystems.
* Reads the page asynchronously --- the unlock_buffer() and
@@ -1947,13 +2096,23 @@
if (!err) {
memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
- __block_commit_write(inode,page,0,offset);
+
+ /* this will just set the dirty bits for block_write_full_page
+ ** it is only safe because we have the page locked and
+ ** nobody will try to write the buffers between
+ ** the block_dirty_range and the write_full_page calls
+ ** we have to clear the page up to date so the buffers
+ ** past the end of the file won't get written
+ */
+ __block_dirty_range(inode,page,0,offset);
+ ClearPageUptodate(page);
+ err = __block_write_full_page(inode, page, get_block) ;
done:
kunmap(page);
- UnlockPage(page);
return err;
}
ClearPageUptodate(page);
+ UnlockPage(page);
goto done;
}
@@ -2244,7 +2403,7 @@
struct buffer_head *bh, *tmp;
struct buffer_head * insert_point;
int isize;
-
+ unsigned long index ;
if ((size & 511) || (size > PAGE_SIZE)) {
printk("VFS: grow_buffers: size = %d\n",size);
return 0;
@@ -2260,6 +2419,16 @@
isize = BUFSIZE_INDEX(size);
+ /* don't put this buffer head on the free list until the
+ ** page is setup. Is there a better index to use? Would 0
+ ** be good enough?
+ */
+ page->flags &= ~(1 << PG_referenced);
+ index = atomic_read(&buffermem_pages) ;
+ atomic_inc(&buffermem_pages);
+ add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+ page->buffers = bh;
+
spin_lock(&free_list[isize].lock);
insert_point = free_list[isize].list;
tmp = bh;
@@ -2283,11 +2452,7 @@
free_list[isize].list = bh;
spin_unlock(&free_list[isize].lock);
- page->buffers = bh;
- page->flags &= ~(1 << PG_referenced);
- lru_cache_add(page);
UnlockPage(page);
- atomic_inc(&buffermem_pages);
return 1;
no_buffer_head:
@@ -2309,7 +2474,6 @@
*
* Wait:
* 0 - no wait (this does not get called - see try_to_free_buffers below)
- * 1 - start IO for dirty buffers
* 2 - wait for completion of locked buffers
*/
static void sync_page_buffers(struct buffer_head *bh, int wait)
@@ -2319,11 +2483,9 @@
do {
struct buffer_head *p = tmp;
tmp = tmp->b_this_page;
- if (buffer_locked(p)) {
- if (wait > 1)
- __wait_on_buffer(p);
- } else if (buffer_dirty(p))
- ll_rw_block(WRITE, 1, &p);
+ if (buffer_locked(p) && wait > 1) {
+ __wait_on_buffer(p);
+ }
} while (tmp != bh);
}
@@ -2386,6 +2548,9 @@
/* And free the page */
page->buffers = NULL;
+ if (page->mapping == (&anon_space_mapping)) {
+ atomic_dec(&buffermem_pages) ;
+ }
page_cache_release(page);
spin_unlock(&free_list[index].lock);
write_unlock(&hash_table_lock);
@@ -2564,6 +2729,7 @@
static int flush_dirty_buffers(int check_flushtime)
{
struct buffer_head * bh, *next;
+ struct page *page ;
int flushed = 0, i;
restart:
@@ -2580,6 +2746,8 @@
}
if (buffer_locked(bh))
continue;
+ if (!buffer_uptodate(bh))
+ continue ;
if (check_flushtime) {
/* The dirty lru list is chronologically ordered so
@@ -2592,13 +2760,15 @@
if (++flushed > bdf_prm.b_un.ndirty)
goto out_unlock;
}
-
- /* OK, now we are committed to write it out. */
- atomic_inc(&bh->b_count);
- spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
-
+ page = bh->b_page ;
+ page_cache_get(page) ;
+ if (TryLockPage(page)) {
+ page_cache_release(page) ;
+ continue ;
+ }
+ spin_unlock(&lru_list_lock) ;
+ dirty_list_writepage(page) ;
+ page_cache_release(page) ;
if (current->need_resched)
schedule();
goto restart;
diff -urN linux-test13-pre4/mm/page_alloc.c linux-anon-space/mm/page_alloc.c
--- linux-test13-pre4/mm/page_alloc.c Tue Nov 28 13:54:31 2000
+++ linux-anon-space/mm/page_alloc.c Thu Dec 21 16:24:44 2000
@@ -317,11 +317,12 @@
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
- * wake up bdflush.
+ * wake up kswapd here as well, so page_launder can start
+ * sending things to disk.
*/
else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
- wakeup_bdflush(0);
+ wakeup_kswapd(0);
try_again:
/*
diff -urN linux-test13-pre4/mm/vmscan.c linux-anon-space/mm/vmscan.c
--- linux-test13-pre4/mm/vmscan.c Sat Dec 23 13:14:26 2000
+++ linux-anon-space/mm/vmscan.c Sat Dec 23 13:33:53 2000
@@ -678,7 +678,6 @@
/* The page was only in the buffer cache. */
} else if (!page->mapping) {
- atomic_dec(&buffermem_pages);
freed_page = 1;
cleaned_pages++;
@@ -739,9 +738,6 @@
* free anything yet, we wait synchronously on the writeout of
* MAX_SYNC_LAUNDER pages.
*
- * We also wake up bdflush, since bdflush should, under most
- * loads, flush out the dirty pages before we have to wait on
- * IO.
*/
if (can_get_io_locks && !launder_loop && free_shortage()) {
launder_loop = 1;
@@ -750,8 +746,6 @@
sync = 0;
/* We only do a few "out of order" flushes. */
maxlaunder = MAX_LAUNDER;
- /* Kflushd takes care of the rest. */
- wakeup_bdflush(0);
goto dirty_page_rescan;
}
On Sat, 23 Dec 2000, Chris Mason wrote:
>
> I've updated to test13-pre4, and removed the hunk for submit_bh.
> Looks as though pre4 changed the submit_bh callers to clear the dirty
> bit, so my code does the same.
Basically, I wanted to think of "submit_bh()" as a pure IO thing. When we
call submit_bh(), that is basically the same as "statr IO on this thing".
Which implies to me that submit_bh() doesn't care, or know, about why the
higher layers did this.
Which is why I prefer the higher layers handling the dirty/uptodate/xxx
bits.
Linus
On Saturday, December 23, 2000 11:02:53 -0800 Linus Torvalds
<[email protected]> wrote:
>
> Which is why I prefer the higher layers handling the dirty/uptodate/xxx
> bits.
>
Grin, I should have taken the hint when we talked about the buffer up to
date checks for block_read_full_page, it made sense then too.
brw_page doesn't clear the dirty bit in pre4, was that on purpose?
-chris
Hi guys,
Here's my latest code, which uses ll_rw_block for anon pages (or
pages without a writepage func) when flush_dirty_buffers,
sync_buffers, or fsync_inode_buffers are flushing things. This
seems to have fixed my slowdown on 1k buffer sizes, but I
haven't done extensive benchmarks yet.
Other changes: After freeing a page with buffers, page_launder
now stops if (!free_shortage()). This is a mod of the check where
page_launder checked free_shortage after freeing a buffer cache
page. Code outside buffer.c can't detect buffer cache pages with
this patch, so the old check doesn't apply.
My change doesn't seem quite right though, if page_launder wants
to stop when there isn't a shortage, it should do that regardless of
if the page it just freed had buffers. It looks like this was added
so bdflush could call page_launder, and get an early out after
freeing some buffer heads, but I'm not sure.
In test13-pre4, invalidate_buffers skips buffers on a page
with a mapping. I changed that to skip mappings other than the
anon space mapping.
Comments and/or suggestions on how to make better use of this stuff
are more than welcome ;-)
-chris
diff -urN linux-test13-pre4/fs/buffer.c linux-anon-space/fs/buffer.c
--- linux-test13-pre4/fs/buffer.c Sat Dec 23 13:14:48 2000
+++ linux-anon-space/fs/buffer.c Tue Dec 26 00:58:06 2000
@@ -97,6 +97,17 @@
static int grow_buffers(int size);
static void __refile_buffer(struct buffer_head *);
+static int block_write_anon_page(struct page *);
+static void end_buffer_io_async(struct buffer_head * bh, int uptodate) ;
+
+static struct address_space_operations anon_space_ops = {
+ writepage: block_write_anon_page,
+ sync_page: block_sync_page,
+} ;
+static struct address_space anon_space_mapping = {
+ pages: { &anon_space_mapping.pages, &anon_space_mapping.pages },
+ a_ops: &anon_space_ops,
+} ;
/* This is used by some architectures to estimate available memory. */
atomic_t buffermem_pages = ATOMIC_INIT(0);
@@ -161,6 +172,73 @@
atomic_dec(&bh->b_count);
}
+/* just for use with anon pages, or pages that don't provide their own
+** writepage func. We just want to write bh, not the whole page, so we
+** queue that io here instead of calling writepage.
+*/
+static int __dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+ int other_dirty = 0 ;
+ struct buffer_head *cur ;
+
+ /* check for other dirty buffers on this page. If there are none,
+ ** clear the page dirty bit
+ */
+ cur = bh->b_this_page ;
+ while(cur != bh) {
+ other_dirty += buffer_dirty(cur) ;
+ cur = cur->b_this_page ;
+ }
+ if (other_dirty == 0) {
+ ClearPageDirty(page) ;
+ }
+
+ /* we want the page available for locking again right away.
+ ** someone walking the dirty buffer list might find another
+ ** buffer from this page, and we don't want them to skip it in
+ ** favor of a younger buffer.
+ */
+ atomic_inc(&bh->b_count) ;
+ ll_rw_block(WRITE, 1, &bh) ;
+ atomic_dec(&bh->b_count) ;
+ UnlockPage(page) ;
+ return 0 ;
+}
+
+/*
+** util function for sync_buffers and flush_dirty_buffers
+** uses either the writepage func supplied in the page's mapping,
+** or the anon address space writepage
+*/
+static int dirty_list_writepage(struct page *page, struct buffer_head *bh) {
+ int (*writepage)(struct page *) ;
+ int ret ;
+
+ /* someone wrote this page while we were locking before */
+ if (!PageDirty(page) && !buffer_dirty(bh)) {
+ UnlockPage(page) ;
+ return 0 ;
+ }
+ writepage = page->mapping->a_ops->writepage ;
+
+ /* For anon pages, and pages that don't have a writepage
+ ** func, just write this one dirty buffer. __dirty_list_writepage
+ ** does a little more work to make sure the page dirty bit is cleared
+ ** when we are the only dirty buffer on this page
+ */
+ if (!writepage || page->mapping == &anon_space_mapping) {
+ writepage = anon_space_ops.writepage ;
+ return __dirty_list_writepage(page, bh) ;
+ }
+
+ ClearPageDirty(page) ;
+ ret = writepage(page) ;
+ if (ret == 1) {
+ SetPageDirty(page) ;
+ UnlockPage(page) ;
+ }
+ return ret ;
+}
+
/* Call sync_buffers with wait!=0 to ensure that the call does not
* return until all buffer writes have completed. Sync() may return
* before the writes have finished; fsync() may not.
@@ -175,6 +253,7 @@
{
int i, retry, pass = 0, err = 0;
struct buffer_head * bh, *next;
+ struct page *page ;
/* One pass for no-wait, three for wait:
* 0) write out all dirty, unlocked buffers;
@@ -230,10 +309,27 @@
if (!buffer_dirty(bh) || pass >= 2)
continue;
- atomic_inc(&bh->b_count);
+ page = bh->b_page ;
+ page_cache_get(page) ;
+ if (TryLockPage(page)) {
+ if (!wait || !pass) {
+ retry = 1 ;
+ continue ;
+ }
+ spin_unlock(&lru_list_lock);
+ wait_on_page(page) ;
+ page_cache_release(page) ;
+ goto repeat ;
+ }
spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
+
+ /* if the writepage func returns 1, it is
+ ** responsible for marking the buffers dirty
+ ** again (or not marking them clean at all).
+ ** we'll catch them again on the next pass
+ */
+ dirty_list_writepage(page, bh) ;
+ page_cache_release(page) ;
retry = 1;
goto repeat;
}
@@ -644,7 +740,7 @@
if (bh->b_dev != dev)
continue;
/* Part of a mapping? */
- if (bh->b_page->mapping)
+ if (bh->b_page->mapping != &anon_space_mapping)
continue;
if (buffer_locked(bh)) {
atomic_inc(&bh->b_count);
@@ -852,13 +948,14 @@
int fsync_inode_buffers(struct inode *inode)
{
struct buffer_head *bh;
- struct inode tmp;
+ struct inode tmp ;
int err = 0, err2;
+ struct page * page ;
+ int ret ;
INIT_LIST_HEAD(&tmp.i_dirty_buffers);
spin_lock(&lru_list_lock);
-
while (!list_empty(&inode->i_dirty_buffers)) {
bh = BH_ENTRY(inode->i_dirty_buffers.next);
list_del(&bh->b_inode_buffers);
@@ -868,11 +965,28 @@
bh->b_inode = &tmp;
list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
if (buffer_dirty(bh)) {
- atomic_inc(&bh->b_count);
+ page = bh->b_page ;
+ page_cache_get(page) ;
spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- brelse(bh);
+
+ LockPage(page) ;
+ ret = dirty_list_writepage(page, bh) ;
+ page_cache_release(page) ;
+
spin_lock(&lru_list_lock);
+
+ /* if the writepage func decided to skip
+ ** this page, we have to put it back onto
+ ** the dirty buffer list. we add onto the
+ ** tail so this buffer will be retried after
+ ** all the other writes have gone through.
+ */
+ if (ret == 1) {
+ list_del(&bh->b_inode_buffers) ;
+ list_add_tail(&bh->b_inode_buffers,
+ &inode->i_dirty_buffers) ;
+ bh->b_inode = inode ;
+ }
}
}
}
@@ -1101,8 +1215,10 @@
int dispose = BUF_CLEAN;
if (buffer_locked(bh))
dispose = BUF_LOCKED;
- if (buffer_dirty(bh))
+ if (buffer_dirty(bh)) {
dispose = BUF_DIRTY;
+ SetPageDirty(bh->b_page) ;
+ }
if (buffer_protected(bh))
dispose = BUF_PROTECTED;
if (dispose != bh->b_list) {
@@ -1478,6 +1594,53 @@
* "Dirty" is valid only with the last case (mapped+uptodate).
*/
+static int block_write_anon_page(struct page *page)
+{
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ int i, nr = 0 ;
+ int partial = 0 ;
+ int ret = 0 ;
+
+ if (!PageLocked(page))
+ BUG();
+
+ if (!page->buffers)
+ BUG() ;
+
+ head = page->buffers;
+ bh = head;
+
+ /* Stage 1: find the dirty buffers, lock them for submit_bh */
+ do {
+ if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
+ if (buffer_uptodate(bh) && buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_io_async;
+ clear_bit(BH_Dirty, &bh->b_state) ;
+ atomic_inc(&bh->b_count);
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
+ unlock_buffer(bh) ;
+ }
+ } else {
+ partial = 1 ;
+ }
+ bh = bh->b_this_page;
+ } while (bh != head);
+
+ /* Stage 2: submit the IO */
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+ /* Done - end_buffer_io_async will unlock */
+ if (!partial)
+ SetPageUptodate(page);
+ if (nr == 0) {
+ UnlockPage(page) ;
+ }
+ return ret ;
+}
+
/*
* block_write_full_page() is SMP-safe - currently it's still
* being called with the kernel lock held, but the code is ready.
@@ -1487,6 +1650,10 @@
int err, i;
unsigned long block;
struct buffer_head *bh, *head;
+ int nr = 0 ;
+ struct buffer_head *arr[MAX_BUF_PER_PAGE] ;
+ int page_ok = Page_Uptodate(page) ;
+ int partial = 0;
if (!PageLocked(page))
BUG();
@@ -1509,36 +1676,46 @@
*
* Leave it to the low-level FS to make all those
* decisions (block #0 may actually be a valid block)
+ *
+ * only bother when the page is up to date or the buffer
+ * is dirty.
*/
- if (!buffer_mapped(bh)) {
- err = get_block(inode, block, bh, 1);
- if (err)
- goto out;
- if (buffer_new(bh))
- unmap_underlying_metadata(bh);
+ if (page_ok || buffer_dirty(bh)) {
+ if (!buffer_mapped(bh)) {
+ err = get_block(inode, block, bh, 1);
+ if (err)
+ goto out;
+ if (buffer_new(bh))
+ unmap_underlying_metadata(bh);
+ }
+ arr[nr++] = bh ;
+ } else {
+ partial = 1 ;
}
bh = bh->b_this_page;
block++;
} while (bh != head);
/* Stage 2: lock the buffers, mark them clean */
- do {
+ for (i = 0 ; i < nr ; i++) {
+ bh = arr[i] ;
lock_buffer(bh);
bh->b_end_io = end_buffer_io_async;
atomic_inc(&bh->b_count);
set_bit(BH_Uptodate, &bh->b_state);
clear_bit(BH_Dirty, &bh->b_state);
- bh = bh->b_this_page;
- } while (bh != head);
+ }
- /* Stage 3: submit the IO */
- do {
- submit_bh(WRITE, bh);
- bh = bh->b_this_page;
- } while (bh != head);
+ for (i = 0 ; i < nr ; i++) {
+ submit_bh(WRITE, arr[i]) ;
+ }
+
+ if (nr == 0)
+ UnlockPage(page) ;
/* Done - end_buffer_io_async will unlock */
- SetPageUptodate(page);
+ if (!partial)
+ SetPageUptodate(page);
return 0;
out:
@@ -1658,6 +1835,45 @@
}
/*
+** just sets the dirty bits for a range of buffers in the page. Does
+** not balance the dirty list, or put the buffers onto the dirty list
+*/
+static int __block_dirty_range(struct inode *inode, struct page *page,
+ unsigned from, unsigned to)
+{
+ unsigned block_start, block_end;
+ int partial = 0 ;
+ unsigned blocksize;
+ struct buffer_head *bh, *head;
+
+ blocksize = inode->i_sb->s_blocksize;
+
+ for(bh = head = page->buffers, block_start = 0;
+ bh != head || !block_start;
+ block_start=block_end, bh = bh->b_this_page) {
+ block_end = block_start + blocksize;
+ if (block_end <= from || block_start >= to) {
+ if (!buffer_uptodate(bh))
+ partial = 1;
+ } else {
+ set_bit(BH_Uptodate, &bh->b_state);
+ if (!atomic_set_buffer_dirty(bh)) {
+ buffer_insert_inode_queue(bh, inode);
+ }
+ }
+ }
+ /*
+ * is this a partial write that happened to make all buffers
+ * uptodate then we can optimize away a bogus readpage() for
+ * the next read(). Here we 'discover' wether the page went
+ * uptodate as a result of this (potentially partial) write.
+ */
+ if (!partial)
+ SetPageUptodate(page);
+ return 0;
+}
+
+/*
* Generic "read page" function for block devices that have the normal
* get_block functionality. This is most of the block device filesystems.
* Reads the page asynchronously --- the unlock_buffer() and
@@ -1947,13 +2163,23 @@
if (!err) {
memset(page_address(page) + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
- __block_commit_write(inode,page,0,offset);
+
+ /* this will just set the dirty bits for block_write_full_page
+ ** it is only safe because we have the page locked and
+ ** nobody will try to write the buffers between
+ ** the block_dirty_range and the write_full_page calls
+ ** we have to clear the page up to date so the buffers
+ ** past the end of the file won't get written
+ */
+ __block_dirty_range(inode,page,0,offset);
+ ClearPageUptodate(page);
+ err = __block_write_full_page(inode, page, get_block) ;
done:
kunmap(page);
- UnlockPage(page);
return err;
}
ClearPageUptodate(page);
+ UnlockPage(page);
goto done;
}
@@ -2244,7 +2470,7 @@
struct buffer_head *bh, *tmp;
struct buffer_head * insert_point;
int isize;
-
+ unsigned long index ;
if ((size & 511) || (size > PAGE_SIZE)) {
printk("VFS: grow_buffers: size = %d\n",size);
return 0;
@@ -2260,6 +2486,16 @@
isize = BUFSIZE_INDEX(size);
+ /* don't put this buffer head on the free list until the
+ ** page is setup. Is there a better index to use? Would 0
+ ** be good enough?
+ */
+ page->flags &= ~(1 << PG_referenced);
+ index = atomic_read(&buffermem_pages) ;
+ atomic_inc(&buffermem_pages);
+ add_to_page_cache_locked(page, &anon_space_mapping, index) ;
+ page->buffers = bh;
+
spin_lock(&free_list[isize].lock);
insert_point = free_list[isize].list;
tmp = bh;
@@ -2283,11 +2519,7 @@
free_list[isize].list = bh;
spin_unlock(&free_list[isize].lock);
- page->buffers = bh;
- page->flags &= ~(1 << PG_referenced);
- lru_cache_add(page);
UnlockPage(page);
- atomic_inc(&buffermem_pages);
return 1;
no_buffer_head:
@@ -2309,7 +2541,6 @@
*
* Wait:
* 0 - no wait (this does not get called - see try_to_free_buffers below)
- * 1 - start IO for dirty buffers
* 2 - wait for completion of locked buffers
*/
static void sync_page_buffers(struct buffer_head *bh, int wait)
@@ -2319,11 +2550,9 @@
do {
struct buffer_head *p = tmp;
tmp = tmp->b_this_page;
- if (buffer_locked(p)) {
- if (wait > 1)
- __wait_on_buffer(p);
- } else if (buffer_dirty(p))
- ll_rw_block(WRITE, 1, &p);
+ if (buffer_locked(p) && wait > 1) {
+ __wait_on_buffer(p);
+ }
} while (tmp != bh);
}
@@ -2386,6 +2615,9 @@
/* And free the page */
page->buffers = NULL;
+ if (page->mapping == &anon_space_mapping) {
+ atomic_dec(&buffermem_pages) ;
+ }
page_cache_release(page);
spin_unlock(&free_list[index].lock);
write_unlock(&hash_table_lock);
@@ -2564,6 +2796,7 @@
static int flush_dirty_buffers(int check_flushtime)
{
struct buffer_head * bh, *next;
+ struct page *page ;
int flushed = 0, i;
restart:
@@ -2580,6 +2813,8 @@
}
if (buffer_locked(bh))
continue;
+ if (!buffer_uptodate(bh))
+ continue ;
if (check_flushtime) {
/* The dirty lru list is chronologically ordered so
@@ -2592,13 +2827,15 @@
if (++flushed > bdf_prm.b_un.ndirty)
goto out_unlock;
}
-
- /* OK, now we are committed to write it out. */
- atomic_inc(&bh->b_count);
- spin_unlock(&lru_list_lock);
- ll_rw_block(WRITE, 1, &bh);
- atomic_dec(&bh->b_count);
-
+ page = bh->b_page ;
+ page_cache_get(page) ;
+ if (TryLockPage(page)) {
+ page_cache_release(page) ;
+ continue ;
+ }
+ spin_unlock(&lru_list_lock) ;
+ dirty_list_writepage(page, bh) ;
+ page_cache_release(page) ;
if (current->need_resched)
schedule();
goto restart;
diff -urN linux-test13-pre4/mm/page_alloc.c linux-anon-space/mm/page_alloc.c
--- linux-test13-pre4/mm/page_alloc.c Tue Nov 28 13:54:31 2000
+++ linux-anon-space/mm/page_alloc.c Sun Dec 24 19:00:31 2000
@@ -317,11 +317,12 @@
/*
* If we are about to get low on free pages and cleaning
* the inactive_dirty pages would fix the situation,
- * wake up bdflush.
+ * wake up kswapd here as well, so page_launder can start
+ * sending things to disk.
*/
else if (free_shortage() && nr_inactive_dirty_pages > free_shortage()
&& nr_inactive_dirty_pages >= freepages.high)
- wakeup_bdflush(0);
+ wakeup_kswapd(0);
try_again:
/*
diff -urN linux-test13-pre4/mm/vmscan.c linux-anon-space/mm/vmscan.c
--- linux-test13-pre4/mm/vmscan.c Sat Dec 23 13:14:26 2000
+++ linux-anon-space/mm/vmscan.c Tue Dec 26 00:52:32 2000
@@ -678,7 +678,6 @@
/* The page was only in the buffer cache. */
} else if (!page->mapping) {
- atomic_dec(&buffermem_pages);
freed_page = 1;
cleaned_pages++;
@@ -701,10 +700,9 @@
page_cache_release(page);
/*
- * If we're freeing buffer cache pages, stop when
- * we've got enough free memory.
+ * stop when we've got enough free memory.
*/
- if (freed_page && !free_shortage())
+ if (!free_shortage())
break;
continue;
} else if (page->mapping && !PageDirty(page)) {
@@ -739,9 +737,6 @@
* free anything yet, we wait synchronously on the writeout of
* MAX_SYNC_LAUNDER pages.
*
- * We also wake up bdflush, since bdflush should, under most
- * loads, flush out the dirty pages before we have to wait on
- * IO.
*/
if (can_get_io_locks && !launder_loop && free_shortage()) {
launder_loop = 1;
@@ -750,8 +745,6 @@
sync = 0;
/* We only do a few "out of order" flushes. */
maxlaunder = MAX_LAUNDER;
- /* Kflushd takes care of the rest. */
- wakeup_bdflush(0);
goto dirty_page_rescan;
}
On Tue, 26 Dec 2000, Chris Mason wrote:
>
> Hi guys,
>
> Here's my latest code, which uses ll_rw_block for anon pages (or
> pages without a writepage func) when flush_dirty_buffers,
> sync_buffers, or fsync_inode_buffers are flushing things. This
> seems to have fixed my slowdown on 1k buffer sizes, but I
> haven't done extensive benchmarks yet.
Great.
I'll run some benchmarks around here too and let you know.
> Other changes: After freeing a page with buffers, page_launder
> now stops if (!free_shortage()). This is a mod of the check where
> page_launder checked free_shortage after freeing a buffer cache
> page. Code outside buffer.c can't detect buffer cache pages with
> this patch, so the old check doesn't apply.
>
> My change doesn't seem quite right though, if page_launder wants
> to stop when there isn't a shortage, it should do that regardless of
> if the page it just freed had buffers. It looks like this was added
> so bdflush could call page_launder, and get an early out after
> freeing some buffer heads, but I'm not sure.
>
> In test13-pre4, invalidate_buffers skips buffers on a page
> with a mapping. I changed that to skip mappings other than the
> anon space mapping.
>
> Comments and/or suggestions on how to make better use of this stuff
> are more than welcome ;-)
Well, the best use of this patch seems to be the ability to do write
clustering in the ->writepage() operation for normal filesystems.
I'll try to do a lightweight write clustering patch for
block_write_full_page soon.
Chris Mason wrote:
>
> Hi guys,
>
> Here's my latest code, which uses ll_rw_block for anon pages (or
> pages without a writepage func) when flush_dirty_buffers,
> sync_buffers, or fsync_inode_buffers are flushing things. This
> seems to have fixed my slowdown on 1k buffer sizes, but I
> haven't done extensive benchmarks yet.
>
> Other changes: After freeing a page with buffers, page_launder
> now stops if (!free_shortage()). This is a mod of the check where
> page_launder checked free_shortage after freeing a buffer cache
> page. Code outside buffer.c can't detect buffer cache pages with
> this patch, so the old check doesn't apply.
>
> My change doesn't seem quite right though, if page_launder wants
> to stop when there isn't a shortage, it should do that regardless of
> if the page it just freed had buffers. It looks like this was added
> so bdflush could call page_launder, and get an early out after
> freeing some buffer heads, but I'm not sure.
>
> In test13-pre4, invalidate_buffers skips buffers on a page
> with a mapping. I changed that to skip mappings other than the
> anon space mapping.
>
> Comments and/or suggestions on how to make better use of this stuff
> are more than welcome ;-)
Hi Chris. I took your patch for a test drive under dbench and it seems
impressively stable under load, but there are performance problems.
Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
Without patch: 9.5 MB/sec, 11 min 6 secs
With patch: 3.12 MB/sec, 33 min 51 sec
Philosophically, I wonder if it's right for the buffer flush mechanism
to be calling into the filesystem. It seems like the buffer lists
should stay sitting between the filesystem and the block layer, it
actually does a pretty good job.
What about just bumping the buffers of pages found on the inactive_dirty
list to the front of the buffer LRU list? Then we can do write
clustering by being selective about the bumping.
--
Daniel
On Wednesday, December 27, 2000 21:26:02 +0100 Daniel Phillips
<[email protected]> wrote:
> Hi Chris. I took your patch for a test drive under dbench and it seems
> impressively stable under load, but there are performance problems.
>
> Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
> Without patch: 9.5 MB/sec, 11 min 6 secs
> With patch: 3.12 MB/sec, 33 min 51 sec
>
Cool, thanks for the testing. Which benchmark are you using? bonnie and
dbench don't show any changes on my scsi disks, I'll give IDE a try as well.
> Philosophically, I wonder if it's right for the buffer flush mechanism
> to be calling into the filesystem. It seems like the buffer lists
> should stay sitting between the filesystem and the block layer, it
> actually does a pretty good job.
>
What I'm looking for is a separation of the write management (aging, memory
pressure, etc, etc) from the actual write method. The lists (VM, buffer.c,
whatever) should do the management, and the FS should do the i/o. This
patch is not a perfect solution by any means, but its a start.
-chris
Chris Mason wrote:
>
> On Wednesday, December 27, 2000 21:26:02 +0100 Daniel Phillips
> <[email protected]> wrote:
>
> > Hi Chris. I took your patch for a test drive under dbench and it seems
> > impressively stable under load, but there are performance problems.
> >
> > Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
> > Without patch: 9.5 MB/sec, 11 min 6 secs
> > With patch: 3.12 MB/sec, 33 min 51 sec
> >
>
> Cool, thanks for the testing. Which benchmark are you using? bonnie and
> dbench don't show any changes on my scsi disks, I'll give IDE a try as well.
Chris, this was an error, I had accidently booted the wrong kernel. The
'With patch' results above are for 2.2.16, not your patch.
With correct results you are looking much better:
Test machine: 64 meg, 500 Mhz K6, IDE, Ext2, Blocksize=4K
Test: dbench 48
pre13 without patch: 9.5 MB/sec 11 min 6 secs
pre13 with patch: 8.9 MB/sec 11 min 46 secs
2.2.16: 3.1 MB/sec 33 min 51 sec
This benchmark doesn't seem to suffer a lot from noise, so the 7%
slowdown with your patch likely real.
We've come a long way from 2.2.16, haven't we? I'll run some of these
tests against 2.2 pre19 kernels and maybe fan the flames of the 2.2/2.4
competition a little.
--
Daniel
On Thursday, December 28, 2000 16:49:14 +0100 Daniel Phillips <[email protected]> wrote:
[ dbench 48 test on the anon space mapping patch ]
>
> This benchmark doesn't seem to suffer a lot from noise, so the 7%
> slowdown with your patch likely real.
>
Ok, page_launder is supposed to run through the inactive dirty
list twice, and on the second run, it wants to start i/o. But,
if the page is dirty, writepage is called on the first run. With
my patch, this flushes lots more data than it used to.
I have writepage doing all the i/o, and try_to_free_buffers
only waits on it. This diff makes it so writepage is only called
on the second loop through the inactive dirty list, could you
please give it a try (slightly faster in my tests).
Linus and Rik are cc'd in to find out if this is a good idea in
general.
-chris
--- linux-test13-pre4/mm/vmscan.c Sat Dec 23 13:14:26 2000
+++ linux/mm/vmscan.c Thu Dec 28 15:02:08 2000
@@ -609,7 +609,7 @@
goto page_active;
/* Can't start IO? Move it to the back of the list */
- if (!can_get_io_locks) {
+ if (!launder_loop || !can_get_io_locks) {
list_del(page_lru);
list_add(page_lru, &inactive_dirty_list);
UnlockPage(page);
On Thu, 28 Dec 2000, Chris Mason wrote:
>
> Linus and Rik are cc'd in to find out if this is a good idea in
> general.
Probably.
There are some arguments for starting the writeout early, but there are
tons of arguments against it too (the main one being "avoid doing IO if
you can do so"), so your patch is probably fine. In the end, the
performance characteristics are what matters. Does the patch make for
smoother behaviour and better performance?
Anyway, the "can_get_io_locks" check is subsumed by the "launder_loop"
check: we will never set "launder_loop" to non-zero if we can't get the
io_locks, so you might as well just make the test be
/* First loop through? Don't start IO, just move it to the back of the list */
if (!launder_loop) {
....
and be done with it.
I'd like to hear what that does for dbench.
Linus
On Thursday, December 28, 2000 11:29:01 AM -0800 Linus Torvalds
<[email protected]> wrote:
[ skipping io on the first walk in page_launder ]
>
> There are some arguments for starting the writeout early, but there are
> tons of arguments against it too (the main one being "avoid doing IO if
> you can do so"), so your patch is probably fine. In the end, the
> performance characteristics are what matters. Does the patch make for
> smoother behaviour and better performance?
My dbench speeds have always varied from run to run, but the average speed
went up about 9% with the anon space mapping patch and the page_launder
change. I could not find much difference in a pure test13-pre4, probably
because dbench doesn't generate much swap on my machine. I'll do more
tests when I get back on Monday night.
Daniel, sounds like dbench varies less on your machine, what did the patch
do for you?
BTW, the last anon space mapping patch I sent also works on test13-pre5.
The block_truncate_page fix does help my patch, since I have bdflush
locking pages ( thanks Marcelo )
-chris
Chris Mason wrote:
>
> On Thursday, December 28, 2000 11:29:01 AM -0800 Linus Torvalds
> <[email protected]> wrote:
> [ skipping io on the first walk in page_launder ]
> >
> > There are some arguments for starting the writeout early, but there are
> > tons of arguments against it too (the main one being "avoid doing IO if
> > you can do so"), so your patch is probably fine. In the end, the
> > performance characteristics are what matters. Does the patch make for
> > smoother behaviour and better performance?
>
> My dbench speeds have always varied from run to run, but the average speed
> went up about 9% with the anon space mapping patch and the page_launder
> change. I could not find much difference in a pure test13-pre4, probably
> because dbench doesn't generate much swap on my machine. I'll do more
> tests when I get back on Monday night.
>
> Daniel, sounds like dbench varies less on your machine, what did the patch
> do for you?
>
> BTW, the last anon space mapping patch I sent also works on test13-pre5.
> The block_truncate_page fix does help my patch, since I have bdflush
> locking pages ( thanks Marcelo )
Yes it does, but the little additional patch you posted no longer
applies. Your patch is suffering badly under pre5 here, reducing
throughput from around 10 MB/sec to 5-6 MB/sec, and highly variable. If
you're not seeing this you should probably try to get a few others to
run it.
--
Daniel
On Fri, 29 Dec 2000, Chris Mason wrote:
> BTW, the last anon space mapping patch I sent also works on test13-pre5.
> The block_truncate_page fix does help my patch, since I have bdflush
> locking pages ( thanks Marcelo )
Good to know. I thought that fix would not change performance noticeable.
Now the ext2 changes will for sure make a difference since right now the
superblock lock is suffering from contention.
On Fri, 29 Dec 2000, Marcelo Tosatti wrote:
> Now the ext2 changes will for sure make a difference since right now the
> superblock lock is suffering from contention.
superblock lock is suffering from more than just contention. Consider the
following:
sys_ustat() does get_super(), which leaves the sb unlocked.
We are going to call ->statfs().
In the meanwhile, fs is umounted. Woops...
In other words, get_super() callers are oopsen waiting to happen. That's
what is fundamentally wrong with lock_super()/wait_super()/unlock_super() -
we have no reference counter on superblocks and we blindly grab references
to them assuming that they will stay.
The main source of get_super() calls is in device drivers. I propose to add
void invalidate_dev(kdev_t dev, int sync_flag)
{
struct super_block *sb = get_super(dev);
switch (sync_flag) {
case 1: sync_dev(dev);break;
case 2: fsync_dev(dev);break;
}
if (sb)
invalidate_inodes(sb);
invalidate_buffers(dev);
}
to fs/buffer.c, export it and let drivers call that instead of doing the
thing by hands. Then we have a chance to deal with the problem inside the
kernel proper. Right now almost every block device has that sequence in
BLKRRPART handling and fixing each of them separately is going to hurt like
hell.
Linus, I realize that it's changing the block devices near to release, but
AFAICS we have to change them anyway. I'm asking for permission to take
get_super() out.
Cheers,
Al