LinuxLists.cc - 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

2004-04-27 01:12:43

Subject: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Hi,

During page reclamation when the scanner encounters a
dirty page and invokes writepage(), if the FS layer
returns WRITEPAGE_ACTIVATE as NFS does, I think the
page should not be placed on the active as is
presently done. This can cause a lot of extraneous
swapout activity because in the presence of a large
active list, the pages being written out will not be
reclaimed quickly enough. It also seems counter
intuitive since the scanner has just determined that
the page has not been recently referenced.

Shouldn't the following code from shrink_list():

res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto activate_locked;
}

read:

res = mapping->a_ops->writepage(page, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
if (res == WRITEPAGE_ACTIVATE) {
ClearPageReclaim(page);
goto keep_locked;
}

I can observe the benefit of this change if I run a dd
on an NFS mount with the active list full of mostly
mapped pages. The stock kernel ends up paging out
quite a bit of memory whereas the modified kernel does
not.

Comments?

Thanks,
Shantanu

__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover

2004-04-27 02:15:33

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Shantanu Goel <[email protected]> wrote:
>
> Hi,
>
> During page reclamation when the scanner encounters a
> dirty page and invokes writepage(), if the FS layer
> returns WRITEPAGE_ACTIVATE as NFS does, I think the
> page should not be placed on the active as is
> presently done. This can cause a lot of extraneous
> swapout activity because in the presence of a large
> active list, the pages being written out will not be
> reclaimed quickly enough. It also seems counter
> intuitive since the scanner has just determined that
> the page has not been recently referenced.
>
> Shouldn't the following code from shrink_list():
>
> res = mapping->a_ops->writepage(page, &wbc);
> if (res < 0)
> handle_write_error(mapping, page, res);
> if (res == WRITEPAGE_ACTIVATE) {
> ClearPageReclaim(page);
> goto activate_locked;
> }
>
> read:
>
> res = mapping->a_ops->writepage(page, &wbc);
> if (res < 0)
> handle_write_error(mapping, page, res);
> if (res == WRITEPAGE_ACTIVATE) {
> ClearPageReclaim(page);
> goto keep_locked;
> }

WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of
the interaction between tmpfs and page reclaim.

Trond, the changelog for that patch does not explain what is going on in
there - can you help out?

Also, what's the theory behind the handling of BDI_write_congested and
nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it
looks like NFS should be waking the sleepers in blk_congestion_wait()
rather than doing it privately?

> I can observe the benefit of this change if I run a dd
> on an NFS mount with the active list full of mostly
> mapped pages. The stock kernel ends up paging out
> quite a bit of memory whereas the modified kernel does
> not.

yup. We should be able to handle the throttling and writeback scheduling
from within core VFS/VM. NFS should set and clear the backing_dev
congestion state appropriately and the VFS should take care of the rest.
The missing bit is the early blk_congestion_wait() termination.

2004-04-27 03:11:15

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Mon, 2004-04-26 at 22:15, Andrew Morton wrote:
> WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of
> the interaction between tmpfs and page reclaim.
>
> Trond, the changelog for that patch does not explain what is going on in
> there - can you help out?

As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow
filesystems to defer actually starting the I/O until the call to
->writepages(). This is indeed beneficial to NFS, since most servers
will work more efficiently if we cluster NFS write requests into "wsize"
sized chunks.

> Also, what's the theory behind the handling of BDI_write_congested and
> nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it
> looks like NFS should be waking the sleepers in blk_congestion_wait()
> rather than doing it privately?

The idea is mainly to prevent tasks from scheduling new writes if we are
in the situation of wanting to reclaim or otherwise flush out dirty
pages. IOW: I am assuming that the writepages() method is usually called
only when we are low on memory and/or if pdflush() was triggered.

> yup. We should be able to handle the throttling and writeback scheduling
> from within core VFS/VM. NFS should set and clear the backing_dev
> congestion state appropriately and the VFS should take care of the rest.
> The missing bit is the early blk_congestion_wait() termination.

Err... I appear to be missing something here. What is it you mean by
*early* blk_congestion_wait() termination?

Cheers,
Trond

2004-04-27 03:59:55

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2004-04-26 at 22:15, Andrew Morton wrote:
> > WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of
> > the interaction between tmpfs and page reclaim.
> >
> > Trond, the changelog for that patch does not explain what is going on in
> > there - can you help out?
>
> As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow
> filesystems to defer actually starting the I/O until the call to
> ->writepages(). This is indeed beneficial to NFS, since most servers
> will work more efficiently if we cluster NFS write requests into "wsize"
> sized chunks.

No, it's purely used by tmpfs when we discover the page was under mlock or
we've run out of swapspace.

Yes, single-page writepage off the LRU is inefficient and we want
writepages() to do most of the work. For most workloads, this is the case.
It's only the whacky mmap-based test which should encounter significant
amounts of vmscan.c writepage activity. If you're seeing much traffic via
that route I'd be interested in knowing what the workload is.

There's one scenario in which we _do_ do too much writepage off the LRU,
and that's on 1G ia32 highmem boxes: the tiny highmem zone is smaller than
dirty_ratio and vmscan ends up doing work which balance_dirty_pages()
should have done. Hard to fix, apart from reducing the dirty memory
limits.

> > Also, what's the theory behind the handling of BDI_write_congested and
> > nfs_wait_on_write_congestion() in nfs_writepages()? From a quick peek it
> > looks like NFS should be waking the sleepers in blk_congestion_wait()
> > rather than doing it privately?
>
> The idea is mainly to prevent tasks from scheduling new writes if we are
> in the situation of wanting to reclaim or otherwise flush out dirty
> pages. IOW: I am assuming that the writepages() method is usually called
> only when we are low on memory and/or if pdflush() was triggered.

writepages() is called by pdflush when the dirty memory exceeds
dirty_background_ratio and it is called by the write(2) caller when dirty
memory exceeds dirty_ratio.

balance_dirty_pages() will throttle the write(2) caller when we hit
dirty_ratio. balance_dirty_pages() attempts to prevent amount of dirty
memory from exceeding dirty_ratio by blocking the write(2) caller.

> > yup. We should be able to handle the throttling and writeback scheduling
> > from within core VFS/VM. NFS should set and clear the backing_dev
> > congestion state appropriately and the VFS should take care of the rest.
> > The missing bit is the early blk_congestion_wait() termination.
>
> Err... I appear to be missing something here. What is it you mean by
> *early* blk_congestion_wait() termination?

In the various places where the VM/VFS decides "gee, we need to wait for
some write I/O to terminate" it will call blk_congestion_wait() (the
function is seriously misnamed nowadays). blk_congestion_wait() will wait
for a certain number of milliseconds, but that wait will terminate earlier
if the block layer completes some write requests. So if writes are being
retired quickly, the sleeps are shorter.

What we've never had, and which I expected we'd need a year ago was a
connection between network filesytems and blk_congestion_wait(). At
present, if someone calls blk_congestion_wait(HZ/20) when the only write
I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's
going on. What we should do is to deliver a wakeup from within NFS to the
blk_congestion_wait() sleepers when write I/O has completed.

Throttling of NFS writers should be working OK in 2.6.5. The only problem
of which I am aware is that the callers of try_to_free_pages() and
balance_dirty_pages() may be sleeping for too long in
blk_congestion_wait(). We can address that by calling into
clear_backing_dev_congested() (see below) from the place in NFS where we
determine that writes have completed. Of course there may be additional
problems of which I'm unaware.

---

25-akpm/drivers/block/ll_rw_blk.c | 20 +++++++++++++-------
25-akpm/include/linux/writeback.h | 1 +
2 files changed, 14 insertions(+), 7 deletions(-)

diff -puN drivers/block/ll_rw_blk.c~clear_baking_dev_congested drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~clear_baking_dev_congested 2004-04-26 20:52:05.752041120 -0700
+++ 25-akpm/drivers/block/ll_rw_blk.c 2004-04-26 20:54:20.997480688 -0700
@@ -95,22 +95,28 @@ static inline int queue_congestion_off_t
return ret;
}

-/*
- * A queue has just exitted congestion. Note this in the global counter of
- * congested queues, and wake up anyone who was waiting for requests to be
- * put back.
- */
-static void clear_queue_congested(request_queue_t *q, int rw)
+void clear_backing_dev_congested(struct backing_dev_info *bdi, int rw)
{
enum bdi_state bit;
wait_queue_head_t *wqh = &congestion_wqh[rw];

bit = (rw == WRITE) ? BDI_write_congested : BDI_read_congested;
- clear_bit(bit, &q->backing_dev_info.state);
+ clear_bit(bit, &bdi->state);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
}
+EXPORT_SYMBOL(clear_backing_dev_congested);
+
+/*
+ * A queue has just exitted congestion. Note this in the global counter of
+ * congested queues, and wake up anyone who was waiting for requests to be
+ * put back.
+ */
+static void clear_queue_congested(request_queue_t *q, int rw)
+{
+ clear_backing_dev_congested(&q->backing_dev_info, rw);
+}

/*
* A queue has just entered congestion. Flag that in the queue's VM-visible
diff -puN include/linux/writeback.h~clear_baking_dev_congested include/linux/writeback.h
--- 25/include/linux/writeback.h~clear_baking_dev_congested 2004-04-26 20:53:19.680802240 -0700
+++ 25-akpm/include/linux/writeback.h 2004-04-26 20:53:36.333270680 -0700
@@ -97,5 +97,6 @@ int do_writepages(struct address_space *
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
read-only. */

+void clear_backing_dev_congested(struct backing_dev_info *bdi, int rw);

#endif /* WRITEBACK_H */

_

2004-04-27 05:24:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Mon, 2004-04-26 at 23:59, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:

> No, it's purely used by tmpfs when we discover the page was under mlock or
> we've run out of swapspace.

No: either it is a bona-fide interface, in which case it belongs in the
mm/vfs, or it is a private interface, in which case it doesn't.
I can see no reason to be playing games purely for the case of tmpfs,
but if indeed there is, then it needs a *lot* more comments in the code
than is currently the case: something along the lines of "don't ever use
this unless you are tmpfs"...

> Yes, single-page writepage off the LRU is inefficient and we want
> writepages() to do most of the work. For most workloads, this is the case.
> It's only the whacky mmap-based test which should encounter significant
> amounts of vmscan.c writepage activity. If you're seeing much traffic via
> that route I'd be interested in knowing what the workload is.

Err... Anything that currently ends up calling writepage() and returning
WRITEPAGE_ACTIVATE. This is a problem that you believe you are seeing
the effects of, right?
Currently, we use this on virtually anything that trigggers writepage()
with the wbc->for_reclaim flag set.

> > The idea is mainly to prevent tasks from scheduling new writes if we are
> > in the situation of wanting to reclaim or otherwise flush out dirty
> > pages. IOW: I am assuming that the writepages() method is usually called
> > only when we are low on memory and/or if pdflush() was triggered.
>
> writepages() is called by pdflush when the dirty memory exceeds
> dirty_background_ratio and it is called by the write(2) caller when dirty
> memory exceeds dirty_ratio.
>
> balance_dirty_pages() will throttle the write(2) caller when we hit
> dirty_ratio. balance_dirty_pages() attempts to prevent amount of dirty
> memory from exceeding dirty_ratio by blocking the write(2) caller.

...but if the pdflush() process hangs, due to an unavailable server or
something like that, how do we prevent other processes from hanging too?
AFAICS the backing-dev->state is the only way to ensure that alternate
pdflush processes try to free memory through other means first.

> \In the various places where the VM/VFS decides "gee, we need to wait for
> some write I/O to terminate" it will call blk_congestion_wait() (the
> function is seriously misnamed nowadays). blk_congestion_wait() will wait
> for a certain number of milliseconds, but that wait will terminate earlier
> if the block layer completes some write requests. So if writes are being
> retired quickly, the sleeps are shorter.

We're not using the block layer.

> What we've never had, and which I expected we'd need a year ago was a
> connection between network filesytems and blk_congestion_wait(). At
> present, if someone calls blk_congestion_wait(HZ/20) when the only write
> I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's
> going on. What we should do is to deliver a wakeup from within NFS to the
> blk_congestion_wait() sleepers when write I/O has completed.

So exactly *how* do we define the concept "write I/O has completed"
here?

AFAICS that has to be entirely defined by the MM layer since it alone
can declare that writepages() (plus whatever was racing with
writepages() at the time) has satisfied its request for reclaiming
memory.
I certainly don't want the wretched thing to be scheduling yet more
writes while I'm still busy completing the last batch. If I happen to be
on a slow network, that can cause me to enter a vicious circle where I
keep scheduling more and more I/O simply because my write speed does not
match the bandwidth expectations of the MM.
This is why I want to be throttling *all* those processes that would
otherwise be triggering yet more writepages() calls.

Cheers,
Trond

2004-04-27 05:59:09

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2004-04-26 at 23:59, Andrew Morton wrote:
> > Trond Myklebust <[email protected]> wrote:
>
> > No, it's purely used by tmpfs when we discover the page was under mlock or
> > we've run out of swapspace.
>
> No: either it is a bona-fide interface, in which case it belongs in the
> mm/vfs, or it is a private interface, in which case it doesn't.
> I can see no reason to be playing games purely for the case of tmpfs,
> but if indeed there is, then it needs a *lot* more comments in the code
> than is currently the case: something along the lines of "don't ever use
> this unless you are tmpfs"...

It was designed for tmpfs's particular problems - tmpfs is quite incestuous
with the MM.

Yes, I suppose it should be renamed to TMPFS_WRITEPAGE_FAILED or something.

> > Yes, single-page writepage off the LRU is inefficient and we want
> > writepages() to do most of the work. For most workloads, this is the case.
> > It's only the whacky mmap-based test which should encounter significant
> > amounts of vmscan.c writepage activity. If you're seeing much traffic via
> > that route I'd be interested in knowing what the workload is.
>
> Err... Anything that currently ends up calling writepage() and returning
> WRITEPAGE_ACTIVATE. This is a problem that you believe you are seeing
> the effects of, right?

I didn't report the problem - Shantanu is reporting problems where all the
NFS pages are stuck on the active list and the VM has nothing to reclaim on
the inactive list.

> Currently, we use this on virtually anything that trigggers writepage()
> with the wbc->for_reclaim flag set.

Why? Sorry, I still do not understand the effect which you're trying to
achieve? I thought it was an efficiency thing: send more traffic via
writepages(). But your other comments seem to indicate that this is not
the primary reason.

> ...but if the pdflush() process hangs, due to an unavailable server or
> something like that, how do we prevent other processes from hanging too?

We'll lose one pdflush thread if this happens. After that,
writeback_acquire() will fail and no other pdflush instances will get
stuck.

After that, user processes will start entering that fs trying to write
things back and they too will get stuck.

Not good. You could do a writeback_acquire() in writepage and writepages
perhaps.

> AFAICS the backing-dev->state is the only way to ensure that alternate
> pdflush processes try to free memory through other means first.

Certainly backing_dev_info is the right place to do this.

Again, I'd need to know more about what problems you're trying to solve to
say much more.

If the only problem is the everyone-hangs-because-of-a-dead-server problem
then we can either utilise the write-congestion state or we can introduce
an additional layer of congestion avoidance (similar to writeback_acquire)
to prevent tasks from trying to write to dead NFS servers.

> > \In the various places where the VM/VFS decides "gee, we need to wait for
> > some write I/O to terminate" it will call blk_congestion_wait() (the
> > function is seriously misnamed nowadays). blk_congestion_wait() will wait
> > for a certain number of milliseconds, but that wait will terminate earlier
> > if the block layer completes some write requests. So if writes are being
> > retired quickly, the sleeps are shorter.
>
> We're not using the block layer.

That's a fairly irritating comment :(

I'm trying to explain how the current congestion/collision avoidance logic
works and how I expected that it could be generalised to network-backed
filesystems.

If we can determine beforehand that writes to the server will block then
this should fall out nicely. NFS needs to set a backing_dev flag which
says "writes to this backing_dev will block soon" and writers can then take
avoiding action.

I assume such a thing can be implemented - there's an upper bound on the
number of writes which can be in flight on the wire?

> > What we've never had, and which I expected we'd need a year ago was a
> > connection between network filesytems and blk_congestion_wait(). At
> > present, if someone calls blk_congestion_wait(HZ/20) when the only write
> > I/O in flight is NFS, they will sleep for 50 milliseconds no matter what's
> > going on. What we should do is to deliver a wakeup from within NFS to the
> > blk_congestion_wait() sleepers when write I/O has completed.
>
> So exactly *how* do we define the concept "write I/O has completed"
> here?

Simplistically: wherever NFS calls end_page_writeback(). But it would be
better to do it at a higher level if poss: nfs_writeback_done_full(),
perhaps?

> AFAICS that has to be entirely defined by the MM layer since it alone
> can declare that writepages() (plus whatever was racing with
> writepages() at the time) has satisfied its request for reclaiming
> memory.

Well... the matter of reclaiming memory is somewhat separate from all of
this. The code in there is all designed to throttle write()rs and memory
allocators to the rate at which the I/O system can retire writes.

It sounds like the problem which you're trying to solve here is the dead
server problem? If so, yes, that will need additional logic somewhere.

> I certainly don't want the wretched thing to be scheduling yet more
> writes while I'm still busy completing the last batch.

OK. In that case it would be preferable for the memory reclaim code to
just skip the page altogether (via may_write_to_queue()) and to go on and
try to reclaim the next page off the LRU.

And it would be preferable for the write(2) caller to skip this superblock
altogether and to go on and start working on a different superblock.
Setting the congestion indicator will cause this to happen. If there are
no additional superblocks to be written then the write() caller will block
in blk_congestion_wait() and will then retry a writeback pass across the
superblocks.

> If I happen to be
> on a slow network, that can cause me to enter a vicious circle where I
> keep scheduling more and more I/O simply because my write speed does not
> match the bandwidth expectations of the MM.
> This is why I want to be throttling *all* those processes that would
> otherwise be triggering yet more writepages() calls.

Setting BDI_write_congested should have this effect for write(2) callers.

We may need an additional backing_dev flag for may_write_to_queue() to
inspect.

Bottom line: I _think_ you need to remove the WRITEPAGE_ACTIVATE stuff, set
a bdi->state flag (as well as BDI_write_congested) when the server gets
stuck, test that new flag in may_write_to_queue().

2004-04-27 06:10:54

by Nick Piggin

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
>
>>On Mon, 2004-04-26 at 22:15, Andrew Morton wrote:
>>
>>>WRITEPAGE_ACTIVATE is a bit of a hack to fix up specific peculiarities of
>>>the interaction between tmpfs and page reclaim.
>>>
>>>Trond, the changelog for that patch does not explain what is going on in
>>>there - can you help out?
>>
>>As far as I understand, the WRITEPAGE_ACTIVATE hack is supposed to allow
>>filesystems to defer actually starting the I/O until the call to
>>->writepages(). This is indeed beneficial to NFS, since most servers
>>will work more efficiently if we cluster NFS write requests into "wsize"
>>sized chunks.
>
>
> No, it's purely used by tmpfs when we discover the page was under mlock or
> we've run out of swapspace.
>
> Yes, single-page writepage off the LRU is inefficient and we want
> writepages() to do most of the work. For most workloads, this is the case.
> It's only the whacky mmap-based test which should encounter significant
> amounts of vmscan.c writepage activity.

Nikita has a patch to gather a page's dirty bits before
taking it off the active list. It might help here a bit.
Looks like it would need porting to the new rmap stuff.

> If you're seeing much traffic via
> that route I'd be interested in knowing what the workload is.
>
> There's one scenario in which we _do_ do too much writepage off the LRU,
> and that's on 1G ia32 highmem boxes: the tiny highmem zone is smaller than
> dirty_ratio and vmscan ends up doing work which balance_dirty_pages()
> should have done. Hard to fix, apart from reducing the dirty memory
> limits.
>

You might be able to improve this by not allocating the
highmem zone right down to its scanning watermark while
zone normal has lots of free memory.

2004-04-27 11:44:39

by Shantanu Goel

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

--- Andrew Morton <[email protected]> wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> >
> > Err... Anything that currently ends up calling
> writepage() and returning
> > WRITEPAGE_ACTIVATE. This is a problem that you
> believe you are seeing
> > the effects of, right?
>
> I didn't report the problem - Shantanu is reporting
> problems where all the
> NFS pages are stuck on the active list and the VM
> has nothing to reclaim on
> the inactive list.
>

Yup, what happens is the NFS writepage() returns
WRITEPAGE_ACTIVATE causing the scanner to place the
page at the head of the active list. Now the page
won't be reclaimed until the scanner has run through
the entire active list. I do not see such behaviour
with ext3 for instance.

__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover

2004-04-27 15:37:58

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Tue, 2004-04-27 at 01:58, Andrew Morton wrote:

> > Currently, we use this on virtually anything that trigggers writepage()
> > with the wbc->for_reclaim flag set.
>
> Why? Sorry, I still do not understand the effect which you're trying to
> achieve? I thought it was an efficiency thing: send more traffic via
> writepages(). But your other comments seem to indicate that this is not
> the primary reason.

That is the primary reason. Under NFS, writing is a 2 stage process:

- Stage 1: "schedule" the page for writing. Basically this entails
creating an nfs_page struct for each dirty page and putting this on the
NFS private list of dirty pages.
- Stage 2: "flush" the NFS private list of dirty pages. This involves
coalescing contiguous nfs_page structs into chunks of size "wsize", and
actually putting them on the wire in the form of RPC requests.

writepage() only deals with one page at a time, so it will work fine for
doing stage 1.
If you also try force it to do stage 2, then you will end up with chunks
of size <= PAGE_CACHE_SIZE because there will only be 1 page on the NFS
private list of dirty pages. In practice this again means that you
usually have to send 8 times as many RPC requests to the server in order
to process the same amount of data.

This is why I'd like to try to leave stage 2) to writepages().

Now normally, that is not a problem, since the actual calls to
writepage() are in fact made by means of a call to generic_writepages()
from within nfs_writepages(), so I can do all the correct things once
I'm done with generic_writepages().

However shrink_cache() (where the MM calls writepage() directly) needs
to be treated specially, since that is not wrapped by a writepages()
call. In that case, we return WRITEPAGE_ACTIVATE and wait for pdflush to
call writepages().

Cheers,
Trond

2004-04-28 00:47:11

by Shantanu Goel

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Andrew/Trond,

Any consensus as to what the right approach is here?

One cheap though perhaps hack'ish solution would be to
introduce yet another special error code like
WRITEPAGE_ACTIVATE which instead of moving the page to
the head of the active list instead moves it to the
head of the inactive list. In this case, would it
also make sense for balance_pgdat() to
wakeup_bdflush() before blk_congestion_wait()?
Otherwise, kswapd() would keep looping over the same
pages at least until kupdate kicks in or a process
stalls in try_to_free_pages()?

Thanks,
Shantanu

__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover

2004-04-28 01:03:47

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Shantanu Goel <[email protected]> wrote:
>
> Andrew/Trond,
>
> Any consensus as to what the right approach is here?

For now I suggest you do

- err = WRITEPAGE_ACTIVATE;
+ err = 0;

in nfs_writepage().

> One cheap though perhaps hack'ish solution would be to
> introduce yet another special error code like
> WRITEPAGE_ACTIVATE which instead of moving the page to
> the head of the active list instead moves it to the
> head of the inactive list.

It would be better to set a flag in the backing_dev so that page reclaim
knows to not enter writepage at all.

2004-04-28 01:29:01

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Tue, 2004-04-27 at 21:02, Andrew Morton wrote:
> Shantanu Goel <[email protected]> wrote:
> >
> > Andrew/Trond,
> >
> > Any consensus as to what the right approach is here?
>
> For now I suggest you do
>
> - err = WRITEPAGE_ACTIVATE;
> + err = 0;
>
> in nfs_writepage().

That will just cause the page to be put back onto the inactive list
without starting writeback. Won't that cause precisely those kswapd
loops that Shantanu was worried about?
AFAICS if you want to do this, you probably need to flush the page
immediately to disk on the server using a STABLE write as per the
appeanded patch. The problem is that screws the server over pretty hard
as it will get flooded with what are in effect a load of 4k O_SYNC
writes.

Cheers,
Trond

--- linux-2.6.6-rc2/fs/nfs/write.c.orig 2004-04-27 16:01:01.000000000 -0400
+++ linux-2.6.6-rc2/fs/nfs/write.c 2004-04-27 21:18:58.000000000 -0400
@@ -313,7 +313,7 @@
if (err >= 0) {
err = 0;
if (wbc->for_reclaim)
- err = WRITEPAGE_ACTIVATE;
+ nfs_flush_inode(inode, 0, 0, FLUSH_STABLE);
}
} else {
err = nfs_writepage_sync(NULL, inode, page, 0,

2004-04-28 01:38:25

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Trond Myklebust <[email protected]> wrote:
>
> On Tue, 2004-04-27 at 21:02, Andrew Morton wrote:
> > Shantanu Goel <[email protected]> wrote:
> > >
> > > Andrew/Trond,
> > >
> > > Any consensus as to what the right approach is here?
> >
> > For now I suggest you do
> >
> > - err = WRITEPAGE_ACTIVATE;
> > + err = 0;
> >
> > in nfs_writepage().
>
> That will just cause the page to be put back onto the inactive list
> without starting writeback. Won't that cause precisely those kswapd
> loops that Shantanu was worried about?

Possibly - but the process which is in reclaim will throttle and will kick
pdflush which will do the writepages() thing.

It needs testing.

> AFAICS if you want to do this, you probably need to flush the page
> immediately to disk on the server using a STABLE write as per the
> appeanded patch. The problem is that screws the server over pretty hard
> as it will get flooded with what are in effect a load of 4k O_SYNC
> writes.

Well I'd be interested in discovering which workloads actually suffer
significantly from doing this. If it's a significant problem then perhaps
we should resurrect the writearound-from-within-writepage thing. It's
pretty simple to do, especially since we now have efficient ways of finding
neighbouring dirty pages.

2004-04-28 05:29:59

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Tue, Apr 27, 2004 at 11:36:47AM -0400, Trond Myklebust wrote:
> writepage() only deals with one page at a time, so it will work fine for
> doing stage 1.
> If you also try force it to do stage 2, then you will end up with chunks
> of size <= PAGE_CACHE_SIZE because there will only be 1 page on the NFS
> private list of dirty pages. In practice this again means that you
> usually have to send 8 times as many RPC requests to the server in order
> to process the same amount of data.

There's nothing speaking against probing for more dirty pages before and
after the one ->writepage wants to write out and send the big request
out. XFS does this to avoid creating small extents when converting from
delayed allocated space to real extents.

2004-04-28 16:18:01

by Trond Myklebust

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Wed, 2004-04-28 at 01:29, Christoph Hellwig wrote:
> There's nothing speaking against probing for more dirty pages before and
> after the one ->writepage wants to write out and send the big request
> out. XFS does this to avoid creating small extents when converting from
> delayed allocated space to real extents.

You are referring to xfs_probe_unmapped_page()? True: we could do
that... In fact we could do it a lot more efficiently now that we have
the pagevec_lookup_tag() interface.

Cheers,
Trond

2004-04-28 19:17:16

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Wed, Apr 28, 2004 at 12:17:42PM -0400, Trond Myklebust wrote:
> On Wed, 2004-04-28 at 01:29, Christoph Hellwig wrote:
> > There's nothing speaking against probing for more dirty pages before and
> > after the one ->writepage wants to write out and send the big request
> > out. XFS does this to avoid creating small extents when converting from
> > delayed allocated space to real extents.
>
> You are referring to xfs_probe_unmapped_page()?

Yes, and the two other functions doing similar stuff.

> True: we could do
> that... In fact we could do it a lot more efficiently now that we have
> the pagevec_lookup_tag() interface.

Yes pagevec are much more efficient for that. In fact I have a workarea
switching over XFS to use pagevec for that probing.

I'm not yet sure where I'm heading with revamping xfs_aops.c, but what
I'd love to see in the end is more or less xfs implementing only
writepages and some generic implement writepage as writepages wrapper.

2004-04-28 20:00:15

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

Christoph Hellwig <[email protected]> wrote:
>
> I'm not yet sure where I'm heading with revamping xfs_aops.c, but what
> I'd love to see in the end is more or less xfs implementing only
> writepages and some generic implement writepage as writepages wrapper.

That might make sense. One problem is that writepage expects to be passed
a locked page whereas writepages() does not.

Any code which implements writearound-inside-writepage should be targetted
at a generic implementation, not an fs-specific one if poss. We could go
look at the ->vm_writeback() a_op which was in in 2.5.20 or thereabouts.
it was causing problems and had no discernable benefits so I ripped it out.

A writearound-within-writepage implementation would need to decide whether
it's goign to use lock_page() or TryLockPage(). I expect lock_page() will
be OK - we only call in there for __GFP_FS allocators.

2004-04-29 01:52:26

by Nathan Scott

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Wed, Apr 28, 2004 at 12:07:15PM -0700, Andrew Morton wrote:
> Christoph Hellwig <[email protected]> wrote:
> >
> > I'm not yet sure where I'm heading with revamping xfs_aops.c, but what
> > I'd love to see in the end is more or less xfs implementing only
> > writepages and some generic implement writepage as writepages wrapper.
> ...
> Any code which implements writearound-inside-writepage should be targetted
> at a generic implementation, not an fs-specific one if poss. We could go

Another thing I think XFS could make good use of on the generic
buffered IO path would be mpage_readpages and mpage_writepages
variants that can use a get_blocks_t (like the direct IO path)
instead of the single-block-at-a-time get_block_t interface thats
used now. The existence of such beasts would probably impact the
direction xfs_aops.c takes a fair bit I'd guess Christoph?

cheers.

--
Nathan

2004-04-29 08:46:50

by Christoph Hellwig

[permalink] [raw]

Subject: Re: 2.6.6-rc{1,2} bad VM/NFS interaction in case of dirty page writeback

On Thu, Apr 29, 2004 at 11:48:21AM +1000, Nathan Scott wrote:
> Another thing I think XFS could make good use of on the generic
> buffered IO path would be mpage_readpages and mpage_writepages
> variants that can use a get_blocks_t (like the direct IO path)
> instead of the single-block-at-a-time get_block_t interface thats
> used now. The existence of such beasts would probably impact the
> direction xfs_aops.c takes a fair bit I'd guess Christoph?

For mpage_readpages - yes. To use mpage_writepages from XFS I'd need
a large-scale rewrite to handle delayed allocations and unwritten extents,
so at least for the 2.6.x timeframe we're probably better of keeping our
own ->writepages inside XFS.