LinuxLists.cc - [PATCH] remove throttle_vm

2007-10-04 12:26:23

Subject: [PATCH] remove throttle_vm_writeout()

This in preparation for the writable mmap patches for fuse. I know it
conflicts with

writeback-remove-unnecessary-wait-in-throttle_vm_writeout.patch

but if this function is to be removed, it doesn't make much sense to
fix it first ;)
---

From: Miklos Szeredi <[email protected]>

By relying on the global diry limits, this can cause a deadlock when
devices are stacked.

If the stacking is done through a fuse filesystem, the __GFP_FS,
__GFP_IO tests won't help: the process doing the allocation doesn't
have any special flag.

So why exactly does this function exist?

Direct reclaim does not _increase_ the number of dirty pages in the
system, so rate limiting it seems somewhat pointless.

There are two cases:

1) File backed pages -> file

dirty + writeback count remains constant

2) Anonymous pages -> swap

writeback count increases, dirty balancing will hold back file
writeback in favor of swap

So the real question is: does case 2 need rate limiting, or is it OK
to let the device queue fill with swap pages as fast as possible?

Signed-off-by: Miklos Szeredi <[email protected]>
---

Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h 2007-10-02 16:55:03.000000000 +0200
+++ linux/include/linux/writeback.h 2007-10-04 13:40:33.000000000 +0200
@@ -94,7 +94,6 @@ static inline void inode_sync_wait(struc
int wakeup_pdflush(long nr_pages);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c 2007-10-02 16:55:03.000000000 +0200
+++ linux/mm/page-writeback.c 2007-10-04 13:40:33.000000000 +0200
@@ -497,37 +497,6 @@ void balance_dirty_pages_ratelimited_nr(
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
-{
- long background_thresh;
- long dirty_thresh;
-
- if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
- /*
- * The caller might hold locks which can prevent IO completion
- * or progress in the filesystem. So we cannot just sit here
- * waiting for IO to complete.
- */
- congestion_wait(WRITE, HZ/10);
- return;
- }
-
- for ( ; ; ) {
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-
- /*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
- congestion_wait(WRITE, HZ/10);
- }
-}
-
/*
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2007-10-02 16:55:03.000000000 +0200
+++ linux/mm/vmscan.c 2007-10-04 13:40:33.000000000 +0200
@@ -1184,7 +1184,6 @@ static unsigned long shrink_zone(int pri
}
}

- throttle_vm_writeout(sc->gfp_mask);
return nr_reclaimed;
}

2007-10-04 12:40:39

On Fri, 05 Oct 2007 02:12:30 +0200 Miklos Szeredi <[email protected]> wrote:

> >
> > I don't think I understand that. Sure, it _shouldn't_ be a problem. But it
> > _is_. That's what we're trying to fix, isn't it?
>
> The problem, I believe is in the memory allocation code, not in fuse.

fuse is trying to do something which page reclaim was not designed for.
Stuff broke.

> In the example, memory allocation may be blocking indefinitely,
> because we have 4MB under writeback, even though 28MB can still be
> made available. And that _should_ be fixable.

Well yes. But we need to work out how, without re-breaking the thing which
throttle_vm_writeout() fixed.

> > > So the only thing the kernel should be careful about, is not to block
> > > on an allocation if not strictly necessary.
> > >
> > > Actually a trivial fix for this problem could be to just tweak the
> > > thresholds, so to make the above scenario impossible. Although I'm
> > > still not convinced, this patch is perfect, because the dirty
> > > threshold can actually change in time...
> > >
> > > Index: linux/mm/page-writeback.c
> > > ===================================================================
> > > --- linux.orig/mm/page-writeback.c 2007-10-05 00:31:01.000000000 +0200
> > > +++ linux/mm/page-writeback.c 2007-10-05 00:50:11.000000000 +0200
> > > @@ -515,6 +515,12 @@ void throttle_vm_writeout(gfp_t gfp_mask
> > > for ( ; ; ) {
> > > get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> > >
> > > + /*
> > > + * Make sure the theshold is over the hard limit of
> > > + * dirty_thresh + ratelimit_pages * nr_cpus
> > > + */
> > > + dirty_thresh += ratelimit_pages * num_online_cpus();
> > > +
> > > /*
> > > * Boost the allowable dirty threshold a bit for page
> > > * allocators so they don't get DoS'ed by heavy writers
> >
> > I can probably kind of guess what you're trying to do here. But if
> > ratelimit_pages * num_online_cpus() exceeds the size of the offending zone
> > then things might go bad.
>
> I think the admin can do quite a bit of other damage, by setting
> dirty_ratio too high.
>
> Maybe this writeback throttling should just have a fixed limit of 80%
> ZONE_NORMAL, and limit dirty_ratio to something like 50%.

Bear in mind that the same problem will occur for the 16MB ZONE_DMA, and
we cannot limit the system-wide dirty-memory threshold to 12MB.

iow, throttle_vm_writeout() needs to become zone-aware. Then it only
throttles when, say, 80% of ZONE_FOO is under writeback.

Except I don't think that'll fix the problem 100%: if your fuse kernel
component somehow manages to put 80% of ZONE_FOO under writeback (and
remmeber this might be only 12MB on a 16GB machine) then we get stuck again
- the fuse server process (is that the correct terminology, btw?) ends up
waiting upon itself.

I'll think about it a bit.

2007-10-05 07:33:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] remove throttle_vm_writeout()

On Thu, 2007-10-04 at 16:09 -0700, Andrew Morton wrote:
> On Fri, 05 Oct 2007 00:39:16 +0200
> Miklos Szeredi <[email protected]> wrote:
>
> > > throttle_vm_writeout() should be a per-zone thing, I guess. Perhaps fixing
> > > that would fix your deadlock. That's doubtful, but I don't know anything
> > > about your deadlock so I cannot say.
> >
> > No, doing the throttling per-zone won't in itself fix the deadlock.
> >
> > Here's a deadlock example:
> >
> > Total memory = 32M
> > /proc/sys/vm/dirty_ratio = 10
> > dirty_threshold = 3M
> > ratelimit_pages = 1M
> >
> > Some program dirties 4M (dirty_threshold + ratelimit_pages) of mmap on
> > a fuse fs. Page balancing is called which turns all these into
> > writeback pages.
> >
> > Then userspace filesystem gets a write request, and tries to allocate
> > memory needed to complete the writeout.
> >
> > That will possibly trigger direct reclaim, and throttle_vm_writeout()
> > will be called. That will block until nr_writeback goes below 3.3M
> > (dirty_threshold + 10%). But since all 4M of writeback is from the
> > fuse fs, that will never happen.
> >
> > Does that explain it better?
> >
>
> yup, thanks.
>
> This is a somewhat general problem: a userspace process is in the IO path.
> Userspace block drivers, for example - pretty much anything which involves
> kernel->userspace upcalls for storage applications.
>
> I solved it once in the past by marking the userspace process as
> PF_MEMALLOC and I beleive that others have implemented the same hack.
>
> I suspect that what we need is a general solution, and that the solution
> will involve explicitly telling the kernel that this process is one which
> actually cleans memory and needs special treatment.
>
> Because I bet there will be other corner-cases where such a process needs
> kernel help, and there might be optimisation opportunities as well.
>
> Problem is, any such mark-me-as-special syscall would need to be
> privileged, and FUSE servers presently don't require special perms (do
> they?)

I think just adding nr_cpus * ratelimit_pages to the dirth_thresh in
throttle_vm_writeout() will also solve the problem

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-10-05 08:22:19

On Fri, 2007-10-05 at 11:22 +0200, Miklos Szeredi wrote:
> > So how do we end up with more writeback pages than that? should we teach
> > pdflush about these limits as well?
>
> Ugh.
>
> I think we should rather fix vmscan to not spin when all pages of a
> zone are already under writeout. Which is the _real_ problem,
> according to Andrew.

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4ef4d22..eff2438 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
int wakeup_pdflush(long nr_pages);
void laptop_io_completion(void);
void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eec1481..f949997 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
{
- long background_thresh;
- long dirty_thresh;
-
if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
/*
* The caller might hold locks which can prevent IO completion
@@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
}

for ( ; ; ) {
- get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+ unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
+ zone_page_state(zone, NR_INACTIVE);

- /*
- * Boost the allowable dirty threshold a bit for page
- * allocators so they don't get DoS'ed by heavy writers
- */
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
+ /*
+ * wait when 75% of the zone's pages are under writeback
+ */
+ thresh -= thresh >> 2;
+ if (zone_page_state(zone, NR_WRITEBACK) < thresh)
+ break;

- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
congestion_wait(WRITE, HZ/10);
}
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1be5a63..7dd6bd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
}
}

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(zone, sc->gfp_mask);

atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2007-10-05 10:27:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] remove throttle_vm_writeout()

> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 4ef4d22..eff2438 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -88,7 +88,7 @@ static inline void wait_on_inode(struct inode *inode)
> int wakeup_pdflush(long nr_pages);
> void laptop_io_completion(void);
> void laptop_sync_completion(void);
> -void throttle_vm_writeout(gfp_t gfp_mask);
> +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask);
>
> /* These are exported to sysctl. */
> extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index eec1481..f949997 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -326,11 +326,8 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +void throttle_vm_writeout(struct zone *zone, gfp_t gfp_mask)
> {
> - long background_thresh;
> - long dirty_thresh;
> -
> if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> /*
> * The caller might hold locks which can prevent IO completion
> @@ -342,17 +339,16 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> }
>
> for ( ; ; ) {
> - get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
> + unsigned long thresh = zone_page_state(zone, NR_ACTIVE) +
> + zone_page_state(zone, NR_INACTIVE);
>
> - /*
> - * Boost the allowable dirty threshold a bit for page
> - * allocators so they don't get DoS'ed by heavy writers
> - */
> - dirty_thresh += dirty_thresh / 10; /* wheeee... */
> + /*
> + * wait when 75% of the zone's pages are under writeback
> + */
> + thresh -= thresh >> 2;
> + if (zone_page_state(zone, NR_WRITEBACK) < thresh)
> + break;
>
> - if (global_page_state(NR_UNSTABLE_NFS) +
> - global_page_state(NR_WRITEBACK) <= dirty_thresh)
> - break;
> congestion_wait(WRITE, HZ/10);
> }
> }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1be5a63..7dd6bd9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -948,7 +948,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
> }
> }
>
> - throttle_vm_writeout(sc->gfp_mask);
> + throttle_vm_writeout(zone, sc->gfp_mask);
>
> atomic_dec(&zone->reclaim_in_progress);
> return nr_reclaimed;
>
>

I think that's an improvement in all respects.

However it still does not generally address the deadlock scenario: if
there's a small DMA zone, and fuse manages to put all of those pages
under writeout, then there's trouble.

But it's not really fuse specific. If it was a normal filesystem that
did that, and it needed a GFP_DMA allocation for writeout, it is in
trouble also, as that allocation would fail (at least no deadlock).

Or is GFP_DMA never used by fs/io writeout paths?

Miklos

2007-10-05 10:32:42

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] remove throttle_vm_writeout()

> I think that's an improvement in all respects.
>
> However it still does not generally address the deadlock scenario: if
> there's a small DMA zone, and fuse manages to put all of those pages
> under writeout, then there's trouble.

And the only way to solve that AFAICS, is to make sure fuse never uses
more than e.g. 50% of _any_ zone for page cache. And that may need
some tweaking in the allocator...

Miklos

2007-10-05 10:57:48

On Mon, Oct 08, 2007 at 09:54:33AM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 08:30:28PM +0800, Fengguang Wu wrote:
> > The improvement could be:
> > - kswapd is now explicitly preferred to do the writeout;
>
> Careful. kswapd is much less efficient at writeout than pdflush
> because it does not do low->high offset writeback per address space.
> It just flushes the pages in LRU order and that turns writeback into
> a non-sequential mess. I/O sizes decrease substantially and
> throughput falls through the floor.
>
> So if you want kswapd to take over all the writeback, it needs to do
> writeback in the same manner as the background flushes. i.e. by
> grabbing page->mapping and flushing that in sequential order rather
> than just the page on the end of the LRU....
>
> I documented the effect of kswapd taking over writeback in this
> paper (section 5.3):
>
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

Ah, indeed. That means introducing a new "really congested" threshold
for kswapd is *dangerous*. I realized this later on, and am now
heading for another direction.

The basic idea is to
- rotate pdflush issued writeback pages for kswapd;
- use the more precise zone_rotate_wait() to throttle kswapd.

The code is a quick hack and not tested yet.
Early comments are more than welcome.

Fengguang
---
include/linux/mmzone.h | 1 +
mm/filemap.c | 5 ++++-
mm/page_alloc.c | 1 +
mm/swap.c | 13 +++++++++++++
mm/vmscan.c | 12 ++++++++++--
5 files changed, 29 insertions(+), 3 deletions(-)

--- linux-2.6.23-rc8-mm2.orig/include/linux/mmzone.h
+++ linux-2.6.23-rc8-mm2/include/linux/mmzone.h
@@ -316,6 +316,7 @@ struct zone {
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
+ wait_queue_head_t wait_rotate;

/*
* Discontig memory support fields.
--- linux-2.6.23-rc8-mm2.orig/mm/filemap.c
+++ linux-2.6.23-rc8-mm2/mm/filemap.c
@@ -558,12 +558,15 @@ EXPORT_SYMBOL(unlock_page);
*/
void end_page_writeback(struct page *page)
{
- if (!TestClearPageReclaim(page) || rotate_reclaimable_page(page)) {
+ int r = 1;
+ if (!TestClearPageReclaim(page) || (r = rotate_reclaimable_page(page))) {
if (!test_clear_page_writeback(page))
BUG();
}
smp_mb__after_clear_bit();
wake_up_page(page, PG_writeback);
+ if (!r)
+ wake_up(&page_zone(page)->wait_rotate);
}
EXPORT_SYMBOL(end_page_writeback);

--- linux-2.6.23-rc8-mm2.orig/mm/page_alloc.c
+++ linux-2.6.23-rc8-mm2/mm/page_alloc.c
@@ -3482,6 +3482,7 @@ static void __meminit free_area_init_cor
zone->prev_priority = DEF_PRIORITY;

zone_pcp_init(zone);
+ init_waitqueue_head(&zone->wait_rotate);
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
zone->nr_scan_active = 0;
--- linux-2.6.23-rc8-mm2.orig/mm/vmscan.c
+++ linux-2.6.23-rc8-mm2/mm/vmscan.c
@@ -50,6 +50,7 @@
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
+ unsigned long nr_dirty_writeback;

/* This context's GFP mask */
gfp_t gfp_mask;
@@ -558,8 +559,10 @@ static unsigned long shrink_page_list(st
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page) || PageDirty(page)) {
+ sc->nr_dirty_writeback++;
goto keep;
+ }
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -620,6 +623,10 @@ keep_locked:
keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page));
+ if (PageLocked(page) && PageWriteback(page)) {
+ SetPageReclaim(page);
+ sc->nr_dirty_writeback++;
+ }
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
@@ -1184,7 +1191,8 @@ static unsigned long shrink_zone(int pri
}
}

- throttle_vm_writeout(sc->gfp_mask);
+ if (!nr_reclaimed && sc->nr_dirty_writeback)
+ zone_rotate_wait(zone, HZ/100);
return nr_reclaimed;
}

--- linux-2.6.23-rc8-mm2.orig/mm/swap.c
+++ linux-2.6.23-rc8-mm2/mm/swap.c
@@ -174,6 +174,19 @@ int rotate_reclaimable_page(struct page
return 0;
}

+long zone_rotate_wait(struct zone* z, long timeout)
+{
+ long ret;
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = &z->wait_rotate;
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ return ret;
+}
+EXPORT_SYMBOL(zone_rotate_wait);
+
/*
* FIXME: speed this up?
*/