Date: Tue, 9 Nov 2004 14:40:33 -0800
From: Andrew Morton <akpm@osdl.org>
To: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
Cc: 76306.1226@compuserve.com, linux-kernel@vger.kernel.org,
       nickpiggin@yahoo.com.au
Subject: Re: balance_pgdat(): where is total_scanned ever updated?
Message-Id: <20041109144033.3629a07e.akpm@osdl.org>
In-Reply-To: <20041109185221.GA8414@logos.cnet>
References: <200411061418_MC3-1-8E17-8B6C@compuserve.com>
	<20041106161114.1cbb512b.akpm@osdl.org>
	<20041109104220.GB6326@logos.cnet>
	<20041109113620.16b47e28.akpm@osdl.org>
	<20041109180223.GG7632@logos.cnet>
	<20041109134032.124b55fa.akpm@osdl.org>
	<20041109185221.GA8414@logos.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7545
Lines: 208

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> ...
> > > You're talking about laptop_mode ONLY, then?
> > 
> > No, not at all.
> > 
> > If we restore the total_scanned logic then kswapd will throttle itself, as
> > designed.  Regardless of laptop_mode.  I did that, and monitored the page
> > scanning and reclaim rates under various workloads.  I observed that with
> > the fix in place, kswapd performed less page reclaim and direct-reclaim
> > performed more reclaim.  And I wasn't able to demonstrate any benchmark
> > improvements with the fix in place, so things are left as they are.
> 
> Ah, OK, I understand what you mean. I was thinking about sc->may_writepage 
> only and its effects on shrink_list/pageout.
> 
> You remind me about the self throttling (blk_congestion_wait).
> It makes sense now.
> 
> Andrea noted that blk_congestion_wait waits on IO which is not generated by 
> reclaim - which is indeed a bad thing - it should only wait on IO which the
> VM itself has started.

Yes, blk_congestion_wait() is very approximate.  It was always intended
that it be replaced by wakeups from end_page_writeback(), directed to
waitqueues which correspond to the classzones to which the page belongs. 
That way, page reclaiming processes can throttle precisely upon I/O
completion against pages which are useful to them.

But I've never seen a report of a problem which would be solved by such a
change, and so the cost of delivering multiple wakeups at
end_page_writeback() doesn't seem justified thus far.

> ...
> Another related thing I noted this afternoon is that right now kswapd will
> always block on full queues:
> 
> static int may_write_to_queue(struct backing_dev_info *bdi)
> {
>         if (current_is_kswapd())
>                 return 1;
>         if (current_is_pdflush())       /* This is unlikely, but why not... */
>                 return 1;
>         if (!bdi_write_congested(bdi))
>                 return 1;
>         if (bdi == current->backing_dev_info)
>                 return 1;
>         return 0;
> }
> 
> We should make kswapd use the "bdi_write_congested" information and avoid
> blocking on full queues. It should improve performance on multi-device 
> systems with intense VM loads.
> 
> Maybe something along the lines 
> 
> "if the reclaim ratio is high, do not writepage"
> "if the reclaim ratio is below high, writepage but not block"
> "if the reclaim ratio is low, writepage and block"

It used to do that, but it was taken out.  gack, brain-strain.  umm, dig,
dig.   Here you go:


The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations.  pdflush (or a write(2)
caller) could be saturating the queue with highmem pages.  This
prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications.  The machine instantly goes from
0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem.  The queue is
congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd.  kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
   memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
   back pages.  (These pages will be rotated to the tail of the
   inactive list at IO-completion interrupt time).

   This writeback will saturate the queue with ZONE_NORMAL pages. 
   Conveniently, pdflush will avoid the congested queues.  So we end up
   writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.


The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc.  But I have not been able to
demonstrate that in testing.


The other problem is that there is only one kswapd, and there are lots
of disks.  That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout.  This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues.  I doubt if this will be
called for.


 include/linux/swap.h |    6 ++++++
 mm/vmscan.c          |   21 +++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

--- 25/mm/vmscan.c~blocking-kswapd	Sat Dec 21 16:24:37 2002
+++ 25-akpm/mm/vmscan.c	Sat Dec 21 16:24:37 2002
@@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
+static int may_write_to_queue(struct backing_dev_info *bdi)
+{
+	if (current_is_kswapd())
+		return 1;
+	if (current_is_pdflush())	/* This is unlikely, but why not... */
+		return 1;
+	if (!bdi_write_congested(bdi))
+		return 1;
+	if (bdi == current->backing_dev_info)
+		return 1;
+	return 0;
+}
+
 /*
  * shrink_list returns the number of reclaimed pages
  */
@@ -303,8 +316,6 @@ shrink_list(struct list_head *page_list,
 		 * See swapfile.c:page_queue_congested().
 		 */
 		if (PageDirty(page)) {
-			struct backing_dev_info *bdi;
-
 			if (!is_page_cache_freeable(page))
 				goto keep_locked;
 			if (!mapping)
@@ -313,9 +324,7 @@ shrink_list(struct list_head *page_list,
 				goto activate_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
-			bdi = mapping->backing_dev_info;
-			if (bdi != current->backing_dev_info &&
-					bdi_write_congested(bdi))
+			if (!may_write_to_queue(mapping->backing_dev_info))
 				goto keep_locked;
 			write_lock(&mapping->page_lock);
 			if (test_clear_page_dirty(page)) {
@@ -424,7 +433,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
 	mod_page_state(pgsteal, ret);
-	if (current->flags & PF_KSWAPD)
+	if (current_is_kswapd())
 		mod_page_state(kswapd_steal, ret);
 	mod_page_state(pgactivate, pgactivate);
 	return ret;
--- 25/include/linux/swap.h~blocking-kswapd	Sat Dec 21 16:24:37 2002
+++ 25-akpm/include/linux/swap.h	Sat Dec 21 16:24:37 2002
@@ -7,6 +7,7 @@
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
 #include <linux/list.h>
+#include <linux/sched.h>
 #include <asm/atomic.h>
 #include <asm/page.h>
 
@@ -14,6 +15,11 @@
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 
+static inline int current_is_kswapd(void)
+{
+	return current->flags & PF_KSWAPD;
+}
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/