DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-type:content-disposition:in-reply-to:user-agent;
        b=iKtKV1fPqx7N/I1pqhO5/Wb+tfFd/sWtvOxeP4Kblb/EzqEg5dRlNFobUMQFWORACl
         PU9E9u22JxZTRX/JB3lAQrPgJVeRXZznIUCnWmahDuDHglLaRwG+PVoEMkL081NOMQLY
         DhJvTsCp2pxH2zsMIeAh0x1O3/BsIJamBrPio=
Date: Mon, 2 May 2011 01:35:42 +0900
From: Minchan Kim <minchan.kim@gmail.com>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Mel Gorman <mel@linux.vnet.ibm.com>,
        Dave Young <hidave.darkstar@gmail.com>, linux-mm <linux-mm@kvack.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
        KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
        Christoph Lameter <cl@linux.com>, Dave Chinner <david@fromorbit.com>,
        David Rientjes <rientjes@google.com>
Subject: Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation
 failures
Message-ID: <20110501163542.GA3204@barrios-desktop>
References: <BANLkTim0MNgqeh1KTfvpVFuAvebKyQV8Hg@mail.gmail.com>
 <20110426062535.GB19717@localhost>
 <BANLkTinM9DjK9QsGtN0Sh308rr+86UMF0A@mail.gmail.com>
 <20110426063421.GC19717@localhost>
 <BANLkTi=xDozFNBXNdGDLK6EwWrfHyBifQw@mail.gmail.com>
 <20110426092029.GA27053@localhost>
 <20110426124743.e58d9746.akpm@linux-foundation.org>
 <20110428133644.GA12400@localhost>
 <20110429022824.GA8061@localhost>
 <20110430141741.GA4511@localhost>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20110430141741.GA4511@localhost>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 16146
Lines: 440

Hi Wu, 

On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote:
> On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote:
> > > Test results:
> > > 
> > > - the failure rate is pretty sensible to the page reclaim size,
> > >   from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> > > 
> > > - the IPIs are reduced by over 100 times
> > 
> > It's reduced by 500 times indeed.
> > 
> > CAL:     220449     220246     220372     220558     220251     219740     220043     219968   Function call interrupts
> > CAL:         93        463        410        540        298        282        272        306   Function call interrupts
> > 
> > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> > > -------------------------------------------------------------------------------
> > > nr_alloc_fail 10496
> > > allocstall 1576602
> > 
> > > patched (WMARK_MIN)
> > > -------------------
> > > nr_alloc_fail 704
> > > allocstall 105551
> > 
> > > patched (WMARK_HIGH)
> > > --------------------
> > > nr_alloc_fail 282
> > > allocstall 53860
> > 
> > > this patch (WMARK_HIGH, limited scan)
> > > -------------------------------------
> > > nr_alloc_fail 276
> > > allocstall 54034
> > 
> > There is a bad side effect though: the much reduced "allocstall" means
> > each direct reclaim will take much more time to complete. A simple solution
> > is to terminate direct reclaim after 10ms. I noticed that an 100ms
> > time threshold can reduce the reclaim latency from 621ms to 358ms.
> > Further lowering the time threshold to 20ms does not help reducing the
> > real latencies though.
> 
> Experiments going on...
> 
> I tried the more reasonable terminate condition: stop direct reclaim
> when the preferred zone is above high watermark (see the below chunk).
> 
> This helps reduce the average reclaim latency to under 100ms in the
> 1000-dd case.
> 
> However nr_alloc_fail is around 5000 and not ideal. The interesting
> thing is, even if zone watermark is high, the task still may fail to
> get a free page..
> 
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>                         }
>                 }
>                 total_scanned += sc->nr_scanned;
> -               if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> -                       goto out;
> +               if (sc->nr_reclaimed >= min_reclaim) {
> +                       if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> +                               goto out;
> +                       if (total_scanned > 2 * sc->nr_to_reclaim)
> +                               goto out;
> +                       if (preferred_zone &&
> +                           zone_watermark_ok_safe(preferred_zone, sc->order,
> +                                       high_wmark_pages(preferred_zone),
> +                                       zone_idx(preferred_zone), 0))
> +                               goto out;
> +               }
>                
>                 /*
>                  * Try to write back as many pages as we just scanned.  This
> 
> Thanks,
> Fengguang
> ---
> Subject: mm: cut down __GFP_NORETRY page allocation failures
> Date: Thu Apr 28 13:46:39 CST 2011
> 
> Concurrent page allocations are suffering from high failure rates.
> 
> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB,
> the page allocation failures are
> 
> nr_alloc_fail 733 	# interleaved reads by 1 single task
> nr_alloc_fail 11799	# concurrent reads by 1000 tasks
> 
> The concurrent read test script is:
> 
> 	for i in `seq 1000`
> 	do
> 		truncate -s 1G /fs/sparse-$i
> 		dd if=/fs/sparse-$i of=/dev/null &
> 	done
> 
> In order for get_page_from_freelist() to get free page,
> 
> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the
>     current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the
>     possible low watermark state as well as fill the pcp with enough free
>     pages to overflow its high watermark.
> 
> (2) the get_page_from_freelist() _after_ direct reclaim should use lower
>     watermark than its normal invocations, so that it can reasonably
>     "reserve" some free pages for itself and prevent other concurrent
>     page allocators stealing all its reclaimed pages.

Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea.
http://marc.info/?l=linux-mm&m=129187231129887&w=4
The idea is to keep a page at leat for direct reclaimed process.
Could it mitigate your problem or could you enhacne the idea?
I think it's very simple and fair solution.

> 
> Some notes:
> 
> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct
>   reclaim allocation fails") has the same target, however is obviously
>   costly and less effective. It seems more clean to just remove the
>   retry and drain code than to retain it.

Tend to agree.
My old patch can solve it, I think.

> 
> - it's a bit hacky to reclaim more than requested pages inside
>   do_try_to_free_page(), and it won't help cgroup for now
> 
> - it only aims to reduce failures when there are plenty of reclaimable
>   pages, so it stops the opportunistic reclaim when scanned 2 times pages
> 
> Test results:
> 
> - the failure rate is pretty sensible to the page reclaim size,
>   from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX)
> 
> - the IPIs are reduced by over 100 times
> 
> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch
> -------------------------------------------------------------------------------
> nr_alloc_fail 10496
> allocstall 1576602
> 
> slabs_scanned 21632
> kswapd_steal 4393382
> kswapd_inodesteal 124
> kswapd_low_wmark_hit_quickly 885
> kswapd_high_wmark_hit_quickly 2321
> kswapd_skip_congestion_wait 0
> pageoutrun 29426
> 
> CAL:     220449     220246     220372     220558     220251     219740     220043     219968   Function call interrupts
> 
> LOC:     536274     532529     531734     536801     536510     533676     534853     532038   Local timer interrupts
> RES:       3032       2128       1792       1765       2184       1703       1754       1865   Rescheduling interrupts
> TLB:        189         15         13         17         64        294         97         63   TLB shootdowns
> 
> patched (WMARK_MIN)
> -------------------
> nr_alloc_fail 704
> allocstall 105551
> 
> slabs_scanned 33280
> kswapd_steal 4525537
> kswapd_inodesteal 187
> kswapd_low_wmark_hit_quickly 4980
> kswapd_high_wmark_hit_quickly 2573
> kswapd_skip_congestion_wait 0
> pageoutrun 35429
> 
> CAL:         93        286        396        754        272        297        275        281   Function call interrupts
> 
> LOC:     520550     517751     517043     522016     520302     518479     519329     517179   Local timer interrupts
> RES:       2131       1371       1376       1269       1390       1181       1409       1280   Rescheduling interrupts
> TLB:        280         26         27         30         65        305        134         75   TLB shootdowns
> 
> patched (WMARK_HIGH)
> --------------------
> nr_alloc_fail 282
> allocstall 53860
> 
> slabs_scanned 23936
> kswapd_steal 4561178
> kswapd_inodesteal 0
> kswapd_low_wmark_hit_quickly 2760
> kswapd_high_wmark_hit_quickly 1748
> kswapd_skip_congestion_wait 0
> pageoutrun 32639
> 
> CAL:         93        463        410        540        298        282        272        306   Function call interrupts
> 
> LOC:     513956     510749     509890     514897     514300     512392     512825     510574   Local timer interrupts
> RES:       1174       2081       1411       1320       1742       2683       1380       1230   Rescheduling interrupts
> TLB:        274         21         19         22         57        317        131         61   TLB shootdowns
> 
> patched (WMARK_HIGH, limited scan)
> ----------------------------------
> nr_alloc_fail 276
> allocstall 54034
> 
> slabs_scanned 24320
> kswapd_steal 4507482
> kswapd_inodesteal 262
> kswapd_low_wmark_hit_quickly 2638
> kswapd_high_wmark_hit_quickly 1710
> kswapd_skip_congestion_wait 0
> pageoutrun 32182
> 
> CAL:         69        443        421        567        273        279        269        334   Function call interrupts

Looks amazing.

> 
> LOC:     514736     511698     510993     514069     514185     512986     513838     511229   Local timer interrupts
> RES:       2153       1556       1126       1351       3047       1554       1131       1560   Rescheduling interrupts
> TLB:        209         26         20         15         71        315        117         71   TLB shootdowns
> 
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd
> ----------------------------------------------------------------
> 
> start time: 3
> total time: 50
> nr_alloc_fail 162
> allocstall 45523
> 
> CPU             count     real total  virtual total    delay total
>                   921     3024540200     3009244668    37123129525
> IO              count    delay total  delay average
>                     0              0              0ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   357     4891766796             13ms
> dd: read=0, write=0, cancelled_write=0
> 
> patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd
> -----------------------------------------------------------------
> 
> start time: 272
> total time: 509
> nr_alloc_fail 3913
> allocstall 541789
> 
> CPU             count     real total  virtual total    delay total
>                  1044     3445476208     3437200482   229919915202
> IO              count    delay total  delay average
>                     0              0              0ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   452    34691441605             76ms
> dd: read=0, write=0, cancelled_write=0
> 
> patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd
> --------------------------------------------------------------------------------
> 
> start time: 278
> total time: 513
> nr_alloc_fail 4737
> allocstall 436392
> 
> 
> CPU             count     real total  virtual total    delay total
>                  1024     3371487456     3359441487   225088210977
> IO              count    delay total  delay average
>                     1      160631171            160ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   367    30809994722             83ms
> dd: read=20480, write=0, cancelled_write=0
> 
> 
> no cond_resched():

What's this?

> 
> start time: 263
> total time: 516
> nr_alloc_fail 5144
> allocstall 436787
> 
> CPU             count     real total  virtual total    delay total
>                  1018     3305497488     3283831119   241982934044
> IO              count    delay total  delay average
>                     0              0              0ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   328    31398481378             95ms
> dd: read=0, write=0, cancelled_write=0
> 
> zone_watermark_ok_safe():
> 
> start time: 266
> total time: 513
> nr_alloc_fail 4526
> allocstall 440246
> 
> CPU             count     real total  virtual total    delay total
>                  1119     3640446568     3619184439   240945024724
> IO              count    delay total  delay average
>                     3      303620082            101ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   372    27320731898             73ms
> dd: read=77824, write=0, cancelled_write=0
> 
> 
> start time: 275

What's meaing of start time?

> total time: 517

Total time is elapsed time on your experiment?

> nr_alloc_fail 4694
> allocstall 431021
> 
> 
> CPU             count     real total  virtual total    delay total
>                  1073     3534462680     3512544928   234056498221

What's meaning of CPU fields?

> IO              count    delay total  delay average
>                     0              0              0ms
> SWAP            count    delay total  delay average
>                     0              0              0ms
> RECLAIM         count    delay total  delay average
>                   386    34751778363             89ms
> dd: read=0, write=0, cancelled_write=0
> 

Where is vanilla data for comparing latency?
Personally, It's hard to parse your data.


> CC: Mel Gorman <mel@linux.vnet.ibm.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/buffer.c          |    4 ++--
>  include/linux/swap.h |    3 ++-
>  mm/page_alloc.c      |   20 +++++---------------
>  mm/vmscan.c          |   31 +++++++++++++++++++++++--------
>  4 files changed, 32 insertions(+), 26 deletions(-)
> --- linux-next.orig/mm/vmscan.c	2011-04-29 10:42:14.000000000 +0800
> +++ linux-next/mm/vmscan.c	2011-04-30 21:59:33.000000000 +0800
> @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon
>   * returns:	0, if no pages reclaimed
>   * 		else, the number of pages reclaimed
>   */
> -static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> -					struct scan_control *sc)
> +static unsigned long do_try_to_free_pages(struct zone *preferred_zone,
> +					  struct zonelist *zonelist,
> +					  struct scan_control *sc)
>  {
>  	int priority;
>  	unsigned long total_scanned = 0;
> @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page
>  	struct zoneref *z;
>  	struct zone *zone;
>  	unsigned long writeback_threshold;
> +	unsigned long min_reclaim = sc->nr_to_reclaim;

Hmm,

>  
>  	get_mems_allowed();
>  	delayacct_freepages_start();
> @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page
>  	if (scanning_global_lru(sc))
>  		count_vm_event(ALLOCSTALL);
>  
> +	if (preferred_zone)
> +		sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH];
> +

Hmm, I don't like this idea.
The goal of direct reclaim path is to reclaim pages asap, I beleive.
Many thing should be achieve of background kswapd.
If admin changes min_free_kbytes, it can affect latency of direct reclaim.
It doesn't make sense to me.


>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		sc->nr_scanned = 0;
>  		if (!priority)
> @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page
>  			}
>  		}
>  		total_scanned += sc->nr_scanned;
> -		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> -			goto out;
> +		if (sc->nr_reclaimed >= min_reclaim) {
> +			if (sc->nr_reclaimed >= sc->nr_to_reclaim)
> +				goto out;

I can't understand the logic.
if nr_reclaimed is bigger than min_reclaim, it's always greater than
nr_to_reclaim. What's meaning of min_reclaim?


> +			if (total_scanned > 2 * sc->nr_to_reclaim)
> +				goto out;

If there are lots of dirty pages in LRU?
If there are lots of unevictable pages in LRU?
If there are lots of mapped page in LRU but may_unmap = 0 cases?
I means it's rather risky early conclusion.


> +			if (preferred_zone &&
> +			    zone_watermark_ok_safe(preferred_zone, sc->order,
> +					high_wmark_pages(preferred_zone),
> +					zone_idx(preferred_zone), 0))
> +				goto out;
> +		}

As I said, I think direct reclaim path sould be fast if possbile and 
it should not a function of min_free_kbytes.
Of course, there are lots of tackle for keep direct reclaim path's consistent
latency but at least, I don't want to add another source.


-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/