Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760934Ab1EAQfw (ORCPT ); Sun, 1 May 2011 12:35:52 -0400 Received: from mail-pw0-f46.google.com ([209.85.160.46]:49893 "EHLO mail-pw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753599Ab1EAQfv (ORCPT ); Sun, 1 May 2011 12:35:51 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=iKtKV1fPqx7N/I1pqhO5/Wb+tfFd/sWtvOxeP4Kblb/EzqEg5dRlNFobUMQFWORACl PU9E9u22JxZTRX/JB3lAQrPgJVeRXZznIUCnWmahDuDHglLaRwG+PVoEMkL081NOMQLY DhJvTsCp2pxH2zsMIeAh0x1O3/BsIJamBrPio= Date: Mon, 2 May 2011 01:35:42 +0900 From: Minchan Kim To: Wu Fengguang Cc: Andrew Morton , Mel Gorman , Dave Young , linux-mm , Linux Kernel Mailing List , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Christoph Lameter , Dave Chinner , David Rientjes Subject: Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures Message-ID: <20110501163542.GA3204@barrios-desktop> References: <20110426062535.GB19717@localhost> <20110426063421.GC19717@localhost> <20110426092029.GA27053@localhost> <20110426124743.e58d9746.akpm@linux-foundation.org> <20110428133644.GA12400@localhost> <20110429022824.GA8061@localhost> <20110430141741.GA4511@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20110430141741.GA4511@localhost> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 16146 Lines: 440 Hi Wu, On Sat, Apr 30, 2011 at 10:17:41PM +0800, Wu Fengguang wrote: > On Fri, Apr 29, 2011 at 10:28:24AM +0800, Wu Fengguang wrote: > > > Test results: > > > > > > - the failure rate is pretty sensible to the page reclaim size, > > > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) > > > > > > - the IPIs are reduced by over 100 times > > > > It's reduced by 500 times indeed. > > > > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts > > CAL: 93 463 410 540 298 282 272 306 Function call interrupts > > > > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch > > > ------------------------------------------------------------------------------- > > > nr_alloc_fail 10496 > > > allocstall 1576602 > > > > > patched (WMARK_MIN) > > > ------------------- > > > nr_alloc_fail 704 > > > allocstall 105551 > > > > > patched (WMARK_HIGH) > > > -------------------- > > > nr_alloc_fail 282 > > > allocstall 53860 > > > > > this patch (WMARK_HIGH, limited scan) > > > ------------------------------------- > > > nr_alloc_fail 276 > > > allocstall 54034 > > > > There is a bad side effect though: the much reduced "allocstall" means > > each direct reclaim will take much more time to complete. A simple solution > > is to terminate direct reclaim after 10ms. I noticed that an 100ms > > time threshold can reduce the reclaim latency from 621ms to 358ms. > > Further lowering the time threshold to 20ms does not help reducing the > > real latencies though. > > Experiments going on... > > I tried the more reasonable terminate condition: stop direct reclaim > when the preferred zone is above high watermark (see the below chunk). > > This helps reduce the average reclaim latency to under 100ms in the > 1000-dd case. > > However nr_alloc_fail is around 5000 and not ideal. The interesting > thing is, even if zone watermark is high, the task still may fail to > get a free page.. > > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page > } > } > total_scanned += sc->nr_scanned; > - if (sc->nr_reclaimed >= sc->nr_to_reclaim) > - goto out; > + if (sc->nr_reclaimed >= min_reclaim) { > + if (sc->nr_reclaimed >= sc->nr_to_reclaim) > + goto out; > + if (total_scanned > 2 * sc->nr_to_reclaim) > + goto out; > + if (preferred_zone && > + zone_watermark_ok_safe(preferred_zone, sc->order, > + high_wmark_pages(preferred_zone), > + zone_idx(preferred_zone), 0)) > + goto out; > + } > > /* > * Try to write back as many pages as we just scanned. This > > Thanks, > Fengguang > --- > Subject: mm: cut down __GFP_NORETRY page allocation failures > Date: Thu Apr 28 13:46:39 CST 2011 > > Concurrent page allocations are suffering from high failure rates. > > On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB, > the page allocation failures are > > nr_alloc_fail 733 # interleaved reads by 1 single task > nr_alloc_fail 11799 # concurrent reads by 1000 tasks > > The concurrent read test script is: > > for i in `seq 1000` > do > truncate -s 1G /fs/sparse-$i > dd if=/fs/sparse-$i of=/dev/null & > done > > In order for get_page_from_freelist() to get free page, > > (1) try_to_free_pages() should use much higher .nr_to_reclaim than the > current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the > possible low watermark state as well as fill the pcp with enough free > pages to overflow its high watermark. > > (2) the get_page_from_freelist() _after_ direct reclaim should use lower > watermark than its normal invocations, so that it can reasonably > "reserve" some free pages for itself and prevent other concurrent > page allocators stealing all its reclaimed pages. Do you see my old patch? The patch want't incomplet but it's not bad for showing an idea. http://marc.info/?l=linux-mm&m=129187231129887&w=4 The idea is to keep a page at leat for direct reclaimed process. Could it mitigate your problem or could you enhacne the idea? I think it's very simple and fair solution. > > Some notes: > > - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct > reclaim allocation fails") has the same target, however is obviously > costly and less effective. It seems more clean to just remove the > retry and drain code than to retain it. Tend to agree. My old patch can solve it, I think. > > - it's a bit hacky to reclaim more than requested pages inside > do_try_to_free_page(), and it won't help cgroup for now > > - it only aims to reduce failures when there are plenty of reclaimable > pages, so it stops the opportunistic reclaim when scanned 2 times pages > > Test results: > > - the failure rate is pretty sensible to the page reclaim size, > from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) > > - the IPIs are reduced by over 100 times > > base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch > ------------------------------------------------------------------------------- > nr_alloc_fail 10496 > allocstall 1576602 > > slabs_scanned 21632 > kswapd_steal 4393382 > kswapd_inodesteal 124 > kswapd_low_wmark_hit_quickly 885 > kswapd_high_wmark_hit_quickly 2321 > kswapd_skip_congestion_wait 0 > pageoutrun 29426 > > CAL: 220449 220246 220372 220558 220251 219740 220043 219968 Function call interrupts > > LOC: 536274 532529 531734 536801 536510 533676 534853 532038 Local timer interrupts > RES: 3032 2128 1792 1765 2184 1703 1754 1865 Rescheduling interrupts > TLB: 189 15 13 17 64 294 97 63 TLB shootdowns > > patched (WMARK_MIN) > ------------------- > nr_alloc_fail 704 > allocstall 105551 > > slabs_scanned 33280 > kswapd_steal 4525537 > kswapd_inodesteal 187 > kswapd_low_wmark_hit_quickly 4980 > kswapd_high_wmark_hit_quickly 2573 > kswapd_skip_congestion_wait 0 > pageoutrun 35429 > > CAL: 93 286 396 754 272 297 275 281 Function call interrupts > > LOC: 520550 517751 517043 522016 520302 518479 519329 517179 Local timer interrupts > RES: 2131 1371 1376 1269 1390 1181 1409 1280 Rescheduling interrupts > TLB: 280 26 27 30 65 305 134 75 TLB shootdowns > > patched (WMARK_HIGH) > -------------------- > nr_alloc_fail 282 > allocstall 53860 > > slabs_scanned 23936 > kswapd_steal 4561178 > kswapd_inodesteal 0 > kswapd_low_wmark_hit_quickly 2760 > kswapd_high_wmark_hit_quickly 1748 > kswapd_skip_congestion_wait 0 > pageoutrun 32639 > > CAL: 93 463 410 540 298 282 272 306 Function call interrupts > > LOC: 513956 510749 509890 514897 514300 512392 512825 510574 Local timer interrupts > RES: 1174 2081 1411 1320 1742 2683 1380 1230 Rescheduling interrupts > TLB: 274 21 19 22 57 317 131 61 TLB shootdowns > > patched (WMARK_HIGH, limited scan) > ---------------------------------- > nr_alloc_fail 276 > allocstall 54034 > > slabs_scanned 24320 > kswapd_steal 4507482 > kswapd_inodesteal 262 > kswapd_low_wmark_hit_quickly 2638 > kswapd_high_wmark_hit_quickly 1710 > kswapd_skip_congestion_wait 0 > pageoutrun 32182 > > CAL: 69 443 421 567 273 279 269 334 Function call interrupts Looks amazing. > > LOC: 514736 511698 510993 514069 514185 512986 513838 511229 Local timer interrupts > RES: 2153 1556 1126 1351 3047 1554 1131 1560 Rescheduling interrupts > TLB: 209 26 20 15 71 315 117 71 TLB shootdowns > > patched (WMARK_HIGH, limited scan, stop on watermark OK), 100 dd > ---------------------------------------------------------------- > > start time: 3 > total time: 50 > nr_alloc_fail 162 > allocstall 45523 > > CPU count real total virtual total delay total > 921 3024540200 3009244668 37123129525 > IO count delay total delay average > 0 0 0ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 357 4891766796 13ms > dd: read=0, write=0, cancelled_write=0 > > patched (WMARK_HIGH, limited scan, stop on watermark OK), 1000 dd > ----------------------------------------------------------------- > > start time: 272 > total time: 509 > nr_alloc_fail 3913 > allocstall 541789 > > CPU count real total virtual total delay total > 1044 3445476208 3437200482 229919915202 > IO count delay total delay average > 0 0 0ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 452 34691441605 76ms > dd: read=0, write=0, cancelled_write=0 > > patched (WMARK_HIGH, limited scan, stop on watermark OK, no time limit), 1000 dd > -------------------------------------------------------------------------------- > > start time: 278 > total time: 513 > nr_alloc_fail 4737 > allocstall 436392 > > > CPU count real total virtual total delay total > 1024 3371487456 3359441487 225088210977 > IO count delay total delay average > 1 160631171 160ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 367 30809994722 83ms > dd: read=20480, write=0, cancelled_write=0 > > > no cond_resched(): What's this? > > start time: 263 > total time: 516 > nr_alloc_fail 5144 > allocstall 436787 > > CPU count real total virtual total delay total > 1018 3305497488 3283831119 241982934044 > IO count delay total delay average > 0 0 0ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 328 31398481378 95ms > dd: read=0, write=0, cancelled_write=0 > > zone_watermark_ok_safe(): > > start time: 266 > total time: 513 > nr_alloc_fail 4526 > allocstall 440246 > > CPU count real total virtual total delay total > 1119 3640446568 3619184439 240945024724 > IO count delay total delay average > 3 303620082 101ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 372 27320731898 73ms > dd: read=77824, write=0, cancelled_write=0 > > > start time: 275 What's meaing of start time? > total time: 517 Total time is elapsed time on your experiment? > nr_alloc_fail 4694 > allocstall 431021 > > > CPU count real total virtual total delay total > 1073 3534462680 3512544928 234056498221 What's meaning of CPU fields? > IO count delay total delay average > 0 0 0ms > SWAP count delay total delay average > 0 0 0ms > RECLAIM count delay total delay average > 386 34751778363 89ms > dd: read=0, write=0, cancelled_write=0 > Where is vanilla data for comparing latency? Personally, It's hard to parse your data. > CC: Mel Gorman > Signed-off-by: Wu Fengguang > --- > fs/buffer.c | 4 ++-- > include/linux/swap.h | 3 ++- > mm/page_alloc.c | 20 +++++--------------- > mm/vmscan.c | 31 +++++++++++++++++++++++-------- > 4 files changed, 32 insertions(+), 26 deletions(-) > --- linux-next.orig/mm/vmscan.c 2011-04-29 10:42:14.000000000 +0800 > +++ linux-next/mm/vmscan.c 2011-04-30 21:59:33.000000000 +0800 > @@ -2025,8 +2025,9 @@ static bool all_unreclaimable(struct zon > * returns: 0, if no pages reclaimed > * else, the number of pages reclaimed > */ > -static unsigned long do_try_to_free_pages(struct zonelist *zonelist, > - struct scan_control *sc) > +static unsigned long do_try_to_free_pages(struct zone *preferred_zone, > + struct zonelist *zonelist, > + struct scan_control *sc) > { > int priority; > unsigned long total_scanned = 0; > @@ -2034,6 +2035,7 @@ static unsigned long do_try_to_free_page > struct zoneref *z; > struct zone *zone; > unsigned long writeback_threshold; > + unsigned long min_reclaim = sc->nr_to_reclaim; Hmm, > > get_mems_allowed(); > delayacct_freepages_start(); > @@ -2041,6 +2043,9 @@ static unsigned long do_try_to_free_page > if (scanning_global_lru(sc)) > count_vm_event(ALLOCSTALL); > > + if (preferred_zone) > + sc->nr_to_reclaim += preferred_zone->watermark[WMARK_HIGH]; > + Hmm, I don't like this idea. The goal of direct reclaim path is to reclaim pages asap, I beleive. Many thing should be achieve of background kswapd. If admin changes min_free_kbytes, it can affect latency of direct reclaim. It doesn't make sense to me. > for (priority = DEF_PRIORITY; priority >= 0; priority--) { > sc->nr_scanned = 0; > if (!priority) > @@ -2067,8 +2072,17 @@ static unsigned long do_try_to_free_page > } > } > total_scanned += sc->nr_scanned; > - if (sc->nr_reclaimed >= sc->nr_to_reclaim) > - goto out; > + if (sc->nr_reclaimed >= min_reclaim) { > + if (sc->nr_reclaimed >= sc->nr_to_reclaim) > + goto out; I can't understand the logic. if nr_reclaimed is bigger than min_reclaim, it's always greater than nr_to_reclaim. What's meaning of min_reclaim? > + if (total_scanned > 2 * sc->nr_to_reclaim) > + goto out; If there are lots of dirty pages in LRU? If there are lots of unevictable pages in LRU? If there are lots of mapped page in LRU but may_unmap = 0 cases? I means it's rather risky early conclusion. > + if (preferred_zone && > + zone_watermark_ok_safe(preferred_zone, sc->order, > + high_wmark_pages(preferred_zone), > + zone_idx(preferred_zone), 0)) > + goto out; > + } As I said, I think direct reclaim path sould be fast if possbile and it should not a function of min_free_kbytes. Of course, there are lots of tackle for keep direct reclaim path's consistent latency but at least, I don't want to add another source. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/