Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754820Ab1EDCcF (ORCPT ); Tue, 3 May 2011 22:32:05 -0400 Received: from mail-wy0-f174.google.com ([74.125.82.174]:41938 "EHLO mail-wy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754682Ab1EDCcD convert rfc822-to-8bit (ORCPT ); Tue, 3 May 2011 22:32:03 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=c9xOJsuXSyv2Qd/aKBLSju/SmPKZtW/dphAVw/eG5IYaMeGFPl0J5GYLNK+NnlxBtO edWFMdAGKPFZODkCPne8YUNRW3QxHOU2EfQ8VxF41c2Px/6Grgip7UbJIu3MzNFafEBA aZip/EqR5qBZZbR/dJ2GyS8/hF1gt8/5WFS6o= MIME-Version: 1.0 In-Reply-To: References: <20110426055521.GA18473@localhost> <20110426062535.GB19717@localhost> <20110426063421.GC19717@localhost> <20110426092029.GA27053@localhost> <20110426124743.e58d9746.akpm@linux-foundation.org> <20110428133644.GA12400@localhost> Date: Wed, 4 May 2011 10:32:01 +0800 Message-ID: Subject: Re: [RFC][PATCH] mm: cut down __GFP_NORETRY page allocation failures From: Dave Young To: Wu Fengguang Cc: Andrew Morton , Minchan Kim , linux-mm , Linux Kernel Mailing List , Mel Gorman , KAMEZAWA Hiroyuki , KOSAKI Motohiro , Christoph Lameter , Dave Chinner , David Rientjes Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9906 Lines: 238 On Wed, May 4, 2011 at 9:56 AM, Dave Young wrote: > On Thu, Apr 28, 2011 at 9:36 PM, Wu Fengguang wrote: >> Concurrent page allocations are suffering from high failure rates. >> >> On a 8p, 3GB ram test box, when reading 1000 sparse files of size 1GB, >> the page allocation failures are >> >> nr_alloc_fail 733       # interleaved reads by 1 single task >> nr_alloc_fail 11799     # concurrent reads by 1000 tasks >> >> The concurrent read test script is: >> >>        for i in `seq 1000` >>        do >>                truncate -s 1G /fs/sparse-$i >>                dd if=/fs/sparse-$i of=/dev/null & >>        done >> > > With Core2 Duo, 3G ram, No swap partition I can not produce the alloc fail unset CONFIG_SCHED_AUTOGROUP and CONFIG_CGROUP_SCHED seems affects the test results, now I see several nr_alloc_fail (dd is not finished yet): dave@darkstar-32:$ grep fail /proc/vmstat: nr_alloc_fail 4 compact_pagemigrate_failed 0 compact_fail 3 htlb_buddy_alloc_fail 0 thp_collapse_alloc_fail 4 So the result is related to cpu scheduler. > >> In order for get_page_from_freelist() to get free page, >> >> (1) try_to_free_pages() should use much higher .nr_to_reclaim than the >>    current SWAP_CLUSTER_MAX=32, in order to draw the zone out of the >>    possible low watermark state as well as fill the pcp with enough free >>    pages to overflow its high watermark. >> >> (2) the get_page_from_freelist() _after_ direct reclaim should use lower >>    watermark than its normal invocations, so that it can reasonably >>    "reserve" some free pages for itself and prevent other concurrent >>    page allocators stealing all its reclaimed pages. >> >> Some notes: >> >> - commit 9ee493ce ("mm: page allocator: drain per-cpu lists after direct >>  reclaim allocation fails") has the same target, however is obviously >>  costly and less effective. It seems more clean to just remove the >>  retry and drain code than to retain it. >> >> - it's a bit hacky to reclaim more than requested pages inside >>  do_try_to_free_page(), and it won't help cgroup for now >> >> - it only aims to reduce failures when there are plenty of reclaimable >>  pages, so it stops the opportunistic reclaim when scanned 2 times pages >> >> Test results: >> >> - the failure rate is pretty sensible to the page reclaim size, >>  from 282 (WMARK_HIGH) to 704 (WMARK_MIN) to 10496 (SWAP_CLUSTER_MAX) >> >> - the IPIs are reduced by over 100 times >> >> base kernel: vanilla 2.6.39-rc3 + __GFP_NORETRY readahead page allocation patch >> ------------------------------------------------------------------------------- >> nr_alloc_fail 10496 >> allocstall 1576602 >> >> slabs_scanned 21632 >> kswapd_steal 4393382 >> kswapd_inodesteal 124 >> kswapd_low_wmark_hit_quickly 885 >> kswapd_high_wmark_hit_quickly 2321 >> kswapd_skip_congestion_wait 0 >> pageoutrun 29426 >> >> CAL:     220449     220246     220372     220558     220251     219740     220043     219968   Function call interrupts >> >> LOC:     536274     532529     531734     536801     536510     533676     534853     532038   Local timer interrupts >> RES:       3032       2128       1792       1765       2184       1703       1754       1865   Rescheduling interrupts >> TLB:        189         15         13         17         64        294         97         63   TLB shootdowns > > Could you tell how to get above info? > >> >> patched (WMARK_MIN) >> ------------------- >> nr_alloc_fail 704 >> allocstall 105551 >> >> slabs_scanned 33280 >> kswapd_steal 4525537 >> kswapd_inodesteal 187 >> kswapd_low_wmark_hit_quickly 4980 >> kswapd_high_wmark_hit_quickly 2573 >> kswapd_skip_congestion_wait 0 >> pageoutrun 35429 >> >> CAL:         93        286        396        754        272        297        275        281   Function call interrupts >> >> LOC:     520550     517751     517043     522016     520302     518479     519329     517179   Local timer interrupts >> RES:       2131       1371       1376       1269       1390       1181       1409       1280   Rescheduling interrupts >> TLB:        280         26         27         30         65        305        134         75   TLB shootdowns >> >> patched (WMARK_HIGH) >> -------------------- >> nr_alloc_fail 282 >> allocstall 53860 >> >> slabs_scanned 23936 >> kswapd_steal 4561178 >> kswapd_inodesteal 0 >> kswapd_low_wmark_hit_quickly 2760 >> kswapd_high_wmark_hit_quickly 1748 >> kswapd_skip_congestion_wait 0 >> pageoutrun 32639 >> >> CAL:         93        463        410        540        298        282        272        306   Function call interrupts >> >> LOC:     513956     510749     509890     514897     514300     512392     512825     510574   Local timer interrupts >> RES:       1174       2081       1411       1320       1742       2683       1380       1230   Rescheduling interrupts >> TLB:        274         21         19         22         57        317        131         61   TLB shootdowns >> >> this patch (WMARK_HIGH, limited scan) >> ------------------------------------- >> nr_alloc_fail 276 >> allocstall 54034 >> >> slabs_scanned 24320 >> kswapd_steal 4507482 >> kswapd_inodesteal 262 >> kswapd_low_wmark_hit_quickly 2638 >> kswapd_high_wmark_hit_quickly 1710 >> kswapd_skip_congestion_wait 0 >> pageoutrun 32182 >> >> CAL:         69        443        421        567        273        279        269        334   Function call interrupts >> >> LOC:     514736     511698     510993     514069     514185     512986     513838     511229   Local timer interrupts >> RES:       2153       1556       1126       1351       3047       1554       1131       1560   Rescheduling interrupts >> TLB:        209         26         20         15         71        315        117         71   TLB shootdowns >> >> CC: Mel Gorman >> Signed-off-by: Wu Fengguang >> --- >>  mm/page_alloc.c |   17 +++-------------- >>  mm/vmscan.c     |    6 ++++++ >>  2 files changed, 9 insertions(+), 14 deletions(-) >> --- linux-next.orig/mm/vmscan.c 2011-04-28 21:16:16.000000000 +0800 >> +++ linux-next/mm/vmscan.c      2011-04-28 21:28:57.000000000 +0800 >> @@ -1978,6 +1978,8 @@ static void shrink_zones(int priority, s >>                                continue; >>                        if (zone->all_unreclaimable && priority != DEF_PRIORITY) >>                                continue;       /* Let kswapd poll it */ >> +                       sc->nr_to_reclaim = max(sc->nr_to_reclaim, >> +                                               zone->watermark[WMARK_HIGH]); >>                } >> >>                shrink_zone(priority, zone, sc); >> @@ -2034,6 +2036,7 @@ static unsigned long do_try_to_free_page >>        struct zoneref *z; >>        struct zone *zone; >>        unsigned long writeback_threshold; >> +       unsigned long min_reclaim = sc->nr_to_reclaim; >> >>        get_mems_allowed(); >>        delayacct_freepages_start(); >> @@ -2067,6 +2070,9 @@ static unsigned long do_try_to_free_page >>                        } >>                } >>                total_scanned += sc->nr_scanned; >> +               if (sc->nr_reclaimed >= min_reclaim && >> +                   total_scanned > 2 * sc->nr_to_reclaim) >> +                       goto out; >>                if (sc->nr_reclaimed >= sc->nr_to_reclaim) >>                        goto out; >> >> --- linux-next.orig/mm/page_alloc.c     2011-04-28 21:16:16.000000000 +0800 >> +++ linux-next/mm/page_alloc.c  2011-04-28 21:16:18.000000000 +0800 >> @@ -1888,9 +1888,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >>        nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone, >>        int migratetype, unsigned long *did_some_progress) >>  { >> -       struct page *page = NULL; >> +       struct page *page; >>        struct reclaim_state reclaim_state; >> -       bool drained = false; >> >>        cond_resched(); >> >> @@ -1912,22 +1911,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m >>        if (unlikely(!(*did_some_progress))) >>                return NULL; >> >> -retry: >> +       alloc_flags |= ALLOC_HARDER; >> + >>        page = get_page_from_freelist(gfp_mask, nodemask, order, >>                                        zonelist, high_zoneidx, >>                                        alloc_flags, preferred_zone, >>                                        migratetype); >> - >> -       /* >> -        * If an allocation failed after direct reclaim, it could be because >> -        * pages are pinned on the per-cpu lists. Drain them and try again >> -        */ >> -       if (!page && !drained) { >> -               drain_all_pages(); >> -               drained = true; >> -               goto retry; >> -       } >> - >>        return page; >>  } >> >> > > > > -- > Regards > dave > -- Regards dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/