Received: by 10.213.65.68 with SMTP id h4csp73102imn; Thu, 5 Apr 2018 18:06:15 -0700 (PDT) X-Google-Smtp-Source: AIpwx48eOaEfdmxfL/9caPnsFl0K4GOjN0nxUyji3ZgMjwb+8fkLcADKdYdMCypQmFwuh4TgZV9P X-Received: by 10.99.119.74 with SMTP id s71mr16201198pgc.321.1522976775083; Thu, 05 Apr 2018 18:06:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522976775; cv=none; d=google.com; s=arc-20160816; b=ytEPFYWt4EMKzH7/0TJiRkFpPLtTpHO2lryQJfGD/fH/IZsOiq0MN2PjLRPp2Lfrl3 EyangOqzFruahH6AmBGdfCVneOOFxNPwu6uG1fBLID/bEDOAtQXU3ychYWDjZ/ZCVcn3 xPuR3KNumYCJMO5kUAen2l2JcTtLQ2H+WWYCtg+a2BUtYO4ExCyMigX3TqqN4DlinQwu kUVRRcMIIxZOtFYdwWfceemeHqDNwStTUWCMAAHriuq2ko08NtmJ4ftwdDFOx5VUIUQn FtTw0WVIKe3yEJEG/T6D7K93UPL+Go9AAmYi7BcfmsD0aJXgKvY3cOmsVY9VlHEY2aFx 0Uig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=XBphiUDgWrl2LJGr7eSfT8mLb9HJ7nSQh6pfxILxZz4=; b=tJZAKcH3gC/3+NbDKFB4+6CndguWL7M7M3kvftU070NbKvdrUX9dY+kyDsFAUC3v/O e24lCFo7UbNYgHvefJ8IoEz0HB7W7jT2SrYE8VN9ok32RmRRanW6ZS9nyjeQdrbeQwLJ YpoiYVL8koQearZ3XBEpiYV25PqXPOra7F6HteFSzjxh4/LNxEFnPjurHVxCJbUzdITB 2aNCH6yjyivYEchCmkchDeH0d2zcwVFpT8/b4a+jBX4d/FvlBE9gCKO8zg97rETbI9ZP bdGiWQCZ4WfHTWDa7bkWT22pnofdn2VfMYftCNZauAbhW/v+L08xhbjUHs7/O72v8UFG QG2A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=dikqpOTN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id az2-v6si6616092plb.263.2018.04.05.18.06.01; Thu, 05 Apr 2018 18:06:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=dikqpOTN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751404AbeDFBEz (ORCPT + 99 others); Thu, 5 Apr 2018 21:04:55 -0400 Received: from mail-wm0-f67.google.com ([74.125.82.67]:37212 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751298AbeDFBEx (ORCPT ); Thu, 5 Apr 2018 21:04:53 -0400 Received: by mail-wm0-f67.google.com with SMTP id r131so12087590wmb.2 for ; Thu, 05 Apr 2018 18:04:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=XBphiUDgWrl2LJGr7eSfT8mLb9HJ7nSQh6pfxILxZz4=; b=dikqpOTNt8KRtJeSbn22YOS/HWuvM8hDTB06j4U/6F+tQoMU4CbJw37sIXjri/gPbe z3d/kuS1bnrWVTEIdYlbAVjz32JPkHaziaU1FkxJ8c1ZDXUYYEKs1WYPdyO6nMgD2Vx6 ADerI67pEnpvkNVNqV11v+dgTJzu7ExkjcPiH2MLMlRm0IMciHqBHFWDTGq1N86TupKz YAJbpotL5ZOiTV6ju+IXquJ6m93tVAY7/pRpzbAOhJkRyEWr+NSRGVUQuEebyh7G7oKR ZkD5TdKdWBE49S2ibQEMuIV+oBxjnNf99RfsfOEknKoZni/minOdH1K6o92Mf/ug0twZ xnUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=XBphiUDgWrl2LJGr7eSfT8mLb9HJ7nSQh6pfxILxZz4=; b=tYJUW3T41rh0aIc60Ei8q8Dpt5KhuL8KJb6aogwzAdUfacf+C5I8npZ1G5LrdbOkhA pN5nLniDlt30WT8BpTdPSGLqFrq3zlZy3kWaSiwkNMt6/csOXGFj1/InjGARNfNvPVUB i77t+YijEKtiqHJLHe6Tw6Ib/vF3Ryy4ourhngmpkZd6x+5VXea5aSWQ8uG7UUhpxjn6 E43nAwhsM5XGtM0nDAb9AxHLtBI6nwsQN9wwtKD+OwSYjobIXCHbzdN6wZA7ZvVPf92+ F8fsfonsGCnwFhn3sChcW/Ya66Ux6OF3FDs7h0Lge77Er7rZFvKfSdQYxMvnjsIJ4qv8 yh5w== X-Gm-Message-State: ALQs6tAGUJBm6DMT41YbgQOtnTw6jlkcuMlNsMMoK2DjRDnwGNi/XHWu +X157MLD9f83osv2BF/hO8CdVGpHSOGOuFD/AB2Yxw== X-Received: by 10.28.225.193 with SMTP id y184mr11309688wmg.9.1522976691650; Thu, 05 Apr 2018 18:04:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.156.139 with HTTP; Thu, 5 Apr 2018 18:04:50 -0700 (PDT) In-Reply-To: <20180323152029.11084-4-aryabinin@virtuozzo.com> References: <20180323152029.11084-1-aryabinin@virtuozzo.com> <20180323152029.11084-4-aryabinin@virtuozzo.com> From: Shakeel Butt Date: Thu, 5 Apr 2018 18:04:50 -0700 Message-ID: Subject: Re: [PATCH v2 3/4] mm/vmscan: Don't change pgdat state on base of a single LRU list state. To: Andrey Ryabinin Cc: Andrew Morton , Mel Gorman , Tejun Heo , Johannes Weiner , Michal Hocko , Steven Rostedt , Linux MM , LKML , Cgroups Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 23, 2018 at 8:20 AM, Andrey Ryabinin wrote: > We have separate LRU list for each memory cgroup. Memory reclaim iterates > over cgroups and calls shrink_inactive_list() every inactive LRU list. > Based on the state of a single LRU shrink_inactive_list() may flag > the whole node as dirty,congested or under writeback. This is obviously > wrong and hurtful. It's especially hurtful when we have possibly > small congested cgroup in system. Than *all* direct reclaims waste time > by sleeping in wait_iff_congested(). And the more memcgs in the system > we have the longer memory allocation stall is, because > wait_iff_congested() called on each lru-list scan. > > Sum reclaim stats across all visited LRUs on node and flag node as dirty, > congested or under writeback based on that sum. Also call > congestion_wait(), wait_iff_congested() once per pgdat scan, instead of > once per lru-list scan. > > This only fixes the problem for global reclaim case. Per-cgroup reclaim > may alter global pgdat flags too, which is wrong. But that is separate > issue and will be addressed in the next patch. > > This change will not have any effect on a systems with all workload > concentrated in a single cgroup. > > Signed-off-by: Andrey Ryabinin Seems reasonable. Reviewed-by: Shakeel Butt > --- > mm/vmscan.c | 124 +++++++++++++++++++++++++++++++++++------------------------- > 1 file changed, 73 insertions(+), 51 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 403f59edd53e..2134b3ac8fa0 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -116,6 +116,15 @@ struct scan_control { > > /* Number of pages freed so far during a call to shrink_zones() */ > unsigned long nr_reclaimed; > + > + struct { > + unsigned int dirty; > + unsigned int unqueued_dirty; > + unsigned int congested; > + unsigned int writeback; > + unsigned int immediate; > + unsigned int file_taken; > + } nr; > }; > > #ifdef ARCH_HAS_PREFETCH > @@ -1754,23 +1763,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > mem_cgroup_uncharge_list(&page_list); > free_unref_page_list(&page_list); > > - /* > - * If reclaim is isolating dirty pages under writeback, it implies > - * that the long-lived page allocation rate is exceeding the page > - * laundering rate. Either the global limits are not being effective > - * at throttling processes due to the page distribution throughout > - * zones or there is heavy usage of a slow backing device. The > - * only option is to throttle from reclaim context which is not ideal > - * as there is no guarantee the dirtying process is throttled in the > - * same way balance_dirty_pages() manages. > - * > - * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the number > - * of pages under pages flagged for immediate reclaim and stall if any > - * are encountered in the nr_immediate check below. > - */ > - if (stat.nr_writeback && stat.nr_writeback == nr_taken) > - set_bit(PGDAT_WRITEBACK, &pgdat->flags); > - > /* > * If dirty pages are scanned that are not queued for IO, it > * implies that flushers are not doing their job. This can > @@ -1785,40 +1777,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > if (stat.nr_unqueued_dirty == nr_taken) > wakeup_flusher_threads(WB_REASON_VMSCAN); > > - /* > - * Legacy memcg will stall in page writeback so avoid forcibly > - * stalling here. > - */ > - if (sane_reclaim(sc)) { > - /* > - * Tag a node as congested if all the dirty pages scanned were > - * backed by a congested BDI and wait_iff_congested will stall. > - */ > - if (stat.nr_dirty && stat.nr_dirty == stat.nr_congested) > - set_bit(PGDAT_CONGESTED, &pgdat->flags); > - > - /* Allow kswapd to start writing pages during reclaim. */ > - if (stat.nr_unqueued_dirty == nr_taken) > - set_bit(PGDAT_DIRTY, &pgdat->flags); > - > - /* > - * If kswapd scans pages marked marked for immediate > - * reclaim and under writeback (nr_immediate), it implies > - * that pages are cycling through the LRU faster than > - * they are written so also forcibly stall. > - */ > - if (stat.nr_immediate) > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - } > - > - /* > - * Stall direct reclaim for IO completions if underlying BDIs and node > - * is congested. Allow kswapd to continue until it starts encountering > - * unqueued dirty pages or cycling through the LRU too quickly. > - */ > - if (!sc->hibernation_mode && !current_is_kswapd() && > - current_may_throttle()) > - wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); > + sc->nr.dirty += stat.nr_dirty; > + sc->nr.congested += stat.nr_congested; > + sc->nr.unqueued_dirty += stat.nr_unqueued_dirty; > + sc->nr.writeback += stat.nr_writeback; > + sc->nr.immediate += stat.nr_immediate; > + if (file) > + sc->nr.file_taken += nr_taken; > > trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id, > nr_scanned, nr_reclaimed, > @@ -2522,6 +2487,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > unsigned long node_lru_pages = 0; > struct mem_cgroup *memcg; > > + memset(&sc->nr, 0, sizeof(sc->nr)); > + > nr_reclaimed = sc->nr_reclaimed; > nr_scanned = sc->nr_scanned; > > @@ -2587,6 +2554,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > > + /* > + * If reclaim is isolating dirty pages under writeback, it > + * implies that the long-lived page allocation rate is exceeding > + * the page laundering rate. Either the global limits are not > + * being effective at throttling processes due to the page > + * distribution throughout zones or there is heavy usage of a > + * slow backing device. The only option is to throttle from > + * reclaim context which is not ideal as there is no guarantee > + * the dirtying process is throttled in the same way > + * balance_dirty_pages() manages. > + * > + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the > + * number of pages under pages flagged for immediate reclaim and > + * stall if any are encountered in the nr_immediate check below. > + */ > + if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken) > + set_bit(PGDAT_WRITEBACK, &pgdat->flags); > + > + /* > + * Legacy memcg will stall in page writeback so avoid forcibly > + * stalling here. > + */ > + if (sane_reclaim(sc)) { > + /* > + * Tag a node as congested if all the dirty pages > + * scanned were backed by a congested BDI and > + * wait_iff_congested will stall. > + */ > + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) > + set_bit(PGDAT_CONGESTED, &pgdat->flags); > + > + /* Allow kswapd to start writing pages during reclaim.*/ > + if (sc->nr.unqueued_dirty == sc->nr.file_taken) > + set_bit(PGDAT_DIRTY, &pgdat->flags); > + > + /* > + * If kswapd scans pages marked marked for immediate > + * reclaim and under writeback (nr_immediate), it > + * implies that pages are cycling through the LRU > + * faster than they are written so also forcibly stall. > + */ > + if (sc->nr.immediate) > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + } > + > + /* > + * Stall direct reclaim for IO completions if underlying BDIs > + * and node is congested. Allow kswapd to continue until it > + * starts encountering unqueued dirty pages or cycling through > + * the LRU too quickly. > + */ > + if (!sc->hibernation_mode && !current_is_kswapd() && > + current_may_throttle()) > + wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10); > + > } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)); > > -- > 2.16.1 >