Received: by 10.213.65.68 with SMTP id h4csp827998imn; Fri, 6 Apr 2018 09:33:50 -0700 (PDT) X-Google-Smtp-Source: AIpwx49LsORKqojFsiXo9kJTrKrPVOqbnaxGaGYHHBoc7cHhonbrfQd3K3RLrTuNdBm7QDuO6Ku7 X-Received: by 2002:a17:902:828b:: with SMTP id y11-v6mr27940274pln.69.1523032430871; Fri, 06 Apr 2018 09:33:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523032430; cv=none; d=google.com; s=arc-20160816; b=UUrlTyd3IVQ+/E9T4kJxlWzWMx2IjzkQRH601Oon1JsnBpGBcSplDVGjSsJbSOS6sx vefR0z4no0nvFGelRz+s4ayEpKKmpbpExdQSM0r+3afXWodzUNc+tP3wf7UUgD9DMM3p cxMm+iiJgi2lf55ToQkwzQ08DBA0q1y/O9M0ymFtFWfppBfEkfY15/9zuPcUcQ63sMsR l+GD3RrJ7oXvhCuiQd8yiq5Xqvhb6f+X1tXkvIWB/aOrGoNIRJOxngsqf6XGNHTZMOQq yOPfx/1+YOmS8A1i4bNM1ei80trE9/hH9yA13Iqc8KtuxujS3a7sEenRcrl6mjHnS9Kw 0joQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=bSelOyPsRGGCPZaLQaIS5WQViIlmIsJgBVnzF0n2ycI=; b=NmiX18+ryA54V9LRW4ph2ZfRKupsCAZGXhEP4Svr0DIU4UNW3bYBGHKgDLq/BXN+5D N8KSl0Qvi5yy3AYBgLp390giXST/zGXxbagvJ8Y2on9Mk+EeRricl0tamy0Rl+HNUeS/ xsXc4pmAbg/w5LL8jLzFqeCriUvgYsoVRGLgXp4DsCQ9BqO5v39EsMRZpvQFQpMGsKKX nf/uHGCY+LtCkX3pf8W2RRcPpzMB8tJhMDWdD1TljGvX5L31X6PTPg20++b6/aJ5eIdZ jZpW2x5jy1yIjotM8T/L2Tgxt5McTO+NIslBOy7ecOHZKdHEMGqq8mTts2hcCC5l8/2v kYaw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=YVw9x+Ri; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w5-v6si1316316plz.58.2018.04.06.09.33.14; Fri, 06 Apr 2018 09:33:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@cmpxchg.org header.s=x header.b=YVw9x+Ri; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751989AbeDFQ1Q (ORCPT + 99 others); Fri, 6 Apr 2018 12:27:16 -0400 Received: from gum.cmpxchg.org ([85.214.110.215]:50402 "EHLO gum.cmpxchg.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751412AbeDFQ1O (ORCPT ); Fri, 6 Apr 2018 12:27:14 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cmpxchg.org ; s=x; h=In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject: Cc:To:From:Date:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=bSelOyPsRGGCPZaLQaIS5WQViIlmIsJgBVnzF0n2ycI=; b=YVw9x+Rid54fnr7mIzMZtkVQph isMAsqHONC8sENfP6ST6t53rceJtD5EYVJWBEuzCVP6leDnCOzbgeEGfTvq/Mr6Rgud4M0/gLZeJa yd8dwwAr2iUXcCtsFM+Rzsmb2ZrUUvTnmCCS93TIO5rnv2mdoairHYc6JPoMFw+Em3nQ=; Date: Fri, 6 Apr 2018 12:28:35 -0400 From: Johannes Weiner To: Andrey Ryabinin Cc: Andrew Morton , Mel Gorman , Tejun Heo , Michal Hocko , Shakeel Butt , Steven Rostedt , linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Subject: Re: [PATCH v2 3/4] mm/vmscan: Don't change pgdat state on base of a single LRU list state. Message-ID: <20180406162835.GD20806@cmpxchg.org> References: <20180323152029.11084-1-aryabinin@virtuozzo.com> <20180323152029.11084-4-aryabinin@virtuozzo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180323152029.11084-4-aryabinin@virtuozzo.com> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 23, 2018 at 06:20:28PM +0300, Andrey Ryabinin wrote: > We have separate LRU list for each memory cgroup. Memory reclaim iterates > over cgroups and calls shrink_inactive_list() every inactive LRU list. > Based on the state of a single LRU shrink_inactive_list() may flag > the whole node as dirty,congested or under writeback. This is obviously > wrong and hurtful. It's especially hurtful when we have possibly > small congested cgroup in system. Than *all* direct reclaims waste time > by sleeping in wait_iff_congested(). And the more memcgs in the system > we have the longer memory allocation stall is, because > wait_iff_congested() called on each lru-list scan. > > Sum reclaim stats across all visited LRUs on node and flag node as dirty, > congested or under writeback based on that sum. Also call > congestion_wait(), wait_iff_congested() once per pgdat scan, instead of > once per lru-list scan. > > This only fixes the problem for global reclaim case. Per-cgroup reclaim > may alter global pgdat flags too, which is wrong. But that is separate > issue and will be addressed in the next patch. > > This change will not have any effect on a systems with all workload > concentrated in a single cgroup. This makes a ton of sense, and I'm going to ack the patch, but here is one issue here: > @@ -2587,6 +2554,61 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) > if (sc->nr_reclaimed - nr_reclaimed) > reclaimable = true; > > + /* > + * If reclaim is isolating dirty pages under writeback, it > + * implies that the long-lived page allocation rate is exceeding > + * the page laundering rate. Either the global limits are not > + * being effective at throttling processes due to the page > + * distribution throughout zones or there is heavy usage of a > + * slow backing device. The only option is to throttle from > + * reclaim context which is not ideal as there is no guarantee > + * the dirtying process is throttled in the same way > + * balance_dirty_pages() manages. > + * > + * Once a node is flagged PGDAT_WRITEBACK, kswapd will count the > + * number of pages under pages flagged for immediate reclaim and > + * stall if any are encountered in the nr_immediate check below. > + */ > + if (sc->nr.writeback && sc->nr.writeback == sc->nr.file_taken) > + set_bit(PGDAT_WRITEBACK, &pgdat->flags); > + > + /* > + * Legacy memcg will stall in page writeback so avoid forcibly > + * stalling here. > + */ > + if (sane_reclaim(sc)) { > + /* > + * Tag a node as congested if all the dirty pages > + * scanned were backed by a congested BDI and > + * wait_iff_congested will stall. > + */ > + if (sc->nr.dirty && sc->nr.dirty == sc->nr.congested) > + set_bit(PGDAT_CONGESTED, &pgdat->flags); > + > + /* Allow kswapd to start writing pages during reclaim.*/ > + if (sc->nr.unqueued_dirty == sc->nr.file_taken) > + set_bit(PGDAT_DIRTY, &pgdat->flags); > + > + /* > + * If kswapd scans pages marked marked for immediate > + * reclaim and under writeback (nr_immediate), it > + * implies that pages are cycling through the LRU > + * faster than they are written so also forcibly stall. > + */ > + if (sc->nr.immediate) > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + } This isn't quite equivalent to what we have right now. Yes, nr_dirty, nr_unqueued_dirty and nr_congested apply to file pages only. That part is about waking the flushers and avoiding writing files in 4k chunks from reclaim context. So those numbers do need to be compared against scanned *file* pages. But nr_writeback and nr_immediate is about throttling reclaim when we hit too many pages under writeout, and that applies to both file and anonymous/swap pages. We do want to throttle on swapout, too. So nr_writeback needs to check against all nr_taken, not just file.