Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp4730851ybv; Tue, 11 Feb 2020 02:17:56 -0800 (PST) X-Google-Smtp-Source: APXvYqz4P51XNyWIXTrtks8l4I+HfSR5IMgdJU+CGj9KHRuS31sU1z+z7Ue8NZZefGJRwkXT0RpS X-Received: by 2002:a05:6808:98e:: with SMTP id a14mr2285001oic.8.1581416276348; Tue, 11 Feb 2020 02:17:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581416276; cv=none; d=google.com; s=arc-20160816; b=Zlj2UHhJf1GA67uz1rEutG+ALMc/NAPp3sjoaWOS76XRpq03QKEPRBSL+9GplV9iUh Zb6QxafkxDlBBRzvx4hCZSkDgoulgVzPINzjbiIx6C72G+9+u5QQlIA4JaPMf2gX4vZp D4EbGHaP0DF1NstVXp7lY4KoF1unxATsBbnPIy3Zvr1uJZX5QojS3SBP7dvDY0/yOnBx 14A0nVy3+/lV++8yMI+AxoAX4gJAOn1aMLYsaPANamLbtrC7duKGtpuBFEsNTddaBZfe jEwLast0JbC23Q+R5U6gs0OxosOIfgkukQBKTaHmsSYZIsJ8Biquvtdjj+SHu3+6LEei qH+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=BFQghykPQhIOHc2NLEIxFxJ90U3rTMdLkjjzLW8wXnw=; b=uhjGDrxnW7pSwh3ZGaaGiq43pbcRr7FhUqxkTuT6TZ4ZaocfSUKXxlP1bOYur3zw8R MzWQZxQ3q1y4WYgkUH6e91XoXic/ylEp+/nu8hhT41rfzMQ9YgIf53ePEvenFRq8wU56 sIZJtbwW8ZmRNpR+6TZ2lgLEC9/i42jX0meR0DItCCD+MtiSooJ22O6rQa63/PngBsgA yOwxVlc+eZbKjGS1CvDwGBhfoA62NvixUOcUMpH7l7uZVPeSEJkQ8roycXLnmARBt2OG 6oaoy195hbrkI+UdDS1EvhEFa/3CkhheYcZUsXcyHu0k3hWt4HwCQje3fsFd0z8oRqsG G5Lg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q9si1625378otc.86.2020.02.11.02.17.44; Tue, 11 Feb 2020 02:17:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728241AbgBKKQe (ORCPT + 99 others); Tue, 11 Feb 2020 05:16:34 -0500 Received: from outbound-smtp56.blacknight.com ([46.22.136.240]:39715 "EHLO outbound-smtp56.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727937AbgBKKQd (ORCPT ); Tue, 11 Feb 2020 05:16:33 -0500 Received: from mail.blacknight.com (pemlinmail02.blacknight.ie [81.17.254.11]) by outbound-smtp56.blacknight.com (Postfix) with ESMTPS id 4AEDCFA80D for ; Tue, 11 Feb 2020 10:16:30 +0000 (GMT) Received: (qmail 14398 invoked from network); 11 Feb 2020 10:16:29 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 11 Feb 2020 10:16:29 -0000 Date: Tue, 11 Feb 2020 10:16:27 +0000 From: Mel Gorman To: Ivan Babrou Cc: linux-mm@kvack.org, linux-kernel , kernel-team , Andrew Morton , Rik van Riel , Vlastimil Babka Subject: Re: Reclaim regression after 1c30844d2dfe Message-ID: <20200211101627.GJ3466@techsingularity.net> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 07, 2020 at 02:54:43PM -0800, Ivan Babrou wrote: > This change from 5.5 times: > > * https://github.com/torvalds/linux/commit/1c30844d2dfe > > > mm: reclaim small amounts of memory when an external fragmentation event occurs > > Introduced undesired effects in our environment. > > * NUMA with 2 x CPU > * 128GB of RAM > * THP disabled > * Upgraded from 4.19 to 5.4 > > Before we saw free memory hover at around 1.4GB with no spikes. After > the upgrade we saw some machines decide that they need a lot more than > that, with frequent spikes above 10GB, often only on a single numa > node. > > We can see kswapd quite active in balance_pgdat (it didn't look like > it slept at all): > > $ ps uax | fgrep kswapd > root 1850 23.0 0.0 0 0 ? R Jan30 1902:24 [kswapd0] > root 1851 1.8 0.0 0 0 ? S Jan30 152:16 [kswapd1] > > This in turn massively increased pressure on page cache, which did not > go well to services that depend on having a quick response from a > local cache backed by solid storage. > > Here's how it looked like when I zeroed vm.watermark_boost_factor: > > * https://imgur.com/a/6IZWicU > > IO subsided from 100% busy in page cache population at 300MB/s on a > single SATA drive down to under 100MB/s. > > This sort of regression doesn't seem like a good thing. It is not a good thing, so thanks for the report. Obviously I have not seen something similar or least not severe enough to show up on my radar. I'd seen some increases with reclaim activity affecting benchmarks that rely on use-twice data remaining resident but nothing severe enough to warrant action. Can you tell me if it is *always* node 0 that shows crazy activity? I ask because some conditions would have to be met for the boost to always apply. It's already a per-zone attribute but it is treated indirectly as a pgdat property. What I'm thinking is that on node 0, the DMA32 or DMA zone gets boosted but vmscan then reclaims from higher zones until the boost is removed. That would excessively reclaim memory but be specific to node 0. I've cc'd Rik as he says he saw something similar even on single node systems. The boost applying to lower zones would still affect single node systems but NUMA machines always getting impacted by boost would show that the boost really needs to be a per-node flag. Sure, we *could* apply the reclaim to just the lower zones but that potentially means a *lot* of scan activity -- potentially 124G of pages before a lower zone page is found on Ivan's machine. That might be the very situation being encountered here. An alternative is that boosting is only ever applied to the highest populated zone in a system. The intent of the patch was primarily about THP which can use any zone to reduce their allocaation latency. While it's possible that there are cases where the latency of other orders matter *and* they require lower zones, I think it's unlikely and that this would be a safer option overall. However, overall I think the simpliest is to abort the boosting if reclaim is reaching higher priorities without being able to clear the boost. The boost is best-effort to reduce allocation latency in the future. This approach still has some overhead as there is a reclaim pass but kswapd will abort and go to sleep if the normal watermarks are met. This is build tested only. Ideally someone on the cc has a test case that can reproduce this specific problem of excessive kswapd activity. diff --git a/mm/vmscan.c b/mm/vmscan.c index 572fb17c6273..71dd47172cef 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3462,6 +3462,25 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) return false; } +static void acct_boosted_reclaim(pg_data_t *pgdat, int classzone_idx, + unsigned long *zone_boosts) +{ + struct zone *zone; + unsigned long flags; + int i; + + for (i = 0; i <= classzone_idx; i++) { + if (!zone_boosts[i]) + continue; + + /* Increments are under the zone lock */ + zone = pgdat->node_zones + i; + spin_lock_irqsave(&zone->lock, flags); + zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); + spin_unlock_irqrestore(&zone->lock, flags); + } +} + /* Clear pgdat state for congested, dirty or under writeback. */ static void clear_pgdat_congested(pg_data_t *pgdat) { @@ -3654,9 +3673,17 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!nr_boost_reclaim && balanced) goto out; - /* Limit the priority of boosting to avoid reclaim writeback */ - if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) - raise_priority = false; + /* + * Abort boosting if reclaiming at higher priority is not + * working to avoid excessive reclaim due to lower zones + * being boosted. + */ + if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2) { + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts); + boosted = false; + nr_boost_reclaim = 0; + goto restart; + } /* * Do not writeback or swap pages for boosted reclaim. The @@ -3738,18 +3765,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) out: /* If reclaim was boosted, account for the reclaim done in this pass */ if (boosted) { - unsigned long flags; - - for (i = 0; i <= classzone_idx; i++) { - if (!zone_boosts[i]) - continue; - - /* Increments are under the zone lock */ - zone = pgdat->node_zones + i; - spin_lock_irqsave(&zone->lock, flags); - zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]); - spin_unlock_irqrestore(&zone->lock, flags); - } + acct_boosted_reclaim(pgdat, classzone_idx, zone_boosts); /* * As there is now likely space, wakeup kcompact to defragment