Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp1486348ybf; Thu, 27 Feb 2020 11:56:34 -0800 (PST) X-Google-Smtp-Source: APXvYqxx70RKVL7E9t3UOeJmyf37AdoenTVmiV7fw9ts3M1eZhOwpbLUIc5NF3kkStdmdCdLWOpJ X-Received: by 2002:a9d:aa7:: with SMTP id 36mr494302otq.52.1582833394169; Thu, 27 Feb 2020 11:56:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582833394; cv=none; d=google.com; s=arc-20160816; b=oFaKm23icV/N2fNj4apgtPKiDn0/siyqpBSIFR1iBAAPp0B5alaqa1zqRdyzWBBFd0 C6b5BvSa87tgubq5rmy5GKG347rKCtSlcpUQPzg9jI1JkrpidQ3dQeMGuYhqqTaneIaG 0QJRwXBI4HYzA7oaZeAsf7F70mNBRLuGkXJArLv3bIXCD4iHhAX4LWPG18TSc01bFbLl Qd7DBPz6kuaQsqkMN/edrw0pI7k9VDuVjjUAwvNppznkg3NZzhCc3jjJovFfzpV57foE l2RdxgrDsqTtpRLOfY/JVhfk6BxEPHNmxPIskfniRHWMdxLXQfQJVKKGFpzKtvQioggU 7Fow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=fSikhrFs5BcRusqV72w1ZPleJmM+MohxSSUW83ONj5U=; b=SfYxjzkl8SORRaBGbIjXspwHjzrhSA7olCls5ci5r7qYKt0WUwGPTGdR23rjyr/iBW ufgfhdA1pquKFDBDSpQQUuwiCAhuzTWbX4oU9yV5/b6YueIYjXNkmoxn0Ui9oGCCKSU1 he709CbE5F3yh8TDakvaYVvInWY9I/W3ByQsvVG1GhmvfDRj53D0oOMGokEmb0yvI96K WnCSMW2rh1AFfzv457lgOmndoZxYlOdyed6HI1oNbS+o43TarAzSrNNchUza4gjVY3a9 V915Lcgbb2hbIMQ4Nlq+Pn+IyrnZ75ZJg4iq/HTkCAsX4JsR4u6S0pZmKX/VH33pbQ+E AswQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=oBEvIdqy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x186si428978oig.209.2020.02.27.11.56.21; Thu, 27 Feb 2020 11:56:34 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=oBEvIdqy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730686AbgB0T4P (ORCPT + 99 others); Thu, 27 Feb 2020 14:56:15 -0500 Received: from mail-qk1-f178.google.com ([209.85.222.178]:41419 "EHLO mail-qk1-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729940AbgB0T4O (ORCPT ); Thu, 27 Feb 2020 14:56:14 -0500 Received: by mail-qk1-f178.google.com with SMTP id b5so623915qkh.8 for ; Thu, 27 Feb 2020 11:56:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=fSikhrFs5BcRusqV72w1ZPleJmM+MohxSSUW83ONj5U=; b=oBEvIdqy/cYuOQutMhWUrNcPeYL2gICCQr+fu7tq7vVplWwxr0u7hW9ENL65iMdRom xFPYrs5j+CvsSDj/SOJ0fb+AQ/JGskuPHfhd/wFPRruaeOtFk7Qsg0IPt9PLNCzvD8GM MmHIl/zx6wFCAnLSLRRaK29JxbAtIbEZ+s4gqBOK5c3KKtIn4PFtJZlJHD01gCLtU7K1 EEkJsiMrCUO5M/l4QvUQDaP2zS6qxa36u/D6n3QWZmfkHyiDGzSoYqRjuxTxE5s1Ij6k dRwDWFVoDGKvv8qDZk+m2mPbpSKjWvMiuAVVYT76aAmsqxeBqxRVUU1ugyX8ZIdHdP1x pdJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=fSikhrFs5BcRusqV72w1ZPleJmM+MohxSSUW83ONj5U=; b=GcjNCdBLFaIWBPFRBIRNdLUKsdz39t2vULJ7y/ZWbyBJboE1sEYthu0IEYAzHjygFN U586fUr56siZbgyyQ72oznFozCtyQTTIt4KiYJRy5VTjT9G8TjsZwiG/42zvipC7VYjs y8Z0+K83qQ9ncyYlsu7LNa4LjHzROn4YGY8XJPul3zAAkwnYJ+PwYRgFzKQfSWCI1Qq2 JB0xvLFT+iJ7Ym7bmeuM58OPwvkZ3tClWNWfrFdJVcCiNXXShujcl26q7vZFHxDCEgaF 4x55Zt+YQD6/MtirsHO9QP8lAvaeXraIpUKRuENJvXfaXyLeJdOZg+Ph5vesnYhUqIK5 Iimw== X-Gm-Message-State: APjAAAX7BvWMvAMVT7xo/6PPq8WZI9xWg38X8sgRsq1o2rfKigP1pXxJ HCBqHaWJy/+Ro5aLt/aK91+XKg== X-Received: by 2002:a05:620a:42:: with SMTP id t2mr1196414qkt.45.1582833373465; Thu, 27 Feb 2020 11:56:13 -0800 (PST) Received: from localhost ([2620:10d:c091:500::3:2450]) by smtp.gmail.com with ESMTPSA id a14sm3708173qkk.73.2020.02.27.11.56.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 27 Feb 2020 11:56:12 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Roman Gushchin , Michal Hocko , Tejun Heo , Chris Down , =?UTF-8?q?Michal=20Koutn=C3=BD?= , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/3] mm: memcontrol: fix memory.low proportional distribution Date: Thu, 27 Feb 2020 14:56:04 -0500 Message-Id: <20200227195606.46212-2-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200227195606.46212-1-hannes@cmpxchg.org> References: <20200227195606.46212-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When memory.low is overcommitted - i.e. the children claim more protection than their shared ancestor grants them - the allowance is distributed in proportion to how much each sibling uses their own declared protection: low_usage = min(memory.low, memory.current) elow = parent_elow * (low_usage / siblings_low_usage) However, siblings_low_usage is not the sum of all low_usages. It sums up the usages of *only those cgroups that are within their memory.low* That means that low_usage can be *bigger* than siblings_low_usage, and consequently the total protection afforded to the children can be bigger than what the ancestor grants the subtree. Consider three groups where two are in excess of their protection: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 8G (only A3 contributes) A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(8G) = 12.5G -> 10G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(8G) = 10.0G (the 12.5G are capped to the explicit memory.low setting of 10G) With that, the sum of all awarded protection below A is 30G, when A only grants 10G for the entire subtree. What does this mean in practice? A1 and A2 would still be in excess of their 10G allowance and would be reclaimed, whereas A3 would not. As they eventually drop below their protection setting, they would be counted in siblings_low_usage again and the error would right itself. When reclaim was applied in a binary fashion (cgroup is reclaimed when it's above its protection, otherwise it's skipped) this would actually work out just fine. However, since 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection"), reclaim pressure is scaled to how much a cgroup is above its protection. As a result this calculation error unduly skews pressure away from A1 and A2 toward the rest of the system. But why did we do it like this in the first place? The reasoning behind exempting groups in excess from siblings_low_usage was to go after them first during reclaim in an overcommitted subtree: A/memory.low = 2G, memory.current = 4G A/A1/memory.low = 3G, memory.current = 2G A/A2/memory.low = 1G, memory.current = 2G siblings_low_usage = 2G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(2G) / siblings_low_usage(2G) = 2G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G While the children combined are overcomitting A and are technically both at fault, A2 is actively declaring unprotected memory and we would like to reclaim that first. However, while this sounds like a noble goal on the face of it, it doesn't make much difference in actual memory distribution: Because A is overcommitted, reclaim will not stop once A2 gets pushed back to within its allowance; we'll have to reclaim A1 either way. The end result is still that protection is distributed proportionally, with A1 getting 3/4 (1.5G) and A2 getting 1/4 (0.5G) of A's allowance. [ If A weren't overcommitted, it wouldn't make a difference since each cgroup would just get the protection it declares: A/memory.low = 2G, memory.current = 3G A/A1/memory.low = 1G, memory.current = 1G A/A2/memory.low = 1G, memory.current = 2G With the current calculation: siblings_low_usage = 1G (only A1 contributes) A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(1G) = 2G -> 1G Including excess groups in siblings_low_usage: siblings_low_usage = 2G A1/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G A2/elow = parent_elow(2G) * low_usage(1G) / siblings_low_usage(2G) = 1G -> 1G ] Simplify the calculation and fix the proportional reclaim bug by including excess cgroups in siblings_low_usage. After this patch, the effective memory.low distribution from the example above would be as follows: A/memory.low = 10G A/A1/memory.low = 10G, memory.current = 20G A/A2/memory.low = 10G, memory.current = 20G A/A3/memory.low = 10G, memory.current = 8G siblings_low_usage = 28G A1/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A2/elow = parent_elow(10G) * low_usage(10G) / siblings_low_usage(28G) = 3.5G A3/elow = parent_elow(10G) * low_usage(8G) / siblings_low_usage(28G) = 2.8G Fixes: 1bc63fb1272b ("mm, memcg: make scan aggression always exclude protection") Fixes: 230671533d64 ("mm: memory.low hierarchical behavior") Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Acked-by: Michal Hocko Signed-off-by: Johannes Weiner --- mm/memcontrol.c | 4 +--- mm/page_counter.c | 12 ++---------- 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c5b5f74cfd4d..874a0b00f89b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6236,9 +6236,7 @@ struct cgroup_subsys memory_cgrp_subsys = { * elow = min( memory.low, parent->elow * ------------------ ), * siblings_low_usage * - * | memory.current, if memory.current < memory.low - * low_usage = | - * | 0, otherwise. + * low_usage = min(memory.low, memory.current) * * * Such definition of the effective memory.low provides the expected diff --git a/mm/page_counter.c b/mm/page_counter.c index de31470655f6..75d53f15f040 100644 --- a/mm/page_counter.c +++ b/mm/page_counter.c @@ -23,11 +23,7 @@ static void propagate_protected_usage(struct page_counter *c, return; if (c->min || atomic_long_read(&c->min_usage)) { - if (usage <= c->min) - protected = usage; - else - protected = 0; - + protected = min(usage, c->min); old_protected = atomic_long_xchg(&c->min_usage, protected); delta = protected - old_protected; if (delta) @@ -35,11 +31,7 @@ static void propagate_protected_usage(struct page_counter *c, } if (c->low || atomic_long_read(&c->low_usage)) { - if (usage <= c->low) - protected = usage; - else - protected = 0; - + protected = min(usage, c->low); old_protected = atomic_long_xchg(&c->low_usage, protected); delta = protected - old_protected; if (delta) -- 2.24.1