Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp98384ybl; Mon, 12 Aug 2019 12:24:47 -0700 (PDT) X-Google-Smtp-Source: APXvYqz9IjZFALJ/U+A+2FfLsmUmxIrMJvnuzMIM8mWwGlvblJ+Bylm1XPw2U6J4PjjkAZXjHuKc X-Received: by 2002:a63:5c7:: with SMTP id 190mr30833575pgf.67.1565637887186; Mon, 12 Aug 2019 12:24:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565637887; cv=none; d=google.com; s=arc-20160816; b=fJnT8Z3s2QUSXh547jf5HVnlTtiCTrtSgLMzwpj9p69EHnZs13ExndknzFPA1kwEhX zCJSFXag7vXFqtluHuTvkzgz+gA134GnwJPsmmAeaMSvZRgRTgkfpSyZpMeFpkU7UCeS tzsThg0+p4f5VxV/OO6rQrq5ZBd/kJ5LdiRm78tWkAymY/wWQVDTPeJ444HVaoP8ZRu0 3fjsGGrNENVJ7IPqVL8QmcFSGWhP2Mxc+7vGElG4XNB4vXtqs6Oje/h8Qo5hc2HrBSau +SJi8kP1M7URvFb39t3fYjegDzCIv85QXBcPd3tXSwpYvWFym4wDvus15jg1HsVCABhF Oasg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=WbZ4Sjz/PnOqkj9Tyb0eRWLHmMcoPs5r9WlmOIHpeSo=; b=uybEkCTnRQJ0Rz79tn2dzRG64wSbF/D1Swr2bts3g0anhzZQnZgpbtkZIVo0QQQdyB KZvORe9+o/69oNds9Fm6Obl5szNjlnSHh0S2LOVkfeLq12SaqMFltB0kQ9Wav54WiODJ E7u2BVo09poO3THW3oFlvWC5+7fCIK+LgjRSeBCbJ5zG34tmIRLhy9W4GgzYSa89slrC ZBGusNPwudMYDB8IhN6monVIu1e7ETdq3nSXgToTGnxrBPq3WwFXbkYlBh/PXBi2jhSI TxVKPj+FW1FJapN+qmSZ30+VaY8F1rV5lZ/KNhne18lsGFL9C+dzc03rlmSHfXeb6ZgA wttQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=jzC2l9r8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d5si57797895pls.233.2019.08.12.12.24.31; Mon, 12 Aug 2019 12:24:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=jzC2l9r8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726820AbfHLTXU (ORCPT + 99 others); Mon, 12 Aug 2019 15:23:20 -0400 Received: from mail-pg1-f196.google.com ([209.85.215.196]:37326 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726789AbfHLTXT (ORCPT ); Mon, 12 Aug 2019 15:23:19 -0400 Received: by mail-pg1-f196.google.com with SMTP id d1so17161665pgp.4 for ; Mon, 12 Aug 2019 12:23:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WbZ4Sjz/PnOqkj9Tyb0eRWLHmMcoPs5r9WlmOIHpeSo=; b=jzC2l9r88UaY1IukY7wBl28NxdfPM525Oizdju82oqnG+UjMFtJsmQ4V8Uvx/6+Orm fWS1GQxRoaEB2jq5CsZnnU1wxLSuBymvRWSQYQv2k4Fc04be15Oia6PI0grZJI7R2lxH IIEMek1SCHJoUx7vl2b0ZqSLUs8sascLRu9JdKIBwCJMo6nAISs6UQ+jEeXfnyNHERRv auaqhuzsjPAV71Ee7a+PteY3nAiiXs0DyQhgGfGNjObh9DdA0bOLONSDN10g6v6ng50c nzWmlZ4d3goJTdGLPL9hG2fwhhVHCKPrGDEk5V0RvMSdUKKA+J9+ZRo4KaIWYRmCYX2/ vyyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=WbZ4Sjz/PnOqkj9Tyb0eRWLHmMcoPs5r9WlmOIHpeSo=; b=MW024eGS1ABWVBa2VN1SXyiTPwC9rNTR/zND6d9rz+Z0bbIEvK2bmVhtf856gyOK3C tWhGST9987wDv//jMdQsvNo5G2OkqDLllKl70gE+C7zvmmzIq4h65XPSj3LGNQgBzxXB /Okmf5FheQ2sDUljiRUkbfmALuZajHP0rPOTyU33o2TJg5pFD7g+KG6onrh1+0gGcLDb TrqHt6jQyJPgkLYlQMocojAQ+j3u0DEmw6M05Q07ke4X3Zi1jq0l0hnHfEsU4NOszU0P T25RPlrpP9COgGFM/ssWBOcZCNouy4e4YJIXNGSz5RiIgYQ38VjgOdZIphdRVjcmymQG wXIg== X-Gm-Message-State: APjAAAVCw9KIKa/AXbUOWnTj5EZ4B/aL6fHFMUSZr2MMRkJeZ4mlTVGE Q82UBZdKm5o4wwTt8lnwOoLI7g== X-Received: by 2002:a65:5188:: with SMTP id h8mr31346069pgq.294.1565637798475; Mon, 12 Aug 2019 12:23:18 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::5810]) by smtp.gmail.com with ESMTPSA id v8sm340554pjb.6.2019.08.12.12.23.17 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 12 Aug 2019 12:23:17 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Date: Mon, 12 Aug 2019 15:23:16 -0400 Message-Id: <20190812192316.13615-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.22.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org One of our services observed a high rate of cgroup OOM kills in the presence of large amounts of clean cache. Debugging showed that the culprit is the shared cgroup iteration in page reclaim. Under high allocation concurrency, multiple threads enter reclaim at the same time. Fearing overreclaim when we first switched from the single global LRU to cgrouped LRU lists, we introduced a shared iteration state for reclaim invocations - whether 1 or 20 reclaimers are active concurrently, we only walk the cgroup tree once: the 1st reclaimer reclaims the first cgroup, the second the second one etc. With more reclaimers than cgroups, we start another walk from the top. This sounded reasonable at the time, but the problem is that reclaim concurrency doesn't scale with allocation concurrency. As reclaim concurrency increases, the amount of memory individual reclaimers get to scan gets smaller and smaller. Individual reclaimers may only see one cgroup per cycle, and that may not have much reclaimable memory. We see individual reclaimers declare OOM when there is plenty of reclaimable memory available in cgroups they didn't visit. This patch does away with the shared iterator, and every reclaimer is allowed to scan the full cgroup tree and see all of reclaimable memory, just like it would on a non-cgrouped system. This way, when OOM is declared, we know that the reclaimer actually had a chance. To still maintain fairness in reclaim pressure, disallow cgroup reclaim from bailing out of the tree walk early. Kswapd and regular direct reclaim already don't bail, so it's not clear why limit reclaim would have to, especially since it only walks subtrees to begin with. This change completely eliminates the OOM kills on our service, while showing no signs of overreclaim - no increased scan rates, %sys time, or abrupt free memory spikes. I tested across 100 machines that have 64G of RAM and host about 300 cgroups each. [ It's possible overreclaim never was a *practical* issue to begin with - it was simply a concern we had on the mailing lists at the time, with no real data to back it up. But we have also added more bail-out conditions deeper inside reclaim (e.g. the proportional exit in shrink_node_memcg) since. Regardless, now we have data that suggests full walks are more reliable and scale just fine. ] Signed-off-by: Johannes Weiner --- mm/vmscan.c | 22 ++-------------------- 1 file changed, 2 insertions(+), 20 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index dbdc46a84f63..b2f10fa49c88 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2667,10 +2667,6 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) do { struct mem_cgroup *root = sc->target_mem_cgroup; - struct mem_cgroup_reclaim_cookie reclaim = { - .pgdat = pgdat, - .priority = sc->priority, - }; unsigned long node_lru_pages = 0; struct mem_cgroup *memcg; @@ -2679,7 +2675,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) nr_reclaimed = sc->nr_reclaimed; nr_scanned = sc->nr_scanned; - memcg = mem_cgroup_iter(root, NULL, &reclaim); + memcg = mem_cgroup_iter(root, NULL, NULL); do { unsigned long lru_pages; unsigned long reclaimed; @@ -2724,21 +2720,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) sc->nr_scanned - scanned, sc->nr_reclaimed - reclaimed); - /* - * Kswapd have to scan all memory cgroups to fulfill - * the overall scan target for the node. - * - * Limit reclaim, on the other hand, only cares about - * nr_to_reclaim pages to be reclaimed and it will - * retry with decreasing priority if one round over the - * whole hierarchy is not sufficient. - */ - if (!current_is_kswapd() && - sc->nr_reclaimed >= sc->nr_to_reclaim) { - mem_cgroup_iter_break(root, memcg); - break; - } - } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim))); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; -- 2.22.0