Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1362325ybl; Tue, 13 Aug 2019 11:15:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqyKc35e5HI16YjpYPwaZQPkOOAafxOdT1JDMtXoRxI86HOf+04vF6a9ayHfRk1c1X3zVcfZ X-Received: by 2002:a63:58c:: with SMTP id 134mr37032320pgf.106.1565720129170; Tue, 13 Aug 2019 11:15:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565720129; cv=none; d=google.com; s=arc-20160816; b=aUPyx86we3a4a/0clRpxDnYDtnLAi4TsoqirVizjG2ZDNCpr9h9zMQEbxWJ/3gdkvZ wNaeIfwaojWIZHsdcOqpnXFPqBda0p1Kq2uCGW6KaGSQ2uCzwW1rrtKZ2p9bEnMSWIgQ 053ub2WF6eWr1kTuLaqW/c4iCpUc5fri4vXNf4tfGi/4QmU5E2fMoh/u2LLtddjUCd55 hkES5ksvVQv5lM+4zXHZAagFqdZEaq/BOgU6rYAW2+kaoJQpzx/AD2b5eanwdCZ98Ppf aXydxT9u8esiGVtujdEEtI4j+U0tGLiu8W5joPTQkt3CviTSZaOHMFLMRN4WuCuHJI2b te1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=EgcOWrbU42e6Bz/dNCaGzvQRlmmBX/ILwM0rAxK2QB8=; b=PjC38/fghfVb03tCgAhKalYNVEP6F+cJ/2iQOFYKTJ1j9RcM7WpZgjyTVBTKdwl82V gqYhSVLD2Ew4+w8k8iAsd5LXV3e/MJa9cP6NqlJ+Tigf4J++U8+K6S7j6+giUNPqyODd uD4XsdaDYG+8FQ6OLNh+KWxg97O7Bz1WexE652efBdSwLxZ3CpRtFnYkj/xfpWpOQP64 laj4HUgYnxL3ivTwWQOqrrX9TEv9/t7AOFstxSIKeE+S7S8HvpPH7Koimv09yruTbrCs OKyPFeNoAcJgCWPApjPVnBTsJzHp35skFgOwJ5hMQekjjscq5bFE1k3eZ9GPjX73vk5W 47Dw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=Xpgr53Vv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 5si11510816plx.124.2019.08.13.11.15.13; Tue, 13 Aug 2019 11:15:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=Xpgr53Vv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728312AbfHMSOf (ORCPT + 99 others); Tue, 13 Aug 2019 14:14:35 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:35383 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725923AbfHMSOf (ORCPT ); Tue, 13 Aug 2019 14:14:35 -0400 Received: by mail-pg1-f194.google.com with SMTP id n4so9935362pgv.2 for ; Tue, 13 Aug 2019 11:14:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=EgcOWrbU42e6Bz/dNCaGzvQRlmmBX/ILwM0rAxK2QB8=; b=Xpgr53VvIkeiIKN7I509QxwJNBY38P1bS8Gb57ABqKGEjIDZT7FL7kvsq8TSUDicYz AV6mOPsUxwsdTmvxxhluwI1qmfupgIQR8/IJ38jYSTypC7rVcRM8NGi8WUEZipLoKVMZ Jz+1huobQe+KyeLD0c32pDcLoBiGynF7flJHDuhqOZmZo5qI/3OrbNwAjYVpzdxDsHSL twFboR4cqTfiADNZva+/cfsaEWD8i+bJqlvjgvQIb87GVugh+f0uW+pjcchomLVu2CtH 16hyjA1yiWexmWe3OiNgDC1fxkbGApsvS3TU9O4P3UTG+oYTuGJ3zNimp3etPg43xIKQ TdMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=EgcOWrbU42e6Bz/dNCaGzvQRlmmBX/ILwM0rAxK2QB8=; b=XLEGweRMx46OhZHZmzh4W0Rhmcz8OaCX1m33UL0ht2rVJwJB+qxeNmA741xIgTFa5y 6Ui4dZ+9GOs8Qpgo2dxezH8JcJwlw0ApBrD+O7XssLaedVZlOzKpQO3TOb7Rf4cwypoh LTocq9bMRenFZMbgFZDNsT/nhIHULQWFdoImo2ahms7D78YFElCNGYUbdKNi5KfXtGwR eh4sImFuYDKfHGzeEYfwdjwFTNedxx7Bht0sXrofXoMiEA70+wQic0lAFtHTyjj/XG5R zL0Gy2vmpcqnwjkt5tfUprnt2FE8zT8L8GEYepRuhyr8FHBsSYRDtOcr3kZBc5zZQt7i iYBg== X-Gm-Message-State: APjAAAXPbaVQiX2V6Y8A3Iv43cn7X932pw8aXPQHfCdMj+eg/Nrs6yRf 5zcc2uWexoFjl0jrAnJR8bA1xw== X-Received: by 2002:a65:5c02:: with SMTP id u2mr35599324pgr.367.1565720073850; Tue, 13 Aug 2019 11:14:33 -0700 (PDT) Received: from localhost ([2620:10d:c091:500::2:674]) by smtp.gmail.com with ESMTPSA id e9sm1925110pge.39.2019.08.13.11.14.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Aug 2019 11:14:33 -0700 (PDT) Date: Tue, 13 Aug 2019 14:14:31 -0400 From: Johannes Weiner To: Yang Shi Cc: Michal Hocko , Andrew Morton , Linux MM , cgroups@vger.kernel.org, Linux Kernel Mailing List , kernel-team@fb.com Subject: Re: [PATCH] mm: vmscan: do not share cgroup iteration between reclaimers Message-ID: <20190813181431.GA23056@cmpxchg.org> References: <20190812192316.13615-1-hannes@cmpxchg.org> <20190813132938.GJ17933@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.12.0 (2019-05-25) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 13, 2019 at 08:59:38AM -0700, Yang Shi wrote: > On Tue, Aug 13, 2019 at 6:29 AM Michal Hocko wrote: > > > > On Mon 12-08-19 15:23:16, Johannes Weiner wrote: > > > One of our services observed a high rate of cgroup OOM kills in the > > > presence of large amounts of clean cache. Debugging showed that the > > > culprit is the shared cgroup iteration in page reclaim. > > > > > > Under high allocation concurrency, multiple threads enter reclaim at > > > the same time. Fearing overreclaim when we first switched from the > > > single global LRU to cgrouped LRU lists, we introduced a shared > > > iteration state for reclaim invocations - whether 1 or 20 reclaimers > > > are active concurrently, we only walk the cgroup tree once: the 1st > > > reclaimer reclaims the first cgroup, the second the second one etc. > > > With more reclaimers than cgroups, we start another walk from the top. > > > > > > This sounded reasonable at the time, but the problem is that reclaim > > > concurrency doesn't scale with allocation concurrency. As reclaim > > > concurrency increases, the amount of memory individual reclaimers get > > > to scan gets smaller and smaller. Individual reclaimers may only see > > > one cgroup per cycle, and that may not have much reclaimable > > > memory. We see individual reclaimers declare OOM when there is plenty > > > of reclaimable memory available in cgroups they didn't visit. > > > > > > This patch does away with the shared iterator, and every reclaimer is > > > allowed to scan the full cgroup tree and see all of reclaimable > > > memory, just like it would on a non-cgrouped system. This way, when > > > OOM is declared, we know that the reclaimer actually had a chance. > > > > > > To still maintain fairness in reclaim pressure, disallow cgroup > > > reclaim from bailing out of the tree walk early. Kswapd and regular > > > direct reclaim already don't bail, so it's not clear why limit reclaim > > > would have to, especially since it only walks subtrees to begin with. > > > > The code does bail out on any direct reclaim - be it limit or page > > allocator triggered. Check the !current_is_kswapd part of the condition. > > Yes, please see commit 2bb0f34fe3c1 ("mm: vmscan: do not iterate all > mem cgroups for global direct reclaim") This patch is a workaround for the cgroup tree blowing up with zombie cgroups. Roman's slab reparenting patches are fixing the zombies, so we shouldn't need this anymore. Because with or without the direct reclaim rule, we still don't want offline cgroups to accumulate like this. They also slow down kswapd, and they eat a ton of RAM. > > > This change completely eliminates the OOM kills on our service, while > > > showing no signs of overreclaim - no increased scan rates, %sys time, > > > or abrupt free memory spikes. I tested across 100 machines that have > > > 64G of RAM and host about 300 cgroups each. > > > > What is the usual direct reclaim involvement on those machines? > > > > > [ It's possible overreclaim never was a *practical* issue to begin > > > with - it was simply a concern we had on the mailing lists at the > > > time, with no real data to back it up. But we have also added more > > > bail-out conditions deeper inside reclaim (e.g. the proportional > > > exit in shrink_node_memcg) since. Regardless, now we have data that > > > suggests full walks are more reliable and scale just fine. ] > > > > I do not see how shrink_node_memcg bail out helps here. We do scan up-to > > SWAP_CLUSTER_MAX pages for each LRU at least once. So we are getting to > > nr_memcgs_with_pages multiplier with the patch applied in the worst case. > > > > How much that matters is another question and it depends on the > > number of cgroups and the rate the direct reclaim happens. I do not > > remember exact numbers but even walking a very large memcg tree was > > noticeable. > > I'm concerned by this too. It might be ok to cgroup v2, but v1 still > dominates. And, considering offline memcgs it might be not unusual to > have quite large memcg tree. cgroup2 was affected by the offline memcgs just as much as cgroup1 - probably even more so because it tracks more types of memory per default. That's why Roman worked tirelessly on a solution. But we shouldn't keep those bandaid patches around forever.