Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp599998rdg; Thu, 12 Oct 2023 15:24:05 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHwCOwQAcF9YvyIXrqW/WTuuZF3fFmD06mmpd5SDMJPnlIkX30qVjCtl8hS65Gx/LVjDrec X-Received: by 2002:a05:6a00:16c9:b0:68e:2f6e:b4c0 with SMTP id l9-20020a056a0016c900b0068e2f6eb4c0mr25305535pfc.28.1697149444835; Thu, 12 Oct 2023 15:24:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697149444; cv=none; d=google.com; s=arc-20160816; b=v0HBr5Cvt045zVG8tHy4pSZKqZqn7ipQnvIjI1kz9M7b7meDSzPNX8nWFBlN3zgmQG SOZIP11lxNakOGTcjYNq2Gf4s/tcwhqoyPJCg+YmcEdGE/yD4mwxmUvwD4HawDb4THPE FzrblwIRbPipqWJKaV/oV3y3zQ70DexoTcT+7NEB+XbXCdflNkSMnK7TEAviavDJr1il FNrh+D1Uv/1X3gZwEtPPbkJK6R5ONDURb/+Le7Zgj35FSA7VFgAhUx68xgKkYO5oNu5G Vzxc+NugeD+D65WpyEoFr63sRYTuTQq3ePWvWwMcH8D4dBBt6Y+DDYEx2pLYuqawPXU2 aHiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; fh=NrhGjCSYbdsm/rg9mGZ9NlXBaOz3NDTUDXNsHei4GII=; b=Fx7DtWCNdIxiIkCzQmnj/c6hq+vddWjm19347CcO4h+8ZuKgeKn/YufeXHq1v+pGZp RVw6jP/9YWs0RwGVyNtBUFY8exhQEc901u1StgckSMGaAZEbeTB/xvON2UDsRsFx1DYM 3h8cbAXuWMy6daityalssvZSKI41QjKQaQwWynxM0S35OKfOKnsq50JS46CVj4W7REse sSTIaxj8nfFhQ7fHjC95CisolqPXUeIwm3MWGeievUEYUq3hWragk4tN2LXDGnY/C59y az2zZLFKlJBxvEvWWQMfieQnXgR1ebHIRpUUYAnqXWZlZ96P39geVHO8O0UT8XDuU4+q qteA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=b0gdRMoa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id i69-20020a638748000000b00584b9d0ff80si3068494pge.127.2023.10.12.15.24.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 15:24:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=b0gdRMoa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 0467E803D8C5; Thu, 12 Oct 2023 15:24:02 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1442949AbjJLWXt (ORCPT + 99 others); Thu, 12 Oct 2023 18:23:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344180AbjJLWXs (ORCPT ); Thu, 12 Oct 2023 18:23:48 -0400 Received: from mail-ej1-x632.google.com (mail-ej1-x632.google.com [IPv6:2a00:1450:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD814B8 for ; Thu, 12 Oct 2023 15:23:46 -0700 (PDT) Received: by mail-ej1-x632.google.com with SMTP id a640c23a62f3a-9ae2cc4d17eso245099966b.1 for ; Thu, 12 Oct 2023 15:23:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1697149425; x=1697754225; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; b=b0gdRMoaQAg3h741A3Z1sp1yA3ry8SroDj2YbvjlhVb+T5FcJy4AlKPzL+Tmtiype/ 57wdmxQpz600nRevn2QsyufOOxHQxbkZwmAvQQjRmOba/s0Yze+oVP0Fd6q4uHD++JaT OoCzoSC99gQOMKw95mjmsajDwSlJ5cquyqqTuKyiLY6oB/MH+v+hZteV6+3WMgDz4/Jv bVSpnMkF8X+NjPQZ3dCd7fbUfhi145qk9Jckv9NG+Lfp3uyoxHISoiklnvmIVXFj7BKU gwcuTPLePucqlEEMEywt7qy+o9y+pZdeVxlLlpMIF1Ijdp+oc96Ua5koy69CTsYxmkAa XAkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697149425; x=1697754225; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ryb9YsIL7H3sv3lFpG42u/WGG4wjx7QPr3zVCs+YE7o=; b=vX9+1vTH9ogVgUrGfD+lzgCyGLsEwqfxlg+zU8TLxvVeMBLXOi00jvjl2aPYtFNWeO NMmwhgVyM24t/tMKgKkRq0lF8hIq9BeCtyGyhqQv+oaRCErp6fuA0Ong0pt61KuJ+RFS axPXz8N1xW22M7fO1RkbSerL6fBsdLauDRO7RNLTVB83p/VIGlMrKKU8nO8cHYrg8n97 TW9VD53e2q+Ojky4SQhKnXh843EH6G5QedFsJHCefWRJ1Le2TkV/2pXWU+zFeFJiskqu bOSozjnS2wCcUIzbjBd9nwfMwYwH/P8uWzkv2DWH+VZko3BNdOcVBT/+av4/m0KO2388 uSCQ== X-Gm-Message-State: AOJu0YxSIDpKUwmZrmHMsVRGf7N+WVRo3KMPla4epMadsbxdvg1H1ilw qXsQTYmCRP6UyUnRrbEfeH/z3lO+Oe33RRKIiQojmg== X-Received: by 2002:a17:906:6a19:b0:9a5:a0c6:9e8e with SMTP id qw25-20020a1709066a1900b009a5a0c69e8emr29493117ejc.31.1697149425195; Thu, 12 Oct 2023 15:23:45 -0700 (PDT) MIME-Version: 1.0 References: <20231010032117.1577496-1-yosryahmed@google.com> <20231010032117.1577496-4-yosryahmed@google.com> <20231011003646.dt5rlqmnq6ybrlnd@google.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 12 Oct 2023 15:23:06 -0700 Message-ID: Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg To: Shakeel Butt , Andrew Morton Cc: michael@phoronix.com, Feng Tang , kernel test robot , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Thu, 12 Oct 2023 15:24:02 -0700 (PDT) On Thu, Oct 12, 2023 at 2:39=E2=80=AFPM Shakeel Butt = wrote: > > On Thu, Oct 12, 2023 at 2:20=E2=80=AFPM Yosry Ahmed wrote: > > > [...] > > > > > > Yes this looks better. I think we should also ask intel perf and > > > phoronix folks to run their benchmarks as well (but no need to block > > > on them). > > > > Anything I need to do for this to happen? (I thought such testing is > > already done on linux-next) > > Just Cced the relevant folks. > > Michael, Oliver & Feng, if you have some time/resource available, > please do trigger your performance benchmarks on the following series > (but nothing urgent): > > https://lore.kernel.org/all/20231010032117.1577496-1-yosryahmed@google.co= m/ Thanks for that. > > > > > Also, any further comments on the patch (or the series in general)? If > > not, I can send a new commit message for this patch in-place. > > Sorry, I haven't taken a look yet but will try in a week or so. Sounds good, thanks. Meanwhile, Andrew, could you please replace the commit log of this patch as follows for more updated testing info: Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg A global counter for the magnitude of memcg stats update is maintained on the memcg side to avoid invoking rstat flushes when the pending updates are not significant. This avoids unnecessary flushes, which are not very cheap even if there isn't a lot of stats to flush. It also avoids unnecessary lock contention on the underlying global rstat lock. Make this threshold per-memcg. The scheme is followed where percpu (now also per-memcg) counters are incremented in the update path, and only propagated to per-memcg atomics when they exceed a certain threshold. This provides two benefits: (a) On large machines with a lot of memcgs, the global threshold can be reached relatively fast, so guarding the underlying lock becomes less effective. Making the threshold per-memcg avoids this. (b) Having a global threshold makes it hard to do subtree flushes, as we cannot reset the global counter except for a full flush. Per-memcg counters removes this as a blocker from doing subtree flushes, which helps avoid unnecessary work when the stats of a small subtree are needed. Nothing is free, of course. This comes at a cost: (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 bytes. The extra memory usage is insigificant. (b) More work on the update side, although in the common case it will only be percpu counter updates. The amount of work scales with the number of ancestors (i.e. tree depth). This is not a new concept, adding a cgroup to the rstat tree involves a parent loop, so is charging. Testing results below show no significant regressions. (c) The error margin in the stats for the system as a whole increases from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * NR_MEMCGS. This is probably fine because we have a similar per-memcg error in charges coming from percpu stocks, and we have a periodic flusher that makes sure we always flush all the stats every 2s anyway. This patch was tested to make sure no significant regressions are introduced on the update path as follows. The following benchmarks were ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/): (1) Running 22 instances of netperf on a 44 cpu machine with hyperthreading disabled. All instances are run in a level 2 cgroup, as well as netserver: # netserver -6 # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K Averaging 20 runs, the numbers are as follows: Base: 40198.0 mbps Patched: 38629.7 mbps (-3.9%) The regression is minimal, especially for 22 instances in the same cgroup sharing all ancestors (so updating the same atomics). (2) will-it-scale page_fault tests. These tests (specifically per_process_ops in page_fault3 test) detected a 25.9% regression before for a change in the stats update path [1]. These are the numbers from 10 runs (+ is good) on a machine with 256 cpus: LABEL | MEAN | MEDIAN | STDDEV | ------------------------------+-------------+-------------+------------- page_fault1_per_process_ops | | | | (A) base | 270249.164 | 265437.000 | 13451.836 | (B) patched | 261368.709 | 255725.000 | 13394.767 | | -3.29% | -3.66% | | page_fault1_per_thread_ops | | | | (A) base | 242111.345 | 239737.000 | 10026.031 | (B) patched | 237057.109 | 235305.000 | 9769.687 | | -2.09% | -1.85% | | page_fault1_scalability | | | (A) base | 0.034387 | 0.035168 | 0.0018283 | (B) patched | 0.033988 | 0.034573 | 0.0018056 | | -1.16% | -1.69% | | page_fault2_per_process_ops | | | (A) base | 203561.836 | 203301.000 | 2550.764 | (B) patched | 197195.945 | 197746.000 | 2264.263 | | -3.13% | -2.73% | | page_fault2_per_thread_ops | | | (A) base | 171046.473 | 170776.000 | 1509.679 | (B) patched | 166626.327 | 166406.000 | 768.753 | | -2.58% | -2.56% | | page_fault2_scalability | | | (A) base | 0.054026 | 0.053821 | 0.00062121 | (B) patched | 0.053329 | 0.05306 | 0.00048394 | | -1.29% | -1.41% | | page_fault3_per_process_ops | | | (A) base | 1295807.782 | 1297550.000 | 5907.585 | (B) patched | 1275579.873 | 1273359.000 | 8759.160 | | -1.56% | -1.86% | | page_fault3_per_thread_ops | | | (A) base | 391234.164 | 390860.000 | 1760.720 | (B) patched | 377231.273 | 376369.000 | 1874.971 | | -3.58% | -3.71% | | page_fault3_scalability | | | (A) base | 0.60369 | 0.60072 | 0.0083029 | (B) patched | 0.61733 | 0.61544 | 0.009855 | | +2.26% | +2.45% | | All regressions seem to be minimal, and within the normal variance for the benchmark. The fix for [1] assumes that 3% is noise -- and there were n= o further practical complaints), so hopefully this means that such variations in these microbenchmarks do not reflect on practical workloads. (3) I also ran stress-ng in a nested cgroup and did not observe any obvious regressions. [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/