Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp2265624rdb; Mon, 9 Oct 2023 20:25:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEpUgoYPQlzpK2baYtpSJSKVfiJByCqXHWbf6HMD61hHbVk9OghoSmupUVuezcZ2Szs2cKH X-Received: by 2002:a05:6808:92:b0:3ae:4774:c00c with SMTP id s18-20020a056808009200b003ae4774c00cmr17406539oic.53.1696908353081; Mon, 09 Oct 2023 20:25:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696908353; cv=none; d=google.com; s=arc-20160816; b=gvhDM0KB+nZcdaofUDissLfbyljG4P5BRS3qL3L0x39odrh76CBS3POQPdvk/zjcan ED43g6vZUvYI9mhQIOsZZ5Y8Ar5Vn58z0SuQsXzHjI/gkx+QM8j7MuMfrFIMluW7gipq 5/tNnZd+qTWlrcM0nmlua98qB5li8oyGz1m5y1sK/gVRwfJU7xnu3pdEOxbqoFUuSTUD jCJVGrQi+wpiim7wa7OhVAYvNs+ym8sjyK99lCb9w9Lleet5HGNDDXn8xlLch4cwSUTk O7h9p8PSObQuowLFIEjD7WxGUzf3yE3LoJFyxm8xZo1NdYb0CE/B0VAbCkxCOrsqcjLL r9vQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=D/75P8pc2DkgM2SQ5ORt4v9hM0okDctjs9IwPZwy6BY=; fh=hwD/jZK8wpxe/cPK0H3F31VhNOQajItgeiUqz5NiXt8=; b=zK2/UqlooX9i+xalZeCzfknuilBXwhT/TfLq4Xvvagjkl1wH21krpcl9PMwXuYUWTH tFTPp+liohSUbVL4ecGotBNDjRWxxzJyq1kd0Ei5f7rJbjFk3AMzrTRqrstZbOsUnzxk 0VxjcUDFP2qpRECAbvBTdkzTMVSQzyGpRkJEKP8Rwv7EhiiVtZPk7hW2DHE6X/MXKC94 QZNAf9I2flvBPD92J13iGBJwr4TKBBjhOr3OKDU483OxusOx58w3LeK0l6WdwaOfF60R DqgW+ND7sn85fw2fjOGE5yT9Lx7PQU42XdZmn1/fTrmNIjTJqclQFxK/EBFRp88rxVvf PIhQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=D8cFF0oe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id a21-20020a63e855000000b005774cf04028si10990575pgk.764.2023.10.09.20.25.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 20:25:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=D8cFF0oe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 52F108201B02; Mon, 9 Oct 2023 20:25:50 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1442007AbjJJDZl (ORCPT + 99 others); Mon, 9 Oct 2023 23:25:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44024 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1441902AbjJJDZi (ORCPT ); Mon, 9 Oct 2023 23:25:38 -0400 Received: from mail-lj1-x230.google.com (mail-lj1-x230.google.com [IPv6:2a00:1450:4864:20::230]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6C7A8BA for ; Mon, 9 Oct 2023 20:25:32 -0700 (PDT) Received: by mail-lj1-x230.google.com with SMTP id 38308e7fff4ca-2c1854bebf5so65027641fa.1 for ; Mon, 09 Oct 2023 20:25:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696908331; x=1697513131; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D/75P8pc2DkgM2SQ5ORt4v9hM0okDctjs9IwPZwy6BY=; b=D8cFF0oesck5HHARRuBKnmhvHatGRECZVPrVD2spw9S8VkV3v3Eg/KipKR6N1VCmeo AkVPli7HvKpfBDdpfQgQHQ26kVVDnaK5nFDK7HCiEV4XE/IJj8TVaBLjShi7t2KxPFMK c9N4/UG57vLTWtrx8XaS/tTRFZoMFvUR09vPgrby7QV4O0uQGvUeEjpgOxHKJqjj32Lc 6FgU81XFnUCOC5+LEAhx9BfgRkxu/ms6+LGmRfkA82ztrCROq4+H3aTxGYyzqijUzcWq DOboWknMGTVVxtw1Hz+GwEATXH2Mks05KI1cNOx5sorjlTBzfHioaACjSWDsrR7WAiUm vU/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696908331; x=1697513131; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D/75P8pc2DkgM2SQ5ORt4v9hM0okDctjs9IwPZwy6BY=; b=C4npOzaB+BoHbhjQOHRIEwQacecfc+PquR0xVC0+EQcsglYqwtCL/DjxMn+atif7xf FPcm6VqfZ36EFwOGOpfc7UcbzrMPQMkO1KFuqE9cL3gWYyCxfYN2sDXPpuSqjRIz5WW4 U9a0loemZf77MPDpZdzSUe3DftwR6SakdcJGu5coMogjMwZnvVEG9gJ04vQ8H0XHzzG1 7gFKDse42dEU13ZnAeyHvN8MMsIRFcsKCb8gRFyulOfENNIovd7xqJH4/THPfWz1QAJA DTOJVupcAAmeJHm2FcqypLZkFGSR/oI9benLaWAdhjPZqHlIqXlV+gPvOKgpVRZ8EkQ8 A2Pw== X-Gm-Message-State: AOJu0YxSwASjObm7OSlgyDBFB+Iwt5+aWGdJT0TTXol6fqeTG/yjtwK2 YmtZFO7+sj2QQOV/tVSYIpC7ZYs49/qLG+sKGXBG0A== X-Received: by 2002:a2e:a212:0:b0:2c2:9810:3677 with SMTP id h18-20020a2ea212000000b002c298103677mr14419687ljm.6.1696908330676; Mon, 09 Oct 2023 20:25:30 -0700 (PDT) MIME-Version: 1.0 References: <20231010032117.1577496-1-yosryahmed@google.com> <20231010032117.1577496-4-yosryahmed@google.com> In-Reply-To: <20231010032117.1577496-4-yosryahmed@google.com> From: Yosry Ahmed Date: Mon, 9 Oct 2023 20:24:51 -0700 Message-ID: Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.8 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Mon, 09 Oct 2023 20:25:50 -0700 (PDT) On Mon, Oct 9, 2023 at 8:21=E2=80=AFPM Yosry Ahmed = wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. The extra memory usage is insigificant. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing results below show no significant regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. The following benchmarks were > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is > deeper than a usual setup: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > (3) will-it-scale page_fault tests. These tests (specifically > per_process_ops in page_fault3 test) detected a 25.9% regression before > for a change in the stats update path [2]. These are the > numbers from 30 runs (+ is good): > > LABEL | MEAN | MEDIAN | STDDEV | > ------------------------------+-------------+-------------+------------- > page_fault1_per_process_ops | | | | > (A) base | 265207.738 | 262941.000 | 12112.379 | > (B) patched | 249249.191 | 248781.000 | 8767.457 | > | -6.02% | -5.39% | | > page_fault1_per_thread_ops | | | | > (A) base | 241618.484 | 240209.000 | 10162.207 | > (B) patched | 229820.671 | 229108.000 | 7506.582 | > | -4.88% | -4.62% | | > page_fault1_scalability | | | > (A) base | 0.03545 | 0.035705 | 0.0015837 | > (B) patched | 0.029952 | 0.029957 | 0.0013551 | > | -9.29% | -9.35% | | > page_fault2_per_process_ops | | | > (A) base | 203916.148 | 203496.000 | 2908.331 | > (B) patched | 186975.419 | 187023.000 | 1991.100 | > | -6.85% | -6.90% | | > page_fault2_per_thread_ops | | | > (A) base | 170604.972 | 170532.000 | 1624.834 | > (B) patched | 163100.260 | 163263.000 | 1517.967 | > | -4.40% | -4.26% | | > page_fault2_scalability | | | > (A) base | 0.054603 | 0.054693 | 0.00080196 | > (B) patched | 0.044882 | 0.044957 | 0.0011766 | > | -0.05% | +0.33% | | > page_fault3_per_process_ops | | | > (A) base | 1299821.099 | 1297918.000 | 9882.872 | > (B) patched | 1248700.839 | 1247168.000 | 8454.891 | > | -3.93% | -3.91% | | > page_fault3_per_thread_ops | | | > (A) base | 387216.963 | 387115.000 | 1605.760 | > (B) patched | 368538.213 | 368826.000 | 1852.594 | > | -4.82% | -4.72% | | > page_fault3_scalability | | | > (A) base | 0.59909 | 0.59367 | 0.01256 | > (B) patched | 0.59995 | 0.59769 | 0.010088 | > | +0.14% | +0.68% | | > > There is some microbenchmarks regressions (and some minute improvements), > but nothing outside the normal variance of this benchmark between kernel > versions. The fix for [2] assumes that 3% is noise -- and there were no > further practical complaints), so hopefully this means that such variatio= ns > in these microbenchmarks do not reflect on practical workloads. > > [1]https://github.com/google/neper > [2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/ > > Signed-off-by: Yosry Ahmed Johannes, as I mentioned in a reply to v1, I think this might be what you suggested in our previous discussion [1], but I am not sure this is what you meant for the update path, so I did not add a Suggested-by. Please let me know if this is what you meant and I can amend the tag as suc= h. [1]https://lore.kernel.org/lkml/20230913153758.GB45543@cmpxchg.org/