Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp122537rwb; Tue, 25 Jul 2023 13:00:08 -0700 (PDT) X-Google-Smtp-Source: APBJJlG8TrTT7hHJljPlVqfAEQMRZWVs2N2FzacjsjZrS6QqvgaFsmAaHWJ+NuPRkyWPi9ZVDfn6 X-Received: by 2002:a05:6a00:2292:b0:682:f48a:7295 with SMTP id f18-20020a056a00229200b00682f48a7295mr249613pfe.31.1690315207785; Tue, 25 Jul 2023 13:00:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690315207; cv=none; d=google.com; s=arc-20160816; b=RpDmx5apbFa9lh8dOrwJ/JP0L8GG8C5rE73OtPrXnr7QowKZYJIl6fK+waEPvvAHrv ntvOvus4ip3GeUIas8w2NwEq22fplQSwg+wcMC04DQKyyoSFqV4hS5BZa7SLqi96C80u G21v7kYGk5Edo6nfj3ZRA69TnpCC34HOIoVW79HjrcY2i2H95ZUELza64EfALQiET/yU H4p3qi9BtZcSR4Kl4DYlVo889YrTh8/fGtsqo3fzEyUx7RYTJIMimjUnRHO6GRSEZLLw YJ+sHkca7nv22FdEwnmlj8crgjWXP3vtd2RzebmgpLyoUBv21bbAm4nn7NZW6543lovo 57Jg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; fh=SC0+WcXZxunrDIs2V0d+PGz+mvx+xdL8ZbaoNlE/8Pc=; b=LnRvAheqrYHDkFqNgyT5gKKBuZWy9VWSkctOoyIZdqCgSxOTtTn5FXgmehH/1p9PNY OQrO5affqDat2aQl5o9JSj5T5FuX6mpCrJCHHxPgxew9G0UWSuLVKviI74cAhDb2JUc/ fRFoxa8FmfGy0WAhUVc13o8XT93XxQ87bfaL1EndEbadyQDBWBk7vzwMQcrqsehj7eeC Sg+ECGTnkDm3p83cf5rBP73CFly7F0ZDn7AtVqmD9ObwcuDL+65IB7+wQhQEDvVtsRlc Qmtia96jki5OOxhiRrzZAiCtCT+ywTburpZDrd6+DW51whgIkUyFFrZ3LiSnESrFzUZB t1XA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=ooRahWUd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w10-20020a056a0014ca00b0068291627221si12591710pfu.15.2023.07.25.12.59.55; Tue, 25 Jul 2023 13:00:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=ooRahWUd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229974AbjGYTZB (ORCPT + 99 others); Tue, 25 Jul 2023 15:25:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230306AbjGYTZA (ORCPT ); Tue, 25 Jul 2023 15:25:00 -0400 Received: from mail-lf1-x135.google.com (mail-lf1-x135.google.com [IPv6:2a00:1450:4864:20::135]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 376E01FFA for ; Tue, 25 Jul 2023 12:24:58 -0700 (PDT) Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-4fe05fbe250so1612959e87.0 for ; Tue, 25 Jul 2023 12:24:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690313096; x=1690917896; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; b=ooRahWUd9qepvopG3J9bw8egrolyIsO84D5Vc0l1eY7rKbHLsh4YLsQhAkHszACG43 wiDdZqDCfuuZG0/qiNUsKynyxGflky1uwaaXpJIgWCWwyX50Bx2lq2hxpkz0LEfD9cD7 08HWy/WST4OLNYWEfwBafRDQQdKtdlRQ4NyE9pxSSW2B8prEMlyI6Lq7Vk7jgjQVPxpT 1yfFkkxhTK0RvKM7EfpXF6sG/rTkke8+HFRi1yyIM18vj2TKd8NhpBhNxcwxAm8Fntfv itmeTnNvDP1worfpkzjL3V/Vi4p3Q4MC+X0pnrXtub5xWx6V9wHU4GMu/+yCT0tixuYA hlfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690313096; x=1690917896; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d1uMPkgHteWQUIoqBLGXNOkNPWdTfMsi2o8RVrftxNk=; b=ARQmo4hFhEwu+2/QZl4z3rNF9WswoBCsgxdTOeXX8g44lTMIc3xPMpkFvF8NZWpOnd 20bHomCU63Y0GAx54F656rPupAZG34bf+bvx69tr3DLP3/JISYkgwLmh/VSgyRcaW/zd HpDa2KCpDj20gs3ONFtZQ7aI/o7mWieWJylW8wbL5zjeFkn5ekpEiZ/teYGa7p0ljmxb w4TRJGdN2vEdZk1HPUawwQoS9qz7KkO+T+kXpAGnJduY1jFVZ4NrQNn4+RkDQys++38E ouG3jzWtPX8duCZdMkqkifXdwh/GtYf11oEg74ZDWa1ENgI4/JJJE6ENBYWSRTDhd5MC eZWQ== X-Gm-Message-State: ABy/qLaRPIIN5twZbQu1QQWPcZs2adia5En0g3/vL76e4/cQOeRgZZNj ZuSfuNAQs6/awDYW/TAx4yXavXW7eY0twzWg8JBuYw== X-Received: by 2002:a05:6512:baa:b0:4fb:b11:c9a2 with SMTP id b42-20020a0565120baa00b004fb0b11c9a2mr11022614lfv.34.1690313096153; Tue, 25 Jul 2023 12:24:56 -0700 (PDT) MIME-Version: 1.0 References: <20230719174613.3062124-1-yosryahmed@google.com> <20230725140435.GB1146582@cmpxchg.org> In-Reply-To: <20230725140435.GB1146582@cmpxchg.org> From: Yosry Ahmed Date: Tue, 25 Jul 2023 12:24:19 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: use rstat for non-hierarchical stats To: Johannes Weiner Cc: Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey Johannes, On Tue, Jul 25, 2023 at 7:04=E2=80=AFAM Johannes Weiner wrote: > > On Wed, Jul 19, 2023 at 05:46:13PM +0000, Yosry Ahmed wrote: > > Currently, memcg uses rstat to maintain hierarchical stats. The rstat > > framework keeps track of which cgroups have updates on which cpus. > > > > For non-hierarchical stats, as memcg moved to rstat, they are no longer > > readily available as counters. Instead, the percpu counters for a given > > stat need to be summed to get the non-hierarchical stat value. This > > causes a performance regression when reading non-hierarchical stats on > > kernels where memcg moved to using rstat. This is especially visible > > when reading memory.stat on cgroup v1. There are also some code paths > > internal to the kernel that read such non-hierarchical stats. > > It's actually not an rstat regression. It's always been this costly. > > Quick history: Thanks for the context. > > We used to maintain *all* stats in per-cpu counters at the local > level. memory.stat reads would have to iterate and aggregate the > entire subtree every time. This was obviously very costly, so we added > batched upward propagation during stat updates to simplify reads: > > commit 42a300353577ccc17ecc627b8570a89fa1678bec > Author: Johannes Weiner > Date: Tue May 14 15:47:12 2019 -0700 > > mm: memcontrol: fix recursive statistics correctness & scalabilty > > However, that caused a regression in the stat write path, as the > upward propagation would bottleneck on the cachelines in the shared > parents. The fix for *that* re-introduced the per-cpu loops in the > local stat reads: > > commit 815744d75152078cde5391fc1e3c2d4424323fb6 > Author: Johannes Weiner > Date: Thu Jun 13 15:55:46 2019 -0700 > > mm: memcontrol: don't batch updates of local VM stats and events > > So I wouldn't say it's a regression from rstat. Except for that short > period between the two commits above, the read side for local stats > was always expensive. I was comparing from an 4.15 kernel, so I assumed the major change was from rstat, but that was not accurate. Thanks for the history. However, in that 4.15 kernel the local (non-hierarchical) stats were readily available without iterating percpu counters. There is a regression that was introduced somewhere. Looking at the history you described, it seems like up until 815744d75152 we used to maintain "local" (aka non-hierarchical) counters, so reading local stats was reading one counter, and starting 815744d75152 we started having to loop percpu counters for that. So it is not a regression of rstat, but seemingly it is a regression of 815744d75152. Is my understanding incorrect? > > rstat promises a shot at finally fixing it, with less risk to the > write path. > > > It is inefficient to iterate and sum counters in all cpus when the rsta= t > > framework knows exactly when a percpu counter has an update. Instead, > > maintain cpu-aggregated non-hierarchical counters for each stat. During > > an rstat flush, keep those updated as well. When reading > > non-hierarchical stats, we no longer need to iterate cpus, we just need > > to read the maintainer counters, similar to hierarchical stats. > > > > A caveat is that we now a stats flush before reading > > local/non-hierarchical stats through {memcg/lruvec}_page_state_local() > > or memcg_events_local(), where we previously only needed a flush to > > read hierarchical stats. Most contexts reading non-hierarchical stats > > are already doing a flush, add a flush to the only missing context in > > count_shadow_nodes(). > > > > With this patch, reading memory.stat from 1000 memcgs is 3x faster on a > > machine with 256 cpus on cgroup v1: > > # for i in $(seq 1000); do mkdir /sys/fs/cgroup/memory/cg$i; done > > # time cat /dev/cgroup/memory/cg*/memory.stat > /dev/null > > real 0m0.125s > > user 0m0.005s > > sys 0m0.120s > > > > After: > > real 0m0.032s > > user 0m0.005s > > sys 0m0.027s > > > > Signed-off-by: Yosry Ahmed > > Acked-by: Johannes Weiner Thanks! I will reformulate the commit log after we agree on the history. > > But I want to be clear: this isn't a regression fix. It's a new > performance optimization for the deprecated cgroup1 code. And it comes > at the cost of higher memory footprint for both cgroup1 AND cgroup2. I still think it is, but I can easily be wrong. I am hoping that the memory footprint is not a problem. There are *roughly* 80 per-memcg stats/events (MEMCG_NR_STAT + NR_MEMCG_EVENTS) and 55 per-lruvec stats (NR_VM_NODE_STAT_ITEMS). For each stat there is an extra 8 bytes, so on a two-node machine that's 8 * (80 + 55 * 2) ~=3D 1.5 KiB per memcg. Given that struct mem_cgroup is already large, and can easily be 100s of KiBs on a large machine with many cpus, I hope there won't be a noticeable regression. > > If this causes a regression, we should revert it again. But let's try. Of course. Fingers crossed.