Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp2470231rdb; Thu, 21 Sep 2023 22:14:32 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFxYHbfGPI4RsA4bxuF+3+wtisyWPKwAZajzBtIKwhZ1b07D8Jw3CXt9nuRrgmGsop/me2f X-Received: by 2002:a17:902:8e82:b0:1bd:aeb3:9504 with SMTP id bg2-20020a1709028e8200b001bdaeb39504mr6735388plb.15.1695359672381; Thu, 21 Sep 2023 22:14:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695359672; cv=none; d=google.com; s=arc-20160816; b=D3aE4MJQW+oDFm9qibhIDsPdjIn6v0H7+uEu8t5TarG55QLhV6h4IspYY1u0qinOmg k4cTsQ2s6eeiz1CYyVbNjb69P66p3KQUV6XjXR5jTPv2fbeAQXTak+uVtrAWCXm8UwMB wfYXihk6q2RG/CWWoQH+iMRNrAY4Y64GSQDh9OSS3QprfM0rz5+PJcq5GGahGljG1E3/ 4tg7rMX+UHMVcIjllm6X5L44eTM+B4pIuJfq5Y5Q+ztdk6XO9QA7JDz0dMGxZP++jjF+ atAO1floLZ+G5BSlZSByk1Dt0lZ4JkXf3FAL5w40Y34ce5uLpzjyIis4TUTnbuNiT+Dc UsZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=+Cl15FReM+0BOvrBRx7ZaDFvWKYaZDzCR+8xQoQoeQk=; fh=hwD/jZK8wpxe/cPK0H3F31VhNOQajItgeiUqz5NiXt8=; b=S20mle4QOPMEOzdEQAxwb48gWfT29nSsm93jtek1L9TDZJEOAV/breCLywpahvt7yH jZqBRkGY2VgT07U44RD/qkNAUCV84OnthfzB6aVNXprsb/4poy52ySZPfe2zYM8/PLv7 jkiPjR46YlDStegXJ8262IWmxIAupYrOBfjhieNXaNHnxOeOuw7ECHmq7YW/s3CDXw89 Igu7fCmbeeqbLxAmK+Ls9jJdMwcDSASh8s+Vv4FllSWyF+UrlMnoxH2NiAtCaWm4EZDw 2G3PUqhZ+MiIcv5hft27pOipaOqwNjSbUIowbo7WiB34aVvuzpZLjrfwv5eGU8NKwVNE kFRA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rtrRwFLl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id r12-20020a170903410c00b001bdf67d6a8fsi2860625pld.40.2023.09.21.22.14.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Sep 2023 22:14:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rtrRwFLl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 773118040D45; Thu, 21 Sep 2023 12:05:51 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231195AbjIUTFi (ORCPT + 99 others); Thu, 21 Sep 2023 15:05:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231535AbjIUTFA (ORCPT ); Thu, 21 Sep 2023 15:05:00 -0400 Received: from mail-wr1-x435.google.com (mail-wr1-x435.google.com [IPv6:2a00:1450:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92C578C60A for ; Thu, 21 Sep 2023 10:54:19 -0700 (PDT) Received: by mail-wr1-x435.google.com with SMTP id ffacd0b85a97d-3216ba1b01eso1244770f8f.2 for ; Thu, 21 Sep 2023 10:54:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695318856; x=1695923656; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+Cl15FReM+0BOvrBRx7ZaDFvWKYaZDzCR+8xQoQoeQk=; b=rtrRwFLlUjPl/d7eHLR1alhFriXkGXZLaYx5Cjy1FgLwgXM5OGY98WugbtV43tLtiM ZR04iJbTbda03qdwNcDSl36EOQtIKtFapOfeelWHCCSQNpsaTEnHdVIhb/o/4+IHE2RU 48SAkkZ8RO5INb798UIRtfMpLCN6Cx4j13isLvXRUNqAo+9MV/H72lnOxidMLpz0/j7e i71qjSs2+TDYAkyKHzseRHVCc/1N01z1bGjoqyOqyqDEcKMc4ZfaDuUm85DFbDjq3oXr MKMLwuswrGAjdwGXM6cJQMpNTKoIqfKuSmpjONWdIizpvvSIz/EV8IDw2N6vaDbX66wy rpqg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695318856; x=1695923656; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+Cl15FReM+0BOvrBRx7ZaDFvWKYaZDzCR+8xQoQoeQk=; b=aYBto/9s+RUj3uRxIItrtKl5ky3UG/hXcoh/ymvScLBgGhVhAnnAoJLViVfcufORbj CEi6cH6OgXHvI91YoiiOV+l7DVkRm31pKkDIj8IsolcUVkvcplzoT6+RIvVYUKJoJBWj xObXnJOuxo0RRqlslhmEGvrpAm1ftwP9FCGnpZAHCSDOyNX4OcNvTzun6J8hAfM483dz pJ5EvdeoI+nCCtNsgVA7TbX714A3dsVHrHp4DVJrwxw0Xf6Y032VTKHyCQL08oxQsZ7y apr6aw6KBoxIk1CfUBuiiZ1RFyLo7rXIDaxWpIh5pERzia/nqhSX07tglOUrWaXDetqZ tsLw== X-Gm-Message-State: AOJu0YxTucgXWuwZmThv3ymjSv0J6lgocIYxgtLG/DkfV2epcmt4ywFR ZDQhBzdkp9DhaxHsLayH/QlH4qQvBavwuzPcYxmtvdJCVAhIkXryFLs= X-Received: by 2002:a17:906:5dce:b0:9ae:406c:3425 with SMTP id p14-20020a1709065dce00b009ae406c3425mr4180099ejv.0.1695284043103; Thu, 21 Sep 2023 01:14:03 -0700 (PDT) MIME-Version: 1.0 References: <20230921081057.3440885-1-yosryahmed@google.com> <20230921081057.3440885-4-yosryahmed@google.com> In-Reply-To: <20230921081057.3440885-4-yosryahmed@google.com> From: Yosry Ahmed Date: Thu, 21 Sep 2023 01:13:23 -0700 Message-ID: Subject: Re: [PATCH 3/5] mm: memcg: make stats flushing threshold per-memcg To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.2 required=5.0 tests=DATE_IN_PAST_06_12, DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 21 Sep 2023 12:05:51 -0700 (PDT) On Thu, Sep 21, 2023 at 1:11=E2=80=AFAM Yosry Ahmed = wrote: > > A global counter for the magnitude of memcg stats update is maintained > on the memcg side to avoid invoking rstat flushes when the pending > updates are not significant. This avoids unnecessary flushes, which are > not very cheap even if there isn't a lot of stats to flush. It also > avoids unnecessary lock contention on the underlying global rstat lock. > > Make this threshold per-memcg. The scheme is followed where percpu (now > also per-memcg) counters are incremented in the update path, and only > propagated to per-memcg atomics when they exceed a certain threshold. > > This provides two benefits: > (a) On large machines with a lot of memcgs, the global threshold can be > reached relatively fast, so guarding the underlying lock becomes less > effective. Making the threshold per-memcg avoids this. > > (b) Having a global threshold makes it hard to do subtree flushes, as we > cannot reset the global counter except for a full flush. Per-memcg > counters removes this as a blocker from doing subtree flushes, which > helps avoid unnecessary work when the stats of a small subtree are > needed. > > Nothing is free, of course. This comes at a cost: > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4 > bytes. > > (b) More work on the update side, although in the common case it will > only be percpu counter updates. The amount of work scales with the > number of ancestors (i.e. tree depth). This is not a new concept, adding > a cgroup to the rstat tree involves a parent loop, so is charging. > Testing in a later patch shows this doesn't introduce significant > regressions. > > (c) The error margin in the stats for the system as a whole increases > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH * > NR_MEMCGS. This is probably fine because we have a similar per-memcg > error in charges coming from percpu stocks, and we have a periodic > flusher that makes sure we always flush all the stats every 2s anyway. > > This patch was tested to make sure no significant regressions are > introduced on the update path as follows. In a cgroup that is 4 levels > deep (/sys/fs/cgroup/a/b/c/d), the following benchmarks were ran: > > (a) neper [1] with 1000 flows and 100 threads (single machine). The > values in the table are the average of server and client throughputs in > mbps after 30 iterations, each running for 30s: > > tcp_rr tcp_stream > Base 9504218.56 357366.84 > Patched 9656205.68 356978.39 > Delta +1.6% -0.1% > Standard Deviation 0.95% 1.03% > > An increase in the performance of tcp_rr doesn't really make sense, but > it's probably in the noise. The same tests were ran with 1 flow and 1 > thread but the throughput was too noisy to make any conclusions (the > averages did not show regressions nonetheless). > > Looking at perf for one iteration of the above test, __mod_memcg_state() > (which is where memcg_rstat_updated() is called) does not show up at all > without this patch, but it shows up with this patch as 1.06% for tcp_rr > and 0.36% for tcp_stream. > > (b) Running "stress-ng --vm 0 -t 1m --times --perf". I don't understand > stress-ng very well, so I am not sure that's the best way to test this, > but it spawns 384 workers and spits a lot of metrics which looks nice :) > I picked a few ones that seem to be relevant to the stats update path. I > also included cache misses as this patch introduce more atomics that may > bounce between cpu caches: > > Metric Base Patched Delta > Cache Misses 3.394 B/sec 3.433 B/sec +1.14% > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05% > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8% > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4% > Page Faults Major 18.794 /sec 0.000 /sec > Kmalloc 0.153 M/sec 0.152 M/sec -0.65% > Kfree 0.152 M/sec 0.153 M/sec +0.65% > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56% > MM Page Free 4.639 M/sec 4.897 M/sec +5.56% > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32% > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32% > page-cache add 238.057 /sec 0.000 /sec > page-cache del 6.265 /sec 6.267 /sec -0.03% > > This is only using a single run in each case. I am not sure what to > make out of most of these numbers, but they mostly seem in the noise > (some better, some worse). The lock contention numbers are interesting. > I am not sure if higher is better or worse here. No new locks or lock > sections are introduced by this patch either way. > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without > this patch. This is suspicious, but I verified while stress-ng is > running that all the threads are in the right cgroup. > > [1]https://github.com/google/neper > > Signed-off-by: Yosry Ahmed Johannes, I think this might be what you have suggested in our previous discussion, but I am not sure this is what you meant for the update path, so I did not add a Suggested-by. Please let me know if this is what you meant and I can amend the tag as such. > > --- > mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++---------------- > 1 file changed, 33 insertions(+), 16 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index ef7ad66a9e4c..c273c65bb642 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -627,6 +627,9 @@ struct memcg_vmstats_percpu { > /* Cgroup1: threshold notifications & softlimit tree updates */ > unsigned long nr_page_events; > unsigned long targets[MEM_CGROUP_NTARGETS]; > + > + /* Stats updates since the last flush */ > + unsigned int stats_updates; > }; > > struct memcg_vmstats { > @@ -641,6 +644,9 @@ struct memcg_vmstats { > /* Pending child counts during tree propagation */ > long state_pending[MEMCG_NR_STAT]; > unsigned long events_pending[NR_MEMCG_EVENTS]; > + > + /* Stats updates since the last flush */ > + atomic64_t stats_updates; > }; > > /* > @@ -660,9 +666,7 @@ struct memcg_vmstats { > */ > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwor= k); > -static DEFINE_PER_CPU(unsigned int, stats_updates); > static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0); > -static atomic_t stats_flush_threshold =3D ATOMIC_INIT(0); > static u64 flush_last_time; > > #define FLUSH_TIME (2UL*HZ) > @@ -689,26 +693,37 @@ static void memcg_stats_unlock(void) > preempt_enable_nested(); > } > > + > +static bool memcg_should_flush_stats(struct mem_cgroup *memcg) > +{ > + return atomic64_read(&memcg->vmstats->stats_updates) > > + MEMCG_CHARGE_BATCH * num_online_cpus(); > +} > + > static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val= ) > { > + int cpu =3D smp_processor_id(); > unsigned int x; > > if (!val) > return; > > - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); > + cgroup_rstat_updated(memcg->css.cgroup, cpu); > + > + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { > + x =3D __this_cpu_add_return(memcg->vmstats_percpu->stats_= updates, > + abs(val)); > + > + if (x < MEMCG_CHARGE_BATCH) > + continue; > > - x =3D __this_cpu_add_return(stats_updates, abs(val)); > - if (x > MEMCG_CHARGE_BATCH) { > /* > - * If stats_flush_threshold exceeds the threshold > - * (>num_online_cpus()), cgroup stats update will be trig= gered > - * in __mem_cgroup_flush_stats(). Increasing this var fur= ther > - * is redundant and simply adds overhead in atomic update= . > + * If @memcg is already flush-able, increasing stats_upda= tes is > + * redundant. Avoid the overhead of the atomic update. > */ > - if (atomic_read(&stats_flush_threshold) <=3D num_online_c= pus()) > - atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_t= hreshold); > - __this_cpu_write(stats_updates, 0); > + if (!memcg_should_flush_stats(memcg)) > + atomic64_add(x, &memcg->vmstats->stats_updates); > + __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0)= ; > } > } > > @@ -727,13 +742,12 @@ static void do_flush_stats(void) > > cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > - atomic_set(&stats_flush_threshold, 0); > atomic_set(&stats_flush_ongoing, 0); > } > > void mem_cgroup_flush_stats(void) > { > - if (atomic_read(&stats_flush_threshold) > num_online_cpus()) > + if (memcg_should_flush_stats(root_mem_cgroup)) > do_flush_stats(); > } > > @@ -747,8 +761,8 @@ void mem_cgroup_flush_stats_ratelimited(void) > static void flush_memcg_stats_dwork(struct work_struct *w) > { > /* > - * Always flush here so that flushing in latency-sensitive paths = is > - * as cheap as possible. > + * Deliberately ignore memcg_should_flush_stats() here so that fl= ushing > + * in latency-sensitive paths is as cheap as possible. > */ > do_flush_stats(); > queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_T= IME); > @@ -5622,6 +5636,9 @@ static void mem_cgroup_css_rstat_flush(struct cgrou= p_subsys_state *css, int cpu) > } > } > } > + /* We are in a per-cpu loop here, only do the atomic write once *= / > + if (atomic64_read(&memcg->vmstats->stats_updates)) > + atomic64_set(&memcg->vmstats->stats_updates, 0); > } > > #ifdef CONFIG_MMU > -- > 2.42.0.459.ge4e396fd5e-goog >