Received: by 2002:a05:7412:251c:b0:e2:908c:2ebd with SMTP id w28csp1604206rda; Mon, 23 Oct 2023 19:14:53 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHQInZyHsD0on1Jg+t82qfWqpOhrD5c7KsTmxwvnmu7+hJ3MAVndHdLBAUtMgBcEnouUqRB X-Received: by 2002:a05:6a00:1747:b0:6b1:c1c4:ae98 with SMTP id j7-20020a056a00174700b006b1c1c4ae98mr13229915pfc.18.1698113692790; Mon, 23 Oct 2023 19:14:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698113692; cv=none; d=google.com; s=arc-20160816; b=ioIBV2neEoXPEfGz06DXiJeq+fyD8QE9UiDHYNdp6Myn67kwJ0jX+PuwC+RzMiGF0u 0QJLxijJNxG90SUlyGSuAsaQgwOKpH0rhOBtCGvcS4wRAAM00F3idVHyzQC9ngwUqwp9 hL8lb2RT0qADf7nuaOSYUDmUrLpYsJJ4o95XnrREyD3niy766CHF45A/RxwufDLaZ+1Q gXIqhw82QOMtb4kFLK4lpLrHorTqSza1QIvEWTReY4VHNLcJ42a6vEXRI7ZOu4rB/iJU CjMp5uNP7doRdsANqBRZykWK2D452GppE7YgeUCrN2CNfOoa8HEB+ripCsgVIMvowsKK Yq0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=x0sOgq9xlJHiJPp92A0KCrtEWTJLcsYxL+ixDw2Vc7k=; fh=96UokPDCPJL4925r9qnUX9TiX6WPree7wk0CFG0uvfU=; b=uW0qtx5AAWBDFbDBpNcAW0vQG0wiqvoBc1WIv+yd+bCauDQCnB9JCMdhHrF0z2mFWp gRH25dcZ8wrFPrAuJZu0dKTnjEOTy3ckdRLb9shvfSG1EHTaBKT/B+NoEm/MdgouVHoE 1t0HKPX/i3ZLlPE3d/LCdnvnvv4SrSvBxuYLiCTJTYqtC4z2btUKMuqR68gRLNKwWFWj Hfaqzhkt77f6kQ2Boo1H30P3F6ZNIoTJyfUrOEsOCiu5TkAp7bCgy5eZu29V2Yj/WiYz hu39FoO+013jfgxyGRFYeJJMURQxAJQX6kHlvAQ/e+T/Lh+7U76rn2nvk3STmcGl8jJW dJTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FSQR5QOB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id cm6-20020a056a020a0600b005aa0e024d5esi7993055pgb.656.2023.10.23.19.14.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 23 Oct 2023 19:14:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FSQR5QOB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id C280A806115B; Mon, 23 Oct 2023 19:14:48 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231970AbjJXCOg (ORCPT + 99 others); Mon, 23 Oct 2023 22:14:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46294 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230284AbjJXCOf (ORCPT ); Mon, 23 Oct 2023 22:14:35 -0400 Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 655ECB9 for ; Mon, 23 Oct 2023 19:14:32 -0700 (PDT) Received: by mail-ej1-x636.google.com with SMTP id a640c23a62f3a-99c3c8adb27so559288666b.1 for ; Mon, 23 Oct 2023 19:14:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698113669; x=1698718469; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=x0sOgq9xlJHiJPp92A0KCrtEWTJLcsYxL+ixDw2Vc7k=; b=FSQR5QOBhiH8kpauWayc9Cju3OshKKeJvFNDKtjNgXlrpRSXlsjI7B1g2Dm5g3AR/R +u6So0mOowzurgsYpb2GO2BYT/ye4s1iVPL4FhHIrzwJWsPY2JRAT1dJPL1oUY9KNAi4 MBVRibA/KKEfcjltHcxgohVBADc/Tbq/ooYKNzHztCHJWQBIx9PRAKtGUWw49TOU+g1Y kUQSkzMNf7vuONKUakHonqjdf0KmoKFADvW5HS1Nzpo8jvkb8941BZoqgwQEYU9r9UzY WylCqZgM1AB67ZISpe6dAyKk/zTdk2536PCLM/Bf6bdMz6iPy1eVBlegQMnHgfRD9K7c o+oQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698113669; x=1698718469; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=x0sOgq9xlJHiJPp92A0KCrtEWTJLcsYxL+ixDw2Vc7k=; b=qjGxKe7jCzX500ne9DeP8Sp0csTrzUCCkqEipz4tgovuE+9QBdr6kM6Uud84Rzb5J8 Tqf2YEAlnetSPkJlF8dw84wd08qQRFVKJGeP4zLndJm5toERHNjBhtbd2dv/T/0X9m3p JwLlYYQYl9l5iTiGlXhS+6KqbjG8GMqHw/2N74JfkYudFNaSm7KrwYji/ojaMR15tAOt c//Y6y2tJuFuEVmw6Dop3tCd+QrctNk215cBebIsv7ugVz0UbKz3vevfAPTHATuHJRPD fvwH42K39sBTRNoIfLHZn9xwz6hQPUlmYo0a5IweU3ty4QjeG+ZEMsW7UKwJ+QuT+pPM piPA== X-Gm-Message-State: AOJu0YyOi2cIDqOdbxEiZTJ7/PjFs6U93eXnIR1kdz80tKjTXw5JcWN/ pbEhnpevs7hTQISX0h09JWnIO+smrrHnM0swDqdN+Q== X-Received: by 2002:a17:907:7f8b:b0:9bf:2673:7371 with SMTP id qk11-20020a1709077f8b00b009bf26737371mr9869290ejc.13.1698113669449; Mon, 23 Oct 2023 19:14:29 -0700 (PDT) MIME-Version: 1.0 References: <20231010032117.1577496-4-yosryahmed@google.com> <202310202303.c68e7639-oliver.sang@intel.com> In-Reply-To: From: Yosry Ahmed Date: Mon, 23 Oct 2023 19:13:50 -0700 Message-ID: Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg To: Feng Tang Cc: Shakeel Butt , "Sang, Oliver" , "oe-lkp@lists.linux.dev" , lkp , "cgroups@vger.kernel.org" , "linux-mm@kvack.org" , "Huang, Ying" , "Yin, Fengwei" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , "kernel-team@cloudflare.com" , Wei Xu , Greg Thelen , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Mon, 23 Oct 2023 19:14:49 -0700 (PDT) On Mon, Oct 23, 2023 at 11:25=E2=80=AFAM Yosry Ahmed wrote: > > On Sun, Oct 22, 2023 at 6:34=E2=80=AFPM Feng Tang w= rote: > > > > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote: > > > On Fri, Oct 20, 2023 at 10:23=E2=80=AFAM Shakeel Butt wrote: > > > > > > > > On Fri, Oct 20, 2023 at 9:18=E2=80=AFAM kernel test robot wrote: > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > kernel test robot noticed a -25.8% regression of will-it-scale.pe= r_thread_ops on: > > > > > > > > > > > > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5]= mm: memcg: make stats flushing threshold per-memcg") > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/m= m-memcg-change-flush_next_time-to-flush_last_time/20231010-112257 > > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm= -everything > > > > > patch link: https://lore.kernel.org/all/20231010032117.1577496-4-= yosryahmed@google.com/ > > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing thre= shold per-memcg > > > > > > > > > > testcase: will-it-scale > > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory > > > > > parameters: > > > > > > > > > > nr_task: 100% > > > > > mode: thread > > > > > test: fallocate1 > > > > > cpufreq_governor: performance > > > > > > > > > > > > > > > In addition to that, the commit also has significant impact on th= e following tests: > > > > > > > > > > +------------------+---------------------------------------------= ------------------+ > > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops = -30.0% regression | > > > > > | test machine | 104 threads 2 sockets (Skylake) with 192G me= mory | > > > > > | test parameters | cpufreq_governor=3Dperformance = | > > > > > | | mode=3Dthread = | > > > > > | | nr_task=3D50% = | > > > > > | | test=3Dfallocate1 = | > > > > > +------------------+---------------------------------------------= ------------------+ > > > > > > > > > > > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless > > > > there is a quick fix, IMO this series should be skipped for the > > > > upcoming kernel open window. > > > > > > I am currently looking into it. It's reasonable to skip the next merg= e > > > window if a quick fix isn't found soon. > > > > > > I am surprised by the size of the regression given the following: > > > 1.12 =C4=85 5% +1.4 2.50 =C4=85 2% > > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state > > > > > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state(). > > > > Yes, this is kind of confusing. And we have seen similar cases before, > > espcially for micro benchmark like will-it-scale, stressng, netperf > > etc, the change to those functions in hot path was greatly amplified > > in the final benchmark score. > > > > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@= xsang-OptiPlex-9020/ > > the affected functions have around 10% change in perf's cpu-cycles, > > and trigger 69% regression. IIRC, micro benchmarks are very sensitive > > to those statistics update, like memcg's and vmstat. > > > > Thanks for clarifying. I am still trying to reproduce locally but I am > running into some quirks with tooling. I may have to run a modified > version of the fallocate test manually. Meanwhile, I noticed that the > patch was tested without the fixlet that I posted [1] for it, > understandably. Would it be possible to get some numbers with that > fixlet? It should reduce the total number of contended atomic > operations, so it may help. > > [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7= Sgs_kERoVA@mail.gmail.com/ > > I am also wondering if aligning the stats_updates atomic will help. > Right now it may share a cacheline with some items of the > events_pending array. The latter may be dirtied during a flush and > unnecessarily dirty the former, but the chances are slim to be honest. > If it's easy to test such a diff, that would be nice, but I don't > expect a lot of difference: > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7cbc7d94eb65..a35fce653262 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -646,7 +646,7 @@ struct memcg_vmstats { > unsigned long events_pending[NR_MEMCG_EVENTS]; > > /* Stats updates since the last flush */ > - atomic64_t stats_updates; > + atomic64_t stats_updates ____cacheline_aligned_in_sm= p; > }; > > /* I still could not run the benchmark, but I used a version of fallocate1.c that does 1 million iterations. I ran 100 in parallel. This showed ~13% regression with the patch, so not the same as the will-it-scale version, but it could be an indicator. With that, I did not see any improvement with the fixlet above or ___cacheline_aligned_in_smp. So you can scratch that. I did, however, see some improvement with reducing the indirection layers by moving stats_updates directly into struct mem_cgroup. The regression in my manual testing went down to 9%. Still not great, but I am wondering how this reflects on the benchmark. If you're able to test it that would be great, the diff is below. Meanwhile I am still looking for other improvements that can be made. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f64ac140083e..b4dfcd8b9cc1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -270,6 +270,9 @@ struct mem_cgroup { CACHELINE_PADDING(_pad1_); + /* Stats updates since the last flush */ + atomic64_t stats_updates; + /* memory.stat */ struct memcg_vmstats *vmstats; @@ -309,6 +312,7 @@ struct mem_cgroup { atomic_t moving_account; struct task_struct *move_lock_task; + unsigned int __percpu *stats_updates_percpu; struct memcg_vmstats_percpu __percpu *vmstats_percpu; #ifdef CONFIG_CGROUP_WRITEBACK diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7cbc7d94eb65..e5d2f3d4d874 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu { /* Cgroup1: threshold notifications & softlimit tree updates */ unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; - - /* Stats updates since the last flush */ - unsigned int stats_updates; }; struct memcg_vmstats { @@ -644,9 +641,6 @@ struct memcg_vmstats { /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; - - /* Stats updates since the last flush */ - atomic64_t stats_updates; }; /* @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void) static bool memcg_should_flush_stats(struct mem_cgroup *memcg) { - return atomic64_read(&memcg->vmstats->stats_updates) > + return atomic64_read(&memcg->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { int cpu =3D smp_processor_id(); - unsigned int x; + unsigned int *stats_updates_percpu; if (!val) return; @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) cgroup_rstat_updated(memcg->css.cgroup, cpu); for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { - x =3D __this_cpu_add_return(memcg->vmstats_percpu->stats_up= dates, - abs(val)); + stats_updates_percpu =3D this_cpu_ptr(memcg->stats_updates_percpu); - if (x < MEMCG_CHARGE_BATCH) + *stats_updates_percpu +=3D abs(val); + if (*stats_updates_percpu < MEMCG_CHARGE_BATCH) continue; /* @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) * redundant. Avoid the overhead of the atomic update. */ if (!memcg_should_flush_stats(memcg)) - atomic64_add(x, &memcg->vmstats->stats_updates); - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); + atomic64_add(*stats_updates_percpu, &memcg->stats_updates); + *stats_updates_percpu =3D 0; } } @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memc= g) free_mem_cgroup_per_node_info(memcg, node); kfree(memcg->vmstats); free_percpu(memcg->vmstats_percpu); + free_percpu(memcg->stats_updates_percpu); kfree(memcg); } @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg->vmstats_percpu) goto fail; + memcg->stats_updates_percpu =3D alloc_percpu_gfp(unsigned int, + GFP_KERNEL_ACCOUNT); + if (!memcg->stats_updates_percpu) + goto fail; + for_each_node(node) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail; @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) struct mem_cgroup *memcg =3D mem_cgroup_from_css(css); struct mem_cgroup *parent =3D parent_mem_cgroup(memcg); struct memcg_vmstats_percpu *statc; + int *stats_updates_percpu; long delta, delta_cpu, v; int i, nid; statc =3D per_cpu_ptr(memcg->vmstats_percpu, cpu); + stats_updates_percpu =3D per_cpu_ptr(memcg->stats_updates_percpu, c= pu); for (i =3D 0; i < MEMCG_NR_STAT; i++) { /* @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } - statc->stats_updates =3D 0; + *stats_updates_percpu =3D 0; /* We are in a per-cpu loop here, only do the atomic write once */ - if (atomic64_read(&memcg->vmstats->stats_updates)) - atomic64_set(&memcg->vmstats->stats_updates, 0); + if (atomic64_read(&memcg->stats_updates)) + atomic64_set(&memcg->stats_updates, 0); } #ifdef CONFIG_MMU