Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp3150349iog; Mon, 27 Jun 2022 10:09:15 -0700 (PDT) X-Google-Smtp-Source: AGRyM1tvxg3qRNq4MkMz5P3jkOpsBvsPmLkpsPkq7QlHpeoGMjSyWzKGwCxC/4BKQ+vnqaDyQfht X-Received: by 2002:a05:6a00:1a4a:b0:518:bbd5:3c1d with SMTP id h10-20020a056a001a4a00b00518bbd53c1dmr16189790pfv.64.1656349754965; Mon, 27 Jun 2022 10:09:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656349754; cv=none; d=google.com; s=arc-20160816; b=tU005y5+H9DKRahf1Juh+exYZvUpYPkEJsd2DHhlOJikj919bv3n3tBMLRHS8YTrpg UcXU5xQ+UTvBpfepAloA4cbFeMQ6H+jnFno7PEaq14qdUcBZCG4FN1FwjaJBoUbjKPdW lvpNwmKODzZVsN26NmfHKYth3xiV4t2aV8Q1G0S/gdVny3R0tee8pMKvTlTcZvhMcNMn K7ZMurKwMvTmnqvgdlsG0E8w6/QjmXq2oLtRqpYuKIFNZWpwsTtuDeH/951o8bNeJE8n zMR69qdQHsCsyyVW4JuC/Vl6qUZZf6oN2/YzRqTDvjC0eTTL4nn8jQVBOcr1ZaAKr6OW G1qA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=0o8BHvdouHjDX6xouAXOA71Zh+Aq3Bg25pMCWdlg9b8=; b=k61pUtFGRPzsT/fDkZKYbpOMPpaO5AaiiE4NKvZl1rQTM5Ex0TD4Tl3D1o5JO3uETK YYdohT/DEZuUOgARy3LWb1ALPIDU+ejBONjqyz9Wvs253qKQkVEY5pUnag6f7J22MMhD x9qd3BiUEqdbEyzGBRkw+GVyoioiONzL9crw4Fhpi/X1UHbMNEOLgv9IreUSK2kl3Qlw FE5AvMrtEPOJV3uVSvRvsIrVuhqmL2cRmzGpqVvnilWfglU5LNXfccg7Wx+hbHnR4LG0 r9lCtewfjq/Aft8D2QVqlZpJR1uHnr4iT6A6hVX5qLIqmxjWd0PzQlFjRGgWpa5eNtyF np+w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=coLTvVYf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id me4-20020a17090b17c400b001ece377fe19si20089349pjb.128.2022.06.27.10.09.02; Mon, 27 Jun 2022 10:09:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=coLTvVYf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238703AbiF0Q0Q (ORCPT + 99 others); Mon, 27 Jun 2022 12:26:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56494 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233196AbiF0Q0O (ORCPT ); Mon, 27 Jun 2022 12:26:14 -0400 Received: from mail-yb1-xb33.google.com (mail-yb1-xb33.google.com [IPv6:2607:f8b0:4864:20::b33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35B1813DCA for ; Mon, 27 Jun 2022 09:26:13 -0700 (PDT) Received: by mail-yb1-xb33.google.com with SMTP id v38so6925485ybi.3 for ; Mon, 27 Jun 2022 09:26:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0o8BHvdouHjDX6xouAXOA71Zh+Aq3Bg25pMCWdlg9b8=; b=coLTvVYf1AMKsLzXaSyW9avdObFVi9AfgpRcJ3zIuN2nVUgZoKqGl3XnOQ3w3IBd/1 RT3FtAvVINKwpFTt77CkGAN8ynsuJpqn9OZRYUNEGRazUoVdLKZfpxVG9gl4cZukIK/s StqHeKVJ1Q6XWubOJBqkX+hUNXu5YbKyT9lGh906qe4hL/IkDHkwssiw2d3+fujLchaS hjx30y2d9tUGcp/U2ILmrakRTs3wOy2TXgxcM1wUZCcj9Uei71JzG8DUgS8/Npso+2Ue C/KXjBRMC8BIiyiRRnxp5o8G+yys/v8rZia0oajmilb3P1WiKR30Aw0D3J8u/hG+qZcz fMzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0o8BHvdouHjDX6xouAXOA71Zh+Aq3Bg25pMCWdlg9b8=; b=mGz74qE8BbaALgoDzR6z/2MAjMijgGFyCvYBVlep/Db3nN7KeUqPaBeyNY/Co8zw4U KHZark3ey28gJl+sEYu7sWIYOazRou9joxVz53kDDCmIKzcK3/Or35jLBakV5JSkvjI7 iS1cvUY/6K8yfmGfZm4Dv7Cnqq600KzSBrTCuHOh6x2Y4VNMg/PrmoOr7/uH/ieVViSC OYtaG+ZDAjOo9uT+xFNzxUusWBkcoH0EBE3FxxvkjlZVtdJQK/gLR7/gsbeLOlEOKC8G KPi3ffaXXle5u29So9gUuPGOVqCHA/qHa0kAp5HYQW5aTzm1ZqggKkLLmqLl+TtA5qDU esLA== X-Gm-Message-State: AJIora+S62oM7pcHNnVTQHi97GyoeoVp0A6YOaQ2gNo36q/pyfNg9Jda Lj1x+rPblP/BEkYK3AAH6yB6VOaenqWIdXctVJb3Gw== X-Received: by 2002:a25:d957:0:b0:66c:9476:708f with SMTP id q84-20020a25d957000000b0066c9476708fmr11293785ybg.427.1656347172053; Mon, 27 Jun 2022 09:26:12 -0700 (PDT) MIME-Version: 1.0 References: <20220623185730.25b88096@kernel.org> <20220624070656.GE79500@shbuild999.sh.intel.com> <20220624144358.lqt2ffjdry6p5u4d@google.com> <20220625023642.GA40868@shbuild999.sh.intel.com> <20220627023812.GA29314@shbuild999.sh.intel.com> <20220627123415.GA32052@shbuild999.sh.intel.com> <20220627144822.GA20878@shbuild999.sh.intel.com> In-Reply-To: <20220627144822.GA20878@shbuild999.sh.intel.com> From: Eric Dumazet Date: Mon, 27 Jun 2022 18:25:59 +0200 Message-ID: Subject: Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression To: Feng Tang Cc: Shakeel Butt , Linux MM , Andrew Morton , Roman Gushchin , Michal Hocko , Johannes Weiner , Muchun Song , Jakub Kicinski , Xin Long , Marcelo Ricardo Leitner , kernel test robot , Soheil Hassas Yeganeh , LKML , network dev , linux-s390@vger.kernel.org, MPTCP Upstream , "linux-sctp @ vger . kernel . org" , lkp@lists.01.org, kbuild test robot , Huang Ying , Xing Zhengjun , Yin Fengwei , Ying Xu Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 27, 2022 at 4:48 PM Feng Tang wrote: > > On Mon, Jun 27, 2022 at 04:07:55PM +0200, Eric Dumazet wrote: > > On Mon, Jun 27, 2022 at 2:34 PM Feng Tang wrote: > > > > > > On Mon, Jun 27, 2022 at 10:46:21AM +0200, Eric Dumazet wrote: > > > > On Mon, Jun 27, 2022 at 4:38 AM Feng Tang wrote: > > > [snip] > > > > > > > > > > > > > > Thanks Feng. Can you check the value of memory.kmem.tcp.max_usage_in_bytes > > > > > > > in /sys/fs/cgroup/memory/system.slice/lkp-bootstrap.service after making > > > > > > > sure that the netperf test has already run? > > > > > > > > > > > > memory.kmem.tcp.max_usage_in_bytes:0 > > > > > > > > > > Sorry, I made a mistake that in the original report from Oliver, it > > > > > was 'cgroup v2' with a 'debian-11.1' rootfs. > > > > > > > > > > When you asked about cgroup info, I tried the job on another tbox, and > > > > > the original 'job.yaml' didn't work, so I kept the 'netperf' test > > > > > parameters and started a new job which somehow run with a 'debian-10.4' > > > > > rootfs and acutally run with cgroup v1. > > > > > > > > > > And as you mentioned cgroup version does make a big difference, that > > > > > with v1, the regression is reduced to 1% ~ 5% on different generations > > > > > of test platforms. Eric mentioned they also got regression report, > > > > > but much smaller one, maybe it's due to the cgroup version? > > > > > > > > This was using the current net-next tree. > > > > Used recipe was something like: > > > > > > > > Make sure cgroup2 is mounted or mount it by mount -t cgroup2 none $MOUNT_POINT. > > > > Enable memory controller by echo +memory > $MOUNT_POINT/cgroup.subtree_control. > > > > Create a cgroup by mkdir $MOUNT_POINT/job. > > > > Jump into that cgroup by echo $$ > $MOUNT_POINT/job/cgroup.procs. > > > > > > > > > > > > > > > > The regression was smaller than 1%, so considered noise compared to > > > > the benefits of the bug fix. > > > > > > Yes, 1% is just around noise level for a microbenchmark. > > > > > > I went check the original test data of Oliver's report, the tests was > > > run 6 rounds and the performance data is pretty stable (0Day's report > > > will show any std deviation bigger than 2%) > > > > > > The test platform is a 4 sockets 72C/144T machine, and I run the > > > same job (nr_tasks = 25% * nr_cpus) on one CascadeLake AP (4 nodes) > > > and one Icelake 2 sockets platform, and saw 75% and 53% regresson on > > > them. > > > > > > In the first email, there is a file named 'reproduce', it shows the > > > basic test process: > > > > > > " > > > use 'performane' cpufre governor for all CPUs > > > > > > netserver -4 -D > > > modprobe sctp > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > netperf -4 -H 127.0.0.1 -t SCTP_STREAM_MANY -c -C -l 300 -- -m 10K & > > > (repeat 36 times in total) > > > ... > > > > > > " > > > > > > Which starts 36 (25% of nr_cpus) netperf clients. And the clients number > > > also matters, I tried to increase the client number from 36 to 72(50%), > > > and the regression is changed from 69.4% to 73.7%" > > > > > > > This seems like a lot of opportunities for memcg folks :) > > > > struct page_counter has poor field placement [1], and no per-cpu cache. > > > > [1] "atomic_long_t usage" is sharing cache line with read mostly fields. > > > > (struct mem_cgroup also has poor field placement, mainly because of > > struct page_counter) > > > > 28.69% [kernel] [k] copy_user_enhanced_fast_string > > 16.13% [kernel] [k] intel_idle_irq > > 6.46% [kernel] [k] page_counter_try_charge > > 6.20% [kernel] [k] __sk_mem_reduce_allocated > > 5.68% [kernel] [k] try_charge_memcg > > 5.16% [kernel] [k] page_counter_cancel > > Yes, I also analyzed the perf-profile data, and made some layout changes > which could recover the changes from 69% to 40%. > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de 332b589c49656a45881bca4ecc0 > ---------------- --------------------------- --------------------------- > 15722 -69.5% 4792 -40.8% 9300 netperf.Throughput_Mbps > > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 1bfcfb1af352..aa37bd39116c 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -179,14 +179,13 @@ struct cgroup_subsys_state { > atomic_t online_cnt; > > /* percpu_ref killing and RCU release */ > - struct work_struct destroy_work; > struct rcu_work destroy_rwork; > - > + struct cgroup_subsys_state *parent; > + struct work_struct destroy_work; > /* > * PI: the parent css. Placed here for cache proximity to following > * fields of the containing structure. > */ > - struct cgroup_subsys_state *parent; > }; > > /* > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 9ecead1042b9..963b88ab9930 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -239,9 +239,6 @@ struct mem_cgroup { > /* Private memcg ID. Used to ID objects that outlive the cgroup */ > struct mem_cgroup_id id; > > - /* Accounted resources */ > - struct page_counter memory; /* Both v1 & v2 */ > - > union { > struct page_counter swap; /* v2 only */ > struct page_counter memsw; /* v1 only */ > @@ -251,6 +248,9 @@ struct mem_cgroup { > struct page_counter kmem; /* v1 only */ > struct page_counter tcpmem; /* v1 only */ > > + /* Accounted resources */ > + struct page_counter memory; /* Both v1 & v2 */ > + > /* Range enforcement for interrupt charges */ > struct work_struct high_work; > > @@ -313,7 +313,6 @@ struct mem_cgroup { > atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; > atomic_long_t memory_events_local[MEMCG_NR_MEMORY_EVENTS]; > > - unsigned long socket_pressure; > > /* Legacy tcp memory accounting */ > bool tcpmem_active; > @@ -349,6 +348,7 @@ struct mem_cgroup { > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > struct deferred_split deferred_split_queue; > #endif > + unsigned long socket_pressure; > > struct mem_cgroup_per_node *nodeinfo[]; > }; > I simply did the following and got much better results. But I am not sure if updates to ->usage are really needed that often... diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h index 679591301994d316062f92b275efa2459a8349c9..e267be4ba849760117d9fd041e22c2a44658ab36 100644 --- a/include/linux/page_counter.h +++ b/include/linux/page_counter.h @@ -3,12 +3,15 @@ #define _LINUX_PAGE_COUNTER_H #include +#include #include #include struct page_counter { - atomic_long_t usage; - unsigned long min; + /* contended cache line. */ + atomic_long_t usage ____cacheline_aligned_in_smp; + + unsigned long min ____cacheline_aligned_in_smp; unsigned long low; unsigned long high; unsigned long max; @@ -27,12 +30,6 @@ struct page_counter { unsigned long watermark; unsigned long failcnt; - /* - * 'parent' is placed here to be far from 'usage' to reduce - * cache false sharing, as 'usage' is written mostly while - * parent is frequently read for cgroup's hierarchical - * counting nature. - */ struct page_counter *parent; }; > And some of these are specific for network and may not be a universal > win, though I think the 'cgroup_subsys_state' could keep the > read-mostly 'parent' away from following written-mostly counters. > > Btw, I tried your debug patch which compiled fail with 0Day's kbuild > system, but it did compile ok on my local machine. > > Thanks, > Feng > > > > > > Thanks, > > > Feng > > > > > > > > > > > > > Thanks, > > > > > Feng