Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp4060378rdb; Thu, 14 Sep 2023 10:29:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE3J7m9nD6tdPbfE0YWXrMxMEV7sovf2MxXHZoQejIBuO/ufuY1PE669/1VamHl2X+cEAnj X-Received: by 2002:a17:902:a409:b0:1c0:c4be:62ca with SMTP id p9-20020a170902a40900b001c0c4be62camr5976084plq.17.1694712562977; Thu, 14 Sep 2023 10:29:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694712562; cv=none; d=google.com; s=arc-20160816; b=HLgmhadXc8+N9Bnn/OTajgDOVD17GoLnXVzhD0CugOqsELmVP0nCNFRRmeSIhfzIjs ZJF+3gTWkvilwyyAET2YS7637Khj9qlOynELbJl3Nk+TNzyKIX44HX2Oz4jQ/stql2Qb UzyPj/nAIHAim3CqP6JlVKwwB5k/SolrqfombpoFkJycxTPzMcNAa/eSw7ZmnQIgok7x iw6Isf4PeRamhyR+BrLXxQAq+1APi8l3jycZPGwCjNyCUz2bGyKMoZmIOF1pGFWfBoP/ xrAZMqVGOWBfQXRN2A7Z3Qt/hjzZy1oWByAM7YWkjqGGbfmmmF7RUIPcRqR8fQIkEc4h abIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=nAol+H1zFCoUfUfZZsDgx8UsnqMjiiU4tH87PfqSwyE=; fh=tr6hWKiH6iefIcj4PZM3yBhpgVOOgsiHJKvDkAYIgMc=; b=xRyQhI2Y8PkCGzxOqItqLdCEZe8PVwB2qFZEgbI2h0jXobNAH7fsTSVr/WfKrGSCo3 VBU567ipNIhiisuZGjjfns6G4cfexLK1dNlRDLqgSYTkio57tmoQCJ5yXN3GIr1r/iR7 f3MCLfJ9h6aPNr3FPZvB6MJr2iqWehoKjqZT4EeHRkryPkprQZZNGLxhNv6z8AE7FaqR AVkr8dUyST4qHTp8Ju8qnjWl7FPA8bTOZL40/BDcsLbcgHY8sw0U3SwZgOFJqHCu9xeq IwJy98UAU2YabF3O91cBhzIAjOG+XhlmflMVRwOgXsmepYWKELshs/Je06FP9lzsVgsI ySMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=iQa70wec; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id kp3-20020a170903280300b001c0eefc0dfesi1949702plb.130.2023.09.14.10.29.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 10:29:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=iQa70wec; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id CB80F80AC448; Thu, 14 Sep 2023 10:26:10 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240012AbjINRZq (ORCPT + 99 others); Thu, 14 Sep 2023 13:25:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33676 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240130AbjINRZ2 (ORCPT ); Thu, 14 Sep 2023 13:25:28 -0400 Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 548A92112 for ; Thu, 14 Sep 2023 10:24:09 -0700 (PDT) Received: by mail-ej1-x62f.google.com with SMTP id a640c23a62f3a-9a58dbd5daeso173915666b.2 for ; Thu, 14 Sep 2023 10:24:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694712248; x=1695317048; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nAol+H1zFCoUfUfZZsDgx8UsnqMjiiU4tH87PfqSwyE=; b=iQa70wecFwmLv+kCAgYaStnuRHfGvXBl7s5U34NvLqhlxlg+tERIMVu52MbTZuKo4C L6VMKz6t/YROPEuDzxNOVB3TXbhbmUGlgfvluECwH2BS2ZeHNfrk/v+MIEQxZIj2HPub jBC7K+C4mnQnodAcH89sFb+JhOtup6kjyY4OWbYgAVJRhiD6s8VfAj3TM2jyg0baTzsz a5/Jbdsiu2IgQWhQ+8dduPlsB5mxlby/34iig0oGAGAAKQhwVbWnJhYCyvy7Ib9OF407 Bl32Mgk8cpBnm3ca0nI6V7SO5Gw47ReiMytdPSWlrpGzB1pDRegnikIWPJyVG1cdXkgq 9Z3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694712248; x=1695317048; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nAol+H1zFCoUfUfZZsDgx8UsnqMjiiU4tH87PfqSwyE=; b=jXYnYBp9GB2kKgCjltJOiAIN0u3/nkx1Nlrfe+8/+vZRr9zENdFMqKhbzFHVw9zVIm qMB8ZJnl/chwXnBK5n9G3OOK2wmTNV+MREjeE1M48ZMBWWns6MQZVBQ6TIIY/Av/SRWL iPnd5iJuGMBI5quxs2RouknAOV9Qiq+RrDaHwPEE8SInqdF9gmkPb2Hcrytgu6zHbXl5 uHndAdklksz1JObzPvooHF5MGOmw0oMJ11evWoPvOPdCwj/VL22NqRFeOxbtqndcDCtm 0c2pe0CW9O23rGcgI1VyK0qzXeCgp3vNrGyEl6/CNd4k3X8hrQB9g58bSbrOv7mLFWoZ xjFA== X-Gm-Message-State: AOJu0YxNoHPzzC0JwG/M6gqeB9ICilIqgkdvfkqxwWlkPhjrFW2hyDhX yxXGuYAFyNIrVN6JhIO+PWZHBYbP4wJ0ttLRUkgp/w== X-Received: by 2002:a17:906:76cb:b0:9a5:846d:d829 with SMTP id q11-20020a17090676cb00b009a5846dd829mr5071449ejn.18.1694712247530; Thu, 14 Sep 2023 10:24:07 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Thu, 14 Sep 2023 10:23:31 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Waiman Long Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Thu, 14 Sep 2023 10:26:11 -0700 (PDT) X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email On Thu, Sep 14, 2023 at 10:19=E2=80=AFAM Waiman Long w= rote: > > > On 9/13/23 03:38, Yosry Ahmed wrote: > > Stats flushing for memcg currently follows the following rules: > > - Always flush the entire memcg hierarchy (i.e. flush the root). > > - Only one flusher is allowed at a time. If someone else tries to flush > > concurrently, they skip and return immediately. > > - A periodic flusher flushes all the stats every 2 seconds. > > > > The reason this approach is followed is because all flushes are > > serialized by a global rstat spinlock. On the memcg side, flushing is > > invoked from userspace reads as well as in-kernel flushers (e.g. > > reclaim, refault, etc). This approach aims to avoid serializing all > > flushers on the global lock, which can cause a significant performance > > hit under high concurrency. > > > > This approach has the following problems: > > - Occasionally a userspace read of the stats of a non-root cgroup will > > be too expensive as it has to flush the entire hierarchy [1]. > > - Sometimes the stats accuracy are compromised if there is an ongoing > > flush, and we skip and return before the subtree of interest is > > actually flushed. This is more visible when reading stats from > > userspace, but can also affect in-kernel flushers. > > > > This patch aims to solve both problems by reworking how flushing > > currently works as follows: > > - Without contention, there is no need to flush the entire tree. In thi= s > > case, only flush the subtree of interest. This avoids the latency of= a > > full root flush if unnecessary. > > - With contention, fallback to a coalesced (aka unified) flush of the > > entire hierarchy, a root flush. In this case, instead of returning > > immediately if a root flush is ongoing, wait for it to finish > > *without* attempting to acquire the lock or flush. This is done usin= g > > a completion. Compared to competing directly on the underlying lock, > > this approach makes concurrent flushing a synchronization point > > instead of a serialization point. Once a root flush finishes, *all* > > waiters can wake up and continue at once. > > - Finally, with very high contention, bound the number of waiters to th= e > > number of online cpus. This keeps the flush latency bounded at the t= ail > > (very high concurrency). We fallback to sacrificing stats freshness = only > > in such cases in favor of performance. > > > > This was tested in two ways on a machine with 384 cpus: > > - A synthetic test with 5000 concurrent workers doing allocations and > > reclaim, as well as 1000 readers for memory.stat (variation of [2]). > > No significant regressions were noticed in the total runtime. > > Note that if concurrent flushers compete directly on the spinlock > > instead of waiting for a completion, this test shows 2x-3x slowdowns= . > > Even though subsequent flushers would have nothing to flush, just th= e > > serialization and lock contention is a major problem. Using a > > completion for synchronization instead seems to overcome this proble= m. > > > > - A synthetic stress test for concurrently reading memcg stats provided > > by Wei Xu. > > With 10k threads reading the stats every 100ms: > > - 98.8% of reads take <100us > > - 1.09% of reads take 100us to 1ms. > > - 0.11% of reads take 1ms to 10ms. > > - Almost no reads take more than 10ms. > > With 10k threads reading the stats every 10ms: > > - 82.3% of reads take <100us. > > - 4.2% of reads take 100us to 1ms. > > - 4.7% of reads take 1ms to 10ms. > > - 8.8% of reads take 10ms to 100ms. > > - Almost no reads take more than 100ms. > > > > [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43= KO9ME4-dsgfoQ@mail.gmail.com/ > > [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcO= BZcz6POYTV-4g@mail.gmail.com/ > > [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=3DiF5Lf_cRnKxUfkiEe0AMDTu6= yhrUAzX0b6a6rDg@mail.gmail.com/ > > > > [weixugc@google.com: suggested the fallback logic and bounding the > > number of waiters] > > > > Signed-off-by: Yosry Ahmed > > --- > > include/linux/memcontrol.h | 4 +- > > mm/memcontrol.c | 100 ++++++++++++++++++++++++++++--------= - > > mm/vmscan.c | 2 +- > > mm/workingset.c | 8 ++- > > 4 files changed, 85 insertions(+), 29 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 11810a2cfd2d..4453cd3fc4b8 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_loc= al(struct lruvec *lruvec, > > return x; > > } > > > > -void mem_cgroup_flush_stats(void); > > +void mem_cgroup_flush_stats(struct mem_cgroup *memcg); > > void mem_cgroup_flush_stats_ratelimited(void); > > > > void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_i= tem idx, > > @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_loc= al(struct lruvec *lruvec, > > return node_page_state(lruvec_pgdat(lruvec), idx); > > } > > > > -static inline void mem_cgroup_flush_stats(void) > > +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg) > > { > > } > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index d729870505f1..edff41e4b4e7 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgrou= p_tree_per_node *mctz) > > static void flush_memcg_stats_dwork(struct work_struct *w); > > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_d= work); > > static DEFINE_PER_CPU(unsigned int, stats_updates); > > -static atomic_t stats_flush_ongoing =3D ATOMIC_INIT(0); > > /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */ > > static atomic_t stats_updates_order =3D ATOMIC_INIT(0); > > static u64 flush_last_time; > > @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem= _cgroup *memcg, int val) > > } > > } > > > > -static void do_flush_stats(void) > > +/* > > + * do_flush_stats - flush the statistics of a memory cgroup and its tr= ee > > + * @memcg: the memory cgroup to flush > > + * @wait: wait for an ongoing root flush to complete before returning > > + * > > + * All flushes are serialized by the underlying rstat global lock. If = there is > > + * no contention, we try to only flush the subtree of the passed @memc= g to > > + * minimize the work. Otherwise, we coalesce multiple flushing request= s into a > > + * single flush of the root memcg. When there is an ongoing root flush= , we wait > > + * for its completion (unless otherwise requested), to get fresh stats= . If the > > + * number of waiters exceeds the number of cpus just skip the flush to= bound the > > + * flush latency at the tail with very high concurrency. > > + * > > + * This is a trade-off between stats accuracy and flush latency. > > + */ > > +static void do_flush_stats(struct mem_cgroup *memcg, bool wait) > > { > > + static DECLARE_COMPLETION(root_flush_done); > > + static DEFINE_SPINLOCK(root_flusher_lock); > > + static DEFINE_MUTEX(subtree_flush_mutex); > > + static atomic_t waiters =3D ATOMIC_INIT(0); > > + static bool root_flush_ongoing; > > + bool root_flusher =3D false; > > + > > + /* Ongoing root flush, just wait for it (unless otherwise request= ed) */ > > + if (READ_ONCE(root_flush_ongoing)) > > + goto root_flush_or_wait; > > + > > /* > > - * We always flush the entire tree, so concurrent flushers can ju= st > > - * skip. This avoids a thundering herd problem on the rstat globa= l lock > > - * from memcg flushers (e.g. reclaim, refault, etc). > > + * Opportunistically try to only flush the requested subtree. Oth= erwise > > + * fallback to a coalesced flush below. > > */ > > - if (atomic_read(&stats_flush_ongoing) || > > - atomic_xchg(&stats_flush_ongoing, 1)) > > + if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mu= tex)) { > > + cgroup_rstat_flush(memcg->css.cgroup); > > + mutex_unlock(&subtree_flush_mutex); > > return; > > + } > > If mutex_trylock() is the only way to acquire subtree_flush_mutex, you > don't really need a mutex. Just a simple integer flag with xchg() call > should be enough. Thanks for pointing this out. Agreed. If we keep this approach I will drop that mutex. > > Cheers, > Longman >