Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp4206360rdb; Thu, 14 Sep 2023 15:31:25 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFBvrb1SDBnTbG9qTtclnhIbC32vuDBLJZ1qQLvc1oHrMEY/UMJoWViIB0u/y74kAhNEmPg X-Received: by 2002:a05:6a00:a08:b0:68f:bb02:fdf with SMTP id p8-20020a056a000a0800b0068fbb020fdfmr7738263pfh.27.1694730685003; Thu, 14 Sep 2023 15:31:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694730684; cv=none; d=google.com; s=arc-20160816; b=kN9TxFgkmgWcwHs8oGL2DnIRZA+Ok5TTMCbDz7y7zsrDDbnVmwXQws17HIy2/wvXVd gfndfGyo5oyWpfOa0c8qIIhE7yMUzL+db86FnO/MwtR03h+/mjLLCpxksZlL16n9wEbV 0iDAGKKZcQab8Ky28481IFDve6l91TgTjj5JtKhoIumtFDL2J2bS9s+YXhl+LNtHtebQ OlStlVYVdaS/SsaEdcThms0k9Ap0pVT8bS6+DUwddD8ncQE7FPAK5xLRgic/kL+wIpuf tC/aAiOn9cSmvo+iOuzFKuI0lQxCWjHr/jQXQgrZoGj841Q7UnpdQnFfF9jFlXxCC4U2 XiLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=8P8Iz0AXrz+yvXhSdzWefKmGB+E6NxTMaN+Ka9Zbzsc=; fh=hwD/jZK8wpxe/cPK0H3F31VhNOQajItgeiUqz5NiXt8=; b=EWb5rowK/iKv6b1RIKQsYxmjPUCHbmwMPWlXo23CePbjZfEdIzxA9I1zinQy0tMrLU T84Q10gi2Z2WZFywJfz28kjRnM6LRJUnmm5a9+kEo8gocxcsir1dfAC8R+uQwYb7ySWY NorFgSDR7EQMP9TYEx5Ie9ktiQOCrCekxP/+jT9yG3kCf41QmYyCNFD17+/nrPjTN1Y8 lrvJeUfCJT7ESWjLdGEejF41hdeqYfwHRH30crr9MyNgD8aK/0dDUp/MFbsn5MXpScp0 TaLUCs+HZsGHoe7X5HFd9RpIH6pKU33TfIKZM2HLNDU/EyBU6QrLwFlDdvEPqC90GG8j A2RQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FAGDhbzU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id g17-20020a056a001a1100b0069023e4bca8si2285536pfv.214.2023.09.14.15.31.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 15:31:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=FAGDhbzU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 3A012829C322; Thu, 14 Sep 2023 10:24:12 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240484AbjINRYC (ORCPT + 99 others); Thu, 14 Sep 2023 13:24:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239969AbjINRXf (ORCPT ); Thu, 14 Sep 2023 13:23:35 -0400 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C7C230CA for ; Thu, 14 Sep 2023 10:22:59 -0700 (PDT) Received: by mail-ej1-x62c.google.com with SMTP id a640c23a62f3a-9ada2e6e75fso169110666b.2 for ; Thu, 14 Sep 2023 10:22:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694712178; x=1695316978; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8P8Iz0AXrz+yvXhSdzWefKmGB+E6NxTMaN+Ka9Zbzsc=; b=FAGDhbzUm8fYwBUzU8t1UQ1PrtOq8au6U1PIQ+YF0ED+DyavoVwTdTCRMnIM1dQFOx QeT54VR0nfEdA5VWV4i+KxrZJfTe9Eggji/WXACghUZGYdIbmA6kip9DuY0KF+xlRd5m UM/lfR/zTEh8DsyvUJsrRPIjx6MD2XOyKVPpUlECY2gw2j5c/EkrvZSXH962ISyZDhOq DHO71trPKGLQ8Fn2u7n4WUy1yFTkdk50+BCuXAtYACbTwe781fGtuu2fnNRxKkWvv8bt w51qPFYIALPc2BLuIZ99BeAnnn2ZLjCDTQJTCzPVa/YHzFqEN/nO+unH31KJuUVvpHqL HvqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694712178; x=1695316978; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8P8Iz0AXrz+yvXhSdzWefKmGB+E6NxTMaN+Ka9Zbzsc=; b=WnUbLzLjjDAQLc5pONr7LABEijWcZ7uyedfDyoHvsl6K0LHF6V6zp28A8EWAXJ/ijg fTKFx9a8Xm557SeEAqVfzKiHhvk+GRLaj+9f0mGLgIiGvI9hnKO0UnpYzW6nZaZvEnlY iOc/SKHvOhrbQnkkGyFyULdMh+L/caqw2/87+7kFldok0/oTnNraVxDcioQJM8s/pWzA 5yj55nqR4ZynBIsHTnTIMsGT5krxQ9jnPYnl3JBDf4zVYY/4agHTsSK0CDVxi5Jjdm38 TGvUv0xXByvPOy9NJWTddM0TTxUw4QFon1Tj/JJL9uPtOubxlsj+AUqjM/H0X2zf4w+U MuaQ== X-Gm-Message-State: AOJu0Yx0c9RlRU6l7dYytalkNuSnJUjPHNXzFM2BFs714h+17daaPOxC AxZGyqvv9+bO9e3pmzLSHo5NxyjQt/gRTYnliT6gDQ== X-Received: by 2002:a17:907:2c6a:b0:9a1:aea2:d18d with SMTP id ib10-20020a1709072c6a00b009a1aea2d18dmr4701892ejc.48.1694712177785; Thu, 14 Sep 2023 10:22:57 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> <20230913153758.GB45543@cmpxchg.org> <20230914160646.GA101038@cmpxchg.org> In-Reply-To: <20230914160646.GA101038@cmpxchg.org> From: Yosry Ahmed Date: Thu, 14 Sep 2023 10:22:18 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Thu, 14 Sep 2023 10:24:12 -0700 (PDT) X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email [..] > > > > - > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > > > + /* A coalesced root flush is in order. Are we the designated = flusher? */ > > > > + spin_lock(&root_flusher_lock); > > > > > > I can't say I'm crazy about this. > > > > > > Let's say you have a fairly sprawling and active cgroup tree with lot= s > > > of updates in lots of groups in the system. > > > > > > Add a periodic memory.stat reader to one of the subgroups, and that > > > reader will do very fast, localized aggregation. > > > > > > Now add a periodic memory.stat reader to some other subgroup. They > > > might still both do very fast, localized aggregation. Or they might > > > collide; and now - despite having nothing in common, and sharing no > > > data besides the rstat lock - will have to flush the entire massive > > > tree. The rate at which this happens is purely dependent on timing of > > > what should be independent actors. This is really not great for the > > > p50 vs p99 gap. > > > > Initially I had a few retry iterations for the subtree flush, where we > > fsleep for a bit and trylock again. I thought it was a little bit too > > complicated and the fsleep duration would be a magic value. > > Hm, how is that different than a mutex / sleepable lock? It is essentially a lock with a timeout, which IIUC we don't support explicitly in the kernel. What I was trying to do was basically to try and do a subtree flush, but if we are waiting for too long then a lot of people are probably flushing, so let's all switch to a root flush and wait for one flusher instead of contending among ourselves. > > > > I think this whole thing is getting away from us. > > > > > > When we first switched to rstat, every reader would take the global > > > rstat lock and then flush its local subtree of interest. > > > > > > This was changed in the following commit: > > > > > > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf > > > Author: Shakeel Butt > > > Date: Fri Nov 5 13:37:34 2021 -0700 > > > > > > memcg: unify memcg stat flushing > > > > > > The memcg stats can be flushed in multiple context and potentiall= y in > > > parallel too. For example multiple parallel user space readers f= or > > > memcg stats will contend on the rstat locks with each other. The= re is > > > no need for that. We just need one flusher and everyone else can > > > benefit. > > > > > > In addition after aa48e47e3906 ("memcg: infrastructure to flush m= emcg > > > stats") the kernel periodically flush the memcg stats from the ro= ot, so, > > > the other flushers will potentially have much less work to do. > > > > > > Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@g= oogle.com > > > Signed-off-by: Shakeel Butt > > > Acked-by: Johannes Weiner > > > Cc: Michal Hocko > > > Cc: "Michal Koutn=C3=BD" > > > Signed-off-by: Andrew Morton > > > Signed-off-by: Linus Torvalds > > > > > > The idea was that we can avoid lock contention if somebody is already > > > doing the flushing. However, you're now bringing global serialization= . > > > Clearly that idea didn't work out. What would be an obstacle to go > > > back to the original way of doing it? > > > > The obstacle is high concurrency among flushers. A good example is > > reclaim code, we can have a lot of concurrent reclaimers. When I tried > > to go back to the original way of doing it, a stress test with high > > reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think > > high concurrency among userspace reads would have a similar outcome, > > but I hadn't really checked. > > > > Basically this patch is trying to switch to root flushing when the > > cost of contending on the lock is roughly the same as a root flush (or > > waiting for one). It's doing that too eagerly now of course (if > > contenders > 1), we can try to calibrate this better. > > I don't quite understand this. > > If you have two readers on separate subtrees, why is it cheaper to > flush the entire tree in sequence (where both readers don't care about > the majority of the data), than having each reader flush their own > small subset one after another? It's not like the whole tree flush > parallelizes its work. > > Where is that overhead actually coming from? If you have N concurrent readers flushing different parts of the subtree, at some point the overhead of N sequential subtree flushes will exceed the overhead of a single root flush. Ideally, we would be able to identify this point, and switch to a single root flush with everyone else waiting. > > > > With one reader, this will work the same as in your proposal. > > > > > > With two readers, just like in your proposal, flushing must be > > > serialized against the root level. But at least the two flushes only > > > aggregate the local data they actually care about - not the entire > > > tree data that doesn't even have readers! This is much better for loc= k > > > contention and performance. > > > > Keep in mind that in my testing, I noticed that synchronization using > > a completion is more performant than serialization on a lock. I am > > assuming because when we contend on the underlying lock, we serially > > wake up and do the flush. Even if there is nothing to do (as you > > mention below), we still do this work. On the contrary, in this > > proposal we just wait for the root flush to finish and return > > immediately, and all waiters return at once (that's a lie because of > > scheduling internals). > > Right, because rstat's do-nothing case is still somewhat costly due to > the per-cpu loop and the tree walk. > > But that should be possible to avoid with the outlined caching of > recently-flushed state at the memcg level. This helps only if concurrent flushers are overlapping, right? > > > Also, in the current code, in the two reader scenario, the first > > reader will flush the entire tree anyway. The only difference is that > > the second reader will wait for it to finish instead of just skipping, > > which is the right thing to do from a correctness point of view. Right > > now it's a coin flip on whether you get updated stats if someone else > > is already flushing. > > Agreed, it should wait. My mutex would accomplish this, right? I think what you're describing here is v4 of the patchset I mention in the cover letter: https://lore.kernel.org/lkml/20230831165611.2610118-5-yosryahmed@google.com= / The problem with that was that with very high concurrency among readers, the read latency is unbounded and can get out of hand. Wei shared some numbers in that thread. What this patch is trying to do is to switch to root flushing if the mutex is contended to avoid that scenario. Also, if we keep acquiring more flushers at some point we just skip the flush in favor of performance (if the number of waiters exceeds the number of cpus). Now that I think about it, perhaps we can just go back to the mutex approach w/ limiting the number of waiters, without ever falling back to a root flush. Not sure how that would work. Taking a step back, I think what you are implying is that if we make the thresholding code per cpu instead of global, this should mitigate the issue in a different way than falling back to root flushing or limiting the number of waiters, is that right? If yes, I think it will work in some cases where the flushes are overlapping as I mentioned above, but not if concurrent readers are flushing different parts of the tree. Right? > > > > One concern is the thresholding code. The cost of flushing on every > > > read is too high: even when there is no flush work, checking for it i= s > > > kind of expensive due to the locks and the for_each_possible_cpu(). > > > > > > Could we do something like the following? > > > > > > mem_cgroup_flush(memcg) > > > { > > > mutex_lock(&memcg_flush_mutex); > > > if (atomic_read(&memcg->stat_delta) > THRESHOLD) > > > cgroup_rstat_flush(memcg); > > > mutex_unlock(&memcg_flush_mutex); > > > } > > > > > > mem_cgroup_css_rstat_flush(css, cpu) > > > { > > > ... > > > /* > > > * Reset here instead of mem_cgroup_flush() > > > * so that each flushed subgroup is reset > > > * recursively. A recent parent flush will > > > * allow a child flush to skip. > > > */ > > > atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0); > > > } > > > > > > memcg_rstat_updated(memcg, val) > > > { > > > atomic_add(&memcg->stat_delta, val); > > > > We need to do this for each parent in the hierarchy, not just the > > memcg being updated, right? I guess that may be too expensive for the > > update path. > > How so? > > We need to mark the subgroups that are flushed, so that if you have > > root - a - a1 - foo > `- a2 > > and somebody flushes a, then a1 and a2 also don't need to be flushed > for a while. > > But if we flush a1 first, foo is aggregated into a1. Reading a still > needs to aggregate a1 and a2 into a. > > Maybe I'm missing something blatant, but I don't see how we have to > mark parents in any way. We only have to tag cgroups up to the point > to which they were recently flushed, and we already visit all those. > > Let me know if I'm missing something blatant here. I think we are talking about different paths. I believe you are talking about resetting memcg->stat_delta, which I agree should be done during the flush. What I am talking about is updating memcg->stat_delta when memcg_rstat_updated() is called. We would need to update that for all parents. In your example hierarchy, if stat updates happened in a2, and someone tries to flush a, they should be aware that there are stats to flush.