Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp4151916rdb; Thu, 14 Sep 2023 13:26:37 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHPqHEOGan+EIIlaNtOdJAackf+F3SFjo+K/uYc4pQq3UxIlO8sp8esRvEFZo0hkCLeJkul X-Received: by 2002:a05:6358:9996:b0:12b:da97:aba6 with SMTP id j22-20020a056358999600b0012bda97aba6mr6399377rwb.24.1694723197037; Thu, 14 Sep 2023 13:26:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694723197; cv=none; d=google.com; s=arc-20160816; b=asJ2A5uFVm7+WjLEheAZl4mK70bEvH66nA/0RvhkAquN4YFeYnrsGYQWeJPGoiNRw2 7/rTaIu8+tas3O0u4T+CRmIKX0s19w8O62/KBNwmTRgRp/08kfagPZLKx5priXwQ6pJ8 Jyd5Ai4rveAGJHJie3h1TToyjppPWDo/fzucBJMrvvQ3bZc71lqpXRDfzAMnqc7QEBxW mv75X4jl83t6vaDqN/MOQ5Rz8KmJCkR4GC4QBitp3m/64VFyGvoS+L2FSQx+VGUM3XBE z1u2iljRV4KYZMwAhRrFeI+X+JnpA0Ag3x63XqONPrU8onVcDg6YAdZ8YlMbTLMldih1 9vWQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=DPpwdq8G8RSqUrohA8xVfXt68tiwNJksEyoBIju0/jI=; fh=hwD/jZK8wpxe/cPK0H3F31VhNOQajItgeiUqz5NiXt8=; b=VmpwgwUcl6BR2ropmSZb5aYq/TD+dgBnlhOlFw3gtkmpRsjGx6d2LkZIuWt7/N6FOy RF/G9FKhdJ35hGL52JTUVvjpHoRnCB1X9kZddBbbSZS0E5Thq312QiRxQg7sUR3OQnMH D8atwr8v+Ft9dafde/vkQ1dyaRqH3tAfK0ign7THJylyvuGZUlH37chTH3gDeycPe7hK GdZO/aiCquCqdQ7qb7SURUO/V2plST0etWYL2lwSNCtCgzxsJqyta6G6VemUowtHE8pi od48crzPum8Dq26v8Tpjb5IiMpaFnpar3nOI9ZMSG672eDes0EvOHElRCW2H57uigfTz hYTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=J7d0KB3b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id a2-20020a656402000000b00573feb80075si1963161pgv.692.2023.09.14.13.26.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 13:26:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=J7d0KB3b; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 456F082691A0; Thu, 14 Sep 2023 10:27:31 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232399AbjINR1d (ORCPT + 99 others); Thu, 14 Sep 2023 13:27:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33582 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239301AbjINR1T (ORCPT ); Thu, 14 Sep 2023 13:27:19 -0400 Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66206268E for ; Thu, 14 Sep 2023 10:26:51 -0700 (PDT) Received: by mail-ej1-x62d.google.com with SMTP id a640c23a62f3a-9a65f9147ccso165687366b.1 for ; Thu, 14 Sep 2023 10:26:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1694712410; x=1695317210; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DPpwdq8G8RSqUrohA8xVfXt68tiwNJksEyoBIju0/jI=; b=J7d0KB3brYaw1NVG650yQgPo1xitv05+FSDFoZVlPf5d5GgJlogz7qCsB0M3xyuJe7 diclStL0J423eM6d8VQf+Suq1QSIliW4TuWd2EdLHGbSx+G52k1Eqmr6CuuBdQyCxGPG 54+5UOkRmQEtig2tqyV6s1H1m9vqd5oq9VMXV90YZOjnv87y0KLZXnFNhDLRXJqVxMf1 r4AHWnfC6pVNUZx3mvn0uF/Fa7gIrf6hetsYoau+Mah3JJASQK78yyI3n0qMTe2P9OHK OR94KK2y+eQf+9LXlnkpcTAjnkCJtdBQ0SxCY40/JwZQYcsSDWuwogq3+VkmIt9WwDSF 9EUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694712410; x=1695317210; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DPpwdq8G8RSqUrohA8xVfXt68tiwNJksEyoBIju0/jI=; b=MNT33TiichH7g+0Zz5gxupKSBIposwqzIYwGFbNL+P3A27OB9lN0lgQmNeKL/emtBJ HkADRX1mn/vB4/mbOCJnUo8pCqjoi1+XBIY7hkO6Ngqvno55/isbiMEG2JyxQxxBrohq kAKehgPH40SeW/+jhTzjQYfbdaUxAif4IoJq9Nq0hwMoe3tQPfIGeD4YrYvgOC5YoqOI /16dufg8/PGM6Oh6k0/bgZA+ekDqNS4YwgVV2F7LWX8JpNyR6Z7hCmfskYeJLXvpE9++ DHVPpt5QZ4Z13jnSaqIwqzkPL3kB9faQW4WfrLlwqyhI1xoSvvlkH0JGHnx12/eG3rBz Y/ww== X-Gm-Message-State: AOJu0Yxs0gJczNMRJKg75YXaynj1EWSWaSFVw0hYb87olVEsCp9AFjgB yQ2Fi1NJUoi41p3Vh3voRhB3NVt516gIZ5eZcxHe1w== X-Received: by 2002:a17:906:5385:b0:9a5:852f:10ae with SMTP id g5-20020a170906538500b009a5852f10aemr4983073ejo.60.1694712409563; Thu, 14 Sep 2023 10:26:49 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> <20230913153758.GB45543@cmpxchg.org> <20230914160646.GA101038@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 14 Sep 2023 10:26:13 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 14 Sep 2023 10:27:31 -0700 (PDT) On Thu, Sep 14, 2023 at 10:22=E2=80=AFAM Yosry Ahmed wrote: > > [..] > > > > > - > > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); > > > > > + /* A coalesced root flush is in order. Are we the designate= d flusher? */ > > > > > + spin_lock(&root_flusher_lock); > > > > > > > > I can't say I'm crazy about this. > > > > > > > > Let's say you have a fairly sprawling and active cgroup tree with l= ots > > > > of updates in lots of groups in the system. > > > > > > > > Add a periodic memory.stat reader to one of the subgroups, and that > > > > reader will do very fast, localized aggregation. > > > > > > > > Now add a periodic memory.stat reader to some other subgroup. They > > > > might still both do very fast, localized aggregation. Or they might > > > > collide; and now - despite having nothing in common, and sharing no > > > > data besides the rstat lock - will have to flush the entire massive > > > > tree. The rate at which this happens is purely dependent on timing = of > > > > what should be independent actors. This is really not great for the > > > > p50 vs p99 gap. > > > > > > Initially I had a few retry iterations for the subtree flush, where w= e > > > fsleep for a bit and trylock again. I thought it was a little bit too > > > complicated and the fsleep duration would be a magic value. > > > > Hm, how is that different than a mutex / sleepable lock? > > It is essentially a lock with a timeout, which IIUC we don't support > explicitly in the kernel. What I was trying to do was basically to try > and do a subtree flush, but if we are waiting for too long then a lot > of people are probably flushing, so let's all switch to a root flush > and wait for one flusher instead of contending among ourselves. > > > > > > > I think this whole thing is getting away from us. > > > > > > > > When we first switched to rstat, every reader would take the global > > > > rstat lock and then flush its local subtree of interest. > > > > > > > > This was changed in the following commit: > > > > > > > > commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf > > > > Author: Shakeel Butt > > > > Date: Fri Nov 5 13:37:34 2021 -0700 > > > > > > > > memcg: unify memcg stat flushing > > > > > > > > The memcg stats can be flushed in multiple context and potentia= lly in > > > > parallel too. For example multiple parallel user space readers= for > > > > memcg stats will contend on the rstat locks with each other. T= here is > > > > no need for that. We just need one flusher and everyone else c= an > > > > benefit. > > > > > > > > In addition after aa48e47e3906 ("memcg: infrastructure to flush= memcg > > > > stats") the kernel periodically flush the memcg stats from the = root, so, > > > > the other flushers will potentially have much less work to do. > > > > > > > > Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb= @google.com > > > > Signed-off-by: Shakeel Butt > > > > Acked-by: Johannes Weiner > > > > Cc: Michal Hocko > > > > Cc: "Michal Koutn=C3=BD" > > > > Signed-off-by: Andrew Morton > > > > Signed-off-by: Linus Torvalds > > > > > > > > The idea was that we can avoid lock contention if somebody is alrea= dy > > > > doing the flushing. However, you're now bringing global serializati= on. > > > > Clearly that idea didn't work out. What would be an obstacle to go > > > > back to the original way of doing it? > > > > > > The obstacle is high concurrency among flushers. A good example is > > > reclaim code, we can have a lot of concurrent reclaimers. When I trie= d > > > to go back to the original way of doing it, a stress test with high > > > reclaim concurrency (100s or 1000s) would be 2x-3x slower. I think > > > high concurrency among userspace reads would have a similar outcome, > > > but I hadn't really checked. > > > > > > Basically this patch is trying to switch to root flushing when the > > > cost of contending on the lock is roughly the same as a root flush (o= r > > > waiting for one). It's doing that too eagerly now of course (if > > > contenders > 1), we can try to calibrate this better. > > > > I don't quite understand this. > > > > If you have two readers on separate subtrees, why is it cheaper to > > flush the entire tree in sequence (where both readers don't care about > > the majority of the data), than having each reader flush their own > > small subset one after another? It's not like the whole tree flush > > parallelizes its work. > > > > Where is that overhead actually coming from? > > If you have N concurrent readers flushing different parts of the > subtree, at some point the overhead of N sequential subtree flushes > will exceed the overhead of a single root flush. > > Ideally, we would be able to identify this point, and switch to a > single root flush with everyone else waiting. > > > > > > > With one reader, this will work the same as in your proposal. > > > > > > > > With two readers, just like in your proposal, flushing must be > > > > serialized against the root level. But at least the two flushes onl= y > > > > aggregate the local data they actually care about - not the entire > > > > tree data that doesn't even have readers! This is much better for l= ock > > > > contention and performance. > > > > > > Keep in mind that in my testing, I noticed that synchronization using > > > a completion is more performant than serialization on a lock. I am > > > assuming because when we contend on the underlying lock, we serially > > > wake up and do the flush. Even if there is nothing to do (as you > > > mention below), we still do this work. On the contrary, in this > > > proposal we just wait for the root flush to finish and return > > > immediately, and all waiters return at once (that's a lie because of > > > scheduling internals). > > > > Right, because rstat's do-nothing case is still somewhat costly due to > > the per-cpu loop and the tree walk. > > > > But that should be possible to avoid with the outlined caching of > > recently-flushed state at the memcg level. > > This helps only if concurrent flushers are overlapping, right? > > > > > > Also, in the current code, in the two reader scenario, the first > > > reader will flush the entire tree anyway. The only difference is that > > > the second reader will wait for it to finish instead of just skipping= , > > > which is the right thing to do from a correctness point of view. Righ= t > > > now it's a coin flip on whether you get updated stats if someone else > > > is already flushing. > > > > Agreed, it should wait. My mutex would accomplish this, right? > > I think what you're describing here is v4 of the patchset I mention in > the cover letter: > https://lore.kernel.org/lkml/20230831165611.2610118-5-yosryahmed@google.c= om/ > > The problem with that was that with very high concurrency among > readers, the read latency is unbounded and can get out of hand. Wei > shared some numbers in that thread. > > What this patch is trying to do is to switch to root flushing if the > mutex is contended to avoid that scenario. Also, if we keep acquiring > more flushers at some point we just skip the flush in favor of > performance (if the number of waiters exceeds the number of cpus). Now > that I think about it, perhaps we can just go back to the mutex > approach w/ limiting the number of waiters, without ever falling back > to a root flush. Not sure how that would work. > > Taking a step back, I think what you are implying is that if we make > the thresholding code per cpu instead of global, this should mitigate per cgroup* > the issue in a different way than falling back to root flushing or > limiting the number of waiters, is that right? > If yes, I think it will work in some cases where the flushes are > overlapping as I mentioned above, but not if concurrent readers are > flushing different parts of the tree. Right? > > > > > > > One concern is the thresholding code. The cost of flushing on every > > > > read is too high: even when there is no flush work, checking for it= is > > > > kind of expensive due to the locks and the for_each_possible_cpu(). > > > > > > > > Could we do something like the following? > > > > > > > > mem_cgroup_flush(memcg) > > > > { > > > > mutex_lock(&memcg_flush_mutex); > > > > if (atomic_read(&memcg->stat_delta) > THRESHOLD) > > > > cgroup_rstat_flush(memcg); > > > > mutex_unlock(&memcg_flush_mutex); > > > > } > > > > > > > > mem_cgroup_css_rstat_flush(css, cpu) > > > > { > > > > ... > > > > /* > > > > * Reset here instead of mem_cgroup_flush() > > > > * so that each flushed subgroup is reset > > > > * recursively. A recent parent flush will > > > > * allow a child flush to skip. > > > > */ > > > > atomic_set(&mem_cgroup_from_css(css)->stat_delta, 0= ); > > > > } > > > > > > > > memcg_rstat_updated(memcg, val) > > > > { > > > > atomic_add(&memcg->stat_delta, val); > > > > > > We need to do this for each parent in the hierarchy, not just the > > > memcg being updated, right? I guess that may be too expensive for the > > > update path. > > > > How so? > > > > We need to mark the subgroups that are flushed, so that if you have > > > > root - a - a1 - foo > > `- a2 > > > > and somebody flushes a, then a1 and a2 also don't need to be flushed > > for a while. > > > > But if we flush a1 first, foo is aggregated into a1. Reading a still > > needs to aggregate a1 and a2 into a. > > > > Maybe I'm missing something blatant, but I don't see how we have to > > mark parents in any way. We only have to tag cgroups up to the point > > to which they were recently flushed, and we already visit all those. > > > > Let me know if I'm missing something blatant here. > > I think we are talking about different paths. I believe you are > talking about resetting memcg->stat_delta, which I agree should be > done during the flush. What I am talking about is updating > memcg->stat_delta when memcg_rstat_updated() is called. We would need > to update that for all parents. In your example hierarchy, if stat > updates happened in a2, and someone tries to flush a, they should be > aware that there are stats to flush.