Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp5603874rwb; Wed, 9 Aug 2023 06:45:08 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHWsNx5EudhjCnNw+R1tYwz2qe8xIrEokJG/ZNyiKAS+0n2VTR6J6snDdo5oOu8JZPcFn9w X-Received: by 2002:a17:907:7709:b0:99b:bca6:cf94 with SMTP id kw9-20020a170907770900b0099bbca6cf94mr2067168ejc.14.1691588707866; Wed, 09 Aug 2023 06:45:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691588707; cv=none; d=google.com; s=arc-20160816; b=zObeOX9kOBiDDEjtyGzu8PsRuHOG69rPenjB+oOx4HfIJKk+LhACeZap1bS9x9Hu4k ScimYxflRqZRwOoXvqfgHE5pfRS9DdEtT8DysNdjs3A6qQRqvlMW1+JCHGHeo2/PtDl+ vWnpNCyBk8nt43WjdBvNLwIxyRTdeBYjtc01C4pVvOBaLhpIIDvq8pkoGKeWxXlOJpPE J61xuy+cjS/ZaMVxCPcXCovTAyaLC/kHdTOi6mLO/64ok8/CI9zMGhe66fY36qYIMSDh DJRvJrm+T+pcjW/tFhAdWoh1UgZuxdDknWQpiZx7WR70m1aMIEnkPMfH3NvxnvymI2FQ WJ6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=Q3q+oF8DcCXzUb0+o6vnkiFGnNZXUzpVlmsAZC2SnCA=; fh=z72+QYuJciOOOFzf2RoL3C2GbcAXRoRQhsmgvx/iL5k=; b=Fo5V/5U8mhF+yF3P6bhRhyj4F1Y1zb51O798DBOKeGXs7K6QhH1uF2pRrYCpfgXsfd vrJQ+Ri2Zqt5qgYRej1+hHBT7X7uNwlqf43u6zbcNShhnlFVGVX1DczVIVWYMjwRfCrZ EOrQ2y7ZZwjJCJa9v6y+2voi4lcXyBCTYgN99ONykeKy3yDiwPfl8T+38UiZfbpCtyEX IBEh7tSZs18dLDXUUwiMixZFO9vS9wXwUb8IlbRfIGN8xZEg1oSsaLHja/juiwSP710m xi+PuC6Rlf1pIRj4jsDzBoygLkpCJNTbxiKXrlM/SDc1BVlk7D6feP05LJqvonT4YC4t ewjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=KFIXC5Uz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o20-20020a1709061d5400b0099bbc75eb36si8058724ejh.47.2023.08.09.06.44.43; Wed, 09 Aug 2023 06:45:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=KFIXC5Uz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232821AbjHINNv (ORCPT + 99 others); Wed, 9 Aug 2023 09:13:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40764 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232779AbjHINNq (ORCPT ); Wed, 9 Aug 2023 09:13:46 -0400 Received: from mail-ed1-x531.google.com (mail-ed1-x531.google.com [IPv6:2a00:1450:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DDB4F212F for ; Wed, 9 Aug 2023 06:13:43 -0700 (PDT) Received: by mail-ed1-x531.google.com with SMTP id 4fb4d7f45d1cf-5234b80e9b6so2144756a12.2 for ; Wed, 09 Aug 2023 06:13:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691586822; x=1692191622; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Q3q+oF8DcCXzUb0+o6vnkiFGnNZXUzpVlmsAZC2SnCA=; b=KFIXC5UzyX9KW1ybbG2szF/GuAnirPt07l/xnJN/LhE7O4xCzhvLsPdgbXKD6E7eMf ikIIgzUrA1n/ViVW3Urjqgdo+1QxEiaoyTDPJsg4Jv+ew18M/dXHl3khl255ksGu8tOf 8MTylTWyWWQsUM+BnJbn2tRySwouFHGc9g4KetOxMFCSVqbU0idDTrNrDig3npuOVsiT HswwwIkV7SidVYYiyLMNHTHaXS7/dkUGCZwPT+0E2JO79uhkU2thz6DQpT25eCGt197J hY0cZbi+0h5xrfLsyRvIpGFdVz4rUeEd+GPZBGq2menxW0y9J1duxc/Obw5yZdFqn2LY Mp+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691586822; x=1692191622; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Q3q+oF8DcCXzUb0+o6vnkiFGnNZXUzpVlmsAZC2SnCA=; b=IlegXPgjdgjqHVaBB+egVUs1Nl3Oi5KEYcLMLZglp6w4wFJxtI0t9Q/isZyCnxJX4j omMilVDPORxEHy7BuJ1a/cVLYzEZPkWblICLK1h5JhO/ncAhNDfwA53pNr0yTn0WtfEG 7qxXYHX4nwABC6yTAJ0kuYR2q04j6Uf7ogrQW1Wn4CZyJJ2hzE3AzldjH83PW6tnV5vt GQMujUHP4bh2MEalgC0y4mr+bUIVrqXZQCTlNMXXyXXr5F9qamoRC8eAml1ocoF+6TkJ wjE5ZhG2kKGgx00JP9vSrwLRQZLeSNWFGtJjR+atedaAtlvb/6NR3DshF9lwxBtozsrr 0myQ== X-Gm-Message-State: AOJu0Yxz/LoCJ6GN8z7d10OwVkytPQG2n5un7yDAOUA2acg8Wo8Zmwzz uOwh2KGDuqw9xvxiobP5RtX0Cc5P80pUydx2idMCFQ== X-Received: by 2002:a17:906:54:b0:99c:1b9a:b09e with SMTP id 20-20020a170906005400b0099c1b9ab09emr2112215ejg.75.1691586822187; Wed, 09 Aug 2023 06:13:42 -0700 (PDT) MIME-Version: 1.0 References: <20230809045810.1659356-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 9 Aug 2023 06:13:05 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Michal Hocko Cc: Johannes Weiner , Roman Gushchin , Shakeel Butt , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 9, 2023 at 5:58=E2=80=AFAM Michal Hocko wrote= : > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote: > > On Wed, Aug 9, 2023 at 1:51=E2=80=AFAM Michal Hocko w= rote: > > > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote: > > > > Over time, the memcg code added multiple optimizations to the stats > > > > flushing path that introduce a tradeoff between accuracy and > > > > performance. In some contexts (e.g. dirty throttling, refaults, etc= ), a > > > > full rstat flush of the stats in the tree can be too expensive. Suc= h > > > > optimizations include [1]: > > > > (a) Introducing a periodic background flusher to keep the size of t= he > > > > update tree from growing unbounded. > > > > (b) Allowing only one thread to flush at a time, and other concurre= nt > > > > flushers just skip the flush. This avoids a thundering herd problem > > > > when multiple reclaim/refault threads attempt to flush the stats at > > > > once. > > > > (c) Only executing a flush if the magnitude of the stats updates ex= ceeds > > > > a certain threshold. > > > > > > > > These optimizations were necessary to make flushing feasible in > > > > performance-critical paths, and they come at the cost of some accur= acy > > > > that we choose to live without. On the other hand, for flushes invo= ked > > > > when userspace is reading the stats, the tradeoff is less appealing > > > > This code path is not performance-critical, and the inaccuracies ca= n > > > > affect userspace behavior. For example, skipping flushing when ther= e is > > > > another ongoing flush is essentially a coin flip. We don't know if = the > > > > ongoing flush is done with the subtree of interest or not. > > > > > > I am not convinced by this much TBH. What kind of precision do you > > > really need and how much off is what we provide? > > > > > > More expensive read of stats from userspace is quite easy to notice > > > and usually reported as a regression. So you should have a convincing > > > argument that an extra time spent is really worth it. AFAIK there are > > > many monitoring (top like) tools which simply read those files regula= rly > > > just to show numbers and they certainly do not need a high level of > > > precision. > > > > We used to spend this time before commit fd25a9e0e23b ("memcg: unify > > memcg stat flushing") which generalized the "skip if ongoing flush" > > for all stat flushing. As far I know, the problem was contention on > > the flushing lock which also affected critical paths like refault. > > > > The problem is that the current behavior is indeterministic, if cpu A > > tries to flush stats and cpu B is already doing that, cpu A will just > > skip. At that point, the cgroup(s) that cpu A cares about may have > > been fully flushed, partially flushed (in terms of cpus), or not > > flushed at all. We have no idea. We just know that someone else is > > flushing something. IOW, in some cases the flush request will be > > completely ignored and userspace will read stale stats (up to 2s + the > > periodic flusher runtime). > > Yes, that is certainly true but why does that matter? Stats are always a > snapshot of the past. Do we get an inconsistent image that would be > actively harmful. That can very well be the case because we may be in a state where some cpus are flushed and some aren't. Also sometimes a few seconds is too old. We have some workloads that read the stats every 1-2 seconds to keep a fresh state, and they certainly do not expect stats to be 2+ seconds old when they read them. > > > Some workloads need to read up-to-date stats as feedback to actions > > (e.g. after proactive reclaim, or for userspace OOM killing purposes), > > and reading such stale stats causes regressions or misbehavior by > > userspace. > > Please tell us more about those and why should all others that do not > require such a precision should page that price as well. Everyone used to pay this price though and no one used to complain. Even before rstat, we used to iterate the entire hierarchy when userspace reads the stats. rstat came in and made this much more efficient by only iterating the subtrees that actually have updates. The "skip if someone else is flushing" behavior was introduced for flushers in critical paths (e.g. refault), and hurting the accuracy for userspace readers was a side effect of it. This patch is trying to remedy this side effect by restoring the old behavior for userspace reads. One other side effect is testing. Some tests started becoming flaky because a test performs an action and expects the state of the system to change in a certain way deterministically. In some cases the flushing race leads to false negatives. > > > > [...] > > > > @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct= mem_cgroup *memcg, int val) > > > > } > > > > } > > > > > > > > -static void do_flush_stats(void) > > > > +static void do_flush_stats(bool full) > > > > { > > > > + if (!atomic_read(&stats_flush_ongoing) && > > > > + !atomic_xchg(&stats_flush_ongoing, 1)) > > > > + goto flush; > > > > + > > > > /* > > > > - * We always flush the entire tree, so concurrent flushers ca= n just > > > > - * skip. This avoids a thundering herd problem on the rstat g= lobal lock > > > > - * from memcg flushers (e.g. reclaim, refault, etc). > > > > + * We always flush the entire tree, so concurrent flushers ca= n choose to > > > > + * skip if accuracy is not critical. Otherwise, wait for the = ongoing > > > > + * flush to complete. This avoids a thundering herd problem o= n the rstat > > > > + * global lock from memcg flushers (e.g. reclaim, refault, et= c). > > > > */ > > > > - if (atomic_read(&stats_flush_ongoing) || > > > > - atomic_xchg(&stats_flush_ongoing, 1)) > > > > - return; > > > > - > > > > + while (full && atomic_read(&stats_flush_ongoing) =3D=3D 1) { > > > > + if (!cond_resched()) > > > > + cpu_relax(); > > > > > > You are reinveting a mutex with spinning waiter. Why don't you simply > > > make stats_flush_ongoing a real mutex and make use try_lock for !full > > > flush and normal lock otherwise? > > > > So that was actually a spinlock at one point, when we used to skip if > > try_lock failed. > > AFAICS cgroup_rstat_flush is allowed to sleep so spinlocks are not > really possible. > > > We opted for an atomic because the lock was only used > > in a try_lock fashion. The problem here is that the atomic is used to > > ensure that only one thread actually attempts to flush at a time (and > > others skip/wait), to avoid a thundering herd problem on > > cgroup_rstat_lock. > > > > Here, what I am trying to do is essentially equivalent to "wait until > > the lock is available but don't grab it". If we make > > stats_flush_ongoing a mutex, I am afraid the thundering herd problem > > will be reintroduced for stats_flush_ongoing this time. > > You will have potentially many spinners for something that might take > quite a lot of time (sleep) if there is nothing else to schedule. I do > not think this is a proper behavior. Really, you shouldn't be busy > waiting for a sleeper. > > > I am not sure if there's a cleaner way of doing this, but I am > > certainly open for suggestions. I also don't like how the spinning > > loop looks as of now. > > mutex_try_lock for non-critical flushers and mutex_lock of syncing ones. > We can talk a custom locking scheme if that proves insufficient or > problematic. > -- > Michal Hocko > SUSE Labs