Received: by 2002:a05:7412:6592:b0:d7:7d3a:4fe2 with SMTP id m18csp1195037rdg; Fri, 11 Aug 2023 13:01:09 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGQtOj9zyp6dYeWkdKQ7KnwRiJcOxRcsWEbg+aL9qc4gEok2hVT7pkGCF6VCVsBDyPbhePp X-Received: by 2002:a17:902:e848:b0:1bc:1b01:8961 with SMTP id t8-20020a170902e84800b001bc1b018961mr3248426plg.1.1691784068850; Fri, 11 Aug 2023 13:01:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691784068; cv=none; d=google.com; s=arc-20160816; b=waHrKd2y3V/Nz82vzpnrSVapDIFHO5QJTf/gwo6Dh0YO8sUPI0gqw3PmpdwNmOPmaO NIf/3XdhEH60FEYImeROSTlPTldFDZF0xN4jcUnC6RZt/nDQRyseFDFEGg97KDrEAMwB wFWNMrGJttaFU1C2TqRfuAUnp5B+4VkbtIQplKL9vxKNjho3rTlCiGzDb6vUEqSbO6UE O+mYICIV21tek9jStT7789DuNX6Lmc8UHEtG1RhZaTtP7XAqCoRMNWLRfIfcRQwZpqKi HGAPXAyaVdwmYl7v5UGiRfo5/YPj53u969sEsZ2YDRQJU1HukVNW6NKqH4r6Ye9NTbzr cxYA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=x/sryX5oyAT1OcjzxkuS075LBhWkQfYR3UGrlpCh99U=; fh=z72+QYuJciOOOFzf2RoL3C2GbcAXRoRQhsmgvx/iL5k=; b=TedpvX/gMEZIy3mArShIMAOzitxCs7Kp3R6W5OyBFtsEZnKntdrYop2xkA6Yj2rGHA TzCUfPEuW7IXaqfZAhQUPSCNy5fZQAHNJIecDlCqGH0iEt+djlTI0qgQpc06hpxXLPc8 u/+t6oU7Y6BnNvJJD8liy32KiETU42sh0F7orGOImaSD3QrQD14qn3NZv8DZBqiKFz6w xiRY0sWz5DxOWv1T41L5Rp7fjcMnyLvscsHiVrpoJ4kpcamv8GmyERz+66Or6Ou2h+TO W1f5y74wxq49DkCxVivlWOiNbE1S+Qlnxnp1unqizFeGbWwcG7IHF4IFNAalGGMKiDUF IgNg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=dKctG1GF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t9-20020a170902e84900b001bbb175a81asi3867236plg.263.2023.08.11.13.00.55; Fri, 11 Aug 2023 13:01:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=dKctG1GF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229576AbjHKTD3 (ORCPT + 99 others); Fri, 11 Aug 2023 15:03:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229706AbjHKTD2 (ORCPT ); Fri, 11 Aug 2023 15:03:28 -0400 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2C46018F for ; Fri, 11 Aug 2023 12:03:27 -0700 (PDT) Received: by mail-ej1-x62b.google.com with SMTP id a640c23a62f3a-99c136ee106so314739166b.1 for ; Fri, 11 Aug 2023 12:03:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691780605; x=1692385405; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=x/sryX5oyAT1OcjzxkuS075LBhWkQfYR3UGrlpCh99U=; b=dKctG1GFM+Jh407/MAvdPbf8fvy2f8tU/NAZX1EVR9ab6tHyEHzJ0TRR9NbLFvZNa/ uJuDpVcGhfi6OejziQiC+fPB0qopemacTt1pqoJYoTjFUnOVkclVFkP14P0Pf/ZvmzcB bQXFtreNVS4ATqWWHVWXucbITer2pHwn16sODp+ca/16BUjSEovkmVQHDjcwHssUg/Re A1Kqn8Q/Hd2HVhK1MQLNDMSwsHkd2MyNjtHa/1SPkQwToShj/E5FCsX6EE2V6IU4vWRY QtsNioHuAmmjhFOcFdvyk1HCf5aSwHfSqXRPhCNlDMFjz01aYt8ch9DEpg298zdChejE ZWrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691780605; x=1692385405; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=x/sryX5oyAT1OcjzxkuS075LBhWkQfYR3UGrlpCh99U=; b=hhEdciRXpkpuFhOi9hGA7mJl7i54gWyrQqjpEzSW1g7KAlQYULgeT/GREwsz+63IBi oblxXDNwpoCwB8nyZvsts3+uNh2cxLkqjIPkN6ebiPmQNpqekmgw0V/WQHL82awe41yU ytoSL6EqsJO1eN5/E3W97wLdM0CniZbVFssqVsCPB/gR5B6p8RlgLijkUvXxnhj0T9Vl O/XbGdj1MdPWLRPsXaASpXMtNK3hoqTpliBBDMCcRTES1kJqHfzMmkNu+ey609USJuYo DLSgLb3DQcIv65XDmvOylpe7ECyk0nAglhwLPBDMLejgKS8McPoplKu/IszKBkJopdrJ WouA== X-Gm-Message-State: AOJu0Yz8Bh4AtqSV5fqnoG/nhY4nGqa6ZFCVKc1OUcu34B+QtnKRzO/q j8Pf3509ZeGt8hJfDL4VHvKfXMn4XYl+CkejhMCeRg== X-Received: by 2002:a17:907:75d5:b0:98d:5333:79c with SMTP id jl21-20020a17090775d500b0098d5333079cmr2133043ejc.5.1691780605368; Fri, 11 Aug 2023 12:03:25 -0700 (PDT) MIME-Version: 1.0 References: <20230809045810.1659356-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Fri, 11 Aug 2023 12:02:48 -0700 Message-ID: Subject: Re: [PATCH] mm: memcg: provide accurate stats for userspace reads To: Michal Hocko Cc: Johannes Weiner , Roman Gushchin , Shakeel Butt , Andrew Morton , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 11, 2023 at 5:21=E2=80=AFAM Michal Hocko wrot= e: > > On Wed 09-08-23 11:33:20, Yosry Ahmed wrote: > > On Wed, Aug 9, 2023 at 6:32=E2=80=AFAM Michal Hocko w= rote: > > > > > > On Wed 09-08-23 06:13:05, Yosry Ahmed wrote: > > > > On Wed, Aug 9, 2023 at 5:58=E2=80=AFAM Michal Hocko wrote: > > > > > > > > > > On Wed 09-08-23 05:31:04, Yosry Ahmed wrote: > > > > > > On Wed, Aug 9, 2023 at 1:51=E2=80=AFAM Michal Hocko wrote: > > > > > > > > > > > > > > On Wed 09-08-23 04:58:10, Yosry Ahmed wrote: > > > > > > > > Over time, the memcg code added multiple optimizations to t= he stats > > > > > > > > flushing path that introduce a tradeoff between accuracy an= d > > > > > > > > performance. In some contexts (e.g. dirty throttling, refau= lts, etc), a > > > > > > > > full rstat flush of the stats in the tree can be too expens= ive. Such > > > > > > > > optimizations include [1]: > > > > > > > > (a) Introducing a periodic background flusher to keep the s= ize of the > > > > > > > > update tree from growing unbounded. > > > > > > > > (b) Allowing only one thread to flush at a time, and other = concurrent > > > > > > > > flushers just skip the flush. This avoids a thundering herd= problem > > > > > > > > when multiple reclaim/refault threads attempt to flush the = stats at > > > > > > > > once. > > > > > > > > (c) Only executing a flush if the magnitude of the stats up= dates exceeds > > > > > > > > a certain threshold. > > > > > > > > > > > > > > > > These optimizations were necessary to make flushing feasibl= e in > > > > > > > > performance-critical paths, and they come at the cost of so= me accuracy > > > > > > > > that we choose to live without. On the other hand, for flus= hes invoked > > > > > > > > when userspace is reading the stats, the tradeoff is less a= ppealing > > > > > > > > This code path is not performance-critical, and the inaccur= acies can > > > > > > > > affect userspace behavior. For example, skipping flushing w= hen there is > > > > > > > > another ongoing flush is essentially a coin flip. We don't = know if the > > > > > > > > ongoing flush is done with the subtree of interest or not. > > > > > > > > > > > > > > I am not convinced by this much TBH. What kind of precision d= o you > > > > > > > really need and how much off is what we provide? > > > > > > > > > > > > > > More expensive read of stats from userspace is quite easy to = notice > > > > > > > and usually reported as a regression. So you should have a co= nvincing > > > > > > > argument that an extra time spent is really worth it. AFAIK t= here are > > > > > > > many monitoring (top like) tools which simply read those file= s regularly > > > > > > > just to show numbers and they certainly do not need a high le= vel of > > > > > > > precision. > > > > > > > > > > > > We used to spend this time before commit fd25a9e0e23b ("memcg: = unify > > > > > > memcg stat flushing") which generalized the "skip if ongoing fl= ush" > > > > > > for all stat flushing. As far I know, the problem was contentio= n on > > > > > > the flushing lock which also affected critical paths like refau= lt. > > > > > > > > > > > > The problem is that the current behavior is indeterministic, if= cpu A > > > > > > tries to flush stats and cpu B is already doing that, cpu A wil= l just > > > > > > skip. At that point, the cgroup(s) that cpu A cares about may h= ave > > > > > > been fully flushed, partially flushed (in terms of cpus), or no= t > > > > > > flushed at all. We have no idea. We just know that someone else= is > > > > > > flushing something. IOW, in some cases the flush request will b= e > > > > > > completely ignored and userspace will read stale stats (up to 2= s + the > > > > > > periodic flusher runtime). > > > > > > > > > > Yes, that is certainly true but why does that matter? Stats are a= lways a > > > > > snapshot of the past. Do we get an inconsistent image that would = be > > > > > actively harmful. > > > > > > > > That can very well be the case because we may be in a state where s= ome > > > > cpus are flushed and some aren't. Also sometimes a few seconds is t= oo > > > > old. We have some workloads that read the stats every 1-2 seconds t= o > > > > keep a fresh state, and they certainly do not expect stats to be 2+ > > > > seconds old when they read them. > > > > > > I hate to repeat myself but please be more specific. This all sounds > > > just too wavy to me. > > > > Sorry I didn't have the full story in mind, I had to do my homework. > > One example is userspace OOM killing. Our userspace OOM killer makes > > decisions based on some stats from memory.stat, and stale stats (a few > > seconds in this case) can result in an unrightful OOM kill, which can > > easily cascade. > > OK, but how is this any different from having outdated data because you > have to wait for memory.stat to read (being blocked inside the rstat > code)? Either your oom killer is reading the stats directly and then you > depend on that flushing which is something that could be really harmful > itself or you rely on another thread doing the blocking and you do not > have up-to-date numbers anyway. So how does blocking actually help? I am not sure I understand. The problem is that when you skip when someone else is flushing, there is a chance that the stats we care about haven't been flushed since the last time the periodic flusher ran. Which is supposed to be ~2 seconds ago, but maybe more depending on how busy the workqueue is. When you block until the flusher finishes, the stats are being refreshed as you wait. So the stats are not getting more outdated as you wait in the general case (unless your cgroup was flushed first and you're waiting for others to be flushed). [Let's call this approach A] Furthermore, with the implementation you suggested using a mutex, we will wait until the ongoing flush is completed, then we will grab the mutex and do a flush ourselves. That second flush should mostly be very fast, but it will guarantee even fresher stats. [Let's call this approach B] See below for test results with either A or B. We can add a new API that checks if the specific cgroup we care about is flushed and wait on that instead of waiting for the entire flush to finish, which will add stronger guarantees. However, as you said when you suggested the mutex approach, let's start simple and add more complexity when needed. > > > A simplified example of that is when a hierarchy has a parent cgroup > > with multiple related children. In this case, there are usually > > file-backed resources that are shared between those children, and OOM > > killing one of them will not free those resources. Hence, the OOM > > killer only considers their anonymous usage to be reap-able when a > > memcg is nuked. For that we use the "anon" stat (or "rss" in cgroup > > v1) in memory.stat. > > > > > > > > > > > Some workloads need to read up-to-date stats as feedback to act= ions > > > > > > (e.g. after proactive reclaim, or for userspace OOM killing pur= poses), > > > > > > and reading such stale stats causes regressions or misbehavior = by > > > > > > userspace. > > > > > > > > > > Please tell us more about those and why should all others that do= not > > > > > require such a precision should page that price as well. > > > > > > > > Everyone used to pay this price though and no one used to complain. > > > > > > Right, and then the overhead has been reduced and now you want to bri= ng > > > it back and that will be seen as a regression. It doesn't really matt= er > > > what used to be the overhead. People always care when something gets > > > slower. > > > > People also care when something gets less accurate :) > > Accuracy will never be 100%. We have to carefully balance between > accuracy and overhead. So far we haven't heard about how much inaccuracy > you are getting. Numbers help! Very good question, I should have added numbers since the beginning to clarify the significance of the problem. To easily produce numbers I will use another use case that we have that relies on having fresh stats, which is proactive reclaim. Proactive reclaim usually operates in a feedback loop where it requests some reclaim, queries the stats, and decides how to operate based on that (e.g. fallback for a while). When running a test that is proactively reclaiming some memory and expecting to see the memory swapped, without this patch, we see significant inaccuracy. In some failure instances we expect ~2000 pages to be swapped but we only find ~1200. This is observed on machines with hundreds of cpus, where the problem is most noticeable. This is a huge difference. Keep in mind that the inaccuracy would probably be even worse in a production environment if the system is under enough pressure (e.g. the periodic flusher is late). For both approach A (wait until flusher finishes and exit, i.e this patch) and approach B (wait until flusher finishes then flush, i.e the mutex approach), I stop seeing this failure in the proactive reclaim test and the stats are accurate. I have v2 ready that implements approach B with the mutex ready to fire, just say the word :) > > In any case I do get the argument about consistency within a subtree > (children data largely not matching parents'). Examples like that would > be really helpful as well. If that is indeed the case then I would > consider it much more serious than accuracy which is always problematic > (100ms of an actively allocating context can ruin your just read numbers > and there is no way around that wihtout stopping the world). 100% agreed. It's more difficult to get testing results for this, but that can easily be the case when we have no idea how much is flushed when we return from mem_cgroup_flush_stats(). > > Last note, for /proc/vmstat we have /proc/sys/vm/stat_refresh to trigger > an explicit refresh. For those users who really need more accurate > numbers we might consider interface like that. Or allow to write to stat > file and do that in the write handler. This wouldn't be my first option, but if that's the only way to get accurate stats I'll take it. Keep in mind that the normal stats read path will always try to refresh, it's just that it will often skip refreshing due to an implementation-specific race. So having an interface for an explicit flush might be too implementation specific, especially if the race disappears later and the interface is not needed later. Having said that, I am not opposed to this if that's the only way forward for accurate stats, but I would rather have the stat reads be always accurate unless a regression is noticed. > -- > Michal Hocko > SUSE Labs