Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp544313rdb; Tue, 19 Sep 2023 03:20:46 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHGOSKRXJxYX4liszYPTLckR4h6TakLFvej16bFrJMJ3Vk3wjPO5gNIeuZ5gayQLRTvEgJj X-Received: by 2002:a05:6a20:938e:b0:14d:5796:bccd with SMTP id x14-20020a056a20938e00b0014d5796bccdmr12465887pzh.24.1695118845793; Tue, 19 Sep 2023 03:20:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695118845; cv=none; d=google.com; s=arc-20160816; b=PEdXczlMkfDfTuS2g8WmayppOMUbxkQCaSlTZ82Rl2Kn8/GCsuYE08anK6ZtEqRrGR k9GP24m8vMX5l1E7eRV8zSmrS5H9PuaTI2feeLA+JpoOenPVyc8ouU6wzYnuExVLJQxF yozCCueTeaFo1Z942Cu3F0L9Y+7lIBthg9ei14sZoDBAUg4Yy2ToWaSMeCibDlI6giak 7wm4m5yeOTFKZj9OGejZuHFtpPJdLuxeTvtVFVrNQ+Q6/VnZkAns+8pZ+6ZmxuIF3iRx Gfinnwm74dl+rrIwwt+hvSozFYMlsF0Dz+JJ7QyxmHj9ClwxcmfxXDnH/iCUctJ9pnZr 2/MQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=fneQXzvskbWAJ/u8RbohNhmDqQxy1hEL61UT0wNAZho=; fh=xoUKUYwsHP88jVXv8iAtrUCIr9Fd+NLU74akQLg97ME=; b=A3Y+7RWJ58iORhKb4NHMOFu+25JjrwLqmdKUtsZNQEs/CGJjw6VZOa49fhY4MZY3mH asrYrfPIDaUP6N7fNTmIjWWS85amYj8fwOFhBa0R/ux0h4ane9OZFRFrwnetx46J5ykk OZ0OKkB/38RKs0IvDtVS9OsgZmbK99d0C3OQ30L9VpRXBdkJkAGzKL4/IsvO+ly8/4VK yGgP3JdbdYUOSQdnmmJNYlJdsrgbcmI4MZR6CguPZKCDvxzMEus6IZ+dZcSrabdEqq3p Qy6j+s7paV+1Sp//Nw9mTw6audAgKLYi5evWkai1j7F5dL/c3swNVQA7dWpBVKAIe9YA BVEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=n+hkRywl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id z17-20020a631911000000b00577a519621bsi9291210pgl.231.2023.09.19.03.20.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 19 Sep 2023 03:20:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=n+hkRywl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 0F3BB8235792; Mon, 18 Sep 2023 22:31:04 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229960AbjISFa3 (ORCPT + 99 others); Tue, 19 Sep 2023 01:30:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230513AbjISFaZ (ORCPT ); Tue, 19 Sep 2023 01:30:25 -0400 Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44808102 for ; Mon, 18 Sep 2023 22:30:19 -0700 (PDT) Received: by mail-ej1-x636.google.com with SMTP id a640c23a62f3a-9adc75f6f09so551692766b.0 for ; Mon, 18 Sep 2023 22:30:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695101418; x=1695706218; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fneQXzvskbWAJ/u8RbohNhmDqQxy1hEL61UT0wNAZho=; b=n+hkRywlkXvqXRXYkd6GjF6nP8DsNC6F0mvMGKVBzHflUYLPfg3aiDk99QYMyzn78H o3gRGBX95B8NQJQdGbWpcJH1pIiePnxIYcl24auvrKEFUmLcUhDDQvJOlyOon1ycRYTZ L/TyPvCBzCZVhi+t7hFBL3JZ4pmfUkPpsFEhImM+pR0tSwLOFNEWlSxzyiwMXM7C4wCT gk5X72MVmV6vMMkjfhCJ3CK7rvbuT4NfEnav5uN4N1D9HGNk9kS6F29PyYihqeBTj67j ciOm2YcHFdQ7jMtzDQj2MT6u9P56suWMikwj91N7BRCk9XTHLm8ZDyqsiC+Zq1Sh40d8 AT4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695101418; x=1695706218; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fneQXzvskbWAJ/u8RbohNhmDqQxy1hEL61UT0wNAZho=; b=UwzXKR09Y2K4crGbRI/8eL555Vb/lYBAG62g/XDxQZVpKLXv96mlVST4plKNBsbhKZ PEJXyGyF03agf7YCwZWlUlQfReNyow0fAmfMliVVeVI423+dpcEpTgu+3ILpyqFr4k3/ WYIEkzAYaTSveyhRu8igi44QNo0dOUhhR+FW/MkPcxJQYkBdztRdOnh0DQnfWTzQkelC ciQxGmlGNUKQ/unK/BuQKTblddTTjzopAfOpN8R3K/x6ffngdJs9b7nngtEZj9h0F+4y KjE++RvRu2KuOc0AnsPeK+hqcbWG6Sw7u6WCwOj1hM6Ev2hi1vYnXbphk2riwRx0s3pr clrw== X-Gm-Message-State: AOJu0Yx2EPBVM1x7N3zQQKWIUIEHUjRVW+BBm0nvh22lFqjLWIlB6ZSN pwWWuMFf9n93iY5eW7OWDWp8QxHHGa/RhYo4HASwHw== X-Received: by 2002:a17:907:2c6b:b0:9ae:1872:d0ab with SMTP id ib11-20020a1709072c6b00b009ae1872d0abmr2757113ejc.0.1695101417573; Mon, 18 Sep 2023 22:30:17 -0700 (PDT) MIME-Version: 1.0 References: <20230913073846.1528938-1-yosryahmed@google.com> <20230913073846.1528938-4-yosryahmed@google.com> <20230914225844.woz7mke6vnmwijh7@google.com> <20230915010138.knjli6ovpozxbpss@google.com> In-Reply-To: <20230915010138.knjli6ovpozxbpss@google.com> From: Yosry Ahmed Date: Mon, 18 Sep 2023 22:29:38 -0700 Message-ID: Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy To: Shakeel Butt Cc: Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Ivan Babrou , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Waiman Long , kernel-team@cloudflare.com, Wei Xu , Greg Thelen , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Mon, 18 Sep 2023 22:31:04 -0700 (PDT) On Thu, Sep 14, 2023 at 6:01=E2=80=AFPM Shakeel Butt = wrote: > > On Thu, Sep 14, 2023 at 04:30:56PM -0700, Yosry Ahmed wrote: > [...] > > > > > > I think first you need to show if this (2 sec stale stats) is really = a > > > problem. > > > > That's the thing, my main concern is that if this causes a problem, we > > probably won't be able to tell it was because of stale stats. It's > > very hard to make that connection. > > > > Please articulate what the problem would look like which you did in the > use-cases description below, let's discuss there. > > > Pre-rstat, reading stats would always yield fresh stats (as much as > > possible). Now the stats can be up to 2s stale, and we don't really > > know how this will affect our existing workloads. > > > > Pre-rstat the stat read would traverse the memcg tree. With rstat > the tradeoff was made between expensive read and staleness. > Yeah there > might still be memcg update tree traversal which I would like to remove > completely. However you are saying to I think this sentence is truncated. > > [...] > > > > > > I don't see why userspace OOM killing and proactive reclaim need > > > subsecond accuracy. Please explain. > > > > For proactive reclaim it is not about sub-second accuracy. It is about > > doing the reclaim then reading the stats immediately to see the > > effect. Naturally one would expect that a stat read after reclaim > > would show the system state after reclaim. > > > > For userspace OOM killing I am not really sure. It depends on how > > dynamic the workload is. If a task recently had a spike in memory > > usage causing a threshold to be hit, userspace can kill a different > > task if the stats are stale. > > > > Please add above reasoning in your commit message (though I am not > convinced but let's leave it at that). Will do in the next version, thanks. > > > I think the whole point is *not* about the amount of staleness. It is > > more about that you expect a stats read after an event to reflect the > > system state after the event. > > The whole point is to understand the tradeoff between accuracy and cost > of accuracy. I don't think you want to pay the cost of strong > consistency/ordering between stats reading and an event. My worry is > that you are enforcing a tradeoff which *might* be just applicable to > your use-cases. Anyways this is not something that can not be changed > later. Given the numbers I got with the patch, it doesn't seem like we are paying a significant cost for the accuracy. Anyway, as you say, it's not something that can not be changed. In fact, I have another proposal that I am currently testing, please see my next response to Johannes. > > > > > > Same for system overhead but I can > > > see the complication of two different sources for stats. Can you prov= ide > > > the formula of system overhead? I am wondering why do you need to rea= d > > > stats from memory.stat files. Why not the memory.current of top level > > > cgroups and /proc/meminfo be enough. Something like: > > > > > > Overhead =3D MemTotal - MemFree - SumOfTopCgroups(memory.current) > > > > We use the amount of compressed memory in zswap from memory.stat, > > which is not accounted as memory usage in cgroup v1. > > > > There are zswap stats in /proc/meminfo. Will those work for you? Yeah this should work for this specific use case, thanks. > > [...] > > > Fix the in-kernel flushers separately. > > > > The in-kernel flushers are basically facing the same problem. For > > instance, reclaim would expect a stats read after a reclaim iteration > > to reflect the system state after the reclaim iteration. > > > > I have not seen any complains on memory reclaim recently. Maybe > reclaim does not really need that such accuracy :P Perhaps, it's full of heuristics anyway :) > > > > Also the problem Cloudflare is facing does not need to be tied with t= his. > > > > When we try to wait for flushing to complete we run into the same > > latency problem of the root flush. > > Not sure what wait for flushing has to do with Cloudflare's report. They > are ok with no sync flushing at all stat read. Oh I am not saying the wait benefits their use case. I am saying when the wait is implemented, we face the same problem (expensive flush of the entire hierarchy), so we need to mitigate it anyway -- hence the relevance to Cloudflare's use case. Anyway, I have an alternative that I will propose shortly in response to Johannes's reply.