Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp2301262pxj; Sat, 19 Jun 2021 08:20:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxDx5Du6EXnxo0VjNm9EaCWww7Wwqu4Y4q3kihEgGo3poVmedWjfWV7sM8PBRlh4JKMDdg/ X-Received: by 2002:a05:6e02:8b:: with SMTP id l11mr5611858ilm.228.1624116022252; Sat, 19 Jun 2021 08:20:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624116022; cv=none; d=google.com; s=arc-20160816; b=aizNr6wyrpbk+5RoYezUMCZZK9PaE1nQz3GCNq5ffMYk4YATgKhWVSzo+CJsuEM8Fm SA+EH6cVBFZBvYz28kieNgQbLkrUvNQ1L2ZCcW511aJj+zX6Dr8Bo0lpYIZLOszioZCQ Oj7VMeaWq1IFC4oCvfbA6eQB12syx5m4Qe5Ykayl0dxhS9YKY8jDn/BKSLnWSRQljScG qWKsWpS1RddYJNinXHG6Yz3iJdC6rTzUZVKdcYQbBb0LU3xPqh4LsfRExK+2AmdEOXik 9N3DwDVLGChRyKb8n4xLvAlgQAQh1I2gjKsiVNgjCnS3C83GPJw/3XKnZLlJ0zGTrDQ+ 5IsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Tc6KnzldMt7lAUMkOKZvz0InbTpN/HVH+jn+WvW53/E=; b=jqSrqpMC72nqpDge7HhU+vFwKMRAkpBc+EZPfirXUvIFUq/a1VQrn9va+1PCF0cWCd rWudmJnVF01pMtNH+CIGCAcyhGrpxvWmq9P6c0gyLPOlDOMz5I93NCMUaJPb4bUBmNis qHDuoWY/qim/d6LoT6xYQ8g/zHFoGsA6J2Sx8BYmStHDeXIfyNaUaeudZqWDRT9PFBzr 689fj4ygPp80G5U7JlVU3NZts2pIX10CkBWi43X+5hkxE7bOa4as4y2F6MMomF8ltYnU wbVI7tCpkxx/XnVMEBX0Mx05vFdH0PldlvIGM6ido7U9N7b8NXRcbyoPErLzAMknbrvM TVkQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=eL7fdXlf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c2si5342391ilr.135.2021.06.19.08.20.10; Sat, 19 Jun 2021 08:20:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=eL7fdXlf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235183AbhFRXlN (ORCPT + 99 others); Fri, 18 Jun 2021 19:41:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235165AbhFRXlM (ORCPT ); Fri, 18 Jun 2021 19:41:12 -0400 Received: from mail-lj1-x234.google.com (mail-lj1-x234.google.com [IPv6:2a00:1450:4864:20::234]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 08B54C061574 for ; Fri, 18 Jun 2021 16:39:02 -0700 (PDT) Received: by mail-lj1-x234.google.com with SMTP id s22so16182981ljg.5 for ; Fri, 18 Jun 2021 16:39:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Tc6KnzldMt7lAUMkOKZvz0InbTpN/HVH+jn+WvW53/E=; b=eL7fdXlfwBHtKxmU60+CMtMA5r/XRaknloeToSvovaCy6C1p3D+StnRGc6RHxSKOyB cbYWJzeNoxiX28R42+0V7Is0XqgT4rJZybOUPajsM5WmZT59S1flesvHNzV55nydgD8i zrq3N45K8cnGIdEvTrRRZAan57g2JQv+zqU+8zH2iOuH2PmKzAgWc/kX/kfhYxIwNKJa 5obKI77TLDUM8aw3RbCG8up/2Cdi9EiN+j9Jybf71xqaOyGGi1fq1EdaKDn1VvdWEwre jmbSxEbYJwvYUzhjyRt2W8BKdvi7sHpAfYEOsBn1K0jgoqJXweGTYOdJIR0oFJ1ZJyH9 /Erw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Tc6KnzldMt7lAUMkOKZvz0InbTpN/HVH+jn+WvW53/E=; b=sidYSakdfcwYuysEJqfc/bjobdns+XGNnvUiaUn4I1y00A7R3Lpm7Tn6UnVp8NM2kf KwDd1lL5yVt6Tcb9FK9RRdSGOJmBPb9T4zxcZbSLpImcSBKw0apkNGNOfJE3oHa9LSay axVvnbNkd9YiTym+l85nYtUISiFgbDOALLb7k0PWAbNliXMhbPxPUKn3yVFJr17Pnwhk PLSqm/yYhRk5WJm0QkQ3702tF1V/Rb5KscbNUpN2BL+/QVTfrAX7n6VtAn+fdv6daaH1 aL6XkpY3EBQ2vAQ5knpQKkItqm2jyHbTpIn3K2j5Tg8eV4xVra7F47t5HbYZslvX9oQS /1/g== X-Gm-Message-State: AOAM532DUIc/x+ewkHPk/ywaskrB2tMqgkYj85PoPgskE2Vv21ODOfW4 uvrqcmiJBPA4GluOxoPFEIEPKbR8oFbi2JnaZDkUQlqAu+lHIg== X-Received: by 2002:a05:651c:1108:: with SMTP id d8mr10895928ljo.0.1624059540015; Fri, 18 Jun 2021 16:39:00 -0700 (PDT) MIME-Version: 1.0 References: <20210615113222.edzkaqfvrris4nth@wittgenstein> <20210615124715.nzd5we5tl7xc2n2p@example.org> <87zgvpg4wt.fsf@disp2133> In-Reply-To: <87zgvpg4wt.fsf@disp2133> From: Shakeel Butt Date: Fri, 18 Jun 2021 16:38:48 -0700 Message-ID: Subject: Re: [PATCH v1] proc: Implement /proc/self/meminfo To: "Eric W. Biederman" Cc: Alexey Gladkov , Christian Brauner , LKML , Linux Containers , Linux Containers , Linux FS Devel , Linux MM , Andrew Morton , Johannes Weiner , Michal Hocko , Chris Down , Cgroups Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 16, 2021 at 9:17 AM Eric W. Biederman wrote: > > Shakeel Butt writes: > > > On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov wrote: > >> > > [...] > >> > >> I made the second version of the patch [1], but then I had a conversation > >> with Eric W. Biederman offlist. He convinced me that it is a bad idea to > >> change all the values in meminfo to accommodate cgroups. But we agreed > >> that MemAvailable in /proc/meminfo should respect cgroups limits. This > >> field was created to hide implementation details when calculating > >> available memory. You can see that it is quite widely used [2]. > >> So I want to try to move in that direction. > >> > >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0 > >> [2] https://codesearch.debian.net/search?q=MemAvailable%3A > >> > > > > Please see following two links on the previous discussion on having > > per-memcg MemAvailable stat. > > > > [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/ > > [2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/ > > > > MemAvailable itself is an imprecise metric and involving memcg makes > > this metric even more weird. The difference of semantics of swap > > accounting of v1 and v2 is one source of this weirdness (I have not > > checked your patch if it is handling this weirdness). The lazyfree and > > deferred split pages are another source. > > > > So, I am not sure if complicating an already imprecise metric will > > make it more useful. > > Making a good guess at how much memory can be allocated without > triggering swapping or otherwise stressing the system is something that > requires understanding our mm internals. > > To be able to continue changing the mm or even mm policy without > introducing regressions in userspace we need to export values that > userspace can use. The issue is the dependence of such exported values on mm internals. MM internal code and policy changes will change this value and there is a potential of userspace regression. > > At a first approximation that seems to look like MemAvailable. > > MemAvailable seems to have a good definition. Roughly the amount of > memory that can be allocated without triggering swapping. Nowadays, I don't think MemAvailable giving "amount of memory that can be allocated without triggering swapping" is even roughly accurate. Actually IMO "without triggering swap" is not something an application should concern itself with where refaults from some swap types (zswap/swap-on-zram) are much faster than refaults from disk. > Updated > to include not trigger memory cgroup based swapping and I sounds good. > > I don't know if it will work in practice but I think it is worth > exploring. I agree. > > I do know that hiding the implementation details and providing userspace > with information it can directly use seems like the programming model > that needs to be explored. Most programs should not care if they are in > a memory cgroup, etc. Programs, load management systems, and even > balloon drivers have a legitimately interest in how much additional load > can be placed on a systems memory. > How much additional load can be placed on a system *until what*. I think we should focus more on the "until" part to make the problem more tractable. > > A version of this that I remember working fairly well is free space > on compressed filesystems. As I recall compressed filesystems report > the amount of uncompressed space that is available (an underestimate). > This results in the amount of space consumed going up faster than the > free space goes down. > > We can't do exactly the same thing with our memory usability estimate, > but having our estimate be a reliable underestimate might be enough > to avoid problems with reporting too much memory as available to > userspace. > > I know that MemAvailable already does that /2 so maybe it is already > aiming at being an underestimate. Perhaps we need some additional > accounting to help create a useful metric for userspace as well. > The real challenge here is that we are not 100% sure if a page is reclaimable until we try to reclaim it. For example we might have file lrus filled with lazyfree pages which might have been accessed. MemAvailable will show half the size of file lrus but once we try to reclaim them, we have to move them back to anon lru and drastic drop in MemAvailable. > > I don't know the final answer. I do know that not designing an > interface that userspace can use to deal with it's legitimate concerns > is sticking our collective heads in the sand and wishing the problem > will go away. I am a bit skeptical that a single interface would be enough but first we should formalize what exactly the application wants with some concrete use-cases. More specifically, are the applications interested in avoiding swapping or OOM or stall? Second, is the reactive approach acceptable? Instead of an upfront number representing the room for growth, how about just grow and backoff when some event (oom or stall) which we want to avoid is about to happen? This is achievable today for oom and stall with PSI and memory.high and it avoids the hard problem of reliably estimating the reclaimable memory.