Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp499318pxk; Thu, 1 Oct 2020 07:35:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyxo/hk7hFLutx/1eP4UO9iwZMrr8vtFHUnL6dELF0+1jv1iQMQKq34qK3lY1UfPea2bMI3 X-Received: by 2002:a50:ed94:: with SMTP id h20mr8620900edr.184.1601562934844; Thu, 01 Oct 2020 07:35:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601562934; cv=none; d=google.com; s=arc-20160816; b=fb6vGMIHVBD6ZKl5/wXLWJSS7mibwZkhCZP0o1zdyAz4a/BIBszTZmShi/m3o+4gi6 VQvxr6ZLZonr+cuugoYdeOkV9U5e3OSSyJwO+e8LmSNzCvFBFN0n1A+J5ugtMFgMhS/o ezW5Ggulnp94TB2K8O6aFzLP/0OAwfa+M2v/iElM1vWHXQnxl+oMmgutUPZoSR+8bbEw 4eD8YcuZmrALUwXSaSIstGIZvFNkt4FO/gj+m/+yrf1Zp+TrkJutCF77cs2tbznrxnA9 zzMmN+NfvU+eIH9IyHDBI94uB3Jq9dQqZpyh7KEnhHnSprSbOztSUAas3B0WEYJ2K75M C9FA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=jO3BB8sVqiM7vkvwS1XKzZc6N0X8KTzVoqftN8TqfgY=; b=dwu23kLvPVaIvn3JTCZ3Fq6/AGNatriSF5cqRLl+r7i38Vt3KbeJJ3pXZFEUlPVgpr vnvclZHzjlguHlJUnDurQ4KSHs29ao7yZmp1IbRbyKQYrLmGw8sBHtbdCgRwprgcb7FV KcCKssNORfwbyUfZuGjL5V7/vcRWNEK6f1k7oqVTuQ6gA7SumnU+eT66M2S6DFECz2oK S/8VPtX4vGIvsE8GwYVN1yB14TFoa+PPPrIOEzNUwqGNqk/w0/Rc6HWawZSuOLUkh8q/ 35IuOKWBpcqyvd+x02I8d0Sk7Z0Ch071kmJNr6IoLjAZJEACCn5e1y1jv7u3fKAr/EYL A83g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=pFDbQq9H; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k16si3631421ejd.538.2020.10.01.07.35.11; Thu, 01 Oct 2020 07:35:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=pFDbQq9H; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732440AbgJAOdh (ORCPT + 99 others); Thu, 1 Oct 2020 10:33:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58590 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732020AbgJAOdd (ORCPT ); Thu, 1 Oct 2020 10:33:33 -0400 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6DF59C0613E2 for ; Thu, 1 Oct 2020 07:33:32 -0700 (PDT) Received: by mail-qk1-x744.google.com with SMTP id g72so5490890qke.8 for ; Thu, 01 Oct 2020 07:33:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=jO3BB8sVqiM7vkvwS1XKzZc6N0X8KTzVoqftN8TqfgY=; b=pFDbQq9H6572Zep0T0kiBK6O7gvcomiDy/QnxFgocFInKglLtah5+pLzvUOj3o9TPH +3da9GlUc4tqby6bzqiaELI5lmPN4OaxP9JadjXiNAl5peWgjvHsOEIxXZA8itf2TpYA JNviktKQyP0nsSSMAAW0MxjfMd7rEyANGzxySxkDh3bWE6xluydotD90yrNYtye8EqJ3 BvGaoNvS9pYwjfzCTbatP3d1ysGa5OH4ZqBeiLCC1vB107nF4lHhgKPjdxaGOrd5gfbM k2XnQf/MNMGoG7ONN84busjIoH2e/uYBL+MszKJcAeZxGFSM/C2lshGLybQO/xFPCg06 QReg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jO3BB8sVqiM7vkvwS1XKzZc6N0X8KTzVoqftN8TqfgY=; b=hYefJpFL83NjOosoG3AKYPASObbPpzzAaCB3bZ7Xah30xw4fv38i3Qj97OUwragdPM J2vOdwwnm0qpSgiB/oDDzFAjixkvC9oZ06t64Am4JVEnOON7A0gywxSdLjpZttMRVctK 4XNsS8YdvQQgPYaXRpZaYvTxxVmeTmD2OCwwMd4gw1ZZT46mPLxMNUP5q7AlS4xq0z9x 2eMtA3IuZSmfgt3nxI4adTndBgI4xIadbvlxxykOa6q25K0G6V/1EuOWsICCaqtlY7HW Ksbo59NEu+tO4Y03Z5JGaDP48fPkWsptgNXnYkmOdsIqTvOyzxrOOg0jMwhJzvlT4HBV n60Q== X-Gm-Message-State: AOAM5334oFR5UFMvIXl528yHNe1ckp4iuKNG3i5p/zvwD25MbFVKcWh2 YDYGtjjPZYP8j4KIvFxx+XMC8A== X-Received: by 2002:a37:4f90:: with SMTP id d138mr7888531qkb.451.1601562811437; Thu, 01 Oct 2020 07:33:31 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:4e22]) by smtp.gmail.com with ESMTPSA id m3sm6483356qkh.10.2020.10.01.07.33.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 01 Oct 2020 07:33:29 -0700 (PDT) Date: Thu, 1 Oct 2020 10:31:49 -0400 From: Johannes Weiner To: Shakeel Butt Cc: Michal Hocko , Roman Gushchin , Yang Shi , Greg Thelen , David Rientjes , Michal =?iso-8859-1?Q?Koutn=FD?= , Andrew Morton , Linux MM , Cgroups , LKML Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface Message-ID: <20201001143149.GA493631@cmpxchg.org> References: <20200909215752.1725525-1-shakeelb@google.com> <20200928210216.GA378894@cmpxchg.org> <20200929150444.GG2277@dhcp22.suse.cz> <20200929215341.GA408059@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote: > On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner wrote: > > > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > > [...] > > > > My take is that a proactive reclaim feature, whose goal is never to > > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > > would ideally have: > > > > > > > > - a pressure or size target specified by userspace but with > > > > enforcement driven inside the kernel from the allocation path > > > > > > > > - the enforcement work NOT be done synchronously by the workload > > > > (something I'd argue we want for *all* memory limits) > > > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > > cgroup's memory allocations causing the work (again something I'd > > > > argue we want in general) > > > > > > > > - a delegatable knob that is independent of setting the maximum size > > > > of a container, as that expresses a different type of policy > > > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > > threshold or stop enforcement when the userspace component dies > > > > > > > > Thoughts? > > > > > > Agreed with above points. What do you think about > > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. > > > > I definitely agree with what you wrote in this email for background > > reclaim. Indeed, your description sounds like what I proposed in > > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > > - what's missing from that patch is proper work attribution. > > > > > I assume that you do not want to override memory.high to implement > > > this because that tends to be tricky from the configuration POV as > > > you mentioned above. But a new limit (memory.middle for a lack of a > > > better name) to define the background reclaim sounds like a good fit > > > with above points. > > > > I can see that with a new memory.middle you could kind of sort of do > > both - background reclaim and proactive reclaim. > > > > That said, I do see advantages in keeping them separate: > > > > 1. Background reclaim is essentially an allocation optimization that > > we may want to provide per default, just like kswapd. > > > > Kswapd is tweakable of course, but I think actually few users do, > > and it works pretty well out of the box. It would be nice to > > provide the same thing on a per-cgroup basis per default and not > > ask users to make decisions that we are generally better at making. > > > > 2. Proactive reclaim may actually be better configured through a > > pressure threshold rather than a size target. > > > > As per above, the goal is not to be punitive or containing. The > > goal is to keep the LRUs warm and move the colder pages to disk. > > > > But how aggressively do you run reclaim for this purpose? What > > target value should a user write to such a memory.middle file? > > > > For one, it depends on the job. A batch job, or a less important > > background job, may tolerate higher paging overhead than an > > interactive job. That means more of its pages could be trimmed from > > RAM and reloaded on-demand from disk. > > > > But also, it depends on the storage device. If you move a workload > > from a machine with a slow disk to a machine with a fast disk, you > > can page more data in the same amount of time. That means while > > your workload tolerances stays the same, the faster the disk, the > > more aggressively you can do reclaim and offload memory. > > > > So again, what should a user write to such a control file? > > > > Of course, you can approximate an optimal target size for the > > workload. You can run a manual workingset analysis with page_idle, > > damon, or similar, determine a hot/cold cutoff based on what you > > know about the storage characteristics, then echo a number of pages > > or a size target into a cgroup file and let kernel do the reclaim > > accordingly. The drawbacks are that the kernel LRU may do a > > different hot/cold classification than you did and evict the wrong > > pages, the storage device latencies may vary based on overall IO > > pattern, and two equally warm pages may have very different paging > > overhead depending on whether readahead can avert a major fault or > > not. So it's easy to overshoot the tolerance target and disrupt the > > workload, or undershoot and have stale LRU data, waste memory etc. > > > > You can also do a feedback loop, where you guess an optimal size, > > then adjust based on how much paging overhead the workload is > > experiencing, i.e. memory pressure. The drawbacks are that you have > > to monitor pressure closely and react quickly when the workload is > > expanding, as it can be potentially sensitive to latencies in the > > usec range. This can be tricky to do from userspace. > > > > This is actually what we do in our production i.e. feedback loop to > adjust the next iteration of proactive reclaim. That's what we do also right now. It works reasonably well, the only two pain points are/have been the reaction time under quick workload expansion and inadvertently forcing the workload into direct reclaim. > We eliminated the IO or slow disk issues you mentioned by only > focusing on anon memory and doing zswap. Interesting, may I ask how the file cache is managed in this setup? > > So instead of asking users for a target size whose suitability > > heavily depends on the kernel's LRU implementation, the readahead > > code, the IO device's capability and general load, why not directly > > ask the user for a pressure level that the workload is comfortable > > with and which captures all of the above factors implicitly? Then > > let the kernel do this feedback loop from a per-cgroup worker. > > I am assuming here by pressure level you are referring to the PSI like > interface e.g. allowing the users to tell about their jobs that X > amount of stalls in a fixed time window is tolerable. Right, essentially the same parameters that psi poll() would take.