Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp4671639pxk; Wed, 30 Sep 2020 08:47:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyBX6AM/E3gQQXAH4FSGQONL9cyBrllLifAL34FDu0LCt8ZvxNZwz2QougHxjp9x8EFN6Mq X-Received: by 2002:a50:d4ce:: with SMTP id e14mr3564133edj.126.1601480825574; Wed, 30 Sep 2020 08:47:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601480825; cv=none; d=google.com; s=arc-20160816; b=cyCerdhhqUxue+xj5FzZSXk5c5nNfpdg6kqFgCngKIWhvkqsv5ehoaDlT12n7EHxI7 9/wrRr+0ImG0sXQ/fZFMPLViCHe0gtPJ99pDlZAx7EmQYl72jPHh1csL0k7LyOXkAOsI Wik4ljy1WPf/TI/usHmwb95vfRtVCeS9JS7VCA+4DhqenNsKZzOg/RvLtOYjzEFg3lB5 JKDQL7oCQdslIhP3MT35m5HY+FwmV7QBHN8XiRu/s772HyTV+Enhgce4UXBmSl5kI0MH G4BaZNwkfefQLB95i/x/Vjmia5Ij4bBza8w6Ci3kbQqJXcRoVqW0KWsvyEQYTayUgkfc Ym+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=WjGyYmsLAMjsI4Vd+JJ6aVnyc9ajioRKaKMFxKhxaik=; b=Rgg9pzVihQIj0e5WFB1u5LdVgdQvkJRGRCj32SVliD6zphTMjWBR03k426JMo29I/u 6Y2tWej/PqIUTJw5mxkyJi/Wl2DgD2BsEcxLyla9UD9pL2oEs/WzwKl4HrdCRgdhTrsK 8sO6Ih5U/+h+Q41FYHiQ8cX7Z1rAL9p3YA0Pcdp1qcnoHKjsTMxayawOH/tGPpFMv2nJ pumHMAG4D2ljlSxVA/SF1IgJBF6pPdbyjqAFKAEG6iAdTv2j5J2Uoybdhf/w7tG3eTLg 9bpwU11yh/SRGMYMw8NkJwpTT0jJGFIdQ2S3PEth5uj0o8Ka3enOnGY1fH82wFOvmTl+ PXsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NjAmdQm8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p9si1426577ejx.463.2020.09.30.08.46.40; Wed, 30 Sep 2020 08:47:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NjAmdQm8; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728340AbgI3Ppd (ORCPT + 99 others); Wed, 30 Sep 2020 11:45:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44792 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725372AbgI3Ppc (ORCPT ); Wed, 30 Sep 2020 11:45:32 -0400 Received: from mail-lj1-x244.google.com (mail-lj1-x244.google.com [IPv6:2a00:1450:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0FF0C061755 for ; Wed, 30 Sep 2020 08:45:30 -0700 (PDT) Received: by mail-lj1-x244.google.com with SMTP id y4so2001745ljk.8 for ; Wed, 30 Sep 2020 08:45:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WjGyYmsLAMjsI4Vd+JJ6aVnyc9ajioRKaKMFxKhxaik=; b=NjAmdQm8MXZaLnWlGnp/SJHX6VYj0DS/7Gez+83AyXBMoNvk+/jPA1SbFRqQcqtBwO Ok/o36HzVVkUY3ru2aZxBs+wGU4DRrhw+hiDOzyytPNaO5JSw/Wf6d8Rri3vItQ4G2EM /Ek+Q+OZpsDOEL19Frebd8dK6g78mhoubgOLsJkyY+EC+TnjsmNR7Gao6ZLzC/8UMC5l f5wnBiQ4/mjBsrjkH0LrsRRzwlK0f18AfRnxg+Pdie3iMZ6Orfb+K5jq2KEmj7u+r/fE dfZQL7UGluptFcNBHGODab7Xi3NvTFTVox6i6gTovbv4j5iaq+2W5lMbaA2aphOu5YS0 3uKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WjGyYmsLAMjsI4Vd+JJ6aVnyc9ajioRKaKMFxKhxaik=; b=JcMJQadwQrO5O7/wmh5QyiSrkzcIrJ9D0sABz4gOZLLLXWF7gWFdy9Vx0YsfOCRKZb 6Mcd3mufFzo9+sPkYV2Uf3Q1HFtqQXncYNQdwOH/a6+WrL13DMuBrOkek92fzsb6qIzc U9aYQJ9bzN8d3AEybZBnWhqRWFT5AYfrugrIfg6IMDjqXvamXZVRAYpYRILI++RfR+ZA QxSeyGUcHJQUkzisRdh0Z5iQHvJVMnskR9raVHkuyHAKOi2d/Z3jabS+q24gjv9YBc2U ZTyeu8gaXlEnBpKvZ/Xornhsejwyk3hy/ffP8vPsfz+7XYqC8CbC2AXsEOiGaL0UeIJz 033Q== X-Gm-Message-State: AOAM533yDOKi1jaDvay/jM2Nc6Ow1yEkGL83hpSv7MOFpgz0dXn34QiX 9MdijvkKekIb/cBbJRVnsfjg/uBo1Xs2A4xTpw84ag== X-Received: by 2002:a2e:98cf:: with SMTP id s15mr1169162ljj.446.1601480729013; Wed, 30 Sep 2020 08:45:29 -0700 (PDT) MIME-Version: 1.0 References: <20200909215752.1725525-1-shakeelb@google.com> <20200928210216.GA378894@cmpxchg.org> <20200929150444.GG2277@dhcp22.suse.cz> <20200929215341.GA408059@cmpxchg.org> In-Reply-To: <20200929215341.GA408059@cmpxchg.org> From: Shakeel Butt Date: Wed, 30 Sep 2020 08:45:17 -0700 Message-ID: Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface To: Johannes Weiner Cc: Michal Hocko , Roman Gushchin , Yang Shi , Greg Thelen , David Rientjes , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linux MM , Cgroups , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner wrote: > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > [...] > > > My take is that a proactive reclaim feature, whose goal is never to > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > would ideally have: > > > > > > - a pressure or size target specified by userspace but with > > > enforcement driven inside the kernel from the allocation path > > > > > > - the enforcement work NOT be done synchronously by the workload > > > (something I'd argue we want for *all* memory limits) > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > cgroup's memory allocations causing the work (again something I'd > > > argue we want in general) > > > > > > - a delegatable knob that is independent of setting the maximum size > > > of a container, as that expresses a different type of policy > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > threshold or stop enforcement when the userspace component dies > > > > > > Thoughts? > > > > Agreed with above points. What do you think about > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. > > I definitely agree with what you wrote in this email for background > reclaim. Indeed, your description sounds like what I proposed in > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > - what's missing from that patch is proper work attribution. > > > I assume that you do not want to override memory.high to implement > > this because that tends to be tricky from the configuration POV as > > you mentioned above. But a new limit (memory.middle for a lack of a > > better name) to define the background reclaim sounds like a good fit > > with above points. > > I can see that with a new memory.middle you could kind of sort of do > both - background reclaim and proactive reclaim. > > That said, I do see advantages in keeping them separate: > > 1. Background reclaim is essentially an allocation optimization that > we may want to provide per default, just like kswapd. > > Kswapd is tweakable of course, but I think actually few users do, > and it works pretty well out of the box. It would be nice to > provide the same thing on a per-cgroup basis per default and not > ask users to make decisions that we are generally better at making. > > 2. Proactive reclaim may actually be better configured through a > pressure threshold rather than a size target. > > As per above, the goal is not to be punitive or containing. The > goal is to keep the LRUs warm and move the colder pages to disk. > > But how aggressively do you run reclaim for this purpose? What > target value should a user write to such a memory.middle file? > > For one, it depends on the job. A batch job, or a less important > background job, may tolerate higher paging overhead than an > interactive job. That means more of its pages could be trimmed from > RAM and reloaded on-demand from disk. > > But also, it depends on the storage device. If you move a workload > from a machine with a slow disk to a machine with a fast disk, you > can page more data in the same amount of time. That means while > your workload tolerances stays the same, the faster the disk, the > more aggressively you can do reclaim and offload memory. > > So again, what should a user write to such a control file? > > Of course, you can approximate an optimal target size for the > workload. You can run a manual workingset analysis with page_idle, > damon, or similar, determine a hot/cold cutoff based on what you > know about the storage characteristics, then echo a number of pages > or a size target into a cgroup file and let kernel do the reclaim > accordingly. The drawbacks are that the kernel LRU may do a > different hot/cold classification than you did and evict the wrong > pages, the storage device latencies may vary based on overall IO > pattern, and two equally warm pages may have very different paging > overhead depending on whether readahead can avert a major fault or > not. So it's easy to overshoot the tolerance target and disrupt the > workload, or undershoot and have stale LRU data, waste memory etc. > > You can also do a feedback loop, where you guess an optimal size, > then adjust based on how much paging overhead the workload is > experiencing, i.e. memory pressure. The drawbacks are that you have > to monitor pressure closely and react quickly when the workload is > expanding, as it can be potentially sensitive to latencies in the > usec range. This can be tricky to do from userspace. > This is actually what we do in our production i.e. feedback loop to adjust the next iteration of proactive reclaim. We eliminated the IO or slow disk issues you mentioned by only focusing on anon memory and doing zswap. > So instead of asking users for a target size whose suitability > heavily depends on the kernel's LRU implementation, the readahead > code, the IO device's capability and general load, why not directly > ask the user for a pressure level that the workload is comfortable > with and which captures all of the above factors implicitly? Then > let the kernel do this feedback loop from a per-cgroup worker. I am assuming here by pressure level you are referring to the PSI like interface e.g. allowing the users to tell about their jobs that X amount of stalls in a fixed time window is tolerable. Seems promising though I would like flexibility for giving the resource to the per-cgroup worker. Are you planning to work on this or should I give it a try?