Received: by 10.192.165.148 with SMTP id m20csp4044252imm; Mon, 23 Apr 2018 17:57:51 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/qJlOiq8rUYNvH1vFqij6y5zJNCvFo6QpGRBLelxyppsxDZNzrXQQi5zllk1zNaWYjGFIH X-Received: by 2002:a17:902:145:: with SMTP id 63-v6mr22814550plb.332.1524531471109; Mon, 23 Apr 2018 17:57:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524531471; cv=none; d=google.com; s=arc-20160816; b=o+DwaAhfhiSolb5DC5MAk7OHAM7T8EqQYgemg29OdNgNBL0DhtRoZajjPboFqUwCRU zYBMwfDa+5SEW+t4baop/8W5BKdVgRj8FqtCMayQjr0uiJ/uuBVtZxyTlG1qRaQFrzLF BVTF3RBIOQ9I+z06FQ8jIxRjEX6D4yqvqXWwwKf5ThiemcCrFPmcGEKeHi9wJQLGFtQn d/WUwOvNOhK6Zz/OVeFry7OVN2Q40wFmeSioYtrWjOTDl060yGAXdLJ/xNPUtPtEdHIC Ti/BD8NBs6E5JNoePA/Z+jcSw+JUo7nigkWAobCkCkVYE3UueSBPFlozP4brSK4DCbcX tQeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=Htav33/3MhYyt7PmW3NVIQVVAH/L2lZIlm2flK+dnMo=; b=uWAWtqa05fyXNRN/ZtN3OX5lHdS3G80+bDvd3T56Ucv2XpkDsGU3U+fgs6oUMLbOBd v+0HPgdNpvEQxc+OSgveqiauh5c2QVWovit0UV5ECW1zc2vFldBw55iMt/6/gj2VHX2D ioL7RKVIR7lRoaot+iBPjUlrbe/Bj3kgH/FO96wbrkvJ27bCW8TbQ7Zk0fn7TDrymWYX G9C44IIMMCLLfkx9TdbttrdXLDNXmupCd+LZp1nmchuXooHJQcqKNa5bLfqlnaPMu8RC D22VvHmUEnw4bw/aoMLLRhHmyxGTAp8nV05fscbQBDRUQKh4YbyKkrEP4c4hC4GVb5g0 I5LQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=WN8Sa3ZN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l5-v6si7170849pli.409.2018.04.23.17.57.36; Mon, 23 Apr 2018 17:57:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=WN8Sa3ZN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932760AbeDXA4Z (ORCPT + 99 others); Mon, 23 Apr 2018 20:56:25 -0400 Received: from mail-yb0-f174.google.com ([209.85.213.174]:41067 "EHLO mail-yb0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932625AbeDXA4W (ORCPT ); Mon, 23 Apr 2018 20:56:22 -0400 Received: by mail-yb0-f174.google.com with SMTP id l9-v6so757533ybm.8 for ; Mon, 23 Apr 2018 17:56:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Htav33/3MhYyt7PmW3NVIQVVAH/L2lZIlm2flK+dnMo=; b=WN8Sa3ZNQ8OEwzOI7ZjKQPm55SGBrH7zRVhbQuKCm/x1pDHJsQGRTxzA9v5AKkvbNN 4PEhesVJuG9n8cCWI+Z1d4/QI+QbQpw2ppKTxrpB4A4N4NdIu+K2uSPJz1JpRFPBPY4y q5Ti6ZI6IM8JQKD+HpKwTdmlznSAgs1Hi/HxzCodvg3xNTc5R5MsQd8F5JkbP7TQ8ciP 5YDyo56raycfZMUPz0G2I06cePbkayUNs0ZlrzDQ/slEkQsRrqGyHq/oQH1G0m0xA7RI NowYMeMmy98rVVNvN2KUGVY7Th2gzTiyiOqTMhH+80ik5DcPOcQlndGq77lnjvrtuoaP QHFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Htav33/3MhYyt7PmW3NVIQVVAH/L2lZIlm2flK+dnMo=; b=ZmWt0WPeNJyjV6MRMRWVXZDHyu5AMeUNzWY9IU6lYr4NZOtIJmQ71us8q+tLKhwM3t +9swjSdNOlVZ8VKCDMYE5b0E5pBZXAq+as6cxi+hf4PtOdq+sMjlr6fzQgfcblkc3v5S 2rU2iDUXKYlEamsq6xo3uOwWjS3GqD7Ow0j4A5d8YqfEzW9tJhEMhOo+pvy9JRDRr0y6 lkcbDreno8UkXhTGYmgE+u+rYi3SPY+9ww3aQSFcgUWwqcjNtx3vABqVrvCuw72KS500 zgX+qlW90gPhKHTH5KsvOzYQXUu6RUACqoopsxXzNoTWzADmMzeX6a27DEH0FFdVpj3c wBhA== X-Gm-Message-State: ALQs6tCnA9bw8cJm6R9nZ80YLhC7BYYazBEE6LWQPSTgokwnBHRIRW1b 5OzlAP96pa+5PFARgcUqg97PR0OzDTAbzDhe7gtQ+g== X-Received: by 2002:a25:654:: with SMTP id 81-v6mr9576640ybg.448.1524531381575; Mon, 23 Apr 2018 17:56:21 -0700 (PDT) MIME-Version: 1.0 References: <20180320223353.5673-1-guro@fb.com> <20180422202612.127760-1-gthelen@google.com> <20180423103804.GA12648@castle.DHCP.thefacebook.com> In-Reply-To: <20180423103804.GA12648@castle.DHCP.thefacebook.com> From: Greg Thelen Date: Tue, 24 Apr 2018 00:56:09 +0000 Message-ID: Subject: Re: [RFC PATCH 0/2] memory.low,min reclaim To: guro@fb.com Cc: Johannes Weiner , Andrew Morton , Michal Hocko , Vladimir Davydov , Tejun Heo , Cgroups , kernel-team@fb.com, Linux MM , LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 23, 2018 at 3:38 AM Roman Gushchin wrote: > Hi, Greg! > On Sun, Apr 22, 2018 at 01:26:10PM -0700, Greg Thelen wrote: > > Roman's previously posted memory.low,min patches add per memcg effective > > low limit to detect overcommitment of parental limits. But if we flip > > low,min reclaim to bail if usage<{low,min} at any level, then we don't need > > an effective low limit, which makes the code simpler. When parent limits > > are overcommited memory.min will oom kill, which is more drastic but makes > > the memory.low a simpler concept. If memcg a/b wants oom kill before > > reclaim, then give it to them. It seems a bit strange for a/b/memory.low's > > behaviour to depend on a/c/memory.low (i.e. a/b.low is strong unless > > a/b.low+a/c.low exceed a.low). > It's actually not strange: a/b and a/c are sharing a common resource: > a/memory.low. > Exactly as a/b/memory.max and a/c/memory.max are sharing a/memory.max. > If there are sibling cgroups which are consuming memory, a cgroup can't > exceed parent's memory.max, even if its memory.max is grater. > > > > I think there might be a simpler way (ableit it doesn't yet include > > Documentation): > > - memcg: fix memory.low > > - memcg: add memory.min > > 3 files changed, 75 insertions(+), 6 deletions(-) > > > > The idea of this alternate approach is for memory.low,min to avoid reclaim > > if any portion of under-consideration memcg ancestry is under respective > > limit. > This approach has a significant downside: it breaks hierarchical constraints > for memory.low/min. There are two important outcomes: > 1) Any leaf's memory.low/min value is respected, even if parent's value > is lower or even 0. It's not possible anymore to limit the amount of > protected memory for a sub-tree. > This is especially bad in case of delegation. As someone who has been using something like memory.min for a while, I have cases where it needs to be a strong protection. Such jobs prefer oom kill to reclaim. These jobs know they need X MB of memory. But I guess it's on me to avoid configuring machines which overcommit memory.min at such cgroup levels all the way to the root. > 2) If a cgroup has an ancestor with the usage under its memory.low/min, > it becomes protection, even if its memory.low/min is 0. So it becomes > impossible to have unprotected cgroups in protected sub-tree. Fair point. One use case is where a non trivial job which has several memory accounting subcontainers. Is there a way to only set memory.low at the top and have the offer protection to the job? The case I'm thinking of is: % cd /cgroup % echo +memory > cgroup.subtree_control % mkdir top % echo +memory > top/cgroup.subtree_control % mkdir top/part1 top/part2 % echo 1GB > top/memory.min % (echo $BASHPID > top/part1/cgroup.procs && part1) % (echo $BASHPID > top/part2/cgroup.procs && part2) Empirically it's been measured that the entire workload (/top) needs 1GB to perform well. But we don't care how the memory is distributed between part1,part2. Is the strategy for that to set /top, /top/part1.min, and /top/part2.min to 1GB? What do you think about exposing emin and elow to user space? I think that would reduce admin/user confusion in situations where memory.min is internally discounted. (tangent) Delegation in v2 isn't something I've been able to fully internalize yet. The "no interior processes" rule challenges my notion of subdelegation. My current model is where a system controller creates a container C with C.min and then starts client manager process M in C. Then M can choose to further divide C's resources (e.g. C/S). This doesn't seem possible because v2 doesn't allow for interior processes. So the system manager would need to create C, set C.low, create C/sub_manager, create C/sub_resources, set C/sub_manager.low, set C/sub_resources.low, then start M in C/sub_manager. Then sub_manager can create and manage C/sub_resources/S. PS: Thanks for the memory.low and memory.min work. Regardless of how we proceed it's better than the upstream memory.soft_limit_in_bytes!