Received: by 10.223.176.5 with SMTP id f5csp4035888wra; Tue, 30 Jan 2018 00:51:27 -0800 (PST) X-Google-Smtp-Source: AH8x224NGxse3/WxJBZbZUSreoMAovBLRLJtvIYAySaEr+7m2Jc80b+1Z/COD1hp0EGsfYiycril X-Received: by 2002:a17:902:8b8a:: with SMTP id ay10-v6mr24146271plb.156.1517302287160; Tue, 30 Jan 2018 00:51:27 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517302287; cv=none; d=google.com; s=arc-20160816; b=zmQtohoNq6uW8g60buMPOvQZbaSh49l2GprP8Ci1AzgSTjueDDA33yjZRWw581cMDH zNp1x3oSDQj1dxRk3tSlOLrtyQzR39KFTTBVt6IAJSDbDim6/nH8QSxLF9hNGCErpTQ+ X3WrOW0wBxFax8Wj6gSFFty6MMW6kVoiOr6qGX6O9B0NNL/MNeDLedNGgZ+h319iwP4t RfUHZBVD0PI/Dgt2gMYISGpSsIjMAATOUpjMXk4XgHNGUbSMQRsepsyDLi/HPg7uzZdH BgH6zyjnaNB0ru+taoxPQ0ihZ8SJS957/C3rC0aD+0U7hOnUXy7sVV2deRB5a7MfsgHj tZ1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=4ehRbwAt+tulEy3w96WYjjS1aSAmwxNLssbVa/I+7A8=; b=aScSaQdeF/8ymRQEf0qSImQONl7eVgldy220d/th1T0KVchLLSI2xoAlbgh8vihG/E eZnZQep/rjr7l6pCkgpUps4S/l7l6Aktteh+e8KY2V7dXAXzp9/dCs2F7BisF9hAd6CH 0wi9PICFQTezMSO6SBaFFrSGyK80S8aru3tGclEiPzYDVy2Um8xbYPCbkJLwfdo5MYqR wJBDiKPX4b1XT5JGDlCBRml1T/i51iKzF6lQXQYoDn9OSzqmCuhY607b9A9djipPcCeV KFsfLDmgjH/RpXiIQwW25nZItK6qnAuYIFB/X2Z9F6t+jYYKLV7l4cAg++D1A1wEQ4bW 8zOA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 3-v6si1263316plt.307.2018.01.30.00.51.12; Tue, 30 Jan 2018 00:51:27 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751697AbeA3IuS (ORCPT + 99 others); Tue, 30 Jan 2018 03:50:18 -0500 Received: from mx2.suse.de ([195.135.220.15]:60135 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751219AbeA3IuQ (ORCPT ); Tue, 30 Jan 2018 03:50:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id B37C4ADB7; Tue, 30 Jan 2018 08:50:14 +0000 (UTC) Date: Tue, 30 Jan 2018 09:50:13 +0100 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Roman Gushchin , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable Message-ID: <20180130085013.GP21609@dhcp22.suse.cz> References: <20180126171548.GB16763@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 29-01-18 14:38:02, David Rientjes wrote: > On Fri, 26 Jan 2018, Michal Hocko wrote: > > > > The cgroup aware oom killer is needlessly declared for the entire system > > > by a mount option. It's unnecessary to force the system into a single > > > oom policy: either cgroup aware, or the traditional process aware. > > > > > > This patch introduces a memory.oom_policy tunable for all mem cgroups. > > > It is currently a no-op: it can only be set to "none", which is its > > > default policy. It will be expanded in the next patch to define cgroup > > > aware oom killer behavior. > > > > > > This is an extensible interface that can be used to define cgroup aware > > > assessment of mem cgroup subtrees or the traditional process aware > > > assessment. > > > > > > > So what is the actual semantic and scope of this policy. Does it apply > > only down the hierarchy. Also how do you compare cgroups with different > > policies? Let's say you have > > root > > / | \ > > A B C > > / \ / \ > > D E F G > > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1 > > > > At each level of the hierarchy, memory.oom_policy compares immediate > children, it's the only way that an admin can lock in a specific oom > policy like "tree" and then delegate the subtree to the user. If you've > configured it as above, comparing A and C should be the same based on the > cumulative usage of their child mem cgroups. So cgroup == tree if we are memcg aware OOM killing, right? Why do we need both then? Just to make memcg aware OOM killing possible? > The policy for B hasn't been specified, but since it does not have any > children "cgroup" and "tree" should be the same. So now you have a killable cgroup selected by process criterion? That just doesn't make any sense. So I guess it would at least require to enforce (cgroup || tree) to allow oom_group. But even then it doesn't make much sense to me because having a memcg killable or not is an attribute of the _memcg_ rather than the OOM context, no? In other words how much sense does it make to have B OOM intity or not depending on whether this is a global OOM or B OOM. Either the workload running inside B can cope with partial tear down or it cannot. Or do you have an example when something like that would be useful? > > Now we have the global OOM killer to choose a victim. From a quick > > glance over those patches, it seems that we will be comparing only > > tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C > > policies are ignored. > > Right, a policy of "none" reverts its subtree back to per-process > comparison if you are either not using the cgroup aware oom killer or your > subtree is not using the cgroup aware oom killer. So how are you going to compare none cgroups with those that consider full memcg or hierarchy (cgroup, tree)? Are you going to consider oom_score_adj? > > Moreover If I select any of B's tasks then I will > > happily kill it breaking the expectation that the whole memcg will go > > away. Weird, don't you think? Or did I misunderstand? > > > > It's just as weird as the behavior of memory.oom_group today without using > the mount option :) Which is why oom_group returns -ENOTSUPP, so you simply cannot even set any memcg as oom killable. And you do not have this weirdness. > In that case, mem_cgroup_select_oom_victim() always > returns false and the value of memory.oom_group is ignored. I agree that > it's weird in -mm and there's nothing preventing us from separating > memory.oom_group from the cgroup aware oom killer and allowing it to be > set regardless of a selection change. it is not weird. I suspect you misunderstood the code and its intention. > If memory.oom_group is set, and the > kill originates from that mem cgroup, kill all processes attached to it > and its subtree. > > This is a criticism of the current implementation in -mm, however, my > extension only respects its weirdness. > > > So let's assume that root: cgroup. Then we are finally comparing > > cgroups. D, E, B, C. Of those D, E and F do not have any > > policy. Do they inherit their policy from the parent? If they don't then > > we should be comparing their tasks separately, no? The code disagrees > > because once we are in the cgroup mode, we do not care about separate > > tasks. > > > > No, perhaps I wasn't clear in the documentation: the policy at each level > of the hierarchy is specified by memory.oom_policy and compares its > immediate children with that policy. So the per-cgroup usage of A, B, and > C and compared regardless of A, B, and C's own oom policies. You are still operating in terms of levels. And that is rather confusing because we are operating on a _tree_ and that walk has to be independent on the way we walk that tree - i.e. whether we do DFS or BFS ordering. > > Let's say we choose C because it has the largest cumulative consumption. > > It is not oom_group so it will select a task from F, G. Again you are > > breaking oom_group policy of G if you kill a single task. So you would > > have to be recursive here. That sounds fixable though. Just be > > recursive. > > > > I fully agree, but that's (another) implementation detail of what is in > -mm that isn't specified. I think where you're going is the complete > separation of mem cgroup selection from memory.oom_group. I agree, and we > can fix that. memory.oom_group also shouldn't depend on any mount option, > it can be set or unset depending on the properties of the workload. Huh? oom_group is completely orthogonal to the selection strategy. Full stop. I do not know how many times I have to repeat that. oom_group defines who to kill from the target. It is completely irrelevant how we have selected the target. > > Then you say > > > > > Another benefit of such an approach is that an admin can lock in a > > > certain policy for the system or for a mem cgroup subtree and can > > > delegate the policy decision to the user to determine if the kill should > > > originate from a subcontainer, as indivisible memory consumers > > > themselves, or selection should be done per process. > > > > And the code indeed doesn't check oom_policy on each level of the > > hierarchy, unless I am missing something. So the subgroup is simply > > locked in to the oom_policy parent has chosen. That is not the case for > > the tree policy. > > > > So look how we are comparing cumulative groups without policy with > > groups with policy with subtrees. Either I have grossly misunderstood > > something or this is massively inconsistent and it doesn't make much > > sense to me. Root memcg without cgroup policy will simply turn off the > > whole thing for the global OOM case. So you really need to enable it > > there but then it is not really clear how to configure lower levels. > > > > From the above it seems that you are more interested in memcg OOMs and > > want to give different hierarchies different policies but you quickly > > hit the similar inconsistencies there as well. > > > > I am not sure how extensible this is actually. How do we place > > priorities on top? > > > > If you're referring to strict priorities where one cgroup can be preferred > or biased against regardless of usage, that would be an extension with yet > another tunable. How does that fit into cgroup, tree, none policy model? > Userspace influence over the selection is not addressed > by this patchset, nor is the unfair comparison of the root mem cgroup with > leaf mem cgroups. My suggestion previously was a memory.oom_value > tunable, which is configured depending on its parent's memory.oom_policy. > If "cgroup" or "tree" it behaves as oom_score_adj-type behavior, i.e. it's > an adjustment on the usage. This way, a subtree's usage can have a > certain amount of memory discounted, for example, because it is supposed > to use more than 50% of memory. If a new "priority" memory.oom_policy > were implemented, which would be trivial, comparison between child cgroups > would be as simple as comparing memory.oom_value integers. What if we need to implement "kill the youngest" policy? This wouldn't really work out, right? -- Michal Hocko SUSE Labs