Received: by 10.223.176.5 with SMTP id f5csp439470wra; Sat, 27 Jan 2018 03:05:50 -0800 (PST) X-Google-Smtp-Source: AH8x226dGZmvmTxBoHQSvBWBz2H2OpgA6eCdAGpKgqwzO7El1sToc+I5VUoCu2HgokTJR89XOjXD X-Received: by 2002:a17:902:b2c2:: with SMTP id x2-v6mr7092354plw.121.1517051150395; Sat, 27 Jan 2018 03:05:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517051150; cv=none; d=google.com; s=arc-20160816; b=cxcS3piOhaW5OvPHzhlQzxU4oVE9FfMbhLPhIkIelIBJjHuasNUhZhYQZ0Vql7F+yt pkw06BgNpGQXuBl/N4TB7Gg6QJ6rFEHFuTyay3GFcX9+0eZqESPMjr72hpmr07ML2blF FxgabJnxlgTYfQ1JTylcLyfpkimI9rQajm4DU8gyBQcWmirUUPbwQwdgb94iVMna6O2z PpocHYXR+865BF5vSsPHH3dpP8TrgTGsZsVROcdQ+2LrWRusetuNDQr0mIksbIjj9rSl bjjnh4lj6/UbqQdQ0qJO4UvbWVjP+xVkx5EtUme/3sxS9lsoyibzqSyopxlE/BR1wvrE 5zEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ttOfW2hJEX3E4W7KIaVJ7VLPbOjHpphOp6hWoVEAHks=; b=TcXEnTioa8JWrGr7jiGBuqjho236/oPaqqZcdLEWyunLHVys7W3ifoZnNXEdOL8d0S th9Nu0pisHTI6Zfdtr+tLgWvV6fAVqc8J9GVKmYUhu5LkQ/NA4kYGb6Z92HdqmcUzXzv 8bYSRM8Asp/yx7DzIa9WpSGRwt6uT887ZHK1jvSnuU1hJ+aZOQyZDIVO8AMVEC5Cxn8v ciS7D/6z3OKd5kR+QxLQYsATPgv9HNyQqarpwrVEMGDPBW9nAYIktDSGgcm5Yx0Tz+8g sIHZns/Miz9njHbSCAs9Ne55qZNOY8jT31IW2UwtpQbYr0aeih6CTBk5HUcJGYCC0Gfx MSzw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p5-v6si5228303plo.32.2018.01.27.03.05.02; Sat, 27 Jan 2018 03:05:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752604AbeA0LEF (ORCPT + 99 others); Sat, 27 Jan 2018 06:04:05 -0500 Received: from mx2.suse.de ([195.135.220.15]:46810 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752078AbeA0LED (ORCPT ); Sat, 27 Jan 2018 06:04:03 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 87C47AEB9; Sat, 27 Jan 2018 10:59:42 +0000 (UTC) Date: Fri, 26 Jan 2018 18:15:48 +0100 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Roman Gushchin , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable Message-ID: <20180126171548.GB16763@dhcp22.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 25-01-18 15:53:45, David Rientjes wrote: > The cgroup aware oom killer is needlessly declared for the entire system > by a mount option. It's unnecessary to force the system into a single > oom policy: either cgroup aware, or the traditional process aware. > > This patch introduces a memory.oom_policy tunable for all mem cgroups. > It is currently a no-op: it can only be set to "none", which is its > default policy. It will be expanded in the next patch to define cgroup > aware oom killer behavior. > > This is an extensible interface that can be used to define cgroup aware > assessment of mem cgroup subtrees or the traditional process aware > assessment. > So what is the actual semantic and scope of this policy. Does it apply only down the hierarchy. Also how do you compare cgroups with different policies? Let's say you have root / | \ A B C / \ / \ D E F G Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1 Now we have the global OOM killer to choose a victim. From a quick glance over those patches, it seems that we will be comparing only tasks because root->oom_policy != MEMCG_OOM_POLICY_CGROUP. A, B and C policies are ignored. Moreover If I select any of B's tasks then I will happily kill it breaking the expectation that the whole memcg will go away. Weird, don't you think? Or did I misunderstand? So let's assume that root: cgroup. Then we are finally comparing cgroups. D, E, B, C. Of those D, E and F do not have any policy. Do they inherit their policy from the parent? If they don't then we should be comparing their tasks separately, no? The code disagrees because once we are in the cgroup mode, we do not care about separate tasks. Let's say we choose C because it has the largest cumulative consumption. It is not oom_group so it will select a task from F, G. Again you are breaking oom_group policy of G if you kill a single task. So you would have to be recursive here. That sounds fixable though. Just be recursive. Then you say > Another benefit of such an approach is that an admin can lock in a > certain policy for the system or for a mem cgroup subtree and can > delegate the policy decision to the user to determine if the kill should > originate from a subcontainer, as indivisible memory consumers > themselves, or selection should be done per process. And the code indeed doesn't check oom_policy on each level of the hierarchy, unless I am missing something. So the subgroup is simply locked in to the oom_policy parent has chosen. That is not the case for the tree policy. So look how we are comparing cumulative groups without policy with groups with policy with subtrees. Either I have grossly misunderstood something or this is massively inconsistent and it doesn't make much sense to me. Root memcg without cgroup policy will simply turn off the whole thing for the global OOM case. So you really need to enable it there but then it is not really clear how to configure lower levels. From the above it seems that you are more interested in memcg OOMs and want to give different hierarchies different policies but you quickly hit the similar inconsistencies there as well. I am not sure how extensible this is actually. How do we place priorities on top? > Signed-off-by: David Rientjes -- Michal Hocko SUSE Labs