Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752155AbdISUvb (ORCPT ); Tue, 19 Sep 2017 16:51:31 -0400 Received: from mail-pf0-f176.google.com ([209.85.192.176]:43659 "EHLO mail-pf0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751787AbdISUv1 (ORCPT ); Tue, 19 Sep 2017 16:51:27 -0400 X-Google-Smtp-Source: AOwi7QB0X3cUZ1AfRuWfzq6/QiQrqCR6KhENPW8r56KSi2rWnA4w+y1+lFPGZyfG26kDpZOpyoxRSw== Date: Tue, 19 Sep 2017 13:51:25 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Roman Gushchin , linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Andrew Morton , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v8 0/4] cgroup-aware OOM killer In-Reply-To: <20170918061603.z2ngh6bs5276mc3q@dhcp22.suse.cz> Message-ID: References: <20170911131742.16482-1-guro@fb.com> <20170913122914.5gdksbmkolum7ita@dhcp22.suse.cz> <20170913215607.GA19259@castle> <20170914134014.wqemev2kgychv7m5@dhcp22.suse.cz> <20170914160548.GA30441@castle> <20170915105826.hq5afcu2ij7hevb4@dhcp22.suse.cz> <20170915152301.GA29379@castle> <20170918061603.z2ngh6bs5276mc3q@dhcp22.suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2965 Lines: 71 On Mon, 18 Sep 2017, Michal Hocko wrote: > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > > A D > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > > root > > > > / | \ > > > > B C D > > > > > > > > > > I actually don't have a strong argument against an approach to select > > > largest leaf or kill-all-set memcg. I think, in practice there will be > > > no much difference. > > > > > > The only real concern I have is that then we have to do the same with > > > oom_priorities (select largest priority tree-wide), and this will limit > > > an ability to enforce the priority by parent cgroup. > > > > > > > Yes, oom_priority cannot select the largest priority tree-wide for exactly > > that reason. We need the ability to control from which subtree the kill > > occurs in ancestor cgroups. If multiple jobs are allocated their own > > cgroups and they can own memory.oom_priority for their own subcontainers, > > this becomes quite powerful so they can define their own oom priorities. > > Otherwise, they can easily override the oom priorities of other cgroups. > > Could you be more speicific about your usecase? What would be a > problem If we allow to only increase priority in children (like other > hierarchical controls). > For memcg constrained oom conditions, there is only a theoretical issue if the subtree is not under the control of a single user and various users can alter their priorities without knowledge of the priorities of other children in the same subtree that is oom, or those values change without knowledge of a child. I don't know of anybody that configures memory cgroup hierarchies that way, though. The problem is more obvious in system oom conditions. If we have two top-level memory cgroups with the same "job" priority, they get the same oom priority. The user who configures subcontainers is now always targeted for oom kill in an "increase priority in children" policy. The hierarchy becomes this: root / \ A D / \ / | \ B C E F G where A/memory.oom_priority == D/memory.oom_priority. D wants to kill in order of E -> F -> G, but can't configure that if B = A - 1 and C = B - 1. It also shouldn't need to adjust its own oom priorities based on a hierarchy outside its control and which can change at any time at the discretion of the user (with namespaces you may not even be able to access it). But also if A/memory.oom_priority = D/memory.oom_priority - 100, A is preferred unless its subcontainers configure themselves in a way where they have higher oom priority values than E, F, and G. That may yield very different results when additional jobs get scheduled on the system (and H tree) where the user has full control over their own oom priorities, even when the value must only increase.