Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752154AbdISUy4 (ORCPT ); Tue, 19 Sep 2017 16:54:56 -0400 Received: from mail-pg0-f47.google.com ([74.125.83.47]:51129 "EHLO mail-pg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751930AbdISUyt (ORCPT ); Tue, 19 Sep 2017 16:54:49 -0400 X-Google-Smtp-Source: AOwi7QCqJk/aSCYHj/cpCbHZMXdEHIsFB5qFUHB8y74mTj8SCRONgaXZ8HSfJZDAb1PrUD6qlwUoWA== Date: Tue, 19 Sep 2017 13:54:48 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Roman Gushchin cc: Michal Hocko , linux-mm@kvack.org, Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Andrew Morton , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v8 0/4] cgroup-aware OOM killer In-Reply-To: <20170915210807.GA5238@castle> Message-ID: References: <20170911131742.16482-1-guro@fb.com> <20170913122914.5gdksbmkolum7ita@dhcp22.suse.cz> <20170913215607.GA19259@castle> <20170914134014.wqemev2kgychv7m5@dhcp22.suse.cz> <20170914160548.GA30441@castle> <20170915105826.hq5afcu2ij7hevb4@dhcp22.suse.cz> <20170915152301.GA29379@castle> <20170915210807.GA5238@castle> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2354 Lines: 51 On Fri, 15 Sep 2017, Roman Gushchin wrote: > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > > A D > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > > root > > > > / | \ > > > > B C D > > > > > > > > > > I actually don't have a strong argument against an approach to select > > > largest leaf or kill-all-set memcg. I think, in practice there will be > > > no much difference. > > > > > > The only real concern I have is that then we have to do the same with > > > oom_priorities (select largest priority tree-wide), and this will limit > > > an ability to enforce the priority by parent cgroup. > > > > > > > Yes, oom_priority cannot select the largest priority tree-wide for exactly > > that reason. We need the ability to control from which subtree the kill > > occurs in ancestor cgroups. If multiple jobs are allocated their own > > cgroups and they can own memory.oom_priority for their own subcontainers, > > this becomes quite powerful so they can define their own oom priorities. > > Otherwise, they can easily override the oom priorities of other cgroups. > > I believe, it's a solvable problem: we can require CAP_SYS_RESOURCE to set > the oom_priority below parent's value, or something like this. > > But it looks more complex, and I'm not sure there are real examples, > when we have to compare memcgs, which are on different levels > (or in different subtrees). > It's actually much more complex because in our environment we'd need an "activity manager" with CAP_SYS_RESOURCE to control oom priorities of user subcontainers when today it need only be concerned with top-level memory cgroups. Users can create their own hierarchies with their own oom priorities at will, it doesn't alter the selection heuristic for another other user running on the same system and gives them full control over the selection in their own subtree. We shouldn't need to have a system-wide daemon with CAP_SYS_RESOURCE be required to manage subcontainers when nothing else requires it. I believe it's also much easier to document: oom_priority is considered for all sibling cgroups at each level of the hierarchy and the cgroup with the lowest priority value gets iterated.