Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752195AbdIVUxw (ORCPT ); Fri, 22 Sep 2017 16:53:52 -0400 Received: from mail-pf0-f177.google.com ([209.85.192.177]:53796 "EHLO mail-pf0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752070AbdIVUxt (ORCPT ); Fri, 22 Sep 2017 16:53:49 -0400 X-Google-Smtp-Source: AOwi7QCOwSRNRbv7Si15JWFP6fHjUtEg7ndn2Ovrq/7XcFcpmxGHu0x/Rv4AH0Ajm9NSJyM8p44MUA== Date: Fri, 22 Sep 2017 13:53:47 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Johannes Weiner cc: Roman Gushchin , linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Tetsuo Handa , Andrew Morton , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v8 0/4] cgroup-aware OOM killer In-Reply-To: <20170921215103.GA23772@cmpxchg.org> Message-ID: References: <20170911131742.16482-1-guro@fb.com> <20170921142107.GA20109@cmpxchg.org> <20170921215103.GA23772@cmpxchg.org> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3777 Lines: 73 On Thu, 21 Sep 2017, Johannes Weiner wrote: > > The issue is that if you opt-in to the new feature, then you are forced to > > change /proc/pid/oom_score_adj of all processes attached to a cgroup that > > you do not want oom killed based on size to be oom disabled. > > You're assuming that most people would want to influence the oom > behavior in the first place. I think the opposite is the case: most > people don't care as long as the OOM killer takes the intent the user > has expressed wrt runtime containerization/grouping into account. > If you do not want to influence the oom behavior, do not change memory.oom_priority from its default. It's that simple. > > The kernel provides no other remedy without oom priorities since the > > new feature would otherwise disregard oom_score_adj. > > As of v8, it respects this setting and doesn't kill min score tasks. > That's the issue. To protect a memory cgroup from being oom killed in a system oom condition, you need to change oom_score_adj of *all* processes attached to be oom disabled. Then, you have a huge problem in memory cgroup oom conditions because nothing can be killed in that hierarchy itself. > > The patchset compares memory cgroup size relative to sibling cgroups only, > > the same comparison for memory.oom_priority. There is a guarantee > > provided on how cgroup size is compared in select_victim_memcg(), it > > hierarchically accumulates the "size" from leaf nodes up to the root memcg > > and then iterates the tree comparing sizes between sibling cgroups to > > choose a victim memcg. That algorithm could be more elaborately described > > in the documentation, but we simply cannot change the implementation of > > select_victim_memcg() later even without oom priorities since users cannot > > get inconsistent results after opting into a feature between kernel > > versions. I believe the selection criteria should be implemented to be > > deterministic, as select_victim_memcg() does, and the documentation should > > fully describe what the selection criteria is, and then allow the user to > > decide. > > I wholeheartedly disagree. We have changed the behavior multiple times > in the past. In fact, you have arguably done the most drastic changes > to the algorithm since the OOM killer was first introduced. E.g. > > a63d83f427fb oom: badness heuristic rewrite > > And that's completely fine. Because this thing is not a resource > management tool for userspace, it's the kernel saving itself. At best > in a manner that's not too surprising to userspace. > When I did that, I had to add /proc/pid/oom_score_adj to allow userspace to influence selection. We came up with /proc/pid/oom_score_adj when working with kde, openssh, chromium, and udev because they cared about the ability to influence the decisionmaking. I'm perfectly happy with the new heuristic presented in this patchset, I simply want userspace to be able to influence it, if it desires. Requiring userspace to set all processes to be oom disabled to protect a hierarchy is totally and completely broken. It livelocks the memory cgroup if it is oom itself. > To me, your argument behind the NAK still boils down to "this doesn't > support my highly specialized usecase." But since it doesn't prohibit > your usecase - which isn't even supported upstream, btw - this really > doesn't carry much weight. > > I'd say if you want configurability on top of Roman's code, please > submit patches and push the case for these in a separate effort. > Roman implemented memory.oom_priority himself, it has my Tested-by, and it allows users who want to protect high priority memory cgroups from using the size based comparison for all other cgroups that we very much desire.