Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752754AbdHOVrR (ORCPT ); Tue, 15 Aug 2017 17:47:17 -0400 Received: from mail-pg0-f44.google.com ([74.125.83.44]:37896 "EHLO mail-pg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752694AbdHOVrO (ORCPT ); Tue, 15 Aug 2017 17:47:14 -0400 Date: Tue, 15 Aug 2017 14:47:10 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Roman Gushchin cc: linux-mm@kvack.org, Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer In-Reply-To: <20170815121558.GA15892@castle.dhcp.TheFacebook.com> Message-ID: References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3088 Lines: 61 On Tue, 15 Aug 2017, Roman Gushchin wrote: > > I'm curious about the decision made in this conditional and how > > oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means > > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it > > should otherwise be disabled. > > > > It's undocumented in the changelog, but I'm questioning whether it's the > > right decision. Doesn't it make sense to kill all tasks that are not oom > > disabled, and allow the user to still protect certain processes by their > > /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that > > protection without a sibling memcg and its own reservation of memory. I'm > > thinking about a process that governs jobs inside the memcg and if there > > is an oom kill, it wants to do logging and any cleanup necessary before > > exiting itself. It seems like a powerful combination if coupled with oom > > notification. > > Good question! > I think, that an ability to override any oom_score_adj value and get all tasks > killed is more important, than an ability to kill all processes with some > exceptions. > I'm disagreeing because getting all tasks killed is not necessarily something that only the kernel can do. If all processes are oom disabled, that's a configuration issue done by sysadmin and the kernel should decide to kill the next largest memory cgroup or lower priority memory cgroup. It's not killing things like sshd that intentionally oom disable themselves. You could argue that having an oom disabled process attached to these memcgs in the first place is also a configuration issue, but the problem is that in cgroup v2 with a restriction on processes only being attached at the leaf cgroups that there is no competition for memory in this case. I must assign memory resources to that sshd, or "Activity Manager" described by the cgroup v1 documentation, just to prevent it from being killed. I think the latter of what you describe, killing all processes with some exceptions, is actually quite powerful. I can guarantee that processes that set themselves to oom disabled are really oom disabled and I don't need to work around that in the cgroup hierarchy only because of this caveat. I can also oom disable my Activity Manger that wants to wait on oom notification and collect the oom kill logs, raise notifications, and perhaps restart the process that it manages. > In your example someone still needs to look after the remaining process, > and kill it after some timeout, if it will not quit by itself, right? > No, it can restart the process that was oom killed; or it can be sshd and I can still ssh into my machine. > The special treatment of the -1000 value (without oom_kill_all_tasks) > is required only to not to break the existing setups. > I think as a general principle that allowing an oom disabled process to be oom killed is incorrect and if you really do want these to be killed, then (1) either your oom_score_adj is already wrong or (2) you can wait on oom notification and exit.