Date: Tue, 15 Aug 2017 14:47:10 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
To: Roman Gushchin <guro@fb.com>
cc: linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
        Tejun Heo <tj@kernel.org>, kernel-team@fb.com, cgroups@vger.kernel.org,
        linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer
In-Reply-To: <20170815121558.GA15892@castle.dhcp.TheFacebook.com>
Message-ID: <alpine.DEB.2.10.1708151435290.104516@chino.kir.corp.google.com>
References: <20170814183213.12319-1-guro@fb.com> <20170814183213.12319-3-guro@fb.com> <alpine.DEB.2.10.1708141532300.63207@chino.kir.corp.google.com> <20170815121558.GA15892@castle.dhcp.TheFacebook.com>
User-Agent: Alpine 2.10 (DEB 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3088
Lines: 61

On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > I'm curious about the decision made in this conditional and how 
> > oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
> > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
> > should otherwise be disabled.
> > 
> > It's undocumented in the changelog, but I'm questioning whether it's the 
> > right decision.  Doesn't it make sense to kill all tasks that are not oom 
> > disabled, and allow the user to still protect certain processes by their 
> > /proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
> > protection without a sibling memcg and its own reservation of memory.  I'm 
> > thinking about a process that governs jobs inside the memcg and if there 
> > is an oom kill, it wants to do logging and any cleanup necessary before 
> > exiting itself.  It seems like a powerful combination if coupled with oom 
> > notification.
> 
> Good question!
> I think, that an ability to override any oom_score_adj value and get all tasks
> killed is more important, than an ability to kill all processes with some
> exceptions.
> 

I'm disagreeing because getting all tasks killed is not necessarily 
something that only the kernel can do.  If all processes are oom disabled, 
that's a configuration issue done by sysadmin and the kernel should decide 
to kill the next largest memory cgroup or lower priority memory cgroup.  
It's not killing things like sshd that intentionally oom disable 
themselves.

You could argue that having an oom disabled process attached to these 
memcgs in the first place is also a configuration issue, but the problem 
is that in cgroup v2 with a restriction on processes only being attached 
at the leaf cgroups that there is no competition for memory in this case.  
I must assign memory resources to that sshd, or "Activity Manager" 
described by the cgroup v1 documentation, just to prevent it from being 
killed.

I think the latter of what you describe, killing all processes with some 
exceptions, is actually quite powerful.  I can guarantee that processes 
that set themselves to oom disabled are really oom disabled and I don't 
need to work around that in the cgroup hierarchy only because of this 
caveat.  I can also oom disable my Activity Manger that wants to wait on 
oom notification and collect the oom kill logs, raise notifications, and 
perhaps restart the process that it manages.

> In your example someone still needs to look after the remaining process,
> and kill it after some timeout, if it will not quit by itself, right?
> 

No, it can restart the process that was oom killed; or it can be sshd and 
I can still ssh into my machine.

> The special treatment of the -1000 value (without oom_kill_all_tasks)
> is required only to not to break the existing setups.
> 

I think as a general principle that allowing an oom disabled process to be 
oom killed is incorrect and if you really do want these to be killed, then 
(1) either your oom_score_adj is already wrong or (2) you can wait on oom 
notification and exit.