Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753881AbeAJTdv (ORCPT + 1 other); Wed, 10 Jan 2018 14:33:51 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:49946 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753682AbeAJTds (ORCPT ); Wed, 10 Jan 2018 14:33:48 -0500 Date: Wed, 10 Jan 2018 11:33:45 -0800 From: Andrew Morton To: Roman Gushchin Cc: David Rientjes , , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , , , , , Subject: Re: [PATCH v13 0/7] cgroup-aware OOM killer Message-Id: <20180110113345.54dd571967fd6e70bfba68c3@linux-foundation.org> In-Reply-To: <20180110131143.GB26913@castle.DHCP.thefacebook.com> References: <20171130152824.1591-1-guro@fb.com> <20171130123930.cf3217c816fd270fa35a40cb@linux-foundation.org> <20180110131143.GB26913@castle.DHCP.thefacebook.com> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin wrote: > Hello, David! > > On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote: > > On Thu, 30 Nov 2017, Andrew Morton wrote: > > > > > > This patchset makes the OOM killer cgroup-aware. > > > > > > Thanks, I'll grab these. > > > > > > There has been controversy over this patchset, to say the least. I > > > can't say that I followed it closely! Could those who still have > > > reservations please summarise their concerns and hopefully suggest a > > > way forward? > > > > > > > Yes, I'll summarize what my concerns have been in the past and what they > > are wrt the patchset as it stands in -mm. None of them originate from my > > current usecase or anticipated future usecase of the oom killer for > > system-wide or memcg-constrained oom conditions. They are based purely on > > the patchset's use of an incomplete and unfair heuristic for deciding > > which cgroup to target. > > > > I'll also suggest simple changes to the patchset, which I have in the > > past, that can be made to address all of these concerns. > > > > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups > > > > The patchset uses two different heuristics to compare root and leaf mem > > cgroups and scores them based on number of pages. For the root mem > > cgroup, it totals the /proc/pid/oom_score of all processes attached: > > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj. > > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, > > unreclaimable slab, kernel stack, and swap counters. These can be wildly > > different independent of /proc/pid/oom_score_adj, but the most obvious > > unfairness comes from users who tune oom_score_adj. > > > > An example: start a process that faults 1GB of anonymous memory and leave > > it attached to the root mem cgroup. Start six more processes that each > > fault 1GB of anonymous memory and attached them to a leaf mem cgroup. Set > > all processes to have /proc/pid/oom_score_adj of 1000. System oom kill > > will always kill the 1GB process attached to the root mem cgroup. It's > > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to > > evaluate the root mem cgroup, and leaf mem cgroups completely disregard > > it. > > > > In this example, the leaf mem cgroup's score is 1,573,044, the number of > > pages for the 6GB of faulted memory. The root mem cgroup's score is > > 12,652,907, eight times larger even though its usage is six times smaller. > > > > This is caused by the patchset disregarding oom_score_adj entirely for > > leaf mem cgroups and relying on it heavily for the root mem cgroup. It's > > the complete opposite result of what the cgroup aware oom killer > > advertises. > > > > It also works the other way, if a large memory hog is attached to the root > > mem cgroup but has a negative oom_score_adj it is never killed and random > > processes are nuked solely because they happened to be attached to a leaf > > mem cgroup. This behavior wrt oom_score_adj is completely undocumented, > > so I can't presume that it is either known nor tested. > > > > Solution: compare the root mem cgroup and leaf mem cgroups equally with > > the same criteria by doing hierarchical accounting of usage and > > subtracting from total system usage to find root usage. > > I find this problem quite minor, because I haven't seen any practical problems > caused by accounting of the root cgroup memory. > If it's a serious problem for you, it can be solved without switching to the > hierarchical accounting: it's possible to sum up all leaf cgroup stats and > substract them from global values. So, it can be a relatively small enhancement > on top of the current mm tree. This has nothing to do with global victim selection > approach. It sounds like a significant shortcoming to me - the oom-killing decisions which David describes are clearly incorrect? If this can be fixed against the -mm patchset with a "relatively small enhancement" then please let's get that done so it can be reviewed and tested. > > > > 2. Evading the oom killer by attaching processes to child cgroups > > > > Any cgroup on the system can attach all their processes to individual > > child cgroups. This is functionally the same as doing > > > > for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done > > > > without the no internal process constraint introduced with cgroup v2. All > > child cgroups are evaluated based on their own usage: all anon, > > unevictable, and unreclaimable slab as described previously. It requires > > an individual cgroup to be the single largest consumer to be targeted by > > the oom killer. > > > > An example: allow users to manage two different mem cgroup hierarchies > > limited to 100GB each. User A uses 10GB of memory and user B uses 90GB of > > memory in their respective hierarchies. On a system oom condition, we'd > > expect at least one process from user B's hierarchy would always be oom > > killed with the cgroup aware oom killer. In fact, the changelog > > explicitly states it solves an issue where "1) There is no fairness > > between containers. A small container with few large processes will be > > chosen over a large one with huge number of small processes." > > > > The opposite becomes true, however, if user B creates child cgroups and > > distributes its processes such that each child cgroup's usage never > > exceeds 10GB of memory. This can either be done intentionally to > > purposefully have a low cgroup memory footprint to evade the oom killer or > > unintentionally with cgroup v2 to allow those individual processes to be > > constrained by other cgroups in a single hierarchy model. User A, using > > 10% of his memory limit, is always oom killed instead of user B, using 90% > > of his memory limit. > > > > Others have commented its still possible to do this with a per-process > > model if users split their processes into many subprocesses with small > > memory footprints. > > > > Solution: comparing cgroups must be done hierarchically. Neither user A > > nor user B can evade the oom killer because targeting is done based on the > > total hierarchical usage rather than individual cgroups in their > > hierarchies. > > We've discussed this a lot. > Hierarchical approach has their own issues, which we've discussed during > previous iterations of the patchset. If you know how to address them > (I've no idea), please, go on and suggest your version. Well, if a hierarchical approach isn't a workable fix for the problem which David has identified then what *is* the fix? > > > > 3. Userspace has zero control over oom kill selection in leaf mem cgroups > > > > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes > > from the oom killer, the cgroup aware oom killer does not provide any > > solution for the user to protect leaf mem cgroups. This is a result of > > leaf mem cgroups being evaluated based on their anon, unevictable, and > > unreclaimable slab usage and disregarding any user tunable. > > > > Absent the cgroup aware oom killer, users have the ability to strongly > > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or > > strongly bias against a process (/proc/pid/oom_score_adj = -999). > > > > An example: a process knows its going to use a lot of memory, so it sets > > /proc/self/oom_score_adj to 1000. It wants to be killed first to avoid > > distrupting any other process. If it's attached to the root mem cgroup, > > it will be oom killed. If it's attached to a leaf mem cgroup by an admin > > outside its control, it will never be oom killed unless that cgroup's > > usage is the largest single cgroup usage on the system. The reverse also > > is true for processes that the admin does not want to be oom killed: set > > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its > > cgroup has the highest usage on the system. > > > > The result is that both admins and users have lost all control over which > > processes are oom killed. They are left with only one alternative: set > > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom > > kill. It doesn't address the issue at all for memcg-constrained oom > > conditions since no processes are killable anymore, and risks panicking > > the system if it is the only process left on the system. A process > > preferring that it is first in line for oom kill simply cannot volunteer > > anymore. > > > > Solution: allow users and admins to control oom kill selection by > > introducing a memory.oom_score_adj to affect the oom score of that mem > > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score > > of a process. > > The per-process oom_score_adj interface is not the nicest one, and I'm not > sure we want to replicate it on cgroup level as is. If you have an idea of how > it should look like, please, propose a patch; otherwise it's hard to discuss > it without the code. It does make sense to have some form of per-cgroup tunability. Why is the oom_score_adj approach inappropriate and what would be better? How hard is it to graft such a thing onto the -mm patchset? > > > > > > I proposed a solution in > > https://marc.info/?l=linux-kernel&m=150956897302725, which was never > > responded to, for all of these issues. The idea is to do hierarchical > > accounting of mem cgroup hierarchies so that the hierarchy is traversed > > comparing total usage at each level to select target cgroups. Admins and > > users can use memory.oom_score_adj to influence that decisionmaking at > > each level. > > > > This solves #1 because mem cgroups can be compared based on the same > > classes of memory and the root mem cgroup's usage can be fairly compared > > by subtracting top-level mem cgroup usage from system usage. All of the > > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide > > counterpart that can be used to do the simple subtraction. > > > > This solves #2 because evaluation is done hierarchically so that > > distributing processes over a set of child cgroups either intentionally > > or unintentionally no longer evades the oom killer. Total usage is always > > accounted to the parent and there is no escaping this criteria for users. > > > > This solves #3 because it allows admins to protect important processes in > > cgroups that are supposed to use, for example, 75% of system memory > > without it unconditionally being selected for oom kill but still oom kill > > if it exceeds a certain threshold. In this sense, the cgroup aware oom > > killer, as currently implemented, is selling mem cgroups short by > > requiring the user to accept that the important process will be oom killed > > iff it uses mem cgroups and isn't attached to root. It also allows users > > to actually volunteer to be oom killed first without majority usage. > > > > It has come up time and time again that this support can be introduced on > > top of the cgroup oom killer as implemented. It simply cannot. For > > admins and users to have control over decisionmaking, it needs a > > oom_score_adj type tunable that cannot change semantics from kernel > > version to kernel version and without polluting the mem cgroup filesystem. > > That, in my suggestion, is an adjustment on the amount of total > > hierarchical usage of each mem cgroup at each level of the hierarchy. > > That requires that the heuristic uses hierarchical usage rather than > > considering each cgroup as independent consumers as it does today. We > > need to implement that heuristic and introduce userspace influence over > > oom kill selection now rather than later because its implementation > > changes how this patchset is implemented. > > > > I can implement these changes, if preferred, on top of the current > > patchset, but I do not believe we want inconsistencies between kernel > > versions that introduce user visible changes for the sole reason that this > > current implementation is incomplete and unfair. We can implement and > > introduce it once without behavior changing later because the core > > heuristic has necessarily changed. > > David, I _had_ hierarchical accounting implemented in one of the previous > versions of this patchset. And there were _reasons_, why we went away from it. Can you please summarize those issues for my understanding? > You can't just ignore them and say that "there is a simple solution, which > Roman is not responding". If you know how to address these issues and > convince everybody that hierarchical approach is a way to go, please, > go on and send your version of the patchset. > > Thanks! > > Roman