Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752330AbdI0SMH (ORCPT ); Wed, 27 Sep 2017 14:12:07 -0400 Received: from mail-oi0-f68.google.com ([209.85.218.68]:34912 "EHLO mail-oi0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751263AbdI0SME (ORCPT ); Wed, 27 Sep 2017 14:12:04 -0400 X-Google-Smtp-Source: AOwi7QCrnrKxTysrrt6vfdQoDV0HCKGQ62uUc0rqsvrkLghUdqqDVZlgiJO0nnqpDABqQlTUVWEvT5GXGnpiGSqSLTY= MIME-Version: 1.0 In-Reply-To: <20170927162300.GA5623@castle.DHCP.thefacebook.com> References: <20170925181533.GA15918@castle> <20170925202442.lmcmvqwy2jj2tr5h@dhcp22.suse.cz> <20170926105925.GA23139@castle.dhcp.TheFacebook.com> <20170926112134.r5eunanjy7ogjg5n@dhcp22.suse.cz> <20170926121300.GB23139@castle.dhcp.TheFacebook.com> <20170926133040.uupv3ibkt3jtbotf@dhcp22.suse.cz> <20170926172610.GA26694@cmpxchg.org> <20170927074319.o3k26kja43rfqmvb@dhcp22.suse.cz> <20170927162300.GA5623@castle.DHCP.thefacebook.com> From: Tim Hockin Date: Wed, 27 Sep 2017 11:11:42 -0700 X-Google-Sender-Auth: _EwPeqTKYbGCZ3b7RoAXnk7BKxE Message-ID: Subject: Re: [v8 0/4] cgroup-aware OOM killer To: Roman Gushchin Cc: Michal Hocko , Johannes Weiner , Tejun Heo , kernel-team@fb.com, David Rientjes , linux-mm@kvack.org, Vladimir Davydov , Tetsuo Handa , Andrew Morton , Cgroups , linux-doc@vger.kernel.org, "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4084 Lines: 79 On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin wrote: > On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote: >> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko wrote: >> > On Tue 26-09-17 20:37:37, Tim Hockin wrote: >> > [...] >> >> I feel like David has offered examples here, and many of us at Google >> >> have offered examples as long ago as 2013 (if I recall) of cases where >> >> the proposed heuristic is EXACTLY WRONG. >> > >> > I do not think we have discussed anything resembling the current >> > approach. And I would really appreciate some more examples where >> > decisions based on leaf nodes would be EXACTLY WRONG. >> > >> >> We need OOM behavior to kill in a deterministic order configured by >> >> policy. >> > >> > And nobody is objecting to this usecase. I think we can build a priority >> > policy on top of leaf-based decision as well. The main point we are >> > trying to sort out here is a reasonable semantic that would work for >> > most workloads. Sibling based selection will simply not work on those >> > that have to use deeper hierarchies for organizational purposes. I >> > haven't heard a counter argument for that example yet. >> > > Hi, Tim! > >> We have a priority-based, multi-user cluster. That cluster runs a >> variety of work, including critical things like search and gmail, as >> well as non-critical things like batch work. We try to offer our >> users an SLA around how often they will be killed by factors outside >> themselves, but we also want to get higher utilization. We know for a >> fact (data, lots of data) that most jobs have spare memory capacity, >> set aside for spikes or simply because accurate sizing is hard. We >> can sell "guaranteed" resources to critical jobs, with a high SLA. We >> can sell "best effort" resources to non-critical jobs with a low SLA. >> We achieve much better overall utilization this way. > > This is well understood. > >> >> I need to represent the priority of these tasks in a way that gives me >> a very strong promise that, in case of system OOM, the non-critical >> jobs will be chosen before the critical jobs. Regardless of size. >> Regardless of how many non-critical jobs have to die. I'd rather kill >> *all* of the non-critical jobs than a single critical job. Size of >> the process or cgroup is simply not a factor, and honestly given 2 >> options of equal priority I'd say age matters more than size. >> >> So concretely I have 2 first-level cgroups, one for "guaranteed" and >> one for "best effort" classes. I always want to kill from "best >> effort", even if that means killing 100 small cgroups, before touching >> "guaranteed". >> >> I apologize if this is not as thorough as the rest of the thread - I >> am somewhat out of touch with the guts of it all these days. I just >> feel compelled to indicate that, as a historical user (via Google >> systems) and current user (via Kubernetes), some of the assertions >> being made here do not ring true for our very real use cases. I >> desperately want cgroup-aware OOM handing, but it has to be >> policy-based or it is just not useful to us. > > A policy-based approach was suggested by Michal at a very beginning of > this discussion. Although nobody had any strong objections against it, > we've agreed that this is out of scope of this patchset. > > The idea of this patchset is to introduce an ability to select a memcg > as an OOM victim with the following optional killing of all belonging tasks. > I believe, it's absolutely mandatory for _any_ further development > of the OOM killer, which wants to deal with memory cgroups as OOM entities. > > If you think that it makes impossible to support some use cases in the future, > let's discuss it. Otherwise, I'd prefer to finish this part of the work, > and proceed to the following improvements on top of it. > > Thank you! I am 100% in favor of killing whole groups. We want that too. I just needed to express disagreement with statements that size-based decisions could not produce bad results. They can and do.