Received: by 10.223.148.5 with SMTP id 5csp6468222wrq; Wed, 17 Jan 2018 14:16:12 -0800 (PST) X-Google-Smtp-Source: ACJfBovdfHFbI8m1G6hF24e6eru75wRfsY3xR9hB8eZhTvmVaqzac9+byEtNlIwgpS/Cr6yIeFeZ X-Received: by 10.84.237.9 with SMTP id s9mr27426946plk.176.1516227372599; Wed, 17 Jan 2018 14:16:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516227372; cv=none; d=google.com; s=arc-20160816; b=zzsYrdEWviL7ajGHu8mEYH7KWcBEwmbJjEZxx20XrdgAnSyjS9s8SrQCs5PSkpqV/H UFE+fQkRoDMXDpFxYZbjELwUs6QoiCID072cjn55SrrrYIEGme/4KBH0B7I/0adl1UkZ rVpsSruRefF/UMOCA2vXHHHX+2htPmB86lRuL6ZLoxyeDfVBZ+rS6yPDU08qJgBNz1Bo jAEXGeWWUwPCN0xTTP3QxqYzVumvtnY2X3/I74kbslhdmZhVPM838G9HWmvS8wlAkTLo X71NECExrxh8AZj7zzNPSG+IbRnfCDq/I3ovzJk+wXQvJREtj2uzxiIERLyDzVRq72YZ FvHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=53D7aamm1WWZ9NXVCJay4Ct96e/P4omcG9CPCukSq44=; b=GVKOXFgaTsbspnprVMr2mmuC7s2newwNCkhAvB142I6gdWBlHTCfHA1OA+yXEpsT5i TZzjODSir6hj1r+6rJrmsfny86nu4JKvKfAwXCdonPKFm6hpGzwxuq3XXcW9WBa0v0wS 40e23u8BXQnut45hlAtzoTzCgvCEsFC+SfKq1meYTp6HfghKnjJn+HAM7SAgewZnwAwB 47neu0Atf0C1SzzHdU+mnH1F01/m0O6ji1tLLe12JYmSzMXUyG0A+0DtN3c30EBX2cJm lhZl0EttnXLy9hw5AnKkVfiurxZWryZivGKacYyxxq36nIJyj4TgzEOA4Eb78/dDNWuJ FAsw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ltPs/LF1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l7si4626486pgn.307.2018.01.17.14.15.58; Wed, 17 Jan 2018 14:16:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ltPs/LF1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754140AbeAQWOx (ORCPT + 99 others); Wed, 17 Jan 2018 17:14:53 -0500 Received: from mail-io0-f176.google.com ([209.85.223.176]:41600 "EHLO mail-io0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753662AbeAQWOu (ORCPT ); Wed, 17 Jan 2018 17:14:50 -0500 Received: by mail-io0-f176.google.com with SMTP id f6so22610713ioh.8 for ; Wed, 17 Jan 2018 14:14:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=53D7aamm1WWZ9NXVCJay4Ct96e/P4omcG9CPCukSq44=; b=ltPs/LF19BrHpRQvZY5j4WZNbWuE7dWaQYfFCBEIdMNpjY+kio4y7bAKVY3ZW31t44 20utfdV0x/8OAB77zuuz+OnCnjHiTI0wZQcbAOIkWabYAvR1lPqN20WUOoJN1grCGHas 6lX4/pOIdNlyKJoxyu1FYInZymPMNRaSmJu++jOARgBLbPkNCW/ZZ+5NAXJYuTBgszMR Wn9W73sdFYUIX5mzqx5YxUNL8p/zYFCWNoSv6J8KzHBk/OF0FFaIDOmrKtvD0qjWHP0p NRZYO3U+kW99/EX2EnoKUUbAxfBEB5JRQA9DYl+RrmdmWgy9Cxeyc23se0yv+lUzhwEB DzcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=53D7aamm1WWZ9NXVCJay4Ct96e/P4omcG9CPCukSq44=; b=XY9wcI3IwM3Q0wuSwDKArhTQcpimwSRpQwrR3eOKVyZao47kCAkBGmUvOKBJ8T/O5G 4XHA3cRhjtcDa1SNHfnu48bWzBDb9YXVU7yttLNvV2VNKfshpC/z81GgOCKeWXQOxUCc LfO1K7mKOZEjeoBW8Mo8xJ4scqG7dc9hjFOYyU31jszm864lPhaxie2ZDqBVSSoP144F zycfg2veRLoCEMlYESjy9Iu/12ZwH46LuzF/ptWLrriN4k3N5KecMh/3qCWvBTq2mNFD gyKPitYALLdxR/Yfw5wSZC73/1I+PuIzZG80HZYFm1eoeONNEyg24FT+NwHCgk34djXW zksA== X-Gm-Message-State: AKwxytcDGuH+IUXFkxkrArEZvuG1A8U/iJsR3W5+PIeMfb0falE+y+1S vq4FH6PhfxuhUCnnv5bgBy3Z9g== X-Received: by 10.107.47.92 with SMTP id j89mr2920094ioo.222.1516227289049; Wed, 17 Jan 2018 14:14:49 -0800 (PST) Received: from [2620:15c:17:3:c867:fde7:9ce7:379c] ([2620:15c:17:3:c867:fde7:9ce7:379c]) by smtp.gmail.com with ESMTPSA id f20sm2860278ioh.19.2018.01.17.14.14.47 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 17 Jan 2018 14:14:48 -0800 (PST) Date: Wed, 17 Jan 2018 14:14:47 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Tejun Heo cc: Andrew Morton , Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable In-Reply-To: <20180117154155.GU3460072@devbig577.frc2.facebook.com> Message-ID: References: <20180117154155.GU3460072@devbig577.frc2.facebook.com> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 17 Jan 2018, Tejun Heo wrote: > Hello, David. > Hi Tejun! > > The behavior of killing an entire indivisible memory consumer, enabled > > by memory.oom_group, is an oom policy itself. It specifies that all > > I thought we discussed this before but maybe I'm misremembering. > There are two parts to the OOM policy. One is victim selection, the > other is the action to take thereafter. > > The two are different and conflating the two don't work too well. For > example, please consider what should be given to the delegatee when > delegating a subtree, which often is a good excercise when designing > these APIs. > > When a given workload is selected for OOM kill (IOW, selected to free > some memory), whether the workload can handle individual process kills > or not is the property of the workload itself. Some applications can > safely handle some of its processes picked off and killed. Most > others can't and want to be handled as a single unit, which makes it a > property of the workload. > Yes, this is a valid point. The policy of "tree" and "all" are identical policies and then the mechanism differs wrt to whether one process is killed or all eligible processes are killed, respectively. My motivation for this was to avoid having two different tunables, especially because later we'll need a way for userspace to influence the decisionmaking to protect (bias against) important subtrees. What would really be nice is cgroup.subtree_control-type behavior where we could effect a policy and a mechanism at the same time. It's not clear how that would be handled to allow only one policy and one mechanism, however, in a clean way. The simplest for the user would be a new file, to specify the mechanism and leave memory.oom_policy alone. Would another file really be warranted? Not sure. > That makes sense in the hierarchy too because whether one process or > the whole workload is killed doesn't infringe upon the parent's > authority over resources which in turn implies that there's nothing to > worry about how the parent's groupoom setting should constrain the > descendants. > > OOM victim selection policy is a different beast. As you've mentioned > multiple times, especially if you're worrying about people abusing OOM > policies by creating sub-cgroups and so on, the policy, first of all, > shouldn't be delegatable and secondly should have meaningful > hierarchical restrictions so that a policy that an ancestor chose > can't be nullified by a descendant. > The goal here was to require a policy of either "tree" or "all" that the user can't change. They are allowed to have their own oom policies internal to their subtree, however, for oom conditions in that subtree alone. However, if the common ancestor hits its limit, it is forced to either be "tree" or "all" and require hierarchical usage to be considered instead of localized usage. Either "tree" or "all" is appropriate, and this may be why you brought up the point about separating them out, i.e. the policy can be demanded by the common ancestor but the actual mechanism that the oom killer uses, kill either a single process or the full cgroup, is left to the user depending on their workload. That sounds reasonable and I can easily separate the two by introducing a new file, similar to memory.oom_group but in a more extensible way so that it is not simply a selection of either full cgroup kill or single process. > I'm not necessarily against adding hierarchical victim selection > policy tunables; however, I am skeptical whether static tunables on > cgroup hierarchy (including selectable policies) can be made clean and > versatile enough, especially because the resource hierarchy doesn't > necessarily, or rather in most cases, match the OOM victim selection > decision tree, but I'd be happy to be proven wrong. > Right, and I think that giving users control over their subtrees is a powerful tool and one that can lead to very effective use of the cgroup v2 hierarchy. Being able to circumvent the oom selection by creating child cgroups is certainly something that can trivially be prevented. The argument that users can currently divide their entire processes into several different smaller processes to circumvent today's heuristic doesn't mean we can't have "tree"-like comparisons between cgroups to address that issue itself since all processes charge to the tree itself. I became convinced of this when I saw the real-world usecases that would use such a feature on cgroup v2: we want to have hierarchical usage for comparison when full subtrees are dedicated to individual consumers, for example, and local mem cgroup usage for comparison when using hierarchies for top-level /admins and /students cgroups for which Michal provided an example. These can coexist on systems and it's clear that there needs to be a system-wide policy decision for the cgroup aware oom killer (the idea behind the current mount option, which isn't needed anymore). So defining the actual policy, and mechanism as you pointed out, for subtrees is a very powerful tool, it's extensible, doesn't require a system to either fully enable or disable, and doesn't require a remount of cgroup v2 to change. > Without explicit configurations, the only thing the OOM killer needs > to guarantee is that the system can make forward progress. We've > always been tweaking victim selection with or without cgroup and > absolutely shouldn't be locked into a specific heuristics. The > heuristics is an implementaiton detail subject to improvements. > > To me, your patchset actually seems to demonstrate that these are > separate issues. The goal of groupoom is just to kill logical units > as cgroup hierarchy can inform the kernel of how workloads are > composed in the userspace. If you want to improve victim selection, > sure, please go ahead, but your argument that groupoom can't be merged > because of victim selection policy doesn't make sense to me. > memory.oom_group, the mechanism behind what the oom killer chooses to do after victim selection, is not implemented without the selection heuristic comparing cgroups as indivisible memory consumers. It could be done first prior to introducing the new selection criteria. We don't have patches for that right now, because Roman's work introduces it in the opposite order. If it is acceptable to add a separate file solely for this purpose, it's rather trivial to do. My other thought was some kind of echo "hierarchy killall" > memory.oom_policy where the policy and an (optional) mechanism could be specified. Your input on the actual tunables would be very valuable.