Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp3263112imm; Sun, 19 Aug 2018 16:30:08 -0700 (PDT) X-Google-Smtp-Source: AA+uWPzmMUV20eUsv6QDDDJMITZdCF3WXMSAnB8iUH9LTbfEonqe4T+uBrGPFDtI8aSXZT0aTqdC X-Received: by 2002:a17:902:8604:: with SMTP id f4-v6mr43285356plo.225.1534721408275; Sun, 19 Aug 2018 16:30:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534721408; cv=none; d=google.com; s=arc-20160816; b=mmQAOXFEWqntuI1CMbwGq+Pe9mOvXg6zTJAa5gZcAfOu4MM93Yq1Uoq5o1axunFUBL d94cuoWJO/xdspfFZ8COe3W9V75Pw1DNXeR+r+DYTxNXeK3TLyQ700d0YIKrW8SZ9s3t RUovhFCcsmugq9qyPieZpDBlYJanjqeZODah0SddODXlUWe1CPdCBv7VNcqh7OTCarXx m+iMFsQ18SVzoEQDonQLSVmUnnQ4swHHUyikJfLqSS6ry8j6ReGC4NzH2Yga2gsg8hUg nICRlUlZmWkVWuva7EJ8MIA4/fJLAeloKQPJeF1XUBzo2MU5OoBHOEs5Dq1OxZhxX+L8 E3Xw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=+NU3gl01oftgxzp1mS8vV0Z/KBiLOZFQTVxz3hhwQp0=; b=IM+Mlw/E8Bcl0CXACohkD6iwR7Pngrbw5Z7vbV/ykJq2m2OXyxL6p3iUERNKD9XXJZ jFUy/pNbBjTX6GY2Fh2VQkTTfmrpNj/vjHT9Dd4Ejy4I/SV4u3/57NJI+pPwIdtZBQUD plApQ7z6skWX2z8gZPkKMxGi2fAhKq9LgsxMhaaOf02nd4sDamcwlbXrqQQ5KdhLIBwa iEQOKiJ1FIGcdA/VqjtdlYRO3FDvTFYMbCfvbaXNnAfMQiMbqBUoI/QQy09bcmfJEWZj Zh6XrhJqbWqxAwjvxcXBcMnucaF2CkYOVHxP7QXaCWXkfDl50PDW1ePPTInjNVxTWTp0 PHIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=OmVxqVyV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r63-v6si4528777plb.134.2018.08.19.16.29.30; Sun, 19 Aug 2018 16:30:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=OmVxqVyV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726557AbeHTCkF (ORCPT + 99 others); Sun, 19 Aug 2018 22:40:05 -0400 Received: from mail-pf1-f196.google.com ([209.85.210.196]:38601 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726436AbeHTCkF (ORCPT ); Sun, 19 Aug 2018 22:40:05 -0400 Received: by mail-pf1-f196.google.com with SMTP id x17-v6so5817674pfh.5 for ; Sun, 19 Aug 2018 16:26:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=+NU3gl01oftgxzp1mS8vV0Z/KBiLOZFQTVxz3hhwQp0=; b=OmVxqVyVa6YcNlWj8udEQGAEkCLhcMt5mDxuHlPuc/1kV1D3M61Ud7OWtIBBxMLsJA AfAR3WvSawT/45ojeZnAAZsQkaSq3C/UMTbqNQd3h1iPQD2jOvnbWrCvL7X2Zz6V9/Fa VDbN4CQ+Tlek127KOzoYTetxuKH7MUnDbYY4dZUfji9JtoDiBNQtMlycHA5KyvtTZNek U2W49BSUl4iHWzOmkB710lKsBKZuMTQBYS7P1JnGBQysz+tvCqC5Kg+MhzwOAj5XmzNg LBBEnoXANzv5AHaB6CKrpkO4i6pTvEv4Lh99QQPMovRqJElcyDcK0RVO947pM59leZTz V92Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=+NU3gl01oftgxzp1mS8vV0Z/KBiLOZFQTVxz3hhwQp0=; b=Bw8g2UZigOqg9p/WxQpPue+ulfu5SeApwYwlEwNzeym8TGiezZe5fot695rjD+Hrzi 5Tj6fuEoXtBFYiBVR0zuo12QEIZy3PSNR2ymE3WDZctyPb+pMband/C0Xkc3tkF8Ilsq BOjVZ7h8U+Y+Udt1Dk1KaKuDaZVxmXJL4qIEH2k/jz3OhdAotKiWaxe6e6GCmX5+sdf7 b40Ruu3VEKOaDHOiPvuEzeJXq/rGRWHjrMFAvIBeRCaxgKa8oGjlBXYlOvm0w6cu7Xf0 QGcYoMDRWMyN6u6dOXQ1s5Xu7XEDfVYdweqhdC8a3vnLJlGEht8B0zyRcKeidpO2tcTH FTTA== X-Gm-Message-State: AOUpUlFN+mpvLRHisKYO/CPNPjK4/VxTm1foNwfAHd8W4A2M54jqdiFY DZL23sWYLfip690ONnNuJmMu7g== X-Received: by 2002:aa7:8591:: with SMTP id w17-v6mr45929297pfn.77.1534721211547; Sun, 19 Aug 2018 16:26:51 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id y27-v6sm15408765pff.181.2018.08.19.16.26.50 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 19 Aug 2018 16:26:50 -0700 (PDT) Date: Sun, 19 Aug 2018 16:26:50 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Roman Gushchin cc: linux-mm@kvack.org, Michal Hocko , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, linux-kernel@vger.kernel.org Subject: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group) In-Reply-To: Message-ID: References: <20180730180100.25079-1-guro@fb.com> <20180731235135.GA23436@castle.DHCP.thefacebook.com> <20180801224706.GA32269@castle.DHCP.thefacebook.com> <20180807003020.GA21483@castle.DHCP.thefacebook.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Roman, have you had time to go through this? On Tue, 7 Aug 2018, David Rientjes wrote: > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > that the usage of the entire subtree should be compared as a single > > > entity with other cgroups. That is necessary for user subtrees but may > > > not be necessary for top-level cgroups depending on how you structure your > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > and you are correct it can be different than oom.group. > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > me to say :) > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > not cgroup aware, for subtrees so that our users who have clear > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > this case? This basically means that if memcg is selected as target, > > the process inside will be selected using traditional per-process approach. > > > > No, that would overload the policy and mechanism. We want the ability to > consider user-controlled subtrees as a single entity for comparison with > other user subtrees to select which subtree to target. This does not > imply that users want their entire subtree oom killed. > > > > So we need to define the policy for a subtree that is oom, and I suggest > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > "cgroup", and process would be the default to preserve what currently > > > happens in a user subtree). > > > > I'm not entirely convinced here. > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > and it's preferable to keep the current behavior. > > > > At the same time I don't like the idea to look at the policy of the OOMing > > cgroup. Why exceeding of one limit should be handled different to exceeding > > of another? This seems to be a property of workload, not a limit. > > > > The limit is the property of the mem cgroup, so it's logical that the > policy when reaching that limit is a property of the same mem cgroup. > Using the user-controlled subtree example, if we have /david and /roman, > we can define our own policies on oom, we are not restricted to cgroup > aware selection on the entire hierarchy. /david/oom.policy can be > "process" so that I haven't regressed with earlier kernels, and > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > subtree. > > Something needs to be oom killed when a mem cgroup at any level in the > hierarchy is reached and reclaim has failed. What to do when that limit > is reached is a property of that cgroup. > > > > Now, as users who rely on process selection are well aware, we have > > > oom_score_adj to influence the decision of which process to oom kill. If > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > influence that decision. For example, we have high priority applications > > > that run at the top-level that use a lot of memory and strictly oom > > > killing them in all scenarios because they use a lot of memory isn't > > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > > subtree) when compared to other cgroups. > > > > > > I've also suggested, but did not implement in my patchset because I was > > > trying to define the API and find common ground first, that we have a need > > > for priority based selection. In other words, define the priority of a > > > subtree regardless of cgroup usage. > > > > > > So with these four things, we have > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > > subtree (and plans for "priority" in the future), > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > > subtree as the cgroup's own usage for comparison with others, > > > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > > to protect important applications and bias against unimportant > > > applications. > > > > > > This adds several tunables, which I didn't like, so I tried to overload > > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > > the subtree usage accounting into a separate tunable, that is what I have > > > referenced above. > > > > IMO, merging multiple tunables into one doesn't make it saner. > > The real question how to make a reasonable interface with fever tunables. > > > > The reason behind introducing all these knobs is to provide > > a generic solution to define OOM handling rules, but then the > > question raises if the kernel is the best place for it. > > > > I really doubt that an interface with so many knobs has any chances > > to be merged. > > > > This is why I attempted to overload oom.policy and oom.evaluate_as_group: > I could not think of a reasonable usecase where a subtree would be used to > account for cgroup usage but not use a cgroup aware policy itself. You've > objected to that, where memory.oom_policy == "tree" implied cgroup > awareness in my patchset, so I've separated that out. > > > IMO, there should be a compromise between the simplicity (basically, > > the number of tunables and possible values) and functionality > > of the interface. You nacked my previous version, and unfortunately > > I don't have anything better so far. > > > > If you do not agree with the overloading and have a preference for single > value tunables, then all three tunables are needed. This functionality > could be represented as two or one tunable if they are not single value, > but from the oom.group discussion you preferred single values. > > I assume you'd also object to adding and removing files based on > oom.policy since oom.evaluate_as_group and oom.adj is only needed for > oom.policy of "cgroup" or "priority", and they do not need to exist for > the default oom.policy of "process". >