Received: by 10.213.65.68 with SMTP id h4csp177123imn; Thu, 15 Mar 2018 13:18:12 -0700 (PDT) X-Google-Smtp-Source: AG47ELt2+XBLixfAJk0XO0Xk0LCAjjAcLwwg2gLYqG0bNKXom8FQthH3p4at66r7N4fgMJLwCtM2 X-Received: by 2002:a17:902:ba96:: with SMTP id k22-v6mr8445260pls.111.1521145092935; Thu, 15 Mar 2018 13:18:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521145092; cv=none; d=google.com; s=arc-20160816; b=btNrPa9avLEtFBenK8MIbQeeWjpgE7p3Gzs4B7prQZF7x7ZkKD5cvHkurIrlfvnEe4 ++eJdLTsxSMJSX0IRw7x016FkjH4ZQd8h1BuJ40eGEe82+rskdKuTx1G0BjlkKCyKSzI LLrj112cVb0XM6+PGGlwt+53u3vLw87U3zh6hcRyYLyDIB1X7tx/yfjwGyGiWO7ul1Ye uuhzN5OgH4DHSjtL6Dnb9gRkN75Fpbtgi4sfhFJ4jTt8dDrUtaYtzOGuxFkwlsyyR93g Nyokn9Ih8YmLO7DlzDmIu7QkYI6PSuVWkN2cjo2/+pKWfBfPB7v/gDGykgftP+7aDhrI XGnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=ZEQ7zb8oPJO1VnWyjccXN2Sd3s83Gtw3Gt2i7hwSBk4=; b=pJLkIPTCRqZCApqUX9+1DvI89lcfuow8GZohev9kpLH6JxGGiGN1dOu0GnL6Cpd7iU tbLg/gsQaC4Qi/yc/XcMUjR35g1LtJTezLt1coH7vhpTGLatxuqkrnmJ7rB/WoTzypEg LXfvFzXMeh6B3QeU9kL+o+gls/NVmIQ0QIum1+Uy61Bo2p2UAHVaOdxpbkXTt/eSO6DS 1iVF7MeiAKwTTGSYPgA3tNGOlMTlyC/rKFNwdRZurH4ik5qZE/Vt/JUgpN/lzO1n/t/l AuF8XfM4NKD0LwW6Rx0b8344nC0suboH2SrSLi6fwHb+YOha8X48GNwiOU/d+rw7Xe92 TUwg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=e2kPVs8T; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g6-v6si4267117plj.431.2018.03.15.13.17.57; Thu, 15 Mar 2018 13:18:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=e2kPVs8T; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752763AbeCOUQ5 (ORCPT + 99 others); Thu, 15 Mar 2018 16:16:57 -0400 Received: from mail-pf0-f181.google.com ([209.85.192.181]:44031 "EHLO mail-pf0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752640AbeCOUQz (ORCPT ); Thu, 15 Mar 2018 16:16:55 -0400 Received: by mail-pf0-f181.google.com with SMTP id j2so3255397pff.10 for ; Thu, 15 Mar 2018 13:16:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=ZEQ7zb8oPJO1VnWyjccXN2Sd3s83Gtw3Gt2i7hwSBk4=; b=e2kPVs8TN5RyO4KNLm+ceyjd2MQkw7I/st4JqI9pIObmCYrWq4Mw9SF2nuLkEtzNHB bt6EnEQo8TggoSsG3mO0akB8hWZG1MyKLxRUpqziQpi26wPZBt4lU7TMwRIsc2J9bswC torkTZIF3sOdN7Fal1qZEfWbRSsNNvfSYbI2vuE9e1DanFf8vwUZUxWDWsqkb73tLkVq xeBQr2MrYF/1668rgLWHGwHa/NoQ/7tlYxeRgiEhOp+2CZUxl1qQErwjruHL/BG/Ro5U Ba3XU0OT9Sj1i4QbOp3jCYrl3rxJfW8i2WvhblDzh1Dv2qrqVqtB2CjUI1Mana72z4E2 ovhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=ZEQ7zb8oPJO1VnWyjccXN2Sd3s83Gtw3Gt2i7hwSBk4=; b=bBDLR5uiyXZOH5lOZss1TSvdJeJyaaikxVRzwRYit87xw0MSkpXFNIlh1QK9H75g9+ Wv6mlDntMTSC6Y3UX/tjkrp8sLdkx+cCPTmy8d2K5Kti/40jTi2CDHYAkV2RDe1ZmEle XtJuF3N5R5YVmInUDSHg1vhZvoC9U89U0LMW7eSZJbs28X2Qm+758oehC16CBiIDdvPN /uS/hWaZUiSiZ52HVcMKhnFn4jyb2RzWKGzy6UZe05VaY0GKJ1OAXb0Xn3z3hwF1Qoiq GW82PcJurRBkZoNF4vR4RfWtlvY7kabVo/wZcXf8oeWIgyUixy81kWkpilQUoZNk5bI+ R3xA== X-Gm-Message-State: AElRT7ERSZLBdrf29KAmbcpWK/R0qNmFmasdZxoh878FOFlnZVySsQ+t +C5J2go1wkLC2HyMs4FAL2sSpg== X-Received: by 10.98.7.68 with SMTP id b65mr8683933pfd.39.1521145014940; Thu, 15 Mar 2018 13:16:54 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id q13sm10342591pgr.52.2018.03.15.13.16.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 15 Mar 2018 13:16:54 -0700 (PDT) Date: Thu, 15 Mar 2018 13:16:53 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Roman Gushchin cc: Andrew Morton , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm v3 1/3] mm, memcg: introduce per-memcg oom policy tunable In-Reply-To: <20180315171039.GB1853@castle.DHCP.thefacebook.com> Message-ID: References: <20180314123851.GB20850@castle.DHCP.thefacebook.com> <20180315171039.GB1853@castle.DHCP.thefacebook.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 15 Mar 2018, Roman Gushchin wrote: > > - Does not lock the entire system into a single methodology. Users > > working in a subtree can default to what they are used to: per-process > > oom selection even though their subtree might be targeted by a system > > policy level decision at the root. This allow them flexibility to > > organize their subtree intuitively for use with other controllers in a > > single hierarchy. > > > > The real-world example is a user who currently organizes their subtree > > for this purpose and has defined oom_score_adj appropriately and now > > regresses if the admin mounts with the needless "groupoom" option. > > I find this extremely confusing. > > The problem is that OOM policy defines independently how the OOM > of the corresponding scope is handled, not like how it prefers > to handle OOMs from above. > > As I've said, if you're inside a container, you can have OOMs > of different types, depending on settings, which you don't even know about. > Sometimes oom_score_adj works, sometimes not. > Sometimes all processes are killed, sometimes not. > IMO, this adds nothing but mess. > There are many additional problems with the cgroup aware oom killer in -mm, yes, the fact that memory.oom_group is factored into the selection logic is another problem. Users who prefer to account their subtree for comparison (the only way to avoid allowing users to evade the oom killer completely) should use the memory.oom_policy of "tree" introduced later. memory.oom_group needs to be completely separated from the policy of selecting a victim, it shall only be a mechanism that defines if a single process is oom killed or all processes attached to the victim mem cgroup as a property of the workload. > The mount option (which I'm not a big fan of too) was added only > to provide a 100% backward compatibility, what was forced by Michal. > But I doubt that mixing per-process and per-cgroup approach > makes any sense. > It makes absolute sense and has real users who can immediately use this if it's merged. There is nothing wrong with a user preferring to kill the largest process from their subtree on mem cgroup oom. It's what they've always experienced, with cgroup v1 and v2. It's the difference between users in a subtree being able to use /proc/pid/oom_score_adj or not. Without it, their oom_score_adj values become entirely irrelevant. We have users who tune their oom_score_adj and are running in a subtree they control. If an overcomitted ancestor is oom, which is up to the admin to define in the organization of the hierarchy and imposing limits, the user does not control which process or group of processes is oom killed. That's a decision for the ancestor which controls all descendant cgroups, including limits and oom policies. > > > > - Allows changing the oom policy at runtime without remounting the entire > > cgroup fs. Depending on how cgroups are going to be used, per-process > > vs cgroup-aware may be mandated separately. This is a trait only of > > the mem cgroup controller, the root level oom policy is no different > > from the subtree and depends directly on how the subtree is organized. > > If other controllers are already being used, requiring a remount to > > change the system-wide oom policy is an unnecessary burden. > > > > The real-world example is systems software that either supports user > > subtrees or strictly subtrees that it maintains itself. While other > > controllers are used, the mem cgroup oom policy can be changed at > > runtime rather than requiring a remount and reorganizing other > > controllers exactly as before. > > Btw, what the problem with remounting? You don't have to re-create cgroups, > or something like this; the operation is as trivial as adding a flag. > Remounting is for the entire mem cgroup hierarchy. The point of this entire patchset is that different subtrees will have different policies, it cannot be locked into a single selection logic. This completely avoids users being able to evade the cgroup-aware oom killer by creating subcontainers. Obviously I've been focused on users controlling subtrees in a lot of my examples. Those users may already prefer oom killing the largest process on the system (or their subtree). They can still do that with this patch and opt out of cgroup awareness for their subtree. It also provides all the functionality that the current implementation in -mm provides. > > > > - Can be extended to cgroup v1 if necessary. There is no need for a > > new cgroup v1 mount option and mem cgroup oom selection is not > > dependant on any functionality provided by cgroup v2. The policies > > introduced here work exactly the same if used with cgroup v1. > > > > The real-world example is a cgroup configuration that hasn't had > > the ability to move to cgroup v2 yet and still would like to use > > cgroup-aware oom selection with a very trivial change to add the > > memory.oom_policy file to the cgroup v1 filesystem. > > I assume that v1 interface is frozen. > It requires adding the memory.oom_policy file to the cgroup v1 fs, that's it. No other support is needed. If that's allowed upstream, great; if not, it's a very simple patch to carry for a distribution.