Received: by 10.223.176.5 with SMTP id f5csp8384wra; Fri, 26 Jan 2018 16:18:17 -0800 (PST) X-Google-Smtp-Source: AH8x22735W5F/05i+LTuwjgwEQsapWtH7/2Y9OHCvZeh4M2bk5kvmkll/D4AmxwZYAXC1SpwD31z X-Received: by 2002:a17:902:5902:: with SMTP id o2-v6mr15528500pli.79.1517012297498; Fri, 26 Jan 2018 16:18:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517012297; cv=none; d=google.com; s=arc-20160816; b=e+gAhcKKzRA9euliOIrPFPTiWWigwjIABl4m5XKymA/+yz4OKSwTpLn5dhJDNnnyrg L1Phj6rDVJ6cL5iPOuquKjS3P1fDMZim0RgVZnEqihd148/raHHvWDkEbo8s7pawHCk+ 62CHy4mvHAeG4l2dtaH2u2VhuzjPfffoXQP05teSBg0z2POmUMZFLNsn5/6BcVLPrkkz OhO6peogyGVW2qfOrWP9hljpuLjAmsCaYHzTzUlQi0aa/cfYIWffHILAi7LU0aWttqib 6qm3LGApjDr++6Uxd/mpfJaqF+x/OWZ74Do860D3EMYhIcUL+MXasR59vxfFaXmT6qLH RIZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :arc-authentication-results; bh=QpMjpn5x9xC2An4KK0gAe/ak3RvwERMHmjBA9Ep140E=; b=sw70y0FZHT7nP3GqN/M5ummSQxoJThgil9rM9Z4Hu5o03zjmwapcG44r0WTjw34sD7 v3iruk+6WFpQCX0SCb00BP+y8QYQNoj3foGAIRUk70ab7kTfIdFK7PY3FeGnRtf9yEc/ qvhUGp6GoHjDG9OnC5m3VaLq2FQx3yYxqVqz8W4mRqjNM3nCneBoZvuOy+C3au9lyYbs Wa+4Ag6T1USLPs8oN+S4AQZutiJBfezj67yQyWz2xGBpVK9s3unoAGevnUfucP0JUeqj u1xJxxa1YDs/I01E29oXgqvmhKHihkDqQTJ5iBB2hnfeZyTDDV0AfZ/QMkIhi55XbHw9 57KQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l16si3671059pgn.681.2018.01.26.16.18.03; Fri, 26 Jan 2018 16:18:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752268AbeA0ARi (ORCPT + 99 others); Fri, 26 Jan 2018 19:17:38 -0500 Received: from mail.linuxfoundation.org ([140.211.169.12]:55844 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751536AbeA0ARh (ORCPT ); Fri, 26 Jan 2018 19:17:37 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.9.92]) by mail.linuxfoundation.org (Postfix) with ESMTPSA id 53028F3B; Sat, 27 Jan 2018 00:17:36 +0000 (UTC) Date: Fri, 26 Jan 2018 16:17:35 -0800 From: Andrew Morton To: David Rientjes Cc: Roman Gushchin , Michal Hocko , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable Message-Id: <20180126161735.b999356fbe96c0acd33aaa66@linux-foundation.org> In-Reply-To: References: <20180125160016.30e019e546125bb13b5b6b4f@linux-foundation.org> <20180126143950.719912507bd993d92188877f@linux-foundation.org> X-Mailer: Sylpheed 3.4.1 (GTK+ 2.24.23; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 26 Jan 2018 14:52:59 -0800 (PST) David Rientjes wrote: > On Fri, 26 Jan 2018, Andrew Morton wrote: > > > > -ECONFUSED. We want to have a mount option that has the sole purpose of > > > doing echo cgroup > /mnt/cgroup/memory.oom_policy? > > > > Approximately. Let me put it another way: can we modify your patchset > > so that the mount option remains, and continues to have a sufficiently > > same effect? For backward compatibility. > > > > The mount option would exist solely to set the oom policy of the root mem > cgroup, it would lose its effect of mandating that policy for any subtree > since it would become configurable by the user if delegated. Why can't we propagate the mount option into the subtrees? If the user then alters that behaviour with new added-by-David tunables then fine, that's still backward compatible. > Let me put it another way: if the cgroup aware oom killer is merged for > 4.16 without this patchset, userspace can reasonably infer the oom policy > from checking how cgroups were mounted. If my followup patchset were > merged for 4.17, that's invalid and it becomes dependent on kernel > version: it could have the "groupoom" mount option but configured through > the root mem cgroup's memory.oom_policy to not be cgroup aware at all. That concern seems unreasonable to me. Is an application *really* going to peek at the mount options to figure out what its present oom policy is? Well, maybe. But that's a pretty dopey thing to do and I wouldn't lose much sleep over breaking any such application in the very unlikely case that such a thing was developed in that two-month window. If that's really a concern then let's add (to Roman's patchset) a proper interface for an application to query its own oom policy. > That inconsistency, to me, is unfair to burden userspace with. > > > > This, and fixes to fairly compare the root mem cgroup with leaf mem > > > cgroups, are essential before the feature is merged otherwise it yields > > > wildly unpredictable (and unexpected, since its interaction with > > > oom_score_adj isn't documented) results as I already demonstrated where > > > cgroups with 1GB of usage are killed instead of 6GB workers outside of > > > that subtree. > > > > OK, so Roman's new feature is incomplete: it satisfies some use cases > > but not others. And we kinda have a plan to address the other use > > cases in the future. > > > > Those use cases are also undocumented such that the user doesn't know the > behavior they are opting into. Nowhere in the patchset does it mention > anything about oom_score_adj other than being oom disabled. It doesn't > mention that a per-process tunable now depends strictly on whether it is > attached to root or not. It specifies a fair comparison between the root > mem cgroup and leaf mem cgroups, which is obviously incorrect by the > implementation itself. So I'm not sure the user would know which use > cases it is valid for, which is why I've been trying to make it generally > purposeful and documented. Documentation patches are nice. We can cc:stable them too, so no huge hurry. > > There's nothing wrong with that! As long as we don't break existing > > setups while evolving the feature. How do we do that? > > > > We'd break the setups that actually configure their cgroups and processes > to abide by the current implementation since we'd need to discount > oom_score_adj from the the root mem cgroup usage to fix it. Am having trouble understanding that. Expand, please? Can we address this (and other such) issues in the (interim) documentation?