Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7322721imu; Thu, 31 Jan 2019 08:23:47 -0800 (PST) X-Google-Smtp-Source: ALg8bN5mcqAIQdQo30027LtoVgXnKbh4qu+Xwba9cqaGgH3XVz+8fsUL67jwyFzYe2QybRhY2uTN X-Received: by 2002:a63:a112:: with SMTP id b18mr32263962pgf.440.1548951827558; Thu, 31 Jan 2019 08:23:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548951827; cv=none; d=google.com; s=arc-20160816; b=dNQdj6TUJC24R6FEMZZT2pPhpJTuL106ZX6MVV0qEnC0FEK6Vth46GhhMb3DKqJUo7 O9pF/oW2wZj7Wg6mmOJeJU7WHYqwdv2ofADQJXYZjXM8q4501a9TpQ5JjKDnBah4+bPu urimDiLMquuPDKbP2Lba3sKIgCd9h9p8ST8WhO1MJg3xcRKrqkPO0dscLYu13ZanRJ5H rgNjMR9ZKRo/pQF6l6baFjexEIbg/EQyAVIlOvy8oQjLF5N2gVhhpMIow/hmQdzOrlxI oapkhSuqRZIWjLP9Os5ybM39s2XoEAJJ10g03ZghJorm/XvmpU7cUX6pGWTt6wuwXsuN rrNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=wE2aRl4Gjx5VjDxdFAhUakdsREIMx7y7W6lKaBxreOw=; b=INkMmI12S+SbGI+MoyBXqdV2Nb0HKzrrF4LOyqh1UKKSbt7DEYzgLp61lN590w0mSB HzAKmyJ9aVcrJMZ4VTyjhABLE1Kznd5isTIA9Hk0bpn9NYKjWVmTK/sYRJttRZB6vVvy XjHDMzDQTrBrDhgcYt2nLV8q8SaYZpojv/3lXbnSGVtskcu/n1BOv7tb0mlnpFtV4jXw 0fVu3qIY7GyTnpefws+SMwF8uJMpf6dSNzEsyUx04kWBIqbj20sQwe2SMTwv//VKb2jN BnURzCoZDTqd4QFI8XfjDfUC8itG5kQciwIsjs/y4yBVPmUVNSqciL2fVMKMlI/RGEpF BkYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=0rtgpPE0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p15si157606plq.24.2019.01.31.08.23.31; Thu, 31 Jan 2019 08:23:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=0rtgpPE0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388420AbfAaQWw (ORCPT + 99 others); Thu, 31 Jan 2019 11:22:52 -0500 Received: from mail-yw1-f65.google.com ([209.85.161.65]:34043 "EHLO mail-yw1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388155AbfAaQWv (ORCPT ); Thu, 31 Jan 2019 11:22:51 -0500 Received: by mail-yw1-f65.google.com with SMTP id g75so1509254ywb.1 for ; Thu, 31 Jan 2019 08:22:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=wE2aRl4Gjx5VjDxdFAhUakdsREIMx7y7W6lKaBxreOw=; b=0rtgpPE0g4au+NVj91KUAPTbdwe5t34WNXNZTqBfjBzIZNnwPYO6rW2J51SVnzZFOs BNG+tu2XURHJiNSJdqsO4KRcLJUdl5egJ5lxzfyp51h9DahrDRofAi6xQUzqcWME/moM GEZYhCoGcc/hWFWTMYeHXoNenRGx2HVIxbfwe04JNo8Mo2Mwzl3v0B0PcgAkpbQuZiqC w3F8tUT6Dr1ZjFhwSd1rEmmjOGzyg1DhqH1NcFCiwZf8f8vY1bqJkhSyZi8Nd/EdGHCt LDhV1Msb3+4G2ttFqIwYk7Q5SBcdl/C0xaRT8cWDWZJMBwwB++D6WgRW0sgvK07bnHhh tiTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=wE2aRl4Gjx5VjDxdFAhUakdsREIMx7y7W6lKaBxreOw=; b=L5CuBHiG/yg2UcXaLO2nYaDyVnga0L0fga7eumvqsF3RsGwlPiUe2THY9IIMmCmQ0H 6jjo/2WUEH1W8/Tc0ocWMXemug51Ti/AA/MtEiY51gLpjKzFgJuSmI7vma9ALVTsXA3o rYgwXlK//Xj13Uy+aRRulAOit+4rSpv1JcijYBxmrhMUfTJaOLGMTzLIMJCz/vdZNCUn ilxhLq6XaStCL5ccOMHNHMKes2JSkSa6AqiKmCF/1OxnMHcJg8P7kXTsiuwOHajpplv9 ArTF94blMrJzUBull2Evjqrr7WJ2HfAR98fxxDEVIClXi8cPBz4rikSslVXTOgtkeCb/ ZAsA== X-Gm-Message-State: AJcUuke70lAtBfbWOqbJmBJGmRjd62Z+TkaURg0gxl7ixud3YmlMS9Cs htEPJe/PGTFY0BVKBbzReOlQRw== X-Received: by 2002:a81:6c51:: with SMTP id h78mr33955264ywc.116.1548951770271; Thu, 31 Jan 2019 08:22:50 -0800 (PST) Received: from localhost ([2620:10d:c091:180::1:80c2]) by smtp.gmail.com with ESMTPSA id y1sm1948347ywe.86.2019.01.31.08.22.49 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 31 Jan 2019 08:22:49 -0800 (PST) Date: Thu, 31 Jan 2019 11:22:48 -0500 From: Johannes Weiner To: Michal Hocko Cc: Tejun Heo , Chris Down , Andrew Morton , Roman Gushchin , Dennis Zhou , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, kernel-team@fb.com Subject: Re: [PATCH 2/2] mm: Consider subtrees in memory.events Message-ID: <20190131162248.GA17354@cmpxchg.org> References: <20190124182328.GA10820@cmpxchg.org> <20190125074824.GD3560@dhcp22.suse.cz> <20190125165152.GK50184@devbig004.ftw2.facebook.com> <20190125173713.GD20411@dhcp22.suse.cz> <20190125182808.GL50184@devbig004.ftw2.facebook.com> <20190128125151.GI18811@dhcp22.suse.cz> <20190130192345.GA20957@cmpxchg.org> <20190130200559.GI18811@dhcp22.suse.cz> <20190130213131.GA13142@cmpxchg.org> <20190131085808.GO18811@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190131085808.GO18811@dhcp22.suse.cz> User-Agent: Mutt/1.11.2 (2019-01-07) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote: > On Wed 30-01-19 16:31:31, Johannes Weiner wrote: > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: > [...] > > > I thought I have already mentioned an example. Say you have an observer > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > > > limit) on the root of it. If you get an OOM event then you know that the > > > whole hierarchy might be underprovisioned and perform some rebalancing. > > > Now you really do not care that somewhere down the delegated tree there > > > was an oom. Such a spurious event would just confuse the monitoring and > > > lead to wrong decisions. > > > > You can construct a usecase like this, as per above with OOM, but it's > > incredibly unlikely for something like this to exist. There is plenty > > of evidence on adoption rate that supports this: we know where the big > > names in containerization are; we see the things we run into that have > > not been reported yet etc. > > > > Compare this to real problems this has already caused for > > us. Multi-level control and monitoring is a fundamental concept of the > > cgroup design, so naturally our infrastructure doesn't monitor and log > > at the individual job level (too much data, and also kind of pointless > > when the jobs are identical) but at aggregate parental levels. > > > > Because of this wart, we have missed problematic configurations when > > the low, high, max events were not propagated as expected (we log oom > > separately, so we still noticed those). Even once we knew about it, we > > had trouble tracking these configurations down for the same reason - > > the data isn't logged, and won't be logged, at this level. > > Yes, I do understand that you might be interested in the hierarchical > accounting. > > > Adding a separate, hierarchical file would solve this one particular > > problem for us, but it wouldn't fix this pitfall for all future users > > of cgroup2 (which by all available evidence is still most of them) and > > would be a wart on the interface that we'd carry forever. > > I understand even this reasoning but if I have to chose between a risk > of user breakage that would require to reimplement the monitoring or an > API incosistency I vote for the first option. It is unfortunate but this > is the way we deal with APIs and compatibility. I don't know why you keep repeating this, it's simply not how Linux API is maintained in practice. In cgroup2, we fixed io.stat to not conflate discard IO and write IO: 636620b66d5d4012c4a9c86206013964d3986c4f Linus changed the Vmalloc field semantics in /proc/meminfo after over a decade, without a knob to restore it in production: If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0 Mel changed the zone_reclaim_mode default behavior after over a decade: Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. 4f9b16a64753d0bb607454347036dc997fd03b82 Acked-by: Michal Hocko And then Mel changed the default zonelist ordering to pick saner behavior for most users, followed by a complete removal of the zone list ordering, after again, decades of existence of these things: commit c9bff3eebc09be23fbc868f5e6731666d23cbea3 Author: Michal Hocko Date: Wed Sep 6 16:20:13 2017 -0700 mm, page_alloc: rip out ZONELIST_ORDER_ZONE And why did we do any of those things and risk user disruption every single time? Because the existing behavior was not a good default, a burden on people, and the risk of breakage was sufficiently low. I don't see how this case is different, and you haven't provided any arguments that would explain that. > > Adding a note in cgroup-v2.txt doesn't make up for the fact that this > > behavior flies in the face of basic UX concepts that underly the > > hierarchical monitoring and control idea of the cgroup2fs. > > > > The fact that the current behavior MIGHT HAVE a valid application does > > not mean that THIS FILE should be providing it. It IS NOT an argument > > against this patch here, just an argument for a separate patch that > > adds this functionality in a way that is consistent with the rest of > > the interface (e.g. systematically adding .local files). > > > > The current semantics have real costs to real users. You cannot > > dismiss them or handwave them away with a hypothetical regression. > > > > I would really ask you to consider the real world usage and adoption > > data we have on cgroup2, rather than insist on a black and white > > answer to this situation. > > Those users requiring the hierarchical beahvior can use the new file > without any risk of breakages so I really do not see why we should > undertake the risk and do it the other way around. Okay, so let's find a way forward here. 1. A new memory.events_tree file or similar. This would give us a way to get the desired hierarchical behavior. The downside is that it's suggesting that ${x} and ${x}_tree are the local and hierarchical versions of a cgroup file, and that's false everywhere else. Saying we would document it is a cop-out and doesn't actually make the interface less confusing (most people don't look at errata documentation until they've been burned by unexpected behavior). 2. A runtime switch (cgroup mount option, sysctl, what have you) that lets you switch between the local and the tree behavior. This would be able to provide the desired semantics in a clean interface, while still having the ability to support legacy users. 2a. A runtime switch that defaults to the local behavior. 2b. A runtime switch that defaults to the tree behavior. The choice between 2a and 2b comes down to how big we evaluate the risk that somebody has an existing dependency on the local behavior. Given what we know about cgroup2 usage, and considering our previous behavior in such matters, I'd say 2b is reasonable and in line with how we tend to handle these things. On the tiny chance that somebody is using the current behavior, they can flick the switch (until we add the .local files, or simply use the switch forever).