Received: by 10.223.176.46 with SMTP id f43csp1243701wra; Fri, 26 Jan 2018 14:34:33 -0800 (PST) X-Google-Smtp-Source: AH8x2267+pQNkCtpDkM/FA2JpKaLBysRi+q1Wd3c9SYdxLVVXPmDSJB4mtZ31P6f0FPkKcvgJRU/ X-Received: by 2002:a17:902:9a04:: with SMTP id v4-v6mr15091552plp.252.1517006072817; Fri, 26 Jan 2018 14:34:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517006072; cv=none; d=google.com; s=arc-20160816; b=oCu7dBT1MNx+xYDH02cPRjYSIck3rZl0IeREQKzSgZHz7U00SeE9M7EI45eIKH3Z02 bHlUql00TFHgrV5DaVe5s4/uO6VXAAkUNtvfwOQuxa5PNQkl9kGaJ87EalWHhwijuspq qc9RXx4XXlg+qebH5YCl/UnWRkNBQb1Xn5dskoYy9Z4p0LhTBw6Q8nEmObkgdipX01l4 U4tUvJ3ZRU3FN7hcf6D0IZ0jHKtGLvryqCugfyrBExlV3Ef3qamgyc3OH/sYV+Mw0P+e bZbwKyND7nVZIO9eumDXB8IYkUUWpzApzZ2R3f+sP19oFCpVVHm5vt4ZdBqaSNAHs+/t qD4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=lJoFJzzQ7c8uHQ7C52HN+niCbvRUrTG9G19KwogIVCA=; b=WenYLNJfRgGoXM8IVO3XEsJY9q1AeYgmHhn8lBcYWkJB40ZA9vGMy/oo8EP3kbPfsC myEAlFT+rBNP8T5xWPdqwkrh8TDFnD9aapyU9mzamo4S0Y2pYpb0joWGtIN2ntbo3gc2 WzWscvoAm6t40Rc/RN0lZYplGB+5ewjT79m/QDI/nOPEG/AtMuzOGXbVYALBvap1E4zg DDzXALcOoawVGkAb37tZSF+CJKJvO/Xt8g4t+nopE6mfqez7msXaMN+po/PlC5PZc7Lv 7mVYUNeOhnoqHJMPrvfoV61bqcnna3AraF6fgYr0hESRpOeelkSbtEb5mARsvqmA80JL fJcA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=GG+WpmYc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z61-v6si4329942plb.669.2018.01.26.14.34.18; Fri, 26 Jan 2018 14:34:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=GG+WpmYc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752229AbeAZWdn (ORCPT + 99 others); Fri, 26 Jan 2018 17:33:43 -0500 Received: from mail-it0-f52.google.com ([209.85.214.52]:39474 "EHLO mail-it0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752000AbeAZWdk (ORCPT ); Fri, 26 Jan 2018 17:33:40 -0500 Received: by mail-it0-f52.google.com with SMTP id 68so3386921ite.4 for ; Fri, 26 Jan 2018 14:33:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=lJoFJzzQ7c8uHQ7C52HN+niCbvRUrTG9G19KwogIVCA=; b=GG+WpmYcIUuyyLu1odAJVi98CVdZMp4bgKf8XIczQlZ2X3PaEO3zPoLTxWyssMVOnZ aKALPqVGsx1/UiCaQ6FhItFdggOuj3jHdJqdI0oXftSzjRxVwe/Z89ofdCEvqaOO2r8U +PXTzICHn2Ua/OKRoOqFFjphKBVPRRqrxnxxv1p8gKsxguyeZSKvaiqx8Ye7AKMmY75T QurLjvtaLc5enNTk28VCYX26rdEhGcgJzGXvaC8mDcA1AAq0GXEXwUFz1RU/kgcyeOcz 7HP6WEh0qJwOv6dxHRwLqCvIOHzoeYTM+6FlaCMj+gfMJbCETUpkUHZQ9cBrsQPxuD0n k2BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=lJoFJzzQ7c8uHQ7C52HN+niCbvRUrTG9G19KwogIVCA=; b=IgyWho1JsOMM5u+jTSuM2iz01PT5GYBv5kxD0/il5nrcxTePRUGJG5qOMrYbpg8WRd 35MGxdFOhU2Mv/oszbgU3oyQTabKvOmzUxRhkAFapagLj5m57nJ8arVcg2VZZDGKNLWm Dv5sj2entr3YjMWiDbhn6cqoNqRBGmcWxBegLsLKaTuJm/Ce5SIvUSsYniMT5bAc2mcM 4PDpAwpC35Mn73kAbE4e6OdvLeMySPN0lCFgwil91hvyW7dmuSylwX9mibRbLuwXW2DA CJLoh9DPGqKjOYv705iqxcrro8IrJsdAnrKx969nfUXwR+fwoIy6Dd/YOO/fZVD0ol/9 aHCA== X-Gm-Message-State: AKwxytc+OmQlqVHsSTMowDA2T3Ern8hBgGD+1MMZphYCLTTkNH8UPNIH uml0gPDWU6/O9Z31lyES3/SOzw== X-Received: by 10.36.65.222 with SMTP id b91mr4282494itd.66.1517006018920; Fri, 26 Jan 2018 14:33:38 -0800 (PST) Received: from [2620:15c:17:3:951d:d8da:c496:5667] ([2620:15c:17:3:951d:d8da:c496:5667]) by smtp.gmail.com with ESMTPSA id v198sm2554566ita.3.2018.01.26.14.33.37 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 26 Jan 2018 14:33:38 -0800 (PST) Date: Fri, 26 Jan 2018 14:33:36 -0800 (PST) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Tejun Heo , Andrew Morton , Roman Gushchin , Vladimir Davydov , Johannes Weiner , Tetsuo Handa , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable In-Reply-To: <20180126100726.GA5027@dhcp22.suse.cz> Message-ID: References: <20180120123251.GB1096857@devbig577.frc2.facebook.com> <20180123155301.GS1526@dhcp22.suse.cz> <20180124082041.GD1526@dhcp22.suse.cz> <20180125080542.GK28465@dhcp22.suse.cz> <20180126100726.GA5027@dhcp22.suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 26 Jan 2018, Michal Hocko wrote: > > Could you elaborate on why specifying the oom policy for the entire > > hierarchy as part of the root mem cgroup and also for individual subtrees > > is incomplete? It allows admins to specify and delegate policy decisions > > to subtrees owners as appropriate. It addresses your concern in the > > /admins and /students example. It addresses my concern about evading the > > selection criteria simply by creating child cgroups. It appears to be a > > win-win. What is incomplete or are you concerned about? > > I will get back to this later. I am really busy these days. This is not > a trivial thing at all. > Please follow-up in the v2 patchset when you have time. > Most usecases I've ever seen usually use oom_score_adj only to disable > the oom killer for a particular service. In those case the current > heuristic works reasonably well. > I'm not familiar with the workloads you have worked with that use oom_score_adj. We use it to prefer a subset of processes first and a subset of processes last. I don't expect this to be a highly specialized usecase, it's the purpose of the tunable. The fact remains that oom_score_adj tuning is only effective with the current implementation when attached to the root mem cgroup in an undocumented way, the preference or bias immediately changes as soon as it is attached to a cgroup, even if it's the only non root mem cgroup on the system. > > That's because per-process usage and oom_score_adj are only relevant > > for the root mem cgroup and irrelevant when attached to a leaf. > > This is the simplest implementation. You could go and ignore > oom_score_adj on root tasks. Would it be much better? Should you ignore > oom disabled tasks? Should you consider kernel memory footprint of those > tasks? Maybe we will realize that we simply have to account root memcg > like any other memcg. We used to do that but it has been reverted due > to performance footprint. There are more questions to answer I believe > but the most important one is whether actually any _real_ user cares. > The goal is to compare the root mem cgroup and leaf mem cgroups equally. That is specifically listed as a goal for the cgroup aware oom killer and it's very obvious it's not implemented correctly particularly because of this bias but also because sum of oom_badness() != anon + unevictable + unreclaimable slab, even discounting oom_score_adj. The amount of slab is only considered for leaf mem cgroups as well. What I've proposed in the past was to use the global state of anon, unevictable, and unreclaimable slab to fairly account the root mem cgroup without bias from oom_score_adj for comparing cgroup usage. oom_score_adj is valid when choosing the process from the root mem cgroup to kill, not when comparing against other cgroups since leaf cgroups discount it. > I can see your arguments and they are true. You can construct setups > where the current memcg oom heuristic works sub-optimally. The same has > been the case for the OOM killer in general. The OOM killer has always > been just a heuristic and there always be somebody complaining. This > doesn't mean we should just remove it because it works reasonably well > for most users. > It's not most users, it's only for configurations that are fully containerized where there are no user processes attached to the root mem cgroup and nobody uses oom_score_adj like it is defined to be used, and it's undocumented so they don't even know that fact without looking at the kernel implementation. > > Because of that, users are > > affected by the design decision and will organize their hierarchies as > > approrpiate to avoid it. Users who only want to use cgroups for a subset > > of processes but still treat those processes as indivisible logical units > > when attached to cgroups find that it is simply not possible. > > Nobody enforces the memcg oom selection as presented here for those > users. They have to explicitly _opt-in_. If the new heuristic doesn't > work for them we will hear about that most likely. I am really skeptical > that oom_score_adj can be reused for memcg aware oom selection. > oom_score_adj is value for choosing a process attached to a mem cgroup to kill, absent memory.oom_group being set. It is not valid to for comparing cgroups, obviously. That's why it shouldn't be used for the root mem cgroup either, which the current implementation does, when it is documented falsely to be a fair comparison. > I do not think anything you have proposed so far is even close to > mergeable state. I think you are simply oversimplifying this. We have > spent many months discussing different aspects of the memcg aware OOM > killer. The result is a compromise that should work reasonably well > for the targeted usecases and it doesn't bring unsustainable APIs that > will get carved into stone. If you don't have time to review the patchset to show that it's not mergeable, I'm not sure that I have anything to work with.