Received: by 10.213.65.68 with SMTP id h4csp65369imn; Mon, 12 Mar 2018 17:59:54 -0700 (PDT) X-Google-Smtp-Source: AG47ELsGwUY793jOlHkr9to5hTuANnAo7WYoxucEzRM3PFrv2O5/8URrbVrxw0Y85anK3gv0GfLx X-Received: by 10.99.116.30 with SMTP id p30mr2730492pgc.60.1520902794666; Mon, 12 Mar 2018 17:59:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1520902794; cv=none; d=google.com; s=arc-20160816; b=gNKpHwbNq26ZatcflxMkgRj0+Q117Kd4dleRQWmfD9wAwulZDnGlGGBWjDmA5Vx6wa 74qjItandcIDCZpCLm8JGTkBkzAot7xYNRz68KyJucrjMvBzQOx0zqJAP0H8p254kPyV zgC+7P18jpBuVsG7G7Sc7c1XBHBCXO5VxYRNbJPpZn9KK6Q8teooU9XxpeZnNRIBTMfm D9PWfOQntS8Y4dZMkfHgEHbQYk/bOcVAGWd5HNtbLYBS2ktHB8yxjCgvglZujGfG1mdb I8cV82gnbz0Wppq0su4s/p5ddwqOKYKY5pFgNBxIgV6hzp2QbX1qsFa0XjmUL9nEwQ0G gnKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=O8WswsWvmNOw7DasHynMwXCmsSCR3bNaMXPdXLNFLU8=; b=PR18zwhF7t3lSdTpUny0jjWa2J16z9qGN8X6jd7mrqw26XxCgjRCEJv36+LktSyOs8 5er6uzdelTB/pa8/2sv/WIRDQSO+Y89jUzRcPvkfV3DnishU+zsCCvocCtSNfZXPaQM4 pAc8QBB2NZ0FkTPWwQ4EXLN6hwTGfhJxdf81fMS4c1VTVFtJTsR72W9k2bx+wLRfyAHa FFdzOu4rRKx8BjXp4s7M5VyCR0bzD0Fzz9P8P8QTlTzIuFlg/2Kuk7fLOPBpmoPEieOL RW4Pda8Bp7sWFR9zYcyDbw+QmLlHL720EbCKRel+FZReOLQV/2wceJbQbm/LQo1beCUz rLKg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=N6CWP+qr; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y3si1634875pgc.601.2018.03.12.17.59.40; Mon, 12 Mar 2018 17:59:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=N6CWP+qr; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932469AbeCMA61 (ORCPT + 99 others); Mon, 12 Mar 2018 20:58:27 -0400 Received: from mail-pg0-f66.google.com ([74.125.83.66]:34443 "EHLO mail-pg0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932308AbeCMA55 (ORCPT ); Mon, 12 Mar 2018 20:57:57 -0400 Received: by mail-pg0-f66.google.com with SMTP id m15so2282960pgc.1 for ; Mon, 12 Mar 2018 17:57:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=O8WswsWvmNOw7DasHynMwXCmsSCR3bNaMXPdXLNFLU8=; b=N6CWP+qrXt3NZIooLWNpe62EPnWScwVgKd2ijeS3lwlZ4Cs9WX7RcVFnvNKJuMq0Lw a9D9R6lP1NgAS5gy602mcbmXputGQAQ5o1M8MJWOy04gc715CQtXhOZp7YAEfYAvvIjW YbFkpz7DGzD8RE+aNzl2ElEfBeY6kX30rkr0DoG7B8DqpE6cExaGwhErf7sOlwFTZ7pb Scr0NOeaM2vCksE1LszIFhVB9bQlwWzFdkHa3vUjyFrj/5JnAU308St8KbfNmRBFym+4 pGpQAELACSimqjl4dLzG+6Uvla6EAnmJcbazfsY6UM6B9jKyvJutWEJq5WExnsF8amCh WrtQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=O8WswsWvmNOw7DasHynMwXCmsSCR3bNaMXPdXLNFLU8=; b=A5/6kIfbgbdWM3Ac9VkGEwC+6ZTv2EZeaWo/FkLEg7oSGG2ZkljSrasY4SeLY3fUJj oQnHbrX4LrJdCUsPqx9fefHAdgE4hZUJdLVaB9CMNHNEVltxMjG//ikk22u27+K5pT/3 uFNHWpwXNv07I0acUGp3kS0O/jFaS4i23/Lbl6IXMLErxuYUwtWeBoQhGNGCyhftgOZp UOl3sl43bP+wpW8m0rJdrGYToUnuWqOJS5jMILoTIPxYb8eFyTFqvpV61wIVXTyYzzVj vpecLuv8zUttuqdxx6MinCYXBnWfJ5lnyJ8XL+yFzW1Yg6ha/E8WAEuMtO+vEDMjdzHz CoAA== X-Gm-Message-State: AElRT7GD2307DqwB7D5merNIINzno1vPdaLcnad4m3Qe7wWOf16kx2rt loIBvORHKLEZ144frk8nKFTxxA== X-Received: by 10.98.91.66 with SMTP id p63mr9713420pfb.163.1520902676559; Mon, 12 Mar 2018 17:57:56 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id p1sm16655386pgr.67.2018.03.12.17.57.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Mar 2018 17:57:55 -0700 (PDT) Date: Mon, 12 Mar 2018 17:57:55 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Roman Gushchin cc: Michal Hocko , Vladimir Davydov , Johannes Weiner , Tejun Heo , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch -mm v3 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that each mem cgroup on the system has a memory.oom_policy tunable to specify oom kill selection behavior, remove the needless "groupoom" mount option that requires (1) the entire system to be forced, perhaps unnecessarily, perhaps unexpectedly, into a single oom policy that differs from the traditional per process selection, and (2) a remount to change. Instead of enabling the cgroup aware oom killer with the "groupoom" mount option, set the mem cgroup subtree's memory.oom_policy to "cgroup". The heuristic used to select a process or cgroup to kill from is controlled by the oom mem cgroup's memory.oom_policy. This means that if a descendant mem cgroup has an oom policy of "none", for example, and an oom condition originates in an ancestor with an oom policy of "cgroup", the selection logic will treat all descendant cgroups as indivisible memory consumers. For example, consider an example where each mem cgroup has "memory" set in cgroup.controllers: mem cgroup cgroup.procs ========== ============ /cg1 1 process consuming 250MB /cg2 3 processes consuming 100MB each /cg3/cg31 2 processes consuming 100MB each /cg3/cg32 2 processes consuming 100MB each If the root mem cgroup's memory.oom_policy is "none", the process from /cg1 is chosen as the victim. If memory.oom_policy is "cgroup", a process from /cg2 is chosen because it is in the single indivisible memory consumer with the greatest usage. This policy of "cgroup" is identical to to the current "groupoom" mount option, now removed. Note that /cg3 is not the chosen victim when the oom mem cgroup policy is "cgroup" because cgroups are treated individually without regard to hierarchical /cg3/memory.current usage. This will be addressed in a follow-up patch. This has the added benefit of allowing descendant cgroups to control their own oom policies if they have memory.oom_policy file permissions without being restricted to the system-wide policy. In the above example, /cg2 and /cg3 can be either "none" or "cgroup" with the same results: the selection heuristic depends only on the policy of the oom mem cgroup. If /cg2 or /cg3 themselves are oom, however, the policy is controlled by their own oom policies, either process aware or cgroup aware. Signed-off-by: David Rientjes --- Documentation/cgroup-v2.txt | 78 +++++++++++++++++++------------------ include/linux/cgroup-defs.h | 5 --- include/linux/memcontrol.h | 5 +++ kernel/cgroup/cgroup.c | 13 +------ mm/memcontrol.c | 17 ++++---- 5 files changed, 55 insertions(+), 63 deletions(-) diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -1076,6 +1076,17 @@ PAGE_SIZE multiple when read back. Documentation/filesystems/proc.txt). This is the same policy as if memory cgroups were not even mounted. + If "cgroup", the OOM killer will compare mem cgroups as indivisible + memory consumers; that is, they will compare mem cgroup usage rather + than process memory footprint. See the "OOM Killer" section below. + + When an OOM condition occurs, the policy is dictated by the mem + cgroup that is OOM (the root mem cgroup for a system-wide OOM + condition). If a descendant mem cgroup has a policy of "none", for + example, for an OOM condition in a mem cgroup with policy "cgroup", + the heuristic will still compare mem cgroups as indivisible memory + consumers. + memory.events A read-only flat-keyed file which exists on non-root cgroups. The following entries are defined. Unless specified @@ -1282,43 +1293,36 @@ belonging to the affected files to ensure correct memory ownership. OOM Killer ~~~~~~~~~~ -Cgroup v2 memory controller implements a cgroup-aware OOM killer. -It means that it treats cgroups as first class OOM entities. - -Cgroup-aware OOM logic is turned off by default and requires -passing the "groupoom" option on mounting cgroupfs. It can also -by remounting cgroupfs with the following command:: - - # mount -o remount,groupoom $MOUNT_POINT - -Under OOM conditions the memory controller tries to make the best -choice of a victim, looking for a memory cgroup with the largest -memory footprint, considering leaf cgroups and cgroups with the -memory.oom_group option set, which are considered to be an indivisible -memory consumers. - -By default, OOM killer will kill the biggest task in the selected -memory cgroup. A user can change this behavior by enabling -the per-cgroup memory.oom_group option. If set, it causes -the OOM killer to kill all processes attached to the cgroup, -except processes with oom_score_adj set to -1000. - -This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM -the memory controller considers only cgroups belonging to the sub-tree -of the OOM'ing cgroup. - -Leaf cgroups and cgroups with oom_group option set are compared based -on their cumulative memory usage. The root cgroup is treated as a -leaf memory cgroup as well, so it is compared with other leaf memory -cgroups. Due to internal implementation restrictions the size of -the root cgroup is the cumulative sum of oom_badness of all its tasks -(in other words oom_score_adj of each task is obeyed). Relying on -oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can lead to over- or -underestimation of the root cgroup consumption and it is therefore -discouraged. This might change in the future, however. - -If there are no cgroups with the enabled memory controller, -the OOM killer is using the "traditional" process-based approach. +Cgroup v2 memory controller implements an optional cgroup-aware out of +memory killer, which treats cgroups as indivisible OOM entities. + +This policy is controlled by memory.oom_policy. When a memory cgroup is +out of memory, its memory.oom_policy will dictate how the OOM killer will +select a process, or cgroup, to kill. Likewise, when the system is OOM, +the policy is dictated by the root mem cgroup. + +There are currently two available oom policies: + + - "none": default, choose the largest single memory hogging process to + oom kill, as traditionally the OOM killer has always done. + + - "cgroup": choose the cgroup with the largest memory footprint from the + subtree as an OOM victim and kill at least one process, depending on + memory.oom_group, from it. + +When selecting a cgroup as a victim, the OOM killer will kill the process +with the largest memory footprint. A user can control this behavior by +enabling the per-cgroup memory.oom_group option. If set, it causes the +OOM killer to kill all processes attached to the cgroup, except processes +with /proc/pid/oom_score_adj set to -1000 (oom disabled). + +The root cgroup is treated as a leaf memory cgroup as well, so it is +compared with other leaf memory cgroups. Due to internal implementation +restrictions the size of the root cgroup is the cumulative sum of +oom_badness of all its tasks (in other words oom_score_adj of each task +is obeyed). Relying on oom_score_adj (apart from OOM_SCORE_ADJ_MIN) can +lead to over- or underestimation of the root cgroup consumption and it is +therefore discouraged. This might change in the future, however. Please, note that memory charges are not migrating if tasks are moved between different memory cgroups. Moving tasks with diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -81,11 +81,6 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), - - /* - * Enable cgroup-aware OOM killer. - */ - CGRP_GROUP_OOM = (1 << 5), }; /* cftype->flags */ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -64,6 +64,11 @@ enum memcg_oom_policy { * oom_badness() */ MEMCG_OOM_POLICY_NONE, + /* + * Local cgroup usage is used to select a target cgroup, treating each + * mem cgroup as an indivisible consumer + */ + MEMCG_OOM_POLICY_CGROUP, }; struct mem_cgroup_reclaim_cookie { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1732,9 +1732,6 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) if (!strcmp(token, "nsdelegate")) { *root_flags |= CGRP_ROOT_NS_DELEGATE; continue; - } else if (!strcmp(token, "groupoom")) { - *root_flags |= CGRP_GROUP_OOM; - continue; } pr_err("cgroup2: unknown option \"%s\"\n", token); @@ -1751,11 +1748,6 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; - - if (root_flags & CGRP_GROUP_OOM) - cgrp_dfl_root.flags |= CGRP_GROUP_OOM; - else - cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; } } @@ -1763,8 +1755,6 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); - if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) - seq_puts(seq, ",groupoom"); return 0; } @@ -5932,8 +5922,7 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\n" - "groupoom\n"); + return snprintf(buf, PAGE_SIZE, "nsdelegate\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2811,14 +2811,14 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return false; - if (oc->memcg) root = oc->memcg; else root = root_mem_cgroup; + if (root->oom_policy != MEMCG_OOM_POLICY_CGROUP) + return false; + select_victim_memcg(root, oc); return oc->chosen_memcg; @@ -5425,9 +5425,6 @@ static int memory_oom_group_show(struct seq_file *m, void *v) struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); bool oom_group = memcg->oom_group; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return -ENOTSUPP; - seq_printf(m, "%d\n", oom_group); return 0; @@ -5441,9 +5438,6 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, int oom_group; int err; - if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) - return -ENOTSUPP; - err = kstrtoint(strstrip(buf), 0, &oom_group); if (err) return err; @@ -5554,6 +5548,9 @@ static int memory_oom_policy_show(struct seq_file *m, void *v) enum memcg_oom_policy policy = READ_ONCE(memcg->oom_policy); switch (policy) { + case MEMCG_OOM_POLICY_CGROUP: + seq_puts(m, "cgroup\n"); + break; case MEMCG_OOM_POLICY_NONE: default: seq_puts(m, "none\n"); @@ -5570,6 +5567,8 @@ static ssize_t memory_oom_policy_write(struct kernfs_open_file *of, buf = strstrip(buf); if (!memcmp("none", buf, min(sizeof("none")-1, nbytes))) memcg->oom_policy = MEMCG_OOM_POLICY_NONE; + else if (!memcmp("cgroup", buf, min(sizeof("cgroup")-1, nbytes))) + memcg->oom_policy = MEMCG_OOM_POLICY_CGROUP; else ret = -EINVAL;