Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp458114imm; Tue, 31 Jul 2018 22:56:54 -0700 (PDT) X-Google-Smtp-Source: AAOMgpf3K38S3awEIVnP2BlD2jo4Q1b3bR2NQGZOmhyWhdUYwTR8mlY55Ed/nun161MdvoFiAzR2 X-Received: by 2002:a62:bd4:: with SMTP id 81-v6mr25468052pfl.67.1533103014579; Tue, 31 Jul 2018 22:56:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533103014; cv=none; d=google.com; s=arc-20160816; b=tzRWtbXzIwDGpvo94T9cIIkH/n/MNGR60dn6UwHPFNrPkfMT/XA4S6WGDDHThFR7Kd oMjP+Hi5vMjjgwhtH6v5UnWHyjLrrNkCgsPI3v2dHCfurYXQTfKic4D2/IFJwRfCDVS7 YsB66wc14R7tbXU+2hM/90huBnS9Ou2KJhpi+5Tiu/dDd0PJAG6EWVSlwKsAtTsQaBz4 PLk3eyzQxVb0oBrBa/HuCkt73cYu3TdtngU3VEOgK6DWyapzs08gNgV29Tf4o4c0MrwF AeaWAbjZiy/IFToj4k1TEKF5kNsOxgbAdl5Le72iiiw68r3Z5OHrZCpi6e00yNkdmn+9 k/9g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ft8p3MTMHYvDFACxIWuBkLQedueoxAprWZvxE326ooo=; b=kqKRuQhU5H80ZiBruCTW+aWkaWYNoraQP/7wGk45tNL8rfPVVI/0b7qQperinAiu5z iz/cY33/LglvxjBjWSC6Pn/IcqjqJOFNNtE3F0IG54TGxw9vi0x08JVqgKKR3KCBDkjO AN/2uVDmTZDygRVrFnYKWvmAxLMIAJ9G5SluBxDgFj6xcN7qrY/YahxkKeYi71NYRO4+ MxuuNWNuod+Wb3fxAYLqTBJ2G8FV+WucDtdka1gcdisltlzh3wqRCWRtElj+Wd8vjsyj IQ1dAolcTGfn7Ra6QkOJVC4z1yrLwWNxhgsP4/LzaeT/nCngtLz7sLcscHbH0ywTRwSR eoEA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n11-v6si13680280plk.225.2018.07.31.22.56.39; Tue, 31 Jul 2018 22:56:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733292AbeHAHjB (ORCPT + 99 others); Wed, 1 Aug 2018 03:39:01 -0400 Received: from mx2.suse.de ([195.135.220.15]:39534 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1733124AbeHAHjA (ORCPT ); Wed, 1 Aug 2018 03:39:00 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 568CDAE77; Wed, 1 Aug 2018 05:55:05 +0000 (UTC) Date: Wed, 1 Aug 2018 07:55:03 +0200 From: Michal Hocko To: Roman Gushchin Cc: linux-mm@kvack.org, Johannes Weiner , David Rientjes , Tetsuo Handa , Tejun Heo , kernel-team@fb.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/3] mm, oom: introduce memory.oom.group Message-ID: <20180801055503.GB16767@dhcp22.suse.cz> References: <20180730180100.25079-1-guro@fb.com> <20180730180100.25079-4-guro@fb.com> <20180731090700.GF4557@dhcp22.suse.cz> <20180801011447.GB25953@castle.DHCP.thefacebook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180801011447.GB25953@castle.DHCP.thefacebook.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 31-07-18 18:14:48, Roman Gushchin wrote: > On Tue, Jul 31, 2018 at 11:07:00AM +0200, Michal Hocko wrote: > > On Mon 30-07-18 11:01:00, Roman Gushchin wrote: > > > For some workloads an intervention from the OOM killer > > > can be painful. Killing a random task can bring > > > the workload into an inconsistent state. > > > > > > Historically, there are two common solutions for this > > > problem: > > > 1) enabling panic_on_oom, > > > 2) using a userspace daemon to monitor OOMs and kill > > > all outstanding processes. > > > > > > Both approaches have their downsides: > > > rebooting on each OOM is an obvious waste of capacity, > > > and handling all in userspace is tricky and requires > > > a userspace agent, which will monitor all cgroups > > > for OOMs. > > > > > > In most cases an in-kernel after-OOM cleaning-up > > > mechanism can eliminate the necessity of enabling > > > panic_on_oom. Also, it can simplify the cgroup > > > management for userspace applications. > > > > > > This commit introduces a new knob for cgroup v2 memory > > > controller: memory.oom.group. The knob determines > > > whether the cgroup should be treated as a single > > > unit by the OOM killer. If set, the cgroup and its > > > descendants are killed together or not at all. > > > > I do not want to nit pick on wording but unit is not really a good > > description. I would expect that to mean that the oom killer will > > consider the unit also when selecting the task and that is not the case. > > I would be more explicit about this being a single killable entity > > because it forms an indivisible workload. > > > > You can reuse http://lkml.kernel.org/r/20180730080357.GA24267@dhcp22.suse.cz > > if you want. > > Ok, I'll do my best to make it clearer. > > > > > [...] > > > +/** > > > + * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM > > > + * @victim: task to be killed by the OOM killer > > > + * @oom_domain: memcg in case of memcg OOM, NULL in case of system-wide OOM > > > + * > > > + * Returns a pointer to a memory cgroup, which has to be cleaned up > > > + * by killing all belonging OOM-killable tasks. > > > > Caller has to call mem_cgroup_put on the returned non-null memcg. > > Added. > > > > > > + */ > > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > > > + struct mem_cgroup *oom_domain) > > > +{ > > > + struct mem_cgroup *oom_group = NULL; > > > + struct mem_cgroup *memcg; > > > + > > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > > + return NULL; > > > + > > > + if (!oom_domain) > > > + oom_domain = root_mem_cgroup; > > > + > > > + rcu_read_lock(); > > > + > > > + memcg = mem_cgroup_from_task(victim); > > > + if (!memcg || memcg == root_mem_cgroup) > > > + goto out; > > > > When can we have memcg == NULL? victim should be always non-NULL. > > Also why do you need to special case the root_mem_cgroup here. The loop > > below should handle that just fine no? > > Idk, I prefer to keep an explicit root_mem_cgroup check, > rather than traversing the tree and relying on an inability > to set oom_group on the root. I will not insist but this just makes the code harder to read. [...] > > > + if (oom_group) { > > > > we want a printk explaining that we are going to tear down the whole > > oom_group here. > > Does this looks good? > Or it's better to remove "memory." prefix? > > [ 52.835327] Out of memory: Kill process 1221 (allocate) score 241 or sacrifice child > [ 52.836625] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB > [ 52.841431] Tasks in /A1 are going to be killed due to memory.oom.group set Yes, looks good to me. > [ 52.869439] Killed process 1217 (allocate) total-vm:2052344kB, anon-rss:1704036kB, file-rss:0kB, shmem-rss:0kB > [ 52.875601] Killed process 1218 (allocate) total-vm:106668kB, anon-rss:24668kB, file-rss:0kB, shmem-rss:0kB > [ 52.882914] Killed process 1219 (allocate) total-vm:106668kB, anon-rss:21528kB, file-rss:0kB, shmem-rss:0kB > [ 52.891806] Killed process 1220 (allocate) total-vm:2257144kB, anon-rss:1984120kB, file-rss:4kB, shmem-rss:0kB > [ 52.903770] Killed process 1221 (allocate) total-vm:2257144kB, anon-rss:2009128kB, file-rss:4kB, shmem-rss:0kB > [ 52.905574] Killed process 1222 (allocate) total-vm:2257144kB, anon-rss:2063640kB, file-rss:0kB, shmem-rss:0kB > [ 53.202153] oom_reaper: reaped process 1222 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > + mem_cgroup_scan_tasks(oom_group, oom_kill_memcg_member, NULL); > > > + mem_cgroup_put(oom_group); > > > + } > > > } > > > > Other than that looks good to me. My concern that the previous > > implementation was more consistent because we were comparing memcgs > > still holds but if there is no way forward that direction this should be > > acceptable as well. > > > > After above small things are addressed you can add > > Acked-by: Michal Hocko > > Thank you! -- Michal Hocko SUSE Labs