Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1976245imm; Thu, 2 Aug 2018 04:22:24 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcSMxPqAdWxmNpJGmbJNWpPmHut/+6IxsVpphh9AMKjdq4yhL7SciahcurJUpsyG9XQHLxn X-Received: by 2002:a17:902:4381:: with SMTP id j1-v6mr2025393pld.104.1533208944870; Thu, 02 Aug 2018 04:22:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533208944; cv=none; d=google.com; s=arc-20160816; b=aXXjkhhOJHo9iZv+5mXHSMErg/B5xbTzBTlSO1subCG2blmqY+PtQURGZqwcJ2vBsH Y+O+Zrw2ynJBsKwgqggLRU3EygbfpSM0dCQh5E4qACli09OLaeQHzWiSkWKMgCFz0yPd 1n6JYHeGSLJRiggyfQy1gjceJPlvPbFzr4/+T+bEOnp9mx1XCyc0KjGhEWMIBcoLzRkw ezhs/TXbOiIQOEeEocGxFzKujKGqVB2hHXZmZPyToDHerbyaZJ3MrFfrNNZhQjkNrLGW WX4wsQeQc21QjgtM2/d7j0/bthHkc+NlxSAvnl1ps/yPMuUaDlNtDmKERluEYTORZdp+ zM+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=XzDu8Gx91E/XTxkAqh0Hkd+SvyjdlkO21kT08RMzBpM=; b=XdnCwxCqQGer59HmZAlZR3uGJhLfzuKVhJgRrEGOmWLJuJAhBdKm+es+3rEH8tvMJq p4/TuaoGVGil7QLYbzm8xKb0bimW423kcgBAlq1zNcRBUUIeug5hdwbVAJFM9jbdBN0q t1wRVahKMq9DreBE1cRhi5PP6K20eF0cBWA/u9MQeBqmJWRd0ccoS590D77O1VRe674F lfK8D268jGVWc+bcVsdBTBGGqhADHIJcoimPLwdFLVvUSHj7tOQNc/fcSNAQ6lWqwNIb Y3tEn1lwHjbowqnIlBLY3nnOBzNEEX+iL2K8YzalD+ZG3swHtHoARFwQI1yiCMBYMxn9 Ipdg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y23-v6si1723496pgl.132.2018.08.02.04.22.10; Thu, 02 Aug 2018 04:22:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732119AbeHBNL7 (ORCPT + 99 others); Thu, 2 Aug 2018 09:11:59 -0400 Received: from mx2.suse.de ([195.135.220.15]:55184 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728685AbeHBNL7 (ORCPT ); Thu, 2 Aug 2018 09:11:59 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7024EAD1B; Thu, 2 Aug 2018 11:21:15 +0000 (UTC) Date: Thu, 2 Aug 2018 13:21:14 +0200 From: Michal Hocko To: Tetsuo Handa Cc: Roman Gushchin , linux-mm@kvack.org, Johannes Weiner , David Rientjes , Tejun Heo , kernel-team@fb.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 3/3] mm, oom: introduce memory.oom.group Message-ID: <20180802112114.GG10808@dhcp22.suse.cz> References: <20180802003201.817-1-guro@fb.com> <20180802003201.817-4-guro@fb.com> <879f1767-8b15-4e83-d9ef-d8df0e8b4d83@i-love.sakura.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <879f1767-8b15-4e83-d9ef-d8df0e8b4d83@i-love.sakura.ne.jp> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 02-08-18 19:53:13, Tetsuo Handa wrote: > On 2018/08/02 9:32, Roman Gushchin wrote: [...] > > +struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim, > > + struct mem_cgroup *oom_domain) > > +{ > > + struct mem_cgroup *oom_group = NULL; > > + struct mem_cgroup *memcg; > > + > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > + return NULL; > > + > > + if (!oom_domain) > > + oom_domain = root_mem_cgroup; > > + > > + rcu_read_lock(); > > + > > + memcg = mem_cgroup_from_task(victim); > > Isn't this racy? I guess that memcg of this "victim" can change to > somewhere else from the one as of determining the final candidate. How is this any different from the existing code? We select a victim and then kill it. The victim might move away and won't be part of the oom memcg anymore but we will still kill it. I do not remember this ever being a problem. Migration is a privileged operation. If you loose this restriction you shouldn't allow to move outside of the oom domain. > This "victim" might have already passed exit_mm()/cgroup_exit() from do_exit(). Why does this matter? The victim hasn't been killed yet so if it exists by its own I do not think we really have to tear the whole cgroup down. > This "victim" might be moving to a memcg which is different from the one > determining the final candidate. > > > + if (memcg == root_mem_cgroup) > > + goto out; > > + > > + /* > > + * Traverse the memory cgroup hierarchy from the victim task's > > + * cgroup up to the OOMing cgroup (or root) to find the > > + * highest-level memory cgroup with oom.group set. > > + */ > > + for (; memcg; memcg = parent_mem_cgroup(memcg)) { > > + if (memcg->oom_group) > > + oom_group = memcg; > > + > > + if (memcg == oom_domain) > > + break; > > + } > > + > > + if (oom_group) > > + css_get(&oom_group->css); > > +out: > > + rcu_read_unlock(); > > + > > + return oom_group; > > +} > > > > > @@ -974,7 +988,23 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > read_unlock(&tasklist_lock); > > > > + /* > > + * Do we need to kill the entire memory cgroup? > > + * Or even one of the ancestor memory cgroups? > > + * Check this out before killing the victim task. > > + */ > > + oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > > + > > __oom_kill_process(victim); > > + > > + /* > > + * If necessary, kill all tasks in the selected memory cgroup. > > + */ > > + if (oom_group) { > > Isn't "killing a child process of the biggest memory hog" and "killing all > processes which belongs to a memcg which the child process of the biggest > memory hog belongs to" strange? The intent of selecting a child is to try > to minimize lost work while the intent of oom_cgroup is to try to discard > all work. If oom_cgroup is enabled, I feel that we should > > pr_err("%s: Kill all processes in ", message); > pr_cont_cgroup_path(memcg->css.cgroup); > pr_cont(" due to memory.oom.group set\n"); > > without > > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", message, task_pid_nr(p), p->comm, points); > > (I mean, don't try to select a child). Well, the child can belong into a different memcg. Whether the heuristic to pick up the child is sensible is another question and I do not think it is related to this patchset. The code works as intended, albeit being questionable. -- Michal Hocko SUSE Labs