Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp6331944rwl; Tue, 4 Apr 2023 10:58:04 -0700 (PDT) X-Google-Smtp-Source: AKy350b45rGG4AfEE+D7JPOPuponzLPhge6aPHUH9OMaE2dIm2WppQzMzVDeA9PRBLJysXW+U7CA X-Received: by 2002:a17:902:9b8c:b0:1a1:e39c:d4d1 with SMTP id y12-20020a1709029b8c00b001a1e39cd4d1mr2636842plp.67.1680631084493; Tue, 04 Apr 2023 10:58:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680631084; cv=none; d=google.com; s=arc-20160816; b=IcvyxPaBczm+CeGsJuD3A346nGD8krlm0I3JyfvnazEJzfEGEigJ5hyN/DaPXpRJWF WP5+ZtbSMcxX8TZRl7RPXQiSDkChHMQuIWl/kyuMxTzIAHAFQZrBxz2CnR6ksJn7SL9V rZEpZlzTdX+Xp+n4QKMMKiU0TYCXI4uth7QNvwoudw5TeEBXx4mz6zH4sf6D0iDtgZ9U yVFzPfWhKo3yNqflO7sSWeVqs4t+sfg6vPZa7LKma+mK8lrF2cn6j3NJe0SsjXbQbZWm /etKzowja7sb41V96r4512CGMh4Zrs+i0zX86CX21BLOPrr4JUZG5rVt76nDkYH0DJnY wGJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=K+yJ2HBFPQ7R/vP49e1zoV0yoKAMKg83JKUSNAzQmCY=; b=Rsc60GAemBIJab+gVfYuv56L1Y6p6dyeMgeeMbaBEJVgqwk5vq0csS8bNODrvCcTiu tn//RYF6XjsRpjCa3sWJJebj9FFGDnOZn7WMa7Sgk9jfSD/xUG1BdariY9Ey1fZefjWH +phhnvCNfIqTVQm5tKVVF5jBX8sPIASq/SN6EeR9BpugpNxBz4mVuPSmcf1oeZYD3x69 YjZZ7pzVVlDaLb37mecQ3nqH8pvmXQB2yWZbEXp/WFzas6ba88ErmQTZpx0iN2Hbl164 CKjOMhMaSu6rFY/fR/bjxpCQNs9ei4vCFzUCwvf99nSfK3OjvRDXmel8u5Ye39tyTvFF +Row== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hoC7i9jp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i12-20020a17090332cc00b001a1878af626si10963464plr.594.2023.04.04.10.57.53; Tue, 04 Apr 2023 10:58:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=hoC7i9jp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235693AbjDDRj1 (ORCPT + 99 others); Tue, 4 Apr 2023 13:39:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236307AbjDDRjQ (ORCPT ); Tue, 4 Apr 2023 13:39:16 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBE824C2D for ; Tue, 4 Apr 2023 10:38:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680629895; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K+yJ2HBFPQ7R/vP49e1zoV0yoKAMKg83JKUSNAzQmCY=; b=hoC7i9jpYgr5TZxiqs6Zze2HoXGKf8dR+u6PfN3oaGJdIdLpkEHGtVCC2yC0xolPi2UzT/ /yAW0yV5dwyQ8h57eMJo//1hpsmGoMf63sZLqdmTReO3YjHXM1XhDaqKR7+42bQZDbVu9h mPAq0DkwmpzH4wcrBkHq5rod11FBirY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-101-wV2LJ1KSPv-3hBGwhkWILA-1; Tue, 04 Apr 2023 13:24:04 -0400 X-MC-Unique: wV2LJ1KSPv-3hBGwhkWILA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C89642999B2E; Tue, 4 Apr 2023 17:24:03 +0000 (UTC) Received: from [10.22.32.153] (unknown [10.22.32.153]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4E587140EBF4; Tue, 4 Apr 2023 17:24:03 +0000 (UTC) Message-ID: <342c1967-8a68-275c-042e-765d5993157c@redhat.com> Date: Tue, 4 Apr 2023 13:24:03 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [PATCH v2] mm: oom: introduce cpuset oom Content-Language: en-US To: Michal Hocko , Gang Li Cc: rientjes@google.com, linux-kernel@vger.kernel.org, Zefan Li , cgroups@vger.kernel.org References: <20230404115509.14299-1-ligang.bdlg@bytedance.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.7 X-Spam-Status: No, score=-2.1 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/4/23 10:31, Michal Hocko wrote: > [CC cpuset people] > > On Tue 04-04-23 19:55:09, Gang Li wrote: >> When a process in cpuset triggers oom, it may kill a completely >> irrelevant process on another numa node, which will not release any >> memory for this cpuset. >> >> It seems that `CONSTRAINT_CPUSET` is not really doing much these >> days. Using CONSTRAINT_CPUSET, we can easily achieve node aware oom >> by selecting victim from all cpusets with the same mems_allowed as >> the current cpuset. > This should go into more details about the usecase, testing and ideally > also spend couple of words about how CONSTRAINT_CPUSET is actually > implemented because this is not really immediately obvious. An example > of before/after behavior would have been really nice as well. > > You should also go into much more details about how oom victims are > actually evaluated. > >> Suggested-by: Michal Hocko >> Signed-off-by: Gang Li >> --- >> Changes in v2: >> - Select victim from all cpusets with the same mems_allowed as the current cpuset. >> (David Rientjes ) >> >> v1: >> - https://lore.kernel.org/all/20220921064710.89663-1-ligang.bdlg@bytedance.com/ >> >> --- >> include/linux/cpuset.h | 6 ++++++ >> kernel/cgroup/cpuset.c | 28 ++++++++++++++++++++++++++++ >> mm/oom_kill.c | 4 ++++ >> 3 files changed, 38 insertions(+) > As this is a userspace visible change it should also be documented > somewhere in Documentation. > > I am not really familiar with cpusets internals so I cannot really judge > cpuset_cgroup_scan_tasks implementation. > > The oom report should be explicit about this being CPUSET specific oom > handling so unexpected behavior could be nailed down to this change so I > do not see a major concern from the oom POV. Nevertheless it would be > still good to consider whether this should be an opt-in behavior. I > personally do not see a major problem because most cpuset deployments I > have seen tend to be well partitioned so the new behavior makes more > sense. > >> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h >> index 980b76a1237e..fc244141bd52 100644 >> --- a/include/linux/cpuset.h >> +++ b/include/linux/cpuset.h >> @@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask) >> task_unlock(current); >> } >> >> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg); >> + >> #else /* !CONFIG_CPUSETS */ >> >> static inline bool cpusets_enabled(void) { return false; } >> @@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq) >> return false; >> } >> >> +static inline int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) >> +{ >> + return 0; >> +} >> #endif /* !CONFIG_CPUSETS */ >> >> #endif /* _LINUX_CPUSET_H */ >> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >> index bc4dcfd7bee5..b009c98ca19e 100644 >> --- a/kernel/cgroup/cpuset.c >> +++ b/kernel/cgroup/cpuset.c >> @@ -4013,6 +4013,34 @@ void cpuset_print_current_mems_allowed(void) >> rcu_read_unlock(); >> } >> >> +int cpuset_cgroup_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg) >> +{ >> + int ret = 0; >> + struct css_task_iter it; >> + struct task_struct *task; >> + struct cpuset *cs; >> + struct cgroup_subsys_state *pos_css; >> + >> + /* >> + * Situation gets complex with overlapping nodemasks in different cpusets. >> + * TODO: Maybe we should calculate the "distance" between different mems_allowed. >> + * >> + * But for now, let's make it simple. Just iterate through all cpusets >> + * with the same mems_allowed as the current cpuset. >> + */ >> + rcu_read_lock(); >> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { >> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) { >> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it); >> + while (!ret && (task = css_task_iter_next(&it))) >> + ret = fn(task, arg); >> + css_task_iter_end(&it); >> + } >> + } >> + rcu_read_unlock(); >> + return ret; >> +} You will also need to take cpuset_rwsem to make sure that cpusets are stable. BTW, the cpuset_cgroup_scan_tasks() name is kind of redundant. I will suggest you just name it as cpuset_scan_tasks(). Please also add a doctext comment about its purpose and how it should be used. Cheers, Longman