Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp2692710rwb; Wed, 30 Nov 2022 09:36:03 -0800 (PST) X-Google-Smtp-Source: AA0mqf5qoD/LgUBN8+An0srlKH1jn9b+FUBdlTCdV7bxHwunBrcbyjBvAALbEHSlMrFtJA30J0ke X-Received: by 2002:a17:906:6149:b0:78d:a154:ae with SMTP id p9-20020a170906614900b0078da15400aemr37307351ejl.259.1669829763046; Wed, 30 Nov 2022 09:36:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669829763; cv=none; d=google.com; s=arc-20160816; b=wX3MQC/XZsOIgXc7YM7vcucGccGGgc0ZnAGW4r5zW1jF+fcxP22Gb+Ce93nfWgEFXL CpWSrm/gLYP65AlUVG5oR92nOtOCCpvCEVfLJPsqHO8/ZRvKq1m+ZWe4fioZd15C3+2k hJjpb/KX3rwmZTEU6aMHj1306bYoYAG5MOeQ+Ze2608HLJ9ansujeaPJj3mxRVRpLbbs +RfwHbV2PZnW/Z4xiAqMivFBVqxeNQ4tluC3b52Cn+lyRbM7MbMxINcHqyZIeLao4TmX EG4bIsrwFfJNeicuyEyLiOwY007I2R5jkE82pCU7HTxO2tN21o/XsvBsD3neX4IzAcLY t3Bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=nXfNG6Sww8fGURySTb8eVABizwa45hjeSXnodgsxL7A=; b=UYqgwVl8GbKjN0dt2bEnMEymaZQdwjopEdDniaiFg1vnmnNdurN/NQEdDuhZJCdOb4 RWva+C1xjl2BLS3DGg6mosDTvJxQg/+UCdZDSXkZE+axv8b1RaQFbG2YZqjXiY99a8ES OAEeIrIOgmzRP9g+ezcTG1JfrWcVXqHaAs2WWpaibCEE+uC1F6kHTQlxtCV9dztNxFyc aPsoBnWqhTLCmKeCh6+ZuydsCHdlY/0MzJKSHHUy/v01hwYgHom7xplXBrClz8qYSi4M hm44sCChT7m1vinYC6f4tdA/rJvCR6jI3XR/UM9MjxYtiyQCkiDdjw+yL1KyHBJuIjEw 2Bsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=K5AhfMD1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nd19-20020a170907629300b007af041e40b4si1787586ejc.129.2022.11.30.09.35.42; Wed, 30 Nov 2022 09:36:03 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=K5AhfMD1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230211AbiK3Q17 (ORCPT + 83 others); Wed, 30 Nov 2022 11:27:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230185AbiK3Q16 (ORCPT ); Wed, 30 Nov 2022 11:27:58 -0500 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B08ADF85; Wed, 30 Nov 2022 08:27:56 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 014482129A; Wed, 30 Nov 2022 16:27:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1669825675; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nXfNG6Sww8fGURySTb8eVABizwa45hjeSXnodgsxL7A=; b=K5AhfMD1mmeSLfJHmy5N+Z9Fp6PpPk7Tg6j3Gg7JBI1HUIpk9Q1kujuQk6AXGNNyV145KQ d9PwmobJ2stCSZPjUM0avskb5R4vue2G5ijhEc6FYiczmjflCS0mD9rGkb6uCxCJrsuy3j g3oV+3D2ydZVm0DCiTt0qtKaJSdI6oU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id D03D313A70; Wed, 30 Nov 2022 16:27:54 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id pufuMYqEh2NEPgAAMHmgww (envelope-from ); Wed, 30 Nov 2022 16:27:54 +0000 Date: Wed, 30 Nov 2022 17:27:54 +0100 From: Michal Hocko To: =?utf-8?B?56iL5Z6y5rab?= Chengkaitao Cheng Cc: Tao pilgrim , "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "songmuchun@bytedance.com" , "cgel.zte@gmail.com" , "ran.xiaokai@zte.com.cn" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "sujiaxun@uniontech.com" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , Bagas Sanjaya , "linux-mm@kvack.org" , Greg Kroah-Hartman Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed Message-ID: References: <7EF16CB9-C34A-410B-BEBE-0303C1BB7BA0@didiglobal.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7EF16CB9-C34A-410B-BEBE-0303C1BB7BA0@didiglobal.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote: > On 2022-11-30 21:15:06, "Michal Hocko" wrote: > > On Wed 30-11-22 15:01:58, chengkaitao wrote: > > > From: chengkaitao > > > > > > We created a new interface for memory, If there is > > > the OOM killer under parent memory cgroup, and the memory usage of a > > > child cgroup is within its effective oom.protect boundary, the cgroup's > > > tasks won't be OOM killed unless there is no unprotected tasks in other > > > children cgroups. It draws on the logic of in the > > > inheritance relationship. > > > > Could you be more specific about usecases? This is a very important question to answer. > > How do you tune oom.protect > > wrt to other tunables? How does this interact with the oom_score_adj > > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting > > in a oom protected memcg)? > > We prefer users to use score_adj and oom.protect independently. Score_adj is > a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. > When the physical machine's memory size is particularly large, the score_adj > granularity is also very large. However, oom.protect can achieve more fine-grained > adjustment. Let me clarify a bit. I am not trying to defend oom_score_adj. It has it's well known limitations and it is is essentially unusable for many situations other than - hide or auto-select potential oom victim. > When the score_adj of the processes are the same, I list the following cases > for explanation, > > root > | > cgroup A > / \ > cgroup B cgroup C > (task m,n) (task x,y) > > score_adj(all task) = 0; > oom.protect(cgroup A) = 0; > oom.protect(cgroup B) = 0; > oom.protect(cgroup C) = 3G; How can you enforce protection at C level without any protection at A level? This would easily allow arbitrary cgroup to hide from the oom killer and spill over to other cgroups. > usage(task m) = 1G > usage(task n) = 2G > usage(task x) = 1G > usage(task y) = 2G > > oom killer order of cgroup A: n > m > y > x > oom killer order of host: y = n > x = m > > If cgroup A is a directory maintained by users, users can use oom.protect > to protect relatively important tasks x and y. > > However, when score_adj and oom.protect are used at the same time, we > will also consider the impact of both, as expressed in the following formula. > but I have to admit that it is an unstable result. > score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage I hope I am not misreading but this has some rather unexpected properties. First off, bigger memory consumers in a protected memcg are protected more. Also I would expect the protection discount would be capped by the actual usage otherwise excessive protection configuration could skew the results considerably. > > I haven't really read through the whole patch but this struck me odd. > > > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, > > > unsigned long totalpages = totalram_pages() + total_swap_pages; > > > unsigned long points = 0; > > > long badness; > > > +#ifdef CONFIG_MEMCG > > > + struct mem_cgroup *memcg; > > > > > > - badness = oom_badness(task, totalpages); > > > + rcu_read_lock(); > > > + memcg = mem_cgroup_from_task(task); > > > + if (memcg && !css_tryget(&memcg->css)) > > > + memcg = NULL; > > > + rcu_read_unlock(); > > > + > > > + update_parent_oom_protection(root_mem_cgroup, memcg); > > > + css_put(&memcg->css); > > > +#endif > > > + badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT); > > > > the badness means different thing depending on which memcg hierarchy > > subtree you look at. Scaling based on the global oom could get really > > misleading. > > I also took it into consideration. I planned to change "/proc/pid/oom_score" > to a writable node. When writing to different cgroup paths, different values > will be output. The default output is root cgroup. Do you think this idea is > feasible? I do not follow. Care to elaborate? -- Michal Hocko SUSE Labs