Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1133532rwd; Tue, 13 Jun 2023 05:28:23 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4mIy+qcf1RqOeMe/9QTqHuFpyUlr7UAFVVMrfsEs3ImYze5gScsKras/pJovfYF0lSBquG X-Received: by 2002:a17:907:6d01:b0:96f:f451:1874 with SMTP id sa1-20020a1709076d0100b0096ff4511874mr13373562ejc.59.1686659303644; Tue, 13 Jun 2023 05:28:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686659303; cv=none; d=google.com; s=arc-20160816; b=cwZ4MpnUDN92fiyi66kAWbIWt9QWFyw9I8r/NjLPoZrVZqHC0VrawRIX56G/7hf+1U fMwoztz8daWCX/yDE1Gam6/n7Hu7MzwkNEXwt86S3+h0sBcs4HP4qZ7yzVfbaph2UtUA XqU4Z2tjvZDJl7kFuJ0dBlF6Uy9t2JhnIyUL5QAQzh+NPc7M+QvSNMBBvY3ZiknJyaUI c24jaa8WWOub2dmO0dAvTLQBbupU6mpY9wxSqVemb50URFAWVpKs6Gs7kTZ7RZZU4MO8 RoImWjKM49hqWFELl9MA4ZGY7d20td9h82jFRtsd46VVee8ckWZ+Vn4u03+DmvshkPSz X3Vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=HYVUFO1wPWNonY/X8OsDNVi5gRCOnIF00hLJCgVVXL8=; b=koHPXWHC5xM/wPSYQAG8QaCle2QZKRLnKfK6ppeDAuqpbBTRtZSBdEisqUskZg+RTT u/2T/ciiJlXURd0Vl1AVFiH2Mpf590peLqc+qAumeQEYIJVdDUPKXRbptqdvxP9v3wp2 Gmft9w98dH8GPwnr0WktpiaCwhXbhzlQfPdkjBEVtQWxmhh0UIyve0sZQWk7moKFplb2 A5MBZIv3KEi+SxcbIPPtiu4L8FQDAf7llIisxDDnbDEcMgmFR/aaoQcNP5B8wJ1zb6YI eD6eCROmCV0ENZTgiccSZ/kL3O0S3q+qI+yNlEDx5sVHu/bXjzM6fIzHOu2aXauaCPG4 DRsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=E2hmup+R; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c8-20020a170906694800b009787b18c250si6559061ejs.911.2023.06.13.05.27.58; Tue, 13 Jun 2023 05:28:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=E2hmup+R; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242016AbjFMMGt (ORCPT + 99 others); Tue, 13 Jun 2023 08:06:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240461AbjFMMGr (ORCPT ); Tue, 13 Jun 2023 08:06:47 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [IPv6:2001:67c:2178:6::1c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E665AE57; Tue, 13 Jun 2023 05:06:45 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 8A1D2224C5; Tue, 13 Jun 2023 12:06:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1686658004; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HYVUFO1wPWNonY/X8OsDNVi5gRCOnIF00hLJCgVVXL8=; b=E2hmup+R2QkYLjr7YF45V8o6ojzs8mRStZDGZ4gtvemfRUU6ZMdUti8UWYB14K4N087zID PCmHqx+U0pWm1b+NAhJZWgxdcsmQ+DfDoUQ1SrXhDczXVUyCIOGHqIdZ4Mqv9LGNpjulD7 sr0aoAT7cMde28oU3y65lkUec6malk0= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 7511513345; Tue, 13 Jun 2023 12:06:44 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id ab+LHNRbiGSNKAAAMHmgww (envelope-from ); Tue, 13 Jun 2023 12:06:44 +0000 Date: Tue, 13 Jun 2023 14:06:44 +0200 From: Michal Hocko To: Yosry Ahmed Cc: =?utf-8?B?56iL5Z6y5rab?= Chengkaitao Cheng , "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "sujiaxun@uniontech.com" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , David Rientjes Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection Message-ID: References: <66F9BB37-3BE1-4B0F-8DE1-97085AF4BED2@didiglobal.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > +David Rientjes > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko wrote: > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > [...] > > > There has been a parallel discussion in the cover letter thread of v4 > > > [1]. To summarize, at Google, we have been using OOM scores to > > > describe different job priorities in a more explicit way -- regardless > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > broken based on memory usage. > > > > > > We understand that something like memory.oom.protect has an advantage > > > in the sense that you can skip killing a process if you know that it > > > won't free enough memory anyway, but for an environment where multiple > > > jobs of different priorities are running, we find it crucial to be > > > able to define strict ordering. Some jobs are simply more important > > > than others, regardless of their memory usage. > > > > I do remember that discussion. I am not a great fan of simple priority > > based interfaces TBH. It sounds as an easy interface but it hits > > complications as soon as you try to define a proper/sensible > > hierarchical semantic. I can see how they might work on leaf memcgs with > > statically assigned priorities but that sounds like a very narrow > > usecase IMHO. > > Do you mind elaborating the problem with the hierarchical semantics? Well, let me be more specific. If you have a simple hierarchical numeric enforcement (assume higher priority more likely to be chosen and the effective priority to be max(self, max(parents)) then the semantic itslef is straightforward. I am not really sure about the practical manageability though. I have hard time to imagine priority assignment on something like a shared workload with a more complex hierarchy. For example: root / | \ cont_A cont_B cont_C each container running its workload with own hierarchy structures that might be rather dynamic during the lifetime. In order to have a predictable OOM behavior you need to watch and reassign priorities all the time, no? > The way it works with our internal implementation is (imo) sensible > and straightforward from a hierarchy POV. Starting at the OOM memcg > (which can be root), we recursively compare the OOM scores of the > children memcgs and pick the one with the lowest score, until we > arrive at a leaf memcg. This approach has a strong requirement on the memcg hierarchy organization. Siblings have to be directly comparable because you cut off many potential sub-trees this way (e.g. is it easy to tell whether you want to rule out all system or user slices?). I can imagine usecases where this could work reasonably well e.g. a set of workers of a different priority all of them running under a shared memcg parent. But more more involved hierarchies seem more complex because you always keep in mind how the hierarchy is organize to get to your desired victim. -- Michal Hocko SUSE Labs