Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp2356317rwb; Fri, 9 Dec 2022 00:43:14 -0800 (PST) X-Google-Smtp-Source: AA0mqf7YRkrW8SSRiuOEQoVXTMX7XsGWZvoTn4BeiU2JLusOufcNtHEQbQeKcKuYZ2GNZ3i+bSe0 X-Received: by 2002:a05:6402:501c:b0:463:a84c:6805 with SMTP id p28-20020a056402501c00b00463a84c6805mr5063067eda.15.1670575394746; Fri, 09 Dec 2022 00:43:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670575394; cv=none; d=google.com; s=arc-20160816; b=jkgTkvL5mBOJsrAiKJv+QbqpqGhihg3PMLaliWzH7aSR8ksTechhdHYf4ER0Tu7ARV Jb+1RXrbG81fPBg30bfYQqNv70NJ1V2+jaHXYrxE2DfhTtgdntwPcQIEfZdsDbFFaaw6 bgor4CtT8n5RecqSAYw4E/pciS0MgImF5kc04fkSXG1XF0K0goXTEuFPLJSvZTgbZvye 5nXVjL+VALKtShGitoj8PyQUAbVC4vJe2By9FExdJXDgaWffIM4MWLePJKD+tjpLgc4z s1XkpEtWNNhChzvIEAZIEWMIt95b44TIqoFbHBxKfnXH2VSWRjuZUvYrtKuRyKrSjxoo CtAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Yg8Dy91nslxvuXyFZO8v15npgrF13z71XGIrk7x/zmg=; b=pC43qfmBzmKgj0bdBIC2lQ2sXVVXaABegTvwKaj8HxL5vb7xdSRwf8SnakUsMwK9Eu sUMAw2aI+M7Wc/Qd4OCzOZcEAOfbjEyEQBpT8mPtx/aFBOVUGl+0TJfPd1zibYE5UliE 3VhizN4X5CfEcsANR6qOznsXWcb4sglxOeEClmI5YpnKpiXpPPVA4JSG7Sey2uGPa0dn uOhAAmE+NnfFcJOGSsbaypNihjh6nRrpyx3bapl9bfZV02gJBpZwxBt2LvCRlxHvwt50 KCPz3MPKrwWa6FleQ+uPkPL5HeFgZsEO6qBo33aKFXDsz0ygzhvcm8MmCkRK5BPPN/KE eFcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Gnl1efSl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i2-20020aa7c702000000b0046c46bc5c9csi805463edq.327.2022.12.09.00.42.56; Fri, 09 Dec 2022 00:43:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Gnl1efSl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229661AbiLIIZo (ORCPT + 73 others); Fri, 9 Dec 2022 03:25:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40920 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229571AbiLIIZk (ORCPT ); Fri, 9 Dec 2022 03:25:40 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7253056D46; Fri, 9 Dec 2022 00:25:39 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 282851FE57; Fri, 9 Dec 2022 08:25:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1670574338; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Yg8Dy91nslxvuXyFZO8v15npgrF13z71XGIrk7x/zmg=; b=Gnl1efSl8C1V561RAp72h3ej4lDX6Hv8yQ8w86HPJq65Ge6RaZnEvRYmYPoSJ9jiiv1NGv Fx9fSt/Cl1opw8apuCmh5dW02+cRgRM0sOsgUb5228e0NfEAvX3nWN0tgPL+Ob+UcDzKtb TWx/69xXPo0xOpP4HauT3GvHY0m9jyM= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 024D2138E0; Fri, 9 Dec 2022 08:25:38 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id M/+IAALxkmOgQgAAMHmgww (envelope-from ); Fri, 09 Dec 2022 08:25:38 +0000 Date: Fri, 9 Dec 2022 09:25:37 +0100 From: Michal Hocko To: =?utf-8?B?56iL5Z6y5rab?= Chengkaitao Cheng Cc: chengkaitao , "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "songmuchun@bytedance.com" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "sujiaxun@uniontech.com" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" Subject: Re: [PATCH v2] mm: memcontrol: protect the memory in cgroup from being oom killed Message-ID: References: <114DF8F0-3E68-4F2B-8E35-0943EC2F51AE@didiglobal.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <114DF8F0-3E68-4F2B-8E35-0943EC2F51AE@didiglobal.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote: > At 2022-12-08 22:23:56, "Michal Hocko" wrote: [...] > >oom killer is a memory reclaim of the last resort. So yes, there is some > >difference but fundamentally it is about releasing some memory. And long > >term we have learned that the more clever it tries to be the more likely > >corner cases can happen. It is simply impossible to know the best > >candidate so this is a just a best effort. We try to aim for > >predictability at least. > > Is the current oom_score strategy predictable? I don't think so. The score_adj > has broken the predictability of oom_score (it is no longer simply killing the > process that uses the most mems). oom_score as reported to the userspace already considers oom_score_adj which means that you can compare processes and get a reasonable guess what would be the current oom_victim. There is a certain fuzz level because this is not atomic and also there is no clear candidate when multiple processes have equal score. So yes, it is not 100% predictable. memory.reclaim as you propose doesn't change that though. Is oom_score_adj a good interface? No, not really. If I could go back in time I would nack it but here we are. We have an interface that promises quite much but essentially it only allows two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is clumsy at best because a real user space oom policy would require to re-evaluate the whole oom domain (be it global or memcg oom) as the memory consumption evolves over time. I am really worried that your memory.oom.protection directs a very similar trajectory because protection really needs to consider other memcgs to balance properly. [...] > > But I am really open > >to be convinced otherwise and this is in fact what I have been asking > >for since the beginning. I would love to see some examples on the > >reasonable configuration for a practical usecase. > > Here is a simple example. In a docker container, users can divide all processes > into two categories (important and normal), and put them in different cgroups. > One cgroup's oom.protect is set to "max", the other is set to "0". In this way, > important processes in the container can be protected. That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all processes in the important group. I would argue you can achieve a very similar result by the process launcher to set the oom_score_adj and inherit it to all processes in that important container. You do not need any memcg tunable for that. I am really much more interested in examples when the protection is to be fine tuned. -- Michal Hocko SUSE Labs