Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1310636rwd; Thu, 25 May 2023 10:42:01 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7Ty+VClIHqFuUa75t9hFkcm6UOkhSHwtCv0+Ci2RGAP7wp3zJDZguhAjXB/TXjiP1wZe6u X-Received: by 2002:a17:903:2343:b0:1ad:bb89:16f4 with SMTP id c3-20020a170903234300b001adbb8916f4mr3026800plh.50.1685036520829; Thu, 25 May 2023 10:42:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685036520; cv=none; d=google.com; s=arc-20160816; b=Taij/aQXZcwihf1FLuDGjH1GubDz8ZNUQd2aTj57M+BWOeNKgsbmW1w4dX/zn0xm/G 4uqmaQEKi2+9/Vls32tiVGZxADr9qk1KD6M4oSxto2cySNHeOnxHG7CoyheRKw03Swsr Sz/8+ACSi67lg/TVzM8R2Qg17Y1dCnRb1V6zqhdGX1aGujdVYnP0BZzKP1gtzcJ0Ht5g IcR1RNRRWvRKVN2uGS15ukDsIYkaZw4j58HSfhv1itSWX3H8iqwuHs8bCgJdbgC33F0V oaOfUDaaNsUo6D7I11XYpt5Bb53v66cZv4BrlT8GYOg6Wqw30VgSQ6g6CKDp1pO/nn4x dqbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=CgGBFsuo4c8v/VoBg5546JzKr9rMKDr9HQqSKodiIMp4VUCSOSUQJ7pUcel6FUN5FA lZntpcNs5+Ijcs4pWuxaKpWMnAFXgsRxALzqRxp23VG75Xtn9MYqndqENGu426wbon+D xzH3Y/yEZyxo54ZDBe1JOAnbqxDIpN2QScCcynjVpE9J/i23xM14GWldmXut1uQ1C4gE Y7eGRPhSsKMdmDFnefJnNOOdrd1zMsEhDIbzXgqlVDZWf5mCxCeUJgIximtxOmOakwQP JErmRbG32K3x9UJ2xx/eF62O2WKIUdgYijBcO8RIkbw1fzQyuzuoMlT4B7r4YIwmqHEL xrMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=goawVQC5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e11-20020a170902744b00b001a923bc3af0si170157plt.544.2023.05.25.10.41.45; Thu, 25 May 2023 10:42:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=goawVQC5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240899AbjEYRUB (ORCPT + 99 others); Thu, 25 May 2023 13:20:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233736AbjEYRT7 (ORCPT ); Thu, 25 May 2023 13:19:59 -0400 Received: from mail-ej1-x62b.google.com (mail-ej1-x62b.google.com [IPv6:2a00:1450:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED8B5B6 for ; Thu, 25 May 2023 10:19:57 -0700 (PDT) Received: by mail-ej1-x62b.google.com with SMTP id a640c23a62f3a-96f9cfa7eddso162835266b.2 for ; Thu, 25 May 2023 10:19:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685035197; x=1687627197; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=goawVQC5jsSd/ZTijZq1yB03zafWo77t7+Zao+UGglnPeOunriCGqUVh062x+pz1AY Tm6HF3/F6232FkJS/srHiOjoWJ2WkCfRbkSzQmEN5ODbd7yGZCOQ5sLH4DYv5CrGa+s5 e0VL41YYNBfodf21AAsHSNGeEZXBPWbV6TSR/vppHSrWwPAjso3m69mzGaS7WCIsY2qk YHv4D1XAI2Hh7zUAqDDxfH/EH3R8VK0LKb7a4VOuaSMqwY/BqZ7j6t6IO9IYPhg0aIRw /Ls6gN2GUX2SboOkFRqwC832l5xeBUyf3o+FzvunNuWi6pJRcGXO7ckZjC16TypYGtTQ yPeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685035197; x=1687627197; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wavdum6yYAqpFeopZHwVVFgnm4JMHiQDU/53zz3FxBY=; b=TL6Hx7ctX4QV5HHQx41J/bfr+qaCKMcy8UuliDw4Shzd/dv180YpkAS76UibgoqERo xynC2qwON80KFaY1Ap1Z1r38OKaKo5B6yWOsD1/nazMbSSxjKMUqpyK4EkRfr0k4SoAL mfD6Ft3ItX5e57bk6Do03Rp9l1Z8emk9gm/a2SdaKd6ep8gjBHi3AA4kPyp7UR23m+5D cuRd18hqbin4lNhIP0frjjM3lG0TWwFTM6ieyXDxjwUbUjYZO/wqHV1KAOPgQydbmYuf 0egsOOCnCxjopZi3lYBmWeMpFese6cLLDn81lZX5PrQaWGSPyVFld1KcMbpDdWzV7Jn1 CVYg== X-Gm-Message-State: AC+VfDw5yVuMTZZGScOLdlinr22F9SEpmsI/sFJ6/dqiiVWAN0ZFF1e2 Ckg7Cxx952qYBnflD4OjAc5ZvnaKyDdg6kr3DP23hA== X-Received: by 2002:a17:907:2da8:b0:96f:d154:54f7 with SMTP id gt40-20020a1709072da800b0096fd15454f7mr2878524ejc.42.1685035197205; Thu, 25 May 2023 10:19:57 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Thu, 25 May 2023 10:19:20 -0700 Message-ID: Subject: Re: [PATCH v4 0/2] memcontrol: support cgroup level OOM protection To: =?UTF-8?B?56iL5Z6y5rabIENoZW5na2FpdGFvIENoZW5n?= Cc: "tj@kernel.org" , "lizefan.x@bytedance.com" , "hannes@cmpxchg.org" , "corbet@lwn.net" , "mhocko@kernel.org" , "roman.gushchin@linux.dev" , "shakeelb@google.com" , "akpm@linux-foundation.org" , "brauner@kernel.org" , "muchun.song@linux.dev" , "viro@zeniv.linux.org.uk" , "zhengqi.arch@bytedance.com" , "ebiederm@xmission.com" , "Liam.Howlett@oracle.com" , "chengzhihao1@huawei.com" , "pilgrimtao@gmail.com" , "haolee.swjtu@gmail.com" , "yuzhao@google.com" , "willy@infradead.org" , "vasily.averin@linux.dev" , "vbabka@suse.cz" , "surenb@google.com" , "sfr@canb.auug.org.au" , "mcgrof@kernel.org" , "feng.tang@intel.com" , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "linux-mm@kvack.org" , David Rientjes Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 25, 2023 at 1:19=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chengka= itao Cheng wrote: > > At 2023-05-24 06:02:55, "Yosry Ahmed" wrote: > >On Sat, May 20, 2023 at 2:52=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6=9B Chen= gkaitao Cheng > > wrote: > >> > >> At 2023-05-20 06:04:26, "Yosry Ahmed" wrote: > >> >On Wed, May 17, 2023 at 10:12=E2=80=AFPM =E7=A8=8B=E5=9E=B2=E6=B6=9B = Chengkaitao Cheng > >> > wrote: > >> >> > >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" wrote= : > >> >> >On Wed, May 17, 2023 at 3:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6=B6= =9B Chengkaitao Cheng > >> >> > wrote: > >> >> >> > >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" wr= ote: > >> >> >> >On Wed, May 17, 2023 at 1:01=E2=80=AFAM =E7=A8=8B=E5=9E=B2=E6= =B6=9B Chengkaitao Cheng > >> >> >> > wrote: > >> >> >> >> > >> >> >> > >> >> >> Killing processes in order of memory usage cannot effectively pr= otect > >> >> >> important processes. Killing processes in a user-defined priorit= y order > >> >> >> will result in a large number of OOM events and still not being = able to > >> >> >> release enough memory. I have been searching for a balance betwe= en > >> >> >> the two methods, so that their shortcomings are not too obvious. > >> >> >> The biggest advantage of memcg is its tree topology, and I also = hope > >> >> >> to make good use of it. > >> >> > > >> >> >For us, killing processes in a user-defined priority order works w= ell. > >> >> > > >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit = to > >> >> >observe how many times this memcg has been killed due to a limit i= n an > >> >> >ancestor. Wouldn't it be more straightforward to specify the prior= ity > >> >> >of protections among memcgs? > >> >> > > >> >> >For example, if you observe multiple memcgs being OOM killed due t= o > >> >> >hitting an ancestor limit, you will need to decide which of them t= o > >> >> >increase memory.oom.protect for more, based on their importance. > >> >> >Otherwise, if you increase all of them, then there is no point if = all > >> >> >the memory is protected, right? > >> >> > >> >> If all memory in memcg is protected, its meaning is similar to that= of the > >> >> highest priority memcg in your approach, which is ultimately killed= or > >> >> never killed. > >> > > >> >Makes sense. I believe it gets a bit trickier when you want to > >> >describe relative ordering between memcgs using memory.oom.protect. > >> > >> Actually, my original intention was not to use memory.oom.protect to > >> achieve relative ordering between memcgs, it was just a feature that > >> happened to be achievable. My initial idea was to protect a certain > >> proportion of memory in memcg from being killed, and through the > >> method, physical memory can be reasonably planned. Both the physical > >> machine manager and container manager can add some unimportant > >> loads beyond the oom.protect limit, greatly improving the oversold > >> rate of memory. In the worst case scenario, the physical machine can > >> always provide all the memory limited by memory.oom.protect for memcg. > >> > >> On the other hand, I also want to achieve relative ordering of interna= l > >> processes in memcg, not just a unified ordering of all memcgs on > >> physical machines. > > > >For us, having a strict priority ordering-based selection is > >essential. We have different tiers of jobs of different importance, > >and a job of higher priority should not be killed before a lower > >priority task if possible, no matter how much memory either of them is > >using. Protecting memcgs solely based on their usage can be useful in > >some scenarios, but not in a system where you have different tiers of > >jobs running with strict priority ordering. > > If you want to run with strict priority ordering, it can also be achieved= , > but it may be quite troublesome. The directory structure shown below > can achieve the goal. > > root > / \ > cgroup A cgroup B > (protect=3Dmax) (protect=3D0) > / \ > cgroup C cgroup D > (protect=3Dmax) (protect=3D0) > / \ > cgroup E cgroup F > (protect=3Dmax) (protect=3D0) > > Oom kill order: F > E > C > A This requires restructuring the cgroup hierarchy which comes with a lot of other factors, I don't think that's practically an option. > > As mentioned earlier, "running with strict priority ordering" may be > some extreme issues, that requires the manager to make a choice. We have been using strict priority ordering in our fleet for many years now and we depend on it. Some jobs are simply more important than others, regardless of their usage. > > >> > >> >> >In this case, wouldn't it be easier to just tell the OOM killer th= e > >> >> >relative priority among the memcgs? > >> >> > > >> >> >> > >> >> >> >If this approach works for you (or any other audience), that's = great, > >> >> >> >I can share more details and perhaps we can reach something tha= t we > >> >> >> >can both use :) > >> >> >> > >> >> >> If you have a good idea, please share more details or show some = code. > >> >> >> I would greatly appreciate it > >> >> > > >> >> >The code we have needs to be rebased onto a different version and > >> >> >cleaned up before it can be shared, but essentially it is as > >> >> >described. > >> >> > > >> >> >(a) All processes and memcgs start with a default score. > >> >> >(b) Userspace can specify scores for memcgs and processes. A highe= r > >> >> >score means higher priority (aka less score gets killed first). > >> >> >(c) The OOM killer essentially looks for the memcg with the lowest > >> >> >scores to kill, then among this memcg, it looks for the process wi= th > >> >> >the lowest score. Ties are broken based on usage, so essentially i= f > >> >> >all processes/memcgs have the default score, we fallback to the > >> >> >current OOM behavior. > >> >> > >> >> If memory oversold is severe, all processes of the lowest priority > >> >> memcg may be killed before selecting other memcg processes. > >> >> If there are 1000 processes with almost zero memory usage in > >> >> the lowest priority memcg, 1000 invalid kill events may occur. > >> >> To avoid this situation, even for the lowest priority memcg, > >> >> I will leave him a very small oom.protect quota. > >> > > >> >I checked internally, and this is indeed something that we see from > >> >time to time. We try to avoid that with userspace OOM killing, but > >> >it's not 100% effective. > >> > > >> >> > >> >> If faced with two memcgs with the same total memory usage and > >> >> priority, memcg A has more processes but less memory usage per > >> >> single process, and memcg B has fewer processes but more > >> >> memory usage per single process, then when OOM occurs, the > >> >> processes in memcg B may continue to be killed until all processes > >> >> in memcg B are killed, which is unfair to memcg B because memcg A > >> >> also occupies a large amount of memory. > >> > > >> >I believe in this case we will kill one process in memcg B, then the > >> >usage of memcg A will become higher, so we will pick a process from > >> >memcg A next. > >> > >> If there is only one process in memcg A and its memory usage is higher > >> than any other process in memcg B, but the total memory usage of > >> memcg A is lower than that of memcg B. In this case, if the OOM-killer > >> still chooses the process in memcg A. it may be unfair to memcg A. > >> > >> >> Dose your approach have these issues? Killing processes in a > >> >> user-defined priority is indeed easier and can work well in most ca= ses, > >> >> but I have been trying to solve the cases that it cannot cover. > >> > > >> >The first issue is relatable with our approach. Let me dig more info > >> >from our internal teams and get back to you with more details. > > -- > Thanks for your comment! > chengkaitao > >