Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp4034057rwb; Sun, 30 Jul 2023 23:21:05 -0700 (PDT) X-Google-Smtp-Source: APBJJlGoHhqGscYTz8KeiW8aQKLesoY4ZVmeO1YkyCWu8Ss30fS/GdApcbHLKjUGDLGwk9BL2eDy X-Received: by 2002:a17:902:db08:b0:1b8:a277:4b5b with SMTP id m8-20020a170902db0800b001b8a2774b5bmr9406829plx.7.1690784465134; Sun, 30 Jul 2023 23:21:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690784465; cv=none; d=google.com; s=arc-20160816; b=apqJH239OE5B7XHYoRrLv8C6TxifbLHlVkkRILhQgVzgl83cFYpHRWuUalmc9MrxSd mD2bOUX9EsD+BKXPt85ok050FLweh81Dy+LSeAd2/GmE/Cs3dPV5Pl64Sfz3jhK50btD isjDyn4wGgzI00diQCofpbIbciTs4YLFZtGJi9cYmLSiXOiLMGS2SvG3wHYEf/FkjHoM 2aNpL3I/iWNt82iloq8LYvqnmUm4UE9B3pjW6lf1V8/oMzr+McylD9WkjGy+RRuJt/UH kMUsliR05IGCdopU2L5yGCTK3XqguIDuJxp7eMFFab5Sux7FB1fBd3bnZOOyFv9JUsfu Yenw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id :dkim-signature; bh=em7nleQj8/QQEFVzlIkyFVMASePMWNKx3y98kE9uwNQ=; fh=PYGJ1xZZfiOr3qQKLytcxg8Kv08QQkY3ytmnBlzHSLM=; b=CqevpyWTdqhB8+m2Qa6VOfrUF6uekFS/GvCcAc4aw8hBYJKpWhhbSp0D/T2mxVvXyc +/uRX16XD2HVVA2eIp5KTNBese88SX9bnPmom1c5leGKWl0zYOTAxDPGjSD70HqxG5v5 yrsT/kKNX/91riqcrZxx8gRS6HB1pnxRXiPEDPDKaTvTGEI+V6pbExADfgge1Z5VGwfk ICH1MBq3Ji8YKnrpvpqlYcRUV6Gfh4UZYNESnOmAWzAid9IZBPkruMV1/Jo7fDZjWJcA 23olSwPLKdakXLhAq+E0j2T/VskLVk0HIG6e82eCugfmAAuqO/Ja83/PCYuKOOcXc7Qw 4TCg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=dldsRSCJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b5-20020a170902d50500b001b8664a2b6dsi7015266plg.639.2023.07.30.23.20.53; Sun, 30 Jul 2023 23:21:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=dldsRSCJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229735AbjGaGA5 (ORCPT + 99 others); Mon, 31 Jul 2023 02:00:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229485AbjGaGA4 (ORCPT ); Mon, 31 Jul 2023 02:00:56 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1ECD0F4 for ; Sun, 30 Jul 2023 23:00:31 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id d9443c01a7336-1b8b2b60731so23046115ad.2 for ; Sun, 30 Jul 2023 23:00:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1690783230; x=1691388030; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=em7nleQj8/QQEFVzlIkyFVMASePMWNKx3y98kE9uwNQ=; b=dldsRSCJQFwuK2l9ZcBL05envBRIJqkvMzjodoHk30Blc6IdlOTtICFtDkXghNf+iq mTX8zzlUratw+gEy6YMXrG/rUEs4lRn6y0gKYILeyPXHTm76YmWdwOxgNJW113Nat0SW zxiDlTBc1pwC88p5f0nsheAMdICqkY50WVKd8PzbqEi5JvIZUVTcGws+V4V08w9LLYeE ohgoz6ZhiS00Bqh/tc2D0GLrkgB4qdLc3PtGoM93/Mb9vc1flLMAQjfIvLOQC+Jzx4C/ GN09XSbcyVDVRZD/cbbVZ797WQ0KiPWBeRUY+s7VUZ3EERfhZv32rNGAN8YAaOy7L3cf iwzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690783230; x=1691388030; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=em7nleQj8/QQEFVzlIkyFVMASePMWNKx3y98kE9uwNQ=; b=BFrhahgWK/aNm0LdQ3oXpC7wcvz3Mh9hdR3IqKf/jnP2UjBrMndb2GTrYkanfDsEho BHAtdozh+E0bE0JPkrATEI/FsKVVM4Ny5AnaHy4dz+YRqIuDeGW5RC7o1hmQ4/EPLijC lzFnrz6eqOIXdzNoKL05PWAFYFE3xtc2mEAFcjTD0ziNE1+Qdt3kPyCn6ztRMLOOGh3v 814LFJP2RlZ259X3DNAak8ZpLoq1aqveHZ0t+oN3J/ZZvnC6SBO/ImzJn5gtXE5Kqe7B zx0s7cS3MpA9ELLZSlXwVJoErhB5rdhGpsaIbRUCurG2/bR/h8i4WPBjpEYSsc3JiA9M /GXw== X-Gm-Message-State: ABy/qLYwJv+sCaP9cmNTDQtxxJvZ4iQwATiU8ERMHljEgJ7/uE0o7BCV SByHvqAE48XNbPv2PDdWCTxYlA== X-Received: by 2002:a17:903:1246:b0:1bb:a367:a70 with SMTP id u6-20020a170903124600b001bba3670a70mr8836498plh.17.1690783230548; Sun, 30 Jul 2023 23:00:30 -0700 (PDT) Received: from [10.85.117.81] ([203.208.167.147]) by smtp.gmail.com with ESMTPSA id f13-20020a170902ab8d00b001b9e9f191f2sm7558117plr.15.2023.07.30.23.00.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 30 Jul 2023 23:00:30 -0700 (PDT) Message-ID: Date: Mon, 31 Jul 2023 14:00:22 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY To: Michal Hocko Cc: hannes@cmpxchg.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com, muchun.song@linux.dev, zhengqi.arch@bytedance.com References: <20230727073632.44983-1-zhouchuyi@bytedance.com> <7347aad5-f25c-6b76-9db5-9f1be3a9f303@bytedance.com> From: Chuyi Zhou In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Michal 在 2023/7/28 01:23, Michal Hocko 写道: > On Thu 27-07-23 20:12:01, Chuyi Zhou wrote: >> >> >> 在 2023/7/27 16:15, Michal Hocko 写道: >>> On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: >>>> This patchset tries to add a new bpf prog type and use it to select >>>> a victim memcg when global OOM is invoked. The mainly motivation is >>>> the need to customizable OOM victim selection functionality so that >>>> we can protect more important app from OOM killer. >>> >>> This is rather modest to give an idea how the whole thing is supposed to >>> work. I have looked through patches very quickly but there is no overall >>> design described anywhere either. >>> >>> Please could you give us a high level design description and reasoning >>> why certain decisions have been made? e.g. why is this limited to the >>> global oom sitation, why is the BPF program forced to operate on memcgs >>> as entities etc... >>> Also it would be very helpful to call out limitations of the BPF >>> program, if there are any. >>> >>> Thanks! >> >> Hi, >> >> Thanks for your advice. >> >> The global/memcg OOM victim selection uses process as the base search >> granularity. However, we can see a need for cgroup level protection and >> there's been some discussion[1]. It seems reasonable to consider using memcg >> as a search granularity in victim selection algorithm. > > Yes, it can be reasonable for some policies but making it central to the > design is very limiting. > >> Besides, it seems pretty well fit for offloading policy decisions to a BPF >> program, since BPF is scalable and flexible. That's why the new BPF >> program operate on memcgs as entities. > > I do not follow your line of argumentation here. The same could be > argued for processes or beans. > >> The idea is to let user choose which leaf in the memcg tree should be >> selected as the victim. At the first layer, if we choose A, then it protects >> the memcg under the B, C, and D subtrees. >> >> root >> / | \ \ >> A B C D >> /\ >> E F >> >> >> Using the BPF prog, we are allowed to compare the OOM priority between >> two siblings so that we can choose the best victim in each layer. > > How is the priority defined and communicated to the userspace. > >> For example: >> >> run_prog(B, C) -> choose B >> run_prog(B, D) -> choose D >> run_prog(A, D) -> choose A >> >> Once we select A as the victim in the first layer, the victim in next layer >> would be selected among A's children. Finally, we select a leaf memcg as >> victim. > > This sounds like a very specific oom policy and that is fine. But the > interface shouldn't be bound to any concepts like priorities let alone > be bound to memcg based selection. Ideally the BPF program should get > the oom_control as an input and either get a hook to kill process or if > that is not possible then return an entity to kill (either process or > set of processes). Here are two interfaces I can think of. I was wondering if you could give me some feedback. 1. Add a new hook in select_bad_process(), we can attach it and return a set of pids or cgroup_ids which are pre-selected by user-defined policy, suggested by Roman. Then we could use oom_evaluate_task to find a final victim among them. It's user-friendly and we can offload the OOM policy to userspace. 2. Add a new hook in oom_evaluate_task() and return a point to override the default oom_badness return-value. The simplest way to use this is to protect certain processes by setting the minimum score. Of course if you have a better idea, please let me know. Thanks! --- Chuyi Zhou > >> In our scenarios, the impact caused by global OOM's is much more common, so >> we only considered global in this patchset. But it seems that the idea can >> also be applied to memcg OOM. > > The global and memcg OOMs shouldn't have a different interface. If a > specific BPF program wants to implement a different policy for global > vs. memcg OOM then be it but this should be a decision of the said > program not an inherent limitation of the interface. > >> >> [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/ >> >> Thanks! >> -- >> Chuyi Zhou >