Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp185849rwb; Thu, 27 Jul 2023 11:07:18 -0700 (PDT) X-Google-Smtp-Source: APBJJlFL74tv97MYvI8515Y0FSeJ+0qOmG8g10+fRu1oOu9DG5QG3ulI/gmFfi5T/KSHrCck5JsG X-Received: by 2002:a05:6a20:914d:b0:135:4527:efcc with SMTP id x13-20020a056a20914d00b001354527efccmr5875171pzc.46.1690481237694; Thu, 27 Jul 2023 11:07:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690481237; cv=none; d=google.com; s=arc-20160816; b=EJLcqlqHym05RPCttRRfw6DlHedgVOEHL8g1imRifHBJoN2WWr31yZlUEAL4Y1Giys KwIFYKEM2IRO79oXSMG0dvJ248Y9Mvgt235x2MqtAXpFRadEBrx+xe9fJFXgKlOS6iUH 333nt49yxhNFsyo2JytGjVfwzmUX5xTToWbS79Up99PlXRnFZRrnx4yFz94uK1CJ+WAV 98SKWLioscc2MfpvVENQTn9WcufJCuDQ84cL8WI1aCm6IoCIvTUfv0PVxk3B1NzaTfGr zSTHyUWdSCFXfX8p6M7ItMRrEryhjnTBfteWvU+O9rTBV/3hklYVW9FIlGBu6S9XKZTS iaBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=aweaNlAF2TaomfSSOciU5EAZdZoJxELHiitSS0DAgMw=; fh=gbjt5B7NRZsfzo6f2Pf0upoSLGj6eWPI9+XqS9rMZc4=; b=PZUxcGfKH+o/1Co4CWcHk/1iSS1J7zLjcf2Rf+YNgvbqAfqbOn3LXQ8w5ECIdmuo6G LWSLgEEj1h4IMeKHE1HdSxTjSzqqZJUCAeg27kO9tw7cL3Ydez4O1Ve10bjVuxM5sIlB ZXwgwBJJnwnaXQjCFj3uEPinemcbUzeGU6FemsV5dVtB8W2ViVA9hgbGUkqU4s6x0Ylx mubksk7Vs0bifwICE3PXr3sXVcRnQxIXDHitDZ6MBawr3iqy3p0f3nSAGDnflkexYkHN L24M3TnW+KrEYSLySe/oYJ8b6kk9KDNciagN3H7a43B/sMdcz8ngIoc6/O4MqJ4kPTon Rjrg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Dn1fOqLE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id bt9-20020a056a00438900b00687080947e0si363158pfb.42.2023.07.27.11.07.05; Thu, 27 Jul 2023 11:07:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=Dn1fOqLE; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232502AbjG0RXZ (ORCPT + 99 others); Thu, 27 Jul 2023 13:23:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233618AbjG0RXR (ORCPT ); Thu, 27 Jul 2023 13:23:17 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 583C32D75; Thu, 27 Jul 2023 10:23:15 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 1690321A9B; Thu, 27 Jul 2023 17:23:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1690478594; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aweaNlAF2TaomfSSOciU5EAZdZoJxELHiitSS0DAgMw=; b=Dn1fOqLECmDrHHOezd7C2fkG0mBY88pGxix+QJUWf9ts6RjqaDHqJam7JiZu3hiAFz21kh 22+8WO3hcE4TBI9bkzviVRQZLNmDsSUVLgwYjwV33gH+2BzfmMb9SyplKZLW2ti77AXLku FKk5ziCBKckN5+uNyUDUL0gew6AeGaU= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id EC4CE13902; Thu, 27 Jul 2023 17:23:13 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id ccKiNgGowmQ3ZgAAMHmgww (envelope-from ); Thu, 27 Jul 2023 17:23:13 +0000 Date: Thu, 27 Jul 2023 19:23:13 +0200 From: Michal Hocko To: Chuyi Zhou Cc: hannes@cmpxchg.org, roman.gushchin@linux.dev, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com Subject: Re: [RFC PATCH 0/5] mm: Select victim memcg using BPF_OOM_POLICY Message-ID: References: <20230727073632.44983-1-zhouchuyi@bytedance.com> <7347aad5-f25c-6b76-9db5-9f1be3a9f303@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <7347aad5-f25c-6b76-9db5-9f1be3a9f303@bytedance.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 27-07-23 20:12:01, Chuyi Zhou wrote: > > > 在 2023/7/27 16:15, Michal Hocko 写道: > > On Thu 27-07-23 15:36:27, Chuyi Zhou wrote: > > > This patchset tries to add a new bpf prog type and use it to select > > > a victim memcg when global OOM is invoked. The mainly motivation is > > > the need to customizable OOM victim selection functionality so that > > > we can protect more important app from OOM killer. > > > > This is rather modest to give an idea how the whole thing is supposed to > > work. I have looked through patches very quickly but there is no overall > > design described anywhere either. > > > > Please could you give us a high level design description and reasoning > > why certain decisions have been made? e.g. why is this limited to the > > global oom sitation, why is the BPF program forced to operate on memcgs > > as entities etc... > > Also it would be very helpful to call out limitations of the BPF > > program, if there are any. > > > > Thanks! > > Hi, > > Thanks for your advice. > > The global/memcg OOM victim selection uses process as the base search > granularity. However, we can see a need for cgroup level protection and > there's been some discussion[1]. It seems reasonable to consider using memcg > as a search granularity in victim selection algorithm. Yes, it can be reasonable for some policies but making it central to the design is very limiting. > Besides, it seems pretty well fit for offloading policy decisions to a BPF > program, since BPF is scalable and flexible. That's why the new BPF > program operate on memcgs as entities. I do not follow your line of argumentation here. The same could be argued for processes or beans. > The idea is to let user choose which leaf in the memcg tree should be > selected as the victim. At the first layer, if we choose A, then it protects > the memcg under the B, C, and D subtrees. > > root > / | \ \ > A B C D > /\ > E F > > > Using the BPF prog, we are allowed to compare the OOM priority between > two siblings so that we can choose the best victim in each layer. How is the priority defined and communicated to the userspace. > For example: > > run_prog(B, C) -> choose B > run_prog(B, D) -> choose D > run_prog(A, D) -> choose A > > Once we select A as the victim in the first layer, the victim in next layer > would be selected among A's children. Finally, we select a leaf memcg as > victim. This sounds like a very specific oom policy and that is fine. But the interface shouldn't be bound to any concepts like priorities let alone be bound to memcg based selection. Ideally the BPF program should get the oom_control as an input and either get a hook to kill process or if that is not possible then return an entity to kill (either process or set of processes). > In our scenarios, the impact caused by global OOM's is much more common, so > we only considered global in this patchset. But it seems that the idea can > also be applied to memcg OOM. The global and memcg OOMs shouldn't have a different interface. If a specific BPF program wants to implement a different policy for global vs. memcg OOM then be it but this should be a decision of the said program not an inherent limitation of the interface. > > [1]https://lore.kernel.org/lkml/ZIgodGWoC%2FR07eak@dhcp22.suse.cz/ > > Thanks! > -- > Chuyi Zhou -- Michal Hocko SUSE Labs