Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp3388139rwb; Mon, 7 Aug 2023 12:42:01 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHZ1+ycXhduWatekJpoymh+aJg+zJ2iYCqkJYrkwX2I/ND/ofRY+23B5+LpuQDF+1R4J+k8 X-Received: by 2002:aca:1803:0:b0:3a3:ed69:331 with SMTP id h3-20020aca1803000000b003a3ed690331mr10020828oih.6.1691437320806; Mon, 07 Aug 2023 12:42:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691437320; cv=none; d=google.com; s=arc-20160816; b=pcFaGBmeA77d7ol5/NM5Q2njxBYoOz8cT/lyHKWFDzcnVRg22f6TVzWyu9uYr2Rqr2 U44+nkQzHUY6Fb3iVshb92WaPTgfkD/aDfGqX4h+FehJOvDC0fuYyUteRhN9O1PF5Wgo AMRoHWE0XItOuEkylxX9u5ba+qlFzhvCujMrtlscwVDtgJS095EY7ioFxhwUvHWVqTi/ T7Bc8+7yKrtpVYJZx5sJMEjyD9IHvGaWxHPb1wpnApM1iINMJQvd496iEVIIfPNwDHZI LLEDpDI+2DMNT6o1FcHFHwqZdANM3CCPW9aUXXpcwdbmp82JZV8pyNmUF4KJ+aQpJpCK ukNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:date; bh=r70aw84iIw6YnT4dCaZ1pOEXNwbndODv8HyIxsu+9bA=; fh=9bDDYUCQoY/af3MhgVnnLpp/DW5C9jIw94EBcZSCX7E=; b=O/h1BA+hxCz9+K8gP2wjfvLdBdQkZxZj8BP3+nDcRI1ARSvjJvmu6COIRlXaz+rZKB bcP/5u/mzuU1l83gER2tfR5JAtU0CSTZ4Iz5VzvEb3FT6ycyAWAhNtc1l7UiZ07romhA h1FrRcQ2V9iiMCEx9bUFjrTdiqtAzUKPc66mIKNwDVjbpps96HBNRc92MZom/9y5ObTg DocJyr77BDBWlKJs3I6zQwvQ5tFpMKmHKAc/IC4nEAKmlmT5zyc4rA9OMBort3Sgv+Ku Kw1XVPyfwvrIc8HCE9708qpIc4vD//nGPLs6ZCS+XJvYwBPe4VyrbAsb/TfFWs+Sqxbd GEmg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=HYbHViCL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z8-20020a633308000000b00564d6ad2746si3718302pgz.452.2023.08.07.12.41.48; Mon, 07 Aug 2023 12:42:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=HYbHViCL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232295AbjHGR2a (ORCPT + 99 others); Mon, 7 Aug 2023 13:28:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230390AbjHGR21 (ORCPT ); Mon, 7 Aug 2023 13:28:27 -0400 Received: from out-84.mta1.migadu.com (out-84.mta1.migadu.com [95.215.58.84]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 135A5E68 for ; Mon, 7 Aug 2023 10:28:25 -0700 (PDT) Date: Mon, 7 Aug 2023 10:28:17 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1691429302; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=r70aw84iIw6YnT4dCaZ1pOEXNwbndODv8HyIxsu+9bA=; b=HYbHViCLW4K5QsWzHKAjloITjYJ+CVCoK0kPSN16V9JN+QH/+yBwm5lrFszwi3LE4zpMdW rrUqS9krA0VvGehM08z2pRnbaf5aB+KJFPNjO8Eh4Bgd+AUgECpRy/5a4U0P7MM6SAU7j5 zkAitbFOM/qdH3+ruoqo77znS5gQcHQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Michal Hocko Cc: Chuyi Zhou , hannes@cmpxchg.org, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, muchun.song@linux.dev, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, wuyun.abel@bytedance.com, robin.lu@bytedance.com Subject: Re: [RFC PATCH 1/2] mm, oom: Introduce bpf_select_task Message-ID: References: <20230804093804.47039-1-zhouchuyi@bytedance.com> <20230804093804.47039-2-zhouchuyi@bytedance.com> <866462cf-6045-6239-6e27-45a733aa7daa@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 07, 2023 at 09:04:34AM +0200, Michal Hocko wrote: > On Mon 07-08-23 10:21:09, Chuyi Zhou wrote: > > > > > > 在 2023/8/4 21:34, Michal Hocko 写道: > > > On Fri 04-08-23 21:15:57, Chuyi Zhou wrote: > > > [...] > > > > > + switch (bpf_oom_evaluate_task(task, oc, &points)) { > > > > > + case -EOPNOTSUPP: break; /* No BPF policy */ > > > > > + case -EBUSY: goto abort; /* abort search process */ > > > > > + case 0: goto next; /* ignore process */ > > > > > + default: goto select; /* note the task */ > > > > > + } To be honest, I can't say I like it. IMO it's not really using the full bpf potential and is too attached to the current oom implementation. First, I'm a bit concerned about implicit restrictions we apply to bpf programs which will be executed potentially thousands times under a very heavy memory pressure. We will need to make sure that they don't allocate (much) memory, don't take any locks which might deadlock with other memory allocations etc. It will potentially require hard restrictions on what these programs can and can't do and this is something that the bpf community will have to maintain long-term. Second, if we're introducing bpf here (which I'm not yet convinced), IMO we should use it in a more generic and expressive way. Instead of adding hooks into the existing oom killer implementation, we can call a bpf program before invoking the in-kernel oom killer and let it do whatever it takes to free some memory. E.g. we can provide it with an API to kill individual tasks as well as all tasks in a cgroup. This approach is more generic and will allow to solve certain problems which can't be solved by the current oom killer, e.g. deleting files from a tmpfs instead of killing tasks. So I think the alternative approach is to provide some sort of an interface to pre-select oom victims in advance. E.g. on memcg level it can look like: echo PID >> memory.oom.victim_proc If the list is empty, the default oom killer is invoked. If there are tasks, the first one is killed on OOM. A similar interface can exist to choose between sibling cgroups: echo CGROUP_NAME >> memory.oom.victim_cgroup This is just a rough idea. Thanks!