Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp4156029rdb; Thu, 14 Sep 2023 13:35:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEX51BeCZAhXGjmntQS9R2qUncwN39Y/1BvzdQevTnHWIdZ6aRyB4rbv6iFQV8PFcMaQ+qr X-Received: by 2002:a05:6a20:7289:b0:153:5832:b31b with SMTP id o9-20020a056a20728900b001535832b31bmr7798335pzk.53.1694723752518; Thu, 14 Sep 2023 13:35:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694723752; cv=none; d=google.com; s=arc-20160816; b=lKbZAkzKaMePk8JKlN+DrMmOESzxtVFXTxSGOrQRmo08on9rkZ20g6ui/vAgfn6Hhc TZ58mOmbCh2vtxlx4GWnKlZVsEcgnrt17QWGIgg9QPfpkBiU9VQvxNaBxfM60UGy6HXx AlytD+pbzEWJahQrQSwbqkRLEzyrEXYNkttXJeHlhiYn1FBdS63v6nLHtWspfIF1Dis+ Wfbz+l46DsQIwF9ofZewEJ7fMJixlKCmZ4Wref4kHPNu/ACyj4qaI9ZUxC4f31PN0T+u 6yzVdSxvZo6MXdkCHqexchwpiyxwtnlADEyU95AYU4qkM+O3+dZf4Mq5o/JaCrw0Ruou WwZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=wR8njeDBhvSDtR6lVlpBWKsc/TiwJI5inQMLli8IvXE=; fh=BqzSvXjcD/v1XCFQN+KETe49f7NyvtR6bnwaCfqSgR4=; b=cRv+0QpXfayeNdEKp9p6Ek3SVu7/uqxAPOG0fVxbk5LZpVA3MYoqZg6FvhvTIWx/Lz dXNpbJgyvJaMQwkNGBcZSf23QGIBv0Yux8C/5kZK3i4ursuX6AkXxMyEzyjaRMTMGBU0 I5b11Hc85xc+N7Xs9ihEXdzvlFS4wlVhnISr95taJRJ8JwoX+HHBcRYwE8eVEHWks//c A9j4Yk/fSLPO4ARt7bFF8EoaJKx0v97T1UnVlPIr7TtClqv07EJq8e4tlIQ4KIupwRip vfZ889COZIVOZIjik73Bi2K8BieKi7g/jeRneEIJ5UrtnYX3AFe1WuHYTybW/re/cDGj o97Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=nQJeEWkB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from howler.vger.email (howler.vger.email. [23.128.96.34]) by mx.google.com with ESMTPS id u42-20020a056a0009aa00b0068fdeb84453si2282086pfg.265.2023.09.14.13.35.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 14 Sep 2023 13:35:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b=nQJeEWkB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 3DBF985A0422; Thu, 14 Sep 2023 10:16:24 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237415AbjINRQ0 (ORCPT + 99 others); Thu, 14 Sep 2023 13:16:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33638 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230502AbjINRQY (ORCPT ); Thu, 14 Sep 2023 13:16:24 -0400 Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97531CF; Thu, 14 Sep 2023 10:16:20 -0700 (PDT) Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-530196c780dso1077334a12.1; Thu, 14 Sep 2023 10:16:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694711779; x=1695316579; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wR8njeDBhvSDtR6lVlpBWKsc/TiwJI5inQMLli8IvXE=; b=nQJeEWkB+r1rSts6S53Kz9LEBZCCrZ/+qfTPzGQpQZjAPZ5BoUcKQCSimEzzZpudgj gcig6NIseLRg0hO5b4Fl3OlllB9rSIg2JAuK3G5tNgMnQQpnDrfT2DzV8ro3lNdGlKAB c/IMLkytIAgNaJr62BvEbbQ6Ag7pcZ52aQs1ezXeHjdzaLupaMQabTpr5li0arXwfsWj VcV9psWe6iTNhX1eGBQWcvIxsu/Mrz08u4XEb4NjW/SUHR3rc+cTNMBEB06ANCCmgsxB skax4ZX4ppkGzSyGEz0U9+3AWngUxV29/OjIojn85qhO2w1D+w/r581Oo4QCYFturNtW KT+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694711779; x=1695316579; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wR8njeDBhvSDtR6lVlpBWKsc/TiwJI5inQMLli8IvXE=; b=M7ial1aC0cxWWQwzF2J2fqAvF2uK3sghgX3AdVLdkACIcXL1osDeYuYaxd4WfgQTem G00uMHHvd1GENfzFU3PQpN8UtufeVBE3Ww4hwJgVurigQdl3q4exVr3XlZINsnQwT7Oa CQyP5iMXhhCqeiKKsVVnzSG5tSnhlUbdjoRUaffb10V98RbjUOQJAIrcxJxJ/TOFp/wy SqXdaR+BNOY8z9x56Fn5maFLGwCgQdmpzq18WIfBUTcl7OGlBjXbV+5YoiRt32jjj6+G 5BW6BCA9Uw9Z0wQgBLBjmeUbkJTbK9KjNAMFM9CXxz73SHKE3XEVpd7a6isVIXwzGXx+ vuIw== X-Gm-Message-State: AOJu0YyjB9hhRri/D6i+ysJr81117y6zTUYYWyjwQxOlzH9POV1ayypR C763y0gJ9lEEQ9IfCXikDaMic+8P1ZoCM36Zwmw= X-Received: by 2002:aa7:c1d1:0:b0:525:6588:b624 with SMTP id d17-20020aa7c1d1000000b005256588b624mr5479932edp.37.1694711778617; Thu, 14 Sep 2023 10:16:18 -0700 (PDT) MIME-Version: 1.0 References: <20230827072057.1591929-1-zhouchuyi@bytedance.com> <20230827072057.1591929-3-zhouchuyi@bytedance.com> In-Reply-To: From: Andrii Nakryiko Date: Thu, 14 Sep 2023 10:16:06 -0700 Message-ID: Subject: Re: [RFC PATCH bpf-next 2/4] bpf: Introduce process open coded iterator kfuncs To: Kumar Kartikeya Dwivedi Cc: Alexei Starovoitov , Chuyi Zhou , bpf , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 14 Sep 2023 10:16:24 -0700 (PDT) On Tue, Sep 12, 2023 at 3:21=E2=80=AFPM Kumar Kartikeya Dwivedi wrote: > > On Wed, 13 Sept 2023 at 00:12, Andrii Nakryiko > wrote: > > > > On Wed, Sep 6, 2023 at 10:18=E2=80=AFAM Alexei Starovoitov > > wrote: > > > > > > On Wed, Sep 6, 2023 at 5:38=E2=80=AFAM Chuyi Zhou wrote: > > > > > > > > Hello, Alexei. > > > > > > > > =E5=9C=A8 2023/9/6 04:09, Alexei Starovoitov =E5=86=99=E9=81=93: > > > > > On Sun, Aug 27, 2023 at 12:21=E2=80=AFAM Chuyi Zhou wrote: > > > > >> > > > > >> This patch adds kfuncs bpf_iter_process_{new,next,destroy} which= allow > > > > >> creation and manipulation of struct bpf_iter_process in open-cod= ed iterator > > > > >> style. BPF programs can use these kfuncs or through bpf_for_each= macro to > > > > >> iterate all processes in the system. > > > > >> > > > > >> Signed-off-by: Chuyi Zhou > > > > >> --- > > > > >> include/uapi/linux/bpf.h | 4 ++++ > > > > >> kernel/bpf/helpers.c | 3 +++ > > > > >> kernel/bpf/task_iter.c | 31 ++++++++++++++++++++++++++= +++++ > > > > >> tools/include/uapi/linux/bpf.h | 4 ++++ > > > > >> tools/lib/bpf/bpf_helpers.h | 5 +++++ > > > > >> 5 files changed, 47 insertions(+) > > > > >> > > > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > >> index 2a6e9b99564b..cfbd527e3733 100644 > > > > >> --- a/include/uapi/linux/bpf.h > > > > >> +++ b/include/uapi/linux/bpf.h > > > > >> @@ -7199,4 +7199,8 @@ struct bpf_iter_css_task { > > > > >> __u64 __opaque[1]; > > > > >> } __attribute__((aligned(8))); > > > > >> > > > > >> +struct bpf_iter_process { > > > > >> + __u64 __opaque[1]; > > > > >> +} __attribute__((aligned(8))); > > > > >> + > > > > >> #endif /* _UAPI__LINUX_BPF_H__ */ > > > > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > > > >> index cf113ad24837..81a2005edc26 100644 > > > > >> --- a/kernel/bpf/helpers.c > > > > >> +++ b/kernel/bpf/helpers.c > > > > >> @@ -2458,6 +2458,9 @@ BTF_ID_FLAGS(func, bpf_iter_num_destroy, K= F_ITER_DESTROY) > > > > >> BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW) > > > > >> BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_R= ET_NULL) > > > > >> BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY) > > > > >> +BTF_ID_FLAGS(func, bpf_iter_process_new, KF_ITER_NEW) > > > > >> +BTF_ID_FLAGS(func, bpf_iter_process_next, KF_ITER_NEXT | KF_RET= _NULL) > > > > >> +BTF_ID_FLAGS(func, bpf_iter_process_destroy, KF_ITER_DESTROY) > > > > >> BTF_ID_FLAGS(func, bpf_dynptr_adjust) > > > > >> BTF_ID_FLAGS(func, bpf_dynptr_is_null) > > > > >> BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly) > > > > >> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c > > > > >> index b1bdba40b684..a6717a76c1e0 100644 > > > > >> --- a/kernel/bpf/task_iter.c > > > > >> +++ b/kernel/bpf/task_iter.c > > > > >> @@ -862,6 +862,37 @@ __bpf_kfunc void bpf_iter_css_task_destroy(= struct bpf_iter_css_task *it) > > > > >> kfree(kit->css_it); > > > > >> } > > > > >> > > > > >> +struct bpf_iter_process_kern { > > > > >> + struct task_struct *tsk; > > > > >> +} __attribute__((aligned(8))); > > > > >> + > > > > >> +__bpf_kfunc int bpf_iter_process_new(struct bpf_iter_process *i= t) > > > > >> +{ > > > > >> + struct bpf_iter_process_kern *kit =3D (void *)it; > > > > >> + > > > > >> + BUILD_BUG_ON(sizeof(struct bpf_iter_process_kern) !=3D s= izeof(struct bpf_iter_process)); > > > > >> + BUILD_BUG_ON(__alignof__(struct bpf_iter_process_kern) != =3D > > > > >> + __alignof__(struct bpf_i= ter_process)); > > > > >> + > > > > >> + rcu_read_lock(); > > > > >> + kit->tsk =3D &init_task; > > > > >> + return 0; > > > > >> +} > > > > >> + > > > > >> +__bpf_kfunc struct task_struct *bpf_iter_process_next(struct bp= f_iter_process *it) > > > > >> +{ > > > > >> + struct bpf_iter_process_kern *kit =3D (void *)it; > > > > >> + > > > > >> + kit->tsk =3D next_task(kit->tsk); > > > > >> + > > > > >> + return kit->tsk =3D=3D &init_task ? NULL : kit->tsk; > > > > >> +} > > > > >> + > > > > >> +__bpf_kfunc void bpf_iter_process_destroy(struct bpf_iter_proce= ss *it) > > > > >> +{ > > > > >> + rcu_read_unlock(); > > > > >> +} > > > > > > > > > > This iter can be used in all ctx-s which is nice, but let's > > > > > make the verifier enforce rcu_read_lock/unlock done by bpf prog > > > > > instead of doing in the ctor/dtor of iter, since > > > > > in sleepable progs the verifier won't recognize that body is RCU = CS. > > > > > We'd need to teach the verifier to allow bpf_iter_process_new() > > > > > inside in_rcu_cs() and make sure there is no rcu_read_unlock > > > > > while BPF_ITER_STATE_ACTIVE. > > > > > bpf_iter_process_destroy() would become a nop. > > > > > > > > Thanks for your review! > > > > > > > > I think bpf_iter_process_{new, next, destroy} should be protected b= y > > > > bpf_rcu_read_lock/unlock explicitly whether the prog is sleepable o= r > > > > not, right? > > > > > > Correct. By explicit bpf_rcu_read_lock() in case of sleepable progs > > > or just by using them in normal bpf progs that have implicit rcu_read= _lock() > > > done before calling into them. > > > > > > > I'm not very familiar with the BPF verifier, but I believe > > > > there is still a risk in directly calling these kfuns even if > > > > in_rcu_cs() is true. > > > > > > > > Maby what we actually need here is to enforce BPF verifier to check > > > > env->cur_state->active_rcu_lock is true when we want to call these = kfuncs. > > > > > > active_rcu_lock means explicit bpf_rcu_read_lock. > > > Currently we do allow bpf_rcu_read_lock in non-sleepable, but it's po= intless. > > > > > > Technically we can extend the check: > > > if (in_rbtree_lock_required_cb(env) && (rcu_lock || > > > rcu_unlock)) { > > > verbose(env, "Calling > > > bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n"); > > > return -EACCES; > > > } > > > to discourage their use in all non-sleepable, but it will break some = progs. > > > > > > I think it's ok to check in_rcu_cs() to allow bpf_iter_process_*(). > > > If bpf prog adds explicit and unnecessary bpf_rcu_read_lock() around > > > the iter ops it won't do any harm. > > > Just need to make sure that rcu unlock logic: > > > } else if (rcu_unlock) { > > > bpf_for_each_reg_in_vstate(env->cur_state, > > > state, reg, ({ > > > if (reg->type & MEM_RCU) { > > > reg->type &=3D ~(MEM_RCU | > > > PTR_MAYBE_NULL); > > > reg->type |=3D PTR_UNTRUSTED; > > > } > > > })); > > > clears iter state that depends on rcu. > > > > > > I thought about changing mark_stack_slots_iter() to do > > > st->type =3D PTR_TO_STACK | MEM_RCU; > > > so that the above clearing logic kicks in, > > > but it might be better to have something iter specific. > > > is_iter_reg_valid_init() should probably be changed to > > > make sure reg->type is not UNTRUSTED. > > > > > > Andrii, > > > do you have better suggestions? > > > > What if we just remember inside bpf_reg_state.iter state whether > > iterator needs to be RCU protected (it's just one bit if we don't > > allow nesting rcu_read_lock()/rcu_read_unlock(), or we'd need to > > remember RCU nestedness level), and then when validating iter_next and > > iter_destroy() kfuncs, check that we are still in RCU-protected region > > (if we have nestedness, then iter->rcu_nest_level <=3D > > cur_rcu_nest_level, if I understand correctly). And if not, provide a > > clear and nice message. > > > > That seems straightforward enough, but am I missing anything subtle? > > > > We also need to ensure one does not do a bpf_rcu_read_unlock and > bpf_rcu_read_lock again between the iter_new and > iter_next/iter_destroy calls. Simply checking we are in an RCU > protected region will pass the verifier in such a case. Yep, you are right, what I proposed is too naive, of course. > > A simple solution might be associating an ID with the RCU CS, so make > active_rcu_lock a 32-bit ID which is monotonically increasing for each > new RCU region. Ofcourse, all of this only matters for sleepable > programs. Then check if id recorded in iter state is same on next and > destroy. Yep, I think each RCU region should ideally be tracked separately and get a unique ID. Kind of like a ref. It is some lifetime/scope, not necessarily an actual kernel object. And if/when we have it, we can grab the ID of most nested RCU scope, associate it with RCU-protected iter, and then make sure that this RCU scope is active at every next/destroy invocation.