Received: by 2002:ac0:f2c4:0:0:0:0:0 with SMTP id f4csp7447imp; Tue, 2 Aug 2022 15:31:03 -0700 (PDT) X-Google-Smtp-Source: AGRyM1tc80s6MkkUdvJs1eXMlOW930ejJqgiKj9wB/rJKtxiz/Se2DNHJdf06u4pajE6mczePizT X-Received: by 2002:a63:1d49:0:b0:41b:425b:fbc6 with SMTP id d9-20020a631d49000000b0041b425bfbc6mr18440357pgm.407.1659479463513; Tue, 02 Aug 2022 15:31:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659479463; cv=none; d=google.com; s=arc-20160816; b=0nhYaXcUinypyXCmxfH2oZv7joTrpfBmLLracNegC2q/TfuSh+35mAYo7GFy+ZrlKq zIDrSm7Iz1DLXOQvyN3Ms0EXkuRUieFjNaHcwuyYUoFYg5T8ly0wuDLWOoTFnpyDTpXj iCLgRZNJogik/BZE5ZMmRU+2HwzrZl6sHOw9jhnzyXtCbDcQzKwYpYT0o0wn017r80QL fqlphLykrxlwIMsufLQiTNLYxFa0S2pLAuq3nE6OD9y3iyF6sk7dDkLqFycoKwIdp3md 17AvX/QGJIgN0574IChZUlpA+YVGf9QsolpdkGlaAve21BdUX2t/gkkrHwmwA71mtVqy h5NQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=9SA1GeBCwPiGvsrnnwBDw1nYIHiUSNtHqCw5JUNeRyk=; b=pSTBHqBZuHSrXVAtB84K2rzykNDX+vdm0pJpFJcX+c4IOzqUMFUE4usxGJQYKbAZVy xVcFR88dkr5QfRiG8Rx8O2qbA6qL09136W398DKSp+y4CccI7k166TRZvkGvT9QI/4Sp QUoeiTOQ719h1H689MxA7PtKcpOlIl5azpGvf7kD9Kt89iSz9+GjcQ8dEEH25/k86ZId ZvhDCXDYL1PjpNGjKnWnWOn4O3bvh2W3Z0JwmI+7tt9MPv9FR5pLtvwVj+BUz4ebahJo qUQ8gKky1rw7V+uGU2n0QkLowJsdxDbKT3i1Xjtzzl6cYJFqzMTbeVYgBRjiWb0voYO1 2CSA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tN7USoFf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j15-20020a170902da8f00b0016da78aeff7si691357plx.215.2022.08.02.15.30.05; Tue, 02 Aug 2022 15:31:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=tN7USoFf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234288AbiHBW2N (ORCPT + 99 others); Tue, 2 Aug 2022 18:28:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56514 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232949AbiHBW2G (ORCPT ); Tue, 2 Aug 2022 18:28:06 -0400 Received: from mail-qv1-xf2e.google.com (mail-qv1-xf2e.google.com [IPv6:2607:f8b0:4864:20::f2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5CCF654CBC for ; Tue, 2 Aug 2022 15:27:55 -0700 (PDT) Received: by mail-qv1-xf2e.google.com with SMTP id ct13so8058831qvb.9 for ; Tue, 02 Aug 2022 15:27:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9SA1GeBCwPiGvsrnnwBDw1nYIHiUSNtHqCw5JUNeRyk=; b=tN7USoFfowYqaOA3Zg7kBgDk9uh8FMrZP5UPHQpQgQA49sW+iNiHFS0Ua7puhPUvQV y23XyuYrkE8cRYDz/Za/oub0X6hP/u2391ShPfSH6nuX3nku04d1N71G7JBwsqgFMBCL svXUhAvY2hQc8Ot26TnHY2JyRLD7XFvGuqJeGRvQUqiEV3WxSZh4lwAFNPaW1v4k8i0s 9S8eVIpf0+8pemcv6MCpwmd7tlpfmxitnptjmQJDw8HPZMXMt3IPtJpk4KfIeDScYyBn pcM3gNEYo0bAOeb3/VY4KxpyGtKfutMGciY6RE5pFf22fS/4Ng9+Ypqtwqw6Om+9t8at ta0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9SA1GeBCwPiGvsrnnwBDw1nYIHiUSNtHqCw5JUNeRyk=; b=4rHlHgOYT8rR9bZl7TSx+udnyYBkUZYhRlxGy6RTiNXWSOfG8usGttWLawW/jGpqYk RCxVcsWkU8sYJcPqwP7jeJu5MjOfAgReJAbucVGlKKxZwztLOfHvK05VwN9lfgwkKS5E YeJPaoTr149aQGcMAG7d6lcdtZkDH8Jn9p7G5H1ts1A2ffjtHgEHNLomMriScFlnuVhU 9EJFpr9Q41b4aK8AgKRQ3aoqsLUeG5VI/ZQjBBtN7naO4/HOMB9NbpxNvjnDSark363M d8MzNcEW9rEOxSlFTbgWB2mAw5vSOaOfToFXNlXEY27yYOaZgBZTyUSQcvoXll4nijk/ sydA== X-Gm-Message-State: ACgBeo3dcsOWziqr5wTTWpP0V0gYg9SkxysR8VN+TG2U8exniA1XgIGj xGe29XPUgKKEIKPzvd5NlDUUZ1Q4eIuJ8ZIObxxNJg== X-Received: by 2002:a0c:9101:0:b0:473:9b:d92a with SMTP id q1-20020a0c9101000000b00473009bd92amr19932398qvq.17.1659479274234; Tue, 02 Aug 2022 15:27:54 -0700 (PDT) MIME-Version: 1.0 References: <20220722174829.3422466-1-yosryahmed@google.com> <20220722174829.3422466-5-yosryahmed@google.com> In-Reply-To: From: Hao Luo Date: Tue, 2 Aug 2022 15:27:43 -0700 Message-ID: Subject: Re: [PATCH bpf-next v5 4/8] bpf: Introduce cgroup iter To: Andrii Nakryiko Cc: Yosry Ahmed , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , KP Singh , Benjamin Tissoires , John Fastabend , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org, Kui-Feng Lee Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andrii, On Mon, Aug 1, 2022 at 8:43 PM Andrii Nakryiko wrote: > > On Fri, Jul 22, 2022 at 10:48 AM Yosry Ahmed wrote: > > > > From: Hao Luo > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes: > > > > - walking a cgroup's descendants in pre-order. > > - walking a cgroup's descendants in post-order. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Currently only one session is supported, which means, depending on the > > volume of data bpf program intends to send to user space, the number > > of cgroups that can be walked is limited. For example, given the current > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > cgroup, the total number of cgroups that can be walked is 512. This is > > a limitation of cgroup_iter. If the output data is larger than the > > buffer size, the second read() will signal EOPNOTSUPP. In order to work > > around, the user may have to update their program to reduce the volume > > of data sent to output. For example, skip some uninteresting cgroups. > > In future, we may extend bpf_iter flags to allow customizing buffer > > size. > > > > Signed-off-by: Hao Luo > > Signed-off-by: Yosry Ahmed > > Acked-by: Yonghong Song > > --- > > include/linux/bpf.h | 8 + > > include/uapi/linux/bpf.h | 30 +++ > > kernel/bpf/Makefile | 3 + > > kernel/bpf/cgroup_iter.c | 252 ++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 30 +++ > > .../selftests/bpf/prog_tests/btf_dump.c | 4 +- > > 6 files changed, 325 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/cgroup_iter.c > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index a97751d845c9..9061618fe929 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -47,6 +47,7 @@ struct kobject; > > struct mem_cgroup; > > struct module; > > struct bpf_func_state; > > +struct cgroup; > > > > extern struct idr btf_idr; > > extern spinlock_t btf_idr_lock; > > @@ -1717,7 +1718,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags); > > int __init bpf_iter_ ## target(args) { return 0; } > > > > struct bpf_iter_aux_info { > > + /* for map_elem iter */ > > struct bpf_map *map; > > + > > + /* for cgroup iter */ > > + struct { > > + struct cgroup *start; /* starting cgroup */ > > + int order; > > + } cgroup; > > }; > > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index ffcbf79a556b..fe50c2489350 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -87,10 +87,30 @@ struct bpf_cgroup_storage_key { > > __u32 attach_type; /* program attach type (enum bpf_attach_type) */ > > }; > > > > +enum bpf_iter_cgroup_traversal_order { > > + BPF_ITER_CGROUP_PRE = 0, /* pre-order traversal */ > > + BPF_ITER_CGROUP_POST, /* post-order traversal */ > > + BPF_ITER_CGROUP_PARENT_UP, /* traversal of ancestors up to the root */ > > I've just put up my arguments why it's a good idea to also support a > "trivial" mode of only traversing specified cgroup and no descendants > or parents. Please see [0]. cc Kui-Feng in this thread. Yeah, I think it's a good idea. It's useful when we only want to show a single object, which can be common. Going further, I think we may want to restructure bpf_iter to optimize for this case. > I think the same applies here, especially > considering that it seems like a good idea to support > task/task_vma/task_files iteration within a cgroup. I have reservations on these use cases. I don't see immediate use of iterating vma or files within a cgroup. Tasks within a cgroup? Maybe. :) > So depending on > how successful I am in arguing for supporting task iterator with > target cgroup, I think we should reuse *exactly* this > bpf_iter_cgroup_traversal_order and how we specify cgroup (FD or ID, > see some more below) *as is* in task iterators as well. In the latter > case, having an ability to say "iterate task for only given cgroup" is > very useful, and for such mode all the PRE/POST/PARENT_UP is just an > unnecessary nuisance. > > So please consider also adding and supporting BPF_ITER_CGROUP_SELF (or > whatever naming makes most sense). > PRE/POST/UP can be reused for iter of tree-structured containers, like rbtree [1]. SELF can be reused for any iters like iter/task, iter/cgroup, etc. Promoting all of them out of cgroup-specific struct seems valuable. [1] https://lwn.net/Articles/902405/ > > Some more naming nits. I find BPF_ITER_CGROUP_PRE and > BPF_ITER_CGROUP_POST a bit confusing. Even internally in kernel we > have css_next_descendant_pre/css_next_descendant_post, so why not > reflect the fact that we are going to iterate descendants: > BPF_ITER_CGROUP_DESCENDANTS_{PRE,POST}. And now that we use > "descendants" terminology, PARENT_UP should be ANCESTORS. ANCESTORS_UP > probably is fine, but seems a bit redundant (unless we consider a > somewhat weird ANCESTORS_DOWN, where we find the furthest parent and > then descend through preceding parents until we reach specified > cgroup; seems a bit exotic). > BPF_ITER_CGROUP_DESCENDANTS_PRE is too verbose. If there is a possibility of merging rbtree and supporting walk order of rbtree iter, maybe the name here could be general, like BPF_ITER_DESCENDANTS_PRE, which seems better. > [0] https://lore.kernel.org/bpf/f92e20e9961963e20766e290ee6668edd4bacf06.camel@fb.com/T/#m5ce50632aa550dd87a99241efb168cbcde1ee98f > > > +}; > > + > > union bpf_iter_link_info { > > struct { > > __u32 map_fd; > > } map; > > + > > + /* cgroup_iter walks either the live descendants of a cgroup subtree, or the > > + * ancestors of a given cgroup. > > + */ > > + struct { > > + /* Cgroup file descriptor. This is root of the subtree if walking > > + * descendants; it's the starting cgroup if walking the ancestors. > > + * If it is left 0, the traversal starts from the default cgroup v2 > > + * root. For walking v1 hierarchy, one should always explicitly > > + * specify the cgroup_fd. > > + */ > > + __u32 cgroup_fd; > > Now, similar to what I argued in regard of pidfd vs pid, I think the > same applied to cgroup_fd vs cgroup_id. Why can't we support both? > cgroup_fd has some benefits, but cgroup_id is nice due to simplicity > and not having to open/close/keep extra FDs (which can add up if we > want to periodically query something about a large set of cgroups). > Please see my arguments from [0] above. > > Thoughts? > We can support both, it's a good idea IMO. But what exactly is the interface going to look like? Can you be more specific about that? Below is something I tried based on your description. @@ -91,6 +91,18 @@ union bpf_iter_link_info { struct { __u32 map_fd; } map; + struct { + /* PRE/POST/UP/SELF */ + __u32 order; + struct { + __u32 cgroup_fd; + __u64 cgroup_id; + } cgroup; + struct { + __u32 pid_fd; + __u64 pid; + } task; + }; }; > > + __u32 traversal_order; > > + } cgroup; > > }; > > > > /* BPF syscall commands, see bpf(2) man-page for more details. */ > > [...]