Received: by 2002:ac0:e34a:0:0:0:0:0 with SMTP id g10csp618618imn; Thu, 28 Jul 2022 10:34:19 -0700 (PDT) X-Google-Smtp-Source: AGRyM1vuXCpGHSq3K7VL1T9F5C9IpDDU99/kM7oiW1KqQUPjAVw0q4g2vge7WOv6Ziwqs+fRJvQl X-Received: by 2002:a05:6a00:134e:b0:52a:d5b4:19bb with SMTP id k14-20020a056a00134e00b0052ad5b419bbmr27639180pfu.45.1659029659141; Thu, 28 Jul 2022 10:34:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659029659; cv=none; d=google.com; s=arc-20160816; b=SIThubVTtoQM+aHGP0NtHNkPfzzZfpsnscAUbG9G7GGTuTw0le6NdKqEEwdzyvIlnL nztMY2oT5GPz1PBjReMMUVzaHWNUw3m7I4n0Kb23sa6I8KEF36JraObFS+baTiL+CKvY d24q8uLxmMnZDjGr0c+5wDJHeTlI8CJuvFDU7vXss5rbTxi17t77qAHqtnqpnGaPPtwD cjalMBIDEUMIhOD7o2fWEjuwiNTZ0QJsFITRc4ll1Wob0VTwbNHMM2/ivgqs2gkmxZhl JJxZ3YAOzjH5oKJWFusVVoPmceB19j22qDyaqLAB/c71TTzGznLr8GAA3pGiZjXb0P4q Ce8Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=rJljmSyv8w2VGn14si1kaNQ7bWs8OfRzJatRimvQO6Y=; b=ZgEYhhhDpPNKF5EKdwCvR53tYx1kvFh71AH4lqnpl7D03Rk6ke+55tgFCZi6Jpv5Ki Jz1zH86WG1x3Vs5Nkw1nfdFYawThvS3XNHjijZJ8MnxZh1nu90EQwZWOTt5D/68cATlv XfTIVS4O3XVZObLP2GznjatpJxTpUjbt7Nns4/76dfGBrFmjd+PBygBExfa0bQwyHUlU NxWJ2bUtD6gi9s4w4DYq5mWLVstMBlBddRscOdKnHKbQEUZNq/x3ZpH+hPdpqjlU81wM 9NeKLmayYmwgHLvya0HI8NyLFfbtf5zLfb4jxMsZ0czhQSy1ZJoGoKv99+n+jYxrG62S q+7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=CfRo5oBk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ls8-20020a17090b350800b001f30ceb03fasi1827641pjb.61.2022.07.28.10.34.04; Thu, 28 Jul 2022 10:34:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=CfRo5oBk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229971AbiG1RZX (ORCPT + 99 others); Thu, 28 Jul 2022 13:25:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230438AbiG1RZV (ORCPT ); Thu, 28 Jul 2022 13:25:21 -0400 Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 29EC96367 for ; Thu, 28 Jul 2022 10:25:16 -0700 (PDT) Received: by mail-qk1-x72d.google.com with SMTP id g1so1917886qki.7 for ; Thu, 28 Jul 2022 10:25:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=rJljmSyv8w2VGn14si1kaNQ7bWs8OfRzJatRimvQO6Y=; b=CfRo5oBkcztfAwLnAzZIUzpKexeCvxCHJMUJKzRww2U7OEoRyVjvJpexbqWSjQxBdA qL+OEIx8n0OE+R6HQ4z2++cp1adc9RBIWZpjQHxCC6IXy0soY1sQQL21u8IFjIR7oNjt JugLaK4fjGaz2dR3U1bSiPhJUDOVj70Y5UW7dUzdzA2kYz8qFBSUen71AWX+ctzp6S65 vzIcqCTDPa37qDWAmKNZegbLDxgy4fW2TAybnzYvDfd2nJsqc6WNRYsccjdZiJb3Q1uO WxYfnaOByYHvcmwd1Wql+a+RKDG/g4/EN2QSmOiTsp20OWZa9UL2gZbXDZtGnJ1v2b4E ulhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=rJljmSyv8w2VGn14si1kaNQ7bWs8OfRzJatRimvQO6Y=; b=3R6emMS/pi925S+1mo2VsNB+v3OBf39roznaJhvKDURrahsmj+1UKmLkIsUtYixZ/l /SY252uYcig3UMtHsXD69Ek33PVb9H/QE/Z+uYPQpUwJVVZj6EOhs70C/3MymybOkSnp o+s2S5S/ZAMz4UOYU6/Ffs/hTNsnGfahr0xelI1u+09K6d0PRvRQuuqMGX7odFEqSMVF D9n/xxdDf8/z5xlSWDwgCiUUoX/YbfQ9TJgTpdFRHZzrq6AQSOCLputMTntfWOxPWR2k wDJL77Gp4K2WdBKkL4TNOIIXu8j+2Wie2gKB/T05+ptfJ1wmklgeLrmv5f2UqQ6ivy53 J6aQ== X-Gm-Message-State: AJIora+qWP0HGSn14fL6geXobm11ddLW4cdDEn4Qp4Wq7OZEaoZuUUtW N9emw0B6H9q039lrI+8eRQMruUVLvMHgAxY3y6pelQ== X-Received: by 2002:a05:620a:1927:b0:6b5:fe70:9acc with SMTP id bj39-20020a05620a192700b006b5fe709accmr21200732qkb.669.1659029115045; Thu, 28 Jul 2022 10:25:15 -0700 (PDT) MIME-Version: 1.0 References: <20220722174829.3422466-1-yosryahmed@google.com> <20220722174829.3422466-5-yosryahmed@google.com> In-Reply-To: From: Hao Luo Date: Thu, 28 Jul 2022 10:25:04 -0700 Message-ID: Subject: Re: [PATCH bpf-next v5 4/8] bpf: Introduce cgroup iter To: Yonghong Song Cc: Yosry Ahmed , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , KP Singh , Benjamin Tissoires , John Fastabend , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 27, 2022 at 10:49 PM Yonghong Song wrote: > > > > On 7/22/22 10:48 AM, Yosry Ahmed wrote: > > From: Hao Luo > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes: > > > > - walking a cgroup's descendants in pre-order. > > - walking a cgroup's descendants in post-order. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Currently only one session is supported, which means, depending on the > > volume of data bpf program intends to send to user space, the number > > of cgroups that can be walked is limited. For example, given the current > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > cgroup, the total number of cgroups that can be walked is 512. This is > > PAGE_SIZE needs to be 4KB in order to conclude that the total number of > walked cgroups is 512. > Sure. Will change that. > > a limitation of cgroup_iter. If the output data is larger than the > > buffer size, the second read() will signal EOPNOTSUPP. In order to work > > around, the user may have to update their program to reduce the volume > > of data sent to output. For example, skip some uninteresting cgroups. > > In future, we may extend bpf_iter flags to allow customizing buffer > > size. > > > > Signed-off-by: Hao Luo > > Signed-off-by: Yosry Ahmed > > Acked-by: Yonghong Song > > --- > > include/linux/bpf.h | 8 + > > include/uapi/linux/bpf.h | 30 +++ > > kernel/bpf/Makefile | 3 + > > kernel/bpf/cgroup_iter.c | 252 ++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 30 +++ > > .../selftests/bpf/prog_tests/btf_dump.c | 4 +- > > 6 files changed, 325 insertions(+), 2 deletions(-) > > create mode 100644 kernel/bpf/cgroup_iter.c > > This patch cannot apply to bpf-next cleanly, so please rebase > and post again. > Sorry about that. Will do. > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index a97751d845c9..9061618fe929 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -47,6 +47,7 @@ struct kobject; > > struct mem_cgroup; > > struct module; > > struct bpf_func_state; > > +struct cgroup; > > > > extern struct idr btf_idr; > > extern spinlock_t btf_idr_lock; > > @@ -1717,7 +1718,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags); > > int __init bpf_iter_ ## target(args) { return 0; } > > > > struct bpf_iter_aux_info { > > + /* for map_elem iter */ > > struct bpf_map *map; > > + > > + /* for cgroup iter */ > > + struct { > > + struct cgroup *start; /* starting cgroup */ > > + int order; > > + } cgroup; > > }; > > > > typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index ffcbf79a556b..fe50c2489350 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -87,10 +87,30 @@ struct bpf_cgroup_storage_key { > > __u32 attach_type; /* program attach type (enum bpf_attach_type) */ > > }; > > > > +enum bpf_iter_cgroup_traversal_order { > > + BPF_ITER_CGROUP_PRE = 0, /* pre-order traversal */ > > + BPF_ITER_CGROUP_POST, /* post-order traversal */ > > + BPF_ITER_CGROUP_PARENT_UP, /* traversal of ancestors up to the root */ > > +}; > > + > > union bpf_iter_link_info { > > struct { > > __u32 map_fd; > > } map; > > + > > + /* cgroup_iter walks either the live descendants of a cgroup subtree, or the > > + * ancestors of a given cgroup. > > + */ > > + struct { > > + /* Cgroup file descriptor. This is root of the subtree if walking > > + * descendants; it's the starting cgroup if walking the ancestors. > > + * If it is left 0, the traversal starts from the default cgroup v2 > > + * root. For walking v1 hierarchy, one should always explicitly > > + * specify the cgroup_fd. > > + */ > > I did see how the above cgroup v1/v2 scenarios are enforced. > Do you mean _not_ see? Yosry and I experimented a bit. We found even on systems where v2 is not enabled, cgroup v2 root always exists and can be attached to, and can be iterated on (only trivially). We didn't find a way to tell v1 and v2 apart and deemed a comment to instruct v1 users is fine? > > + __u32 cgroup_fd; > > + __u32 traversal_order; > > + } cgroup; > > }; > > > > /* BPF syscall commands, see bpf(2) man-page for more details. */ > > @@ -6136,6 +6156,16 @@ struct bpf_link_info { > > __u32 map_id; > > } map; > > }; > > + union { > > + struct { > > + __u64 cgroup_id; > > + __u32 traversal_order; > > + } cgroup; > > + }; > > + /* For new iters, if the first field is larger than __u32, > > + * the struct should be added in the second union. Otherwise, > > + * it will create holes before map_id, breaking uapi. > > + */ > > Please put the comment above the union. Let us just say, if > the iter specific field is __u32, it can be put in the first or > second union. Otherwise, it is put in second union. > Ok. > > } iter; > > struct { > > __u32 netns_ino; > > diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile > > index 057ba8e01e70..00e05b69a4df 100644 > > --- a/kernel/bpf/Makefile > > +++ b/kernel/bpf/Makefile > > @@ -24,6 +24,9 @@ endif > > ifeq ($(CONFIG_PERF_EVENTS),y) > > obj-$(CONFIG_BPF_SYSCALL) += stackmap.o > > endif > > +ifeq ($(CONFIG_CGROUPS),y) > > +obj-$(CONFIG_BPF_SYSCALL) += cgroup_iter.o > > +endif > > obj-$(CONFIG_CGROUP_BPF) += cgroup.o > > ifeq ($(CONFIG_INET),y) > > obj-$(CONFIG_BPF_SYSCALL) += reuseport_array.o > > diff --git a/kernel/bpf/cgroup_iter.c b/kernel/bpf/cgroup_iter.c > > new file mode 100644 > > index 000000000000..1027faed0b8b > > --- /dev/null > > +++ b/kernel/bpf/cgroup_iter.c > > @@ -0,0 +1,252 @@ > > +// SPDX-License-Identifier: GPL-2.0-only > > +/* Copyright (c) 2022 Google */ > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +#include "../cgroup/cgroup-internal.h" /* cgroup_mutex and cgroup_is_dead */ > > + > > +/* cgroup_iter provides three modes of traversal to the cgroup hierarchy. > > + * > > + * 1. Walk the descendants of a cgroup in pre-order. > > + * 2. Walk the descendants of a cgroup in post-order. > > + * 2. Walk the ancestors of a cgroup. > > + * > > + * For walking descendants, cgroup_iter can walk in either pre-order or > > + * post-order. For walking ancestors, the iter walks up from a cgroup to > > + * the root. > > + * > > + * The iter program can terminate the walk early by returning 1. Walk > > + * continues if prog returns 0. > > + * > > + * The prog can check (seq->num == 0) to determine whether this is > > + * the first element. The prog may also be passed a NULL cgroup, > > + * which means the walk has completed and the prog has a chance to > > + * do post-processing, such as outputing an epilogue. > > + * > > + * Note: the iter_prog is called with cgroup_mutex held. > > + * > > + * Currently only one session is supported, which means, depending on the > > + * volume of data bpf program intends to send to user space, the number > > + * of cgroups that can be walked is limited. For example, given the current > > + * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > + * cgroup, the total number of cgroups that can be walked is 512. This is > > Again, let us specify PAGE_SIZE = 4KB here. > Ok. > > + * a limitation of cgroup_iter. If the output data is larger than the > > + * buffer size, the second read() will signal EOPNOTSUPP. In order to work > > + * around, the user may have to update their program to reduce the volume > > + * of data sent to output. For example, skip some uninteresting cgroups. > > + */ > > + > > +struct bpf_iter__cgroup { > > + __bpf_md_ptr(struct bpf_iter_meta *, meta); > > + __bpf_md_ptr(struct cgroup *, cgroup); > > +}; > > + > > +struct cgroup_iter_priv { > > + struct cgroup_subsys_state *start_css; > > + bool terminate; > > + int order; > > +}; > > + > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + > > + mutex_lock(&cgroup_mutex); > > + > > + /* cgroup_iter doesn't support read across multiple sessions. */ > > + if (*pos > 0) > > + return ERR_PTR(-EOPNOTSUPP); > > This is not quite right. Let us say, the number of cgroups is 1, > after bpf program run, pos = 1, and the control return to user > space. Now the second read() will return -EOPNOTSUPP which is not > right. -EOPNOTSUPP should be returned ONLY if the previous cgroup > iterations do not traverse all cgroups. > So you might need to record additional information in cgroup_iter_priv > to record such information. > Oh, I missed seeing this scenario. Then we need a flag in cgroup_iter_priv indicating whether iter is happening. Only when during iter hasn't ended, we return -EOPNOTSUPP. > > + > > + ++*pos; > > + p->terminate = false; > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(NULL, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(NULL, p->start_css); > > + else /* BPF_ITER_CGROUP_PARENT_UP */ > > + return p->start_css; > > +} > > + > > +static int __cgroup_iter_seq_show(struct seq_file *seq, > > + struct cgroup_subsys_state *css, int in_stop); > > + > > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v) > > +{ > > + /* pass NULL to the prog for post-processing */ > > + if (!v) > > + __cgroup_iter_seq_show(seq, NULL, true); > > + mutex_unlock(&cgroup_mutex); > > +} > > + > > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos) > > +{ > > + struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v; > > + struct cgroup_iter_priv *p = seq->private; > > + > > + ++*pos; > > + if (p->terminate) > > + return NULL; > > + > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(curr, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(curr, p->start_css); > > + else > > + return curr->parent; > > +} > > + > [...]