Received: by 2002:ac0:e350:0:0:0:0:0 with SMTP id g16csp38688imn; Wed, 3 Aug 2022 18:37:46 -0700 (PDT) X-Google-Smtp-Source: AA6agR5MC5ad+k6X+KEJfHMqPk4yNEuhnB0vFwGMRcyhWKRAd59dUE40Uqrcgbo4T+DBX0RrXl/N X-Received: by 2002:a63:7847:0:b0:41c:9e74:21e2 with SMTP id t68-20020a637847000000b0041c9e7421e2mr4195080pgc.455.1659577066192; Wed, 03 Aug 2022 18:37:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659577066; cv=none; d=google.com; s=arc-20160816; b=TQQho+c/1iSxtCmG1e0T6oFu1miV7H2wQQT0dco5IvwmQHl/OGwXyPiRCg5jvdaqRa SeU/ZSV3xaCqVmKuNT+lUnQ4QN9Rskvb51nS05xn3NoWrz4TzB+7iV0Rt6dq6ADdvKxd Cod9ix6Qpcx7jqsv3lBnTZ8Q5ZEOis8lHef/T5fkJMwr5fquE65l4AU2fZGdwUxxZaW0 fqQDzU2GcVa0eCw2MamxyQsQsNsLHr/YJrwXGYtCdRcyPW5Vb3X/6l83EQIR3IlWwWYy NLU1I3iMwJClPojjxXnF7MCV8iVXeZ8U8zribFm836uR0qKCqt0oXZNOtbZmuW4bXqPh yxxQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=3To6LMVxoqoKzRmQbA8Jez/sB+9ti+KFrFIpcpqC5DU=; b=iaaAHpSila7ri+4Ic24FF7d9fqFUdMf9rvYgDtdTsdFNlvxLnb6Rh2/NgMnXPAWagx zaa6SlxDguQnmF8BPgU6eza1UZto1asYLpNUaZEBqDZE/ii5nsPd4cYSyUxhZzsmZr7V dpxHk3XROm0t/ojaeGHroe27KXES7Yen/mqTcNj1TXi2dfH2Ogtge4+5Hfok7qKNblS9 S70jbMHQERkMdvlbo/zoUE3UbRw2ws096DxmbqL3UOh44PyeXiNXHUD/1EwTcbynTTK4 +9wfO8YubMaI8xpg5S6hHohrprqZ0q1PsUugpgOBubm1ZYkOzugob5EHe/noNG2YtlGN tndw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jaFMYrAr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mw17-20020a17090b4d1100b001f559b2fd12si1957750pjb.138.2022.08.03.18.37.32; Wed, 03 Aug 2022 18:37:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=jaFMYrAr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239199AbiHDASo (ORCPT + 99 others); Wed, 3 Aug 2022 20:18:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38962 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237044AbiHDASi (ORCPT ); Wed, 3 Aug 2022 20:18:38 -0400 Received: from mail-qv1-xf31.google.com (mail-qv1-xf31.google.com [IPv6:2607:f8b0:4864:20::f31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D353C5E33B for ; Wed, 3 Aug 2022 17:18:36 -0700 (PDT) Received: by mail-qv1-xf31.google.com with SMTP id j11so14100376qvt.10 for ; Wed, 03 Aug 2022 17:18:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=3To6LMVxoqoKzRmQbA8Jez/sB+9ti+KFrFIpcpqC5DU=; b=jaFMYrAr15hi7try/jB5s91PiWR8kMLB484AM76aPEsChXhIBD+lTzzNWQs7YTYffC 63bFlVKk/ZG9whwY4EzkRsgp56VkFoRTAi46PNFdqCIVS5abYmdTkqyYEfi6IXTHpx52 NzZEmvnFruMTy2bQtwEpsrjGHOS+HJw3v24rafEiQbLO2WCtUWMg1dTnDoa/wPK7bpPA ZDpkYHp29JtKjxj0PdUJYC1hknjlPP/1LBYX8EFh+xET+Qa4YK4rmKgX3RZdhGdwntmC Dog0Dn1Z1X7nq0Ponu0M/n1lUbjjAv/lXEId6u+zqwop41RL4jilX4EHM1/wUWqYwks2 lHug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=3To6LMVxoqoKzRmQbA8Jez/sB+9ti+KFrFIpcpqC5DU=; b=ijcWrDYdk1MVQ/Hs1uJ4yBemF4eF0I/OKpJXDVq2zBUDtAP/HDVar4U/GN84dK+TjH mdXzCyA+T+4fXER4qfBZAWi2Q82s9u3TLOpIBCY3vBcre6pqPNXPArD198porywv3DBz Cxre7J/ZgvWjKNn9SLKPDRLIB5VsnGtPkWfykqdJ6QqglmfLdvYzCwUt7ddYEKYXyPGC w8xF/2d/lYv42w4xC3QH5KNAQRiPSebejcsCc9ez7ZI1LY+/VFjKO8ojLw1kHqwspryh XRtfGTFxzSPIYvoLRekxZ1lnIMrUOqVuXUS00EszC6aGOTqzrEd+5zpy0mQwn7SpJ/Gw LZmA== X-Gm-Message-State: ACgBeo3I8FPJWQHBjw8ghOr8H34yrfxKhO/qcBdWdj48TgQqfSKizz2G xjBAOvDaydWca5y8ZfszKJDCa5972qhthd75hcrhWA== X-Received: by 2002:a0c:9101:0:b0:473:9b:d92a with SMTP id q1-20020a0c9101000000b00473009bd92amr24057537qvq.17.1659572315631; Wed, 03 Aug 2022 17:18:35 -0700 (PDT) MIME-Version: 1.0 References: <20220801175407.2647869-1-haoluo@google.com> <20220801175407.2647869-5-haoluo@google.com> In-Reply-To: From: Hao Luo Date: Wed, 3 Aug 2022 17:18:25 -0700 Message-ID: Subject: Re: [PATCH bpf-next v6 4/8] bpf: Introduce cgroup iter To: Yonghong Song Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org, netdev@vger.kernel.org, Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , Tejun Heo , Zefan Li , KP Singh , Johannes Weiner , Michal Hocko , Benjamin Tissoires , John Fastabend , Michal Koutny , Roman Gushchin , David Rientjes , Stanislav Fomichev , Shakeel Butt , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 3, 2022 at 12:44 AM Yonghong Song wrote: > > > > On 8/1/22 10:54 AM, Hao Luo wrote: > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in three modes: > > > > - walking a cgroup's descendants in pre-order. > > - walking a cgroup's descendants in post-order. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Currently only one session is supported, which means, depending on the > > volume of data bpf program intends to send to user space, the number > > of cgroups that can be walked is limited. For example, given the current > > buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can > > be walked is 512. This is a limitation of cgroup_iter. If the output > > data is larger than the buffer size, the second read() will signal > > EOPNOTSUPP. In order to work around, the user may have to update their > > 'the second read() will signal EOPNOTSUPP' is not true. for bpf_iter, > we have user buffer from read() syscall and kernel buffer. The above > buffer size like 8 * PAGE_SIZE refers to the kernel buffer size. > > If read() syscall buffer size is less than kernel buffer size, > the second read() will not signal EOPNOTSUPP. So to make it precise, > we can say > If the output data is larger than the kernel buffer size, after > all data in the kernel buffer is consumed by user space, the > subsequent read() syscall will signal EOPNOTSUPP. > Thanks Yonghong. Will update. > > program to reduce the volume of data sent to output. For example, skip > > some uninteresting cgroups. In future, we may extend bpf_iter flags to > > allow customizing buffer size. > > > > Acked-by: Yonghong Song > > Acked-by: Tejun Heo > > Signed-off-by: Hao Luo > > --- [...] > > + * > > + * Currently only one session is supported, which means, depending on the > > + * volume of data bpf program intends to send to user space, the number > > + * of cgroups that can be walked is limited. For example, given the current > > + * buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each > > + * cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can > > + * be walked is 512. This is a limitation of cgroup_iter. If the output data > > + * is larger than the buffer size, the second read() will signal EOPNOTSUPP. > > + * In order to work around, the user may have to update their program to > > same here as above for better description. > SG. Will update. > > + * reduce the volume of data sent to output. For example, skip some > > + * uninteresting cgroups. > > + */ > > + > > +struct bpf_iter__cgroup { > > + __bpf_md_ptr(struct bpf_iter_meta *, meta); > > + __bpf_md_ptr(struct cgroup *, cgroup); > > +}; > > + > > +struct cgroup_iter_priv { > > + struct cgroup_subsys_state *start_css; > > + bool visited_all; > > + bool terminate; > > + int order; > > +}; > > + > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + > > + mutex_lock(&cgroup_mutex); > > + > > + /* cgroup_iter doesn't support read across multiple sessions. */ > > + if (*pos > 0) { > > + if (p->visited_all) > > + return NULL; > > This looks good. thanks! > > > + > > + /* Haven't visited all, but because cgroup_mutex has dropped, > > + * return -EOPNOTSUPP to indicate incomplete iteration. > > + */ > > + return ERR_PTR(-EOPNOTSUPP); > > + } > > + > > + ++*pos; > > + p->terminate = false; > > + p->visited_all = false; > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(NULL, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(NULL, p->start_css); > > + else /* BPF_ITER_CGROUP_PARENT_UP */ > > + return p->start_css; > > +} > > + > [...]