Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp3659361iog; Mon, 27 Jun 2022 23:14:03 -0700 (PDT) X-Google-Smtp-Source: AGRyM1t0sSUtb5xzZHKxIZ6fZfCjLpK7KQQYAdlaQlYV+M1PnMtTRVZulwWWABQ8/Vr8Jfae/tV3 X-Received: by 2002:a17:902:e807:b0:16a:471b:a4cc with SMTP id u7-20020a170902e80700b0016a471ba4ccmr2116356plg.102.1656396843335; Mon, 27 Jun 2022 23:14:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656396843; cv=none; d=google.com; s=arc-20160816; b=Wb5Igxd4kqT61ANGTWHWhjXztoLNVT6Gv9tlR7lxUpYtZ7R+msJ3mnDwq/LEqL4uQS hDWQ7PaKrmKipWqqYXbc1B4MWY8SRdQjcxF16B6CFakqYPWILDEJT1deKWIrZEOLoT5Q VWWAKER4As/MkXk+YHqp8cfGvH/fyY6evVuuNSJctmTRG+7WlTPuwK/HvbM34Yg0GKWw qKdCvUFenzfai6LYX5Tlb3IUoL4s7g3ZUtGE8V/FBLHRNRr/Q2WXXS31GHUvsDmEeoMa nQvo6t4jha5ewhnEGAa6tYmMO7FUkBRZmE6GFgDY1C74meiQzD4ynDM/IYf5HKPlombm 9XiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=QNXhUtdlTQcrmyW6lCM6JlWYrPfy+A90YPaeQHo/QNs=; b=ZiveMDjJK8ceAgUocRoQIyxc/slA5dISUGZrO4PWchCToadzBVYH0tUblIhARQ4h78 u2pR2J85MdfLTAM1f897pC7+a0qBuO6sQl3eQGKDGUnJ1Q/JVzQDQzpYGNR7YM7U/spT fWYjp2JV+lDpKtpq91CYUrsXroRc5QhddeQtd7MYg+ISAL81qjDnG03WNAMg+o7+Dkf/ Czv24Y+GFrTfHt4XBh4AiXENi0B6As9FVTNJ8sZYTUx0sISDdjQoCPHxmkdb7+O8InwL nUxxVH6A1YM0G5vQRtb2sV//KIvp0wdPpEl1/yzHR1jpaaDGP7RYnNpecqYyKvz7bLyr jMBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=q7ClShnc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l12-20020a170903120c00b00163fbc05bb7si19787329plh.552.2022.06.27.23.13.48; Mon, 27 Jun 2022 23:14:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=q7ClShnc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245219AbiF1GEK (ORCPT + 99 others); Tue, 28 Jun 2022 02:04:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245200AbiF1GEF (ORCPT ); Tue, 28 Jun 2022 02:04:05 -0400 Received: from mail-wr1-x433.google.com (mail-wr1-x433.google.com [IPv6:2a00:1450:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 626741162 for ; Mon, 27 Jun 2022 23:04:04 -0700 (PDT) Received: by mail-wr1-x433.google.com with SMTP id n1so16004772wrg.12 for ; Mon, 27 Jun 2022 23:04:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QNXhUtdlTQcrmyW6lCM6JlWYrPfy+A90YPaeQHo/QNs=; b=q7ClShnc9Wdx2N4ezAYB802OSiVV+20WR+VEvCc7CTBDiILDvn43BBkvL9M147fMnj eQUHaLlsUk65VO6BUHOSNHC+JI5zKh1Cc6BxzGUfT8u7vGKVNmNUt6Q36Uwydna5/1m1 lsNsz1CMwwkFomKXzxj2naGpfSyyvKzpTk1SEbVBIy/oSCz4ArkHU/nFRMLpv2fxvmPY CVZaoBkmsG4XYMA6HDg9jfne34cBkD7ZojMJNSVRkbcCYzuhSXDhEgGis6sbWrm6mCO/ lF9ir4KpI8Qyz2KA3qeBYhsfY9DzSYToI8sexB4/igLCkMkehEAlCC/DwCZ2jf0eyZtz eARg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QNXhUtdlTQcrmyW6lCM6JlWYrPfy+A90YPaeQHo/QNs=; b=L3X0P0qVFBq+ZsBOKyYSR5pLJmioDVOOjHPHmpB1kPtgjT7qku3vjV3QXjXnTwgOXG glxQ5bkvgohWxuDQKh2rxfkUKwYwohBGJSwpqbUUqdIMq6IJlgsvnxqmeyhBShA8sWvM rzZ4L3BRsisKQ+sdeV1Qordi9ahtWZOQNyGYujRBF060Z2st62yNmh+nFHB+Rx7log/X rITSaUAgo8gtRtq4R3WZokHo9FUTP4gceWTfuRbjxYyN+pe3nbkHUstSo+G2xfBSXhOP 4Q0xxcTYnOHHcTHm7N6zgctLIm5DmH6wfHtYV1Ydro9EB71mViuwaP63n4UtS0yq3QmK 0gEQ== X-Gm-Message-State: AJIora/W/9Wma8375q46Ym/ljex6JaiyXpbAxLa+H3oyz8MBMwogB5q7 PFBu6Cpk8OAynK1m3tUfBU8PZdqCF3MCMVRGrBwphA== X-Received: by 2002:a05:6000:a1e:b0:21b:8c8d:3cb5 with SMTP id co30-20020a0560000a1e00b0021b8c8d3cb5mr16013841wrb.372.1656396242834; Mon, 27 Jun 2022 23:04:02 -0700 (PDT) MIME-Version: 1.0 References: <20220610194435.2268290-1-yosryahmed@google.com> <20220610194435.2268290-5-yosryahmed@google.com> <40114462-d5e2-ab07-7af9-5e60180027f9@fb.com> In-Reply-To: <40114462-d5e2-ab07-7af9-5e60180027f9@fb.com> From: Yosry Ahmed Date: Mon, 27 Jun 2022 23:03:26 -0700 Message-ID: Subject: Re: [PATCH bpf-next v2 4/8] bpf: Introduce cgroup iter To: Yonghong Song Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , John Fastabend , KP Singh , Hao Luo , Tejun Heo , Zefan Li , Johannes Weiner , Shuah Khan , Michal Hocko , Roman Gushchin , David Rientjes , Stanislav Fomichev , Greg Thelen , Shakeel Butt , Linux Kernel Mailing List , Networking , bpf , Cgroups Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 27, 2022 at 9:14 PM Yonghong Song wrote: > > > > On 6/10/22 12:44 PM, Yosry Ahmed wrote: > > From: Hao Luo > > > > Cgroup_iter is a type of bpf_iter. It walks over cgroups in two modes: > > > > - walking a cgroup's descendants. > > - walking a cgroup's ancestors. > > > > When attaching cgroup_iter, one can set a cgroup to the iter_link > > created from attaching. This cgroup is passed as a file descriptor and > > serves as the starting point of the walk. If no cgroup is specified, > > the starting point will be the root cgroup. > > > > For walking descendants, one can specify the order: either pre-order or > > post-order. For walking ancestors, the walk starts at the specified > > cgroup and ends at the root. > > > > One can also terminate the walk early by returning 1 from the iter > > program. > > > > Note that because walking cgroup hierarchy holds cgroup_mutex, the iter > > program is called with cgroup_mutex held. > > > > Signed-off-by: Hao Luo > > Signed-off-by: Yosry Ahmed > > --- > > include/linux/bpf.h | 8 ++ > > include/uapi/linux/bpf.h | 21 +++ > > kernel/bpf/Makefile | 2 +- > > kernel/bpf/cgroup_iter.c | 235 +++++++++++++++++++++++++++++++++ > > tools/include/uapi/linux/bpf.h | 21 +++ > > 5 files changed, 286 insertions(+), 1 deletion(-) > > create mode 100644 kernel/bpf/cgroup_iter.c > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 8e6092d0ea956..48d8e836b9748 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -44,6 +44,7 @@ struct kobject; > > struct mem_cgroup; > > struct module; > > struct bpf_func_state; > > +struct cgroup; > > > > extern struct idr btf_idr; > > extern spinlock_t btf_idr_lock; > > @@ -1590,7 +1591,14 @@ int bpf_obj_get_user(const char __user *pathname, int flags); > > int __init bpf_iter_ ## target(args) { return 0; } > > > > struct bpf_iter_aux_info { > > + /* for map_elem iter */ > > struct bpf_map *map; > > + > > + /* for cgroup iter */ > > + struct { > > + struct cgroup *start; /* starting cgroup */ > > + int order; > > + } cgroup; > > }; > > > [...] > > + > > +static void *cgroup_iter_seq_start(struct seq_file *seq, loff_t *pos) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + > > + mutex_lock(&cgroup_mutex); > > + > > + /* support only one session */ > > + if (*pos > 0) > > + return NULL; > > + > > + ++*pos; > > + p->terminate = false; > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(NULL, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(NULL, p->start_css); > > + else /* BPF_ITER_CGROUP_PARENT_UP */ > > + return p->start_css; > > +} > > + > > +static int __cgroup_iter_seq_show(struct seq_file *seq, > > + struct cgroup_subsys_state *css, int in_stop); > > + > > +static void cgroup_iter_seq_stop(struct seq_file *seq, void *v) > > +{ > > + /* pass NULL to the prog for post-processing */ > > + if (!v) > > + __cgroup_iter_seq_show(seq, NULL, true); > > + mutex_unlock(&cgroup_mutex); > > +} > > + > > +static void *cgroup_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos) > > +{ > > + struct cgroup_subsys_state *curr = (struct cgroup_subsys_state *)v; > > + struct cgroup_iter_priv *p = seq->private; > > + > > + ++*pos; > > + if (p->terminate) > > + return NULL; > > + > > + if (p->order == BPF_ITER_CGROUP_PRE) > > + return css_next_descendant_pre(curr, p->start_css); > > + else if (p->order == BPF_ITER_CGROUP_POST) > > + return css_next_descendant_post(curr, p->start_css); > > + else > > + return curr->parent; > > +} > > + > > +static int __cgroup_iter_seq_show(struct seq_file *seq, > > + struct cgroup_subsys_state *css, int in_stop) > > +{ > > + struct cgroup_iter_priv *p = seq->private; > > + struct bpf_iter__cgroup ctx; > > + struct bpf_iter_meta meta; > > + struct bpf_prog *prog; > > + int ret = 0; > > + > > + /* cgroup is dead, skip this element */ > > + if (css && cgroup_is_dead(css->cgroup)) > > + return 0; > > + > > + ctx.meta = &meta; > > + ctx.cgroup = css ? css->cgroup : NULL; > > + meta.seq = seq; > > + prog = bpf_iter_get_info(&meta, in_stop); > > + if (prog) > > + ret = bpf_iter_run_prog(prog, &ctx); > > Do we need to do anything special to ensure bpf program gets > up-to-date stat from ctx.cgroup? Later patches in the series add cgroup_flush_rstat() kfunc which flushes cgroup stats that use rstat (e.g. memcg stats). It can be called directly from the bpf program if needed. It would be better to leave this to the bpf program, it's an unnecessary toll to flush the stats for any cgroup_iter program, that could be not accessing stats, or stats that are not maintained using rstat. > > > + > > + /* if prog returns > 0, terminate after this element. */ > > + if (ret != 0) > > + p->terminate = true; > > + > > + return 0; > > +} > > + > [...]