Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp2227334pxm; Fri, 4 Mar 2022 11:50:35 -0800 (PST) X-Google-Smtp-Source: ABdhPJwSVkr1Yqhl56x/LJ8M2MHvh+3aeoOTqFTLaD+0fdo9CjhVgKvs411xC+IhPELK2OhE0Tyx X-Received: by 2002:a05:6a00:1829:b0:4f6:cf48:e3b0 with SMTP id y41-20020a056a00182900b004f6cf48e3b0mr134025pfa.58.1646423434940; Fri, 04 Mar 2022 11:50:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646423434; cv=none; d=google.com; s=arc-20160816; b=Lz5RA3Y0Oj6ub+1FbVYP+A63brATWKlmWBUua3CqbOnP4I7euWCU7ctQ26F61o+QYh Lf3M5DWzGuroiOZ/qwdvFanDQhupMDcuZ3MVkHWTlWZBTWpqjZcxx2jC/9JlgySGqL5h 3eRZhJhHXSrAWGPzWrz4zGmA8Erm3mNqLfNhWdaQYKv524Rben7C/Vow55rodh9bacrJ CnMPtzPwSok0NvcGXiqC7rFQrFCMoLB+5o+vq0Uv9/zRwtRiuE59AI50DFmYsf40q4eT xH5bpN90wnjx0PgOh/Tf4TL88mCUfqcAFJiZ3Cw0K/XgYNkh3vEoMZRNydllnXoMvsLT Rpqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=1gQFOw2v5w6RQvzg/9d3ch6fsGJuXS0mE7iiujRgGMs=; b=wHnbig3+ul+BpqeggemRLwShtbV2kg31qNUh4uQ8uiVFZiaoyiliWVkxzfSdxE9CjQ rZUSN1bll06Tv5E1h91FiAp2UNZQowYXO4P00T0eYX2N84EExEKYTI1EDrBWuqsrc/Ki Hc06qZHNxQM1Gv3ME/jrNHW3+3dwJBf5hkX81gKrwlSjMXGRKqV0obwZdmSRfM2QX4dS CGDa3ZD8xksdFRXbxv7J79qgijy2SgEu6wv38Gi1t2eBf3pxRx6470DTHvzB19Dg9I6D WKSspi3jjInzfSfG2Uf3E9v2OEYrq3vitQNKfEotAt2sktO7fESc2kb9g9IDulruS4vX vr7Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=UTHTLApf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id r4-20020a1709028bc400b001497e136d48si5008422plo.336.2022.03.04.11.50.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Mar 2022 11:50:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=UTHTLApf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 41B9920FCB7; Fri, 4 Mar 2022 11:20:15 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236281AbiCCVxQ (ORCPT + 99 others); Thu, 3 Mar 2022 16:53:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59296 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232808AbiCCVxL (ORCPT ); Thu, 3 Mar 2022 16:53:11 -0500 Received: from mail-qv1-xf29.google.com (mail-qv1-xf29.google.com [IPv6:2607:f8b0:4864:20::f29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40180EDF05 for ; Thu, 3 Mar 2022 13:52:25 -0800 (PST) Received: by mail-qv1-xf29.google.com with SMTP id gm1so5220551qvb.7 for ; Thu, 03 Mar 2022 13:52:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1gQFOw2v5w6RQvzg/9d3ch6fsGJuXS0mE7iiujRgGMs=; b=UTHTLApf8U1Q5liu+aMqL2YQ1hD7ngEAPhc1jjOmWdIzBvNnASnAjlH3Sgx5jxUFwI /IYQ4sK9YBt0hqBHJRD77C61KNSp+Mkr9EYIa6DmnL8ePwcRS6Q1Lwk8dMZ/y0EYJSjo lpcKL5PR1PygrzHV4xqH4bbZ8GC5vBaSgrgU/2CqrC1+xW4TiAx2JxDqr3CW+fNqNC6O mcTCk6MWB1ikOCxgYYZOIyy/Cw33u54KCSs1bU3fRUlnCck5L/ZXDl1GQjzroIHuSW7l tjKr7lAqOERcGLqjAdZtZNtMaiCpw9AB8NnACX8XNsJnoBfzb7jQYFwssJbPqMfs5rV4 Fvgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1gQFOw2v5w6RQvzg/9d3ch6fsGJuXS0mE7iiujRgGMs=; b=Etv8BqpMzYMmLne6+CFTRqSY6VuGo2QLvdj87Y10sy1mHAXqKt9fi51UtSxpd+giNk 4DARB+IX2zpmYE904lvULmLZnQ0p1CTU63ADbkxOqirfuUC5hPF6n3FgKmy7wOSTfanE OlVGrmRncy6MnkCVoTC42MnrB4zIGzcKmEXEcM+oozcXBEjfxTj1J6NGHYMfbU385zOk XcT8H4DgSh5u+oXsywplEt9xmFVSlz4tAS+8bSPLzRpZ1BHqZeJ7rTwS5Eyia50QlChX FkSGdBbACtluP4txgXGVGOKan56+eV3yUagdD3bzhlVMj/5oZuVmP9FfqP9DLewdjtK9 w3Bw== X-Gm-Message-State: AOAM533jBEPa2CLW6vHpGVFOpUcEvm2xZzPJ+aGikLzZg+6G85IOVr6U gKjSimlURNJO2aI8TcTUku05HeaNJwJFHanzdoBkIw== X-Received: by 2002:a05:6214:202f:b0:432:4810:1b34 with SMTP id 15-20020a056214202f00b0043248101b34mr25772922qvf.35.1646344344193; Thu, 03 Mar 2022 13:52:24 -0800 (PST) MIME-Version: 1.0 References: <20220225234339.2386398-1-haoluo@google.com> <20220225234339.2386398-9-haoluo@google.com> <20220302224506.jc7jwkdaatukicik@apollo.legion> <20220303030349.drd7mmwtufl45p3u@apollo.legion> In-Reply-To: From: Hao Luo Date: Thu, 3 Mar 2022 13:52:12 -0800 Message-ID: Subject: Re: [PATCH bpf-next v1 8/9] bpf: Introduce cgroup iter To: Yonghong Song Cc: Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , KP Singh , Shakeel Butt , Joe Burton , Tejun Heo , joshdon@google.com, sdf@google.com, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-10.0 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 2, 2022 at 11:34 PM Yonghong Song wrote: > > > > On 3/2/22 7:03 PM, Kumar Kartikeya Dwivedi wrote: > > On Thu, Mar 03, 2022 at 07:33:16AM IST, Yonghong Song wrote: > >> > >> > >> On 3/2/22 2:45 PM, Kumar Kartikeya Dwivedi wrote: > >>> On Sat, Feb 26, 2022 at 05:13:38AM IST, Hao Luo wrote: > >>>> Introduce a new type of iter prog: cgroup. Unlike other bpf_iter, this > >>>> iter doesn't iterate a set of kernel objects. Instead, it is supposed to > >>>> be parameterized by a cgroup id and prints only that cgroup. So one > >>>> needs to specify a target cgroup id when attaching this iter. > >>>> > >>>> The target cgroup's state can be read out via a link of this iter. > >>>> Typically, we can monitor cgroup creation and deletion using sleepable > >>>> tracing and use it to create corresponding directories in bpffs and pin > >>>> a cgroup id parameterized link in the directory. Then we can read the > >>>> auto-pinned iter link to get cgroup's state. The output of the iter link > >>>> is determined by the program. See the selftest test_cgroup_stats.c for > >>>> an example. > >>>> > >>>> Signed-off-by: Hao Luo > >>>> --- > >>>> include/linux/bpf.h | 1 + > >>>> include/uapi/linux/bpf.h | 6 ++ > >>>> kernel/bpf/Makefile | 2 +- > >>>> kernel/bpf/cgroup_iter.c | 141 +++++++++++++++++++++++++++++++++ > >>>> tools/include/uapi/linux/bpf.h | 6 ++ > >>>> 5 files changed, 155 insertions(+), 1 deletion(-) > >>>> create mode 100644 kernel/bpf/cgroup_iter.c [...] > >>> > >>> I think in existing iterators, we make a final call to seq_show, with v as NULL, > >>> is there a specific reason to do it differently for this? There is logic in > >>> bpf_iter.c to trigger ->stop() callback again when ->start() or ->next() returns > >>> NULL, to execute BPF program with NULL p, see the comment above stop label. > >>> > >>> If you do add the seq_show call with NULL, you'd also need to change the > >>> ctx_arg_info PTR_TO_BTF_ID to PTR_TO_BTF_ID_OR_NULL. > >> > >> Kumar, PTR_TO_BTF_ID should be okay since the show() never takes a non-NULL > >> cgroup. But we do have issues for cgroup_iter_seq_stop() which I missed > >> earlier. > >> > > > > Right, I was thinking whether it should call seq_show for v == NULL case. All > > other iterators seem to do so, it's a bit different here since it is only > > iterating over a single cgroup, I guess, but it would be nice to have some > > consistency. > > You are correct that I think it is okay since it only iterates with one > cgroup. This is different from other cases so far where more than one > objects may be traversed. We may have future other use cases, e.g., > one task. I think we can abstract out start()/next()/stop() callbacks > for such use cases. So it is okay it is different from other existing > iterators since they are indeed different. > Right. This iter is special. It has a single element. So we don't really need preamble and epilogue, which can directly be coded up in the iter program. And we can also guarantee the cgroup passed is always valid, otherwise we wouldn't invoke show(). So passing PTR_TO_BTF_ID is fine. I did so mainly in order to save a null check inside the prog. > > > >> For cgroup_iter, the following is the current workflow: > >> start -> not NULL -> show -> next -> NULL -> stop > >> or > >> start -> NULL -> stop > >> > >> So for cgroup_iter_seq_stop, the input parameter 'v' will be NULL, so > >> the cgroup_put() is not actually called, i.e., corresponding cgroup is > >> not freed. > >> > >> There are two ways to fix the issue: > >> . call cgroup_put() in next() before return NULL. This way, > >> stop() will be a noop. > >> . put cgroup_get_from_id() and cgroup_put() in > >> bpf_iter_attach_cgroup() and bpf_iter_detach_cgroup(). > >> > >> I prefer the second approach as it is cleaner. > >> Yeah, the second approach should be fine. I was thinking of holding the cgroup's reference only when we actually start reading, so that a cgroup can go at any time and this iter gets a reference only in best effort. Now a reference is held from attach to detach, but I think it should be fine. Let me test. > > > > I think current approach is also not safe if cgroup_id gets reused, right? I.e. > > it only does cgroup_get_from_id in seq_start, not at attach time, so it may not > > be the same cgroup when calling read(2). kernfs is using idr_alloc_cyclic, so it > > is less likely to occur, but since it wraps around to find a free ID it might > > not be theoretical. > > As Alexei mentioned, cgroup id is 64-bit, the collision should > be nearly impossible. Another option is to get a fd from > the cgroup path, and send the fd to the kernel. This probably > works. > 64bit cgroup id should be fine. Using cgroup path and fd is more complicated, unnecessarily IMHO. > [...]