Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp5482861pxb; Mon, 7 Feb 2022 03:07:10 -0800 (PST) X-Google-Smtp-Source: ABdhPJyFFuz3bZtZMWjccCyNBwo8CJQjeLs1UXocjwpwrYMyRxEJ2N2POHmZrY36mrsaxd8Rvr31 X-Received: by 2002:a63:e94e:: with SMTP id q14mr9001281pgj.376.1644232030485; Mon, 07 Feb 2022 03:07:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644232030; cv=none; d=google.com; s=arc-20160816; b=q1GOGpQcNJCBhD5iFkjLea6chbD2+E/KQIpvi6u2tulPual8wK3C6Ihyqq4ACs8GbT WZ/is69BFNCubAk3YitMcwzJPUGPRuPEBrPmKiYxtcw2pgAqi9Yaw4+z2JEl/KrxEIsd 7Deq5ZVPegUYbfeWqzqmp8FuQ0Mxd5T3axgpcxhfugZQKbnXHZ5Zyv+HNYoyUYC8rZFk JAf+LL/rZ1R5rBMygvd/hyLRfEiS6GR+GK2T5oC1hpAb4S8VMiW0PIRqNHxBjaSiTlcX Xdob3ADQKJdZaBTGxHHwuvXNoe+ZhQZvQQOJjaZ/StlIqUEElYB1POWkztyCB2KEIZ28 pJqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=jeWAra30FXkZs/gRmR49z6lt/T+9ynnjq9B+AWcFc0M=; b=xRgJyYYBWq7zZx8rNPyZ4sylhiV0e5cN0GcQP7IjQBOj5Ks6w+2boI/baEW6SzV3eE /BXqAYPJGREboWYtU3P+cyjPisH2O0momWWzR0bhAzMTC9Hi5z2OOYr7kFbeaeNCSwqO wnfzyZrDRoCkimh+btgfqGjl4tOoW7rsPoKK0StmW+VUmREvkr/1WAKzuptqqMuNrq7B PyZyvoxoCiV2KKxAj3gNCum7rnABGwjiyehOjYVOTRfBaqTQmeH32nIaVKorAkZI1mWh gy++X+VaQyc2WhHX3NayFLczVeR6hvTR1NYv/PUwTXsJq64Dz49xwy9qubYhI3cxrUCI oMDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LOIf1tBu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id v184si9362675pgd.712.2022.02.07.03.06.57; Mon, 07 Feb 2022 03:07:10 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=LOIf1tBu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377517AbiBDS1N (ORCPT + 99 others); Fri, 4 Feb 2022 13:27:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377511AbiBDS1M (ORCPT ); Fri, 4 Feb 2022 13:27:12 -0500 Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E21BEC06173D for ; Fri, 4 Feb 2022 10:27:11 -0800 (PST) Received: by mail-qk1-x732.google.com with SMTP id c189so5422776qkg.11 for ; Fri, 04 Feb 2022 10:27:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jeWAra30FXkZs/gRmR49z6lt/T+9ynnjq9B+AWcFc0M=; b=LOIf1tBuS1tAG6bCCiNP8m5VoC8AK9dgPZpaYYC8CMFPy2FtTuNt3HOG/39ywfFeXf 2SvwmqypOW2WfYOvprujZf3zzwP7EZlK6Tw3mksOgbimD03VlJ4E2UTarrCm3HBAsDmp 4DFmgTvJAJGS+bcOFGbM78sNuJ2xoCUsTqfsB4AD0F+L/YHeSKzm7Rr5uQMVIpEiqVr6 UFVW2VMOW9OTwQEmkjRKzUTUGC4KESydLa4s3Te4ai8t98QE11qc9ljjjW8ZVTgFnvbW 4u0WLxTWCwZezv2TYyo0d7dEK1Ejnt98unT3m6/txxJOAPuQ5QAugdqnsGnXMsrYuJjA 7k6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jeWAra30FXkZs/gRmR49z6lt/T+9ynnjq9B+AWcFc0M=; b=dXKoVhrO/y2LGxjTC1G09za2yWxfSpnjQ1rex2OPEQ4d4ELlVemEk+g+v9MrM8a2cF o27P7jKW/L3EygR/NQ5tgpJf7RoWEb9M0N41k/7jqhRqTmEZ5y6Gi5AaDQFm7cblpkSD x5pcEbgoRYRUeR6u86R/uJZUCn9IrZI/osbsXR16gdPCiRNw5rjKA3fm2gE5b29hsm6x d9F0l1lmq9fygmPQ34QFM4OVdTo7Mq3Cga/znxPBzVsqON+NyohhOk9L1OYHTnQ1G2dK ik9KquAc4kjwhAEwZp0Jo3kEBl/RlnlbzF6nHf5ZiK5Xsl4blhsHgI9mwb0bFMuQ8y1q nbFQ== X-Gm-Message-State: AOAM530IIhQlYJ39oPEiBgeuVwNIskLqi7IPRaBh+J5hKGn/TMzMzObL 0yFSJJPH/+CmIB55aMTOK9bF2x8+AuLPZfjAm52T8A== X-Received: by 2002:ae9:ed96:: with SMTP id c144mr241939qkg.221.1643999230713; Fri, 04 Feb 2022 10:27:10 -0800 (PST) MIME-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> <20220201205534.1962784-6-haoluo@google.com> <20220203180414.blk6ou3ccmod2qck@ast-mbp.dhcp.thefacebook.com> In-Reply-To: From: Hao Luo Date: Fri, 4 Feb 2022 10:26:59 -0800 Message-ID: Subject: Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link To: Alexei Starovoitov Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf , LKML Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 3, 2022 at 7:33 PM Alexei Starovoitov wrote: > > On Thu, Feb 3, 2022 at 2:46 PM Hao Luo wrote: > > > > On Thu, Feb 3, 2022 at 10:04 AM Alexei Starovoitov > > wrote: > > > > > > On Tue, Feb 01, 2022 at 12:55:34PM -0800, Hao Luo wrote: > > > > + > > > > +SEC("iter/cgroup_view") > > > > +int dump_cgroup_lat(struct bpf_iter__cgroup_view *ctx) > > > > +{ > > > > + struct seq_file *seq = ctx->meta->seq; > > > > + struct cgroup *cgroup = ctx->cgroup; > > > > + struct wait_lat *lat; > > > > + u64 id; > > > > + > > > > + BPF_SEQ_PRINTF(seq, "cgroup_id: %8lu\n", cgroup->kn->id); > > > > + lat = bpf_map_lookup_elem(&cgroup_lat, &id); > > > > > > Looks like "id = cgroup->kn->id" assignment is missing here? > > > > > > > Ah, yes. I'll fix it. > > > > > Thanks a lot for this test. It explains the motivation well. > > > > > > It seems that the patches 1-4 are there to automatically > > > supply cgroup pointer into bpf_iter__cgroup_view. > > > > > > Since user space needs to track good part of cgroup dir opreations > > > can we task it with the job of patches 1-4 as well? > > > It can register notifier for cgroupfs operations and > > > do mkdir in bpffs similarly _and_ parametrize 'view' bpf program > > > with corresponding cgroup_id. > > > Ideally there is no new 'view' program and it's a subset of 'iter' > > > bpf program. They're already parametrizable. > > > When 'iter' is pinned the user space can tell it which object it should > > > iterate on. The 'view' will be an interator of one element and > > > argument to it can be cgroup_id. > > > When user space pins the same 'view' program in a newly created bpffs > > > directory it will parametrize it with a different cgroup_id. > > > At the end the same 'view' program will be pinned in multiple directories > > > with different cgroup_id arguments. > > > This patch 5 will look very much the same, but patches 1-4 will not be > > > necessary. > > > Of course there are races between cgroup create/destroy and bpffs > > > mkdir, prog pin operatiosn, but they will be there regardless. > > > The patch 1-4 approach is not race free either. > > > > Right. I tried to minimize the races between cgroupfs and bpffs in > > this patchset. The cgroup and kernfs APIs called in this patchset > > guarantee that the cgroup and kernfs objects are alive once get. Some > > states in the objects such as 'id' should be valid at least. > > > > > Will that work? > > > > Thanks Alexei for the idea. > > > > The parameterization part sounds good. By 'parametrize', do you mean a > > variable in iter prog (like the 'pid' variable in bpf_iter_task_vma.c > > [1])? or some metadata of the program? I assume it's program's > > metadata. Either parameterizing with cgroup_id or passing cgroup > > object to the prog should work. The problem is at pinning. > > The bpf_iter_link_info is used to parametrize the iterator. > The map iterator will iterate the given map_fd. > iirc pinning is not parameterizable yet, > but that's not difficult to add. > I can take a look at that. This will be useful in our use case. > > > In our use case, we can't ask the users who create cgroups to do the > > pinning. Pinning requires root privilege. In our use case, we have > > non-root users who can create cgroup directories and still want to > > read bpf stats. They can't do pinning by themselves. This is why > > inheritance is a requirement for us. With inheritance, they only need > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning > > operation is required. Patch 1-4 are needed to implement inheritance. > > > > It's also not a good idea in our use case to add a userspace > > privileged process to monitor cgroupfs operations and perform the > > pinning. It's more complex and has a higher maintenance cost and > > runtime overhead, compared to the solution of asking whoever makes > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in > > the data center that don't have the userspace process deployed, the > > stats will be unavailable, which is a no-no for some of our users. > > The commit log says that there will be a daemon that does that > monitoring of cgroupfs. And that daemon needs to mkdir > directories in bpffs when a new cgroup is created, no? > The kernel is only doing inheritance of bpf progs into > new dirs. I think that daemon can pin as well. > > The cgroup creation is typically managed by an agent like systemd. > Sounds like you have your own agent that creates cgroups? > If so it has to be privileged and it can mkdir in bpffs and pin too ? Ah, yes, we have our own daemon to manage cgroups. That daemon creates the top-level cgroup for each job to run inside. However, the job can create its own cgroups inside the top-level cgroup, for fine grained resource control. This doesn't go through the daemon. The job-created cgroups don't have the pinned objects and this is a no-no for our users.