Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp1685996pxb; Wed, 9 Feb 2022 02:10:18 -0800 (PST) X-Google-Smtp-Source: ABdhPJwgkSHgY7nIWSf0IP/kUTgM5WCgvbqYYgDM9MQ1z8FLeFx5Ua5yehVPwoJvw8kMXDC/TH9f X-Received: by 2002:a05:6a00:16d3:: with SMTP id l19mr1561192pfc.7.1644401417967; Wed, 09 Feb 2022 02:10:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644401417; cv=none; d=google.com; s=arc-20160816; b=Hs8mT1aD8m0T9jNS29hb16e4rs0iMwftoIAe0jHSi5fyIMoes3uu9Az8oGxnB2fyZx YkWnITGYa5+lOpu0t5bIUYDDcJm4ioVzx31yMqkfG645DR/xOQxQgUC4D23EV0Zr3hbO n8g6NU7tNotXvSKxUFV0bZNReOjOOIDnWMa+qQhp+iP8K3Vn1ASeqavXjUIFQk+RTjq9 oDmSf8++Y01Jem7rTwVrFQfOemlwceTRb/8NXp6ZG0tza75InMTE4X2dzFgA0nIVKwmT 5eq8w+SNR8v2yKMCBu1RkdvJaETcJcsBKrzVm+Q+oZU5Kft0I0UZUUhPxX49NHKSzh3Q 11Lg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=zo4w0qGloAZkPY53bTSNnSGAF1VS+pM4s2mK6HkQLtc=; b=ySXEK32umrOerCoryRKkwELbJ9FTqp/uaVKi1MgL2DJbgG174U8Q+TbeQNfqWRkxlA l/HE0mZYi8ZHbmPmWAUk5ib5wxCs/eOiRZSU+G1Shnt3dfgduv3OYXSRoYVwriqZTMWH IXkcRw1U4YVzBBMc2wCq3uGr8Yrgq7zt2Uff92qMCEWG94uASj0HuWUkkp6tdYMA1Rpw T5KKciHXnyoAp9Nk6V4jAUbLMedEQrxWkCOD79LE8zCFSUbAAgJ1DBVWp48zkvNfn0hi yYq6l+QcEBXlreCJ2dZFtJeB/NCXv3PuA+dpxCtZSKzXY0CJwN31puxvIh2yWYYln7au btgw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=U31AYWVT; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id w1si15259690pgr.451.2022.02.09.02.10.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Feb 2022 02:10:17 -0800 (PST) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=U31AYWVT; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D5DECE049781; Wed, 9 Feb 2022 01:15:11 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1388194AbiBHWc2 (ORCPT + 99 others); Tue, 8 Feb 2022 17:32:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48914 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1386319AbiBHUHp (ORCPT ); Tue, 8 Feb 2022 15:07:45 -0500 Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1CB78C0613CB for ; Tue, 8 Feb 2022 12:07:44 -0800 (PST) Received: by mail-qt1-x835.google.com with SMTP id o3so36930qtm.12 for ; Tue, 08 Feb 2022 12:07:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zo4w0qGloAZkPY53bTSNnSGAF1VS+pM4s2mK6HkQLtc=; b=U31AYWVTpg4Y6UTs8bMBgV21yKWxIxPN5NsJCKdf3pZKKra5ishY7MNc2/mgyrsBLL XbqcayxArMzzX/ZziWfeHWhCvqLXmL5VBnR0Ln/klmI5IcGRENszPi1/OnZtND6+KwNV jvbQ9XxtLCY+o6Qqx0OlF/yXXFbjTIbNA1dkPVCWT+zftLOSZwpQUPaKoEqrTTRjo2AS va6HCLG3li72VfCwXMw8q/dojcMjncE3bwZ3W0Js5X7G7kySWGwAXCrTSgjbchAASz/i gLFNpXtXYkdEVL/sO9xS9sa1230yrwbCxcHKys0taQfOv36dGMYydTQtRzRnW1PCpcA4 lALA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zo4w0qGloAZkPY53bTSNnSGAF1VS+pM4s2mK6HkQLtc=; b=LKCk3JZ0iyAtXWvcqmMuCcaTtkiPkTDTipSC0ldxanwQLQWV4d+QyqMXuWYZrO3Uba a3F/stfbu7FmnAyW6YOSdsZ1WG2Z7ZPvDdymcrm10CxzuvvZgemgwZNaYiOQczvTMkBQ LO5eIfQ9BbuCdKD7QwUATaZPMjE9KwY3/13Mddz/LKNN15vvtIWbn6sm4FBffUfP0/ty /pyEeXNnOq1z7MRBQEhpQU6J1nQ+2xCEms47a86byt5IzZ8BFgsAlfGWYOLaCJBXhXtn v1pSe+rgQk81RYzaj1RXT1NvSADM+lipKGdHKTznYELFnWHS17/8tdmGXncTLQ+Bmx2m BnJw== X-Gm-Message-State: AOAM533YB3/+JQqNvqnD67bKilxGskQA+ppPj4TCUffEZiiXNRc4CFlL 0+vYluw6wRtwT7hFom21p1uQJAq4gBqZg8CiErfTnQ== X-Received: by 2002:ac8:58ce:: with SMTP id u14mr4120851qta.299.1644350862988; Tue, 08 Feb 2022 12:07:42 -0800 (PST) MIME-Version: 1.0 References: <20220201205534.1962784-1-haoluo@google.com> <20220201205534.1962784-6-haoluo@google.com> <20220203180414.blk6ou3ccmod2qck@ast-mbp.dhcp.thefacebook.com> In-Reply-To: From: Hao Luo Date: Tue, 8 Feb 2022 12:07:31 -0800 Message-ID: Subject: Re: [PATCH RFC bpf-next v2 5/5] selftests/bpf: test for pinning for cgroup_view link To: Alexei Starovoitov Cc: Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Stanislav Fomichev , bpf , LKML Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Feb 5, 2022 at 8:29 PM Alexei Starovoitov wrote: > > On Fri, Feb 4, 2022 at 10:27 AM Hao Luo wrote: > > > > > > > In our use case, we can't ask the users who create cgroups to do the > > > > pinning. Pinning requires root privilege. In our use case, we have > > > > non-root users who can create cgroup directories and still want to > > > > read bpf stats. They can't do pinning by themselves. This is why > > > > inheritance is a requirement for us. With inheritance, they only need > > > > to mkdir in cgroupfs and bpffs (unprivileged operations), no pinning > > > > operation is required. Patch 1-4 are needed to implement inheritance. > > > > > > > > It's also not a good idea in our use case to add a userspace > > > > privileged process to monitor cgroupfs operations and perform the > > > > pinning. It's more complex and has a higher maintenance cost and > > > > runtime overhead, compared to the solution of asking whoever makes > > > > cgroups to mkdir in bpffs. The other problem is: if there are nodes in > > > > the data center that don't have the userspace process deployed, the > > > > stats will be unavailable, which is a no-no for some of our users. > > > > > > The commit log says that there will be a daemon that does that > > > monitoring of cgroupfs. And that daemon needs to mkdir > > > directories in bpffs when a new cgroup is created, no? > > > The kernel is only doing inheritance of bpf progs into > > > new dirs. I think that daemon can pin as well. > > > > > > The cgroup creation is typically managed by an agent like systemd. > > > Sounds like you have your own agent that creates cgroups? > > > If so it has to be privileged and it can mkdir in bpffs and pin too ? > > > > Ah, yes, we have our own daemon to manage cgroups. That daemon creates > > the top-level cgroup for each job to run inside. However, the job can > > create its own cgroups inside the top-level cgroup, for fine grained > > resource control. This doesn't go through the daemon. The job-created > > cgroups don't have the pinned objects and this is a no-no for our > > users. > > We can whitelist certain tracepoints to be sleepable and extend > tp_btf prog type to include everything from prog_type_syscall. > Such prog would attach to cgroup_mkdir and cgroup_release > and would call bpf_sys_bpf() helper to pin progs in new bpffs dirs. > We can allow prog_type_syscall to do mkdir in bpffs as well. > > This feature could be useful for similar monitoring/introspection tasks. > We can write a program that would monitor bpf prog load/unload > and would pin an iterator prog that would show debug info about a prog. > Like cat /sys/fs/bpf/progs.debug shows a list of loaded progs. > With this feature we can implement: > ls /sys/fs/bpf/all_progs.debug/ > and each loaded prog would have a corresponding file. > The file name would be a program name, for example. > cat /sys/fs/bpf/all_progs.debug/my_prog > would pretty print info about 'my_prog' bpf program. > > This way the kernfs/cgroupfs specific logic from patches 1-4 > will not be necessary. > > wdyt? Thanks Alexei. I gave it more thought in the last couple of days. Actually I think it's a good idea, more flexible. It gets rid of the need of a user space daemon for monitoring cgroup creation and destruction. We could monitor task creations and exits as well, so that we can export per-task information (e.g. task_vma_iter) more efficiently. A couple of thoughts when thinking about the details: - Regarding parameterized pinning, I don't think we can have one single bpf_iter_link object, but with different parameters. Because parameters are part of the bpf_iter_link (bpf_iter_aux_info). So every time we pin, we have to attach iter in order to get a new link object first. So we need to add attach and detach in bpf_sys_bpf(). - We also need to add those syscalls for cleanup: (1) unlink for removing pinned obj and (2) rmdir for removing the directory in prog_type_syscall. With these extensions, we can shift some of the bpf operations currently performed in system daemons into the kernel. IMHO it's a great thing, making system monitoring more flexible.