Received: by 2002:a05:6a10:413:0:0:0:0 with SMTP id 19csp3584868pxp; Tue, 8 Mar 2022 18:06:09 -0800 (PST) X-Google-Smtp-Source: ABdhPJw92OumJeDPqFxhDgloaiqkN8IRsCqKiZaNP/+cBcVK2r/MvNeIus0R13CVtNSa0s6na7Jf X-Received: by 2002:a17:902:e5cd:b0:151:da9a:2b98 with SMTP id u13-20020a170902e5cd00b00151da9a2b98mr17327051plf.94.1646791569543; Tue, 08 Mar 2022 18:06:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646791569; cv=none; d=google.com; s=arc-20160816; b=uiG3424I1StrkIDMXrVDDVI8IZ0hRw8PtVHiRRehuhz2+iLgPDHzTXFGoMcwxAe85q ME//hHCRJySGxq3WwrHgqj67H7z6zP3tFPEZUFQhHuaafcxvY7nNsegF+cHA3NBwbTaK BRTmRLciegpqvxIKh5w3U6kQOWlEnrE582D43g0QUHhxxeMs54DnzR9bwOjIoWgJOtOs WDOFTsLG4kTpnSqy/yZd/LA2ON3Ni3kcmOpPn45suHIFk1bHootvv2asJhYG+KNpeZTy ZfeGPw0Y8VVHguWK7wDNV0+u+h5+hQT8AYGBy0ArAkJvdg+Jrk1ULYMa0E7l4QWOpomn bnGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=aDSOBm4d5nO/KrS/W0aAAr28zrJ753qvjHwTkTdw9kk=; b=nVquEnLP+LV5ZfZ9Y5K30TIbb0tfBcol4sGWXyznNZgXWZLG6Cu7suEqvKYHv+c67M KA2RTExS0EdgX6dkZ1cd36LtAUHO570QMcWp9b0F0wW+NVM/7e60194Pf99gI6W625fE iWkYFkaoh5I50mH70cIIvrAuQx6GgL05Ah3GF/1JkwTRVq56uK43XDxXbaBiGB32uUx0 A3LbTJE846nPplqrgUbb6pQhA/MGGzNXyqThgecQuyktkHP/nieHJLPGNpbYUrQ3jxEM KO15vNoP24NgfQ1O4PiJMWrR1wUMdX8yJfM9QiCO4vneFSAdXAMCiU2JRqIkiL5GOz6s dnRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=IQNlE95x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id 19-20020a630213000000b0037c56cd75fbsi543631pgc.118.2022.03.08.18.06.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 08 Mar 2022 18:06:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=IQNlE95x; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A9492240DE0; Tue, 8 Mar 2022 16:37:50 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347370AbiCHVJu (ORCPT + 99 others); Tue, 8 Mar 2022 16:09:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43516 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235737AbiCHVJt (ORCPT ); Tue, 8 Mar 2022 16:09:49 -0500 Received: from mail-qk1-x72d.google.com (mail-qk1-x72d.google.com [IPv6:2607:f8b0:4864:20::72d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A68573EA90 for ; Tue, 8 Mar 2022 13:08:51 -0800 (PST) Received: by mail-qk1-x72d.google.com with SMTP id bm39so217748qkb.0 for ; Tue, 08 Mar 2022 13:08:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=aDSOBm4d5nO/KrS/W0aAAr28zrJ753qvjHwTkTdw9kk=; b=IQNlE95xzFsoZ/PblHHm/qEE1xW6I0QhH4ifACt1mBjmMJif3RQXXYM4RzLT/psSl8 9Anm/9FEbRjZtMEZ2MfvZHD9Pu+dRyx7z783YEo0MawjZetnl2/G4Lo3yB6DC4z/BCta AYpElEbrSlfmX9AE3nvcbP5lcT8pAQkJqyB2YiKCRDgcitKKIARW+lB4G6yumt9gAHRA lSRsTjy8h84X0kRBXr4etBPqLx5b1ZweTPDGCzhjgr9ZKjsy/KKEHHOdtEe4jbD0z2Au 93LEAczwx87RZhku6PtgEn2yhr6ZqsR6LszCnikopuJOp5zA01ChuDkOYHbv0whsNRu+ wrCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=aDSOBm4d5nO/KrS/W0aAAr28zrJ753qvjHwTkTdw9kk=; b=vH+k9pa/TbLbXm8ojVo9Row5Vna6WaEHRiG8jl5D52kOu3YB49hpfCaLGDyKKO95qz XoJ9a3RFy1JTFLvMjHVwkh8d+y6Bn9XlmLyQlklEWzbXdF4HUfSOUKni7dxpDtytftVG OuPq54+3WQ6Z6yP4k91DwOym++1B5/v8awaxgdr3/+HxIUk11O//rN2gA7vzUcsJg3Wy YiALNz3oYHNroudFo+zolFjBMqBhs/G5w8rBki/DwpinQOFY9ZsscsZNX0k/TwgrSxLL gxtLIlw33aZrnHQ9jtkzdXFdYtSaOC62DTbHyhHSG6ANE1OL7/eHdHj9Dzefm9FtZ+KQ NB3Q== X-Gm-Message-State: AOAM533RKkEQ2XFe+waNL64ixZREAWGBxNuCE8ittFPlgWU2OjWQarTb jOSMPl+HQ2TJ3G0YZIrhNHSsWe1fNTS9+DDk/gsZ7w== X-Received: by 2002:a05:620a:2849:b0:67d:2462:15e4 with SMTP id h9-20020a05620a284900b0067d246215e4mr2356259qkp.583.1646773730521; Tue, 08 Mar 2022 13:08:50 -0800 (PST) MIME-Version: 1.0 References: <20220225234339.2386398-1-haoluo@google.com> <20220225234339.2386398-2-haoluo@google.com> <20220227051821.fwrmeu7r6bab6tio@apollo.legion> <20220302193411.ieooguqoa6tpraoe@ast-mbp.dhcp.thefacebook.com> In-Reply-To: From: Hao Luo Date: Tue, 8 Mar 2022 13:08:39 -0800 Message-ID: Subject: Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall To: Alexei Starovoitov Cc: Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Tejun Heo , Josh Don , Stanislav Fomichev , bpf , LKML Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Mar 5, 2022 at 3:47 PM Alexei Starovoitov wrote: > > On Fri, Mar 4, 2022 at 10:37 AM Hao Luo wrote: > > > > I gave this question more thought. We don't need to bind mount the top > > bpffs into the container, instead, we may be able to overlay a bpffs > > directory into the container. Here is the workflow in my mind: > > I don't quite follow what you mean by 'overlay' here. > Another bpffs mount or future overlayfs that supports bpffs? > > > For each job, let's say A, the container runtime can create a > > directory in bpffs, for example > > > > /sys/fs/bpf/jobs/A > > > > and then create the cgroup for A. The sleepable tracing prog will > > create the file: > > > > /sys/fs/bpf/jobs/A/100/stats > > > > 100 is the created cgroup's id. Then the container runtime overlays > > the bpffs directory into container A in the same path: > > Why cgroup id ? Wouldn't it be easier to use the same cgroup name > as in cgroupfs ? > Cgroup name isn't unique. We don't need the hierarchy information of cgroups. We can use a library function to translate cgroup path to cgroup id. See the get_cgroup_id() in patch 9/9. It works fine in the selftest. > > [A's container path]/sys/fs/bpf/jobs/A. > > > > A can see the stats at the path within its mount ns: > > > > /sys/fs/bpf/jobs/A/100/stats > > > > When A creates cgroup, it is able to write to the top layer of the > > overlayed directory. So it is > > > > /sys/fs/bpf/jobs/A/101/stats > > > > Some of my thoughts: > > 1. Compared to bind mount top bpffs into container, overlaying a > > directory avoids exposing other jobs' stats. This gives better > > isolation. I already have a patch for supporting laying bpffs over > > other fs, it's not too hard. > > So it's overlayfs combination of bpffs and something like ext4, right? > I thought you found out that overlaryfs has to be upper fs > and lower fs shouldn't be modified underneath. > So if bpffs is a lower fs the writes into it should go > through the upper overlayfs, right? > It's overlayfs combining bpffs and ext4. Bpffs is the upper layer. The lower layer is an empty ext4 directory. The merged directory is a directory in the container. The upper layer contains bpf objects that we want to expose to the container, for example, the sleepable tracing progs and the iter link for reading stats. Only the merged directory is visible to the container and all the updates go through the merged directory. The following is the example of workflow I'm thinking: Step 1: We first set up directories and bpf objects needed by containers. [# ~] ls /sys/fs/bpf/container/upper tracing_prog iter_link [# ~] ls /sys/fs/bpf/container/work [# ~] ls /container root lower [# ~] ls /container/root bpf [# ~] ls /container/root/bpf Step 2: Use overlayfs to mount a directory from bpffs into the container's home. [# ~] mkdir /container/lower [# ~] mkdir /sys/fs/bpf/container/workdir [# ~] mount -t overlay overlay -o \ lowerdir=/container/lower,\ upperdir=/sys/fs/bpf/container/upper,\ workdir=/sys/fs/bpf/container/work \ /container/root/bpf [# ~] ls /container/root/bpf tracing_prog iter_link Step 3: pivot root for container, we expect to see the bpf objects are mapped into container, [# ~] chroot /container/root [# ~] ls / bpf [# ~] ls /bpf tracing_prog iter_link Note: - I haven't tested Step 3. But Step 1 and step 2 seem to be working as expected. I am testing the behaviors of the bpf objects, after we enter the container. - Only a directory in bpffs is mapped into the container, not the top bpffs. The path is uniform in all containers, that is, /bpf. The container should be able to mkdir in /bpf, etc. > > 2. Once the container runtime has overlayed directory into the > > container, it has no need to create more cgroups for this job. It > > doesn't need to track the stats of job-created cgroups, which are > > mainly for inspection by the job itself. Even if it needs to collect > > the stats from those cgroups, it can read from the path in the > > container. > > 3. The overlay path in container doesn't have to be exactly the same > > as the path in root mount ns. In the sleepable tracing prog, we may > > select paths based on current process's ns. If we choose to do this, > > we can further avoid exposing cgroup id and job name to the container. > > The benefits make sense.