Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp276029pxm; Wed, 2 Mar 2022 15:11:51 -0800 (PST) X-Google-Smtp-Source: ABdhPJypw6w6CMJv0ln7ZRH8OMY1HODQvZJhjVLnrRAlsDUpGabvoMzrFJ2zxl7P02VQlIA42rXN X-Received: by 2002:a63:6cb:0:b0:36c:e2d:8857 with SMTP id 194-20020a6306cb000000b0036c0e2d8857mr27682493pgg.214.1646262711714; Wed, 02 Mar 2022 15:11:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646262711; cv=none; d=google.com; s=arc-20160816; b=CmYxeeQYmsOWWQWSthcWbxPCeSciJt/rM4/YU01JlX1g4knwg3HFF0bP3mTFBMDJzq p5rc9kservYCeaNJva6pTit75RQJxPpJhM7FdUWNQwXAr/z/kOdwWANNkf+70oCwqJVq 6gUvhB8dM5vGw0f2WXI2K+VVEnNr7IsjK0gL20gLz/Zm6cmw08PkhJGLroXcX4pIgKTg UGj3p2/72bNSUVHxREz0/bdMKmp1g/PEAp/XjxPoPMyZmHJ8JA9e0lnxCkZ+4MUeLn4b RPzV6VZBl/IXDJmCal5PJEqXLIddRL7Yb0lEuuWgUfmb0Emw/AjkFfCMG0/9SkzYyw3e 8SGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=UtUIqChNwIGiM6BzTSKzlQEWchWVUYsbZYtiHQAXLaI=; b=P53x2s9LwBmIbG+/15IwZKH9vWaLyFHdwcHIQrpLhtB2inLNd4XKuKv+9HCa8cKa09 H1t2luJ0Kr6Ro0/zIu4vRlFU9S/KtUAmk5U769wmScF5e3HiG+veiHBvf5x9/cdRq2xz 1LwX2UrFbgvO18Tuc55kWqUAPPgKl76JD2wgQ8igz6HctOeLSddOgnpbwnlmLGH5KkWC KinZbYpPDxME4e3vF4i4eQ4pdyOJ9buDHUyyc8bv2eSA4WoSy/pVgYEk7g2FpvAO5tyM gVXBszR21FnjX8tyLESVcKbKJq12Lr1Dth5OeGXFAbKjh2R0aF2BFBuTA/h2eZZVYwuF KoNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Sj4Eeu+n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id n16-20020a170903111000b0014d9b821424si409592plh.562.2022.03.02.15.11.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Mar 2022 15:11:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=Sj4Eeu+n; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A016712E768; Wed, 2 Mar 2022 14:52:20 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241096AbiCBTfC (ORCPT + 99 others); Wed, 2 Mar 2022 14:35:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240776AbiCBTfA (ORCPT ); Wed, 2 Mar 2022 14:35:00 -0500 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 291A5C1C8B; Wed, 2 Mar 2022 11:34:16 -0800 (PST) Received: by mail-pf1-x431.google.com with SMTP id k1so2813448pfu.2; Wed, 02 Mar 2022 11:34:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=UtUIqChNwIGiM6BzTSKzlQEWchWVUYsbZYtiHQAXLaI=; b=Sj4Eeu+nBBfoMN1uavFkIabMJjbhxcWY9UzBPUg71hSA25R0GL0CeZN8G8q8DK/jbZ 82Wgr/wLyWX/Fuvgxu1z+dahLEw/4ZyAhdE9PX0gX91IaiTOa1Q/eEXInk3Ws63FLMxP mOQpXgYlF0P2WMkp2vG7HhrLbz76tlGQ4TiVlepaZLLVbzjAEzyBY/Bne0LiSYRVv/OI eh5CIFWUYth0qkfFgKZp/oYgFFGTJnbEmol6t9Jx5Yzj7wKxcYWsS8b6aAYMNyPkhM9o Ym6u6z3n8GSnEBHhYJYzkLiZ9IrUz/6IQ3CBH45Z16VDXcNSGWcLQw8+M0IeogQQNsgH QPfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=UtUIqChNwIGiM6BzTSKzlQEWchWVUYsbZYtiHQAXLaI=; b=OpFckQnrijzhw9r4o199MfpCeSISIDTbhamH0oNPUQu5t8lXoUfQ4iyT0vRK7ufH/p 4htJ5F11sGbc1q5iuy1dtiWxfKyDrnqsBMvzunI1aXlwEMq+V0ojiFTB8eEy5bTaNzHR OZlRUaf2qEY7234Y62A01ejkm/WJhSl/x3Q2ahRznr1Xgs6VnMDzWUVeL1CeuekTRUoH vo7sL76mXK91ZhuxsTd8wXgi037npQ8ZP8uZGJLTkyEhR5WhTRxmMtz3HqbuwqCZmyln /nVrwLcMpPJh9iuWLwOOUI3bKh3PSy9bjdMIypavDRNfZ52vcl6bTfPvkWy6SJBIO5fk iqYg== X-Gm-Message-State: AOAM533dOccnLZrR0SKv8wicL93yJHfR0FwmpHzassoY4YxLqjSm1ZhD C79NiUTe1V774kVmwnKWpNU= X-Received: by 2002:a63:4005:0:b0:373:9ac7:fec1 with SMTP id n5-20020a634005000000b003739ac7fec1mr26990808pga.12.1646249655409; Wed, 02 Mar 2022 11:34:15 -0800 (PST) Received: from ast-mbp.dhcp.thefacebook.com ([2620:10d:c090:500::2:156b]) by smtp.gmail.com with ESMTPSA id d5-20020a17090acd0500b001b9c05b075dsm5861616pju.44.2022.03.02.11.34.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Mar 2022 11:34:15 -0800 (PST) Date: Wed, 2 Mar 2022 11:34:11 -0800 From: Alexei Starovoitov To: Hao Luo Cc: Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Song Liu , Yonghong Song , KP Singh , Shakeel Butt , Joe Burton , Tejun Heo , joshdon@google.com, sdf@google.com, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall Message-ID: <20220302193411.ieooguqoa6tpraoe@ast-mbp.dhcp.thefacebook.com> References: <20220225234339.2386398-1-haoluo@google.com> <20220225234339.2386398-2-haoluo@google.com> <20220227051821.fwrmeu7r6bab6tio@apollo.legion> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote: > Hi Kumar, > > On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi > wrote: > > > > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote: > > > This patch allows bpf_syscall prog to perform some basic filesystem > > > operations: create, remove directories and unlink files. Three bpf > > > helpers are added for this purpose. When combined with the following > > > patches that allow pinning and getting bpf objects from bpf prog, > > > this feature can be used to create directory hierarchy in bpffs that > > > help manage bpf objects purely using bpf progs. > > > > > > The added helpers subject to the same permission checks as their syscall > > > version. For example, one can not write to a read-only file system; > > > The identity of the current process is checked to see whether it has > > > sufficient permission to perform the operations. > > > > > > Only directories and files in bpffs can be created or removed by these > > > helpers. But it won't be too hard to allow these helpers to operate > > > on files in other filesystems, if we want. > > > > > > Signed-off-by: Hao Luo > > > --- > > > + * > > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode) > > > + * Description > > > + * Attempts to create a directory name *pathname*. The argument > > > + * *pathname_sz* specifies the length of the string *pathname*. > > > + * The argument *mode* specifies the mode for the new directory. It > > > + * is modified by the process's umask. It has the same semantic as > > > + * the syscall mkdir(2). > > > + * Return > > > + * 0 on success, or a negative error in case of failure. > > > + * > > > + * long bpf_rmdir(const char *pathname, int pathname_sz) > > > + * Description > > > + * Deletes a directory, which must be empty. > > > + * Return > > > + * 0 on sucess, or a negative error in case of failure. > > > + * > > > + * long bpf_unlink(const char *pathname, int pathname_sz) > > > + * Description > > > + * Deletes a name and possibly the file it refers to. It has the > > > + * same semantic as the syscall unlink(2). > > > + * Return > > > + * 0 on success, or a negative error in case of failure. > > > */ > > > > > > > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be > > more useful for other cases in future, and when AT_FDCWD is passed, has the same > > functionality as these, but when openat/fget is supported, it would work > > relative to other dirfds as well. It can also allow using dirfd of the process > > calling read for a iterator (e.g. if it sets the fd number using skel->bss). > > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir. > > > > WDYT? > > > > The idea sounds good to me, more flexible. But I don't have a real use > case for using the added 'dirfd' at this moment. For all the use cases > I can think of, absolute paths will suffice, I think. Unless other > reviewers have opposition, I will try switching to mkdirat and > unlinkat in v2. I'm surprised you don't need "at" variants. I thought your production setup has a top level cgroup controller and then inner tasks inside containers manage cgroups on their own. Since containers are involved they likely run inside their own mountns. cgroupfs mount is single. So you probably don't even need to bind mount it inside containers, but bpffs is not a single mount. You need to bind mount top bpffs inside containers for tasks to access it. Now for cgroupfs the abs path is not an issue, but for bpffs the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns. Inside container that will be different. Unless you bind mount into exact same path the full path has different meanings inside and outside of the container. It seems to me the bpf progs attached to cgroup sleepable events should be using FD of bpffs. Then when these tracepoints are triggered from different containers in different mountns they will get the right dir prefix. What am I missing? I think non-AT variants are not needed. The prog can always pass AT_FDCWD if it's really the intent, but passing actual FD seems more error-proof.