Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp5649260rwb; Tue, 17 Jan 2023 17:03:29 -0800 (PST) X-Google-Smtp-Source: AMrXdXuYZIbzPhIKa6t6hWrHuFEay4k7KDsOhWXEfKS4PlWKD5eqQhJoz6tGH6MJ9PZ3kaz/yhCQ X-Received: by 2002:a05:6402:5296:b0:49c:3cf4:d9ed with SMTP id en22-20020a056402529600b0049c3cf4d9edmr5983052edb.23.1674003809603; Tue, 17 Jan 2023 17:03:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674003809; cv=none; d=google.com; s=arc-20160816; b=Y1d5b94FEAyljPNM4UFwXadhorhN1qQUcKEUWWUSvHJ8G7vOXyJ2DsPEWQ2rf8ZE/M /obYCwn3wNzZtt/zdrKzQchJzjEF0WXQWOyCtWpcoSgZTlPPraSqi/1uZFOgdquvRPrM qHHRlBPXQF+j4iJVaTVhKjBdBvHTUAxeXndR416rlaCbFoSyy2/bgweK1ZuWMP8O7q4G /qYxXe2ZZ8EB0IvyUXg02GfCPXAjlTxQPkq8R0cbau6fSvgT6x9Rx2xHxKZ5g/dAdBCw Ra9qUZwuW15mCWafT2wOgY+ASyhr3BM4R1Yi66wPcXtGOFgVuk7CiSIcc7IVlyodRaPc uM/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ZabAU69kzrS+JlD6u3p/eokT1Xy/TCkqyu/YUto0/Ws=; b=b4KDFi0qQDurBxBB5tjj/WwW39LpqhQ+E1GAQQMqmLQe4T/aN8doTGLhgdyWZbP/wd zkxfwJr9tDYNM5JDUkDQntrMAzzcLSpDyCnOoU1nroyVXrEV/niRgBu5X/rvA1zggjC/ FdwSAQJUdbB6Zc+ZBAeBQq8BaCrcFa0aCS73Gw15tOo5hWMcYg7J/LhVoN9MIFtBCsPp RXyk6y2lvLei4h3gdSE5bSJ2r72xe7aYijDOE1vGzxQIWZI6e7aA6AHLT3DkbUe1piKe hwmqY55QdIhBFjmIKmfmMHTdGNB9ADRr43hK1XDFKMeNriBHSTDQbxS3LRPmrRxp1RsU nLvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=bNgJ7mGf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q20-20020a056402519400b004854e1d2682si42690044edd.249.2023.01.17.17.03.17; Tue, 17 Jan 2023 17:03:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=bNgJ7mGf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229577AbjARAoX (ORCPT + 48 others); Tue, 17 Jan 2023 19:44:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41552 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229739AbjARAnN (ORCPT ); Tue, 17 Jan 2023 19:43:13 -0500 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1CA0A54B0E for ; Tue, 17 Jan 2023 16:22:45 -0800 (PST) Received: by mail-pj1-x1032.google.com with SMTP id z9-20020a17090a468900b00226b6e7aeeaso567375pjf.1 for ; Tue, 17 Jan 2023 16:22:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=ZabAU69kzrS+JlD6u3p/eokT1Xy/TCkqyu/YUto0/Ws=; b=bNgJ7mGfmxJpyJk7YI84f7nG4msD6s4J4JmN1EPghdNBOuCNluQcCi1Swcg1TiC3c2 ozA70KYSMLsJu/PD/QzAfK+i8UNfBvX/+97HD0y0C1QKkte+2j6N36nKd4x1q4J7RSiC TNScLNaFOSudcM0bBvgMC7ZoY+nNp6LVmExiJPYlr1Wsvyh4ovVMX5+bIO3UioZS4yrR tRHguYB8v4aOygf/iJku39qyCxHX0wXit7zVLrnUm6et8GeXe6uH2aYd552FAgjrZIye KoLkDudyOkGNYTFPNxxktGbMmat5QjC1Z+/wVixOc4EjGLs58La34uLOizbcim1w+8kk DmSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZabAU69kzrS+JlD6u3p/eokT1Xy/TCkqyu/YUto0/Ws=; b=5qB+z0FqviwpZSRgWnfwt0vfZh+AMezmrIDO2wfxBuasW0E+7yhJqiIcrGAnHK1cad LePhOJt3hQb67Rf26geFXD6XmFdhicG7jUQS8skEQKcCAERO+V7/cBNwB/5p8BUE1N7W 5oVci1t5e8iUMWuj8RJElBSwTp2IDASY5isfReYmZt+ok5evT09E/TSbkTaj10Xh1MVk MAIIToBYbFAptDXLcFkaEtBdZBG71Ft6BFEiTXZpx2bO4oA2W3DgRm7MqqYUgp+C4uxD F7kzNYJFL4MM3k1qPXXhVEEIqTdwHN5x/4iVRgmXBnXOUpu2PVe1DanHnyGKCq9Elluo wOkA== X-Gm-Message-State: AFqh2koCvKKf4HW1UO3ZYUwTj8Rfa8fySjltuiI0k2CYXlbbstX82ZLO U8XzpQVdRLZ3DERNT3Z/7wpzCA== X-Received: by 2002:a17:90a:7e08:b0:219:672a:42db with SMTP id i8-20020a17090a7e0800b00219672a42dbmr4827377pjl.19.1674001365228; Tue, 17 Jan 2023 16:22:45 -0800 (PST) Received: from dread.disaster.area (pa49-186-146-207.pa.vic.optusnet.com.au. [49.186.146.207]) by smtp.gmail.com with ESMTPSA id ei7-20020a17090ae54700b00227223c58ecsm121029pjb.42.2023.01.17.16.22.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 17 Jan 2023 16:22:44 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pHwDy-004LZS-3X; Wed, 18 Jan 2023 11:22:42 +1100 Date: Wed, 18 Jan 2023 11:22:42 +1100 From: Dave Chinner To: Christian Brauner Cc: Giuseppe Scrivano , Amir Goldstein , Gao Xiang , Alexander Larsson , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Miklos Szeredi , Yurii Zubrytskyi , Eugene Zemtsov , Vivek Goyal , Al Viro Subject: Re: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Message-ID: <20230118002242.GB937597@dread.disaster.area> References: <0a144ffd-38bb-0ff3-e8b2-bca5e277444c@linux.alibaba.com> <9d44494fdf07df000ce1b9bafea7725ea240ca41.camel@redhat.com> <2856820a46a6e47206eb51a7f66ec51a7ef0bd06.camel@redhat.com> <8f854339-1cc0-e575-f320-50a6d9d5a775@linux.alibaba.com> <20230117101202.4v4zxuj2tbljogbx@wittgenstein> <87fsc9gt7b.fsf@redhat.com> <20230117152756.jbwmeq724potyzju@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230117152756.jbwmeq724potyzju@wittgenstein> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote: > On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote: > > Christian Brauner writes: > > 2) no multi repo support: > > > > Both reflinks and hardlinks do not work across mount points, so we > > Just fwiw, afaict reflinks work across mount points since at least 5.18. The might work for NFS server *file clones* across different exports within the same NFS server (or server cluster), but they most certainly don't work across mountpoints for local filesystems, or across different types of filesystems. I'm not here to advocate that composefs as the right solution, I'm just pointing out that the proposed alternatives do not, in any way, have the same critical behavioural characteristics as composefs provides container orchestration systems and hence do not solve the problems that composefs is attempting to solve. In short: any solution that requires userspace to create a new filesystem heirarchy one file at a time via standard syscall mechanisms is not going to perform acceptibly at scale - that's a major problem that composefs addresses. The whole problem with file copying to create images - even with reflink or hardlinks avoiding data copying - is the overhead of creating and destroying those copies in the first place. A reflink copy of a tens of thousands of files in a complex directory structure is not free - each individual reflink has a time, CPU, memory and IO cost to it. The teardown cost is similar - the only way to remove the "container image" built with reflinks is "rm -rf", and that has significant time, CPU memory and IO costs associated with it as well. Further, you can't ship container images to remote hosts using reflink copies - they can only be created at runtime on the host that the container will be instantiated on. IOWs, the entire cost of reflink copies for container instances must be taken at container instantiation and destruction time. When you have container instances that might only be needed for a few seconds, taking half a minute to set up the container instance and then another half a minute to tear it down just isn't viable - we need instantiation and teardown times in the order of a second or two. From my reading of the code, composefs is based around the concept of a verifiable "shipping manifest", where the filesystem namespace presented to users by the kernel is derived from the manifest rahter than from some other filesystem namespace. Overlay, reflinks, etc all use some other filesystem namespace to generate the container namespace that links to the common data, whilst composefs uses the manifest for that. The use of a minfest file means there is almost zero container setup overhead - ship the manifest file, mount it, all done - and zero teardown overhead as unmounting the filesystem is all that is needed to remove all traces of the container instance from the system. In having a custom manifest format, the manifest can easily contain verification information alongside the pointer to the content the namespace should expose. i.e. the manifest references a secure content addressed repository that is protected by fsverity and contains the fsverity digests itself. Hence it doesn't rely on the repository to self-verify, it actually ensures that the repository files actually contain the data the manifest expects them to contain. Hence if the composefs kernel module is provided with a mechanism for validating the chain of trust for the manifest file that a user is trying to mount, then we just don't care who the mounting user is. This architecture is a viable path to rootless mounting of pre-built third party container images. Also, with the host's content addressed repository being managed separately by the trusted host and distro package management, the manifest is not be unique to a single container host. The distro can build manifests so that containers are running known, signed and verified container images built by the distro. The container orchestration software or admin could also build manifests on demand and sign them. If the manifest is not signed, not signed with a key loaded into the kernel keyring, or does not pass verification, then we simply fall back to root-in-the-init-ns permissions being required to mount the manifest. This fallback is exactly the same security model we have for every other type of filesystem image that the linux kernel can mount - we trust root not to be mounting malicious images. Essentially, I don't think any of the filesystems in the linux kernel currently provide a viable solution to the problem that composefs is trying to solve. We need a different way of solving the ephemeral container namespace creation and destruction overhead problem. Composefs provides a mechanism that not only solves this problem and potentially several others, whilst also being easy to retrofit into existing production container stacks. As such, I think composefs is definitely worth further time and investment as a unique line of filesystem development for Linux. Solve the chain of trust problem (i.e. crypto signing for the manifest files) and we potentially have game changing container infrastructure in a couple of thousand lines of code... Cheers, Dave. -- Dave Chinner david@fromorbit.com