Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp14682065rwb; Mon, 28 Nov 2022 04:01:48 -0800 (PST) X-Google-Smtp-Source: AA0mqf7RqTgz5MFKvk5KxEa8baYjHPA0mcdXUB2+RMC79Jj8CbvjER3l01/Fk2XeJ3IpL3hYZx5I X-Received: by 2002:a17:903:234e:b0:188:cfc6:8543 with SMTP id c14-20020a170903234e00b00188cfc68543mr41277045plh.95.1669636908240; Mon, 28 Nov 2022 04:01:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669636908; cv=none; d=google.com; s=arc-20160816; b=eiriBIFX2IOFViZDFoz+v0yw6lOGeGxh0YVDDmSfS0Jp/BuNiFKlQXSCNp8qeXulV9 OuAK9XrviFQFhd9QOfc5kyz3XyoaDf8WUnVia9WwVJ7xwnqAp6RnuJRncgcUN98+UuPB DlOrwztye+b0eo+U992CgketmcIo63FP9VHo4X+B2c5iTeFtGmiV+xH3oL9Gj9TVBV/1 TjTEJJF9YtpO7Bs+0wCSl63OpsXPYu0X+VlBsoKs2wJOL5koJNhcTYlImEwQ23p4ycAx 8Vhc7WxBW4Z1Zd0vJ4fLiVBBfu+C+7uAoXXxIGUFsNYcjKrJ47IDwCXuCeGU3oTDiowz 1sfA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=+00GV0ZFHOZDgJN8jFBjOilk/3K1Mik3DzvPDiYXgTg=; b=aKdtRt0kwAOUcqZhiwiyMUfm4RFQdgxNoFznhkG9qlcS8KPT3rQTEV/vaClhFPrLRq PmLTdqCzFbwQcpTPhPIwUtT8W2aabDqP+FQFqFucdtAHmvp31tFqsjW6AE0eLoBDQlbz OYqayJ5kwWgadGs8HOIxwkV2+9oW+3h8dI8QPDi1DFMSjDNYoQq/XDvkxES4BY9kaG2J Kmek2yJ+tdQpDbLyy8EPy7ZTX8lGyc8qotjRkvAzq5J9dMsNG2Jz0I/jaNhx7IWrhxnH 731rBg2H0PjEFN8UOk2h/NCauOOmLTLnqbC1+bhfDUA1EIHxMYYSpvcFht6ETx3YpcMR oJ8A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WwRccdPU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j72-20020a638b4b000000b0046ef006f51dsi12425821pge.425.2022.11.28.04.01.37; Mon, 28 Nov 2022 04:01:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=WwRccdPU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230438AbiK1LOr (ORCPT + 84 others); Mon, 28 Nov 2022 06:14:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34000 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229947AbiK1LOp (ORCPT ); Mon, 28 Nov 2022 06:14:45 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C7BAC380 for ; Mon, 28 Nov 2022 03:13:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1669634029; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=+00GV0ZFHOZDgJN8jFBjOilk/3K1Mik3DzvPDiYXgTg=; b=WwRccdPUv4GwxEB1GeC17K7NJMdnk2uJ/AY9TCv9HmBjzfqEf5CcrA86M00eSVCfU5Jx6h UF8/8AQzxTYCgUMX/C+ktc66CTW9inrEjYCVLyBGdhLgqpv3mpEJWIx2AHN/UUIYoeL7M2 raw7RWASly3DjrlTcbIQHib9HVgI8IA= Received: from mail-lf1-f72.google.com (mail-lf1-f72.google.com [209.85.167.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-627-IpPTuVMsNjGZa62Mjgx_1A-1; Mon, 28 Nov 2022 06:13:48 -0500 X-MC-Unique: IpPTuVMsNjGZa62Mjgx_1A-1 Received: by mail-lf1-f72.google.com with SMTP id z29-20020a0565120c1d00b004b5056044ccso2113400lfu.21 for ; Mon, 28 Nov 2022 03:13:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+00GV0ZFHOZDgJN8jFBjOilk/3K1Mik3DzvPDiYXgTg=; b=AWCMksTWf6A+Ebq5DrHDTTIgcuhejVgFr81TQFj6M7RiPDQ8MrIkNs2KzEtsix57Bp nWTYv/pUc8pSL/wG9S1ViDD5lnff5fWg/pzY+EPjzueTi/mUjxmMzfsMlcVpWyPLbIDH cfFgLUx/Yeu80WSx3i3z10ArhyEuiQ1VE4o5Y5UA7T/Q5MHZ7EXZhHi0P6a+EsTiQt6M kx9Wr1MEAhEHrgY5R7ZyqoZWDI0KcG0kR8TFhSqTaxau5iUX3utwuKIwAQ40vRtWUK7u OnilItdbUM+ShiW4yuPBCYLfYZPyPPfm14JvRx9RS6xONc8VBkDRJeAv58T6FzViLNls gWEw== X-Gm-Message-State: ANoB5pnwwwEnZVpa8eZUGImtvK+1LILswTHWx/YEAX+jn3ErXe5I16am 4KGRP0+0S1mfmQ7n24GbnrFjixnr6mi8NGe8N73J1MKUfli0v2QMh/q8fHddG/tjZjHxxwFrJ+w 1x6nXfx7L3cC5L+8Cr5w49nkD X-Received: by 2002:a2e:908f:0:b0:279:8717:54ca with SMTP id l15-20020a2e908f000000b00279871754camr5960881ljg.468.1669634026604; Mon, 28 Nov 2022 03:13:46 -0800 (PST) X-Received: by 2002:a2e:908f:0:b0:279:8717:54ca with SMTP id l15-20020a2e908f000000b00279871754camr5960870ljg.468.1669634026308; Mon, 28 Nov 2022 03:13:46 -0800 (PST) Received: from localhost.localdomain (c-e6a5e255.022-110-73746f36.bbcust.telenor.se. [85.226.165.230]) by smtp.googlemail.com with ESMTPSA id q22-20020a2e8756000000b0027703e09b71sm1141250ljj.64.2022.11.28.03.13.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Nov 2022 03:13:45 -0800 (PST) From: Alexander Larsson To: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, gscrivan@redhat.com, alexl@redhat.com Subject: [PATCH RFC 0/6] Composefs: an opportunistically sharing verified image filesystem Date: Mon, 28 Nov 2022 12:13:31 +0100 Message-Id: X-Mailer: git-send-email 2.38.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Giuseppe Scrivano and I have recently been working on a new project we call composefs. This is the first time we propose this publically and we would like some feedback on it. At its core, composefs is a way to construct and use read only images that are used similarly to how you would use e.g. loop-back mounted squashfs images. On top of this composefs has two new fundamental features. First it allows sharing of file data (both on disk and in page cache) between images, and secondly it has dm-verity like validation on read. Let me first start with a minimal example of how this can be used, before going into the details: Suppose we have this source for an image: rootfs/ ├── dir │ └── another_a ├── file_a └── file_b We can then use this to generate an image file and a set of content-addressed backing files: $ mkcomposefs --digest-store=objects rootfs/ rootfs.img $ ls -l rootfs.img objects/*/* -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img The rootfs.img file contains all information about directory and file metadata plus references to the backing files by name. We can now mount this and look at the result: $ mount -t composefs rootfs.img -o basedir=objects,verity_check=2 /mnt $ ls /mnt/ dir file_a file_b $ cat /mnt/file_a content_a When reading this file the kernel is actually reading the backing file, in a fashion similar to overlayfs. Since the backing file is content-addressed, the objects directory can be shared for multiple images, and any files that happen to have the same content are shared. I refer to this as opportunistic sharing, as it is different than the more coarse-grained explicit sharing used by e.g. container base images. The next step is the validation. Note how the object files have fs-verity enabled. In fact, they are named by their fs-verity digest: $ fsverity digest objects/*/* sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f The generated filesystem image can contain the expected digest for the backing files. If you mount the filesystem with the verity_check option, then open will fail when the backing file digest is incorrect. And if the open succeeds, any other on-disk file-changes will be detected by fs-verity: $ cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f content_a $ rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f $ echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f $ cat /mnt/file_a WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest cat: /mnt/file_a: Input/output error This re-uses the existing fs-verity functionality to protect against changes in file contents, while adding on top of it protection against changes in filesystem metadata and structure. In other words, it protects against replacing a fs-verity enabled file or modifying file permissions, xattrs or other metadata not verified by fs-verity. To be fully verified we need another step: we use fs-verity on the image itself. Then we pass the expected digest on the mount command line (which will be verified at mount time): $ fsverity enable rootfs.img $ fsverity digest rootfs.img sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img $ mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt So, given a trusted set of mount options (say unlocked from TPM), we have a fully verified filesystem tree mounted, with opportunistic fine-grained sharing of identical files. So, why do we want this? There are two initial user cases. First of all we want to use the opportunistic sharing for podman container layers. The idea is to use a composefs mount as the lower directory in an overlay mount, with the upper directory being the container work dir. This will allow automatic file-level disk and page-cache sharing between any two images, independent of details like the permissions and timestamps of the files and the origin of the images. Secondly we are interested in using the verification aspects of composefs in the ostree project. Ostree already uses a content-addressed object store, but it is currently referenced to by hardlink farms. The object store and the trees that reference it are signed and verified at download time, but there is no runtime verification. If we replace the hardlink farm with a composefs image that points into the existing object store we can use the verification to implement runtime verification. In fact, the tooling to create composefs images is fully reproducible, so all we need is to add the fs-verity digest of the composefs image into the ostree commit metadata. Then the image can be reconstructed from the ostree commit, generating a composefs image with the same fs-verity digest. These are the use cases we're currently interested in, but there seems to be a wealth of other possible uses. For example, many systems use loopback mounts for images (like lxc or snap), and these could take advantage of the opportunistic sharing. We've also talked about using fuse to implement a local cache for the backing files. I.e. you would have a second basedir be a fuse filesystem, and on lookup failure in the first basedir the fuse one triggers a download which is also saved in the first dir for later lookups. There are many interesting possibilities here. The patch series contains documentation on the file format and how to use the filesystem. The userspace tools (and a standalone kernel module) is available here: https://github.com/containers/composefs Initial work on ostree integration is here: https://github.com/ostreedev/ostree/pull/2640 This patchset in git is available here: https://github.com/alexlarsson/linux/tree/composefs-v1 Alexander Larsson (6): fsverity: Export fsverity_get_digest composefs: Add on-disk layout composefs: Add descriptor parsing code composefs: Add filesystem implementation composefs: Add documentation composefs: Add kconfig and build support Documentation/filesystems/composefs.rst | 162 ++++ fs/Kconfig | 1 + fs/Makefile | 1 + fs/composefs/Kconfig | 18 + fs/composefs/Makefile | 5 + fs/composefs/cfs-internals.h | 65 ++ fs/composefs/cfs-reader.c | 958 ++++++++++++++++++++++++ fs/composefs/cfs.c | 941 +++++++++++++++++++++++ fs/composefs/cfs.h | 242 ++++++ fs/verity/measure.c | 1 + 10 files changed, 2394 insertions(+) create mode 100644 Documentation/filesystems/composefs.rst create mode 100644 fs/composefs/Kconfig create mode 100644 fs/composefs/Makefile create mode 100644 fs/composefs/cfs-internals.h create mode 100644 fs/composefs/cfs-reader.c create mode 100644 fs/composefs/cfs.c create mode 100644 fs/composefs/cfs.h -- 2.38.1