Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp814736ybb; Wed, 8 Apr 2020 10:23:42 -0700 (PDT) X-Google-Smtp-Source: APiQypIuHpSgG/Ym8caAYYMwAH23nKtlbKmE4gjJke3MQOwyqGgTNg5gvE+lCcj2fVYRLTKaoa2x X-Received: by 2002:a05:6830:1541:: with SMTP id l1mr6579181otp.297.1586366621950; Wed, 08 Apr 2020 10:23:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586366621; cv=none; d=google.com; s=arc-20160816; b=xdQduFn4EGZhOGJNrR6i412p2qK5dFln+f4JbB4UbBDVAYjiXB6eTW0W31M2ftN/PE 5WwCPQCbBJ1ogVazBo7hrwfZZtNuGGL7Zt+iEh1Pyp4vQozJkvM290FBUWXq7tbHsfTj VEnVuDxU9i9BOGP7u8UkfbB4UDEQ6oB++iyshLCWLaoITr45y3WYM4GogKz2U3RXit+G hvsJ+Kl8mhIEtSB5dZSLpazxuc9fnFnKMLxO5q89d2WcgzfiQHNpAyKUcxj49LHhJsQe HBhc9JvEAcjaRH4FH5Ja2bUSqoUN9NmXUFEL4TUj69deCgdGGm37r3/Ch0uLEu5RaXZr EBAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=Rfg5uDE4FTdrSVKCxsLhOjtG3hN+092XjuCCw89FcuI=; b=rjchaUusDtiUHvvczmIob9cIe3rpRfQlhzwxwp9T+SYoFuAf7olWkJgYGuYLJSl/h3 PKw+op7Fh6e4MIOmVvIu1U9DSMydcsHwtQ77xzteXjhDUIr7E+cmCXMWAUdz0Hu+/NWu jDh3ONd5pgILcbmJE6j82fYyo5Ok5U/K6JcakaNX/8vsILhQ/r04sN9nQb5Glo90qHSf 3Dy2aF1SgCzTFWbNxPo/CfJJ5ZUFMZ5tC5CCIzLTYmQR98z8lHTrQSwareG6vNhDpzWx 81DZqmi69v89DRZT5eXV1gQ66rH4kexXSOaPdB4NK69SHzqHqk5DGlKYMJYG12HxioVB vNAg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n5si2878832oof.46.2020.04.08.10.23.28; Wed, 08 Apr 2020 10:23:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729500AbgDHPWm (ORCPT + 99 others); Wed, 8 Apr 2020 11:22:42 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:39080 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728733AbgDHPWk (ORCPT ); Wed, 8 Apr 2020 11:22:40 -0400 Received: from ip5f5bd698.dynamic.kabel-deutschland.de ([95.91.214.152] helo=wittgenstein.fritz.box) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jMCXA-0001BO-NU; Wed, 08 Apr 2020 15:22:32 +0000 From: Christian Brauner To: Jens Axboe , Greg Kroah-Hartman , linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: Jonathan Corbet , Serge Hallyn , "Rafael J. Wysocki" , Tejun Heo , "David S. Miller" , Christian Brauner , Saravana Kannan , Jan Kara , David Howells , Seth Forshee , David Rheinsberg , Tom Gundersen , Christian Kellner , Dmitry Vyukov , =?UTF-8?q?St=C3=A9phane=20Graber?= , linux-doc@vger.kernel.org, netdev@vger.kernel.org Subject: [PATCH 0/8] loopfs Date: Wed, 8 Apr 2020 17:21:43 +0200 Message-Id: <20200408152151.5780-1-christian.brauner@ubuntu.com> X-Mailer: git-send-email 2.26.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hey everyone, After having been pinged about this by various people recently here's loopfs. This implements loopfs, a loop device filesystem. It takes inspiration from the binderfs filesystem I implemented about two years ago and with which we had overall good experiences so far. Parts of it are also based on [3] but it's mostly a new, imho cleaner and more complete approach. To experiment, the patchset can be found in the following locations: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=loopfs https://gitlab.com/brauner/linux/-/commits/loopfs https://github.com/brauner/linux/tree/loopfs One of the use-cases for loopfs is to allow to dynamically allocate loop devices in sandboxed workloads without exposing /dev or /dev/loop-control to the workload in question and without having to implement a complex and also racy protocol to send around file descriptors for loop devices. With loopfs each mount is a new instance, i.e. loop devices created in one loopfs instance are independent of any loop devices created in another loopfs instance. This allows sufficiently privileged tools to have their own private stash of loop device instances. Dmitry has expressed his desire to use this for syzkaller in a private discussion. And various parties that want to use it are Cced here too. In addition, the loopfs filesystem can be mounted by user namespace root and is thus suitable for use in containers. Combined with syscall interception this makes it possible to securely delegate mounting of images on loop devices, i.e. when a user calls mount -o loop it will be possible to completely setup the loop device. The final mount syscall to actually perform the mount will be handled through syscall interception and be performed by a sufficiently privileged process. Syscall interception is already supported through a new seccomp feature we implemented in [1] and extended in [2] and is actively used in production workloads. The additional loopfs work will be used there and in various other workloads too. You'll find a short illustration how this works with syscall interception below in [4]. The number of loop devices available to a loopfs instance can be limited by setting the "max" mount option to a positive integer. This e.g. allows sufficiently privileged processes to dynamically enforce a limit on the number of devices. This limit is dynamic in contrast to the max_loop module option in that a sufficiently privileged process can update it with a simple remount operation. The loopfs filesystem is placed under a new config option and special care has been taken to not introduce any new code when users do not select this config option. Thanks! Christian [1]: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") [2]: fb3c5386b382 ("seccomp: add SECCOMP_USER_NOTIF_FLAG_CONTINUE") [3]: https://lore.kernel.org/lkml/1401227936-15698-1-git-send-email-seth.forshee@canonical.com [4]: root@f1:~# cat /proc/self/uid_map 0 100000 1000000000 root@f1:~# cat /proc/self/gid_map 0 100000 1000000000 root@f1:~# mkdir /dev/loopfs root@f1:~# mount -t loop loop /dev/loopfs/ root@f1:~# ln -sf /dev/loopfs/loop-control /dev/loop-control root@f1:~# losetup -f /dev/loop9 root@f1:~# ln -sf /dev/loopfs/loop9 /dev/loop9 root@f1:~# ls -al /sys/class/block/loop9 lrwxrwxrwx 1 root root 0 Apr 8 14:53 /sys/class/block/loop9 -> ../../devices/virtual/block/loop9 root@f1:~# ls -al /sys/class/block/loop9/ total 0 drwxr-xr-x 9 root root 0 Apr 8 14:53 . drwxr-xr-x 13 nobody nogroup 0 Apr 8 14:53 .. -r--r--r-- 1 root root 4096 Apr 8 14:53 alignment_offset lrwxrwxrwx 1 nobody nogroup 0 Apr 8 14:53 bdi -> ../../bdi/7:9 -r--r--r-- 1 root root 4096 Apr 8 14:53 capability -r--r--r-- 1 root root 4096 Apr 8 14:53 dev -r--r--r-- 1 root root 4096 Apr 8 14:53 discard_alignment -r--r--r-- 1 root root 4096 Apr 8 14:53 events -r--r--r-- 1 root root 4096 Apr 8 14:53 events_async -rw-r--r-- 1 root root 4096 Apr 8 14:53 events_poll_msecs -r--r--r-- 1 root root 4096 Apr 8 14:53 ext_range -r--r--r-- 1 root root 4096 Apr 8 14:53 hidden drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 holders -r--r--r-- 1 root root 4096 Apr 8 14:53 inflight drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 integrity drwxr-xr-x 3 nobody nogroup 0 Apr 8 14:53 mq drwxr-xr-x 2 root root 0 Apr 8 14:53 power drwxr-xr-x 3 nobody nogroup 0 Apr 8 14:53 queue -r--r--r-- 1 root root 4096 Apr 8 14:53 range -r--r--r-- 1 root root 4096 Apr 8 14:53 removable -r--r--r-- 1 root root 4096 Apr 8 14:53 ro -r--r--r-- 1 root root 4096 Apr 8 14:53 size drwxr-xr-x 2 nobody nogroup 0 Apr 8 14:53 slaves -r--r--r-- 1 root root 4096 Apr 8 14:53 stat lrwxrwxrwx 1 nobody nogroup 0 Apr 8 14:53 subsystem -> ../../../../class/block drwxr-xr-x 2 root root 0 Apr 8 14:53 trace -rw-r--r-- 1 root root 4096 Apr 8 14:53 uevent root@f1:~# root@f1:~# stat --file-system /bla.img File: "/bla.img" ID: 4396dc4f5f3ffe1b Namelen: 255 Type: btrfs Block size: 4096 Fundamental block size: 4096 Blocks: Total: 11230468 Free: 10851929 Available: 10738585 Inodes: Total: 0 Free: 0 root@f1:~# mount -o loop /bla.img /opt root@f1:~# findmnt | grep opt └─/opt /dev/loop9 btrfs rw,relatime,ssd,space_cache,subvolid=5,subvol=/ Christian Brauner (8): kobject_uevent: remove unneeded netlink_ns check loopfs: implement loopfs loop: use ns_capable for some loop operations kernfs: handle multiple namespace tags kernfs: let objects opt-in to propagating from the initial namespace genhd: add minimal namespace infrastructure loopfs: start attaching correct namespace during loop_add() loopfs: only show devices in their correct instance Documentation/filesystems/sysfs-tagging.txt | 1 - MAINTAINERS | 5 + block/genhd.c | 79 ++++ drivers/base/devtmpfs.c | 4 +- drivers/block/Kconfig | 4 + drivers/block/Makefile | 1 + drivers/block/loop.c | 186 +++++++-- drivers/block/loop.h | 8 +- drivers/block/loopfs/Makefile | 3 + drivers/block/loopfs/loopfs.c | 429 ++++++++++++++++++++ drivers/block/loopfs/loopfs.h | 35 ++ fs/kernfs/dir.c | 38 +- fs/kernfs/kernfs-internal.h | 26 +- fs/kernfs/mount.c | 11 +- fs/sysfs/mount.c | 14 +- include/linux/device.h | 3 + include/linux/genhd.h | 3 + include/linux/kernfs.h | 44 +- include/linux/kobject_ns.h | 7 +- include/linux/sysfs.h | 8 +- include/uapi/linux/magic.h | 1 + lib/kobject.c | 17 +- lib/kobject_uevent.c | 2 +- net/core/net-sysfs.c | 6 - 24 files changed, 834 insertions(+), 101 deletions(-) create mode 100644 drivers/block/loopfs/Makefile create mode 100644 drivers/block/loopfs/loopfs.c create mode 100644 drivers/block/loopfs/loopfs.h base-commit: 7111951b8d4973bda27ff663f2cf18b663d15b48 -- 2.26.0