Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp871760ybb; Wed, 8 Apr 2020 11:31:21 -0700 (PDT) X-Google-Smtp-Source: APiQypKEvFBCwq/Wo46VJt8rTBo8JEpvDcWlMCdqfrSKNcQuow7dgRzZjPhpm7fKbSp30SNip/Rm X-Received: by 2002:a05:6830:19ec:: with SMTP id t12mr6481342ott.24.1586370680851; Wed, 08 Apr 2020 11:31:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586370680; cv=none; d=google.com; s=arc-20160816; b=pZJHfbad6AC96STG5k7DDq5DKeocCtu78rhIOw0Cwcap6UtU3EOrdj+8Y/BEamB5Yg nX+AGuma/27I+HhM9qhQaic8gNuuy5pJ240gdJkZZs5XGOGCobcAwklOXQWcu7JBRh3S PHwqeFdDPGZ+XEHNhsGtoqwsPrADC1wril/1LfmDXiZcgrEGU+QULdws8PBvEkKQZBy2 6hxhTAxLzZ4LcZepRB29Zjslus9GTaqc+0sPVn3Mb/kASsYYtplOin8QjGRZI7ZzMClE lDyyjZvaQ2iTFs4kg2TQKhvR84X6sQefYMVWFkCGnk3G12t+S5WC6ulST7oBXw6xOhlo bBCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version; bh=/J3bWgqQoIra/uupR1tRIOJhU5pSOcTLLhH4FPybwOc=; b=aTLPxXm+Bg88M5OZcMenhXfBoB607YrKcx++VdELTN3Ing/ln6YYdHRMFW2e4u43lq XwIPBrESVSOO8ixYv+NOLkofKbM04esiBYUXqE0le6PKRFuvH3MlEUpv+dmBX7HtGLzl ukyE/guFtCQ7QVsEZrfGSIxRUJb/DHhIvFb71/Ap1QS9YiGWaRYFvoIdQaX44BHMzrJC UFyJUehwNv9y2eqs25QkV9lXZB/cnG3vc/4tBJDZzYI1YMTexGH/cr21mReef/yl+rCw SWu94cuH5U98hocSj2DQc6E7L3rljb+YnKab6FK+IACel7JZj1jhP3TS1uvoH6ZN5w/R /UbA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g8si3235285otn.56.2020.04.08.11.31.05; Wed, 08 Apr 2020 11:31:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730361AbgDHQl5 convert rfc822-to-8bit (ORCPT + 99 others); Wed, 8 Apr 2020 12:41:57 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:43191 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730356AbgDHQl5 (ORCPT ); Wed, 8 Apr 2020 12:41:57 -0400 Received: from mail-lj1-f175.google.com ([209.85.208.175]) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jMDlx-0001To-LD for linux-kernel@vger.kernel.org; Wed, 08 Apr 2020 16:41:53 +0000 Received: by mail-lj1-f175.google.com with SMTP id g27so8301847ljn.10 for ; Wed, 08 Apr 2020 09:41:53 -0700 (PDT) X-Gm-Message-State: AGi0PuYN2quChaXSWaJiC+5P6iYrqpRBVuVbWV9OLwkXSpYUkEp38kdz oegBVnBJNTHjPkdr9lAMKlv4gbRhBElqS3jOlqacXQ== X-Received: by 2002:a2e:97c2:: with SMTP id m2mr5450395ljj.228.1586364113069; Wed, 08 Apr 2020 09:41:53 -0700 (PDT) MIME-Version: 1.0 References: <20200408152151.5780-1-christian.brauner@ubuntu.com> In-Reply-To: From: =?UTF-8?Q?St=C3=A9phane_Graber?= Date: Wed, 8 Apr 2020 12:41:41 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/8] loopfs To: Jann Horn Cc: Christian Brauner , Jens Axboe , Greg Kroah-Hartman , kernel list , linux-block@vger.kernel.org, Linux API , Jonathan Corbet , Serge Hallyn , "Rafael J. Wysocki" , Tejun Heo , "David S. Miller" , Saravana Kannan , Jan Kara , David Howells , Seth Forshee , David Rheinsberg , Tom Gundersen , Christian Kellner , Dmitry Vyukov , linux-doc@vger.kernel.org, Network Development , Matthew Garrett , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 8, 2020 at 12:24 PM Jann Horn wrote: > > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner > wrote: > > One of the use-cases for loopfs is to allow to dynamically allocate loop > > devices in sandboxed workloads without exposing /dev or > > /dev/loop-control to the workload in question and without having to > > implement a complex and also racy protocol to send around file > > descriptors for loop devices. With loopfs each mount is a new instance, > > i.e. loop devices created in one loopfs instance are independent of any > > loop devices created in another loopfs instance. This allows > > sufficiently privileged tools to have their own private stash of loop > > device instances. Dmitry has expressed his desire to use this for > > syzkaller in a private discussion. And various parties that want to use > > it are Cced here too. > > > > In addition, the loopfs filesystem can be mounted by user namespace root > > and is thus suitable for use in containers. Combined with syscall > > interception this makes it possible to securely delegate mounting of > > images on loop devices, i.e. when a user calls mount -o loop > > it will be possible to completely setup the loop device. > > The final mount syscall to actually perform the mount will be handled > > through syscall interception and be performed by a sufficiently > > privileged process. Syscall interception is already supported through a > > new seccomp feature we implemented in [1] and extended in [2] and is > > actively used in production workloads. The additional loopfs work will > > be used there and in various other workloads too. You'll find a short > > illustration how this works with syscall interception below in [4]. > > Would that privileged process then allow you to mount your filesystem > images with things like ext4? As far as I know, the filesystem > maintainers don't generally consider "untrusted filesystem image" to > be a strongly enforced security boundary; and worse, if an attacker > has access to a loop device from which something like ext4 is mounted, > things like "struct ext4_dir_entry_2" will effectively be in shared > memory, and an attacker can trivially bypass e.g. > ext4_check_dir_entry(). At the moment, that's not a huge problem (for > anything other than kernel lockdown) because only root normally has > access to loop devices. > > Ubuntu carries an out-of-tree patch that afaik blocks the shared > memory thing: > > But even with that patch, I'm not super excited about exposing > filesystem image parsing attack surface to containers unless you run > the filesystem in a sandboxed environment (at which point you don't > need a loop device anymore either). So in general we certainly agree that you should never expose someone that you wouldn't trust with root on the host to syscall interception mounting of real kernel filesystems. But that's not all that our syscall interception logic can do. We have support for rewriting a normal filesystem mount attempt to instead use an available FUSE implementation. As far as the user is concerned, they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted on /mnt as requested, except that the container manager intercepted the mount attempt and instead spawned fuse2fs for that mount. This requires absolutely no change to the software the user is running. loopfs, with that interception mode, will let us also handle all cases where a loop would be used, similarly without needing any change to the software being run. If a piece of software calls the command "mount -o loop blah.img /mnt", the "mount" command will setup a loop device as it normally would (doing so through loopfs) and then will call the "mount" syscall, which will get intercepted and redirected to a FUSE implementation if so configured, resulting in the expected filesystem being mounted for the user. LXD with syscall interception offers both straight up privileged mounting using the kernel fs or using a FUSE based implementation. This is configurable on a per-filesystem and per-container basis. I hope that clarifies what we're doing here :) Stéphane