Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp1363506ybb; Thu, 9 Apr 2020 00:04:28 -0700 (PDT) X-Google-Smtp-Source: APiQypITpnpKF97XF3Ayl0hQow6wqR3tkXV7e1TQLNPKRaljcKAp9wypP2j1Q8s/M5NT+phd0cNk X-Received: by 2002:a9d:2c01:: with SMTP id f1mr8828618otb.67.1586415868709; Thu, 09 Apr 2020 00:04:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586415868; cv=none; d=google.com; s=arc-20160816; b=UenQO0qE8v2pXLDIkTbvzE4hNHOji4zX1U3Q2Irfzj8WTdmxSZznUzElfSX/O90qHD DXxrJ7EYW8By6eYRr5XC2aHNYdXiPSflliMo4bDK6/kdhiPxdBCcnTIUlqu21Or1v1re U+th6xF/xdnQ2k1Bkdy4hJgY7BAa02PSXazL9A3jgrm/H/M4QQt0c2NcrLZrpP9AZ2x8 gqQjpqwE/o6XWPyQfogX2u9dgrHK0g6P0A+8weqU11Dv6BZqztwa1EFYSCGvX1hevjhJ f+DSkMkFxjQMflWPhBDacR2E+/PPf3ZCSfX1N9CObNFKpRxtn9WnxKVLIOIH5yuyT+zq DlJA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=4T5kGWYuiWsD8ODG0rIT5wEJy5Zk5WzpdZY70XPYCsk=; b=hiS7zQ/7iM4Zkms6sk+TK0Cxd3HXUeNlvJDDegxq5OrV3VqszrCEgfzhl+0fPuWoEt p3E8s/HftAywrWaG1BifdvreaRV8Xx/mRF8kKuor3/DmLnptbk4e5yI/nwqpPwTFTyXf WtM83/AwEGdV0gtzO8jWvvFxu0Xx5CbwNVcJxRz8eHahExaAKg2S7O27SFZXkFJAKQMY mfG/h0p42Xc+9VqndLiwVp/rLafJ5z31u3MG8hg9X48JriqPdYK658quenSfUWjFtNHd nwPjsnJqoTDISjbE9Ef1olRoZqlIigxIcCIQno+Zlvt8GBNA+5whYaMo4YcfD/BF9PcH vkcA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nh7PHjWt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c13si2957483oiy.199.2020.04.09.00.04.12; Thu, 09 Apr 2020 00:04:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=nh7PHjWt; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726622AbgDIHDI (ORCPT + 99 others); Thu, 9 Apr 2020 03:03:08 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:42044 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726579AbgDIHDH (ORCPT ); Thu, 9 Apr 2020 03:03:07 -0400 Received: by mail-qt1-f196.google.com with SMTP id b10so1997134qtt.9 for ; Thu, 09 Apr 2020 00:03:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=4T5kGWYuiWsD8ODG0rIT5wEJy5Zk5WzpdZY70XPYCsk=; b=nh7PHjWteqz8nFiHSjdWnVP7s/XkAfF1ymz+SC5KOF7ff+qFavN9xWvQtCveEZumqP Js0U5iOyiNiX0f5FDlsNU/xj0ioJRTCvg6fcTokSxpk7DAjMIOKJeDKLXT+qT5HauLa2 TxJ4oI7IdnA8l0vUXq6P86eg+pENHwdeX8GPbsjiYVxhUHFrZE2NH87nmZsThIsfDySq 6Ueeav8Fa2S+yIy+d31YRfW/Okz3ce8NTe6OSjELck4bUW2cYOTG3fiQl9ALKawhyAE/ xV36Y62LGGOK8uHb6JB+w+LBDuZKdqq6PsZG6sS1vtfczJ+vl5DyjI1bIY5y0SNmNFay Uplg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=4T5kGWYuiWsD8ODG0rIT5wEJy5Zk5WzpdZY70XPYCsk=; b=XdyRpPCnxmPaZIGXN0XAL+3F8GjZYec1JacoM+Akp1RvgcaanndztRGPqfshdGOLUF OGwOF2AOxkNJXaPuJFrIxgokyRN+DW5ZO8VG2SkeEWthN88g8Zb34LnJPhEDdPM7kRf/ dQutadqU+NOEGaOe4qFypHOPE3yGq8yD9CEMawHoNUEU4Tl4lPCKKGA5nyqSVrV7ik1T yGjqX69R1kXitMh79mu5wAsYZUDSZsR7UqHRQQ4wTbq0rZ1zqqSXTnDhQS4XMj8/hjc0 YFuB+Q8SZ/IXJQ48rEgyqcPNKbw9wquhY8VXHBsaPRclTS+RxPSrxMGkG8X5bDkvtPQ0 VgfA== X-Gm-Message-State: AGi0PuZYKPRjeUQw4YhYrhHds4ubZZ86LpyZ+tawiL9KGeeYDGPcIoX0 YtwS3WFfCOtmt6mMMqZeQjHP1uLor3kmOmwzHaE4Cw== X-Received: by 2002:ac8:6c24:: with SMTP id k4mr3078643qtu.257.1586415786173; Thu, 09 Apr 2020 00:03:06 -0700 (PDT) MIME-Version: 1.0 References: <20200408152151.5780-1-christian.brauner@ubuntu.com> In-Reply-To: From: Dmitry Vyukov Date: Thu, 9 Apr 2020 09:02:54 +0200 Message-ID: Subject: Re: [PATCH 0/8] loopfs To: =?UTF-8?Q?St=C3=A9phane_Graber?= Cc: Jann Horn , Christian Brauner , Jens Axboe , Greg Kroah-Hartman , kernel list , linux-block , Linux API , Jonathan Corbet , Serge Hallyn , "Rafael J. Wysocki" , Tejun Heo , "David S. Miller" , Saravana Kannan , Jan Kara , David Howells , Seth Forshee , David Rheinsberg , Tom Gundersen , Christian Kellner , "open list:DOCUMENTATION" , Network Development , Matthew Garrett , linux-fsdevel , syzkaller Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 8, 2020 at 6:41 PM St=C3=A9phane Graber w= rote: > > On Wed, Apr 8, 2020 at 12:24 PM Jann Horn wrote: > > > > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner > > wrote: > > > One of the use-cases for loopfs is to allow to dynamically allocate l= oop > > > devices in sandboxed workloads without exposing /dev or > > > /dev/loop-control to the workload in question and without having to > > > implement a complex and also racy protocol to send around file > > > descriptors for loop devices. With loopfs each mount is a new instanc= e, > > > i.e. loop devices created in one loopfs instance are independent of a= ny > > > loop devices created in another loopfs instance. This allows > > > sufficiently privileged tools to have their own private stash of loop > > > device instances. Dmitry has expressed his desire to use this for > > > syzkaller in a private discussion. And various parties that want to u= se > > > it are Cced here too. > > > > > > In addition, the loopfs filesystem can be mounted by user namespace r= oot > > > and is thus suitable for use in containers. Combined with syscall > > > interception this makes it possible to securely delegate mounting of > > > images on loop devices, i.e. when a user calls mount -o loop > > > it will be possible to completely setup the loop device. > > > The final mount syscall to actually perform the mount will be handled > > > through syscall interception and be performed by a sufficiently > > > privileged process. Syscall interception is already supported through= a > > > new seccomp feature we implemented in [1] and extended in [2] and is > > > actively used in production workloads. The additional loopfs work wil= l > > > be used there and in various other workloads too. You'll find a short > > > illustration how this works with syscall interception below in [4]. > > > > Would that privileged process then allow you to mount your filesystem > > images with things like ext4? As far as I know, the filesystem > > maintainers don't generally consider "untrusted filesystem image" to > > be a strongly enforced security boundary; and worse, if an attacker > > has access to a loop device from which something like ext4 is mounted, > > things like "struct ext4_dir_entry_2" will effectively be in shared > > memory, and an attacker can trivially bypass e.g. > > ext4_check_dir_entry(). At the moment, that's not a huge problem (for > > anything other than kernel lockdown) because only root normally has > > access to loop devices. > > > > Ubuntu carries an out-of-tree patch that afaik blocks the shared > > memory thing: > > > > But even with that patch, I'm not super excited about exposing > > filesystem image parsing attack surface to containers unless you run > > the filesystem in a sandboxed environment (at which point you don't > > need a loop device anymore either). > > So in general we certainly agree that you should never expose someone > that you wouldn't trust with root on the host to syscall interception > mounting of real kernel filesystems. > > But that's not all that our syscall interception logic can do. We have > support for rewriting a normal filesystem mount attempt to instead use > an available FUSE implementation. As far as the user is concerned, > they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted > on /mnt as requested, except that the container manager intercepted > the mount attempt and instead spawned fuse2fs for that mount. This > requires absolutely no change to the software the user is running. > > loopfs, with that interception mode, will let us also handle all cases > where a loop would be used, similarly without needing any change to > the software being run. If a piece of software calls the command > "mount -o loop blah.img /mnt", the "mount" command will setup a loop > device as it normally would (doing so through loopfs) and then will > call the "mount" syscall, which will get intercepted and redirected to > a FUSE implementation if so configured, resulting in the expected > filesystem being mounted for the user. > > LXD with syscall interception offers both straight up privileged > mounting using the kernel fs or using a FUSE based implementation. > This is configurable on a per-filesystem and per-container basis. > > I hope that clarifies what we're doing here :) > > St=C3=A9phane Hi Christian, Our use case for loopfs in syzkaller would be isolation of several test processes from each other. Currently all loop devices and loop-control are global and cause test processes to collide, which in turn causes non-reproducible coverage and non-reproducible crashes. Ideally we give each test process its own loopfs instance.