Received: by 2002:a05:6a10:9e8c:0:0:0:0 with SMTP id y12csp2183220pxx; Sat, 31 Oct 2020 10:44:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxKtnqz34JDQFMO8tg9YcdYa0pdnaITszXJU5E1CVGVOWqZlQoydeikttbTUcUt2PLgcswg X-Received: by 2002:a50:bb06:: with SMTP id y6mr8704576ede.278.1604166272386; Sat, 31 Oct 2020 10:44:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1604166272; cv=none; d=google.com; s=arc-20160816; b=nJWoabHsTupr3UnPWwAycCEZ4+n/34yRLMRbxYaCYJ+uB0aXzH5aZIhAVjTjRWDQnv MJ+E0RwgZ8yRr9cUdINp0G33+oAh5c2/4UMU5uRou1rhz0jy0yUTV1kyNKAT3RssXU+2 YJ2UE3GF0/q+tuHDByLfokaP6mXhSUgnw4oEhmWCcWOyaQZ8/bcTIMVtOv6h1xF/OLbS kzfswfSpcBOqEGoLfNG1sYvggcjl+EL39KSv6H07hUwvFmAAwwte4YNpSG9g6ujnbXbS se/TPfR1VYdDkB1+o7bh3cux8JOHIACUPfl5I+lEeBumiTqQmKIWpfdgWX+1TDoeUq20 +tDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=U/zhwVl9eQtC3m4j578cy18qBae3Vr+LjW07Gr21ouw=; b=A4vBGL2xbm1oA91Yb1QvPUMUA2QZWsupXKmhAgyYEfzuatfFchzIdQC8+UAmR6UYlo TfOku8MkVug0NFxBahkQhTtz4RR0KU51NqEDrtUdFmv4OEheTGfd6OV+HscHfCGHjvTm C7pOWIKiQHdHfVI4KhGMBiZDahuk9hVB/nzlPHobOLc+HH2Me4K3+fO+QPXfvK2DxgMN 3SmWKYRibsOhS9L1ltBs/VaMYnBM6B2oXSIR5hWvofSMkA13z3IVQ6lOHh0pXks/qDNz 9OixZXi2PfGI1hauzcWScZ3D6HAORozSz+zMU6xL7nUA3ggNF9MM+PvLbUY54xWBW8tq +yIQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ChPzBIwW; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w20si6924753eji.110.2020.10.31.10.43.58; Sat, 31 Oct 2020 10:44:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=ChPzBIwW; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728140AbgJaRnr (ORCPT + 99 others); Sat, 31 Oct 2020 13:43:47 -0400 Received: from mail.kernel.org ([198.145.29.99]:33400 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726736AbgJaRnq (ORCPT ); Sat, 31 Oct 2020 13:43:46 -0400 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 85CE822249 for ; Sat, 31 Oct 2020 17:43:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1604166224; bh=VwVQKGVZ4cnLMs/o97T/OeE50opoRiznL7EK0zhPGx0=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=ChPzBIwWOz0Tuix+oDBFnzAGulCWH0skah7i6O45rr18eqOspny15mp+yFrxpKekM CyE86AFwNgsNHd4vROfvjUjo8uUetpFVHUfONSXmf/y7O1DWd5ofAxkiq3c6tQnd5/ 9Y8dGX2nQd8V9exDRWSybyR6FpppBr5BDfaCr+4k= Received: by mail-ed1-f43.google.com with SMTP id dn5so9938703edb.10 for ; Sat, 31 Oct 2020 10:43:44 -0700 (PDT) X-Gm-Message-State: AOAM532Q/0IcrKfXc0dUPAxYIASf6KooXLDwudkNCmfYAnrwBYIo8Tt1 ErkqiFTM+6njt85i5Y+wfETbv6n6W1CjMnLioW4o/g== X-Received: by 2002:a05:6000:1252:: with SMTP id j18mr8926686wrx.18.1604166221960; Sat, 31 Oct 2020 10:43:41 -0700 (PDT) MIME-Version: 1.0 References: <20201029003252.2128653-1-christian.brauner@ubuntu.com> <8E455D54-FED4-4D06-8CB7-FC6291C64259@amacapital.net> <20201030120157.exz4rxmebruh7bgp@wittgenstein> In-Reply-To: <20201030120157.exz4rxmebruh7bgp@wittgenstein> From: Andy Lutomirski Date: Sat, 31 Oct 2020 10:43:29 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 00/34] fs: idmapped mounts To: Christian Brauner Cc: Alexander Viro , Christoph Hellwig , Linux FS Devel , John Johansen , James Morris , Mimi Zohar , Dmitry Kasatkin , Stephen Smalley , Casey Schaufler , Arnd Bergmann , Andreas Dilger , OGAWA Hirofumi , Geoffrey Thomas , Mrunal Patel , Josh Triplett , Andy Lutomirski , Amir Goldstein , Miklos Szeredi , Theodore Tso , Alban Crequy , Tycho Andersen , David Howells , James Bottomley , Jann Horn , Seth Forshee , =?UTF-8?Q?St=C3=A9phane_Graber?= , Aleksa Sarai , Lennart Poettering , "Eric W. Biederman" , Stephen Barber , Phil Estes , Serge Hallyn , Kees Cook , Todd Kjos , Jonathan Corbet , Linux Containers , LSM List , Linux API , Ext4 Developers List , linux-unionfs@vger.kernel.org, linux-audit@redhat.com, linux-integrity , selinux@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner wrote: > > On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote: > > > > > > > On Oct 28, 2020, at 5:35 PM, Christian Brauner wrote: > > > > > > =EF=BB=BFHey everyone, > > > > > > I vanished for a little while to focus on this work here so sorry for > > > not being available by mail for a while. > > > > > > Since quite a long time we have issues with sharing mounts between > > > multiple unprivileged containers with different id mappings, sharing = a > > > rootfs between multiple containers with different id mappings, and al= so > > > sharing regular directories and filesystems between users with differ= ent > > > uids and gids. The latter use-cases have become even more important w= ith > > > the availability and adoption of systemd-homed (cf. [1]) to implement > > > portable home directories. > > > > > > The solutions we have tried and proposed so far include the introduct= ion > > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > > call override creds in the vfs. None of these solutions have covered = all > > > of the above use-cases. > > > > > > The solution proposed here has it's origins in multiple discussions > > > during Linux Plumbers 2017 during and after the end of the containers > > > microconference. > > > To the best of my knowledge this involved Aleksa, St=C3=A9phane, Eric= , David, > > > James, and myself. A variant of the solution proposed here has also b= een > > > discussed, again to the best of my knowledge, after a Linux conferenc= e > > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2= 017 > > > after Linux Plumbers. > > > I've taken the time to finally implement a working version of this > > > solution over the last weeks to the best of my abilities. Tycho has > > > signed up for this sligthly crazy endeavour as well and he has helped > > > with the conversion of the xattr codepaths. > > > > > > The core idea is to make idmappings a property of struct vfsmount > > > instead of tying it to a process being inside of a user namespace whi= ch > > > has been the case for all other proposed approaches. > > > It means that idmappings become a property of bind-mounts, i.e. each > > > bind-mount can have a separate idmapping. This has the obvious advant= age > > > that idmapped mounts can be created inside of the initial user > > > namespace, i.e. on the host itself instead of requiring the caller to= be > > > located inside of a user namespace. This enables such use-cases as e.= g. > > > making a usb stick available in multiple locations with different > > > idmappings (see the vfat port that is part of this patch series). > > > > > > The vfsmount struct gains a new struct user_namespace member. The > > > idmapping of the user namespace becomes the idmapping of the mount. A > > > caller that is either privileged with respect to the user namespace o= f > > > the superblock of the underlying filesystem or a caller that is > > > privileged with respect to the user namespace a mount has been idmapp= ed > > > with can create a new bind-mount and mark it with a user namespace. > > > > So one way of thinking about this is that a user namespace that has an = idmapped mount can, effectively, create or chown files with *any* on-disk u= id or gid by doing it directly (if that uid exists in-namespace, which is l= ikely for interesting ids like 0) or by creating a new userns with that id = inside. > > > > For a file system that is private to a container, this seems moderately= safe, although this may depend on what exactly =E2=80=9Cprivate=E2=80=9D m= eans. We probably want a mechanism such that, if you are outside the namesp= ace, a reference to a file with the namespace=E2=80=99s vfsmnt does not con= fer suid privilege. > > > > Imagine the following attack: user creates a namespace with a root user= and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or= using whatever container management tool does this. Inside the namespace,= the user creates a suid-root file. > > > > Now, outside the namespace, the user has privilege over the namespace. = (I=E2=80=99m assuming there is some tool that will idmap things in a names= pace owned by an unprivileged user, which seems likely.). So the user makes= a new bind mount and if maps it to the init namespace. Game over. > > > > So I think we need to have some control to mitigate this in a comprehen= sible way. A big hammer would be to require nosuid. A smaller hammer might = be to say that you can=E2=80=99t create a new idmapped mount unless you hav= e privilege over the userns that you want to use for the idmap and to say t= hat a vfsmnt=E2=80=99s paths don=E2=80=99t do suid outside the idmap namesp= ace. We already do the latter for the vfsmnt=E2=80=99s mntns=E2=80=99s use= rns. > > With this series, in order to create an idmapped mount the user must > either be cap_sys_admin in the superblock of the underlying filesystem > or if the mount is already idmapped and they want to create another > idmapped mount from it they must have cap_sys_admin in the userns that > the mount is currrently marked with. It is also not possible to change > an idmapped mount once it has been idmapped, i.e. the user must create a > new detached bind-mount first. I think my attack might not work, but I also think I didn't explain it very well. Let me try again. I'll also try to lay out what I understand the rules of idmaps to be so that you can correct me when I'm inevitable wrong :) First, background: there are a bunch of user namespaces around. Every superblock has one, every idmapped mount has one, and every vfsmnt also (indirectly) has one: mnt->mnt_ns->user_ns. So, if you're looking at a given vfsmnt, you have three user namespaces that are relevant, in addition to whatever namespaces are active for the task (or kernel thread) accessing that mount. I'm wondering whether mnt_user_ns() should possibly have a name that makes it clear that it refers to the idmap namespace and not mnt->mnt_ns->user_ns. So here's the attack. An attacker with uid=3D1000 creates a userns N (so the attacker owns the ns and 1000 outside maps to 0 inside). N is a child of init_user_ns. Now the attacker creates a mount namespace M inside the userns and, potentially with the help of a container management tool, creates an idmapped filesystem mount F inside M. So, playing fast and loose with my ampersands: F->mnt_ns =3D=3D M F->mnt_ns->user_ns =3D=3D N mnt_user_ns(F) =3D=3D N I expect that this wouldn't be a particularly uncommon setup. Now the user has the ability to create files with inode->uid =3D=3D 0 and the SUID bit set on their filesystem. This isn't terribly different from FUSE, except that the mount won't have nosuid set, whereas at least many uses of unprivileged FUSE would have nosuid set. So the thing that makes me a little bit nervous. But it actually seems likely that I was wrong and this is okay. Specifically, to exploit this using kernel mechanisms, one would need to pass a mnt_may_suid() check, which means that one would need to acquire a mount of F in one's current mount namespace, and one would need one's current user namespace to be init_ns (or something else sensitive). But you already need to own the namespace to create mounts, unless you have a way to confuse some existing user tooling. You would also need to be in F's superblock's user_ns (second line of mnt_may_suid()), which totally kills this type of attack if F's superblock is in the container's user_ns, but I wouldn't count on that. So maybe this is all fine. I'll continue to try to poke holes in it, but perhaps there aren't any holes to poke. I'll also continue to try to see if I can state the security properties of idmap in a way that is clear and obviously has nice properties. Why are you allowing the creation of a new idmapped mount if you have cap_sys_admin over an existing idmap userns but not over the superblock's userns? I assume this is for a nested container use case, but can you spell out a specific example usage? --Andy