Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp5223911ybv; Mon, 17 Feb 2020 15:13:34 -0800 (PST) X-Google-Smtp-Source: APXvYqxY5u5EmU6uxG/xDpOPGaDYmuLnmkfOhCp62t7cq1ecD0RYNz49WKneRtxpMlfDxCB4KCrU X-Received: by 2002:a9d:649a:: with SMTP id g26mr13391945otl.15.1581981213874; Mon, 17 Feb 2020 15:13:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581981213; cv=none; d=google.com; s=arc-20160816; b=sYTRyPUaF68Q/cmLL8UwAP5jWJT3GKJrvD9ARFn6mrTxRdPPQovUmZEGq1SPhOmRi6 4oV4k1YpAiLO1dc6fRMBM9ttCbrIwMD8vvdyI6K80UiFRk3vdTG4WJUOeqHF2xlAp+IL aojXeCTdb+Vd02cNxfxgsLwz5awVFBQew4QSiGbChwQJZgV15hQQVJec+Fuq/OdiAxN1 /EqofWUxQ6wucl70yUiAZU82RFkNlsz6iMy3OaDp/vtxjK88Ud2LntEsmrfqfwvWc2KL ZNoQaxOcP6QJmkhJIX6ox3xvBZOkytxPCHBzdKgQ6yNFKqdg1NP5t8wWcAT+cUOAcb2O 0J1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version; bh=vFoXDHlqSE3reHwLHN70Gcp9ZC/dUb243w9+u+VS7Sc=; b=UqHaLa7H/sxHQ5b8wW2EnH1gckv3F6Hnn+XUpGeshv6qbNW7a7mlR68GXmcrrb/js8 3+1GKLTa7aKH3A3bbSkfrKepbcE/lD6LrfgQF+JOrc/mmv+fFt2Jrmqobh3SeJPHbxqV 2yFELinUSUBeivCcgFFv52UGeXej9Ja5CMPbBE2IR/im3SPbAzaOk3RNkx8NUcKPlVMI E3OY1KUDyzPgyg6ok5BRohjO0LXifTl5KM6O1HJjdu7zJjxsgCtep7vUXOsP+Wo96gnk aUzpUcsu0iBREtlTRzgmkan+TwcYG4CWbm/BOecMzcWqlXKopH04A0t3qdhmQ84B8+JG vs1A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m14si849772otr.131.2020.02.17.15.13.21; Mon, 17 Feb 2020 15:13:33 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726070AbgBQXMQ convert rfc822-to-8bit (ORCPT + 99 others); Mon, 17 Feb 2020 18:12:16 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:56979 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725922AbgBQXMQ (ORCPT ); Mon, 17 Feb 2020 18:12:16 -0500 Received: from mail-lj1-f180.google.com ([209.85.208.180]) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j3pYP-0002Mx-S7 for linux-kernel@vger.kernel.org; Mon, 17 Feb 2020 23:11:53 +0000 Received: by mail-lj1-f180.google.com with SMTP id e18so20646017ljn.12 for ; Mon, 17 Feb 2020 15:11:53 -0800 (PST) X-Gm-Message-State: APjAAAUzwQl8A6LP5u4yKFSnFtPxQ4hQZhp2vO5icAgzgRBkIaPF7yRM RxfDKyvoCtZYlcnBPME9nh8r7Ng9oHWmBAOPB+fcug== X-Received: by 2002:a2e:548:: with SMTP id 69mr11670670ljf.67.1581981113234; Mon, 17 Feb 2020 15:11:53 -0800 (PST) MIME-Version: 1.0 References: <20200214183554.1133805-1-christian.brauner@ubuntu.com> <1581973919.24289.12.camel@HansenPartnership.com> <1581980625.24289.30.camel@HansenPartnership.com> In-Reply-To: <1581980625.24289.30.camel@HansenPartnership.com> From: =?UTF-8?Q?St=C3=A9phane_Graber?= Date: Mon, 17 Feb 2020 18:11:41 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 00/28] user_namespace: introduce fsid mappings To: James Bottomley Cc: linux-security-module@vger.kernel.org, Kees Cook , Jonathan Corbet , linux-api@vger.kernel.org, Linux Containers , Jann Horn , linux-kernel@vger.kernel.org, smbarber@chromium.org, Seth Forshee , "Eric W. Biederman" , linux-fsdevel , Christian Brauner , Alexey Dobriyan , Alexander Viro Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 17, 2020 at 6:03 PM James Bottomley wrote: > > On Mon, 2020-02-17 at 16:57 -0500, Stéphane Graber wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley < > > James.Bottomley@hansenpartnership.com> wrote: > > > > > On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: > > > [...] > > > > With this patch series we simply introduce the ability to create > > > > fsid mappings that are different from the id mappings of a user > > > > namespace. The whole feature set is placed under a config option > > > > that defaults to false. > > > > > > > > In the usual case of running an unprivileged container we will > > > > have setup an id mapping, e.g. 0 100000 100000. The on-disk > > > > mapping will correspond to this id mapping, i.e. all files which > > > > we want to appear as 0:0 inside the user namespace will be > > > > chowned to 100000:100000 on the host. This works, because > > > > whenever the kernel needs to do a filesystem access it will > > > > lookup the corresponding uid and gid in the idmapping tables of > > > > the container. > > > > Now think about the case where we want to have an id mapping of 0 > > > > 100000 100000 but an on-disk mapping of 0 300000 100000 which is > > > > needed to e.g. share a single on-disk mapping with multiple > > > > containers that all have different id mappings. > > > > This will be problematic. Whenever a filesystem access is > > > > requested, the kernel will now try to lookup a mapping for 300000 > > > > in the id mapping tables of the user namespace but since there is > > > > none the files will appear to be owned by the overflow id, i.e. > > > > usually 65534:65534 or nobody:nogroup. > > > > > > > > With fsid mappings we can solve this by writing an id mapping of > > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On > > > > filesystem access the kernel will now lookup the mapping for > > > > 300000 in the fsid mapping tables of the user namespace. And > > > > since such a mapping exists, the corresponding files will have > > > > correct ownership. > > > > > > How do we parametrise this new fsid shift for the unprivileged use > > > case? For newuidmap/newgidmap, it's easy because each user gets a > > > dedicated range and everything "just works (tm)". However, for the > > > fsid mapping, assuming some newfsuid/newfsgid tool to help, that > > > tool has to know not only your allocated uid/gid chunk, but also > > > the offset map of the image. The former is easy, but the latter is > > > going to vary by the actual image ... well unless we standardise > > > some accepted shift for images and it simply becomes a known static > > > offset. > > > > > > > For unprivileged runtimes, I would expect images to be unshifted and > > be unpacked from within a userns. > > For images whose resting format is an archive like tar, I concur. > > > So your unprivileged user would be allowed a uid/gid range through > > /etc/subuid and /etc/subgid and allowed to use them through > > newuidmap/newgidmap.In that namespace, you can then pull > > and unpack any images/layers you may want and the resulting fs tree > > will look correct from within that namespace. > > > > All that is possible today and is how for example unprivileged LXC > > works right now. > > I do have a counter example, but it might be more esoteric: I do use > unprivileged architecture emulation containers to maintain actual > physical system boot environments. These are stored as mountable disk > images, not as archives, so I do need a simple remapping ... however, I > think this use case is simple: it's a back shift along my owned uid/gid > range, so tools for allowing unprivileged use can easily cope with this > use case, so the use is either fsid identity or fsid back along > existing user_ns mapping. > > > What this patchset then allows is for containers to have differing > > uid/gid maps while still being based off the same image or layers. > > In this scenario, you would carve a subset of your main uid/gid map > > for each container you run and run them in a child user namespace > > while setting up a fsuid/fsgid map such that their filesystem access > > do not follow their uid/gid map. This then results in proper > > isolation for processes, networks, ... as everything runs as > > different kuid/kgid but the VFS view will be the same in all > > containers. > > Who owns the shifted range of the image ... all tenants or none? I would expect the most common case being none of them. So you'd have a uid/gid range carved out of your own allocation which is used to unpack images, let's call that the image map. Your containers would then use a uid/gid map which is distinct from that map and distinct from each other but all using the image map as their fsuid/fsgid map. This will make the VFS behave in a normal way and would also allow for shared paths between those containers by using a shared directory through bind-mount which is also owned by a uid/gid in that image range. > > Shared storage between those otherwise isolated containers would also > > work just fine by simply bind-mounting the same path into two or more > > containers. > > > > > > Now one additional thing that would be safe for a setuid wrapper to > > allow would be for arbitrary mapping of any of the uid/gid that the > > user owns to be used within the fsuid/fsgid map. One potential use > > for this would be to create any number of user namespaces, each with > > their own mapping for uid 0 while still having all VFS access be > > mapped to the user that spawned them (say uid=1000, gid=1000). > > > > > > Note that in our case, the intended use for this is from a privileged > > runtime where our images would be unshifted as would be the container > > storage and any shared storage for containers. The security model > > effectively relying on properly configured filesystem permissions and > > mount namespaces such that the content of those paths can never be > > seen by anyone but root outside of those containers (and therefore > > avoids all the issues around setuid/setgid/fscaps). > > Yes, I understand ... all orchestration systems are currently hugely > privileged. However, there is interest in getting them down to only > "slightly privileged". > > James > > > > We will then be able to allocate distinct, random, ranges of 65536 > > uids/gids (or more) for each container without ever having to do any > > uid/gid shifting at the filesystem layer or run into issues when > > having to setup shared storage between containers or attaching > > external storage volumes to those containers.