Received: by 2002:a25:1506:0:0:0:0:0 with SMTP id 6csp5175900ybv; Mon, 17 Feb 2020 14:04:17 -0800 (PST) X-Google-Smtp-Source: APXvYqwJUyHlCdZ3iemtIPFumDQJOEKJnLe+ma0Oo0rBKfzrWjbw6xZn8qKWavh1CEKAn6SGUTqp X-Received: by 2002:a54:458d:: with SMTP id z13mr764171oib.32.1581977057203; Mon, 17 Feb 2020 14:04:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1581977057; cv=none; d=google.com; s=arc-20160816; b=Svk+Y82g9xSFd7otk80t+w7lQm2IMmJppnHyuBZqEco4r+qfoq5X8lXx/Zf9B1Zg+q 50o76tGZOFee5xn0y/N45bMR6fZKAm6Hd9xgz6aa5PNAVSWuDBSAG94o0ji9y8mbqSkd fvVV295MU4rCTlUyzW4mLOTdbe6RjDRpGsKP5yhzAl0lHLPXVwxVhUOMSsAVWR64plIY gMJ/YICM9UvEmGUqYpC3yl2e96aijEplGQ/jeBNSISkhCqJTL5B8Pz0zQxcA0kUts0KC vxs72BDY+8bkTARX879VMwpJRGNnusVNABxlUPXAdCal6alfWGVxrwjgVVQGIiu2g07C M1xA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version; bh=gQ0P/qJOfhrnseCFPs5fnju25Do57EVJdb+T279eJYE=; b=qY59++5kihIZm1x7a3GBLZtO23SOXYmDyL77vDJcCBLBVCgDkfn+7nv8GdjxgEJv4x l/qQFO+4rdmo9R2z3UgwVQmWcyCpLzWi0ENq4DVmANRb6ZR8JP8n7XjR1LwiVuPfId+9 1oad0j16+Id8Rr4rb3rt8UL4NNULwJWcC3wZq0W7Y/nTlr0WzFM80jMlZ/EgGjNC0TgC J7OxxamKYgpaW/wb3wX4hyFAHQJ+3Y7Xw1CFE0jZ8NTKsBBEieYku2i0K5kIU7DkDF2X li4tbf4HsAoy3ei/nLkyr+Jhx2ZsGVphNKbOq255IjQJ02o5eQIUlSP/escnXBWr7MPX vP4A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x23si6983321oie.50.2020.02.17.14.04.05; Mon, 17 Feb 2020 14:04:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726346AbgBQWDJ convert rfc822-to-8bit (ORCPT + 99 others); Mon, 17 Feb 2020 17:03:09 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:55846 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726312AbgBQWDI (ORCPT ); Mon, 17 Feb 2020 17:03:08 -0500 Received: from mail-lf1-f51.google.com ([209.85.167.51]) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1j3oTp-0006Xe-Ht for linux-kernel@vger.kernel.org; Mon, 17 Feb 2020 22:03:05 +0000 Received: by mail-lf1-f51.google.com with SMTP id 203so12915122lfa.12 for ; Mon, 17 Feb 2020 14:03:05 -0800 (PST) X-Gm-Message-State: APjAAAX67DLEklwLE+Pg+R/n2cXIYFPDQbdPX0J85euLx44ZoB7Ej1ZC NQW1JllidbeixIfiOO564CnxfCjJVdLOuF6MpQ+mVw== X-Received: by 2002:ac2:47e6:: with SMTP id b6mr8894012lfp.96.1581976984864; Mon, 17 Feb 2020 14:03:04 -0800 (PST) MIME-Version: 1.0 References: <20200214183554.1133805-1-christian.brauner@ubuntu.com> <1581973919.24289.12.camel@HansenPartnership.com> In-Reply-To: From: =?UTF-8?Q?St=C3=A9phane_Graber?= Date: Mon, 17 Feb 2020 17:02:52 -0500 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 00/28] user_namespace: introduce fsid mappings To: James Bottomley Cc: Christian Brauner , "Eric W. Biederman" , Aleksa Sarai , Jann Horn , Kees Cook , Jonathan Corbet , linux-kernel@vger.kernel.org, Linux Containers , smbarber@chromium.org, Seth Forshee , linux-security-module@vger.kernel.org, Alexander Viro , linux-api@vger.kernel.org, linux-fsdevel , Alexey Dobriyan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org And re-sending, this time hopefully actually in plain text mode. Sorry about that, my e-mail client isn't behaving today... Stéphane On Mon, Feb 17, 2020 at 4:57 PM Stéphane Graber wrote: > > On Mon, Feb 17, 2020 at 4:12 PM James Bottomley wrote: >> >> On Fri, 2020-02-14 at 19:35 +0100, Christian Brauner wrote: >> [...] >> > With this patch series we simply introduce the ability to create fsid >> > mappings that are different from the id mappings of a user namespace. >> > The whole feature set is placed under a config option that defaults >> > to false. >> > >> > In the usual case of running an unprivileged container we will have >> > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will >> > correspond to this id mapping, i.e. all files which we want to appear >> > as 0:0 inside the user namespace will be chowned to 100000:100000 on >> > the host. This works, because whenever the kernel needs to do a >> > filesystem access it will lookup the corresponding uid and gid in the >> > idmapping tables of the container. >> > Now think about the case where we want to have an id mapping of 0 >> > 100000 100000 but an on-disk mapping of 0 300000 100000 which is >> > needed to e.g. share a single on-disk mapping with multiple >> > containers that all have different id mappings. >> > This will be problematic. Whenever a filesystem access is requested, >> > the kernel will now try to lookup a mapping for 300000 in the id >> > mapping tables of the user namespace but since there is none the >> > files will appear to be owned by the overflow id, i.e. usually >> > 65534:65534 or nobody:nogroup. >> > >> > With fsid mappings we can solve this by writing an id mapping of 0 >> > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem >> > access the kernel will now lookup the mapping for 300000 in the fsid >> > mapping tables of the user namespace. And since such a mapping >> > exists, the corresponding files will have correct ownership. >> >> How do we parametrise this new fsid shift for the unprivileged use >> case? For newuidmap/newgidmap, it's easy because each user gets a >> dedicated range and everything "just works (tm)". However, for the >> fsid mapping, assuming some newfsuid/newfsgid tool to help, that tool >> has to know not only your allocated uid/gid chunk, but also the offset >> map of the image. The former is easy, but the latter is going to vary >> by the actual image ... well unless we standardise some accepted shift >> for images and it simply becomes a known static offset. > > > For unprivileged runtimes, I would expect images to be unshifted and be > unpacked from within a userns. So your unprivileged user would be allowed > a uid/gid range through /etc/subuid and /etc/subgid and allowed to use > them through newuidmap/newgidmap.In that namespace, you can then pull > and unpack any images/layers you may want and the resulting fs tree will > look correct from within that namespace. > > All that is possible today and is how for example unprivileged LXC works > right now. > > What this patchset then allows is for containers to have differing > uid/gid maps while still being based off the same image or layers. > In this scenario, you would carve a subset of your main uid/gid map for > each container you run and run them in a child user namespace while > setting up a fsuid/fsgid map such that their filesystem access do not > follow their uid/gid map. This then results in proper isolation for > processes, networks, ... as everything runs as different kuid/kgid but > the VFS view will be the same in all containers. > > Shared storage between those otherwise isolated containers would also > work just fine by simply bind-mounting the same path into two or more > containers. > > > Now one additional thing that would be safe for a setuid wrapper to > allow would be for arbitrary mapping of any of the uid/gid that the user > owns to be used within the fsuid/fsgid map. One potential use for this > would be to create any number of user namespaces, each with their own > mapping for uid 0 while still having all VFS access be mapped to the > user that spawned them (say uid=1000, gid=1000). > > > Note that in our case, the intended use for this is from a privileged runtime > where our images would be unshifted as would be the container storage > and any shared storage for containers. The security model effectively relying > on properly configured filesystem permissions and mount namespaces such > that the content of those paths can never be seen by anyone but root outside > of those containers (and therefore avoids all the issues around setuid/setgid/fscaps). > > We will then be able to allocate distinct, random, ranges of 65536 uids/gids (or more) > for each container without ever having to do any uid/gid shifting at the filesystem layer > or run into issues when having to setup shared storage between containers or attaching > external storage volumes to those containers. > >> James > > > Stéphane