Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp535155pxb; Tue, 3 Nov 2020 06:12:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJwQK3RzNEOi5X6jx7T8JmQA2kGQfMUvHCY9LAEyWn4DD2XrkHa79BHVrW64REtA9D8Hqcf7 X-Received: by 2002:a50:85c6:: with SMTP id q6mr22483395edh.126.1604412737595; Tue, 03 Nov 2020 06:12:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1604412737; cv=none; d=google.com; s=arc-20160816; b=Pmg63gCSnU20vHL15Hg4eZNLtf1B9QdJIS0GOXdxeP5lRD43cN0E9UpaY1QBJNfXlS TiJIeJh/bMj3fvyxrvkjcWX6OPrYVuNcaCcJ+3ogqSVNRJXcTGZQVEsfomBukrlQ5h+T k3KXas6QVXyTo2FWLbRi1/5/8RmxZR/aplYDW9h13VgiDJgmloUDHHEXSn9zTfX+9WwZ sNLfK+ecJprcstoWcnhdq5eOcfVLGP2PuGtWS/L12+hKaplg/RSRtrRNBluuutP6Awzl oa3JFRmzI2KboxyUwkGJAuuobIUzoDp+l/u/XuZ3mSgaX4u1iKeY3PmjhfezeeTBuSED pe/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=y8WQjpT9rqVrgeh99s8m7L7lW+jgLfIjyGz2M8+krP0=; b=xcNqMRt5k7Zb1RxMHTf8eYpTLXq33dS49W82dBkngEEx1iLly2/LvZ/1NqUpjp5GkC Y7YgBxphFYSA5QpzZdwyD0wOJDs2x4XgNBK8qKsLPKgdzg1xX/IrStlOuWdi4XwIf18K QSqfo7QFzN+HltuetLZDfHrD881GJkicFNgQrznE6t13ePYy64z0AW8AFHRYVHiEKJI/ 2X2u0pOEhQ6n5dBBWlG2XVst/E7z00wGdBJ0PUNI1MzDl6SXtfLaY9E3Higqhw2dCBVf 5uh8r6LVT0Kl6dGpFkbj2Es301wnNkYJXjUcA1brXIDBCccd8YnSqoyYuiz1vC6eHvCi m8gA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kinvolk.io header.s=google header.b=jNSrYWNM; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u19si2677098ejc.310.2020.11.03.06.11.46; Tue, 03 Nov 2020 06:12:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kinvolk.io header.s=google header.b=jNSrYWNM; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729341AbgKCOK7 (ORCPT + 99 others); Tue, 3 Nov 2020 09:10:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729556AbgKCOKs (ORCPT ); Tue, 3 Nov 2020 09:10:48 -0500 Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com [IPv6:2a00:1450:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 766E6C0617A6 for ; Tue, 3 Nov 2020 06:10:48 -0800 (PST) Received: by mail-ed1-x541.google.com with SMTP id k9so18418078edo.5 for ; Tue, 03 Nov 2020 06:10:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kinvolk.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=y8WQjpT9rqVrgeh99s8m7L7lW+jgLfIjyGz2M8+krP0=; b=jNSrYWNMri4UzTfQTwvUoeI3d9PbxAntszBtbkOHWPbsmFNLXoekIJ8TcfDFJbJf7v 6+eYLKBEUBAXwADA9DJI10J7imuIzSuwLQhtXLJY1nc7fmB3dMavlYA1Xmy9wmU6aG0F m9nefJssTHLgeDNMLEKBtF2hV/CbFCCq5QOzc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=y8WQjpT9rqVrgeh99s8m7L7lW+jgLfIjyGz2M8+krP0=; b=OGyTGuyDt0fJjRu+8cj4eHqsogYS/3dPLSoPHdlC/fx7jzwBF4++97G+W9Zc5V1E8C DdUqkIEA/+SP9sWq6yM77AaIsYaPSZY9d2BnHU5NtQHuyT7OJYEZARp1KL226sIvoiNr RnJxth7jgJEHelOtIXywVBAEu+nT0Pte2LvjhfvO8r8q12kMUdutJ5XMhCCYzRY7GHsO p3hY/DPBuBlwsBU4dSxRjpRSRDvrSuIFH9utjTFlrKyaLpGzOhoyKmcqs4U8mzt8sMgQ IUZne9RYtsEz4Ut8P/0Fg3q1so39y4DcfodMxIxO2XVrvpadPTWzBXIojzvR5YDUaTet fO3Q== X-Gm-Message-State: AOAM532t0GdQ5TrKXc/8ps674j6M6QUogyaqIeyhRMfWjFiEY6B7vzMO KbgnSyXctvukil+Muzq+wHAN+1OTs1mxEaWl3RQWpw== X-Received: by 2002:a05:6402:a57:: with SMTP id bt23mr10741907edb.62.1604412647178; Tue, 03 Nov 2020 06:10:47 -0800 (PST) MIME-Version: 1.0 References: <20201029003252.2128653-1-christian.brauner@ubuntu.com> <87pn51ghju.fsf@x220.int.ebiederm.org> <20201029155148.5odu4j2kt62ahcxq@yavin.dot.cyphar.com> <87361xdm4c.fsf@x220.int.ebiederm.org> In-Reply-To: <87361xdm4c.fsf@x220.int.ebiederm.org> From: Alban Crequy Date: Tue, 3 Nov 2020 15:10:35 +0100 Message-ID: Subject: Re: [PATCH 00/34] fs: idmapped mounts To: "Eric W. Biederman" Cc: Aleksa Sarai , Christian Brauner , Alexander Viro , Christoph Hellwig , linux-fsdevel , John Johansen , James Morris , Mimi Zohar , Dmitry Kasatkin , Stephen Smalley , Casey Schaufler , Arnd Bergmann , Andreas Dilger , OGAWA Hirofumi , Geoffrey Thomas , Mrunal Patel , Josh Triplett , Andy Lutomirski , Amir Goldstein , Miklos Szeredi , Theodore Tso , Tycho Andersen , David Howells , James Bottomley , Jann Horn , Seth Forshee , =?UTF-8?Q?St=C3=A9phane_Graber?= , Lennart Poettering , smbarber@chromium.org, Phil Estes , Serge Hallyn , Kees Cook , Todd Kjos , Jonathan Corbet , Linux Containers , LSM , linux-api@vger.kernel.org, linux-ext4@vger.kernel.org, linux-unionfs@vger.kernel.org, linux-audit@redhat.com, linux-integrity , selinux@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Thu, Oct 29, 2020 at 5:37 PM Eric W. Biederman w= rote: > > Aleksa Sarai writes: > > > On 2020-10-29, Eric W. Biederman wrote: > >> Christian Brauner writes: > >> > >> > Hey everyone, > >> > > >> > I vanished for a little while to focus on this work here so sorry fo= r > >> > not being available by mail for a while. > >> > > >> > Since quite a long time we have issues with sharing mounts between > >> > multiple unprivileged containers with different id mappings, sharing= a > >> > rootfs between multiple containers with different id mappings, and a= lso > >> > sharing regular directories and filesystems between users with diffe= rent > >> > uids and gids. The latter use-cases have become even more important = with > >> > the availability and adoption of systemd-homed (cf. [1]) to implemen= t > >> > portable home directories. > >> > >> Can you walk us through the motivating use case? > >> > >> As of this year's LPC I had the distinct impression that the primary u= se > >> case for such a feature was due to the RLIMIT_NPROC problem where two > >> containers with the same users still wanted different uid mappings to > >> the disk because the users were conflicting with each other because of > >> the per user rlimits. > >> > >> Fixing rlimits is straight forward to implement, and easier to manage > >> for implementations and administrators. > > > > This is separate to the question of "isolated user namespaces" and > > managing different mappings between containers. This patchset is solvin= g > > the same problem that shiftfs solved -- sharing a single directory tree > > between containers that have different ID mappings. rlimits (nor any of > > the other proposals we discussed at LPC) will help with this problem. > > First and foremost: A uid shift on write to a filesystem is a security > bug waiting to happen. This is especially in the context of facilities > like iouring, that play very agressive games with how process context > makes it to system calls. > > The only reason containers were not immediately exploitable when iouring > was introduced is because the mechanisms are built so that even if > something escapes containment the security properties still apply. > Changes to the uid when writing to the filesystem does not have that > property. The tiniest slip in containment will be a security issue. > > This is not even the least bit theoretical. I have seem reports of how > shitfs+overlayfs created a situation where anyone could read > /etc/shadow. > > If you are going to write using the same uid to disk from different > containers the question becomes why can't those containers configure > those users to use the same kuid? > > What fixing rlimits does is it fixes one of the reasons that different > containers could not share the same kuid for users that want to write to > disk with the same uid. > > > I humbly suggest that it will be more secure, and easier to maintain for > both developers and users if we fix the reasons people want different > containers to have the same user running with different kuids. > > If not what are the reasons we fundamentally need the same on-disk user > using multiple kuids in the kernel? I would like to use this patch set in the context of Kubernetes. I described my two possible setups in https://www.spinics.net/lists/linux-containers/msg36537.html: 1. Each Kubernetes pod has its own userns but with the same user id mapping 2. Each Kubernetes pod has its own userns with non-overlapping user id mapping (providing additional isolation between pods) But even in the setup where all pods run with the same id mappings, this patch set is still useful to me for 2 reasons: 1. To avoid the expensive recursive chown of the rootfs. We cannot necessarily extract the tarball directly with the right uids because we might use the same container image for privileged containers (with the host userns) and unprivileged containers (with a new userns), so we have at least 2 =E2=80=9Cmappings=E2=80=9D (taking more time and resulti= ng in more storage space). Although the =E2=80=9Cmetacopy=E2=80=9D mount option in ove= rlayfs helps to make the recursive chown faster, it can still take time with large container images with lots of files. I=E2=80=99d like to use this pat= ch set to set up the root fs in constant time. 2. To manage large external volumes (NFS or other filesystems). Even if admins can decide to use the same kuid on all the nodes of the Kubernetes cluster, this is impractical for migration. People can have existing Kubernetes clusters (currently without using user namespaces) and large NFS volumes. If they want to switch to a new version of Kubernetes with the user namespace feature enabled, they would need to recursively chown all the files on the NFS shares. This could take time on large filesystems and realistically, we want to support rolling updates where some nodes use the previous version without user namespaces and new nodes are progressively migrated to the new userns with the new id mapping. If both sets of nodes use the same NFS share, that can=E2=80=99t work. Alban