Received: by 2002:a05:7412:798b:b0:fc:a2b0:25d7 with SMTP id fb11csp110272rdb; Wed, 21 Feb 2024 19:47:00 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCWcewT/smsns7YtX+kUbqGCOfewBhqleuc4MoMjh3GrRShOYARPsAxS6krWpWzF6/CUCiDaGfMX6TYYkJSjhBzc7LKxzmg0mzXIcHekog== X-Google-Smtp-Source: AGHT+IGbK1amJXLJyt0HFDA67+Isk/CMuZeoXKWrOq/5QufrJ2vaqdl+hDucPaLwwjOLCQM1ihLE X-Received: by 2002:a17:906:52d2:b0:a3d:48cd:9d15 with SMTP id w18-20020a17090652d200b00a3d48cd9d15mr12529329ejn.23.1708573620054; Wed, 21 Feb 2024 19:47:00 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708573620; cv=pass; d=google.com; s=arc-20160816; b=pEkLDSqtzHzG/2L9QUpMGrA8xkoMHqAq2E+o/wrXBI0T827M+xG3q6wJUqV342zY53 22tlVaGxSmS6scP6tLTJQ86jxcoj5bXEFpZWjGib0sy8lv/wwEpFtFE/nEEtNJXq4OQt /tzogQxeiCFvm7sCYfQJhgBtDtXmMVJBJjltBuHUoPuk9uf9oxiIVgZxBQ71EhW1Z2QL 5uEj1VE7nEsERQqPSvRnCdokYC2UP0ruQzc4kQlE0KAqspPgnRQfY6q9r3q4MGtxQGe4 pJK39Vg4Knn4yjbwRHPHen4/etOBddwt0nTOmUN+O6+e1lBf4s0WcCcmjXuQilxY/rtF pcMA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:dkim-signature:date; bh=Ojk1fz89JpUi8jWPvmW8PS+rQwDOSQZnMT7LCgNuLDc=; fh=pRFn11vlawN3kgDgmmm10m5a8CPf/j3chIl0quNB/5s=; b=U+Im57uMxdxxvznre8E1bhkTl9Ox8aRKVD2fKAXZdAixBa6lL+5UiIEHA1L08QJ4gL Rqx/Lm8EU9TUz048BWCzZEzYkVS2UNioaerzT6VJwJ7e2jg91kS9xQMX+df71FKnRa6r jip+1cW28NWB/P5INxzpatEuELkEjJsXwqjyHgKHK9p9T2EAf2bSUtnvQJ5YQU/GQHqv 2Oc0TSLeq5xPCuhaYXCLvjz/mu7Q8W5eVNtxm9gMNlYh9Vsb0qXEGB9xYgB3qxGoDW7N rB3TcO04gCVjMVvOgqPqudfbuO6uvUrC3pdl/RK626hm4ON68pADeKr4obuV11gM2PrG x/3A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="W01/NdEz"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-75862-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-75862-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id go17-20020a1709070d9100b00a3f180e5da6si1729828ejc.273.2024.02.21.19.47.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 21 Feb 2024 19:47:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-75862-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="W01/NdEz"; arc=pass (i=1 spf=pass spfdomain=linux.dev dkim=pass dkdomain=linux.dev dmarc=pass fromdomain=linux.dev); spf=pass (google.com: domain of linux-kernel+bounces-75862-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-75862-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 5F3411F241AD for ; Thu, 22 Feb 2024 03:37:30 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id CE9A717573; Thu, 22 Feb 2024 03:37:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="W01/NdEz" Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 39F37883D for ; Thu, 22 Feb 2024 03:37:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708573037; cv=none; b=ayluL00XYVCKTRTgnwly31gzODc4+ZveWRsfiCRSHNM/yZ+hXX4tZ6T2/mstxcPerq8VIo+EJZa1pNsJsya0glOOSoE4fMzZMA2m/dXpmBiEczlJgRyyfJ+tnlbJlGYD4tyh/pknnlxXuMxFr7iKETHuMrDaf/O8wae2MDf8NT4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708573037; c=relaxed/simple; bh=AClHY/wTOZgy1rmmOfdnQgC33nR4hv+88+JaIN/OiYQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=L+Rd2qapb1QBf8Iq3I2HJenLHPpNROBN1NiW1yGdO/XnwhvYzr0JzoNr/3fEV9hLgfPjgDF+g8H22C+jC30632yVV2LOe8/vLd0RfQiStPxHpCFzp2Bs+72C799TziixD9BZIJnPpgN4V6Hx2hbye1S3J9U2tRyRSPP/HlFlDFI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=W01/NdEz; arc=none smtp.client-ip=91.218.175.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Date: Wed, 21 Feb 2024 22:37:05 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1708573030; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ojk1fz89JpUi8jWPvmW8PS+rQwDOSQZnMT7LCgNuLDc=; b=W01/NdEzw/an3GwPrdLCVmxzEx5UTibBxE3aqwho4wrssQyQY7HyggVyHUFRg2QyTK+/zV 20AgxZfLu2Lh4jK6lCL79R8lFtkdwFUQoX+CvDUpoZHmVZwq8tAINCzoIbG6e/gyj3Dyt0 EHKjiTsEv4ULDmpH0eguayAfV2kh1AE= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: James Bottomley Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, lsf-pc@lists.linux-foundation.org, Christian Brauner , =?utf-8?B?U3TDqXBoYW5l?= Graber Subject: Re: [LSF TOPIC] beyond uidmapping, & towards a better security model Message-ID: References: <141b4c7ecda2a8c064586d064b8d1476d8de3617.camel@HansenPartnership.com> <4ub23tni5bwxthqzsn2uvfs5hwr6gd3oitbckd5xwxdbgci4lj@xddn3dh6y23x> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT On Thu, Feb 22, 2024 at 01:33:14AM +0100, James Bottomley wrote: > On Wed, 2024-02-21 at 18:01 -0500, Kent Overstreet wrote: > > Strings are just arrays of integers, and anyways this stuff would be > > within helpers. > > Length limits and comparisons are the problem We'd be using qstrs for this, not c strings, so they really are equivalent to arrays for this purpose. > > > > > But what you're not seeing is the beauty and simplicity of killing > > the mapping layer. > > Well, that's the problem: you don't for certain use cases. That's what > I've been trying to explain. For the fully unprivileged use case, > sure, it all works (as does the upper 32 bits proposal or the integer > array ... equally well. > > Once you're representing to the userns contained entity they have a > privileged admin that can write to the fsimage as an apparently > privileged user then the problems begin. In what sense? If they're in a userns and all their mounts are username mapped, that's completely fine from a userns POV; they can put a suid root binary into the fs image but when they mount that suid root will be suid to the root user of their userns. > > > When usernames are strings all the way into the kernel, creating and > > switching to a new user is a single syscall. You can't do that if > > users are small integer identifiers to the kernel; you have to create > > a new entry in /etc/passwd or some equivalent, and that is strictly > > required in order to avoid collisions. Users also can't be ephemeral. > > > > To sketch out an example of how this would work, say we've got a new > > set_subuser() syscall and the username equivalent of chown(). > > > > Now if we want to run firefox as a subuser, giving it access only > > .local/state/firefox, we'd do the following sequence of syscalls > > within the start of the new firefox process: > > > > mkdir(".local/state/firefox"); > > chown_subuser(".local/state/firefox", "firefox"); /* now owned by > > $USER.firefox */ > > set_subuser("firefox"); > > > > If we want to guarantee uniqueness, we'd append a UUID to the > > subusername for the chown_subuser() call, and then for subsequent > > invocations read it with statx() (or subuser enabled equivalent) for > > the set_subuser() call. > > > > Now firefox is running in a sandbox, where it has no access to the > > rest of your home directory - unless explicitly granted with normal > > ACLs. And the sandbox requires no system configuration; rm -rfing the > > .local/state/firefox directory cleans everything up. > > > > And these trivially nest: Firefox itself wants to sandbox individual > > tabs from each other, so firefox could run each sub-process as a > > different subuser. > > > > This is dead easy compared to what we've been doing. > > The above is the unprivileged use case. It works, but it's not all we > have to support. There is only one root user, in the sense of _actual_ root - CAP_SYS_ADMIN and all that. > > > > > > However, neither proposal would get us out of the problem of > > > > > mount mapping because we'd have to keep the filesystem > > > > > permission check on the owning uid unless told otherwise. > > > > > > > > Not sure I follow? > > > > > > Mounting a filesystem inside a userns can cause huge security > > > problems if we map fs root to inner root without the admin blessing > > > it.  Think of binding /bin into the userns and then altering one of > > > the root owned binaries as inner root: if the permission check > > > passes, the change appears in system /bin. > > > > So with this proposal mount mapping becomes "map all users on this > > filesystem to subusers of username x". That's a much simpler mapping > > than mapping integer ranges to integer ranges, much easier to verify > > that there aren't accidental root escpes. > > That doesn't work for the privileged container run in unprivileged > userns containment use case because we need a mapping from inner to > outer root. I can't parse this. "Privileged container in an unprivileged containment"? Do you just mean a container that has root user (which is only root over that container, not the rest of the system, of course). Any user is root over its subusers - so that works perfectly. Or do you mean something else by "privileged container"? Do you mean a container that actually has CAP_SYS_ADMIN? > > > > And it wouldn't have to be administrator assigned. Some > > > > administrator assignment might be required for the username <-> > > > > 16 bit uid mapping, but if those mappings are ephemeral (i.e. if > > > > we get filesystems persistently storing usernames, which is easy > > > > enough with xattrs) then that just becomes "reserve x range of > > > > the 16 bit uid space for ephemeral translations". > > > > > > *if* the user names you're dealing with are all unprivileged.  When > > > we have a mix of privileged and unprivileged users owning the > > > files, the problems begin. > > > > Yes, all subusers are unprivilidged - only one username, the empty > > username (which we'd probably map to root) maps to existing uid 0. > > But, as I said above, that's only a subset of the use cases. The > equally big use case is figuring out how to run privileged containers > in a deprivileged mode and yet still allow them to update images (and > other things). If you're running in a userns, all your mounts get the same user mapping as your userns - where that usermapping is just prepending the username of the userns. That part is easy. The big difficulty with letting them update images is that our current filesystems really aren't ready for the mounting of untrusted images - they're ~100k loc codebases each and the amount of hardening required is significant. I would hazard to guess that XFS is the furthest along is this respect (from all the screaming I hear from Darrick about syzkaller it sounds like they're taking this the most seriously) - but I would hesitate to depend on any of our filesystems to be secure in this respect, even my own - not until we get them rewritten in Rust...