Received: by 2002:ab2:1689:0:b0:1f7:5705:b850 with SMTP id d9csp1233510lqa; Mon, 29 Apr 2024 02:12:56 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXVa1ASSyV2ZrJ8EvTCI4Tb740KRrFlFdx8lw88mw+6UeO8I24ZL2vSI4OhzEOTpDTbDME4RTqymhAqByT5tM6C/dFGdRvtvLUt3XrxYw== X-Google-Smtp-Source: AGHT+IGi6d+LLZwwAGxH09UKSrR7/RNVCxEG1XScph5UTomqXl0ieRYM1bV9JKLuC6yRjDYnDBzY X-Received: by 2002:a17:906:d1db:b0:a52:2e91:321e with SMTP id bs27-20020a170906d1db00b00a522e91321emr6645408ejb.72.1714381976220; Mon, 29 Apr 2024 02:12:56 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714381976; cv=pass; d=google.com; s=arc-20160816; b=prlf8y9x/VEXKTorBNueRLO2zukig3XY9tPlfcHgzZ7c7Y3Ur1TfxMCxrW3qinkawk D1yuQKE4UxANjz8YEKzCwFgNhVCJgRX/4leY2oT3xjyH1+YwoR7oDCb142nnAnYI7uYh iLBWTItyrqPffdr2YZ8CaxXO+GoIwDpVZy11oGcXjLcz5DKA1eLIIUgdWVx2ElTIGeuG xo+XO4C3PT8ZjH/Y+IECSuI7rBJy7iN/60ZgiqdnaRH67iu3ub2qwNJFHNFaOdzgGmOX gIurlhBP+cNhxJ6BaDhn2PMiFGOz/Q49GUP41XHBxVbPpXiVq6UAgQ73xyKJfTK7t7h7 //ig== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=G3NzsnB0gULSW6HOxyfy2ydES0c8DGedMUMOvz6tr6M=; fh=qFSYpS4hyIpsP5zKaGspzB3hfV2QkdLk+qL/ducIXtE=; b=uugO7qzE+EJXM1gkgxMt8c1vLyc8o9tR3SZHiZeiLCEzZVJWs+xzv2gcGG6z3hh+h8 XFiMPNOskbqqhYuRVq98wrKLgiqQ40x8JSp6dSOHF8c5sY2Vikrj9s28223fd3WZMHfJ VkfGF0nb1lPiLtEnZMKi6UDUp1HtYh1uu+zuj6o2V2+Hw6XpYva1f9aAJfqM4G0ldI16 wq1cqWTMUhrHOX/zFGATDI5CSiAYr13L0+qad42cRTxkMoGJMYe4d53i6AoFUiTW1UJO 5q7x9Q5XMsAoL1nzN1T+plEK7AgBimkp/YmJxthb9oQk71Yv1g/Uh3ma1aJ1WH7KFMDy 8cWw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Cp7ycruY; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-161966-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161966-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id dt10-20020a170906b78a00b00a5538817584si13507968ejb.19.2024.04.29.02.12.56 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Apr 2024 02:12:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-161966-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Cp7ycruY; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-161966-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161966-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id C63AA1F22093 for ; Mon, 29 Apr 2024 09:12:55 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B274222301; Mon, 29 Apr 2024 09:12:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Cp7ycruY" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0A5E17BCB; Mon, 29 Apr 2024 09:12:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714381966; cv=none; b=Rd873q7qpj8g8jqmf82Cum2CQohOugnrzryK0Sv6jp1qj1d2T04F8tj6cYKAoQfA8qhw7wo2v3PfKij/15crYI16plLaa91OKD3+G18JpfEBfYR0maZaDDouLmu4LDDn+yiCqXgv8spBUvqSDu3H9zp7xaDu4PtfGpCApxN9IiM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714381966; c=relaxed/simple; bh=rsGj0P8emcKb8hKu9ORzTWwrGWwZ6ES5DkeLfKwLzQY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=BNKnXpyoMwHy1rS1qjqF3qmogzr5umqgut3SJJNffuDOXZ9ffdszM7mEu065mvXpB8WCaVrds2JoR5Xn5/izgVvBgQ2I+FHRP+6vOR9Rl762MQh9lIp29Gks06OISaV5LK9TEBfSWRvTY9LS47u2qw2CTYGQl1ak9oSQRyqGLmA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Cp7ycruY; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id F1F57C113CD; Mon, 29 Apr 2024 09:12:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1714381966; bh=rsGj0P8emcKb8hKu9ORzTWwrGWwZ6ES5DkeLfKwLzQY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Cp7ycruYXV7fRH86kj7Y1D1E71XzyLYhW96Mt5mpFGp/wKKJfCpS4Iq7+wyWRvWwC +HEtCIvs8RAbkpwoZ5vM4kfmJPZbPhr8bSYqd0qO8An0qpq4LlxQtAFQ73FXnmZ+GA Bb+8VZs/7w25/+f7SRxvfzauo5+Ox1LAbeliZssb99mOdHZkjrodD2gkceLydgwnAM X7uKYEwOB7X0jVKIaS0IWaNpo6Vu0CMbjmxIwtxULZR6cmq63p/Yph00iAxC7RsUuA GbP/Pc9koYev24eUXsnzFxcHCMJL7SorHZl2/bE7/AJBRtyuFjHkuvbmfaC46PCpqC tFyxL9uRXRzug== Date: Mon, 29 Apr 2024 11:12:39 +0200 From: Christian Brauner To: Andy Lutomirski Cc: Stas Sergeev , Aleksa Sarai , "Serge E. Hallyn" , linux-kernel@vger.kernel.org, Stefan Metzmacher , Eric Biederman , Alexander Viro , Andy Lutomirski , Jan Kara , Jeff Layton , Chuck Lever , Alexander Aring , David Laight , linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, Paolo Bonzini , Christian =?utf-8?B?R8O2dHRzY2hl?= Subject: Re: [PATCH v5 0/3] implement OA2_CRED_INHERIT flag for openat2() Message-ID: <20240429-donnerstag-behilflich-a083311d8e00@brauner> References: <20240426133310.1159976-1-stsp2@yandex.ru> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Sun, Apr 28, 2024 at 09:41:20AM -0700, Andy Lutomirski wrote: > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev wrote: > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > It is needed to perform an open operation with the creds that were in > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > flag. This allows the process to pre-open some dirs and switch eUID > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > the possibility to open/create files within the pre-opened directory set. > > > > I’ve been contemplating this, and I want to propose a different solution. > > First, the problem Stas is solving is quite narrow and doesn’t > actually need kernel support: if I want to write a user program that > sandboxes itself, I have at least three solutions already. I can make > a userns and a mountns; I can use landlock; and I can have a separate > process that brokers filesystem access using SCM_RIGHTS. > > But what if I want to run a container, where the container can access > a specific host directory, and the contained application is not aware > of the exact technology being used? I recently started using > containers in anger in a production setting, and “anger” was > definitely the right word: binding part of a filesystem in is > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and move_mount(). I rewrote the bind-mount logic in systemd based on that and util-linux uses that as well now. https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > actually gave up on making one of my use cases work on a Fedora > system. > > Here’s what I wanted to do, logically, in production: pick a host > directory, pick a host *principal* (UID, GID, label, etc), and have > the *entire container* access the directory as that principal. This is > what happens automatically if I run the whole container as a userns > with only a single UID mapped, but I don’t really want to do that for > a whole variety and of reasons. You're describing idmapped mounts for the most part which are upstream and are used in exactly that way by a lot of userspace. > > So maybe reimagining Stas’ feature a bit can actually solve this > problem. Instead of a special dirfd, what if there was a special > subtree (in the sense of open_tree) that captures a set of creds and > does all opens inside the subtree using those creds? That would mean override creds in the VFS layer when accessing a specific subtree which is a terrible idea imho. Not just because it will quickly become a potential dos when you do that with a lot of subtrees it will also have complex interactions with overlayfs. > > This isn’t a fully formed proposal, but I *think* it should be > generally fairly safe for even an unprivileged user to clone a subtree > with a specific flag set to do this. Maybe a capability would be > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > this to a daemon if a privilege is needed, and getting the API right > might be a bit tricky. > > Then two different things could be done: > > 1. The subtree could be used unmounted or via /proc magic links. This > would be for programs that are aware of this interface. > > 2. The subtree could be mounted, and accessed through the mount would > use the captured creds. > > (Hmm. What would a new open_tree() pointing at this special subtree do?) > > > With all this done, if userspace wired it up, a container user could > do something like: > > —bind-capture-creds source=dest > > And the contained program would access source *as the user who started > the container*, and this would just work without relabeling or > fiddling with owner uids or gids or ACLs, and it would continue to > work even if the container has multiple dynamically allocated subuids > mapped (e.g. one for “root” and one for the actual application). > > Bonus points for the ability to revoke the creds in an already opened > subtree. Or even for the creds to automatically revoke themselves when > the opener exits (or maybe when a specific cred-pinning fd goes away). > > (This should work for single files as well as for directories.) > > New LSM hooks or extensions of existing hooks might be needed to make > LSMs comfortable with this. > > What do you all think? I think the problem you're describing is already mostly solved.