Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp3253102ybz; Mon, 27 Apr 2020 12:43:49 -0700 (PDT) X-Google-Smtp-Source: APiQypKs31i/7xBUtUJKlYK9/dkjqLd36gsqVkpCA/9ssKcTxbNoLzoKlRZNhZk1QqZ9MBUQqYet X-Received: by 2002:a17:906:b2c2:: with SMTP id cf2mr20396700ejb.262.1588016629653; Mon, 27 Apr 2020 12:43:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588016629; cv=none; d=google.com; s=arc-20160816; b=e/UbQfbKGHvaDjq0o8/0puWSeOCp1VN9SjP8jAyxDMAqlQlJcCtnXnef3Nf3rFOHhC mgkb2guf0KyPw66kcirFmfmBMUxUJWQR743xzuhB1TiX5PCOtOzqqyz4aTtXE5hMgbeQ kbxKTXkXlL0B+ODaaJJC1mTgeVV13qZ2Xi1UqNEPE4E0sSeodoVExyU040sEnDklJOv2 dmCoOI+QydiJZdKxMSuIVS4m9NsT2Kone9wXTPVFsmZ1tX1f0k+o4hMscbO5fkktfjeg 86iYxU76BiVQBEJHhy/V+mfZWlO2JOCfmTXYOiLOMXhUUQ9zi2ZYzSRqwKaiaZC5RW8i udxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Xl34K+WyxLr+OBvYQN0aKmiQr358TgE2J+a3gDDkuqM=; b=B3C2SvRqhr5kd2Nu0tVk4Iyz9wAPaRyiT3Z0aPMo3bmFEzAb4RA8g8DkmHQ+FkAtqm A5FmMw0tvocVGgxtveEt7dQGZQUCu0Pcy83bgQ17SmI1vX+pyWcFkfV72uhCfsA4KPwL rmnO+zXxhlZ8qvtXizyCP8/e8kJlsHPE0hh+oBuwbbJaQMxZh0Y302HXEUmRZcZ5QFzA vC/A0Lf7KlOZNwiJNBWWqQP+H8OY0wQq/VqmBPNNT1BfLbNSBMCQJsGfY3YfvMd98O5G CfnX85d78MwPZVEoZRe6j/+r4WBCISek3zpH0mM+eQcdb+hHS2EEJ0FY8xom1qQAviZN DTDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=MF42SY0y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f6si296126edw.325.2020.04.27.12.43.26; Mon, 27 Apr 2020 12:43:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=MF42SY0y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726690AbgD0Tlt (ORCPT + 99 others); Mon, 27 Apr 2020 15:41:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55298 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1726384AbgD0Tls (ORCPT ); Mon, 27 Apr 2020 15:41:48 -0400 Received: from mail-lj1-x243.google.com (mail-lj1-x243.google.com [IPv6:2a00:1450:4864:20::243]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9282DC03C1A8 for ; Mon, 27 Apr 2020 12:41:48 -0700 (PDT) Received: by mail-lj1-x243.google.com with SMTP id b2so18882661ljp.4 for ; Mon, 27 Apr 2020 12:41:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Xl34K+WyxLr+OBvYQN0aKmiQr358TgE2J+a3gDDkuqM=; b=MF42SY0yJDHJdeoyY2Bjwc+L+5Rr7RvBuYk1vaEDqwpzTy0Ar8l+AbQfdBq1oayuyv SJXbUSShjHhaEBMZLwdk+OxRBeDjhp1VRKMog083j5glNgfGCc+7I5FgZ6vPiIIHhGWh sF7V0Ruklt+nNaNQX+JILLGfdp6vZWflEQ6iV76Id0K1rDOJArWBcMfpAUBaBLoVsXVb ypw+plAYtlWVCBp2m7iOCt6tepzBEZbd2OKkM4g8zh2qfOKC//uFor3uGBjViUMwr4CN VzVmewGExwi+4HTaPdyE/SZ/V0ozK5wjvdksx5p7yrkQWK1+DILp1h1GhuPi1M7VJjv0 YM8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Xl34K+WyxLr+OBvYQN0aKmiQr358TgE2J+a3gDDkuqM=; b=YWo1s8DxRTUeXZeb+XYTKCBXxzSbqJftQo7wu8Kh6d0QVtAWNzqU50z/gymI8wOCNI 7LiSEp9Ca1SGtnQVaLccI8slVtQlu7CZu8OrLL6Xh4VvpTcUHJ6/aTZZBF+aVaBP8qVP fQ59s9EbvOcaZAfRg5YpxtukiAw9TiQX1hxiMz7R1gbBw5jr5PJZeCd/5jKqllUJuhXK sM+XMESu/RcwSW6O6UkbLyHDGutSEIJWSeHbfyxdz6B36ivrFqjOS8qvmBWemryoiHky Zv/kdKESaT3U0jIQDpHXP+08/mdV8bnRKdKj8sZWq/HoLw1PKfIZcVIAOsr662TyV/xX bGLw== X-Gm-Message-State: AGi0PuYg1G9kGVR0rF68yrZTXktN+Dlt+KcXKFb0mqZRjC19l97+iF5n 0MYzy1RGoNghdq2dme9taxdfEpLirEx2Ci7lNs9vqw== X-Received: by 2002:a2e:760c:: with SMTP id r12mr14755084ljc.139.1588016506746; Mon, 27 Apr 2020 12:41:46 -0700 (PDT) MIME-Version: 1.0 References: <20200427143646.619227-1-christian.brauner@ubuntu.com> <20200427181507.ry3hw7ufiifwhi5k@wittgenstein> In-Reply-To: <20200427181507.ry3hw7ufiifwhi5k@wittgenstein> From: Jann Horn Date: Mon, 27 Apr 2020 21:41:20 +0200 Message-ID: Subject: Re: [PATCH] nsproxy: attach to namespaces via pidfds To: Christian Brauner Cc: kernel list , Alexander Viro , =?UTF-8?Q?St=C3=A9phane_Graber?= , Linux Containers , "Eric W . Biederman" , Serge Hallyn , Aleksa Sarai , linux-security-module , Kernel Hardening , Linux API Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 27, 2020 at 8:15 PM Christian Brauner wrote: > On Mon, Apr 27, 2020 at 07:28:56PM +0200, Jann Horn wrote: > > On Mon, Apr 27, 2020 at 4:47 PM Christian Brauner > > wrote: [...] > > > That means > > > setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However, > > > when a pidfd is passed, multiple namespace flags can be specified in the > > > second setns() argument and setns() will attach the caller to all the > > > specified namespaces all at once or to none of them. If 0 is specified > > > together with a pidfd then setns() will interpret it the same way 0 is > > > interpreted together with a nsfd argument, i.e. attach to any/all > > > namespaces. > > [...] > > > Apart from significiantly reducing the number of syscalls from double > > > digit to single digit which is a decent reason post-spectre/meltdown > > > this also allows to switch to a set of namespaces atomically, i.e. > > > either attaching to all the specified namespaces succeeds or we fail. > > > > Apart from the issues I've pointed out below, I think it's worth > > calling out explicitly that with the current design, the switch will > > not, in fact, be fully atomic - the process will temporarily be in > > intermediate stages where the switches to some namespaces have > > completed while the switches to other namespaces are still pending; > > and while there will be less of these intermediate stages than before, > > it also means that they will be less explicit to userspace. > > Right, that can be fixed by switching to the unshare model of getting a > new set of credentials and committing it after the nsproxy has been > installed? Then there shouldn't be an intermediate state anymore or > rather an intermediate stage where we can still fail somehow. It still wouldn't be atomic (in the sense of parallelism, not in the sense of intermediate error handling) though; for example, if task B does setns(, 0) and task C concurrently does setns(, 0), then task C may end up with the new mount namespace of task B but the old user namespace, or something like that. If C is more privileged than B, that may cause C to have more privileges through its configuration of namespaces than B does (e.g. by running in the &init_user_ns but with a mount namespace owned by an unprivileged user), which C may not expect. Same thing for racing between unshare() and setns(). [...] > > > + put_user_ns(user_ns); > > > + } > > > +#else > > > + if (flags & CLONE_NEWUSER) > > > + ret = -EINVAL; > > > +#endif > > > + > > > + if (!ret && wants_ns(flags, CLONE_NEWNS)) > > > + ret = __ns_install(nsproxy, mnt_ns_to_common(nsp->mnt_ns)); > > > > And this one might be even worse, because the mount namespace change > > itself is only stored in the nsproxy at this point, but the cwd and > > root paths have already been overwritten on the task's fs_struct. > > > > To actually make sys_set_ns() atomic, I think you'd need some > > moderately complicated prep work, splitting the ->install handlers up > > into prep work and a commit phase that can't fail. > > Wouldn't it be sufficient to move to an unshare like model, i.e. > creating a new set of creds, and passing the new user_ns to > create_new_namespaces() as well as having a temporary new_fs struct? > That should get rid of all intermediate stages. Ah, good point, I didn't realize that that already exists for unshare().