Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp204376yba; Sat, 20 Apr 2019 00:23:08 -0700 (PDT) X-Google-Smtp-Source: APXvYqzW6ujHMqQyTOkbX3eVB+liJAO9O7R0iRjnZpo+ZuYevZ3w7hUwT0tcotj6SxoBrD80oneN X-Received: by 2002:a63:1654:: with SMTP id 20mr8181506pgw.166.1555744988712; Sat, 20 Apr 2019 00:23:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555744988; cv=none; d=google.com; s=arc-20160816; b=kRyUmNqyBC+E2jgSE6nVgAz7kbyC7t8W4Pj6+2fXtb1PrblFPLwc6yiu45BXTPN41T aZhXhsWHjl4GPnN2LJ4yRf4Crdnbx90N4iIAPV5y7YwfN0YLvGAiLUa13IHQK2Z81V6V aR3vAwu0ONPlmOtnXOjkszb8v4aDAlB5RMqfWwgFZfwXjDLMa/gOKy/MqwfQ7QOHJtV7 YRXO2kYDXSolR6eP10SlCKclb794lFy+VLEGt6pGN5O1RHU2AphbUxWXvaOWujRa2iBL qirXYpHW/VGFNL8eEnO0y008is1GPUVlkyrtnoehgyq8gM6dl+zY5xDp/RP5GOl+LcoE dHLg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=9gWUjFMTMY208eKp0w2jNSwXSu3YXYxYzbfZbEZtH/g=; b=fLRhN0S9iEmyWclDcUFFWQyhKz4Mvrr11A4hHfY5T2qPSrz4fTaEa6vEGsP4kd6wU4 9UHkTndy2yZ75j+IMwg/nymv2Wy3GnZ5aST4fsMDcrXiGpOl7hPhXACDDMQqKPbH6sJx EHxQqXjgCClB6tY9JWQ8arKSRaq+BxbWW4u25OGdITDU9SrlIUdZpKT9udtqFqdm9Cu7 NlvHAtHQ39Sa0kKCkdchy3A1Dzd9CPUppwaXj1wwoJ8C34jNXYRZoMtebGwZ2h8mslNi x38qgNuP3RPhlUY6bLlHmxoVzsvqGBQyqXh5uXcow3ROarq0k6OcRTkLhZU2L/qxbA8b LzKQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w12si6457200pgr.412.2019.04.20.00.22.54; Sat, 20 Apr 2019 00:23:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727922AbfDTHVo (ORCPT + 99 others); Sat, 20 Apr 2019 03:21:44 -0400 Received: from aws.guarana.org ([13.237.110.252]:50266 "EHLO aws.guarana.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725991AbfDTHVo (ORCPT ); Sat, 20 Apr 2019 03:21:44 -0400 X-Greylist: delayed 452 seconds by postgrey-1.27 at vger.kernel.org; Sat, 20 Apr 2019 03:21:42 EDT Received: by aws.guarana.org (Postfix, from userid 1006) id 69058A17A2; Sat, 20 Apr 2019 07:14:06 +0000 (UTC) Date: Sat, 20 Apr 2019 07:14:06 +0000 From: Kevin Easton To: Andy Lutomirski Cc: Aleksa Sarai , "Enrico Weigelt, metux IT consult" , Christian Brauner , Linus Torvalds , Al Viro , Jann Horn , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes , Daniel Colascione Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] Message-ID: <20190420071406.GA22257@ip-172-31-15-78> References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote: > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai wrote: > > > > On 2019-04-15, Enrico Weigelt, metux IT consult wrote: > > > > This patchset makes it possible to retrieve pid file descriptors at > > > > process creation time by introducing the new flag CLONE_PIDFD to the > > > > clone() system call as previously discussed. > > > > > > Sorry, for highjacking this thread, but I'm curious on what things to > > > consider when introducing new CLONE_* flags. > > > > > > The reason I'm asking is: > > > > > > I'm working on implementing plan9-like fs namespaces, where unprivileged > > > processes can change their own namespace at will. For that, certain > > > traditional unix'ish things have to be disabled, most notably suid. > > > As forbidding suid can be helpful in other scenarios, too, I thought > > > about making this its own feature. Doing that switch on clone() seems > > > a nice place for that, IMHO. > > > > Just spit-balling -- is no_new_privs not sufficient for this usecase? > > Not granting privileges such as setuid during execve(2) is the main > > point of that flag. > > > > I would personally *love* it if distros started setting no_new_privs > for basically all processes. And pidfd actually gets us part of the > way toward a straightforward way to make sudo and su still work in a > no_new_privs world: su could call into a daemon that would spawn the > privileged task, and su would get a (read-only!) pidfd back and then > wait for the fd and exit. I suppose that, done naively, this might > cause some odd effects with respect to tty handling, but I bet it's > solveable. I suppose it would be nifty if there were a way for a > process, by mutual agreement, to reparent itself to an unrelated > process. > > Anyway, clone(2) is an enormous mess. Surely the right solution here > is to have a whole new process creation API that takes a big, > extensible struct as an argument, and supports *at least* the full > abilities of posix_spawn() and ideally covers all the use cases for > fork() + do stuff + exec(). It would be nifty if this API also had a > way to say "add no_new_privs and therefore enable extra functionality > that doesn't work without no_new_privs". This functionality would > include things like returning a future extra-privileged pidfd that > gives ptrace-like access. > > As basic examples, the improved process creation API should take a > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag > from, fds to close (or, maybe even better, a list of fds to *not* > close), a list of rlimit changes to make, a list of signal changes to > make, the ability to set sid, pgrp, uid, gid (as in > setresuid/setresgid), the ability to do capset() operations, etc. The > posix_spawn() API, for all that it's rather complicated, covers a > bunch of the basics pretty well. The idea of a system call that takes an infinitely-extendable laundry list of operations to perform in kernel space seems quite inelegant, if only for the error-reporting reason. Instead, I suggest that what you'd want is a way to create a new embryonic process that has no address space and isn't yet schedulable. You then just need other-process-directed variants of all the normal setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode), pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd) etc. Then when it's all set up you pr_execve() to kick it off. - Kevin