Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp546472yba; Sat, 20 Apr 2019 08:08:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqycvrV8UWu5N2BVo2RCndulImNItmwu79tR/4j36ncdXOnkt46ySBX87+tSJ4EbIo2On70K X-Received: by 2002:a63:5621:: with SMTP id k33mr9496745pgb.437.1555772920467; Sat, 20 Apr 2019 08:08:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555772920; cv=none; d=google.com; s=arc-20160816; b=GQKhj43z68311ykqBBZxLfS9TfRMPrPaqAOkLHtsKqo1qVGzR5GOM2IfNnLnWgXVIp V59fx6s1XGeLP+2218Y/5Mq0BUWPpFfXOehRKcYsEf7i7xLi13lsaZYnd2nR9O4hOc3z WF8wIPpY/PqBwxsYSxWz7cQW50yXhoRqgcQloHLb5JpFWhsuzpiGHLAoJrqzNaNNpfmy WJxBxgYP6dQwsvqpxTpXL/VEjrV05Qp4loXnVUiljMMgMeqRxEfX5u2jcjEgaxTrzU3I npkkNkNiCKUDmIUXuM6Ydm0x4HkdE80mTabdUnnn+PNgSx0hQhFN0EvmJ8hdUnkJN6BZ jH1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=GFnRGAsoLvk8Uoq9FyDh/WzWy/bRQWDRkdgej3+Z1Nk=; b=I+2auZScR97MPnV/ShNcRzo0SUkzBIXeSbtaTIM6xDwQBSllA6HzK2wQyIDUY2yLay xCh3WuZ1YxvXeARRWwSvHdSxrMTDtWiHW8NkHGYJzHXRc+AHZLNFmGBg0Ci38kW+92as Ftls1tkQbI1AvzbptWBWJkPXGNiB28xjxFa0UpxbW+l4LazzatUP1yQBbdQDmoUqHHOA doLzsfn+HNaq0jm9AnI2CVpBL0kwl0TTIqkIYiRFfeFZsBnbVoMMLiKC+fNV5ve3/N/e u3FcGPNjVV++C8y+OxEfQQRY0mIWec+GQv1B66KwyVqPLVX8cWY4jNRiFGAFt9vZirKD ptzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=kOHAV4Hx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d2si8375145pld.78.2019.04.20.08.08.13; Sat, 20 Apr 2019 08:08:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=kOHAV4Hx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726887AbfDTPGd (ORCPT + 99 others); Sat, 20 Apr 2019 11:06:33 -0400 Received: from mail-ua1-f68.google.com ([209.85.222.68]:41773 "EHLO mail-ua1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726334AbfDTPGc (ORCPT ); Sat, 20 Apr 2019 11:06:32 -0400 Received: by mail-ua1-f68.google.com with SMTP id l22so2471441uao.8 for ; Sat, 20 Apr 2019 08:06:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=GFnRGAsoLvk8Uoq9FyDh/WzWy/bRQWDRkdgej3+Z1Nk=; b=kOHAV4HxLuAVkp4grvw/Go3PZ3q9mbg95MRViMH3J4FLgJ+ocEboBCnZFJNkBs0QPZ Uo63adCiSIulipqVbY8xNg5Gni54BCBJMSzAZtVPXZbcOlV59bWmky/wQSpHOFSe2tlI J3pPJ2oAe0HMRANjroNQ3+chUP+t+Y4khSlqHRjeNWJHpL4bAzvToXn6HTWeWDDYR7zT AmvvwpXww2S+iX+gbyygkRla8jNGerYPpOoHHy1mLSyF269e+dbpf4co0owkQy6Fvnzn Lruf3SzvsD93dxyP0pnKbCzboKTFB1PYiKy/GkY7tAbC+d22IVljU5q5loXjkkOAe2LS p8Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GFnRGAsoLvk8Uoq9FyDh/WzWy/bRQWDRkdgej3+Z1Nk=; b=aldg1KJxMZXSXjzTxvyYbRM4CKHBg+4ldapDhkhGFxkDbjVQGIvos7KkqYfYSVwRko 13A8y2S+mhLuG4PE6dKxG1Gn23tkVCW7Mc1sDfINkLFprKvR8hj2Qoj/aBZ8KMuX5fTm hSO1dKvNZLtluV/Dxqqxv/U2KZhh088rhjJQ58uyI/pLlTYBuQYbDjJzPiodUtBxzc0g xiI8zJlWIhxliG8jMQtVbU64muUT2s4/hqbSFV9IbsFw2qpRExY9c6N52FbcYdSfzHaD PSRQgR5MvMtzu07uXq0XJjj156BSHW+KNpjoF2p5T4fuVicH4o1xOpVGbd4BFiFdmpL/ S/7g== X-Gm-Message-State: APjAAAUnfccKU4d4bu92P2zMF56riwpoDWu0wni76TC3j+aypqUfTc3w +/QtaOiiWKPzTD0zzE7G0rvLwvFpfDyICIFnCpwvnQ== X-Received: by 2002:ab0:72c2:: with SMTP id g2mr4745378uap.112.1555772791172; Sat, 20 Apr 2019 08:06:31 -0700 (PDT) MIME-Version: 1.0 References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> <20190420071406.GA22257@ip-172-31-15-78> In-Reply-To: <20190420071406.GA22257@ip-172-31-15-78> From: Daniel Colascione Date: Sat, 20 Apr 2019 08:06:19 -0700 Message-ID: Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] To: Kevin Easton Cc: Andy Lutomirski , Aleksa Sarai , "Enrico Weigelt, metux IT consult" , Christian Brauner , Linus Torvalds , Al Viro , Jann Horn , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Apr 20, 2019 at 12:14 AM Kevin Easton wrote: > On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote: > > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai wrote: > > > > > > On 2019-04-15, Enrico Weigelt, metux IT consult wrote: > > > > > This patchset makes it possible to retrieve pid file descriptors at > > > > > process creation time by introducing the new flag CLONE_PIDFD to the > > > > > clone() system call as previously discussed. > > > > > > > > Sorry, for highjacking this thread, but I'm curious on what things to > > > > consider when introducing new CLONE_* flags. > > > > > > > > The reason I'm asking is: > > > > > > > > I'm working on implementing plan9-like fs namespaces, where unprivileged > > > > processes can change their own namespace at will. For that, certain > > > > traditional unix'ish things have to be disabled, most notably suid. > > > > As forbidding suid can be helpful in other scenarios, too, I thought > > > > about making this its own feature. Doing that switch on clone() seems > > > > a nice place for that, IMHO. > > > > > > Just spit-balling -- is no_new_privs not sufficient for this usecase? > > > Not granting privileges such as setuid during execve(2) is the main > > > point of that flag. > > > > > > > I would personally *love* it if distros started setting no_new_privs > > for basically all processes. And pidfd actually gets us part of the > > way toward a straightforward way to make sudo and su still work in a > > no_new_privs world: su could call into a daemon that would spawn the > > privileged task, and su would get a (read-only!) pidfd back and then > > wait for the fd and exit. I suppose that, done naively, this might > > cause some odd effects with respect to tty handling, but I bet it's > > solveable. I suppose it would be nifty if there were a way for a > > process, by mutual agreement, to reparent itself to an unrelated > > process. > > > > Anyway, clone(2) is an enormous mess. Surely the right solution here > > is to have a whole new process creation API that takes a big, > > extensible struct as an argument, and supports *at least* the full > > abilities of posix_spawn() and ideally covers all the use cases for > > fork() + do stuff + exec(). It would be nifty if this API also had a > > way to say "add no_new_privs and therefore enable extra functionality > > that doesn't work without no_new_privs". This functionality would > > include things like returning a future extra-privileged pidfd that > > gives ptrace-like access. > > > > As basic examples, the improved process creation API should take a > > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag > > from, fds to close (or, maybe even better, a list of fds to *not* > > close), a list of rlimit changes to make, a list of signal changes to > > make, the ability to set sid, pgrp, uid, gid (as in > > setresuid/setresgid), the ability to do capset() operations, etc. The > > posix_spawn() API, for all that it's rather complicated, covers a > > bunch of the basics pretty well. > > The idea of a system call that takes an infinitely-extendable laundry > list of operations to perform in kernel space seems quite inelegant, if > only for the error-reporting reason. > > Instead, I suggest that what you'd want is a way to create a new > embryonic process that has no address space and isn't yet schedulable. > You then just need other-process-directed variants of all the normal > setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode), > pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd) > etc. Providing process-directed versions of these functions would be useful for a variety of management tasks anyway, > Then when it's all set up you pr_execve() to kick it off. Yes. That's the right general approach.