Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp4059804yba; Mon, 29 Apr 2019 13:03:51 -0700 (PDT) X-Google-Smtp-Source: APXvYqxlhaUjDgpRniAVqlUtRG/Td7XiqfGGJp9OqLMnK/+UQm3iHSRLXkUACQqie4zmyVMJgsdz X-Received: by 2002:a65:6688:: with SMTP id b8mr38984053pgw.81.1556568231411; Mon, 29 Apr 2019 13:03:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556568231; cv=none; d=google.com; s=arc-20160816; b=P7cyiSmB/PgZtjNXeARVvVemNFMMfoD0ZvPoT0LduZPelbz6t5rl+xqCLIi6zgfNOs nHJG+/tPYrgZYVNcS4+y/9fYkxIvYvfLvi6bKgAp2dFvvAx1B93JT7GWzqfQnleDCc4Q KxbWI7q74GOP6xpDFweX4dmWvzMcu9hXvjXd62hQhMcpgdCWIZvoiVRTNvCAdtHTroIQ wUTus2HjqEdHDix4k/n2EhE+UmFKk3BowOTB/aYI+PKpOXXPGOBdSU0YUz3uDy53/bRu MB9qHr5CKtL2P6eHH0tKATrGjxPFR0ZwivV6uMG0jM02z64fVXmTrvfspaxgxkNtX1ig acTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=HMgcRk8g93CjrXbXJuvBDYtODDacIPx5LhgZDNjChP0=; b=V2YTKw1d+UzcEyEENvBUMMC/bV/LUqcdNzEBKdxH34G7drBPxgCAswgEYSX3W16Ljb EYQ430EWdqR1SdqA3aFsCoBLcpqVf/H55oWKiYTklQ2A8VbdYZPxga6VU62ZJ5+UmcvF jyC/8096T3krxgdileSxNOLoHkyBFXfHbOrBmCxZQZv/gALR725CJlILqvteSnBSOike KoRtMPliKiBwv9tRuzKGCD/tMBUe6BEL5t/WpniqoTOay0Kg3YOQJB+CktT0uFK7ImnN BpoZDUtRHYuFgeBjhk1m1Uzeg42YXFQnxeriPpnFAJt8WKGcgKs10+LYc9n6yfQ/xBqK Yldg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="m/c9HKKl"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bb8si18226642plb.388.2019.04.29.13.03.34; Mon, 29 Apr 2019 13:03:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="m/c9HKKl"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729316AbfD2UBL (ORCPT + 99 others); Mon, 29 Apr 2019 16:01:11 -0400 Received: from mail-ot1-f68.google.com ([209.85.210.68]:33624 "EHLO mail-ot1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728928AbfD2UBK (ORCPT ); Mon, 29 Apr 2019 16:01:10 -0400 Received: by mail-ot1-f68.google.com with SMTP id s11so5499077otp.0 for ; Mon, 29 Apr 2019 13:01:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=HMgcRk8g93CjrXbXJuvBDYtODDacIPx5LhgZDNjChP0=; b=m/c9HKKlv2x9utzy9xHZDzLz/Bt2j/r6MVfcvuF490BAV4J7oX5QWdgghDE182jbSf MdxK3IqRbNVu0vagDg5qIKniVsNq6r/ir5Aem+nKCMydbC2gx8mHJ/yaTzDLXIi6+q4C Aek2HZI7aEDzhKVT2Iiz2s0qEO2yI9NiE6QMajFlWlIMxe2MMuYla+qmMipOhqMAAeJ7 UFSPVjQfCRIoGTr7pTh2pNZ26nCrXwgyz6Krbd4HJG/n89z+SdZXljIfRzU+SWgWsd6G CnWHtDcRdMwxNWbjQfqM8uIf3FCcIq9GlY/cLFNHrL3f55oAqsq7ddWFrgXt9xhod0Im NGSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=HMgcRk8g93CjrXbXJuvBDYtODDacIPx5LhgZDNjChP0=; b=dcT8Z/d63keelaSU6gB9g9Yw67SFPkntAS5q4yEihRVqqgCcwfQ3EcpBglYg/BGLs/ 0RftWvhKtJdqDcYpXwCvmFnCj88d1qnySgD6/uHuIrJfELOEiLK8px3UXSmPQp1vw0dm QJwJ3alHgjf04j6nuOmwwW8QGyEBIrMyDtirbI4MsTe0ALxCSF8smci9dGuoWdR3N51F D1xjs2/7Q2FxyAnD3s0zTqhF9nepbbr/yDRUaBkYjCsO1F8uwyCEEsfZu1tqaThTW0i5 IFRS7D9x6hqCbH+Pbmndn1jKQhtX6el6NB9NEzxPjtlbJy1cTGQ0Y2VxTrhNf5QKAEMz mY/Q== X-Gm-Message-State: APjAAAVS1NgBVkokbF5vFQzr2tOk+eAc77uIqu1v8H4B9Vp6wiSt11+G 6BoUFP1ZI8Ods3dnHMNHQKE6Qs4BQS7B2oFCr3pifw== X-Received: by 2002:a9d:6152:: with SMTP id c18mr2261461otk.230.1556567738730; Mon, 29 Apr 2019 12:55:38 -0700 (PDT) MIME-Version: 1.0 References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> <20190420071406.GA22257@ip-172-31-15-78> In-Reply-To: From: Jann Horn Date: Mon, 29 Apr 2019 15:55:11 -0400 Message-ID: Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] To: Kevin Easton , Andy Lutomirski , Christian Brauner Cc: Aleksa Sarai , "Enrico Weigelt, metux IT consult" , Linus Torvalds , Al Viro , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes , Daniel Colascione Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 29, 2019 at 3:30 PM Jann Horn wrote: > On Sat, Apr 20, 2019 at 3:14 AM Kevin Easton wrote: > > On Mon, Apr 15, 2019 at 01:29:23PM -0700, Andy Lutomirski wrote: > > > On Mon, Apr 15, 2019 at 12:59 PM Aleksa Sarai wrote: > > > > > > > > On 2019-04-15, Enrico Weigelt, metux IT consult wrote: > > > > > > This patchset makes it possible to retrieve pid file descriptors at > > > > > > process creation time by introducing the new flag CLONE_PIDFD to the > > > > > > clone() system call as previously discussed. > > > > > > > > > > Sorry, for highjacking this thread, but I'm curious on what things to > > > > > consider when introducing new CLONE_* flags. > > > > > > > > > > The reason I'm asking is: > > > > > > > > > > I'm working on implementing plan9-like fs namespaces, where unprivileged > > > > > processes can change their own namespace at will. For that, certain > > > > > traditional unix'ish things have to be disabled, most notably suid. > > > > > As forbidding suid can be helpful in other scenarios, too, I thought > > > > > about making this its own feature. Doing that switch on clone() seems > > > > > a nice place for that, IMHO. > > > > > > > > Just spit-balling -- is no_new_privs not sufficient for this usecase? > > > > Not granting privileges such as setuid during execve(2) is the main > > > > point of that flag. > > > > > > > > > > I would personally *love* it if distros started setting no_new_privs > > > for basically all processes. And pidfd actually gets us part of the > > > way toward a straightforward way to make sudo and su still work in a > > > no_new_privs world: su could call into a daemon that would spawn the > > > privileged task, and su would get a (read-only!) pidfd back and then > > > wait for the fd and exit. I suppose that, done naively, this might > > > cause some odd effects with respect to tty handling, but I bet it's > > > solveable. I suppose it would be nifty if there were a way for a > > > process, by mutual agreement, to reparent itself to an unrelated > > > process. > > > > > > Anyway, clone(2) is an enormous mess. Surely the right solution here > > > is to have a whole new process creation API that takes a big, > > > extensible struct as an argument, and supports *at least* the full > > > abilities of posix_spawn() and ideally covers all the use cases for > > > fork() + do stuff + exec(). It would be nifty if this API also had a > > > way to say "add no_new_privs and therefore enable extra functionality > > > that doesn't work without no_new_privs". This functionality would > > > include things like returning a future extra-privileged pidfd that > > > gives ptrace-like access. > > > > > > As basic examples, the improved process creation API should take a > > > list of dup2() operations to perform, fds to remove the O_CLOEXEC flag > > > from, fds to close (or, maybe even better, a list of fds to *not* > > > close), a list of rlimit changes to make, a list of signal changes to > > > make, the ability to set sid, pgrp, uid, gid (as in > > > setresuid/setresgid), the ability to do capset() operations, etc. The > > > posix_spawn() API, for all that it's rather complicated, covers a > > > bunch of the basics pretty well. > > > > The idea of a system call that takes an infinitely-extendable laundry > > list of operations to perform in kernel space seems quite inelegant, if > > only for the error-reporting reason. > > > > Instead, I suggest that what you'd want is a way to create a new > > embryonic process that has no address space and isn't yet schedulable. > > You then just need other-process-directed variants of all the normal > > setup functions - so pr_openat(pidfd, dirfd, pathname, flags, mode), > > pr_sigaction(pidfd, signum, act, oldact), pr_dup2(pidfd, oldfd, newfd) > > etc. > > > > Then when it's all set up you pr_execve() to kick it off. > > Is this really necessary? I agree that fork()+exec() is suboptimal, > but if you just want to avoid the cost of duplicating the address > space, you can AFAICS already do that in userspace with > clone(CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD). Then > the parent can block on a futex until the child leaves the mm_struct > through execve() (or by exiting, in the case of an error), and the > child can temporarily have its stack at the bottom of the caller's > stack. You could build an API like this around it in userspace: > > int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid, > ) > > and then you'd use it like this to fork off a child process: > > int spawn_shell_subprocess_(void *arg) { > char *cmdline = arg; > execl("/bin/sh", "sh", "-c", cmdline); > return -1; > } > pid_t spawn_shell_subprocess(char *cmdline) { > pid_t child_pid; > int res = clone_temporary(spawn_shell_subprocess_, cmdline, > &child_pid, [...]); > if (res == 0) return child_pid; > return res; > } > > clone_temporary() could be implemented roughly as follows by the libc > (or other userspace code): > > sigset_t sigset, sigset_old; > sigfillset(&sigset); > sigprocmask(SIG_SETMASK, &sigset, &sigset_old); > int child_pid; > int result = 0; > /* starting here, use inline assembly to ensure that no stack > allocations occur */ > long child = syscall(__NR_clone, > CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP - > ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0); > if (child == -1) { result = -1; goto reset_sigmask; } > if (child == 0) { > result = fn(arg); > syscall(__NR_exit, 0); > } > futex(&child_pid, FUTEX_WAIT, child, NULL); > /* end of no-stack-allocations zone */ > reset_sigmask: > sigprocmask(SIG_SETMASK, &sigset_old, NULL); > return result; ... I guess that already has a name, and it's called vfork(). (Well, except that the Linux vfork() isn't a real vfork().) So I guess my question is: Why not vfork()? And if vfork() alone isn't flexible enough, alternatively: How about an API that forks a new child in the same address space, and then allows the parent to invoke arbitrary syscalls in the context of the child? You could also build that in userspace if you wanted, I think - just let the child run an assembly loop that reads registers from a unix seqpacket socket, invokes the syscall instruction, and writes the value of the result register back into the seqpacket socket. As long as you use CLONE_VM, you don't have to worry about moving the pointer targets of syscalls. The user-visible API could look like this: // flags added by the implementation: CLONE_VM|CLONE_CHILD_SETTID puppet_handle = fork_puppet(CLONE_NEWUSER|SIGCHLD); int uid_map_fd = puppet_syscall(SYS_open, "/proc/self/uid_map"); char uid_map_buf[1000]; puppet_syscall(SYS_write, uid_map_fd, uid_map_buf, strlen(uid_map_buf)); puppet_syscall(SYS_close, uid_map_fd); // waits for the child to either exit or switch to new mm via CLONE_CHILD_CLEARTID puppet_finish_execve(path, argv, envv);