Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp4105373yba; Mon, 29 Apr 2019 13:51:17 -0700 (PDT) X-Google-Smtp-Source: APXvYqyrn7jq/9+JbK4OfKIpHeQtlR6QTH4Mn79GhulzIwsrjlJRMM11/oDPe34zdB5NPa0ZkISf X-Received: by 2002:a62:1d0d:: with SMTP id d13mr65723377pfd.96.1556571076935; Mon, 29 Apr 2019 13:51:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556571076; cv=none; d=google.com; s=arc-20160816; b=plWhzsibn8t4wZEq+M2Vfr3F3NmWsWnVuErkJ9RmQwv3JLKFAA2lCRgj99FJGiOQYy 92jUP8ErD1ylVxQNgkruPgLuFwYKoFOhf6eQc/YvdSv0l/rolERcfKXhytXq8UI6QdQj AaHg+JOmzRbtsEstQFOBSd7nW61VVsN1PNDK++nw9EBPubKVKkMOF02Xb/3RGb9L5PXo yvF40Vokx7yh8Peg4aCJYX/Zam+bMcdeuuHf+FW0WO8TK7tef23LIYAbCO1ufeDQW79I xyudFGphTLrmM6dWt6CrAqgozH5C4V0sGcNAPTieUR3x0EsAZC3ugir5xhLkFAnpTQyc EToA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:in-reply-to:date:references:subject:cc:to :from; bh=caaAaV0jy+kCNI5bF7wc+qo99Gmyq5xyFRjGvAMEeJI=; b=i7Px496i8MLNMmkjzLIubICtoX1EdvcjE/9Pv+qKDgNsPuMW1lR59w01GM3uY+e6dy RmIOiSMypnkyJrgln/hV0eMxozmFYU6d+pI682zKUOmtKhJU9zrgDKetN/I1R1ulMXzg FYlqZ6BSdcp8AaGJOldpTZ4rO96fGpDDI8X1v8TthQkxRkNprFMrvActY8Q74M0QpUNM qbqODc9ew2LZKoZaAvBzE6NQpfnRPCw3U0iyf8TpLZMzUuAyVTBLW1Zbodmk11R7efIa kTUjPTz6W9AFmNlUIHGIiQYvvTIaEAvxIknbDvSCN9wmbwkHHAcx/cMzHbN9dSOsLUNG fwJw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r5si33652960pls.46.2019.04.29.13.51.01; Mon, 29 Apr 2019 13:51:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728928AbfD2UuH convert rfc822-to-8bit (ORCPT + 99 others); Mon, 29 Apr 2019 16:50:07 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54664 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728071AbfD2UuG (ORCPT ); Mon, 29 Apr 2019 16:50:06 -0400 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 24B673092661; Mon, 29 Apr 2019 20:50:06 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-116-123.ams2.redhat.com [10.36.116.123]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AA2D719089; Mon, 29 Apr 2019 20:49:57 +0000 (UTC) From: Florian Weimer To: Jann Horn Cc: Kevin Easton , Andy Lutomirski , Christian Brauner , Aleksa Sarai , "Enrico Weigelt\, metux IT consult" , Linus Torvalds , Al Viro , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes , Daniel Colascione Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> <20190420071406.GA22257@ip-172-31-15-78> Date: Mon, 29 Apr 2019 22:49:55 +0200 In-Reply-To: (Jann Horn's message of "Mon, 29 Apr 2019 15:55:11 -0400") Message-ID: <87v9ywbkp8.fsf@oldenburg2.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.43]); Mon, 29 Apr 2019 20:50:06 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Jann Horn: >> int clone_temporary(int (*fn)(void *arg), void *arg, pid_t *child_pid, >> ) >> >> and then you'd use it like this to fork off a child process: >> >> int spawn_shell_subprocess_(void *arg) { >> char *cmdline = arg; >> execl("/bin/sh", "sh", "-c", cmdline); >> return -1; >> } >> pid_t spawn_shell_subprocess(char *cmdline) { >> pid_t child_pid; >> int res = clone_temporary(spawn_shell_subprocess_, cmdline, >> &child_pid, [...]); >> if (res == 0) return child_pid; >> return res; >> } >> >> clone_temporary() could be implemented roughly as follows by the libc >> (or other userspace code): >> >> sigset_t sigset, sigset_old; >> sigfillset(&sigset); >> sigprocmask(SIG_SETMASK, &sigset, &sigset_old); >> int child_pid; >> int result = 0; >> /* starting here, use inline assembly to ensure that no stack >> allocations occur */ >> long child = syscall(__NR_clone, >> CLONE_VM|CLONE_CHILD_SETTID|CLONE_CHILD_CLEARTID|SIGCHLD, $RSP - >> ABI_STACK_REDZONE_SIZE, NULL, &child_pid, 0); >> if (child == -1) { result = -1; goto reset_sigmask; } >> if (child == 0) { >> result = fn(arg); >> syscall(__NR_exit, 0); >> } >> futex(&child_pid, FUTEX_WAIT, child, NULL); >> /* end of no-stack-allocations zone */ >> reset_sigmask: >> sigprocmask(SIG_SETMASK, &sigset_old, NULL); >> return result; > > ... I guess that already has a name, and it's called vfork(). (Well, > except that the Linux vfork() isn't a real vfork().) > > So I guess my question is: Why not vfork()? Mainly because some users want access to the clone flags, and that's not possible with the current userspace wrappers. The stack setup for the undocumented clone wrapper is also cumbersome, and the ia64 pecularity annoying. For the stack sharing, the callback-based interface looks like the absolutely right thing to do to me. It enforces the notion that you can safely return on the child path from a function calling vfork. > And if vfork() alone isn't flexible enough, alternatively: How about > an API that forks a new child in the same address space, and then > allows the parent to invoke arbitrary syscalls in the context of the > child? As long it's not an eBPF script … > You could also build that in userspace if you wanted, I think - just > let the child run an assembly loop that reads registers from a unix > seqpacket socket, invokes the syscall instruction, and writes the > value of the result register back into the seqpacket socket. As long > as you use CLONE_VM, you don't have to worry about moving the pointer > targets of syscalls. The user-visible API could look like this: People already use a variant of this, execve'ing twice. See jspawnhelper. Thanks, Florian