Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5165489yba; Tue, 30 Apr 2019 10:09:14 -0700 (PDT) X-Google-Smtp-Source: APXvYqzX8GvSFAyK5NRtUioQAYaSOS0GcFqHf9ebx0l0kVAFfgsFNJ5Tg8sruFWuSY0E5llb3C1+ X-Received: by 2002:a62:ac0c:: with SMTP id v12mr8368936pfe.59.1556644154848; Tue, 30 Apr 2019 10:09:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556644154; cv=none; d=google.com; s=arc-20160816; b=GS4YBBzbPwb2WeSUwaC1+C3xj8QNOKfSm+q5U8A676CWzRtqeA2Wj5GK3Qmil7zgtn zCZRGd7VXOzBVIF/KxPRbtpGKOBJqv7ZXHIS7vxZeQy/xCglilZRQJexrlrdr7QKAMnS TQTvQFbKi0rGNqcDXyPCPREZQP5wGAWDjudH2zrjlwb/GPp8iEzZwRfq2uOEM9CVHJ3m 5B7fdxnXBWVxLMA8aSUWLzVuUrKzxKruWi0IiY+E2wV/9ukpoC3Fn1Zwwuex+3qqbGum Shbpu41CWe6D3uE1mNOTAnKJUsU2hER5t0ga4Jk8+8MGZnfRm3svH8KY06pPsLVs2Y3o rB0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:message-id:in-reply-to:date:references:subject:cc:to :from; bh=o8J3m3olWr3g5LpDgo7cMAr8zWRPxFaNDWFtxcWjwOE=; b=e3JmtuYhEhh6wIaW+KjrWEBdxrJZ8YdD5KCKDj9C1K4bMHxzVjL7/zZEdMm+LScpsk 1eUPrDY0N12CyjG/8Uc97YQafbVniVedb+GUwkuOW+iEVJocBKeTn8fevrMKw4Dy9t0U G218ieoEa2HqWtTo5BurO+wgc/MEr0u8vJ+qyRfDyerTbkkEo3QeYuufllcZtUSus2oQ CA/7pgqUoHnOMlShbgVT8wWpH2VwYWCglqNZmggqslpcajrCbKOFl9kNiH6qWPjLHvQF FcjITHl4wvl3r8CUD9Fh+Y2DVS1UU6GhiNknSGx/wlvLpcR/DLkvn7Rx6dF5f7atjaRv SuEQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l17si26564398pgh.381.2019.04.30.10.08.58; Tue, 30 Apr 2019 10:09:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726557AbfD3RHm convert rfc822-to-8bit (ORCPT + 99 others); Tue, 30 Apr 2019 13:07:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40334 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726050AbfD3RHl (ORCPT ); Tue, 30 Apr 2019 13:07:41 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 95B4EC05CDFC; Tue, 30 Apr 2019 17:07:31 +0000 (UTC) Received: from oldenburg2.str.redhat.com (ovpn-116-90.ams2.redhat.com [10.36.116.90]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AACAD171DF; Tue, 30 Apr 2019 17:07:21 +0000 (UTC) From: Florian Weimer To: Linus Torvalds Cc: Jann Horn , Kevin Easton , Andy Lutomirski , Christian Brauner , Aleksa Sarai , "Enrico Weigelt\, metux IT consult" , Al Viro , David Howells , Linux API , LKML , "Serge E. Hallyn" , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , Michael Kerrisk , Andrew Morton , Oleg Nesterov , Joel Fernandes , Daniel Colascione Subject: Re: RFC: on adding new CLONE_* flags [WAS Re: [PATCH 0/4] clone: add CLONE_PIDFD] References: <20190414201436.19502-1-christian@brauner.io> <20190415195911.z7b7miwsj67ha54y@yavin> <20190420071406.GA22257@ip-172-31-15-78> <87r29jaoov.fsf@oldenburg2.str.redhat.com> Date: Tue, 30 Apr 2019 19:07:19 +0200 In-Reply-To: (Linus Torvalds's message of "Tue, 30 Apr 2019 09:26:57 -0700") Message-ID: <871s1j777c.fsf@oldenburg2.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Tue, 30 Apr 2019 17:07:41 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds: > On Tue, Apr 30, 2019 at 9:19 AM Linus Torvalds > wrote: >> >> Of course, if you *don't* need the exact vfork() semantics, clone >> itself actually very much supports a callback model with s separate >> stack. You can basically do this: >> >> - allocate new stack for the child >> - in trivial asm wrapper, do: >> - push the callback address on the child stack >> - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL) >> - "ret" >> - free new stack >> >> where the "ret" in the child will just go to the callback, while the >> parent (eventually) just returns from the trivial wrapper and frees >> the new stack (which by definition is no longer used, since the child >> has exited or execve'd. > > In fact, Florian, maybe this is the solution to your "I want to use > vfork for posix_spawn(), but I don't know if I can trust it" problem. > > Just use clone() directly. On WSL it will presumably just fail, and > you can then fall back on doing the slow stupid > fork+pipes-to-communicate. We already use clone. I don't know why. We should add a comment that provides the reason. > On valgrind, I don't know what will happen. Maybe it will just do an > unchecked posix_spawn() because valgrind doesn't catch it? I think what happens with these emulators that they use fork (no shared address space) but suspend the parent thread. clone with CLONE_VFORK definitely does not fail. That mostly works, except that you do not get back the error code from the execve. Instead, the process is considered launched, and the caller collects the exit status from the _exit after the failed execve. > Of course, if you *don't* need the exact vfork() semantics, clone > itself actually very much supports a callback model with s separate > stack. You can basically do this: > > - allocate new stack for the child > - in trivial asm wrapper, do: > - push the callback address on the child stack > - clone(CLONE_VFORK|CLONE_VM|CLONE_SIGCHLD, chld_stack, NULL, NULL,NULL) > - "ret" > - free new stack > > where the "ret" in the child will just go to the callback, while the > parent (eventually) just returns from the trivial wrapper and frees > the new stack (which by definition is no longer used, since the child > has exited or execve'd. > > So you can most definitely create a "vfork_with_child_callback()" with > clone, and it would arguably be a much superior interface to vfork() > anyway (maybe you'd like to pass in some arguments to the callback too > - add more stack setup for the child as needed), but it wouldn't be > the right solution for programs that just want to use the standard BSD > vfork() model. As far as we understand the situation, we believe that we absolutely must block all signals for both the parent thread and the new subprocess. Signals can be unblocked in the subprocess, but only after setting their handlers to SIG_DFL or SIG_IGN. (Parent signal handlers cannot run in the subprocess because application-supplied signal handlers generally do not expect to run with a corrupt TCB—or a different PID.) At that point, I wonder if we can just skip the stack setup for the CLONE_VFORK case and reuse the existing stack. Thanks, Florian