Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1043063pxb; Wed, 6 Apr 2022 07:22:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwN5WQwxgMOiP0m08ny5rtvGkyx+YJWekSoBsubBinP02e+a+23O9WMWu5as+v/teMKS7fq X-Received: by 2002:a17:90a:8595:b0:1bb:fbfd:bfbf with SMTP id m21-20020a17090a859500b001bbfbfdbfbfmr10116931pjn.125.1649254968678; Wed, 06 Apr 2022 07:22:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649254968; cv=none; d=google.com; s=arc-20160816; b=neQB052MMqt4HkH+nmqnbIEQxQtEsfx0rxtJDHeRhpIx1/4rd8QH2EoCckkDnlvAWr Gja+DQEL+3ehkDTaseLSKxHtOgAQ51PzzwNqo1vtVEdTXR+JqdwjF1lTJN79kbJdmGiH PD2agLgB27+h/u2oa1p/U/cbJUKeNiBIEj1uPRhk080rTPPytFxDbzOpaWAHufAsDGBb X8UUC+ezIcf1HsaIcQ/Es6DJQEUqlJsQH2O5QQZKTmcq0TbyagMNTcAzlXQ4goxfZEh5 l1Egum0n82wmljMaqbfbjgwZQwg/S/zUsmgLxVgxwysGfKBPiHT8xxDNWXPaggeyIpi0 PEqA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=QJTbl4l+vFlwQVWlUKMWz4lZScYteQRiOF5KrnH0KZE=; b=TFSwHuGOyksi/coKYV8Fya2W1JDrKTaLAvT2uCCBNlvlatURAw8lQ5Y7+UTR+HsLQV lyyk4/F9xVb0BXc+eRDpiaawvig6XI9i2WDjFiOu8ntuUbgVHbO/QM4ytWnQgwbE+5o3 lGhI6djVYA/pgQk81cMYXH+/4wT//EMk73Unn2EFuWVaH8loOjihv2lFe2+3M9wg6Zep 7KrHh/4OMp9VGc2XKfsjpyb2kHzT8wDMQQiPvOXj6KoNTNazcZtAqjrSkuRS98gFIt5A 3AV2QZtHVD0y2xq7vfXfj0Uw+rc+iWUz2+EhNFJZ5QR0CgqatpeJ2dipyQirArRejBpp f5kw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=EJr5w2cz; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id s5-20020a17090302c500b001568acea039si11549529plk.554.2022.04.06.07.22.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 07:22:48 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=EJr5w2cz; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 623BC4922CF; Wed, 6 Apr 2022 05:07:21 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230354AbiDFLtP (ORCPT + 99 others); Wed, 6 Apr 2022 07:49:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36138 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230344AbiDFLsu (ORCPT ); Wed, 6 Apr 2022 07:48:50 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD7C45A3614 for ; Wed, 6 Apr 2022 01:46:21 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 4CC5EB82045 for ; Wed, 6 Apr 2022 08:46:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A81E9C385A3; Wed, 6 Apr 2022 08:46:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1649234779; bh=IH8OYu4RWwPzUN5lmMvBD7N2P54Lx3me2JUlUkLn9LM=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=EJr5w2czrnDeR0F1W4T2vcj4wF6YR1cuSFucMpxCKLP4qMzhCzstMaiiriF6a6lFu 6ZFz4bI6sM8ZC7yl0eu6pgm0Uajg6SZqRiKFtGrcrMBXdLjIhj9e8TgsT0IcOP2AZa v4QVZXU3DUJ+QyNT9k85Qke7nErpfc7WIDb3TZUznQwAriOzdFZ9gS20M7fRge2yoX gl25KentWPBNtzGxsTdMpYRxTsy7g5l8Sxy7c6V8iR/x8/KYARtfEBlv/2wQh2sSDV LDgrxXN6UW2wriAi4gtqtZTfctWbD3q0kKOOtywS09rEAtt3ex5WJ4w4bpGh/Mnz3Y ktex3HW6qMS9w== Date: Wed, 6 Apr 2022 10:46:13 +0200 From: Christian Brauner To: Alejandro Colomar Cc: "linux-kernel@vger.kernel.org" , =?utf-8?B?0JrQvtGA0LXQvdCx0LXRgNCzINCc0LDRgNC6?= , Andrei Vagin , Dmitry Safonov , Thomas Gleixner , Arnd Bergmann , Serge Hallyn , bugzilla-daemon@kernel.org Subject: Re: vfork(2) behavior not consistent with fork(2) (was: vfork(2) fails after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not document corner case when PID == 1)) Message-ID: <20220406084613.3srklyt27qxcmrcx@wittgenstein> References: <4fb02f5f-60f9-42af-ddd5-fe5af877231f@gmail.com> <20220404080519.pi6izyuop3mmdg2g@wittgenstein> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote: > Hey, Christian! > > On 4/4/22 10:05, Christian Brauner wrote: > > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) wrote: > > > [Added some kernel CCs that may know what's going on] > [...] > > > Maybe someone in the kernel can send some patch for the clone(2) and/or > > > vfork(2) manual pages that explains the reason (if it's intended). > > > > Hey Alejandro, > > > > I won't be able to send a patch very soon but I can at least explain why > > you see EINVAL. :) > > Don't hurry, we're not planning to release any soon :) > > > > > This is intended. > > > > vfork() suspends the parent process and the child process will share the > > same vm as the parent process. If the child process is in a new time > > namespace different from its parent process it is not allowed to be in > > the same threadgroup or share virtual memory with the parent process. > > That's why you see EINVAL. > > That makes a lot of sense to me. > > > > > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling > > process to be moved into a different time namespace. Only the newly > > created child process will be after a subsequent > > fork()/vfork()/clone()/clone3()... > > > > The semantics are equivalent to that of CLONE_NEWPID in this regard. You > > can see this via /proc//ns/ where you see two entries for pid > > namespaces and also two entries for time namespaces: > > > > * CLONE_NEWTIME > > * /proc//ns/time // current time namespace > > * /proc//ns/time_for_children // time namespace for the new child process > > Also makes sense. Michael taught me that a few weeks ago :) > > This also triggers some doubt: will the same problem happen with > CLONE_NEWPID since it also moves the child into a new ns (in this case a PID > one)? See test program below. No, it won't. A pid namespace places no relevant constraints on vm usage whereas a time namespace does. If a task joins a new time namespace it'll clean the VVAR page tables and refault them with the new layout after the timens change. That affects all tasks which use the same task->mm. Since CLONE_THREAD implies CLONE_VM this would affect the whole thread-group behind their back. All threads would suddenly change timens. No such issues exist for pid namespaces; they don't need to alter task->mm. > > > > > If during fork: > > > > parent_process->time != parent_process->time_for_children > > > > and either CLONE_VM or CLONE_THREAD is set you see EINVAL. > > > > You can thus replicate the same error via: > > > > unshare(CLONE_NEWTIME) > > > > and a > > > > clone() or clone3() call with CLONE_VM or CLONE_THREAD. > > So, to test my doubts, I wrote this similar program (and also similar > programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME, > and one with CLONE_NEWNS)): > > $ cat vfork_newpid.c > #define _GNU_SOURCE > #include > #include > #include > #include > #include > #include > #include > #include > #include > > static char *const child_argv[] = { > "print_pid", > NULL > }; > > static char *const child_envp[] = { > NULL > }; > > int > main(void) > { > pid_t pid; > > printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid()); > > if (unshare(CLONE_NEWPID) == -1) > err(EXIT_FAILURE, "unshare(2)"); > if (signal(SIGCHLD, SIG_IGN) == SIG_ERR) > err(EXIT_FAILURE, "signal(2)"); > > pid = syscall(SYS_vfork); > //pid = vfork(); // This behaves differently. > switch (pid) { > case 0: > execve("/home/alx/tmp/print_pid", child_argv, child_envp); > err(EXIT_SUCCESS, "PID %jd exiting after execve(2)", > (long) getpid()); > case -1: > err(EXIT_FAILURE, "vfork(2)"); > default: > errx(EXIT_SUCCESS, "Parent exiting after vfork(2)."); > } > } > > $ cat print_pid.c > #include > #include > #include > > int > main(void) > { > errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid()); > } > > $ cc -Wall -Wextra -Werror -o print_pid print_pid.c > $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c > $ > $ > $ sudo ./vfork_newpid > vfork_newpid: PID: 8479 > vfork_newpid: PID 8479 exiting after execve(2): Success > print_pid: PID 1 exiting. > $ > $ > $ sudo ./vfork_newtime > vfork_newtime: PID: 8484 > vfork_newtime: vfork(2): Invalid argument > $ > $ > $ sudo ./vfork_newns > vfork_newns: PID: 8486 > vfork_newns: PID 8486 exiting after execve(2): Success > print_pid: PID 8487 exiting. > > > The first thing I noted is that usage of vfork(2) differs considerably from > fork(2), and that's something that's not clear by reading the manual page. > It sais that the parent process is suspended until the child calls > execve(2), but I expected it to mean that vfork(2) doesn't return to the > parent until that happened, but was otherwise transparent. I was wrong and > my tests showed me that. > > I was going to propose an example program for the manual page, when I > decided to try a slightly different thing: call vfork() instead of > syscall(SYS_vfork); that changed the behavior to the same one as with > fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the > child. > > Is that also intended? I couldn't find the glibc wrapper source code, so I > don't know what is glibc doing here, but I straced the processes, and > they're all calling vfork(), so the behavior should be consistent; it's > quite weird. I'm very confused at this point. glibc does vfork() via inline assembly massaging. There's probably atfork handlers and a bunch of other stuff involved so it's difficult to do a remote diagnosis. (And note that calling anything other than execve() or _exit() after vfork() is basically undefined behavior.) > > > I'm also wondering why it's okay to have processes in different PID ns share > the same vm, but I guess that's implementation details that I don't need to > care that much. See earlier in the thread.