Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp46584rdg; Wed, 11 Oct 2023 20:31:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGnnRAhaXWnlF11i69HI6wQOyQ8Zp+a1zexvX9exnmNybDSMfL8JU7WFM37S6oe5qM43Jtu X-Received: by 2002:a17:90b:1d88:b0:269:a96:981a with SMTP id pf8-20020a17090b1d8800b002690a96981amr30808439pjb.5.1697081480478; Wed, 11 Oct 2023 20:31:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697081480; cv=none; d=google.com; s=arc-20160816; b=ZpDVQy96WbaSMRO4TAVwdrhtb0DqmcKeqtvbpHRMlusOOv1Fyc+IYz7LLbbcKtnQDb 8yE/L89Ipu/JU/Bqdi8OdCULXU34aSGioGYvJqT6nsv9XEAbwO7yRiGhXVr6Z8CwUkFo rkc5Q2zfvRojQozbYadN5RrmWVXjx9nv9Uo5aiOqWWq1aGe1OpVhUB//aXdswEF1Eq7A bIvAmLkIAfjBSwiXrfVH83BA3euqwj/pqLg2tMnGCOgW+eOQdIDy8Va6HonvPvwf4AGt odZDqgSlOGCz327DeM+p840LNQhuYFhOCKi3VjYyC8OHLhLLecsgRera9cvM3Ya+7GTg 5bDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:subject:mime-version:user-agent:message-id :in-reply-to:date:references:cc:to:from; bh=Ts1tK5jD1bOFcfZIHgfF7LWRMELSR9FeCJ0UYJgseXk=; fh=ZcbikI8vpViOYs8dteqxMZPVAR2Fb55vrj6kIV48cwo=; b=BM9UsyB3W5uSlvF57IqfN4dl5htKB21pvcWrzbuOHOJPX5fo2lbwUeXLzBz8C4dRPA HkVDPH+OtLiWecBaVV9rCMHz1Z5428JIbAKwScz+/B1XYX6rmVvtRqINCrwAmYvdV/QZ A2FcWerbcY5o6ifQj6XIsFqFXtzboOVylSU/JLGyx59NcvGSNyUEOfXPIy0gWguMvZfQ qgisGyO+/+Z/1pTk2Rlm1mFAz2CYjEcznvdrxrNJ4xV+NtY2smOvyqym5gQvOjEBLJJD snjB4MzzDC3CT2azQtux3yKy2n1UPGW2fc58uAcCogcyyzo4xGThBwLUJCXpIPKd3f3i 4TaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Return-Path: Received: from fry.vger.email (fry.vger.email. [2620:137:e000::3:8]) by mx.google.com with ESMTPS id rj1-20020a17090b3e8100b0027b0a89978csi1369579pjb.166.2023.10.11.20.31.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 20:31:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) client-ip=2620:137:e000::3:8; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:8 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=xmission.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 712AA809FA4D; Wed, 11 Oct 2023 20:31:17 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376666AbjJLDa4 (ORCPT + 99 others); Wed, 11 Oct 2023 23:30:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234050AbjJLDaz (ORCPT ); Wed, 11 Oct 2023 23:30:55 -0400 Received: from out03.mta.xmission.com (out03.mta.xmission.com [166.70.13.233]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFDADB6 for ; Wed, 11 Oct 2023 20:30:53 -0700 (PDT) Received: from in01.mta.xmission.com ([166.70.13.51]:51994) by out03.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1qqmPR-002NxY-OX; Wed, 11 Oct 2023 21:30:49 -0600 Received: from ip68-227-168-167.om.om.cox.net ([68.227.168.167]:35210 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1qqmPQ-00E9TF-Gp; Wed, 11 Oct 2023 21:30:49 -0600 From: "Eric W. Biederman" To: Yunhui Cui Cc: akpm@linux-foundation.org, keescook@chromium.org, brauner@kernel.org, jeffxu@google.com, frederic@kernel.org, mcgrof@kernel.org, cyphar@cyphar.com, rongtao@cestc.cn, linux-kernel@vger.kernel.org, Linux Containers References: <20231011065446.53034-1-cuiyunhui@bytedance.com> Date: Wed, 11 Oct 2023 22:30:24 -0500 In-Reply-To: <20231011065446.53034-1-cuiyunhui@bytedance.com> (Yunhui Cui's message of "Wed, 11 Oct 2023 14:54:46 +0800") Message-ID: <87sf6gcyb3.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1qqmPQ-00E9TF-Gp;;;mid=<87sf6gcyb3.fsf@email.froward.int.ebiederm.org>;;;hst=in01.mta.xmission.com;;;ip=68.227.168.167;;;frm=ebiederm@xmission.com;;;spf=pass X-XM-AID: U2FsdGVkX18jR1acemHAkEyGJW1gUCzoj5kRrtYsd2Y= X-SA-Exim-Connect-IP: 68.227.168.167 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email X-Spam-Level: X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-DCC: XMission; sa06 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Yunhui Cui X-Spam-Relay-Country: X-Spam-Timing: total 530 ms - load_scoreonly_sql: 0.04 (0.0%), signal_user_changed: 11 (2.1%), b_tie_ro: 10 (1.9%), parse: 0.94 (0.2%), extract_message_metadata: 13 (2.4%), get_uri_detail_list: 2.2 (0.4%), tests_pri_-2000: 15 (2.8%), tests_pri_-1000: 2.4 (0.5%), tests_pri_-950: 1.28 (0.2%), tests_pri_-900: 1.08 (0.2%), tests_pri_-200: 0.86 (0.2%), tests_pri_-100: 7 (1.3%), tests_pri_-90: 57 (10.8%), check_bayes: 56 (10.5%), b_tokenize: 9 (1.7%), b_tok_get_all: 10 (1.8%), b_comp_prob: 2.6 (0.5%), b_tok_touch_all: 31 (5.9%), b_finish: 0.71 (0.1%), tests_pri_0: 364 (68.7%), check_dkim_signature: 0.57 (0.1%), check_dkim_adsp: 2.7 (0.5%), poll_dns_idle: 33 (6.3%), tests_pri_10: 3.6 (0.7%), tests_pri_500: 49 (9.3%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH] pid_ns: support pidns switching between sibling X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Wed, 11 Oct 2023 20:31:17 -0700 (PDT) Yunhui Cui writes: > In the scenario of container acceleration, What is container acceleration? Are you perhaps performing what is essentially checkpoint/restart from one set of processes to a new set of processes so you can get a container starting faster? > when a target pstree is cloned from a temp pstree, we hope that the > cloned process is inherently in the target's pid namespace. I am having a hard time figuring out what you are saying here. > Examples of what we expected: > > /* switch to target ns first. */ > setns(target_ns, CLONE_NEWPID); ^-------- Is this the line that fails for you? > if(!fork()) { > /* Child */ > ... > } > /* switch back */ > setns(temp_ns, CLONE_NEWPID); Assuming that the "switch back" means returning to your task_active_pid_ns that should always work. If I had to guess I think what you are missing is that entire pid namespaces can be inside other pid namespaces. So there is no reason to believe that any random pid namespace that happens to pass the CAP_SYS_ADMIN permission check is also in your processes task_active_pid_ns. > However, it is limited by the existing implementation, CAP_SYS_ADMIN > has been checked in pidns_install(), so remove the limitation that only > by traversing parent can switch pidns. The check you are deleting is what verifies the pid namespaces you are attempting to change pid_ns_for_children to is a member of the tasks current pid namespace (aka task_active_pid_ns). There is a perfectly good comment describing why what you are attempting to do is unsupportable. /* * Only allow entering the current active pid namespace * or a child of the current active pid namespace. * * This is required for fork to return a usable pid value and * this maintains the property that processes and their * children can not escape their current pid namespace. */ If you pick a pid namespace that does not meet the restrictions you are removing the pid of the new child can not be mapped into the pid namespace of the parent that called setns. AKA the following code can not work. pid = fork(); if (!pid) { /* child */ do_something(); _exit(0); } waitpid(pid); So no. The suggested change to pidns_install makes no sense at all. The whole not being able to escape your current pid namespace is also an important invariant when reasoning about pid namespaces. It would have to be a strong well thought out case for me to agree it makes sense to abandon the invariant that a process can not escape it's pid namespace. > Signed-off-by: Yunhui Cui > --- > kernel/pid_namespace.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index 3028b2218aa4..774db1f268f1 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -389,7 +389,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > { > struct nsproxy *nsproxy = nsset->nsproxy; > struct pid_namespace *active = task_active_pid_ns(current); > - struct pid_namespace *ancestor, *new = to_pid_ns(ns); > + struct pid_namespace *new = to_pid_ns(ns); > > if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || > !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) > @@ -406,12 +406,6 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns) > if (new->level < active->level) > return -EINVAL; > > - ancestor = new; > - while (ancestor->level > active->level) > - ancestor = ancestor->parent; > - if (ancestor != active) > - return -EINVAL; > - > put_pid_ns(nsproxy->pid_ns_for_children); > nsproxy->pid_ns_for_children = get_pid_ns(new); > return 0; Eric