Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp482843ybt; Fri, 10 Jul 2020 05:05:01 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz3FG7ofTmttJ+Or8J5QnBUttK6MfAFfEoqr6br1/3mTRPoBsWdN2TWSlceToRvhAZimHy5 X-Received: by 2002:aa7:cd52:: with SMTP id v18mr73564638edw.196.1594382701448; Fri, 10 Jul 2020 05:05:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594382701; cv=none; d=google.com; s=arc-20160816; b=oY1M51cXzGCpf0bDLo2dvZvFaDGEPU7JGzW4mRDj/ryL5Ye2wGQwYUtrMi58BItriF IK2i1Mp5M5rdc9sx8cdYEd2ZykS7kruQF9hSEQQagnnfU01gJIs9oNnfQexYGpUgBr3Q tVuuCHynE4AqsNjiFPn51iUkrnuE/k/WHtvdDC4BpdBhRv/Ox4tLwJHNaMzk0WiguWpg DnNqFUU5h8N6dBk8vBlYLCZ79WVI0kM8uF8f7/YRn8mlGcEX1kjKBFRsEyB6UjIQcAky QzwFvBtqQE2C01AG9cwwFBs9c8ai2L/bhwTEtEFZoNE3vFjL0xvsogGugPa3MJPsAL50 2a6w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=xs6zJmeRsqnjL/YjjUWiuqvp8apN2JkCAJWbX8FpzD4=; b=diqYHMHRHFNAcSaEISqbXepUuxbMiHxpldLbQCVeWsZS2Mb5ztfypgvnZtCpeTdHiz yHpWWVBt7P7ZKbh4FBCxGVuGGiO8zx87gOrYSwUReNk/Y+Vx91hn8BzT4OBbO82/zkJ2 X5fW/LiUFRl5DXxEpi5kZ/ptkm3Zd0o6ypKQcFb2qf1aAzOtvkrMVsH/7BgxutnlXdiC gi6exjvs0mMgLW7kX/ESEChoomBQ/pW3b4WocUd5gs6tiQp51JvLFp97zhlkcoREp7/K 382o8/xCuv7MuXwgu/MaSzlapDYZR4B1GDDfjpm1uaa5jkDPzArzrdioKvhp0/nVlH7V o03A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i12si3916512edq.340.2020.07.10.05.04.36; Fri, 10 Jul 2020 05:05:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727019AbgGJMBy (ORCPT + 99 others); Fri, 10 Jul 2020 08:01:54 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:45479 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726872AbgGJMBy (ORCPT ); Fri, 10 Jul 2020 08:01:54 -0400 Received: from ip5f5af08c.dynamic.kabel-deutschland.de ([95.90.240.140] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jtriy-0000Fg-3q; Fri, 10 Jul 2020 12:01:52 +0000 Date: Fri, 10 Jul 2020 14:01:51 +0200 From: Christian Brauner To: Qian Cai Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Wolfgang Bumiller , Serge Hallyn , Michael Kerrisk , Alexander Viro Subject: Re: [PATCH] nsfs: add NS_GET_INIT_PID ioctl Message-ID: <20200710120151.yeoonuttjc4wivaf@wittgenstein> References: <20200618084543.326605-1-christian.brauner@ubuntu.com> <20200710115836.GA1027@lca.pw> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20200710115836.GA1027@lca.pw> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 10, 2020 at 07:58:36AM -0400, Qian Cai wrote: > On Thu, Jun 18, 2020 at 10:45:43AM +0200, Christian Brauner wrote: > > Add an ioctl() to return the PID of the init process/child reaper of a pid > > namespace as seen in the caller's pid namespace. > > > > LXCFS is a tiny fuse filesystem used to virtualize various aspects of > > procfs. It is used actively by a large number of users including ChromeOS > > and cloud providers. LXCFS is run on the host. The files and directories it > > creates can be bind-mounted by e.g. a container at startup and mounted over > > the various procfs files the container wishes to have virtualized. When > > e.g. a read request for uptime is received, LXCFS will receive the pid of > > the reader. In order to virtualize the corresponding read, LXCFS needs to > > know the pid of the init process of the reader's pid namespace. In order to > > do this, LXCFS first needs to fork() two helper processes. The first helper > > process setns() to the readers pid namespace. The second helper process is > > needed to create a process that is a proper member of the pid namespace. > > The second helper process then creates a ucred message with ucred.pid set > > to 1 and sends it back to LXCFS. The kernel will translate the ucred.pid > > field to the corresponding pid number in LXCFS's pid namespace. This way > > LXCFS can learn the init pid number of the reader's pid namespace and can > > go on to virtualize. Since these two forks() are costly LXCFS maintains an > > init pid cache that caches a given pid for a fixed amount of time. The > > cache is pruned during new read requests. However, even with the cache the > > hit of the two forks() is singificant when a very large number of > > containers are running. With this simple patch we add an ns ioctl that > > let's a caller retrieve the init pid nr of a pid namespace through its > > pid namespace fd. This _significantly_ improves our performance with a very > > simple change. A caller should do something like: > > - pid_t init_pid = ioctl(pid_ns_fd, NS_GET_INIT_PID); > > - verify init_pid is still valid (not necessarily both but recommended): > > - opening a pidfd to get a stable reference > > - opening /proc//ns/pid and verifying that > > and the pid namespace fd of refer to the same pid namespace > > > > Note, it is possible for the init process of the pid namespace (identified > > via the child_reaper member in the relevant pid namespace) to die and get > > reaped right after the ioctl returned. If that happens there are two cases > > to consider: > > - if the init process was single threaded, all other processes in the pid > > namespace will be zapped and any new process creation in there will fail; > > A caller can detect this case since either the init pid is still around > > but it is a zombie, or it already has exited and not been recycled, or it > > has exited, been reaped, and also been recycled. The last case is the > > most interesting one but a caller would then be able to detect that the > > recycled process lives in a different pid namespace. > > - if the init process was multi-threaded, then the kernel will try to make > > one of the threads in the same thread-group - if any are still alive - > > the new child_reaper. In this case the caller can detect that the thread > > which exited and used to be the child_reaper is no longer alive. If it's > > tid has been recycled in the same pid namespace a caller can detect this > > by parsing through /proc//stat, looking at the Nspid: field and if > > there's a entry with pid nr 1 in the respective pid namespace it can be > > sure that it hasn't been recycled. > > Both options can be combined with pidfd_open() to make sure that a stable > > reference is maintained. > > > > Cc: Wolfgang Bumiller > > Cc: Serge Hallyn > > Cc: Michael Kerrisk > > Cc: Alexander Viro > > Cc: linux-fsdevel@vger.kernel.org > > Signed-off-by: Christian Brauner > > fs/nsfs.c: In function ‘ns_ioctl’: > fs/nsfs.c:195:14: warning: unused variable ‘pid_struct’ [-Wunused-variable] > struct pid *pid_struct; > ^~~~~~~~~~ > fs/nsfs.c:194:22: warning: unused variable ‘child_reaper’ [-Wunused-variable] > struct task_struct *child_reaper; > ^~~~~~~~~~~~ Thanks. Fyi, this has been reported by Stephen already. Christian