Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3429621img; Mon, 25 Mar 2019 10:06:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqwmkHfcxWIvWxdjhkgI74YA2gitu35k3TcYpxyfuKFheUlTJBjWSsSPkYDrKjoXyKNVkZjl X-Received: by 2002:a63:c00c:: with SMTP id h12mr24556625pgg.423.1553533588034; Mon, 25 Mar 2019 10:06:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553533588; cv=none; d=google.com; s=arc-20160816; b=MjAT7qO6OgnMtbOEz3zEMT+kT1isarowaFQlcWB7RelGoOvM23S0S3rlclEHapWz1x /y7GwIAayrxn5kgzWr7HlDwChNkUR4earGgTlFiA2HRVyQUsi5sdsK/Jy3K/tm7tTrA9 nqXuztKmBGIjBsPWGSdAMwpQcN3F4uUiHljimK9mKRXBCQsH3D7lDtIcRk0tlQAXWNV1 Vz4EN1eoM4pSVGfHoW/y71093mS2F2tGNKPHRb0GXgCZaS1g1xaisDe18HauFKzDBNMM PlON6tdvedifGROg5lQIpzHLTCYyI5s5ueJZrr316rGgJnjSRbn+0jxLiPF86qP6gEo4 90ow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=z9ctgYcA/eZKhetKhIjEaWibY8L4Zk+HbgQAwjGnQcI=; b=EessFfOBnbSkf+iPaepKTx3FNhgB1hB/mu2X6VkrPcR7JBb2hYTl0M+NoGiCkxx0mi EAOI6/bImehBQf65m6aZdTFQsOt+ZTZpgggdPLLGnqOIlzeWWiLr37KfcFATSTeVcw7r Xx+M+FiB9yG3eZtjBqXdOBdhhkrhCEw+7y46n5RE/7cs+dEyuCX24xVtMkVYnMbVpk10 IGJkIgOmh5qYEz//rO+Zt6i5Hlrd4ClUf4wSf9dsCo/a2Tp5+N2KNK9Z1DFhF/33EfRM DbC3a9YJDsQ7VX4RiNQSG/yGX83/D+FyUbHIF4mDdBcOak6JlGJGYIsZETYmA5X3HwXK h1bg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=vHVyTH2p; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b9si13322056pfd.228.2019.03.25.10.06.12; Mon, 25 Mar 2019 10:06:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@yandex-team.ru header.s=default header.b=vHVyTH2p; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=yandex-team.ru Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729680AbfCYRFb (ORCPT + 99 others); Mon, 25 Mar 2019 13:05:31 -0400 Received: from forwardcorp1j.cmail.yandex.net ([5.255.227.105]:50790 "EHLO forwardcorp1j.cmail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbfCYRFb (ORCPT ); Mon, 25 Mar 2019 13:05:31 -0400 Received: from mxbackcorp2j.mail.yandex.net (mxbackcorp2j.mail.yandex.net [IPv6:2a02:6b8:0:1619::119]) by forwardcorp1j.cmail.yandex.net (Yandex) with ESMTP id D97E220E6A; Mon, 25 Mar 2019 20:05:26 +0300 (MSK) Received: from smtpcorp1o.mail.yandex.net (smtpcorp1o.mail.yandex.net [2a02:6b8:0:1a2d::30]) by mxbackcorp2j.mail.yandex.net (nwsmtp/Yandex) with ESMTP id v5FSEMVxnl-5OB4eE5r; Mon, 25 Mar 2019 20:05:26 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.ru; s=default; t=1553533526; bh=z9ctgYcA/eZKhetKhIjEaWibY8L4Zk+HbgQAwjGnQcI=; h=In-Reply-To:Message-ID:From:Date:References:To:Subject:Cc; b=vHVyTH2pMK9fFySa72ld8g/Df8GG5XUW8CIArHCcgdg50QMA83746HcIHHzdrOzfQ bLLwVFK+1jGfVP9yTel6rL2PFHqnfWKfkJ3COJPxq/YMBGaLjNTofFqGE6Tz1Yi1Oq V9v1grEMCl0k67CQ0nAk9vSyN80tY1c/g0H3X8Js= Authentication-Results: mxbackcorp2j.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Received: from dynamic-vpn.dhcp.yndx.net (dynamic-vpn.dhcp.yndx.net [2a02:6b8:0:1496::1:63]) by smtpcorp1o.mail.yandex.net (nwsmtp/Yandex) with ESMTPSA id 485xXsuUV1-5OLaxXMo; Mon, 25 Mar 2019 20:05:24 +0300 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client certificate not present) Subject: Re: [PATCH 0/4] pid: add pidctl() To: Daniel Colascione , Christian Brauner Cc: Jann Horn , Andy Lutomirski , David Howells , "Serge E. Hallyn" , "Eric W. Biederman" , Linux API , linux-kernel , Arnd Bergmann , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , bl0pbl33p@gmail.com, "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , nagarathnam.muthusamy@oracle.com, Aleksa Sarai , Al Viro , Joel Fernandes References: <20190325162052.28987-1-christian@brauner.io> From: Konstantin Khlebnikov Message-ID: <8075dfac-94d2-b8c5-e37a-afe9b88bb48e@yandex-team.ru> Date: Mon, 25 Mar 2019 20:05:23 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-CA Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25.03.2019 19:48, Daniel Colascione wrote: > On Mon, Mar 25, 2019 at 9:21 AM Christian Brauner wrote: >> The pidctl() syscalls builds on, extends, and improves translate_pid() [4]. >> I quote Konstantins original patchset first that has already been acked and >> picked up by Eric before and whose functionality is preserved in this >> syscall. Multiple people have asked when this patchset will be sent in >> for merging (cf. [1], [2]). It has recently been revived by Nagarathnam >> Muthusamy from Oracle [3]. >> >> The intention of the original translate_pid() syscall was twofold: >> 1. Provide translation of pids between pid namespaces >> 2. Provide implicit pid namespace introspection >> >> Both functionalities are preserved. The latter task has been improved >> upon though. In the original version of the pachset passing pid as 1 >> would allow to deterimine the relationship between the pid namespaces. >> This is inherhently racy. If pid 1 inside a pid namespace has died it >> would report false negatives. For example, if pid 1 inside of the target >> pid namespace already died, it would report that the target pid >> namespace cannot be reached from the source pid namespace because it >> couldn't find the pid inside of the target pid namespace and thus >> falsely report to the user that the two pid namespaces are not related. >> This problem is simple to avoid. In the new version we simply walk the >> list of ancestors and check whether the namespace are related to each >> other. By doing it this way we can reliably report what the relationship >> between two pid namespace file descriptors looks like. >> >> Additionally, this syscall has been extended to allow the retrieval of >> pidfds independent of procfs. These pidfds can e.g. be used with the new >> pidfd_send_signal() syscall we recently merged. The ability to retrieve >> pidfds independent of procfs had already been requested in the >> pidfd_send_signal patchset by e.g. Andrew [4] and later again by Alexey >> [5]. A use-case where a kernel is compiled without procfs but where >> pidfds are still useful has been outlined by Andy in [6]. Regular >> anon-inode based file descriptors are used that stash a reference to >> struct pid in file->private_data and drop that reference on close. >> >> With this translate_pid() has three closely related but still distinct >> functionalities. To clarify the semantics and to make it easier for >> userspace to use the syscall it has: >> - gained a command argument and three commands clearly reflecting the >> distinct functionalities (PIDCMD_QUERY_PID, PIDCMD_QUERY_PIDNS, >> PIDCMD_GET_PIDFD). >> - been renamed to pidctl() > > Having made these changes, you've built a general-purpose command > command multiplexer, not one operation that happens to be flexible. > The general-purpose command multiplexer is a common antipattern: > multiplexers make it hard to talk about different kernel-provided > operations using the common vocabulary we use to distinguish > kernel-related operations, the system call number. socketcall, for > example, turned out to be cumbersome for users like SELinux policy > writers. People had to do work work later to split socketcall into > fine-grained system calls. Please split the pidctl system call so that > the design is clean from the start and we avoid work later. System > calls are cheap. > > Also, I'm still confused about how metadata access is supposed to work > for these procfs-less pidfs. If I use PIDCMD_GET_PIDFD on a process, > You snipped out a portion of a previous email in which I asked about > your thoughts on this question. With the PIDCMD_GET_PIDFD command in > place, we have two different kinds of file descriptors for processes, > one derived from procfs and one that's independent. The former works > with openat(2). The latter does not. To be very specific; if I'm > writing a function that accepts a pidfd and I get a pidfd that comes > from PIDCMD_GET_PIDFD, how am I supposed to get the equivalent of > smaps or oom_score_adj or statm for the named process in a race-free > manner? > Task metadata could be exposed via "pages" identified by offset: struct pidfd_stats stats; pread(pidfd, &stats, sizeof(stats), PIDFD_STATS_OFFSET); I'm not sure that we need yet another binary procfs. But it will be faster than current text-based for sure.