Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp613246yba; Sun, 31 Mar 2019 08:06:54 -0700 (PDT) X-Google-Smtp-Source: APXvYqwefkR1q1WkkQZzixgsEbMpH5M7PTegA9Fekj/HeCM+0sDUUARI3v/XwWK+zpaTwGtgWdwO X-Received: by 2002:a62:d2ca:: with SMTP id c193mr57975320pfg.247.1554044814553; Sun, 31 Mar 2019 08:06:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554044814; cv=none; d=google.com; s=arc-20160816; b=c8vGg5ivwnclDwwTRGVi60+fkzqX4K/ZpQJXLcqsSFsEC/vYWmHdyWCFOZ4+i07urI XdyWZiMZG5GbcyPjJ+J/mWVSJ1UZ3Pom6Me/p9v6OO2f4YfyTR9wb6UNgsjRf+7fCegh 8n55bejgnDXI3qNXao7h8TtMlXXGYXR3npMg+NOMiM9zRId/X09qNjAOIR4pwYh+ofkp SaqHgC6jgMd+xJFJ0SGk5E9SFYnZOlaDlouXRqEQJdblkHi8RxuOmEroHCswRgMlCKho CTT1jAVrJepT7Tzkl1VrYGPKxlyaXDydVigfKtSh9T0O8SDmIkVBCe1mN0QPZgsBlnBs KpFA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Cx/zd+U/8YS2qou0yunnQnhx/UfblkUNvc4Sl4wgWN8=; b=ZaknxXt2mYdOTS81e6DIahD0ls8CFCHjZGEckvWSIOLTTlVtGuc4/d7C+KYR8hPhGL 3WSrj9Dn1AXi2ZCdrM4XbbDJXOYyKxCAj176maJ1lWMtGjvlYZY+OWw1YLd5HnQ7gGil qX+LJo2a3HIIgYBouMkEE5hAaRVz6BsofG5AgocFC2D9SaNKK7rzhUKD9UdFiQuvclbf rCRhFw0P2PAKBWrxENCdqCBGSa3pzva7auA4I/fPlMPOE5VACcqXIIkk8tdMPYj2qLyw MsKnyUyVh183thuVwPFdY+dkgp8TC2uN+bwq8kY+s2+CwPzu+1gyDQEqWnZNIkZS5pIo G4pQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=O1zkMvcr; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d62si6985322pfc.104.2019.03.31.08.06.28; Sun, 31 Mar 2019 08:06:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@brauner.io header.s=google header.b=O1zkMvcr; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731272AbfCaPFN (ORCPT + 99 others); Sun, 31 Mar 2019 11:05:13 -0400 Received: from mail-ed1-f65.google.com ([209.85.208.65]:35165 "EHLO mail-ed1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731243AbfCaPFN (ORCPT ); Sun, 31 Mar 2019 11:05:13 -0400 Received: by mail-ed1-f65.google.com with SMTP id s39so5960893edb.2 for ; Sun, 31 Mar 2019 08:05:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=brauner.io; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Cx/zd+U/8YS2qou0yunnQnhx/UfblkUNvc4Sl4wgWN8=; b=O1zkMvcrI7fTOdPOTtQHaCUXerCTga3w4DS60QWjQkC84PlygawGAcdDL3Uj9UWEwt UocA8Q9+C2MPyYvewlOZnz84H9sltqZehk8lFXi0SZcQBfnWxReusZO1YeH+/Jwu5I1y G2e1kXAzeoMaaOFUJBYS4nRI/mZIf8ovGlyhGCrwpbNVF7nwqcmW9zVHFyWNbl9XP9xt Emkzmz73fUZo0r7wUGck5Xv+kK4u7HVbgl2atf6iIrqm/bb71Q8nTSAnUVWNeQAJmfhQ tVIOgHrW5mXAjE80KcA5mYSeVQJzs1w6QBk0+RCWOxow2PPFPvap7v7MeIAxbeVavWg2 8tmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Cx/zd+U/8YS2qou0yunnQnhx/UfblkUNvc4Sl4wgWN8=; b=sLtW8fIT9wq9Bq8IDITnG0Lhc3fhqwpmaHAd6129EMufvSclj3xPA5ZDqMp6jytxs2 B4eteMLFyVTzjf+k/9LBkt0b8J0WIuDO/e5B+CSx5Mk/xm81RH1phOi2FworqLhE1R7i 7JvTguE0AnfmLRe5tOkjn3EnEksXWBkGxHw8dxJK+kEMd3aAMGlyCB8SUPxNLir/6z0k 0pY+OP/XQHOoFh869rREEegb/tMRXG5wNAo08N7Rfy6l31iMwui0reiorPEzWXSTeI1W GOjprAGnhuwTGhzw6fvxsqcvypkwGH+6Y/e70GwbJe1wgbHW1BMN0qNgt+MvcGhk4SHv v/xQ== X-Gm-Message-State: APjAAAVwPZH4bkmL8UKcBwHQBdlmOkqxjbU7iMq7AwIk14ARudX3Z9Nv x/MTzuCkMvTdOC9fuKBVx8Mbkg== X-Received: by 2002:a50:aa0f:: with SMTP id o15mr39115103edc.129.1554044710805; Sun, 31 Mar 2019 08:05:10 -0700 (PDT) Received: from brauner.io ([2a02:8109:b6bf:d24a:b136:35b0:7c8c:280a]) by smtp.gmail.com with ESMTPSA id v35sm2375522edm.14.2019.03.31.08.05.09 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Sun, 31 Mar 2019 08:05:10 -0700 (PDT) Date: Sun, 31 Mar 2019 17:05:08 +0200 From: Christian Brauner To: Linus Torvalds Cc: Jann Horn , Joel Fernandes , Daniel Colascione , Andrew Lutomirski , David Howells , "Serge E. Hallyn" , Linux API , Linux List Kernel Mailing , Arnd Bergmann , "Eric W. Biederman" , Konstantin Khlebnikov , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , Jonathan Kowalski , "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , Nagarathnam Muthusamy , Aleksa Sarai , Al Viro Subject: Re: [PATCH v2 0/5] pid: add pidfd_open() Message-ID: <20190331150507.zpyugdvtmr6rgpda@brauner.io> References: <20190329155425.26059-1-christian@brauner.io> <20190331010716.GA189578@google.com> <20190331040810.GB189578@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 31, 2019 at 07:52:28AM -0700, Linus Torvalds wrote: > On Sat, Mar 30, 2019 at 9:47 PM Jann Horn wrote: > > > > Sure, given a pidfd_clone() syscall, as long as the parent of the > > process is giving you a pidfd for it and you don't have to deal with > > grandchildren created by fork() calls outside your control, that > > works. > > Don't do pidfd_clone() and pidfd_wait(). > > Both of those existing system calls already get a "flags" argument. > Just make a WPIDFD (for waitid) and CLONE_PIDFD (for clone) bit, and > make the existing system calls just take/return a pidfd. Yes, that's one of the options I was considering but was afraid of pitching it because of the very massive opposition I got regarding"multiplexers". I'm perfectly happy with doing it this way. > > Side note: we could (should?) also make the default maxpid just be > larger. It needs to fit in an 'int', but MAXINT instead of 65535 would > likely alreadt make a lot of these attacks harder. Yes, agreed. > > There was some really old legacy reason why we actually limited it to > 65535 originally. It was old and crufty even back when.. So Jann and I have been thinking about going forward with the following idea: With the pidfd_open() patchset I have pidfds are simple anone inode file descriptors stashing a reference to struct pid of a process. I have mentioned this is in prior mails. This cleanly decouples pidfds from procfs completely. The reason why we want to use pidfds with no connection to a specific procfs instance, even in environments that have procfs, is that we would like to add the API to clone with CLONE_PIDFD that you just mentioned that creates a new process or thread and returns a pidfd to it. In the context of such a syscall, it would be awkward to have the kernel open a file in some procfs instance, since then userspace would have to specify which procfs instance the fd should come from. There is an argument to be made that for consistency's sake we should - although I don't feel strongly about it - disallow the usage of pidfd_send_signal() with fds referring to /proc/ then. Unless you want this to always work. If you want this to work then we would simply submit pidfd_open() for the 5.2 window. If you agree that it makes sense to only have pidfd_open() file descriptors working with pidfd_send_signal() we would send a revert for pidfd_send_signal() now and resubmit it together with pidfd_open() during the 5.2. merge window. This decouples pidfds completely from procfs not just when it is not compiled in or mounted. I very much care about this being done right. If this means temporarily kicking pidfd_send_signal() out until 5.2 I'm happy to do so. Btw, the /proc/ race issue that is mentioned constantly is simply avoidable by placing the pid that the pidfd has stashed relative to the callers' procfs mount's pid namespace in the pidfd's fdinfo. So there's not even a need to really go through /proc/ in the first place. A caller wanting to get metadata access and avoid a race with pid recycling can then simply do: int pidfd = pidfd_open(pid, 0); int pid = parse_fdinfo("/proc/self/fdinfo/"); int procpidfd = open("/proc/", ...); /* Test if process still exists by sending signal 0 through our pidfd. */ int ret = pidfd_send_signal(pid, 0, NULL, PIDFD_SIGNAL_THREAD); if (ret < 0 && errno == ESRCH) { /* pid has been recycled and procpidfd refers to another process */ } Christian