Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3556002img; Mon, 25 Mar 2019 12:43:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqwKfCt/q8rdeZ5bt9gTjZ8BOw8Qh3HuXSlcHTlmX3xEBKe6NYeHWahf0pgevWHies5XxwiZ X-Received: by 2002:a65:420b:: with SMTP id c11mr3733315pgq.24.1553543016221; Mon, 25 Mar 2019 12:43:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553543016; cv=none; d=google.com; s=arc-20160816; b=vM8ZBvjuvhEbqlyLBrOfEXP4ut70Inis0KjxJjJy5gWvhFLt8R+uGqjwtmTVa3r4MW MGh81LZP2WcHUzqh8z4fw3z9o1jtUnKp+dELlfO1MOAXpQzUjlC9z81MSahsAOEWHTM1 BWBsvtpuvs+4Q6nFG65++qoPFHI+r6Ife99doL8LQNcSf++wnDugIE4rO+Doc55QSiJ2 kIkmkhTSO0W0B0wjdfn34GBWKHgy3pMmOaZWB9K4K6jEe1ml3cVypaF7kenkST4/PGzM lY9unoc49BlHo5n3lUffin92UAP6NkV4gaHnKQjNmZlmcpa/x551anlLEaiV4C59OR50 nMww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=zIcail1q0Vj9y3ZxkGiNoq0bOit/ZXD2J+TAZmW+YZQ=; b=tdOPfwa/MFPjQrpHye/5l+wRo3J5hhyAjdfdu6sOK7jYCa/4WaBCt9DIgZJzY4mdDy c2Gl31XrDjujpEiqfDmcmwqmI4HT21MiMtz1kkiOZXcGzL7fh7jAjqUS6Ad4CSqymO2v xEuVQkwEVLnmEh5CKo9iwNGvG1oMrNx2PiQlPDT307pHO/bbQqIdcaFKI2SDfIAMAtHA icuywKPBS9ERjdNtY8amQIlcwjPpg8Q/Y0mC5F26SC45jEZ4lQ/E4FC35KVBfq5oH4Mp /q1/WwTfJvsWPgif7/1y38nNWnKFBVoFsYvKFT6RPeNzZ4bKJ0T96AhBCcfJh5fuKk+r /TfQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hxN1DAwL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g10si14770169pge.304.2019.03.25.12.43.20; Mon, 25 Mar 2019 12:43:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=hxN1DAwL; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730063AbfCYTmb (ORCPT + 99 others); Mon, 25 Mar 2019 15:42:31 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:45113 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728912AbfCYTma (ORCPT ); Mon, 25 Mar 2019 15:42:30 -0400 Received: by mail-qt1-f196.google.com with SMTP id v20so11752484qtv.12; Mon, 25 Mar 2019 12:42:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=zIcail1q0Vj9y3ZxkGiNoq0bOit/ZXD2J+TAZmW+YZQ=; b=hxN1DAwLOQEb0uILnk/dx4xPBTyasoOOVsbh+Fi1vgANtH7/ieicXcRq0TyV0GnNN0 A4QjyJq+agF9+EATBnVaHTWr7MPWHrs9hodEwaIRUYd5ot2O9KTwEIpoI7/rOMULsYyL i4IPoshHFxCKD0d3ny3LY0uPMVLWFbxBv/1W7mI7a6P2n0wHNnc1ebVeFI0JhZzJ34v7 I3IDV2M5zga0Psmxo+8Crdvnc/EP99GL/z9AHBIiCNFi1Emaub3DN6ZmtRH9CiQLdyev v+EWwuB6loc7lgK/NYVmSzOZ2IVVAvrBl4r8moyKmxLK5iuOVpJAEJt/yIsnd4jaApQU PgXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=zIcail1q0Vj9y3ZxkGiNoq0bOit/ZXD2J+TAZmW+YZQ=; b=nznegCMCOwgDDLoyQGoVR57j4mpWB6Uh73Oe/7qpEYR1J+RBT3RMHyjOFjoFxzqKhP k7Jp8qm/SLSOFgyh9loDBAHu/VPaMrMUKsIVrREwk9bE+eh7uTXFXdy1ORcaHSpfC64c TVVTakYYkuo5IyrzlhU3QOc8NExpue//aYdi9XlE4V/I71MtwrUkWCq9eSyIW2p2J9VG 8Nr3M+HoXy/XmriERfuzTFED0TZKUFDwz4s4ts0D8MWQk9XFBJgRgEyu1qAvrajgG2t7 JmCQjLt/Iyd3jyXYQXWfD6SkYJ3OZk57HN4Xw0OhA5+a3jjsCNHOHGDLeZBRO/+rmXl3 7ncQ== X-Gm-Message-State: APjAAAX6PWXXjKKLtELYHBHsMhZuzLO89gqxcOocrzldARDtBcWOY/me wRJpK6NThjA+P/ZBazVXJdiCinfrAPoXMp/qrCc= X-Received: by 2002:ac8:2b53:: with SMTP id 19mr22717611qtv.374.1553542949602; Mon, 25 Mar 2019 12:42:29 -0700 (PDT) MIME-Version: 1.0 References: <20190325162052.28987-1-christian@brauner.io> <20190325173614.GB25975@google.com> In-Reply-To: From: Jonathan Kowalski Date: Mon, 25 Mar 2019 19:42:14 +0000 Message-ID: Subject: Re: [PATCH 0/4] pid: add pidctl() To: Daniel Colascione Cc: Joel Fernandes , Christian Brauner , Jann Horn , Konstantin Khlebnikov , Andy Lutomirski , David Howells , "Serge E. Hallyn" , "Eric W. Biederman" , Linux API , linux-kernel , Arnd Bergmann , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , Nagarathnam Muthusamy , Aleksa Sarai , Al Viro Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 25, 2019 at 6:57 PM Daniel Colascione wrote: > > On Mon, Mar 25, 2019 at 11:19 AM Jonathan Kowalski wrote: > > > > On Mon, Mar 25, 2019 at 5:53 PM Daniel Colascione wrote: > > > > > > [..snip..] > > > > > > I don't like the idea of having one kind of pollfd be pollable and > > > another not. Such an interface would be confusing for users. If, as > > > you suggest below, we instead make the procfs-less FD the only thing > > > we call a "pidfd", then this proposal becomes less confusing and more > > > viable. > > > > That's certainly on the table, we remove the ability to open > > /proc/ and use the dir fd with pidfd_send_signal. I'm in favor of > > this. > > > > > > But.. one thing we could do for Daniel usecase is if a /proc/pid directory fd > > > > can be translated into a "pidfd" using another syscall or even a node, like > > > > /proc/pid/handle or something. I think this is what Christian suggested in > > > > the previous threads. > > > > > > /proc/pid/handle, if I understand correctly, "translates" a > > > procfs-based pidfd to a non-pidfd one. The needed interface would have > > > to perform the opposite translation, providing a procfs directory for > > > a given pidfd. > > > > > > > And also for the translation the other way, add a syscall or modify > > > > translate_fd or something, to covert a anon_inode pidfd into a /proc/pid > > > > directory fd. Then the user is welcomed to do openat(2) on _that_ directory fd. > > > > Then we modify pidfd_send_signal to only send signals to pure pidfd fds, not > > > > to /proc/pid directory fds. > > > > > > This approach would work, but there's one subtlety to take into > > > account: which procfs? You'd want to take, as inputs, 1) the procfs > > > root you want, and 2) the pidfd, this way you could return to the user > > > a directory FD in the right procfs. > > > > > > > I don't think metadata access is in the scope of such a pidfd itself. > > It should be possible to write a race-free pkill(1) using pidfds. > Without metadata access tied to the pidfd somehow, that can't be done. > > > > > > Should we work on patches for these? Please let us know if this idea makes > > > > sense and thanks a lot for adding us to the review as well. > > > > > > I would strongly prefer that we not merge pidctl (or whatever it is) > > > without a story for metadata access, be it your suggestion or > > > something else. > > > > IMO, this is out of scope for a process handle. Hence, the need for > > metadata access musn't stall it as is. > > On the contrary, rather than metadata being out of scope, metadata > access is essential. Given a file handle (an FD), you can learn about > the file backing that handle by using fstat(2). Given a socket handle > (a socket FD), you can learn about the socket with getsockopt(2) and > ioctl(2). It would be strange not to be able, similarly, to learn > things about a process given a handle (a pidfd) to that process. I > want process handles to be "boring" in that they let users query and > manipulate processes mostly like they manipulate other resources. > Yes, but everything in /proc is not equivalent to an attribute, or an option, and depending on its configuration, you may not want to allow processes to even be able to see /proc for any PIDs other than those running as their own user (hidepid). This means, even if this new system call is added, to respect hidepid, it must, depending on if /proc is mounted (and what hidepid is set to, and what gid= is set to), return EPERM, because then there is a discrepancy between how the two entrypoints to acquire a process handle do access control. This is ugly, but ideally should work the way it is described above. Otherwise, things would be able to bypass the restrictions made therein, because we also want a system call. PIDs however are addressable with kill(2) even with hidepid enabled. For good reason, the capabilities you can exercise using a PID is limited in that case. The same logic applies to a process reference like pidfd. I agree there should be some way (if you can take care of the hidepid gotcha) to return a dir fd of /proc entry, but I don't think it should be the pidfd itself. You can get it to work using the fdinfo thing for now. > > If you really need this, the pid this pidfd is mapped to can be > > exposed through fdinfo (which is perfectly fine for your case as you > > use /proc anyway). This means you can reliably determine if you're > > opening it for the same pid (and it hasn't been recycled) because this > > pidfd will be pollable for state change (in the future). Exposing it > > through fdinfo isn't a problem, pid ns is bound to the process during > > its lifetime, it's a process creation time property, so the value > > isn't going to change anyway. So you can do all metadata access you > > want subsequently. > > Thanks for the proposal. I'm a bit confused. Are you suggesting that > we report the numeric FD in fdinfo? Are you saying it should work > basically like this? > Yes, numeric PID that you would see from your namespace for the said process the pidfd is for. It could be -1 if this process is not in any of the namespaces (current or children) (which would happen, say if you pass it to some other pidns residing process, during fd installation on ithe target process we set that up properly). > int get_oom_score_adj(int pidfd) { > unique_fd fdinfo_fd = open(fmt("/proc/self/fdinfo/%d", pidfd)); > string fdinfo = read_all(fdinfo_fd); > numeric_pid = parse_fdinfo_for_line("pid"); > unique_fd procdir_fd = open(fmt("/proc/%d", numeric_pid), O_DIRECTORY); > if(!is_pidfd_alive(pidfd)) { return -1; /* process died */ } /* > LIVENESS CHECK */ > // We know the process named by pidfd was called NUMERIC_PID > // in our namespace (because fdinfo told us) and we know that the > named process > // was alive after we successfully opened its /proc/pid directory, therefore, > // PROCDIR_FD and PIDFD must refer to the same underlying process. > unique_fd oom_adj_score_fd = openat(procdir_fd, "oom_score_adj"); > return parse_int(read_all(oom_adj_score_fd)); > } > > While this interface functions correctly, I have two concerns: 1) the > number of system calls necessary is very large -- by my count, > starting with the pifd, we need six, not counting the follow-on > close(2) calls (which would bring the total to nine[1]), But hey, that's a bit rich if you're planning to collect metadata from /proc ;-). > and 2) it's > "fail unsafe": IMHO, most users in practice will skip the line marked > "LIVENESS CHECK", and as a result, their code will appear to work but > contain subtle race conditions. An explicit interface to translate > from a (PIDFD, PROCFS_ROOT) tuple to a /proc/pid directory file > descriptor would be both more efficient and fail-safe. > > [1] as a separate matter, it'd be nice to have a batch version of close(2). Since /proc is full of gunk, how about adding more to it and making the magic symlink of /proc/self/fd for the pidfd to lead to the dirfd of the /proc entry of the process it maps to, when one uses O_DIRECTORY while opening it? Otherwise, it behaves as it does today. It would be equivalent to opening the proc entry with usual access restrictions (and hidepid made to work) but without the races, and because for processes outside your and children pid ns, it shouldn't work anyway, and since they wouldn't have their entry on this procfs instance, it would all just fit in nicely?