Received: by 2002:a25:5b86:0:0:0:0:0 with SMTP id p128csp2540473ybb; Sat, 30 Mar 2019 07:31:49 -0700 (PDT) X-Google-Smtp-Source: APXvYqxzKr/PkDteLJ9p2mz0VScBapJlgH1uA9J3eunP7ry3LB1O7Cpi2IkmcobpfN4Z8YC7Djk/ X-Received: by 2002:a17:902:9a01:: with SMTP id v1mr55118627plp.34.1553956309245; Sat, 30 Mar 2019 07:31:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553956309; cv=none; d=google.com; s=arc-20160816; b=BbDZAREvMMFRdJoKTnsSATqRfKB9LYqfRfB7QgyU63nC9r0kpTFlRYIws0jWQJ6wwc HtTrWEGDXIKT44/wl0RVUHmrJzs+IQIu06HHFwAfv0EG+zh5r7PQUcRK7yQQ5jq6fJyv 4L1r35+dpnxG+Wui8XtTshLh1uLya8SlvJ1nLezzhZCC4tLCpn4XYuu+NMzEx+7LpG9q +qVgP4jELW44NMusWBlViAxbN9TmompIf+yOv1YA8zWvc39RidLX8SO/ZU6kxnER9RMO 5zeaHNxCWkq8Fiz900kzinKPV6/vBxzaYUy/gjtyuW1rJNdD9nS97VWuC3czs0O3rKop KBng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=CPF4c7HrDXSVw5HaHtwLnyc8T/17yHiTyYGEwWu2Sso=; b=iK2AQQrg4oUblCpIH4JjKJ32/b7J+8s2sRxAhFu22rasK1FBOqieuq/is57D7Bf9U3 Vf1dTVBNTwEnBY24lfjgyJmx7YY3O8L0ZZBNZZ4n1RhSA54486YfTdyVC2dg7a31OmMd 45P0U1adp4JjZvaAr9jJxdypP9Pc4PC2GvFfSa25Lb3S1JR/28+VH3sjtnFDLQGWLFWI SdyTwn18CzaeoP47Fb0UXkIkgIt/FBSx6wvk4mMh/+Vl1a6si0oxTl+wHhC6zqwfq/XT aT7JaN03cJRHIyJPTJs5oheaCB1BwI5e5HrHHh7hDHRn5HEssLlQSF7oaISto8BdIRQq o8pw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=polEdhCQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p66si4637213pfp.231.2019.03.30.07.31.31; Sat, 30 Mar 2019 07:31:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=polEdhCQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730926AbfC3Oax (ORCPT + 99 others); Sat, 30 Mar 2019 10:30:53 -0400 Received: from mail-qt1-f195.google.com ([209.85.160.195]:38078 "EHLO mail-qt1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730860AbfC3Oax (ORCPT ); Sat, 30 Mar 2019 10:30:53 -0400 Received: by mail-qt1-f195.google.com with SMTP id d13so5956708qth.5; Sat, 30 Mar 2019 07:30:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CPF4c7HrDXSVw5HaHtwLnyc8T/17yHiTyYGEwWu2Sso=; b=polEdhCQXMLoV3PJ4+oBevPiUIxgRmAYvMQpqKZFQqHnQlWrpH1ORLpgPp1yYIE8/p 02uXvGUfA8Et6KRjA258zFRtei4i0sqAB4qjROziCdWdQaww7JPg4D3uzeki2B47hTuZ yOtU/Bsi5nqaRdPvGTdBdxSKLprgHqdBp+R7Qo8WFMKNag/Ut8TWPpVE7rbi8TJ5OH6z NYeolzOdc/sUfvmTaI/3KxGEvww2uOb5gcPj4MCxUjy449l1ry/rQZthIGUOJkvDTfVx fIPWLODWx4ZSZIViBP769Sg0/Z+Vy677BqVD8IgsQToxTcoPRRoQsMTTmK7ra45jGR13 Q7CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CPF4c7HrDXSVw5HaHtwLnyc8T/17yHiTyYGEwWu2Sso=; b=IrMkg5ehIH59hj9AlHRjZAPd+KGJ8OT1rUOm4Bwvqjx361LfcUoJJ0HIXtjwbaIGIl MUrEypgciln53jlhkHDkIBKCkXHZ5onreUf/IFqW7I9Jx8/Apg/L0uW+BDpsq73jWLav fIsyet5/DOhvqu5IVXx5l53aiWFtAnhCnxYkU8j+7tyxWoBmj6M9svxjVkBmHJZpRFud 8uxjBTT2j3yzGrQTmth5NLboP1HWh2Bcka8ri6e05dicPvMueVsBR0bgH1xyYdUcm0kV ViRLSrJiEWmE/JKISG6f7qkWDyD+xBgwMfkPwu9T83h0soUVECa3sEp30Kv5jYpYdHyI 5O6g== X-Gm-Message-State: APjAAAWwKdYPvSiiya9X2x5S4o4hl3/6lV8p1UL4BlwhoPjJjOUVhqBJ 56EHxICOphop5zvVbubWkQrgmBsQL7DKUMB/pX0= X-Received: by 2002:a0c:b907:: with SMTP id u7mr1189204qvf.189.1553956251919; Sat, 30 Mar 2019 07:30:51 -0700 (PDT) MIME-Version: 1.0 References: <20190327162147.23198-1-christian@brauner.io> <20190327162147.23198-3-christian@brauner.io> <20190327213404.pv4wqtkjbufkx36u@brauner.io> <20190327222543.huugotqcew6jyytv@brauner.io> <20190328103813.eogszrqbitw3e7k7@brauner.io> In-Reply-To: From: Jonathan Kowalski Date: Sat, 30 Mar 2019 14:30:44 +0000 Message-ID: Subject: Re: [PATCH 2/4] pid: add pidfd_open() To: Daniel Colascione Cc: Christian Brauner , Jann Horn , Konstantin Khlebnikov , Andy Lutomirski , David Howells , "Serge E. Hallyn" , "Eric W. Biederman" , Linux API , linux-kernel , Arnd Bergmann , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , Nagarathnam Muthusamy , Aleksa Sarai , Al Viro , Joel Fernandes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione wrote: > > [SNIP] > > Thanks again. > > I agree that the operation we're discussing has a simple signature, > but signature flexibility isn't the only reason to prefer a system > call over an ioctl. There are other reasons for preferring system > calls to ioctls (safety, tracing, etc.) that apply even if the > operation we're discussing has a relatively simple signature: for > example, every system call has a distinct and convenient ftrace event, > but ioctls don't; strace filtering Just Works on a > system-call-by-system-call basis, but it doesn't for ioctls; and It does for those with a unique number. > documentation for system calls is much more discoverable (e.g., man > -k) than documentation for ioctls. Even if the distinction doesn't > matter much, IMHO, it still matters a little, enough to favor a system > call without an offsetting advantage for the ioctl option. It will be alongside other I/O operations in the pidfd man page. > > > If you start adding a system call for every specific operation on file > > descriptors, it *will* become a problem. > > I'm not sure what you mean. Do you mean that adding a top-level system > call for every operation that might apply to one specific kind of file > descriptor would lead, as the overall result, to the kernel having > enough system calls to cause negative consequences? I'm not sure I > agree, but accepting this idea for the sake of discussion: shouldn't > we be more okay with system calls for features present on almost all > systems --- like procfs --- even if we punt to ioctls very rarely-used > functionality, e.g., some hypothetical special squeak noise that you > could get some specific 1995 adlib clone to make? Consider if we were to do: int procpidfd_open(int pidfd, int procrootfd); This system call is useless besides a very specific operation for a very specific usecase on a file descriptor. Then, you figure you might want to support procpidfd back to pidfd (although YAGNI), so you will add a CMD or flags argument now, int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags); int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD); int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD); vs procpidfd2pidfd(int procpidfd); if you did not foresee the need. Then, you want pidfd2pid, and so on. If you don't, you will have to add a command interface in the new system call, which changes parameters depending on the flags. This is already starting to get ugly. In the end, it's a matter of taste, this pattern if exploited leads to endless proliferation of the system call interface, often ridden with short-sighted APIs, because you cannot know if you want a focused call or a command style call. ioctl is already a command interface, and 10 system calls for each command is _NOT_ nice, from a user perspective. > > > Besides, the translation is > > just there because it is racy to do in userspace, it is not some well > > defined core kernel functionality. > > Therefore, it is just a way to > > enter the kernel to do the openat in a race free and safe manner. > > I agree that the translation has to be done in the kernel, not > userspace, and that the kernel must provide to userspace some > interface for requesting that the translation happen: we're just > discussing the shape of this interface. Shouldn't all interfaces > provided by the kernel to userspace be equally well defined? I'm not > sure that the internal simplicity of the operation matters much > either. There are already explicit system calls for some > simple-to-implement things, e.g., timerfd_gettime. It's worth noting > that timerfd is (IIRC), like procfs, a feature that's both ubiquitous > and optional. It is well defined, it has a well defined signature, and it will error out if you don't use it properly. Again, what it does is very limited and niche. I am not sure it warrants a system call of its own. timerfd_gettime was an afterthought anyway, that's probably not a good example (it was more to just match the POSIX timers interface as the original timerfd never had support for querying, so the split into two steps, create and initialize, you could argue one could do it without a syscall, but it still has a well defined argument list, and accepting elaborate data structures into an ioctl is not a good interface to plumb, so that's easily justified). It's much like socket and setsockopt/getsockopt in nature. I would even say APIs separating creation and configuration age well and are better, but a process doesn't fit such a model cleanly. > > > As is, the facility being provided through an ioctl on the pidfd is > > not something I'd consider a problem. > > You're right that from a signature perspective, using an ioctl isn't a > problem. I just want to make sure we take into account the other, > non-signature advantages that system calls have over ioctls. > > > I think the translation stuff > > should also probably be an extension of ioctl_ns(2) (but I wouldn't be > > opposed if translate_pid is resurrected as is). > > For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD, > > procrootfd), I'd agree that a system call would be a cleaner > > interface, otherwise, if you cannot generalise it, using ioctls as a > > command interface is probably the better tradeoff here. > > Sure: I just want to better understand everyone else's thought process > here, having been frustrated with things like the termios ioctls being > ioctls. On that note, I don't buy the safety argument here, or the seccomp argument. You need to consider what this is, on a case by case basis. ioctl(pidfd, PIDFD_TO_PROCFD, procrootfd); See? You need /proc's root fd to be able to get to the /proc/ dir fd, in a race free manner. You don't need to have the overhead of filtering to contain access, as privilege is bound to descriptors. However, an ioctl to do pidfd_send_signal would have indeed been problematic, it takes in structures, it has a very core use case, and a very compelling advantage over existing signal sending APIs. On the point, using file descriptors, you can safely delegate metadata access to a process it has a pidfd open for, without /proc being in its mount namespace (which was what Andy was after). File descriptors are acting as capabilities, so if you don't leak the /proc's root fd to the process, you can be assured it wouldn't be able to do any metadata access (and you don't have /proc mounted). *No need* to blacklist the ioctl or the system call, as it wouldn't work anyway. *This* is a good interface, with little to no overhead for security. So whether this is an ioctl or a system call doesn't matter at all, because it doesn't suffer from all those cases where you really want to avoid ioctls and do a well scoped system call as an entrypoint instead. There is not much incentive to it, really.