Received: by 2002:a25:5b86:0:0:0:0:0 with SMTP id p128csp2610183ybb; Sat, 30 Mar 2019 09:09:41 -0700 (PDT) X-Google-Smtp-Source: APXvYqzWySbkv9GIhmGWoCBNbIuJTbW7egWIjWS5wORRVavrN9kq27J9dGKlj+HhLCyoaG/OscDI X-Received: by 2002:a17:902:e48c:: with SMTP id cj12mr8189077plb.93.1553962181514; Sat, 30 Mar 2019 09:09:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553962181; cv=none; d=google.com; s=arc-20160816; b=ozoK89I9uIDy62STqLOx16XmRFCPppZiJL56fPOUfi5LGZ4xJZC0pVLi/dMlM0OGug XxHilYcg7OOhXktiT7baxsIR11dpqnuAcSmKDKrgnUD9XUHuZy4/6WltMw4vR5AndlP0 EyqwX99YsoOj1RAFPSlTxi9ZI9AYx8Q6qTSWvDKucKgzAFT7eGvGbbd8p41UU6teTokM e5Dfyj8W/AZP541Fidv4v9uoVdxBcEHRgqHTHJ0520fZe3jcDtm3jvO8UFV6s8SNTs8Z 29Uz+SuVvJ4MYbylXfiqtQnOQoeZaxufxSMHeyFb5U3fbjLymAPAXitpp3H7SQL1KYBN dSjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=jWSKGgPglMSsML8XN2EAPZWYnlWwCcwK1JGP6pz/I+I=; b=B+5lDXII3MjboLhNd/Y8abMxwOdxO/sevyKymUIMgJRiZ3rHavhTQX/NZBpA5g7dOj kDw2D/dj4k+/zqsElbOfZ0Nxq/sBJMdVjHwtu2XFqrdaxa46Q6tsV6fZ3kPxWJAuBgBv aputUot1mgeAO2iGOHp7LBNQZip5t0kuy4kc9/vtv9X91B9nd0ni4sQ/ujKyQ+zs0FXf iZDIDZS2h3mZUNOJxaR3pu+elKnRXdUnEWULMdqHWpfj+paTRWgoNF7FSrwv7n00Lo/Q Ok8AZZDENWMnoYAHoLpa+X92/dHFVx5trtZAPPxVupmJm509s/XNd4PKRv/wzK6b95me 9mjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CeR239CR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bg12si4731418plb.295.2019.03.30.09.09.23; Sat, 30 Mar 2019 09:09:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CeR239CR; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730849AbfC3QIp (ORCPT + 99 others); Sat, 30 Mar 2019 12:08:45 -0400 Received: from mail-vs1-f68.google.com ([209.85.217.68]:36620 "EHLO mail-vs1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730733AbfC3QIo (ORCPT ); Sat, 30 Mar 2019 12:08:44 -0400 Received: by mail-vs1-f68.google.com with SMTP id n4so3092660vsm.3 for ; Sat, 30 Mar 2019 09:08:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jWSKGgPglMSsML8XN2EAPZWYnlWwCcwK1JGP6pz/I+I=; b=CeR239CRrkZxUCUhBKt1gVOc1ldiSEJvvqCIpKxFdLxBegj1qltGSqwRHBG5ds/gIn xQYKnFrXloEmNWsixMYIANU1HAIPUYc2Zb6Pu9/q82VLqhRS0UUJHTgu1/EOtAQe2D2h coKqWxC3pAse5ZcW/PBsXdrL2TM9KBP3yHPn+hAtZs1KCPWqI/z5CXIvW4WAO+4GFQOq +YPcFmCoCyroXex3kOkv5e9+9d0GEXlb1w/gAVQQUVWfc8KJvVlLuqfGhA6C8lv5racB ZpqYNAuV1ItZXKX7QLbiL6WYVDMcgacbWDxTnJBAgYUwQyURSuUsTpqYHgIa1Qid7eYk E/rA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jWSKGgPglMSsML8XN2EAPZWYnlWwCcwK1JGP6pz/I+I=; b=g4fO0nU/lTMDY4BDC1apht/xoW3/Iq5jbeZVGXfw2ugxDVYAd3poxvdLuhjWrdPyh9 xe3nGbrmpC4WxgOugQVBFkc9yTXxJo9Ix4rQiL5WDY6tzH03ygKUr796ipR4uhm3c7GR zxAJIWCoDsXbZ5cu4fnIA64bjADOzLbpNhsSlmaLQQVPeSnKm+OQt9dBLS1M/mBY39PT /YOy2dWREm4HgtHQvr0Gd0I3CVVcbxFUPfmk8IKJTdAfkEaDn+qU9OvIIgFDsVRULeH9 NAtiPWIp+F/gWjnHFqXauDodlqeaghbUj862GoiRqEpnnSp74e8VKyCDKfofQWqulUT7 YtIw== X-Gm-Message-State: APjAAAV5GjHsCN6zkS/rxRpgvtbC/EbWljQylVF6j5Gboo34vVNV7qEy v50ZqOkCZA5qQ1hXPfogOOhXfNDKQeS9cLd18Lb9eg== X-Received: by 2002:a67:bc01:: with SMTP id t1mr33933054vsn.149.1553962122872; Sat, 30 Mar 2019 09:08:42 -0700 (PDT) MIME-Version: 1.0 References: <20190327162147.23198-1-christian@brauner.io> <20190327162147.23198-3-christian@brauner.io> <20190327213404.pv4wqtkjbufkx36u@brauner.io> <20190327222543.huugotqcew6jyytv@brauner.io> <20190328103813.eogszrqbitw3e7k7@brauner.io> In-Reply-To: From: Daniel Colascione Date: Sat, 30 Mar 2019 09:08:31 -0700 Message-ID: Subject: Re: [PATCH 2/4] pid: add pidfd_open() To: Jonathan Kowalski Cc: Christian Brauner , Jann Horn , Konstantin Khlebnikov , Andy Lutomirski , David Howells , "Serge E. Hallyn" , "Eric W. Biederman" , Linux API , linux-kernel , Arnd Bergmann , Kees Cook , Alexey Dobriyan , Thomas Gleixner , Michael Kerrisk-manpages , "Dmitry V. Levin" , Andrew Morton , Oleg Nesterov , Nagarathnam Muthusamy , Aleksa Sarai , Al Viro , Joel Fernandes Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Mar 30, 2019 at 7:30 AM Jonathan Kowalski wrote: > > On Sat, Mar 30, 2019 at 7:39 AM Daniel Colascione wrote: > > > > [SNIP] > > > > Thanks again. > > > > I agree that the operation we're discussing has a simple signature, > > but signature flexibility isn't the only reason to prefer a system > > call over an ioctl. There are other reasons for preferring system > > calls to ioctls (safety, tracing, etc.) that apply even if the > > operation we're discussing has a relatively simple signature: for > > example, every system call has a distinct and convenient ftrace event, > > but ioctls don't; strace filtering Just Works on a > > system-call-by-system-call basis, but it doesn't for ioctls; and > > It does for those with a unique number. There's no such thing as a unique ioctl number though. Anyone is free to create an ioctl with a conflicting number. Sure, you can guarantee ioctl request number uniqueness within things that ship with the kernel, but you can also guarantee system call number uniqueness, so what do we gain by using an ioctl? And yes, you can do *some* filtering based on ioctl command, but it's frequently more awkward than the system call equivalent and sometimes doesn't work at all. What's the strace -e pidfd_open equivalent for an ioctl? Grep isn't quite equivalent. > > documentation for system calls is much more discoverable (e.g., man > > -k) than documentation for ioctls. Even if the distinction doesn't > > matter much, IMHO, it still matters a little, enough to favor a system > > call without an offsetting advantage for the ioctl option. > > It will be alongside other I/O operations in the pidfd man page. > > > > > > If you start adding a system call for every specific operation on file > > > descriptors, it *will* become a problem. > > > > I'm not sure what you mean. Do you mean that adding a top-level system > > call for every operation that might apply to one specific kind of file > > descriptor would lead, as the overall result, to the kernel having > > enough system calls to cause negative consequences? I'm not sure I > > agree, but accepting this idea for the sake of discussion: shouldn't > > we be more okay with system calls for features present on almost all > > systems --- like procfs --- even if we punt to ioctls very rarely-used > > functionality, e.g., some hypothetical special squeak noise that you > > could get some specific 1995 adlib clone to make? > > Consider if we were to do: > > int procpidfd_open(int pidfd, int procrootfd); > > This system call is useless besides a very specific operation for a > very specific usecase on a file descriptor. I don't understand this argument based on an operation being "very specific". Are you suggesting that epoll_wait should have been an ioctl? I don't see how an operation being specific to a type of file descriptor is an argument for making that operation an ioctl instead of a system call. > Then, you figure you might > want to support procpidfd back to pidfd (although YAGNI), so you will > add a CMD or flags argument now, Why would we need a system call at all? And why are we invoking hypothetical argument B in an argument against adding operation A as a system call? B doesn't exist yet. As Christian mentioned, we could create a named /proc/pid/handle file which, when opened, would yield a pidfd. More generally, though: if we expose new functionality, we can make a new system call. Why is that worse than adding a new ioctl code? > int procpidfd_open(int pidfd, int procrootfd/procpidfd, unsigned int flags); > > int procpidfd = procpidfd_open(fd, procrootfd, PIDFD_TO_PROCPIDFD); > int pidfd = procpidfd_open(-1, procpidfd, PROCPIDFD_TO_PIDFD); > > vs procpidfd2pidfd(int procpidfd); if you did not foresee the need. > Then, you want pidfd2pid, and so on. > > If you don't, you will have to add a command interface in the new > system call, which changes parameters depending on the flags. This is > already starting to get ugly. > In the end, it's a matter of taste, I don't think it's quite a matter of taste. That idiom suggests that the two options are roughly functionally equivalent. In this case, there are clear and specific technical reasons to prefer system calls. We're not talking about two equivalent approaches. Making the operation a system call gives users more power, and we shouldn't deprive them of this power without a good reason. So far, I haven't seen such a reason. > this > pattern if exploited leads to endless proliferation of the system call > interface, often ridden with short-sighted APIs, because you cannot > know if you want a focused call or a command style call. Bad interfaces proliferate no matter what, and there's no difference in support burden between an ioctl and a system call: both need to work indefinitely. Are you saying that it's easier to remove a bad interface if it's an ioctl instead of a system call? I don't recall any precedent, but maybe I'm wrong. > ioctl is > already a command interface, and 10 system calls for each command is > _NOT_ nice, from a user perspective. syscall(2) is already a "command" interface, just with more automatic features than ioctl. What do we gain by pushing the "command" down a level from the syscall(2) multiplexer to the ioctl(2) multiplexer? > > > Besides, the translation is > > > just there because it is racy to do in userspace, it is not some well > > > defined core kernel functionality. > > > Therefore, it is just a way to > > > enter the kernel to do the openat in a race free and safe manner. > > > > I agree that the translation has to be done in the kernel, not > > userspace, and that the kernel must provide to userspace some > > interface for requesting that the translation happen: we're just > > discussing the shape of this interface. Shouldn't all interfaces > > provided by the kernel to userspace be equally well defined? I'm not > > sure that the internal simplicity of the operation matters much > > either. There are already explicit system calls for some > > simple-to-implement things, e.g., timerfd_gettime. It's worth noting > > that timerfd is (IIRC), like procfs, a feature that's both ubiquitous > > and optional. > > It is well defined, it has a well defined signature, and it will error > out if you don't use it properly. Again, what it does is very limited > and niche. I am not sure it warrants a system call of its own. Do we need to think in terms of whether a bit of functionality "warrants" a system call? A system call is not expensive to add. Instead of thinking of whether a function meets some bar for promotion to a system call, we should be thinking of whether a function meets the bar for demotion to ioctl. System calls are generally more useful to users than ioctls, and all things being equal, creating an interface more pleasant for users should be the priority. > timerfd_gettime was an afterthought anyway, that's probably not a good > example (it was more to just match the POSIX timers interface as the > original timerfd never had support for querying, so the split into two > steps, create and initialize, you could argue one could do it without > a syscall, but it still has a well defined argument list, and > accepting elaborate data structures into an ioctl is not a good > interface to plumb, so that's easily justified). TCGETS deals with an elaborate data structure. In any case, I don't think the elaborate data structure argument is relevant. The things that the system call mechanism lets you do that the ioctl mechanism does not are independent of whether a system call accepts simple or complex argument types. And ioctl and a system call both specify a bit of kernel mode code to run: the difference is in identifying that bit of kernel-side code to run and in the features you get automatically through the call mechanism. > It's much like socket and setsockopt/getsockopt in nature. I would > even say APIs separating creation and configuration age well and are > better, but a process doesn't fit such a model cleanly. > > > > As is, the facility being provided through an ioctl on the pidfd is > > > not something I'd consider a problem. > > > > You're right that from a signature perspective, using an ioctl isn't a > > problem. I just want to make sure we take into account the other, > > non-signature advantages that system calls have over ioctls. > > > > > I think the translation stuff > > > should also probably be an extension of ioctl_ns(2) (but I wouldn't be > > > opposed if translate_pid is resurrected as is). > > > For anything more involved than ioctl(pidfd, PIDFD_TO_PROCFD, > > > procrootfd), I'd agree that a system call would be a cleaner > > > interface, otherwise, if you cannot generalise it, using ioctls as a > > > command interface is probably the better tradeoff here. > > > > Sure: I just want to better understand everyone else's thought process > > here, having been frustrated with things like the termios ioctls being > > ioctls. > > On that note, I don't buy the safety argument here, or the seccomp argument. > > You need to consider what this is, on a case by case basis. > > ioctl(pidfd, PIDFD_TO_PROCFD, procrootfd); > > See? You need /proc's root fd to be able to get to the /proc/ dir > fd, in a race free manner. You don't need to have the overhead of > filtering to contain access, as privilege is bound to descriptors. There's no filtering overhead when you haven't enabled filtering, right? You seem to be making an argument that we should make this facility an ioctl because we don't need the features that the system call mechanism provides --- correct me if I'm wrong. Even if we can't see a case for filtering or containment or tracing _today_, such a case might arise in the future, as has happened many times before. Besides, we haven't touched on the other features we get "for free" when we make an operation a system call, e.g., explicit tracing support. I don't understand why we should forestall these uses cases, both present and future, by using an ioctl instead of a system call, especially when we get so little in return. I think there's a lot of value in regularity and consistency and that making some operations ioctls and some syscalls based on details like whether an operation accepts a structure or a primitive makes the system less flexible and understandable. > [snip further discussion along the same lines]