Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp1174422ybx; Thu, 7 Nov 2019 08:17:42 -0800 (PST) X-Google-Smtp-Source: APXvYqxsBaxoF+OWnm4qrOTazR0Vx6LSHEWkiJoI92LSSWfq+l0uOu9/5C8vArbRhwPhKSPY9m4T X-Received: by 2002:a17:906:5859:: with SMTP id h25mr3943391ejs.2.1573143462572; Thu, 07 Nov 2019 08:17:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573143462; cv=none; d=google.com; s=arc-20160816; b=0gv8bp9IHPEujOTwgHvhr9J7pCC5htP8ybF8YC/OvsM/DkW9AQulcWhCoGgFVR1mWF 8tkuGmJzc4dXax6D5xZjpcHtlMm2XT39GtNVYg6WMOiuZFjcmAQ+MGrH1tcGHNadwjhX OsbVpEDNRONNm+adMK3aEY0UftAzmVyHCwrjvY9uEjdBJy+wl8oKeAyp+7SfjzmQqzvZ 5b/bbOduVa00WWMNxC+jHu/mEt7NHrnPzlsvckLl7Jz1hWlDmw+P6lyX4425xdNxolYA iYvGduCpArXDr3WQPQPfcO3Tj3l2RTFMUYp6p0HfHS0Zc6V5j2itNX6pDAnCHI4XxpnA 6wfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=dy3Zw18GEppEcc1iAhwMaVLBwUsp6JGihw3ZTsB5tVQ=; b=trgJhZUdAQJ6yB2aV3DlGtEHnvguKZiqpd1ZSwFV3L0gLhPKv6gwMOmk9fpUAGgwAt wV0vFN9dVmyz0QPB5QiP+CXkHhi2Sz9twhIs7a00wSaTqWp/Sj9hGeX/kY13+O4v+ZST Squ/bVFBuUHwspT6gydxkfeyXNQB0AfgbdnFxAMbyKFiF6YXbIeWK275+ZCX6M053bI2 xjlK2d2znuuVgucfTL8AMZuHYXqerl7btTjSCrqonUoj4z1HrR77xWIzUV7gPnmInpyE /Mzent21EPEkmcyZ2fFvu2KEkNZY9YBK/Pv1Oy9UfTMc8JQyFQVZ/OTRFDeoUdnryz+C ajeQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="WJ3yi5/u"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c13si1949237edk.4.2019.11.07.08.17.19; Thu, 07 Nov 2019 08:17:42 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="WJ3yi5/u"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730507AbfKGQQd (ORCPT + 99 others); Thu, 7 Nov 2019 11:16:33 -0500 Received: from mail-oi1-f193.google.com ([209.85.167.193]:34770 "EHLO mail-oi1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726810AbfKGQQc (ORCPT ); Thu, 7 Nov 2019 11:16:32 -0500 Received: by mail-oi1-f193.google.com with SMTP id l202so2471708oig.1 for ; Thu, 07 Nov 2019 08:16:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=dy3Zw18GEppEcc1iAhwMaVLBwUsp6JGihw3ZTsB5tVQ=; b=WJ3yi5/uIXA2QkqLuZX0OFzSgyye7GA9Ap4IWKi+a83z3JYrTNhjtbV8urRr5kJyQC E/UMsNfsKvdJpYrLotOTBlXnGUXBXOyzOeiG2gF5j/aU52IzE18lo8yyudGY3vk5ECdO 1biD663zclUCsdJEX+KF+Tc5A9XYz0yNo2nyo2y+wxW0g+a2DtLs/py/wKic9WPUvqk5 HmzP4ZPuclCM90AoPf8hsQhJRqxSE7FcQdmj4C+iyx+g3SlCQKy3w6cxfbrFp4nOT/17 krdyE3k88xpi69EX/h717ywzEG8PHVUA38NBghKklr0dEW18XBF5L0P3OMfGT/gUfp1t /uqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dy3Zw18GEppEcc1iAhwMaVLBwUsp6JGihw3ZTsB5tVQ=; b=GAFAbLfSHz3+JClSqdTu8XVY0RcQfZ3FXRoq3IZctBMFZPA+uVHe8ZPWIAQ2+ET/oG RGCmmy/sHeMDmq8BhdhCts9h/xN6OWW+r4Nn9xDC/glMe37Fmx5udPPmzaWeOx3wUFNO xO3MZ3H6o1D7ee1pm7/AzAeUJGo81wXIE6ctn/Hqcj29ieT1McamU1544cWoVONUuaxc qvdUWJI1z8tQUuftpArxjUBj1dPeVwy48W0rh4rKzcSUCsITShN/M84ItbdW69pB4LFO EkPj6mEZ8jc68R2RrF+D1QV3uxoI2oDfqf3jTNS6p9ERu1w5Djubi0E+M+YT72KXeKPE AD6Q== X-Gm-Message-State: APjAAAXixFD24I4j81wdrp3k7udkiAaQuRGLPjD91NiLw+5fkuN8U0pk hKIF13MzNDcbrZ1uErBwjgZsPSIYGC5gzRiYGjEPeQ== X-Received: by 2002:aca:57d7:: with SMTP id l206mr4246587oib.32.1573143390886; Thu, 07 Nov 2019 08:16:30 -0800 (PST) MIME-Version: 1.0 References: <1572967777-8812-1-git-send-email-rppt@linux.ibm.com> <1572967777-8812-2-git-send-email-rppt@linux.ibm.com> <20191105162424.GH30717@redhat.com> <20191107083902.GB3247@linux.ibm.com> <20191107153801.GF17896@redhat.com> In-Reply-To: <20191107153801.GF17896@redhat.com> From: Daniel Colascione Date: Thu, 7 Nov 2019 08:15:53 -0800 Message-ID: Subject: Re: [PATCH 1/1] userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK To: Andrea Arcangeli Cc: Mike Rapoport , Andy Lutomirski , linux-kernel , Andrew Morton , Jann Horn , Linus Torvalds , Lokesh Gidra , Nick Kralevich , Nosh Minwalla , Pavel Emelyanov , Tim Murray , Linux API , linux-mm Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 7, 2019 at 7:38 AM Andrea Arcangeli wrote: > On Thu, Nov 07, 2019 at 12:54:59AM -0800, Daniel Colascione wrote: > > On Thu, Nov 7, 2019 at 12:39 AM Mike Rapoport wrote: > > > On Tue, Nov 05, 2019 at 08:41:18AM -0800, Daniel Colascione wrote: > > > > On Tue, Nov 5, 2019 at 8:24 AM Andrea Arcangeli wrote: > > > > > The long term plan is to introduce UFFD_FEATURE_EVENT_FORK2 feature > > > > > flag that uses the ioctl to receive the child uffd, it'll consume more > > > > > CPU, but it wouldn't require the PTRACE privilege anymore. > > > > > > > > Why not just have callers retrieve FDs using recvmsg? This way, you > > > > retrieve the message packet and the file descriptor at the same time > > > > and you don't need any appreciable extra CPU use. > > > > > > I don't follow you here. Can you elaborate on how recvmsg would be used in > > > this case? > > > > Imagine an AF_UNIX SOCK_DGRAM socket. You call recvmsg(). You get a > > blob of regular data along with some ancillary data. The ancillary > > data may include some file descriptors or it may not. Isn't the UFFD > > message model the same thing? You'd call recvmsg() on a UFFD and get > > back a uffd_msg data structure. If that uffd_msg came with file > > descriptors, these descriptors would be in ancillary data. If you > > didn't reserve enough space for the message or enough space for its > > ancillary data, the recvmsg() call would fail cleanly with MSG_TRUNC > > or MSG_CTRUNC. > > Having to check for truncation is just a slowdown doesn't sound a > feature here but just a complication and unnecessary branches. You can > already read as much as you want in multiples of the uffd size. You're already paying for bounds checking. Receiving a message via a datagram socket is basically the same thing as what UFFD's read is doing anyway. > > The nice thing about using recvmsg() for this purpose is that there's > > tons of existing code for dealing with recvmsg()'s calling convention > > and its ancillary data. You can, for example, use recvmsg out of the > > box in a Python script. You could make an ioctl that also returned a > > data blob plus some optional file descriptors, but if recvmsg already > > does exactly that job and it's well-understood, why not just reuse the > > recvmsg interface? > > uffd can't become an plain AF_UNIX because on the other end there's no > other process but the kernel. Even if it could the fact it'd > facilitate a pure python backend isn't relevant because handling page > faults is a performance critical system activity, and rust can do the > ioctl like it can do poll/epoll without mio/tokyo by just calling > glibc. We can't write kernel code in python either for the same > reason. My point isn't "hey, you should write this in Python". (Although for prototyping, why not?) My point is that where there's an existing kernel interface for exactly the functionality you want, you should use it instead of inventing some new thing, because when we use the same interface for things have the same shape and purpose, we not only get to reuse code, but also the knowledge in people's heads. > > point is only that *from a userspace API* point of view, recvmsg() > > seems ideal. > > Now thinking about this, the semantics of the ancillary data seems to > be per socket family. So what does prevent you to create an AF_UNIX > socket, send it to a SCM_RIGHTS receiving daemon unaware that it is > getting an AF_UNIX socket. The daemon is calling recvmsg on the fd it > receives from SCM_RIGHTS in order to receive ancillary data from > another non-AF_UNIX family instead (it is irrelevant what the > semantics of the ancillary data are but they're not AF_UNIX). So the > daemon calls recvmsg and it will not understand that the fd in the > ancillary data represents an installed "fd" in the fd space and in > turn still gets the fd involuntary installed with the exact same side > effects of what we're fixing in the uffd fork event read? SCM_RIGHTS (AFAIK) is the only bit of ancillary data which indicates that the kernel has created a file descriptor in the process doing the recvmsg. > I guess there shall be something somewhere that prevents recvmsg to > run on anything but an AF_UNIX if msg_control isn't NULL and > msg_controllen > 0? Otherwise even if we implemented the uffd fork > event with recvmsg, we would be back to square one. Why would we limit recvmsg to AF_UNIX? We can receive ancillary data on other sockets, e.g., netlink. SCM_RIGHTS works only with AF_UNIX right now, but this limitation isn't written in stone. > As a corollary this could also imply we don't need the ptrace check > after all if the same thing can happen already to SCM_RIGHTS receiving > daemon expecting to receive ancillary data from AF_SOMETHING but > getting an AF_UNIX instead through SCM_RIGHTS (just like the uffd > example was expecting to call read() on a normal fd and instead it got > an uffd). Programs generally don't go calling recvmsg() on random FDs they get from the outside world. They do call read() on those FDs, which is why read() having unexpected side effects is terrible. > I'm sure there's something stopping SCM_RIGHTS to have the same > pitfalls of uffd event fork and that makes recvmsg safe unlike read() > but then it's not immediately clear what it is. If you call it with a non-empty ancillary data buffer, you know to react to what you get. You're *opting into* the possibility of getting file descriptors. Sure, it's theoretically possible that a program calls recvmsg on random FDs it gets from unknown sources, sees SCM_RIGHTS unexpectedly, and just the SCM_RIGHTS message and its FD payload, but that's an outright bug, while calling read() on stdin is no bug. Anyway, IMHO, UFFD should be a netlink-like SOCK_DGRAM socket that sends FDs with SCM_RIGHTS. This interface is already very efficient -- people have been optimizing the hell out of AF_UNIX for decades --- and this interface provides exactly the right interface semantics for what UFFD needs to do.