Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp1129077ybx; Thu, 7 Nov 2019 07:41:28 -0800 (PST) X-Google-Smtp-Source: APXvYqwd5pcTZvFO9664D24NFcjdzHqUjzMP1JqJO5ozdC2lT30L195c1MAk4cEzCA2gfQwfrQ3v X-Received: by 2002:a50:f30c:: with SMTP id p12mr4208033edm.208.1573141288135; Thu, 07 Nov 2019 07:41:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1573141288; cv=none; d=google.com; s=arc-20160816; b=LM0WuAzVfB/AQUr7BzZeXC2n2aRRD5Y3+uwiyruC6Ow2uUFEM4K9EkrxIMQA9NTPlv WO6eEye8tYc3NJV9eLtSK3LSE0w4k5PgXGHUCpTMgoF3+kevdUimUONvcjTcuoslMo7O 0hCxFg0B2PjoyztOM31XC/8U+dRygg7wHFetKpdYx8+JCilfDVmQiPdQLyvIWw/xivwZ YXbpdNXIGFzKncwLt5TCGj5JG8sky1b/erjsqVCAQuK9aTtrztZrTW34cnP60ojBG2k5 VWPAdC8UQ+euwg0IPc4YmjUNpunFIEJbSy6LzZJ4AIoJxGUEQ3vkPSQq1pqRKY0vX14s 66yA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-disposition :content-transfer-encoding:user-agent:in-reply-to:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ZlnYjNYTseMGbHaFbgcXIvub57cEgCnxRgcSJ+6+lDM=; b=aIxoElwRWpqNjCHDS2S2pEof9GO6fF5p5+mF5IZCCnFOsGg7LcF0h/dPb5wyMBuMLO CEc8GMVzGRpm66jzly2KkAcw69IgFLKCjev5+XLx5c5/BG5eG12U7BFohMjPNvreD/yr YB+TUkPj24x9LKeNzPg6Ungl6k4gluSl3m5mH5liXxLzNW8Io9UvDDKNNf+MRLy5T/R6 GhGMNud/ujLDlaEoMTg+XV2+QkTqQe1FYvx+dpODsQeJnsKSjjHW5CUzS2TdT4ay1RCM jNregIkv14L1Vg1V/M/8d0CcibJeW5TeY0F8fWjP9L9tZZIXdX+IoMj1bb38IkZ0zlAW mg5Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YYdbeUut; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e1si1650965ejf.28.2019.11.07.07.41.04; Thu, 07 Nov 2019 07:41:28 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=YYdbeUut; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388792AbfKGPiO (ORCPT + 99 others); Thu, 7 Nov 2019 10:38:14 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:42501 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2388230AbfKGPiO (ORCPT ); Thu, 7 Nov 2019 10:38:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1573141093; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZlnYjNYTseMGbHaFbgcXIvub57cEgCnxRgcSJ+6+lDM=; b=YYdbeUutbhPDzbEXPvBWXimFL4ip25HgjYt5gaf0PMmJ8eWUM7z2BGUShJF+u2Rlf+Zsaf j2M+dnPXN3Sr+7eEw7jNoslJJS7RGNtEMHlYMA7CQ2Yc525cgexhlbw75l6UUa9ufYHi6j CV5mnYghhKPFcKvEahRHGPWekRJLw1M= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-171-18FDT1idPRKsX84YKyOzuA-1; Thu, 07 Nov 2019 10:38:04 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1A9EC477; Thu, 7 Nov 2019 15:38:03 +0000 (UTC) Received: from mail (ovpn-121-157.rdu2.redhat.com [10.10.121.157]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B3F89608AC; Thu, 7 Nov 2019 15:38:02 +0000 (UTC) Date: Thu, 7 Nov 2019 10:38:01 -0500 From: Andrea Arcangeli To: Daniel Colascione Cc: Mike Rapoport , Andy Lutomirski , linux-kernel , Andrew Morton , Jann Horn , Linus Torvalds , Lokesh Gidra , Nick Kralevich , Nosh Minwalla , Pavel Emelyanov , Tim Murray , Linux API , linux-mm Subject: Re: [PATCH 1/1] userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK Message-ID: <20191107153801.GF17896@redhat.com> References: <1572967777-8812-1-git-send-email-rppt@linux.ibm.com> <1572967777-8812-2-git-send-email-rppt@linux.ibm.com> <20191105162424.GH30717@redhat.com> <20191107083902.GB3247@linux.ibm.com> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.12.2 (2019-09-21) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-MC-Unique: 18FDT1idPRKsX84YKyOzuA-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Thu, Nov 07, 2019 at 12:54:59AM -0800, Daniel Colascione wrote: > On Thu, Nov 7, 2019 at 12:39 AM Mike Rapoport wrote: > > On Tue, Nov 05, 2019 at 08:41:18AM -0800, Daniel Colascione wrote: > > > On Tue, Nov 5, 2019 at 8:24 AM Andrea Arcangeli = wrote: > > > > The long term plan is to introduce UFFD_FEATURE_EVENT_FORK2 feature > > > > flag that uses the ioctl to receive the child uffd, it'll consume m= ore > > > > CPU, but it wouldn't require the PTRACE privilege anymore. > > > > > > Why not just have callers retrieve FDs using recvmsg? This way, you > > > retrieve the message packet and the file descriptor at the same time > > > and you don't need any appreciable extra CPU use. > > > > I don't follow you here. Can you elaborate on how recvmsg would be used= in > > this case? >=20 > Imagine an AF_UNIX SOCK_DGRAM socket. You call recvmsg(). You get a > blob of regular data along with some ancillary data. The ancillary > data may include some file descriptors or it may not. Isn't the UFFD > message model the same thing? You'd call recvmsg() on a UFFD and get > back a uffd_msg data structure. If that uffd_msg came with file > descriptors, these descriptors would be in ancillary data. If you > didn't reserve enough space for the message or enough space for its > ancillary data, the recvmsg() call would fail cleanly with MSG_TRUNC > or MSG_CTRUNC. Having to check for truncation is just a slowdown doesn't sound a feature here but just a complication and unnecessary branches. You can already read as much as you want in multiples of the uffd size. > The nice thing about using recvmsg() for this purpose is that there's > tons of existing code for dealing with recvmsg()'s calling convention > and its ancillary data. You can, for example, use recvmsg out of the > box in a Python script. You could make an ioctl that also returned a > data blob plus some optional file descriptors, but if recvmsg already > does exactly that job and it's well-understood, why not just reuse the > recvmsg interface? uffd can't become an plain AF_UNIX because on the other end there's no other process but the kernel. Even if it could the fact it'd facilitate a pure python backend isn't relevant because handling page faults is a performance critical system activity, and rust can do the ioctl like it can do poll/epoll without mio/tokyo by just calling glibc. We can't write kernel code in python either for the same reason. > How practical is it to actually support recvmsg without being a > socket? How hard would it be to just become a socket? I don't know. My AF_UINIX has more features than we need (credentials) and dealing with skbs and truncation would slow down the protocol. The objective is to get the highest performance possible out of the uffd API so that it performs as close as possible to running page faults in the kernel. So even if we could avoid a syscall in CRIU, but we'd be slowing down QEMU and all other normal cooperative usages if we made uffd a socket. So overall it would be a net loss. > point is only that *from a userspace API* point of view, recvmsg() > seems ideal. Now thinking about this, the semantics of the ancillary data seems to be per socket family. So what does prevent you to create an AF_UNIX socket, send it to a SCM_RIGHTS receiving daemon unaware that it is getting an AF_UNIX socket. The daemon is calling recvmsg on the fd it receives from SCM_RIGHTS in order to receive ancillary data from another non-AF_UNIX family instead (it is irrelevant what the semantics of the ancillary data are but they're not AF_UNIX). So the daemon calls recvmsg and it will not understand that the fd in the ancillary data represents an installed "fd" in the fd space and in turn still gets the fd involuntary installed with the exact same side effects of what we're fixing in the uffd fork event read? I guess there shall be something somewhere that prevents recvmsg to run on anything but an AF_UNIX if msg_control isn't NULL and msg_controllen > 0? Otherwise even if we implemented the uffd fork event with recvmsg, we would be back to square one. As a corollary this could also imply we don't need the ptrace check after all if the same thing can happen already to SCM_RIGHTS receiving daemon expecting to receive ancillary data from AF_SOMETHING but getting an AF_UNIX instead through SCM_RIGHTS (just like the uffd example was expecting to call read() on a normal fd and instead it got an uffd). I'm sure there's something stopping SCM_RIGHTS to have the same pitfalls of uffd event fork and that makes recvmsg safe unlike read() but then it's not immediately clear what it is. Thanks, Andrea