Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp758520pxy; Thu, 22 Apr 2021 12:42:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzXdTkpdN/yzgcY2NnenR0926cWl+jgQvrrC28mX0LqfiE2yNBQs4gplQ2R6iOmdsAg/Qzy X-Received: by 2002:aa7:c850:: with SMTP id g16mr123706edt.324.1619120551981; Thu, 22 Apr 2021 12:42:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619120551; cv=none; d=google.com; s=arc-20160816; b=nc0cSPjcdKNnCI40hC1ZALGJRyydM9nYrQzB401XNKaDasVociiobPjm2MopCyWJ+E c2snFI/AtLMVz66g36w5DzgPsL4ZeIwMstqIcq9Od/hp/Z/7IHXIbdom8Pk0lMpUON1s comMcNThtO+m1fS40Bbsy63D48sHy3Fs6KDqrxTWfSHSjqOmD3bPsuTERUDz17BnEs5h K4JK+HW23c/RA/bq7Z/W1F9zNmuMXzuS9eeeYrEiUOwdXS86LKbXeJvvdROWBbGVJl1B R193/wVBSNgYOFXyS/vggUHeX8Y5nWah80ByDDowMV69DCrAHzms1etT8T3qzqWQ4/D9 NWiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=biOoOdkcB54+0RggxGqfQD/QFH+RgEKqiVLmKbv8Qt8=; b=kLVK0Y0XP4T7d+v1+4RLuFvQIdV7tFeNnpkLqRyLYs+x8kXkaIF0XBdfqCOOf/xglh 3lLnOQGbgOR3xG5BA+39wA0LirDXWJatBxJ30LiKtHIM1bMJXydK03QnPXbCNUi5p07y hWfd2Ev67N5sQMxNubtvTRgmHSgLZzGuW7Rt2OPC8kg0VAjjZUUtbe2npnrcD7RXtoKY 6ZPpU/UNs6MYRoMXhj6nUuBNveYtF72C+L0DQbU9zAmIB/lWb1pWt+SPmi383nxXtPV0 cug4tplpPafb0vAQySDBb2SGbdO4bmr59qqGaH5aJChoQFS6gjYkBb0KxNwx3yrOrLfm /35Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Pu09vGhe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hp30si3449555ejc.344.2021.04.22.12.42.07; Thu, 22 Apr 2021 12:42:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Pu09vGhe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238956AbhDVTig (ORCPT + 99 others); Thu, 22 Apr 2021 15:38:36 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:40487 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239025AbhDVTif (ORCPT ); Thu, 22 Apr 2021 15:38:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1619120279; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=biOoOdkcB54+0RggxGqfQD/QFH+RgEKqiVLmKbv8Qt8=; b=Pu09vGheXqQfCSTMH110QUwahN1ua1wGZwpuVPE9YPn+BbXO9jBdXjYJVP1zgGugDgD+lE wSmkNEXwgxx/CO5HvD0A65fIKTF9ZGLJzYylwKBn6elpfQUiiYfAlcYNWvJzXFfohOijL+ 4YRxcCUyFjqUpTiW4rivQDXqeFtXxkU= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-144-Jm4NmWn3NMmQ8koJLzBOpw-1; Thu, 22 Apr 2021 15:37:56 -0400 X-MC-Unique: Jm4NmWn3NMmQ8koJLzBOpw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 30D138542C1; Thu, 22 Apr 2021 19:37:53 +0000 (UTC) Received: from redhat.com (ovpn-114-21.phx2.redhat.com [10.3.114.21]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8D5A360BE5; Thu, 22 Apr 2021 19:37:47 +0000 (UTC) Date: Thu, 22 Apr 2021 13:37:47 -0600 From: Alex Williamson To: Jason Gunthorpe Cc: "Liu, Yi L" , Jacob Pan , Auger Eric , Jean-Philippe Brucker , "Tian, Kevin" , LKML , Joerg Roedel , Lu Baolu , David Woodhouse , "iommu@lists.linux-foundation.org" , "cgroups@vger.kernel.org" , Tejun Heo , Li Zefan , Johannes Weiner , Jean-Philippe Brucker , Jonathan Corbet , "Raj, Ashok" , "Wu, Hao" , "Jiang, Dave" , David Gibson , Alexey Kardashevskiy Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Message-ID: <20210422133747.23322269@redhat.com> In-Reply-To: <20210422175715.GA1370958@nvidia.com> References: <20210415230732.GG1370958@nvidia.com> <20210416061258.325e762e@jacob-builder> <20210416094547.1774e1a3@redhat.com> <20210421162307.GM1370958@nvidia.com> <20210421105451.56d3670a@redhat.com> <20210421175203.GN1370958@nvidia.com> <20210421133312.15307c44@redhat.com> <20210421230301.GP1370958@nvidia.com> <20210422111337.6ac3624d@redhat.com> <20210422175715.GA1370958@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 22 Apr 2021 14:57:15 -0300 Jason Gunthorpe wrote: > > > The security rule for isolation is that once a device is attached to a > > > /dev/ioasid fd then all other devices in that security group must be > > > attached to the same ioasid FD or left unused. > > > > Sounds like a group... Note also that if those other devices are not > > isolated from the user's device, the user could manipulate "unused" > > devices via DMA. So even unused devices should be within the same > > IOMMU context... thus attaching groups to IOMMU domains. > > That is a very interesting point. So, say, in the classic PCI bus > world if I have a NIC and HD on my PCI bus and both are in the group, > I assign the NIC to a /dev/ioasid & VFIO then it is possible to use > the NIC to access the HD via DMA > > And here you want a more explicit statement that the HD is at risk by > using the NIC? If by "classic" you mean conventional PCI bus, then this is much worse than simply "at risk". The IOMMU cannot differentiate devices behind a PCIe-to-PCI bridge, so the moment you turn on the IOMMU context for the NIC, the address space for your HBA is pulled out from under it. In the vfio world, the NIC and HBA are grouped and managed together, the user cannot change the IOMMU context of a group unless all of the devices in the group are "viable", ie. they are released from any host drivers. > Honestly, I'm not sure the current group FD is actually showing that > very strongly - though I get the point it is modeled in the sysfs and > kind of implicit in the API - we evolved things in a way where most > actual applications are taking in a PCI BDF from the user, not a group > reference. So the actual security impact seems lost on the user. vfio users are extremely aware of grouping, they understand the model, if not always the reason for the grouping. You only need to look at r/VFIO to find various lsgroup scripts and kernel patches to manipulate grouping. The visibility to the user is valuable imo. > Along my sketch if we have: > > ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd) > [..] > ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id) == ENOPERM > > I would feel comfortable if the ATTACH_IOASID fails by default if all > devices in the group have not been joined to the same ioasidfd. And without a group representation to userspace, how would a user know to resolve that? > So in the NIC&HD example the application would need to do: > > ioasid_fd = open("/dev/ioasid"); > nic_device_fd = open("/dev/vfio/device0") > hd_device_fd = open("/dev/vfio/device1") > > ioctl(nic_device_fd, JOIN_IOASID_FD, ioasifd) > ioctl(hd_device_fd, JOIN_IOASID_FD, ioasifd) > [..] > ioctl(nice_device, ATTACH_IOASID, gpa_ioasid_id) == SUCCESS > > Now the security relation is forced by the kernel to be very explicit. But not discoverable to the user. > However to keep current semantics, I'd suggest a flag on > JOIN_IOASID_FD called "IOASID_IMPLICIT_GROUP" which has the effect of > allowing the ATTACH_IOASID to succeed without the user having to > explicitly join all the group devices. This is analogous to the world > we have today of opening the VFIO group FD but only instantiating one > device FD. > > In effect the ioasid FD becomes the group and the numbered IOASID's > inside the FD become the /dev/vfio/vfio objects - we don't end up with > fewer objects in the system, they just have different uAPI > presentations. > > I'd envision applications like DPDK that are BDF centric to use the > first API with some '--allow-insecure-vfio' flag to switch on the > IOASID_IMPLICIT_GROUP. Maybe good applications would also print: > "Danger Will Robinson these PCI BDFs [...] are also at risk" > When the switch is used by parsing the sysfs So the group still exist in sysfs, they just don't have vfio representations? An implicit grouping does what, automatically unbind the devices, so an admin gives a user access to the NIC but their HBA device disappears because they were implicitly linked? That's why vfio basis ownership on the group, if a user owns the group but the group is not viable because a device is still bound to another kernel driver, the use can't do anything. Implicitly snarfing up subtly affected devices is bad. > > > Thus /dev/ioasid also becomes the unit of security and the IOMMU > > > subsystem level becomes aware of and enforces the group security > > > rules. Userspace does not need to "see" the group > > > > What tools does userspace have to understand isolation of individual > > devices without groups? > > I think we can continue to show all of this group information in sysfs > files, it just doesn't require application code to open a group FD. > > This becomes relavent the more I think about it - elmininating the > group and container FD uAPI by directly creating the device FD also > sidesteps questions about how to model these objects in a /dev/ioasid > only world. We simply don't have them at all so the answer is pretty > easy. I'm not sold. Ideally each device would be fully isolated, then we could assume a 1:1 relation of group and device and collapse the model to work on devices. We don't live in that world and I see a benefit to making that explicit in the uapi, even if that group fd might seem superfluous at times. Thanks, Alex