Received: by 2002:a05:6a10:c604:0:0:0:0 with SMTP id y4csp3362535pxt; Tue, 10 Aug 2021 01:38:14 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx1XDi5AnxcYPgLTA5T1yIQ+4o880rvz0dQAibR9Ha3JYKp8S+Vr1PFAm4sx2HxZspxTcck X-Received: by 2002:a17:906:b890:: with SMTP id hb16mr3148873ejb.383.1628584694167; Tue, 10 Aug 2021 01:38:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628584694; cv=none; d=google.com; s=arc-20160816; b=Ia2UR1cR5/DHSoBXmAkMV4UQ7Kyb6pfac8+ehLwSI7oAbYv3VEd6WHJr4aDXs/zFtR XlrkH/6Oj/+uNpvuuB9hV5yzsoBGdMX7hKojDIo9/xKDN705HTIGbpXUNavUiglaXYp8 k9oD+vin7dCdLbelLZkDVTENPPf/m2tmAxquFjdixRdsY7QpGCigOsKNVgxnjqY72XV7 QLuUl/K+z0ixSh4Mrd9I8VBjiR7e+rOnZTw/LChDdrn5O0XcBeCXZgBLC9aYPka/7OPw Dmgfsy8IYwdgoNMYC2W7MoM5fZ6zUAdfP4blyBXfH9ehhwGjv23JLph5FciRTin3f/CI DYNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-language:content-transfer-encoding :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:reply-to:dkim-signature; bh=MiJQ0VtQH9ot0EVFN/FpusLl8DGmShMhIr9rUyrSbi8=; b=u3VodPy0iodSGQep7lj6LKjWt24T69dld/F8IOgUjhPdYxKmEyUoOL2FYTTRKDsiH7 6hZVDMaxad343GpKNLfUljXRiD9BS79Za33snqGfPhXanA4IVSVtIRCd4pbToToCXBqi vX5b6/SQ8GsmRLOJK13oqY8BjSGqz7rAuNO10aazeaYS6b663cuiHvySz1Wd1O60C837 llxf4lS19gu2JzFGs1e5fr3AY5QHyn9kW/V/Ce8lzP504lLvp0T/cBFQlAsnnzzyAXdJ 3FxXrdy73Amq4Bq/bYXdbZ353IVcpWTBAiKH4MsfxwUtgtOB0CVsY2xzn6qDWIOdsKwQ bBAg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Zufxo6h4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h18si4443200edq.56.2021.08.10.01.37.48; Tue, 10 Aug 2021 01:38:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Zufxo6h4; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231324AbhHJHRr (ORCPT + 99 others); Tue, 10 Aug 2021 03:17:47 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:41604 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229484AbhHJHRj (ORCPT ); Tue, 10 Aug 2021 03:17:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628579836; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MiJQ0VtQH9ot0EVFN/FpusLl8DGmShMhIr9rUyrSbi8=; b=Zufxo6h4PQ/qZKvZ+STI/z2SDqi/Gy0FXnyu/JsP4gZ3HXJi463oHTvUfj4f/AM20a/611 NgENb0E8vtuwV1krC+J4iIyC19Iv7Jbx3xBRNPI3pJ5C7qUZobPb9HY+GyCuz021DV5t2S qAgFak2Qd/NGRM7789NaOv0AwN16R0k= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-469-MmeiFl_yOySi2pqFth98BQ-1; Tue, 10 Aug 2021 03:17:14 -0400 X-MC-Unique: MmeiFl_yOySi2pqFth98BQ-1 Received: by mail-wm1-f70.google.com with SMTP id a18-20020a05600c2252b02902531dcdc68fso694390wmm.6 for ; Tue, 10 Aug 2021 00:17:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:reply-to:subject:to:cc:references:from :message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding:content-language; bh=MiJQ0VtQH9ot0EVFN/FpusLl8DGmShMhIr9rUyrSbi8=; b=aDPFnH+UmbQGVKmWPqcf8GypM3VZALz9egATpG4ortNipwXnBTqRDT5ajlXQCAxv8P +H4KduIN48OVC0+fJU42clVNbi1uQkMsA8CvnP7beGV87Psum+n0cYVCSEyeSvfHTXyh +ASniYb5TeAvhQDRcaArlBjAlNDpl8mrArHsxcMEAHGPAfcNWWfH/7RtU7QgKMhlnskx Rq/9Ue3eWbodor3Hy9YpdzdKY2SpB3O/rvvJC4XSmeF4PcbS+wX67WRBF/xfPFAlOI69 euGZ1logdk4Q+cP27K27+wlV1dmlfWsgiTAi9POsmazXR7pSYQE93ez7uZSBganyTANZ DLyA== X-Gm-Message-State: AOAM530mvMGg0IeYPEhU9jkFbYGzCJij/7FzWBtOMsKBi+jLYt+n287l rqnW2rjNfxe7Y8K4DlQ5vdTLhLWhn55qmQPUP116jufcLsqLA+8qng38txOcqTI2TgP3urebCej p9i1mkOOWRqAYlnMOIUznQP8f X-Received: by 2002:adf:d1e4:: with SMTP id g4mr28689953wrd.371.1628579833708; Tue, 10 Aug 2021 00:17:13 -0700 (PDT) X-Received: by 2002:adf:d1e4:: with SMTP id g4mr28689919wrd.371.1628579833497; Tue, 10 Aug 2021 00:17:13 -0700 (PDT) Received: from ?IPv6:2a01:e0a:59e:9d80:527b:9dff:feef:3874? ([2a01:e0a:59e:9d80:527b:9dff:feef:3874]) by smtp.gmail.com with ESMTPSA id z17sm22634714wrt.47.2021.08.10.00.17.11 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 10 Aug 2021 00:17:12 -0700 (PDT) Reply-To: eric.auger@redhat.com Subject: Re: [RFC v2] /dev/iommu uAPI proposal To: "Tian, Kevin" , Jason Gunthorpe , "Alex Williamson (alex.williamson@redhat.com)" , Jean-Philippe Brucker , David Gibson , Jason Wang , "parav@mellanox.com" , "Enrico Weigelt, metux IT consult" , Paolo Bonzini , Shenming Lu , Joerg Roedel Cc: Jonathan Corbet , "Raj, Ashok" , "Liu, Yi L" , "Wu, Hao" , "Jiang, Dave" , Jacob Pan , Kirti Wankhede , Robin Murphy , "kvm@vger.kernel.org" , "iommu@lists.linux-foundation.org" , David Woodhouse , LKML , Lu Baolu References: From: Eric Auger Message-ID: Date: Tue, 10 Aug 2021 09:17:10 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Content-Language: en-US Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Kevin, On 8/5/21 2:36 AM, Tian, Kevin wrote: >> From: Eric Auger >> Sent: Wednesday, August 4, 2021 11:59 PM >> > [...] >>> 1.2. Attach Device to I/O address space >>> +++++++++++++++++++++++++++++++++++++++ >>> >>> Device attach/bind is initiated through passthrough framework uAPI. >>> >>> Device attaching is allowed only after a device is successfully bound to >>> the IOMMU fd. User should provide a device cookie when binding the >>> device through VFIO uAPI. This cookie is used when the user queries >>> device capability/format, issues per-device iotlb invalidation and >>> receives per-device I/O page fault data via IOMMU fd. >>> >>> Successful binding puts the device into a security context which isolates >>> its DMA from the rest system. VFIO should not allow user to access the >> s/from the rest system/from the rest of the system >>> device before binding is completed. Similarly, VFIO should prevent the >>> user from unbinding the device before user access is withdrawn. >> With Intel scalable IOV, I understand you could assign an RID/PASID to >> one VM and another one to another VM (which is not the case for ARM). Is >> it a targetted use case?How would it be handled? Is it related to the >> sub-groups evoked hereafter? > Not related to sub-group. Each mdev is bound to the IOMMU fd respectively > with the defPASID which represents the mdev. But how does it work in term of security. The device (RID) is bound to an IOMMU fd. But then each SID/PASID may be working for a different VM. How do you detect this is safe as each SID can work safely for a different VM versus the ARM case where it is not possible. 1.3 says " 1) A successful binding call for the first device in the group creates the security context for the entire group, by: " What does it mean for above scalable IOV use case? > >> Actually all devices bound to an IOMMU fd should have the same parent >> I/O address space or root address space, am I correct? If so, maybe add >> this comment explicitly? > in most cases yes but it's not mandatory. multiple roots are allowed > (e.g. with vIOMMU but no nesting). OK, right, this corresponds to example 4.2 for example. I misinterpreted the notion of security context. The security context does not match the IOMMU fd but is something implicit created on 1st device binding. > > [...] >>> The device in the /dev/iommu context always refers to a physical one >>> (pdev) which is identifiable via RID. Physically each pdev can support >>> one default I/O address space (routed via RID) and optionally multiple >>> non-default I/O address spaces (via RID+PASID). >>> >>> The device in VFIO context is a logic concept, being either a physical >>> device (pdev) or mediated device (mdev or subdev). Each vfio device >>> is represented by RID+cookie in IOMMU fd. User is allowed to create >>> one default I/O address space (routed by vRID from user p.o.v) per >>> each vfio_device. >> The concept of default address space is not fully clear for me. I >> currently understand this is a >> root address space (not nesting). Is that coorect.This may need >> clarification. > w/o PASID there is only one address space (either GPA or GIOVA) > per device. This one is called default. whether it's root is orthogonal > (e.g. GIOVA could be also nested) to the device view of this space. > > w/ PASID additional address spaces can be targeted by the device. > those are called non-default. > > I could also rename default to RID address space and non-default to > RID+PASID address space if doing so makes it clearer. Yes I think it is worth having a kind of glossary and defining root as, default as as you clearly defined child/parent. > >>> VFIO decides the routing information for this default >>> space based on device type: >>> >>> 1) pdev, routed via RID; >>> >>> 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via >>> the parent's RID plus the PASID marking this mdev; >>> >>> 3) a purely sw-mediated device (sw mdev), no routing required i.e. no >>> need to install the I/O page table in the IOMMU. sw mdev just uses >>> the metadata to assist its internal DMA isolation logic on top of >>> the parent's IOMMU page table; >> Maybe you should introduce this concept of SW mediated device earlier >> because it seems to special case the way the attach behaves. I am >> especially refering to >> >> "Successful attaching activates an I/O address space in the IOMMU, if the >> device is not purely software mediated" > makes sense. > >>> In addition, VFIO may allow user to create additional I/O address spaces >>> on a vfio_device based on the hardware capability. In such case the user >>> has its own view of the virtual routing information (vPASID) when marking >>> these non-default address spaces. >> I do not catch what does mean "marking these non default address space". > as explained above, those non-default address spaces are identified/routed > via PASID. > >>> 1.3. Group isolation >>> ++++++++++++++++++++ > [...] >>> 1) A successful binding call for the first device in the group creates >>> the security context for the entire group, by: >>> >>> * Verifying group viability in a similar way as VFIO does; >>> >>> * Calling IOMMU-API to move the group into a block-dma state, >>> which makes all devices in the group attached to an block-dma >>> domain with an empty I/O page table; >> this block-dma state/domain would deserve to be better defined (I know >> you already evoked it in 1.1 with the dma mapping protocol though) >> activates an empty I/O page table in the IOMMU (if the device is not >> purely SW mediated)? > sure. some explanations are scattered in following paragraph, but I > can consider to further clarify it. > >> How does that relate to the default address space? Is it the same? > different. this block-dma domain doesn't hold any valid mapping. The > default address space is represented by a normal unmanaged domain. > the ioasid attaching operation will detach the device from the block-dma > domain and then attach it to the target ioasid. OK Thanks Eric > >>> 2. uAPI Proposal >>> ---------------------- > [...] >>> /* >>> * Allocate an IOASID. >>> * >>> * IOASID is the FD-local software handle representing an I/O address >>> * space. Each IOASID is associated with a single I/O page table. User >>> * must call this ioctl to get an IOASID for every I/O address space that is >>> * intended to be tracked by the kernel. >>> * >>> * User needs to specify the attributes of the IOASID and associated >>> * I/O page table format information according to one or multiple devices >>> * which will be attached to this IOASID right after. The I/O page table >>> * is activated in the IOMMU when it's attached by a device. Incompatible >> .. if not SW mediated >>> * format between device and IOASID will lead to attaching failure. >>> * >>> * The root IOASID should always have a kernel-managed I/O page >>> * table for safety. Locked page accounting is also conducted on the root. >> The definition of root IOASID is not easily found in this spec. Maybe >> this would deserve some clarification. > make sense. > > and thanks for other typo-related comments. > > Thanks > Kevin