Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp419883pxv; Fri, 9 Jul 2021 00:49:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz5YelfyxhZn7MYldNcpKgCogffIn4mOiepq9O96Rsd6UvwVXsUO1bYOFKUtJKgIlA3b9XL X-Received: by 2002:a92:da8d:: with SMTP id u13mr11463705iln.180.1625816980883; Fri, 09 Jul 2021 00:49:40 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1625816980; cv=pass; d=google.com; s=arc-20160816; b=yg3IpO3Gt71oPJkCAghFSk3Q29ObLziQifhzx2ObAHwXl4wF/Ug166l6rgJIZLgRp0 YO2YFwDc0/+qiVDWVH1y5/R7oHsMqNmGhtnv+6QwmgolfvVfmGx6Jo41bCPQ2wThLnMU zqPnCtvREMkHp5fNO65B47gWursvCZWezlMBONTrRd5u/8FYfbunK1pUsMH2uoM7ykTA vFE0eYTbJE+jCVR9vOGpSRU62OLSyY0geVuEXRIew2dHauSmhaYVYsoCnu8nDGhhRG92 vdOz8m1Pb8Nof4GjbN/1nWIEZvo5uN2JIsnZsOk6eZ1VzPue/t2snm4BWBVAZIuh4Pz6 xcqg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:content-transfer-encoding :dlp-reaction:dlp-product:dlp-version:content-language :accept-language:message-id:date:thread-index:thread-topic:subject :cc:to:from:dkim-signature; bh=hwkgsciirjgyJ8mMI71OARQXPxtiInlMchKQ5wtTNVs=; b=bhr106V2aMsvlcgOd5No8Ii1fUQBkeEkPM0Xe5nR+9mcmUinoqk1Q5u7/96xB1yFg2 oAnPFrx95wBmRgEYkQgd+wIfjsqFsbnMJUVvOn2JUGLT9pE0al4i1QgGp2C6eKze/5wk K5TTMuftG7sKmTL+ruP5CDLHC0IB9hvgOUuma9x1HNR7ILWGDx7v+HmJBg2UAZSwKq9A PuKDlVnA9f1LWQnhgxp8z7UesuAd05GhQ9nMVgnY64RTTGMtlUM2BLHItctZPZfqk9X7 h4Re2KNbRLJ7WlFeXlG7V8CD1fTj8RwCMQWg5Kz6zTopbg/3wAv2s+2pF7NYiHVrliBF I2tg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.onmicrosoft.com header.s=selector2-intel-onmicrosoft-com header.b=LsCraUVk; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z16si5556965ilm.29.2021.07.09.00.49.28; Fri, 09 Jul 2021 00:49:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.onmicrosoft.com header.s=selector2-intel-onmicrosoft-com header.b=LsCraUVk; arc=pass (i=1 spf=pass spfdomain=intel.com dkim=pass dkdomain=intel.com dmarc=pass fromdomain=intel.com); spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231321AbhGIHvg (ORCPT + 99 others); Fri, 9 Jul 2021 03:51:36 -0400 Received: from mga01.intel.com ([192.55.52.88]:51178 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231233AbhGIHvf (ORCPT ); Fri, 9 Jul 2021 03:51:35 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10039"; a="231431262" X-IronPort-AV: E=Sophos;i="5.84,226,1620716400"; d="scan'208";a="231431262" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2021 00:48:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,226,1620716400"; d="scan'208";a="411707954" Received: from fmsmsx606.amr.corp.intel.com ([10.18.126.86]) by orsmga003.jf.intel.com with ESMTP; 09 Jul 2021 00:48:50 -0700 Received: from fmsmsx609.amr.corp.intel.com (10.18.126.89) by fmsmsx606.amr.corp.intel.com (10.18.126.86) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2242.10; Fri, 9 Jul 2021 00:48:49 -0700 Received: from fmsmsx611.amr.corp.intel.com (10.18.126.91) by fmsmsx609.amr.corp.intel.com (10.18.126.89) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2242.10; Fri, 9 Jul 2021 00:48:49 -0700 Received: from FMSEDG603.ED.cps.intel.com (10.1.192.133) by fmsmsx611.amr.corp.intel.com (10.18.126.91) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2242.10 via Frontend Transport; Fri, 9 Jul 2021 00:48:49 -0700 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (104.47.55.106) by edgegateway.intel.com (192.55.55.68) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2242.4; Fri, 9 Jul 2021 00:48:49 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VXpl9F9CdLkO0PnhR1n4iarim1W21WbIv2BKpxpkVW/zp1Of/Gvmdi3SM6IUruJlQWiztz+D7bc/6Rvq+XdSUdHgk8Y43ZySknoXEF8yjjWMYbX+AL3zgzTE3KIj5MWGcetKM9URJ1Pl8RVvTCki8y2agZxIKWTkaI0k+hGCHDD2XjORFO+3FdQgcpHLj0JY+iYAFfOCR5PV8m8XhhrxWVTiTn6W4JnXvYkl8MDTm61VlLaRQEgBH+wrq5uQn7WRcD+NiUawWRa/YzVMKto4JKv9uSHpJtCuXsRtuaqPmwEsDasj7iQBobCuoH1XVix+F+s9yneaWruRbgYDDbThMA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hwkgsciirjgyJ8mMI71OARQXPxtiInlMchKQ5wtTNVs=; b=aqoUezRWkdk3oB0HscxzhNzj9NyTLyajY7ZCzaQuftdurQ2nN+ksgWryEu5icVXmC7NKeADQFwrrOxAX45ccXGn2BFKWYvgOXNlDvvGQU3logNDCE38q6tLRxRIekYygFSTLumDGxPoNynBtWhjhADpDbqMSSxlndgC3I0OpFzQrbIrLXHUVA/ZlcmxcEC1/Dl967uE38Gh61AK9o81dm6o/W7VadVtU3UjPjSRI5nQ2CUuD6AOtSSn+bBsG2i5ZRjHhBbhTN6s/Q89mJp2oT/OdcAm8oWMjOhrCftGfCVu5wu/tv4xNq+EEfX3wsRbJ7SSzuy53DPu9xW5T2L4UoQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel.onmicrosoft.com; s=selector2-intel-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=hwkgsciirjgyJ8mMI71OARQXPxtiInlMchKQ5wtTNVs=; b=LsCraUVkjRzZeK91MaS/m83if9qMTweZrEKVswotvuYs2eJA73I5pg6IMKT4B6l71ZWq0IFzyGK5pXesh5ZzzwI61JB/PiEwDAHzGxr2zr5CfT64ZuDAik1YxsWkjiNoyk4aMkQNqcBnmYHHILhsZO5cJVOoTN2hNGIyllsCd+s= Received: from BN9PR11MB5433.namprd11.prod.outlook.com (2603:10b6:408:11e::13) by BN6PR11MB2019.namprd11.prod.outlook.com (2603:10b6:404:3f::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4308.20; Fri, 9 Jul 2021 07:48:44 +0000 Received: from BN9PR11MB5433.namprd11.prod.outlook.com ([fe80::fd4b:cdde:6790:134]) by BN9PR11MB5433.namprd11.prod.outlook.com ([fe80::fd4b:cdde:6790:134%7]) with mapi id 15.20.4308.023; Fri, 9 Jul 2021 07:48:44 +0000 From: "Tian, Kevin" To: Jason Gunthorpe , "Alex Williamson (alex.williamson@redhat.com)" , "Jean-Philippe Brucker" , David Gibson , Jason Wang , "parav@mellanox.com" , "Enrico Weigelt, metux IT consult" , Paolo Bonzini , Shenming Lu , Joerg Roedel , Eric Auger CC: Jonathan Corbet , "Raj, Ashok" , "Tian, Kevin" , "Liu, Yi L" , "Wu, Hao" , "Jiang, Dave" , Jacob Pan , Kirti Wankhede , "Robin Murphy" , "kvm@vger.kernel.org" , "iommu@lists.linux-foundation.org" , "David Woodhouse" , LKML , "Lu Baolu" Subject: [RFC v2] /dev/iommu uAPI proposal Thread-Topic: [RFC v2] /dev/iommu uAPI proposal Thread-Index: Add0lrMH87IsTsl5Rp6WN1oQU6kGMQ== Date: Fri, 9 Jul 2021 07:48:44 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-version: 11.5.1.3 dlp-product: dlpe-windows dlp-reaction: no-action authentication-results: nvidia.com; dkim=none (message not signed) header.d=none;nvidia.com; dmarc=none action=none header.from=intel.com; x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 84bad0ea-6741-4d2e-8137-08d942adfa34 x-ms-traffictypediagnostic: BN6PR11MB2019: x-ld-processed: 46c98d88-e344-4ed4-8496-4ed7712e255d,ExtAddr x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: Q99fPrtK3JzTVKVmV2uNQDBhBjrjYk2+nlgSMCQY/58A/j0epJkpz1IB6YAAWFj4nm+G6cz5WxoCaEiBDCmIwqgQc30+bozVp54XbX+q7w2YgJg5lJDy6IfpH0QL7oQMlaXmuz2gpRsS+Rcf0kS+5QLp7ZCkh7p5aeUTmVR115iOfKC+gy52q8hEP0tkTorjeVX5qSmPTvtWCtuC+d7xUkTlJy7P36cEwi40lGPE1gFUJyTAruk5UzZpRUMSB4CEztW2P10dNtrK41maJeePqKi1wS6oA1BdTR9L4cmadImMpcfsok3Ke7pda6hczatfs3MoSH0AbcbPjwdbByUytBhw36rLtrN9kprdbqRBXrxqzEch2db6QkQCSKYkwiKx9ugENeYGuVIE0YDu7aJq7g+YtZzY0It8ilZUDO608vMsSL67rJuxC/9hmYEAVOLOMBRdBSsQMrbFifOItZNMFXGiWGjo24cNn/sQ1bysUkhlVxx1XXkGlqMGmDVY2QoxyBaXhRkPtrDATED/nlJGuvRVNTkIq6Lzk8WBkn8HaJhoXr4NfEj4I3SEM+pAhtQNEJLIkkckzB3PNg/Rd21M64BNunZeLJWrqZ515z165lIfIRTLD84qFbW73mb5x5UWGoXYZ01Iq6d1x4sTsYX1Z/Auu2+AkZHjN8AY35lJTTIT8hDJ/FFpE15YMtyCFDdLwY2Vs/3/8ipiLkcFEWNDvHiNE1E8CcKHywxuNiJHza4qwwvfayg2SUJGDPnbot6Ko6jNi+LSFwtLuj8SYd+obr90VJWiTw3s+182HtEgXMMg1+lCz2aTruZrv3pLoMBE x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:BN9PR11MB5433.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(346002)(396003)(366004)(136003)(39860400002)(376002)(66476007)(4326008)(7416002)(8676002)(66556008)(45080400002)(5660300002)(66946007)(966005)(83380400001)(2906002)(30864003)(478600001)(64756008)(66446008)(76116006)(7696005)(86362001)(33656002)(316002)(8936002)(9686003)(52536014)(921005)(38100700002)(54906003)(55016002)(122000001)(6506007)(110136005)(186003)(26005)(71200400001)(559001)(579004);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?BWa5GEv8NtavsU7gvjVxbzYSmJ1KXpzfXWF7ETEf/+QavxUYTWo7fIMk7Eza?= =?us-ascii?Q?c6ktMrNgtGdiYuGiCdjqLGRlC2vc1lAaxU60ZKc6HlgQOMeG/SUTE1Khob2O?= =?us-ascii?Q?XodtD+9FZ5biAF2YOlhfI1vE7COj5vO4ADuoZjJ4PvLtoi6syGhsEb9OeeLF?= =?us-ascii?Q?kxMEEIV8op7aKJfTjSJ+t0qkIlN+SL7SkSGgrvH7gS5lT6p02Cd3SBiuH4kg?= =?us-ascii?Q?QZFSXqLAefb4m3BKqEFP6Gw2jyW4DEViWwta/zalQ1JkurTtzJa8UzQoL1RL?= =?us-ascii?Q?HzMBJOmFdWm1Hqcbl7+CUQyrL4drH5mkPB54b/wXac425bEIWVVjMqUV6Dto?= =?us-ascii?Q?EAuseomfcMX6v7Pz/zuqD3Nvniqi3uf+c7EPzx/M58jcem3L8AiOBLKxNg3Z?= =?us-ascii?Q?HQ4GDZ8l80K+Qmq7wKtAscSfP3r8I46PCGlprESnlQ5nSESZffbmtM4yasJc?= =?us-ascii?Q?RlNaIE7gRphJLVJrxAHSkTTkuSXFPHq0e+r83DEYG7luCKvJ6ekSHmXC0cEZ?= =?us-ascii?Q?Bq1u3ZHR6UD/jOWOTcRbf7FnJ/U1jbLIFpTqgOou5R6Z/C50rQ7WrDac3N5g?= =?us-ascii?Q?Upx7rK3Y9zIimIAld2Zfv+shjoIT1oy6cREXmL19jw6eERKi29DxPIYOdLxh?= =?us-ascii?Q?cOnSqYmbwA9y/QBnSJMyiWzcZ8GQw/xRVPt1cb4gM6nb2fiAqw52YwuocBhb?= =?us-ascii?Q?m5qM39CC/3oi8Tv7WzRP1tRzBbqRV7UVzAuUUP/28kYuacsWrz0mvrfbJHSI?= =?us-ascii?Q?SGB5aHlcxDXMCeORbkG5ZsNySp/xCYsz1Gy2wLCxSfSEDOXsTCpYAhFh9STI?= =?us-ascii?Q?VSVHqB4rSQTjDJoaY4/qm468huI2BN1Jp3sntJVJDYtS1l21nzglxJSeGkuc?= =?us-ascii?Q?x6/+3JxZ6dvvdtmOpKsBRj6L30S3L/eZnqkUf4Pz0nIF6tbKGv4WsijnQGII?= =?us-ascii?Q?+DkIYKzYw6y8cnR6HPiCv1binNPEkh+UqGCEV76d3y660/3vHRJYg9d/9KPn?= =?us-ascii?Q?pL5Q6TCSKE8R8ucGhm2HPZBmBylfAVmeN3Wbf11hasSUdz4kx4dkDfOuLfYa?= =?us-ascii?Q?lc6pUpa27ZC7e1vM9ACXiOpG6WvAfCKmb71oG0/kchQDrjmeH5HNXctf7u+v?= =?us-ascii?Q?LfJsbi75CspRqbDQbeOSu2tyk0N65ReOa0tpSXD4ao1nSJ6LCSpZXTihd6zP?= =?us-ascii?Q?d1Kn2oRhhKZqap3xo3t7CqQf8hlhYFG34HHzU9nDj1bN+nL+Z1K3cg3/0s8f?= =?us-ascii?Q?wOGxqEWQon0dx+GTMGilGlz62NfdQpG0MUSaXXmEDcae+xOPkrEJdTnh/kuC?= =?us-ascii?Q?z6GywVTIZefAH0Qk+YsIoIq0?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: BN9PR11MB5433.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 84bad0ea-6741-4d2e-8137-08d942adfa34 X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Jul 2021 07:48:44.2991 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: HTjdDq8cTkjZJcAX3VafPvSl+Fo/18RfxiI9X6jo6HDD0c/BD8odRTT77YXS3WHVkqfpYP18+b5GHBOn4hapBA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR11MB2019 X-OriginatorOrg: intel.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org /dev/iommu provides an unified interface for managing I/O page tables for=20 devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,=20 etc.) are expected to use this interface instead of creating their own logi= c to=20 isolate untrusted device DMAs initiated by userspace.=20 This proposal describes the uAPI of /dev/iommu and also sample sequences=20 with VFIO as example in typical usages. The driver-facing kernel API provid= ed=20 by the iommu layer is still TBD, which can be discussed after consensus is= =20 made on this uAPI. It's based on a lengthy discussion starting from here: https://lore.kernel.org/linux-iommu/20210330132830.GO2356281@nvidia.com/=20 v1 can be found here: https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9= @PH0PR12MB5481.namprd12.prod.outlook.com/T/ This doc is also tracked on github, though it's not very useful for v1->v2= =20 given dramatic refactoring: https://github.com/luxis1999/dev_iommu_uapi=20 Changelog (v1->v2): - Rename /dev/ioasid to /dev/iommu (Jason); - Add a section for device-centric vs. group-centric design (many); - Add a section for handling no-snoop DMA (Jason/Alex/Paolo); - Add definition of user/kernel/shared I/O page tables (Baolu/Jason); - Allow one device bound to multiple iommu fd's (Jason); - No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason); - Add a device cookie for iotlb invalidation and fault handling (Jean/Jason= ); - Add capability/format query interface per device cookie (Jason); - Specify format/attribute when creating an IOASID, leading to several v1 uAPI commands removed (Jason); - Explain the value of software nesting (Jean); - Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason= ); - Cover software mdev usage (Jason); - No restriction on map/unmap vs. bind/invalidate (Jason/David); - Report permitted IOVA range instead of reserved range (David); - Refine the sample structures and helper functions (Jason); - Add definition of default and non-default I/O address spaces; - Expand and clarify the design for PASID virtualization; - and lots of subtle refinement according to above changes; TOC =3D=3D=3D=3D 1. Terminologies and Concepts 1.1. Manage I/O address space 1.2. Attach device to I/O address space 1.3. Group isolation 1.4. PASID virtualization 1.4.1. Devices which don't support DMWr 1.4.2. Devices which support DMWr 1.4.3. Mix different types together 1.4.4. User sequence 1.5. No-snoop DMA 2. uAPI Proposal 2.1. /dev/iommu uAPI 2.2. /dev/vfio device uAPI 2.3. /dev/kvm uAPI 3. Sample Structures and Helper Functions 4. Use Cases and Flows 4.1. A simple example 4.2. Multiple IOASIDs (no nesting) 4.3. IOASID nesting (software) 4.4. IOASID nesting (hardware) 4.5. Guest SVA (vSVA) 4.6. I/O page fault =3D=3D=3D=3D 1. Terminologies and Concepts ----------------------------------------- IOMMU fd is the container holding multiple I/O address spaces. User=20 manages those address spaces through fd operations. Multiple fd's are=20 allowed per process, but with this proposal one fd should be sufficient for= =20 all intended usages. IOASID is the fd-local software handle representing an I/O address space.=20 Each IOASID is associated with a single I/O page table. IOASIDs can be=20 nested together, implying the output address from one I/O page table=20 (represented by child IOASID) must be further translated by another I/O=20 page table (represented by parent IOASID). An I/O address space takes effect only after it is attached by a device.=20 One device is allowed to attach to multiple I/O address spaces. One I/O=20 address space can be attached by multiple devices. Device must be bound to an IOMMU fd before attach operation can be conducted. Though not necessary, user could bind one device to multiple IOMMU FD's. But no cross-FD IOASID nesting is allowed. The format of an I/O page table must be compatible to the attached=20 devices (or more specifically to the IOMMU which serves the DMA from the attached devices). User is responsible for specifying the format when allocating an IOASID, according to one or multiple devices which will be attached right after. Attaching a device to an IOASID with=20 incompatible format is simply rejected. Relationship between IOMMU fd, VFIO fd and KVM fd: - IOMMU fd provides uAPI for managing IOASIDs and I/O page tables.=20 It also provides an unified capability/format reporting interface for each bound device.=20 - VFIO fd provides uAPI for device binding and attaching. In this proposa= l=20 VFIO is used as the example of device passthrough frameworks. The routing information that identifies an I/O address space in the wire is= =20 per-device and registered to IOMMU fd via VFIO uAPI. - KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization in CPU (when PASID is carried in instruction payload). 1.1. Manage I/O address space +++++++++++++++++++++++++++++ An I/O address space can be created in three ways, according to how the corresponding I/O page table is managed: - kernel-managed I/O page table which is created via IOMMU fd, e.g.=20 for IOVA space (dpdk), GPA space (Qemu), GIOVA space (vIOMMU), etc. - user-managed I/O page table which is created by the user, e.g. for=20 GIOVA/GVA space (vIOMMU), etc. - shared kernel-managed CPU page table which is created by another=20 subsystem, e.g. for process VA space (mm), GPA space (kvm), etc. The first category is managed via a dma mapping protocol (similar to=20 existing VFIO iommu type1), which allows the user to explicitly specify=20 which range in the I/O address space should be mapped. The second category is managed via an iotlb protocol (similar to the underlying IOMMU semantics). Once the user-managed page table is bound to the IOMMU, the user can invoke an invalidation command to update the kernel-side cache (either in software or in physical IOMMU). In the meantime, a fault reporting/completion mechanism is also provided=20 for the user to fixup potential I/O page faults. The last category is supposed to be managed via the subsystem which actually owns the shared address space. Likely what's minimally required=20 in /dev/iommu uAPI is to build the connection with the address space=20 owner when allocating the IOASID, so an in-kernel interface (e.g. mmu_ notifer) is activated for any required synchronization between IOMMU fd=20 and the space owner. This proposal focuses on how to manage the first two categories, as=20 they are existing and more urgent requirements. Support of the last category can be discussed when a real usage comes in the future.=20 The user needs to specify the desired management protocol and page=20 table format when creating a new I/O address space. Before allocating=20 the IOASID, the user should already know at least one device that will be=20 attached to this space. It is expected to first query (via IOMMU fd) the supported capabilities and page table format information of the to-be- attached device (or a common set between multiple devices) and then=20 choose a compatible format to set on the IOASID. I/O address spaces can be nested together, called IOASID nesting. IOASID nesting can be implemented in two ways: hardware nesting and software=20 nesting. With hardware support the child and parent I/O page tables are=20 walked consecutively by the IOMMU to form a nested translation. When=20 it's implemented in software, /dev/iommu is responsible for merging the=20 two-level mappings into a single-level shadow I/O page table.=20 An user-managed I/O page table can be setup only on the child IOASID,=20 implying IOASID nesting must be enabled. This is because the kernel=20 doesn't trust userspace. Nesting allows the kernel to enforce its DMA=20 isolation policy through the parent IOASID.=20 Software nesting is useful in several scenarios. First, it allows=20 centralized accounting on locked pages between multiple root IOASIDs (no parent). In this case a 'dummy' IOASID can be created with an=20 identity mapping (HVA->HVA), dedicated for page pinning/accounting and=20 nested by all root IOASIDs. Second, it's also useful for mdev drivers=20 (e.g. kvmgt) to write-protect guest structures when vIOMMU is enabled.=20 In this case the protected addresses are in GIOVA space while KVM=20 write-protection API is based on GPA. Software nesting allows finding=20 GPA according to GIOVA in the kernel. 1.2. Attach Device to I/O address space +++++++++++++++++++++++++++++++++++++++ Device attach/bind is initiated through passthrough framework uAPI. Device attaching is allowed only after a device is successfully bound to the IOMMU fd. User should provide a device cookie when binding the=20 device through VFIO uAPI. This cookie is used when the user queries=20 device capability/format, issues per-device iotlb invalidation and=20 receives per-device I/O page fault data via IOMMU fd. Successful binding puts the device into a security context which isolates=20 its DMA from the rest system. VFIO should not allow user to access the=20 device before binding is completed. Similarly, VFIO should prevent the=20 user from unbinding the device before user access is withdrawn. When a device is in an iommu group which contains multiple devices, all devices within the group must enter/exit the security context together. Please check {1.3} for more info about group isolation via this device-centric design. Successful attaching activates an I/O address space in the IOMMU, if the device is not purely software mediated. VFIO must provide device specific routing information for where to install the I/O page table in=20 the IOMMU for this device. VFIO must also guarantee that the attached=20 device is configured to compose DMAs with the routing information that=20 is provided in the attaching call. When handling DMA requests, IOMMU=20 identifies the target I/O address space according to the routing=20 information carried in the request. Misconfiguration breaks DMA isolation thus could lead to severe security vulnerability. Routing information is per-device and bus specific. For PCI, it is=20 Requester ID (RID) identifying the device plus optional Process Address=20 Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream=20 ID (SSID). PASID or SSID is used when multiple I/O address spaces are=20 enabled on a single device. For simplicity and continuity reason the=20 following context uses RID+PASID though SID+SSID may sound a clearer=20 naming from device p.o.v. We can decide the actual naming when coding. Because one I/O address space can be attached by multiple devices,=20 per-device routing information (plus device cookie) is tracked under=20 each IOASID and is used respectively when activating the I/O address=20 space in the IOMMU for each attached device. The device in the /dev/iommu context always refers to a physical one=20 (pdev) which is identifiable via RID. Physically each pdev can support=20 one default I/O address space (routed via RID) and optionally multiple=20 non-default I/O address spaces (via RID+PASID). The device in VFIO context is a logic concept, being either a physical device (pdev) or mediated device (mdev or subdev). Each vfio device is represented by RID+cookie in IOMMU fd. User is allowed to create=20 one default I/O address space (routed by vRID from user p.o.v) per=20 each vfio_device. VFIO decides the routing information for this default space based on device type: 1) pdev, routed via RID; 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via=20 the parent's RID plus the PASID marking this mdev; 3) a purely sw-mediated device (sw mdev), no routing required i.e. no need to install the I/O page table in the IOMMU. sw mdev just uses=20 the metadata to assist its internal DMA isolation logic on top of=20 the parent's IOMMU page table; In addition, VFIO may allow user to create additional I/O address spaces on a vfio_device based on the hardware capability. In such case the user=20 has its own view of the virtual routing information (vPASID) when marking=20 these non-default address spaces. How to virtualize vPASID is platform specific and device specific. Some platforms allow the user to fully=20 manage the PASID space thus vPASIDs are directly used for routing and even hidden from the kernel. Other platforms require the user to=20 explicitly register the vPASID information to the kernel when attaching=20 the vfio_device. In this case VFIO must figure out whether vPASID should=20 be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)=20 for physical routing. Detail explanation about PASID virtualization can=20 be found in {1.4}. For mdev both default and non-default I/O address spaces are routed via PASIDs. To better differentiate them we use "default PASID" (or=20 defPASID) when talking about the default I/O address space on mdev. When=20 vPASID or pPASID is referred in PASID virtualization it's all about the=20 non-default spaces. defPASID and pPASID are always hidden from userspace=20 and can only be indirectly referenced via IOASID. 1.3. Group isolation ++++++++++++++++++++ Group is the minimal object when talking about DMA isolation in the=20 iommu layer. Devices which cannot be isolated from each other are=20 organized into a single group. Lack of isolation could be caused by=20 multiple reasons: no ACS capability in the upstreaming port, behind a=20 PCIe-to-PCI bridge (thus sharing RID), or DMA aliasing (multiple RIDs=20 per device), etc. All devices in the group must be put in a security context together=20 before one or more devices in the group are operated by an untrusted=20 user. Passthrough frameworks must guarantee that: 1) No user access is granted on a device before an security context is=20 established for the entire group (becomes viable). 2) Group viability is not broken before the user relinquishes the device.= =20 This implies that devices in the group must be either assigned to this= =20 user, or driver-less, or bound to a driver which is known safe (not=20 do DMA).=20 3) The security context should not be destroyed before user access permission is withdrawn. Existing VFIO introduces explicit container and group semantics in its uAPI to meet above requirements: 1) VFIO user can open a device fd only after: * A container is created; * The group is attached to the container (VFIO_GROUP_SET_CONTAINER); * An empty I/O page table is created in the container (VFIO_SET_IOMMU); * Group viability is passed and the entire group is attached to=20 the empty I/O page table (the security context); 2) VFIO monitors driver binding status to verify group viability * IOMMU_GROUP_NOTIFY_BOUND_DRIVER; * BUG_ON() if group viability is broken; 3) Detach the group from the container when the last device fd in the=20 group is closed and destroy the I/O page table only after the last=20 group is detached from the container. With this proposal VFIO can move to a simpler device-centric model by directly exposeing device node under "/dev/vfio/devices" w/o using=20 container and group uAPI at all. In this case group isolation is enforced mplicitly within IOMMU fd: 1) A successful binding call for the first device in the group creates=20 the security context for the entire group, by: * Verifying group viability in a similar way as VFIO does; * Calling IOMMU-API to move the group into a block-dma state, which makes all devices in the group attached to an block-dma domain with an empty I/O page table; VFIO should not allow the user to mmap the MMIO bar of the bound device until the binding call succeeds. Binding other devices in the same group just succeeds since the security context has already been established for the entire group. 2) IOMMU fd monitors driver binding status in case group viability is broken, same as VFIO does today. BUG_ON() might be eliminated if we=20 can find a way to deny probe of non-iommu-safe drivers. Before a device is unbound from IOMMU fd, it is always attached to a security context (either the block-dma domain or an IOASID domain). Switch between two domains is initiated by attaching the device to or=20 detaching it from an IOASID. The IOMMU layer should ensure that=20 the default domain is not implicitly re-attached in the switching process, before the group is moved out of the block-dma state. To stay on par with legacy VFIO, IOMMU fd could verify that all=20 bound devices in the same group must be attached to a single IOASID. 3) When a device fd is closed, VFIO automatically unbinds the device from IOMMU fd before zapping the mmio mapping. Unbinding the last device in the group moves the entire group out of the block-dma state and re-attached to the default domain. Actual implementation may use a staging approach, e.g. only support=20 one-device group in the start (leaving multi-devices group handled via legacy VFIO uAPI) and then cover multi-devices group in a later stage. If necessary, devices within a group may be further allowed to be=20 attached to different IOASIDs in the same IOMMU fd, in case that the=20 source devices can be reliably identifiable (e.g. due to !ACS). This will=20 require additional sub-group logic in the iommu layer and with=20 sub-group topology exposed to userspace. But no expectation of=20 changing the device-centric semantics except introducing sub-group awareness within IOMMU fd. A more detailed explanation of the staging approach can be found: https://lore.kernel.org/linux-iommu/BN9PR11MB543382665D34E58155A9593C8C039@= BN9PR11MB5433.namprd11.prod.outlook.com/ 1.4. PASID Virtualization +++++++++++++++++++++++++ As explained in {1.2}, PASID virtualization is required when multiple I/O address spaces are supported on a device. The actual policy is per-device=20 thus defined by specific VFIO device driver.=20 A PASID virtualization policy is defined by four aspects: 1) Whether this device allows the user to create multiple I/O address=20 spaces (vPASID capability). This is decided upon whether this device=20 and its upstream IOMMU both support PASID. 2) If yes, whether the PASID space is delegated to the user, based on whether the PASID table should be managed by user or kernel. 3) If no, the user should register vPASID to the kernel. Then the next question is whether vPASID should be directly used for physical routing (vPASID=3D=3DpPASID or vPASID!=3DpPASID). The key is whether this devic= e=20 must share the PASID space with others (pdev vs. mdev). 4) If vPASID!=3DpPASID, whether pPASID should be allocated from the=20 per-RID space or a global space. This is about whether the device=20 supports PCIe DMWr-type work submission (e.g. Intel ENQCMD) which=20 requires global pPASID allocation cross multiple devices. Only vPASIDs are part of the VM state to be migrated in VM live migration. This is basically about the virtual PASID table state in vendor vIOMMU. If vPASID!=3DpPASID, new pPASIDs will be re-allocated on the destination and VFIO device driver is responsible for programming the device to use the new pPASID when restoring the device state. Different policies may imply different uAPI semantics for user to follow=20 when attaching a device. The semantics information is expected to be=20 reported to the user via VFIO uAPI instead of via IOMMU fd, since the=20 latter only cares about pPASID. But if there is a different thought we'd=20 like to hear it. Following sections (1.4.1 - 1.4.3) provide detail explanation on how=20 above are selected on different device types and the implication when=20 multiple types are mixed together (i.e. assigned to a single user). Last=20 section (1.4.4) then summarizes what uAPI semantics information is reported and how user is expected to deal with it. 1.4.1. Devices which don't support DMWr *************************************** This section is about following types: 1) a pdev which doesn't issue PASID; 2) a sw mdev which doesn't issue PASID; 3) a mdev which is programmed a fixed defPASID (for default I/O address space), but does not expose vPASID capability; 4) a pdev which exposes vPASID and has its PASID table managed by user; 5) a pdev which exposes vPASID and has its PASID table managed by kernel; 6) a mdev which exposes vPASID and shares the parent's PASID table with other mdev's; +--------+---------+---------+----------+-----------+ | | |Delegated| vPASID=3D=3D | per-RID | | | vPASID | to user | pPASID | pPASID | +=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | type-1 | N/A | N/A | N/A | N/A | +--------+---------+---------+----------+-----------+ | type-2 | N/A | N/A | N/A | N/A | +--------+---------+---------+----------+-----------+ | type-3 | N/A | N/A | N/A | N/A | +--------+---------+---------+----------+-----------+ | type-4 | Yes | Yes | v=3D=3Dp(*)| per-RID(*)| +--------+---------+---------+----------+-----------+ | type-5 | Yes | No | v=3D=3Dp | per-RID | +--------+---------+---------+----------+-----------+ | type-6 | Yes | No | v!=3Dp | per-RID | +--------+---------+---------+----------+-----------+ <* conceptual definition though the PASID space is fully delegated> for 1-3 there is no vPASID capability exposed and the user can create=20 only one default I/O address space on this device. Thus there is no PASID=20 virtualization at all. 4) is specific to ARM/AMD platforms where the PASID table is managed by=20 the user. In this case the entire PASID space is delegated to the user which just needs to create a single IOASID linked to the user-managed=20 PASID table, as placeholder covering all non-default I/O address spaces=20 on pdev. In concept this looks like a big 84bit address space (20bit=20 PASID + 64bit addr). vPASID may be carried in the uAPI data to help define= =20 the operation scope when invalidating IOTLB or reporting I/O page fault.=20 IOMMU fd doesn't touch it and just acts as a channel for vIOMMU/pIOMMU to=20 exchange info. 5) is specific to Intel platforms where the PASID table is managed by=20 the kernel. In this case vPASIDs should be registered to the kernel=20 in the attaching call. This implies that every non-default I/O address=20 space on pdev is explicitly tracked by an unique IOASID in the kernel.=20 Because pdev is fully controlled by the user, its DMA request carries=20 vPASID as the routing informaiton thus requires VFIO device driver to=20 adopt vPASID=3D=3DpPASID policy. Because an IOASID already represents a standalone address space, there is no need to further carry vPASID in=20 the invalidation and fault paths. 6) is about mdev, as those enabled by Intel Scalable IOV. The main=20 difference from type-5) is on whether vPASID=3D=3DpPASID. There is=20 only a single PASID table per the parent device, implying the per-RID=20 PASID space shared by all mdevs created on this parent. VFIO device=20 driver must use vPASID!=3DpPASID policy and allocate a pPASID from the=20 per-RID space for every registered vPASID to guarantee DMA isolation=20 between sibling mdev's. VFIO device driver needs to conduct vPASID-> pPASID conversion properly in several paths: - When VFIO device driver provides the routing information in the attaching call, since IOMMU fd only cares about pPASID; - When VFIO device driver updates a PASID MMIO register in the=20 parent according to the vPASID intercepted in the mediation path; 1.4.2. Devices which support DMWr ********************************* Modern devices may support a scalable workload submission interface=20 based on PCI Deferrable Memory Write (DMWr) capability, allowing a=20 single work queue to access multiple I/O address spaces. One example=20 using DMWr is Intel ENQCMD, having PASID saved in the CPU MSR and=20 carried in the non-posted DMWr payload when sent out to the device.=20 Then a single work queue shared by multiple processes can compose=20 DMAs toward different address spaces, by carrying the PASID value=20 retrieved from the DMWr payload. The role of DMWr is allowing the=20 shared work queue to return a retry response when the work queue is under pressure (due to capacity or QoS). Upon such response the=20 software could try re-submitting the descriptor. When ENQCMD is executed in the guest, the value saved in the CPU=20 MSR is vPASID (part of the xsave state). This creates another point for=20 consideration regarding to PASID virtualization. Two device types are relevant: 7) a pdev same as 5) plus DMWr support; 8) a mdev same as 6) plus DMWr support; and respective polices: +--------+---------+---------+----------+-----------+ | | |Delegated| vPASID=3D=3D | per-RID | | | vPASID | to user | pPASID | pPASID | +=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D= =3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+ | type-7 | Yes | Yes | v=3D=3Dp | per-RID | +--------+---------+---------+----------+-----------+ | type-8 | Yes | Yes | v!=3Dp | global | +--------+---------+---------+----------+-----------+ DMWr or shared mode is configurable per work queue. It's completely=20 sane if an assigned device with multiple queues needs to handle both=20 DMWr (shared work queue) and normal write (dedicated work queue)=20 simultaneously. Thus the PASID virtualization policy must be consistent=20 when both paths are activated. for 7) we should use the same policy as 5), i.e. directly using vPASID=20 for physical routing on pdev. In this case ENQCMD in the guest just works=20 w/o additional work because the vPASID saved in the PASID_MSR=20 matches the routing information configured for the target I/O address space in the IOMMU. When receiving a DMWr request, the shared=20 work queue grabs vPASID from the payload and then tags outgoing=20 DMAs with vPASID. This is consistent with the dedicated work queue path where vPASID is grabbed from the MMIO register to tag DMAs. for 8) vPASID in the PASID_MSR must be converted to pPASID before=20 sent to the wire (given vPASID!=3DpPASID for the same reason as 6).=20 Intel CPU provides a hardware PASID translation capability for auto- conversion when ENQCMD is being executed. In this case the payload=20 received by the work queue contains pPASID thus outgoing DMAs are=20 tagged with pPASID. This is consistent with the dedicated work=20 queue path where pPASID is programmed to the MMIO register in the=20 mediation path and then grabbed to tag DMAs. However, the CPU translation structure is per-VM which implies that a same pPASID must be used cross all type-8 devices (of this VM)=20 given a vPASID. This requires the pPASID allocated from a global pool by the first type-8 device and then shared by the following type-8 devices when they are attached to the same vPASID. CPU translation capability is enabled via KVM uAPI. We need a secure=20 contract between VFIO device fd and KVM fd so VFIO device driver knows=20 when it's secure to allow guest access to the cmd portal of the type-8 device. It's dangerous by allowing the guest to issue ENQCMD to the=20 device before CPU is ready for PASID translation. In this window the=20 vPASID is untranslated thus grants the guest to access random I/O=20 address space on the parent of this mdev. We plan to utilize existing kvm-vfio contract. It is currently used for=20 multiple purposes including propagating the kvm pointer to the VFIO device driver. It can be extended to further notify whether CPU PASID translation capability is turned on. Before receiving this notification,=20 the VFIO device driver should not allow user to access the DMWr-capable=20 work queue on type-8 device. 1.4.3. Mix different types together *********************************** In majority case mixing different types doesn't change the aforementioned=20 PASID virtualization policy for each type. The user just needs to handle=20 them per device basis.=20 There is one exception though, when mixing type 7) and 8) together, due to conflicting policies on how PASID_MSR should be handled.=20 For mdev (type-8) the CPU translation capability must be enabled to=20 prevent a malicious guest from doing bad things. But once per-VM=20 PASID translation is enabled, the shared work queue of pdev (type-7)=20 will also receive a pPASID allocated for mdev instead of the vPASID=20 that is expected on this pdev. Fixing this exception for pdev is not easy. There are three options. One is moving pdev to also accept pPASID. Because pdev may have both=20 shared work queue (PASID in MSR) and dedicated work queue (PASID in MMIO) enabled by the guest, this requires VFIO device driver to=20 mediate the dedicated work queue path so vPASIDs programmed by=20 the guest are manually translated to pPASIDs before written to the=20 pdev. This may add undesired software complexity and potential=20 performance impact if the PASID register locates alongside other=20 fast-path resources in the same 4K page. If it works it essentially=20 converts type-7 to type-8 from user p.o.v. The second option is using an enlightened approach so the guest=20 directly use the host-allocated pPASIDs instead of creating its own vPASID space. In this case even the dedicated work queue path uses pPASID w/o the need of mediation. However this requires different uAPI semantics=20 (from register-vPASID to return-pPASID) and exposes pPASID knowledge=20 to userspace which also implies breaking VM live migration. The third option is making pPASID as an alias routing info to vPASID=20 and having both linked to the same I/O page table in the IOMMU, so=20 either way can hit the desired address space. This further requires sort=20 of range split scheme to avoid conflict between vPASID and pPASID.=20 However, we haven't found a clear way to fold this trick into this uAPI=20 proposal yet. and this option may not work when PASID is also used to=20 tag the IMS entry for verifying the interrupt source. In this case there=20 is no room for aliasing. So, none of above can work cleanly based on current thoughts. We=20 decide to not support type-7/8 mix in this proposal. User could detect=20 this exception based on reported PASID flags, as outlined in next section. 1.4.4. User sequence ******************** A new PASID capability info could be introduced to VFIO_DEVICE_GET_INFO. The presence indicates allowing the user to create multiple I/O address spaces with vPASID on the device. This capability further includes=20 following flags to help describe the desired uAPI semantics: - PASID_DELEGATED; // PASID space delegated to the user? - PASID_CPU; // Allow vPASID used in the CPU? - PASID_CPU_VIRT; // Require vPASID translation in the CPU? The last two flags together help the user to detect the unsupported=20 type 7/8 mix scenario. Take Qemu for example. It queries above flags for every vfio device at=20 initialization time, after identifying the PASID capability: 1) If PASID_DELEGATED is set, the PASID space is fully managed by the=20 user thus a single IOASID (linked to user-managed page table) is=20 required as the placeholder for all non-default I/O address spaces=20 on the device. If not set, an IOASID must be created for every non-default I/O address= =20 space on this device and vPASID must be registered to the kernel=20 when attaching the device to this IOASID. User may want to sanity check on all devices with the same setting=20 as this flag is a platform attribute though it's exported per device. If not set, continue to step 2. 2) If PASID_CPU is not set, done. Otherwise check whether the PASID_CPU_VIRT flag on this device is=20 consistent with all other devices with PASID_CPU set. If inconsistency is found (indicating type 7/8 mix), only one type of devices (all set, or all clear) should have the vPASID capability exposed to the guest. 3) If PASID_CPU_VIRT is not set, done. If set and consistency check in 2) is passed, call KVM uAPI to=20 enable CPU PASID translation if it is the first device with this flag=20 set. Later when a new vPASID is identified through vIOMMU at run-time,= =20 call another KVM uAPI to update the corresponding PASID mapping. 1.5. No-snoop DMA ++++++++++++++++++++ Snoop behavior of a DMA specifies whether the access is coherent (snoops=20 the processor caches) or not. The snoop behavior is decided by both device= =20 and IOMMU. Device can set a no-snoop attribute in DMA request to force=20 the non-coherent behavior, while IOMMU may support a configuration which=20 enforces DMAs to be coherent (with the no-snoop attribute ignored). No-snoop DMA requires the driver to manually flush caches for=20 observing the latest content. When such driver is running in the guest,=20 it further requires KVM to intercept/emulate WBINVD plus favoring=20 guest cache attributes in the EPT page table. Alex helped create a matrix as below: (https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9= @PH0PR12MB5481.namprd12.prod.outlook.com/T/#mbfc96278b078d3ec07eabb9aa46abf= e03a886dc6) \ Device supports IOMMU enforces\ no-snoop snoop \ yes | no | ----------------+-----+-----+ yes | 1 | 2 | ----------------+-----+-----+ no | 3 | 4 | ----------------+-----+-----+ DMA is always coherent in boxes {1, 2, 4}. No-snoop DMA is allowed in {3} but whether it is actually used is a driver decision. VFIO currently adopts a simple policy - always turn on IOMMU enforce- snoop if available. It provides a contract via kvm-vfio fd for KVM to learn whether no-snoop DMA is used thus special tricks on WBINVD=20 and EPT must be enabled. However, the criteria of no-snoop DMA is=20 solely based on the fact of lacking IOMMU enforce-snoop for any vfio=20 device, i.e. both 3) and 4) are considered capable of doing no-snoop=20 DMA. This model has several limitations: - It's impossible to move a device from 1) to 3) when no-snoop DMA is a must to achieve the desired user experience; - Unnecessary overhead in KVM side in 4) or if the driver doesn't do=20 no-snoop DMA in 3). Although the driver doesn't use WBINVD, the=20 guest still uses WBINVD in other places e.g. when changing cache- related registers (e.g. MTRR/CR0); We want to adopt an user-driven model in /dev/iommu for more accurate control over the no-snoop usage. In this model the enforce-snoop format=20 is specified when an IOASID is created, while the device no-snoop usage=20 can be further clarified when it's attached to the IOASID.=20 IOMMU fd is expected to provide uAPIs and helper functions for: - reporting IOMMU enforce-snoop capability to the user per device cookie (device no-snoop capability is reported via VFIO). - allowing user to specify whether an IOASID should be created in the=20 IOMMU enforce-snoop format (enable/disable/auto): * This allows moving a device from 1) to 3) in case of performance requirement. * 'auto' falls back to the legacy VFIO policy, i.e. always enables enforce-snoop if available. * Any device can be attached to a non-enforce-snoop IOASID,=20 because this format is supported by all IOMMUs. In this case the device belongs to {3, 4} and whether it is considered doing no-snoop DMA is decided by the next interface. * Attaching a device which cannot be forced to snoop by its IOMMU=20 to an enforce-snoop IOASID gets a failure. Successful attaching implies the device always does snoop DMA, i.e. belonging to {1,2}. * Some platform supports page-granular enforce-snoop. One open is whether a page-granular interface is necessary here. - allowing user to further hint whether no-snoop DMA is actually used=20 in {3, 4} on a specific IOASID, via the VFIO attaching call: * in case the user has such intrinsic knowledge on a specific device. * {3} can be filtered out with this hint. * {4} can be filtered out automatically by VFIO device driver,=20 based on device no-snoop capability. * If no hint is provided, fall back to legacy VFIO policy, i.e.=20 treating all devices in {3, 4} as capable of doing no-snoop. - a new contract for KVM to learn whether any IOASID is attached by devices which require no-snoop DMA: * Once we thought existing kvm-vfio fd can be leveraged as a short term approach (see above link). However kvm-vfio is centralized on vfio group concept, while this proposal is moving to device- centric model. * The new contract will allows KVM to query no-snoop requirement=20 per IOMMU fd. This will apply to all passthrough frameworks. * A notification mechanism might be introduced to switch between WBINVD emulation and no-op intercept according to device=20 attaching status change in registered IOMMU fd. * whether kvm-vfio will be completely deprecated is a TBD. It's=20 still used for non-iommu related contract, e.g. notifying kvm=20 pointer to mdev driver and pvIOMMU acceleration in PPC. - optional bulk cache invalidation: * Userspace driver can use clflush to invalidate cachelines for buffers used for no-snoop DMA. But this may be inefficient when a big buffer needs to be invalidated. In this case a bulk invalidation could be provided based on WBINVD. The implementation might be a staging approach. In the start IOMMU fd only support devices which can be forced to snoop via the IOMMU (i.e. {1, 2}), while leaving {3, 4} still handled via legacy VFIO. In=20 this case no need to introduce new contract with KVM. An easy way is=20 having VFIO not expose {3, 4} devices in /dev/vfio/devices. Then we have=20 plenty of time to figure out the implementation detail of the new model=20 at a later stage. 2. uAPI Proposal ---------------------- /dev/iommu uAPI covers everything about managing I/O address spaces. /dev/vfio device uAPI builds connection between devices and I/O address=20 spaces. /dev/kvm uAPI is optionally required as far as no-snoop DMA or ENQCMD=20 is concerned. 2.1. /dev/iommu uAPI ++++++++++++++++++++ /* * Check whether an uAPI extension is supported.=20 * * It's unlikely that all planned capabilities in IOMMU fd will be ready i= n * one breath. User should check which uAPI extension is supported=20 * according to its intended usage. * * A rough list of possible extensions may include: * * - EXT_MAP_TYPE1V2 for vfio type1v2 map semantics; * - EXT_MAP_NEWTYPE for an enhanced map semantics; * - EXT_IOASID_NESTING for what the name stands; * - EXT_USER_PAGE_TABLE for user managed page table; * - EXT_USER_PASID_TABLE for user managed PASID table; * - EXT_MULTIDEV_GROUP for 1:N iommu group; * - EXT_DMA_NO_SNOOP for no-snoop DMA support; * - EXT_DIRTY_TRACKING for tracking pages dirtied by DMA; * - ... * * Return: 0 if not supported, 1 if supported. */ #define IOMMU_CHECK_EXTENSION _IO(IOMMU_TYPE, IOMMU_BASE + 0) /* * Check capabilities and format information on a bound device. * * It could be reported either via a capability chain as implemented in=20 * VFIO or a per-capability query interface. The device is identified=20 * by device cookie (registered when binding this device). * * Sample capability info: * - VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.; * - IOASID nesting: hardware nesting vs. software nesting; * - User-managed page table: vendor specific formats; * - User-managed pasid table: vendor specific formats; * - coherency: whether IOMMU can enforce snoop for this device; * - ... * */ #define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1) /* * Allocate an IOASID.=20 * * IOASID is the FD-local software handle representing an I/O address=20 * space. Each IOASID is associated with a single I/O page table. User=20 * must call this ioctl to get an IOASID for every I/O address space that = is * intended to be tracked by the kernel. * * User needs to specify the attributes of the IOASID and associated * I/O page table format information according to one or multiple devices * which will be attached to this IOASID right after. The I/O page table=20 * is activated in the IOMMU when it's attached by a device. Incompatible * format between device and IOASID will lead to attaching failure. * * The root IOASID should always have a kernel-managed I/O page=20 * table for safety. Locked page accounting is also conducted on the root. * * Multiple roots are possible, e.g. when multiple I/O address spaces * are created but IOASID nesting is disabled. However, one page might=20 * be accounted multiple times in this case. The user is recommended to=20 * instead create a 'dummy' root with identity mapping (HVA->HVA) for=20 * centralized accounting, nested by all other IOASIDs which represent=20 * 'real' I/O address spaces. * * Sample attributes: * - Ownership: kernel-managed or user-managed I/O page table; * - IOASID nesting: the parent IOASID info if enabled; * - User-managed page table: addr and vendor specific formats; * - User-managed pasid table: addr and vendor specific formats; * - coherency: enforce-snoop; * - ... * * Return: allocated ioasid on success, -errno on failure. */ #define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2) #define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3) /* * Map/unmap process virtual addresses to I/O virtual addresses. * * Provide VFIO type1 equivalent semantics. Start with the same=20 * restriction e.g. the unmap size should match those used in the=20 * original mapping call.=20 * * If the specified IOASID is the root, the mapped pages are automatically * pinned and accounted as locked memory. Pinning might be postponed=20 * until the IOASID is attached by a device. Software mdev driver may=20 * further provide a hint to skip auto-pinning at attaching time, since * it does selective pinning at run-time. auto-pinning can be also=20 * skipped when I/O page fault is enabled on the root. *=20 * When software nesting is enabled, this implies that the merged * shadow mapping will also be updated accordingly. However if the * change happens on the parent, it requires reverse lookup to update * all relevant child mappings which is time consuming. So the user * is not suggested to change the parent mapping after the software * nesting is established (maybe disallow?). There is no such restriction= =20 * with hardware nesting, as the IOMMU will catch up the change=20 * when actually walking the page table. * * Input parameters: * - u32 ioasid; * - refer to vfio_iommu_type1_dma_{un}map * * Return: 0 on success, -errno on failure. */ #define IOMMU_MAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 4) #define IOMMU_UNMAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 5) /* * Invalidate IOTLB for an user-managed I/O page table * * check include/uapi/linux/iommu.h for supported cache types and * granularities. Device cookie and vPASID may be specified to help=20 * decide the scope of this operation. * * Input parameters: * - child_ioasid; * - granularity (per-device, per-pasid, range-based); * - cache type (iotlb, devtlb, pasid cache); *=20 * Return: 0 on success, -errno on failure */ #define IOMMU_INVALIDATE_CACHE _IO(IOMMU_TYPE, IOMMU_BASE + 6) /* * Page fault report and response * * This is TBD. Can be added after other parts are cleared up. It may * include a fault region to report fault data via read()), an=20 * eventfd to notify the user and an ioctl to complete the fault. * * The fault data includes {IOASID, device_cookie, faulting addr, perm}=20 * as common info. vendor specific fault info can be also included if * necessary. * * If the IOASID represents an user-managed PASID table, the vendor * fault info includes vPASID information for the user to figure out * which I/O page table triggers the fault. * * If the IOASID represents an user-managed I/O page table, the user * is expected to find out vPASID itself according to {IOASID, device_ * cookie}.=20 */ /* * Dirty page tracking=20 * * Track and report memory pages dirtied in I/O address spaces. There=20 * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.=20 * It needs be adapted to /dev/iommu later. */ 2.2. /dev/vfio device uAPI ++++++++++++++++++++++++++ /* * Bind a vfio_device to the specified IOMMU fd * * The user should provide a device cookie when calling this ioctl. The=20 * cookie is later used in IOMMU fd for capability query, iotlb invalidati= on * and I/O fault handling. * * User is not allowed to access the device before the binding operation * is completed. * * Unbind is automatically conducted when device fd is closed. * * Input parameters: * - iommu_fd; * - cookie; * * Return: 0 on success, -errno on failure. */ #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22) /* * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO * * Add a new device capability. The presence indicates that the user * is allowed to create multiple I/O address spaces on this device. The * capability further includes following flags: * * - PASID_DELEGATED, if clear every vPASID must be registered to=20 * the kernel; * - PASID_CPU, if set vPASID is allowed to be carried in the CPU=20 * instructions (e.g. ENQCMD); * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;=20 *=20 * The user must check that all devices with PASID_CPU set have the=20 * same setting on PASID_CPU_VIRT. If mismatching, it should enable=20 * vPASID only in one category (all set, or all clear). * * When the user enables vPASID on the device with PASID_CPU_VIRT * set, it must enable vPASID CPU translation via kvm fd before attempting * to use ENQCMD to submit work items. The command portal is blocked=20 * by the kernel until the CPU translation is enabled. */ #define VFIO_DEVICE_INFO_CAP_PASID 5 /* * Attach a vfio device to the specified IOASID * * Multiple vfio devices can be attached to the same IOASID, and vice=20 * versa.=20 * * User may optionally provide a "virtual PASID" to mark an I/O page=20 * table on this vfio device, if PASID_DELEGATED is not set in device info= .=20 * Whether the virtual PASID is physically used or converted to another=20 * kernel-allocated PASID is a policy in the kernel. * * Because one device is allowed to bind to multiple IOMMU fd's, the * user should provide both iommu_fd and ioasid for this attach operation. * * Input parameter: * - iommu_fd; * - ioasid; * - flag; * - vpasid (if specified); *=20 * Return: 0 on success, -errno on failure. */ #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23) #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24) 2.3. KVM uAPI +++++++++++++ /* * Check/enable CPU PASID translation via KVM CAP interface * * This is necessary when ENQCMD will be used in the guest while the * targeted device doesn't accept the vPASID saved in the CPU MSR. */ #define KVM_CAP_PASID_TRANSLATION 206 /* * Update CPU PASID mapping * * This command allows user to set/clear the vPASID->pPASID mapping * in the CPU, by providing the IOASID (and FD) information representing * the I/O address space marked by this vPASID. KVM calls iommu helper * function to retrieve pPASID according to the input parameters. So the * pPASID value is completely hidden from the user. * * Input parameters: * - user_pasid; * - iommu_fd; * - ioasid; */ #define KVM_MAP_PASID _IO(KVMIO, 0xf0) #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1) /* * and a new contract to exchange no-snoop dma status with IOMMU fd. * this will be a device-centric interface, thus existing vfio-kvm contrac= t * is not suitable as it's group-centric. * * actual definition TBD. */ 3. Sample structures and helper functions -------------------------------------------------------- Three helper functions are provided to support VFIO_BIND_IOMMU_FD: struct iommu_ctx *iommu_ctx_fdget(int fd); struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, struct device *device, u64 cookie); int iommu_unregister_device(struct iommu_dev *dev); An iommu_ctx is created for each fd: struct iommu_ctx { // a list of allocated IOASID data's struct xarray ioasid_xa; // a list of registered devices struct xarray dev_xa; }; Later some group-tracking fields will be also introduced to support=20 multi-devices group. Each registered device is represented by iommu_dev: struct iommu_dev { struct iommu_ctx *ctx; // always be the physical device struct device *device; u64 cookie; struct kref kref; }; A successful binding establishes a security context for the bound device and returns struct iommu_dev pointer to the caller. After this point, the user is allowed to query device capabilities via IOMMU_ DEVICE_GET_INFO. For mdev the struct device should be the pointer to the parent device.=20 An ioasid_data is created when IOMMU_IOASID_ALLOC, as the main=20 object describing characteristics about an I/O page table: struct ioasid_data { struct iommu_ctx *ctx; // the IOASID number u32 ioasid; // the handle for kernel-managed I/O page table struct iommu_domain *domain; // map metadata (vfio type1 semantics) struct rb_node dma_list; // pointer to user-managed pgtable u64 user_pgd; // link to the parent ioasid (for nesting) struct ioasid_data *parent; // IOMMU enforce-snoop bool enforce_snoop; // various format information ... // a list of device attach data (routing information) struct list_head attach_data; // a list of fault_data reported from the iommu layer struct list_head fault_data; ... } iommu_domain is the object for operating the kernel-managed I/O=20 page tables in the IOMMU layer. ioasid_data is associated to an iommu_domain explicitly or implicitly: - root IOASID (except the 'dummy' one for locked accounting) must use kernel-manage I/O page table thus always linked to an=20 iommu_domain; - child IOASID (via software nesting) is explicitly linked to an iommu domain as the shadow I/O page table is managed by the kernel; - child IOASID (via hardware nesting) is linked to another simpler iommu layer object (TBD) for tracking user-managed page table. Due to=20 nesting it is also implicitly linked to the iommu_domain of the=20 parent; Following link has an initial discussion on this part: https://lore.kernel.org/linux-iommu/BN9PR11MB54331FC6BB31E8CBF11914A48C019@= BN9PR11MB5433.namprd11.prod.outlook.com/T/#m2c19d3825cc096daf2026ea94e00cc5= 858cda321 As Jason recommends in v1, bus-specific wrapper functions are provided explicitly to support VFIO_ATTACH_IOASID, e.g. struct iommu_attach_data * iommu_pci_device_attach( struct iommu_dev *dev, struct pci_device *pdev,=20 u32 ioasid); struct iommu_attach_data * iommu_pci_device_attach_pasid( struct iommu_dev *dev, struct pci_device *pdev,=20 u32 ioasid, u32 pasid); and variants for non-PCI devices. A helper function is provided for above wrappers: // flags specifies whether pasid is valid struct iommu_attach_data *__iommu_device_attach( struct ioasid_dev *dev, u32 ioasid, u32 pasid, int flags); A new object is introduced and linked to ioasid_data->attach_data for=20 each successful attach operation: struct iommu_attach_data { struct list_head next; struct iommu_dev *dev; u32 pasid; } The helper function for VFIO_DETACH_IOASID is generic: int iommu_device_detach(struct iommu_attach_data *data); 4. Use Cases and Flows ------------------------------- Here assume VFIO will support a new model where /dev/iommu capable devices are explicitly listed under /dev/vfio/devices thus a device fd can= =20 be acquired w/o going through legacy container/group interface. They=20 maybe further categorized into sub-directories based on device types (e.g. pdev, mdev, etc.). For illustration purpose those devices are putting together and just called dev[1...N]: device_fd[1...N] =3D open("/dev/vfio/devices/dev[1...N]", mode); VFIO continues to support container/group model for legacy applications and also for devices which are not moved to /dev/iommu in one breath (e.g. in a group with multiple devices, or support no-snoop DMA). In concep= t there is no problem for VFIO to support two models simultaneously, but=20 we'll wait to see any issue when reaching implementation. As explained earlier, one IOMMU fd is sufficient for all intended use cases= : iommu_fd =3D open("/dev/iommu", mode); For simplicity below examples are all made for the virtualization story. They are representative and could be easily adapted to a non-virtualization scenario. Three types of IOASIDs are considered: gpa_ioasid[1...N]: GPA as the default address space giova_ioasid[1...N]: GIOVA as the default address space (nesting) gva_ioasid[1...N]: CPU VA as non-default address space (nesting) At least one gpa_ioasid must always be created per guest, while the other=20 two are relevant as far as vIOMMU is concerned. Examples here apply to both pdev and mdev. VFIO device driver in the=20 kernel will figure out the associated routing information in the attaching= =20 operation. For illustration simplicity, IOMMU_CHECK_EXTENSION and IOMMU_DEVICE_ GET_INFO are skipped in these examples. No-snoop DMA is also not covered he= re. Below examples may not apply to all platforms. For example, the PAPR IOMMU in PPC platform always requires a vIOMMU and blocks DMAs until the device i= s=20 explicitly attached to an GIOVA address space. there are even fixed=20 associations between available GIOVA spaces and devices. Those platform=20 specific variances are not covered here and will be figured out in the=20 implementation phase. 4.1. A simple example +++++++++++++++++++++ Dev1 is assigned to the guest. A cookie has been allocated by the user to represent this device in the iommu_fd. One gpa_ioasid is created. The GPA address space is managed through=20 DMA mapping protocol by specifying that the I/O page table is managed by the kernel: /* Bind device to IOMMU fd */ device_fd =3D open("/dev/vfio/devices/dev1", mode); iommu_fd =3D open("/dev/iommu", mode); bind_data =3D {.fd =3D iommu_fd; .cookie =3D cookie}; ioctl(device_fd, VFIO_BIND_IOASID_FD, &bind_data); /* Allocate IOASID */ alloc_data =3D {.user_pgtable =3D false}; gpa_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach device to IOASID */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D gpa_ioasid}; ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data); /* Setup GPA mapping [0 - 1GB] */ dma_map =3D { .ioasid =3D gpa_ioasid; .iova =3D 0; // GPA .vaddr =3D 0x40000000; // HVA .size =3D 1GB; }; ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map); If the guest is assigned with more than dev1, the user follows above=20 sequence to attach other devices to the same gpa_ioasid i.e. sharing=20 the GPA address space cross all assigned devices, e.g. for dev2: bind_data =3D {.fd =3D iommu_fd; .cookie =3D cookie2}; ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data); ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); 4.2. Multiple IOASIDs (no nesting) ++++++++++++++++++++++++++++++++++ Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially both devices are attached to gpa_ioasid. After boot the guest creates=20 a GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass through mode (gpa_ioasid). Suppose IOASID nesting is not supported in this case. Qemu needs to generate shadow mappings in userspace for giova_ioasid (like how VFIO works today). The side-effect is that duplicated locked page=20 accounting might be incurred in this example as there are two root IOASIDs now. It will be fixed once IOASID nesting is supported: device_fd1 =3D open("/dev/vfio/devices/dev1", mode); device_fd2 =3D open("/dev/vfio/devices/dev2", mode); iommu_fd =3D open("/dev/iommu", mode); /* Bind device to IOMMU fd */ bind_data =3D {.fd =3D iommu_fd; .device_cookie =3D cookie1}; ioctl(device_fd1, VFIO_BIND_IOASID_FD, &bind_data); bind_data =3D {.fd =3D iommu_fd; .device_cookie =3D cookie2}; ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data); /* Allocate IOASID */ alloc_data =3D {.user_pgtable =3D false}; gpa_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach dev1 and dev2 to gpa_ioasid */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D gpa_ioasid}; ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); /* Setup GPA mapping [0 - 1GB] */ dma_map =3D { .ioasid =3D gpa_ioasid; .iova =3D 0; // GPA .vaddr =3D 0x40000000; // HVA .size =3D 1GB; }; ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map); /* After boot, guest enables a GIOVA space for dev2 via vIOMMU */ alloc_data =3D {.user_pgtable =3D false}; giova_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* First detach dev2 from previous address space */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D gpa_ioasid}; ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data); /* Then attach dev2 to the new address space */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D giova_ioasid}; ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); /* Setup a shadow DMA mapping according to vIOMMU. * * e.g. the vIOMMU page table adds a new 4KB mapping: * GIOVA [0x2000] -> GPA [0x1000] * * and GPA [0x1000] is mapped to HVA [0x40001000] in gpa_ioasid. *=20 * In this case the shadow mapping should be: * GIOVA [0x2000] -> HVA [0x40001000] */ dma_map =3D { .ioasid =3D giova_ioasid; .iova =3D 0x2000; // GIOVA .vaddr =3D 0x40001000; // HVA .size =3D 4KB; }; ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map); 4.3. IOASID nesting (software) ++++++++++++++++++++++++++++++ Same usage scenario as 4.2, with software-based IOASID nesting=20 available. In this mode it is the kernel instead of user to create the shadow mapping. The flow before guest boots is same as 4.2, except one point. Because=20 giova_ioasid is nested on gpa_ioasid, locked accounting is only=20 conducted for gpa_ioasid which becomes the only root. There could be a case where different gpa_ioasids are created due to incompatible format between dev1/dev2 (e.g. about IOMMU=20 enforce-snoop). In such case the user could further created a dummy IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid=20 duplicated accounting. But this scenario is not covered in following=20 flows. To save space we only list the steps after boots (i.e. both dev1/dev2 have been attached to gpa_ioasid before guest boots): /* After boots */ /* Create GIOVA space nested on GPA space * Both page tables are managed by the kernel */ alloc_data =3D {.user_pgtable =3D false; .parent =3D gpa_ioasid}; giova_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach dev2 to the new address space (child) * Note dev2 is still attached to gpa_ioasid (parent) */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D giova_ioasid}; ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,=20 * based on the vIOMMU page table. The kernel is responsible for * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000] * by walking the parent's I/O page table to find out GPA [0x1000] -> * HVA [0x40001000]. */ dma_map =3D { .ioasid =3D giova_ioasid; .iova =3D 0x2000; // GIOVA .vaddr =3D 0x1000; // GPA .size =3D 4KB; }; ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map); 4.4. IOASID nesting (hardware) ++++++++++++++++++++++++++++++ Same usage scenario as 4.2, with hardware-based IOASID nesting available. In this mode the I/O page table is managed by userspace thus an invalidation interface is used for the user to request iotlb invalidation. /* After boots */ /* Create GIOVA space nested on GPA space. * Claim it's an user-managed I/O page table. */ alloc_data =3D { .user_pgtable =3D true; .parent =3D gpa_ioasid; .addr =3D giova_pgtable; // and format information; }; giova_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach dev2 to the new address space (child) * Note dev2 is still attached to gpa_ioasid (parent) */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D giova_ioasid}; ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data); /* Invalidate IOTLB when required */ inv_data =3D { .ioasid =3D giova_ioasid; // granular/cache type information }; ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data); /* See 4.6 for I/O page fault handling */ =09 4.5. Guest SVA (vSVA) +++++++++++++++++++++ After boots the guest further creates a GVA address spaces (vpasid1) on=20 dev1. Dev2 is not affected (still attached to giova_ioasid). As explained in section 1.4, the user should check the PASID capability exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI semantics when doing the attaching call: /****** If dev1 reports PASID_DELEGATED=3Dfalse **********/ /* After boots */ /* Create GVA space nested on GPA space. * Claim it's an user-managed I/O page table. */ alloc_data =3D { .user_pgtable =3D true; .parent =3D gpa_ioasid; .addr =3D gva_pgtable; // and format information; }; gva_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach dev1 to the new address space (child) and specify=20 * vPASID. Note dev1 is still attached to gpa_ioasid (parent) */ at_data =3D { .fd =3D iommu_fd; .ioasid =3D gva_ioasid; .flag =3D IOASID_ATTACH_VPASID; .vpasid =3D vpasid1; }; ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); /* Enable CPU PASID translation if required */ if (PASID_CPU and PASID_CPU_VIRT are both true for dev1) { pa_data =3D { .iommu_fd =3D iommu_fd; .ioasid =3D gva_ioasid; .vpasid =3D vpasid1; }; ioctl(kvm_fd, KVM_MAP_PASID, &pa_data); }; /* Invalidate IOTLB when required */ ... /****** If dev1 reports PASID_DELEGATED=3Dtrue **********/ /* Create user-managed vPASID space when it's enabled via vIOMMU */ alloc_data =3D { .user_pasid_table =3D true; .parent =3D gpa_ioasid; .addr =3D gpasid_tbl; // and format information; }; pasidtbl_ioasid =3D ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data); /* Attach dev1 to the vPASID space */ at_data =3D {.fd =3D iommu_fd; .ioasid =3D pasidtbl_ioasid}; ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data); /* from now on all GVA address spaces on dev1 are represented by=20 * a single pasidtlb_ioasid as the placeholder in the kernel. * * But iotlb invalidation and fault handling are still per GVA=20 * address space. They are still going through IOMMU fd in the=20 * same way as PASID_DELEGATED=3Dfalse scenario */ ... 4.6. I/O page fault +++++++++++++++++++ uAPI is TBD. Here is just about the high-level flow from host IOMMU driver to guest IOMMU driver and backwards. This flow assumes that I/O page faults are reported via IOMMU interrupts. Some devices report faults via device specific way instead of going through the IOMMU. That usage is not covered here: - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,=20 pasid, addr}; - Host IOMMU driver identifies the faulting I/O page table according to {rid, pasid} and calls the corresponding fault handler with an opaque object (registered by the handler) and raw fault_data (rid, pasid, addr= ); - IOASID fault handler identifies the corresponding ioasid and device=20 cookie according to the opaque object, generates an user fault_data=20 (ioasid, cookie, addr) in the fault region, and triggers eventfd to=20 userspace; * In case ioasid represents a pasid table, pasid is also included as additional fault_data; * the raw fault_data is also cached in ioasid_data->fault_data and used when generating response; - Upon received event, Qemu needs to find the virtual routing information= =20 (v_rid + v_pasid) of the device attached to the faulting ioasid; * v_rid is identified according to device_cookie; * v_pasid is either identified according to ioasid, or already carrie= d in the fault data; - Qemu generates a virtual I/O page fault through vIOMMU into guest, carrying the virtual fault data (v_rid, v_pasid, addr); - Guest IOMMU driver fixes up the fault, updates the guest I/O page table (GIOVA or GVA), and then sends a page response with virtual completion= =20 data (v_rid, v_pasid, response_code) to vIOMMU; - Qemu finds the pending fault event, converts virtual completion data=20 into (ioasid, cookie, response_code), and then calls a /dev/iommu ioctl= to=20 complete the pending fault; - /dev/iommu finds out the pending fault data {rid, pasid, addr} saved in= =20 ioasid_data->fault_data, and then calls iommu api to complete it with {rid, pasid, response_code};