Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757856AbbKSHXZ (ORCPT ); Thu, 19 Nov 2015 02:23:25 -0500 Received: from mga11.intel.com ([192.55.52.93]:18288 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753065AbbKSHXX (ORCPT ); Thu, 19 Nov 2015 02:23:23 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,317,1444719600"; d="scan'208";a="854556431" Message-ID: <564D78D0.80904@intel.com> Date: Thu, 19 Nov 2015 15:22:56 +0800 From: Jike Song User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Alex Williamson CC: "Tian, Kevin" , "xen-devel@lists.xen.org" , "igvt-g@ml01.01.org" , "intel-gfx@lists.freedesktop.org" , "linux-kernel@vger.kernel.org" , "White, Michael L" , "Dong, Eddie" , "Li, Susie" , "Cowperthwaite, David J" , "Reddy, Raghuveer" , "Zhu, Libo" , "Zhou, Chao" , "Wang, Hongbo" , "Lv, Zhiyuan" , qemu-devel , Paolo Bonzini , Gerd Hoffmann Subject: Re: [Intel-gfx] [Announcement] 2015-Q3 release of XenGT - a Mediated Graphics Passthrough Solution from Intel References: <53D215D3.50608@intel.com> <547FCAAD.2060406@intel.com> <54AF967B.3060503@intel.com> <5527CEC4.9080700@intel.com> <559B3E38.1080707@intel.com> <562F4311.9@intel.com> <1447870341.4697.92.camel@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10158 Lines: 200 Hi Alex, On 11/19/2015 12:06 PM, Tian, Kevin wrote: >> From: Alex Williamson [mailto:alex.williamson@redhat.com] >> Sent: Thursday, November 19, 2015 2:12 AM >> >> [cc +qemu-devel, +paolo, +gerd] >> >> On Tue, 2015-10-27 at 17:25 +0800, Jike Song wrote: >>> {snip} >> >> Hi! >> >> At redhat we've been thinking about how to support vGPUs from multiple >> vendors in a common way within QEMU. We want to enable code sharing >> between vendors and give new vendors an easy path to add their own >> support. We also have the complication that not all vGPU vendors are as >> open source friendly as Intel, so being able to abstract the device >> mediation and access outside of QEMU is a big advantage. >> >> The proposal I'd like to make is that a vGPU, whether it is from Intel >> or another vendor, is predominantly a PCI(e) device. We have an >> interface in QEMU already for exposing arbitrary PCI devices, vfio-pci. >> Currently vfio-pci uses the VFIO API to interact with "physical" devices >> and system IOMMUs. I highlight /physical/ there because some of these >> physical devices are SR-IOV VFs, which is somewhat of a fuzzy concept, >> somewhere between fixed hardware and a virtual device implemented in >> software. That software just happens to be running on the physical >> endpoint. > > Agree. > > One clarification for rest discussion, is that we're talking about GVT-g vGPU > here which is a pure software GPU virtualization technique. GVT-d (note > some use in the text) refers to passing through the whole GPU or a specific > VF. GVT-d already falls into existing VFIO APIs nicely (though some on-going > effort to remove Intel specific platform stickness from gfx driver). :-) > Hi Alex, thanks for the discussion. In addition to Kevin's replies, I have a high-level question: can VFIO be used by QEMU for both KVM and Xen? -- Thanks, Jike >> >> vGPUs are similar, with the virtual device created at a different point, >> host software. They also rely on different IOMMU constructs, making use >> of the MMU capabilities of the GPU (GTTs and such), but really having >> similar requirements. > > One important difference between system IOMMU and GPU-MMU here. > System IOMMU is very much about translation from a DMA target > (IOVA on native, or GPA in virtualization case) to HPA. However GPU > internal MMUs is to translate from Graphics Memory Address (GMA) > to DMA target (HPA if system IOMMU is disabled, or IOVA/GPA if system > IOMMU is enabled). GMA is an internal addr space within GPU, not > exposed to Qemu and fully managed by GVT-g device model. Since it's > not a standard PCI defined resource, we don't need abstract this capability > in VFIO interface. > >> >> The proposal is therefore that GPU vendors can expose vGPUs to >> userspace, and thus to QEMU, using the VFIO API. For instance, vfio >> supports modular bus drivers and IOMMU drivers. An intel-vfio-gvt-d >> module (or extension of i915) can register as a vfio bus driver, create >> a struct device per vGPU, create an IOMMU group for that device, and >> register that device with the vfio-core. Since we don't rely on the >> system IOMMU for GVT-d vGPU assignment, another vGPU vendor driver (or >> extension of the same module) can register a "type1" compliant IOMMU >> driver into vfio-core. From the perspective of QEMU then, all of the >> existing vfio-pci code is re-used, QEMU remains largely unaware of any >> specifics of the vGPU being assigned, and the only necessary change so >> far is how QEMU traverses sysfs to find the device and thus the IOMMU >> group leading to the vfio group. > > GVT-g requires to pin guest memory and query GPA->HPA information, > upon which shadow GTTs will be updated accordingly from (GMA->GPA) > to (GMA->HPA). So yes, here a dummy or simple "type1" compliant IOMMU > can be introduced just for this requirement. > > However there's one tricky point which I'm not sure whether overall > VFIO concept will be violated. GVT-g doesn't require system IOMMU > to function, however host system may enable system IOMMU just for > hardening purpose. This means two-level translations existing (GMA-> > IOVA->HPA), so the dummy IOMMU driver has to request system IOMMU > driver to allocate IOVA for VMs and then setup IOVA->HPA mapping > in IOMMU page table. In this case, multiple VM's translations are > multiplexed in one IOMMU page table. > > We might need create some group/sub-group or parent/child concepts > among those IOMMUs for thorough permission control. > >> >> There are a few areas where we know we'll need to extend the VFIO API to >> make this work, but it seems like they can all be done generically. One >> is that PCI BARs are described through the VFIO API as regions and each >> region has a single flag describing whether mmap (ie. direct mapping) of >> that region is possible. We expect that vGPUs likely need finer >> granularity, enabling some areas within a BAR to be trapped and fowarded >> as a read or write access for the vGPU-vfio-device module to emulate, >> while other regions, like framebuffers or texture regions, are directly >> mapped. I have prototype code to enable this already. > > Yes in GVT-g one BAR resource might be partitioned among multiple vGPUs. > If VFIO can support such partial resource assignment, it'd be great. Similar > parent/child concept might also be required here, so any resource enumerated > on a vGPU shouldn't break limitations enforced on the physical device. > > One unique requirement for GVT-g here, though, is that vGPU device model > need to know guest BAR configuration for proper emulation (e.g. register > IO emulation handler to KVM). Similar is about guest MSI vector for virtual > interrupt injection. Not sure how this can be fit into common VFIO model. > Does VFIO allow vendor specific extension today? > >> >> Another area is that we really don't want to proliferate each vGPU >> needing a new IOMMU type within vfio. The existing type1 IOMMU provides >> potentially the most simple mapping and unmapping interface possible. >> We'd therefore need to allow multiple "type1" IOMMU drivers for vfio, >> making type1 be more of an interface specification rather than a single >> implementation. This is a trivial change to make within vfio and one >> that I believe is compatible with the existing API. Note that >> implementing a type1-compliant vfio IOMMU does not imply pinning an >> mapping every registered page. A vGPU, with mediated device access, may >> use this only to track the current HVA to GPA mappings for a VM. Only >> when a DMA is enabled for the vGPU instance is that HVA pinned and an >> HPA to GPA translation programmed into the GPU MMU. >> >> Another area of extension is how to expose a framebuffer to QEMU for >> seamless integration into a SPICE/VNC channel. For this I believe we >> could use a new region, much like we've done to expose VGA access >> through a vfio device file descriptor. An area within this new >> framebuffer region could be directly mappable in QEMU while a >> non-mappable page, at a standard location with standardized format, >> provides a description of framebuffer and potentially even a >> communication channel to synchronize framebuffer captures. This would >> be new code for QEMU, but something we could share among all vGPU >> implementations. > > Now GVT-g already provides an interface to decode framebuffer information, > w/ an assumption that the framebuffer will be further composited into > OpenGL APIs. So the format is defined according to OpenGL definition. > Does that meet SPICE requirement? > > Another thing to be added. Framebuffers are frequently switched in > reality. So either Qemu needs to poll or a notification mechanism is required. > And since it's dynamic, having framebuffer page directly exposed in the > new region might be tricky. We can just expose framebuffer information > (including base, format, etc.) and let Qemu to map separately out of VFIO > interface. > > And... this works fine with vGPU model since software knows all the > detail about framebuffer. However in pass-through case, who do you expect > to provide that information? Is it OK to introduce vGPU specific APIs in > VFIO? > >> >> Another obvious area to be standardized would be how to discover, >> create, and destroy vGPU instances. SR-IOV has a standard mechanism to >> create VFs in sysfs and I would propose that vGPU vendors try to >> standardize on similar interfaces to enable libvirt to easily discover >> the vGPU capabilities of a given GPU and manage the lifecycle of a vGPU >> instance. > > Now there is no standard. We expose vGPU life-cycle mgmt. APIs through > sysfs (under i915 node), which is very Intel specific. In reality different > vendors have quite different capabilities for their own vGPUs, so not sure > how standard we can define such a mechanism. But this code should be > minor to be maintained in libvirt. > >> >> This is obviously a lot to digest, but I'd certainly be interested in >> hearing feedback on this proposal as well as try to clarify anything >> I've left out or misrepresented above. Another benefit to this >> mechanism is that direct GPU assignment and vGPU assignment use the same >> code within QEMU and same API to the kernel, which should make debugging >> and code support between the two easier. I'd really like to start a >> discussion around this proposal, and of course the first open source >> implementation of this sort of model will really help to drive the >> direction it takes. Thanks! >> > > Thanks for starting this discussion. Intel will definitely work with > community on this work. Based on earlier comments, I'm not sure > whether we can exactly same code for direct GPU assignment and > vGPU assignment, since even we extend VFIO some interfaces might > be vGPU specific. Does this way still achieve your end goal? > > Thanks > Kevin > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/