Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753614AbcKIUBV (ORCPT ); Wed, 9 Nov 2016 15:01:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51452 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751550AbcKIUBS (ORCPT ); Wed, 9 Nov 2016 15:01:18 -0500 Date: Wed, 9 Nov 2016 13:01:14 -0700 From: Alex Williamson To: Christoffer Dall Cc: Don Dutile , Will Deacon , Eric Auger , eric.auger.pro@gmail.com, marc.zyngier@arm.com, robin.murphy@arm.com, joro@8bytes.org, tglx@linutronix.de, jason@lakedaemon.net, linux-arm-kernel@lists.infradead.org, kvm@vger.kernel.org, drjones@redhat.com, linux-kernel@vger.kernel.org, pranav.sawargaonkar@gmail.com, iommu@lists.linux-foundation.org, punit.agrawal@arm.com, diana.craciun@nxp.com, benh@kernel.crashing.org, arnd@arndb.de, jcm@redhat.com, dwmw@amazon.co.uk Subject: Re: Summary of LPC guest MSI discussion in Santa Fe Message-ID: <20161109130114.3e17bba9@t450s.home> In-Reply-To: <20161109192303.GD15676@cbox> References: <1478209178-3009-1-git-send-email-eric.auger@redhat.com> <20161103220205.37715b49@t450s.home> <20161108024559.GA20591@arm.com> <20161108202922.GC15676@cbox> <20161108163508.1bcae0c2@t450s.home> <58228F71.6020108@redhat.com> <20161109170326.GG17771@arm.com> <582371FB.2040808@redhat.com> <20161109192303.GD15676@cbox> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Wed, 09 Nov 2016 20:01:18 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4467 Lines: 86 On Wed, 9 Nov 2016 20:23:03 +0100 Christoffer Dall wrote: > On Wed, Nov 09, 2016 at 01:59:07PM -0500, Don Dutile wrote: > > On 11/09/2016 12:03 PM, Will Deacon wrote: > > >On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote: > > >>On 11/08/2016 06:35 PM, Alex Williamson wrote: > > >>>On Tue, 8 Nov 2016 21:29:22 +0100 > > >>>Christoffer Dall wrote: > > >>>>Is my understanding correct, that you need to tell userspace about the > > >>>>location of the doorbell (in the IOVA space) in case (2), because even > > >>>>though the configuration of the device is handled by the (host) kernel > > >>>>through trapping of the BARs, we have to avoid the VFIO user programming > > >>>>the device to create other DMA transactions to this particular address, > > >>>>since that will obviously conflict and either not produce the desired > > >>>>DMA transactions or result in unintended weird interrupts? > > > > > >Yes, that's the crux of the issue. > > > > > >>>Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then > > >>>it's potentially a DMA target and we'll get bogus data on DMA read from > > >>>the device, and lose data and potentially trigger spurious interrupts on > > >>>DMA write from the device. Thanks, > > >>> > > >>That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e., > > >>they address match before the SMMU checks are done. if > > >>all DMA addrs had to go through SMMU first, then the DMA access could > > >>be ignored/rejected. > > > > > >That's actually not true :( The SMMU can't generally distinguish between MSI > > >writes and DMA writes, so it would just see a write transaction to the > > >doorbell address, regardless of how it was generated by the endpoint. > > > > > >Will > > > > > So, we have real systems where MSI doorbells are placed at the same IOVA > > that could have memory for a guest > > I don't think this is a property of a hardware system. THe problem is > userspace not knowing where in the IOVA space the kernel is going to > place the doorbell, so you can end up (basically by chance) that some > IPA range of guest memory overlaps with the IOVA space for the doorbell. > > > >, but not at the same IOVA as memory on real hw ? > > On real hardware without an IOMMU the system designer would have to > separate the IOVA and RAM in the physical address space. With an IOMMU, > the SMMU driver just makes sure to allocate separate regions in the IOVA > space. > > The challenge, as I understand it, happens with the VM, because the VM > doesn't allocate the IOVA for the MSI doorbell itself, but the host > kernel does this, independently from the attributes (e.g. memory map) of > the VM. > > Because the IOVA is a single resource, but with two independent entities > allocating chunks of it (the host kernel for the MSI doorbell IOVA, and > the VFIO user for other DMA operations), you have to provide some > coordination between those to entities to avoid conflicts. In the case > of KVM, the two entities are the host kernel and the VFIO user (QEMU/the > VM), and the host kernel informs the VFIO user to never attempt to use > the doorbell IOVA already reserved by the host kernel for DMA. > > One way to do that is to ensure that the IPA space of the VFIO user > corresponding to the doorbell IOVA is simply not valid, ie. the reserved > regions that avoid for example QEMU to allocate RAM there. > > (I suppose it's technically possible to get around this issue by letting > QEMU place RAM wherever it wants but tell the guest to never use a > particular subset of its RAM for DMA, because that would conflict with > the doorbell IOVA or be seen as p2p transactions. But I think we all > probably agree that it's a disgusting idea.) Well, it's not like QEMU or libvirt stumbling through sysfs to figure out where holes could be in order to instantiate a VM with matching holes, just in case someone might decide to hot-add a device into the VM, at some point, and hopefully they don't migrate the VM to another host with a different layout first, is all that much less disgusting or foolproof. It's just that in order to dynamically remove a page as a possible DMA target we require a paravirt channel, such as a balloon driver that's able to pluck a specific page. In some ways it's actually less disgusting, but it puts some prerequisites on enlightening the guest OS. Thanks, Alex