Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754190AbcKHTCr (ORCPT ); Tue, 8 Nov 2016 14:02:47 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46428 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754163AbcKHTCp (ORCPT ); Tue, 8 Nov 2016 14:02:45 -0500 Subject: Re: Summary of LPC guest MSI discussion in Santa Fe To: Will Deacon , Auger Eric References: <1478209178-3009-1-git-send-email-eric.auger@redhat.com> <20161103220205.37715b49@t450s.home> <20161108024559.GA20591@arm.com> <20161108175457.GK20591@arm.com> Cc: Alex Williamson , drjones@redhat.com, jason@lakedaemon.net, kvm@vger.kernel.org, marc.zyngier@arm.com, benh@kernel.crashing.org, joro@8bytes.org, punit.agrawal@arm.com, linux-kernel@vger.kernel.org, arnd@arndb.de, diana.craciun@nxp.com, iommu@lists.linux-foundation.org, pranav.sawargaonkar@gmail.com, linux-arm-kernel@lists.infradead.org, jcm@redhat.com, tglx@linutronix.de, robin.murphy@arm.com, dwmw@amazon.co.uk, christoffer.dall@linaro.org, eric.auger.pro@gmail.com From: Don Dutile Message-ID: <5822214F.2070500@redhat.com> Date: Tue, 8 Nov 2016 14:02:39 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20161108175457.GK20591@arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Tue, 08 Nov 2016 19:02:44 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5128 Lines: 95 On 11/08/2016 12:54 PM, Will Deacon wrote: > On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote: >> On 08/11/2016 03:45, Will Deacon wrote: >>> Rather than treat these as separate problems, a better interface is to >>> tell userspace about a set of reserved regions, and have this include >>> the MSI doorbell, irrespective of whether or not it can be remapped. >>> Don suggested that we statically pick an address for the doorbell in a >>> similar way to x86, and have the kernel map it there. We could even pick >>> 0xfee00000. If it conflicts with a reserved region on the platform (due >>> to (4)), then we'd obviously have to (deterministically?) allocate it >>> somewhere else, but probably within the bottom 4G. >> This is tentatively achieved now with >> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II >> (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1264506.html) > Yup, I saw that fly by. Hopefully some of the internals can be reused > with the current thinking on user ABI. > >>> The next question is how to tell userspace about all of the reserved >>> regions. Initially, the idea was to extend VFIO, however Alex pointed >>> out a horrible scenario: >>> >>> 1. QEMU spawns a VM on system 0 >>> 2. VM is migrated to system 1 >>> 3. QEMU attempts to passthrough a device using PCI hotplug >>> >>> In this scenario, the guest memory map is chosen at step (1), yet there >>> is no VFIO fd available to determine the reserved regions. Furthermore, >>> the reserved regions may vary between system 0 and system 1. This pretty >>> much rules out using VFIO to determine the reserved regions.Alex suggested >>> that the SMMU driver can advertise the regions via /sys/class/iommu/. This >>> would solve part of the problem, but migration between systems with >>> different memory maps can still cause problems if the reserved regions >>> of the new system conflict with the guest memory map chosen by QEMU. >> >> OK so I understand we do not want anymore the VFIO chain capability API >> (patch 5 of above series) but we prefer a sysfs approach instead. > Right. > >> I understand the sysfs approach which allows the userspace to get the >> info earlier and independently on VFIO. Keeping in mind current QEMU >> virt - which is not the only userspace - will not do much from this info >> until we bring upheavals in virt address space management. So if I am >> not wrong, at the moment the main action to be undertaken is the >> rejection of the PCI hotplug in case we detect a collision. > I don't think so; it should be up to userspace to reject the hotplug. > If userspace doesn't have support for the regions, then that's fine -- > you just end up in a situation where the CPU page table maps memory > somewhere that the device can't see. In other words, you'll end up with > spurious DMA failures, but that's exactly what happens with current systems > if you passthrough an overlapping region (Robin demonstrated this on Juno). > > Additionally, you can imagine some future support where you can tell the > guest not to use certain regions of its memory for DMA. In this case, you > wouldn't want to refuse the hotplug in the case of overlapping regions. > > Really, I think the kernel side just needs to enumerate the fixed reserved > regions, place the doorbell at a fixed address and then advertise these > via sysfs. > >> I can respin [1] >> - studying and taking into account Robin's comments about dm_regions >> similarities >> - removing the VFIO capability chain and replacing this by a sysfs API > Ideally, this would be reusable between different SMMU drivers so the sysfs > entries have the same format etc. > >> Would that be OK? > Sounds good to me. Are you in a position to prototype something on the qemu > side once we've got kernel-side agreement? > >> What about Alex comments who wanted to report the usable memory ranges >> instead of unusable memory ranges? >> >> Also did you have a chance to discuss the following items: >> 1) the VFIO irq safety assessment > The discussion really focussed on system topology, as opposed to properties > of the doorbell. Regardless of how the device talks to the doorbell, if > the doorbell can't protect against things like MSI spoofing, then it's > unsafe. My opinion is that we shouldn't allow passthrough by default on > systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts > cmdline option to VFIO). > > A first step would be making all this opt-in, and only supporting GICv3 > ITS for now. You're trying to support a config that is < GICv3 and no ITS ? ... That would be the equiv. of x86 pre-intr-remap, and that's why allow_unsafe_interrupts hook was created ... to enable devel/kick-the-tires. >> 2) the MSI reserved size computation (is an arbitrary size OK?) > If we fix the base address, we could fix a size too. However, we'd still > need to enumerate the doorbells to check that they fit in the region we > have. If not, then we can warn during boot and treat it the same way as > a resource conflict (that is, reallocate the region in some deterministic > way). > > Will