Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp4978010pxj; Tue, 22 Jun 2021 12:13:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxpVvXyRNFYr1ZueCgRH6A2ODfuD8vq1EwCiMx42qtJXUTEeHMDGf4e7sl9How17HlIDXfM X-Received: by 2002:a05:6638:2706:: with SMTP id m6mr5413823jav.100.1624389213275; Tue, 22 Jun 2021 12:13:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624389213; cv=none; d=google.com; s=arc-20160816; b=mZXmDzqt2gju1IsWR3+Bojet81Cu1IVDvhoxAJa4PRDGOsrcpGqZAqoBjP8RtXTsBK 6qd9jyZe4uUDj3gXUYkt1DrqHisGm2jf4DvrwrXUG0XBc/hqM2oikeZ0Csj4mn97lLoV 3lr5rTQEN8IxvOLa0ofhWxH7M+dEz0H0qu0kFMOtjOCGFHI0dXjAU/aLgz3Y64oBTwPi uPzu3Sw9+IGopYloQJ0fFJTz1GQL2CyBPOg/qziY7rCDdI7TQOns27BM9ckMTCUBQtlk ulsCf2yBLuu61WZZWsaCpnzuke538NwflWkyOi74j+I4VhNX/bGLo2EQe0gjPP3XZz7P 0iwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date:dkim-signature; bh=w/kiGqfvKmHlvvODf12+NKLRbxkAQ5ugskJmY6siGHk=; b=ZHTVHFkJkS9SnwYcure7Hnd8eMEaOaG+K5UlYm9AuEefsX721hBQsfqppgByvg/ToI zBUTAROQnoTBYxv2pUuMI90RD9iM6Vkdoz/Ewn/2BzaMnT/CiWXi+hhNGvM9SHvUy1gM aIzPJMqn/wmRPkszqbl7QeuIzC4okaG7cW7GblG+Lem1sb/YkNZ7sO9fKnDfKPGQmQnG ReayuJ1OoJWmyr4BxRbJ2W/6DFLBOCu3W1XVjmZmfW9ptORPCAq1dMSFVIB/owZPYqxJ f5h1tlrzMGXK9UuQ+7JhGPcRIlBtpemxCAEwzi9dxAkw/+8Y7ybNBq8VYx8MQXedX/eZ b8MQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GPkUbJit; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h13si18452126ild.129.2021.06.22.12.13.18; Tue, 22 Jun 2021 12:13:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GPkUbJit; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232416AbhFVTOj (ORCPT + 99 others); Tue, 22 Jun 2021 15:14:39 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:56509 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229786AbhFVTOi (ORCPT ); Tue, 22 Jun 2021 15:14:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1624389142; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=w/kiGqfvKmHlvvODf12+NKLRbxkAQ5ugskJmY6siGHk=; b=GPkUbJitj5jnpIfbrPwrnB9nAZfATW5tAZIIqVf7k7YDohbGaT4dcRgF3bftUGv81PHrRg HVF0FtTc3jZLztPhUE8k039Iuy6ldZbzC/cXl6wuLsNMCT9vX4dNYd1MAQQf5ymxLVKIfk gu+Miuzn+crTEVjdLOT4IOiM4kik76E= Received: from mail-oi1-f198.google.com (mail-oi1-f198.google.com [209.85.167.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-490-F4I-cS88NE62psLAfqoKOQ-1; Tue, 22 Jun 2021 15:12:20 -0400 X-MC-Unique: F4I-cS88NE62psLAfqoKOQ-1 Received: by mail-oi1-f198.google.com with SMTP id b185-20020acab2c20000b02901f6dd5f89fbso265918oif.10 for ; Tue, 22 Jun 2021 12:12:20 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:organization:mime-version:content-transfer-encoding; bh=w/kiGqfvKmHlvvODf12+NKLRbxkAQ5ugskJmY6siGHk=; b=oOkhZjoqD/uwHMtEQ9f680FJefYuO133JcchH5EIxhqkDufuB5eIirVdu+6AWJE7ud bizGwMKsH7DPFSL9/E1Zw9OdteoDBWOF6ajJBhRk7myX5ymA82pIE5TzAOhef7HhPplg ZlStRFvcO4xFOThjnMoM3PynyqdQ0lvfWBqKHxb32PlvJJzYPoKmMjpnws/GX0OqmDap j4Z4GjjRfHgez/wslfl2WU8GNuQRACePIIrfayEz02U3scwQSY+KEKZIgtnDV6W9JqyI HcWNJqGacDgdogL3lOxoV3I5zx4gsqjukHMjj8HymVaAGWR//ML2O51fWQXPmmMZbm0V pBIA== X-Gm-Message-State: AOAM531q4+7Eb/4Rt1OsyUVM1h4dmS3gYVw+f1CPlyKfVIUSdyPeaQp3 K8WAkh+yFjZQinMqfwvxbhl1k7tR73w/BYox0OJKoFOBWMKeg0mZo8brT3DM/EEuhIRG0Ffj8bc qr/PKETLV4mjYrxIZeT8LkGfg X-Received: by 2002:a05:6808:60e:: with SMTP id y14mr215414oih.105.1624389139939; Tue, 22 Jun 2021 12:12:19 -0700 (PDT) X-Received: by 2002:a05:6808:60e:: with SMTP id y14mr215402oih.105.1624389139699; Tue, 22 Jun 2021 12:12:19 -0700 (PDT) Received: from redhat.com ([198.99.80.109]) by smtp.gmail.com with ESMTPSA id y16sm50876oto.60.2021.06.22.12.12.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Jun 2021 12:12:19 -0700 (PDT) Date: Tue, 22 Jun 2021 13:12:17 -0600 From: Alex Williamson To: "Tian, Kevin" Cc: Thomas Gleixner , Jason Gunthorpe , "Dey, Megha" , "Raj, Ashok" , "Pan, Jacob jun" , "Jiang, Dave" , "Liu, Yi L" , "Lu, Baolu" , "Williams, Dan J" , "Luck, Tony" , "Kumar, Sanjay K" , LKML , "kvm@vger.kernel.org" , Kirti Wankhede Subject: Re: Virtualizing MSI-X on IMS via VFIO Message-ID: <20210622131217.76b28f6f.alex.williamson@redhat.com> In-Reply-To: References: Organization: Red Hat X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.32; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" wrote: > Hi, Alex, > > Need your help to understand the current MSI-X virtualization flow in > VFIO. Some background info first. > > Recently we are discussing how to virtualize MSI-X with Interrupt > Message Storage (IMS) on mdev: > https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ > > IMS is a device specific interrupt storage, allowing an optimized and > scalable manner for generating interrupts. idxd mdev exposes virtual > MSI-X capability to guest but uses IMS entries physically for generating > interrupts. > > Thomas has helped implement a generic ims irqchip driver: > https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/ > > idxd device allows software to specify an IMS entry (for triggering > completion interrupt) when submitting a descriptor. To prevent one > mdev triggering malicious interrupt into another mdev (by specifying > an arbitrary entry), idxd ims entry includes a PASID field for validation - > only a matching PASID in the executed descriptor can trigger interrupt > via this entry. idxd driver is expected to program ims entries with > PASIDs that are allocated to the mdev which owns those entries. > > Other devices may have different ID and format to isolate ims entries. > But we need abstract a generic means for programming vendor-specific > ID into vendor-specific ims entry, without violating the layering model. > > Thomas suggested vendor driver to first register ID information (possibly > plus the location where to write ID to) in msi_desc when allocating irqs > (extend existing alloc function or via new helper function) and then have > the generic ims irqchip driver to update ID to the ims entry when it's > started up by request_irq(). > > Then there are two questions to be answered: > > 1) How does vendor driver decide the ID to be registered to msi_desc? > 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO? > > For the 1st open, there are two types of PASIDs on idxd mdev: > > 1) default PASID: one per mdev and allocated when mdev is created; > 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU); > > If vIOMMU is not exposed, all ims entries of this mdev should be > programmed with default PASID which is always available in mdev's > lifespan. > > If vIOMMU is exposed and guest sva is enabled, entries used for sva > should be tagged with sva PASIDs, leaving others tagged with default > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver > needs program guest sva PASIDs into virtual MSIX_PERM register (one > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated > by host idxd driver which then figure out which PASID to register to > msi_desc (require PASID translation info via new /dev/iommu proposal). > > The guest driver is expected to update MSIX_PERM before request_irq(). > > Now the 2nd open requires your help. Below is what I learned from > current vfio/qemu code (for vfio-pci device): > > 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > table_size. It is done in an dynamic and incremental way. Not by table_size, our expectation is that the device interrupt behavior can be implicitly affected by the enabled vectors and the table size may support far more vectors than the driver might actually use. It's also easier if we never need to get into the scenario of pci_alloc_irq_vectors() returning a smaller than requested number of vectors and needing to fallback to a vector negotiation that doesn't exist via MSI-X. FWIW, more recent QEMU will scan the vector table for the highest non-masked vector to initially enable that number of vectors in the host, both to improve restore behavior after migration and avoid overhead for guests that write the vector table before setting the MSI-X capability enable bit (Windows?). > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > allocating/enabling irqs given a set of vMSIX vectors [start, count]: > > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > specified vectors [start, count] via request_irq(); > b) if irqs already allocated, enable irqs for specified vectors; > c) if irq already enabled, disable and re-enable irqs for specified > vectors because user may specify a different eventfd; > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0, even though it's currently > masked by the guest. Interrupts are received by Qemu but blocked > from guest via mask/pending bit emulation. The main intention is > to enable physical MSI-X; Yes, this is a bit awkward since the interrupt API is via SET_IRQS and we don't allow writes to the MSI-X enable bit via config space. > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > from the one provided in 2); > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0; > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1; > > 5) When guest unmasks vector#2, same flow in 4) continues. > > .... > > If above understanding is correct, how is lost interrupt avoided between > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > while from guest p.o.v this vector is actually unmasked? There must be > a mechanism in place, but I just didn't figure it out... In practice unmasking new vectors is rare and done only at initialization. Risk from lost interrupts at this time is low. When masking and unmasking vectors that are already in use, we're only changing the signaling eventfd between KVM and QEMU such that QEMU can set emulated pending bits in response to interrupts (and our lack of interfaces to handle the mask/unmask at the host). I believe that locking in the vfio-pci driver prevents an interrupt from being lost during the eventfd switch. > Given above flow is robust, mapping Thomas's model to this flow is > straightforward. Assume idxd mdev has two vectors: vector#0 for > misc/error interrupt and vector#1 as completion interrupt for guest > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver > knows to register default PASID to msi_desc#0 when allocating irqs. > Then .startup() callback of ims irqchip is called to program default > PASID saved in msi_desc#0 to the target ims entry when request_irq(). > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again. Following same logic > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then > re-enable irq#0 via request_irq(). It's still default PASID being used > according to msi_desc#0. > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0. msi_desc#0 is also freed. > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0 > has PASID disabled while MSIX_PERM#1 has a valid guest PASID1 > for sva. idxd driver registers default PASID to msix_desc#0 and > host PASID2 (translated from guest PASID1) to msix_desc#1 when > allocating irqs. Later when both irqs are enabled via request_irq(), > ims irqchip driver updates the target ims entries according to > msix_desc#0 and misx_desc#1 respectively. > > But this is specific to how Qemu virtualizes MSI-X today. What about it > may change (or another device model) to allocate all table_size irqs > when guest enables MSI-X capability? At that point we don't have valid > MSIX_PERM content to register PASID info to msix_desc. Possibly what > we really require is a separate helper function allowing driver to update > msix_desc after irq allocation, e.g. when guest unmasks a vector... I think you're basically asking how you can guarantee that you always have a mechanism to update your device specific MSIX_PERM table before or after the vector tables. You're trapping and emulating the MSIX_PERM table, so every write traps to your driver. Therefore shouldn't the driver be able to setup any vector using the default PASID if MSIX_PERM is not configured and update it with the correct PASID translation based on the trapped write to the register? Logically I think you'd want your guest driver to mask and unmask the affected vector around modifying the MSIX_PERM entry as well, so it would be another option to reevaluate MSI_PERM on unmask, which triggers a SET_IRQS ioctl into the idxd host driver. Either entry point could trigger a descriptor update to the host irqchip driver. > and do you see any other facets which are overlooked here? AIUI, the default with no sva PASID should always work, the host driver initializes the device with a virtual MSIX_PERM table with all the entries disabled, no special guest/host driver coordination is required. In the case of setting a PASID for a vector, you have entry points into the driver either by the virtualization of the MSIX_PERM table or likely also at the unmasking of the vector, so it seems fully contained. Thanks, Alex