Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3867AC7618E for ; Thu, 16 Mar 2023 21:57:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230089AbjCPV5k (ORCPT ); Thu, 16 Mar 2023 17:57:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57838 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbjCPV5i (ORCPT ); Thu, 16 Mar 2023 17:57:38 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B40436040F for ; Thu, 16 Mar 2023 14:56:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679003810; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yBfyzuvRoe+QOXGN5SbtszQ6PVd6w0zcMfyA5scQ3Zo=; b=BhbG+7xU8eRNHOjTRgeEdxG3RBaawYqQzWTnXWNWrtNDRdsuWQXBCw8ISvJcnJLa9ryrmx N4HmRsWOqV4mIXJCiBTlv3SP0zRTKOR0JKuz8ZoDSyU0erClSVtJDOtwx63XS/W/CSVSqv pEIYUgaHSAEgF7Y7iABuWlJpKf/fYww= Received: from mail-il1-f198.google.com (mail-il1-f198.google.com [209.85.166.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-393-dVw7NiGJPd2IoLPCHVDX7A-1; Thu, 16 Mar 2023 17:56:49 -0400 X-MC-Unique: dVw7NiGJPd2IoLPCHVDX7A-1 Received: by mail-il1-f198.google.com with SMTP id b9-20020a926709000000b00322ffd64968so1492153ilc.13 for ; Thu, 16 Mar 2023 14:56:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679003808; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yBfyzuvRoe+QOXGN5SbtszQ6PVd6w0zcMfyA5scQ3Zo=; b=Nnh+abNPtIC1dJuJ7U+1CyiUMvbpUTR2CTT2t93nIqaqO72KqogFr8QarAhDVFmCGb Srs+n+OcrjFi0PCLZFjtaE9JNcwvv3PCbZlhy1UmcJHaGO1H3aUsaEAcHtsifjsmFU0k 9Ult+kUSrjOx1mJBu/i6xc9nnc3LJGvmJE7QyWUc9bDngZEhE3MmxfjvyYHIU7C1moZF /z/U2QV4fU12NAsrcicj4WUQtL8aK4SvdWGHgUS5sYBHEA58HPZvi6O92heQKML3/Buf PMcU/qgVX71Og+N9eDejFgvwBquiKdD7OuXl1jBbP5gdBFcAqX0FlIGKLd9ZIW1yYXjB GyzQ== X-Gm-Message-State: AO0yUKUGUX2C9PsU8Unciztr28rx0C/YzjMbHgtj7U28Pz+7kBKPnj8D e9A2dwEHyOG9UMf4XhgzvtsjJzZwy68lT5c4ql2VQXCTqTBpvtjNodK8y+caz28ILub7NicPTlp A4qUTG7WQjwfmwWPdfk97m7ZU X-Received: by 2002:a92:da8e:0:b0:315:53b6:d293 with SMTP id u14-20020a92da8e000000b0031553b6d293mr8470341iln.21.1679003808585; Thu, 16 Mar 2023 14:56:48 -0700 (PDT) X-Google-Smtp-Source: AK7set8hE9qRjiVkADLNj9Ym+CS3y91gsTlZRJSIxUV9UT3cXYpWiTMNkB+43yCbBxsrCLOPFbEbcA== X-Received: by 2002:a92:da8e:0:b0:315:53b6:d293 with SMTP id u14-20020a92da8e000000b0031553b6d293mr8470326iln.21.1679003808286; Thu, 16 Mar 2023 14:56:48 -0700 (PDT) Received: from redhat.com ([38.15.36.239]) by smtp.gmail.com with ESMTPSA id c13-20020a928e0d000000b003157b2c504bsm137825ild.24.2023.03.16.14.56.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Mar 2023 14:56:47 -0700 (PDT) Date: Thu, 16 Mar 2023 15:56:46 -0600 From: Alex Williamson To: Reinette Chatre Cc: jgg@nvidia.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, tglx@linutronix.de, darwi@linutronix.de, kvm@vger.kernel.org, dave.jiang@intel.com, jing2.liu@intel.com, ashok.raj@intel.com, fenghua.yu@intel.com, tom.zanussi@linux.intel.com, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 0/8] vfio/pci: Support dynamic allocation of MSI-X interrupts Message-ID: <20230316155646.07ae266f.alex.williamson@redhat.com> In-Reply-To: References: X-Mailer: Claws Mail 4.1.1 (GTK 3.24.35; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 15 Mar 2023 13:59:20 -0700 Reinette Chatre wrote: > == Cover letter == > > Qemu allocates interrupts incrementally at the time the guest unmasks an > interrupt, for example each time a Linux guest runs request_irq(). > > Dynamic allocation of MSI-X interrupts was not possible until v6.2 [1]. > This prompted Qemu to, when allocating a new interrupt, first release all > previously allocated interrupts (including disable of MSI-X) followed > by re-allocation of all interrupts that includes the new interrupt. > Please see [2] for a detailed discussion about this issue. > > Releasing and re-allocating interrupts may be acceptable if all > interrupts are unmasked during device initialization. If unmasking of > interrupts occur during runtime this may result in lost interrupts. > For example, consider an accelerator device with multiple work queues, > each work queue having a dedicated interrupt. A work queue can be > enabled at any time with its associated interrupt unmasked while other > work queues are already active. Having all interrupts released and MSI-X > disabled to enable the new work queue will impact active work queues. > > This series builds on the recent interrupt sub-system core changes > that added support for dynamic MSI-X allocation after initial MSI-X > enabling. > > Add support for dynamic MSI-X allocation to vfio-pci. A flag > indicating lack of support for dynamic allocation already exist: > VFIO_IRQ_INFO_NORESIZE and has always been set for MSI and MSI-X. With > support for dynamic MSI-X the flag is cleared for MSI-X, enabling > Qemu to modify its behavior. > > == Why is this an RFC ? == > > vfio support for dynamic MSI-X needs to work with existing user space > as well as upcoming user space that takes advantage of this feature. > I would appreciate guidance on the expectations and requirements > surrounding error handling when considering existing user space. > > For example, consider the following scenario: > Start: Consider a passthrough device that supports 10 MSI-X interrupts > (0 to 9) and existing Qemu allocated interrupts 0 to 4. > > Scenario: Qemu (hypothetically) attempts to associate a new action to > interrupts 0 to 7. Early checking of this range in > vfio_set_irqs_validate_and_prepare() will pass since it > is a valid range for the device. What happens after the > early checking is considered next: > > Current behavior (before this series): Since the provided range, 0 - 7, > exceeds the allocated range, no action will be taken on existing > allocated interrupts 0 - 4 and the Qemu request will fail without > making any state changes. I must be missing something, it was described correctly earlier that QEMU will disable MSI-X and re-enable with a new vector count. Not only does QEMU not really have a way to fail this change, pretty much nothing would work if we did. What happens in this case is that the QEMU vfio-pci driver gets a vector_use callback for one of these new vectors {5,6,7}, vfio_msix_vector_do_use() checks that's greater than we have enabled, disables MSI-X, and re-enables it for the new vector count. Worst case we'll do that 3 times if the vectors are presented in ascending order. > New behavior (with this series): Since vfio supports dynamic MSI-X new > interrupts will be allocated for vectors 5, 6, and 7. Please note > that this changes the behavior encountered by unmodified Qemu: new > interrupts are allocated instead of returning an error. Even more, > since the range provided by Qemu spans 0 - 7, a failure during > allocation of 5, 6, or 7 will result in release of entire range. As above, QEMU doesn't currently return an error here, nor is there actually any means to return an error. MSI-X is not paravirtualized and the PCI spec definition of MSI-X does not define that a driver needs to check whether the device accepted unmasking a vector. All we can do in QEMU if the host fails to configure the device as directed is print an error message as a breadcrumb to help debug why the device stopped working. In practice, I don't think I've ever seen this. So really the difference should be transparent to the guest. The risk of an error allocating vectors on the host is the same, the only significant difference should be the amount of disruption at the device that we can better approximate the request of the guest. > This series aims to maintain existing error behavior for MSI (please see > "vfio/pci: Remove interrupt context counter") but I would appreciate your > guidance on whether existing error behavior should be maintained for MSI-X > and how to do so if it is a requirement. Based on above, there really can never be an error if we expect the device to work, so I think there's a misread of the current status. Dynamic MSI-X support should simply reduce the disruption and chance of lost interrupts at the device, but the points where we risk that the host cannot provide the configuration we need are the same. Thanks, Alex