Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1155347ybl; Wed, 28 Aug 2019 10:24:47 -0700 (PDT) X-Google-Smtp-Source: APXvYqx6a0JIK/z9AVfxgQMb4w+amcH5BHxWS7c37MCvw0gE7eSLI1c8H+Yd7wC0PqQ9ox96oCCF X-Received: by 2002:a65:6256:: with SMTP id q22mr4440364pgv.408.1567013086953; Wed, 28 Aug 2019 10:24:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567013086; cv=none; d=google.com; s=arc-20160816; b=Yrys5VVqult9ptN+VgY1l+7TSPAzS9exIb6oW7t84XaMh8vwhFhUs7B1mnm9Ciya6o w1gJWbqHlHmS0ZDaY32KQYFbBRMw9E859ZW3+3dUorHwbRhZpZBATLQin+E4SLW0flYa TLXLaVElgRMw60jZ3cHSrhTB3LR2JtE0WcbdUyVfpKAeTx/R/1UyFGjc6AAEPicCXTJm 65QDC0ypM+NdSkb7V0LwWb92OoywqDYfxh7vURd44T++Zjc5Y2EJkbGOiEzOfKP7zlVZ /MVISdaAB57cmXdVydT41MSghpHdLV/2XlVQo9XyUQ3062SmLshp13iq5aWpfYfWPImg yDZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=3u2FC1e2tyuaLuUO7nMKgCL9cP3LfmDPaBae1b7Ey6Q=; b=Ddo+LVc/p8g0rN/73q4G+l4w99gmBv8tXLVt7J706TewFUFmW9qYJFvvs15joc874O /C9y5/PNrX/T3qmrYi9ARKaMF/8ytpC9wuZxK2ZJudU70DnKk0BQ7SjXUneFm0LXsIs2 hnDFOKq86633a7oXfuESau6so6UTQ51lHJSG22KV7FhorlZYq5WFkyt5MmhR/cmyYD0a +qYfySvXc8pLB1HbiLlX3vjL+y7gDca3YwLSsa/xg42dKbMu1eYH/4TmTxmtM5S2RBY2 t5DWZRRbMKRb/kWevRuSJzpuCCkiyHmUh4Fem4d9bm06vscC/1zxvIFhprQ98GJviP/D 6Bmw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 33si2388276pli.149.2019.08.28.10.24.30; Wed, 28 Aug 2019 10:24:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726616AbfH1RXg convert rfc822-to-8bit (ORCPT + 99 others); Wed, 28 Aug 2019 13:23:36 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42348 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726504AbfH1RXf (ORCPT ); Wed, 28 Aug 2019 13:23:35 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id AD10A4627A; Wed, 28 Aug 2019 17:23:34 +0000 (UTC) Received: from x1.home (ovpn-116-131.phx2.redhat.com [10.3.116.131]) by smtp.corp.redhat.com (Postfix) with ESMTP id 83218100197A; Wed, 28 Aug 2019 17:23:33 +0000 (UTC) Date: Wed, 28 Aug 2019 11:23:31 -0600 From: Alex Williamson To: Ben Luo Cc: tglx@linutronix.de, linux-kernel@vger.kernel.org, tao.ma@linux.alibaba.com, gerry@linux.alibaba.com, nanhai.zou@linux.alibaba.com, linyunsheng@huawei.com Subject: Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops Message-ID: <20190828112331.7178af1a@x1.home> In-Reply-To: <429ae9ed-de9d-f8de-de65-a41c3a0c501d@linux.alibaba.com> References: <8721e56f15dbcb1e0a1d8fc645def7b9bc752988.1566486156.git.luoben@linux.alibaba.com> <20190827143305.1ac826e1@x1.home> <429ae9ed-de9d-f8de-de65-a41c3a0c501d@linux.alibaba.com> Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Wed, 28 Aug 2019 17:23:34 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 28 Aug 2019 18:08:02 +0800 Ben Luo wrote: > 在 2019/8/28 上午4:33, Alex Williamson 写道: > > On Thu, 22 Aug 2019 23:34:43 +0800 > > Ben Luo wrote: > > > >> When userspace (e.g. qemu) triggers a switch between KVM > >> irqfd and userspace eventfd, only dev_id of irq action > >> (i.e. the "trigger" in this patch's context) will be > >> changed, but a free-then-request-irq action is taken in > >> current code. And, irq affinity setting in VM will also > >> trigger a free-then-request-irq action, which actually > >> changes nothing, but only fires a producer re-registration > >> to update irte in case that posted-interrupt is in use. > >> > >> This patch makes use of irq_update_devid() and optimize > >> both cases above, which reduces the risk of losing interrupt > >> and also cuts some overhead. > >> > >> Signed-off-by: Ben Luo > >> --- > >> drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++------------- > >> 1 file changed, 74 insertions(+), 38 deletions(-) > >> > >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > >> index 3fa3f72..60d3023 100644 > >> --- a/drivers/vfio/pci/vfio_pci_intrs.c > >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c > >> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix) > >> static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, > >> int vector, int fd, bool msix) > >> { > >> + struct eventfd_ctx *trigger = NULL; > >> struct pci_dev *pdev = vdev->pdev; > >> - struct eventfd_ctx *trigger; > >> int irq, ret; > >> > >> if (vector < 0 || vector >= vdev->num_ctx) > >> return -EINVAL; > >> > >> + if (fd >= 0) { > >> + trigger = eventfd_ctx_fdget(fd); > >> + if (IS_ERR(trigger)) > >> + return PTR_ERR(trigger); > >> + } > > I think this is a user visible change. Previously the vector is > > disabled first, then if an error occurs re-enabling, we return an errno > > with the vector disabled. Here we instead fail the ioctl and leave the > > state as if it had never happened. For instance with QEMU, if they > > were trying to change from KVM to userspace signaling and entered this > > condition, previously the interrupt would signal to neither eventfd, now > > it would continue signaling to KVM. If QEMU's intent was to emulate > > vector masking, this could induce unhandled interrupts in the guest. > > Maybe we need a tear-down on fault here to maintain that behavior, or > > do you see some justification for the change? > > Thanks for your comments, this reminds me to think more about the > effects to users. > > After I reviewed the related code in Qemu and VFIO, I think maybe there > is a problem in current behavior > for the signal path changing case. Qemu has neither recovery nor retry > code in case that ioctl with > 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been > disabled on fault of setting > up new path, the corresponding vector may be disabled forever. Following > is an example from qemu's > vfio_msix_vector_do_use(): > >         ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); >         g_free(irq_set); >         if (ret) { >             error_report("vfio: failed to modify vector, %d", ret); >         } > > I think the singal path before changing should be still working at this > moment and the caller should keep it > working if the changing fails, so that at least we still have the old > path instead of no path. > > For masking vector case, the 'fd' should be -1, and the interrupt will > be freed as before this patch. QEMU doesn't really have an opportunity to signal an error to the guest, we're emulating the hardware masking of MSI and MSI-X. The guest is simply trying to write a mask bit in the vector, there's no provision in the PCI spec that setting this bit can fail. The current behavior is that the vector is disabled on error. We can argue whether that's the optimal behavior, but it's the existing behavior and changing it would require and evaluation of all existing users. > >> + > >> irq = pci_irq_vector(pdev, vector); > >> > >> + /* > >> + * For KVM-VFIO case, interrupt from passthrough device will be directly > >> + * delivered to VM after producer and consumer connected successfully. > >> + * If producer and consumer are disconnected, this interrupt process > >> + * will fall back to remap mode, where interrupt handler uses 'trigger' > >> + * to find the right way to deliver the interrupt to VM. So, it is safe > >> + * to do irq_update_devid() before irq_bypass_unregister_producer() which > >> + * switches interrupt process to remap mode. To producer and consumer, > >> + * 'trigger' is only a token used for pairing them togather. > >> + */ > >> if (vdev->ctx[vector].trigger) { > >> - free_irq(irq, vdev->ctx[vector].trigger); > >> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer); > >> - kfree(vdev->ctx[vector].name); > >> - eventfd_ctx_put(vdev->ctx[vector].trigger); > >> - vdev->ctx[vector].trigger = NULL; > >> + if (vdev->ctx[vector].trigger == trigger) { > >> + /* switch back to remap mode */ > >> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer); > > I think we leak the fd context we acquired above in this case. > Thanks for pointing it out, I will fix this in next version. > > > > Why do we do anything in this case, couldn't we just 'put' the extra ctx > > and return 0 here? > > Sorry for confusing and I do this for a reason,  I will add some more > comments like this: > > Unregistration here is for re-resigtraion later, which will trigger the > reconnection of irq_bypass producer > and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if > posted interrupt is in use. > (vmx_update_pi_irte() will modify IRTE based on the information > retrieved from KVM.) > Whether producer token changed or not, irq_bypass_register_producer() is > a way (seems the only way) to > update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU > directly to find the target CPU > for an interrupt posted to VM, since hypervisor is bypassed. This is only explaining what the bypass de-registration and re-registration does, not why we need to perform those actions here. If the trigger is the same as that already attached to this vector, why is the IRTE changing? Seems this ought to be a no-op for this vector. > >> + } else if (trigger) { > >> + ret = irq_update_devid(irq, > >> + vdev->ctx[vector].trigger, trigger); > >> + if (unlikely(ret)) { > >> + dev_info(&pdev->dev, > >> + "update devid of %d (token %p) failed: %d\n", > >> + irq, vdev->ctx[vector].trigger, ret); > >> + eventfd_ctx_put(trigger); > >> + return ret; > >> + } > >> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer); > > Can you explain this ordering, I would have expected that we'd > > unregister the bypass before we updated the devid. Thanks, > > > > Alex > Actually, I have explained this in comments above this whole control > block. I think it is safe and better to > update devid before irq_bypass_unregister_producer() which switches > interrupt process from posted mode > to remap mode. So, if update fails, the posted interrupt can still work. > > Anyway, to producer and consumer,  'trigger' is only a token used for > pairing them togather. The bypass is not a guaranteed mechanism, it's an opportunistic accelerator. If the devid update fails, what state are we left with? The irq action may not work, but the bypass does; maybe; maybe not all the time? This seems to fall into the same consistency in userspace behavior issue identified above. The user ABI is that the vector is disabled if an error is returned. Thanks, Alex