Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1828698ybl; Wed, 28 Aug 2019 22:42:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqwDuJS+mvhMjqYlHukAU8vuFp0XYo8BM8ybBwT+D5WtdBCru6Dp9Ui6RvLITPHDwX+0XOzs X-Received: by 2002:a17:90a:3a8d:: with SMTP id b13mr8086555pjc.75.1567057348805; Wed, 28 Aug 2019 22:42:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567057348; cv=none; d=google.com; s=arc-20160816; b=D9G7VKWBWEH+Mu6Yys7YT9esuxA+ZhXk632IJImsHxHhXDJK3FRmzKv/Lc2m9f1dKP cSG6gXhVTsW5WnwncajdEz+b9dAXqEzD0A8jvzjrQNNUG1yPt2s79GPqXqlAT+HoJgsM 7/3SnPGU78HHpsfLQut6C/jZ+fn0W0DhIA1ROnzxkg8szSjYZpBJIhWANmreDSKaHHmn bptopEX9BrD9RTKKp3l+bOoXVEe4rpiBiThCf7TBhrN7JIE2GrZmUJkOzuq6ciLUkcY/ olC/RWDT3KEhgnH7kvtW184MiNBsiOxG3hW3f7uAHQEq40VwiQggp2+7CZwRJjWwoo+q yZ3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=0szv72V/YVVB37MMhLHZRgWJMTpiXCjJnHKMgYSiJso=; b=Hg5B3RiShE/BHzUh6qHO2fqgSHdk2FcWWYle/qwbzfNjv0FwI5HdCcubZ3X/SAVhEn mMrIoXW2kd184LVEVuaX8XaUlEnJb5V1gIwkeVuDnMigmAX+SAPpmPwpoECGf73gLuwe Wwp2Al4xNYt9K4tdVl+lFGpmM7Scoah0j2O0vIIAUYetJKbGNLx0vX+oBv5WzvzFM1hp RgTj5l0FZ1EEfgMYlIntBexbKDE2gA2WiPpBqcFocCJz8HH5lfugryZyAxJ+jLZ23yGV RjE7y+9QDQYd94HMEgqXNw9i48EaW8QlsUxyJMA8t8rkzfS3SX/f1amozoTJ5JXBWnBI 38/g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r3si978488plb.14.2019.08.28.22.42.13; Wed, 28 Aug 2019 22:42:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727392AbfH2FlN (ORCPT + 99 others); Thu, 29 Aug 2019 01:41:13 -0400 Received: from out4436.biz.mail.alibaba.com ([47.88.44.36]:54543 "EHLO out4436.biz.mail.alibaba.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725855AbfH2FlN (ORCPT ); Thu, 29 Aug 2019 01:41:13 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R421e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=luoben@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0TakuPvd_1567057259; Received: from bn0418deMacBook-Pro.local(mailfrom:luoben@linux.alibaba.com fp:SMTPD_---0TakuPvd_1567057259) by smtp.aliyun-inc.com(127.0.0.1); Thu, 29 Aug 2019 13:40:59 +0800 Subject: Re: [PATCH v4 3/3] vfio/pci: make use of irq_update_devid and optimize irq ops To: Alex Williamson Cc: tglx@linutronix.de, linux-kernel@vger.kernel.org, tao.ma@linux.alibaba.com, gerry@linux.alibaba.com, nanhai.zou@linux.alibaba.com, linyunsheng@huawei.com References: <8721e56f15dbcb1e0a1d8fc645def7b9bc752988.1566486156.git.luoben@linux.alibaba.com> <20190827143305.1ac826e1@x1.home> <429ae9ed-de9d-f8de-de65-a41c3a0c501d@linux.alibaba.com> <20190828112331.7178af1a@x1.home> From: Ben Luo Message-ID: Date: Thu, 29 Aug 2019 13:40:59 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190828112331.7178af1a@x1.home> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 在 2019/8/29 上午1:23, Alex Williamson 写道: > On Wed, 28 Aug 2019 18:08:02 +0800 > Ben Luo wrote: > >> 在 2019/8/28 上午4:33, Alex Williamson 写道: >>> On Thu, 22 Aug 2019 23:34:43 +0800 >>> Ben Luo wrote: >>> >>>> When userspace (e.g. qemu) triggers a switch between KVM >>>> irqfd and userspace eventfd, only dev_id of irq action >>>> (i.e. the "trigger" in this patch's context) will be >>>> changed, but a free-then-request-irq action is taken in >>>> current code. And, irq affinity setting in VM will also >>>> trigger a free-then-request-irq action, which actually >>>> changes nothing, but only fires a producer re-registration >>>> to update irte in case that posted-interrupt is in use. >>>> >>>> This patch makes use of irq_update_devid() and optimize >>>> both cases above, which reduces the risk of losing interrupt >>>> and also cuts some overhead. >>>> >>>> Signed-off-by: Ben Luo >>>> --- >>>> drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++------------- >>>> 1 file changed, 74 insertions(+), 38 deletions(-) >>>> >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c >>>> index 3fa3f72..60d3023 100644 >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c >>>> @@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix) >>>> static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev, >>>> int vector, int fd, bool msix) >>>> { >>>> + struct eventfd_ctx *trigger = NULL; >>>> struct pci_dev *pdev = vdev->pdev; >>>> - struct eventfd_ctx *trigger; >>>> int irq, ret; >>>> >>>> if (vector < 0 || vector >= vdev->num_ctx) >>>> return -EINVAL; >>>> >>>> + if (fd >= 0) { >>>> + trigger = eventfd_ctx_fdget(fd); >>>> + if (IS_ERR(trigger)) >>>> + return PTR_ERR(trigger); >>>> + } >>> I think this is a user visible change. Previously the vector is >>> disabled first, then if an error occurs re-enabling, we return an errno >>> with the vector disabled. Here we instead fail the ioctl and leave the >>> state as if it had never happened. For instance with QEMU, if they >>> were trying to change from KVM to userspace signaling and entered this >>> condition, previously the interrupt would signal to neither eventfd, now >>> it would continue signaling to KVM. If QEMU's intent was to emulate >>> vector masking, this could induce unhandled interrupts in the guest. >>> Maybe we need a tear-down on fault here to maintain that behavior, or >>> do you see some justification for the change? >> Thanks for your comments, this reminds me to think more about the >> effects to users. >> >> After I reviewed the related code in Qemu and VFIO, I think maybe there >> is a problem in current behavior >> for the signal path changing case. Qemu has neither recovery nor retry >> code in case that ioctl with >> 'VFIO_DEVICE_SET_IRQS' command fails, so if the old signal path has been >> disabled on fault of setting >> up new path, the corresponding vector may be disabled forever. Following >> is an example from qemu's >> vfio_msix_vector_do_use(): >> >>         ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); >>         g_free(irq_set); >>         if (ret) { >>             error_report("vfio: failed to modify vector, %d", ret); >>         } >> >> I think the singal path before changing should be still working at this >> moment and the caller should keep it >> working if the changing fails, so that at least we still have the old >> path instead of no path. >> >> For masking vector case, the 'fd' should be -1, and the interrupt will >> be freed as before this patch. > QEMU doesn't really have an opportunity to signal an error to the > guest, we're emulating the hardware masking of MSI and MSI-X. The > guest is simply trying to write a mask bit in the vector, there's no > provision in the PCI spec that setting this bit can fail. The current > behavior is that the vector is disabled on error. We can argue whether > that's the optimal behavior, but it's the existing behavior and > changing it would require and evaluation of all existing users. I totally agree with you that masking of MSI and MSI-X should follow current behavior, and my code does follow this behavior when 'fd' == -1, the interrupt will surely be disabled. There is another case that 'fd' is not -1, means userspace just want to change the singal path, this time we do have a chance of encountering error on eventfd_ctx_fdget(fd). So, I think it is better to keep the old path working in this case. But, as you said this may break the expectation of existing users, I make it tear-down on fault in next version. > >>>> + >>>> irq = pci_irq_vector(pdev, vector); >>>> >>>> + /* >>>> + * For KVM-VFIO case, interrupt from passthrough device will be directly >>>> + * delivered to VM after producer and consumer connected successfully. >>>> + * If producer and consumer are disconnected, this interrupt process >>>> + * will fall back to remap mode, where interrupt handler uses 'trigger' >>>> + * to find the right way to deliver the interrupt to VM. So, it is safe >>>> + * to do irq_update_devid() before irq_bypass_unregister_producer() which >>>> + * switches interrupt process to remap mode. To producer and consumer, >>>> + * 'trigger' is only a token used for pairing them togather. >>>> + */ >>>> if (vdev->ctx[vector].trigger) { >>>> - free_irq(irq, vdev->ctx[vector].trigger); >>>> - irq_bypass_unregister_producer(&vdev->ctx[vector].producer); >>>> - kfree(vdev->ctx[vector].name); >>>> - eventfd_ctx_put(vdev->ctx[vector].trigger); >>>> - vdev->ctx[vector].trigger = NULL; >>>> + if (vdev->ctx[vector].trigger == trigger) { >>>> + /* switch back to remap mode */ >>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer); >>> I think we leak the fd context we acquired above in this case. >> Thanks for pointing it out, I will fix this in next version. >>> Why do we do anything in this case, couldn't we just 'put' the extra ctx >>> and return 0 here? >> Sorry for confusing and I do this for a reason,  I will add some more >> comments like this: >> >> Unregistration here is for re-resigtraion later, which will trigger the >> reconnection of irq_bypass producer >> and consumer, which in turn calls vmx_update_pi_irte() to update IRTE if >> posted interrupt is in use. >> (vmx_update_pi_irte() will modify IRTE based on the information >> retrieved from KVM.) >> Whether producer token changed or not, irq_bypass_register_producer() is >> a way (seems the only way) to >> update IRTE by VFIO for posted interrupt. The IRTE will be used by IOMMU >> directly to find the target CPU >> for an interrupt posted to VM, since hypervisor is bypassed. > This is only explaining what the bypass de-registration and > re-registration does, not why we need to perform those actions here. > If the trigger is the same as that already attached to this vector, why > is the IRTE changing? Seems this ought to be a no-op for this vector. Sorry, I think it cannot be a no-op here. The interrupt affinity setting in guest also triggers the calling of this function and IRTE will be modified with new affinity information retrieved from KVM's data structure by vmx_update_pi_irte() when posted interrupt is in use, not from the 'trigger'. > >>>> + } else if (trigger) { >>>> + ret = irq_update_devid(irq, >>>> + vdev->ctx[vector].trigger, trigger); >>>> + if (unlikely(ret)) { >>>> + dev_info(&pdev->dev, >>>> + "update devid of %d (token %p) failed: %d\n", >>>> + irq, vdev->ctx[vector].trigger, ret); >>>> + eventfd_ctx_put(trigger); >>>> + return ret; >>>> + } >>>> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer); >>> Can you explain this ordering, I would have expected that we'd >>> unregister the bypass before we updated the devid. Thanks, >>> >>> Alex >> Actually, I have explained this in comments above this whole control >> block. I think it is safe and better to >> update devid before irq_bypass_unregister_producer() which switches >> interrupt process from posted mode >> to remap mode. So, if update fails, the posted interrupt can still work. >> >> Anyway, to producer and consumer,  'trigger' is only a token used for >> pairing them togather. > The bypass is not a guaranteed mechanism, it's an opportunistic > accelerator. If the devid update fails, what state are we left with? > The irq action may not work, but the bypass does; maybe; maybe not all > the time? This seems to fall into the same consistency in userspace > behavior issue identified above. The user ABI is that the vector is > disabled if an error is returned. Thanks, > > Alex Thanks, now I see it is better to unregister the bypass before update the devid, will change ordering in next version. Ben