Received: by 10.223.176.5 with SMTP id f5csp2001623wra; Thu, 8 Feb 2018 06:59:34 -0800 (PST) X-Google-Smtp-Source: AH8x226ICcMloHia5sG9XYgWUwXRO98Lzel7jZ7SzjFa1t59/VZiITUq1BjCD/tdBR4nXe1ot9Mt X-Received: by 10.99.140.18 with SMTP id m18mr772945pgd.320.1518101974122; Thu, 08 Feb 2018 06:59:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518101974; cv=none; d=google.com; s=arc-20160816; b=lFGd4E0IRl3Cwid8Bv7y+x/ZT0ugYYFA5NmqZmAd3rn6g9Aj+rt8l4eue4SHvnZFUW Qp3lvNsblLzOKdDMh93LLpJP+Cc+1+I4UhUUYfNKdmquyyuKMd6ydOFgXT/mNKr3SxGA o7kvT6QvpPnq+oNP5JSdhDjbcVQfoJkZIdtIgyVB4Vk4ykViFeSE/MZlRgVoyd/aOgDg CC/6ZoIcfpBFGZWNOjRG5Bd/viCEYDDNORLeEelfR2tU7DcPSoXzqhHiFXBvnTL/pL3n h0i5RATpfSqcg8CJQ6BRKzQSP3ctblRDETRQHxFDHLLNDWg7KozN5sGUHolwhmmvljQP a9mA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:cc:references:to :subject:arc-authentication-results; bh=BbylR9P6QZp/Ta7Nk91WtH1oMVwJHvEgbKBr8nzQV/w=; b=LoBMIX840fm/l8ifVG7/fBq5x5CziwpDguFr98laGn1Z0PYeZbXysOTgldw6uWEhDo pd7H0XFEs+j6HztCuttUrUsHBxobGXCJKSgrgU1LfA9QJ1iN2sC31WSPNOrshlFqFpt4 /acG3Q4JvVSAJKAmTdwAGhdzPYQbtuPDZTvIgpocgVM7EGH/+eWoUAKAIW5VLhaCyz74 pQ3iax6sIydHChG2zbvKAvJQr6Mb1inJaillr+hYZ+e6QOopxJhMzpFNeJk/LJoCqixb XZw0fazZocvLdK8XIjxv/6w6gu4C1GrlgQgoJ4Y4C/oPSni6n3H4qY0V7hDzVLO+ur7d 4zug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v35-v6si57845plg.724.2018.02.08.06.59.19; Thu, 08 Feb 2018 06:59:34 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752086AbeBHNsZ (ORCPT + 99 others); Thu, 8 Feb 2018 08:48:25 -0500 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:39930 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751153AbeBHNsX (ORCPT ); Thu, 8 Feb 2018 08:48:23 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 5B38C406F881; Thu, 8 Feb 2018 13:48:23 +0000 (UTC) Received: from localhost.localdomain (unknown [10.36.118.64]) by smtp.corp.redhat.com (Postfix) with ESMTPS id D2F1F10073BB; Thu, 8 Feb 2018 13:48:15 +0000 (UTC) Subject: Re: [Qemu-devel] [RFC PATCH] vfio/pci: Add ioeventfd support To: Alex Williamson References: <20180207000731.32764.95992.stgit@gimli.home> <20180207095746.14cef887@w520.home> Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org From: Auger Eric Message-ID: <070da5ee-2895-0350-3bc7-d17eb1f19344@redhat.com> Date: Thu, 8 Feb 2018 14:48:13 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20180207095746.14cef887@w520.home> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 08 Feb 2018 13:48:23 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 08 Feb 2018 13:48:23 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'eric.auger@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Alex, On 07/02/18 17:57, Alex Williamson wrote: > On Wed, 7 Feb 2018 16:46:19 +0100 > Auger Eric wrote: > >> Hi Alex, >> >> On 07/02/18 01:08, Alex Williamson wrote: >>> The ioeventfd here is actually irqfd handling of an ioeventfd such as >>> supported in KVM. A user is able to pre-program a device write to >>> occur when the eventfd triggers. This is yet another instance of >>> eventfd-irqfd triggering between KVM and vfio. The impetus for this >>> is high frequency writes to pages which are virtualized in QEMU. >>> Enabling this near-direct write path for selected registers within >>> the virtualized page can improve performance and reduce overhead. >>> Specifically this is initially targeted at NVIDIA graphics cards where >>> the driver issues a write to an MMIO register within a virtualized >>> region in order to allow the MSI interrupt to re-trigger. >>> >>> Signed-off-by: Alex Williamson >> >> fyi it does not apply anymore on master (conflict in >> include/uapi/linux/vfio.h on GFX stuff) > > Sorry, I should have noted that this was against v4.15, I didn't want > the churn of the merge window since I was benchmarking against this. > Will update for non-RFC. > > ... >>> +long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset, >>> + uint64_t data, int count, int fd) >>> +{ >>> + struct pci_dev *pdev = vdev->pdev; >>> + loff_t pos = offset & VFIO_PCI_OFFSET_MASK; >>> + int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset); >>> + struct vfio_pci_ioeventfd *ioeventfd; >>> + int (*handler)(void *, void *); >>> + unsigned long val; >>> + >>> + /* Only support ioeventfds into BARs */ >>> + if (bar > VFIO_PCI_BAR5_REGION_INDEX) >>> + return -EINVAL; >>> + >>> + if (pos + count > pci_resource_len(pdev, bar)) >>> + return -EINVAL; >>> + >>> + /* Disallow ioeventfds working around MSI-X table writes */ >>> + if (bar == vdev->msix_bar && >>> + !(pos + count <= vdev->msix_offset || >>> + pos >= vdev->msix_offset + vdev->msix_size)) >>> + return -EINVAL; >>> + >>> + switch (count) { >>> + case 1: >>> + handler = &vfio_pci_ioeventfd_handler8; >>> + val = data; >>> + break; >>> + case 2: >>> + handler = &vfio_pci_ioeventfd_handler16; >>> + val = le16_to_cpu(data); >>> + break; >>> + case 4: >>> + handler = &vfio_pci_ioeventfd_handler32; >>> + val = le32_to_cpu(data); >>> + break; >>> +#ifdef iowrite64 >>> + case 8: >>> + handler = &vfio_pci_ioeventfd_handler64; >>> + val = le64_to_cpu(data); >>> + break; >>> +#endif >>> + default: >>> + return -EINVAL; >>> + } >>> + >>> + ret = vfio_pci_setup_barmap(vdev, bar); >>> + if (ret) >>> + return ret; >>> + >>> + mutex_lock(&vdev->ioeventfds_lock); >>> + >>> + list_for_each_entry(ioeventfd, &vdev->ioeventfds_list, next) { >>> + if (ioeventfd->pos == pos && ioeventfd->bar == bar && >>> + ioeventfd->data == data && ioeventfd->count == count) { >>> + if (fd == -1) { >>> + vfio_virqfd_disable(&ioeventfd->virqfd); >>> + list_del(&ioeventfd->next); >>> + kfree(ioeventfd); >>> + ret = 0; >>> + } else >>> + ret = -EEXIST; >>> + >>> + goto out_unlock; >>> + } >>> + } >>> + >>> + if (fd < 0) { >>> + ret = -ENODEV; >>> + goto out_unlock; >>> + } >>> + >>> + ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL); >>> + if (!ioeventfd) { >>> + ret = -ENOMEM; >>> + goto out_unlock; >>> + } >>> + >>> + ioeventfd->pos = pos; >>> + ioeventfd->bar = bar; >>> + ioeventfd->data = data; >>> + ioeventfd->count = count; >>> + >>> + ret = vfio_virqfd_enable(vdev->barmap[ioeventfd->bar] + ioeventfd->pos, >> nit: bar and pos could be used directly > > Indeed, probably leftover from development. Fixed and re-wrapped the > following lines. > >>> + handler, NULL, (void *)val, >>> + &ioeventfd->virqfd, fd); >>> + if (ret) { >>> + kfree(ioeventfd); >>> + goto out_unlock; >>> + } >>> + >>> + list_add(&ioeventfd->next, &vdev->ioeventfds_list); >>> + >>> +out_unlock: >>> + mutex_unlock(&vdev->ioeventfds_lock); >>> + >>> + return ret; >>> +} >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >>> index e3301dbd27d4..07966a5f0832 100644 >>> --- a/include/uapi/linux/vfio.h >>> +++ b/include/uapi/linux/vfio.h >>> @@ -503,6 +503,30 @@ struct vfio_pci_hot_reset { >>> >>> #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13) >>> >>> +/** >>> + * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14, >>> + * struct vfio_device_ioeventfd) >>> + * >>> + * Perform a write to the device at the specified device fd offset, with >>> + * the specified data and width when the provided eventfd is triggered. >> don't you want to document the limitation on BAR only and excluding the >> MSIx table. > > Sure, we could. This is also just an acceleration interface, it's not > required for functionality, so a user can probe the capabilities by > trying to enable an ioeventfd to test for support. I don't really want > to add a flag to each region to identify support or create yet another > sparse map identifying which sections of regions are supported. We > could enable this for PCI config space, but I didn't think it really > made sense (and I was too lazy). Real PCI config space (not MMIO > mirrors of config space on GPUs) should never be a performance path, > therefore I question if there's any ROI for the code to support it. > Device specific regions would need to be handled on a case by case > basis, and I don't think we have any cases yet were it makes sense > (IIRC the IGD regions are read-only). Of course ROMs are read-only, so > it doesn't make sense to support them. > > I also note that this patch of course only supports directly assigned > vfio-pci devices, not vfio-platform and not mdev devices. Since the > ioctl is intended to be device agnostic, maybe we could have a default > handler that simply uses the device file write interface. At least one > issue there is that those expect a user buffer. I'll look into how I > might add the support more generically, if perhaps less optimized. OK > > Does it seem like a sufficient user API to try and ioeventfd and be > prepared for it to fail? Given that this is a new ioctl, users need to > do that for backwards compatibility anyway. looks OK to me. > > Something I forgot to mention in my rush to send this out is that I > debated whether to create a new ioctl or re-use the SET_IRQS ioctl. In > the end, I couldn't resolve how to handle the index, start, and count > fields, so I created this new ioctl. Would we create an ioeventfd > index? We couldn't simply pass an array of ioeventfds since each needs > to be associated with an offset, data, and size, so we'd need a new > data type with a structure encompassing those fields. In the end it > obviously didn't make sense to me to re-use SET_IRQS, but if others can > come up with something that makes sense, I'm open to it. Thanks, as far as I am concerned, I prefer this API rather than the SET_IRQS one ;-) Thanks Eric > > Alex >