Received: by 10.223.176.5 with SMTP id f5csp878650wra; Wed, 7 Feb 2018 08:59:05 -0800 (PST) X-Google-Smtp-Source: AH8x226HTX944cZ8V3AxAzXi018lSld8p4dQOuzWQegup3LchmcBpa9E+sTj9uFILakQromKyLs7 X-Received: by 2002:a17:902:b608:: with SMTP id b8-v6mr6676529pls.404.1518022745314; Wed, 07 Feb 2018 08:59:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518022745; cv=none; d=google.com; s=arc-20160816; b=IJyx1DOQsSTkevO2z+YDbzQMbw/8VnBX1RUnosfYSwM7kg7Se7yKPxHOEuZVFDsGiT xRmFSkI/SWHtpJdCvyhQM7xWXPymXXlvkfy4cWheV27vcPUUdo402f3FmxSqaC/mfPE0 4aPMA0HH3iu2quBXZALbVlq9HDNHDwDZa/jiEOlVASBXVoUmnD8DeoPDP8mNE3+bVkq0 kKCpxQChfAfJVJZcOvAfJrpZ1d7bwtQlHNAzlQkS3MLhiFrsWYFcDKqPnO1WIho1L7ya CtrRF1UB9JqE6P97KjnD5ppybNQMlIpc6ka9V04RI3Xx9V93Ru5lnsIROTLIcqdoR8ZY Z6CQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :arc-authentication-results; bh=JH+7Et/FPb2mZ54G1M4WLfOtQFkP6b1Q86OJG+9D41U=; b=h2QyLzBZUiVM6KibFeG3lEiVWFgvrTG9RlrGbXCChP6gqNYtMlhmC1veetTgWkcLR4 L+Uzx3FlBeuk7tuE+bSbWLZtFhENsZDFex6HKv/vq8R80ggQhViISMgnb5ViSAJbZ8Jz iiUDfEn8ImFX56vT5UrhUUxhlUbTxyWvimZQm3FVdaKWLYMBnYRaGJctg4m+3v6vgIv0 nom5JnvaMhHblf28sX8L97LT7mzoikTakD9qj64iZD9ZpaxBRlrPdw1PR76OF/YW1xOM 7siqeUIolNIqgBCaLKkUgYqpaJ5RbfNnmAaANgeLsFs/KmNYOe31j1ccSBg3UypcpQHM vdbA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s185si1148462pgb.792.2018.02.07.08.58.51; Wed, 07 Feb 2018 08:59:05 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754778AbeBGQ54 (ORCPT + 99 others); Wed, 7 Feb 2018 11:57:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:35262 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754639AbeBGQ5z (ORCPT ); Wed, 7 Feb 2018 11:57:55 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E501F28211; Wed, 7 Feb 2018 16:57:54 +0000 (UTC) Received: from w520.home (ovpn-117-203.phx2.redhat.com [10.3.117.203]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7BA206E3F5; Wed, 7 Feb 2018 16:57:47 +0000 (UTC) Date: Wed, 7 Feb 2018 09:57:46 -0700 From: Alex Williamson To: Auger Eric Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [RFC PATCH] vfio/pci: Add ioeventfd support Message-ID: <20180207095746.14cef887@w520.home> In-Reply-To: References: <20180207000731.32764.95992.stgit@gimli.home> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Wed, 07 Feb 2018 16:57:54 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 7 Feb 2018 16:46:19 +0100 Auger Eric wrote: > Hi Alex, > > On 07/02/18 01:08, Alex Williamson wrote: > > The ioeventfd here is actually irqfd handling of an ioeventfd such as > > supported in KVM. A user is able to pre-program a device write to > > occur when the eventfd triggers. This is yet another instance of > > eventfd-irqfd triggering between KVM and vfio. The impetus for this > > is high frequency writes to pages which are virtualized in QEMU. > > Enabling this near-direct write path for selected registers within > > the virtualized page can improve performance and reduce overhead. > > Specifically this is initially targeted at NVIDIA graphics cards where > > the driver issues a write to an MMIO register within a virtualized > > region in order to allow the MSI interrupt to re-trigger. > > > > Signed-off-by: Alex Williamson > > fyi it does not apply anymore on master (conflict in > include/uapi/linux/vfio.h on GFX stuff) Sorry, I should have noted that this was against v4.15, I didn't want the churn of the merge window since I was benchmarking against this. Will update for non-RFC. ... > > +long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset, > > + uint64_t data, int count, int fd) > > +{ > > + struct pci_dev *pdev = vdev->pdev; > > + loff_t pos = offset & VFIO_PCI_OFFSET_MASK; > > + int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset); > > + struct vfio_pci_ioeventfd *ioeventfd; > > + int (*handler)(void *, void *); > > + unsigned long val; > > + > > + /* Only support ioeventfds into BARs */ > > + if (bar > VFIO_PCI_BAR5_REGION_INDEX) > > + return -EINVAL; > > + > > + if (pos + count > pci_resource_len(pdev, bar)) > > + return -EINVAL; > > + > > + /* Disallow ioeventfds working around MSI-X table writes */ > > + if (bar == vdev->msix_bar && > > + !(pos + count <= vdev->msix_offset || > > + pos >= vdev->msix_offset + vdev->msix_size)) > > + return -EINVAL; > > + > > + switch (count) { > > + case 1: > > + handler = &vfio_pci_ioeventfd_handler8; > > + val = data; > > + break; > > + case 2: > > + handler = &vfio_pci_ioeventfd_handler16; > > + val = le16_to_cpu(data); > > + break; > > + case 4: > > + handler = &vfio_pci_ioeventfd_handler32; > > + val = le32_to_cpu(data); > > + break; > > +#ifdef iowrite64 > > + case 8: > > + handler = &vfio_pci_ioeventfd_handler64; > > + val = le64_to_cpu(data); > > + break; > > +#endif > > + default: > > + return -EINVAL; > > + } > > + > > + ret = vfio_pci_setup_barmap(vdev, bar); > > + if (ret) > > + return ret; > > + > > + mutex_lock(&vdev->ioeventfds_lock); > > + > > + list_for_each_entry(ioeventfd, &vdev->ioeventfds_list, next) { > > + if (ioeventfd->pos == pos && ioeventfd->bar == bar && > > + ioeventfd->data == data && ioeventfd->count == count) { > > + if (fd == -1) { > > + vfio_virqfd_disable(&ioeventfd->virqfd); > > + list_del(&ioeventfd->next); > > + kfree(ioeventfd); > > + ret = 0; > > + } else > > + ret = -EEXIST; > > + > > + goto out_unlock; > > + } > > + } > > + > > + if (fd < 0) { > > + ret = -ENODEV; > > + goto out_unlock; > > + } > > + > > + ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL); > > + if (!ioeventfd) { > > + ret = -ENOMEM; > > + goto out_unlock; > > + } > > + > > + ioeventfd->pos = pos; > > + ioeventfd->bar = bar; > > + ioeventfd->data = data; > > + ioeventfd->count = count; > > + > > + ret = vfio_virqfd_enable(vdev->barmap[ioeventfd->bar] + ioeventfd->pos, > nit: bar and pos could be used directly Indeed, probably leftover from development. Fixed and re-wrapped the following lines. > > + handler, NULL, (void *)val, > > + &ioeventfd->virqfd, fd); > > + if (ret) { > > + kfree(ioeventfd); > > + goto out_unlock; > > + } > > + > > + list_add(&ioeventfd->next, &vdev->ioeventfds_list); > > + > > +out_unlock: > > + mutex_unlock(&vdev->ioeventfds_lock); > > + > > + return ret; > > +} > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > > index e3301dbd27d4..07966a5f0832 100644 > > --- a/include/uapi/linux/vfio.h > > +++ b/include/uapi/linux/vfio.h > > @@ -503,6 +503,30 @@ struct vfio_pci_hot_reset { > > > > #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13) > > > > +/** > > + * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14, > > + * struct vfio_device_ioeventfd) > > + * > > + * Perform a write to the device at the specified device fd offset, with > > + * the specified data and width when the provided eventfd is triggered. > don't you want to document the limitation on BAR only and excluding the > MSIx table. Sure, we could. This is also just an acceleration interface, it's not required for functionality, so a user can probe the capabilities by trying to enable an ioeventfd to test for support. I don't really want to add a flag to each region to identify support or create yet another sparse map identifying which sections of regions are supported. We could enable this for PCI config space, but I didn't think it really made sense (and I was too lazy). Real PCI config space (not MMIO mirrors of config space on GPUs) should never be a performance path, therefore I question if there's any ROI for the code to support it. Device specific regions would need to be handled on a case by case basis, and I don't think we have any cases yet were it makes sense (IIRC the IGD regions are read-only). Of course ROMs are read-only, so it doesn't make sense to support them. I also note that this patch of course only supports directly assigned vfio-pci devices, not vfio-platform and not mdev devices. Since the ioctl is intended to be device agnostic, maybe we could have a default handler that simply uses the device file write interface. At least one issue there is that those expect a user buffer. I'll look into how I might add the support more generically, if perhaps less optimized. Does it seem like a sufficient user API to try and ioeventfd and be prepared for it to fail? Given that this is a new ioctl, users need to do that for backwards compatibility anyway. Something I forgot to mention in my rush to send this out is that I debated whether to create a new ioctl or re-use the SET_IRQS ioctl. In the end, I couldn't resolve how to handle the index, start, and count fields, so I created this new ioctl. Would we create an ioeventfd index? We couldn't simply pass an array of ioeventfds since each needs to be associated with an offset, data, and size, so we'd need a new data type with a structure encompassing those fields. In the end it obviously didn't make sense to me to re-use SET_IRQS, but if others can come up with something that makes sense, I'm open to it. Thanks, Alex