Received: by 10.213.65.68 with SMTP id h4csp3446583imn; Mon, 9 Apr 2018 22:02:47 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/sYD6N6aW/NQIpErb26dT5VGN4reOSZOdl5L4myO1Stz+pjr8YfarngOZGEyArq0g3LCKG X-Received: by 10.99.95.135 with SMTP id t129mr27115000pgb.268.1523336567496; Mon, 09 Apr 2018 22:02:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523336567; cv=none; d=google.com; s=arc-20160816; b=C/ysl3v19sXmi3pUgdJc3msOTAI6RMwB2g133eSHpYcz7RgDhZATKOQ9qiBdZjifGV fOqHVUhdyjbQJQ+CovXEeMH8ZxBv1X9iU12VZaRey+fBw80IipXaMGOY8O8oCeSl2C2h 1q+lByfn2F+LsMsxXoLFwOEvj8dZWMEMzmoclCFQXhtBUDiho1CokQ7STe4E+bUVRYHR 5R2Vb8Hbxqh5zUv7GtCFuEcvW+4bhGaLZeXS8vMjNd8f7dJCeMHTSKOWess8V7Yyk6QM uahijo9vqKXuODJr2bBzOs7s86Lql/0yQBbwfWI2uTaFRHkpQNwi8jFRPTcE5YW7CagP kf9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=mJjqKQBHNMGXLB3JTfFqFaXRR/gsy4du/B2soMB6Ux0=; b=Y8HB4XNYrTYQSjKhLKSnIeNGJdbI9rxbbH+JdH5ifD4OucDPHlP7Tvd2pvNPMFSli2 prleCI+Qik2deA1DYS6fEMMVk8Z2KchyN/UA8bG/s/FplnSF8/G95T1QklIzEhaHQkXi 5gd429xJKyICJ4t4InbOiFiEVmit+PCwLcQqKLbZ0xmpWIql0NuhR2pNSDPl1+eXSdEJ TBWL2P/238blLhJZg9tHcvWjwxBodzSV/KymTxY3Z4H+RXRb3XVnim5SUdg7i6ejpsKU +hALVv9FZ7HY67Ue+4Lg8NbDGDZXTR6j6HFuXAHFBkA1aOhCnGATZE3+nbVNdb6o66H1 oh8g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e98-v6si1863164plb.273.2018.04.09.22.02.10; Mon, 09 Apr 2018 22:02:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752070AbeDJE7a (ORCPT + 99 others); Tue, 10 Apr 2018 00:59:30 -0400 Received: from mga04.intel.com ([192.55.52.120]:43811 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751581AbeDJE72 (ORCPT ); Tue, 10 Apr 2018 00:59:28 -0400 X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 09 Apr 2018 21:59:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,430,1517904000"; d="scan'208";a="45604164" Received: from debian.sh.intel.com (HELO debian) ([10.67.104.164]) by fmsmga001.fm.intel.com with ESMTP; 09 Apr 2018 21:59:25 -0700 Date: Tue, 10 Apr 2018 12:57:23 +0800 From: Tiwei Bie To: Jason Wang Cc: mst@redhat.com, alex.williamson@redhat.com, ddutile@redhat.com, alexander.h.duyck@intel.com, virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, dan.daly@intel.com, cunming.liang@intel.com, zhihong.wang@intel.com, jianfeng.tan@intel.com, xiao.w.wang@intel.com Subject: Re: [RFC] vhost: introduce mdev based hardware vhost backend Message-ID: <20180410045723.rftsb7l4l3ip2ioi@debian> References: <20180402152330.4158-1-tiwei.bie@intel.com> <622f4bd7-1249-5545-dc5a-5a92b64f5c26@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <622f4bd7-1249-5545-dc5a-5a92b64f5c26@redhat.com> User-Agent: NeoMutt/20170113 (1.7.2) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote: > On 2018年04月02日 23:23, Tiwei Bie wrote: > > This patch introduces a mdev (mediated device) based hardware > > vhost backend. This backend is an abstraction of the various > > hardware vhost accelerators (potentially any device that uses > > virtio ring can be used as a vhost accelerator). Some generic > > mdev parent ops are provided for accelerator drivers to support > > generating mdev instances. > > > > What's this > > =========== > > > > The idea is that we can setup a virtio ring compatible device > > with the messages available at the vhost-backend. Originally, > > these messages are used to implement a software vhost backend, > > but now we will use these messages to setup a virtio ring > > compatible hardware device. Then the hardware device will be > > able to work with the guest virtio driver in the VM just like > > what the software backend does. That is to say, we can implement > > a hardware based vhost backend in QEMU, and any virtio ring > > compatible devices potentially can be used with this backend. > > (We also call it vDPA -- vhost Data Path Acceleration). > > > > One problem is that, different virtio ring compatible devices > > may have different device interfaces. That is to say, we will > > need different drivers in QEMU. It could be troublesome. And > > that's what this patch trying to fix. The idea behind this > > patch is very simple: mdev is a standard way to emulate device > > in kernel. > > So you just move the abstraction layer from qemu to kernel, and you still > need different drivers in kernel for different device interfaces of > accelerators. This looks even more complex than leaving it in qemu. As you > said, another idea is to implement userspace vhost backend for accelerators > which seems easier and could co-work with other parts of qemu without > inventing new type of messages. I'm not quite sure. Do you think it's acceptable to add various vendor specific hardware drivers in QEMU? > > Need careful thought here to seek a best solution here. Yeah, definitely! :) And your opinions would be very helpful! > > > So we defined a standard device based on mdev, which > > is able to accept vhost messages. When the mdev emulation code > > (i.e. the generic mdev parent ops provided by this patch) gets > > vhost messages, it will parse and deliver them to accelerator > > drivers. Drivers can use these messages to setup accelerators. > > > > That is to say, the generic mdev parent ops (e.g. read()/write()/ > > ioctl()/...) will be provided for accelerator drivers to register > > accelerators as mdev parent devices. And each accelerator device > > will support generating standard mdev instance(s). > > > > With this standard device interface, we will be able to just > > develop one userspace driver to implement the hardware based > > vhost backend in QEMU. > > > > Difference between vDPA and PCI passthru > > ======================================== > > > > The key difference between vDPA and PCI passthru is that, in > > vDPA only the data path of the device (e.g. DMA ring, notify > > region and queue interrupt) is pass-throughed to the VM, the > > device control path (e.g. PCI configuration space and MMIO > > regions) is still defined and emulated by QEMU. > > > > The benefits of keeping virtio device emulation in QEMU compared > > with virtio device PCI passthru include (but not limit to): > > > > - consistent device interface for guest OS in the VM; > > - max flexibility on the hardware design, especially the > > accelerator for each vhost backend doesn't have to be a > > full PCI device; > > - leveraging the existing virtio live-migration framework; > > > > The interface of this mdev based device > > ======================================= > > > > 1. BAR0 > > > > The MMIO region described by BAR0 is the main control > > interface. Messages will be written to or read from > > this region. > > > > The message type is determined by the `request` field > > in message header. The message size is encoded in the > > message header too. The message format looks like this: > > > > struct vhost_vfio_op { > > __u64 request; > > __u32 flags; > > /* Flag values: */ > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ > > __u32 size; > > union { > > __u64 u64; > > struct vhost_vring_state state; > > struct vhost_vring_addr addr; > > struct vhost_memory memory; > > } payload; > > }; > > > > The existing vhost-kernel ioctl cmds are reused as > > the message requests in above structure. > > > > Each message will be written to or read from this > > region at offset 0: > > > > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > int ret; > > > > ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset); > > if (ret != count) > > return -1; > > > > return 0; > > } > > > > int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > uint64_t request = op->request; > > int ret; > > > > ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset); > > if (ret != count || request != op->request) > > return -1; > > > > return 0; > > } > > > > It's quite straightforward to set things to the device. > > Just need to write the message to device directly: > > > > int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features) > > { > > struct vhost_vfio_op op; > > > > op.request = VHOST_SET_FEATURES; > > op.flags = 0; > > op.size = sizeof(features); > > op.payload.u64 = features; > > > > return vhost_vfio_write(dev, &op); > > } > > > > To get things from the device, two steps are needed. > > Take VHOST_GET_FEATURE as an example: > > > > int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features) > > { > > struct vhost_vfio_op op; > > int ret; > > > > op.request = VHOST_GET_FEATURES; > > op.flags = VHOST_VFIO_NEED_REPLY; > > op.size = 0; > > > > /* Just need to write the header */ > > ret = vhost_vfio_write(dev, &op); > > if (ret != 0) > > goto out; > > > > /* `op` wasn't changed during write */ > > op.flags = 0; > > op.size = sizeof(*features); > > > > ret = vhost_vfio_read(dev, &op); > > if (ret != 0) > > goto out; > > > > *features = op.payload.u64; > > out: > > return ret; > > } > > > > 2. BAR1 (mmap-able) > > > > The MMIO region described by BAR1 will be used to notify the > > device. > > > > Each queue will has a page for notification, and it can be > > mapped to VM (if hardware also supports), and the virtio > > driver in the VM will be able to notify the device directly. > > > > The MMIO region described by BAR1 is also write-able. If the > > accelerator's notification register(s) cannot be mapped to the > > VM, write() can also be used to notify the device. Something > > like this: > > > > void notify_relay(void *opaque) > > { > > ...... > > offset = 0x1000 * queue_idx; /* XXX assume page size is 4K here. */ > > > > ret = pwrite64(vfio->device_fd, &queue_idx, sizeof(queue_idx), > > vfio->bar1_offset + offset); > > ...... > > } > > > > Other BARs are reserved. > > > > 3. VFIO interrupt ioctl API > > > > VFIO interrupt ioctl API is used to setup device interrupts. > > IRQ-bypass will also be supported. > > > > Currently, only VFIO_PCI_MSIX_IRQ_INDEX is supported. > > > > The API for drivers to provide mdev instances > > ============================================= > > > > The read()/write()/ioctl()/mmap()/open()/release() mdev > > parent ops have been provided for accelerators' drivers > > to provide mdev instances. > > > > ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf, > > size_t count, loff_t *ppos); > > ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf, > > size_t count, loff_t *ppos); > > long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg); > > int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma); > > int vdpa_open(struct mdev_device *mdev); > > void vdpa_close(struct mdev_device *mdev); > > > > Each accelerator driver just needs to implement its own > > create()/remove() ops, and provide a vdpa device ops > > which will be called by the generic mdev emulation code. > > > > Currently, the vdpa device ops are defined as: > > > > typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd); > > typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa); > > typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid); > > typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid); > > > > struct vdpa_device_ops { > > vdpa_start_device_t start; > > vdpa_stop_device_t stop; > > vdpa_dma_map_t dma_map; > > vdpa_dma_unmap_t dma_unmap; > > vdpa_set_eventfd_t set_eventfd; > > vdpa_supported_features_t supported_features; > > vdpa_notify_device_t notify; > > vdpa_get_notify_addr_t get_notify_addr; > > }; > > > > struct vdpa_dev { > > struct mdev_device *mdev; > > struct mutex ops_lock; > > u8 vconfig[VDPA_CONFIG_SIZE]; > > int nr_vring; > > u64 features; > > u64 state; > > struct vhost_memory *mem_table; > > bool pending_reply; > > struct vhost_vfio_op pending; > > const struct vdpa_device_ops *ops; > > void *private; > > int max_vrings; > > struct vdpa_vring_info vring_info[0]; > > }; > > > > struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private, > > int max_vrings); > > void vdpa_free(struct vdpa_dev *vdpa); > > > > A simple example > > ================ > > > > # Query the number of available mdev instances > > $ cat /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/available_instances > > > > # Create a mdev instance > > $ echo $UUID > /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/create > > > > # Launch QEMU with a virtio-net device > > $ qemu \ > > ...... \ > > -netdev type=vhost-vfio,sysfsdev=/sys/bus/mdev/devices/$UUID,id=$ID \ > > -device virtio-net-pci,netdev=$ID > > > > -------- END -------- > > > > Most of above words will be refined and moved to a doc in > > the formal patch. In this RFC, all introductions and code > > are gathered in this patch, the idea is to make it easier > > to find all the relevant information. Anyone who wants to > > comment could use inline comment and just keep the relevant > > parts. Sorry for the big RFC patch.. > > > > This patch is just a RFC for now, and something is still > > missing or needs to be refined. But it's never too early > > to hear the thoughts from the community. So any comments > > would be appreciated! Thanks! :-) > > I don't see vhost_vfio_write() and other above functions in the patch. Looks > like some part of the patch is missed, it would be better to post a complete > series with an example driver (vDPA) to get a full picture. No problem. We will send out the QEMU changes soon! Thanks! > > Thanks > [...]