2020-02-10 03:58:04

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 0/5] vDPA support

Hi all:

This is an updated version of kernel support for vDPA device. Various
changes were made based on the feedback since last verion. One major
change is to drop the sysfs API and leave the management interface for
future development, and introudce the incremental DMA bus
operations. Please see changelog for more information.

The work on vhost, IFCVF (intel VF driver for vDPA) and qemu is
ongoing.

Please provide feedback.

Thanks

Change from V1:

- drop sysfs API, leave the management interface to future development
(Michael)
- introduce incremental DMA ops (dma_map/dma_unmap) (Michael)
- introduce dma_device and use it instead of parent device for doing
IOMMU or DMA from bus driver (Michael, Jason, Ling Shan, Tiwei)
- accept parent device and dma device when register vdpa device
- coding style and compile fixes (Randy)
- using vdpa_xxx instead of xxx_vdpa (Jason)
- ove vDPA accessors to header and make it static inline (Jason)
- split vdp_register_device() into two helpers vdpa_init_device() and
vdpa_register_device() which allows intermediate step to be done (Jason)
- warn on invalidate queue state when fail to creating virtqueue (Jason)
- make to_virtio_vdpa_device() static (Jason)
- use kmalloc/kfree instead of devres for virtio vdpa device (Jason)
- avoid using cast in vdpa bus function (Jason)
- introduce module_vdpa_driver and fix module refcnt (Jason)
- fix returning freed address in vdapsim coherent DMA addr allocation (Dan)
- various other fixes and tweaks

V1: https://lkml.org/lkml/2020/1/16/353

Jason Wang (5):
vhost: factor out IOTLB
vringh: IOTLB support
vDPA: introduce vDPA bus
virtio: introduce a vDPA based transport
vdpasim: vDPA device simulator

MAINTAINERS | 2 +
drivers/vhost/Kconfig | 7 +
drivers/vhost/Kconfig.vringh | 1 +
drivers/vhost/Makefile | 2 +
drivers/vhost/net.c | 2 +-
drivers/vhost/vhost.c | 221 ++++-------
drivers/vhost/vhost.h | 36 +-
drivers/vhost/vhost_iotlb.c | 171 +++++++++
drivers/vhost/vringh.c | 421 ++++++++++++++++++--
drivers/virtio/Kconfig | 15 +
drivers/virtio/Makefile | 2 +
drivers/virtio/vdpa/Kconfig | 26 ++
drivers/virtio/vdpa/Makefile | 3 +
drivers/virtio/vdpa/vdpa.c | 160 ++++++++
drivers/virtio/vdpa/vdpa_sim.c | 678 +++++++++++++++++++++++++++++++++
drivers/virtio/virtio_vdpa.c | 392 +++++++++++++++++++
include/linux/vdpa.h | 233 +++++++++++
include/linux/vhost_iotlb.h | 45 +++
include/linux/vringh.h | 36 ++
19 files changed, 2249 insertions(+), 204 deletions(-)
create mode 100644 drivers/vhost/vhost_iotlb.c
create mode 100644 drivers/virtio/vdpa/Kconfig
create mode 100644 drivers/virtio/vdpa/Makefile
create mode 100644 drivers/virtio/vdpa/vdpa.c
create mode 100644 drivers/virtio/vdpa/vdpa_sim.c
create mode 100644 drivers/virtio/virtio_vdpa.c
create mode 100644 include/linux/vdpa.h
create mode 100644 include/linux/vhost_iotlb.h

--
2.19.1


2020-02-10 04:00:01

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 1/5] vhost: factor out IOTLB

This patch factors out IOTLB into a dedicated module in order to be
reused by other modules like vringh. User may choose to enable the
automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit
for the case of vhost device IOTLB implementation.

Signed-off-by: Jason Wang <[email protected]>
---
MAINTAINERS | 1 +
drivers/vhost/Kconfig | 7 ++
drivers/vhost/Makefile | 2 +
drivers/vhost/net.c | 2 +-
drivers/vhost/vhost.c | 221 +++++++++++-------------------------
drivers/vhost/vhost.h | 36 ++----
drivers/vhost/vhost_iotlb.c | 171 ++++++++++++++++++++++++++++
include/linux/vhost_iotlb.h | 45 ++++++++
8 files changed, 304 insertions(+), 181 deletions(-)
create mode 100644 drivers/vhost/vhost_iotlb.c
create mode 100644 include/linux/vhost_iotlb.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2549f10eb0b1..d4bda9c900fa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17607,6 +17607,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
S: Maintained
F: drivers/vhost/
F: include/uapi/linux/vhost.h
+F: include/linux/vhost_iotlb.h

VIRTIO INPUT DRIVER
M: Gerd Hoffmann <[email protected]>
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..eef634ff9a6e 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -36,6 +36,7 @@ config VHOST_VSOCK

config VHOST
tristate
+ select VHOST_IOTLB
---help---
This option is selected by any driver which needs to access
the core of vhost.
@@ -54,3 +55,9 @@ config VHOST_CROSS_ENDIAN_LEGACY
adds some overhead, it is disabled by default.

If unsure, say "N".
+
+config VHOST_IOTLB
+ tristate
+ default m
+ help
+ Generic IOTLB implementation for vhost and vringh.
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..df99756fbb26 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -11,3 +11,5 @@ vhost_vsock-y := vsock.o
obj-$(CONFIG_VHOST_RING) += vringh.o

obj-$(CONFIG_VHOST) += vhost.o
+
+obj-$(CONFIG_VHOST_IOTLB) += vhost_iotlb.o
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e158159671fa..e4a20d7a2921 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1594,7 +1594,7 @@ static long vhost_net_reset_owner(struct vhost_net *n)
struct socket *tx_sock = NULL;
struct socket *rx_sock = NULL;
long err;
- struct vhost_umem *umem;
+ struct vhost_iotlb *umem;

mutex_lock(&n->dev.mutex);
err = vhost_dev_check_owner(&n->dev);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f44340b41494..9059b95cac83 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -50,10 +50,6 @@ enum {
#define vhost_used_event(vq) ((__virtio16 __user *)&vq->avail->ring[vq->num])
#define vhost_avail_event(vq) ((__virtio16 __user *)&vq->used->ring[vq->num])

-INTERVAL_TREE_DEFINE(struct vhost_umem_node,
- rb, __u64, __subtree_last,
- START, LAST, static inline, vhost_umem_interval_tree);
-
#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
{
@@ -581,21 +577,25 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_set_owner);

-struct vhost_umem *vhost_dev_reset_owner_prepare(void)
+static struct vhost_iotlb *iotlb_alloc(void)
+{
+ return vhost_iotlb_alloc(max_iotlb_entries,
+ VHOST_IOTLB_FLAG_RETIRE);
+}
+
+struct vhost_iotlb *vhost_dev_reset_owner_prepare(void)
{
- return kvzalloc(sizeof(struct vhost_umem), GFP_KERNEL);
+ return iotlb_alloc();
}
EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);

/* Caller should have device mutex */
-void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_umem *umem)
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *umem)
{
int i;

vhost_dev_cleanup(dev);

- /* Restore memory to default empty mapping. */
- INIT_LIST_HEAD(&umem->umem_list);
dev->umem = umem;
/* We don't need VQ locks below since vhost_dev_cleanup makes sure
* VQs aren't running.
@@ -618,28 +618,6 @@ void vhost_dev_stop(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_stop);

-static void vhost_umem_free(struct vhost_umem *umem,
- struct vhost_umem_node *node)
-{
- vhost_umem_interval_tree_remove(node, &umem->umem_tree);
- list_del(&node->link);
- kfree(node);
- umem->numem--;
-}
-
-static void vhost_umem_clean(struct vhost_umem *umem)
-{
- struct vhost_umem_node *node, *tmp;
-
- if (!umem)
- return;
-
- list_for_each_entry_safe(node, tmp, &umem->umem_list, link)
- vhost_umem_free(umem, node);
-
- kvfree(umem);
-}
-
static void vhost_clear_msg(struct vhost_dev *dev)
{
struct vhost_msg_node *node, *n;
@@ -677,9 +655,9 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
eventfd_ctx_put(dev->log_ctx);
dev->log_ctx = NULL;
/* No one will access memory at this point */
- vhost_umem_clean(dev->umem);
+ vhost_iotlb_free(dev->umem);
dev->umem = NULL;
- vhost_umem_clean(dev->iotlb);
+ vhost_iotlb_free(dev->iotlb);
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
@@ -715,27 +693,26 @@ static bool vhost_overflow(u64 uaddr, u64 size)
}

/* Caller should have vq mutex and device mutex. */
-static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_iotlb *umem,
int log_all)
{
- struct vhost_umem_node *node;
+ struct vhost_iotlb_map *map;

if (!umem)
return false;

- list_for_each_entry(node, &umem->umem_list, link) {
- unsigned long a = node->userspace_addr;
+ list_for_each_entry(map, &umem->list, link) {
+ unsigned long a = map->addr;

- if (vhost_overflow(node->userspace_addr, node->size))
+ if (vhost_overflow(map->addr, map->size))
return false;


- if (!access_ok((void __user *)a,
- node->size))
+ if (!access_ok((void __user *)a, map->size))
return false;
else if (log_all && !log_access_ok(log_base,
- node->start,
- node->size))
+ map->start,
+ map->size))
return false;
}
return true;
@@ -745,17 +722,17 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
u64 addr, unsigned int size,
int type)
{
- const struct vhost_umem_node *node = vq->meta_iotlb[type];
+ const struct vhost_iotlb_map *map = vq->meta_iotlb[type];

- if (!node)
+ if (!map)
return NULL;

- return (void *)(uintptr_t)(node->userspace_addr + addr - node->start);
+ return (void *)(uintptr_t)(map->addr + addr - map->start);
}

/* Can we switch to this memory table? */
/* Caller should have device mutex but not vq mutex */
-static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_iotlb *umem,
int log_all)
{
int i;
@@ -1020,47 +997,6 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
}

-static int vhost_new_umem_range(struct vhost_umem *umem,
- u64 start, u64 size, u64 end,
- u64 userspace_addr, int perm)
-{
- struct vhost_umem_node *tmp, *node;
-
- if (!size)
- return -EFAULT;
-
- node = kmalloc(sizeof(*node), GFP_ATOMIC);
- if (!node)
- return -ENOMEM;
-
- if (umem->numem == max_iotlb_entries) {
- tmp = list_first_entry(&umem->umem_list, typeof(*tmp), link);
- vhost_umem_free(umem, tmp);
- }
-
- node->start = start;
- node->size = size;
- node->last = end;
- node->userspace_addr = userspace_addr;
- node->perm = perm;
- INIT_LIST_HEAD(&node->link);
- list_add_tail(&node->link, &umem->umem_list);
- vhost_umem_interval_tree_insert(node, &umem->umem_tree);
- umem->numem++;
-
- return 0;
-}
-
-static void vhost_del_umem_range(struct vhost_umem *umem,
- u64 start, u64 end)
-{
- struct vhost_umem_node *node;
-
- while ((node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
- start, end)))
- vhost_umem_free(umem, node);
-}
-
static void vhost_iotlb_notify_vq(struct vhost_dev *d,
struct vhost_iotlb_msg *msg)
{
@@ -1117,9 +1053,9 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
break;
}
vhost_vq_meta_reset(dev);
- if (vhost_new_umem_range(dev->iotlb, msg->iova, msg->size,
- msg->iova + msg->size - 1,
- msg->uaddr, msg->perm)) {
+ if (vhost_iotlb_add_range(dev->iotlb, msg->iova,
+ msg->iova + msg->size - 1,
+ msg->uaddr, msg->perm)) {
ret = -ENOMEM;
break;
}
@@ -1131,8 +1067,8 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
break;
}
vhost_vq_meta_reset(dev);
- vhost_del_umem_range(dev->iotlb, msg->iova,
- msg->iova + msg->size - 1);
+ vhost_iotlb_del_range(dev->iotlb, msg->iova,
+ msg->iova + msg->size - 1);
break;
default:
ret = -EINVAL;
@@ -1311,44 +1247,42 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
}

static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
- const struct vhost_umem_node *node,
+ const struct vhost_iotlb_map *map,
int type)
{
int access = (type == VHOST_ADDR_USED) ?
VHOST_ACCESS_WO : VHOST_ACCESS_RO;

- if (likely(node->perm & access))
- vq->meta_iotlb[type] = node;
+ if (likely(map->perm & access))
+ vq->meta_iotlb[type] = map;
}

static bool iotlb_access_ok(struct vhost_virtqueue *vq,
int access, u64 addr, u64 len, int type)
{
- const struct vhost_umem_node *node;
- struct vhost_umem *umem = vq->iotlb;
+ const struct vhost_iotlb_map *map;
+ struct vhost_iotlb *umem = vq->iotlb;
u64 s = 0, size, orig_addr = addr, last = addr + len - 1;

if (vhost_vq_meta_fetch(vq, addr, len, type))
return true;

while (len > s) {
- node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
- addr,
- last);
- if (node == NULL || node->start > addr) {
+ map = vhost_iotlb_itree_first(umem, addr, last);
+ if (map == NULL || map->start > addr) {
vhost_iotlb_miss(vq, addr, access);
return false;
- } else if (!(node->perm & access)) {
+ } else if (!(map->perm & access)) {
/* Report the possible access violation by
* request another translation from userspace.
*/
return false;
}

- size = node->size - addr + node->start;
+ size = map->size - addr + map->start;

if (orig_addr == addr && size >= len)
- vhost_vq_meta_update(vq, node, type);
+ vhost_vq_meta_update(vq, map, type);

s += size;
addr += size;
@@ -1364,12 +1298,12 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
if (!vq->iotlb)
return 1;

- return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
+ return iotlb_access_ok(vq, VHOST_MAP_RO, (u64)(uintptr_t)vq->desc,
vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
- iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
+ iotlb_access_ok(vq, VHOST_MAP_RO, (u64)(uintptr_t)vq->avail,
vhost_get_avail_size(vq, num),
VHOST_ADDR_AVAIL) &&
- iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
+ iotlb_access_ok(vq, VHOST_MAP_WO, (u64)(uintptr_t)vq->used,
vhost_get_used_size(vq, num), VHOST_ADDR_USED);
}
EXPORT_SYMBOL_GPL(vq_meta_prefetch);
@@ -1408,25 +1342,11 @@ bool vhost_vq_access_ok(struct vhost_virtqueue *vq)
}
EXPORT_SYMBOL_GPL(vhost_vq_access_ok);

-static struct vhost_umem *vhost_umem_alloc(void)
-{
- struct vhost_umem *umem = kvzalloc(sizeof(*umem), GFP_KERNEL);
-
- if (!umem)
- return NULL;
-
- umem->umem_tree = RB_ROOT_CACHED;
- umem->numem = 0;
- INIT_LIST_HEAD(&umem->umem_list);
-
- return umem;
-}
-
static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
{
struct vhost_memory mem, *newmem;
struct vhost_memory_region *region;
- struct vhost_umem *newumem, *oldumem;
+ struct vhost_iotlb *newumem, *oldumem;
unsigned long size = offsetof(struct vhost_memory, regions);
int i;

@@ -1448,7 +1368,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
return -EFAULT;
}

- newumem = vhost_umem_alloc();
+ newumem = iotlb_alloc();
if (!newumem) {
kvfree(newmem);
return -ENOMEM;
@@ -1457,13 +1377,12 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
for (region = newmem->regions;
region < newmem->regions + mem.nregions;
region++) {
- if (vhost_new_umem_range(newumem,
- region->guest_phys_addr,
- region->memory_size,
- region->guest_phys_addr +
- region->memory_size - 1,
- region->userspace_addr,
- VHOST_ACCESS_RW))
+ if (vhost_iotlb_add_range(newumem,
+ region->guest_phys_addr,
+ region->guest_phys_addr +
+ region->memory_size - 1,
+ region->userspace_addr,
+ VHOST_MAP_RW))
goto err;
}

@@ -1481,11 +1400,11 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
}

kvfree(newmem);
- vhost_umem_clean(oldumem);
+ vhost_iotlb_free(oldumem);
return 0;

err:
- vhost_umem_clean(newumem);
+ vhost_iotlb_free(newumem);
kvfree(newmem);
return -EFAULT;
}
@@ -1726,10 +1645,10 @@ EXPORT_SYMBOL_GPL(vhost_vring_ioctl);

int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
{
- struct vhost_umem *niotlb, *oiotlb;
+ struct vhost_iotlb *niotlb, *oiotlb;
int i;

- niotlb = vhost_umem_alloc();
+ niotlb = iotlb_alloc();
if (!niotlb)
return -ENOMEM;

@@ -1745,7 +1664,7 @@ int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
mutex_unlock(&vq->mutex);
}

- vhost_umem_clean(oiotlb);
+ vhost_iotlb_free(oiotlb);

return 0;
}
@@ -1875,8 +1794,8 @@ static int log_write(void __user *log_base,

static int log_write_hva(struct vhost_virtqueue *vq, u64 hva, u64 len)
{
- struct vhost_umem *umem = vq->umem;
- struct vhost_umem_node *u;
+ struct vhost_iotlb *umem = vq->umem;
+ struct vhost_iotlb_map *u;
u64 start, end, l, min;
int r;
bool hit = false;
@@ -1886,16 +1805,15 @@ static int log_write_hva(struct vhost_virtqueue *vq, u64 hva, u64 len)
/* More than one GPAs can be mapped into a single HVA. So
* iterate all possible umems here to be safe.
*/
- list_for_each_entry(u, &umem->umem_list, link) {
- if (u->userspace_addr > hva - 1 + len ||
- u->userspace_addr - 1 + u->size < hva)
+ list_for_each_entry(u, &umem->list, link) {
+ if (u->addr > hva - 1 + len ||
+ u->addr - 1 + u->size < hva)
continue;
- start = max(u->userspace_addr, hva);
- end = min(u->userspace_addr - 1 + u->size,
- hva - 1 + len);
+ start = max(u->addr, hva);
+ end = min(u->addr - 1 + u->size, hva - 1 + len);
l = end - start + 1;
r = log_write(vq->log_base,
- u->start + start - u->userspace_addr,
+ u->start + start - u->addr,
l);
if (r < 0)
return r;
@@ -2046,9 +1964,9 @@ EXPORT_SYMBOL_GPL(vhost_vq_init_access);
static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
struct iovec iov[], int iov_size, int access)
{
- const struct vhost_umem_node *node;
+ const struct vhost_iotlb_map *map;
struct vhost_dev *dev = vq->dev;
- struct vhost_umem *umem = dev->iotlb ? dev->iotlb : dev->umem;
+ struct vhost_iotlb *umem = dev->iotlb ? dev->iotlb : dev->umem;
struct iovec *_iov;
u64 s = 0;
int ret = 0;
@@ -2060,25 +1978,24 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
break;
}

- node = vhost_umem_interval_tree_iter_first(&umem->umem_tree,
- addr, addr + len - 1);
- if (node == NULL || node->start > addr) {
+ map = vhost_iotlb_itree_first(umem, addr, addr + len - 1);
+ if (map == NULL || map->start > addr) {
if (umem != dev->iotlb) {
ret = -EFAULT;
break;
}
ret = -EAGAIN;
break;
- } else if (!(node->perm & access)) {
+ } else if (!(map->perm & access)) {
ret = -EPERM;
break;
}

_iov = iov + ret;
- size = node->size - addr + node->start;
+ size = map->size - addr + map->start;
_iov->iov_len = min((u64)len - s, size);
_iov->iov_base = (void __user *)(unsigned long)
- (node->userspace_addr + addr - node->start);
+ (map->addr + addr - map->start);
s += size;
addr += size;
++ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index a123fd70847e..b99c6ffb6be1 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,7 @@
#include <linux/virtio_config.h>
#include <linux/virtio_ring.h>
#include <linux/atomic.h>
+#include <linux/vhost_iotlb.h>

struct vhost_work;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -52,27 +53,6 @@ struct vhost_log {
u64 len;
};

-#define START(node) ((node)->start)
-#define LAST(node) ((node)->last)
-
-struct vhost_umem_node {
- struct rb_node rb;
- struct list_head link;
- __u64 start;
- __u64 last;
- __u64 size;
- __u64 userspace_addr;
- __u32 perm;
- __u32 flags_padding;
- __u64 __subtree_last;
-};
-
-struct vhost_umem {
- struct rb_root_cached umem_tree;
- struct list_head umem_list;
- int numem;
-};
-
enum vhost_uaddr_type {
VHOST_ADDR_DESC = 0,
VHOST_ADDR_AVAIL = 1,
@@ -90,7 +70,7 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_avail __user *avail;
struct vring_used __user *used;
- const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
+ const struct vhost_iotlb_map *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct eventfd_ctx *call_ctx;
struct eventfd_ctx *error_ctx;
@@ -128,8 +108,8 @@ struct vhost_virtqueue {
struct iovec *indirect;
struct vring_used_elem *heads;
/* Protected by virtqueue mutex. */
- struct vhost_umem *umem;
- struct vhost_umem *iotlb;
+ struct vhost_iotlb *umem;
+ struct vhost_iotlb *iotlb;
void *private_data;
u64 acked_features;
u64 acked_backend_features;
@@ -164,8 +144,8 @@ struct vhost_dev {
struct eventfd_ctx *log_ctx;
struct llist_head work_list;
struct task_struct *worker;
- struct vhost_umem *umem;
- struct vhost_umem *iotlb;
+ struct vhost_iotlb *umem;
+ struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;
struct list_head read_list;
struct list_head pending_list;
@@ -182,8 +162,8 @@ void vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue **vqs,
long vhost_dev_set_owner(struct vhost_dev *dev);
bool vhost_dev_has_owner(struct vhost_dev *dev);
long vhost_dev_check_owner(struct vhost_dev *);
-struct vhost_umem *vhost_dev_reset_owner_prepare(void);
-void vhost_dev_reset_owner(struct vhost_dev *, struct vhost_umem *);
+struct vhost_iotlb *vhost_dev_reset_owner_prepare(void);
+void vhost_dev_reset_owner(struct vhost_dev *dev, struct vhost_iotlb *iotlb);
void vhost_dev_cleanup(struct vhost_dev *);
void vhost_dev_stop(struct vhost_dev *);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp);
diff --git a/drivers/vhost/vhost_iotlb.c b/drivers/vhost/vhost_iotlb.c
new file mode 100644
index 000000000000..e08710f1690c
--- /dev/null
+++ b/drivers/vhost/vhost_iotlb.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2020 Red Hat, Inc.
+ * Author: Jason Wang <[email protected]>
+ *
+ * IOTLB implementation for vhost.
+ */
+#include <linux/slab.h>
+#include <linux/vhost_iotlb.h>
+#include <linux/module.h>
+
+#define MOD_VERSION "0.1"
+#define MOD_DESC "VHOST IOTLB"
+#define MOD_AUTHOR "Jason Wang <[email protected]>"
+#define MOD_LICENSE "GPL v2"
+
+#define START(map) ((map)->start)
+#define LAST(map) ((map)->last)
+
+INTERVAL_TREE_DEFINE(struct vhost_iotlb_map,
+ rb, __u64, __subtree_last,
+ START, LAST, static inline, vhost_iotlb_itree);
+
+static void iotlb_map_free(struct vhost_iotlb *iotlb,
+ struct vhost_iotlb_map *map)
+{
+ vhost_iotlb_itree_remove(map, &iotlb->root);
+ list_del(&map->link);
+ kfree(map);
+ iotlb->nmaps--;
+}
+
+/**
+ * vhost_iotlb_add_range - add a new range to vhost IOTLB
+ * @iotlb: the IOTLB
+ * @start: start of the IOVA range
+ * @last: last of IOVA range
+ * @addr: the address that is mapped to @start
+ * @perm: access permission of this range
+ *
+ * Returns an error last is smaller than start or memory allocation
+ * fails
+ */
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb,
+ u64 start, u64 last,
+ u64 addr, unsigned int perm)
+{
+ struct vhost_iotlb_map *map;
+
+ if (last < start)
+ return -EFAULT;
+
+ if (iotlb->limit &&
+ iotlb->nmaps == iotlb->limit &&
+ iotlb->flags & VHOST_IOTLB_FLAG_RETIRE) {
+ map = list_first_entry(&iotlb->list, typeof(*map), link);
+ iotlb_map_free(iotlb, map);
+ }
+
+ map = kmalloc(sizeof(*map), GFP_ATOMIC);
+ if (!map)
+ return -ENOMEM;
+
+ map->start = start;
+ map->size = last - start + 1;
+ map->last = last;
+ map->addr = addr;
+ map->perm = perm;
+
+ iotlb->nmaps++;
+ vhost_iotlb_itree_insert(map, &iotlb->root);
+
+ INIT_LIST_HEAD(&map->link);
+ list_add_tail(&map->link, &iotlb->list);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_add_range);
+
+/**
+ * vring_iotlb_del_range - delete overlapped ranges from vhost IOTLB
+ * @iotlb: the IOTLB
+ * @start: start of the IOVA range
+ * @last: last of IOVA range
+ */
+void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last)
+{
+ struct vhost_iotlb_map *map;
+
+ while ((map = vhost_iotlb_itree_iter_first(&iotlb->root,
+ start, last)))
+ iotlb_map_free(iotlb, map);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_del_range);
+
+/**
+ * vhost_iotlb_alloc - add a new vhost IOTLB
+ * @limit: maximum number of IOTLB entries
+ * @flags: VHOST_IOTLB_FLAG_XXX
+ *
+ * Returns an error is memory allocation fails
+ */
+struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags)
+{
+ struct vhost_iotlb *iotlb = kzalloc(sizeof(*iotlb), GFP_KERNEL);
+
+ if (!iotlb)
+ return NULL;
+
+ iotlb->root = RB_ROOT_CACHED;
+ iotlb->limit = limit;
+ iotlb->nmaps = 0;
+ iotlb->flags = flags;
+ INIT_LIST_HEAD(&iotlb->list);
+
+ return iotlb;
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_alloc);
+
+/**
+ * vhost_iotlb_reset - reset vhost IOTLB (free all IOTLB entries)
+ * @iotlb: the IOTLB to be reset
+ */
+void vhost_iotlb_reset(struct vhost_iotlb *iotlb)
+{
+ vhost_iotlb_del_range(iotlb, 0ULL, 0ULL - 1);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_reset);
+
+/**
+ * vhost_iotlb_free - reset and free vhost IOTLB
+ * @iotlb: the IOTLB to be freed
+ */
+void vhost_iotlb_free(struct vhost_iotlb *iotlb)
+{
+ if (iotlb) {
+ vhost_iotlb_reset(iotlb);
+ kfree(iotlb);
+ }
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_free);
+
+/**
+ * vhost_iotlb_itree_first - return the first overlapped range
+ * @iotlb: the IOTLB
+ * @start: start of IOVA range
+ * @end: end of IOVA range
+ */
+struct vhost_iotlb_map *
+vhost_iotlb_itree_first(struct vhost_iotlb *iotlb, u64 start, u64 last)
+{
+ return vhost_iotlb_itree_iter_first(&iotlb->root, start, last);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_itree_first);
+
+/**
+ * vhost_iotlb_itree_first - return the next overlapped range
+ * @iotlb: the IOTLB
+ * @start: start of IOVA range
+ * @end: end of IOVA range
+ */
+struct vhost_iotlb_map *
+vhost_iotlb_itree_next(struct vhost_iotlb_map *map, u64 start, u64 last)
+{
+ return vhost_iotlb_itree_iter_next(map, start, last);
+}
+EXPORT_SYMBOL_GPL(vhost_iotlb_itree_next);
+
+MODULE_VERSION(MOD_VERSION);
+MODULE_DESCRIPTION(MOD_DESC);
+MODULE_AUTHOR(MOD_AUTHOR);
+MODULE_LICENSE(MOD_LICENSE);
diff --git a/include/linux/vhost_iotlb.h b/include/linux/vhost_iotlb.h
new file mode 100644
index 000000000000..a44c61f5627b
--- /dev/null
+++ b/include/linux/vhost_iotlb.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VHOST_IOTLB_H
+#define _LINUX_VHOST_IOTLB_H
+
+#include <linux/interval_tree_generic.h>
+
+struct vhost_iotlb_map {
+ struct rb_node rb;
+ struct list_head link;
+ u64 start;
+ u64 last;
+ u64 size;
+ u64 addr;
+#define VHOST_MAP_RO 0x1
+#define VHOST_MAP_WO 0x2
+#define VHOST_MAP_RW 0x3
+ u32 perm;
+ u32 flags_padding;
+ u64 __subtree_last;
+};
+
+#define VHOST_IOTLB_FLAG_RETIRE 0x1
+
+struct vhost_iotlb {
+ struct rb_root_cached root;
+ struct list_head list;
+ unsigned int limit;
+ unsigned int nmaps;
+ unsigned int flags;
+};
+
+int vhost_iotlb_add_range(struct vhost_iotlb *iotlb, u64 start, u64 last,
+ u64 addr, unsigned int perm);
+void vhost_iotlb_del_range(struct vhost_iotlb *iotlb, u64 start, u64 last);
+
+struct vhost_iotlb *vhost_iotlb_alloc(unsigned int limit, unsigned int flags);
+void vhost_iotlb_free(struct vhost_iotlb *iotlb);
+void vhost_iotlb_reset(struct vhost_iotlb *iotlb);
+
+struct vhost_iotlb_map *
+vhost_iotlb_itree_first(struct vhost_iotlb *iotlb, u64 start, u64 last);
+struct vhost_iotlb_map *
+vhost_iotlb_itree_next(struct vhost_iotlb_map *map, u64 start, u64 last);
+
+#endif
--
2.19.1

2020-02-10 04:00:23

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 2/5] vringh: IOTLB support

This patch implements the third memory accessor for vringh besides
current kernel and userspace accessors. This idea is to allow vringh
to do the address translation through an IOTLB which is implemented
via vhost_map interval tree. Users should setup and IOVA to PA mapping
in this IOTLB.

This allows us to:

- Using vringh to access virtqueues with vIOMMU
- Using vringh to implement software vDPA devices

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/Kconfig.vringh | 1 +
drivers/vhost/vringh.c | 421 +++++++++++++++++++++++++++++++++--
include/linux/vringh.h | 36 +++
3 files changed, 435 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/Kconfig.vringh b/drivers/vhost/Kconfig.vringh
index c1fe36a9b8d4..a8d4dd0cb06e 100644
--- a/drivers/vhost/Kconfig.vringh
+++ b/drivers/vhost/Kconfig.vringh
@@ -1,6 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
config VHOST_RING
tristate
+ select VHOST_IOTLB
---help---
This option is selected by any driver which needs to access
the host side of a virtio ring.
diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index a0a2d74967ef..ee0491f579ac 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -13,6 +13,9 @@
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/export.h>
+#include <linux/bvec.h>
+#include <linux/highmem.h>
+#include <linux/vhost_iotlb.h>
#include <uapi/linux/virtio_config.h>

static __printf(1,2) __cold void vringh_bad(const char *fmt, ...)
@@ -71,9 +74,11 @@ static inline int __vringh_get_head(const struct vringh *vrh,
}

/* Copy some bytes to/from the iovec. Returns num copied. */
-static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov,
+static inline ssize_t vringh_iov_xfer(struct vringh *vrh,
+ struct vringh_kiov *iov,
void *ptr, size_t len,
- int (*xfer)(void *addr, void *ptr,
+ int (*xfer)(const struct vringh *vrh,
+ void *addr, void *ptr,
size_t len))
{
int err, done = 0;
@@ -82,7 +87,7 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov,
size_t partlen;

partlen = min(iov->iov[iov->i].iov_len, len);
- err = xfer(iov->iov[iov->i].iov_base, ptr, partlen);
+ err = xfer(vrh, iov->iov[iov->i].iov_base, ptr, partlen);
if (err)
return err;
done += partlen;
@@ -96,6 +101,7 @@ static inline ssize_t vringh_iov_xfer(struct vringh_kiov *iov,
/* Fix up old iov element then increment. */
iov->iov[iov->i].iov_len = iov->consumed;
iov->iov[iov->i].iov_base -= iov->consumed;
+

iov->consumed = 0;
iov->i++;
@@ -227,7 +233,8 @@ static int slow_copy(struct vringh *vrh, void *dst, const void *src,
u64 addr,
struct vringh_range *r),
struct vringh_range *range,
- int (*copy)(void *dst, const void *src, size_t len))
+ int (*copy)(const struct vringh *vrh,
+ void *dst, const void *src, size_t len))
{
size_t part, len = sizeof(struct vring_desc);

@@ -241,7 +248,7 @@ static int slow_copy(struct vringh *vrh, void *dst, const void *src,
if (!rcheck(vrh, addr, &part, range, getrange))
return -EINVAL;

- err = copy(dst, src, part);
+ err = copy(vrh, dst, src, part);
if (err)
return err;

@@ -262,7 +269,8 @@ __vringh_iov(struct vringh *vrh, u16 i,
struct vringh_range *)),
bool (*getrange)(struct vringh *, u64, struct vringh_range *),
gfp_t gfp,
- int (*copy)(void *dst, const void *src, size_t len))
+ int (*copy)(const struct vringh *vrh,
+ void *dst, const void *src, size_t len))
{
int err, count = 0, up_next, desc_max;
struct vring_desc desc, *descs;
@@ -291,7 +299,7 @@ __vringh_iov(struct vringh *vrh, u16 i,
err = slow_copy(vrh, &desc, &descs[i], rcheck, getrange,
&slowrange, copy);
else
- err = copy(&desc, &descs[i], sizeof(desc));
+ err = copy(vrh, &desc, &descs[i], sizeof(desc));
if (unlikely(err))
goto fail;

@@ -404,7 +412,8 @@ static inline int __vringh_complete(struct vringh *vrh,
unsigned int num_used,
int (*putu16)(const struct vringh *vrh,
__virtio16 *p, u16 val),
- int (*putused)(struct vring_used_elem *dst,
+ int (*putused)(const struct vringh *vrh,
+ struct vring_used_elem *dst,
const struct vring_used_elem
*src, unsigned num))
{
@@ -420,12 +429,12 @@ static inline int __vringh_complete(struct vringh *vrh,
/* Compiler knows num_used == 1 sometimes, hence extra check */
if (num_used > 1 && unlikely(off + num_used >= vrh->vring.num)) {
u16 part = vrh->vring.num - off;
- err = putused(&used_ring->ring[off], used, part);
+ err = putused(vrh, &used_ring->ring[off], used, part);
if (!err)
- err = putused(&used_ring->ring[0], used + part,
+ err = putused(vrh, &used_ring->ring[0], used + part,
num_used - part);
} else
- err = putused(&used_ring->ring[off], used, num_used);
+ err = putused(vrh, &used_ring->ring[off], used, num_used);

if (err) {
vringh_bad("Failed to write %u used entries %u at %p",
@@ -564,13 +573,15 @@ static inline int putu16_user(const struct vringh *vrh, __virtio16 *p, u16 val)
return put_user(v, (__force __virtio16 __user *)p);
}

-static inline int copydesc_user(void *dst, const void *src, size_t len)
+static inline int copydesc_user(const struct vringh *vrh,
+ void *dst, const void *src, size_t len)
{
return copy_from_user(dst, (__force void __user *)src, len) ?
-EFAULT : 0;
}

-static inline int putused_user(struct vring_used_elem *dst,
+static inline int putused_user(const struct vringh *vrh,
+ struct vring_used_elem *dst,
const struct vring_used_elem *src,
unsigned int num)
{
@@ -578,13 +589,15 @@ static inline int putused_user(struct vring_used_elem *dst,
sizeof(*dst) * num) ? -EFAULT : 0;
}

-static inline int xfer_from_user(void *src, void *dst, size_t len)
+static inline int xfer_from_user(const struct vringh *vrh, void *src,
+ void *dst, size_t len)
{
return copy_from_user(dst, (__force void __user *)src, len) ?
-EFAULT : 0;
}

-static inline int xfer_to_user(void *dst, void *src, size_t len)
+static inline int xfer_to_user(const struct vringh *vrh,
+ void *dst, void *src, size_t len)
{
return copy_to_user((__force void __user *)dst, src, len) ?
-EFAULT : 0;
@@ -706,7 +719,7 @@ EXPORT_SYMBOL(vringh_getdesc_user);
*/
ssize_t vringh_iov_pull_user(struct vringh_iov *riov, void *dst, size_t len)
{
- return vringh_iov_xfer((struct vringh_kiov *)riov,
+ return vringh_iov_xfer(NULL, (struct vringh_kiov *)riov,
dst, len, xfer_from_user);
}
EXPORT_SYMBOL(vringh_iov_pull_user);
@@ -722,7 +735,7 @@ EXPORT_SYMBOL(vringh_iov_pull_user);
ssize_t vringh_iov_push_user(struct vringh_iov *wiov,
const void *src, size_t len)
{
- return vringh_iov_xfer((struct vringh_kiov *)wiov,
+ return vringh_iov_xfer(NULL, (struct vringh_kiov *)wiov,
(void *)src, len, xfer_to_user);
}
EXPORT_SYMBOL(vringh_iov_push_user);
@@ -832,13 +845,15 @@ static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 val)
return 0;
}

-static inline int copydesc_kern(void *dst, const void *src, size_t len)
+static inline int copydesc_kern(const struct vringh *vrh,
+ void *dst, const void *src, size_t len)
{
memcpy(dst, src, len);
return 0;
}

-static inline int putused_kern(struct vring_used_elem *dst,
+static inline int putused_kern(const struct vringh *vrh,
+ struct vring_used_elem *dst,
const struct vring_used_elem *src,
unsigned int num)
{
@@ -846,13 +861,15 @@ static inline int putused_kern(struct vring_used_elem *dst,
return 0;
}

-static inline int xfer_kern(void *src, void *dst, size_t len)
+static inline int xfer_kern(const struct vringh *vrh, void *src,
+ void *dst, size_t len)
{
memcpy(dst, src, len);
return 0;
}

-static inline int kern_xfer(void *dst, void *src, size_t len)
+static inline int kern_xfer(const struct vringh *vrh, void *dst,
+ void *src, size_t len)
{
memcpy(dst, src, len);
return 0;
@@ -949,7 +966,7 @@ EXPORT_SYMBOL(vringh_getdesc_kern);
*/
ssize_t vringh_iov_pull_kern(struct vringh_kiov *riov, void *dst, size_t len)
{
- return vringh_iov_xfer(riov, dst, len, xfer_kern);
+ return vringh_iov_xfer(NULL, riov, dst, len, xfer_kern);
}
EXPORT_SYMBOL(vringh_iov_pull_kern);

@@ -964,7 +981,7 @@ EXPORT_SYMBOL(vringh_iov_pull_kern);
ssize_t vringh_iov_push_kern(struct vringh_kiov *wiov,
const void *src, size_t len)
{
- return vringh_iov_xfer(wiov, (void *)src, len, kern_xfer);
+ return vringh_iov_xfer(NULL, wiov, (void *)src, len, kern_xfer);
}
EXPORT_SYMBOL(vringh_iov_push_kern);

@@ -1042,4 +1059,362 @@ int vringh_need_notify_kern(struct vringh *vrh)
}
EXPORT_SYMBOL(vringh_need_notify_kern);

+static int iotlb_translate(const struct vringh *vrh,
+ u64 addr, u64 len, struct bio_vec iov[],
+ int iov_size, u32 perm)
+{
+ struct vhost_iotlb_map *map;
+ struct vhost_iotlb *iotlb = vrh->iotlb;
+ int ret = 0;
+ u64 s = 0;
+
+ while (len > s) {
+ u64 size, pa, pfn;
+
+ if (unlikely(ret >= iov_size)) {
+ ret = -ENOBUFS;
+ break;
+ }
+
+ map = vhost_iotlb_itree_first(iotlb, addr,
+ addr + len - 1);
+ if (!map || map->start > addr) {
+ ret = -EINVAL;
+ break;
+ } else if (!(map->perm & perm)) {
+ ret = -EPERM;
+ break;
+ }
+
+ size = map->size - addr + map->start;
+ pa = map->addr + addr - map->start;
+ pfn = pa >> PAGE_SHIFT;
+ iov[ret].bv_page = pfn_to_page(pfn);
+ iov[ret].bv_len = min(len - s, size);
+ iov[ret].bv_offset = pa & (PAGE_SIZE - 1);
+ s += size;
+ addr += size;
+ ++ret;
+ }
+
+ return ret;
+}
+
+static inline int copy_from_iotlb(const struct vringh *vrh, void *dst,
+ void *src, size_t len)
+{
+ struct iov_iter iter;
+ struct bio_vec iov[16];
+ int ret;
+
+ ret = iotlb_translate(vrh, (u64)(uintptr_t)src,
+ len, iov, 16, VHOST_MAP_RO);
+ if (ret < 0)
+ return ret;
+
+ iov_iter_bvec(&iter, READ, iov, ret, len);
+
+ ret = copy_from_iter(dst, len, &iter);
+
+ return ret;
+}
+
+static inline int copy_to_iotlb(const struct vringh *vrh, void *dst,
+ void *src, size_t len)
+{
+ struct iov_iter iter;
+ struct bio_vec iov[16];
+ int ret;
+
+ ret = iotlb_translate(vrh, (u64)(uintptr_t)dst,
+ len, iov, 16, VHOST_MAP_WO);
+ if (ret < 0)
+ return ret;
+
+ iov_iter_bvec(&iter, WRITE, iov, ret, len);
+
+ return copy_to_iter(src, len, &iter);
+}
+
+static inline int getu16_iotlb(const struct vringh *vrh,
+ u16 *val, const __virtio16 *p)
+{
+ struct bio_vec iov;
+ void *kaddr, *from;
+ int ret;
+
+ /* Atomic read is needed for getu16 */
+ ret = iotlb_translate(vrh, (u64)(uintptr_t)p, sizeof(*p),
+ &iov, 1, VHOST_MAP_RO);
+ if (ret < 0)
+ return ret;
+
+ kaddr = kmap_atomic(iov.bv_page);
+ from = kaddr + iov.bv_offset;
+ *val = vringh16_to_cpu(vrh, READ_ONCE(*(__virtio16 *)from));
+ kunmap_atomic(kaddr);
+
+ return 0;
+}
+
+static inline int putu16_iotlb(const struct vringh *vrh,
+ __virtio16 *p, u16 val)
+{
+ struct bio_vec iov;
+ void *kaddr, *to;
+ int ret;
+
+ /* Atomic write is needed for putu16 */
+ ret = iotlb_translate(vrh, (u64)(uintptr_t)p, sizeof(*p),
+ &iov, 1, VHOST_MAP_WO);
+ if (ret < 0)
+ return ret;
+
+ kaddr = kmap_atomic(iov.bv_page);
+ to = kaddr + iov.bv_offset;
+ WRITE_ONCE(*(__virtio16 *)to, cpu_to_vringh16(vrh, val));
+ kunmap_atomic(kaddr);
+
+ return 0;
+}
+
+static inline int copydesc_iotlb(const struct vringh *vrh,
+ void *dst, const void *src, size_t len)
+{
+ int ret;
+
+ ret = copy_from_iotlb(vrh, dst, (void *)src, len);
+ if (ret != len)
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int xfer_from_iotlb(const struct vringh *vrh, void *src,
+ void *dst, size_t len)
+{
+ int ret;
+
+ ret = copy_from_iotlb(vrh, dst, src, len);
+ if (ret != len)
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int xfer_to_iotlb(const struct vringh *vrh,
+ void *dst, void *src, size_t len)
+{
+ int ret;
+
+ ret = copy_to_iotlb(vrh, dst, src, len);
+ if (ret != len)
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int putused_iotlb(const struct vringh *vrh,
+ struct vring_used_elem *dst,
+ const struct vring_used_elem *src,
+ unsigned int num)
+{
+ int size = num * sizeof(*dst);
+ int ret;
+
+ ret = copy_to_iotlb(vrh, dst, (void *)src, num * sizeof(*dst));
+ if (ret != size)
+ return -EFAULT;
+
+ return 0;
+}
+
+/**
+ * vringh_init_iotlb - initialize a vringh for a ring with IOTLB.
+ * @vrh: the vringh to initialize.
+ * @features: the feature bits for this ring.
+ * @num: the number of elements.
+ * @weak_barriers: true if we only need memory barriers, not I/O.
+ * @desc: the userpace descriptor pointer.
+ * @avail: the userpace avail pointer.
+ * @used: the userpace used pointer.
+ *
+ * Returns an error if num is invalid.
+ */
+int vringh_init_iotlb(struct vringh *vrh, u64 features,
+ unsigned int num, bool weak_barriers,
+ struct vring_desc *desc,
+ struct vring_avail *avail,
+ struct vring_used *used)
+{
+ return vringh_init_kern(vrh, features, num, weak_barriers,
+ desc, avail, used);
+}
+EXPORT_SYMBOL(vringh_init_iotlb);
+
+/**
+ * vringh_set_iotlb - initialize a vringh for a ring with IOTLB.
+ * @vrh: the vring
+ * @iotlb: iotlb associated with this vring
+ */
+void vringh_set_iotlb(struct vringh *vrh, struct vhost_iotlb *iotlb)
+{
+ vrh->iotlb = iotlb;
+}
+EXPORT_SYMBOL(vringh_set_iotlb);
+
+/**
+ * vringh_getdesc_iotlb - get next available descriptor from ring with
+ * IOTLB.
+ * @vrh: the kernelspace vring.
+ * @riov: where to put the readable descriptors (or NULL)
+ * @wiov: where to put the writable descriptors (or NULL)
+ * @head: head index we received, for passing to vringh_complete_iotlb().
+ * @gfp: flags for allocating larger riov/wiov.
+ *
+ * Returns 0 if there was no descriptor, 1 if there was, or -errno.
+ *
+ * Note that on error return, you can tell the difference between an
+ * invalid ring and a single invalid descriptor: in the former case,
+ * *head will be vrh->vring.num. You may be able to ignore an invalid
+ * descriptor, but there's not much you can do with an invalid ring.
+ *
+ * Note that you may need to clean up riov and wiov, even on error!
+ */
+int vringh_getdesc_iotlb(struct vringh *vrh,
+ struct vringh_kiov *riov,
+ struct vringh_kiov *wiov,
+ u16 *head,
+ gfp_t gfp)
+{
+ int err;
+
+ err = __vringh_get_head(vrh, getu16_iotlb, &vrh->last_avail_idx);
+ if (err < 0)
+ return err;
+
+ /* Empty... */
+ if (err == vrh->vring.num)
+ return 0;
+
+ *head = err;
+ err = __vringh_iov(vrh, *head, riov, wiov, no_range_check, NULL,
+ gfp, copydesc_iotlb);
+ if (err)
+ return err;
+
+ return 1;
+}
+EXPORT_SYMBOL(vringh_getdesc_iotlb);
+
+/**
+ * vringh_iov_pull_iotlb - copy bytes from vring_iov.
+ * @vrh: the vring.
+ * @riov: the riov as passed to vringh_getdesc_iotlb() (updated as we consume)
+ * @dst: the place to copy.
+ * @len: the maximum length to copy.
+ *
+ * Returns the bytes copied <= len or a negative errno.
+ */
+ssize_t vringh_iov_pull_iotlb(struct vringh *vrh,
+ struct vringh_kiov *riov,
+ void *dst, size_t len)
+{
+ return vringh_iov_xfer(vrh, riov, dst, len, xfer_from_iotlb);
+}
+EXPORT_SYMBOL(vringh_iov_pull_iotlb);
+
+/**
+ * vringh_iov_push_iotlb - copy bytes into vring_iov.
+ * @vrh: the vring.
+ * @wiov: the wiov as passed to vringh_getdesc_iotlb() (updated as we consume)
+ * @dst: the place to copy.
+ * @len: the maximum length to copy.
+ *
+ * Returns the bytes copied <= len or a negative errno.
+ */
+ssize_t vringh_iov_push_iotlb(struct vringh *vrh,
+ struct vringh_kiov *wiov,
+ const void *src, size_t len)
+{
+ return vringh_iov_xfer(vrh, wiov, (void *)src, len, xfer_to_iotlb);
+}
+EXPORT_SYMBOL(vringh_iov_push_iotlb);
+
+/**
+ * vringh_abandon_iotlb - we've decided not to handle the descriptor(s).
+ * @vrh: the vring.
+ * @num: the number of descriptors to put back (ie. num
+ * vringh_get_iotlb() to undo).
+ *
+ * The next vringh_get_iotlb() will return the old descriptor(s) again.
+ */
+void vringh_abandon_iotlb(struct vringh *vrh, unsigned int num)
+{
+ /* We only update vring_avail_event(vr) when we want to be notified,
+ * so we haven't changed that yet.
+ */
+ vrh->last_avail_idx -= num;
+}
+EXPORT_SYMBOL(vringh_abandon_iotlb);
+
+/**
+ * vringh_complete_iotlb - we've finished with descriptor, publish it.
+ * @vrh: the vring.
+ * @head: the head as filled in by vringh_getdesc_iotlb.
+ * @len: the length of data we have written.
+ *
+ * You should check vringh_need_notify_iotlb() after one or more calls
+ * to this function.
+ */
+int vringh_complete_iotlb(struct vringh *vrh, u16 head, u32 len)
+{
+ struct vring_used_elem used;
+
+ used.id = cpu_to_vringh32(vrh, head);
+ used.len = cpu_to_vringh32(vrh, len);
+
+ return __vringh_complete(vrh, &used, 1, putu16_iotlb, putused_iotlb);
+}
+EXPORT_SYMBOL(vringh_complete_iotlb);
+
+/**
+ * vringh_notify_enable_iotlb - we want to know if something changes.
+ * @vrh: the vring.
+ *
+ * This always enables notifications, but returns false if there are
+ * now more buffers available in the vring.
+ */
+bool vringh_notify_enable_iotlb(struct vringh *vrh)
+{
+ return __vringh_notify_enable(vrh, getu16_iotlb, putu16_iotlb);
+}
+EXPORT_SYMBOL(vringh_notify_enable_iotlb);
+
+/**
+ * vringh_notify_disable_iotlb - don't tell us if something changes.
+ * @vrh: the vring.
+ *
+ * This is our normal running state: we disable and then only enable when
+ * we're going to sleep.
+ */
+void vringh_notify_disable_iotlb(struct vringh *vrh)
+{
+ __vringh_notify_disable(vrh, putu16_iotlb);
+}
+EXPORT_SYMBOL(vringh_notify_disable_iotlb);
+
+/**
+ * vringh_need_notify_iotlb - must we tell the other side about used buffers?
+ * @vrh: the vring we've called vringh_complete_iotlb() on.
+ *
+ * Returns -errno or 0 if we don't need to tell the other side, 1 if we do.
+ */
+int vringh_need_notify_iotlb(struct vringh *vrh)
+{
+ return __vringh_need_notify(vrh, getu16_iotlb);
+}
+EXPORT_SYMBOL(vringh_need_notify_iotlb);
+
+
MODULE_LICENSE("GPL");
diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index d237087eb257..bd0503ca6f8f 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -14,6 +14,8 @@
#include <linux/virtio_byteorder.h>
#include <linux/uio.h>
#include <linux/slab.h>
+#include <linux/dma-direction.h>
+#include <linux/vhost_iotlb.h>
#include <asm/barrier.h>

/* virtio_ring with information needed for host access. */
@@ -39,6 +41,9 @@ struct vringh {
/* The vring (note: it may contain user pointers!) */
struct vring vring;

+ /* IOTLB for this vring */
+ struct vhost_iotlb *iotlb;
+
/* The function to call to notify the guest about added buffers */
void (*notify)(struct vringh *);
};
@@ -248,4 +253,35 @@ static inline __virtio64 cpu_to_vringh64(const struct vringh *vrh, u64 val)
{
return __cpu_to_virtio64(vringh_is_little_endian(vrh), val);
}
+
+void vringh_set_iotlb(struct vringh *vrh, struct vhost_iotlb *iotlb);
+
+int vringh_init_iotlb(struct vringh *vrh, u64 features,
+ unsigned int num, bool weak_barriers,
+ struct vring_desc *desc,
+ struct vring_avail *avail,
+ struct vring_used *used);
+
+int vringh_getdesc_iotlb(struct vringh *vrh,
+ struct vringh_kiov *riov,
+ struct vringh_kiov *wiov,
+ u16 *head,
+ gfp_t gfp);
+
+ssize_t vringh_iov_pull_iotlb(struct vringh *vrh,
+ struct vringh_kiov *riov,
+ void *dst, size_t len);
+ssize_t vringh_iov_push_iotlb(struct vringh *vrh,
+ struct vringh_kiov *wiov,
+ const void *src, size_t len);
+
+void vringh_abandon_iotlb(struct vringh *vrh, unsigned int num);
+
+int vringh_complete_iotlb(struct vringh *vrh, u16 head, u32 len);
+
+bool vringh_notify_enable_iotlb(struct vringh *vrh);
+void vringh_notify_disable_iotlb(struct vringh *vrh);
+
+int vringh_need_notify_iotlb(struct vringh *vrh);
+
#endif /* _LINUX_VRINGH_H */
--
2.19.1

2020-02-10 04:01:34

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 3/5] vDPA: introduce vDPA bus

vDPA device is a device that uses a datapath which complies with the
virtio specifications with vendor specific control path. vDPA devices
can be both physically located on the hardware or emulated by
software. vDPA hardware devices are usually implemented through PCIE
with the following types:

- PF (Physical Function) - A single Physical Function
- VF (Virtual Function) - Device that supports single root I/O
virtualization (SR-IOV). Its Virtual Function (VF) represents a
virtualized instance of the device that can be assigned to different
partitions
- ADI (Assignable Device Interface) and its equivalents - With
technologies such as Intel Scalable IOV, a virtual device (VDEV)
composed by host OS utilizing one or more ADIs. Or its equivalent
like SF (Sub function) from Mellanox.

From a driver's perspective, depends on how and where the DMA
translation is done, vDPA devices are split into two types:

- Platform specific DMA translation - From the driver's perspective,
the device can be used on a platform where device access to data in
memory is limited and/or translated. An example is a PCIE vDPA whose
DMA request was tagged via a bus (e.g PCIE) specific way. DMA
translation and protection are done at PCIE bus IOMMU level.
- Device specific DMA translation - The device implements DMA
isolation and protection through its own logic. An example is a vDPA
device which uses on-chip IOMMU.

To hide the differences and complexity of the above types for a vDPA
device/IOMMU options and in order to present a generic virtio device
to the upper layer, a device agnostic framework is required.

This patch introduces a software vDPA bus which abstracts the
common attributes of vDPA device, vDPA bus driver and the
communication method (vdpa_config_ops) between the vDPA device
abstraction and the vDPA bus driver:

With the abstraction of vDPA bus and vDPA bus operations, the
difference and complexity of the under layer hardware is hidden from
upper layer. The vDPA bus drivers on top can use a unified
vdpa_config_ops to control different types of vDPA device.

Signed-off-by: Jason Wang <[email protected]>
---
MAINTAINERS | 1 +
drivers/virtio/Kconfig | 2 +
drivers/virtio/Makefile | 1 +
drivers/virtio/vdpa/Kconfig | 9 ++
drivers/virtio/vdpa/Makefile | 2 +
drivers/virtio/vdpa/vdpa.c | 160 ++++++++++++++++++++++++
include/linux/vdpa.h | 233 +++++++++++++++++++++++++++++++++++
7 files changed, 408 insertions(+)
create mode 100644 drivers/virtio/vdpa/Kconfig
create mode 100644 drivers/virtio/vdpa/Makefile
create mode 100644 drivers/virtio/vdpa/vdpa.c
create mode 100644 include/linux/vdpa.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d4bda9c900fa..578d2a581e3b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17540,6 +17540,7 @@ F: tools/virtio/
F: drivers/net/virtio_net.c
F: drivers/block/virtio_blk.c
F: include/linux/virtio*.h
+F: include/linux/vdpa.h
F: include/uapi/linux/virtio_*.h
F: drivers/crypto/virtio/
F: mm/balloon_compaction.c
diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..9c4fdb64d9ac 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -96,3 +96,5 @@ config VIRTIO_MMIO_CMDLINE_DEVICES
If unsure, say 'N'.

endif # VIRTIO_MENU
+
+source "drivers/virtio/vdpa/Kconfig"
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index 3a2b5c5dcf46..fdf5eacd0d0a 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,3 +6,4 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VDPA) += vdpa/
diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
new file mode 100644
index 000000000000..7a99170e6c30
--- /dev/null
+++ b/drivers/virtio/vdpa/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VDPA
+ tristate
+ default m
+ help
+ Enable this module to support vDPA device that uses a
+ datapath which complies with virtio specifications with
+ vendor specific control path.
+
diff --git a/drivers/virtio/vdpa/Makefile b/drivers/virtio/vdpa/Makefile
new file mode 100644
index 000000000000..ee6a35e8a4fb
--- /dev/null
+++ b/drivers/virtio/vdpa/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_VDPA) += vdpa.o
diff --git a/drivers/virtio/vdpa/vdpa.c b/drivers/virtio/vdpa/vdpa.c
new file mode 100644
index 000000000000..4003d90f5df3
--- /dev/null
+++ b/drivers/virtio/vdpa/vdpa.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vDPA bus.
+ *
+ * Copyright (c) 2020, Red Hat. All rights reserved.
+ * Author: Jason Wang <[email protected]>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/idr.h>
+#include <linux/vdpa.h>
+
+static DEFINE_IDA(vdpa_index_ida);
+
+static int vdpa_dev_probe(struct device *d)
+{
+ struct vdpa_device *vdev = dev_to_vdpa(d);
+ struct vdpa_driver *drv = drv_to_vdpa(vdev->dev.driver);
+ int ret = 0;
+
+ if (drv && drv->probe)
+ ret = drv->probe(vdev);
+
+ return ret;
+}
+
+static int vdpa_dev_remove(struct device *d)
+{
+ struct vdpa_device *vdev = dev_to_vdpa(d);
+ struct vdpa_driver *drv = drv_to_vdpa(vdev->dev.driver);
+
+ if (drv && drv->remove)
+ drv->remove(vdev);
+
+ return 0;
+}
+
+static struct bus_type vdpa_bus = {
+ .name = "vdpa",
+ .probe = vdpa_dev_probe,
+ .remove = vdpa_dev_remove,
+};
+
+/**
+ * vdpa_init_device - initilaize a vDPA device
+ * This allows driver to some prepartion between after device is
+ * initialized but before vdpa_register_device()
+ * @vdev: the vdpa device to be initialized
+ * @parent: the paretn device
+ * @dma_dev: the actual device that is performing DMA
+ * @config: the bus operations support by this device
+ *
+ * Returns an error when parent/config/dma_dev is not set or fail to get
+ * ida.
+ */
+int vdpa_init_device(struct vdpa_device *vdev, struct device *parent,
+ struct device *dma_dev,
+ const struct vdpa_config_ops *config)
+{
+ int err;
+
+ if (!parent || !dma_dev || !config)
+ return -EINVAL;
+
+ err = ida_simple_get(&vdpa_index_ida, 0, 0, GFP_KERNEL);
+ if (err < 0)
+ return -EFAULT;
+
+ vdev->dev.bus = &vdpa_bus;
+ vdev->dev.parent = parent;
+
+ device_initialize(&vdev->dev);
+
+ vdev->index = err;
+ vdev->dma_dev = dma_dev;
+ vdev->config = config;
+
+ dev_set_name(&vdev->dev, "vdpa%u", vdev->index);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vdpa_init_device);
+
+/**
+ * vdpa_register_device - register a vDPA device
+ * Callers must have a succeed call of vdpa_init_device() before.
+ * @vdev: the vdpa device to be registered to vDPA bus
+ *
+ * Returns an error when fail to add to vDPA bus
+ */
+int vdpa_register_device(struct vdpa_device *vdev)
+{
+ int err = device_add(&vdev->dev);
+
+ if (err) {
+ put_device(&vdev->dev);
+ ida_simple_remove(&vdpa_index_ida, vdev->index);
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(vdpa_register_device);
+
+/**
+ * vdpa_unregister_device - unregister a vDPA device
+ * @vdev: the vdpa device to be unregisted from vDPA bus
+ */
+void vdpa_unregister_device(struct vdpa_device *vdev)
+{
+ int index = vdev->index;
+
+ device_unregister(&vdev->dev);
+ ida_simple_remove(&vdpa_index_ida, index);
+}
+EXPORT_SYMBOL_GPL(vdpa_unregister_device);
+
+/**
+ * __vdpa_register_driver - register a vDPA device driver
+ * @drv: the vdpa device driver to be registered
+ * @owner: module owner of the driver
+ *
+ * Returns an err when fail to do the registration
+ */
+int __vdpa_register_driver(struct vdpa_driver *drv, struct module *owner)
+{
+ drv->driver.bus = &vdpa_bus;
+ drv->driver.owner = owner;
+
+ return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL_GPL(__vdpa_register_driver);
+
+/**
+ * vdpa_unregister_driver - unregister a vDPA device driver
+ * @drv: the vdpa device driver to be unregistered
+ */
+void vdpa_unregister_driver(struct vdpa_driver *drv)
+{
+ driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL_GPL(vdpa_unregister_driver);
+
+static int vdpa_init(void)
+{
+ if (bus_register(&vdpa_bus) != 0)
+ panic("virtio bus registration failed");
+ return 0;
+}
+
+static void __exit vdpa_exit(void)
+{
+ bus_unregister(&vdpa_bus);
+ ida_destroy(&vdpa_index_ida);
+}
+core_initcall(vdpa_init);
+module_exit(vdpa_exit);
+
+MODULE_AUTHOR("Jason Wang <[email protected]>");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
new file mode 100644
index 000000000000..52543a862801
--- /dev/null
+++ b/include/linux/vdpa.h
@@ -0,0 +1,233 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VDPA_H
+#define _LINUX_VDPA_H
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/vhost_iotlb.h>
+
+/**
+ * vDPA callback definition.
+ * @callback: interrupt callback function
+ * @private: the data passed to the callback function
+ */
+struct vdpa_callback {
+ irqreturn_t (*callback)(void *data);
+ void *private;
+};
+
+/**
+ * vDPA device - representation of a vDPA device
+ * @dev: underlying device
+ * @dma_dev: the actual device that is performing DMA
+ * @config: the configuration ops for this device.
+ * @index: device index
+ */
+struct vdpa_device {
+ struct device dev;
+ struct device *dma_dev;
+ const struct vdpa_config_ops *config;
+ int index;
+};
+
+/**
+ * vDPA_config_ops - operations for configuring a vDPA device.
+ * Note: vDPA device drivers are required to implement all of the
+ * operations unless it is mentioned to be optional in the following
+ * list.
+ *
+ * @set_vq_address: Set the address of virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @desc_area: address of desc area
+ * @driver_area: address of driver area
+ * @device_area: address of device area
+ * Returns integer: success (0) or error (< 0)
+ * @set_vq_num: Set the size of virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @num: the size of virtqueue
+ * @kick_vq: Kick the virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @set_vq_cb: Set the interrupt callback function for
+ * a virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @cb: virtio-vdev interrupt callback structure
+ * @set_vq_ready: Set ready status for a virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @ready: ready (true) not ready(false)
+ * @get_vq_ready: Get ready status for a virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * Returns boolean: ready (true) or not (false)
+ * @set_vq_state: Set the state for a virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * @state: virtqueue state (last_avail_idx)
+ * Returns integer: success (0) or error (< 0)
+ * @get_vq_state: Get the state for a virtqueue
+ * @vdev: vdpa device
+ * @idx: virtqueue index
+ * Returns virtqueue state (last_avail_idx)
+ * @get_vq_align: Get the virtqueue align requirement
+ * for the device
+ * @vdev: vdpa device
+ * Returns virtqueue algin requirement
+ * @get_features: Get virtio features supported by the device
+ * @vdev: vdpa device
+ * Returns the virtio features support by the
+ * device
+ * @set_features: Set virtio features supported by the driver
+ * @vdev: vdpa device
+ * @features: feature support by the driver
+ * Returns integer: success (0) or error (< 0)
+ * @set_config_cb: Set the config interrupt callback
+ * @vdev: vdpa device
+ * @cb: virtio-vdev interrupt callback structure
+ * @get_vq_num_max: Get the max size of virtqueue
+ * @vdev: vdpa device
+ * Returns u16: max size of virtqueue
+ * @get_device_id: Get virtio device id
+ * @vdev: vdpa device
+ * Returns u32: virtio device id
+ * @get_vendor_id: Get id for the vendor that provides this device
+ * @vdev: vdpa device
+ * Returns u32: virtio vendor id
+ * @get_status: Get the device status
+ * @vdev: vdpa device
+ * Returns u8: virtio device status
+ * @set_status: Set the device status
+ * @vdev: vdpa device
+ * @status: virtio device status
+ * @get_config: Read from device specific configuration space
+ * @vdev: vdpa device
+ * @offset: offset from the beginning of
+ * configuration space
+ * @buf: buffer used to read to
+ * @len: the length to read from
+ * configuration space
+ * @set_config: Write to device specific configuration space
+ * @vdev: vdpa device
+ * @offset: offset from the beginning of
+ * configuration space
+ * @buf: buffer used to write from
+ * @len: the length to write to
+ * configuration space
+ * @get_generation: Get device config generation (optional)
+ * @vdev: vdpa device
+ * Returns u32: device generation
+ * @set_map: Set device memory mapping (optional)
+ * and only needed for device that using
+ * device specific DMA translation
+ * (on-chip IOMMU)
+ * @vdev: vdpa device
+ * @iotlb: vhost memory mapping to be
+ * used by the vDPA
+ * Returns integer: success (0) or error (< 0)
+ * @dma_map: Map an area of PA to IOVA
+ * @vdev: vdpa device
+ * @iova: iova to be mapped
+ * @size: size of the area
+ * @pa: physical address for the map
+ * @perm: device access permission (VHOST_MAP_XX)
+ * Returns integer: success (0) or error (< 0)
+ * @dma_unmap: Unmap an area of IOVA
+ * @vdev: vdpa device
+ * @iova: iova to be unmapped
+ * @size: size of the area
+ * Returns integer: success (0) or error (< 0)
+ */
+struct vdpa_config_ops {
+ /* Virtqueue ops */
+ int (*set_vq_address)(struct vdpa_device *vdev,
+ u16 idx, u64 desc_area, u64 driver_area,
+ u64 device_area);
+ void (*set_vq_num)(struct vdpa_device *vdev, u16 idx, u32 num);
+ void (*kick_vq)(struct vdpa_device *vdev, u16 idx);
+ void (*set_vq_cb)(struct vdpa_device *vdev, u16 idx,
+ struct vdpa_callback *cb);
+ void (*set_vq_ready)(struct vdpa_device *vdev, u16 idx, bool ready);
+ bool (*get_vq_ready)(struct vdpa_device *vdev, u16 idx);
+ int (*set_vq_state)(struct vdpa_device *vdev, u16 idx, u64 state);
+ u64 (*get_vq_state)(struct vdpa_device *vdev, u16 idx);
+
+ /* Device ops */
+ u16 (*get_vq_align)(struct vdpa_device *vdev);
+ u64 (*get_features)(struct vdpa_device *vdev);
+ int (*set_features)(struct vdpa_device *vdev, u64 features);
+ void (*set_config_cb)(struct vdpa_device *vdev,
+ struct vdpa_callback *cb);
+ u16 (*get_vq_num_max)(struct vdpa_device *vdev);
+ u32 (*get_device_id)(struct vdpa_device *vdev);
+ u32 (*get_vendor_id)(struct vdpa_device *vdev);
+ u8 (*get_status)(struct vdpa_device *vdev);
+ void (*set_status)(struct vdpa_device *vdev, u8 status);
+ void (*get_config)(struct vdpa_device *vdev, unsigned int offset,
+ void *buf, unsigned int len);
+ void (*set_config)(struct vdpa_device *vdev, unsigned int offset,
+ const void *buf, unsigned int len);
+ u32 (*get_generation)(struct vdpa_device *vdev);
+
+ /* DMA ops */
+ int (*set_map)(struct vdpa_device *vdev, struct vhost_iotlb *iotlb);
+ int (*dma_map)(struct vdpa_device *vdev, u64 iova, u64 size,
+ u64 pa, u32 perm);
+ int (*dma_unmap)(struct vdpa_device *vdev, u64 iova, u64 size);
+};
+
+int vdpa_init_device(struct vdpa_device *vdev,
+ struct device *parent,
+ struct device *dma_dev,
+ const struct vdpa_config_ops *config);
+int vdpa_register_device(struct vdpa_device *vdev);
+void vdpa_unregister_device(struct vdpa_device *vdev);
+
+/**
+ * vdpa_driver - operations for a vDPA driver
+ * @driver: underlying device driver
+ * @probe: the function to call when a device is found. Returns 0 or -errno.
+ * @remove: the function to call when a device is removed.
+ */
+struct vdpa_driver {
+ struct device_driver driver;
+ int (*probe)(struct vdpa_device *vdev);
+ void (*remove)(struct vdpa_device *vdev);
+};
+
+#define vdpa_register_driver(drv) \
+ __vdpa_register_driver(drv, THIS_MODULE)
+int __vdpa_register_driver(struct vdpa_driver *drv, struct module *owner);
+void vdpa_unregister_driver(struct vdpa_driver *drv);
+
+#define module_vdpa_driver(__vdpa_driver) \
+ module_driver(__vdpa_driver, vdpa_register_driver, \
+ vdpa_unregister_driver)
+
+static inline struct vdpa_driver *drv_to_vdpa(struct device_driver *driver)
+{
+ return container_of(driver, struct vdpa_driver, driver);
+}
+
+static inline struct vdpa_device *dev_to_vdpa(struct device *_dev)
+{
+ return container_of(_dev, struct vdpa_device, dev);
+}
+
+static inline void *vdpa_get_drvdata(const struct vdpa_device *vdev)
+{
+ return dev_get_drvdata(&vdev->dev);
+}
+
+static inline void vdpa_set_drvdata(struct vdpa_device *vdev, void *data)
+{
+ dev_set_drvdata(&vdev->dev, data);
+}
+
+static inline struct device *vdpa_get_dma_dev(struct vdpa_device *vdev)
+{
+ return vdev->dma_dev;
+}
+#endif /* _LINUX_VDPA_H */
--
2.19.1

2020-02-10 04:03:26

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 4/5] virtio: introduce a vDPA based transport

This patch introduces a vDPA transport for virtio. This is used to
use kernel virtio driver to drive the mediated device that is capable
of populating virtqueue directly.

A new virtio-vdpa driver will be registered to the vDPA bus, when a
new virtio-vdpa device is probed, it will register the device with
vdpa based config ops. This means it is a software transport between
vDPA driver and vDPA device. The transport was implemented through
bus_ops of vDPA parent.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/virtio/Kconfig | 13 ++
drivers/virtio/Makefile | 1 +
drivers/virtio/virtio_vdpa.c | 392 +++++++++++++++++++++++++++++++++++
3 files changed, 406 insertions(+)
create mode 100644 drivers/virtio/virtio_vdpa.c

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 9c4fdb64d9ac..0df3676b0f4f 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -43,6 +43,19 @@ config VIRTIO_PCI_LEGACY

If unsure, say Y.

+config VIRTIO_VDPA
+ tristate "vDPA driver for virtio devices"
+ select VDPA
+ select VIRTIO
+ help
+ This driver provides support for virtio based paravirtual
+ device driver over vDPA bus. For this to be useful, you need
+ an appropriate vDPA device implementation that operates on a
+ physical device to allow the datapath of virtio to be
+ offloaded to hardware.
+
+ If unsure, say M.
+
config VIRTIO_PMEM
tristate "Support for virtio pmem driver"
depends on VIRTIO
diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
index fdf5eacd0d0a..3407ac03fe60 100644
--- a/drivers/virtio/Makefile
+++ b/drivers/virtio/Makefile
@@ -6,4 +6,5 @@ virtio_pci-y := virtio_pci_modern.o virtio_pci_common.o
virtio_pci-$(CONFIG_VIRTIO_PCI_LEGACY) += virtio_pci_legacy.o
obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
obj-$(CONFIG_VIRTIO_INPUT) += virtio_input.o
+obj-$(CONFIG_VIRTIO_VDPA) += virtio_vdpa.o
obj-$(CONFIG_VDPA) += vdpa/
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
new file mode 100644
index 000000000000..077796087abf
--- /dev/null
+++ b/drivers/virtio/virtio_vdpa.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VIRTIO based driver for vDPA device
+ *
+ * Copyright (c) 2020, Red Hat. All rights reserved.
+ * Author: Jason Wang <[email protected]>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/virtio.h>
+#include <linux/vdpa.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ring.h>
+
+#define MOD_VERSION "0.1"
+#define MOD_AUTHOR "Jason Wang <[email protected]>"
+#define MOD_DESC "vDPA bus driver for virtio devices"
+#define MOD_LICENSE "GPL v2"
+
+struct virtio_vdpa_device {
+ struct virtio_device vdev;
+ struct vdpa_device *vdpa;
+ u64 features;
+
+ /* The lock to protect virtqueue list */
+ spinlock_t lock;
+ /* List of virtio_vdpa_vq_info */
+ struct list_head virtqueues;
+};
+
+struct virtio_vdpa_vq_info {
+ /* the actual virtqueue */
+ struct virtqueue *vq;
+
+ /* the list node for the virtqueues list */
+ struct list_head node;
+};
+
+static inline struct virtio_vdpa_device *
+to_virtio_vdpa_device(struct virtio_device *dev)
+{
+ return container_of(dev, struct virtio_vdpa_device, vdev);
+}
+
+static struct vdpa_device *vd_get_vdpa(struct virtio_device *vdev)
+{
+ return to_virtio_vdpa_device(vdev)->vdpa;
+}
+
+static void virtio_vdpa_get(struct virtio_device *vdev, unsigned offset,
+ void *buf, unsigned len)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ ops->get_config(vdpa, offset, buf, len);
+}
+
+static void virtio_vdpa_set(struct virtio_device *vdev, unsigned offset,
+ const void *buf, unsigned len)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ ops->set_config(vdpa, offset, buf, len);
+}
+
+static u32 virtio_vdpa_generation(struct virtio_device *vdev)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ if (ops->get_generation)
+ return ops->get_generation(vdpa);
+
+ return 0;
+}
+
+static u8 virtio_vdpa_get_status(struct virtio_device *vdev)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ return ops->get_status(vdpa);
+}
+
+static void virtio_vdpa_set_status(struct virtio_device *vdev, u8 status)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ return ops->set_status(vdpa, status);
+}
+
+static void virtio_vdpa_reset(struct virtio_device *vdev)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ return ops->set_status(vdpa, 0);
+}
+
+static bool virtio_vdpa_notify(struct virtqueue *vq)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vq->vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ ops->kick_vq(vdpa, vq->index);
+
+ return true;
+}
+
+static irqreturn_t virtio_vdpa_config_cb(void *private)
+{
+ struct virtio_vdpa_device *vd_dev = private;
+
+ virtio_config_changed(&vd_dev->vdev);
+
+ return IRQ_HANDLED;
+}
+
+static irqreturn_t virtio_vdpa_virtqueue_cb(void *private)
+{
+ struct virtio_vdpa_vq_info *info = private;
+
+ return vring_interrupt(0, info->vq);
+}
+
+static struct virtqueue *
+virtio_vdpa_setup_vq(struct virtio_device *vdev, unsigned int index,
+ void (*callback)(struct virtqueue *vq),
+ const char *name, bool ctx)
+{
+ struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vdev);
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+ struct virtio_vdpa_vq_info *info;
+ struct vdpa_callback cb;
+ struct virtqueue *vq;
+ u64 desc_addr, driver_addr, device_addr;
+ unsigned long flags;
+ u32 align, num;
+ int err;
+
+ if (!name)
+ return NULL;
+
+ /* Queue shouldn't already be set up. */
+ if (ops->get_vq_ready(vdpa, index))
+ return ERR_PTR(-ENOENT);
+
+ /* Allocate and fill out our active queue description */
+ info = kmalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return ERR_PTR(-ENOMEM);
+
+ num = ops->get_vq_num_max(vdpa);
+ if (num == 0) {
+ err = -ENOENT;
+ goto error_new_virtqueue;
+ }
+
+ /* Create the vring */
+ align = ops->get_vq_align(vdpa);
+ vq = vring_create_virtqueue(index, num, align, vdev,
+ true, true, ctx,
+ virtio_vdpa_notify, callback, name);
+ if (!vq) {
+ err = -ENOMEM;
+ goto error_new_virtqueue;
+ }
+
+ /* Setup virtqueue callback */
+ cb.callback = virtio_vdpa_virtqueue_cb;
+ cb.private = info;
+ ops->set_vq_cb(vdpa, index, &cb);
+ ops->set_vq_num(vdpa, index, virtqueue_get_vring_size(vq));
+
+ desc_addr = virtqueue_get_desc_addr(vq);
+ driver_addr = virtqueue_get_avail_addr(vq);
+ device_addr = virtqueue_get_used_addr(vq);
+
+ if (ops->set_vq_address(vdpa, index,
+ desc_addr, driver_addr,
+ device_addr)) {
+ err = -EINVAL;
+ goto err_vq;
+ }
+
+ ops->set_vq_ready(vdpa, index, 1);
+
+ vq->priv = info;
+ info->vq = vq;
+
+ spin_lock_irqsave(&vd_dev->lock, flags);
+ list_add(&info->node, &vd_dev->virtqueues);
+ spin_unlock_irqrestore(&vd_dev->lock, flags);
+
+ return vq;
+
+err_vq:
+ vring_del_virtqueue(vq);
+error_new_virtqueue:
+ ops->set_vq_ready(vdpa, index, 0);
+ /* VDPA driver should make sure vq is stopeed here */
+ WARN_ON(ops->get_vq_ready(vdpa, index));
+ kfree(info);
+ return ERR_PTR(err);
+}
+
+static void virtio_vdpa_del_vq(struct virtqueue *vq)
+{
+ struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vq->vdev);
+ struct vdpa_device *vdpa = vd_dev->vdpa;
+ const struct vdpa_config_ops *ops = vdpa->config;
+ struct virtio_vdpa_vq_info *info = vq->priv;
+ unsigned int index = vq->index;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vd_dev->lock, flags);
+ list_del(&info->node);
+ spin_unlock_irqrestore(&vd_dev->lock, flags);
+
+ /* Select and deactivate the queue */
+ ops->set_vq_ready(vdpa, index, 0);
+ WARN_ON(ops->get_vq_ready(vdpa, index));
+
+ vring_del_virtqueue(vq);
+
+ kfree(info);
+}
+
+static void virtio_vdpa_del_vqs(struct virtio_device *vdev)
+{
+ struct virtqueue *vq, *n;
+
+ list_for_each_entry_safe(vq, n, &vdev->vqs, list)
+ virtio_vdpa_del_vq(vq);
+}
+
+static int virtio_vdpa_find_vqs(struct virtio_device *vdev, unsigned nvqs,
+ struct virtqueue *vqs[],
+ vq_callback_t *callbacks[],
+ const char * const names[],
+ const bool *ctx,
+ struct irq_affinity *desc)
+{
+ struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vdev);
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+ struct vdpa_callback cb;
+ int i, err, queue_idx = 0;
+
+ for (i = 0; i < nvqs; ++i) {
+ if (!names[i]) {
+ vqs[i] = NULL;
+ continue;
+ }
+
+ vqs[i] = virtio_vdpa_setup_vq(vdev, queue_idx++,
+ callbacks[i], names[i], ctx ?
+ ctx[i] : false);
+ if (IS_ERR(vqs[i])) {
+ err = PTR_ERR(vqs[i]);
+ goto err_setup_vq;
+ }
+ }
+
+ cb.callback = virtio_vdpa_config_cb;
+ cb.private = vd_dev;
+ ops->set_config_cb(vdpa, &cb);
+
+ return 0;
+
+err_setup_vq:
+ virtio_vdpa_del_vqs(vdev);
+ return err;
+}
+
+static u64 virtio_vdpa_get_features(struct virtio_device *vdev)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ return ops->get_features(vdpa);
+}
+
+static int virtio_vdpa_finalize_features(struct virtio_device *vdev)
+{
+ struct vdpa_device *vdpa = vd_get_vdpa(vdev);
+ const struct vdpa_config_ops *ops = vdpa->config;
+
+ /* Give virtio_ring a chance to accept features. */
+ vring_transport_features(vdev);
+
+ return ops->set_features(vdpa, vdev->features);
+}
+
+static const char *virtio_vdpa_bus_name(struct virtio_device *vdev)
+{
+ struct virtio_vdpa_device *vd_dev = to_virtio_vdpa_device(vdev);
+ struct vdpa_device *vdpa = vd_dev->vdpa;
+
+ return dev_name(&vdpa->dev);
+}
+
+static const struct virtio_config_ops virtio_vdpa_config_ops = {
+ .get = virtio_vdpa_get,
+ .set = virtio_vdpa_set,
+ .generation = virtio_vdpa_generation,
+ .get_status = virtio_vdpa_get_status,
+ .set_status = virtio_vdpa_set_status,
+ .reset = virtio_vdpa_reset,
+ .find_vqs = virtio_vdpa_find_vqs,
+ .del_vqs = virtio_vdpa_del_vqs,
+ .get_features = virtio_vdpa_get_features,
+ .finalize_features = virtio_vdpa_finalize_features,
+ .bus_name = virtio_vdpa_bus_name,
+};
+
+static void virtio_vdpa_release_dev(struct device *_d)
+{
+ struct virtio_device *vdev =
+ container_of(_d, struct virtio_device, dev);
+ struct virtio_vdpa_device *vd_dev =
+ container_of(vdev, struct virtio_vdpa_device, vdev);
+
+ kfree(vd_dev);
+}
+
+static int virtio_vdpa_probe(struct vdpa_device *vdpa)
+{
+ const struct vdpa_config_ops *ops = vdpa->config;
+ struct virtio_vdpa_device *vd_dev;
+ int ret = -EINVAL;
+
+ vd_dev = kzalloc(sizeof(*vd_dev), GFP_KERNEL);
+ if (!vd_dev)
+ return -ENOMEM;
+
+ vd_dev->vdev.dev.parent = vdpa_get_dma_dev(vdpa);
+ vd_dev->vdev.dev.release = virtio_vdpa_release_dev;
+ vd_dev->vdev.config = &virtio_vdpa_config_ops;
+ vd_dev->vdpa = vdpa;
+ INIT_LIST_HEAD(&vd_dev->virtqueues);
+ spin_lock_init(&vd_dev->lock);
+
+ vd_dev->vdev.id.device = ops->get_device_id(vdpa);
+ if (vd_dev->vdev.id.device == 0)
+ goto err;
+
+ vd_dev->vdev.id.vendor = ops->get_vendor_id(vdpa);
+ ret = register_virtio_device(&vd_dev->vdev);
+ if (ret)
+ goto err;
+
+ vdpa_set_drvdata(vdpa, vd_dev);
+
+ return 0;
+
+err:
+ kfree(vd_dev);
+ return ret;
+}
+
+static void virtio_vdpa_remove(struct vdpa_device *vdpa)
+{
+ struct virtio_vdpa_device *vd_dev = vdpa_get_drvdata(vdpa);
+
+ unregister_virtio_device(&vd_dev->vdev);
+}
+
+static struct vdpa_driver virtio_vdpa_driver = {
+ .driver = {
+ .name = "virtio_vdpa",
+ },
+ .probe = virtio_vdpa_probe,
+ .remove = virtio_vdpa_remove,
+};
+
+module_vdpa_driver(virtio_vdpa_driver);
+
+MODULE_VERSION(MOD_VERSION);
+MODULE_LICENSE(MOD_LICENSE);
+MODULE_AUTHOR(MOD_AUTHOR);
+MODULE_DESCRIPTION(MOD_DESC);
--
2.19.1

2020-02-10 04:03:41

by Jason Wang

[permalink] [raw]
Subject: [PATCH V2 5/5] vdpasim: vDPA device simulator

This patch implements a software vDPA networking device. The datapath
is implemented through vringh and workqueue. The device has an on-chip
IOMMU which translates IOVA to PA. For kernel virtio drivers, vDPA
simulator driver provides dma_ops. For vhost driers, set_map() methods
of vdpa_config_ops is implemented to accept mappings from vhost.

Currently, vDPA device simulator will loopback TX traffic to RX. So
the main use case for the device is vDPA feature testing, prototyping
and development.

Note, there's no management API implemented, a vDPA device will be
registered once the module is probed. We need to handle this in the
future development.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/virtio/vdpa/Kconfig | 17 +
drivers/virtio/vdpa/Makefile | 1 +
drivers/virtio/vdpa/vdpa_sim.c | 678 +++++++++++++++++++++++++++++++++
3 files changed, 696 insertions(+)
create mode 100644 drivers/virtio/vdpa/vdpa_sim.c

diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
index 7a99170e6c30..a7888974dda8 100644
--- a/drivers/virtio/vdpa/Kconfig
+++ b/drivers/virtio/vdpa/Kconfig
@@ -7,3 +7,20 @@ config VDPA
datapath which complies with virtio specifications with
vendor specific control path.

+menuconfig VDPA_MENU
+ bool "VDPA drivers"
+ default n
+
+if VDPA_MENU
+
+config VDPA_SIM
+ tristate "vDPA device simulator"
+ select VDPA
+ default n
+ help
+ vDPA networking device simulator which loop TX traffic back
+ to RX. This device is used for testing, prototyping and
+ development of vDPA.
+
+endif # VDPA_MENU
+
diff --git a/drivers/virtio/vdpa/Makefile b/drivers/virtio/vdpa/Makefile
index ee6a35e8a4fb..5ec0e6ae3c57 100644
--- a/drivers/virtio/vdpa/Makefile
+++ b/drivers/virtio/vdpa/Makefile
@@ -1,2 +1,3 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_VDPA) += vdpa.o
+obj-$(CONFIG_VDPA_SIM) += vdpa_sim.o
diff --git a/drivers/virtio/vdpa/vdpa_sim.c b/drivers/virtio/vdpa/vdpa_sim.c
new file mode 100644
index 000000000000..89783a2b9773
--- /dev/null
+++ b/drivers/virtio/vdpa/vdpa_sim.c
@@ -0,0 +1,678 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDPA networking device simulator.
+ *
+ * Copyright (c) 2020, Red Hat Inc. All rights reserved.
+ * Author: Jason Wang <[email protected]>
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/file.h>
+#include <linux/etherdevice.h>
+#include <linux/vringh.h>
+#include <linux/vdpa.h>
+#include <linux/vhost_iotlb.h>
+#include <uapi/linux/virtio_config.h>
+#include <uapi/linux/virtio_net.h>
+
+#define DRV_VERSION "0.1"
+#define DRV_AUTHOR "Jason Wang <[email protected]>"
+#define DRV_DESC "vDPA Device Simulator"
+#define DRV_LICENSE "GPL v2"
+
+struct vdpasim_dev {
+ struct device dev;
+};
+
+struct vdpasim_dev *vdpasim_dev;
+
+struct vdpasim_virtqueue {
+ struct vringh vring;
+ struct vringh_kiov iov;
+ unsigned short head;
+ bool ready;
+ u64 desc_addr;
+ u64 device_addr;
+ u64 driver_addr;
+ u32 num;
+ void *private;
+ irqreturn_t (*cb)(void *data);
+};
+
+#define VDPASIM_QUEUE_ALIGN PAGE_SIZE
+#define VDPASIM_QUEUE_MAX 256
+#define VDPASIM_DEVICE_ID 0x1
+#define VDPASIM_VENDOR_ID 0
+#define VDPASIM_VQ_NUM 0x2
+#define VDPASIM_NAME "vdpasim-netdev"
+
+static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
+ (1ULL << VIRTIO_F_VERSION_1) |
+ (1ULL << VIRTIO_F_IOMMU_PLATFORM);
+
+/* State of each vdpasim device */
+struct vdpasim {
+ struct vdpasim_virtqueue vqs[2];
+ struct work_struct work;
+ /* spinlock to synchronize virtqueue state */
+ spinlock_t lock;
+ struct vdpa_device vdpa;
+ struct virtio_net_config config;
+ struct vhost_iotlb *iommu;
+ void *buffer;
+ u32 status;
+ u32 generation;
+ u64 features;
+};
+
+struct vdpasim *vdpa_sim;
+
+static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
+{
+ return container_of(vdpa, struct vdpasim, vdpa);
+}
+
+static void vdpasim_queue_ready(struct vdpasim *vdpasim, unsigned int idx)
+{
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+ int ret;
+
+ ret = vringh_init_iotlb(&vq->vring, vdpasim_features,
+ VDPASIM_QUEUE_MAX, false,
+ (struct vring_desc *)(uintptr_t)vq->desc_addr,
+ (struct vring_avail *)
+ (uintptr_t)vq->driver_addr,
+ (struct vring_used *)
+ (uintptr_t)vq->device_addr);
+}
+
+static void vdpasim_vq_reset(struct vdpasim_virtqueue *vq)
+{
+ vq->ready = 0;
+ vq->desc_addr = 0;
+ vq->driver_addr = 0;
+ vq->device_addr = 0;
+ vq->cb = NULL;
+ vq->private = NULL;
+ vringh_init_iotlb(&vq->vring, vdpasim_features, VDPASIM_QUEUE_MAX,
+ false, 0, 0, 0);
+}
+
+static void vdpasim_reset(struct vdpasim *vdpasim)
+{
+ int i;
+
+ for (i = 0; i < VDPASIM_VQ_NUM; i++)
+ vdpasim_vq_reset(&vdpasim->vqs[i]);
+
+ vhost_iotlb_reset(vdpasim->iommu);
+
+ vdpasim->features = 0;
+ vdpasim->status = 0;
+ ++vdpasim->generation;
+}
+
+static void vdpasim_work(struct work_struct *work)
+{
+ struct vdpasim *vdpasim = container_of(work, struct
+ vdpasim, work);
+ struct vdpasim_virtqueue *txq = &vdpasim->vqs[1];
+ struct vdpasim_virtqueue *rxq = &vdpasim->vqs[0];
+ size_t read, write, total_write;
+ int err;
+ int pkts = 0;
+
+ spin_lock(&vdpasim->lock);
+
+ if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
+ goto out;
+
+ if (!txq->ready || !rxq->ready)
+ goto out;
+
+ while (true) {
+ total_write = 0;
+ err = vringh_getdesc_iotlb(&txq->vring, &txq->iov, NULL,
+ &txq->head, GFP_ATOMIC);
+ if (err <= 0)
+ break;
+
+ err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->iov,
+ &rxq->head, GFP_ATOMIC);
+ if (err <= 0) {
+ vringh_complete_iotlb(&txq->vring, txq->head, 0);
+ break;
+ }
+
+ while (true) {
+ read = vringh_iov_pull_iotlb(&txq->vring, &txq->iov,
+ vdpasim->buffer,
+ PAGE_SIZE);
+ if (read <= 0)
+ break;
+
+ write = vringh_iov_push_iotlb(&rxq->vring, &rxq->iov,
+ vdpasim->buffer, read);
+ if (write <= 0)
+ break;
+
+ total_write += write;
+ }
+
+ /* Make sure data is wrote before advancing index */
+ smp_wmb();
+
+ vringh_complete_iotlb(&txq->vring, txq->head, 0);
+ vringh_complete_iotlb(&rxq->vring, rxq->head, total_write);
+
+ /* Make sure used is visible before rasing the interrupt. */
+ smp_wmb();
+
+ local_bh_disable();
+ if (txq->cb)
+ txq->cb(txq->private);
+ if (rxq->cb)
+ rxq->cb(rxq->private);
+ local_bh_enable();
+
+ if (++pkts > 4) {
+ schedule_work(&vdpasim->work);
+ goto out;
+ }
+ }
+
+out:
+ spin_unlock(&vdpasim->lock);
+}
+
+static int dir_to_perm(enum dma_data_direction dir)
+{
+ int perm = -EFAULT;
+
+ switch (dir) {
+ case DMA_FROM_DEVICE:
+ perm = VHOST_MAP_WO;
+ break;
+ case DMA_TO_DEVICE:
+ perm = VHOST_MAP_RO;
+ break;
+ case DMA_BIDIRECTIONAL:
+ perm = VHOST_MAP_RW;
+ break;
+ default:
+ break;
+ }
+
+ return perm;
+}
+
+static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(dev);
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vhost_iotlb *iommu = vdpasim->iommu;
+ u64 pa = (page_to_pfn(page) << PAGE_SHIFT) + offset;
+ int ret, perm = dir_to_perm(dir);
+
+ if (perm < 0)
+ return DMA_MAPPING_ERROR;
+
+ /* For simplicity, use identical mapping to avoid e.g iova
+ * allocator.
+ */
+ ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
+ pa, dir_to_perm(dir));
+ if (ret)
+ return DMA_MAPPING_ERROR;
+
+ return (dma_addr_t)(pa);
+}
+
+static void vdpasim_unmap_page(struct device *dev, dma_addr_t dma_addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(dev);
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vhost_iotlb *iommu = vdpasim->iommu;
+
+ vhost_iotlb_del_range(iommu, (u64)dma_addr,
+ (u64)dma_addr + size - 1);
+}
+
+static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
+ dma_addr_t *dma_addr, gfp_t flag,
+ unsigned long attrs)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(dev);
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vhost_iotlb *iommu = vdpasim->iommu;
+ void *addr = kmalloc(size, flag);
+ int ret;
+
+ if (!addr)
+ *dma_addr = DMA_MAPPING_ERROR;
+ else {
+ u64 pa = virt_to_phys(addr);
+
+ ret = vhost_iotlb_add_range(iommu, (u64)pa,
+ (u64)pa + size - 1,
+ pa, VHOST_MAP_RW);
+ if (ret) {
+ *dma_addr = DMA_MAPPING_ERROR;
+ kfree(addr);
+ addr = NULL;
+ } else
+ *dma_addr = (dma_addr_t)pa;
+ }
+
+ return addr;
+}
+
+static void vdpasim_free_coherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_addr,
+ unsigned long attrs)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(dev);
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vhost_iotlb *iommu = vdpasim->iommu;
+
+ vhost_iotlb_del_range(iommu, (u64)dma_addr,
+ (u64)dma_addr + size - 1);
+ kfree(phys_to_virt((uintptr_t)dma_addr));
+}
+
+static const struct dma_map_ops vdpasim_dma_ops = {
+ .map_page = vdpasim_map_page,
+ .unmap_page = vdpasim_unmap_page,
+ .alloc = vdpasim_alloc_coherent,
+ .free = vdpasim_free_coherent,
+};
+
+static void vdpasim_release_dev(struct device *_d)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(_d);
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ kfree(vdpasim->buffer);
+ kfree(vdpasim);
+}
+
+static const struct vdpa_config_ops vdpasim_net_config_ops;
+
+static struct vdpasim *vdpasim_create(void)
+{
+ struct vdpasim *vdpasim;
+ struct virtio_net_config *config;
+ struct vdpa_device *vdpa;
+ struct device *dev;
+ int ret = -ENOMEM;
+
+ vdpasim = kzalloc(sizeof(*vdpasim), GFP_KERNEL);
+ if (!vdpasim)
+ goto err_vdpa_alloc;
+
+ vdpasim->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!vdpasim->buffer)
+ goto err_buffer_alloc;
+
+ vdpasim->iommu = vhost_iotlb_alloc(2048, 0);
+ if (!vdpasim->iommu)
+ goto err_iotlb;
+
+ config = &vdpasim->config;
+ config->mtu = 1500;
+ config->status = VIRTIO_NET_S_LINK_UP;
+ eth_random_addr(config->mac);
+
+ INIT_WORK(&vdpasim->work, vdpasim_work);
+ spin_lock_init(&vdpasim->lock);
+
+ vdpa = &vdpasim->vdpa;
+ vdpa->dev.release = vdpasim_release_dev;
+
+ vringh_set_iotlb(&vdpasim->vqs[0].vring, vdpasim->iommu);
+ vringh_set_iotlb(&vdpasim->vqs[1].vring, vdpasim->iommu);
+
+ dev = &vdpa->dev;
+ dev->coherent_dma_mask = DMA_BIT_MASK(64);
+ set_dma_ops(dev, &vdpasim_dma_ops);
+
+ ret = vdpa_init_device(vdpa, &vdpasim_dev->dev, dev,
+ &vdpasim_net_config_ops);
+ if (ret)
+ goto err_init;
+
+ ret = vdpa_register_device(vdpa);
+ if (ret)
+ goto err_register;
+
+ return vdpasim;
+
+err_register:
+ put_device(&vdpa->dev);
+err_init:
+ vhost_iotlb_free(vdpasim->iommu);
+err_iotlb:
+ kfree(vdpasim->buffer);
+err_buffer_alloc:
+ kfree(vdpasim);
+err_vdpa_alloc:
+ return ERR_PTR(ret);
+}
+
+static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
+ u64 desc_area, u64 driver_area,
+ u64 device_area)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ vq->desc_addr = desc_area;
+ vq->driver_addr = driver_area;
+ vq->device_addr = device_area;
+
+ return 0;
+}
+
+static void vdpasim_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ vq->num = num;
+}
+
+static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ if (vq->ready)
+ schedule_work(&vdpasim->work);
+}
+
+static void vdpasim_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
+ struct vdpa_callback *cb)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ vq->cb = cb->callback;
+ vq->private = cb->private;
+}
+
+static void vdpasim_set_vq_ready(struct vdpa_device *vdpa, u16 idx, bool ready)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ spin_lock(&vdpasim->lock);
+ vq->ready = ready;
+ if (vq->ready)
+ vdpasim_queue_ready(vdpasim, idx);
+ spin_unlock(&vdpasim->lock);
+}
+
+static bool vdpasim_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+
+ return vq->ready;
+}
+
+static int vdpasim_set_vq_state(struct vdpa_device *vdpa, u16 idx, u64 state)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+ struct vringh *vrh = &vq->vring;
+
+ spin_lock(&vdpasim->lock);
+ vrh->last_avail_idx = state;
+ spin_unlock(&vdpasim->lock);
+
+ return 0;
+}
+
+static u64 vdpasim_get_vq_state(struct vdpa_device *vdpa, u16 idx)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
+ struct vringh *vrh = &vq->vring;
+
+ return vrh->last_avail_idx;
+}
+
+static u16 vdpasim_get_vq_align(struct vdpa_device *vdpa)
+{
+ return VDPASIM_QUEUE_ALIGN;
+}
+
+static u64 vdpasim_get_features(struct vdpa_device *vdpa)
+{
+ return vdpasim_features;
+}
+
+static int vdpasim_set_features(struct vdpa_device *vdpa, u64 features)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ /* DMA mapping must be done by driver */
+ if (!(features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
+ return -EINVAL;
+
+ vdpasim->features = features & vdpasim_features;
+
+ return 0;
+}
+
+static void vdpasim_set_config_cb(struct vdpa_device *vdpa,
+ struct vdpa_callback *cb)
+{
+ /* We don't support config interrupt */
+}
+
+static u16 vdpasim_get_vq_num_max(struct vdpa_device *vdpa)
+{
+ return VDPASIM_QUEUE_MAX;
+}
+
+static u32 vdpasim_get_device_id(struct vdpa_device *vdpa)
+{
+ return VDPASIM_DEVICE_ID;
+}
+
+static u32 vdpasim_get_vendor_id(struct vdpa_device *vdpa)
+{
+ return VDPASIM_VENDOR_ID;
+}
+
+static u8 vdpasim_get_status(struct vdpa_device *vdpa)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ u8 status;
+
+ spin_lock(&vdpasim->lock);
+ status = vdpasim->status;
+ spin_unlock(&vdpasim->lock);
+
+ return vdpasim->status;
+}
+
+static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ spin_lock(&vdpasim->lock);
+ vdpasim->status = status;
+ if (status == 0)
+ vdpasim_reset(vdpasim);
+ spin_unlock(&vdpasim->lock);
+}
+
+static void vdpasim_get_config(struct vdpa_device *vdpa, unsigned int offset,
+ void *buf, unsigned int len)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ if (offset + len < sizeof(struct virtio_net_config))
+ memcpy(buf, &vdpasim->config + offset, len);
+}
+
+static void vdpasim_set_config(struct vdpa_device *vdpa, unsigned int offset,
+ const void *buf, unsigned int len)
+{
+ /* No writable config supportted by vdpasim */
+}
+
+static u32 vdpasim_get_generation(struct vdpa_device *vdpa)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ return vdpasim->generation;
+}
+
+static int vdpasim_set_map(struct vdpa_device *vdpa,
+ struct vhost_iotlb *iotlb)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+ struct vhost_iotlb_map *map;
+ u64 start = 0ULL, last = 0ULL - 1;
+ int ret;
+
+ vhost_iotlb_reset(vdpasim->iommu);
+
+ for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
+ map = vhost_iotlb_itree_next(map, start, last)) {
+ ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
+ map->last, map->addr, map->perm);
+ if (ret)
+ goto err;
+ }
+ return 0;
+
+err:
+ vhost_iotlb_reset(vdpasim->iommu);
+ return ret;
+}
+
+static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
+ u64 pa, u32 perm)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ return vhost_iotlb_add_range(vdpasim->iommu, iova,
+ iova + size - 1, pa, perm);
+}
+
+static int vdpasim_dma_unmap(struct vdpa_device *vdpa, u64 iova, u64 size)
+{
+ struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
+
+ vhost_iotlb_del_range(vdpasim->iommu, iova, iova + size - 1);
+
+ return 0;
+}
+
+static const struct vdpa_config_ops vdpasim_net_config_ops = {
+ .set_vq_address = vdpasim_set_vq_address,
+ .set_vq_num = vdpasim_set_vq_num,
+ .kick_vq = vdpasim_kick_vq,
+ .set_vq_cb = vdpasim_set_vq_cb,
+ .set_vq_ready = vdpasim_set_vq_ready,
+ .get_vq_ready = vdpasim_get_vq_ready,
+ .set_vq_state = vdpasim_set_vq_state,
+ .get_vq_state = vdpasim_get_vq_state,
+ .get_vq_align = vdpasim_get_vq_align,
+ .get_features = vdpasim_get_features,
+ .set_features = vdpasim_set_features,
+ .set_config_cb = vdpasim_set_config_cb,
+ .get_vq_num_max = vdpasim_get_vq_num_max,
+ .get_device_id = vdpasim_get_device_id,
+ .get_vendor_id = vdpasim_get_vendor_id,
+ .get_status = vdpasim_get_status,
+ .set_status = vdpasim_set_status,
+ .get_config = vdpasim_get_config,
+ .set_config = vdpasim_set_config,
+ .get_generation = vdpasim_get_generation,
+ .set_map = vdpasim_set_map,
+ .dma_map = vdpasim_dma_map,
+ .dma_unmap = vdpasim_dma_unmap,
+};
+
+static void vdpasim_device_release(struct device *dev)
+{
+ struct vdpasim_dev *vdpasim_dev =
+ container_of(dev, struct vdpasim_dev, dev);
+
+ vdpasim_dev->dev.bus = NULL;
+ kfree(vdpasim_dev);
+}
+
+static int __init vdpasim_dev_init(void)
+{
+ struct device *dev;
+ int ret = 0;
+
+ vdpasim_dev = kzalloc(sizeof(*vdpasim_dev), GFP_KERNEL);
+ if (!vdpasim_dev)
+ return -ENOMEM;
+
+ dev = &vdpasim_dev->dev;
+ dev->release = vdpasim_device_release;
+ dev_set_name(dev, "%s", VDPASIM_NAME);
+
+ ret = device_register(&vdpasim_dev->dev);
+ if (ret)
+ goto err_register;
+
+ if (!vdpasim_create())
+ goto err_register;
+
+ return 0;
+
+err_register:
+ kfree(vdpasim_dev);
+ vdpasim_dev = NULL;
+ return ret;
+}
+
+static int vdpasim_device_remove_cb(struct device *dev, void *data)
+{
+ struct vdpa_device *vdpa = dev_to_vdpa(dev);
+
+ vdpa_unregister_device(vdpa);
+
+ return 0;
+}
+
+static void __exit vdpasim_dev_exit(void)
+{
+ device_for_each_child(&vdpasim_dev->dev, NULL,
+ vdpasim_device_remove_cb);
+ device_unregister(&vdpasim_dev->dev);
+}
+
+module_init(vdpasim_dev_init)
+module_exit(vdpasim_dev_exit)
+
+MODULE_VERSION(DRV_VERSION);
+MODULE_LICENSE(DRV_LICENSE);
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
--
2.19.1

2020-02-10 11:24:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V2 5/5] vdpasim: vDPA device simulator

On Mon, Feb 10, 2020 at 11:56:08AM +0800, Jason Wang wrote:
> This patch implements a software vDPA networking device. The datapath
> is implemented through vringh and workqueue. The device has an on-chip
> IOMMU which translates IOVA to PA. For kernel virtio drivers, vDPA
> simulator driver provides dma_ops. For vhost driers, set_map() methods
> of vdpa_config_ops is implemented to accept mappings from vhost.
>
> Currently, vDPA device simulator will loopback TX traffic to RX. So
> the main use case for the device is vDPA feature testing, prototyping
> and development.
>
> Note, there's no management API implemented, a vDPA device will be
> registered once the module is probed. We need to handle this in the
> future development.
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/virtio/vdpa/Kconfig | 17 +
> drivers/virtio/vdpa/Makefile | 1 +
> drivers/virtio/vdpa/vdpa_sim.c | 678 +++++++++++++++++++++++++++++++++
> 3 files changed, 696 insertions(+)
> create mode 100644 drivers/virtio/vdpa/vdpa_sim.c
>
> diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
> index 7a99170e6c30..a7888974dda8 100644
> --- a/drivers/virtio/vdpa/Kconfig
> +++ b/drivers/virtio/vdpa/Kconfig
> @@ -7,3 +7,20 @@ config VDPA
> datapath which complies with virtio specifications with
> vendor specific control path.
>
> +menuconfig VDPA_MENU
> + bool "VDPA drivers"
> + default n
> +
> +if VDPA_MENU
> +
> +config VDPA_SIM
> + tristate "vDPA device simulator"
> + select VDPA
> + default n
> + help
> + vDPA networking device simulator which loop TX traffic back
> + to RX. This device is used for testing, prototyping and
> + development of vDPA.

So how about we make this depend on RUNTIME_TESTING_MENU?


> +
> +endif # VDPA_MENU
> +
> diff --git a/drivers/virtio/vdpa/Makefile b/drivers/virtio/vdpa/Makefile
> index ee6a35e8a4fb..5ec0e6ae3c57 100644
> --- a/drivers/virtio/vdpa/Makefile
> +++ b/drivers/virtio/vdpa/Makefile
> @@ -1,2 +1,3 @@
> # SPDX-License-Identifier: GPL-2.0
> obj-$(CONFIG_VDPA) += vdpa.o
> +obj-$(CONFIG_VDPA_SIM) += vdpa_sim.o
> diff --git a/drivers/virtio/vdpa/vdpa_sim.c b/drivers/virtio/vdpa/vdpa_sim.c
> new file mode 100644
> index 000000000000..89783a2b9773
> --- /dev/null
> +++ b/drivers/virtio/vdpa/vdpa_sim.c
> @@ -0,0 +1,678 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDPA networking device simulator.
> + *
> + * Copyright (c) 2020, Red Hat Inc. All rights reserved.
> + * Author: Jason Wang <[email protected]>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/file.h>
> +#include <linux/etherdevice.h>
> +#include <linux/vringh.h>
> +#include <linux/vdpa.h>
> +#include <linux/vhost_iotlb.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <uapi/linux/virtio_net.h>
> +
> +#define DRV_VERSION "0.1"
> +#define DRV_AUTHOR "Jason Wang <[email protected]>"
> +#define DRV_DESC "vDPA Device Simulator"
> +#define DRV_LICENSE "GPL v2"
> +
> +struct vdpasim_dev {
> + struct device dev;
> +};
> +
> +struct vdpasim_dev *vdpasim_dev;
> +
> +struct vdpasim_virtqueue {
> + struct vringh vring;
> + struct vringh_kiov iov;
> + unsigned short head;
> + bool ready;
> + u64 desc_addr;
> + u64 device_addr;
> + u64 driver_addr;
> + u32 num;
> + void *private;
> + irqreturn_t (*cb)(void *data);
> +};
> +
> +#define VDPASIM_QUEUE_ALIGN PAGE_SIZE
> +#define VDPASIM_QUEUE_MAX 256
> +#define VDPASIM_DEVICE_ID 0x1
> +#define VDPASIM_VENDOR_ID 0
> +#define VDPASIM_VQ_NUM 0x2
> +#define VDPASIM_NAME "vdpasim-netdev"
> +
> +static u64 vdpasim_features = (1ULL << VIRTIO_F_ANY_LAYOUT) |
> + (1ULL << VIRTIO_F_VERSION_1) |
> + (1ULL << VIRTIO_F_IOMMU_PLATFORM);
> +
> +/* State of each vdpasim device */
> +struct vdpasim {
> + struct vdpasim_virtqueue vqs[2];
> + struct work_struct work;
> + /* spinlock to synchronize virtqueue state */
> + spinlock_t lock;
> + struct vdpa_device vdpa;
> + struct virtio_net_config config;
> + struct vhost_iotlb *iommu;
> + void *buffer;
> + u32 status;
> + u32 generation;
> + u64 features;
> +};
> +
> +struct vdpasim *vdpa_sim;
> +
> +static struct vdpasim *vdpa_to_sim(struct vdpa_device *vdpa)
> +{
> + return container_of(vdpa, struct vdpasim, vdpa);
> +}
> +
> +static void vdpasim_queue_ready(struct vdpasim *vdpasim, unsigned int idx)
> +{
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> + int ret;
> +
> + ret = vringh_init_iotlb(&vq->vring, vdpasim_features,
> + VDPASIM_QUEUE_MAX, false,
> + (struct vring_desc *)(uintptr_t)vq->desc_addr,
> + (struct vring_avail *)
> + (uintptr_t)vq->driver_addr,
> + (struct vring_used *)
> + (uintptr_t)vq->device_addr);
> +}
> +
> +static void vdpasim_vq_reset(struct vdpasim_virtqueue *vq)
> +{
> + vq->ready = 0;
> + vq->desc_addr = 0;
> + vq->driver_addr = 0;
> + vq->device_addr = 0;
> + vq->cb = NULL;
> + vq->private = NULL;
> + vringh_init_iotlb(&vq->vring, vdpasim_features, VDPASIM_QUEUE_MAX,
> + false, 0, 0, 0);
> +}
> +
> +static void vdpasim_reset(struct vdpasim *vdpasim)
> +{
> + int i;
> +
> + for (i = 0; i < VDPASIM_VQ_NUM; i++)
> + vdpasim_vq_reset(&vdpasim->vqs[i]);
> +
> + vhost_iotlb_reset(vdpasim->iommu);
> +
> + vdpasim->features = 0;
> + vdpasim->status = 0;
> + ++vdpasim->generation;
> +}
> +
> +static void vdpasim_work(struct work_struct *work)
> +{
> + struct vdpasim *vdpasim = container_of(work, struct
> + vdpasim, work);
> + struct vdpasim_virtqueue *txq = &vdpasim->vqs[1];
> + struct vdpasim_virtqueue *rxq = &vdpasim->vqs[0];
> + size_t read, write, total_write;
> + int err;
> + int pkts = 0;
> +
> + spin_lock(&vdpasim->lock);
> +
> + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> + goto out;
> +
> + if (!txq->ready || !rxq->ready)
> + goto out;
> +
> + while (true) {
> + total_write = 0;
> + err = vringh_getdesc_iotlb(&txq->vring, &txq->iov, NULL,
> + &txq->head, GFP_ATOMIC);
> + if (err <= 0)
> + break;
> +
> + err = vringh_getdesc_iotlb(&rxq->vring, NULL, &rxq->iov,
> + &rxq->head, GFP_ATOMIC);
> + if (err <= 0) {
> + vringh_complete_iotlb(&txq->vring, txq->head, 0);
> + break;
> + }
> +
> + while (true) {
> + read = vringh_iov_pull_iotlb(&txq->vring, &txq->iov,
> + vdpasim->buffer,
> + PAGE_SIZE);
> + if (read <= 0)
> + break;
> +
> + write = vringh_iov_push_iotlb(&rxq->vring, &rxq->iov,
> + vdpasim->buffer, read);
> + if (write <= 0)
> + break;
> +
> + total_write += write;
> + }
> +
> + /* Make sure data is wrote before advancing index */
> + smp_wmb();
> +
> + vringh_complete_iotlb(&txq->vring, txq->head, 0);
> + vringh_complete_iotlb(&rxq->vring, rxq->head, total_write);
> +
> + /* Make sure used is visible before rasing the interrupt. */
> + smp_wmb();
> +
> + local_bh_disable();
> + if (txq->cb)
> + txq->cb(txq->private);
> + if (rxq->cb)
> + rxq->cb(rxq->private);
> + local_bh_enable();
> +
> + if (++pkts > 4) {
> + schedule_work(&vdpasim->work);
> + goto out;
> + }
> + }
> +
> +out:
> + spin_unlock(&vdpasim->lock);
> +}
> +
> +static int dir_to_perm(enum dma_data_direction dir)
> +{
> + int perm = -EFAULT;
> +
> + switch (dir) {
> + case DMA_FROM_DEVICE:
> + perm = VHOST_MAP_WO;
> + break;
> + case DMA_TO_DEVICE:
> + perm = VHOST_MAP_RO;
> + break;
> + case DMA_BIDIRECTIONAL:
> + perm = VHOST_MAP_RW;
> + break;
> + default:
> + break;
> + }
> +
> + return perm;
> +}
> +
> +static dma_addr_t vdpasim_map_page(struct device *dev, struct page *page,
> + unsigned long offset, size_t size,
> + enum dma_data_direction dir,
> + unsigned long attrs)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(dev);
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vhost_iotlb *iommu = vdpasim->iommu;
> + u64 pa = (page_to_pfn(page) << PAGE_SHIFT) + offset;
> + int ret, perm = dir_to_perm(dir);
> +
> + if (perm < 0)
> + return DMA_MAPPING_ERROR;
> +
> + /* For simplicity, use identical mapping to avoid e.g iova
> + * allocator.
> + */
> + ret = vhost_iotlb_add_range(iommu, pa, pa + size - 1,
> + pa, dir_to_perm(dir));
> + if (ret)
> + return DMA_MAPPING_ERROR;
> +
> + return (dma_addr_t)(pa);
> +}
> +
> +static void vdpasim_unmap_page(struct device *dev, dma_addr_t dma_addr,
> + size_t size, enum dma_data_direction dir,
> + unsigned long attrs)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(dev);
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vhost_iotlb *iommu = vdpasim->iommu;
> +
> + vhost_iotlb_del_range(iommu, (u64)dma_addr,
> + (u64)dma_addr + size - 1);
> +}
> +
> +static void *vdpasim_alloc_coherent(struct device *dev, size_t size,
> + dma_addr_t *dma_addr, gfp_t flag,
> + unsigned long attrs)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(dev);
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vhost_iotlb *iommu = vdpasim->iommu;
> + void *addr = kmalloc(size, flag);
> + int ret;
> +
> + if (!addr)
> + *dma_addr = DMA_MAPPING_ERROR;
> + else {
> + u64 pa = virt_to_phys(addr);
> +
> + ret = vhost_iotlb_add_range(iommu, (u64)pa,
> + (u64)pa + size - 1,
> + pa, VHOST_MAP_RW);
> + if (ret) {
> + *dma_addr = DMA_MAPPING_ERROR;
> + kfree(addr);
> + addr = NULL;
> + } else
> + *dma_addr = (dma_addr_t)pa;
> + }
> +
> + return addr;
> +}
> +
> +static void vdpasim_free_coherent(struct device *dev, size_t size,
> + void *vaddr, dma_addr_t dma_addr,
> + unsigned long attrs)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(dev);
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vhost_iotlb *iommu = vdpasim->iommu;
> +
> + vhost_iotlb_del_range(iommu, (u64)dma_addr,
> + (u64)dma_addr + size - 1);
> + kfree(phys_to_virt((uintptr_t)dma_addr));
> +}
> +
> +static const struct dma_map_ops vdpasim_dma_ops = {
> + .map_page = vdpasim_map_page,
> + .unmap_page = vdpasim_unmap_page,
> + .alloc = vdpasim_alloc_coherent,
> + .free = vdpasim_free_coherent,
> +};
> +
> +static void vdpasim_release_dev(struct device *_d)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(_d);
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + kfree(vdpasim->buffer);
> + kfree(vdpasim);
> +}
> +
> +static const struct vdpa_config_ops vdpasim_net_config_ops;
> +
> +static struct vdpasim *vdpasim_create(void)
> +{
> + struct vdpasim *vdpasim;
> + struct virtio_net_config *config;
> + struct vdpa_device *vdpa;
> + struct device *dev;
> + int ret = -ENOMEM;
> +
> + vdpasim = kzalloc(sizeof(*vdpasim), GFP_KERNEL);
> + if (!vdpasim)
> + goto err_vdpa_alloc;
> +
> + vdpasim->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!vdpasim->buffer)
> + goto err_buffer_alloc;
> +
> + vdpasim->iommu = vhost_iotlb_alloc(2048, 0);
> + if (!vdpasim->iommu)
> + goto err_iotlb;
> +
> + config = &vdpasim->config;
> + config->mtu = 1500;
> + config->status = VIRTIO_NET_S_LINK_UP;
> + eth_random_addr(config->mac);
> +
> + INIT_WORK(&vdpasim->work, vdpasim_work);
> + spin_lock_init(&vdpasim->lock);
> +
> + vdpa = &vdpasim->vdpa;
> + vdpa->dev.release = vdpasim_release_dev;
> +
> + vringh_set_iotlb(&vdpasim->vqs[0].vring, vdpasim->iommu);
> + vringh_set_iotlb(&vdpasim->vqs[1].vring, vdpasim->iommu);
> +
> + dev = &vdpa->dev;
> + dev->coherent_dma_mask = DMA_BIT_MASK(64);
> + set_dma_ops(dev, &vdpasim_dma_ops);
> +
> + ret = vdpa_init_device(vdpa, &vdpasim_dev->dev, dev,
> + &vdpasim_net_config_ops);
> + if (ret)
> + goto err_init;
> +
> + ret = vdpa_register_device(vdpa);
> + if (ret)
> + goto err_register;
> +
> + return vdpasim;
> +
> +err_register:
> + put_device(&vdpa->dev);
> +err_init:
> + vhost_iotlb_free(vdpasim->iommu);
> +err_iotlb:
> + kfree(vdpasim->buffer);
> +err_buffer_alloc:
> + kfree(vdpasim);
> +err_vdpa_alloc:
> + return ERR_PTR(ret);
> +}
> +
> +static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> + u64 desc_area, u64 driver_area,
> + u64 device_area)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + vq->desc_addr = desc_area;
> + vq->driver_addr = driver_area;
> + vq->device_addr = device_area;
> +
> + return 0;
> +}
> +
> +static void vdpasim_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + vq->num = num;
> +}
> +
> +static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + if (vq->ready)
> + schedule_work(&vdpasim->work);
> +}
> +
> +static void vdpasim_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> + struct vdpa_callback *cb)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + vq->cb = cb->callback;
> + vq->private = cb->private;
> +}
> +
> +static void vdpasim_set_vq_ready(struct vdpa_device *vdpa, u16 idx, bool ready)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + spin_lock(&vdpasim->lock);
> + vq->ready = ready;
> + if (vq->ready)
> + vdpasim_queue_ready(vdpasim, idx);
> + spin_unlock(&vdpasim->lock);
> +}
> +
> +static bool vdpasim_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> +
> + return vq->ready;
> +}
> +
> +static int vdpasim_set_vq_state(struct vdpa_device *vdpa, u16 idx, u64 state)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> + struct vringh *vrh = &vq->vring;
> +
> + spin_lock(&vdpasim->lock);
> + vrh->last_avail_idx = state;
> + spin_unlock(&vdpasim->lock);
> +
> + return 0;
> +}
> +
> +static u64 vdpasim_get_vq_state(struct vdpa_device *vdpa, u16 idx)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vdpasim_virtqueue *vq = &vdpasim->vqs[idx];
> + struct vringh *vrh = &vq->vring;
> +
> + return vrh->last_avail_idx;
> +}
> +
> +static u16 vdpasim_get_vq_align(struct vdpa_device *vdpa)
> +{
> + return VDPASIM_QUEUE_ALIGN;
> +}
> +
> +static u64 vdpasim_get_features(struct vdpa_device *vdpa)
> +{
> + return vdpasim_features;
> +}
> +
> +static int vdpasim_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + /* DMA mapping must be done by driver */
> + if (!(features & (1ULL << VIRTIO_F_IOMMU_PLATFORM)))
> + return -EINVAL;
> +
> + vdpasim->features = features & vdpasim_features;
> +
> + return 0;
> +}
> +
> +static void vdpasim_set_config_cb(struct vdpa_device *vdpa,
> + struct vdpa_callback *cb)
> +{
> + /* We don't support config interrupt */
> +}
> +
> +static u16 vdpasim_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> + return VDPASIM_QUEUE_MAX;
> +}
> +
> +static u32 vdpasim_get_device_id(struct vdpa_device *vdpa)
> +{
> + return VDPASIM_DEVICE_ID;
> +}
> +
> +static u32 vdpasim_get_vendor_id(struct vdpa_device *vdpa)
> +{
> + return VDPASIM_VENDOR_ID;
> +}
> +
> +static u8 vdpasim_get_status(struct vdpa_device *vdpa)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + u8 status;
> +
> + spin_lock(&vdpasim->lock);
> + status = vdpasim->status;
> + spin_unlock(&vdpasim->lock);
> +
> + return vdpasim->status;
> +}
> +
> +static void vdpasim_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + spin_lock(&vdpasim->lock);
> + vdpasim->status = status;
> + if (status == 0)
> + vdpasim_reset(vdpasim);
> + spin_unlock(&vdpasim->lock);
> +}
> +
> +static void vdpasim_get_config(struct vdpa_device *vdpa, unsigned int offset,
> + void *buf, unsigned int len)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + if (offset + len < sizeof(struct virtio_net_config))
> + memcpy(buf, &vdpasim->config + offset, len);
> +}
> +
> +static void vdpasim_set_config(struct vdpa_device *vdpa, unsigned int offset,
> + const void *buf, unsigned int len)
> +{
> + /* No writable config supportted by vdpasim */
> +}
> +
> +static u32 vdpasim_get_generation(struct vdpa_device *vdpa)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + return vdpasim->generation;
> +}
> +
> +static int vdpasim_set_map(struct vdpa_device *vdpa,
> + struct vhost_iotlb *iotlb)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> + struct vhost_iotlb_map *map;
> + u64 start = 0ULL, last = 0ULL - 1;
> + int ret;
> +
> + vhost_iotlb_reset(vdpasim->iommu);
> +
> + for (map = vhost_iotlb_itree_first(iotlb, start, last); map;
> + map = vhost_iotlb_itree_next(map, start, last)) {
> + ret = vhost_iotlb_add_range(vdpasim->iommu, map->start,
> + map->last, map->addr, map->perm);
> + if (ret)
> + goto err;
> + }
> + return 0;
> +
> +err:
> + vhost_iotlb_reset(vdpasim->iommu);
> + return ret;
> +}
> +
> +static int vdpasim_dma_map(struct vdpa_device *vdpa, u64 iova, u64 size,
> + u64 pa, u32 perm)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + return vhost_iotlb_add_range(vdpasim->iommu, iova,
> + iova + size - 1, pa, perm);
> +}
> +
> +static int vdpasim_dma_unmap(struct vdpa_device *vdpa, u64 iova, u64 size)
> +{
> + struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> + vhost_iotlb_del_range(vdpasim->iommu, iova, iova + size - 1);
> +
> + return 0;
> +}
> +
> +static const struct vdpa_config_ops vdpasim_net_config_ops = {
> + .set_vq_address = vdpasim_set_vq_address,
> + .set_vq_num = vdpasim_set_vq_num,
> + .kick_vq = vdpasim_kick_vq,
> + .set_vq_cb = vdpasim_set_vq_cb,
> + .set_vq_ready = vdpasim_set_vq_ready,
> + .get_vq_ready = vdpasim_get_vq_ready,
> + .set_vq_state = vdpasim_set_vq_state,
> + .get_vq_state = vdpasim_get_vq_state,
> + .get_vq_align = vdpasim_get_vq_align,
> + .get_features = vdpasim_get_features,
> + .set_features = vdpasim_set_features,
> + .set_config_cb = vdpasim_set_config_cb,
> + .get_vq_num_max = vdpasim_get_vq_num_max,
> + .get_device_id = vdpasim_get_device_id,
> + .get_vendor_id = vdpasim_get_vendor_id,
> + .get_status = vdpasim_get_status,
> + .set_status = vdpasim_set_status,
> + .get_config = vdpasim_get_config,
> + .set_config = vdpasim_set_config,
> + .get_generation = vdpasim_get_generation,
> + .set_map = vdpasim_set_map,
> + .dma_map = vdpasim_dma_map,
> + .dma_unmap = vdpasim_dma_unmap,
> +};
> +
> +static void vdpasim_device_release(struct device *dev)
> +{
> + struct vdpasim_dev *vdpasim_dev =
> + container_of(dev, struct vdpasim_dev, dev);
> +
> + vdpasim_dev->dev.bus = NULL;
> + kfree(vdpasim_dev);
> +}
> +
> +static int __init vdpasim_dev_init(void)
> +{
> + struct device *dev;
> + int ret = 0;
> +
> + vdpasim_dev = kzalloc(sizeof(*vdpasim_dev), GFP_KERNEL);
> + if (!vdpasim_dev)
> + return -ENOMEM;
> +
> + dev = &vdpasim_dev->dev;
> + dev->release = vdpasim_device_release;
> + dev_set_name(dev, "%s", VDPASIM_NAME);
> +
> + ret = device_register(&vdpasim_dev->dev);
> + if (ret)
> + goto err_register;
> +
> + if (!vdpasim_create())
> + goto err_register;
> +
> + return 0;
> +
> +err_register:
> + kfree(vdpasim_dev);
> + vdpasim_dev = NULL;
> + return ret;
> +}
> +
> +static int vdpasim_device_remove_cb(struct device *dev, void *data)
> +{
> + struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> + vdpa_unregister_device(vdpa);
> +
> + return 0;
> +}
> +
> +static void __exit vdpasim_dev_exit(void)
> +{
> + device_for_each_child(&vdpasim_dev->dev, NULL,
> + vdpasim_device_remove_cb);
> + device_unregister(&vdpasim_dev->dev);
> +}
> +
> +module_init(vdpasim_dev_init)
> +module_exit(vdpasim_dev_exit)
> +
> +MODULE_VERSION(DRV_VERSION);
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> --
> 2.19.1

2020-02-10 13:36:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 4/5] virtio: introduce a vDPA based transport

On Mon, Feb 10, 2020 at 11:56:07AM +0800, Jason Wang wrote:
> This patch introduces a vDPA transport for virtio. This is used to
> use kernel virtio driver to drive the mediated device that is capable
> of populating virtqueue directly.

Is this comment still right? Is there a mediated device still?

Jason

2020-02-11 03:19:30

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 4/5] virtio: introduce a vDPA based transport


On 2020/2/10 下午9:34, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 11:56:07AM +0800, Jason Wang wrote:
>> This patch introduces a vDPA transport for virtio. This is used to
>> use kernel virtio driver to drive the mediated device that is capable
>> of populating virtqueue directly.
> Is this comment still right? Is there a mediated device still?
>
> Jason


No, will fix.

Thanks


>

2020-02-11 03:22:22

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 5/5] vdpasim: vDPA device simulator


On 2020/2/10 下午7:23, Michael S. Tsirkin wrote:
> On Mon, Feb 10, 2020 at 11:56:08AM +0800, Jason Wang wrote:
>> This patch implements a software vDPA networking device. The datapath
>> is implemented through vringh and workqueue. The device has an on-chip
>> IOMMU which translates IOVA to PA. For kernel virtio drivers, vDPA
>> simulator driver provides dma_ops. For vhost driers, set_map() methods
>> of vdpa_config_ops is implemented to accept mappings from vhost.
>>
>> Currently, vDPA device simulator will loopback TX traffic to RX. So
>> the main use case for the device is vDPA feature testing, prototyping
>> and development.
>>
>> Note, there's no management API implemented, a vDPA device will be
>> registered once the module is probed. We need to handle this in the
>> future development.
>>
>> Signed-off-by: Jason Wang<[email protected]>
>> ---
>> drivers/virtio/vdpa/Kconfig | 17 +
>> drivers/virtio/vdpa/Makefile | 1 +
>> drivers/virtio/vdpa/vdpa_sim.c | 678 +++++++++++++++++++++++++++++++++
>> 3 files changed, 696 insertions(+)
>> create mode 100644 drivers/virtio/vdpa/vdpa_sim.c
>>
>> diff --git a/drivers/virtio/vdpa/Kconfig b/drivers/virtio/vdpa/Kconfig
>> index 7a99170e6c30..a7888974dda8 100644
>> --- a/drivers/virtio/vdpa/Kconfig
>> +++ b/drivers/virtio/vdpa/Kconfig
>> @@ -7,3 +7,20 @@ config VDPA
>> datapath which complies with virtio specifications with
>> vendor specific control path.
>>
>> +menuconfig VDPA_MENU
>> + bool "VDPA drivers"
>> + default n
>> +
>> +if VDPA_MENU
>> +
>> +config VDPA_SIM
>> + tristate "vDPA device simulator"
>> + select VDPA
>> + default n
>> + help
>> + vDPA networking device simulator which loop TX traffic back
>> + to RX. This device is used for testing, prototyping and
>> + development of vDPA.
> So how about we make this depend on RUNTIME_TESTING_MENU?
>
>

I'm not sure how much it can help but I can do that in next version.

Thanks

RUNTIME_TESTING_MENU

2020-02-11 13:57:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Mon, Feb 10, 2020 at 11:56:06AM +0800, Jason Wang wrote:
> +/**
> + * vdpa_register_device - register a vDPA device
> + * Callers must have a succeed call of vdpa_init_device() before.
> + * @vdev: the vdpa device to be registered to vDPA bus
> + *
> + * Returns an error when fail to add to vDPA bus
> + */
> +int vdpa_register_device(struct vdpa_device *vdev)
> +{
> + int err = device_add(&vdev->dev);
> +
> + if (err) {
> + put_device(&vdev->dev);
> + ida_simple_remove(&vdpa_index_ida, vdev->index);
> + }

This is a very dangerous construction, I've seen it lead to driver
bugs. Better to require the driver to always do the put_device on
error unwind

The ida_simple_remove should probably be part of the class release
function to make everything work right

> +/**
> + * vdpa_unregister_driver - unregister a vDPA device driver
> + * @drv: the vdpa device driver to be unregistered
> + */
> +void vdpa_unregister_driver(struct vdpa_driver *drv)
> +{
> + driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL_GPL(vdpa_unregister_driver);
> +
> +static int vdpa_init(void)
> +{
> + if (bus_register(&vdpa_bus) != 0)
> + panic("virtio bus registration failed");
> + return 0;
> +}

Linus will tell you not to kill the kernel - return the error code and
propagate it up to the module init function.

> +/**
> + * vDPA device - representation of a vDPA device
> + * @dev: underlying device
> + * @dma_dev: the actual device that is performing DMA
> + * @config: the configuration ops for this device.
> + * @index: device index
> + */
> +struct vdpa_device {
> + struct device dev;
> + struct device *dma_dev;
> + const struct vdpa_config_ops *config;
> + int index;

unsigned values shuld be unsigned int

Jason

2020-02-11 13:58:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 5/5] vdpasim: vDPA device simulator

On Mon, Feb 10, 2020 at 11:56:08AM +0800, Jason Wang wrote:
> +
> +static struct vdpasim *vdpasim_create(void)
> +{
> + struct vdpasim *vdpasim;
> + struct virtio_net_config *config;
> + struct vdpa_device *vdpa;
> + struct device *dev;
> + int ret = -ENOMEM;
> +
> + vdpasim = kzalloc(sizeof(*vdpasim), GFP_KERNEL);
> + if (!vdpasim)
> + goto err_vdpa_alloc;
> +
> + vdpasim->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
> + if (!vdpasim->buffer)
> + goto err_buffer_alloc;
> +
> + vdpasim->iommu = vhost_iotlb_alloc(2048, 0);
> + if (!vdpasim->iommu)
> + goto err_iotlb;
> +
> + config = &vdpasim->config;
> + config->mtu = 1500;
> + config->status = VIRTIO_NET_S_LINK_UP;
> + eth_random_addr(config->mac);
> +
> + INIT_WORK(&vdpasim->work, vdpasim_work);
> + spin_lock_init(&vdpasim->lock);
> +
> + vdpa = &vdpasim->vdpa;
> + vdpa->dev.release = vdpasim_release_dev;

The driver should not provide the release function.

Again the safest model is 'vdpa_alloc_device' which combines the
kzalloc and the vdpa_init_device() and returns something that is
error unwound with put_device()

The subsystem owns the release and does the kfree and other cleanup
like releasing the IDA.

> + vringh_set_iotlb(&vdpasim->vqs[0].vring, vdpasim->iommu);
> + vringh_set_iotlb(&vdpasim->vqs[1].vring, vdpasim->iommu);
> +
> + dev = &vdpa->dev;
> + dev->coherent_dma_mask = DMA_BIT_MASK(64);
> + set_dma_ops(dev, &vdpasim_dma_ops);
> +
> + ret = vdpa_init_device(vdpa, &vdpasim_dev->dev, dev,
> + &vdpasim_net_config_ops);
> + if (ret)
> + goto err_init;
> +
> + ret = vdpa_register_device(vdpa);
> + if (ret)
> + goto err_register;

See? This error unwind is now all wrong:

> +
> + return vdpasim;
> +
> +err_register:
> + put_device(&vdpa->dev);

Double put_device

> +err_init:
> + vhost_iotlb_free(vdpasim->iommu);
> +err_iotlb:
> + kfree(vdpasim->buffer);
> +err_buffer_alloc:
> + kfree(vdpasim);

kfree after vdpa_init_device() is incorrect, as the put_device now
does kfree via release

> +static int __init vdpasim_dev_init(void)
> +{
> + struct device *dev;
> + int ret = 0;
> +
> + vdpasim_dev = kzalloc(sizeof(*vdpasim_dev), GFP_KERNEL);
> + if (!vdpasim_dev)
> + return -ENOMEM;
> +
> + dev = &vdpasim_dev->dev;
> + dev->release = vdpasim_device_release;
> + dev_set_name(dev, "%s", VDPASIM_NAME);
> +
> + ret = device_register(&vdpasim_dev->dev);
> + if (ret)
> + goto err_register;
> +
> + if (!vdpasim_create())
> + goto err_register;

Wrong error unwind here too

> + return 0;
> +
> +err_register:
> + kfree(vdpasim_dev);
> + vdpasim_dev = NULL;
> + return ret;
> +}

Jason

2020-02-12 07:57:17

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/11 下午9:47, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 11:56:06AM +0800, Jason Wang wrote:
>> +/**
>> + * vdpa_register_device - register a vDPA device
>> + * Callers must have a succeed call of vdpa_init_device() before.
>> + * @vdev: the vdpa device to be registered to vDPA bus
>> + *
>> + * Returns an error when fail to add to vDPA bus
>> + */
>> +int vdpa_register_device(struct vdpa_device *vdev)
>> +{
>> + int err = device_add(&vdev->dev);
>> +
>> + if (err) {
>> + put_device(&vdev->dev);
>> + ida_simple_remove(&vdpa_index_ida, vdev->index);
>> + }
> This is a very dangerous construction, I've seen it lead to driver
> bugs. Better to require the driver to always do the put_device on
> error unwind


Ok.


>
> The ida_simple_remove should probably be part of the class release
> function to make everything work right


It looks to me bus instead of class is the correct abstraction here
since the devices share a set of programming interface but not the
semantics.

Or do you actually mean type here?


>
>> +/**
>> + * vdpa_unregister_driver - unregister a vDPA device driver
>> + * @drv: the vdpa device driver to be unregistered
>> + */
>> +void vdpa_unregister_driver(struct vdpa_driver *drv)
>> +{
>> + driver_unregister(&drv->driver);
>> +}
>> +EXPORT_SYMBOL_GPL(vdpa_unregister_driver);
>> +
>> +static int vdpa_init(void)
>> +{
>> + if (bus_register(&vdpa_bus) != 0)
>> + panic("virtio bus registration failed");
>> + return 0;
>> +}
> Linus will tell you not to kill the kernel - return the error code and
> propagate it up to the module init function.


Yes, will fix.


>
>> +/**
>> + * vDPA device - representation of a vDPA device
>> + * @dev: underlying device
>> + * @dma_dev: the actual device that is performing DMA
>> + * @config: the configuration ops for this device.
>> + * @index: device index
>> + */
>> +struct vdpa_device {
>> + struct device dev;
>> + struct device *dma_dev;
>> + const struct vdpa_config_ops *config;
>> + int index;
> unsigned values shuld be unsigned int
>
> Jason


Will fix.

Thanks

>

2020-02-12 08:27:58

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 5/5] vdpasim: vDPA device simulator


On 2020/2/11 下午9:52, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 11:56:08AM +0800, Jason Wang wrote:
>> +
>> +static struct vdpasim *vdpasim_create(void)
>> +{
>> + struct vdpasim *vdpasim;
>> + struct virtio_net_config *config;
>> + struct vdpa_device *vdpa;
>> + struct device *dev;
>> + int ret = -ENOMEM;
>> +
>> + vdpasim = kzalloc(sizeof(*vdpasim), GFP_KERNEL);
>> + if (!vdpasim)
>> + goto err_vdpa_alloc;
>> +
>> + vdpasim->buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> + if (!vdpasim->buffer)
>> + goto err_buffer_alloc;
>> +
>> + vdpasim->iommu = vhost_iotlb_alloc(2048, 0);
>> + if (!vdpasim->iommu)
>> + goto err_iotlb;
>> +
>> + config = &vdpasim->config;
>> + config->mtu = 1500;
>> + config->status = VIRTIO_NET_S_LINK_UP;
>> + eth_random_addr(config->mac);
>> +
>> + INIT_WORK(&vdpasim->work, vdpasim_work);
>> + spin_lock_init(&vdpasim->lock);
>> +
>> + vdpa = &vdpasim->vdpa;
>> + vdpa->dev.release = vdpasim_release_dev;
> The driver should not provide the release function.
>
> Again the safest model is 'vdpa_alloc_device' which combines the
> kzalloc and the vdpa_init_device() and returns something that is
> error unwound with put_device()
>
> The subsystem owns the release and does the kfree and other cleanup
> like releasing the IDA.


So I think if we agree bus instead of class is used. vDPA bus can
provide a release function in vdpa_alloc_device()?


>
>> + vringh_set_iotlb(&vdpasim->vqs[0].vring, vdpasim->iommu);
>> + vringh_set_iotlb(&vdpasim->vqs[1].vring, vdpasim->iommu);
>> +
>> + dev = &vdpa->dev;
>> + dev->coherent_dma_mask = DMA_BIT_MASK(64);
>> + set_dma_ops(dev, &vdpasim_dma_ops);
>> +
>> + ret = vdpa_init_device(vdpa, &vdpasim_dev->dev, dev,
>> + &vdpasim_net_config_ops);
>> + if (ret)
>> + goto err_init;
>> +
>> + ret = vdpa_register_device(vdpa);
>> + if (ret)
>> + goto err_register;
> See? This error unwind is now all wrong:
>
>> +
>> + return vdpasim;
>> +
>> +err_register:
>> + put_device(&vdpa->dev);
> Double put_device


Yes.


>
>> +err_init:
>> + vhost_iotlb_free(vdpasim->iommu);
>> +err_iotlb:
>> + kfree(vdpasim->buffer);
>> +err_buffer_alloc:
>> + kfree(vdpasim);
> kfree after vdpa_init_device() is incorrect, as the put_device now
> does kfree via release


Ok, will fix.


>
>> +static int __init vdpasim_dev_init(void)
>> +{
>> + struct device *dev;
>> + int ret = 0;
>> +
>> + vdpasim_dev = kzalloc(sizeof(*vdpasim_dev), GFP_KERNEL);
>> + if (!vdpasim_dev)
>> + return -ENOMEM;
>> +
>> + dev = &vdpasim_dev->dev;
>> + dev->release = vdpasim_device_release;
>> + dev_set_name(dev, "%s", VDPASIM_NAME);
>> +
>> + ret = device_register(&vdpasim_dev->dev);
>> + if (ret)
>> + goto err_register;
>> +
>> + if (!vdpasim_create())
>> + goto err_register;
> Wrong error unwind here too


Will fix.

Thanks


>
>> + return 0;
>> +
>> +err_register:
>> + kfree(vdpasim_dev);
>> + vdpasim_dev = NULL;
>> + return ret;
>> +}
> Jason
>

2020-02-12 12:53:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Wed, Feb 12, 2020 at 03:55:31PM +0800, Jason Wang wrote:
> > The ida_simple_remove should probably be part of the class release
> > function to make everything work right
>
> It looks to me bus instead of class is the correct abstraction here since
> the devices share a set of programming interface but not the semantics.

device_release() doesn't call the bus release? You have dev, type or
class to choose from. Type is rarely used and doesn't seem to be used
by vdpa, so class seems the right choice

Jason

2020-02-13 03:34:55

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/12 下午8:51, Jason Gunthorpe wrote:
> On Wed, Feb 12, 2020 at 03:55:31PM +0800, Jason Wang wrote:
>>> The ida_simple_remove should probably be part of the class release
>>> function to make everything work right
>> It looks to me bus instead of class is the correct abstraction here since
>> the devices share a set of programming interface but not the semantics.
> device_release() doesn't call the bus release?


What it did is:

        if (dev->release)
                dev->release(dev);
        else if (dev->type && dev->type->release)
                dev->type->release(dev);
        else if (dev->class && dev->class->dev_release)
                dev->class->dev_release(dev);
        else
                WARN(1, KERN_ERR "Device '%s' does not have a release()
function, it is broken and must be fixed. See Documentation/kobject.txt.\n",
                        dev_name(dev));

So it looks not.


> You have dev, type or
> class to choose from. Type is rarely used and doesn't seem to be used
> by vdpa, so class seems the right choice
>
> Jason


Yes, but my understanding is class and bus are mutually exclusive. So we
can't add a class to a device which is already attached on a bus.

Thanks


2020-02-13 13:42:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:

> > You have dev, type or
> > class to choose from. Type is rarely used and doesn't seem to be used
> > by vdpa, so class seems the right choice
> >
> > Jason
>
> Yes, but my understanding is class and bus are mutually exclusive. So we
> can't add a class to a device which is already attached on a bus.

While I suppose there are variations, typically 'class' devices are
user facing things and 'bus' devices are internal facing (ie like a
PCI device)

So why is this using a bus? VDPA is a user facing object, so the
driver should create a class vhost_vdpa device directly, and that
driver should live in the drivers/vhost/ directory.

For the PCI VF case this driver would bind to a PCI device like
everything else

For our future SF/ADI cases the driver would bind to some
SF/ADI/whatever device on a bus.

I don't see a reason for VDPA to be creating busses..

Jason

2020-02-13 15:00:37

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/13 下午9:41, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:
>
>>> You have dev, type or
>>> class to choose from. Type is rarely used and doesn't seem to be used
>>> by vdpa, so class seems the right choice
>>>
>>> Jason
>> Yes, but my understanding is class and bus are mutually exclusive. So we
>> can't add a class to a device which is already attached on a bus.
> While I suppose there are variations, typically 'class' devices are
> user facing things and 'bus' devices are internal facing (ie like a
> PCI device)


Though all vDPA devices have the same programming interface, but the
semantic is different. So it looks to me that use bus complies what
class.rst said:

"

Each device class defines a set of semantics and a programming interface
that devices of that class adhere to. Device drivers are the
implementation of that programming interface for a particular device on
a particular bus.

"


>
> So why is this using a bus? VDPA is a user facing object, so the
> driver should create a class vhost_vdpa device directly, and that
> driver should live in the drivers/vhost/ directory.


This is because we want vDPA to be generic for being used by different
drivers which is not limited to vhost-vdpa. E.g in this series, it
allows vDPA to be used by kernel virtio drivers. And in the future, we
will probably introduce more drivers in the future.


>
> For the PCI VF case this driver would bind to a PCI device like
> everything else
>
> For our future SF/ADI cases the driver would bind to some
> SF/ADI/whatever device on a bus.


All these driver will still be bound to their own bus (PCI or other).
And what the driver needs is to present a vDPA device to virtual vDPA
bus on top.

Thanks

>
> I don't see a reason for VDPA to be creating busses..
>
> Jason
>

2020-02-13 15:08:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 10:58:44PM +0800, Jason Wang wrote:
>
> On 2020/2/13 下午9:41, Jason Gunthorpe wrote:
> > On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:
> >
> > > > You have dev, type or
> > > > class to choose from. Type is rarely used and doesn't seem to be used
> > > > by vdpa, so class seems the right choice
> > > >
> > > > Jason
> > > Yes, but my understanding is class and bus are mutually exclusive. So we
> > > can't add a class to a device which is already attached on a bus.
> > While I suppose there are variations, typically 'class' devices are
> > user facing things and 'bus' devices are internal facing (ie like a
> > PCI device)
>
>
> Though all vDPA devices have the same programming interface, but the
> semantic is different. So it looks to me that use bus complies what
> class.rst said:
>
> "
>
> Each device class defines a set of semantics and a programming interface
> that devices of that class adhere to. Device drivers are the
> implementation of that programming interface for a particular device on
> a particular bus.
>
> "

Here we are talking about the /dev/XX node that provides the
programming interface. All the vdpa devices have the same basic
chardev interface and discover any semantic variations 'in band'

> > So why is this using a bus? VDPA is a user facing object, so the
> > driver should create a class vhost_vdpa device directly, and that
> > driver should live in the drivers/vhost/ directory.
>
> This is because we want vDPA to be generic for being used by different
> drivers which is not limited to vhost-vdpa. E.g in this series, it allows
> vDPA to be used by kernel virtio drivers. And in the future, we will
> probably introduce more drivers in the future.

I don't see how that connects with using a bus.

Every class of virtio traffic is going to need a special HW driver to
enable VDPA, that special driver can create the correct vhost side
class device.

> > For the PCI VF case this driver would bind to a PCI device like
> > everything else
> >
> > For our future SF/ADI cases the driver would bind to some
> > SF/ADI/whatever device on a bus.
>
> All these driver will still be bound to their own bus (PCI or other). And
> what the driver needs is to present a vDPA device to virtual vDPA bus on
> top.

Again, I can't see any reason to inject a 'vdpa virtual bus' on
top. That seems like mis-using the driver core.

Jason

2020-02-13 15:41:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 11:05:42AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2020 at 10:58:44PM +0800, Jason Wang wrote:
> >
> > On 2020/2/13 下午9:41, Jason Gunthorpe wrote:
> > > On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:
> > >
> > > > > You have dev, type or
> > > > > class to choose from. Type is rarely used and doesn't seem to be used
> > > > > by vdpa, so class seems the right choice
> > > > >
> > > > > Jason
> > > > Yes, but my understanding is class and bus are mutually exclusive. So we
> > > > can't add a class to a device which is already attached on a bus.
> > > While I suppose there are variations, typically 'class' devices are
> > > user facing things and 'bus' devices are internal facing (ie like a
> > > PCI device)
> >
> >
> > Though all vDPA devices have the same programming interface, but the
> > semantic is different. So it looks to me that use bus complies what
> > class.rst said:
> >
> > "
> >
> > Each device class defines a set of semantics and a programming interface
> > that devices of that class adhere to. Device drivers are the
> > implementation of that programming interface for a particular device on
> > a particular bus.
> >
> > "
>
> Here we are talking about the /dev/XX node that provides the
> programming interface. All the vdpa devices have the same basic
> chardev interface and discover any semantic variations 'in band'
>
> > > So why is this using a bus? VDPA is a user facing object, so the
> > > driver should create a class vhost_vdpa device directly, and that
> > > driver should live in the drivers/vhost/ directory.
> >
> > This is because we want vDPA to be generic for being used by different
> > drivers which is not limited to vhost-vdpa. E.g in this series, it allows
> > vDPA to be used by kernel virtio drivers. And in the future, we will
> > probably introduce more drivers in the future.
>
> I don't see how that connects with using a bus.
>
> Every class of virtio traffic is going to need a special HW driver to
> enable VDPA, that special driver can create the correct vhost side
> class device.


That's just a ton of useless code duplication, and a good chance
to have minor variations in implementations confusing
userspace.

Instead, each device implement the same interface, and then
vhost sits on top.

> > > For the PCI VF case this driver would bind to a PCI device like
> > > everything else
> > >
> > > For our future SF/ADI cases the driver would bind to some
> > > SF/ADI/whatever device on a bus.
> >
> > All these driver will still be bound to their own bus (PCI or other). And
> > what the driver needs is to present a vDPA device to virtual vDPA bus on
> > top.
>
> Again, I can't see any reason to inject a 'vdpa virtual bus' on
> top. That seems like mis-using the driver core.
>
> Jason

That bus is exactly what Greg KH proposed. There are other ways
to solve this I guess but this bikeshedding is getting tiring.
Come on it's an internal kernel interface, if we feel
it was a wrong direction to take we can change our minds later.
Main thing is getting UAPI right.

--
MST

2020-02-13 15:52:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 10:41:06AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 13, 2020 at 11:05:42AM -0400, Jason Gunthorpe wrote:
> > On Thu, Feb 13, 2020 at 10:58:44PM +0800, Jason Wang wrote:
> > >
> > > On 2020/2/13 下午9:41, Jason Gunthorpe wrote:
> > > > On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:
> > > >
> > > > > > You have dev, type or
> > > > > > class to choose from. Type is rarely used and doesn't seem to be used
> > > > > > by vdpa, so class seems the right choice
> > > > > >
> > > > > > Jason
> > > > > Yes, but my understanding is class and bus are mutually exclusive. So we
> > > > > can't add a class to a device which is already attached on a bus.
> > > > While I suppose there are variations, typically 'class' devices are
> > > > user facing things and 'bus' devices are internal facing (ie like a
> > > > PCI device)
> > >
> > >
> > > Though all vDPA devices have the same programming interface, but the
> > > semantic is different. So it looks to me that use bus complies what
> > > class.rst said:
> > >
> > > "
> > >
> > > Each device class defines a set of semantics and a programming interface
> > > that devices of that class adhere to. Device drivers are the
> > > implementation of that programming interface for a particular device on
> > > a particular bus.
> > >
> > > "
> >
> > Here we are talking about the /dev/XX node that provides the
> > programming interface. All the vdpa devices have the same basic
> > chardev interface and discover any semantic variations 'in band'
> >
> > > > So why is this using a bus? VDPA is a user facing object, so the
> > > > driver should create a class vhost_vdpa device directly, and that
> > > > driver should live in the drivers/vhost/ directory.
> > >
> > > This is because we want vDPA to be generic for being used by different
> > > drivers which is not limited to vhost-vdpa. E.g in this series, it allows
> > > vDPA to be used by kernel virtio drivers. And in the future, we will
> > > probably introduce more drivers in the future.
> >
> > I don't see how that connects with using a bus.
> >
> > Every class of virtio traffic is going to need a special HW driver to
> > enable VDPA, that special driver can create the correct vhost side
> > class device.
>
> That's just a ton of useless code duplication, and a good chance
> to have minor variations in implementations confusing
> userspace.

What? Why? This is how almost every user of the driver core works.

I don't see how you get any duplication unless the subsystem core is
badly done wrong.

The 'class' is supposed to provide all the library functions to remove
this duplication. Instead of plugging the HW driver in via some bus
scheme every subsystem has its own 'ops' that the HW driver provides
to the subsystem's class via subsystem_register()

This is the *standard* pattern to use the driver core.

This is almost there, it just has this extra bus part to convey the HW
ops instead of directly.

> Instead, each device implement the same interface, and then
> vhost sits on top.

Sure, but plugging in via ops/etc not via a bus and another struct
device.

> That bus is exactly what Greg KH proposed. There are other ways
> to solve this I guess but this bikeshedding is getting tiring.

This discussion was for a different goal, IMHO.

> Come on it's an internal kernel interface, if we feel
> it was a wrong direction to take we can change our minds later.
> Main thing is getting UAPI right.

This discussion started because the refcounting has been busted up in
every version posted so far. It is not bikeshedding when these bugs
are actually being caused by trying to abuse the driver core and
shoehorn in a bus that isn't needed when a class is the correct thing
to use.

Jason

2020-02-13 15:57:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
> > That bus is exactly what Greg KH proposed. There are other ways
> > to solve this I guess but this bikeshedding is getting tiring.
>
> This discussion was for a different goal, IMHO.

Hmm couldn't find it anymore. What was the goal there in your opinion?

--
MST

2020-02-13 16:00:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
> The 'class' is supposed to provide all the library functions to remove
> this duplication. Instead of plugging the HW driver in via some bus
> scheme every subsystem has its own 'ops' that the HW driver provides
> to the subsystem's class via subsystem_register()

Hmm I'm not familiar with subsystem_register. A grep didn't find it
in the kernel either ...

--
MST

2020-02-13 16:14:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 10:59:34AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
> > The 'class' is supposed to provide all the library functions to remove
> > this duplication. Instead of plugging the HW driver in via some bus
> > scheme every subsystem has its own 'ops' that the HW driver provides
> > to the subsystem's class via subsystem_register()
>
> Hmm I'm not familiar with subsystem_register. A grep didn't find it
> in the kernel either ...

I mean it is the registration function provided by the subsystem that
owns the class, for instance tpm_chip_register(),
ib_register_device(), register_netdev(), rtc_register_device() etc

So if you have some vhost (vhost net?) class then you'd have some
vhost_vdpa_init/alloc(); vhost_vdpa_register(), sequence
presumably. (vs trying to do it with a bus matcher)

I recommend to look at rtc and tpm for fairly simple easy to follow
patterns for creating a subsystem in the kernel. A subsystem owns a class,
allows HW drivers to plug in to it, and provides a consistent user
API via a cdev/sysfs/etc.

The driver model class should revolve around the char dev and sysfs
uABI - if you enumerate the devices on the class then they should all
follow the char dev and sysfs interfaces contract of that class.

Those examples show how to do all the refcounting semi-sanely,
introduce sysfs, cdevs, etc.

I thought the latest proposal was to use the existing vhost class and
largely the existing vhost API, so it probably just needs to make sure
the common class-wide stuff is split from the 'driver' stuff of the
existing vhost to netdev.

Jason

2020-02-13 16:26:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Thu, Feb 13, 2020 at 10:56:00AM -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
> > > That bus is exactly what Greg KH proposed. There are other ways
> > > to solve this I guess but this bikeshedding is getting tiring.
> >
> > This discussion was for a different goal, IMHO.
>
> Hmm couldn't find it anymore. What was the goal there in your opinion?

I think it was largely talking about how to model things like
ADI/SF/etc, plus stuff got very confused when the discussion tried to
explain what mdev's role was vs the driver core.

The standard driver model is a 'bus' driver provides the HW access
(think PCI level things), and a 'hw driver' attaches to the bus
device, and instantiates a 'subsystem device' (think netdev, rdma,
etc) using some per-subsystem XXX_register(). The 'hw driver' pulls in
functions from the 'subsystem' using a combination of callbacks and
library-style calls so there is no code duplication.

As a subsystem, vhost&vdpa should expect its 'HW driver' to bind to
devices on busses, for instance I would expect:

- A future SF/ADI/'virtual bus' as a child of multi-functional PCI device
Exactly how this works is still under active discussion and is
one place where Greg said 'use a bus'.
- An existing PCI, platform, or other bus and device. No need for an
extra bus here, PCI is the bus.
- No bus, ie for a simulator or binding to a netdev. (existing vhost?)

They point is that the HW driver's job is to adapt from the bus level
interfaces (eg readl/writel) to the subsystem level (eg something like
the vdpa_ops).

For instance that Intel driver should be a pci_driver to bind to a
struct pci_device for its VF and then call some 'vhost&vdpa'
_register() function to pass its ops to the subsystem which in turn
creates the struct device of the subsystem calls, common char devices,
sysfs, etc and calls the driver's ops in response to uAPI calls.

This is already almost how things were setup in v2 of the patches,
near as I can see, just that a bus was inserted somehow instead of
having only the vhost class. So it iwas confusing and the lifetime
model becomes too complicated to implement correctly...

Jason

2020-02-14 03:25:37

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/13 下午11:05, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2020 at 10:58:44PM +0800, Jason Wang wrote:
>> On 2020/2/13 下午9:41, Jason Gunthorpe wrote:
>>> On Thu, Feb 13, 2020 at 11:34:10AM +0800, Jason Wang wrote:
>>>
>>>>> You have dev, type or
>>>>> class to choose from. Type is rarely used and doesn't seem to be used
>>>>> by vdpa, so class seems the right choice
>>>>>
>>>>> Jason
>>>> Yes, but my understanding is class and bus are mutually exclusive. So we
>>>> can't add a class to a device which is already attached on a bus.
>>> While I suppose there are variations, typically 'class' devices are
>>> user facing things and 'bus' devices are internal facing (ie like a
>>> PCI device)
>>
>> Though all vDPA devices have the same programming interface, but the
>> semantic is different. So it looks to me that use bus complies what
>> class.rst said:
>>
>> "
>>
>> Each device class defines a set of semantics and a programming interface
>> that devices of that class adhere to. Device drivers are the
>> implementation of that programming interface for a particular device on
>> a particular bus.
>>
>> "
> Here we are talking about the /dev/XX node that provides the
> programming interface.


I'm confused here, are you suggesting to use class to create char device
in vhost-vdpa? That's fine but the comment should go for vhost-vdpa patch.


> All the vdpa devices have the same basic
> chardev interface and discover any semantic variations 'in band'


That's not true, char interface is only used for vhost. Kernel virtio
driver does not need char dev but a device on the virtio bus.


>
>>> So why is this using a bus? VDPA is a user facing object, so the
>>> driver should create a class vhost_vdpa device directly, and that
>>> driver should live in the drivers/vhost/ directory.
>>
>> This is because we want vDPA to be generic for being used by different
>> drivers which is not limited to vhost-vdpa. E.g in this series, it allows
>> vDPA to be used by kernel virtio drivers. And in the future, we will
>> probably introduce more drivers in the future.
> I don't see how that connects with using a bus.


This is demonstrated in the virito-vdpa driver. So if you want to use
kernel virito driver for vDPA device, a bus is most straight forward.


>
> Every class of virtio traffic is going to need a special HW driver to
> enable VDPA, that special driver can create the correct vhost side
> class device.


Are you saying, e.g it's the charge of IFCVF driver to create vhost char
dev and other stuffs?


>
>>> For the PCI VF case this driver would bind to a PCI device like
>>> everything else
>>>
>>> For our future SF/ADI cases the driver would bind to some
>>> SF/ADI/whatever device on a bus.
>> All these driver will still be bound to their own bus (PCI or other). And
>> what the driver needs is to present a vDPA device to virtual vDPA bus on
>> top.
> Again, I can't see any reason to inject a 'vdpa virtual bus' on
> top. That seems like mis-using the driver core.


I don't think so. Vhost is not the only programming interface for vDPA.
We don't want a device that can only work for userspace drivers and only
have a single set of userspace APIs.

Thanks


>
> Jason
>

2020-02-14 04:06:19

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/14 上午12:24, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2020 at 10:56:00AM -0500, Michael S. Tsirkin wrote:
>> On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
>>>> That bus is exactly what Greg KH proposed. There are other ways
>>>> to solve this I guess but this bikeshedding is getting tiring.
>>> This discussion was for a different goal, IMHO.
>> Hmm couldn't find it anymore. What was the goal there in your opinion?
> I think it was largely talking about how to model things like
> ADI/SF/etc, plus stuff got very confused when the discussion tried to
> explain what mdev's role was vs the driver core.
>
> The standard driver model is a 'bus' driver provides the HW access
> (think PCI level things), and a 'hw driver' attaches to the bus
> device,


This is not true, kernel had already had plenty virtual bus where
virtual devices and drivers could be attached, besides mdev and virtio,
you can see vop, rpmsg, visorbus etc.


> and instantiates a 'subsystem device' (think netdev, rdma,
> etc) using some per-subsystem XXX_register().


Well, if you go through virtio spec, we support ~20 types of different
devices. Classes like netdev and rdma are correct since they have a
clear set of semantics their own. But grouping network and scsi into a
single class looks wrong, that's the work of a virtual bus.

The class should be done on top of vDPA device instead of vDPA device
itself:

- For kernel driver, netdev, blk dev could be done on top
- For userspace driver, the class could be done by the drivers inside VM
or userspace (dpdk)


> The 'hw driver' pulls in
> functions from the 'subsystem' using a combination of callbacks and
> library-style calls so there is no code duplication.


The point is we want vDPA devices to be used by different subsystems,
not only vhost, but also netdev, blk, crypto (every subsystem that can
use virtio devices). That's why we introduce vDPA bus and introduce
different drivers on top.


>
> As a subsystem, vhost&vdpa should expect its 'HW driver' to bind to
> devices on busses, for instance I would expect:
>
> - A future SF/ADI/'virtual bus' as a child of multi-functional PCI device
> Exactly how this works is still under active discussion and is
> one place where Greg said 'use a bus'.


That's ok but it's something that is not directly related to vDPA which
can be implemented by any kinds of devices/buses:

struct XXX_device {
struct vdpa_device vdpa;
struct adi_device/pci_device *lowerdev;
}
...


> - An existing PCI, platform, or other bus and device. No need for an
> extra bus here, PCI is the bus.


There're several examples that a bus is needed on top.

A good example is Mellanox TmFIFO driver which is a platform device
driver but register itself as a virtio device in order to be used by
virito-console driver on the virtio bus.

But it's a pity that the device can not be used by userspace driver due
to the limitation of virito bus which is designed for kernel driver.
That's why vDPA bus is introduced which abstract the common requirements
of both kernel and userspace drivers which allow the a single HW driver
to be used by kernel drivers (and the subsystems on top) and userspace
drivers.


> - No bus, ie for a simulator or binding to a netdev. (existing vhost?)


Note, simulator can have its own class (sysfs etc.).


>
> They point is that the HW driver's job is to adapt from the bus level
> interfaces (eg readl/writel) to the subsystem level (eg something like
> the vdpa_ops).
>
> For instance that Intel driver should be a pci_driver to bind to a
> struct pci_device for its VF and then call some 'vhost&vdpa'
> _register() function to pass its ops to the subsystem which in turn
> creates the struct device of the subsystem calls, common char devices,
> sysfs, etc and calls the driver's ops in response to uAPI calls.
>
> This is already almost how things were setup in v2 of the patches,
> near as I can see, just that a bus was inserted somehow instead of
> having only the vhost class.


Well the series (plus mdev part) uses a bus since day 0. It's not
something new.

Thanks


> So it iwas confusing and the lifetime
> model becomes too complicated to implement correctly...
>
> Jason
>

2020-02-14 04:40:33

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/14 上午12:13, Jason Gunthorpe wrote:
> On Thu, Feb 13, 2020 at 10:59:34AM -0500, Michael S. Tsirkin wrote:
>> On Thu, Feb 13, 2020 at 11:51:54AM -0400, Jason Gunthorpe wrote:
>>> The 'class' is supposed to provide all the library functions to remove
>>> this duplication. Instead of plugging the HW driver in via some bus
>>> scheme every subsystem has its own 'ops' that the HW driver provides
>>> to the subsystem's class via subsystem_register()
>> Hmm I'm not familiar with subsystem_register. A grep didn't find it
>> in the kernel either ...
> I mean it is the registration function provided by the subsystem that
> owns the class, for instance tpm_chip_register(),
> ib_register_device(), register_netdev(), rtc_register_device() etc
>
> So if you have some vhost (vhost net?) class then you'd have some
> vhost_vdpa_init/alloc(); vhost_vdpa_register(), sequence
> presumably. (vs trying to do it with a bus matcher)
>
> I recommend to look at rtc and tpm for fairly simple easy to follow
> patterns for creating a subsystem in the kernel. A subsystem owns a class,
> allows HW drivers to plug in to it, and provides a consistent user
> API via a cdev/sysfs/etc.
>
> The driver model class should revolve around the char dev and sysfs
> uABI - if you enumerate the devices on the class then they should all
> follow the char dev and sysfs interfaces contract of that class.
>
> Those examples show how to do all the refcounting semi-sanely,
> introduce sysfs, cdevs, etc.
>
> I thought the latest proposal was to use the existing vhost class and
> largely the existing vhost API, so it probably just needs to make sure
> the common class-wide stuff is split from the 'driver' stuff of the
> existing vhost to netdev.


Still, netdev is only one of the type we want to support. And we can not
guarantee or forecast that vhost is the only API that is used.

Let's take virtio as an example, it is implemented through a bus which
allows different subsystems on top. And it can provide a variety of
different uAPIs. For best performance, VFIO could be used for userspace
drivers, but it requires the bus has support from VFIO.

For vDPA devices, it's just the same logic. A bus allows different
drivers and subsystems on top. One of the subsystem could be vhost that
provides a unified API for userspace driver.

Thanks


>
> Jason
>

2020-02-14 13:53:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Fri, Feb 14, 2020 at 11:23:27AM +0800, Jason Wang wrote:

> > > Though all vDPA devices have the same programming interface, but the
> > > semantic is different. So it looks to me that use bus complies what
> > > class.rst said:
> > >
> > > "
> > >
> > > Each device class defines a set of semantics and a programming interface
> > > that devices of that class adhere to. Device drivers are the
> > > implementation of that programming interface for a particular device on
> > > a particular bus.
> > >
> > > "
> > Here we are talking about the /dev/XX node that provides the
> > programming interface.
>
>
> I'm confused here, are you suggesting to use class to create char device in
> vhost-vdpa? That's fine but the comment should go for vhost-vdpa patch.

Certainly yes, something creating many char devs should have a
class. That makes the sysfs work as expected

I suppose this is vhost user? I admit I don't really see how this
vhost stuff works, all I see are global misc devices? Very unusual for
a new subsystem to be using global misc devices..

I would have expected that a single VDPA device comes out as a single
char dev linked to only that VDPA device.

> > All the vdpa devices have the same basic
> > chardev interface and discover any semantic variations 'in band'
>
> That's not true, char interface is only used for vhost. Kernel virtio driver
> does not need char dev but a device on the virtio bus.

Okay, this is fine, but why do you need two busses to accomplish this?

Shouldn't the 'struct virito_device' be the plug in point for HW
drivers I was talking about - and from there a vhost-user can connect
to the struct virtio_device to give it a char dev or a kernel driver
can connect to link it to another subsystem?

It is easy to see something is going wrong with this design because
the drivers/virtio/virtio_vdpa.c mainly contains a bunch of trampoline
functions reflecting identical calls from one ops struct to a
different ops struct. This suggests the 'vdpa' is some subclass of
'virtio' and it is possibly better to model it by extending 'struct
virito_device' to include the vdpa specific stuff.

Where does the vhost-user char dev get invovled in with the v2 series?
Is that included?

> > Every class of virtio traffic is going to need a special HW driver to
> > enable VDPA, that special driver can create the correct vhost side
> > class device.
>
> Are you saying, e.g it's the charge of IFCVF driver to create vhost char dev
> and other stuffs?

No.

Jason

2020-02-14 14:05:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Fri, Feb 14, 2020 at 12:05:32PM +0800, Jason Wang wrote:

> > The standard driver model is a 'bus' driver provides the HW access
> > (think PCI level things), and a 'hw driver' attaches to the bus
> > device,
>
> This is not true, kernel had already had plenty virtual bus where virtual
> devices and drivers could be attached, besides mdev and virtio, you can see
> vop, rpmsg, visorbus etc.

Sure, but those are not connecting HW into the kernel..

> > and instantiates a 'subsystem device' (think netdev, rdma,
> > etc) using some per-subsystem XXX_register().
>
>
> Well, if you go through virtio spec, we support ~20 types of different
> devices. Classes like netdev and rdma are correct since they have a clear
> set of semantics their own. But grouping network and scsi into a single
> class looks wrong, that's the work of a virtual bus.

rdma also has about 20 different types of things it supports on top of
the generic ib_device.

The central point in RDMA is the 'struct ib_device' which is a device
class. You can discover all RDMA devices by looking in /sys/class/infiniband/

It has an internal bus like thing (which probably should have been an
actual bus, but this was done 15 years ago) which allows other
subsystems to have drivers to match and bind their own drivers to the
struct ib_device.

So you'd have a chain like:

struct pci_device -> struct ib_device -> [ib client bus thing] -> struct net_device

And the various char devs are created by clients connecting to the
ib_device and creating char devs on their own classes.

Since ib_devices are multi-queue we can have all 20 devices running
concurrently and there are various schemes to manage when the various
things are created.

> > The 'hw driver' pulls in
> > functions from the 'subsystem' using a combination of callbacks and
> > library-style calls so there is no code duplication.
>
> The point is we want vDPA devices to be used by different subsystems, not
> only vhost, but also netdev, blk, crypto (every subsystem that can use
> virtio devices). That's why we introduce vDPA bus and introduce different
> drivers on top.

See the other mail, it seems struct virtio_device serves this purpose
already, confused why a struct vdpa_device and another bus is being
introduced

> There're several examples that a bus is needed on top.
>
> A good example is Mellanox TmFIFO driver which is a platform device driver
> but register itself as a virtio device in order to be used by virito-console
> driver on the virtio bus.

How is that another bus? The platform bus is the HW bus, the TmFIFO is
the HW driver, and virtio_device is the subsystem.

This seems reasonable/normal so far..

> But it's a pity that the device can not be used by userspace driver due to
> the limitation of virito bus which is designed for kernel driver. That's why
> vDPA bus is introduced which abstract the common requirements of both kernel
> and userspace drivers which allow the a single HW driver to be used by
> kernel drivers (and the subsystems on top) and userspace drivers.

Ah! Maybe this is the source of all this strangeness - the userspace
driver is something parallel to the struct virtio_device instead of
being a consumer of it?? That certianly would mess up the driver model
quite a lot.

Then you want to add another bus to switch between vhost and struct
virtio_device? But only for vdpa?

But as you point out something like TmFIFO is left hanging. Seems like
the wrong abstraction point..

Jason

2020-02-17 06:08:54

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/14 下午10:04, Jason Gunthorpe wrote:
> On Fri, Feb 14, 2020 at 12:05:32PM +0800, Jason Wang wrote:
>
>>> The standard driver model is a 'bus' driver provides the HW access
>>> (think PCI level things), and a 'hw driver' attaches to the bus
>>> device,
>> This is not true, kernel had already had plenty virtual bus where virtual
>> devices and drivers could be attached, besides mdev and virtio, you can see
>> vop, rpmsg, visorbus etc.
> Sure, but those are not connecting HW into the kernel..


Well the virtual devices are normally implemented via a real HW driver.
E.g for virio bus, its transport driver could be driver of real hardware
(e.g PCI).


>
>>> and instantiates a 'subsystem device' (think netdev, rdma,
>>> etc) using some per-subsystem XXX_register().
>>
>> Well, if you go through virtio spec, we support ~20 types of different
>> devices. Classes like netdev and rdma are correct since they have a clear
>> set of semantics their own. But grouping network and scsi into a single
>> class looks wrong, that's the work of a virtual bus.
> rdma also has about 20 different types of things it supports on top of
> the generic ib_device.
>
> The central point in RDMA is the 'struct ib_device' which is a device
> class. You can discover all RDMA devices by looking in /sys/class/infiniband/
>
> It has an internal bus like thing (which probably should have been an
> actual bus, but this was done 15 years ago) which allows other
> subsystems to have drivers to match and bind their own drivers to the
> struct ib_device.


Right.


>
> So you'd have a chain like:
>
> struct pci_device -> struct ib_device -> [ib client bus thing] -> struct net_device


So for vDPA we want to have:

kernel datapath:

struct pci_device -> struct vDPA device -> [ vDPA bus] -> struct
virtio_device -> [virtio bus] -> struct net_device

userspace datapath:

struct pci_device -> struct vDPA device -> [ vDPA bus] -> struct
vhost_device -> UAPI -> userspace driver


>
> And the various char devs are created by clients connecting to the
> ib_device and creating char devs on their own classes.
>
> Since ib_devices are multi-queue we can have all 20 devices running
> concurrently and there are various schemes to manage when the various
> things are created.
>
>>> The 'hw driver' pulls in
>>> functions from the 'subsystem' using a combination of callbacks and
>>> library-style calls so there is no code duplication.
>> The point is we want vDPA devices to be used by different subsystems, not
>> only vhost, but also netdev, blk, crypto (every subsystem that can use
>> virtio devices). That's why we introduce vDPA bus and introduce different
>> drivers on top.
> See the other mail, it seems struct virtio_device serves this purpose
> already, confused why a struct vdpa_device and another bus is being
> introduced
>
>> There're several examples that a bus is needed on top.
>>
>> A good example is Mellanox TmFIFO driver which is a platform device driver
>> but register itself as a virtio device in order to be used by virito-console
>> driver on the virtio bus.
> How is that another bus? The platform bus is the HW bus, the TmFIFO is
> the HW driver, and virtio_device is the subsystem.
>
> This seems reasonable/normal so far..


Yes, that's reasonable. This example is to answer the question why bus
is used instead of class here.


>
>> But it's a pity that the device can not be used by userspace driver due to
>> the limitation of virito bus which is designed for kernel driver. That's why
>> vDPA bus is introduced which abstract the common requirements of both kernel
>> and userspace drivers which allow the a single HW driver to be used by
>> kernel drivers (and the subsystems on top) and userspace drivers.
> Ah! Maybe this is the source of all this strangeness - the userspace
> driver is something parallel to the struct virtio_device instead of
> being a consumer of it??


userspace driver is not parallel to virtio_device. The vhost_device is
parallel to virtio_device actually.


> That certianly would mess up the driver model
> quite a lot.
>
> Then you want to add another bus to switch between vhost and struct
> virtio_device? But only for vdpa?


Still, vhost works on top of vDPA bus directly (see the reply above).


>
> But as you point out something like TmFIFO is left hanging. Seems like
> the wrong abstraction point..


You know, even refactoring virtio-bus is not for free, TmFIFO driver
needs changes anyhow.

Thanks


>
> Jason
>

2020-02-17 06:10:22

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/14 下午9:52, Jason Gunthorpe wrote:
> On Fri, Feb 14, 2020 at 11:23:27AM +0800, Jason Wang wrote:
>
>>>> Though all vDPA devices have the same programming interface, but the
>>>> semantic is different. So it looks to me that use bus complies what
>>>> class.rst said:
>>>>
>>>> "
>>>>
>>>> Each device class defines a set of semantics and a programming interface
>>>> that devices of that class adhere to. Device drivers are the
>>>> implementation of that programming interface for a particular device on
>>>> a particular bus.
>>>>
>>>> "
>>> Here we are talking about the /dev/XX node that provides the
>>> programming interface.
>>
>> I'm confused here, are you suggesting to use class to create char device in
>> vhost-vdpa? That's fine but the comment should go for vhost-vdpa patch.
> Certainly yes, something creating many char devs should have a
> class. That makes the sysfs work as expected
>
> I suppose this is vhost user?


Actually not.

Vhost-user is the vhost protocol that is used for userspace vhost
backend (usually though a UNIX domain socket).

What's being done in the vhost-vpda is a new type of the vhost in kernel.


> I admit I don't really see how this
> vhost stuff works, all I see are global misc devices? Very unusual for
> a new subsystem to be using global misc devices..


Vhost is not a subsystem right now, e.g for it's net implementation, it
was loosely coupled with a socket.

I thought you were copied in the patch [1], maybe we can move vhost
related discussion there to avoid confusion.

[1] https://lwn.net/Articles/811210/


>
> I would have expected that a single VDPA device comes out as a single
> char dev linked to only that VDPA device.
>
>>> All the vdpa devices have the same basic
>>> chardev interface and discover any semantic variations 'in band'
>> That's not true, char interface is only used for vhost. Kernel virtio driver
>> does not need char dev but a device on the virtio bus.
> Okay, this is fine, but why do you need two busses to accomplish this?


The reasons are:

- vDPA ops is designed to be functional as a software assisted transport
(control path) for virtio, so it's fit for a new transport driver but
not directly into virtio bus. VOP use similar design.
- virtio bus is designed for kernel drivers but not userspace, and it
can not be easily extended to support userspace driver but requires some
major refactoring. E.g the virtio bus operations requires the virtqueue
to be allocated by the transport driver.

So it's cheaper and simpler to introduce a new bus instead of
refactoring a well known bus and API where brunches of drivers and
devices had been implemented for years.


>
> Shouldn't the 'struct virito_device' be the plug in point for HW
> drivers I was talking about - and from there a vhost-user can connect
> to the struct virtio_device to give it a char dev or a kernel driver
> can connect to link it to another subsystem?


From vhost point of view, it would only need to connect vDPA bus, no
need to go for virtio bus. Vhost device talks to vDPA device through
vDPA bus. Virito device talks to vDPA device through a new vDPA
transport driver.


>
> It is easy to see something is going wrong with this design because
> the drivers/virtio/virtio_vdpa.c mainly contains a bunch of trampoline
> functions reflecting identical calls from one ops struct to a
> different ops struct.


That's pretty normal, since part of the virtio ops could be 1:1 mapped
to some device function. If you see MMIO and PCI transport, you can see
something similar. The only difference is that in the case of VDPA the
function is assisted or emulated by hardware vDPA driver.


> This suggests the 'vdpa' is some subclass of
> 'virtio' and it is possibly better to model it by extending 'struct
> virito_device' to include the vdpa specific stuff.


Going for such kind of modeling, virtio-pci and virtio-mmio could be
also treated as a subclass of virtio as well, they were all implemented
via a dedicated transport driver.


>
> Where does the vhost-user char dev get invovled in with the v2 series?
> Is that included?


We're working on the a new version, but for the bus/driver part it
should be the same as version 1.

Thanks


>
>>> Every class of virtio traffic is going to need a special HW driver to
>>> enable VDPA, that special driver can create the correct vhost side
>>> class device.
>> Are you saying, e.g it's the charge of IFCVF driver to create vhost char dev
>> and other stuffs?
> No.
>
> Jason
>

2020-02-18 13:56:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Mon, Feb 17, 2020 at 02:08:03PM +0800, Jason Wang wrote:

> I thought you were copied in the patch [1], maybe we can move vhost related
> discussion there to avoid confusion.
>
> [1] https://lwn.net/Articles/811210/

Wow, that is .. confusing.

So this is supposed to duplicate the uAPI of vhost-user? But it is
open coded and duplicated because .. vdpa?

> So it's cheaper and simpler to introduce a new bus instead of refactoring a
> well known bus and API where brunches of drivers and devices had been
> implemented for years.

If you reason for this approach is to ease the implementation then you
should talk about it in the cover letters/etc

Maybe it is reasonable to do this because the rework is too great, I
don't know, but to me this whole thing looks rather messy.

Remember this stuff is all uAPI as it shows up in sysfs, so you can
easilly get stuck with it forever.

Jason

2020-02-19 03:00:45

by Tiwei Bie

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Tue, Feb 18, 2020 at 01:56:12PM +0000, Jason Gunthorpe wrote:
> On Mon, Feb 17, 2020 at 02:08:03PM +0800, Jason Wang wrote:
>
> > I thought you were copied in the patch [1], maybe we can move vhost related
> > discussion there to avoid confusion.
> >
> > [1] https://lwn.net/Articles/811210/
>
> Wow, that is .. confusing.
>
> So this is supposed to duplicate the uAPI of vhost-user? But it is
> open coded and duplicated because .. vdpa?

Do you mean the vhost-user in DPDK? There is no vhost-user
in Linux kernel.

Thanks,
Tiwei

>
> > So it's cheaper and simpler to introduce a new bus instead of refactoring a
> > well known bus and API where brunches of drivers and devices had been
> > implemented for years.
>
> If you reason for this approach is to ease the implementation then you
> should talk about it in the cover letters/etc
>
> Maybe it is reasonable to do this because the rework is too great, I
> don't know, but to me this whole thing looks rather messy.
>
> Remember this stuff is all uAPI as it shows up in sysfs, so you can
> easilly get stuck with it forever.
>
> Jason

2020-02-19 05:38:26

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus


On 2020/2/18 下午9:56, Jason Gunthorpe wrote:
> On Mon, Feb 17, 2020 at 02:08:03PM +0800, Jason Wang wrote:
>
>> I thought you were copied in the patch [1], maybe we can move vhost related
>> discussion there to avoid confusion.
>>
>> [1] https://lwn.net/Articles/811210/
> Wow, that is .. confusing.
>
> So this is supposed to duplicate the uAPI of vhost-user?


It tries to reuse the uAPI of vhost with some extension.


> But it is
> open coded and duplicated because .. vdpa?


I'm not sure I get here, vhost module is reused for vhost-vdpa and all
current vhost device (e.g net) uses their own char device.


>
>> So it's cheaper and simpler to introduce a new bus instead of refactoring a
>> well known bus and API where brunches of drivers and devices had been
>> implemented for years.
> If you reason for this approach is to ease the implementation then you
> should talk about it in the cover letters/etc


I will add more rationale in both cover letter and this patch.

Thanks


>
> Maybe it is reasonable to do this because the rework is too great, I
> don't know, but to me this whole thing looks rather messy.
>
> Remember this stuff is all uAPI as it shows up in sysfs, so you can
> easilly get stuck with it forever.
>
> Jason
>

2020-02-19 12:54:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH V2 3/5] vDPA: introduce vDPA bus

On Wed, Feb 19, 2020 at 01:35:25PM +0800, Jason Wang wrote:
> > But it is
> > open coded and duplicated because .. vdpa?
>
>
> I'm not sure I get here, vhost module is reused for vhost-vdpa and all
> current vhost device (e.g net) uses their own char device.

I mean there shouldn't be two fops implementing the same uAPI

Jason