LinuxLists.cc - [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

2019-01-23 09:58:06

[permalink] [raw]

Subject: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

This series tries to access virtqueue metadata through kernel virtual
address instead of copy_user() friends since they had too much
overheads like checks, spec barriers or even hardware feature
toggling.

Test shows about 24% improvement on TX PPS. It should benefit other
cases as well.

Changes from V3:
- don't try to use vmap for file backed pages
- rebase to master
Changes from V2:
- fix buggy range overlapping check
- tear down MMU notifier during vhost ioctl to make sure invalidation
request can read metadata userspace address and vq size without
holding vq mutex.
Changes from V1:
- instead of pinning pages, use MMU notifier to invalidate vmaps and
remap duing metadata prefetch
- fix build warning on MIPS

Jason Wang (5):
vhost: generalize adding used elem
vhost: fine grain userspace memory accessors
vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
vhost: introduce helpers to get the size of metadata area
vhost: access vq metadata through kernel virtual address

drivers/vhost/net.c | 4 +-
drivers/vhost/vhost.c | 441 +++++++++++++++++++++++++++++++++++++-----
drivers/vhost/vhost.h | 15 +-
mm/shmem.c | 1 +
4 files changed, 410 insertions(+), 51 deletions(-)

--
2.17.1

2019-01-23 09:57:37

[permalink] [raw]

Subject: [PATCH net-next V4 1/5] vhost: generalize adding used elem

Use one generic vhost_copy_to_user() instead of two dedicated
accessor. This will simplify the conversion to fine grain
accessors. About 2% improvement of PPS were seen during vitio-user
txonly test.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/vhost.c | 11 +----------
1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 15a216cdd507..14fad2577df3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2250,16 +2250,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,

start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (count == 1) {
- if (vhost_put_user(vq, heads[0].id, &used->id)) {
- vq_err(vq, "Failed to write used id");
- return -EFAULT;
- }
- if (vhost_put_user(vq, heads[0].len, &used->len)) {
- vq_err(vq, "Failed to write used len");
- return -EFAULT;
- }
- } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+ if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
vq_err(vq, "Failed to write used");
return -EFAULT;
}
--
2.17.1

2019-01-23 09:57:45

[permalink] [raw]

Subject: [PATCH net-next V4 3/5] vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()

Rename the function to be more accurate since it actually tries to
prefetch vq metadata address in IOTLB. And this will be used by
following patch to prefetch metadata virtual addresses.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/net.c | 4 ++--
drivers/vhost/vhost.c | 4 ++--
drivers/vhost/vhost.h | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index bca86bf7189f..9c83c1837464 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -971,7 +971,7 @@ static void handle_tx(struct vhost_net *net)
if (!sock)
goto out;

- if (!vq_iotlb_prefetch(vq))
+ if (!vq_meta_prefetch(vq))
goto out;

vhost_disable_notify(&net->dev, vq);
@@ -1140,7 +1140,7 @@ static void handle_rx(struct vhost_net *net)
if (!sock)
goto out;

- if (!vq_iotlb_prefetch(vq))
+ if (!vq_meta_prefetch(vq))
goto out;

vhost_disable_notify(&net->dev, vq);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 96dd87531ba0..24c74c60c093 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1308,7 +1308,7 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
return true;
}

-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_meta_prefetch(struct vhost_virtqueue *vq)
{
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;
@@ -1327,7 +1327,7 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
num * sizeof(*vq->used->ring) + s,
VHOST_ADDR_USED);
}
-EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
+EXPORT_SYMBOL_GPL(vq_meta_prefetch);

/* Can we log writes? */
/* Caller should have device mutex but not vq mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1b675dad5e05..4e21011b6628 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -207,7 +207,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len,
struct iovec *iov, int count);
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
+int vq_meta_prefetch(struct vhost_virtqueue *vq);

struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
void vhost_enqueue_msg(struct vhost_dev *dev,
--
2.17.1

2019-01-23 09:58:04

[permalink] [raw]

Subject: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

It was noticed that the copy_user() friends that was used to access
virtqueue metdata tends to be very expensive for dataplane
implementation like vhost since it involves lots of software checks,
speculation barrier, hardware feature toggling (e.g SMAP). The
extra cost will be more obvious when transferring small packets since
the time spent on metadata accessing become more significant.

This patch tries to eliminate those overheads by accessing them
through kernel virtual address by vmap(). To make the pages can be
migrated, instead of pinning them through GUP, we use MMU notifiers to
invalidate vmaps and re-establish vmaps during each round of metadata
prefetching if necessary. For devices that doesn't use metadata
prefetching, the memory accessors fallback to normal copy_user()
implementation gracefully. The invalidation was synchronized with
datapath through vq mutex, and in order to avoid hold vq mutex during
range checking, MMU notifier was teared down when trying to modify vq
metadata.

Another thing is kernel lacks efficient solution for tracking dirty
pages by vmap(), this will lead issues if vhost is using file backed
memory which needs care of writeback. This patch solves this issue by
just skipping the vma that is file backed and fallback to normal
copy_user() friends. This might introduce some overheads for file
backed users but consider this use case is rare we could do
optimizations on top.

Note that this was only done when device IOTLB is not enabled. We
could use similar method to optimize it in the future.

Tests shows at most about 22% improvement on TX PPS when using
virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:

SMAP on | SMAP off
Before: 5.0Mpps | 6.6Mpps
After: 6.1Mpps | 7.4Mpps

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/vhost.c | 288 +++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.h | 13 ++
mm/shmem.c | 1 +
3 files changed, 300 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 37e2cac8e8b0..096ae3298d62 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->indirect = NULL;
vq->heads = NULL;
vq->dev = dev;
+ memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
+ memset(&vq->used_ring, 0, sizeof(vq->used_ring));
+ memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
mutex_init(&vq->mutex);
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
@@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
return sizeof(*vq->desc) * num;
}

+static void vhost_uninit_vmap(struct vhost_vmap *map)
+{
+ if (map->addr)
+ vunmap(map->unmap_addr);
+
+ map->addr = NULL;
+ map->unmap_addr = NULL;
+}
+
+static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
+ struct vhost_vmap *map,
+ unsigned long ustart,
+ size_t size,
+ unsigned long start,
+ unsigned long end,
+ bool blockable)
+{
+ if (end < ustart || start > ustart - 1 + size)
+ return 0;
+
+ if (!blockable)
+ return -EAGAIN;
+
+ mutex_lock(&vq->mutex);
+ vhost_uninit_vmap(map);
+ mutex_unlock(&vq->mutex);
+
+ return 0;
+}
+
+static int vhost_invalidate_range_start(struct mmu_notifier *mn,
+ const struct mmu_notifier_range *range)
+{
+ struct vhost_dev *dev = container_of(mn, struct vhost_dev,
+ mmu_notifier);
+ int i;
+
+ for (i = 0; i < dev->nvqs; i++) {
+ struct vhost_virtqueue *vq = dev->vqs[i];
+
+ if (vhost_invalidate_vmap(vq, &vq->avail_ring,
+ (unsigned long)vq->avail,
+ vhost_get_avail_size(vq, vq->num),
+ range->start, range->end,
+ range->blockable))
+ return -EAGAIN;
+ if (vhost_invalidate_vmap(vq, &vq->desc_ring,
+ (unsigned long)vq->desc,
+ vhost_get_desc_size(vq, vq->num),
+ range->start, range->end,
+ range->blockable))
+ return -EAGAIN;
+ if (vhost_invalidate_vmap(vq, &vq->used_ring,
+ (unsigned long)vq->used,
+ vhost_get_used_size(vq, vq->num),
+ range->start, range->end,
+ range->blockable))
+ return -EAGAIN;
+ }
+
+ return 0;
+}
+
+static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
+ .invalidate_range_start = vhost_invalidate_range_start,
+};
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
@@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
if (err)
goto err_cgroup;

+ dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
+ err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
+ if (err)
+ goto err_mmu_notifier;
+
return 0;
+err_mmu_notifier:
+ vhost_dev_free_iovecs(dev);
err_cgroup:
kthread_stop(worker);
dev->worker = NULL;
@@ -632,6 +709,97 @@ static void vhost_clear_msg(struct vhost_dev *dev)
spin_unlock(&dev->iotlb_lock);
}

+/* Suppress the vma that needs writeback since we can not track dirty
+ * pages now.
+ */
+static bool vma_can_vmap(struct vm_area_struct *vma)
+{
+ return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
+ vma_is_shmem(vma);
+}
+
+static int vhost_init_vmap(struct vhost_dev *dev,
+ struct vhost_vmap *map, unsigned long uaddr,
+ size_t size, int write)
+{
+ struct mm_struct *mm = dev->mm;
+ struct vm_area_struct *vma;
+ struct page **pages;
+ int npages = DIV_ROUND_UP(size, PAGE_SIZE);
+ int npinned;
+ void *vaddr;
+ int err = 0;
+
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, uaddr);
+ if (!vma || !vma_can_vmap(vma) ||
+ vma->vm_end < uaddr - 1 + size) {
+ err = -EINVAL;
+ goto err_vma;
+ }
+
+ pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages) {
+ err = -ENOMEM;
+ goto err_alloc;
+ }
+
+ npinned = get_user_pages_fast(uaddr, npages, write, pages);
+ if (npinned != npages) {
+ err = -EFAULT;
+ goto err_gup;
+ }
+
+ vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+ if (!vaddr) {
+ err = EFAULT;
+ goto err_gup;
+ }
+
+ map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
+ map->unmap_addr = vaddr;
+
+err_gup:
+ /* Don't pin pages, mmu notifier will notify us about page
+ * migration.
+ */
+ if (npinned > 0)
+ release_pages(pages, npinned);
+err_alloc:
+ kfree(pages);
+err_vma:
+ up_read(&mm->mmap_sem);
+ return err;
+}
+
+static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
+{
+ vhost_uninit_vmap(&vq->avail_ring);
+ vhost_uninit_vmap(&vq->desc_ring);
+ vhost_uninit_vmap(&vq->used_ring);
+}
+
+static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
+ unsigned long avail)
+{
+ return vhost_init_vmap(vq->dev, &vq->avail_ring, avail,
+ vhost_get_avail_size(vq, vq->num), false);
+}
+
+static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
+ unsigned long desc)
+{
+ return vhost_init_vmap(vq->dev, &vq->desc_ring, desc,
+ vhost_get_desc_size(vq, vq->num), false);
+}
+
+static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
+ unsigned long used)
+{
+ return vhost_init_vmap(vq->dev, &vq->used_ring, used,
+ vhost_get_used_size(vq, vq->num), true);
+}
+
void vhost_dev_cleanup(struct vhost_dev *dev)
{
int i;
@@ -661,8 +829,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
kthread_stop(dev->worker);
dev->worker = NULL;
}
- if (dev->mm)
+ if (dev->mm) {
+ mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
mmput(dev->mm);
+ }
+ for (i = 0; i < dev->nvqs; i++)
+ vhost_clean_vmaps(dev->vqs[i]);
dev->mm = NULL;
}
EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
@@ -891,6 +1063,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,

static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ if (likely(used)) {
+ *((__virtio16 *)&used->ring[vq->num]) =
+ cpu_to_vhost16(vq, vq->avail_idx);
+ return 0;
+ }
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
vhost_avail_event(vq));
}
@@ -899,6 +1081,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
struct vring_used_elem *head, int idx,
int count)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ if (likely(used)) {
+ memcpy(used->ring + idx, head,
+ count * sizeof(*head));
+ return 0;
+ }
+ }
+
return vhost_copy_to_user(vq, vq->used->ring + idx, head,
count * sizeof(*head));
}
@@ -906,6 +1098,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)

{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ if (likely(used)) {
+ used->flags = cpu_to_vhost16(vq, vq->used_flags);
+ return 0;
+ }
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
&vq->used->flags);
}
@@ -913,6 +1114,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)

{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ if (likely(used)) {
+ used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
+ return 0;
+ }
+ }
+
return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
&vq->used->idx);
}
@@ -958,12 +1168,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
__virtio16 *idx)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ if (likely(avail)) {
+ *idx = avail->idx;
+ return 0;
+ }
+ }
+
return vhost_get_avail(vq, *idx, &vq->avail->idx);
}

static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
__virtio16 *head, int idx)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ if (likely(avail)) {
+ *head = avail->ring[idx & (vq->num - 1)];
+ return 0;
+ }
+ }
+
return vhost_get_avail(vq, *head,
&vq->avail->ring[idx & (vq->num - 1)]);
}
@@ -971,24 +1199,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
__virtio16 *flags)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ if (likely(avail)) {
+ *flags = avail->flags;
+ return 0;
+ }
+ }
+
return vhost_get_avail(vq, *flags, &vq->avail->flags);
}

static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
__virtio16 *event)
{
+ if (!vq->iotlb) {
+ struct vring_avail *avail = vq->avail_ring.addr;
+
+ if (likely(avail)) {
+ *event = (__virtio16)avail->ring[vq->num];
+ return 0;
+ }
+ }
+
return vhost_get_avail(vq, *event, vhost_used_event(vq));
}

static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
__virtio16 *idx)
{
+ if (!vq->iotlb) {
+ struct vring_used *used = vq->used_ring.addr;
+
+ if (likely(used)) {
+ *idx = used->idx;
+ return 0;
+ }
+ }
+
return vhost_get_used(vq, *idx, &vq->used->idx);
}

static inline int vhost_get_desc(struct vhost_virtqueue *vq,
struct vring_desc *desc, int idx)
{
+ if (!vq->iotlb) {
+ struct vring_desc *d = vq->desc_ring.addr;
+
+ if (likely(d)) {
+ *desc = *(d + idx);
+ return 0;
+ }
+ }
+
return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
}

@@ -1329,8 +1593,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
{
unsigned int num = vq->num;

- if (!vq->iotlb)
+ if (!vq->iotlb) {
+ if (unlikely(!vq->avail_ring.addr))
+ vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
+ if (unlikely(!vq->desc_ring.addr))
+ vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
+ if (unlikely(!vq->used_ring.addr))
+ vhost_setup_used_vmap(vq, (unsigned long)vq->used);
+
return 1;
+ }

return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
@@ -1482,6 +1754,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg

mutex_lock(&vq->mutex);

+ /* Unregister MMU notifer to allow invalidation callback
+ * can access vq->avail, vq->desc , vq->used and vq->num
+ * without holding vq->mutex.
+ */
+ if (d->mm)
+ mmu_notifier_unregister(&d->mmu_notifier, d->mm);
+
switch (ioctl) {
case VHOST_SET_VRING_NUM:
/* Resizing ring with an active backend?
@@ -1498,6 +1777,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
r = -EINVAL;
break;
}
+ vhost_clean_vmaps(vq);
vq->num = s.num;
break;
case VHOST_SET_VRING_BASE:
@@ -1575,6 +1855,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
}
}

+ vhost_clean_vmaps(vq);
+
vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
@@ -1655,6 +1937,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
if (pollstart && vq->handle_kick)
r = vhost_poll_start(&vq->poll, vq->kick);

+ if (d->mm)
+ mmu_notifier_register(&d->mmu_notifier, d->mm);
mutex_unlock(&vq->mutex);

if (pollstop && vq->handle_kick)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4e21011b6628..c04bc327db9f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -12,6 +12,8 @@
#include <linux/virtio_config.h>
#include <linux/virtio_ring.h>
#include <linux/atomic.h>
+#include <linux/pagemap.h>
+#include <linux/mmu_notifier.h>

struct vhost_work;
typedef void (*vhost_work_fn_t)(struct vhost_work *work);
@@ -80,6 +82,11 @@ enum vhost_uaddr_type {
VHOST_NUM_ADDRS = 3,
};

+struct vhost_vmap {
+ void *addr;
+ void *unmap_addr;
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -90,6 +97,11 @@ struct vhost_virtqueue {
struct vring_desc __user *desc;
struct vring_avail __user *avail;
struct vring_used __user *used;
+
+ struct vhost_vmap avail_ring;
+ struct vhost_vmap desc_ring;
+ struct vhost_vmap used_ring;
+
const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
struct file *kick;
struct eventfd_ctx *call_ctx;
@@ -158,6 +170,7 @@ struct vhost_msg_node {

struct vhost_dev {
struct mm_struct *mm;
+ struct mmu_notifier mmu_notifier;
struct mutex mutex;
struct vhost_virtqueue **vqs;
int nvqs;
diff --git a/mm/shmem.c b/mm/shmem.c
index 6ece1e2fe76e..745e7c7f7a6c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -237,6 +237,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
{
return vma->vm_ops == &shmem_vm_ops;
}
+EXPORT_SYMBOL_GPL(vma_is_shmem);

static LIST_HEAD(shmem_swaplist);
static DEFINE_MUTEX(shmem_swaplist_mutex);
--
2.17.1

2019-01-23 09:59:21

[permalink] [raw]

Subject: [PATCH net-next V4 2/5] vhost: fine grain userspace memory accessors

This is used to hide the metadata address from virtqueue helpers. This
will allow to implement a vmap based fast accessing to metadata.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/vhost.c | 94 +++++++++++++++++++++++++++++++++++--------
1 file changed, 77 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 14fad2577df3..96dd87531ba0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -868,6 +868,34 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
ret; \
})

+static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
+ vhost_avail_event(vq));
+}
+
+static inline int vhost_put_used(struct vhost_virtqueue *vq,
+ struct vring_used_elem *head, int idx,
+ int count)
+{
+ return vhost_copy_to_user(vq, vq->used->ring + idx, head,
+ count * sizeof(*head));
+}
+
+static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
+
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
+ &vq->used->flags);
+}
+
+static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
+
+{
+ return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
+ &vq->used->idx);
+}
+
#define vhost_get_user(vq, x, ptr, type) \
({ \
int ret; \
@@ -906,6 +934,43 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(&d->vqs[i]->mutex);
}

+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
+ __virtio16 *idx)
+{
+ return vhost_get_avail(vq, *idx, &vq->avail->idx);
+}
+
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
+ __virtio16 *head, int idx)
+{
+ return vhost_get_avail(vq, *head,
+ &vq->avail->ring[idx & (vq->num - 1)]);
+}
+
+static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
+ __virtio16 *flags)
+{
+ return vhost_get_avail(vq, *flags, &vq->avail->flags);
+}
+
+static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
+ __virtio16 *event)
+{
+ return vhost_get_avail(vq, *event, vhost_used_event(vq));
+}
+
+static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
+ __virtio16 *idx)
+{
+ return vhost_get_used(vq, *idx, &vq->used->idx);
+}
+
+static inline int vhost_get_desc(struct vhost_virtqueue *vq,
+ struct vring_desc *desc, int idx)
+{
+ return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
+}
+
static int vhost_new_umem_range(struct vhost_umem *umem,
u64 start, u64 size, u64 end,
u64 userspace_addr, int perm)
@@ -1839,8 +1904,7 @@ EXPORT_SYMBOL_GPL(vhost_log_write);
static int vhost_update_used_flags(struct vhost_virtqueue *vq)
{
void __user *used;
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
- &vq->used->flags) < 0)
+ if (vhost_put_used_flags(vq))
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1857,8 +1921,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)

static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
{
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
- vhost_avail_event(vq)))
+ if (vhost_put_avail_event(vq))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -1894,7 +1957,7 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
r = -EFAULT;
goto err;
}
- r = vhost_get_used(vq, last_used_idx, &vq->used->idx);
+ r = vhost_get_used_idx(vq, &last_used_idx);
if (r) {
vq_err(vq, "Can't access used idx at %p\n",
&vq->used->idx);
@@ -2093,7 +2156,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
last_avail_idx = vq->last_avail_idx;

if (vq->avail_idx == vq->last_avail_idx) {
- if (unlikely(vhost_get_avail(vq, avail_idx, &vq->avail->idx))) {
+ if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
return -EFAULT;
@@ -2120,8 +2183,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,

/* Grab the next descriptor number they're advertising, and increment
* the index we've seen. */
- if (unlikely(vhost_get_avail(vq, ring_head,
- &vq->avail->ring[last_avail_idx & (vq->num - 1)]))) {
+ if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
vq_err(vq, "Failed to read head: idx %d address %p\n",
last_avail_idx,
&vq->avail->ring[last_avail_idx % vq->num]);
@@ -2156,8 +2218,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
i, vq->num, head);
return -EINVAL;
}
- ret = vhost_copy_from_user(vq, &desc, vq->desc + i,
- sizeof desc);
+ ret = vhost_get_desc(vq, &desc, i);
if (unlikely(ret)) {
vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
i, vq->desc + i);
@@ -2250,7 +2311,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,

start = vq->last_used_idx & (vq->num - 1);
used = vq->used->ring + start;
- if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
+ if (vhost_put_used(vq, heads, start, count)) {
vq_err(vq, "Failed to write used");
return -EFAULT;
}
@@ -2292,8 +2353,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,

/* Make sure buffer is written before we update index. */
smp_wmb();
- if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
- &vq->used->idx)) {
+ if (vhost_put_used_idx(vq)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
@@ -2326,7 +2386,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)

if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) {
__virtio16 flags;
- if (vhost_get_avail(vq, flags, &vq->avail->flags)) {
+ if (vhost_get_avail_flags(vq, &flags)) {
vq_err(vq, "Failed to get flags");
return true;
}
@@ -2340,7 +2400,7 @@ static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (unlikely(!v))
return true;

- if (vhost_get_avail(vq, event, vhost_used_event(vq))) {
+ if (vhost_get_used_event(vq, &event)) {
vq_err(vq, "Failed to get used event idx");
return true;
}
@@ -2385,7 +2445,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
if (vq->avail_idx != vq->last_avail_idx)
return false;

- r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail_idx(vq, &avail_idx);
if (unlikely(r))
return false;
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
@@ -2421,7 +2481,7 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
/* They could have slipped one in as we were doing that: make
* sure it's written, then check again. */
smp_mb();
- r = vhost_get_avail(vq, avail_idx, &vq->avail->idx);
+ r = vhost_get_avail_idx(vq, &avail_idx);
if (r) {
vq_err(vq, "Failed to check avail idx at %p: %d\n",
&vq->avail->idx, r);
--
2.17.1

2019-01-23 10:12:47

[permalink] [raw]

Subject: [PATCH net-next V4 4/5] vhost: introduce helpers to get the size of metadata area

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/vhost.c | 46 ++++++++++++++++++++++++++-----------------
1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 24c74c60c093..37e2cac8e8b0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -489,6 +489,27 @@ bool vhost_dev_has_owner(struct vhost_dev *dev)
}
EXPORT_SYMBOL_GPL(vhost_dev_has_owner);

+static size_t vhost_get_avail_size(struct vhost_virtqueue *vq, int num)
+{
+ size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+ return sizeof(*vq->avail) +
+ sizeof(*vq->avail->ring) * num + event;
+}
+
+static size_t vhost_get_used_size(struct vhost_virtqueue *vq, int num)
+{
+ size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
+
+ return sizeof(*vq->used) +
+ sizeof(*vq->used->ring) * num + event;
+}
+
+static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
+{
+ return sizeof(*vq->desc) * num;
+}
+
/* Caller should have device mutex */
long vhost_dev_set_owner(struct vhost_dev *dev)
{
@@ -1252,13 +1273,9 @@ static bool vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
struct vring_used __user *used)

{
- size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
- return access_ok(desc, num * sizeof *desc) &&
- access_ok(avail,
- sizeof *avail + num * sizeof *avail->ring + s) &&
- access_ok(used,
- sizeof *used + num * sizeof *used->ring + s);
+ return access_ok(desc, vhost_get_desc_size(vq, num)) &&
+ access_ok(avail, vhost_get_avail_size(vq, num)) &&
+ access_ok(used, vhost_get_used_size(vq, num));
}

static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
@@ -1310,22 +1327,18 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,

int vq_meta_prefetch(struct vhost_virtqueue *vq)
{
- size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;

if (!vq->iotlb)
return 1;

return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
- num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+ vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
- sizeof *vq->avail +
- num * sizeof(*vq->avail->ring) + s,
+ vhost_get_avail_size(vq, num),
VHOST_ADDR_AVAIL) &&
iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->used,
- sizeof *vq->used +
- num * sizeof(*vq->used->ring) + s,
- VHOST_ADDR_USED);
+ vhost_get_used_size(vq, num), VHOST_ADDR_USED);
}
EXPORT_SYMBOL_GPL(vq_meta_prefetch);

@@ -1342,13 +1355,10 @@ EXPORT_SYMBOL_GPL(vhost_log_access_ok);
static bool vq_log_access_ok(struct vhost_virtqueue *vq,
void __user *log_base)
{
- size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
-
return vq_memory_access_ok(log_base, vq->umem,
vhost_has_feature(vq, VHOST_F_LOG_ALL)) &&
(!vq->log_used || log_access_ok(log_base, vq->log_addr,
- sizeof *vq->used +
- vq->num * sizeof *vq->used->ring + s));
+ vhost_get_used_size(vq, vq->num)));
}

/* Can we start vq? */
--
2.17.1

2019-01-23 13:59:40

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

On Wed, Jan 23, 2019 at 05:55:52PM +0800, Jason Wang wrote:
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
>
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.

ok I think this addresses most comments but it's a big change and we
just started 1.1 review so to pls give me a week to review this ok?

> Changes from V3:
> - don't try to use vmap for file backed pages
> - rebase to master
> Changes from V2:
> - fix buggy range overlapping check
> - tear down MMU notifier during vhost ioctl to make sure invalidation
> request can read metadata userspace address and vq size without
> holding vq mutex.
> Changes from V1:
> - instead of pinning pages, use MMU notifier to invalidate vmaps and
> remap duing metadata prefetch
> - fix build warning on MIPS
>
> Jason Wang (5):
> vhost: generalize adding used elem
> vhost: fine grain userspace memory accessors
> vhost: rename vq_iotlb_prefetch() to vq_meta_prefetch()
> vhost: introduce helpers to get the size of metadata area
> vhost: access vq metadata through kernel virtual address
>
> drivers/vhost/net.c | 4 +-
> drivers/vhost/vhost.c | 441 +++++++++++++++++++++++++++++++++++++-----
> drivers/vhost/vhost.h | 15 +-
> mm/shmem.c | 1 +
> 4 files changed, 410 insertions(+), 51 deletions(-)
>
> --
> 2.17.1

2019-01-23 14:10:52

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
>
> This patch tries to eliminate those overheads by accessing them
> through kernel virtual address by vmap(). To make the pages can be
> migrated, instead of pinning them through GUP, we use MMU notifiers to
> invalidate vmaps and re-establish vmaps during each round of metadata
> prefetching if necessary. For devices that doesn't use metadata
> prefetching, the memory accessors fallback to normal copy_user()
> implementation gracefully. The invalidation was synchronized with
> datapath through vq mutex, and in order to avoid hold vq mutex during
> range checking, MMU notifier was teared down when trying to modify vq
> metadata.
>
> Another thing is kernel lacks efficient solution for tracking dirty
> pages by vmap(), this will lead issues if vhost is using file backed
> memory which needs care of writeback. This patch solves this issue by
> just skipping the vma that is file backed and fallback to normal
> copy_user() friends. This might introduce some overheads for file
> backed users but consider this use case is rare we could do
> optimizations on top.
>
> Note that this was only done when device IOTLB is not enabled. We
> could use similar method to optimize it in the future.
>
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>
> SMAP on | SMAP off
> Before: 5.0Mpps | 6.6Mpps
> After: 6.1Mpps | 7.4Mpps
>
> Signed-off-by: Jason Wang <[email protected]>

So this is the bulk of the change.
Threee things that I need to look into
- Are there any security issues with bypassing the speculation barrier
that is normally present after access_ok?
- How hard does the special handling for
file backed storage make testing?
On the one hand we could add a module parameter to
force copy to/from user. on the other that's
another configuration we need to support.
But iotlb is not using vmap, so maybe that's enough
for testing.
- How hard is it to figure out which mode uses which code.

Meanwhile, could you pls post data comparing this last patch with the
below? This removes the speculation barrier replacing it with a
(useless but at least more lightweight) data dependency.

Thanks!

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bac939af8dbb..352ee7e14476 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
int ret;

if (!vq->iotlb)
- return __copy_to_user(to, from, size);
+ return copy_to_user(to, from, size);
else {
/* This function should be called after iotlb
* prefetch, which means we're sure that all vq
@@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
VHOST_ADDR_USED);

if (uaddr)
- return __copy_to_user(uaddr, from, size);
+ return copy_to_user(uaddr, from, size);

ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
ARRAY_SIZE(vq->iotlb_iov),
@@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
int ret;

if (!vq->iotlb)
- return __copy_from_user(to, from, size);
+ return copy_from_user(to, from, size);
else {
/* This function should be called after iotlb
* prefetch, which means we're sure that vq
@@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
struct iov_iter f;

if (uaddr)
- return __copy_from_user(to, uaddr, size);
+ return copy_from_user(to, uaddr, size);

ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov,
ARRAY_SIZE(vq->iotlb_iov),
@@ -855,13 +855,13 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
- ret = __put_user(x, ptr); \
+ ret = put_user(x, ptr); \
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
sizeof(*ptr), VHOST_ADDR_USED); \
if (to != NULL) \
- ret = __put_user(x, to); \
+ ret = put_user(x, to); \
else \
ret = -EFAULT; \
} \
@@ -872,14 +872,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
({ \
int ret; \
if (!vq->iotlb) { \
- ret = __get_user(x, ptr); \
+ ret = get_user(x, ptr); \
} else { \
__typeof__(ptr) from = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
sizeof(*ptr), \
type); \
if (from != NULL) \
- ret = __get_user(x, from); \
+ ret = get_user(x, from); \
else \
ret = -EFAULT; \
} \

2019-01-23 17:26:00

by David Miller

[permalink] [raw]

Subject: Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

From: "Michael S. Tsirkin" <[email protected]>
Date: Wed, 23 Jan 2019 08:58:07 -0500

> On Wed, Jan 23, 2019 at 05:55:52PM +0800, Jason Wang wrote:
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>
> ok I think this addresses most comments but it's a big change and we
> just started 1.1 review so to pls give me a week to review this ok?

Ok. :)

2019-01-24 04:08:23

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/23 下午10:08, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software checks,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets since
>> the time spent on metadata accessing become more significant.
>>
>> This patch tries to eliminate those overheads by accessing them
>> through kernel virtual address by vmap(). To make the pages can be
>> migrated, instead of pinning them through GUP, we use MMU notifiers to
>> invalidate vmaps and re-establish vmaps during each round of metadata
>> prefetching if necessary. For devices that doesn't use metadata
>> prefetching, the memory accessors fallback to normal copy_user()
>> implementation gracefully. The invalidation was synchronized with
>> datapath through vq mutex, and in order to avoid hold vq mutex during
>> range checking, MMU notifier was teared down when trying to modify vq
>> metadata.
>>
>> Another thing is kernel lacks efficient solution for tracking dirty
>> pages by vmap(), this will lead issues if vhost is using file backed
>> memory which needs care of writeback. This patch solves this issue by
>> just skipping the vma that is file backed and fallback to normal
>> copy_user() friends. This might introduce some overheads for file
>> backed users but consider this use case is rare we could do
>> optimizations on top.
>>
>> Note that this was only done when device IOTLB is not enabled. We
>> could use similar method to optimize it in the future.
>>
>> Tests shows at most about 22% improvement on TX PPS when using
>> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>>
>> SMAP on | SMAP off
>> Before: 5.0Mpps | 6.6Mpps
>> After: 6.1Mpps | 7.4Mpps
>>
>> Signed-off-by: Jason Wang <[email protected]>
>
> So this is the bulk of the change.
> Threee things that I need to look into
> - Are there any security issues with bypassing the speculation barrier
> that is normally present after access_ok?

If we can make sure the bypassing was only used in a kthread (vhost), it
should be fine I think.

> - How hard does the special handling for
> file backed storage make testing?

It's as simple as un-commenting vhost_can_vmap()? Or I can try to hack
qemu or dpdk to test this.

> On the one hand we could add a module parameter to
> force copy to/from user. on the other that's
> another configuration we need to support.

That sounds sub-optimal since it leave the choice to users.

> But iotlb is not using vmap, so maybe that's enough
> for testing.
> - How hard is it to figure out which mode uses which code.
>
>
>
> Meanwhile, could you pls post data comparing this last patch with the
> below? This removes the speculation barrier replacing it with a
> (useless but at least more lightweight) data dependency.

SMAP off

Your patch: 7.2MPPs

vmap: 7.4Mpps

I don't test SMAP on, since it will be much slow for sure.

Thanks

>
> Thanks!
>
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bac939af8dbb..352ee7e14476 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> int ret;
>
> if (!vq->iotlb)
> - return __copy_to_user(to, from, size);
> + return copy_to_user(to, from, size);
> else {
> /* This function should be called after iotlb
> * prefetch, which means we're sure that all vq
> @@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> VHOST_ADDR_USED);
>
> if (uaddr)
> - return __copy_to_user(uaddr, from, size);
> + return copy_to_user(uaddr, from, size);
>
> ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
> ARRAY_SIZE(vq->iotlb_iov),
> @@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> int ret;
>
> if (!vq->iotlb)
> - return __copy_from_user(to, from, size);
> + return copy_from_user(to, from, size);
> else {
> /* This function should be called after iotlb
> * prefetch, which means we're sure that vq
> @@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> struct iov_iter f;
>
> if (uaddr)
> - return __copy_from_user(to, uaddr, size);
> + return copy_from_user(to, uaddr, size);
>
> ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov,
> ARRAY_SIZE(vq->iotlb_iov),
> @@ -855,13 +855,13 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> ({ \
> int ret = -EFAULT; \
> if (!vq->iotlb) { \
> - ret = __put_user(x, ptr); \
> + ret = put_user(x, ptr); \
> } else { \
> __typeof__(ptr) to = \
> (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> sizeof(*ptr), VHOST_ADDR_USED); \
> if (to != NULL) \
> - ret = __put_user(x, to); \
> + ret = put_user(x, to); \
> else \
> ret = -EFAULT; \
> } \
> @@ -872,14 +872,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> ({ \
> int ret; \
> if (!vq->iotlb) { \
> - ret = __get_user(x, ptr); \
> + ret = get_user(x, ptr); \
> } else { \
> __typeof__(ptr) from = \
> (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> sizeof(*ptr), \
> type); \
> if (from != NULL) \
> - ret = __get_user(x, from); \
> + ret = get_user(x, from); \
> else \
> ret = -EFAULT; \
> } \

2019-01-24 04:12:00

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/24 下午12:07, Jason Wang wrote:
>
> On 2019/1/23 下午10:08, Michael S. Tsirkin wrote:
>> On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
>>> It was noticed that the copy_user() friends that was used to access
>>> virtqueue metdata tends to be very expensive for dataplane
>>> implementation like vhost since it involves lots of software checks,
>>> speculation barrier, hardware feature toggling (e.g SMAP). The
>>> extra cost will be more obvious when transferring small packets since
>>> the time spent on metadata accessing become more significant.
>>>
>>> This patch tries to eliminate those overheads by accessing them
>>> through kernel virtual address by vmap(). To make the pages can be
>>> migrated, instead of pinning them through GUP, we use MMU notifiers to
>>> invalidate vmaps and re-establish vmaps during each round of metadata
>>> prefetching if necessary. For devices that doesn't use metadata
>>> prefetching, the memory accessors fallback to normal copy_user()
>>> implementation gracefully. The invalidation was synchronized with
>>> datapath through vq mutex, and in order to avoid hold vq mutex during
>>> range checking, MMU notifier was teared down when trying to modify vq
>>> metadata.
>>>
>>> Another thing is kernel lacks efficient solution for tracking dirty
>>> pages by vmap(), this will lead issues if vhost is using file backed
>>> memory which needs care of writeback. This patch solves this issue by
>>> just skipping the vma that is file backed and fallback to normal
>>> copy_user() friends. This might introduce some overheads for file
>>> backed users but consider this use case is rare we could do
>>> optimizations on top.
>>>
>>> Note that this was only done when device IOTLB is not enabled. We
>>> could use similar method to optimize it in the future.
>>>
>>> Tests shows at most about 22% improvement on TX PPS when using
>>> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>>>
>>>          SMAP on | SMAP off
>>> Before: 5.0Mpps | 6.6Mpps
>>> After: 6.1Mpps | 7.4Mpps
>>>
>>> Signed-off-by: Jason Wang <[email protected]>
>>
>> So this is the bulk of the change.
>> Threee things that I need to look into
>> - Are there any security issues with bypassing the speculation barrier
>>    that is normally present after access_ok?
>
>
> If we can make sure the bypassing was only used in a kthread (vhost),
> it should be fine I think.
>
>
>> - How hard does the special handling for
>>    file backed storage make testing?
>
>
> It's as simple as un-commenting vhost_can_vmap()? Or I can try to hack
> qemu or dpdk to test this.
>
>
>>    On the one hand we could add a module parameter to
>>    force copy to/from user. on the other that's
>>    another configuration we need to support.
>
>
> That sounds sub-optimal since it leave the choice to users.
>
>
>>    But iotlb is not using vmap, so maybe that's enough
>>    for testing.
>> - How hard is it to figure out which mode uses which code.

It's as simple as tracing __get_user() usage in vhost process?

Thanks

>>
>>
>>
>> Meanwhile, could you pls post data comparing this last patch with the
>> below? This removes the speculation barrier replacing it with a
>> (useless but at least more lightweight) data dependency.
>
>
> SMAP off
>
> Your patch: 7.2MPPs
>
> vmap: 7.4Mpps
>
> I don't test SMAP on, since it will be much slow for sure.
>
> Thanks
>
>
>>
>> Thanks!
>>
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index bac939af8dbb..352ee7e14476 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct
>> vhost_virtqueue *vq, void __user *to,
>>       int ret;
>>       if (!vq->iotlb)
>> -        return __copy_to_user(to, from, size);
>> +        return copy_to_user(to, from, size);
>>       else {
>>           /* This function should be called after iotlb
>>            * prefetch, which means we're sure that all vq
>> @@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct
>> vhost_virtqueue *vq, void __user *to,
>>                        VHOST_ADDR_USED);
>>           if (uaddr)
>> -            return __copy_to_user(uaddr, from, size);
>> +            return copy_to_user(uaddr, from, size);
>>           ret = translate_desc(vq, (u64)(uintptr_t)to, size,
>> vq->iotlb_iov,
>>                        ARRAY_SIZE(vq->iotlb_iov),
>> @@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct
>> vhost_virtqueue *vq, void *to,
>>       int ret;
>>       if (!vq->iotlb)
>> -        return __copy_from_user(to, from, size);
>> +        return copy_from_user(to, from, size);
>>       else {
>>           /* This function should be called after iotlb
>>            * prefetch, which means we're sure that vq
>> @@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct
>> vhost_virtqueue *vq, void *to,
>>           struct iov_iter f;
>>           if (uaddr)
>> -            return __copy_from_user(to, uaddr, size);
>> +            return copy_from_user(to, uaddr, size);
>>           ret = translate_desc(vq, (u64)(uintptr_t)from, size,
>> vq->iotlb_iov,
>>                        ARRAY_SIZE(vq->iotlb_iov),
>> @@ -855,13 +855,13 @@ static inline void __user
>> *__vhost_get_user(struct vhost_virtqueue *vq,
>> ({ \
>>       int ret = -EFAULT; \
>>       if (!vq->iotlb) { \
>> -        ret = __put_user(x, ptr); \
>> +        ret = put_user(x, ptr); \
>>       } else { \
>>           __typeof__(ptr) to = \
>>               (__typeof__(ptr)) __vhost_get_user(vq, ptr,    \
>>                         sizeof(*ptr), VHOST_ADDR_USED); \
>>           if (to != NULL) \
>> -            ret = __put_user(x, to); \
>> +            ret = put_user(x, to); \
>>           else \
>>               ret = -EFAULT;    \
>>       } \
>> @@ -872,14 +872,14 @@ static inline void __user
>> *__vhost_get_user(struct vhost_virtqueue *vq,
>> ({ \
>>       int ret; \
>>       if (!vq->iotlb) { \
>> -        ret = __get_user(x, ptr); \
>> +        ret = get_user(x, ptr); \
>>       } else { \
>>           __typeof__(ptr) from = \
>>               (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
>>                                  sizeof(*ptr), \
>>                                  type); \
>>           if (from != NULL) \
>> -            ret = __get_user(x, from); \
>> +            ret = get_user(x, from); \
>>           else \
>>               ret = -EFAULT; \
>>       } \

2019-01-24 04:54:46

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Jan 24, 2019 at 12:07:54PM +0800, Jason Wang wrote:
>
> On 2019/1/23 下午10:08, Michael S. Tsirkin wrote:
> > On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
> > > It was noticed that the copy_user() friends that was used to access
> > > virtqueue metdata tends to be very expensive for dataplane
> > > implementation like vhost since it involves lots of software checks,
> > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > extra cost will be more obvious when transferring small packets since
> > > the time spent on metadata accessing become more significant.
> > >
> > > This patch tries to eliminate those overheads by accessing them
> > > through kernel virtual address by vmap(). To make the pages can be
> > > migrated, instead of pinning them through GUP, we use MMU notifiers to
> > > invalidate vmaps and re-establish vmaps during each round of metadata
> > > prefetching if necessary. For devices that doesn't use metadata
> > > prefetching, the memory accessors fallback to normal copy_user()
> > > implementation gracefully. The invalidation was synchronized with
> > > datapath through vq mutex, and in order to avoid hold vq mutex during
> > > range checking, MMU notifier was teared down when trying to modify vq
> > > metadata.
> > >
> > > Another thing is kernel lacks efficient solution for tracking dirty
> > > pages by vmap(), this will lead issues if vhost is using file backed
> > > memory which needs care of writeback. This patch solves this issue by
> > > just skipping the vma that is file backed and fallback to normal
> > > copy_user() friends. This might introduce some overheads for file
> > > backed users but consider this use case is rare we could do
> > > optimizations on top.
> > >
> > > Note that this was only done when device IOTLB is not enabled. We
> > > could use similar method to optimize it in the future.
> > >
> > > Tests shows at most about 22% improvement on TX PPS when using
> > > virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
> > >
> > > SMAP on | SMAP off
> > > Before: 5.0Mpps | 6.6Mpps
> > > After: 6.1Mpps | 7.4Mpps
> > >
> > > Signed-off-by: Jason Wang <[email protected]>
> >
> > So this is the bulk of the change.
> > Threee things that I need to look into
> > - Are there any security issues with bypassing the speculation barrier
> > that is normally present after access_ok?
>
>
> If we can make sure the bypassing was only used in a kthread (vhost), it
> should be fine I think.
>
>
> > - How hard does the special handling for
> > file backed storage make testing?
>
>
> It's as simple as un-commenting vhost_can_vmap()? Or I can try to hack qemu
> or dpdk to test this.
>
>
> > On the one hand we could add a module parameter to
> > force copy to/from user. on the other that's
> > another configuration we need to support.
>
>
> That sounds sub-optimal since it leave the choice to users.
>
>
> > But iotlb is not using vmap, so maybe that's enough
> > for testing.
> > - How hard is it to figure out which mode uses which code.
> >
> >
> >
> > Meanwhile, could you pls post data comparing this last patch with the
> > below? This removes the speculation barrier replacing it with a
> > (useless but at least more lightweight) data dependency.
>
>
> SMAP off
>
> Your patch: 7.2MPPs
>
> vmap: 7.4Mpps
>

Sounds more or less as expected. Up to 3% gain with vmap - I think
that's a bit higher than what we saw previously when we switched from
get_user to __get_user and that's probably because of all the
array_index_nospec trickery.

> I don't test SMAP on, since it will be much slow for sure.

Right. So bypassing SMAP remains the main reason to do vmap tricks.

> Thanks

>
> >
> > Thanks!
> >
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index bac939af8dbb..352ee7e14476 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> > int ret;
> > if (!vq->iotlb)
> > - return __copy_to_user(to, from, size);
> > + return copy_to_user(to, from, size);
> > else {
> > /* This function should be called after iotlb
> > * prefetch, which means we're sure that all vq
> > @@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> > VHOST_ADDR_USED);
> > if (uaddr)
> > - return __copy_to_user(uaddr, from, size);
> > + return copy_to_user(uaddr, from, size);
> > ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
> > ARRAY_SIZE(vq->iotlb_iov),
> > @@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> > int ret;
> > if (!vq->iotlb)
> > - return __copy_from_user(to, from, size);
> > + return copy_from_user(to, from, size);
> > else {
> > /* This function should be called after iotlb
> > * prefetch, which means we're sure that vq
> > @@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> > struct iov_iter f;
> > if (uaddr)
> > - return __copy_from_user(to, uaddr, size);
> > + return copy_from_user(to, uaddr, size);
> > ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov,
> > ARRAY_SIZE(vq->iotlb_iov),
> > @@ -855,13 +855,13 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > ({ \
> > int ret = -EFAULT; \
> > if (!vq->iotlb) { \
> > - ret = __put_user(x, ptr); \
> > + ret = put_user(x, ptr); \
> > } else { \
> > __typeof__(ptr) to = \
> > (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> > sizeof(*ptr), VHOST_ADDR_USED); \
> > if (to != NULL) \
> > - ret = __put_user(x, to); \
> > + ret = put_user(x, to); \
> > else \
> > ret = -EFAULT; \
> > } \
> > @@ -872,14 +872,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > ({ \
> > int ret; \
> > if (!vq->iotlb) { \
> > - ret = __get_user(x, ptr); \
> > + ret = get_user(x, ptr); \
> > } else { \
> > __typeof__(ptr) from = \
> > (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> > sizeof(*ptr), \
> > type); \
> > if (from != NULL) \
> > - ret = __get_user(x, from); \
> > + ret = get_user(x, from); \
> > else \
> > ret = -EFAULT; \
> > } \

2019-01-24 04:56:23

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Jan 24, 2019 at 12:11:28PM +0800, Jason Wang wrote:
>
> On 2019/1/24 下午12:07, Jason Wang wrote:
> >
> > On 2019/1/23 下午10:08, Michael S. Tsirkin wrote:
> > > On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
> > > > It was noticed that the copy_user() friends that was used to access
> > > > virtqueue metdata tends to be very expensive for dataplane
> > > > implementation like vhost since it involves lots of software checks,
> > > > speculation barrier, hardware feature toggling (e.g SMAP). The
> > > > extra cost will be more obvious when transferring small packets since
> > > > the time spent on metadata accessing become more significant.
> > > >
> > > > This patch tries to eliminate those overheads by accessing them
> > > > through kernel virtual address by vmap(). To make the pages can be
> > > > migrated, instead of pinning them through GUP, we use MMU notifiers to
> > > > invalidate vmaps and re-establish vmaps during each round of metadata
> > > > prefetching if necessary. For devices that doesn't use metadata
> > > > prefetching, the memory accessors fallback to normal copy_user()
> > > > implementation gracefully. The invalidation was synchronized with
> > > > datapath through vq mutex, and in order to avoid hold vq mutex during
> > > > range checking, MMU notifier was teared down when trying to modify vq
> > > > metadata.
> > > >
> > > > Another thing is kernel lacks efficient solution for tracking dirty
> > > > pages by vmap(), this will lead issues if vhost is using file backed
> > > > memory which needs care of writeback. This patch solves this issue by
> > > > just skipping the vma that is file backed and fallback to normal
> > > > copy_user() friends. This might introduce some overheads for file
> > > > backed users but consider this use case is rare we could do
> > > > optimizations on top.
> > > >
> > > > Note that this was only done when device IOTLB is not enabled. We
> > > > could use similar method to optimize it in the future.
> > > >
> > > > Tests shows at most about 22% improvement on TX PPS when using
> > > > virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
> > > >
> > > >          SMAP on | SMAP off
> > > > Before: 5.0Mpps | 6.6Mpps
> > > > After: 6.1Mpps | 7.4Mpps
> > > >
> > > > Signed-off-by: Jason Wang <[email protected]>
> > >
> > > So this is the bulk of the change.
> > > Threee things that I need to look into
> > > - Are there any security issues with bypassing the speculation barrier
> > >    that is normally present after access_ok?
> >
> >
> > If we can make sure the bypassing was only used in a kthread (vhost), it
> > should be fine I think.
> >
> >
> > > - How hard does the special handling for
> > >    file backed storage make testing?
> >
> >
> > It's as simple as un-commenting vhost_can_vmap()? Or I can try to hack
> > qemu or dpdk to test this.
> >
> >
> > >    On the one hand we could add a module parameter to
> > >    force copy to/from user. on the other that's
> > >    another configuration we need to support.
> >
> >
> > That sounds sub-optimal since it leave the choice to users.
> >
> >
> > >    But iotlb is not using vmap, so maybe that's enough
> > >    for testing.
> > > - How hard is it to figure out which mode uses which code.
>
>
> It's as simple as tracing __get_user() usage in vhost process?
>
> Thanks

Well there are now mtu notifiers etc etc. It's hardly as well
contained as that.

>
> > >
> > >
> > >
> > > Meanwhile, could you pls post data comparing this last patch with the
> > > below? This removes the speculation barrier replacing it with a
> > > (useless but at least more lightweight) data dependency.
> >
> >
> > SMAP off
> >
> > Your patch: 7.2MPPs
> >
> > vmap: 7.4Mpps
> >
> > I don't test SMAP on, since it will be much slow for sure.
> >
> > Thanks
> >
> >
> > >
> > > Thanks!
> > >
> > >
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index bac939af8dbb..352ee7e14476 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct
> > > vhost_virtqueue *vq, void __user *to,
> > >       int ret;
> > >       if (!vq->iotlb)
> > > -        return __copy_to_user(to, from, size);
> > > +        return copy_to_user(to, from, size);
> > >       else {
> > >           /* This function should be called after iotlb
> > >            * prefetch, which means we're sure that all vq
> > > @@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct
> > > vhost_virtqueue *vq, void __user *to,
> > >                        VHOST_ADDR_USED);
> > >           if (uaddr)
> > > -            return __copy_to_user(uaddr, from, size);
> > > +            return copy_to_user(uaddr, from, size);
> > >           ret = translate_desc(vq, (u64)(uintptr_t)to, size,
> > > vq->iotlb_iov,
> > >                        ARRAY_SIZE(vq->iotlb_iov),
> > > @@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct
> > > vhost_virtqueue *vq, void *to,
> > >       int ret;
> > >       if (!vq->iotlb)
> > > -        return __copy_from_user(to, from, size);
> > > +        return copy_from_user(to, from, size);
> > >       else {
> > >           /* This function should be called after iotlb
> > >            * prefetch, which means we're sure that vq
> > > @@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct
> > > vhost_virtqueue *vq, void *to,
> > >           struct iov_iter f;
> > >           if (uaddr)
> > > -            return __copy_from_user(to, uaddr, size);
> > > +            return copy_from_user(to, uaddr, size);
> > >           ret = translate_desc(vq, (u64)(uintptr_t)from, size,
> > > vq->iotlb_iov,
> > >                        ARRAY_SIZE(vq->iotlb_iov),
> > > @@ -855,13 +855,13 @@ static inline void __user
> > > *__vhost_get_user(struct vhost_virtqueue *vq,
> > > ({ \
> > >       int ret = -EFAULT; \
> > >       if (!vq->iotlb) { \
> > > -        ret = __put_user(x, ptr); \
> > > +        ret = put_user(x, ptr); \
> > >       } else { \
> > >           __typeof__(ptr) to = \
> > >               (__typeof__(ptr)) __vhost_get_user(vq, ptr,    \
> > >                         sizeof(*ptr), VHOST_ADDR_USED); \
> > >           if (to != NULL) \
> > > -            ret = __put_user(x, to); \
> > > +            ret = put_user(x, to); \
> > >           else \
> > >               ret = -EFAULT;    \
> > >       } \
> > > @@ -872,14 +872,14 @@ static inline void __user
> > > *__vhost_get_user(struct vhost_virtqueue *vq,
> > > ({ \
> > >       int ret; \
> > >       if (!vq->iotlb) { \
> > > -        ret = __get_user(x, ptr); \
> > > +        ret = get_user(x, ptr); \
> > >       } else { \
> > >           __typeof__(ptr) from = \
> > >               (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> > >                                  sizeof(*ptr), \
> > >                                  type); \
> > >           if (from != NULL) \
> > > -            ret = __get_user(x, from); \
> > > +            ret = get_user(x, from); \
> > >           else \
> > >               ret = -EFAULT; \
> > >       } \

2019-01-25 02:33:50

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/24 下午12:53, Michael S. Tsirkin wrote:
>>>> - How hard is it to figure out which mode uses which code.
>> It's as simple as tracing __get_user() usage in vhost process?
>>
>> Thanks
> Well there are now mtu notifiers etc etc. It's hardly as well
> contained as that.
>
>

We can setup filter out exactly what sets of function that we wan to
trace. E.g we can only trace the usage of __get_user() and
invalidate_range_start(). This should be sufficient.

In the long run, we may want to have some tracepoints for vhost_net.

Thanks

2019-01-25 03:02:29

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Jan 24, 2019 at 12:07:54PM +0800, Jason Wang wrote:
> > Meanwhile, could you pls post data comparing this last patch with the
> > below? This removes the speculation barrier replacing it with a
> > (useless but at least more lightweight) data dependency.
>
>
> SMAP off
>
> Your patch: 7.2MPPs
>
> vmap: 7.4Mpps

OK so while we keep looking into vmap, why don't we merge something like
the below? Seems quite straight forward ...

> I don't test SMAP on, since it will be much slow for sure.
>
> Thanks
>
>
> >
> > Thanks!
> >
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index bac939af8dbb..352ee7e14476 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -739,7 +739,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> > int ret;
> > if (!vq->iotlb)
> > - return __copy_to_user(to, from, size);
> > + return copy_to_user(to, from, size);
> > else {
> > /* This function should be called after iotlb
> > * prefetch, which means we're sure that all vq
> > @@ -752,7 +752,7 @@ static int vhost_copy_to_user(struct vhost_virtqueue *vq, void __user *to,
> > VHOST_ADDR_USED);
> > if (uaddr)
> > - return __copy_to_user(uaddr, from, size);
> > + return copy_to_user(uaddr, from, size);
> > ret = translate_desc(vq, (u64)(uintptr_t)to, size, vq->iotlb_iov,
> > ARRAY_SIZE(vq->iotlb_iov),
> > @@ -774,7 +774,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> > int ret;
> > if (!vq->iotlb)
> > - return __copy_from_user(to, from, size);
> > + return copy_from_user(to, from, size);
> > else {
> > /* This function should be called after iotlb
> > * prefetch, which means we're sure that vq
> > @@ -787,7 +787,7 @@ static int vhost_copy_from_user(struct vhost_virtqueue *vq, void *to,
> > struct iov_iter f;
> > if (uaddr)
> > - return __copy_from_user(to, uaddr, size);
> > + return copy_from_user(to, uaddr, size);
> > ret = translate_desc(vq, (u64)(uintptr_t)from, size, vq->iotlb_iov,
> > ARRAY_SIZE(vq->iotlb_iov),
> > @@ -855,13 +855,13 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > ({ \
> > int ret = -EFAULT; \
> > if (!vq->iotlb) { \
> > - ret = __put_user(x, ptr); \
> > + ret = put_user(x, ptr); \
> > } else { \
> > __typeof__(ptr) to = \
> > (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> > sizeof(*ptr), VHOST_ADDR_USED); \
> > if (to != NULL) \
> > - ret = __put_user(x, to); \
> > + ret = put_user(x, to); \
> > else \
> > ret = -EFAULT; \
> > } \
> > @@ -872,14 +872,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
> > ({ \
> > int ret; \
> > if (!vq->iotlb) { \
> > - ret = __get_user(x, ptr); \
> > + ret = get_user(x, ptr); \
> > } else { \
> > __typeof__(ptr) from = \
> > (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> > sizeof(*ptr), \
> > type); \
> > if (from != NULL) \
> > - ret = __get_user(x, from); \
> > + ret = get_user(x, from); \
> > else \
> > ret = -EFAULT; \
> > } \

2019-01-25 03:06:23

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
>
> This patch tries to eliminate those overheads by accessing them
> through kernel virtual address by vmap(). To make the pages can be
> migrated, instead of pinning them through GUP, we use MMU notifiers to
> invalidate vmaps and re-establish vmaps during each round of metadata
> prefetching if necessary. For devices that doesn't use metadata
> prefetching, the memory accessors fallback to normal copy_user()
> implementation gracefully. The invalidation was synchronized with
> datapath through vq mutex, and in order to avoid hold vq mutex during
> range checking, MMU notifier was teared down when trying to modify vq
> metadata.
>
> Another thing is kernel lacks efficient solution for tracking dirty
> pages by vmap(), this will lead issues if vhost is using file backed
> memory which needs care of writeback. This patch solves this issue by
> just skipping the vma that is file backed and fallback to normal
> copy_user() friends. This might introduce some overheads for file
> backed users but consider this use case is rare we could do
> optimizations on top.
>
> Note that this was only done when device IOTLB is not enabled. We
> could use similar method to optimize it in the future.
>
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>
> SMAP on | SMAP off
> Before: 5.0Mpps | 6.6Mpps
> After: 6.1Mpps | 7.4Mpps
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/vhost/vhost.c | 288 +++++++++++++++++++++++++++++++++++++++++-
> drivers/vhost/vhost.h | 13 ++
> mm/shmem.c | 1 +
> 3 files changed, 300 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 37e2cac8e8b0..096ae3298d62 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> vq->indirect = NULL;
> vq->heads = NULL;
> vq->dev = dev;
> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> mutex_init(&vq->mutex);
> vhost_vq_reset(dev, vq);
> if (vq->handle_kick)
> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
> return sizeof(*vq->desc) * num;
> }
>
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> + if (map->addr)
> + vunmap(map->unmap_addr);
> +
> + map->addr = NULL;
> + map->unmap_addr = NULL;
> +}
> +
> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
> + struct vhost_vmap *map,
> + unsigned long ustart,
> + size_t size,
> + unsigned long start,
> + unsigned long end,
> + bool blockable)
> +{
> + if (end < ustart || start > ustart - 1 + size)
> + return 0;
> +
> + if (!blockable)
> + return -EAGAIN;
> +
> + mutex_lock(&vq->mutex);
> + vhost_uninit_vmap(map);
> + mutex_unlock(&vq->mutex);
> +
> + return 0;
> +}
> +
> +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> + const struct mmu_notifier_range *range)
> +{
> + struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> + mmu_notifier);
> + int i;
> +
> + for (i = 0; i < dev->nvqs; i++) {
> + struct vhost_virtqueue *vq = dev->vqs[i];
> +
> + if (vhost_invalidate_vmap(vq, &vq->avail_ring,
> + (unsigned long)vq->avail,
> + vhost_get_avail_size(vq, vq->num),
> + range->start, range->end,
> + range->blockable))
> + return -EAGAIN;
> + if (vhost_invalidate_vmap(vq, &vq->desc_ring,
> + (unsigned long)vq->desc,
> + vhost_get_desc_size(vq, vq->num),
> + range->start, range->end,
> + range->blockable))
> + return -EAGAIN;
> + if (vhost_invalidate_vmap(vq, &vq->used_ring,
> + (unsigned long)vq->used,
> + vhost_get_used_size(vq, vq->num),
> + range->start, range->end,
> + range->blockable))
> + return -EAGAIN;
> + }
> +
> + return 0;
> +}
> +
> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> + .invalidate_range_start = vhost_invalidate_range_start,
> +};
> +
> /* Caller should have device mutex */
> long vhost_dev_set_owner(struct vhost_dev *dev)
> {

It seems questionable to merely track .invalidate_range_start.
Don't we care about keeping pages young/accessed?
MMU will think they aren't and will penalize vhost by pushing
them out.

I note that MMU documentation says
* invalidate_range_start() and invalidate_range_end() must be
* paired
and it seems questionable that they are not paired here.

I also wonder about things like write-protecting the pages.
It does not look like a range is invalidated when page
is write-protected, even though I might have missed that.
If not we can be corrupting memory in a variety of ways
e.g. when using KSM, or with COW.

> @@ -541,7 +611,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> if (err)
> goto err_cgroup;
>
> + dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> + err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> + if (err)
> + goto err_mmu_notifier;
> +
> return 0;
> +err_mmu_notifier:
> + vhost_dev_free_iovecs(dev);
> err_cgroup:
> kthread_stop(worker);
> dev->worker = NULL;
> @@ -632,6 +709,97 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> spin_unlock(&dev->iotlb_lock);
> }
>
> +/* Suppress the vma that needs writeback since we can not track dirty
> + * pages now.
> + */
> +static bool vma_can_vmap(struct vm_area_struct *vma)
> +{
> + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
> + vma_is_shmem(vma);
> +}
> +

IIUC a second but anonymous memory needs writeback too, just to swap.
I'm not an MM person so I might be off.

> +static int vhost_init_vmap(struct vhost_dev *dev,
> + struct vhost_vmap *map, unsigned long uaddr,
> + size_t size, int write)
> +{
> + struct mm_struct *mm = dev->mm;
> + struct vm_area_struct *vma;
> + struct page **pages;
> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> + int npinned;
> + void *vaddr;
> + int err = 0;
> +
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, uaddr);
> + if (!vma || !vma_can_vmap(vma) ||
> + vma->vm_end < uaddr - 1 + size) {
> + err = -EINVAL;
> + goto err_vma;
> + }
> +
> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> + if (!pages) {
> + err = -ENOMEM;
> + goto err_alloc;
> + }
> +
> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> + if (npinned != npages) {
> + err = -EFAULT;
> + goto err_gup;
> + }
> +
> + vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> + if (!vaddr) {
> + err = EFAULT;
> + goto err_gup;
> + }
> +
> + map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
> + map->unmap_addr = vaddr;
> +
> +err_gup:
> + /* Don't pin pages, mmu notifier will notify us about page
> + * migration.
> + */
> + if (npinned > 0)
> + release_pages(pages, npinned);
> +err_alloc:
> + kfree(pages);
> +err_vma:
> + up_read(&mm->mmap_sem);
> + return err;
> +}
> +
> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
> +{
> + vhost_uninit_vmap(&vq->avail_ring);
> + vhost_uninit_vmap(&vq->desc_ring);
> + vhost_uninit_vmap(&vq->used_ring);
> +}
> +
> +static int vhost_setup_avail_vmap(struct vhost_virtqueue *vq,
> + unsigned long avail)
> +{
> + return vhost_init_vmap(vq->dev, &vq->avail_ring, avail,
> + vhost_get_avail_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_desc_vmap(struct vhost_virtqueue *vq,
> + unsigned long desc)
> +{
> + return vhost_init_vmap(vq->dev, &vq->desc_ring, desc,
> + vhost_get_desc_size(vq, vq->num), false);
> +}
> +
> +static int vhost_setup_used_vmap(struct vhost_virtqueue *vq,
> + unsigned long used)
> +{
> + return vhost_init_vmap(vq->dev, &vq->used_ring, used,
> + vhost_get_used_size(vq, vq->num), true);
> +}
> +
> void vhost_dev_cleanup(struct vhost_dev *dev)
> {
> int i;
> @@ -661,8 +829,12 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> kthread_stop(dev->worker);
> dev->worker = NULL;
> }
> - if (dev->mm)
> + if (dev->mm) {
> + mmu_notifier_unregister(&dev->mmu_notifier, dev->mm);
> mmput(dev->mm);
> + }
> + for (i = 0; i < dev->nvqs; i++)
> + vhost_clean_vmaps(dev->vqs[i]);
> dev->mm = NULL;
> }
> EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -891,6 +1063,16 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>
> static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + if (likely(used)) {
> + *((__virtio16 *)&used->ring[vq->num]) =
> + cpu_to_vhost16(vq, vq->avail_idx);
> + return 0;
> + }
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> vhost_avail_event(vq));
> }
> @@ -899,6 +1081,16 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> struct vring_used_elem *head, int idx,
> int count)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + if (likely(used)) {
> + memcpy(used->ring + idx, head,
> + count * sizeof(*head));
> + return 0;
> + }
> + }
> +
> return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> count * sizeof(*head));
> }
> @@ -906,6 +1098,15 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + if (likely(used)) {
> + used->flags = cpu_to_vhost16(vq, vq->used_flags);
> + return 0;
> + }
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> &vq->used->flags);
> }
> @@ -913,6 +1114,15 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + if (likely(used)) {
> + used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> + return 0;
> + }
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> &vq->used->idx);
> }
> @@ -958,12 +1168,30 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> __virtio16 *idx)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + if (likely(avail)) {
> + *idx = avail->idx;
> + return 0;
> + }
> + }
> +
> return vhost_get_avail(vq, *idx, &vq->avail->idx);
> }
>
> static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> __virtio16 *head, int idx)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + if (likely(avail)) {
> + *head = avail->ring[idx & (vq->num - 1)];
> + return 0;
> + }
> + }
> +
> return vhost_get_avail(vq, *head,
> &vq->avail->ring[idx & (vq->num - 1)]);
> }
> @@ -971,24 +1199,60 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> __virtio16 *flags)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + if (likely(avail)) {
> + *flags = avail->flags;
> + return 0;
> + }
> + }
> +
> return vhost_get_avail(vq, *flags, &vq->avail->flags);
> }
>
> static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> __virtio16 *event)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + if (likely(avail)) {
> + *event = (__virtio16)avail->ring[vq->num];
> + return 0;
> + }
> + }
> +
> return vhost_get_avail(vq, *event, vhost_used_event(vq));
> }
>
> static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> __virtio16 *idx)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + if (likely(used)) {
> + *idx = used->idx;
> + return 0;
> + }
> + }
> +
> return vhost_get_used(vq, *idx, &vq->used->idx);
> }
>
> static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> struct vring_desc *desc, int idx)
> {
> + if (!vq->iotlb) {
> + struct vring_desc *d = vq->desc_ring.addr;
> +
> + if (likely(d)) {
> + *desc = *(d + idx);
> + return 0;
> + }
> + }
> +
> return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> }
>
> @@ -1329,8 +1593,16 @@ int vq_meta_prefetch(struct vhost_virtqueue *vq)
> {
> unsigned int num = vq->num;
>
> - if (!vq->iotlb)
> + if (!vq->iotlb) {
> + if (unlikely(!vq->avail_ring.addr))
> + vhost_setup_avail_vmap(vq, (unsigned long)vq->avail);
> + if (unlikely(!vq->desc_ring.addr))
> + vhost_setup_desc_vmap(vq, (unsigned long)vq->desc);
> + if (unlikely(!vq->used_ring.addr))
> + vhost_setup_used_vmap(vq, (unsigned long)vq->used);
> +
> return 1;
> + }
>
> return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
> vhost_get_desc_size(vq, num), VHOST_ADDR_DESC) &&
> @@ -1482,6 +1754,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
>
> mutex_lock(&vq->mutex);
>
> + /* Unregister MMU notifer to allow invalidation callback
> + * can access vq->avail, vq->desc , vq->used and vq->num
> + * without holding vq->mutex.
> + */
> + if (d->mm)
> + mmu_notifier_unregister(&d->mmu_notifier, d->mm);
> +
> switch (ioctl) {
> case VHOST_SET_VRING_NUM:
> /* Resizing ring with an active backend?
> @@ -1498,6 +1777,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> r = -EINVAL;
> break;
> }
> + vhost_clean_vmaps(vq);
> vq->num = s.num;
> break;
> case VHOST_SET_VRING_BASE:
> @@ -1575,6 +1855,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> }
> }
>
> + vhost_clean_vmaps(vq);
> +
> vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
> vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
> vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
> @@ -1655,6 +1937,8 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> if (pollstart && vq->handle_kick)
> r = vhost_poll_start(&vq->poll, vq->kick);
>
> + if (d->mm)
> + mmu_notifier_register(&d->mmu_notifier, d->mm);
> mutex_unlock(&vq->mutex);
>
> if (pollstop && vq->handle_kick)
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4e21011b6628..c04bc327db9f 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -12,6 +12,8 @@
> #include <linux/virtio_config.h>
> #include <linux/virtio_ring.h>
> #include <linux/atomic.h>
> +#include <linux/pagemap.h>
> +#include <linux/mmu_notifier.h>
>
> struct vhost_work;
> typedef void (*vhost_work_fn_t)(struct vhost_work *work);
> @@ -80,6 +82,11 @@ enum vhost_uaddr_type {
> VHOST_NUM_ADDRS = 3,
> };
>
> +struct vhost_vmap {
> + void *addr;
> + void *unmap_addr;
> +};
> +
> /* The virtqueue structure describes a queue attached to a device. */
> struct vhost_virtqueue {
> struct vhost_dev *dev;
> @@ -90,6 +97,11 @@ struct vhost_virtqueue {
> struct vring_desc __user *desc;
> struct vring_avail __user *avail;
> struct vring_used __user *used;
> +
> + struct vhost_vmap avail_ring;
> + struct vhost_vmap desc_ring;
> + struct vhost_vmap used_ring;
> +
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> struct file *kick;
> struct eventfd_ctx *call_ctx;
> @@ -158,6 +170,7 @@ struct vhost_msg_node {
>
> struct vhost_dev {
> struct mm_struct *mm;
> + struct mmu_notifier mmu_notifier;
> struct mutex mutex;
> struct vhost_virtqueue **vqs;
> int nvqs;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6ece1e2fe76e..745e7c7f7a6c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -237,6 +237,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
> {
> return vma->vm_ops == &shmem_vm_ops;
> }
> +EXPORT_SYMBOL_GPL(vma_is_shmem);
>
> static LIST_HEAD(shmem_swaplist);
> static DEFINE_MUTEX(shmem_swaplist_mutex);
> --
> 2.17.1

2019-01-25 09:19:22

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/25 上午11:00, Michael S. Tsirkin wrote:
> On Thu, Jan 24, 2019 at 12:07:54PM +0800, Jason Wang wrote:
>>> Meanwhile, could you pls post data comparing this last patch with the
>>> below? This removes the speculation barrier replacing it with a
>>> (useless but at least more lightweight) data dependency.
>> SMAP off
>>
>> Your patch: 7.2MPPs
>>
>> vmap: 7.4Mpps
> OK so while we keep looking into vmap, why don't we merge something like
> the below? Seems quite straight forward ...
>

The problem is it gives ~8% regression on PPS when SMAP is on. This is
probably because the latency of lfence is hided by previous stac as
mentioned in the commit b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd.

Thanks

2019-01-25 09:23:05

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/25 上午11:03, Michael S. Tsirkin wrote:
> On Wed, Jan 23, 2019 at 05:55:57PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software checks,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets since
>> the time spent on metadata accessing become more significant.
>>
>> This patch tries to eliminate those overheads by accessing them
>> through kernel virtual address by vmap(). To make the pages can be
>> migrated, instead of pinning them through GUP, we use MMU notifiers to
>> invalidate vmaps and re-establish vmaps during each round of metadata
>> prefetching if necessary. For devices that doesn't use metadata
>> prefetching, the memory accessors fallback to normal copy_user()
>> implementation gracefully. The invalidation was synchronized with
>> datapath through vq mutex, and in order to avoid hold vq mutex during
>> range checking, MMU notifier was teared down when trying to modify vq
>> metadata.
>>
>> Another thing is kernel lacks efficient solution for tracking dirty
>> pages by vmap(), this will lead issues if vhost is using file backed
>> memory which needs care of writeback. This patch solves this issue by
>> just skipping the vma that is file backed and fallback to normal
>> copy_user() friends. This might introduce some overheads for file
>> backed users but consider this use case is rare we could do
>> optimizations on top.
>>
>> Note that this was only done when device IOTLB is not enabled. We
>> could use similar method to optimize it in the future.
>>
>> Tests shows at most about 22% improvement on TX PPS when using
>> virtio-user + vhost_net + xdp1 + TAP on 2.6GHz Broadwell:
>>
>> SMAP on | SMAP off
>> Before: 5.0Mpps | 6.6Mpps
>> After: 6.1Mpps | 7.4Mpps
>>
>> Signed-off-by: Jason Wang <[email protected]>
>> ---
>> drivers/vhost/vhost.c | 288 +++++++++++++++++++++++++++++++++++++++++-
>> drivers/vhost/vhost.h | 13 ++
>> mm/shmem.c | 1 +
>> 3 files changed, 300 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 37e2cac8e8b0..096ae3298d62 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -440,6 +440,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>> vq->indirect = NULL;
>> vq->heads = NULL;
>> vq->dev = dev;
>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>> mutex_init(&vq->mutex);
>> vhost_vq_reset(dev, vq);
>> if (vq->handle_kick)
>> @@ -510,6 +513,73 @@ static size_t vhost_get_desc_size(struct vhost_virtqueue *vq, int num)
>> return sizeof(*vq->desc) * num;
>> }
>>
>> +static void vhost_uninit_vmap(struct vhost_vmap *map)
>> +{
>> + if (map->addr)
>> + vunmap(map->unmap_addr);
>> +
>> + map->addr = NULL;
>> + map->unmap_addr = NULL;
>> +}
>> +
>> +static int vhost_invalidate_vmap(struct vhost_virtqueue *vq,
>> + struct vhost_vmap *map,
>> + unsigned long ustart,
>> + size_t size,
>> + unsigned long start,
>> + unsigned long end,
>> + bool blockable)
>> +{
>> + if (end < ustart || start > ustart - 1 + size)
>> + return 0;
>> +
>> + if (!blockable)
>> + return -EAGAIN;
>> +
>> + mutex_lock(&vq->mutex);
>> + vhost_uninit_vmap(map);
>> + mutex_unlock(&vq->mutex);
>> +
>> + return 0;
>> +}
>> +
>> +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
>> + const struct mmu_notifier_range *range)
>> +{
>> + struct vhost_dev *dev = container_of(mn, struct vhost_dev,
>> + mmu_notifier);
>> + int i;
>> +
>> + for (i = 0; i < dev->nvqs; i++) {
>> + struct vhost_virtqueue *vq = dev->vqs[i];
>> +
>> + if (vhost_invalidate_vmap(vq, &vq->avail_ring,
>> + (unsigned long)vq->avail,
>> + vhost_get_avail_size(vq, vq->num),
>> + range->start, range->end,
>> + range->blockable))
>> + return -EAGAIN;
>> + if (vhost_invalidate_vmap(vq, &vq->desc_ring,
>> + (unsigned long)vq->desc,
>> + vhost_get_desc_size(vq, vq->num),
>> + range->start, range->end,
>> + range->blockable))
>> + return -EAGAIN;
>> + if (vhost_invalidate_vmap(vq, &vq->used_ring,
>> + (unsigned long)vq->used,
>> + vhost_get_used_size(vq, vq->num),
>> + range->start, range->end,
>> + range->blockable))
>> + return -EAGAIN;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
>> + .invalidate_range_start = vhost_invalidate_range_start,
>> +};
>> +
>> /* Caller should have device mutex */
>> long vhost_dev_set_owner(struct vhost_dev *dev)
>> {
> It seems questionable to merely track .invalidate_range_start.
> Don't we care about keeping pages young/accessed?

My understanding is the young stuffs were only needed for secondary MMU
where the hva is not used. This is not the case of vhost since anyway
guest will access those pages through userspace address.

> MMU will think they aren't and will penalize vhost by pushing
> them out.
>
> I note that MMU documentation says
> * invalidate_range_start() and invalidate_range_end() must be
> * paired
> and it seems questionable that they are not paired here.

I can see some users with the unpaired invalidate_range_start(). Maybe I
miss something but I can not find anything that we need to do after the
page is unmaped.

>
>
> I also wonder about things like write-protecting the pages.
> It does not look like a range is invalidated when page
> is write-protected, even though I might have missed that.
> If not we can be corrupting memory in a variety of ways
> e.g. when using KSM, or with COW.

Yes, we probably need to implement change_pte() method which will do
vunmap().

Thanks

2019-01-25 09:24:52

[permalink] [raw]

Subject: Re: [PATCH net-next V4 5/5] vhost: access vq metadata through kernel virtual address

On 2019/1/25 上午11:03, Michael S. Tsirkin wrote:
>> +/* Suppress the vma that needs writeback since we can not track dirty
>> + * pages now.
>> + */
>> +static bool vma_can_vmap(struct vm_area_struct *vma)
>> +{
>> + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
>> + vma_is_shmem(vma);
>> +}
>> +
> IIUC a second but anonymous memory needs writeback too, just to swap.
> I'm not an MM person so I might be off.

Right, my fault, I mean the vma that needs dirty page tracking.

Thanks

2019-01-26 22:41:46

by David Miller

[permalink] [raw]

Subject: Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

From: Jason Wang <[email protected]>
Date: Wed, 23 Jan 2019 17:55:52 +0800

> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
>
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.

I've read over the discussion of patch #5 a few times.

And it seems to me that, at a minimum, a few things still need to
be resolved:

1) More perf data added to commit message.

2) Whether invalidate_range_start() and invalidate_range_end() must
be paired.

Etc. So I am marking this series "Changes Requested".

2019-01-27 00:32:35

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

On Sat, Jan 26, 2019 at 02:37:08PM -0800, David Miller wrote:
> From: Jason Wang <[email protected]>
> Date: Wed, 23 Jan 2019 17:55:52 +0800
>
> > This series tries to access virtqueue metadata through kernel virtual
> > address instead of copy_user() friends since they had too much
> > overheads like checks, spec barriers or even hardware feature
> > toggling.
> >
> > Test shows about 24% improvement on TX PPS. It should benefit other
> > cases as well.
>
> I've read over the discussion of patch #5 a few times.
>
> And it seems to me that, at a minimum, a few things still need to
> be resolved:
>
> 1) More perf data added to commit message.
>
> 2) Whether invalidate_range_start() and invalidate_range_end() must
> be paired.

Add dirty tracking.

> Etc. So I am marking this series "Changes Requested".

2019-01-29 02:35:30

[permalink] [raw]

Subject: Re: [PATCH net-next V4 0/5] vhost: accelerate metadata access through vmap()

On 2019/1/27 上午8:31, Michael S. Tsirkin wrote:
> On Sat, Jan 26, 2019 at 02:37:08PM -0800, David Miller wrote:
>> From: Jason Wang <[email protected]>
>> Date: Wed, 23 Jan 2019 17:55:52 +0800
>>
>>> This series tries to access virtqueue metadata through kernel virtual
>>> address instead of copy_user() friends since they had too much
>>> overheads like checks, spec barriers or even hardware feature
>>> toggling.
>>>
>>> Test shows about 24% improvement on TX PPS. It should benefit other
>>> cases as well.
>> I've read over the discussion of patch #5 a few times.
>>
>> And it seems to me that, at a minimum, a few things still need to
>> be resolved:
>>
>> 1) More perf data added to commit message.

Ok.

>>
>> 2) Whether invalidate_range_start() and invalidate_range_end() must
>> be paired.

The reason that vhost doesn't need an invalidate_range_end() is because
we have a fallback to copy_to_user() friends. So there's no requirement
to setup the mapping in range_end() or lock the vq between range_start()
and range_end(). We try to delay the setup of vmap until it will be
really used in vhost_meta_prefetch() and we hold mmap_sem when trying to
setup vmap, this will guarantee there's no intermediate state at this time.

>
> Add dirty tracking.

I think this could be solved by introducing e.g
vhost_meta_prefetch_done() at the end of handle_tx()/handle_rx() and
call set_page_dirty() for used pages instead of the tricks of
classifying VMA. (As I saw hugetlbfs has its own set dirty method).

Thanks

>
>> Etc. So I am marking this series "Changes Requested".