Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp4290488ybe; Mon, 9 Sep 2019 07:13:16 -0700 (PDT) X-Google-Smtp-Source: APXvYqzjB76O6EG5rM0jSAbu1ofNgc8wpZRR6lY0PygqrgGo/xAvB3DdUhy5HdxAWL9aPVwV5GLc X-Received: by 2002:a17:906:3604:: with SMTP id q4mr20095868ejb.102.1568038396013; Mon, 09 Sep 2019 07:13:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568038396; cv=none; d=google.com; s=arc-20160816; b=B1Sr3fP13JSdMevSib9XKQUBqHFC+eXWOw1bvrQ/s3iiNAXgbY+ecplY+McRfuZzNM dBbSRa44102vlzvfDahfBJ82NswIQTzDCQ89GP17vynuEdbO/OCeu1a7tRhboFTsf0/1 gb513ncKXs9saAXowHDu9FDF2ChFKFPlc8lF7ga9UF3ckWXF4MdmdnvF5ezfqgqzEiW3 MMZOhF9FnCUMfhAkN18MOONH8yUJyL3lmp7j+edM/9UnL6kk30ZnySqEj6GsuBhOH+l4 /td6c/vR/Rd2a6dyU1IolzYw3+bQZSBV3L4az3CCmwfOkB46aXG4OI8OzsCqyakVxy14 0SAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=1yUZnjl80fsF/rXPq1qWU+T3hbgDZm4JZcchcrlSPvs=; b=ZgJlIlRlaA3tMHXnyotV6NcEEtHNVqFUsLAbPzKFOhHGbYQUe0Q3/IqsmoIAHkmrB9 fYFQNnj5+9+wp9cOkgMypHweKYVIouv86zaUeGungLGaKGrw1dK4Dcy2bVwyFeagXQ+/ GuveasfLEMctRjwaATIyUlEmsuNySD44/DuebQxtLsawEsgmgEgPPM2wcMd3vgbGwG/z IuaLv8xzFqlrnG9JedXRNYhHOGNfmHLBk1VKG3xQI1GAnbkdPXMgaAf/UTAfEx6t05kB EbEYAkR6yTjcQ9Y+R55UbwSY2huG71P3E+GTe3ZkSOFhILt5bJL3MR+Orhp3/UD8sPIf 6KFg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p11si6451162edq.310.2019.09.09.07.12.52; Mon, 09 Sep 2019 07:13:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732758AbfIICTg (ORCPT + 99 others); Sun, 8 Sep 2019 22:19:36 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40768 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732156AbfIICTg (ORCPT ); Sun, 8 Sep 2019 22:19:36 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 222B0A46C10; Mon, 9 Sep 2019 02:19:35 +0000 (UTC) Received: from [10.72.12.61] (ovpn-12-61.pek2.redhat.com [10.72.12.61]) by smtp.corp.redhat.com (Postfix) with ESMTP id 16CA960925; Mon, 9 Sep 2019 02:19:05 +0000 (UTC) Subject: Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address To: "Michael S. Tsirkin" Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@mellanox.com, aarcange@redhat.com, jglisse@redhat.com, linux-mm@kvack.org, James Bottomley , Christoph Hellwig , David Miller , linux-arm-kernel@lists.infradead.org, linux-parisc@vger.kernel.org References: <20190905122736.19768-1-jasowang@redhat.com> <20190905122736.19768-3-jasowang@redhat.com> <20190908063618-mutt-send-email-mst@kernel.org> From: Jason Wang Message-ID: <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com> Date: Mon, 9 Sep 2019 10:18:57 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190908063618-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (mx1.redhat.com [10.5.110.68]); Mon, 09 Sep 2019 02:19:35 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019/9/8 下午7:05, Michael S. Tsirkin wrote: > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote: >> This is a rework on the commit 7f466032dc9e ("vhost: access vq >> metadata through kernel virtual address"). >> >> It was noticed that the copy_to/from_user() friends that was used to >> access virtqueue metdata tends to be very expensive for dataplane >> implementation like vhost since it involves lots of software checks, >> speculation barriers, > So if we drop speculation barrier, > there's a problem here in access will now be speculated. > This effectively disables the defence in depth effect of > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd > x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec > > > So now we need to sprinkle array_index_nospec or barrier_nospec over the > code whenever we use an index we got from userspace. > See below for some examples. > > >> hardware feature toggling (e.g SMAP). The >> extra cost will be more obvious when transferring small packets since >> the time spent on metadata accessing become more significant. >> >> This patch tries to eliminate those overheads by accessing them >> through direct mapping of those pages. Invalidation callbacks is >> implemented for co-operation with general VM management (swap, KSM, >> THP or NUMA balancing). We will try to get the direct mapping of vq >> metadata before each round of packet processing if it doesn't >> exist. If we fail, we will simplely fallback to copy_to/from_user() >> friends. >> >> This invalidation, direct mapping access and set are synchronized >> through spinlock. This takes a step back from the original commit >> 7f466032dc9e ("vhost: access vq metadata through kernel virtual >> address") which tries to RCU which is suspicious and hard to be >> reviewed. This won't perform as well as RCU because of the atomic, >> this could be addressed by the future optimization. >> >> This method might does not work for high mem page which requires >> temporary mapping so we just fallback to normal >> copy_to/from_user() and may not for arch that has virtual tagged cache >> since extra cache flushing is needed to eliminate the alias. This will >> result complex logic and bad performance. For those archs, this patch >> simply go for copy_to/from_user() friends. This is done by ruling out >> kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE. >> >> Note that this is only done when device IOTLB is not enabled. We >> could use similar method to optimize IOTLB in the future. >> >> Tests shows at most about 22% improvement on TX PPS when using >> virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake. >> >> SMAP on | SMAP off >> Before: 4.9Mpps | 6.9Mpps >> After: 6.0Mpps | 7.5Mpps >> >> On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see >> any difference. > Why is not Kaby Lake with SMAP off the same as Sandy Bridge? I don't know, I guess it was because the atomic is l > > >> Cc: Andrea Arcangeli >> Cc: James Bottomley >> Cc: Christoph Hellwig >> Cc: David Miller >> Cc: Jerome Glisse >> Cc: Jason Gunthorpe >> Cc: linux-mm@kvack.org >> Cc: linux-arm-kernel@lists.infradead.org >> Cc: linux-parisc@vger.kernel.org >> Signed-off-by: Jason Wang >> Signed-off-by: Michael S. Tsirkin >> --- >> drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++- >> drivers/vhost/vhost.h | 41 ++++ >> 2 files changed, 589 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 791562e03fe0..f98155f28f02 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d) >> __vhost_vq_meta_reset(d->vqs[i]); >> } >> >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +static void vhost_map_unprefetch(struct vhost_map *map) >> +{ >> + kfree(map->pages); >> + kfree(map); >> +} >> + >> +static void vhost_set_map_dirty(struct vhost_virtqueue *vq, >> + struct vhost_map *map, int index) >> +{ >> + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; >> + int i; >> + >> + if (uaddr->write) { >> + for (i = 0; i < map->npages; i++) >> + set_page_dirty(map->pages[i]); >> + } >> +} >> + >> +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) >> +{ >> + struct vhost_map *map[VHOST_NUM_ADDRS]; >> + int i; >> + >> + spin_lock(&vq->mmu_lock); >> + for (i = 0; i < VHOST_NUM_ADDRS; i++) { >> + map[i] = vq->maps[i]; >> + if (map[i]) { >> + vhost_set_map_dirty(vq, map[i], i); >> + vq->maps[i] = NULL; >> + } >> + } >> + spin_unlock(&vq->mmu_lock); >> + >> + /* No need for synchronization since we are serialized with >> + * memory accessors (e.g vq mutex held). >> + */ >> + >> + for (i = 0; i < VHOST_NUM_ADDRS; i++) >> + if (map[i]) >> + vhost_map_unprefetch(map[i]); >> + >> +} >> + >> +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq) >> +{ >> + int i; >> + >> + vhost_uninit_vq_maps(vq); >> + for (i = 0; i < VHOST_NUM_ADDRS; i++) >> + vq->uaddrs[i].size = 0; >> +} >> + >> +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, >> + unsigned long start, >> + unsigned long end) >> +{ >> + if (unlikely(!uaddr->size)) >> + return false; >> + >> + return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); >> +} >> + >> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) >> +{ >> + spin_lock(&vq->mmu_lock); >> +} >> + >> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) >> +{ >> + spin_unlock(&vq->mmu_lock); >> +} >> + >> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, >> + int index, >> + unsigned long start, >> + unsigned long end, >> + bool blockable) >> +{ >> + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; >> + struct vhost_map *map; >> + >> + if (!vhost_map_range_overlap(uaddr, start, end)) >> + return 0; >> + else if (!blockable) >> + return -EAGAIN; >> + >> + spin_lock(&vq->mmu_lock); >> + ++vq->invalidate_count; >> + >> + map = vq->maps[index]; >> + if (map) >> + vq->maps[index] = NULL; >> + spin_unlock(&vq->mmu_lock); >> + >> + if (map) { >> + vhost_set_map_dirty(vq, map, index); >> + vhost_map_unprefetch(map); >> + } >> + >> + return 0; >> +} >> + >> +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, >> + int index, >> + unsigned long start, >> + unsigned long end) >> +{ >> + if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end)) >> + return; >> + >> + spin_lock(&vq->mmu_lock); >> + --vq->invalidate_count; >> + spin_unlock(&vq->mmu_lock); >> +} >> + >> +static int vhost_invalidate_range_start(struct mmu_notifier *mn, >> + const struct mmu_notifier_range *range) >> +{ >> + struct vhost_dev *dev = container_of(mn, struct vhost_dev, >> + mmu_notifier); >> + bool blockable = mmu_notifier_range_blockable(range); >> + int i, j, ret; >> + >> + for (i = 0; i < dev->nvqs; i++) { >> + struct vhost_virtqueue *vq = dev->vqs[i]; >> + >> + for (j = 0; j < VHOST_NUM_ADDRS; j++) { >> + ret = vhost_invalidate_vq_start(vq, j, >> + range->start, >> + range->end, blockable); >> + if (ret) >> + return ret; >> + } >> + } >> + >> + return 0; >> +} >> + >> +static void vhost_invalidate_range_end(struct mmu_notifier *mn, >> + const struct mmu_notifier_range *range) >> +{ >> + struct vhost_dev *dev = container_of(mn, struct vhost_dev, >> + mmu_notifier); >> + int i, j; >> + >> + for (i = 0; i < dev->nvqs; i++) { >> + struct vhost_virtqueue *vq = dev->vqs[i]; >> + >> + for (j = 0; j < VHOST_NUM_ADDRS; j++) >> + vhost_invalidate_vq_end(vq, j, >> + range->start, >> + range->end); >> + } >> +} >> + >> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = { >> + .invalidate_range_start = vhost_invalidate_range_start, >> + .invalidate_range_end = vhost_invalidate_range_end, >> +}; >> + >> +static void vhost_init_maps(struct vhost_dev *dev) >> +{ >> + struct vhost_virtqueue *vq; >> + int i, j; >> + >> + dev->mmu_notifier.ops = &vhost_mmu_notifier_ops; >> + >> + for (i = 0; i < dev->nvqs; ++i) { >> + vq = dev->vqs[i]; >> + for (j = 0; j < VHOST_NUM_ADDRS; j++) >> + vq->maps[j] = NULL; >> + } >> +} >> +#endif >> + >> static void vhost_vq_reset(struct vhost_dev *dev, >> struct vhost_virtqueue *vq) >> { >> @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev, >> vq->busyloop_timeout = 0; >> vq->umem = NULL; >> vq->iotlb = NULL; >> + vq->invalidate_count = 0; >> __vhost_vq_meta_reset(vq); >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + vhost_reset_vq_maps(vq); >> +#endif >> } >> >> static int vhost_worker(void *data) >> @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev, >> dev->iov_limit = iov_limit; >> dev->weight = weight; >> dev->byte_weight = byte_weight; >> + dev->has_notifier = false; >> init_llist_head(&dev->work_list); >> init_waitqueue_head(&dev->wait); >> INIT_LIST_HEAD(&dev->read_list); >> INIT_LIST_HEAD(&dev->pending_list); >> spin_lock_init(&dev->iotlb_lock); >> - >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + vhost_init_maps(dev); >> +#endif >> >> for (i = 0; i < dev->nvqs; ++i) { >> vq = dev->vqs[i]; >> @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev, >> vq->heads = NULL; >> vq->dev = dev; >> mutex_init(&vq->mutex); >> + spin_lock_init(&vq->mmu_lock); >> vhost_vq_reset(dev, vq); >> if (vq->handle_kick) >> vhost_poll_init(&vq->poll, vq->handle_kick, >> @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev) >> if (err) >> goto err_cgroup; >> >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + err = mmu_notifier_register(&dev->mmu_notifier, dev->mm); >> + if (err) >> + goto err_mmu_notifier; >> +#endif >> + dev->has_notifier = true; >> + >> return 0; >> + >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +err_mmu_notifier: >> + vhost_dev_free_iovecs(dev); >> +#endif >> err_cgroup: >> kthread_stop(worker); >> dev->worker = NULL; >> @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev) >> spin_unlock(&dev->iotlb_lock); >> } >> >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> +static void vhost_setup_uaddr(struct vhost_virtqueue *vq, >> + int index, unsigned long uaddr, >> + size_t size, bool write) >> +{ >> + struct vhost_uaddr *addr = &vq->uaddrs[index]; >> + >> + addr->uaddr = uaddr; >> + addr->size = size; >> + addr->write = write; >> +} >> + >> +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq) >> +{ >> + vhost_setup_uaddr(vq, VHOST_ADDR_DESC, >> + (unsigned long)vq->desc, >> + vhost_get_desc_size(vq, vq->num), >> + false); >> + vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL, >> + (unsigned long)vq->avail, >> + vhost_get_avail_size(vq, vq->num), >> + false); >> + vhost_setup_uaddr(vq, VHOST_ADDR_USED, >> + (unsigned long)vq->used, >> + vhost_get_used_size(vq, vq->num), >> + true); >> +} >> + >> +static int vhost_map_prefetch(struct vhost_virtqueue *vq, >> + int index) >> +{ >> + struct vhost_map *map; >> + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; >> + struct page **pages; >> + int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE); >> + int npinned; >> + void *vaddr, *v; >> + int err; >> + int i; >> + >> + spin_lock(&vq->mmu_lock); >> + >> + err = -EFAULT; >> + if (vq->invalidate_count) >> + goto err; >> + >> + err = -ENOMEM; >> + map = kmalloc(sizeof(*map), GFP_ATOMIC); >> + if (!map) >> + goto err; >> + >> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC); >> + if (!pages) >> + goto err_pages; >> + >> + err = EFAULT; >> + npinned = __get_user_pages_fast(uaddr->uaddr, npages, >> + uaddr->write, pages); >> + if (npinned > 0) >> + release_pages(pages, npinned); >> + if (npinned != npages) >> + goto err_gup; >> + >> + for (i = 0; i < npinned; i++) >> + if (PageHighMem(pages[i])) >> + goto err_gup; >> + >> + vaddr = v = page_address(pages[0]); >> + >> + /* For simplicity, fallback to userspace address if VA is not >> + * contigious. >> + */ >> + for (i = 1; i < npinned; i++) { >> + v += PAGE_SIZE; >> + if (v != page_address(pages[i])) >> + goto err_gup; >> + } >> + >> + map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1)); >> + map->npages = npages; >> + map->pages = pages; >> + >> + vq->maps[index] = map; >> + /* No need for a synchronize_rcu(). This function should be >> + * called by dev->worker so we are serialized with all >> + * readers. >> + */ >> + spin_unlock(&vq->mmu_lock); >> + >> + return 0; >> + >> +err_gup: >> + kfree(pages); >> +err_pages: >> + kfree(map); >> +err: >> + spin_unlock(&vq->mmu_lock); >> + return err; >> +} >> +#endif >> + >> void vhost_dev_cleanup(struct vhost_dev *dev) >> { >> int i; >> @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev) >> kthread_stop(dev->worker); >> dev->worker = NULL; >> } >> - if (dev->mm) >> + if (dev->mm) { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + if (dev->has_notifier) { >> + mmu_notifier_unregister(&dev->mmu_notifier, >> + dev->mm); >> + dev->has_notifier = false; >> + } >> +#endif >> mmput(dev->mm); >> + } >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + for (i = 0; i < dev->nvqs; i++) >> + vhost_uninit_vq_maps(dev->vqs[i]); >> +#endif >> dev->mm = NULL; >> } >> EXPORT_SYMBOL_GPL(vhost_dev_cleanup); >> @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, >> >> static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used = map->addr; >> + *((__virtio16 *)&used->ring[vq->num]) = >> + cpu_to_vhost16(vq, vq->avail_idx); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), >> vhost_avail_event(vq)); >> } >> @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> struct vring_used_elem *head, int idx, >> int count) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + size_t size; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used = map->addr; >> + size = count * sizeof(*head); >> + memcpy(used->ring + idx, head, size); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_copy_to_user(vq, vq->used->ring + idx, head, >> count * sizeof(*head)); >> } >> @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, >> static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used = map->addr; >> + used->flags = cpu_to_vhost16(vq, vq->used_flags); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), >> &vq->used->flags); >> } >> @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) >> static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >> >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used = map->addr; >> + used->idx = cpu_to_vhost16(vq, vq->last_used_idx); >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx), >> &vq->used->idx); >> } >> @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) >> static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail = map->addr; >> + *idx = avail->idx; > index can now be speculated. [...] > + vhost_vq_access_map_begin(vq); > + > + map = vq->maps[VHOST_ADDR_AVAIL]; > + if (likely(map)) { > + avail = map->addr; > + *head = avail->ring[idx & (vq->num - 1)]; > > Since idx can be speculated, I guess we need array_index_nospec here? So we have ACQUIRE(mmu_lock) get idx RELEASE(mmu_lock) ACQUIRE(mmu_lock) read array[idx] RELEASE(mmu_lock) Then I think idx can't be speculated consider we've passed RELEASE + ACQUIRE? > > >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *head, >> &vq->avail->ring[idx & (vq->num - 1)]); >> } >> @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, >> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, >> __virtio16 *flags) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail = map->addr; >> + *flags = avail->flags; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *flags, &vq->avail->flags); >> } >> >> static inline int vhost_get_used_event(struct vhost_virtqueue *vq, >> __virtio16 *event) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_avail *avail; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + map = vq->maps[VHOST_ADDR_AVAIL]; >> + if (likely(map)) { >> + avail = map->addr; >> + *event = (__virtio16)avail->ring[vq->num]; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_avail(vq, *event, vhost_used_event(vq)); >> } >> >> static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, >> __virtio16 *idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_used *used; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_USED]; >> + if (likely(map)) { >> + used = map->addr; >> + *idx = used->idx; >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_get_used(vq, *idx, &vq->used->idx); >> } > > This seems to be used during init. Why do we bother > accelerating this? Ok, I can remove this part in next version. > > >> >> static inline int vhost_get_desc(struct vhost_virtqueue *vq, >> struct vring_desc *desc, int idx) >> { >> +#if VHOST_ARCH_CAN_ACCEL_UACCESS >> + struct vhost_map *map; >> + struct vring_desc *d; >> + >> + if (!vq->iotlb) { >> + vhost_vq_access_map_begin(vq); >> + >> + map = vq->maps[VHOST_ADDR_DESC]; >> + if (likely(map)) { >> + d = map->addr; >> + *desc = *(d + idx); > > Since idx can be speculated, I guess we need array_index_nospec here? This is similar to the above avail idx case. > > >> + vhost_vq_access_map_end(vq); >> + return 0; >> + } >> + >> + vhost_vq_access_map_end(vq); >> + } >> +#endif >> + >> return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc)); >> } >> > I also wonder about the userspace address we get eventualy. > It would seem that we need to prevent that from speculating - > and that seems like a good idea even if this patch isn't > applied. As you are playing with micro-benchmarks, maybe > you could the below patch? Let me test it. Thanks > It's unfortunately untested. > Thanks a lot in advance! > > ===> > vhost: block speculation of translated descriptors > > iovec addresses coming from vhost are assumed to be > pre-validated, but in fact can be speculated to a value > out of range. > > Userspace address are later validated with array_index_nospec so we can > be sure kernel info does not leak through these addresses, but vhost > must also not leak userspace info outside the allowed memory table to > guests. > > Following the defence in depth principle, make sure > the address is not validated out of node range. > > Signed-off-by: Michael S. Tsirkin > > --- > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 5dc174ac8cac..863e25011ef6 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, > size = node->size - addr + node->start; > _iov->iov_len = min((u64)len - s, size); > _iov->iov_base = (void __user *)(unsigned long) > - (node->userspace_addr + addr - node->start); > + (node->userspace_addr + > + array_index_nospec(addr - node->start, > + node->size)); > s += size; > addr += size; > ++ret;