Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp4597990ybe; Mon, 9 Sep 2019 11:30:59 -0700 (PDT) X-Google-Smtp-Source: APXvYqx7+DnZfqdWllFWFJ5KiKMjQj9UCZDPv2s5EOzTzztMavwsOPvCAoYOc+hgYN+aNoJF5WmS X-Received: by 2002:a17:906:f51:: with SMTP id h17mr7448687ejj.197.1568053859330; Mon, 09 Sep 2019 11:30:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568053859; cv=none; d=google.com; s=arc-20160816; b=kFfBHnGLeK4KhOEWRLSiIhARdaTWbrfJZs5DocDL+3pPyZmMKheUr9S9MnOlsJIrZ6 IVVn0Xl18RpEOgd5xVorwOuL7w13rcj5tDAdc19cMnf3aFYJXi59m2arRFObb/qcjEAe uwhXcn2u5GPJDR7Kn8oHh5395kCfyVJsns4RqjRWujwXWTrdk2dNMlj5MSCzOF4JKYL8 gW6ZYoickkjHeTJIpahluzFupN3Uc98PsOkd5UIG6KT0/u8ObQplFv+UFq76nC7IRAYV tAZESbL47oRcQgQ0QbxjorBKyvnfQQSNnqonu2avI3Ee4ffRTFL73OjC6jtmVnBySjvE qUrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=iB48hXcXpGU7Iq2m1FjCmrhpGPEFpvytzAMMWWi50Ww=; b=cUYi8XgWianU2Fr5FSWlxunNKAkv7jQpM2rOxLyQAJEAwClE9iVCR948UNzYbG/VyX M+0FdgqE+i/h/tnEYuLo0sizfGiXj8w32zrgDz2JwZgwGKx7fMT30RNHuNgx5FEmZH6B 9suQC6qUknrXwmiZcHqEjksWrQN7xTeUIprVJM800ce5bG9zsW5U1qt7pMMwv4ORGzuL r/dWn2IbTt38TQCg5+V62b0YQXP/CXMYRnx/XLi6SafOMOyIDOwhs0Sx5USqR71ykSPL XxgT8jffaFoYx9TORgTbetiGdYtwy+xFX3Lix/+ESoult1a1vPFWk3zQQLQrs4Tr2Pgh xHzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y1si3294599edm.323.2019.09.09.11.30.35; Mon, 09 Sep 2019 11:30:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726674AbfIIEpv (ORCPT + 99 others); Mon, 9 Sep 2019 00:45:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:64521 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726479AbfIIEpu (ORCPT ); Mon, 9 Sep 2019 00:45:50 -0400 Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 7914BC008621 for ; Mon, 9 Sep 2019 04:45:49 +0000 (UTC) Received: by mail-qk1-f199.google.com with SMTP id n135so14925288qka.13 for ; Sun, 08 Sep 2019 21:45:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=iB48hXcXpGU7Iq2m1FjCmrhpGPEFpvytzAMMWWi50Ww=; b=uL07ggdoat2Av2ls6wV/V8GO+VYYH1B5Zty2Jidzxe7RBVnc0M7LswivuI/dtSn9d3 jnfqKHuc522QB/5F5mXQw6J1eia3KLhMCWjZzFClFfDFJX0fhiyh/Zlc0V7HIH1M+W5D 7khmwPE1uFx2/eb2XUtKptgNDTeMDD550K2frAVdaue8NJAWVc4QsOfgLtQu29yfJz/J ZECoNaPvY0TQXNCbi84/HIQpdbIb1qrMdxJ3lZJq0j86ErXwa5pWZ8sQuzIz+0xgrjAc 13oWprgNpDCSZGyKteYJKnXSYYMTe98Z+E9j3STStRj7bqBgUt5Nzk1IfdDwmEOaW2qH 0evA== X-Gm-Message-State: APjAAAXxx1f0kto/CP+7RxKItDfL36ozcfhvEPQR9nYYwxw57mKQmg+z kJhtgNbzkNrmHLFHEg2jAss4pWNaSQpsVs40Zu7beP+LPV4u+pXFOaY1gkbeE1RGO2wutd/OxtD cLli+EJGYuvWnhwe+CdLV4ggP X-Received: by 2002:a37:2784:: with SMTP id n126mr20483670qkn.302.1568004348677; Sun, 08 Sep 2019 21:45:48 -0700 (PDT) X-Received: by 2002:a37:2784:: with SMTP id n126mr20483655qkn.302.1568004348275; Sun, 08 Sep 2019 21:45:48 -0700 (PDT) Received: from redhat.com (bzq-79-176-40-226.red.bezeqint.net. [79.176.40.226]) by smtp.gmail.com with ESMTPSA id h4sm5600584qtn.62.2019.09.08.21.45.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 08 Sep 2019 21:45:47 -0700 (PDT) Date: Mon, 9 Sep 2019 00:45:40 -0400 From: "Michael S. Tsirkin" To: Jason Wang Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@mellanox.com, aarcange@redhat.com, jglisse@redhat.com, linux-mm@kvack.org, James Bottomley , Christoph Hellwig , David Miller , linux-arm-kernel@lists.infradead.org, linux-parisc@vger.kernel.org Subject: Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address Message-ID: <20190909004504-mutt-send-email-mst@kernel.org> References: <20190905122736.19768-1-jasowang@redhat.com> <20190905122736.19768-3-jasowang@redhat.com> <20190908063618-mutt-send-email-mst@kernel.org> <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1cb5aa8d-6213-5fce-5a77-fcada572c882@redhat.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 09, 2019 at 10:18:57AM +0800, Jason Wang wrote: > > On 2019/9/8 下午7:05, Michael S. Tsirkin wrote: > > On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote: > > > This is a rework on the commit 7f466032dc9e ("vhost: access vq > > > metadata through kernel virtual address"). > > > > > > It was noticed that the copy_to/from_user() friends that was used to > > > access virtqueue metdata tends to be very expensive for dataplane > > > implementation like vhost since it involves lots of software checks, > > > speculation barriers, > > So if we drop speculation barrier, > > there's a problem here in access will now be speculated. > > This effectively disables the defence in depth effect of > > b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd > > x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec > > > > > > So now we need to sprinkle array_index_nospec or barrier_nospec over the > > code whenever we use an index we got from userspace. > > See below for some examples. > > > > > > > hardware feature toggling (e.g SMAP). The > > > extra cost will be more obvious when transferring small packets since > > > the time spent on metadata accessing become more significant. > > > > > > This patch tries to eliminate those overheads by accessing them > > > through direct mapping of those pages. Invalidation callbacks is > > > implemented for co-operation with general VM management (swap, KSM, > > > THP or NUMA balancing). We will try to get the direct mapping of vq > > > metadata before each round of packet processing if it doesn't > > > exist. If we fail, we will simplely fallback to copy_to/from_user() > > > friends. > > > > > > This invalidation, direct mapping access and set are synchronized > > > through spinlock. This takes a step back from the original commit > > > 7f466032dc9e ("vhost: access vq metadata through kernel virtual > > > address") which tries to RCU which is suspicious and hard to be > > > reviewed. This won't perform as well as RCU because of the atomic, > > > this could be addressed by the future optimization. > > > > > > This method might does not work for high mem page which requires > > > temporary mapping so we just fallback to normal > > > copy_to/from_user() and may not for arch that has virtual tagged cache > > > since extra cache flushing is needed to eliminate the alias. This will > > > result complex logic and bad performance. For those archs, this patch > > > simply go for copy_to/from_user() friends. This is done by ruling out > > > kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE. > > > > > > Note that this is only done when device IOTLB is not enabled. We > > > could use similar method to optimize IOTLB in the future. > > > > > > Tests shows at most about 22% improvement on TX PPS when using > > > virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake. > > > > > > SMAP on | SMAP off > > > Before: 4.9Mpps | 6.9Mpps > > > After: 6.0Mpps | 7.5Mpps > > > > > > On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see > > > any difference. > > Why is not Kaby Lake with SMAP off the same as Sandy Bridge? > > > I don't know, I guess it was because the atomic is l > > > > > > > > > Cc: Andrea Arcangeli > > > Cc: James Bottomley > > > Cc: Christoph Hellwig > > > Cc: David Miller > > > Cc: Jerome Glisse > > > Cc: Jason Gunthorpe > > > Cc: linux-mm@kvack.org > > > Cc: linux-arm-kernel@lists.infradead.org > > > Cc: linux-parisc@vger.kernel.org > > > Signed-off-by: Jason Wang > > > Signed-off-by: Michael S. Tsirkin > > > --- > > > drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++- > > > drivers/vhost/vhost.h | 41 ++++ > > > 2 files changed, 589 insertions(+), 3 deletions(-) > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index 791562e03fe0..f98155f28f02 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d) > > > __vhost_vq_meta_reset(d->vqs[i]); > > > } > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > +static void vhost_map_unprefetch(struct vhost_map *map) > > > +{ > > > + kfree(map->pages); > > > + kfree(map); > > > +} > > > + > > > +static void vhost_set_map_dirty(struct vhost_virtqueue *vq, > > > + struct vhost_map *map, int index) > > > +{ > > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; > > > + int i; > > > + > > > + if (uaddr->write) { > > > + for (i = 0; i < map->npages; i++) > > > + set_page_dirty(map->pages[i]); > > > + } > > > +} > > > + > > > +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq) > > > +{ > > > + struct vhost_map *map[VHOST_NUM_ADDRS]; > > > + int i; > > > + > > > + spin_lock(&vq->mmu_lock); > > > + for (i = 0; i < VHOST_NUM_ADDRS; i++) { > > > + map[i] = vq->maps[i]; > > > + if (map[i]) { > > > + vhost_set_map_dirty(vq, map[i], i); > > > + vq->maps[i] = NULL; > > > + } > > > + } > > > + spin_unlock(&vq->mmu_lock); > > > + > > > + /* No need for synchronization since we are serialized with > > > + * memory accessors (e.g vq mutex held). > > > + */ > > > + > > > + for (i = 0; i < VHOST_NUM_ADDRS; i++) > > > + if (map[i]) > > > + vhost_map_unprefetch(map[i]); > > > + > > > +} > > > + > > > +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq) > > > +{ > > > + int i; > > > + > > > + vhost_uninit_vq_maps(vq); > > > + for (i = 0; i < VHOST_NUM_ADDRS; i++) > > > + vq->uaddrs[i].size = 0; > > > +} > > > + > > > +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr, > > > + unsigned long start, > > > + unsigned long end) > > > +{ > > > + if (unlikely(!uaddr->size)) > > > + return false; > > > + > > > + return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size); > > > +} > > > + > > > +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq) > > > +{ > > > + spin_lock(&vq->mmu_lock); > > > +} > > > + > > > +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq) > > > +{ > > > + spin_unlock(&vq->mmu_lock); > > > +} > > > + > > > +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq, > > > + int index, > > > + unsigned long start, > > > + unsigned long end, > > > + bool blockable) > > > +{ > > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; > > > + struct vhost_map *map; > > > + > > > + if (!vhost_map_range_overlap(uaddr, start, end)) > > > + return 0; > > > + else if (!blockable) > > > + return -EAGAIN; > > > + > > > + spin_lock(&vq->mmu_lock); > > > + ++vq->invalidate_count; > > > + > > > + map = vq->maps[index]; > > > + if (map) > > > + vq->maps[index] = NULL; > > > + spin_unlock(&vq->mmu_lock); > > > + > > > + if (map) { > > > + vhost_set_map_dirty(vq, map, index); > > > + vhost_map_unprefetch(map); > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq, > > > + int index, > > > + unsigned long start, > > > + unsigned long end) > > > +{ > > > + if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end)) > > > + return; > > > + > > > + spin_lock(&vq->mmu_lock); > > > + --vq->invalidate_count; > > > + spin_unlock(&vq->mmu_lock); > > > +} > > > + > > > +static int vhost_invalidate_range_start(struct mmu_notifier *mn, > > > + const struct mmu_notifier_range *range) > > > +{ > > > + struct vhost_dev *dev = container_of(mn, struct vhost_dev, > > > + mmu_notifier); > > > + bool blockable = mmu_notifier_range_blockable(range); > > > + int i, j, ret; > > > + > > > + for (i = 0; i < dev->nvqs; i++) { > > > + struct vhost_virtqueue *vq = dev->vqs[i]; > > > + > > > + for (j = 0; j < VHOST_NUM_ADDRS; j++) { > > > + ret = vhost_invalidate_vq_start(vq, j, > > > + range->start, > > > + range->end, blockable); > > > + if (ret) > > > + return ret; > > > + } > > > + } > > > + > > > + return 0; > > > +} > > > + > > > +static void vhost_invalidate_range_end(struct mmu_notifier *mn, > > > + const struct mmu_notifier_range *range) > > > +{ > > > + struct vhost_dev *dev = container_of(mn, struct vhost_dev, > > > + mmu_notifier); > > > + int i, j; > > > + > > > + for (i = 0; i < dev->nvqs; i++) { > > > + struct vhost_virtqueue *vq = dev->vqs[i]; > > > + > > > + for (j = 0; j < VHOST_NUM_ADDRS; j++) > > > + vhost_invalidate_vq_end(vq, j, > > > + range->start, > > > + range->end); > > > + } > > > +} > > > + > > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = { > > > + .invalidate_range_start = vhost_invalidate_range_start, > > > + .invalidate_range_end = vhost_invalidate_range_end, > > > +}; > > > + > > > +static void vhost_init_maps(struct vhost_dev *dev) > > > +{ > > > + struct vhost_virtqueue *vq; > > > + int i, j; > > > + > > > + dev->mmu_notifier.ops = &vhost_mmu_notifier_ops; > > > + > > > + for (i = 0; i < dev->nvqs; ++i) { > > > + vq = dev->vqs[i]; > > > + for (j = 0; j < VHOST_NUM_ADDRS; j++) > > > + vq->maps[j] = NULL; > > > + } > > > +} > > > +#endif > > > + > > > static void vhost_vq_reset(struct vhost_dev *dev, > > > struct vhost_virtqueue *vq) > > > { > > > @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev, > > > vq->busyloop_timeout = 0; > > > vq->umem = NULL; > > > vq->iotlb = NULL; > > > + vq->invalidate_count = 0; > > > __vhost_vq_meta_reset(vq); > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + vhost_reset_vq_maps(vq); > > > +#endif > > > } > > > static int vhost_worker(void *data) > > > @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev, > > > dev->iov_limit = iov_limit; > > > dev->weight = weight; > > > dev->byte_weight = byte_weight; > > > + dev->has_notifier = false; > > > init_llist_head(&dev->work_list); > > > init_waitqueue_head(&dev->wait); > > > INIT_LIST_HEAD(&dev->read_list); > > > INIT_LIST_HEAD(&dev->pending_list); > > > spin_lock_init(&dev->iotlb_lock); > > > - > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + vhost_init_maps(dev); > > > +#endif > > > for (i = 0; i < dev->nvqs; ++i) { > > > vq = dev->vqs[i]; > > > @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev, > > > vq->heads = NULL; > > > vq->dev = dev; > > > mutex_init(&vq->mutex); > > > + spin_lock_init(&vq->mmu_lock); > > > vhost_vq_reset(dev, vq); > > > if (vq->handle_kick) > > > vhost_poll_init(&vq->poll, vq->handle_kick, > > > @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev) > > > if (err) > > > goto err_cgroup; > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + err = mmu_notifier_register(&dev->mmu_notifier, dev->mm); > > > + if (err) > > > + goto err_mmu_notifier; > > > +#endif > > > + dev->has_notifier = true; > > > + > > > return 0; > > > + > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > +err_mmu_notifier: > > > + vhost_dev_free_iovecs(dev); > > > +#endif > > > err_cgroup: > > > kthread_stop(worker); > > > dev->worker = NULL; > > > @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev) > > > spin_unlock(&dev->iotlb_lock); > > > } > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > +static void vhost_setup_uaddr(struct vhost_virtqueue *vq, > > > + int index, unsigned long uaddr, > > > + size_t size, bool write) > > > +{ > > > + struct vhost_uaddr *addr = &vq->uaddrs[index]; > > > + > > > + addr->uaddr = uaddr; > > > + addr->size = size; > > > + addr->write = write; > > > +} > > > + > > > +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq) > > > +{ > > > + vhost_setup_uaddr(vq, VHOST_ADDR_DESC, > > > + (unsigned long)vq->desc, > > > + vhost_get_desc_size(vq, vq->num), > > > + false); > > > + vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL, > > > + (unsigned long)vq->avail, > > > + vhost_get_avail_size(vq, vq->num), > > > + false); > > > + vhost_setup_uaddr(vq, VHOST_ADDR_USED, > > > + (unsigned long)vq->used, > > > + vhost_get_used_size(vq, vq->num), > > > + true); > > > +} > > > + > > > +static int vhost_map_prefetch(struct vhost_virtqueue *vq, > > > + int index) > > > +{ > > > + struct vhost_map *map; > > > + struct vhost_uaddr *uaddr = &vq->uaddrs[index]; > > > + struct page **pages; > > > + int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE); > > > + int npinned; > > > + void *vaddr, *v; > > > + int err; > > > + int i; > > > + > > > + spin_lock(&vq->mmu_lock); > > > + > > > + err = -EFAULT; > > > + if (vq->invalidate_count) > > > + goto err; > > > + > > > + err = -ENOMEM; > > > + map = kmalloc(sizeof(*map), GFP_ATOMIC); > > > + if (!map) > > > + goto err; > > > + > > > + pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC); > > > + if (!pages) > > > + goto err_pages; > > > + > > > + err = EFAULT; > > > + npinned = __get_user_pages_fast(uaddr->uaddr, npages, > > > + uaddr->write, pages); > > > + if (npinned > 0) > > > + release_pages(pages, npinned); > > > + if (npinned != npages) > > > + goto err_gup; > > > + > > > + for (i = 0; i < npinned; i++) > > > + if (PageHighMem(pages[i])) > > > + goto err_gup; > > > + > > > + vaddr = v = page_address(pages[0]); > > > + > > > + /* For simplicity, fallback to userspace address if VA is not > > > + * contigious. > > > + */ > > > + for (i = 1; i < npinned; i++) { > > > + v += PAGE_SIZE; > > > + if (v != page_address(pages[i])) > > > + goto err_gup; > > > + } > > > + > > > + map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1)); > > > + map->npages = npages; > > > + map->pages = pages; > > > + > > > + vq->maps[index] = map; > > > + /* No need for a synchronize_rcu(). This function should be > > > + * called by dev->worker so we are serialized with all > > > + * readers. > > > + */ > > > + spin_unlock(&vq->mmu_lock); > > > + > > > + return 0; > > > + > > > +err_gup: > > > + kfree(pages); > > > +err_pages: > > > + kfree(map); > > > +err: > > > + spin_unlock(&vq->mmu_lock); > > > + return err; > > > +} > > > +#endif > > > + > > > void vhost_dev_cleanup(struct vhost_dev *dev) > > > { > > > int i; > > > @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev) > > > kthread_stop(dev->worker); > > > dev->worker = NULL; > > > } > > > - if (dev->mm) > > > + if (dev->mm) { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + if (dev->has_notifier) { > > > + mmu_notifier_unregister(&dev->mmu_notifier, > > > + dev->mm); > > > + dev->has_notifier = false; > > > + } > > > +#endif > > > mmput(dev->mm); > > > + } > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + for (i = 0; i < dev->nvqs; i++) > > > + vhost_uninit_vq_maps(dev->vqs[i]); > > > +#endif > > > dev->mm = NULL; > > > } > > > EXPORT_SYMBOL_GPL(vhost_dev_cleanup); > > > @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, > > > static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_used *used; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_USED]; > > > + if (likely(map)) { > > > + used = map->addr; > > > + *((__virtio16 *)&used->ring[vq->num]) = > > > + cpu_to_vhost16(vq, vq->avail_idx); > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), > > > vhost_avail_event(vq)); > > > } > > > @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > > > struct vring_used_elem *head, int idx, > > > int count) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_used *used; > > > + size_t size; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_USED]; > > > + if (likely(map)) { > > > + used = map->addr; > > > + size = count * sizeof(*head); > > > + memcpy(used->ring + idx, head, size); > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_copy_to_user(vq, vq->used->ring + idx, head, > > > count * sizeof(*head)); > > > } > > > @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq, > > > static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_used *used; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_USED]; > > > + if (likely(map)) { > > > + used = map->addr; > > > + used->flags = cpu_to_vhost16(vq, vq->used_flags); > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), > > > &vq->used->flags); > > > } > > > @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq) > > > static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_used *used; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_USED]; > > > + if (likely(map)) { > > > + used = map->addr; > > > + used->idx = cpu_to_vhost16(vq, vq->last_used_idx); > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx), > > > &vq->used->idx); > > > } > > > @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > > > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > > > __virtio16 *idx) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_avail *avail; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_AVAIL]; > > > + if (likely(map)) { > > > + avail = map->addr; > > > + *idx = avail->idx; > > index can now be speculated. > > [...] > > > > + vhost_vq_access_map_begin(vq); > > + > > + map = vq->maps[VHOST_ADDR_AVAIL]; > > + if (likely(map)) { > > + avail = map->addr; > > + *head = avail->ring[idx & (vq->num - 1)]; > > > > Since idx can be speculated, I guess we need array_index_nospec here? > > > So we have > > ACQUIRE(mmu_lock) > > get idx > > RELEASE(mmu_lock) > > ACQUIRE(mmu_lock) > > read array[idx] > > RELEASE(mmu_lock) > > Then I think idx can't be speculated consider we've passed RELEASE + > ACQUIRE? I don't think memory barriers have anything to do with speculation, they are architectural. > > > > > > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_get_avail(vq, *head, > > > &vq->avail->ring[idx & (vq->num - 1)]); > > > } > > > @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > > > static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > > > __virtio16 *flags) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_avail *avail; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_AVAIL]; > > > + if (likely(map)) { > > > + avail = map->addr; > > > + *flags = avail->flags; > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_get_avail(vq, *flags, &vq->avail->flags); > > > } > > > static inline int vhost_get_used_event(struct vhost_virtqueue *vq, > > > __virtio16 *event) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_avail *avail; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + map = vq->maps[VHOST_ADDR_AVAIL]; > > > + if (likely(map)) { > > > + avail = map->addr; > > > + *event = (__virtio16)avail->ring[vq->num]; > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_get_avail(vq, *event, vhost_used_event(vq)); > > > } > > > static inline int vhost_get_used_idx(struct vhost_virtqueue *vq, > > > __virtio16 *idx) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_used *used; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_USED]; > > > + if (likely(map)) { > > > + used = map->addr; > > > + *idx = used->idx; > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_get_used(vq, *idx, &vq->used->idx); > > > } > > > > This seems to be used during init. Why do we bother > > accelerating this? > > > Ok, I can remove this part in next version. > > > > > > > > > static inline int vhost_get_desc(struct vhost_virtqueue *vq, > > > struct vring_desc *desc, int idx) > > > { > > > +#if VHOST_ARCH_CAN_ACCEL_UACCESS > > > + struct vhost_map *map; > > > + struct vring_desc *d; > > > + > > > + if (!vq->iotlb) { > > > + vhost_vq_access_map_begin(vq); > > > + > > > + map = vq->maps[VHOST_ADDR_DESC]; > > > + if (likely(map)) { > > > + d = map->addr; > > > + *desc = *(d + idx); > > > > Since idx can be speculated, I guess we need array_index_nospec here? > > > This is similar to the above avail idx case. > > > > > > > > > + vhost_vq_access_map_end(vq); > > > + return 0; > > > + } > > > + > > > + vhost_vq_access_map_end(vq); > > > + } > > > +#endif > > > + > > > return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc)); > > > } > > I also wonder about the userspace address we get eventualy. > > It would seem that we need to prevent that from speculating - > > and that seems like a good idea even if this patch isn't > > applied. As you are playing with micro-benchmarks, maybe > > you could the below patch? > > > Let me test it. > > Thanks > > > > It's unfortunately untested. > > Thanks a lot in advance! > > > > ===> > > vhost: block speculation of translated descriptors > > > > iovec addresses coming from vhost are assumed to be > > pre-validated, but in fact can be speculated to a value > > out of range. > > > > Userspace address are later validated with array_index_nospec so we can > > be sure kernel info does not leak through these addresses, but vhost > > must also not leak userspace info outside the allowed memory table to > > guests. > > > > Following the defence in depth principle, make sure > > the address is not validated out of node range. > > > > Signed-off-by: Michael S. Tsirkin > > > > --- > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 5dc174ac8cac..863e25011ef6 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, > > size = node->size - addr + node->start; > > _iov->iov_len = min((u64)len - s, size); > > _iov->iov_base = (void __user *)(unsigned long) > > - (node->userspace_addr + addr - node->start); > > + (node->userspace_addr + > > + array_index_nospec(addr - node->start, > > + node->size)); > > s += size; > > addr += size; > > ++ret;