Received: by 10.192.165.156 with SMTP id m28csp1327424imm; Mon, 16 Apr 2018 19:19:08 -0700 (PDT) X-Google-Smtp-Source: AIpwx48sYhRdkG/GXo0tODdvmAPsUWpl/JfSVpJyRPg0SevVNRAQ5fqZgjFEYv5wG/+lnFywQHxM X-Received: by 10.98.99.4 with SMTP id x4mr212376pfb.179.1523931548175; Mon, 16 Apr 2018 19:19:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1523931548; cv=none; d=google.com; s=arc-20160816; b=N2rX5ki4ofg0mQK3OPgAHTGt4/Jlk6cGCIYi71TA1HyxkMl9KZ99OclpyzQ9cudV7h AY7/XlIAG0UB63SkA4lPwe6pMC4DagcXrWPQw6G3zbQLkJrxAHQOjrD3sbIQ8iFzwlBy ijss1ZZJcTdkjaFIyqK9kC5oy+fvWmG4kPxxXkbgBbKvOwFMtpBYc9yYLDk6b2IYOKvM htfh3tztE2Btx/4533BIGW+yLiWJpSE/HESa1VB6dPFinmOduC0BKz+u7yoXO67TDqu/ WSgUTHWjH9G0KQk4fjh6V7MaDn2jYh1Ox+/xB9MBPHKSCd+n9YdfFVJzQ+s/HD1kYnY+ WX2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=NUtzvgbv9m9cotCv3wVCSyf16B1qeeoW/esErRcHI8Y=; b=m/3KOEIfdvCcxi7FrP90m1OzM4FPB1DJId++TIrlyRrD/JJaUxVl85wwr9Q6PprIZk nBIQJMY7+5pe9vwKfomn0wl9MxIFge4jH3uglc8fv6/kn3ztpuR4cBH4t5ji99NGXhg+ HgHDlhmOe0EYfMF9HjPcWd/0NJ4TxPvo0sHgqWXEPJGtKEODMQ6NVBrYaTG2GOBSYyvh aDy+F23WqWblyXFVlgppS+W7zTk+Sy3slqTSMYwr9UrzjDc4JLdiqrzzvJ01D/6JvTHx v9UdsR3u+EzzDO2FgDR5ty5D7DPrEc83zqWT4pFn5WC+kFNcFpV2/oKby5ilaw0mTY/0 6Bfg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t3si11037059pgq.237.2018.04.16.19.18.53; Mon, 16 Apr 2018 19:19:08 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752120AbeDQCRn (ORCPT + 99 others); Mon, 16 Apr 2018 22:17:43 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:59118 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751172AbeDQCRm (ORCPT ); Mon, 16 Apr 2018 22:17:42 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id ACFB24075386; Tue, 17 Apr 2018 02:17:41 +0000 (UTC) Received: from redhat.com (ovpn-125-198.rdu2.redhat.com [10.10.125.198]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3A5F42026DFD; Tue, 17 Apr 2018 02:17:41 +0000 (UTC) Date: Tue, 17 Apr 2018 05:17:41 +0300 From: "Michael S. Tsirkin" To: Jason Wang Cc: Tiwei Bie , wexu@redhat.com, virtualization@lists.linux-foundation.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, jfreimann@redhat.com Subject: Re: [RFC v2] virtio: support packed ring Message-ID: <20180417051343-mutt-send-email-mst@kernel.org> References: <20180401141216.8969-1-tiwei.bie@intel.com> <20180413071529.f4esh654dakodf4f@debian> <8dee7d62-ac0b-54ba-6bec-4bc4a6fb34e9@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <8dee7d62-ac0b-54ba-6bec-4bc4a6fb34e9@redhat.com> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Tue, 17 Apr 2018 02:17:41 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Tue, 17 Apr 2018 02:17:41 +0000 (UTC) for IP:'10.11.54.4' DOMAIN:'int-mx04.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'mst@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote: > > > On 2018年04月13日 15:15, Tiwei Bie wrote: > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote: > > > On 2018年04月01日 22:12, Tiwei Bie wrote: > > > > Hello everyone, > > > > > > > > This RFC implements packed ring support for virtio driver. > > > > > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented > > > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html > > > > Minor changes are needed for the vhost code, e.g. to kick the guest. > > > > > > > > TODO: > > > > - Refinements and bug fixes; > > > > - Split into small patches; > > > > - Test indirect descriptor support; > > > > - Test/fix event suppression support; > > > > - Test devices other than net; > > > > > > > > RFC v1 -> RFC v2: > > > > - Add indirect descriptor support - compile test only; > > > > - Add event suppression supprt - compile test only; > > > > - Move vring_packed_init() out of uapi (Jason, MST); > > > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > > > - Avoid using '%' operator (Jason); > > > > - Rename free_head -> next_avail_idx (Jason); > > > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > > > - Some other refinements and bug fixes; > > > > > > > > Thanks! > > > > > > > > Signed-off-by: Tiwei Bie > > > > --- > > > > drivers/virtio/virtio_ring.c | 1094 +++++++++++++++++++++++++++++------- > > > > include/linux/virtio_ring.h | 8 +- > > > > include/uapi/linux/virtio_config.h | 12 +- > > > > include/uapi/linux/virtio_ring.h | 61 ++ > > > > 4 files changed, 980 insertions(+), 195 deletions(-) > > [...] > > > > +static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq, > > > > + unsigned int total_sg, > > > > + gfp_t gfp) > > > > +{ > > > > + struct vring_packed_desc *desc; > > > > + > > > > + /* > > > > + * We require lowmem mappings for the descriptors because > > > > + * otherwise virt_to_phys will give us bogus addresses in the > > > > + * virtqueue. > > > > + */ > > > > + gfp &= ~__GFP_HIGHMEM; > > > > + > > > > + desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp); > > > Can we simply check vq->packed here to avoid duplicating helpers? > > Then it would be something like this: > > > > static void *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg, > > gfp_t gfp) > > { > > struct vring_virtqueue *vq = to_vvq(_vq); > > void *data; > > > > /* > > * We require lowmem mappings for the descriptors because > > * otherwise virt_to_phys will give us bogus addresses in the > > * virtqueue. > > */ > > gfp &= ~__GFP_HIGHMEM; > > > > if (vq->packed) { > > data = kmalloc(total_sg * sizeof(struct vring_packed_desc), > > gfp); > > if (!data) > > return NULL; > > } else { > > struct vring_desc *desc; > > unsigned int i; > > > > desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp); > > if (!desc) > > return NULL; > > > > for (i = 0; i < total_sg; i++) > > desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1); > > > > data = desc; > > } > > > > return data; > > } > > > > I would prefer to have two simpler helpers (and to the callers, > > it's already very clear about which one they should call), i.e. > > the current implementation: > > > > static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq, > > unsigned int total_sg, > > gfp_t gfp) > > { > > struct vring_packed_desc *desc; > > > > /* > > * We require lowmem mappings for the descriptors because > > * otherwise virt_to_phys will give us bogus addresses in the > > * virtqueue. > > */ > > gfp &= ~__GFP_HIGHMEM; > > > > desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp); > > > > return desc; > > } > > > > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, > > unsigned int total_sg, > > gfp_t gfp) > > { > > struct vring_desc *desc; > > unsigned int i; > > > > /* > > * We require lowmem mappings for the descriptors because > > * otherwise virt_to_phys will give us bogus addresses in the > > * virtqueue. > > */ > > gfp &= ~__GFP_HIGHMEM; > > > > desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp); > > if (!desc) > > return NULL; > > > > for (i = 0; i < total_sg; i++) > > desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1); > > return desc; > > } > > Yeah, I miss that split version needs a desc list. > > > > > > > + > > > > + return desc; > > > > +} > > [...] > > > > +static inline int virtqueue_add_packed(struct virtqueue *_vq, > > > > + struct scatterlist *sgs[], > > > > + unsigned int total_sg, > > > > + unsigned int out_sgs, > > > > + unsigned int in_sgs, > > > > + void *data, > > > > + void *ctx, > > > > + gfp_t gfp) > > > > +{ > > > > + struct vring_virtqueue *vq = to_vvq(_vq); > > > > + struct vring_packed_desc *desc; > > > > + struct scatterlist *sg; > > > > + unsigned int i, n, descs_used, uninitialized_var(prev), err_idx; > > > > + __virtio16 uninitialized_var(head_flags), flags; > > > > + int head, wrap_counter; > > > > + bool indirect; > > > > + > > > > + START_USE(vq); > > > > + > > > > + BUG_ON(data == NULL); > > > > + BUG_ON(ctx && vq->indirect); > > > > + > > > > + if (unlikely(vq->broken)) { > > > > + END_USE(vq); > > > > + return -EIO; > > > > + } > > > > + > > > > +#ifdef DEBUG > > > > + { > > > > + ktime_t now = ktime_get(); > > > > + > > > > + /* No kick or get, with .1 second between? Warn. */ > > > > + if (vq->last_add_time_valid) > > > > + WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time)) > > > > + > 100); > > > > + vq->last_add_time = now; > > > > + vq->last_add_time_valid = true; > > > > + } > > > > +#endif > > > > + > > > > + BUG_ON(total_sg == 0); > > > > + > > > > + head = vq->next_avail_idx; > > > > + wrap_counter = vq->wrap_counter; > > > > + > > > > + /* If the host supports indirect descriptor tables, and we have multiple > > > > + * buffers, then go indirect. FIXME: tune this threshold */ > > > > + if (vq->indirect && total_sg > 1 && vq->vq.num_free) > > > Let's introduce a helper like virtqueue_need_indirect() to avoid duplicating > > > codes and FIXME. > > Okay. > > > > > > + desc = alloc_indirect_packed(_vq, total_sg, gfp); > > > > + else { > > > > + desc = NULL; > > > > + WARN_ON_ONCE(total_sg > vq->vring_packed.num && !vq->indirect); > > > > + } > > > > + > > > > + if (desc) { > > > > + /* Use a single buffer which doesn't continue */ > > > > + indirect = true; > > > > + /* Set up rest to use this indirect table. */ > > > > + i = 0; > > > > + descs_used = 1; > > > > + } else { > > > > + indirect = false; > > > > + desc = vq->vring_packed.desc; > > > > + i = head; > > > > + descs_used = total_sg; > > > > + } > > > > + > > > > + if (vq->vq.num_free < descs_used) { > > > > + pr_debug("Can't add buf len %i - avail = %i\n", > > > > + descs_used, vq->vq.num_free); > > > > + /* FIXME: for historical reasons, we force a notify here if > > > > + * there are outgoing parts to the buffer. Presumably the > > > > + * host should service the ring ASAP. */ > > > > + if (out_sgs) > > > > + vq->notify(&vq->vq); > > > > + if (indirect) > > > > + kfree(desc); > > > > + END_USE(vq); > > > > + return -ENOSPC; > > > > + } > > > > + > > > > + for (n = 0; n < out_sgs + in_sgs; n++) { > > > > + for (sg = sgs[n]; sg; sg = sg_next(sg)) { > > > > + dma_addr_t addr = vring_map_one_sg(vq, sg, n < out_sgs ? > > > > + DMA_TO_DEVICE : DMA_FROM_DEVICE); > > > > + if (vring_mapping_error(vq, addr)) > > > > + goto unmap_release; > > > > + > > > > + flags = cpu_to_virtio16(_vq->vdev, VRING_DESC_F_NEXT | > > > > + (n < out_sgs ? 0 : VRING_DESC_F_WRITE) | > > > > + VRING_DESC_F_AVAIL(vq->wrap_counter) | > > > > + VRING_DESC_F_USED(!vq->wrap_counter)); > > > > + if (!indirect && i == head) > > > > + head_flags = flags; > > > > + else > > > > + desc[i].flags = flags; > > > > + > > > > + desc[i].addr = cpu_to_virtio64(_vq->vdev, addr); > > > > + desc[i].len = cpu_to_virtio32(_vq->vdev, sg->length); > > > > + desc[i].id = cpu_to_virtio32(_vq->vdev, head); > > > Similar to V1, we only need this for the last descriptor. > > Okay, will just set it for the last desc. > > > > > > + prev = i; > > > It looks to me there's no need to track prev inside the loop here. > > > > > > > + i++; > > > > + if (!indirect && i >= vq->vring_packed.num) { > > > > + i = 0; > > > > + vq->wrap_counter ^= 1; > > > > + } > > > > + } > > > > + } > > > > + /* Last one doesn't continue. */ > > > > + if (total_sg == 1) > > > > + head_flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT); > > > > + else > > > > + desc[prev].flags &= cpu_to_virtio16(_vq->vdev, ~VRING_DESC_F_NEXT); > > > The only case when prev != i - 1 is i == 0, we can add a if here. > > It's just a mirror of the existing implementation in split ring. > > It seems that split ring implementation needs this just because > > it's much harder for it to find the prev, which is not true for > > packed ring. So I'll take your suggestion. Thanks! > > > > [...] > > > > +static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq) > > > > +{ > > > > + struct vring_virtqueue *vq = to_vvq(_vq); > > > > + u16 new, old, off_wrap; > > > > + bool needs_kick; > > > > + > > > > + START_USE(vq); > > > > + /* We need to expose the new flags value before checking notification > > > > + * suppressions. */ > > > > + virtio_mb(vq->weak_barriers); > > > > + > > > > + old = vq->next_avail_idx - vq->num_added; > > > > + new = vq->next_avail_idx; > > > > + vq->num_added = 0; > > > > + > > > > +#ifdef DEBUG > > > > + if (vq->last_add_time_valid) { > > > > + WARN_ON(ktime_to_ms(ktime_sub(ktime_get(), > > > > + vq->last_add_time)) > 100); > > > > + } > > > > + vq->last_add_time_valid = false; > > > > +#endif > > > > + > > > > + off_wrap = virtio16_to_cpu(_vq->vdev, vq->vring_packed.device->off_wrap); > > > > + > > > > + if (vq->event) { > > > It looks to me we should examine RING_EVENT_FLAGS_DESC in desc_event_flags > > > instead of vq->event here. Spec does not forces to use evenf_off and > > > event_wrap if event index is enabled. > > > > > > > + // FIXME: fix this! > > > > + needs_kick = ((off_wrap >> 15) == vq->wrap_counter) && > > > > + vring_need_event(off_wrap & ~(1<<15), new, old); > > > Why need a & here? > > Because wrap_counter (the most significant bit in off_wrap) > > isn't part of the index. > > > > > > + } else { > > > Need a smp_rmb() to make sure desc_event_flags was checked before flags. > > I don't get your point, if my understanding is correct, > > desc_event_flags is vq->vring_packed.device->flags. So > > what's the "flags"? > > Sorry, I mean we need check device.flags before off_warp. So it needs an > smp_rmb() in the middle. It's best to just read all flags atomically as u32. > It looks to me there's no guarantee that > VRING_EVENT_F_DESC is set if event index is supported. > > > > > > > + needs_kick = (vq->vring_packed.device->flags != > > > > + cpu_to_virtio16(_vq->vdev, VRING_EVENT_F_DISABLE)); > > > > + } > > > > + END_USE(vq); > > > > + return needs_kick; > > > > +} > > [...] > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head, > > > > + void **ctx) > > > > +{ > > > > + struct vring_packed_desc *desc; > > > > + unsigned int i, j; > > > > + > > > > + /* Clear data ptr. */ > > > > + vq->desc_state[head].data = NULL; > > > > + > > > > + i = head; > > > > + > > > > + for (j = 0; j < vq->desc_state[head].num; j++) { > > > > + desc = &vq->vring_packed.desc[i]; > > > > + vring_unmap_one_packed(vq, desc); > > > > + desc->flags = 0x0; > > > Looks like this is unnecessary. > > It's safer to zero it. If we don't zero it, after we > > call virtqueue_detach_unused_buf_packed() which calls > > this function, the desc is still available to the > > device. > > Well detach_unused_buf_packed() should be called after device is stopped, > otherwise even if you try to clear, there will still be a window that device > may use it. > > > > > > > + i++; > > > > + if (i >= vq->vring_packed.num) > > > > + i = 0; > > > > + } > > [...] > > > > +static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq) > > > > +{ > > > > + struct vring_virtqueue *vq = to_vvq(_vq); > > > > + u16 last_used_idx, wrap_counter, off_wrap; > > > > + > > > > + START_USE(vq); > > > > + > > > > + last_used_idx = vq->last_used_idx; > > > > + wrap_counter = vq->wrap_counter; > > > > + > > > > + if (last_used_idx > vq->next_avail_idx) > > > > + wrap_counter ^= 1; > > > > + > > > > + off_wrap = last_used_idx | (wrap_counter << 15); > > > > + > > > > + /* We optimistically turn back on interrupts, then check if there was > > > > + * more to do. */ > > > > + /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to > > > > + * either clear the flags bit or point the event index at the next > > > > + * entry. Always do both to keep code simple. */ > > > > + if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) { > > > > + vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC: > > > > + VRING_EVENT_F_ENABLE; > > > > + vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev, > > > > + vq->event_flags_shadow); > > > > + } > > > A smp_wmb() is missed here? > > > > > > > + vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev, off_wrap); > > > And according to the spec, it looks to me write a VRING_EVENT_F_ENABLE is > > > sufficient here. > > I didn't think much when implementing the event suppression > > for packed ring previously. After I saw your comments, I found > > something new. Indeed, unlike the split ring, for the packed > > ring, spec doesn't say we must use VRING_EVENT_F_DESC when > > EVENT_IDX is negotiated. So do you think below thought is > > right or makes sense? > > > > - For virtqueue_enable_cb_prepare(), we just need to enable > > the ring by setting flags to VRING_EVENT_F_ENABLE in any > > case. > > > > - We will try to use VRING_EVENT_F_DESC (if EVENT_IDX is > > negotiated) only when we want to delay the interrupts > > virtqueue_enable_cb_delayed(). > > This looks good to me. I suspect this will lead to extra interrupts if host is fast. So I think for now we should always use VRING_EVENT_F_DESC if EVENT_IDX is negotiated. VRING_EVENT_F_DISABLE makes more sense to me. > > > > > > + END_USE(vq); > > > > + return last_used_idx; > > > > +} > > > > + > > [...] > > > > @@ -1157,14 +1852,18 @@ void vring_transport_features(struct virtio_device *vdev) > > > > for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) { > > > > switch (i) { > > > > - case VIRTIO_RING_F_INDIRECT_DESC: > > > > +#if 0 > > > > + case VIRTIO_RING_F_INDIRECT_DESC: // FIXME not tested yet. > > > > break; > > > > - case VIRTIO_RING_F_EVENT_IDX: > > > > + case VIRTIO_RING_F_EVENT_IDX: // FIXME probably not work. > > > > break; > > > > +#endif > > > It would be better if you can split EVENT_IDX and INDIRECT_DESC into > > > separate patches too. > > Sure. Will do it in the next version. > > > > Thanks for the review! > > Thanks. > > > > Thanks > > >