LinuxLists.cc - [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

2014-06-26 02:09:05

Subject: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

Hi,

These patches try to support multi virtual queues(multi-vq) in one
virtio-blk device, and maps each virtual queue(vq) to blk-mq's
hardware queue.

With this approach, both scalability and performance on virtio-blk
device can get improved.

For verifying the improvement, I implements virtio-blk multi-vq over
qemu's dataplane feature, and both handling host notification
from each vq and processing host I/O are still kept in the per-device
iothread context, the change is based on qemu v2.0.0 release, and
can be accessed from below tree:

git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1

For enabling the multi-vq feature, 'num_queues=N' need to be added into
'-device virtio-blk-pci ...' of qemu command line, and suggest to pass
'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
depends on x-data-plane.

Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
verify the improvement.

I just create a small quadcore VM and run fio inside the VM, and
num_queues of the virtio-blk device is set as 2, but looks the
improvement is still obvious.

1), about scalability
- without mutli-vq feature
-- jobs=2, thoughput: 145K iops
-- jobs=4, thoughput: 100K iops
- with mutli-vq feature
-- jobs=2, thoughput: 193K iops
-- jobs=4, thoughput: 202K iops

2), about thoughput
- without mutli-vq feature
-- thoughput: 145K iops
- with mutli-vq feature
-- thoughput: 202K iops

So in my test, even for a quad-core VM, if the virtqueue number
is increased from 1 to 2, both scalability and performance can
get improved a lot.

TODO:
- adjust vq's irq smp_affinity according to blk-mq hw queue's cpumask

V2: (suggestions from Michael and Dave Chinner)
- allocate virtqueues' pointers dynamically
- make sure the per-queue spinlock isn't kept in same cache line
- make each queue's name different

V1:
- remove RFC since no one objects
- add '__u8 unused' for pending as suggested by Rusty
- use virtio_cread_feature() directly, suggested by Rusty

Thanks,
--
Ming Lei

2014-06-26 02:09:17

by Ming Lei

[permalink] [raw]

Subject: [PATCH v2 1/2] include/uapi/linux/virtio_blk.h: introduce feature of VIRTIO_BLK_F_MQ

Current virtio-blk spec only supports one virtual queue for transfering
data between VM and host, and inside VM all kinds of operations on
the virtual queue needs to hold one lock, so cause below problems:

- bad scalability
- bad throughput

This patch requests to introduce feature of VIRTIO_BLK_F_MQ
so that more than one virtual queues can be used to virtio-blk
device, then above problems can be solved or eased.

Signed-off-by: Ming Lei <[email protected]>
---
include/uapi/linux/virtio_blk.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 6d8e61c..9ad67b2 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -40,6 +40,7 @@
#define VIRTIO_BLK_F_WCE 9 /* Writeback mode enabled after reset */
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_CONFIG_WCE 11 /* Writeback mode available in config */
+#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */

#ifndef __KERNEL__
/* Old (deprecated) name for VIRTIO_BLK_F_WCE. */
@@ -77,6 +78,10 @@ struct virtio_blk_config {

/* writeback mode (if VIRTIO_BLK_F_CONFIG_WCE) */
__u8 wce;
+ __u8 unused;
+
+ /* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
+ __u16 num_queues;
} __attribute__((packed));

/*
--
1.7.9.5

2014-06-26 02:09:26

by Ming Lei

[permalink] [raw]

Subject: [PATCH v2 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

Firstly this patch supports more than one virtual queues for virtio-blk
device.

Secondly this patch maps the virtual queue to blk-mq's hardware queue.

With this approach, both scalability and performance can be improved.

Signed-off-by: Ming Lei <[email protected]>
---
drivers/block/virtio_blk.c | 109 ++++++++++++++++++++++++++++++++++++--------
1 file changed, 89 insertions(+), 20 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f63d358..b0a49a0 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -21,11 +21,14 @@ static DEFINE_IDA(vd_index_ida);

static struct workqueue_struct *virtblk_wq;

+struct virtio_blk_vq {
+ struct virtqueue *vq;
+ spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
struct virtio_blk
{
struct virtio_device *vdev;
- struct virtqueue *vq;
- spinlock_t vq_lock;

/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -47,6 +50,10 @@ struct virtio_blk

/* Ida index - used to track minor number allocations. */
int index;
+
+ /* num of vqs */
+ int num_vqs;
+ struct virtio_blk_vq *vqs;
};

struct virtblk_req
@@ -133,14 +140,15 @@ static void virtblk_done(struct virtqueue *vq)
{
struct virtio_blk *vblk = vq->vdev->priv;
bool req_done = false;
+ int qid = vq->index;
struct virtblk_req *vbr;
unsigned long flags;
unsigned int len;

- spin_lock_irqsave(&vblk->vq_lock, flags);
+ spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
do {
virtqueue_disable_cb(vq);
- while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
+ while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != NULL) {
blk_mq_complete_request(vbr->req);
req_done = true;
}
@@ -151,7 +159,7 @@ static void virtblk_done(struct virtqueue *vq)
/* In case queue is stopped waiting for more buffers. */
if (req_done)
blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
- spin_unlock_irqrestore(&vblk->vq_lock, flags);
+ spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
}

static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
@@ -160,6 +168,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
unsigned long flags;
unsigned int num;
+ int qid = hctx->queue_num;
const bool last = (req->cmd_flags & REQ_END) != 0;
int err;
bool notify = false;
@@ -202,12 +211,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
}

- spin_lock_irqsave(&vblk->vq_lock, flags);
- err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
+ spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
+ err = __virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
if (err) {
- virtqueue_kick(vblk->vq);
+ virtqueue_kick(vblk->vqs[qid].vq);
blk_mq_stop_hw_queue(hctx);
- spin_unlock_irqrestore(&vblk->vq_lock, flags);
+ spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
/* Out of mem doesn't actually happen, since we fall back
* to direct descriptors */
if (err == -ENOMEM || err == -ENOSPC)
@@ -215,12 +224,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
return BLK_MQ_RQ_QUEUE_ERROR;
}

- if (last && virtqueue_kick_prepare(vblk->vq))
+ if (last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
notify = true;
- spin_unlock_irqrestore(&vblk->vq_lock, flags);
+ spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);

if (notify)
- virtqueue_notify(vblk->vq);
+ virtqueue_notify(vblk->vqs[qid].vq);
return BLK_MQ_RQ_QUEUE_OK;
}

@@ -377,12 +386,71 @@ static void virtblk_config_changed(struct virtio_device *vdev)
static int init_vq(struct virtio_blk *vblk)
{
int err = 0;
+ int i;
+ vq_callback_t **callbacks;
+ const char **names;
+ char *name_array;
+ struct virtqueue **vqs;
+ unsigned short num_vqs;
+ struct virtio_device *vdev = vblk->vdev;

- /* We expect one virtqueue, for output. */
- vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
- if (IS_ERR(vblk->vq))
- err = PTR_ERR(vblk->vq);
+ err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
+ struct virtio_blk_config, num_queues,
+ &num_vqs);
+ if (err)
+ num_vqs = 1;
+
+ vblk->vqs = kmalloc(sizeof(*vblk->vqs) * num_vqs, GFP_KERNEL);
+ if (!vblk->vqs) {
+ err = -ENOMEM;
+ goto out;
+ }

+ name_array = kmalloc(sizeof(char) * 32 * num_vqs, GFP_KERNEL);
+ if (!name_array)
+ goto err_name_array;
+
+ names = kmalloc(sizeof(*names) * num_vqs, GFP_KERNEL);
+ if (!names)
+ goto err_names;
+
+ callbacks = kmalloc(sizeof(*callbacks) * num_vqs, GFP_KERNEL);
+ if (!callbacks)
+ goto err_callbacks;
+
+ vqs = kmalloc(sizeof(*vqs) * num_vqs, GFP_KERNEL);
+ if (!vqs)
+ goto err_vqs;
+
+ for (i = 0; i < num_vqs; i++) {
+ callbacks[i] = virtblk_done;
+ snprintf(&name_array[i * 32], 32, "req.%d", i);
+ names[i] = &name_array[i * 32];
+ }
+
+ /* Discover virtqueues and write information to configuration. */
+ err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
+ if (err)
+ goto err_find_vqs;
+
+ for (i = 0; i < num_vqs; i++) {
+ spin_lock_init(&vblk->vqs[i].lock);
+ vblk->vqs[i].vq = vqs[i];
+ }
+ vblk->num_vqs = num_vqs;
+
+ err_find_vqs:
+ kfree(vqs);
+ err_vqs:
+ kfree(callbacks);
+ err_callbacks:
+ kfree(names);
+ err_names:
+ kfree(name_array);
+ err_name_array:
+ if (err)
+ kfree(vblk->vqs);
+ out:
return err;
}

@@ -551,7 +619,6 @@ static int virtblk_probe(struct virtio_device *vdev)
err = init_vq(vblk);
if (err)
goto out_free_vblk;
- spin_lock_init(&vblk->vq_lock);

/* FIXME: How many partitions? How long is a piece of string? */
vblk->disk = alloc_disk(1 << PART_BITS);
@@ -562,7 +629,7 @@ static int virtblk_probe(struct virtio_device *vdev)

/* Default queue sizing is to fill the ring. */
if (!virtblk_queue_depth) {
- virtblk_queue_depth = vblk->vq->num_free;
+ virtblk_queue_depth = vblk->vqs[0].vq->num_free;
/* ... but without indirect descs, we use 2 descs per req */
if (!virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC))
virtblk_queue_depth /= 2;
@@ -570,7 +637,6 @@ static int virtblk_probe(struct virtio_device *vdev)

memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
vblk->tag_set.ops = &virtio_mq_ops;
- vblk->tag_set.nr_hw_queues = 1;
vblk->tag_set.queue_depth = virtblk_queue_depth;
vblk->tag_set.numa_node = NUMA_NO_NODE;
vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
@@ -578,6 +644,7 @@ static int virtblk_probe(struct virtio_device *vdev)
sizeof(struct virtblk_req) +
sizeof(struct scatterlist) * sg_elems;
vblk->tag_set.driver_data = vblk;
+ vblk->tag_set.nr_hw_queues = vblk->num_vqs;

err = blk_mq_alloc_tag_set(&vblk->tag_set);
if (err)
@@ -727,6 +794,7 @@ static void virtblk_remove(struct virtio_device *vdev)
refc = atomic_read(&disk_to_dev(vblk->disk)->kobj.kref.refcount);
put_disk(vblk->disk);
vdev->config->del_vqs(vdev);
+ kfree(vblk->vqs);
kfree(vblk);

/* Only free device id if we don't have any users */
@@ -777,7 +845,8 @@ static const struct virtio_device_id id_table[] = {
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
- VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE
+ VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
+ VIRTIO_BLK_F_MQ,
};

static struct virtio_driver virtio_blk = {
--
1.7.9.5

2014-06-26 05:06:00

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

On 2014-06-25 20:08, Ming Lei wrote:
> Hi,
>
> These patches try to support multi virtual queues(multi-vq) in one
> virtio-blk device, and maps each virtual queue(vq) to blk-mq's
> hardware queue.
>
> With this approach, both scalability and performance on virtio-blk
> device can get improved.
>
> For verifying the improvement, I implements virtio-blk multi-vq over
> qemu's dataplane feature, and both handling host notification
> from each vq and processing host I/O are still kept in the per-device
> iothread context, the change is based on qemu v2.0.0 release, and
> can be accessed from below tree:
>
> git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1
>
> For enabling the multi-vq feature, 'num_queues=N' need to be added into
> '-device virtio-blk-pci ...' of qemu command line, and suggest to pass
> 'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
> depends on x-data-plane.
>
> Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
> verify the improvement.
>
> I just create a small quadcore VM and run fio inside the VM, and
> num_queues of the virtio-blk device is set as 2, but looks the
> improvement is still obvious.
>
> 1), about scalability
> - without mutli-vq feature
> -- jobs=2, thoughput: 145K iops
> -- jobs=4, thoughput: 100K iops
> - with mutli-vq feature
> -- jobs=2, thoughput: 193K iops
> -- jobs=4, thoughput: 202K iops
>
> 2), about thoughput
> - without mutli-vq feature
> -- thoughput: 145K iops
> - with mutli-vq feature
> -- thoughput: 202K iops

Of these numbers, I think it's important to highlight that the 2 thread
case is 33% faster and the 2 -> 4 thread case scales linearly (100%)
while the pre-patch case sees negative scaling going from 2 -> 4 threads
(-39%).

I haven't run your patches yet, but from looking at the code, it looks
good. It's pretty straightforward. See feel free to add my reviewed-by.

Rusty, do you want to ack this (and I'll slurp it up for 3.17) or take
this yourself? Or something else?

--
Jens Axboe

2014-06-26 05:28:18

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

On Thu, Jun 26, 2014 at 1:05 PM, Jens Axboe <[email protected]> wrote:
> On 2014-06-25 20:08, Ming Lei wrote:
>>
>> Hi,
>>
>> These patches try to support multi virtual queues(multi-vq) in one
>> virtio-blk device, and maps each virtual queue(vq) to blk-mq's
>> hardware queue.
>>
>> With this approach, both scalability and performance on virtio-blk
>> device can get improved.
>>
>> For verifying the improvement, I implements virtio-blk multi-vq over
>> qemu's dataplane feature, and both handling host notification
>> from each vq and processing host I/O are still kept in the per-device
>> iothread context, the change is based on qemu v2.0.0 release, and
>> can be accessed from below tree:
>>
>> git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1
>>
>> For enabling the multi-vq feature, 'num_queues=N' need to be added into
>> '-device virtio-blk-pci ...' of qemu command line, and suggest to pass
>> 'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
>> depends on x-data-plane.
>>
>> Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
>> verify the improvement.
>>
>> I just create a small quadcore VM and run fio inside the VM, and
>> num_queues of the virtio-blk device is set as 2, but looks the
>> improvement is still obvious.
>>
>> 1), about scalability
>> - without mutli-vq feature
>> -- jobs=2, thoughput: 145K iops
>> -- jobs=4, thoughput: 100K iops
>> - with mutli-vq feature
>> -- jobs=2, thoughput: 193K iops
>> -- jobs=4, thoughput: 202K iops
>>
>> 2), about thoughput
>> - without mutli-vq feature
>> -- thoughput: 145K iops
>> - with mutli-vq feature
>> -- thoughput: 202K iops
>
>
> Of these numbers, I think it's important to highlight that the 2 thread case
> is 33% faster and the 2 -> 4 thread case scales linearly (100%) while the
> pre-patch case sees negative scaling going from 2 -> 4 threads (-39%).

This is because my qemu implementation on multi vq only uses
single iothread to handle requests from all vqs, and the only iothread
is already at full load now, that said on host side the
same fio test(single job) results is ~200K iops too.

>
> I haven't run your patches yet, but from looking at the code, it looks good.
> It's pretty straightforward. See feel free to add my reviewed-by.

Thanks a lot.

>
> Rusty, do you want to ack this (and I'll slurp it up for 3.17) or take this
> yourself? Or something else?

That is great if this can be merged to 3.17.

Thanks,
--
Ming Lei

2014-06-26 07:45:41

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

On Thu, Jun 26, 2014 at 10:08:46AM +0800, Ming Lei wrote:
> Firstly this patch supports more than one virtual queues for virtio-blk
> device.
>
> Secondly this patch maps the virtual queue to blk-mq's hardware queue.
>
> With this approach, both scalability and performance can be improved.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> drivers/block/virtio_blk.c | 109 ++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 89 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index f63d358..b0a49a0 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -21,11 +21,14 @@ static DEFINE_IDA(vd_index_ida);
>
> static struct workqueue_struct *virtblk_wq;
>
> +struct virtio_blk_vq {
> + struct virtqueue *vq;
> + spinlock_t lock;
> +} ____cacheline_aligned_in_smp;
> +

Padding wastes a hot cacheline here.
What about this patch I sent:

virtio-blk: move spinlock to vq itself

Signed-off-by: Michael S. Tsirkin <[email protected]>

Rusty didn't respond, try including it as 1/3 in your patchset
and we'll see if anyone objects?

> struct virtio_blk
> {
> struct virtio_device *vdev;
> - struct virtqueue *vq;
> - spinlock_t vq_lock;
>
> /* The disk structure for the kernel. */
> struct gendisk *disk;
> @@ -47,6 +50,10 @@ struct virtio_blk
>
> /* Ida index - used to track minor number allocations. */
> int index;
> +
> + /* num of vqs */
> + int num_vqs;
> + struct virtio_blk_vq *vqs;
> };
>
> struct virtblk_req
> @@ -133,14 +140,15 @@ static void virtblk_done(struct virtqueue *vq)
> {
> struct virtio_blk *vblk = vq->vdev->priv;
> bool req_done = false;
> + int qid = vq->index;
> struct virtblk_req *vbr;
> unsigned long flags;
> unsigned int len;
>
> - spin_lock_irqsave(&vblk->vq_lock, flags);
> + spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
> do {
> virtqueue_disable_cb(vq);
> - while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
> + while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != NULL) {
> blk_mq_complete_request(vbr->req);
> req_done = true;
> }
> @@ -151,7 +159,7 @@ static void virtblk_done(struct virtqueue *vq)
> /* In case queue is stopped waiting for more buffers. */
> if (req_done)
> blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
> }
>
> static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> @@ -160,6 +168,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
> unsigned long flags;
> unsigned int num;
> + int qid = hctx->queue_num;
> const bool last = (req->cmd_flags & REQ_END) != 0;
> int err;
> bool notify = false;
> @@ -202,12 +211,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
> }
>
> - spin_lock_irqsave(&vblk->vq_lock, flags);
> - err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
> + spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
> + err = __virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
> if (err) {
> - virtqueue_kick(vblk->vq);
> + virtqueue_kick(vblk->vqs[qid].vq);
> blk_mq_stop_hw_queue(hctx);
> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
> /* Out of mem doesn't actually happen, since we fall back
> * to direct descriptors */
> if (err == -ENOMEM || err == -ENOSPC)
> @@ -215,12 +224,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> return BLK_MQ_RQ_QUEUE_ERROR;
> }
>
> - if (last && virtqueue_kick_prepare(vblk->vq))
> + if (last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
> notify = true;
> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
>
> if (notify)
> - virtqueue_notify(vblk->vq);
> + virtqueue_notify(vblk->vqs[qid].vq);
> return BLK_MQ_RQ_QUEUE_OK;
> }
>
> @@ -377,12 +386,71 @@ static void virtblk_config_changed(struct virtio_device *vdev)
> static int init_vq(struct virtio_blk *vblk)
> {
> int err = 0;
> + int i;
> + vq_callback_t **callbacks;
> + const char **names;
> + char *name_array;
> + struct virtqueue **vqs;
> + unsigned short num_vqs;
> + struct virtio_device *vdev = vblk->vdev;
>
> - /* We expect one virtqueue, for output. */
> - vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
> - if (IS_ERR(vblk->vq))
> - err = PTR_ERR(vblk->vq);
> + err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
> + struct virtio_blk_config, num_queues,
> + &num_vqs);
> + if (err)
> + num_vqs = 1;
> +
> + vblk->vqs = kmalloc(sizeof(*vblk->vqs) * num_vqs, GFP_KERNEL);
> + if (!vblk->vqs) {
> + err = -ENOMEM;
> + goto out;
> + }
>
> + name_array = kmalloc(sizeof(char) * 32 * num_vqs, GFP_KERNEL);

sizeof(char) is 1, just drop it.
We don't do if (!NULL) just in case someone redefined it, either.

> + if (!name_array)
> + goto err_name_array;

You want vmalloc here, it will fail on high # of vqs, and speed
doesn't matter for names.

> +
> + names = kmalloc(sizeof(*names) * num_vqs, GFP_KERNEL);
> + if (!names)
> + goto err_names;
> +
> + callbacks = kmalloc(sizeof(*callbacks) * num_vqs, GFP_KERNEL);
> + if (!callbacks)
> + goto err_callbacks;
> +
> + vqs = kmalloc(sizeof(*vqs) * num_vqs, GFP_KERNEL);
> + if (!vqs)
> + goto err_vqs;
> +
> + for (i = 0; i < num_vqs; i++) {
> + callbacks[i] = virtblk_done;
> + snprintf(&name_array[i * 32], 32, "req.%d", i);

Eschew abbreviation. Call it requests.%d.

> + names[i] = &name_array[i * 32];
> + }

That 32 and pointer math hurts.
Please create a structure and use an array everywhere.

> +
> + /* Discover virtqueues and write information to configuration. */
> + err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
> + if (err)
> + goto err_find_vqs;
> +
> + for (i = 0; i < num_vqs; i++) {
> + spin_lock_init(&vblk->vqs[i].lock);
> + vblk->vqs[i].vq = vqs[i];
> + }
> + vblk->num_vqs = num_vqs;
> +
> + err_find_vqs:
> + kfree(vqs);
> + err_vqs:
> + kfree(callbacks);
> + err_callbacks:
> + kfree(names);
> + err_names:
> + kfree(name_array);

This one will cause use after free if vq names are later used, since
vring_new_virtqueue simply does
vq->vq.name = name;

You need to keep the memory around until unplug.

> + err_name_array:
> + if (err)
> + kfree(vblk->vqs);
> + out:
> return err;
> }
>
> @@ -551,7 +619,6 @@ static int virtblk_probe(struct virtio_device *vdev)
> err = init_vq(vblk);
> if (err)
> goto out_free_vblk;
> - spin_lock_init(&vblk->vq_lock);
>
> /* FIXME: How many partitions? How long is a piece of string? */
> vblk->disk = alloc_disk(1 << PART_BITS);
> @@ -562,7 +629,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>
> /* Default queue sizing is to fill the ring. */
> if (!virtblk_queue_depth) {
> - virtblk_queue_depth = vblk->vq->num_free;
> + virtblk_queue_depth = vblk->vqs[0].vq->num_free;
> /* ... but without indirect descs, we use 2 descs per req */
> if (!virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC))
> virtblk_queue_depth /= 2;
> @@ -570,7 +637,6 @@ static int virtblk_probe(struct virtio_device *vdev)
>
> memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
> vblk->tag_set.ops = &virtio_mq_ops;
> - vblk->tag_set.nr_hw_queues = 1;
> vblk->tag_set.queue_depth = virtblk_queue_depth;
> vblk->tag_set.numa_node = NUMA_NO_NODE;
> vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> @@ -578,6 +644,7 @@ static int virtblk_probe(struct virtio_device *vdev)
> sizeof(struct virtblk_req) +
> sizeof(struct scatterlist) * sg_elems;
> vblk->tag_set.driver_data = vblk;
> + vblk->tag_set.nr_hw_queues = vblk->num_vqs;
>
> err = blk_mq_alloc_tag_set(&vblk->tag_set);
> if (err)
> @@ -727,6 +794,7 @@ static void virtblk_remove(struct virtio_device *vdev)
> refc = atomic_read(&disk_to_dev(vblk->disk)->kobj.kref.refcount);
> put_disk(vblk->disk);
> vdev->config->del_vqs(vdev);
> + kfree(vblk->vqs);
> kfree(vblk);
>
> /* Only free device id if we don't have any users */
> @@ -777,7 +845,8 @@ static const struct virtio_device_id id_table[] = {
> static unsigned int features[] = {
> VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
> - VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE
> + VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
> + VIRTIO_BLK_F_MQ,
> };
>
> static struct virtio_driver virtio_blk = {
> --
> 1.7.9.5

2014-06-26 07:46:29

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

On Wed, Jun 25, 2014 at 11:05:56PM -0600, Jens Axboe wrote:
> On 2014-06-25 20:08, Ming Lei wrote:
> >Hi,
> >
> >These patches try to support multi virtual queues(multi-vq) in one
> >virtio-blk device, and maps each virtual queue(vq) to blk-mq's
> >hardware queue.
> >
> >With this approach, both scalability and performance on virtio-blk
> >device can get improved.
> >
> >For verifying the improvement, I implements virtio-blk multi-vq over
> >qemu's dataplane feature, and both handling host notification
> >from each vq and processing host I/O are still kept in the per-device
> >iothread context, the change is based on qemu v2.0.0 release, and
> >can be accessed from below tree:
> >
> > git://kernel.ubuntu.com/ming/qemu.git #v2.0.0-virtblk-mq.1
> >
> >For enabling the multi-vq feature, 'num_queues=N' need to be added into
> >'-device virtio-blk-pci ...' of qemu command line, and suggest to pass
> >'vectors=N+1' to keep one MSI irq vector per each vq, and the feature
> >depends on x-data-plane.
> >
> >Fio(libaio, randread, iodepth=64, bs=4K, jobs=N) is run inside VM to
> >verify the improvement.
> >
> >I just create a small quadcore VM and run fio inside the VM, and
> >num_queues of the virtio-blk device is set as 2, but looks the
> >improvement is still obvious.
> >
> >1), about scalability
> >- without mutli-vq feature
> > -- jobs=2, thoughput: 145K iops
> > -- jobs=4, thoughput: 100K iops
> >- with mutli-vq feature
> > -- jobs=2, thoughput: 193K iops
> > -- jobs=4, thoughput: 202K iops
> >
> >2), about thoughput
> >- without mutli-vq feature
> > -- thoughput: 145K iops
> >- with mutli-vq feature
> > -- thoughput: 202K iops
>
> Of these numbers, I think it's important to highlight that the 2 thread case
> is 33% faster and the 2 -> 4 thread case scales linearly (100%) while the
> pre-patch case sees negative scaling going from 2 -> 4 threads (-39%).
>
> I haven't run your patches yet, but from looking at the code, it looks good.
> It's pretty straightforward. See feel free to add my reviewed-by.
>
> Rusty, do you want to ack this (and I'll slurp it up for 3.17)

Looks like I found some issues, so not yet pls.

> or take this
> yourself? Or something else?
>
>
> --
> Jens Axboe

2014-06-26 08:23:32

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH v2 2/2] block: virtio-blk: support multi virt queues per virtio-blk device

On Thu, Jun 26, 2014 at 3:45 PM, Michael S. Tsirkin <[email protected]> wrote:
> On Thu, Jun 26, 2014 at 10:08:46AM +0800, Ming Lei wrote:
>> Firstly this patch supports more than one virtual queues for virtio-blk
>> device.
>>
>> Secondly this patch maps the virtual queue to blk-mq's hardware queue.
>>
>> With this approach, both scalability and performance can be improved.
>>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> drivers/block/virtio_blk.c | 109 ++++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 89 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index f63d358..b0a49a0 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -21,11 +21,14 @@ static DEFINE_IDA(vd_index_ida);
>>
>> static struct workqueue_struct *virtblk_wq;
>>
>> +struct virtio_blk_vq {
>> + struct virtqueue *vq;
>> + spinlock_t lock;
>> +} ____cacheline_aligned_in_smp;
>> +
>
> Padding wastes a hot cacheline here.
> What about this patch I sent:
>
> virtio-blk: move spinlock to vq itself
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
> Rusty didn't respond, try including it as 1/3 in your patchset
> and we'll see if anyone objects?

I think your patch is fine, but I'd like to follow current virtio vq's
lock rule.

Your patch should not be part of this patchset because
you introduce one spinlock inside vq, and you need to
replace other virtio devices' per-vq lock to the builtin lock
too in your patchset.

>
>
>> struct virtio_blk
>> {
>> struct virtio_device *vdev;
>> - struct virtqueue *vq;
>> - spinlock_t vq_lock;
>>
>> /* The disk structure for the kernel. */
>> struct gendisk *disk;
>> @@ -47,6 +50,10 @@ struct virtio_blk
>>
>> /* Ida index - used to track minor number allocations. */
>> int index;
>> +
>> + /* num of vqs */
>> + int num_vqs;
>> + struct virtio_blk_vq *vqs;
>> };
>>
>> struct virtblk_req
>> @@ -133,14 +140,15 @@ static void virtblk_done(struct virtqueue *vq)
>> {
>> struct virtio_blk *vblk = vq->vdev->priv;
>> bool req_done = false;
>> + int qid = vq->index;
>> struct virtblk_req *vbr;
>> unsigned long flags;
>> unsigned int len;
>>
>> - spin_lock_irqsave(&vblk->vq_lock, flags);
>> + spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
>> do {
>> virtqueue_disable_cb(vq);
>> - while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
>> + while ((vbr = virtqueue_get_buf(vblk->vqs[qid].vq, &len)) != NULL) {
>> blk_mq_complete_request(vbr->req);
>> req_done = true;
>> }
>> @@ -151,7 +159,7 @@ static void virtblk_done(struct virtqueue *vq)
>> /* In case queue is stopped waiting for more buffers. */
>> if (req_done)
>> blk_mq_start_stopped_hw_queues(vblk->disk->queue, true);
>> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
>> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
>> }
>>
>> static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>> @@ -160,6 +168,7 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>> struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
>> unsigned long flags;
>> unsigned int num;
>> + int qid = hctx->queue_num;
>> const bool last = (req->cmd_flags & REQ_END) != 0;
>> int err;
>> bool notify = false;
>> @@ -202,12 +211,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>> vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
>> }
>>
>> - spin_lock_irqsave(&vblk->vq_lock, flags);
>> - err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
>> + spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
>> + err = __virtblk_add_req(vblk->vqs[qid].vq, vbr, vbr->sg, num);
>> if (err) {
>> - virtqueue_kick(vblk->vq);
>> + virtqueue_kick(vblk->vqs[qid].vq);
>> blk_mq_stop_hw_queue(hctx);
>> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
>> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
>> /* Out of mem doesn't actually happen, since we fall back
>> * to direct descriptors */
>> if (err == -ENOMEM || err == -ENOSPC)
>> @@ -215,12 +224,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
>> return BLK_MQ_RQ_QUEUE_ERROR;
>> }
>>
>> - if (last && virtqueue_kick_prepare(vblk->vq))
>> + if (last && virtqueue_kick_prepare(vblk->vqs[qid].vq))
>> notify = true;
>> - spin_unlock_irqrestore(&vblk->vq_lock, flags);
>> + spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
>>
>> if (notify)
>> - virtqueue_notify(vblk->vq);
>> + virtqueue_notify(vblk->vqs[qid].vq);
>> return BLK_MQ_RQ_QUEUE_OK;
>> }
>>
>> @@ -377,12 +386,71 @@ static void virtblk_config_changed(struct virtio_device *vdev)
>> static int init_vq(struct virtio_blk *vblk)
>> {
>> int err = 0;
>> + int i;
>> + vq_callback_t **callbacks;
>> + const char **names;
>> + char *name_array;
>> + struct virtqueue **vqs;
>> + unsigned short num_vqs;
>> + struct virtio_device *vdev = vblk->vdev;
>>
>> - /* We expect one virtqueue, for output. */
>> - vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
>> - if (IS_ERR(vblk->vq))
>> - err = PTR_ERR(vblk->vq);
>> + err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
>> + struct virtio_blk_config, num_queues,
>> + &num_vqs);
>> + if (err)
>> + num_vqs = 1;
>> +
>> + vblk->vqs = kmalloc(sizeof(*vblk->vqs) * num_vqs, GFP_KERNEL);
>> + if (!vblk->vqs) {
>> + err = -ENOMEM;
>> + goto out;
>> + }
>>
>> + name_array = kmalloc(sizeof(char) * 32 * num_vqs, GFP_KERNEL);
>
> sizeof(char) is 1, just drop it.
> We don't do if (!NULL) just in case someone redefined it, either.
>
>> + if (!name_array)
>> + goto err_name_array;
>
> You want vmalloc here, it will fail on high # of vqs, and speed
> doesn't matter for names.

I don't think there should be lots of vqs:

- each virtio-blk has only one disk, not like virtio-scsi

- with aio, the block io can easily reach its top throughput
with very few IO threads

- for each IO thread, just several vqs can make it at full load
(in my test, 2 vqs can make one iothread at full loading)

- more vqs, more notifications and irqs, which hurt performance too

so I think we needn't consider the huge vqs case until it is
proved to be necessary, we have use one vq per virtio-blk
working for long time at all.

>
>> +
>> + names = kmalloc(sizeof(*names) * num_vqs, GFP_KERNEL);
>> + if (!names)
>> + goto err_names;
>> +
>> + callbacks = kmalloc(sizeof(*callbacks) * num_vqs, GFP_KERNEL);
>> + if (!callbacks)
>> + goto err_callbacks;
>> +
>> + vqs = kmalloc(sizeof(*vqs) * num_vqs, GFP_KERNEL);
>> + if (!vqs)
>> + goto err_vqs;
>> +
>> + for (i = 0; i < num_vqs; i++) {
>> + callbacks[i] = virtblk_done;
>> + snprintf(&name_array[i * 32], 32, "req.%d", i);
>
> Eschew abbreviation. Call it requests.%d.

I like short name because it can fit in 80 character's column
when reading /proc/interrupts.

>
>> + names[i] = &name_array[i * 32];
>> + }
>
> That 32 and pointer math hurts.
> Please create a structure and use an array everywhere.

OK.

>
>
>> +
>> + /* Discover virtqueues and write information to configuration. */
>> + err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>> + if (err)
>> + goto err_find_vqs;
>> +
>> + for (i = 0; i < num_vqs; i++) {
>> + spin_lock_init(&vblk->vqs[i].lock);
>> + vblk->vqs[i].vq = vqs[i];
>> + }
>> + vblk->num_vqs = num_vqs;
>> +
>> + err_find_vqs:
>> + kfree(vqs);
>> + err_vqs:
>> + kfree(callbacks);
>> + err_callbacks:
>> + kfree(names);
>> + err_names:
>> + kfree(name_array);
>
> This one will cause use after free if vq names are later used, since
> vring_new_virtqueue simply does
> vq->vq.name = name;
>
> You need to keep the memory around until unplug.

That is a bug, will fix that.

>
>> + err_name_array:
>> + if (err)
>> + kfree(vblk->vqs);
>> + out:
>> return err;
>> }
>>
>> @@ -551,7 +619,6 @@ static int virtblk_probe(struct virtio_device *vdev)
>> err = init_vq(vblk);
>> if (err)
>> goto out_free_vblk;
>> - spin_lock_init(&vblk->vq_lock);
>>
>> /* FIXME: How many partitions? How long is a piece of string? */
>> vblk->disk = alloc_disk(1 << PART_BITS);
>> @@ -562,7 +629,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>>
>> /* Default queue sizing is to fill the ring. */
>> if (!virtblk_queue_depth) {
>> - virtblk_queue_depth = vblk->vq->num_free;
>> + virtblk_queue_depth = vblk->vqs[0].vq->num_free;
>> /* ... but without indirect descs, we use 2 descs per req */
>> if (!virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC))
>> virtblk_queue_depth /= 2;
>> @@ -570,7 +637,6 @@ static int virtblk_probe(struct virtio_device *vdev)
>>
>> memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
>> vblk->tag_set.ops = &virtio_mq_ops;
>> - vblk->tag_set.nr_hw_queues = 1;
>> vblk->tag_set.queue_depth = virtblk_queue_depth;
>> vblk->tag_set.numa_node = NUMA_NO_NODE;
>> vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
>> @@ -578,6 +644,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>> sizeof(struct virtblk_req) +
>> sizeof(struct scatterlist) * sg_elems;
>> vblk->tag_set.driver_data = vblk;
>> + vblk->tag_set.nr_hw_queues = vblk->num_vqs;
>>
>> err = blk_mq_alloc_tag_set(&vblk->tag_set);
>> if (err)
>> @@ -727,6 +794,7 @@ static void virtblk_remove(struct virtio_device *vdev)
>> refc = atomic_read(&disk_to_dev(vblk->disk)->kobj.kref.refcount);
>> put_disk(vblk->disk);
>> vdev->config->del_vqs(vdev);
>> + kfree(vblk->vqs);
>> kfree(vblk);
>>
>> /* Only free device id if we don't have any users */
>> @@ -777,7 +845,8 @@ static const struct virtio_device_id id_table[] = {
>> static unsigned int features[] = {
>> VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
>> VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE, VIRTIO_BLK_F_SCSI,
>> - VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE
>> + VIRTIO_BLK_F_WCE, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
>> + VIRTIO_BLK_F_MQ,
>> };
>>
>> static struct virtio_driver virtio_blk = {
>> --
>> 1.7.9.5

2014-07-09 01:13:22

by Rusty Russell

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] block: virtio-blk: support multi vq per virtio-blk

Jens Axboe <[email protected]> writes:
> Rusty, do you want to ack this (and I'll slurp it up for 3.17) or take
> this yourself? Or something else?

I'm happy with the idea, and importantly, the new feature bit. So once
the implementation is tweaked, please add:

Acked-by: Rusty Russell <[email protected]>

The new feature bit takes us outside the current standard (which is
hopefully now frozen), but I'll just add it to the pile of things to
revisit for 1.1.

Thanks,
Rusty,