2012-08-07 08:45:29

by Asias He

[permalink] [raw]
Subject: [PATCH V6 0/2] Improve virtio-blk performance

Hi, all

This version reworked on REQ_FLUSH and REQ_FUA support as suggested by
Christoph and dropped the block core bits since Jens has picked them up.

Fio test shows bio-based IO path gives the following performance improvement:

1) Ramdisk device
With bio-based IO path, sequential read/write, random read/write
IOPS boost : 28%, 24%, 21%, 16%
Latency improvement: 32%, 17%, 21%, 16%
2) Fusion IO device
With bio-based IO path, sequential read/write, random read/write
IOPS boost : 11%, 11%, 13%, 10%
Latency improvement: 10%, 10%, 12%, 10%

Asias He (2):
virtio-blk: Add bio-based IO path for virtio-blk
virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

drivers/block/virtio_blk.c | 306 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 268 insertions(+), 38 deletions(-)

--
1.7.11.2


2012-08-07 08:45:36

by Asias He

[permalink] [raw]
Subject: [PATCH V6 1/2] virtio-blk: Add bio-based IO path for virtio-blk

This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
With bio-based IO path, sequential read/write, random read/write
IOPS boost : 28%, 24%, 21%, 16%
Latency improvement: 32%, 17%, 21%, 16%

Long version:
With bio-based IO path:
seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec
clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
With request-based IO path:
seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
With bio-based IO path, sequential read/write, random read/write
IOPS boost : 11%, 11%, 13%, 10%
Latency improvement: 10%, 10%, 12%, 10%
Long Version:
With bio-based IO path:
read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
With request-based IO path:
read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Asias He <[email protected]>
Acked-by: Rusty Russell <[email protected]>
---
drivers/block/virtio_blk.c | 203 ++++++++++++++++++++++++++++++++++++---------
1 file changed, 163 insertions(+), 40 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index c0bbeb4..95cfeed 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -14,6 +14,9 @@

#define PART_BITS 4

+static bool use_bio;
+module_param(use_bio, bool, S_IRUGO);
+
static int major;
static DEFINE_IDA(vd_index_ida);

@@ -23,6 +26,7 @@ struct virtio_blk
{
struct virtio_device *vdev;
struct virtqueue *vq;
+ wait_queue_head_t queue_wait;

/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -51,53 +55,87 @@ struct virtio_blk
struct virtblk_req
{
struct request *req;
+ struct bio *bio;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 status;
+ struct scatterlist sg[];
};

-static void blk_done(struct virtqueue *vq)
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+ switch (vbr->status) {
+ case VIRTIO_BLK_S_OK:
+ return 0;
+ case VIRTIO_BLK_S_UNSUPP:
+ return -ENOTTY;
+ default:
+ return -EIO;
+ }
+}
+
+static inline void virtblk_request_done(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
+{
+ struct request *req = vbr->req;
+ int error = virtblk_result(vbr);
+
+ if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
+ req->resid_len = vbr->in_hdr.residual;
+ req->sense_len = vbr->in_hdr.sense_len;
+ req->errors = vbr->in_hdr.errors;
+ } else if (req->cmd_type == REQ_TYPE_SPECIAL) {
+ req->errors = (error != 0);
+ }
+
+ __blk_end_request_all(req, error);
+ mempool_free(vbr, vblk->pool);
+}
+
+static inline void virtblk_bio_done(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
+{
+ bio_endio(vbr->bio, virtblk_result(vbr));
+ mempool_free(vbr, vblk->pool);
+}
+
+static void virtblk_done(struct virtqueue *vq)
{
struct virtio_blk *vblk = vq->vdev->priv;
+ unsigned long bio_done = 0, req_done = 0;
struct virtblk_req *vbr;
- unsigned int len;
unsigned long flags;
+ unsigned int len;

spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
- int error;
-
- switch (vbr->status) {
- case VIRTIO_BLK_S_OK:
- error = 0;
- break;
- case VIRTIO_BLK_S_UNSUPP:
- error = -ENOTTY;
- break;
- default:
- error = -EIO;
- break;
- }
-
- switch (vbr->req->cmd_type) {
- case REQ_TYPE_BLOCK_PC:
- vbr->req->resid_len = vbr->in_hdr.residual;
- vbr->req->sense_len = vbr->in_hdr.sense_len;
- vbr->req->errors = vbr->in_hdr.errors;
- break;
- case REQ_TYPE_SPECIAL:
- vbr->req->errors = (error != 0);
- break;
- default:
- break;
+ if (vbr->bio) {
+ virtblk_bio_done(vblk, vbr);
+ bio_done++;
+ } else {
+ virtblk_request_done(vblk, vbr);
+ req_done++;
}
-
- __blk_end_request_all(vbr->req, error);
- mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
- blk_start_queue(vblk->disk->queue);
+ if (req_done)
+ blk_start_queue(vblk->disk->queue);
spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
+
+ if (bio_done)
+ wake_up(&vblk->queue_wait);
+}
+
+static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
+ gfp_t gfp_mask)
+{
+ struct virtblk_req *vbr;
+
+ vbr = mempool_alloc(vblk->pool, gfp_mask);
+ if (vbr && use_bio)
+ sg_init_table(vbr->sg, vblk->sg_elems);
+
+ return vbr;
}

static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
unsigned long num, out = 0, in = 0;
struct virtblk_req *vbr;

- vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
+ vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
if (!vbr)
/* When another request finishes we'll try again. */
return false;

vbr->req = req;
-
+ vbr->bio = NULL;
if (req->cmd_flags & REQ_FLUSH) {
vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
vbr->out_hdr.sector = 0;
@@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
}
}

- if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+ if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
+ GFP_ATOMIC) < 0) {
mempool_free(vbr, vblk->pool);
return false;
}
@@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
return true;
}

-static void do_virtblk_request(struct request_queue *q)
+static void virtblk_request(struct request_queue *q)
{
struct virtio_blk *vblk = q->queuedata;
struct request *req;
@@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q)
virtqueue_kick(vblk->vq);
}

+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+ struct virtblk_req *vbr,
+ unsigned long out,
+ unsigned long in)
+{
+ DEFINE_WAIT(wait);
+
+ for (;;) {
+ prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ spin_lock_irq(vblk->disk->queue->queue_lock);
+ if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+ GFP_ATOMIC) < 0) {
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ io_schedule();
+ } else {
+ virtqueue_kick(vblk->vq);
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ break;
+ }
+
+ }
+
+ finish_wait(&vblk->queue_wait, &wait);
+}
+
+static void virtblk_make_request(struct request_queue *q, struct bio *bio)
+{
+ struct virtio_blk *vblk = q->queuedata;
+ unsigned int num, out = 0, in = 0;
+ struct virtblk_req *vbr;
+
+ BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
+ BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
+
+ vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+ if (!vbr) {
+ bio_endio(bio, -ENOMEM);
+ return;
+ }
+
+ vbr->bio = bio;
+ vbr->req = NULL;
+ vbr->out_hdr.type = 0;
+ vbr->out_hdr.sector = bio->bi_sector;
+ vbr->out_hdr.ioprio = bio_prio(bio);
+
+ sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+ num = blk_bio_map_sg(q, bio, vbr->sg + out);
+
+ sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+ sizeof(vbr->status));
+
+ if (num) {
+ if (bio->bi_rw & REQ_WRITE) {
+ vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+ out += num;
+ } else {
+ vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+ in += num;
+ }
+ }
+
+ spin_lock_irq(vblk->disk->queue->queue_lock);
+ if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+ GFP_ATOMIC) < 0)) {
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ virtblk_add_buf_wait(vblk, vbr, out, in);
+ return;
+ }
+ virtqueue_kick(vblk->vq);
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+}
+
/* return id (s/n) string for *disk to *id_str
*/
static int virtblk_get_id(struct gendisk *disk, char *id_str)
@@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk)
int err = 0;

/* We expect one virtqueue, for output. */
- vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
+ vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
if (IS_ERR(vblk->vq))
err = PTR_ERR(vblk->vq);

@@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
u8 writeback = virtblk_get_cache_mode(vdev);
struct virtio_blk *vblk = vdev->priv;

- if (writeback)
+ if (writeback && !use_bio)
blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
else
blk_queue_flush(vblk->disk->queue, 0);
@@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
struct virtio_blk *vblk;
struct request_queue *q;
int err, index;
+ int pool_size;
+
u64 cap;
u32 v, blk_size, sg_elems, opt_io_size;
u16 min_io_size;
@@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
goto out_free_index;
}

+ init_waitqueue_head(&vblk->queue_wait);
vblk->vdev = vdev;
vblk->sg_elems = sg_elems;
sg_init_table(vblk->sg, vblk->sg_elems);
mutex_init(&vblk->config_lock);
+
INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
vblk->config_enable = true;

@@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
if (err)
goto out_free_vblk;

- vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+ pool_size = sizeof(struct virtblk_req);
+ if (use_bio)
+ pool_size += sizeof(struct scatterlist) * sg_elems;
+ vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
@@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
goto out_mempool;
}

- q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
+ q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
if (!q) {
err = -ENOMEM;
goto out_put_disk;
}

+ if (use_bio)
+ blk_queue_make_request(q, virtblk_make_request);
q->queuedata = vblk;

virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
@@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);

-
add_disk(vblk->disk);
err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
if (err)
--
1.7.11.2

2012-08-07 08:45:35

by Asias He

[permalink] [raw]
Subject: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

We need to support both REQ_FLUSH and REQ_FUA for bio based path since
it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
based drivers can request.

REQ_FLUSH is emulated by:
A) If the bio has no data to write:
1. Send VIRTIO_BLK_T_FLUSH to device,
2. In the flush I/O completion handler, finish the bio

B) If the bio has data to write:
1. Send VIRTIO_BLK_T_FLUSH to device
2. In the flush I/O completion handler, send the actual write data to device
3. In the write I/O completion handler, finish the bio

REQ_FUA is emulated by:
1. Send the actual write data to device
2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device
3. In the flush I/O completion handler, finish the bio

Cc: Rusty Russell <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: "Michael S. Tsirkin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Asias He <[email protected]>
---
drivers/block/virtio_blk.c | 259 ++++++++++++++++++++++++++++++++-------------
1 file changed, 183 insertions(+), 76 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 95cfeed..d33ea48 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -58,6 +58,12 @@ struct virtblk_req
struct bio *bio;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+ struct work_struct work;
+ struct virtio_blk *vblk;
+ bool is_flush;
+ bool req_flush;
+ bool req_data;
+ bool req_fua;
u8 status;
struct scatterlist sg[];
};
@@ -74,6 +80,128 @@ static inline int virtblk_result(struct virtblk_req *vbr)
}
}

+static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
+ gfp_t gfp_mask)
+{
+ struct virtblk_req *vbr;
+
+ vbr = mempool_alloc(vblk->pool, gfp_mask);
+ if (vbr && use_bio)
+ sg_init_table(vbr->sg, vblk->sg_elems);
+
+ return vbr;
+}
+
+static void virtblk_add_buf_wait(struct virtio_blk *vblk,
+ struct virtblk_req *vbr,
+ unsigned long out,
+ unsigned long in)
+{
+ DEFINE_WAIT(wait);
+
+ for (;;) {
+ prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ spin_lock_irq(vblk->disk->queue->queue_lock);
+ if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+ GFP_ATOMIC) < 0) {
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ io_schedule();
+ } else {
+ virtqueue_kick(vblk->vq);
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ break;
+ }
+
+ }
+
+ finish_wait(&vblk->queue_wait, &wait);
+}
+
+static inline void virtblk_add_req(struct virtio_blk *vblk,
+ struct virtblk_req *vbr,
+ unsigned int out, unsigned int in)
+{
+ spin_lock_irq(vblk->disk->queue->queue_lock);
+ if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
+ GFP_ATOMIC) < 0)) {
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+ virtblk_add_buf_wait(vblk, vbr, out, in);
+ return;
+ }
+ virtqueue_kick(vblk->vq);
+ spin_unlock_irq(vblk->disk->queue->queue_lock);
+}
+
+static int virtblk_bio_send_flush(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
+{
+ unsigned int out = 0, in = 0;
+
+ vbr->is_flush = true;
+ vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
+ vbr->out_hdr.sector = 0;
+ vbr->out_hdr.ioprio = 0;
+ sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+ sg_set_buf(&vbr->sg[out + in++], &vbr->status, sizeof(vbr->status));
+
+ virtblk_add_req(vblk, vbr, out, in);
+
+ return 0;
+}
+
+static int virtblk_bio_send_data(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
+{
+ unsigned int num, out = 0, in = 0;
+ struct bio *bio = vbr->bio;
+
+ vbr->is_flush = false;
+ vbr->out_hdr.type = 0;
+ vbr->out_hdr.sector = bio->bi_sector;
+ vbr->out_hdr.ioprio = bio_prio(bio);
+
+ sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
+
+ num = blk_bio_map_sg(vblk->disk->queue, bio, vbr->sg + out);
+
+ sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
+ sizeof(vbr->status));
+
+ if (num) {
+ if (bio->bi_rw & REQ_WRITE) {
+ vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
+ out += num;
+ } else {
+ vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
+ in += num;
+ }
+ }
+
+ virtblk_add_req(vblk, vbr, out, in);
+
+ return 0;
+}
+
+static void virtblk_bio_send_data_work(struct work_struct *work)
+{
+ struct virtblk_req *vbr;
+
+ vbr = container_of(work, struct virtblk_req, work);
+
+ virtblk_bio_send_data(vbr->vblk, vbr);
+}
+
+static void virtblk_bio_send_flush_work(struct work_struct *work)
+{
+ struct virtblk_req *vbr;
+
+ vbr = container_of(work, struct virtblk_req, work);
+
+ virtblk_bio_send_flush(vbr->vblk, vbr);
+}
+
static inline void virtblk_request_done(struct virtio_blk *vblk,
struct virtblk_req *vbr)
{
@@ -92,13 +220,53 @@ static inline void virtblk_request_done(struct virtio_blk *vblk,
mempool_free(vbr, vblk->pool);
}

-static inline void virtblk_bio_done(struct virtio_blk *vblk,
- struct virtblk_req *vbr)
+static inline void virtblk_bio_done_flush(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
{
- bio_endio(vbr->bio, virtblk_result(vbr));
+ if (vbr->req_data) {
+ /* Send out the actual write data */
+ struct virtblk_req *_vbr;
+ _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+ if (!_vbr) {
+ bio_endio(vbr->bio, -ENOMEM);
+ goto out;
+ }
+ _vbr->req_fua = vbr->req_fua;
+ _vbr->bio = vbr->bio;
+ _vbr->vblk = vblk;
+ INIT_WORK(&_vbr->work, virtblk_bio_send_data_work);
+ queue_work(virtblk_wq, &_vbr->work);
+ } else {
+ bio_endio(vbr->bio, virtblk_result(vbr));
+ }
+out:
mempool_free(vbr, vblk->pool);
}

+static inline void virtblk_bio_done_data(struct virtio_blk *vblk,
+ struct virtblk_req *vbr)
+{
+ if (unlikely(vbr->req_fua)) {
+ /* Send out a flush before end the bio */
+ struct virtblk_req *_vbr;
+ _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
+ if (!_vbr) {
+ bio_endio(vbr->bio, -ENOMEM);
+ goto out;
+ }
+ _vbr->req_data = false;
+ _vbr->bio = vbr->bio;
+ _vbr->vblk = vblk;
+ INIT_WORK(&_vbr->work, virtblk_bio_send_flush_work);
+ queue_work(virtblk_wq, &_vbr->work);
+ } else {
+ bio_endio(vbr->bio, virtblk_result(vbr));
+ }
+out:
+ mempool_free(vbr, vblk->pool);
+}
+
+
static void virtblk_done(struct virtqueue *vq)
{
struct virtio_blk *vblk = vq->vdev->priv;
@@ -110,7 +278,10 @@ static void virtblk_done(struct virtqueue *vq)
spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
if (vbr->bio) {
- virtblk_bio_done(vblk, vbr);
+ if (unlikely(vbr->is_flush))
+ virtblk_bio_done_flush(vblk, vbr);
+ else
+ virtblk_bio_done_data(vblk, vbr);
bio_done++;
} else {
virtblk_request_done(vblk, vbr);
@@ -126,18 +297,6 @@ static void virtblk_done(struct virtqueue *vq)
wake_up(&vblk->queue_wait);
}

-static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
- gfp_t gfp_mask)
-{
- struct virtblk_req *vbr;
-
- vbr = mempool_alloc(vblk->pool, gfp_mask);
- if (vbr && use_bio)
- sg_init_table(vbr->sg, vblk->sg_elems);
-
- return vbr;
-}
-
static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
struct request *req)
{
@@ -242,41 +401,12 @@ static void virtblk_request(struct request_queue *q)
virtqueue_kick(vblk->vq);
}

-static void virtblk_add_buf_wait(struct virtio_blk *vblk,
- struct virtblk_req *vbr,
- unsigned long out,
- unsigned long in)
-{
- DEFINE_WAIT(wait);
-
- for (;;) {
- prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
- TASK_UNINTERRUPTIBLE);
-
- spin_lock_irq(vblk->disk->queue->queue_lock);
- if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
- GFP_ATOMIC) < 0) {
- spin_unlock_irq(vblk->disk->queue->queue_lock);
- io_schedule();
- } else {
- virtqueue_kick(vblk->vq);
- spin_unlock_irq(vblk->disk->queue->queue_lock);
- break;
- }
-
- }
-
- finish_wait(&vblk->queue_wait, &wait);
-}
-
static void virtblk_make_request(struct request_queue *q, struct bio *bio)
{
struct virtio_blk *vblk = q->queuedata;
- unsigned int num, out = 0, in = 0;
struct virtblk_req *vbr;

BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
- BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));

vbr = virtblk_alloc_req(vblk, GFP_NOIO);
if (!vbr) {
@@ -284,38 +414,15 @@ static void virtblk_make_request(struct request_queue *q, struct bio *bio)
return;
}

+ vbr->req_flush = !!(bio->bi_rw & REQ_FLUSH);
+ vbr->req_fua = !!(bio->bi_rw & REQ_FUA);
+ vbr->req_data = !!(bio->bi_size);
vbr->bio = bio;
- vbr->req = NULL;
- vbr->out_hdr.type = 0;
- vbr->out_hdr.sector = bio->bi_sector;
- vbr->out_hdr.ioprio = bio_prio(bio);
-
- sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
-
- num = blk_bio_map_sg(q, bio, vbr->sg + out);

- sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
- sizeof(vbr->status));
-
- if (num) {
- if (bio->bi_rw & REQ_WRITE) {
- vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
- out += num;
- } else {
- vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
- in += num;
- }
- }
-
- spin_lock_irq(vblk->disk->queue->queue_lock);
- if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
- GFP_ATOMIC) < 0)) {
- spin_unlock_irq(vblk->disk->queue->queue_lock);
- virtblk_add_buf_wait(vblk, vbr, out, in);
- return;
- }
- virtqueue_kick(vblk->vq);
- spin_unlock_irq(vblk->disk->queue->queue_lock);
+ if (unlikely(vbr->req_flush))
+ virtblk_bio_send_flush(vblk, vbr);
+ else
+ virtblk_bio_send_data(vblk, vbr);
}

/* return id (s/n) string for *disk to *id_str
@@ -529,7 +636,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
u8 writeback = virtblk_get_cache_mode(vdev);
struct virtio_blk *vblk = vdev->priv;

- if (writeback && !use_bio)
+ if (writeback)
blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
else
blk_queue_flush(vblk->disk->queue, 0);
--
1.7.11.2

2012-08-07 09:15:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

At least after review is done I really think this patch sopuld be folded
into the previous one.

Some more comments below:

> @@ -58,6 +58,12 @@ struct virtblk_req
> struct bio *bio;
> struct virtio_blk_outhdr out_hdr;
> struct virtio_scsi_inhdr in_hdr;
> + struct work_struct work;
> + struct virtio_blk *vblk;

I think using bio->bi_private for the virtio_blk pointer would
be cleaner.


> + bool is_flush;
> + bool req_flush;
> + bool req_data;
> + bool req_fua;

This could be a bitmap, or better even a single state field.

> +static int virtblk_bio_send_flush(struct virtio_blk *vblk,
> + struct virtblk_req *vbr)

> +static int virtblk_bio_send_data(struct virtio_blk *vblk,
> + struct virtblk_req *vbr)

These should only get the struct virtblk_req * argument as the virtio_blk
structure is easily derivable from it.

> +static inline void virtblk_bio_done_flush(struct virtio_blk *vblk,
> + struct virtblk_req *vbr)
> {
> + if (vbr->req_data) {
> + /* Send out the actual write data */
> + struct virtblk_req *_vbr;
> + _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
> + if (!_vbr) {
> + bio_endio(vbr->bio, -ENOMEM);
> + goto out;
> + }
> + _vbr->req_fua = vbr->req_fua;
> + _vbr->bio = vbr->bio;
> + _vbr->vblk = vblk;
> + INIT_WORK(&_vbr->work, virtblk_bio_send_data_work);
> + queue_work(virtblk_wq, &_vbr->work);

The _vbr naming isn't too nice. Also can you explain why the original
request can't be reused in a comment here?

Also if using a state variable I think the whole code would be
a bit cleaner if the bio_done helpers are combined.

> - if (writeback && !use_bio)
> + if (writeback)
> blk_queue_flush(vblk->disk->queue, REQ_FLUSH);

Shouldn't this be REQ_FLUSH | REQ_FUA for the bio case?

Btw, did you verify that flushes really work correctly for all cases
using tracing in qemu?

2012-08-07 09:16:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V6 0/2] Improve virtio-blk performance

On Tue, Aug 07, 2012 at 04:47:13PM +0800, Asias He wrote:
> 1) Ramdisk device
> With bio-based IO path, sequential read/write, random read/write
> IOPS boost : 28%, 24%, 21%, 16%
> Latency improvement: 32%, 17%, 21%, 16%
> 2) Fusion IO device
> With bio-based IO path, sequential read/write, random read/write
> IOPS boost : 11%, 11%, 13%, 10%
> Latency improvement: 10%, 10%, 12%, 10%

Do you also have numbers for normal SAS/SATA disks? The changelog should
have a reall good explanation of why this is optional, and numbers are a
very important part of that.

2012-08-08 06:18:19

by Asias He

[permalink] [raw]
Subject: Re: [PATCH V6 2/2] virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path

On 08/07/2012 05:15 PM, Christoph Hellwig wrote:
> At least after review is done I really think this patch sopuld be folded
> into the previous one.

OK.

> Some more comments below:
>
>> @@ -58,6 +58,12 @@ struct virtblk_req
>> struct bio *bio;
>> struct virtio_blk_outhdr out_hdr;
>> struct virtio_scsi_inhdr in_hdr;
>> + struct work_struct work;
>> + struct virtio_blk *vblk;
>
> I think using bio->bi_private for the virtio_blk pointer would
> be cleaner.

I wish I could use bio->bi_private but I am seeing this when using
bio->bi_priate to store virito_blk pointer:


[ 1.100335] Call Trace:
[ 1.100335] <IRQ>
[ 1.100335] [<ffffffff811dd4b0>] ? end_bio_bh_io_sync+0x30/0x50
[ 1.100335] [<ffffffff811e167d>] bio_endio+0x1d/0x40
[ 1.100335] [<ffffffff81551fb2>] virtblk_done+0xa2/0x260
[ 1.100335] [<ffffffff813d714d>] vring_interrupt+0x2d/0x40
[ 1.100335] [<ffffffff81119c0d>] handle_irq_event_percpu+0x6d/0x210
[ 1.100335] [<ffffffff81119df1>] handle_irq_event+0x41/0x70
[ 1.100335] [<ffffffff8111d2c9>] handle_edge_irq+0x69/0x120
[ 1.100335] [<ffffffff810613a2>] handle_irq+0x22/0x30
[ 1.100335] [<ffffffff81aadccd>] do_IRQ+0x5d/0xe0
[ 1.100335] [<ffffffff81aa432f>] common_interrupt+0x6f/0x6f

end_bio_bh_io_sync() uses bio->private:

struct buffer_head *bh = bio->bi_private;

>
>> + bool is_flush;
>> + bool req_flush;
>> + bool req_data;
>> + bool req_fua;
>
> This could be a bitmap, or better even a single state field.

Will use a bitmap for now.

>> +static int virtblk_bio_send_flush(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr)
>
>> +static int virtblk_bio_send_data(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr)
>
> These should only get the struct virtblk_req * argument as the virtio_blk
> structure is easily derivable from it.

Yes. Will clean it up.

>> +static inline void virtblk_bio_done_flush(struct virtio_blk *vblk,
>> + struct virtblk_req *vbr)
>> {
>> + if (vbr->req_data) {
>> + /* Send out the actual write data */
>> + struct virtblk_req *_vbr;
>> + _vbr = virtblk_alloc_req(vblk, GFP_NOIO);
>> + if (!_vbr) {
>> + bio_endio(vbr->bio, -ENOMEM);
>> + goto out;
>> + }
>> + _vbr->req_fua = vbr->req_fua;
>> + _vbr->bio = vbr->bio;
>> + _vbr->vblk = vblk;
>> + INIT_WORK(&_vbr->work, virtblk_bio_send_data_work);
>> + queue_work(virtblk_wq, &_vbr->work);
>
> The _vbr naming isn't too nice. Also can you explain why the original
> request can't be reused in a comment here

Ah, the original request can be reused. Will fix this.

> Also if using a state variable I think the whole code would be
> a bit cleaner if the bio_done helpers are combined.
>
>> - if (writeback && !use_bio)
>> + if (writeback)
>> blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
>
> Shouldn't this be REQ_FLUSH | REQ_FUA for the bio case?

Without REQ_FUA, I am also seeing bio with REQ_FUA flag set. Do we need
to set REQ_FUA explicitly?

> Btw, did you verify that flushes really work correctly for all cases
> using tracing in qemu?

I added some debug code in both kernel and kvm tool to verity the flush.

--
Asias

2012-08-08 06:22:11

by Asias He

[permalink] [raw]
Subject: Re: [PATCH V6 0/2] Improve virtio-blk performance

On 08/08/2012 11:09 AM, Asias He wrote:
> On 08/07/2012 05:16 PM, Christoph Hellwig wrote:
>> On Tue, Aug 07, 2012 at 04:47:13PM +0800, Asias He wrote:
>>> 1) Ramdisk device
>>> With bio-based IO path, sequential read/write, random read/write
>>> IOPS boost : 28%, 24%, 21%, 16%
>>> Latency improvement: 32%, 17%, 21%, 16%
>>> 2) Fusion IO device
>>> With bio-based IO path, sequential read/write, random read/write
>>> IOPS boost : 11%, 11%, 13%, 10%
>>> Latency improvement: 10%, 10%, 12%, 10%
>>
>> Do you also have numbers for normal SAS/SATA disks? The changelog should
>> have a reall good explanation of why this is optional, and numbers are a
>> very important part of that.

Yes. I posted the numbers on normal SATA disks a few days ago in the
thread with Sasha. Will add that data and explanation of why optional to
the changelog

--
Asias