2015-02-15 08:20:09

by Bob Liu

[permalink] [raw]
Subject: [RFC PATCH 00/10] Multi-queue support for xen-block driver

This patchset convert the Xen PV block driver to the multi-queue block layer API
by sharing and using multiple I/O rings between the frontend and backend.

History:
It's based on the result of Arianna's internship for GNOME's Outreach Program
for Women, in which she was mentored by Konrad Rzeszutek Wilk. I also worked on
this patchset with her at that time, and now fully take over this task.
I've got her authorization to "change authorship or SoB to the patches as you
like."

A few words on block multi-queue layer:
Multi-queue block layer improved block scalability a lot by split single request
queue to per-processor software queues and hardware dispatch queues. The linux
blk-mq API will handle software queues, while specific block driver must deal
with hardware queues.

The xen/block implementation:
1) Convert to blk-mq api with only one hardware queue.
2) Use more rings to act as multi hardware queues.
3) Negotiate number of hardware queues, the same as xen-net driver. The
backend notify "multi-queue-max-queues" to frontend, then the front write back
final number to "multi-queue-num-queues".

Test result:
fio's IOmeter emulation on a 16 cpus domU with a null_blk device, hardware
queue number was 16.
nr_fio_jobs IOPS(before) IOPS(after) Diff
1 57k 58k 0%
4 95k 201k +210%
8 89k 372k +410%
16 68k 284k +410%
32 65k 196k +300%
64 63k 183k +290%

More results are coming, there was also big improvement on both write-IOPS and
latency.

Any comments or suggestions are welcome.
Thank you,
-Bob Liu

Bob Liu (10):
xen/blkfront: convert to blk-mq API
xen/blkfront: drop legacy block layer support
xen/blkfront: reorg info->io_lock after using blk-mq API
xen/blkfront: separate ring information to an new struct
xen/blkback: separate ring information out of struct xen_blkif
xen/blkfront: pseudo support for multi hardware queues
xen/blkback: pseudo support for multi hardware queues
xen/blkfront: negotiate hardware queue number with backend
xen/blkback: get hardware queue number from blkfront
xen/blkfront: use work queue to fast blkif interrupt return

drivers/block/xen-blkback/blkback.c | 370 ++++++++-------
drivers/block/xen-blkback/common.h | 54 ++-
drivers/block/xen-blkback/xenbus.c | 415 +++++++++++------
drivers/block/xen-blkfront.c | 894 +++++++++++++++++++++---------------
4 files changed, 1018 insertions(+), 715 deletions(-)

--
1.8.3.1


2015-02-15 08:20:07

by Bob Liu

[permalink] [raw]
Subject: [PATCH 01/10] xen/blkfront: convert to blk-mq API

This patch convert xen-blkfront driver to use the block multiqueue API and
force to use only one hardware queue at this time.

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 118 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 100 insertions(+), 18 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2236c6f..13e6178 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -37,6 +37,7 @@

#include <linux/interrupt.h>
#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
#include <linux/hdreg.h>
#include <linux/cdrom.h>
#include <linux/module.h>
@@ -133,6 +134,8 @@ struct blkfront_info
unsigned int feature_persistent:1;
unsigned int max_indirect_segments;
int is_ready;
+ struct blk_mq_tag_set tag_set;
+ int feature_multiqueue;
};

static unsigned int nr_minors;
@@ -651,6 +654,42 @@ wait:
flush_requests(info);
}

+static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
+ const struct blk_mq_queue_data *qd)
+{
+ struct blkfront_info *info = qd->rq->rq_disk->private_data;
+ int ret = BLK_MQ_RQ_QUEUE_OK;
+
+ blk_mq_start_request(qd->rq);
+ spin_lock_irq(&info->io_lock);
+ if (RING_FULL(&info->ring)) {
+ blk_mq_stop_hw_queue(hctx);
+ ret = BLK_MQ_RQ_QUEUE_BUSY;
+ goto out;
+ }
+
+ if (blkif_request_flush_invalid(qd->rq, info)) {
+ ret = BLK_MQ_RQ_QUEUE_ERROR;
+ goto out;
+ }
+
+ if (blkif_queue_request(qd->rq)) {
+ blk_mq_stop_hw_queue(hctx);
+ ret = BLK_MQ_RQ_QUEUE_BUSY;
+ goto out;
+ }
+
+ flush_requests(info);
+out:
+ spin_unlock_irq(&info->io_lock);
+ return ret;
+}
+
+static struct blk_mq_ops blkfront_mq_ops = {
+ .queue_rq = blk_mq_queue_rq,
+ .map_queue = blk_mq_map_queue,
+};
+
static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
unsigned int physical_sector_size,
unsigned int segments)
@@ -658,9 +697,28 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;

- rq = blk_init_queue(do_blkif_request, &info->io_lock);
- if (rq == NULL)
- return -1;
+ if (info->feature_multiqueue) {
+ memset(&info->tag_set, 0, sizeof(info->tag_set));
+ info->tag_set.ops = &blkfront_mq_ops;
+ info->tag_set.nr_hw_queues = 1;
+ info->tag_set.queue_depth = BLK_RING_SIZE;
+ info->tag_set.numa_node = NUMA_NO_NODE;
+ info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ info->tag_set.cmd_size = 0;
+ info->tag_set.driver_data = info;
+
+ if (blk_mq_alloc_tag_set(&info->tag_set))
+ return -1;
+ rq = blk_mq_init_queue(&info->tag_set);
+ if (IS_ERR(rq)) {
+ blk_mq_free_tag_set(&info->tag_set);
+ return -1;
+ }
+ } else {
+ rq = blk_init_queue(do_blkif_request, &info->io_lock);
+ if (rq == NULL)
+ return -1;
+ }

queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);

@@ -896,7 +954,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
spin_lock_irqsave(&info->io_lock, flags);

/* No more blkif_request(). */
- blk_stop_queue(info->rq);
+ if (info->feature_multiqueue)
+ blk_mq_stop_hw_queues(info->rq);
+ else
+ blk_stop_queue(info->rq);

/* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback);
@@ -912,6 +973,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
xlbd_release_minors(minor, nr_minors);

blk_cleanup_queue(info->rq);
+ blk_mq_free_tag_set(&info->tag_set);
info->rq = NULL;

put_disk(info->gd);
@@ -921,10 +983,14 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
static void kick_pending_request_queues(struct blkfront_info *info)
{
if (!RING_FULL(&info->ring)) {
- /* Re-enable calldowns. */
- blk_start_queue(info->rq);
- /* Kick things off immediately. */
- do_blkif_request(info->rq);
+ if (info->feature_multiqueue) {
+ blk_mq_start_stopped_hw_queues(info->rq, true);
+ } else {
+ /* Re-enable calldowns. */
+ blk_start_queue(info->rq);
+ /* Kick things off immediately. */
+ do_blkif_request(info->rq);
+ }
}
}

@@ -949,8 +1015,12 @@ static void blkif_free(struct blkfront_info *info, int suspend)
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
/* No more blkif_request(). */
- if (info->rq)
- blk_stop_queue(info->rq);
+ if (info->rq) {
+ if (info->feature_multiqueue)
+ blk_mq_stop_hw_queues(info->rq);
+ else
+ blk_stop_queue(info->rq);
+ }

/* Remove all persistent grants */
if (!list_empty(&info->grants)) {
@@ -1175,37 +1245,40 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
continue;
}

- error = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+ error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
switch (bret->operation) {
case BLKIF_OP_DISCARD:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
struct request_queue *rq = info->rq;
printk(KERN_WARNING "blkfront: %s: %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = -EOPNOTSUPP;
+ error = req->errors = -EOPNOTSUPP;
info->feature_discard = 0;
info->feature_secdiscard = 0;
queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
queue_flag_clear(QUEUE_FLAG_SECDISCARD, rq);
}
- __blk_end_request_all(req, error);
+ if (info->feature_multiqueue)
+ blk_mq_complete_request(req);
+ else
+ __blk_end_request_all(req, error);
break;
case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_WRITE_BARRIER:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
printk(KERN_WARNING "blkfront: %s: %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = -EOPNOTSUPP;
+ error = req->errors = -EOPNOTSUPP;
}
if (unlikely(bret->status == BLKIF_RSP_ERROR &&
info->shadow[id].req.u.rw.nr_segments == 0)) {
printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = -EOPNOTSUPP;
+ error = req->errors = -EOPNOTSUPP;
}
if (unlikely(error)) {
if (error == -EOPNOTSUPP)
- error = 0;
+ error = req->errors = 0;
info->feature_flush = 0;
xlvbd_flush(info);
}
@@ -1216,7 +1289,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
dev_dbg(&info->xbdev->dev, "Bad return from blkdev data "
"request: %x\n", bret->status);

- __blk_end_request_all(req, error);
+ if (info->feature_multiqueue)
+ blk_mq_complete_request(req);
+ else
+ __blk_end_request_all(req, error);
break;
default:
BUG();
@@ -1552,8 +1628,13 @@ static int blkif_recover(struct blkfront_info *info)
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
BUG_ON(req->nr_phys_segments > segs);
- blk_requeue_request(info->rq, req);
+ if (info->feature_multiqueue)
+ blk_mq_requeue_request(req);
+ else
+ blk_requeue_request(info->rq, req);
}
+ if (info->feature_multiqueue)
+ blk_mq_kick_requeue_list(info->rq);
spin_unlock_irq(&info->io_lock);

while ((bio = bio_list_pop(&bio_list)) != NULL) {
@@ -1873,6 +1954,7 @@ static void blkfront_connect(struct blkfront_info *info)
return;
}

+ info->feature_multiqueue = 1;
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
physical_sector_size);
if (err) {
--
1.8.3.1

2015-02-15 08:20:10

by Bob Liu

[permalink] [raw]
Subject: [PATCH 02/10] xen/blkfront: drop legacy block layer support

As Christoph suggested, remove the legacy support similar to most
drivers coverted (virtio, mtip, and nvme).

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 167 +++++++++----------------------------------
1 file changed, 32 insertions(+), 135 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 13e6178..3589436 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -135,7 +135,6 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
- int feature_multiqueue;
};

static unsigned int nr_minors;
@@ -606,54 +605,6 @@ static inline bool blkif_request_flush_invalid(struct request *req,
!(info->feature_flush & REQ_FUA)));
}

-/*
- * do_blkif_request
- * read a block; request is in a request queue
- */
-static void do_blkif_request(struct request_queue *rq)
-{
- struct blkfront_info *info = NULL;
- struct request *req;
- int queued;
-
- pr_debug("Entered do_blkif_request\n");
-
- queued = 0;
-
- while ((req = blk_peek_request(rq)) != NULL) {
- info = req->rq_disk->private_data;
-
- if (RING_FULL(&info->ring))
- goto wait;
-
- blk_start_request(req);
-
- if (blkif_request_flush_invalid(req, info)) {
- __blk_end_request_all(req, -EOPNOTSUPP);
- continue;
- }
-
- pr_debug("do_blk_req %p: cmd %p, sec %lx, "
- "(%u/%u) [%s]\n",
- req, req->cmd, (unsigned long)blk_rq_pos(req),
- blk_rq_cur_sectors(req), blk_rq_sectors(req),
- rq_data_dir(req) ? "write" : "read");
-
- if (blkif_queue_request(req)) {
- blk_requeue_request(rq, req);
-wait:
- /* Avoid pointless unplugs. */
- blk_stop_queue(rq);
- break;
- }
-
- queued++;
- }
-
- if (queued != 0)
- flush_requests(info);
-}
-
static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *qd)
{
@@ -697,27 +648,21 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;

- if (info->feature_multiqueue) {
- memset(&info->tag_set, 0, sizeof(info->tag_set));
- info->tag_set.ops = &blkfront_mq_ops;
- info->tag_set.nr_hw_queues = 1;
- info->tag_set.queue_depth = BLK_RING_SIZE;
- info->tag_set.numa_node = NUMA_NO_NODE;
- info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
- info->tag_set.cmd_size = 0;
- info->tag_set.driver_data = info;
-
- if (blk_mq_alloc_tag_set(&info->tag_set))
- return -1;
- rq = blk_mq_init_queue(&info->tag_set);
- if (IS_ERR(rq)) {
- blk_mq_free_tag_set(&info->tag_set);
- return -1;
- }
- } else {
- rq = blk_init_queue(do_blkif_request, &info->io_lock);
- if (rq == NULL)
- return -1;
+ memset(&info->tag_set, 0, sizeof(info->tag_set));
+ info->tag_set.ops = &blkfront_mq_ops;
+ info->tag_set.nr_hw_queues = 1;
+ info->tag_set.queue_depth = BLK_RING_SIZE;
+ info->tag_set.numa_node = NUMA_NO_NODE;
+ info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+ info->tag_set.cmd_size = 0;
+ info->tag_set.driver_data = info;
+
+ if (blk_mq_alloc_tag_set(&info->tag_set))
+ return -1;
+ rq = blk_mq_init_queue(&info->tag_set);
+ if (IS_ERR(rq)) {
+ blk_mq_free_tag_set(&info->tag_set);
+ return -1;
}

queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
@@ -954,10 +899,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
spin_lock_irqsave(&info->io_lock, flags);

/* No more blkif_request(). */
- if (info->feature_multiqueue)
- blk_mq_stop_hw_queues(info->rq);
- else
- blk_stop_queue(info->rq);
+ blk_mq_stop_hw_queues(info->rq);

/* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback);
@@ -980,18 +922,11 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
info->gd = NULL;
}

+/* Called with info->io_lock holded */
static void kick_pending_request_queues(struct blkfront_info *info)
{
- if (!RING_FULL(&info->ring)) {
- if (info->feature_multiqueue) {
- blk_mq_start_stopped_hw_queues(info->rq, true);
- } else {
- /* Re-enable calldowns. */
- blk_start_queue(info->rq);
- /* Kick things off immediately. */
- do_blkif_request(info->rq);
- }
- }
+ if (!RING_FULL(&info->ring))
+ blk_mq_start_stopped_hw_queues(info->rq, true);
}

static void blkif_restart_queue(struct work_struct *work)
@@ -1015,12 +950,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
/* No more blkif_request(). */
- if (info->rq) {
- if (info->feature_multiqueue)
- blk_mq_stop_hw_queues(info->rq);
- else
- blk_stop_queue(info->rq);
- }
+ if (info->rq)
+ blk_mq_stop_hw_queues(info->rq);

/* Remove all persistent grants */
if (!list_empty(&info->grants)) {
@@ -1204,7 +1135,6 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
RING_IDX i, rp;
unsigned long flags;
struct blkfront_info *info = (struct blkfront_info *)dev_id;
- int error;

spin_lock_irqsave(&info->io_lock, flags);

@@ -1245,40 +1175,37 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
continue;
}

- error = req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
+ req->errors = (bret->status == BLKIF_RSP_OKAY) ? 0 : -EIO;
switch (bret->operation) {
case BLKIF_OP_DISCARD:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
struct request_queue *rq = info->rq;
printk(KERN_WARNING "blkfront: %s: %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = req->errors = -EOPNOTSUPP;
+ req->errors = -EOPNOTSUPP;
info->feature_discard = 0;
info->feature_secdiscard = 0;
queue_flag_clear(QUEUE_FLAG_DISCARD, rq);
queue_flag_clear(QUEUE_FLAG_SECDISCARD, rq);
}
- if (info->feature_multiqueue)
- blk_mq_complete_request(req);
- else
- __blk_end_request_all(req, error);
+ blk_mq_complete_request(req);
break;
case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_WRITE_BARRIER:
if (unlikely(bret->status == BLKIF_RSP_EOPNOTSUPP)) {
printk(KERN_WARNING "blkfront: %s: %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = req->errors = -EOPNOTSUPP;
+ req->errors = -EOPNOTSUPP;
}
if (unlikely(bret->status == BLKIF_RSP_ERROR &&
info->shadow[id].req.u.rw.nr_segments == 0)) {
printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
- error = req->errors = -EOPNOTSUPP;
+ req->errors = -EOPNOTSUPP;
}
- if (unlikely(error)) {
- if (error == -EOPNOTSUPP)
- error = req->errors = 0;
+ if (unlikely(req->errors)) {
+ if (req->errors == -EOPNOTSUPP)
+ req->errors = 0;
info->feature_flush = 0;
xlvbd_flush(info);
}
@@ -1289,10 +1216,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
dev_dbg(&info->xbdev->dev, "Bad return from blkdev data "
"request: %x\n", bret->status);

- if (info->feature_multiqueue)
- blk_mq_complete_request(req);
- else
- __blk_end_request_all(req, error);
+ blk_mq_complete_request(req);
break;
default:
BUG();
@@ -1592,28 +1516,6 @@ static int blkif_recover(struct blkfront_info *info)

kfree(copy);

- /*
- * Empty the queue, this is important because we might have
- * requests in the queue with more segments than what we
- * can handle now.
- */
- spin_lock_irq(&info->io_lock);
- while ((req = blk_fetch_request(info->rq)) != NULL) {
- if (req->cmd_flags &
- (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
- list_add(&req->queuelist, &requests);
- continue;
- }
- merge_bio.head = req->bio;
- merge_bio.tail = req->biotail;
- bio_list_merge(&bio_list, &merge_bio);
- req->bio = NULL;
- if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
- pr_alert("diskcache flush request found!\n");
- __blk_put_request(info->rq, req);
- }
- spin_unlock_irq(&info->io_lock);
-
xenbus_switch_state(info->xbdev, XenbusStateConnected);

spin_lock_irq(&info->io_lock);
@@ -1628,13 +1530,9 @@ static int blkif_recover(struct blkfront_info *info)
/* Requeue pending requests (flush or discard) */
list_del_init(&req->queuelist);
BUG_ON(req->nr_phys_segments > segs);
- if (info->feature_multiqueue)
- blk_mq_requeue_request(req);
- else
- blk_requeue_request(info->rq, req);
+ blk_mq_requeue_request(req);
}
- if (info->feature_multiqueue)
- blk_mq_kick_requeue_list(info->rq);
+ blk_mq_kick_requeue_list(info->rq);
spin_unlock_irq(&info->io_lock);

while ((bio = bio_list_pop(&bio_list)) != NULL) {
@@ -1954,7 +1852,6 @@ static void blkfront_connect(struct blkfront_info *info)
return;
}

- info->feature_multiqueue = 1;
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
physical_sector_size);
if (err) {
--
1.8.3.1

2015-02-15 08:22:19

by Bob Liu

[permalink] [raw]
Subject: [PATCH 03/10] xen/blkfront: reorg info->io_lock after using blk-mq API

Drop unnecessary holding of info->io_lock when calling into blk-mq apis.

Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 38 ++++++++++++++++----------------------
1 file changed, 16 insertions(+), 22 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 3589436..5a90a51 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -614,25 +614,28 @@ static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
blk_mq_start_request(qd->rq);
spin_lock_irq(&info->io_lock);
if (RING_FULL(&info->ring)) {
+ spin_unlock_irq(&info->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

if (blkif_request_flush_invalid(qd->rq, info)) {
+ spin_unlock_irq(&info->io_lock);
ret = BLK_MQ_RQ_QUEUE_ERROR;
goto out;
}

if (blkif_queue_request(qd->rq)) {
+ spin_unlock_irq(&info->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

flush_requests(info);
-out:
spin_unlock_irq(&info->io_lock);
+out:
return ret;
}

@@ -891,19 +894,15 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
static void xlvbd_release_gendisk(struct blkfront_info *info)
{
unsigned int minor, nr_minors;
- unsigned long flags;

if (info->rq == NULL)
return;

- spin_lock_irqsave(&info->io_lock, flags);
-
/* No more blkif_request(). */
blk_mq_stop_hw_queues(info->rq);

/* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback);
- spin_unlock_irqrestore(&info->io_lock, flags);

/* Flush gnttab callback work. Must be done with no locks held. */
flush_work(&info->work);
@@ -922,21 +921,25 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
info->gd = NULL;
}

-/* Called with info->io_lock holded */
static void kick_pending_request_queues(struct blkfront_info *info)
{
- if (!RING_FULL(&info->ring))
+ unsigned long flags;
+
+ spin_lock_irqsave(&info->io_lock, flags);
+ if (!RING_FULL(&info->ring)) {
+ spin_unlock_irqrestore(&info->io_lock, flags);
blk_mq_start_stopped_hw_queues(info->rq, true);
+ return;
+ }
+ spin_unlock_irqrestore(&info->io_lock, flags);
}

static void blkif_restart_queue(struct work_struct *work)
{
struct blkfront_info *info = container_of(work, struct blkfront_info, work);

- spin_lock_irq(&info->io_lock);
if (info->connected == BLKIF_STATE_CONNECTED)
kick_pending_request_queues(info);
- spin_unlock_irq(&info->io_lock);
}

static void blkif_free(struct blkfront_info *info, int suspend)
@@ -946,13 +949,13 @@ static void blkif_free(struct blkfront_info *info, int suspend)
int i, j, segs;

/* Prevent new requests being issued until we fix things up. */
- spin_lock_irq(&info->io_lock);
info->connected = suspend ?
BLKIF_STATE_SUSPENDED : BLKIF_STATE_DISCONNECTED;
/* No more blkif_request(). */
if (info->rq)
blk_mq_stop_hw_queues(info->rq);

+ spin_lock_irq(&info->io_lock);
/* Remove all persistent grants */
if (!list_empty(&info->grants)) {
list_for_each_entry_safe(persistent_gnt, n,
@@ -1136,13 +1139,10 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
unsigned long flags;
struct blkfront_info *info = (struct blkfront_info *)dev_id;

- spin_lock_irqsave(&info->io_lock, flags);
-
- if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) {
- spin_unlock_irqrestore(&info->io_lock, flags);
+ if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return IRQ_HANDLED;
- }

+ spin_lock_irqsave(&info->io_lock, flags);
again:
rp = info->ring.sring->rsp_prod;
rmb(); /* Ensure we see queued responses up to 'rp'. */
@@ -1233,9 +1233,8 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
} else
info->ring.sring->rsp_event = i + 1;

- kick_pending_request_queues(info);
-
spin_unlock_irqrestore(&info->io_lock, flags);
+ kick_pending_request_queues(info);

return IRQ_HANDLED;
}
@@ -1518,8 +1517,6 @@ static int blkif_recover(struct blkfront_info *info)

xenbus_switch_state(info->xbdev, XenbusStateConnected);

- spin_lock_irq(&info->io_lock);
-
/* Now safe for us to use the shared ring */
info->connected = BLKIF_STATE_CONNECTED;

@@ -1533,7 +1530,6 @@ static int blkif_recover(struct blkfront_info *info)
blk_mq_requeue_request(req);
}
blk_mq_kick_requeue_list(info->rq);
- spin_unlock_irq(&info->io_lock);

while ((bio = bio_list_pop(&bio_list)) != NULL) {
/* Traverse the list of pending bios and re-queue them */
@@ -1863,10 +1859,8 @@ static void blkfront_connect(struct blkfront_info *info)
xenbus_switch_state(info->xbdev, XenbusStateConnected);

/* Kick pending requests. */
- spin_lock_irq(&info->io_lock);
info->connected = BLKIF_STATE_CONNECTED;
kick_pending_request_queues(info);
- spin_unlock_irq(&info->io_lock);

add_disk(info->gd);

--
1.8.3.1

2015-02-15 08:20:16

by Bob Liu

[permalink] [raw]
Subject: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

A ring is the representation of a hardware queue, this patch separate ring
information from blkfront_info to an new struct blkfront_ring_info to make
preparation for real multi hardware queues supporting.

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 403 +++++++++++++++++++++++--------------------
1 file changed, 218 insertions(+), 185 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 5a90a51..aaa4a0e 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -102,23 +102,15 @@ MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default
#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)

/*
- * We have one of these per vbd, whether ide, scsi or 'other'. They
- * hang in private_data off the gendisk structure. We may end up
- * putting all kinds of interesting stuff here :-)
+ * Per-ring info.
+ * A blkfront_info structure can associate with one or more blkfront_ring_info,
+ * depending on how many hardware queues supported.
*/
-struct blkfront_info
-{
+struct blkfront_ring_info {
spinlock_t io_lock;
- struct mutex mutex;
- struct xenbus_device *xbdev;
- struct gendisk *gd;
- int vdevice;
- blkif_vdev_t handle;
- enum blkif_state connected;
int ring_ref;
struct blkif_front_ring ring;
unsigned int evtchn, irq;
- struct request_queue *rq;
struct work_struct work;
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
@@ -126,6 +118,22 @@ struct blkfront_info
struct list_head indirect_pages;
unsigned int persistent_gnts_c;
unsigned long shadow_free;
+ struct blkfront_info *info;
+};
+
+/*
+ * We have one of these per vbd, whether ide, scsi or 'other'. They
+ * hang in private_data off the gendisk structure. We may end up
+ * putting all kinds of interesting stuff here :-)
+ */
+struct blkfront_info {
+ struct mutex mutex;
+ struct xenbus_device *xbdev;
+ struct gendisk *gd;
+ int vdevice;
+ blkif_vdev_t handle;
+ enum blkif_state connected;
+ struct request_queue *rq;
unsigned int feature_flush;
unsigned int feature_discard:1;
unsigned int feature_secdiscard:1;
@@ -135,6 +143,7 @@ struct blkfront_info
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
+ struct blkfront_ring_info rinfo;
};

static unsigned int nr_minors;
@@ -167,34 +176,35 @@ static DEFINE_SPINLOCK(minor_lock);
#define INDIRECT_GREFS(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)

-static int blkfront_setup_indirect(struct blkfront_info *info);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);

-static int get_id_from_freelist(struct blkfront_info *info)
+static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
{
- unsigned long free = info->shadow_free;
+ unsigned long free = rinfo->shadow_free;
BUG_ON(free >= BLK_RING_SIZE);
- info->shadow_free = info->shadow[free].req.u.rw.id;
- info->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
+ rinfo->shadow_free = rinfo->shadow[free].req.u.rw.id;
+ rinfo->shadow[free].req.u.rw.id = 0x0fffffee; /* debug */
return free;
}

-static int add_id_to_freelist(struct blkfront_info *info,
+static int add_id_to_freelist(struct blkfront_ring_info *rinfo,
unsigned long id)
{
- if (info->shadow[id].req.u.rw.id != id)
+ if (rinfo->shadow[id].req.u.rw.id != id)
return -EINVAL;
- if (info->shadow[id].request == NULL)
+ if (rinfo->shadow[id].request == NULL)
return -EINVAL;
- info->shadow[id].req.u.rw.id = info->shadow_free;
- info->shadow[id].request = NULL;
- info->shadow_free = id;
+ rinfo->shadow[id].req.u.rw.id = rinfo->shadow_free;
+ rinfo->shadow[id].request = NULL;
+ rinfo->shadow_free = id;
return 0;
}

-static int fill_grant_buffer(struct blkfront_info *info, int num)
+static int fill_grant_buffer(struct blkfront_ring_info *rinfo, int num)
{
struct page *granted_page;
struct grant *gnt_list_entry, *n;
+ struct blkfront_info *info = rinfo->info;
int i = 0;

while(i < num) {
@@ -212,7 +222,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
}

gnt_list_entry->gref = GRANT_INVALID_REF;
- list_add(&gnt_list_entry->node, &info->grants);
+ list_add(&gnt_list_entry->node, &rinfo->grants);
i++;
}

@@ -220,7 +230,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)

out_of_memory:
list_for_each_entry_safe(gnt_list_entry, n,
- &info->grants, node) {
+ &rinfo->grants, node) {
list_del(&gnt_list_entry->node);
if (info->feature_persistent)
__free_page(pfn_to_page(gnt_list_entry->pfn));
@@ -232,33 +242,33 @@ out_of_memory:
}

static struct grant *get_grant(grant_ref_t *gref_head,
- unsigned long pfn,
- struct blkfront_info *info)
+ unsigned long pfn,
+ struct blkfront_ring_info *rinfo)
{
struct grant *gnt_list_entry;
unsigned long buffer_mfn;

- BUG_ON(list_empty(&info->grants));
- gnt_list_entry = list_first_entry(&info->grants, struct grant,
+ BUG_ON(list_empty(&rinfo->grants));
+ gnt_list_entry = list_first_entry(&rinfo->grants, struct grant,
node);
list_del(&gnt_list_entry->node);

if (gnt_list_entry->gref != GRANT_INVALID_REF) {
- info->persistent_gnts_c--;
+ rinfo->persistent_gnts_c--;
return gnt_list_entry;
}

/* Assign a gref to this page */
gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
BUG_ON(gnt_list_entry->gref == -ENOSPC);
- if (!info->feature_persistent) {
+ if (!rinfo->info->feature_persistent) {
BUG_ON(!pfn);
gnt_list_entry->pfn = pfn;
}
buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
- info->xbdev->otherend_id,
- buffer_mfn, 0);
+ rinfo->info->xbdev->otherend_id,
+ buffer_mfn, 0);
return gnt_list_entry;
}

@@ -328,8 +338,9 @@ static void xlbd_release_minors(unsigned int minor, unsigned int nr)

static void blkif_restart_queue_callback(void *arg)
{
- struct blkfront_info *info = (struct blkfront_info *)arg;
- schedule_work(&info->work);
+ struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)arg;
+
+ schedule_work(&rinfo->work);
}

static int blkif_getgeo(struct block_device *bd, struct hd_geometry *hg)
@@ -387,7 +398,8 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
*
* @req: a request struct
*/
-static int blkif_queue_request(struct request *req)
+static int blkif_queue_request(struct request *req,
+ struct blkfront_ring_info *rinfo)
{
struct blkfront_info *info = req->rq_disk->private_data;
struct blkif_request *ring_req;
@@ -419,15 +431,15 @@ static int blkif_queue_request(struct request *req)
max_grefs += INDIRECT_GREFS(req->nr_phys_segments);

/* Check if we have enough grants to allocate a requests */
- if (info->persistent_gnts_c < max_grefs) {
+ if (rinfo->persistent_gnts_c < max_grefs) {
new_persistent_gnts = 1;
if (gnttab_alloc_grant_references(
- max_grefs - info->persistent_gnts_c,
+ max_grefs - rinfo->persistent_gnts_c,
&gref_head) < 0) {
gnttab_request_free_callback(
- &info->callback,
+ &rinfo->callback,
blkif_restart_queue_callback,
- info,
+ rinfo,
max_grefs);
return 1;
}
@@ -435,9 +447,9 @@ static int blkif_queue_request(struct request *req)
new_persistent_gnts = 0;

/* Fill out a communications ring structure. */
- ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
- id = get_id_from_freelist(info);
- info->shadow[id].request = req;
+ ring_req = RING_GET_REQUEST(&rinfo->ring, rinfo->ring.req_prod_pvt);
+ id = get_id_from_freelist(rinfo);
+ rinfo->shadow[id].request = req;

if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
ring_req->operation = BLKIF_OP_DISCARD;
@@ -453,7 +465,7 @@ static int blkif_queue_request(struct request *req)
req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
BUG_ON(info->max_indirect_segments &&
req->nr_phys_segments > info->max_indirect_segments);
- nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
+ nseg = blk_rq_map_sg(req->q, req, rinfo->shadow[id].sg);
ring_req->u.rw.id = id;
if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
/*
@@ -496,7 +508,7 @@ static int blkif_queue_request(struct request *req)
}
ring_req->u.rw.nr_segments = nseg;
}
- for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+ for_each_sg(rinfo->shadow[id].sg, sg, nseg, i) {
fsect = sg->offset >> 9;
lsect = fsect + (sg->length >> 9) - 1;

@@ -512,22 +524,22 @@ static int blkif_queue_request(struct request *req)
struct page *indirect_page;

/* Fetch a pre-allocated page to use for indirect grefs */
- BUG_ON(list_empty(&info->indirect_pages));
- indirect_page = list_first_entry(&info->indirect_pages,
+ BUG_ON(list_empty(&rinfo->indirect_pages));
+ indirect_page = list_first_entry(&rinfo->indirect_pages,
struct page, lru);
list_del(&indirect_page->lru);
pfn = page_to_pfn(indirect_page);
}
- gnt_list_entry = get_grant(&gref_head, pfn, info);
- info->shadow[id].indirect_grants[n] = gnt_list_entry;
+ gnt_list_entry = get_grant(&gref_head, pfn, rinfo);
+ rinfo->shadow[id].indirect_grants[n] = gnt_list_entry;
segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
}

- gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
+ gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), rinfo);
ref = gnt_list_entry->gref;

- info->shadow[id].grants_used[i] = gnt_list_entry;
+ rinfo->shadow[id].grants_used[i] = gnt_list_entry;

if (rq_data_dir(req) && info->feature_persistent) {
char *bvec_data;
@@ -573,10 +585,10 @@ static int blkif_queue_request(struct request *req)
kunmap_atomic(segments);
}

- info->ring.req_prod_pvt++;
+ rinfo->ring.req_prod_pvt++;

/* Keep a private copy so we can reissue requests when recovering. */
- info->shadow[id].req = *ring_req;
+ rinfo->shadow[id].req = *ring_req;

if (new_persistent_gnts)
gnttab_free_grant_references(gref_head);
@@ -585,14 +597,14 @@ static int blkif_queue_request(struct request *req)
}


-static inline void flush_requests(struct blkfront_info *info)
+static inline void flush_requests(struct blkfront_ring_info *rinfo)
{
int notify;

- RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&info->ring, notify);
+ RING_PUSH_REQUESTS_AND_CHECK_NOTIFY(&rinfo->ring, notify);

if (notify)
- notify_remote_via_irq(info->irq);
+ notify_remote_via_irq(rinfo->irq);
}

static inline bool blkif_request_flush_invalid(struct request *req,
@@ -608,40 +620,50 @@ static inline bool blkif_request_flush_invalid(struct request *req,
static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *qd)
{
- struct blkfront_info *info = qd->rq->rq_disk->private_data;
+ struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)hctx->driver_data;
int ret = BLK_MQ_RQ_QUEUE_OK;

blk_mq_start_request(qd->rq);
- spin_lock_irq(&info->io_lock);
- if (RING_FULL(&info->ring)) {
- spin_unlock_irq(&info->io_lock);
+ spin_lock_irq(&rinfo->io_lock);
+ if (RING_FULL(&rinfo->ring)) {
+ spin_unlock_irq(&rinfo->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

- if (blkif_request_flush_invalid(qd->rq, info)) {
- spin_unlock_irq(&info->io_lock);
+ if (blkif_request_flush_invalid(qd->rq, rinfo->info)) {
+ spin_unlock_irq(&rinfo->io_lock);
ret = BLK_MQ_RQ_QUEUE_ERROR;
goto out;
}

- if (blkif_queue_request(qd->rq)) {
- spin_unlock_irq(&info->io_lock);
+ if (blkif_queue_request(qd->rq, rinfo)) {
+ spin_unlock_irq(&rinfo->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

- flush_requests(info);
- spin_unlock_irq(&info->io_lock);
+ flush_requests(rinfo);
+ spin_unlock_irq(&rinfo->io_lock);
out:
return ret;
}

+static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
+ unsigned int index)
+{
+ struct blkfront_info *info = (struct blkfront_info *)data;
+
+ hctx->driver_data = &info->rinfo;
+ return 0;
+}
+
static struct blk_mq_ops blkfront_mq_ops = {
.queue_rq = blk_mq_queue_rq,
.map_queue = blk_mq_map_queue,
+ .init_hctx = blk_mq_init_hctx,
};

static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
@@ -894,6 +916,7 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
static void xlvbd_release_gendisk(struct blkfront_info *info)
{
unsigned int minor, nr_minors;
+ struct blkfront_ring_info *rinfo = &info->rinfo;

if (info->rq == NULL)
return;
@@ -902,10 +925,10 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
blk_mq_stop_hw_queues(info->rq);

/* No more gnttab callback work. */
- gnttab_cancel_free_callback(&info->callback);
+ gnttab_cancel_free_callback(&rinfo->callback);

/* Flush gnttab callback work. Must be done with no locks held. */
- flush_work(&info->work);
+ flush_work(&rinfo->work);

del_gendisk(info->gd);

@@ -921,25 +944,25 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
info->gd = NULL;
}

-static void kick_pending_request_queues(struct blkfront_info *info)
+static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
{
unsigned long flags;

- spin_lock_irqsave(&info->io_lock, flags);
- if (!RING_FULL(&info->ring)) {
- spin_unlock_irqrestore(&info->io_lock, flags);
- blk_mq_start_stopped_hw_queues(info->rq, true);
+ spin_lock_irqsave(&rinfo->io_lock, flags);
+ if (!RING_FULL(&rinfo->ring)) {
+ spin_unlock_irqrestore(&rinfo->io_lock, flags);
+ blk_mq_start_stopped_hw_queues(rinfo->info->rq, true);
return;
}
- spin_unlock_irqrestore(&info->io_lock, flags);
+ spin_unlock_irqrestore(&rinfo->io_lock, flags);
}

static void blkif_restart_queue(struct work_struct *work)
{
- struct blkfront_info *info = container_of(work, struct blkfront_info, work);
+ struct blkfront_ring_info *rinfo = container_of(work, struct blkfront_ring_info, work);

- if (info->connected == BLKIF_STATE_CONNECTED)
- kick_pending_request_queues(info);
+ if (rinfo->info->connected == BLKIF_STATE_CONNECTED)
+ kick_pending_request_queues(rinfo);
}

static void blkif_free(struct blkfront_info *info, int suspend)
@@ -947,6 +970,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
struct grant *persistent_gnt;
struct grant *n;
int i, j, segs;
+ struct blkfront_ring_info *rinfo = &info->rinfo;

/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
@@ -955,33 +979,33 @@ static void blkif_free(struct blkfront_info *info, int suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);

- spin_lock_irq(&info->io_lock);
+ spin_lock_irq(&rinfo->io_lock);
/* Remove all persistent grants */
- if (!list_empty(&info->grants)) {
+ if (!list_empty(&rinfo->grants)) {
list_for_each_entry_safe(persistent_gnt, n,
- &info->grants, node) {
+ &rinfo->grants, node) {
list_del(&persistent_gnt->node);
if (persistent_gnt->gref != GRANT_INVALID_REF) {
gnttab_end_foreign_access(persistent_gnt->gref,
- 0, 0UL);
- info->persistent_gnts_c--;
+ 0, 0UL);
+ rinfo->persistent_gnts_c--;
}
if (info->feature_persistent)
__free_page(pfn_to_page(persistent_gnt->pfn));
kfree(persistent_gnt);
}
}
- BUG_ON(info->persistent_gnts_c != 0);
+ BUG_ON(rinfo->persistent_gnts_c != 0);

/*
* Remove indirect pages, this only happens when using indirect
* descriptors but not persistent grants
*/
- if (!list_empty(&info->indirect_pages)) {
+ if (!list_empty(&rinfo->indirect_pages)) {
struct page *indirect_page, *n;

BUG_ON(info->feature_persistent);
- list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
+ list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
list_del(&indirect_page->lru);
__free_page(indirect_page);
}
@@ -992,21 +1016,21 @@ static void blkif_free(struct blkfront_info *info, int suspend)
* Clear persistent grants present in requests already
* on the shared ring
*/
- if (!info->shadow[i].request)
+ if (!rinfo->shadow[i].request)
goto free_shadow;

- segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
- info->shadow[i].req.u.indirect.nr_segments :
- info->shadow[i].req.u.rw.nr_segments;
+ segs = rinfo->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+ rinfo->shadow[i].req.u.indirect.nr_segments :
+ rinfo->shadow[i].req.u.rw.nr_segments;
for (j = 0; j < segs; j++) {
- persistent_gnt = info->shadow[i].grants_used[j];
+ persistent_gnt = rinfo->shadow[i].grants_used[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
if (info->feature_persistent)
__free_page(pfn_to_page(persistent_gnt->pfn));
kfree(persistent_gnt);
}

- if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+ if (rinfo->shadow[i].req.operation != BLKIF_OP_INDIRECT)
/*
* If this is not an indirect operation don't try to
* free indirect segments
@@ -1014,42 +1038,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
goto free_shadow;

for (j = 0; j < INDIRECT_GREFS(segs); j++) {
- persistent_gnt = info->shadow[i].indirect_grants[j];
+ persistent_gnt = rinfo->shadow[i].indirect_grants[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
__free_page(pfn_to_page(persistent_gnt->pfn));
kfree(persistent_gnt);
}

free_shadow:
- kfree(info->shadow[i].grants_used);
- info->shadow[i].grants_used = NULL;
- kfree(info->shadow[i].indirect_grants);
- info->shadow[i].indirect_grants = NULL;
- kfree(info->shadow[i].sg);
- info->shadow[i].sg = NULL;
+ kfree(rinfo->shadow[i].grants_used);
+ rinfo->shadow[i].grants_used = NULL;
+ kfree(rinfo->shadow[i].indirect_grants);
+ rinfo->shadow[i].indirect_grants = NULL;
+ kfree(rinfo->shadow[i].sg);
+ rinfo->shadow[i].sg = NULL;
}

/* No more gnttab callback work. */
- gnttab_cancel_free_callback(&info->callback);
- spin_unlock_irq(&info->io_lock);
+ gnttab_cancel_free_callback(&rinfo->callback);
+ spin_unlock_irq(&rinfo->io_lock);

/* Flush gnttab callback work. Must be done with no locks held. */
- flush_work(&info->work);
+ flush_work(&rinfo->work);

/* Free resources associated with old device channel. */
- if (info->ring_ref != GRANT_INVALID_REF) {
- gnttab_end_foreign_access(info->ring_ref, 0,
- (unsigned long)info->ring.sring);
- info->ring_ref = GRANT_INVALID_REF;
- info->ring.sring = NULL;
+ if (rinfo->ring_ref != GRANT_INVALID_REF) {
+ gnttab_end_foreign_access(rinfo->ring_ref, 0,
+ (unsigned long)rinfo->ring.sring);
+ rinfo->ring_ref = GRANT_INVALID_REF;
+ rinfo->ring.sring = NULL;
}
- if (info->irq)
- unbind_from_irqhandler(info->irq, info);
- info->evtchn = info->irq = 0;
+ if (rinfo->irq)
+ unbind_from_irqhandler(rinfo->irq, rinfo);
+ rinfo->evtchn = rinfo->irq = 0;

}

-static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+static void blkif_completion(struct blk_shadow *s, struct blkfront_ring_info *rinfo,
struct blkif_response *bret)
{
int i = 0;
@@ -1057,6 +1081,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
char *bvec_data;
void *shared_data;
int nseg;
+ struct blkfront_info *info = rinfo->info;

nseg = s->req.operation == BLKIF_OP_INDIRECT ?
s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
@@ -1092,8 +1117,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
if (!info->feature_persistent)
pr_alert_ratelimited("backed has not unmapped grant: %u\n",
s->grants_used[i]->gref);
- list_add(&s->grants_used[i]->node, &info->grants);
- info->persistent_gnts_c++;
+ list_add(&s->grants_used[i]->node, &rinfo->grants);
+ rinfo->persistent_gnts_c++;
} else {
/*
* If the grant is not mapped by the backend we end the
@@ -1103,7 +1128,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
*/
gnttab_end_foreign_access(s->grants_used[i]->gref, 0, 0UL);
s->grants_used[i]->gref = GRANT_INVALID_REF;
- list_add_tail(&s->grants_used[i]->node, &info->grants);
+ list_add_tail(&s->grants_used[i]->node, &rinfo->grants);
}
}
if (s->req.operation == BLKIF_OP_INDIRECT) {
@@ -1112,8 +1137,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
if (!info->feature_persistent)
pr_alert_ratelimited("backed has not unmapped grant: %u\n",
s->indirect_grants[i]->gref);
- list_add(&s->indirect_grants[i]->node, &info->grants);
- info->persistent_gnts_c++;
+ list_add(&s->indirect_grants[i]->node, &rinfo->grants);
+ rinfo->persistent_gnts_c++;
} else {
struct page *indirect_page;

@@ -1123,9 +1148,9 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
* available pages for indirect grefs.
*/
indirect_page = pfn_to_page(s->indirect_grants[i]->pfn);
- list_add(&indirect_page->lru, &info->indirect_pages);
+ list_add(&indirect_page->lru, &rinfo->indirect_pages);
s->indirect_grants[i]->gref = GRANT_INVALID_REF;
- list_add_tail(&s->indirect_grants[i]->node, &info->grants);
+ list_add_tail(&s->indirect_grants[i]->node, &rinfo->grants);
}
}
}
@@ -1137,20 +1162,21 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
struct blkif_response *bret;
RING_IDX i, rp;
unsigned long flags;
- struct blkfront_info *info = (struct blkfront_info *)dev_id;
+ struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
+ struct blkfront_info *info = rinfo->info;

if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return IRQ_HANDLED;

- spin_lock_irqsave(&info->io_lock, flags);
+ spin_lock_irqsave(&rinfo->io_lock, flags);
again:
- rp = info->ring.sring->rsp_prod;
+ rp = rinfo->ring.sring->rsp_prod;
rmb(); /* Ensure we see queued responses up to 'rp'. */

- for (i = info->ring.rsp_cons; i != rp; i++) {
+ for (i = rinfo->ring.rsp_cons; i != rp; i++) {
unsigned long id;

- bret = RING_GET_RESPONSE(&info->ring, i);
+ bret = RING_GET_RESPONSE(&rinfo->ring, i);
id = bret->id;
/*
* The backend has messed up and given us an id that we would
@@ -1164,12 +1190,12 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
* the id is busted. */
continue;
}
- req = info->shadow[id].request;
+ req = rinfo->shadow[id].request;

if (bret->operation != BLKIF_OP_DISCARD)
- blkif_completion(&info->shadow[id], info, bret);
+ blkif_completion(&rinfo->shadow[id], rinfo, bret);

- if (add_id_to_freelist(info, id)) {
+ if (add_id_to_freelist(rinfo, id)) {
WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
info->gd->disk_name, op_name(bret->operation), id);
continue;
@@ -1198,7 +1224,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
req->errors = -EOPNOTSUPP;
}
if (unlikely(bret->status == BLKIF_RSP_ERROR &&
- info->shadow[id].req.u.rw.nr_segments == 0)) {
+ rinfo->shadow[id].req.u.rw.nr_segments == 0)) {
printk(KERN_WARNING "blkfront: %s: empty %s op failed\n",
info->gd->disk_name, op_name(bret->operation));
req->errors = -EOPNOTSUPP;
@@ -1223,30 +1249,30 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
}
}

- info->ring.rsp_cons = i;
+ rinfo->ring.rsp_cons = i;

- if (i != info->ring.req_prod_pvt) {
+ if (i != rinfo->ring.req_prod_pvt) {
int more_to_do;
- RING_FINAL_CHECK_FOR_RESPONSES(&info->ring, more_to_do);
+ RING_FINAL_CHECK_FOR_RESPONSES(&rinfo->ring, more_to_do);
if (more_to_do)
goto again;
} else
- info->ring.sring->rsp_event = i + 1;
+ rinfo->ring.sring->rsp_event = i + 1;

- spin_unlock_irqrestore(&info->io_lock, flags);
- kick_pending_request_queues(info);
+ spin_unlock_irqrestore(&rinfo->io_lock, flags);
+ kick_pending_request_queues(rinfo);

return IRQ_HANDLED;
}


static int setup_blkring(struct xenbus_device *dev,
- struct blkfront_info *info)
+ struct blkfront_ring_info *rinfo)
{
struct blkif_sring *sring;
int err;

- info->ring_ref = GRANT_INVALID_REF;
+ rinfo->ring_ref = GRANT_INVALID_REF;

sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH);
if (!sring) {
@@ -1254,32 +1280,32 @@ static int setup_blkring(struct xenbus_device *dev,
return -ENOMEM;
}
SHARED_RING_INIT(sring);
- FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);
+ FRONT_RING_INIT(&rinfo->ring, sring, PAGE_SIZE);

- err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
+ err = xenbus_grant_ring(dev, virt_to_mfn(rinfo->ring.sring));
if (err < 0) {
free_page((unsigned long)sring);
- info->ring.sring = NULL;
+ rinfo->ring.sring = NULL;
goto fail;
}
- info->ring_ref = err;
+ rinfo->ring_ref = err;

- err = xenbus_alloc_evtchn(dev, &info->evtchn);
+ err = xenbus_alloc_evtchn(dev, &rinfo->evtchn);
if (err)
goto fail;

- err = bind_evtchn_to_irqhandler(info->evtchn, blkif_interrupt, 0,
- "blkif", info);
+ err = bind_evtchn_to_irqhandler(rinfo->evtchn, blkif_interrupt, 0,
+ "blkif", rinfo);
if (err <= 0) {
xenbus_dev_fatal(dev, err,
"bind_evtchn_to_irqhandler failed");
goto fail;
}
- info->irq = err;
+ rinfo->irq = err;

return 0;
fail:
- blkif_free(info, 0);
+ blkif_free(rinfo->info, 0);
return err;
}

@@ -1291,9 +1317,10 @@ static int talk_to_blkback(struct xenbus_device *dev,
const char *message = NULL;
struct xenbus_transaction xbt;
int err;
+ struct blkfront_ring_info *rinfo = &info->rinfo;

/* Create shared ring, alloc event channel. */
- err = setup_blkring(dev, info);
+ err = setup_blkring(dev, rinfo);
if (err)
goto out;

@@ -1305,13 +1332,13 @@ again:
}

err = xenbus_printf(xbt, dev->nodename,
- "ring-ref", "%u", info->ring_ref);
+ "ring-ref", "%u", rinfo->ring_ref);
if (err) {
message = "writing ring-ref";
goto abort_transaction;
}
err = xenbus_printf(xbt, dev->nodename,
- "event-channel", "%u", info->evtchn);
+ "event-channel", "%u", rinfo->evtchn);
if (err) {
message = "writing event-channel";
goto abort_transaction;
@@ -1361,6 +1388,7 @@ static int blkfront_probe(struct xenbus_device *dev,
{
int err, vdevice, i;
struct blkfront_info *info;
+ struct blkfront_ring_info *rinfo;

/* FIXME: Use dynamic device id if this is not set. */
err = xenbus_scanf(XBT_NIL, dev->nodename,
@@ -1410,19 +1438,21 @@ static int blkfront_probe(struct xenbus_device *dev,
return -ENOMEM;
}

+ rinfo = &info->rinfo;
mutex_init(&info->mutex);
- spin_lock_init(&info->io_lock);
+ spin_lock_init(&rinfo->io_lock);
info->xbdev = dev;
info->vdevice = vdevice;
- INIT_LIST_HEAD(&info->grants);
- INIT_LIST_HEAD(&info->indirect_pages);
- info->persistent_gnts_c = 0;
+ INIT_LIST_HEAD(&rinfo->grants);
+ INIT_LIST_HEAD(&rinfo->indirect_pages);
+ rinfo->persistent_gnts_c = 0;
info->connected = BLKIF_STATE_DISCONNECTED;
- INIT_WORK(&info->work, blkif_restart_queue);
+ rinfo->info = info;
+ INIT_WORK(&rinfo->work, blkif_restart_queue);

for (i = 0; i < BLK_RING_SIZE; i++)
- info->shadow[i].req.u.rw.id = i+1;
- info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+ rinfo->shadow[i].req.u.rw.id = i+1;
+ rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;

/* Front end dir is a number, which is used as the id. */
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
@@ -1465,21 +1495,22 @@ static int blkif_recover(struct blkfront_info *info)
int pending, size;
struct split_bio *split_bio;
struct list_head requests;
+ struct blkfront_ring_info *rinfo = &info->rinfo;

/* Stage 1: Make a safe copy of the shadow state. */
- copy = kmemdup(info->shadow, sizeof(info->shadow),
+ copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
if (!copy)
return -ENOMEM;

/* Stage 2: Set up free list. */
- memset(&info->shadow, 0, sizeof(info->shadow));
+ memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
for (i = 0; i < BLK_RING_SIZE; i++)
- info->shadow[i].req.u.rw.id = i+1;
- info->shadow_free = info->ring.req_prod_pvt;
- info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+ rinfo->shadow[i].req.u.rw.id = i+1;
+ rinfo->shadow_free = rinfo->ring.req_prod_pvt;
+ rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;

- rc = blkfront_setup_indirect(info);
+ rc = blkfront_setup_indirect(rinfo);
if (rc) {
kfree(copy);
return rc;
@@ -1521,7 +1552,7 @@ static int blkif_recover(struct blkfront_info *info)
info->connected = BLKIF_STATE_CONNECTED;

/* Kick any other new requests queued since we resumed */
- kick_pending_request_queues(info);
+ kick_pending_request_queues(rinfo);

list_for_each_entry_safe(req, n, &requests, queuelist) {
/* Requeue pending requests (flush or discard) */
@@ -1654,10 +1685,11 @@ static void blkfront_setup_discard(struct blkfront_info *info)
info->feature_secdiscard = !!discard_secure;
}

-static int blkfront_setup_indirect(struct blkfront_info *info)
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo)
{
unsigned int indirect_segments, segs;
int err, i;
+ struct blkfront_info *info = rinfo->info;

err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
"feature-max-indirect-segments", "%u", &indirect_segments,
@@ -1671,7 +1703,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
segs = info->max_indirect_segments;
}

- err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+ err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
if (err)
goto out_of_memory;

@@ -1683,31 +1715,31 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
*/
int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE;

- BUG_ON(!list_empty(&info->indirect_pages));
+ BUG_ON(!list_empty(&rinfo->indirect_pages));
for (i = 0; i < num; i++) {
struct page *indirect_page = alloc_page(GFP_NOIO);
if (!indirect_page)
goto out_of_memory;
- list_add(&indirect_page->lru, &info->indirect_pages);
+ list_add(&indirect_page->lru, &rinfo->indirect_pages);
}
}

for (i = 0; i < BLK_RING_SIZE; i++) {
- info->shadow[i].grants_used = kzalloc(
- sizeof(info->shadow[i].grants_used[0]) * segs,
+ rinfo->shadow[i].grants_used = kzalloc(
+ sizeof(rinfo->shadow[i].grants_used[0]) * segs,
GFP_NOIO);
- info->shadow[i].sg = kzalloc(sizeof(info->shadow[i].sg[0]) * segs, GFP_NOIO);
+ rinfo->shadow[i].sg = kzalloc(sizeof(rinfo->shadow[i].sg[0]) * segs, GFP_NOIO);
if (info->max_indirect_segments)
- info->shadow[i].indirect_grants = kzalloc(
- sizeof(info->shadow[i].indirect_grants[0]) *
+ rinfo->shadow[i].indirect_grants = kzalloc(
+ sizeof(rinfo->shadow[i].indirect_grants[0]) *
INDIRECT_GREFS(segs),
GFP_NOIO);
- if ((info->shadow[i].grants_used == NULL) ||
- (info->shadow[i].sg == NULL) ||
+ if ((rinfo->shadow[i].grants_used == NULL) ||
+ (rinfo->shadow[i].sg == NULL) ||
(info->max_indirect_segments &&
- (info->shadow[i].indirect_grants == NULL)))
+ (rinfo->shadow[i].indirect_grants == NULL)))
goto out_of_memory;
- sg_init_table(info->shadow[i].sg, segs);
+ sg_init_table(rinfo->shadow[i].sg, segs);
}


@@ -1715,16 +1747,16 @@ static int blkfront_setup_indirect(struct blkfront_info *info)

out_of_memory:
for (i = 0; i < BLK_RING_SIZE; i++) {
- kfree(info->shadow[i].grants_used);
- info->shadow[i].grants_used = NULL;
- kfree(info->shadow[i].sg);
- info->shadow[i].sg = NULL;
- kfree(info->shadow[i].indirect_grants);
- info->shadow[i].indirect_grants = NULL;
- }
- if (!list_empty(&info->indirect_pages)) {
+ kfree(rinfo->shadow[i].grants_used);
+ rinfo->shadow[i].grants_used = NULL;
+ kfree(rinfo->shadow[i].sg);
+ rinfo->shadow[i].sg = NULL;
+ kfree(rinfo->shadow[i].indirect_grants);
+ rinfo->shadow[i].indirect_grants = NULL;
+ }
+ if (!list_empty(&rinfo->indirect_pages)) {
struct page *indirect_page, *n;
- list_for_each_entry_safe(indirect_page, n, &info->indirect_pages, lru) {
+ list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
list_del(&indirect_page->lru);
__free_page(indirect_page);
}
@@ -1744,6 +1776,7 @@ static void blkfront_connect(struct blkfront_info *info)
unsigned int binfo;
int err;
int barrier, flush, discard, persistent;
+ struct blkfront_ring_info *rinfo = &info->rinfo;

switch (info->connected) {
case BLKIF_STATE_CONNECTED:
@@ -1841,7 +1874,7 @@ static void blkfront_connect(struct blkfront_info *info)
else
info->feature_persistent = persistent;

- err = blkfront_setup_indirect(info);
+ err = blkfront_setup_indirect(rinfo);
if (err) {
xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
info->xbdev->otherend);
@@ -1860,7 +1893,7 @@ static void blkfront_connect(struct blkfront_info *info)

/* Kick pending requests. */
info->connected = BLKIF_STATE_CONNECTED;
- kick_pending_request_queues(info);
+ kick_pending_request_queues(rinfo);

add_disk(info->gd);

--
1.8.3.1

2015-02-15 08:21:34

by Bob Liu

[permalink] [raw]
Subject: [PATCH 05/10] xen/blkback: separate ring information out of struct xen_blkif

This patch separate ring information from struct xen_blkif to an new struct
xen_blkif_ring to make preparation for real multi hardware queues supporting.

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 362 ++++++++++++++++++------------------
drivers/block/xen-blkback/common.h | 53 ++++--
drivers/block/xen-blkback/xenbus.c | 129 +++++++------
3 files changed, 292 insertions(+), 252 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 63fc7f0..0969e7e 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -103,71 +103,71 @@ module_param(log_stats, int, 0644);
/* Number of free pages to remove on each call to free_xenballooned_pages */
#define NUM_BATCH_FREE_PAGES 10

-static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+static inline int get_free_page(struct xen_blkif_ring *ring, struct page **page)
{
unsigned long flags;

- spin_lock_irqsave(&blkif->free_pages_lock, flags);
- if (list_empty(&blkif->free_pages)) {
- BUG_ON(blkif->free_pages_num != 0);
- spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ spin_lock_irqsave(&ring->free_pages_lock, flags);
+ if (list_empty(&ring->free_pages)) {
+ BUG_ON(ring->free_pages_num != 0);
+ spin_unlock_irqrestore(&ring->free_pages_lock, flags);
return alloc_xenballooned_pages(1, page, false);
}
- BUG_ON(blkif->free_pages_num == 0);
- page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+ BUG_ON(ring->free_pages_num == 0);
+ page[0] = list_first_entry(&ring->free_pages, struct page, lru);
list_del(&page[0]->lru);
- blkif->free_pages_num--;
- spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ ring->free_pages_num--;
+ spin_unlock_irqrestore(&ring->free_pages_lock, flags);

return 0;
}

-static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
- int num)
+static inline void put_free_pages(struct xen_blkif_ring *ring,
+ struct page **page, int num)
{
unsigned long flags;
int i;

- spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ spin_lock_irqsave(&ring->free_pages_lock, flags);
for (i = 0; i < num; i++)
- list_add(&page[i]->lru, &blkif->free_pages);
- blkif->free_pages_num += num;
- spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ list_add(&page[i]->lru, &ring->free_pages);
+ ring->free_pages_num += num;
+ spin_unlock_irqrestore(&ring->free_pages_lock, flags);
}

-static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+static inline void shrink_free_pagepool(struct xen_blkif_ring *ring, int num)
{
/* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
struct page *page[NUM_BATCH_FREE_PAGES];
unsigned int num_pages = 0;
unsigned long flags;

- spin_lock_irqsave(&blkif->free_pages_lock, flags);
- while (blkif->free_pages_num > num) {
- BUG_ON(list_empty(&blkif->free_pages));
- page[num_pages] = list_first_entry(&blkif->free_pages,
+ spin_lock_irqsave(&ring->free_pages_lock, flags);
+ while (ring->free_pages_num > num) {
+ BUG_ON(list_empty(&ring->free_pages));
+ page[num_pages] = list_first_entry(&ring->free_pages,
struct page, lru);
list_del(&page[num_pages]->lru);
- blkif->free_pages_num--;
+ ring->free_pages_num--;
if (++num_pages == NUM_BATCH_FREE_PAGES) {
- spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ spin_unlock_irqrestore(&ring->free_pages_lock, flags);
free_xenballooned_pages(num_pages, page);
- spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ spin_lock_irqsave(&ring->free_pages_lock, flags);
num_pages = 0;
}
}
- spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ spin_unlock_irqrestore(&ring->free_pages_lock, flags);
if (num_pages != 0)
free_xenballooned_pages(num_pages, page);
}

#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))

-static int do_block_io_op(struct xen_blkif *blkif);
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int do_block_io_op(struct xen_blkif_ring *ring);
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req);
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
unsigned short op, int st);

#define foreach_grant_safe(pos, n, rbtree, node) \
@@ -188,19 +188,19 @@ static void make_response(struct xen_blkif *blkif, u64 id,
* bit operations to modify the flags of a persistent grant and to count
* the number of used grants.
*/
-static int add_persistent_gnt(struct xen_blkif *blkif,
+static int add_persistent_gnt(struct xen_blkif_ring *ring,
struct persistent_gnt *persistent_gnt)
{
struct rb_node **new = NULL, *parent = NULL;
struct persistent_gnt *this;

- if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
- if (!blkif->vbd.overflow_max_grants)
- blkif->vbd.overflow_max_grants = 1;
+ if (ring->persistent_gnt_c >= xen_blkif_max_pgrants) {
+ if (!ring->blkif->vbd.overflow_max_grants)
+ ring->blkif->vbd.overflow_max_grants = 1;
return -EBUSY;
}
/* Figure out where to put new node */
- new = &blkif->persistent_gnts.rb_node;
+ new = &ring->persistent_gnts.rb_node;
while (*new) {
this = container_of(*new, struct persistent_gnt, node);

@@ -219,19 +219,19 @@ static int add_persistent_gnt(struct xen_blkif *blkif,
set_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
/* Add new node and rebalance tree. */
rb_link_node(&(persistent_gnt->node), parent, new);
- rb_insert_color(&(persistent_gnt->node), &blkif->persistent_gnts);
- blkif->persistent_gnt_c++;
- atomic_inc(&blkif->persistent_gnt_in_use);
+ rb_insert_color(&(persistent_gnt->node), &ring->persistent_gnts);
+ ring->persistent_gnt_c++;
+ atomic_inc(&ring->persistent_gnt_in_use);
return 0;
}

-static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
- grant_ref_t gref)
+static struct persistent_gnt *get_persistent_gnt(struct xen_blkif_ring *ring,
+ grant_ref_t gref)
{
struct persistent_gnt *data;
struct rb_node *node = NULL;

- node = blkif->persistent_gnts.rb_node;
+ node = ring->persistent_gnts.rb_node;
while (node) {
data = container_of(node, struct persistent_gnt, node);

@@ -245,25 +245,25 @@ static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
return NULL;
}
set_bit(PERSISTENT_GNT_ACTIVE, data->flags);
- atomic_inc(&blkif->persistent_gnt_in_use);
+ atomic_inc(&ring->persistent_gnt_in_use);
return data;
}
}
return NULL;
}

-static void put_persistent_gnt(struct xen_blkif *blkif,
+static void put_persistent_gnt(struct xen_blkif_ring *ring,
struct persistent_gnt *persistent_gnt)
{
if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
pr_alert_ratelimited(DRV_PFX " freeing a grant already unused");
set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
clear_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
- atomic_dec(&blkif->persistent_gnt_in_use);
+ atomic_dec(&ring->persistent_gnt_in_use);
}

-static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
- unsigned int num)
+static void free_persistent_gnts(struct xen_blkif_ring *ring,
+ struct rb_root *root, unsigned int num)
{
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -288,7 +288,7 @@ static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
ret = gnttab_unmap_refs(unmap, NULL, pages,
segs_to_unmap);
BUG_ON(ret);
- put_free_pages(blkif, pages, segs_to_unmap);
+ put_free_pages(ring, pages, segs_to_unmap);
segs_to_unmap = 0;
}

@@ -305,10 +305,11 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct persistent_gnt *persistent_gnt;
int ret, segs_to_unmap = 0;
- struct xen_blkif *blkif = container_of(work, typeof(*blkif), persistent_purge_work);
+ struct xen_blkif_ring *ring =
+ container_of(work, typeof(*ring), persistent_purge_work);

- while(!list_empty(&blkif->persistent_purge_list)) {
- persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
+ while (!list_empty(&ring->persistent_purge_list)) {
+ persistent_gnt = list_first_entry(&ring->persistent_purge_list,
struct persistent_gnt,
remove_node);
list_del(&persistent_gnt->remove_node);
@@ -324,7 +325,7 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
ret = gnttab_unmap_refs(unmap, NULL, pages,
segs_to_unmap);
BUG_ON(ret);
- put_free_pages(blkif, pages, segs_to_unmap);
+ put_free_pages(ring, pages, segs_to_unmap);
segs_to_unmap = 0;
}
kfree(persistent_gnt);
@@ -332,34 +333,35 @@ void xen_blkbk_unmap_purged_grants(struct work_struct *work)
if (segs_to_unmap > 0) {
ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
BUG_ON(ret);
- put_free_pages(blkif, pages, segs_to_unmap);
+ put_free_pages(ring, pages, segs_to_unmap);
}
}

-static void purge_persistent_gnt(struct xen_blkif *blkif)
+static void purge_persistent_gnt(struct xen_blkif_ring *ring)
{
struct persistent_gnt *persistent_gnt;
struct rb_node *n;
unsigned int num_clean, total;
bool scan_used = false, clean_used = false;
struct rb_root *root;
+ struct xen_blkif *blkif = ring->blkif;

- if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
- (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
+ if (ring->persistent_gnt_c < xen_blkif_max_pgrants ||
+ (ring->persistent_gnt_c == xen_blkif_max_pgrants &&
!blkif->vbd.overflow_max_grants)) {
return;
}

- if (work_pending(&blkif->persistent_purge_work)) {
+ if (work_pending(&ring->persistent_purge_work)) {
pr_alert_ratelimited(DRV_PFX "Scheduled work from previous purge is still pending, cannot purge list\n");
return;
}

num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
- num_clean = blkif->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
- num_clean = min(blkif->persistent_gnt_c, num_clean);
+ num_clean = ring->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
+ num_clean = min(ring->persistent_gnt_c, num_clean);
if ((num_clean == 0) ||
- (num_clean > (blkif->persistent_gnt_c - atomic_read(&blkif->persistent_gnt_in_use))))
+ (num_clean > (ring->persistent_gnt_c - atomic_read(&ring->persistent_gnt_in_use))))
return;

/*
@@ -375,8 +377,8 @@ static void purge_persistent_gnt(struct xen_blkif *blkif)

pr_debug(DRV_PFX "Going to purge %u persistent grants\n", num_clean);

- BUG_ON(!list_empty(&blkif->persistent_purge_list));
- root = &blkif->persistent_gnts;
+ BUG_ON(!list_empty(&ring->persistent_purge_list));
+ root = &ring->persistent_gnts;
purge_list:
foreach_grant_safe(persistent_gnt, n, root, node) {
BUG_ON(persistent_gnt->handle ==
@@ -395,7 +397,7 @@ purge_list:

rb_erase(&persistent_gnt->node, root);
list_add(&persistent_gnt->remove_node,
- &blkif->persistent_purge_list);
+ &ring->persistent_purge_list);
if (--num_clean == 0)
goto finished;
}
@@ -416,11 +418,11 @@ finished:
goto purge_list;
}

- blkif->persistent_gnt_c -= (total - num_clean);
+ ring->persistent_gnt_c -= (total - num_clean);
blkif->vbd.overflow_max_grants = 0;

/* We can defer this work */
- schedule_work(&blkif->persistent_purge_work);
+ schedule_work(&ring->persistent_purge_work);
pr_debug(DRV_PFX "Purged %u/%u\n", (total - num_clean), total);
return;
}
@@ -428,18 +430,18 @@ finished:
/*
* Retrieve from the 'pending_reqs' a free pending_req structure to be used.
*/
-static struct pending_req *alloc_req(struct xen_blkif *blkif)
+static struct pending_req *alloc_req(struct xen_blkif_ring *ring)
{
struct pending_req *req = NULL;
unsigned long flags;

- spin_lock_irqsave(&blkif->pending_free_lock, flags);
- if (!list_empty(&blkif->pending_free)) {
- req = list_entry(blkif->pending_free.next, struct pending_req,
+ spin_lock_irqsave(&ring->pending_free_lock, flags);
+ if (!list_empty(&ring->pending_free)) {
+ req = list_entry(ring->pending_free.next, struct pending_req,
free_list);
list_del(&req->free_list);
}
- spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+ spin_unlock_irqrestore(&ring->pending_free_lock, flags);
return req;
}

@@ -447,17 +449,17 @@ static struct pending_req *alloc_req(struct xen_blkif *blkif)
* Return the 'pending_req' structure back to the freepool. We also
* wake up the thread if it was waiting for a free page.
*/
-static void free_req(struct xen_blkif *blkif, struct pending_req *req)
+static void free_req(struct xen_blkif_ring *ring, struct pending_req *req)
{
unsigned long flags;
int was_empty;

- spin_lock_irqsave(&blkif->pending_free_lock, flags);
- was_empty = list_empty(&blkif->pending_free);
- list_add(&req->free_list, &blkif->pending_free);
- spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
+ spin_lock_irqsave(&ring->pending_free_lock, flags);
+ was_empty = list_empty(&ring->pending_free);
+ list_add(&req->free_list, &ring->pending_free);
+ spin_unlock_irqrestore(&ring->pending_free_lock, flags);
if (was_empty)
- wake_up(&blkif->pending_free_wq);
+ wake_up(&ring->pending_free_wq);
}

/*
@@ -537,10 +539,10 @@ abort:
/*
* Notification from the guest OS.
*/
-static void blkif_notify_work(struct xen_blkif *blkif)
+static void blkif_notify_work(struct xen_blkif_ring *ring)
{
- blkif->waiting_reqs = 1;
- wake_up(&blkif->wq);
+ ring->waiting_reqs = 1;
+ wake_up(&ring->wq);
}

irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
@@ -553,25 +555,26 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
* SCHEDULER FUNCTIONS
*/

-static void print_stats(struct xen_blkif *blkif)
+static void print_stats(struct xen_blkif_ring *ring)
{
pr_info("xen-blkback (%s): oo %3llu | rd %4llu | wr %4llu | f %4llu"
" | ds %4llu | pg: %4u/%4d\n",
- current->comm, blkif->st_oo_req,
- blkif->st_rd_req, blkif->st_wr_req,
- blkif->st_f_req, blkif->st_ds_req,
- blkif->persistent_gnt_c,
+ current->comm, ring->st_oo_req,
+ ring->st_rd_req, ring->st_wr_req,
+ ring->st_f_req, ring->st_ds_req,
+ ring->persistent_gnt_c,
xen_blkif_max_pgrants);
- blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
- blkif->st_rd_req = 0;
- blkif->st_wr_req = 0;
- blkif->st_oo_req = 0;
- blkif->st_ds_req = 0;
+ ring->st_print = jiffies + msecs_to_jiffies(10 * 1000);
+ ring->st_rd_req = 0;
+ ring->st_wr_req = 0;
+ ring->st_oo_req = 0;
+ ring->st_ds_req = 0;
}

int xen_blkif_schedule(void *arg)
{
- struct xen_blkif *blkif = arg;
+ struct xen_blkif_ring *ring = arg;
+ struct xen_blkif *blkif = ring->blkif;
struct xen_vbd *vbd = &blkif->vbd;
unsigned long timeout;
int ret;
@@ -587,50 +590,50 @@ int xen_blkif_schedule(void *arg)
timeout = msecs_to_jiffies(LRU_INTERVAL);

timeout = wait_event_interruptible_timeout(
- blkif->wq,
- blkif->waiting_reqs || kthread_should_stop(),
+ ring->wq,
+ ring->waiting_reqs || kthread_should_stop(),
timeout);
if (timeout == 0)
goto purge_gnt_list;
timeout = wait_event_interruptible_timeout(
- blkif->pending_free_wq,
- !list_empty(&blkif->pending_free) ||
+ ring->pending_free_wq,
+ !list_empty(&ring->pending_free) ||
kthread_should_stop(),
timeout);
if (timeout == 0)
goto purge_gnt_list;

- blkif->waiting_reqs = 0;
+ ring->waiting_reqs = 0;
smp_mb(); /* clear flag *before* checking for work */

- ret = do_block_io_op(blkif);
+ ret = do_block_io_op(ring);
if (ret > 0)
- blkif->waiting_reqs = 1;
+ ring->waiting_reqs = 1;
if (ret == -EACCES)
- wait_event_interruptible(blkif->shutdown_wq,
+ wait_event_interruptible(ring->shutdown_wq,
kthread_should_stop());

purge_gnt_list:
if (blkif->vbd.feature_gnt_persistent &&
- time_after(jiffies, blkif->next_lru)) {
- purge_persistent_gnt(blkif);
- blkif->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
+ time_after(jiffies, ring->next_lru)) {
+ purge_persistent_gnt(ring);
+ ring->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
}

/* Shrink if we have more than xen_blkif_max_buffer_pages */
- shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
+ shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);

- if (log_stats && time_after(jiffies, blkif->st_print))
- print_stats(blkif);
+ if (log_stats && time_after(jiffies, ring->st_print))
+ print_stats(ring);
}

/* Drain pending purge work */
- flush_work(&blkif->persistent_purge_work);
+ flush_work(&ring->persistent_purge_work);

if (log_stats)
- print_stats(blkif);
+ print_stats(ring);

- blkif->xenblkd = NULL;
+ ring->xenblkd = NULL;
xen_blkif_put(blkif);

return 0;
@@ -639,25 +642,25 @@ purge_gnt_list:
/*
* Remove persistent grants and empty the pool of free pages
*/
-void xen_blkbk_free_caches(struct xen_blkif *blkif)
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring)
{
/* Free all persistent grant pages */
- if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
- free_persistent_gnts(blkif, &blkif->persistent_gnts,
- blkif->persistent_gnt_c);
+ if (!RB_EMPTY_ROOT(&ring->persistent_gnts))
+ free_persistent_gnts(ring, &ring->persistent_gnts,
+ ring->persistent_gnt_c);

- BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
- blkif->persistent_gnt_c = 0;
+ BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+ ring->persistent_gnt_c = 0;

/* Since we are shutting down remove all pages from the buffer */
- shrink_free_pagepool(blkif, 0 /* All */);
+ shrink_free_pagepool(ring, 0 /* All */);
}

/*
* Unmap the grant references, and also remove the M2P over-rides
* used in the 'pending_req'.
*/
-static void xen_blkbk_unmap(struct xen_blkif *blkif,
+static void xen_blkbk_unmap(struct xen_blkif_ring *ring,
struct grant_page *pages[],
int num)
{
@@ -668,7 +671,7 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,

for (i = 0; i < num; i++) {
if (pages[i]->persistent_gnt != NULL) {
- put_persistent_gnt(blkif, pages[i]->persistent_gnt);
+ put_persistent_gnt(ring, pages[i]->persistent_gnt);
continue;
}
if (pages[i]->handle == BLKBACK_INVALID_HANDLE)
@@ -681,18 +684,18 @@ static void xen_blkbk_unmap(struct xen_blkif *blkif,
ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
invcount);
BUG_ON(ret);
- put_free_pages(blkif, unmap_pages, invcount);
+ put_free_pages(ring, unmap_pages, invcount);
invcount = 0;
}
}
if (invcount) {
ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
BUG_ON(ret);
- put_free_pages(blkif, unmap_pages, invcount);
+ put_free_pages(ring, unmap_pages, invcount);
}
}

-static int xen_blkbk_map(struct xen_blkif *blkif,
+static int xen_blkbk_map(struct xen_blkif_ring *ring,
struct grant_page *pages[],
int num, bool ro)
{
@@ -705,6 +708,7 @@ static int xen_blkbk_map(struct xen_blkif *blkif,
int ret = 0;
int last_map = 0, map_until = 0;
int use_persistent_gnts;
+ struct xen_blkif *blkif = ring->blkif;

use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);

@@ -719,7 +723,7 @@ again:

if (use_persistent_gnts)
persistent_gnt = get_persistent_gnt(
- blkif,
+ ring,
pages[i]->gref);

if (persistent_gnt) {
@@ -730,7 +734,7 @@ again:
pages[i]->page = persistent_gnt->page;
pages[i]->persistent_gnt = persistent_gnt;
} else {
- if (get_free_page(blkif, &pages[i]->page))
+ if (get_free_page(ring, &pages[i]->page))
goto out_of_memory;
addr = vaddr(pages[i]->page);
pages_to_gnt[segs_to_map] = pages[i]->page;
@@ -763,7 +767,7 @@ again:
BUG_ON(new_map_idx >= segs_to_map);
if (unlikely(map[new_map_idx].status != 0)) {
pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
- put_free_pages(blkif, &pages[seg_idx]->page, 1);
+ put_free_pages(ring, &pages[seg_idx]->page, 1);
pages[seg_idx]->handle = BLKBACK_INVALID_HANDLE;
ret |= 1;
goto next;
@@ -773,7 +777,7 @@ again:
continue;
}
if (use_persistent_gnts &&
- blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
+ ring->persistent_gnt_c < xen_blkif_max_pgrants) {
/*
* We are using persistent grants, the grant is
* not mapped but we might have room for it.
@@ -791,7 +795,7 @@ again:
persistent_gnt->gnt = map[new_map_idx].ref;
persistent_gnt->handle = map[new_map_idx].handle;
persistent_gnt->page = pages[seg_idx]->page;
- if (add_persistent_gnt(blkif,
+ if (add_persistent_gnt(ring,
persistent_gnt)) {
kfree(persistent_gnt);
persistent_gnt = NULL;
@@ -799,7 +803,7 @@ again:
}
pages[seg_idx]->persistent_gnt = persistent_gnt;
pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
- persistent_gnt->gnt, blkif->persistent_gnt_c,
+ persistent_gnt->gnt, ring->persistent_gnt_c,
xen_blkif_max_pgrants);
goto next;
}
@@ -824,7 +828,7 @@ next:

out_of_memory:
pr_alert(DRV_PFX "%s: out of memory\n", __func__);
- put_free_pages(blkif, pages_to_gnt, segs_to_map);
+ put_free_pages(ring, pages_to_gnt, segs_to_map);
return -ENOMEM;
}

@@ -832,7 +836,7 @@ static int xen_blkbk_map_seg(struct pending_req *pending_req)
{
int rc;

- rc = xen_blkbk_map(pending_req->blkif, pending_req->segments,
+ rc = xen_blkbk_map(pending_req->ring, pending_req->segments,
pending_req->nr_pages,
(pending_req->operation != BLKIF_OP_READ));

@@ -845,7 +849,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
struct phys_req *preq)
{
struct grant_page **pages = pending_req->indirect_pages;
- struct xen_blkif *blkif = pending_req->blkif;
+ struct xen_blkif_ring *ring = pending_req->ring;
int indirect_grefs, rc, n, nseg, i;
struct blkif_request_segment *segments = NULL;

@@ -856,7 +860,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
for (i = 0; i < indirect_grefs; i++)
pages[i]->gref = req->u.indirect.indirect_grefs[i];

- rc = xen_blkbk_map(blkif, pages, indirect_grefs, true);
+ rc = xen_blkbk_map(ring, pages, indirect_grefs, true);
if (rc)
goto unmap;

@@ -883,15 +887,16 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
unmap:
if (segments)
kunmap_atomic(segments);
- xen_blkbk_unmap(blkif, pages, indirect_grefs);
+ xen_blkbk_unmap(ring, pages, indirect_grefs);
return rc;
}

-static int dispatch_discard_io(struct xen_blkif *blkif,
+static int dispatch_discard_io(struct xen_blkif_ring *ring,
struct blkif_request *req)
{
int err = 0;
int status = BLKIF_RSP_OKAY;
+ struct xen_blkif *blkif = ring->blkif;
struct block_device *bdev = blkif->vbd.bdev;
unsigned long secure;
struct phys_req preq;
@@ -908,7 +913,7 @@ static int dispatch_discard_io(struct xen_blkif *blkif,
preq.sector_number + preq.nr_sects, blkif->vbd.pdevice);
goto fail_response;
}
- blkif->st_ds_req++;
+ ring->st_ds_req++;

secure = (blkif->vbd.discard_secure &&
(req->u.discard.flag & BLKIF_DISCARD_SECURE)) ?
@@ -924,26 +929,27 @@ fail_response:
} else if (err)
status = BLKIF_RSP_ERROR;

- make_response(blkif, req->u.discard.id, req->operation, status);
+ make_response(ring, req->u.discard.id, req->operation, status);
xen_blkif_put(blkif);
return err;
}

-static int dispatch_other_io(struct xen_blkif *blkif,
+static int dispatch_other_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req)
{
- free_req(blkif, pending_req);
- make_response(blkif, req->u.other.id, req->operation,
+ free_req(ring, pending_req);
+ make_response(ring, req->u.other.id, req->operation,
BLKIF_RSP_EOPNOTSUPP);
return -EIO;
}

-static void xen_blk_drain_io(struct xen_blkif *blkif)
+static void xen_blk_drain_io(struct xen_blkif_ring *ring)
{
+ struct xen_blkif *blkif = ring->blkif;
atomic_set(&blkif->drain, 1);
do {
- if (atomic_read(&blkif->inflight) == 0)
+ if (atomic_read(&ring->inflight) == 0)
break;
wait_for_completion_interruptible_timeout(
&blkif->drain_complete, HZ);
@@ -964,12 +970,12 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
if ((pending_req->operation == BLKIF_OP_FLUSH_DISKCACHE) &&
(error == -EOPNOTSUPP)) {
pr_debug(DRV_PFX "flush diskcache op failed, not supported\n");
- xen_blkbk_flush_diskcache(XBT_NIL, pending_req->blkif->be, 0);
+ xen_blkbk_flush_diskcache(XBT_NIL, pending_req->ring->blkif->be, 0);
pending_req->status = BLKIF_RSP_EOPNOTSUPP;
} else if ((pending_req->operation == BLKIF_OP_WRITE_BARRIER) &&
(error == -EOPNOTSUPP)) {
pr_debug(DRV_PFX "write barrier op failed, not supported\n");
- xen_blkbk_barrier(XBT_NIL, pending_req->blkif->be, 0);
+ xen_blkbk_barrier(XBT_NIL, pending_req->ring->blkif->be, 0);
pending_req->status = BLKIF_RSP_EOPNOTSUPP;
} else if (error) {
pr_debug(DRV_PFX "Buffer not up-to-date at end of operation,"
@@ -983,14 +989,15 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
* the proper response on the ring.
*/
if (atomic_dec_and_test(&pending_req->pendcnt)) {
- struct xen_blkif *blkif = pending_req->blkif;
+ struct xen_blkif_ring *ring = pending_req->ring;
+ struct xen_blkif *blkif = ring->blkif;

- xen_blkbk_unmap(blkif,
+ xen_blkbk_unmap(ring,
pending_req->segments,
pending_req->nr_pages);
- make_response(blkif, pending_req->id,
+ make_response(ring, pending_req->id,
pending_req->operation, pending_req->status);
- free_req(blkif, pending_req);
+ free_req(ring, pending_req);
/*
* Make sure the request is freed before releasing blkif,
* or there could be a race between free_req and the
@@ -1003,9 +1010,9 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
* pending_free_wq if there's a drain going on, but it has
* to be taken into account if the current model is changed.
*/
- if (atomic_dec_and_test(&blkif->inflight) && atomic_read(&blkif->drain)) {
+ if (atomic_dec_and_test(&ring->inflight) && atomic_read(&blkif->drain))
complete(&blkif->drain_complete);
- }
+
xen_blkif_put(blkif);
}
}
@@ -1027,9 +1034,9 @@ static void end_block_io_op(struct bio *bio, int error)
* and transmute it to the block API to hand it over to the proper block disk.
*/
static int
-__do_block_io_op(struct xen_blkif *blkif)
+__do_block_io_op(struct xen_blkif_ring *ring)
{
- union blkif_back_rings *blk_rings = &blkif->blk_rings;
+ union blkif_back_rings *blk_rings = &ring->blk_rings;
struct blkif_request req;
struct pending_req *pending_req;
RING_IDX rc, rp;
@@ -1042,7 +1049,7 @@ __do_block_io_op(struct xen_blkif *blkif)
if (RING_REQUEST_PROD_OVERFLOW(&blk_rings->common, rp)) {
rc = blk_rings->common.rsp_prod_pvt;
pr_warn(DRV_PFX "Frontend provided bogus ring requests (%d - %d = %d). Halting ring processing on dev=%04x\n",
- rp, rc, rp - rc, blkif->vbd.pdevice);
+ rp, rc, rp - rc, ring->blkif->vbd.pdevice);
return -EACCES;
}
while (rc != rp) {
@@ -1055,14 +1062,14 @@ __do_block_io_op(struct xen_blkif *blkif)
break;
}

- pending_req = alloc_req(blkif);
+ pending_req = alloc_req(ring);
if (NULL == pending_req) {
- blkif->st_oo_req++;
+ ring->st_oo_req++;
more_to_do = 1;
break;
}

- switch (blkif->blk_protocol) {
+ switch (ring->blkif->blk_protocol) {
case BLKIF_PROTOCOL_NATIVE:
memcpy(&req, RING_GET_REQUEST(&blk_rings->native, rc), sizeof(req));
break;
@@ -1086,16 +1093,16 @@ __do_block_io_op(struct xen_blkif *blkif)
case BLKIF_OP_WRITE_BARRIER:
case BLKIF_OP_FLUSH_DISKCACHE:
case BLKIF_OP_INDIRECT:
- if (dispatch_rw_block_io(blkif, &req, pending_req))
+ if (dispatch_rw_block_io(ring, &req, pending_req))
goto done;
break;
case BLKIF_OP_DISCARD:
- free_req(blkif, pending_req);
- if (dispatch_discard_io(blkif, &req))
+ free_req(ring, pending_req);
+ if (dispatch_discard_io(ring, &req))
goto done;
break;
default:
- if (dispatch_other_io(blkif, &req, pending_req))
+ if (dispatch_other_io(ring, &req, pending_req))
goto done;
break;
}
@@ -1108,13 +1115,13 @@ done:
}

static int
-do_block_io_op(struct xen_blkif *blkif)
+do_block_io_op(struct xen_blkif_ring *ring)
{
- union blkif_back_rings *blk_rings = &blkif->blk_rings;
+ union blkif_back_rings *blk_rings = &ring->blk_rings;
int more_to_do;

do {
- more_to_do = __do_block_io_op(blkif);
+ more_to_do = __do_block_io_op(ring);
if (more_to_do)
break;

@@ -1127,7 +1134,7 @@ do_block_io_op(struct xen_blkif *blkif)
* Transmutation of the 'struct blkif_request' to a proper 'struct bio'
* and call the 'submit_bio' to pass it to the underlying storage.
*/
-static int dispatch_rw_block_io(struct xen_blkif *blkif,
+static int dispatch_rw_block_io(struct xen_blkif_ring *ring,
struct blkif_request *req,
struct pending_req *pending_req)
{
@@ -1155,17 +1162,17 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,

switch (req_operation) {
case BLKIF_OP_READ:
- blkif->st_rd_req++;
+ ring->st_rd_req++;
operation = READ;
break;
case BLKIF_OP_WRITE:
- blkif->st_wr_req++;
+ ring->st_wr_req++;
operation = WRITE_ODIRECT;
break;
case BLKIF_OP_WRITE_BARRIER:
drain = true;
case BLKIF_OP_FLUSH_DISKCACHE:
- blkif->st_f_req++;
+ ring->st_f_req++;
operation = WRITE_FLUSH;
break;
default:
@@ -1191,7 +1198,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,

preq.nr_sects = 0;

- pending_req->blkif = blkif;
+ pending_req->ring = ring;
pending_req->id = req->u.rw.id;
pending_req->operation = req_operation;
pending_req->status = BLKIF_RSP_OKAY;
@@ -1218,12 +1225,12 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
goto fail_response;
}

- if (xen_vbd_translate(&preq, blkif, operation) != 0) {
+ if (xen_vbd_translate(&preq, ring->blkif, operation) != 0) {
pr_debug(DRV_PFX "access denied: %s of [%llu,%llu] on dev=%04x\n",
operation == READ ? "read" : "write",
preq.sector_number,
preq.sector_number + preq.nr_sects,
- blkif->vbd.pdevice);
+ ring->blkif->vbd.pdevice);
goto fail_response;
}

@@ -1235,7 +1242,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
if (((int)preq.sector_number|(int)seg[i].nsec) &
((bdev_logical_block_size(preq.bdev) >> 9) - 1)) {
pr_debug(DRV_PFX "Misaligned I/O request from domain %d",
- blkif->domid);
+ ring->blkif->domid);
goto fail_response;
}
}
@@ -1244,7 +1251,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* issue the WRITE_FLUSH.
*/
if (drain)
- xen_blk_drain_io(pending_req->blkif);
+ xen_blk_drain_io(pending_req->ring);

/*
* If we have failed at this point, we need to undo the M2P override,
@@ -1259,8 +1266,8 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* This corresponding xen_blkif_put is done in __end_block_io_op, or
* below (in "!bio") if we are handling a BLKIF_OP_DISCARD.
*/
- xen_blkif_get(blkif);
- atomic_inc(&blkif->inflight);
+ xen_blkif_get(ring->blkif);
+ atomic_inc(&ring->inflight);

for (i = 0; i < nseg; i++) {
while ((bio == NULL) ||
@@ -1308,19 +1315,19 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
blk_finish_plug(&plug);

if (operation == READ)
- blkif->st_rd_sect += preq.nr_sects;
+ ring->st_rd_sect += preq.nr_sects;
else if (operation & WRITE)
- blkif->st_wr_sect += preq.nr_sects;
+ ring->st_wr_sect += preq.nr_sects;

return 0;

fail_flush:
- xen_blkbk_unmap(blkif, pending_req->segments,
+ xen_blkbk_unmap(ring, pending_req->segments,
pending_req->nr_pages);
fail_response:
/* Haven't submitted any bio's yet. */
- make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
- free_req(blkif, pending_req);
+ make_response(ring, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
+ free_req(ring, pending_req);
msleep(1); /* back off a bit */
return -EIO;

@@ -1338,21 +1345,22 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
/*
* Put a response on the ring on how the operation fared.
*/
-static void make_response(struct xen_blkif *blkif, u64 id,
+static void make_response(struct xen_blkif_ring *ring, u64 id,
unsigned short op, int st)
{
struct blkif_response resp;
unsigned long flags;
- union blkif_back_rings *blk_rings = &blkif->blk_rings;
+ union blkif_back_rings *blk_rings;
int notify;

resp.id = id;
resp.operation = op;
resp.status = st;

- spin_lock_irqsave(&blkif->blk_ring_lock, flags);
+ spin_lock_irqsave(&ring->blk_ring_lock, flags);
+ blk_rings = &ring->blk_rings;
/* Place on the response ring for the relevant domain. */
- switch (blkif->blk_protocol) {
+ switch (ring->blkif->blk_protocol) {
case BLKIF_PROTOCOL_NATIVE:
memcpy(RING_GET_RESPONSE(&blk_rings->native, blk_rings->native.rsp_prod_pvt),
&resp, sizeof(resp));
@@ -1370,9 +1378,9 @@ static void make_response(struct xen_blkif *blkif, u64 id,
}
blk_rings->common.rsp_prod_pvt++;
RING_PUSH_RESPONSES_AND_CHECK_NOTIFY(&blk_rings->common, notify);
- spin_unlock_irqrestore(&blkif->blk_ring_lock, flags);
+ spin_unlock_irqrestore(&ring->blk_ring_lock, flags);
if (notify)
- notify_remote_via_irq(blkif->irq);
+ notify_remote_via_irq(ring->irq);
}

static int __init xen_blkif_init(void)
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index f65b807..71863d4 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -256,30 +256,18 @@ struct persistent_gnt {
struct list_head remove_node;
};

-struct xen_blkif {
- /* Unique identifier for this interface. */
- domid_t domid;
- unsigned int handle;
+/* Per-ring structure */
+struct xen_blkif_ring {
/* Physical parameters of the comms window. */
unsigned int irq;
- /* Comms information. */
- enum blkif_protocol blk_protocol;
union blkif_back_rings blk_rings;
void *blk_ring;
- /* The VBD attached to this interface. */
- struct xen_vbd vbd;
- /* Back pointer to the backend_info. */
- struct backend_info *be;
/* Private fields. */
spinlock_t blk_ring_lock;
- atomic_t refcnt;

wait_queue_head_t wq;
- /* for barrier (drain) requests */
- struct completion drain_complete;
- atomic_t drain;
atomic_t inflight;
- /* One thread per one blkif. */
+ /* One thread per blkif ring. */
struct task_struct *xenblkd;
unsigned int waiting_reqs;

@@ -314,9 +302,38 @@ struct xen_blkif {
unsigned long long st_rd_sect;
unsigned long long st_wr_sect;

- struct work_struct free_work;
/* Thread shutdown wait queue. */
wait_queue_head_t shutdown_wq;
+ struct xen_blkif *blkif;
+};
+
+struct xen_blkif {
+ /* Unique identifier for this interface. */
+ domid_t domid;
+ unsigned int handle;
+ /* Comms information. */
+ enum blkif_protocol blk_protocol;
+ /* The VBD attached to this interface. */
+ struct xen_vbd vbd;
+ /* Back pointer to the backend_info. */
+ struct backend_info *be;
+ /* for barrier (drain) requests */
+ struct completion drain_complete;
+ atomic_t drain;
+ atomic_t refcnt;
+ struct work_struct free_work;
+
+ /* statistics */
+ unsigned long st_print;
+ unsigned long long st_rd_req;
+ unsigned long long st_wr_req;
+ unsigned long long st_oo_req;
+ unsigned long long st_f_req;
+ unsigned long long st_ds_req;
+ unsigned long long st_rd_sect;
+ unsigned long long st_wr_sect;
+ /* Rings for this device */
+ struct xen_blkif_ring ring;
};

struct seg_buf {
@@ -338,7 +355,7 @@ struct grant_page {
* response queued for it, with the saved 'id' passed back.
*/
struct pending_req {
- struct xen_blkif *blkif;
+ struct xen_blkif_ring *ring;
u64 id;
int nr_pages;
atomic_t pendcnt;
@@ -377,7 +394,7 @@ int xen_blkif_xenbus_init(void);
irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
int xen_blkif_purge_persistent(void *arg);
-void xen_blkbk_free_caches(struct xen_blkif *blkif);
+void xen_blkbk_free_caches(struct xen_blkif_ring *ring);

int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 630a489..4b7bde6 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -82,7 +82,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
char name[TASK_COMM_LEN];

/* Not ready to connect? */
- if (!blkif->irq || !blkif->vbd.bdev)
+ if (!blkif->ring.irq || !blkif->vbd.bdev)
return;

/* Already connected? */
@@ -107,10 +107,10 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
}
invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);

- blkif->xenblkd = kthread_run(xen_blkif_schedule, blkif, "%s", name);
- if (IS_ERR(blkif->xenblkd)) {
- err = PTR_ERR(blkif->xenblkd);
- blkif->xenblkd = NULL;
+ blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, &blkif->ring, "%s", name);
+ if (IS_ERR(blkif->ring.xenblkd)) {
+ err = PTR_ERR(blkif->ring.xenblkd);
+ blkif->ring.xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
return;
}
@@ -121,6 +121,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
struct xen_blkif *blkif;
struct pending_req *req, *n;
int i, j;
+ struct xen_blkif_ring *ring;

BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);

@@ -129,30 +130,30 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
return ERR_PTR(-ENOMEM);

blkif->domid = domid;
- spin_lock_init(&blkif->blk_ring_lock);
+ ring = &blkif->ring;
+ spin_lock_init(&ring->blk_ring_lock);
atomic_set(&blkif->refcnt, 1);
- init_waitqueue_head(&blkif->wq);
+ init_waitqueue_head(&ring->wq);
init_completion(&blkif->drain_complete);
atomic_set(&blkif->drain, 0);
- blkif->st_print = jiffies;
- blkif->persistent_gnts.rb_node = NULL;
- spin_lock_init(&blkif->free_pages_lock);
- INIT_LIST_HEAD(&blkif->free_pages);
- INIT_LIST_HEAD(&blkif->persistent_purge_list);
- blkif->free_pages_num = 0;
- atomic_set(&blkif->persistent_gnt_in_use, 0);
- atomic_set(&blkif->inflight, 0);
- INIT_WORK(&blkif->persistent_purge_work, xen_blkbk_unmap_purged_grants);
-
- INIT_LIST_HEAD(&blkif->pending_free);
+ ring->st_print = jiffies;
+ ring->persistent_gnts.rb_node = NULL;
+ spin_lock_init(&ring->free_pages_lock);
+ INIT_LIST_HEAD(&ring->free_pages);
+ INIT_LIST_HEAD(&ring->persistent_purge_list);
+ ring->free_pages_num = 0;
+ atomic_set(&ring->persistent_gnt_in_use, 0);
+ atomic_set(&ring->inflight, 0);
+ INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
+
+ INIT_LIST_HEAD(&ring->pending_free);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);

for (i = 0; i < XEN_BLKIF_REQS; i++) {
req = kzalloc(sizeof(*req), GFP_KERNEL);
if (!req)
goto fail;
- list_add_tail(&req->free_list,
- &blkif->pending_free);
+ list_add_tail(&req->free_list, &ring->pending_free);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
req->segments[j] = kzalloc(sizeof(*req->segments[0]),
GFP_KERNEL);
@@ -166,14 +167,14 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
goto fail;
}
}
- spin_lock_init(&blkif->pending_free_lock);
- init_waitqueue_head(&blkif->pending_free_wq);
- init_waitqueue_head(&blkif->shutdown_wq);
+ spin_lock_init(&ring->pending_free_lock);
+ init_waitqueue_head(&ring->pending_free_wq);
+ init_waitqueue_head(&ring->shutdown_wq);

return blkif;

fail:
- list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
+ list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
list_del(&req->free_list);
for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
if (!req->segments[j])
@@ -193,16 +194,17 @@ fail:
return ERR_PTR(-ENOMEM);
}

-static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
+static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
unsigned int evtchn)
{
int err;
+ struct xen_blkif *blkif = ring->blkif;

/* Already connected through? */
- if (blkif->irq)
+ if (ring->irq)
return 0;

- err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &blkif->blk_ring);
+ err = xenbus_map_ring_valloc(blkif->be->dev, shared_page, &ring->blk_ring);
if (err < 0)
return err;

@@ -210,22 +212,22 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,
case BLKIF_PROTOCOL_NATIVE:
{
struct blkif_sring *sring;
- sring = (struct blkif_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE);
+ sring = (struct blkif_sring *)ring->blk_ring;
+ BACK_RING_INIT(&ring->blk_rings.native, sring, PAGE_SIZE);
break;
}
case BLKIF_PROTOCOL_X86_32:
{
struct blkif_x86_32_sring *sring_x86_32;
- sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
+ sring_x86_32 = (struct blkif_x86_32_sring *)ring->blk_ring;
+ BACK_RING_INIT(&ring->blk_rings.x86_32, sring_x86_32, PAGE_SIZE);
break;
}
case BLKIF_PROTOCOL_X86_64:
{
struct blkif_x86_64_sring *sring_x86_64;
- sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
+ sring_x86_64 = (struct blkif_x86_64_sring *)ring->blk_ring;
+ BACK_RING_INIT(&ring->blk_rings.x86_64, sring_x86_64, PAGE_SIZE);
break;
}
default:
@@ -234,44 +236,46 @@ static int xen_blkif_map(struct xen_blkif *blkif, unsigned long shared_page,

err = bind_interdomain_evtchn_to_irqhandler(blkif->domid, evtchn,
xen_blkif_be_int, 0,
- "blkif-backend", blkif);
+ "blkif-backend", ring);
if (err < 0) {
- xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
- blkif->blk_rings.common.sring = NULL;
+ xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+ ring->blk_rings.common.sring = NULL;
return err;
}
- blkif->irq = err;
+ ring->irq = err;

return 0;
}

static int xen_blkif_disconnect(struct xen_blkif *blkif)
{
- if (blkif->xenblkd) {
- kthread_stop(blkif->xenblkd);
- wake_up(&blkif->shutdown_wq);
- blkif->xenblkd = NULL;
+ struct xen_blkif_ring *ring = &blkif->ring;
+
+ if (ring->xenblkd) {
+ kthread_stop(ring->xenblkd);
+ wake_up(&ring->shutdown_wq);
+ ring->xenblkd = NULL;
}

/* The above kthread_stop() guarantees that at this point we
* don't have any discard_io or other_io requests. So, checking
* for inflight IO is enough.
*/
- if (atomic_read(&blkif->inflight) > 0)
+ if (atomic_read(&ring->inflight) > 0)
return -EBUSY;

- if (blkif->irq) {
- unbind_from_irqhandler(blkif->irq, blkif);
- blkif->irq = 0;
+ if (ring->irq) {
+ unbind_from_irqhandler(ring->irq, ring);
+ ring->irq = 0;
}

- if (blkif->blk_rings.common.sring) {
- xenbus_unmap_ring_vfree(blkif->be->dev, blkif->blk_ring);
- blkif->blk_rings.common.sring = NULL;
+ if (ring->blk_rings.common.sring) {
+ xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+ ring->blk_rings.common.sring = NULL;
}

/* Remove all persistent grants and the cache of ballooned pages. */
- xen_blkbk_free_caches(blkif);
+ xen_blkbk_free_caches(ring);

return 0;
}
@@ -280,20 +284,21 @@ static void xen_blkif_free(struct xen_blkif *blkif)
{
struct pending_req *req, *n;
int i = 0, j;
+ struct xen_blkif_ring *ring = &blkif->ring;

xen_blkif_disconnect(blkif);
xen_vbd_free(&blkif->vbd);

/* Make sure everything is drained before shutting down */
- BUG_ON(blkif->persistent_gnt_c != 0);
- BUG_ON(atomic_read(&blkif->persistent_gnt_in_use) != 0);
- BUG_ON(blkif->free_pages_num != 0);
- BUG_ON(!list_empty(&blkif->persistent_purge_list));
- BUG_ON(!list_empty(&blkif->free_pages));
- BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
+ BUG_ON(ring->persistent_gnt_c != 0);
+ BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
+ BUG_ON(ring->free_pages_num != 0);
+ BUG_ON(!list_empty(&ring->persistent_purge_list));
+ BUG_ON(!list_empty(&ring->free_pages));
+ BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));

/* Check that there is no request in use */
- list_for_each_entry_safe(req, n, &blkif->pending_free, free_list) {
+ list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
list_del(&req->free_list);

for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
@@ -333,6 +338,16 @@ int __init xen_blkif_interface_init(void)
{ \
struct xenbus_device *dev = to_xenbus_device(_dev); \
struct backend_info *be = dev_get_drvdata(&dev->dev); \
+ struct xen_blkif *blkif = be->blkif; \
+ struct xen_blkif_ring *ring = &blkif->ring; \
+ \
+ blkif->st_oo_req = ring->st_oo_req; \
+ blkif->st_rd_req = ring->st_rd_req; \
+ blkif->st_wr_req = ring->st_wr_req; \
+ blkif->st_f_req = ring->st_f_req; \
+ blkif->st_ds_req = ring->st_ds_req; \
+ blkif->st_rd_sect = ring->st_rd_sect; \
+ blkif->st_wr_sect = ring->st_wr_sect; \
\
return sprintf(buf, format, ##args); \
} \
@@ -897,7 +912,7 @@ static int connect_ring(struct backend_info *be)
pers_grants ? "persistent grants" : "");

/* Map the shared frame, irq etc. */
- err = xen_blkif_map(be->blkif, ring_ref, evtchn);
+ err = xen_blkif_map(&be->blkif->ring, ring_ref, evtchn);
if (err) {
xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
ring_ref, evtchn);
--
1.8.3.1

2015-02-15 08:20:19

by Bob Liu

[permalink] [raw]
Subject: [PATCH 06/10] xen/blkfront: pseudo support for multi hardware queues

Prepare patch for multi hardware queues, the ring number was mandatory set to 1.

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 408 +++++++++++++++++++++++++------------------
1 file changed, 234 insertions(+), 174 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index aaa4a0e..d551be0 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -143,7 +143,8 @@ struct blkfront_info {
unsigned int max_indirect_segments;
int is_ready;
struct blk_mq_tag_set tag_set;
- struct blkfront_ring_info rinfo;
+ struct blkfront_ring_info *rinfo;
+ unsigned int nr_rings;
};

static unsigned int nr_minors;
@@ -176,7 +177,8 @@ static DEFINE_SPINLOCK(minor_lock);
#define INDIRECT_GREFS(_segs) \
((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)

-static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo);
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo, unsigned int segs);
+static int blkfront_gather_indirect(struct blkfront_info *info);

static int get_id_from_freelist(struct blkfront_ring_info *rinfo)
{
@@ -656,7 +658,7 @@ static int blk_mq_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
{
struct blkfront_info *info = (struct blkfront_info *)data;

- hctx->driver_data = &info->rinfo;
+ hctx->driver_data = &info->rinfo[index];
return 0;
}

@@ -915,8 +917,8 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,

static void xlvbd_release_gendisk(struct blkfront_info *info)
{
- unsigned int minor, nr_minors;
- struct blkfront_ring_info *rinfo = &info->rinfo;
+ unsigned int minor, nr_minors, i;
+ struct blkfront_ring_info *rinfo;

if (info->rq == NULL)
return;
@@ -924,11 +926,14 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)
/* No more blkif_request(). */
blk_mq_stop_hw_queues(info->rq);

- /* No more gnttab callback work. */
- gnttab_cancel_free_callback(&rinfo->callback);
+ for (i = 0; i < info->nr_rings; i++) {
+ rinfo = &info->rinfo[i];
+ /* No more gnttab callback work. */
+ gnttab_cancel_free_callback(&rinfo->callback);

- /* Flush gnttab callback work. Must be done with no locks held. */
- flush_work(&rinfo->work);
+ /* Flush gnttab callback work. Must be done with no locks held. */
+ flush_work(&rinfo->work);
+ }

del_gendisk(info->gd);

@@ -969,8 +974,8 @@ static void blkif_free(struct blkfront_info *info, int suspend)
{
struct grant *persistent_gnt;
struct grant *n;
- int i, j, segs;
- struct blkfront_ring_info *rinfo = &info->rinfo;
+ int i, j, segs, rindex;
+ struct blkfront_ring_info *rinfo;

/* Prevent new requests being issued until we fix things up. */
info->connected = suspend ?
@@ -979,97 +984,100 @@ static void blkif_free(struct blkfront_info *info, int suspend)
if (info->rq)
blk_mq_stop_hw_queues(info->rq);

- spin_lock_irq(&rinfo->io_lock);
- /* Remove all persistent grants */
- if (!list_empty(&rinfo->grants)) {
- list_for_each_entry_safe(persistent_gnt, n,
- &rinfo->grants, node) {
- list_del(&persistent_gnt->node);
- if (persistent_gnt->gref != GRANT_INVALID_REF) {
- gnttab_end_foreign_access(persistent_gnt->gref,
- 0, 0UL);
- rinfo->persistent_gnts_c--;
+ for (rindex = 0; rindex < info->nr_rings; rindex++) {
+ rinfo = &info->rinfo[rindex];
+ spin_lock_irq(&rinfo->io_lock);
+ /* Remove all persistent grants */
+ if (!list_empty(&rinfo->grants)) {
+ list_for_each_entry_safe(persistent_gnt, n,
+ &rinfo->grants, node) {
+ list_del(&persistent_gnt->node);
+ if (persistent_gnt->gref != GRANT_INVALID_REF) {
+ gnttab_end_foreign_access(persistent_gnt->gref,
+ 0, 0UL);
+ rinfo->persistent_gnts_c--;
+ }
+ if (info->feature_persistent)
+ __free_page(pfn_to_page(persistent_gnt->pfn));
+ kfree(persistent_gnt);
}
- if (info->feature_persistent)
- __free_page(pfn_to_page(persistent_gnt->pfn));
- kfree(persistent_gnt);
}
- }
- BUG_ON(rinfo->persistent_gnts_c != 0);
+ BUG_ON(rinfo->persistent_gnts_c != 0);

- /*
- * Remove indirect pages, this only happens when using indirect
- * descriptors but not persistent grants
- */
- if (!list_empty(&rinfo->indirect_pages)) {
- struct page *indirect_page, *n;
-
- BUG_ON(info->feature_persistent);
- list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
- list_del(&indirect_page->lru);
- __free_page(indirect_page);
- }
- }
-
- for (i = 0; i < BLK_RING_SIZE; i++) {
/*
- * Clear persistent grants present in requests already
- * on the shared ring
+ * Remove indirect pages, this only happens when using indirect
+ * descriptors but not persistent grants
*/
- if (!rinfo->shadow[i].request)
- goto free_shadow;
-
- segs = rinfo->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
- rinfo->shadow[i].req.u.indirect.nr_segments :
- rinfo->shadow[i].req.u.rw.nr_segments;
- for (j = 0; j < segs; j++) {
- persistent_gnt = rinfo->shadow[i].grants_used[j];
- gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
- if (info->feature_persistent)
- __free_page(pfn_to_page(persistent_gnt->pfn));
- kfree(persistent_gnt);
+ if (!list_empty(&rinfo->indirect_pages)) {
+ struct page *indirect_page, *n;
+
+ BUG_ON(info->feature_persistent);
+ list_for_each_entry_safe(indirect_page, n, &rinfo->indirect_pages, lru) {
+ list_del(&indirect_page->lru);
+ __free_page(indirect_page);
+ }
}

- if (rinfo->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+ for (i = 0; i < BLK_RING_SIZE; i++) {
/*
- * If this is not an indirect operation don't try to
- * free indirect segments
+ * Clear persistent grants present in requests already
+ * on the shared ring
*/
- goto free_shadow;
+ if (!rinfo->shadow[i].request)
+ goto free_shadow;
+
+ segs = rinfo->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+ rinfo->shadow[i].req.u.indirect.nr_segments :
+ rinfo->shadow[i].req.u.rw.nr_segments;
+ for (j = 0; j < segs; j++) {
+ persistent_gnt = rinfo->shadow[i].grants_used[j];
+ gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+ if (info->feature_persistent)
+ __free_page(pfn_to_page(persistent_gnt->pfn));
+ kfree(persistent_gnt);
+ }

- for (j = 0; j < INDIRECT_GREFS(segs); j++) {
- persistent_gnt = rinfo->shadow[i].indirect_grants[j];
- gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
- __free_page(pfn_to_page(persistent_gnt->pfn));
- kfree(persistent_gnt);
- }
+ if (rinfo->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+ /*
+ * If this is not an indirect operation don't try to
+ * free indirect segments
+ */
+ goto free_shadow;
+
+ for (j = 0; j < INDIRECT_GREFS(segs); j++) {
+ persistent_gnt = rinfo->shadow[i].indirect_grants[j];
+ gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+ __free_page(pfn_to_page(persistent_gnt->pfn));
+ kfree(persistent_gnt);
+ }

free_shadow:
- kfree(rinfo->shadow[i].grants_used);
- rinfo->shadow[i].grants_used = NULL;
- kfree(rinfo->shadow[i].indirect_grants);
- rinfo->shadow[i].indirect_grants = NULL;
- kfree(rinfo->shadow[i].sg);
- rinfo->shadow[i].sg = NULL;
- }
+ kfree(rinfo->shadow[i].grants_used);
+ rinfo->shadow[i].grants_used = NULL;
+ kfree(rinfo->shadow[i].indirect_grants);
+ rinfo->shadow[i].indirect_grants = NULL;
+ kfree(rinfo->shadow[i].sg);
+ rinfo->shadow[i].sg = NULL;
+ }

- /* No more gnttab callback work. */
- gnttab_cancel_free_callback(&rinfo->callback);
- spin_unlock_irq(&rinfo->io_lock);
+ /* No more gnttab callback work. */
+ gnttab_cancel_free_callback(&rinfo->callback);
+ spin_unlock_irq(&rinfo->io_lock);

- /* Flush gnttab callback work. Must be done with no locks held. */
- flush_work(&rinfo->work);
+ /* Flush gnttab callback work. Must be done with no locks held. */
+ flush_work(&rinfo->work);

- /* Free resources associated with old device channel. */
- if (rinfo->ring_ref != GRANT_INVALID_REF) {
- gnttab_end_foreign_access(rinfo->ring_ref, 0,
- (unsigned long)rinfo->ring.sring);
- rinfo->ring_ref = GRANT_INVALID_REF;
- rinfo->ring.sring = NULL;
+ /* Free resources associated with old device channel. */
+ if (rinfo->ring_ref != GRANT_INVALID_REF) {
+ gnttab_end_foreign_access(rinfo->ring_ref, 0,
+ (unsigned long)rinfo->ring.sring);
+ rinfo->ring_ref = GRANT_INVALID_REF;
+ rinfo->ring.sring = NULL;
+ }
+ if (rinfo->irq)
+ unbind_from_irqhandler(rinfo->irq, rinfo);
+ rinfo->evtchn = rinfo->irq = 0;
}
- if (rinfo->irq)
- unbind_from_irqhandler(rinfo->irq, rinfo);
- rinfo->evtchn = rinfo->irq = 0;

}

@@ -1265,6 +1273,18 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
return IRQ_HANDLED;
}

+static void destroy_blkring(struct xenbus_device *dev,
+ struct blkfront_ring_info *rinfo)
+{
+ if (rinfo->irq)
+ unbind_from_irqhandler(rinfo->irq, rinfo);
+ if (rinfo->evtchn)
+ xenbus_free_evtchn(dev, rinfo->evtchn);
+ if (rinfo->ring_ref != GRANT_INVALID_REF)
+ gnttab_end_foreign_access(rinfo->ring_ref, 0, (unsigned long)rinfo->ring.sring);
+ if (rinfo->ring.sring)
+ free_page((unsigned long)rinfo->ring.sring);
+}

static int setup_blkring(struct xenbus_device *dev,
struct blkfront_ring_info *rinfo)
@@ -1305,7 +1325,7 @@ static int setup_blkring(struct xenbus_device *dev,

return 0;
fail:
- blkif_free(rinfo->info, 0);
+ destroy_blkring(dev, rinfo);
return err;
}

@@ -1316,31 +1336,40 @@ static int talk_to_blkback(struct xenbus_device *dev,
{
const char *message = NULL;
struct xenbus_transaction xbt;
- int err;
- struct blkfront_ring_info *rinfo = &info->rinfo;
+ int err, i;
+ struct blkfront_ring_info *rinfo;

- /* Create shared ring, alloc event channel. */
- err = setup_blkring(dev, rinfo);
- if (err)
- goto out;
+ for (i = 0; i < info->nr_rings; i++) {
+ rinfo = &info->rinfo[i];
+ /* Create shared ring, alloc event channel. */
+ err = setup_blkring(dev, rinfo);
+ if (err)
+ goto out;
+ }

again:
err = xenbus_transaction_start(&xbt);
if (err) {
xenbus_dev_fatal(dev, err, "starting transaction");
- goto destroy_blkring;
+ goto out;
}

- err = xenbus_printf(xbt, dev->nodename,
- "ring-ref", "%u", rinfo->ring_ref);
- if (err) {
- message = "writing ring-ref";
- goto abort_transaction;
- }
- err = xenbus_printf(xbt, dev->nodename,
- "event-channel", "%u", rinfo->evtchn);
- if (err) {
- message = "writing event-channel";
+ if (info->nr_rings == 1) {
+ rinfo = &info->rinfo[0];
+ err = xenbus_printf(xbt, dev->nodename,
+ "ring-ref", "%u", rinfo->ring_ref);
+ if (err) {
+ message = "writing ring-ref";
+ goto abort_transaction;
+ }
+ err = xenbus_printf(xbt, dev->nodename,
+ "event-channel", "%u", rinfo->evtchn);
+ if (err) {
+ message = "writing event-channel";
+ goto abort_transaction;
+ }
+ } else {
+ /* Not supported at this stage */
goto abort_transaction;
}
err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
@@ -1360,7 +1389,7 @@ again:
if (err == -EAGAIN)
goto again;
xenbus_dev_fatal(dev, err, "completing transaction");
- goto destroy_blkring;
+ goto out;
}

xenbus_switch_state(dev, XenbusStateInitialised);
@@ -1371,9 +1400,11 @@ again:
xenbus_transaction_end(xbt, 1);
if (message)
xenbus_dev_fatal(dev, err, "%s", message);
- destroy_blkring:
- blkif_free(info, 0);
out:
+ while (--i >= 0) {
+ rinfo = &info->rinfo[i];
+ destroy_blkring(dev, rinfo);
+ }
return err;
}

@@ -1386,7 +1417,7 @@ again:
static int blkfront_probe(struct xenbus_device *dev,
const struct xenbus_device_id *id)
{
- int err, vdevice, i;
+ int err, vdevice, i, rindex;
struct blkfront_info *info;
struct blkfront_ring_info *rinfo;

@@ -1437,22 +1468,32 @@ static int blkfront_probe(struct xenbus_device *dev,
xenbus_dev_fatal(dev, -ENOMEM, "allocating info structure");
return -ENOMEM;
}
-
- rinfo = &info->rinfo;
mutex_init(&info->mutex);
- spin_lock_init(&rinfo->io_lock);
info->xbdev = dev;
info->vdevice = vdevice;
- INIT_LIST_HEAD(&rinfo->grants);
- INIT_LIST_HEAD(&rinfo->indirect_pages);
- rinfo->persistent_gnts_c = 0;
info->connected = BLKIF_STATE_DISCONNECTED;
- rinfo->info = info;
- INIT_WORK(&rinfo->work, blkif_restart_queue);

- for (i = 0; i < BLK_RING_SIZE; i++)
- rinfo->shadow[i].req.u.rw.id = i+1;
- rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+ info->nr_rings = 1;
+ info->rinfo = kzalloc(sizeof(*rinfo) * info->nr_rings, GFP_KERNEL);
+ if (!info->rinfo) {
+ xenbus_dev_fatal(dev, -ENOMEM, "allocating ring_info structure");
+ kfree(info);
+ return -ENOMEM;
+ }
+
+ for (rindex = 0; rindex < info->nr_rings; rindex++) {
+ rinfo = &info->rinfo[rindex];
+ spin_lock_init(&rinfo->io_lock);
+ INIT_LIST_HEAD(&rinfo->grants);
+ INIT_LIST_HEAD(&rinfo->indirect_pages);
+ rinfo->persistent_gnts_c = 0;
+ rinfo->info = info;
+ INIT_WORK(&rinfo->work, blkif_restart_queue);
+
+ for (i = 0; i < BLK_RING_SIZE; i++)
+ rinfo->shadow[i].req.u.rw.id = i+1;
+ rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+ }

/* Front end dir is a number, which is used as the id. */
info->handle = simple_strtoul(strrchr(dev->nodename, '/')+1, NULL, 0);
@@ -1485,7 +1526,7 @@ static void split_bio_end(struct bio *bio, int error)

static int blkif_recover(struct blkfront_info *info)
{
- int i;
+ int i, rindex;
struct request *req, *n;
struct blk_shadow *copy;
int rc;
@@ -1495,64 +1536,70 @@ static int blkif_recover(struct blkfront_info *info)
int pending, size;
struct split_bio *split_bio;
struct list_head requests;
- struct blkfront_ring_info *rinfo = &info->rinfo;
-
- /* Stage 1: Make a safe copy of the shadow state. */
- copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
- GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
- if (!copy)
- return -ENOMEM;
-
- /* Stage 2: Set up free list. */
- memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
- for (i = 0; i < BLK_RING_SIZE; i++)
- rinfo->shadow[i].req.u.rw.id = i+1;
- rinfo->shadow_free = rinfo->ring.req_prod_pvt;
- rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
-
- rc = blkfront_setup_indirect(rinfo);
- if (rc) {
- kfree(copy);
- return rc;
- }
+ struct blkfront_ring_info *rinfo;

- segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ segs = blkfront_gather_indirect(info);
blk_queue_max_segments(info->rq, segs);
bio_list_init(&bio_list);
INIT_LIST_HEAD(&requests);
- for (i = 0; i < BLK_RING_SIZE; i++) {
- /* Not in use? */
- if (!copy[i].request)
- continue;
+ for (rindex = 0; rindex < info->nr_rings; rindex++) {
+ rinfo = &info->rinfo[rindex];
+ /* Stage 1: Make a safe copy of the shadow state. */
+ copy = kmemdup(rinfo->shadow, sizeof(rinfo->shadow),
+ GFP_NOIO | __GFP_REPEAT | __GFP_HIGH);
+ if (!copy)
+ return -ENOMEM;
+
+ /* Stage 2: Set up free list. */
+ memset(&rinfo->shadow, 0, sizeof(rinfo->shadow));
+ for (i = 0; i < BLK_RING_SIZE; i++)
+ rinfo->shadow[i].req.u.rw.id = i+1;
+ rinfo->shadow_free = rinfo->ring.req_prod_pvt;
+ rinfo->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;
+
+ rc = blkfront_setup_indirect(rinfo, segs);
+ if (rc) {
+ kfree(copy);
+ return rc;
+ }
+
+ for (i = 0; i < BLK_RING_SIZE; i++) {
+ /* Not in use? */
+ if (!copy[i].request)
+ continue;

- /*
- * Get the bios in the request so we can re-queue them.
- */
- if (copy[i].request->cmd_flags &
- (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
/*
- * Flush operations don't contain bios, so
- * we need to requeue the whole request
+ * Get the bios in the request so we can re-queue them.
*/
- list_add(&copy[i].request->queuelist, &requests);
- continue;
+ if (copy[i].request->cmd_flags &
+ (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+ /*
+ * Flush operations don't contain bios, so
+ * we need to requeue the whole request
+ */
+ list_add(&copy[i].request->queuelist, &requests);
+ continue;
+ }
+ merge_bio.head = copy[i].request->bio;
+ merge_bio.tail = copy[i].request->biotail;
+ bio_list_merge(&bio_list, &merge_bio);
+ copy[i].request->bio = NULL;
+ blk_put_request(copy[i].request);
}
- merge_bio.head = copy[i].request->bio;
- merge_bio.tail = copy[i].request->biotail;
- bio_list_merge(&bio_list, &merge_bio);
- copy[i].request->bio = NULL;
- blk_put_request(copy[i].request);
- }

- kfree(copy);
+ kfree(copy);
+ }

xenbus_switch_state(info->xbdev, XenbusStateConnected);

/* Now safe for us to use the shared ring */
info->connected = BLKIF_STATE_CONNECTED;

- /* Kick any other new requests queued since we resumed */
- kick_pending_request_queues(rinfo);
+ for (rindex = 0; rindex < info->nr_rings; rindex++) {
+ rinfo = &info->rinfo[rindex];
+ /* Kick any other new requests queued since we resumed */
+ kick_pending_request_queues(rinfo);
+ }

list_for_each_entry_safe(req, n, &requests, queuelist) {
/* Requeue pending requests (flush or discard) */
@@ -1685,11 +1732,10 @@ static void blkfront_setup_discard(struct blkfront_info *info)
info->feature_secdiscard = !!discard_secure;
}

-static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo)
+static int blkfront_gather_indirect(struct blkfront_info *info)
{
unsigned int indirect_segments, segs;
- int err, i;
- struct blkfront_info *info = rinfo->info;
+ int err;

err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
"feature-max-indirect-segments", "%u", &indirect_segments,
@@ -1702,6 +1748,13 @@ static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo)
xen_blkif_max_segments);
segs = info->max_indirect_segments;
}
+ return segs;
+}
+
+static int blkfront_setup_indirect(struct blkfront_ring_info *rinfo, unsigned int segs)
+{
+ int err, i;
+ struct blkfront_info *info = rinfo->info;

err = fill_grant_buffer(rinfo, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
if (err)
@@ -1774,9 +1827,9 @@ static void blkfront_connect(struct blkfront_info *info)
unsigned long sector_size;
unsigned int physical_sector_size;
unsigned int binfo;
- int err;
+ int err, i;
int barrier, flush, discard, persistent;
- struct blkfront_ring_info *rinfo = &info->rinfo;
+ struct blkfront_ring_info *rinfo;

switch (info->connected) {
case BLKIF_STATE_CONNECTED:
@@ -1874,11 +1927,15 @@ static void blkfront_connect(struct blkfront_info *info)
else
info->feature_persistent = persistent;

- err = blkfront_setup_indirect(rinfo);
- if (err) {
- xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
- info->xbdev->otherend);
- return;
+ for (i = 0; i < info->nr_rings; i++) {
+ rinfo = &info->rinfo[i];
+ err = blkfront_setup_indirect(rinfo, blkfront_gather_indirect(info));
+ if (err) {
+ xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
+ info->xbdev->otherend);
+ blkif_free(info, 0);
+ return;
+ }
}

err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size,
@@ -1893,7 +1950,10 @@ static void blkfront_connect(struct blkfront_info *info)

/* Kick pending requests. */
info->connected = BLKIF_STATE_CONNECTED;
- kick_pending_request_queues(rinfo);
+ for (i = 0; i < info->nr_rings; i++) {
+ rinfo = &info->rinfo[i];
+ kick_pending_request_queues(rinfo);
+ }

add_disk(info->gd);

--
1.8.3.1

2015-02-15 08:21:01

by Bob Liu

[permalink] [raw]
Subject: [PATCH 07/10] xen/blkback: pseudo support for multi hardware queues

Prepare patch for multi hardware queues, the ring number was mandatory set to 1.

Signed-off-by: Arianna Avanzini <[email protected]>
Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkback/common.h | 3 +-
drivers/block/xen-blkback/xenbus.c | 368 +++++++++++++++++++++++--------------
2 files changed, 233 insertions(+), 138 deletions(-)

diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 71863d4..4565deb 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -333,7 +333,8 @@ struct xen_blkif {
unsigned long long st_rd_sect;
unsigned long long st_wr_sect;
/* Rings for this device */
- struct xen_blkif_ring ring;
+ struct xen_blkif_ring *rings;
+ unsigned int nr_rings;
};

struct seg_buf {
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 4b7bde6..93e5f38 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -78,11 +78,14 @@ static int blkback_name(struct xen_blkif *blkif, char *buf)

static void xen_update_blkif_status(struct xen_blkif *blkif)
{
- int err;
+ int err, i;
char name[TASK_COMM_LEN];
+ char per_ring_name[TASK_COMM_LEN + 4];
+ struct xen_blkif_ring *ring;

- /* Not ready to connect? */
- if (!blkif->ring.irq || !blkif->vbd.bdev)
+ /* Not ready to connect? Check irq of first ring as the others
+ * should all be the same.*/
+ if (!blkif->rings || !blkif->rings[0].irq || !blkif->vbd.bdev)
return;

/* Already connected? */
@@ -107,21 +110,108 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
}
invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);

- blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, &blkif->ring, "%s", name);
- if (IS_ERR(blkif->ring.xenblkd)) {
- err = PTR_ERR(blkif->ring.xenblkd);
- blkif->ring.xenblkd = NULL;
- xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
- return;
+ if (blkif->nr_rings == 1) {
+ blkif->rings[0].xenblkd = kthread_run(xen_blkif_schedule, &blkif->rings[0], "%s", name);
+ if (IS_ERR(blkif->rings[0].xenblkd)) {
+ err = PTR_ERR(blkif->rings[0].xenblkd);
+ blkif->rings[0].xenblkd = NULL;
+ xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
+ return;
+ }
+ } else {
+ for (i = 0 ; i < blkif->nr_rings ; i++) {
+ snprintf(per_ring_name, TASK_COMM_LEN + 1, "%s-%d", name, i);
+ ring = &blkif->rings[i];
+ ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);
+ if (IS_ERR(ring->xenblkd)) {
+ err = PTR_ERR(ring->xenblkd);
+ ring->xenblkd = NULL;
+ xenbus_dev_error(blkif->be->dev, err,
+ "start %s xenblkd", per_ring_name);
+ return;
+ }
+ }
}
}

+static int xen_blkif_alloc_rings(struct xen_blkif *blkif)
+{
+ struct xen_blkif_ring *ring;
+ struct pending_req *req, *n;
+ int i, j, r;
+
+ blkif->rings = kzalloc(blkif->nr_rings * sizeof(struct xen_blkif_ring), GFP_KERNEL);
+ if (!blkif->rings)
+ return -ENOMEM;
+
+ for (r = 0; r < blkif->nr_rings; r++) {
+ ring = &blkif->rings[r];
+ spin_lock_init(&ring->blk_ring_lock);
+ init_waitqueue_head(&ring->wq);
+ ring->st_print = jiffies;
+ ring->persistent_gnts.rb_node = NULL;
+ spin_lock_init(&ring->free_pages_lock);
+ INIT_LIST_HEAD(&ring->free_pages);
+ INIT_LIST_HEAD(&ring->persistent_purge_list);
+ ring->free_pages_num = 0;
+ atomic_set(&ring->persistent_gnt_in_use, 0);
+ atomic_set(&ring->inflight, 0);
+ INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
+ INIT_LIST_HEAD(&ring->pending_free);
+
+ for (i = 0; i < XEN_BLKIF_REQS; i++) {
+ req = kzalloc(sizeof(*req), GFP_KERNEL);
+ if (!req)
+ goto fail;
+ list_add_tail(&req->free_list,
+ &ring->pending_free);
+ for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+ req->segments[j] = kzalloc(sizeof(*req->segments[0]),
+ GFP_KERNEL);
+ if (!req->segments[j])
+ goto fail;
+ }
+ for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+ req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
+ GFP_KERNEL);
+ if (!req->indirect_pages[j])
+ goto fail;
+ }
+ }
+ spin_lock_init(&ring->pending_free_lock);
+ init_waitqueue_head(&ring->pending_free_wq);
+ init_waitqueue_head(&ring->shutdown_wq);
+ ring->blkif = blkif;
+ xen_blkif_get(blkif);
+ }
+ return 0;
+
+fail:
+ while (--r >= 0) {
+ ring = &blkif->rings[r];
+ list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
+ list_del(&req->free_list);
+ for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
+ if (!req->segments[j])
+ break;
+ kfree(req->segments[j]);
+ }
+ for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
+ if (!req->indirect_pages[j])
+ break;
+ kfree(req->indirect_pages[j]);
+ }
+ kfree(req);
+ }
+ xen_blkif_put(blkif);
+ }
+ kfree(blkif->rings);
+ return -ENOMEM;
+}
+
static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{
struct xen_blkif *blkif;
- struct pending_req *req, *n;
- int i, j;
- struct xen_blkif_ring *ring;

BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);

@@ -130,68 +220,18 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
return ERR_PTR(-ENOMEM);

blkif->domid = domid;
- ring = &blkif->ring;
- spin_lock_init(&ring->blk_ring_lock);
atomic_set(&blkif->refcnt, 1);
- init_waitqueue_head(&ring->wq);
init_completion(&blkif->drain_complete);
atomic_set(&blkif->drain, 0);
- ring->st_print = jiffies;
- ring->persistent_gnts.rb_node = NULL;
- spin_lock_init(&ring->free_pages_lock);
- INIT_LIST_HEAD(&ring->free_pages);
- INIT_LIST_HEAD(&ring->persistent_purge_list);
- ring->free_pages_num = 0;
- atomic_set(&ring->persistent_gnt_in_use, 0);
- atomic_set(&ring->inflight, 0);
- INIT_WORK(&ring->persistent_purge_work, xen_blkbk_unmap_purged_grants);
-
- INIT_LIST_HEAD(&ring->pending_free);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);

- for (i = 0; i < XEN_BLKIF_REQS; i++) {
- req = kzalloc(sizeof(*req), GFP_KERNEL);
- if (!req)
- goto fail;
- list_add_tail(&req->free_list, &ring->pending_free);
- for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
- req->segments[j] = kzalloc(sizeof(*req->segments[0]),
- GFP_KERNEL);
- if (!req->segments[j])
- goto fail;
- }
- for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
- req->indirect_pages[j] = kzalloc(sizeof(*req->indirect_pages[0]),
- GFP_KERNEL);
- if (!req->indirect_pages[j])
- goto fail;
- }
+ blkif->nr_rings = 1;
+ if (xen_blkif_alloc_rings(blkif)) {
+ kmem_cache_free(xen_blkif_cachep, blkif);
+ return ERR_PTR(-ENOMEM);
}
- spin_lock_init(&ring->pending_free_lock);
- init_waitqueue_head(&ring->pending_free_wq);
- init_waitqueue_head(&ring->shutdown_wq);

return blkif;
-
-fail:
- list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
- list_del(&req->free_list);
- for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++) {
- if (!req->segments[j])
- break;
- kfree(req->segments[j]);
- }
- for (j = 0; j < MAX_INDIRECT_PAGES; j++) {
- if (!req->indirect_pages[j])
- break;
- kfree(req->indirect_pages[j]);
- }
- kfree(req);
- }
-
- kmem_cache_free(xen_blkif_cachep, blkif);
-
- return ERR_PTR(-ENOMEM);
}

static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,
@@ -249,69 +289,76 @@ static int xen_blkif_map(struct xen_blkif_ring *ring, unsigned long shared_page,

static int xen_blkif_disconnect(struct xen_blkif *blkif)
{
- struct xen_blkif_ring *ring = &blkif->ring;
+ struct xen_blkif_ring *ring;
+ int i;
+
+ for (i = 0; i < blkif->nr_rings; i++) {
+ ring = &blkif->rings[i];
+ if (ring->xenblkd) {
+ kthread_stop(ring->xenblkd);
+ wake_up(&ring->shutdown_wq);
+ ring->xenblkd = NULL;
+ }

- if (ring->xenblkd) {
- kthread_stop(ring->xenblkd);
- wake_up(&ring->shutdown_wq);
- ring->xenblkd = NULL;
- }
+ /* The above kthread_stop() guarantees that at this point we
+ * don't have any discard_io or other_io requests. So, checking
+ * for inflight IO is enough.
+ */
+ if (atomic_read(&ring->inflight) > 0)
+ return -EBUSY;

- /* The above kthread_stop() guarantees that at this point we
- * don't have any discard_io or other_io requests. So, checking
- * for inflight IO is enough.
- */
- if (atomic_read(&ring->inflight) > 0)
- return -EBUSY;
+ if (ring->irq) {
+ unbind_from_irqhandler(ring->irq, ring);
+ ring->irq = 0;
+ }

- if (ring->irq) {
- unbind_from_irqhandler(ring->irq, ring);
- ring->irq = 0;
- }
+ if (ring->blk_rings.common.sring) {
+ xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
+ ring->blk_rings.common.sring = NULL;
+ }

- if (ring->blk_rings.common.sring) {
- xenbus_unmap_ring_vfree(blkif->be->dev, ring->blk_ring);
- ring->blk_rings.common.sring = NULL;
+ /* Remove all persistent grants and the cache of ballooned pages. */
+ xen_blkbk_free_caches(ring);
}

- /* Remove all persistent grants and the cache of ballooned pages. */
- xen_blkbk_free_caches(ring);
-
return 0;
}

static void xen_blkif_free(struct xen_blkif *blkif)
{
struct pending_req *req, *n;
- int i = 0, j;
- struct xen_blkif_ring *ring = &blkif->ring;
+ int i = 0, j, r;
+ struct xen_blkif_ring *ring;

xen_blkif_disconnect(blkif);
xen_vbd_free(&blkif->vbd);

- /* Make sure everything is drained before shutting down */
- BUG_ON(ring->persistent_gnt_c != 0);
- BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
- BUG_ON(ring->free_pages_num != 0);
- BUG_ON(!list_empty(&ring->persistent_purge_list));
- BUG_ON(!list_empty(&ring->free_pages));
- BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));
+ for (r = 0; r < blkif->nr_rings; r++) {
+ ring = &blkif->rings[r];
+ /* Make sure everything is drained before shutting down */
+ BUG_ON(ring->persistent_gnt_c != 0);
+ BUG_ON(atomic_read(&ring->persistent_gnt_in_use) != 0);
+ BUG_ON(ring->free_pages_num != 0);
+ BUG_ON(!list_empty(&ring->persistent_purge_list));
+ BUG_ON(!list_empty(&ring->free_pages));
+ BUG_ON(!RB_EMPTY_ROOT(&ring->persistent_gnts));

- /* Check that there is no request in use */
- list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
- list_del(&req->free_list);
+ /* Check that there is no request in use */
+ list_for_each_entry_safe(req, n, &ring->pending_free, free_list) {
+ list_del(&req->free_list);

- for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
- kfree(req->segments[j]);
+ for (j = 0; j < MAX_INDIRECT_SEGMENTS; j++)
+ kfree(req->segments[j]);

- for (j = 0; j < MAX_INDIRECT_PAGES; j++)
- kfree(req->indirect_pages[j]);
+ for (j = 0; j < MAX_INDIRECT_PAGES; j++)
+ kfree(req->indirect_pages[j]);

- kfree(req);
- i++;
- }
+ kfree(req);
+ i++;
+ }

- WARN_ON(i != XEN_BLKIF_REQS);
+ WARN_ON(i != XEN_BLKIF_REQS);
+ }

kmem_cache_free(xen_blkif_cachep, blkif);
}
@@ -339,15 +386,19 @@ int __init xen_blkif_interface_init(void)
struct xenbus_device *dev = to_xenbus_device(_dev); \
struct backend_info *be = dev_get_drvdata(&dev->dev); \
struct xen_blkif *blkif = be->blkif; \
- struct xen_blkif_ring *ring = &blkif->ring; \
+ struct xen_blkif_ring *ring; \
+ int i; \
\
- blkif->st_oo_req = ring->st_oo_req; \
- blkif->st_rd_req = ring->st_rd_req; \
- blkif->st_wr_req = ring->st_wr_req; \
- blkif->st_f_req = ring->st_f_req; \
- blkif->st_ds_req = ring->st_ds_req; \
- blkif->st_rd_sect = ring->st_rd_sect; \
- blkif->st_wr_sect = ring->st_wr_sect; \
+ for (i = 0; i < blkif->nr_rings; i++) { \
+ ring = &blkif->rings[i]; \
+ blkif->st_oo_req += ring->st_oo_req; \
+ blkif->st_rd_req += ring->st_rd_req; \
+ blkif->st_wr_req += ring->st_wr_req; \
+ blkif->st_f_req += ring->st_f_req; \
+ blkif->st_ds_req += ring->st_ds_req; \
+ blkif->st_rd_sect += ring->st_rd_sect; \
+ blkif->st_wr_sect += ring->st_wr_sect; \
+ } \
\
return sprintf(buf, format, ##args); \
} \
@@ -471,6 +522,7 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
static int xen_blkbk_remove(struct xenbus_device *dev)
{
struct backend_info *be = dev_get_drvdata(&dev->dev);
+ int i;

DPRINTK("");

@@ -487,7 +539,8 @@ static int xen_blkbk_remove(struct xenbus_device *dev)

if (be->blkif) {
xen_blkif_disconnect(be->blkif);
- xen_blkif_put(be->blkif);
+ for (i = 0; i < be->blkif->nr_rings; i++)
+ xen_blkif_put(be->blkif);
}

kfree(be->mode);
@@ -870,19 +923,13 @@ static int connect_ring(struct backend_info *be)
unsigned int evtchn;
unsigned int pers_grants;
char protocol[64] = "";
- int err;
+ int err, i;
+ char *xspath;
+ size_t xspathsize;
+ const size_t xenstore_path_ext_size = 11; /* sufficient for "/queue-NNN" */

DPRINTK("%s", dev->otherend);

- err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
- &ring_ref, "event-channel", "%u", &evtchn, NULL);
- if (err) {
- xenbus_dev_fatal(dev, err,
- "reading %s/ring-ref and event-channel",
- dev->otherend);
- return err;
- }
-
be->blkif->blk_protocol = BLKIF_PROTOCOL_NATIVE;
err = xenbus_gather(XBT_NIL, dev->otherend, "protocol",
"%63s", protocol, NULL);
@@ -907,19 +954,66 @@ static int connect_ring(struct backend_info *be)
be->blkif->vbd.feature_gnt_persistent = pers_grants;
be->blkif->vbd.overflow_max_grants = 0;

- pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
- ring_ref, evtchn, be->blkif->blk_protocol, protocol,
- pers_grants ? "persistent grants" : "");
+ if (be->blkif->nr_rings == 1) {
+ err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
+ &ring_ref, "event-channel", "%u", &evtchn, NULL);
+ if (err) {
+ xenbus_dev_fatal(dev, err,
+ "reading %s/ring-ref and event-channel",
+ dev->otherend);
+ goto out;
+ }

- /* Map the shared frame, irq etc. */
- err = xen_blkif_map(&be->blkif->ring, ring_ref, evtchn);
- if (err) {
- xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
- ring_ref, evtchn);
- return err;
- }
+ pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
+ ring_ref, evtchn, be->blkif->blk_protocol, protocol,
+ pers_grants ? "persistent grants" : "");

+ /* Map the shared frame, irq etc. */
+ err = xen_blkif_map(&be->blkif->rings[0], ring_ref, evtchn);
+ if (err) {
+ xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
+ ring_ref, evtchn);
+ goto out;
+ }
+ } else {
+ xspathsize = strlen(dev->otherend) + xenstore_path_ext_size;
+ xspath = kzalloc(xspathsize, GFP_KERNEL);
+ if (!xspath) {
+ xenbus_dev_fatal(dev, -ENOMEM, "reading ring references");
+ err = -ENOMEM;
+ goto out;
+ }
+
+ for (i = 0; i < be->blkif->nr_rings; i++) {
+ memset(xspath, 0, xspathsize);
+ snprintf(xspath, xspathsize, "%s/queue-%u", dev->otherend, i);
+ err = xenbus_gather(XBT_NIL, xspath, "ring-ref", "%lu",
+ &ring_ref, "event-channel", "%u", &evtchn, NULL);
+ if (err) {
+ xenbus_dev_fatal(dev, err,
+ "reading %s %d/ring-ref and event-channel",
+ xspath, i);
+ kfree(xspath);
+ goto out;
+ }
+
+ pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) %s\n",
+ ring_ref, evtchn, be->blkif->blk_protocol, protocol,
+ pers_grants ? "persistent grants" : "");
+ /* Map the shared frame, irq etc. */
+ err = xen_blkif_map(&be->blkif->rings[i], ring_ref, evtchn);
+ if (err) {
+ xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u",
+ ring_ref, evtchn);
+ kfree(xspath);
+ goto out;
+ }
+ }
+ kfree(xspath);
+ }
return 0;
+out:
+ return err;
}

static const struct xenbus_device_id xen_blkbk_ids[] = {
--
1.8.3.1

2015-02-15 08:20:27

by Bob Liu

[permalink] [raw]
Subject: [PATCH 08/10] xen/blkfront: negotiate hardware queue number with backend

The max number of hardware queues for xen/blkfront is num_online_cpus() or set
by module parameter, while the number xen/blkback supported is notified through
xenstore("multi-queue-max-queues").
The negotiated number was the smaller one, and was written back to
xen/blkback as "multi-queue-num-queues".

Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 71 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d551be0..32caf85 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -99,6 +99,10 @@ static unsigned int xen_blkif_max_segments = 32;
module_param_named(max, xen_blkif_max_segments, int, S_IRUGO);
MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default is 32)");

+static unsigned int xenblkif_max_queues;
+module_param_named(max_queues, xenblkif_max_queues, uint, 0644);
+MODULE_PARM_DESC(max_queues, "Maximum number of hardware queues per virtual disk");
+
#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)

/*
@@ -677,7 +681,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,

memset(&info->tag_set, 0, sizeof(info->tag_set));
info->tag_set.ops = &blkfront_mq_ops;
- info->tag_set.nr_hw_queues = 1;
+ info->tag_set.nr_hw_queues = info->nr_rings;
info->tag_set.queue_depth = BLK_RING_SIZE;
info->tag_set.numa_node = NUMA_NO_NODE;
info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
@@ -1338,6 +1342,8 @@ static int talk_to_blkback(struct xenbus_device *dev,
struct xenbus_transaction xbt;
int err, i;
struct blkfront_ring_info *rinfo;
+ char *path;
+ size_t pathsize;

for (i = 0; i < info->nr_rings; i++) {
rinfo = &info->rinfo[i];
@@ -1354,6 +1360,13 @@ again:
goto out;
}

+ /* Write the number of queues */
+ err = xenbus_printf(xbt, dev->nodename, "multi-queue-num-queues", "%u", info->nr_rings);
+ if (err) {
+ message = "writing multi-queue-num-queues";
+ goto abort_transaction;
+ }
+
if (info->nr_rings == 1) {
rinfo = &info->rinfo[0];
err = xenbus_printf(xbt, dev->nodename,
@@ -1369,8 +1382,33 @@ again:
goto abort_transaction;
}
} else {
- /* Not supported at this stage */
- goto abort_transaction;
+ pathsize = strlen(dev->nodename) + 12;
+ path = kzalloc(pathsize, GFP_KERNEL);
+ if (!path) {
+ err = -ENOMEM;
+ message = "ENOMEM while writing ring references";
+ goto abort_transaction;
+ }
+ for (i = 0; i < info->nr_rings; i++) {
+ memset(path, 0, pathsize);
+ snprintf(path, pathsize, "%s/queue-%u", dev->nodename, i);
+
+ err = xenbus_printf(xbt, path,
+ "ring-ref", "%u", info->rinfo[i].ring_ref);
+ if (err) {
+ message = "writing ring-ref";
+ kfree(path);
+ goto abort_transaction;
+ }
+ err = xenbus_printf(xbt, path,
+ "event-channel", "%u", info->rinfo[i].evtchn);
+ if (err) {
+ message = "writing event-channel";
+ kfree(path);
+ goto abort_transaction;
+ }
+ }
+ kfree(path);
}
err = xenbus_printf(xbt, dev->nodename, "protocol", "%s",
XEN_IO_PROTO_ABI_NATIVE);
@@ -1420,6 +1458,7 @@ static int blkfront_probe(struct xenbus_device *dev,
int err, vdevice, i, rindex;
struct blkfront_info *info;
struct blkfront_ring_info *rinfo;
+ unsigned int max_queues = 0;

/* FIXME: Use dynamic device id if this is not set. */
err = xenbus_scanf(XBT_NIL, dev->nodename,
@@ -1473,7 +1512,14 @@ static int blkfront_probe(struct xenbus_device *dev,
info->vdevice = vdevice;
info->connected = BLKIF_STATE_DISCONNECTED;

- info->nr_rings = 1;
+ /* Check if backend supports multiple queues */
+ err = xenbus_scanf(XBT_NIL, info->xbdev->otherend,
+ "multi-queue-max-queues", "%u", &max_queues);
+ if (err < 0)
+ max_queues = 1;
+
+ info->nr_rings = min(max_queues, xenblkif_max_queues);
+ printk("xen/blkfront probe info->nr_rings:%d, backend support:%d\n", info->nr_rings, max_queues);
info->rinfo = kzalloc(sizeof(*rinfo) * info->nr_rings, GFP_KERNEL);
if (!info->rinfo) {
xenbus_dev_fatal(dev, -ENOMEM, "allocating ring_info structure");
@@ -1654,12 +1700,24 @@ static int blkif_recover(struct blkfront_info *info)
static int blkfront_resume(struct xenbus_device *dev)
{
struct blkfront_info *info = dev_get_drvdata(&dev->dev);
- int err;
+ int err = 0;
+ unsigned int max_queues = 0;

dev_dbg(&dev->dev, "blkfront_resume: %s\n", dev->nodename);

blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);

+ err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+ "multi-queue-max-queues", "%u", &max_queues, NULL);
+ if (err)
+ max_queues = 1;
+
+ if (info->nr_rings != min(max_queues, xenblkif_max_queues)) {
+ /* At this stage, not support resume to a different hardware queue
+ * number */
+ return -1;
+ }
+
err = talk_to_blkback(dev, info);

/*
@@ -2165,6 +2223,9 @@ static int __init xlblk_init(void)
return -ENODEV;
}

+ /* Allow as many queues as there are CPUs, by default */
+ xenblkif_max_queues = num_online_cpus();
+
ret = xenbus_register_frontend(&blkfront_driver);
if (ret) {
unregister_blkdev(XENVBD_MAJOR, DEV_NAME);
--
1.8.3.1

2015-02-15 08:20:29

by Bob Liu

[permalink] [raw]
Subject: [PATCH 09/10] xen/blkback: get hardware queue number from blkfront

Backend advertise the max number it supported to "multi-queue-max-queues", and
read the negotiate value from "multi-queue-num-queues".

Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 8 ++++++++
drivers/block/xen-blkback/xenbus.c | 36 ++++++++++++++++++++++++++++++------
2 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 0969e7e..34d72b0 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -80,6 +80,11 @@ module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
MODULE_PARM_DESC(max_persistent_grants,
"Maximum number of grants to map persistently");

+unsigned int xenblk_max_queues;
+module_param_named(max_queues, xenblk_max_queues, uint, 0644);
+MODULE_PARM_DESC(max_queues,
+ "Maximum number of hardware queues per virtual disk");
+
/*
* The LRU mechanism to clean the lists of persistent grants needs to
* be executed periodically. The time interval between consecutive executions
@@ -1390,6 +1395,9 @@ static int __init xen_blkif_init(void)
if (!xen_domain())
return -ENODEV;

+ /* Allow as many queues as there are CPUs, by default */
+ xenblk_max_queues = num_online_cpus();
+
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 93e5f38..c33d8c9 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -21,6 +21,8 @@
#include <xen/grant_table.h>
#include "common.h"

+extern unsigned int xenblk_max_queues;
+
struct backend_info {
struct xenbus_device *dev;
struct xen_blkif *blkif;
@@ -225,12 +227,6 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
atomic_set(&blkif->drain, 0);
INIT_WORK(&blkif->free_work, xen_blkif_deferred_free);

- blkif->nr_rings = 1;
- if (xen_blkif_alloc_rings(blkif)) {
- kmem_cache_free(xen_blkif_cachep, blkif);
- return ERR_PTR(-ENOMEM);
- }
-
return blkif;
}

@@ -647,6 +643,14 @@ static int xen_blkbk_probe(struct xenbus_device *dev,
goto fail;
}

+ /* Multi-queue support: This is an optional feature. */
+ err = xenbus_printf(XBT_NIL, dev->nodename,
+ "multi-queue-max-queues", "%u", xenblk_max_queues);
+ if (err) {
+ pr_debug("Error writing multi-queue-num-queues\n");
+ goto fail;
+ }
+
/* setup back pointer */
be->blkif->be = be;

@@ -927,6 +931,7 @@ static int connect_ring(struct backend_info *be)
char *xspath;
size_t xspathsize;
const size_t xenstore_path_ext_size = 11; /* sufficient for "/queue-NNN" */
+ unsigned int requested_num_queues = 0;

DPRINTK("%s", dev->otherend);

@@ -954,6 +959,25 @@ static int connect_ring(struct backend_info *be)
be->blkif->vbd.feature_gnt_persistent = pers_grants;
be->blkif->vbd.overflow_max_grants = 0;

+ /*
+ * Read the number of hardware queus from frontend.
+ */
+ err = xenbus_scanf(XBT_NIL, dev->otherend, "multi-queue-num-queues", "%u", &requested_num_queues);
+ if (err < 0) {
+ requested_num_queues = 1;
+ } else {
+ if (requested_num_queues > xenblk_max_queues) {
+ /* buggy or malicious guest */
+ xenbus_dev_fatal(dev, err,
+ "guest requested %u queues, exceeding the maximum of %u.",
+ requested_num_queues, xenblk_max_queues);
+ return -1;
+ }
+ }
+ be->blkif->nr_rings = requested_num_queues;
+ if (xen_blkif_alloc_rings(be->blkif))
+ return -ENOMEM;
+
if (be->blkif->nr_rings == 1) {
err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu",
&ring_ref, "event-channel", "%u", &evtchn, NULL);
--
1.8.3.1

2015-02-15 08:20:38

by Bob Liu

[permalink] [raw]
Subject: [PATCH 10/10] xen/blkfront: use work queue to fast blkif interrupt return

Move the request complete logic out of blkif_interrupt() to a work queue,
after that we can replace 'spin_lock_irq' with 'spin_lock' so that irq won't
be disabled too long in blk_mq_queue_rq().

No more warning like this:
INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 0,
t=15002 jiffies, g=1018, c=1017, q=0)
Task dump for CPU 7:
swapper/7 R running task 0 0 1 0x00080000
ffff88028f4edf50 0000000000000086 ffff88028f4ee330 ffff880283df3e18
ffffffff8108836a 0000000183f75438 0000000000000040 000000000000df50
0000008bde2dd600 ffff88028f4ee330 0000000000000086 ffff880283f75038
Call Trace:
[<ffffffff8108836a>] ? __hrtimer_start_range_ns+0x269/0x27b
[<ffffffff8108838f>] ? hrtimer_start+0x13/0x15
[<ffffffff81085298>] ? rcu_eqs_enter+0x66/0x79
[<ffffffff81013847>] ? default_idle+0x9/0xd
[<ffffffff81013f2d>] ? arch_cpu_idle+0xa/0xc
[<ffffffff810746ad>] ? cpu_startup_entry+0x118/0x253
[<ffffffff81030f57>] ? start_secondary+0x12e/0x132

Signed-off-by: Bob Liu <[email protected]>
---
drivers/block/xen-blkfront.c | 47 ++++++++++++++++++++++++++------------------
1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 32caf85..bdd9a15 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -116,6 +116,7 @@ struct blkfront_ring_info {
struct blkif_front_ring ring;
unsigned int evtchn, irq;
struct work_struct work;
+ struct work_struct done_work;
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
struct list_head grants;
@@ -630,29 +631,29 @@ static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
int ret = BLK_MQ_RQ_QUEUE_OK;

blk_mq_start_request(qd->rq);
- spin_lock_irq(&rinfo->io_lock);
+ spin_lock(&rinfo->io_lock);
if (RING_FULL(&rinfo->ring)) {
- spin_unlock_irq(&rinfo->io_lock);
+ spin_unlock(&rinfo->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

if (blkif_request_flush_invalid(qd->rq, rinfo->info)) {
- spin_unlock_irq(&rinfo->io_lock);
+ spin_unlock(&rinfo->io_lock);
ret = BLK_MQ_RQ_QUEUE_ERROR;
goto out;
}

if (blkif_queue_request(qd->rq, rinfo)) {
- spin_unlock_irq(&rinfo->io_lock);
+ spin_unlock(&rinfo->io_lock);
blk_mq_stop_hw_queue(hctx);
ret = BLK_MQ_RQ_QUEUE_BUSY;
goto out;
}

flush_requests(rinfo);
- spin_unlock_irq(&rinfo->io_lock);
+ spin_unlock(&rinfo->io_lock);
out:
return ret;
}
@@ -937,6 +938,7 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)

/* Flush gnttab callback work. Must be done with no locks held. */
flush_work(&rinfo->work);
+ flush_work(&rinfo->done_work);
}

del_gendisk(info->gd);
@@ -955,15 +957,13 @@ static void xlvbd_release_gendisk(struct blkfront_info *info)

static void kick_pending_request_queues(struct blkfront_ring_info *rinfo)
{
- unsigned long flags;
-
- spin_lock_irqsave(&rinfo->io_lock, flags);
+ spin_lock(&rinfo->io_lock);
if (!RING_FULL(&rinfo->ring)) {
- spin_unlock_irqrestore(&rinfo->io_lock, flags);
+ spin_unlock(&rinfo->io_lock);
blk_mq_start_stopped_hw_queues(rinfo->info->rq, true);
return;
}
- spin_unlock_irqrestore(&rinfo->io_lock, flags);
+ spin_unlock(&rinfo->io_lock);
}

static void blkif_restart_queue(struct work_struct *work)
@@ -1070,6 +1070,7 @@ free_shadow:

/* Flush gnttab callback work. Must be done with no locks held. */
flush_work(&rinfo->work);
+ flush_work(&rinfo->done_work);

/* Free resources associated with old device channel. */
if (rinfo->ring_ref != GRANT_INVALID_REF) {
@@ -1168,19 +1169,15 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_ring_info *ri
}
}

-static irqreturn_t blkif_interrupt(int irq, void *dev_id)
+static void blkif_done_req(struct work_struct *work)
{
+ struct blkfront_ring_info *rinfo = container_of(work, struct blkfront_ring_info, done_work);
struct request *req;
struct blkif_response *bret;
RING_IDX i, rp;
- unsigned long flags;
- struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
struct blkfront_info *info = rinfo->info;

- if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
- return IRQ_HANDLED;
-
- spin_lock_irqsave(&rinfo->io_lock, flags);
+ spin_lock(&rinfo->io_lock);
again:
rp = rinfo->ring.sring->rsp_prod;
rmb(); /* Ensure we see queued responses up to 'rp'. */
@@ -1271,9 +1268,20 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
} else
rinfo->ring.sring->rsp_event = i + 1;

- spin_unlock_irqrestore(&rinfo->io_lock, flags);
- kick_pending_request_queues(rinfo);
+ if (!RING_FULL(&rinfo->ring))
+ blk_mq_start_stopped_hw_queues(rinfo->info->rq, true);
+ spin_unlock(&rinfo->io_lock);
+}
+
+static irqreturn_t blkif_interrupt(int irq, void *dev_id)
+{
+ struct blkfront_ring_info *rinfo = (struct blkfront_ring_info *)dev_id;
+ struct blkfront_info *info = rinfo->info;
+
+ if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+ return IRQ_HANDLED;

+ schedule_work(&rinfo->done_work);
return IRQ_HANDLED;
}

@@ -1535,6 +1543,7 @@ static int blkfront_probe(struct xenbus_device *dev,
rinfo->persistent_gnts_c = 0;
rinfo->info = info;
INIT_WORK(&rinfo->work, blkif_restart_queue);
+ INIT_WORK(&rinfo->done_work, blkif_done_req);

for (i = 0; i < BLK_RING_SIZE; i++)
rinfo->shadow[i].req.u.rw.id = i+1;
--
1.8.3.1

2015-02-18 17:01:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] Multi-queue support for xen-block driver

On Sun, Feb 15, 2015 at 04:18:55PM +0800, Bob Liu wrote:
> History:
> It's based on the result of Arianna's internship for GNOME's Outreach Program
> for Women, in which she was mentored by Konrad Rzeszutek Wilk. I also worked on
> this patchset with her at that time, and now fully take over this task.
> I've got her authorization to "change authorship or SoB to the patches as you
> like."

The standard way to credit this original author is:

- if the patch is unchanged just keep her as the From: and
Signed-off-by.
- if you add small changes add your Signed-off-by to hers, and
a note like [bob: fix foobarbaz] before it.
- if you substantially rewrite a patch add it as From and
Signed-off-by you, but add a note to the patch text that mentions
the original work it is based on.

2015-02-18 17:02:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/10] xen/blkfront: drop legacy block layer support

On Sun, Feb 15, 2015 at 04:18:57PM +0800, Bob Liu wrote:
> As Christoph suggested, remove the legacy support similar to most
> drivers coverted (virtio, mtip, and nvme).

Please merge this into the previous patch.

2015-02-18 17:05:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/10] xen/blkfront: reorg info->io_lock after using blk-mq API

On Sun, Feb 15, 2015 at 04:18:58PM +0800, Bob Liu wrote:
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 3589436..5a90a51 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -614,25 +614,28 @@ static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
> blk_mq_start_request(qd->rq);
> spin_lock_irq(&info->io_lock);
> if (RING_FULL(&info->ring)) {
> + spin_unlock_irq(&info->io_lock);
> blk_mq_stop_hw_queue(hctx);
> ret = BLK_MQ_RQ_QUEUE_BUSY;
> goto out;
> }
>
> if (blkif_request_flush_invalid(qd->rq, info)) {
> + spin_unlock_irq(&info->io_lock);
> ret = BLK_MQ_RQ_QUEUE_ERROR;
> goto out;
> }
>
> if (blkif_queue_request(qd->rq)) {
> + spin_unlock_irq(&info->io_lock);
> blk_mq_stop_hw_queue(hctx);
> ret = BLK_MQ_RQ_QUEUE_BUSY;
> goto out;
> }
>
> flush_requests(info);
> -out:
> spin_unlock_irq(&info->io_lock);
> +out:
> return ret;
> }

I'd rather write the function something like:

spin_lock_irq(&info->io_lock);
if (RING_FULL(&info->ring))
goto out_busy;
if (blkif_request_flush_invalid(qd->rq, info))
goto out_error;

if (blkif_queue_request(qd->rq))
goto out_busy;

flush_requests(info);
spin_unlock_irq(&info->io_lock);
return BLK_MQ_RQ_QUEUE_OK;
out_error:
spin_unlock_irq(&info->io_lock);
return BLK_MQ_RQ_QUEUE_ERROR;
out_busy:
spin_unlock_irq(&info->io_lock);
blk_mq_stop_hw_queue(hctx);
return BLK_MQ_RQ_QUEUE_BUSY;
}

Also this really should be merged into the first patch.

2015-02-18 17:33:07

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

El 15/02/15 a les 9.18, Bob Liu ha escrit:
> A ring is the representation of a hardware queue, this patch separate ring
> information from blkfront_info to an new struct blkfront_ring_info to make
> preparation for real multi hardware queues supporting.
>
> Signed-off-by: Arianna Avanzini <[email protected]>
> Signed-off-by: Bob Liu <[email protected]>
> ---
> drivers/block/xen-blkfront.c | 403 +++++++++++++++++++++++--------------------
> 1 file changed, 218 insertions(+), 185 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5a90a51..aaa4a0e 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -102,23 +102,15 @@ MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default
> #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
>
> /*
> - * We have one of these per vbd, whether ide, scsi or 'other'. They
> - * hang in private_data off the gendisk structure. We may end up
> - * putting all kinds of interesting stuff here :-)
> + * Per-ring info.
> + * A blkfront_info structure can associate with one or more blkfront_ring_info,
> + * depending on how many hardware queues supported.
> */
> -struct blkfront_info
> -{
> +struct blkfront_ring_info {
> spinlock_t io_lock;
> - struct mutex mutex;
> - struct xenbus_device *xbdev;
> - struct gendisk *gd;
> - int vdevice;
> - blkif_vdev_t handle;
> - enum blkif_state connected;
> int ring_ref;
> struct blkif_front_ring ring;
> unsigned int evtchn, irq;
> - struct request_queue *rq;
> struct work_struct work;
> struct gnttab_free_callback callback;
> struct blk_shadow shadow[BLK_RING_SIZE];
> @@ -126,6 +118,22 @@ struct blkfront_info
> struct list_head indirect_pages;
> unsigned int persistent_gnts_c;
> unsigned long shadow_free;
> + struct blkfront_info *info;

AFAICT you seem to have a list of persistent grants, indirect pages and
a grant table callback for each ring, isn't this supposed to be shared
between all rings?

I don't think we should be going down that route, or else we can hoard a
large amount of memory and grants.

2015-02-18 17:38:24

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
> El 15/02/15 a les 9.18, Bob Liu ha escrit:
> > A ring is the representation of a hardware queue, this patch separate ring
> > information from blkfront_info to an new struct blkfront_ring_info to make
> > preparation for real multi hardware queues supporting.
> >
> > Signed-off-by: Arianna Avanzini <[email protected]>
> > Signed-off-by: Bob Liu <[email protected]>
> > ---
> > drivers/block/xen-blkfront.c | 403 +++++++++++++++++++++++--------------------
> > 1 file changed, 218 insertions(+), 185 deletions(-)
> >
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index 5a90a51..aaa4a0e 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -102,23 +102,15 @@ MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default
> > #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)
> >
> > /*
> > - * We have one of these per vbd, whether ide, scsi or 'other'. They
> > - * hang in private_data off the gendisk structure. We may end up
> > - * putting all kinds of interesting stuff here :-)
> > + * Per-ring info.
> > + * A blkfront_info structure can associate with one or more blkfront_ring_info,
> > + * depending on how many hardware queues supported.
> > */
> > -struct blkfront_info
> > -{
> > +struct blkfront_ring_info {
> > spinlock_t io_lock;
> > - struct mutex mutex;
> > - struct xenbus_device *xbdev;
> > - struct gendisk *gd;
> > - int vdevice;
> > - blkif_vdev_t handle;
> > - enum blkif_state connected;
> > int ring_ref;
> > struct blkif_front_ring ring;
> > unsigned int evtchn, irq;
> > - struct request_queue *rq;
> > struct work_struct work;
> > struct gnttab_free_callback callback;
> > struct blk_shadow shadow[BLK_RING_SIZE];
> > @@ -126,6 +118,22 @@ struct blkfront_info
> > struct list_head indirect_pages;
> > unsigned int persistent_gnts_c;
> > unsigned long shadow_free;
> > + struct blkfront_info *info;
>
> AFAICT you seem to have a list of persistent grants, indirect pages and
> a grant table callback for each ring, isn't this supposed to be shared
> between all rings?
>
> I don't think we should be going down that route, or else we can hoard a
> large amount of memory and grants.

It does remove the lock that would have to be accessed by each
ring thread to access those. Those values (grants) can be limited to be a smaller
value such that the overall number is the same as it was with the previous
version. As in: each ring has = MAX_GRANTS / nr_online_cpus().
>

2015-02-18 18:08:12

by Felipe Franciosi

[permalink] [raw]
Subject: RE: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

> -----Original Message-----
> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
> Sent: 18 February 2015 17:38
> To: Roger Pau Monne
> Cc: Bob Liu; [email protected]; David Vrabel; linux-
> [email protected]; Felipe Franciosi; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
> struct
>
> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
> > El 15/02/15 a les 9.18, Bob Liu ha escrit:
> > > A ring is the representation of a hardware queue, this patch
> > > separate ring information from blkfront_info to an new struct
> > > blkfront_ring_info to make preparation for real multi hardware queues
> supporting.
> > >
> > > Signed-off-by: Arianna Avanzini <[email protected]>
> > > Signed-off-by: Bob Liu <[email protected]>
> > > ---
> > > drivers/block/xen-blkfront.c | 403
> > > +++++++++++++++++++++++--------------------
> > > 1 file changed, 218 insertions(+), 185 deletions(-)
> > >
> > > diff --git a/drivers/block/xen-blkfront.c
> > > b/drivers/block/xen-blkfront.c index 5a90a51..aaa4a0e 100644
> > > --- a/drivers/block/xen-blkfront.c
> > > +++ b/drivers/block/xen-blkfront.c
> > > @@ -102,23 +102,15 @@ MODULE_PARM_DESC(max, "Maximum amount
> of
> > > segments in indirect requests (default #define BLK_RING_SIZE
> > > __CONST_RING_SIZE(blkif, PAGE_SIZE)
> > >
> > > /*
> > > - * We have one of these per vbd, whether ide, scsi or 'other'.
> > > They
> > > - * hang in private_data off the gendisk structure. We may end up
> > > - * putting all kinds of interesting stuff here :-)
> > > + * Per-ring info.
> > > + * A blkfront_info structure can associate with one or more
> > > + blkfront_ring_info,
> > > + * depending on how many hardware queues supported.
> > > */
> > > -struct blkfront_info
> > > -{
> > > +struct blkfront_ring_info {
> > > spinlock_t io_lock;
> > > - struct mutex mutex;
> > > - struct xenbus_device *xbdev;
> > > - struct gendisk *gd;
> > > - int vdevice;
> > > - blkif_vdev_t handle;
> > > - enum blkif_state connected;
> > > int ring_ref;
> > > struct blkif_front_ring ring;
> > > unsigned int evtchn, irq;
> > > - struct request_queue *rq;
> > > struct work_struct work;
> > > struct gnttab_free_callback callback;
> > > struct blk_shadow shadow[BLK_RING_SIZE]; @@ -126,6 +118,22 @@
> > > struct blkfront_info
> > > struct list_head indirect_pages;
> > > unsigned int persistent_gnts_c;
> > > unsigned long shadow_free;
> > > + struct blkfront_info *info;
> >
> > AFAICT you seem to have a list of persistent grants, indirect pages
> > and a grant table callback for each ring, isn't this supposed to be
> > shared between all rings?
> >
> > I don't think we should be going down that route, or else we can hoard
> > a large amount of memory and grants.
>
> It does remove the lock that would have to be accessed by each ring thread to
> access those. Those values (grants) can be limited to be a smaller value such
> that the overall number is the same as it was with the previous version. As in:
> each ring has = MAX_GRANTS / nr_online_cpus().
> >

We should definitely be concerned with the amount of memory consumed on the backend for each plugged virtual disk. We have faced several problems in XenServer around this area before; it drastically affects VBD scalability per host.

This makes me think that all the persistent grants work was done as a workaround while we were facing several performance problems around concurrent grant un/mapping operations. Given all the recent submissions made around this (grant ops) area, is this something we should perhaps revisit and discuss whether we want to continue offering persistent grants as a feature?

Thanks,
Felipe

2015-02-18 18:22:20

by Felipe Franciosi

[permalink] [raw]
Subject: RE: [RFC PATCH 00/10] Multi-queue support for xen-block driver



> -----Original Message-----
> From: Bob Liu [mailto:[email protected]]
> Sent: 15 February 2015 08:19
> To: [email protected]
> Cc: David Vrabel; [email protected]; Roger Pau Monne;
> [email protected]; Felipe Franciosi; [email protected]; [email protected];
> [email protected]; Bob Liu
> Subject: [RFC PATCH 00/10] Multi-queue support for xen-block driver
>
> This patchset convert the Xen PV block driver to the multi-queue block layer API
> by sharing and using multiple I/O rings between the frontend and backend.
>
> History:
> It's based on the result of Arianna's internship for GNOME's Outreach Program
> for Women, in which she was mentored by Konrad Rzeszutek Wilk. I also
> worked on this patchset with her at that time, and now fully take over this task.
> I've got her authorization to "change authorship or SoB to the patches as you
> like."
>
> A few words on block multi-queue layer:
> Multi-queue block layer improved block scalability a lot by split single request
> queue to per-processor software queues and hardware dispatch queues. The
> linux blk-mq API will handle software queues, while specific block driver must
> deal with hardware queues.

IIUC, the main motivation around the blk-mq work was around locking issues on a block device's request queue when accessed concurrently from different NUMA nodes. I believe we are not stressing enough on the main benefit of taking such approach on Xen.

Many modern storage systems (e.g. NVMe devices) will respond much better (especially when it comes to IOPS) to a high number of outstanding requests. That can be achieved by having a single thread sustaining a high IO depth _and/or_ several different threads issuing requests at the same time. The former approach is often limited by CPU capacity; that is, we can suffer from only being able to handle so many interrupts being delivered to the (v)CPU that the single thread is running on (also simply observable by 'top' showing the thread smoking at 100%). The latter approach is more flexible, given that many threads can run over several different (v)CPUs. I have a lot of data around this topic and am happy to share if people are interested.

We can therefore use the multi-queue block layer in a guest to have more than one request queue associated with block front. These can be mapped over several rings to the backend, making it very easy for us to run multiple threads on the backend for a single virtual disk. I believe this is why Bob is seeing massive improvements when running 'fio' in a guest with an increased number of jobs.

In my opinion, this motivation should be highlighted behind the blk-mq adoption by Xen.

Thanks,
Felipe

>
> The xen/block implementation:
> 1) Convert to blk-mq api with only one hardware queue.
> 2) Use more rings to act as multi hardware queues.
> 3) Negotiate number of hardware queues, the same as xen-net driver. The
> backend notify "multi-queue-max-queues" to frontend, then the front write
> back final number to "multi-queue-num-queues".
>
> Test result:
> fio's IOmeter emulation on a 16 cpus domU with a null_blk device, hardware
> queue number was 16.
> nr_fio_jobs IOPS(before) IOPS(after) Diff
> 1 57k 58k 0%
> 4 95k 201k +210%
> 8 89k 372k +410%
> 16 68k 284k +410%
> 32 65k 196k +300%
> 64 63k 183k +290%
>
> More results are coming, there was also big improvement on both write-IOPS
> and latency.
>
> Any comments or suggestions are welcome.
> Thank you,
> -Bob Liu
>
> Bob Liu (10):
> xen/blkfront: convert to blk-mq API
> xen/blkfront: drop legacy block layer support
> xen/blkfront: reorg info->io_lock after using blk-mq API
> xen/blkfront: separate ring information to an new struct
> xen/blkback: separate ring information out of struct xen_blkif
> xen/blkfront: pseudo support for multi hardware queues
> xen/blkback: pseudo support for multi hardware queues
> xen/blkfront: negotiate hardware queue number with backend
> xen/blkback: get hardware queue number from blkfront
> xen/blkfront: use work queue to fast blkif interrupt return
>
> drivers/block/xen-blkback/blkback.c | 370 ++++++++------- drivers/block/xen-
> blkback/common.h | 54 ++- drivers/block/xen-blkback/xenbus.c | 415
> +++++++++++------
> drivers/block/xen-blkfront.c | 894 +++++++++++++++++++++---------------
> 4 files changed, 1018 insertions(+), 715 deletions(-)
>
> --
> 1.8.3.1

2015-02-18 18:30:00

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

> > > AFAICT you seem to have a list of persistent grants, indirect pages
> > > and a grant table callback for each ring, isn't this supposed to be
> > > shared between all rings?
> > >
> > > I don't think we should be going down that route, or else we can hoard
> > > a large amount of memory and grants.
> >
> > It does remove the lock that would have to be accessed by each ring thread to
> > access those. Those values (grants) can be limited to be a smaller value such
> > that the overall number is the same as it was with the previous version. As in:
> > each ring has = MAX_GRANTS / nr_online_cpus().
> > >
>
> We should definitely be concerned with the amount of memory consumed on the backend for each plugged virtual disk. We have faced several problems in XenServer around this area before; it drastically affects VBD scalability per host.
>
> This makes me think that all the persistent grants work was done as a workaround while we were facing several performance problems around concurrent grant un/mapping operations. Given all the recent submissions made around this (grant ops) area, is this something we should perhaps revisit and discuss whether we want to continue offering persistent grants as a feature?
>

Certainly. Perhaps as a talking point at XenHackathon?

> Thanks,
> Felipe

2015-02-19 02:05:32

by Bob Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/10] Multi-queue support for xen-block driver


On 02/19/2015 02:22 AM, Felipe Franciosi wrote:
>
>
>> -----Original Message-----
>> From: Bob Liu [mailto:[email protected]]
>> Sent: 15 February 2015 08:19
>> To: [email protected]
>> Cc: David Vrabel; [email protected]; Roger Pau Monne;
>> [email protected]; Felipe Franciosi; [email protected]; [email protected];
>> [email protected]; Bob Liu
>> Subject: [RFC PATCH 00/10] Multi-queue support for xen-block driver
>>
>> This patchset convert the Xen PV block driver to the multi-queue block layer API
>> by sharing and using multiple I/O rings between the frontend and backend.
>>
>> History:
>> It's based on the result of Arianna's internship for GNOME's Outreach Program
>> for Women, in which she was mentored by Konrad Rzeszutek Wilk. I also
>> worked on this patchset with her at that time, and now fully take over this task.
>> I've got her authorization to "change authorship or SoB to the patches as you
>> like."
>>
>> A few words on block multi-queue layer:
>> Multi-queue block layer improved block scalability a lot by split single request
>> queue to per-processor software queues and hardware dispatch queues. The
>> linux blk-mq API will handle software queues, while specific block driver must
>> deal with hardware queues.
>
> IIUC, the main motivation around the blk-mq work was around locking issues on a block device's request queue when accessed concurrently from different NUMA nodes. I believe we are not stressing enough on the main benefit of taking such approach on Xen.
>
> Many modern storage systems (e.g. NVMe devices) will respond much better (especially when it comes to IOPS) to a high number of outstanding requests. That can be achieved by having a single thread sustaining a high IO depth _and/or_ several different threads issuing requests at the same time. The former approach is often limited by CPU capacity; that is, we can suffer from only being able to handle so many interrupts being delivered to the (v)CPU that the single thread is running on (also simply observable by 'top' showing the thread smoking at 100%). The latter approach is more flexible, given that many threads can run over several different (v)CPUs. I have a lot of data around this topic and am happy to share if people are interested.
>
> We can therefore use the multi-queue block layer in a guest to have more than one request queue associated with block front. These can be mapped over several rings to the backend, making it very easy for us to run multiple threads on the backend for a single virtual disk. I believe this is why Bob is seeing massive improvements when running 'fio' in a guest with an increased number of jobs.
>

Yes, exactly. I will add this information to the commit log.

Thanks,
-Bob

> In my opinion, this motivation should be highlighted behind the blk-mq adoption by Xen.
>
> Thanks,
> Felipe
>
>>
>> The xen/block implementation:
>> 1) Convert to blk-mq api with only one hardware queue.
>> 2) Use more rings to act as multi hardware queues.
>> 3) Negotiate number of hardware queues, the same as xen-net driver. The
>> backend notify "multi-queue-max-queues" to frontend, then the front write
>> back final number to "multi-queue-num-queues".
>>
>> Test result:
>> fio's IOmeter emulation on a 16 cpus domU with a null_blk device, hardware
>> queue number was 16.
>> nr_fio_jobs IOPS(before) IOPS(after) Diff
>> 1 57k 58k 0%
>> 4 95k 201k +210%
>> 8 89k 372k +410%
>> 16 68k 284k +410%
>> 32 65k 196k +300%
>> 64 63k 183k +290%
>>
>> More results are coming, there was also big improvement on both write-IOPS
>> and latency.
>>
>> Any comments or suggestions are welcome.
>> Thank you,
>> -Bob Liu
>>
>> Bob Liu (10):
>> xen/blkfront: convert to blk-mq API
>> xen/blkfront: drop legacy block layer support
>> xen/blkfront: reorg info->io_lock after using blk-mq API
>> xen/blkfront: separate ring information to an new struct
>> xen/blkback: separate ring information out of struct xen_blkif
>> xen/blkfront: pseudo support for multi hardware queues
>> xen/blkback: pseudo support for multi hardware queues
>> xen/blkfront: negotiate hardware queue number with backend
>> xen/blkback: get hardware queue number from blkfront
>> xen/blkfront: use work queue to fast blkif interrupt return
>>
>> drivers/block/xen-blkback/blkback.c | 370 ++++++++------- drivers/block/xen-
>> blkback/common.h | 54 ++- drivers/block/xen-blkback/xenbus.c | 415
>> +++++++++++------
>> drivers/block/xen-blkfront.c | 894 +++++++++++++++++++++---------------
>> 4 files changed, 1018 insertions(+), 715 deletions(-)
>>
>> --
>> 1.8.3.1

2015-02-19 02:05:40

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct



On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
>> -----Original Message-----
>> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
>> Sent: 18 February 2015 17:38
>> To: Roger Pau Monne
>> Cc: Bob Liu; [email protected]; David Vrabel; linux-
>> [email protected]; Felipe Franciosi; [email protected]; [email protected];
>> [email protected]
>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
>> struct
>>
>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
>>>> A ring is the representation of a hardware queue, this patch
>>>> separate ring information from blkfront_info to an new struct
>>>> blkfront_ring_info to make preparation for real multi hardware queues
>> supporting.
>>>>
>>>> Signed-off-by: Arianna Avanzini <[email protected]>
>>>> Signed-off-by: Bob Liu <[email protected]>
>>>> ---
>>>> drivers/block/xen-blkfront.c | 403
>>>> +++++++++++++++++++++++--------------------
>>>> 1 file changed, 218 insertions(+), 185 deletions(-)
>>>>
>>>> diff --git a/drivers/block/xen-blkfront.c
>>>> b/drivers/block/xen-blkfront.c index 5a90a51..aaa4a0e 100644
>>>> --- a/drivers/block/xen-blkfront.c
>>>> +++ b/drivers/block/xen-blkfront.c
>>>> @@ -102,23 +102,15 @@ MODULE_PARM_DESC(max, "Maximum amount
>> of
>>>> segments in indirect requests (default #define BLK_RING_SIZE
>>>> __CONST_RING_SIZE(blkif, PAGE_SIZE)
>>>>
>>>> /*
>>>> - * We have one of these per vbd, whether ide, scsi or 'other'.
>>>> They
>>>> - * hang in private_data off the gendisk structure. We may end up
>>>> - * putting all kinds of interesting stuff here :-)
>>>> + * Per-ring info.
>>>> + * A blkfront_info structure can associate with one or more
>>>> + blkfront_ring_info,
>>>> + * depending on how many hardware queues supported.
>>>> */
>>>> -struct blkfront_info
>>>> -{
>>>> +struct blkfront_ring_info {
>>>> spinlock_t io_lock;
>>>> - struct mutex mutex;
>>>> - struct xenbus_device *xbdev;
>>>> - struct gendisk *gd;
>>>> - int vdevice;
>>>> - blkif_vdev_t handle;
>>>> - enum blkif_state connected;
>>>> int ring_ref;
>>>> struct blkif_front_ring ring;
>>>> unsigned int evtchn, irq;
>>>> - struct request_queue *rq;
>>>> struct work_struct work;
>>>> struct gnttab_free_callback callback;
>>>> struct blk_shadow shadow[BLK_RING_SIZE]; @@ -126,6 +118,22 @@
>>>> struct blkfront_info
>>>> struct list_head indirect_pages;
>>>> unsigned int persistent_gnts_c;
>>>> unsigned long shadow_free;
>>>> + struct blkfront_info *info;
>>>
>>> AFAICT you seem to have a list of persistent grants, indirect pages
>>> and a grant table callback for each ring, isn't this supposed to be
>>> shared between all rings?
>>>
>>> I don't think we should be going down that route, or else we can hoard
>>> a large amount of memory and grants.
>>
>> It does remove the lock that would have to be accessed by each ring thread to
>> access those. Those values (grants) can be limited to be a smaller value such
>> that the overall number is the same as it was with the previous version. As in:
>> each ring has = MAX_GRANTS / nr_online_cpus().
>>>
>
> We should definitely be concerned with the amount of memory consumed on the backend for each plugged virtual disk. We have faced several problems in XenServer around this area before; it drastically affects VBD scalability per host.
>

Right, so we have to keep both the lock and the amount of memory
consumed in mind.

> This makes me think that all the persistent grants work was done as a workaround while we were facing several performance problems around concurrent grant un/mapping operations. Given all the recent submissions made around this (grant ops) area, is this something we should perhaps revisit and discuss whether we want to continue offering persistent grants as a feature?
>

Agree, Life would be easier if we can remove the persistent feature.

--
Regards,
-Bob

2015-02-19 02:08:27

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH 03/10] xen/blkfront: reorg info->io_lock after using blk-mq API


On 02/19/2015 01:05 AM, Christoph Hellwig wrote:
> On Sun, Feb 15, 2015 at 04:18:58PM +0800, Bob Liu wrote:
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 3589436..5a90a51 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -614,25 +614,28 @@ static int blk_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
>> blk_mq_start_request(qd->rq);
>> spin_lock_irq(&info->io_lock);
>> if (RING_FULL(&info->ring)) {
>> + spin_unlock_irq(&info->io_lock);
>> blk_mq_stop_hw_queue(hctx);
>> ret = BLK_MQ_RQ_QUEUE_BUSY;
>> goto out;
>> }
>>
>> if (blkif_request_flush_invalid(qd->rq, info)) {
>> + spin_unlock_irq(&info->io_lock);
>> ret = BLK_MQ_RQ_QUEUE_ERROR;
>> goto out;
>> }
>>
>> if (blkif_queue_request(qd->rq)) {
>> + spin_unlock_irq(&info->io_lock);
>> blk_mq_stop_hw_queue(hctx);
>> ret = BLK_MQ_RQ_QUEUE_BUSY;
>> goto out;
>> }
>>
>> flush_requests(info);
>> -out:
>> spin_unlock_irq(&info->io_lock);
>> +out:
>> return ret;
>> }
>
> I'd rather write the function something like:
>
> spin_lock_irq(&info->io_lock);
> if (RING_FULL(&info->ring))
> goto out_busy;
> if (blkif_request_flush_invalid(qd->rq, info))
> goto out_error;
>
> if (blkif_queue_request(qd->rq))
> goto out_busy;
>
> flush_requests(info);
> spin_unlock_irq(&info->io_lock);
> return BLK_MQ_RQ_QUEUE_OK;
> out_error:
> spin_unlock_irq(&info->io_lock);
> return BLK_MQ_RQ_QUEUE_ERROR;
> out_busy:
> spin_unlock_irq(&info->io_lock);
> blk_mq_stop_hw_queue(hctx);
> return BLK_MQ_RQ_QUEUE_BUSY;
> }
>

Thank you! Will be updated.

> Also this really should be merged into the first patch.
>

I thought it would be easier for people to review by split into three
patches.
Anyway, I can merge them into the first one.

--
Regards,
-Bob

2015-02-19 11:09:05

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

El 19/02/15 a les 3.05, Bob Liu ha escrit:
>
>
> On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
>>> -----Original Message-----
>>> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
>>> Sent: 18 February 2015 17:38
>>> To: Roger Pau Monne
>>> Cc: Bob Liu; [email protected]; David Vrabel; linux-
>>> [email protected]; Felipe Franciosi; [email protected]; [email protected];
>>> [email protected]
>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
>>> struct
>>>
>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
>>>> AFAICT you seem to have a list of persistent grants, indirect pages
>>>> and a grant table callback for each ring, isn't this supposed to be
>>>> shared between all rings?
>>>>
>>>> I don't think we should be going down that route, or else we can hoard
>>>> a large amount of memory and grants.
>>>
>>> It does remove the lock that would have to be accessed by each ring thread to
>>> access those. Those values (grants) can be limited to be a smaller value such
>>> that the overall number is the same as it was with the previous version. As in:
>>> each ring has = MAX_GRANTS / nr_online_cpus().
>>>>
>>
>> We should definitely be concerned with the amount of memory consumed on the backend for each plugged virtual disk. We have faced several problems in XenServer around this area before; it drastically affects VBD scalability per host.
>>
>
> Right, so we have to keep both the lock and the amount of memory
> consumed in mind.
>
>> This makes me think that all the persistent grants work was done as a workaround while we were facing several performance problems around concurrent grant un/mapping operations. Given all the recent submissions made around this (grant ops) area, is this something we should perhaps revisit and discuss whether we want to continue offering persistent grants as a feature?
>>
>
> Agree, Life would be easier if we can remove the persistent feature.

I was thinking about this yesterday, and IMHO I think we should remove
persistent grants now while it's not too entangled, leaving it for later
will just make our life more miserable.

While it's true that persistent grants provide a throughput increase by
preventing grant table operations and TLB flushes, it has several
problems that cannot by avoided:

- Memory/grants hoarding, we need to reserve the same amount of memory
as the amount of data that we want to have in-flight. While this is not
so critical for memory, it is for grants, since using too many grants
can basically deadlock other PV interfaces. There's no way to avoid this
since it's the design behind persistent grants.

- Memcopy: guest needs to perform a memcopy of all data that goes
through blkfront. While not so critical, Felipe found systems were
memcopy was more expensive than grant map/unmap in the backend (IIRC
those were AMD systems).

- Complexity/interactions: when persistent grants was designed number
of requests was limited to 32 and each request could only contain 11
pages. This means we had to use 352 pages/grants which was fine. Now
that we have indirect IO and multiqueue in the horizon this number has
gone up by orders of magnitude, I don't think this is viable/useful any
more.

If Konrad/Bob agree I would like to send a patch to remove persistent
grants and then have the multiqueue series rebased on top of that.

Roger.

2015-02-19 11:14:43

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

On 19/02/15 11:08, Roger Pau Monn? wrote:
> El 19/02/15 a les 3.05, Bob Liu ha escrit:
>>
>>
>> On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
>>>> -----Original Message-----
>>>> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
>>>> Sent: 18 February 2015 17:38
>>>> To: Roger Pau Monne
>>>> Cc: Bob Liu; [email protected]; David Vrabel; linux-
>>>> [email protected]; Felipe Franciosi; [email protected]; [email protected];
>>>> [email protected]
>>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
>>>> struct
>>>>
>>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
>>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
>>>>> AFAICT you seem to have a list of persistent grants, indirect pages
>>>>> and a grant table callback for each ring, isn't this supposed to be
>>>>> shared between all rings?
>>>>>
>>>>> I don't think we should be going down that route, or else we can hoard
>>>>> a large amount of memory and grants.
>>>>
>>>> It does remove the lock that would have to be accessed by each ring thread to
>>>> access those. Those values (grants) can be limited to be a smaller value such
>>>> that the overall number is the same as it was with the previous version. As in:
>>>> each ring has = MAX_GRANTS / nr_online_cpus().
>>>>>
>>>
>>> We should definitely be concerned with the amount of memory consumed on the backend for each plugged virtual disk. We have faced several problems in XenServer around this area before; it drastically affects VBD scalability per host.
>>>
>>
>> Right, so we have to keep both the lock and the amount of memory
>> consumed in mind.
>>
>>> This makes me think that all the persistent grants work was done as a workaround while we were facing several performance problems around concurrent grant un/mapping operations. Given all the recent submissions made around this (grant ops) area, is this something we should perhaps revisit and discuss whether we want to continue offering persistent grants as a feature?
>>>
>>
>> Agree, Life would be easier if we can remove the persistent feature.
>
> I was thinking about this yesterday, and IMHO I think we should remove
> persistent grants now while it's not too entangled, leaving it for later
> will just make our life more miserable.
>
> While it's true that persistent grants provide a throughput increase by
> preventing grant table operations and TLB flushes, it has several
> problems that cannot by avoided:
>
> - Memory/grants hoarding, we need to reserve the same amount of memory
> as the amount of data that we want to have in-flight. While this is not
> so critical for memory, it is for grants, since using too many grants
> can basically deadlock other PV interfaces. There's no way to avoid this
> since it's the design behind persistent grants.
>
> - Memcopy: guest needs to perform a memcopy of all data that goes
> through blkfront. While not so critical, Felipe found systems were
> memcopy was more expensive than grant map/unmap in the backend (IIRC
> those were AMD systems).
>
> - Complexity/interactions: when persistent grants was designed number
> of requests was limited to 32 and each request could only contain 11
> pages. This means we had to use 352 pages/grants which was fine. Now
> that we have indirect IO and multiqueue in the horizon this number has
> gone up by orders of magnitude, I don't think this is viable/useful any
> more.
>
> If Konrad/Bob agree I would like to send a patch to remove persistent
> grants and then have the multiqueue series rebased on top of that.

I agree with this.

I think we can get better performance/scalability gains of with
improvements to grant table locking and TLB flush avoidance.

David

2015-02-19 12:06:28

by Felipe Franciosi

[permalink] [raw]
Subject: RE: [PATCH 04/10] xen/blkfront: separate ring information to an new struct



> -----Original Message-----
> From: David Vrabel
> Sent: 19 February 2015 11:15
> To: Roger Pau Monne; Bob Liu; Felipe Franciosi
> Cc: 'Konrad Rzeszutek Wilk'; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
> struct
>
> On 19/02/15 11:08, Roger Pau Monn? wrote:
> > El 19/02/15 a les 3.05, Bob Liu ha escrit:
> >>
> >>
> >> On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
> >>>> -----Original Message-----
> >>>> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
> >>>> Sent: 18 February 2015 17:38
> >>>> To: Roger Pau Monne
> >>>> Cc: Bob Liu; [email protected]; David Vrabel; linux-
> >>>> [email protected]; Felipe Franciosi; [email protected];
> >>>> [email protected]; [email protected]
> >>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information
> >>>> to an new struct
> >>>>
> >>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
> >>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
> >>>>> AFAICT you seem to have a list of persistent grants, indirect
> >>>>> pages and a grant table callback for each ring, isn't this
> >>>>> supposed to be shared between all rings?
> >>>>>
> >>>>> I don't think we should be going down that route, or else we can
> >>>>> hoard a large amount of memory and grants.
> >>>>
> >>>> It does remove the lock that would have to be accessed by each ring
> >>>> thread to access those. Those values (grants) can be limited to be
> >>>> a smaller value such that the overall number is the same as it was with
> the previous version. As in:
> >>>> each ring has = MAX_GRANTS / nr_online_cpus().
> >>>>>
> >>>
> >>> We should definitely be concerned with the amount of memory consumed
> on the backend for each plugged virtual disk. We have faced several problems
> in XenServer around this area before; it drastically affects VBD scalability per
> host.
> >>>
> >>
> >> Right, so we have to keep both the lock and the amount of memory
> >> consumed in mind.
> >>
> >>> This makes me think that all the persistent grants work was done as a
> workaround while we were facing several performance problems around
> concurrent grant un/mapping operations. Given all the recent submissions
> made around this (grant ops) area, is this something we should perhaps revisit
> and discuss whether we want to continue offering persistent grants as a feature?
> >>>
> >>
> >> Agree, Life would be easier if we can remove the persistent feature.
> >
> > I was thinking about this yesterday, and IMHO I think we should remove
> > persistent grants now while it's not too entangled, leaving it for
> > later will just make our life more miserable.
> >
> > While it's true that persistent grants provide a throughput increase
> > by preventing grant table operations and TLB flushes, it has several
> > problems that cannot by avoided:
> >
> > - Memory/grants hoarding, we need to reserve the same amount of
> > memory as the amount of data that we want to have in-flight. While
> > this is not so critical for memory, it is for grants, since using too
> > many grants can basically deadlock other PV interfaces. There's no way
> > to avoid this since it's the design behind persistent grants.
> >
> > - Memcopy: guest needs to perform a memcopy of all data that goes
> > through blkfront. While not so critical, Felipe found systems were
> > memcopy was more expensive than grant map/unmap in the backend (IIRC
> > those were AMD systems).
> >
> > - Complexity/interactions: when persistent grants was designed number
> > of requests was limited to 32 and each request could only contain 11
> > pages. This means we had to use 352 pages/grants which was fine. Now
> > that we have indirect IO and multiqueue in the horizon this number has
> > gone up by orders of magnitude, I don't think this is viable/useful
> > any more.
> >
> > If Konrad/Bob agree I would like to send a patch to remove persistent
> > grants and then have the multiqueue series rebased on top of that.
>
> I agree with this.
>
> I think we can get better performance/scalability gains of with improvements
> to grant table locking and TLB flush avoidance.
>
> David

It doesn't change the fact that persistent grants (as well as the grant copy implementation we did for tapdisk3) were alternatives that allowed aggregate storage performance to increase drastically. Before committing to removing something that allow Xen users to scale their deployments, I think we need to revisit whether the recent improvements to the whole grant mechanisms (grant table locking, TLB flushing, batched calls, etc) are performing as we would (now) expect.

What I think should be done prior to committing to either direction is a proper performance assessment of grant mapping vs. persistent grants vs. grant copy for single and aggregate workloads. We need to test a meaningful set of host architectures, workloads and storage types. Last year at the XenDevelSummit, for example, we showed how grant copy scaled better than persistent grants at the cost of doing the copy on the back end.

I don't mean to propose tests that will delay innovation by weeks or months. However, it is very easy to find changes that improve this or that synthetic workload and ignore the fact that it might damage several (possibly very realistic) others. I think this is the time to run performance tests objectively without trying to dig too much into debugging and go from there.

Felipe

2015-02-19 13:12:49

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

El 19/02/15 a les 13.06, Felipe Franciosi ha escrit:
>
>
>> -----Original Message-----
>> From: David Vrabel
>> Sent: 19 February 2015 11:15
>> To: Roger Pau Monne; Bob Liu; Felipe Franciosi
>> Cc: 'Konrad Rzeszutek Wilk'; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> [email protected]
>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new
>> struct
>>
>> On 19/02/15 11:08, Roger Pau Monn? wrote:
>>> El 19/02/15 a les 3.05, Bob Liu ha escrit:
>>>>
>>>>
>>>> On 02/19/2015 02:08 AM, Felipe Franciosi wrote:
>>>>>> -----Original Message-----
>>>>>> From: Konrad Rzeszutek Wilk [mailto:[email protected]]
>>>>>> Sent: 18 February 2015 17:38
>>>>>> To: Roger Pau Monne
>>>>>> Cc: Bob Liu; [email protected]; David Vrabel; linux-
>>>>>> [email protected]; Felipe Franciosi; [email protected];
>>>>>> [email protected]; [email protected]
>>>>>> Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information
>>>>>> to an new struct
>>>>>>
>>>>>> On Wed, Feb 18, 2015 at 06:28:49PM +0100, Roger Pau Monn? wrote:
>>>>>>> El 15/02/15 a les 9.18, Bob Liu ha escrit:
>>>>>>> AFAICT you seem to have a list of persistent grants, indirect
>>>>>>> pages and a grant table callback for each ring, isn't this
>>>>>>> supposed to be shared between all rings?
>>>>>>>
>>>>>>> I don't think we should be going down that route, or else we can
>>>>>>> hoard a large amount of memory and grants.
>>>>>>
>>>>>> It does remove the lock that would have to be accessed by each ring
>>>>>> thread to access those. Those values (grants) can be limited to be
>>>>>> a smaller value such that the overall number is the same as it was with
>> the previous version. As in:
>>>>>> each ring has = MAX_GRANTS / nr_online_cpus().
>>>>>>>
>>>>>
>>>>> We should definitely be concerned with the amount of memory consumed
>> on the backend for each plugged virtual disk. We have faced several problems
>> in XenServer around this area before; it drastically affects VBD scalability per
>> host.
>>>>>
>>>>
>>>> Right, so we have to keep both the lock and the amount of memory
>>>> consumed in mind.
>>>>
>>>>> This makes me think that all the persistent grants work was done as a
>> workaround while we were facing several performance problems around
>> concurrent grant un/mapping operations. Given all the recent submissions
>> made around this (grant ops) area, is this something we should perhaps revisit
>> and discuss whether we want to continue offering persistent grants as a feature?
>>>>>
>>>>
>>>> Agree, Life would be easier if we can remove the persistent feature.
>>>
>>> I was thinking about this yesterday, and IMHO I think we should remove
>>> persistent grants now while it's not too entangled, leaving it for
>>> later will just make our life more miserable.
>>>
>>> While it's true that persistent grants provide a throughput increase
>>> by preventing grant table operations and TLB flushes, it has several
>>> problems that cannot by avoided:
>>>
>>> - Memory/grants hoarding, we need to reserve the same amount of
>>> memory as the amount of data that we want to have in-flight. While
>>> this is not so critical for memory, it is for grants, since using too
>>> many grants can basically deadlock other PV interfaces. There's no way
>>> to avoid this since it's the design behind persistent grants.
>>>
>>> - Memcopy: guest needs to perform a memcopy of all data that goes
>>> through blkfront. While not so critical, Felipe found systems were
>>> memcopy was more expensive than grant map/unmap in the backend (IIRC
>>> those were AMD systems).
>>>
>>> - Complexity/interactions: when persistent grants was designed number
>>> of requests was limited to 32 and each request could only contain 11
>>> pages. This means we had to use 352 pages/grants which was fine. Now
>>> that we have indirect IO and multiqueue in the horizon this number has
>>> gone up by orders of magnitude, I don't think this is viable/useful
>>> any more.
>>>
>>> If Konrad/Bob agree I would like to send a patch to remove persistent
>>> grants and then have the multiqueue series rebased on top of that.
>>
>> I agree with this.
>>
>> I think we can get better performance/scalability gains of with improvements
>> to grant table locking and TLB flush avoidance.
>>
>> David
>
> It doesn't change the fact that persistent grants (as well as the grant copy implementation we did for tapdisk3) were alternatives that allowed aggregate storage performance to increase drastically. Before committing to removing something that allow Xen users to scale their deployments, I think we need to revisit whether the recent improvements to the whole grant mechanisms (grant table locking, TLB flushing, batched calls, etc) are performing as we would (now) expect.

The fact that this extension improved performance doesn't mean it's
right or desirable. So IMHO we should just remove it and take the
performance hit. Then we can figure ways to deal with the limitations
properly instead of doing this kind of hacks because they prevent us
from moving forward.

Roger.

2015-02-19 17:26:14

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 10/10] xen/blkfront: use work queue to fast blkif interrupt return



On 15/02/2015 08:19, Bob Liu wrote:
> Move the request complete logic out of blkif_interrupt() to a work queue,
> after that we can replace 'spin_lock_irq' with 'spin_lock' so that irq won't
> be disabled too long in blk_mq_queue_rq().

I think using a threaded interrupt (like scsifront) is better than work.
Also, this seems orthogonal to the multiqueue support. Is it a useful
bug fix on its own?

David

2015-02-19 16:57:58

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH 07/10] xen/blkback: pseudo support for multi hardware queues



On 15/02/2015 08:19, Bob Liu wrote:
> Prepare patch for multi hardware queues, the ring number was mandatory set to 1.
[...]
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -107,21 +110,108 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
> }
> invalidate_inode_pages2(blkif->vbd.bdev->bd_inode->i_mapping);
>
> - blkif->ring.xenblkd = kthread_run(xen_blkif_schedule, &blkif->ring, "%s", name);
> - if (IS_ERR(blkif->ring.xenblkd)) {
> - err = PTR_ERR(blkif->ring.xenblkd);
> - blkif->ring.xenblkd = NULL;
> - xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
> - return;
> + if (blkif->nr_rings == 1) {
> + blkif->rings[0].xenblkd = kthread_run(xen_blkif_schedule, &blkif->rings[0], "%s", name);
> + if (IS_ERR(blkif->rings[0].xenblkd)) {
> + err = PTR_ERR(blkif->rings[0].xenblkd);
> + blkif->rings[0].xenblkd = NULL;
> + xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
> + return;
> + }

You don't need to special case 1 ring here.

> + } else {
> + for (i = 0 ; i < blkif->nr_rings ; i++) {
> + snprintf(per_ring_name, TASK_COMM_LEN + 1, "%s-%d", name, i);
> + ring = &blkif->rings[i];
> + ring->xenblkd = kthread_run(xen_blkif_schedule, ring, "%s", per_ring_name);

You don't need the snprintf since kthread_run already takes a printf
style name.

David

2015-02-20 19:00:15

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct

> >>>>
> >>>> Agree, Life would be easier if we can remove the persistent feature.

..snip..
> >>>
> >>> If Konrad/Bob agree I would like to send a patch to remove persistent
> >>> grants and then have the multiqueue series rebased on top of that.

..snip..
> >>
> >> I agree with this.
> >>
> >> I think we can get better performance/scalability gains of with improvements
> >> to grant table locking and TLB flush avoidance.
> >>
> >> David
> >
> > It doesn't change the fact that persistent grants (as well as the grant copy implementation we did for tapdisk3) were alternatives that allowed aggregate storage performance to increase drastically. Before committing to removing something that allow Xen users to scale their deployments, I think we need to revisit whether the recent improvements to the whole grant mechanisms (grant table locking, TLB flushing, batched calls, etc) are performing as we would (now) expect.
>
> The fact that this extension improved performance doesn't mean it's
> right or desirable. So IMHO we should just remove it and take the
> performance hit. Then we can figure ways to deal with the limitations

.. snip..

Removing code just because without a clear forward plan might lead to
re-instating said code back again - if no forward plan has been achieved.

If the matter here is purely code complication I would stress that doing
cleanups in code can simplify this - as in the code can do with some
moving of the 'grant' ops (persistent or not) in a different file.

That ought to short-term remove the problems with the 'if (persistent_grant)'
problem.

David assertion that better performance and scalbility can be gained
with grant table locking and TLB flush avoidance is interesting - as
1). The grant locking is going in Xen 4.6 but not earlier - so when running
on older hypervisors this gives an performance benefit.

2). I have not seen any prototype TLB flush avoidance code so not know
when that would be available.

Perhaps a better choice is to do the removal of the persistence support
when the changes in Xen hypervisor are known?

2015-02-27 12:53:45

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH 04/10] xen/blkfront: separate ring information to an new struct


On 02/21/2015 02:59 AM, Konrad Rzeszutek Wilk wrote:
>>>>>>
>>>>>> Agree, Life would be easier if we can remove the persistent feature.
>
> ..snip..
>>>>>
>>>>> If Konrad/Bob agree I would like to send a patch to remove persistent
>>>>> grants and then have the multiqueue series rebased on top of that.
>
> ..snip..
>>>>
>>>> I agree with this.
>>>>
>>>> I think we can get better performance/scalability gains of with improvements
>>>> to grant table locking and TLB flush avoidance.
>>>>
>>>> David
>>>
>>> It doesn't change the fact that persistent grants (as well as the grant copy implementation we did for tapdisk3) were alternatives that allowed aggregate storage performance to increase drastically. Before committing to removing something that allow Xen users to scale their deployments, I think we need to revisit whether the recent improvements to the whole grant mechanisms (grant table locking, TLB flushing, batched calls, etc) are performing as we would (now) expect.
>>
>> The fact that this extension improved performance doesn't mean it's
>> right or desirable. So IMHO we should just remove it and take the
>> performance hit. Then we can figure ways to deal with the limitations
>
> .. snip..
>
> Removing code just because without a clear forward plan might lead to
> re-instating said code back again - if no forward plan has been achieved.
>
> If the matter here is purely code complication I would stress that doing
> cleanups in code can simplify this - as in the code can do with some
> moving of the 'grant' ops (persistent or not) in a different file.
>
> That ought to short-term remove the problems with the 'if (persistent_grant)'
> problem.
>
> David assertion that better performance and scalbility can be gained
> with grant table locking and TLB flush avoidance is interesting - as
> 1). The grant locking is going in Xen 4.6 but not earlier - so when running
> on older hypervisors this gives an performance benefit.
>
> 2). I have not seen any prototype TLB flush avoidance code so not know
> when that would be available.
>
> Perhaps a better choice is to do the removal of the persistence support
> when the changes in Xen hypervisor are known?
>

With patch: [PATCH v5 0/2] gnttab: Improve scaleability, I can get
nearly the same performance as without persistence support.

But I'm not sure about the benchmark described here:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/block/xen-blkfront.c?id=0a8704a51f386cab7394e38ff1d66eef924d8ab8

--
Regards,
-Bob