2013-04-17 18:20:30

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 0/7] xen-block: indirect descriptors

Implementation of indirect descriptors v2, addressing Konrad's
comments. A graph on performance can be found at:

http://xenbits.xen.org/people/royger/plot_indirect_read4k.png

Thanks for the review, Roger.

Roger Pau Monne (7):
xen-blkback: print stats about persistent grants
xen-blkback: use balloon pages for all mappings
xen-blkback: implement LRU mechanism for persistent grants
xen-blkback: move pending handles list from blkbk to pending_req
xen-blkback: make the queue of free requests per backend
xen-blkback: expand map/unmap functions
xen-block: implement indirect descriptors

Documentation/ABI/stable/sysfs-bus-xen-backend | 18 +
drivers/block/xen-blkback/blkback.c | 844 +++++++++++++++---------
drivers/block/xen-blkback/common.h | 145 ++++-
drivers/block/xen-blkback/xenbus.c | 39 ++
drivers/block/xen-blkfront.c | 491 ++++++++++++---
include/xen/interface/io/blkif.h | 48 ++
6 files changed, 1186 insertions(+), 399 deletions(-)

git://xenbits.xen.org/people/royger/linux.git xen-block-indirect-v2


2013-04-17 18:20:31

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 2/7] xen-blkback: use balloon pages for all mappings

Using balloon pages for all granted pages allows us to simplify the
logic in blkback, especially in the xen_blkbk_map function, since now
we can decide if we want to map a grant persistently or not after we
have actually mapped it. This could not be done before because
persistent grants used ballooned pages, whereas non-persistent grants
used pages from the kernel.

This patch also introduces several changes, the first one is that the
list of free pages is no longer global, now each blkback instance has
it's own list of free pages that can be used to map grants. Also, a
run time parameter (max_buffer_pages) has been added in order to tune
the maximum number of free pages each blkback instance will keep in
it's buffer.

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: [email protected]
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
Changes since RFC:
* Fix typos in commit message.
* Minor fixes in code.
Changes since v1:
* Added some comments
* Changed variable name in the grant ref map loop
---
Documentation/ABI/stable/sysfs-bus-xen-backend | 8 +
drivers/block/xen-blkback/blkback.c | 286 ++++++++++++++----------
drivers/block/xen-blkback/common.h | 5 +
drivers/block/xen-blkback/xenbus.c | 3 +
4 files changed, 181 insertions(+), 121 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-bus-xen-backend b/Documentation/ABI/stable/sysfs-bus-xen-backend
index 3d5951c..e04afe0 100644
--- a/Documentation/ABI/stable/sysfs-bus-xen-backend
+++ b/Documentation/ABI/stable/sysfs-bus-xen-backend
@@ -73,3 +73,11 @@ KernelVersion: 3.0
Contact: Konrad Rzeszutek Wilk <[email protected]>
Description:
Number of sectors written by the frontend.
+
+What: /sys/module/xen_blkback/parameters/max_buffer_pages
+Date: March 2013
+KernelVersion: 3.10
+Contact: Roger Pau Monné <[email protected]>
+Description:
+ Maximum number of free pages to keep in each block
+ backend buffer.
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index f7526db..8245c6b 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -63,6 +63,21 @@ static int xen_blkif_reqs = 64;
module_param_named(reqs, xen_blkif_reqs, int, 0);
MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");

+/*
+ * Maximum number of unused free pages to keep in the internal buffer.
+ * Setting this to a value too low will reduce memory used in each backend,
+ * but can have a performance penalty.
+ *
+ * A sane value is xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST, but can
+ * be set to a lower value that might degrade performance on some intensive
+ * IO workloads.
+ */
+
+static int xen_blkif_max_buffer_pages = 704;
+module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
+MODULE_PARM_DESC(max_buffer_pages,
+"Maximum number of free pages to keep in each block backend buffer");
+
/* Run-time switchable: /sys/module/blkback/parameters/ */
static unsigned int log_stats;
module_param(log_stats, int, 0644);
@@ -82,10 +97,14 @@ struct pending_req {
int status;
struct list_head free_list;
DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};

#define BLKBACK_INVALID_HANDLE (~0)

+/* Number of free pages to remove on each call to free_xenballooned_pages */
+#define NUM_BATCH_FREE_PAGES 10
+
struct xen_blkbk {
struct pending_req *pending_reqs;
/* List of all 'pending_req' available */
@@ -93,8 +112,6 @@ struct xen_blkbk {
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
- /* The list of all pages that are available. */
- struct page **pending_pages;
/* And the grant handles that are available. */
grant_handle_t *pending_grant_handles;
};
@@ -143,14 +160,66 @@ static inline int vaddr_pagenr(struct pending_req *req, int seg)
BLKIF_MAX_SEGMENTS_PER_REQUEST + seg;
}

-#define pending_page(req, seg) pending_pages[vaddr_pagenr(req, seg)]
+static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ if (list_empty(&blkif->free_pages)) {
+ BUG_ON(blkif->free_pages_num != 0);
+ spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ return alloc_xenballooned_pages(1, page, false);
+ }
+ BUG_ON(blkif->free_pages_num == 0);
+ page[0] = list_first_entry(&blkif->free_pages, struct page, lru);
+ list_del(&page[0]->lru);
+ blkif->free_pages_num--;
+ spin_unlock_irqrestore(&blkif->free_pages_lock, flags);

-static inline unsigned long vaddr(struct pending_req *req, int seg)
+ return 0;
+}
+
+static inline void put_free_pages(struct xen_blkif *blkif, struct page **page,
+ int num)
{
- unsigned long pfn = page_to_pfn(blkbk->pending_page(req, seg));
- return (unsigned long)pfn_to_kaddr(pfn);
+ unsigned long flags;
+ int i;
+
+ spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ for (i = 0; i < num; i++)
+ list_add(&page[i]->lru, &blkif->free_pages);
+ blkif->free_pages_num += num;
+ spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+}
+
+static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
+{
+ /* Remove requested pages in batches of NUM_BATCH_FREE_PAGES */
+ struct page *page[NUM_BATCH_FREE_PAGES];
+ unsigned int num_pages = 0;
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ while (blkif->free_pages_num > num) {
+ BUG_ON(list_empty(&blkif->free_pages));
+ page[num_pages] = list_first_entry(&blkif->free_pages,
+ struct page, lru);
+ list_del(&page[num_pages]->lru);
+ blkif->free_pages_num--;
+ if (++num_pages == NUM_BATCH_FREE_PAGES) {
+ spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ free_xenballooned_pages(num_pages, page);
+ spin_lock_irqsave(&blkif->free_pages_lock, flags);
+ num_pages = 0;
+ }
+ }
+ spin_unlock_irqrestore(&blkif->free_pages_lock, flags);
+ if (num_pages != 0)
+ free_xenballooned_pages(num_pages, page);
}

+#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))
+
#define pending_handle(_req, _seg) \
(blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)])

@@ -170,7 +239,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,
(n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)


-static void add_persistent_gnt(struct rb_root *root,
+static int add_persistent_gnt(struct rb_root *root,
struct persistent_gnt *persistent_gnt)
{
struct rb_node **new = &(root->rb_node), *parent = NULL;
@@ -186,14 +255,15 @@ static void add_persistent_gnt(struct rb_root *root,
else if (persistent_gnt->gnt > this->gnt)
new = &((*new)->rb_right);
else {
- pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
- BUG();
+ pr_alert_ratelimited(DRV_PFX " trying to add a gref that's already in the tree\n");
+ return -EINVAL;
}
}

/* Add new node and rebalance tree. */
rb_link_node(&(persistent_gnt->node), parent, new);
rb_insert_color(&(persistent_gnt->node), root);
+ return 0;
}

static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
@@ -215,7 +285,8 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
return NULL;
}

-static void free_persistent_gnts(struct rb_root *root, unsigned int num)
+static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
+ unsigned int num)
{
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
@@ -240,7 +311,7 @@ static void free_persistent_gnts(struct rb_root *root, unsigned int num)
ret = gnttab_unmap_refs(unmap, NULL, pages,
segs_to_unmap);
BUG_ON(ret);
- free_xenballooned_pages(segs_to_unmap, pages);
+ put_free_pages(blkif, pages, segs_to_unmap);
segs_to_unmap = 0;
}

@@ -422,13 +493,19 @@ int xen_blkif_schedule(void *arg)
if (do_block_io_op(blkif))
blkif->waiting_reqs = 1;

+ /* Shrink if we have more than xen_blkif_max_buffer_pages */
+ shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);
+
if (log_stats && time_after(jiffies, blkif->st_print))
print_stats(blkif);
}

+ /* Since we are shutting down remove all pages from the buffer */
+ shrink_free_pagepool(blkif, 0 /* All */);
+
/* Free all persistent grant pages */
if (!RB_EMPTY_ROOT(&blkif->persistent_gnts))
- free_persistent_gnts(&blkif->persistent_gnts,
+ free_persistent_gnts(blkif, &blkif->persistent_gnts,
blkif->persistent_gnt_c);

BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
@@ -457,23 +534,25 @@ static void xen_blkbk_unmap(struct pending_req *req)
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
unsigned int i, invcount = 0;
grant_handle_t handle;
+ struct xen_blkif *blkif = req->blkif;
int ret;

for (i = 0; i < req->nr_pages; i++) {
if (!test_bit(i, req->unmap_seg))
continue;
handle = pending_handle(req, i);
+ pages[invcount] = req->pages[i];
if (handle == BLKBACK_INVALID_HANDLE)
continue;
- gnttab_set_unmap_op(&unmap[invcount], vaddr(req, i),
+ gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[invcount]),
GNTMAP_host_map, handle);
pending_handle(req, i) = BLKBACK_INVALID_HANDLE;
- pages[invcount] = virt_to_page(vaddr(req, i));
invcount++;
}

ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
BUG_ON(ret);
+ put_free_pages(blkif, pages, invcount);
}

static int xen_blkbk_map(struct blkif_request *req,
@@ -487,8 +566,7 @@ static int xen_blkbk_map(struct blkif_request *req,
struct persistent_gnt *persistent_gnt = NULL;
struct xen_blkif *blkif = pending_req->blkif;
phys_addr_t addr = 0;
- int i, j;
- bool new_map;
+ int i, seg_idx, new_map_idx;
int nseg = req->u.rw.nr_segments;
int segs_to_map = 0;
int ret = 0;
@@ -517,68 +595,16 @@ static int xen_blkbk_map(struct blkif_request *req,
* We are using persistent grants and
* the grant is already mapped
*/
- new_map = false;
- } else if (use_persistent_gnts &&
- blkif->persistent_gnt_c <
- max_mapped_grant_pages(blkif->blk_protocol)) {
- /*
- * We are using persistent grants, the grant is
- * not mapped but we have room for it
- */
- new_map = true;
- persistent_gnt = kmalloc(
- sizeof(struct persistent_gnt),
- GFP_KERNEL);
- if (!persistent_gnt)
- return -ENOMEM;
- if (alloc_xenballooned_pages(1, &persistent_gnt->page,
- false)) {
- kfree(persistent_gnt);
- return -ENOMEM;
- }
- persistent_gnt->gnt = req->u.rw.seg[i].gref;
- persistent_gnt->handle = BLKBACK_INVALID_HANDLE;
-
- pages_to_gnt[segs_to_map] =
- persistent_gnt->page;
- addr = (unsigned long) pfn_to_kaddr(
- page_to_pfn(persistent_gnt->page));
-
- add_persistent_gnt(&blkif->persistent_gnts,
- persistent_gnt);
- blkif->persistent_gnt_c++;
- pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
- persistent_gnt->gnt, blkif->persistent_gnt_c,
- max_mapped_grant_pages(blkif->blk_protocol));
- } else {
- /*
- * We are either using persistent grants and
- * hit the maximum limit of grants mapped,
- * or we are not using persistent grants.
- */
- if (use_persistent_gnts &&
- !blkif->vbd.overflow_max_grants) {
- blkif->vbd.overflow_max_grants = 1;
- pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
- blkif->domid, blkif->vbd.handle);
- }
- new_map = true;
- pages[i] = blkbk->pending_page(pending_req, i);
- addr = vaddr(pending_req, i);
- pages_to_gnt[segs_to_map] =
- blkbk->pending_page(pending_req, i);
- }
-
- if (persistent_gnt) {
pages[i] = persistent_gnt->page;
persistent_gnts[i] = persistent_gnt;
} else {
+ if (get_free_page(blkif, &pages[i]))
+ goto out_of_memory;
+ addr = vaddr(pages[i]);
+ pages_to_gnt[segs_to_map] = pages[i];
persistent_gnts[i] = NULL;
- }
-
- if (new_map) {
flags = GNTMAP_host_map;
- if (!persistent_gnt &&
+ if (!use_persistent_gnts &&
(pending_req->operation != BLKIF_OP_READ))
flags |= GNTMAP_readonly;
gnttab_set_map_op(&map[segs_to_map++], addr,
@@ -598,48 +624,81 @@ static int xen_blkbk_map(struct blkif_request *req,
* the page from the other domain.
*/
bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
- for (i = 0, j = 0; i < nseg; i++) {
- if (!persistent_gnts[i] ||
- persistent_gnts[i]->handle == BLKBACK_INVALID_HANDLE) {
+ for (seg_idx = 0, new_map_idx = 0; seg_idx < nseg; seg_idx++) {
+ if (!persistent_gnts[seg_idx]) {
/* This is a newly mapped grant */
- BUG_ON(j >= segs_to_map);
- if (unlikely(map[j].status != 0)) {
+ BUG_ON(new_map_idx >= segs_to_map);
+ if (unlikely(map[new_map_idx].status != 0)) {
pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
- map[j].handle = BLKBACK_INVALID_HANDLE;
+ pending_handle(pending_req, seg_idx) = BLKBACK_INVALID_HANDLE;
ret |= 1;
- if (persistent_gnts[i]) {
- rb_erase(&persistent_gnts[i]->node,
- &blkif->persistent_gnts);
- blkif->persistent_gnt_c--;
- kfree(persistent_gnts[i]);
- persistent_gnts[i] = NULL;
- }
+ new_map_idx++;
+ /*
+ * No need to set unmap_seg bit, since
+ * we can not unmap this grant because
+ * the handle is invalid.
+ */
+ continue;
}
+ pending_handle(pending_req, seg_idx) = map[new_map_idx].handle;
+ } else {
+ /* This grant is persistent and already mapped */
+ goto next;
}
- if (persistent_gnts[i]) {
- if (persistent_gnts[i]->handle ==
- BLKBACK_INVALID_HANDLE) {
+ if (use_persistent_gnts &&
+ blkif->persistent_gnt_c <
+ max_mapped_grant_pages(blkif->blk_protocol)) {
+ /*
+ * We are using persistent grants, the grant is
+ * not mapped but we have room for it
+ */
+ persistent_gnt = kmalloc(sizeof(struct persistent_gnt),
+ GFP_KERNEL);
+ if (!persistent_gnt) {
/*
- * If this is a new persistent grant
- * save the handler
+ * If we don't have enough memory to
+ * allocate the persistent_gnt struct
+ * map this grant non-persistenly
*/
- persistent_gnts[i]->handle = map[j++].handle;
+ goto next_unmap;
}
- pending_handle(pending_req, i) =
- persistent_gnts[i]->handle;
-
- if (ret)
- continue;
- } else {
- pending_handle(pending_req, i) = map[j++].handle;
- bitmap_set(pending_req->unmap_seg, i, 1);
-
- if (ret)
- continue;
+ persistent_gnt->gnt = map[new_map_idx].ref;
+ persistent_gnt->handle = map[new_map_idx].handle;
+ persistent_gnt->page = pages[seg_idx];
+ if (add_persistent_gnt(&blkif->persistent_gnts,
+ persistent_gnt)) {
+ kfree(persistent_gnt);
+ persistent_gnt = NULL;
+ goto next_unmap;
+ }
+ blkif->persistent_gnt_c++;
+ pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
+ persistent_gnt->gnt, blkif->persistent_gnt_c,
+ max_mapped_grant_pages(blkif->blk_protocol));
+ new_map_idx++;
+ goto next;
+ }
+ if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
+ blkif->vbd.overflow_max_grants = 1;
+ pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
+ blkif->domid, blkif->vbd.handle);
}
- seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
+next_unmap:
+ /*
+ * We could not map this grant persistently, so use it as
+ * a non-persistent grant.
+ */
+ bitmap_set(pending_req->unmap_seg, seg_idx, 1);
+ new_map_idx++;
+next:
+ seg[seg_idx].offset = (req->u.rw.seg[seg_idx].first_sect << 9);
}
return ret;
+
+out_of_memory:
+ pr_alert(DRV_PFX "%s: out of memory\n", __func__);
+ put_free_pages(blkif, pages_to_gnt, segs_to_map);
+ return -ENOMEM;
}

static int dispatch_discard_io(struct xen_blkif *blkif,
@@ -863,7 +922,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
int operation;
struct blk_plug plug;
bool drain = false;
- struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page **pages = pending_req->pages;

switch (req->operation) {
case BLKIF_OP_READ:
@@ -1090,22 +1149,14 @@ static int __init xen_blkif_init(void)
xen_blkif_reqs, GFP_KERNEL);
blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
mmap_pages, GFP_KERNEL);
- blkbk->pending_pages = kzalloc(sizeof(blkbk->pending_pages[0]) *
- mmap_pages, GFP_KERNEL);

- if (!blkbk->pending_reqs || !blkbk->pending_grant_handles ||
- !blkbk->pending_pages) {
+ if (!blkbk->pending_reqs || !blkbk->pending_grant_handles) {
rc = -ENOMEM;
goto out_of_memory;
}

for (i = 0; i < mmap_pages; i++) {
blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
- blkbk->pending_pages[i] = alloc_page(GFP_KERNEL);
- if (blkbk->pending_pages[i] == NULL) {
- rc = -ENOMEM;
- goto out_of_memory;
- }
}
rc = xen_blkif_interface_init();
if (rc)
@@ -1130,13 +1181,6 @@ static int __init xen_blkif_init(void)
failed_init:
kfree(blkbk->pending_reqs);
kfree(blkbk->pending_grant_handles);
- if (blkbk->pending_pages) {
- for (i = 0; i < mmap_pages; i++) {
- if (blkbk->pending_pages[i])
- __free_page(blkbk->pending_pages[i]);
- }
- kfree(blkbk->pending_pages);
- }
kfree(blkbk);
blkbk = NULL;
return rc;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 60103e2..6c73c38 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -220,6 +220,11 @@ struct xen_blkif {
struct rb_root persistent_gnts;
unsigned int persistent_gnt_c;

+ /* buffer of free pages to map grant refs */
+ spinlock_t free_pages_lock;
+ int free_pages_num;
+ struct list_head free_pages;
+
/* statistics */
unsigned long st_print;
unsigned long long st_rd_req;
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 8bfd1bc..24f7f6d 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -118,6 +118,9 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free);
blkif->persistent_gnts.rb_node = NULL;
+ spin_lock_init(&blkif->free_pages_lock);
+ INIT_LIST_HEAD(&blkif->free_pages);
+ blkif->free_pages_num = 0;

return blkif;
}
--
1.7.7.5 (Apple Git-26)

2013-04-17 18:20:27

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 4/7] xen-blkback: move pending handles list from blkbk to pending_req

Moving grant ref handles from blkbk to pending_req will allow us to
get rid of the shared blkbk structure.

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
---
drivers/block/xen-blkback/blkback.c | 16 ++++------------
1 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 17052f7..ae7dc92 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -128,6 +128,7 @@ struct pending_req {
struct list_head free_list;
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ grant_handle_t grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};

#define BLKBACK_INVALID_HANDLE (~0)
@@ -142,8 +143,6 @@ struct xen_blkbk {
/* And its spinlock. */
spinlock_t pending_free_lock;
wait_queue_head_t pending_free_wq;
- /* And the grant handles that are available. */
- grant_handle_t *pending_grant_handles;
};

static struct xen_blkbk *blkbk;
@@ -221,7 +220,7 @@ static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)
#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))

#define pending_handle(_req, _seg) \
- (blkbk->pending_grant_handles[vaddr_pagenr(_req, _seg)])
+ (_req->grant_handles[_seg])


static int do_block_io_op(struct xen_blkif *blkif);
@@ -1304,7 +1303,7 @@ static void make_response(struct xen_blkif *blkif, u64 id,

static int __init xen_blkif_init(void)
{
- int i, mmap_pages;
+ int i;
int rc = 0;

if (!xen_domain())
@@ -1316,21 +1315,15 @@ static int __init xen_blkif_init(void)
return -ENOMEM;
}

- mmap_pages = xen_blkif_reqs * BLKIF_MAX_SEGMENTS_PER_REQUEST;

blkbk->pending_reqs = kzalloc(sizeof(blkbk->pending_reqs[0]) *
xen_blkif_reqs, GFP_KERNEL);
- blkbk->pending_grant_handles = kmalloc(sizeof(blkbk->pending_grant_handles[0]) *
- mmap_pages, GFP_KERNEL);

- if (!blkbk->pending_reqs || !blkbk->pending_grant_handles) {
+ if (!blkbk->pending_reqs) {
rc = -ENOMEM;
goto out_of_memory;
}

- for (i = 0; i < mmap_pages; i++) {
- blkbk->pending_grant_handles[i] = BLKBACK_INVALID_HANDLE;
- }
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;
@@ -1353,7 +1346,6 @@ static int __init xen_blkif_init(void)
pr_alert(DRV_PFX "%s: out of memory\n", __func__);
failed_init:
kfree(blkbk->pending_reqs);
- kfree(blkbk->pending_grant_handles);
kfree(blkbk);
blkbk = NULL;
return rc;
--
1.7.7.5 (Apple Git-26)

2013-04-17 18:20:25

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 1/7] xen-blkback: print stats about persistent grants

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: [email protected]
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
drivers/block/xen-blkback/blkback.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index dd5b2fe..f7526db 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -382,10 +382,12 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
static void print_stats(struct xen_blkif *blkif)
{
pr_info("xen-blkback (%s): oo %3llu | rd %4llu | wr %4llu | f %4llu"
- " | ds %4llu\n",
+ " | ds %4llu | pg: %4u/%4u\n",
current->comm, blkif->st_oo_req,
blkif->st_rd_req, blkif->st_wr_req,
- blkif->st_f_req, blkif->st_ds_req);
+ blkif->st_f_req, blkif->st_ds_req,
+ blkif->persistent_gnt_c,
+ max_mapped_grant_pages(blkif->blk_protocol));
blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
blkif->st_rd_req = 0;
blkif->st_wr_req = 0;
--
1.7.7.5 (Apple Git-26)

2013-04-17 18:21:20

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 7/7] xen-block: implement indirect descriptors

Indirect descriptors introduce a new block operation
(BLKIF_OP_INDIRECT) that passes grant references instead of segments
in the request. This grant references are filled with arrays of
blkif_request_segment_aligned, this way we can send more segments in a
request.

The proposed implementation sets the maximum number of indirect grefs
(frames filled with blkif_request_segment_aligned) to 256 in the
backend and 32 in the frontend. The value in the frontend has been
chosen experimentally, and the backend value has been set to a sane
value that allows expanding the maximum number of indirect descriptors
in the frontend if needed.

The migration code has changed from the previous implementation, in
which we simply remapped the segments on the shared ring. Now the
maximum number of segments allowed in a request can change depending
on the backend, so we have to requeue all the requests in the ring and
in the queue and split the bios in them if they are bigger than the
new maximum number of segments.

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
---
Changes since v1:
* Added padding to make the indirect request 64bit aligned.
* Added some BUGs.
* Added comments.
* Fixed number of indirect pages in blkif_get_x86_{32/64}_req.
* Added description about the indirect operation in blkif.h
---
drivers/block/xen-blkback/blkback.c | 135 +++++++---
drivers/block/xen-blkback/common.h | 98 +++++++-
drivers/block/xen-blkback/xenbus.c | 8 +
drivers/block/xen-blkfront.c | 491 +++++++++++++++++++++++++++++------
include/xen/interface/io/blkif.h | 48 ++++
5 files changed, 655 insertions(+), 125 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 356722f..998842e 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -59,7 +59,7 @@
* IO workloads.
*/

-static int xen_blkif_max_buffer_pages = 704;
+static int xen_blkif_max_buffer_pages = 1024;
module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
MODULE_PARM_DESC(max_buffer_pages,
"Maximum number of free pages to keep in each block backend buffer");
@@ -75,7 +75,7 @@ MODULE_PARM_DESC(max_buffer_pages,
* algorithm.
*/

-static int xen_blkif_max_pgrants = 352;
+static int xen_blkif_max_pgrants = 1056;
module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
MODULE_PARM_DESC(max_persistent_grants,
"Maximum number of grants to map persistently");
@@ -636,10 +636,6 @@ purge_gnt_list:
return 0;
}

-struct seg_buf {
- unsigned int offset;
- unsigned int nsec;
-};
/*
* Unmap the grant references, and also remove the M2P over-rides
* used in the 'pending_req'.
@@ -818,29 +814,70 @@ out_of_memory:
return -ENOMEM;
}

-static int xen_blkbk_map_seg(struct blkif_request *req,
- struct pending_req *pending_req,
+static int xen_blkbk_map_seg(struct pending_req *pending_req,
struct seg_buf seg[],
struct page *pages[])
{
- int i, rc;
- grant_ref_t grefs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ int rc;

- for (i = 0; i < req->u.rw.nr_segments; i++)
- grefs[i] = req->u.rw.seg[i].gref;
-
- rc = xen_blkbk_map(pending_req->blkif, grefs,
+ rc = xen_blkbk_map(pending_req->blkif, pending_req->grefs,
pending_req->persistent_gnts,
pending_req->grant_handles, pending_req->pages,
- req->u.rw.nr_segments,
+ pending_req->nr_pages,
(pending_req->operation != BLKIF_OP_READ));
- if (rc)
- return rc;

- for (i = 0; i < req->u.rw.nr_segments; i++)
- seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
+ return rc;
+}

- return 0;
+static int xen_blkbk_parse_indirect(struct blkif_request *req,
+ struct pending_req *pending_req,
+ struct seg_buf seg[],
+ struct phys_req *preq)
+{
+ struct persistent_gnt **persistent =
+ pending_req->indirect_persistent_gnts;
+ struct page **pages = pending_req->indirect_pages;
+ struct xen_blkif *blkif = pending_req->blkif;
+ int indirect_grefs, rc, n, nseg, i;
+ struct blkif_request_segment_aligned *segments = NULL;
+
+ nseg = pending_req->nr_pages;
+ indirect_grefs = INDIRECT_PAGES(nseg);
+ BUG_ON(indirect_grefs > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
+
+ rc = xen_blkbk_map(blkif, req->u.indirect.indirect_grefs,
+ persistent, pending_req->indirect_handles,
+ pages, indirect_grefs, true);
+ if (rc)
+ goto unmap;
+
+ for (n = 0, i = 0; n < nseg; n++) {
+ if ((n % SEGS_PER_INDIRECT_FRAME) == 0) {
+ /* Map indirect segments */
+ if (segments)
+ kunmap_atomic(segments);
+ segments = kmap_atomic(pages[n/SEGS_PER_INDIRECT_FRAME]);
+ }
+ i = n % SEGS_PER_INDIRECT_FRAME;
+ pending_req->grefs[n] = segments[i].gref;
+ seg[n].nsec = segments[i].last_sect -
+ segments[i].first_sect + 1;
+ seg[n].offset = (segments[i].first_sect << 9);
+ if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) ||
+ (segments[i].last_sect <
+ segments[i].first_sect)) {
+ rc = -EINVAL;
+ goto unmap;
+ }
+ preq->nr_sects += seg[n].nsec;
+ }
+
+unmap:
+ if (segments)
+ kunmap_atomic(segments);
+ xen_blkbk_unmap(blkif, pending_req->indirect_handles,
+ pages, persistent, indirect_grefs);
+ return rc;
}

static int dispatch_discard_io(struct xen_blkif *blkif,
@@ -1013,6 +1050,7 @@ __do_block_io_op(struct xen_blkif *blkif)
case BLKIF_OP_WRITE:
case BLKIF_OP_WRITE_BARRIER:
case BLKIF_OP_FLUSH_DISKCACHE:
+ case BLKIF_OP_INDIRECT:
if (dispatch_rw_block_io(blkif, &req, pending_req))
goto done;
break;
@@ -1059,17 +1097,28 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
struct pending_req *pending_req)
{
struct phys_req preq;
- struct seg_buf seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct seg_buf *seg = pending_req->seg;
unsigned int nseg;
struct bio *bio = NULL;
- struct bio *biolist[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct bio **biolist = pending_req->biolist;
int i, nbio = 0;
int operation;
struct blk_plug plug;
bool drain = false;
struct page **pages = pending_req->pages;
+ unsigned short req_operation;
+
+ req_operation = req->operation == BLKIF_OP_INDIRECT ?
+ req->u.indirect.indirect_op : req->operation;
+ if ((req->operation == BLKIF_OP_INDIRECT) &&
+ (req_operation != BLKIF_OP_READ) &&
+ (req_operation != BLKIF_OP_WRITE)) {
+ pr_debug(DRV_PFX "Invalid indirect operation (%u)\n",
+ req_operation);
+ goto fail_response;
+ }

- switch (req->operation) {
+ switch (req_operation) {
case BLKIF_OP_READ:
blkif->st_rd_req++;
operation = READ;
@@ -1091,33 +1140,47 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
}

/* Check that the number of segments is sane. */
- nseg = req->u.rw.nr_segments;
+ nseg = req->operation == BLKIF_OP_INDIRECT ?
+ req->u.indirect.nr_segments : req->u.rw.nr_segments;

if (unlikely(nseg == 0 && operation != WRITE_FLUSH) ||
- unlikely(nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
+ unlikely((req->operation != BLKIF_OP_INDIRECT) &&
+ (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST)) ||
+ unlikely((req->operation == BLKIF_OP_INDIRECT) &&
+ (nseg > MAX_INDIRECT_SEGMENTS))) {
pr_debug(DRV_PFX "Bad number of segments in request (%d)\n",
nseg);
/* Haven't submitted any bio's yet. */
goto fail_response;
}

- preq.sector_number = req->u.rw.sector_number;
preq.nr_sects = 0;

pending_req->blkif = blkif;
pending_req->id = req->u.rw.id;
- pending_req->operation = req->operation;
+ pending_req->operation = req_operation;
pending_req->status = BLKIF_RSP_OKAY;
pending_req->nr_pages = nseg;

- for (i = 0; i < nseg; i++) {
- seg[i].nsec = req->u.rw.seg[i].last_sect -
- req->u.rw.seg[i].first_sect + 1;
- if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
- (req->u.rw.seg[i].last_sect < req->u.rw.seg[i].first_sect))
+ if (req->operation != BLKIF_OP_INDIRECT) {
+ preq.dev = req->u.rw.handle;
+ preq.sector_number = req->u.rw.sector_number;
+ for (i = 0; i < nseg; i++) {
+ pending_req->grefs[i] = req->u.rw.seg[i].gref;
+ seg[i].nsec = req->u.rw.seg[i].last_sect -
+ req->u.rw.seg[i].first_sect + 1;
+ seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
+ if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
+ (req->u.rw.seg[i].last_sect <
+ req->u.rw.seg[i].first_sect))
+ goto fail_response;
+ preq.nr_sects += seg[i].nsec;
+ }
+ } else {
+ preq.dev = req->u.indirect.handle;
+ preq.sector_number = req->u.indirect.sector_number;
+ if (xen_blkbk_parse_indirect(req, pending_req, seg, &preq))
goto fail_response;
- preq.nr_sects += seg[i].nsec;
-
}

if (xen_vbd_translate(&preq, blkif, operation) != 0) {
@@ -1154,7 +1217,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* the hypercall to unmap the grants - that is all done in
* xen_blkbk_unmap.
*/
- if (xen_blkbk_map_seg(req, pending_req, seg, pages))
+ if (xen_blkbk_map_seg(pending_req, seg, pages))
goto fail_flush;

/*
@@ -1220,7 +1283,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
pending_req->nr_pages);
fail_response:
/* Haven't submitted any bio's yet. */
- make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
+ make_response(blkif, req->u.rw.id, req_operation, BLKIF_RSP_ERROR);
free_req(blkif, pending_req);
msleep(1); /* back off a bit */
return -EIO;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index e33fafa..1ac53da 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -50,6 +50,19 @@
__func__, __LINE__, ##args)


+/*
+ * This is the maximum number of segments that would be allowed in indirect
+ * requests. This value will also be passed to the frontend.
+ */
+#define MAX_INDIRECT_SEGMENTS 256
+
+#define SEGS_PER_INDIRECT_FRAME \
+ (PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
+#define MAX_INDIRECT_PAGES \
+ ((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+#define INDIRECT_PAGES(_segs) \
+ ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+
/* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a
* compiler-checkable way to use common struct elements, so we can
@@ -83,12 +96,31 @@ struct blkif_x86_32_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));

+struct blkif_x86_32_request_indirect {
+ uint8_t indirect_op;
+ uint16_t nr_segments;
+ uint64_t id;
+ blkif_sector_t sector_number;
+ blkif_vdev_t handle;
+ uint16_t _pad1;
+ grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
+ /*
+ * The maximum number of indirect segments (and pages) that will
+ * be used is determined by MAX_INDIRECT_SEGMENTS, this value
+ * is also exported to the guest (via xenstore
+ * feature-max-indirect-segments entry), so the frontend knows how
+ * many indirect segments the backend supports.
+ */
+ uint64_t _pad2; /* make it 64 byte aligned */
+} __attribute__((__packed__));
+
struct blkif_x86_32_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_32_request_rw rw;
struct blkif_x86_32_request_discard discard;
struct blkif_x86_32_request_other other;
+ struct blkif_x86_32_request_indirect indirect;
} u;
} __attribute__((__packed__));

@@ -127,12 +159,32 @@ struct blkif_x86_64_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));

+struct blkif_x86_64_request_indirect {
+ uint8_t indirect_op;
+ uint16_t nr_segments;
+ uint32_t _pad1; /* offsetof(blkif_..,u.indirect.id)==8 */
+ uint64_t id;
+ blkif_sector_t sector_number;
+ blkif_vdev_t handle;
+ uint16_t _pad2;
+ grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
+ /*
+ * The maximum number of indirect segments (and pages) that will
+ * be used is determined by MAX_INDIRECT_SEGMENTS, this value
+ * is also exported to the guest (via xenstore
+ * feature-max-indirect-segments entry), so the frontend knows how
+ * many indirect segments the backend supports.
+ */
+ uint32_t _pad3; /* make it 64 byte aligned */
+} __attribute__((__packed__));
+
struct blkif_x86_64_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_x86_64_request_rw rw;
struct blkif_x86_64_request_discard discard;
struct blkif_x86_64_request_other other;
+ struct blkif_x86_64_request_indirect indirect;
} u;
} __attribute__((__packed__));

@@ -266,6 +318,11 @@ struct xen_blkif {
wait_queue_head_t waiting_to_free;
};

+struct seg_buf {
+ unsigned long offset;
+ unsigned int nsec;
+};
+
/*
* Each outstanding request that we've passed to the lower device layers has a
* 'pending_req' allocated to it. Each buffer_head that completes decrements
@@ -280,9 +337,16 @@ struct pending_req {
unsigned short operation;
int status;
struct list_head free_list;
- struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- grant_handle_t grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page *pages[MAX_INDIRECT_SEGMENTS];
+ struct persistent_gnt *persistent_gnts[MAX_INDIRECT_SEGMENTS];
+ grant_handle_t grant_handles[MAX_INDIRECT_SEGMENTS];
+ grant_ref_t grefs[MAX_INDIRECT_SEGMENTS];
+ /* Indirect descriptors */
+ struct persistent_gnt *indirect_persistent_gnts[MAX_INDIRECT_PAGES];
+ struct page *indirect_pages[MAX_INDIRECT_PAGES];
+ grant_handle_t indirect_handles[MAX_INDIRECT_PAGES];
+ struct seg_buf seg[MAX_INDIRECT_SEGMENTS];
+ struct bio *biolist[MAX_INDIRECT_SEGMENTS];
};


@@ -321,7 +385,7 @@ struct xenbus_device *xen_blkbk_xenbus(struct backend_info *be);
static inline void blkif_get_x86_32_req(struct blkif_request *dst,
struct blkif_x86_32_request *src)
{
- int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
@@ -344,6 +408,18 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
+ case BLKIF_OP_INDIRECT:
+ dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
+ dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
+ dst->u.indirect.handle = src->u.indirect.handle;
+ dst->u.indirect.id = src->u.indirect.id;
+ dst->u.indirect.sector_number = src->u.indirect.sector_number;
+ barrier();
+ j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
+ for (i = 0; i < j; i++)
+ dst->u.indirect.indirect_grefs[i] =
+ src->u.indirect.indirect_grefs[i];
+ break;
default:
/*
* Don't know how to translate this op. Only get the
@@ -357,7 +433,7 @@ static inline void blkif_get_x86_32_req(struct blkif_request *dst,
static inline void blkif_get_x86_64_req(struct blkif_request *dst,
struct blkif_x86_64_request *src)
{
- int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ int i, n = BLKIF_MAX_SEGMENTS_PER_REQUEST, j;
dst->operation = src->operation;
switch (src->operation) {
case BLKIF_OP_READ:
@@ -380,6 +456,18 @@ static inline void blkif_get_x86_64_req(struct blkif_request *dst,
dst->u.discard.sector_number = src->u.discard.sector_number;
dst->u.discard.nr_sectors = src->u.discard.nr_sectors;
break;
+ case BLKIF_OP_INDIRECT:
+ dst->u.indirect.indirect_op = src->u.indirect.indirect_op;
+ dst->u.indirect.nr_segments = src->u.indirect.nr_segments;
+ dst->u.indirect.handle = src->u.indirect.handle;
+ dst->u.indirect.id = src->u.indirect.id;
+ dst->u.indirect.sector_number = src->u.indirect.sector_number;
+ barrier();
+ j = min(MAX_INDIRECT_PAGES, INDIRECT_PAGES(dst->u.indirect.nr_segments));
+ for (i = 0; i < j; i++)
+ dst->u.indirect.indirect_grefs[i] =
+ src->u.indirect.indirect_grefs[i];
+ break;
default:
/*
* Don't know how to translate this op. Only get the
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 1f1ade6..d404e10 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -107,6 +107,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
struct xen_blkif *blkif;
int i;

+ BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
+
blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif)
return ERR_PTR(-ENOMEM);
@@ -710,6 +712,12 @@ again:
goto abort;
}

+ err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
+ MAX_INDIRECT_SEGMENTS);
+ if (err)
+ xenbus_dev_fatal(dev, err, "writing %s/feature-max-indirect-segments",
+ dev->nodename);
+
err = xenbus_printf(xbt, dev->nodename, "sectors", "%llu",
(unsigned long long)vbd_sz(&be->blkif->vbd));
if (err) {
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index a894f88..5da172e 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -74,12 +74,27 @@ struct grant {
struct blk_shadow {
struct blkif_request req;
struct request *request;
- struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct grant **grants_used;
+ struct grant **indirect_grants;
+};
+
+struct split_bio {
+ struct bio *bio;
+ atomic_t pending;
+ int err;
};

static DEFINE_MUTEX(blkfront_mutex);
static const struct block_device_operations xlvbd_block_fops;

+/*
+ * Maximum number of segments in indirect requests, the actual value used by
+ * the frontend driver is the minimum of this value and the value provided
+ * by the backend driver.
+ */
+
+static unsigned int xen_blkif_max_segments = 32;
+
#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE)

/*
@@ -98,7 +113,7 @@ struct blkfront_info
enum blkif_state connected;
int ring_ref;
struct blkif_front_ring ring;
- struct scatterlist sg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct scatterlist *sg;
unsigned int evtchn, irq;
struct request_queue *rq;
struct work_struct work;
@@ -114,6 +129,7 @@ struct blkfront_info
unsigned int discard_granularity;
unsigned int discard_alignment;
unsigned int feature_persistent:1;
+ unsigned int max_indirect_segments;
int is_ready;
};

@@ -142,6 +158,13 @@ static DEFINE_SPINLOCK(minor_lock);

#define DEV_NAME "xvd" /* name in /dev */

+#define SEGS_PER_INDIRECT_FRAME \
+ (PAGE_SIZE/sizeof(struct blkif_request_segment_aligned))
+#define INDIRECT_GREFS(_segs) \
+ ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+
+static int blkfront_setup_indirect(struct blkfront_info *info);
+
static int get_id_from_freelist(struct blkfront_info *info)
{
unsigned long free = info->shadow_free;
@@ -358,7 +381,8 @@ static int blkif_queue_request(struct request *req)
struct blkif_request *ring_req;
unsigned long id;
unsigned int fsect, lsect;
- int i, ref;
+ int i, ref, n;
+ struct blkif_request_segment_aligned *segments = NULL;

/*
* Used to store if we are able to queue the request by just using
@@ -369,21 +393,27 @@ static int blkif_queue_request(struct request *req)
grant_ref_t gref_head;
struct grant *gnt_list_entry = NULL;
struct scatterlist *sg;
+ int nseg, max_grefs;

if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return 1;

- /* Check if we have enought grants to allocate a requests */
- if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ max_grefs = info->max_indirect_segments ?
+ info->max_indirect_segments +
+ INDIRECT_GREFS(info->max_indirect_segments) :
+ BLKIF_MAX_SEGMENTS_PER_REQUEST;
+
+ /* Check if we have enough grants to allocate a requests */
+ if (info->persistent_gnts_c < max_grefs) {
new_persistent_gnts = 1;
if (gnttab_alloc_grant_references(
- BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
+ max_grefs - info->persistent_gnts_c,
&gref_head) < 0) {
gnttab_request_free_callback(
&info->callback,
blkif_restart_queue_callback,
info,
- BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ max_grefs);
return 1;
}
} else
@@ -394,42 +424,67 @@ static int blkif_queue_request(struct request *req)
id = get_id_from_freelist(info);
info->shadow[id].request = req;

- ring_req->u.rw.id = id;
- ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
- ring_req->u.rw.handle = info->handle;
-
- ring_req->operation = rq_data_dir(req) ?
- BLKIF_OP_WRITE : BLKIF_OP_READ;
-
- if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
- /*
- * Ideally we can do an unordered flush-to-disk. In case the
- * backend onlysupports barriers, use that. A barrier request
- * a superset of FUA, so we can implement it the same
- * way. (It's also a FLUSH+FUA, since it is
- * guaranteed ordered WRT previous writes.)
- */
- ring_req->operation = info->flush_op;
- }
-
if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
- /* id, sector_number and handle are set above. */
ring_req->operation = BLKIF_OP_DISCARD;
ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
+ ring_req->u.discard.id = id;
+ ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req);
if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
else
ring_req->u.discard.flag = 0;
} else {
- ring_req->u.rw.nr_segments = blk_rq_map_sg(req->q, req,
- info->sg);
- BUG_ON(ring_req->u.rw.nr_segments >
- BLKIF_MAX_SEGMENTS_PER_REQUEST);
-
- for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
+ BUG_ON(info->max_indirect_segments == 0 &&
+ req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ BUG_ON(info->max_indirect_segments &&
+ req->nr_phys_segments > info->max_indirect_segments);
+ nseg = blk_rq_map_sg(req->q, req, info->sg);
+ ring_req->u.rw.id = id;
+ if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ /*
+ * The indirect operation can only be a BLKIF_OP_READ or
+ * BLKIF_OP_WRITE
+ */
+ BUG_ON(req->cmd_flags & (REQ_FLUSH | REQ_FUA));
+ ring_req->operation = BLKIF_OP_INDIRECT;
+ ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
+ BLKIF_OP_WRITE : BLKIF_OP_READ;
+ ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req);
+ ring_req->u.indirect.handle = info->handle;
+ ring_req->u.indirect.nr_segments = nseg;
+ } else {
+ ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
+ ring_req->u.rw.handle = info->handle;
+ ring_req->operation = rq_data_dir(req) ?
+ BLKIF_OP_WRITE : BLKIF_OP_READ;
+ if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
+ /*
+ * Ideally we can do an unordered flush-to-disk. In case the
+ * backend onlysupports barriers, use that. A barrier request
+ * a superset of FUA, so we can implement it the same
+ * way. (It's also a FLUSH+FUA, since it is
+ * guaranteed ordered WRT previous writes.)
+ */
+ ring_req->operation = info->flush_op;
+ }
+ ring_req->u.rw.nr_segments = nseg;
+ }
+ for_each_sg(info->sg, sg, nseg, i) {
fsect = sg->offset >> 9;
lsect = fsect + (sg->length >> 9) - 1;

+ if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
+ (i % SEGS_PER_INDIRECT_FRAME == 0)) {
+ if (segments)
+ kunmap_atomic(segments);
+
+ n = i / SEGS_PER_INDIRECT_FRAME;
+ gnt_list_entry = get_grant(&gref_head, info);
+ info->shadow[id].indirect_grants[n] = gnt_list_entry;
+ segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
+ ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
+ }
+
gnt_list_entry = get_grant(&gref_head, info);
ref = gnt_list_entry->gref;

@@ -441,8 +496,7 @@ static int blkif_queue_request(struct request *req)

BUG_ON(sg->offset + sg->length > PAGE_SIZE);

- shared_data = kmap_atomic(
- pfn_to_page(gnt_list_entry->pfn));
+ shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
bvec_data = kmap_atomic(sg_page(sg));

/*
@@ -461,13 +515,23 @@ static int blkif_queue_request(struct request *req)
kunmap_atomic(bvec_data);
kunmap_atomic(shared_data);
}
-
- ring_req->u.rw.seg[i] =
- (struct blkif_request_segment) {
- .gref = ref,
- .first_sect = fsect,
- .last_sect = lsect };
+ if (ring_req->operation != BLKIF_OP_INDIRECT) {
+ ring_req->u.rw.seg[i] =
+ (struct blkif_request_segment) {
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
+ } else {
+ n = i % SEGS_PER_INDIRECT_FRAME;
+ segments[n] =
+ (struct blkif_request_segment_aligned) {
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
+ }
}
+ if (segments)
+ kunmap_atomic(segments);
}

info->ring.req_prod_pvt++;
@@ -542,7 +606,8 @@ wait:
flush_requests(info);
}

-static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
+static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
+ unsigned int segments)
{
struct request_queue *rq;
struct blkfront_info *info = gd->private_data;
@@ -571,7 +636,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
blk_queue_max_segment_size(rq, PAGE_SIZE);

/* Ensure a merged request will fit in a single I/O ring slot. */
- blk_queue_max_segments(rq, BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ blk_queue_max_segments(rq, segments);

/* Make sure buffer addresses are sector-aligned. */
blk_queue_dma_alignment(rq, 511);
@@ -588,13 +653,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
static void xlvbd_flush(struct blkfront_info *info)
{
blk_queue_flush(info->rq, info->feature_flush);
- printk(KERN_INFO "blkfront: %s: %s: %s %s\n",
+ printk(KERN_INFO "blkfront: %s: %s: %s %s %s %s %s\n",
info->gd->disk_name,
info->flush_op == BLKIF_OP_WRITE_BARRIER ?
"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
"flush diskcache" : "barrier or flush"),
- info->feature_flush ? "enabled" : "disabled",
- info->feature_persistent ? "using persistent grants" : "");
+ info->feature_flush ? "enabled;" : "disabled;",
+ "persistent grants:",
+ info->feature_persistent ? "enabled;" : "disabled;",
+ "indirect descriptors:",
+ info->max_indirect_segments ? "enabled;" : "disabled;");
}

static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
@@ -734,7 +802,9 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
gd->driverfs_dev = &(info->xbdev->dev);
set_capacity(gd, capacity);

- if (xlvbd_init_blk_queue(gd, sector_size)) {
+ if (xlvbd_init_blk_queue(gd, sector_size,
+ info->max_indirect_segments ? :
+ BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
del_gendisk(gd);
goto release;
}
@@ -818,6 +888,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
{
struct grant *persistent_gnt;
struct grant *n;
+ int i, j, segs;

/* Prevent new requests being issued until we fix things up. */
spin_lock_irq(&info->io_lock);
@@ -843,6 +914,47 @@ static void blkif_free(struct blkfront_info *info, int suspend)
}
BUG_ON(info->persistent_gnts_c != 0);

+ kfree(info->sg);
+ info->sg = NULL;
+ for (i = 0; i < BLK_RING_SIZE; i++) {
+ /*
+ * Clear persistent grants present in requests already
+ * on the shared ring
+ */
+ if (!info->shadow[i].request)
+ goto free_shadow;
+
+ segs = info->shadow[i].req.operation == BLKIF_OP_INDIRECT ?
+ info->shadow[i].req.u.indirect.nr_segments :
+ info->shadow[i].req.u.rw.nr_segments;
+ for (j = 0; j < segs; j++) {
+ persistent_gnt = info->shadow[i].grants_used[j];
+ gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+ __free_page(pfn_to_page(persistent_gnt->pfn));
+ kfree(persistent_gnt);
+ }
+
+ if (info->shadow[i].req.operation != BLKIF_OP_INDIRECT)
+ /*
+ * If this is not an indirect operation don't try to
+ * free indirect segments
+ */
+ goto free_shadow;
+
+ for (j = 0; j < INDIRECT_GREFS(segs); j++) {
+ persistent_gnt = info->shadow[i].indirect_grants[j];
+ gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+ __free_page(pfn_to_page(persistent_gnt->pfn));
+ kfree(persistent_gnt);
+ }
+
+free_shadow:
+ kfree(info->shadow[i].grants_used);
+ info->shadow[i].grants_used = NULL;
+ kfree(info->shadow[i].indirect_grants);
+ info->shadow[i].indirect_grants = NULL;
+ }
+
/* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback);
spin_unlock_irq(&info->io_lock);
@@ -873,6 +985,10 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
char *bvec_data;
void *shared_data;
unsigned int offset = 0;
+ int nseg;
+
+ nseg = s->req.operation == BLKIF_OP_INDIRECT ?
+ s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;

if (bret->operation == BLKIF_OP_READ) {
/*
@@ -885,7 +1001,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
if (bvec->bv_offset < offset)
i++;
- BUG_ON(i >= s->req.u.rw.nr_segments);
+ BUG_ON(i >= nseg);
shared_data = kmap_atomic(
pfn_to_page(s->grants_used[i]->pfn));
bvec_data = bvec_kmap_irq(bvec, &flags);
@@ -897,10 +1013,16 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
}
}
/* Add the persistent grant into the list of free grants */
- for (i = 0; i < s->req.u.rw.nr_segments; i++) {
+ for (i = 0; i < nseg; i++) {
list_add(&s->grants_used[i]->node, &info->persistent_gnts);
info->persistent_gnts_c++;
}
+ if (s->req.operation == BLKIF_OP_INDIRECT) {
+ for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
+ list_add(&s->indirect_grants[i]->node, &info->persistent_gnts);
+ info->persistent_gnts_c++;
+ }
+ }
}

static irqreturn_t blkif_interrupt(int irq, void *dev_id)
@@ -1034,14 +1156,6 @@ static int setup_blkring(struct xenbus_device *dev,
SHARED_RING_INIT(sring);
FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE);

- sg_init_table(info->sg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
-
- /* Allocate memory for grants */
- err = fill_grant_buffer(info, BLK_RING_SIZE *
- BLKIF_MAX_SEGMENTS_PER_REQUEST);
- if (err)
- goto fail;
-
err = xenbus_grant_ring(dev, virt_to_mfn(info->ring.sring));
if (err < 0) {
free_page((unsigned long)sring);
@@ -1223,13 +1337,84 @@ static int blkfront_probe(struct xenbus_device *dev,
return 0;
}

+/*
+ * This is a clone of md_trim_bio, used to split a bio into smaller ones
+ */
+static void trim_bio(struct bio *bio, int offset, int size)
+{
+ /* 'bio' is a cloned bio which we need to trim to match
+ * the given offset and size.
+ * This requires adjusting bi_sector, bi_size, and bi_io_vec
+ */
+ int i;
+ struct bio_vec *bvec;
+ int sofar = 0;
+
+ size <<= 9;
+ if (offset == 0 && size == bio->bi_size)
+ return;
+
+ bio->bi_sector += offset;
+ bio->bi_size = size;
+ offset <<= 9;
+ clear_bit(BIO_SEG_VALID, &bio->bi_flags);
+
+ while (bio->bi_idx < bio->bi_vcnt &&
+ bio->bi_io_vec[bio->bi_idx].bv_len <= offset) {
+ /* remove this whole bio_vec */
+ offset -= bio->bi_io_vec[bio->bi_idx].bv_len;
+ bio->bi_idx++;
+ }
+ if (bio->bi_idx < bio->bi_vcnt) {
+ bio->bi_io_vec[bio->bi_idx].bv_offset += offset;
+ bio->bi_io_vec[bio->bi_idx].bv_len -= offset;
+ }
+ /* avoid any complications with bi_idx being non-zero*/
+ if (bio->bi_idx) {
+ memmove(bio->bi_io_vec, bio->bi_io_vec+bio->bi_idx,
+ (bio->bi_vcnt - bio->bi_idx) * sizeof(struct bio_vec));
+ bio->bi_vcnt -= bio->bi_idx;
+ bio->bi_idx = 0;
+ }
+ /* Make sure vcnt and last bv are not too big */
+ bio_for_each_segment(bvec, bio, i) {
+ if (sofar + bvec->bv_len > size)
+ bvec->bv_len = size - sofar;
+ if (bvec->bv_len == 0) {
+ bio->bi_vcnt = i;
+ break;
+ }
+ sofar += bvec->bv_len;
+ }
+}
+
+static void split_bio_end(struct bio *bio, int error)
+{
+ struct split_bio *split_bio = bio->bi_private;
+
+ if (error)
+ split_bio->err = error;
+
+ if (atomic_dec_and_test(&split_bio->pending)) {
+ split_bio->bio->bi_phys_segments = 0;
+ bio_endio(split_bio->bio, split_bio->err);
+ kfree(split_bio);
+ }
+ bio_put(bio);
+}

static int blkif_recover(struct blkfront_info *info)
{
int i;
- struct blkif_request *req;
+ struct request *req, *n;
struct blk_shadow *copy;
- int j;
+ int rc;
+ struct bio *bio, *cloned_bio;
+ struct bio_list bio_list, merge_bio;
+ unsigned int segs, offset;
+ int pending, size;
+ struct split_bio *split_bio;
+ struct list_head requests;

/* Stage 1: Make a safe copy of the shadow state. */
copy = kmemdup(info->shadow, sizeof(info->shadow),
@@ -1244,36 +1429,64 @@ static int blkif_recover(struct blkfront_info *info)
info->shadow_free = info->ring.req_prod_pvt;
info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff;

- /* Stage 3: Find pending requests and requeue them. */
+ rc = blkfront_setup_indirect(info);
+ if (rc) {
+ kfree(copy);
+ return rc;
+ }
+
+ segs = info->max_indirect_segments ? : BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ blk_queue_max_segments(info->rq, segs);
+ bio_list_init(&bio_list);
+ INIT_LIST_HEAD(&requests);
for (i = 0; i < BLK_RING_SIZE; i++) {
/* Not in use? */
if (!copy[i].request)
continue;

- /* Grab a request slot and copy shadow state into it. */
- req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
- *req = copy[i].req;
-
- /* We get a new request id, and must reset the shadow state. */
- req->u.rw.id = get_id_from_freelist(info);
- memcpy(&info->shadow[req->u.rw.id], &copy[i], sizeof(copy[i]));
-
- if (req->operation != BLKIF_OP_DISCARD) {
- /* Rewrite any grant references invalidated by susp/resume. */
- for (j = 0; j < req->u.rw.nr_segments; j++)
- gnttab_grant_foreign_access_ref(
- req->u.rw.seg[j].gref,
- info->xbdev->otherend_id,
- pfn_to_mfn(copy[i].grants_used[j]->pfn),
- 0);
+ /*
+ * Get the bios in the request so we can re-queue them.
+ */
+ if (copy[i].request->cmd_flags &
+ (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+ /*
+ * Flush operations don't contain bios, so
+ * we need to requeue the whole request
+ */
+ list_add(&copy[i].request->queuelist, &requests);
+ continue;
}
- info->shadow[req->u.rw.id].req = *req;
-
- info->ring.req_prod_pvt++;
+ merge_bio.head = copy[i].request->bio;
+ merge_bio.tail = copy[i].request->biotail;
+ bio_list_merge(&bio_list, &merge_bio);
+ copy[i].request->bio = NULL;
+ blk_put_request(copy[i].request);
}

kfree(copy);

+ /*
+ * Empty the queue, this is important because we might have
+ * requests in the queue with more segments than what we
+ * can handle now.
+ */
+ spin_lock_irq(&info->io_lock);
+ while ((req = blk_fetch_request(info->rq)) != NULL) {
+ if (req->cmd_flags &
+ (REQ_FLUSH | REQ_FUA | REQ_DISCARD | REQ_SECURE)) {
+ list_add(&req->queuelist, &requests);
+ continue;
+ }
+ merge_bio.head = req->bio;
+ merge_bio.tail = req->biotail;
+ bio_list_merge(&bio_list, &merge_bio);
+ req->bio = NULL;
+ if (req->cmd_flags & (REQ_FLUSH | REQ_FUA))
+ pr_alert("diskcache flush request found!\n");
+ __blk_put_request(info->rq, req);
+ }
+ spin_unlock_irq(&info->io_lock);
+
xenbus_switch_state(info->xbdev, XenbusStateConnected);

spin_lock_irq(&info->io_lock);
@@ -1281,14 +1494,50 @@ static int blkif_recover(struct blkfront_info *info)
/* Now safe for us to use the shared ring */
info->connected = BLKIF_STATE_CONNECTED;

- /* Send off requeued requests */
- flush_requests(info);
-
/* Kick any other new requests queued since we resumed */
kick_pending_request_queues(info);

+ list_for_each_entry_safe(req, n, &requests, queuelist) {
+ /* Requeue pending requests (flush or discard) */
+ list_del_init(&req->queuelist);
+ BUG_ON(req->nr_phys_segments > segs);
+ blk_requeue_request(info->rq, req);
+ }
spin_unlock_irq(&info->io_lock);

+ while ((bio = bio_list_pop(&bio_list)) != NULL) {
+ /* Traverse the list of pending bios and re-queue them */
+ if (bio_segments(bio) > segs) {
+ /*
+ * This bio has more segments than what we can
+ * handle, we have to split it.
+ */
+ pending = (bio_segments(bio) + segs - 1) / segs;
+ split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
+ BUG_ON(split_bio == NULL);
+ atomic_set(&split_bio->pending, pending);
+ split_bio->bio = bio;
+ for (i = 0; i < pending; i++) {
+ offset = (i * segs * PAGE_SIZE) >> 9;
+ size = min((unsigned int)(segs * PAGE_SIZE) >> 9,
+ (unsigned int)(bio->bi_size >> 9) - offset);
+ cloned_bio = bio_clone(bio, GFP_NOIO);
+ BUG_ON(cloned_bio == NULL);
+ trim_bio(cloned_bio, offset, size);
+ cloned_bio->bi_private = split_bio;
+ cloned_bio->bi_end_io = split_bio_end;
+ submit_bio(cloned_bio->bi_rw, cloned_bio);
+ }
+ /*
+ * Now we have to wait for all those smaller bios to
+ * end, so we can also end the "parent" bio.
+ */
+ continue;
+ }
+ /* We don't need to split this bio */
+ submit_bio(bio->bi_rw, bio);
+ }
+
return 0;
}

@@ -1308,8 +1557,12 @@ static int blkfront_resume(struct xenbus_device *dev)
blkif_free(info, info->connected == BLKIF_STATE_CONNECTED);

err = talk_to_blkback(dev, info);
- if (info->connected == BLKIF_STATE_SUSPENDED && !err)
- err = blkif_recover(info);
+
+ /*
+ * We have to wait for the backend to switch to
+ * connected state, since we want to read which
+ * features it supports.
+ */

return err;
}
@@ -1387,6 +1640,62 @@ static void blkfront_setup_discard(struct blkfront_info *info)
kfree(type);
}

+static int blkfront_setup_indirect(struct blkfront_info *info)
+{
+ unsigned int indirect_segments, segs;
+ int err, i;
+
+ err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
+ "feature-max-indirect-segments", "%u", &indirect_segments,
+ NULL);
+ if (err) {
+ info->max_indirect_segments = 0;
+ segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ } else {
+ info->max_indirect_segments = min(indirect_segments,
+ xen_blkif_max_segments);
+ segs = info->max_indirect_segments;
+ }
+ info->sg = kzalloc(sizeof(info->sg[0]) * segs, GFP_KERNEL);
+ if (info->sg == NULL)
+ goto out_of_memory;
+ sg_init_table(info->sg, segs);
+
+ err = fill_grant_buffer(info,
+ (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE);
+ if (err)
+ goto out_of_memory;
+
+ for (i = 0; i < BLK_RING_SIZE; i++) {
+ info->shadow[i].grants_used = kzalloc(
+ sizeof(info->shadow[i].grants_used[0]) * segs,
+ GFP_NOIO);
+ if (info->max_indirect_segments)
+ info->shadow[i].indirect_grants = kzalloc(
+ sizeof(info->shadow[i].indirect_grants[0]) *
+ INDIRECT_GREFS(segs),
+ GFP_NOIO);
+ if ((info->shadow[i].grants_used == NULL) ||
+ (info->max_indirect_segments &&
+ (info->shadow[i].indirect_grants == NULL)))
+ goto out_of_memory;
+ }
+
+
+ return 0;
+
+out_of_memory:
+ kfree(info->sg);
+ info->sg = NULL;
+ for (i = 0; i < BLK_RING_SIZE; i++) {
+ kfree(info->shadow[i].grants_used);
+ info->shadow[i].grants_used = NULL;
+ kfree(info->shadow[i].indirect_grants);
+ info->shadow[i].indirect_grants = NULL;
+ }
+ return -ENOMEM;
+}
+
/*
* Invoked when the backend is finally 'ready' (and has told produced
* the details about the physical device - #sectors, size, etc).
@@ -1414,8 +1723,15 @@ static void blkfront_connect(struct blkfront_info *info)
set_capacity(info->gd, sectors);
revalidate_disk(info->gd);

- /* fall through */
+ return;
case BLKIF_STATE_SUSPENDED:
+ /*
+ * If we are recovering from suspension, we need to wait
+ * for the backend to announce it's features before
+ * reconnecting, at least we need to know if the backend
+ * supports indirect descriptors, and how many.
+ */
+ blkif_recover(info);
return;

default:
@@ -1483,6 +1799,13 @@ static void blkfront_connect(struct blkfront_info *info)
else
info->feature_persistent = persistent;

+ err = blkfront_setup_indirect(info);
+ if (err) {
+ xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
+ info->xbdev->otherend);
+ return;
+ }
+
err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
if (err) {
xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
index ffd4652..7901e58 100644
--- a/include/xen/interface/io/blkif.h
+++ b/include/xen/interface/io/blkif.h
@@ -103,12 +103,41 @@ typedef uint64_t blkif_sector_t;
#define BLKIF_OP_DISCARD 5

/*
+ * Recognized if "feature-max-indirect-segments" in present in the backend
+ * xenbus info. The "feature-max-indirect-segments" node contains the maximum
+ * number of segments allowed by the backend. If the node is present,
+ * the frontend might use blkif_request_indirect in order to issue requests
+ * with more than BLKIF_MAX_SEGMENTS_PER_REQUEST. The maximum number
+ * of indirect segments is fixed by the backend, but the frontend can issue
+ * requests with any number of indirect segments as long as it's inferior
+ * than the number provided by the backend. The indirect_grefs field in
+ * blkif_request_indirect should be filled by the frontend with the
+ * grant references of the pages that are holding the indirect segments.
+ * This pages are filled with an array of blkif_request_segment_aligned
+ * that hold the information about the segments.
+ *
+ * If a backend does not recognize BLKIF_OP_INDIRECT, it should *not*
+ * create the "feature-max-indirect-segments" node!
+ */
+#define BLKIF_OP_INDIRECT 6
+
+/*
* Maximum scatter/gather segments per request.
* This is carefully chosen so that sizeof(struct blkif_ring) <= PAGE_SIZE.
* NB. This could be 12 if the ring indexes weren't stored in the same page.
*/
#define BLKIF_MAX_SEGMENTS_PER_REQUEST 11

+#define BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST 8
+
+struct blkif_request_segment_aligned {
+ grant_ref_t gref; /* reference to I/O buffer frame */
+ /* @first_sect: first sector in frame to transfer (inclusive). */
+ /* @last_sect: last sector in frame to transfer (inclusive). */
+ uint8_t first_sect, last_sect;
+ uint16_t _pad; /* padding to make it 8 bytes, so it's cache-aligned */
+} __attribute__((__packed__));
+
struct blkif_request_rw {
uint8_t nr_segments; /* number of segments */
blkif_vdev_t handle; /* only for read/write requests */
@@ -147,12 +176,31 @@ struct blkif_request_other {
uint64_t id; /* private guest value, echoed in resp */
} __attribute__((__packed__));

+struct blkif_request_indirect {
+ uint8_t indirect_op;
+ uint16_t nr_segments;
+#ifdef CONFIG_X86_64
+ uint32_t _pad1; /* offsetof(blkif_...,u.indirect.id) == 8 */
+#endif
+ uint64_t id;
+ blkif_sector_t sector_number;
+ blkif_vdev_t handle;
+ uint16_t _pad2;
+ grant_ref_t indirect_grefs[BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST];
+#ifdef CONFIG_X86_64
+ uint32_t _pad3; /* make it 64 byte aligned */
+#else
+ uint64_t _pad3; /* make it 64 byte aligned */
+#endif
+} __attribute__((__packed__));
+
struct blkif_request {
uint8_t operation; /* BLKIF_OP_??? */
union {
struct blkif_request_rw rw;
struct blkif_request_discard discard;
struct blkif_request_other other;
+ struct blkif_request_indirect indirect;
} u;
} __attribute__((__packed__));

--
1.7.7.5 (Apple Git-26)

2013-04-17 18:21:45

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 5/7] xen-blkback: make the queue of free requests per backend

Remove the last dependency from blkbk by moving the list of free
requests to blkif. This change reduces the contention on the list of
available requests.

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
---
Changes since RFC:
* Replace kzalloc with kcalloc.
Changes since v1:
* Changed xen_blkif_reqs to a define since it doesn't make sense to
keep it.
* Improve code in xen_blkif_free to make sure we don't have any
in-flight request when freeing pending_reqs
---
drivers/block/xen-blkback/blkback.c | 123 +++++-----------------------------
drivers/block/xen-blkback/common.h | 30 +++++++++
drivers/block/xen-blkback/xenbus.c | 26 +++++++
3 files changed, 74 insertions(+), 105 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index ae7dc92..90a5755 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -50,20 +50,6 @@
#include "common.h"

/*
- * These are rather arbitrary. They are fairly large because adjacent requests
- * pulled from a communication ring are quite likely to end up being part of
- * the same scatter/gather request at the disc.
- *
- * ** TRY INCREASING 'xen_blkif_reqs' IF WRITE SPEEDS SEEM TOO LOW **
- *
- * This will increase the chances of being able to write whole tracks.
- * 64 should be enough to keep us competitive with Linux.
- */
-static int xen_blkif_reqs = 64;
-module_param_named(reqs, xen_blkif_reqs, int, 0);
-MODULE_PARM_DESC(reqs, "Number of blkback requests to allocate");
-
-/*
* Maximum number of unused free pages to keep in the internal buffer.
* Setting this to a value too low will reduce memory used in each backend,
* but can have a performance penalty.
@@ -112,53 +98,11 @@ MODULE_PARM_DESC(max_persistent_grants,
static unsigned int log_stats;
module_param(log_stats, int, 0644);

-/*
- * Each outstanding request that we've passed to the lower device layers has a
- * 'pending_req' allocated to it. Each buffer_head that completes decrements
- * the pendcnt towards zero. When it hits zero, the specified domain has a
- * response queued for it, with the saved 'id' passed back.
- */
-struct pending_req {
- struct xen_blkif *blkif;
- u64 id;
- int nr_pages;
- atomic_t pendcnt;
- unsigned short operation;
- int status;
- struct list_head free_list;
- struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- grant_handle_t grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
-};
-
#define BLKBACK_INVALID_HANDLE (~0)

/* Number of free pages to remove on each call to free_xenballooned_pages */
#define NUM_BATCH_FREE_PAGES 10

-struct xen_blkbk {
- struct pending_req *pending_reqs;
- /* List of all 'pending_req' available */
- struct list_head pending_free;
- /* And its spinlock. */
- spinlock_t pending_free_lock;
- wait_queue_head_t pending_free_wq;
-};
-
-static struct xen_blkbk *blkbk;
-
-/*
- * Little helpful macro to figure out the index and virtual address of the
- * pending_pages[..]. For each 'pending_req' we have have up to
- * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
- * 10 and would index in the pending_pages[..].
- */
-static inline int vaddr_pagenr(struct pending_req *req, int seg)
-{
- return (req - blkbk->pending_reqs) *
- BLKIF_MAX_SEGMENTS_PER_REQUEST + seg;
-}
-
static inline int get_free_page(struct xen_blkif *blkif, struct page **page)
{
unsigned long flags;
@@ -485,18 +429,18 @@ finished:
/*
* Retrieve from the 'pending_reqs' a free pending_req structure to be used.
*/
-static struct pending_req *alloc_req(void)
+static struct pending_req *alloc_req(struct xen_blkif *blkif)
{
struct pending_req *req = NULL;
unsigned long flags;

- spin_lock_irqsave(&blkbk->pending_free_lock, flags);
- if (!list_empty(&blkbk->pending_free)) {
- req = list_entry(blkbk->pending_free.next, struct pending_req,
+ spin_lock_irqsave(&blkif->pending_free_lock, flags);
+ if (!list_empty(&blkif->pending_free)) {
+ req = list_entry(blkif->pending_free.next, struct pending_req,
free_list);
list_del(&req->free_list);
}
- spin_unlock_irqrestore(&blkbk->pending_free_lock, flags);
+ spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
return req;
}

@@ -504,17 +448,17 @@ static struct pending_req *alloc_req(void)
* Return the 'pending_req' structure back to the freepool. We also
* wake up the thread if it was waiting for a free page.
*/
-static void free_req(struct pending_req *req)
+static void free_req(struct xen_blkif *blkif, struct pending_req *req)
{
unsigned long flags;
int was_empty;

- spin_lock_irqsave(&blkbk->pending_free_lock, flags);
- was_empty = list_empty(&blkbk->pending_free);
- list_add(&req->free_list, &blkbk->pending_free);
- spin_unlock_irqrestore(&blkbk->pending_free_lock, flags);
+ spin_lock_irqsave(&blkif->pending_free_lock, flags);
+ was_empty = list_empty(&blkif->pending_free);
+ list_add(&req->free_list, &blkif->pending_free);
+ spin_unlock_irqrestore(&blkif->pending_free_lock, flags);
if (was_empty)
- wake_up(&blkbk->pending_free_wq);
+ wake_up(&blkif->pending_free_wq);
}

/*
@@ -649,8 +593,8 @@ int xen_blkif_schedule(void *arg)
if (timeout == 0)
goto purge_gnt_list;
timeout = wait_event_interruptible_timeout(
- blkbk->pending_free_wq,
- !list_empty(&blkbk->pending_free) ||
+ blkif->pending_free_wq,
+ !list_empty(&blkif->pending_free) ||
kthread_should_stop(),
timeout);
if (timeout == 0)
@@ -907,7 +851,7 @@ static int dispatch_other_io(struct xen_blkif *blkif,
struct blkif_request *req,
struct pending_req *pending_req)
{
- free_req(pending_req);
+ free_req(blkif, pending_req);
make_response(blkif, req->u.other.id, req->operation,
BLKIF_RSP_EOPNOTSUPP);
return -EIO;
@@ -967,7 +911,7 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
if (atomic_read(&pending_req->blkif->drain))
complete(&pending_req->blkif->drain_complete);
}
- free_req(pending_req);
+ free_req(pending_req->blkif, pending_req);
}
}

@@ -1010,7 +954,7 @@ __do_block_io_op(struct xen_blkif *blkif)
break;
}

- pending_req = alloc_req();
+ pending_req = alloc_req(blkif);
if (NULL == pending_req) {
blkif->st_oo_req++;
more_to_do = 1;
@@ -1044,7 +988,7 @@ __do_block_io_op(struct xen_blkif *blkif)
goto done;
break;
case BLKIF_OP_DISCARD:
- free_req(pending_req);
+ free_req(blkif, pending_req);
if (dispatch_discard_io(blkif, &req))
goto done;
break;
@@ -1246,7 +1190,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
fail_response:
/* Haven't submitted any bio's yet. */
make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
- free_req(pending_req);
+ free_req(blkif, pending_req);
msleep(1); /* back off a bit */
return -EIO;

@@ -1303,51 +1247,20 @@ static void make_response(struct xen_blkif *blkif, u64 id,

static int __init xen_blkif_init(void)
{
- int i;
int rc = 0;

if (!xen_domain())
return -ENODEV;

- blkbk = kzalloc(sizeof(struct xen_blkbk), GFP_KERNEL);
- if (!blkbk) {
- pr_alert(DRV_PFX "%s: out of memory!\n", __func__);
- return -ENOMEM;
- }
-
-
- blkbk->pending_reqs = kzalloc(sizeof(blkbk->pending_reqs[0]) *
- xen_blkif_reqs, GFP_KERNEL);
-
- if (!blkbk->pending_reqs) {
- rc = -ENOMEM;
- goto out_of_memory;
- }
-
rc = xen_blkif_interface_init();
if (rc)
goto failed_init;

- INIT_LIST_HEAD(&blkbk->pending_free);
- spin_lock_init(&blkbk->pending_free_lock);
- init_waitqueue_head(&blkbk->pending_free_wq);
-
- for (i = 0; i < xen_blkif_reqs; i++)
- list_add_tail(&blkbk->pending_reqs[i].free_list,
- &blkbk->pending_free);
-
rc = xen_blkif_xenbus_init();
if (rc)
goto failed_init;

- return 0;
-
- out_of_memory:
- pr_alert(DRV_PFX "%s: out of memory\n", __func__);
failed_init:
- kfree(blkbk->pending_reqs);
- kfree(blkbk);
- blkbk = NULL;
return rc;
}

diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index af9bed4..e33fafa 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -192,6 +192,9 @@ struct backend_info;
*/
#define PERSISTENT_GNT_WAS_ACTIVE 1

+/* Number of requests that we can fit in a ring */
+#define XEN_BLKIF_REQS 32
+
struct persistent_gnt {
struct page *page;
grant_ref_t gnt;
@@ -242,6 +245,14 @@ struct xen_blkif {
int free_pages_num;
struct list_head free_pages;

+ /* Allocation of pending_reqs */
+ struct pending_req *pending_reqs;
+ /* List of all 'pending_req' available */
+ struct list_head pending_free;
+ /* And its spinlock. */
+ spinlock_t pending_free_lock;
+ wait_queue_head_t pending_free_wq;
+
/* statistics */
unsigned long st_print;
unsigned long long st_rd_req;
@@ -255,6 +266,25 @@ struct xen_blkif {
wait_queue_head_t waiting_to_free;
};

+/*
+ * Each outstanding request that we've passed to the lower device layers has a
+ * 'pending_req' allocated to it. Each buffer_head that completes decrements
+ * the pendcnt towards zero. When it hits zero, the specified domain has a
+ * response queued for it, with the saved 'id' passed back.
+ */
+struct pending_req {
+ struct xen_blkif *blkif;
+ u64 id;
+ int nr_pages;
+ atomic_t pendcnt;
+ unsigned short operation;
+ int status;
+ struct list_head free_list;
+ struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ grant_handle_t grant_handles[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+};
+

#define vbd_sz(_v) ((_v)->bdev->bd_part ? \
(_v)->bdev->bd_part->nr_sects : \
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index e0fd92a..1f1ade6 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -105,6 +105,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
static struct xen_blkif *xen_blkif_alloc(domid_t domid)
{
struct xen_blkif *blkif;
+ int i;

blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
if (!blkif)
@@ -124,6 +125,21 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
blkif->free_pages_num = 0;
atomic_set(&blkif->persistent_gnt_in_use, 0);

+ blkif->pending_reqs = kcalloc(XEN_BLKIF_REQS,
+ sizeof(blkif->pending_reqs[0]),
+ GFP_KERNEL);
+ if (!blkif->pending_reqs) {
+ kmem_cache_free(xen_blkif_cachep, blkif);
+ return ERR_PTR(-ENOMEM);
+ }
+ INIT_LIST_HEAD(&blkif->pending_free);
+ spin_lock_init(&blkif->pending_free_lock);
+ init_waitqueue_head(&blkif->pending_free_wq);
+
+ for (i = 0; i < XEN_BLKIF_REQS; i++)
+ list_add_tail(&blkif->pending_reqs[i].free_list,
+ &blkif->pending_free);
+
return blkif;
}

@@ -203,8 +219,18 @@ static void xen_blkif_disconnect(struct xen_blkif *blkif)

static void xen_blkif_free(struct xen_blkif *blkif)
{
+ struct pending_req *req;
+ int i = 0;
+
if (!atomic_dec_and_test(&blkif->refcnt))
BUG();
+
+ /* Check that there is no request in use */
+ list_for_each_entry(req, &blkif->pending_free, free_list)
+ i++;
+ BUG_ON(i != XEN_BLKIF_REQS);
+
+ kfree(blkif->pending_reqs);
kmem_cache_free(xen_blkif_cachep, blkif);
}

--
1.7.7.5 (Apple Git-26)

2013-04-17 18:21:44

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 3/7] xen-blkback: implement LRU mechanism for persistent grants

This mechanism allows blkback to change the number of grants
persistently mapped at run time.

The algorithm uses a simple LRU mechanism that removes (if needed) the
persistent grants that have not been used since the last LRU run, or
if all grants have been used it removes the first grants in the list
(that are not in use).

The algorithm allows the user to change the maximum number of
persistent grants, by changing max_persistent_grants in sysfs.

Since we are storing the persistent grants used inside the request
struct (to be able to mark them as "unused" when unmapping), we no
longer need the bitmap (unmap_seg).

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
---
Changes since RFC:
* Offload part of the purge work to a kworker thread.
* Add documentation about sysfs tunable.
Changes since v1:
* Changed some module_param to defines.
* Replace a WARN_ON with a rate limited print if we found out we are
using the same grant in different segments concurrently and return
NULL.
* Rename PERSISTENT_GNT_USED to PERSISTENT_GNT_WAS_ACTIVE.
* Improve documentation by adding what condition makes the LRU kick
in, and the time interval and number of grants removed at each
iteration.
---
Documentation/ABI/stable/sysfs-bus-xen-backend | 10 +
drivers/block/xen-blkback/blkback.c | 287 +++++++++++++++++++-----
drivers/block/xen-blkback/common.h | 18 ++
drivers/block/xen-blkback/xenbus.c | 2 +
4 files changed, 260 insertions(+), 57 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-bus-xen-backend b/Documentation/ABI/stable/sysfs-bus-xen-backend
index e04afe0..947db11 100644
--- a/Documentation/ABI/stable/sysfs-bus-xen-backend
+++ b/Documentation/ABI/stable/sysfs-bus-xen-backend
@@ -81,3 +81,13 @@ Contact: Roger Pau Monné <[email protected]>
Description:
Maximum number of free pages to keep in each block
backend buffer.
+
+What: /sys/module/xen_blkback/parameters/max_persistent_grants
+Date: March 2013
+KernelVersion: 3.10
+Contact: Roger Pau Monné <[email protected]>
+Description:
+ Maximum number of grants to map persistently in
+ blkback. If the frontend tries to use more than
+ max_persistent_grants, the LRU kicks in and starts
+ removing 5% of max_persistent_grants every 100ms.
diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 8245c6b..17052f7 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -78,6 +78,36 @@ module_param_named(max_buffer_pages, xen_blkif_max_buffer_pages, int, 0644);
MODULE_PARM_DESC(max_buffer_pages,
"Maximum number of free pages to keep in each block backend buffer");

+/*
+ * Maximum number of grants to map persistently in blkback. For maximum
+ * performance this should be the total numbers of grants that can be used
+ * to fill the ring, but since this might become too high, specially with
+ * the use of indirect descriptors, we set it to a value that provides good
+ * performance without using too much memory.
+ *
+ * When the list of persistent grants is full we clean it up using a LRU
+ * algorithm.
+ */
+
+static int xen_blkif_max_pgrants = 352;
+module_param_named(max_persistent_grants, xen_blkif_max_pgrants, int, 0644);
+MODULE_PARM_DESC(max_persistent_grants,
+ "Maximum number of grants to map persistently");
+
+/*
+ * The LRU mechanism to clean the lists of persistent grants needs to
+ * be executed periodically. The time interval between consecutive executions
+ * of the purge mechanism is set in ms.
+ */
+#define LRU_INTERVAL 100
+
+/*
+ * When the persistent grants list is full we will remove unused grants
+ * from the list. The percent number of grants to be removed at each LRU
+ * execution.
+ */
+#define LRU_PERCENT_CLEAN 5
+
/* Run-time switchable: /sys/module/blkback/parameters/ */
static unsigned int log_stats;
module_param(log_stats, int, 0644);
@@ -96,8 +126,8 @@ struct pending_req {
unsigned short operation;
int status;
struct list_head free_list;
- DECLARE_BITMAP(unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};

#define BLKBACK_INVALID_HANDLE (~0)
@@ -119,36 +149,6 @@ struct xen_blkbk {
static struct xen_blkbk *blkbk;

/*
- * Maximum number of grant pages that can be mapped in blkback.
- * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
- * pages that blkback will persistently map.
- * Currently, this is:
- * RING_SIZE = 32 (for all known ring types)
- * BLKIF_MAX_SEGMENTS_PER_REQUEST = 11
- * sizeof(struct persistent_gnt) = 48
- * So the maximum memory used to store the grants is:
- * 32 * 11 * 48 = 16896 bytes
- */
-static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
-{
- switch (protocol) {
- case BLKIF_PROTOCOL_NATIVE:
- return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
- BLKIF_MAX_SEGMENTS_PER_REQUEST;
- case BLKIF_PROTOCOL_X86_32:
- return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
- BLKIF_MAX_SEGMENTS_PER_REQUEST;
- case BLKIF_PROTOCOL_X86_64:
- return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
- BLKIF_MAX_SEGMENTS_PER_REQUEST;
- default:
- BUG();
- }
- return 0;
-}
-
-
-/*
* Little helpful macro to figure out the index and virtual address of the
* pending_pages[..]. For each 'pending_req' we have have up to
* BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
@@ -239,13 +239,29 @@ static void make_response(struct xen_blkif *blkif, u64 id,
(n) = (&(pos)->node != NULL) ? rb_next(&(pos)->node) : NULL)


-static int add_persistent_gnt(struct rb_root *root,
+/*
+ * We don't need locking around the persistent grant helpers
+ * because blkback uses a single-thread for each backed, so we
+ * can be sure that this functions will never be called recursively.
+ *
+ * The only exception to that is put_persistent_grant, that can be called
+ * from interrupt context (by xen_blkbk_unmap), so we have to use atomic
+ * bit operations to modify the flags of a persistent grant and to count
+ * the number of used grants.
+ */
+static int add_persistent_gnt(struct xen_blkif *blkif,
struct persistent_gnt *persistent_gnt)
{
- struct rb_node **new = &(root->rb_node), *parent = NULL;
+ struct rb_node **new = NULL, *parent = NULL;
struct persistent_gnt *this;

+ if (blkif->persistent_gnt_c >= xen_blkif_max_pgrants) {
+ if (!blkif->vbd.overflow_max_grants)
+ blkif->vbd.overflow_max_grants = 1;
+ return -EBUSY;
+ }
/* Figure out where to put new node */
+ new = &blkif->persistent_gnts.rb_node;
while (*new) {
this = container_of(*new, struct persistent_gnt, node);

@@ -260,18 +276,23 @@ static int add_persistent_gnt(struct rb_root *root,
}
}

+ bitmap_zero(persistent_gnt->flags, PERSISTENT_GNT_FLAGS_SIZE);
+ set_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
/* Add new node and rebalance tree. */
rb_link_node(&(persistent_gnt->node), parent, new);
- rb_insert_color(&(persistent_gnt->node), root);
+ rb_insert_color(&(persistent_gnt->node), &blkif->persistent_gnts);
+ blkif->persistent_gnt_c++;
+ atomic_inc(&blkif->persistent_gnt_in_use);
return 0;
}

-static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
+static struct persistent_gnt *get_persistent_gnt(struct xen_blkif *blkif,
grant_ref_t gref)
{
struct persistent_gnt *data;
- struct rb_node *node = root->rb_node;
+ struct rb_node *node = NULL;

+ node = blkif->persistent_gnts.rb_node;
while (node) {
data = container_of(node, struct persistent_gnt, node);

@@ -279,12 +300,29 @@ static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
node = node->rb_left;
else if (gref > data->gnt)
node = node->rb_right;
- else
+ else {
+ if(test_bit(PERSISTENT_GNT_ACTIVE, data->flags)) {
+ pr_alert_ratelimited(DRV_PFX " requesting a grant already in use\n");
+ return NULL;
+ }
+ set_bit(PERSISTENT_GNT_ACTIVE, data->flags);
+ atomic_inc(&blkif->persistent_gnt_in_use);
return data;
+ }
}
return NULL;
}

+static void put_persistent_gnt(struct xen_blkif *blkif,
+ struct persistent_gnt *persistent_gnt)
+{
+ if(!test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
+ pr_alert_ratelimited(DRV_PFX " freeing a grant already unused");
+ set_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
+ clear_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags);
+ atomic_dec(&blkif->persistent_gnt_in_use);
+}
+
static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
unsigned int num)
{
@@ -322,6 +360,129 @@ static void free_persistent_gnts(struct xen_blkif *blkif, struct rb_root *root,
BUG_ON(num != 0);
}

+static void unmap_purged_grants(struct work_struct *work)
+{
+ struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt *persistent_gnt;
+ int ret, segs_to_unmap = 0;
+ struct xen_blkif *blkif = container_of(work, typeof(*blkif), persistent_purge_work);
+
+ while(!list_empty(&blkif->persistent_purge_list)) {
+ persistent_gnt = list_first_entry(&blkif->persistent_purge_list,
+ struct persistent_gnt,
+ remove_node);
+ list_del(&persistent_gnt->remove_node);
+
+ gnttab_set_unmap_op(&unmap[segs_to_unmap],
+ vaddr(persistent_gnt->page),
+ GNTMAP_host_map,
+ persistent_gnt->handle);
+
+ pages[segs_to_unmap] = persistent_gnt->page;
+
+ if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ ret = gnttab_unmap_refs(unmap, NULL, pages,
+ segs_to_unmap);
+ BUG_ON(ret);
+ put_free_pages(blkif, pages, segs_to_unmap);
+ segs_to_unmap = 0;
+ }
+ kfree(persistent_gnt);
+ }
+ if (segs_to_unmap > 0) {
+ ret = gnttab_unmap_refs(unmap, NULL, pages, segs_to_unmap);
+ BUG_ON(ret);
+ put_free_pages(blkif, pages, segs_to_unmap);
+ }
+}
+
+static void purge_persistent_gnt(struct xen_blkif *blkif)
+{
+ struct persistent_gnt *persistent_gnt;
+ struct rb_node *n;
+ unsigned int num_clean, total;
+ bool scan_used = false;
+ struct rb_root *root;
+
+ if (blkif->persistent_gnt_c < xen_blkif_max_pgrants ||
+ (blkif->persistent_gnt_c == xen_blkif_max_pgrants &&
+ !blkif->vbd.overflow_max_grants)) {
+ return;
+ }
+
+ if (work_pending(&blkif->persistent_purge_work)) {
+ pr_alert_ratelimited(DRV_PFX "Scheduled work from previous purge is still pending, cannot purge list\n");
+ return;
+ }
+
+ num_clean = (xen_blkif_max_pgrants / 100) * LRU_PERCENT_CLEAN;
+ num_clean = blkif->persistent_gnt_c - xen_blkif_max_pgrants + num_clean;
+ num_clean = min(blkif->persistent_gnt_c, num_clean);
+ if (num_clean >
+ (blkif->persistent_gnt_c -
+ atomic_read(&blkif->persistent_gnt_in_use)))
+ return;
+
+ /*
+ * At this point, we can assure that there will be no calls
+ * to get_persistent_grant (because we are executing this code from
+ * xen_blkif_schedule), there can only be calls to put_persistent_gnt,
+ * which means that the number of currently used grants will go down,
+ * but never up, so we will always be able to remove the requested
+ * number of grants.
+ */
+
+ total = num_clean;
+
+ pr_debug(DRV_PFX "Going to purge %u persistent grants\n", num_clean);
+
+ INIT_LIST_HEAD(&blkif->persistent_purge_list);
+ root = &blkif->persistent_gnts;
+purge_list:
+ foreach_grant_safe(persistent_gnt, n, root, node) {
+ BUG_ON(persistent_gnt->handle ==
+ BLKBACK_INVALID_HANDLE);
+
+ if (test_bit(PERSISTENT_GNT_ACTIVE, persistent_gnt->flags))
+ continue;
+ if (!scan_used &&
+ (test_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags)))
+ continue;
+
+ rb_erase(&persistent_gnt->node, root);
+ list_add(&persistent_gnt->remove_node,
+ &blkif->persistent_purge_list);
+ if (--num_clean == 0)
+ goto finished;
+ }
+ /*
+ * If we get here it means we also need to start cleaning
+ * grants that were used since last purge in order to cope
+ * with the requested num
+ */
+ if (!scan_used) {
+ pr_debug(DRV_PFX "Still missing %u purged frames\n", num_clean);
+ scan_used = true;
+ goto purge_list;
+ }
+finished:
+ /* Remove the "used" flag from all the persistent grants */
+ foreach_grant_safe(persistent_gnt, n, root, node) {
+ BUG_ON(persistent_gnt->handle ==
+ BLKBACK_INVALID_HANDLE);
+ clear_bit(PERSISTENT_GNT_WAS_ACTIVE, persistent_gnt->flags);
+ }
+ blkif->persistent_gnt_c -= (total - num_clean);
+ blkif->vbd.overflow_max_grants = 0;
+
+ /* We can defer this work */
+ INIT_WORK(&blkif->persistent_purge_work, unmap_purged_grants);
+ schedule_work(&blkif->persistent_purge_work);
+ pr_debug(DRV_PFX "Purged %u/%u\n", (total - num_clean), total);
+ return;
+}
+
/*
* Retrieve from the 'pending_reqs' a free pending_req structure to be used.
*/
@@ -453,12 +614,12 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id)
static void print_stats(struct xen_blkif *blkif)
{
pr_info("xen-blkback (%s): oo %3llu | rd %4llu | wr %4llu | f %4llu"
- " | ds %4llu | pg: %4u/%4u\n",
+ " | ds %4llu | pg: %4u/%4d\n",
current->comm, blkif->st_oo_req,
blkif->st_rd_req, blkif->st_wr_req,
blkif->st_f_req, blkif->st_ds_req,
blkif->persistent_gnt_c,
- max_mapped_grant_pages(blkif->blk_protocol));
+ xen_blkif_max_pgrants);
blkif->st_print = jiffies + msecs_to_jiffies(10 * 1000);
blkif->st_rd_req = 0;
blkif->st_wr_req = 0;
@@ -470,6 +631,7 @@ int xen_blkif_schedule(void *arg)
{
struct xen_blkif *blkif = arg;
struct xen_vbd *vbd = &blkif->vbd;
+ unsigned long timeout;

xen_blkif_get(blkif);

@@ -479,13 +641,21 @@ int xen_blkif_schedule(void *arg)
if (unlikely(vbd->size != vbd_sz(vbd)))
xen_vbd_resize(blkif);

- wait_event_interruptible(
+ timeout = msecs_to_jiffies(LRU_INTERVAL);
+
+ timeout = wait_event_interruptible_timeout(
blkif->wq,
- blkif->waiting_reqs || kthread_should_stop());
- wait_event_interruptible(
+ blkif->waiting_reqs || kthread_should_stop(),
+ timeout);
+ if (timeout == 0)
+ goto purge_gnt_list;
+ timeout = wait_event_interruptible_timeout(
blkbk->pending_free_wq,
!list_empty(&blkbk->pending_free) ||
- kthread_should_stop());
+ kthread_should_stop(),
+ timeout);
+ if (timeout == 0)
+ goto purge_gnt_list;

blkif->waiting_reqs = 0;
smp_mb(); /* clear flag *before* checking for work */
@@ -493,6 +663,13 @@ int xen_blkif_schedule(void *arg)
if (do_block_io_op(blkif))
blkif->waiting_reqs = 1;

+purge_gnt_list:
+ if (blkif->vbd.feature_gnt_persistent &&
+ time_after(jiffies, blkif->next_lru)) {
+ purge_persistent_gnt(blkif);
+ blkif->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
+ }
+
/* Shrink if we have more than xen_blkif_max_buffer_pages */
shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages);

@@ -538,8 +715,10 @@ static void xen_blkbk_unmap(struct pending_req *req)
int ret;

for (i = 0; i < req->nr_pages; i++) {
- if (!test_bit(i, req->unmap_seg))
+ if (req->persistent_gnts[i] != NULL) {
+ put_persistent_gnt(blkif, req->persistent_gnts[i]);
continue;
+ }
handle = pending_handle(req, i);
pages[invcount] = req->pages[i];
if (handle == BLKBACK_INVALID_HANDLE)
@@ -561,8 +740,8 @@ static int xen_blkbk_map(struct blkif_request *req,
struct page *pages[])
{
struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
struct persistent_gnt *persistent_gnt = NULL;
struct xen_blkif *blkif = pending_req->blkif;
phys_addr_t addr = 0;
@@ -574,9 +753,6 @@ static int xen_blkbk_map(struct blkif_request *req,

use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);

- BUG_ON(blkif->persistent_gnt_c >
- max_mapped_grant_pages(pending_req->blkif->blk_protocol));
-
/*
* Fill out preq.nr_sects with proper amount of sectors, and setup
* assign map[..] with the PFN of the page in our domain with the
@@ -587,7 +763,7 @@ static int xen_blkbk_map(struct blkif_request *req,

if (use_persistent_gnts)
persistent_gnt = get_persistent_gnt(
- &blkif->persistent_gnts,
+ blkif,
req->u.rw.seg[i].gref);

if (persistent_gnt) {
@@ -623,7 +799,6 @@ static int xen_blkbk_map(struct blkif_request *req,
* so that when we access vaddr(pending_req,i) it has the contents of
* the page from the other domain.
*/
- bitmap_zero(pending_req->unmap_seg, BLKIF_MAX_SEGMENTS_PER_REQUEST);
for (seg_idx = 0, new_map_idx = 0; seg_idx < nseg; seg_idx++) {
if (!persistent_gnts[seg_idx]) {
/* This is a newly mapped grant */
@@ -646,11 +821,10 @@ static int xen_blkbk_map(struct blkif_request *req,
goto next;
}
if (use_persistent_gnts &&
- blkif->persistent_gnt_c <
- max_mapped_grant_pages(blkif->blk_protocol)) {
+ blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
/*
* We are using persistent grants, the grant is
- * not mapped but we have room for it
+ * not mapped but we might have room for it.
*/
persistent_gnt = kmalloc(sizeof(struct persistent_gnt),
GFP_KERNEL);
@@ -665,16 +839,16 @@ static int xen_blkbk_map(struct blkif_request *req,
persistent_gnt->gnt = map[new_map_idx].ref;
persistent_gnt->handle = map[new_map_idx].handle;
persistent_gnt->page = pages[seg_idx];
- if (add_persistent_gnt(&blkif->persistent_gnts,
+ if (add_persistent_gnt(blkif,
persistent_gnt)) {
kfree(persistent_gnt);
persistent_gnt = NULL;
goto next_unmap;
}
- blkif->persistent_gnt_c++;
+ persistent_gnts[seg_idx] = persistent_gnt;
pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
persistent_gnt->gnt, blkif->persistent_gnt_c,
- max_mapped_grant_pages(blkif->blk_protocol));
+ xen_blkif_max_pgrants);
new_map_idx++;
goto next;
}
@@ -688,7 +862,6 @@ next_unmap:
* We could not map this grant persistently, so use it as
* a non-persistent grant.
*/
- bitmap_set(pending_req->unmap_seg, seg_idx, 1);
new_map_idx++;
next:
seg[seg_idx].offset = (req->u.rw.seg[seg_idx].first_sect << 9);
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 6c73c38..af9bed4 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -182,12 +182,23 @@ struct xen_vbd {

struct backend_info;

+/* Number of available flags */
+#define PERSISTENT_GNT_FLAGS_SIZE 2
+/* This persistent grant is currently in use */
+#define PERSISTENT_GNT_ACTIVE 0
+/*
+ * This persistent grant has been used, this flag is set when we remove the
+ * PERSISTENT_GNT_ACTIVE, to know that this grant has been used recently.
+ */
+#define PERSISTENT_GNT_WAS_ACTIVE 1

struct persistent_gnt {
struct page *page;
grant_ref_t gnt;
grant_handle_t handle;
+ DECLARE_BITMAP(flags, PERSISTENT_GNT_FLAGS_SIZE);
struct rb_node node;
+ struct list_head remove_node;
};

struct xen_blkif {
@@ -219,6 +230,12 @@ struct xen_blkif {
/* tree to store persistent grants */
struct rb_root persistent_gnts;
unsigned int persistent_gnt_c;
+ atomic_t persistent_gnt_in_use;
+ unsigned long next_lru;
+
+ /* used by the kworker that offload work from the persistent purge */
+ struct list_head persistent_purge_list;
+ struct work_struct persistent_purge_work;

/* buffer of free pages to map grant refs */
spinlock_t free_pages_lock;
@@ -262,6 +279,7 @@ int xen_blkif_xenbus_init(void);

irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
int xen_blkif_schedule(void *arg);
+int xen_blkif_purge_persistent(void *arg);

int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
struct backend_info *be, int state);
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 24f7f6d..e0fd92a 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -98,6 +98,7 @@ static void xen_update_blkif_status(struct xen_blkif *blkif)
err = PTR_ERR(blkif->xenblkd);
blkif->xenblkd = NULL;
xenbus_dev_error(blkif->be->dev, err, "start xenblkd");
+ return;
}
}

@@ -121,6 +122,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
spin_lock_init(&blkif->free_pages_lock);
INIT_LIST_HEAD(&blkif->free_pages);
blkif->free_pages_num = 0;
+ atomic_set(&blkif->persistent_gnt_in_use, 0);

return blkif;
}
--
1.7.7.5 (Apple Git-26)

2013-04-17 18:21:43

by Roger Pau Monné

[permalink] [raw]
Subject: [PATCH v2 6/7] xen-blkback: expand map/unmap functions

Preparatory change for implementing indirect descriptors. Change
xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount
of grants (previously it was limited to
BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req
in the map/unmap functions, so we can map/unmap grants without needing
to pass a pending_req.

Signed-off-by: Roger Pau Monné <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: [email protected]
---
drivers/block/xen-blkback/blkback.c | 141 +++++++++++++++++++++--------------
1 files changed, 86 insertions(+), 55 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index 90a5755..356722f 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -163,10 +163,6 @@ static inline void shrink_free_pagepool(struct xen_blkif *blkif, int num)

#define vaddr(page) ((unsigned long)pfn_to_kaddr(page_to_pfn(page)))

-#define pending_handle(_req, _seg) \
- (_req->grant_handles[_seg])
-
-
static int do_block_io_op(struct xen_blkif *blkif);
static int dispatch_rw_block_io(struct xen_blkif *blkif,
struct blkif_request *req,
@@ -648,50 +644,57 @@ struct seg_buf {
* Unmap the grant references, and also remove the M2P over-rides
* used in the 'pending_req'.
*/
-static void xen_blkbk_unmap(struct pending_req *req)
+static void xen_blkbk_unmap(struct xen_blkif *blkif,
+ grant_handle_t handles[],
+ struct page *pages[],
+ struct persistent_gnt *persistent_gnts[],
+ int num)
{
struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page *unmap_pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
unsigned int i, invcount = 0;
- grant_handle_t handle;
- struct xen_blkif *blkif = req->blkif;
int ret;

- for (i = 0; i < req->nr_pages; i++) {
- if (req->persistent_gnts[i] != NULL) {
- put_persistent_gnt(blkif, req->persistent_gnts[i]);
+ for (i = 0; i < num; i++) {
+ if (persistent_gnts[i] != NULL) {
+ put_persistent_gnt(blkif, persistent_gnts[i]);
continue;
}
- handle = pending_handle(req, i);
- pages[invcount] = req->pages[i];
- if (handle == BLKBACK_INVALID_HANDLE)
+ if (handles[i] == BLKBACK_INVALID_HANDLE)
continue;
- gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[invcount]),
- GNTMAP_host_map, handle);
- pending_handle(req, i) = BLKBACK_INVALID_HANDLE;
- invcount++;
+ unmap_pages[invcount] = pages[i];
+ gnttab_set_unmap_op(&unmap[invcount], vaddr(pages[i]),
+ GNTMAP_host_map, handles[i]);
+ handles[i] = BLKBACK_INVALID_HANDLE;
+ if (++invcount == BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ ret = gnttab_unmap_refs(unmap, NULL, unmap_pages,
+ invcount);
+ BUG_ON(ret);
+ put_free_pages(blkif, unmap_pages, invcount);
+ invcount = 0;
+ }
+ }
+ if (invcount) {
+ ret = gnttab_unmap_refs(unmap, NULL, unmap_pages, invcount);
+ BUG_ON(ret);
+ put_free_pages(blkif, unmap_pages, invcount);
}
-
- ret = gnttab_unmap_refs(unmap, NULL, pages, invcount);
- BUG_ON(ret);
- put_free_pages(blkif, pages, invcount);
}

-static int xen_blkbk_map(struct blkif_request *req,
- struct pending_req *pending_req,
- struct seg_buf seg[],
- struct page *pages[])
+static int xen_blkbk_map(struct xen_blkif *blkif, grant_ref_t grefs[],
+ struct persistent_gnt *persistent_gnts[],
+ grant_handle_t handles[],
+ struct page *pages[],
+ int num, bool ro)
{
struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- struct persistent_gnt **persistent_gnts = pending_req->persistent_gnts;
struct persistent_gnt *persistent_gnt = NULL;
- struct xen_blkif *blkif = pending_req->blkif;
phys_addr_t addr = 0;
int i, seg_idx, new_map_idx;
- int nseg = req->u.rw.nr_segments;
int segs_to_map = 0;
int ret = 0;
+ int last_map = 0, map_until = 0;
int use_persistent_gnts;

use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
@@ -701,13 +704,14 @@ static int xen_blkbk_map(struct blkif_request *req,
* assign map[..] with the PFN of the page in our domain with the
* corresponding grant reference for each page.
*/
- for (i = 0; i < nseg; i++) {
+again:
+ for (i = map_until; i < num; i++) {
uint32_t flags;

if (use_persistent_gnts)
persistent_gnt = get_persistent_gnt(
blkif,
- req->u.rw.seg[i].gref);
+ grefs[i]);

if (persistent_gnt) {
/*
@@ -723,13 +727,15 @@ static int xen_blkbk_map(struct blkif_request *req,
pages_to_gnt[segs_to_map] = pages[i];
persistent_gnts[i] = NULL;
flags = GNTMAP_host_map;
- if (!use_persistent_gnts &&
- (pending_req->operation != BLKIF_OP_READ))
+ if (!use_persistent_gnts && ro)
flags |= GNTMAP_readonly;
gnttab_set_map_op(&map[segs_to_map++], addr,
- flags, req->u.rw.seg[i].gref,
+ flags, grefs[i],
blkif->domid);
}
+ map_until = i + 1;
+ if (segs_to_map == BLKIF_MAX_SEGMENTS_PER_REQUEST)
+ break;
}

if (segs_to_map) {
@@ -742,26 +748,19 @@ static int xen_blkbk_map(struct blkif_request *req,
* so that when we access vaddr(pending_req,i) it has the contents of
* the page from the other domain.
*/
- for (seg_idx = 0, new_map_idx = 0; seg_idx < nseg; seg_idx++) {
+ for (seg_idx = last_map, new_map_idx = 0; seg_idx < map_until; seg_idx++) {
if (!persistent_gnts[seg_idx]) {
/* This is a newly mapped grant */
BUG_ON(new_map_idx >= segs_to_map);
if (unlikely(map[new_map_idx].status != 0)) {
pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
- pending_handle(pending_req, seg_idx) = BLKBACK_INVALID_HANDLE;
+ handles[seg_idx] = BLKBACK_INVALID_HANDLE;
ret |= 1;
- new_map_idx++;
- /*
- * No need to set unmap_seg bit, since
- * we can not unmap this grant because
- * the handle is invalid.
- */
- continue;
+ goto next;
}
- pending_handle(pending_req, seg_idx) = map[new_map_idx].handle;
+ handles[seg_idx] = map[new_map_idx].handle;
} else {
- /* This grant is persistent and already mapped */
- goto next;
+ continue;
}
if (use_persistent_gnts &&
blkif->persistent_gnt_c < xen_blkif_max_pgrants) {
@@ -777,7 +776,7 @@ static int xen_blkbk_map(struct blkif_request *req,
* allocate the persistent_gnt struct
* map this grant non-persistenly
*/
- goto next_unmap;
+ goto next;
}
persistent_gnt->gnt = map[new_map_idx].ref;
persistent_gnt->handle = map[new_map_idx].handle;
@@ -786,13 +785,12 @@ static int xen_blkbk_map(struct blkif_request *req,
persistent_gnt)) {
kfree(persistent_gnt);
persistent_gnt = NULL;
- goto next_unmap;
+ goto next;
}
persistent_gnts[seg_idx] = persistent_gnt;
pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
persistent_gnt->gnt, blkif->persistent_gnt_c,
xen_blkif_max_pgrants);
- new_map_idx++;
goto next;
}
if (use_persistent_gnts && !blkif->vbd.overflow_max_grants) {
@@ -800,15 +798,18 @@ static int xen_blkbk_map(struct blkif_request *req,
pr_debug(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
blkif->domid, blkif->vbd.handle);
}
-next_unmap:
/*
* We could not map this grant persistently, so use it as
* a non-persistent grant.
*/
- new_map_idx++;
next:
- seg[seg_idx].offset = (req->u.rw.seg[seg_idx].first_sect << 9);
+ new_map_idx++;
}
+ segs_to_map = 0;
+ last_map = map_until;
+ if (map_until != num)
+ goto again;
+
return ret;

out_of_memory:
@@ -817,6 +818,31 @@ out_of_memory:
return -ENOMEM;
}

+static int xen_blkbk_map_seg(struct blkif_request *req,
+ struct pending_req *pending_req,
+ struct seg_buf seg[],
+ struct page *pages[])
+{
+ int i, rc;
+ grant_ref_t grefs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+
+ for (i = 0; i < req->u.rw.nr_segments; i++)
+ grefs[i] = req->u.rw.seg[i].gref;
+
+ rc = xen_blkbk_map(pending_req->blkif, grefs,
+ pending_req->persistent_gnts,
+ pending_req->grant_handles, pending_req->pages,
+ req->u.rw.nr_segments,
+ (pending_req->operation != BLKIF_OP_READ));
+ if (rc)
+ return rc;
+
+ for (i = 0; i < req->u.rw.nr_segments; i++)
+ seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
+
+ return 0;
+}
+
static int dispatch_discard_io(struct xen_blkif *blkif,
struct blkif_request *req)
{
@@ -903,7 +929,10 @@ static void __end_block_io_op(struct pending_req *pending_req, int error)
* the proper response on the ring.
*/
if (atomic_dec_and_test(&pending_req->pendcnt)) {
- xen_blkbk_unmap(pending_req);
+ xen_blkbk_unmap(pending_req->blkif, pending_req->grant_handles,
+ pending_req->pages,
+ pending_req->persistent_gnts,
+ pending_req->nr_pages);
make_response(pending_req->blkif, pending_req->id,
pending_req->operation, pending_req->status);
xen_blkif_put(pending_req->blkif);
@@ -1125,7 +1154,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* the hypercall to unmap the grants - that is all done in
* xen_blkbk_unmap.
*/
- if (xen_blkbk_map(req, pending_req, seg, pages))
+ if (xen_blkbk_map_seg(req, pending_req, seg, pages))
goto fail_flush;

/*
@@ -1186,7 +1215,9 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
return 0;

fail_flush:
- xen_blkbk_unmap(pending_req);
+ xen_blkbk_unmap(blkif, pending_req->grant_handles,
+ pending_req->pages, pending_req->persistent_gnts,
+ pending_req->nr_pages);
fail_response:
/* Haven't submitted any bio's yet. */
make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
--
1.7.7.5 (Apple Git-26)

2013-04-18 12:21:28

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] xen-block: implement indirect descriptors

On Wed, Apr 17, 2013 at 08:19:01PM +0200, Roger Pau Monne wrote:
> Indirect descriptors introduce a new block operation
> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
> in the request. This grant references are filled with arrays of
> blkif_request_segment_aligned, this way we can send more segments in a
> request.
>
> The proposed implementation sets the maximum number of indirect grefs
> (frames filled with blkif_request_segment_aligned) to 256 in the
> backend and 32 in the frontend. The value in the frontend has been
> chosen experimentally, and the backend value has been set to a sane
> value that allows expanding the maximum number of indirect descriptors
> in the frontend if needed.
>
> The migration code has changed from the previous implementation, in
> which we simply remapped the segments on the shared ring. Now the
> maximum number of segments allowed in a request can change depending
> on the backend, so we have to requeue all the requests in the ring and
> in the queue and split the bios in them if they are bigger than the
> new maximum number of segments.
>
> Signed-off-by: Roger Pau Monn? <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: [email protected]
> ---
> Changes since v1:
> * Added padding to make the indirect request 64bit aligned.
> * Added some BUGs.
> * Added comments.
> * Fixed number of indirect pages in blkif_get_x86_{32/64}_req.
> * Added description about the indirect operation in blkif.h

Good. I only have three comments - and if you want to handle them later
as seperate patches that is OK too.

They are:

> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -107,6 +107,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> struct xen_blkif *blkif;
> int i;
>
> + BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
> +
> blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
> if (!blkif)
> return ERR_PTR(-ENOMEM);
> @@ -710,6 +712,12 @@ again:
> goto abort;
> }
>
> + err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
> + MAX_INDIRECT_SEGMENTS);
> + if (err)
> + xenbus_dev_fatal(dev, err, "writing %s/feature-max-indirect-segments",
> + dev->nodename);

Not really fatal. We can continue on and just not use that feature.

.. snip..
> @@ -1483,6 +1799,13 @@ static void blkfront_connect(struct blkfront_info *info)
> else
> info->feature_persistent = persistent;
>
> + err = blkfront_setup_indirect(info);
> + if (err) {
> + xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
> + info->xbdev->otherend);
> + return;

I thought we decided not to make it fatal?

> + }
> +
> err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
> if (err) {
> xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
> diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
> index ffd4652..7901e58 100644
> --- a/include/xen/interface/io/blkif.h
> +++ b/include/xen/interface/io/blkif.h
> @@ -103,12 +103,41 @@ typedef uint64_t blkif_sector_t;
> #define BLKIF_OP_DISCARD 5
>
> /*
> + * Recognized if "feature-max-indirect-segments" in present in the backend
> + * xenbus info. The "feature-max-indirect-segments" node contains the maximum
> + * number of segments allowed by the backend. If the node is present,
^-'per request'.

> + * the frontend might use blkif_request_indirect in order to issue requests
^ structs
> + * with more than BLKIF_MAX_SEGMENTS_PER_REQUEST. The maximum number
^(11)

> + * of indirect segments is fixed by the backend, but the frontend can issue
> + * requests with any number of indirect segments as long as it's inferior
^^- less
> + * than the number provided by the backend. The indirect_grefs field in
> + * blkif_request_indirect should be filled by the frontend with the
> + * grant references of the pages that are holding the indirect segments.
> + * This pages are filled with an array of blkif_request_segment_aligned
> + * that hold the information about the segments.

You should probably explain where the counter for the amount of indirect_grefs
that are going to be used is computed.

> + *
> + * If a backend does not recognize BLKIF_OP_INDIRECT, it should *not*
> + * create the "feature-max-indirect-segments" node!
> + */
> +#define BLKIF_OP_INDIRECT 6

2013-04-18 13:51:21

by Roger Pau Monné

[permalink] [raw]
Subject: Re: [PATCH v2 7/7] xen-block: implement indirect descriptors

On 18/04/13 14:21, Konrad Rzeszutek Wilk wrote:
> On Wed, Apr 17, 2013 at 08:19:01PM +0200, Roger Pau Monne wrote:
>> Indirect descriptors introduce a new block operation
>> (BLKIF_OP_INDIRECT) that passes grant references instead of segments
>> in the request. This grant references are filled with arrays of
>> blkif_request_segment_aligned, this way we can send more segments in a
>> request.
>>
>> The proposed implementation sets the maximum number of indirect grefs
>> (frames filled with blkif_request_segment_aligned) to 256 in the
>> backend and 32 in the frontend. The value in the frontend has been
>> chosen experimentally, and the backend value has been set to a sane
>> value that allows expanding the maximum number of indirect descriptors
>> in the frontend if needed.
>>
>> The migration code has changed from the previous implementation, in
>> which we simply remapped the segments on the shared ring. Now the
>> maximum number of segments allowed in a request can change depending
>> on the backend, so we have to requeue all the requests in the ring and
>> in the queue and split the bios in them if they are bigger than the
>> new maximum number of segments.
>>
>> Signed-off-by: Roger Pau Monn? <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: [email protected]
>> ---
>> Changes since v1:
>> * Added padding to make the indirect request 64bit aligned.
>> * Added some BUGs.
>> * Added comments.
>> * Fixed number of indirect pages in blkif_get_x86_{32/64}_req.
>> * Added description about the indirect operation in blkif.h
>
> Good. I only have three comments - and if you want to handle them later
> as seperate patches that is OK too.
>
> They are:
>
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -107,6 +107,8 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>> struct xen_blkif *blkif;
>> int i;
>>
>> + BUILD_BUG_ON(MAX_INDIRECT_PAGES > BLKIF_MAX_INDIRECT_PAGES_PER_REQUEST);
>> +
>> blkif = kmem_cache_zalloc(xen_blkif_cachep, GFP_KERNEL);
>> if (!blkif)
>> return ERR_PTR(-ENOMEM);
>> @@ -710,6 +712,12 @@ again:
>> goto abort;
>> }
>>
>> + err = xenbus_printf(xbt, dev->nodename, "feature-max-indirect-segments", "%u",
>> + MAX_INDIRECT_SEGMENTS);
>> + if (err)
>> + xenbus_dev_fatal(dev, err, "writing %s/feature-max-indirect-segments",
>> + dev->nodename);
>
> Not really fatal. We can continue on and just not use that feature.

Fixed, changed the fatal to pr_alert.

>
> .. snip..
>> @@ -1483,6 +1799,13 @@ static void blkfront_connect(struct blkfront_info *info)
>> else
>> info->feature_persistent = persistent;
>>
>> + err = blkfront_setup_indirect(info);
>> + if (err) {
>> + xenbus_dev_fatal(info->xbdev, err, "setup_indirect at %s",
>> + info->xbdev->otherend);
>> + return;
>
> I thought we decided not to make it fatal?

If blkfront_setup_indirect returns an error, it doesn't mean it hasn't
been able to fetch the xenstore node, it means we are not able to
allocate necessary structures for blkfront to function properly (amongst
them the scatter-gather list), so it is fatal.

>
>> + }
>> +
>> err = xlvbd_alloc_gendisk(sectors, info, binfo, sector_size);
>> if (err) {
>> xenbus_dev_fatal(info->xbdev, err, "xlvbd_add at %s",
>> diff --git a/include/xen/interface/io/blkif.h b/include/xen/interface/io/blkif.h
>> index ffd4652..7901e58 100644
>> --- a/include/xen/interface/io/blkif.h
>> +++ b/include/xen/interface/io/blkif.h
>> @@ -103,12 +103,41 @@ typedef uint64_t blkif_sector_t;
>> #define BLKIF_OP_DISCARD 5
>>
>> /*
>> + * Recognized if "feature-max-indirect-segments" in present in the backend
>> + * xenbus info. The "feature-max-indirect-segments" node contains the maximum
>> + * number of segments allowed by the backend. If the node is present,
> ^-'per request'.
>
>> + * the frontend might use blkif_request_indirect in order to issue requests
> ^ structs
>> + * with more than BLKIF_MAX_SEGMENTS_PER_REQUEST. The maximum number
> ^(11)
>
>> + * of indirect segments is fixed by the backend, but the frontend can issue
>> + * requests with any number of indirect segments as long as it's inferior
> ^^- less
>> + * than the number provided by the backend. The indirect_grefs field in
>> + * blkif_request_indirect should be filled by the frontend with the
>> + * grant references of the pages that are holding the indirect segments.
>> + * This pages are filled with an array of blkif_request_segment_aligned
>> + * that hold the information about the segments.
>
> You should probably explain where the counter for the amount of indirect_grefs
> that are going to be used is computed.

Done :)

>> + *
>> + * If a backend does not recognize BLKIF_OP_INDIRECT, it should *not*
>> + * create the "feature-max-indirect-segments" node!
>> + */
>> +#define BLKIF_OP_INDIRECT 6