2012-10-18 11:22:21

by Roger Pau Monne

[permalink] [raw]
Subject: [PATCH RFC] Persistent grant maps for xen blk drivers

This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <[email protected]>
Signed-off-by: Roger Pau Monne <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
---
Benchmarks showing the impact of this patch in blk performance can be
found at:

http://xenbits.xensource.com/people/royger/persistent_grants/
---
drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++---
drivers/block/xen-blkback/common.h | 17 ++
drivers/block/xen-blkback/xenbus.c | 16 ++-
drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++----
4 files changed, 442 insertions(+), 53 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index c6decb9..2b982b2 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -78,6 +78,7 @@ struct pending_req {
unsigned short operation;
int status;
struct list_head free_list;
+ unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};

#define BLKBACK_INVALID_HANDLE (~0)
@@ -98,6 +99,30 @@ struct xen_blkbk {
static struct xen_blkbk *blkbk;

/*
+ * Maximum number of grant pages that can be mapped in blkback.
+ * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
+ * pages that blkback will persistently map.
+ */
+static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
+{
+ switch (protocol) {
+ case BLKIF_PROTOCOL_NATIVE:
+ return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
+ BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ case BLKIF_PROTOCOL_X86_32:
+ return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
+ BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ case BLKIF_PROTOCOL_X86_64:
+ return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
+ BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ default:
+ BUG();
+ }
+ return 0;
+}
+
+
+/*
* Little helpful macro to figure out the index and virtual address of the
* pending_pages[..]. For each 'pending_req' we have have up to
* BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
@@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
static void make_response(struct xen_blkif *blkif, u64 id,
unsigned short op, int st);

+#define foreach_grant(pos, rbtree, node) \
+ for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
+ &(pos)->node != NULL; \
+ (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
+
+
+static void add_persistent_gnt(struct rb_root *root,
+ struct persistent_gnt *persistent_gnt)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+ struct persistent_gnt *this;
+
+ /* Figure out where to put new node */
+ while (*new) {
+ this = container_of(*new, struct persistent_gnt, node);
+
+ parent = *new;
+ if (persistent_gnt->gnt < this->gnt)
+ new = &((*new)->rb_left);
+ else if (persistent_gnt->gnt > this->gnt)
+ new = &((*new)->rb_right);
+ else {
+ pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
+ BUG();
+ }
+ }
+
+ /* Add new node and rebalance tree. */
+ rb_link_node(&(persistent_gnt->node), parent, new);
+ rb_insert_color(&(persistent_gnt->node), root);
+}
+
+static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
+ grant_ref_t gref)
+{
+ struct persistent_gnt *data;
+ struct rb_node *node = root->rb_node;
+
+ while (node) {
+ data = container_of(node, struct persistent_gnt, node);
+
+ if (gref < data->gnt)
+ node = node->rb_left;
+ else if (gref > data->gnt)
+ node = node->rb_right;
+ else
+ return data;
+ }
+ return NULL;
+}
+
/*
* Retrieve from the 'pending_reqs' a free pending_req structure to be used.
*/
@@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
{
struct xen_blkif *blkif = arg;
struct xen_vbd *vbd = &blkif->vbd;
+ struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt *persistent_gnt;
+ int ret = 0;
+ int segs_to_unmap = 0;

xen_blkif_get(blkif);

@@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
print_stats(blkif);
}

+ /* Free all persistent grant pages */
+ foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
+ BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
+ gnttab_set_unmap_op(&unmap[segs_to_unmap],
+ (unsigned long) pfn_to_kaddr(page_to_pfn(
+ persistent_gnt->page)),
+ GNTMAP_host_map,
+ persistent_gnt->handle);
+
+ pages[segs_to_unmap] = persistent_gnt->page;
+ rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
+ kfree(persistent_gnt);
+ blkif->persistent_gnt_c--;
+
+ if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
+ !rb_next(&persistent_gnt->node)) {
+ ret = gnttab_unmap_refs(unmap, NULL, pages,
+ segs_to_unmap);
+ BUG_ON(ret);
+ segs_to_unmap = 0;
+ }
+ }
+
+ BUG_ON(blkif->persistent_gnt_c != 0);
+ BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
+
if (log_stats)
print_stats(blkif);

@@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
int ret;

for (i = 0; i < req->nr_pages; i++) {
+ if (!req->unmap_seg[i])
+ continue;
handle = pending_handle(req, i);
if (handle == BLKBACK_INVALID_HANDLE)
continue;
@@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)

static int xen_blkbk_map(struct blkif_request *req,
struct pending_req *pending_req,
- struct seg_buf seg[])
+ struct seg_buf seg[],
+ struct page *pages[])
{
struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
- int i;
+ struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct persistent_gnt *persistent_gnt = NULL;
+ struct xen_blkif *blkif = pending_req->blkif;
+ phys_addr_t addr = 0;
+ int i, j;
+ int new_map;
int nseg = req->u.rw.nr_segments;
+ int segs_to_map = 0;
int ret = 0;
+ int use_persistent_gnts;
+
+ use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
+
+ BUG_ON(blkif->persistent_gnt_c >
+ max_mapped_grant_pages(pending_req->blkif->blk_protocol));

/*
* Fill out preq.nr_sects with proper amount of sectors, and setup
@@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
for (i = 0; i < nseg; i++) {
uint32_t flags;

- flags = GNTMAP_host_map;
- if (pending_req->operation != BLKIF_OP_READ)
- flags |= GNTMAP_readonly;
- gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
- req->u.rw.seg[i].gref,
- pending_req->blkif->domid);
+ if (use_persistent_gnts)
+ persistent_gnt = get_persistent_gnt(
+ &blkif->persistent_gnts,
+ req->u.rw.seg[i].gref);
+
+ if (persistent_gnt) {
+ /*
+ * We are using persistent grants and
+ * the grant is already mapped
+ */
+ new_map = 0;
+ } else if (use_persistent_gnts &&
+ blkif->persistent_gnt_c <
+ max_mapped_grant_pages(blkif->blk_protocol)) {
+ /*
+ * We are using persistent grants, the grant is
+ * not mapped but we have room for it
+ */
+ new_map = 1;
+ persistent_gnt = kzalloc(
+ sizeof(struct persistent_gnt),
+ GFP_KERNEL);
+ if (!persistent_gnt)
+ return -ENOMEM;
+ persistent_gnt->page = alloc_page(GFP_KERNEL);
+ if (!persistent_gnt->page) {
+ kfree(persistent_gnt);
+ return -ENOMEM;
+ }
+ persistent_gnt->gnt = req->u.rw.seg[i].gref;
+
+ pages_to_gnt[segs_to_map] =
+ persistent_gnt->page;
+ addr = (unsigned long) pfn_to_kaddr(
+ page_to_pfn(persistent_gnt->page));
+
+ add_persistent_gnt(&blkif->persistent_gnts,
+ persistent_gnt);
+ blkif->persistent_gnt_c++;
+ pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
+ persistent_gnt->gnt, blkif->persistent_gnt_c,
+ max_mapped_grant_pages(blkif->blk_protocol));
+ } else {
+ /*
+ * We are either using persistent grants and
+ * hit the maximum limit of grants mapped,
+ * or we are not using persistent grants.
+ */
+ if (use_persistent_gnts &&
+ !blkif->vbd.overflow_max_grants) {
+ blkif->vbd.overflow_max_grants = 1;
+ pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
+ blkif->domid, blkif->vbd.handle);
+ }
+ new_map = 1;
+ pages[i] = blkbk->pending_page(pending_req, i);
+ addr = vaddr(pending_req, i);
+ pages_to_gnt[segs_to_map] =
+ blkbk->pending_page(pending_req, i);
+ }
+
+ if (persistent_gnt) {
+ pages[i] = persistent_gnt->page;
+ persistent_gnts[i] = persistent_gnt;
+ } else {
+ persistent_gnts[i] = NULL;
+ }
+
+ if (new_map) {
+ flags = GNTMAP_host_map;
+ if (!persistent_gnt &&
+ (pending_req->operation != BLKIF_OP_READ))
+ flags |= GNTMAP_readonly;
+ gnttab_set_map_op(&map[segs_to_map++], addr,
+ flags, req->u.rw.seg[i].gref,
+ blkif->domid);
+ }
}

- ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
- BUG_ON(ret);
+ if (segs_to_map) {
+ ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
+ BUG_ON(ret);
+ }

/*
* Now swizzle the MFN in our domain with the MFN from the other domain
* so that when we access vaddr(pending_req,i) it has the contents of
* the page from the other domain.
*/
- for (i = 0; i < nseg; i++) {
- if (unlikely(map[i].status != 0)) {
- pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
- map[i].handle = BLKBACK_INVALID_HANDLE;
- ret |= 1;
+ for (i = 0, j = 0; i < nseg; i++) {
+ if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
+ /* This is a newly mapped grant */
+ BUG_ON(j >= segs_to_map);
+ if (unlikely(map[j].status != 0)) {
+ pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
+ map[j].handle = BLKBACK_INVALID_HANDLE;
+ ret |= 1;
+ if (persistent_gnts[i]) {
+ rb_erase(&persistent_gnts[i]->node,
+ &blkif->persistent_gnts);
+ blkif->persistent_gnt_c--;
+ kfree(persistent_gnts[i]);
+ persistent_gnts[i] = NULL;
+ }
+ }
+ }
+ if (persistent_gnts[i]) {
+ if (!persistent_gnts[i]->handle) {
+ /*
+ * If this is a new persistent grant
+ * save the handler
+ */
+ persistent_gnts[i]->handle = map[j].handle;
+ persistent_gnts[i]->dev_bus_addr =
+ map[j++].dev_bus_addr;
+ }
+ pending_handle(pending_req, i) =
+ persistent_gnts[i]->handle;
+ pending_req->unmap_seg[i] = 0;
+
+ if (ret)
+ continue;
+
+ seg[i].buf = persistent_gnts[i]->dev_bus_addr |
+ (req->u.rw.seg[i].first_sect << 9);
+ } else {
+ pending_handle(pending_req, i) = map[j].handle;
+ pending_req->unmap_seg[i] = 1;
+
+ if (ret)
+ continue;
+
+ seg[i].buf = map[j++].dev_bus_addr |
+ (req->u.rw.seg[i].first_sect << 9);
}
-
- pending_handle(pending_req, i) = map[i].handle;
-
- if (ret)
- continue;
-
- seg[i].buf = map[i].dev_bus_addr |
- (req->u.rw.seg[i].first_sect << 9);
}
return ret;
}
@@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
int operation;
struct blk_plug plug;
bool drain = false;
+ struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];

switch (req->operation) {
case BLKIF_OP_READ:
@@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
* the hypercall to unmap the grants - that is all done in
* xen_blkbk_unmap.
*/
- if (xen_blkbk_map(req, pending_req, seg))
+ if (xen_blkbk_map(req, pending_req, seg, pages))
goto fail_flush;

/*
@@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
for (i = 0; i < nseg; i++) {
while ((bio == NULL) ||
(bio_add_page(bio,
- blkbk->pending_page(pending_req, i),
+ pages[i],
seg[i].nsec << 9,
seg[i].buf & ~PAGE_MASK) == 0)) {

diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 9ad3b5e..6c08ee9 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -34,6 +34,7 @@
#include <linux/vmalloc.h>
#include <linux/wait.h>
#include <linux/io.h>
+#include <linux/rbtree.h>
#include <asm/setup.h>
#include <asm/pgalloc.h>
#include <asm/hypervisor.h>
@@ -160,10 +161,22 @@ struct xen_vbd {
sector_t size;
bool flush_support;
bool discard_secure;
+
+ unsigned int feature_gnt_persistent:1;
+ unsigned int overflow_max_grants:1;
};

struct backend_info;

+
+struct persistent_gnt {
+ struct page *page;
+ grant_ref_t gnt;
+ grant_handle_t handle;
+ uint64_t dev_bus_addr;
+ struct rb_node node;
+};
+
struct xen_blkif {
/* Unique identifier for this interface. */
domid_t domid;
@@ -190,6 +203,10 @@ struct xen_blkif {
struct task_struct *xenblkd;
unsigned int waiting_reqs;

+ /* frontend feature information */
+ struct rb_root persistent_gnts;
+ unsigned int persistent_gnt_c;
+
/* statistics */
unsigned long st_print;
int st_rd_req;
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index 4f66171..9f88b4e 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
atomic_set(&blkif->drain, 0);
blkif->st_print = jiffies;
init_waitqueue_head(&blkif->waiting_to_free);
+ blkif->persistent_gnts.rb_node = NULL;

return blkif;
}
@@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
struct xenbus_device *dev = be->dev;
unsigned long ring_ref;
unsigned int evtchn;
+ unsigned int pers_grants;
char protocol[64] = "";
int err;

@@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
return -1;
}
- pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
- ring_ref, evtchn, be->blkif->blk_protocol, protocol);
+ err = xenbus_gather(XBT_NIL, dev->otherend,
+ "feature-persistent-grants", "%u",
+ &pers_grants, NULL);
+ if (err)
+ pers_grants = 0;
+
+ be->blkif->vbd.feature_gnt_persistent = pers_grants;
+ be->blkif->vbd.overflow_max_grants = 0;
+
+ pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
+ ring_ref, evtchn, be->blkif->blk_protocol, protocol,
+ pers_grants);

/* Map the shared frame, irq etc. */
err = xen_blkif_map(be->blkif, ring_ref, evtchn);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 2c2d2e5..206d422 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -44,6 +44,7 @@
#include <linux/mutex.h>
#include <linux/scatterlist.h>
#include <linux/bitmap.h>
+#include <linux/llist.h>

#include <xen/xen.h>
#include <xen/xenbus.h>
@@ -64,10 +65,17 @@ enum blkif_state {
BLKIF_STATE_SUSPENDED,
};

+struct grant {
+ grant_ref_t gref;
+ unsigned long pfn;
+ struct llist_node node;
+};
+
struct blk_shadow {
struct blkif_request req;
struct request *request;
unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
+ struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
};

static DEFINE_MUTEX(blkfront_mutex);
@@ -97,6 +105,8 @@ struct blkfront_info
struct work_struct work;
struct gnttab_free_callback callback;
struct blk_shadow shadow[BLK_RING_SIZE];
+ struct llist_head persistent_gnts;
+ unsigned int persistent_gnts_c;
unsigned long shadow_free;
unsigned int feature_flush;
unsigned int flush_op;
@@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
unsigned long id;
unsigned int fsect, lsect;
int i, ref;
+
+ /*
+ * Used to store if we are able to queue the request by just using
+ * existing persistent grants (0), or if we have to get new grants,
+ * as there are not sufficiently many free.
+ */
+ int new_persistent_gnts;
grant_ref_t gref_head;
+ struct page *granted_page;
+ struct grant *gnt_list_entry = NULL;
struct scatterlist *sg;

if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
return 1;

- if (gnttab_alloc_grant_references(
- BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
- gnttab_request_free_callback(
- &info->callback,
- blkif_restart_queue_callback,
- info,
- BLKIF_MAX_SEGMENTS_PER_REQUEST);
- return 1;
- }
+ /* Check if we have enought grants to allocate a requests */
+ if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ new_persistent_gnts = 1;
+ if (gnttab_alloc_grant_references(
+ BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
+ &gref_head) < 0) {
+ gnttab_request_free_callback(
+ &info->callback,
+ blkif_restart_queue_callback,
+ info,
+ BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ return 1;
+ }
+ } else
+ new_persistent_gnts = 0;

/* Fill out a communications ring structure. */
ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
@@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
BLKIF_MAX_SEGMENTS_PER_REQUEST);

for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
- buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
fsect = sg->offset >> 9;
lsect = fsect + (sg->length >> 9) - 1;
- /* install a grant reference. */
- ref = gnttab_claim_grant_reference(&gref_head);
- BUG_ON(ref == -ENOSPC);

- gnttab_grant_foreign_access_ref(
- ref,
+ if (info->persistent_gnts_c) {
+ BUG_ON(llist_empty(&info->persistent_gnts));
+ gnt_list_entry = llist_entry(
+ llist_del_first(&info->persistent_gnts),
+ struct grant, node);
+
+ ref = gnt_list_entry->gref;
+ buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
+ info->persistent_gnts_c--;
+ } else {
+ ref = gnttab_claim_grant_reference(&gref_head);
+ BUG_ON(ref == -ENOSPC);
+
+ gnt_list_entry =
+ kmalloc(sizeof(struct grant),
+ GFP_ATOMIC);
+ if (!gnt_list_entry)
+ return -ENOMEM;
+
+ granted_page = alloc_page(GFP_ATOMIC);
+ if (!granted_page) {
+ kfree(gnt_list_entry);
+ return -ENOMEM;
+ }
+
+ gnt_list_entry->pfn =
+ page_to_pfn(granted_page);
+ gnt_list_entry->gref = ref;
+
+ buffer_mfn = pfn_to_mfn(page_to_pfn(
+ granted_page));
+ gnttab_grant_foreign_access_ref(ref,
info->xbdev->otherend_id,
- buffer_mfn,
- rq_data_dir(req));
+ buffer_mfn, 0);
+ }
+
+ info->shadow[id].grants_used[i] = gnt_list_entry;
+
+ if (rq_data_dir(req)) {
+ char *bvec_data;
+ void *shared_data;
+
+ BUG_ON(sg->offset + sg->length > PAGE_SIZE);
+
+ shared_data = kmap_atomic(
+ pfn_to_page(gnt_list_entry->pfn));
+ bvec_data = kmap_atomic(sg_page(sg));
+
+ /*
+ * this does not wipe data stored outside the
+ * range sg->offset..sg->offset+sg->length.
+ * Therefore, blkback *could* see data from
+ * previous requests. This is OK as long as
+ * persistent grants are shared with just one
+ * domain. It may need refactoring if This
+ * changes
+ */
+ memcpy(shared_data + sg->offset,
+ bvec_data + sg->offset,
+ sg->length);
+
+ kunmap_atomic(bvec_data);
+ kunmap_atomic(shared_data);
+ }

info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
ring_req->u.rw.seg[i] =
@@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
/* Keep a private copy so we can reissue requests when recovering. */
info->shadow[id].req = *ring_req;

- gnttab_free_grant_references(gref_head);
+ if (new_persistent_gnts)
+ gnttab_free_grant_references(gref_head);

return 0;
}
@@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
static void xlvbd_flush(struct blkfront_info *info)
{
blk_queue_flush(info->rq, info->feature_flush);
- printk(KERN_INFO "blkfront: %s: %s: %s\n",
+ printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
info->gd->disk_name,
info->flush_op == BLKIF_OP_WRITE_BARRIER ?
"barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
@@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)

static void blkif_free(struct blkfront_info *info, int suspend)
{
+ struct llist_node *all_gnts;
+ struct grant *persistent_gnt;
+
/* Prevent new requests being issued until we fix things up. */
spin_lock_irq(&info->io_lock);
info->connected = suspend ?
@@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
/* No more blkif_request(). */
if (info->rq)
blk_stop_queue(info->rq);
+
+ /* Remove all persistent grants */
+ if (info->persistent_gnts_c) {
+ all_gnts = llist_del_all(&info->persistent_gnts);
+ llist_for_each_entry(persistent_gnt, all_gnts, node) {
+ gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
+ kfree(persistent_gnt);
+ }
+ info->persistent_gnts_c = 0;
+ }
+
/* No more gnttab callback work. */
gnttab_cancel_free_callback(&info->callback);
spin_unlock_irq(&info->io_lock);
@@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)

}

-static void blkif_completion(struct blk_shadow *s)
+static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
+ struct blkif_response *bret)
{
int i;
- /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
- * flag. */
- for (i = 0; i < s->req.u.rw.nr_segments; i++)
- gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
+ struct bio_vec *bvec;
+ struct req_iterator iter;
+ unsigned long flags;
+ char *bvec_data;
+ void *shared_data;
+ unsigned int offset = 0;
+
+ if (bret->operation == BLKIF_OP_READ) {
+ /*
+ * Copy the data received from the backend into the bvec.
+ * Since bv_len can be different from PAGE_SIZE, we need to
+ * be sure we are actually copying the data from the right
+ * shared page.
+ */
+ rq_for_each_segment(bvec, s->request, iter) {
+ BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
+ i = offset >> PAGE_SHIFT;
+ shared_data = kmap_atomic(
+ pfn_to_page(s->grants_used[i]->pfn));
+ bvec_data = bvec_kmap_irq(bvec, &flags);
+ memcpy(bvec_data, shared_data + bvec->bv_offset,
+ bvec->bv_len);
+ bvec_kunmap_irq(bvec_data, &flags);
+ kunmap_atomic(shared_data);
+ offset += bvec->bv_len;
+ }
+ }
+ /* Add the persistent grant into the list of free grants */
+ for (i = 0; i < s->req.u.rw.nr_segments; i++) {
+ llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
+ info->persistent_gnts_c++;
+ }
}

static irqreturn_t blkif_interrupt(int irq, void *dev_id)
@@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
req = info->shadow[id].request;

if (bret->operation != BLKIF_OP_DISCARD)
- blkif_completion(&info->shadow[id]);
+ blkif_completion(&info->shadow[id], info, bret);

if (add_id_to_freelist(info, id)) {
WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
@@ -942,6 +1066,11 @@ again:
message = "writing protocol";
goto abort_transaction;
}
+ err = xenbus_printf(xbt, dev->nodename,
+ "feature-persistent-grants", "%d", 1);
+ if (err)
+ dev_warn(&dev->dev,
+ "writing persistent grants feature to xenbus");

err = xenbus_transaction_end(xbt, 0);
if (err) {
@@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
spin_lock_init(&info->io_lock);
info->xbdev = dev;
info->vdevice = vdevice;
+ init_llist_head(&info->persistent_gnts);
+ info->persistent_gnts_c = 0;
info->connected = BLKIF_STATE_DISCONNECTED;
INIT_WORK(&info->work, blkif_restart_queue);

@@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
req->u.rw.seg[j].gref,
info->xbdev->otherend_id,
pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
- rq_data_dir(info->shadow[req->u.rw.id].request));
+ 0);
}
info->shadow[req->u.rw.id].req = *req;

--
1.7.7.5 (Apple Git-26)


2012-10-19 01:56:55

by James Harper

[permalink] [raw]
Subject: RE: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

>
> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This allows
> the I/O performance to scale better when a large number of VMs are
> performing I/O.
>
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales badly
> with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the
> IPIs from performing unmap's is a bottleneck at 5 guests (at which point
> 650,000 IOPS are being performed in total). If more than 5 guests are used,
> the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
>
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.

I assume network drivers would suffer from the same affliction... Would a more general persistent map solution be worth considering (or be possible)? So a common interface to this persistent mapping allowing the persistent pool to be shared between all drivers in the DomU?

James

2012-10-19 08:26:37

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

On 19/10/12 03:34, James Harper wrote:
>>
>> This patch implements persistent grants for the xen-blk{front,back}
>> mechanism. The effect of this change is to reduce the number of unmap
>> operations performed, since they cause a (costly) TLB shootdown. This allows
>> the I/O performance to scale better when a large number of VMs are
>> performing I/O.
>>
>> Previously, the blkfront driver was supplied a bvec[] from the request
>> queue. This was granted to dom0; dom0 performed the I/O and wrote
>> directly into the grant-mapped memory and unmapped it; blkfront then
>> removed foreign access for that grant. The cost of unmapping scales badly
>> with the number of CPUs in Dom0. An experiment showed that when
>> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a ramdisk, the
>> IPIs from performing unmap's is a bottleneck at 5 guests (at which point
>> 650,000 IOPS are being performed in total). If more than 5 guests are used,
>> the performance declines. By 10 guests, only
>> 400,000 IOPS are being performed.
>>
>> This patch improves performance by only unmapping when the connection
>> between blkfront and back is broken.
>
> I assume network drivers would suffer from the same affliction... Would a more general persistent map solution be worth considering (or be possible)? So a common interface to this persistent mapping allowing the persistent pool to be shared between all drivers in the DomU?

Yes, there are plans to implement the same for network drivers. I would
generally avoid having a shared pool of grants for all the devices of a
DomU, as said in the description of the patch:

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case.

Having a shared pool with all grants would mean that n will become much
higher, and so the search time for a grant would increase. Also, if the
pool is shared some kind of concurrency control should be added, which
will make it even slower.

What should be shared are the structures used to save the grants (struct
grant and struct persistent_gnt for front and back drivers
respectively), and the red-black tree implementation (foreach_grant,
add_persistent_gnt and get_persistent_gnt). This functions should be
moved to a common header for the net implementation.

2012-10-22 13:47:30

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers

On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
> This patch implements persistent grants for the xen-blk{front,back}
> mechanism. The effect of this change is to reduce the number of unmap
> operations performed, since they cause a (costly) TLB shootdown. This
> allows the I/O performance to scale better when a large number of VMs
> are performing I/O.
>
> Previously, the blkfront driver was supplied a bvec[] from the request
> queue. This was granted to dom0; dom0 performed the I/O and wrote
> directly into the grant-mapped memory and unmapped it; blkfront then
> removed foreign access for that grant. The cost of unmapping scales
> badly with the number of CPUs in Dom0. An experiment showed that when
> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> (at which point 650,000 IOPS are being performed in total). If more
> than 5 guests are used, the performance declines. By 10 guests, only
> 400,000 IOPS are being performed.
>
> This patch improves performance by only unmapping when the connection
> between blkfront and back is broken.
>
> On startup blkfront notifies blkback that it is using persistent
> grants, and blkback will do the same. If blkback is not capable of
> persistent mapping, blkfront will still use the same grants, since it
> is compatible with the previous protocol, and simplifies the code
> complexity in blkfront.
>
> To perform a read, in persistent mode, blkfront uses a separate pool
> of pages that it maps to dom0. When a request comes in, blkfront
> transmutes the request so that blkback will write into one of these
> free pages. Blkback keeps note of which grefs it has already
> mapped. When a new ring request comes to blkback, it looks to see if
> it has already mapped that page. If so, it will not map it again. If
> the page hasn't been previously mapped, it is mapped now, and a record
> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> notified that blkback has completed a request, it memcpy's from the
> shared memory, into the bvec supplied. A record that the {gref, page}
> tuple is mapped, and not inflight is kept.
>
> Writes are similar, except that the memcpy is peformed from the
> supplied bvecs, into the shared pages, before the request is put onto
> the ring.
>
> Blkback stores a mapping of grefs=>{page mapped to by gref} in
> a red-black tree. As the grefs are not known apriori, and provide no
> guarantees on their ordering, we have to perform a search
> through this tree to find the page, for every gref we receive. This
> operation takes O(log n) time in the worst case.

Might want to mention how blkfront stores it as well. It looks
to be using 'llist' instead of 'list'? Any particular reason?

>
> The maximum number of grants that blkback will persistenly map is
> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
> prevent a malicios guest from attempting a DoS, by supplying fresh
> grefs, causing the Dom0 kernel to map excessively. If a guest
> is using persistent grants and exceeds the maximum number of grants to
> map persistenly the newly passed grefs will be mapped and unmaped.
> Using this approach, we can have requests that mix persistent and
> non-persistent grants, and we need to handle them correctly.
> This allows us to set the maximum number of persistent grants to a
> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
> setting it will lead to unpredictable performance.
>
> In writing this patch, the question arrises as to if the additional
> cost of performing memcpys in the guest (to/from the pool of granted
> pages) outweigh the gains of not performing TLB shootdowns. The answer
> to that question is `no'. There appears to be very little, if any
> additional cost to the guest of using persistent grants. There is
> perhaps a small saving, from the reduced number of hypercalls
> performed in granting, and ending foreign access.
>
> Signed-off-by: Oliver Chick <[email protected]>
> Signed-off-by: Roger Pau Monne <[email protected]>
> Cc: <[email protected]>
> Cc: <[email protected]>
> ---
> Benchmarks showing the impact of this patch in blk performance can be
> found at:
>
> http://xenbits.xensource.com/people/royger/persistent_grants/
> ---
> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++---
> drivers/block/xen-blkback/common.h | 17 ++
> drivers/block/xen-blkback/xenbus.c | 16 ++-
> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++----
> 4 files changed, 442 insertions(+), 53 deletions(-)
>
> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> index c6decb9..2b982b2 100644
> --- a/drivers/block/xen-blkback/blkback.c
> +++ b/drivers/block/xen-blkback/blkback.c
> @@ -78,6 +78,7 @@ struct pending_req {
> unsigned short operation;
> int status;
> struct list_head free_list;
> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> };
>
> #define BLKBACK_INVALID_HANDLE (~0)
> @@ -98,6 +99,30 @@ struct xen_blkbk {
> static struct xen_blkbk *blkbk;
>
> /*
> + * Maximum number of grant pages that can be mapped in blkback.
> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> + * pages that blkback will persistently map.
> + */
> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> +{
> + switch (protocol) {
> + case BLKIF_PROTOCOL_NATIVE:
> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + case BLKIF_PROTOCOL_X86_32:
> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + case BLKIF_PROTOCOL_X86_64:
> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> + BLKIF_MAX_SEGMENTS_PER_REQUEST;

Could you include in the comments what the size (bytes) you expect it to be?
If that would require you re-doing some tests - then don't worry - but
I figured you have some notes scribbled away that have the exact values
down.

> + default:
> + BUG();
> + }
> + return 0;
> +}
> +
> +
> +/*
> * Little helpful macro to figure out the index and virtual address of the
> * pending_pages[..]. For each 'pending_req' we have have up to
> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> static void make_response(struct xen_blkif *blkif, u64 id,
> unsigned short op, int st);
>
> +#define foreach_grant(pos, rbtree, node) \
> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
> + &(pos)->node != NULL; \
> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
> +
> +
> +static void add_persistent_gnt(struct rb_root *root,
> + struct persistent_gnt *persistent_gnt)
> +{
> + struct rb_node **new = &(root->rb_node), *parent = NULL;
> + struct persistent_gnt *this;
> +
> + /* Figure out where to put new node */
> + while (*new) {
> + this = container_of(*new, struct persistent_gnt, node);
> +
> + parent = *new;
> + if (persistent_gnt->gnt < this->gnt)
> + new = &((*new)->rb_left);
> + else if (persistent_gnt->gnt > this->gnt)
> + new = &((*new)->rb_right);
> + else {
> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
> + BUG();
> + }
> + }
> +
> + /* Add new node and rebalance tree. */
> + rb_link_node(&(persistent_gnt->node), parent, new);
> + rb_insert_color(&(persistent_gnt->node), root);
> +}
> +
> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
> + grant_ref_t gref)
> +{
> + struct persistent_gnt *data;
> + struct rb_node *node = root->rb_node;
> +
> + while (node) {
> + data = container_of(node, struct persistent_gnt, node);
> +
> + if (gref < data->gnt)
> + node = node->rb_left;
> + else if (gref > data->gnt)
> + node = node->rb_right;
> + else
> + return data;
> + }
> + return NULL;
> +}
> +
> /*
> * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
> */
> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
> {
> struct xen_blkif *blkif = arg;
> struct xen_vbd *vbd = &blkif->vbd;
> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct persistent_gnt *persistent_gnt;
> + int ret = 0;
> + int segs_to_unmap = 0;
>
> xen_blkif_get(blkif);
>
> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
> print_stats(blkif);
> }
>
> + /* Free all persistent grant pages */
> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {

Just for sanity - you did check this with blkfronts that did not have
persistent grants enabled, right?

> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
> + gnttab_set_unmap_op(&unmap[segs_to_unmap],
> + (unsigned long) pfn_to_kaddr(page_to_pfn(
> + persistent_gnt->page)),
> + GNTMAP_host_map,
> + persistent_gnt->handle);
> +
> + pages[segs_to_unmap] = persistent_gnt->page;
> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
> + kfree(persistent_gnt);
> + blkif->persistent_gnt_c--;
> +
> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
> + !rb_next(&persistent_gnt->node)) {
> + ret = gnttab_unmap_refs(unmap, NULL, pages,
> + segs_to_unmap);
> + BUG_ON(ret);
> + segs_to_unmap = 0;
> + }
> + }
> +
> + BUG_ON(blkif->persistent_gnt_c != 0);
> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> +
> if (log_stats)
> print_stats(blkif);
>
> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
> int ret;
>
> for (i = 0; i < req->nr_pages; i++) {
> + if (!req->unmap_seg[i])
> + continue;

Perhaps there should be a #define for that array..
> handle = pending_handle(req, i);
> if (handle == BLKBACK_INVALID_HANDLE)
> continue;
> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
>
> static int xen_blkbk_map(struct blkif_request *req,
> struct pending_req *pending_req,
> - struct seg_buf seg[])
> + struct seg_buf seg[],
> + struct page *pages[])
> {
> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> - int i;
> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct persistent_gnt *persistent_gnt = NULL;
> + struct xen_blkif *blkif = pending_req->blkif;
> + phys_addr_t addr = 0;
> + int i, j;
> + int new_map;

Just use a bool for this.
> int nseg = req->u.rw.nr_segments;
> + int segs_to_map = 0;
> int ret = 0;
> + int use_persistent_gnts;
> +
> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
> +
> + BUG_ON(blkif->persistent_gnt_c >
> + max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>
> /*
> * Fill out preq.nr_sects with proper amount of sectors, and setup
> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
> for (i = 0; i < nseg; i++) {
> uint32_t flags;
>
> - flags = GNTMAP_host_map;
> - if (pending_req->operation != BLKIF_OP_READ)
> - flags |= GNTMAP_readonly;
> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> - req->u.rw.seg[i].gref,
> - pending_req->blkif->domid);
> + if (use_persistent_gnts)
> + persistent_gnt = get_persistent_gnt(
> + &blkif->persistent_gnts,
> + req->u.rw.seg[i].gref);
> +
> + if (persistent_gnt) {
> + /*
> + * We are using persistent grants and
> + * the grant is already mapped
> + */
> + new_map = 0;
> + } else if (use_persistent_gnts &&
> + blkif->persistent_gnt_c <
> + max_mapped_grant_pages(blkif->blk_protocol)) {
> + /*
> + * We are using persistent grants, the grant is
> + * not mapped but we have room for it
> + */
> + new_map = 1;
> + persistent_gnt = kzalloc(
> + sizeof(struct persistent_gnt),
> + GFP_KERNEL);
> + if (!persistent_gnt)
> + return -ENOMEM;
> + persistent_gnt->page = alloc_page(GFP_KERNEL);
> + if (!persistent_gnt->page) {
> + kfree(persistent_gnt);
> + return -ENOMEM;
> + }
> + persistent_gnt->gnt = req->u.rw.seg[i].gref;
> +
> + pages_to_gnt[segs_to_map] =
> + persistent_gnt->page;
> + addr = (unsigned long) pfn_to_kaddr(
> + page_to_pfn(persistent_gnt->page));
> +
> + add_persistent_gnt(&blkif->persistent_gnts,
> + persistent_gnt);
> + blkif->persistent_gnt_c++;
> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> + persistent_gnt->gnt, blkif->persistent_gnt_c,
> + max_mapped_grant_pages(blkif->blk_protocol));
> + } else {
> + /*
> + * We are either using persistent grants and
> + * hit the maximum limit of grants mapped,
> + * or we are not using persistent grants.
> + */
> + if (use_persistent_gnts &&
> + !blkif->vbd.overflow_max_grants) {
> + blkif->vbd.overflow_max_grants = 1;
> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> + blkif->domid, blkif->vbd.handle);
> + }
> + new_map = 1;
> + pages[i] = blkbk->pending_page(pending_req, i);
> + addr = vaddr(pending_req, i);
> + pages_to_gnt[segs_to_map] =
> + blkbk->pending_page(pending_req, i);
> + }
> +
> + if (persistent_gnt) {
> + pages[i] = persistent_gnt->page;
> + persistent_gnts[i] = persistent_gnt;
> + } else {
> + persistent_gnts[i] = NULL;
> + }
> +
> + if (new_map) {
> + flags = GNTMAP_host_map;
> + if (!persistent_gnt &&
> + (pending_req->operation != BLKIF_OP_READ))
> + flags |= GNTMAP_readonly;
> + gnttab_set_map_op(&map[segs_to_map++], addr,
> + flags, req->u.rw.seg[i].gref,
> + blkif->domid);
> + }
> }
>
> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
> - BUG_ON(ret);
> + if (segs_to_map) {
> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
> + BUG_ON(ret);
> + }
>
> /*
> * Now swizzle the MFN in our domain with the MFN from the other domain
> * so that when we access vaddr(pending_req,i) it has the contents of
> * the page from the other domain.
> */
> - for (i = 0; i < nseg; i++) {
> - if (unlikely(map[i].status != 0)) {
> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> - map[i].handle = BLKBACK_INVALID_HANDLE;
> - ret |= 1;
> + for (i = 0, j = 0; i < nseg; i++) {
> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
> + /* This is a newly mapped grant */
> + BUG_ON(j >= segs_to_map);
> + if (unlikely(map[j].status != 0)) {

I am not seeing j being incremented anywhere? Should it?

> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> + map[j].handle = BLKBACK_INVALID_HANDLE;
> + ret |= 1;
> + if (persistent_gnts[i]) {
> + rb_erase(&persistent_gnts[i]->node,
> + &blkif->persistent_gnts);
> + blkif->persistent_gnt_c--;
> + kfree(persistent_gnts[i]);
> + persistent_gnts[i] = NULL;
> + }
> + }
> + }
> + if (persistent_gnts[i]) {
> + if (!persistent_gnts[i]->handle) {
> + /*
> + * If this is a new persistent grant
> + * save the handler
> + */
> + persistent_gnts[i]->handle = map[j].handle;
> + persistent_gnts[i]->dev_bus_addr =
> + map[j++].dev_bus_addr;
> + }
> + pending_handle(pending_req, i) =
> + persistent_gnts[i]->handle;
> + pending_req->unmap_seg[i] = 0;

Could we have a #define for that?
> +
> + if (ret)
> + continue;
> +
> + seg[i].buf = persistent_gnts[i]->dev_bus_addr |
> + (req->u.rw.seg[i].first_sect << 9);
> + } else {
> + pending_handle(pending_req, i) = map[j].handle;
> + pending_req->unmap_seg[i] = 1;

And here as well?
> +
> + if (ret)
> + continue;
> +
> + seg[i].buf = map[j++].dev_bus_addr |
> + (req->u.rw.seg[i].first_sect << 9);
> }
> -
> - pending_handle(pending_req, i) = map[i].handle;
> -
> - if (ret)
> - continue;
> -
> - seg[i].buf = map[i].dev_bus_addr |
> - (req->u.rw.seg[i].first_sect << 9);
> }
> return ret;
> }
> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> int operation;
> struct blk_plug plug;
> bool drain = false;
> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>
> switch (req->operation) {
> case BLKIF_OP_READ:
> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> * the hypercall to unmap the grants - that is all done in
> * xen_blkbk_unmap.
> */
> - if (xen_blkbk_map(req, pending_req, seg))
> + if (xen_blkbk_map(req, pending_req, seg, pages))
> goto fail_flush;
>
> /*
> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> for (i = 0; i < nseg; i++) {
> while ((bio == NULL) ||
> (bio_add_page(bio,
> - blkbk->pending_page(pending_req, i),
> + pages[i],
> seg[i].nsec << 9,
> seg[i].buf & ~PAGE_MASK) == 0)) {
>
> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> index 9ad3b5e..6c08ee9 100644
> --- a/drivers/block/xen-blkback/common.h
> +++ b/drivers/block/xen-blkback/common.h
> @@ -34,6 +34,7 @@
> #include <linux/vmalloc.h>
> #include <linux/wait.h>
> #include <linux/io.h>
> +#include <linux/rbtree.h>
> #include <asm/setup.h>
> #include <asm/pgalloc.h>
> #include <asm/hypervisor.h>
> @@ -160,10 +161,22 @@ struct xen_vbd {
> sector_t size;
> bool flush_support;
> bool discard_secure;
> +
> + unsigned int feature_gnt_persistent:1;
> + unsigned int overflow_max_grants:1;

I think the v3.7-rc1 has this structure changed just a tiny bit, so you
might want to rebase it on top of that.

> };
>
> struct backend_info;
>
> +
> +struct persistent_gnt {
> + struct page *page;
> + grant_ref_t gnt;
> + grant_handle_t handle;
> + uint64_t dev_bus_addr;
> + struct rb_node node;
> +};
> +
> struct xen_blkif {
> /* Unique identifier for this interface. */
> domid_t domid;
> @@ -190,6 +203,10 @@ struct xen_blkif {
> struct task_struct *xenblkd;
> unsigned int waiting_reqs;
>
> + /* frontend feature information */

Huh?

> + struct rb_root persistent_gnts;
> + unsigned int persistent_gnt_c;
> +
> /* statistics */
> unsigned long st_print;
> int st_rd_req;
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 4f66171..9f88b4e 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> atomic_set(&blkif->drain, 0);
> blkif->st_print = jiffies;
> init_waitqueue_head(&blkif->waiting_to_free);
> + blkif->persistent_gnts.rb_node = NULL;
>
> return blkif;
> }
> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
> struct xenbus_device *dev = be->dev;
> unsigned long ring_ref;
> unsigned int evtchn;
> + unsigned int pers_grants;
> char protocol[64] = "";
> int err;
>
> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> return -1;
> }
> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> - ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> + err = xenbus_gather(XBT_NIL, dev->otherend,
> + "feature-persistent-grants", "%u",
> + &pers_grants, NULL);
> + if (err)
> + pers_grants = 0;
> +
> + be->blkif->vbd.feature_gnt_persistent = pers_grants;
> + be->blkif->vbd.overflow_max_grants = 0;
> +
> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
> + ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> + pers_grants);

Can you make that a string? So it is
pers_grants ? "persistent grants" : ""

instead of a zero of one value pls?
>
> /* Map the shared frame, irq etc. */
> err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 2c2d2e5..206d422 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -44,6 +44,7 @@
> #include <linux/mutex.h>
> #include <linux/scatterlist.h>
> #include <linux/bitmap.h>
> +#include <linux/llist.h>
>
> #include <xen/xen.h>
> #include <xen/xenbus.h>
> @@ -64,10 +65,17 @@ enum blkif_state {
> BLKIF_STATE_SUSPENDED,
> };
>
> +struct grant {
> + grant_ref_t gref;
> + unsigned long pfn;
> + struct llist_node node;
> +};
> +
> struct blk_shadow {
> struct blkif_request req;
> struct request *request;
> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> };
>
> static DEFINE_MUTEX(blkfront_mutex);
> @@ -97,6 +105,8 @@ struct blkfront_info
> struct work_struct work;
> struct gnttab_free_callback callback;
> struct blk_shadow shadow[BLK_RING_SIZE];
> + struct llist_head persistent_gnts;
> + unsigned int persistent_gnts_c;
> unsigned long shadow_free;
> unsigned int feature_flush;
> unsigned int flush_op;
> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
> unsigned long id;
> unsigned int fsect, lsect;
> int i, ref;
> +
> + /*
> + * Used to store if we are able to queue the request by just using
> + * existing persistent grants (0), or if we have to get new grants,

What does the zero mean?
> + * as there are not sufficiently many free.
> + */
> + int new_persistent_gnts;

I think this can be a bool?
> grant_ref_t gref_head;
> + struct page *granted_page;
> + struct grant *gnt_list_entry = NULL;
> struct scatterlist *sg;
>
> if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> return 1;
>
> - if (gnttab_alloc_grant_references(
> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> - gnttab_request_free_callback(
> - &info->callback,
> - blkif_restart_queue_callback,
> - info,
> - BLKIF_MAX_SEGMENTS_PER_REQUEST);
> - return 1;
> - }
> + /* Check if we have enought grants to allocate a requests */
> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> + new_persistent_gnts = 1;
> + if (gnttab_alloc_grant_references(
> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
> + &gref_head) < 0) {
> + gnttab_request_free_callback(
> + &info->callback,
> + blkif_restart_queue_callback,
> + info,
> + BLKIF_MAX_SEGMENTS_PER_REQUEST);
> + return 1;
> + }
> + } else
> + new_persistent_gnts = 0;
>
> /* Fill out a communications ring structure. */
> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
> BLKIF_MAX_SEGMENTS_PER_REQUEST);
>
> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
> fsect = sg->offset >> 9;
> lsect = fsect + (sg->length >> 9) - 1;
> - /* install a grant reference. */
> - ref = gnttab_claim_grant_reference(&gref_head);
> - BUG_ON(ref == -ENOSPC);
>
> - gnttab_grant_foreign_access_ref(
> - ref,
> + if (info->persistent_gnts_c) {
> + BUG_ON(llist_empty(&info->persistent_gnts));
> + gnt_list_entry = llist_entry(
> + llist_del_first(&info->persistent_gnts),
> + struct grant, node);
> +
> + ref = gnt_list_entry->gref;
> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> + info->persistent_gnts_c--;
> + } else {
> + ref = gnttab_claim_grant_reference(&gref_head);
> + BUG_ON(ref == -ENOSPC);
> +
> + gnt_list_entry =
> + kmalloc(sizeof(struct grant),
> + GFP_ATOMIC);
> + if (!gnt_list_entry)
> + return -ENOMEM;
> +
> + granted_page = alloc_page(GFP_ATOMIC);
> + if (!granted_page) {
> + kfree(gnt_list_entry);
> + return -ENOMEM;
> + }
> +
> + gnt_list_entry->pfn =
> + page_to_pfn(granted_page);
> + gnt_list_entry->gref = ref;
> +
> + buffer_mfn = pfn_to_mfn(page_to_pfn(
> + granted_page));
> + gnttab_grant_foreign_access_ref(ref,
> info->xbdev->otherend_id,
> - buffer_mfn,
> - rq_data_dir(req));
> + buffer_mfn, 0);
> + }
> +
> + info->shadow[id].grants_used[i] = gnt_list_entry;
> +
> + if (rq_data_dir(req)) {
> + char *bvec_data;
> + void *shared_data;
> +
> + BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> +
> + shared_data = kmap_atomic(
> + pfn_to_page(gnt_list_entry->pfn));
> + bvec_data = kmap_atomic(sg_page(sg));
> +
> + /*
> + * this does not wipe data stored outside the
> + * range sg->offset..sg->offset+sg->length.
> + * Therefore, blkback *could* see data from
> + * previous requests. This is OK as long as
> + * persistent grants are shared with just one
> + * domain. It may need refactoring if This
.. this (lowercase it pls)

> + * changes
> + */
> + memcpy(shared_data + sg->offset,
> + bvec_data + sg->offset,
> + sg->length);
> +
> + kunmap_atomic(bvec_data);
> + kunmap_atomic(shared_data);
> + }
>
> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> ring_req->u.rw.seg[i] =
> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
> /* Keep a private copy so we can reissue requests when recovering. */
> info->shadow[id].req = *ring_req;
>
> - gnttab_free_grant_references(gref_head);
> + if (new_persistent_gnts)
> + gnttab_free_grant_references(gref_head);
>
> return 0;
> }
> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> static void xlvbd_flush(struct blkfront_info *info)
> {
> blk_queue_flush(info->rq, info->feature_flush);
> - printk(KERN_INFO "blkfront: %s: %s: %s\n",
> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",

HA! By default, eh?
> info->gd->disk_name,
> info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
>
> static void blkif_free(struct blkfront_info *info, int suspend)
> {
> + struct llist_node *all_gnts;
> + struct grant *persistent_gnt;
> +
> /* Prevent new requests being issued until we fix things up. */
> spin_lock_irq(&info->io_lock);
> info->connected = suspend ?
> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> /* No more blkif_request(). */
> if (info->rq)
> blk_stop_queue(info->rq);
> +
> + /* Remove all persistent grants */
> + if (info->persistent_gnts_c) {
> + all_gnts = llist_del_all(&info->persistent_gnts);
> + llist_for_each_entry(persistent_gnt, all_gnts, node) {
> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> + kfree(persistent_gnt);
> + }
> + info->persistent_gnts_c = 0;
> + }
> +
> /* No more gnttab callback work. */
> gnttab_cancel_free_callback(&info->callback);
> spin_unlock_irq(&info->io_lock);
> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>
> }
>
> -static void blkif_completion(struct blk_shadow *s)
> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> + struct blkif_response *bret)
> {
> int i;
> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> - * flag. */
> - for (i = 0; i < s->req.u.rw.nr_segments; i++)
> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> + struct bio_vec *bvec;
> + struct req_iterator iter;
> + unsigned long flags;
> + char *bvec_data;
> + void *shared_data;
> + unsigned int offset = 0;
> +
> + if (bret->operation == BLKIF_OP_READ) {
> + /*
> + * Copy the data received from the backend into the bvec.
> + * Since bv_len can be different from PAGE_SIZE, we need to
> + * be sure we are actually copying the data from the right
> + * shared page.
> + */
> + rq_for_each_segment(bvec, s->request, iter) {
> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
> + i = offset >> PAGE_SHIFT;

Could you also include a comment about the bug you found here, pls?
> + shared_data = kmap_atomic(
> + pfn_to_page(s->grants_used[i]->pfn));
> + bvec_data = bvec_kmap_irq(bvec, &flags);
> + memcpy(bvec_data, shared_data + bvec->bv_offset,
> + bvec->bv_len);
> + bvec_kunmap_irq(bvec_data, &flags);
> + kunmap_atomic(shared_data);
> + offset += bvec->bv_len;
> + }
> + }
> + /* Add the persistent grant into the list of free grants */
> + for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
> + info->persistent_gnts_c++;
> + }
> }
>
> static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> req = info->shadow[id].request;
>
> if (bret->operation != BLKIF_OP_DISCARD)
> - blkif_completion(&info->shadow[id]);
> + blkif_completion(&info->shadow[id], info, bret);
>
> if (add_id_to_freelist(info, id)) {
> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> @@ -942,6 +1066,11 @@ again:
> message = "writing protocol";
> goto abort_transaction;
> }
> + err = xenbus_printf(xbt, dev->nodename,
> + "feature-persistent-grants", "%d", 1);

So its %u in blkback, but %d in here? Which one should it be?

> + if (err)
> + dev_warn(&dev->dev,
> + "writing persistent grants feature to xenbus");
>
> err = xenbus_transaction_end(xbt, 0);
> if (err) {
> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
> spin_lock_init(&info->io_lock);
> info->xbdev = dev;
> info->vdevice = vdevice;
> + init_llist_head(&info->persistent_gnts);
> + info->persistent_gnts_c = 0;
> info->connected = BLKIF_STATE_DISCONNECTED;
> INIT_WORK(&info->work, blkif_restart_queue);
>
> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
> req->u.rw.seg[j].gref,
> info->xbdev->otherend_id,
> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> - rq_data_dir(info->shadow[req->u.rw.id].request));
> + 0);
> }
> info->shadow[req->u.rw.id].req = *req;
>
> --
> 1.7.7.5 (Apple Git-26)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-23 16:07:42

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers

On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
>> This patch implements persistent grants for the xen-blk{front,back}
>> mechanism. The effect of this change is to reduce the number of unmap
>> operations performed, since they cause a (costly) TLB shootdown. This
>> allows the I/O performance to scale better when a large number of VMs
>> are performing I/O.
>>
>> Previously, the blkfront driver was supplied a bvec[] from the request
>> queue. This was granted to dom0; dom0 performed the I/O and wrote
>> directly into the grant-mapped memory and unmapped it; blkfront then
>> removed foreign access for that grant. The cost of unmapping scales
>> badly with the number of CPUs in Dom0. An experiment showed that when
>> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
>> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
>> (at which point 650,000 IOPS are being performed in total). If more
>> than 5 guests are used, the performance declines. By 10 guests, only
>> 400,000 IOPS are being performed.
>>
>> This patch improves performance by only unmapping when the connection
>> between blkfront and back is broken.
>>
>> On startup blkfront notifies blkback that it is using persistent
>> grants, and blkback will do the same. If blkback is not capable of
>> persistent mapping, blkfront will still use the same grants, since it
>> is compatible with the previous protocol, and simplifies the code
>> complexity in blkfront.
>>
>> To perform a read, in persistent mode, blkfront uses a separate pool
>> of pages that it maps to dom0. When a request comes in, blkfront
>> transmutes the request so that blkback will write into one of these
>> free pages. Blkback keeps note of which grefs it has already
>> mapped. When a new ring request comes to blkback, it looks to see if
>> it has already mapped that page. If so, it will not map it again. If
>> the page hasn't been previously mapped, it is mapped now, and a record
>> is kept of this mapping. Blkback proceeds as usual. When blkfront is
>> notified that blkback has completed a request, it memcpy's from the
>> shared memory, into the bvec supplied. A record that the {gref, page}
>> tuple is mapped, and not inflight is kept.
>>
>> Writes are similar, except that the memcpy is peformed from the
>> supplied bvecs, into the shared pages, before the request is put onto
>> the ring.
>>
>> Blkback stores a mapping of grefs=>{page mapped to by gref} in
>> a red-black tree. As the grefs are not known apriori, and provide no
>> guarantees on their ordering, we have to perform a search
>> through this tree to find the page, for every gref we receive. This
>> operation takes O(log n) time in the worst case.
>
> Might want to mention how blkfront stores it as well. It looks
> to be using 'llist' instead of 'list'? Any particular reason?

Since we are just pushing and poping grant references, I went for what I
think is the simplest one, a single linked list (list is a doubly linked
list). Oliver in the previous version was using something similar, but
custom made. I think it's best to use the data structures provided by
the kernel itself.

>
>>
>> The maximum number of grants that blkback will persistenly map is
>> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
>> prevent a malicios guest from attempting a DoS, by supplying fresh
>> grefs, causing the Dom0 kernel to map excessively. If a guest
>> is using persistent grants and exceeds the maximum number of grants to
>> map persistenly the newly passed grefs will be mapped and unmaped.
>> Using this approach, we can have requests that mix persistent and
>> non-persistent grants, and we need to handle them correctly.
>> This allows us to set the maximum number of persistent grants to a
>> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
>> setting it will lead to unpredictable performance.
>>
>> In writing this patch, the question arrises as to if the additional
>> cost of performing memcpys in the guest (to/from the pool of granted
>> pages) outweigh the gains of not performing TLB shootdowns. The answer
>> to that question is `no'. There appears to be very little, if any
>> additional cost to the guest of using persistent grants. There is
>> perhaps a small saving, from the reduced number of hypercalls
>> performed in granting, and ending foreign access.
>>
>> Signed-off-by: Oliver Chick <[email protected]>
>> Signed-off-by: Roger Pau Monne <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>> ---
>> Benchmarks showing the impact of this patch in blk performance can be
>> found at:
>>
>> http://xenbits.xensource.com/people/royger/persistent_grants/
>> ---
>> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++---
>> drivers/block/xen-blkback/common.h | 17 ++
>> drivers/block/xen-blkback/xenbus.c | 16 ++-
>> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++----
>> 4 files changed, 442 insertions(+), 53 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>> index c6decb9..2b982b2 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -78,6 +78,7 @@ struct pending_req {
>> unsigned short operation;
>> int status;
>> struct list_head free_list;
>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> };
>>
>> #define BLKBACK_INVALID_HANDLE (~0)
>> @@ -98,6 +99,30 @@ struct xen_blkbk {
>> static struct xen_blkbk *blkbk;
>>
>> /*
>> + * Maximum number of grant pages that can be mapped in blkback.
>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
>> + * pages that blkback will persistently map.
>> + */
>> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
>> +{
>> + switch (protocol) {
>> + case BLKIF_PROTOCOL_NATIVE:
>> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> + case BLKIF_PROTOCOL_X86_32:
>> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> + case BLKIF_PROTOCOL_X86_64:
>> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
>
> Could you include in the comments what the size (bytes) you expect it to be?
> If that would require you re-doing some tests - then don't worry - but
> I figured you have some notes scribbled away that have the exact values
> down.

As far as I know and remember (I've checked the ring size in the past),
all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is
always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 *
48 = 16896 bytes. I will add a comment with this calculation.

>
>> + default:
>> + BUG();
>> + }
>> + return 0;
>> +}
>> +
>> +
>> +/*
>> * Little helpful macro to figure out the index and virtual address of the
>> * pending_pages[..]. For each 'pending_req' we have have up to
>> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
>> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> static void make_response(struct xen_blkif *blkif, u64 id,
>> unsigned short op, int st);
>>
>> +#define foreach_grant(pos, rbtree, node) \
>> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
>> + &(pos)->node != NULL; \
>> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
>> +
>> +
>> +static void add_persistent_gnt(struct rb_root *root,
>> + struct persistent_gnt *persistent_gnt)
>> +{
>> + struct rb_node **new = &(root->rb_node), *parent = NULL;
>> + struct persistent_gnt *this;
>> +
>> + /* Figure out where to put new node */
>> + while (*new) {
>> + this = container_of(*new, struct persistent_gnt, node);
>> +
>> + parent = *new;
>> + if (persistent_gnt->gnt < this->gnt)
>> + new = &((*new)->rb_left);
>> + else if (persistent_gnt->gnt > this->gnt)
>> + new = &((*new)->rb_right);
>> + else {
>> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
>> + BUG();
>> + }
>> + }
>> +
>> + /* Add new node and rebalance tree. */
>> + rb_link_node(&(persistent_gnt->node), parent, new);
>> + rb_insert_color(&(persistent_gnt->node), root);
>> +}
>> +
>> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
>> + grant_ref_t gref)
>> +{
>> + struct persistent_gnt *data;
>> + struct rb_node *node = root->rb_node;
>> +
>> + while (node) {
>> + data = container_of(node, struct persistent_gnt, node);
>> +
>> + if (gref < data->gnt)
>> + node = node->rb_left;
>> + else if (gref > data->gnt)
>> + node = node->rb_right;
>> + else
>> + return data;
>> + }
>> + return NULL;
>> +}
>> +
>> /*
>> * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>> */
>> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
>> {
>> struct xen_blkif *blkif = arg;
>> struct xen_vbd *vbd = &blkif->vbd;
>> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct persistent_gnt *persistent_gnt;
>> + int ret = 0;
>> + int segs_to_unmap = 0;
>>
>> xen_blkif_get(blkif);
>>
>> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
>> print_stats(blkif);
>> }
>>
>> + /* Free all persistent grant pages */
>> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
>
> Just for sanity - you did check this with blkfronts that did not have
> persistent grants enabled, right?

Yes, it doesn't crash, but looking at foreach_grant it seems like it
should. I've added a check before trying to iterate over the tree.

>
>> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
>> + gnttab_set_unmap_op(&unmap[segs_to_unmap],
>> + (unsigned long) pfn_to_kaddr(page_to_pfn(
>> + persistent_gnt->page)),
>> + GNTMAP_host_map,
>> + persistent_gnt->handle);
>> +
>> + pages[segs_to_unmap] = persistent_gnt->page;
>> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
>> + kfree(persistent_gnt);
>> + blkif->persistent_gnt_c--;
>> +
>> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
>> + !rb_next(&persistent_gnt->node)) {
>> + ret = gnttab_unmap_refs(unmap, NULL, pages,
>> + segs_to_unmap);
>> + BUG_ON(ret);
>> + segs_to_unmap = 0;
>> + }
>> + }
>> +
>> + BUG_ON(blkif->persistent_gnt_c != 0);
>> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
>> +
>> if (log_stats)
>> print_stats(blkif);
>>
>> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
>> int ret;
>>
>> for (i = 0; i < req->nr_pages; i++) {
>> + if (!req->unmap_seg[i])
>> + continue;
>
> Perhaps there should be a #define for that array..

Do you mean something like:

#define unmap(req, i) req->unmap_seg[i]

>> handle = pending_handle(req, i);
>> if (handle == BLKBACK_INVALID_HANDLE)
>> continue;
>> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
>>
>> static int xen_blkbk_map(struct blkif_request *req,
>> struct pending_req *pending_req,
>> - struct seg_buf seg[])
>> + struct seg_buf seg[],
>> + struct page *pages[])
>> {
>> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> - int i;
>> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct persistent_gnt *persistent_gnt = NULL;
>> + struct xen_blkif *blkif = pending_req->blkif;
>> + phys_addr_t addr = 0;
>> + int i, j;
>> + int new_map;
>
> Just use a bool for this.

Done

>> int nseg = req->u.rw.nr_segments;
>> + int segs_to_map = 0;
>> int ret = 0;
>> + int use_persistent_gnts;
>> +
>> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
>> +
>> + BUG_ON(blkif->persistent_gnt_c >
>> + max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>>
>> /*
>> * Fill out preq.nr_sects with proper amount of sectors, and setup
>> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
>> for (i = 0; i < nseg; i++) {
>> uint32_t flags;
>>
>> - flags = GNTMAP_host_map;
>> - if (pending_req->operation != BLKIF_OP_READ)
>> - flags |= GNTMAP_readonly;
>> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
>> - req->u.rw.seg[i].gref,
>> - pending_req->blkif->domid);
>> + if (use_persistent_gnts)
>> + persistent_gnt = get_persistent_gnt(
>> + &blkif->persistent_gnts,
>> + req->u.rw.seg[i].gref);
>> +
>> + if (persistent_gnt) {
>> + /*
>> + * We are using persistent grants and
>> + * the grant is already mapped
>> + */
>> + new_map = 0;
>> + } else if (use_persistent_gnts &&
>> + blkif->persistent_gnt_c <
>> + max_mapped_grant_pages(blkif->blk_protocol)) {
>> + /*
>> + * We are using persistent grants, the grant is
>> + * not mapped but we have room for it
>> + */
>> + new_map = 1;
>> + persistent_gnt = kzalloc(
>> + sizeof(struct persistent_gnt),
>> + GFP_KERNEL);
>> + if (!persistent_gnt)
>> + return -ENOMEM;
>> + persistent_gnt->page = alloc_page(GFP_KERNEL);
>> + if (!persistent_gnt->page) {
>> + kfree(persistent_gnt);
>> + return -ENOMEM;
>> + }
>> + persistent_gnt->gnt = req->u.rw.seg[i].gref;
>> +
>> + pages_to_gnt[segs_to_map] =
>> + persistent_gnt->page;
>> + addr = (unsigned long) pfn_to_kaddr(
>> + page_to_pfn(persistent_gnt->page));
>> +
>> + add_persistent_gnt(&blkif->persistent_gnts,
>> + persistent_gnt);
>> + blkif->persistent_gnt_c++;
>> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
>> + persistent_gnt->gnt, blkif->persistent_gnt_c,
>> + max_mapped_grant_pages(blkif->blk_protocol));
>> + } else {
>> + /*
>> + * We are either using persistent grants and
>> + * hit the maximum limit of grants mapped,
>> + * or we are not using persistent grants.
>> + */
>> + if (use_persistent_gnts &&
>> + !blkif->vbd.overflow_max_grants) {
>> + blkif->vbd.overflow_max_grants = 1;
>> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>> + blkif->domid, blkif->vbd.handle);
>> + }
>> + new_map = 1;
>> + pages[i] = blkbk->pending_page(pending_req, i);
>> + addr = vaddr(pending_req, i);
>> + pages_to_gnt[segs_to_map] =
>> + blkbk->pending_page(pending_req, i);
>> + }
>> +
>> + if (persistent_gnt) {
>> + pages[i] = persistent_gnt->page;
>> + persistent_gnts[i] = persistent_gnt;
>> + } else {
>> + persistent_gnts[i] = NULL;
>> + }
>> +
>> + if (new_map) {
>> + flags = GNTMAP_host_map;
>> + if (!persistent_gnt &&
>> + (pending_req->operation != BLKIF_OP_READ))
>> + flags |= GNTMAP_readonly;
>> + gnttab_set_map_op(&map[segs_to_map++], addr,
>> + flags, req->u.rw.seg[i].gref,
>> + blkif->domid);
>> + }
>> }
>>
>> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
>> - BUG_ON(ret);
>> + if (segs_to_map) {
>> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
>> + BUG_ON(ret);
>> + }
>>
>> /*
>> * Now swizzle the MFN in our domain with the MFN from the other domain
>> * so that when we access vaddr(pending_req,i) it has the contents of
>> * the page from the other domain.
>> */
>> - for (i = 0; i < nseg; i++) {
>> - if (unlikely(map[i].status != 0)) {
>> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> - map[i].handle = BLKBACK_INVALID_HANDLE;
>> - ret |= 1;
>> + for (i = 0, j = 0; i < nseg; i++) {
>> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
>> + /* This is a newly mapped grant */
>> + BUG_ON(j >= segs_to_map);
>> + if (unlikely(map[j].status != 0)) {
>
> I am not seeing j being incremented anywhere? Should it?

Yes, it should be incremented, but not here. See the comment below to
see what I've changed.

>
>> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> + map[j].handle = BLKBACK_INVALID_HANDLE;
>> + ret |= 1;
>> + if (persistent_gnts[i]) {
>> + rb_erase(&persistent_gnts[i]->node,
>> + &blkif->persistent_gnts);
>> + blkif->persistent_gnt_c--;
>> + kfree(persistent_gnts[i]);
>> + persistent_gnts[i] = NULL;
>> + }
>> + }
>> + }
>> + if (persistent_gnts[i]) {
>> + if (!persistent_gnts[i]->handle) {
>> + /*
>> + * If this is a new persistent grant
>> + * save the handler
>> + */
>> + persistent_gnts[i]->handle = map[j].handle;
>> + persistent_gnts[i]->dev_bus_addr =
>> + map[j++].dev_bus_addr;
>> + }
>> + pending_handle(pending_req, i) =
>> + persistent_gnts[i]->handle;
>> + pending_req->unmap_seg[i] = 0;
>
> Could we have a #define for that?

Sure.

>> +
>> + if (ret)
>> + continue;

This should be

if (ret) {
j++;
continue;
}

>> +
>> + seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>> + (req->u.rw.seg[i].first_sect << 9);
>> + } else {
>> + pending_handle(pending_req, i) = map[j].handle;
>> + pending_req->unmap_seg[i] = 1;
>
> And here as well?

Done.

>> +
>> + if (ret)
>> + continue;
>> +
>> + seg[i].buf = map[j++].dev_bus_addr |
>> + (req->u.rw.seg[i].first_sect << 9);
>> }
>> -
>> - pending_handle(pending_req, i) = map[i].handle;
>> -
>> - if (ret)
>> - continue;
>> -
>> - seg[i].buf = map[i].dev_bus_addr |
>> - (req->u.rw.seg[i].first_sect << 9);
>> }
>> return ret;
>> }
>> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> int operation;
>> struct blk_plug plug;
>> bool drain = false;
>> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>
>> switch (req->operation) {
>> case BLKIF_OP_READ:
>> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> * the hypercall to unmap the grants - that is all done in
>> * xen_blkbk_unmap.
>> */
>> - if (xen_blkbk_map(req, pending_req, seg))
>> + if (xen_blkbk_map(req, pending_req, seg, pages))
>> goto fail_flush;
>>
>> /*
>> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>> for (i = 0; i < nseg; i++) {
>> while ((bio == NULL) ||
>> (bio_add_page(bio,
>> - blkbk->pending_page(pending_req, i),
>> + pages[i],
>> seg[i].nsec << 9,
>> seg[i].buf & ~PAGE_MASK) == 0)) {
>>
>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
>> index 9ad3b5e..6c08ee9 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -34,6 +34,7 @@
>> #include <linux/vmalloc.h>
>> #include <linux/wait.h>
>> #include <linux/io.h>
>> +#include <linux/rbtree.h>
>> #include <asm/setup.h>
>> #include <asm/pgalloc.h>
>> #include <asm/hypervisor.h>
>> @@ -160,10 +161,22 @@ struct xen_vbd {
>> sector_t size;
>> bool flush_support;
>> bool discard_secure;
>> +
>> + unsigned int feature_gnt_persistent:1;
>> + unsigned int overflow_max_grants:1;
>
> I think the v3.7-rc1 has this structure changed just a tiny bit, so you
> might want to rebase it on top of that.

I've done the rebase on top of your linux-next branch, commit
ad502612c173fff239250c9fe9bdfaaef70b9901.

>
>> };
>>
>> struct backend_info;
>>
>> +
>> +struct persistent_gnt {
>> + struct page *page;
>> + grant_ref_t gnt;
>> + grant_handle_t handle;
>> + uint64_t dev_bus_addr;
>> + struct rb_node node;
>> +};
>> +
>> struct xen_blkif {
>> /* Unique identifier for this interface. */
>> domid_t domid;
>> @@ -190,6 +203,10 @@ struct xen_blkif {
>> struct task_struct *xenblkd;
>> unsigned int waiting_reqs;
>>
>> + /* frontend feature information */
>
> Huh?

Changed it to:

/* tree to store persistent grants */

>> + struct rb_root persistent_gnts;
>> + unsigned int persistent_gnt_c;
>> +
>> /* statistics */
>> unsigned long st_print;
>> int st_rd_req;
>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
>> index 4f66171..9f88b4e 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>> atomic_set(&blkif->drain, 0);
>> blkif->st_print = jiffies;
>> init_waitqueue_head(&blkif->waiting_to_free);
>> + blkif->persistent_gnts.rb_node = NULL;
>>
>> return blkif;
>> }
>> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
>> struct xenbus_device *dev = be->dev;
>> unsigned long ring_ref;
>> unsigned int evtchn;
>> + unsigned int pers_grants;
>> char protocol[64] = "";
>> int err;
>>
>> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
>> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>> return -1;
>> }
>> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
>> - ring_ref, evtchn, be->blkif->blk_protocol, protocol);
>> + err = xenbus_gather(XBT_NIL, dev->otherend,
>> + "feature-persistent-grants", "%u",
>> + &pers_grants, NULL);
>> + if (err)
>> + pers_grants = 0;
>> +
>> + be->blkif->vbd.feature_gnt_persistent = pers_grants;
>> + be->blkif->vbd.overflow_max_grants = 0;
>> +
>> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
>> + ring_ref, evtchn, be->blkif->blk_protocol, protocol,
>> + pers_grants);
>
> Can you make that a string? So it is
> pers_grants ? "persistent grants" : ""
>
> instead of a zero of one value pls?

Yes, done.

>>
>> /* Map the shared frame, irq etc. */
>> err = xen_blkif_map(be->blkif, ring_ref, evtchn);
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 2c2d2e5..206d422 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -44,6 +44,7 @@
>> #include <linux/mutex.h>
>> #include <linux/scatterlist.h>
>> #include <linux/bitmap.h>
>> +#include <linux/llist.h>
>>
>> #include <xen/xen.h>
>> #include <xen/xenbus.h>
>> @@ -64,10 +65,17 @@ enum blkif_state {
>> BLKIF_STATE_SUSPENDED,
>> };
>>
>> +struct grant {
>> + grant_ref_t gref;
>> + unsigned long pfn;
>> + struct llist_node node;
>> +};
>> +
>> struct blk_shadow {
>> struct blkif_request req;
>> struct request *request;
>> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> };
>>
>> static DEFINE_MUTEX(blkfront_mutex);
>> @@ -97,6 +105,8 @@ struct blkfront_info
>> struct work_struct work;
>> struct gnttab_free_callback callback;
>> struct blk_shadow shadow[BLK_RING_SIZE];
>> + struct llist_head persistent_gnts;
>> + unsigned int persistent_gnts_c;
>> unsigned long shadow_free;
>> unsigned int feature_flush;
>> unsigned int flush_op;
>> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
>> unsigned long id;
>> unsigned int fsect, lsect;
>> int i, ref;
>> +
>> + /*
>> + * Used to store if we are able to queue the request by just using
>> + * existing persistent grants (0), or if we have to get new grants,
>
> What does the zero mean?

Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it.

>> + * as there are not sufficiently many free.
>> + */
>> + int new_persistent_gnts;
>
> I think this can be a bool?

I agree.

>> grant_ref_t gref_head;
>> + struct page *granted_page;
>> + struct grant *gnt_list_entry = NULL;
>> struct scatterlist *sg;
>>
>> if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>> return 1;
>>
>> - if (gnttab_alloc_grant_references(
>> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
>> - gnttab_request_free_callback(
>> - &info->callback,
>> - blkif_restart_queue_callback,
>> - info,
>> - BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> - return 1;
>> - }
>> + /* Check if we have enought grants to allocate a requests */
>> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>> + new_persistent_gnts = 1;
>> + if (gnttab_alloc_grant_references(
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
>> + &gref_head) < 0) {
>> + gnttab_request_free_callback(
>> + &info->callback,
>> + blkif_restart_queue_callback,
>> + info,
>> + BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> + return 1;
>> + }
>> + } else
>> + new_persistent_gnts = 0;
>>
>> /* Fill out a communications ring structure. */
>> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
>> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
>> BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>
>> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
>> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>> fsect = sg->offset >> 9;
>> lsect = fsect + (sg->length >> 9) - 1;
>> - /* install a grant reference. */
>> - ref = gnttab_claim_grant_reference(&gref_head);
>> - BUG_ON(ref == -ENOSPC);
>>
>> - gnttab_grant_foreign_access_ref(
>> - ref,
>> + if (info->persistent_gnts_c) {
>> + BUG_ON(llist_empty(&info->persistent_gnts));
>> + gnt_list_entry = llist_entry(
>> + llist_del_first(&info->persistent_gnts),
>> + struct grant, node);
>> +
>> + ref = gnt_list_entry->gref;
>> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>> + info->persistent_gnts_c--;
>> + } else {
>> + ref = gnttab_claim_grant_reference(&gref_head);
>> + BUG_ON(ref == -ENOSPC);
>> +
>> + gnt_list_entry =
>> + kmalloc(sizeof(struct grant),
>> + GFP_ATOMIC);
>> + if (!gnt_list_entry)
>> + return -ENOMEM;
>> +
>> + granted_page = alloc_page(GFP_ATOMIC);
>> + if (!granted_page) {
>> + kfree(gnt_list_entry);
>> + return -ENOMEM;
>> + }
>> +
>> + gnt_list_entry->pfn =
>> + page_to_pfn(granted_page);
>> + gnt_list_entry->gref = ref;
>> +
>> + buffer_mfn = pfn_to_mfn(page_to_pfn(
>> + granted_page));
>> + gnttab_grant_foreign_access_ref(ref,
>> info->xbdev->otherend_id,
>> - buffer_mfn,
>> - rq_data_dir(req));
>> + buffer_mfn, 0);
>> + }
>> +
>> + info->shadow[id].grants_used[i] = gnt_list_entry;
>> +
>> + if (rq_data_dir(req)) {
>> + char *bvec_data;
>> + void *shared_data;
>> +
>> + BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>> +
>> + shared_data = kmap_atomic(
>> + pfn_to_page(gnt_list_entry->pfn));
>> + bvec_data = kmap_atomic(sg_page(sg));
>> +
>> + /*
>> + * this does not wipe data stored outside the
>> + * range sg->offset..sg->offset+sg->length.
>> + * Therefore, blkback *could* see data from
>> + * previous requests. This is OK as long as
>> + * persistent grants are shared with just one
>> + * domain. It may need refactoring if This
> .. this (lowercase it pls)

Done.

>
>> + * changes
>> + */
>> + memcpy(shared_data + sg->offset,
>> + bvec_data + sg->offset,
>> + sg->length);
>> +
>> + kunmap_atomic(bvec_data);
>> + kunmap_atomic(shared_data);
>> + }
>>
>> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
>> ring_req->u.rw.seg[i] =
>> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
>> /* Keep a private copy so we can reissue requests when recovering. */
>> info->shadow[id].req = *ring_req;
>>
>> - gnttab_free_grant_references(gref_head);
>> + if (new_persistent_gnts)
>> + gnttab_free_grant_references(gref_head);
>>
>> return 0;
>> }
>> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>> static void xlvbd_flush(struct blkfront_info *info)
>> {
>> blk_queue_flush(info->rq, info->feature_flush);
>> - printk(KERN_INFO "blkfront: %s: %s: %s\n",
>> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
>
> HA! By default, eh?

Yes, you caught me, there's a paragraph in the commit message that
explains that we are using persistent grants in the frontend
unconditionally, since the protocol is compatible (you can have a
persistent blkfront and a non-persistent blkback). It simplifies the
logic in blkfront. Are you OK with it?

>> info->gd->disk_name,
>> info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
>>
>> static void blkif_free(struct blkfront_info *info, int suspend)
>> {
>> + struct llist_node *all_gnts;
>> + struct grant *persistent_gnt;
>> +
>> /* Prevent new requests being issued until we fix things up. */
>> spin_lock_irq(&info->io_lock);
>> info->connected = suspend ?
>> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>> /* No more blkif_request(). */
>> if (info->rq)
>> blk_stop_queue(info->rq);
>> +
>> + /* Remove all persistent grants */
>> + if (info->persistent_gnts_c) {
>> + all_gnts = llist_del_all(&info->persistent_gnts);
>> + llist_for_each_entry(persistent_gnt, all_gnts, node) {
>> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
>> + kfree(persistent_gnt);
>> + }
>> + info->persistent_gnts_c = 0;
>> + }
>> +
>> /* No more gnttab callback work. */
>> gnttab_cancel_free_callback(&info->callback);
>> spin_unlock_irq(&info->io_lock);
>> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>>
>> }
>>
>> -static void blkif_completion(struct blk_shadow *s)
>> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>> + struct blkif_response *bret)
>> {
>> int i;
>> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
>> - * flag. */
>> - for (i = 0; i < s->req.u.rw.nr_segments; i++)
>> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
>> + struct bio_vec *bvec;
>> + struct req_iterator iter;
>> + unsigned long flags;
>> + char *bvec_data;
>> + void *shared_data;
>> + unsigned int offset = 0;
>> +
>> + if (bret->operation == BLKIF_OP_READ) {
>> + /*
>> + * Copy the data received from the backend into the bvec.
>> + * Since bv_len can be different from PAGE_SIZE, we need to
>> + * be sure we are actually copying the data from the right
>> + * shared page.
>> + */
>> + rq_for_each_segment(bvec, s->request, iter) {
>> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
>> + i = offset >> PAGE_SHIFT;
>
> Could you also include a comment about the bug you found here, pls?

There's a comment before the rq_for_each_segment loop, that tries to
explain that, do you think the following is better?

/*
* Copy the data received from the backend into the bvec.
* Since bv_offset can be different than 0, and bv_len different
* than PAGE_SIZE, we have to keep track of the current offset,
* to be sure we are copying the data from the right shared page.
*/

>> + shared_data = kmap_atomic(
>> + pfn_to_page(s->grants_used[i]->pfn));
>> + bvec_data = bvec_kmap_irq(bvec, &flags);
>> + memcpy(bvec_data, shared_data + bvec->bv_offset,
>> + bvec->bv_len);
>> + bvec_kunmap_irq(bvec_data, &flags);
>> + kunmap_atomic(shared_data);
>> + offset += bvec->bv_len;
>> + }
>> + }
>> + /* Add the persistent grant into the list of free grants */
>> + for (i = 0; i < s->req.u.rw.nr_segments; i++) {
>> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
>> + info->persistent_gnts_c++;
>> + }
>> }
>>
>> static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>> req = info->shadow[id].request;
>>
>> if (bret->operation != BLKIF_OP_DISCARD)
>> - blkif_completion(&info->shadow[id]);
>> + blkif_completion(&info->shadow[id], info, bret);
>>
>> if (add_id_to_freelist(info, id)) {
>> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
>> @@ -942,6 +1066,11 @@ again:
>> message = "writing protocol";
>> goto abort_transaction;
>> }
>> + err = xenbus_printf(xbt, dev->nodename,
>> + "feature-persistent-grants", "%d", 1);
>
> So its %u in blkback, but %d in here? Which one should it be?

%u in both places.

>> + if (err)
>> + dev_warn(&dev->dev,
>> + "writing persistent grants feature to xenbus");
>>
>> err = xenbus_transaction_end(xbt, 0);
>> if (err) {
>> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
>> spin_lock_init(&info->io_lock);
>> info->xbdev = dev;
>> info->vdevice = vdevice;
>> + init_llist_head(&info->persistent_gnts);
>> + info->persistent_gnts_c = 0;
>> info->connected = BLKIF_STATE_DISCONNECTED;
>> INIT_WORK(&info->work, blkif_restart_queue);
>>
>> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
>> req->u.rw.seg[j].gref,
>> info->xbdev->otherend_id,
>> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
>> - rq_data_dir(info->shadow[req->u.rw.id].request));
>> + 0);
>> }
>> info->shadow[req->u.rw.id].req = *req;
>>
>> --
>> 1.7.7.5 (Apple Git-26)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>

2012-10-23 17:32:29

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers

On Tue, Oct 23, 2012 at 06:07:36PM +0200, Roger Pau Monn? wrote:
> On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:
> > On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
> >> This patch implements persistent grants for the xen-blk{front,back}
> >> mechanism. The effect of this change is to reduce the number of unmap
> >> operations performed, since they cause a (costly) TLB shootdown. This
> >> allows the I/O performance to scale better when a large number of VMs
> >> are performing I/O.
> >>
> >> Previously, the blkfront driver was supplied a bvec[] from the request
> >> queue. This was granted to dom0; dom0 performed the I/O and wrote
> >> directly into the grant-mapped memory and unmapped it; blkfront then
> >> removed foreign access for that grant. The cost of unmapping scales
> >> badly with the number of CPUs in Dom0. An experiment showed that when
> >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> >> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> >> (at which point 650,000 IOPS are being performed in total). If more
> >> than 5 guests are used, the performance declines. By 10 guests, only
> >> 400,000 IOPS are being performed.
> >>
> >> This patch improves performance by only unmapping when the connection
> >> between blkfront and back is broken.
> >>
> >> On startup blkfront notifies blkback that it is using persistent
> >> grants, and blkback will do the same. If blkback is not capable of
> >> persistent mapping, blkfront will still use the same grants, since it
> >> is compatible with the previous protocol, and simplifies the code
> >> complexity in blkfront.
> >>
> >> To perform a read, in persistent mode, blkfront uses a separate pool
> >> of pages that it maps to dom0. When a request comes in, blkfront
> >> transmutes the request so that blkback will write into one of these
> >> free pages. Blkback keeps note of which grefs it has already
> >> mapped. When a new ring request comes to blkback, it looks to see if
> >> it has already mapped that page. If so, it will not map it again. If
> >> the page hasn't been previously mapped, it is mapped now, and a record
> >> is kept of this mapping. Blkback proceeds as usual. When blkfront is
> >> notified that blkback has completed a request, it memcpy's from the
> >> shared memory, into the bvec supplied. A record that the {gref, page}
> >> tuple is mapped, and not inflight is kept.
> >>
> >> Writes are similar, except that the memcpy is peformed from the
> >> supplied bvecs, into the shared pages, before the request is put onto
> >> the ring.
> >>
> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in
> >> a red-black tree. As the grefs are not known apriori, and provide no
> >> guarantees on their ordering, we have to perform a search
> >> through this tree to find the page, for every gref we receive. This
> >> operation takes O(log n) time in the worst case.
> >
> > Might want to mention how blkfront stores it as well. It looks
> > to be using 'llist' instead of 'list'? Any particular reason?
>
> Since we are just pushing and poping grant references, I went for what I
> think is the simplest one, a single linked list (list is a doubly linked
> list). Oliver in the previous version was using something similar, but
> custom made. I think it's best to use the data structures provided by
> the kernel itself.
>
> >
> >>
> >> The maximum number of grants that blkback will persistenly map is
> >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
> >> prevent a malicios guest from attempting a DoS, by supplying fresh
> >> grefs, causing the Dom0 kernel to map excessively. If a guest
> >> is using persistent grants and exceeds the maximum number of grants to
> >> map persistenly the newly passed grefs will be mapped and unmaped.
> >> Using this approach, we can have requests that mix persistent and
> >> non-persistent grants, and we need to handle them correctly.
> >> This allows us to set the maximum number of persistent grants to a
> >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
> >> setting it will lead to unpredictable performance.
> >>
> >> In writing this patch, the question arrises as to if the additional
> >> cost of performing memcpys in the guest (to/from the pool of granted
> >> pages) outweigh the gains of not performing TLB shootdowns. The answer
> >> to that question is `no'. There appears to be very little, if any
> >> additional cost to the guest of using persistent grants. There is
> >> perhaps a small saving, from the reduced number of hypercalls
> >> performed in granting, and ending foreign access.
> >>
> >> Signed-off-by: Oliver Chick <[email protected]>
> >> Signed-off-by: Roger Pau Monne <[email protected]>
> >> Cc: <[email protected]>
> >> Cc: <[email protected]>
> >> ---
> >> Benchmarks showing the impact of this patch in blk performance can be
> >> found at:
> >>
> >> http://xenbits.xensource.com/people/royger/persistent_grants/
> >> ---
> >> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++---
> >> drivers/block/xen-blkback/common.h | 17 ++
> >> drivers/block/xen-blkback/xenbus.c | 16 ++-
> >> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++----
> >> 4 files changed, 442 insertions(+), 53 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> >> index c6decb9..2b982b2 100644
> >> --- a/drivers/block/xen-blkback/blkback.c
> >> +++ b/drivers/block/xen-blkback/blkback.c
> >> @@ -78,6 +78,7 @@ struct pending_req {
> >> unsigned short operation;
> >> int status;
> >> struct list_head free_list;
> >> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> };
> >>
> >> #define BLKBACK_INVALID_HANDLE (~0)
> >> @@ -98,6 +99,30 @@ struct xen_blkbk {
> >> static struct xen_blkbk *blkbk;
> >>
> >> /*
> >> + * Maximum number of grant pages that can be mapped in blkback.
> >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
> >> + * pages that blkback will persistently map.
> >> + */
> >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
> >> +{
> >> + switch (protocol) {
> >> + case BLKIF_PROTOCOL_NATIVE:
> >> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
> >> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> + case BLKIF_PROTOCOL_X86_32:
> >> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
> >> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >> + case BLKIF_PROTOCOL_X86_64:
> >> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
> >> + BLKIF_MAX_SEGMENTS_PER_REQUEST;
> >
> > Could you include in the comments what the size (bytes) you expect it to be?
> > If that would require you re-doing some tests - then don't worry - but
> > I figured you have some notes scribbled away that have the exact values
> > down.
>
> As far as I know and remember (I've checked the ring size in the past),
> all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is
> always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 *
> 48 = 16896 bytes. I will add a comment with this calculation.
>
> >
> >> + default:
> >> + BUG();
> >> + }
> >> + return 0;
> >> +}
> >> +
> >> +
> >> +/*
> >> * Little helpful macro to figure out the index and virtual address of the
> >> * pending_pages[..]. For each 'pending_req' we have have up to
> >> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
> >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >> static void make_response(struct xen_blkif *blkif, u64 id,
> >> unsigned short op, int st);
> >>
> >> +#define foreach_grant(pos, rbtree, node) \
> >> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
> >> + &(pos)->node != NULL; \
> >> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
> >> +
> >> +
> >> +static void add_persistent_gnt(struct rb_root *root,
> >> + struct persistent_gnt *persistent_gnt)
> >> +{
> >> + struct rb_node **new = &(root->rb_node), *parent = NULL;
> >> + struct persistent_gnt *this;
> >> +
> >> + /* Figure out where to put new node */
> >> + while (*new) {
> >> + this = container_of(*new, struct persistent_gnt, node);
> >> +
> >> + parent = *new;
> >> + if (persistent_gnt->gnt < this->gnt)
> >> + new = &((*new)->rb_left);
> >> + else if (persistent_gnt->gnt > this->gnt)
> >> + new = &((*new)->rb_right);
> >> + else {
> >> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
> >> + BUG();
> >> + }
> >> + }
> >> +
> >> + /* Add new node and rebalance tree. */
> >> + rb_link_node(&(persistent_gnt->node), parent, new);
> >> + rb_insert_color(&(persistent_gnt->node), root);
> >> +}
> >> +
> >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
> >> + grant_ref_t gref)
> >> +{
> >> + struct persistent_gnt *data;
> >> + struct rb_node *node = root->rb_node;
> >> +
> >> + while (node) {
> >> + data = container_of(node, struct persistent_gnt, node);
> >> +
> >> + if (gref < data->gnt)
> >> + node = node->rb_left;
> >> + else if (gref > data->gnt)
> >> + node = node->rb_right;
> >> + else
> >> + return data;
> >> + }
> >> + return NULL;
> >> +}
> >> +
> >> /*
> >> * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
> >> */
> >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
> >> {
> >> struct xen_blkif *blkif = arg;
> >> struct xen_vbd *vbd = &blkif->vbd;
> >> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> + struct persistent_gnt *persistent_gnt;
> >> + int ret = 0;
> >> + int segs_to_unmap = 0;
> >>
> >> xen_blkif_get(blkif);
> >>
> >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
> >> print_stats(blkif);
> >> }
> >>
> >> + /* Free all persistent grant pages */
> >> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
> >
> > Just for sanity - you did check this with blkfronts that did not have
> > persistent grants enabled, right?
>
> Yes, it doesn't crash, but looking at foreach_grant it seems like it
> should. I've added a check before trying to iterate over the tree.
>
> >
> >> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
> >> + gnttab_set_unmap_op(&unmap[segs_to_unmap],
> >> + (unsigned long) pfn_to_kaddr(page_to_pfn(
> >> + persistent_gnt->page)),
> >> + GNTMAP_host_map,
> >> + persistent_gnt->handle);
> >> +
> >> + pages[segs_to_unmap] = persistent_gnt->page;
> >> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
> >> + kfree(persistent_gnt);
> >> + blkif->persistent_gnt_c--;
> >> +
> >> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
> >> + !rb_next(&persistent_gnt->node)) {
> >> + ret = gnttab_unmap_refs(unmap, NULL, pages,
> >> + segs_to_unmap);
> >> + BUG_ON(ret);
> >> + segs_to_unmap = 0;
> >> + }
> >> + }
> >> +
> >> + BUG_ON(blkif->persistent_gnt_c != 0);
> >> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
> >> +
> >> if (log_stats)
> >> print_stats(blkif);
> >>
> >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
> >> int ret;
> >>
> >> for (i = 0; i < req->nr_pages; i++) {
> >> + if (!req->unmap_seg[i])
> >> + continue;
> >
> > Perhaps there should be a #define for that array..
>
> Do you mean something like:
>
> #define unmap(req, i) req->unmap_seg[i]

I was thinking that you just check for req->unamp_seg[i] to
have an non-zero value. But since that array is just used as an check
to see whether the functionality is enabled (or not), you might want
to declerare the right values so:
#define UNMAP_SG_ON 1
#define UNMAP_SG_OFF 0

or so.

>
> >> handle = pending_handle(req, i);
> >> if (handle == BLKBACK_INVALID_HANDLE)
> >> continue;
> >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
> >>
> >> static int xen_blkbk_map(struct blkif_request *req,
> >> struct pending_req *pending_req,
> >> - struct seg_buf seg[])
> >> + struct seg_buf seg[],
> >> + struct page *pages[])
> >> {
> >> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> - int i;
> >> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> + struct persistent_gnt *persistent_gnt = NULL;
> >> + struct xen_blkif *blkif = pending_req->blkif;
> >> + phys_addr_t addr = 0;
> >> + int i, j;
> >> + int new_map;
> >
> > Just use a bool for this.
>
> Done
>
> >> int nseg = req->u.rw.nr_segments;
> >> + int segs_to_map = 0;
> >> int ret = 0;
> >> + int use_persistent_gnts;
> >> +
> >> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
> >> +
> >> + BUG_ON(blkif->persistent_gnt_c >
> >> + max_mapped_grant_pages(pending_req->blkif->blk_protocol));
> >>
> >> /*
> >> * Fill out preq.nr_sects with proper amount of sectors, and setup
> >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
> >> for (i = 0; i < nseg; i++) {
> >> uint32_t flags;
> >>
> >> - flags = GNTMAP_host_map;
> >> - if (pending_req->operation != BLKIF_OP_READ)
> >> - flags |= GNTMAP_readonly;
> >> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> >> - req->u.rw.seg[i].gref,
> >> - pending_req->blkif->domid);
> >> + if (use_persistent_gnts)
> >> + persistent_gnt = get_persistent_gnt(
> >> + &blkif->persistent_gnts,
> >> + req->u.rw.seg[i].gref);
> >> +
> >> + if (persistent_gnt) {
> >> + /*
> >> + * We are using persistent grants and
> >> + * the grant is already mapped
> >> + */
> >> + new_map = 0;
> >> + } else if (use_persistent_gnts &&
> >> + blkif->persistent_gnt_c <
> >> + max_mapped_grant_pages(blkif->blk_protocol)) {
> >> + /*
> >> + * We are using persistent grants, the grant is
> >> + * not mapped but we have room for it
> >> + */
> >> + new_map = 1;
> >> + persistent_gnt = kzalloc(
> >> + sizeof(struct persistent_gnt),
> >> + GFP_KERNEL);
> >> + if (!persistent_gnt)
> >> + return -ENOMEM;
> >> + persistent_gnt->page = alloc_page(GFP_KERNEL);
> >> + if (!persistent_gnt->page) {
> >> + kfree(persistent_gnt);
> >> + return -ENOMEM;
> >> + }
> >> + persistent_gnt->gnt = req->u.rw.seg[i].gref;
> >> +
> >> + pages_to_gnt[segs_to_map] =
> >> + persistent_gnt->page;
> >> + addr = (unsigned long) pfn_to_kaddr(
> >> + page_to_pfn(persistent_gnt->page));
> >> +
> >> + add_persistent_gnt(&blkif->persistent_gnts,
> >> + persistent_gnt);
> >> + blkif->persistent_gnt_c++;
> >> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
> >> + persistent_gnt->gnt, blkif->persistent_gnt_c,
> >> + max_mapped_grant_pages(blkif->blk_protocol));
> >> + } else {
> >> + /*
> >> + * We are either using persistent grants and
> >> + * hit the maximum limit of grants mapped,
> >> + * or we are not using persistent grants.
> >> + */
> >> + if (use_persistent_gnts &&
> >> + !blkif->vbd.overflow_max_grants) {
> >> + blkif->vbd.overflow_max_grants = 1;
> >> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
> >> + blkif->domid, blkif->vbd.handle);
> >> + }
> >> + new_map = 1;
> >> + pages[i] = blkbk->pending_page(pending_req, i);
> >> + addr = vaddr(pending_req, i);
> >> + pages_to_gnt[segs_to_map] =
> >> + blkbk->pending_page(pending_req, i);
> >> + }
> >> +
> >> + if (persistent_gnt) {
> >> + pages[i] = persistent_gnt->page;
> >> + persistent_gnts[i] = persistent_gnt;
> >> + } else {
> >> + persistent_gnts[i] = NULL;
> >> + }
> >> +
> >> + if (new_map) {
> >> + flags = GNTMAP_host_map;
> >> + if (!persistent_gnt &&
> >> + (pending_req->operation != BLKIF_OP_READ))
> >> + flags |= GNTMAP_readonly;
> >> + gnttab_set_map_op(&map[segs_to_map++], addr,
> >> + flags, req->u.rw.seg[i].gref,
> >> + blkif->domid);
> >> + }
> >> }
> >>
> >> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
> >> - BUG_ON(ret);
> >> + if (segs_to_map) {
> >> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
> >> + BUG_ON(ret);
> >> + }
> >>
> >> /*
> >> * Now swizzle the MFN in our domain with the MFN from the other domain
> >> * so that when we access vaddr(pending_req,i) it has the contents of
> >> * the page from the other domain.
> >> */
> >> - for (i = 0; i < nseg; i++) {
> >> - if (unlikely(map[i].status != 0)) {
> >> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> >> - map[i].handle = BLKBACK_INVALID_HANDLE;
> >> - ret |= 1;
> >> + for (i = 0, j = 0; i < nseg; i++) {
> >> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
> >> + /* This is a newly mapped grant */
> >> + BUG_ON(j >= segs_to_map);
> >> + if (unlikely(map[j].status != 0)) {
> >
> > I am not seeing j being incremented anywhere? Should it?
>
> Yes, it should be incremented, but not here. See the comment below to
> see what I've changed.
>
> >
> >> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
> >> + map[j].handle = BLKBACK_INVALID_HANDLE;
> >> + ret |= 1;
> >> + if (persistent_gnts[i]) {
> >> + rb_erase(&persistent_gnts[i]->node,
> >> + &blkif->persistent_gnts);
> >> + blkif->persistent_gnt_c--;
> >> + kfree(persistent_gnts[i]);
> >> + persistent_gnts[i] = NULL;
> >> + }
> >> + }
> >> + }
> >> + if (persistent_gnts[i]) {
> >> + if (!persistent_gnts[i]->handle) {
> >> + /*
> >> + * If this is a new persistent grant
> >> + * save the handler
> >> + */
> >> + persistent_gnts[i]->handle = map[j].handle;
> >> + persistent_gnts[i]->dev_bus_addr =
> >> + map[j++].dev_bus_addr;
> >> + }
> >> + pending_handle(pending_req, i) =
> >> + persistent_gnts[i]->handle;
> >> + pending_req->unmap_seg[i] = 0;
> >
> > Could we have a #define for that?
>
> Sure.
>
> >> +
> >> + if (ret)
> >> + continue;
>
> This should be
>
> if (ret) {
> j++;
> continue;
> }

<nods>
>
> >> +
> >> + seg[i].buf = persistent_gnts[i]->dev_bus_addr |
> >> + (req->u.rw.seg[i].first_sect << 9);
> >> + } else {
> >> + pending_handle(pending_req, i) = map[j].handle;
> >> + pending_req->unmap_seg[i] = 1;
> >
> > And here as well?
>
> Done.
>
> >> +
> >> + if (ret)
> >> + continue;
> >> +
> >> + seg[i].buf = map[j++].dev_bus_addr |
> >> + (req->u.rw.seg[i].first_sect << 9);
> >> }
> >> -
> >> - pending_handle(pending_req, i) = map[i].handle;
> >> -
> >> - if (ret)
> >> - continue;
> >> -
> >> - seg[i].buf = map[i].dev_bus_addr |
> >> - (req->u.rw.seg[i].first_sect << 9);
> >> }
> >> return ret;
> >> }
> >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >> int operation;
> >> struct blk_plug plug;
> >> bool drain = false;
> >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>
> >> switch (req->operation) {
> >> case BLKIF_OP_READ:
> >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >> * the hypercall to unmap the grants - that is all done in
> >> * xen_blkbk_unmap.
> >> */
> >> - if (xen_blkbk_map(req, pending_req, seg))
> >> + if (xen_blkbk_map(req, pending_req, seg, pages))
> >> goto fail_flush;
> >>
> >> /*
> >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >> for (i = 0; i < nseg; i++) {
> >> while ((bio == NULL) ||
> >> (bio_add_page(bio,
> >> - blkbk->pending_page(pending_req, i),
> >> + pages[i],
> >> seg[i].nsec << 9,
> >> seg[i].buf & ~PAGE_MASK) == 0)) {
> >>
> >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> >> index 9ad3b5e..6c08ee9 100644
> >> --- a/drivers/block/xen-blkback/common.h
> >> +++ b/drivers/block/xen-blkback/common.h
> >> @@ -34,6 +34,7 @@
> >> #include <linux/vmalloc.h>
> >> #include <linux/wait.h>
> >> #include <linux/io.h>
> >> +#include <linux/rbtree.h>
> >> #include <asm/setup.h>
> >> #include <asm/pgalloc.h>
> >> #include <asm/hypervisor.h>
> >> @@ -160,10 +161,22 @@ struct xen_vbd {
> >> sector_t size;
> >> bool flush_support;
> >> bool discard_secure;
> >> +
> >> + unsigned int feature_gnt_persistent:1;
> >> + unsigned int overflow_max_grants:1;
> >
> > I think the v3.7-rc1 has this structure changed just a tiny bit, so you
> > might want to rebase it on top of that.
>
> I've done the rebase on top of your linux-next branch, commit
> ad502612c173fff239250c9fe9bdfaaef70b9901.

Thx
>
> >
> >> };
> >>
> >> struct backend_info;
> >>
> >> +
> >> +struct persistent_gnt {
> >> + struct page *page;
> >> + grant_ref_t gnt;
> >> + grant_handle_t handle;
> >> + uint64_t dev_bus_addr;
> >> + struct rb_node node;
> >> +};
> >> +
> >> struct xen_blkif {
> >> /* Unique identifier for this interface. */
> >> domid_t domid;
> >> @@ -190,6 +203,10 @@ struct xen_blkif {
> >> struct task_struct *xenblkd;
> >> unsigned int waiting_reqs;
> >>
> >> + /* frontend feature information */
> >
> > Huh?
>
> Changed it to:
>
> /* tree to store persistent grants */
>
> >> + struct rb_root persistent_gnts;
> >> + unsigned int persistent_gnt_c;
> >> +
> >> /* statistics */
> >> unsigned long st_print;
> >> int st_rd_req;
> >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> >> index 4f66171..9f88b4e 100644
> >> --- a/drivers/block/xen-blkback/xenbus.c
> >> +++ b/drivers/block/xen-blkback/xenbus.c
> >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
> >> atomic_set(&blkif->drain, 0);
> >> blkif->st_print = jiffies;
> >> init_waitqueue_head(&blkif->waiting_to_free);
> >> + blkif->persistent_gnts.rb_node = NULL;
> >>
> >> return blkif;
> >> }
> >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
> >> struct xenbus_device *dev = be->dev;
> >> unsigned long ring_ref;
> >> unsigned int evtchn;
> >> + unsigned int pers_grants;
> >> char protocol[64] = "";
> >> int err;
> >>
> >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
> >> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
> >> return -1;
> >> }
> >> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> >> + err = xenbus_gather(XBT_NIL, dev->otherend,
> >> + "feature-persistent-grants", "%u",
> >> + &pers_grants, NULL);
> >> + if (err)
> >> + pers_grants = 0;
> >> +
> >> + be->blkif->vbd.feature_gnt_persistent = pers_grants;
> >> + be->blkif->vbd.overflow_max_grants = 0;
> >> +
> >> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
> >> + ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> >> + pers_grants);
> >
> > Can you make that a string? So it is
> > pers_grants ? "persistent grants" : ""
> >
> > instead of a zero of one value pls?
>
> Yes, done.
>
> >>
> >> /* Map the shared frame, irq etc. */
> >> err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >> index 2c2d2e5..206d422 100644
> >> --- a/drivers/block/xen-blkfront.c
> >> +++ b/drivers/block/xen-blkfront.c
> >> @@ -44,6 +44,7 @@
> >> #include <linux/mutex.h>
> >> #include <linux/scatterlist.h>
> >> #include <linux/bitmap.h>
> >> +#include <linux/llist.h>
> >>
> >> #include <xen/xen.h>
> >> #include <xen/xenbus.h>
> >> @@ -64,10 +65,17 @@ enum blkif_state {
> >> BLKIF_STATE_SUSPENDED,
> >> };
> >>
> >> +struct grant {
> >> + grant_ref_t gref;
> >> + unsigned long pfn;
> >> + struct llist_node node;
> >> +};
> >> +
> >> struct blk_shadow {
> >> struct blkif_request req;
> >> struct request *request;
> >> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >> };
> >>
> >> static DEFINE_MUTEX(blkfront_mutex);
> >> @@ -97,6 +105,8 @@ struct blkfront_info
> >> struct work_struct work;
> >> struct gnttab_free_callback callback;
> >> struct blk_shadow shadow[BLK_RING_SIZE];
> >> + struct llist_head persistent_gnts;
> >> + unsigned int persistent_gnts_c;
> >> unsigned long shadow_free;
> >> unsigned int feature_flush;
> >> unsigned int flush_op;
> >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
> >> unsigned long id;
> >> unsigned int fsect, lsect;
> >> int i, ref;
> >> +
> >> + /*
> >> + * Used to store if we are able to queue the request by just using
> >> + * existing persistent grants (0), or if we have to get new grants,
> >
> > What does the zero mean?
>
> Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it.
>
> >> + * as there are not sufficiently many free.
> >> + */
> >> + int new_persistent_gnts;
> >
> > I think this can be a bool?
>
> I agree.
>
> >> grant_ref_t gref_head;
> >> + struct page *granted_page;
> >> + struct grant *gnt_list_entry = NULL;
> >> struct scatterlist *sg;
> >>
> >> if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> >> return 1;
> >>
> >> - if (gnttab_alloc_grant_references(
> >> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> >> - gnttab_request_free_callback(
> >> - &info->callback,
> >> - blkif_restart_queue_callback,
> >> - info,
> >> - BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> - return 1;
> >> - }
> >> + /* Check if we have enought grants to allocate a requests */
> >> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> >> + new_persistent_gnts = 1;
> >> + if (gnttab_alloc_grant_references(
> >> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
> >> + &gref_head) < 0) {
> >> + gnttab_request_free_callback(
> >> + &info->callback,
> >> + blkif_restart_queue_callback,
> >> + info,
> >> + BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >> + return 1;
> >> + }
> >> + } else
> >> + new_persistent_gnts = 0;
> >>
> >> /* Fill out a communications ring structure. */
> >> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
> >> BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >>
> >> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> >> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
> >> fsect = sg->offset >> 9;
> >> lsect = fsect + (sg->length >> 9) - 1;
> >> - /* install a grant reference. */
> >> - ref = gnttab_claim_grant_reference(&gref_head);
> >> - BUG_ON(ref == -ENOSPC);
> >>
> >> - gnttab_grant_foreign_access_ref(
> >> - ref,
> >> + if (info->persistent_gnts_c) {
> >> + BUG_ON(llist_empty(&info->persistent_gnts));
> >> + gnt_list_entry = llist_entry(
> >> + llist_del_first(&info->persistent_gnts),
> >> + struct grant, node);
> >> +
> >> + ref = gnt_list_entry->gref;
> >> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> >> + info->persistent_gnts_c--;
> >> + } else {
> >> + ref = gnttab_claim_grant_reference(&gref_head);
> >> + BUG_ON(ref == -ENOSPC);
> >> +
> >> + gnt_list_entry =
> >> + kmalloc(sizeof(struct grant),
> >> + GFP_ATOMIC);
> >> + if (!gnt_list_entry)
> >> + return -ENOMEM;
> >> +
> >> + granted_page = alloc_page(GFP_ATOMIC);
> >> + if (!granted_page) {
> >> + kfree(gnt_list_entry);
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + gnt_list_entry->pfn =
> >> + page_to_pfn(granted_page);
> >> + gnt_list_entry->gref = ref;
> >> +
> >> + buffer_mfn = pfn_to_mfn(page_to_pfn(
> >> + granted_page));
> >> + gnttab_grant_foreign_access_ref(ref,
> >> info->xbdev->otherend_id,
> >> - buffer_mfn,
> >> - rq_data_dir(req));
> >> + buffer_mfn, 0);
> >> + }
> >> +
> >> + info->shadow[id].grants_used[i] = gnt_list_entry;
> >> +
> >> + if (rq_data_dir(req)) {
> >> + char *bvec_data;
> >> + void *shared_data;
> >> +
> >> + BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> >> +
> >> + shared_data = kmap_atomic(
> >> + pfn_to_page(gnt_list_entry->pfn));
> >> + bvec_data = kmap_atomic(sg_page(sg));
> >> +
> >> + /*
> >> + * this does not wipe data stored outside the
> >> + * range sg->offset..sg->offset+sg->length.
> >> + * Therefore, blkback *could* see data from
> >> + * previous requests. This is OK as long as
> >> + * persistent grants are shared with just one
> >> + * domain. It may need refactoring if This
> > .. this (lowercase it pls)
>
> Done.
>
> >
> >> + * changes
> >> + */
> >> + memcpy(shared_data + sg->offset,
> >> + bvec_data + sg->offset,
> >> + sg->length);
> >> +
> >> + kunmap_atomic(bvec_data);
> >> + kunmap_atomic(shared_data);
> >> + }
> >>
> >> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> >> ring_req->u.rw.seg[i] =
> >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
> >> /* Keep a private copy so we can reissue requests when recovering. */
> >> info->shadow[id].req = *ring_req;
> >>
> >> - gnttab_free_grant_references(gref_head);
> >> + if (new_persistent_gnts)
> >> + gnttab_free_grant_references(gref_head);
> >>
> >> return 0;
> >> }
> >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
> >> static void xlvbd_flush(struct blkfront_info *info)
> >> {
> >> blk_queue_flush(info->rq, info->feature_flush);
> >> - printk(KERN_INFO "blkfront: %s: %s: %s\n",
> >> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
> >
> > HA! By default, eh?
>
> Yes, you caught me, there's a paragraph in the commit message that
> explains that we are using persistent grants in the frontend
> unconditionally, since the protocol is compatible (you can have a
> persistent blkfront and a non-persistent blkback). It simplifies the
> logic in blkfront. Are you OK with it?

It is OK, but you should be checking whether the backend supports it.
I don't see it checking the info->feature_persistent_grant to print
that.

>
> >> info->gd->disk_name,
> >> info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> >> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
> >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
> >>
> >> static void blkif_free(struct blkfront_info *info, int suspend)
> >> {
> >> + struct llist_node *all_gnts;
> >> + struct grant *persistent_gnt;
> >> +
> >> /* Prevent new requests being issued until we fix things up. */
> >> spin_lock_irq(&info->io_lock);
> >> info->connected = suspend ?
> >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> >> /* No more blkif_request(). */
> >> if (info->rq)
> >> blk_stop_queue(info->rq);
> >> +
> >> + /* Remove all persistent grants */
> >> + if (info->persistent_gnts_c) {
> >> + all_gnts = llist_del_all(&info->persistent_gnts);
> >> + llist_for_each_entry(persistent_gnt, all_gnts, node) {
> >> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> >> + kfree(persistent_gnt);
> >> + }
> >> + info->persistent_gnts_c = 0;
> >> + }
> >> +
> >> /* No more gnttab callback work. */
> >> gnttab_cancel_free_callback(&info->callback);
> >> spin_unlock_irq(&info->io_lock);
> >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> >>
> >> }
> >>
> >> -static void blkif_completion(struct blk_shadow *s)
> >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> >> + struct blkif_response *bret)
> >> {
> >> int i;
> >> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> >> - * flag. */
> >> - for (i = 0; i < s->req.u.rw.nr_segments; i++)
> >> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> >> + struct bio_vec *bvec;
> >> + struct req_iterator iter;
> >> + unsigned long flags;
> >> + char *bvec_data;
> >> + void *shared_data;
> >> + unsigned int offset = 0;
> >> +
> >> + if (bret->operation == BLKIF_OP_READ) {
> >> + /*
> >> + * Copy the data received from the backend into the bvec.
> >> + * Since bv_len can be different from PAGE_SIZE, we need to
> >> + * be sure we are actually copying the data from the right
> >> + * shared page.
> >> + */
> >> + rq_for_each_segment(bvec, s->request, iter) {
> >> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
> >> + i = offset >> PAGE_SHIFT;
> >
> > Could you also include a comment about the bug you found here, pls?
>
> There's a comment before the rq_for_each_segment loop, that tries to
> explain that, do you think the following is better?
>
> /*
> * Copy the data received from the backend into the bvec.
> * Since bv_offset can be different than 0, and bv_len different
> * than PAGE_SIZE, we have to keep track of the current offset,
> * to be sure we are copying the data from the right shared page.

Yes. That is good.
> */
>
> >> + shared_data = kmap_atomic(
> >> + pfn_to_page(s->grants_used[i]->pfn));
> >> + bvec_data = bvec_kmap_irq(bvec, &flags);
> >> + memcpy(bvec_data, shared_data + bvec->bv_offset,
> >> + bvec->bv_len);
> >> + bvec_kunmap_irq(bvec_data, &flags);
> >> + kunmap_atomic(shared_data);
> >> + offset += bvec->bv_len;
> >> + }
> >> + }
> >> + /* Add the persistent grant into the list of free grants */
> >> + for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> >> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
> >> + info->persistent_gnts_c++;
> >> + }
> >> }
> >>
> >> static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> >> req = info->shadow[id].request;
> >>
> >> if (bret->operation != BLKIF_OP_DISCARD)
> >> - blkif_completion(&info->shadow[id]);
> >> + blkif_completion(&info->shadow[id], info, bret);
> >>
> >> if (add_id_to_freelist(info, id)) {
> >> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> >> @@ -942,6 +1066,11 @@ again:
> >> message = "writing protocol";
> >> goto abort_transaction;
> >> }
> >> + err = xenbus_printf(xbt, dev->nodename,
> >> + "feature-persistent-grants", "%d", 1);
> >
> > So its %u in blkback, but %d in here? Which one should it be?
>
> %u in both places.
>
> >> + if (err)
> >> + dev_warn(&dev->dev,
> >> + "writing persistent grants feature to xenbus");
> >>
> >> err = xenbus_transaction_end(xbt, 0);
> >> if (err) {
> >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
> >> spin_lock_init(&info->io_lock);
> >> info->xbdev = dev;
> >> info->vdevice = vdevice;
> >> + init_llist_head(&info->persistent_gnts);
> >> + info->persistent_gnts_c = 0;
> >> info->connected = BLKIF_STATE_DISCONNECTED;
> >> INIT_WORK(&info->work, blkif_restart_queue);
> >>
> >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
> >> req->u.rw.seg[j].gref,
> >> info->xbdev->otherend_id,
> >> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> >> - rq_data_dir(info->shadow[req->u.rw.id].request));
> >> + 0);
> >> }
> >> info->shadow[req->u.rw.id].req = *req;
> >>
> >> --
> >> 1.7.7.5 (Apple Git-26)
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at http://www.tux.org/lkml/
> >>

2012-10-23 18:09:30

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers

On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:
>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>>>> index c6decb9..2b982b2 100644
>>>> --- a/drivers/block/xen-blkback/blkback.c
>>>> +++ b/drivers/block/xen-blkback/blkback.c
>>>> @@ -78,6 +78,7 @@ struct pending_req {
>>>> unsigned short operation;
>>>> int status;
>>>> struct list_head free_list;
>>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];

Should I change this to a bool? Since we are only setting it to 0 or 1.

>>> Perhaps there should be a #define for that array..
>>
>> Do you mean something like:
>>
>> #define unmap(req, i) req->unmap_seg[i]
>
> I was thinking that you just check for req->unamp_seg[i] to
> have an non-zero value. But since that array is just used as an check
> to see whether the functionality is enabled (or not), you might want
> to declerare the right values so:
> #define UNMAP_SG_ON 1
> #define UNMAP_SG_OFF 0
>
> or so.

Agreed, will add the defines.

>>>> + if (persistent_gnts[i]) {
>>>> + if (!persistent_gnts[i]->handle) {
>>>> + /*
>>>> + * If this is a new persistent grant
>>>> + * save the handler
>>>> + */
>>>> + persistent_gnts[i]->handle = map[j].handle;
>>>> + persistent_gnts[i]->dev_bus_addr =
>>>> + map[j++].dev_bus_addr;
>>>> + }
>>>> + pending_handle(pending_req, i) =
>>>> + persistent_gnts[i]->handle;
>>>> + pending_req->unmap_seg[i] = 0;
>>>
>>> Could we have a #define for that?
>>
>> Sure.

I've used the previous macro, so it looks like:

unmap(req, i) = UNMAP_SG_OFF;

I'm not sure if this is what you meant, or if you where interested in
defining a set of macros like:

#define check_unmap(req, i) req->unmap_seg[i]
#define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF
#define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON

I would go for the first option (the unmap macro that can be used here
and in xen_blkbk_unmap).

>>> HA! By default, eh?
>>
>> Yes, you caught me, there's a paragraph in the commit message that
>> explains that we are using persistent grants in the frontend
>> unconditionally, since the protocol is compatible (you can have a
>> persistent blkfront and a non-persistent blkback). It simplifies the
>> logic in blkfront. Are you OK with it?
>
> It is OK, but you should be checking whether the backend supports it.
> I don't see it checking the info->feature_persistent_grant to print
> that.

I don't understand why blkfront needs to check if the backend supports
persisten grants, blkfront is going to use persistent grants anyway.

2012-10-23 19:03:05

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monn? wrote:
> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:
> >>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> >>>> index c6decb9..2b982b2 100644
> >>>> --- a/drivers/block/xen-blkback/blkback.c
> >>>> +++ b/drivers/block/xen-blkback/blkback.c
> >>>> @@ -78,6 +78,7 @@ struct pending_req {
> >>>> unsigned short operation;
> >>>> int status;
> >>>> struct list_head free_list;
> >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>
> Should I change this to a bool? Since we are only setting it to 0 or 1.

I would just keep it as 'int'. Eventually we can replace this with a
bit-map, but that can be done later.

>
> >>> Perhaps there should be a #define for that array..
> >>
> >> Do you mean something like:
> >>
> >> #define unmap(req, i) req->unmap_seg[i]
> >
> > I was thinking that you just check for req->unamp_seg[i] to
> > have an non-zero value. But since that array is just used as an check
> > to see whether the functionality is enabled (or not), you might want
> > to declerare the right values so:
> > #define UNMAP_SG_ON 1
> > #define UNMAP_SG_OFF 0
> >
> > or so.
>
> Agreed, will add the defines.
>
> >>>> + if (persistent_gnts[i]) {
> >>>> + if (!persistent_gnts[i]->handle) {
> >>>> + /*
> >>>> + * If this is a new persistent grant
> >>>> + * save the handler
> >>>> + */
> >>>> + persistent_gnts[i]->handle = map[j].handle;
> >>>> + persistent_gnts[i]->dev_bus_addr =
> >>>> + map[j++].dev_bus_addr;
> >>>> + }
> >>>> + pending_handle(pending_req, i) =
> >>>> + persistent_gnts[i]->handle;
> >>>> + pending_req->unmap_seg[i] = 0;
> >>>
> >>> Could we have a #define for that?
> >>
> >> Sure.
>
> I've used the previous macro, so it looks like:
>
> unmap(req, i) = UNMAP_SG_OFF;
>
> I'm not sure if this is what you meant, or if you where interested in
> defining a set of macros like:
>
> #define check_unmap(req, i) req->unmap_seg[i]
> #define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF
> #define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON
>
> I would go for the first option (the unmap macro that can be used here
> and in xen_blkbk_unmap).

I was just thinking something as simple as

if (reg->unmap_seg[i] == UNMAP_SG_OFF)
continue;

And the #defines are just for the hard-coded values of 0 or 1.

>
> >>> HA! By default, eh?
> >>
> >> Yes, you caught me, there's a paragraph in the commit message that
> >> explains that we are using persistent grants in the frontend
> >> unconditionally, since the protocol is compatible (you can have a
> >> persistent blkfront and a non-persistent blkback). It simplifies the
> >> logic in blkfront. Are you OK with it?
> >
> > It is OK, but you should be checking whether the backend supports it.
> > I don't see it checking the info->feature_persistent_grant to print
> > that.
>
> I don't understand why blkfront needs to check if the backend supports
> persisten grants, blkfront is going to use persistent grants anyway.

What if it does not (say this guest runs on an older xen-blkback?)?
Then you would be still printing 'persistent grants' in the blkfront.
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xen.org/xen-devel
>

2012-10-24 07:40:17

by Jan Beulich

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

>>> On 23.10.12 at 20:50, Konrad Rzeszutek Wilk <[email protected]> wrote:
> On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monné wrote:
>> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:
>> >>>> diff --git a/drivers/block/xen-blkback/blkback.c
> b/drivers/block/xen-blkback/blkback.c
>> >>>> index c6decb9..2b982b2 100644
>> >>>> --- a/drivers/block/xen-blkback/blkback.c
>> >>>> +++ b/drivers/block/xen-blkback/blkback.c
>> >>>> @@ -78,6 +78,7 @@ struct pending_req {
>> >>>> unsigned short operation;
>> >>>> int status;
>> >>>> struct list_head free_list;
>> >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>
>> Should I change this to a bool? Since we are only setting it to 0 or 1.
>
> I would just keep it as 'int'. Eventually we can replace this with a
> bit-map, but that can be done later.

I think this should be a bitmap from the beginning - why would
you want to waste 44 bytes per request for something that fits
in a single unsigned long (and the picture would get worse with
the number-of-segments extension)?

Also - am I taking this work being done here as a silent agreement
to not invest into blkif2 to streamline the various extensions
floating around?

Jan

2012-10-24 10:45:22

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

On 23/10/12 20:50, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monn? wrote:
>> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:
>>>>>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>>>>>> index c6decb9..2b982b2 100644
>>>>>> --- a/drivers/block/xen-blkback/blkback.c
>>>>>> +++ b/drivers/block/xen-blkback/blkback.c
>>>>>> @@ -78,6 +78,7 @@ struct pending_req {
>>>>>> unsigned short operation;
>>>>>> int status;
>>>>>> struct list_head free_list;
>>>>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>
>> Should I change this to a bool? Since we are only setting it to 0 or 1.
>
> I would just keep it as 'int'. Eventually we can replace this with a
> bit-map, but that can be done later.

I've already changed it to a bitmap.

>>>>> Perhaps there should be a #define for that array..
>>>>
>>>> Do you mean something like:
>>>>
>>>> #define unmap(req, i) req->unmap_seg[i]
>>>
>>> I was thinking that you just check for req->unamp_seg[i] to
>>> have an non-zero value. But since that array is just used as an check
>>> to see whether the functionality is enabled (or not), you might want
>>> to declerare the right values so:
>>> #define UNMAP_SG_ON 1
>>> #define UNMAP_SG_OFF 0
>>>
>>> or so.
>>
>> Agreed, will add the defines.
>>
>>>>>> + if (persistent_gnts[i]) {
>>>>>> + if (!persistent_gnts[i]->handle) {
>>>>>> + /*
>>>>>> + * If this is a new persistent grant
>>>>>> + * save the handler
>>>>>> + */
>>>>>> + persistent_gnts[i]->handle = map[j].handle;
>>>>>> + persistent_gnts[i]->dev_bus_addr =
>>>>>> + map[j++].dev_bus_addr;
>>>>>> + }
>>>>>> + pending_handle(pending_req, i) =
>>>>>> + persistent_gnts[i]->handle;
>>>>>> + pending_req->unmap_seg[i] = 0;
>>>>>
>>>>> Could we have a #define for that?
>>>>
>>>> Sure.
>>
>> I've used the previous macro, so it looks like:
>>
>> unmap(req, i) = UNMAP_SG_OFF;
>>
>> I'm not sure if this is what you meant, or if you where interested in
>> defining a set of macros like:
>>
>> #define check_unmap(req, i) req->unmap_seg[i]
>> #define unset_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_OFF
>> #define set_unmap(req, i) req->unmap_seg[i] = UNMAP_SG_ON
>>
>> I would go for the first option (the unmap macro that can be used here
>> and in xen_blkbk_unmap).
>
> I was just thinking something as simple as
>
> if (reg->unmap_seg[i] == UNMAP_SG_OFF)
> continue;
>
> And the #defines are just for the hard-coded values of 0 or 1.
>
>>
>>>>> HA! By default, eh?
>>>>
>>>> Yes, you caught me, there's a paragraph in the commit message that
>>>> explains that we are using persistent grants in the frontend
>>>> unconditionally, since the protocol is compatible (you can have a
>>>> persistent blkfront and a non-persistent blkback). It simplifies the
>>>> logic in blkfront. Are you OK with it?
>>>
>>> It is OK, but you should be checking whether the backend supports it.
>>> I don't see it checking the info->feature_persistent_grant to print
>>> that.
>>
>> I don't understand why blkfront needs to check if the backend supports
>> persisten grants, blkfront is going to use persistent grants anyway.
>
> What if it does not (say this guest runs on an older xen-blkback?)?
> Then you would be still printing 'persistent grants' in the blkfront.

Ok, I get your point. Now blkfront will only report the use of
persistent grants if the backend supports it.

If there are no further comments I will send v2 after doing some tests,
thanks for the reviews.

2012-10-25 12:42:05

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] Persistent grant maps for xen blk drivers

On Wed, Oct 24, 2012 at 08:40:10AM +0100, Jan Beulich wrote:
> >>> On 23.10.12 at 20:50, Konrad Rzeszutek Wilk <[email protected]> wrote:
> > On Tue, Oct 23, 2012 at 08:09:27PM +0200, Roger Pau Monn? wrote:
> >> On 23/10/12 19:20, Konrad Rzeszutek Wilk wrote:
> >> >>>> diff --git a/drivers/block/xen-blkback/blkback.c
> > b/drivers/block/xen-blkback/blkback.c
> >> >>>> index c6decb9..2b982b2 100644
> >> >>>> --- a/drivers/block/xen-blkback/blkback.c
> >> >>>> +++ b/drivers/block/xen-blkback/blkback.c
> >> >>>> @@ -78,6 +78,7 @@ struct pending_req {
> >> >>>> unsigned short operation;
> >> >>>> int status;
> >> >>>> struct list_head free_list;
> >> >>>> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >>
> >> Should I change this to a bool? Since we are only setting it to 0 or 1.
> >
> > I would just keep it as 'int'. Eventually we can replace this with a
> > bit-map, but that can be done later.
>
> I think this should be a bitmap from the beginning - why would
> you want to waste 44 bytes per request for something that fits
> in a single unsigned long (and the picture would get worse with
> the number-of-segments extension)?
>
> Also - am I taking this work being done here as a silent agreement
> to not invest into blkif2 to streamline the various extensions
> floating around?

I haven't been able (time-wise) to look at making blkif2 a possibility
and I don't want to hinder this work - which provides great
performance benefits.