Message-ID: <5086C0C8.5000306@citrix.com>
Date: Tue, 23 Oct 2012 18:07:36 +0200
From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= <roger.pau@citrix.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20121010 Thunderbird/16.0.1
MIME-Version: 1.0
To: Konrad Rzeszutek Wilk <konrad@kernel.org>
CC: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>,
        "konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers
References: <1350559321-19066-1-git-send-email-roger.pau@citrix.com> <20121022134708.GA13832@konrad-lan.dumpdata.com>
In-Reply-To: <20121022134708.GA13832@konrad-lan.dumpdata.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 41161
Lines: 995

On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote:
> On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote:
>> This patch implements persistent grants for the xen-blk{front,back}
>> mechanism. The effect of this change is to reduce the number of unmap
>> operations performed, since they cause a (costly) TLB shootdown. This
>> allows the I/O performance to scale better when a large number of VMs
>> are performing I/O.
>>
>> Previously, the blkfront driver was supplied a bvec[] from the request
>> queue. This was granted to dom0; dom0 performed the I/O and wrote
>> directly into the grant-mapped memory and unmapped it; blkfront then
>> removed foreign access for that grant. The cost of unmapping scales
>> badly with the number of CPUs in Dom0. An experiment showed that when
>> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
>> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
>> (at which point 650,000 IOPS are being performed in total). If more
>> than 5 guests are used, the performance declines. By 10 guests, only
>> 400,000 IOPS are being performed.
>>
>> This patch improves performance by only unmapping when the connection
>> between blkfront and back is broken.
>>
>> On startup blkfront notifies blkback that it is using persistent
>> grants, and blkback will do the same. If blkback is not capable of
>> persistent mapping, blkfront will still use the same grants, since it
>> is compatible with the previous protocol, and simplifies the code
>> complexity in blkfront.
>>
>> To perform a read, in persistent mode, blkfront uses a separate pool
>> of pages that it maps to dom0. When a request comes in, blkfront
>> transmutes the request so that blkback will write into one of these
>> free pages. Blkback keeps note of which grefs it has already
>> mapped. When a new ring request comes to blkback, it looks to see if
>> it has already mapped that page. If so, it will not map it again. If
>> the page hasn't been previously mapped, it is mapped now, and a record
>> is kept of this mapping. Blkback proceeds as usual. When blkfront is
>> notified that blkback has completed a request, it memcpy's from the
>> shared memory, into the bvec supplied. A record that the {gref, page}
>> tuple is mapped, and not inflight is kept.
>>
>> Writes are similar, except that the memcpy is peformed from the
>> supplied bvecs, into the shared pages, before the request is put onto
>> the ring.
>>
>> Blkback stores a mapping of grefs=>{page mapped to by gref} in
>> a red-black tree. As the grefs are not known apriori, and provide no
>> guarantees on their ordering, we have to perform a search
>> through this tree to find the page, for every gref we receive. This
>> operation takes O(log n) time in the worst case.
> 
> Might want to mention how blkfront stores it as well. It looks
> to be using 'llist' instead of 'list'? Any particular reason?

Since we are just pushing and poping grant references, I went for what I
think is the simplest one, a single linked list (list is a doubly linked
list). Oliver in the previous version was using something similar, but
custom made. I think it's best to use the data structures provided by
the kernel itself.

> 
>>
>> The maximum number of grants that blkback will persistenly map is
>> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
>> prevent a malicios guest from attempting a DoS, by supplying fresh
>> grefs, causing the Dom0 kernel to map excessively. If a guest
>> is using persistent grants and exceeds the maximum number of grants to
>> map persistenly the newly passed grefs will be mapped and unmaped.
>> Using this approach, we can have requests that mix persistent and
>> non-persistent grants, and we need to handle them correctly.
>> This allows us to set the maximum number of persistent grants to a
>> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
>> setting it will lead to unpredictable performance.
>>
>> In writing this patch, the question arrises as to if the additional
>> cost of performing memcpys in the guest (to/from the pool of granted
>> pages) outweigh the gains of not performing TLB shootdowns. The answer
>> to that question is `no'. There appears to be very little, if any
>> additional cost to the guest of using persistent grants. There is
>> perhaps a small saving, from the reduced number of hypercalls
>> performed in granting, and ending foreign access.
>>
>> Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
>> Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
>> Cc: <konrad.wilk@oracle.com>
>> Cc: <linux-kernel@vger.kernel.org>
>> ---
>> Benchmarks showing the impact of this patch in blk performance can be
>> found at:
>>
>> http://xenbits.xensource.com/people/royger/persistent_grants/
>> ---
>>  drivers/block/xen-blkback/blkback.c |  279 +++++++++++++++++++++++++++++++---
>>  drivers/block/xen-blkback/common.h  |   17 ++
>>  drivers/block/xen-blkback/xenbus.c  |   16 ++-
>>  drivers/block/xen-blkfront.c        |  183 ++++++++++++++++++++----
>>  4 files changed, 442 insertions(+), 53 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
>> index c6decb9..2b982b2 100644
>> --- a/drivers/block/xen-blkback/blkback.c
>> +++ b/drivers/block/xen-blkback/blkback.c
>> @@ -78,6 +78,7 @@ struct pending_req {
>>       unsigned short          operation;
>>       int                     status;
>>       struct list_head        free_list;
>> +     unsigned int            unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>  };
>>
>>  #define BLKBACK_INVALID_HANDLE (~0)
>> @@ -98,6 +99,30 @@ struct xen_blkbk {
>>  static struct xen_blkbk *blkbk;
>>
>>  /*
>> + * Maximum number of grant pages that can be mapped in blkback.
>> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of
>> + * pages that blkback will persistently map.
>> + */
>> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol)
>> +{
>> +     switch (protocol) {
>> +     case BLKIF_PROTOCOL_NATIVE:
>> +             return __CONST_RING_SIZE(blkif, PAGE_SIZE) *
>> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +     case BLKIF_PROTOCOL_X86_32:
>> +             return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) *
>> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
>> +     case BLKIF_PROTOCOL_X86_64:
>> +             return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) *
>> +                        BLKIF_MAX_SEGMENTS_PER_REQUEST;
> 
> Could you include in the comments what the size (bytes) you expect it to be?
> If that would require you re-doing some tests - then don't worry - but
> I figured you have some notes scribbled away that have the exact values
> down.

As far as I know and remember (I've checked the ring size in the past),
all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is
always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 *
48 = 16896 bytes. I will add a comment with this calculation.

> 
>> +     default:
>> +             BUG();
>> +     }
>> +     return 0;
>> +}
>> +
>> +
>> +/*
>>   * Little helpful macro to figure out the index and virtual address of the
>>   * pending_pages[..]. For each 'pending_req' we have have up to
>>   * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through
>> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>>  static void make_response(struct xen_blkif *blkif, u64 id,
>>                         unsigned short op, int st);
>>
>> +#define foreach_grant(pos, rbtree, node) \
>> +     for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \
>> +          &(pos)->node != NULL; \
>> +          (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node))
>> +
>> +
>> +static void add_persistent_gnt(struct rb_root *root,
>> +                            struct persistent_gnt *persistent_gnt)
>> +{
>> +     struct rb_node **new = &(root->rb_node), *parent = NULL;
>> +     struct persistent_gnt *this;
>> +
>> +     /* Figure out where to put new node */
>> +     while (*new) {
>> +             this = container_of(*new, struct persistent_gnt, node);
>> +
>> +             parent = *new;
>> +             if (persistent_gnt->gnt < this->gnt)
>> +                     new = &((*new)->rb_left);
>> +             else if (persistent_gnt->gnt > this->gnt)
>> +                     new = &((*new)->rb_right);
>> +             else {
>> +                     pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n");
>> +                     BUG();
>> +             }
>> +     }
>> +
>> +     /* Add new node and rebalance tree. */
>> +     rb_link_node(&(persistent_gnt->node), parent, new);
>> +     rb_insert_color(&(persistent_gnt->node), root);
>> +}
>> +
>> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root,
>> +                                              grant_ref_t gref)
>> +{
>> +     struct persistent_gnt *data;
>> +     struct rb_node *node = root->rb_node;
>> +
>> +     while (node) {
>> +             data = container_of(node, struct persistent_gnt, node);
>> +
>> +             if (gref < data->gnt)
>> +                     node = node->rb_left;
>> +             else if (gref > data->gnt)
>> +                     node = node->rb_right;
>> +             else
>> +                     return data;
>> +     }
>> +     return NULL;
>> +}
>> +
>>  /*
>>   * Retrieve from the 'pending_reqs' a free pending_req structure to be used.
>>   */
>> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg)
>>  {
>>       struct xen_blkif *blkif = arg;
>>       struct xen_vbd *vbd = &blkif->vbd;
>> +     struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct persistent_gnt *persistent_gnt;
>> +     int ret = 0;
>> +     int segs_to_unmap = 0;
>>
>>       xen_blkif_get(blkif);
>>
>> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg)
>>                       print_stats(blkif);
>>       }
>>
>> +     /* Free all persistent grant pages */
>> +     foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) {
> 
> Just for sanity - you did check this with blkfronts that did not have
> persistent grants enabled, right?

Yes, it doesn't crash, but looking at foreach_grant it seems like it
should. I've added a check before trying to iterate over the tree.

> 
>> +             BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE);
>> +             gnttab_set_unmap_op(&unmap[segs_to_unmap],
>> +                 (unsigned long) pfn_to_kaddr(page_to_pfn(
>> +                     persistent_gnt->page)),
>> +                 GNTMAP_host_map,
>> +                 persistent_gnt->handle);
>> +
>> +             pages[segs_to_unmap] = persistent_gnt->page;
>> +             rb_erase(&persistent_gnt->node, &blkif->persistent_gnts);
>> +             kfree(persistent_gnt);
>> +             blkif->persistent_gnt_c--;
>> +
>> +             if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST ||
>> +                     !rb_next(&persistent_gnt->node)) {
>> +                     ret = gnttab_unmap_refs(unmap, NULL, pages,
>> +                                             segs_to_unmap);
>> +                     BUG_ON(ret);
>> +                     segs_to_unmap = 0;
>> +             }
>> +     }
>> +
>> +     BUG_ON(blkif->persistent_gnt_c != 0);
>> +     BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts));
>> +
>>       if (log_stats)
>>               print_stats(blkif);
>>
>> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req)
>>       int ret;
>>
>>       for (i = 0; i < req->nr_pages; i++) {
>> +             if (!req->unmap_seg[i])
>> +                     continue;
> 
> Perhaps there should be a #define for that array..

Do you mean something like:

#define unmap(req, i) req->unmap_seg[i]

>>               handle = pending_handle(req, i);
>>               if (handle == BLKBACK_INVALID_HANDLE)
>>                       continue;
>> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req)
>>
>>  static int xen_blkbk_map(struct blkif_request *req,
>>                        struct pending_req *pending_req,
>> -                      struct seg_buf seg[])
>> +                      struct seg_buf seg[],
>> +                      struct page *pages[])
>>  {
>>       struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> -     int i;
>> +     struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct persistent_gnt *persistent_gnt = NULL;
>> +     struct xen_blkif *blkif = pending_req->blkif;
>> +     phys_addr_t addr = 0;
>> +     int i, j;
>> +     int new_map;
> 
> Just use a bool for this.

Done

>>       int nseg = req->u.rw.nr_segments;
>> +     int segs_to_map = 0;
>>       int ret = 0;
>> +     int use_persistent_gnts;
>> +
>> +     use_persistent_gnts = (blkif->vbd.feature_gnt_persistent);
>> +
>> +     BUG_ON(blkif->persistent_gnt_c >
>> +                max_mapped_grant_pages(pending_req->blkif->blk_protocol));
>>
>>       /*
>>        * Fill out preq.nr_sects with proper amount of sectors, and setup
>> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req,
>>       for (i = 0; i < nseg; i++) {
>>               uint32_t flags;
>>
>> -             flags = GNTMAP_host_map;
>> -             if (pending_req->operation != BLKIF_OP_READ)
>> -                     flags |= GNTMAP_readonly;
>> -             gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
>> -                               req->u.rw.seg[i].gref,
>> -                               pending_req->blkif->domid);
>> +             if (use_persistent_gnts)
>> +                     persistent_gnt = get_persistent_gnt(
>> +                             &blkif->persistent_gnts,
>> +                             req->u.rw.seg[i].gref);
>> +
>> +             if (persistent_gnt) {
>> +                     /*
>> +                      * We are using persistent grants and
>> +                      * the grant is already mapped
>> +                      */
>> +                     new_map = 0;
>> +             } else if (use_persistent_gnts &&
>> +                        blkif->persistent_gnt_c <
>> +                        max_mapped_grant_pages(blkif->blk_protocol)) {
>> +                     /*
>> +                      * We are using persistent grants, the grant is
>> +                      * not mapped but we have room for it
>> +                      */
>> +                     new_map = 1;
>> +                     persistent_gnt = kzalloc(
>> +                             sizeof(struct persistent_gnt),
>> +                             GFP_KERNEL);
>> +                     if (!persistent_gnt)
>> +                             return -ENOMEM;
>> +                     persistent_gnt->page = alloc_page(GFP_KERNEL);
>> +                     if (!persistent_gnt->page) {
>> +                             kfree(persistent_gnt);
>> +                             return -ENOMEM;
>> +                     }
>> +                     persistent_gnt->gnt = req->u.rw.seg[i].gref;
>> +
>> +                     pages_to_gnt[segs_to_map] =
>> +                             persistent_gnt->page;
>> +                     addr = (unsigned long) pfn_to_kaddr(
>> +                             page_to_pfn(persistent_gnt->page));
>> +
>> +                     add_persistent_gnt(&blkif->persistent_gnts,
>> +                             persistent_gnt);
>> +                     blkif->persistent_gnt_c++;
>> +                     pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n",
>> +                              persistent_gnt->gnt, blkif->persistent_gnt_c,
>> +                              max_mapped_grant_pages(blkif->blk_protocol));
>> +             } else {
>> +                     /*
>> +                      * We are either using persistent grants and
>> +                      * hit the maximum limit of grants mapped,
>> +                      * or we are not using persistent grants.
>> +                      */
>> +                     if (use_persistent_gnts &&
>> +                             !blkif->vbd.overflow_max_grants) {
>> +                             blkif->vbd.overflow_max_grants = 1;
>> +                             pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n",
>> +                                      blkif->domid, blkif->vbd.handle);
>> +                     }
>> +                     new_map = 1;
>> +                     pages[i] = blkbk->pending_page(pending_req, i);
>> +                     addr = vaddr(pending_req, i);
>> +                     pages_to_gnt[segs_to_map] =
>> +                             blkbk->pending_page(pending_req, i);
>> +             }
>> +
>> +             if (persistent_gnt) {
>> +                     pages[i] = persistent_gnt->page;
>> +                     persistent_gnts[i] = persistent_gnt;
>> +             } else {
>> +                     persistent_gnts[i] = NULL;
>> +             }
>> +
>> +             if (new_map) {
>> +                     flags = GNTMAP_host_map;
>> +                     if (!persistent_gnt &&
>> +                         (pending_req->operation != BLKIF_OP_READ))
>> +                             flags |= GNTMAP_readonly;
>> +                     gnttab_set_map_op(&map[segs_to_map++], addr,
>> +                                       flags, req->u.rw.seg[i].gref,
>> +                                       blkif->domid);
>> +             }
>>       }
>>
>> -     ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg);
>> -     BUG_ON(ret);
>> +     if (segs_to_map) {
>> +             ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map);
>> +             BUG_ON(ret);
>> +     }
>>
>>       /*
>>        * Now swizzle the MFN in our domain with the MFN from the other domain
>>        * so that when we access vaddr(pending_req,i) it has the contents of
>>        * the page from the other domain.
>>        */
>> -     for (i = 0; i < nseg; i++) {
>> -             if (unlikely(map[i].status != 0)) {
>> -                     pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> -                     map[i].handle = BLKBACK_INVALID_HANDLE;
>> -                     ret |= 1;
>> +     for (i = 0, j = 0; i < nseg; i++) {
>> +             if (!persistent_gnts[i] || !persistent_gnts[i]->handle) {
>> +                     /* This is a newly mapped grant */
>> +                     BUG_ON(j >= segs_to_map);
>> +                     if (unlikely(map[j].status != 0)) {
> 
> I am not seeing j being incremented anywhere? Should it?

Yes, it should be incremented, but not here. See the comment below to
see what I've changed.

> 
>> +                             pr_debug(DRV_PFX "invalid buffer -- could not remap it\n");
>> +                             map[j].handle = BLKBACK_INVALID_HANDLE;
>> +                             ret |= 1;
>> +                             if (persistent_gnts[i]) {
>> +                                     rb_erase(&persistent_gnts[i]->node,
>> +                                              &blkif->persistent_gnts);
>> +                                     blkif->persistent_gnt_c--;
>> +                                     kfree(persistent_gnts[i]);
>> +                                     persistent_gnts[i] = NULL;
>> +                             }
>> +                     }
>> +             }
>> +             if (persistent_gnts[i]) {
>> +                     if (!persistent_gnts[i]->handle) {
>> +                             /*
>> +                              * If this is a new persistent grant
>> +                              * save the handler
>> +                              */
>> +                             persistent_gnts[i]->handle = map[j].handle;
>> +                             persistent_gnts[i]->dev_bus_addr =
>> +                                     map[j++].dev_bus_addr;
>> +                     }
>> +                     pending_handle(pending_req, i) =
>> +                             persistent_gnts[i]->handle;
>> +                     pending_req->unmap_seg[i] = 0;
> 
> Could we have a #define for that?

Sure.

>> +
>> +                     if (ret)
>> +                             continue;

This should be

if (ret) {
	j++;
	continue;
}

>> +
>> +                     seg[i].buf = persistent_gnts[i]->dev_bus_addr |
>> +                             (req->u.rw.seg[i].first_sect << 9);
>> +             } else {
>> +                     pending_handle(pending_req, i) = map[j].handle;
>> +                     pending_req->unmap_seg[i] = 1;
> 
> And here as well?

Done.

>> +
>> +                     if (ret)
>> +                             continue;
>> +
>> +                     seg[i].buf = map[j++].dev_bus_addr |
>> +                             (req->u.rw.seg[i].first_sect << 9);
>>               }
>> -
>> -             pending_handle(pending_req, i) = map[i].handle;
>> -
>> -             if (ret)
>> -                     continue;
>> -
>> -             seg[i].buf  = map[i].dev_bus_addr |
>> -                     (req->u.rw.seg[i].first_sect << 9);
>>       }
>>       return ret;
>>  }
>> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>>       int operation;
>>       struct blk_plug plug;
>>       bool drain = false;
>> +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>
>>       switch (req->operation) {
>>       case BLKIF_OP_READ:
>> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>>        * the hypercall to unmap the grants - that is all done in
>>        * xen_blkbk_unmap.
>>        */
>> -     if (xen_blkbk_map(req, pending_req, seg))
>> +     if (xen_blkbk_map(req, pending_req, seg, pages))
>>               goto fail_flush;
>>
>>       /*
>> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
>>       for (i = 0; i < nseg; i++) {
>>               while ((bio == NULL) ||
>>                      (bio_add_page(bio,
>> -                                  blkbk->pending_page(pending_req, i),
>> +                                  pages[i],
>>                                    seg[i].nsec << 9,
>>                                    seg[i].buf & ~PAGE_MASK) == 0)) {
>>
>> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
>> index 9ad3b5e..6c08ee9 100644
>> --- a/drivers/block/xen-blkback/common.h
>> +++ b/drivers/block/xen-blkback/common.h
>> @@ -34,6 +34,7 @@
>>  #include <linux/vmalloc.h>
>>  #include <linux/wait.h>
>>  #include <linux/io.h>
>> +#include <linux/rbtree.h>
>>  #include <asm/setup.h>
>>  #include <asm/pgalloc.h>
>>  #include <asm/hypervisor.h>
>> @@ -160,10 +161,22 @@ struct xen_vbd {
>>       sector_t                size;
>>       bool                    flush_support;
>>       bool                    discard_secure;
>> +
>> +     unsigned int            feature_gnt_persistent:1;
>> +     unsigned int            overflow_max_grants:1;
> 
> I think the v3.7-rc1 has this structure changed just a tiny bit, so you
> might want to rebase it on top of that.

I've done the rebase on top of your linux-next branch, commit
ad502612c173fff239250c9fe9bdfaaef70b9901.

> 
>>  };
>>
>>  struct backend_info;
>>
>> +
>> +struct persistent_gnt {
>> +     struct page *page;
>> +     grant_ref_t gnt;
>> +     grant_handle_t handle;
>> +     uint64_t dev_bus_addr;
>> +     struct rb_node node;
>> +};
>> +
>>  struct xen_blkif {
>>       /* Unique identifier for this interface. */
>>       domid_t                 domid;
>> @@ -190,6 +203,10 @@ struct xen_blkif {
>>       struct task_struct      *xenblkd;
>>       unsigned int            waiting_reqs;
>>
>> +     /* frontend feature information */
> 
> Huh?

Changed it to:

/* tree to store persistent grants */

>> +     struct rb_root          persistent_gnts;
>> +     unsigned int            persistent_gnt_c;
>> +
>>       /* statistics */
>>       unsigned long           st_print;
>>       int                     st_rd_req;
>> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
>> index 4f66171..9f88b4e 100644
>> --- a/drivers/block/xen-blkback/xenbus.c
>> +++ b/drivers/block/xen-blkback/xenbus.c
>> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid)
>>       atomic_set(&blkif->drain, 0);
>>       blkif->st_print = jiffies;
>>       init_waitqueue_head(&blkif->waiting_to_free);
>> +     blkif->persistent_gnts.rb_node = NULL;
>>
>>       return blkif;
>>  }
>> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be)
>>       struct xenbus_device *dev = be->dev;
>>       unsigned long ring_ref;
>>       unsigned int evtchn;
>> +     unsigned int pers_grants;
>>       char protocol[64] = "";
>>       int err;
>>
>> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be)
>>               xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol);
>>               return -1;
>>       }
>> -     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
>> -             ring_ref, evtchn, be->blkif->blk_protocol, protocol);
>> +     err = xenbus_gather(XBT_NIL, dev->otherend,
>> +                         "feature-persistent-grants", "%u",
>> +                         &pers_grants, NULL);
>> +     if (err)
>> +             pers_grants = 0;
>> +
>> +     be->blkif->vbd.feature_gnt_persistent = pers_grants;
>> +     be->blkif->vbd.overflow_max_grants = 0;
>> +
>> +     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n",
>> +             ring_ref, evtchn, be->blkif->blk_protocol, protocol,
>> +             pers_grants);
> 
> Can you make that a string? So it is
>  pers_grants ? "persistent grants" : ""
> 
> instead of a zero of one value pls?

Yes, done.

>>
>>       /* Map the shared frame, irq etc. */
>>       err = xen_blkif_map(be->blkif, ring_ref, evtchn);
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 2c2d2e5..206d422 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/mutex.h>
>>  #include <linux/scatterlist.h>
>>  #include <linux/bitmap.h>
>> +#include <linux/llist.h>
>>
>>  #include <xen/xen.h>
>>  #include <xen/xenbus.h>
>> @@ -64,10 +65,17 @@ enum blkif_state {
>>       BLKIF_STATE_SUSPENDED,
>>  };
>>
>> +struct grant {
>> +     grant_ref_t gref;
>> +     unsigned long pfn;
>> +     struct llist_node node;
>> +};
>> +
>>  struct blk_shadow {
>>       struct blkif_request req;
>>       struct request *request;
>>       unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>> +     struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
>>  };
>>
>>  static DEFINE_MUTEX(blkfront_mutex);
>> @@ -97,6 +105,8 @@ struct blkfront_info
>>       struct work_struct work;
>>       struct gnttab_free_callback callback;
>>       struct blk_shadow shadow[BLK_RING_SIZE];
>> +     struct llist_head persistent_gnts;
>> +     unsigned int persistent_gnts_c;
>>       unsigned long shadow_free;
>>       unsigned int feature_flush;
>>       unsigned int flush_op;
>> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req)
>>       unsigned long id;
>>       unsigned int fsect, lsect;
>>       int i, ref;
>> +
>> +     /*
>> +      * Used to store if we are able to queue the request by just using
>> +      * existing persistent grants (0), or if we have to get new grants,
> 
> What does the zero mean?

Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it.

>> +      * as there are not sufficiently many free.
>> +      */
>> +     int new_persistent_gnts;
> 
> I think this can be a bool?

I agree.

>>       grant_ref_t gref_head;
>> +     struct page *granted_page;
>> +     struct grant *gnt_list_entry = NULL;
>>       struct scatterlist *sg;
>>
>>       if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>>               return 1;
>>
>> -     if (gnttab_alloc_grant_references(
>> -             BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
>> -             gnttab_request_free_callback(
>> -                     &info->callback,
>> -                     blkif_restart_queue_callback,
>> -                     info,
>> -                     BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> -             return 1;
>> -     }
>> +     /* Check if we have enought grants to allocate a requests */
>> +     if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
>> +             new_persistent_gnts = 1;
>> +             if (gnttab_alloc_grant_references(
>> +                 BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c,
>> +                 &gref_head) < 0) {
>> +                     gnttab_request_free_callback(
>> +                             &info->callback,
>> +                             blkif_restart_queue_callback,
>> +                             info,
>> +                             BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> +                     return 1;
>> +             }
>> +     } else
>> +             new_persistent_gnts = 0;
>>
>>       /* Fill out a communications ring structure. */
>>       ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
>> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req)
>>                      BLKIF_MAX_SEGMENTS_PER_REQUEST);
>>
>>               for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
>> -                     buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
>>                       fsect = sg->offset >> 9;
>>                       lsect = fsect + (sg->length >> 9) - 1;
>> -                     /* install a grant reference. */
>> -                     ref = gnttab_claim_grant_reference(&gref_head);
>> -                     BUG_ON(ref == -ENOSPC);
>>
>> -                     gnttab_grant_foreign_access_ref(
>> -                                     ref,
>> +                     if (info->persistent_gnts_c) {
>> +                             BUG_ON(llist_empty(&info->persistent_gnts));
>> +                             gnt_list_entry = llist_entry(
>> +                                     llist_del_first(&info->persistent_gnts),
>> +                                     struct grant, node);
>> +
>> +                             ref = gnt_list_entry->gref;
>> +                             buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
>> +                             info->persistent_gnts_c--;
>> +                     } else {
>> +                             ref = gnttab_claim_grant_reference(&gref_head);
>> +                             BUG_ON(ref == -ENOSPC);
>> +
>> +                             gnt_list_entry =
>> +                                     kmalloc(sizeof(struct grant),
>> +                                                      GFP_ATOMIC);
>> +                             if (!gnt_list_entry)
>> +                                     return -ENOMEM;
>> +
>> +                             granted_page = alloc_page(GFP_ATOMIC);
>> +                             if (!granted_page) {
>> +                                     kfree(gnt_list_entry);
>> +                                     return -ENOMEM;
>> +                             }
>> +
>> +                             gnt_list_entry->pfn =
>> +                                     page_to_pfn(granted_page);
>> +                             gnt_list_entry->gref = ref;
>> +
>> +                             buffer_mfn = pfn_to_mfn(page_to_pfn(
>> +                                                             granted_page));
>> +                             gnttab_grant_foreign_access_ref(ref,
>>                                       info->xbdev->otherend_id,
>> -                                     buffer_mfn,
>> -                                     rq_data_dir(req));
>> +                                     buffer_mfn, 0);
>> +                     }
>> +
>> +                     info->shadow[id].grants_used[i] = gnt_list_entry;
>> +
>> +                     if (rq_data_dir(req)) {
>> +                             char *bvec_data;
>> +                             void *shared_data;
>> +
>> +                             BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>> +
>> +                             shared_data = kmap_atomic(
>> +                                     pfn_to_page(gnt_list_entry->pfn));
>> +                             bvec_data = kmap_atomic(sg_page(sg));
>> +
>> +                             /*
>> +                              * this does not wipe data stored outside the
>> +                              * range sg->offset..sg->offset+sg->length.
>> +                              * Therefore, blkback *could* see data from
>> +                              * previous requests. This is OK as long as
>> +                              * persistent grants are shared with just one
>> +                              * domain. It may need refactoring if This
> .. this (lowercase it pls)

Done.

> 
>> +                              * changes
>> +                              */
>> +                             memcpy(shared_data + sg->offset,
>> +                                    bvec_data   + sg->offset,
>> +                                    sg->length);
>> +
>> +                             kunmap_atomic(bvec_data);
>> +                             kunmap_atomic(shared_data);
>> +                     }
>>
>>                       info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
>>                       ring_req->u.rw.seg[i] =
>> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req)
>>       /* Keep a private copy so we can reissue requests when recovering. */
>>       info->shadow[id].req = *ring_req;
>>
>> -     gnttab_free_grant_references(gref_head);
>> +     if (new_persistent_gnts)
>> +             gnttab_free_grant_references(gref_head);
>>
>>       return 0;
>>  }
>> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size)
>>  static void xlvbd_flush(struct blkfront_info *info)
>>  {
>>       blk_queue_flush(info->rq, info->feature_flush);
>> -     printk(KERN_INFO "blkfront: %s: %s: %s\n",
>> +     printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n",
> 
> HA! By default, eh?

Yes, you caught me, there's a paragraph in the commit message that
explains that we are using persistent grants in the frontend
unconditionally, since the protocol is compatible (you can have a
persistent blkfront and a non-persistent blkback). It simplifies the
logic in blkfront. Are you OK with it?

>>              info->gd->disk_name,
>>              info->flush_op == BLKIF_OP_WRITE_BARRIER ?
>>               "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
>> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work)
>>
>>  static void blkif_free(struct blkfront_info *info, int suspend)
>>  {
>> +     struct llist_node *all_gnts;
>> +     struct grant *persistent_gnt;
>> +
>>       /* Prevent new requests being issued until we fix things up. */
>>       spin_lock_irq(&info->io_lock);
>>       info->connected = suspend ?
>> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>>       /* No more blkif_request(). */
>>       if (info->rq)
>>               blk_stop_queue(info->rq);
>> +
>> +     /* Remove all persistent grants */
>> +     if (info->persistent_gnts_c) {
>> +             all_gnts = llist_del_all(&info->persistent_gnts);
>> +             llist_for_each_entry(persistent_gnt, all_gnts, node) {
>> +                     gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
>> +                     kfree(persistent_gnt);
>> +             }
>> +             info->persistent_gnts_c = 0;
>> +     }
>> +
>>       /* No more gnttab callback work. */
>>       gnttab_cancel_free_callback(&info->callback);
>>       spin_unlock_irq(&info->io_lock);
>> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend)
>>
>>  }
>>
>> -static void blkif_completion(struct blk_shadow *s)
>> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>> +                          struct blkif_response *bret)
>>  {
>>       int i;
>> -     /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
>> -      * flag. */
>> -     for (i = 0; i < s->req.u.rw.nr_segments; i++)
>> -             gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
>> +     struct bio_vec *bvec;
>> +     struct req_iterator iter;
>> +     unsigned long flags;
>> +     char *bvec_data;
>> +     void *shared_data;
>> +     unsigned int offset = 0;
>> +
>> +     if (bret->operation == BLKIF_OP_READ) {
>> +             /*
>> +              * Copy the data received from the backend into the bvec.
>> +              * Since bv_len can be different from PAGE_SIZE, we need to
>> +              * be sure we are actually copying the data from the right
>> +              * shared page.
>> +              */
>> +             rq_for_each_segment(bvec, s->request, iter) {
>> +                     BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE);
>> +                     i = offset >> PAGE_SHIFT;
> 
> Could you also include a comment about the bug you found here, pls?

There's a comment before the rq_for_each_segment loop, that tries to
explain that, do you think the following is better?

/*
 * Copy the data received from the backend into the bvec.
 * Since bv_offset can be different than 0, and bv_len different
 * than PAGE_SIZE, we have to keep track of the current offset,
 * to be sure we are copying the data from the right shared page.
 */

>> +                     shared_data = kmap_atomic(
>> +                             pfn_to_page(s->grants_used[i]->pfn));
>> +                     bvec_data = bvec_kmap_irq(bvec, &flags);
>> +                     memcpy(bvec_data, shared_data + bvec->bv_offset,
>> +                             bvec->bv_len);
>> +                     bvec_kunmap_irq(bvec_data, &flags);
>> +                     kunmap_atomic(shared_data);
>> +                     offset += bvec->bv_len;
>> +             }
>> +     }
>> +     /* Add the persistent grant into the list of free grants */
>> +     for (i = 0; i < s->req.u.rw.nr_segments; i++) {
>> +             llist_add(&s->grants_used[i]->node, &info->persistent_gnts);
>> +             info->persistent_gnts_c++;
>> +     }
>>  }
>>
>>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>>               req  = info->shadow[id].request;
>>
>>               if (bret->operation != BLKIF_OP_DISCARD)
>> -                     blkif_completion(&info->shadow[id]);
>> +                     blkif_completion(&info->shadow[id], info, bret);
>>
>>               if (add_id_to_freelist(info, id)) {
>>                       WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
>> @@ -942,6 +1066,11 @@ again:
>>               message = "writing protocol";
>>               goto abort_transaction;
>>       }
>> +     err = xenbus_printf(xbt, dev->nodename,
>> +                         "feature-persistent-grants", "%d", 1);
> 
> So its %u in blkback, but %d in here? Which one should it be?

%u in both places.

>> +     if (err)
>> +             dev_warn(&dev->dev,
>> +                      "writing persistent grants feature to xenbus");
>>
>>       err = xenbus_transaction_end(xbt, 0);
>>       if (err) {
>> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev,
>>       spin_lock_init(&info->io_lock);
>>       info->xbdev = dev;
>>       info->vdevice = vdevice;
>> +     init_llist_head(&info->persistent_gnts);
>> +     info->persistent_gnts_c = 0;
>>       info->connected = BLKIF_STATE_DISCONNECTED;
>>       INIT_WORK(&info->work, blkif_restart_queue);
>>
>> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info)
>>                                       req->u.rw.seg[j].gref,
>>                                       info->xbdev->otherend_id,
>>                                       pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
>> -                                     rq_data_dir(info->shadow[req->u.rw.id].request));
>> +                                     0);
>>               }
>>               info->shadow[req->u.rw.id].req = *req;
>>
>> --
>> 1.7.7.5 (Apple Git-26)
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/