Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933073Ab2JWQHm (ORCPT ); Tue, 23 Oct 2012 12:07:42 -0400 Received: from smtp.ctxuk.citrix.com ([62.200.22.115]:14931 "EHLO SMTP.EU.CITRIX.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753184Ab2JWQHl (ORCPT ); Tue, 23 Oct 2012 12:07:41 -0400 X-IronPort-AV: E=Sophos;i="4.80,635,1344211200"; d="scan'208";a="15339822" Message-ID: <5086C0C8.5000306@citrix.com> Date: Tue, 23 Oct 2012 18:07:36 +0200 From: =?ISO-8859-1?Q?Roger_Pau_Monn=E9?= User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20121010 Thunderbird/16.0.1 MIME-Version: 1.0 To: Konrad Rzeszutek Wilk CC: "xen-devel@lists.xen.org" , "konrad.wilk@oracle.com" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH RFC] Persistent grant maps for xen blk drivers References: <1350559321-19066-1-git-send-email-roger.pau@citrix.com> <20121022134708.GA13832@konrad-lan.dumpdata.com> In-Reply-To: <20121022134708.GA13832@konrad-lan.dumpdata.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 41161 Lines: 995 On 22/10/12 15:47, Konrad Rzeszutek Wilk wrote: > On Thu, Oct 18, 2012 at 01:22:01PM +0200, Roger Pau Monne wrote: >> This patch implements persistent grants for the xen-blk{front,back} >> mechanism. The effect of this change is to reduce the number of unmap >> operations performed, since they cause a (costly) TLB shootdown. This >> allows the I/O performance to scale better when a large number of VMs >> are performing I/O. >> >> Previously, the blkfront driver was supplied a bvec[] from the request >> queue. This was granted to dom0; dom0 performed the I/O and wrote >> directly into the grant-mapped memory and unmapped it; blkfront then >> removed foreign access for that grant. The cost of unmapping scales >> badly with the number of CPUs in Dom0. An experiment showed that when >> Dom0 has 24 VCPUs, and guests are performing parallel I/O to a >> ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests >> (at which point 650,000 IOPS are being performed in total). If more >> than 5 guests are used, the performance declines. By 10 guests, only >> 400,000 IOPS are being performed. >> >> This patch improves performance by only unmapping when the connection >> between blkfront and back is broken. >> >> On startup blkfront notifies blkback that it is using persistent >> grants, and blkback will do the same. If blkback is not capable of >> persistent mapping, blkfront will still use the same grants, since it >> is compatible with the previous protocol, and simplifies the code >> complexity in blkfront. >> >> To perform a read, in persistent mode, blkfront uses a separate pool >> of pages that it maps to dom0. When a request comes in, blkfront >> transmutes the request so that blkback will write into one of these >> free pages. Blkback keeps note of which grefs it has already >> mapped. When a new ring request comes to blkback, it looks to see if >> it has already mapped that page. If so, it will not map it again. If >> the page hasn't been previously mapped, it is mapped now, and a record >> is kept of this mapping. Blkback proceeds as usual. When blkfront is >> notified that blkback has completed a request, it memcpy's from the >> shared memory, into the bvec supplied. A record that the {gref, page} >> tuple is mapped, and not inflight is kept. >> >> Writes are similar, except that the memcpy is peformed from the >> supplied bvecs, into the shared pages, before the request is put onto >> the ring. >> >> Blkback stores a mapping of grefs=>{page mapped to by gref} in >> a red-black tree. As the grefs are not known apriori, and provide no >> guarantees on their ordering, we have to perform a search >> through this tree to find the page, for every gref we receive. This >> operation takes O(log n) time in the worst case. > > Might want to mention how blkfront stores it as well. It looks > to be using 'llist' instead of 'list'? Any particular reason? Since we are just pushing and poping grant references, I went for what I think is the simplest one, a single linked list (list is a doubly linked list). Oliver in the previous version was using something similar, but custom made. I think it's best to use the data structures provided by the kernel itself. > >> >> The maximum number of grants that blkback will persistenly map is >> currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to >> prevent a malicios guest from attempting a DoS, by supplying fresh >> grefs, causing the Dom0 kernel to map excessively. If a guest >> is using persistent grants and exceeds the maximum number of grants to >> map persistenly the newly passed grefs will be mapped and unmaped. >> Using this approach, we can have requests that mix persistent and >> non-persistent grants, and we need to handle them correctly. >> This allows us to set the maximum number of persistent grants to a >> lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although >> setting it will lead to unpredictable performance. >> >> In writing this patch, the question arrises as to if the additional >> cost of performing memcpys in the guest (to/from the pool of granted >> pages) outweigh the gains of not performing TLB shootdowns. The answer >> to that question is `no'. There appears to be very little, if any >> additional cost to the guest of using persistent grants. There is >> perhaps a small saving, from the reduced number of hypercalls >> performed in granting, and ending foreign access. >> >> Signed-off-by: Oliver Chick >> Signed-off-by: Roger Pau Monne >> Cc: >> Cc: >> --- >> Benchmarks showing the impact of this patch in blk performance can be >> found at: >> >> http://xenbits.xensource.com/people/royger/persistent_grants/ >> --- >> drivers/block/xen-blkback/blkback.c | 279 +++++++++++++++++++++++++++++++--- >> drivers/block/xen-blkback/common.h | 17 ++ >> drivers/block/xen-blkback/xenbus.c | 16 ++- >> drivers/block/xen-blkfront.c | 183 ++++++++++++++++++++---- >> 4 files changed, 442 insertions(+), 53 deletions(-) >> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c >> index c6decb9..2b982b2 100644 >> --- a/drivers/block/xen-blkback/blkback.c >> +++ b/drivers/block/xen-blkback/blkback.c >> @@ -78,6 +78,7 @@ struct pending_req { >> unsigned short operation; >> int status; >> struct list_head free_list; >> + unsigned int unmap_seg[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> }; >> >> #define BLKBACK_INVALID_HANDLE (~0) >> @@ -98,6 +99,30 @@ struct xen_blkbk { >> static struct xen_blkbk *blkbk; >> >> /* >> + * Maximum number of grant pages that can be mapped in blkback. >> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * RING_SIZE is the maximum number of >> + * pages that blkback will persistently map. >> + */ >> +static inline unsigned int max_mapped_grant_pages(enum blkif_protocol protocol) >> +{ >> + switch (protocol) { >> + case BLKIF_PROTOCOL_NATIVE: >> + return __CONST_RING_SIZE(blkif, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; >> + case BLKIF_PROTOCOL_X86_32: >> + return __CONST_RING_SIZE(blkif_x86_32, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; >> + case BLKIF_PROTOCOL_X86_64: >> + return __CONST_RING_SIZE(blkif_x86_64, PAGE_SIZE) * >> + BLKIF_MAX_SEGMENTS_PER_REQUEST; > > Could you include in the comments what the size (bytes) you expect it to be? > If that would require you re-doing some tests - then don't worry - but > I figured you have some notes scribbled away that have the exact values > down. As far as I know and remember (I've checked the ring size in the past), all ring types have a size of 32, BLKIF_MAX_SEGMENTS_PER_REQUEST is always 11, and sizeof(struct persistent_gnt) is 48, so that's 32 * 11 * 48 = 16896 bytes. I will add a comment with this calculation. > >> + default: >> + BUG(); >> + } >> + return 0; >> +} >> + >> + >> +/* >> * Little helpful macro to figure out the index and virtual address of the >> * pending_pages[..]. For each 'pending_req' we have have up to >> * BLKIF_MAX_SEGMENTS_PER_REQUEST (11) pages. The seg would be from 0 through >> @@ -128,6 +153,57 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> static void make_response(struct xen_blkif *blkif, u64 id, >> unsigned short op, int st); >> >> +#define foreach_grant(pos, rbtree, node) \ >> + for ((pos) = container_of(rb_first((rbtree)), typeof(*(pos)), node); \ >> + &(pos)->node != NULL; \ >> + (pos) = container_of(rb_next(&(pos)->node), typeof(*(pos)), node)) >> + >> + >> +static void add_persistent_gnt(struct rb_root *root, >> + struct persistent_gnt *persistent_gnt) >> +{ >> + struct rb_node **new = &(root->rb_node), *parent = NULL; >> + struct persistent_gnt *this; >> + >> + /* Figure out where to put new node */ >> + while (*new) { >> + this = container_of(*new, struct persistent_gnt, node); >> + >> + parent = *new; >> + if (persistent_gnt->gnt < this->gnt) >> + new = &((*new)->rb_left); >> + else if (persistent_gnt->gnt > this->gnt) >> + new = &((*new)->rb_right); >> + else { >> + pr_alert(DRV_PFX " trying to add a gref that's already in the tree\n"); >> + BUG(); >> + } >> + } >> + >> + /* Add new node and rebalance tree. */ >> + rb_link_node(&(persistent_gnt->node), parent, new); >> + rb_insert_color(&(persistent_gnt->node), root); >> +} >> + >> +static struct persistent_gnt *get_persistent_gnt(struct rb_root *root, >> + grant_ref_t gref) >> +{ >> + struct persistent_gnt *data; >> + struct rb_node *node = root->rb_node; >> + >> + while (node) { >> + data = container_of(node, struct persistent_gnt, node); >> + >> + if (gref < data->gnt) >> + node = node->rb_left; >> + else if (gref > data->gnt) >> + node = node->rb_right; >> + else >> + return data; >> + } >> + return NULL; >> +} >> + >> /* >> * Retrieve from the 'pending_reqs' a free pending_req structure to be used. >> */ >> @@ -274,6 +350,11 @@ int xen_blkif_schedule(void *arg) >> { >> struct xen_blkif *blkif = arg; >> struct xen_vbd *vbd = &blkif->vbd; >> + struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct persistent_gnt *persistent_gnt; >> + int ret = 0; >> + int segs_to_unmap = 0; >> >> xen_blkif_get(blkif); >> >> @@ -301,6 +382,32 @@ int xen_blkif_schedule(void *arg) >> print_stats(blkif); >> } >> >> + /* Free all persistent grant pages */ >> + foreach_grant(persistent_gnt, &blkif->persistent_gnts, node) { > > Just for sanity - you did check this with blkfronts that did not have > persistent grants enabled, right? Yes, it doesn't crash, but looking at foreach_grant it seems like it should. I've added a check before trying to iterate over the tree. > >> + BUG_ON(persistent_gnt->handle == BLKBACK_INVALID_HANDLE); >> + gnttab_set_unmap_op(&unmap[segs_to_unmap], >> + (unsigned long) pfn_to_kaddr(page_to_pfn( >> + persistent_gnt->page)), >> + GNTMAP_host_map, >> + persistent_gnt->handle); >> + >> + pages[segs_to_unmap] = persistent_gnt->page; >> + rb_erase(&persistent_gnt->node, &blkif->persistent_gnts); >> + kfree(persistent_gnt); >> + blkif->persistent_gnt_c--; >> + >> + if (++segs_to_unmap == BLKIF_MAX_SEGMENTS_PER_REQUEST || >> + !rb_next(&persistent_gnt->node)) { >> + ret = gnttab_unmap_refs(unmap, NULL, pages, >> + segs_to_unmap); >> + BUG_ON(ret); >> + segs_to_unmap = 0; >> + } >> + } >> + >> + BUG_ON(blkif->persistent_gnt_c != 0); >> + BUG_ON(!RB_EMPTY_ROOT(&blkif->persistent_gnts)); >> + >> if (log_stats) >> print_stats(blkif); >> >> @@ -327,6 +434,8 @@ static void xen_blkbk_unmap(struct pending_req *req) >> int ret; >> >> for (i = 0; i < req->nr_pages; i++) { >> + if (!req->unmap_seg[i]) >> + continue; > > Perhaps there should be a #define for that array.. Do you mean something like: #define unmap(req, i) req->unmap_seg[i] >> handle = pending_handle(req, i); >> if (handle == BLKBACK_INVALID_HANDLE) >> continue; >> @@ -343,12 +452,26 @@ static void xen_blkbk_unmap(struct pending_req *req) >> >> static int xen_blkbk_map(struct blkif_request *req, >> struct pending_req *pending_req, >> - struct seg_buf seg[]) >> + struct seg_buf seg[], >> + struct page *pages[]) >> { >> struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> - int i; >> + struct persistent_gnt *persistent_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct persistent_gnt *persistent_gnt = NULL; >> + struct xen_blkif *blkif = pending_req->blkif; >> + phys_addr_t addr = 0; >> + int i, j; >> + int new_map; > > Just use a bool for this. Done >> int nseg = req->u.rw.nr_segments; >> + int segs_to_map = 0; >> int ret = 0; >> + int use_persistent_gnts; >> + >> + use_persistent_gnts = (blkif->vbd.feature_gnt_persistent); >> + >> + BUG_ON(blkif->persistent_gnt_c > >> + max_mapped_grant_pages(pending_req->blkif->blk_protocol)); >> >> /* >> * Fill out preq.nr_sects with proper amount of sectors, and setup >> @@ -358,36 +481,141 @@ static int xen_blkbk_map(struct blkif_request *req, >> for (i = 0; i < nseg; i++) { >> uint32_t flags; >> >> - flags = GNTMAP_host_map; >> - if (pending_req->operation != BLKIF_OP_READ) >> - flags |= GNTMAP_readonly; >> - gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags, >> - req->u.rw.seg[i].gref, >> - pending_req->blkif->domid); >> + if (use_persistent_gnts) >> + persistent_gnt = get_persistent_gnt( >> + &blkif->persistent_gnts, >> + req->u.rw.seg[i].gref); >> + >> + if (persistent_gnt) { >> + /* >> + * We are using persistent grants and >> + * the grant is already mapped >> + */ >> + new_map = 0; >> + } else if (use_persistent_gnts && >> + blkif->persistent_gnt_c < >> + max_mapped_grant_pages(blkif->blk_protocol)) { >> + /* >> + * We are using persistent grants, the grant is >> + * not mapped but we have room for it >> + */ >> + new_map = 1; >> + persistent_gnt = kzalloc( >> + sizeof(struct persistent_gnt), >> + GFP_KERNEL); >> + if (!persistent_gnt) >> + return -ENOMEM; >> + persistent_gnt->page = alloc_page(GFP_KERNEL); >> + if (!persistent_gnt->page) { >> + kfree(persistent_gnt); >> + return -ENOMEM; >> + } >> + persistent_gnt->gnt = req->u.rw.seg[i].gref; >> + >> + pages_to_gnt[segs_to_map] = >> + persistent_gnt->page; >> + addr = (unsigned long) pfn_to_kaddr( >> + page_to_pfn(persistent_gnt->page)); >> + >> + add_persistent_gnt(&blkif->persistent_gnts, >> + persistent_gnt); >> + blkif->persistent_gnt_c++; >> + pr_debug(DRV_PFX " grant %u added to the tree of persistent grants, using %u/%u\n", >> + persistent_gnt->gnt, blkif->persistent_gnt_c, >> + max_mapped_grant_pages(blkif->blk_protocol)); >> + } else { >> + /* >> + * We are either using persistent grants and >> + * hit the maximum limit of grants mapped, >> + * or we are not using persistent grants. >> + */ >> + if (use_persistent_gnts && >> + !blkif->vbd.overflow_max_grants) { >> + blkif->vbd.overflow_max_grants = 1; >> + pr_alert(DRV_PFX " domain %u, device %#x is using maximum number of persistent grants\n", >> + blkif->domid, blkif->vbd.handle); >> + } >> + new_map = 1; >> + pages[i] = blkbk->pending_page(pending_req, i); >> + addr = vaddr(pending_req, i); >> + pages_to_gnt[segs_to_map] = >> + blkbk->pending_page(pending_req, i); >> + } >> + >> + if (persistent_gnt) { >> + pages[i] = persistent_gnt->page; >> + persistent_gnts[i] = persistent_gnt; >> + } else { >> + persistent_gnts[i] = NULL; >> + } >> + >> + if (new_map) { >> + flags = GNTMAP_host_map; >> + if (!persistent_gnt && >> + (pending_req->operation != BLKIF_OP_READ)) >> + flags |= GNTMAP_readonly; >> + gnttab_set_map_op(&map[segs_to_map++], addr, >> + flags, req->u.rw.seg[i].gref, >> + blkif->domid); >> + } >> } >> >> - ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 0), nseg); >> - BUG_ON(ret); >> + if (segs_to_map) { >> + ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_map); >> + BUG_ON(ret); >> + } >> >> /* >> * Now swizzle the MFN in our domain with the MFN from the other domain >> * so that when we access vaddr(pending_req,i) it has the contents of >> * the page from the other domain. >> */ >> - for (i = 0; i < nseg; i++) { >> - if (unlikely(map[i].status != 0)) { >> - pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); >> - map[i].handle = BLKBACK_INVALID_HANDLE; >> - ret |= 1; >> + for (i = 0, j = 0; i < nseg; i++) { >> + if (!persistent_gnts[i] || !persistent_gnts[i]->handle) { >> + /* This is a newly mapped grant */ >> + BUG_ON(j >= segs_to_map); >> + if (unlikely(map[j].status != 0)) { > > I am not seeing j being incremented anywhere? Should it? Yes, it should be incremented, but not here. See the comment below to see what I've changed. > >> + pr_debug(DRV_PFX "invalid buffer -- could not remap it\n"); >> + map[j].handle = BLKBACK_INVALID_HANDLE; >> + ret |= 1; >> + if (persistent_gnts[i]) { >> + rb_erase(&persistent_gnts[i]->node, >> + &blkif->persistent_gnts); >> + blkif->persistent_gnt_c--; >> + kfree(persistent_gnts[i]); >> + persistent_gnts[i] = NULL; >> + } >> + } >> + } >> + if (persistent_gnts[i]) { >> + if (!persistent_gnts[i]->handle) { >> + /* >> + * If this is a new persistent grant >> + * save the handler >> + */ >> + persistent_gnts[i]->handle = map[j].handle; >> + persistent_gnts[i]->dev_bus_addr = >> + map[j++].dev_bus_addr; >> + } >> + pending_handle(pending_req, i) = >> + persistent_gnts[i]->handle; >> + pending_req->unmap_seg[i] = 0; > > Could we have a #define for that? Sure. >> + >> + if (ret) >> + continue; This should be if (ret) { j++; continue; } >> + >> + seg[i].buf = persistent_gnts[i]->dev_bus_addr | >> + (req->u.rw.seg[i].first_sect << 9); >> + } else { >> + pending_handle(pending_req, i) = map[j].handle; >> + pending_req->unmap_seg[i] = 1; > > And here as well? Done. >> + >> + if (ret) >> + continue; >> + >> + seg[i].buf = map[j++].dev_bus_addr | >> + (req->u.rw.seg[i].first_sect << 9); >> } >> - >> - pending_handle(pending_req, i) = map[i].handle; >> - >> - if (ret) >> - continue; >> - >> - seg[i].buf = map[i].dev_bus_addr | >> - (req->u.rw.seg[i].first_sect << 9); >> } >> return ret; >> } >> @@ -590,6 +818,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> int operation; >> struct blk_plug plug; >> bool drain = false; >> + struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> >> switch (req->operation) { >> case BLKIF_OP_READ: >> @@ -676,7 +905,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> * the hypercall to unmap the grants - that is all done in >> * xen_blkbk_unmap. >> */ >> - if (xen_blkbk_map(req, pending_req, seg)) >> + if (xen_blkbk_map(req, pending_req, seg, pages)) >> goto fail_flush; >> >> /* >> @@ -688,7 +917,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif, >> for (i = 0; i < nseg; i++) { >> while ((bio == NULL) || >> (bio_add_page(bio, >> - blkbk->pending_page(pending_req, i), >> + pages[i], >> seg[i].nsec << 9, >> seg[i].buf & ~PAGE_MASK) == 0)) { >> >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h >> index 9ad3b5e..6c08ee9 100644 >> --- a/drivers/block/xen-blkback/common.h >> +++ b/drivers/block/xen-blkback/common.h >> @@ -34,6 +34,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -160,10 +161,22 @@ struct xen_vbd { >> sector_t size; >> bool flush_support; >> bool discard_secure; >> + >> + unsigned int feature_gnt_persistent:1; >> + unsigned int overflow_max_grants:1; > > I think the v3.7-rc1 has this structure changed just a tiny bit, so you > might want to rebase it on top of that. I've done the rebase on top of your linux-next branch, commit ad502612c173fff239250c9fe9bdfaaef70b9901. > >> }; >> >> struct backend_info; >> >> + >> +struct persistent_gnt { >> + struct page *page; >> + grant_ref_t gnt; >> + grant_handle_t handle; >> + uint64_t dev_bus_addr; >> + struct rb_node node; >> +}; >> + >> struct xen_blkif { >> /* Unique identifier for this interface. */ >> domid_t domid; >> @@ -190,6 +203,10 @@ struct xen_blkif { >> struct task_struct *xenblkd; >> unsigned int waiting_reqs; >> >> + /* frontend feature information */ > > Huh? Changed it to: /* tree to store persistent grants */ >> + struct rb_root persistent_gnts; >> + unsigned int persistent_gnt_c; >> + >> /* statistics */ >> unsigned long st_print; >> int st_rd_req; >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c >> index 4f66171..9f88b4e 100644 >> --- a/drivers/block/xen-blkback/xenbus.c >> +++ b/drivers/block/xen-blkback/xenbus.c >> @@ -118,6 +118,7 @@ static struct xen_blkif *xen_blkif_alloc(domid_t domid) >> atomic_set(&blkif->drain, 0); >> blkif->st_print = jiffies; >> init_waitqueue_head(&blkif->waiting_to_free); >> + blkif->persistent_gnts.rb_node = NULL; >> >> return blkif; >> } >> @@ -721,6 +722,7 @@ static int connect_ring(struct backend_info *be) >> struct xenbus_device *dev = be->dev; >> unsigned long ring_ref; >> unsigned int evtchn; >> + unsigned int pers_grants; >> char protocol[64] = ""; >> int err; >> >> @@ -750,8 +752,18 @@ static int connect_ring(struct backend_info *be) >> xenbus_dev_fatal(dev, err, "unknown fe protocol %s", protocol); >> return -1; >> } >> - pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n", >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol); >> + err = xenbus_gather(XBT_NIL, dev->otherend, >> + "feature-persistent-grants", "%u", >> + &pers_grants, NULL); >> + if (err) >> + pers_grants = 0; >> + >> + be->blkif->vbd.feature_gnt_persistent = pers_grants; >> + be->blkif->vbd.overflow_max_grants = 0; >> + >> + pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) persistent %d\n", >> + ring_ref, evtchn, be->blkif->blk_protocol, protocol, >> + pers_grants); > > Can you make that a string? So it is > pers_grants ? "persistent grants" : "" > > instead of a zero of one value pls? Yes, done. >> >> /* Map the shared frame, irq etc. */ >> err = xen_blkif_map(be->blkif, ring_ref, evtchn); >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c >> index 2c2d2e5..206d422 100644 >> --- a/drivers/block/xen-blkfront.c >> +++ b/drivers/block/xen-blkfront.c >> @@ -44,6 +44,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -64,10 +65,17 @@ enum blkif_state { >> BLKIF_STATE_SUSPENDED, >> }; >> >> +struct grant { >> + grant_ref_t gref; >> + unsigned long pfn; >> + struct llist_node node; >> +}; >> + >> struct blk_shadow { >> struct blkif_request req; >> struct request *request; >> unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> + struct grant *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST]; >> }; >> >> static DEFINE_MUTEX(blkfront_mutex); >> @@ -97,6 +105,8 @@ struct blkfront_info >> struct work_struct work; >> struct gnttab_free_callback callback; >> struct blk_shadow shadow[BLK_RING_SIZE]; >> + struct llist_head persistent_gnts; >> + unsigned int persistent_gnts_c; >> unsigned long shadow_free; >> unsigned int feature_flush; >> unsigned int flush_op; >> @@ -287,21 +297,36 @@ static int blkif_queue_request(struct request *req) >> unsigned long id; >> unsigned int fsect, lsect; >> int i, ref; >> + >> + /* >> + * Used to store if we are able to queue the request by just using >> + * existing persistent grants (0), or if we have to get new grants, > > What does the zero mean? Frankly, no idea, I guess it was in Oliver's patch and I missed to spot it. >> + * as there are not sufficiently many free. >> + */ >> + int new_persistent_gnts; > > I think this can be a bool? I agree. >> grant_ref_t gref_head; >> + struct page *granted_page; >> + struct grant *gnt_list_entry = NULL; >> struct scatterlist *sg; >> >> if (unlikely(info->connected != BLKIF_STATE_CONNECTED)) >> return 1; >> >> - if (gnttab_alloc_grant_references( >> - BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) { >> - gnttab_request_free_callback( >> - &info->callback, >> - blkif_restart_queue_callback, >> - info, >> - BLKIF_MAX_SEGMENTS_PER_REQUEST); >> - return 1; >> - } >> + /* Check if we have enought grants to allocate a requests */ >> + if (info->persistent_gnts_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) { >> + new_persistent_gnts = 1; >> + if (gnttab_alloc_grant_references( >> + BLKIF_MAX_SEGMENTS_PER_REQUEST - info->persistent_gnts_c, >> + &gref_head) < 0) { >> + gnttab_request_free_callback( >> + &info->callback, >> + blkif_restart_queue_callback, >> + info, >> + BLKIF_MAX_SEGMENTS_PER_REQUEST); >> + return 1; >> + } >> + } else >> + new_persistent_gnts = 0; >> >> /* Fill out a communications ring structure. */ >> ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt); >> @@ -341,18 +366,73 @@ static int blkif_queue_request(struct request *req) >> BLKIF_MAX_SEGMENTS_PER_REQUEST); >> >> for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) { >> - buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg))); >> fsect = sg->offset >> 9; >> lsect = fsect + (sg->length >> 9) - 1; >> - /* install a grant reference. */ >> - ref = gnttab_claim_grant_reference(&gref_head); >> - BUG_ON(ref == -ENOSPC); >> >> - gnttab_grant_foreign_access_ref( >> - ref, >> + if (info->persistent_gnts_c) { >> + BUG_ON(llist_empty(&info->persistent_gnts)); >> + gnt_list_entry = llist_entry( >> + llist_del_first(&info->persistent_gnts), >> + struct grant, node); >> + >> + ref = gnt_list_entry->gref; >> + buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn); >> + info->persistent_gnts_c--; >> + } else { >> + ref = gnttab_claim_grant_reference(&gref_head); >> + BUG_ON(ref == -ENOSPC); >> + >> + gnt_list_entry = >> + kmalloc(sizeof(struct grant), >> + GFP_ATOMIC); >> + if (!gnt_list_entry) >> + return -ENOMEM; >> + >> + granted_page = alloc_page(GFP_ATOMIC); >> + if (!granted_page) { >> + kfree(gnt_list_entry); >> + return -ENOMEM; >> + } >> + >> + gnt_list_entry->pfn = >> + page_to_pfn(granted_page); >> + gnt_list_entry->gref = ref; >> + >> + buffer_mfn = pfn_to_mfn(page_to_pfn( >> + granted_page)); >> + gnttab_grant_foreign_access_ref(ref, >> info->xbdev->otherend_id, >> - buffer_mfn, >> - rq_data_dir(req)); >> + buffer_mfn, 0); >> + } >> + >> + info->shadow[id].grants_used[i] = gnt_list_entry; >> + >> + if (rq_data_dir(req)) { >> + char *bvec_data; >> + void *shared_data; >> + >> + BUG_ON(sg->offset + sg->length > PAGE_SIZE); >> + >> + shared_data = kmap_atomic( >> + pfn_to_page(gnt_list_entry->pfn)); >> + bvec_data = kmap_atomic(sg_page(sg)); >> + >> + /* >> + * this does not wipe data stored outside the >> + * range sg->offset..sg->offset+sg->length. >> + * Therefore, blkback *could* see data from >> + * previous requests. This is OK as long as >> + * persistent grants are shared with just one >> + * domain. It may need refactoring if This > .. this (lowercase it pls) Done. > >> + * changes >> + */ >> + memcpy(shared_data + sg->offset, >> + bvec_data + sg->offset, >> + sg->length); >> + >> + kunmap_atomic(bvec_data); >> + kunmap_atomic(shared_data); >> + } >> >> info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn); >> ring_req->u.rw.seg[i] = >> @@ -368,7 +448,8 @@ static int blkif_queue_request(struct request *req) >> /* Keep a private copy so we can reissue requests when recovering. */ >> info->shadow[id].req = *ring_req; >> >> - gnttab_free_grant_references(gref_head); >> + if (new_persistent_gnts) >> + gnttab_free_grant_references(gref_head); >> >> return 0; >> } >> @@ -480,7 +561,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size) >> static void xlvbd_flush(struct blkfront_info *info) >> { >> blk_queue_flush(info->rq, info->feature_flush); >> - printk(KERN_INFO "blkfront: %s: %s: %s\n", >> + printk(KERN_INFO "blkfront: %s: %s: %s, using persistent grants\n", > > HA! By default, eh? Yes, you caught me, there's a paragraph in the commit message that explains that we are using persistent grants in the frontend unconditionally, since the protocol is compatible (you can have a persistent blkfront and a non-persistent blkback). It simplifies the logic in blkfront. Are you OK with it? >> info->gd->disk_name, >> info->flush_op == BLKIF_OP_WRITE_BARRIER ? >> "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ? >> @@ -707,6 +788,9 @@ static void blkif_restart_queue(struct work_struct *work) >> >> static void blkif_free(struct blkfront_info *info, int suspend) >> { >> + struct llist_node *all_gnts; >> + struct grant *persistent_gnt; >> + >> /* Prevent new requests being issued until we fix things up. */ >> spin_lock_irq(&info->io_lock); >> info->connected = suspend ? >> @@ -714,6 +798,17 @@ static void blkif_free(struct blkfront_info *info, int suspend) >> /* No more blkif_request(). */ >> if (info->rq) >> blk_stop_queue(info->rq); >> + >> + /* Remove all persistent grants */ >> + if (info->persistent_gnts_c) { >> + all_gnts = llist_del_all(&info->persistent_gnts); >> + llist_for_each_entry(persistent_gnt, all_gnts, node) { >> + gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL); >> + kfree(persistent_gnt); >> + } >> + info->persistent_gnts_c = 0; >> + } >> + >> /* No more gnttab callback work. */ >> gnttab_cancel_free_callback(&info->callback); >> spin_unlock_irq(&info->io_lock); >> @@ -734,13 +829,42 @@ static void blkif_free(struct blkfront_info *info, int suspend) >> >> } >> >> -static void blkif_completion(struct blk_shadow *s) >> +static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, >> + struct blkif_response *bret) >> { >> int i; >> - /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place >> - * flag. */ >> - for (i = 0; i < s->req.u.rw.nr_segments; i++) >> - gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL); >> + struct bio_vec *bvec; >> + struct req_iterator iter; >> + unsigned long flags; >> + char *bvec_data; >> + void *shared_data; >> + unsigned int offset = 0; >> + >> + if (bret->operation == BLKIF_OP_READ) { >> + /* >> + * Copy the data received from the backend into the bvec. >> + * Since bv_len can be different from PAGE_SIZE, we need to >> + * be sure we are actually copying the data from the right >> + * shared page. >> + */ >> + rq_for_each_segment(bvec, s->request, iter) { >> + BUG_ON((bvec->bv_offset + bvec->bv_len) > PAGE_SIZE); >> + i = offset >> PAGE_SHIFT; > > Could you also include a comment about the bug you found here, pls? There's a comment before the rq_for_each_segment loop, that tries to explain that, do you think the following is better? /* * Copy the data received from the backend into the bvec. * Since bv_offset can be different than 0, and bv_len different * than PAGE_SIZE, we have to keep track of the current offset, * to be sure we are copying the data from the right shared page. */ >> + shared_data = kmap_atomic( >> + pfn_to_page(s->grants_used[i]->pfn)); >> + bvec_data = bvec_kmap_irq(bvec, &flags); >> + memcpy(bvec_data, shared_data + bvec->bv_offset, >> + bvec->bv_len); >> + bvec_kunmap_irq(bvec_data, &flags); >> + kunmap_atomic(shared_data); >> + offset += bvec->bv_len; >> + } >> + } >> + /* Add the persistent grant into the list of free grants */ >> + for (i = 0; i < s->req.u.rw.nr_segments; i++) { >> + llist_add(&s->grants_used[i]->node, &info->persistent_gnts); >> + info->persistent_gnts_c++; >> + } >> } >> >> static irqreturn_t blkif_interrupt(int irq, void *dev_id) >> @@ -783,7 +907,7 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) >> req = info->shadow[id].request; >> >> if (bret->operation != BLKIF_OP_DISCARD) >> - blkif_completion(&info->shadow[id]); >> + blkif_completion(&info->shadow[id], info, bret); >> >> if (add_id_to_freelist(info, id)) { >> WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n", >> @@ -942,6 +1066,11 @@ again: >> message = "writing protocol"; >> goto abort_transaction; >> } >> + err = xenbus_printf(xbt, dev->nodename, >> + "feature-persistent-grants", "%d", 1); > > So its %u in blkback, but %d in here? Which one should it be? %u in both places. >> + if (err) >> + dev_warn(&dev->dev, >> + "writing persistent grants feature to xenbus"); >> >> err = xenbus_transaction_end(xbt, 0); >> if (err) { >> @@ -1029,6 +1158,8 @@ static int blkfront_probe(struct xenbus_device *dev, >> spin_lock_init(&info->io_lock); >> info->xbdev = dev; >> info->vdevice = vdevice; >> + init_llist_head(&info->persistent_gnts); >> + info->persistent_gnts_c = 0; >> info->connected = BLKIF_STATE_DISCONNECTED; >> INIT_WORK(&info->work, blkif_restart_queue); >> >> @@ -1093,7 +1224,7 @@ static int blkif_recover(struct blkfront_info *info) >> req->u.rw.seg[j].gref, >> info->xbdev->otherend_id, >> pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]), >> - rq_data_dir(info->shadow[req->u.rw.id].request)); >> + 0); >> } >> info->shadow[req->u.rw.id].req = *req; >> >> -- >> 1.7.7.5 (Apple Git-26) >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/