Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422786AbbEOMHG (ORCPT ); Fri, 15 May 2015 08:07:06 -0400 Received: from aserp1040.oracle.com ([141.146.126.69]:49432 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1422728AbbEOMHE (ORCPT ); Fri, 15 May 2015 08:07:04 -0400 Message-ID: <5555E15B.8000406@oracle.com> Date: Fri, 15 May 2015 20:06:51 +0800 From: Bob Liu User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: =?windows-1252?Q?Roger_Pau_Monn=E9?= CC: xen-devel@lists.xen.org, david.vrabel@citrix.com, justing@spectralogic.com, konrad.wilk@oracle.com, paul.durrant@citrix.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 2/2] xen/block: add multi-page ring support References: <1431428504-6623-1-git-send-email-bob.liu@oracle.com> <1431428504-6623-2-git-send-email-bob.liu@oracle.com> <5555D4D1.1030705@citrix.com> In-Reply-To: <5555D4D1.1030705@citrix.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18338 Lines: 506 On 05/15/2015 07:13 PM, Roger Pau Monn? wrote: > El 12/05/15 a les 13.01, Bob Liu ha escrit: >> Extend xen/block to support multi-page ring, so that more requests can be issued >> by using more than one pages as the request ring between blkfront and backend. >> As a result, the performance can get improved significantly. > ^ s/can get improved/improves/ > >> >> We got some impressive improvements on our highend iscsi storage cluster backend. >> If using 64 pages as the ring, the IOPS increased about 15 times for the >> throughput testing and above doubled for the latency testing. >> >> The reason was the limit on outstanding requests is 32 if use only one-page >> ring, but in our case the iscsi lun was spread across about 100 physical drives, >> 32 was really not enough to keep them busy. >> >> Changes in v2: >> - Rebased to 4.0-rc6 >> - Added description on how this protocol works into io/blkif.h > > I don't see any changes to io/blkif.h in this patch, is something missing? > Sorry, I should mention in v3 that these changed were removed because I followed the protocol already defined in XEN git tree: xen/include/public/io/blkif.h > Also you use XENBUS_MAX_RING_PAGES which AFAICT it's not defined anywhere. > It was defined in include/xen/xenbus.h. >> >> Changes in v3: >> - Follow the protocol defined in io/blkif.h on XEN tree >> >> Signed-off-by: Bob Liu >> --- >> drivers/block/xen-blkback/blkback.c | 14 ++++- >> drivers/block/xen-blkback/common.h | 4 +- >> drivers/block/xen-blkback/xenbus.c | 83 ++++++++++++++++++++++------- >> drivers/block/xen-blkfront.c | 102 +++++++++++++++++++++++++++--------- >> 4 files changed, 156 insertions(+), 47 deletions(-) >> >> diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c >> index 713fc9f..f191083 100644 >> --- a/drivers/block/xen-blkback/blkback.c >> +++ b/drivers/block/xen-blkback/blkback.c >> @@ -84,6 +84,12 @@ MODULE_PARM_DESC(max_persistent_grants, >> "Maximum number of grants to map persistently"); >> >> /* >> + * Maximum number of pages to be used as the ring between front and backend >> + */ >> +unsigned int xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES; >> +module_param_named(max_ring_pages, xen_blkif_max_ring_pages, int, S_IRUGO); >> +MODULE_PARM_DESC(max_ring_pages, "Maximum amount of pages to be used as the ring"); >> +/* >> * The LRU mechanism to clean the lists of persistent grants needs to >> * be executed periodically. The time interval between consecutive executions >> * of the purge mechanism is set in ms. >> @@ -630,7 +636,7 @@ purge_gnt_list: >> } >> >> /* Shrink if we have more than xen_blkif_max_buffer_pages */ >> - shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages); >> + shrink_free_pagepool(blkif, xen_blkif_max_buffer_pages * blkif->nr_ring_pages); > > You are greatly increasing the buffer of free (ballooned) pages. > Possibly making it 32 times bigger than it used to be, is this really > needed? > Hmm.. it's a bit aggressive. How about (xen_blkif_max_buffer_pages * blkif->nr_ring_pages) / 2? >> >> if (log_stats && time_after(jiffies, blkif->st_print)) >> print_stats(blkif); >> @@ -1435,6 +1441,12 @@ static int __init xen_blkif_init(void) >> { >> int rc = 0; >> >> + if (xen_blkif_max_ring_pages > XENBUS_MAX_RING_PAGES) { >> + pr_info("Invalid max_ring_pages (%d), will use default max: %d.\n", >> + xen_blkif_max_ring_pages, XENBUS_MAX_RING_PAGES); >> + xen_blkif_max_ring_pages = XENBUS_MAX_RING_PAGES; >> + } >> + >> if (!xen_domain()) >> return -ENODEV; >> >> diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h >> index f620b5d..84a964c 100644 >> --- a/drivers/block/xen-blkback/common.h >> +++ b/drivers/block/xen-blkback/common.h >> @@ -44,6 +44,7 @@ >> #include >> #include >> >> +extern unsigned int xen_blkif_max_ring_pages; >> /* >> * This is the maximum number of segments that would be allowed in indirect >> * requests. This value will also be passed to the frontend. >> @@ -248,7 +249,7 @@ struct backend_info; >> #define PERSISTENT_GNT_WAS_ACTIVE 1 >> >> /* Number of requests that we can fit in a ring */ >> -#define XEN_BLKIF_REQS 32 >> +#define XEN_BLKIF_REQS (32 * XENBUS_MAX_RING_PAGES) > > This should be made a member of xen_blkif and set dynamically, or else > we are just wasting memory if the frontend doesn't support multipage > rings. We seem to be doing this for a bunch of features, but allocating > 32 times the needed amount of requests (if the frontend doesn't support > multipage rings) seems too much IMHO. > Right, the reason I use static is to simple the code. Else we should: 1) Move xen_blkif_alloc() from backend_probe() to some place until we read the negotiated 'num-ring-pages' written by front. 2) Deal with memory alloc/free when domU migrated and 'num-ring-pages' changed. I'll try to find a suitable place in frontend_changed() but may be in an new patch late. >> >> struct persistent_gnt { >> struct page *page; >> @@ -320,6 +321,7 @@ struct xen_blkif { >> struct work_struct free_work; >> /* Thread shutdown wait queue. */ >> wait_queue_head_t shutdown_wq; >> + int nr_ring_pages; >> }; >> >> struct seg_buf { >> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c >> index 6ab69ad..909babd 100644 >> --- a/drivers/block/xen-blkback/xenbus.c >> +++ b/drivers/block/xen-blkback/xenbus.c >> @@ -198,8 +198,8 @@ fail: >> return ERR_PTR(-ENOMEM); >> } >> >> -static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref, >> - unsigned int evtchn) >> +static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref, >> + unsigned int nr_grefs, unsigned int evtchn) >> { >> int err; >> >> @@ -207,7 +207,7 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref, >> if (blkif->irq) >> return 0; >> >> - err = xenbus_map_ring_valloc(blkif->be->dev, &gref, 1, >> + err = xenbus_map_ring_valloc(blkif->be->dev, gref, nr_grefs, >> &blkif->blk_ring); >> if (err < 0) >> return err; >> @@ -217,21 +217,21 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t gref, >> { >> struct blkif_sring *sring; >> sring = (struct blkif_sring *)blkif->blk_ring; >> - BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE); >> + BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * nr_grefs); >> break; >> } >> case BLKIF_PROTOCOL_X86_32: >> { >> struct blkif_x86_32_sring *sring_x86_32; >> sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring; >> - BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE); >> + BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE * nr_grefs); >> break; >> } >> case BLKIF_PROTOCOL_X86_64: >> { >> struct blkif_x86_64_sring *sring_x86_64; >> sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring; >> - BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE); >> + BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE * nr_grefs); >> break; >> } >> default: >> @@ -597,6 +597,11 @@ static int xen_blkbk_probe(struct xenbus_device *dev, >> if (err) >> goto fail; >> >> + err = xenbus_printf(XBT_NIL, dev->nodename, "max-ring-pages", "%u", >> + xen_blkif_max_ring_pages); >> + if (err) >> + pr_warn("%s write out 'max-ring-pages' failed\n", __func__); >> + >> err = xenbus_switch_state(dev, XenbusStateInitWait); >> if (err) >> goto fail; >> @@ -860,22 +865,62 @@ again: >> static int connect_ring(struct backend_info *be) >> { >> struct xenbus_device *dev = be->dev; >> - unsigned long ring_ref; >> - unsigned int evtchn; >> + unsigned int ring_ref[XENBUS_MAX_RING_PAGES]; >> + unsigned int evtchn, nr_grefs; >> unsigned int pers_grants; >> char protocol[64] = ""; >> int err; >> >> pr_debug("%s %s\n", __func__, dev->otherend); >> >> - err = xenbus_gather(XBT_NIL, dev->otherend, "ring-ref", "%lu", >> - &ring_ref, "event-channel", "%u", &evtchn, NULL); >> - if (err) { >> - xenbus_dev_fatal(dev, err, >> - "reading %s/ring-ref and event-channel", >> + err = xenbus_scanf(XBT_NIL, dev->otherend, "event-channel", "%u", >> + &evtchn); >> + if (err != 1) { >> + err = -EINVAL; >> + xenbus_dev_fatal(dev, err, "reading %s/event-channel", >> dev->otherend); >> return err; >> } >> + pr_info("event-channel %u\n", evtchn); >> + >> + err = xenbus_scanf(XBT_NIL, dev->otherend, "num-ring-pages", "%u", >> + &nr_grefs); >> + if (err != 1) { >> + err = xenbus_scanf(XBT_NIL, dev->otherend, "ring-ref", >> + "%u", &ring_ref[0]); >> + if (err != 1) { >> + err = -EINVAL; >> + xenbus_dev_fatal(dev, err, "reading %s/ring-ref", >> + dev->otherend); >> + return err; >> + } >> + nr_grefs = 1; >> + pr_info("%s:using single page: ring-ref %d\n", dev->otherend, >> + ring_ref[0]); >> + } else { >> + unsigned int i; >> + >> + if (nr_grefs > xen_blkif_max_ring_pages) { >> + err = -EINVAL; >> + xenbus_dev_fatal(dev, err, "%s/request %d ring pages exceed max:%d", >> + dev->otherend, nr_grefs, xen_blkif_max_ring_pages); >> + return err; >> + } >> + for (i = 0; i < nr_grefs; i++) { >> + char ring_ref_name[BLKBACK_NAME_LEN]; > > I would add a new macro rather than reusing one which might change > (BLKBACK_RINGREF_LEN?). > Sure. >> + >> + snprintf(ring_ref_name, sizeof(ring_ref_name), "ring-ref%u", i); >> + err = xenbus_scanf(XBT_NIL, dev->otherend, >> + ring_ref_name, "%u", &ring_ref[i]); >> + if (err != 1) { >> + err = -EINVAL; >> + xenbus_dev_fatal(dev, err, "reading %s/%s", >> + dev->otherend, ring_ref_name); >> + return err; >> + } >> + pr_info("ring-ref%u: %u\n", i, ring_ref[i]); >> + } >> + } >> >> be->blkif->blk_protocol = BLKIF_PROTOCOL_DEFAULT; >> err = xenbus_gather(XBT_NIL, dev->otherend, "protocol", >> @@ -900,16 +945,16 @@ static int connect_ring(struct backend_info *be) >> >> be->blkif->vbd.feature_gnt_persistent = pers_grants; >> be->blkif->vbd.overflow_max_grants = 0; >> + be->blkif->nr_ring_pages = nr_grefs; >> >> - pr_info("ring-ref %ld, event-channel %d, protocol %d (%s) %s\n", >> - ring_ref, evtchn, be->blkif->blk_protocol, protocol, >> - pers_grants ? "persistent grants" : ""); >> + pr_info("ring-pages:%d, event-channel %d, protocol %d (%s) %s\n", >> + nr_grefs, evtchn, be->blkif->blk_protocol, protocol, >> + pers_grants ? "persistent grants" : ""); >> >> /* Map the shared frame, irq etc. */ >> - err = xen_blkif_map(be->blkif, ring_ref, evtchn); >> + err = xen_blkif_map(be->blkif, ring_ref, nr_grefs, evtchn); >> if (err) { >> - xenbus_dev_fatal(dev, err, "mapping ring-ref %lu port %u", >> - ring_ref, evtchn); >> + xenbus_dev_fatal(dev, err, "mapping ring-ref port %u", evtchn); >> return err; >> } >> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c >> index 88e23fd..c4b427f 100644 >> --- a/drivers/block/xen-blkfront.c >> +++ b/drivers/block/xen-blkfront.c >> @@ -98,7 +98,17 @@ static unsigned int xen_blkif_max_segments = 32; >> module_param_named(max, xen_blkif_max_segments, int, S_IRUGO); >> MODULE_PARM_DESC(max, "Maximum amount of segments in indirect requests (default is 32)"); >> >> -#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE) >> +static unsigned int xen_blkif_max_ring_pages = 1; >> +module_param_named(max_ring_pages, xen_blkif_max_ring_pages, int, S_IRUGO); >> +MODULE_PARM_DESC(max_ring_pages, "Maximum amount of pages to be used as the ring"); >> + >> +#define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * info->nr_ring_pages) > > I would prefer if this macro took a explicit info parameter instead on > realying that it is always info. > Sorry, I missed your idea here. >> +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES) >> +/* >> + * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19 characters >> + * are enough. Define to 20 to keep consist with backend. >> + */ >> +#define RINGREF_NAME_LEN (20) >> >> /* >> * We have one of these per vbd, whether ide, scsi or 'other'. They >> @@ -114,13 +124,14 @@ struct blkfront_info >> int vdevice; >> blkif_vdev_t handle; >> enum blkif_state connected; >> - int ring_ref; >> + int ring_ref[XENBUS_MAX_RING_PAGES]; >> + unsigned int nr_ring_pages; >> struct blkif_front_ring ring; >> unsigned int evtchn, irq; >> struct request_queue *rq; >> struct work_struct work; >> struct gnttab_free_callback callback; >> - struct blk_shadow shadow[BLK_RING_SIZE]; >> + struct blk_shadow shadow[BLK_MAX_RING_SIZE]; >> struct list_head grants; >> struct list_head indirect_pages; >> unsigned int persistent_gnts_c; >> @@ -139,8 +150,6 @@ static unsigned int nr_minors; >> static unsigned long *minors; >> static DEFINE_SPINLOCK(minor_lock); >> >> -#define MAXIMUM_OUTSTANDING_BLOCK_REQS \ >> - (BLKIF_MAX_SEGMENTS_PER_REQUEST * BLK_RING_SIZE) >> #define GRANT_INVALID_REF 0 >> >> #define PARTS_PER_DISK 16 >> @@ -1033,12 +1042,15 @@ free_shadow: >> flush_work(&info->work); >> >> /* Free resources associated with old device channel. */ >> - if (info->ring_ref != GRANT_INVALID_REF) { >> - gnttab_end_foreign_access(info->ring_ref, 0, >> - (unsigned long)info->ring.sring); >> - info->ring_ref = GRANT_INVALID_REF; >> - info->ring.sring = NULL; >> + for (i = 0; i < info->nr_ring_pages; i++) { >> + if (info->ring_ref[i] != GRANT_INVALID_REF) { >> + gnttab_end_foreign_access(info->ring_ref[i], 0, 0); >> + info->ring_ref[i] = GRANT_INVALID_REF; >> + } >> } >> + free_pages((unsigned long)info->ring.sring, get_order(info->nr_ring_pages * PAGE_SIZE)); >> + info->ring.sring = NULL; >> + >> if (info->irq) >> unbind_from_irqhandler(info->irq, info); >> info->evtchn = info->irq = 0; >> @@ -1245,26 +1257,30 @@ static int setup_blkring(struct xenbus_device *dev, >> struct blkfront_info *info) >> { >> struct blkif_sring *sring; >> - grant_ref_t gref; >> - int err; >> + int err, i; >> + unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE; >> + grant_ref_t gref[XENBUS_MAX_RING_PAGES]; >> >> - info->ring_ref = GRANT_INVALID_REF; >> + for (i = 0; i < info->nr_ring_pages; i++) >> + info->ring_ref[i] = GRANT_INVALID_REF; >> >> - sring = (struct blkif_sring *)__get_free_page(GFP_NOIO | __GFP_HIGH); >> + sring = (struct blkif_sring *)__get_free_pages(GFP_NOIO | __GFP_HIGH, >> + get_order(ring_size)); >> if (!sring) { >> xenbus_dev_fatal(dev, -ENOMEM, "allocating shared ring"); >> return -ENOMEM; >> } >> SHARED_RING_INIT(sring); >> - FRONT_RING_INIT(&info->ring, sring, PAGE_SIZE); >> + FRONT_RING_INIT(&info->ring, sring, ring_size); >> >> - err = xenbus_grant_ring(dev, info->ring.sring, 1, &gref); >> + err = xenbus_grant_ring(dev, info->ring.sring, info->nr_ring_pages, gref); >> if (err < 0) { >> free_page((unsigned long)sring); >> info->ring.sring = NULL; >> goto fail; >> } >> - info->ring_ref = gref; >> + for (i = 0; i < info->nr_ring_pages; i++) >> + info->ring_ref[i] = gref[i]; >> >> err = xenbus_alloc_evtchn(dev, &info->evtchn); >> if (err) >> @@ -1292,7 +1308,15 @@ static int talk_to_blkback(struct xenbus_device *dev, >> { >> const char *message = NULL; >> struct xenbus_transaction xbt; >> - int err; >> + int err, i; >> + unsigned int max_pages = 0; >> + >> + err = xenbus_scanf(XBT_NIL, info->xbdev->otherend, >> + "max-ring-pages", "%u", &max_pages); >> + if (err != 1) >> + info->nr_ring_pages = 1; >> + else >> + info->nr_ring_pages = min(xen_blkif_max_ring_pages, max_pages); >> >> /* Create shared ring, alloc event channel. */ >> err = setup_blkring(dev, info); >> @@ -1306,11 +1330,31 @@ again: >> goto destroy_blkring; >> } >> >> - err = xenbus_printf(xbt, dev->nodename, >> - "ring-ref", "%u", info->ring_ref); >> - if (err) { >> - message = "writing ring-ref"; >> - goto abort_transaction; >> + if (info->nr_ring_pages == 1) { >> + err = xenbus_printf(xbt, dev->nodename, >> + "ring-ref", "%u", info->ring_ref[0]); >> + if (err) { >> + message = "writing ring-ref"; >> + goto abort_transaction; >> + } >> + } else { >> + err = xenbus_printf(xbt, dev->nodename, >> + "num-ring-pages", "%u", info->nr_ring_pages); >> + if (err) { >> + message = "writing num-ring-pages"; >> + goto abort_transaction; >> + } >> + >> + for (i = 0; i < info->nr_ring_pages; i++) { >> + char ring_ref_name[RINGREF_NAME_LEN]; >> + snprintf(ring_ref_name, sizeof(ring_ref_name), "ring-ref%u", i); >> + err = xenbus_printf(xbt, dev->nodename, >> + ring_ref_name, "%u", info->ring_ref[i]); >> + if (err) { >> + message = "writing ring-ref"; >> + goto abort_transaction; >> + } >> + } >> } >> err = xenbus_printf(xbt, dev->nodename, >> "event-channel", "%u", info->evtchn); >> @@ -1338,6 +1382,7 @@ again: >> goto destroy_blkring; >> } > > Why is this moved to here... > Else the kernel will panic. >> + info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff; >> xenbus_switch_state(dev, XenbusStateInitialised); >> >> return 0; >> @@ -1422,9 +1467,8 @@ static int blkfront_probe(struct xenbus_device *dev, >> info->connected = BLKIF_STATE_DISCONNECTED; >> INIT_WORK(&info->work, blkif_restart_queue); >> >> - for (i = 0; i < BLK_RING_SIZE; i++) >> + for (i = 0; i < BLK_MAX_RING_SIZE; i++) >> info->shadow[i].req.u.rw.id = i+1; >> - info->shadow[BLK_RING_SIZE-1].req.u.rw.id = 0x0fffffff; > > from here. Isn't the previous location suitable anymore? > See: #define BLK_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * info->nr_ring_pages) Because we haven't get info->nr_ring_pages yet here(in probe()), info->nr_ring_pages is read out in talk_to_blkback() which I already moved to blkback_changed(). Thanks for your review. -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/