Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753986AbbLAPag (ORCPT ); Tue, 1 Dec 2015 10:30:36 -0500 Received: from smtp02.citrix.com ([66.165.176.63]:33900 "EHLO SMTP02.CITRIX.COM" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752260AbbLAPac (ORCPT ); Tue, 1 Dec 2015 10:30:32 -0500 X-IronPort-AV: E=Sophos;i="5.20,369,1444694400"; d="scan'208";a="321726792" Subject: Re: [PATCH v3 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages To: Julien Grall , , References: <1447873045-14663-1-git-send-email-julien.grall@citrix.com> <1447873045-14663-3-git-send-email-julien.grall@citrix.com> CC: Konrad Rzeszutek Wilk , Boris Ostrovsky , David Vrabel , Bob Liu From: =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= X-Enigmail-Draft-Status: N1110 Message-ID: <565DBD08.5030503@citrix.com> Date: Tue, 1 Dec 2015 16:30:16 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <1447873045-14663-3-git-send-email-julien.grall@citrix.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-DLP: MIA2 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 16363 Lines: 469 El 18/11/15 a les 19.57, Julien Grall ha escrit: > The minimal size of request in the block framework is always PAGE_SIZE. > It means that when 64KB guest is support, the request will at least be > 64KB. > > Although, if the backend doesn't support indirect descriptor (such as QDISK > in QEMU), a ring request is only able to accommodate 11 segments of 4KB > (i.e 44KB). > > The current frontend is assuming that an I/O request will always fit in > a ring request. This is not true any more when using 64KB page > granularity and will therefore crash during boot. > > On ARM64, the ABI is completely neutral to the page granularity used by > the domU. The guest has the choice between different page granularity > supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB). > This can't be enforced by the hypervisor and therefore it's possible to > run guests using different page granularity. > > So we can't mandate the block backend to support indirect descriptor > when the frontend is using 64KB page granularity and have to fix it > properly in the frontend. > > The solution exposed below is based on modifying directly the frontend > guest rather than asking the block framework to support smaller size > (i.e < PAGE_SIZE). This is because the change is the block framework are > not trivial as everything seems to relying on a struct *page (see [1]). > Although, it may be possible that someone succeed to do it in the future > and we would therefore be able to use it. > > Given that a block request may not fit in a single ring request, a > second request is introduced for the data that cannot fit in the first > one. This means that the second ring request should never be used on > Linux if the page size is smaller than 44KB. > > To achieve the support of the extra ring request, the block queue size > is divided by two. Therefore, the ring will always contain enough space > to accommodate 2 ring requests. While this will reduce the overall > performance, it will make the implementation more contained. The way > forward to get better performance is to implement in the backend either > indirect descriptor or multiple grants ring. > > Note that the parameters blk_queue_max_* helpers haven't been updated. > The block code will set the mimimum size supported and we may be able > to support directly any change in the block framework that lower down > the minimal size of a request. > > [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html > > Signed-off-by: Julien Grall Acked-by: Roger Pau Monné Just a couple of cosmetic comments below. Unless anyone else has real comments I don't think you need to resend. > --- > Cc: Konrad Rzeszutek Wilk > Cc: "Roger Pau Monné" > Cc: Boris Ostrovsky > Cc: David Vrabel > Cc: Bob Liu > > Roger, I haven't kept your acked-by because the update of the final > status. > > Changes in v3: > - Typoes > - Change the way the final status is computed to handle the > possibility of the request handled by different backend (could > happen during suspend/resume). > > Changes in v2: > - Update the commit message > - Typoes > - Rename ring_req2/id2 to extra_ring_req/extra_id > --- > drivers/block/xen-blkfront.c | 228 ++++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 212 insertions(+), 16 deletions(-) > > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > index 2248a47..33f6ff4 100644 > --- a/drivers/block/xen-blkfront.c > +++ b/drivers/block/xen-blkfront.c > @@ -60,6 +60,20 @@ > > #include > > +/* > + * The minimal size of segment supported by the block framework is PAGE_SIZE. > + * When Linux is using a different page size than xen, it may not be possible > + * to put all the data in a single segment. > + * This can happen when the backend doesn't support indirect descriptor and > + * therefore the maximum amount of data that a request can carry is > + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB > + * > + * Note that we only support one extra request. So the Linux page size > + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) = > + * 88KB. > + */ > +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE) > + > enum blkif_state { > BLKIF_STATE_DISCONNECTED, > BLKIF_STATE_CONNECTED, > @@ -72,6 +86,13 @@ struct grant { > struct list_head node; > }; > > +enum blk_req_status { > + REQ_WAITING, > + REQ_DONE, > + REQ_ERROR, > + REQ_EOPNOTSUPP, > +}; > + > struct blk_shadow { > struct blkif_request req; > struct request *request; > @@ -79,6 +100,14 @@ struct blk_shadow { > struct grant **indirect_grants; > struct scatterlist *sg; > unsigned int num_sg; > + enum blk_req_status status; > + > + #define NO_ASSOCIATED_ID ~0UL > + /* > + * Id of the sibling if we ever need 2 requests when handling a > + * block I/O request > + */ > + unsigned long associated_id; > }; > > struct split_bio { > @@ -467,6 +496,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info, > > id = get_id_from_freelist(info); > info->shadow[id].request = req; > + info->shadow[id].status = REQ_WAITING; > + info->shadow[id].associated_id = NO_ASSOCIATED_ID; > > (*ring_req)->u.rw.id = id; > > @@ -508,6 +539,9 @@ struct setup_rw_req { > bool need_copy; > unsigned int bvec_off; > char *bvec_data; > + > + bool require_extra_req; > + struct blkif_request *extra_ring_req; > }; > > static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > @@ -521,8 +555,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > unsigned int grant_idx = setup->grant_idx; > struct blkif_request *ring_req = setup->ring_req; > struct blkfront_info *info = setup->info; > + /* > + * We always use the shadow of the first request to store the list > + * of grant associated to the block I/O request. This made the > + * completion more easy to handle even if the block I/O request is > + * split. > + */ > struct blk_shadow *shadow = &info->shadow[setup->id]; > > + if (unlikely(setup->require_extra_req && > + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) { > + /* > + * We are using the second request, setup grant_idx > + * to be the index of the segment array > + */ > + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST; > + ring_req = setup->extra_ring_req; > + } > + > if ((ring_req->operation == BLKIF_OP_INDIRECT) && > (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) { > if (setup->segments) > @@ -537,7 +587,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > > gnt_list_entry = get_grant(&setup->gref_head, gfn, info); > ref = gnt_list_entry->gref; > - shadow->grants_used[grant_idx] = gnt_list_entry; > + /* > + * All the grants are stored in the shadow of the first > + * request. Therefore we have to use the global index > + */ > + shadow->grants_used[setup->grant_idx] = gnt_list_entry; > > if (setup->need_copy) { > void *shared_data; > @@ -579,11 +633,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset, > (setup->grant_idx)++; > } > > +static void blkif_setup_extra_req(struct blkif_request *first, > + struct blkif_request *second) > +{ > + uint16_t nr_segments = first->u.rw.nr_segments; > + > + /* > + * The second request is only present when the first request uses > + * all its segments. It's always the continuity of the first one. > + */ > + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + > + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST; > + second->u.rw.sector_number = first->u.rw.sector_number + > + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512; > + > + second->u.rw.handle = first->u.rw.handle; > + second->operation = first->operation; > +} > + > static int blkif_queue_rw_req(struct request *req) > { > struct blkfront_info *info = req->rq_disk->private_data; > - struct blkif_request *ring_req; > - unsigned long id; > + struct blkif_request *ring_req, *extra_ring_req = NULL; > + unsigned long id, extra_id = NO_ASSOCIATED_ID; > + bool require_extra_req = false; > int i; > struct setup_rw_req setup = { > .grant_idx = 0, > @@ -628,19 +702,19 @@ static int blkif_queue_rw_req(struct request *req) > /* Fill out a communications ring structure. */ > id = blkif_ring_get_request(info, req, &ring_req); > > - BUG_ON(info->max_indirect_segments == 0 && > - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST); > - BUG_ON(info->max_indirect_segments && > - GREFS(req->nr_phys_segments) > info->max_indirect_segments); > - > num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg); > num_grant = 0; > /* Calculate the number of grant used */ > for_each_sg(info->shadow[id].sg, sg, num_sg, i) > num_grant += gnttab_count_grant(sg->offset, sg->length); > > + require_extra_req = info->max_indirect_segments == 0 && > + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST; > + BUG_ON(!HAS_EXTRA_REQ && require_extra_req); > + > info->shadow[id].num_sg = num_sg; > - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) { > + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST && > + likely(!require_extra_req)) { > /* > * The indirect operation can only be a BLKIF_OP_READ or > * BLKIF_OP_WRITE > @@ -680,10 +754,30 @@ static int blkif_queue_rw_req(struct request *req) > } > } > ring_req->u.rw.nr_segments = num_grant; > + if (unlikely(require_extra_req)) { > + extra_id = blkif_ring_get_request(info, req, > + &extra_ring_req); > + /* > + * Only the first request contains the scatter-gather > + * list. > + */ > + info->shadow[extra_id].num_sg = 0; > + > + blkif_setup_extra_req(ring_req, extra_ring_req); > + > + /* Link the 2 requests together */ > + info->shadow[extra_id].associated_id = id; > + info->shadow[id].associated_id = extra_id; > + } > } > > setup.ring_req = ring_req; > setup.id = id; > + > + setup.require_extra_req = require_extra_req; > + if (unlikely(require_extra_req)) > + setup.extra_ring_req = extra_ring_req; > + > for_each_sg(info->shadow[id].sg, sg, num_sg, i) { > BUG_ON(sg->offset + sg->length > PAGE_SIZE); > > @@ -706,6 +800,8 @@ static int blkif_queue_rw_req(struct request *req) > > /* Keep a private copy so we can reissue requests when recovering. */ > info->shadow[id].req = *ring_req; > + if (unlikely(require_extra_req)) > + info->shadow[extra_id].req = *extra_ring_req; > > if (new_persistent_gnts) > gnttab_free_grant_references(setup.gref_head); > @@ -797,7 +893,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, > memset(&info->tag_set, 0, sizeof(info->tag_set)); > info->tag_set.ops = &blkfront_mq_ops; > info->tag_set.nr_hw_queues = 1; > - info->tag_set.queue_depth = BLK_RING_SIZE(info); > + if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) { > + /* > + * When indirect descriptior is not supported, the I/O request ^ descriptor > + * will be split between multiple request in the ring. > + * To avoid problems when sending the request, divide by > + * 2 the depth of the queue. > + */ > + info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2; > + } else > + info->tag_set.queue_depth = BLK_RING_SIZE(info); > info->tag_set.numa_node = NUMA_NO_NODE; > info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE; > info->tag_set.cmd_size = 0; > @@ -1217,19 +1322,92 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset, > kunmap_atomic(shared_data); > } > > -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > +static enum blk_req_status blkif_rsp_to_req_status(int rsp) > +{ > + switch (rsp) > + { > + case BLKIF_RSP_OKAY: > + return REQ_DONE; > + case BLKIF_RSP_ERROR: > + return REQ_ERROR; This could be joined with the default case with a fall through: case BLKIF_RSP_ERROR: default: return REQ_ERROR; > + case BLKIF_RSP_EOPNOTSUPP: > + return REQ_EOPNOTSUPP; > + default: > + return REQ_ERROR; > + } > +} > + > +/* > + * Get the final status of the block request based on two ring response > + */ > +static int blkif_get_final_status(enum blk_req_status s1, > + enum blk_req_status s2) > +{ > + BUG_ON(s1 == REQ_WAITING); > + BUG_ON(s2 == REQ_WAITING); > + > + if (s1 == REQ_ERROR || s2 == REQ_ERROR) > + return BLKIF_RSP_ERROR; > + else if (s1 == REQ_EOPNOTSUPP || s2 == REQ_EOPNOTSUPP) > + return BLKIF_RSP_EOPNOTSUPP; > + else Unneeded else? > + return BLKIF_RSP_OKAY; > +} > + > +static bool blkif_completion(unsigned long *id, struct blkfront_info *info, > struct blkif_response *bret) > { > int i = 0; > struct scatterlist *sg; > int num_sg, num_grant; > + struct blk_shadow *s = &info->shadow[*id]; > struct copy_from_grant data = { > - .s = s, > .grant_idx = 0, > }; > > num_grant = s->req.operation == BLKIF_OP_INDIRECT ? > s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments; > + > + /* The I/O request may be split in two */ > + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) { > + struct blk_shadow *s2 = &info->shadow[s->associated_id]; > + > + /* Keep the status of the current response in shadow */ > + s->status = blkif_rsp_to_req_status(bret->status); > + > + /* Wait the second response if not yet here */ > + if (s2->status == REQ_WAITING) > + return 0; > + > + bret->status = blkif_get_final_status(s->status, > + s2->status); > + > + /* > + * All the grants is stored in the first shadow in order > + * to make the completion code simpler. > + */ > + num_grant += s2->req.u.rw.nr_segments; > + > + /* > + * The two responses may not come in order. Only the > + * first request will store the scatter-gather list > + */ > + if (s2->num_sg != 0) { > + /* Update "id" with the ID of the first response */ > + *id = s->associated_id; > + s = s2; > + } > + > + /* > + * We don't need anymore the second request, so recycling > + * it now. > + */ > + if (add_id_to_freelist(info, s->associated_id)) > + WARN(1, "%s: can't recycle the second part (id = %ld) of the request\n", > + info->gd->disk_name, s->associated_id); > + } > + > + data.s = s; > num_sg = s->num_sg; > > if (bret->operation == BLKIF_OP_READ && info->feature_persistent) { > @@ -1299,6 +1477,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info, > } > } > } > + > + return 1; > } > > static irqreturn_t blkif_interrupt(int irq, void *dev_id) > @@ -1340,8 +1520,14 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id) > } > req = info->shadow[id].request; > > - if (bret->operation != BLKIF_OP_DISCARD) > - blkif_completion(&info->shadow[id], info, bret); > + if (bret->operation != BLKIF_OP_DISCARD) { > + /* > + * We may need to wait for an extra response if the > + * I/O request is split in 2 > + */ > + if (!blkif_completion(&id, info, bret)) > + continue; > + } > > if (add_id_to_freelist(info, id)) { > WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n", > @@ -1864,8 +2050,18 @@ static int blkfront_setup_indirect(struct blkfront_info *info) > unsigned int psegs, grants; > int err, i; > > - if (info->max_indirect_segments == 0) > - grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + if (info->max_indirect_segments == 0) { > + if (!HAS_EXTRA_REQ) > + grants = BLKIF_MAX_SEGMENTS_PER_REQUEST; > + else { > + /* > + * When an extra req is required, the maximum > + * grants supported is related to the size of the > + * Linux block segment > + */ > + grants = GRANTS_PER_PSEG; > + } > + } > else > grants = info->max_indirect_segments; > psegs = grants / GRANTS_PER_PSEG; > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/