2015-11-12 16:45:38

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] block/xen-blkfront: Handle non-indirect grant with 64KB pages

El 19/10/15 a les 16.19, Julien Grall ha escrit:
> The minimal size of request in the block framework is always PAGE_SIZE.
> It means that when 64KB guest is support, the request will at least be
> 64KB.
>
> Although, if the backend doesn't support indirect descriptor (such as QDISK
> in QEMU), a ring request is only able to accommodate 11 segments of 4KB
> (i.e 44KB).
>
> The current frontend is assuming that an I/O request will always fit in
> a ring request. This is not true any more when using 64KB page
> granularity and will therefore crash during boot.
>
> On ARM64, the ABI is completely neutral to the page granularity used by
> the domU. The guest has the choice between different page granularity
> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
> This can't be enforced by the hypervisor and therefore it's possible to
> run guests using different page granularity.
>
> So we can't mandate the block backend to support indirect descriptor
> when the frontend is using 64KB page granularity and have to fix it
> properly in the frontend.
>
> The solution exposed below is based on modifying directly the frontend
> guest rather than asking the block framework to support smaller size
> (i.e < PAGE_SIZE). This is because the change is the block framework are
> not trivial as everything seems to relying on a struct *page (see [1]).
> Although, it may be possible that someone succeed to do it in the future
> and we would therefore be able to use it.
>
> Given that a block request may not fit in a single ring request, a
> second request is introduced for the data that cannot fit in the first
> one. This means that the second ring request should never be used on
> Linux if the page size is smaller than 44KB.
>
> To achieve the support of the extra ring request, the block queue size
> is divided by two. Therefore, the ring will always contain enough space
> to accommodate 2 ring requests. While this will reduce the overall
> performance, it will make the implementation more contained. The way
> forward to get better performance is to implement in the backend either
> indirect descriptor or multiple grants ring.
>
> Note that the parameters blk_queue_max_* helpers haven't been updated.
> The block code will set the mimimum size supported and we may be able
> to support directly any change in the block framework that lower down
> the minimal size of a request.
>
> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
>
> Signed-off-by: Julien Grall <[email protected]>

LGTM, only a couple of typos and a simplification:

Acked-by: Roger Pau MonnĂ© <[email protected]>

> ---
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: "Roger Pau MonnĂ©" <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
>
> Changes in v2:
> - Update the commit message
> - Typoes
> - Rename ring_req2/id2 to extra_ring_req/extra_id
> ---
> drivers/block/xen-blkfront.c | 200 +++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 184 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5982768..8ea2c97 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -60,6 +60,20 @@
>
> #include <asm/xen/hypervisor.h>
>
> +/*
> + * The minimal size of segment supportd by the block framework is PAGE_SIZE.
^supported
> + * When Linux is using a different page size than xen, it may not be possible
> + * to put all the data in a single segment.
> + * This can happen when the backend doesn't support indirect descriptor and
> + * therefore the maximum amount of data that a request can carry is
> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
> + *
> + * Note that we only support one extra request. So the Linux page size
> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
> + * 88KB.
> + */
> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
> +
> enum blkif_state {
> BLKIF_STATE_DISCONNECTED,
> BLKIF_STATE_CONNECTED,
> @@ -79,6 +93,18 @@ struct blk_shadow {
> struct grant **indirect_grants;
> struct scatterlist *sg;
> unsigned int num_sg;
> + enum {
> + REQ_WAITING,
> + REQ_DONE,
> + REQ_FAIL
> + } status;
> +
> + #define NO_ASSOCIATED_ID ~0UL
> + /*
> + * Id of the sibling if we ever need 2 requests when handling a
> + * block I/O request
> + */
> + unsigned long associated_id;
> };
>
> struct split_bio {
> @@ -467,6 +493,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
>
> id = get_id_from_freelist(info);
> info->shadow[id].request = req;
> + info->shadow[id].status = REQ_WAITING;
> + info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>
> (*ring_req)->u.rw.id = id;
>
> @@ -508,6 +536,9 @@ struct setup_rw_req {
> bool need_copy;
> unsigned int bvec_off;
> char *bvec_data;
> +
> + bool require_extra_req;
> + struct blkif_request *extra_ring_req;
> };
>
> static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
> @@ -521,8 +552,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
> unsigned int grant_idx = setup->grant_idx;
> struct blkif_request *ring_req = setup->ring_req;
> struct blkfront_info *info = setup->info;
> + /*
> + * We always use the shadow of the first request to store the list
> + * of grant associated to the block I/O request. This made the
> + * completion more easy to handle even if the block I/O request is
> + * split.
> + */
> struct blk_shadow *shadow = &info->shadow[setup->id];
>
> + if (unlikely(setup->require_extra_req &&
> + grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
> + /*
> + * We are using the second request, setup grant_idx
> + * to be the index of the segment array
> + */
> + grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + ring_req = setup->extra_ring_req;
> + }
> +
> if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
> (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
> if (setup->segments)
> @@ -537,7 +584,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>
> gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
> ref = gnt_list_entry->gref;
> - shadow->grants_used[grant_idx] = gnt_list_entry;
> + /*
> + * All the grants are stored in the shadow of the first
> + * request. Therefore we have to use the global index
> + */
> + shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>
> if (setup->need_copy) {
> void *shared_data;
> @@ -579,11 +630,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
> (setup->grant_idx)++;
> }
>
> +static void blkif_setup_extra_req(struct blkif_request *first,
> + struct blkif_request *second)
> +{
> + uint16_t nr_segments = first->u.rw.nr_segments;
> +
> + /*
> + * The second request is only present when the first request uses
> + * all its segments. It's always the continuity of the first one.
> + */
> + first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +
> + second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + second->u.rw.sector_number = first->u.rw.sector_number +
> + (BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
> +
> + second->u.rw.handle = first->u.rw.handle;
> + second->operation = first->operation;
> +}
> +
> static int blkif_queue_rw_req(struct request *req)
> {
> struct blkfront_info *info = req->rq_disk->private_data;
> - struct blkif_request *ring_req;
> - unsigned long id;
> + struct blkif_request *ring_req, *extra_ring_req = NULL;
> + unsigned long id, extra_id = NO_ASSOCIATED_ID;
> + bool require_extra_req = false;
> int i;
> struct setup_rw_req setup = {
> .grant_idx = 0,
> @@ -628,19 +699,19 @@ static int blkif_queue_rw_req(struct request *req)
> /* Fill out a communications ring structure. */
> id = blkif_ring_get_request(info, req, &ring_req);
>
> - BUG_ON(info->max_indirect_segments == 0 &&
> - GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
> - BUG_ON(info->max_indirect_segments &&
> - GREFS(req->nr_phys_segments) > info->max_indirect_segments);
> -
> num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
> num_grant = 0;
> /* Calculate the number of grant used */
> for_each_sg(info->shadow[id].sg, sg, num_sg, i)
> num_grant += gnttab_count_grant(sg->offset, sg->length);
>
> + require_extra_req = info->max_indirect_segments == 0 &&
> + num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
> + BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
> +
> info->shadow[id].num_sg = num_sg;
> - if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> + if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
> + likely(!require_extra_req)) {
> /*
> * The indirect operation can only be a BLKIF_OP_READ or
> * BLKIF_OP_WRITE
> @@ -680,10 +751,30 @@ static int blkif_queue_rw_req(struct request *req)
> }
> }
> ring_req->u.rw.nr_segments = num_grant;
> + if (unlikely(require_extra_req)) {
> + extra_id = blkif_ring_get_request(info, req,
> + &extra_ring_req);
> + /*
> + * Only the first request contains the scatter-gather
> + * list.
> + */
> + info->shadow[extra_id].num_sg = 0;
> +
> + blkif_setup_extra_req(ring_req, extra_ring_req);
> +
> + /* Link the 2 requests together */
> + info->shadow[extra_id].associated_id = id;
> + info->shadow[id].associated_id = extra_id;
> + }
> }
>
> setup.ring_req = ring_req;
> setup.id = id;
> +
> + setup.require_extra_req = require_extra_req;
> + if (unlikely(require_extra_req))
> + setup.extra_ring_req = extra_ring_req;
> +
> for_each_sg(info->shadow[id].sg, sg, num_sg, i) {
> BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>
> @@ -706,6 +797,8 @@ static int blkif_queue_rw_req(struct request *req)
>
> /* Keep a private copy so we can reissue requests when recovering. */
> info->shadow[id].req = *ring_req;
> + if (unlikely(require_extra_req))
> + info->shadow[extra_id].req = *extra_ring_req;
>
> if (new_persistent_gnts)
> gnttab_free_grant_references(setup.gref_head);
> @@ -797,7 +890,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
> memset(&info->tag_set, 0, sizeof(info->tag_set));
> info->tag_set.ops = &blkfront_mq_ops;
> info->tag_set.nr_hw_queues = 1;
> - info->tag_set.queue_depth = BLK_RING_SIZE(info);
> + if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) {
> + /*
> + * When indirect descriptior is not supported, the I/O request
^descriptors are
> + * will be split between multiple request in the ring.
> + * To avoid problems when sending the request, divide by
> + * 2 the depth of the queue.
> + */
> + info->tag_set.queue_depth = BLK_RING_SIZE(info) / 2;
> + } else
> + info->tag_set.queue_depth = BLK_RING_SIZE(info);
> info->tag_set.numa_node = NUMA_NO_NODE;
> info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> info->tag_set.cmd_size = 0;
> @@ -1217,19 +1319,67 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset,
> kunmap_atomic(shared_data);
> }
>
> -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> +static bool blkif_completion(unsigned long *id, struct blkfront_info *info,
> struct blkif_response *bret)
> {
> int i = 0;
> struct scatterlist *sg;
> int num_sg, num_grant;
> + struct blk_shadow *s = &info->shadow[*id];
> struct copy_from_grant data = {
> - .s = s,
> .grant_idx = 0,
> };
>
> num_grant = s->req.operation == BLKIF_OP_INDIRECT ?
> s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
> +
> + /* The I/O request may be split in two */
> + if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) {
> + struct blk_shadow *s2 = &info->shadow[s->associated_id];
> +
> + /* Keep the status of the current response in shadow */
> + s->status = (bret->status == BLKIF_RSP_OKAY) ?
> + REQ_DONE : REQ_FAIL;
> +
> + /* Wait the second response if not yet here */
> + if (s2->status == REQ_WAITING)
> + return 0;
> +
> + /*
> + * The status of the current response will be used in
> + * order to know if the request has failed.
> + * Update the current response status only if has not
> + * failed.
> + */
> + if (bret->status == BLKIF_RSP_OKAY && s2->status == REQ_FAIL)

This condition could be made simpler by just checking s2->status ==
REQ_FAIL and then setting bret->status unconditionally?

> + bret->status = BLKIF_RSP_ERROR;

Roger.