Subject: Re: [PATCH v3 2/2] block/xen-blkfront: Handle non-indirect grant with
 64KB pages
To: Julien Grall <julien.grall@citrix.com>, <xen-devel@lists.xenproject.org>,
        <linux-kernel@vger.kernel.org>
References: <1447873045-14663-1-git-send-email-julien.grall@citrix.com>
 <1447873045-14663-3-git-send-email-julien.grall@citrix.com>
CC: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        David Vrabel <david.vrabel@citrix.com>, Bob Liu <bob.liu@oracle.com>
From: =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= <roger.pau@citrix.com>
Message-ID: <565DBD08.5030503@citrix.com>
Date: Tue, 1 Dec 2015 16:30:16 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0)
 Gecko/20100101 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <1447873045-14663-3-git-send-email-julien.grall@citrix.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 16363
Lines: 469

El 18/11/15 a les 19.57, Julien Grall ha escrit:
> The minimal size of request in the block framework is always PAGE_SIZE.
> It means that when 64KB guest is support, the request will at least be
> 64KB.
> 
> Although, if the backend doesn't support indirect descriptor (such as QDISK
> in QEMU), a ring request is only able to accommodate 11 segments of 4KB
> (i.e 44KB).
> 
> The current frontend is assuming that an I/O request will always fit in
> a ring request. This is not true any more when using 64KB page
> granularity and will therefore crash during boot.
> 
> On ARM64, the ABI is completely neutral to the page granularity used by
> the domU. The guest has the choice between different page granularity
> supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
> This can't be enforced by the hypervisor and therefore it's possible to
> run guests using different page granularity.
> 
> So we can't mandate the block backend to support indirect descriptor
> when the frontend is using 64KB page granularity and have to fix it
> properly in the frontend.
> 
> The solution exposed below is based on modifying directly the frontend
> guest rather than asking the block framework to support smaller size
> (i.e < PAGE_SIZE). This is because the change is the block framework are
> not trivial as everything seems to relying on a struct *page (see [1]).
> Although, it may be possible that someone succeed to do it in the future
> and we would therefore be able to use it.
> 
> Given that a block request may not fit in a single ring request, a
> second request is introduced for the data that cannot fit in the first
> one. This means that the second ring request should never be used on
> Linux if the page size is smaller than 44KB.
> 
> To achieve the support of the extra ring request, the block queue size
> is divided by two. Therefore, the ring will always contain enough space
> to accommodate 2 ring requests. While this will reduce the overall
> performance, it will make the implementation more contained. The way
> forward to get better performance is to implement in the backend either
> indirect descriptor or multiple grants ring.
> 
> Note that the parameters blk_queue_max_* helpers haven't been updated.
> The block code will set the mimimum size supported and we may be able
> to support directly any change in the block framework that lower down
> the minimal size of a request.
> 
> [1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html
> 
> Signed-off-by: Julien Grall <julien.grall@citrix.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Just a couple of cosmetic comments below. Unless anyone else has real
comments I don't think you need to resend.

> ---
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: "Roger Pau Monné" <roger.pau@citrix.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Cc: Bob Liu <bob.liu@oracle.com>
> 
>     Roger, I haven't kept your acked-by because the update of the final
>     status.
> 
>     Changes in v3:
>         - Typoes
>         - Change the way the final status is computed to handle the
>         possibility of the request handled by different backend (could
>         happen during suspend/resume).
> 
>     Changes in v2:
>         - Update the commit message
>         - Typoes
>         - Rename ring_req2/id2 to extra_ring_req/extra_id
> ---
>  drivers/block/xen-blkfront.c | 228 ++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 212 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 2248a47..33f6ff4 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -60,6 +60,20 @@
>  
>  #include <asm/xen/hypervisor.h>
>  
> +/*
> + * The minimal size of segment supported by the block framework is PAGE_SIZE.
> + * When Linux is using a different page size than xen, it may not be possible
> + * to put all the data in a single segment.
> + * This can happen when the backend doesn't support indirect descriptor and
> + * therefore the maximum amount of data that a request can carry is
> + * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE = 44KB
> + *
> + * Note that we only support one extra request. So the Linux page size
> + * should be <= ( 2 * BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) =
> + * 88KB.
> + */
> +#define HAS_EXTRA_REQ (BLKIF_MAX_SEGMENTS_PER_REQUEST < XEN_PFN_PER_PAGE)
> +
>  enum blkif_state {
>  	BLKIF_STATE_DISCONNECTED,
>  	BLKIF_STATE_CONNECTED,
> @@ -72,6 +86,13 @@ struct grant {
>  	struct list_head node;
>  };
>  
> +enum blk_req_status {
> +	REQ_WAITING,
> +	REQ_DONE,
> +	REQ_ERROR,
> +	REQ_EOPNOTSUPP,
> +};
> +
>  struct blk_shadow {
>  	struct blkif_request req;
>  	struct request *request;
> @@ -79,6 +100,14 @@ struct blk_shadow {
>  	struct grant **indirect_grants;
>  	struct scatterlist *sg;
>  	unsigned int num_sg;
> +	enum blk_req_status status;
> +
> +	#define NO_ASSOCIATED_ID ~0UL
> +	/*
> +	 * Id of the sibling if we ever need 2 requests when handling a
> +	 * block I/O request
> +	 */
> +	unsigned long associated_id;
>  };
>  
>  struct split_bio {
> @@ -467,6 +496,8 @@ static unsigned long blkif_ring_get_request(struct blkfront_info *info,
>  
>  	id = get_id_from_freelist(info);
>  	info->shadow[id].request = req;
> +	info->shadow[id].status = REQ_WAITING;
> +	info->shadow[id].associated_id = NO_ASSOCIATED_ID;
>  
>  	(*ring_req)->u.rw.id = id;
>  
> @@ -508,6 +539,9 @@ struct setup_rw_req {
>  	bool need_copy;
>  	unsigned int bvec_off;
>  	char *bvec_data;
> +
> +	bool require_extra_req;
> +	struct blkif_request *extra_ring_req;
>  };
>  
>  static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
> @@ -521,8 +555,24 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>  	unsigned int grant_idx = setup->grant_idx;
>  	struct blkif_request *ring_req = setup->ring_req;
>  	struct blkfront_info *info = setup->info;
> +	/*
> +	 * We always use the shadow of the first request to store the list
> +	 * of grant associated to the block I/O request. This made the
> +	 * completion more easy to handle even if the block I/O request is
> +	 * split.
> +	 */
>  	struct blk_shadow *shadow = &info->shadow[setup->id];
>  
> +	if (unlikely(setup->require_extra_req &&
> +		     grant_idx >= BLKIF_MAX_SEGMENTS_PER_REQUEST)) {
> +		/*
> +		 * We are using the second request, setup grant_idx
> +		 * to be the index of the segment array
> +		 */
> +		grant_idx -= BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +		ring_req = setup->extra_ring_req;
> +	}
> +
>  	if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
>  	    (grant_idx % GRANTS_PER_INDIRECT_FRAME == 0)) {
>  		if (setup->segments)
> @@ -537,7 +587,11 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>  
>  	gnt_list_entry = get_grant(&setup->gref_head, gfn, info);
>  	ref = gnt_list_entry->gref;
> -	shadow->grants_used[grant_idx] = gnt_list_entry;
> +	/*
> +	 * All the grants are stored in the shadow of the first
> +	 * request. Therefore we have to use the global index
> +	 */
> +	shadow->grants_used[setup->grant_idx] = gnt_list_entry;
>  
>  	if (setup->need_copy) {
>  		void *shared_data;
> @@ -579,11 +633,31 @@ static void blkif_setup_rw_req_grant(unsigned long gfn, unsigned int offset,
>  	(setup->grant_idx)++;
>  }
>  
> +static void blkif_setup_extra_req(struct blkif_request *first,
> +				  struct blkif_request *second)
> +{
> +	uint16_t nr_segments = first->u.rw.nr_segments;
> +
> +	/*
> +	 * The second request is only present when the first request uses
> +	 * all its segments. It's always the continuity of the first one.
> +	 */
> +	first->u.rw.nr_segments = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +
> +	second->u.rw.nr_segments = nr_segments - BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	second->u.rw.sector_number = first->u.rw.sector_number +
> +		(BLKIF_MAX_SEGMENTS_PER_REQUEST * XEN_PAGE_SIZE) / 512;
> +
> +	second->u.rw.handle = first->u.rw.handle;
> +	second->operation = first->operation;
> +}
> +
>  static int blkif_queue_rw_req(struct request *req)
>  {
>  	struct blkfront_info *info = req->rq_disk->private_data;
> -	struct blkif_request *ring_req;
> -	unsigned long id;
> +	struct blkif_request *ring_req, *extra_ring_req = NULL;
> +	unsigned long id, extra_id = NO_ASSOCIATED_ID;
> +	bool require_extra_req = false;
>  	int i;
>  	struct setup_rw_req setup = {
>  		.grant_idx = 0,
> @@ -628,19 +702,19 @@ static int blkif_queue_rw_req(struct request *req)
>  	/* Fill out a communications ring structure. */
>  	id = blkif_ring_get_request(info, req, &ring_req);
>  
> -	BUG_ON(info->max_indirect_segments == 0 &&
> -	       GREFS(req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
> -	BUG_ON(info->max_indirect_segments &&
> -	       GREFS(req->nr_phys_segments) > info->max_indirect_segments);
> -
>  	num_sg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>  	num_grant = 0;
>  	/* Calculate the number of grant used */
>  	for_each_sg(info->shadow[id].sg, sg, num_sg, i)
>  	       num_grant += gnttab_count_grant(sg->offset, sg->length);
>  
> +	require_extra_req = info->max_indirect_segments == 0 &&
> +		num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	BUG_ON(!HAS_EXTRA_REQ && require_extra_req);
> +
>  	info->shadow[id].num_sg = num_sg;
> -	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> +	if (num_grant > BLKIF_MAX_SEGMENTS_PER_REQUEST &&
> +	    likely(!require_extra_req)) {
>  		/*
>  		 * The indirect operation can only be a BLKIF_OP_READ or
>  		 * BLKIF_OP_WRITE
> @@ -680,10 +754,30 @@ static int blkif_queue_rw_req(struct request *req)
>  			}
>  		}
>  		ring_req->u.rw.nr_segments = num_grant;
> +		if (unlikely(require_extra_req)) {
> +			extra_id = blkif_ring_get_request(info, req,
> +							  &extra_ring_req);
> +			/*
> +			 * Only the first request contains the scatter-gather
> +			 * list.
> +			 */
> +			info->shadow[extra_id].num_sg = 0;
> +
> +			blkif_setup_extra_req(ring_req, extra_ring_req);
> +
> +			/* Link the 2 requests together */
> +			info->shadow[extra_id].associated_id = id;
> +			info->shadow[id].associated_id = extra_id;
> +		}
>  	}
>  
>  	setup.ring_req = ring_req;
>  	setup.id = id;
> +
> +	setup.require_extra_req = require_extra_req;
> +	if (unlikely(require_extra_req))
> +		setup.extra_ring_req = extra_ring_req;
> +
>  	for_each_sg(info->shadow[id].sg, sg, num_sg, i) {
>  		BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>  
> @@ -706,6 +800,8 @@ static int blkif_queue_rw_req(struct request *req)
>  
>  	/* Keep a private copy so we can reissue requests when recovering. */
>  	info->shadow[id].req = *ring_req;
> +	if (unlikely(require_extra_req))
> +		info->shadow[extra_id].req = *extra_ring_req;
>  
>  	if (new_persistent_gnts)
>  		gnttab_free_grant_references(setup.gref_head);
> @@ -797,7 +893,16 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
>  	memset(&info->tag_set, 0, sizeof(info->tag_set));
>  	info->tag_set.ops = &blkfront_mq_ops;
>  	info->tag_set.nr_hw_queues = 1;
> -	info->tag_set.queue_depth =  BLK_RING_SIZE(info);
> +	if (HAS_EXTRA_REQ && info->max_indirect_segments == 0) {
> +		/*
> +		 * When indirect descriptior is not supported, the I/O request
                                 ^ descriptor
> +		 * will be split between multiple request in the ring.
> +		 * To avoid problems when sending the request, divide by
> +		 * 2 the depth of the queue.
> +		 */
> +		info->tag_set.queue_depth =  BLK_RING_SIZE(info) / 2;
> +	} else
> +		info->tag_set.queue_depth = BLK_RING_SIZE(info);
>  	info->tag_set.numa_node = NUMA_NO_NODE;
>  	info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
>  	info->tag_set.cmd_size = 0;
> @@ -1217,19 +1322,92 @@ static void blkif_copy_from_grant(unsigned long gfn, unsigned int offset,
>  	kunmap_atomic(shared_data);
>  }
>  
> -static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> +static enum blk_req_status blkif_rsp_to_req_status(int rsp)
> +{
> +	switch (rsp)
> +	{
> +	case BLKIF_RSP_OKAY:
> +		return REQ_DONE;
> +	case BLKIF_RSP_ERROR:
> +		return REQ_ERROR;

This could be joined with the default case with a fall through:

case BLKIF_RSP_ERROR:
default:
	return REQ_ERROR;

> +	case BLKIF_RSP_EOPNOTSUPP:
> +		return REQ_EOPNOTSUPP;
> +	default:
> +		return REQ_ERROR;
> +	}
> +}
> +
> +/*
> + * Get the final status of the block request based on two ring response
> + */
> +static int blkif_get_final_status(enum blk_req_status s1,
> +				  enum blk_req_status s2)
> +{
> +	BUG_ON(s1 == REQ_WAITING);
> +	BUG_ON(s2 == REQ_WAITING);
> +
> +	if (s1 == REQ_ERROR || s2 == REQ_ERROR)
> +		return BLKIF_RSP_ERROR;
> +	else if (s1 == REQ_EOPNOTSUPP || s2 == REQ_EOPNOTSUPP)
> +		return BLKIF_RSP_EOPNOTSUPP;
> +	else

Unneeded else?

> +		return BLKIF_RSP_OKAY;
> +}
> +
> +static bool blkif_completion(unsigned long *id, struct blkfront_info *info,
>  			     struct blkif_response *bret)
>  {
>  	int i = 0;
>  	struct scatterlist *sg;
>  	int num_sg, num_grant;
> +	struct blk_shadow *s = &info->shadow[*id];
>  	struct copy_from_grant data = {
> -		.s = s,
>  		.grant_idx = 0,
>  	};
>  
>  	num_grant = s->req.operation == BLKIF_OP_INDIRECT ?
>  		s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
> +
> +	/* The I/O request may be split in two */
> +	if (unlikely(s->associated_id != NO_ASSOCIATED_ID)) {
> +		struct blk_shadow *s2 = &info->shadow[s->associated_id];
> +
> +		/* Keep the status of the current response in shadow */
> +		s->status = blkif_rsp_to_req_status(bret->status);
> +
> +		/* Wait the second response if not yet here */
> +		if (s2->status == REQ_WAITING)
> +			return 0;
> +
> +		bret->status = blkif_get_final_status(s->status,
> +						      s2->status);
> +
> +		/*
> +		 * All the grants is stored in the first shadow in order
> +		 * to make the completion code simpler.
> +		 */
> +		num_grant += s2->req.u.rw.nr_segments;
> +
> +		/*
> +		 * The two responses may not come in order. Only the
> +		 * first request will store the scatter-gather list
> +		 */
> +		if (s2->num_sg != 0) {
> +			/* Update "id" with the ID of the first response */
> +			*id = s->associated_id;
> +			s = s2;
> +		}
> +
> +		/*
> +		 * We don't need anymore the second request, so recycling
> +		 * it now.
> +		 */
> +		if (add_id_to_freelist(info, s->associated_id))
> +			WARN(1, "%s: can't recycle the second part (id = %ld) of the request\n",
> +			     info->gd->disk_name, s->associated_id);
> +	}
> +
> +	data.s = s;
>  	num_sg = s->num_sg;
>  
>  	if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
> @@ -1299,6 +1477,8 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
>  			}
>  		}
>  	}
> +
> +	return 1;
>  }
>  
>  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> @@ -1340,8 +1520,14 @@ static irqreturn_t blkif_interrupt(int irq, void *dev_id)
>  		}
>  		req  = info->shadow[id].request;
>  
> -		if (bret->operation != BLKIF_OP_DISCARD)
> -			blkif_completion(&info->shadow[id], info, bret);
> +		if (bret->operation != BLKIF_OP_DISCARD) {
> +			/*
> +			 * We may need to wait for an extra response if the
> +			 * I/O request is split in 2
> +			 */
> +			if (!blkif_completion(&id, info, bret))
> +				continue;
> +		}
>  
>  		if (add_id_to_freelist(info, id)) {
>  			WARN(1, "%s: response to %s (id %ld) couldn't be recycled!\n",
> @@ -1864,8 +2050,18 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
>  	unsigned int psegs, grants;
>  	int err, i;
>  
> -	if (info->max_indirect_segments == 0)
> -		grants = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +	if (info->max_indirect_segments == 0) {
> +		if (!HAS_EXTRA_REQ)
> +			grants = BLKIF_MAX_SEGMENTS_PER_REQUEST;
> +		else {
> +			/*
> +			 * When an extra req is required, the maximum
> +			 * grants supported is related to the size of the
> +			 * Linux block segment
> +			 */
> +			grants = GRANTS_PER_PSEG;
> +		}
> +	}
>  	else
>  		grants = info->max_indirect_segments;
>  	psegs = grants / GRANTS_PER_PSEG;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/