2015-07-09 20:43:43

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 00/20] xen/arm64: Add support for 64KB page

Hi all,

ARM64 Linux is supporting both 4KB and 64KB page granularity. Although, Xen
hypercall interface and PV protocol are always based on 4KB page granularity.

Any attempt to boot a Linux guest with 64KB pages enabled will result to a
guest crash.

This series is a first attempt to allow those Linux running with the current
hypercall interface and PV protocol.

This solution has been chosen because we want to run Linux 64KB in released
Xen ARM version or/and platform using an old version of Linux DOM0.

There is room for improvement, such as support of 64KB grant, modification
of PV protocol to support different page size... They will be explored in a
separate patch series later.

For this new version, new helpers have been added in order to split the Linux
page in multiple grant. This is requiring to produce some callback.

So far I've only done a quick network performance test using iperf. The
server is living in DOM0 and the client in the guest.

Average betwen 10 iperf :

DOM0 Guest Result

4KB-mod 64KB 3.176 Gbits/sec
4KB-mod 4KB-mod 3.245 Gbits/sec
4KB-mod 4KB 3.258 Gbits/sec
4KB 4KB 3.292 Gbits/sec
4KB 4KB-mod 3.265 Gbits/sec
4KB 64KB 3.189 Gbits/sec

4KB-mod: Linux with the 64KB patch series
4KB: linux/master

The network performance is slightly worst with this series (-0.15%). I suspect,
this is because of using an indirection to setup the grant. This is necessary
in order to ensure that the grant will be correctly sized no matter of the
Linux page granularity. This could be used later in order to support bigger
grant.

TODO list:
- swiotlb not yet converted to 64KB pages
- it may be possible to move some common define between netback/netfront
and blkfront/blkback in an header.

Note that the patches has only been build tested on x86. It may also be
possible that they are not compiling one by one. I will give a look for
the next version.

A branch based on the latest linux/master can be found here:

git://xenbits.xen.org/people/julieng/linux-arm.git branch xen-64k-v2

Comments, suggestions are welcomed.

Sincerely yours,

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Julien Grall (20):
xen: Add Xen specific page definition
xen: Introduce a function to split a Linux page into Xen page
xen/grant: Introduce helpers to split a page into grant
xen/grant: Add helper gnttab_page_grant_foreign_access_ref
block/xen-blkfront: Split blkif_queue_request in 2
block/xen-blkfront: Store a page rather a pfn in the grant structure
block/xen-blkfront: split get_grant in 2
net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop
xen/biomerge: Don't allow biovec to be merge when Linux is not using
4KB page
xen/xenbus: Use Xen page definition
tty/hvc: xen: Use xen page definition
xen/balloon: Don't rely on the page granularity is the same for Xen
and Linux
xen/events: fifo: Make it running on 64KB granularity
xen/grant-table: Make it running on 64KB granularity
block/xen-blkfront: Make it running on 64KB page granularity
block/xen-blkback: Make it running on 64KB page granularity
net/xen-netfront: Make it running on 64KB page granularity
net/xen-netback: Make it running on 64KB page granularity
xen/privcmd: Add support for Linux 64KB page granularity
arm/xen: Add support for 64KB page granularity

arch/arm/include/asm/xen/page.h | 12 +-
arch/arm/xen/enlighten.c | 6 +-
arch/arm/xen/p2m.c | 6 +-
arch/x86/include/asm/xen/page.h | 2 +-
drivers/block/xen-blkback/blkback.c | 5 +-
drivers/block/xen-blkback/common.h | 16 +-
drivers/block/xen-blkback/xenbus.c | 9 +-
drivers/block/xen-blkfront.c | 536 +++++++++++++++++++++++-------------
drivers/net/xen-netback/common.h | 15 +-
drivers/net/xen-netback/netback.c | 148 ++++++----
drivers/net/xen-netfront.c | 121 +++++---
drivers/tty/hvc/hvc_xen.c | 6 +-
drivers/xen/balloon.c | 147 +++++++---
drivers/xen/biomerge.c | 7 +
drivers/xen/events/events_base.c | 2 +-
drivers/xen/events/events_fifo.c | 2 +-
drivers/xen/grant-table.c | 32 ++-
drivers/xen/privcmd.c | 8 +-
drivers/xen/xenbus/xenbus_client.c | 6 +-
drivers/xen/xenbus/xenbus_probe.c | 4 +-
drivers/xen/xlate_mmu.c | 127 ++++++---
include/xen/grant_table.h | 50 ++++
include/xen/page.h | 41 ++-
23 files changed, 896 insertions(+), 412 deletions(-)

--
2.1.4


2015-07-09 20:45:19

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 01/20] xen: Add Xen specific page definition

The Xen hypercall interface is always using 4K page granularity on ARM
and x86 architecture.

With the incoming support of 64K page granularity for ARM64 guest, it
won't be possible to re-use the Linux page definition in Xen drivers.

Introduce Xen page definition helpers based on the Linux page
definition. They have exactly the same name but prefixed with
XEN_/xen_ prefix.

Also modify page_to_pfn to use new Xen page definition.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
I'm wondering if we should drop page_to_pfn has the macro will likely
misuse when Linux is using 64KB page granularity.

Changes in v2:
- Add XEN_PFN_UP
- Add a comment describing the behavior of page_to_pfn
---
include/xen/page.h | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/xen/page.h b/include/xen/page.h
index c5ed20b..8ebd37b 100644
--- a/include/xen/page.h
+++ b/include/xen/page.h
@@ -1,11 +1,30 @@
#ifndef _XEN_PAGE_H
#define _XEN_PAGE_H

+#include <asm/page.h>
+
+/* The hypercall interface supports only 4KB page */
+#define XEN_PAGE_SHIFT 12
+#define XEN_PAGE_SIZE (_AC(1,UL) << XEN_PAGE_SHIFT)
+#define XEN_PAGE_MASK (~(XEN_PAGE_SIZE-1))
+#define xen_offset_in_page(p) ((unsigned long)(p) & ~XEN_PAGE_MASK)
+#define xen_pfn_to_page(pfn) \
+ ((pfn_to_page(((unsigned long)(pfn) << XEN_PAGE_SHIFT) >> PAGE_SHIFT)))
+#define xen_page_to_pfn(page) \
+ (((page_to_pfn(page)) << PAGE_SHIFT) >> XEN_PAGE_SHIFT)
+
+#define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
+
+#define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
+#define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
+#define XEN_PFN_PHYS(x) ((phys_addr_t)(x) << XEN_PAGE_SHIFT)
+
#include <asm/xen/page.h>

+/* Return the MFN associated to the first 4KB of the page */
static inline unsigned long page_to_mfn(struct page *page)
{
- return pfn_to_mfn(page_to_pfn(page));
+ return pfn_to_mfn(xen_page_to_pfn(page));
}

struct xen_memory_region {
--
2.1.4

2015-07-09 20:48:51

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

The Xen interface is always using 4KB page. This means that a Linux page
may be split across multiple Xen page when the page granularity is not
the same.

This helper will break down a Linux page into 4KB chunk and call the
helper on each of them.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
include/xen/page.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/include/xen/page.h b/include/xen/page.h
index 8ebd37b..b1f7722 100644
--- a/include/xen/page.h
+++ b/include/xen/page.h
@@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];

extern unsigned long xen_released_pages;

+typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
+
+/* Break down the page in 4KB granularity and call fn foreach xen pfn */
+static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
+ void *data)
+{
+ unsigned long pfn = xen_page_to_pfn(page);
+ int i;
+ int ret;
+
+ for (i = 0; i < XEN_PFN_PER_PAGE; i++, pfn++) {
+ ret = fn(page, pfn, data);
+ if (ret)
+ return ret;
+ }
+
+ return ret;
+}
+
+
#endif /* _XEN_PAGE_H */
--
2.1.4

2015-07-09 20:44:12

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 03/20] xen/grant: Introduce helpers to split a page into grant

Currently, a grant is always based on the Xen page granularity (i.e
4KB). When Linux is using a different page granularity, a single page
will be split between multiple grants.

The new helpers will be in charge to split the Linux page into grant and
call a function given by the caller on each grant.

In order to help some PV drivers, the callback is allowed to use less
data and must update the resulting length. This is useful for netback.

Also provide and helper to count the number of grants within a given
contiguous region.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
drivers/xen/grant-table.c | 26 ++++++++++++++++++++++++++
include/xen/grant_table.h | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 67 insertions(+)

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 62f591f..3679293 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -296,6 +296,32 @@ int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly)
}
EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref);

+void gnttab_foreach_grant(struct page *page, unsigned int offset,
+ unsigned int len, xen_grant_fn_t fn,
+ void *data)
+{
+ unsigned int goffset;
+ unsigned int glen;
+ unsigned long pfn;
+
+ len = min_t(unsigned int, PAGE_SIZE - offset, len);
+ goffset = offset & ~XEN_PAGE_MASK;
+
+ pfn = xen_page_to_pfn(page) + (offset >> XEN_PAGE_SHIFT);
+
+ while (len) {
+ glen = min_t(unsigned int, XEN_PAGE_SIZE - goffset, len);
+ fn(pfn_to_mfn(pfn), goffset, &glen, data);
+
+ goffset += glen;
+ if (goffset == XEN_PAGE_SIZE) {
+ goffset = 0;
+ pfn++;
+ }
+ len -= glen;
+ }
+}
+
struct deferred_entry {
struct list_head list;
grant_ref_t ref;
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 4478f4b..6f77378 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -45,8 +45,10 @@
#include <asm/xen/hypervisor.h>

#include <xen/features.h>
+#include <xen/page.h>
#include <linux/mm_types.h>
#include <linux/page-flags.h>
+#include <linux/kernel.h>

#define GNTTAB_RESERVED_XENSTORE 1

@@ -224,4 +226,43 @@ static inline struct xen_page_foreign *xen_page_foreign(struct page *page)
#endif
}

+/* Split Linux page in chunk of the size of the grant and call fn
+ *
+ * Parameters of fn:
+ * mfn: machine frame number based on grant granularity
+ * offset: offset in the grant
+ * len: length of the data in the grant. If fn decides to use less data,
+ * it must update len.
+ * data: internal information
+ */
+typedef void (*xen_grant_fn_t)(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data);
+
+void gnttab_foreach_grant(struct page *page, unsigned int offset,
+ unsigned int len, xen_grant_fn_t fn,
+ void *data);
+
+/* Helper to get to call fn only on the first "grant chunk" */
+static inline void gnttab_one_grant(struct page *page, unsigned int offset,
+ unsigned len, xen_grant_fn_t fn,
+ void *data)
+{
+ /* The first request is limited to the size of one grant */
+ len = min_t(unsigned int, XEN_PAGE_SIZE - (offset & ~XEN_PAGE_MASK),
+ len);
+
+ gnttab_foreach_grant(page, offset, len, fn, data);
+}
+
+/* Get the number of grant in a specified region
+ *
+ * offset: Offset in the first page
+ * len: total lenght of data (can cross multiple page)
+ */
+static inline unsigned int gnttab_count_grant(unsigned int offset,
+ unsigned int len)
+{
+ return (XEN_PFN_UP((offset & ~XEN_PAGE_MASK) + len));
+}
+
#endif /* __ASM_GNTTAB_H__ */
--
2.1.4

2015-07-09 20:43:44

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 04/20] xen/grant: Add helper gnttab_page_grant_foreign_access_ref

Many PV drivers contain the idiom:

pfn = page_to_mfn(...) /* Or similar */
gnttab_grant_foreign_access_ref

Replace it by a new helper. Note that when Linux is using a different
page granularity than Xen, the helper only gives access to the first 4KB
grant.

This is useful where drivers are allocating a full Linux page for each
grant.

Also include xen/interface/grant_table.h rather than xen/grant_table.h in
asm/page.h for x86 to fix a compilation issue [1]. Only the former is
useful in order to get the structure definition.

[1] Interpendency between asm/page.h and xen/grant_table.h which result
to page_mfn not being defined when necessary.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
arch/x86/include/asm/xen/page.h | 2 +-
include/xen/grant_table.h | 9 +++++++++
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index c44a5d5..fb2e037 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -12,7 +12,7 @@
#include <asm/pgtable.h>

#include <xen/interface/xen.h>
-#include <xen/grant_table.h>
+#include <xen/interface/grant_table.h>
#include <xen/features.h>

/* Xen machine address */
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 6f77378..6a1ef86 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -131,6 +131,15 @@ void gnttab_cancel_free_callback(struct gnttab_free_callback *callback);
void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
unsigned long frame, int readonly);

+/* Give access to the first 4K of the page */
+static inline void gnttab_page_grant_foreign_access_ref(
+ grant_ref_t ref, domid_t domid,
+ struct page *page, int readonly)
+{
+ gnttab_grant_foreign_access_ref(ref, domid, page_to_mfn(page),
+ readonly);
+}
+
void gnttab_grant_foreign_transfer_ref(grant_ref_t, domid_t domid,
unsigned long pfn);

--
2.1.4

2015-07-09 20:44:18

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 05/20] block/xen-blkfront: Split blkif_queue_request in 2

Currently, blkif_queue_request has 2 distinct execution path:
- Send a discard request
- Send a read/write request

The function is also allocating grants to use for generating the
request. Although, this is only used for read/write request.

Rather than having a function with 2 distinct execution path, separate
the function in 2. This will also remove one level of tabulation.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Roger Pau Monné <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
drivers/block/xen-blkfront.c | 280 +++++++++++++++++++++++--------------------
1 file changed, 153 insertions(+), 127 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 6d89ed3..7107d58 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -392,13 +392,35 @@ static int blkif_ioctl(struct block_device *bdev, fmode_t mode,
return 0;
}

-/*
- * Generate a Xen blkfront IO request from a blk layer request. Reads
- * and writes are handled as expected.
- *
- * @req: a request struct
- */
-static int blkif_queue_request(struct request *req)
+static int blkif_queue_discard_req(struct request *req)
+{
+ struct blkfront_info *info = req->rq_disk->private_data;
+ struct blkif_request *ring_req;
+ unsigned long id;
+
+ /* Fill out a communications ring structure. */
+ ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
+ id = get_id_from_freelist(info);
+ info->shadow[id].request = req;
+
+ ring_req->operation = BLKIF_OP_DISCARD;
+ ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
+ ring_req->u.discard.id = id;
+ ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req);
+ if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
+ ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
+ else
+ ring_req->u.discard.flag = 0;
+
+ info->ring.req_prod_pvt++;
+
+ /* Keep a private copy so we can reissue requests when recovering. */
+ info->shadow[id].req = *ring_req;
+
+ return 0;
+}
+
+static int blkif_queue_rw_req(struct request *req)
{
struct blkfront_info *info = req->rq_disk->private_data;
struct blkif_request *ring_req;
@@ -418,9 +440,6 @@ static int blkif_queue_request(struct request *req)
struct scatterlist *sg;
int nseg, max_grefs;

- if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
- return 1;
-
max_grefs = req->nr_phys_segments;
if (max_grefs > BLKIF_MAX_SEGMENTS_PER_REQUEST)
/*
@@ -450,139 +469,128 @@ static int blkif_queue_request(struct request *req)
id = get_id_from_freelist(info);
info->shadow[id].request = req;

- if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE))) {
- ring_req->operation = BLKIF_OP_DISCARD;
- ring_req->u.discard.nr_sectors = blk_rq_sectors(req);
- ring_req->u.discard.id = id;
- ring_req->u.discard.sector_number = (blkif_sector_t)blk_rq_pos(req);
- if ((req->cmd_flags & REQ_SECURE) && info->feature_secdiscard)
- ring_req->u.discard.flag = BLKIF_DISCARD_SECURE;
- else
- ring_req->u.discard.flag = 0;
+ BUG_ON(info->max_indirect_segments == 0 &&
+ req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ BUG_ON(info->max_indirect_segments &&
+ req->nr_phys_segments > info->max_indirect_segments);
+ nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
+ ring_req->u.rw.id = id;
+ if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ /*
+ * The indirect operation can only be a BLKIF_OP_READ or
+ * BLKIF_OP_WRITE
+ */
+ BUG_ON(req->cmd_flags & (REQ_FLUSH | REQ_FUA));
+ ring_req->operation = BLKIF_OP_INDIRECT;
+ ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
+ BLKIF_OP_WRITE : BLKIF_OP_READ;
+ ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req);
+ ring_req->u.indirect.handle = info->handle;
+ ring_req->u.indirect.nr_segments = nseg;
} else {
- BUG_ON(info->max_indirect_segments == 0 &&
- req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
- BUG_ON(info->max_indirect_segments &&
- req->nr_phys_segments > info->max_indirect_segments);
- nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
- ring_req->u.rw.id = id;
- if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
+ ring_req->u.rw.handle = info->handle;
+ ring_req->operation = rq_data_dir(req) ?
+ BLKIF_OP_WRITE : BLKIF_OP_READ;
+ if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
/*
- * The indirect operation can only be a BLKIF_OP_READ or
- * BLKIF_OP_WRITE
+ * Ideally we can do an unordered flush-to-disk. In case the
+ * backend onlysupports barriers, use that. A barrier request
+ * a superset of FUA, so we can implement it the same
+ * way. (It's also a FLUSH+FUA, since it is
+ * guaranteed ordered WRT previous writes.)
*/
- BUG_ON(req->cmd_flags & (REQ_FLUSH | REQ_FUA));
- ring_req->operation = BLKIF_OP_INDIRECT;
- ring_req->u.indirect.indirect_op = rq_data_dir(req) ?
- BLKIF_OP_WRITE : BLKIF_OP_READ;
- ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req);
- ring_req->u.indirect.handle = info->handle;
- ring_req->u.indirect.nr_segments = nseg;
- } else {
- ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
- ring_req->u.rw.handle = info->handle;
- ring_req->operation = rq_data_dir(req) ?
- BLKIF_OP_WRITE : BLKIF_OP_READ;
- if (req->cmd_flags & (REQ_FLUSH | REQ_FUA)) {
- /*
- * Ideally we can do an unordered flush-to-disk. In case the
- * backend onlysupports barriers, use that. A barrier request
- * a superset of FUA, so we can implement it the same
- * way. (It's also a FLUSH+FUA, since it is
- * guaranteed ordered WRT previous writes.)
- */
- switch (info->feature_flush &
- ((REQ_FLUSH|REQ_FUA))) {
- case REQ_FLUSH|REQ_FUA:
- ring_req->operation =
- BLKIF_OP_WRITE_BARRIER;
- break;
- case REQ_FLUSH:
- ring_req->operation =
- BLKIF_OP_FLUSH_DISKCACHE;
- break;
- default:
- ring_req->operation = 0;
- }
+ switch (info->feature_flush &
+ ((REQ_FLUSH|REQ_FUA))) {
+ case REQ_FLUSH|REQ_FUA:
+ ring_req->operation =
+ BLKIF_OP_WRITE_BARRIER;
+ break;
+ case REQ_FLUSH:
+ ring_req->operation =
+ BLKIF_OP_FLUSH_DISKCACHE;
+ break;
+ default:
+ ring_req->operation = 0;
}
- ring_req->u.rw.nr_segments = nseg;
}
- for_each_sg(info->shadow[id].sg, sg, nseg, i) {
- fsect = sg->offset >> 9;
- lsect = fsect + (sg->length >> 9) - 1;
-
- if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
- (i % SEGS_PER_INDIRECT_FRAME == 0)) {
- unsigned long uninitialized_var(pfn);
-
- if (segments)
- kunmap_atomic(segments);
-
- n = i / SEGS_PER_INDIRECT_FRAME;
- if (!info->feature_persistent) {
- struct page *indirect_page;
-
- /* Fetch a pre-allocated page to use for indirect grefs */
- BUG_ON(list_empty(&info->indirect_pages));
- indirect_page = list_first_entry(&info->indirect_pages,
- struct page, lru);
- list_del(&indirect_page->lru);
- pfn = page_to_pfn(indirect_page);
- }
- gnt_list_entry = get_grant(&gref_head, pfn, info);
- info->shadow[id].indirect_grants[n] = gnt_list_entry;
- segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
- ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
+ ring_req->u.rw.nr_segments = nseg;
+ }
+ for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+ fsect = sg->offset >> 9;
+ lsect = fsect + (sg->length >> 9) - 1;
+
+ if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
+ (i % SEGS_PER_INDIRECT_FRAME == 0)) {
+ unsigned long uninitialized_var(pfn);
+
+ if (segments)
+ kunmap_atomic(segments);
+
+ n = i / SEGS_PER_INDIRECT_FRAME;
+ if (!info->feature_persistent) {
+ struct page *indirect_page;
+
+ /* Fetch a pre-allocated page to use for indirect grefs */
+ BUG_ON(list_empty(&info->indirect_pages));
+ indirect_page = list_first_entry(&info->indirect_pages,
+ struct page, lru);
+ list_del(&indirect_page->lru);
+ pfn = page_to_pfn(indirect_page);
}
+ gnt_list_entry = get_grant(&gref_head, pfn, info);
+ info->shadow[id].indirect_grants[n] = gnt_list_entry;
+ segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
+ ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
+ }

- gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
- ref = gnt_list_entry->gref;
+ gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
+ ref = gnt_list_entry->gref;

- info->shadow[id].grants_used[i] = gnt_list_entry;
+ info->shadow[id].grants_used[i] = gnt_list_entry;

- if (rq_data_dir(req) && info->feature_persistent) {
- char *bvec_data;
- void *shared_data;
+ if (rq_data_dir(req) && info->feature_persistent) {
+ char *bvec_data;
+ void *shared_data;

- BUG_ON(sg->offset + sg->length > PAGE_SIZE);
+ BUG_ON(sg->offset + sg->length > PAGE_SIZE);

- shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
- bvec_data = kmap_atomic(sg_page(sg));
+ shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
+ bvec_data = kmap_atomic(sg_page(sg));

- /*
- * this does not wipe data stored outside the
- * range sg->offset..sg->offset+sg->length.
- * Therefore, blkback *could* see data from
- * previous requests. This is OK as long as
- * persistent grants are shared with just one
- * domain. It may need refactoring if this
- * changes
- */
- memcpy(shared_data + sg->offset,
- bvec_data + sg->offset,
- sg->length);
+ /*
+ * this does not wipe data stored outside the
+ * range sg->offset..sg->offset+sg->length.
+ * Therefore, blkback *could* see data from
+ * previous requests. This is OK as long as
+ * persistent grants are shared with just one
+ * domain. It may need refactoring if this
+ * changes
+ */
+ memcpy(shared_data + sg->offset,
+ bvec_data + sg->offset,
+ sg->length);

- kunmap_atomic(bvec_data);
- kunmap_atomic(shared_data);
- }
- if (ring_req->operation != BLKIF_OP_INDIRECT) {
- ring_req->u.rw.seg[i] =
- (struct blkif_request_segment) {
- .gref = ref,
- .first_sect = fsect,
- .last_sect = lsect };
- } else {
- n = i % SEGS_PER_INDIRECT_FRAME;
- segments[n] =
+ kunmap_atomic(bvec_data);
+ kunmap_atomic(shared_data);
+ }
+ if (ring_req->operation != BLKIF_OP_INDIRECT) {
+ ring_req->u.rw.seg[i] =
(struct blkif_request_segment) {
- .gref = ref,
- .first_sect = fsect,
- .last_sect = lsect };
- }
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
+ } else {
+ n = i % SEGS_PER_INDIRECT_FRAME;
+ segments[n] =
+ (struct blkif_request_segment) {
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
}
- if (segments)
- kunmap_atomic(segments);
}
+ if (segments)
+ kunmap_atomic(segments);

info->ring.req_prod_pvt++;

@@ -595,6 +603,24 @@ static int blkif_queue_request(struct request *req)
return 0;
}

+/*
+ * Generate a Xen blkfront IO request from a blk layer request. Reads
+ * and writes are handled as expected.
+ *
+ * @req: a request struct
+ */
+static int blkif_queue_request(struct request *req)
+{
+ struct blkfront_info *info = req->rq_disk->private_data;
+
+ if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
+ return 1;
+
+ if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE)))
+ return blkif_queue_discard_req(req);
+ else
+ return blkif_queue_rw_req(req);
+}

static inline void flush_requests(struct blkfront_info *info)
{
--
2.1.4

2015-07-09 20:43:42

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure

All the usage of the field pfn are done using the same idiom:

pfn_to_page(grant->pfn)

This will return always the same page. Store directly the page in the
grant to clean up the code.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Roger Pau Monné <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
drivers/block/xen-blkfront.c | 37 ++++++++++++++++++-------------------
1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 7107d58..7b81d23 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -67,7 +67,7 @@ enum blkif_state {

struct grant {
grant_ref_t gref;
- unsigned long pfn;
+ struct page *page;
struct list_head node;
};

@@ -219,7 +219,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
kfree(gnt_list_entry);
goto out_of_memory;
}
- gnt_list_entry->pfn = page_to_pfn(granted_page);
+ gnt_list_entry->page = granted_page;
}

gnt_list_entry->gref = GRANT_INVALID_REF;
@@ -234,7 +234,7 @@ out_of_memory:
&info->grants, node) {
list_del(&gnt_list_entry->node);
if (info->feature_persistent)
- __free_page(pfn_to_page(gnt_list_entry->pfn));
+ __free_page(gnt_list_entry->page);
kfree(gnt_list_entry);
i--;
}
@@ -243,7 +243,7 @@ out_of_memory:
}

static struct grant *get_grant(grant_ref_t *gref_head,
- unsigned long pfn,
+ struct page *page,
struct blkfront_info *info)
{
struct grant *gnt_list_entry;
@@ -263,10 +263,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
BUG_ON(gnt_list_entry->gref == -ENOSPC);
if (!info->feature_persistent) {
- BUG_ON(!pfn);
- gnt_list_entry->pfn = pfn;
+ BUG_ON(!page);
+ gnt_list_entry->page = page;
}
- buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
+ buffer_mfn = page_to_mfn(gnt_list_entry->page);
gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
info->xbdev->otherend_id,
buffer_mfn, 0);
@@ -522,7 +522,7 @@ static int blkif_queue_rw_req(struct request *req)

if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
(i % SEGS_PER_INDIRECT_FRAME == 0)) {
- unsigned long uninitialized_var(pfn);
+ struct page *uninitialized_var(page);

if (segments)
kunmap_atomic(segments);
@@ -536,15 +536,15 @@ static int blkif_queue_rw_req(struct request *req)
indirect_page = list_first_entry(&info->indirect_pages,
struct page, lru);
list_del(&indirect_page->lru);
- pfn = page_to_pfn(indirect_page);
+ page = indirect_page;
}
- gnt_list_entry = get_grant(&gref_head, pfn, info);
+ gnt_list_entry = get_grant(&gref_head, page, info);
info->shadow[id].indirect_grants[n] = gnt_list_entry;
- segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
+ segments = kmap_atomic(gnt_list_entry->page);
ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
}

- gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
+ gnt_list_entry = get_grant(&gref_head, sg_page(sg), info);
ref = gnt_list_entry->gref;

info->shadow[id].grants_used[i] = gnt_list_entry;
@@ -555,7 +555,7 @@ static int blkif_queue_rw_req(struct request *req)

BUG_ON(sg->offset + sg->length > PAGE_SIZE);

- shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
+ shared_data = kmap_atomic(gnt_list_entry->page);
bvec_data = kmap_atomic(sg_page(sg));

/*
@@ -1002,7 +1002,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
info->persistent_gnts_c--;
}
if (info->feature_persistent)
- __free_page(pfn_to_page(persistent_gnt->pfn));
+ __free_page(persistent_gnt->page);
kfree(persistent_gnt);
}
}
@@ -1037,7 +1037,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
persistent_gnt = info->shadow[i].grants_used[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
if (info->feature_persistent)
- __free_page(pfn_to_page(persistent_gnt->pfn));
+ __free_page(persistent_gnt->page);
kfree(persistent_gnt);
}

@@ -1051,7 +1051,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
for (j = 0; j < INDIRECT_GREFS(segs); j++) {
persistent_gnt = info->shadow[i].indirect_grants[j];
gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
- __free_page(pfn_to_page(persistent_gnt->pfn));
+ __free_page(persistent_gnt->page);
kfree(persistent_gnt);
}

@@ -1102,8 +1102,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
for_each_sg(s->sg, sg, nseg, i) {
BUG_ON(sg->offset + sg->length > PAGE_SIZE);
- shared_data = kmap_atomic(
- pfn_to_page(s->grants_used[i]->pfn));
+ shared_data = kmap_atomic(s->grants_used[i]->page);
bvec_data = kmap_atomic(sg_page(sg));
memcpy(bvec_data + sg->offset,
shared_data + sg->offset,
@@ -1154,7 +1153,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
* Add the used indirect page back to the list of
* available pages for indirect grefs.
*/
- indirect_page = pfn_to_page(s->indirect_grants[i]->pfn);
+ indirect_page = s->indirect_grants[i]->page;
list_add(&indirect_page->lru, &info->indirect_pages);
s->indirect_grants[i]->gref = GRANT_INVALID_REF;
list_add_tail(&s->indirect_grants[i]->node, &info->grants);
--
2.1.4

2015-07-09 20:49:02

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 07/20] block/xen-blkfront: split get_grant in 2

Prepare the code to support 64KB page granularity. The first
implementation will use a full Linux page per indirect and persistent
grant. When non-persistent grant is used, each page of a bio request
may be split in multiple grant.

Furthermore, the field page of the grant structure is only used to copy
data from persistent grant or indirect grant. Avoid to set it for other
use case as it will have no meaning given the page will be split in
multiple grant.

Provide 2 functions, to setup indirect grant, the other for bio page.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Roger Pau Monné <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Patch added
---
drivers/block/xen-blkfront.c | 85 ++++++++++++++++++++++++++++++--------------
1 file changed, 59 insertions(+), 26 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 7b81d23..95fd067 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -242,34 +242,77 @@ out_of_memory:
return -ENOMEM;
}

-static struct grant *get_grant(grant_ref_t *gref_head,
- struct page *page,
- struct blkfront_info *info)
+static struct grant *get_free_grant(struct blkfront_info *info)
{
struct grant *gnt_list_entry;
- unsigned long buffer_mfn;

BUG_ON(list_empty(&info->grants));
gnt_list_entry = list_first_entry(&info->grants, struct grant,
- node);
+ node);
list_del(&gnt_list_entry->node);

- if (gnt_list_entry->gref != GRANT_INVALID_REF) {
+ if (gnt_list_entry->gref != GRANT_INVALID_REF)
info->persistent_gnts_c--;
+
+ return gnt_list_entry;
+}
+
+static void grant_foreign_access(const struct grant *gnt_list_entry,
+ const struct blkfront_info *info)
+{
+ gnttab_page_grant_foreign_access_ref(gnt_list_entry->gref,
+ info->xbdev->otherend_id,
+ gnt_list_entry->page,
+ 0);
+}
+
+static struct grant *get_grant(grant_ref_t *gref_head,
+ unsigned long mfn,
+ struct blkfront_info *info)
+{
+ struct grant *gnt_list_entry = get_free_grant(info);
+
+ if (gnt_list_entry->gref != GRANT_INVALID_REF)
return gnt_list_entry;
+
+ /* Assign a gref to this page */
+ gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
+ BUG_ON(gnt_list_entry->gref == -ENOSPC);
+ if (info->feature_persistent)
+ grant_foreign_access(gnt_list_entry, info);
+ else {
+ /* Grant access to the MFN passed by the caller */
+ gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
+ info->xbdev->otherend_id,
+ mfn, 0);
}

+ return gnt_list_entry;
+}
+
+static struct grant *get_indirect_grant(grant_ref_t *gref_head,
+ struct blkfront_info *info)
+{
+ struct grant *gnt_list_entry = get_free_grant(info);
+
+ if (gnt_list_entry->gref != GRANT_INVALID_REF)
+ return gnt_list_entry;
+
/* Assign a gref to this page */
gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
BUG_ON(gnt_list_entry->gref == -ENOSPC);
if (!info->feature_persistent) {
- BUG_ON(!page);
- gnt_list_entry->page = page;
+ struct page *indirect_page;
+
+ /* Fetch a pre-allocated page to use for indirect grefs */
+ BUG_ON(list_empty(&info->indirect_pages));
+ indirect_page = list_first_entry(&info->indirect_pages,
+ struct page, lru);
+ list_del(&indirect_page->lru);
+ gnt_list_entry->page = indirect_page;
}
- buffer_mfn = page_to_mfn(gnt_list_entry->page);
- gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
- info->xbdev->otherend_id,
- buffer_mfn, 0);
+ grant_foreign_access(gnt_list_entry, info);
+
return gnt_list_entry;
}

@@ -522,29 +565,19 @@ static int blkif_queue_rw_req(struct request *req)

if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
(i % SEGS_PER_INDIRECT_FRAME == 0)) {
- struct page *uninitialized_var(page);
-
if (segments)
kunmap_atomic(segments);

n = i / SEGS_PER_INDIRECT_FRAME;
- if (!info->feature_persistent) {
- struct page *indirect_page;
-
- /* Fetch a pre-allocated page to use for indirect grefs */
- BUG_ON(list_empty(&info->indirect_pages));
- indirect_page = list_first_entry(&info->indirect_pages,
- struct page, lru);
- list_del(&indirect_page->lru);
- page = indirect_page;
- }
- gnt_list_entry = get_grant(&gref_head, page, info);
+ gnt_list_entry = get_indirect_grant(&gref_head, info);
info->shadow[id].indirect_grants[n] = gnt_list_entry;
segments = kmap_atomic(gnt_list_entry->page);
ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
}

- gnt_list_entry = get_grant(&gref_head, sg_page(sg), info);
+ gnt_list_entry = get_grant(&gref_head,
+ page_to_mfn(sg_page(sg)),
+ info);
ref = gnt_list_entry->gref;

info->shadow[id].grants_used[i] = gnt_list_entry;
--
2.1.4

2015-07-09 20:45:11

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 08/20] net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop

The skb doesn't change within the function. Therefore it's only
necessary to check if we need GSO once at the beginning.

Signed-off-by: Julien Grall <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: [email protected]
---
Changes in v2:
- Patch added
---
drivers/net/xen-netback/netback.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 880d0d6..3f77030 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -277,6 +277,13 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
unsigned long bytes;
int gso_type = XEN_NETIF_GSO_TYPE_NONE;

+ if (skb_is_gso(skb)) {
+ if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
+ gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
+ else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
+ gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
+ }
+
/* Data must not cross a page boundary. */
BUG_ON(size + offset > PAGE_SIZE<<compound_order(page));

@@ -336,13 +343,6 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
}

/* Leave a gap for the GSO descriptor. */
- if (skb_is_gso(skb)) {
- if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
- gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
- else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
- gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
- }
-
if (*head && ((1 << gso_type) & queue->vif->gso_mask))
queue->rx.req_cons++;

--
2.1.4

2015-07-09 20:49:47

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

When Linux is using 64K page granularity, every page will be slipt in
multiple non-contiguous 4K MFN (page granularity of Xen).

I'm not sure how to handle efficiently the check to know whether we can
merge 2 biovec with a such case. So for now, always says that biovec are
not mergeable.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Remove the workaround and check if the Linux page granularity
is the same as Xen or not
---
drivers/xen/biomerge.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
index 0edb91c..571567c 100644
--- a/drivers/xen/biomerge.c
+++ b/drivers/xen/biomerge.c
@@ -6,10 +6,17 @@
bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
const struct bio_vec *vec2)
{
+#if XEN_PAGE_SIZE == PAGE_SIZE
unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page));
unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page));

return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) &&
((mfn1 == mfn2) || ((mfn1+1) == mfn2));
+#else
+ /* XXX: bio_vec are not mergeable when using different page size in
+ * Xen and Linux
+ */
+ return 0;
+#endif
}
EXPORT_SYMBOL(xen_biovec_phys_mergeable);
--
2.1.4

2015-07-09 20:47:24

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 10/20] xen/xenbus: Use Xen page definition

All the ring (xenstore, and PV rings) are always based on the page
granularity of Xen.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Also update the ring mapping function
---
drivers/xen/xenbus/xenbus_client.c | 6 +++---
drivers/xen/xenbus/xenbus_probe.c | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
index 9ad3272..80272f6 100644
--- a/drivers/xen/xenbus/xenbus_client.c
+++ b/drivers/xen/xenbus/xenbus_client.c
@@ -388,7 +388,7 @@ int xenbus_grant_ring(struct xenbus_device *dev, void *vaddr,
}
grefs[i] = err;

- vaddr = vaddr + PAGE_SIZE;
+ vaddr = vaddr + XEN_PAGE_SIZE;
}

return 0;
@@ -555,7 +555,7 @@ static int xenbus_map_ring_valloc_pv(struct xenbus_device *dev,
if (!node)
return -ENOMEM;

- area = alloc_vm_area(PAGE_SIZE * nr_grefs, ptes);
+ area = alloc_vm_area(XEN_PAGE_SIZE * nr_grefs, ptes);
if (!area) {
kfree(node);
return -ENOMEM;
@@ -750,7 +750,7 @@ static int xenbus_unmap_ring_vfree_pv(struct xenbus_device *dev, void *vaddr)
unsigned long addr;

memset(&unmap[i], 0, sizeof(unmap[i]));
- addr = (unsigned long)vaddr + (PAGE_SIZE * i);
+ addr = (unsigned long)vaddr + (XEN_PAGE_SIZE * i);
unmap[i].host_addr = arbitrary_virt_to_machine(
lookup_address(addr, &level)).maddr;
unmap[i].dev_bus_addr = 0;
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
index 4308fb3..c67e5ba 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -713,7 +713,7 @@ static int __init xenstored_local_init(void)

xen_store_mfn = xen_start_info->store_mfn =
pfn_to_mfn(virt_to_phys((void *)page) >>
- PAGE_SHIFT);
+ XEN_PAGE_SHIFT);

/* Next allocate a local port which xenstored can bind to */
alloc_unbound.dom = DOMID_SELF;
@@ -804,7 +804,7 @@ static int __init xenbus_init(void)
goto out_error;
xen_store_mfn = (unsigned long)v;
xen_store_interface =
- xen_remap(xen_store_mfn << PAGE_SHIFT, PAGE_SIZE);
+ xen_remap(xen_store_mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE);
break;
default:
pr_warn("Xenstore state unknown\n");
--
2.1.4

2015-07-09 20:45:38

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 11/20] tty/hvc: xen: Use xen page definition

The console ring is always based on the page granularity of Xen.

Signed-off-by: Julien Grall <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: [email protected]
---
drivers/tty/hvc/hvc_xen.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
index a9d837f..2135944 100644
--- a/drivers/tty/hvc/hvc_xen.c
+++ b/drivers/tty/hvc/hvc_xen.c
@@ -230,7 +230,7 @@ static int xen_hvm_console_init(void)
if (r < 0 || v == 0)
goto err;
mfn = v;
- info->intf = xen_remap(mfn << PAGE_SHIFT, PAGE_SIZE);
+ info->intf = xen_remap(mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE);
if (info->intf == NULL)
goto err;
info->vtermno = HVC_COOKIE;
@@ -392,7 +392,7 @@ static int xencons_connect_backend(struct xenbus_device *dev,
if (xen_pv_domain())
mfn = virt_to_mfn(info->intf);
else
- mfn = __pa(info->intf) >> PAGE_SHIFT;
+ mfn = __pa(info->intf) >> XEN_PAGE_SHIFT;
ret = gnttab_alloc_grant_references(1, &gref_head);
if (ret < 0)
return ret;
@@ -476,7 +476,7 @@ static int xencons_resume(struct xenbus_device *dev)
struct xencons_info *info = dev_get_drvdata(&dev->dev);

xencons_disconnect_backend(info);
- memset(info->intf, 0, PAGE_SIZE);
+ memset(info->intf, 0, XEN_PAGE_SIZE);
return xencons_connect_backend(dev, info);
}

--
2.1.4

2015-07-09 20:46:45

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 12/20] xen/balloon: Don't rely on the page granularity is the same for Xen and Linux

For ARM64 guests, Linux is able to support either 64K or 4K page
granularity. Although, the hypercall interface is always based on 4K
page granularity.

With 64K page granuliarty, a single page will be spread over multiple
Xen frame.

When a driver request/free a balloon page, the balloon driver will have
to split the Linux page in 4K chunk before asking Xen to add/remove the
frame from the guest.

Note that this can work on any page granularity assuming it's a multiple
of 4K.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: Wei Liu <[email protected]>
---
Changes in v2:
- Use xen_apply_to_page to split a page in 4K chunk
- It's not necessary to have a smaller frame list. Re-use
PAGE_SIZE
- Convert reserve_additional_memory to use XEN_... macro
---
drivers/xen/balloon.c | 147 +++++++++++++++++++++++++++++++++++---------------
1 file changed, 105 insertions(+), 42 deletions(-)

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index fd93369..19a72b1 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -230,6 +230,7 @@ static enum bp_state reserve_additional_memory(long credit)
nid = memory_add_physaddr_to_nid(hotplug_start_paddr);

#ifdef CONFIG_XEN_HAVE_PVMMU
+ /* TODO */
/*
* add_memory() will build page tables for the new memory so
* the p2m must contain invalid entries so the correct
@@ -242,8 +243,8 @@ static enum bp_state reserve_additional_memory(long credit)
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
unsigned long pfn, i;

- pfn = PFN_DOWN(hotplug_start_paddr);
- for (i = 0; i < balloon_hotplug; i++) {
+ pfn = XEN_PFN_DOWN(hotplug_start_paddr);
+ for (i = 0; i < (balloon_hotplug * XEN_PFN_PER_PAGE); i++) {
if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
pr_warn("set_phys_to_machine() failed, no memory added\n");
return BP_ECANCELED;
@@ -323,10 +324,72 @@ static enum bp_state reserve_additional_memory(long credit)
}
#endif /* CONFIG_XEN_BALLOON_MEMORY_HOTPLUG */

+static int set_frame(struct page *page, unsigned long pfn, void *data)
+{
+ unsigned long *index = data;
+
+ frame_list[(*index)++] = pfn;
+
+ return 0;
+}
+
+#ifdef CONFIG_XEN_HAVE_PVMMU
+static int pvmmu_update_mapping(struct page *page, unsigned long pfn,
+ void *data)
+{
+ unsigned long *index = data;
+ xen_pfn_t frame = frame_list[*index];
+
+ set_phys_to_machine(pfn, frame);
+ /* Link back into the page tables if not highmem. */
+ if (!PageHighMem(page)) {
+ int ret;
+ ret = HYPERVISOR_update_va_mapping(
+ (unsigned long)__va(pfn << XEN_PAGE_SHIFT),
+ mfn_pte(frame, PAGE_KERNEL),
+ 0);
+ BUG_ON(ret);
+ }
+
+ (*index)++;
+
+ return 0;
+}
+#endif
+
+static int balloon_remove_mapping(struct page *page, unsigned long pfn,
+ void *data)
+{
+ unsigned long *index = data;
+
+ /* We expect the frame_list to contain the same pfn */
+ BUG_ON(pfn != frame_list[*index]);
+
+ frame_list[*index] = pfn_to_mfn(pfn);
+
+#ifdef CONFIG_XEN_HAVE_PVMMU
+ if (!xen_feature(XENFEAT_auto_translated_physmap)) {
+ if (!PageHighMem(page)) {
+ int ret;
+
+ ret = HYPERVISOR_update_va_mapping(
+ (unsigned long)__va(pfn << XEN_PAGE_SHIFT),
+ __pte_ma(0), 0);
+ BUG_ON(ret);
+ }
+ __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
+ }
+#endif
+
+ (*index)++;
+
+ return 0;
+}
+
static enum bp_state increase_reservation(unsigned long nr_pages)
{
int rc;
- unsigned long pfn, i;
+ unsigned long i, frame_idx;
struct page *page;
struct xen_memory_reservation reservation = {
.address_bits = 0,
@@ -343,44 +406,43 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
}
#endif

- if (nr_pages > ARRAY_SIZE(frame_list))
- nr_pages = ARRAY_SIZE(frame_list);
+ if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE))
+ nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE;

+ frame_idx = 0;
page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
for (i = 0; i < nr_pages; i++) {
if (!page) {
nr_pages = i;
break;
}
- frame_list[i] = page_to_pfn(page);
+
+ rc = xen_apply_to_page(page, set_frame, &frame_idx);
+
page = balloon_next_page(page);
}

set_xen_guest_handle(reservation.extent_start, frame_list);
- reservation.nr_extents = nr_pages;
+ reservation.nr_extents = nr_pages * XEN_PFN_PER_PAGE;
rc = HYPERVISOR_memory_op(XENMEM_populate_physmap, &reservation);
if (rc <= 0)
return BP_EAGAIN;

- for (i = 0; i < rc; i++) {
+ /* rc is equal to the number of Xen page populated */
+ nr_pages = rc / XEN_PFN_PER_PAGE;
+
+ for (i = 0; i < nr_pages; i++) {
page = balloon_retrieve(false);
BUG_ON(page == NULL);

- pfn = page_to_pfn(page);
-
#ifdef CONFIG_XEN_HAVE_PVMMU
+ frame_idx = 0;
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
- set_phys_to_machine(pfn, frame_list[i]);
-
- /* Link back into the page tables if not highmem. */
- if (!PageHighMem(page)) {
- int ret;
- ret = HYPERVISOR_update_va_mapping(
- (unsigned long)__va(pfn << PAGE_SHIFT),
- mfn_pte(frame_list[i], PAGE_KERNEL),
- 0);
- BUG_ON(ret);
- }
+ int ret;
+
+ ret = xen_apply_to_page(page, pvmmu_update_mapping,
+ &frame_idx);
+ BUG_ON(ret);
}
#endif

@@ -388,7 +450,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
__free_reserved_page(page);
}

- balloon_stats.current_pages += rc;
+ balloon_stats.current_pages += nr_pages;

return BP_DONE;
}
@@ -396,7 +458,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
{
enum bp_state state = BP_DONE;
- unsigned long pfn, i;
+ unsigned long pfn, i, frame_idx, nr_frames;
struct page *page;
int ret;
struct xen_memory_reservation reservation = {
@@ -414,9 +476,10 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
}
#endif

- if (nr_pages > ARRAY_SIZE(frame_list))
- nr_pages = ARRAY_SIZE(frame_list);
+ if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE))
+ nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE;

+ frame_idx = 0;
for (i = 0; i < nr_pages; i++) {
page = alloc_page(gfp);
if (page == NULL) {
@@ -426,9 +489,12 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
}
scrub_page(page);

- frame_list[i] = page_to_pfn(page);
+ ret = xen_apply_to_page(page, set_frame, &frame_idx);
+ BUG_ON(ret);
}

+ nr_frames = nr_pages * XEN_PFN_PER_PAGE;
+
/*
* Ensure that ballooned highmem pages don't have kmaps.
*
@@ -439,22 +505,19 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
kmap_flush_unused();

/* Update direct mapping, invalidate P2M, and add to balloon. */
+ frame_idx = 0;
for (i = 0; i < nr_pages; i++) {
- pfn = frame_list[i];
- frame_list[i] = pfn_to_mfn(pfn);
- page = pfn_to_page(pfn);
+ /*
+ * The Xen PFN for a given Linux Page are contiguous in
+ * frame_list
+ */
+ pfn = frame_list[frame_idx];
+ page = xen_pfn_to_page(pfn);

-#ifdef CONFIG_XEN_HAVE_PVMMU
- if (!xen_feature(XENFEAT_auto_translated_physmap)) {
- if (!PageHighMem(page)) {
- ret = HYPERVISOR_update_va_mapping(
- (unsigned long)__va(pfn << PAGE_SHIFT),
- __pte_ma(0), 0);
- BUG_ON(ret);
- }
- __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
- }
-#endif
+
+ ret = xen_apply_to_page(page, balloon_remove_mapping,
+ &frame_idx);
+ BUG_ON(ret);

balloon_append(page);
}
@@ -462,9 +525,9 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
flush_tlb_all();

set_xen_guest_handle(reservation.extent_start, frame_list);
- reservation.nr_extents = nr_pages;
+ reservation.nr_extents = nr_frames;
ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation);
- BUG_ON(ret != nr_pages);
+ BUG_ON(ret != nr_frames);

balloon_stats.current_pages -= nr_pages;

--
2.1.4

2015-07-09 20:47:34

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

Only use the first 4KB of the page to store the events channel info. It
means that we will wast 60KB every time we allocate page for:
* control block: a page is allocating per CPU
* event array: a page is allocating everytime we need to expand it

I think we can reduce the memory waste for the 2 areas by:

* control block: sharing between multiple vCPUs. Although it will
require some bookkeeping in order to not free the page when the CPU
goes offline and the other CPUs sharing the page still there

* event array: always extend the array event by 64K (i.e 16 4K
chunk). That would require more care when we fail to expand the
event channel.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
drivers/xen/events/events_base.c | 2 +-
drivers/xen/events/events_fifo.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 96093ae..858d2f6 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -40,11 +40,11 @@
#include <asm/idle.h>
#include <asm/io_apic.h>
#include <asm/xen/pci.h>
-#include <xen/page.h>
#endif
#include <asm/sync_bitops.h>
#include <asm/xen/hypercall.h>
#include <asm/xen/hypervisor.h>
+#include <xen/page.h>

#include <xen/xen.h>
#include <xen/hvm.h>
diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index ed673e1..d53c297 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -54,7 +54,7 @@

#include "events_internal.h"

-#define EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
+#define EVENT_WORDS_PER_PAGE (XEN_PAGE_SIZE / sizeof(event_word_t))
#define MAX_EVENT_ARRAY_PAGES (EVTCHN_FIFO_NR_CHANNELS / EVENT_WORDS_PER_PAGE)

struct evtchn_fifo_queue {
--
2.1.4

2015-07-09 20:46:14

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 14/20] xen/grant-table: Make it running on 64KB granularity

The Xen interface is using 4KB page granularity. This means that each
grant is 4KB.

The current implementation allocates a Linux page per grant. On Linux
using 64KB page granularity, only the first 4KB of the page will be
used.

We could decrease the memory wasted by sharing the page with multiple
grant. It will require some care with the {Set,Clear}ForeignPage macro.

Note that no changes has been made in the x86 code because both Linux
and Xen will only use 4KB page granularity.

Signed-off-by: Julien Grall <[email protected]>
Reviewed-by: David Vrabel <[email protected]>
Cc: Stefano Stabellini <[email protected]>
Cc: Russell King <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
---
Changes in v2
- Add David's reviewed-by
---
arch/arm/xen/p2m.c | 6 +++---
drivers/xen/grant-table.c | 6 +++---
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/xen/p2m.c b/arch/arm/xen/p2m.c
index 887596c..0ed01f2 100644
--- a/arch/arm/xen/p2m.c
+++ b/arch/arm/xen/p2m.c
@@ -93,8 +93,8 @@ int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
for (i = 0; i < count; i++) {
if (map_ops[i].status)
continue;
- set_phys_to_machine(map_ops[i].host_addr >> PAGE_SHIFT,
- map_ops[i].dev_bus_addr >> PAGE_SHIFT);
+ set_phys_to_machine(map_ops[i].host_addr >> XEN_PAGE_SHIFT,
+ map_ops[i].dev_bus_addr >> XEN_PAGE_SHIFT);
}

return 0;
@@ -108,7 +108,7 @@ int clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref *unmap_ops,
int i;

for (i = 0; i < count; i++) {
- set_phys_to_machine(unmap_ops[i].host_addr >> PAGE_SHIFT,
+ set_phys_to_machine(unmap_ops[i].host_addr >> XEN_PAGE_SHIFT,
INVALID_P2M_ENTRY);
}

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 3679293..0a1f903 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -668,7 +668,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr)
if (xen_auto_xlat_grant_frames.count)
return -EINVAL;

- vaddr = xen_remap(addr, PAGE_SIZE * max_nr_gframes);
+ vaddr = xen_remap(addr, XEN_PAGE_SIZE * max_nr_gframes);
if (vaddr == NULL) {
pr_warn("Failed to ioremap gnttab share frames (addr=%pa)!\n",
&addr);
@@ -680,7 +680,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr)
return -ENOMEM;
}
for (i = 0; i < max_nr_gframes; i++)
- pfn[i] = PFN_DOWN(addr) + i;
+ pfn[i] = XEN_PFN_DOWN(addr) + i;

xen_auto_xlat_grant_frames.vaddr = vaddr;
xen_auto_xlat_grant_frames.pfn = pfn;
@@ -1004,7 +1004,7 @@ static void gnttab_request_version(void)
{
/* Only version 1 is used, which will always be available. */
grant_table_version = 1;
- grefs_per_grant_frame = PAGE_SIZE / sizeof(struct grant_entry_v1);
+ grefs_per_grant_frame = XEN_PAGE_SIZE / sizeof(struct grant_entry_v1);
gnttab_interface = &gnttab_v1_ops;

pr_info("Grant tables using version %d layout\n", grant_table_version);
--
2.1.4

2015-07-09 20:46:50

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 15/20] block/xen-blkfront: Make it running on 64KB page granularity

From: Julien Grall <[email protected]>

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using block
device on a non-modified Xen.

The block API is using segment which should at least be the size of a
Linux page. Therefore, the driver will have to break the page in chunk
of 4K before giving the page to the backend.

Breaking a 64KB segment in 4KB chunk will result to have some chunk with
no data. As the PV protocol always require to have data in the chunk, we
have to count the number of Xen page which will be in use and avoid to
sent empty chunk.

Note that, a pre-defined number of grant is reserved before preparing
the request. This pre-defined number is based on the number and the
maximum size of the segments. If each segment contain a very small
amount of data, the driver may reserve too much grant (16 grant is
reserved per segment with 64KB page granularity).

Futhermore, in the case of persistent grant we allocate one Linux page
per grant although only the 4KB of the page will be effectively use.
This could be improved by share the page with multiple grants.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Roger Pau Monné <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---

Improvement such as support 64KB grant is not taken into consideration in
this patch because we have the requirement to run a Linux using 64KB page
on a non-modified Xen.

Changes in v2:
- Use gnttab_foreach_grant to split a Linux page into grant
---
drivers/block/xen-blkfront.c | 304 ++++++++++++++++++++++++++++---------------
1 file changed, 198 insertions(+), 106 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 95fd067..644ba76 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -77,6 +77,7 @@ struct blk_shadow {
struct grant **grants_used;
struct grant **indirect_grants;
struct scatterlist *sg;
+ unsigned int num_sg;
};

struct split_bio {
@@ -106,8 +107,8 @@ static unsigned int xen_blkif_max_ring_order;
module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");

-#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * (info)->nr_ring_pages)
-#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES)
+#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * (info)->nr_ring_pages)
+#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * XENBUS_MAX_RING_PAGES)
/*
* ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
* characters are enough. Define to 20 to keep consist with backend.
@@ -146,6 +147,7 @@ struct blkfront_info
unsigned int discard_granularity;
unsigned int discard_alignment;
unsigned int feature_persistent:1;
+ /* Number of 4K segment handled */
unsigned int max_indirect_segments;
int is_ready;
};
@@ -173,10 +175,19 @@ static DEFINE_SPINLOCK(minor_lock);

#define DEV_NAME "xvd" /* name in /dev */

-#define SEGS_PER_INDIRECT_FRAME \
- (PAGE_SIZE/sizeof(struct blkif_request_segment))
-#define INDIRECT_GREFS(_segs) \
- ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+/*
+ * Xen use 4K pages. The guest may use different page size (4K or 64K)
+ * Number of Xen pages per segment
+ */
+#define XEN_PAGES_PER_SEGMENT (PAGE_SIZE / XEN_PAGE_SIZE)
+
+#define SEGS_PER_INDIRECT_FRAME \
+ (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment) / XEN_PAGES_PER_SEGMENT)
+#define XEN_PAGES_PER_INDIRECT_FRAME \
+ (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment))
+
+#define INDIRECT_GREFS(_pages) \
+ ((_pages + XEN_PAGES_PER_INDIRECT_FRAME - 1)/XEN_PAGES_PER_INDIRECT_FRAME)

static int blkfront_setup_indirect(struct blkfront_info *info);

@@ -463,14 +474,100 @@ static int blkif_queue_discard_req(struct request *req)
return 0;
}

+struct setup_rw_req {
+ unsigned int grant_idx;
+ struct blkif_request_segment *segments;
+ struct blkfront_info *info;
+ struct blkif_request *ring_req;
+ grant_ref_t gref_head;
+ unsigned int id;
+ /* Only used when persistent grant is used and it's a read request */
+ bool need_copy;
+ unsigned int bvec_off;
+ char *bvec_data;
+};
+
+static void blkif_setup_rw_req_grant(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data)
+{
+ struct setup_rw_req *setup = data;
+ int n, ref;
+ struct grant *gnt_list_entry;
+ unsigned int fsect, lsect;
+ /* Convenient aliases */
+ unsigned int grant_idx = setup->grant_idx;
+ struct blkif_request *ring_req = setup->ring_req;
+ struct blkfront_info *info = setup->info;
+ struct blk_shadow *shadow = &info->shadow[setup->id];
+
+ if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
+ (grant_idx % XEN_PAGES_PER_INDIRECT_FRAME == 0)) {
+ if (setup->segments)
+ kunmap_atomic(setup->segments);
+
+ n = grant_idx / XEN_PAGES_PER_INDIRECT_FRAME;
+ gnt_list_entry = get_indirect_grant(&setup->gref_head, info);
+ shadow->indirect_grants[n] = gnt_list_entry;
+ setup->segments = kmap_atomic(gnt_list_entry->page);
+ ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
+ }
+
+ gnt_list_entry = get_grant(&setup->gref_head, mfn, info);
+ ref = gnt_list_entry->gref;
+ shadow->grants_used[grant_idx] = gnt_list_entry;
+
+ if (setup->need_copy) {
+ void *shared_data;
+
+ shared_data = kmap_atomic(gnt_list_entry->page);
+ /*
+ * this does not wipe data stored outside the
+ * range sg->offset..sg->offset+sg->length.
+ * Therefore, blkback *could* see data from
+ * previous requests. This is OK as long as
+ * persistent grants are shared with just one
+ * domain. It may need refactoring if this
+ * changes
+ */
+ memcpy(shared_data + offset,
+ setup->bvec_data + setup->bvec_off,
+ *len);
+
+ kunmap_atomic(shared_data);
+ setup->bvec_off += *len;
+ }
+
+ fsect = offset >> 9;
+ lsect = fsect + (*len >> 9) - 1;
+ if (ring_req->operation != BLKIF_OP_INDIRECT) {
+ ring_req->u.rw.seg[grant_idx] =
+ (struct blkif_request_segment) {
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
+ } else {
+ setup->segments[grant_idx % XEN_PAGES_PER_INDIRECT_FRAME] =
+ (struct blkif_request_segment) {
+ .gref = ref,
+ .first_sect = fsect,
+ .last_sect = lsect };
+ }
+
+ (setup->grant_idx)++;
+}
+
static int blkif_queue_rw_req(struct request *req)
{
struct blkfront_info *info = req->rq_disk->private_data;
struct blkif_request *ring_req;
unsigned long id;
- unsigned int fsect, lsect;
- int i, ref, n;
- struct blkif_request_segment *segments = NULL;
+ int i;
+ struct setup_rw_req setup = {
+ .grant_idx = 0,
+ .segments = NULL,
+ .info = info,
+ .need_copy = rq_data_dir(req) && info->feature_persistent,
+ };

/*
* Used to store if we are able to queue the request by just using
@@ -478,25 +575,23 @@ static int blkif_queue_rw_req(struct request *req)
* as there are not sufficiently many free.
*/
bool new_persistent_gnts;
- grant_ref_t gref_head;
- struct grant *gnt_list_entry = NULL;
struct scatterlist *sg;
- int nseg, max_grefs;
+ int nseg, max_grefs, nr_page;

- max_grefs = req->nr_phys_segments;
+ max_grefs = req->nr_phys_segments * XEN_PAGES_PER_SEGMENT;
if (max_grefs > BLKIF_MAX_SEGMENTS_PER_REQUEST)
/*
* If we are using indirect segments we need to account
* for the indirect grefs used in the request.
*/
- max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
+ max_grefs += INDIRECT_GREFS(req->nr_phys_segments * XEN_PAGES_PER_SEGMENT);

/* Check if we have enough grants to allocate a requests */
if (info->persistent_gnts_c < max_grefs) {
new_persistent_gnts = 1;
if (gnttab_alloc_grant_references(
max_grefs - info->persistent_gnts_c,
- &gref_head) < 0) {
+ &setup.gref_head) < 0) {
gnttab_request_free_callback(
&info->callback,
blkif_restart_queue_callback,
@@ -513,12 +608,18 @@ static int blkif_queue_rw_req(struct request *req)
info->shadow[id].request = req;

BUG_ON(info->max_indirect_segments == 0 &&
- req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
+ (XEN_PAGES_PER_SEGMENT * req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
BUG_ON(info->max_indirect_segments &&
- req->nr_phys_segments > info->max_indirect_segments);
+ (req->nr_phys_segments * XEN_PAGES_PER_SEGMENT) > info->max_indirect_segments);
nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
+ nr_page = 0;
+ /* Calculate the number of Xen pages used */
+ for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+ nr_page += (round_up(sg->offset + sg->length, XEN_PAGE_SIZE) - round_down(sg->offset, XEN_PAGE_SIZE)) >> XEN_PAGE_SHIFT;
+ }
ring_req->u.rw.id = id;
- if (nseg > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
+ info->shadow[id].num_sg = nseg;
+ if (nr_page > BLKIF_MAX_SEGMENTS_PER_REQUEST) {
/*
* The indirect operation can only be a BLKIF_OP_READ or
* BLKIF_OP_WRITE
@@ -529,7 +630,7 @@ static int blkif_queue_rw_req(struct request *req)
BLKIF_OP_WRITE : BLKIF_OP_READ;
ring_req->u.indirect.sector_number = (blkif_sector_t)blk_rq_pos(req);
ring_req->u.indirect.handle = info->handle;
- ring_req->u.indirect.nr_segments = nseg;
+ ring_req->u.indirect.nr_segments = nr_page;
} else {
ring_req->u.rw.sector_number = (blkif_sector_t)blk_rq_pos(req);
ring_req->u.rw.handle = info->handle;
@@ -557,73 +658,30 @@ static int blkif_queue_rw_req(struct request *req)
ring_req->operation = 0;
}
}
- ring_req->u.rw.nr_segments = nseg;
+ ring_req->u.rw.nr_segments = nr_page;
}
- for_each_sg(info->shadow[id].sg, sg, nseg, i) {
- fsect = sg->offset >> 9;
- lsect = fsect + (sg->length >> 9) - 1;
-
- if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
- (i % SEGS_PER_INDIRECT_FRAME == 0)) {
- if (segments)
- kunmap_atomic(segments);
-
- n = i / SEGS_PER_INDIRECT_FRAME;
- gnt_list_entry = get_indirect_grant(&gref_head, info);
- info->shadow[id].indirect_grants[n] = gnt_list_entry;
- segments = kmap_atomic(gnt_list_entry->page);
- ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
- }

- gnt_list_entry = get_grant(&gref_head,
- page_to_mfn(sg_page(sg)),
- info);
- ref = gnt_list_entry->gref;
-
- info->shadow[id].grants_used[i] = gnt_list_entry;
-
- if (rq_data_dir(req) && info->feature_persistent) {
- char *bvec_data;
- void *shared_data;
-
- BUG_ON(sg->offset + sg->length > PAGE_SIZE);
+ setup.ring_req = ring_req;
+ setup.id = id;
+ for_each_sg(info->shadow[id].sg, sg, nseg, i) {
+ BUG_ON(sg->offset + sg->length > PAGE_SIZE);

- shared_data = kmap_atomic(gnt_list_entry->page);
- bvec_data = kmap_atomic(sg_page(sg));
+ if (setup.need_copy) {
+ setup.bvec_off = sg->offset;
+ setup.bvec_data = kmap_atomic(sg_page(sg));
+ }

- /*
- * this does not wipe data stored outside the
- * range sg->offset..sg->offset+sg->length.
- * Therefore, blkback *could* see data from
- * previous requests. This is OK as long as
- * persistent grants are shared with just one
- * domain. It may need refactoring if this
- * changes
- */
- memcpy(shared_data + sg->offset,
- bvec_data + sg->offset,
- sg->length);
+ gnttab_foreach_grant(sg_page(sg),
+ sg->offset,
+ sg->length,
+ blkif_setup_rw_req_grant,
+ &setup);

- kunmap_atomic(bvec_data);
- kunmap_atomic(shared_data);
- }
- if (ring_req->operation != BLKIF_OP_INDIRECT) {
- ring_req->u.rw.seg[i] =
- (struct blkif_request_segment) {
- .gref = ref,
- .first_sect = fsect,
- .last_sect = lsect };
- } else {
- n = i % SEGS_PER_INDIRECT_FRAME;
- segments[n] =
- (struct blkif_request_segment) {
- .gref = ref,
- .first_sect = fsect,
- .last_sect = lsect };
- }
+ if (setup.need_copy)
+ kunmap_atomic(setup.bvec_data);
}
- if (segments)
- kunmap_atomic(segments);
+ if (setup.segments)
+ kunmap_atomic(setup.segments);

info->ring.req_prod_pvt++;

@@ -631,7 +689,7 @@ static int blkif_queue_rw_req(struct request *req)
info->shadow[id].req = *ring_req;

if (new_persistent_gnts)
- gnttab_free_grant_references(gref_head);
+ gnttab_free_grant_references(setup.gref_head);

return 0;
}
@@ -748,14 +806,14 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
/* Hard sector size and max sectors impersonate the equiv. hardware. */
blk_queue_logical_block_size(rq, sector_size);
blk_queue_physical_block_size(rq, physical_sector_size);
- blk_queue_max_hw_sectors(rq, (segments * PAGE_SIZE) / 512);
+ blk_queue_max_hw_sectors(rq, (segments * XEN_PAGE_SIZE) / 512);

/* Each segment in a request is up to an aligned page in size. */
blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
blk_queue_max_segment_size(rq, PAGE_SIZE);

/* Ensure a merged request will fit in a single I/O ring slot. */
- blk_queue_max_segments(rq, segments);
+ blk_queue_max_segments(rq, segments / XEN_PAGES_PER_SEGMENT);

/* Make sure buffer addresses are sector-aligned. */
blk_queue_dma_alignment(rq, 511);
@@ -1120,32 +1178,65 @@ free_shadow:

}

+struct copy_from_grant {
+ const struct blk_shadow *s;
+ unsigned int grant_idx;
+ unsigned int bvec_offset;
+ char *bvec_data;
+};
+
+static void blkif_copy_from_grant(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data)
+{
+ struct copy_from_grant *info = data;
+ char *shared_data;
+ /* Convenient aliases */
+ const struct blk_shadow *s = info->s;
+
+ shared_data = kmap_atomic(s->grants_used[info->grant_idx]->page);
+
+ memcpy(info->bvec_data + info->bvec_offset,
+ shared_data + offset, *len);
+
+ info->bvec_offset += *len;
+ info->grant_idx++;
+
+ kunmap_atomic(shared_data);
+}
+
static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
struct blkif_response *bret)
{
int i = 0;
struct scatterlist *sg;
- char *bvec_data;
- void *shared_data;
- int nseg;
+ int nseg, nr_page;
+ struct copy_from_grant data = {
+ .s = s,
+ .grant_idx = 0,
+ };

- nseg = s->req.operation == BLKIF_OP_INDIRECT ?
+ nr_page = s->req.operation == BLKIF_OP_INDIRECT ?
s->req.u.indirect.nr_segments : s->req.u.rw.nr_segments;
+ nseg = s->num_sg;

if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
for_each_sg(s->sg, sg, nseg, i) {
BUG_ON(sg->offset + sg->length > PAGE_SIZE);
- shared_data = kmap_atomic(s->grants_used[i]->page);
- bvec_data = kmap_atomic(sg_page(sg));
- memcpy(bvec_data + sg->offset,
- shared_data + sg->offset,
- sg->length);
- kunmap_atomic(bvec_data);
- kunmap_atomic(shared_data);
+
+ data.bvec_offset = sg->offset;
+ data.bvec_data = kmap_atomic(sg_page(sg));
+
+ gnttab_foreach_grant(sg_page(sg),
+ sg->offset,
+ sg->length,
+ blkif_copy_from_grant,
+ &data);
+
+ kunmap_atomic(data.bvec_data);
}
}
/* Add the persistent grant into the list of free grants */
- for (i = 0; i < nseg; i++) {
+ for (i = 0; i < nr_page; i++) {
if (gnttab_query_foreign_access(s->grants_used[i]->gref)) {
/*
* If the grant is still mapped by the backend (the
@@ -1171,7 +1262,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
}
}
if (s->req.operation == BLKIF_OP_INDIRECT) {
- for (i = 0; i < INDIRECT_GREFS(nseg); i++) {
+ for (i = 0; i < INDIRECT_GREFS(nr_page); i++) {
if (gnttab_query_foreign_access(s->indirect_grants[i]->gref)) {
if (!info->feature_persistent)
pr_alert_ratelimited("backed has not unmapped grant: %u\n",
@@ -1314,7 +1405,7 @@ static int setup_blkring(struct xenbus_device *dev,
{
struct blkif_sring *sring;
int err, i;
- unsigned long ring_size = info->nr_ring_pages * PAGE_SIZE;
+ unsigned long ring_size = info->nr_ring_pages * XEN_PAGE_SIZE;
grant_ref_t gref[XENBUS_MAX_RING_PAGES];

for (i = 0; i < info->nr_ring_pages; i++)
@@ -1666,8 +1757,8 @@ static int blkif_recover(struct blkfront_info *info)
atomic_set(&split_bio->pending, pending);
split_bio->bio = bio;
for (i = 0; i < pending; i++) {
- offset = (i * segs * PAGE_SIZE) >> 9;
- size = min((unsigned int)(segs * PAGE_SIZE) >> 9,
+ offset = (i * segs * XEN_PAGE_SIZE) >> 9;
+ size = min((unsigned int)(segs * XEN_PAGE_SIZE) >> 9,
(unsigned int)bio_sectors(bio) - offset);
cloned_bio = bio_clone(bio, GFP_NOIO);
BUG_ON(cloned_bio == NULL);
@@ -1778,7 +1869,7 @@ static void blkfront_setup_discard(struct blkfront_info *info)

static int blkfront_setup_indirect(struct blkfront_info *info)
{
- unsigned int indirect_segments, segs;
+ unsigned int indirect_segments, segs, nr_page;
int err, i;

err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
@@ -1786,14 +1877,15 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
NULL);
if (err) {
info->max_indirect_segments = 0;
- segs = BLKIF_MAX_SEGMENTS_PER_REQUEST;
+ nr_page = BLKIF_MAX_SEGMENTS_PER_REQUEST;
} else {
info->max_indirect_segments = min(indirect_segments,
xen_blkif_max_segments);
- segs = info->max_indirect_segments;
+ nr_page = info->max_indirect_segments;
}
+ segs = nr_page / XEN_PAGES_PER_SEGMENT;

- err = fill_grant_buffer(info, (segs + INDIRECT_GREFS(segs)) * BLK_RING_SIZE(info));
+ err = fill_grant_buffer(info, (nr_page + INDIRECT_GREFS(nr_page)) * BLK_RING_SIZE(info));
if (err)
goto out_of_memory;

@@ -1803,7 +1895,7 @@ static int blkfront_setup_indirect(struct blkfront_info *info)
* grants, we need to allocate a set of pages that can be
* used for mapping indirect grefs
*/
- int num = INDIRECT_GREFS(segs) * BLK_RING_SIZE(info);
+ int num = INDIRECT_GREFS(nr_page) * BLK_RING_SIZE(info);

BUG_ON(!list_empty(&info->indirect_pages));
for (i = 0; i < num; i++) {
@@ -1816,13 +1908,13 @@ static int blkfront_setup_indirect(struct blkfront_info *info)

for (i = 0; i < BLK_RING_SIZE(info); i++) {
info->shadow[i].grants_used = kzalloc(
- sizeof(info->shadow[i].grants_used[0]) * segs,
+ sizeof(info->shadow[i].grants_used[0]) * nr_page,
GFP_NOIO);
info->shadow[i].sg = kzalloc(sizeof(info->shadow[i].sg[0]) * segs, GFP_NOIO);
if (info->max_indirect_segments)
info->shadow[i].indirect_grants = kzalloc(
sizeof(info->shadow[i].indirect_grants[0]) *
- INDIRECT_GREFS(segs),
+ INDIRECT_GREFS(nr_page),
GFP_NOIO);
if ((info->shadow[i].grants_used == NULL) ||
(info->shadow[i].sg == NULL) ||
--
2.1.4

2015-07-09 20:48:02

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 16/20] block/xen-blkback: Make it running on 64KB page granularity

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.

It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.

Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: "Roger Pau Monné" <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---

Improvement such as support of 64KB grant is not taken into
consideration in this patch because we have the requirement to run a
Linux using 64KB pages on a non-modified Xen.

This has been tested only with a loop device. I plan to test passing
hard drive partition but I didn't yet convert the swiotlb code.
---
drivers/block/xen-blkback/blkback.c | 5 +++--
drivers/block/xen-blkback/common.h | 16 +++++++++++++---
drivers/block/xen-blkback/xenbus.c | 9 ++++++---
3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
index ced9677..d5cce8c 100644
--- a/drivers/block/xen-blkback/blkback.c
+++ b/drivers/block/xen-blkback/blkback.c
@@ -961,7 +961,7 @@ static int xen_blkbk_parse_indirect(struct blkif_request *req,
seg[n].nsec = segments[i].last_sect -
segments[i].first_sect + 1;
seg[n].offset = (segments[i].first_sect << 9);
- if ((segments[i].last_sect >= (PAGE_SIZE >> 9)) ||
+ if ((segments[i].last_sect >= (XEN_PAGE_SIZE >> 9)) ||
(segments[i].last_sect < segments[i].first_sect)) {
rc = -EINVAL;
goto unmap;
@@ -1210,6 +1210,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,

req_operation = req->operation == BLKIF_OP_INDIRECT ?
req->u.indirect.indirect_op : req->operation;
+
if ((req->operation == BLKIF_OP_INDIRECT) &&
(req_operation != BLKIF_OP_READ) &&
(req_operation != BLKIF_OP_WRITE)) {
@@ -1268,7 +1269,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
seg[i].nsec = req->u.rw.seg[i].last_sect -
req->u.rw.seg[i].first_sect + 1;
seg[i].offset = (req->u.rw.seg[i].first_sect << 9);
- if ((req->u.rw.seg[i].last_sect >= (PAGE_SIZE >> 9)) ||
+ if ((req->u.rw.seg[i].last_sect >= (XEN_PAGE_SIZE >> 9)) ||
(req->u.rw.seg[i].last_sect <
req->u.rw.seg[i].first_sect))
goto fail_response;
diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
index 45a044a..33836bb 100644
--- a/drivers/block/xen-blkback/common.h
+++ b/drivers/block/xen-blkback/common.h
@@ -39,6 +39,7 @@
#include <asm/pgalloc.h>
#include <asm/hypervisor.h>
#include <xen/grant_table.h>
+#include <xen/page.h>
#include <xen/xenbus.h>
#include <xen/interface/io/ring.h>
#include <xen/interface/io/blkif.h>
@@ -51,12 +52,21 @@ extern unsigned int xen_blkif_max_ring_order;
*/
#define MAX_INDIRECT_SEGMENTS 256

-#define SEGS_PER_INDIRECT_FRAME \
- (PAGE_SIZE/sizeof(struct blkif_request_segment))
+/*
+ * Xen use 4K pages. The guest may use different page size (4K or 64K)
+ * Number of Xen pages per segment
+ */
+#define XEN_PAGES_PER_SEGMENT (PAGE_SIZE / XEN_PAGE_SIZE)
+
+#define SEGS_PER_INDIRECT_FRAME \
+ (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment) / XEN_PAGES_PER_SEGMENT)
+#define XEN_PAGES_PER_INDIRECT_FRAME \
+ (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment))
+
#define MAX_INDIRECT_PAGES \
((MAX_INDIRECT_SEGMENTS + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
#define INDIRECT_PAGES(_segs) \
- ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
+ ((_segs + XEN_PAGES_PER_INDIRECT_FRAME - 1)/XEN_PAGES_PER_INDIRECT_FRAME)

/* Not a real protocol. Used to generate ring structs which contain
* the elements common to all protocols only. This way we get a
diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
index deb3f00..edd27e4 100644
--- a/drivers/block/xen-blkback/xenbus.c
+++ b/drivers/block/xen-blkback/xenbus.c
@@ -176,21 +176,24 @@ static int xen_blkif_map(struct xen_blkif *blkif, grant_ref_t *gref,
{
struct blkif_sring *sring;
sring = (struct blkif_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.native, sring, PAGE_SIZE * nr_grefs);
+ BACK_RING_INIT(&blkif->blk_rings.native, sring,
+ XEN_PAGE_SIZE * nr_grefs);
break;
}
case BLKIF_PROTOCOL_X86_32:
{
struct blkif_x86_32_sring *sring_x86_32;
sring_x86_32 = (struct blkif_x86_32_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32, PAGE_SIZE * nr_grefs);
+ BACK_RING_INIT(&blkif->blk_rings.x86_32, sring_x86_32,
+ XEN_PAGE_SIZE * nr_grefs);
break;
}
case BLKIF_PROTOCOL_X86_64:
{
struct blkif_x86_64_sring *sring_x86_64;
sring_x86_64 = (struct blkif_x86_64_sring *)blkif->blk_ring;
- BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64, PAGE_SIZE * nr_grefs);
+ BACK_RING_INIT(&blkif->blk_rings.x86_64, sring_x86_64,
+ XEN_PAGE_SIZE * nr_grefs);
break;
}
default:
--
2.1.4

2015-07-09 20:48:09

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 17/20] net/xen-netfront: Make it running on 64KB page granularity

The PV network protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using network
device on a non-modified Xen.

It's only necessary to adapt the ring size and break skb data in small
chunk of 4KB. The rest of the code is relying on the grant table code.

Note that we allocate a Linux page for each rx skb but only the first
4KB is used. We may improve the memory usage by extending the size of
the rx skb.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
Cc: [email protected]
---

Improvement such as support of 64KB grant is not taken into
consideration in this patch because we have the requirement to run a Linux
using 64KB pages on a non-modified Xen.

Tested with workload such as ping, ssh, wget, git... I would happy if
someone give details how to test all the path.

Changes in v2:
- Use gnttab_foreach_grant to split a Linux page in grant
- Fix count slots
---
drivers/net/xen-netfront.c | 121 ++++++++++++++++++++++++++++++++-------------
1 file changed, 87 insertions(+), 34 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index f948c46..7233b09 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -74,8 +74,8 @@ struct netfront_cb {

#define GRANT_INVALID_REF 0

-#define NET_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
-#define NET_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE)
+#define NET_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, XEN_PAGE_SIZE)
+#define NET_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, XEN_PAGE_SIZE)

/* Minimum number of Rx slots (includes slot for GSO metadata). */
#define NET_RX_SLOTS_MIN (XEN_NETIF_NR_SLOTS_MIN + 1)
@@ -291,7 +291,7 @@ static void xennet_alloc_rx_buffers(struct netfront_queue *queue)
struct sk_buff *skb;
unsigned short id;
grant_ref_t ref;
- unsigned long pfn;
+ struct page *page;
struct xen_netif_rx_request *req;

skb = xennet_alloc_one_rx_buffer(queue);
@@ -307,14 +307,13 @@ static void xennet_alloc_rx_buffers(struct netfront_queue *queue)
BUG_ON((signed short)ref < 0);
queue->grant_rx_ref[id] = ref;

- pfn = page_to_pfn(skb_frag_page(&skb_shinfo(skb)->frags[0]));
+ page = skb_frag_page(&skb_shinfo(skb)->frags[0]);

req = RING_GET_REQUEST(&queue->rx, req_prod);
- gnttab_grant_foreign_access_ref(ref,
- queue->info->xbdev->otherend_id,
- pfn_to_mfn(pfn),
- 0);
-
+ gnttab_page_grant_foreign_access_ref(ref,
+ queue->info->xbdev->otherend_id,
+ page,
+ 0);
req->id = id;
req->gref = ref;
}
@@ -415,15 +414,26 @@ static void xennet_tx_buf_gc(struct netfront_queue *queue)
xennet_maybe_wake_tx(queue);
}

-static struct xen_netif_tx_request *xennet_make_one_txreq(
- struct netfront_queue *queue, struct sk_buff *skb,
- struct page *page, unsigned int offset, unsigned int len)
+struct xennet_gnttab_make_txreq
+{
+ struct netfront_queue *queue;
+ struct sk_buff *skb;
+ struct page *page;
+ struct xen_netif_tx_request *tx; /* Last request */
+ unsigned int size;
+};
+
+static void xennet_tx_setup_grant(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data)
{
+ struct xennet_gnttab_make_txreq *info = data;
unsigned int id;
struct xen_netif_tx_request *tx;
grant_ref_t ref;
-
- len = min_t(unsigned int, PAGE_SIZE - offset, len);
+ /* convenient aliases */
+ struct page *page = info->page;
+ struct netfront_queue *queue = info->queue;
+ struct sk_buff *skb = info->skb;

id = get_id_from_freelist(&queue->tx_skb_freelist, queue->tx_skbs);
tx = RING_GET_REQUEST(&queue->tx, queue->tx.req_prod_pvt++);
@@ -431,7 +441,7 @@ static struct xen_netif_tx_request *xennet_make_one_txreq(
BUG_ON((signed short)ref < 0);

gnttab_grant_foreign_access_ref(ref, queue->info->xbdev->otherend_id,
- page_to_mfn(page), GNTMAP_readonly);
+ mfn, GNTMAP_readonly);

queue->tx_skbs[id].skb = skb;
queue->grant_tx_page[id] = page;
@@ -440,10 +450,37 @@ static struct xen_netif_tx_request *xennet_make_one_txreq(
tx->id = id;
tx->gref = ref;
tx->offset = offset;
- tx->size = len;
+ tx->size = *len;
tx->flags = 0;

- return tx;
+ info->tx = tx;
+ info->size += tx->size;
+}
+
+static struct xen_netif_tx_request *xennet_make_first_txreq(
+ struct netfront_queue *queue, struct sk_buff *skb,
+ struct page *page, unsigned int offset, unsigned int len)
+{
+ struct xennet_gnttab_make_txreq info = {
+ .queue = queue,
+ .skb = skb,
+ .page = page,
+ .size = 0,
+ };
+
+ gnttab_one_grant(page, offset, len, xennet_tx_setup_grant, &info);
+
+ return info.tx;
+}
+
+static void xennet_make_one_txreq(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data)
+{
+ struct xennet_gnttab_make_txreq *info = data;
+
+ info->tx->flags |= XEN_NETTXF_more_data;
+ skb_get(info->skb);
+ xennet_make_one_txreq(mfn, offset, len, data);
}

static struct xen_netif_tx_request *xennet_make_txreqs(
@@ -451,20 +488,30 @@ static struct xen_netif_tx_request *xennet_make_txreqs(
struct sk_buff *skb, struct page *page,
unsigned int offset, unsigned int len)
{
+ struct xennet_gnttab_make_txreq info = {
+ .queue = queue,
+ .skb = skb,
+ .tx = tx,
+ };
+
/* Skip unused frames from start of page */
page += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;

while (len) {
- tx->flags |= XEN_NETTXF_more_data;
- tx = xennet_make_one_txreq(queue, skb_get(skb),
- page, offset, len);
+ info.page = page;
+ info.size = 0;
+
+ gnttab_foreach_grant(page, offset, len,
+ xennet_make_one_txreq,
+ &info);
+
page++;
offset = 0;
- len -= tx->size;
+ len -= info.size;
}

- return tx;
+ return info.tx;
}

/*
@@ -474,9 +521,10 @@ static struct xen_netif_tx_request *xennet_make_txreqs(
static int xennet_count_skb_slots(struct sk_buff *skb)
{
int i, frags = skb_shinfo(skb)->nr_frags;
- int pages;
+ int slots;

- pages = PFN_UP(offset_in_page(skb->data) + skb_headlen(skb));
+ slots = gnttab_count_grant(offset_in_page(skb->data),
+ skb_headlen(skb));

for (i = 0; i < frags; i++) {
skb_frag_t *frag = skb_shinfo(skb)->frags + i;
@@ -486,10 +534,10 @@ static int xennet_count_skb_slots(struct sk_buff *skb)
/* Skip unused frames from start of page */
offset &= ~PAGE_MASK;

- pages += PFN_UP(offset + size);
+ slots += gnttab_count_grant(offset, size);
}

- return pages;
+ return slots;
}

static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb,
@@ -510,6 +558,8 @@ static u16 xennet_select_queue(struct net_device *dev, struct sk_buff *skb,
return queue_idx;
}

+#define MAX_XEN_SKB_FRAGS (65536 / XEN_PAGE_SIZE + 1)
+
static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct netfront_info *np = netdev_priv(dev);
@@ -544,7 +594,7 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
}

slots = xennet_count_skb_slots(skb);
- if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
+ if (unlikely(slots > MAX_XEN_SKB_FRAGS + 1)) {
net_dbg_ratelimited("xennet: skb rides the rocket: %d slots, %d bytes\n",
slots, skb->len);
if (skb_linearize(skb))
@@ -565,10 +615,13 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
}

/* First request for the linear area. */
- first_tx = tx = xennet_make_one_txreq(queue, skb,
- page, offset, len);
- page++;
- offset = 0;
+ first_tx = tx = xennet_make_first_txreq(queue, skb,
+ page, offset, len);
+ offset += tx->size;
+ if (offset == PAGE_SIZE) {
+ page++;
+ offset = 0;
+ }
len -= tx->size;

if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -730,7 +783,7 @@ static int xennet_get_responses(struct netfront_queue *queue,

for (;;) {
if (unlikely(rx->status < 0 ||
- rx->offset + rx->status > PAGE_SIZE)) {
+ rx->offset + rx->status > XEN_PAGE_SIZE)) {
if (net_ratelimit())
dev_warn(dev, "rx->offset: %u, size: %d\n",
rx->offset, rx->status);
@@ -1493,7 +1546,7 @@ static int setup_netfront(struct xenbus_device *dev,
goto fail;
}
SHARED_RING_INIT(txs);
- FRONT_RING_INIT(&queue->tx, txs, PAGE_SIZE);
+ FRONT_RING_INIT(&queue->tx, txs, XEN_PAGE_SIZE);

err = xenbus_grant_ring(dev, txs, 1, &gref);
if (err < 0)
@@ -1507,7 +1560,7 @@ static int setup_netfront(struct xenbus_device *dev,
goto alloc_rx_ring_fail;
}
SHARED_RING_INIT(rxs);
- FRONT_RING_INIT(&queue->rx, rxs, PAGE_SIZE);
+ FRONT_RING_INIT(&queue->rx, rxs, XEN_PAGE_SIZE);

err = xenbus_grant_ring(dev, rxs, 1, &gref);
if (err < 0)
--
2.1.4

2015-07-09 20:46:00

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 18/20] net/xen-netback: Make it running on 64KB page granularity

The PV network protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity working as a
network backend on a non-modified Xen.

It's only necessary to adapt the ring size and break skb data in small
chunk of 4KB. The rest of the code is relying on the grant table code.

Signed-off-by: Julien Grall <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Wei Liu <[email protected]>
Cc: [email protected]
---

Improvement such as support of 64KB grant is not taken into
consideration in this patch because we have the requirement to run a
Linux using 64KB pages on a non-modified Xen.

Changes in v2:
- Correctly set MAX_GRANT_COPY_OPS and XEN_NETBK_RX_SLOTS_MAX
- Don't use XEN_PAGE_SIZE in handle_frag_list as we coalesce
fragment into a new skb
- Use gnntab_foreach_grant to split a Linux page into grant
---
drivers/net/xen-netback/common.h | 15 +++--
drivers/net/xen-netback/netback.c | 138 +++++++++++++++++++++++---------------
2 files changed, 93 insertions(+), 60 deletions(-)

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 8a495b3..bb68211 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -44,6 +44,7 @@
#include <xen/interface/grant_table.h>
#include <xen/grant_table.h>
#include <xen/xenbus.h>
+#include <xen/page.h>
#include <linux/debugfs.h>

typedef unsigned int pending_ring_idx_t;
@@ -64,8 +65,8 @@ struct pending_tx_info {
struct ubuf_info callback_struct;
};

-#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, PAGE_SIZE)
-#define XEN_NETIF_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, PAGE_SIZE)
+#define XEN_NETIF_TX_RING_SIZE __CONST_RING_SIZE(xen_netif_tx, XEN_PAGE_SIZE)
+#define XEN_NETIF_RX_RING_SIZE __CONST_RING_SIZE(xen_netif_rx, XEN_PAGE_SIZE)

struct xenvif_rx_meta {
int id;
@@ -80,16 +81,18 @@ struct xenvif_rx_meta {
/* Discriminate from any valid pending_idx value. */
#define INVALID_PENDING_IDX 0xFFFF

-#define MAX_BUFFER_OFFSET PAGE_SIZE
+#define MAX_BUFFER_OFFSET XEN_PAGE_SIZE

#define MAX_PENDING_REQS XEN_NETIF_TX_RING_SIZE

+#define MAX_XEN_SKB_FRAGS (65536 / XEN_PAGE_SIZE + 1)
+
/* It's possible for an skb to have a maximal number of frags
* but still be less than MAX_BUFFER_OFFSET in size. Thus the
- * worst-case number of copy operations is MAX_SKB_FRAGS per
+ * worst-case number of copy operations is MAX_XEN_SKB_FRAGS per
* ring slot.
*/
-#define MAX_GRANT_COPY_OPS (MAX_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE)
+#define MAX_GRANT_COPY_OPS (MAX_XEN_SKB_FRAGS * XEN_NETIF_RX_RING_SIZE)

#define NETBACK_INVALID_HANDLE -1

@@ -203,7 +206,7 @@ struct xenvif_queue { /* Per-queue data for xenvif */
/* Maximum number of Rx slots a to-guest packet may use, including the
* slot needed for GSO meta-data.
*/
-#define XEN_NETBK_RX_SLOTS_MAX (MAX_SKB_FRAGS + 1)
+#define XEN_NETBK_RX_SLOTS_MAX ((MAX_XEN_SKB_FRAGS + 1))

enum state_bit_shift {
/* This bit marks that the vif is connected */
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 3f77030..828085b 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -263,6 +263,65 @@ static struct xenvif_rx_meta *get_next_rx_buffer(struct xenvif_queue *queue,
return meta;
}

+struct gop_frag_copy
+{
+ struct xenvif_queue *queue;
+ struct netrx_pending_operations *npo;
+ struct xenvif_rx_meta *meta;
+ int head;
+ int gso_type;
+
+ struct page *page;
+};
+
+static void xenvif_gop_frag_copy_grant(unsigned long mfn, unsigned int offset,
+ unsigned int *len, void *data)
+{
+ struct gop_frag_copy *info = data;
+ struct gnttab_copy *copy_gop;
+ struct xen_page_foreign *foreign;
+ /* Convenient aliases */
+ struct xenvif_queue *queue = info->queue;
+ struct netrx_pending_operations *npo = info->npo;
+ struct page *page = info->page;
+
+ BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);
+
+ if (npo->copy_off == MAX_BUFFER_OFFSET)
+ info->meta = get_next_rx_buffer(queue, npo);
+
+ if (npo->copy_off + *len > MAX_BUFFER_OFFSET)
+ *len = MAX_BUFFER_OFFSET - npo->copy_off;
+
+ copy_gop = npo->copy + npo->copy_prod++;
+ copy_gop->flags = GNTCOPY_dest_gref;
+ copy_gop->len = *len;
+
+ foreign = xen_page_foreign(page);
+ if (foreign) {
+ copy_gop->source.domid = foreign->domid;
+ copy_gop->source.u.ref = foreign->gref;
+ copy_gop->flags |= GNTCOPY_source_gref;
+ } else {
+ copy_gop->source.domid = DOMID_SELF;
+ copy_gop->source.u.gmfn = mfn;
+ }
+ copy_gop->source.offset = offset;
+
+ copy_gop->dest.domid = queue->vif->domid;
+ copy_gop->dest.offset = npo->copy_off;
+ copy_gop->dest.u.ref = npo->copy_gref;
+
+ npo->copy_off += *len;
+ info->meta->size += *len;
+
+ /* Leave a gap for the GSO descriptor. */
+ if (info->head && ((1 << info->gso_type) & queue->vif->gso_mask))
+ queue->rx.req_cons++;
+
+ info->head = 0; /* There must be something in this buffer now */
+}
+
/*
* Set up the grant operations for this fragment. If it's a flipping
* interface, we also set up the unmap request from here.
@@ -272,83 +331,54 @@ static void xenvif_gop_frag_copy(struct xenvif_queue *queue, struct sk_buff *skb
struct page *page, unsigned long size,
unsigned long offset, int *head)
{
- struct gnttab_copy *copy_gop;
- struct xenvif_rx_meta *meta;
+ struct gop_frag_copy info =
+ {
+ .queue = queue,
+ .npo = npo,
+ .head = *head,
+ .gso_type = XEN_NETIF_GSO_TYPE_NONE,
+ };
unsigned long bytes;
- int gso_type = XEN_NETIF_GSO_TYPE_NONE;

if (skb_is_gso(skb)) {
if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4)
- gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
+ info.gso_type = XEN_NETIF_GSO_TYPE_TCPV4;
else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
- gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
+ info.gso_type = XEN_NETIF_GSO_TYPE_TCPV6;
}

/* Data must not cross a page boundary. */
BUG_ON(size + offset > PAGE_SIZE<<compound_order(page));

- meta = npo->meta + npo->meta_prod - 1;
+ info.meta = npo->meta + npo->meta_prod - 1;

/* Skip unused frames from start of page */
page += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;

while (size > 0) {
- struct xen_page_foreign *foreign;
-
BUG_ON(offset >= PAGE_SIZE);
- BUG_ON(npo->copy_off > MAX_BUFFER_OFFSET);
-
- if (npo->copy_off == MAX_BUFFER_OFFSET)
- meta = get_next_rx_buffer(queue, npo);

bytes = PAGE_SIZE - offset;
if (bytes > size)
bytes = size;

- if (npo->copy_off + bytes > MAX_BUFFER_OFFSET)
- bytes = MAX_BUFFER_OFFSET - npo->copy_off;

- copy_gop = npo->copy + npo->copy_prod++;
- copy_gop->flags = GNTCOPY_dest_gref;
- copy_gop->len = bytes;
-
- foreign = xen_page_foreign(page);
- if (foreign) {
- copy_gop->source.domid = foreign->domid;
- copy_gop->source.u.ref = foreign->gref;
- copy_gop->flags |= GNTCOPY_source_gref;
- } else {
- copy_gop->source.domid = DOMID_SELF;
- copy_gop->source.u.gmfn =
- virt_to_mfn(page_address(page));
- }
- copy_gop->source.offset = offset;
-
- copy_gop->dest.domid = queue->vif->domid;
- copy_gop->dest.offset = npo->copy_off;
- copy_gop->dest.u.ref = npo->copy_gref;
-
- npo->copy_off += bytes;
- meta->size += bytes;
-
- offset += bytes;
+ info.page = page;
+ gnttab_foreach_grant(page, offset, bytes,
+ xenvif_gop_frag_copy_grant,
+ &info);
size -= bytes;
+ offset = 0;

- /* Next frame */
- if (offset == PAGE_SIZE && size) {
+ /* Next page */
+ if (size) {
BUG_ON(!PageCompound(page));
page++;
- offset = 0;
}
-
- /* Leave a gap for the GSO descriptor. */
- if (*head && ((1 << gso_type) & queue->vif->gso_mask))
- queue->rx.req_cons++;
-
- *head = 0; /* There must be something in this buffer now. */
-
}
+
+ *head = info.head;
}

/*
@@ -747,7 +777,7 @@ static int xenvif_count_requests(struct xenvif_queue *queue,
first->size -= txp->size;
slots++;

- if (unlikely((txp->offset + txp->size) > PAGE_SIZE)) {
+ if (unlikely((txp->offset + txp->size) > XEN_PAGE_SIZE)) {
netdev_err(queue->vif->dev, "Cross page boundary, txp->offset: %u, size: %u\n",
txp->offset, txp->size);
xenvif_fatal_tx_err(queue->vif);
@@ -1241,11 +1271,11 @@ static void xenvif_tx_build_gops(struct xenvif_queue *queue,
}

/* No crossing a page as the payload mustn't fragment. */
- if (unlikely((txreq.offset + txreq.size) > PAGE_SIZE)) {
+ if (unlikely((txreq.offset + txreq.size) > XEN_PAGE_SIZE)) {
netdev_err(queue->vif->dev,
"txreq.offset: %u, size: %u, end: %lu\n",
txreq.offset, txreq.size,
- (unsigned long)(txreq.offset&~PAGE_MASK) + txreq.size);
+ (unsigned long)(txreq.offset&~XEN_PAGE_MASK) + txreq.size);
xenvif_fatal_tx_err(queue->vif);
break;
}
@@ -1287,7 +1317,7 @@ static void xenvif_tx_build_gops(struct xenvif_queue *queue,
virt_to_mfn(skb->data);
queue->tx_copy_ops[*copy_ops].dest.domid = DOMID_SELF;
queue->tx_copy_ops[*copy_ops].dest.offset =
- offset_in_page(skb->data);
+ offset_in_page(skb->data) & ~XEN_PAGE_MASK;

queue->tx_copy_ops[*copy_ops].len = data_len;
queue->tx_copy_ops[*copy_ops].flags = GNTCOPY_source_gref;
@@ -1780,7 +1810,7 @@ int xenvif_map_frontend_rings(struct xenvif_queue *queue,
goto err;

txs = (struct xen_netif_tx_sring *)addr;
- BACK_RING_INIT(&queue->tx, txs, PAGE_SIZE);
+ BACK_RING_INIT(&queue->tx, txs, XEN_PAGE_SIZE);

err = xenbus_map_ring_valloc(xenvif_to_xenbus_device(queue->vif),
&rx_ring_ref, 1, &addr);
@@ -1788,7 +1818,7 @@ int xenvif_map_frontend_rings(struct xenvif_queue *queue,
goto err;

rxs = (struct xen_netif_rx_sring *)addr;
- BACK_RING_INIT(&queue->rx, rxs, PAGE_SIZE);
+ BACK_RING_INIT(&queue->rx, rxs, XEN_PAGE_SIZE);

return 0;

--
2.1.4

2015-07-09 20:46:10

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

The hypercall interface (as well as the toolstack) is always using 4KB
page granularity. When the toolstack is asking for mapping a series of
guest PFN in a batch, it expects to have the page map contiguously in
its virtual memory.

When Linux is using 64KB page granularity, the privcmd driver will have
to map multiple Xen PFN in a single Linux page.

Note that this solution works on page granularity which is a multiple of
4KB.

Signed-off-by: Julien Grall <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: David Vrabel <[email protected]>
---
Changes in v2:
- Use xen_apply_to_page
---
drivers/xen/privcmd.c | 8 +--
drivers/xen/xlate_mmu.c | 127 +++++++++++++++++++++++++++++++++---------------
2 files changed, 92 insertions(+), 43 deletions(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 5a29616..e8714b4 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -446,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
return -EINVAL;
}

- nr_pages = m.num;
+ nr_pages = DIV_ROUND_UP_ULL(m.num, PAGE_SIZE / XEN_PAGE_SIZE);
if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
return -EINVAL;

@@ -494,7 +494,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
goto out_unlock;
}
if (xen_feature(XENFEAT_auto_translated_physmap)) {
- ret = alloc_empty_pages(vma, m.num);
+ ret = alloc_empty_pages(vma, nr_pages);
if (ret < 0)
goto out_unlock;
} else
@@ -518,6 +518,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
state.global_error = 0;
state.version = version;

+ BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
/* mmap_batch_fn guarantees ret == 0 */
BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
&pagelist, mmap_batch_fn, &state));
@@ -582,12 +583,13 @@ static void privcmd_close(struct vm_area_struct *vma)
{
struct page **pages = vma->vm_private_data;
int numpgs = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ int nr_pfn = (vma->vm_end - vma->vm_start) >> XEN_PAGE_SHIFT;
int rc;

if (!xen_feature(XENFEAT_auto_translated_physmap) || !numpgs || !pages)
return;

- rc = xen_unmap_domain_mfn_range(vma, numpgs, pages);
+ rc = xen_unmap_domain_mfn_range(vma, nr_pfn, pages);
if (rc == 0)
free_xenballooned_pages(numpgs, pages);
else
diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
index 58a5389..1fac17c 100644
--- a/drivers/xen/xlate_mmu.c
+++ b/drivers/xen/xlate_mmu.c
@@ -38,31 +38,9 @@
#include <xen/interface/xen.h>
#include <xen/interface/memory.h>

-/* map fgmfn of domid to lpfn in the current domain */
-static int map_foreign_page(unsigned long lpfn, unsigned long fgmfn,
- unsigned int domid)
-{
- int rc;
- struct xen_add_to_physmap_range xatp = {
- .domid = DOMID_SELF,
- .foreign_domid = domid,
- .size = 1,
- .space = XENMAPSPACE_gmfn_foreign,
- };
- xen_ulong_t idx = fgmfn;
- xen_pfn_t gpfn = lpfn;
- int err = 0;
-
- set_xen_guest_handle(xatp.idxs, &idx);
- set_xen_guest_handle(xatp.gpfns, &gpfn);
- set_xen_guest_handle(xatp.errs, &err);
-
- rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);
- return rc < 0 ? rc : err;
-}
-
struct remap_data {
xen_pfn_t *fgmfn; /* foreign domain's gmfn */
+ xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
pgprot_t prot;
domid_t domid;
struct vm_area_struct *vma;
@@ -71,24 +49,75 @@ struct remap_data {
struct xen_remap_mfn_info *info;
int *err_ptr;
int mapped;
+
+ /* Hypercall parameters */
+ int h_errs[XEN_PFN_PER_PAGE];
+ xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
+ xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
+
+ int h_iter; /* Iterator */
};

+static int setup_hparams(struct page *page, unsigned long pfn, void *data)
+{
+ struct remap_data *info = data;
+
+ /* We may not have enough domain's gmfn to fill a Linux Page */
+ if (info->fgmfn == info->efgmfn)
+ return 0;
+
+ info->h_idxs[info->h_iter] = *info->fgmfn;
+ info->h_gpfns[info->h_iter] = pfn;
+ info->h_errs[info->h_iter] = 0;
+ info->h_iter++;
+
+ info->fgmfn++;
+
+ return 0;
+}
+
static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
void *data)
{
struct remap_data *info = data;
struct page *page = info->pages[info->index++];
- unsigned long pfn = page_to_pfn(page);
- pte_t pte = pte_mkspecial(pfn_pte(pfn, info->prot));
+ pte_t pte = pte_mkspecial(pfn_pte(page_to_pfn(page), info->prot));
int rc;
+ uint32_t i;
+ struct xen_add_to_physmap_range xatp = {
+ .domid = DOMID_SELF,
+ .foreign_domid = info->domid,
+ .space = XENMAPSPACE_gmfn_foreign,
+ };

- rc = map_foreign_page(pfn, *info->fgmfn, info->domid);
- *info->err_ptr++ = rc;
- if (!rc) {
- set_pte_at(info->vma->vm_mm, addr, ptep, pte);
- info->mapped++;
+ info->h_iter = 0;
+
+ /* setup_hparams guarantees ret == 0 */
+ BUG_ON(xen_apply_to_page(page, setup_hparams, info));
+
+ set_xen_guest_handle(xatp.idxs, info->h_idxs);
+ set_xen_guest_handle(xatp.gpfns, info->h_gpfns);
+ set_xen_guest_handle(xatp.errs, info->h_errs);
+ xatp.size = info->h_iter;
+
+ rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);
+
+ /* info->err_ptr expect to have one error status per Xen PFN */
+ for (i = 0; i < info->h_iter; i++) {
+ int err = (rc < 0) ? rc : info->h_errs[i];
+
+ *(info->err_ptr++) = err;
+ if (!err)
+ info->mapped++;
}
- info->fgmfn++;
+
+ /*
+ * Note: The hypercall will return 0 in most of the case if even if
+ * all the fgmfn are not mapped. We still have to update the pte
+ * as the userspace may decide to continue.
+ */
+ if (!rc)
+ set_pte_at(info->vma->vm_mm, addr, ptep, pte);

return 0;
}
@@ -102,13 +131,14 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
{
int err;
struct remap_data data;
- unsigned long range = nr << PAGE_SHIFT;
+ unsigned long range = round_up(nr, XEN_PFN_PER_PAGE) << XEN_PAGE_SHIFT;

/* Kept here for the purpose of making sure code doesn't break
x86 PVOPS */
BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | VM_IO)));

data.fgmfn = mfn;
+ data.efgmfn = mfn + nr;
data.prot = prot;
data.domid = domid;
data.vma = vma;
@@ -123,21 +153,38 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
}
EXPORT_SYMBOL_GPL(xen_xlate_remap_gfn_array);

+static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
+{
+ int *nr = data;
+ struct xen_remove_from_physmap xrp;
+
+ /* The Linux Page may not have been fully mapped to Xen */
+ if (!*nr)
+ return 0;
+
+ xrp.domid = DOMID_SELF;
+ xrp.gpfn = pfn;
+ (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
+
+ (*nr)--;
+
+ return 0;
+}
+
int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
int nr, struct page **pages)
{
int i;
+ int nr_page = round_up(nr, XEN_PFN_PER_PAGE);

- for (i = 0; i < nr; i++) {
- struct xen_remove_from_physmap xrp;
- unsigned long pfn;
+ for (i = 0; i < nr_page; i++) {
+ /* unmap_gfn guarantees ret == 0 */
+ BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));
+ }

- pfn = page_to_pfn(pages[i]);
+ /* We should have consume every xen page */
+ BUG_ON(nr != 0);

- xrp.domid = DOMID_SELF;
- xrp.gpfn = pfn;
- (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
- }
return 0;
}
EXPORT_SYMBOL_GPL(xen_xlate_unmap_gfn_range);
--
2.1.4

2015-07-09 20:45:44

by Julien Grall

[permalink] [raw]
Subject: [PATCH v2 20/20] arm/xen: Add support for 64KB page granularity

The hypercall interface is always using 4KB page granularity. This is
requiring to use xen page definition macro when we deal with hypercall.

Note that pfn_to_mfn is working with a Xen pfn (i.e 4KB). We may want to
rename pfn_mfn to make this explicit.

We also allocate a 64KB page for the shared page even though only the
first 4KB is used. I don't think this is really important for now as it
helps to have the pointer 4KB aligned (XENMEM_add_to_physmap is taking a
Xen PFN).

Signed-off-by: Julien Grall <[email protected]>
Reviewed-by: Stefano Stabellini <[email protected]>
Cc: Russell King <[email protected]>

---
Changes in v2
- Add Stefano's reviewed-by
---
arch/arm/include/asm/xen/page.h | 12 ++++++------
arch/arm/xen/enlighten.c | 6 +++---
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/arm/include/asm/xen/page.h b/arch/arm/include/asm/xen/page.h
index 1bee8ca..ab6eb9a 100644
--- a/arch/arm/include/asm/xen/page.h
+++ b/arch/arm/include/asm/xen/page.h
@@ -56,19 +56,19 @@ static inline unsigned long mfn_to_pfn(unsigned long mfn)

static inline xmaddr_t phys_to_machine(xpaddr_t phys)
{
- unsigned offset = phys.paddr & ~PAGE_MASK;
- return XMADDR(PFN_PHYS(pfn_to_mfn(PFN_DOWN(phys.paddr))) | offset);
+ unsigned offset = phys.paddr & ~XEN_PAGE_MASK;
+ return XMADDR(XEN_PFN_PHYS(pfn_to_mfn(XEN_PFN_DOWN(phys.paddr))) | offset);
}

static inline xpaddr_t machine_to_phys(xmaddr_t machine)
{
- unsigned offset = machine.maddr & ~PAGE_MASK;
- return XPADDR(PFN_PHYS(mfn_to_pfn(PFN_DOWN(machine.maddr))) | offset);
+ unsigned offset = machine.maddr & ~XEN_PAGE_MASK;
+ return XPADDR(XEN_PFN_PHYS(mfn_to_pfn(XEN_PFN_DOWN(machine.maddr))) | offset);
}
/* VIRT <-> MACHINE conversion */
#define virt_to_machine(v) (phys_to_machine(XPADDR(__pa(v))))
-#define virt_to_mfn(v) (pfn_to_mfn(virt_to_pfn(v)))
-#define mfn_to_virt(m) (__va(mfn_to_pfn(m) << PAGE_SHIFT))
+#define virt_to_mfn(v) (pfn_to_mfn(virt_to_phys(v) >> XEN_PAGE_SHIFT))
+#define mfn_to_virt(m) (__va(mfn_to_pfn(m) << XEN_PAGE_SHIFT))

static inline xmaddr_t arbitrary_virt_to_machine(void *vaddr)
{
diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 6c09cc4..c7d32af 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -96,8 +96,8 @@ static void xen_percpu_init(void)
pr_info("Xen: initializing cpu%d\n", cpu);
vcpup = per_cpu_ptr(xen_vcpu_info, cpu);

- info.mfn = __pa(vcpup) >> PAGE_SHIFT;
- info.offset = offset_in_page(vcpup);
+ info.mfn = __pa(vcpup) >> XEN_PAGE_SHIFT;
+ info.offset = xen_offset_in_page(vcpup);

err = HYPERVISOR_vcpu_op(VCPUOP_register_vcpu_info, cpu, &info);
BUG_ON(err);
@@ -220,7 +220,7 @@ static int __init xen_guest_init(void)
xatp.domid = DOMID_SELF;
xatp.idx = 0;
xatp.space = XENMAPSPACE_shared_info;
- xatp.gpfn = __pa(shared_info_page) >> PAGE_SHIFT;
+ xatp.gpfn = __pa(shared_info_page) >> XEN_PAGE_SHIFT;
if (HYPERVISOR_memory_op(XENMEM_add_to_physmap, &xatp))
BUG();

--
2.1.4

2015-07-10 19:13:18

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
> When Linux is using 64K page granularity, every page will be slipt in
> multiple non-contiguous 4K MFN (page granularity of Xen).

But you don't care about that on the Linux layer I think?

As in, is there an SWIOTLB that does PFN to MFN and vice-versa
translation?

I thought that ARM guests are not exposed to the MFN<->PFN logic
and trying to figure that out to not screw up the DMA engine
on a PCIe device slurping up contingous MFNs which don't map
to contingous PFNs?

>
> I'm not sure how to handle efficiently the check to know whether we can
> merge 2 biovec with a such case. So for now, always says that biovec are
> not mergeable.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Remove the workaround and check if the Linux page granularity
> is the same as Xen or not
> ---
> drivers/xen/biomerge.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
> index 0edb91c..571567c 100644
> --- a/drivers/xen/biomerge.c
> +++ b/drivers/xen/biomerge.c
> @@ -6,10 +6,17 @@
> bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
> const struct bio_vec *vec2)
> {
> +#if XEN_PAGE_SIZE == PAGE_SIZE
> unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page));
> unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page));
>
> return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) &&
> ((mfn1 == mfn2) || ((mfn1+1) == mfn2));
> +#else
> + /* XXX: bio_vec are not mergeable when using different page size in
> + * Xen and Linux
> + */
> + return 0;
> +#endif
> }
> EXPORT_SYMBOL(xen_biovec_phys_mergeable);
> --
> 2.1.4
>

2015-07-13 20:14:21

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

On 07/09/2015 04:42 PM, Julien Grall wrote:
> -
> struct remap_data {
> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */

It might be better to keep size of fgmfn array instead.


>
> +static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
> +{
> + int *nr = data;
> + struct xen_remove_from_physmap xrp;
> +
> + /* The Linux Page may not have been fully mapped to Xen */
> + if (!*nr)
> + return 0;
> +
> + xrp.domid = DOMID_SELF;
> + xrp.gpfn = pfn;
> + (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
> +
> + (*nr)--;
> +
> + return 0;
> +}
> +
> int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
> int nr, struct page **pages)
> {
> int i;
> + int nr_page = round_up(nr, XEN_PFN_PER_PAGE);
>
> - for (i = 0; i < nr; i++) {
> - struct xen_remove_from_physmap xrp;
> - unsigned long pfn;
> + for (i = 0; i < nr_page; i++) {
> + /* unmap_gfn guarantees ret == 0 */
> + BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));


TBH, I am not sure how useful xen_apply_to_page() routine is. In this
patch especially, but also in others.

-boris

2015-07-13 22:06:13

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

Hi Boris,

On 13/07/2015 22:13, Boris Ostrovsky wrote:
> On 07/09/2015 04:42 PM, Julien Grall wrote:
>> -
>> struct remap_data {
>> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
>> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
>
> It might be better to keep size of fgmfn array instead.

It would means to have an other variable to check that we are at the end
the array.

What about a variable which will be decremented?

>>
>> +static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
>> +{
>> + int *nr = data;
>> + struct xen_remove_from_physmap xrp;
>> +
>> + /* The Linux Page may not have been fully mapped to Xen */
>> + if (!*nr)
>> + return 0;
>> +
>> + xrp.domid = DOMID_SELF;
>> + xrp.gpfn = pfn;
>> + (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
>> +
>> + (*nr)--;
>> +
>> + return 0;
>> +}
>> +
>> int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
>> int nr, struct page **pages)
>> {
>> int i;
>> + int nr_page = round_up(nr, XEN_PFN_PER_PAGE);
>> - for (i = 0; i < nr; i++) {
>> - struct xen_remove_from_physmap xrp;
>> - unsigned long pfn;
>> + for (i = 0; i < nr_page; i++) {
>> + /* unmap_gfn guarantees ret == 0 */
>> + BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));
>
>
> TBH, I am not sure how useful xen_apply_to_page() routine is. In this
> patch especially, but also in others.

It avoids an open loop in each place where it's needed (here,
balloon...) which means another indentation layer. You can give a look
it's quite ugly.

Furthermore, the helper will avoid possible done by developers who are
working on PV drivers on x86. If you see code where the MFN translation
is done directly via virt_to_mfn or page_to_mfn... it will likely means
that the code will be broking when 64KB page granularity will be in used.
Though, there will still be some place where it's valid to use
virt_to_mfn and page_to_mfn.

Regards,

--
Julien Grall

2015-07-14 15:29:22

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

On 07/13/2015 06:05 PM, Julien Grall wrote:
> Hi Boris,
>
> On 13/07/2015 22:13, Boris Ostrovsky wrote:
>> On 07/09/2015 04:42 PM, Julien Grall wrote:
>>> -
>>> struct remap_data {
>>> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
>>> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
>>
>> It might be better to keep size of fgmfn array instead.
>
> It would means to have an other variable to check that we are at the
> end the array.


I thought that's what h_iter is. Is it not?


>
> What about a variable which will be decremented?
>
>>>
>>> +static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
>>> +{
>>> + int *nr = data;
>>> + struct xen_remove_from_physmap xrp;
>>> +
>>> + /* The Linux Page may not have been fully mapped to Xen */
>>> + if (!*nr)
>>> + return 0;
>>> +
>>> + xrp.domid = DOMID_SELF;
>>> + xrp.gpfn = pfn;
>>> + (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
>>> +
>>> + (*nr)--;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
>>> int nr, struct page **pages)
>>> {
>>> int i;
>>> + int nr_page = round_up(nr, XEN_PFN_PER_PAGE);
>>> - for (i = 0; i < nr; i++) {
>>> - struct xen_remove_from_physmap xrp;
>>> - unsigned long pfn;
>>> + for (i = 0; i < nr_page; i++) {
>>> + /* unmap_gfn guarantees ret == 0 */
>>> + BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));
>>
>>
>> TBH, I am not sure how useful xen_apply_to_page() routine is. In this
>> patch especially, but also in others.
>
> It avoids an open loop in each place where it's needed (here,
> balloon...) which means another indentation layer. You can give a look
> it's quite ugly.

I didn't notice that it was an inline, in which case it is indeed cleaner.

-boris

>
> Furthermore, the helper will avoid possible done by developers who are
> working on PV drivers on x86. If you see code where the MFN
> translation is done directly via virt_to_mfn or page_to_mfn... it will
> likely means that the code will be broking when 64KB page granularity
> will be in used.
> Though, there will still be some place where it's valid to use
> virt_to_mfn and page_to_mfn.
>
> Regards,
>

2015-07-14 15:37:14

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

Hi Boris,

On 14/07/2015 17:28, Boris Ostrovsky wrote:
> On 07/13/2015 06:05 PM, Julien Grall wrote:
>> On 13/07/2015 22:13, Boris Ostrovsky wrote:
>>> On 07/09/2015 04:42 PM, Julien Grall wrote:
>>>> -
>>>> struct remap_data {
>>>> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
>>>> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
>>>
>>> It might be better to keep size of fgmfn array instead.
>>
>> It would means to have an other variable to check that we are at the
>> end the array.
>
>
> I thought that's what h_iter is. Is it not?

h_iter is for the number of xen pfn in a Linux page. This is because the
Linux privcmd interface is working with 4KB page and there may not be
enough to fill a 64KB page.

So we need another counter for the total number of foreign domain's gmfn.

Regards,

--
Julien Grall

2015-07-15 08:56:56

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

Hi Konrad,

On 10/07/2015 21:12, Konrad Rzeszutek Wilk wrote:
> On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
>> When Linux is using 64K page granularity, every page will be slipt in
>> multiple non-contiguous 4K MFN (page granularity of Xen).
>
> But you don't care about that on the Linux layer I think?

In general use case (i.e arch agnostic) we care about it. We don't want
to merge 2 biovec if they are not living on the same MFNs.

> As in, is there an SWIOTLB that does PFN to MFN and vice-versa
> translation?
>
> I thought that ARM guests are not exposed to the MFN<->PFN logic
> and trying to figure that out to not screw up the DMA engine
> on a PCIe device slurping up contingous MFNs which don't map
> to contingous PFNs?


I will let these 2 questions for Stefano. He knows better than me
swiotlb for ARM.

So far, I skipped swiotlb implementation for 64KB page granularity as
I'm not sure what to do when a page is split across multiple MFNs.

Although I don't think this can happen with this specific series as:
- The memory is a direct mapping so any Linux page is using contiguous
MFNs.
- Foreign mapping is using the 4KB of the Linux page. This is for an
easier implementation.

For the latter, I plan to work on using the Linux page to map multiple
foreign gfn. I have to talk with Stefano about it how to handle it.

Regards,

--
Julien Grall

2015-07-16 14:20:57

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 01/20] xen: Add Xen specific page definition

On Thu, 9 Jul 2015, Julien Grall wrote:
> The Xen hypercall interface is always using 4K page granularity on ARM
> and x86 architecture.
>
> With the incoming support of 64K page granularity for ARM64 guest, it
> won't be possible to re-use the Linux page definition in Xen drivers.
>
> Introduce Xen page definition helpers based on the Linux page
> definition. They have exactly the same name but prefixed with
> XEN_/xen_ prefix.
>
> Also modify page_to_pfn to use new Xen page definition.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> I'm wondering if we should drop page_to_pfn has the macro will likely
> misuse when Linux is using 64KB page granularity.
>
> Changes in v2:
> - Add XEN_PFN_UP
> - Add a comment describing the behavior of page_to_pfn
> ---
> include/xen/page.h | 21 ++++++++++++++++++++-
> 1 file changed, 20 insertions(+), 1 deletion(-)
>
> diff --git a/include/xen/page.h b/include/xen/page.h
> index c5ed20b..8ebd37b 100644
> --- a/include/xen/page.h
> +++ b/include/xen/page.h
> @@ -1,11 +1,30 @@
> #ifndef _XEN_PAGE_H
> #define _XEN_PAGE_H
>
> +#include <asm/page.h>
> +
> +/* The hypercall interface supports only 4KB page */
> +#define XEN_PAGE_SHIFT 12
> +#define XEN_PAGE_SIZE (_AC(1,UL) << XEN_PAGE_SHIFT)
> +#define XEN_PAGE_MASK (~(XEN_PAGE_SIZE-1))
> +#define xen_offset_in_page(p) ((unsigned long)(p) & ~XEN_PAGE_MASK)
> +#define xen_pfn_to_page(pfn) \

I think it would be clearer if you called the parameter "xen_pfn"
instead of "pfn".


> + ((pfn_to_page(((unsigned long)(pfn) << XEN_PAGE_SHIFT) >> PAGE_SHIFT)))
> +#define xen_page_to_pfn(page) \
> + (((page_to_pfn(page)) << PAGE_SHIFT) >> XEN_PAGE_SHIFT)


It would be nice to have a comment:

/* assume PAGE_SIZE is a multiple of XEN_PAGE_SIZE */


> +#define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
> +
> +#define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
> +#define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
> +#define XEN_PFN_PHYS(x) ((phys_addr_t)(x) << XEN_PAGE_SHIFT)
> +
> #include <asm/xen/page.h>
>
> +/* Return the MFN associated to the first 4KB of the page */
> static inline unsigned long page_to_mfn(struct page *page)
> {
> - return pfn_to_mfn(page_to_pfn(page));
> + return pfn_to_mfn(xen_page_to_pfn(page));
> }
>
> struct xen_memory_region {


Aside from the two minor suggestions:

Reviewed-by: Stefano Stabellini <[email protected]>

2015-07-16 14:24:21

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On Thu, 9 Jul 2015, Julien Grall wrote:
> The Xen interface is always using 4KB page. This means that a Linux page
> may be split across multiple Xen page when the page granularity is not
> the same.
>
> This helper will break down a Linux page into 4KB chunk and call the
> helper on each of them.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Patch added
> ---
> include/xen/page.h | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/include/xen/page.h b/include/xen/page.h
> index 8ebd37b..b1f7722 100644
> --- a/include/xen/page.h
> +++ b/include/xen/page.h
> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>
> extern unsigned long xen_released_pages;
>
> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
> + void *data)
> +{
> + unsigned long pfn = xen_page_to_pfn(page);
> + int i;
> + int ret;

please initialize ret to 0


> + for (i = 0; i < XEN_PFN_PER_PAGE; i++, pfn++) {
> + ret = fn(page, pfn, data);
> + if (ret)
> + return ret;
> + }
> +
> + return ret;
> +}
> +
> +
> #endif /* _XEN_PAGE_H */


Reviewed-by: Stefano Stabellini <[email protected]>

2015-07-16 14:53:12

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 01/20] xen: Add Xen specific page definition

Hi Stefano,

On 16/07/2015 16:19, Stefano Stabellini wrote:
>> diff --git a/include/xen/page.h b/include/xen/page.h
>> index c5ed20b..8ebd37b 100644
>> --- a/include/xen/page.h
>> +++ b/include/xen/page.h
>> @@ -1,11 +1,30 @@
>> #ifndef _XEN_PAGE_H
>> #define _XEN_PAGE_H
>>
>> +#include <asm/page.h>
>> +
>> +/* The hypercall interface supports only 4KB page */
>> +#define XEN_PAGE_SHIFT 12
>> +#define XEN_PAGE_SIZE (_AC(1,UL) << XEN_PAGE_SHIFT)
>> +#define XEN_PAGE_MASK (~(XEN_PAGE_SIZE-1))
>> +#define xen_offset_in_page(p) ((unsigned long)(p) & ~XEN_PAGE_MASK)
>> +#define xen_pfn_to_page(pfn) \
>
> I think it would be clearer if you called the parameter "xen_pfn"
> instead of "pfn".

Good idea, I will do it in the next version.

>
>> + ((pfn_to_page(((unsigned long)(pfn) << XEN_PAGE_SHIFT) >> PAGE_SHIFT)))
>> +#define xen_page_to_pfn(page) \
>> + (((page_to_pfn(page)) << PAGE_SHIFT) >> XEN_PAGE_SHIFT)
>
>
> It would be nice to have a comment:
>
> /* assume PAGE_SIZE is a multiple of XEN_PAGE_SIZE */

Ok. FWIW, there is already a BUILD_BUG_ON the privcmd driver to check
this assumption (see patch #19).

I could move the BUILD_BUG_ON in page.h. Maybe inside xen_page_to_pfn?

>
>
>> +#define XEN_PFN_PER_PAGE (PAGE_SIZE / XEN_PAGE_SIZE)
>> +
>> +#define XEN_PFN_DOWN(x) ((x) >> XEN_PAGE_SHIFT)
>> +#define XEN_PFN_UP(x) (((x) + XEN_PAGE_SIZE-1) >> XEN_PAGE_SHIFT)
>> +#define XEN_PFN_PHYS(x) ((phys_addr_t)(x) << XEN_PAGE_SHIFT)
>> +
>> #include <asm/xen/page.h>
>>
>> +/* Return the MFN associated to the first 4KB of the page */
>> static inline unsigned long page_to_mfn(struct page *page)
>> {
>> - return pfn_to_mfn(page_to_pfn(page));
>> + return pfn_to_mfn(xen_page_to_pfn(page));
>> }
>>
>> struct xen_memory_region {
>
>
> Aside from the two minor suggestions:
>
> Reviewed-by: Stefano Stabellini <[email protected]>

Thank you,

--
Julien Grall

2015-07-16 14:54:33

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

Hi Stefano,

On 16/07/2015 16:23, Stefano Stabellini wrote:
>> diff --git a/include/xen/page.h b/include/xen/page.h
>> index 8ebd37b..b1f7722 100644
>> --- a/include/xen/page.h
>> +++ b/include/xen/page.h
>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>
>> extern unsigned long xen_released_pages;
>>
>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>> + void *data)
>> +{
>> + unsigned long pfn = xen_page_to_pfn(page);
>> + int i;
>> + int ret;
>
> please initialize ret to 0

Hmmm... right. I'm not sure why the compiler didn't catch it.

>
>> + for (i = 0; i < XEN_PFN_PER_PAGE; i++, pfn++) {
>> + ret = fn(page, pfn, data);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +
>> #endif /* _XEN_PAGE_H */
>
>
> Reviewed-by: Stefano Stabellini <[email protected]>

Thank you,

--
Julien Grall

2015-07-16 15:03:16

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 03/20] xen/grant: Introduce helpers to split a page into grant

On Thu, 9 Jul 2015, Julien Grall wrote:
> Currently, a grant is always based on the Xen page granularity (i.e
> 4KB). When Linux is using a different page granularity, a single page
> will be split between multiple grants.
>
> The new helpers will be in charge to split the Linux page into grant and
^
grants

> call a function given by the caller on each grant.
>
> In order to help some PV drivers, the callback is allowed to use less
> data and must update the resulting length. This is useful for netback.
>
> Also provide and helper to count the number of grants within a given
> contiguous region.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Patch added
> ---
> drivers/xen/grant-table.c | 26 ++++++++++++++++++++++++++
> include/xen/grant_table.h | 41 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 67 insertions(+)
>
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index 62f591f..3679293 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c
> @@ -296,6 +296,32 @@ int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly)
> }
> EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref);
>
> +void gnttab_foreach_grant(struct page *page, unsigned int offset,
> + unsigned int len, xen_grant_fn_t fn,
> + void *data)
> +{
> + unsigned int goffset;
> + unsigned int glen;
> + unsigned long pfn;

I would s/pfn/xen_pfn/ inside this function for clarity


> + len = min_t(unsigned int, PAGE_SIZE - offset, len);
> + goffset = offset & ~XEN_PAGE_MASK;

I don't think we want to support cases where (offset & ~XEN_PAGE_MASK)
!= 0, we should just return error.


> + pfn = xen_page_to_pfn(page) + (offset >> XEN_PAGE_SHIFT);
> +
> + while (len) {
> + glen = min_t(unsigned int, XEN_PAGE_SIZE - goffset, len);

Similarly I don't think we want to support glen != XEN_PAGE_SIZE here


> + fn(pfn_to_mfn(pfn), goffset, &glen, data);

Allowing the callee to change glen makes the interface more complex and
certainly doesn't match the gnttab_foreach_grant function name anymore.

If netback needs it, could it just do the work inside its own function?
I would rather keep gnttab_foreach_grant simple and move the complexity
there.


> + goffset += glen;
> + if (goffset == XEN_PAGE_SIZE) {
> + goffset = 0;
> + pfn++;
> + }

With the assumptions above you can simplify this


> + len -= glen;
> + }
> +}
> +
> struct deferred_entry {
> struct list_head list;
> grant_ref_t ref;
> diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
> index 4478f4b..6f77378 100644
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -45,8 +45,10 @@
> #include <asm/xen/hypervisor.h>
>
> #include <xen/features.h>
> +#include <xen/page.h>
> #include <linux/mm_types.h>
> #include <linux/page-flags.h>
> +#include <linux/kernel.h>
>
> #define GNTTAB_RESERVED_XENSTORE 1
>
> @@ -224,4 +226,43 @@ static inline struct xen_page_foreign *xen_page_foreign(struct page *page)
> #endif
> }
>
> +/* Split Linux page in chunk of the size of the grant and call fn
> + *
> + * Parameters of fn:
> + * mfn: machine frame number based on grant granularity
> + * offset: offset in the grant
> + * len: length of the data in the grant. If fn decides to use less data,
> + * it must update len.
> + * data: internal information
> + */
> +typedef void (*xen_grant_fn_t)(unsigned long mfn, unsigned int offset,
> + unsigned int *len, void *data);
> +
> +void gnttab_foreach_grant(struct page *page, unsigned int offset,
> + unsigned int len, xen_grant_fn_t fn,
> + void *data);
> +
> +/* Helper to get to call fn only on the first "grant chunk" */
> +static inline void gnttab_one_grant(struct page *page, unsigned int offset,
> + unsigned len, xen_grant_fn_t fn,
> + void *data)
> +{
> + /* The first request is limited to the size of one grant */
> + len = min_t(unsigned int, XEN_PAGE_SIZE - (offset & ~XEN_PAGE_MASK),
> + len);

I would just BUG_ON(offset & ~XEN_PAGE_MASK) and simply len = XEN_PAGE_SIZE;


> + gnttab_foreach_grant(page, offset, len, fn, data);
> +}
> +
> +/* Get the number of grant in a specified region
> + *
> + * offset: Offset in the first page

I would generalize this function and support offset > PAGE_SIZE. At that
point you could rename offset to "start".


> + * len: total lenght of data (can cross multiple page)
> + */
> +static inline unsigned int gnttab_count_grant(unsigned int offset,
> + unsigned int len)
> +{
> + return (XEN_PFN_UP((offset & ~XEN_PAGE_MASK) + len));
> +}
> +
> #endif /* __ASM_GNTTAB_H__ */

2015-07-16 15:08:41

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] xen/grant: Add helper gnttab_page_grant_foreign_access_ref

On Thu, 9 Jul 2015, Julien Grall wrote:
> Many PV drivers contain the idiom:
>
> pfn = page_to_mfn(...) /* Or similar */
> gnttab_grant_foreign_access_ref
>
> Replace it by a new helper. Note that when Linux is using a different
> page granularity than Xen, the helper only gives access to the first 4KB
> grant.
>
> This is useful where drivers are allocating a full Linux page for each
> grant.
>
> Also include xen/interface/grant_table.h rather than xen/grant_table.h in
> asm/page.h for x86 to fix a compilation issue [1]. Only the former is
> useful in order to get the structure definition.
>
> [1] Interpendency between asm/page.h and xen/grant_table.h which result
> to page_mfn not being defined when necessary.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Patch added
> ---
> arch/x86/include/asm/xen/page.h | 2 +-
> include/xen/grant_table.h | 9 +++++++++
> 2 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
> index c44a5d5..fb2e037 100644
> --- a/arch/x86/include/asm/xen/page.h
> +++ b/arch/x86/include/asm/xen/page.h
> @@ -12,7 +12,7 @@
> #include <asm/pgtable.h>
>
> #include <xen/interface/xen.h>
> -#include <xen/grant_table.h>
> +#include <xen/interface/grant_table.h>
> #include <xen/features.h>
>
> /* Xen machine address */
> diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
> index 6f77378..6a1ef86 100644
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -131,6 +131,15 @@ void gnttab_cancel_free_callback(struct gnttab_free_callback *callback);
> void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
> unsigned long frame, int readonly);
>
> +/* Give access to the first 4K of the page */
> +static inline void gnttab_page_grant_foreign_access_ref(
> + grant_ref_t ref, domid_t domid,
> + struct page *page, int readonly)
> +{

I like this. I think it might make sense to call it
gnttab_page_grant_foreign_access_ref_one to make clear that it is only
granting the first 4K.

In the future we could introduce a new function, called
gnttab_page_grant_foreign_access_ref, that grants all 4K in the page.

In any case

Reviewed-by: Stefano Stabellini <[email protected]>

> + gnttab_grant_foreign_access_ref(ref, domid, page_to_mfn(page),
> + readonly);
> +}
> +
> void gnttab_grant_foreign_transfer_ref(grant_ref_t, domid_t domid,
> unsigned long pfn);
>
> --
> 2.1.4
>

2015-07-16 15:14:19

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure

On Thu, 9 Jul 2015, Julien Grall wrote:
> All the usage of the field pfn are done using the same idiom:
>
> pfn_to_page(grant->pfn)
>
> This will return always the same page. Store directly the page in the
> grant to clean up the code.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Roger Pau Monné <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Patch added
> ---
> drivers/block/xen-blkfront.c | 37 ++++++++++++++++++-------------------
> 1 file changed, 18 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 7107d58..7b81d23 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -67,7 +67,7 @@ enum blkif_state {
>
> struct grant {
> grant_ref_t gref;
> - unsigned long pfn;
> + struct page *page;
> struct list_head node;
> };
>
> @@ -219,7 +219,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
> kfree(gnt_list_entry);
> goto out_of_memory;
> }
> - gnt_list_entry->pfn = page_to_pfn(granted_page);
> + gnt_list_entry->page = granted_page;
> }
>
> gnt_list_entry->gref = GRANT_INVALID_REF;
> @@ -234,7 +234,7 @@ out_of_memory:
> &info->grants, node) {
> list_del(&gnt_list_entry->node);
> if (info->feature_persistent)
> - __free_page(pfn_to_page(gnt_list_entry->pfn));
> + __free_page(gnt_list_entry->page);
> kfree(gnt_list_entry);
> i--;
> }
> @@ -243,7 +243,7 @@ out_of_memory:
> }
>
> static struct grant *get_grant(grant_ref_t *gref_head,
> - unsigned long pfn,
> + struct page *page,
> struct blkfront_info *info)

indentation

Aside from this:

Reviewed-by: Stefano Stabellini <[email protected]>

> {
> struct grant *gnt_list_entry;
> @@ -263,10 +263,10 @@ static struct grant *get_grant(grant_ref_t *gref_head,
> gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
> BUG_ON(gnt_list_entry->gref == -ENOSPC);
> if (!info->feature_persistent) {
> - BUG_ON(!pfn);
> - gnt_list_entry->pfn = pfn;
> + BUG_ON(!page);
> + gnt_list_entry->page = page;
> }
> - buffer_mfn = pfn_to_mfn(gnt_list_entry->pfn);
> + buffer_mfn = page_to_mfn(gnt_list_entry->page);
> gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
> info->xbdev->otherend_id,
> buffer_mfn, 0);
> @@ -522,7 +522,7 @@ static int blkif_queue_rw_req(struct request *req)
>
> if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
> (i % SEGS_PER_INDIRECT_FRAME == 0)) {
> - unsigned long uninitialized_var(pfn);
> + struct page *uninitialized_var(page);
>
> if (segments)
> kunmap_atomic(segments);
> @@ -536,15 +536,15 @@ static int blkif_queue_rw_req(struct request *req)
> indirect_page = list_first_entry(&info->indirect_pages,
> struct page, lru);
> list_del(&indirect_page->lru);
> - pfn = page_to_pfn(indirect_page);
> + page = indirect_page;
> }
> - gnt_list_entry = get_grant(&gref_head, pfn, info);
> + gnt_list_entry = get_grant(&gref_head, page, info);
> info->shadow[id].indirect_grants[n] = gnt_list_entry;
> - segments = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
> + segments = kmap_atomic(gnt_list_entry->page);
> ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
> }
>
> - gnt_list_entry = get_grant(&gref_head, page_to_pfn(sg_page(sg)), info);
> + gnt_list_entry = get_grant(&gref_head, sg_page(sg), info);
> ref = gnt_list_entry->gref;
>
> info->shadow[id].grants_used[i] = gnt_list_entry;
> @@ -555,7 +555,7 @@ static int blkif_queue_rw_req(struct request *req)
>
> BUG_ON(sg->offset + sg->length > PAGE_SIZE);
>
> - shared_data = kmap_atomic(pfn_to_page(gnt_list_entry->pfn));
> + shared_data = kmap_atomic(gnt_list_entry->page);
> bvec_data = kmap_atomic(sg_page(sg));
>
> /*
> @@ -1002,7 +1002,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> info->persistent_gnts_c--;
> }
> if (info->feature_persistent)
> - __free_page(pfn_to_page(persistent_gnt->pfn));
> + __free_page(persistent_gnt->page);
> kfree(persistent_gnt);
> }
> }
> @@ -1037,7 +1037,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> persistent_gnt = info->shadow[i].grants_used[j];
> gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> if (info->feature_persistent)
> - __free_page(pfn_to_page(persistent_gnt->pfn));
> + __free_page(persistent_gnt->page);
> kfree(persistent_gnt);
> }
>
> @@ -1051,7 +1051,7 @@ static void blkif_free(struct blkfront_info *info, int suspend)
> for (j = 0; j < INDIRECT_GREFS(segs); j++) {
> persistent_gnt = info->shadow[i].indirect_grants[j];
> gnttab_end_foreign_access(persistent_gnt->gref, 0, 0UL);
> - __free_page(pfn_to_page(persistent_gnt->pfn));
> + __free_page(persistent_gnt->page);
> kfree(persistent_gnt);
> }
>
> @@ -1102,8 +1102,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> if (bret->operation == BLKIF_OP_READ && info->feature_persistent) {
> for_each_sg(s->sg, sg, nseg, i) {
> BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> - shared_data = kmap_atomic(
> - pfn_to_page(s->grants_used[i]->pfn));
> + shared_data = kmap_atomic(s->grants_used[i]->page);
> bvec_data = kmap_atomic(sg_page(sg));
> memcpy(bvec_data + sg->offset,
> shared_data + sg->offset,
> @@ -1154,7 +1153,7 @@ static void blkif_completion(struct blk_shadow *s, struct blkfront_info *info,
> * Add the used indirect page back to the list of
> * available pages for indirect grefs.
> */
> - indirect_page = pfn_to_page(s->indirect_grants[i]->pfn);
> + indirect_page = s->indirect_grants[i]->page;
> list_add(&indirect_page->lru, &info->indirect_pages);
> s->indirect_grants[i]->gref = GRANT_INVALID_REF;
> list_add_tail(&s->indirect_grants[i]->node, &info->grants);
> --
> 2.1.4
>

2015-07-16 15:34:41

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On Fri, 10 Jul 2015, Konrad Rzeszutek Wilk wrote:
> On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
> > When Linux is using 64K page granularity, every page will be slipt in
> > multiple non-contiguous 4K MFN (page granularity of Xen).
>
> But you don't care about that on the Linux layer I think?
>
> As in, is there an SWIOTLB that does PFN to MFN and vice-versa
> translation?
>
> I thought that ARM guests are not exposed to the MFN<->PFN logic
> and trying to figure that out to not screw up the DMA engine
> on a PCIe device slurping up contingous MFNs which don't map
> to contingous PFNs?

Dom0 is mapped 1:1, so pfn == mfn normally, however grant maps
unavoidably screw up the 1:1, so the swiotlb jumps in to save the day
when a foreign granted page is involved in a dma operation.

Regarding xen_biovec_phys_mergeable, we could check that all the pfn ==
mfn and return true in that case.


> > I'm not sure how to handle efficiently the check to know whether we can
> > merge 2 biovec with a such case. So for now, always says that biovec are
> > not mergeable.
> >
> > Signed-off-by: Julien Grall <[email protected]>
> > Cc: Konrad Rzeszutek Wilk <[email protected]>
> > Cc: Boris Ostrovsky <[email protected]>
> > Cc: David Vrabel <[email protected]>
> > ---
> > Changes in v2:
> > - Remove the workaround and check if the Linux page granularity
> > is the same as Xen or not
> > ---
> > drivers/xen/biomerge.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > diff --git a/drivers/xen/biomerge.c b/drivers/xen/biomerge.c
> > index 0edb91c..571567c 100644
> > --- a/drivers/xen/biomerge.c
> > +++ b/drivers/xen/biomerge.c
> > @@ -6,10 +6,17 @@
> > bool xen_biovec_phys_mergeable(const struct bio_vec *vec1,
> > const struct bio_vec *vec2)
> > {
> > +#if XEN_PAGE_SIZE == PAGE_SIZE
> > unsigned long mfn1 = pfn_to_mfn(page_to_pfn(vec1->bv_page));
> > unsigned long mfn2 = pfn_to_mfn(page_to_pfn(vec2->bv_page));
> >
> > return __BIOVEC_PHYS_MERGEABLE(vec1, vec2) &&
> > ((mfn1 == mfn2) || ((mfn1+1) == mfn2));
> > +#else
> > + /* XXX: bio_vec are not mergeable when using different page size in
> > + * Xen and Linux
> > + */
> > + return 0;
> > +#endif
> > }
> > EXPORT_SYMBOL(xen_biovec_phys_mergeable);
> > --
> > 2.1.4
> >
>

2015-07-16 15:37:28

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 10/20] xen/xenbus: Use Xen page definition

On Thu, 9 Jul 2015, Julien Grall wrote:
> All the ring (xenstore, and PV rings) are always based on the page
> granularity of Xen.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>

Reviewed-by: Stefano Stabellini <[email protected]>


> Changes in v2:
> - Also update the ring mapping function
> ---
> drivers/xen/xenbus/xenbus_client.c | 6 +++---
> drivers/xen/xenbus/xenbus_probe.c | 4 ++--
> 2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/xen/xenbus/xenbus_client.c b/drivers/xen/xenbus/xenbus_client.c
> index 9ad3272..80272f6 100644
> --- a/drivers/xen/xenbus/xenbus_client.c
> +++ b/drivers/xen/xenbus/xenbus_client.c
> @@ -388,7 +388,7 @@ int xenbus_grant_ring(struct xenbus_device *dev, void *vaddr,
> }
> grefs[i] = err;
>
> - vaddr = vaddr + PAGE_SIZE;
> + vaddr = vaddr + XEN_PAGE_SIZE;
> }
>
> return 0;
> @@ -555,7 +555,7 @@ static int xenbus_map_ring_valloc_pv(struct xenbus_device *dev,
> if (!node)
> return -ENOMEM;
>
> - area = alloc_vm_area(PAGE_SIZE * nr_grefs, ptes);
> + area = alloc_vm_area(XEN_PAGE_SIZE * nr_grefs, ptes);
> if (!area) {
> kfree(node);
> return -ENOMEM;
> @@ -750,7 +750,7 @@ static int xenbus_unmap_ring_vfree_pv(struct xenbus_device *dev, void *vaddr)
> unsigned long addr;
>
> memset(&unmap[i], 0, sizeof(unmap[i]));
> - addr = (unsigned long)vaddr + (PAGE_SIZE * i);
> + addr = (unsigned long)vaddr + (XEN_PAGE_SIZE * i);
> unmap[i].host_addr = arbitrary_virt_to_machine(
> lookup_address(addr, &level)).maddr;
> unmap[i].dev_bus_addr = 0;
> diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
> index 4308fb3..c67e5ba 100644
> --- a/drivers/xen/xenbus/xenbus_probe.c
> +++ b/drivers/xen/xenbus/xenbus_probe.c
> @@ -713,7 +713,7 @@ static int __init xenstored_local_init(void)
>
> xen_store_mfn = xen_start_info->store_mfn =
> pfn_to_mfn(virt_to_phys((void *)page) >>
> - PAGE_SHIFT);
> + XEN_PAGE_SHIFT);
>
> /* Next allocate a local port which xenstored can bind to */
> alloc_unbound.dom = DOMID_SELF;
> @@ -804,7 +804,7 @@ static int __init xenbus_init(void)
> goto out_error;
> xen_store_mfn = (unsigned long)v;
> xen_store_interface =
> - xen_remap(xen_store_mfn << PAGE_SHIFT, PAGE_SIZE);
> + xen_remap(xen_store_mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE);
> break;
> default:
> pr_warn("Xenstore state unknown\n");
> --
> 2.1.4
>

2015-07-16 15:38:03

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 11/20] tty/hvc: xen: Use xen page definition

On Thu, 9 Jul 2015, Julien Grall wrote:
> The console ring is always based on the page granularity of Xen.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: David Vrabel <[email protected]>
> Cc: Stefano Stabellini <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: [email protected]

Reviewed-by: Stefano Stabellini <[email protected]>

> drivers/tty/hvc/hvc_xen.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/tty/hvc/hvc_xen.c b/drivers/tty/hvc/hvc_xen.c
> index a9d837f..2135944 100644
> --- a/drivers/tty/hvc/hvc_xen.c
> +++ b/drivers/tty/hvc/hvc_xen.c
> @@ -230,7 +230,7 @@ static int xen_hvm_console_init(void)
> if (r < 0 || v == 0)
> goto err;
> mfn = v;
> - info->intf = xen_remap(mfn << PAGE_SHIFT, PAGE_SIZE);
> + info->intf = xen_remap(mfn << XEN_PAGE_SHIFT, XEN_PAGE_SIZE);
> if (info->intf == NULL)
> goto err;
> info->vtermno = HVC_COOKIE;
> @@ -392,7 +392,7 @@ static int xencons_connect_backend(struct xenbus_device *dev,
> if (xen_pv_domain())
> mfn = virt_to_mfn(info->intf);
> else
> - mfn = __pa(info->intf) >> PAGE_SHIFT;
> + mfn = __pa(info->intf) >> XEN_PAGE_SHIFT;
> ret = gnttab_alloc_grant_references(1, &gref_head);
> if (ret < 0)
> return ret;
> @@ -476,7 +476,7 @@ static int xencons_resume(struct xenbus_device *dev)
> struct xencons_info *info = dev_get_drvdata(&dev->dev);
>
> xencons_disconnect_backend(info);
> - memset(info->intf, 0, PAGE_SIZE);
> + memset(info->intf, 0, XEN_PAGE_SIZE);
> return xencons_connect_backend(dev, info);
> }
>
> --
> 2.1.4
>

2015-07-16 15:44:12

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

On Thu, 9 Jul 2015, Julien Grall wrote:
> Only use the first 4KB of the page to store the events channel info. It
> means that we will wast 60KB every time we allocate page for:
^ waste

> * control block: a page is allocating per CPU
> * event array: a page is allocating everytime we need to expand it
>
> I think we can reduce the memory waste for the 2 areas by:
>
> * control block: sharing between multiple vCPUs. Although it will
> require some bookkeeping in order to not free the page when the CPU
> goes offline and the other CPUs sharing the page still there
>
> * event array: always extend the array event by 64K (i.e 16 4K
> chunk). That would require more care when we fail to expand the
> event channel.

But this is not implemented in this series, right?


> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> drivers/xen/events/events_base.c | 2 +-
> drivers/xen/events/events_fifo.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
> index 96093ae..858d2f6 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -40,11 +40,11 @@
> #include <asm/idle.h>
> #include <asm/io_apic.h>
> #include <asm/xen/pci.h>
> -#include <xen/page.h>
> #endif
> #include <asm/sync_bitops.h>
> #include <asm/xen/hypercall.h>
> #include <asm/xen/hypervisor.h>
> +#include <xen/page.h>
>
> #include <xen/xen.h>
> #include <xen/hvm.h>

Spurious change?


> diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
> index ed673e1..d53c297 100644
> --- a/drivers/xen/events/events_fifo.c
> +++ b/drivers/xen/events/events_fifo.c
> @@ -54,7 +54,7 @@
>
> #include "events_internal.h"
>
> -#define EVENT_WORDS_PER_PAGE (PAGE_SIZE / sizeof(event_word_t))
> +#define EVENT_WORDS_PER_PAGE (XEN_PAGE_SIZE / sizeof(event_word_t))
> #define MAX_EVENT_ARRAY_PAGES (EVTCHN_FIFO_NR_CHANNELS / EVENT_WORDS_PER_PAGE)
>
> struct evtchn_fifo_queue {
> --
> 2.1.4
>

2015-07-16 15:48:17

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 14/20] xen/grant-table: Make it running on 64KB granularity

On Thu, 9 Jul 2015, Julien Grall wrote:
> The Xen interface is using 4KB page granularity. This means that each
> grant is 4KB.
>
> The current implementation allocates a Linux page per grant. On Linux
> using 64KB page granularity, only the first 4KB of the page will be
> used.
>
> We could decrease the memory wasted by sharing the page with multiple
> grant. It will require some care with the {Set,Clear}ForeignPage macro.
>
> Note that no changes has been made in the x86 code because both Linux
> and Xen will only use 4KB page granularity.
>
> Signed-off-by: Julien Grall <[email protected]>
> Reviewed-by: David Vrabel <[email protected]>
> Cc: Stefano Stabellini <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> ---
> Changes in v2
> - Add David's reviewed-by
> ---
> arch/arm/xen/p2m.c | 6 +++---
> drivers/xen/grant-table.c | 6 +++---
> 2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/xen/p2m.c b/arch/arm/xen/p2m.c
> index 887596c..0ed01f2 100644
> --- a/arch/arm/xen/p2m.c
> +++ b/arch/arm/xen/p2m.c
> @@ -93,8 +93,8 @@ int set_foreign_p2m_mapping(struct gnttab_map_grant_ref *map_ops,
> for (i = 0; i < count; i++) {
> if (map_ops[i].status)
> continue;
> - set_phys_to_machine(map_ops[i].host_addr >> PAGE_SHIFT,
> - map_ops[i].dev_bus_addr >> PAGE_SHIFT);
> + set_phys_to_machine(map_ops[i].host_addr >> XEN_PAGE_SHIFT,
> + map_ops[i].dev_bus_addr >> XEN_PAGE_SHIFT);
> }
>
> return 0;
> @@ -108,7 +108,7 @@ int clear_foreign_p2m_mapping(struct gnttab_unmap_grant_ref *unmap_ops,
> int i;
>
> for (i = 0; i < count; i++) {
> - set_phys_to_machine(unmap_ops[i].host_addr >> PAGE_SHIFT,
> + set_phys_to_machine(unmap_ops[i].host_addr >> XEN_PAGE_SHIFT,
> INVALID_P2M_ENTRY);
> }
>
> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> index 3679293..0a1f903 100644
> --- a/drivers/xen/grant-table.c
> +++ b/drivers/xen/grant-table.c

The arm part is fine, but aren't you missing the change to RPP and SPP?


> @@ -668,7 +668,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr)
> if (xen_auto_xlat_grant_frames.count)
> return -EINVAL;
>
> - vaddr = xen_remap(addr, PAGE_SIZE * max_nr_gframes);
> + vaddr = xen_remap(addr, XEN_PAGE_SIZE * max_nr_gframes);
> if (vaddr == NULL) {
> pr_warn("Failed to ioremap gnttab share frames (addr=%pa)!\n",
> &addr);
> @@ -680,7 +680,7 @@ int gnttab_setup_auto_xlat_frames(phys_addr_t addr)
> return -ENOMEM;
> }
> for (i = 0; i < max_nr_gframes; i++)
> - pfn[i] = PFN_DOWN(addr) + i;
> + pfn[i] = XEN_PFN_DOWN(addr) + i;
>
> xen_auto_xlat_grant_frames.vaddr = vaddr;
> xen_auto_xlat_grant_frames.pfn = pfn;
> @@ -1004,7 +1004,7 @@ static void gnttab_request_version(void)
> {
> /* Only version 1 is used, which will always be available. */
> grant_table_version = 1;
> - grefs_per_grant_frame = PAGE_SIZE / sizeof(struct grant_entry_v1);
> + grefs_per_grant_frame = XEN_PAGE_SIZE / sizeof(struct grant_entry_v1);
> gnttab_interface = &gnttab_v1_ops;
>
> pr_info("Grant tables using version %d layout\n", grant_table_version);
> --
> 2.1.4
>

2015-07-16 16:08:27

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 03/20] xen/grant: Introduce helpers to split a page into grant

Hi Stefano,

On 16/07/2015 17:01, Stefano Stabellini wrote:
> On Thu, 9 Jul 2015, Julien Grall wrote:
>> Currently, a grant is always based on the Xen page granularity (i.e
>> 4KB). When Linux is using a different page granularity, a single page
>> will be split between multiple grants.
>>
>> The new helpers will be in charge to split the Linux page into grant and
> ^
> grants

Will fix it.

>> call a function given by the caller on each grant.
>>
>> In order to help some PV drivers, the callback is allowed to use less
>> data and must update the resulting length. This is useful for netback.
>>
>> Also provide and helper to count the number of grants within a given
>> contiguous region.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> Changes in v2:
>> - Patch added
>> ---
>> drivers/xen/grant-table.c | 26 ++++++++++++++++++++++++++
>> include/xen/grant_table.h | 41 +++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 67 insertions(+)
>>
>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>> index 62f591f..3679293 100644
>> --- a/drivers/xen/grant-table.c
>> +++ b/drivers/xen/grant-table.c
>> @@ -296,6 +296,32 @@ int gnttab_end_foreign_access_ref(grant_ref_t ref, int readonly)
>> }
>> EXPORT_SYMBOL_GPL(gnttab_end_foreign_access_ref);
>>
>> +void gnttab_foreach_grant(struct page *page, unsigned int offset,
>> + unsigned int len, xen_grant_fn_t fn,
>> + void *data)
>> +{
>> + unsigned int goffset;
>> + unsigned int glen;
>> + unsigned long pfn;
>
> I would s/pfn/xen_pfn/ inside this function for clarity

Ok.

>
>
>> + len = min_t(unsigned int, PAGE_SIZE - offset, len);
>> + goffset = offset & ~XEN_PAGE_MASK;
>
> I don't think we want to support cases where (offset & ~XEN_PAGE_MASK)
> != 0, we should just return error.

We have to support offset != 0. The buffer received by the PV drivers
may start in the middle of the page and finish before the end of the page.

For instance blkfront is using biovec which contains 3 informations: the
page, the offset in the page and the total length.

>
>> + pfn = xen_page_to_pfn(page) + (offset >> XEN_PAGE_SHIFT);
>> +
>> + while (len) {
>> + glen = min_t(unsigned int, XEN_PAGE_SIZE - goffset, len);
>
> Similarly I don't think we want to support glen != XEN_PAGE_SIZE here

See my answer above.

>
>
>> + fn(pfn_to_mfn(pfn), goffset, &glen, data);
>
> Allowing the callee to change glen makes the interface more complex and
> certainly doesn't match the gnttab_foreach_grant function name anymore.

Why? Each time the callback is called, there is a new grant allocated.

> If netback needs it, could it just do the work inside its own function?
> I would rather keep gnttab_foreach_grant simple and move the complexity
> there.

Moving the complexity in netback means adding a loop in the callback
which will do exactly the same as this loop.

That also means to use XEN_PAGE_SIZE & co which I'm trying to avoid in
order to not confuse the developer. If they are hidden it likely mean
less problem on 64KB when the developper is working on 4KB.

IHMO, the complexity is not so bad and will be more lisible than yet
another loop.

[...]


>> +/* Helper to get to call fn only on the first "grant chunk" */
>> +static inline void gnttab_one_grant(struct page *page, unsigned int offset,
>> + unsigned len, xen_grant_fn_t fn,
>> + void *data)
>> +{
>> + /* The first request is limited to the size of one grant */
>> + len = min_t(unsigned int, XEN_PAGE_SIZE - (offset & ~XEN_PAGE_MASK),
>> + len);
>
> I would just BUG_ON(offset & ~XEN_PAGE_MASK) and simply len = XEN_PAGE_SIZE;

See my remark above.

>
>
>> + gnttab_foreach_grant(page, offset, len, fn, data);
>> +}
>> +
>> +/* Get the number of grant in a specified region
>> + *
>> + * offset: Offset in the first page
>
> I would generalize this function and support offset > PAGE_SIZE. At that
> point you could rename offset to "start".

It's actually supported, maybe it's not clear enough.

Regards,

--
Julien Grall

2015-07-16 16:12:55

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 04/20] xen/grant: Add helper gnttab_page_grant_foreign_access_ref

Hi Stefano,

On 16/07/2015 16:05, Stefano Stabellini wrote:
> I like this. I think it might make sense to call it
> gnttab_page_grant_foreign_access_ref_one to make clear that it is only
> granting the first 4K.

Will do.

> In the future we could introduce a new function, called
> gnttab_page_grant_foreign_access_ref, that grants all 4K in the page.

Unless having a different prototype it won't be possible to do it. This
is because one ref = one grant. We would need a list of grant.

> In any case
>
> Reviewed-by: Stefano Stabellini <[email protected]>

Thank you,

--
Julien Grall

2015-07-16 16:16:12

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

Hi Stefano,

On 16/07/2015 16:33, Stefano Stabellini wrote:
> On Fri, 10 Jul 2015, Konrad Rzeszutek Wilk wrote:
>> On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
>>> When Linux is using 64K page granularity, every page will be slipt in
>>> multiple non-contiguous 4K MFN (page granularity of Xen).
>>
>> But you don't care about that on the Linux layer I think?
>>
>> As in, is there an SWIOTLB that does PFN to MFN and vice-versa
>> translation?
>>
>> I thought that ARM guests are not exposed to the MFN<->PFN logic
>> and trying to figure that out to not screw up the DMA engine
>> on a PCIe device slurping up contingous MFNs which don't map
>> to contingous PFNs?
>
> Dom0 is mapped 1:1, so pfn == mfn normally, however grant maps
> unavoidably screw up the 1:1, so the swiotlb jumps in to save the day
> when a foreign granted page is involved in a dma operation.
>
> Regarding xen_biovec_phys_mergeable, we could check that all the pfn ==
> mfn and return true in that case.

I mentioned it in the commit message. Although, we would have to loop on
every pfn which is slow on 64KB (16 times for every page). Given the
biovec is called often, I don't think we can do a such things.

Regards,

--
Julien Grall

2015-07-16 16:20:31

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

Hi Stefano,

On 16/07/2015 16:43, Stefano Stabellini wrote:
> On Thu, 9 Jul 2015, Julien Grall wrote:
>> Only use the first 4KB of the page to store the events channel info. It
>> means that we will wast 60KB every time we allocate page for:
> ^ waste
>
>> * control block: a page is allocating per CPU
>> * event array: a page is allocating everytime we need to expand it
>>
>> I think we can reduce the memory waste for the 2 areas by:
>>
>> * control block: sharing between multiple vCPUs. Although it will
>> require some bookkeeping in order to not free the page when the CPU
>> goes offline and the other CPUs sharing the page still there
>>
>> * event array: always extend the array event by 64K (i.e 16 4K
>> chunk). That would require more care when we fail to expand the
>> event channel.
>
> But this is not implemented in this series, right?

Yes, it's some ideas to improve the code.

>
>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> drivers/xen/events/events_base.c | 2 +-
>> drivers/xen/events/events_fifo.c | 2 +-
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> index 96093ae..858d2f6 100644
>> --- a/drivers/xen/events/events_base.c
>> +++ b/drivers/xen/events/events_base.c
>> @@ -40,11 +40,11 @@
>> #include <asm/idle.h>
>> #include <asm/io_apic.h>
>> #include <asm/xen/pci.h>
>> -#include <xen/page.h>
>> #endif
>> #include <asm/sync_bitops.h>
>> #include <asm/xen/hypercall.h>
>> #include <asm/xen/hypervisor.h>
>> +#include <xen/page.h>
>>
>> #include <xen/xen.h>
>> #include <xen/hvm.h>
>
> Spurious change?

No, xen/page.h was only included for x86 before. Now, it's included for
every architecture.

This is required in order to get XEN_PAGE_SIZE.

Regards,

--
Julien Grall

2015-07-16 16:23:49

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 14/20] xen/grant-table: Make it running on 64KB granularity

Hi Stefano,

On 16/07/2015 16:47, Stefano Stabellini wrote:
>> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
>> index 3679293..0a1f903 100644
>> --- a/drivers/xen/grant-table.c
>> +++ b/drivers/xen/grant-table.c
>
> The arm part is fine, but aren't you missing the change to RPP and SPP?

SPP has been removed by commit 548f7c94759ac58d4744ef2663e2a66a106e21c5
as it was unused.

For RPP, it's used internally so there is no need to switch to
XEN_PAGE_SIZE. Otherwise we will waste 60KB for each internal page
allocated (see gnttab_init).

Regards,

--
Julien Grall

2015-07-16 17:14:02

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

On Thu, 9 Jul 2015, Julien Grall wrote:
> The hypercall interface (as well as the toolstack) is always using 4KB
> page granularity. When the toolstack is asking for mapping a series of
> guest PFN in a batch, it expects to have the page map contiguously in
> its virtual memory.
>
> When Linux is using 64KB page granularity, the privcmd driver will have
> to map multiple Xen PFN in a single Linux page.
>
> Note that this solution works on page granularity which is a multiple of
> 4KB.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Use xen_apply_to_page
> ---
> drivers/xen/privcmd.c | 8 +--
> drivers/xen/xlate_mmu.c | 127 +++++++++++++++++++++++++++++++++---------------
> 2 files changed, 92 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 5a29616..e8714b4 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -446,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
> return -EINVAL;
> }
>
> - nr_pages = m.num;
> + nr_pages = DIV_ROUND_UP_ULL(m.num, PAGE_SIZE / XEN_PAGE_SIZE);
> if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
> return -EINVAL;

DIV_ROUND_UP is enough, neither arguments are unsigned long long


> @@ -494,7 +494,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
> goto out_unlock;
> }
> if (xen_feature(XENFEAT_auto_translated_physmap)) {
> - ret = alloc_empty_pages(vma, m.num);
> + ret = alloc_empty_pages(vma, nr_pages);
> if (ret < 0)
> goto out_unlock;
> } else
> @@ -518,6 +518,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
> state.global_error = 0;
> state.version = version;
>
> + BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
> /* mmap_batch_fn guarantees ret == 0 */
> BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
> &pagelist, mmap_batch_fn, &state));
> @@ -582,12 +583,13 @@ static void privcmd_close(struct vm_area_struct *vma)
> {
> struct page **pages = vma->vm_private_data;
> int numpgs = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> + int nr_pfn = (vma->vm_end - vma->vm_start) >> XEN_PAGE_SHIFT;
> int rc;
>
> if (!xen_feature(XENFEAT_auto_translated_physmap) || !numpgs || !pages)
> return;
>
> - rc = xen_unmap_domain_mfn_range(vma, numpgs, pages);
> + rc = xen_unmap_domain_mfn_range(vma, nr_pfn, pages);
> if (rc == 0)
> free_xenballooned_pages(numpgs, pages);

If you intend to pass the number of xen pages as nr argument to
xen_unmap_domain_mfn_range, then I think that the changes to
xen_xlate_unmap_gfn_range below are wrong.


> else
> diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
> index 58a5389..1fac17c 100644
> --- a/drivers/xen/xlate_mmu.c
> +++ b/drivers/xen/xlate_mmu.c
> @@ -38,31 +38,9 @@
> #include <xen/interface/xen.h>
> #include <xen/interface/memory.h>
>
> -/* map fgmfn of domid to lpfn in the current domain */
> -static int map_foreign_page(unsigned long lpfn, unsigned long fgmfn,
> - unsigned int domid)
> -{
> - int rc;
> - struct xen_add_to_physmap_range xatp = {
> - .domid = DOMID_SELF,
> - .foreign_domid = domid,
> - .size = 1,
> - .space = XENMAPSPACE_gmfn_foreign,
> - };
> - xen_ulong_t idx = fgmfn;
> - xen_pfn_t gpfn = lpfn;
> - int err = 0;
> -
> - set_xen_guest_handle(xatp.idxs, &idx);
> - set_xen_guest_handle(xatp.gpfns, &gpfn);
> - set_xen_guest_handle(xatp.errs, &err);
> -
> - rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);
> - return rc < 0 ? rc : err;
> -}
> -
> struct remap_data {
> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
> pgprot_t prot;
> domid_t domid;
> struct vm_area_struct *vma;
> @@ -71,24 +49,75 @@ struct remap_data {
> struct xen_remap_mfn_info *info;
> int *err_ptr;
> int mapped;
> +
> + /* Hypercall parameters */
> + int h_errs[XEN_PFN_PER_PAGE];
> + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
> + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];

I don't think you should be adding these fields to struct remap_data:
struct remap_data is used to pass multi pages arguments from
xen_xlate_remap_gfn_array to remap_pte_fn.

I think you need to introduce a different struct to pass per linux page
arguments from remap_pte_fn to setup_hparams.


> + int h_iter; /* Iterator */
> };
>
> +static int setup_hparams(struct page *page, unsigned long pfn, void *data)
> +{
> + struct remap_data *info = data;
> +
> + /* We may not have enough domain's gmfn to fill a Linux Page */
> + if (info->fgmfn == info->efgmfn)
> + return 0;
> +
> + info->h_idxs[info->h_iter] = *info->fgmfn;
> + info->h_gpfns[info->h_iter] = pfn;
> + info->h_errs[info->h_iter] = 0;
> + info->h_iter++;
> +
> + info->fgmfn++;
> +
> + return 0;
> +}
> +
> static int remap_pte_fn(pte_t *ptep, pgtable_t token, unsigned long addr,
> void *data)
> {
> struct remap_data *info = data;
> struct page *page = info->pages[info->index++];
> - unsigned long pfn = page_to_pfn(page);
> - pte_t pte = pte_mkspecial(pfn_pte(pfn, info->prot));
> + pte_t pte = pte_mkspecial(pfn_pte(page_to_pfn(page), info->prot));
> int rc;
> + uint32_t i;
> + struct xen_add_to_physmap_range xatp = {
> + .domid = DOMID_SELF,
> + .foreign_domid = info->domid,
> + .space = XENMAPSPACE_gmfn_foreign,
> + };
>
> - rc = map_foreign_page(pfn, *info->fgmfn, info->domid);
> - *info->err_ptr++ = rc;
> - if (!rc) {
> - set_pte_at(info->vma->vm_mm, addr, ptep, pte);
> - info->mapped++;
> + info->h_iter = 0;
> +
> + /* setup_hparams guarantees ret == 0 */
> + BUG_ON(xen_apply_to_page(page, setup_hparams, info));
> +
> + set_xen_guest_handle(xatp.idxs, info->h_idxs);
> + set_xen_guest_handle(xatp.gpfns, info->h_gpfns);
> + set_xen_guest_handle(xatp.errs, info->h_errs);
> + xatp.size = info->h_iter;
> +
> + rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);

I would have thought that XENMEM_add_to_physmap_range operates at 4K
granularity, regardless of how the guest decides to layout its
pagetables. If so, the call to XENMEM_add_to_physmap_range needs to be
moved within the function passed to xen_apply_to_page.


> + /* info->err_ptr expect to have one error status per Xen PFN */
> + for (i = 0; i < info->h_iter; i++) {
> + int err = (rc < 0) ? rc : info->h_errs[i];
> +
> + *(info->err_ptr++) = err;
> + if (!err)
> + info->mapped++;
> }
> - info->fgmfn++;
> +
> + /*
> + * Note: The hypercall will return 0 in most of the case if even if
^ in most cases

> + * all the fgmfn are not mapped. We still have to update the pte
^ not all the fgmfn are mapped.

> + * as the userspace may decide to continue.
> + */
> + if (!rc)
> + set_pte_at(info->vma->vm_mm, addr, ptep, pte);
>
> return 0;
> }
> @@ -102,13 +131,14 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
> {
> int err;
> struct remap_data data;
> - unsigned long range = nr << PAGE_SHIFT;
> + unsigned long range = round_up(nr, XEN_PFN_PER_PAGE) << XEN_PAGE_SHIFT;

If would just BUG_ON(nr % XEN_PFN_PER_PAGE) and avoid the round_up;



> /* Kept here for the purpose of making sure code doesn't break
> x86 PVOPS */
> BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | VM_IO)));
>
> data.fgmfn = mfn;
> + data.efgmfn = mfn + nr;

If we assume that nr is a multiple of XEN_PFN_PER_PAGE, then we can get
rid of efgmfn.


> data.prot = prot;
> data.domid = domid;
> data.vma = vma;
> @@ -123,21 +153,38 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
> }
> EXPORT_SYMBOL_GPL(xen_xlate_remap_gfn_array);
>
> +static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
> +{
> + int *nr = data;
> + struct xen_remove_from_physmap xrp;
> +
> + /* The Linux Page may not have been fully mapped to Xen */
> + if (!*nr)
> + return 0;
> +
> + xrp.domid = DOMID_SELF;
> + xrp.gpfn = pfn;
> + (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
> +
> + (*nr)--;

I don't understand why you are passing nr as private argument. I would
just call XENMEM_remove_from_physmap unconditionally here. Am I missing
something? After all XENMEM_remove_from_physmap is just unmapping
at 4K granularity, right?


> + return 0;
> +}
> +
> int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
> int nr, struct page **pages)
> {
> int i;
> + int nr_page = round_up(nr, XEN_PFN_PER_PAGE);

If nr is the number of xen pages, then this should be:

int nr_pages = DIV_ROUND_UP(nr, XEN_PFN_PER_PAGE);



> - for (i = 0; i < nr; i++) {
> - struct xen_remove_from_physmap xrp;
> - unsigned long pfn;
> + for (i = 0; i < nr_page; i++) {
> + /* unmap_gfn guarantees ret == 0 */
> + BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));
> + }
>
> - pfn = page_to_pfn(pages[i]);
> + /* We should have consume every xen page */
^ consumed


> + BUG_ON(nr != 0);
>
> - xrp.domid = DOMID_SELF;
> - xrp.gpfn = pfn;
> - (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
> - }
> return 0;
> }
> EXPORT_SYMBOL_GPL(xen_xlate_unmap_gfn_range);
> --
> 2.1.4
>

2015-07-16 17:17:39

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

On Thu, 16 Jul 2015, Stefano Stabellini wrote:
> > + /* setup_hparams guarantees ret == 0 */
> > + BUG_ON(xen_apply_to_page(page, setup_hparams, info));
> > +
> > + set_xen_guest_handle(xatp.idxs, info->h_idxs);
> > + set_xen_guest_handle(xatp.gpfns, info->h_gpfns);
> > + set_xen_guest_handle(xatp.errs, info->h_errs);
> > + xatp.size = info->h_iter;
> > +
> > + rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);
>
> I would have thought that XENMEM_add_to_physmap_range operates at 4K
> granularity, regardless of how the guest decides to layout its
> pagetables. If so, the call to XENMEM_add_to_physmap_range needs to be
> moved within the function passed to xen_apply_to_page.

Sorry, this comment is wrong, please ignore it. The others are OK.

2015-07-16 18:31:40

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On Thu, Jul 16, 2015 at 05:15:41PM +0100, Julien Grall wrote:
> Hi Stefano,
>
> On 16/07/2015 16:33, Stefano Stabellini wrote:
> >On Fri, 10 Jul 2015, Konrad Rzeszutek Wilk wrote:
> >>On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
> >>>When Linux is using 64K page granularity, every page will be slipt in
> >>>multiple non-contiguous 4K MFN (page granularity of Xen).
> >>
> >>But you don't care about that on the Linux layer I think?
> >>
> >>As in, is there an SWIOTLB that does PFN to MFN and vice-versa
> >>translation?
> >>
> >>I thought that ARM guests are not exposed to the MFN<->PFN logic
> >>and trying to figure that out to not screw up the DMA engine
> >>on a PCIe device slurping up contingous MFNs which don't map
> >>to contingous PFNs?
> >
> >Dom0 is mapped 1:1, so pfn == mfn normally, however grant maps
> >unavoidably screw up the 1:1, so the swiotlb jumps in to save the day
> >when a foreign granted page is involved in a dma operation.
> >
> >Regarding xen_biovec_phys_mergeable, we could check that all the pfn ==
> >mfn and return true in that case.
>
> I mentioned it in the commit message. Although, we would have to loop on
> every pfn which is slow on 64KB (16 times for every page). Given the biovec
> is called often, I don't think we can do a such things.

OK - it would be good to have the gist of this email thread in the
commit message. Thanks.
>
> Regards,
>
> --
> Julien Grall

2015-07-17 12:57:43

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 19/20] xen/privcmd: Add support for Linux 64KB page granularity

Hi Stefano,

On 16/07/15 18:12, Stefano Stabellini wrote:
> On Thu, 9 Jul 2015, Julien Grall wrote:
>> The hypercall interface (as well as the toolstack) is always using 4KB
>> page granularity. When the toolstack is asking for mapping a series of
>> guest PFN in a batch, it expects to have the page map contiguously in
>> its virtual memory.
>>
>> When Linux is using 64KB page granularity, the privcmd driver will have
>> to map multiple Xen PFN in a single Linux page.
>>
>> Note that this solution works on page granularity which is a multiple of
>> 4KB.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> Changes in v2:
>> - Use xen_apply_to_page
>> ---
>> drivers/xen/privcmd.c | 8 +--
>> drivers/xen/xlate_mmu.c | 127 +++++++++++++++++++++++++++++++++---------------
>> 2 files changed, 92 insertions(+), 43 deletions(-)
>>
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 5a29616..e8714b4 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -446,7 +446,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>> return -EINVAL;
>> }
>>
>> - nr_pages = m.num;
>> + nr_pages = DIV_ROUND_UP_ULL(m.num, PAGE_SIZE / XEN_PAGE_SIZE);
>> if ((m.num <= 0) || (nr_pages > (LONG_MAX >> PAGE_SHIFT)))
>> return -EINVAL;
>
> DIV_ROUND_UP is enough, neither arguments are unsigned long long

I'm not sure why I use DIV_ROUND_UP_ULL here... I will switch to
DIV_ROUND_UP in the next version.

>
>> @@ -494,7 +494,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>> goto out_unlock;
>> }
>> if (xen_feature(XENFEAT_auto_translated_physmap)) {
>> - ret = alloc_empty_pages(vma, m.num);
>> + ret = alloc_empty_pages(vma, nr_pages);
>> if (ret < 0)
>> goto out_unlock;
>> } else
>> @@ -518,6 +518,7 @@ static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
>> state.global_error = 0;
>> state.version = version;
>>
>> + BUILD_BUG_ON(((PAGE_SIZE / sizeof(xen_pfn_t)) % XEN_PFN_PER_PAGE) != 0);
>> /* mmap_batch_fn guarantees ret == 0 */
>> BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
>> &pagelist, mmap_batch_fn, &state));
>> @@ -582,12 +583,13 @@ static void privcmd_close(struct vm_area_struct *vma)
>> {
>> struct page **pages = vma->vm_private_data;
>> int numpgs = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>> + int nr_pfn = (vma->vm_end - vma->vm_start) >> XEN_PAGE_SHIFT;
>> int rc;
>>
>> if (!xen_feature(XENFEAT_auto_translated_physmap) || !numpgs || !pages)
>> return;
>>
>> - rc = xen_unmap_domain_mfn_range(vma, numpgs, pages);
>> + rc = xen_unmap_domain_mfn_range(vma, nr_pfn, pages);
>> if (rc == 0)
>> free_xenballooned_pages(numpgs, pages);
>
> If you intend to pass the number of xen pages as nr argument to
> xen_unmap_domain_mfn_range, then I think that the changes to
> xen_xlate_unmap_gfn_range below are wrong.

Hmmm... right. I will fix it.

>
>
>> else
>> diff --git a/drivers/xen/xlate_mmu.c b/drivers/xen/xlate_mmu.c
>> index 58a5389..1fac17c 100644
>> --- a/drivers/xen/xlate_mmu.c
>> +++ b/drivers/xen/xlate_mmu.c
>> @@ -38,31 +38,9 @@
>> #include <xen/interface/xen.h>
>> #include <xen/interface/memory.h>
>>
>> -/* map fgmfn of domid to lpfn in the current domain */
>> -static int map_foreign_page(unsigned long lpfn, unsigned long fgmfn,
>> - unsigned int domid)
>> -{
>> - int rc;
>> - struct xen_add_to_physmap_range xatp = {
>> - .domid = DOMID_SELF,
>> - .foreign_domid = domid,
>> - .size = 1,
>> - .space = XENMAPSPACE_gmfn_foreign,
>> - };
>> - xen_ulong_t idx = fgmfn;
>> - xen_pfn_t gpfn = lpfn;
>> - int err = 0;
>> -
>> - set_xen_guest_handle(xatp.idxs, &idx);
>> - set_xen_guest_handle(xatp.gpfns, &gpfn);
>> - set_xen_guest_handle(xatp.errs, &err);
>> -
>> - rc = HYPERVISOR_memory_op(XENMEM_add_to_physmap_range, &xatp);
>> - return rc < 0 ? rc : err;
>> -}
>> -
>> struct remap_data {
>> xen_pfn_t *fgmfn; /* foreign domain's gmfn */
>> + xen_pfn_t *efgmfn; /* pointer to the end of the fgmfn array */
>> pgprot_t prot;
>> domid_t domid;
>> struct vm_area_struct *vma;
>> @@ -71,24 +49,75 @@ struct remap_data {
>> struct xen_remap_mfn_info *info;
>> int *err_ptr;
>> int mapped;
>> +
>> + /* Hypercall parameters */
>> + int h_errs[XEN_PFN_PER_PAGE];
>> + xen_ulong_t h_idxs[XEN_PFN_PER_PAGE];
>> + xen_pfn_t h_gpfns[XEN_PFN_PER_PAGE];
>
> I don't think you should be adding these fields to struct remap_data:
> struct remap_data is used to pass multi pages arguments from
> xen_xlate_remap_gfn_array to remap_pte_fn.
>
> I think you need to introduce a different struct to pass per linux page
> arguments from remap_pte_fn to setup_hparams.

I didn't want to introduce a new structure in order to avoid allocating
it on the stack every time remap_pte_fn is called.

Maybe it is an optimization for nothing?

[...]

>> + /* info->err_ptr expect to have one error status per Xen PFN */
>> + for (i = 0; i < info->h_iter; i++) {
>> + int err = (rc < 0) ? rc : info->h_errs[i];
>> +
>> + *(info->err_ptr++) = err;
>> + if (!err)
>> + info->mapped++;
>> }
>> - info->fgmfn++;
>> +
>> + /*
>> + * Note: The hypercall will return 0 in most of the case if even if
> ^ in most cases

Will fix it.

>> + * all the fgmfn are not mapped. We still have to update the pte
> ^ not all the fgmfn are mapped.
>
>> + * as the userspace may decide to continue.
>> + */
>> + if (!rc)
>> + set_pte_at(info->vma->vm_mm, addr, ptep, pte);
>>
>> return 0;
>> }
>> @@ -102,13 +131,14 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
>> {
>> int err;
>> struct remap_data data;
>> - unsigned long range = nr << PAGE_SHIFT;
>> + unsigned long range = round_up(nr, XEN_PFN_PER_PAGE) << XEN_PAGE_SHIFT;
>
> If would just BUG_ON(nr % XEN_PFN_PER_PAGE) and avoid the round_up;

As discussed IRL, the toolstack can request to map only 1 Xen page. So
the BUG_ON would always be hit.

Anyway, as you suggested IRL, I will replace the round_up by
DIV_ROUND_UP in the next version.

>> data.prot = prot;
>> data.domid = domid;
>> data.vma = vma;
>> @@ -123,21 +153,38 @@ int xen_xlate_remap_gfn_array(struct vm_area_struct *vma,
>> }
>> EXPORT_SYMBOL_GPL(xen_xlate_remap_gfn_array);
>>
>> +static int unmap_gfn(struct page *page, unsigned long pfn, void *data)
>> +{
>> + int *nr = data;
>> + struct xen_remove_from_physmap xrp;
>> +
>> + /* The Linux Page may not have been fully mapped to Xen */
>> + if (!*nr)
>> + return 0;
>> +
>> + xrp.domid = DOMID_SELF;
>> + xrp.gpfn = pfn;
>> + (void)HYPERVISOR_memory_op(XENMEM_remove_from_physmap, &xrp);
>> +
>> + (*nr)--;
>
> I don't understand why you are passing nr as private argument. I would
> just call XENMEM_remove_from_physmap unconditionally here. Am I missing
> something? After all XENMEM_remove_from_physmap is just unmapping
> at 4K granularity, right?

Yes, but you may ask to only remove 1 4KB page. When 64KB is inuse that
would mean to call the hypervisor 16 times for only 1 useful remove.

This is because, the hypervisor doesn't provide an hypercall to remove a
list of PFN which is very infortunate.

Although, as discussed IIRC I can see to provide a new function
xen_apply_to_page_range which will handle the counter internally.

>
>
>> + return 0;
>> +}
>> +
>> int xen_xlate_unmap_gfn_range(struct vm_area_struct *vma,
>> int nr, struct page **pages)
>> {
>> int i;
>> + int nr_page = round_up(nr, XEN_PFN_PER_PAGE);
>
> If nr is the number of xen pages, then this should be:
>
> int nr_pages = DIV_ROUND_UP(nr, XEN_PFN_PER_PAGE);

Correct, I will fix it.

>> - for (i = 0; i < nr; i++) {
>> - struct xen_remove_from_physmap xrp;
>> - unsigned long pfn;
>> + for (i = 0; i < nr_page; i++) {
>> + /* unmap_gfn guarantees ret == 0 */
>> + BUG_ON(xen_apply_to_page(pages[i], unmap_gfn, &nr));
>> + }
>>
>> - pfn = page_to_pfn(pages[i]);
>> + /* We should have consume every xen page */
> ^ consumed

I will fix it.

Regards,

--
Julien Grall

2015-07-17 13:07:49

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

On Thu, 16 Jul 2015, Julien Grall wrote:
> Hi Stefano,
>
> On 16/07/2015 16:43, Stefano Stabellini wrote:
> > On Thu, 9 Jul 2015, Julien Grall wrote:
> > > Only use the first 4KB of the page to store the events channel info. It
> > > means that we will wast 60KB every time we allocate page for:
> > ^ waste
> >
> > > * control block: a page is allocating per CPU
> > > * event array: a page is allocating everytime we need to expand it
> > >
> > > I think we can reduce the memory waste for the 2 areas by:
> > >
> > > * control block: sharing between multiple vCPUs. Although it will
> > > require some bookkeeping in order to not free the page when the CPU
> > > goes offline and the other CPUs sharing the page still there
> > >
> > > * event array: always extend the array event by 64K (i.e 16 4K
> > > chunk). That would require more care when we fail to expand the
> > > event channel.
> >
> > But this is not implemented in this series, right?
>
> Yes, it's some ideas to improve the code.
>
> >
> >
> > > Signed-off-by: Julien Grall <[email protected]>
> > > Cc: Konrad Rzeszutek Wilk <[email protected]>
> > > Cc: Boris Ostrovsky <[email protected]>
> > > Cc: David Vrabel <[email protected]>
> > > ---
> > > drivers/xen/events/events_base.c | 2 +-
> > > drivers/xen/events/events_fifo.c | 2 +-
> > > 2 files changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/xen/events/events_base.c
> > > b/drivers/xen/events/events_base.c
> > > index 96093ae..858d2f6 100644
> > > --- a/drivers/xen/events/events_base.c
> > > +++ b/drivers/xen/events/events_base.c
> > > @@ -40,11 +40,11 @@
> > > #include <asm/idle.h>
> > > #include <asm/io_apic.h>
> > > #include <asm/xen/pci.h>
> > > -#include <xen/page.h>
> > > #endif
> > > #include <asm/sync_bitops.h>
> > > #include <asm/xen/hypercall.h>
> > > #include <asm/xen/hypervisor.h>
> > > +#include <xen/page.h>
> > >
> > > #include <xen/xen.h>
> > > #include <xen/hvm.h>
> >
> > Spurious change?
>
> No, xen/page.h was only included for x86 before. Now, it's included for every
> architecture.
>
> This is required in order to get XEN_PAGE_SIZE.

Ah, right.

Reviewed-by: Stefano Stabellini <[email protected]>

2015-07-17 13:11:26

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 03/20] xen/grant: Introduce helpers to split a page into grant

On 16/07/15 17:07, Julien Grall wrote:
>>> + pfn = xen_page_to_pfn(page) + (offset >> XEN_PAGE_SHIFT);
>>> +
>>> + while (len) {
>>> + glen = min_t(unsigned int, XEN_PAGE_SIZE - goffset, len);
>>
>> Similarly I don't think we want to support glen != XEN_PAGE_SIZE here
>
> See my answer above.
>
>>
>>
>>> + fn(pfn_to_mfn(pfn), goffset, &glen, data);
>>
>> Allowing the callee to change glen makes the interface more complex and
>> certainly doesn't match the gnttab_foreach_grant function name anymore.
>
> Why? Each time the callback is called, there is a new grant allocated.

As discussed IRL, I will rename this function into
gnttab_foreach_grant_in_range.

>
>> If netback needs it, could it just do the work inside its own function?
>> I would rather keep gnttab_foreach_grant simple and move the complexity
>> there.
>
> Moving the complexity in netback means adding a loop in the callback
> which will do exactly the same as this loop.
>
> That also means to use XEN_PAGE_SIZE & co which I'm trying to avoid in
> order to not confuse the developer. If they are hidden it likely mean
> less problem on 64KB when the developper is working on 4KB.
>
> IHMO, the complexity is not so bad and will be more lisible than yet
> another loop.

After the IRL talk, I will give a look if I can move the "netback
problem" in a different helper. We can see later if we could improve the
code.

Regards,

--
Julien Grall

2015-07-17 13:22:06

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On Thu, 16 Jul 2015, Julien Grall wrote:
> Hi Stefano,
>
> On 16/07/2015 16:33, Stefano Stabellini wrote:
> > On Fri, 10 Jul 2015, Konrad Rzeszutek Wilk wrote:
> > > On Thu, Jul 09, 2015 at 09:42:21PM +0100, Julien Grall wrote:
> > > > When Linux is using 64K page granularity, every page will be slipt in
> > > > multiple non-contiguous 4K MFN (page granularity of Xen).
> > >
> > > But you don't care about that on the Linux layer I think?
> > >
> > > As in, is there an SWIOTLB that does PFN to MFN and vice-versa
> > > translation?
> > >
> > > I thought that ARM guests are not exposed to the MFN<->PFN logic
> > > and trying to figure that out to not screw up the DMA engine
> > > on a PCIe device slurping up contingous MFNs which don't map
> > > to contingous PFNs?
> >
> > Dom0 is mapped 1:1, so pfn == mfn normally, however grant maps
> > unavoidably screw up the 1:1, so the swiotlb jumps in to save the day
> > when a foreign granted page is involved in a dma operation.
> >
> > Regarding xen_biovec_phys_mergeable, we could check that all the pfn ==
> > mfn and return true in that case.
>
> I mentioned it in the commit message. Although, we would have to loop on every
> pfn which is slow on 64KB (16 times for every page). Given the biovec is
> called often, I don't think we can do a such things.

We would have to run some benchmarks, but I think it would still be a
win. We should write an ad-hoc __pfn_to_mfn translation function that
operates on a range of pfns and simply checks whether an entry is
present in that range. It should be just as fast as __pfn_to_mfn. I
would definitely recommend it.

2015-07-17 13:39:12

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 14/20] xen/grant-table: Make it running on 64KB granularity

On Thu, 16 Jul 2015, Julien Grall wrote:
> Hi Stefano,
>
> On 16/07/2015 16:47, Stefano Stabellini wrote:
> >> diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
> > > index 3679293..0a1f903 100644
> > > --- a/drivers/xen/grant-table.c
> > > +++ b/drivers/xen/grant-table.c
> >
> > The arm part is fine, but aren't you missing the change to RPP and SPP?
>
> SPP has been removed by commit 548f7c94759ac58d4744ef2663e2a66a106e21c5 as it
> was unused.
>
> For RPP, it's used internally so there is no need to switch to XEN_PAGE_SIZE.
> Otherwise we will waste 60KB for each internal page allocated (see
> gnttab_init).

I see now that RPP is specifically for internal data structures in
grant-table.c and it is used consistently.

Reviewed-by: Stefano Stabellini <[email protected]>

2015-07-17 14:05:01

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH v2 12/20] xen/balloon: Don't rely on the page granularity is the same for Xen and Linux

On Thu, 9 Jul 2015, Julien Grall wrote:
> For ARM64 guests, Linux is able to support either 64K or 4K page
> granularity. Although, the hypercall interface is always based on 4K
> page granularity.
>
> With 64K page granuliarty, a single page will be spread over multiple
> Xen frame.
>
> When a driver request/free a balloon page, the balloon driver will have
> to split the Linux page in 4K chunk before asking Xen to add/remove the
> frame from the guest.
>
> Note that this can work on any page granularity assuming it's a multiple
> of 4K.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> Cc: Wei Liu <[email protected]>
> ---
> Changes in v2:
> - Use xen_apply_to_page to split a page in 4K chunk
> - It's not necessary to have a smaller frame list. Re-use
> PAGE_SIZE
> - Convert reserve_additional_memory to use XEN_... macro
> ---
> drivers/xen/balloon.c | 147 +++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 105 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index fd93369..19a72b1 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -230,6 +230,7 @@ static enum bp_state reserve_additional_memory(long credit)
> nid = memory_add_physaddr_to_nid(hotplug_start_paddr);
>
> #ifdef CONFIG_XEN_HAVE_PVMMU
> + /* TODO */

I think you need to be more verbose than that: TODO what?


> /*
> * add_memory() will build page tables for the new memory so
> * the p2m must contain invalid entries so the correct
> @@ -242,8 +243,8 @@ static enum bp_state reserve_additional_memory(long credit)
> if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> unsigned long pfn, i;
>
> - pfn = PFN_DOWN(hotplug_start_paddr);
> - for (i = 0; i < balloon_hotplug; i++) {
> + pfn = XEN_PFN_DOWN(hotplug_start_paddr);
> + for (i = 0; i < (balloon_hotplug * XEN_PFN_PER_PAGE); i++) {
> if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
> pr_warn("set_phys_to_machine() failed, no memory added\n");
> return BP_ECANCELED;
> @@ -323,10 +324,72 @@ static enum bp_state reserve_additional_memory(long credit)
> }
> #endif /* CONFIG_XEN_BALLOON_MEMORY_HOTPLUG */
>
> +static int set_frame(struct page *page, unsigned long pfn, void *data)
> +{
> + unsigned long *index = data;
> +
> + frame_list[(*index)++] = pfn;
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_XEN_HAVE_PVMMU
> +static int pvmmu_update_mapping(struct page *page, unsigned long pfn,
> + void *data)
> +{
> + unsigned long *index = data;
> + xen_pfn_t frame = frame_list[*index];
> +
> + set_phys_to_machine(pfn, frame);
> + /* Link back into the page tables if not highmem. */
> + if (!PageHighMem(page)) {
> + int ret;
> + ret = HYPERVISOR_update_va_mapping(
> + (unsigned long)__va(pfn << XEN_PAGE_SHIFT),
> + mfn_pte(frame, PAGE_KERNEL),
> + 0);
> + BUG_ON(ret);
> + }
> +
> + (*index)++;
> +
> + return 0;
> +}
> +#endif
> +
> +static int balloon_remove_mapping(struct page *page, unsigned long pfn,
> + void *data)
> +{
> + unsigned long *index = data;
> +
> + /* We expect the frame_list to contain the same pfn */
> + BUG_ON(pfn != frame_list[*index]);
> +
> + frame_list[*index] = pfn_to_mfn(pfn);
> +
> +#ifdef CONFIG_XEN_HAVE_PVMMU
> + if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> + if (!PageHighMem(page)) {
> + int ret;
> +
> + ret = HYPERVISOR_update_va_mapping(
> + (unsigned long)__va(pfn << XEN_PAGE_SHIFT),
> + __pte_ma(0), 0);
> + BUG_ON(ret);
> + }
> + __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
> + }
> +#endif
> +
> + (*index)++;
> +
> + return 0;
> +}
> +
> static enum bp_state increase_reservation(unsigned long nr_pages)
> {
> int rc;
> - unsigned long pfn, i;
> + unsigned long i, frame_idx;
> struct page *page;
> struct xen_memory_reservation reservation = {
> .address_bits = 0,
> @@ -343,44 +406,43 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> }
> #endif
>
> - if (nr_pages > ARRAY_SIZE(frame_list))
> - nr_pages = ARRAY_SIZE(frame_list);
> + if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE))
> + nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE;
>
> + frame_idx = 0;
> page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
> for (i = 0; i < nr_pages; i++) {
> if (!page) {
> nr_pages = i;
> break;
> }
> - frame_list[i] = page_to_pfn(page);
> +
> + rc = xen_apply_to_page(page, set_frame, &frame_idx);
> +
> page = balloon_next_page(page);
> }
>
> set_xen_guest_handle(reservation.extent_start, frame_list);
> - reservation.nr_extents = nr_pages;
> + reservation.nr_extents = nr_pages * XEN_PFN_PER_PAGE;
> rc = HYPERVISOR_memory_op(XENMEM_populate_physmap, &reservation);
> if (rc <= 0)
> return BP_EAGAIN;
>
> - for (i = 0; i < rc; i++) {
> + /* rc is equal to the number of Xen page populated */
> + nr_pages = rc / XEN_PFN_PER_PAGE;

Here we are purposedly ignoring any spares (rc % XEN_PFN_PER_PAGE).
Instead of leaking them, maybe we should givem them back to Xen since we
cannot use them?



> + for (i = 0; i < nr_pages; i++) {
> page = balloon_retrieve(false);
> BUG_ON(page == NULL);
>
> - pfn = page_to_pfn(page);
> -
> #ifdef CONFIG_XEN_HAVE_PVMMU
> + frame_idx = 0;

Shouldn't this be before the beginning of the loop above?


> if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> - set_phys_to_machine(pfn, frame_list[i]);
> -
> - /* Link back into the page tables if not highmem. */
> - if (!PageHighMem(page)) {
> - int ret;
> - ret = HYPERVISOR_update_va_mapping(
> - (unsigned long)__va(pfn << PAGE_SHIFT),
> - mfn_pte(frame_list[i], PAGE_KERNEL),
> - 0);
> - BUG_ON(ret);
> - }
> + int ret;
> +
> + ret = xen_apply_to_page(page, pvmmu_update_mapping,
> + &frame_idx);
> + BUG_ON(ret);
> }
> #endif
>
> @@ -388,7 +450,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> __free_reserved_page(page);
> }
>
> - balloon_stats.current_pages += rc;
> + balloon_stats.current_pages += nr_pages;
>
> return BP_DONE;
> }
> @@ -396,7 +458,7 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
> static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> {
> enum bp_state state = BP_DONE;
> - unsigned long pfn, i;
> + unsigned long pfn, i, frame_idx, nr_frames;
> struct page *page;
> int ret;
> struct xen_memory_reservation reservation = {
> @@ -414,9 +476,10 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> }
> #endif
>
> - if (nr_pages > ARRAY_SIZE(frame_list))
> - nr_pages = ARRAY_SIZE(frame_list);
> + if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE))
> + nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE;
>
> + frame_idx = 0;
> for (i = 0; i < nr_pages; i++) {
> page = alloc_page(gfp);
> if (page == NULL) {
> @@ -426,9 +489,12 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> }
> scrub_page(page);
>
> - frame_list[i] = page_to_pfn(page);
> + ret = xen_apply_to_page(page, set_frame, &frame_idx);
> + BUG_ON(ret);
> }
>
> + nr_frames = nr_pages * XEN_PFN_PER_PAGE;
> +
> /*
> * Ensure that ballooned highmem pages don't have kmaps.
> *
> @@ -439,22 +505,19 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> kmap_flush_unused();
>
> /* Update direct mapping, invalidate P2M, and add to balloon. */
> + frame_idx = 0;
> for (i = 0; i < nr_pages; i++) {
> - pfn = frame_list[i];
> - frame_list[i] = pfn_to_mfn(pfn);
> - page = pfn_to_page(pfn);
> + /*
> + * The Xen PFN for a given Linux Page are contiguous in
> + * frame_list
> + */
> + pfn = frame_list[frame_idx];
> + page = xen_pfn_to_page(pfn);
>
> -#ifdef CONFIG_XEN_HAVE_PVMMU
> - if (!xen_feature(XENFEAT_auto_translated_physmap)) {
> - if (!PageHighMem(page)) {
> - ret = HYPERVISOR_update_va_mapping(
> - (unsigned long)__va(pfn << PAGE_SHIFT),
> - __pte_ma(0), 0);
> - BUG_ON(ret);
> - }
> - __set_phys_to_machine(pfn, INVALID_P2M_ENTRY);
> - }
> -#endif
> +
> + ret = xen_apply_to_page(page, balloon_remove_mapping,
> + &frame_idx);
> + BUG_ON(ret);
>
> balloon_append(page);
> }
> @@ -462,9 +525,9 @@ static enum bp_state decrease_reservation(unsigned long nr_pages, gfp_t gfp)
> flush_tlb_all();
>
> set_xen_guest_handle(reservation.extent_start, frame_list);
> - reservation.nr_extents = nr_pages;
> + reservation.nr_extents = nr_frames;
> ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation, &reservation);
> - BUG_ON(ret != nr_pages);
> + BUG_ON(ret != nr_frames);
>
> balloon_stats.current_pages -= nr_pages;
>
> --
> 2.1.4
>

2015-07-17 14:33:34

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 12/20] xen/balloon: Don't rely on the page granularity is the same for Xen and Linux

Hi Stefano,

On 17/07/15 15:03, Stefano Stabellini wrote:
>> ---
>> drivers/xen/balloon.c | 147 +++++++++++++++++++++++++++++++++++---------------
>> 1 file changed, 105 insertions(+), 42 deletions(-)
>>
>> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
>> index fd93369..19a72b1 100644
>> --- a/drivers/xen/balloon.c
>> +++ b/drivers/xen/balloon.c
>> @@ -230,6 +230,7 @@ static enum bp_state reserve_additional_memory(long credit)
>> nid = memory_add_physaddr_to_nid(hotplug_start_paddr);
>>
>> #ifdef CONFIG_XEN_HAVE_PVMMU
>> + /* TODO */
>
> I think you need to be more verbose than that: TODO what?

It was for me to remember fixing reserve_additional_memory. I did it and
forgot to remove the TODO when I clean up.

I will drop it in the next version.

[...]

>> static enum bp_state increase_reservation(unsigned long nr_pages)
>> {
>> int rc;
>> - unsigned long pfn, i;
>> + unsigned long i, frame_idx;
>> struct page *page;
>> struct xen_memory_reservation reservation = {
>> .address_bits = 0,
>> @@ -343,44 +406,43 @@ static enum bp_state increase_reservation(unsigned long nr_pages)
>> }
>> #endif
>>
>> - if (nr_pages > ARRAY_SIZE(frame_list))
>> - nr_pages = ARRAY_SIZE(frame_list);
>> + if (nr_pages > (ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE))
>> + nr_pages = ARRAY_SIZE(frame_list) / XEN_PFN_PER_PAGE;
>>
>> + frame_idx = 0;
>> page = list_first_entry_or_null(&ballooned_pages, struct page, lru);
>> for (i = 0; i < nr_pages; i++) {
>> if (!page) {
>> nr_pages = i;
>> break;
>> }
>> - frame_list[i] = page_to_pfn(page);
>> +
>> + rc = xen_apply_to_page(page, set_frame, &frame_idx);
>> +
>> page = balloon_next_page(page);
>> }
>>
>> set_xen_guest_handle(reservation.extent_start, frame_list);
>> - reservation.nr_extents = nr_pages;
>> + reservation.nr_extents = nr_pages * XEN_PFN_PER_PAGE;
>> rc = HYPERVISOR_memory_op(XENMEM_populate_physmap, &reservation);
>> if (rc <= 0)
>> return BP_EAGAIN;
>>
>> - for (i = 0; i < rc; i++) {
>> + /* rc is equal to the number of Xen page populated */
>> + nr_pages = rc / XEN_PFN_PER_PAGE;
>
> Here we are purposedly ignoring any spares (rc % XEN_PFN_PER_PAGE).
> Instead of leaking them, maybe we should givem them back to Xen since we
> cannot use them?

I will give a look to do it.

>> + for (i = 0; i < nr_pages; i++) {
>> page = balloon_retrieve(false);
>> BUG_ON(page == NULL);
>>
>> - pfn = page_to_pfn(page);
>> -
>> #ifdef CONFIG_XEN_HAVE_PVMMU
>> + frame_idx = 0;
>
> Shouldn't this be before the beginning of the loop above?

Hmmmm... Yes. Note that I only compiled tested on x86, it would be good
if someone test on real hardware at some point (I don't have any x86 Xen
setup).

Regards,

--
Julien Grall

2015-07-17 14:45:59

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On 17/07/15 14:20, Stefano Stabellini wrote:
> We would have to run some benchmarks, but I think it would still be a
> win. We should write an ad-hoc __pfn_to_mfn translation function that
> operates on a range of pfns and simply checks whether an entry is
> present in that range. It should be just as fast as __pfn_to_mfn. I
> would definitely recommend it.

I'd like to see a basic support of 64KB support on Xen pushed in Linux
upstream before looking to possible improvement in the code. Can we
defer this as the follow-up of this series?

Regards,

--
Julien Grall

2015-07-17 14:46:42

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On Fri, 17 Jul 2015, Julien Grall wrote:
> On 17/07/15 14:20, Stefano Stabellini wrote:
> > We would have to run some benchmarks, but I think it would still be a
> > win. We should write an ad-hoc __pfn_to_mfn translation function that
> > operates on a range of pfns and simply checks whether an entry is
> > present in that range. It should be just as fast as __pfn_to_mfn. I
> > would definitely recommend it.
>
> I'd like to see a basic support of 64KB support on Xen pushed in Linux
> upstream before looking to possible improvement in the code. Can we
> defer this as the follow-up of this series?

Yes, maybe add a TODO comment in the code.

2015-07-17 14:47:12

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 09/20] xen/biomerge: Don't allow biovec to be merge when Linux is not using 4KB page

On 17/07/15 15:45, Stefano Stabellini wrote:
> On Fri, 17 Jul 2015, Julien Grall wrote:
>> On 17/07/15 14:20, Stefano Stabellini wrote:
>>> We would have to run some benchmarks, but I think it would still be a
>>> win. We should write an ad-hoc __pfn_to_mfn translation function that
>>> operates on a range of pfns and simply checks whether an entry is
>>> present in that range. It should be just as fast as __pfn_to_mfn. I
>>> would definitely recommend it.
>>
>> I'd like to see a basic support of 64KB support on Xen pushed in Linux
>> upstream before looking to possible improvement in the code. Can we
>> defer this as the follow-up of this series?
>
> Yes, maybe add a TODO comment in the code.

Will do.

Regards,

--
Julien Grall

2015-07-20 17:27:08

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 17/20] net/xen-netfront: Make it running on 64KB page granularity

Hi,

On 09/07/15 21:42, Julien Grall wrote:
> +static void xennet_make_one_txreq(unsigned long mfn, unsigned int offset,
> + unsigned int *len, void *data)
> +{
> + struct xennet_gnttab_make_txreq *info = data;
> +
> + info->tx->flags |= XEN_NETTXF_more_data;
> + skb_get(info->skb);
> + xennet_make_one_txreq(mfn, offset, len, data);

This should be xennet_tx_setup_grant rather than calling itself. I did
the mistake while cleaning up the code sorry.

Regards,

--
Julien Grall

2015-07-20 17:56:23

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 00/20] xen/arm64: Add support for 64KB page

On 09/07/15 21:42, Julien Grall wrote:
> Average betwen 10 iperf :
>
> DOM0 Guest Result
>
> 4KB-mod 64KB 3.176 Gbits/sec
> 4KB-mod 4KB-mod 3.245 Gbits/sec
> 4KB-mod 4KB 3.258 Gbits/sec
> 4KB 4KB 3.292 Gbits/sec
> 4KB 4KB-mod 3.265 Gbits/sec
> 4KB 64KB 3.189 Gbits/sec
>
> 4KB-mod: Linux with the 64KB patch series
> 4KB: linux/master
>
> The network performance is slightly worst with this series (-0.15%). I suspect,
> this is because of using an indirection to setup the grant. This is necessary
> in order to ensure that the grant will be correctly sized no matter of the
> Linux page granularity. This could be used later in order to support bigger
> grant.

I didn't compute correctly the result. It's -1.5% and not -0.15% sorry.

--
Julien Grall

2015-07-21 09:54:58

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] block/xen-blkfront: Split blkif_queue_request in 2

El 09/07/15 a les 22.42, Julien Grall ha escrit:
> Currently, blkif_queue_request has 2 distinct execution path:
> - Send a discard request
> - Send a read/write request
>
> The function is also allocating grants to use for generating the
> request. Although, this is only used for read/write request.
>
> Rather than having a function with 2 distinct execution path, separate
> the function in 2. This will also remove one level of tabulation.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Roger Pau Monné <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>

Patch looks fine, although with so much indentation changes it's kind of
hard to review.

Acked-by: Roger Pau Monné <[email protected]>

Just one minor change below.

[...]

> @@ -595,6 +603,24 @@ static int blkif_queue_request(struct request *req)
> return 0;
> }
>
> +/*
> + * Generate a Xen blkfront IO request from a blk layer request. Reads
> + * and writes are handled as expected.
> + *
> + * @req: a request struct
> + */
> +static int blkif_queue_request(struct request *req)
> +{
> + struct blkfront_info *info = req->rq_disk->private_data;
> +
> + if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> + return 1;
> +
> + if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE)))
> + return blkif_queue_discard_req(req);
> + else
> + return blkif_queue_rw_req(req);

There's no need for the else clause.

Roger.

2015-07-21 10:16:14

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure

El 09/07/15 a les 22.42, Julien Grall ha escrit:
> All the usage of the field pfn are done using the same idiom:
>
> pfn_to_page(grant->pfn)
>
> This will return always the same page. Store directly the page in the
> grant to clean up the code.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Roger Pau Monné <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>

Acked-by: Roger Pau Monné <[email protected]>

With one style fix.

[...]

> static struct grant *get_grant(grant_ref_t *gref_head,
> - unsigned long pfn,
> + struct page *page,

Indentation.

Roger.

2015-07-21 10:30:12

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] block/xen-blkfront: split get_grant in 2

El 09/07/15 a les 22.42, Julien Grall ha escrit:
> Prepare the code to support 64KB page granularity. The first
> implementation will use a full Linux page per indirect and persistent
> grant. When non-persistent grant is used, each page of a bio request
> may be split in multiple grant.
>
> Furthermore, the field page of the grant structure is only used to copy
> data from persistent grant or indirect grant. Avoid to set it for other
> use case as it will have no meaning given the page will be split in
> multiple grant.
>
> Provide 2 functions, to setup indirect grant, the other for bio page.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Roger Pau Monné <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> Changes in v2:
> - Patch added
> ---
> drivers/block/xen-blkfront.c | 85 ++++++++++++++++++++++++++++++--------------
> 1 file changed, 59 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 7b81d23..95fd067 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -242,34 +242,77 @@ out_of_memory:
> return -ENOMEM;
> }
>
> -static struct grant *get_grant(grant_ref_t *gref_head,
> - struct page *page,
> - struct blkfront_info *info)
> +static struct grant *get_free_grant(struct blkfront_info *info)
> {
> struct grant *gnt_list_entry;
> - unsigned long buffer_mfn;
>
> BUG_ON(list_empty(&info->grants));
> gnt_list_entry = list_first_entry(&info->grants, struct grant,
> - node);
> + node);

Stray change?

> list_del(&gnt_list_entry->node);
>
> - if (gnt_list_entry->gref != GRANT_INVALID_REF) {
> + if (gnt_list_entry->gref != GRANT_INVALID_REF)
> info->persistent_gnts_c--;
> +
> + return gnt_list_entry;
> +}
> +
> +static void grant_foreign_access(const struct grant *gnt_list_entry,
> + const struct blkfront_info *info)

Given that this is just a wrapper I would make it an inline function, or
even consider removing it and just call
gnttab_page_grant_foreign_access_ref directly.

> +{
> + gnttab_page_grant_foreign_access_ref(gnt_list_entry->gref,
> + info->xbdev->otherend_id,
> + gnt_list_entry->page,
> + 0);
> +}
> +
> +static struct grant *get_grant(grant_ref_t *gref_head,
> + unsigned long mfn,
> + struct blkfront_info *info)

Indentation.

> +{
> + struct grant *gnt_list_entry = get_free_grant(info);
> +
> + if (gnt_list_entry->gref != GRANT_INVALID_REF)
> return gnt_list_entry;
> +
> + /* Assign a gref to this page */
> + gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
> + BUG_ON(gnt_list_entry->gref == -ENOSPC);
> + if (info->feature_persistent)
> + grant_foreign_access(gnt_list_entry, info);
> + else {
> + /* Grant access to the MFN passed by the caller */
> + gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
> + info->xbdev->otherend_id,
> + mfn, 0);
> }
>
> + return gnt_list_entry;
> +}
> +
> +static struct grant *get_indirect_grant(grant_ref_t *gref_head,
> + struct blkfront_info *info)

Indentation.

Roger.

2015-07-21 11:06:49

by Roger Pau Monne

[permalink] [raw]
Subject: Re: [PATCH v2 15/20] block/xen-blkfront: Make it running on 64KB page granularity

El 09/07/15 a les 22.42, Julien Grall ha escrit:
> From: Julien Grall <[email protected]>
>
> The PV block protocol is using 4KB page granularity. The goal of this
> patch is to allow a Linux using 64KB page granularity using block
> device on a non-modified Xen.
>
> The block API is using segment which should at least be the size of a
^ segments
> Linux page. Therefore, the driver will have to break the page in chunk
chunks ^
> of 4K before giving the page to the backend.
>
> Breaking a 64KB segment in 4KB chunk will result to have some chunk with
> no data.

I would rewrite this as:

Breaking a 64KB page into 4KB chunks can result in chunks with no data.

> As the PV protocol always require to have data in the chunk, we
> have to count the number of Xen page which will be in use and avoid to
^ pages remove the "to" ^
> sent empty chunk.
^ sending empty chunks
>
> Note that, a pre-defined number of grant is reserved before preparing
^ no coma ^ grants are
> the request. This pre-defined number is based on the number and the
> maximum size of the segments. If each segment contain a very small
^ contains
> amount of data, the driver may reserve too much grant (16 grant is
^ many grants? ^ grants are
> reserved per segment with 64KB page granularity).
>
> Futhermore, in the case of persistent grant we allocate one Linux page
^ Furthermore ^ case of using persistent grants
> per grant although only the 4KB of the page will be effectively use.
^ initial ^ used.
> This could be improved by share the page with multiple grants.
^ sharing the page between
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Roger Pau Monné <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>

This looks much better now, thanks.

Acked-by: Roger Pau Monné <[email protected]>

> ---
>
> Improvement such as support 64KB grant is not taken into consideration in
> this patch because we have the requirement to run a Linux using 64KB page
> on a non-modified Xen.
>
> Changes in v2:
> - Use gnttab_foreach_grant to split a Linux page into grant
> ---
> drivers/block/xen-blkfront.c | 304 ++++++++++++++++++++++++++++---------------
> 1 file changed, 198 insertions(+), 106 deletions(-)
>
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 95fd067..644ba76 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -77,6 +77,7 @@ struct blk_shadow {
> struct grant **grants_used;
> struct grant **indirect_grants;
> struct scatterlist *sg;
> + unsigned int num_sg;
> };
>
> struct split_bio {
> @@ -106,8 +107,8 @@ static unsigned int xen_blkif_max_ring_order;
> module_param_named(max_ring_page_order, xen_blkif_max_ring_order, int, S_IRUGO);
> MODULE_PARM_DESC(max_ring_page_order, "Maximum order of pages to be used for the shared ring");
>
> -#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, PAGE_SIZE * (info)->nr_ring_pages)
> -#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, PAGE_SIZE * XENBUS_MAX_RING_PAGES)
> +#define BLK_RING_SIZE(info) __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * (info)->nr_ring_pages)
> +#define BLK_MAX_RING_SIZE __CONST_RING_SIZE(blkif, XEN_PAGE_SIZE * XENBUS_MAX_RING_PAGES)
> /*
> * ring-ref%i i=(-1UL) would take 11 characters + 'ring-ref' is 8, so 19
> * characters are enough. Define to 20 to keep consist with backend.
> @@ -146,6 +147,7 @@ struct blkfront_info
> unsigned int discard_granularity;
> unsigned int discard_alignment;
> unsigned int feature_persistent:1;
> + /* Number of 4K segment handled */
^ segments
> unsigned int max_indirect_segments;
> int is_ready;
> };
> @@ -173,10 +175,19 @@ static DEFINE_SPINLOCK(minor_lock);
>
> #define DEV_NAME "xvd" /* name in /dev */
>
> -#define SEGS_PER_INDIRECT_FRAME \
> - (PAGE_SIZE/sizeof(struct blkif_request_segment))
> -#define INDIRECT_GREFS(_segs) \
> - ((_segs + SEGS_PER_INDIRECT_FRAME - 1)/SEGS_PER_INDIRECT_FRAME)
> +/*
> + * Xen use 4K pages. The guest may use different page size (4K or 64K)
> + * Number of Xen pages per segment
> + */
> +#define XEN_PAGES_PER_SEGMENT (PAGE_SIZE / XEN_PAGE_SIZE)
> +
> +#define SEGS_PER_INDIRECT_FRAME \
> + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment) / XEN_PAGES_PER_SEGMENT)
> +#define XEN_PAGES_PER_INDIRECT_FRAME \
> + (XEN_PAGE_SIZE/sizeof(struct blkif_request_segment))
> +
> +#define INDIRECT_GREFS(_pages) \
> + ((_pages + XEN_PAGES_PER_INDIRECT_FRAME - 1)/XEN_PAGES_PER_INDIRECT_FRAME)
>
> static int blkfront_setup_indirect(struct blkfront_info *info);
>
> @@ -463,14 +474,100 @@ static int blkif_queue_discard_req(struct request *req)
> return 0;
> }
>
> +struct setup_rw_req {
> + unsigned int grant_idx;
> + struct blkif_request_segment *segments;
> + struct blkfront_info *info;
> + struct blkif_request *ring_req;
> + grant_ref_t gref_head;
> + unsigned int id;
> + /* Only used when persistent grant is used and it's a read request */
> + bool need_copy;
> + unsigned int bvec_off;
> + char *bvec_data;
> +};
> +
> +static void blkif_setup_rw_req_grant(unsigned long mfn, unsigned int offset,
> + unsigned int *len, void *data)
> +{
> + struct setup_rw_req *setup = data;
> + int n, ref;
> + struct grant *gnt_list_entry;
> + unsigned int fsect, lsect;
> + /* Convenient aliases */
> + unsigned int grant_idx = setup->grant_idx;
> + struct blkif_request *ring_req = setup->ring_req;
> + struct blkfront_info *info = setup->info;
> + struct blk_shadow *shadow = &info->shadow[setup->id];
> +
> + if ((ring_req->operation == BLKIF_OP_INDIRECT) &&
> + (grant_idx % XEN_PAGES_PER_INDIRECT_FRAME == 0)) {
> + if (setup->segments)
> + kunmap_atomic(setup->segments);
> +
> + n = grant_idx / XEN_PAGES_PER_INDIRECT_FRAME;
> + gnt_list_entry = get_indirect_grant(&setup->gref_head, info);
> + shadow->indirect_grants[n] = gnt_list_entry;
> + setup->segments = kmap_atomic(gnt_list_entry->page);
> + ring_req->u.indirect.indirect_grefs[n] = gnt_list_entry->gref;
> + }
> +
> + gnt_list_entry = get_grant(&setup->gref_head, mfn, info);
> + ref = gnt_list_entry->gref;
> + shadow->grants_used[grant_idx] = gnt_list_entry;
> +
> + if (setup->need_copy) {
> + void *shared_data;
> +
> + shared_data = kmap_atomic(gnt_list_entry->page);
> + /*
> + * this does not wipe data stored outside the
> + * range sg->offset..sg->offset+sg->length.
> + * Therefore, blkback *could* see data from
> + * previous requests. This is OK as long as
> + * persistent grants are shared with just one
> + * domain. It may need refactoring if this
> + * changes
> + */
> + memcpy(shared_data + offset,
> + setup->bvec_data + setup->bvec_off,
> + *len);
> +
> + kunmap_atomic(shared_data);
> + setup->bvec_off += *len;
> + }
> +
> + fsect = offset >> 9;
> + lsect = fsect + (*len >> 9) - 1;
> + if (ring_req->operation != BLKIF_OP_INDIRECT) {
> + ring_req->u.rw.seg[grant_idx] =
> + (struct blkif_request_segment) {
> + .gref = ref,
> + .first_sect = fsect,
> + .last_sect = lsect };
> + } else {
> + setup->segments[grant_idx % XEN_PAGES_PER_INDIRECT_FRAME] =
> + (struct blkif_request_segment) {
> + .gref = ref,
> + .first_sect = fsect,
> + .last_sect = lsect };
> + }
> +
> + (setup->grant_idx)++;
> +}
> +
> static int blkif_queue_rw_req(struct request *req)
> {
> struct blkfront_info *info = req->rq_disk->private_data;
> struct blkif_request *ring_req;
> unsigned long id;
> - unsigned int fsect, lsect;
> - int i, ref, n;
> - struct blkif_request_segment *segments = NULL;
> + int i;
> + struct setup_rw_req setup = {
> + .grant_idx = 0,
> + .segments = NULL,
> + .info = info,
> + .need_copy = rq_data_dir(req) && info->feature_persistent,
> + };
>
> /*
> * Used to store if we are able to queue the request by just using
> @@ -478,25 +575,23 @@ static int blkif_queue_rw_req(struct request *req)
> * as there are not sufficiently many free.
> */
> bool new_persistent_gnts;
> - grant_ref_t gref_head;
> - struct grant *gnt_list_entry = NULL;
> struct scatterlist *sg;
> - int nseg, max_grefs;
> + int nseg, max_grefs, nr_page;
>
> - max_grefs = req->nr_phys_segments;
> + max_grefs = req->nr_phys_segments * XEN_PAGES_PER_SEGMENT;
> if (max_grefs > BLKIF_MAX_SEGMENTS_PER_REQUEST)
> /*
> * If we are using indirect segments we need to account
> * for the indirect grefs used in the request.
> */
> - max_grefs += INDIRECT_GREFS(req->nr_phys_segments);
> + max_grefs += INDIRECT_GREFS(req->nr_phys_segments * XEN_PAGES_PER_SEGMENT);
>
> /* Check if we have enough grants to allocate a requests */
> if (info->persistent_gnts_c < max_grefs) {
> new_persistent_gnts = 1;
> if (gnttab_alloc_grant_references(
> max_grefs - info->persistent_gnts_c,
> - &gref_head) < 0) {
> + &setup.gref_head) < 0) {
> gnttab_request_free_callback(
> &info->callback,
> blkif_restart_queue_callback,
> @@ -513,12 +608,18 @@ static int blkif_queue_rw_req(struct request *req)
> info->shadow[id].request = req;
>
> BUG_ON(info->max_indirect_segments == 0 &&
> - req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
> + (XEN_PAGES_PER_SEGMENT * req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
> BUG_ON(info->max_indirect_segments &&
> - req->nr_phys_segments > info->max_indirect_segments);
> + (req->nr_phys_segments * XEN_PAGES_PER_SEGMENT) > info->max_indirect_segments);
> nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
> + nr_page = 0;
> + /* Calculate the number of Xen pages used */
> + for_each_sg(info->shadow[id].sg, sg, nseg, i) {
> + nr_page += (round_up(sg->offset + sg->length, XEN_PAGE_SIZE) - round_down(sg->offset, XEN_PAGE_SIZE)) >> XEN_PAGE_SHIFT;

I haven't counted the characters, but this line looks too long, also you
can get rid of the braces since it's a single line statement.

Roger.

2015-07-21 11:13:04

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 05/20] block/xen-blkfront: Split blkif_queue_request in 2

Hi Roger,

On 21/07/15 10:54, Roger Pau Monné wrote:
> El 09/07/15 a les 22.42, Julien Grall ha escrit:
>> Currently, blkif_queue_request has 2 distinct execution path:
>> - Send a discard request
>> - Send a read/write request
>>
>> The function is also allocating grants to use for generating the
>> request. Although, this is only used for read/write request.
>>
>> Rather than having a function with 2 distinct execution path, separate
>> the function in 2. This will also remove one level of tabulation.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Roger Pau Monné <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>
> Patch looks fine, although with so much indentation changes it's kind of
> hard to review.

I wasn't sure how to make this patch more easy to review and it seems
like diff is getting confused.

It's mostly removing one indentation layer (the if (req->cmd_flags ...))
and move the discard code in a separate function.

> Acked-by: Roger Pau Monné <[email protected]>

Thank you.

> Just one minor change below.
>
> [...]
>
>> @@ -595,6 +603,24 @@ static int blkif_queue_request(struct request *req)
>> return 0;
>> }
>>
>> +/*
>> + * Generate a Xen blkfront IO request from a blk layer request. Reads
>> + * and writes are handled as expected.
>> + *
>> + * @req: a request struct
>> + */
>> +static int blkif_queue_request(struct request *req)
>> +{
>> + struct blkfront_info *info = req->rq_disk->private_data;
>> +
>> + if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
>> + return 1;
>> +
>> + if (unlikely(req->cmd_flags & (REQ_DISCARD | REQ_SECURE)))
>> + return blkif_queue_discard_req(req);
>> + else
>> + return blkif_queue_rw_req(req);
>
> There's no need for the else clause.

I find it more readable and obvious to understand than:

if ( ... )
return
return;

when there is only one line in the else. IIRC, the resulting assembly
will be the same.

Anyway, I can drop the else if you really want.

Regards,

--
Julien Grall

2015-07-21 11:20:43

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure

Hi Roger,

On 21/07/15 11:16, Roger Pau Monné wrote:
> El 09/07/15 a les 22.42, Julien Grall ha escrit:
>> All the usage of the field pfn are done using the same idiom:
>>
>> pfn_to_page(grant->pfn)
>>
>> This will return always the same page. Store directly the page in the
>> grant to clean up the code.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Roger Pau Monné <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>
> Acked-by: Roger Pau Monné <[email protected]>
>
> With one style fix.
>
> [...]
>
>> static struct grant *get_grant(grant_ref_t *gref_head,
>> - unsigned long pfn,
>> + struct page *page,
>
> Indentation.

The indentation for the parameters of this function wasn't correct:

static struct grant *get_grant(grant_ref_t *gref_head,
- unsigned long pfn,
+^I^I^I struct page *page,
struct blkfront_info *info)

So "struct page *page" is correctly indent but not the remaining
parameter ("struct blkfront_info *info"). I will indent it correctly.

Regards,

--
Julien Grall

2015-07-21 13:05:13

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 07/20] block/xen-blkfront: split get_grant in 2

Hi,

On 21/07/15 11:30, Roger Pau Monné wrote:
> El 09/07/15 a les 22.42, Julien Grall ha escrit:
>> Prepare the code to support 64KB page granularity. The first
>> implementation will use a full Linux page per indirect and persistent
>> grant. When non-persistent grant is used, each page of a bio request
>> may be split in multiple grant.
>>
>> Furthermore, the field page of the grant structure is only used to copy
>> data from persistent grant or indirect grant. Avoid to set it for other
>> use case as it will have no meaning given the page will be split in
>> multiple grant.
>>
>> Provide 2 functions, to setup indirect grant, the other for bio page.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Roger Pau Monné <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> Changes in v2:
>> - Patch added
>> ---
>> drivers/block/xen-blkfront.c | 85 ++++++++++++++++++++++++++++++--------------
>> 1 file changed, 59 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 7b81d23..95fd067 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -242,34 +242,77 @@ out_of_memory:
>> return -ENOMEM;
>> }
>>
>> -static struct grant *get_grant(grant_ref_t *gref_head,
>> - struct page *page,
>> - struct blkfront_info *info)
>> +static struct grant *get_free_grant(struct blkfront_info *info)
>> {
>> struct grant *gnt_list_entry;
>> - unsigned long buffer_mfn;
>>
>> BUG_ON(list_empty(&info->grants));
>> gnt_list_entry = list_first_entry(&info->grants, struct grant,
>> - node);
>> + node);
>
> Stray change?

No the indentation was wrong before.

>
>> list_del(&gnt_list_entry->node);
>>
>> - if (gnt_list_entry->gref != GRANT_INVALID_REF) {
>> + if (gnt_list_entry->gref != GRANT_INVALID_REF)
>> info->persistent_gnts_c--;
>> +
>> + return gnt_list_entry;
>> +}
>> +
>> +static void grant_foreign_access(const struct grant *gnt_list_entry,
>> + const struct blkfront_info *info)
>
> Given that this is just a wrapper I would make it an inline function, or
> even consider removing it and just call
> gnttab_page_grant_foreign_access_ref directly.

I prefer to keep the help as it's used in two place and make the caller
function more readable.

Most of compiler will try to inline even without the inline. This is the
case of gcc where both grant_foreign_access and even get_grant are in
fine inline. If you really want, I can add the inline.

>> +{
>> + gnttab_page_grant_foreign_access_ref(gnt_list_entry->gref,
>> + info->xbdev->otherend_id,
>> + gnt_list_entry->page,
>> + 0);
>> +}
>> +
>> +static struct grant *get_grant(grant_ref_t *gref_head,
>> + unsigned long mfn,
>> + struct blkfront_info *info)
>
> Indentation.

Why? The indentation is valid, tabulation of 8 as requested by the
coding style, and will be aligned correctly in your editor. This may not
be the case in your mail reader.

>
>> +{
>> + struct grant *gnt_list_entry = get_free_grant(info);
>> +
>> + if (gnt_list_entry->gref != GRANT_INVALID_REF)
>> return gnt_list_entry;
>> +
>> + /* Assign a gref to this page */
>> + gnt_list_entry->gref = gnttab_claim_grant_reference(gref_head);
>> + BUG_ON(gnt_list_entry->gref == -ENOSPC);
>> + if (info->feature_persistent)
>> + grant_foreign_access(gnt_list_entry, info);
>> + else {
>> + /* Grant access to the MFN passed by the caller */
>> + gnttab_grant_foreign_access_ref(gnt_list_entry->gref,
>> + info->xbdev->otherend_id,
>> + mfn, 0);
>> }
>>
>> + return gnt_list_entry;
>> +}
>> +
>> +static struct grant *get_indirect_grant(grant_ref_t *gref_head,
>> + struct blkfront_info *info)
>
> Indentation.

Ditto.

Regards,

--
Julien Grall

2015-07-21 13:09:18

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 15/20] block/xen-blkfront: Make it running on 64KB page granularity

Hi Roger,

On 21/07/15 12:06, Roger Pau Monné wrote:
> El 09/07/15 a les 22.42, Julien Grall ha escrit:
>> From: Julien Grall <[email protected]>
>>
>> The PV block protocol is using 4KB page granularity. The goal of this
>> patch is to allow a Linux using 64KB page granularity using block
>> device on a non-modified Xen.
>>
>> The block API is using segment which should at least be the size of a
> ^ segments

Will fix it.

>> Linux page. Therefore, the driver will have to break the page in chunk
> chunks ^
>> of 4K before giving the page to the backend.
>>
>> Breaking a 64KB segment in 4KB chunk will result to have some chunk with
>> no data.
>
> I would rewrite this as:
>
> Breaking a 64KB page into 4KB chunks can result in chunks with no data.

Sounds good.

>> As the PV protocol always require to have data in the chunk, we
>> have to count the number of Xen page which will be in use and avoid to
> ^ pages remove the "to" ^
>> sent empty chunk.
> ^ sending empty chunks
>>
>> Note that, a pre-defined number of grant is reserved before preparing
> ^ no coma ^ grants are
>> the request. This pre-defined number is based on the number and the
>> maximum size of the segments. If each segment contain a very small
> ^ contains
>> amount of data, the driver may reserve too much grant (16 grant is
> ^ many grants? ^ grants are
>> reserved per segment with 64KB page granularity).
>>
>> Futhermore, in the case of persistent grant we allocate one Linux page
> ^ Furthermore ^ case of using persistent grants
>> per grant although only the 4KB of the page will be effectively use.
> ^ initial ^ used.
>> This could be improved by share the page with multiple grants.
> ^ sharing the page between

Will fix all the typoes, grammatical error in the next version.

>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Roger Pau Monné <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>
> This looks much better now, thanks.
>
> Acked-by: Roger Pau Monné <[email protected]>

Thank you!

>> BUG_ON(info->max_indirect_segments == 0 &&
>> - req->nr_phys_segments > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> + (XEN_PAGES_PER_SEGMENT * req->nr_phys_segments) > BLKIF_MAX_SEGMENTS_PER_REQUEST);
>> BUG_ON(info->max_indirect_segments &&
>> - req->nr_phys_segments > info->max_indirect_segments);
>> + (req->nr_phys_segments * XEN_PAGES_PER_SEGMENT) > info->max_indirect_segments);
>> nseg = blk_rq_map_sg(req->q, req, info->shadow[id].sg);
>> + nr_page = 0;
>> + /* Calculate the number of Xen pages used */
>> + for_each_sg(info->shadow[id].sg, sg, nseg, i) {
>> + nr_page += (round_up(sg->offset + sg->length, XEN_PAGE_SIZE) - round_down(sg->offset, XEN_PAGE_SIZE)) >> XEN_PAGE_SHIFT;
>
> I haven't counted the characters, but this line looks too long, also you
> can get rid of the braces since it's a single line statement.

I was planning to use gnttab_count_grant added in patch #3 but forgot to
do it. I will do in the next version.

Regards,

--
Julien Grall

2015-07-23 17:19:02

by Julien Grall

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] block/xen-blkfront: Store a page rather a pfn in the grant structure

Hi Stefano,

On 16/07/15 16:11, Stefano Stabellini wrote:
> On Thu, 9 Jul 2015, Julien Grall wrote:
>> All the usage of the field pfn are done using the same idiom:
>>
>> pfn_to_page(grant->pfn)
>>
>> This will return always the same page. Store directly the page in the
>> grant to clean up the code.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Roger Pau Monn? <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> Changes in v2:
>> - Patch added
>> ---
>> drivers/block/xen-blkfront.c | 37 ++++++++++++++++++-------------------
>> 1 file changed, 18 insertions(+), 19 deletions(-)
>>
>> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
>> index 7107d58..7b81d23 100644
>> --- a/drivers/block/xen-blkfront.c
>> +++ b/drivers/block/xen-blkfront.c
>> @@ -67,7 +67,7 @@ enum blkif_state {
>>
>> struct grant {
>> grant_ref_t gref;
>> - unsigned long pfn;
>> + struct page *page;
>> struct list_head node;
>> };
>>
>> @@ -219,7 +219,7 @@ static int fill_grant_buffer(struct blkfront_info *info, int num)
>> kfree(gnt_list_entry);
>> goto out_of_memory;
>> }
>> - gnt_list_entry->pfn = page_to_pfn(granted_page);
>> + gnt_list_entry->page = granted_page;
>> }
>>
>> gnt_list_entry->gref = GRANT_INVALID_REF;
>> @@ -234,7 +234,7 @@ out_of_memory:
>> &info->grants, node) {
>> list_del(&gnt_list_entry->node);
>> if (info->feature_persistent)
>> - __free_page(pfn_to_page(gnt_list_entry->pfn));
>> + __free_page(gnt_list_entry->page);
>> kfree(gnt_list_entry);
>> i--;
>> }
>> @@ -243,7 +243,7 @@ out_of_memory:
>> }
>>
>> static struct grant *get_grant(grant_ref_t *gref_head,
>> - unsigned long pfn,
>> + struct page *page,
>> struct blkfront_info *info)
>
> indentation

The indentation of the parameters wasn't valid. I will indent correctly
"struct blkfront_info *info" rather than to use invalid indent.


> Aside from this:
>
> Reviewed-by: Stefano Stabellini <[email protected]>

Thank you,

Regards,

--
Julien Grall

2015-07-24 09:28:40

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On 09/07/15 21:42, Julien Grall wrote:
> The Xen hypercall interface is always using 4K page granularity on ARM
> and x86 architecture.
>
> With the incoming support of 64K page granularity for ARM64 guest, it
> won't be possible to re-use the Linux page definition in Xen drivers.
>
> Introduce Xen page definition helpers based on the Linux page
> definition. They have exactly the same name but prefixed with
> XEN_/xen_ prefix.
>
> Also modify page_to_pfn to use new Xen page definition.
>
> Signed-off-by: Julien Grall <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: David Vrabel <[email protected]>
> ---
> I'm wondering if we should drop page_to_pfn has the macro will likely
> misuse when Linux is using 64KB page granularity.

I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
front/back drivers never deal with PFNs only GFNs.

David

2015-07-24 09:31:51

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 09/07/15 21:42, Julien Grall wrote:
> The Xen interface is always using 4KB page. This means that a Linux page
> may be split across multiple Xen page when the page granularity is not
> the same.
>
> This helper will break down a Linux page into 4KB chunk and call the
> helper on each of them.
[...]
> --- a/include/xen/page.h
> +++ b/include/xen/page.h
> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>
> extern unsigned long xen_released_pages;
>
> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
> +
> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
> + void *data)

I think this should be outlined (unless you have measurements that
support making it inlined).

Also perhaps make it

int xen_for_each_gfn(struct page *page,
xen_gfn_fn_t fn, void *data);

or

int xen_for_each_gfn(struct page **page, unsigned int count,
xen_gfn_fn_t fn, void *data);

?

David

2015-07-24 09:36:17

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 04/20] xen/grant: Add helper gnttab_page_grant_foreign_access_ref

On 09/07/15 21:42, Julien Grall wrote:
> Many PV drivers contain the idiom:
>
> pfn = page_to_mfn(...) /* Or similar */
> gnttab_grant_foreign_access_ref
>
> Replace it by a new helper. Note that when Linux is using a different
> page granularity than Xen, the helper only gives access to the first 4KB
> grant.
>
> This is useful where drivers are allocating a full Linux page for each
> grant.
>
> Also include xen/interface/grant_table.h rather than xen/grant_table.h in
> asm/page.h for x86 to fix a compilation issue [1]. Only the former is
> useful in order to get the structure definition.
>
> [1] Interpendency between asm/page.h and xen/grant_table.h which result
> to page_mfn not being defined when necessary.
[...]
> --- a/include/xen/grant_table.h
> +++ b/include/xen/grant_table.h
> @@ -131,6 +131,15 @@ void gnttab_cancel_free_callback(struct gnttab_free_callback *callback);
> void gnttab_grant_foreign_access_ref(grant_ref_t ref, domid_t domid,
> unsigned long frame, int readonly);
>
> +/* Give access to the first 4K of the page */
> +static inline void gnttab_page_grant_foreign_access_ref(
> + grant_ref_t ref, domid_t domid,
> + struct page *page, int readonly)
> +{
> + gnttab_grant_foreign_access_ref(ref, domid, page_to_mfn(page),

Obviously this would use the new xen_page_to_gfn() macro here.

Otherwise,

Reviewed-by: David Vrabel <[email protected]>

David

2015-07-24 09:40:13

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

Hi David,

On 24/07/15 10:28, David Vrabel wrote:
> On 09/07/15 21:42, Julien Grall wrote:
>> The Xen hypercall interface is always using 4K page granularity on ARM
>> and x86 architecture.
>>
>> With the incoming support of 64K page granularity for ARM64 guest, it
>> won't be possible to re-use the Linux page definition in Xen drivers.
>>
>> Introduce Xen page definition helpers based on the Linux page
>> definition. They have exactly the same name but prefixed with
>> XEN_/xen_ prefix.
>>
>> Also modify page_to_pfn to use new Xen page definition.
>>
>> Signed-off-by: Julien Grall <[email protected]>
>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>> Cc: Boris Ostrovsky <[email protected]>
>> Cc: David Vrabel <[email protected]>
>> ---
>> I'm wondering if we should drop page_to_pfn has the macro will likely
>> misuse when Linux is using 64KB page granularity.
>
> I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
> front/back drivers never deal with PFNs only GFNs.

What is xen_gfn_to_page and xen_page_to_gfn? Neither Linux, nor my
series have them.

Regards,

--
Julien Grall

2015-07-24 09:48:15

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On 24/07/15 10:39, Julien Grall wrote:
> Hi David,
>
> On 24/07/15 10:28, David Vrabel wrote:
>> On 09/07/15 21:42, Julien Grall wrote:
>>> The Xen hypercall interface is always using 4K page granularity on ARM
>>> and x86 architecture.
>>>
>>> With the incoming support of 64K page granularity for ARM64 guest, it
>>> won't be possible to re-use the Linux page definition in Xen drivers.
>>>
>>> Introduce Xen page definition helpers based on the Linux page
>>> definition. They have exactly the same name but prefixed with
>>> XEN_/xen_ prefix.
>>>
>>> Also modify page_to_pfn to use new Xen page definition.
>>>
>>> Signed-off-by: Julien Grall <[email protected]>
>>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>>> Cc: Boris Ostrovsky <[email protected]>
>>> Cc: David Vrabel <[email protected]>
>>> ---
>>> I'm wondering if we should drop page_to_pfn has the macro will likely
>>> misuse when Linux is using 64KB page granularity.
>>
>> I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
>> front/back drivers never deal with PFNs only GFNs.
>
> What is xen_gfn_to_page and xen_page_to_gfn? Neither Linux, nor my
> series have them.

I suggesting that you introduce these.

David

2015-07-24 09:49:57

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH v2 10/20] xen/xenbus: Use Xen page definition

On 09/07/15 21:42, Julien Grall wrote:
> All the ring (xenstore, and PV rings) are always based on the page
> granularity of Xen.
[...]
> --- a/drivers/xen/xenbus/xenbus_probe.c
> +++ b/drivers/xen/xenbus/xenbus_probe.c
> @@ -713,7 +713,7 @@ static int __init xenstored_local_init(void)
>
> xen_store_mfn = xen_start_info->store_mfn =
> pfn_to_mfn(virt_to_phys((void *)page) >>
> - PAGE_SHIFT);
> + XEN_PAGE_SHIFT);

xen_store_pfn = xen_page_to_gfn(page);

Otherwise,

Reviewed-by: David Vrabel <[email protected]>

David

2015-07-24 09:52:52

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On 24/07/15 10:48, David Vrabel wrote:
> On 24/07/15 10:39, Julien Grall wrote:
>> Hi David,
>>
>> On 24/07/15 10:28, David Vrabel wrote:
>>> On 09/07/15 21:42, Julien Grall wrote:
>>>> The Xen hypercall interface is always using 4K page granularity on ARM
>>>> and x86 architecture.
>>>>
>>>> With the incoming support of 64K page granularity for ARM64 guest, it
>>>> won't be possible to re-use the Linux page definition in Xen drivers.
>>>>
>>>> Introduce Xen page definition helpers based on the Linux page
>>>> definition. They have exactly the same name but prefixed with
>>>> XEN_/xen_ prefix.
>>>>
>>>> Also modify page_to_pfn to use new Xen page definition.
>>>>
>>>> Signed-off-by: Julien Grall <[email protected]>
>>>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>>>> Cc: Boris Ostrovsky <[email protected]>
>>>> Cc: David Vrabel <[email protected]>
>>>> ---
>>>> I'm wondering if we should drop page_to_pfn has the macro will likely
>>>> misuse when Linux is using 64KB page granularity.
>>>
>>> I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
>>> front/back drivers never deal with PFNs only GFNs.
>>
>> What is xen_gfn_to_page and xen_page_to_gfn? Neither Linux, nor my
>> series have them.
>
> I suggesting that you introduce these.

It's still not clear to me what you are suggesting here... Do you
suggest to rename xen_pfn_to_page and xen_page_to_pfn by xen_gfn_to_page
and xen_page_to_gfn?

Regards,

--
Julien Grall

2015-07-24 09:52:44

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH v2 11/20] tty/hvc: xen: Use xen page definition

On 09/07/15 21:42, Julien Grall wrote:
> The console ring is always based on the page granularity of Xen.
[...]
> --- a/drivers/tty/hvc/hvc_xen.c
> +++ b/drivers/tty/hvc/hvc_xen.c
> @@ -392,7 +392,7 @@ static int xencons_connect_backend(struct xenbus_device *dev,
> if (xen_pv_domain())
> mfn = virt_to_mfn(info->intf);
> else
> - mfn = __pa(info->intf) >> PAGE_SHIFT;
> + mfn = __pa(info->intf) >> XEN_PAGE_SHIFT;

Change this to

gfn = xen_page_to_gfn(virt_to_page(info->intf));

and drop the if()?

David

2015-07-24 09:55:24

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 24/07/15 10:31, David Vrabel wrote:
> On 09/07/15 21:42, Julien Grall wrote:
>> The Xen interface is always using 4KB page. This means that a Linux page
>> may be split across multiple Xen page when the page granularity is not
>> the same.
>>
>> This helper will break down a Linux page into 4KB chunk and call the
>> helper on each of them.
> [...]
>> --- a/include/xen/page.h
>> +++ b/include/xen/page.h
>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>
>> extern unsigned long xen_released_pages;
>>
>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>> +
>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>> + void *data)
>
> I think this should be outlined (unless you have measurements that
> support making it inlined).

I don't have any performance measurements. Although, when Linux is using
4KB page granularity, the loop in this helper will be dropped by the
helper. The code would look like:

unsigned long pfn = xen_page_to_pfn(page);

ret = fn(page, fn, data);
if (ret)
return ret;

The compiler could even inline the callback (fn). So it drops 2
functions call.

>
> Also perhaps make it
>
> int xen_for_each_gfn(struct page *page,
> xen_gfn_fn_t fn, void *data);

gfn standing for Guest Frame Number right?

> or
>
> int xen_for_each_gfn(struct page **page, unsigned int count,
> xen_gfn_fn_t fn, void *data);

count being the number of Linux page or Xen page? We have some code (see
xlate_mmu patch #19) requiring to iter on a specific number of Xen page.
I was thinking to introduce a separate function for iterating on a
specific number of Xen PFN.

We don't want to introduce it for everyone as we need to hide this
complexity from most the caller.

In general case, the 2 suggestions would not be very useful. Most of the
time we have some actions to do per Linux page (see the balloon code for
instance).

Regards,

--
Julien Grall

2015-07-24 10:10:44

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 24/07/15 10:54, Julien Grall wrote:
> On 24/07/15 10:31, David Vrabel wrote:
>> On 09/07/15 21:42, Julien Grall wrote:
>>> The Xen interface is always using 4KB page. This means that a Linux page
>>> may be split across multiple Xen page when the page granularity is not
>>> the same.
>>>
>>> This helper will break down a Linux page into 4KB chunk and call the
>>> helper on each of them.
>> [...]
>>> --- a/include/xen/page.h
>>> +++ b/include/xen/page.h
>>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>>
>>> extern unsigned long xen_released_pages;
>>>
>>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>>> +
>>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>>> + void *data)
>>
>> I think this should be outlined (unless you have measurements that
>> support making it inlined).
>
> I don't have any performance measurements. Although, when Linux is using
> 4KB page granularity, the loop in this helper will be dropped by the
> helper. The code would look like:
>
> unsigned long pfn = xen_page_to_pfn(page);
>
> ret = fn(page, fn, data);
> if (ret)
> return ret;
>
> The compiler could even inline the callback (fn). So it drops 2
> functions call.

Ok, keep it inlined.

>> Also perhaps make it
>>
>> int xen_for_each_gfn(struct page *page,
>> xen_gfn_fn_t fn, void *data);
>
> gfn standing for Guest Frame Number right?

Yes. This suggestion is just changing the name to make it more obvious
what it does.

David

2015-07-24 10:21:50

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 24/07/15 11:10, David Vrabel wrote:
> On 24/07/15 10:54, Julien Grall wrote:
>> On 24/07/15 10:31, David Vrabel wrote:
>>> On 09/07/15 21:42, Julien Grall wrote:
>>>> The Xen interface is always using 4KB page. This means that a Linux page
>>>> may be split across multiple Xen page when the page granularity is not
>>>> the same.
>>>>
>>>> This helper will break down a Linux page into 4KB chunk and call the
>>>> helper on each of them.
>>> [...]
>>>> --- a/include/xen/page.h
>>>> +++ b/include/xen/page.h
>>>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>>>
>>>> extern unsigned long xen_released_pages;
>>>>
>>>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>>>> +
>>>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>>>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>>>> + void *data)
>>>
>>> I think this should be outlined (unless you have measurements that
>>> support making it inlined).
>>
>> I don't have any performance measurements. Although, when Linux is using
>> 4KB page granularity, the loop in this helper will be dropped by the
>> helper. The code would look like:
>>
>> unsigned long pfn = xen_page_to_pfn(page);
>>
>> ret = fn(page, fn, data);
>> if (ret)
>> return ret;
>>
>> The compiler could even inline the callback (fn). So it drops 2
>> functions call.
>
> Ok, keep it inlined.
>
>>> Also perhaps make it
>>>
>>> int xen_for_each_gfn(struct page *page,
>>> xen_gfn_fn_t fn, void *data);
>>
>> gfn standing for Guest Frame Number right?
>
> Yes. This suggestion is just changing the name to make it more obvious
> what it does.

I will change the name.

Regards,

--
Julien Grall

2015-07-24 10:34:53

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On 24/07/15 10:51, Julien Grall wrote:
> On 24/07/15 10:48, David Vrabel wrote:
>> On 24/07/15 10:39, Julien Grall wrote:
>>> Hi David,
>>>
>>> On 24/07/15 10:28, David Vrabel wrote:
>>>> On 09/07/15 21:42, Julien Grall wrote:
>>>>> The Xen hypercall interface is always using 4K page granularity on ARM
>>>>> and x86 architecture.
>>>>>
>>>>> With the incoming support of 64K page granularity for ARM64 guest, it
>>>>> won't be possible to re-use the Linux page definition in Xen drivers.
>>>>>
>>>>> Introduce Xen page definition helpers based on the Linux page
>>>>> definition. They have exactly the same name but prefixed with
>>>>> XEN_/xen_ prefix.
>>>>>
>>>>> Also modify page_to_pfn to use new Xen page definition.
>>>>>
>>>>> Signed-off-by: Julien Grall <[email protected]>
>>>>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>>>>> Cc: Boris Ostrovsky <[email protected]>
>>>>> Cc: David Vrabel <[email protected]>
>>>>> ---
>>>>> I'm wondering if we should drop page_to_pfn has the macro will likely
>>>>> misuse when Linux is using 64KB page granularity.
>>>>
>>>> I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
>>>> front/back drivers never deal with PFNs only GFNs.
>>>
>>> What is xen_gfn_to_page and xen_page_to_gfn? Neither Linux, nor my
>>> series have them.
>>
>> I suggesting that you introduce these.
>
> It's still not clear to me what you are suggesting here... Do you
> suggest to rename xen_pfn_to_page and xen_page_to_pfn by xen_gfn_to_page
> and xen_page_to_gfn?

Effectively, yes but it would be better to think that:

PFNs index guest-sized pages (which may be 64 KiB).

GFNs index Xen-sized pages (which is always 4 KiB).

David

2015-07-24 10:36:43

by David Vrabel

[permalink] [raw]
Subject: Re: [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

On 09/07/15 21:42, Julien Grall wrote:
> Only use the first 4KB of the page to store the events channel info. It
> means that we will wast 60KB every time we allocate page for:
> * control block: a page is allocating per CPU
> * event array: a page is allocating everytime we need to expand it

Reviewed-by: David Vrabel <[email protected]>
>
> I think we can reduce the memory waste for the 2 areas by:
>
> * control block: sharing between multiple vCPUs. Although it will
> require some bookkeeping in order to not free the page when the CPU
> goes offline and the other CPUs sharing the page still there
>
> * event array: always extend the array event by 64K (i.e 16 4K
> chunk). That would require more care when we fail to expand the
> event channel.

I would extend it by 4 KiB each time but only allocate a new page every
16 times. This minimizes the resources used in Xen.

David

2015-07-24 10:43:49

by Ian Campbell

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On Fri, 2015-07-24 at 11:34 +0100, David Vrabel wrote:
> it would be better to think that:
>
> PFNs index guest-sized pages (which may be 64 KiB).
>
> GFNs index Xen-sized pages (which is always 4 KiB).

This concept could be usefully added to the comment in
xen/include/xen/mm.h IMHO.


>
> David

2015-07-24 13:04:56

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 01/20] xen: Add Xen specific page definition

On 24/07/15 11:34, David Vrabel wrote:
> On 24/07/15 10:51, Julien Grall wrote:
>> On 24/07/15 10:48, David Vrabel wrote:
>>> On 24/07/15 10:39, Julien Grall wrote:
>>>> Hi David,
>>>>
>>>> On 24/07/15 10:28, David Vrabel wrote:
>>>>> On 09/07/15 21:42, Julien Grall wrote:
>>>>>> The Xen hypercall interface is always using 4K page granularity on ARM
>>>>>> and x86 architecture.
>>>>>>
>>>>>> With the incoming support of 64K page granularity for ARM64 guest, it
>>>>>> won't be possible to re-use the Linux page definition in Xen drivers.
>>>>>>
>>>>>> Introduce Xen page definition helpers based on the Linux page
>>>>>> definition. They have exactly the same name but prefixed with
>>>>>> XEN_/xen_ prefix.
>>>>>>
>>>>>> Also modify page_to_pfn to use new Xen page definition.
>>>>>>
>>>>>> Signed-off-by: Julien Grall <[email protected]>
>>>>>> Cc: Konrad Rzeszutek Wilk <[email protected]>
>>>>>> Cc: Boris Ostrovsky <[email protected]>
>>>>>> Cc: David Vrabel <[email protected]>
>>>>>> ---
>>>>>> I'm wondering if we should drop page_to_pfn has the macro will likely
>>>>>> misuse when Linux is using 64KB page granularity.
>>>>>
>>>>> I think we want xen_gfn_to_page() and xen_page_to_gfn() and Xen
>>>>> front/back drivers never deal with PFNs only GFNs.
>>>>
>>>> What is xen_gfn_to_page and xen_page_to_gfn? Neither Linux, nor my
>>>> series have them.
>>>
>>> I suggesting that you introduce these.
>>
>> It's still not clear to me what you are suggesting here... Do you
>> suggest to rename xen_pfn_to_page and xen_page_to_pfn by xen_gfn_to_page
>> and xen_page_to_gfn?
>
> Effectively, yes but it would be better to think that:
>
> PFNs index guest-sized pages (which may be 64 KiB).
>
> GFNs index Xen-sized pages (which is always 4 KiB).

If I'm understanding correctly you mean:

#define xen_page_to_gfn(page) \
((page_to_pfn(page) << PAGE_SHIFT) >> XEN_PAGE_SHIFT))

static page_to_mfn(struct page *page)
{
return pfn_to_mfn(xen_page_to_gfn(page));
}

Although in some place you are suggesting to use:

xen_page_to_gfn(virt_to_page(info->intf)) (see patch #11) where it
suggests to rename page_to_mfn in xen_page_to_gfn.

I think it would make more sense to use the latter one. We would also
need to name to describe a PFN (pseudo-physical frame number based on
xen/include/xen/mm.h) but with 4K granularity and not the Linux granularity.

It's useful to have it in some place in order to iter on the 4K pfn (see
gnttab_foreach_grant and xen_apply_to_page). Maybe xpfn for Xen
pseudo-physical frame number?

I will preprend some patches into this serie to rename the function with
their correct naming.

I have in mind pfn_to_mfn which should be name into pfn_to_gfn given the
usage. Similarly, this function is mis-used on ARM because the function
may return an MFN where we expect a GFN.

Regards,

--
Julien Grall

2015-08-05 14:31:18

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

Hi David,

On 24/07/15 11:10, David Vrabel wrote:
> On 24/07/15 10:54, Julien Grall wrote:
>> On 24/07/15 10:31, David Vrabel wrote:
>>> On 09/07/15 21:42, Julien Grall wrote:
>>>> The Xen interface is always using 4KB page. This means that a Linux page
>>>> may be split across multiple Xen page when the page granularity is not
>>>> the same.
>>>>
>>>> This helper will break down a Linux page into 4KB chunk and call the
>>>> helper on each of them.
>>> [...]
>>>> --- a/include/xen/page.h
>>>> +++ b/include/xen/page.h
>>>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>>>
>>>> extern unsigned long xen_released_pages;
>>>>
>>>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>>>> +
>>>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>>>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>>>> + void *data)
>>>
>>> I think this should be outlined (unless you have measurements that
>>> support making it inlined).
>>
>> I don't have any performance measurements. Although, when Linux is using
>> 4KB page granularity, the loop in this helper will be dropped by the
>> helper. The code would look like:
>>
>> unsigned long pfn = xen_page_to_pfn(page);
>>
>> ret = fn(page, fn, data);
>> if (ret)
>> return ret;
>>
>> The compiler could even inline the callback (fn). So it drops 2
>> functions call.
>
> Ok, keep it inlined.
>
>>> Also perhaps make it
>>>
>>> int xen_for_each_gfn(struct page *page,
>>> xen_gfn_fn_t fn, void *data);
>>
>> gfn standing for Guest Frame Number right?
>
> Yes. This suggestion is just changing the name to make it more obvious
> what it does.

Thinking more about this suggestion. The callback (fn) is getting a 4K
PFN in parameter and not a GFN.

This is because the balloon code seems to require having a 4K PFN in
hand in few places. For instance XENMEM_populate_physmap and
HYPERVISOR_update_va_mapping.

Although, I'm not sure to understand the difference between GMFN, and
GPFN in the hypercall doc.

Regards,

--
Julien Grall

2015-08-05 15:50:39

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 05/08/15 15:30, Julien Grall wrote:
> Hi David,
>
> On 24/07/15 11:10, David Vrabel wrote:
>> On 24/07/15 10:54, Julien Grall wrote:
>>> On 24/07/15 10:31, David Vrabel wrote:
>>>> On 09/07/15 21:42, Julien Grall wrote:
>>>>> The Xen interface is always using 4KB page. This means that a Linux page
>>>>> may be split across multiple Xen page when the page granularity is not
>>>>> the same.
>>>>>
>>>>> This helper will break down a Linux page into 4KB chunk and call the
>>>>> helper on each of them.
>>>> [...]
>>>>> --- a/include/xen/page.h
>>>>> +++ b/include/xen/page.h
>>>>> @@ -39,4 +39,24 @@ struct xen_memory_region xen_extra_mem[XEN_EXTRA_MEM_MAX_REGIONS];
>>>>>
>>>>> extern unsigned long xen_released_pages;
>>>>>
>>>>> +typedef int (*xen_pfn_fn_t)(struct page *page, unsigned long pfn, void *data);
>>>>> +
>>>>> +/* Break down the page in 4KB granularity and call fn foreach xen pfn */
>>>>> +static inline int xen_apply_to_page(struct page *page, xen_pfn_fn_t fn,
>>>>> + void *data)
>>>>
>>>> I think this should be outlined (unless you have measurements that
>>>> support making it inlined).
>>>
>>> I don't have any performance measurements. Although, when Linux is using
>>> 4KB page granularity, the loop in this helper will be dropped by the
>>> helper. The code would look like:
>>>
>>> unsigned long pfn = xen_page_to_pfn(page);
>>>
>>> ret = fn(page, fn, data);
>>> if (ret)
>>> return ret;
>>>
>>> The compiler could even inline the callback (fn). So it drops 2
>>> functions call.
>>
>> Ok, keep it inlined.
>>
>>>> Also perhaps make it
>>>>
>>>> int xen_for_each_gfn(struct page *page,
>>>> xen_gfn_fn_t fn, void *data);
>>>
>>> gfn standing for Guest Frame Number right?
>>
>> Yes. This suggestion is just changing the name to make it more obvious
>> what it does.
>
> Thinking more about this suggestion. The callback (fn) is getting a 4K
> PFN in parameter and not a GFN.

I would like only APIs that deal with 64 KiB PFNs and 4 KiB GFNs. I
think having a 4 KiB "PFN" is confusing.

Can you rework this xen_for_each_gfn() to pass GFNs to fn, instead?

> This is because the balloon code seems to require having a 4K PFN in
> hand in few places. For instance XENMEM_populate_physmap and
> HYPERVISOR_update_va_mapping.

Ug. For an auto-xlate guest frame-list needs GFNs, for a PV guest
XENMEM_populate_physmap does want PFNs (so it can fill in the M2P).

Perhaps in increase_reservation:

if (auto-xlate)
frame_list[i] = page_to_gfn(page);
/* Or whatever per-GFN loop you need. */
else
frame_list[i] = page_to_pfn(page);

update_va_mapping takes VAs (e.g, __va(pfn << PAGE_SHIFT) could be
page_to_virt(page).

Sorry for being so picky here, but the inconsistency of terminology and
API misuse is already confusing and I don't want to see it get worse.

David

>
> Although, I'm not sure to understand the difference between GMFN, and
> GPFN in the hypercall doc.
>
> Regards,
>

2015-08-05 16:10:51

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 02/20] xen: Introduce a function to split a Linux page into Xen page

On 05/08/15 16:50, David Vrabel wrote:
>>>>> Also perhaps make it
>>>>>
>>>>> int xen_for_each_gfn(struct page *page,
>>>>> xen_gfn_fn_t fn, void *data);
>>>>
>>>> gfn standing for Guest Frame Number right?
>>>
>>> Yes. This suggestion is just changing the name to make it more obvious
>>> what it does.
>>
>> Thinking more about this suggestion. The callback (fn) is getting a 4K
>> PFN in parameter and not a GFN.
>
> I would like only APIs that deal with 64 KiB PFNs and 4 KiB GFNs. I
> think having a 4 KiB "PFN" is confusing.

I agree with that. Note that helpers splitting Linux page in 4K chunk
such as gnttab_for_each_grant (see patch #3) and this one may still have
the concept of 4K "PFN" for internal purpose.

> Can you rework this xen_for_each_gfn() to pass GFNs to fn, instead?

I will do.

>
>> This is because the balloon code seems to require having a 4K PFN in
>> hand in few places. For instance XENMEM_populate_physmap and
>> HYPERVISOR_update_va_mapping.
>
> Ug. For an auto-xlate guest frame-list needs GFNs, for a PV guest
> XENMEM_populate_physmap does want PFNs (so it can fill in the M2P).
>
> Perhaps in increase_reservation:
>
> if (auto-xlate)
> frame_list[i] = page_to_gfn(page);
> /* Or whatever per-GFN loop you need. */
> else
> frame_list[i] = page_to_pfn(page);
>
> update_va_mapping takes VAs (e.g, __va(pfn << PAGE_SHIFT) could be
> page_to_virt(page).

I though about a similar approach but I wanted to keep the code generic,
i.e support the splitting even for non auto-translated guest. Hence the
implementation xen_apply_to_page.

Anyway, I will see to do what you suggest.

> Sorry for being so picky here, but the inconsistency of terminology and
> API misuse is already confusing and I don't want to see it get worse.

No worry, I'm happy to rework this code to be able to drop the 4K PFN
concept.

Regards,

--
Julien Grall

2015-08-06 15:44:45

by Julien Grall

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH v2 13/20] xen/events: fifo: Make it running on 64KB granularity

Hi David,

On 24/07/15 11:36, David Vrabel wrote:
> On 09/07/15 21:42, Julien Grall wrote:
>> Only use the first 4KB of the page to store the events channel info. It
>> means that we will wast 60KB every time we allocate page for:
>> * control block: a page is allocating per CPU
>> * event array: a page is allocating everytime we need to expand it
>
> Reviewed-by: David Vrabel <[email protected]>

Thank you!

>>
>> I think we can reduce the memory waste for the 2 areas by:
>>
>> * control block: sharing between multiple vCPUs. Although it will
>> require some bookkeeping in order to not free the page when the CPU
>> goes offline and the other CPUs sharing the page still there
>>
>> * event array: always extend the array event by 64K (i.e 16 4K
>> chunk). That would require more care when we fail to expand the
>> event channel.
>
> I would extend it by 4 KiB each time but only allocate a new page every
> 16 times. This minimizes the resources used in Xen.

I will keep it in mind when I will send a patch to reduce memory waster.

Regards,

--
Julien Grall