2018-06-15 05:12:20

by Wang, Wei W

[permalink] [raw]
Subject: [PATCH v33 0/4] Virtio-balloon: support free page reporting

This patch series is separated from the previous "Virtio-balloon
Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
implemented by this series enables the virtio-balloon driver to report
hints of guest free pages to the host. It can be used to accelerate live
migration of VMs. Here is an introduction of this usage:

Live migration needs to transfer the VM's memory from the source machine
to the destination round by round. For the 1st round, all the VM's memory
is transferred. From the 2nd round, only the pieces of memory that were
written by the guest (after the 1st round) are transferred. One method
that is popularly used by the hypervisor to track which part of memory is
written is to write-protect all the guest memory.

This feature enables the optimization by skipping the transfer of guest
free pages during VM live migration. It is not concerned that the memory
pages are used after they are given to the hypervisor as a hint of the
free pages, because they will be tracked by the hypervisor and transferred
in the subsequent round if they are used and written.

* Tests
- Test Environment
Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Guest: 8G RAM, 4 vCPU
Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second

- Test Results
- Idle Guest Live Migration Time (results are averaged over 10 runs):
- Optimization v.s. Legacy = 278ms vs 1757ms --> ~84% reduction
- Guest with Linux Compilation Workload (make bzImage -j4):
- Live Migration Time (average)
Optimization v.s. Legacy = 1408ms v.s. 2528ms --> ~44% reduction
- Linux Compilation Time
Optimization v.s. Legacy = 5min3s v.s. 5min12s
--> no obvious difference

ChangeLog:
v32->v33:
- mm/get_from_free_page_list: The new implementation to get free page
hints based on the suggestions from Linus:
https://lkml.org/lkml/2018/6/11/764
This avoids the complex call chain, and looks more prudent.
- virtio-balloon:
- use a fix-sized buffer to get free page hints;
- remove the cmd id related interface. Now host can just send a free
page hint command to the guest (via the host_cmd config register)
to start the reporting. Currentlty the guest reports only the max
order free page hints to host, which has generated similar good
results as before. But the interface used by virtio-balloon to
report can support reporting more orders in the future when there
is a need.
v31->v32:
- virtio-balloon:
- rename cmd_id_use to cmd_id_active;
- report_free_page_func: detach used buffers after host sends a vq
interrupt, instead of busy waiting for used buffers.
v30->v31:
- virtio-balloon:
- virtio_balloon_send_free_pages: return -EINTR rather than 1 to
indicate an active stop requested by host; and add more
comments to explain about access to cmd_id_received without
locks;
- add_one_sg: add TODO to comments about possible improvement.
v29->v30:
- mm/walk_free_mem_block: add cond_sched() for each order
v28->v29:
- mm/page_poison: only expose page_poison_enabled(), rather than more
changes did in v28, as we are not 100% confident about that for now.
- virtio-balloon: use a separate buffer for the stop cmd, instead of
having the start and stop cmd use the same buffer. This avoids the
corner case that the start cmd is overridden by the stop cmd when
the host has a delay in reading the start cmd.
v27->v28:
- mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
to expose page poison val to kernel modules.
v26->v27:
- add a new patch to expose page_poisoning_enabled to kernel modules
- virtio-balloon: set poison_val to 0xaaaaaaaa, instead of 0xaa
v25->v26: virtio-balloon changes only
- remove kicking free page vq since the host now polls the vq after
initiating the reporting
- report_free_page_func: detach all the used buffers after sending
the stop cmd id. This avoids leaving the detaching burden (i.e.
overhead) to the next cmd id. Detaching here isn't considered
overhead since the stop cmd id has been sent, and host has already
moved formard.
v24->v25:
- mm: change walk_free_mem_block to return 0 (instead of true) on
completing the report, and return a non-zero value from the
callabck, which stops the reporting.
- virtio-balloon:
- use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
- avoid __virtio_clear_bit when bailing out;
- a new method to avoid reporting the some cmd id to host twice
- destroy_workqueue can cancel free page work when the feature is
negotiated;
- fail probe when the free page vq size is less than 2.
v23->v24:
- change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
VIRTIO_BALLOON_F_FREE_PAGE_HINT
- kick when vq->num_free < half full, instead of "= half full"
- replace BUG_ON with bailing out
- check vb->balloon_wq in probe(), if null, bail out
- add a new feature bit for page poisoning
- solve the corner case that one cmd id being sent to host twice
v22->v23:
- change to kick the device when the vq is half-way full;
- open-code batch_free_page_sg into add_one_sg;
- change cmd_id from "uint32_t" to "__virtio32";
- reserver one entry in the vq for the driver to send cmd_id, instead
of busywaiting for an available entry;
- add "stop_update" check before queue_work for prudence purpose for
now, will have a separate patch to discuss this flag check later;
- init_vqs: change to put some variables on stack to have simpler
implementation;
- add destroy_workqueue(vb->balloon_wq);
v21->v22:
- add_one_sg: some code and comment re-arrangement
- send_cmd_id: handle a cornercase

For previous ChangeLog, please reference
https://lwn.net/Articles/743660/

Wei Wang (4):
mm: add a function to get free page blocks
virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
mm/page_poison: expose page_poisoning_enabled to kernel modules
virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

drivers/virtio/virtio_balloon.c | 197 +++++++++++++++++++++++++++++-------
include/linux/mm.h | 1 +
include/uapi/linux/virtio_balloon.h | 16 +++
mm/page_alloc.c | 52 ++++++++++
mm/page_poison.c | 6 ++
5 files changed, 235 insertions(+), 37 deletions(-)

--
2.7.4



2018-06-15 05:14:08

by Wang, Wei W

[permalink] [raw]
Subject: [PATCH v33 1/4] mm: add a function to get free page blocks

This patch adds a function to get free pages blocks from a free page
list. The obtained free page blocks are hints about free pages, because
there is no guarantee that they are still on the free page list after
the function returns.

One use example of this patch is to accelerate live migration by skipping
the transfer of free pages reported from the guest. A popular method used
by the hypervisor to track which part of memory is written during live
migration is to write-protect all the guest memory. So, those pages that
are hinted as free pages but are written after this function returns will
be captured by the hypervisor, and they will be added to the next round of
memory transfer.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Liang Li <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Linus Torvalds <[email protected]>
---
include/linux/mm.h | 1 +
mm/page_alloc.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e49388..c58b4e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2002,6 +2002,7 @@ extern void free_area_init(unsigned long * zones_size);
extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
extern void free_initmem(void);
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size);

/*
* Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07b3c23..7c816d9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5043,6 +5043,58 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
show_swap_cache_info();
}

+/**
+ * get_from_free_page_list - get free page blocks from a free page list
+ * @order: the order of the free page list to check
+ * @buf: the array to store the physical addresses of the free page blocks
+ * @size: the array size
+ *
+ * This function offers hints about free pages. There is no guarantee that
+ * the obtained free pages are still on the free page list after the function
+ * returns. pfn_to_page on the obtained free pages is strongly discouraged
+ * and if there is an absolute need for that, make sure to contact MM people
+ * to discuss potential problems.
+ *
+ * The addresses are currently stored to the array in little endian. This
+ * avoids the overhead of converting endianness by the caller who needs data
+ * in the little endian format. Big endian support can be added on demand in
+ * the future.
+ *
+ * Return the number of free page blocks obtained from the free page list.
+ * The maximum number of free page blocks that can be obtained is limited to
+ * the caller's array size.
+ */
+uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
+{
+ struct zone *zone;
+ enum migratetype mt;
+ struct page *page;
+ struct list_head *list;
+ unsigned long addr, flags;
+ uint32_t index = 0;
+
+ for_each_populated_zone(zone) {
+ spin_lock_irqsave(&zone->lock, flags);
+ for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+ list = &zone->free_area[order].free_list[mt];
+ list_for_each_entry(page, list, lru) {
+ addr = page_to_pfn(page) << PAGE_SHIFT;
+ if (likely(index < size)) {
+ buf[index++] = cpu_to_le64(addr);
+ } else {
+ spin_unlock_irqrestore(&zone->lock,
+ flags);
+ return index;
+ }
+ }
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
+
+ return index;
+}
+EXPORT_SYMBOL_GPL(get_from_free_page_list);
+
static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
{
zoneref->zone = zone;
--
2.7.4


2018-06-15 05:14:35

by Wang, Wei W

[permalink] [raw]
Subject: [PATCH v33 4/4] virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON

The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the
guest is using page poisoning. Guest writes to the poison_val config
field to tell host about the page poisoning value that is in use.

Suggested-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
---
drivers/virtio/virtio_balloon.c | 10 ++++++++++
include/uapi/linux/virtio_balloon.h | 3 +++
2 files changed, 13 insertions(+)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 582a03b..c59bb380 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -634,6 +634,7 @@ static struct file_system_type balloon_fs = {
static int virtballoon_probe(struct virtio_device *vdev)
{
struct virtio_balloon *vb;
+ __u32 poison_val;
int err;

if (!vdev->config->get) {
@@ -671,6 +672,11 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_del_vqs;
}
INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ memset(&poison_val, PAGE_POISON, sizeof(poison_val));
+ virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, &poison_val);
+ }
vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL);
if (!vb->hints) {
err = -ENOMEM;
@@ -796,6 +802,9 @@ static int virtballoon_restore(struct virtio_device *vdev)

static int virtballoon_validate(struct virtio_device *vdev)
{
+ if (!page_poisoning_enabled())
+ __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;
}
@@ -805,6 +814,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
+ VIRTIO_BALLOON_F_PAGE_POISON,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 99b8416..f3b6191 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -35,6 +35,7 @@
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
@@ -47,6 +48,8 @@ struct virtio_balloon_config {
__u32 actual;
/* Command sent from host */
__u32 host_cmd;
+ /* Stores PAGE_POISON if page poisoning is in use */
+ __u32 poison_val;
};

struct virtio_balloon_free_page_hints {
--
2.7.4


2018-06-15 05:14:47

by Wang, Wei W

[permalink] [raw]
Subject: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
support of reporting hints of guest free pages to host via virtio-balloon.

Host requests the guest to report free page hints by sending a command
to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit
of the host_cmd config register.

As the first step here, virtio-balloon only reports free page hints from
the max order (10) free page list to host. This has generated similar good
results as reporting all free page hints during our tests.

TODO:
- support reporting free page hints from smaller order free page lists
when there is a need/request from users.

Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Liang Li <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
---
drivers/virtio/virtio_balloon.c | 187 +++++++++++++++++++++++++++++-------
include/uapi/linux/virtio_balloon.h | 13 +++
2 files changed, 163 insertions(+), 37 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 6b237e3..582a03b 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -43,6 +43,9 @@
#define OOM_VBALLOON_DEFAULT_PAGES 256
#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80

+/* The size of memory in bytes allocated for reporting free page hints */
+#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
+
static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
module_param(oom_pages, int, S_IRUSR | S_IWUSR);
MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
@@ -51,9 +54,22 @@ MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
static struct vfsmount *balloon_mnt;
#endif

+enum virtio_balloon_vq {
+ VIRTIO_BALLOON_VQ_INFLATE,
+ VIRTIO_BALLOON_VQ_DEFLATE,
+ VIRTIO_BALLOON_VQ_STATS,
+ VIRTIO_BALLOON_VQ_FREE_PAGE,
+ VIRTIO_BALLOON_VQ_MAX
+};
+
struct virtio_balloon {
struct virtio_device *vdev;
- struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+ struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+
+ /* Balloon's own wq for cpu-intensive work items */
+ struct workqueue_struct *balloon_wq;
+ /* The free page reporting work item submitted to the balloon wq */
+ struct work_struct report_free_page_work;

/* The balloon servicing is delegated to a freezable workqueue. */
struct work_struct update_balloon_stats_work;
@@ -63,6 +79,8 @@ struct virtio_balloon {
spinlock_t stop_update_lock;
bool stop_update;

+ struct virtio_balloon_free_page_hints *hints;
+
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;

@@ -326,17 +344,6 @@ static void stats_handle_request(struct virtio_balloon *vb)
virtqueue_kick(vq);
}

-static void virtballoon_changed(struct virtio_device *vdev)
-{
- struct virtio_balloon *vb = vdev->priv;
- unsigned long flags;
-
- spin_lock_irqsave(&vb->stop_update_lock, flags);
- if (!vb->stop_update)
- queue_work(system_freezable_wq, &vb->update_balloon_size_work);
- spin_unlock_irqrestore(&vb->stop_update_lock, flags);
-}
-
static inline s64 towards_target(struct virtio_balloon *vb)
{
s64 target;
@@ -353,6 +360,32 @@ static inline s64 towards_target(struct virtio_balloon *vb)
return target - vb->num_pages;
}

+static void virtballoon_changed(struct virtio_device *vdev)
+{
+ struct virtio_balloon *vb = vdev->priv;
+ unsigned long flags;
+ uint32_t host_cmd;
+ s64 diff = towards_target(vb);
+
+ if (diff) {
+ spin_lock_irqsave(&vb->stop_update_lock, flags);
+ if (!vb->stop_update)
+ queue_work(system_freezable_wq,
+ &vb->update_balloon_size_work);
+ spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+ }
+
+ virtio_cread(vdev, struct virtio_balloon_config, host_cmd, &host_cmd);
+ if ((host_cmd & VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT) &&
+ virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ spin_lock_irqsave(&vb->stop_update_lock, flags);
+ if (!vb->stop_update)
+ queue_work(vb->balloon_wq,
+ &vb->report_free_page_work);
+ spin_unlock_irqrestore(&vb->stop_update_lock, flags);
+ }
+}
+
static void update_balloon_size(struct virtio_balloon *vb)
{
u32 actual = vb->num_pages;
@@ -425,44 +458,98 @@ static void update_balloon_size_func(struct work_struct *work)
queue_work(system_freezable_wq, work);
}

+static void free_page_vq_cb(struct virtqueue *vq)
+{
+ unsigned int unused;
+
+ while (virtqueue_get_buf(vq, &unused))
+ ;
+}
+
static int init_vqs(struct virtio_balloon *vb)
{
- struct virtqueue *vqs[3];
- vq_callback_t *callbacks[] = { balloon_ack, balloon_ack, stats_request };
- static const char * const names[] = { "inflate", "deflate", "stats" };
- int err, nvqs;
+ struct virtqueue *vqs[VIRTIO_BALLOON_VQ_MAX];
+ vq_callback_t *callbacks[VIRTIO_BALLOON_VQ_MAX];
+ const char *names[VIRTIO_BALLOON_VQ_MAX];
+ struct scatterlist sg;
+ int ret;

/*
- * We expect two virtqueues: inflate and deflate, and
- * optionally stat.
+ * Inflateq and deflateq are used unconditionally. The names[]
+ * will be NULL if the related feature is not enabled, which will
+ * cause no allocation for the corresponding virtqueue in find_vqs.
*/
- nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
- err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
- if (err)
- return err;
+ callbacks[VIRTIO_BALLOON_VQ_INFLATE] = balloon_ack;
+ names[VIRTIO_BALLOON_VQ_INFLATE] = "inflate";
+ callbacks[VIRTIO_BALLOON_VQ_DEFLATE] = balloon_ack;
+ names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
+ names[VIRTIO_BALLOON_VQ_STATS] = NULL;
+ names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;

- vb->inflate_vq = vqs[0];
- vb->deflate_vq = vqs[1];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
- struct scatterlist sg;
- unsigned int num_stats;
- vb->stats_vq = vqs[2];
+ names[VIRTIO_BALLOON_VQ_STATS] = "stats";
+ callbacks[VIRTIO_BALLOON_VQ_STATS] = stats_request;
+ }

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ names[VIRTIO_BALLOON_VQ_FREE_PAGE] = "free_page_vq";
+ callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = free_page_vq_cb;
+ }
+
+ ret = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
+ vqs, callbacks, names, NULL, NULL);
+ if (ret)
+ return ret;
+
+ vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
+ vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
+ vb->stats_vq = vqs[VIRTIO_BALLOON_VQ_STATS];
/*
* Prime this virtqueue with one buffer so the hypervisor can
* use it to signal us later (it can't be broken yet!).
*/
- num_stats = update_balloon_stats(vb);
-
- sg_init_one(&sg, vb->stats, sizeof(vb->stats[0]) * num_stats);
- if (virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb, GFP_KERNEL)
- < 0)
- BUG();
+ sg_init_one(&sg, vb->stats, sizeof(vb->stats));
+ ret = virtqueue_add_outbuf(vb->stats_vq, &sg, 1, vb,
+ GFP_KERNEL);
+ if (ret) {
+ dev_warn(&vb->vdev->dev, "%s: add stat_vq failed\n",
+ __func__);
+ return ret;
+ }
virtqueue_kick(vb->stats_vq);
}
+
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+ vb->free_page_vq = vqs[VIRTIO_BALLOON_VQ_FREE_PAGE];
+
return 0;
}

+static void report_free_page_func(struct work_struct *work)
+{
+ struct virtio_balloon *vb;
+ struct virtqueue *vq;
+ struct virtio_balloon_free_page_hints *hints;
+ struct scatterlist sg;
+ uint32_t hdr_size, avail_entries, added_entries;
+
+ vb = container_of(work, struct virtio_balloon, report_free_page_work);
+ vq = vb->free_page_vq;
+ hints = vb->hints;
+ hdr_size = sizeof(hints->num_hints) + sizeof(hints->size);
+ avail_entries = (FREE_PAGE_HINT_MEM_SIZE - hdr_size) / sizeof(__le64);
+
+ added_entries = get_from_free_page_list(MAX_ORDER - 1, hints->buf,
+ avail_entries);
+ hints->num_hints = cpu_to_le32(added_entries);
+ hints->size = cpu_to_le32((1 << (MAX_ORDER - 1)) << PAGE_SHIFT);
+
+ sg_init_one(&sg, vb->hints, FREE_PAGE_HINT_MEM_SIZE);
+ virtqueue_add_outbuf(vq, &sg, 1, vb->hints, GFP_KERNEL);
+ virtqueue_kick(vb->free_page_vq);
+}
+
#ifdef CONFIG_BALLOON_COMPACTION
/*
* virtballoon_migratepage - perform the balloon page migration on behalf of
@@ -576,18 +663,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_free_vb;

+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ vb->balloon_wq = alloc_workqueue("balloon-wq",
+ WQ_FREEZABLE | WQ_CPU_INTENSIVE, 0);
+ if (!vb->balloon_wq) {
+ err = -ENOMEM;
+ goto out_del_vqs;
+ }
+ INIT_WORK(&vb->report_free_page_work, report_free_page_func);
+ vb->hints = kmalloc(FREE_PAGE_HINT_MEM_SIZE, GFP_KERNEL);
+ if (!vb->hints) {
+ err = -ENOMEM;
+ goto out_del_balloon_wq;
+ }
+ }
+
vb->nb.notifier_call = virtballoon_oom_notify;
vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
err = register_oom_notifier(&vb->nb);
if (err < 0)
- goto out_del_vqs;
+ goto out_del_free_page_hint;

#ifdef CONFIG_BALLOON_COMPACTION
balloon_mnt = kern_mount(&balloon_fs);
if (IS_ERR(balloon_mnt)) {
err = PTR_ERR(balloon_mnt);
unregister_oom_notifier(&vb->nb);
- goto out_del_vqs;
+ goto out_del_free_page_hint;
}

vb->vb_dev_info.migratepage = virtballoon_migratepage;
@@ -597,7 +699,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
kern_unmount(balloon_mnt);
unregister_oom_notifier(&vb->nb);
vb->vb_dev_info.inode = NULL;
- goto out_del_vqs;
+ goto out_del_free_page_hint;
}
vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
#endif
@@ -607,7 +709,12 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (towards_target(vb))
virtballoon_changed(vdev);
return 0;
-
+out_del_free_page_hint:
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+ kfree(vb->hints);
+out_del_balloon_wq:
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
+ destroy_workqueue(vb->balloon_wq);
out_del_vqs:
vdev->config->del_vqs(vdev);
out_free_vb:
@@ -641,6 +748,11 @@ static void virtballoon_remove(struct virtio_device *vdev)
cancel_work_sync(&vb->update_balloon_size_work);
cancel_work_sync(&vb->update_balloon_stats_work);

+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ cancel_work_sync(&vb->report_free_page_work);
+ destroy_workqueue(vb->balloon_wq);
+ }
+
remove_common(vb);
#ifdef CONFIG_BALLOON_COMPACTION
if (vb->vb_dev_info.inode)
@@ -692,6 +804,7 @@ static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
+ VIRTIO_BALLOON_F_FREE_PAGE_HINT,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index 13b8cb5..99b8416 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,15 +34,28 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12

+#define VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT (1 << 0)
struct virtio_balloon_config {
/* Number of pages host wants Guest to give up. */
__u32 num_pages;
/* Number of pages we've actually got in balloon. */
__u32 actual;
+ /* Command sent from host */
+ __u32 host_cmd;
+};
+
+struct virtio_balloon_free_page_hints {
+ /* Number of hints in the array below */
+ __le32 num_hints;
+ /* The size of each hint in bytes */
+ __le32 size;
+ /* Buffer for the hints */
+ __le64 buf[];
};

#define VIRTIO_BALLOON_S_SWAP_IN 0 /* Amount of memory swapped in */
--
2.7.4


2018-06-15 05:15:15

by Wang, Wei W

[permalink] [raw]
Subject: [PATCH v33 3/4] mm/page_poison: expose page_poisoning_enabled to kernel modules

In some usages, e.g. virtio-balloon, a kernel module needs to know if
page poisoning is in use. This patch exposes the page_poisoning_enabled
function to kernel modules.

Signed-off-by: Wei Wang <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Acked-by: Andrew Morton <[email protected]>
---
mm/page_poison.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d3..830f604 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -17,6 +17,11 @@ static int __init early_page_poison_param(char *buf)
}
early_param("page_poison", early_page_poison_param);

+/**
+ * page_poisoning_enabled - check if page poisoning is enabled
+ *
+ * Return true if page poisoning is enabled, or false if not.
+ */
bool page_poisoning_enabled(void)
{
/*
@@ -29,6 +34,7 @@ bool page_poisoning_enabled(void)
(!IS_ENABLED(CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC) &&
debug_pagealloc_enabled()));
}
+EXPORT_SYMBOL_GPL(page_poisoning_enabled);

static void poison_page(struct page *page)
{
--
2.7.4


2018-06-15 11:31:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting

On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> - remove the cmd id related interface. Now host can just send a free
> page hint command to the guest (via the host_cmd config register)
> to start the reporting.

Here we go again. And what if reporting was already started previously?
I don't think it's a good idea to tweak the host/guest interface yet
again.

--
MST

2018-06-15 11:47:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the
> support of reporting hints of guest free pages to host via virtio-balloon.
>
> Host requests the guest to report free page hints by sending a command
> to the guest via setting the VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT bit
> of the host_cmd config register.
>
> As the first step here, virtio-balloon only reports free page hints from
> the max order (10) free page list to host. This has generated similar good
> results as reporting all free page hints during our tests.
>
> TODO:
> - support reporting free page hints from smaller order free page lists
> when there is a need/request from users.
>
> Signed-off-by: Wei Wang <[email protected]>
> Signed-off-by: Liang Li <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> drivers/virtio/virtio_balloon.c | 187 +++++++++++++++++++++++++++++-------
> include/uapi/linux/virtio_balloon.h | 13 +++
> 2 files changed, 163 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 6b237e3..582a03b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -43,6 +43,9 @@
> #define OOM_VBALLOON_DEFAULT_PAGES 256
> #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>
> +/* The size of memory in bytes allocated for reporting free page hints */
> +#define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> +
> static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> MODULE_PARM_DESC(oom_pages, "pages to free on OOM");

Doesn't this limit memory size of the guest we can report?
Apparently to several gigabytes ...
OTOH huge guests with lots of free memory is exactly
where we would gain the most ...

--
MST

2018-06-15 14:12:22

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > the support of reporting hints of guest free pages to host via virtio-balloon.
> >
> > Host requests the guest to report free page hints by sending a command
> > to the guest via setting the
> VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > bit of the host_cmd config register.
> >
> > As the first step here, virtio-balloon only reports free page hints
> > from the max order (10) free page list to host. This has generated
> > similar good results as reporting all free page hints during our tests.
> >
> > TODO:
> > - support reporting free page hints from smaller order free page lists
> > when there is a need/request from users.
> >
> > Signed-off-by: Wei Wang <[email protected]>
> > Signed-off-by: Liang Li <[email protected]>
> > Cc: Michael S. Tsirkin <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > ---
> > drivers/virtio/virtio_balloon.c | 187 +++++++++++++++++++++++++++++--
> -----
> > include/uapi/linux/virtio_balloon.h | 13 +++
> > 2 files changed, 163 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c
> > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -43,6 +43,9 @@
> > #define OOM_VBALLOON_DEFAULT_PAGES 256 #define
> > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> >
> > +/* The size of memory in bytes allocated for reporting free page
> > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > +
> > static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
>
> Doesn't this limit memory size of the guest we can report?
> Apparently to several gigabytes ...
> OTOH huge guests with lots of free memory is exactly where we would gain
> the most ...

Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It is not flexible.

How about allocating the buffer according to the guest memory size (proportional)? That is,

/* Calculates the maximum number of 4MB (equals to 1024 pages) free pages blocks that the system can have */
4m_page_blocks = totalram_pages / 1024;

/* Allocating one page can hold 512 free page blocks, so calculates the number of pages that can hold those 4MB blocks. And this allocation should not exceed 1024 pages */
pages_to_allocate = min(4m_page_blocks / 512, 1024);

For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 pages as the buffer.

When the guest has large memory, it should be easier to succeed in allocation of large buffer. If that allocation fails, that implies that nothing would be got from the 4MB free page list.

I think the proportional allocation is simpler compared to other approaches like
- scattered buffer, which will complicate the get_from_free_page_list implementation;
- one buffer to call get_from_free_page_list multiple times, which needs get_from_free_page_list to maintain states.. also too complicated.

Best,
Wei




2018-06-15 14:30:40

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates
> > > the support of reporting hints of guest free pages to host via virtio-balloon.
> > >
> > > Host requests the guest to report free page hints by sending a command
> > > to the guest via setting the
> > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > bit of the host_cmd config register.
> > >
> > > As the first step here, virtio-balloon only reports free page hints
> > > from the max order (10) free page list to host. This has generated
> > > similar good results as reporting all free page hints during our tests.
> > >
> > > TODO:
> > > - support reporting free page hints from smaller order free page lists
> > > when there is a need/request from users.
> > >
> > > Signed-off-by: Wei Wang <[email protected]>
> > > Signed-off-by: Liang Li <[email protected]>
> > > Cc: Michael S. Tsirkin <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Cc: Andrew Morton <[email protected]>
> > > ---
> > > drivers/virtio/virtio_balloon.c | 187 +++++++++++++++++++++++++++++--
> > -----
> > > include/uapi/linux/virtio_balloon.h | 13 +++
> > > 2 files changed, 163 insertions(+), 37 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c
> > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -43,6 +43,9 @@
> > > #define OOM_VBALLOON_DEFAULT_PAGES 256 #define
> > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > >
> > > +/* The size of memory in bytes allocated for reporting free page
> > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > +
> > > static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> >
> > Doesn't this limit memory size of the guest we can report?
> > Apparently to several gigabytes ...
> > OTOH huge guests with lots of free memory is exactly where we would gain
> > the most ...
>
> Yes, the 16-page array can report up to 32GB (each page can hold 512 addresses of 4MB free page blocks, i.e. 2GB free memory per page) free memory to host. It is not flexible.
>
> How about allocating the buffer according to the guest memory size (proportional)? That is,
>
> /* Calculates the maximum number of 4MB (equals to 1024 pages) free pages blocks that the system can have */
> 4m_page_blocks = totalram_pages / 1024;
>
> /* Allocating one page can hold 512 free page blocks, so calculates the number of pages that can hold those 4MB blocks. And this allocation should not exceed 1024 pages */
> pages_to_allocate = min(4m_page_blocks / 512, 1024);
>
> For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate 1024 pages as the buffer.
>
> When the guest has large memory, it should be easier to succeed in allocation of large buffer. If that allocation fails, that implies that nothing would be got from the 4MB free page list.
>
> I think the proportional allocation is simpler compared to other approaches like
> - scattered buffer, which will complicate the get_from_free_page_list implementation;
> - one buffer to call get_from_free_page_list multiple times, which needs get_from_free_page_list to maintain states.. also too complicated.
>
> Best,
> Wei
>

That's more reasonable, but question remains what to do if that value
exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Also allocating it with GFP_KERNEL is out. You only want to take
it off the free list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Also you can't allocate this on device start. First totalram_pages can
change. Second that's too much memory to tie up forever.

--
MST

2018-06-15 14:30:57

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v33 0/4] Virtio-balloon: support free page reporting

On Friday, June 15, 2018 7:30 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> > - remove the cmd id related interface. Now host can just send a free
> > page hint command to the guest (via the host_cmd config register)
> > to start the reporting.
>
> Here we go again. And what if reporting was already started previously?
> I don't think it's a good idea to tweak the host/guest interface yet again.

This interface is much simpler, and I'm not sure if that would be an issue here now, because
now the guest delivers the whole buffer of hints to host once, instead of hint by hint as before. And the guest notifies host after the buffer is delivered. In any case, the host doorbell handler will be invoked, if host doesn't need the hints at that time, it will just give back the buffer. There will be no stale hints remained in the ring now.

Best,
Wei

2018-06-15 14:39:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting

On Fri, Jun 15, 2018 at 02:28:49PM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 7:30 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 12:43:09PM +0800, Wei Wang wrote:
> > > - remove the cmd id related interface. Now host can just send a free
> > > page hint command to the guest (via the host_cmd config register)
> > > to start the reporting.
> >
> > Here we go again. And what if reporting was already started previously?
> > I don't think it's a good idea to tweak the host/guest interface yet again.
>
> This interface is much simpler, and I'm not sure if that would be an
> issue here now, because now the guest delivers the whole buffer of
> hints to host once, instead of hint by hint as before. And the guest
> notifies host after the buffer is delivered. In any case, the host
> doorbell handler will be invoked, if host doesn't need the hints at
> that time, it will just give back the buffer. There will be no stale
> hints remained in the ring now.
>
> Best,
> Wei

I still think all the old arguments for cmd id apply.

--
MST

2018-06-15 19:22:06

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [PATCH v33 0/4] Virtio-balloon: support free page reporting

On Fri, 15 Jun 2018 12:43:09 +0800
Wei Wang <[email protected]> wrote:

> This patch series is separated from the previous "Virtio-balloon
> Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> implemented by this series enables the virtio-balloon driver to report
> hints of guest free pages to the host. It can be used to accelerate live
> migration of VMs. Here is an introduction of this usage:

So, we have two page hinting solutions being proposed. One is this
series, the other one by Nitesh is intended to improve host memory
utilization by letting the host use unused guest memory[1].

Instead of merging two similar solutions, do we want a more generic
one that solves both problems? Or maybe an unified solution?

[1] https://www.spinics.net/lists/kvm/msg170113.html

> Live migration needs to transfer the VM's memory from the source machine
> to the destination round by round. For the 1st round, all the VM's memory
> is transferred. From the 2nd round, only the pieces of memory that were
> written by the guest (after the 1st round) are transferred. One method
> that is popularly used by the hypervisor to track which part of memory is
> written is to write-protect all the guest memory.
>
> This feature enables the optimization by skipping the transfer of guest
> free pages during VM live migration. It is not concerned that the memory
> pages are used after they are given to the hypervisor as a hint of the
> free pages, because they will be tracked by the hypervisor and transferred
> in the subsequent round if they are used and written.
>
> * Tests
> - Test Environment
> Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Guest: 8G RAM, 4 vCPU
> Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
>
> - Test Results
> - Idle Guest Live Migration Time (results are averaged over 10 runs):
> - Optimization v.s. Legacy = 278ms vs 1757ms --> ~84% reduction
> - Guest with Linux Compilation Workload (make bzImage -j4):
> - Live Migration Time (average)
> Optimization v.s. Legacy = 1408ms v.s. 2528ms --> ~44% reduction
> - Linux Compilation Time
> Optimization v.s. Legacy = 5min3s v.s. 5min12s
> --> no obvious difference
>
> ChangeLog:
> v32->v33:
> - mm/get_from_free_page_list: The new implementation to get free page
> hints based on the suggestions from Linus:
> https://lkml.org/lkml/2018/6/11/764
> This avoids the complex call chain, and looks more prudent.
> - virtio-balloon:
> - use a fix-sized buffer to get free page hints;
> - remove the cmd id related interface. Now host can just send a free
> page hint command to the guest (via the host_cmd config register)
> to start the reporting. Currentlty the guest reports only the max
> order free page hints to host, which has generated similar good
> results as before. But the interface used by virtio-balloon to
> report can support reporting more orders in the future when there
> is a need.
> v31->v32:
> - virtio-balloon:
> - rename cmd_id_use to cmd_id_active;
> - report_free_page_func: detach used buffers after host sends a vq
> interrupt, instead of busy waiting for used buffers.
> v30->v31:
> - virtio-balloon:
> - virtio_balloon_send_free_pages: return -EINTR rather than 1 to
> indicate an active stop requested by host; and add more
> comments to explain about access to cmd_id_received without
> locks;
> - add_one_sg: add TODO to comments about possible improvement.
> v29->v30:
> - mm/walk_free_mem_block: add cond_sched() for each order
> v28->v29:
> - mm/page_poison: only expose page_poison_enabled(), rather than more
> changes did in v28, as we are not 100% confident about that for now.
> - virtio-balloon: use a separate buffer for the stop cmd, instead of
> having the start and stop cmd use the same buffer. This avoids the
> corner case that the start cmd is overridden by the stop cmd when
> the host has a delay in reading the start cmd.
> v27->v28:
> - mm/page_poison: Move PAGE_POISON to page_poison.c and add a function
> to expose page poison val to kernel modules.
> v26->v27:
> - add a new patch to expose page_poisoning_enabled to kernel modules
> - virtio-balloon: set poison_val to 0xaaaaaaaa, instead of 0xaa
> v25->v26: virtio-balloon changes only
> - remove kicking free page vq since the host now polls the vq after
> initiating the reporting
> - report_free_page_func: detach all the used buffers after sending
> the stop cmd id. This avoids leaving the detaching burden (i.e.
> overhead) to the next cmd id. Detaching here isn't considered
> overhead since the stop cmd id has been sent, and host has already
> moved formard.
> v24->v25:
> - mm: change walk_free_mem_block to return 0 (instead of true) on
> completing the report, and return a non-zero value from the
> callabck, which stops the reporting.
> - virtio-balloon:
> - use enum instead of define for VIRTIO_BALLOON_VQ_INFLATE etc.
> - avoid __virtio_clear_bit when bailing out;
> - a new method to avoid reporting the some cmd id to host twice
> - destroy_workqueue can cancel free page work when the feature is
> negotiated;
> - fail probe when the free page vq size is less than 2.
> v23->v24:
> - change feature name VIRTIO_BALLOON_F_FREE_PAGE_VQ to
> VIRTIO_BALLOON_F_FREE_PAGE_HINT
> - kick when vq->num_free < half full, instead of "= half full"
> - replace BUG_ON with bailing out
> - check vb->balloon_wq in probe(), if null, bail out
> - add a new feature bit for page poisoning
> - solve the corner case that one cmd id being sent to host twice
> v22->v23:
> - change to kick the device when the vq is half-way full;
> - open-code batch_free_page_sg into add_one_sg;
> - change cmd_id from "uint32_t" to "__virtio32";
> - reserver one entry in the vq for the driver to send cmd_id, instead
> of busywaiting for an available entry;
> - add "stop_update" check before queue_work for prudence purpose for
> now, will have a separate patch to discuss this flag check later;
> - init_vqs: change to put some variables on stack to have simpler
> implementation;
> - add destroy_workqueue(vb->balloon_wq);
> v21->v22:
> - add_one_sg: some code and comment re-arrangement
> - send_cmd_id: handle a cornercase
>
> For previous ChangeLog, please reference
> https://lwn.net/Articles/743660/
>
> Wei Wang (4):
> mm: add a function to get free page blocks
> virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> mm/page_poison: expose page_poisoning_enabled to kernel modules
> virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
>
> drivers/virtio/virtio_balloon.c | 197 +++++++++++++++++++++++++++++-------
> include/linux/mm.h | 1 +
> include/uapi/linux/virtio_balloon.h | 16 +++
> mm/page_alloc.c | 52 ++++++++++
> mm/page_poison.c | 6 ++
> 5 files changed, 235 insertions(+), 37 deletions(-)
>


2018-06-15 23:09:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <[email protected]> wrote:
>
> This patch adds a function to get free pages blocks from a free page
> list. The obtained free page blocks are hints about free pages, because
> there is no guarantee that they are still on the free page list after
> the function returns.

Ack. This is the kind of simple interface where I don't need to worry
about the MM code calling out to random drivers or subsystems.

I think that "order" should be checked for validity, but from a MM
standpoint I think this is fine.

Linus

2018-06-16 01:12:30

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > indicates the support of reporting hints of guest free pages to host via
> virtio-balloon.
> > > >
> > > > Host requests the guest to report free page hints by sending a
> > > > command to the guest via setting the
> > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > bit of the host_cmd config register.
> > > >
> > > > As the first step here, virtio-balloon only reports free page
> > > > hints from the max order (10) free page list to host. This has
> > > > generated similar good results as reporting all free page hints during
> our tests.
> > > >
> > > > TODO:
> > > > - support reporting free page hints from smaller order free page lists
> > > > when there is a need/request from users.
> > > >
> > > > Signed-off-by: Wei Wang <[email protected]>
> > > > Signed-off-by: Liang Li <[email protected]>
> > > > Cc: Michael S. Tsirkin <[email protected]>
> > > > Cc: Michal Hocko <[email protected]>
> > > > Cc: Andrew Morton <[email protected]>
> > > > ---
> > > > drivers/virtio/virtio_balloon.c | 187
> +++++++++++++++++++++++++++++--
> > > -----
> > > > include/uapi/linux/virtio_balloon.h | 13 +++
> > > > 2 files changed, 163 insertions(+), 37 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -43,6 +43,9 @@
> > > > #define OOM_VBALLOON_DEFAULT_PAGES 256 #define
> > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > +/* The size of memory in bytes allocated for reporting free page
> > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > +
> > > > static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > >
> > > Doesn't this limit memory size of the guest we can report?
> > > Apparently to several gigabytes ...
> > > OTOH huge guests with lots of free memory is exactly where we would
> > > gain the most ...
> >
> > Yes, the 16-page array can report up to 32GB (each page can hold 512
> addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> memory to host. It is not flexible.
> >
> > How about allocating the buffer according to the guest memory size
> > (proportional)? That is,
> >
> > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > pages blocks that the system can have */ 4m_page_blocks =
> > totalram_pages / 1024;
> >
> > /* Allocating one page can hold 512 free page blocks, so calculates
> > the number of pages that can hold those 4MB blocks. And this
> > allocation should not exceed 1024 pages */ pages_to_allocate =
> > min(4m_page_blocks / 512, 1024);
> >
> > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> 1024 pages as the buffer.
> >
> > When the guest has large memory, it should be easier to succeed in
> allocation of large buffer. If that allocation fails, that implies that nothing
> would be got from the 4MB free page list.
> >
> > I think the proportional allocation is simpler compared to other
> > approaches like
> > - scattered buffer, which will complicate the get_from_free_page_list
> > implementation;
> > - one buffer to call get_from_free_page_list multiple times, which needs
> get_from_free_page_list to maintain states.. also too complicated.
> >
> > Best,
> > Wei
> >
>
> That's more reasonable, but question remains what to do if that value
> exceeds MAX_ORDER. I'd say maybe tell host we can't report it.

Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the optimization can still offer 2TB free memory (better than no optimization).

On the other hand, large guests being large mostly because the guests need to use large memory. In that case, they usually won't have that much free memory to report.

>
> Also allocating it with GFP_KERNEL is out. You only want to take it off the free
> list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.

Sounds good, thanks.

> Also you can't allocate this on device start. First totalram_pages can change.
> Second that's too much memory to tie up forever.

Yes, makes sense.

Best,
Wei

2018-06-16 01:20:20

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v33 1/4] mm: add a function to get free page blocks

On Saturday, June 16, 2018 7:09 AM, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <[email protected]> wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages,
> > because there is no guarantee that they are still on the free page
> > list after the function returns.
>
> Ack. This is the kind of simple interface where I don't need to worry about
> the MM code calling out to random drivers or subsystems.
>
> I think that "order" should be checked for validity, but from a MM standpoint
> I think this is fine.
>

Thanks, will add a validity check for "order".

Best,
Wei

2018-06-16 04:50:47

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> +/**
> + * get_from_free_page_list - get free page blocks from a free page list
> + * @order: the order of the free page list to check
> + * @buf: the array to store the physical addresses of the free page blocks
> + * @size: the array size
> + *
> + * This function offers hints about free pages. There is no guarantee that
> + * the obtained free pages are still on the free page list after the function
> + * returns. pfn_to_page on the obtained free pages is strongly discouraged
> + * and if there is an absolute need for that, make sure to contact MM people
> + * to discuss potential problems.
> + *
> + * The addresses are currently stored to the array in little endian. This
> + * avoids the overhead of converting endianness by the caller who needs data
> + * in the little endian format. Big endian support can be added on demand in
> + * the future.
> + *
> + * Return the number of free page blocks obtained from the free page list.
> + * The maximum number of free page blocks that can be obtained is limited to
> + * the caller's array size.
> + */

Please use:

* Return: The number of free page blocks obtained from the free page list.

Also, please include a

* Context: Any context.

or

* Context: Process context.

or whatever other conetext this function can be called from. Since you're
taking the lock irqsafe, I assume this can be called from any context, but
I wonder if it makes sense to have this function callable from interrupt
context. Maybe this should be callable from process context only.

> +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)
> +{
> + struct zone *zone;
> + enum migratetype mt;
> + struct page *page;
> + struct list_head *list;
> + unsigned long addr, flags;
> + uint32_t index = 0;
> +
> + for_each_populated_zone(zone) {
> + spin_lock_irqsave(&zone->lock, flags);
> + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> + list = &zone->free_area[order].free_list[mt];
> + list_for_each_entry(page, list, lru) {
> + addr = page_to_pfn(page) << PAGE_SHIFT;
> + if (likely(index < size)) {
> + buf[index++] = cpu_to_le64(addr);
> + } else {
> + spin_unlock_irqrestore(&zone->lock,
> + flags);
> + return index;
> + }
> + }
> + }
> + spin_unlock_irqrestore(&zone->lock, flags);
> + }
> +
> + return index;
> +}

I wonder if (to address Michael's concern), you shouldn't instead use
the first free chunk of pages to return the addresses of all the pages.
ie something like this:

__le64 *ret = NULL;
unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);

for_each_populated_zone(zone) {
spin_lock_irq(&zone->lock);
for (mt = 0; mt < MIGRATE_TYPES; mt++) {
list = &zone->free_area[order].free_list[mt];
list_for_each_entry_safe(page, list, lru, ...) {
if (index == size)
break;
addr = page_to_pfn(page) << PAGE_SHIFT;
if (!ret) {
list_del(...);
ret = addr;
}
ret[index++] = cpu_to_le64(addr);
}
}
spin_unlock_irq(&zone->lock);
}

return ret;
}

You'll need to return the page to the freelist afterwards, but free_pages()
should take care of that.

2018-06-17 00:08:27

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v33 1/4] mm: add a function to get free page blocks

On Saturday, June 16, 2018 12:50 PM, Matthew Wilcox wrote:
> On Fri, Jun 15, 2018 at 12:43:10PM +0800, Wei Wang wrote:
> > +/**
> > + * get_from_free_page_list - get free page blocks from a free page
> > +list
> > + * @order: the order of the free page list to check
> > + * @buf: the array to store the physical addresses of the free page
> > +blocks
> > + * @size: the array size
> > + *
> > + * This function offers hints about free pages. There is no guarantee
> > +that
> > + * the obtained free pages are still on the free page list after the
> > +function
> > + * returns. pfn_to_page on the obtained free pages is strongly
> > +discouraged
> > + * and if there is an absolute need for that, make sure to contact MM
> > +people
> > + * to discuss potential problems.
> > + *
> > + * The addresses are currently stored to the array in little endian.
> > +This
> > + * avoids the overhead of converting endianness by the caller who
> > +needs data
> > + * in the little endian format. Big endian support can be added on
> > +demand in
> > + * the future.
> > + *
> > + * Return the number of free page blocks obtained from the free page list.
> > + * The maximum number of free page blocks that can be obtained is
> > +limited to
> > + * the caller's array size.
> > + */
>
> Please use:
>
> * Return: The number of free page blocks obtained from the free page list.
>
> Also, please include a
>
> * Context: Any context.
>
> or
>
> * Context: Process context.
>
> or whatever other conetext this function can be called from. Since you're
> taking the lock irqsafe, I assume this can be called from any context, but I
> wonder if it makes sense to have this function callable from interrupt context.
> Maybe this should be callable from process context only.

Thanks, sounds better to make it process context only.

>
> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t
> > +size) {
> > + struct zone *zone;
> > + enum migratetype mt;
> > + struct page *page;
> > + struct list_head *list;
> > + unsigned long addr, flags;
> > + uint32_t index = 0;
> > +
> > + for_each_populated_zone(zone) {
> > + spin_lock_irqsave(&zone->lock, flags);
> > + for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > + list = &zone->free_area[order].free_list[mt];
> > + list_for_each_entry(page, list, lru) {
> > + addr = page_to_pfn(page) << PAGE_SHIFT;
> > + if (likely(index < size)) {
> > + buf[index++] = cpu_to_le64(addr);
> > + } else {
> > + spin_unlock_irqrestore(&zone->lock,
> > + flags);
> > + return index;
> > + }
> > + }
> > + }
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > + }
> > +
> > + return index;
> > +}
>
> I wonder if (to address Michael's concern), you shouldn't instead use the first
> free chunk of pages to return the addresses of all the pages.
> ie something like this:
>
> __le64 *ret = NULL;
> unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
>
> for_each_populated_zone(zone) {
> spin_lock_irq(&zone->lock);
> for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> list = &zone->free_area[order].free_list[mt];
> list_for_each_entry_safe(page, list, lru, ...) {
> if (index == size)
> break;
> addr = page_to_pfn(page) << PAGE_SHIFT;
> if (!ret) {
> list_del(...);

Thanks for sharing. But probably we would not take this approach for the reasons below:

1) I'm not sure if getting a block of free pages to use could be that simple (just pluck it from the list as above). I think it is more prudent to let the callers allocate the array via the regular allocation functions.

2) Callers may need to use this with their own defined protocols, and they want the header and payload (i.e. the obtained hints) to locate in physically continuous memory (there are tricks they can use to make it work with non-physically continuous memory, but that would just complicate all the things) . In this case, it is better to have callers allocate the memory on their own, and pass the payload part memory to this API to get the payload filled.

Best,
Wei


2018-06-18 02:17:34

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On Fri, Jun 15, 2018 at 09:50:05PM -0700, Matthew Wilcox wrote:
> I wonder if (to address Michael's concern), you shouldn't instead use
> the first free chunk of pages to return the addresses of all the pages.
> ie something like this:
>
> __le64 *ret = NULL;
> unsigned int max = (PAGE_SIZE << order) / sizeof(__le64);
>
> for_each_populated_zone(zone) {
> spin_lock_irq(&zone->lock);
> for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> list = &zone->free_area[order].free_list[mt];
> list_for_each_entry_safe(page, list, lru, ...) {
> if (index == size)
> break;
> addr = page_to_pfn(page) << PAGE_SHIFT;
> if (!ret) {
> list_del(...);
> ret = addr;
> }
> ret[index++] = cpu_to_le64(addr);
> }
> }
> spin_unlock_irq(&zone->lock);
> }
>
> return ret;
> }
>
> You'll need to return the page to the freelist afterwards, but free_pages()
> should take care of that.

Yes Wei already came up with the idea to stick this data into a
MAX_ORDER allocation. Are you sure just taking an entry off
the list like that has no bad side effects?
I have a vague memory someone complained that everyone
most go through get free pages/kmalloc, but I can't
find that anymore.


--
MST

2018-06-18 02:29:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> On Friday, June 15, 2018 10:29 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 15, 2018 at 02:11:23PM +0000, Wang, Wei W wrote:
> > > On Friday, June 15, 2018 7:42 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 15, 2018 at 12:43:11PM +0800, Wei Wang wrote:
> > > > > Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature
> > > > > indicates the support of reporting hints of guest free pages to host via
> > virtio-balloon.
> > > > >
> > > > > Host requests the guest to report free page hints by sending a
> > > > > command to the guest via setting the
> > > > VIRTIO_BALLOON_HOST_CMD_FREE_PAGE_HINT
> > > > > bit of the host_cmd config register.
> > > > >
> > > > > As the first step here, virtio-balloon only reports free page
> > > > > hints from the max order (10) free page list to host. This has
> > > > > generated similar good results as reporting all free page hints during
> > our tests.
> > > > >
> > > > > TODO:
> > > > > - support reporting free page hints from smaller order free page lists
> > > > > when there is a need/request from users.
> > > > >
> > > > > Signed-off-by: Wei Wang <[email protected]>
> > > > > Signed-off-by: Liang Li <[email protected]>
> > > > > Cc: Michael S. Tsirkin <[email protected]>
> > > > > Cc: Michal Hocko <[email protected]>
> > > > > Cc: Andrew Morton <[email protected]>
> > > > > ---
> > > > > drivers/virtio/virtio_balloon.c | 187
> > +++++++++++++++++++++++++++++--
> > > > -----
> > > > > include/uapi/linux/virtio_balloon.h | 13 +++
> > > > > 2 files changed, 163 insertions(+), 37 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_balloon.c
> > > > > b/drivers/virtio/virtio_balloon.c index 6b237e3..582a03b 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -43,6 +43,9 @@
> > > > > #define OOM_VBALLOON_DEFAULT_PAGES 256 #define
> > > > > VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > > >
> > > > > +/* The size of memory in bytes allocated for reporting free page
> > > > > +hints */ #define FREE_PAGE_HINT_MEM_SIZE (PAGE_SIZE * 16)
> > > > > +
> > > > > static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
> > > > > module_param(oom_pages, int, S_IRUSR | S_IWUSR);
> > > > > MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
> > > >
> > > > Doesn't this limit memory size of the guest we can report?
> > > > Apparently to several gigabytes ...
> > > > OTOH huge guests with lots of free memory is exactly where we would
> > > > gain the most ...
> > >
> > > Yes, the 16-page array can report up to 32GB (each page can hold 512
> > addresses of 4MB free page blocks, i.e. 2GB free memory per page) free
> > memory to host. It is not flexible.
> > >
> > > How about allocating the buffer according to the guest memory size
> > > (proportional)? That is,
> > >
> > > /* Calculates the maximum number of 4MB (equals to 1024 pages) free
> > > pages blocks that the system can have */ 4m_page_blocks =
> > > totalram_pages / 1024;
> > >
> > > /* Allocating one page can hold 512 free page blocks, so calculates
> > > the number of pages that can hold those 4MB blocks. And this
> > > allocation should not exceed 1024 pages */ pages_to_allocate =
> > > min(4m_page_blocks / 512, 1024);
> > >
> > > For a 2TB guests, which has 2^19 page blocks (4MB each), we will allocate
> > 1024 pages as the buffer.
> > >
> > > When the guest has large memory, it should be easier to succeed in
> > allocation of large buffer. If that allocation fails, that implies that nothing
> > would be got from the 4MB free page list.
> > >
> > > I think the proportional allocation is simpler compared to other
> > > approaches like
> > > - scattered buffer, which will complicate the get_from_free_page_list
> > > implementation;
> > > - one buffer to call get_from_free_page_list multiple times, which needs
> > get_from_free_page_list to maintain states.. also too complicated.
> > >
> > > Best,
> > > Wei
> > >
> >
> > That's more reasonable, but question remains what to do if that value
> > exceeds MAX_ORDER. I'd say maybe tell host we can't report it.
>
> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above, so the maximum memory that can be reported is 2TB. For larger guests, e.g. 4TB, the optimization can still offer 2TB free memory (better than no optimization).

Maybe it's better, maybe it isn't. It certainly muddies the waters even
more. I'd rather we had a better plan. From that POV I like what
Matthew Wilcox suggested for this which is to steal the necessary #
of entries off the list.

If that doesn't fly, we can allocate out of the loop and just retry with more
pages.

> On the other hand, large guests being large mostly because the guests need to use large memory. In that case, they usually won't have that much free memory to report.

And following this logic small guests don't have a lot of memory to report at all.
Could you remind me why are we considering this optimization then?

> >
> > Also allocating it with GFP_KERNEL is out. You only want to take it off the free
> > list. So I guess __GFP_NOMEMALLOC and __GFP_ATOMIC.
>
> Sounds good, thanks.
>
> > Also you can't allocate this on device start. First totalram_pages can change.
> > Second that's too much memory to tie up forever.
>
> Yes, makes sense.
>
> Best,
> Wei

2018-06-19 01:08:36

by Wang, Wei W

[permalink] [raw]
Subject: RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> 4TB, the optimization can still offer 2TB free memory (better than no
> optimization).
>
> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> suggested for this which is to steal the necessary # of entries off the list.

Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.

The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.

This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.



> If that doesn't fly, we can allocate out of the loop and just retry with more
> pages.
>
> > On the other hand, large guests being large mostly because the guests need
> to use large memory. In that case, they usually won't have that much free
> memory to report.
>
> And following this logic small guests don't have a lot of memory to report at
> all.
> Could you remind me why are we considering this optimization then?

If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

Best,
Wei

2018-06-19 03:07:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > 4TB, the optimization can still offer 2TB free memory (better than no
> > optimization).
> >
> > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > suggested for this which is to steal the necessary # of entries off the list.
>
> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc

I'd do get_user_pages really. You don't want pages split, etc.

> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
>
> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
>
> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).

Not clearly better, no, since you are slowing the guest.


> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.

No one has the time to read documentation about how an internal flag
within a device works. Come on, getting two pages isn't much harder
than a single one.

>
>
> > If that doesn't fly, we can allocate out of the loop and just retry with more
> > pages.
> >
> > > On the other hand, large guests being large mostly because the guests need
> > to use large memory. In that case, they usually won't have that much free
> > memory to report.
> >
> > And following this logic small guests don't have a lot of memory to report at
> > all.
> > Could you remind me why are we considering this optimization then?
>
> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)

I'd rather not spend time writing up random limitations.


> Best,
> Wei

2018-06-19 12:10:40

by Wang, Wei W

[permalink] [raw]
Subject: Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
>> On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
>>> On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
>>>> Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
>>> so the maximum memory that can be reported is 2TB. For larger guests, e.g.
>>> 4TB, the optimization can still offer 2TB free memory (better than no
>>> optimization).
>>>
>>> Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
>>> I'd rather we had a better plan. From that POV I like what Matthew Wilcox
>>> suggested for this which is to steal the necessary # of entries off the list.
>> Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> I'd do get_user_pages really. You don't want pages split, etc.

>> by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
>>
>> The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
>>
>> This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> Not clearly better, no, since you are slowing the guest.

Not really. Live migration slows down the guest itself. It seems that
the guest spends a little extra time reporting free pages, but in return
the live migration time gets reduced a lot, which makes the guest endure
less from live migration. (there is no drop of the workload performance
when using the optimization in the tests)



>
>
>> So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> No one has the time to read documentation about how an internal flag
> within a device works. Come on, getting two pages isn't much harder
> than a single one.

>>
>>> If that doesn't fly, we can allocate out of the loop and just retry with more
>>> pages.
>>>
>>>> On the other hand, large guests being large mostly because the guests need
>>> to use large memory. In that case, they usually won't have that much free
>>> memory to report.
>>>
>>> And following this logic small guests don't have a lot of memory to report at
>>> all.
>>> Could you remind me why are we considering this optimization then?
>> If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> I'd rather not spend time writing up random limitations.

This is not a random limitation. It would be more clear to see the code.
Also I'm not sure how get_user_pages could be used in our case, and what
you meant by "getting two pages". I'll post out a new version, and we
can discuss on the code.


Best,
Wei

2018-06-19 14:44:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > Not necessarily, I think. We have min(4m_page_blocks / 512, 1024) above,
> > > > so the maximum memory that can be reported is 2TB. For larger guests, e.g.
> > > > 4TB, the optimization can still offer 2TB free memory (better than no
> > > > optimization).
> > > >
> > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even more.
> > > > I'd rather we had a better plan. From that POV I like what Matthew Wilcox
> > > > suggested for this which is to steal the necessary # of entries off the list.
> > > Actually what Matthew suggested doesn't make a difference here. That method always steal the first free page blocks, and sure can be changed to take more. But all these can be achieved via kmalloc
> > I'd do get_user_pages really. You don't want pages split, etc.

Oops sorry. I meant get_free_pages .

>
> > > by the caller which is more prudent and makes the code more straightforward. I think we don't need to take that risk unless the MM folks strongly endorse that approach.
> > >
> > > The max size of the kmalloc-ed memory is 4MB, which gives us the limitation that the max free memory to report is 2TB. Back to the motivation of this work, the cloud guys want to use this optimization to accelerate their guest live migration. 2TB guests are not common in today's clouds. When huge guests become common in the future, we can easily tweak this API to fill hints into scattered buffer (e.g. several 4MB arrays passed to this API) instead of one as in this version.
> > >
> > > This limitation doesn't cause any issue from functionality perspective. For the extreme case like a 100TB guest live migration which is theoretically possible today, this optimization helps skip 2TB of its free memory. This result is that it may reduce only 2% live migration time, but still better than not skipping the 2TB (if not using the feature).
> > Not clearly better, no, since you are slowing the guest.
>
> Not really. Live migration slows down the guest itself. It seems that the
> guest spends a little extra time reporting free pages, but in return the
> live migration time gets reduced a lot, which makes the guest endure less
> from live migration. (there is no drop of the workload performance when
> using the optimization in the tests)

My point was you can't say what is better without measuring.
Without special limitations you have hint overhead vs migration
overhead. I think we need to build to scale to huge guests.
We might discover scalability problems down the road,
but no sense in building in limitations straight away.

>
>
> >
> >
> > > So, for the first release of this feature, I think it is better to have the simpler and more straightforward solution as we have now, and clearly document why it can report up to 2TB free memory.
> > No one has the time to read documentation about how an internal flag
> > within a device works. Come on, getting two pages isn't much harder
> > than a single one.
>
> > > > If that doesn't fly, we can allocate out of the loop and just retry with more
> > > > pages.
> > > >
> > > > > On the other hand, large guests being large mostly because the guests need
> > > > to use large memory. In that case, they usually won't have that much free
> > > > memory to report.
> > > >
> > > > And following this logic small guests don't have a lot of memory to report at
> > > > all.
> > > > Could you remind me why are we considering this optimization then?
> > > If there is a 3TB guest, it is 3TB not 2TB mostly because it would need to use e.g. 2.5TB memory from time to time. In the worst case, it only has 0.5TB free memory to report, but reporting 0.5TB with this optimization is better than no optimization. (and the current 2TB limitation isn't a limitation for the 3TB guest in this case)
> > I'd rather not spend time writing up random limitations.
>
> This is not a random limitation. It would be more clear to see the code.

Users don't see code though, that's the point.

Exporting internal limitations from code to users isn't great.


> Also I'm not sure how get_user_pages could be used in our case, and what you
> meant by "getting two pages". I'll post out a new version, and we can
> discuss on the code.

Sorry, I meant get_free_pages.

>
> Best,
> Wei

2018-06-20 09:12:53

by Wang, Wei W

[permalink] [raw]
Subject: RE: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > 1024) above,
> > > > > so the maximum memory that can be reported is 2TB. For larger
> guests, e.g.
> > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > than no optimization).
> > > > >
> > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even
> more.
> > > > > I'd rather we had a better plan. From that POV I like what
> > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> entries off the list.
> > > > Actually what Matthew suggested doesn't make a difference here.
> > > > That method always steal the first free page blocks, and sure can
> > > > be changed to take more. But all these can be achieved via kmalloc
> > > I'd do get_user_pages really. You don't want pages split, etc.
>
> Oops sorry. I meant get_free_pages .

Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, which can report up to 2TB free memory.

"getting two pages isn't harder", do you mean passing two arrays (two allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Please see if the following logic aligns to what you think:

uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
unsigned long *arrays;

/*
* Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
* store all the hints, we need to allocate multiple arrays.
* max_hints: the max number of 4MB free page blocks
* hints_per_page: the number of hints each page can store
* hints_per_array: the number of hints an array can store
* total_arrays: the number of arrays we need
*/
max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
hints_per_page = PAGE_SIZE / sizeof(__le64);
hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
total_arrays = max_hints / hints_per_array +
!!(max_hints % hints_per_array);
arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
for (i = 0; i < total_arrays; i++) {
arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, MAX_ORDER - 1);

if (!arrays[i])
goto out;
}


- the mm API needs to be changed to support storing hints to multiple separated arrays offered by the caller.

Best,
Wei

2018-06-20 14:15:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [virtio-dev] Re: [PATCH v33 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

On Wed, Jun 20, 2018 at 09:11:39AM +0000, Wang, Wei W wrote:
> On Tuesday, June 19, 2018 10:43 PM, Michael S. Tsirk wrote:
> > On Tue, Jun 19, 2018 at 08:13:37PM +0800, Wei Wang wrote:
> > > On 06/19/2018 11:05 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Jun 19, 2018 at 01:06:48AM +0000, Wang, Wei W wrote:
> > > > > On Monday, June 18, 2018 10:29 AM, Michael S. Tsirkin wrote:
> > > > > > On Sat, Jun 16, 2018 at 01:09:44AM +0000, Wang, Wei W wrote:
> > > > > > > Not necessarily, I think. We have min(4m_page_blocks / 512,
> > > > > > > 1024) above,
> > > > > > so the maximum memory that can be reported is 2TB. For larger
> > guests, e.g.
> > > > > > 4TB, the optimization can still offer 2TB free memory (better
> > > > > > than no optimization).
> > > > > >
> > > > > > Maybe it's better, maybe it isn't. It certainly muddies the waters even
> > more.
> > > > > > I'd rather we had a better plan. From that POV I like what
> > > > > > Matthew Wilcox suggested for this which is to steal the necessary # of
> > entries off the list.
> > > > > Actually what Matthew suggested doesn't make a difference here.
> > > > > That method always steal the first free page blocks, and sure can
> > > > > be changed to take more. But all these can be achieved via kmalloc
> > > > I'd do get_user_pages really. You don't want pages split, etc.
> >
> > Oops sorry. I meant get_free_pages .
>
> Yes, we can use __get_free_pages, and the max allocation is MAX_ORDER - 1, which can report up to 2TB free memory.
>
> "getting two pages isn't harder", do you mean passing two arrays (two allocations by get_free_pages(,MAX_ORDER -1)) to the mm API?

Yes, or generally a list of pages with as many as needed.


> Please see if the following logic aligns to what you think:
>
> uint32_t i, max_hints, hints_per_page, hints_per_array, total_arrays;
> unsigned long *arrays;
>
> /*
> * Each array size is MAX_ORDER_NR_PAGES. If one array is not enough to
> * store all the hints, we need to allocate multiple arrays.
> * max_hints: the max number of 4MB free page blocks
> * hints_per_page: the number of hints each page can store
> * hints_per_array: the number of hints an array can store
> * total_arrays: the number of arrays we need
> */
> max_hints = totalram_pages / MAX_ORDER_NR_PAGES;
> hints_per_page = PAGE_SIZE / sizeof(__le64);
> hints_per_array = hints_per_page * MAX_ORDER_NR_PAGES;
> total_arrays = max_hints / hints_per_array +
> !!(max_hints % hints_per_array);
> arrays = kmalloc(total_arrays * sizeof(unsigned long), GFP_KERNEL);
> for (i = 0; i < total_arrays; i++) {
> arrays[i] = __get_free_pages(__GFP_ATOMIC | __GFP_NOMEMALLOC, MAX_ORDER - 1);
>
> if (!arrays[i])
> goto out;
> }
>
>
> - the mm API needs to be changed to support storing hints to multiple separated arrays offered by the caller.
>
> Best,
> Wei

Yes. And add an API to just count entries so we know how many arrays to allocate.

--
MST

2018-06-26 01:56:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On Sat, Jun 16, 2018 at 08:08:53AM +0900, Linus Torvalds wrote:
> On Fri, Jun 15, 2018 at 2:08 PM Wei Wang <[email protected]> wrote:
> >
> > This patch adds a function to get free pages blocks from a free page
> > list. The obtained free page blocks are hints about free pages, because
> > there is no guarantee that they are still on the free page list after
> > the function returns.

...

> > +uint32_t get_from_free_page_list(int order, __le64 buf[], uint32_t size)

...

>
> Ack. This is the kind of simple interface where I don't need to worry
> about the MM code calling out to random drivers or subsystems.
>
> I think that "order" should be checked for validity, but from a MM
> standpoint I think this is fine.
>
> Linus

The only issue seems to be getting hold of buf that's large enough -
and we don't really know what the size is, or whether one
buf would be enough.

Linus, do you think it would be ok to have get_from_free_page_list
actually pop entries from the free list and use them as the buffer
to store PAs?

Caller would be responsible for freeing the returned entries.

--
MST

2018-06-27 16:07:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

[ Sorry for slow reply, my travels have made a mess of my inbox ]

On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <[email protected]> wrote:
>
> Linus, do you think it would be ok to have get_from_free_page_list
> actually pop entries from the free list and use them as the buffer
> to store PAs?

Honestly, what I think the best option would be is to get rid of this
interface *entirely*, and just have the balloon code do

#define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE | __GFP_NOMEMALLOC)

struct page *page = alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);

which is not a new interface, and simply removes the max-order page
from the list if at all possible.

The above has the advantage of "just working", and not having any races.

Now, because you don't want to necessarily *entirely* deplete the max
order, I'd suggest that the *one* new interface you add is just a "how
many max-order pages are there" interface. So then you can query
(either before or after getting the max-order page) just how many of
them there were and whether you want to give that page back.

Notice? No need for any page lists or physical addresses. No races. No
complex new functions.

The physical address you can just get from the "struct page" you got.

And if you run out of memory because of getting a page, you get all
the usual "hey, we ran out of memory" responses..

Wouldn't the above be sufficient?

Linus

2018-06-27 21:17:23

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:
> [ Sorry for slow reply, my travels have made a mess of my inbox ]
>
> On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <[email protected]> wrote:
> >
> > Linus, do you think it would be ok to have get_from_free_page_list
> > actually pop entries from the free list and use them as the buffer
> > to store PAs?
>
> Honestly, what I think the best option would be is to get rid of this
> interface *entirely*, and just have the balloon code do
>
> #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
> __GFP_THISNODE | __GFP_NOMEMALLOC)
>
> struct page *page = alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);
>
> which is not a new interface, and simply removes the max-order page
> from the list if at all possible.
>
> The above has the advantage of "just working", and not having any races.
>
> Now, because you don't want to necessarily *entirely* deplete the max
> order, I'd suggest that the *one* new interface you add is just a "how
> many max-order pages are there" interface. So then you can query
> (either before or after getting the max-order page) just how many of
> them there were and whether you want to give that page back.
>
> Notice? No need for any page lists or physical addresses. No races. No
> complex new functions.
>
> The physical address you can just get from the "struct page" you got.
>
> And if you run out of memory because of getting a page, you get all
> the usual "hey, we ran out of memory" responses..
>
> Wouldn't the above be sufficient?
>
> Linus

I think so, thanks!

Wei, to put it in balloon terms, I think there's one thing we missed: if
you do manage to allocate a page, and you don't have a use for it, then
hey, you can just give it to the host because you know it's free - you
are going to return it to the free list.

--
MST

2018-06-28 11:21:42

by Wang, Wei W

[permalink] [raw]
Subject: Re: [PATCH v33 1/4] mm: add a function to get free page blocks

On 06/28/2018 03:07 AM, Michael S. Tsirkin wrote:
> On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:
>> [ Sorry for slow reply, my travels have made a mess of my inbox ]
>>
>> On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin <[email protected]> wrote:
>>> Linus, do you think it would be ok to have get_from_free_page_list
>>> actually pop entries from the free list and use them as the buffer
>>> to store PAs?
>> Honestly, what I think the best option would be is to get rid of this
>> interface *entirely*, and just have the balloon code do
>>
>> #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
>> __GFP_THISNODE | __GFP_NOMEMALLOC)
>>
>> struct page *page = alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);
>>
>> which is not a new interface, and simply removes the max-order page
>> from the list if at all possible.
>>
>> The above has the advantage of "just working", and not having any races.
>>
>> Now, because you don't want to necessarily *entirely* deplete the max
>> order, I'd suggest that the *one* new interface you add is just a "how
>> many max-order pages are there" interface. So then you can query
>> (either before or after getting the max-order page) just how many of
>> them there were and whether you want to give that page back.
>>
>> Notice? No need for any page lists or physical addresses. No races. No
>> complex new functions.
>>
>> The physical address you can just get from the "struct page" you got.
>>
>> And if you run out of memory because of getting a page, you get all
>> the usual "hey, we ran out of memory" responses..
>>
>> Wouldn't the above be sufficient?
>>
>> Linus


Thanks for the elaboration.

> I think so, thanks!
>
> Wei, to put it in balloon terms, I think there's one thing we missed: if
> you do manage to allocate a page, and you don't have a use for it, then
> hey, you can just give it to the host because you know it's free - you
> are going to return it to the free list.
>

I'm not sure if this would be better than Linus' previous suggestion,
because live migration is expected to be performed without disturbing
the guest. If we do allocation to get all the free pages at all
possible, then the guest applications would be seriously affected. For
example, the network would become very slow as the allocation of sk_buf
often triggers OOM during live migration. If live migration happens from
time to time, and users try memory related tools like "free -h" on the
guest, the reported statistics (e.g. the fee memory becomes very low
abruptly due to the balloon allocation) would confuse them.

With the previous suggestion, we only get hints of the free pages (i.e.
just report the address of free pages to host without taking them off
the list).

Best,
Wei