The current QEMU live migration implementation mark the all the
guest's RAM pages as dirtied in the ram bulk stage, all these pages
will be processed and that takes quit a lot of CPU cycles.
>From guest's point of view, it doesn't care about the content in free
pages. We can make use of this fact and skip processing the free
pages in the ram bulk stage, it can save a lot CPU cycles and reduce
the network traffic significantly while speed up the live migration
process obviously.
This patch set is the QEMU side implementation.
The virtio-balloon is extended so that QEMU can get the free pages
information from the guest through virtio.
After getting the free pages information (a bitmap), QEMU can use it
to filter out the guest's free pages in the ram bulk stage. This make
the live migration process much more efficient.
This RFC version doesn't take the post-copy and RDMA into
consideration, maybe both of them can benefit from this PV solution
by with some extra modifications.
Performance data
================
Test environment:
CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
Host RAM: 64GB
Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
Network: X540-AT2 with 10 Gigabit connection
Guest RAM: 8GB
Case 1: Idle guest just boots:
============================================
| original | pv
-------------------------------------------
total time(ms) | 1894 | 421
--------------------------------------------
transferred ram(KB) | 398017 | 353242
============================================
Case 2: The guest has ever run some memory consuming workload, the
workload is terminated just before live migration.
============================================
| original | pv
-------------------------------------------
total time(ms) | 7436 | 552
--------------------------------------------
transferred ram(KB) | 8146291 | 361375
============================================
Liang Li (4):
pc: Add code to get the lowmem form PCMachineState
virtio-balloon: Add a new feature to balloon device
migration: not set migration bitmap in setup stage
migration: filter out guest's free pages in ram bulk stage
balloon.c | 30 ++++++++-
hw/i386/pc.c | 5 ++
hw/i386/pc_piix.c | 1 +
hw/i386/pc_q35.c | 1 +
hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
include/hw/i386/pc.h | 3 +-
include/hw/virtio/virtio-balloon.h | 17 +++++-
include/standard-headers/linux/virtio_balloon.h | 1 +
include/sysemu/balloon.h | 10 ++-
migration/ram.c | 64 +++++++++++++++----
10 files changed, 195 insertions(+), 18 deletions(-)
--
1.8.3.1
Set ram_list.dirty_memory instead of migration bitmap, the migration
bitmap will be update when doing migration_bitmap_sync().
Set migration_dirty_pages to 0 and it will be updated by
migration_dirty_pages() too.
The following patch is based on this change.
Signed-off-by: Liang Li <[email protected]>
---
migration/ram.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index 704f6a9..ee2547d 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1931,19 +1931,19 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
- bitmap_set(migration_bitmap_rcu->bmap, 0, ram_bitmap_pages);
if (migrate_postcopy_ram()) {
migration_bitmap_rcu->unsentmap = bitmap_new(ram_bitmap_pages);
bitmap_set(migration_bitmap_rcu->unsentmap, 0, ram_bitmap_pages);
}
- /*
- * Count the total number of pages used by ram blocks not including any
- * gaps due to alignment or unplugs.
- */
- migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
+ migration_dirty_pages = 0;
+ QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+ cpu_physical_memory_set_dirty_range(block->offset,
+ block->used_length,
+ DIRTY_MEMORY_MIGRATION);
+ }
memory_global_dirty_log_start();
migration_bitmap_sync();
qemu_mutex_unlock_ramlist();
--
1.8.3.1
The lowmem will be used by the following patch to get
a correct free pages bitmap.
Signed-off-by: Liang Li <[email protected]>
---
hw/i386/pc.c | 5 +++++
hw/i386/pc_piix.c | 1 +
hw/i386/pc_q35.c | 1 +
include/hw/i386/pc.h | 3 ++-
4 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 0aeefd2..f794a84 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1115,6 +1115,11 @@ void pc_hot_add_cpu(const int64_t id, Error **errp)
object_unref(OBJECT(cpu));
}
+ram_addr_t pc_get_lowmem(PCMachineState *pcms)
+{
+ return pcms->lowmem;
+}
+
void pc_cpus_init(PCMachineState *pcms)
{
int i;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 6f8c2cd..268a08c 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -113,6 +113,7 @@ static void pc_init1(MachineState *machine,
}
}
+ pcms->lowmem = lowmem;
if (machine->ram_size >= lowmem) {
pcms->above_4g_mem_size = machine->ram_size - lowmem;
pcms->below_4g_mem_size = lowmem;
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 46522c9..8d9bd39 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -101,6 +101,7 @@ static void pc_q35_init(MachineState *machine)
}
}
+ pcms->lowmem = lowmem;
if (machine->ram_size >= lowmem) {
pcms->above_4g_mem_size = machine->ram_size - lowmem;
pcms->below_4g_mem_size = lowmem;
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 8b3546e..3694c91 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -60,7 +60,7 @@ struct PCMachineState {
bool nvdimm;
/* RAM information (sizes, addresses, configuration): */
- ram_addr_t below_4g_mem_size, above_4g_mem_size;
+ ram_addr_t below_4g_mem_size, above_4g_mem_size, lowmem;
/* CPU and apic information: */
bool apic_xrupt_override;
@@ -229,6 +229,7 @@ void pc_hot_add_cpu(const int64_t id, Error **errp);
void pc_acpi_init(const char *default_dsdt);
void pc_guest_info_init(PCMachineState *pcms);
+ram_addr_t pc_get_lowmem(PCMachineState *pcms);
#define PCI_HOST_PROP_PCI_HOLE_START "pci-hole-start"
#define PCI_HOST_PROP_PCI_HOLE_END "pci-hole-end"
--
1.8.3.1
Get the free pages information through virtio and filter out the free
pages in the ram bulk stage. This can significantly reduce the total
live migration time as well as network traffic.
Signed-off-by: Liang Li <[email protected]>
---
migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 46 insertions(+), 6 deletions(-)
diff --git a/migration/ram.c b/migration/ram.c
index ee2547d..819553b 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -40,6 +40,7 @@
#include "trace.h"
#include "exec/ram_addr.h"
#include "qemu/rcu_queue.h"
+#include "sysemu/balloon.h"
#ifdef DEBUG_MIGRATION_RAM
#define DPRINTF(fmt, ...) \
@@ -241,6 +242,7 @@ static struct BitmapRcu {
struct rcu_head rcu;
/* Main migration bitmap */
unsigned long *bmap;
+ unsigned long *free_pages_bmap;
/* bitmap of pages that haven't been sent even once
* only maintained and used in postcopy at the moment
* where it's used to send the dirtymap at the start
@@ -561,12 +563,7 @@ ram_addr_t migration_bitmap_find_dirty(RAMBlock *rb,
unsigned long next;
bitmap = atomic_rcu_read(&migration_bitmap_rcu)->bmap;
- if (ram_bulk_stage && nr > base) {
- next = nr + 1;
- } else {
- next = find_next_bit(bitmap, size, nr);
- }
-
+ next = find_next_bit(bitmap, size, nr);
*ram_addr_abs = next << TARGET_PAGE_BITS;
return (next - base) << TARGET_PAGE_BITS;
}
@@ -1415,6 +1412,9 @@ void free_xbzrle_decoded_buf(void)
static void migration_bitmap_free(struct BitmapRcu *bmap)
{
g_free(bmap->bmap);
+ if (balloon_free_pages_support()) {
+ g_free(bmap->free_pages_bmap);
+ }
g_free(bmap->unsentmap);
g_free(bmap);
}
@@ -1873,6 +1873,28 @@ err:
return ret;
}
+static void filter_out_guest_free_pages(unsigned long *free_pages_bmap)
+{
+ RAMBlock *block;
+ DirtyMemoryBlocks *blocks;
+ unsigned long end, page;
+
+ blocks = atomic_rcu_read(&ram_list.dirty_memory[DIRTY_MEMORY_MIGRATION]);
+ block = QLIST_FIRST_RCU(&ram_list.blocks);
+ end = TARGET_PAGE_ALIGN(block->offset +
+ block->used_length) >> TARGET_PAGE_BITS;
+ page = block->offset >> TARGET_PAGE_BITS;
+
+ while (page < end) {
+ unsigned long idx = page / DIRTY_MEMORY_BLOCK_SIZE;
+ unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
+ unsigned long num = MIN(end - page, DIRTY_MEMORY_BLOCK_SIZE - offset);
+ unsigned long *p = free_pages_bmap + BIT_WORD(page);
+
+ slow_bitmap_complement(blocks->blocks[idx], p, num);
+ page += num;
+ }
+}
/* Each of ram_save_setup, ram_save_iterate and ram_save_complete has
* long-running RCU critical section. When rcu-reclaims in the code
@@ -1884,6 +1906,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
{
RAMBlock *block;
int64_t ram_bitmap_pages; /* Size of bitmap in pages, including gaps */
+ uint64_t free_pages_count = 0;
dirty_rate_high_cnt = 0;
bitmap_sync_count = 0;
@@ -1931,6 +1954,9 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
ram_bitmap_pages = last_ram_offset() >> TARGET_PAGE_BITS;
migration_bitmap_rcu = g_new0(struct BitmapRcu, 1);
migration_bitmap_rcu->bmap = bitmap_new(ram_bitmap_pages);
+ if (balloon_free_pages_support()) {
+ migration_bitmap_rcu->free_pages_bmap = bitmap_new(ram_bitmap_pages);
+ }
if (migrate_postcopy_ram()) {
migration_bitmap_rcu->unsentmap = bitmap_new(ram_bitmap_pages);
@@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
DIRTY_MEMORY_MIGRATION);
}
memory_global_dirty_log_start();
+
+ if (balloon_free_pages_support() &&
+ balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
+ &free_pages_count) == 0) {
+ qemu_mutex_unlock_iothread();
+ while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
+ &free_pages_count) == 0) {
+ usleep(1000);
+ }
+ qemu_mutex_lock_iothread();
+
+ filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
+ }
+
migration_bitmap_sync();
qemu_mutex_unlock_ramlist();
qemu_mutex_unlock_iothread();
--
1.8.3.1
Extend the virtio balloon device to support a new feature, this
new feature can help to get guest's free pages information, which
can be used for live migration optimzation.
Signed-off-by: Liang Li <[email protected]>
---
balloon.c | 30 ++++++++-
hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
include/hw/virtio/virtio-balloon.h | 17 +++++-
include/standard-headers/linux/virtio_balloon.h | 1 +
include/sysemu/balloon.h | 10 ++-
5 files changed, 134 insertions(+), 5 deletions(-)
diff --git a/balloon.c b/balloon.c
index f2ef50c..a37717e 100644
--- a/balloon.c
+++ b/balloon.c
@@ -36,6 +36,7 @@
static QEMUBalloonEvent *balloon_event_fn;
static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonFreePages *balloon_free_pages_fn;
static void *balloon_opaque;
static bool balloon_inhibited;
@@ -65,9 +66,12 @@ static bool have_balloon(Error **errp)
}
int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque)
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonFreePages *free_pages_func,
+ void *opaque)
{
- if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+ if (balloon_event_fn || balloon_stat_fn || balloon_free_pages_fn
+ || balloon_opaque) {
/* We're already registered one balloon handler. How many can
* a guest really have?
*/
@@ -75,6 +79,7 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
}
balloon_event_fn = event_func;
balloon_stat_fn = stat_func;
+ balloon_free_pages_fn = free_pages_func;
balloon_opaque = opaque;
return 0;
}
@@ -86,6 +91,7 @@ void qemu_remove_balloon_handler(void *opaque)
}
balloon_event_fn = NULL;
balloon_stat_fn = NULL;
+ balloon_free_pages_fn = NULL;
balloon_opaque = NULL;
}
@@ -116,3 +122,23 @@ void qmp_balloon(int64_t target, Error **errp)
trace_balloon_event(balloon_opaque, target);
balloon_event_fn(balloon_opaque, target);
}
+
+bool balloon_free_pages_support(void)
+{
+ return balloon_free_pages_fn ? true : false;
+}
+
+int balloon_get_free_pages(unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count)
+{
+ if (!balloon_free_pages_fn) {
+ return -1;
+ }
+
+ if (!free_pages_bitmap || !free_pages_count) {
+ return -1;
+ }
+
+ return balloon_free_pages_fn(balloon_opaque,
+ free_pages_bitmap, free_pages_count);
+ }
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index e9c30e9..a5b9d08 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -76,6 +76,12 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
}
+static bool balloon_free_pages_supported(const VirtIOBalloon *s)
+{
+ VirtIODevice *vdev = VIRTIO_DEVICE(s);
+ return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES);
+}
+
static bool balloon_stats_enabled(const VirtIOBalloon *s)
{
return s->stats_poll_interval > 0;
@@ -293,6 +299,37 @@ out:
}
}
+static void virtio_balloon_get_free_pages(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+ VirtQueueElement *elem;
+ size_t offset = 0;
+ uint64_t bitmap_bytes = 0, free_pages_count = 0;
+
+ elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+ if (!elem) {
+ return;
+ }
+ s->free_pages_vq_elem = elem;
+
+ if (!elem->out_num) {
+ return;
+ }
+
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ &free_pages_count, sizeof(uint64_t));
+
+ offset += sizeof(uint64_t);
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ &bitmap_bytes, sizeof(uint64_t));
+
+ offset += sizeof(uint64_t);
+ iov_to_buf(elem->out_sg, elem->out_num, offset,
+ s->free_pages_bitmap, bitmap_bytes);
+ s->req_status = DONE;
+ s->free_pages_count = free_pages_count;
+}
+
static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
{
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
@@ -362,6 +399,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
f |= dev->host_features;
virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+ virtio_add_feature(&f, VIRTIO_BALLOON_F_GET_FREE_PAGES);
return f;
}
@@ -372,6 +410,45 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
VIRTIO_BALLOON_PFN_SHIFT);
}
+static int virtio_balloon_free_pages(void *opaque,
+ unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count)
+{
+ VirtIOBalloon *s = opaque;
+ VirtIODevice *vdev = VIRTIO_DEVICE(s);
+ VirtQueueElement *elem = s->free_pages_vq_elem;
+ int len;
+
+ if (!balloon_free_pages_supported(s)) {
+ return -1;
+ }
+
+ if (s->req_status == NOT_STARTED) {
+ s->free_pages_bitmap = free_pages_bitmap;
+ s->req_status = STARTED;
+ s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
+ if (!elem->in_num) {
+ elem = virtqueue_pop(s->fvq, sizeof(VirtQueueElement));
+ if (!elem) {
+ return 0;
+ }
+ s->free_pages_vq_elem = elem;
+ }
+ len = iov_from_buf(elem->in_sg, elem->in_num, 0, &s->mem_layout,
+ sizeof(s->mem_layout));
+ virtqueue_push(s->fvq, elem, len);
+ virtio_notify(vdev, s->fvq);
+ return 0;
+ } else if (s->req_status == STARTED) {
+ return 0;
+ } else if (s->req_status == DONE) {
+ *free_pages_count = s->free_pages_count;
+ s->req_status = NOT_STARTED;
+ }
+
+ return 1;
+}
+
static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
{
VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
@@ -429,7 +506,8 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
sizeof(struct virtio_balloon_config));
ret = qemu_add_balloon_handler(virtio_balloon_to_target,
- virtio_balloon_stat, s);
+ virtio_balloon_stat,
+ virtio_balloon_free_pages, s);
if (ret < 0) {
error_setg(errp, "Only one balloon device is supported");
@@ -440,6 +518,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+ s->fvq = virtio_add_queue(vdev, 128, virtio_balloon_get_free_pages);
reset_stats(s);
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..fc173e4 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -23,6 +23,16 @@
#define VIRTIO_BALLOON(obj) \
OBJECT_CHECK(VirtIOBalloon, (obj), TYPE_VIRTIO_BALLOON)
+typedef enum virtio_req_status {
+ NOT_STARTED,
+ STARTED,
+ DONE,
+} VIRTIO_REQ_STATUS;
+
+typedef struct MemLayout {
+ uint64_t low_mem;
+} MemLayout;
+
typedef struct virtio_balloon_stat VirtIOBalloonStat;
typedef struct virtio_balloon_stat_modern {
@@ -33,16 +43,21 @@ typedef struct virtio_balloon_stat_modern {
typedef struct VirtIOBalloon {
VirtIODevice parent_obj;
- VirtQueue *ivq, *dvq, *svq;
+ VirtQueue *ivq, *dvq, *svq, *fvq;
uint32_t num_pages;
uint32_t actual;
uint64_t stats[VIRTIO_BALLOON_S_NR];
VirtQueueElement *stats_vq_elem;
+ VirtQueueElement *free_pages_vq_elem;
size_t stats_vq_offset;
QEMUTimer *stats_timer;
int64_t stats_last_update;
int64_t stats_poll_interval;
uint32_t host_features;
+ uint64_t *free_pages_bitmap;
+ uint64_t free_pages_count;
+ MemLayout mem_layout;
+ VIRTIO_REQ_STATUS req_status;
} VirtIOBalloon;
#endif
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 2e2a6dc..95b7d0c 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -34,6 +34,7 @@
#define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
#define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
+#define VIRTIO_BALLOON_F_GET_FREE_PAGES 3 /* Get the free pages bitmap */
/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index 3f976b4..205b272 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -18,11 +18,19 @@
typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
+typedef int (QEMUBalloonFreePages)(void *opaque,
+ unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count);
int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
- QEMUBalloonStatus *stat_func, void *opaque);
+ QEMUBalloonStatus *stat_func,
+ QEMUBalloonFreePages *free_pages_func,
+ void *opaque);
void qemu_remove_balloon_handler(void *opaque);
bool qemu_balloon_is_inhibited(void);
void qemu_balloon_inhibit(bool state);
+bool balloon_free_pages_support(void);
+int balloon_get_free_pages(unsigned long *free_pages_bitmap,
+ unsigned long *free_pages_count);
#endif
--
1.8.3.1
On Thu, 3 Mar 2016 18:44:28 +0800
Liang Li <[email protected]> wrote:
> Get the free pages information through virtio and filter out the free
> pages in the ram bulk stage. This can significantly reduce the total
> live migration time as well as network traffic.
>
> Signed-off-by: Liang Li <[email protected]>
> ---
> migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 46 insertions(+), 6 deletions(-)
>
> @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> DIRTY_MEMORY_MIGRATION);
> }
> memory_global_dirty_log_start();
> +
> + if (balloon_free_pages_support() &&
> + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + qemu_mutex_unlock_iothread();
> + while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + usleep(1000);
> + }
> + qemu_mutex_lock_iothread();
> +
> + filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
A general comment: Using the ballooner to get information about pages
that can be filtered out is too limited (there may be other ways to do
this; we might be able to use cmma on s390, for example), and I don't
like hardcoding to a specific method.
What about the reverse approach: Code may register a handler that
populates the free_pages_bitmap which is called during this stage?
<I like the idea of filtering in general, but I haven't looked at the
code yet>
> + }
> +
> migration_bitmap_sync();
> qemu_mutex_unlock_ramlist();
> qemu_mutex_unlock_iothread();
On Thu, 3 Mar 2016 18:44:26 +0800
Liang Li <[email protected]> wrote:
> Extend the virtio balloon device to support a new feature, this
> new feature can help to get guest's free pages information, which
> can be used for live migration optimzation.
Do you have a spec for this, e.g. as a patch to the virtio spec?
>
> Signed-off-by: Liang Li <[email protected]>
> ---
> balloon.c | 30 ++++++++-
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> 5 files changed, 134 insertions(+), 5 deletions(-)
> +static int virtio_balloon_free_pages(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + VirtIOBalloon *s = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + VirtQueueElement *elem = s->free_pages_vq_elem;
> + int len;
> +
> + if (!balloon_free_pages_supported(s)) {
> + return -1;
> + }
> +
> + if (s->req_status == NOT_STARTED) {
> + s->free_pages_bitmap = free_pages_bitmap;
> + s->req_status = STARTED;
> + s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
Please don't leak pc-specific information into generic code.
On Thu, Mar 03, 2016 at 06:44:28PM +0800, Liang Li wrote:
> Get the free pages information through virtio and filter out the free
> pages in the ram bulk stage. This can significantly reduce the total
> live migration time as well as network traffic.
>
> Signed-off-by: Liang Li <[email protected]>
> ---
> migration/ram.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 46 insertions(+), 6 deletions(-)
> @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
> DIRTY_MEMORY_MIGRATION);
> }
> memory_global_dirty_log_start();
> +
> + if (balloon_free_pages_support() &&
> + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + qemu_mutex_unlock_iothread();
> + while (balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> + &free_pages_count) == 0) {
> + usleep(1000);
> + }
> + qemu_mutex_lock_iothread();
> +
> + filter_out_guest_free_pages(migration_bitmap_rcu->free_pages_bmap);
> + }
IIUC, this code is synchronous wrt to the guest OS balloon drive. ie it
is asking the geust for free pages and waiting for a response. If the
guest OS has crashed this is going to mean QEMU waits forever and thus
migration won't complete. Similarly you need to consider that the guest
OS may be malicious and simply never respond.
So if the migration code is going to use the guest balloon driver to get
info about free pages it has to be done in an asynchronous manner so that
migration can never be stalled by a slow/crashed/malicious guest driver.
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
On Thu, Mar 03, 2016 at 06:44:26PM +0800, Liang Li wrote:
> Extend the virtio balloon device to support a new feature, this
> new feature can help to get guest's free pages information, which
> can be used for live migration optimzation.
>
> Signed-off-by: Liang Li <[email protected]>
I don't understand why we need a new interface.
Balloon already sends free pages to host.
Just teach host to skip these pages.
Maybe instead of starting with code, you
should send a high level description to the
virtio tc for consideration?
You can do it through the mailing list or
using the web form:
http://www.oasis-open.org/committees/comments/form.php?wg_abbrev=virtio
> ---
> balloon.c | 30 ++++++++-
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> 5 files changed, 134 insertions(+), 5 deletions(-)
>
> diff --git a/balloon.c b/balloon.c
> index f2ef50c..a37717e 100644
> --- a/balloon.c
> +++ b/balloon.c
> @@ -36,6 +36,7 @@
>
> static QEMUBalloonEvent *balloon_event_fn;
> static QEMUBalloonStatus *balloon_stat_fn;
> +static QEMUBalloonFreePages *balloon_free_pages_fn;
> static void *balloon_opaque;
> static bool balloon_inhibited;
>
> @@ -65,9 +66,12 @@ static bool have_balloon(Error **errp)
> }
>
> int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> - QEMUBalloonStatus *stat_func, void *opaque)
> + QEMUBalloonStatus *stat_func,
> + QEMUBalloonFreePages *free_pages_func,
> + void *opaque)
> {
> - if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
> + if (balloon_event_fn || balloon_stat_fn || balloon_free_pages_fn
> + || balloon_opaque) {
> /* We're already registered one balloon handler. How many can
> * a guest really have?
> */
> @@ -75,6 +79,7 @@ int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> }
> balloon_event_fn = event_func;
> balloon_stat_fn = stat_func;
> + balloon_free_pages_fn = free_pages_func;
> balloon_opaque = opaque;
> return 0;
> }
> @@ -86,6 +91,7 @@ void qemu_remove_balloon_handler(void *opaque)
> }
> balloon_event_fn = NULL;
> balloon_stat_fn = NULL;
> + balloon_free_pages_fn = NULL;
> balloon_opaque = NULL;
> }
>
> @@ -116,3 +122,23 @@ void qmp_balloon(int64_t target, Error **errp)
> trace_balloon_event(balloon_opaque, target);
> balloon_event_fn(balloon_opaque, target);
> }
> +
> +bool balloon_free_pages_support(void)
> +{
> + return balloon_free_pages_fn ? true : false;
> +}
> +
> +int balloon_get_free_pages(unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + if (!balloon_free_pages_fn) {
> + return -1;
> + }
> +
> + if (!free_pages_bitmap || !free_pages_count) {
> + return -1;
> + }
> +
> + return balloon_free_pages_fn(balloon_opaque,
> + free_pages_bitmap, free_pages_count);
> + }
> diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> index e9c30e9..a5b9d08 100644
> --- a/hw/virtio/virtio-balloon.c
> +++ b/hw/virtio/virtio-balloon.c
> @@ -76,6 +76,12 @@ static bool balloon_stats_supported(const VirtIOBalloon *s)
> return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_STATS_VQ);
> }
>
> +static bool balloon_free_pages_supported(const VirtIOBalloon *s)
> +{
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + return virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_GET_FREE_PAGES);
> +}
> +
> static bool balloon_stats_enabled(const VirtIOBalloon *s)
> {
> return s->stats_poll_interval > 0;
> @@ -293,6 +299,37 @@ out:
> }
> }
>
> +static void virtio_balloon_get_free_pages(VirtIODevice *vdev, VirtQueue *vq)
> +{
> + VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
> + VirtQueueElement *elem;
> + size_t offset = 0;
> + uint64_t bitmap_bytes = 0, free_pages_count = 0;
> +
> + elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
> + if (!elem) {
> + return;
> + }
> + s->free_pages_vq_elem = elem;
> +
> + if (!elem->out_num) {
> + return;
> + }
> +
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + &free_pages_count, sizeof(uint64_t));
> +
> + offset += sizeof(uint64_t);
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + &bitmap_bytes, sizeof(uint64_t));
> +
> + offset += sizeof(uint64_t);
> + iov_to_buf(elem->out_sg, elem->out_num, offset,
> + s->free_pages_bitmap, bitmap_bytes);
> + s->req_status = DONE;
> + s->free_pages_count = free_pages_count;
> +}
> +
> static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
> {
> VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> @@ -362,6 +399,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
> VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
> f |= dev->host_features;
> virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
> + virtio_add_feature(&f, VIRTIO_BALLOON_F_GET_FREE_PAGES);
> return f;
> }
>
> @@ -372,6 +410,45 @@ static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
> VIRTIO_BALLOON_PFN_SHIFT);
> }
>
> +static int virtio_balloon_free_pages(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count)
> +{
> + VirtIOBalloon *s = opaque;
> + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> + VirtQueueElement *elem = s->free_pages_vq_elem;
> + int len;
> +
> + if (!balloon_free_pages_supported(s)) {
> + return -1;
> + }
> +
> + if (s->req_status == NOT_STARTED) {
> + s->free_pages_bitmap = free_pages_bitmap;
> + s->req_status = STARTED;
> + s->mem_layout.low_mem = pc_get_lowmem(PC_MACHINE(current_machine));
> + if (!elem->in_num) {
> + elem = virtqueue_pop(s->fvq, sizeof(VirtQueueElement));
> + if (!elem) {
> + return 0;
> + }
> + s->free_pages_vq_elem = elem;
> + }
> + len = iov_from_buf(elem->in_sg, elem->in_num, 0, &s->mem_layout,
> + sizeof(s->mem_layout));
> + virtqueue_push(s->fvq, elem, len);
> + virtio_notify(vdev, s->fvq);
> + return 0;
> + } else if (s->req_status == STARTED) {
> + return 0;
> + } else if (s->req_status == DONE) {
> + *free_pages_count = s->free_pages_count;
> + s->req_status = NOT_STARTED;
> + }
> +
> + return 1;
> +}
> +
> static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
> {
> VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
> @@ -429,7 +506,8 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> sizeof(struct virtio_balloon_config));
>
> ret = qemu_add_balloon_handler(virtio_balloon_to_target,
> - virtio_balloon_stat, s);
> + virtio_balloon_stat,
> + virtio_balloon_free_pages, s);
>
> if (ret < 0) {
> error_setg(errp, "Only one balloon device is supported");
> @@ -440,6 +518,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
> s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
> s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
> + s->fvq = virtio_add_queue(vdev, 128, virtio_balloon_get_free_pages);
>
> reset_stats(s);
>
> diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
> index 35f62ac..fc173e4 100644
> --- a/include/hw/virtio/virtio-balloon.h
> +++ b/include/hw/virtio/virtio-balloon.h
> @@ -23,6 +23,16 @@
> #define VIRTIO_BALLOON(obj) \
> OBJECT_CHECK(VirtIOBalloon, (obj), TYPE_VIRTIO_BALLOON)
>
> +typedef enum virtio_req_status {
> + NOT_STARTED,
> + STARTED,
> + DONE,
> +} VIRTIO_REQ_STATUS;
> +
> +typedef struct MemLayout {
> + uint64_t low_mem;
> +} MemLayout;
> +
> typedef struct virtio_balloon_stat VirtIOBalloonStat;
>
> typedef struct virtio_balloon_stat_modern {
> @@ -33,16 +43,21 @@ typedef struct virtio_balloon_stat_modern {
>
> typedef struct VirtIOBalloon {
> VirtIODevice parent_obj;
> - VirtQueue *ivq, *dvq, *svq;
> + VirtQueue *ivq, *dvq, *svq, *fvq;
> uint32_t num_pages;
> uint32_t actual;
> uint64_t stats[VIRTIO_BALLOON_S_NR];
> VirtQueueElement *stats_vq_elem;
> + VirtQueueElement *free_pages_vq_elem;
> size_t stats_vq_offset;
> QEMUTimer *stats_timer;
> int64_t stats_last_update;
> int64_t stats_poll_interval;
> uint32_t host_features;
> + uint64_t *free_pages_bitmap;
> + uint64_t free_pages_count;
> + MemLayout mem_layout;
> + VIRTIO_REQ_STATUS req_status;
> } VirtIOBalloon;
>
> #endif
> diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
> index 2e2a6dc..95b7d0c 100644
> --- a/include/standard-headers/linux/virtio_balloon.h
> +++ b/include/standard-headers/linux/virtio_balloon.h
> @@ -34,6 +34,7 @@
> #define VIRTIO_BALLOON_F_MUST_TELL_HOST 0 /* Tell before reclaiming pages */
> #define VIRTIO_BALLOON_F_STATS_VQ 1 /* Memory Stats virtqueue */
> #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> +#define VIRTIO_BALLOON_F_GET_FREE_PAGES 3 /* Get the free pages bitmap */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12
> diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
> index 3f976b4..205b272 100644
> --- a/include/sysemu/balloon.h
> +++ b/include/sysemu/balloon.h
> @@ -18,11 +18,19 @@
>
> typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
> typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
> +typedef int (QEMUBalloonFreePages)(void *opaque,
> + unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count);
>
> int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
> - QEMUBalloonStatus *stat_func, void *opaque);
> + QEMUBalloonStatus *stat_func,
> + QEMUBalloonFreePages *free_pages_func,
> + void *opaque);
> void qemu_remove_balloon_handler(void *opaque);
> bool qemu_balloon_is_inhibited(void);
> void qemu_balloon_inhibit(bool state);
> +bool balloon_free_pages_support(void);
> +int balloon_get_free_pages(unsigned long *free_pages_bitmap,
> + unsigned long *free_pages_count);
>
> #endif
> --
> 1.8.3.1
On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
>
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
>
> This patch set is the QEMU side implementation.
>
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
>
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
>
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
>
> Performance data
> ================
>
> Test environment:
>
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> Network: X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
>
> Case 1: Idle guest just boots:
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 1894 | 421
> --------------------------------------------
> transferred ram(KB) | 398017 | 353242
> ============================================
>
>
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 7436 | 552
> --------------------------------------------
> transferred ram(KB) | 8146291 | 361375
> ============================================
Both cases look very artificial to me. Normally you migrate VMs which
have started long ago and which can't have their services terminated
before the migration, so I wouldn't expect any useful amount of free
pages obtained this way.
OTOH I don't see why you can't just inflate the balloon before the
migration, and really optimize the amount of transferred data this way?
With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
good estimate of the optimal balloon size, and with the recently merged
balloon deflation on OOM it's a safe thing to do without exposing the
guest workloads to OOM risks.
Roman.
* Liang Li ([email protected]) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
>
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
>
> This patch set is the QEMU side implementation.
>
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
>
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
Hi,
An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.
I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.
Dave
>
> Performance data
> ================
>
> Test environment:
>
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> Network: X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
>
> Case 1: Idle guest just boots:
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 1894 | 421
> --------------------------------------------
> transferred ram(KB) | 398017 | 353242
> ============================================
>
>
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> ============================================
> | original | pv
> -------------------------------------------
> total time(ms) | 7436 | 552
> --------------------------------------------
> transferred ram(KB) | 8146291 | 361375
> ============================================
>
> Liang Li (4):
> pc: Add code to get the lowmem form PCMachineState
> virtio-balloon: Add a new feature to balloon device
> migration: not set migration bitmap in setup stage
> migration: filter out guest's free pages in ram bulk stage
>
> balloon.c | 30 ++++++++-
> hw/i386/pc.c | 5 ++
> hw/i386/pc_piix.c | 1 +
> hw/i386/pc_q35.c | 1 +
> hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> include/hw/i386/pc.h | 3 +-
> include/hw/virtio/virtio-balloon.h | 17 +++++-
> include/standard-headers/linux/virtio_balloon.h | 1 +
> include/sysemu/balloon.h | 10 ++-
> migration/ram.c | 64 +++++++++++++++----
> 10 files changed, 195 insertions(+), 18 deletions(-)
>
> --
> 1.8.3.1
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
> On Thu, Mar 03, 2016 at 06:44:24PM +0800, Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
> >
> > Performance data
> > ================
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> > Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> > Network: X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 1894 | 421
> > --------------------------------------------
> > transferred ram(KB) | 398017 | 353242
> > ============================================
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 7436 | 552
> > --------------------------------------------
> > transferred ram(KB) | 8146291 | 361375
> > ============================================
>
> Both cases look very artificial to me. Normally you migrate VMs which have
> started long ago and which can't have their services terminated before the
> migration, so I wouldn't expect any useful amount of free pages obtained
> this way.
>
Yes, it's somewhat artificial, just to emphasize the effect. And I think these two
cases are very easy to reproduce. Using the real workload and do the test
in production environment will be more convince.
We can predict that as long as the guest doesn't use out of its memory, this solution
may still take affect and shorten the total live migration time. (Off cause, we should
consider the time cost of the virtio communication.)
> OTOH I don't see why you can't just inflate the balloon before the migration,
> and really optimize the amount of transferred data this way?
> With the recently proposed VIRTIO_BALLOON_S_AVAIL you can have a fairly
> good estimate of the optimal balloon size, and with the recently merged
> balloon deflation on OOM it's a safe thing to do without exposing the guest
> workloads to OOM risks.
>
> Roman.
Thanks for your information. The size of the free page bitmap is not very large, for a
guest with 8GB RAM, only 256KB extra memory is required.
Comparing to this solution, inflate the balloon is more expensive. If the balloon size
is not so optimal and guest request more memory during live migration, the guest's
performance will be impacted.
Liang
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
>
> * Liang Li ([email protected]) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
>
> Hi,
> An interesting solution; I know a few different people have been looking at
> how to speed up ballooned VM migration.
>
Ooh, different solutions for the same purpose, and both based on the balloon.
> I wonder if it would be possible to avoid the kernel changes by parsing
> /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> pages in the guest ram, would it achieve the same result?
>
Only detect the unmapped/zero mapped pages is not enough. Consider the
situation like case 2, it can't achieve the same result.
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
>
> For postcopy to be safe, you would still need to send a message to the
> destination telling it that there were zero pages, otherwise the destination
> can't tell if it's supposed to request the page from the source or treat the
> page as zero.
>
> Dave
I will consider this later, thanks, Dave.
Liang
>
> >
> > Performance data
> > ================
> >
> > Test environment:
> >
> > CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz Host RAM: 64GB
> > Host Linux Kernel: 4.2.0 Host OS: CentOS 7.1
> > Guest Linux Kernel: 4.5.rc6 Guest OS: CentOS 6.6
> > Network: X540-AT2 with 10 Gigabit connection Guest RAM: 8GB
> >
> > Case 1: Idle guest just boots:
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 1894 | 421
> > --------------------------------------------
> > transferred ram(KB) | 398017 | 353242
> > ============================================
> >
> >
> > Case 2: The guest has ever run some memory consuming workload, the
> > workload is terminated just before live migration.
> > ============================================
> > | original | pv
> > -------------------------------------------
> > total time(ms) | 7436 | 552
> > --------------------------------------------
> > transferred ram(KB) | 8146291 | 361375
> > ============================================
> >
> Subject: Re: [RFC qemu 2/4] virtio-balloon: Add a new feature to balloon
> device
>
> On Thu, Mar 03, 2016 at 06:44:26PM +0800, Liang Li wrote:
> > Extend the virtio balloon device to support a new feature, this new
> > feature can help to get guest's free pages information, which can be
> > used for live migration optimzation.
> >
> > Signed-off-by: Liang Li <[email protected]>
>
> I don't understand why we need a new interface.
> Balloon already sends free pages to host.
> Just teach host to skip these pages.
>
I just make use the current virtio-balloon implementation, it's more complicated to
invent a new virtio-io device...
Actually, there is no need to inflate the balloon before live migration, so the host has
no information about the guest's free pages, that's why I add a new one.
> Maybe instead of starting with code, you should send a high level description
> to the virtio tc for consideration?
>
> You can do it through the mailing list or using the web form:
> http://www.oasis-
> open.org/committees/comments/form.php?wg_abbrev=virtio
>
Thanks for your information and suggestion.
Liang
> On Thu, 3 Mar 2016 18:44:28 +0800
> Liang Li <[email protected]> wrote:
>
> > Get the free pages information through virtio and filter out the free
> > pages in the ram bulk stage. This can significantly reduce the total
> > live migration time as well as network traffic.
> >
> > Signed-off-by: Liang Li <[email protected]>
> > ---
> > migration/ram.c | 52
> > ++++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 46 insertions(+), 6 deletions(-)
> >
>
> > @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void
> *opaque)
> > DIRTY_MEMORY_MIGRATION);
> > }
> > memory_global_dirty_log_start();
> > +
> > + if (balloon_free_pages_support() &&
> > + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> > + &free_pages_count) == 0) {
> > + qemu_mutex_unlock_iothread();
> > + while (balloon_get_free_pages(migration_bitmap_rcu-
> >free_pages_bmap,
> > + &free_pages_count) == 0) {
> > + usleep(1000);
> > + }
> > + qemu_mutex_lock_iothread();
> > +
> > +
> > + filter_out_guest_free_pages(migration_bitmap_rcu-
> >free_pages_bmap);
>
> A general comment: Using the ballooner to get information about pages that
> can be filtered out is too limited (there may be other ways to do this; we
> might be able to use cmma on s390, for example), and I don't like hardcoding
> to a specific method.
>
> What about the reverse approach: Code may register a handler that
> populates the free_pages_bitmap which is called during this stage?
Good suggestion, thanks!
Liang
> <I like the idea of filtering in general, but I haven't looked at the code yet>
>
> On Thu, 3 Mar 2016 18:44:26 +0800
> Liang Li <[email protected]> wrote:
>
> > Extend the virtio balloon device to support a new feature, this new
> > feature can help to get guest's free pages information, which can be
> > used for live migration optimzation.
>
> Do you have a spec for this, e.g. as a patch to the virtio spec?
Not yet.
>
> >
> > Signed-off-by: Liang Li <[email protected]>
> > ---
> > balloon.c | 30 ++++++++-
> > hw/virtio/virtio-balloon.c | 81 ++++++++++++++++++++++++-
> > include/hw/virtio/virtio-balloon.h | 17 +++++-
> > include/standard-headers/linux/virtio_balloon.h | 1 +
> > include/sysemu/balloon.h | 10 ++-
> > 5 files changed, 134 insertions(+), 5 deletions(-)
>
> > +static int virtio_balloon_free_pages(void *opaque,
> > + unsigned long *free_pages_bitmap,
> > + unsigned long *free_pages_count)
> > +{
> > + VirtIOBalloon *s = opaque;
> > + VirtIODevice *vdev = VIRTIO_DEVICE(s);
> > + VirtQueueElement *elem = s->free_pages_vq_elem;
> > + int len;
> > +
> > + if (!balloon_free_pages_supported(s)) {
> > + return -1;
> > + }
> > +
> > + if (s->req_status == NOT_STARTED) {
> > + s->free_pages_bitmap = free_pages_bitmap;
> > + s->req_status = STARTED;
> > + s->mem_layout.low_mem =
> > + pc_get_lowmem(PC_MACHINE(current_machine));
>
> Please don't leak pc-specific information into generic code.
I have already notice that and just leave it here in this initial RFC version,
the hard part of this solution is how to handle different architecture ...
Thanks!
Liang
> On Thu, Mar 03, 2016 at 06:44:28PM +0800, Liang Li wrote:
> > Get the free pages information through virtio and filter out the free
> > pages in the ram bulk stage. This can significantly reduce the total
> > live migration time as well as network traffic.
> >
> > Signed-off-by: Liang Li <[email protected]>
> > ---
> > migration/ram.c | 52
> > ++++++++++++++++++++++++++++++++++++++++++++++------
> > 1 file changed, 46 insertions(+), 6 deletions(-)
>
> > @@ -1945,6 +1971,20 @@ static int ram_save_setup(QEMUFile *f, void
> *opaque)
> > DIRTY_MEMORY_MIGRATION);
> > }
> > memory_global_dirty_log_start();
> > +
> > + if (balloon_free_pages_support() &&
> > + balloon_get_free_pages(migration_bitmap_rcu->free_pages_bmap,
> > + &free_pages_count) == 0) {
> > + qemu_mutex_unlock_iothread();
> > + while (balloon_get_free_pages(migration_bitmap_rcu-
> >free_pages_bmap,
> > + &free_pages_count) == 0) {
> > + usleep(1000);
> > + }
> > + qemu_mutex_lock_iothread();
> > +
> > + filter_out_guest_free_pages(migration_bitmap_rcu-
> >free_pages_bmap);
> > + }
>
> IIUC, this code is synchronous wrt to the guest OS balloon drive. ie it is asking
> the geust for free pages and waiting for a response. If the guest OS has
> crashed this is going to mean QEMU waits forever and thus migration won't
> complete. Similarly you need to consider that the guest OS may be malicious
> and simply never respond.
>
> So if the migration code is going to use the guest balloon driver to get info
> about free pages it has to be done in an asynchronous manner so that
> migration can never be stalled by a slow/crashed/malicious guest driver.
>
> Regards,
> Daniel
Really, thanks a lot!
Liang
On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> * Liang Li ([email protected]) wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free
> > pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> > the network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
>
> Hi,
> An interesting solution; I know a few different people have been looking
> at how to speed up ballooned VM migration.
>
> I wonder if it would be possible to avoid the kernel changes by
> parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
> mapped pages in the guest ram, would it achieve the same result?
Yes I was about to suggest the same thing: it's simple and makes use of
the existing infrastructure. And you wouldn't need to care if the pages
were unmapped by ballooning or anything else (alternative balloon
implementations, not yet touched by the guest, etc.). Besides, you
wouldn't need to synchronize with the guest.
Roman.
On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > I wonder if it would be possible to avoid the kernel changes by parsing
> > /proc/self/pagemap - if that can be used to detect unmapped/zero mapped
> > pages in the guest ram, would it achieve the same result?
>
> Only detect the unmapped/zero mapped pages is not enough. Consider the
> situation like case 2, it can't achieve the same result.
Your case 2 doesn't exist in the real world. If people could stop their
main memory consumer in the guest prior to migration they wouldn't need
live migration at all.
I tend to think you can safely assume there's no free memory in the
guest, so there's little point optimizing for it.
OTOH it makes perfect sense optimizing for the unmapped memory that's
made up, in particular, by the ballon, and consider inflating the
balloon right before migration unless you already maintain it at the
optimal size for other reasons (like e.g. a global resource manager
optimizing the VM density).
Roman.
> On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > * Liang Li ([email protected]) wrote:
> > > The current QEMU live migration implementation mark the all the
> > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > will be processed and that takes quit a lot of CPU cycles.
> > >
> > > From guest's point of view, it doesn't care about the content in
> > > free pages. We can make use of this fact and skip processing the
> > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > reduce the network traffic significantly while speed up the live
> > > migration process obviously.
> > >
> > > This patch set is the QEMU side implementation.
> > >
> > > The virtio-balloon is extended so that QEMU can get the free pages
> > > information from the guest through virtio.
> > >
> > > After getting the free pages information (a bitmap), QEMU can use it
> > > to filter out the guest's free pages in the ram bulk stage. This
> > > make the live migration process much more efficient.
> >
> > Hi,
> > An interesting solution; I know a few different people have been
> > looking at how to speed up ballooned VM migration.
> >
> > I wonder if it would be possible to avoid the kernel changes by
> > parsing /proc/self/pagemap - if that can be used to detect
> > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
>
> Yes I was about to suggest the same thing: it's simple and makes use of the
> existing infrastructure. And you wouldn't need to care if the pages were
> unmapped by ballooning or anything else (alternative balloon
> implementations, not yet touched by the guest, etc.). Besides, you wouldn't
> need to synchronize with the guest.
>
> Roman.
The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
but the free pages can't be detected by this. Imaging an application allocates a large amount
of memory , after using, it frees the memory, then live migration happens. All these free pages
will be process and sent to the destination, it's not optimal.
Liang
On Fri, Mar 04, 2016 at 08:23:09AM +0000, Li, Liang Z wrote:
> > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > * Liang Li ([email protected]) wrote:
> > > > The current QEMU live migration implementation mark the all the
> > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > will be processed and that takes quit a lot of CPU cycles.
> > > >
> > > > From guest's point of view, it doesn't care about the content in
> > > > free pages. We can make use of this fact and skip processing the
> > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > reduce the network traffic significantly while speed up the live
> > > > migration process obviously.
> > > >
> > > > This patch set is the QEMU side implementation.
> > > >
> > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > information from the guest through virtio.
> > > >
> > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > make the live migration process much more efficient.
> > >
> > > Hi,
> > > An interesting solution; I know a few different people have been
> > > looking at how to speed up ballooned VM migration.
> > >
> > > I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > same result?
> >
> > Yes I was about to suggest the same thing: it's simple and makes use of the
> > existing infrastructure. And you wouldn't need to care if the pages were
> > unmapped by ballooning or anything else (alternative balloon
> > implementations, not yet touched by the guest, etc.). Besides, you wouldn't
> > need to synchronize with the guest.
> >
> > Roman.
>
> The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
> but the free pages can't be detected by this. Imaging an application allocates a large amount
> of memory , after using, it frees the memory, then live migration happens. All these free pages
> will be process and sent to the destination, it's not optimal.
First, the likelihood of such a situation is marginal, there's no point
optimizing for it specifically.
And second, even if that happens, you inflate the balloon right before
the migration and the free memory will get umapped very quickly, so this
case is covered nicely by the same technique that works for more
realistic cases, too.
Roman.
* Roman Kagan ([email protected]) wrote:
> On Fri, Mar 04, 2016 at 08:23:09AM +0000, Li, Liang Z wrote:
> > > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Liang Li ([email protected]) wrote:
> > > > > The current QEMU live migration implementation mark the all the
> > > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > > will be processed and that takes quit a lot of CPU cycles.
> > > > >
> > > > > From guest's point of view, it doesn't care about the content in
> > > > > free pages. We can make use of this fact and skip processing the
> > > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > > reduce the network traffic significantly while speed up the live
> > > > > migration process obviously.
> > > > >
> > > > > This patch set is the QEMU side implementation.
> > > > >
> > > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > > information from the guest through virtio.
> > > > >
> > > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > > make the live migration process much more efficient.
> > > >
> > > > Hi,
> > > > An interesting solution; I know a few different people have been
> > > > looking at how to speed up ballooned VM migration.
> > > >
> > > > I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > > same result?
> > >
> > > Yes I was about to suggest the same thing: it's simple and makes use of the
> > > existing infrastructure. And you wouldn't need to care if the pages were
> > > unmapped by ballooning or anything else (alternative balloon
> > > implementations, not yet touched by the guest, etc.). Besides, you wouldn't
> > > need to synchronize with the guest.
> > >
> > > Roman.
> >
> > The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
> > but the free pages can't be detected by this. Imaging an application allocates a large amount
> > of memory , after using, it frees the memory, then live migration happens. All these free pages
> > will be process and sent to the destination, it's not optimal.
>
> First, the likelihood of such a situation is marginal, there's no point
> optimizing for it specifically.
>
> And second, even if that happens, you inflate the balloon right before
> the migration and the free memory will get umapped very quickly, so this
> case is covered nicely by the same technique that works for more
> realistic cases, too.
Although I wonder which is cheaper; that would be fairly expensive for
the guest wouldn't it? And you'd somehow have to kick the guest
before migration to do the ballooning - and how long would you wait
for it to finish?
Dave
>
> Roman.
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
> On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > I wonder if it would be possible to avoid the kernel changes by
> > > parsing /proc/self/pagemap - if that can be used to detect
> > > unmapped/zero mapped pages in the guest ram, would it achieve the
> same result?
> >
> > Only detect the unmapped/zero mapped pages is not enough. Consider
> the
> > situation like case 2, it can't achieve the same result.
>
> Your case 2 doesn't exist in the real world. If people could stop their main
> memory consumer in the guest prior to migration they wouldn't need live
> migration at all.
The case 2 is just a simplified scenario, not a real case.
As long as the guest's memory usage does not keep increasing, or not always run out,
it can be covered by the case 2.
> I tend to think you can safely assume there's no free memory in the guest, so
> there's little point optimizing for it.
If this is true, we should not inflate the balloon either.
> OTOH it makes perfect sense optimizing for the unmapped memory that's
> made up, in particular, by the ballon, and consider inflating the balloon right
> before migration unless you already maintain it at the optimal size for other
> reasons (like e.g. a global resource manager optimizing the VM density).
>
Yes, I believe the current balloon works and it's simple. Do you take the performance impact for consideration?
For and 8G guest, it takes about 5s to inflating the balloon. But it only takes 20ms to traverse the free_list and
construct the free pages bitmap. In this period, the guest are very busy.
By inflating the balloon, all the guest's pages are still be processed (zero page checking).
The only advantage of ' inflating the balloon before live migration' is simple, nothing more.
Liang
> Roman.
> * Roman Kagan ([email protected]) wrote:
> > On Fri, Mar 04, 2016 at 08:23:09AM +0000, Li, Liang Z wrote:
> > > > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > > > * Liang Li ([email protected]) wrote:
> > > > > > The current QEMU live migration implementation mark the all
> > > > > > the guest's RAM pages as dirtied in the ram bulk stage, all
> > > > > > these pages will be processed and that takes quit a lot of CPU cycles.
> > > > > >
> > > > > > From guest's point of view, it doesn't care about the content
> > > > > > in free pages. We can make use of this fact and skip
> > > > > > processing the free pages in the ram bulk stage, it can save a
> > > > > > lot CPU cycles and reduce the network traffic significantly
> > > > > > while speed up the live migration process obviously.
> > > > > >
> > > > > > This patch set is the QEMU side implementation.
> > > > > >
> > > > > > The virtio-balloon is extended so that QEMU can get the free
> > > > > > pages information from the guest through virtio.
> > > > > >
> > > > > > After getting the free pages information (a bitmap), QEMU can
> > > > > > use it to filter out the guest's free pages in the ram bulk
> > > > > > stage. This make the live migration process much more efficient.
> > > > >
> > > > > Hi,
> > > > > An interesting solution; I know a few different people have
> > > > > been looking at how to speed up ballooned VM migration.
> > > > >
> > > > > I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > > same result?
> > > >
> > > > Yes I was about to suggest the same thing: it's simple and makes
> > > > use of the existing infrastructure. And you wouldn't need to care
> > > > if the pages were unmapped by ballooning or anything else
> > > > (alternative balloon implementations, not yet touched by the
> > > > guest, etc.). Besides, you wouldn't need to synchronize with the guest.
> > > >
> > > > Roman.
> > >
> > > The unmapped/zero mapped pages can be detected by parsing
> > > /proc/self/pagemap, but the free pages can't be detected by this.
> > > Imaging an application allocates a large amount of memory , after
> > > using, it frees the memory, then live migration happens. All these free
> pages will be process and sent to the destination, it's not optimal.
> >
> > First, the likelihood of such a situation is marginal, there's no
> > point optimizing for it specifically.
> >
> > And second, even if that happens, you inflate the balloon right before
> > the migration and the free memory will get umapped very quickly, so
> > this case is covered nicely by the same technique that works for more
> > realistic cases, too.
>
> Although I wonder which is cheaper; that would be fairly expensive for the
> guest wouldn't it? And you'd somehow have to kick the guest before
> migration to do the ballooning - and how long would you wait for it to finish?
About 5 seconds for an 8G guest, balloon to 1G. Get the free pages bitmap take about 20ms
for an 8G idle guest.
Liang
>
> Dave
>
> >
> > Roman.
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Fri, Mar 04, 2016 at 09:08:20AM +0000, Dr. David Alan Gilbert wrote:
> * Roman Kagan ([email protected]) wrote:
> > On Fri, Mar 04, 2016 at 08:23:09AM +0000, Li, Liang Z wrote:
> > > The unmapped/zero mapped pages can be detected by parsing /proc/self/pagemap,
> > > but the free pages can't be detected by this. Imaging an application allocates a large amount
> > > of memory , after using, it frees the memory, then live migration happens. All these free pages
> > > will be process and sent to the destination, it's not optimal.
> >
> > First, the likelihood of such a situation is marginal, there's no point
> > optimizing for it specifically.
> >
> > And second, even if that happens, you inflate the balloon right before
> > the migration and the free memory will get umapped very quickly, so this
> > case is covered nicely by the same technique that works for more
> > realistic cases, too.
>
> Although I wonder which is cheaper; that would be fairly expensive for
> the guest wouldn't it?
For the guest -- generally it wouldn't if you have a good estimate of
available memory (i.e. the amount you can balloon out without forcing
the guest to swap).
And yes you need certain cost estimates for choosing the best migration
strategy: e.g. if your network bandwidth is unlimited you may be better
off transferring the zeros to the destination rather than optimizing
them away.
> And you'd somehow have to kick the guest
> before migration to do the ballooning - and how long would you wait
> for it to finish?
It's a matter for fine-tuning with all the inputs at hand, like network
bandwidth, costs of delaying the migration, etc. And you don't need to
wait for it to finish, i.e. reach the balloon size target: you can start
the migration as soon as it's good enough (for whatever definition of
"enough" is found appropriate by that fine-tuning).
Roman.
On Fri, Mar 04, 2016 at 09:12:12AM +0000, Li, Liang Z wrote:
> > Although I wonder which is cheaper; that would be fairly expensive for the
> > guest wouldn't it? And you'd somehow have to kick the guest before
> > migration to do the ballooning - and how long would you wait for it to finish?
>
> About 5 seconds for an 8G guest, balloon to 1G. Get the free pages bitmap take about 20ms
> for an 8G idle guest.
>
> Liang
Where is the time spent though? allocating within guest?
Or passing the info to host?
If the former, we can use existing inflate/deflate vqs:
Have guest put each free page on inflate vq, then on deflate vq.
--
MST
> On Fri, Mar 04, 2016 at 09:12:12AM +0000, Li, Liang Z wrote:
> > > Although I wonder which is cheaper; that would be fairly expensive
> > > for the guest wouldn't it? And you'd somehow have to kick the guest
> > > before migration to do the ballooning - and how long would you wait for
> it to finish?
> >
> > About 5 seconds for an 8G guest, balloon to 1G. Get the free pages
> > bitmap take about 20ms for an 8G idle guest.
> >
> > Liang
>
> Where is the time spent though? allocating within guest?
> Or passing the info to host?
> If the former, we can use existing inflate/deflate vqs:
> Have guest put each free page on inflate vq, then on deflate vq.
>
Maybe I am not clear enough.
I mean if we inflate balloon before live migration, for a 8GB guest, it takes about 5 Seconds for the inflating operation to finish.
For the PV solution, there is no need to inflate balloon before live migration, the only cost is to traversing the free_list to
construct the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if there is less free pages),
passing the free pages info to host will take about extra 3ms.
Liang
> --
> MST
On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > > I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > same result?
> > >
> > > Only detect the unmapped/zero mapped pages is not enough. Consider
> > the
> > > situation like case 2, it can't achieve the same result.
> >
> > Your case 2 doesn't exist in the real world. If people could stop their main
> > memory consumer in the guest prior to migration they wouldn't need live
> > migration at all.
>
> The case 2 is just a simplified scenario, not a real case.
> As long as the guest's memory usage does not keep increasing, or not always run out,
> it can be covered by the case 2.
The memory usage will keep increasing due to ever growing caches, etc,
so you'll be left with very little free memory fairly soon.
> > I tend to think you can safely assume there's no free memory in the guest, so
> > there's little point optimizing for it.
>
> If this is true, we should not inflate the balloon either.
We certainly should if there's "available" memory, i.e. not free but
cheap to reclaim.
> > OTOH it makes perfect sense optimizing for the unmapped memory that's
> > made up, in particular, by the ballon, and consider inflating the balloon right
> > before migration unless you already maintain it at the optimal size for other
> > reasons (like e.g. a global resource manager optimizing the VM density).
> >
>
> Yes, I believe the current balloon works and it's simple. Do you take the performance impact for consideration?
> For and 8G guest, it takes about 5s to inflating the balloon. But it only takes 20ms to traverse the free_list and
> construct the free pages bitmap.
I don't have any feeling of how important the difference is. And if the
limiting factor for balloon inflation speed is the granularity of
communication it may be worth optimizing that, because quick balloon
reaction may be important in certain resource management scenarios.
> By inflating the balloon, all the guest's pages are still be processed (zero page checking).
Not sure what you mean. If you describe the current state of affairs
that's exactly the suggested optimization point: skip unmapped pages.
> The only advantage of ' inflating the balloon before live migration' is simple, nothing more.
That's a big advantage. Another one is that it does something useful in
real-world scenarios.
Roman.
On Fri, Mar 04, 2016 at 10:11:00AM +0000, Li, Liang Z wrote:
> > On Fri, Mar 04, 2016 at 09:12:12AM +0000, Li, Liang Z wrote:
> > > > Although I wonder which is cheaper; that would be fairly expensive
> > > > for the guest wouldn't it? And you'd somehow have to kick the guest
> > > > before migration to do the ballooning - and how long would you wait for
> > it to finish?
> > >
> > > About 5 seconds for an 8G guest, balloon to 1G. Get the free pages
> > > bitmap take about 20ms for an 8G idle guest.
> > >
> > > Liang
> >
> > Where is the time spent though? allocating within guest?
> > Or passing the info to host?
> > If the former, we can use existing inflate/deflate vqs:
> > Have guest put each free page on inflate vq, then on deflate vq.
> >
>
> Maybe I am not clear enough.
>
> I mean if we inflate balloon before live migration, for a 8GB guest, it takes about 5 Seconds for the inflating operation to finish.
And these 5 seconds are spent where?
> For the PV solution, there is no need to inflate balloon before live migration, the only cost is to traversing the free_list to
> construct the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if there is less free pages),
> passing the free pages info to host will take about extra 3ms.
>
>
> Liang
So now let's please stop talking about solutions at a high level and
discuss the interface changes you make in detail.
What makes it faster? Better host/guest interface? No need to go through
buddy allocator within guest? Less interrupts? Something else?
> > --
> > MST
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
>
> On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > > On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > > > I wonder if it would be possible to avoid the kernel changes
> > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > the
> > > same result?
> > > >
> > > > Only detect the unmapped/zero mapped pages is not enough.
> Consider
> > > the
> > > > situation like case 2, it can't achieve the same result.
> > >
> > > Your case 2 doesn't exist in the real world. If people could stop
> > > their main memory consumer in the guest prior to migration they
> > > wouldn't need live migration at all.
> >
> > The case 2 is just a simplified scenario, not a real case.
> > As long as the guest's memory usage does not keep increasing, or not
> > always run out, it can be covered by the case 2.
>
> The memory usage will keep increasing due to ever growing caches, etc, so
> you'll be left with very little free memory fairly soon.
>
I don't think so.
> > > I tend to think you can safely assume there's no free memory in the
> > > guest, so there's little point optimizing for it.
> >
> > If this is true, we should not inflate the balloon either.
>
> We certainly should if there's "available" memory, i.e. not free but cheap to
> reclaim.
>
What's your mean by "available" memory? if they are not free, I don't think it's cheap.
> > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > that's made up, in particular, by the ballon, and consider inflating
> > > the balloon right before migration unless you already maintain it at
> > > the optimal size for other reasons (like e.g. a global resource manager
> optimizing the VM density).
> > >
> >
> > Yes, I believe the current balloon works and it's simple. Do you take the
> performance impact for consideration?
> > For and 8G guest, it takes about 5s to inflating the balloon. But it
> > only takes 20ms to traverse the free_list and construct the free pages
> bitmap.
>
> I don't have any feeling of how important the difference is. And if the
> limiting factor for balloon inflation speed is the granularity of communication
> it may be worth optimizing that, because quick balloon reaction may be
> important in certain resource management scenarios.
>
> > By inflating the balloon, all the guest's pages are still be processed (zero
> page checking).
>
> Not sure what you mean. If you describe the current state of affairs that's
> exactly the suggested optimization point: skip unmapped pages.
>
You'd better check the live migration code.
> > The only advantage of ' inflating the balloon before live migration' is simple,
> nothing more.
>
> That's a big advantage. Another one is that it does something useful in real-
> world scenarios.
>
I don't think the heave performance impaction is something useful in real world scenarios.
Liang
> Roman.
On Fri, Mar 04, 2016 at 02:26:49PM +0000, Li, Liang Z wrote:
> > Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> > optimization
> >
> > On Fri, Mar 04, 2016 at 09:08:44AM +0000, Li, Liang Z wrote:
> > > > On Fri, Mar 04, 2016 at 01:52:53AM +0000, Li, Liang Z wrote:
> > > > > > I wonder if it would be possible to avoid the kernel changes
> > > > > > by parsing /proc/self/pagemap - if that can be used to detect
> > > > > > unmapped/zero mapped pages in the guest ram, would it achieve
> > > > > > the
> > > > same result?
> > > > >
> > > > > Only detect the unmapped/zero mapped pages is not enough.
> > Consider
> > > > the
> > > > > situation like case 2, it can't achieve the same result.
> > > >
> > > > Your case 2 doesn't exist in the real world. If people could stop
> > > > their main memory consumer in the guest prior to migration they
> > > > wouldn't need live migration at all.
> > >
> > > The case 2 is just a simplified scenario, not a real case.
> > > As long as the guest's memory usage does not keep increasing, or not
> > > always run out, it can be covered by the case 2.
> >
> > The memory usage will keep increasing due to ever growing caches, etc, so
> > you'll be left with very little free memory fairly soon.
> >
>
> I don't think so.
Here's my laptop:
KiB Mem : 16048560 total, 8574956 free, 3360532 used, 4113072 buff/cache
But here's a server:
KiB Mem: 32892768 total, 20092812 used, 12799956 free, 368704 buffers
What is the difference? A ton of tiny daemons not doing anything,
staying resident in memory.
> > > > I tend to think you can safely assume there's no free memory in the
> > > > guest, so there's little point optimizing for it.
> > >
> > > If this is true, we should not inflate the balloon either.
> >
> > We certainly should if there's "available" memory, i.e. not free but cheap to
> > reclaim.
> >
>
> What's your mean by "available" memory? if they are not free, I don't think it's cheap.
clean pages are cheap to drop as they don't have to be written.
whether they will be ever be used is another matter.
> > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > that's made up, in particular, by the ballon, and consider inflating
> > > > the balloon right before migration unless you already maintain it at
> > > > the optimal size for other reasons (like e.g. a global resource manager
> > optimizing the VM density).
> > > >
> > >
> > > Yes, I believe the current balloon works and it's simple. Do you take the
> > performance impact for consideration?
> > > For and 8G guest, it takes about 5s to inflating the balloon. But it
> > > only takes 20ms to traverse the free_list and construct the free pages
> > bitmap.
> >
> > I don't have any feeling of how important the difference is. And if the
> > limiting factor for balloon inflation speed is the granularity of communication
> > it may be worth optimizing that, because quick balloon reaction may be
> > important in certain resource management scenarios.
> >
> > > By inflating the balloon, all the guest's pages are still be processed (zero
> > page checking).
> >
> > Not sure what you mean. If you describe the current state of affairs that's
> > exactly the suggested optimization point: skip unmapped pages.
> >
>
> You'd better check the live migration code.
What's there to check in migration code?
Here's the extent of what balloon does on output:
while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
ram_addr_t pa;
ram_addr_t addr;
int p = virtio_ldl_p(vdev, &pfn);
pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
offset += 4;
/* FIXME: remove get_system_memory(), but how? */
section = memory_region_find(get_system_memory(), pa, 1);
if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
continue;
trace_virtio_balloon_handle_output(memory_region_name(section.mr),
pa);
/* Using memory_region_get_ram_ptr is bending the rules a bit, but
should be OK because we only want a single page. */
addr = section.offset_within_region;
balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
!!(vq == s->dvq));
memory_region_unref(section.mr);
}
so all that happens when we get a page is balloon_page.
and
static void balloon_page(void *addr, int deflate)
{
#if defined(__linux__)
if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
kvm_has_sync_mmu())) {
qemu_madvise(addr, TARGET_PAGE_SIZE,
deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
}
#endif
}
Do you see anything that tracks pages to help migration skip
the ballooned memory? I don't.
> > > The only advantage of ' inflating the balloon before live migration' is simple,
> > nothing more.
> >
> > That's a big advantage. Another one is that it does something useful in real-
> > world scenarios.
> >
>
> I don't think the heave performance impaction is something useful in real world scenarios.
>
> Liang
> > Roman.
So fix the performance then. You will have to try harder if you want to
convince people that the performance is due to bad host/guest interface,
and so we have to change *that*.
--
MST
> > Maybe I am not clear enough.
> >
> > I mean if we inflate balloon before live migration, for a 8GB guest, it takes
> about 5 Seconds for the inflating operation to finish.
>
> And these 5 seconds are spent where?
>
The time is spent on allocating the pages and send the allocated pages pfns to QEMU
through virtio.
> > For the PV solution, there is no need to inflate balloon before live
> > migration, the only cost is to traversing the free_list to construct
> > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> there is less free pages), passing the free pages info to host will take about
> extra 3ms.
> >
> >
> > Liang
>
> So now let's please stop talking about solutions at a high level and discuss the
> interface changes you make in detail.
> What makes it faster? Better host/guest interface? No need to go through
> buddy allocator within guest? Less interrupts? Something else?
>
I assume you are familiar with the current virtio-balloon and how it works.
The new interface is very simple, send a request to the virtio-balloon driver,
The virtio-driver will travers the '&zone->free_area[order].free_list[t])' to
construct a 'free_page_bitmap', and then the driver will send the content
of 'free_page_bitmap' back to QEMU. That all the new interface does and
there are no ' alloc_page' related affairs, so it's faster.
Some code snippet:
----------------------------------------------
+static void mark_free_pages_bitmap(struct zone *zone,
+ unsigned long *free_page_bitmap, unsigned long pfn_gap) {
+ unsigned long pfn, flags, i;
+ unsigned int order, t;
+ struct list_head *curr;
+
+ if (zone_is_empty(zone))
+ return;
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ for_each_migratetype_order(order, t) {
+ list_for_each(curr, &zone->free_area[order].free_list[t]) {
+
+ pfn = page_to_pfn(list_entry(curr, struct page, lru));
+ for (i = 0; i < (1UL << order); i++) {
+ if ((pfn + i) >= PFN_4G)
+ set_bit_le(pfn + i - pfn_gap,
+ free_page_bitmap);
+ else
+ set_bit_le(pfn + i, free_page_bitmap);
+ }
+ }
+ }
+
+ spin_unlock_irqrestore(&zone->lock, flags); }
----------------------------------------------------
Sorry for my poor English and expression, if you still can't understand,
you could glance at the patch, total about 400 lines.
>
> > > --
> > > MST
> > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > Consider
> > > > > the
> > > > > > situation like case 2, it can't achieve the same result.
> > > > >
> > > > > Your case 2 doesn't exist in the real world. If people could
> > > > > stop their main memory consumer in the guest prior to migration
> > > > > they wouldn't need live migration at all.
> > > >
> > > > The case 2 is just a simplified scenario, not a real case.
> > > > As long as the guest's memory usage does not keep increasing, or
> > > > not always run out, it can be covered by the case 2.
> > >
> > > The memory usage will keep increasing due to ever growing caches,
> > > etc, so you'll be left with very little free memory fairly soon.
> > >
> >
> > I don't think so.
>
> Here's my laptop:
> KiB Mem : 16048560 total, 8574956 free, 3360532 used, 4113072 buff/cache
>
> But here's a server:
> KiB Mem: 32892768 total, 20092812 used, 12799956 free, 368704 buffers
>
> What is the difference? A ton of tiny daemons not doing anything, staying
> resident in memory.
>
> > > > > I tend to think you can safely assume there's no free memory in
> > > > > the guest, so there's little point optimizing for it.
> > > >
> > > > If this is true, we should not inflate the balloon either.
> > >
> > > We certainly should if there's "available" memory, i.e. not free but
> > > cheap to reclaim.
> > >
> >
> > What's your mean by "available" memory? if they are not free, I don't think
> it's cheap.
>
> clean pages are cheap to drop as they don't have to be written.
> whether they will be ever be used is another matter.
>
> > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > that's made up, in particular, by the ballon, and consider
> > > > > inflating the balloon right before migration unless you already
> > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > global resource manager
> > > optimizing the VM density).
> > > > >
> > > >
> > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > take the
> > > performance impact for consideration?
> > > > For and 8G guest, it takes about 5s to inflating the balloon. But
> > > > it only takes 20ms to traverse the free_list and construct the
> > > > free pages
> > > bitmap.
> > >
> > > I don't have any feeling of how important the difference is. And if
> > > the limiting factor for balloon inflation speed is the granularity
> > > of communication it may be worth optimizing that, because quick
> > > balloon reaction may be important in certain resource management
> scenarios.
> > >
> > > > By inflating the balloon, all the guest's pages are still be
> > > > processed (zero
> > > page checking).
> > >
> > > Not sure what you mean. If you describe the current state of
> > > affairs that's exactly the suggested optimization point: skip unmapped
> pages.
> > >
> >
> > You'd better check the live migration code.
>
> What's there to check in migration code?
> Here's the extent of what balloon does on output:
>
>
> while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4)
> {
> ram_addr_t pa;
> ram_addr_t addr;
> int p = virtio_ldl_p(vdev, &pfn);
>
> pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> offset += 4;
>
> /* FIXME: remove get_system_memory(), but how? */
> section = memory_region_find(get_system_memory(), pa, 1);
> if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
> continue;
>
>
> trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> pa);
> /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> should be OK because we only want a single page. */
> addr = section.offset_within_region;
> balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> !!(vq == s->dvq));
> memory_region_unref(section.mr);
> }
>
> so all that happens when we get a page is balloon_page.
> and
>
> static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> kvm_has_sync_mmu())) {
> qemu_madvise(addr, TARGET_PAGE_SIZE,
> deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> }
> #endif
> }
>
>
> Do you see anything that tracks pages to help migration skip the ballooned
> memory? I don't.
>
No. And it's exactly what I mean. The ballooned memory is still processed during
live migration without skipping. The live migration code is in migration/ram.c.
>
> > > > The only advantage of ' inflating the balloon before live
> > > > migration' is simple,
> > > nothing more.
> > >
> > > That's a big advantage. Another one is that it does something
> > > useful in real- world scenarios.
> > >
> >
> > I don't think the heave performance impaction is something useful in real
> world scenarios.
> >
> > Liang
> > > Roman.
>
> So fix the performance then. You will have to try harder if you want to
> convince people that the performance is due to bad host/guest interface,
> and so we have to change *that*.
>
Actually, the PV solution is irrelevant with the balloon mechanism, I just use it
to transfer information between host and guest.
I am not sure if I should implement a new virtio device, and I want to get the answer from
the community.
In this RFC patch, to make things simple, I choose to extend the virtio-balloon and use the
extended interface to transfer the request and free_page_bimap content.
I am not intend to change the current virtio-balloon implementation.
Liang
> --
> MST
On 04/03/2016 15:26, Li, Liang Z wrote:
>> >
>> > The memory usage will keep increasing due to ever growing caches, etc, so
>> > you'll be left with very little free memory fairly soon.
>> >
> I don't think so.
>
Roman is right. For example, here I am looking at a 64 GB (physical)
machine which was booted about 30 minutes ago, and which is running
disk-heavy workloads (installing VMs).
Since I have started writing this email (2 minutes?), the amount of free
memory has already gone down from 37 GB to 33 GB. I expect that by the
time I have finished running the workload, in two hours, it will not
have any free memory.
Paolo
* Paolo Bonzini ([email protected]) wrote:
>
>
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> >
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> >
> > I don't think so.
> >
>
> Roman is right. For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
>
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB. I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.
But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:
# free -m
total used free shared buff/cache available
Mem: 96536 1146 44834 184 50555 94735
I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.
Dave
>
> Paolo
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Fri, Mar 04, 2016 at 03:49:37PM +0000, Li, Liang Z wrote:
> > > > > > > Only detect the unmapped/zero mapped pages is not enough.
> > > > Consider
> > > > > > the
> > > > > > > situation like case 2, it can't achieve the same result.
> > > > > >
> > > > > > Your case 2 doesn't exist in the real world. If people could
> > > > > > stop their main memory consumer in the guest prior to migration
> > > > > > they wouldn't need live migration at all.
> > > > >
> > > > > The case 2 is just a simplified scenario, not a real case.
> > > > > As long as the guest's memory usage does not keep increasing, or
> > > > > not always run out, it can be covered by the case 2.
> > > >
> > > > The memory usage will keep increasing due to ever growing caches,
> > > > etc, so you'll be left with very little free memory fairly soon.
> > > >
> > >
> > > I don't think so.
> >
> > Here's my laptop:
> > KiB Mem : 16048560 total, 8574956 free, 3360532 used, 4113072 buff/cache
> >
> > But here's a server:
> > KiB Mem: 32892768 total, 20092812 used, 12799956 free, 368704 buffers
> >
> > What is the difference? A ton of tiny daemons not doing anything, staying
> > resident in memory.
> >
> > > > > > I tend to think you can safely assume there's no free memory in
> > > > > > the guest, so there's little point optimizing for it.
> > > > >
> > > > > If this is true, we should not inflate the balloon either.
> > > >
> > > > We certainly should if there's "available" memory, i.e. not free but
> > > > cheap to reclaim.
> > > >
> > >
> > > What's your mean by "available" memory? if they are not free, I don't think
> > it's cheap.
> >
> > clean pages are cheap to drop as they don't have to be written.
> > whether they will be ever be used is another matter.
> >
> > > > > > OTOH it makes perfect sense optimizing for the unmapped memory
> > > > > > that's made up, in particular, by the ballon, and consider
> > > > > > inflating the balloon right before migration unless you already
> > > > > > maintain it at the optimal size for other reasons (like e.g. a
> > > > > > global resource manager
> > > > optimizing the VM density).
> > > > > >
> > > > >
> > > > > Yes, I believe the current balloon works and it's simple. Do you
> > > > > take the
> > > > performance impact for consideration?
> > > > > For and 8G guest, it takes about 5s to inflating the balloon. But
> > > > > it only takes 20ms to traverse the free_list and construct the
> > > > > free pages
> > > > bitmap.
> > > >
> > > > I don't have any feeling of how important the difference is. And if
> > > > the limiting factor for balloon inflation speed is the granularity
> > > > of communication it may be worth optimizing that, because quick
> > > > balloon reaction may be important in certain resource management
> > scenarios.
> > > >
> > > > > By inflating the balloon, all the guest's pages are still be
> > > > > processed (zero
> > > > page checking).
> > > >
> > > > Not sure what you mean. If you describe the current state of
> > > > affairs that's exactly the suggested optimization point: skip unmapped
> > pages.
> > > >
> > >
> > > You'd better check the live migration code.
> >
> > What's there to check in migration code?
> > Here's the extent of what balloon does on output:
> >
> >
> > while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4)
> > {
> > ram_addr_t pa;
> > ram_addr_t addr;
> > int p = virtio_ldl_p(vdev, &pfn);
> >
> > pa = (ram_addr_t) p << VIRTIO_BALLOON_PFN_SHIFT;
> > offset += 4;
> >
> > /* FIXME: remove get_system_memory(), but how? */
> > section = memory_region_find(get_system_memory(), pa, 1);
> > if (!int128_nz(section.size) || !memory_region_is_ram(section.mr))
> > continue;
> >
> >
> > trace_virtio_balloon_handle_output(memory_region_name(section.mr),
> > pa);
> > /* Using memory_region_get_ram_ptr is bending the rules a bit, but
> > should be OK because we only want a single page. */
> > addr = section.offset_within_region;
> > balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
> > !!(vq == s->dvq));
> > memory_region_unref(section.mr);
> > }
> >
> > so all that happens when we get a page is balloon_page.
> > and
> >
> > static void balloon_page(void *addr, int deflate) { #if defined(__linux__)
> > if (!qemu_balloon_is_inhibited() && (!kvm_enabled() ||
> > kvm_has_sync_mmu())) {
> > qemu_madvise(addr, TARGET_PAGE_SIZE,
> > deflate ? QEMU_MADV_WILLNEED : QEMU_MADV_DONTNEED);
> > }
> > #endif
> > }
> >
> >
> > Do you see anything that tracks pages to help migration skip the ballooned
> > memory? I don't.
> >
>
> No. And it's exactly what I mean. The ballooned memory is still processed during
> live migration without skipping. The live migration code is in migration/ram.c.
So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST,
we can teach qemu to skip these pages.
Want to write a patch to do this?
> >
> > > > > The only advantage of ' inflating the balloon before live
> > > > > migration' is simple,
> > > > nothing more.
> > > >
> > > > That's a big advantage. Another one is that it does something
> > > > useful in real- world scenarios.
> > > >
> > >
> > > I don't think the heave performance impaction is something useful in real
> > world scenarios.
> > >
> > > Liang
> > > > Roman.
> >
> > So fix the performance then. You will have to try harder if you want to
> > convince people that the performance is due to bad host/guest interface,
> > and so we have to change *that*.
> >
>
> Actually, the PV solution is irrelevant with the balloon mechanism, I just use it
> to transfer information between host and guest.
> I am not sure if I should implement a new virtio device, and I want to get the answer from
> the community.
> In this RFC patch, to make things simple, I choose to extend the virtio-balloon and use the
> extended interface to transfer the request and free_page_bimap content.
>
> I am not intend to change the current virtio-balloon implementation.
>
> Liang
And the answer would depend on the answer to my question above.
Does balloon need an interface passing page bitmaps around?
Does this speed up any operations?
OTOH what if you use the regular balloon interface with your patches?
> > --
> > MST
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> >
> > >> > The memory usage will keep increasing due to ever growing caches,
> > >> > etc, so you'll be left with very little free memory fairly soon.
> > >> >
> > > I don't think so.
> > >
> >
> > Roman is right. For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> >
> > Since I have started writing this email (2 minutes?), the amount of
> > free memory has already gone down from 37 GB to 33 GB. I expect that
> > by the time I have finished running the workload, in two hours, it
> > will not have any free memory.
>
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
> I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
>
> # free -m
> total used free shared buff/cache available
> Mem: 96536 1146 44834 184 50555 94735
>
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes it's
> got a big chunk of cache as well.
>
> Dave
>
> >
> > Paolo
I begin to realize Roman's opinions. The PV solution can't handle the cache memory while inflating balloon could.
Inflating balloon so as to skipping the cache memory is no good for guest's performance.
How much of the free memory in the guest depends on the workload in the VM and the time VM has already run
before live migration. Even the memory usage will keep increasing due to ever growing caches, but we don't know
when the live migration will happen, assuming there are no or very little free pages in the guest is not quite right.
The advantage of the pv solution is the smaller performance impact, comparing with inflating the balloon.
Liang
> > No. And it's exactly what I mean. The ballooned memory is still
> > processed during live migration without skipping. The live migration code is
> in migration/ram.c.
>
> So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> teach qemu to skip these pages.
> Want to write a patch to do this?
>
Yes, we really can teach qemu to skip these pages and it's not hard.
The problem is the poor performance, this PV solution is aimed to make it more
efficient and reduce the performance impact on guest.
> > >
> > > > > > The only advantage of ' inflating the balloon before live
> > > > > > migration' is simple,
> > > > > nothing more.
> > > > >
> > > > > That's a big advantage. Another one is that it does something
> > > > > useful in real- world scenarios.
> > > > >
> > > >
> > > > I don't think the heave performance impaction is something useful
> > > > in real
> > > world scenarios.
> > > >
> > > > Liang
> > > > > Roman.
> > >
> > > So fix the performance then. You will have to try harder if you want
> > > to convince people that the performance is due to bad host/guest
> > > interface, and so we have to change *that*.
> > >
> >
> > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > just use it to transfer information between host and guest.
> > I am not sure if I should implement a new virtio device, and I want to
> > get the answer from the community.
> > In this RFC patch, to make things simple, I choose to extend the
> > virtio-balloon and use the extended interface to transfer the request and
> free_page_bimap content.
> >
> > I am not intend to change the current virtio-balloon implementation.
> >
> > Liang
>
> And the answer would depend on the answer to my question above.
> Does balloon need an interface passing page bitmaps around?
Yes, I need a new interface.
> Does this speed up any operations?
No, a new interface will not speed up anything, but it is the easiest way to solve the compatibility issue.
> OTOH what if you use the regular balloon interface with your patches?
>
The regular balloon interfaces have their specific function and I can't use them in my patches.
If using these regular interface, I have to do a lot of changes to keep the compatibility.
>
> > > --
> > > MST
On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > No. And it's exactly what I mean. The ballooned memory is still
> > > processed during live migration without skipping. The live migration code is
> > in migration/ram.c.
> >
> > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > teach qemu to skip these pages.
> > Want to write a patch to do this?
> >
>
> Yes, we really can teach qemu to skip these pages and it's not hard.
> The problem is the poor performance, this PV solution
Balloon is always PV. And do not call patches solutions please.
> is aimed to make it more
> efficient and reduce the performance impact on guest.
We need to get a bit beyond this. You are making multiple
changes, it seems to make sense to split it all up, and analyse each
change separately. If you don't this patchset will be stuck: as you
have seen people aren't convinced it actually helps with real workloads.
> > > >
> > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > migration' is simple,
> > > > > > nothing more.
> > > > > >
> > > > > > That's a big advantage. Another one is that it does something
> > > > > > useful in real- world scenarios.
> > > > > >
> > > > >
> > > > > I don't think the heave performance impaction is something useful
> > > > > in real
> > > > world scenarios.
> > > > >
> > > > > Liang
> > > > > > Roman.
> > > >
> > > > So fix the performance then. You will have to try harder if you want
> > > > to convince people that the performance is due to bad host/guest
> > > > interface, and so we have to change *that*.
> > > >
> > >
> > > Actually, the PV solution is irrelevant with the balloon mechanism, I
> > > just use it to transfer information between host and guest.
> > > I am not sure if I should implement a new virtio device, and I want to
> > > get the answer from the community.
> > > In this RFC patch, to make things simple, I choose to extend the
> > > virtio-balloon and use the extended interface to transfer the request and
> > free_page_bimap content.
> > >
> > > I am not intend to change the current virtio-balloon implementation.
> > >
> > > Liang
> >
> > And the answer would depend on the answer to my question above.
> > Does balloon need an interface passing page bitmaps around?
>
> Yes, I need a new interface.
Possibly, but you will need to justify this at some level if you care
about upstreaming your patches.
> > Does this speed up any operations?
>
> No, a new interface will not speed up anything, but it is the easiest way to solve the compatibility issue.
A bunch of new code is often easier to write than to figure
out the old one, but if we keep piling it up we'll end up
with an unmaintainable mess. So we are rather careful
about adding new interfaces, and we try to make them generic
sometimes even at cost of slight inefficiencies.
> > OTOH what if you use the regular balloon interface with your patches?
> >
>
> The regular balloon interfaces have their specific function and I can't use them in my patches.
> If using these regular interface, I have to do a lot of changes to keep the compatibility.
Why can't you?
What exactly do we need to change?
If we put things in terms of the balloon, that supports
adding and removing pages.
Using these terms, let's enumerate:
- a new method (e.g. new virtqueue) that adds and immediately removes page in a balloon
clearly, you can add then remove using the existing interfaces
is a single command significantly faster than using existing two vqs?
- a new kind of request that says "add (and immediately remove?) as many pages as you can"
sounds rather benign
- a new kind of message that adds multiple pages using a bitmap
(instead of an address list)
again, is this significantly faster?
Does not look like compatibility is an issue, to me.
At some level, your patches look like page hints.
If we have more patches in mind that use page hints,
then a new hint device might make sense.
However, people experimented with page hints in the past, so far this
always went nowhere. E.g. I CC Rick who saw some problems when page
hints interact with huge pages. Rick, could you elaborate please?
--
MST
> Cc: Roman Kagan; Dr. David Alan Gilbert; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration
> optimization
>
> On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live
> > > > migration code is
> > > in migration/ram.c.
> > >
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > >
> >
> > Yes, we really can teach qemu to skip these pages and it's not hard.
> > The problem is the poor performance, this PV solution
>
> Balloon is always PV. And do not call patches solutions please.
>
OK.
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
>
> We need to get a bit beyond this. You are making multiple changes, it seems
> to make sense to split it all up, and analyse each change separately. If you
> don't this patchset will be stuck: as you have seen people aren't convinced it
> actually helps with real workloads.
>
Really, changing the virtio spec must have good reasons.
> > > > >
> > > > > > > > The only advantage of ' inflating the balloon before live
> > > > > > > > migration' is simple,
> > > > > > > nothing more.
> > > > > > >
> > > > > > > That's a big advantage. Another one is that it does
> > > > > > > something useful in real- world scenarios.
> > > > > > >
> > > > > >
> > > > > > I don't think the heave performance impaction is something
> > > > > > useful in real
> > > > > world scenarios.
> > > > > >
> > > > > > Liang
> > > > > > > Roman.
> > > > >
> > > > > So fix the performance then. You will have to try harder if you
> > > > > want to convince people that the performance is due to bad
> > > > > host/guest interface, and so we have to change *that*.
> > > > >
> > > >
> > > > Actually, the PV solution is irrelevant with the balloon
> > > > mechanism, I just use it to transfer information between host and
> guest.
> > > > I am not sure if I should implement a new virtio device, and I
> > > > want to get the answer from the community.
> > > > In this RFC patch, to make things simple, I choose to extend the
> > > > virtio-balloon and use the extended interface to transfer the
> > > > request and
> > > free_page_bimap content.
> > > >
> > > > I am not intend to change the current virtio-balloon implementation.
> > > >
> > > > Liang
> > >
> > > And the answer would depend on the answer to my question above.
> > > Does balloon need an interface passing page bitmaps around?
> >
> > Yes, I need a new interface.
>
> Possibly, but you will need to justify this at some level if you care about
> upstreaming your patches.
>
> > > Does this speed up any operations?
> >
> > No, a new interface will not speed up anything, but it is the easiest way to
> solve the compatibility issue.
>
> A bunch of new code is often easier to write than to figure out the old one,
> but if we keep piling it up we'll end up with an unmaintainable mess. So we
> are rather careful about adding new interfaces, and we try to make them
> generic sometimes even at cost of slight inefficiencies.
>
> > > OTOH what if you use the regular balloon interface with your patches?
> > >
> >
> > The regular balloon interfaces have their specific function and I can't use
> them in my patches.
> > If using these regular interface, I have to do a lot of changes to keep the
> compatibility.
>
> Why can't you?
>
> What exactly do we need to change?
>
> If we put things in terms of the balloon, that supports adding and removing
> pages.
>
> Using these terms, let's enumerate:
> - a new method (e.g. new virtqueue) that adds and immediately removes
> page in a balloon
> clearly, you can add then remove using the existing interfaces
> is a single command significantly faster than using existing two vqs?
> - a new kind of request that says "add (and immediately remove?) as many
> pages as you can"
> sounds rather benign
> - a new kind of message that adds multiple pages using a bitmap
> (instead of an address list)
> again, is this significantly faster?
More of less faster because of less data traffic. I didn't measure this, I will do it and take a deep look
at the way you suggest if we choose to make use of the virtio-balloon interface.
>
> Does not look like compatibility is an issue, to me.
>
>
> At some level, your patches look like page hints.
> If we have more patches in mind that use page hints, then a new hint device
> might make sense.
>
Yes, I have ever considered to implement a new device, use the virtio-balloon to
transfer the free pages information which is irrelevant with the balloon mechanism
is some more or less confusing.
> However, people experimented with page hints in the past, so far this always
> went nowhere. E.g. I CC Rick who saw some problems when page hints
> interact with huge pages. Rick, could you elaborate please?
>
Thanks a lot. Can't wait to know the problems.
Liang
>
> --
> MST
On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
>
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
>
> This patch set is the QEMU side implementation.
>
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
>
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.
>
> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.
I like the idea, just have to prove (review) and test it a lot to
ensure we don't end up skipping pages that matter.
However, there are a couple of points:
In my opinion, the information that's exchanged between the guest and
the host should be exchanged over a virtio-serial channel rather than
virtio-balloon. First, there's nothing related to the balloon here.
It just happens to be memory info. Second, I would never enable
balloon in a guest that I want to be performance-sensitive. So even
if you add this as part of balloon, you'll find no one is using this
solution.
Secondly, I suggest virtio-serial, because it's meant exactly to
exchange free-flowing information between a host and a guest, and you
don't need to extend any part of the protocol for it (hence no changes
necessary to the spec). You can see how spice, vnc, etc., use
virtio-serial to exchange data.
Amit
> Subject: Re: [RFC qemu 0/4] A PV solution for live migration optimization
>
> On (Thu) 03 Mar 2016 [18:44:24], Liang Li wrote:
> > The current QEMU live migration implementation mark the all the
> > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > will be processed and that takes quit a lot of CPU cycles.
> >
> > From guest's point of view, it doesn't care about the content in free
> > pages. We can make use of this fact and skip processing the free pages
> > in the ram bulk stage, it can save a lot CPU cycles and reduce the
> > network traffic significantly while speed up the live migration
> > process obviously.
> >
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
>
> I like the idea, just have to prove (review) and test it a lot to ensure we don't
> end up skipping pages that matter.
>
> However, there are a couple of points:
>
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon. First, there's nothing related to the balloon here.
> It just happens to be memory info. Second, I would never enable balloon in
> a guest that I want to be performance-sensitive. So even if you add this as
> part of balloon, you'll find no one is using this solution.
>
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec). You can see how spice, vnc, etc., use virtio-serial to exchange data.
>
>
> Amit
I don't like to use the virtio-balloon too, and it's confusing.
It's grate if the virtio-serial can be used, I will take a look at it.
Thanks for your suggestion!
Liang
On Fri, Mar 04, 2016 at 03:13:03PM +0000, Li, Liang Z wrote:
> > > Maybe I am not clear enough.
> > >
> > > I mean if we inflate balloon before live migration, for a 8GB guest, it takes
> > about 5 Seconds for the inflating operation to finish.
> >
> > And these 5 seconds are spent where?
> >
>
> The time is spent on allocating the pages and send the allocated pages pfns to QEMU
> through virtio.
What if we skip allocating pages but use the existing interface to send pfns
to QEMU?
> > > For the PV solution, there is no need to inflate balloon before live
> > > migration, the only cost is to traversing the free_list to construct
> > > the free pages bitmap, and it takes about 20ms for a 8GB idle guest( less if
> > there is less free pages), passing the free pages info to host will take about
> > extra 3ms.
> > >
> > >
> > > Liang
> >
> > So now let's please stop talking about solutions at a high level and discuss the
> > interface changes you make in detail.
> > What makes it faster? Better host/guest interface? No need to go through
> > buddy allocator within guest? Less interrupts? Something else?
> >
>
> I assume you are familiar with the current virtio-balloon and how it works.
> The new interface is very simple, send a request to the virtio-balloon driver,
> The virtio-driver will travers the '&zone->free_area[order].free_list[t])' to
> construct a 'free_page_bitmap', and then the driver will send the content
> of 'free_page_bitmap' back to QEMU. That all the new interface does and
> there are no ' alloc_page' related affairs, so it's faster.
>
>
> Some code snippet:
> ----------------------------------------------
> +static void mark_free_pages_bitmap(struct zone *zone,
> + unsigned long *free_page_bitmap, unsigned long pfn_gap) {
> + unsigned long pfn, flags, i;
> + unsigned int order, t;
> + struct list_head *curr;
> +
> + if (zone_is_empty(zone))
> + return;
> +
> + spin_lock_irqsave(&zone->lock, flags);
> +
> + for_each_migratetype_order(order, t) {
> + list_for_each(curr, &zone->free_area[order].free_list[t]) {
> +
> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> + for (i = 0; i < (1UL << order); i++) {
> + if ((pfn + i) >= PFN_4G)
> + set_bit_le(pfn + i - pfn_gap,
> + free_page_bitmap);
> + else
> + set_bit_le(pfn + i, free_page_bitmap);
> + }
> + }
> + }
> +
> + spin_unlock_irqrestore(&zone->lock, flags); }
> ----------------------------------------------------
> Sorry for my poor English and expression, if you still can't understand,
> you could glance at the patch, total about 400 lines.
> >
> > > > --
> > > > MST
> On Fri, Mar 04, 2016 at 03:13:03PM +0000, Li, Liang Z wrote:
> > > > Maybe I am not clear enough.
> > > >
> > > > I mean if we inflate balloon before live migration, for a 8GB
> > > > guest, it takes
> > > about 5 Seconds for the inflating operation to finish.
> > >
> > > And these 5 seconds are spent where?
> > >
> >
> > The time is spent on allocating the pages and send the allocated pages
> > pfns to QEMU through virtio.
>
> What if we skip allocating pages but use the existing interface to send pfns to
> QEMU?
>
I think it will be much faster, allocating pages is the main reason for the long time of the operation.
Experiment is needed to get the exact time spend on sending the pfns.
Liang
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> >
> >> > The memory usage will keep increasing due to ever growing caches,
> >> > etc, so you'll be left with very little free memory fairly soon.
> >> >
> > I don't think so.
> >
>
> Roman is right. For example, here I am looking at a 64 GB (physical) machine
> which was booted about 30 minutes ago, and which is running disk-heavy
> workloads (installing VMs).
>
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB. I expect that by the
> time I have finished running the workload, in two hours, it will not have any
> free memory.
>
> Paolo
I have a VM which has 2GB of RAM, when the guest booted, there were about 1.4GB of free pages.
Then I tried to download a large file from the internet with the browser, after the downloading finished,
there were only 72MB of free pages left, as Roman pointed out, there were quite a lot of Cached memory.
Then I tried to compile the QEMU, after the compiling finished, there were about 1.3G free pages.
So even the cache will increase to a large amount, it will be freed if there are some other specific workloads.
The cache memory is a big issue that should be taken into consideration.
How about reclaim some cache before getting the free pages information?
Liang
On Fri, Mar 04, 2016 at 06:51:21PM +0000, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini ([email protected]) wrote:
> >
> >
> > On 04/03/2016 15:26, Li, Liang Z wrote:
> > >> >
> > >> > The memory usage will keep increasing due to ever growing caches, etc, so
> > >> > you'll be left with very little free memory fairly soon.
> > >> >
> > > I don't think so.
> > >
> >
> > Roman is right. For example, here I am looking at a 64 GB (physical)
> > machine which was booted about 30 minutes ago, and which is running
> > disk-heavy workloads (installing VMs).
> >
> > Since I have started writing this email (2 minutes?), the amount of free
> > memory has already gone down from 37 GB to 33 GB. I expect that by the
> > time I have finished running the workload, in two hours, it will not
> > have any free memory.
>
> But what about a VM sitting idle, or that just has more RAM assigned to it
> than is currently using.
> I've got a host here that's been up for 46 days and has been doing some
> heavy VM debugging a few days ago, but today:
>
> # free -m
> total used free shared buff/cache available
> Mem: 96536 1146 44834 184 50555 94735
>
> I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
> it's got a big chunk of cache as well.
One of the promises of virtualization is better resource utilization.
People tend to avoid purchasing VMs so much oversized that they never
touch a significant amount of their RAM. (Well, at least this is how
things stand in hosting market; I guess enterprize market is similar in
this regard).
That said, I'm not at all opposed to optimizing the migration of free
memory; what I'm trying to say is that creating brand new infrastructure
specifically for that case doesn't look justified when the existing one
can cover it in addition to much more common scenarios.
Roman.
> On Fri, Mar 04, 2016 at 06:51:21PM +0000, Dr. David Alan Gilbert wrote:
> > * Paolo Bonzini ([email protected]) wrote:
> > >
> > >
> > > On 04/03/2016 15:26, Li, Liang Z wrote:
> > > >> >
> > > >> > The memory usage will keep increasing due to ever growing
> > > >> > caches, etc, so you'll be left with very little free memory fairly soon.
> > > >> >
> > > > I don't think so.
> > > >
> > >
> > > Roman is right. For example, here I am looking at a 64 GB
> > > (physical) machine which was booted about 30 minutes ago, and which
> > > is running disk-heavy workloads (installing VMs).
> > >
> > > Since I have started writing this email (2 minutes?), the amount of
> > > free memory has already gone down from 37 GB to 33 GB. I expect
> > > that by the time I have finished running the workload, in two hours,
> > > it will not have any free memory.
> >
> > But what about a VM sitting idle, or that just has more RAM assigned
> > to it than is currently using.
> > I've got a host here that's been up for 46 days and has been doing
> > some heavy VM debugging a few days ago, but today:
> >
> > # free -m
> > total used free shared buff/cache available
> > Mem: 96536 1146 44834 184 50555 94735
> >
> > I very rarely use all it's RAM, so it's got a big chunk of free RAM,
> > and yes it's got a big chunk of cache as well.
>
> One of the promises of virtualization is better resource utilization.
> People tend to avoid purchasing VMs so much oversized that they never
> touch a significant amount of their RAM. (Well, at least this is how things
> stand in hosting market; I guess enterprize market is similar in this regard).
>
> That said, I'm not at all opposed to optimizing the migration of free memory;
> what I'm trying to say is that creating brand new infrastructure specifically for
> that case doesn't look justified when the existing one can cover it in addition
> to much more common scenarios.
>
> Roman.
Even the existing one can cover more common scenarios, but it has performance issue.
that's why I create a new one.
Liang
On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > processed during live migration without skipping. The live migration code is
> > > in migration/ram.c.
> > >
> > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > teach qemu to skip these pages.
> > > Want to write a patch to do this?
> > >
> >
> > Yes, we really can teach qemu to skip these pages and it's not hard.
> > The problem is the poor performance, this PV solution
>
> Balloon is always PV. And do not call patches solutions please.
>
> > is aimed to make it more
> > efficient and reduce the performance impact on guest.
>
> We need to get a bit beyond this. You are making multiple
> changes, it seems to make sense to split it all up, and analyse each
> change separately.
Couldn't agree more.
There are three stages in this optimization:
1) choosing which pages to skip
2) communicating them from guest to host
3) skip transferring uninteresting pages to the remote side on migration
For (3) there seems to be a low-hanging fruit to amend
migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This
would work for guest RAM that hasn't been touched yet or which has been
ballooned out.
For (1) I've been trying to make a point that skipping clean pages is
much more likely to result in noticable benefit than free pages only.
As for (2), we do seem to have a problem with the existing balloon:
according to your measurements it's very slow; besides, I guess it plays
badly with transparent huge pages (as both the guest and the host work
with one 4k page at a time). This is a problem for other use cases of
balloon (e.g. as a facility for resource management); tackling that
appears a more natural application for optimization efforts.
Thanks,
Roman.
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live
> > > > > migration code is
> > > > in migration/ram.c.
> > > >
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > >
> > >
> > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > The problem is the poor performance, this PV solution
> >
> > Balloon is always PV. And do not call patches solutions please.
> >
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> >
> > We need to get a bit beyond this. You are making multiple changes, it
> > seems to make sense to split it all up, and analyse each change
> > separately.
>
> Couldn't agree more.
>
> There are three stages in this optimization:
>
> 1) choosing which pages to skip
>
> 2) communicating them from guest to host
>
> 3) skip transferring uninteresting pages to the remote side on migration
>
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This would
> work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
>
> For (1) I've been trying to make a point that skipping clean pages is much
> more likely to result in noticable benefit than free pages only.
>
I am considering to drop the pagecache before getting the free pages.
> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays badly
I didn't say communicating is slow. Even this is very slow, my solution use bitmap instead of
PFNs, there is fewer data traffic, so it's faster than the existing balloon which use PFNs.
> with transparent huge pages (as both the guest and the host work with one
> 4k page at a time). This is a problem for other use cases of balloon (e.g. as a
> facility for resource management); tackling that appears a more natural
> application for optimization efforts.
>
> Thanks,
> Roman.
On Wed, Mar 09, 2016 at 03:27:54PM +0000, Li, Liang Z wrote:
> > On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > > processed during live migration without skipping. The live
> > > > > > migration code is
> > > > > in migration/ram.c.
> > > > >
> > > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we
> > can
> > > > > teach qemu to skip these pages.
> > > > > Want to write a patch to do this?
> > > > >
> > > >
> > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > The problem is the poor performance, this PV solution
> > >
> > > Balloon is always PV. And do not call patches solutions please.
> > >
> > > > is aimed to make it more
> > > > efficient and reduce the performance impact on guest.
> > >
> > > We need to get a bit beyond this. You are making multiple changes, it
> > > seems to make sense to split it all up, and analyse each change
> > > separately.
> >
> > Couldn't agree more.
> >
> > There are three stages in this optimization:
> >
> > 1) choosing which pages to skip
> >
> > 2) communicating them from guest to host
> >
> > 3) skip transferring uninteresting pages to the remote side on migration
> >
> > For (3) there seems to be a low-hanging fruit to amend
> > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This would
> > work for guest RAM that hasn't been touched yet or which has been
> > ballooned out.
> >
> > For (1) I've been trying to make a point that skipping clean pages is much
> > more likely to result in noticable benefit than free pages only.
> >
>
> I am considering to drop the pagecache before getting the free pages.
>
> > As for (2), we do seem to have a problem with the existing balloon:
> > according to your measurements it's very slow; besides, I guess it plays badly
>
> I didn't say communicating is slow. Even this is very slow, my solution use bitmap instead of
> PFNs, there is fewer data traffic, so it's faster than the existing balloon which use PFNs.
By how much?
> > with transparent huge pages (as both the guest and the host work with one
> > 4k page at a time). This is a problem for other use cases of balloon (e.g. as a
> > facility for resource management); tackling that appears a more natural
> > application for optimization efforts.
> >
> > Thanks,
> > Roman.
On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> On Mon, Mar 07, 2016 at 01:40:06PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Mar 07, 2016 at 06:49:19AM +0000, Li, Liang Z wrote:
> > > > > No. And it's exactly what I mean. The ballooned memory is still
> > > > > processed during live migration without skipping. The live migration code is
> > > > in migration/ram.c.
> > > >
> > > > So if guest acknowledged VIRTIO_BALLOON_F_MUST_TELL_HOST, we can
> > > > teach qemu to skip these pages.
> > > > Want to write a patch to do this?
> > > >
> > >
> > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > The problem is the poor performance, this PV solution
> >
> > Balloon is always PV. And do not call patches solutions please.
> >
> > > is aimed to make it more
> > > efficient and reduce the performance impact on guest.
> >
> > We need to get a bit beyond this. You are making multiple
> > changes, it seems to make sense to split it all up, and analyse each
> > change separately.
>
> Couldn't agree more.
>
> There are three stages in this optimization:
>
> 1) choosing which pages to skip
>
> 2) communicating them from guest to host
>
> 3) skip transferring uninteresting pages to the remote side on migration
>
> For (3) there seems to be a low-hanging fruit to amend
> migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This
> would work for guest RAM that hasn't been touched yet or which has been
> ballooned out.
>
> For (1) I've been trying to make a point that skipping clean pages is
> much more likely to result in noticable benefit than free pages only.
I guess when you say clean you mean zero?
Yea. In fact, one can zero out any number of pages
quickly by putting them in balloon and immediately
taking them out.
Access will fault a zero page in, then COW kicks in.
We could have a new zero VQ (or some other option)
to pass these pages guest to host, but this only
works well if page size matches the host page size.
> As for (2), we do seem to have a problem with the existing balloon:
> according to your measurements it's very slow; besides, I guess it plays
> badly with transparent huge pages (as both the guest and the host work
> with one 4k page at a time). This is a problem for other use cases of
> balloon (e.g. as a facility for resource management); tackling that
> appears a more natural application for optimization efforts.
>
> Thanks,
> Roman.
On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > For (1) I've been trying to make a point that skipping clean pages is
> > much more likely to result in noticable benefit than free pages only.
>
> I guess when you say clean you mean zero?
No I meant clean, i.e. those that could be evicted from RAM without
causing I/O.
> Yea. In fact, one can zero out any number of pages
> quickly by putting them in balloon and immediately
> taking them out.
>
> Access will fault a zero page in, then COW kicks in.
I must be missing something obvious, but how is that different from
inflating and then immediately deflating the balloon?
> We could have a new zero VQ (or some other option)
> to pass these pages guest to host, but this only
> works well if page size matches the host page size.
I'm afraid I don't yet understand what kind of pages that would be and
how they are different from ballooned pages.
I still tend to think that ballooning is a sensible solution to the
problem at hand; it's just the granularity that makes things slow and
stands in the way.
Roman.
On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean pages is
> > > much more likely to result in noticable benefit than free pages only.
> >
> > I guess when you say clean you mean zero?
>
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.
They must be migrated unless guest actually evicts them.
It's not at all clear to me that it's always preferable
to drop all clean pages from pagecache. It is clearly is
going to slow the guest down significantly.
> > Yea. In fact, one can zero out any number of pages
> > quickly by putting them in balloon and immediately
> > taking them out.
> >
> > Access will fault a zero page in, then COW kicks in.
>
> I must be missing something obvious, but how is that different from
> inflating and then immediately deflating the balloon?
It's exactly the same except
- we do not initiate this from host - it's guest doing
things for its own reasons
- a bit less guest/host interaction this way
> > We could have a new zero VQ (or some other option)
> > to pass these pages guest to host, but this only
> > works well if page size matches the host page size.
>
> I'm afraid I don't yet understand what kind of pages that would be and
> how they are different from ballooned pages.
>
> I still tend to think that ballooning is a sensible solution to the
> problem at hand;
I think it is, too. This does not mean we can't improve things though.
This patchset is reported to improve things, it should be
split up so we improve them for everyone and not just
one specific workload.
> it's just the granularity that makes things slow and
> stands in the way.
So we could request a specific page size/alignment from guest.
Send guest request to give us memory in aligned units of 2Mbytes,
and then host can treat each of these as a single huge page.
> Roman.
--
MST
On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > For (1) I've been trying to make a point that skipping clean
> > > pages is
> > > much more likely to result in noticable benefit than free pages
> > > only.
> >
> > I guess when you say clean you mean zero?
>
> No I meant clean, i.e. those that could be evicted from RAM without
> causing I/O.
>
Programs in the guest may have that memory mmapped.
This could include things like libraries and executables.
How do you deal with the guest page cache containing
references to now non-existent memory?
How do you re-populate the memory on the destination
host?
--Â
All rights reversed
> > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > The problem is the poor performance, this PV solution
> > > >
> > > > Balloon is always PV. And do not call patches solutions please.
> > > >
> > > > > is aimed to make it more
> > > > > efficient and reduce the performance impact on guest.
> > > >
> > > > We need to get a bit beyond this. You are making multiple
> > > > changes, it seems to make sense to split it all up, and analyse
> > > > each change separately.
> > >
> > > Couldn't agree more.
> > >
> > > There are three stages in this optimization:
> > >
> > > 1) choosing which pages to skip
> > >
> > > 2) communicating them from guest to host
> > >
> > > 3) skip transferring uninteresting pages to the remote side on
> > > migration
> > >
> > > For (3) there seems to be a low-hanging fruit to amend
> > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This
> > > would work for guest RAM that hasn't been touched yet or which has
> > > been ballooned out.
> > >
> > > For (1) I've been trying to make a point that skipping clean pages
> > > is much more likely to result in noticable benefit than free pages only.
> > >
> >
> > I am considering to drop the pagecache before getting the free pages.
> >
> > > As for (2), we do seem to have a problem with the existing balloon:
> > > according to your measurements it's very slow; besides, I guess it
> > > plays badly
> >
> > I didn't say communicating is slow. Even this is very slow, my
> > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> faster than the existing balloon which use PFNs.
>
> By how much?
>
Haven't measured yet.
To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed if using PFN,
For a guest with 8GB RAM, the corresponding free page bitmap size is 256KB.
And the corresponding total PFNs size is 8192KB. Assuming the inflating size
is 7GB, the total PFNs size is 7168KB.
Maybe this is not the point.
Liang
> > > with transparent huge pages (as both the guest and the host work
> > > with one 4k page at a time). This is a problem for other use cases
> > > of balloon (e.g. as a facility for resource management); tackling
> > > that appears a more natural application for optimization efforts.
> > >
> > > Thanks,
> > > Roman.
> > This patch set is the QEMU side implementation.
> >
> > The virtio-balloon is extended so that QEMU can get the free pages
> > information from the guest through virtio.
> >
> > After getting the free pages information (a bitmap), QEMU can use it
> > to filter out the guest's free pages in the ram bulk stage. This make
> > the live migration process much more efficient.
> >
> > This RFC version doesn't take the post-copy and RDMA into
> > consideration, maybe both of them can benefit from this PV solution by
> > with some extra modifications.
>
> I like the idea, just have to prove (review) and test it a lot to ensure we don't
> end up skipping pages that matter.
>
> However, there are a couple of points:
>
> In my opinion, the information that's exchanged between the guest and the
> host should be exchanged over a virtio-serial channel rather than virtio-
> balloon. First, there's nothing related to the balloon here.
> It just happens to be memory info. Second, I would never enable balloon in
> a guest that I want to be performance-sensitive. So even if you add this as
> part of balloon, you'll find no one is using this solution.
>
> Secondly, I suggest virtio-serial, because it's meant exactly to exchange free-
> flowing information between a host and a guest, and you don't need to
> extend any part of the protocol for it (hence no changes necessary to the
> spec). You can see how spice, vnc, etc., use virtio-serial to exchange data.
>
>
> Amit
Hi Amit,
Could provide more information on how to use virtio-serial to exchange data? Thread , Wiki or code are all OK.
I have not find some useful information yet.
Thanks
Liang
On (Thu) 10 Mar 2016 [07:44:19], Li, Liang Z wrote:
>
> Hi Amit,
>
> Could provide more information on how to use virtio-serial to exchange data? Thread , Wiki or code are all OK.
> I have not find some useful information yet.
See this commit in the Linux sources:
108fc82596e3b66b819df9d28c1ebbc9ab5de14c
that adds a way to send guest trace data over to the host. I think
that's the most relevant to your use-case. However, you'll have to
add an in-kernel user of virtio-serial (like the virtio-console code
-- the code that deals with tty and hvc currently). There's no other
non-tty user right now, and this is the right kind of use-case to add
one for!
For many other (userspace) use-cases, see the qemu-guest-agent in the
qemu sources.
The API is documented in the wiki:
http://www.linux-kvm.org/page/Virtio-serial_API
and the feature pages have some information that may help as well:
https://fedoraproject.org/wiki/Features/VirtioSerial
There are some links in here too:
http://log.amitshah.net/2010/09/communication-between-guests-and-hosts/
Hope this helps.
Amit
> > Could provide more information on how to use virtio-serial to exchange
> data? Thread , Wiki or code are all OK.
> > I have not find some useful information yet.
>
> See this commit in the Linux sources:
>
> 108fc82596e3b66b819df9d28c1ebbc9ab5de14c
>
> that adds a way to send guest trace data over to the host. I think that's the
> most relevant to your use-case. However, you'll have to add an in-kernel
> user of virtio-serial (like the virtio-console code
> -- the code that deals with tty and hvc currently). There's no other non-tty
> user right now, and this is the right kind of use-case to add one for!
>
> For many other (userspace) use-cases, see the qemu-guest-agent in the
> qemu sources.
>
> The API is documented in the wiki:
>
> http://www.linux-kvm.org/page/Virtio-serial_API
>
> and the feature pages have some information that may help as well:
>
> https://fedoraproject.org/wiki/Features/VirtioSerial
>
> There are some links in here too:
>
> http://log.amitshah.net/2010/09/communication-between-guests-and-
> hosts/
>
> Hope this helps.
>
>
> Amit
Thanks a lot !!
Liang
On Wed, Mar 09, 2016 at 02:38:52PM -0500, Rik van Riel wrote:
> On Wed, 2016-03-09 at 20:04 +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean
> > > > pages is
> > > > much more likely to result in noticable benefit than free pages
> > > > only.
> > >
> > > I guess when you say clean you mean zero?
> >
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
> >
>
> Programs in the guest may have that memory mmapped.
> This could include things like libraries and executables.
>
> How do you deal with the guest page cache containing
> references to now non-existent memory?
>
> How do you re-populate the memory on the destination
> host?
I guess the confusion is due to the context I stripped from the previous
messages... Actually I've been talking about doing full-fledged balloon
inflation before the migration, so, when it's deflated the guest will
fault in that data from the filesystem as usual.
Roman.
On Wed, Mar 09, 2016 at 07:39:18PM +0200, Michael S. Tsirkin wrote:
> On Wed, Mar 09, 2016 at 08:04:39PM +0300, Roman Kagan wrote:
> > On Wed, Mar 09, 2016 at 05:41:39PM +0200, Michael S. Tsirkin wrote:
> > > On Wed, Mar 09, 2016 at 05:28:54PM +0300, Roman Kagan wrote:
> > > > For (1) I've been trying to make a point that skipping clean pages is
> > > > much more likely to result in noticable benefit than free pages only.
> > >
> > > I guess when you say clean you mean zero?
> >
> > No I meant clean, i.e. those that could be evicted from RAM without
> > causing I/O.
>
> They must be migrated unless guest actually evicts them.
If the balloon is inflated the guest will.
> It's not at all clear to me that it's always preferable
> to drop all clean pages from pagecache. It is clearly is
> going to slow the guest down significantly.
That's a matter for optimization. The current value for
/proc/meminfo:MemAvailable (which is being proposed as a member of
balloon stats, too) is a conservative estimate which will probably cover
a good deal of cases.
> > I must be missing something obvious, but how is that different from
> > inflating and then immediately deflating the balloon?
>
> It's exactly the same except
> - we do not initiate this from host - it's guest doing
> things for its own reasons
> - a bit less guest/host interaction this way
I don't quite understand why you need to deflate the balloon until the
VM is on the destination host. deflate_on_oom will do it if the guest
is really tight on memory; otherwise there appears to be no reason for
it. But then inflation followed immediately by deflation doubles the
guest/host interactions rather than reduces them, no?
> > it's just the granularity that makes things slow and
> > stands in the way.
>
> So we could request a specific page size/alignment from guest.
> Send guest request to give us memory in aligned units of 2Mbytes,
> and then host can treat each of these as a single huge page.
I'd guess just coalescing contiguous pages would already speed things
up. I'll try to find some time to experiment with it.
Roman.
Hi,
I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.
1) How many of the free pages do we tell the host about?
Your main change is telling the host about all the
free pages.
If we tell the host about all the free pages, then we might
end up needing to allocate more pages and update the host
with pages we now want to use; that would have to wait for the
host to acknowledge that use of these pages, since if we don't
wait for it then it might have skipped migrating a page we
just started using (I don't understand how your series solves that).
So the guest probably needs to keep some free pages - how many?
2) Clearing out caches
Does it make sense to clean caches? They're apparently useful data
so if we clean them it's likely to slow the guest down; I guess
they're also likely to be fairly static data - so at least fairly
easy to migrate.
The answer here partially depends on what you want from your migration;
if you're after the fastest possible migration time it might make
sense to clean the caches and avoid migrating them; but that might
be at the cost of more disruption to the guest - there's a trade off
somewhere and it's not clear to me how you set that depending on your
guest/network/reqirements.
3) Why is ballooning slow?
You've got a figure of 5s to balloon on an 8GB VM - but an
8GB VM isn't huge; so I worry about how long it would take
on a big VM. We need to understand why it's slow
* is it due to the guest shuffling pages around?
* is it due to the virtio-balloon protocol sending one page
at a time?
+ Do balloon pages normally clump in physical memory
- i.e. would a 'large balloon' message help
- or do we need a bitmap because it tends not to clump?
* is it due to the madvise on the host?
If we were using the normal balloon messages, then we
could, during migration, just route those to the migration
code rather than bothering with the madvise.
If they're clumping together we could just turn that into
one big madvise; if they're not then would we benefit from
a call that lets us madvise lots of areas?
4) Speeding up the migration of those free pages
You're using the bitmap to avoid migrating those free pages; HPe's
patchset is reconstructing a bitmap from the balloon data; OK, so
this all makes sense to avoid migrating them - I'd also been thinking
of using pagemap to spot zero pages that would help find other zero'd
pages, but perhaps ballooned is enough?
5) Second-migrate
Given a VM where you've done all those tricks on, what happens when
you migrate it a second time? I guess you're aiming for the guest
to update it's bitmap; HPe's solution is to migrate it's balloon
bitmap along with the migration data.
Dave
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Thu, Mar 10, 2016 at 01:41:16AM +0000, Li, Liang Z wrote:
> > > > > > Yes, we really can teach qemu to skip these pages and it's not hard.
> > > > > > The problem is the poor performance, this PV solution
> > > > >
> > > > > Balloon is always PV. And do not call patches solutions please.
> > > > >
> > > > > > is aimed to make it more
> > > > > > efficient and reduce the performance impact on guest.
> > > > >
> > > > > We need to get a bit beyond this. You are making multiple
> > > > > changes, it seems to make sense to split it all up, and analyse
> > > > > each change separately.
> > > >
> > > > Couldn't agree more.
> > > >
> > > > There are three stages in this optimization:
> > > >
> > > > 1) choosing which pages to skip
> > > >
> > > > 2) communicating them from guest to host
> > > >
> > > > 3) skip transferring uninteresting pages to the remote side on
> > > > migration
> > > >
> > > > For (3) there seems to be a low-hanging fruit to amend
> > > > migration/ram.c:iz_zero_range() to consult /proc/self/pagemap. This
> > > > would work for guest RAM that hasn't been touched yet or which has
> > > > been ballooned out.
> > > >
> > > > For (1) I've been trying to make a point that skipping clean pages
> > > > is much more likely to result in noticable benefit than free pages only.
> > > >
> > >
> > > I am considering to drop the pagecache before getting the free pages.
> > >
> > > > As for (2), we do seem to have a problem with the existing balloon:
> > > > according to your measurements it's very slow; besides, I guess it
> > > > plays badly
> > >
> > > I didn't say communicating is slow. Even this is very slow, my
> > > solution use bitmap instead of PFNs, there is fewer data traffic, so it's
> > faster than the existing balloon which use PFNs.
> >
> > By how much?
> >
>
> Haven't measured yet.
> To identify a page, 1 bit is needed if using bitmap, 4 Bytes(32bit) is needed if using PFN,
>
> For a guest with 8GB RAM, the corresponding free page bitmap size is 256KB.
> And the corresponding total PFNs size is 8192KB. Assuming the inflating size
> is 7GB, the total PFNs size is 7168KB.
Yes but this is not how balloon works, instead, it will reuse a single
4K page multiple times. We can also trade off more memory for speed
if we want to, it's completely up to guest.
>
> Maybe this is not the point.
>
> Liang
> > > > with transparent huge pages (as both the guest and the host work
> > > > with one 4k page at a time). This is a problem for other use cases
> > > > of balloon (e.g. as a facility for resource management); tackling
> > > > that appears a more natural application for optimization efforts.
> > > >
> > > > Thanks,
> > > > Roman.
>
> Hi,
> I'm just catching back up on this thread; so without reference to any
> particular previous mail in the thread.
>
> 1) How many of the free pages do we tell the host about?
> Your main change is telling the host about all the
> free pages.
Yes, all the guest's free pages.
> If we tell the host about all the free pages, then we might
> end up needing to allocate more pages and update the host
> with pages we now want to use; that would have to wait for the
> host to acknowledge that use of these pages, since if we don't
> wait for it then it might have skipped migrating a page we
> just started using (I don't understand how your series solves that).
> So the guest probably needs to keep some free pages - how many?
Actually, there is no need to care about whether the free pages will be used by the host.
We only care about some of the free pages we get reused by the guest, right?
The dirty page logging can be used to solve this, starting the dirty page logging before getting
the free pages informant from guest. Even some of the free pages are modified by the guest
during the process of getting the free pages information, these modified pages will be traced
by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
omit any dirtied pages.
So, guest doesn't need to keep any free pages.
> 2) Clearing out caches
> Does it make sense to clean caches? They're apparently useful data
> so if we clean them it's likely to slow the guest down; I guess
> they're also likely to be fairly static data - so at least fairly
> easy to migrate.
> The answer here partially depends on what you want from your migration;
> if you're after the fastest possible migration time it might make
> sense to clean the caches and avoid migrating them; but that might
> be at the cost of more disruption to the guest - there's a trade off
> somewhere and it's not clear to me how you set that depending on your
> guest/network/reqirements.
>
Yes, clean the caches is an option. Let the users decide using it or not.
> 3) Why is ballooning slow?
> You've got a figure of 5s to balloon on an 8GB VM - but an
> 8GB VM isn't huge; so I worry about how long it would take
> on a big VM. We need to understand why it's slow
> * is it due to the guest shuffling pages around?
> * is it due to the virtio-balloon protocol sending one page
> at a time?
> + Do balloon pages normally clump in physical memory
> - i.e. would a 'large balloon' message help
> - or do we need a bitmap because it tends not to clump?
>
I didn't do a comprehensive test. But I found most of the time spending
on allocating the pages and sending the PFNs to guest, I don't know that's
the most time consuming operation, allocating the pages or sending the PFNs.
> * is it due to the madvise on the host?
> If we were using the normal balloon messages, then we
> could, during migration, just route those to the migration
> code rather than bothering with the madvise.
> If they're clumping together we could just turn that into
> one big madvise; if they're not then would we benefit from
> a call that lets us madvise lots of areas?
>
My test showed madvise() is not the main reason for the long time, only taken
10% of the total inflating balloon operation time.
Big madvise can more or less improve the performance.
> 4) Speeding up the migration of those free pages
> You're using the bitmap to avoid migrating those free pages; HPe's
> patchset is reconstructing a bitmap from the balloon data; OK, so
> this all makes sense to avoid migrating them - I'd also been thinking
> of using pagemap to spot zero pages that would help find other zero'd
> pages, but perhaps ballooned is enough?
>
Could you describe your ideal with more details?
> 5) Second-migrate
> Given a VM where you've done all those tricks on, what happens when
> you migrate it a second time? I guess you're aiming for the guest
> to update it's bitmap; HPe's solution is to migrate it's balloon
> bitmap along with the migration data.
Nothing is special in the second migration, QEMU will request the guest for free pages
Information, and the guest will traverse it's current free page list to construct a
new free page bitmap and send it to QEMU. Just like in the first migration.
Liang
>
> Dave
>
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
* Li, Liang Z ([email protected]) wrote:
> >
> > Hi,
> > I'm just catching back up on this thread; so without reference to any
> > particular previous mail in the thread.
> >
> > 1) How many of the free pages do we tell the host about?
> > Your main change is telling the host about all the
> > free pages.
>
> Yes, all the guest's free pages.
>
> > If we tell the host about all the free pages, then we might
> > end up needing to allocate more pages and update the host
> > with pages we now want to use; that would have to wait for the
> > host to acknowledge that use of these pages, since if we don't
> > wait for it then it might have skipped migrating a page we
> > just started using (I don't understand how your series solves that).
> > So the guest probably needs to keep some free pages - how many?
>
> Actually, there is no need to care about whether the free pages will be used by the host.
> We only care about some of the free pages we get reused by the guest, right?
>
> The dirty page logging can be used to solve this, starting the dirty page logging before getting
> the free pages informant from guest. Even some of the free pages are modified by the guest
> during the process of getting the free pages information, these modified pages will be traced
> by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
> The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
> omit any dirtied pages.
>
> So, guest doesn't need to keep any free pages.
OK, yes, that works; so we do:
* enable dirty logging
* ask guest for free pages
* initialise the migration bitmap as everything-free
* then later we do the normal sync-dirty bitmap stuff and it all just works.
That's nice and simple.
> > 2) Clearing out caches
> > Does it make sense to clean caches? They're apparently useful data
> > so if we clean them it's likely to slow the guest down; I guess
> > they're also likely to be fairly static data - so at least fairly
> > easy to migrate.
> > The answer here partially depends on what you want from your migration;
> > if you're after the fastest possible migration time it might make
> > sense to clean the caches and avoid migrating them; but that might
> > be at the cost of more disruption to the guest - there's a trade off
> > somewhere and it's not clear to me how you set that depending on your
> > guest/network/reqirements.
> >
>
> Yes, clean the caches is an option. Let the users decide using it or not.
>
> > 3) Why is ballooning slow?
> > You've got a figure of 5s to balloon on an 8GB VM - but an
> > 8GB VM isn't huge; so I worry about how long it would take
> > on a big VM. We need to understand why it's slow
> > * is it due to the guest shuffling pages around?
> > * is it due to the virtio-balloon protocol sending one page
> > at a time?
> > + Do balloon pages normally clump in physical memory
> > - i.e. would a 'large balloon' message help
> > - or do we need a bitmap because it tends not to clump?
> >
>
> I didn't do a comprehensive test. But I found most of the time spending
> on allocating the pages and sending the PFNs to guest, I don't know that's
> the most time consuming operation, allocating the pages or sending the PFNs.
It might be a good idea to analyse it a bit more to convince people where
the problem is.
> > * is it due to the madvise on the host?
> > If we were using the normal balloon messages, then we
> > could, during migration, just route those to the migration
> > code rather than bothering with the madvise.
> > If they're clumping together we could just turn that into
> > one big madvise; if they're not then would we benefit from
> > a call that lets us madvise lots of areas?
> >
>
> My test showed madvise() is not the main reason for the long time, only taken
> 10% of the total inflating balloon operation time.
> Big madvise can more or less improve the performance.
OK; 10% of the total is still pretty big even for your 8GB VM.
> > 4) Speeding up the migration of those free pages
> > You're using the bitmap to avoid migrating those free pages; HPe's
> > patchset is reconstructing a bitmap from the balloon data; OK, so
> > this all makes sense to avoid migrating them - I'd also been thinking
> > of using pagemap to spot zero pages that would help find other zero'd
> > pages, but perhaps ballooned is enough?
> >
> Could you describe your ideal with more details?
At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that would find all the balloon pages.
> > 5) Second-migrate
> > Given a VM where you've done all those tricks on, what happens when
> > you migrate it a second time? I guess you're aiming for the guest
> > to update it's bitmap; HPe's solution is to migrate it's balloon
> > bitmap along with the migration data.
>
> Nothing is special in the second migration, QEMU will request the guest for free pages
> Information, and the guest will traverse it's current free page list to construct a
> new free page bitmap and send it to QEMU. Just like in the first migration.
Right.
Dave
> Liang
> >
> > Dave
> >
> > --
> > Dr. David Alan Gilbert / [email protected] / Manchester, UK
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
> > > Hi,
> > > I'm just catching back up on this thread; so without reference to
> > > any particular previous mail in the thread.
> > >
> > > 1) How many of the free pages do we tell the host about?
> > > Your main change is telling the host about all the
> > > free pages.
> >
> > Yes, all the guest's free pages.
> >
> > > If we tell the host about all the free pages, then we might
> > > end up needing to allocate more pages and update the host
> > > with pages we now want to use; that would have to wait for the
> > > host to acknowledge that use of these pages, since if we don't
> > > wait for it then it might have skipped migrating a page we
> > > just started using (I don't understand how your series solves that).
> > > So the guest probably needs to keep some free pages - how many?
> >
> > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > We only care about some of the free pages we get reused by the guest,
> right?
> >
> > The dirty page logging can be used to solve this, starting the dirty
> > page logging before getting the free pages informant from guest. Even
> > some of the free pages are modified by the guest during the process of
> > getting the free pages information, these modified pages will be traced by
> the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be
> > reset to dirty. We won't omit any dirtied pages.
> >
> > So, guest doesn't need to keep any free pages.
>
> OK, yes, that works; so we do:
> * enable dirty logging
> * ask guest for free pages
> * initialise the migration bitmap as everything-free
> * then later we do the normal sync-dirty bitmap stuff and it all just works.
>
> That's nice and simple.
>
> > > 2) Clearing out caches
> > > Does it make sense to clean caches? They're apparently useful data
> > > so if we clean them it's likely to slow the guest down; I guess
> > > they're also likely to be fairly static data - so at least fairly
> > > easy to migrate.
> > > The answer here partially depends on what you want from your
> migration;
> > > if you're after the fastest possible migration time it might make
> > > sense to clean the caches and avoid migrating them; but that might
> > > be at the cost of more disruption to the guest - there's a trade off
> > > somewhere and it's not clear to me how you set that depending on
> your
> > > guest/network/reqirements.
> > >
> >
> > Yes, clean the caches is an option. Let the users decide using it or not.
> >
> > > 3) Why is ballooning slow?
> > > You've got a figure of 5s to balloon on an 8GB VM - but an
> > > 8GB VM isn't huge; so I worry about how long it would take
> > > on a big VM. We need to understand why it's slow
> > > * is it due to the guest shuffling pages around?
> > > * is it due to the virtio-balloon protocol sending one page
> > > at a time?
> > > + Do balloon pages normally clump in physical memory
> > > - i.e. would a 'large balloon' message help
> > > - or do we need a bitmap because it tends not to clump?
> > >
> >
> > I didn't do a comprehensive test. But I found most of the time
> > spending on allocating the pages and sending the PFNs to guest, I
> > don't know that's the most time consuming operation, allocating the pages
> or sending the PFNs.
>
> It might be a good idea to analyse it a bit more to convince people where the
> problem is.
>
Yes, I will try to measure the time spending on different parts.
> > > * is it due to the madvise on the host?
> > > If we were using the normal balloon messages, then we
> > > could, during migration, just route those to the migration
> > > code rather than bothering with the madvise.
> > > If they're clumping together we could just turn that into
> > > one big madvise; if they're not then would we benefit from
> > > a call that lets us madvise lots of areas?
> > >
> >
> > My test showed madvise() is not the main reason for the long time,
> > only taken 10% of the total inflating balloon operation time.
> > Big madvise can more or less improve the performance.
>
> OK; 10% of the total is still pretty big even for your 8GB VM.
>
> > > 4) Speeding up the migration of those free pages
> > > You're using the bitmap to avoid migrating those free pages; HPe's
> > > patchset is reconstructing a bitmap from the balloon data; OK, so
> > > this all makes sense to avoid migrating them - I'd also been thinking
> > > of using pagemap to spot zero pages that would help find other zero'd
> > > pages, but perhaps ballooned is enough?
> > >
> > Could you describe your ideal with more details?
>
> At the moment the migration code spends a fair amount of time checking if a
> page is zero; I was thinking perhaps the qemu could just open
> /proc/self/pagemap and check if the page was mapped; that would seem
> cheap if we're checking big ranges; and that would find all the balloon pages.
>
Even if virtio-balloon is not enabled, it can be used to find the pages that never used
by guest.
> > > 5) Second-migrate
> > > Given a VM where you've done all those tricks on, what happens when
> > > you migrate it a second time? I guess you're aiming for the guest
> > > to update it's bitmap; HPe's solution is to migrate it's balloon
> > > bitmap along with the migration data.
> >
> > Nothing is special in the second migration, QEMU will request the
> > guest for free pages Information, and the guest will traverse it's
> > current free page list to construct a new free page bitmap and send it to
> QEMU. Just like in the first migration.
>
> Right.
>
> Dave
>
> > Liang
> > >
> > > Dave
> > >
> > > --
> > > Dr. David Alan Gilbert / [email protected] / Manchester, UK
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> * Li, Liang Z ([email protected]) wrote:
> > >
> > > Hi,
> > > I'm just catching back up on this thread; so without reference to any
> > > particular previous mail in the thread.
> > >
> > > 1) How many of the free pages do we tell the host about?
> > > Your main change is telling the host about all the
> > > free pages.
> >
> > Yes, all the guest's free pages.
> >
> > > If we tell the host about all the free pages, then we might
> > > end up needing to allocate more pages and update the host
> > > with pages we now want to use; that would have to wait for the
> > > host to acknowledge that use of these pages, since if we don't
> > > wait for it then it might have skipped migrating a page we
> > > just started using (I don't understand how your series solves that).
> > > So the guest probably needs to keep some free pages - how many?
> >
> > Actually, there is no need to care about whether the free pages will be used by the host.
> > We only care about some of the free pages we get reused by the guest, right?
> >
> > The dirty page logging can be used to solve this, starting the dirty page logging before getting
> > the free pages informant from guest. Even some of the free pages are modified by the guest
> > during the process of getting the free pages information, these modified pages will be traced
> > by the dirty page logging mechanism. So in the following migration_bitmap_sync() function.
> > The pages in the free pages bitmap, but latter was modified, will be reset to dirty. We won't
> > omit any dirtied pages.
> >
> > So, guest doesn't need to keep any free pages.
>
> OK, yes, that works; so we do:
> * enable dirty logging
> * ask guest for free pages
> * initialise the migration bitmap as everything-free
> * then later we do the normal sync-dirty bitmap stuff and it all just works.
>
> That's nice and simple.
This works once, sure. But there's an issue is that you have
to defer migration until you get the free page list,
and this only works once. So you end up with heuristics
about how long to wait.
Instead I propose:
- mark all pages dirty as we do now.
- at start of migration, start tracking dirty
pages in kvm, and tell guest to start tracking free pages
we can now introduce any kind of delay, for
example wait for ack from guest, or do whatever else,
or even just start migrating pages
- repeatedly:
- get list of free pages from guest
- clear them in migration bitmap
- get dirty list from kvm
- at end of migration, stop tracking writes in kvm,
and tell guest to stop tracking free pages
> > > 2) Clearing out caches
> > > Does it make sense to clean caches? They're apparently useful data
> > > so if we clean them it's likely to slow the guest down; I guess
> > > they're also likely to be fairly static data - so at least fairly
> > > easy to migrate.
> > > The answer here partially depends on what you want from your migration;
> > > if you're after the fastest possible migration time it might make
> > > sense to clean the caches and avoid migrating them; but that might
> > > be at the cost of more disruption to the guest - there's a trade off
> > > somewhere and it's not clear to me how you set that depending on your
> > > guest/network/reqirements.
> > >
> >
> > Yes, clean the caches is an option. Let the users decide using it or not.
> >
> > > 3) Why is ballooning slow?
> > > You've got a figure of 5s to balloon on an 8GB VM - but an
> > > 8GB VM isn't huge; so I worry about how long it would take
> > > on a big VM. We need to understand why it's slow
> > > * is it due to the guest shuffling pages around?
> > > * is it due to the virtio-balloon protocol sending one page
> > > at a time?
> > > + Do balloon pages normally clump in physical memory
> > > - i.e. would a 'large balloon' message help
> > > - or do we need a bitmap because it tends not to clump?
> > >
> >
> > I didn't do a comprehensive test. But I found most of the time spending
> > on allocating the pages and sending the PFNs to guest, I don't know that's
> > the most time consuming operation, allocating the pages or sending the PFNs.
>
> It might be a good idea to analyse it a bit more to convince people where
> the problem is.
>
> > > * is it due to the madvise on the host?
> > > If we were using the normal balloon messages, then we
> > > could, during migration, just route those to the migration
> > > code rather than bothering with the madvise.
> > > If they're clumping together we could just turn that into
> > > one big madvise; if they're not then would we benefit from
> > > a call that lets us madvise lots of areas?
> > >
> >
> > My test showed madvise() is not the main reason for the long time, only taken
> > 10% of the total inflating balloon operation time.
> > Big madvise can more or less improve the performance.
>
> OK; 10% of the total is still pretty big even for your 8GB VM.
>
> > > 4) Speeding up the migration of those free pages
> > > You're using the bitmap to avoid migrating those free pages; HPe's
> > > patchset is reconstructing a bitmap from the balloon data; OK, so
> > > this all makes sense to avoid migrating them - I'd also been thinking
> > > of using pagemap to spot zero pages that would help find other zero'd
> > > pages, but perhaps ballooned is enough?
> > >
> > Could you describe your ideal with more details?
>
> At the moment the migration code spends a fair amount of time checking if a page
> is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
> and check if the page was mapped; that would seem cheap if we're checking big
> ranges; and that would find all the balloon pages.
>
> > > 5) Second-migrate
> > > Given a VM where you've done all those tricks on, what happens when
> > > you migrate it a second time? I guess you're aiming for the guest
> > > to update it's bitmap; HPe's solution is to migrate it's balloon
> > > bitmap along with the migration data.
> >
> > Nothing is special in the second migration, QEMU will request the guest for free pages
> > Information, and the guest will traverse it's current free page list to construct a
> > new free page bitmap and send it to QEMU. Just like in the first migration.
>
> Right.
>
> Dave
>
> > Liang
> > >
> > > Dave
> > >
> > > --
> > > Dr. David Alan Gilbert / [email protected] / Manchester, UK
> --
> Dr. David Alan Gilbert / [email protected] / Manchester, UK
> On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > * Li, Liang Z ([email protected]) wrote:
> > > >
> > > > Hi,
> > > > I'm just catching back up on this thread; so without reference
> > > > to any particular previous mail in the thread.
> > > >
> > > > 1) How many of the free pages do we tell the host about?
> > > > Your main change is telling the host about all the
> > > > free pages.
> > >
> > > Yes, all the guest's free pages.
> > >
> > > > If we tell the host about all the free pages, then we might
> > > > end up needing to allocate more pages and update the host
> > > > with pages we now want to use; that would have to wait for the
> > > > host to acknowledge that use of these pages, since if we don't
> > > > wait for it then it might have skipped migrating a page we
> > > > just started using (I don't understand how your series solves that).
> > > > So the guest probably needs to keep some free pages - how many?
> > >
> > > Actually, there is no need to care about whether the free pages will be
> used by the host.
> > > We only care about some of the free pages we get reused by the guest,
> right?
> > >
> > > The dirty page logging can be used to solve this, starting the dirty
> > > page logging before getting the free pages informant from guest.
> > > Even some of the free pages are modified by the guest during the
> > > process of getting the free pages information, these modified pages will
> be traced by the dirty page logging mechanism. So in the following
> migration_bitmap_sync() function.
> > > The pages in the free pages bitmap, but latter was modified, will be
> > > reset to dirty. We won't omit any dirtied pages.
> > >
> > > So, guest doesn't need to keep any free pages.
> >
> > OK, yes, that works; so we do:
> > * enable dirty logging
> > * ask guest for free pages
> > * initialise the migration bitmap as everything-free
> > * then later we do the normal sync-dirty bitmap stuff and it all just works.
> >
> > That's nice and simple.
>
> This works once, sure. But there's an issue is that you have to defer migration
> until you get the free page list, and this only works once. So you end up with
> heuristics about how long to wait.
>
> Instead I propose:
>
> - mark all pages dirty as we do now.
>
> - at start of migration, start tracking dirty
> pages in kvm, and tell guest to start tracking free pages
>
> we can now introduce any kind of delay, for example wait for ack from guest,
> or do whatever else, or even just start migrating pages
>
> - repeatedly:
> - get list of free pages from guest
> - clear them in migration bitmap
> - get dirty list from kvm
>
> - at end of migration, stop tracking writes in kvm,
> and tell guest to stop tracking free pages
I had thought of filtering out the free pages in each migration bitmap synchronization.
The advantage is we can skip process as many free pages as possible. Not just once.
The disadvantage is that we should change the current memory management code to track the free pages,
instead of traversing the free page list to construct the free pages bitmap, to reduce the overhead to get the free pages bitmap.
I am not sure the if the Kernel people would like it.
If keeping the traversing mechanism, because of the overhead, maybe it's not worth to filter out the free pages repeatedly.
Liang
* Li, Liang Z ([email protected]) wrote:
> > On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > > * Li, Liang Z ([email protected]) wrote:
> > > > >
> > > > > Hi,
> > > > > I'm just catching back up on this thread; so without reference
> > > > > to any particular previous mail in the thread.
> > > > >
> > > > > 1) How many of the free pages do we tell the host about?
> > > > > Your main change is telling the host about all the
> > > > > free pages.
> > > >
> > > > Yes, all the guest's free pages.
> > > >
> > > > > If we tell the host about all the free pages, then we might
> > > > > end up needing to allocate more pages and update the host
> > > > > with pages we now want to use; that would have to wait for the
> > > > > host to acknowledge that use of these pages, since if we don't
> > > > > wait for it then it might have skipped migrating a page we
> > > > > just started using (I don't understand how your series solves that).
> > > > > So the guest probably needs to keep some free pages - how many?
> > > >
> > > > Actually, there is no need to care about whether the free pages will be
> > used by the host.
> > > > We only care about some of the free pages we get reused by the guest,
> > right?
> > > >
> > > > The dirty page logging can be used to solve this, starting the dirty
> > > > page logging before getting the free pages informant from guest.
> > > > Even some of the free pages are modified by the guest during the
> > > > process of getting the free pages information, these modified pages will
> > be traced by the dirty page logging mechanism. So in the following
> > migration_bitmap_sync() function.
> > > > The pages in the free pages bitmap, but latter was modified, will be
> > > > reset to dirty. We won't omit any dirtied pages.
> > > >
> > > > So, guest doesn't need to keep any free pages.
> > >
> > > OK, yes, that works; so we do:
> > > * enable dirty logging
> > > * ask guest for free pages
> > > * initialise the migration bitmap as everything-free
> > > * then later we do the normal sync-dirty bitmap stuff and it all just works.
> > >
> > > That's nice and simple.
> >
> > This works once, sure. But there's an issue is that you have to defer migration
> > until you get the free page list, and this only works once. So you end up with
> > heuristics about how long to wait.
> >
> > Instead I propose:
> >
> > - mark all pages dirty as we do now.
> >
> > - at start of migration, start tracking dirty
> > pages in kvm, and tell guest to start tracking free pages
> >
> > we can now introduce any kind of delay, for example wait for ack from guest,
> > or do whatever else, or even just start migrating pages
> >
> > - repeatedly:
> > - get list of free pages from guest
> > - clear them in migration bitmap
> > - get dirty list from kvm
> >
> > - at end of migration, stop tracking writes in kvm,
> > and tell guest to stop tracking free pages
>
> I had thought of filtering out the free pages in each migration bitmap synchronization.
> The advantage is we can skip process as many free pages as possible. Not just once.
> The disadvantage is that we should change the current memory management code to track the free pages,
> instead of traversing the free page list to construct the free pages bitmap, to reduce the overhead to get the free pages bitmap.
> I am not sure the if the Kernel people would like it.
>
> If keeping the traversing mechanism, because of the overhead, maybe it's not worth to filter out the free pages repeatedly.
Well, Michael's idea of not waiting for the dirty
bitmap to be filled does make that idea of constnatly
using the free-bitmap better.
In that case, is it easier if something (guest/host?)
allocates some memory in the guests physical RAM space
and just points the host to it, rather than having an
explicit 'send'.
Dave
> Liang
>
>
>
>
--
Dr. David Alan Gilbert / [email protected] / Manchester, UK
> > > > > > I'm just catching back up on this thread; so without
> > > > > > reference to any particular previous mail in the thread.
> > > > > >
> > > > > > 1) How many of the free pages do we tell the host about?
> > > > > > Your main change is telling the host about all the
> > > > > > free pages.
> > > > >
> > > > > Yes, all the guest's free pages.
> > > > >
> > > > > > If we tell the host about all the free pages, then we might
> > > > > > end up needing to allocate more pages and update the host
> > > > > > with pages we now want to use; that would have to wait for the
> > > > > > host to acknowledge that use of these pages, since if we don't
> > > > > > wait for it then it might have skipped migrating a page we
> > > > > > just started using (I don't understand how your series solves that).
> > > > > > So the guest probably needs to keep some free pages - how
> many?
> > > > >
> > > > > Actually, there is no need to care about whether the free pages
> > > > > will be
> > > used by the host.
> > > > > We only care about some of the free pages we get reused by the
> > > > > guest,
> > > right?
> > > > >
> > > > > The dirty page logging can be used to solve this, starting the
> > > > > dirty page logging before getting the free pages informant from guest.
> > > > > Even some of the free pages are modified by the guest during the
> > > > > process of getting the free pages information, these modified
> > > > > pages will
> > > be traced by the dirty page logging mechanism. So in the following
> > > migration_bitmap_sync() function.
> > > > > The pages in the free pages bitmap, but latter was modified,
> > > > > will be reset to dirty. We won't omit any dirtied pages.
> > > > >
> > > > > So, guest doesn't need to keep any free pages.
> > > >
> > > > OK, yes, that works; so we do:
> > > > * enable dirty logging
> > > > * ask guest for free pages
> > > > * initialise the migration bitmap as everything-free
> > > > * then later we do the normal sync-dirty bitmap stuff and it all just
> works.
> > > >
> > > > That's nice and simple.
> > >
> > > This works once, sure. But there's an issue is that you have to
> > > defer migration until you get the free page list, and this only
> > > works once. So you end up with heuristics about how long to wait.
> > >
> > > Instead I propose:
> > >
> > > - mark all pages dirty as we do now.
> > >
> > > - at start of migration, start tracking dirty
> > > pages in kvm, and tell guest to start tracking free pages
> > >
> > > we can now introduce any kind of delay, for example wait for ack
> > > from guest, or do whatever else, or even just start migrating pages
> > >
> > > - repeatedly:
> > > - get list of free pages from guest
> > > - clear them in migration bitmap
> > > - get dirty list from kvm
> > >
> > > - at end of migration, stop tracking writes in kvm,
> > > and tell guest to stop tracking free pages
> >
> > I had thought of filtering out the free pages in each migration bitmap
> synchronization.
> > The advantage is we can skip process as many free pages as possible. Not
> just once.
> > The disadvantage is that we should change the current memory
> > management code to track the free pages, instead of traversing the free
> page list to construct the free pages bitmap, to reduce the overhead to get
> the free pages bitmap.
> > I am not sure the if the Kernel people would like it.
> >
> > If keeping the traversing mechanism, because of the overhead, maybe it's
> not worth to filter out the free pages repeatedly.
>
> Well, Michael's idea of not waiting for the dirty bitmap to be filled does make
> that idea of constnatly using the free-bitmap better.
>
No wait is a good idea.
Actually, we could shorten the waiting time by pre allocating the free pages bit map
and update it when guest allocating/freeing pages. it requires to modify the mm
related code. I don't know whether the kernel people like this.
> In that case, is it easier if something (guest/host?) allocates some memory in
> the guests physical RAM space and just points the host to it, rather than
> having an explicit 'send'.
>
Good idea too.
Liang
> Dave