2019-06-19 22:33:42

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

This series provides an asynchronous means of hinting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as waste page treatment.

I have based many of the terms and functionality off of waste water
treatment, the idea for the similarity occurred to me after I had reached
the point of referring to the hints as "bubbles", as the hints used the
same approach as the balloon functionality but would disappear if they
were touched, as a result I started to think of the virtio device as an
aerator. The general idea with all of this is that the guest should be
treating the unused pages so that when they end up heading "downstream"
to either another guest, or back at the host they will not need to be
written to swap.

When the number of "dirty" pages in a given free_area exceeds our high
water mark, which is currently 32, we will schedule the aeration task to
start going through and scrubbing the zone. While the scrubbing is taking
place a boundary will be defined that we use to seperate the "aerated"
pages from the "dirty" ones. We use the ZONE_AERATION_ACTIVE bit to flag
when these boundaries are in place.

I am leaving a number of things hard-coded such as limiting the lowest
order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
determine what batch size it wants to allocate to process the hints.

My primary testing has just been to verify the memory is being freed after
allocation by running memhog 32g in the guest and watching the total free
memory via /proc/meminfo on the host. With this I have verified most of
the memory is freed after each iteration. As far as performance I have
been mainly focusing on the will-it-scale/page_fault1 test running with
16 vcpus. With that I have seen a less than 1% difference between the
base kernel without these patches, with the patches and virtio-balloon
disabled, and with the patches and virtio-balloon enabled with hinting.

Changes from the RFC:
Moved aeration requested flag out of aerator and into zone->flags.
Moved boundary out of free_area and into local variables for aeration.
Moved aeration cycle out of interrupt and into workqueue.
Left nr_free as total pages instead of splitting it between raw and aerated.
Combined size and physical address values in virtio ring into one 64b value.
Restructured the patch set to reduce patches from 11 to 6.

---

Alexander Duyck (6):
mm: Adjust shuffle code to allow for future coalescing
mm: Move set/get_pcppage_migratetype to mmzone.h
mm: Use zone and order instead of free area in free_list manipulators
mm: Introduce "aerated" pages
mm: Add logic for separating "aerated" pages from "raw" pages
virtio-balloon: Add support for aerating memory via hinting


drivers/virtio/Kconfig | 1
drivers/virtio/virtio_balloon.c | 110 ++++++++++++++
include/linux/memory_aeration.h | 118 +++++++++++++++
include/linux/mmzone.h | 113 +++++++++------
include/linux/page-flags.h | 8 +
include/uapi/linux/virtio_balloon.h | 1
mm/Kconfig | 5 +
mm/Makefile | 1
mm/aeration.c | 270 +++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 203 ++++++++++++++++++--------
mm/shuffle.c | 24 ---
mm/shuffle.h | 35 +++++
12 files changed, 753 insertions(+), 136 deletions(-)
create mode 100644 include/linux/memory_aeration.h
create mode 100644 mm/aeration.c

--


2019-06-19 22:33:44

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 1/6] mm: Adjust shuffle code to allow for future coalescing

From: Alexander Duyck <[email protected]>

This patch is meant to move the head/tail adding logic out of the shuffle
code and into the __free_one_page function since ultimately that is where
it is really needed anyway. By doing this we should be able to reduce the
overhead and can consolidate all of the list addition bits in one spot.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/mmzone.h | 12 --------
mm/page_alloc.c | 70 +++++++++++++++++++++++++++---------------------
mm/shuffle.c | 24 ----------------
mm/shuffle.h | 35 ++++++++++++++++++++++++
4 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 427b79c39b3c..4c07af2cfc2f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,18 +116,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
area->nr_free++;
}

-#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
-/* Used to preserve page allocation order entropy */
-void add_to_free_area_random(struct page *page, struct free_area *area,
- int migratetype);
-#else
-static inline void add_to_free_area_random(struct page *page,
- struct free_area *area, int migratetype)
-{
- add_to_free_area(page, area, migratetype);
-}
-#endif
-
/* Used for pages which are on another list */
static inline void move_to_free_area(struct page *page, struct free_area *area,
int migratetype)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f4651a09948c..ec344ce46587 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -830,6 +830,36 @@ static inline struct capture_control *task_capc(struct zone *zone)
#endif /* CONFIG_COMPACTION */

/*
+ * If this is not the largest possible page, check if the buddy
+ * of the next-highest order is free. If it is, it's possible
+ * that pages are being freed that will coalesce soon. In case,
+ * that is happening, add the free page to the tail of the list
+ * so it's less likely to be used soon and more likely to be merged
+ * as a higher order page
+ */
+static inline bool
+buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
+ struct page *page, unsigned int order)
+{
+ struct page *higher_page, *higher_buddy;
+ unsigned long combined_pfn;
+
+ if (is_shuffle_order(order) || order >= (MAX_ORDER - 2))
+ return false;
+
+ if (!pfn_valid_within(buddy_pfn))
+ return false;
+
+ combined_pfn = buddy_pfn & pfn;
+ higher_page = page + (combined_pfn - pfn);
+ buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
+ higher_buddy = higher_page + (buddy_pfn - combined_pfn);
+
+ return pfn_valid_within(buddy_pfn) &&
+ page_is_buddy(higher_page, higher_buddy, order + 1);
+}
+
+/*
* Freeing function for a buddy system allocator.
*
* The concept of a buddy system is to maintain direct-mapped table
@@ -858,11 +888,12 @@ static inline void __free_one_page(struct page *page,
struct zone *zone, unsigned int order,
int migratetype)
{
- unsigned long combined_pfn;
+ struct capture_control *capc = task_capc(zone);
unsigned long uninitialized_var(buddy_pfn);
- struct page *buddy;
+ unsigned long combined_pfn;
+ struct free_area *area;
unsigned int max_order;
- struct capture_control *capc = task_capc(zone);
+ struct page *buddy;

max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);

@@ -931,35 +962,12 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_page_order(page, order);

- /*
- * If this is not the largest possible page, check if the buddy
- * of the next-highest order is free. If it is, it's possible
- * that pages are being freed that will coalesce soon. In case,
- * that is happening, add the free page to the tail of the list
- * so it's less likely to be used soon and more likely to be merged
- * as a higher order page
- */
- if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
- && !is_shuffle_order(order)) {
- struct page *higher_page, *higher_buddy;
- combined_pfn = buddy_pfn & pfn;
- higher_page = page + (combined_pfn - pfn);
- buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
- higher_buddy = higher_page + (buddy_pfn - combined_pfn);
- if (pfn_valid_within(buddy_pfn) &&
- page_is_buddy(higher_page, higher_buddy, order + 1)) {
- add_to_free_area_tail(page, &zone->free_area[order],
- migratetype);
- return;
- }
- }
-
- if (is_shuffle_order(order))
- add_to_free_area_random(page, &zone->free_area[order],
- migratetype);
+ area = &zone->free_area[order];
+ if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
+ is_shuffle_tail_page(order))
+ add_to_free_area_tail(page, area, migratetype);
else
- add_to_free_area(page, &zone->free_area[order], migratetype);
-
+ add_to_free_area(page, area, migratetype);
}

/*
diff --git a/mm/shuffle.c b/mm/shuffle.c
index 3ce12481b1dc..55d592e62526 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -4,7 +4,6 @@
#include <linux/mm.h>
#include <linux/init.h>
#include <linux/mmzone.h>
-#include <linux/random.h>
#include <linux/moduleparam.h>
#include "internal.h"
#include "shuffle.h"
@@ -182,26 +181,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
shuffle_zone(z);
}
-
-void add_to_free_area_random(struct page *page, struct free_area *area,
- int migratetype)
-{
- static u64 rand;
- static u8 rand_bits;
-
- /*
- * The lack of locking is deliberate. If 2 threads race to
- * update the rand state it just adds to the entropy.
- */
- if (rand_bits == 0) {
- rand_bits = 64;
- rand = get_random_u64();
- }
-
- if (rand & 1)
- add_to_free_area(page, area, migratetype);
- else
- add_to_free_area_tail(page, area, migratetype);
- rand_bits--;
- rand >>= 1;
-}
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 777a257a0d2f..3f4edb60a453 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -3,6 +3,7 @@
#ifndef _MM_SHUFFLE_H
#define _MM_SHUFFLE_H
#include <linux/jump_label.h>
+#include <linux/random.h>

/*
* SHUFFLE_ENABLE is called from the command line enabling path, or by
@@ -43,6 +44,35 @@ static inline bool is_shuffle_order(int order)
return false;
return order >= SHUFFLE_ORDER;
}
+
+static inline bool is_shuffle_tail_page(int order)
+{
+ static u64 rand;
+ static u8 rand_bits;
+ u64 rand_old;
+
+ if (!is_shuffle_order(order))
+ return false;
+
+ /*
+ * The lack of locking is deliberate. If 2 threads race to
+ * update the rand state it just adds to the entropy.
+ */
+ if (rand_bits-- == 0) {
+ rand_bits = 64;
+ rand = get_random_u64();
+ }
+
+ /*
+ * Test highest order bit while shifting our random value. This
+ * should result in us testing for the carry flag following the
+ * shift.
+ */
+ rand_old = rand;
+ rand <<= 1;
+
+ return rand < rand_old;
+}
#else
static inline void shuffle_free_memory(pg_data_t *pgdat)
{
@@ -60,5 +90,10 @@ static inline bool is_shuffle_order(int order)
{
return false;
}
+
+static inline bool is_shuffle_tail_page(int order)
+{
+ return false;
+}
#endif
#endif /* _MM_SHUFFLE_H */

2019-06-19 22:33:46

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 2/6] mm: Move set/get_pcppage_migratetype to mmzone.h

From: Alexander Duyck <[email protected]>

In order to support page aeration it will be necessary to store and
retrieve the migratetype of a page. To enable that I am moving the set and
get operations for pcppage_migratetype into the mmzone header so that they
can be used when adding or removing pages from the free lists.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/mmzone.h | 18 ++++++++++++++++++
mm/page_alloc.c | 18 ------------------
2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4c07af2cfc2f..6f8fd5c1a286 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -95,6 +95,24 @@ static inline bool is_migrate_movable(int mt)
get_pfnblock_flags_mask(page, page_to_pfn(page), \
PB_migrate_end, MIGRATETYPE_MASK)

+/*
+ * A cached value of the page's pageblock's migratetype, used when the page is
+ * put on a pcplist. Used to avoid the pageblock migratetype lookup when
+ * freeing from pcplists in most cases, at the cost of possibly becoming stale.
+ * Also the migratetype set in the page does not necessarily match the pcplist
+ * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
+ * other index - this ensures that it will be put on the correct CMA freelist.
+ */
+static inline int get_pcppage_migratetype(struct page *page)
+{
+ return page->index;
+}
+
+static inline void set_pcppage_migratetype(struct page *page, int migratetype)
+{
+ page->index = migratetype;
+}
+
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec344ce46587..3e21e01f6165 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -136,24 +136,6 @@ struct pcpu_drain {
int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;

-/*
- * A cached value of the page's pageblock's migratetype, used when the page is
- * put on a pcplist. Used to avoid the pageblock migratetype lookup when
- * freeing from pcplists in most cases, at the cost of possibly becoming stale.
- * Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
- */
-static inline int get_pcppage_migratetype(struct page *page)
-{
- return page->index;
-}
-
-static inline void set_pcppage_migratetype(struct page *page, int migratetype)
-{
- page->index = migratetype;
-}
-
#ifdef CONFIG_PM_SLEEP
/*
* The following functions are used by the suspend/hibernate code to temporarily

2019-06-19 22:34:04

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 3/6] mm: Use zone and order instead of free area in free_list manipulators

From: Alexander Duyck <[email protected]>

In order to enable the use of the zone from the list manipulator functions
I will need access to the zone pointer. As it turns out most of the
accessors were always just being directly passed &zone->free_area[order]
anyway so it would make sense to just fold that into the function itself
and pass the zone and order as arguments instead of the free area.

In addition in order to be able to reference the zone we need to move the
declaration of the functions down so that we have the zone defined before
we define the list manipulation functions.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/mmzone.h | 72 +++++++++++++++++++++++++++---------------------
mm/page_alloc.c | 30 +++++++-------------
2 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f8fd5c1a286..c3597920a155 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,29 +118,6 @@ struct free_area {
unsigned long nr_free;
};

-/* Used for pages not on another list */
-static inline void add_to_free_area(struct page *page, struct free_area *area,
- int migratetype)
-{
- list_add(&page->lru, &area->free_list[migratetype]);
- area->nr_free++;
-}
-
-/* Used for pages not on another list */
-static inline void add_to_free_area_tail(struct page *page, struct free_area *area,
- int migratetype)
-{
- list_add_tail(&page->lru, &area->free_list[migratetype]);
- area->nr_free++;
-}
-
-/* Used for pages which are on another list */
-static inline void move_to_free_area(struct page *page, struct free_area *area,
- int migratetype)
-{
- list_move(&page->lru, &area->free_list[migratetype]);
-}
-
static inline struct page *get_page_from_free_area(struct free_area *area,
int migratetype)
{
@@ -148,15 +125,6 @@ static inline struct page *get_page_from_free_area(struct free_area *area,
struct page, lru);
}

-static inline void del_page_from_free_area(struct page *page,
- struct free_area *area)
-{
- list_del(&page->lru);
- __ClearPageBuddy(page);
- set_page_private(page, 0);
- area->nr_free--;
-}
-
static inline bool free_area_empty(struct free_area *area, int migratetype)
{
return list_empty(&area->free_list[migratetype]);
@@ -805,6 +773,46 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
return !pgdat->node_start_pfn && !pgdat->node_spanned_pages;
}

+/* Used for pages not on another list */
+static inline void add_to_free_area(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
+{
+ struct free_area *area = &zone->free_area[order];
+
+ list_add(&page->lru, &area->free_list[migratetype]);
+ area->nr_free++;
+}
+
+/* Used for pages not on another list */
+static inline void add_to_free_area_tail(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
+{
+ struct free_area *area = &zone->free_area[order];
+
+ list_add_tail(&page->lru, &area->free_list[migratetype]);
+ area->nr_free++;
+}
+
+/* Used for pages which are on another list */
+static inline void move_to_free_area(struct page *page, struct zone *zone,
+ unsigned int order, int migratetype)
+{
+ struct free_area *area = &zone->free_area[order];
+
+ list_move(&page->lru, &area->free_list[migratetype]);
+}
+
+static inline void del_page_from_free_area(struct page *page, struct zone *zone,
+ unsigned int order)
+{
+ struct free_area *area = &zone->free_area[order];
+
+ list_del(&page->lru);
+ __ClearPageBuddy(page);
+ set_page_private(page, 0);
+ area->nr_free--;
+}
+
#include <linux/memory_hotplug.h>

void build_all_zonelists(pg_data_t *pgdat);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e21e01f6165..aad2b2529ab7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -873,7 +873,6 @@ static inline void __free_one_page(struct page *page,
struct capture_control *capc = task_capc(zone);
unsigned long uninitialized_var(buddy_pfn);
unsigned long combined_pfn;
- struct free_area *area;
unsigned int max_order;
struct page *buddy;

@@ -910,7 +909,7 @@ static inline void __free_one_page(struct page *page,
if (page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
else
- del_page_from_free_area(buddy, &zone->free_area[order]);
+ del_page_from_free_area(buddy, zone, order);
combined_pfn = buddy_pfn & pfn;
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
@@ -944,12 +943,11 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_page_order(page, order);

- area = &zone->free_area[order];
if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
is_shuffle_tail_page(order))
- add_to_free_area_tail(page, area, migratetype);
+ add_to_free_area_tail(page, zone, order, migratetype);
else
- add_to_free_area(page, area, migratetype);
+ add_to_free_area(page, zone, order, migratetype);
}

/*
@@ -1941,13 +1939,11 @@ void __init init_cma_reserved_pageblock(struct page *page)
* -- nyc
*/
static inline void expand(struct zone *zone, struct page *page,
- int low, int high, struct free_area *area,
- int migratetype)
+ int low, int high, int migratetype)
{
unsigned long size = 1 << high;

while (high > low) {
- area--;
high--;
size >>= 1;
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
@@ -1961,7 +1957,7 @@ static inline void expand(struct zone *zone, struct page *page,
if (set_page_guard(zone, &page[size], high, migratetype))
continue;

- add_to_free_area(&page[size], area, migratetype);
+ add_to_free_area(&page[size], zone, high, migratetype);
set_page_order(&page[size], high);
}
}
@@ -2122,8 +2118,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
page = get_page_from_free_area(area, migratetype);
if (!page)
continue;
- del_page_from_free_area(page, area);
- expand(zone, page, order, current_order, area, migratetype);
+ del_page_from_free_area(page, zone, current_order);
+ expand(zone, page, order, current_order, migratetype);
set_pcppage_migratetype(page, migratetype);
return page;
}
@@ -2131,7 +2127,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
return NULL;
}

-
/*
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
@@ -2208,7 +2203,7 @@ static int move_freepages(struct zone *zone,
}

order = page_order(page);
- move_to_free_area(page, &zone->free_area[order], migratetype);
+ move_to_free_area(page, zone, order, migratetype);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -2324,7 +2319,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
unsigned int alloc_flags, int start_type, bool whole_block)
{
unsigned int current_order = page_order(page);
- struct free_area *area;
int free_pages, movable_pages, alike_pages;
int old_block_type;

@@ -2395,8 +2389,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
return;

single_page:
- area = &zone->free_area[current_order];
- move_to_free_area(page, area, start_type);
+ move_to_free_area(page, zone, current_order, start_type);
}

/*
@@ -3067,7 +3060,6 @@ void split_page(struct page *page, unsigned int order)

int __isolate_free_page(struct page *page, unsigned int order)
{
- struct free_area *area = &page_zone(page)->free_area[order];
unsigned long watermark;
struct zone *zone;
int mt;
@@ -3093,7 +3085,7 @@ int __isolate_free_page(struct page *page, unsigned int order)

/* Remove page from free list */

- del_page_from_free_area(page, area);
+ del_page_from_free_area(page, zone, order);

/*
* Set the pageblock if the isolated page is at least half of a
@@ -8513,7 +8505,7 @@ void zone_pcp_reset(struct zone *zone)
pr_info("remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- del_page_from_free_area(page, &zone->free_area[order]);
+ del_page_from_free_area(page, zone, order);
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);

2019-06-19 22:34:35

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 5/6] mm: Add logic for separating "aerated" pages from "raw" pages

From: Alexander Duyck <[email protected]>

Add a set of pointers we shall call "boundary" which represents the upper
boundary between the "raw" and "aerated" pages. The general idea is that in
order for a page to cross from one side of the boundary to the other it
will need to go through the aeration treatment.

By doing this we should be able to make certain that we keep the aerated
pages as one contiguous block on the end of each free list. This will allow
us to efficiently walk the free lists whenever we need to go in and start
processing hints to the hypervisor that the pages are no longer in use.

And added advantage to this approach is that we should be reducing the
overall memory footprint of the guest as it will be more likely to recycle
warm pages versus the aerated pages that are likely to be cache cold.

Since we will only be aerating one zone at a time we keep the boundary
limited to being defined for just the zone we are currently placing aerated
pages into. Doing this we can keep the number of additional poitners needed
quite small.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/memory_aeration.h | 57 ++++++++
include/linux/mmzone.h | 8 +
include/linux/page-flags.h | 8 +
mm/Makefile | 1
mm/aeration.c | 270 +++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 4 -
6 files changed, 347 insertions(+), 1 deletion(-)
create mode 100644 mm/aeration.c

diff --git a/include/linux/memory_aeration.h b/include/linux/memory_aeration.h
index 44cfbc259778..2f45196218b1 100644
--- a/include/linux/memory_aeration.h
+++ b/include/linux/memory_aeration.h
@@ -3,19 +3,50 @@
#define _LINUX_MEMORY_AERATION_H

#include <linux/mmzone.h>
+#include <linux/jump_label.h>
#include <linux/pageblock-flags.h>
+#include <asm/pgtable_types.h>

+#define AERATOR_MIN_ORDER pageblock_order
+#define AERATOR_HWM 32
+
+struct aerator_dev_info {
+ void (*react)(struct aerator_dev_info *a_dev_info);
+ struct list_head batch;
+ unsigned long capacity;
+ atomic_t refcnt;
+};
+
+extern struct static_key aerator_notify_enabled;
+
+void __aerator_notify(struct zone *zone);
struct page *get_aeration_page(struct zone *zone, unsigned int order,
int migratetype);
void put_aeration_page(struct zone *zone, struct page *page);

+void __aerator_del_from_boundary(struct page *page, struct zone *zone);
+void aerator_add_to_boundary(struct page *page, struct zone *zone);
+
+struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
static inline struct list_head *aerator_get_tail(struct zone *zone,
unsigned int order,
int migratetype)
{
+#ifdef CONFIG_AERATION
+ if (order >= AERATOR_MIN_ORDER &&
+ test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+ return __aerator_get_tail(order, migratetype);
+#endif
return &zone->free_area[order].free_list[migratetype];
}

+static inline void aerator_del_from_boundary(struct page *page,
+ struct zone *zone)
+{
+ if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+ __aerator_del_from_boundary(page, zone);
+}
+
static inline void set_page_aerated(struct page *page,
struct zone *zone,
unsigned int order,
@@ -28,6 +59,9 @@ static inline void set_page_aerated(struct page *page,
/* record migratetype and flag page as aerated */
set_pcppage_migratetype(page, migratetype);
__SetPageAerated(page);
+
+ /* update boundary of new migratetype and record it */
+ aerator_add_to_boundary(page, zone);
#endif
}

@@ -39,11 +73,19 @@ static inline void clear_page_aerated(struct page *page,
if (likely(!PageAerated(page)))
return;

+ /* push boundary back if we removed the upper boundary */
+ aerator_del_from_boundary(page, zone);
+
__ClearPageAerated(page);
area->nr_free_aerated--;
#endif
}

+static inline unsigned long aerator_raw_pages(struct free_area *area)
+{
+ return area->nr_free - area->nr_free_aerated;
+}
+
/**
* aerator_notify_free - Free page notification that will start page processing
* @zone: Pointer to current zone of last page processed
@@ -57,5 +99,20 @@ static inline void clear_page_aerated(struct page *page,
*/
static inline void aerator_notify_free(struct zone *zone, int order)
{
+#ifdef CONFIG_AERATION
+ if (!static_key_false(&aerator_notify_enabled))
+ return;
+ if (order < AERATOR_MIN_ORDER)
+ return;
+ if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
+ return;
+ if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
+ return;
+
+ __aerator_notify(zone);
+#endif
}
+
+void aerator_shutdown(void);
+int aerator_startup(struct aerator_dev_info *sdev);
#endif /*_LINUX_MEMORY_AERATION_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7d89722ae9eb..52190a791e63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -554,6 +554,14 @@ enum zone_flags {
ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
* Cleared when kswapd is woken.
*/
+ ZONE_AERATION_REQUESTED, /* zone enabled aeration and is
+ * requesting scrubbing the data out of
+ * higher order pages.
+ */
+ ZONE_AERATION_ACTIVE, /* zone enabled aeration and is
+ * activly cleaning the data out of
+ * higher order pages.
+ */
};

static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b848517da64c..f16e73318d49 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -745,6 +745,14 @@ static inline int page_has_type(struct page *page)
PAGE_TYPE_OPS(Offline, offline)

/*
+ * PageAerated() is an alias for Offline, however it is not meant to be an
+ * exclusive value. It should be combined with PageBuddy() when seen as it
+ * is meant to indicate that the page has been scrubbed while waiting in
+ * the buddy system.
+ */
+PAGE_TYPE_OPS(Aerated, offline)
+
+/*
* If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
* pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
*/
diff --git a/mm/Makefile b/mm/Makefile
index ac5e5ba78874..26c2fcd2b89d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -104,3 +104,4 @@ obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_HMM) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_AERATION) += aeration.o
diff --git a/mm/aeration.c b/mm/aeration.c
new file mode 100644
index 000000000000..720dc51cb215
--- /dev/null
+++ b/mm/aeration.c
@@ -0,0 +1,270 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/page-isolation.h>
+#include <linux/gfp.h>
+#include <linux/export.h>
+#include <linux/delay.h>
+#include <linux/slab.h>
+
+static struct aerator_dev_info *a_dev_info;
+struct static_key aerator_notify_enabled;
+
+struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
+
+static void aerator_reset_boundary(struct zone *zone, unsigned int order,
+ unsigned int migratetype)
+{
+ boundary[order - AERATOR_MIN_ORDER][migratetype] =
+ &zone->free_area[order].free_list[migratetype];
+}
+
+#define for_each_aerate_migratetype_order(_order, _type) \
+ for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
+ for (_type = MIGRATE_TYPES; _type--;)
+
+static void aerator_populate_boundaries(struct zone *zone)
+{
+ unsigned int order, mt;
+
+ if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
+ return;
+
+ for_each_aerate_migratetype_order(order, mt)
+ aerator_reset_boundary(zone, order, mt);
+
+ set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
+}
+
+struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
+{
+ return boundary[order - AERATOR_MIN_ORDER][migratetype];
+}
+
+void __aerator_del_from_boundary(struct page *page, struct zone *zone)
+{
+ unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
+ int mt = get_pcppage_migratetype(page);
+ struct list_head **tail = &boundary[order][mt];
+
+ if (*tail == &page->lru)
+ *tail = page->lru.next;
+}
+
+void aerator_add_to_boundary(struct page *page, struct zone *zone)
+{
+ unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
+ int mt = get_pcppage_migratetype(page);
+ struct list_head **tail = &boundary[order][mt];
+
+ *tail = &page->lru;
+}
+
+void aerator_shutdown(void)
+{
+ static_key_slow_dec(&aerator_notify_enabled);
+
+ while (atomic_read(&a_dev_info->refcnt))
+ msleep(20);
+
+ WARN_ON(!list_empty(&a_dev_info->batch));
+
+ a_dev_info = NULL;
+}
+EXPORT_SYMBOL_GPL(aerator_shutdown);
+
+static void aerator_schedule_initial_aeration(void)
+{
+ struct zone *zone;
+
+ for_each_populated_zone(zone) {
+ spin_lock(&zone->lock);
+ __aerator_notify(zone);
+ spin_unlock(&zone->lock);
+ }
+}
+
+int aerator_startup(struct aerator_dev_info *sdev)
+{
+ if (a_dev_info)
+ return -EBUSY;
+
+ INIT_LIST_HEAD(&sdev->batch);
+ atomic_set(&sdev->refcnt, 0);
+
+ a_dev_info = sdev;
+ aerator_schedule_initial_aeration();
+
+ static_key_slow_inc(&aerator_notify_enabled);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(aerator_startup);
+
+static void aerator_fill(struct zone *zone)
+{
+ struct list_head *batch = &a_dev_info->batch;
+ int budget = a_dev_info->capacity;
+ unsigned int order, mt;
+
+ for_each_aerate_migratetype_order(order, mt) {
+ struct page *page;
+
+ /*
+ * Pull pages from free list until we have drained
+ * it or we have filled the batch reactor.
+ */
+ while ((page = get_aeration_page(zone, order, mt))) {
+ list_add_tail(&page->lru, batch);
+
+ if (!--budget)
+ return;
+ }
+ }
+
+ /*
+ * If there are no longer enough free pages to fully populate
+ * the aerator, then we can just shut it down for this zone.
+ */
+ clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
+ atomic_dec(&a_dev_info->refcnt);
+}
+
+static void aerator_drain(struct zone *zone)
+{
+ struct list_head *list = &a_dev_info->batch;
+ struct page *page;
+
+ /*
+ * Drain the now aerated pages back into their respective
+ * free lists/areas.
+ */
+ while ((page = list_first_entry_or_null(list, struct page, lru))) {
+ list_del(&page->lru);
+ put_aeration_page(zone, page);
+ }
+}
+
+static void aerator_scrub_zone(struct zone *zone)
+{
+ /* See if there are any pages to pull */
+ if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
+ return;
+
+ spin_lock(&zone->lock);
+
+ do {
+ aerator_fill(zone);
+
+ if (list_empty(&a_dev_info->batch))
+ break;
+
+ spin_unlock(&zone->lock);
+
+ /*
+ * Start aerating the pages in the batch, and then
+ * once that is completed we can drain the reactor
+ * and refill the reactor, restarting the cycle.
+ */
+ a_dev_info->react(a_dev_info);
+
+ spin_lock(&zone->lock);
+
+ /*
+ * Guarantee boundaries are populated before we
+ * start placing aerated pages in the zone.
+ */
+ aerator_populate_boundaries(zone);
+
+ /*
+ * We should have a list of pages that have been
+ * processed. Return them to their original free lists.
+ */
+ aerator_drain(zone);
+
+ /* keep pulling pages till there are none to pull */
+ } while (test_bit(ZONE_AERATION_REQUESTED, &zone->flags));
+
+ clear_bit(ZONE_AERATION_ACTIVE, &zone->flags);
+
+ spin_unlock(&zone->lock);
+}
+
+/**
+ * aerator_cycle - start aerating a batch of pages, drain, and refill
+ *
+ * The aerator cycle consists of 4 stages, fill, react, drain, and idle.
+ * We will cycle through the first 3 stages until we fail to obtain any
+ * pages, in that case we will switch to idle and the thread will go back
+ * to sleep awaiting the next request for aeration.
+ */
+static void aerator_cycle(struct work_struct *work)
+{
+ struct zone *zone = first_online_pgdat()->node_zones;
+ int refcnt;
+
+ /*
+ * We want to hold one additional reference against the number of
+ * active hints as we may clear the hint that originally brought us
+ * here. We will clear it after we have either vaporized the content
+ * of the pages, or if we discover all pages were stolen out from
+ * under us.
+ */
+ atomic_inc(&a_dev_info->refcnt);
+
+ for (;;) {
+ aerator_scrub_zone(zone);
+
+ /*
+ * Move to next zone, if at the end of the list
+ * test to see if we can just go into idle.
+ */
+ zone = next_zone(zone);
+ if (zone)
+ continue;
+ zone = first_online_pgdat()->node_zones;
+
+ /*
+ * If we never generated any pages and we are
+ * holding the only remaining reference to active
+ * hints then we can just let this go for now and
+ * go idle.
+ */
+ refcnt = atomic_read(&a_dev_info->refcnt);
+ if (refcnt != 1)
+ continue;
+ if (atomic_try_cmpxchg(&a_dev_info->refcnt, &refcnt, 0))
+ break;
+ }
+}
+
+static DECLARE_DELAYED_WORK(aerator_work, &aerator_cycle);
+
+void __aerator_notify(struct zone *zone)
+{
+ /*
+ * We can use separate test and set operations here as there
+ * is nothing else that can set or clear this bit while we are
+ * holding the zone lock. The advantage to doing it this way is
+ * that we don't have to dirty the cacheline unless we are
+ * changing the value.
+ */
+ set_bit(ZONE_AERATION_REQUESTED, &zone->flags);
+
+ if (atomic_fetch_inc(&a_dev_info->refcnt))
+ return;
+
+ /*
+ * We should never be calling this function while there are already
+ * pages in the list being aerated. If we are called under such a
+ * circumstance report an error.
+ */
+ WARN_ON(!list_empty(&a_dev_info->batch));
+
+ /*
+ * Delay the start of work to allow a sizable queue to build. For
+ * now we are limiting this to running no more than 10 times per
+ * second.
+ */
+ schedule_delayed_work(&aerator_work, HZ / 10);
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb7ba8385374..45269c46c662 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2168,8 +2168,10 @@ struct page *get_aeration_page(struct zone *zone, unsigned int order,
list_for_each_entry_from_reverse(page, list, lru) {
if (PageAerated(page)) {
page = list_first_entry(list, struct page, lru);
- if (PageAerated(page))
+ if (PageAerated(page)) {
+ aerator_add_to_boundary(page, zone);
break;
+ }
}

del_page_from_free_area(page, zone, order);

2019-06-19 22:35:30

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

From: Alexander Duyck <[email protected]>

Add support for aerating memory using the hinting feature provided by
virtio-balloon. Hinting differs from the regular balloon functionality in
that is is much less durable than a standard memory balloon. Instead of
creating a list of pages that cannot be accessed the pages are only
inaccessible while they are being indicated to the virtio interface. Once
the interface has acknowledged them they are placed back into their
respective free lists and are once again accessible by the guest system.

Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/virtio/Kconfig | 1
drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
include/uapi/linux/virtio_balloon.h | 1
3 files changed, 108 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 023fc3bc01c6..9cdaccf92c3a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -47,6 +47,7 @@ config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
select MEMORY_BALLOON
+ select AERATION
---help---
This driver supports increasing and decreasing the amount
of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 44339fc87cc7..91f1e8c9017d 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -18,6 +18,7 @@
#include <linux/mm.h>
#include <linux/mount.h>
#include <linux/magic.h>
+#include <linux/memory_aeration.h>

/*
* Balloon device works in 4K page units. So each page is pointed to by
@@ -26,6 +27,7 @@
*/
#define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
#define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
+#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
#define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80

#define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
@@ -45,6 +47,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_DEFLATE,
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
+ VIRTIO_BALLOON_VQ_HINTING,
VIRTIO_BALLOON_VQ_MAX
};

@@ -54,7 +57,8 @@ enum virtio_balloon_config_read {

struct virtio_balloon {
struct virtio_device *vdev;
- struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
+ struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
+ *hinting_vq;

/* Balloon's own wq for cpu-intensive work items */
struct workqueue_struct *balloon_wq;
@@ -103,9 +107,21 @@ struct virtio_balloon {
/* Synchronize access/update to this struct virtio_balloon elements */
struct mutex balloon_lock;

- /* The array of pfns we tell the Host about. */
- unsigned int num_pfns;
- __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
+
+ union {
+ /* The array of pfns we tell the Host about. */
+ struct {
+ unsigned int num_pfns;
+ __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
+ };
+ /* The array of physical addresses we are hinting on */
+ struct {
+ unsigned int num_hints;
+ __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
+ };
+ };
+
+ struct aerator_dev_info a_dev_info;

/* Memory statistics */
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
@@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)

}

+static u64 page_to_hints_pa_order(struct page *page)
+{
+ unsigned char order;
+ dma_addr_t pa;
+
+ BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
+ (1 << VIRTIO_BALLOON_PFN_SHIFT));
+
+ /*
+ * Record physical page address combined with page order.
+ * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
+ * since the size has to fit into a 64b value. So as long
+ * as VIRTIO_BALLOON_SHIFT is greater than this combining
+ * the two values should be safe.
+ */
+ pa = page_to_phys(page);
+ order = page_private(page) +
+ PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
+
+ return (u64)(pa | order);
+}
+
+void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
+{
+ struct virtio_balloon *vb = container_of(a_dev_info,
+ struct virtio_balloon,
+ a_dev_info);
+ struct virtqueue *vq = vb->hinting_vq;
+ struct scatterlist sg;
+ unsigned int unused;
+ struct page *page;
+
+ mutex_lock(&vb->balloon_lock);
+
+ vb->num_hints = 0;
+
+ list_for_each_entry(page, &a_dev_info->batch, lru) {
+ vb->hints[vb->num_hints++] =
+ cpu_to_virtio64(vb->vdev,
+ page_to_hints_pa_order(page));
+ }
+
+ /* We shouldn't have been called if there is nothing to process */
+ if (WARN_ON(vb->num_hints == 0))
+ goto out;
+
+ sg_init_one(&sg, vb->hints,
+ sizeof(vb->hints[0]) * vb->num_hints);
+
+ /*
+ * We should always be able to add one buffer to an
+ * empty queue.
+ */
+ virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
+ virtqueue_kick(vq);
+
+ /* When host has read buffer, this completes via balloon_ack */
+ wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+out:
+ mutex_unlock(&vb->balloon_lock);
+}
+
static void set_page_pfns(struct virtio_balloon *vb,
__virtio32 pfns[], struct page *page)
{
@@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
names[VIRTIO_BALLOON_VQ_STATS] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+ names[VIRTIO_BALLOON_VQ_HINTING] = NULL;

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
}

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+ names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
+ callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
+ }
+
err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
vqs, callbacks, names, NULL, NULL);
if (err)
return err;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+ vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
+
vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_del_balloon_wq;
}
+
+ vb->a_dev_info.react = virtballoon_aerator_react;
+ vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
+ err = aerator_startup(&vb->a_dev_info);
+ if (err)
+ goto out_unregister_shrinker;
+ }
+
virtio_device_ready(vdev);

if (towards_target(vb))
virtballoon_changed(vdev);
return 0;

+out_unregister_shrinker:
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+ virtio_balloon_unregister_shrinker(vb);
out_del_balloon_wq:
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
destroy_workqueue(vb->balloon_wq);
@@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
{
struct virtio_balloon *vb = vdev->priv;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
+ aerator_shutdown();
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
virtio_balloon_unregister_shrinker(vb);
spin_lock_irq(&vb->stop_update_lock);
@@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
VIRTIO_BALLOON_F_PAGE_POISON,
+ VIRTIO_BALLOON_F_HINTING,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..2b0f62814e22 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12

2019-06-19 22:35:50

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 4/6] mm: Introduce "aerated" pages

From: Alexander Duyck <[email protected]>

In order to pave the way for free page hinting in virtualized environments
we will need a way to get pages out of the free lists and identify those
pages after they have been returned. To accomplish this patch adds the
concept of an "aerated" flag, which is essentially meant to just be the
Offline page type used in conjustion with the Buddy page type bit.

For now we can just add the basic logic to set the flag and track the
number of aerated pages per free area.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/memory_aeration.h | 61 +++++++++++++++++++++++++
include/linux/mmzone.h | 13 ++++-
mm/Kconfig | 5 ++
mm/page_alloc.c | 97 +++++++++++++++++++++++++++++++++++++--
4 files changed, 168 insertions(+), 8 deletions(-)
create mode 100644 include/linux/memory_aeration.h

diff --git a/include/linux/memory_aeration.h b/include/linux/memory_aeration.h
new file mode 100644
index 000000000000..44cfbc259778
--- /dev/null
+++ b/include/linux/memory_aeration.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_AERATION_H
+#define _LINUX_MEMORY_AERATION_H
+
+#include <linux/mmzone.h>
+#include <linux/pageblock-flags.h>
+
+struct page *get_aeration_page(struct zone *zone, unsigned int order,
+ int migratetype);
+void put_aeration_page(struct zone *zone, struct page *page);
+
+static inline struct list_head *aerator_get_tail(struct zone *zone,
+ unsigned int order,
+ int migratetype)
+{
+ return &zone->free_area[order].free_list[migratetype];
+}
+
+static inline void set_page_aerated(struct page *page,
+ struct zone *zone,
+ unsigned int order,
+ int migratetype)
+{
+#ifdef CONFIG_AERATION
+ /* update areated page accounting */
+ zone->free_area[order].nr_free_aerated++;
+
+ /* record migratetype and flag page as aerated */
+ set_pcppage_migratetype(page, migratetype);
+ __SetPageAerated(page);
+#endif
+}
+
+static inline void clear_page_aerated(struct page *page,
+ struct zone *zone,
+ struct free_area *area)
+{
+#ifdef CONFIG_AERATION
+ if (likely(!PageAerated(page)))
+ return;
+
+ __ClearPageAerated(page);
+ area->nr_free_aerated--;
+#endif
+}
+
+/**
+ * aerator_notify_free - Free page notification that will start page processing
+ * @zone: Pointer to current zone of last page processed
+ * @order: Order of last page added to zone
+ *
+ * This function is meant to act as a screener for __aerator_notify which
+ * will determine if a give zone has crossed over the high-water mark that
+ * will justify us beginning page treatment. If we have crossed that
+ * threshold then it will start the process of pulling some pages and
+ * placing them in the batch list for treatment.
+ */
+static inline void aerator_notify_free(struct zone *zone, int order)
+{
+}
+#endif /*_LINUX_MEMORY_AERATION_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c3597920a155..7d89722ae9eb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -116,6 +116,7 @@ static inline void set_pcppage_migratetype(struct page *page, int migratetype)
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
+ unsigned long nr_free_aerated;
};

static inline struct page *get_page_from_free_area(struct free_area *area,
@@ -773,6 +774,8 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
return !pgdat->node_start_pfn && !pgdat->node_spanned_pages;
}

+#include <linux/memory_aeration.h>
+
/* Used for pages not on another list */
static inline void add_to_free_area(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
@@ -787,10 +790,10 @@ static inline void add_to_free_area(struct page *page, struct zone *zone,
static inline void add_to_free_area_tail(struct page *page, struct zone *zone,
unsigned int order, int migratetype)
{
- struct free_area *area = &zone->free_area[order];
+ struct list_head *tail = aerator_get_tail(zone, order, migratetype);

- list_add_tail(&page->lru, &area->free_list[migratetype]);
- area->nr_free++;
+ list_add_tail(&page->lru, tail);
+ zone->free_area[order].nr_free++;
}

/* Used for pages which are on another list */
@@ -799,6 +802,8 @@ static inline void move_to_free_area(struct page *page, struct zone *zone,
{
struct free_area *area = &zone->free_area[order];

+ clear_page_aerated(page, zone, area);
+
list_move(&page->lru, &area->free_list[migratetype]);
}

@@ -807,6 +812,8 @@ static inline void del_page_from_free_area(struct page *page, struct zone *zone,
{
struct free_area *area = &zone->free_area[order];

+ clear_page_aerated(page, zone, area);
+
list_del(&page->lru);
__ClearPageBuddy(page);
set_page_private(page, 0);
diff --git a/mm/Kconfig b/mm/Kconfig
index 7c41d2300e07..209dc4bea481 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -236,6 +236,11 @@ config COMPACTION
[email protected].

#
+# support for memory aeration
+config AERATION
+ bool
+
+#
# support for page migration
#
config MIGRATION
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aad2b2529ab7..eb7ba8385374 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
#include <linux/lockdep.h>
#include <linux/nmi.h>
#include <linux/psi.h>
+#include <linux/memory_aeration.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -868,10 +869,11 @@ static inline struct capture_control *task_capc(struct zone *zone)
static inline void __free_one_page(struct page *page,
unsigned long pfn,
struct zone *zone, unsigned int order,
- int migratetype)
+ int migratetype, bool aerated)
{
struct capture_control *capc = task_capc(zone);
unsigned long uninitialized_var(buddy_pfn);
+ bool fully_aerated = aerated;
unsigned long combined_pfn;
unsigned int max_order;
struct page *buddy;
@@ -902,6 +904,11 @@ static inline void __free_one_page(struct page *page,
goto done_merging;
if (!page_is_buddy(page, buddy, order))
goto done_merging;
+
+ /* assume buddy is not aerated */
+ if (aerated)
+ fully_aerated = false;
+
/*
* Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
* merge with it and move up one order.
@@ -943,11 +950,17 @@ static inline void __free_one_page(struct page *page,
done_merging:
set_page_order(page, order);

- if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
+ if (aerated ||
+ buddy_merge_likely(pfn, buddy_pfn, page, order) ||
is_shuffle_tail_page(order))
add_to_free_area_tail(page, zone, order, migratetype);
else
add_to_free_area(page, zone, order, migratetype);
+
+ if (fully_aerated)
+ set_page_aerated(page, zone, order, migratetype);
+ else
+ aerator_notify_free(zone, order);
}

/*
@@ -1247,7 +1260,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);

- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+ __free_one_page(page, page_to_pfn(page), zone, 0, mt, false);
trace_mm_page_pcpu_drain(page, 0, mt);
}
spin_unlock(&zone->lock);
@@ -1263,7 +1276,7 @@ static void free_one_page(struct zone *zone,
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
}
- __free_one_page(page, pfn, zone, order, migratetype);
+ __free_one_page(page, pfn, zone, order, migratetype, false);
spin_unlock(&zone->lock);
}

@@ -2127,6 +2140,77 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
return NULL;
}

+#ifdef CONFIG_AERATION
+/**
+ * get_aeration_page - Provide a "raw" page for aeration by the aerator
+ * @zone: Zone to draw pages from
+ * @order: Order to draw pages from
+ * @migratetype: Migratetype to draw pages from
+ *
+ * This function will obtain a page from above the boundary. As a result
+ * we can guarantee the page has not been aerated.
+ *
+ * The page will have the migrate type and order stored in the page
+ * metadata.
+ *
+ * Return: page pointer if raw page found, otherwise NULL
+ */
+struct page *get_aeration_page(struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ struct free_area *area = &(zone->free_area[order]);
+ struct list_head *list = &area->free_list[migratetype];
+ struct page *page;
+
+ /* Find a page of the appropriate size in the preferred list */
+ page = list_last_entry(aerator_get_tail(zone, order, migratetype),
+ struct page, lru);
+ list_for_each_entry_from_reverse(page, list, lru) {
+ if (PageAerated(page)) {
+ page = list_first_entry(list, struct page, lru);
+ if (PageAerated(page))
+ break;
+ }
+
+ del_page_from_free_area(page, zone, order);
+
+ /* record migratetype and order within page */
+ set_pcppage_migratetype(page, migratetype);
+ set_page_private(page, order);
+ __mod_zone_freepage_state(zone, -(1 << order), migratetype);
+
+ return page;
+ }
+
+ return NULL;
+}
+
+/**
+ * put_aeration_page - Return a now-aerated "raw" page back where we got it
+ * @zone: Zone to return pages to
+ * @page: Previously "raw" page that can now be returned after aeration
+ *
+ * This function will pull the migratetype and order information out
+ * of the page and attempt to return it where it found it.
+ */
+void put_aeration_page(struct zone *zone, struct page *page)
+{
+ unsigned int order, mt;
+ unsigned long pfn;
+
+ mt = get_pcppage_migratetype(page);
+ pfn = page_to_pfn(page);
+
+ if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
+ mt = get_pfnblock_migratetype(page, pfn);
+
+ order = page_private(page);
+ set_page_private(page, 0);
+
+ __free_one_page(page, pfn, zone, order, mt, true);
+}
+#endif /* CONFIG_AERATION */
+
/*
* This array describes the order lists are fallen back to when
* the free lists for the desirable migrate type are depleted
@@ -5929,9 +6013,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
static void __meminit zone_init_free_lists(struct zone *zone)
{
unsigned int order, t;
- for_each_migratetype_order(order, t) {
+ for_each_migratetype_order(order, t)
INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+
+ for (order = MAX_ORDER; order--; ) {
zone->free_area[order].nr_free = 0;
+ zone->free_area[order].nr_free_aerated = 0;
}
}


2019-06-19 22:37:53

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v1 QEMU] QEMU: Provide a interface for hinting based off of the balloon infrastructure

From: Alexander Duyck <[email protected]>

Add support for what I am referring to as "bubble hinting". Basically the
idea is to function very similar to how the balloon works in that we
basically end up madvising the page as not being used. However we don't
really need to bother with any deflate type logic since the page will be
faulted back into the guest when it is read or written to.

This is meant to be a simplification of the existing balloon interface
to use for providing hints to what memory needs to be freed. I am assuming
this is safe to do as the deflate logic does not actually appear to do very
much other than tracking what subpages have been released and which ones
haven't.

Signed-off-by: Alexander Duyck <[email protected]>
---
hw/virtio/trace-events | 1
hw/virtio/virtio-balloon.c | 73 +++++++++++++++++++++++
include/hw/virtio/virtio-balloon.h | 2 -
include/standard-headers/linux/virtio_balloon.h | 1
4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index e28ba48da621..b56daf460769 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -46,6 +46,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: 0x%"PRIx64" num_pages: %d"
+virtio_bubble_handle_output(const char *name, uint64_t gpa, uint64_t size) "section name: %s gpa: 0x%" PRIx64 " size: %" PRIx64

# virtio-mmio.c
virtio_mmio_read(uint64_t offset) "virtio_mmio_read offset 0x%" PRIx64
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 2112874055fb..93ee165d2db2 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -328,6 +328,75 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
balloon_stats_change_timer(s, 0);
}

+static void bubble_inflate_page(VirtIOBalloon *balloon,
+ MemoryRegion *mr, hwaddr offset, size_t size)
+{
+ void *addr = memory_region_get_ram_ptr(mr) + offset;
+ ram_addr_t ram_offset;
+ size_t rb_page_size;
+ RAMBlock *rb;
+
+ rb = qemu_ram_block_from_host(addr, false, &ram_offset);
+ rb_page_size = qemu_ram_pagesize(rb);
+
+ /* For now we will simply ignore unaligned memory regions */
+ if ((ram_offset | size) & (rb_page_size - 1))
+ return;
+
+ ram_block_discard_range(rb, ram_offset, size);
+}
+
+static void virtio_bubble_handle_output(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
+ VirtQueueElement *elem;
+ MemoryRegionSection section;
+
+ for (;;) {
+ size_t offset = 0;
+ uint64_t pa_order;
+
+ elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+ if (!elem) {
+ return;
+ }
+
+ while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pa_order, 8) == 8) {
+ hwaddr pa = virtio_ldq_p(vdev, &pa_order);
+ size_t size = 1ul << (VIRTIO_BALLOON_PFN_SHIFT + (pa & 0xFF));
+
+ pa -= pa & 0xFF;
+ offset += 8;
+
+ if (qemu_balloon_is_inhibited())
+ continue;
+
+ section = memory_region_find(get_system_memory(), pa, size);
+ if (!section.mr) {
+ trace_virtio_balloon_bad_addr(pa);
+ continue;
+ }
+
+ if (!memory_region_is_ram(section.mr) ||
+ memory_region_is_rom(section.mr) ||
+ memory_region_is_romd(section.mr)) {
+ trace_virtio_balloon_bad_addr(pa);
+ } else {
+ trace_virtio_bubble_handle_output(memory_region_name(section.mr),
+ pa, size);
+ bubble_inflate_page(s, section.mr,
+ section.offset_within_region, size);
+ }
+
+ memory_region_unref(section.mr);
+ }
+
+ virtqueue_push(vq, elem, offset);
+ virtio_notify(vdev, vq);
+ g_free(elem);
+ }
+}
+
static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
{
VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -694,6 +763,7 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
f |= dev->host_features;
virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+ virtio_add_feature(&f, VIRTIO_BALLOON_F_HINTING);

return f;
}
@@ -780,6 +850,7 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
s->ivq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);
+ s->hvq = virtio_add_queue(vdev, 128, virtio_bubble_handle_output);

if (virtio_has_feature(s->host_features,
VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
@@ -875,6 +946,8 @@ static void virtio_balloon_instance_init(Object *obj)

object_property_add(obj, "guest-stats", "guest statistics",
balloon_stats_get_all, NULL, NULL, s, NULL);
+ object_property_add(obj, "guest-page-hinting", "guest page hinting",
+ NULL, NULL, NULL, s, NULL);

object_property_add(obj, "guest-stats-polling-interval", "int",
balloon_stats_get_poll_interval,
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 1afafb12f6bc..dd6d4d0e45fd 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -44,7 +44,7 @@ enum virtio_balloon_free_page_report_status {

typedef struct VirtIOBalloon {
VirtIODevice parent_obj;
- VirtQueue *ivq, *dvq, *svq, *free_page_vq;
+ VirtQueue *ivq, *dvq, *svq, *hvq, *free_page_vq;
uint32_t free_page_report_status;
uint32_t num_pages;
uint32_t actual;
diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 9375ca2a70de..f9e3e8256261 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12

2019-06-25 07:54:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 20.06.19 00:32, Alexander Duyck wrote:
> This series provides an asynchronous means of hinting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as waste page treatment.
>
> I have based many of the terms and functionality off of waste water
> treatment, the idea for the similarity occurred to me after I had reached
> the point of referring to the hints as "bubbles", as the hints used the
> same approach as the balloon functionality but would disappear if they
> were touched, as a result I started to think of the virtio device as an
> aerator. The general idea with all of this is that the guest should be
> treating the unused pages so that when they end up heading "downstream"
> to either another guest, or back at the host they will not need to be
> written to swap.
>
> When the number of "dirty" pages in a given free_area exceeds our high
> water mark, which is currently 32, we will schedule the aeration task to
> start going through and scrubbing the zone. While the scrubbing is taking
> place a boundary will be defined that we use to seperate the "aerated"
> pages from the "dirty" ones. We use the ZONE_AERATION_ACTIVE bit to flag
> when these boundaries are in place.

I still *detest* the terminology, sorry. Can't you come up with a
simpler terminology that makes more sense in the context of operating
systems and pages we want to hint to the hypervisor? (that is the only
use case you are using it for so far)

>
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> determine what batch size it wants to allocate to process the hints.
>
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 32g in the guest and watching the total free
> memory via /proc/meminfo on the host. With this I have verified most of
> the memory is freed after each iteration. As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. With that I have seen a less than 1% difference between the

1% throughout all benchmarks? Guess that is quite good.

> base kernel without these patches, with the patches and virtio-balloon
> disabled, and with the patches and virtio-balloon enabled with hinting.
>
> Changes from the RFC:
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved boundary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
> Restructured the patch set to reduce patches from 11 to 6.
>

I'm planning to look into the details, but will be on PTO for two weeks
starting this Saturday (and still have other things to finish first :/ ).

> ---
>
> Alexander Duyck (6):
> mm: Adjust shuffle code to allow for future coalescing
> mm: Move set/get_pcppage_migratetype to mmzone.h
> mm: Use zone and order instead of free area in free_list manipulators
> mm: Introduce "aerated" pages
> mm: Add logic for separating "aerated" pages from "raw" pages
> virtio-balloon: Add support for aerating memory via hinting
>
>
> drivers/virtio/Kconfig | 1
> drivers/virtio/virtio_balloon.c | 110 ++++++++++++++
> include/linux/memory_aeration.h | 118 +++++++++++++++
> include/linux/mmzone.h | 113 +++++++++------
> include/linux/page-flags.h | 8 +
> include/uapi/linux/virtio_balloon.h | 1
> mm/Kconfig | 5 +
> mm/Makefile | 1
> mm/aeration.c | 270 +++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 203 ++++++++++++++++++--------
> mm/shuffle.c | 24 ---
> mm/shuffle.h | 35 +++++
> 12 files changed, 753 insertions(+), 136 deletions(-)
> create mode 100644 include/linux/memory_aeration.h
> create mode 100644 mm/aeration.c

Compared to

17 files changed, 838 insertions(+), 86 deletions(-)
create mode 100644 include/linux/memory_aeration.h
create mode 100644 mm/aeration.c

this looks like a good improvement :)

--

Thanks,

David / dhildenb

2019-06-25 07:58:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] mm: Adjust shuffle code to allow for future coalescing

On 20.06.19 00:33, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> This patch is meant to move the head/tail adding logic out of the shuffle
> code and into the __free_one_page function since ultimately that is where
> it is really needed anyway. By doing this we should be able to reduce the
> overhead and can consolidate all of the list addition bits in one spot.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> include/linux/mmzone.h | 12 --------
> mm/page_alloc.c | 70 +++++++++++++++++++++++++++---------------------
> mm/shuffle.c | 24 ----------------
> mm/shuffle.h | 35 ++++++++++++++++++++++++
> 4 files changed, 74 insertions(+), 67 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 427b79c39b3c..4c07af2cfc2f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -116,18 +116,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
> area->nr_free++;
> }
>
> -#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> -/* Used to preserve page allocation order entropy */
> -void add_to_free_area_random(struct page *page, struct free_area *area,
> - int migratetype);
> -#else
> -static inline void add_to_free_area_random(struct page *page,
> - struct free_area *area, int migratetype)
> -{
> - add_to_free_area(page, area, migratetype);
> -}
> -#endif
> -
> /* Used for pages which are on another list */
> static inline void move_to_free_area(struct page *page, struct free_area *area,
> int migratetype)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f4651a09948c..ec344ce46587 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -830,6 +830,36 @@ static inline struct capture_control *task_capc(struct zone *zone)
> #endif /* CONFIG_COMPACTION */
>
> /*
> + * If this is not the largest possible page, check if the buddy
> + * of the next-highest order is free. If it is, it's possible
> + * that pages are being freed that will coalesce soon. In case,
> + * that is happening, add the free page to the tail of the list
> + * so it's less likely to be used soon and more likely to be merged
> + * as a higher order page
> + */
> +static inline bool
> +buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> + struct page *page, unsigned int order)
> +{
> + struct page *higher_page, *higher_buddy;
> + unsigned long combined_pfn;
> +
> + if (is_shuffle_order(order) || order >= (MAX_ORDER - 2))

My intuition tells me you can drop the () around "MAX_ORDER - 2"

> + return false;

Guess the "is_shuffle_order(order)" check should rather be performed by
the caller, before calling this function.

> +
> + if (!pfn_valid_within(buddy_pfn))
> + return false;
> +
> + combined_pfn = buddy_pfn & pfn;
> + higher_page = page + (combined_pfn - pfn);
> + buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
> + higher_buddy = higher_page + (buddy_pfn - combined_pfn);
> +
> + return pfn_valid_within(buddy_pfn) &&
> + page_is_buddy(higher_page, higher_buddy, order + 1);
> +}
> +
> +/*
> * Freeing function for a buddy system allocator.
> *
> * The concept of a buddy system is to maintain direct-mapped table
> @@ -858,11 +888,12 @@ static inline void __free_one_page(struct page *page,
> struct zone *zone, unsigned int order,
> int migratetype)
> {
> - unsigned long combined_pfn;
> + struct capture_control *capc = task_capc(zone);
> unsigned long uninitialized_var(buddy_pfn);
> - struct page *buddy;
> + unsigned long combined_pfn;
> + struct free_area *area;
> unsigned int max_order;
> - struct capture_control *capc = task_capc(zone);
> + struct page *buddy;
>
> max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
>
> @@ -931,35 +962,12 @@ static inline void __free_one_page(struct page *page,
> done_merging:
> set_page_order(page, order);
>
> - /*
> - * If this is not the largest possible page, check if the buddy
> - * of the next-highest order is free. If it is, it's possible
> - * that pages are being freed that will coalesce soon. In case,
> - * that is happening, add the free page to the tail of the list
> - * so it's less likely to be used soon and more likely to be merged
> - * as a higher order page
> - */
> - if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
> - && !is_shuffle_order(order)) {
> - struct page *higher_page, *higher_buddy;
> - combined_pfn = buddy_pfn & pfn;
> - higher_page = page + (combined_pfn - pfn);
> - buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
> - higher_buddy = higher_page + (buddy_pfn - combined_pfn);
> - if (pfn_valid_within(buddy_pfn) &&
> - page_is_buddy(higher_page, higher_buddy, order + 1)) {
> - add_to_free_area_tail(page, &zone->free_area[order],
> - migratetype);
> - return;
> - }
> - }
> -
> - if (is_shuffle_order(order))
> - add_to_free_area_random(page, &zone->free_area[order],
> - migratetype);
> + area = &zone->free_area[order];
> + if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> + is_shuffle_tail_page(order))
> + add_to_free_area_tail(page, area, migratetype);

I would prefer here something like

if (is_shuffle_order(order)) {
if (add_shuffle_order_to_tail(order))
add_to_free_area_tail(page, area, migratetype);
else
add_to_free_area(page, area, migratetype);
} else if (buddy_merge_likely(pfn, buddy_pfn, page, order)) {
add_to_free_area_tail(page, area, migratetype);
} else {
add_to_free_area(page, area, migratetype);
}

dropping "is_shuffle_order()" from buddy_merge_likely()

Especially, the name "is_shuffle_tail_page(order)" suggests that you are
passing a page.

> else
> - add_to_free_area(page, &zone->free_area[order], migratetype);
> -
> + add_to_free_area(page, area, migratetype);
> }
>
> /*
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 3ce12481b1dc..55d592e62526 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -4,7 +4,6 @@
> #include <linux/mm.h>
> #include <linux/init.h>
> #include <linux/mmzone.h>
> -#include <linux/random.h>
> #include <linux/moduleparam.h>
> #include "internal.h"
> #include "shuffle.h"
> @@ -182,26 +181,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> shuffle_zone(z);
> }
> -
> -void add_to_free_area_random(struct page *page, struct free_area *area,
> - int migratetype)
> -{
> - static u64 rand;
> - static u8 rand_bits;
> -
> - /*
> - * The lack of locking is deliberate. If 2 threads race to
> - * update the rand state it just adds to the entropy.
> - */
> - if (rand_bits == 0) {
> - rand_bits = 64;
> - rand = get_random_u64();
> - }
> -
> - if (rand & 1)
> - add_to_free_area(page, area, migratetype);
> - else
> - add_to_free_area_tail(page, area, migratetype);
> - rand_bits--;
> - rand >>= 1;
> -}
> diff --git a/mm/shuffle.h b/mm/shuffle.h
> index 777a257a0d2f..3f4edb60a453 100644
> --- a/mm/shuffle.h
> +++ b/mm/shuffle.h
> @@ -3,6 +3,7 @@
> #ifndef _MM_SHUFFLE_H
> #define _MM_SHUFFLE_H
> #include <linux/jump_label.h>
> +#include <linux/random.h>
>
> /*
> * SHUFFLE_ENABLE is called from the command line enabling path, or by
> @@ -43,6 +44,35 @@ static inline bool is_shuffle_order(int order)
> return false;
> return order >= SHUFFLE_ORDER;
> }
> +
> +static inline bool is_shuffle_tail_page(int order)
> +{
> + static u64 rand;
> + static u8 rand_bits;
> + u64 rand_old;
> +
> + if (!is_shuffle_order(order))
> + return false;
> +
> + /*
> + * The lack of locking is deliberate. If 2 threads race to
> + * update the rand state it just adds to the entropy.
> + */
> + if (rand_bits-- == 0) {
> + rand_bits = 64;
> + rand = get_random_u64();
> + }
> +
> + /*
> + * Test highest order bit while shifting our random value. This
> + * should result in us testing for the carry flag following the
> + * shift.
> + */
> + rand_old = rand;
> + rand <<= 1;
> +
> + return rand < rand_old;
> +}
> #else
> static inline void shuffle_free_memory(pg_data_t *pgdat)
> {
> @@ -60,5 +90,10 @@ static inline bool is_shuffle_order(int order)
> {
> return false;
> }
> +
> +static inline bool is_shuffle_tail_page(int order)
> +{
> + return false;
> +}
> #endif
> #endif /* _MM_SHUFFLE_H */
>


--

Thanks,

David / dhildenb

2019-06-25 14:11:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 6/25/19 12:42 AM, David Hildenbrand wrote:
> On 20.06.19 00:32, Alexander Duyck wrote:
> I still *detest* the terminology, sorry. Can't you come up with a
> simpler terminology that makes more sense in the context of operating
> systems and pages we want to hint to the hypervisor? (that is the only
> use case you are using it for so far)

It's a wee bit too cute for my taste as well. I could probably live
with it in the data structures, but having it show up out in places like
Kconfig and filenames goes too far.

For instance, someone seeing memory_aeration.c will have no concept
what's in the file. Could we call it something like memory_paravirt.c?
Or even mm/paravirt.c.

Could you talk for a minute about why the straightforward naming like
"hinted/unhinted" wasn't used? Is there something else we could ever
use this infrastructure for that is not related to paravirtualized free
page hinting?

2019-06-25 19:20:16

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On Tue, Jun 25, 2019 at 12:42 AM David Hildenbrand <[email protected]> wrote:
>
> On 20.06.19 00:32, Alexander Duyck wrote:
> > This series provides an asynchronous means of hinting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as waste page treatment.
> >
> > I have based many of the terms and functionality off of waste water
> > treatment, the idea for the similarity occurred to me after I had reached
> > the point of referring to the hints as "bubbles", as the hints used the
> > same approach as the balloon functionality but would disappear if they
> > were touched, as a result I started to think of the virtio device as an
> > aerator. The general idea with all of this is that the guest should be
> > treating the unused pages so that when they end up heading "downstream"
> > to either another guest, or back at the host they will not need to be
> > written to swap.
> >
> > When the number of "dirty" pages in a given free_area exceeds our high
> > water mark, which is currently 32, we will schedule the aeration task to
> > start going through and scrubbing the zone. While the scrubbing is taking
> > place a boundary will be defined that we use to seperate the "aerated"
> > pages from the "dirty" ones. We use the ZONE_AERATION_ACTIVE bit to flag
> > when these boundaries are in place.
>
> I still *detest* the terminology, sorry. Can't you come up with a
> simpler terminology that makes more sense in the context of operating
> systems and pages we want to hint to the hypervisor? (that is the only
> use case you are using it for so far)

I'm open to suggestions. The terminology is just what I went with as I
had gone from balloon to thinking of this as a bubble since it was a
balloon without the deflate logic. From there I got to aeration since
it is filling the buddy allocator with those bubbles.

> >
> > I am leaving a number of things hard-coded such as limiting the lowest
> > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> > determine what batch size it wants to allocate to process the hints.
> >
> > My primary testing has just been to verify the memory is being freed after
> > allocation by running memhog 32g in the guest and watching the total free
> > memory via /proc/meminfo on the host. With this I have verified most of
> > the memory is freed after each iteration. As far as performance I have
> > been mainly focusing on the will-it-scale/page_fault1 test running with
> > 16 vcpus. With that I have seen a less than 1% difference between the
>
> 1% throughout all benchmarks? Guess that is quite good.

That is the general idea. What I wanted to avoid was this introducing
any significant slowdown, especially in the case where we weren't
using it.

> > base kernel without these patches, with the patches and virtio-balloon
> > disabled, and with the patches and virtio-balloon enabled with hinting.
> >
> > Changes from the RFC:
> > Moved aeration requested flag out of aerator and into zone->flags.
> > Moved boundary out of free_area and into local variables for aeration.
> > Moved aeration cycle out of interrupt and into workqueue.
> > Left nr_free as total pages instead of splitting it between raw and aerated.
> > Combined size and physical address values in virtio ring into one 64b value.
> > Restructured the patch set to reduce patches from 11 to 6.
> >
>
> I'm planning to look into the details, but will be on PTO for two weeks
> starting this Saturday (and still have other things to finish first :/ ).

Thanks. No rush. I will be on PTO for the next couple of weeks myself.

> > ---
> >
> > Alexander Duyck (6):
> > mm: Adjust shuffle code to allow for future coalescing
> > mm: Move set/get_pcppage_migratetype to mmzone.h
> > mm: Use zone and order instead of free area in free_list manipulators
> > mm: Introduce "aerated" pages
> > mm: Add logic for separating "aerated" pages from "raw" pages
> > virtio-balloon: Add support for aerating memory via hinting
> >
> >
> > drivers/virtio/Kconfig | 1
> > drivers/virtio/virtio_balloon.c | 110 ++++++++++++++
> > include/linux/memory_aeration.h | 118 +++++++++++++++
> > include/linux/mmzone.h | 113 +++++++++------
> > include/linux/page-flags.h | 8 +
> > include/uapi/linux/virtio_balloon.h | 1
> > mm/Kconfig | 5 +
> > mm/Makefile | 1
> > mm/aeration.c | 270 +++++++++++++++++++++++++++++++++++
> > mm/page_alloc.c | 203 ++++++++++++++++++--------
> > mm/shuffle.c | 24 ---
> > mm/shuffle.h | 35 +++++
> > 12 files changed, 753 insertions(+), 136 deletions(-)
> > create mode 100644 include/linux/memory_aeration.h
> > create mode 100644 mm/aeration.c
>
> Compared to
>
> 17 files changed, 838 insertions(+), 86 deletions(-)
> create mode 100644 include/linux/memory_aeration.h
> create mode 100644 mm/aeration.c
>
> this looks like a good improvement :)

Thanks.

- Alex

2019-06-25 19:41:18

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On Tue, Jun 25, 2019 at 7:10 AM Dave Hansen <[email protected]> wrote:
>
> On 6/25/19 12:42 AM, David Hildenbrand wrote:
> > On 20.06.19 00:32, Alexander Duyck wrote:
> > I still *detest* the terminology, sorry. Can't you come up with a
> > simpler terminology that makes more sense in the context of operating
> > systems and pages we want to hint to the hypervisor? (that is the only
> > use case you are using it for so far)
>
> It's a wee bit too cute for my taste as well. I could probably live
> with it in the data structures, but having it show up out in places like
> Kconfig and filenames goes too far.
>
> For instance, someone seeing memory_aeration.c will have no concept
> what's in the file. Could we call it something like memory_paravirt.c?
> Or even mm/paravirt.c.

Well I couldn't come up with a better explanation of what this was
doing, also I wanted to avoid mentioning hinting specifically because
there have already been a few series that have been committed upstream
that reference this for slightly different purposes such as the one by
Wei Wang that was doing free memory tracking for migration purposes,
https://lkml.org/lkml/2018/7/10/211.

Basically what we are doing is inflating the memory size we can report
by inserting voids into the free memory areas. In my mind that matches
up very well with what "aeration" is. It is similar to balloon in
functionality, however instead of inflating the balloon we are
inflating the free_list for higher order free areas by creating voids
where the madvised pages were.

> Could you talk for a minute about why the straightforward naming like
> "hinted/unhinted" wasn't used? Is there something else we could ever
> use this infrastructure for that is not related to paravirtualized free
> page hinting?

I was hoping there might be something in the future that could use the
infrastructure if it needed to go through and sort out used versus
unused memory. The way things are designed right now for instance
there is only really a define that is limiting the lowest order pages
that are processed. So if we wanted to use this for another purpose we
could replace the AERATOR_MIN_ORDER define with something that is
specific to that use case.

2019-06-25 19:54:07

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 25.06.19 19:00, Alexander Duyck wrote:
> On Tue, Jun 25, 2019 at 7:10 AM Dave Hansen <[email protected]> wrote:
>>
>> On 6/25/19 12:42 AM, David Hildenbrand wrote:
>>> On 20.06.19 00:32, Alexander Duyck wrote:
>>> I still *detest* the terminology, sorry. Can't you come up with a
>>> simpler terminology that makes more sense in the context of operating
>>> systems and pages we want to hint to the hypervisor? (that is the only
>>> use case you are using it for so far)
>>
>> It's a wee bit too cute for my taste as well. I could probably live
>> with it in the data structures, but having it show up out in places like
>> Kconfig and filenames goes too far.
>>
>> For instance, someone seeing memory_aeration.c will have no concept
>> what's in the file. Could we call it something like memory_paravirt.c?
>> Or even mm/paravirt.c.
>
> Well I couldn't come up with a better explanation of what this was
> doing, also I wanted to avoid mentioning hinting specifically because
> there have already been a few series that have been committed upstream
> that reference this for slightly different purposes such as the one by
> Wei Wang that was doing free memory tracking for migration purposes,
> https://lkml.org/lkml/2018/7/10/211.

That one we referred to rather as "free page reporting".

>
> Basically what we are doing is inflating the memory size we can report
> by inserting voids into the free memory areas. In my mind that matches
> up very well with what "aeration" is. It is similar to balloon in
> functionality, however instead of inflating the balloon we are
> inflating the free_list for higher order free areas by creating voids
> where the madvised pages were.
>
>> Could you talk for a minute about why the straightforward naming like
>> "hinted/unhinted" wasn't used? Is there something else we could ever
>> use this infrastructure for that is not related to paravirtualized free
>> page hinting?
>
> I was hoping there might be something in the future that could use the
> infrastructure if it needed to go through and sort out used versus
> unused memory. The way things are designed right now for instance
> there is only really a define that is limiting the lowest order pages
> that are processed. So if we wanted to use this for another purpose we
> could replace the AERATOR_MIN_ORDER define with something that is
> specific to that use case.


I'd still vote to call this "hinting" in some form. Whenever a new use
case eventually pops up, we could generalize this approach. But well,
that's just my opinion :)


--

Thanks,

David / dhildenb

2019-06-25 19:55:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] mm: Adjust shuffle code to allow for future coalescing

On 6/19/19 3:33 PM, Alexander Duyck wrote:
> This patch is meant to move the head/tail adding logic out of the shuffle
> code and into the __free_one_page function since ultimately that is where
> it is really needed anyway. By doing this we should be able to reduce the
> overhead and can consolidate all of the list addition bits in one spot.

This looks like a sane cleanup that can stand on its own. It gives nice
names (buddy_merge_likely()) to things that were just code blobs before.

Reviewed-by: Dave Hansen <[email protected]>

2019-06-25 19:55:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 6/25/19 10:00 AM, Alexander Duyck wrote:
> Basically what we are doing is inflating the memory size we can report
> by inserting voids into the free memory areas. In my mind that matches
> up very well with what "aeration" is. It is similar to balloon in
> functionality, however instead of inflating the balloon we are
> inflating the free_list for higher order free areas by creating voids
> where the madvised pages were.

OK, then call it "free page auto ballooning" or "auto ballooning" or
"allocator ballooning". s390 calls them "unused pages".

Any of those things are clearer and more meaningful than "page aeration"
to me.

2019-06-25 19:55:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] mm: Move set/get_pcppage_migratetype to mmzone.h

On 6/19/19 3:33 PM, Alexander Duyck wrote:
> In order to support page aeration it will be necessary to store and
> retrieve the migratetype of a page. To enable that I am moving the set and
> get operations for pcppage_migratetype into the mmzone header so that they
> can be used when adding or removing pages from the free lists.
...
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4c07af2cfc2f..6f8fd5c1a286 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h

Not mm/internal.h?

2019-06-25 19:56:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] mm: Adjust shuffle code to allow for future coalescing

On 6/19/19 3:33 PM, Alexander Duyck wrote:
>
> This patch is meant to move the head/tail adding logic out of the shuffle
> code and into the __free_one_page function since ultimately that is where
> it is really needed anyway. By doing this we should be able to reduce the
> overhead and can consolidate all of the list addition bits in one spot.

2019-06-25 19:56:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 3/6] mm: Use zone and order instead of free area in free_list manipulators

On 6/19/19 3:33 PM, Alexander Duyck wrote:
> - move_to_free_area(page, &zone->free_area[order], migratetype);
> + move_to_free_area(page, zone, order, migratetype);

This certainly looks nicer. But the naming is a bit goofy now because
you're talking about free areas, but there's no free area to be seen.
If anything, isn't it moving to a free_list[]? It's actually going to
zone->free_area[]->free_list[], so the free area seems rather
inconsequential in the entire thing. The (zone/order/migratetype)
combination specifies a free_list[] not a free area anyway.

2019-06-25 20:04:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 4/6] mm: Introduce "aerated" pages

> +static inline void set_page_aerated(struct page *page,
> + struct zone *zone,
> + unsigned int order,
> + int migratetype)
> +{
> +#ifdef CONFIG_AERATION
> + /* update areated page accounting */
> + zone->free_area[order].nr_free_aerated++;
> +
> + /* record migratetype and flag page as aerated */
> + set_pcppage_migratetype(page, migratetype);
> + __SetPageAerated(page);
> +#endif
> +}

Please don't refer to code before you introduce it, even if you #ifdef
it. I went looking back in the series for the PageAerated() definition,
but didn't think to look forward.

Also, it doesn't make any sense to me that you would need to set the
migratetype here. Isn't it set earlier in the allocator? Also, when
can this function be called? There's obviously some locking in place
because of the __Set, but what are they?

> +static inline void clear_page_aerated(struct page *page,
> + struct zone *zone,
> + struct free_area *area)
> +{
> +#ifdef CONFIG_AERATION
> + if (likely(!PageAerated(page)))
> + return;

Logically, why would you ever clear_page_aerated() on a page that's not
aerated? Comments needed.

BTW, I already hate typing aerated. :)

> + __ClearPageAerated(page);
> + area->nr_free_aerated--;
> +#endif
> +}

More non-atomic flag clears. Still no comments.


> @@ -787,10 +790,10 @@ static inline void add_to_free_area(struct page *page, struct zone *zone,
> static inline void add_to_free_area_tail(struct page *page, struct zone *zone,
> unsigned int order, int migratetype)
> {
> - struct free_area *area = &zone->free_area[order];
> + struct list_head *tail = aerator_get_tail(zone, order, migratetype);

There is no logical change in this patch from this line. That's
unfortunate because I can't see the change in logic that's presumably
coming. You'll presumably change aerator_get_tail(), but then I'll have
to remember that this line is here and come back to it from a later patch.

If it *doesn't* change behavior, it has no business being calle
aerator_...().

This series seems rather suboptimal for reviewing.

> - list_add_tail(&page->lru, &area->free_list[migratetype]);
> - area->nr_free++;
> + list_add_tail(&page->lru, tail);
> + zone->free_area[order].nr_free++;
> }
>
> /* Used for pages which are on another list */
> @@ -799,6 +802,8 @@ static inline void move_to_free_area(struct page *page, struct zone *zone,
> {
> struct free_area *area = &zone->free_area[order];
>
> + clear_page_aerated(page, zone, area);
> +
> list_move(&page->lru, &area->free_list[migratetype]);
> }

It's not immediately clear to me why moving a page should clear
aeration. A comment would help make it clear.

> @@ -868,10 +869,11 @@ static inline struct capture_control *task_capc(struct zone *zone)
> static inline void __free_one_page(struct page *page,
> unsigned long pfn,
> struct zone *zone, unsigned int order,
> - int migratetype)
> + int migratetype, bool aerated)
> {
> struct capture_control *capc = task_capc(zone);
> unsigned long uninitialized_var(buddy_pfn);
> + bool fully_aerated = aerated;
> unsigned long combined_pfn;
> unsigned int max_order;
> struct page *buddy;
> @@ -902,6 +904,11 @@ static inline void __free_one_page(struct page *page,
> goto done_merging;
> if (!page_is_buddy(page, buddy, order))
> goto done_merging;
> +
> + /* assume buddy is not aerated */
> + if (aerated)
> + fully_aerated = false;

So, "full" vs. "partial" is with respect to high-order pages? Why not
just check the page flag on the buddy?

> /*
> * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
> * merge with it and move up one order.
> @@ -943,11 +950,17 @@ static inline void __free_one_page(struct page *page,
> done_merging:
> set_page_order(page, order);
>
> - if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> + if (aerated ||
> + buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> is_shuffle_tail_page(order))
> add_to_free_area_tail(page, zone, order, migratetype);
> else
> add_to_free_area(page, zone, order, migratetype);

Aerated pages always go to the tail? Ahh, so they don't get consumed
quickly and have to be undone? Comments, please.

> + if (fully_aerated)
> + set_page_aerated(page, zone, order, migratetype);
> + else
> + aerator_notify_free(zone, order);
> }

What is this notifying for? It's not like this is some opaque
registration interface. What does this *do*?

> @@ -2127,6 +2140,77 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> return NULL;
> }
>
> +#ifdef CONFIG_AERATION
> +/**
> + * get_aeration_page - Provide a "raw" page for aeration by the aerator
> + * @zone: Zone to draw pages from
> + * @order: Order to draw pages from
> + * @migratetype: Migratetype to draw pages from

FWIW, kerneldoc is a waste of bytes here. Please use it sparingly.

> + * This function will obtain a page from above the boundary. As a result
> + * we can guarantee the page has not been aerated.

This is the first mention of a boundary. That's not good since I have
no idea at this point what the boundary is for or between.


> + * The page will have the migrate type and order stored in the page
> + * metadata.
> + *
> + * Return: page pointer if raw page found, otherwise NULL
> + */
> +struct page *get_aeration_page(struct zone *zone, unsigned int order,
> + int migratetype)
> +{
> + struct free_area *area = &(zone->free_area[order]);
> + struct list_head *list = &area->free_list[migratetype];
> + struct page *page;
> +
> + /* Find a page of the appropriate size in the preferred list */

I don't get the size comment. Hasn't this already been given an order?

> + page = list_last_entry(aerator_get_tail(zone, order, migratetype),
> + struct page, lru);
> + list_for_each_entry_from_reverse(page, list, lru) {
> + if (PageAerated(page)) {
> + page = list_first_entry(list, struct page, lru);
> + if (PageAerated(page))
> + break;
> + }

This confuses me. It looks for a page, then goes to the next page and
checks again? Why check twice? Why is a function looking for an
aerated page that finds *two* pages returning NULL?

I'm stumped.

> + del_page_from_free_area(page, zone, order);
> +
> + /* record migratetype and order within page */
> + set_pcppage_migratetype(page, migratetype);
> + set_page_private(page, order);
> + __mod_zone_freepage_state(zone, -(1 << order), migratetype);
> +
> + return page;
> + }
> +
> + return NULL;
> +}

Oh, so this is trying to find a page _for_ aerating.
"get_aeration_page()" does not convey that. Can that improved?
get_page_for_aeration()?

Rather than talk about boundaries, wouldn't a better description have been:

Similar to allocation, this function removes a page from the
free lists. However, it only removes unaerated pages.

> +/**
> + * put_aeration_page - Return a now-aerated "raw" page back where we got it
> + * @zone: Zone to return pages to
> + * @page: Previously "raw" page that can now be returned after aeration
> + *
> + * This function will pull the migratetype and order information out
> + * of the page and attempt to return it where it found it.
> + */
> +void put_aeration_page(struct zone *zone, struct page *page)
> +{
> + unsigned int order, mt;
> + unsigned long pfn;
> +
> + mt = get_pcppage_migratetype(page);
> + pfn = page_to_pfn(page);
> +
> + if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
> + mt = get_pfnblock_migratetype(page, pfn);
> +
> + order = page_private(page);
> + set_page_private(page, 0);
> +
> + __free_one_page(page, pfn, zone, order, mt, true);
> +}
> +#endif /* CONFIG_AERATION */

Yikes. This seems to have glossed over some pretty big aspects here.
Pages which are being aerated are not free. Pages which are freed are
diverted to be aerated before becoming free. Right? That sounds like
two really important things to add to a changelog.

> /*
> * This array describes the order lists are fallen back to when
> * the free lists for the desirable migrate type are depleted
> @@ -5929,9 +6013,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
> static void __meminit zone_init_free_lists(struct zone *zone)
> {
> unsigned int order, t;
> - for_each_migratetype_order(order, t) {
> + for_each_migratetype_order(order, t)
> INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> +
> + for (order = MAX_ORDER; order--; ) {
> zone->free_area[order].nr_free = 0;
> + zone->free_area[order].nr_free_aerated = 0;
> }
> }
>
>

2019-06-25 20:25:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: Add logic for separating "aerated" pages from "raw" pages

On 6/19/19 3:33 PM, Alexander Duyck wrote:
> Add a set of pointers we shall call "boundary" which represents the upper
> boundary between the "raw" and "aerated" pages. The general idea is that in
> order for a page to cross from one side of the boundary to the other it
> will need to go through the aeration treatment.

Aha! The mysterious "boundary"!

But, how can you introduce code that deals with boundaries before
introducing the boundary itself? Or was that comment misplaced?

FWIW, I'm not a fan of these commit messages. They are really hard to
map to the data structures.

One goal in this set is to avoid creating new data structures.
We accomplish that by reusing the free lists to hold aerated and
non-aerated pages. But, in order to use the existing free list,
we need a boundary to separate aerated from raw.

Further:

Pages are temporarily removed from the free lists while aerating
them.

This needs a justification why you chose this path, and also what the
larger implications are.

> By doing this we should be able to make certain that we keep the aerated
> pages as one contiguous block on the end of each free list. This will allow
> us to efficiently walk the free lists whenever we need to go in and start
> processing hints to the hypervisor that the pages are no longer in use.

You don't really walk them though, right? It *keeps* you from having to
ever walk the lists.

I also don't see what the boundary has to do with aerated pages being on
the tail of the list. If you want them on the tail, you just always
list_add_tail() them.

> And added advantage to this approach is that we should be reducing the
> overall memory footprint of the guest as it will be more likely to recycle
> warm pages versus the aerated pages that are likely to be cache cold.

I'm confused. Isn't an aerated page non-present on the guest? That's
worse than cache cold. It costs a VMEXIT to bring back in.

> Since we will only be aerating one zone at a time we keep the boundary
> limited to being defined for just the zone we are currently placing aerated
> pages into. Doing this we can keep the number of additional poitners needed
> quite small.

pointers ^

> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
> static inline struct list_head *aerator_get_tail(struct zone *zone,
> unsigned int order,
> int migratetype)
> {
> +#ifdef CONFIG_AERATION
> + if (order >= AERATOR_MIN_ORDER &&
> + test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> + return __aerator_get_tail(order, migratetype);
> +#endif
> return &zone->free_area[order].free_list[migratetype];
> }

Logically, I have no idea what this is doing. "Go get pages out of the
aerated list?" "raw list"? Needs comments.

> +static inline void aerator_del_from_boundary(struct page *page,
> + struct zone *zone)
> +{
> + if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> + __aerator_del_from_boundary(page, zone);
> +}
> +
> static inline void set_page_aerated(struct page *page,
> struct zone *zone,
> unsigned int order,
> @@ -28,6 +59,9 @@ static inline void set_page_aerated(struct page *page,
> /* record migratetype and flag page as aerated */
> set_pcppage_migratetype(page, migratetype);
> __SetPageAerated(page);
> +
> + /* update boundary of new migratetype and record it */
> + aerator_add_to_boundary(page, zone);
> #endif
> }
>
> @@ -39,11 +73,19 @@ static inline void clear_page_aerated(struct page *page,
> if (likely(!PageAerated(page)))
> return;
>
> + /* push boundary back if we removed the upper boundary */
> + aerator_del_from_boundary(page, zone);
> +
> __ClearPageAerated(page);
> area->nr_free_aerated--;
> #endif
> }
>
> +static inline unsigned long aerator_raw_pages(struct free_area *area)
> +{
> + return area->nr_free - area->nr_free_aerated;
> +}
> +
> /**
> * aerator_notify_free - Free page notification that will start page processing
> * @zone: Pointer to current zone of last page processed
> @@ -57,5 +99,20 @@ static inline void clear_page_aerated(struct page *page,
> */
> static inline void aerator_notify_free(struct zone *zone, int order)
> {
> +#ifdef CONFIG_AERATION
> + if (!static_key_false(&aerator_notify_enabled))
> + return;
> + if (order < AERATOR_MIN_ORDER)
> + return;
> + if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> + return;
> + if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> + return;
> +
> + __aerator_notify(zone);
> +#endif
> }

Again, this is really hard to review. I see some possible overhead in a
fast path here, but only if aerator_notify_free() is called in a fast
path. Is it? I have to go digging in the previous patches to figure
that out.

> +static struct aerator_dev_info *a_dev_info;
> +struct static_key aerator_notify_enabled;
> +
> +struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
> +
> +static void aerator_reset_boundary(struct zone *zone, unsigned int order,
> + unsigned int migratetype)
> +{
> + boundary[order - AERATOR_MIN_ORDER][migratetype] =
> + &zone->free_area[order].free_list[migratetype];
> +}
> +
> +#define for_each_aerate_migratetype_order(_order, _type) \
> + for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
> + for (_type = MIGRATE_TYPES; _type--;)
> +
> +static void aerator_populate_boundaries(struct zone *zone)
> +{
> + unsigned int order, mt;
> +
> + if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> + return;
> +
> + for_each_aerate_migratetype_order(order, mt)
> + aerator_reset_boundary(zone, order, mt);
> +
> + set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> +}

This function appears misnamed as it's doing more than boundary
manipulation.

> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> +{
> + return boundary[order - AERATOR_MIN_ORDER][migratetype];
> +}
> +
> +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> +{
> + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> + int mt = get_pcppage_migratetype(page);
> + struct list_head **tail = &boundary[order][mt];
> +
> + if (*tail == &page->lru)
> + *tail = page->lru.next;
> +}

Ewww. Please just track the page that's the boundary, not the list head
inside the page that's the boundary.

This also at least needs one comment along the lines of: Move the
boundary if the page representing the boundary is being removed.


> +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> +{
> + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> + int mt = get_pcppage_migratetype(page);
> + struct list_head **tail = &boundary[order][mt];
> +
> + *tail = &page->lru;
> +}
> +
> +void aerator_shutdown(void)
> +{
> + static_key_slow_dec(&aerator_notify_enabled);
> +
> + while (atomic_read(&a_dev_info->refcnt))
> + msleep(20);

We generally frown on open-coded check/sleep loops. What is this for?

> + WARN_ON(!list_empty(&a_dev_info->batch));
> +
> + a_dev_info = NULL;
> +}
> +EXPORT_SYMBOL_GPL(aerator_shutdown);
> +
> +static void aerator_schedule_initial_aeration(void)
> +{
> + struct zone *zone;
> +
> + for_each_populated_zone(zone) {
> + spin_lock(&zone->lock);
> + __aerator_notify(zone);
> + spin_unlock(&zone->lock);
> + }
> +}

Why do we need an initial aeration?

> +int aerator_startup(struct aerator_dev_info *sdev)
> +{
> + if (a_dev_info)
> + return -EBUSY;
> +
> + INIT_LIST_HEAD(&sdev->batch);
> + atomic_set(&sdev->refcnt, 0);
> +
> + a_dev_info = sdev;
> + aerator_schedule_initial_aeration();
> +
> + static_key_slow_inc(&aerator_notify_enabled);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(aerator_startup);
> +
> +static void aerator_fill(struct zone *zone)
> +{
> + struct list_head *batch = &a_dev_info->batch;
> + int budget = a_dev_info->capacity;

Where does capacity come from?

> + unsigned int order, mt;
> +
> + for_each_aerate_migratetype_order(order, mt) {
> + struct page *page;
> +
> + /*
> + * Pull pages from free list until we have drained
> + * it or we have filled the batch reactor.
> + */

What's a reactor?

> + while ((page = get_aeration_page(zone, order, mt))) {
> + list_add_tail(&page->lru, batch);
> +
> + if (!--budget)
> + return;
> + }
> + }
> +
> + /*
> + * If there are no longer enough free pages to fully populate
> + * the aerator, then we can just shut it down for this zone.
> + */
> + clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> + atomic_dec(&a_dev_info->refcnt);
> +}

Huh, so this is the number of threads doing aeration? Didn't we just
make a big deal about there only being one zone being aerated at a time?
Or, did I misunderstand what refcnt is from its lack of clear
documentation?

> +static void aerator_drain(struct zone *zone)
> +{
> + struct list_head *list = &a_dev_info->batch;
> + struct page *page;
> +
> + /*
> + * Drain the now aerated pages back into their respective
> + * free lists/areas.
> + */
> + while ((page = list_first_entry_or_null(list, struct page, lru))) {
> + list_del(&page->lru);
> + put_aeration_page(zone, page);
> + }
> +}
> +
> +static void aerator_scrub_zone(struct zone *zone)
> +{
> + /* See if there are any pages to pull */
> + if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> + return;

How would someone ask for the zone to be scrubbed when aeration has not
been requested?

> + spin_lock(&zone->lock);
> +
> + do {
> + aerator_fill(zone);

Should this say:

/* Pull pages out of the allocator into a local list */

?

> + if (list_empty(&a_dev_info->batch))
> + break;

/* no pages were acquired, give up */

> + spin_unlock(&zone->lock);
> +
> + /*
> + * Start aerating the pages in the batch, and then
> + * once that is completed we can drain the reactor
> + * and refill the reactor, restarting the cycle.
> + */
> + a_dev_info->react(a_dev_info);

After reading (most of) this set, I'm going to reiterate my suggestion:
please find new nomenclature. I can't parse that comment and I don't
know whether that's because it's a bad comment or whether you really
mean "cycle" the english word or "cycle" referring to some new
definition relating to this patch set.

I've asked quite nicely a few times now.

> + spin_lock(&zone->lock);
> +
> + /*
> + * Guarantee boundaries are populated before we
> + * start placing aerated pages in the zone.
> + */
> + aerator_populate_boundaries(zone);

aerator_populate_boundaries() has apparent concurrency checks via
ZONE_AERATION_ACTIVE. Why are those needed when this is called under a
spinlock?

2019-06-26 09:03:36

by Christophe de Dinechin

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment


David Hildenbrand writes:

> On 20.06.19 00:32, Alexander Duyck wrote:
>> This series provides an asynchronous means of hinting to a hypervisor
>> that a guest page is no longer in use and can have the data associated
>> with it dropped. To do this I have implemented functionality that allows
>> for what I am referring to as waste page treatment.
>>
>> I have based many of the terms and functionality off of waste water
>> treatment, the idea for the similarity occurred to me after I had reached
>> the point of referring to the hints as "bubbles", as the hints used the
>> same approach as the balloon functionality but would disappear if they
>> were touched, as a result I started to think of the virtio device as an
>> aerator. The general idea with all of this is that the guest should be
>> treating the unused pages so that when they end up heading "downstream"
>> to either another guest, or back at the host they will not need to be
>> written to swap.
>>
>> When the number of "dirty" pages in a given free_area exceeds our high
>> water mark, which is currently 32, we will schedule the aeration task to
>> start going through and scrubbing the zone. While the scrubbing is taking
>> place a boundary will be defined that we use to seperate the "aerated"
>> pages from the "dirty" ones. We use the ZONE_AERATION_ACTIVE bit to flag
>> when these boundaries are in place.
>
> I still *detest* the terminology, sorry. Can't you come up with a
> simpler terminology that makes more sense in the context of operating
> systems and pages we want to hint to the hypervisor? (that is the only
> use case you are using it for so far)

FWIW, I thought the terminology made sense, in particular given the analogy
with the balloon driver. Operating systems in general, and Linux in
particular, already use tons of analogy-supported terminology. In
particular, a "waste page treatment" terminology is not very far from
the very common "garbage collection" or "scrubbing" wordings. I would find
"hinting" much less specific. for example.

Usually, the phrases that stick are somewhat unique while providing a
useful analogy to server as a reminder of what the thing actually
does. IMHO, it's the case here on both fronts, so I like it.

>
>>
>> I am leaving a number of things hard-coded such as limiting the lowest
>> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
>> determine what batch size it wants to allocate to process the hints.
>>
>> My primary testing has just been to verify the memory is being freed after
>> allocation by running memhog 32g in the guest and watching the total free
>> memory via /proc/meminfo on the host. With this I have verified most of
>> the memory is freed after each iteration. As far as performance I have
>> been mainly focusing on the will-it-scale/page_fault1 test running with
>> 16 vcpus. With that I have seen a less than 1% difference between the
>
> 1% throughout all benchmarks? Guess that is quite good.
>
>> base kernel without these patches, with the patches and virtio-balloon
>> disabled, and with the patches and virtio-balloon enabled with hinting.
>>
>> Changes from the RFC:
>> Moved aeration requested flag out of aerator and into zone->flags.
>> Moved boundary out of free_area and into local variables for aeration.
>> Moved aeration cycle out of interrupt and into workqueue.
>> Left nr_free as total pages instead of splitting it between raw and aerated.
>> Combined size and physical address values in virtio ring into one 64b value.
>> Restructured the patch set to reduce patches from 11 to 6.
>>
>
> I'm planning to look into the details, but will be on PTO for two weeks
> starting this Saturday (and still have other things to finish first :/ ).
>
>> ---
>>
>> Alexander Duyck (6):
>> mm: Adjust shuffle code to allow for future coalescing
>> mm: Move set/get_pcppage_migratetype to mmzone.h
>> mm: Use zone and order instead of free area in free_list manipulators
>> mm: Introduce "aerated" pages
>> mm: Add logic for separating "aerated" pages from "raw" pages
>> virtio-balloon: Add support for aerating memory via hinting
>>
>>
>> drivers/virtio/Kconfig | 1
>> drivers/virtio/virtio_balloon.c | 110 ++++++++++++++
>> include/linux/memory_aeration.h | 118 +++++++++++++++
>> include/linux/mmzone.h | 113 +++++++++------
>> include/linux/page-flags.h | 8 +
>> include/uapi/linux/virtio_balloon.h | 1
>> mm/Kconfig | 5 +
>> mm/Makefile | 1
>> mm/aeration.c | 270 +++++++++++++++++++++++++++++++++++
>> mm/page_alloc.c | 203 ++++++++++++++++++--------
>> mm/shuffle.c | 24 ---
>> mm/shuffle.h | 35 +++++
>> 12 files changed, 753 insertions(+), 136 deletions(-)
>> create mode 100644 include/linux/memory_aeration.h
>> create mode 100644 mm/aeration.c
>
> Compared to
>
> 17 files changed, 838 insertions(+), 86 deletions(-)
> create mode 100644 include/linux/memory_aeration.h
> create mode 100644 mm/aeration.c
>
> this looks like a good improvement :)


--
Cheers,
Christophe de Dinechin (IRC c3d)

2019-06-26 09:14:41

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 26.06.19 11:01, Christophe de Dinechin wrote:
>
> David Hildenbrand writes:
>
>> On 20.06.19 00:32, Alexander Duyck wrote:
>>> This series provides an asynchronous means of hinting to a hypervisor
>>> that a guest page is no longer in use and can have the data associated
>>> with it dropped. To do this I have implemented functionality that allows
>>> for what I am referring to as waste page treatment.
>>>
>>> I have based many of the terms and functionality off of waste water
>>> treatment, the idea for the similarity occurred to me after I had reached
>>> the point of referring to the hints as "bubbles", as the hints used the
>>> same approach as the balloon functionality but would disappear if they
>>> were touched, as a result I started to think of the virtio device as an
>>> aerator. The general idea with all of this is that the guest should be
>>> treating the unused pages so that when they end up heading "downstream"
>>> to either another guest, or back at the host they will not need to be
>>> written to swap.
>>>
>>> When the number of "dirty" pages in a given free_area exceeds our high
>>> water mark, which is currently 32, we will schedule the aeration task to
>>> start going through and scrubbing the zone. While the scrubbing is taking
>>> place a boundary will be defined that we use to seperate the "aerated"
>>> pages from the "dirty" ones. We use the ZONE_AERATION_ACTIVE bit to flag
>>> when these boundaries are in place.
>>
>> I still *detest* the terminology, sorry. Can't you come up with a
>> simpler terminology that makes more sense in the context of operating
>> systems and pages we want to hint to the hypervisor? (that is the only
>> use case you are using it for so far)
>
> FWIW, I thought the terminology made sense, in particular given the analogy
> with the balloon driver. Operating systems in general, and Linux in
> particular, already use tons of analogy-supported terminology. In
> particular, a "waste page treatment" terminology is not very far from
> the very common "garbage collection" or "scrubbing" wordings. I would find
> "hinting" much less specific. for example.
>
> Usually, the phrases that stick are somewhat unique while providing a
> useful analogy to server as a reminder of what the thing actually
> does. IMHO, it's the case here on both fronts, so I like it.

While something like "waste pages" make sense, "aeration" is far out of
my comfort zone.

An analogy is like a joke. If you have to explain it, it's not that
good. (see, that was a good analogy ;) ).

--

Thanks,

David / dhildenb

2019-06-28 19:49:55

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] mm: Adjust shuffle code to allow for future coalescing

On Tue, Jun 25, 2019 at 12:56 AM David Hildenbrand <[email protected]> wrote:
>
> On 20.06.19 00:33, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > This patch is meant to move the head/tail adding logic out of the shuffle
> > code and into the __free_one_page function since ultimately that is where
> > it is really needed anyway. By doing this we should be able to reduce the
> > overhead and can consolidate all of the list addition bits in one spot.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > include/linux/mmzone.h | 12 --------
> > mm/page_alloc.c | 70 +++++++++++++++++++++++++++---------------------
> > mm/shuffle.c | 24 ----------------
> > mm/shuffle.h | 35 ++++++++++++++++++++++++
> > 4 files changed, 74 insertions(+), 67 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 427b79c39b3c..4c07af2cfc2f 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -116,18 +116,6 @@ static inline void add_to_free_area_tail(struct page *page, struct free_area *ar
> > area->nr_free++;
> > }
> >
> > -#ifdef CONFIG_SHUFFLE_PAGE_ALLOCATOR
> > -/* Used to preserve page allocation order entropy */
> > -void add_to_free_area_random(struct page *page, struct free_area *area,
> > - int migratetype);
> > -#else
> > -static inline void add_to_free_area_random(struct page *page,
> > - struct free_area *area, int migratetype)
> > -{
> > - add_to_free_area(page, area, migratetype);
> > -}
> > -#endif
> > -
> > /* Used for pages which are on another list */
> > static inline void move_to_free_area(struct page *page, struct free_area *area,
> > int migratetype)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f4651a09948c..ec344ce46587 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -830,6 +830,36 @@ static inline struct capture_control *task_capc(struct zone *zone)
> > #endif /* CONFIG_COMPACTION */
> >
> > /*
> > + * If this is not the largest possible page, check if the buddy
> > + * of the next-highest order is free. If it is, it's possible
> > + * that pages are being freed that will coalesce soon. In case,
> > + * that is happening, add the free page to the tail of the list
> > + * so it's less likely to be used soon and more likely to be merged
> > + * as a higher order page
> > + */
> > +static inline bool
> > +buddy_merge_likely(unsigned long pfn, unsigned long buddy_pfn,
> > + struct page *page, unsigned int order)
> > +{
> > + struct page *higher_page, *higher_buddy;
> > + unsigned long combined_pfn;
> > +
> > + if (is_shuffle_order(order) || order >= (MAX_ORDER - 2))
>
> My intuition tells me you can drop the () around "MAX_ORDER - 2"

I dunno, I always kind of prefer to use the parenthesis in these cases
for readability. I suppose I can drop it though.

> > + return false;
>
> Guess the "is_shuffle_order(order)" check should rather be performed by
> the caller, before calling this function.

I could do that, however I am not sure it adds much. I am pretty sure
the resultant code would be the same. Where things would be a bit more
complicated is that I would then have to probably look at adding a
variable to trap the output of is_shuffle_tail_page or
buddy_merge_likely.

> > +
> > + if (!pfn_valid_within(buddy_pfn))
> > + return false;
> > +
> > + combined_pfn = buddy_pfn & pfn;
> > + higher_page = page + (combined_pfn - pfn);
> > + buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
> > + higher_buddy = higher_page + (buddy_pfn - combined_pfn);
> > +
> > + return pfn_valid_within(buddy_pfn) &&
> > + page_is_buddy(higher_page, higher_buddy, order + 1);
> > +}
> > +
> > +/*
> > * Freeing function for a buddy system allocator.
> > *
> > * The concept of a buddy system is to maintain direct-mapped table
> > @@ -858,11 +888,12 @@ static inline void __free_one_page(struct page *page,
> > struct zone *zone, unsigned int order,
> > int migratetype)
> > {
> > - unsigned long combined_pfn;
> > + struct capture_control *capc = task_capc(zone);
> > unsigned long uninitialized_var(buddy_pfn);
> > - struct page *buddy;
> > + unsigned long combined_pfn;
> > + struct free_area *area;
> > unsigned int max_order;
> > - struct capture_control *capc = task_capc(zone);
> > + struct page *buddy;
> >
> > max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
> >
> > @@ -931,35 +962,12 @@ static inline void __free_one_page(struct page *page,
> > done_merging:
> > set_page_order(page, order);
> >
> > - /*
> > - * If this is not the largest possible page, check if the buddy
> > - * of the next-highest order is free. If it is, it's possible
> > - * that pages are being freed that will coalesce soon. In case,
> > - * that is happening, add the free page to the tail of the list
> > - * so it's less likely to be used soon and more likely to be merged
> > - * as a higher order page
> > - */
> > - if ((order < MAX_ORDER-2) && pfn_valid_within(buddy_pfn)
> > - && !is_shuffle_order(order)) {
> > - struct page *higher_page, *higher_buddy;
> > - combined_pfn = buddy_pfn & pfn;
> > - higher_page = page + (combined_pfn - pfn);
> > - buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
> > - higher_buddy = higher_page + (buddy_pfn - combined_pfn);
> > - if (pfn_valid_within(buddy_pfn) &&
> > - page_is_buddy(higher_page, higher_buddy, order + 1)) {
> > - add_to_free_area_tail(page, &zone->free_area[order],
> > - migratetype);
> > - return;
> > - }
> > - }
> > -
> > - if (is_shuffle_order(order))
> > - add_to_free_area_random(page, &zone->free_area[order],
> > - migratetype);
> > + area = &zone->free_area[order];
> > + if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> > + is_shuffle_tail_page(order))
> > + add_to_free_area_tail(page, area, migratetype);
>
> I would prefer here something like
>
> if (is_shuffle_order(order)) {
> if (add_shuffle_order_to_tail(order))
> add_to_free_area_tail(page, area, migratetype);
> else
> add_to_free_area(page, area, migratetype);
> } else if (buddy_merge_likely(pfn, buddy_pfn, page, order)) {
> add_to_free_area_tail(page, area, migratetype);
> } else {
> add_to_free_area(page, area, migratetype);
> }
>
> dropping "is_shuffle_order()" from buddy_merge_likely()
>
> Especially, the name "is_shuffle_tail_page(order)" suggests that you are
> passing a page.

Okay, I can look at renaming that. I will probably look at just using
a variable instead of nesting the if statements like that. Otherwise
that is going to get pretty messy fairly quickly.

> > else
> > - add_to_free_area(page, &zone->free_area[order], migratetype);
> > -
> > + add_to_free_area(page, area, migratetype);
> > }
> >
> > /*
> > diff --git a/mm/shuffle.c b/mm/shuffle.c
> > index 3ce12481b1dc..55d592e62526 100644
> > --- a/mm/shuffle.c
> > +++ b/mm/shuffle.c
> > @@ -4,7 +4,6 @@
> > #include <linux/mm.h>
> > #include <linux/init.h>
> > #include <linux/mmzone.h>
> > -#include <linux/random.h>
> > #include <linux/moduleparam.h>
> > #include "internal.h"
> > #include "shuffle.h"
> > @@ -182,26 +181,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
> > for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> > shuffle_zone(z);
> > }
> > -
> > -void add_to_free_area_random(struct page *page, struct free_area *area,
> > - int migratetype)
> > -{
> > - static u64 rand;
> > - static u8 rand_bits;
> > -
> > - /*
> > - * The lack of locking is deliberate. If 2 threads race to
> > - * update the rand state it just adds to the entropy.
> > - */
> > - if (rand_bits == 0) {
> > - rand_bits = 64;
> > - rand = get_random_u64();
> > - }
> > -
> > - if (rand & 1)
> > - add_to_free_area(page, area, migratetype);
> > - else
> > - add_to_free_area_tail(page, area, migratetype);
> > - rand_bits--;
> > - rand >>= 1;
> > -}
> > diff --git a/mm/shuffle.h b/mm/shuffle.h
> > index 777a257a0d2f..3f4edb60a453 100644
> > --- a/mm/shuffle.h
> > +++ b/mm/shuffle.h
> > @@ -3,6 +3,7 @@
> > #ifndef _MM_SHUFFLE_H
> > #define _MM_SHUFFLE_H
> > #include <linux/jump_label.h>
> > +#include <linux/random.h>
> >
> > /*
> > * SHUFFLE_ENABLE is called from the command line enabling path, or by
> > @@ -43,6 +44,35 @@ static inline bool is_shuffle_order(int order)
> > return false;
> > return order >= SHUFFLE_ORDER;
> > }
> > +
> > +static inline bool is_shuffle_tail_page(int order)
> > +{
> > + static u64 rand;
> > + static u8 rand_bits;
> > + u64 rand_old;
> > +
> > + if (!is_shuffle_order(order))
> > + return false;
> > +
> > + /*
> > + * The lack of locking is deliberate. If 2 threads race to
> > + * update the rand state it just adds to the entropy.
> > + */
> > + if (rand_bits-- == 0) {
> > + rand_bits = 64;
> > + rand = get_random_u64();
> > + }
> > +
> > + /*
> > + * Test highest order bit while shifting our random value. This
> > + * should result in us testing for the carry flag following the
> > + * shift.
> > + */
> > + rand_old = rand;
> > + rand <<= 1;
> > +
> > + return rand < rand_old;
> > +}
> > #else
> > static inline void shuffle_free_memory(pg_data_t *pgdat)
> > {
> > @@ -60,5 +90,10 @@ static inline bool is_shuffle_order(int order)
> > {
> > return false;
> > }
> > +
> > +static inline bool is_shuffle_tail_page(int order)
> > +{
> > + return false;
> > +}
> > #endif
> > #endif /* _MM_SHUFFLE_H */
> >
>
>
> --
>
> Thanks,
>
> David / dhildenb

2019-06-28 19:55:55

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 2/6] mm: Move set/get_pcppage_migratetype to mmzone.h

On Tue, Jun 25, 2019 at 11:28 AM Dave Hansen <[email protected]> wrote:
>
> On 6/19/19 3:33 PM, Alexander Duyck wrote:
> > In order to support page aeration it will be necessary to store and
> > retrieve the migratetype of a page. To enable that I am moving the set and
> > get operations for pcppage_migratetype into the mmzone header so that they
> > can be used when adding or removing pages from the free lists.
> ...
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 4c07af2cfc2f..6f8fd5c1a286 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
>
> Not mm/internal.h?

Yeah, I can probably move those to there. I just need to pull the call
to set_pcpage_migratetype out of set_page_aerated and place it in
aerator_add_to_boundary.

2019-07-08 22:44:43

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 4/6] mm: Introduce "aerated" pages

On Tue, 2019-06-25 at 12:45 -0700, Dave Hansen wrote:
> > +static inline void set_page_aerated(struct page *page,
> > + struct zone *zone,
> > + unsigned int order,
> > + int migratetype)
> > +{
> > +#ifdef CONFIG_AERATION
> > + /* update areated page accounting */
> > + zone->free_area[order].nr_free_aerated++;
> > +
> > + /* record migratetype and flag page as aerated */
> > + set_pcppage_migratetype(page, migratetype);
> > + __SetPageAerated(page);
> > +#endif
> > +}
>
> Please don't refer to code before you introduce it, even if you #ifdef
> it. I went looking back in the series for the PageAerated() definition,
> but didn't think to look forward.

Yeah, I had split this code out from patch 5, but I realized after I
submitted it I had a number of issues. The kconfig option also ended up in
patch 5 instead of showing up in patch 4.

I'll have to work on cleaning up patches 4 and 5 so that the split between
them is cleaner.

> Also, it doesn't make any sense to me that you would need to set the
> migratetype here. Isn't it set earlier in the allocator? Also, when
> can this function be called? There's obviously some locking in place
> because of the __Set, but what are they?

Generally this function is only called from inside __free_one_page so yes,
the zone lock is expected to be held.

> > +static inline void clear_page_aerated(struct page *page,
> > + struct zone *zone,
> > + struct free_area *area)
> > +{
> > +#ifdef CONFIG_AERATION
> > + if (likely(!PageAerated(page)))
> > + return;
>
> Logically, why would you ever clear_page_aerated() on a page that's not
> aerated? Comments needed.
>
> BTW, I already hate typing aerated. :)

Well I am always open to other suggestions. I could just default to
offline which is what is used by the balloon drivers. Suggestions for a
better name are always welcome.

> > + __ClearPageAerated(page);
> > + area->nr_free_aerated--;
> > +#endif
> > +}
>
> More non-atomic flag clears. Still no comments.

Yes. it is the same kind of deal as the function above. Basically we only
call this just before we clear the buddy flag in the allocator so once
again the zone lock is expected to be held at this point.

>
> > @@ -787,10 +790,10 @@ static inline void add_to_free_area(struct page *page, struct zone *zone,
> > static inline void add_to_free_area_tail(struct page *page, struct zone *zone,
> > unsigned int order, int migratetype)
> > {
> > - struct free_area *area = &zone->free_area[order];
> > + struct list_head *tail = aerator_get_tail(zone, order, migratetype);
>
> There is no logical change in this patch from this line. That's
> unfortunate because I can't see the change in logic that's presumably
> coming. You'll presumably change aerator_get_tail(), but then I'll have
> to remember that this line is here and come back to it from a later patch.
>
> If it *doesn't* change behavior, it has no business being calle
> aerator_...().
>
> This series seems rather suboptimal for reviewing.

I can move that into patch 5. That would make more sense anyway since that
is where I introduce the change that adds the boundaries.

> > - list_add_tail(&page->lru, &area->free_list[migratetype]);
> > - area->nr_free++;
> > + list_add_tail(&page->lru, tail);
> > + zone->free_area[order].nr_free++;
> > }
> >
> > /* Used for pages which are on another list */
> > @@ -799,6 +802,8 @@ static inline void move_to_free_area(struct page *page, struct zone *zone,
> > {
> > struct free_area *area = &zone->free_area[order];
> >
> > + clear_page_aerated(page, zone, area);
> > +
> > list_move(&page->lru, &area->free_list[migratetype]);
> > }
>
> It's not immediately clear to me why moving a page should clear
> aeration. A comment would help make it clear.

I will do that. The main reason for having to do that is because when we
move the page there is no guarantee that the boundaries will still be in
place. As such we are pulling the page and placing it on the head of the
free_list we are moving it to. As such in order to avoid creating an
island of unprocessed pages we need to clear the flag and just reprocess
it later.

> > @@ -868,10 +869,11 @@ static inline struct capture_control *task_capc(struct zone *zone)
> > static inline void __free_one_page(struct page *page,
> > unsigned long pfn,
> > struct zone *zone, unsigned int order,
> > - int migratetype)
> > + int migratetype, bool aerated)
> > {
> > struct capture_control *capc = task_capc(zone);
> > unsigned long uninitialized_var(buddy_pfn);
> > + bool fully_aerated = aerated;
> > unsigned long combined_pfn;
> > unsigned int max_order;
> > struct page *buddy;
> > @@ -902,6 +904,11 @@ static inline void __free_one_page(struct page *page,
> > goto done_merging;
> > if (!page_is_buddy(page, buddy, order))
> > goto done_merging;
> > +
> > + /* assume buddy is not aerated */
> > + if (aerated)
> > + fully_aerated = false;
>
> So, "full" vs. "partial" is with respect to high-order pages? Why not
> just check the page flag on the buddy?

The buddy will never have the aerated flag set. If we are hinting on a
given page then the assumption is it was the highest order version of the
page available when we processed the pages in the previous pass. So if the
buddy is available when we are returning the processed page then the buddy
is non-aerated and will invalidate the aeration when merged with the
aerated page. What we will then do is hint on the higher-order page that
is created as a result of merging the two pages.

I'll make the comment more robust on that.

> > /*
> > * Our buddy is free or it is CONFIG_DEBUG_PAGEALLOC guard page,
> > * merge with it and move up one order.
> > @@ -943,11 +950,17 @@ static inline void __free_one_page(struct page *page,
> > done_merging:
> > set_page_order(page, order);
> >
> > - if (buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> > + if (aerated ||
> > + buddy_merge_likely(pfn, buddy_pfn, page, order) ||
> > is_shuffle_tail_page(order))
> > add_to_free_area_tail(page, zone, order, migratetype);
> > else
> > add_to_free_area(page, zone, order, migratetype);
>
> Aerated pages always go to the tail? Ahh, so they don't get consumed
> quickly and have to be undone? Comments, please.

Will do.

> > + if (fully_aerated)
> > + set_page_aerated(page, zone, order, migratetype);
> > + else
> > + aerator_notify_free(zone, order);
> > }
>
> What is this notifying for? It's not like this is some opaque
> registration interface. What does this *do*?

This is updating the count of non-treated pages and comparing that to the
high water mark. If it crosses a certain threshold it will then set the
bits requesting the zone be processed and wake up the thread if it is not
active.

> > @@ -2127,6 +2140,77 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> > return NULL;
> > }
> >
> > +#ifdef CONFIG_AERATION
> > +/**
> > + * get_aeration_page - Provide a "raw" page for aeration by the aerator
> > + * @zone: Zone to draw pages from
> > + * @order: Order to draw pages from
> > + * @migratetype: Migratetype to draw pages from
>
> FWIW, kerneldoc is a waste of bytes here. Please use it sparingly.
>
> > + * This function will obtain a page from above the boundary. As a result
> > + * we can guarantee the page has not been aerated.
>
> This is the first mention of a boundary. That's not good since I have
> no idea at this point what the boundary is for or between.

Boundary doesn't get added until the next patch so this is another foul up
with the splitting of the patches.

>
> > + * The page will have the migrate type and order stored in the page
> > + * metadata.
> > + *
> > + * Return: page pointer if raw page found, otherwise NULL
> > + */
> > +struct page *get_aeration_page(struct zone *zone, unsigned int order,
> > + int migratetype)
> > +{
> > + struct free_area *area = &(zone->free_area[order]);
> > + struct list_head *list = &area->free_list[migratetype];
> > + struct page *page;
> > +
> > + /* Find a page of the appropriate size in the preferred list */
>
> I don't get the size comment. Hasn't this already been given an order?

That is a hold over from a previous version of this patch. Originally this
code was getting anything order or greater. Now it only grabs pages of a
certain order as I had pulled some logic out of here and moved it into the
aeration logic.

> > + page = list_last_entry(aerator_get_tail(zone, order, migratetype),
> > + struct page, lru);
> > + list_for_each_entry_from_reverse(page, list, lru) {
> > + if (PageAerated(page)) {
> > + page = list_first_entry(list, struct page, lru);
> > + if (PageAerated(page))
> > + break;
> > + }
>
> This confuses me. It looks for a page, then goes to the next page and
> checks again? Why check twice? Why is a function looking for an
> aerated page that finds *two* pages returning NULL?
>
> I'm stumped.

So the logic here gets confusing because the boundary hasn't been defined
yet. Specifically boundary ends up being a secondary tail that applies
only to the non-processed pages. What we are doing is getting the last
non-processed page, and then using the "from" version of the list iterator
to make certain that the list isn't actually empty.

From there we do a check to see if the page we are currently looking at is
empty. If it is we return NULL. If the list is not empty we check and see
if the page was processed, if it was we try grapping the page from the
head of the list as we hit the bottom of the last batch we processed. If
that is also processed then we just exit.

I will try to do a better job of documenting all this. Basically I used a
bunch of list manipulation that is hiding things too well.

> > + del_page_from_free_area(page, zone, order);
> > +
> > + /* record migratetype and order within page */
> > + set_pcppage_migratetype(page, migratetype);
> > + set_page_private(page, order);
> > + __mod_zone_freepage_state(zone, -(1 << order), migratetype);
> > +
> > + return page;
> > + }
> > +
> > + return NULL;
> > +}
>
> Oh, so this is trying to find a page _for_ aerating.
> "get_aeration_page()" does not convey that. Can that improved?
> get_page_for_aeration()?
>
> Rather than talk about boundaries, wouldn't a better description have been:
>
> Similar to allocation, this function removes a page from the
> free lists. However, it only removes unaerated pages.

I will update the kerneldoc and comments.

> > +/**
> > + * put_aeration_page - Return a now-aerated "raw" page back where we got it
> > + * @zone: Zone to return pages to
> > + * @page: Previously "raw" page that can now be returned after aeration
> > + *
> > + * This function will pull the migratetype and order information out
> > + * of the page and attempt to return it where it found it.
> > + */
> > +void put_aeration_page(struct zone *zone, struct page *page)
> > +{
> > + unsigned int order, mt;
> > + unsigned long pfn;
> > +
> > + mt = get_pcppage_migratetype(page);
> > + pfn = page_to_pfn(page);
> > +
> > + if (unlikely(has_isolate_pageblock(zone) || is_migrate_isolate(mt)))
> > + mt = get_pfnblock_migratetype(page, pfn);
> > +
> > + order = page_private(page);
> > + set_page_private(page, 0);
> > +
> > + __free_one_page(page, pfn, zone, order, mt, true);
> > +}
> > +#endif /* CONFIG_AERATION */
>
> Yikes. This seems to have glossed over some pretty big aspects here.
> Pages which are being aerated are not free. Pages which are freed are
> diverted to be aerated before becoming free. Right? That sounds like
> two really important things to add to a changelog.

Right. The aerated pages are not free. The go into the free_list and once
enough pages are in there the aerator will start pulling some out and
processing them. The idea is that any pages not being actively aerated
will stay in the free_list so it isn't as if we are shunting them off to a
side-queue. They are just sitting in the free_list either to be reused or
aerated.

I'll make some updates to the changelog to clarify that.

> > /*
> > * This array describes the order lists are fallen back to when
> > * the free lists for the desirable migrate type are depleted
> > @@ -5929,9 +6013,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
> > static void __meminit zone_init_free_lists(struct zone *zone)
> > {
> > unsigned int order, t;
> > - for_each_migratetype_order(order, t) {
> > + for_each_migratetype_order(order, t)
> > INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> > +
> > + for (order = MAX_ORDER; order--; ) {
> > zone->free_area[order].nr_free = 0;
> > + zone->free_area[order].nr_free_aerated = 0;
> > }
> > }
> >
> >


2019-07-08 22:46:47

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: Add logic for separating "aerated" pages from "raw" pages

On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
> On 6/19/19 3:33 PM, Alexander Duyck wrote:
> > Add a set of pointers we shall call "boundary" which represents the upper
> > boundary between the "raw" and "aerated" pages. The general idea is that in
> > order for a page to cross from one side of the boundary to the other it
> > will need to go through the aeration treatment.
>
> Aha! The mysterious "boundary"!
>
> But, how can you introduce code that deals with boundaries before
> introducing the boundary itself? Or was that comment misplaced?

The comment in the earlier patch was misplaced. Basically the logic before
this patch would just add the aerated pages directly to the tail of the
free_list, however if it had to leave and come back there was nothing to
prevent us from creating a mess of interleaved "raw" and "aerated" pages.
With this patch we are guaranteed that any "raw" pages are added above the
"aerated" pages and will be pulled for processing.

> FWIW, I'm not a fan of these commit messages. They are really hard to
> map to the data structures.
>
> One goal in this set is to avoid creating new data structures.
> We accomplish that by reusing the free lists to hold aerated and
> non-aerated pages. But, in order to use the existing free list,
> we need a boundary to separate aerated from raw.
>
> Further:
>
> Pages are temporarily removed from the free lists while aerating
> them.
>
> This needs a justification why you chose this path, and also what the
> larger implications are.

Well the big advantage is that we aren't messing with the individual
free_area or free_list structures. My initial implementation was adding a
third pointer to split do the work and it actually had performance
implications as it increased the size of the free_area and zone.

> > By doing this we should be able to make certain that we keep the aerated
> > pages as one contiguous block on the end of each free list. This will allow
> > us to efficiently walk the free lists whenever we need to go in and start
> > processing hints to the hypervisor that the pages are no longer in use.
>
> You don't really walk them though, right? It *keeps* you from having to
> ever walk the lists.

It all depends on your definition of "walk". In the case of this logic we
will have to ultimately do 1 pass over all the "raw" pages to process
them. So I consider that a walk through the free_list. However we can
avoid all of the already processed pages since we have the flag and the
pointer to what should be the top of the list for the "aerated" pages.

> I also don't see what the boundary has to do with aerated pages being on
> the tail of the list. If you want them on the tail, you just always
> list_add_tail() them.

The issue is that there are multiple things that can add to the tail of
the list. For example the shuffle code or the lower order buddy expecting
its buddy to be freed. In those cases I don't want to add to tail so
instead I am adding those to the boundary. By doing that I can avoid
having the tail of the list becoming interleaved with raw and aerated
pages.

> > And added advantage to this approach is that we should be reducing the
> > overall memory footprint of the guest as it will be more likely to recycle
> > warm pages versus the aerated pages that are likely to be cache cold.
>
> I'm confused. Isn't an aerated page non-present on the guest? That's
> worse than cache cold. It costs a VMEXIT to bring back in.

I suppose so, it would be worse than being cache cold.

> > Since we will only be aerating one zone at a time we keep the boundary
> > limited to being defined for just the zone we are currently placing aerated
> > pages into. Doing this we can keep the number of additional poitners needed
> > quite small.
>
> pointers ^
>
> > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
> > static inline struct list_head *aerator_get_tail(struct zone *zone,
> > unsigned int order,
> > int migratetype)
> > {
> > +#ifdef CONFIG_AERATION
> > + if (order >= AERATOR_MIN_ORDER &&
> > + test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > + return __aerator_get_tail(order, migratetype);
> > +#endif
> > return &zone->free_area[order].free_list[migratetype];
> > }
>
> Logically, I have no idea what this is doing. "Go get pages out of the
> aerated list?" "raw list"? Needs comments.

I'll add comments. Really now that I think about it I should probably
change the name for this anyway. What is really being returned is the tail
for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
want to prevent any insertions below the list of aerated pages, so we are
returning the first entry in the aerated list and using that as the
tail/head of a list tail insertion.

Ugh. I really need to go back and name this better.

> > +static inline void aerator_del_from_boundary(struct page *page,
> > + struct zone *zone)
> > +{
> > + if (PageAerated(page) && test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > + __aerator_del_from_boundary(page, zone);
> > +}
> > +
> > static inline void set_page_aerated(struct page *page,
> > struct zone *zone,
> > unsigned int order,
> > @@ -28,6 +59,9 @@ static inline void set_page_aerated(struct page *page,
> > /* record migratetype and flag page as aerated */
> > set_pcppage_migratetype(page, migratetype);
> > __SetPageAerated(page);
> > +
> > + /* update boundary of new migratetype and record it */
> > + aerator_add_to_boundary(page, zone);
> > #endif
> > }
> >
> > @@ -39,11 +73,19 @@ static inline void clear_page_aerated(struct page *page,
> > if (likely(!PageAerated(page)))
> > return;
> >
> > + /* push boundary back if we removed the upper boundary */
> > + aerator_del_from_boundary(page, zone);
> > +
> > __ClearPageAerated(page);
> > area->nr_free_aerated--;
> > #endif
> > }
> >
> > +static inline unsigned long aerator_raw_pages(struct free_area *area)
> > +{
> > + return area->nr_free - area->nr_free_aerated;
> > +}
> > +
> > /**
> > * aerator_notify_free - Free page notification that will start page processing
> > * @zone: Pointer to current zone of last page processed
> > @@ -57,5 +99,20 @@ static inline void clear_page_aerated(struct page *page,
> > */
> > static inline void aerator_notify_free(struct zone *zone, int order)
> > {
> > +#ifdef CONFIG_AERATION
> > + if (!static_key_false(&aerator_notify_enabled))
> > + return;
> > + if (order < AERATOR_MIN_ORDER)
> > + return;
> > + if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > + return;
> > + if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> > + return;
> > +
> > + __aerator_notify(zone);
> > +#endif
> > }
>
> Again, this is really hard to review. I see some possible overhead in a
> fast path here, but only if aerator_notify_free() is called in a fast
> path. Is it? I have to go digging in the previous patches to figure
> that out.

This is called at the end of __free_one_page().

I tried to limit the impact as much as possible by ordering the checks the
way I did. The order check should limit the impact pretty significantly as
that is the only one that will be triggered for every page, then the
higher order pages are left to deal with the test_bit and
aerator_raw_pages checks.

> > +static struct aerator_dev_info *a_dev_info;
> > +struct static_key aerator_notify_enabled;
> > +
> > +struct list_head *boundary[MAX_ORDER - AERATOR_MIN_ORDER][MIGRATE_TYPES];
> > +
> > +static void aerator_reset_boundary(struct zone *zone, unsigned int order,
> > + unsigned int migratetype)
> > +{
> > + boundary[order - AERATOR_MIN_ORDER][migratetype] =
> > + &zone->free_area[order].free_list[migratetype];
> > +}
> > +
> > +#define for_each_aerate_migratetype_order(_order, _type) \
> > + for (_order = MAX_ORDER; _order-- != AERATOR_MIN_ORDER;) \
> > + for (_type = MIGRATE_TYPES; _type--;)
> > +
> > +static void aerator_populate_boundaries(struct zone *zone)
> > +{
> > + unsigned int order, mt;
> > +
> > + if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > + return;
> > +
> > + for_each_aerate_migratetype_order(order, mt)
> > + aerator_reset_boundary(zone, order, mt);
> > +
> > + set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> > +}
>
> This function appears misnamed as it's doing more than boundary
> manipulation.

The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
boundaries are being tracked. Without that we just fall back to using the
free_list tail.

> > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> > +{
> > + return boundary[order - AERATOR_MIN_ORDER][migratetype];
> > +}
> > +
> > +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> > +{
> > + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > + int mt = get_pcppage_migratetype(page);
> > + struct list_head **tail = &boundary[order][mt];
> > +
> > + if (*tail == &page->lru)
> > + *tail = page->lru.next;
> > +}
>
> Ewww. Please just track the page that's the boundary, not the list head
> inside the page that's the boundary.
>
> This also at least needs one comment along the lines of: Move the
> boundary if the page representing the boundary is being removed.

So the reason for using the list_head is because we can end up with a
boundary for an empty list. In that case we don't have a page to point to
but just the list_head for the list itself. It actually makes things quite
a bit simpler, otherwise I have to perform extra checks to see if the list
is empty.

I'll work on updating the comments.

>
> > +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> > +{
> > + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > + int mt = get_pcppage_migratetype(page);
> > + struct list_head **tail = &boundary[order][mt];
> > +
> > + *tail = &page->lru;
> > +}
> > +
> > +void aerator_shutdown(void)
> > +{
> > + static_key_slow_dec(&aerator_notify_enabled);
> > +
> > + while (atomic_read(&a_dev_info->refcnt))
> > + msleep(20);
>
> We generally frown on open-coded check/sleep loops. What is this for?

We are waiting on the aerator to finish processing the list it had active.
With the static key disabled we should see the refcount wind down to 0.
Once that occurs we can safely free the a_dev_info structure since there
will be no other uses of it.

> > + WARN_ON(!list_empty(&a_dev_info->batch));
> > +
> > + a_dev_info = NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(aerator_shutdown);
> > +
> > +static void aerator_schedule_initial_aeration(void)
> > +{
> > + struct zone *zone;
> > +
> > + for_each_populated_zone(zone) {
> > + spin_lock(&zone->lock);
> > + __aerator_notify(zone);
> > + spin_unlock(&zone->lock);
> > + }
> > +}
>
> Why do we need an initial aeration?

This is mostly about avoiding any possible races while we are brining up
the aerator. If we assume we are just going to start a cycle of aeration
for all zones when the aerator is brought up it makes it easier to be sure
we have gone though and checked all of the zones after initialization is
complete.

> > +int aerator_startup(struct aerator_dev_info *sdev)
> > +{
> > + if (a_dev_info)
> > + return -EBUSY;
> > +
> > + INIT_LIST_HEAD(&sdev->batch);
> > + atomic_set(&sdev->refcnt, 0);
> > +
> > + a_dev_info = sdev;
> > + aerator_schedule_initial_aeration();
> > +
> > + static_key_slow_inc(&aerator_notify_enabled);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(aerator_startup);
> > +
> > +static void aerator_fill(struct zone *zone)
> > +{
> > + struct list_head *batch = &a_dev_info->batch;
> > + int budget = a_dev_info->capacity;
>
> Where does capacity come from?

It is the limit on how many pages we can process at a time. The value is
set in a_dev_info before the call to aerator_startup.

> > + unsigned int order, mt;
> > +
> > + for_each_aerate_migratetype_order(order, mt) {
> > + struct page *page;
> > +
> > + /*
> > + * Pull pages from free list until we have drained
> > + * it or we have filled the batch reactor.
> > + */
>
> What's a reactor?

A hold-over from an earlier patch. Basically the batch reactor was the
list containing the pages to be processed. It was a chemistry term in
regards to aeration. I should update that to instead say we have reached
the capacity of the aeration device.

> > + while ((page = get_aeration_page(zone, order, mt))) {
> > + list_add_tail(&page->lru, batch);
> > +
> > + if (!--budget)
> > + return;
> > + }
> > + }
> > +
> > + /*
> > + * If there are no longer enough free pages to fully populate
> > + * the aerator, then we can just shut it down for this zone.
> > + */
> > + clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> > + atomic_dec(&a_dev_info->refcnt);
> > +}
>
> Huh, so this is the number of threads doing aeration? Didn't we just
> make a big deal about there only being one zone being aerated at a time?
> Or, did I misunderstand what refcnt is from its lack of clear
> documentation?

The refcnt is the number of zones requesting aeration plus one additional
if the thread is active. We are limited to only having pages from one zone
in the aerator at a time. That is to prevent us from having to maintain
multiple boundaries.

> > +static void aerator_drain(struct zone *zone)
> > +{
> > + struct list_head *list = &a_dev_info->batch;
> > + struct page *page;
> > +
> > + /*
> > + * Drain the now aerated pages back into their respective
> > + * free lists/areas.
> > + */
> > + while ((page = list_first_entry_or_null(list, struct page, lru))) {
> > + list_del(&page->lru);
> > + put_aeration_page(zone, page);
> > + }
> > +}
> > +
> > +static void aerator_scrub_zone(struct zone *zone)
> > +{
> > + /* See if there are any pages to pull */
> > + if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > + return;
>
> How would someone ask for the zone to be scrubbed when aeration has not
> been requested?

I'm not sure what you are asking here. Basically this function is called
per zone by aerator_cycle. Which now that I think about it I should
probably swap the names around that we perform a cycle per zone and just
scrub memory generically.

> > + spin_lock(&zone->lock);
> > +
> > + do {
> > + aerator_fill(zone);
>
> Should this say:
>
> /* Pull pages out of the allocator into a local list */
>
> ?

Yes, we are filling the local list with "raw" pages from the zone.

>
> > + if (list_empty(&a_dev_info->batch))
> > + break;
>
> /* no pages were acquired, give up */

Correct.

>
> > + spin_unlock(&zone->lock);
> > +
> > + /*
> > + * Start aerating the pages in the batch, and then
> > + * once that is completed we can drain the reactor
> > + * and refill the reactor, restarting the cycle.
> > + */
> > + a_dev_info->react(a_dev_info);
>
> After reading (most of) this set, I'm going to reiterate my suggestion:
> please find new nomenclature. I can't parse that comment and I don't
> know whether that's because it's a bad comment or whether you really
> mean "cycle" the english word or "cycle" referring to some new
> definition relating to this patch set.
>
> I've asked quite nicely a few times now.

The "cycle" in this case refers to fill, react, drain, and idle or repeat.

> > + spin_lock(&zone->lock);
> > +
> > + /*
> > + * Guarantee boundaries are populated before we
> > + * start placing aerated pages in the zone.
> > + */
> > + aerator_populate_boundaries(zone);
>
> aerator_populate_boundaries() has apparent concurrency checks via
> ZONE_AERATION_ACTIVE. Why are those needed when this is called under a
> spinlock?

I probably could move the spin_lock down. It isn't really needed for the
population of the boundaries, it was needed for the draining of the
aerator into the free_lists. I'll move the lock, although I might need to
add a smp_mb__before_atomic to make sure that any callers see the boundary
values before they see the updated bit.

2019-07-08 22:49:01

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: Add logic for separating "aerated" pages from "raw" pages

On 7/8/19 12:02 PM, Alexander Duyck wrote:
> On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
>> I also don't see what the boundary has to do with aerated pages being on
>> the tail of the list. If you want them on the tail, you just always
>> list_add_tail() them.
>
> The issue is that there are multiple things that can add to the tail of
> the list. For example the shuffle code or the lower order buddy expecting
> its buddy to be freed. In those cases I don't want to add to tail so
> instead I am adding those to the boundary. By doing that I can avoid
> having the tail of the list becoming interleaved with raw and aerated
> pages.

So, it sounds like we've got the following data structure rules:

1. We have one list_head and one list of pages
2. For the purposes of allocation, the list is treated the same as
before these patches
3. For a "free()", the behavior changes and we now have two "tails":
3a. Aerated pages are freed into the tail of the list
3b. Cold pages are freed at the boundary between aerated and non.
This serves to... This is also referred to as a "tail".
3a. Hot pages are never aerated and are still freed into the head
of the list.

Did I miss any? Could you please spell it out this way in future
changelogs?


>>> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
>>> static inline struct list_head *aerator_get_tail(struct zone *zone,
>>> unsigned int order,
>>> int migratetype)
>>> {
>>> +#ifdef CONFIG_AERATION
>>> + if (order >= AERATOR_MIN_ORDER &&
>>> + test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
>>> + return __aerator_get_tail(order, migratetype);
>>> +#endif
>>> return &zone->free_area[order].free_list[migratetype];
>>> }
>>
>> Logically, I have no idea what this is doing. "Go get pages out of the
>> aerated list?" "raw list"? Needs comments.
>
> I'll add comments. Really now that I think about it I should probably
> change the name for this anyway. What is really being returned is the tail
> for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
> want to prevent any insertions below the list of aerated pages, so we are
> returning the first entry in the aerated list and using that as the
> tail/head of a list tail insertion.
>
> Ugh. I really need to go back and name this better.

OK, so we now have two tails? One that's called both a boundary and a
tail at different parts of the code?

>>> static inline void aerator_notify_free(struct zone *zone, int order)
>>> {
>>> +#ifdef CONFIG_AERATION
>>> + if (!static_key_false(&aerator_notify_enabled))
>>> + return;
>>> + if (order < AERATOR_MIN_ORDER)
>>> + return;
>>> + if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
>>> + return;
>>> + if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
>>> + return;
>>> +
>>> + __aerator_notify(zone);
>>> +#endif
>>> }
>>
>> Again, this is really hard to review. I see some possible overhead in a
>> fast path here, but only if aerator_notify_free() is called in a fast
>> path. Is it? I have to go digging in the previous patches to figure
>> that out.
>
> This is called at the end of __free_one_page().
>
> I tried to limit the impact as much as possible by ordering the checks the
> way I did. The order check should limit the impact pretty significantly as
> that is the only one that will be triggered for every page, then the
> higher order pages are left to deal with the test_bit and
> aerator_raw_pages checks.

That sounds like a good idea. But, that good idea is very hard to
distill from the code in the patch.

Imagine if the function stared being commented with:

/* Called from a hot path in __free_one_page() */

And said:


if (!static_key_false(&aerator_notify_enabled))
return;

/* Avoid (slow) notifications when no aeration is performed: */
if (order < AERATOR_MIN_ORDER)
return;
if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
return;

/* Some other relevant comment: */
if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
return;

/* This is slow, but should happen very rarely: */
__aerator_notify(zone);

>>> +static void aerator_populate_boundaries(struct zone *zone)
>>> +{
>>> + unsigned int order, mt;
>>> +
>>> + if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
>>> + return;
>>> +
>>> + for_each_aerate_migratetype_order(order, mt)
>>> + aerator_reset_boundary(zone, order, mt);
>>> +
>>> + set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
>>> +}
>>
>> This function appears misnamed as it's doing more than boundary
>> manipulation.
>
> The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
> boundaries are being tracked. Without that we just fall back to using the
> free_list tail.

Is the flag used for other things? Or just to indicate that boundaries
are being tracked?

>>> +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
>>> +{
>>> + return boundary[order - AERATOR_MIN_ORDER][migratetype];
>>> +}
>>> +
>>> +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
>>> +{
>>> + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
>>> + int mt = get_pcppage_migratetype(page);
>>> + struct list_head **tail = &boundary[order][mt];
>>> +
>>> + if (*tail == &page->lru)
>>> + *tail = page->lru.next;
>>> +}
>>
>> Ewww. Please just track the page that's the boundary, not the list head
>> inside the page that's the boundary.
>>
>> This also at least needs one comment along the lines of: Move the
>> boundary if the page representing the boundary is being removed.
>
> So the reason for using the list_head is because we can end up with a
> boundary for an empty list. In that case we don't have a page to point to
> but just the list_head for the list itself. It actually makes things quite
> a bit simpler, otherwise I have to perform extra checks to see if the list
> is empty.

Could you please double-check that keeping a 'struct page *' is truly
more messy?

>>> +void aerator_add_to_boundary(struct page *page, struct zone *zone)
>>> +{
>>> + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
>>> + int mt = get_pcppage_migratetype(page);
>>> + struct list_head **tail = &boundary[order][mt];
>>> +
>>> + *tail = &page->lru;
>>> +}
>>> +
>>> +void aerator_shutdown(void)
>>> +{
>>> + static_key_slow_dec(&aerator_notify_enabled);
>>> +
>>> + while (atomic_read(&a_dev_info->refcnt))
>>> + msleep(20);
>>
>> We generally frown on open-coded check/sleep loops. What is this for?
>
> We are waiting on the aerator to finish processing the list it had active.
> With the static key disabled we should see the refcount wind down to 0.
> Once that occurs we can safely free the a_dev_info structure since there
> will be no other uses of it.

That's fine, but we still don't open-code sleep loops. Please remove this.

"Wait until we can free the thing" sounds to me like RCU. Do you want
to use RCU here? A synchronize_rcu() call can be a very powerful thing
if the read-side critical sections are amenable to it.

>>> +static void aerator_schedule_initial_aeration(void)
>>> +{
>>> + struct zone *zone;
>>> +
>>> + for_each_populated_zone(zone) {
>>> + spin_lock(&zone->lock);
>>> + __aerator_notify(zone);
>>> + spin_unlock(&zone->lock);
>>> + }
>>> +}
>>
>> Why do we need an initial aeration?
>
> This is mostly about avoiding any possible races while we are brining up
> the aerator. If we assume we are just going to start a cycle of aeration
> for all zones when the aerator is brought up it makes it easier to be sure
> we have gone though and checked all of the zones after initialization is
> complete.

Let me ask a different way: What will happen if we don't have this?
Will things crash? Will they be slow? Do we not know?

>>> +{
>>> + struct list_head *batch = &a_dev_info->batch;
>>> + int budget = a_dev_info->capacity;
>>
>> Where does capacity come from?
>
> It is the limit on how many pages we can process at a time. The value is
> set in a_dev_info before the call to aerator_startup.

Let me ask another way: Does it come from the user? Or is it
automatically determined by some in-kernel heuristic?

>>> + while ((page = get_aeration_page(zone, order, mt))) {
>>> + list_add_tail(&page->lru, batch);
>>> +
>>> + if (!--budget)
>>> + return;
>>> + }
>>> + }
>>> +
>>> + /*
>>> + * If there are no longer enough free pages to fully populate
>>> + * the aerator, then we can just shut it down for this zone.
>>> + */
>>> + clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
>>> + atomic_dec(&a_dev_info->refcnt);
>>> +}
>>
>> Huh, so this is the number of threads doing aeration? Didn't we just
>> make a big deal about there only being one zone being aerated at a time?
>> Or, did I misunderstand what refcnt is from its lack of clear
>> documentation?
>
> The refcnt is the number of zones requesting aeration plus one additional
> if the thread is active. We are limited to only having pages from one zone
> in the aerator at a time. That is to prevent us from having to maintain
> multiple boundaries.

That sounds like excellent documentation to add to 'refcnt's definition.

>>> +static void aerator_drain(struct zone *zone)
>>> +{
>>> + struct list_head *list = &a_dev_info->batch;
>>> + struct page *page;
>>> +
>>> + /*
>>> + * Drain the now aerated pages back into their respective
>>> + * free lists/areas.
>>> + */
>>> + while ((page = list_first_entry_or_null(list, struct page, lru))) {
>>> + list_del(&page->lru);
>>> + put_aeration_page(zone, page);
>>> + }
>>> +}
>>> +
>>> +static void aerator_scrub_zone(struct zone *zone)
>>> +{
>>> + /* See if there are any pages to pull */
>>> + if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
>>> + return;
>>
>> How would someone ask for the zone to be scrubbed when aeration has not
>> been requested?
>
> I'm not sure what you are asking here. Basically this function is called
> per zone by aerator_cycle. Which now that I think about it I should
> probably swap the names around that we perform a cycle per zone and just
> scrub memory generically.

It looks like aerator_cycle() calls aerator_scrub_zone() on all zones
all the time. This is the code responsible for ensuring that we don't
do any aeration work on zones that do not need it.

2019-07-08 22:54:26

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 5/6] mm: Add logic for separating "aerated" pages from "raw" pages

On Mon, 2019-07-08 at 12:36 -0700, Dave Hansen wrote:
> On 7/8/19 12:02 PM, Alexander Duyck wrote:
> > On Tue, 2019-06-25 at 13:24 -0700, Dave Hansen wrote:
> > > I also don't see what the boundary has to do with aerated pages being on
> > > the tail of the list. If you want them on the tail, you just always
> > > list_add_tail() them.
> >
> > The issue is that there are multiple things that can add to the tail of
> > the list. For example the shuffle code or the lower order buddy expecting
> > its buddy to be freed. In those cases I don't want to add to tail so
> > instead I am adding those to the boundary. By doing that I can avoid
> > having the tail of the list becoming interleaved with raw and aerated
> > pages.
>
> So, it sounds like we've got the following data structure rules:
>
> 1. We have one list_head and one list of pages
> 2. For the purposes of allocation, the list is treated the same as
> before these patches

So these 2 points are correct.

> 3. For a "free()", the behavior changes and we now have two "tails":
> 3a. Aerated pages are freed into the tail of the list
> 3b. Cold pages are freed at the boundary between aerated and non.
> This serves to... This is also referred to as a "tail".
> 3a. Hot pages are never aerated and are still freed into the head
> of the list.
>
> Did I miss any? Could you please spell it out this way in future
> changelogs?

So the logic for 3a and 3b is actually the same location. The difference
is that the boundary pointer will move up to the page in the case of 3a,
and will not move in the case of 3b. That was why I was kind of annoyed
with myself as I was calling it the aerator "tail" when it is really the
head of the aeration list.

So the change I am planning to make in terms of naming is to refer to
__aerator_get_boundary in the function below. Boundary makes more sense in
my mind anyway because it is the head of one list and the tail of the
other.

>
> > > > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype);
> > > > static inline struct list_head *aerator_get_tail(struct zone *zone,
> > > > unsigned int order,
> > > > int migratetype)
> > > > {
> > > > +#ifdef CONFIG_AERATION
> > > > + if (order >= AERATOR_MIN_ORDER &&
> > > > + test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > > > + return __aerator_get_tail(order, migratetype);
> > > > +#endif
> > > > return &zone->free_area[order].free_list[migratetype];
> > > > }
> > >
> > > Logically, I have no idea what this is doing. "Go get pages out of the
> > > aerated list?" "raw list"? Needs comments.
> >
> > I'll add comments. Really now that I think about it I should probably
> > change the name for this anyway. What is really being returned is the tail
> > for the non-aerated list. Specifically if ZONE_AERATION_ACTIVE is set we
> > want to prevent any insertions below the list of aerated pages, so we are
> > returning the first entry in the aerated list and using that as the
> > tail/head of a list tail insertion.
> >
> > Ugh. I really need to go back and name this better.
>
> OK, so we now have two tails? One that's called both a boundary and a
> tail at different parts of the code?

Yes, that is the naming issue I was getting at. I would prefer to go with
boundary where I can since it is both a head of one list and the tail of
the other.

I will try to clean this all up before I submit this again.

> > > > static inline void aerator_notify_free(struct zone *zone, int order)
> > > > {
> > > > +#ifdef CONFIG_AERATION
> > > > + if (!static_key_false(&aerator_notify_enabled))
> > > > + return;
> > > > + if (order < AERATOR_MIN_ORDER)
> > > > + return;
> > > > + if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > > > + return;
> > > > + if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> > > > + return;
> > > > +
> > > > + __aerator_notify(zone);
> > > > +#endif
> > > > }
> > >
> > > Again, this is really hard to review. I see some possible overhead in a
> > > fast path here, but only if aerator_notify_free() is called in a fast
> > > path. Is it? I have to go digging in the previous patches to figure
> > > that out.
> >
> > This is called at the end of __free_one_page().
> >
> > I tried to limit the impact as much as possible by ordering the checks the
> > way I did. The order check should limit the impact pretty significantly as
> > that is the only one that will be triggered for every page, then the
> > higher order pages are left to deal with the test_bit and
> > aerator_raw_pages checks.
>
> That sounds like a good idea. But, that good idea is very hard to
> distill from the code in the patch.
>
> Imagine if the function stared being commented with:
>
> /* Called from a hot path in __free_one_page() */
>
> And said:
>
>
> if (!static_key_false(&aerator_notify_enabled))
> return;
>
> /* Avoid (slow) notifications when no aeration is performed: */
> if (order < AERATOR_MIN_ORDER)
> return;
> if (test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> return;
>
> /* Some other relevant comment: */
> if (aerator_raw_pages(&zone->free_area[order]) < AERATOR_HWM)
> return;
>
> /* This is slow, but should happen very rarely: */
> __aerator_notify(zone);
>

I'll go through and work on cleaning up the comments.

> > > > +static void aerator_populate_boundaries(struct zone *zone)
> > > > +{
> > > > + unsigned int order, mt;
> > > > +
> > > > + if (test_bit(ZONE_AERATION_ACTIVE, &zone->flags))
> > > > + return;
> > > > +
> > > > + for_each_aerate_migratetype_order(order, mt)
> > > > + aerator_reset_boundary(zone, order, mt);
> > > > +
> > > > + set_bit(ZONE_AERATION_ACTIVE, &zone->flags);
> > > > +}
> > >
> > > This function appears misnamed as it's doing more than boundary
> > > manipulation.
> >
> > The ZONE_AERATION_ACTIVE flag is what is used to indicate that the
> > boundaries are being tracked. Without that we just fall back to using the
> > free_list tail.
>
> Is the flag used for other things? Or just to indicate that boundaries
> are being tracked?

Just the boundaries. It gets set before the first time we have to flush
out a batch of pages, and is cleared after we have determined that there
are no longer any pages to pull and our local list is empty.

> > > > +struct list_head *__aerator_get_tail(unsigned int order, int migratetype)
> > > > +{
> > > > + return boundary[order - AERATOR_MIN_ORDER][migratetype];
> > > > +}
> > > > +
> > > > +void __aerator_del_from_boundary(struct page *page, struct zone *zone)
> > > > +{
> > > > + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > > > + int mt = get_pcppage_migratetype(page);
> > > > + struct list_head **tail = &boundary[order][mt];
> > > > +
> > > > + if (*tail == &page->lru)
> > > > + *tail = page->lru.next;
> > > > +}
> > >
> > > Ewww. Please just track the page that's the boundary, not the list head
> > > inside the page that's the boundary.
> > >
> > > This also at least needs one comment along the lines of: Move the
> > > boundary if the page representing the boundary is being removed.
> >
> > So the reason for using the list_head is because we can end up with a
> > boundary for an empty list. In that case we don't have a page to point to
> > but just the list_head for the list itself. It actually makes things quite
> > a bit simpler, otherwise I have to perform extra checks to see if the list
> > is empty.
>
> Could you please double-check that keeping a 'struct page *' is truly
> more messy?

Well there are a few places I am using this where using a page pointer
would be an issue.

1. add_to_free_area_tail
Using a page pointer here would be difficult since we are adding a
page to a list, not to another page.
2. aerator_populate_boundaries
We were initializing the boundary to the list head for each of the
free_lists that we could possibly be placing pages into. Translating
to a page would require additional overhead.
3. __aerator_del_from_boundary
What we can end up with here if we aren't careful is a page pointer
that isn't to a page in the case that the free_list is actually
empty.

In my mind in order to handle this correctly I would have to start using
NULL when the list is empty, and have to add a check to
__aerator_del_from_boundary that would go in and grab the free_list for
the page and test against the head of the free list to make certain that
removing the page will not cause us to point to something that isn't a
page.


> > > > +void aerator_add_to_boundary(struct page *page, struct zone *zone)
> > > > +{
> > > > + unsigned int order = page_private(page) - AERATOR_MIN_ORDER;
> > > > + int mt = get_pcppage_migratetype(page);
> > > > + struct list_head **tail = &boundary[order][mt];
> > > > +
> > > > + *tail = &page->lru;
> > > > +}
> > > > +
> > > > +void aerator_shutdown(void)
> > > > +{
> > > > + static_key_slow_dec(&aerator_notify_enabled);
> > > > +
> > > > + while (atomic_read(&a_dev_info->refcnt))
> > > > + msleep(20);
> > >
> > > We generally frown on open-coded check/sleep loops. What is this for?
> >
> > We are waiting on the aerator to finish processing the list it had active.
> > With the static key disabled we should see the refcount wind down to 0.
> > Once that occurs we can safely free the a_dev_info structure since there
> > will be no other uses of it.
>
> That's fine, but we still don't open-code sleep loops. Please remove this.
>
> "Wait until we can free the thing" sounds to me like RCU. Do you want
> to use RCU here? A synchronize_rcu() call can be a very powerful thing
> if the read-side critical sections are amenable to it.

So the issue is I am not entirely sure RCU would be a good fit here. Now I
could handle the __aerator_notify call via an RCU setup, however the call
to aerator_cycle probably wouldn't work well with it since it would be
holding onto a_dev_info for an extended period of time and we wouldn't
want to stall RCU out because the system is busy aerating a big section of
memory.

I'll have to think about this some more. As it currently stands I don't
think this completely solves what it is meant to anyway since I think it
is possible to race and end up with a scenario where another CPU might be
able to get past the static key check before we disable it, and then we
could free a_dev_info before it has a chance to take a reference to it.

> > > > +static void aerator_schedule_initial_aeration(void)
> > > > +{
> > > > + struct zone *zone;
> > > > +
> > > > + for_each_populated_zone(zone) {
> > > > + spin_lock(&zone->lock);
> > > > + __aerator_notify(zone);
> > > > + spin_unlock(&zone->lock);
> > > > + }
> > > > +}
> > >
> > > Why do we need an initial aeration?
> >
> > This is mostly about avoiding any possible races while we are brining up
> > the aerator. If we assume we are just going to start a cycle of aeration
> > for all zones when the aerator is brought up it makes it easier to be sure
> > we have gone though and checked all of the zones after initialization is
> > complete.
>
> Let me ask a different way: What will happen if we don't have this?
> Will things crash? Will they be slow? Do we not know?

I wouldn't expect any crashes. We may just not end up with the memory
being freed for some time if all the pages are freed before the aerator
device is registered, and there isn't any memory activity after that.

This was mostly about just making sure we flush the memory after the
device has been initialized.

> > > > +{
> > > > + struct list_head *batch = &a_dev_info->batch;
> > > > + int budget = a_dev_info->capacity;
> > >
> > > Where does capacity come from?
> >
> > It is the limit on how many pages we can process at a time. The value is
> > set in a_dev_info before the call to aerator_startup.
>
> Let me ask another way: Does it come from the user? Or is it
> automatically determined by some in-kernel heuristic?

It is being provided by the module that registers the aeration device. So
in patch 6 of the series we determined that we wanted to process 32 pages
at a time. So we set that as the limit since that is the number of hints
we had allocated in the virtio-balloon driver.

> > > > + while ((page = get_aeration_page(zone, order, mt))) {
> > > > + list_add_tail(&page->lru, batch);
> > > > +
> > > > + if (!--budget)
> > > > + return;
> > > > + }
> > > > + }
> > > > +
> > > > + /*
> > > > + * If there are no longer enough free pages to fully populate
> > > > + * the aerator, then we can just shut it down for this zone.
> > > > + */
> > > > + clear_bit(ZONE_AERATION_REQUESTED, &zone->flags);
> > > > + atomic_dec(&a_dev_info->refcnt);
> > > > +}
> > >
> > > Huh, so this is the number of threads doing aeration? Didn't we just
> > > make a big deal about there only being one zone being aerated at a time?
> > > Or, did I misunderstand what refcnt is from its lack of clear
> > > documentation?
> >
> > The refcnt is the number of zones requesting aeration plus one additional
> > if the thread is active. We are limited to only having pages from one zone
> > in the aerator at a time. That is to prevent us from having to maintain
> > multiple boundaries.
>
> That sounds like excellent documentation to add to 'refcnt's definition.

Will do.

> > > > +static void aerator_drain(struct zone *zone)
> > > > +{
> > > > + struct list_head *list = &a_dev_info->batch;
> > > > + struct page *page;
> > > > +
> > > > + /*
> > > > + * Drain the now aerated pages back into their respective
> > > > + * free lists/areas.
> > > > + */
> > > > + while ((page = list_first_entry_or_null(list, struct page, lru))) {
> > > > + list_del(&page->lru);
> > > > + put_aeration_page(zone, page);
> > > > + }
> > > > +}
> > > > +
> > > > +static void aerator_scrub_zone(struct zone *zone)
> > > > +{
> > > > + /* See if there are any pages to pull */
> > > > + if (!test_bit(ZONE_AERATION_REQUESTED, &zone->flags))
> > > > + return;
> > >
> > > How would someone ask for the zone to be scrubbed when aeration has not
> > > been requested?
> >
> > I'm not sure what you are asking here. Basically this function is called
> > per zone by aerator_cycle. Which now that I think about it I should
> > probably swap the names around that we perform a cycle per zone and just
> > scrub memory generically.
>
> It looks like aerator_cycle() calls aerator_scrub_zone() on all zones
> all the time. This is the code responsible for ensuring that we don't
> do any aeration work on zones that do not need it.

Yes, that is correct.

Based on your comment here and a few other spots I am assuming you would
prefer to see these sort of tests pulled out and done before we call the
function? I'm assuming that was the case after I started to see the
pattern so I will update that for the next patch set.


2019-07-15 09:44:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On 25.06.19 20:22, Dave Hansen wrote:
> On 6/25/19 10:00 AM, Alexander Duyck wrote:
>> Basically what we are doing is inflating the memory size we can report
>> by inserting voids into the free memory areas. In my mind that matches
>> up very well with what "aeration" is. It is similar to balloon in
>> functionality, however instead of inflating the balloon we are
>> inflating the free_list for higher order free areas by creating voids
>> where the madvised pages were.
>
> OK, then call it "free page auto ballooning" or "auto ballooning" or
> "allocator ballooning". s390 calls them "unused pages".
>
> Any of those things are clearer and more meaningful than "page aeration"
> to me.
>

Alex, if you want to generalize the approach, and not call it "hinting",
what about something similar to "page recycling".

Would also fit the "waste" example and would be clearer - at least to
me. Well, "bubble" does not apply anymore ...

--

Thanks,

David / dhildenb

2019-07-15 14:59:22

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 0/6] mm / virtio: Provide support for paravirtual waste page treatment

On Mon, 2019-07-15 at 11:41 +0200, David Hildenbrand wrote:
> On 25.06.19 20:22, Dave Hansen wrote:
> > On 6/25/19 10:00 AM, Alexander Duyck wrote:
> > > Basically what we are doing is inflating the memory size we can report
> > > by inserting voids into the free memory areas. In my mind that matches
> > > up very well with what "aeration" is. It is similar to balloon in
> > > functionality, however instead of inflating the balloon we are
> > > inflating the free_list for higher order free areas by creating voids
> > > where the madvised pages were.
> >
> > OK, then call it "free page auto ballooning" or "auto ballooning" or
> > "allocator ballooning". s390 calls them "unused pages".
> >
> > Any of those things are clearer and more meaningful than "page aeration"
> > to me.
> >
>
> Alex, if you want to generalize the approach, and not call it "hinting",
> what about something similar to "page recycling".
>
> Would also fit the "waste" example and would be clearer - at least to
> me. Well, "bubble" does not apply anymore ...
>

I am fine with "page hinting". I have already gone through and started the
rename. The problem with "page recycling" is that is actually pretty
similar to the name we had in the networking space for how the NICs will
recycle the Rx buffers.

For now I am going through and replacing instances of Aerated with Hinted,
and aeration with page_hinting. I should have a new patch set ready in a
couple days assuming no unforeseen issues.

Thanks.

- Alex

2019-07-16 09:58:16

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Wed, Jun 19, 2019 at 03:33:38PM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Add support for aerating memory using the hinting feature provided by
> virtio-balloon. Hinting differs from the regular balloon functionality in
> that is is much less durable than a standard memory balloon. Instead of
> creating a list of pages that cannot be accessed the pages are only
> inaccessible while they are being indicated to the virtio interface. Once
> the interface has acknowledged them they are placed back into their
> respective free lists and are once again accessible by the guest system.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> drivers/virtio/Kconfig | 1
> drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
> include/uapi/linux/virtio_balloon.h | 1
> 3 files changed, 108 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index 023fc3bc01c6..9cdaccf92c3a 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
> tristate "Virtio balloon driver"
> depends on VIRTIO
> select MEMORY_BALLOON
> + select AERATION
> ---help---
> This driver supports increasing and decreasing the amount
> of memory within a KVM guest.
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 44339fc87cc7..91f1e8c9017d 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -18,6 +18,7 @@
> #include <linux/mm.h>
> #include <linux/mount.h>
> #include <linux/magic.h>
> +#include <linux/memory_aeration.h>
>
> /*
> * Balloon device works in 4K page units. So each page is pointed to by
> @@ -26,6 +27,7 @@
> */
> #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
> #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> +#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
> #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
>
> #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
> @@ -45,6 +47,7 @@ enum virtio_balloon_vq {
> VIRTIO_BALLOON_VQ_DEFLATE,
> VIRTIO_BALLOON_VQ_STATS,
> VIRTIO_BALLOON_VQ_FREE_PAGE,
> + VIRTIO_BALLOON_VQ_HINTING,
> VIRTIO_BALLOON_VQ_MAX
> };
>
> @@ -54,7 +57,8 @@ enum virtio_balloon_config_read {
>
> struct virtio_balloon {
> struct virtio_device *vdev;
> - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> + *hinting_vq;
>
> /* Balloon's own wq for cpu-intensive work items */
> struct workqueue_struct *balloon_wq;
> @@ -103,9 +107,21 @@ struct virtio_balloon {
> /* Synchronize access/update to this struct virtio_balloon elements */
> struct mutex balloon_lock;
>
> - /* The array of pfns we tell the Host about. */
> - unsigned int num_pfns;
> - __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> +
> + union {
> + /* The array of pfns we tell the Host about. */
> + struct {
> + unsigned int num_pfns;
> + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> + };
> + /* The array of physical addresses we are hinting on */
> + struct {
> + unsigned int num_hints;
> + __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
> + };
> + };
> +
> + struct aerator_dev_info a_dev_info;
>
> /* Memory statistics */
> struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> @@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>
> }
>
> +static u64 page_to_hints_pa_order(struct page *page)
> +{
> + unsigned char order;
> + dma_addr_t pa;
> +
> + BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
> + (1 << VIRTIO_BALLOON_PFN_SHIFT));
> +
> + /*
> + * Record physical page address combined with page order.
> + * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
> + * since the size has to fit into a 64b value. So as long
> + * as VIRTIO_BALLOON_SHIFT is greater than this combining
> + * the two values should be safe.
> + */
> + pa = page_to_phys(page);
> + order = page_private(page) +
> + PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
> +
> + return (u64)(pa | order);
> +}
> +
> +void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
> +{
> + struct virtio_balloon *vb = container_of(a_dev_info,
> + struct virtio_balloon,
> + a_dev_info);
> + struct virtqueue *vq = vb->hinting_vq;
> + struct scatterlist sg;
> + unsigned int unused;
> + struct page *page;
> +
> + mutex_lock(&vb->balloon_lock);
> +
> + vb->num_hints = 0;
> +
> + list_for_each_entry(page, &a_dev_info->batch, lru) {
> + vb->hints[vb->num_hints++] =
> + cpu_to_virtio64(vb->vdev,
> + page_to_hints_pa_order(page));
> + }
> +
> + /* We shouldn't have been called if there is nothing to process */
> + if (WARN_ON(vb->num_hints == 0))
> + goto out;
> +
> + sg_init_one(&sg, vb->hints,
> + sizeof(vb->hints[0]) * vb->num_hints);
> +
> + /*
> + * We should always be able to add one buffer to an
> + * empty queue.
> + */
> + virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> + virtqueue_kick(vq);
> +
> + /* When host has read buffer, this completes via balloon_ack */
> + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> +out:
> + mutex_unlock(&vb->balloon_lock);
> +}
> +
> static void set_page_pfns(struct virtio_balloon *vb,
> __virtio32 pfns[], struct page *page)
> {
> @@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
> names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> + names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
>
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
> callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> }
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> + }
> +
> err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> vqs, callbacks, names, NULL, NULL);
> if (err)
> return err;
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> +
> vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
> if (err)
> goto out_del_balloon_wq;
> }
> +
> + vb->a_dev_info.react = virtballoon_aerator_react;
> + vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> + err = aerator_startup(&vb->a_dev_info);
> + if (err)
> + goto out_unregister_shrinker;
> + }
> +
> virtio_device_ready(vdev);
>
> if (towards_target(vb))
> virtballoon_changed(vdev);
> return 0;
>
> +out_unregister_shrinker:
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> + virtio_balloon_unregister_shrinker(vb);
> out_del_balloon_wq:
> if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> destroy_workqueue(vb->balloon_wq);
> @@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> {
> struct virtio_balloon *vb = vdev->priv;
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> + aerator_shutdown();
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> virtio_balloon_unregister_shrinker(vb);
> spin_lock_irq(&vb->stop_update_lock);
> @@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> VIRTIO_BALLOON_F_PAGE_POISON,
> + VIRTIO_BALLOON_F_HINTING,
> };
>
> static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..2b0f62814e22 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
> #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12



The approach here is very close to what on-demand hinting that is
already upstream does.

This should have resulted in a most of the code being shared
but this does not seem to happen here.

Can we unify the code in some way?
It can still use a separate feature flag, but there are things
I like very much about current hinting code, such as
using s/g instead of passing PFNs in a buffer.

If this doesn't work could you elaborate on why?

--
MST

2019-07-16 14:01:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
> The approach here is very close to what on-demand hinting that is
> already upstream does.

Are you referring to the s390 (and powerpc) stuff that is hidden behind
arch_free_page()?

2019-07-16 14:13:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On 16.07.19 16:00, Dave Hansen wrote:
> On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
>> The approach here is very close to what on-demand hinting that is
>> already upstream does.
>
> Are you referring to the s390 (and powerpc) stuff that is hidden behind
> arch_free_page()?
>

I assume Michael meant "free page reporting".

--

Thanks,

David / dhildenb

2019-07-16 14:18:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On 16.07.19 16:12, David Hildenbrand wrote:
> On 16.07.19 16:00, Dave Hansen wrote:
>> On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
>>> The approach here is very close to what on-demand hinting that is
>>> already upstream does.
>>
>> Are you referring to the s390 (and powerpc) stuff that is hidden behind
>> arch_free_page()?
>>
>
> I assume Michael meant "free page reporting".
>

(https://lwn.net/Articles/759413/)

--

Thanks,

David / dhildenb

2019-07-16 14:42:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On 7/16/19 7:12 AM, David Hildenbrand wrote:
> On 16.07.19 16:00, Dave Hansen wrote:
>> On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
>>> The approach here is very close to what on-demand hinting that is
>>> already upstream does.
>> Are you referring to the s390 (and powerpc) stuff that is hidden behind
>> arch_free_page()?
>>
> I assume Michael meant "free page reporting".

Where is the page allocator integration? The set you linked to has 5
patches, but only 4 were merged. This one is missing:

https://lore.kernel.org/patchwork/patch/961038/

2019-07-16 15:04:03

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On 16.07.19 16:41, Dave Hansen wrote:
> On 7/16/19 7:12 AM, David Hildenbrand wrote:
>> On 16.07.19 16:00, Dave Hansen wrote:
>>> On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
>>>> The approach here is very close to what on-demand hinting that is
>>>> already upstream does.
>>> Are you referring to the s390 (and powerpc) stuff that is hidden behind
>>> arch_free_page()?
>>>
>> I assume Michael meant "free page reporting".
>
> Where is the page allocator integration? The set you linked to has 5
> patches, but only 4 were merged. This one is missing:
>
> https://lore.kernel.org/patchwork/patch/961038/
>

I don't recall which version was actually merged (there were too many :)
). I think it was v37:

https://lore.kernel.org/patchwork/cover/977804/

And I remember that there was a comment from Linus that made the patch
you mentioned getting dropped.

--

Thanks,

David / dhildenb

2019-07-16 15:04:50

by Wang, Wei W

[permalink] [raw]
Subject: RE: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tuesday, July 16, 2019 10:41 PM, Hansen, Dave wrote:
> Where is the page allocator integration? The set you linked to has 5 patches,
> but only 4 were merged. This one is missing:
>
> https://lore.kernel.org/patchwork/patch/961038/

For some reason, we used the regular page allocation to get pages
from the free list at that stage. This part could be improved by Alex
or Nitesh's approach.

The page address transmission from the balloon driver to the host
device could reuse what's upstreamed there. I think you could add a
new VIRTIO_BALLOON_CMD_xx for your usages.

Best,
Wei

2019-07-16 15:07:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 04:17:13PM +0200, David Hildenbrand wrote:
> On 16.07.19 16:12, David Hildenbrand wrote:
> > On 16.07.19 16:00, Dave Hansen wrote:
> >> On 7/16/19 2:55 AM, Michael S. Tsirkin wrote:
> >>> The approach here is very close to what on-demand hinting that is
> >>> already upstream does.
> >>
> >> Are you referring to the s390 (and powerpc) stuff that is hidden behind
> >> arch_free_page()?
> >>
> >
> > I assume Michael meant "free page reporting".
> >
>
> (https://lwn.net/Articles/759413/)


Yes - VIRTIO_BALLOON_F_FREE_PAGE_HINT.

> --
>
> Thanks,
>
> David / dhildenb

2019-07-16 15:38:03

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 2:55 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Wed, Jun 19, 2019 at 03:33:38PM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Add support for aerating memory using the hinting feature provided by
> > virtio-balloon. Hinting differs from the regular balloon functionality in
> > that is is much less durable than a standard memory balloon. Instead of
> > creating a list of pages that cannot be accessed the pages are only
> > inaccessible while they are being indicated to the virtio interface. Once
> > the interface has acknowledged them they are placed back into their
> > respective free lists and are once again accessible by the guest system.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > drivers/virtio/Kconfig | 1
> > drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
> > include/uapi/linux/virtio_balloon.h | 1
> > 3 files changed, 108 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > index 023fc3bc01c6..9cdaccf92c3a 100644
> > --- a/drivers/virtio/Kconfig
> > +++ b/drivers/virtio/Kconfig
> > @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
> > tristate "Virtio balloon driver"
> > depends on VIRTIO
> > select MEMORY_BALLOON
> > + select AERATION
> > ---help---
> > This driver supports increasing and decreasing the amount
> > of memory within a KVM guest.
> > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > index 44339fc87cc7..91f1e8c9017d 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -18,6 +18,7 @@
> > #include <linux/mm.h>
> > #include <linux/mount.h>
> > #include <linux/magic.h>
> > +#include <linux/memory_aeration.h>
> >
> > /*
> > * Balloon device works in 4K page units. So each page is pointed to by
> > @@ -26,6 +27,7 @@
> > */
> > #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
> > #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> > +#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
> > #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> >
> > #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
> > @@ -45,6 +47,7 @@ enum virtio_balloon_vq {
> > VIRTIO_BALLOON_VQ_DEFLATE,
> > VIRTIO_BALLOON_VQ_STATS,
> > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > + VIRTIO_BALLOON_VQ_HINTING,
> > VIRTIO_BALLOON_VQ_MAX
> > };
> >
> > @@ -54,7 +57,8 @@ enum virtio_balloon_config_read {
> >
> > struct virtio_balloon {
> > struct virtio_device *vdev;
> > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> > + *hinting_vq;
> >
> > /* Balloon's own wq for cpu-intensive work items */
> > struct workqueue_struct *balloon_wq;
> > @@ -103,9 +107,21 @@ struct virtio_balloon {
> > /* Synchronize access/update to this struct virtio_balloon elements */
> > struct mutex balloon_lock;
> >
> > - /* The array of pfns we tell the Host about. */
> > - unsigned int num_pfns;
> > - __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > +
> > + union {
> > + /* The array of pfns we tell the Host about. */
> > + struct {
> > + unsigned int num_pfns;
> > + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > + };
> > + /* The array of physical addresses we are hinting on */
> > + struct {
> > + unsigned int num_hints;
> > + __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
> > + };
> > + };
> > +
> > + struct aerator_dev_info a_dev_info;
> >
> > /* Memory statistics */
> > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > @@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >
> > }
> >
> > +static u64 page_to_hints_pa_order(struct page *page)
> > +{
> > + unsigned char order;
> > + dma_addr_t pa;
> > +
> > + BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
> > + (1 << VIRTIO_BALLOON_PFN_SHIFT));
> > +
> > + /*
> > + * Record physical page address combined with page order.
> > + * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
> > + * since the size has to fit into a 64b value. So as long
> > + * as VIRTIO_BALLOON_SHIFT is greater than this combining
> > + * the two values should be safe.
> > + */
> > + pa = page_to_phys(page);
> > + order = page_private(page) +
> > + PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
> > +
> > + return (u64)(pa | order);
> > +}
> > +
> > +void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
> > +{
> > + struct virtio_balloon *vb = container_of(a_dev_info,
> > + struct virtio_balloon,
> > + a_dev_info);
> > + struct virtqueue *vq = vb->hinting_vq;
> > + struct scatterlist sg;
> > + unsigned int unused;
> > + struct page *page;
> > +
> > + mutex_lock(&vb->balloon_lock);
> > +
> > + vb->num_hints = 0;
> > +
> > + list_for_each_entry(page, &a_dev_info->batch, lru) {
> > + vb->hints[vb->num_hints++] =
> > + cpu_to_virtio64(vb->vdev,
> > + page_to_hints_pa_order(page));
> > + }
> > +
> > + /* We shouldn't have been called if there is nothing to process */
> > + if (WARN_ON(vb->num_hints == 0))
> > + goto out;
> > +
> > + sg_init_one(&sg, vb->hints,
> > + sizeof(vb->hints[0]) * vb->num_hints);
> > +
> > + /*
> > + * We should always be able to add one buffer to an
> > + * empty queue.
> > + */
> > + virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > + virtqueue_kick(vq);
> > +
> > + /* When host has read buffer, this completes via balloon_ack */
> > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > +out:
> > + mutex_unlock(&vb->balloon_lock);
> > +}
> > +
> > static void set_page_pfns(struct virtio_balloon *vb,
> > __virtio32 pfns[], struct page *page)
> > {
> > @@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > + names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> >
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > @@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > }
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> > + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> > + }
> > +
> > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > vqs, callbacks, names, NULL, NULL);
> > if (err)
> > return err;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> > +
> > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > @@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > if (err)
> > goto out_del_balloon_wq;
> > }
> > +
> > + vb->a_dev_info.react = virtballoon_aerator_react;
> > + vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > + err = aerator_startup(&vb->a_dev_info);
> > + if (err)
> > + goto out_unregister_shrinker;
> > + }
> > +
> > virtio_device_ready(vdev);
> >
> > if (towards_target(vb))
> > virtballoon_changed(vdev);
> > return 0;
> >
> > +out_unregister_shrinker:
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > + virtio_balloon_unregister_shrinker(vb);
> > out_del_balloon_wq:
> > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > destroy_workqueue(vb->balloon_wq);
> > @@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > {
> > struct virtio_balloon *vb = vdev->priv;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > + aerator_shutdown();
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > virtio_balloon_unregister_shrinker(vb);
> > spin_lock_irq(&vb->stop_update_lock);
> > @@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > VIRTIO_BALLOON_F_PAGE_POISON,
> > + VIRTIO_BALLOON_F_HINTING,
> > };
> >
> > static struct virtio_driver virtio_balloon_driver = {
> > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > index a1966cd7b677..2b0f62814e22 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -36,6 +36,7 @@
> > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */
> >
> > /* Size of a PFN in the balloon interface. */
> > #define VIRTIO_BALLOON_PFN_SHIFT 12
>
>
>
> The approach here is very close to what on-demand hinting that is
> already upstream does.
>
> This should have resulted in a most of the code being shared
> but this does not seem to happen here.
>
> Can we unify the code in some way?
> It can still use a separate feature flag, but there are things
> I like very much about current hinting code, such as
> using s/g instead of passing PFNs in a buffer.
>
> If this doesn't work could you elaborate on why?

As far as sending a scatter gather that shouldn't be too much of an
issue, however I need to double check that I will still be able to
keep the completions as a single block.

One significant spot where the "VIRTIO_BALLOON_F_FREE_PAGE_HINT" code
and my code differs. My code is processing a fixed discreet block of
pages at a time, whereas the FREE_PAGE_HINT code is slurping up all
available high-order memory and stuffing it into a giant balloon and
has more of a streaming setup as it doesn't return things until either
forced to by the shrinker or once it has processed all available
memory.

The basic idea with the bubble hinting was to essentially create mini
balloons. As such I had based the code off of the balloon inflation
code. The only spot where it really differs is that I needed the
ability to pass higher order pages so I tweaked thinks and passed
"hints" instead of "pfns".

2019-07-16 16:08:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 08:37:06AM -0700, Alexander Duyck wrote:
> On Tue, Jul 16, 2019 at 2:55 AM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Wed, Jun 19, 2019 at 03:33:38PM -0700, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add support for aerating memory using the hinting feature provided by
> > > virtio-balloon. Hinting differs from the regular balloon functionality in
> > > that is is much less durable than a standard memory balloon. Instead of
> > > creating a list of pages that cannot be accessed the pages are only
> > > inaccessible while they are being indicated to the virtio interface. Once
> > > the interface has acknowledged them they are placed back into their
> > > respective free lists and are once again accessible by the guest system.
> > >
> > > Signed-off-by: Alexander Duyck <[email protected]>
> > > ---
> > > drivers/virtio/Kconfig | 1
> > > drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
> > > include/uapi/linux/virtio_balloon.h | 1
> > > 3 files changed, 108 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > index 023fc3bc01c6..9cdaccf92c3a 100644
> > > --- a/drivers/virtio/Kconfig
> > > +++ b/drivers/virtio/Kconfig
> > > @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
> > > tristate "Virtio balloon driver"
> > > depends on VIRTIO
> > > select MEMORY_BALLOON
> > > + select AERATION
> > > ---help---
> > > This driver supports increasing and decreasing the amount
> > > of memory within a KVM guest.
> > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > index 44339fc87cc7..91f1e8c9017d 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -18,6 +18,7 @@
> > > #include <linux/mm.h>
> > > #include <linux/mount.h>
> > > #include <linux/magic.h>
> > > +#include <linux/memory_aeration.h>
> > >
> > > /*
> > > * Balloon device works in 4K page units. So each page is pointed to by
> > > @@ -26,6 +27,7 @@
> > > */
> > > #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
> > > #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> > > +#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
> > > #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > >
> > > #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
> > > @@ -45,6 +47,7 @@ enum virtio_balloon_vq {
> > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > VIRTIO_BALLOON_VQ_STATS,
> > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > + VIRTIO_BALLOON_VQ_HINTING,
> > > VIRTIO_BALLOON_VQ_MAX
> > > };
> > >
> > > @@ -54,7 +57,8 @@ enum virtio_balloon_config_read {
> > >
> > > struct virtio_balloon {
> > > struct virtio_device *vdev;
> > > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> > > + *hinting_vq;
> > >
> > > /* Balloon's own wq for cpu-intensive work items */
> > > struct workqueue_struct *balloon_wq;
> > > @@ -103,9 +107,21 @@ struct virtio_balloon {
> > > /* Synchronize access/update to this struct virtio_balloon elements */
> > > struct mutex balloon_lock;
> > >
> > > - /* The array of pfns we tell the Host about. */
> > > - unsigned int num_pfns;
> > > - __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > +
> > > + union {
> > > + /* The array of pfns we tell the Host about. */
> > > + struct {
> > > + unsigned int num_pfns;
> > > + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > + };
> > > + /* The array of physical addresses we are hinting on */
> > > + struct {
> > > + unsigned int num_hints;
> > > + __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
> > > + };
> > > + };
> > > +
> > > + struct aerator_dev_info a_dev_info;
> > >
> > > /* Memory statistics */
> > > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > @@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > >
> > > }
> > >
> > > +static u64 page_to_hints_pa_order(struct page *page)
> > > +{
> > > + unsigned char order;
> > > + dma_addr_t pa;
> > > +
> > > + BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
> > > + (1 << VIRTIO_BALLOON_PFN_SHIFT));
> > > +
> > > + /*
> > > + * Record physical page address combined with page order.
> > > + * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
> > > + * since the size has to fit into a 64b value. So as long
> > > + * as VIRTIO_BALLOON_SHIFT is greater than this combining
> > > + * the two values should be safe.
> > > + */
> > > + pa = page_to_phys(page);
> > > + order = page_private(page) +
> > > + PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
> > > +
> > > + return (u64)(pa | order);
> > > +}
> > > +
> > > +void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
> > > +{
> > > + struct virtio_balloon *vb = container_of(a_dev_info,
> > > + struct virtio_balloon,
> > > + a_dev_info);
> > > + struct virtqueue *vq = vb->hinting_vq;
> > > + struct scatterlist sg;
> > > + unsigned int unused;
> > > + struct page *page;
> > > +
> > > + mutex_lock(&vb->balloon_lock);
> > > +
> > > + vb->num_hints = 0;
> > > +
> > > + list_for_each_entry(page, &a_dev_info->batch, lru) {
> > > + vb->hints[vb->num_hints++] =
> > > + cpu_to_virtio64(vb->vdev,
> > > + page_to_hints_pa_order(page));
> > > + }
> > > +
> > > + /* We shouldn't have been called if there is nothing to process */
> > > + if (WARN_ON(vb->num_hints == 0))
> > > + goto out;
> > > +
> > > + sg_init_one(&sg, vb->hints,
> > > + sizeof(vb->hints[0]) * vb->num_hints);
> > > +
> > > + /*
> > > + * We should always be able to add one buffer to an
> > > + * empty queue.
> > > + */
> > > + virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > + virtqueue_kick(vq);
> > > +
> > > + /* When host has read buffer, this completes via balloon_ack */
> > > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > > +out:
> > > + mutex_unlock(&vb->balloon_lock);
> > > +}
> > > +
> > > static void set_page_pfns(struct virtio_balloon *vb,
> > > __virtio32 pfns[], struct page *page)
> > > {
> > > @@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > + names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> > >
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > @@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > }
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> > > + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> > > + }
> > > +
> > > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > vqs, callbacks, names, NULL, NULL);
> > > if (err)
> > > return err;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> > > +
> > > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > @@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > > if (err)
> > > goto out_del_balloon_wq;
> > > }
> > > +
> > > + vb->a_dev_info.react = virtballoon_aerator_react;
> > > + vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > + err = aerator_startup(&vb->a_dev_info);
> > > + if (err)
> > > + goto out_unregister_shrinker;
> > > + }
> > > +
> > > virtio_device_ready(vdev);
> > >
> > > if (towards_target(vb))
> > > virtballoon_changed(vdev);
> > > return 0;
> > >
> > > +out_unregister_shrinker:
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > + virtio_balloon_unregister_shrinker(vb);
> > > out_del_balloon_wq:
> > > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > > destroy_workqueue(vb->balloon_wq);
> > > @@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > > {
> > > struct virtio_balloon *vb = vdev->priv;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > + aerator_shutdown();
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > virtio_balloon_unregister_shrinker(vb);
> > > spin_lock_irq(&vb->stop_update_lock);
> > > @@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > VIRTIO_BALLOON_F_PAGE_POISON,
> > > + VIRTIO_BALLOON_F_HINTING,
> > > };
> > >
> > > static struct virtio_driver virtio_balloon_driver = {
> > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > > index a1966cd7b677..2b0f62814e22 100644
> > > --- a/include/uapi/linux/virtio_balloon.h
> > > +++ b/include/uapi/linux/virtio_balloon.h
> > > @@ -36,6 +36,7 @@
> > > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > > +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */
> > >
> > > /* Size of a PFN in the balloon interface. */
> > > #define VIRTIO_BALLOON_PFN_SHIFT 12
> >
> >
> >
> > The approach here is very close to what on-demand hinting that is
> > already upstream does.
> >
> > This should have resulted in a most of the code being shared
> > but this does not seem to happen here.
> >
> > Can we unify the code in some way?
> > It can still use a separate feature flag, but there are things
> > I like very much about current hinting code, such as
> > using s/g instead of passing PFNs in a buffer.
> >
> > If this doesn't work could you elaborate on why?
>
> As far as sending a scatter gather that shouldn't be too much of an
> issue, however I need to double check that I will still be able to
> keep the completions as a single block.
>
> One significant spot where the "VIRTIO_BALLOON_F_FREE_PAGE_HINT" code
> and my code differs. My code is processing a fixed discreet block of
> pages at a time, whereas the FREE_PAGE_HINT code is slurping up all
> available high-order memory and stuffing it into a giant balloon and
> has more of a streaming setup as it doesn't return things until either
> forced to by the shrinker or once it has processed all available
> memory.

This is what I am saying. Having watched that patchset being developed,
I think that's simply because processing blocks required mm core
changes, which Wei was not up to pushing through.


If we did

while (1) {
alloc_pages
add_buf
get_buf
free_pages
}

We'd end up passing the same page to balloon again and again.

So we end up reserving lots of memory with alloc_pages instead.

What I am saying is that now that you are developing
infrastructure to iterate over free pages,
FREE_PAGE_HINT should be able to use it too.
Whether that's possible might be a good indication of
whether the new mm APIs make sense.

> The basic idea with the bubble hinting was to essentially create mini
> balloons. As such I had based the code off of the balloon inflation
> code. The only spot where it really differs is that I needed the
> ability to pass higher order pages so I tweaked thinks and passed
> "hints" instead of "pfns".

And that is fine. But there isn't really such a big difference with
FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
in response to guest load.

--
MST

2019-07-16 16:13:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 03:01:52PM +0000, Wang, Wei W wrote:
> On Tuesday, July 16, 2019 10:41 PM, Hansen, Dave wrote:
> > Where is the page allocator integration? The set you linked to has 5 patches,
> > but only 4 were merged. This one is missing:
> >
> > https://lore.kernel.org/patchwork/patch/961038/
>
> For some reason, we used the regular page allocation to get pages
> from the free list at that stage.


This is what Linus suggested, that is why:

https://lkml.org/lkml/2018/6/27/461

and

https://lkml.org/lkml/2018/7/11/795


See also

https://lkml.org/lkml/2018/7/10/1157

for some failed attempts to upstream mm core changes
related to this.

> This part could be improved by Alex
> or Nitesh's approach.
>
> The page address transmission from the balloon driver to the host
> device could reuse what's upstreamed there. I think you could add a
> new VIRTIO_BALLOON_CMD_xx for your usages.
>
> Best,
> Wei

2019-07-16 16:55:40

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 9:08 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Tue, Jul 16, 2019 at 08:37:06AM -0700, Alexander Duyck wrote:
> > On Tue, Jul 16, 2019 at 2:55 AM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > On Wed, Jun 19, 2019 at 03:33:38PM -0700, Alexander Duyck wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Add support for aerating memory using the hinting feature provided by
> > > > virtio-balloon. Hinting differs from the regular balloon functionality in
> > > > that is is much less durable than a standard memory balloon. Instead of
> > > > creating a list of pages that cannot be accessed the pages are only
> > > > inaccessible while they are being indicated to the virtio interface. Once
> > > > the interface has acknowledged them they are placed back into their
> > > > respective free lists and are once again accessible by the guest system.
> > > >
> > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > ---
> > > > drivers/virtio/Kconfig | 1
> > > > drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
> > > > include/uapi/linux/virtio_balloon.h | 1
> > > > 3 files changed, 108 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > index 023fc3bc01c6..9cdaccf92c3a 100644
> > > > --- a/drivers/virtio/Kconfig
> > > > +++ b/drivers/virtio/Kconfig
> > > > @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
> > > > tristate "Virtio balloon driver"
> > > > depends on VIRTIO
> > > > select MEMORY_BALLOON
> > > > + select AERATION
> > > > ---help---
> > > > This driver supports increasing and decreasing the amount
> > > > of memory within a KVM guest.
> > > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > > index 44339fc87cc7..91f1e8c9017d 100644
> > > > --- a/drivers/virtio/virtio_balloon.c
> > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > @@ -18,6 +18,7 @@
> > > > #include <linux/mm.h>
> > > > #include <linux/mount.h>
> > > > #include <linux/magic.h>
> > > > +#include <linux/memory_aeration.h>
> > > >
> > > > /*
> > > > * Balloon device works in 4K page units. So each page is pointed to by
> > > > @@ -26,6 +27,7 @@
> > > > */
> > > > #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
> > > > #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> > > > +#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
> > > > #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > >
> > > > #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
> > > > @@ -45,6 +47,7 @@ enum virtio_balloon_vq {
> > > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > > VIRTIO_BALLOON_VQ_STATS,
> > > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > > + VIRTIO_BALLOON_VQ_HINTING,
> > > > VIRTIO_BALLOON_VQ_MAX
> > > > };
> > > >
> > > > @@ -54,7 +57,8 @@ enum virtio_balloon_config_read {
> > > >
> > > > struct virtio_balloon {
> > > > struct virtio_device *vdev;
> > > > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > > > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> > > > + *hinting_vq;
> > > >
> > > > /* Balloon's own wq for cpu-intensive work items */
> > > > struct workqueue_struct *balloon_wq;
> > > > @@ -103,9 +107,21 @@ struct virtio_balloon {
> > > > /* Synchronize access/update to this struct virtio_balloon elements */
> > > > struct mutex balloon_lock;
> > > >
> > > > - /* The array of pfns we tell the Host about. */
> > > > - unsigned int num_pfns;
> > > > - __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > > +
> > > > + union {
> > > > + /* The array of pfns we tell the Host about. */
> > > > + struct {
> > > > + unsigned int num_pfns;
> > > > + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > > + };
> > > > + /* The array of physical addresses we are hinting on */
> > > > + struct {
> > > > + unsigned int num_hints;
> > > > + __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
> > > > + };
> > > > + };
> > > > +
> > > > + struct aerator_dev_info a_dev_info;
> > > >
> > > > /* Memory statistics */
> > > > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > > @@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > >
> > > > }
> > > >
> > > > +static u64 page_to_hints_pa_order(struct page *page)
> > > > +{
> > > > + unsigned char order;
> > > > + dma_addr_t pa;
> > > > +
> > > > + BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
> > > > + (1 << VIRTIO_BALLOON_PFN_SHIFT));
> > > > +
> > > > + /*
> > > > + * Record physical page address combined with page order.
> > > > + * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
> > > > + * since the size has to fit into a 64b value. So as long
> > > > + * as VIRTIO_BALLOON_SHIFT is greater than this combining
> > > > + * the two values should be safe.
> > > > + */
> > > > + pa = page_to_phys(page);
> > > > + order = page_private(page) +
> > > > + PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
> > > > +
> > > > + return (u64)(pa | order);
> > > > +}
> > > > +
> > > > +void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
> > > > +{
> > > > + struct virtio_balloon *vb = container_of(a_dev_info,
> > > > + struct virtio_balloon,
> > > > + a_dev_info);
> > > > + struct virtqueue *vq = vb->hinting_vq;
> > > > + struct scatterlist sg;
> > > > + unsigned int unused;
> > > > + struct page *page;
> > > > +
> > > > + mutex_lock(&vb->balloon_lock);
> > > > +
> > > > + vb->num_hints = 0;
> > > > +
> > > > + list_for_each_entry(page, &a_dev_info->batch, lru) {
> > > > + vb->hints[vb->num_hints++] =
> > > > + cpu_to_virtio64(vb->vdev,
> > > > + page_to_hints_pa_order(page));
> > > > + }
> > > > +
> > > > + /* We shouldn't have been called if there is nothing to process */
> > > > + if (WARN_ON(vb->num_hints == 0))
> > > > + goto out;
> > > > +
> > > > + sg_init_one(&sg, vb->hints,
> > > > + sizeof(vb->hints[0]) * vb->num_hints);
> > > > +
> > > > + /*
> > > > + * We should always be able to add one buffer to an
> > > > + * empty queue.
> > > > + */
> > > > + virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > > + virtqueue_kick(vq);
> > > > +
> > > > + /* When host has read buffer, this completes via balloon_ack */
> > > > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > > > +out:
> > > > + mutex_unlock(&vb->balloon_lock);
> > > > +}
> > > > +
> > > > static void set_page_pfns(struct virtio_balloon *vb,
> > > > __virtio32 pfns[], struct page *page)
> > > > {
> > > > @@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > > > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > > + names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> > > >
> > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > > @@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > > > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > > }
> > > >
> > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > > + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> > > > + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> > > > + }
> > > > +
> > > > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > > vqs, callbacks, names, NULL, NULL);
> > > > if (err)
> > > > return err;
> > > >
> > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > > + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> > > > +
> > > > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > > @@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > > > if (err)
> > > > goto out_del_balloon_wq;
> > > > }
> > > > +
> > > > + vb->a_dev_info.react = virtballoon_aerator_react;
> > > > + vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
> > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > > + err = aerator_startup(&vb->a_dev_info);
> > > > + if (err)
> > > > + goto out_unregister_shrinker;
> > > > + }
> > > > +
> > > > virtio_device_ready(vdev);
> > > >
> > > > if (towards_target(vb))
> > > > virtballoon_changed(vdev);
> > > > return 0;
> > > >
> > > > +out_unregister_shrinker:
> > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > > + virtio_balloon_unregister_shrinker(vb);
> > > > out_del_balloon_wq:
> > > > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > > > destroy_workqueue(vb->balloon_wq);
> > > > @@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > > > {
> > > > struct virtio_balloon *vb = vdev->priv;
> > > >
> > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > > + aerator_shutdown();
> > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > > virtio_balloon_unregister_shrinker(vb);
> > > > spin_lock_irq(&vb->stop_update_lock);
> > > > @@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > > > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > > VIRTIO_BALLOON_F_PAGE_POISON,
> > > > + VIRTIO_BALLOON_F_HINTING,
> > > > };
> > > >
> > > > static struct virtio_driver virtio_balloon_driver = {
> > > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > > > index a1966cd7b677..2b0f62814e22 100644
> > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > @@ -36,6 +36,7 @@
> > > > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > > > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > > > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > > > +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */
> > > >
> > > > /* Size of a PFN in the balloon interface. */
> > > > #define VIRTIO_BALLOON_PFN_SHIFT 12
> > >
> > >
> > >
> > > The approach here is very close to what on-demand hinting that is
> > > already upstream does.
> > >
> > > This should have resulted in a most of the code being shared
> > > but this does not seem to happen here.
> > >
> > > Can we unify the code in some way?
> > > It can still use a separate feature flag, but there are things
> > > I like very much about current hinting code, such as
> > > using s/g instead of passing PFNs in a buffer.
> > >
> > > If this doesn't work could you elaborate on why?
> >
> > As far as sending a scatter gather that shouldn't be too much of an
> > issue, however I need to double check that I will still be able to
> > keep the completions as a single block.
> >
> > One significant spot where the "VIRTIO_BALLOON_F_FREE_PAGE_HINT" code
> > and my code differs. My code is processing a fixed discreet block of
> > pages at a time, whereas the FREE_PAGE_HINT code is slurping up all
> > available high-order memory and stuffing it into a giant balloon and
> > has more of a streaming setup as it doesn't return things until either
> > forced to by the shrinker or once it has processed all available
> > memory.
>
> This is what I am saying. Having watched that patchset being developed,
> I think that's simply because processing blocks required mm core
> changes, which Wei was not up to pushing through.
>
>
> If we did
>
> while (1) {
> alloc_pages
> add_buf
> get_buf
> free_pages
> }
>
> We'd end up passing the same page to balloon again and again.
>
> So we end up reserving lots of memory with alloc_pages instead.
>
> What I am saying is that now that you are developing
> infrastructure to iterate over free pages,
> FREE_PAGE_HINT should be able to use it too.
> Whether that's possible might be a good indication of
> whether the new mm APIs make sense.

The problem is the infrastructure as implemented isn't designed to do
that. I am pretty certain this interface will have issues with being
given small blocks to process at a time.

Basically the design for the FREE_PAGE_HINT feature doesn't really
have the concept of doing things a bit at a time. It is either
filling, stopped, or done. From what I can tell it requires a
configuration change for the virtio balloon interface to toggle
between those states.

> > The basic idea with the bubble hinting was to essentially create mini
> > balloons. As such I had based the code off of the balloon inflation
> > code. The only spot where it really differs is that I needed the
> > ability to pass higher order pages so I tweaked thinks and passed
> > "hints" instead of "pfns".
>
> And that is fine. But there isn't really such a big difference with
> FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> in response to guest load.

I disagree, I believe there is a significant difference. The
FREE_PAGE_HINT code was implemented to be more of a streaming
interface. This is one of the things Linus kept complaining about in
his comments. This code attempts to pull in ALL of the higher order
pages, not just a smaller block of them. Honestly the difference is
mostly in the hypervisor interface than what is needed for the kernel
interface, however the design of the hypervisor interface would make
doing things more incrementally much more difficult.

With that said I will take a look into at least using the scatter
gather interface directly rather than sending the list. I think I can
probably do that much. However it will actually reduce code reuse as I
have to check and verify the pages have been processed before I can
free them back to the host.

- Alex

2019-07-16 17:42:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 09:54:37AM -0700, Alexander Duyck wrote:
> On Tue, Jul 16, 2019 at 9:08 AM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Tue, Jul 16, 2019 at 08:37:06AM -0700, Alexander Duyck wrote:
> > > On Tue, Jul 16, 2019 at 2:55 AM Michael S. Tsirkin <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 19, 2019 at 03:33:38PM -0700, Alexander Duyck wrote:
> > > > > From: Alexander Duyck <[email protected]>
> > > > >
> > > > > Add support for aerating memory using the hinting feature provided by
> > > > > virtio-balloon. Hinting differs from the regular balloon functionality in
> > > > > that is is much less durable than a standard memory balloon. Instead of
> > > > > creating a list of pages that cannot be accessed the pages are only
> > > > > inaccessible while they are being indicated to the virtio interface. Once
> > > > > the interface has acknowledged them they are placed back into their
> > > > > respective free lists and are once again accessible by the guest system.
> > > > >
> > > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > > ---
> > > > > drivers/virtio/Kconfig | 1
> > > > > drivers/virtio/virtio_balloon.c | 110 ++++++++++++++++++++++++++++++++++-
> > > > > include/uapi/linux/virtio_balloon.h | 1
> > > > > 3 files changed, 108 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> > > > > index 023fc3bc01c6..9cdaccf92c3a 100644
> > > > > --- a/drivers/virtio/Kconfig
> > > > > +++ b/drivers/virtio/Kconfig
> > > > > @@ -47,6 +47,7 @@ config VIRTIO_BALLOON
> > > > > tristate "Virtio balloon driver"
> > > > > depends on VIRTIO
> > > > > select MEMORY_BALLOON
> > > > > + select AERATION
> > > > > ---help---
> > > > > This driver supports increasing and decreasing the amount
> > > > > of memory within a KVM guest.
> > > > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> > > > > index 44339fc87cc7..91f1e8c9017d 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -18,6 +18,7 @@
> > > > > #include <linux/mm.h>
> > > > > #include <linux/mount.h>
> > > > > #include <linux/magic.h>
> > > > > +#include <linux/memory_aeration.h>
> > > > >
> > > > > /*
> > > > > * Balloon device works in 4K page units. So each page is pointed to by
> > > > > @@ -26,6 +27,7 @@
> > > > > */
> > > > > #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE >> VIRTIO_BALLOON_PFN_SHIFT)
> > > > > #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
> > > > > +#define VIRTIO_BALLOON_ARRAY_HINTS_MAX 32
> > > > > #define VIRTBALLOON_OOM_NOTIFY_PRIORITY 80
> > > > >
> > > > > #define VIRTIO_BALLOON_FREE_PAGE_ALLOC_FLAG (__GFP_NORETRY | __GFP_NOWARN | \
> > > > > @@ -45,6 +47,7 @@ enum virtio_balloon_vq {
> > > > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > > > VIRTIO_BALLOON_VQ_STATS,
> > > > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > > > + VIRTIO_BALLOON_VQ_HINTING,
> > > > > VIRTIO_BALLOON_VQ_MAX
> > > > > };
> > > > >
> > > > > @@ -54,7 +57,8 @@ enum virtio_balloon_config_read {
> > > > >
> > > > > struct virtio_balloon {
> > > > > struct virtio_device *vdev;
> > > > > - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq;
> > > > > + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *free_page_vq,
> > > > > + *hinting_vq;
> > > > >
> > > > > /* Balloon's own wq for cpu-intensive work items */
> > > > > struct workqueue_struct *balloon_wq;
> > > > > @@ -103,9 +107,21 @@ struct virtio_balloon {
> > > > > /* Synchronize access/update to this struct virtio_balloon elements */
> > > > > struct mutex balloon_lock;
> > > > >
> > > > > - /* The array of pfns we tell the Host about. */
> > > > > - unsigned int num_pfns;
> > > > > - __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > > > +
> > > > > + union {
> > > > > + /* The array of pfns we tell the Host about. */
> > > > > + struct {
> > > > > + unsigned int num_pfns;
> > > > > + __virtio32 pfns[VIRTIO_BALLOON_ARRAY_PFNS_MAX];
> > > > > + };
> > > > > + /* The array of physical addresses we are hinting on */
> > > > > + struct {
> > > > > + unsigned int num_hints;
> > > > > + __virtio64 hints[VIRTIO_BALLOON_ARRAY_HINTS_MAX];
> > > > > + };
> > > > > + };
> > > > > +
> > > > > + struct aerator_dev_info a_dev_info;
> > > > >
> > > > > /* Memory statistics */
> > > > > struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
> > > > > @@ -151,6 +167,68 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > > >
> > > > > }
> > > > >
> > > > > +static u64 page_to_hints_pa_order(struct page *page)
> > > > > +{
> > > > > + unsigned char order;
> > > > > + dma_addr_t pa;
> > > > > +
> > > > > + BUILD_BUG_ON((64 - VIRTIO_BALLOON_PFN_SHIFT) >=
> > > > > + (1 << VIRTIO_BALLOON_PFN_SHIFT));
> > > > > +
> > > > > + /*
> > > > > + * Record physical page address combined with page order.
> > > > > + * Order will never exceed 64 - VIRTIO_BALLON_PFN_SHIFT
> > > > > + * since the size has to fit into a 64b value. So as long
> > > > > + * as VIRTIO_BALLOON_SHIFT is greater than this combining
> > > > > + * the two values should be safe.
> > > > > + */
> > > > > + pa = page_to_phys(page);
> > > > > + order = page_private(page) +
> > > > > + PAGE_SHIFT - VIRTIO_BALLOON_PFN_SHIFT;
> > > > > +
> > > > > + return (u64)(pa | order);
> > > > > +}
> > > > > +
> > > > > +void virtballoon_aerator_react(struct aerator_dev_info *a_dev_info)
> > > > > +{
> > > > > + struct virtio_balloon *vb = container_of(a_dev_info,
> > > > > + struct virtio_balloon,
> > > > > + a_dev_info);
> > > > > + struct virtqueue *vq = vb->hinting_vq;
> > > > > + struct scatterlist sg;
> > > > > + unsigned int unused;
> > > > > + struct page *page;
> > > > > +
> > > > > + mutex_lock(&vb->balloon_lock);
> > > > > +
> > > > > + vb->num_hints = 0;
> > > > > +
> > > > > + list_for_each_entry(page, &a_dev_info->batch, lru) {
> > > > > + vb->hints[vb->num_hints++] =
> > > > > + cpu_to_virtio64(vb->vdev,
> > > > > + page_to_hints_pa_order(page));
> > > > > + }
> > > > > +
> > > > > + /* We shouldn't have been called if there is nothing to process */
> > > > > + if (WARN_ON(vb->num_hints == 0))
> > > > > + goto out;
> > > > > +
> > > > > + sg_init_one(&sg, vb->hints,
> > > > > + sizeof(vb->hints[0]) * vb->num_hints);
> > > > > +
> > > > > + /*
> > > > > + * We should always be able to add one buffer to an
> > > > > + * empty queue.
> > > > > + */
> > > > > + virtqueue_add_outbuf(vq, &sg, 1, vb, GFP_KERNEL);
> > > > > + virtqueue_kick(vq);
> > > > > +
> > > > > + /* When host has read buffer, this completes via balloon_ack */
> > > > > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> > > > > +out:
> > > > > + mutex_unlock(&vb->balloon_lock);
> > > > > +}
> > > > > +
> > > > > static void set_page_pfns(struct virtio_balloon *vb,
> > > > > __virtio32 pfns[], struct page *page)
> > > > > {
> > > > > @@ -475,6 +553,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > > > > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > > > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > > > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > > > + names[VIRTIO_BALLOON_VQ_HINTING] = NULL;
> > > > >
> > > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > > > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > > > @@ -486,11 +565,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > > > > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > > > }
> > > > >
> > > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > > > + names[VIRTIO_BALLOON_VQ_HINTING] = "hinting_vq";
> > > > > + callbacks[VIRTIO_BALLOON_VQ_HINTING] = balloon_ack;
> > > > > + }
> > > > > +
> > > > > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > > > vqs, callbacks, names, NULL, NULL);
> > > > > if (err)
> > > > > return err;
> > > > >
> > > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > > > + vb->hinting_vq = vqs[VIRTIO_BALLOON_VQ_HINTING];
> > > > > +
> > > > > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > > > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > > > @@ -929,12 +1016,24 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > > > > if (err)
> > > > > goto out_del_balloon_wq;
> > > > > }
> > > > > +
> > > > > + vb->a_dev_info.react = virtballoon_aerator_react;
> > > > > + vb->a_dev_info.capacity = VIRTIO_BALLOON_ARRAY_HINTS_MAX;
> > > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING)) {
> > > > > + err = aerator_startup(&vb->a_dev_info);
> > > > > + if (err)
> > > > > + goto out_unregister_shrinker;
> > > > > + }
> > > > > +
> > > > > virtio_device_ready(vdev);
> > > > >
> > > > > if (towards_target(vb))
> > > > > virtballoon_changed(vdev);
> > > > > return 0;
> > > > >
> > > > > +out_unregister_shrinker:
> > > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > > > + virtio_balloon_unregister_shrinker(vb);
> > > > > out_del_balloon_wq:
> > > > > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > > > > destroy_workqueue(vb->balloon_wq);
> > > > > @@ -963,6 +1062,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > > > > {
> > > > > struct virtio_balloon *vb = vdev->priv;
> > > > >
> > > > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_HINTING))
> > > > > + aerator_shutdown();
> > > > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > > > virtio_balloon_unregister_shrinker(vb);
> > > > > spin_lock_irq(&vb->stop_update_lock);
> > > > > @@ -1032,6 +1133,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > > > > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > > > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > > > VIRTIO_BALLOON_F_PAGE_POISON,
> > > > > + VIRTIO_BALLOON_F_HINTING,
> > > > > };
> > > > >
> > > > > static struct virtio_driver virtio_balloon_driver = {
> > > > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > > > > index a1966cd7b677..2b0f62814e22 100644
> > > > > --- a/include/uapi/linux/virtio_balloon.h
> > > > > +++ b/include/uapi/linux/virtio_balloon.h
> > > > > @@ -36,6 +36,7 @@
> > > > > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > > > > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > > > > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > > > > +#define VIRTIO_BALLOON_F_HINTING 5 /* Page hinting virtqueue */
> > > > >
> > > > > /* Size of a PFN in the balloon interface. */
> > > > > #define VIRTIO_BALLOON_PFN_SHIFT 12
> > > >
> > > >
> > > >
> > > > The approach here is very close to what on-demand hinting that is
> > > > already upstream does.
> > > >
> > > > This should have resulted in a most of the code being shared
> > > > but this does not seem to happen here.
> > > >
> > > > Can we unify the code in some way?
> > > > It can still use a separate feature flag, but there are things
> > > > I like very much about current hinting code, such as
> > > > using s/g instead of passing PFNs in a buffer.
> > > >
> > > > If this doesn't work could you elaborate on why?
> > >
> > > As far as sending a scatter gather that shouldn't be too much of an
> > > issue, however I need to double check that I will still be able to
> > > keep the completions as a single block.
> > >
> > > One significant spot where the "VIRTIO_BALLOON_F_FREE_PAGE_HINT" code
> > > and my code differs. My code is processing a fixed discreet block of
> > > pages at a time, whereas the FREE_PAGE_HINT code is slurping up all
> > > available high-order memory and stuffing it into a giant balloon and
> > > has more of a streaming setup as it doesn't return things until either
> > > forced to by the shrinker or once it has processed all available
> > > memory.
> >
> > This is what I am saying. Having watched that patchset being developed,
> > I think that's simply because processing blocks required mm core
> > changes, which Wei was not up to pushing through.
> >
> >
> > If we did
> >
> > while (1) {
> > alloc_pages
> > add_buf
> > get_buf
> > free_pages
> > }
> >
> > We'd end up passing the same page to balloon again and again.
> >
> > So we end up reserving lots of memory with alloc_pages instead.
> >
> > What I am saying is that now that you are developing
> > infrastructure to iterate over free pages,
> > FREE_PAGE_HINT should be able to use it too.
> > Whether that's possible might be a good indication of
> > whether the new mm APIs make sense.
>
> The problem is the infrastructure as implemented isn't designed to do
> that. I am pretty certain this interface will have issues with being
> given small blocks to process at a time.
>
> Basically the design for the FREE_PAGE_HINT feature doesn't really
> have the concept of doing things a bit at a time. It is either
> filling, stopped, or done. From what I can tell it requires a
> configuration change for the virtio balloon interface to toggle
> between those states.

Maybe I misunderstand what you are saying.

Filling state can definitely report things
a bit at a time. It does not assume that
all of guest free memory can fit in a VQ.



> > > The basic idea with the bubble hinting was to essentially create mini
> > > balloons. As such I had based the code off of the balloon inflation
> > > code. The only spot where it really differs is that I needed the
> > > ability to pass higher order pages so I tweaked thinks and passed
> > > "hints" instead of "pfns".
> >
> > And that is fine. But there isn't really such a big difference with
> > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > in response to guest load.
>
> I disagree, I believe there is a significant difference.

Yes there is, I just don't think it's in the iteration.
The iteration seems to be useful to hinting.

> The
> FREE_PAGE_HINT code was implemented to be more of a streaming
> interface.

It's implemented like this but it does not follow from
the interface. The implementation is a combination of
attempts to minimize # of exits and minimize mm core changes.

> This is one of the things Linus kept complaining about in
> his comments. This code attempts to pull in ALL of the higher order
> pages, not just a smaller block of them.

It wants to report all higher order pages eventually, yes.
But it's absolutely fine to report a chunk and then wait
for host to process the chunk before reporting more.

However, interfaces we came up with for this would call
into virtio with a bunch of locks taken.
The solution was to take pages off the free list completely.
That in turn means we can't return them until
we have processed all free memory.


> Honestly the difference is
> mostly in the hypervisor interface than what is needed for the kernel
> interface, however the design of the hypervisor interface would make
> doing things more incrementally much more difficult.

OK that's interesting. The hypervisor interface is not
documented in the spec yet. Let me take a stub at a writeup now. So:



- hypervisor requests reporting by modifying command ID
field in config space, and interrupting guest

- in response, guest sends the command ID value on a special
free page hinting VQ,
followed by any number of buffers. Each buffer is assumed
to be the address and length of memory that was
unused *at some point after the time when command ID was sent*.

Note that hypervisor takes pains to handle the case
where memory is actually no longer free by the time
it gets the memory.
This allows guest driver to take more liberties
and free pages without waiting for guest to
use the buffers.

This is also one of the reason we call this a free page hint -
the guarantee that page is free is a weak one,
in that sense it's more of a hint than a promise.
That helps guarantee we don't create OOM out of blue.

- guest eventually sends a special buffer signalling to
host that it's done sending free pages.
It then stops reporting until command id changes.

- host can restart the process at any time by
updating command ID. That will make guest stop
and start from the beginning.

- host can also stop the process by specifying a special
command ID value.


=========


Now let's compare to what you have here:

- At any time after boot, guest walks over free memory and sends
addresses as buffers to the host

- Memory reported is then guaranteed to be unused
until host has used the buffers


Is above a fair summary?

So yes there's a difference but the specific bit of chunking is same
imho.

> With that said I will take a look into at least using the scatter
> gather interface directly rather than sending the list. I think I can
> probably do that much. However it will actually reduce code reuse as I
> have to check and verify the pages have been processed before I can
> free them back to the host.
>
> - Alex

2019-07-16 21:08:46

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:

<snip>

> > > This is what I am saying. Having watched that patchset being developed,
> > > I think that's simply because processing blocks required mm core
> > > changes, which Wei was not up to pushing through.
> > >
> > >
> > > If we did
> > >
> > > while (1) {
> > > alloc_pages
> > > add_buf
> > > get_buf
> > > free_pages
> > > }
> > >
> > > We'd end up passing the same page to balloon again and again.
> > >
> > > So we end up reserving lots of memory with alloc_pages instead.
> > >
> > > What I am saying is that now that you are developing
> > > infrastructure to iterate over free pages,
> > > FREE_PAGE_HINT should be able to use it too.
> > > Whether that's possible might be a good indication of
> > > whether the new mm APIs make sense.
> >
> > The problem is the infrastructure as implemented isn't designed to do
> > that. I am pretty certain this interface will have issues with being
> > given small blocks to process at a time.
> >
> > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > have the concept of doing things a bit at a time. It is either
> > filling, stopped, or done. From what I can tell it requires a
> > configuration change for the virtio balloon interface to toggle
> > between those states.
>
> Maybe I misunderstand what you are saying.
>
> Filling state can definitely report things
> a bit at a time. It does not assume that
> all of guest free memory can fit in a VQ.

I think where you and I may differ is that you are okay with just
pulling pages until you hit OOM, or allocation failures. Do I have
that right? In my mind I am wanting to perform the hinting on a small
block at a time and work through things iteratively.

The problem is the FREE_PAGE_HINT doesn't have the option of returning
pages until all pages have been pulled. It is run to completion and
will keep filling the balloon until an allocation fails and the host
says it is done. I would prefer to avoid that as I prefer to simply
notify the host of a fixed block of pages at a time and let it process
without having to have a thread on each side actively pushing pages,
or listening for the incoming pages.

> > > > The basic idea with the bubble hinting was to essentially create mini
> > > > balloons. As such I had based the code off of the balloon inflation
> > > > code. The only spot where it really differs is that I needed the
> > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > "hints" instead of "pfns".
> > >
> > > And that is fine. But there isn't really such a big difference with
> > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > in response to guest load.
> >
> > I disagree, I believe there is a significant difference.
>
> Yes there is, I just don't think it's in the iteration.
> The iteration seems to be useful to hinting.

I agree that iteration is useful to hinting. The problem is the
FREE_PAGE_HINT code isn't really designed to be iterative. It is
designed to run with a polling thread on each side and it is meant to
be run to completion.

> > The
> > FREE_PAGE_HINT code was implemented to be more of a streaming
> > interface.
>
> It's implemented like this but it does not follow from
> the interface. The implementation is a combination of
> attempts to minimize # of exits and minimize mm core changes.

The problem is the interface doesn't have a good way of indicating
that it is done with a block of pages.

So what I am probably looking at if I do a sg implementation for my
hinting is to provide one large sg block for all 32 of the pages I
might be holding. I'm assuming that will still be processed as one
contiguous block. With that I can then at least maintain a single
response per request.

> > This is one of the things Linus kept complaining about in
> > his comments. This code attempts to pull in ALL of the higher order
> > pages, not just a smaller block of them.
>
> It wants to report all higher order pages eventually, yes.
> But it's absolutely fine to report a chunk and then wait
> for host to process the chunk before reporting more.
>
> However, interfaces we came up with for this would call
> into virtio with a bunch of locks taken.
> The solution was to take pages off the free list completely.
> That in turn means we can't return them until
> we have processed all free memory.

I get that. The problem is the interface is designed around run to
completion. For example it will sit there in a busy loop waiting for a
free buffer because it knows the other side is suppose to be
processing the pages already.

> > Honestly the difference is
> > mostly in the hypervisor interface than what is needed for the kernel
> > interface, however the design of the hypervisor interface would make
> > doing things more incrementally much more difficult.
>
> OK that's interesting. The hypervisor interface is not
> documented in the spec yet. Let me take a stub at a writeup now. So:
>
>
>
> - hypervisor requests reporting by modifying command ID
> field in config space, and interrupting guest
>
> - in response, guest sends the command ID value on a special
> free page hinting VQ,
> followed by any number of buffers. Each buffer is assumed
> to be the address and length of memory that was
> unused *at some point after the time when command ID was sent*.
>
> Note that hypervisor takes pains to handle the case
> where memory is actually no longer free by the time
> it gets the memory.
> This allows guest driver to take more liberties
> and free pages without waiting for guest to
> use the buffers.
>
> This is also one of the reason we call this a free page hint -
> the guarantee that page is free is a weak one,
> in that sense it's more of a hint than a promise.
> That helps guarantee we don't create OOM out of blue.
>
> - guest eventually sends a special buffer signalling to
> host that it's done sending free pages.
> It then stops reporting until command id changes.

The pages are not freed back to the guest until the host reports that
it is "DONE" via a configuration change. Doing that stops any further
progress, and attempting to resume will just restart from the
beginning.

The big piece this design is missing is the incremental notification
pages have been processed. The existing code just fills the vq with
pages and keeps doing it until it cannot allocate any more pages. We
would have to add logic to stop, flush, and resume to the existing
framework.

> - host can restart the process at any time by
> updating command ID. That will make guest stop
> and start from the beginning.
>
> - host can also stop the process by specifying a special
> command ID value.
>
>
> =========
>
>
> Now let's compare to what you have here:
>
> - At any time after boot, guest walks over free memory and sends
> addresses as buffers to the host
>
> - Memory reported is then guaranteed to be unused
> until host has used the buffers
>
>
> Is above a fair summary?
>
> So yes there's a difference but the specific bit of chunking is same
> imho.

The big difference is that I am returning the pages after they are
processed, while FREE_PAGE_HINT doesn't and isn't designed to. The
problem is the interface doesn't allow for a good way to identify that
any given block of pages has been processed and can be returned.
Instead pages go in, but they don't come out until the configuration
is changed and "DONE" is reported. The act of reporting "DONE" will
reset things and start them all over which kind of defeats the point.

2019-07-17 10:29:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
>
> <snip>
>
> > > > This is what I am saying. Having watched that patchset being developed,
> > > > I think that's simply because processing blocks required mm core
> > > > changes, which Wei was not up to pushing through.
> > > >
> > > >
> > > > If we did
> > > >
> > > > while (1) {
> > > > alloc_pages
> > > > add_buf
> > > > get_buf
> > > > free_pages
> > > > }
> > > >
> > > > We'd end up passing the same page to balloon again and again.
> > > >
> > > > So we end up reserving lots of memory with alloc_pages instead.
> > > >
> > > > What I am saying is that now that you are developing
> > > > infrastructure to iterate over free pages,
> > > > FREE_PAGE_HINT should be able to use it too.
> > > > Whether that's possible might be a good indication of
> > > > whether the new mm APIs make sense.
> > >
> > > The problem is the infrastructure as implemented isn't designed to do
> > > that. I am pretty certain this interface will have issues with being
> > > given small blocks to process at a time.
> > >
> > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > have the concept of doing things a bit at a time. It is either
> > > filling, stopped, or done. From what I can tell it requires a
> > > configuration change for the virtio balloon interface to toggle
> > > between those states.
> >
> > Maybe I misunderstand what you are saying.
> >
> > Filling state can definitely report things
> > a bit at a time. It does not assume that
> > all of guest free memory can fit in a VQ.
>
> I think where you and I may differ is that you are okay with just
> pulling pages until you hit OOM, or allocation failures. Do I have
> that right?

This is exactly what the current code does. But that's an implementation
detail which came about because we failed to find any other way to
iterate over free blocks.

> In my mind I am wanting to perform the hinting on a small
> block at a time and work through things iteratively.
>
> The problem is the FREE_PAGE_HINT doesn't have the option of returning
> pages until all pages have been pulled. It is run to completion and
> will keep filling the balloon until an allocation fails and the host
> says it is done.

OK so there are two points. One is that FREE_PAGE_HINT does not
need to allocate a page at all. It really just wants to
iterate over free pages.


The reason FREE_PAGE_HINT does not free up pages until we finished
iterating over the free list it not a hypervisor API. The reason is we
don't want to keep getting the same address over and over again.

> I would prefer to avoid that as I prefer to simply
> notify the host of a fixed block of pages at a time and let it process
> without having to have a thread on each side actively pushing pages,
> or listening for the incoming pages.

Right. And FREE_PAGE_HINT can go even further. It can push a page and
let linux use it immediately. It does not even need to wait for host to
process anything unless the VQ gets full.

>
> > > > > The basic idea with the bubble hinting was to essentially create mini
> > > > > balloons. As such I had based the code off of the balloon inflation
> > > > > code. The only spot where it really differs is that I needed the
> > > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > > "hints" instead of "pfns".
> > > >
> > > > And that is fine. But there isn't really such a big difference with
> > > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > > in response to guest load.
> > >
> > > I disagree, I believe there is a significant difference.
> >
> > Yes there is, I just don't think it's in the iteration.
> > The iteration seems to be useful to hinting.
>
> I agree that iteration is useful to hinting. The problem is the
> FREE_PAGE_HINT code isn't really designed to be iterative. It is
> designed to run with a polling thread on each side and it is meant to
> be run to completion.

Absolutely. But that's a bug I think.

> > > The
> > > FREE_PAGE_HINT code was implemented to be more of a streaming
> > > interface.
> >
> > It's implemented like this but it does not follow from
> > the interface. The implementation is a combination of
> > attempts to minimize # of exits and minimize mm core changes.
>
> The problem is the interface doesn't have a good way of indicating
> that it is done with a block of pages.
>
> So what I am probably looking at if I do a sg implementation for my
> hinting is to provide one large sg block for all 32 of the pages I
> might be holding.

Right now if you pass an sg it will try to allocate a buffer
on demand for you. If this is a problem I could come up
with a new API that lets caller allocate the buffer.
Let me know.

> I'm assuming that will still be processed as one
> contiguous block. With that I can then at least maintain a single
> response per request.

Why do you care? Won't a counter of outstanding pages be enough?
Down the road maybe we could actually try to pipeline
things a bit. So send 32 pages once you get 16 of these back
send 16 more. Better for SMP configs and does not hurt
non-SMP too much. I am not saying we need to do it right away though.

> > > This is one of the things Linus kept complaining about in
> > > his comments. This code attempts to pull in ALL of the higher order
> > > pages, not just a smaller block of them.
> >
> > It wants to report all higher order pages eventually, yes.
> > But it's absolutely fine to report a chunk and then wait
> > for host to process the chunk before reporting more.
> >
> > However, interfaces we came up with for this would call
> > into virtio with a bunch of locks taken.
> > The solution was to take pages off the free list completely.
> > That in turn means we can't return them until
> > we have processed all free memory.
>
> I get that. The problem is the interface is designed around run to
> completion. For example it will sit there in a busy loop waiting for a
> free buffer because it knows the other side is suppose to be
> processing the pages already.

I didn't get this part.

> > > Honestly the difference is
> > > mostly in the hypervisor interface than what is needed for the kernel
> > > interface, however the design of the hypervisor interface would make
> > > doing things more incrementally much more difficult.
> >
> > OK that's interesting. The hypervisor interface is not
> > documented in the spec yet. Let me take a stub at a writeup now. So:
> >
> >
> >
> > - hypervisor requests reporting by modifying command ID
> > field in config space, and interrupting guest
> >
> > - in response, guest sends the command ID value on a special
> > free page hinting VQ,
> > followed by any number of buffers. Each buffer is assumed
> > to be the address and length of memory that was
> > unused *at some point after the time when command ID was sent*.
> >
> > Note that hypervisor takes pains to handle the case
> > where memory is actually no longer free by the time
> > it gets the memory.
> > This allows guest driver to take more liberties
> > and free pages without waiting for guest to
> > use the buffers.
> >
> > This is also one of the reason we call this a free page hint -
> > the guarantee that page is free is a weak one,
> > in that sense it's more of a hint than a promise.
> > That helps guarantee we don't create OOM out of blue.

I would like to stress the last paragraph above.


> >
> > - guest eventually sends a special buffer signalling to
> > host that it's done sending free pages.
> > It then stops reporting until command id changes.
>
> The pages are not freed back to the guest until the host reports that
> it is "DONE" via a configuration change. Doing that stops any further
> progress, and attempting to resume will just restart from the
> beginning.

Right but it's not a requirement. Host does not assume this at all.
It's done like this simply because we can't iterate over pages
with the existing API.

> The big piece this design is missing is the incremental notification
> pages have been processed. The existing code just fills the vq with
> pages and keeps doing it until it cannot allocate any more pages. We
> would have to add logic to stop, flush, and resume to the existing
> framework.

But not to the hypervisor interface. Hypervisor is fine
with pages being reused immediately. In fact, even before they
are processed.

> > - host can restart the process at any time by
> > updating command ID. That will make guest stop
> > and start from the beginning.
> >
> > - host can also stop the process by specifying a special
> > command ID value.
> >
> >
> > =========
> >
> >
> > Now let's compare to what you have here:
> >
> > - At any time after boot, guest walks over free memory and sends
> > addresses as buffers to the host
> >
> > - Memory reported is then guaranteed to be unused
> > until host has used the buffers
> >
> >
> > Is above a fair summary?
> >
> > So yes there's a difference but the specific bit of chunking is same
> > imho.
>
> The big difference is that I am returning the pages after they are
> processed, while FREE_PAGE_HINT doesn't and isn't designed to.

It doesn't but the hypervisor *is* designed to support that.

> The
> problem is the interface doesn't allow for a good way to identify that
> any given block of pages has been processed and can be returned.

And that's because FREE_PAGE_HINT does not care.
It can return any page at any point even before hypervisor
saw it.

> Instead pages go in, but they don't come out until the configuration
> is changed and "DONE" is reported. The act of reporting "DONE" will
> reset things and start them all over which kind of defeats the point.

Right.

But if you consider how we are using the shrinker you will
see that it's kind of broken.
For example not keeping track of allocated
pages means the count we return is broken
while reporting is active.

I looked at fixing it but really if we can just
stop allocating memory that would be way cleaner.


For example we allocate pages until shrinker kicks in.
Fair enough but in fact many it would be better to
do the reverse: trigger shrinker and then send as many
free pages as we can to host.

--
MST

2019-07-17 16:44:49

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> > On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
> >
> > <snip>
> >
> > > > > This is what I am saying. Having watched that patchset being developed,
> > > > > I think that's simply because processing blocks required mm core
> > > > > changes, which Wei was not up to pushing through.
> > > > >
> > > > >
> > > > > If we did
> > > > >
> > > > > while (1) {
> > > > > alloc_pages
> > > > > add_buf
> > > > > get_buf
> > > > > free_pages
> > > > > }
> > > > >
> > > > > We'd end up passing the same page to balloon again and again.
> > > > >
> > > > > So we end up reserving lots of memory with alloc_pages instead.
> > > > >
> > > > > What I am saying is that now that you are developing
> > > > > infrastructure to iterate over free pages,
> > > > > FREE_PAGE_HINT should be able to use it too.
> > > > > Whether that's possible might be a good indication of
> > > > > whether the new mm APIs make sense.
> > > >
> > > > The problem is the infrastructure as implemented isn't designed to do
> > > > that. I am pretty certain this interface will have issues with being
> > > > given small blocks to process at a time.
> > > >
> > > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > > have the concept of doing things a bit at a time. It is either
> > > > filling, stopped, or done. From what I can tell it requires a
> > > > configuration change for the virtio balloon interface to toggle
> > > > between those states.
> > >
> > > Maybe I misunderstand what you are saying.
> > >
> > > Filling state can definitely report things
> > > a bit at a time. It does not assume that
> > > all of guest free memory can fit in a VQ.
> >
> > I think where you and I may differ is that you are okay with just
> > pulling pages until you hit OOM, or allocation failures. Do I have
> > that right?
>
> This is exactly what the current code does. But that's an implementation
> detail which came about because we failed to find any other way to
> iterate over free blocks.

I get that. However my concern is that permeated other areas of the
implementation that make taking another approach much more difficult
than it needs to be.

> > In my mind I am wanting to perform the hinting on a small
> > block at a time and work through things iteratively.
> >
> > The problem is the FREE_PAGE_HINT doesn't have the option of returning
> > pages until all pages have been pulled. It is run to completion and
> > will keep filling the balloon until an allocation fails and the host
> > says it is done.
>
> OK so there are two points. One is that FREE_PAGE_HINT does not
> need to allocate a page at all. It really just wants to
> iterate over free pages.

I agree that it should just want to iterate over pages. However the
issue I am trying to point out is that it doesn't have any guarantees
on ordering and that is my concern. What I want to avoid is
potentially corrupting memory.

So for example with my current hinting approach I am using the list of
hints because I get back one completion indicating all of the hints
have been processed. It is only at that point that I can go back and
make the memory available for allocation again.

So one big issue right now with the FREE_PAGE_HINT approach is that it
is designed to be all or nothing. Using the balloon makes it
impossible for us to be incremental as all the pages are contained in
one spot. What we would need is some way to associate a page with a
given vq buffer. Ultimately in order to really make the FREE_PAGE_HINT
logic work with something like my page hinting logic it would need to
work more like a network Rx ring in that we would associate a page per
buffer and have some way of knowing the two are associated.

> The reason FREE_PAGE_HINT does not free up pages until we finished
> iterating over the free list it not a hypervisor API. The reason is we
> don't want to keep getting the same address over and over again.
>
> > I would prefer to avoid that as I prefer to simply
> > notify the host of a fixed block of pages at a time and let it process
> > without having to have a thread on each side actively pushing pages,
> > or listening for the incoming pages.
>
> Right. And FREE_PAGE_HINT can go even further. It can push a page and
> let linux use it immediately. It does not even need to wait for host to
> process anything unless the VQ gets full.

If it is doing what you are saying it will be corrupting memory. At a
minimum it has to wait until the page has been processed and the dirty
bit cleared before it can let linux use it again. It is all a matter
of keeping the dirty bit coherent. If we let linux use it again
immediately and then cleared the dirty bit we would open up a possible
data corruption race during migration as a dirty page might not be
marked as such.

> >
> > > > > > The basic idea with the bubble hinting was to essentially create mini
> > > > > > balloons. As such I had based the code off of the balloon inflation
> > > > > > code. The only spot where it really differs is that I needed the
> > > > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > > > "hints" instead of "pfns".
> > > > >
> > > > > And that is fine. But there isn't really such a big difference with
> > > > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > > > in response to guest load.
> > > >
> > > > I disagree, I believe there is a significant difference.
> > >
> > > Yes there is, I just don't think it's in the iteration.
> > > The iteration seems to be useful to hinting.
> >
> > I agree that iteration is useful to hinting. The problem is the
> > FREE_PAGE_HINT code isn't really designed to be iterative. It is
> > designed to run with a polling thread on each side and it is meant to
> > be run to completion.
>
> Absolutely. But that's a bug I think.

I think it is a part of the design. Basically in order to avoid
corrupting memory it cannot return the page to the guest kernel until
it has finished clearing the dirty bits associated with the pages.

> > > > The
> > > > FREE_PAGE_HINT code was implemented to be more of a streaming
> > > > interface.
> > >
> > > It's implemented like this but it does not follow from
> > > the interface. The implementation is a combination of
> > > attempts to minimize # of exits and minimize mm core changes.
> >
> > The problem is the interface doesn't have a good way of indicating
> > that it is done with a block of pages.
> >
> > So what I am probably looking at if I do a sg implementation for my
> > hinting is to provide one large sg block for all 32 of the pages I
> > might be holding.
>
> Right now if you pass an sg it will try to allocate a buffer
> on demand for you. If this is a problem I could come up
> with a new API that lets caller allocate the buffer.
> Let me know.
>
> > I'm assuming that will still be processed as one
> > contiguous block. With that I can then at least maintain a single
> > response per request.
>
> Why do you care? Won't a counter of outstanding pages be enough?
> Down the road maybe we could actually try to pipeline
> things a bit. So send 32 pages once you get 16 of these back
> send 16 more. Better for SMP configs and does not hurt
> non-SMP too much. I am not saying we need to do it right away though.

So the big thing is we cannot give the page back to the guest kernel
until we know the processing has been completed. In the case of the
MADV_DONT_NEED call it will zero out the entire page on the next
access. If the guest kernel had already written data by the time we
get to that it would cause a data corruption and kill the whole guest.

> > > > This is one of the things Linus kept complaining about in
> > > > his comments. This code attempts to pull in ALL of the higher order
> > > > pages, not just a smaller block of them.
> > >
> > > It wants to report all higher order pages eventually, yes.
> > > But it's absolutely fine to report a chunk and then wait
> > > for host to process the chunk before reporting more.
> > >
> > > However, interfaces we came up with for this would call
> > > into virtio with a bunch of locks taken.
> > > The solution was to take pages off the free list completely.
> > > That in turn means we can't return them until
> > > we have processed all free memory.
> >
> > I get that. The problem is the interface is designed around run to
> > completion. For example it will sit there in a busy loop waiting for a
> > free buffer because it knows the other side is suppose to be
> > processing the pages already.
>
> I didn't get this part.

I think the part you may not be getting is that we cannot let the
guest use the page until the hint has been processed. Otherwise we
risk corrupting memory. That is the piece that has me paranoid. If we
end up performing a hint on a page that is use somewhere in the kernel
it will corrupt memory one way or another. That is the thing I have to
avoid at all cost.

That is why I have to have a way to know exactly which pages have been
processed and which haven't before I return pages to the guest.
Otherwise I am just corrupting memory.

> > > > Honestly the difference is
> > > > mostly in the hypervisor interface than what is needed for the kernel
> > > > interface, however the design of the hypervisor interface would make
> > > > doing things more incrementally much more difficult.
> > >
> > > OK that's interesting. The hypervisor interface is not
> > > documented in the spec yet. Let me take a stub at a writeup now. So:
> > >
> > >
> > >
> > > - hypervisor requests reporting by modifying command ID
> > > field in config space, and interrupting guest
> > >
> > > - in response, guest sends the command ID value on a special
> > > free page hinting VQ,
> > > followed by any number of buffers. Each buffer is assumed
> > > to be the address and length of memory that was
> > > unused *at some point after the time when command ID was sent*.
> > >
> > > Note that hypervisor takes pains to handle the case
> > > where memory is actually no longer free by the time
> > > it gets the memory.
> > > This allows guest driver to take more liberties
> > > and free pages without waiting for guest to
> > > use the buffers.
> > >
> > > This is also one of the reason we call this a free page hint -
> > > the guarantee that page is free is a weak one,
> > > in that sense it's more of a hint than a promise.
> > > That helps guarantee we don't create OOM out of blue.
>
> I would like to stress the last paragraph above.

The problem is we don't want to give bad hints. What we do based on
the hint is clear the dirty bit. If we clear it in err when the page
is actually in use it will lead to data corruption after migration.

The idea with the hint is that you are saying the page is currently
not in use, however if you send that hint late and have already freed
the page back you can corrupt memory.

> > >
> > > - guest eventually sends a special buffer signalling to
> > > host that it's done sending free pages.
> > > It then stops reporting until command id changes.
> >
> > The pages are not freed back to the guest until the host reports that
> > it is "DONE" via a configuration change. Doing that stops any further
> > progress, and attempting to resume will just restart from the
> > beginning.
>
> Right but it's not a requirement. Host does not assume this at all.
> It's done like this simply because we can't iterate over pages
> with the existing API.

The problem is nothing about the implementation was designed for
iteration. What I would have to do is likely gut and rewrite the
entire guest side of the FREE_PAGE_HINT code in order to make it work
iteratively. As I mentioned it would probably have to look more like a
NIC Rx ring in handling because we would have to have some sort of way
to associate the pages 1:1 to the buffers.

> > The big piece this design is missing is the incremental notification
> > pages have been processed. The existing code just fills the vq with
> > pages and keeps doing it until it cannot allocate any more pages. We
> > would have to add logic to stop, flush, and resume to the existing
> > framework.
>
> But not to the hypervisor interface. Hypervisor is fine
> with pages being reused immediately. In fact, even before they
> are processed.

I don't think that is actually the case. If it does that I am pretty
sure it will corrupt memory during migration.

Take a look at qemu_guest_free_page_hint:
https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342

I'm pretty sure that code is going in and clearing the dirty bitmap
for memory. If we were to allow a page to be allocated and used and
then perform the hint it is going to introduce a race where the page
might be missed for migration and could result in memory corruption.

> > > - host can restart the process at any time by
> > > updating command ID. That will make guest stop
> > > and start from the beginning.
> > >
> > > - host can also stop the process by specifying a special
> > > command ID value.
> > >
> > >
> > > =========
> > >
> > >
> > > Now let's compare to what you have here:
> > >
> > > - At any time after boot, guest walks over free memory and sends
> > > addresses as buffers to the host
> > >
> > > - Memory reported is then guaranteed to be unused
> > > until host has used the buffers
> > >
> > >
> > > Is above a fair summary?
> > >
> > > So yes there's a difference but the specific bit of chunking is same
> > > imho.
> >
> > The big difference is that I am returning the pages after they are
> > processed, while FREE_PAGE_HINT doesn't and isn't designed to.
>
> It doesn't but the hypervisor *is* designed to support that.

Not really, it seems like it is more just a side effect of things.
Also as I mentioned before I am also not a huge fan of polling on both
sides as it is just going to burn through CPU. If we are iterative and
polling it is going to end up with us potentially pushing one CPU at
100%, and if the one CPU doing the polling cannot keep up with the
page updates coming from the other CPUs we would be stuck in that
state for a while. I would have preferred to see something where the
CPU would at least allow other tasks to occur while it is waiting for
buffers to be returned by the host.

> > The
> > problem is the interface doesn't allow for a good way to identify that
> > any given block of pages has been processed and can be returned.
>
> And that's because FREE_PAGE_HINT does not care.
> It can return any page at any point even before hypervisor
> saw it.

I disagree, see my comment above.

> > Instead pages go in, but they don't come out until the configuration
> > is changed and "DONE" is reported. The act of reporting "DONE" will
> > reset things and start them all over which kind of defeats the point.
>
> Right.
>
> But if you consider how we are using the shrinker you will
> see that it's kind of broken.
> For example not keeping track of allocated
> pages means the count we return is broken
> while reporting is active.
>
> I looked at fixing it but really if we can just
> stop allocating memory that would be way cleaner.

Agreed. If we hit an OOM we should probably just stop the free page
hinting and treat that as the equivalent to an allocation failure.

As-is I think this also has the potential for corrupting memory since
it will likely be returning the most recent pages added to the balloon
so the pages are likely still on the processing queue.

> For example we allocate pages until shrinker kicks in.
> Fair enough but in fact many it would be better to
> do the reverse: trigger shrinker and then send as many
> free pages as we can to host.

I'm not sure I understand this last part.

2019-07-18 05:16:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote:
> On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> > > On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > <snip>
> > >
> > > > > > This is what I am saying. Having watched that patchset being developed,
> > > > > > I think that's simply because processing blocks required mm core
> > > > > > changes, which Wei was not up to pushing through.
> > > > > >
> > > > > >
> > > > > > If we did
> > > > > >
> > > > > > while (1) {
> > > > > > alloc_pages
> > > > > > add_buf
> > > > > > get_buf
> > > > > > free_pages
> > > > > > }
> > > > > >
> > > > > > We'd end up passing the same page to balloon again and again.
> > > > > >
> > > > > > So we end up reserving lots of memory with alloc_pages instead.
> > > > > >
> > > > > > What I am saying is that now that you are developing
> > > > > > infrastructure to iterate over free pages,
> > > > > > FREE_PAGE_HINT should be able to use it too.
> > > > > > Whether that's possible might be a good indication of
> > > > > > whether the new mm APIs make sense.
> > > > >
> > > > > The problem is the infrastructure as implemented isn't designed to do
> > > > > that. I am pretty certain this interface will have issues with being
> > > > > given small blocks to process at a time.
> > > > >
> > > > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > > > have the concept of doing things a bit at a time. It is either
> > > > > filling, stopped, or done. From what I can tell it requires a
> > > > > configuration change for the virtio balloon interface to toggle
> > > > > between those states.
> > > >
> > > > Maybe I misunderstand what you are saying.
> > > >
> > > > Filling state can definitely report things
> > > > a bit at a time. It does not assume that
> > > > all of guest free memory can fit in a VQ.
> > >
> > > I think where you and I may differ is that you are okay with just
> > > pulling pages until you hit OOM, or allocation failures. Do I have
> > > that right?
> >
> > This is exactly what the current code does. But that's an implementation
> > detail which came about because we failed to find any other way to
> > iterate over free blocks.
>
> I get that. However my concern is that permeated other areas of the
> implementation that make taking another approach much more difficult
> than it needs to be.

Implementation would have to change to use an iterator obviously. But I don't see
that it leaked out to a hypervisor interface.

In fact take a look at virtio_balloon_shrinker_scan
and you will see that it calls shrink_free_pages
without waiting for the device at all.

> > > In my mind I am wanting to perform the hinting on a small
> > > block at a time and work through things iteratively.
> > >
> > > The problem is the FREE_PAGE_HINT doesn't have the option of returning
> > > pages until all pages have been pulled. It is run to completion and
> > > will keep filling the balloon until an allocation fails and the host
> > > says it is done.
> >
> > OK so there are two points. One is that FREE_PAGE_HINT does not
> > need to allocate a page at all. It really just wants to
> > iterate over free pages.
>
> I agree that it should just want to iterate over pages. However the
> issue I am trying to point out is that it doesn't have any guarantees
> on ordering and that is my concern. What I want to avoid is
> potentially corrupting memory.

I get that. I am just trying to make sure you are aware that for
FREE_PAGE_HINT specifically ordering does not matter because it does not
care when hypervisor used the buffers. It only cares that page was
free after it got the request. used buffers are only tracked to avoid
overflowing the VQ. This is different from your hinting where you make
it the responsibility of the guest to not allocate page before it was
used.

>
> So for example with my current hinting approach I am using the list of
> hints because I get back one completion indicating all of the hints
> have been processed. It is only at that point that I can go back and
> make the memory available for allocation again.

Right. But just counting them would work just as well, no?
At least as long as you wait for everything to complete...
If you want to pipeline, see below


>
> So one big issue right now with the FREE_PAGE_HINT approach is that it
> is designed to be all or nothing. Using the balloon makes it
> impossible for us to be incremental as all the pages are contained in
> one spot. What we would need is some way to associate a page with a
> given vq buffer.

Sorry if I'm belaboring the obvious, but isn't this what 'void *data' in
virtqueue_add_inbuf is designed for? And if you only ever use
virtqueue_add_inbuf and virtqueue_add_outbuf on a given VQ, then you can
track two pointers using virtqueue_add_inbuf_ctx.

> Ultimately in order to really make the FREE_PAGE_HINT
> logic work with something like my page hinting logic it would need to
> work more like a network Rx ring in that we would associate a page per
> buffer and have some way of knowing the two are associated.

Right. That's exactly how virtio net does it btw.

> > The reason FREE_PAGE_HINT does not free up pages until we finished
> > iterating over the free list it not a hypervisor API. The reason is we
> > don't want to keep getting the same address over and over again.
> >
> > > I would prefer to avoid that as I prefer to simply
> > > notify the host of a fixed block of pages at a time and let it process
> > > without having to have a thread on each side actively pushing pages,
> > > or listening for the incoming pages.
> >
> > Right. And FREE_PAGE_HINT can go even further. It can push a page and
> > let linux use it immediately. It does not even need to wait for host to
> > process anything unless the VQ gets full.
>
> If it is doing what you are saying it will be corrupting memory.

No and that is hypervisor's responsibility.

I think you are missing part of the picture here.

Here is a valid implementation:

Before asking for hints, hypervisor write-protects all memory, and logs
all write faults. When hypervisor gets the hint, if page has since been
modified, the hint is ignored.





> At a
> minimum it has to wait until the page has been processed and the dirty
> bit cleared before it can let linux use it again. It is all a matter
> of keeping the dirty bit coherent. If we let linux use it again
> immediately and then cleared the dirty bit we would open up a possible
> data corruption race during migration as a dirty page might not be
> marked as such.

I think you are talking about the dirty bit on the host, right?

The implication is that calling MADV_FREE from qemu would
not be a good implementation of FREE_PAGE_HINT.
And indeed, as far as I can see it does nothing of the sort.



> > >
> > > > > > > The basic idea with the bubble hinting was to essentially create mini
> > > > > > > balloons. As such I had based the code off of the balloon inflation
> > > > > > > code. The only spot where it really differs is that I needed the
> > > > > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > > > > "hints" instead of "pfns".
> > > > > >
> > > > > > And that is fine. But there isn't really such a big difference with
> > > > > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > > > > in response to guest load.
> > > > >
> > > > > I disagree, I believe there is a significant difference.
> > > >
> > > > Yes there is, I just don't think it's in the iteration.
> > > > The iteration seems to be useful to hinting.
> > >
> > > I agree that iteration is useful to hinting. The problem is the
> > > FREE_PAGE_HINT code isn't really designed to be iterative. It is
> > > designed to run with a polling thread on each side and it is meant to
> > > be run to completion.
> >
> > Absolutely. But that's a bug I think.
>
> I think it is a part of the design. Basically in order to avoid
> corrupting memory it cannot return the page to the guest kernel until
> it has finished clearing the dirty bits associated with the pages.

OK I hope I clarified by that's not supposed to be the case.

> > > > > The
> > > > > FREE_PAGE_HINT code was implemented to be more of a streaming
> > > > > interface.
> > > >
> > > > It's implemented like this but it does not follow from
> > > > the interface. The implementation is a combination of
> > > > attempts to minimize # of exits and minimize mm core changes.
> > >
> > > The problem is the interface doesn't have a good way of indicating
> > > that it is done with a block of pages.
> > >
> > > So what I am probably looking at if I do a sg implementation for my
> > > hinting is to provide one large sg block for all 32 of the pages I
> > > might be holding.
> >
> > Right now if you pass an sg it will try to allocate a buffer
> > on demand for you. If this is a problem I could come up
> > with a new API that lets caller allocate the buffer.
> > Let me know.
> >
> > > I'm assuming that will still be processed as one
> > > contiguous block. With that I can then at least maintain a single
> > > response per request.
> >
> > Why do you care? Won't a counter of outstanding pages be enough?
> > Down the road maybe we could actually try to pipeline
> > things a bit. So send 32 pages once you get 16 of these back
> > send 16 more. Better for SMP configs and does not hurt
> > non-SMP too much. I am not saying we need to do it right away though.
>
> So the big thing is we cannot give the page back to the guest kernel
> until we know the processing has been completed. In the case of the
> MADV_DONT_NEED call it will zero out the entire page on the next
> access. If the guest kernel had already written data by the time we
> get to that it would cause a data corruption and kill the whole guest.


Exactly but FREE_PAGE_HINT does not cause qemu to call MADV_DONT_NEED.

> > > > > This is one of the things Linus kept complaining about in
> > > > > his comments. This code attempts to pull in ALL of the higher order
> > > > > pages, not just a smaller block of them.
> > > >
> > > > It wants to report all higher order pages eventually, yes.
> > > > But it's absolutely fine to report a chunk and then wait
> > > > for host to process the chunk before reporting more.
> > > >
> > > > However, interfaces we came up with for this would call
> > > > into virtio with a bunch of locks taken.
> > > > The solution was to take pages off the free list completely.
> > > > That in turn means we can't return them until
> > > > we have processed all free memory.
> > >
> > > I get that. The problem is the interface is designed around run to
> > > completion. For example it will sit there in a busy loop waiting for a
> > > free buffer because it knows the other side is suppose to be
> > > processing the pages already.
> >
> > I didn't get this part.
>
> I think the part you may not be getting is that we cannot let the
> guest use the page until the hint has been processed. Otherwise we
> risk corrupting memory. That is the piece that has me paranoid. If we
> end up performing a hint on a page that is use somewhere in the kernel
> it will corrupt memory one way or another. That is the thing I have to
> avoid at all cost.

You have to do it, sure. And that is because you do not
assume that hypervisor does it for you. But FREE_PAGE_HINT doesn't,
hypervisor takes care of that.

> That is why I have to have a way to know exactly which pages have been
> processed and which haven't before I return pages to the guest.
> Otherwise I am just corrupting memory.

Sure. That isn't really hard though.

>
> > > > > Honestly the difference is
> > > > > mostly in the hypervisor interface than what is needed for the kernel
> > > > > interface, however the design of the hypervisor interface would make
> > > > > doing things more incrementally much more difficult.
> > > >
> > > > OK that's interesting. The hypervisor interface is not
> > > > documented in the spec yet. Let me take a stub at a writeup now. So:
> > > >
> > > >
> > > >
> > > > - hypervisor requests reporting by modifying command ID
> > > > field in config space, and interrupting guest
> > > >
> > > > - in response, guest sends the command ID value on a special
> > > > free page hinting VQ,
> > > > followed by any number of buffers. Each buffer is assumed
> > > > to be the address and length of memory that was
> > > > unused *at some point after the time when command ID was sent*.
> > > >
> > > > Note that hypervisor takes pains to handle the case
> > > > where memory is actually no longer free by the time
> > > > it gets the memory.
> > > > This allows guest driver to take more liberties
> > > > and free pages without waiting for guest to
> > > > use the buffers.
> > > >
> > > > This is also one of the reason we call this a free page hint -
> > > > the guarantee that page is free is a weak one,
> > > > in that sense it's more of a hint than a promise.
> > > > That helps guarantee we don't create OOM out of blue.
> >
> > I would like to stress the last paragraph above.
>
> The problem is we don't want to give bad hints. What we do based on
> the hint is clear the dirty bit. If we clear it in err when the page
> is actually in use it will lead to data corruption after migration.

That's true for your patches. I get that.

> The idea with the hint is that you are saying the page is currently
> not in use, however if you send that hint late and have already freed
> the page back you can corrupt memory.


That part is I think wrong - assuming "you" means upstream code.

> > > >
> > > > - guest eventually sends a special buffer signalling to
> > > > host that it's done sending free pages.
> > > > It then stops reporting until command id changes.
> > >
> > > The pages are not freed back to the guest until the host reports that
> > > it is "DONE" via a configuration change. Doing that stops any further
> > > progress, and attempting to resume will just restart from the
> > > beginning.
> >
> > Right but it's not a requirement. Host does not assume this at all.
> > It's done like this simply because we can't iterate over pages
> > with the existing API.
>
> The problem is nothing about the implementation was designed for
> iteration. What I would have to do is likely gut and rewrite the
> entire guest side of the FREE_PAGE_HINT code in order to make it work
> iteratively.


Right. I agree.

> As I mentioned it would probably have to look more like a
> NIC Rx ring in handling because we would have to have some sort of way
> to associate the pages 1:1 to the buffers.
>
> > > The big piece this design is missing is the incremental notification
> > > pages have been processed. The existing code just fills the vq with
> > > pages and keeps doing it until it cannot allocate any more pages. We
> > > would have to add logic to stop, flush, and resume to the existing
> > > framework.
> >
> > But not to the hypervisor interface. Hypervisor is fine
> > with pages being reused immediately. In fact, even before they
> > are processed.
>
> I don't think that is actually the case. If it does that I am pretty
> sure it will corrupt memory during migration.
>
> Take a look at qemu_guest_free_page_hint:
> https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342
>
> I'm pretty sure that code is going in and clearing the dirty bitmap
> for memory.

Yes it does. However the trick is that meanwhile
kvm is logging new writes. So the bitmap that
is being cleared is the bitmap that was logged before the request
was sent to guest.

> If we were to allow a page to be allocated and used and
> then perform the hint it is going to introduce a race where the page
> might be missed for migration and could result in memory corruption.

commit c13c4153f76db23cac06a12044bf4dd346764059 has this explanation:

Note: balloon will report pages which were free at the time of this call.
As the reporting happens asynchronously, dirty bit logging must be
enabled before this free_page_start call is made. Guest reporting must be
disabled before the migration dirty bitmap is synchronized.

but over multiple iterations this seems to have been dropped
from code comments. Wei, would you mind going back
and documenting the APIs you used?
They seem to be causing confusion ...

>
> > > > - host can restart the process at any time by
> > > > updating command ID. That will make guest stop
> > > > and start from the beginning.
> > > >
> > > > - host can also stop the process by specifying a special
> > > > command ID value.
> > > >
> > > >
> > > > =========
> > > >
> > > >
> > > > Now let's compare to what you have here:
> > > >
> > > > - At any time after boot, guest walks over free memory and sends
> > > > addresses as buffers to the host
> > > >
> > > > - Memory reported is then guaranteed to be unused
> > > > until host has used the buffers
> > > >
> > > >
> > > > Is above a fair summary?
> > > >
> > > > So yes there's a difference but the specific bit of chunking is same
> > > > imho.
> > >
> > > The big difference is that I am returning the pages after they are
> > > processed, while FREE_PAGE_HINT doesn't and isn't designed to.
> >
> > It doesn't but the hypervisor *is* designed to support that.
>
> Not really, it seems like it is more just a side effect of things.

I hope the commit log above is enough to convice you we did
think about this.

> Also as I mentioned before I am also not a huge fan of polling on both
> sides as it is just going to burn through CPU. If we are iterative and
> polling it is going to end up with us potentially pushing one CPU at
> 100%, and if the one CPU doing the polling cannot keep up with the
> page updates coming from the other CPUs we would be stuck in that
> state for a while. I would have preferred to see something where the
> CPU would at least allow other tasks to occur while it is waiting for
> buffers to be returned by the host.

You lost me here. What does polling have to do with it?


> > > The
> > > problem is the interface doesn't allow for a good way to identify that
> > > any given block of pages has been processed and can be returned.
> >
> > And that's because FREE_PAGE_HINT does not care.
> > It can return any page at any point even before hypervisor
> > saw it.
>
> I disagree, see my comment above.

OK let's see if above is enough to convice you. Or maybe we
have a bug when shrinker is invoked :) But I don't think so.

> > > Instead pages go in, but they don't come out until the configuration
> > > is changed and "DONE" is reported. The act of reporting "DONE" will
> > > reset things and start them all over which kind of defeats the point.
> >
> > Right.
> >
> > But if you consider how we are using the shrinker you will
> > see that it's kind of broken.
> > For example not keeping track of allocated
> > pages means the count we return is broken
> > while reporting is active.
> >
> > I looked at fixing it but really if we can just
> > stop allocating memory that would be way cleaner.
>
> Agreed. If we hit an OOM we should probably just stop the free page
> hinting and treat that as the equivalent to an allocation failure.

And fix the shrinker count to include the pages in the vq. Yea.

>
> As-is I think this also has the potential for corrupting memory since
> it will likely be returning the most recent pages added to the balloon
> so the pages are likely still on the processing queue.

That part is fine I think because of the above.

>
> > For example we allocate pages until shrinker kicks in.
> > Fair enough but in fact many it would be better to
> > do the reverse: trigger shrinker and then send as many
> > free pages as we can to host.
>
> I'm not sure I understand this last part.

Oh basically what I am saying is this: one of the reasons to use page
hinting is when host is short on memory. In that case, why don't we use
shrinker to ask kernel drivers to free up memory? Any memory freed could
then be reported to host.

--
MST

2019-07-18 15:35:28

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Wed, Jul 17, 2019 at 10:14 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote:
> > On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> > > > On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
> > > >
> > > > <snip>
> > > >
> > > > > > > This is what I am saying. Having watched that patchset being developed,
> > > > > > > I think that's simply because processing blocks required mm core
> > > > > > > changes, which Wei was not up to pushing through.
> > > > > > >
> > > > > > >
> > > > > > > If we did
> > > > > > >
> > > > > > > while (1) {
> > > > > > > alloc_pages
> > > > > > > add_buf
> > > > > > > get_buf
> > > > > > > free_pages
> > > > > > > }
> > > > > > >
> > > > > > > We'd end up passing the same page to balloon again and again.
> > > > > > >
> > > > > > > So we end up reserving lots of memory with alloc_pages instead.
> > > > > > >
> > > > > > > What I am saying is that now that you are developing
> > > > > > > infrastructure to iterate over free pages,
> > > > > > > FREE_PAGE_HINT should be able to use it too.
> > > > > > > Whether that's possible might be a good indication of
> > > > > > > whether the new mm APIs make sense.
> > > > > >
> > > > > > The problem is the infrastructure as implemented isn't designed to do
> > > > > > that. I am pretty certain this interface will have issues with being
> > > > > > given small blocks to process at a time.
> > > > > >
> > > > > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > > > > have the concept of doing things a bit at a time. It is either
> > > > > > filling, stopped, or done. From what I can tell it requires a
> > > > > > configuration change for the virtio balloon interface to toggle
> > > > > > between those states.
> > > > >
> > > > > Maybe I misunderstand what you are saying.
> > > > >
> > > > > Filling state can definitely report things
> > > > > a bit at a time. It does not assume that
> > > > > all of guest free memory can fit in a VQ.
> > > >
> > > > I think where you and I may differ is that you are okay with just
> > > > pulling pages until you hit OOM, or allocation failures. Do I have
> > > > that right?
> > >
> > > This is exactly what the current code does. But that's an implementation
> > > detail which came about because we failed to find any other way to
> > > iterate over free blocks.
> >
> > I get that. However my concern is that permeated other areas of the
> > implementation that make taking another approach much more difficult
> > than it needs to be.
>
> Implementation would have to change to use an iterator obviously. But I don't see
> that it leaked out to a hypervisor interface.
>
> In fact take a look at virtio_balloon_shrinker_scan
> and you will see that it calls shrink_free_pages
> without waiting for the device at all.

Yes, and in case you missed it earlier I am pretty sure that leads to
possible memory corruption. I don't think it was tested enough to be
able to say that is safe.

Specifically we cannot be clearing the dirty flag on pages that are in
use. We should only be clearing that flag for pages that are
guaranteed to not be in use.

> > > > In my mind I am wanting to perform the hinting on a small
> > > > block at a time and work through things iteratively.
> > > >
> > > > The problem is the FREE_PAGE_HINT doesn't have the option of returning
> > > > pages until all pages have been pulled. It is run to completion and
> > > > will keep filling the balloon until an allocation fails and the host
> > > > says it is done.
> > >
> > > OK so there are two points. One is that FREE_PAGE_HINT does not
> > > need to allocate a page at all. It really just wants to
> > > iterate over free pages.
> >
> > I agree that it should just want to iterate over pages. However the
> > issue I am trying to point out is that it doesn't have any guarantees
> > on ordering and that is my concern. What I want to avoid is
> > potentially corrupting memory.
>
> I get that. I am just trying to make sure you are aware that for
> FREE_PAGE_HINT specifically ordering does not matter because it does not
> care when hypervisor used the buffers. It only cares that page was
> free after it got the request. used buffers are only tracked to avoid
> overflowing the VQ. This is different from your hinting where you make
> it the responsibility of the guest to not allocate page before it was
> used.

Prove to me that the ordering does not matter. As far as I can tell it
should since this is being used to clear the bitmap and will affect
migration. I'm pretty certain the page should not be freed until it
has been processed. Otherwise I believe there is a risk of the page
not being migrated and leading to a memory corruption when the VM is
finally migrated.

> >
> > So for example with my current hinting approach I am using the list of
> > hints because I get back one completion indicating all of the hints
> > have been processed. It is only at that point that I can go back and
> > make the memory available for allocation again.
>
> Right. But just counting them would work just as well, no?
> At least as long as you wait for everything to complete...
> If you want to pipeline, see below

Yes, but if possible I would also want to try and keep the batch
behavior that I have. We could count the descriptors processed,
however that is still essentially done all via busy waiting in the
FREE_PAGE_HINT logic.

> >
> > So one big issue right now with the FREE_PAGE_HINT approach is that it
> > is designed to be all or nothing. Using the balloon makes it
> > impossible for us to be incremental as all the pages are contained in
> > one spot. What we would need is some way to associate a page with a
> > given vq buffer.
>
> Sorry if I'm belaboring the obvious, but isn't this what 'void *data' in
> virtqueue_add_inbuf is designed for? And if you only ever use
> virtqueue_add_inbuf and virtqueue_add_outbuf on a given VQ, then you can
> track two pointers using virtqueue_add_inbuf_ctx.

I am still learning virtio so I wasn't aware of this piece until
yesterday. For FREE_PAGE_HINT it would probably work as we would then
have that association. For my page hinting I am still thinking I would
prefer to just pass around a scatterlist since that is the structure I
would likely fill and then later drain of pages versus just
maintaining a list.

> > Ultimately in order to really make the FREE_PAGE_HINT
> > logic work with something like my page hinting logic it would need to
> > work more like a network Rx ring in that we would associate a page per
> > buffer and have some way of knowing the two are associated.
>
> Right. That's exactly how virtio net does it btw.

Yeah, I saw that after reviewing the code yesterday.

> > > The reason FREE_PAGE_HINT does not free up pages until we finished
> > > iterating over the free list it not a hypervisor API. The reason is we
> > > don't want to keep getting the same address over and over again.
> > >
> > > > I would prefer to avoid that as I prefer to simply
> > > > notify the host of a fixed block of pages at a time and let it process
> > > > without having to have a thread on each side actively pushing pages,
> > > > or listening for the incoming pages.
> > >
> > > Right. And FREE_PAGE_HINT can go even further. It can push a page and
> > > let linux use it immediately. It does not even need to wait for host to
> > > process anything unless the VQ gets full.
> >
> > If it is doing what you are saying it will be corrupting memory.
>
> No and that is hypervisor's responsibility.
>
> I think you are missing part of the picture here.
>
> Here is a valid implementation:
>
> Before asking for hints, hypervisor write-protects all memory, and logs
> all write faults. When hypervisor gets the hint, if page has since been
> modified, the hint is ignored.

No here is the part where I think you missed the point. I was already
aware of this. So my concern is this scenario.

If you put a hint on the VQ and then free the memory back to the
guest, what about the scenario where another process could allocate
the memory and dirty it before we process the hint request on the
host? In that case the page was dirtied, the hypervisor will have
correctly write faulted and dirtied it, and then we came though and
incorrectly marked it as being free. That is the scenario I am worried
about as I am pretty certain that leads to memory corruption.


>
> > At a
> > minimum it has to wait until the page has been processed and the dirty
> > bit cleared before it can let linux use it again. It is all a matter
> > of keeping the dirty bit coherent. If we let linux use it again
> > immediately and then cleared the dirty bit we would open up a possible
> > data corruption race during migration as a dirty page might not be
> > marked as such.
>
> I think you are talking about the dirty bit on the host, right?
>
> The implication is that calling MADV_FREE from qemu would
> not be a good implementation of FREE_PAGE_HINT.
> And indeed, as far as I can see it does nothing of the sort.

I don't mean the dirty bit on the host, I am talking about the bitmap
used to determine which pages need to be migrated. That is what this
hint is updating and it is also being tracked via the write protection
of the pages at the start of migration.

My concern is that we can end up losing track of pages that are
updated if we are hinting after they have been freed back to the guest
for reallocation.

> > > >
> > > > > > > > The basic idea with the bubble hinting was to essentially create mini
> > > > > > > > balloons. As such I had based the code off of the balloon inflation
> > > > > > > > code. The only spot where it really differs is that I needed the
> > > > > > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > > > > > "hints" instead of "pfns".
> > > > > > >
> > > > > > > And that is fine. But there isn't really such a big difference with
> > > > > > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > > > > > in response to guest load.
> > > > > >
> > > > > > I disagree, I believe there is a significant difference.
> > > > >
> > > > > Yes there is, I just don't think it's in the iteration.
> > > > > The iteration seems to be useful to hinting.
> > > >
> > > > I agree that iteration is useful to hinting. The problem is the
> > > > FREE_PAGE_HINT code isn't really designed to be iterative. It is
> > > > designed to run with a polling thread on each side and it is meant to
> > > > be run to completion.
> > >
> > > Absolutely. But that's a bug I think.
> >
> > I think it is a part of the design. Basically in order to avoid
> > corrupting memory it cannot return the page to the guest kernel until
> > it has finished clearing the dirty bits associated with the pages.
>
> OK I hope I clarified by that's not supposed to be the case.

I think you might have missed something. I am pretty certain issues
are still present.

> > > > > > The
> > > > > > FREE_PAGE_HINT code was implemented to be more of a streaming
> > > > > > interface.
> > > > >
> > > > > It's implemented like this but it does not follow from
> > > > > the interface. The implementation is a combination of
> > > > > attempts to minimize # of exits and minimize mm core changes.
> > > >
> > > > The problem is the interface doesn't have a good way of indicating
> > > > that it is done with a block of pages.
> > > >
> > > > So what I am probably looking at if I do a sg implementation for my
> > > > hinting is to provide one large sg block for all 32 of the pages I
> > > > might be holding.
> > >
> > > Right now if you pass an sg it will try to allocate a buffer
> > > on demand for you. If this is a problem I could come up
> > > with a new API that lets caller allocate the buffer.
> > > Let me know.
> > >
> > > > I'm assuming that will still be processed as one
> > > > contiguous block. With that I can then at least maintain a single
> > > > response per request.
> > >
> > > Why do you care? Won't a counter of outstanding pages be enough?
> > > Down the road maybe we could actually try to pipeline
> > > things a bit. So send 32 pages once you get 16 of these back
> > > send 16 more. Better for SMP configs and does not hurt
> > > non-SMP too much. I am not saying we need to do it right away though.
> >
> > So the big thing is we cannot give the page back to the guest kernel
> > until we know the processing has been completed. In the case of the
> > MADV_DONT_NEED call it will zero out the entire page on the next
> > access. If the guest kernel had already written data by the time we
> > get to that it would cause a data corruption and kill the whole guest.
>
>
> Exactly but FREE_PAGE_HINT does not cause qemu to call MADV_DONT_NEED.

No, instead it clears the bit indicating that the page is supposed to
be migrated. The effect will not be all that different, just delayed
until the VM is actually migrated.

> > > > > > This is one of the things Linus kept complaining about in
> > > > > > his comments. This code attempts to pull in ALL of the higher order
> > > > > > pages, not just a smaller block of them.
> > > > >
> > > > > It wants to report all higher order pages eventually, yes.
> > > > > But it's absolutely fine to report a chunk and then wait
> > > > > for host to process the chunk before reporting more.
> > > > >
> > > > > However, interfaces we came up with for this would call
> > > > > into virtio with a bunch of locks taken.
> > > > > The solution was to take pages off the free list completely.
> > > > > That in turn means we can't return them until
> > > > > we have processed all free memory.
> > > >
> > > > I get that. The problem is the interface is designed around run to
> > > > completion. For example it will sit there in a busy loop waiting for a
> > > > free buffer because it knows the other side is suppose to be
> > > > processing the pages already.
> > >
> > > I didn't get this part.
> >
> > I think the part you may not be getting is that we cannot let the
> > guest use the page until the hint has been processed. Otherwise we
> > risk corrupting memory. That is the piece that has me paranoid. If we
> > end up performing a hint on a page that is use somewhere in the kernel
> > it will corrupt memory one way or another. That is the thing I have to
> > avoid at all cost.
>
> You have to do it, sure. And that is because you do not
> assume that hypervisor does it for you. But FREE_PAGE_HINT doesn't,
> hypervisor takes care of that.

Sort of. The hypervisor is trying to do dirty page tracking, however
the FREE_PAGE_HINT interferes with that. That is the problem. If we
get that out of order then the hypervisor work will be undone and we
just make a mess of memory.

> > That is why I have to have a way to know exactly which pages have been
> > processed and which haven't before I return pages to the guest.
> > Otherwise I am just corrupting memory.
>
> Sure. That isn't really hard though.

Agreed.

> >
> > > > > > Honestly the difference is
> > > > > > mostly in the hypervisor interface than what is needed for the kernel
> > > > > > interface, however the design of the hypervisor interface would make
> > > > > > doing things more incrementally much more difficult.
> > > > >
> > > > > OK that's interesting. The hypervisor interface is not
> > > > > documented in the spec yet. Let me take a stub at a writeup now. So:
> > > > >
> > > > >
> > > > >
> > > > > - hypervisor requests reporting by modifying command ID
> > > > > field in config space, and interrupting guest
> > > > >
> > > > > - in response, guest sends the command ID value on a special
> > > > > free page hinting VQ,
> > > > > followed by any number of buffers. Each buffer is assumed
> > > > > to be the address and length of memory that was
> > > > > unused *at some point after the time when command ID was sent*.
> > > > >
> > > > > Note that hypervisor takes pains to handle the case
> > > > > where memory is actually no longer free by the time
> > > > > it gets the memory.
> > > > > This allows guest driver to take more liberties
> > > > > and free pages without waiting for guest to
> > > > > use the buffers.
> > > > >
> > > > > This is also one of the reason we call this a free page hint -
> > > > > the guarantee that page is free is a weak one,
> > > > > in that sense it's more of a hint than a promise.
> > > > > That helps guarantee we don't create OOM out of blue.
> > >
> > > I would like to stress the last paragraph above.
> >
> > The problem is we don't want to give bad hints. What we do based on
> > the hint is clear the dirty bit. If we clear it in err when the page
> > is actually in use it will lead to data corruption after migration.
>
> That's true for your patches. I get that.

No, it should be true for FREE_PAGE_HINT as well. The fact that it
isn't is a bug as far as I am concerned. If you are doing dirty page
tracking in the hypervisor you cannot expect it to behave well if the
guest is providing it with bad data.

> > The idea with the hint is that you are saying the page is currently
> > not in use, however if you send that hint late and have already freed
> > the page back you can corrupt memory.
>
>
> That part is I think wrong - assuming "you" means upstream code.

Yes, I am referring to someone running FREE_PAGE_HINT code. I usually
try to replace them with "we" to make it clear I am not talking about
someone personally, it is a bad habit.

> > > > >
> > > > > - guest eventually sends a special buffer signalling to
> > > > > host that it's done sending free pages.
> > > > > It then stops reporting until command id changes.
> > > >
> > > > The pages are not freed back to the guest until the host reports that
> > > > it is "DONE" via a configuration change. Doing that stops any further
> > > > progress, and attempting to resume will just restart from the
> > > > beginning.
> > >
> > > Right but it's not a requirement. Host does not assume this at all.
> > > It's done like this simply because we can't iterate over pages
> > > with the existing API.
> >
> > The problem is nothing about the implementation was designed for
> > iteration. What I would have to do is likely gut and rewrite the
> > entire guest side of the FREE_PAGE_HINT code in order to make it work
> > iteratively.
>
>
> Right. I agree.
>
> > As I mentioned it would probably have to look more like a
> > NIC Rx ring in handling because we would have to have some sort of way
> > to associate the pages 1:1 to the buffers.
> >
> > > > The big piece this design is missing is the incremental notification
> > > > pages have been processed. The existing code just fills the vq with
> > > > pages and keeps doing it until it cannot allocate any more pages. We
> > > > would have to add logic to stop, flush, and resume to the existing
> > > > framework.
> > >
> > > But not to the hypervisor interface. Hypervisor is fine
> > > with pages being reused immediately. In fact, even before they
> > > are processed.
> >
> > I don't think that is actually the case. If it does that I am pretty
> > sure it will corrupt memory during migration.
> >
> > Take a look at qemu_guest_free_page_hint:
> > https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342
> >
> > I'm pretty sure that code is going in and clearing the dirty bitmap
> > for memory.
>
> Yes it does. However the trick is that meanwhile
> kvm is logging new writes. So the bitmap that
> is being cleared is the bitmap that was logged before the request
> was sent to guest.
>
> > If we were to allow a page to be allocated and used and
> > then perform the hint it is going to introduce a race where the page
> > might be missed for migration and could result in memory corruption.
>
> commit c13c4153f76db23cac06a12044bf4dd346764059 has this explanation:
>
> Note: balloon will report pages which were free at the time of this call.
> As the reporting happens asynchronously, dirty bit logging must be
> enabled before this free_page_start call is made. Guest reporting must be
> disabled before the migration dirty bitmap is synchronized.
>
> but over multiple iterations this seems to have been dropped
> from code comments. Wei, would you mind going back
> and documenting the APIs you used?
> They seem to be causing confusion ...

The "Note" is the behavior I am seeing. Specifically there is nothing
in place to prevent the freed pages from causing corruption if they
are freed before being hinted. The requirement should be that they
cannot be freed until after they are hinted that way the dirty bit
logging will mark the page as dirty if it is accessed AFTER being
hinted.

If you do not guarantee the hinting has happened first you could end
up logging the dirty bit before the hint is processed and then clear
the dirty bit due to the hint. It is pretty straight forward to
resolve by just not putting the page into the balloon until after the
hint has been processed.

> >
> > > > > - host can restart the process at any time by
> > > > > updating command ID. That will make guest stop
> > > > > and start from the beginning.
> > > > >
> > > > > - host can also stop the process by specifying a special
> > > > > command ID value.
> > > > >
> > > > >
> > > > > =========
> > > > >
> > > > >
> > > > > Now let's compare to what you have here:
> > > > >
> > > > > - At any time after boot, guest walks over free memory and sends
> > > > > addresses as buffers to the host
> > > > >
> > > > > - Memory reported is then guaranteed to be unused
> > > > > until host has used the buffers
> > > > >
> > > > >
> > > > > Is above a fair summary?
> > > > >
> > > > > So yes there's a difference but the specific bit of chunking is same
> > > > > imho.
> > > >
> > > > The big difference is that I am returning the pages after they are
> > > > processed, while FREE_PAGE_HINT doesn't and isn't designed to.
> > >
> > > It doesn't but the hypervisor *is* designed to support that.
> >
> > Not really, it seems like it is more just a side effect of things.
>
> I hope the commit log above is enough to convice you we did
> think about this.

Sorry, but no. I think the "note" convinced me there is a race
condition, specifically in the shrinker case. We cannot free the page
back to host memory until the hint has been processed, otherwise we
will race with the dirty bit logging.

> > Also as I mentioned before I am also not a huge fan of polling on both
> > sides as it is just going to burn through CPU. If we are iterative and
> > polling it is going to end up with us potentially pushing one CPU at
> > 100%, and if the one CPU doing the polling cannot keep up with the
> > page updates coming from the other CPUs we would be stuck in that
> > state for a while. I would have preferred to see something where the
> > CPU would at least allow other tasks to occur while it is waiting for
> > buffers to be returned by the host.
>
> You lost me here. What does polling have to do with it?

This is just another issue I found. Specifically busy polling while
waiting on the host to process the hints. I'm not a fan of it and was
just pointing it out.

> > > > The
> > > > problem is the interface doesn't allow for a good way to identify that
> > > > any given block of pages has been processed and can be returned.
> > >
> > > And that's because FREE_PAGE_HINT does not care.
> > > It can return any page at any point even before hypervisor
> > > saw it.
> >
> > I disagree, see my comment above.
>
> OK let's see if above is enough to convice you. Or maybe we
> have a bug when shrinker is invoked :) But I don't think so.

I'm pretty sure there is a bug.

> > > > Instead pages go in, but they don't come out until the configuration
> > > > is changed and "DONE" is reported. The act of reporting "DONE" will
> > > > reset things and start them all over which kind of defeats the point.
> > >
> > > Right.
> > >
> > > But if you consider how we are using the shrinker you will
> > > see that it's kind of broken.
> > > For example not keeping track of allocated
> > > pages means the count we return is broken
> > > while reporting is active.
> > >
> > > I looked at fixing it but really if we can just
> > > stop allocating memory that would be way cleaner.
> >
> > Agreed. If we hit an OOM we should probably just stop the free page
> > hinting and treat that as the equivalent to an allocation failure.
>
> And fix the shrinker count to include the pages in the vq. Yea.

I don't know if we really want to touch the pages in the VQ. I would
say that we should leave them alone.

> >
> > As-is I think this also has the potential for corrupting memory since
> > it will likely be returning the most recent pages added to the balloon
> > so the pages are likely still on the processing queue.
>
> That part is fine I think because of the above.
>
> >
> > > For example we allocate pages until shrinker kicks in.
> > > Fair enough but in fact many it would be better to
> > > do the reverse: trigger shrinker and then send as many
> > > free pages as we can to host.
> >
> > I'm not sure I understand this last part.
>
> Oh basically what I am saying is this: one of the reasons to use page
> hinting is when host is short on memory. In that case, why don't we use
> shrinker to ask kernel drivers to free up memory? Any memory freed could
> then be reported to host.

Didn't the balloon driver already have a feature like that where it
could start shrinking memory if the host was under memory pressure? If
so how would adding another one add much value.

The idea here is if the memory is free we just mark it as such. As
long as we can do so with no noticeable overhead on the guest or host
why not just do it?

2019-07-18 16:05:01

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting


On 7/18/19 11:34 AM, Alexander Duyck wrote:
> On Wed, Jul 17, 2019 at 10:14 PM Michael S. Tsirkin <[email protected]> wrote:
>> On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote:
>>> On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
>>>> On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
>>>>> On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
>>>>>
>>>>> <snip>
>>>>>
>>>>>>>> This is what I am saying. Having watched that patchset being developed,
>>>>>>>> I think that's simply because processing blocks required mm core
>>>>>>>> changes, which Wei was not up to pushing through.
>>>>>>>>
>>>>>>>>
>>>>>>>> If we did
>>>>>>>>
>>>>>>>> while (1) {
>>>>>>>> alloc_pages
>>>>>>>> add_buf
>>>>>>>> get_buf
>>>>>>>> free_pages
>>>>>>>> }
>>>>>>>>
>>>>>>>> We'd end up passing the same page to balloon again and again.
>>>>>>>>
>>>>>>>> So we end up reserving lots of memory with alloc_pages instead.
>>>>>>>>
>>>>>>>> What I am saying is that now that you are developing
>>>>>>>> infrastructure to iterate over free pages,
>>>>>>>> FREE_PAGE_HINT should be able to use it too.
>>>>>>>> Whether that's possible might be a good indication of
>>>>>>>> whether the new mm APIs make sense.
>>>>>>> The problem is the infrastructure as implemented isn't designed to do
>>>>>>> that. I am pretty certain this interface will have issues with being
>>>>>>> given small blocks to process at a time.
>>>>>>>
>>>>>>> Basically the design for the FREE_PAGE_HINT feature doesn't really
>>>>>>> have the concept of doing things a bit at a time. It is either
>>>>>>> filling, stopped, or done. From what I can tell it requires a
>>>>>>> configuration change for the virtio balloon interface to toggle
>>>>>>> between those states.
>>>>>> Maybe I misunderstand what you are saying.
>>>>>>
>>>>>> Filling state can definitely report things
>>>>>> a bit at a time. It does not assume that
>>>>>> all of guest free memory can fit in a VQ.
>>>>> I think where you and I may differ is that you are okay with just
>>>>> pulling pages until you hit OOM, or allocation failures. Do I have
>>>>> that right?
>>>> This is exactly what the current code does. But that's an implementation
>>>> detail which came about because we failed to find any other way to
>>>> iterate over free blocks.
>>> I get that. However my concern is that permeated other areas of the
>>> implementation that make taking another approach much more difficult
>>> than it needs to be.
>> Implementation would have to change to use an iterator obviously. But I don't see
>> that it leaked out to a hypervisor interface.
>>
>> In fact take a look at virtio_balloon_shrinker_scan
>> and you will see that it calls shrink_free_pages
>> without waiting for the device at all.
> Yes, and in case you missed it earlier I am pretty sure that leads to
> possible memory corruption. I don't think it was tested enough to be
> able to say that is safe.
>
> Specifically we cannot be clearing the dirty flag on pages that are in
> use. We should only be clearing that flag for pages that are
> guaranteed to not be in use.
>
>>>>> In my mind I am wanting to perform the hinting on a small
>>>>> block at a time and work through things iteratively.
>>>>>
>>>>> The problem is the FREE_PAGE_HINT doesn't have the option of returning
>>>>> pages until all pages have been pulled. It is run to completion and
>>>>> will keep filling the balloon until an allocation fails and the host
>>>>> says it is done.
>>>> OK so there are two points. One is that FREE_PAGE_HINT does not
>>>> need to allocate a page at all. It really just wants to
>>>> iterate over free pages.
>>> I agree that it should just want to iterate over pages. However the
>>> issue I am trying to point out is that it doesn't have any guarantees
>>> on ordering and that is my concern. What I want to avoid is
>>> potentially corrupting memory.
>> I get that. I am just trying to make sure you are aware that for
>> FREE_PAGE_HINT specifically ordering does not matter because it does not
>> care when hypervisor used the buffers. It only cares that page was
>> free after it got the request. used buffers are only tracked to avoid
>> overflowing the VQ. This is different from your hinting where you make
>> it the responsibility of the guest to not allocate page before it was
>> used.
> Prove to me that the ordering does not matter. As far as I can tell it
> should since this is being used to clear the bitmap and will affect
> migration. I'm pretty certain the page should not be freed until it
> has been processed. Otherwise I believe there is a risk of the page
> not being migrated and leading to a memory corruption when the VM is
> finally migrated.
>
>>> So for example with my current hinting approach I am using the list of
>>> hints because I get back one completion indicating all of the hints
>>> have been processed. It is only at that point that I can go back and
>>> make the memory available for allocation again.
>> Right. But just counting them would work just as well, no?
>> At least as long as you wait for everything to complete...
>> If you want to pipeline, see below
> Yes, but if possible I would also want to try and keep the batch
> behavior that I have. We could count the descriptors processed,
> however that is still essentially done all via busy waiting in the
> FREE_PAGE_HINT logic.
>
>>> So one big issue right now with the FREE_PAGE_HINT approach is that it
>>> is designed to be all or nothing. Using the balloon makes it
>>> impossible for us to be incremental as all the pages are contained in
>>> one spot. What we would need is some way to associate a page with a
>>> given vq buffer.
>> Sorry if I'm belaboring the obvious, but isn't this what 'void *data' in
>> virtqueue_add_inbuf is designed for? And if you only ever use
>> virtqueue_add_inbuf and virtqueue_add_outbuf on a given VQ, then you can
>> track two pointers using virtqueue_add_inbuf_ctx.
> I am still learning virtio so I wasn't aware of this piece until
> yesterday. For FREE_PAGE_HINT it would probably work as we would then
> have that association. For my page hinting I am still thinking I would
> prefer to just pass around a scatterlist since that is the structure I
> would likely fill and then later drain of pages versus just
> maintaining a list.
>
>>> Ultimately in order to really make the FREE_PAGE_HINT
>>> logic work with something like my page hinting logic it would need to
>>> work more like a network Rx ring in that we would associate a page per
>>> buffer and have some way of knowing the two are associated.
>> Right. That's exactly how virtio net does it btw.
> Yeah, I saw that after reviewing the code yesterday.
>
>>>> The reason FREE_PAGE_HINT does not free up pages until we finished
>>>> iterating over the free list it not a hypervisor API. The reason is we
>>>> don't want to keep getting the same address over and over again.
>>>>
>>>>> I would prefer to avoid that as I prefer to simply
>>>>> notify the host of a fixed block of pages at a time and let it process
>>>>> without having to have a thread on each side actively pushing pages,
>>>>> or listening for the incoming pages.
>>>> Right. And FREE_PAGE_HINT can go even further. It can push a page and
>>>> let linux use it immediately. It does not even need to wait for host to
>>>> process anything unless the VQ gets full.
>>> If it is doing what you are saying it will be corrupting memory.
>> No and that is hypervisor's responsibility.
>>
>> I think you are missing part of the picture here.
>>
>> Here is a valid implementation:
>>
>> Before asking for hints, hypervisor write-protects all memory, and logs
>> all write faults. When hypervisor gets the hint, if page has since been
>> modified, the hint is ignored.
> No here is the part where I think you missed the point. I was already
> aware of this. So my concern is this scenario.
>
> If you put a hint on the VQ and then free the memory back to the
> guest, what about the scenario where another process could allocate
> the memory and dirty it before we process the hint request on the
> host? In that case the page was dirtied, the hypervisor will have
> correctly write faulted and dirtied it, and then we came though and
> incorrectly marked it as being free. That is the scenario I am worried
> about as I am pretty certain that leads to memory corruption.
>
>
>>> At a
>>> minimum it has to wait until the page has been processed and the dirty
>>> bit cleared before it can let linux use it again. It is all a matter
>>> of keeping the dirty bit coherent. If we let linux use it again
>>> immediately and then cleared the dirty bit we would open up a possible
>>> data corruption race during migration as a dirty page might not be
>>> marked as such.
>> I think you are talking about the dirty bit on the host, right?
>>
>> The implication is that calling MADV_FREE from qemu would
>> not be a good implementation of FREE_PAGE_HINT.
>> And indeed, as far as I can see it does nothing of the sort.
> I don't mean the dirty bit on the host, I am talking about the bitmap
> used to determine which pages need to be migrated. That is what this
> hint is updating and it is also being tracked via the write protection
> of the pages at the start of migration.
>
> My concern is that we can end up losing track of pages that are
> updated if we are hinting after they have been freed back to the guest
> for reallocation.
>
>>>>>>>>> The basic idea with the bubble hinting was to essentially create mini
>>>>>>>>> balloons. As such I had based the code off of the balloon inflation
>>>>>>>>> code. The only spot where it really differs is that I needed the
>>>>>>>>> ability to pass higher order pages so I tweaked thinks and passed
>>>>>>>>> "hints" instead of "pfns".
>>>>>>>> And that is fine. But there isn't really such a big difference with
>>>>>>>> FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
>>>>>>>> in response to guest load.
>>>>>>> I disagree, I believe there is a significant difference.
>>>>>> Yes there is, I just don't think it's in the iteration.
>>>>>> The iteration seems to be useful to hinting.
>>>>> I agree that iteration is useful to hinting. The problem is the
>>>>> FREE_PAGE_HINT code isn't really designed to be iterative. It is
>>>>> designed to run with a polling thread on each side and it is meant to
>>>>> be run to completion.
>>>> Absolutely. But that's a bug I think.
>>> I think it is a part of the design. Basically in order to avoid
>>> corrupting memory it cannot return the page to the guest kernel until
>>> it has finished clearing the dirty bits associated with the pages.
>> OK I hope I clarified by that's not supposed to be the case.
> I think you might have missed something. I am pretty certain issues
> are still present.
>
>>>>>>> The
>>>>>>> FREE_PAGE_HINT code was implemented to be more of a streaming
>>>>>>> interface.
>>>>>> It's implemented like this but it does not follow from
>>>>>> the interface. The implementation is a combination of
>>>>>> attempts to minimize # of exits and minimize mm core changes.
>>>>> The problem is the interface doesn't have a good way of indicating
>>>>> that it is done with a block of pages.
>>>>>
>>>>> So what I am probably looking at if I do a sg implementation for my
>>>>> hinting is to provide one large sg block for all 32 of the pages I
>>>>> might be holding.
>>>> Right now if you pass an sg it will try to allocate a buffer
>>>> on demand for you. If this is a problem I could come up
>>>> with a new API that lets caller allocate the buffer.
>>>> Let me know.
>>>>
>>>>> I'm assuming that will still be processed as one
>>>>> contiguous block. With that I can then at least maintain a single
>>>>> response per request.
>>>> Why do you care? Won't a counter of outstanding pages be enough?
>>>> Down the road maybe we could actually try to pipeline
>>>> things a bit. So send 32 pages once you get 16 of these back
>>>> send 16 more. Better for SMP configs and does not hurt
>>>> non-SMP too much. I am not saying we need to do it right away though.
>>> So the big thing is we cannot give the page back to the guest kernel
>>> until we know the processing has been completed. In the case of the
>>> MADV_DONT_NEED call it will zero out the entire page on the next
>>> access. If the guest kernel had already written data by the time we
>>> get to that it would cause a data corruption and kill the whole guest.
>>
>> Exactly but FREE_PAGE_HINT does not cause qemu to call MADV_DONT_NEED.
> No, instead it clears the bit indicating that the page is supposed to
> be migrated. The effect will not be all that different, just delayed
> until the VM is actually migrated.
>
>>>>>>> This is one of the things Linus kept complaining about in
>>>>>>> his comments. This code attempts to pull in ALL of the higher order
>>>>>>> pages, not just a smaller block of them.
>>>>>> It wants to report all higher order pages eventually, yes.
>>>>>> But it's absolutely fine to report a chunk and then wait
>>>>>> for host to process the chunk before reporting more.
>>>>>>
>>>>>> However, interfaces we came up with for this would call
>>>>>> into virtio with a bunch of locks taken.
>>>>>> The solution was to take pages off the free list completely.
>>>>>> That in turn means we can't return them until
>>>>>> we have processed all free memory.
>>>>> I get that. The problem is the interface is designed around run to
>>>>> completion. For example it will sit there in a busy loop waiting for a
>>>>> free buffer because it knows the other side is suppose to be
>>>>> processing the pages already.
>>>> I didn't get this part.
>>> I think the part you may not be getting is that we cannot let the
>>> guest use the page until the hint has been processed. Otherwise we
>>> risk corrupting memory. That is the piece that has me paranoid. If we
>>> end up performing a hint on a page that is use somewhere in the kernel
>>> it will corrupt memory one way or another. That is the thing I have to
>>> avoid at all cost.
>> You have to do it, sure. And that is because you do not
>> assume that hypervisor does it for you. But FREE_PAGE_HINT doesn't,
>> hypervisor takes care of that.
> Sort of. The hypervisor is trying to do dirty page tracking, however
> the FREE_PAGE_HINT interferes with that. That is the problem. If we
> get that out of order then the hypervisor work will be undone and we
> just make a mess of memory.
>
>>> That is why I have to have a way to know exactly which pages have been
>>> processed and which haven't before I return pages to the guest.
>>> Otherwise I am just corrupting memory.
>> Sure. That isn't really hard though.
> Agreed.
>
>>>>>>> Honestly the difference is
>>>>>>> mostly in the hypervisor interface than what is needed for the kernel
>>>>>>> interface, however the design of the hypervisor interface would make
>>>>>>> doing things more incrementally much more difficult.
>>>>>> OK that's interesting. The hypervisor interface is not
>>>>>> documented in the spec yet. Let me take a stub at a writeup now. So:
>>>>>>
>>>>>>
>>>>>>
>>>>>> - hypervisor requests reporting by modifying command ID
>>>>>> field in config space, and interrupting guest
>>>>>>
>>>>>> - in response, guest sends the command ID value on a special
>>>>>> free page hinting VQ,
>>>>>> followed by any number of buffers. Each buffer is assumed
>>>>>> to be the address and length of memory that was
>>>>>> unused *at some point after the time when command ID was sent*.
>>>>>>
>>>>>> Note that hypervisor takes pains to handle the case
>>>>>> where memory is actually no longer free by the time
>>>>>> it gets the memory.
>>>>>> This allows guest driver to take more liberties
>>>>>> and free pages without waiting for guest to
>>>>>> use the buffers.
>>>>>>
>>>>>> This is also one of the reason we call this a free page hint -
>>>>>> the guarantee that page is free is a weak one,
>>>>>> in that sense it's more of a hint than a promise.
>>>>>> That helps guarantee we don't create OOM out of blue.
>>>> I would like to stress the last paragraph above.
>>> The problem is we don't want to give bad hints. What we do based on
>>> the hint is clear the dirty bit. If we clear it in err when the page
>>> is actually in use it will lead to data corruption after migration.
>> That's true for your patches. I get that.
> No, it should be true for FREE_PAGE_HINT as well. The fact that it
> isn't is a bug as far as I am concerned. If you are doing dirty page
> tracking in the hypervisor you cannot expect it to behave well if the
> guest is providing it with bad data.
>
>>> The idea with the hint is that you are saying the page is currently
>>> not in use, however if you send that hint late and have already freed
>>> the page back you can corrupt memory.
>>
>> That part is I think wrong - assuming "you" means upstream code.
> Yes, I am referring to someone running FREE_PAGE_HINT code. I usually
> try to replace them with "we" to make it clear I am not talking about
> someone personally, it is a bad habit.
>
>>>>>> - guest eventually sends a special buffer signalling to
>>>>>> host that it's done sending free pages.
>>>>>> It then stops reporting until command id changes.
>>>>> The pages are not freed back to the guest until the host reports that
>>>>> it is "DONE" via a configuration change. Doing that stops any further
>>>>> progress, and attempting to resume will just restart from the
>>>>> beginning.
>>>> Right but it's not a requirement. Host does not assume this at all.
>>>> It's done like this simply because we can't iterate over pages
>>>> with the existing API.
>>> The problem is nothing about the implementation was designed for
>>> iteration. What I would have to do is likely gut and rewrite the
>>> entire guest side of the FREE_PAGE_HINT code in order to make it work
>>> iteratively.
>>
>> Right. I agree.
>>
>>> As I mentioned it would probably have to look more like a
>>> NIC Rx ring in handling because we would have to have some sort of way
>>> to associate the pages 1:1 to the buffers.
>>>
>>>>> The big piece this design is missing is the incremental notification
>>>>> pages have been processed. The existing code just fills the vq with
>>>>> pages and keeps doing it until it cannot allocate any more pages. We
>>>>> would have to add logic to stop, flush, and resume to the existing
>>>>> framework.
>>>> But not to the hypervisor interface. Hypervisor is fine
>>>> with pages being reused immediately. In fact, even before they
>>>> are processed.
>>> I don't think that is actually the case. If it does that I am pretty
>>> sure it will corrupt memory during migration.
>>>
>>> Take a look at qemu_guest_free_page_hint:
>>> https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342
>>>
>>> I'm pretty sure that code is going in and clearing the dirty bitmap
>>> for memory.
>> Yes it does. However the trick is that meanwhile
>> kvm is logging new writes. So the bitmap that
>> is being cleared is the bitmap that was logged before the request
>> was sent to guest.
>>
>>> If we were to allow a page to be allocated and used and
>>> then perform the hint it is going to introduce a race where the page
>>> might be missed for migration and could result in memory corruption.
>> commit c13c4153f76db23cac06a12044bf4dd346764059 has this explanation:
>>
>> Note: balloon will report pages which were free at the time of this call.
>> As the reporting happens asynchronously, dirty bit logging must be
>> enabled before this free_page_start call is made. Guest reporting must be
>> disabled before the migration dirty bitmap is synchronized.
>>
>> but over multiple iterations this seems to have been dropped
>> from code comments. Wei, would you mind going back
>> and documenting the APIs you used?
>> They seem to be causing confusion ...
> The "Note" is the behavior I am seeing. Specifically there is nothing
> in place to prevent the freed pages from causing corruption if they
> are freed before being hinted. The requirement should be that they
> cannot be freed until after they are hinted that way the dirty bit
> logging will mark the page as dirty if it is accessed AFTER being
> hinted.
>
> If you do not guarantee the hinting has happened first you could end
> up logging the dirty bit before the hint is processed and then clear
> the dirty bit due to the hint. It is pretty straight forward to
> resolve by just not putting the page into the balloon until after the
> hint has been processed.
>
>>>>>> - host can restart the process at any time by
>>>>>> updating command ID. That will make guest stop
>>>>>> and start from the beginning.
>>>>>>
>>>>>> - host can also stop the process by specifying a special
>>>>>> command ID value.
>>>>>>
>>>>>>
>>>>>> =========
>>>>>>
>>>>>>
>>>>>> Now let's compare to what you have here:
>>>>>>
>>>>>> - At any time after boot, guest walks over free memory and sends
>>>>>> addresses as buffers to the host
>>>>>>
>>>>>> - Memory reported is then guaranteed to be unused
>>>>>> until host has used the buffers
>>>>>>
>>>>>>
>>>>>> Is above a fair summary?
>>>>>>
>>>>>> So yes there's a difference but the specific bit of chunking is same
>>>>>> imho.
>>>>> The big difference is that I am returning the pages after they are
>>>>> processed, while FREE_PAGE_HINT doesn't and isn't designed to.
>>>> It doesn't but the hypervisor *is* designed to support that.
>>> Not really, it seems like it is more just a side effect of things.
>> I hope the commit log above is enough to convice you we did
>> think about this.
> Sorry, but no. I think the "note" convinced me there is a race
> condition, specifically in the shrinker case. We cannot free the page
> back to host memory until the hint has been processed, otherwise we
> will race with the dirty bit logging.
>
>>> Also as I mentioned before I am also not a huge fan of polling on both
>>> sides as it is just going to burn through CPU. If we are iterative and
>>> polling it is going to end up with us potentially pushing one CPU at
>>> 100%, and if the one CPU doing the polling cannot keep up with the
>>> page updates coming from the other CPUs we would be stuck in that
>>> state for a while. I would have preferred to see something where the
>>> CPU would at least allow other tasks to occur while it is waiting for
>>> buffers to be returned by the host.
>> You lost me here. What does polling have to do with it?
> This is just another issue I found. Specifically busy polling while
> waiting on the host to process the hints. I'm not a fan of it and was
> just pointing it out.
>
>>>>> The
>>>>> problem is the interface doesn't allow for a good way to identify that
>>>>> any given block of pages has been processed and can be returned.
>>>> And that's because FREE_PAGE_HINT does not care.
>>>> It can return any page at any point even before hypervisor
>>>> saw it.
>>> I disagree, see my comment above.
>> OK let's see if above is enough to convice you. Or maybe we
>> have a bug when shrinker is invoked :) But I don't think so.
> I'm pretty sure there is a bug.
>
>>>>> Instead pages go in, but they don't come out until the configuration
>>>>> is changed and "DONE" is reported. The act of reporting "DONE" will
>>>>> reset things and start them all over which kind of defeats the point.
>>>> Right.
>>>>
>>>> But if you consider how we are using the shrinker you will
>>>> see that it's kind of broken.
>>>> For example not keeping track of allocated
>>>> pages means the count we return is broken
>>>> while reporting is active.
>>>>
>>>> I looked at fixing it but really if we can just
>>>> stop allocating memory that would be way cleaner.
>>> Agreed. If we hit an OOM we should probably just stop the free page
>>> hinting and treat that as the equivalent to an allocation failure.
>> And fix the shrinker count to include the pages in the vq. Yea.
> I don't know if we really want to touch the pages in the VQ. I would
> say that we should leave them alone.
>
>>> As-is I think this also has the potential for corrupting memory since
>>> it will likely be returning the most recent pages added to the balloon
>>> so the pages are likely still on the processing queue.
>> That part is fine I think because of the above.
>>
>>>> For example we allocate pages until shrinker kicks in.
>>>> Fair enough but in fact many it would be better to
>>>> do the reverse: trigger shrinker and then send as many
>>>> free pages as we can to host.
>>> I'm not sure I understand this last part.
>> Oh basically what I am saying is this: one of the reasons to use page
>> hinting is when host is short on memory. In that case, why don't we use
>> shrinker to ask kernel drivers to free up memory? Any memory freed could
>> then be reported to host.
> Didn't the balloon driver already have a feature like that where it
> could start shrinking memory if the host was under memory pressure?
If you are referring to auto-ballooning (I don't think it is merged). It
has its own set of disadvantages such as it could easily lead to OOM,
memory corruption and so on.
VIRTIO_BALLOON_F_FREE_PAGE_HINT does address some of those issues.
However, it still requires external control to initiate/stop the memory
transaction.
> If
> so how would adding another one add much value.
> The idea here is if the memory is free we just mark it as such. As
> long as we can do so with no noticeable overhead on the guest or host
> why not just do it?
+1. This is the advantage which both the hinting solutions are trying to
introduce.
--
Thanks
Nitesh

2019-07-18 16:08:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> On Wed, Jul 17, 2019 at 10:14 PM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote:
> > > On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
> > > >
> > > > On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> > > > > On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
> > > > >
> > > > > <snip>
> > > > >
> > > > > > > > This is what I am saying. Having watched that patchset being developed,
> > > > > > > > I think that's simply because processing blocks required mm core
> > > > > > > > changes, which Wei was not up to pushing through.
> > > > > > > >
> > > > > > > >
> > > > > > > > If we did
> > > > > > > >
> > > > > > > > while (1) {
> > > > > > > > alloc_pages
> > > > > > > > add_buf
> > > > > > > > get_buf
> > > > > > > > free_pages
> > > > > > > > }
> > > > > > > >
> > > > > > > > We'd end up passing the same page to balloon again and again.
> > > > > > > >
> > > > > > > > So we end up reserving lots of memory with alloc_pages instead.
> > > > > > > >
> > > > > > > > What I am saying is that now that you are developing
> > > > > > > > infrastructure to iterate over free pages,
> > > > > > > > FREE_PAGE_HINT should be able to use it too.
> > > > > > > > Whether that's possible might be a good indication of
> > > > > > > > whether the new mm APIs make sense.
> > > > > > >
> > > > > > > The problem is the infrastructure as implemented isn't designed to do
> > > > > > > that. I am pretty certain this interface will have issues with being
> > > > > > > given small blocks to process at a time.
> > > > > > >
> > > > > > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > > > > > have the concept of doing things a bit at a time. It is either
> > > > > > > filling, stopped, or done. From what I can tell it requires a
> > > > > > > configuration change for the virtio balloon interface to toggle
> > > > > > > between those states.
> > > > > >
> > > > > > Maybe I misunderstand what you are saying.
> > > > > >
> > > > > > Filling state can definitely report things
> > > > > > a bit at a time. It does not assume that
> > > > > > all of guest free memory can fit in a VQ.
> > > > >
> > > > > I think where you and I may differ is that you are okay with just
> > > > > pulling pages until you hit OOM, or allocation failures. Do I have
> > > > > that right?
> > > >
> > > > This is exactly what the current code does. But that's an implementation
> > > > detail which came about because we failed to find any other way to
> > > > iterate over free blocks.
> > >
> > > I get that. However my concern is that permeated other areas of the
> > > implementation that make taking another approach much more difficult
> > > than it needs to be.
> >
> > Implementation would have to change to use an iterator obviously. But I don't see
> > that it leaked out to a hypervisor interface.
> >
> > In fact take a look at virtio_balloon_shrinker_scan
> > and you will see that it calls shrink_free_pages
> > without waiting for the device at all.
>
> Yes, and in case you missed it earlier I am pretty sure that leads to
> possible memory corruption. I don't think it was tested enough to be
> able to say that is safe.

More testing would be good, for sure.

> Specifically we cannot be clearing the dirty flag on pages that are in
> use. We should only be clearing that flag for pages that are
> guaranteed to not be in use.

I think that clearing the dirty flag is safe if the flag was originally
set and the page has been
write-protected before reporting was requested.
In that case we know that page has not been changed.
Right?

> > > > > In my mind I am wanting to perform the hinting on a small
> > > > > block at a time and work through things iteratively.
> > > > >
> > > > > The problem is the FREE_PAGE_HINT doesn't have the option of returning
> > > > > pages until all pages have been pulled. It is run to completion and
> > > > > will keep filling the balloon until an allocation fails and the host
> > > > > says it is done.
> > > >
> > > > OK so there are two points. One is that FREE_PAGE_HINT does not
> > > > need to allocate a page at all. It really just wants to
> > > > iterate over free pages.
> > >
> > > I agree that it should just want to iterate over pages. However the
> > > issue I am trying to point out is that it doesn't have any guarantees
> > > on ordering and that is my concern. What I want to avoid is
> > > potentially corrupting memory.
> >
> > I get that. I am just trying to make sure you are aware that for
> > FREE_PAGE_HINT specifically ordering does not matter because it does not
> > care when hypervisor used the buffers. It only cares that page was
> > free after it got the request. used buffers are only tracked to avoid
> > overflowing the VQ. This is different from your hinting where you make
> > it the responsibility of the guest to not allocate page before it was
> > used.
>
> Prove to me that the ordering does not matter. As far as I can tell it
> should since this is being used to clear the bitmap and will affect
> migration.

OK I will try.

Imagine a page that is used by Linux.
It has been write protected by sync dirty bitmap.
Note how that does not happen while reporting
is active: it happens before and next after reporting
is done.

Now what are the bits that will be cleared by hinting?
These are dirty bits from page use from before hinting was
requested. We do not care about these because we know
that page was free at some point afterwards.
So any data it had can be safely discarded.



All this should have been documented in qemu source but
unfortunately wasn't :(



Is the above convincing?




> I'm pretty certain the page should not be freed until it
> has been processed. Otherwise I believe there is a risk of the page
> not being migrated and leading to a memory corruption when the VM is
> finally migrated.

I understand the concern, it was definitely on my mind
and I think it was addressed. But do let me know.

> > >
> > > So for example with my current hinting approach I am using the list of
> > > hints because I get back one completion indicating all of the hints
> > > have been processed. It is only at that point that I can go back and
> > > make the memory available for allocation again.
> >
> > Right. But just counting them would work just as well, no?
> > At least as long as you wait for everything to complete...
> > If you want to pipeline, see below
>
> Yes, but if possible I would also want to try and keep the batch
> behavior that I have.

As in pass a batch to host at once? Sure I think it's a good idea.

> We could count the descriptors processed,
> however that is still essentially done all via busy waiting in the
> FREE_PAGE_HINT logic.

OK let's discuss FREE_PAGE_HINT separately above. Until we
agree on whether it's safe to free up pages before they
are used for that usecase, we are just going in circles.


> > >
> > > So one big issue right now with the FREE_PAGE_HINT approach is that it
> > > is designed to be all or nothing. Using the balloon makes it
> > > impossible for us to be incremental as all the pages are contained in
> > > one spot. What we would need is some way to associate a page with a
> > > given vq buffer.
> >
> > Sorry if I'm belaboring the obvious, but isn't this what 'void *data' in
> > virtqueue_add_inbuf is designed for? And if you only ever use
> > virtqueue_add_inbuf and virtqueue_add_outbuf on a given VQ, then you can
> > track two pointers using virtqueue_add_inbuf_ctx.
>
> I am still learning virtio so I wasn't aware of this piece until
> yesterday. For FREE_PAGE_HINT it would probably work as we would then
> have that association. For my page hinting I am still thinking I would
> prefer to just pass around a scatterlist since that is the structure I
> would likely fill and then later drain of pages versus just
> maintaining a list.

OK. That might need an API extension. We do support scatter lists
but ATM they allocate memory internally. Not something you
want to do when you are playing with free lists I think.

> > > Ultimately in order to really make the FREE_PAGE_HINT
> > > logic work with something like my page hinting logic it would need to
> > > work more like a network Rx ring in that we would associate a page per
> > > buffer and have some way of knowing the two are associated.
> >
> > Right. That's exactly how virtio net does it btw.
>
> Yeah, I saw that after reviewing the code yesterday.
>
> > > > The reason FREE_PAGE_HINT does not free up pages until we finished
> > > > iterating over the free list it not a hypervisor API. The reason is we
> > > > don't want to keep getting the same address over and over again.
> > > >
> > > > > I would prefer to avoid that as I prefer to simply
> > > > > notify the host of a fixed block of pages at a time and let it process
> > > > > without having to have a thread on each side actively pushing pages,
> > > > > or listening for the incoming pages.
> > > >
> > > > Right. And FREE_PAGE_HINT can go even further. It can push a page and
> > > > let linux use it immediately. It does not even need to wait for host to
> > > > process anything unless the VQ gets full.
> > >
> > > If it is doing what you are saying it will be corrupting memory.
> >
> > No and that is hypervisor's responsibility.
> >
> > I think you are missing part of the picture here.
> >
> > Here is a valid implementation:
> >
> > Before asking for hints, hypervisor write-protects all memory, and logs
> > all write faults. When hypervisor gets the hint, if page has since been
> > modified, the hint is ignored.
>
> No here is the part where I think you missed the point. I was already
> aware of this. So my concern is this scenario.
>
> If you put a hint on the VQ and then free the memory back to the
> guest, what about the scenario where another process could allocate
> the memory and dirty it before we process the hint request on the
> host? In that case the page was dirtied, the hypervisor will have
> correctly write faulted and dirtied it, and then we came though and
> incorrectly marked it as being free. That is the scenario I am worried
> about as I am pretty certain that leads to memory corruption.

It would for sure. There are actually two dirty bit data structures.
One is maintained by KVM, I'd like to call it a "write log" here.
the other is maintained by qemu, that's the "dirty bitmap".

sync is the step where we atomically copy write log to dirty
bitmap and write-protect memory.
It works like this in theory:

sync

command id ++

request hints from guest with command id

XXX->

get hint - if command id matches - clear dirty bitmap bit

sync


code underwent enough changes that I couldn't
easily verify that's still the case but was
very clear originally :)

Can you see how if a hint crosses a sync then
it has a different command id and so is ignored?
and if not then writes are logged.


>
> >
> > > At a
> > > minimum it has to wait until the page has been processed and the dirty
> > > bit cleared before it can let linux use it again. It is all a matter
> > > of keeping the dirty bit coherent. If we let linux use it again
> > > immediately and then cleared the dirty bit we would open up a possible
> > > data corruption race during migration as a dirty page might not be
> > > marked as such.
> >
> > I think you are talking about the dirty bit on the host, right?
> >
> > The implication is that calling MADV_FREE from qemu would
> > not be a good implementation of FREE_PAGE_HINT.
> > And indeed, as far as I can see it does nothing of the sort.
>
> I don't mean the dirty bit on the host, I am talking about the bitmap
> used to determine which pages need to be migrated. That is what this
> hint is updating and it is also being tracked via the write protection
> of the pages at the start of migration.
>
> My concern is that we can end up losing track of pages that are
> updated if we are hinting after they have been freed back to the guest
> for reallocation.
>
> > > > >
> > > > > > > > > The basic idea with the bubble hinting was to essentially create mini
> > > > > > > > > balloons. As such I had based the code off of the balloon inflation
> > > > > > > > > code. The only spot where it really differs is that I needed the
> > > > > > > > > ability to pass higher order pages so I tweaked thinks and passed
> > > > > > > > > "hints" instead of "pfns".
> > > > > > > >
> > > > > > > > And that is fine. But there isn't really such a big difference with
> > > > > > > > FREE_PAGE_HINT except FREE_PAGE_HINT triggers upon host request and not
> > > > > > > > in response to guest load.
> > > > > > >
> > > > > > > I disagree, I believe there is a significant difference.
> > > > > >
> > > > > > Yes there is, I just don't think it's in the iteration.
> > > > > > The iteration seems to be useful to hinting.
> > > > >
> > > > > I agree that iteration is useful to hinting. The problem is the
> > > > > FREE_PAGE_HINT code isn't really designed to be iterative. It is
> > > > > designed to run with a polling thread on each side and it is meant to
> > > > > be run to completion.
> > > >
> > > > Absolutely. But that's a bug I think.
> > >
> > > I think it is a part of the design. Basically in order to avoid
> > > corrupting memory it cannot return the page to the guest kernel until
> > > it has finished clearing the dirty bits associated with the pages.
> >
> > OK I hope I clarified by that's not supposed to be the case.
>
> I think you might have missed something. I am pretty certain issues
> are still present.
>
> > > > > > > The
> > > > > > > FREE_PAGE_HINT code was implemented to be more of a streaming
> > > > > > > interface.
> > > > > >
> > > > > > It's implemented like this but it does not follow from
> > > > > > the interface. The implementation is a combination of
> > > > > > attempts to minimize # of exits and minimize mm core changes.
> > > > >
> > > > > The problem is the interface doesn't have a good way of indicating
> > > > > that it is done with a block of pages.
> > > > >
> > > > > So what I am probably looking at if I do a sg implementation for my
> > > > > hinting is to provide one large sg block for all 32 of the pages I
> > > > > might be holding.
> > > >
> > > > Right now if you pass an sg it will try to allocate a buffer
> > > > on demand for you. If this is a problem I could come up
> > > > with a new API that lets caller allocate the buffer.
> > > > Let me know.
> > > >
> > > > > I'm assuming that will still be processed as one
> > > > > contiguous block. With that I can then at least maintain a single
> > > > > response per request.
> > > >
> > > > Why do you care? Won't a counter of outstanding pages be enough?
> > > > Down the road maybe we could actually try to pipeline
> > > > things a bit. So send 32 pages once you get 16 of these back
> > > > send 16 more. Better for SMP configs and does not hurt
> > > > non-SMP too much. I am not saying we need to do it right away though.
> > >
> > > So the big thing is we cannot give the page back to the guest kernel
> > > until we know the processing has been completed. In the case of the
> > > MADV_DONT_NEED call it will zero out the entire page on the next
> > > access. If the guest kernel had already written data by the time we
> > > get to that it would cause a data corruption and kill the whole guest.
> >
> >
> > Exactly but FREE_PAGE_HINT does not cause qemu to call MADV_DONT_NEED.
>
> No, instead it clears the bit indicating that the page is supposed to
> be migrated. The effect will not be all that different, just delayed
> until the VM is actually migrated.
>
> > > > > > > This is one of the things Linus kept complaining about in
> > > > > > > his comments. This code attempts to pull in ALL of the higher order
> > > > > > > pages, not just a smaller block of them.
> > > > > >
> > > > > > It wants to report all higher order pages eventually, yes.
> > > > > > But it's absolutely fine to report a chunk and then wait
> > > > > > for host to process the chunk before reporting more.
> > > > > >
> > > > > > However, interfaces we came up with for this would call
> > > > > > into virtio with a bunch of locks taken.
> > > > > > The solution was to take pages off the free list completely.
> > > > > > That in turn means we can't return them until
> > > > > > we have processed all free memory.
> > > > >
> > > > > I get that. The problem is the interface is designed around run to
> > > > > completion. For example it will sit there in a busy loop waiting for a
> > > > > free buffer because it knows the other side is suppose to be
> > > > > processing the pages already.
> > > >
> > > > I didn't get this part.
> > >
> > > I think the part you may not be getting is that we cannot let the
> > > guest use the page until the hint has been processed. Otherwise we
> > > risk corrupting memory. That is the piece that has me paranoid. If we
> > > end up performing a hint on a page that is use somewhere in the kernel
> > > it will corrupt memory one way or another. That is the thing I have to
> > > avoid at all cost.
> >
> > You have to do it, sure. And that is because you do not
> > assume that hypervisor does it for you. But FREE_PAGE_HINT doesn't,
> > hypervisor takes care of that.
>
> Sort of. The hypervisor is trying to do dirty page tracking, however
> the FREE_PAGE_HINT interferes with that. That is the problem. If we
> get that out of order then the hypervisor work will be undone and we
> just make a mess of memory.
>
> > > That is why I have to have a way to know exactly which pages have been
> > > processed and which haven't before I return pages to the guest.
> > > Otherwise I am just corrupting memory.
> >
> > Sure. That isn't really hard though.
>
> Agreed.
>
> > >
> > > > > > > Honestly the difference is
> > > > > > > mostly in the hypervisor interface than what is needed for the kernel
> > > > > > > interface, however the design of the hypervisor interface would make
> > > > > > > doing things more incrementally much more difficult.
> > > > > >
> > > > > > OK that's interesting. The hypervisor interface is not
> > > > > > documented in the spec yet. Let me take a stub at a writeup now. So:
> > > > > >
> > > > > >
> > > > > >
> > > > > > - hypervisor requests reporting by modifying command ID
> > > > > > field in config space, and interrupting guest
> > > > > >
> > > > > > - in response, guest sends the command ID value on a special
> > > > > > free page hinting VQ,
> > > > > > followed by any number of buffers. Each buffer is assumed
> > > > > > to be the address and length of memory that was
> > > > > > unused *at some point after the time when command ID was sent*.
> > > > > >
> > > > > > Note that hypervisor takes pains to handle the case
> > > > > > where memory is actually no longer free by the time
> > > > > > it gets the memory.
> > > > > > This allows guest driver to take more liberties
> > > > > > and free pages without waiting for guest to
> > > > > > use the buffers.
> > > > > >
> > > > > > This is also one of the reason we call this a free page hint -
> > > > > > the guarantee that page is free is a weak one,
> > > > > > in that sense it's more of a hint than a promise.
> > > > > > That helps guarantee we don't create OOM out of blue.
> > > >
> > > > I would like to stress the last paragraph above.
> > >
> > > The problem is we don't want to give bad hints. What we do based on
> > > the hint is clear the dirty bit. If we clear it in err when the page
> > > is actually in use it will lead to data corruption after migration.
> >
> > That's true for your patches. I get that.
>
> No, it should be true for FREE_PAGE_HINT as well. The fact that it
> isn't is a bug as far as I am concerned. If you are doing dirty page
> tracking in the hypervisor you cannot expect it to behave well if the
> guest is providing it with bad data.
>
> > > The idea with the hint is that you are saying the page is currently
> > > not in use, however if you send that hint late and have already freed
> > > the page back you can corrupt memory.
> >
> >
> > That part is I think wrong - assuming "you" means upstream code.
>
> Yes, I am referring to someone running FREE_PAGE_HINT code. I usually
> try to replace them with "we" to make it clear I am not talking about
> someone personally, it is a bad habit.
>
> > > > > >
> > > > > > - guest eventually sends a special buffer signalling to
> > > > > > host that it's done sending free pages.
> > > > > > It then stops reporting until command id changes.
> > > > >
> > > > > The pages are not freed back to the guest until the host reports that
> > > > > it is "DONE" via a configuration change. Doing that stops any further
> > > > > progress, and attempting to resume will just restart from the
> > > > > beginning.
> > > >
> > > > Right but it's not a requirement. Host does not assume this at all.
> > > > It's done like this simply because we can't iterate over pages
> > > > with the existing API.
> > >
> > > The problem is nothing about the implementation was designed for
> > > iteration. What I would have to do is likely gut and rewrite the
> > > entire guest side of the FREE_PAGE_HINT code in order to make it work
> > > iteratively.
> >
> >
> > Right. I agree.
> >
> > > As I mentioned it would probably have to look more like a
> > > NIC Rx ring in handling because we would have to have some sort of way
> > > to associate the pages 1:1 to the buffers.
> > >
> > > > > The big piece this design is missing is the incremental notification
> > > > > pages have been processed. The existing code just fills the vq with
> > > > > pages and keeps doing it until it cannot allocate any more pages. We
> > > > > would have to add logic to stop, flush, and resume to the existing
> > > > > framework.
> > > >
> > > > But not to the hypervisor interface. Hypervisor is fine
> > > > with pages being reused immediately. In fact, even before they
> > > > are processed.
> > >
> > > I don't think that is actually the case. If it does that I am pretty
> > > sure it will corrupt memory during migration.
> > >
> > > Take a look at qemu_guest_free_page_hint:
> > > https://github.com/qemu/qemu/blob/master/migration/ram.c#L3342
> > >
> > > I'm pretty sure that code is going in and clearing the dirty bitmap
> > > for memory.
> >
> > Yes it does. However the trick is that meanwhile
> > kvm is logging new writes. So the bitmap that
> > is being cleared is the bitmap that was logged before the request
> > was sent to guest.
> >
> > > If we were to allow a page to be allocated and used and
> > > then perform the hint it is going to introduce a race where the page
> > > might be missed for migration and could result in memory corruption.
> >
> > commit c13c4153f76db23cac06a12044bf4dd346764059 has this explanation:
> >
> > Note: balloon will report pages which were free at the time of this call.
> > As the reporting happens asynchronously, dirty bit logging must be
> > enabled before this free_page_start call is made. Guest reporting must be
> > disabled before the migration dirty bitmap is synchronized.
> >
> > but over multiple iterations this seems to have been dropped
> > from code comments. Wei, would you mind going back
> > and documenting the APIs you used?
> > They seem to be causing confusion ...
>
> The "Note" is the behavior I am seeing. Specifically there is nothing
> in place to prevent the freed pages from causing corruption if they
> are freed before being hinted. The requirement should be that they
> cannot be freed until after they are hinted that way the dirty bit
> logging will mark the page as dirty if it is accessed AFTER being
> hinted.
>
> If you do not guarantee the hinting has happened first you could end
> up logging the dirty bit before the hint is processed and then clear
> the dirty bit due to the hint. It is pretty straight forward to
> resolve by just not putting the page into the balloon until after the
> hint has been processed.
>
> > >
> > > > > > - host can restart the process at any time by
> > > > > > updating command ID. That will make guest stop
> > > > > > and start from the beginning.
> > > > > >
> > > > > > - host can also stop the process by specifying a special
> > > > > > command ID value.
> > > > > >
> > > > > >
> > > > > > =========
> > > > > >
> > > > > >
> > > > > > Now let's compare to what you have here:
> > > > > >
> > > > > > - At any time after boot, guest walks over free memory and sends
> > > > > > addresses as buffers to the host
> > > > > >
> > > > > > - Memory reported is then guaranteed to be unused
> > > > > > until host has used the buffers
> > > > > >
> > > > > >
> > > > > > Is above a fair summary?
> > > > > >
> > > > > > So yes there's a difference but the specific bit of chunking is same
> > > > > > imho.
> > > > >
> > > > > The big difference is that I am returning the pages after they are
> > > > > processed, while FREE_PAGE_HINT doesn't and isn't designed to.
> > > >
> > > > It doesn't but the hypervisor *is* designed to support that.
> > >
> > > Not really, it seems like it is more just a side effect of things.
> >
> > I hope the commit log above is enough to convice you we did
> > think about this.
>
> Sorry, but no. I think the "note" convinced me there is a race
> condition, specifically in the shrinker case. We cannot free the page
> back to host memory until the hint has been processed, otherwise we
> will race with the dirty bit logging.
>
> > > Also as I mentioned before I am also not a huge fan of polling on both
> > > sides as it is just going to burn through CPU. If we are iterative and
> > > polling it is going to end up with us potentially pushing one CPU at
> > > 100%, and if the one CPU doing the polling cannot keep up with the
> > > page updates coming from the other CPUs we would be stuck in that
> > > state for a while. I would have preferred to see something where the
> > > CPU would at least allow other tasks to occur while it is waiting for
> > > buffers to be returned by the host.
> >
> > You lost me here. What does polling have to do with it?
>
> This is just another issue I found. Specifically busy polling while
> waiting on the host to process the hints. I'm not a fan of it and was
> just pointing it out.
>
> > > > > The
> > > > > problem is the interface doesn't allow for a good way to identify that
> > > > > any given block of pages has been processed and can be returned.
> > > >
> > > > And that's because FREE_PAGE_HINT does not care.
> > > > It can return any page at any point even before hypervisor
> > > > saw it.
> > >
> > > I disagree, see my comment above.
> >
> > OK let's see if above is enough to convice you. Or maybe we
> > have a bug when shrinker is invoked :) But I don't think so.
>
> I'm pretty sure there is a bug.
>
> > > > > Instead pages go in, but they don't come out until the configuration
> > > > > is changed and "DONE" is reported. The act of reporting "DONE" will
> > > > > reset things and start them all over which kind of defeats the point.
> > > >
> > > > Right.
> > > >
> > > > But if you consider how we are using the shrinker you will
> > > > see that it's kind of broken.
> > > > For example not keeping track of allocated
> > > > pages means the count we return is broken
> > > > while reporting is active.
> > > >
> > > > I looked at fixing it but really if we can just
> > > > stop allocating memory that would be way cleaner.
> > >
> > > Agreed. If we hit an OOM we should probably just stop the free page
> > > hinting and treat that as the equivalent to an allocation failure.
> >
> > And fix the shrinker count to include the pages in the vq. Yea.
>
> I don't know if we really want to touch the pages in the VQ. I would
> say that we should leave them alone.
>
> > >
> > > As-is I think this also has the potential for corrupting memory since
> > > it will likely be returning the most recent pages added to the balloon
> > > so the pages are likely still on the processing queue.
> >
> > That part is fine I think because of the above.
> >
> > >
> > > > For example we allocate pages until shrinker kicks in.
> > > > Fair enough but in fact many it would be better to
> > > > do the reverse: trigger shrinker and then send as many
> > > > free pages as we can to host.
> > >
> > > I'm not sure I understand this last part.
> >
> > Oh basically what I am saying is this: one of the reasons to use page
> > hinting is when host is short on memory. In that case, why don't we use
> > shrinker to ask kernel drivers to free up memory? Any memory freed could
> > then be reported to host.
>
> Didn't the balloon driver already have a feature like that where it
> could start shrinking memory if the host was under memory pressure? If
> so how would adding another one add much value.
>
> The idea here is if the memory is free we just mark it as such. As
> long as we can do so with no noticeable overhead on the guest or host
> why not just do it?

2019-07-18 20:25:05

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> > > > For example we allocate pages until shrinker kicks in.
> > > > Fair enough but in fact many it would be better to
> > > > do the reverse: trigger shrinker and then send as many
> > > > free pages as we can to host.
> > >
> > > I'm not sure I understand this last part.
> >
> > Oh basically what I am saying is this: one of the reasons to use page
> > hinting is when host is short on memory. In that case, why don't we use
> > shrinker to ask kernel drivers to free up memory? Any memory freed could
> > then be reported to host.
>
> Didn't the balloon driver already have a feature like that where it
> could start shrinking memory if the host was under memory pressure? If
> so how would adding another one add much value.

Well fundamentally the basic balloon inflate kind of does this, yes :)

The difference with what I am suggesting is that balloon inflate tries
to aggressively achieve a specific goal of freed memory. We could have a
weaker "free as much as you can" that is still stronger than free page
hint which as you point out below does not try to free at all, just
hints what is already free.


> The idea here is if the memory is free we just mark it as such. As
> long as we can do so with no noticeable overhead on the guest or host
> why not just do it?

2019-07-18 20:29:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 12:03:23PM -0400, Nitesh Narayan Lal wrote:
> >>>> For example we allocate pages until shrinker kicks in.
> >>>> Fair enough but in fact many it would be better to
> >>>> do the reverse: trigger shrinker and then send as many
> >>>> free pages as we can to host.
> >>> I'm not sure I understand this last part.
> >> Oh basically what I am saying is this: one of the reasons to use page
> >> hinting is when host is short on memory. In that case, why don't we use
> >> shrinker to ask kernel drivers to free up memory? Any memory freed could
> >> then be reported to host.
> > Didn't the balloon driver already have a feature like that where it
> > could start shrinking memory if the host was under memory pressure?
> If you are referring to auto-ballooning (I don't think it is merged). It
> has its own set of disadvantages such as it could easily lead to OOM,
> memory corruption and so on.

Right. So what I am saying is: we could have a flag that triggers a
shrinker once before sending memory hints.
Worth considering.

--
MST

2019-07-18 20:31:24

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 9:07 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> > On Wed, Jul 17, 2019 at 10:14 PM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > On Wed, Jul 17, 2019 at 09:43:52AM -0700, Alexander Duyck wrote:
> > > > On Wed, Jul 17, 2019 at 3:28 AM Michael S. Tsirkin <[email protected]> wrote:
> > > > >
> > > > > On Tue, Jul 16, 2019 at 02:06:59PM -0700, Alexander Duyck wrote:
> > > > > > On Tue, Jul 16, 2019 at 10:41 AM Michael S. Tsirkin <[email protected]> wrote:
> > > > > >
> > > > > > <snip>
> > > > > >
> > > > > > > > > This is what I am saying. Having watched that patchset being developed,
> > > > > > > > > I think that's simply because processing blocks required mm core
> > > > > > > > > changes, which Wei was not up to pushing through.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If we did
> > > > > > > > >
> > > > > > > > > while (1) {
> > > > > > > > > alloc_pages
> > > > > > > > > add_buf
> > > > > > > > > get_buf
> > > > > > > > > free_pages
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > We'd end up passing the same page to balloon again and again.
> > > > > > > > >
> > > > > > > > > So we end up reserving lots of memory with alloc_pages instead.
> > > > > > > > >
> > > > > > > > > What I am saying is that now that you are developing
> > > > > > > > > infrastructure to iterate over free pages,
> > > > > > > > > FREE_PAGE_HINT should be able to use it too.
> > > > > > > > > Whether that's possible might be a good indication of
> > > > > > > > > whether the new mm APIs make sense.
> > > > > > > >
> > > > > > > > The problem is the infrastructure as implemented isn't designed to do
> > > > > > > > that. I am pretty certain this interface will have issues with being
> > > > > > > > given small blocks to process at a time.
> > > > > > > >
> > > > > > > > Basically the design for the FREE_PAGE_HINT feature doesn't really
> > > > > > > > have the concept of doing things a bit at a time. It is either
> > > > > > > > filling, stopped, or done. From what I can tell it requires a
> > > > > > > > configuration change for the virtio balloon interface to toggle
> > > > > > > > between those states.
> > > > > > >
> > > > > > > Maybe I misunderstand what you are saying.
> > > > > > >
> > > > > > > Filling state can definitely report things
> > > > > > > a bit at a time. It does not assume that
> > > > > > > all of guest free memory can fit in a VQ.
> > > > > >
> > > > > > I think where you and I may differ is that you are okay with just
> > > > > > pulling pages until you hit OOM, or allocation failures. Do I have
> > > > > > that right?
> > > > >
> > > > > This is exactly what the current code does. But that's an implementation
> > > > > detail which came about because we failed to find any other way to
> > > > > iterate over free blocks.
> > > >
> > > > I get that. However my concern is that permeated other areas of the
> > > > implementation that make taking another approach much more difficult
> > > > than it needs to be.
> > >
> > > Implementation would have to change to use an iterator obviously. But I don't see
> > > that it leaked out to a hypervisor interface.
> > >
> > > In fact take a look at virtio_balloon_shrinker_scan
> > > and you will see that it calls shrink_free_pages
> > > without waiting for the device at all.
> >
> > Yes, and in case you missed it earlier I am pretty sure that leads to
> > possible memory corruption. I don't think it was tested enough to be
> > able to say that is safe.
>
> More testing would be good, for sure.
>
> > Specifically we cannot be clearing the dirty flag on pages that are in
> > use. We should only be clearing that flag for pages that are
> > guaranteed to not be in use.
>
> I think that clearing the dirty flag is safe if the flag was originally
> set and the page has been
> write-protected before reporting was requested.
> In that case we know that page has not been changed.
> Right?

I am just going to drop the rest of this thread as I agree we have
been running ourselves around in circles. The part I had missed was
the part where there are 2 bitmaps and that you are are using
migration_bitmap_sync_precopy() to align the two.

This is just running at the same time as the precopy code and is only
really meant to try and clear the bit before the precopy gets to it
from what I can tell.

So one thing that is still an issue then is that my approach would
only work on the first migration. The problem is the logic I have
implemented assumes that once we have hinted on a page we don't need
to do it again. However in order to support migration you would need
to reset the hinting entirely and start over again after doing a
migration.

2019-07-18 20:35:01

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 1:24 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> > > > > For example we allocate pages until shrinker kicks in.
> > > > > Fair enough but in fact many it would be better to
> > > > > do the reverse: trigger shrinker and then send as many
> > > > > free pages as we can to host.
> > > >
> > > > I'm not sure I understand this last part.
> > >
> > > Oh basically what I am saying is this: one of the reasons to use page
> > > hinting is when host is short on memory. In that case, why don't we use
> > > shrinker to ask kernel drivers to free up memory? Any memory freed could
> > > then be reported to host.
> >
> > Didn't the balloon driver already have a feature like that where it
> > could start shrinking memory if the host was under memory pressure? If
> > so how would adding another one add much value.
>
> Well fundamentally the basic balloon inflate kind of does this, yes :)
>
> The difference with what I am suggesting is that balloon inflate tries
> to aggressively achieve a specific goal of freed memory. We could have a
> weaker "free as much as you can" that is still stronger than free page
> hint which as you point out below does not try to free at all, just
> hints what is already free.

Yes, but why wait until the host is low on memory? With my
implementation we can perform the hints in the background for a low
cost already. So why should we wait to free up memory when we could do
it immediately. Why let things get to the state where the host is
under memory pressure when the guests can be proactively freeing up
the pages and improving performance as a result be reducing swap
usage?

2019-07-18 20:39:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 01:29:14PM -0700, Alexander Duyck wrote:
> So one thing that is still an issue then is that my approach would
> only work on the first migration. The problem is the logic I have
> implemented assumes that once we have hinted on a page we don't need
> to do it again. However in order to support migration you would need
> to reset the hinting entirely and start over again after doing a
> migration.

Well with precopy at least it's simple: just clear the
dirty bit, it won't be sent, and then on destination
you get a zero page and later COW on first write.
Right?

With precopy it is tricker as destination waits until it gets
all of memory. I think we could use some trick to
make source pretend it's a zero page, that is cheap to send.

--
MST

2019-07-18 20:51:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 01:34:03PM -0700, Alexander Duyck wrote:
> On Thu, Jul 18, 2019 at 1:24 PM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> > > > > > For example we allocate pages until shrinker kicks in.
> > > > > > Fair enough but in fact many it would be better to
> > > > > > do the reverse: trigger shrinker and then send as many
> > > > > > free pages as we can to host.
> > > > >
> > > > > I'm not sure I understand this last part.
> > > >
> > > > Oh basically what I am saying is this: one of the reasons to use page
> > > > hinting is when host is short on memory. In that case, why don't we use
> > > > shrinker to ask kernel drivers to free up memory? Any memory freed could
> > > > then be reported to host.
> > >
> > > Didn't the balloon driver already have a feature like that where it
> > > could start shrinking memory if the host was under memory pressure? If
> > > so how would adding another one add much value.
> >
> > Well fundamentally the basic balloon inflate kind of does this, yes :)
> >
> > The difference with what I am suggesting is that balloon inflate tries
> > to aggressively achieve a specific goal of freed memory. We could have a
> > weaker "free as much as you can" that is still stronger than free page
> > hint which as you point out below does not try to free at all, just
> > hints what is already free.
>
> Yes, but why wait until the host is low on memory?

It can come about for a variety of reasons, such as
other VMs being aggressive, or ours aggressively caching
stuff in memory.

> With my
> implementation we can perform the hints in the background for a low
> cost already. So why should we wait to free up memory when we could do
> it immediately. Why let things get to the state where the host is
> under memory pressure when the guests can be proactively freeing up
> the pages and improving performance as a result be reducing swap
> usage?

You are talking about sending free memory to host.
Fair enough but if you have drivers that aggressively
allocate memory then there won't be that much free guest
memory without invoking a shrinker.

--
MST

2019-07-18 20:55:04

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 1:37 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 01:29:14PM -0700, Alexander Duyck wrote:
> > So one thing that is still an issue then is that my approach would
> > only work on the first migration. The problem is the logic I have
> > implemented assumes that once we have hinted on a page we don't need
> > to do it again. However in order to support migration you would need
> > to reset the hinting entirely and start over again after doing a
> > migration.
>
> Well with precopy at least it's simple: just clear the
> dirty bit, it won't be sent, and then on destination
> you get a zero page and later COW on first write.
> Right?

Are you talking about adding MADV_DONTNEED functionality to FREE_PAGE_HINTS?

> With precopy it is tricker as destination waits until it gets
> all of memory. I think we could use some trick to
> make source pretend it's a zero page, that is cheap to send.

So I am confused again.

What I was getting at is that if I am not mistaken block->bmap is set
to all 1s for each page in ram_list_init_bitmaps(). After that the
precopy starts and begins moving memory over. We need to be able to go
in and hint away all the free pages from that initial bitmap. To do
that we would need to have the "Hinted" flag I added in the patch set
cleared for all pages, and then go through all free memory and start
over in order to hint on which pages are actually free. Otherwise all
we are doing is hinting on which pages have been freed since the last
round of hints.

Essentially this is another case where being incremental is
problematic for this design. What I would need to do is reset the
"Hinted" flag in all of the free pages after the migration has been
completed.

2019-07-18 21:09:58

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v1 6/6] virtio-balloon: Add support for aerating memory via hinting

On Thu, Jul 18, 2019 at 1:49 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Jul 18, 2019 at 01:34:03PM -0700, Alexander Duyck wrote:
> > On Thu, Jul 18, 2019 at 1:24 PM Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > On Thu, Jul 18, 2019 at 08:34:37AM -0700, Alexander Duyck wrote:
> > > > > > > For example we allocate pages until shrinker kicks in.
> > > > > > > Fair enough but in fact many it would be better to
> > > > > > > do the reverse: trigger shrinker and then send as many
> > > > > > > free pages as we can to host.
> > > > > >
> > > > > > I'm not sure I understand this last part.
> > > > >
> > > > > Oh basically what I am saying is this: one of the reasons to use page
> > > > > hinting is when host is short on memory. In that case, why don't we use
> > > > > shrinker to ask kernel drivers to free up memory? Any memory freed could
> > > > > then be reported to host.
> > > >
> > > > Didn't the balloon driver already have a feature like that where it
> > > > could start shrinking memory if the host was under memory pressure? If
> > > > so how would adding another one add much value.
> > >
> > > Well fundamentally the basic balloon inflate kind of does this, yes :)
> > >
> > > The difference with what I am suggesting is that balloon inflate tries
> > > to aggressively achieve a specific goal of freed memory. We could have a
> > > weaker "free as much as you can" that is still stronger than free page
> > > hint which as you point out below does not try to free at all, just
> > > hints what is already free.
> >
> > Yes, but why wait until the host is low on memory?
>
> It can come about for a variety of reasons, such as
> other VMs being aggressive, or ours aggressively caching
> stuff in memory.
>
> > With my
> > implementation we can perform the hints in the background for a low
> > cost already. So why should we wait to free up memory when we could do
> > it immediately. Why let things get to the state where the host is
> > under memory pressure when the guests can be proactively freeing up
> > the pages and improving performance as a result be reducing swap
> > usage?
>
> You are talking about sending free memory to host.
> Fair enough but if you have drivers that aggressively
> allocate memory then there won't be that much free guest
> memory without invoking a shrinker.

So then what we really need is a way for the host to trigger the
shrinker via a call to drop_slab() on the guest don't we? Then we
could automatically hint the free pages to the host.