LinuxLists.cc - [PATCH v9 0/8] stg mail -e --version=v9 \

2019-09-09 03:11:03

Subject: [PATCH v9 0/8] stg mail -e --version=v9 \

[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
01-mm-add-per-cpu-logic-to-page..08-virtio-balloon-add-support-for

mm / virtio: Provide support for unused page reporting

This series provides an asynchronous means of reporting to a hypervisor
that a guest page is no longer in use and can have the data associated
with it dropped. To do this I have implemented functionality that allows
for what I am referring to as unused page reporting

The functionality for this is fairly simple. When enabled it will allocate
statistics to track the number of reported pages in a given free area.
When the number of free pages exceeds this value plus a high water value,
currently 32, it will begin performing page reporting which consists of
pulling pages off of free list and placing them into a scatter list. The
scatterlist is then given to the page reporting device and it will perform
the required action to make the pages "reported", in the case of
virtio-balloon this results in the pages being madvised as MADV_DONTNEED
and as such they are forced out of the guest. After this they are placed
back on the free list, and an additional bit is added if they are not
merged indicating that they are a reported buddy page instead of a
standard buddy page. The cycle then repeats with additional non-reported
pages being pulled until the free areas all consist of reported pages.

I am leaving a number of things hard-coded such as limiting the lowest
order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
determine what the limit is on how many pages it wants to allocate to
process the hints. The upper limit for this is based on the size of the
queue used to store the scattergather list.

My primary testing has just been to verify the memory is being freed after
allocation by running memhog 40g on a 40g guest and watching the total
free memory via /proc/meminfo on the host. With this I have verified most
of the memory is freed after each iteration. As far as performance I have
been mainly focusing on the will-it-scale/page_fault1 test running with
16 vcpus. I have modified it to use Transparent Huge Pages. With this I
see almost no difference, -0.08%, with the patches applied and the feature
disabled. I see a regression of -0.86% with the feature enabled, but the
madvise disabled in the hypervisor due to a device being assigned. With
the feature fully enabled I see a regression of -3.27% versus the baseline
without these patches applied. In my testing I found that most of the
overhead was due to the page zeroing that comes as a result of the pages
having to be faulted back into the guest.

One side effect of these patches is that the guest becomes much more
resilient in terms of NUMA locality. With the pages being freed and then
reallocated when used it allows for the pages to be much closer to the
active thread, and as a result there can be situations where this patch
set will out-perform the stock kernel when the guest memory is not local
to the guest vCPUs. To avoid that in my testing I set the affinity of all
the vCPUs and QEMU instance to the same node.

I have not included the QEMU patches with this set as they haven't really
changed in the last several revisions. If needed they can be located with
the v8 patchset.

Changes from the RFC:
https://lore.kernel.org/lkml/[email protected]/
Moved aeration requested flag out of aerator and into zone->flags.
Moved boundary out of free_area and into local variables for aeration.
Moved aeration cycle out of interrupt and into workqueue.
Left nr_free as total pages instead of splitting it between raw and aerated.
Combined size and physical address values in virtio ring into one 64b value.

Changes from v1:
https://lore.kernel.org/lkml/[email protected]/
Dropped "waste page treatment" in favor of "page hinting"
Renamed files and functions from "aeration" to "page_hinting"
Moved from page->lru list to scatterlist
Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
Virtio now uses scatterlist directly instead of intermediate array
Moved stats out of free_area, now in separate area and pointed to from zone
Merged patch 5 into patch 4 to improve review-ability
Updated various code comments throughout

Changes from v2:
https://lore.kernel.org/lkml/[email protected]/
Dropped "page hinting" in favor of "page reporting"
Renamed files from "hinting" to "reporting"
Replaced "Hinted" page type with "Reported" page flag
Added support for page poisoning while hinting is active
Add QEMU patch that implements PAGE_POISON feature

Changes from v3:
https://lore.kernel.org/lkml/[email protected]/
Added mutex lock around page reporting startup and shutdown
Fixed reference to "page aeration" in patch 2
Split page reporting function bit out into separate QEMU patch
Limited capacity of scatterlist to vq size - 1 instead of vq size
Added exception handling for case of virtio descriptor allocation failure

Changes from v4:
https://lore.kernel.org/lkml/[email protected]/
Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_cycle()
Dropped if/continue for ternary operator in page_reporting_process()
Added checks for isolate and cma types to for_each_reporting_migratetype_order
Added virtio-dev, Michal Hocko, and Oscar Salvador to to:/cc:
Rebased on latest linux-next and QEMU git trees

Changes from v5:
https://lore.kernel.org/lkml/[email protected]/
Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_startup()
Updated shuffle code to use "shuffle_pick_tail" and updated patch description
Dropped storage of order and migratettype while page is being reported
Used get_pfnblock_migratetype to determine migratetype of page
Renamed put_reported_page to free_reported_page, added order as argument
Dropped check for CMA type as I believe we should be reporting those
Added code to allow moving of reported pages into and out of isolation
Defined page reporting order as minimum of Huge Page size vs MAX_ORDER - 1
Cleaned up use of static branch usage for page_reporting_notify_enabled

Changes from v6:
https://lore.kernel.org/lkml/[email protected]/
Rebased on linux-next for 20190903
Added jump label to __page_reporting_request so we release RCU read lock
Removed "- 1" from capacity limit based on virtio ring
Added code to verify capacity is non-zero or return error on startup

Changes from v7:
https://lore.kernel.org/lkml/[email protected]/
Updated poison fixes to clear flag if "nosanity" is enabled in kernel config
Split shuffle per-cpu optimization into separate patch
Moved check for !phdev->capacity into reporting patch where it belongs
Added Reviewed-by tags received for v7

Changes from v8:
https://lore.kernel.org/lkml/[email protected]/
Added patch that moves HPAGE_SIZE definition for ARM64 to match other archs
Switch back to using pageblock_order instead of HUGETLB_ORDER and MAX_ORDER - 1
Boundary allocation now dynamic to support HUGETLB_PAGE_SIZE_VARIABLE option
Made use of existing code/functions to reduce size of move_to_boundary function
Dropped unused zone pointer from add_to/del_from_boundary functions
Added additional possible mm and arm64 people as reviewers to Cc
Added Reviewed-by tags received for v8
Fixed missing parameter in kerneldoc

---

Alexander Duyck (8):
mm: Add per-cpu logic to page shuffling
mm: Adjust shuffle code to allow for future coalescing
mm: Move set/get_pcppage_migratetype to mmzone.h
mm: Use zone and order instead of free area in free_list manipulators
arm64: Move hugetlb related definitions out of pgtable.h to page-defs.h
mm: Introduce Reported pages
virtio-balloon: Pull page poisoning config out of free page hinting
virtio-balloon: Add support for providing unused page reports to host

arch/arm64/include/asm/page-def.h | 9 +
arch/arm64/include/asm/pgtable.h | 9 -
drivers/virtio/Kconfig | 1
drivers/virtio/virtio_balloon.c | 87 ++++++++-
include/linux/mmzone.h | 124 ++++++++----
include/linux/page-flags.h | 11 +
include/linux/page_reporting.h | 178 +++++++++++++++++
include/uapi/linux/virtio_balloon.h | 1
mm/Kconfig | 5
mm/Makefile | 1
mm/internal.h | 18 ++
mm/memory_hotplug.c | 1
mm/page_alloc.c | 217 +++++++++++++++------
mm/page_reporting.c | 358 +++++++++++++++++++++++++++++++++++
mm/shuffle.c | 40 ++--
mm/shuffle.h | 12 +
16 files changed, 931 insertions(+), 141 deletions(-)
create mode 100644 include/linux/page_reporting.h
create mode 100644 mm/page_reporting.c

--

2019-09-09 03:26:31

by Alexander Duyck

[permalink] [raw]

Subject: [PATCH v9 3/8] mm: Move set/get_pcppage_migratetype to mmzone.h

From: Alexander Duyck <[email protected]>

In order to support page reporting it will be necessary to store and
retrieve the migratetype of a page. To enable that I am moving the set and
get operations for pcppage_migratetype into the mm/internal.h header so
that they can be used outside of the page_alloc.c file.

Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
mm/internal.h | 18 ++++++++++++++++++
mm/page_alloc.c | 18 ------------------
2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 0d5f720c75ab..e4a1a57bbd40 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -549,6 +549,24 @@ static inline bool is_migrate_highatomic_page(struct page *page)
return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
}

+/*
+ * A cached value of the page's pageblock's migratetype, used when the page is
+ * put on a pcplist. Used to avoid the pageblock migratetype lookup when
+ * freeing from pcplists in most cases, at the cost of possibly becoming stale.
+ * Also the migratetype set in the page does not necessarily match the pcplist
+ * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
+ * other index - this ensures that it will be put on the correct CMA freelist.
+ */
+static inline int get_pcppage_migratetype(struct page *page)
+{
+ return page->index;
+}
+
+static inline void set_pcppage_migratetype(struct page *page, int migratetype)
+{
+ page->index = migratetype;
+}
+
void setup_zone_pageset(struct zone *zone);
extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
#endif /* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4e4356ba66c7..a791f2baeeeb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -185,24 +185,6 @@ static int __init early_init_on_free(char *buf)
}
early_param("init_on_free", early_init_on_free);

-/*
- * A cached value of the page's pageblock's migratetype, used when the page is
- * put on a pcplist. Used to avoid the pageblock migratetype lookup when
- * freeing from pcplists in most cases, at the cost of possibly becoming stale.
- * Also the migratetype set in the page does not necessarily match the pcplist
- * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
- * other index - this ensures that it will be put on the correct CMA freelist.
- */
-static inline int get_pcppage_migratetype(struct page *page)
-{
- return page->index;
-}
-
-static inline void set_pcppage_migratetype(struct page *page, int migratetype)
-{
- page->index = migratetype;
-}
-
#ifdef CONFIG_PM_SLEEP
/*
* The following functions are used by the suspend/hibernate code to temporarily

2019-09-09 03:35:59

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH v9 0/8] mm / virtio: Provide support for unused page reporting

Sorry about that. Looks like I fat fingered things and copied the
command line into the cover page. I corrected the subject here, and
pulled the command line out of the message below.

- Alex

On Sat, Sep 7, 2019 at 10:25 AM Alexander Duyck
<[email protected]> wrote:
> This series provides an asynchronous means of reporting to a hypervisor
> that a guest page is no longer in use and can have the data associated
> with it dropped. To do this I have implemented functionality that allows
> for what I am referring to as unused page reporting
>
> The functionality for this is fairly simple. When enabled it will allocate
> statistics to track the number of reported pages in a given free area.
> When the number of free pages exceeds this value plus a high water value,
> currently 32, it will begin performing page reporting which consists of
> pulling pages off of free list and placing them into a scatter list. The
> scatterlist is then given to the page reporting device and it will perform
> the required action to make the pages "reported", in the case of
> virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> and as such they are forced out of the guest. After this they are placed
> back on the free list, and an additional bit is added if they are not
> merged indicating that they are a reported buddy page instead of a
> standard buddy page. The cycle then repeats with additional non-reported
> pages being pulled until the free areas all consist of reported pages.
>
> I am leaving a number of things hard-coded such as limiting the lowest
> order processed to PAGEBLOCK_ORDER, and have left it up to the guest to
> determine what the limit is on how many pages it wants to allocate to
> process the hints. The upper limit for this is based on the size of the
> queue used to store the scattergather list.
>
> My primary testing has just been to verify the memory is being freed after
> allocation by running memhog 40g on a 40g guest and watching the total
> free memory via /proc/meminfo on the host. With this I have verified most
> of the memory is freed after each iteration. As far as performance I have
> been mainly focusing on the will-it-scale/page_fault1 test running with
> 16 vcpus. I have modified it to use Transparent Huge Pages. With this I
> see almost no difference, -0.08%, with the patches applied and the feature
> disabled. I see a regression of -0.86% with the feature enabled, but the
> madvise disabled in the hypervisor due to a device being assigned. With
> the feature fully enabled I see a regression of -3.27% versus the baseline
> without these patches applied. In my testing I found that most of the
> overhead was due to the page zeroing that comes as a result of the pages
> having to be faulted back into the guest.
>
> One side effect of these patches is that the guest becomes much more
> resilient in terms of NUMA locality. With the pages being freed and then
> reallocated when used it allows for the pages to be much closer to the
> active thread, and as a result there can be situations where this patch
> set will out-perform the stock kernel when the guest memory is not local
> to the guest vCPUs. To avoid that in my testing I set the affinity of all
> the vCPUs and QEMU instance to the same node.
>
> I have not included the QEMU patches with this set as they haven't really
> changed in the last several revisions. If needed they can be located with
> the v8 patchset.
>
> Changes from the RFC:
> https://lore.kernel.org/lkml/[email protected]/
> Moved aeration requested flag out of aerator and into zone->flags.
> Moved boundary out of free_area and into local variables for aeration.
> Moved aeration cycle out of interrupt and into workqueue.
> Left nr_free as total pages instead of splitting it between raw and aerated.
> Combined size and physical address values in virtio ring into one 64b value.
>
> Changes from v1:
> https://lore.kernel.org/lkml/[email protected]/
> Dropped "waste page treatment" in favor of "page hinting"
> Renamed files and functions from "aeration" to "page_hinting"
> Moved from page->lru list to scatterlist
> Replaced wait on refcnt in shutdown with RCU and cancel_delayed_work_sync
> Virtio now uses scatterlist directly instead of intermediate array
> Moved stats out of free_area, now in separate area and pointed to from zone
> Merged patch 5 into patch 4 to improve review-ability
> Updated various code comments throughout
>
> Changes from v2:
> https://lore.kernel.org/lkml/[email protected]/
> Dropped "page hinting" in favor of "page reporting"
> Renamed files from "hinting" to "reporting"
> Replaced "Hinted" page type with "Reported" page flag
> Added support for page poisoning while hinting is active
> Add QEMU patch that implements PAGE_POISON feature
>
> Changes from v3:
> https://lore.kernel.org/lkml/[email protected]/
> Added mutex lock around page reporting startup and shutdown
> Fixed reference to "page aeration" in patch 2
> Split page reporting function bit out into separate QEMU patch
> Limited capacity of scatterlist to vq size - 1 instead of vq size
> Added exception handling for case of virtio descriptor allocation failure
>
> Changes from v4:
> https://lore.kernel.org/lkml/[email protected]/
> Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_cycle()
> Dropped if/continue for ternary operator in page_reporting_process()
> Added checks for isolate and cma types to for_each_reporting_migratetype_order
> Added virtio-dev, Michal Hocko, and Oscar Salvador to to:/cc:
> Rebased on latest linux-next and QEMU git trees
>
> Changes from v5:
> https://lore.kernel.org/lkml/[email protected]/
> Replaced spin_(un)lock with spin_(un)lock_irq in page_reporting_startup()
> Updated shuffle code to use "shuffle_pick_tail" and updated patch description
> Dropped storage of order and migratettype while page is being reported
> Used get_pfnblock_migratetype to determine migratetype of page
> Renamed put_reported_page to free_reported_page, added order as argument
> Dropped check for CMA type as I believe we should be reporting those
> Added code to allow moving of reported pages into and out of isolation
> Defined page reporting order as minimum of Huge Page size vs MAX_ORDER - 1
> Cleaned up use of static branch usage for page_reporting_notify_enabled
>
> Changes from v6:
> https://lore.kernel.org/lkml/[email protected]/
> Rebased on linux-next for 20190903
> Added jump label to __page_reporting_request so we release RCU read lock
> Removed "- 1" from capacity limit based on virtio ring
> Added code to verify capacity is non-zero or return error on startup
>
> Changes from v7:
> https://lore.kernel.org/lkml/[email protected]/
> Updated poison fixes to clear flag if "nosanity" is enabled in kernel config
> Split shuffle per-cpu optimization into separate patch
> Moved check for !phdev->capacity into reporting patch where it belongs
> Added Reviewed-by tags received for v7
>
> Changes from v8:
> https://lore.kernel.org/lkml/[email protected]/
> Added patch that moves HPAGE_SIZE definition for ARM64 to match other archs
> Switch back to using pageblock_order instead of HUGETLB_ORDER and MAX_ORDER - 1
> Boundary allocation now dynamic to support HUGETLB_PAGE_SIZE_VARIABLE option
> Made use of existing code/functions to reduce size of move_to_boundary function
> Dropped unused zone pointer from add_to/del_from_boundary functions
> Added additional possible mm and arm64 people as reviewers to Cc
> Added Reviewed-by tags received for v8
> Fixed missing parameter in kerneldoc
>
> ---
>
> Alexander Duyck (8):
> mm: Add per-cpu logic to page shuffling
> mm: Adjust shuffle code to allow for future coalescing
> mm: Move set/get_pcppage_migratetype to mmzone.h
> mm: Use zone and order instead of free area in free_list manipulators
> arm64: Move hugetlb related definitions out of pgtable.h to page-defs.h
> mm: Introduce Reported pages
> virtio-balloon: Pull page poisoning config out of free page hinting
> virtio-balloon: Add support for providing unused page reports to host
>
>
> arch/arm64/include/asm/page-def.h | 9 +
> arch/arm64/include/asm/pgtable.h | 9 -
> drivers/virtio/Kconfig | 1
> drivers/virtio/virtio_balloon.c | 87 ++++++++-
> include/linux/mmzone.h | 124 ++++++++----
> include/linux/page-flags.h | 11 +
> include/linux/page_reporting.h | 178 +++++++++++++++++
> include/uapi/linux/virtio_balloon.h | 1
> mm/Kconfig | 5
> mm/Makefile | 1
> mm/internal.h | 18 ++
> mm/memory_hotplug.c | 1
> mm/page_alloc.c | 217 +++++++++++++++------
> mm/page_reporting.c | 358 +++++++++++++++++++++++++++++++++++
> mm/shuffle.c | 40 ++--
> mm/shuffle.h | 12 +
> 16 files changed, 931 insertions(+), 141 deletions(-)
> create mode 100644 include/linux/page_reporting.h
> create mode 100644 mm/page_reporting.c
>
> --

2019-09-09 04:31:25

by Alexander Duyck

[permalink] [raw]

Subject: [PATCH v9 8/8] virtio-balloon: Add support for providing unused page reports to host

From: Alexander Duyck <[email protected]>

Add support for the page reporting feature provided by virtio-balloon.
Reporting differs from the regular balloon functionality in that is is
much less durable than a standard memory balloon. Instead of creating a
list of pages that cannot be accessed the pages are only inaccessible
while they are being indicated to the virtio interface. Once the
interface has acknowledged them they are placed back into their respective
free lists and are once again accessible by the guest system.

Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/virtio/Kconfig | 1 +
drivers/virtio/virtio_balloon.c | 65 +++++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_balloon.h | 1 +
3 files changed, 67 insertions(+)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..4b2dd8259ff5 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -58,6 +58,7 @@ config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
select MEMORY_BALLOON
+ select PAGE_REPORTING
---help---
This driver supports increasing and decreasing the amount
of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index d2547df7de93..62bdb0afd022 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/magic.h>
#include <linux/pseudo_fs.h>
+#include <linux/page_reporting.h>

/*
* Balloon device works in 4K page units. So each page is pointed to by
@@ -37,6 +38,9 @@
#define VIRTIO_BALLOON_FREE_PAGE_SIZE \
(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))

+/* limit on the number of pages that can be on the reporting vq */
+#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
+
#ifdef CONFIG_BALLOON_COMPACTION
static struct vfsmount *balloon_mnt;
#endif
@@ -46,6 +50,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_DEFLATE,
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
+ VIRTIO_BALLOON_VQ_REPORTING,
VIRTIO_BALLOON_VQ_MAX
};

@@ -113,6 +118,10 @@ struct virtio_balloon {

/* To register a shrinker to shrink memory upon memory pressure */
struct shrinker shrinker;
+
+ /* Unused page reporting device */
+ struct virtqueue *reporting_vq;
+ struct page_reporting_dev_info ph_dev_info;
};

static struct virtio_device_id id_table[] = {
@@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)

}

+void virtballoon_unused_page_report(struct page_reporting_dev_info *ph_dev_info,
+ unsigned int nents)
+{
+ struct virtio_balloon *vb =
+ container_of(ph_dev_info, struct virtio_balloon, ph_dev_info);
+ struct virtqueue *vq = vb->reporting_vq;
+ unsigned int unused, err;
+
+ /* We should always be able to add these buffers to an empty queue. */
+ err = virtqueue_add_inbuf(vq, ph_dev_info->sg, nents, vb,
+ GFP_NOWAIT | __GFP_NOWARN);
+
+ /*
+ * In the extremely unlikely case that something has changed and we
+ * are able to trigger an error we will simply display a warning
+ * and exit without actually processing the pages.
+ */
+ if (WARN_ON(err))
+ return;
+
+ virtqueue_kick(vq);
+
+ /* When host has read buffer, this completes via balloon_ack */
+ wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+}
+
static void set_page_pfns(struct virtio_balloon *vb,
__virtio32 pfns[], struct page *page)
{
@@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
names[VIRTIO_BALLOON_VQ_STATS] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+ names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
}

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+ names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
+ callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
+ }
+
err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
vqs, callbacks, names, NULL, NULL);
if (err)
return err;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+ vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
+
vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -930,12 +974,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_del_balloon_wq;
}
+
+ vb->ph_dev_info.report = virtballoon_unused_page_report;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+ unsigned int capacity;
+
+ capacity = min_t(unsigned int,
+ virtqueue_get_vring_size(vb->reporting_vq),
+ VIRTIO_BALLOON_VRING_HINTS_MAX);
+ vb->ph_dev_info.capacity = capacity;
+
+ err = page_reporting_startup(&vb->ph_dev_info);
+ if (err)
+ goto out_unregister_shrinker;
+ }
+
virtio_device_ready(vdev);

if (towards_target(vb))
virtballoon_changed(vdev);
return 0;

+out_unregister_shrinker:
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+ virtio_balloon_unregister_shrinker(vb);
out_del_balloon_wq:
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
destroy_workqueue(vb->balloon_wq);
@@ -964,6 +1026,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
{
struct virtio_balloon *vb = vdev->priv;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+ page_reporting_shutdown(&vb->ph_dev_info);
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
virtio_balloon_unregister_shrinker(vb);
spin_lock_irq(&vb->stop_update_lock);
@@ -1035,6 +1099,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
VIRTIO_BALLOON_F_PAGE_POISON,
+ VIRTIO_BALLOON_F_REPORTING,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..19974392d324 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12

2019-09-09 15:04:50

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v9 3/8] mm: Move set/get_pcppage_migratetype to mmzone.h

On 07.09.19 19:25, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> In order to support page reporting it will be necessary to store and
> retrieve the migratetype of a page. To enable that I am moving the set and
> get operations for pcppage_migratetype into the mm/internal.h header so
> that they can be used outside of the page_alloc.c file.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> mm/internal.h | 18 ++++++++++++++++++
> mm/page_alloc.c | 18 ------------------
> 2 files changed, 18 insertions(+), 18 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 0d5f720c75ab..e4a1a57bbd40 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -549,6 +549,24 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
> }
>
> +/*
> + * A cached value of the page's pageblock's migratetype, used when the page is
> + * put on a pcplist. Used to avoid the pageblock migratetype lookup when
> + * freeing from pcplists in most cases, at the cost of possibly becoming stale.
> + * Also the migratetype set in the page does not necessarily match the pcplist
> + * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
> + * other index - this ensures that it will be put on the correct CMA freelist.
> + */
> +static inline int get_pcppage_migratetype(struct page *page)
> +{
> + return page->index;
> +}
> +
> +static inline void set_pcppage_migratetype(struct page *page, int migratetype)
> +{
> + page->index = migratetype;
> +}
> +
> void setup_zone_pageset(struct zone *zone);
> extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> #endif /* __MM_INTERNAL_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4e4356ba66c7..a791f2baeeeb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -185,24 +185,6 @@ static int __init early_init_on_free(char *buf)
> }
> early_param("init_on_free", early_init_on_free);
>
> -/*
> - * A cached value of the page's pageblock's migratetype, used when the page is
> - * put on a pcplist. Used to avoid the pageblock migratetype lookup when
> - * freeing from pcplists in most cases, at the cost of possibly becoming stale.
> - * Also the migratetype set in the page does not necessarily match the pcplist
> - * index, e.g. page might have MIGRATE_CMA set but be on a pcplist with any
> - * other index - this ensures that it will be put on the correct CMA freelist.
> - */
> -static inline int get_pcppage_migratetype(struct page *page)
> -{
> - return page->index;
> -}
> -
> -static inline void set_pcppage_migratetype(struct page *page, int migratetype)
> -{
> - page->index = migratetype;
> -}
> -
> #ifdef CONFIG_PM_SLEEP
> /*
> * The following functions are used by the suspend/hibernate code to temporarily
>
>

Still have to understand in detail how this will be used, but the change
certainly looks ok :)

Acked-by: David Hildenbrand <[email protected]>

--

Thanks,

David / dhildenb

2019-09-09 15:35:16

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH v9 3/8] mm: Move set/get_pcppage_migratetype to mmzone.h

On Sat, Sep 07, 2019 at 10:25:28AM -0700, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> In order to support page reporting it will be necessary to store and
> retrieve the migratetype of a page. To enable that I am moving the set and
> get operations for pcppage_migratetype into the mm/internal.h header so
> that they can be used outside of the page_alloc.c file.
>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>

I'm not sure that it's great idea to export this functionality beyond
mm/page_alloc.c without any additional safeguards. How would we avoid to
messing with ->index when the page is not in the right state of its
life-cycle. Can we add some VM_BUG_ON()s here?

--
Kirill A. Shutemov

2019-09-10 09:19:41

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH v9 3/8] mm: Move set/get_pcppage_migratetype to mmzone.h

On Mon, 2019-09-09 at 11:01 -0700, Alexander Duyck wrote:
> On Mon, 2019-09-09 at 12:56 +0300, Kirill A. Shutemov wrote:
> > On Sat, Sep 07, 2019 at 10:25:28AM -0700, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > In order to support page reporting it will be necessary to store and
> > > retrieve the migratetype of a page. To enable that I am moving the set and
> > > get operations for pcppage_migratetype into the mm/internal.h header so
> > > that they can be used outside of the page_alloc.c file.
> > >
> > > Reviewed-by: Dan Williams <[email protected]>
> > > Signed-off-by: Alexander Duyck <[email protected]>
> >
> > I'm not sure that it's great idea to export this functionality beyond
> > mm/page_alloc.c without any additional safeguards. How would we avoid to
> > messing with ->index when the page is not in the right state of its
> > life-cycle. Can we add some VM_BUG_ON()s here?
>
> I am not sure what we would need to check on though. There are essentially
> 2 cases where we are using this. The first is the percpu page lists so the
> value is set either as a result of __rmqueue_smallest or
> free_unref_page_prepare. The second one which hasn't been added yet is for
> the Reported pages case which I add with patch 6.
>
> When I use it for page reporting I am essentially using the Reported flag
> to identify what pages in the buddy list will have this value set versus
> those that may not. I didn't explicitly define it that way, but that is
> how I am using it so that the value cannot be trusted unless the Reported
> flag is set.

I guess the alternative would be to just treat the ->index value as the
index within the boundary array, and not use the per-cpu list functions.
Doing that might make things a bit more clear since all we are really
doing is storing the index into the boundary list the page is contained
in. I could probably combine the value of order and migratetype and save
myself a few cycles in the process by just saving the index into the array
directly.

2019-09-10 14:48:27

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH v9 0/8] stg mail -e --version=v9 \

On Tue, Sep 10, 2019 at 5:42 AM Michal Hocko <[email protected]> wrote:
>
> I wanted to review "mm: Introduce Reported pages" just realize that I
> have no clue on what is going on so returned to the cover and it didn't
> really help much. I am completely unfamiliar with virtio so please bear
> with me.
>
> On Sat 07-09-19 10:25:03, Alexander Duyck wrote:
> [...]
> > This series provides an asynchronous means of reporting to a hypervisor
> > that a guest page is no longer in use and can have the data associated
> > with it dropped. To do this I have implemented functionality that allows
> > for what I am referring to as unused page reporting
> >
> > The functionality for this is fairly simple. When enabled it will allocate
> > statistics to track the number of reported pages in a given free area.
> > When the number of free pages exceeds this value plus a high water value,
> > currently 32, it will begin performing page reporting which consists of
> > pulling pages off of free list and placing them into a scatter list. The
> > scatterlist is then given to the page reporting device and it will perform
> > the required action to make the pages "reported", in the case of
> > virtio-balloon this results in the pages being madvised as MADV_DONTNEED
> > and as such they are forced out of the guest. After this they are placed
> > back on the free list,
>
> And here I am reallly lost because "forced out of the guest" makes me
> feel that those pages are no longer usable by the guest. So how come you
> can add them back to the free list. I suspect understanding this part
> will allow me to understand why we have to mark those pages and prevent
> merging.

Basically as the paragraph above mentions "forced out of the guest"
really is just the hypervisor calling MADV_DONTNEED on the page in
question. So the behavior is the same as any userspace application
that calls MADV_DONTNEED where the contents are no longer accessible
from userspace and attempting to access them will result in a fault
and the page being populated with a zero fill on-demand page, or a
copy of the file contents if the memory is file backed.

2019-09-10 14:51:55