2019-11-19 21:49:50

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

This series provides an asynchronous means of reporting unused guest
pages to a hypervisor so that the memory associated with those pages can
be dropped and reused by other processes and/or guests on the host. Using
this it is possible to avoid unnecessary I/O to disk and greatly improve
performance in the case of memory overcommit on the host.

When enabled it will allocate a set of statistics to track the number of
reported pages. When the nr_free for a given free area is greater than
this by the high water mark we will schedule a worker to begin pulling the
non-reported memory and to provide it to the reporting interface via a
scatterlist.

Currently this is only in use by virtio-balloon however there is the hope
that at some point in the future other hypervisors might be able to make
use of it. In the virtio-balloon/QEMU implementation the hypervisor is
currently using MADV_DONTNEED to indicate to the host kernel that the page
is currently unused. It will be zeroed and faulted back into the guest the
next time the page is accessed.

To track if a page is reported or not the Uptodate flag was repurposed and
used as a Reported flag for Buddy pages. We walk though the free list
isolating pages and adding them to the scatterlist until we either
encounter the end of the list or have filled the scatterlist with pages to
be reported. If we fill the scatterlist before we reach the end of the
list we rotate the list so that the first unreported page we encounter is
moved to the head of the list as that is where we will resume after we
have freed the reported pages back into the tail of the list.

Below are the results from various benchmarks. I primarily focused on two
tests. The first is the will-it-scale/page_fault2 test, and the other is
a modified version of will-it-scale/page_fault1 that was enabled to use
THP. I did this as it allows for better visibility into different parts
of the memory subsystem. The guest is running with 32G for RAM on one
node of a E5-2630 v3. The host has had some power saving features disabled
by setting the /dev/cpu_dma_latency value to 10ms.

Test page_fault1 (THP) page_fault2
Name tasks Process Iter STDEV Process Iter STDEV
Baseline 1 1203934.75 0.04% 379940.75 0.11%
16 8828217.00 0.85% 3178653.00 1.28%

Patches applied 1 1207961.25 0.10% 380852.25 0.25%
16 8862373.00 0.98% 3246397.25 0.68%

Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%

Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
16 8371359.75 0.65% 3233665.50 0.84%

Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
page shuffle 16 8108719.75 0.58% 3218450.25 1.07%

The results above are for a baseline with a linux-next-20191115 kernel,
that kernel with this patch set applied but page reporting disabled in
virtio-balloon, patches applied but the madvise disabled by direct
assigning a device, the patches applied and page reporting fully
enabled, and the patches enabled with page shuffling enabled. These
results include the deviation seen between the average value reported here
versus the high and/or low value. I observed that during the test memory
usage for the first three tests never dropped whereas with the patches
fully enabled the VM would drop to using only a few GB of the host's
memory when switching from memhog to page fault tests.

Most of the overhead seen with this patch set enabled seems due to page
faults caused by accessing the reported pages and the host zeroing the page
before giving it back to the guest. This overhead is much more visible when
using THP than with standard 4K pages. In addition page shuffling seemed to
increase the amount of faults generated due to an increase in memory churn.

The overall guest size is kept fairly small to only a few GB while the test
is running. If the host memory were oversubscribed this patch set should
result in a performance improvement as swapping memory in the host can be
avoided.

A brief history on the background of unused page reporting can be found at:
https://lore.kernel.org/lkml/[email protected]/

Changes from v12:
https://lore.kernel.org/lkml/[email protected]/
Rebased on linux-next 20191031
Renamed page_is_reported to page_reported
Renamed add_page_to_reported_list to mark_page_reported
Dropped unused definition of add_page_to_reported_list for non-reporting case
Split free_area_reporting out from get_unreported_tail
Minor updates to cover page

Changes from v13:
https://lore.kernel.org/lkml/[email protected]/
Rewrote core reporting functionality
Merged patches 3 & 4
Dropped boundary list and related code
Folded get_reported_page into page_reporting_fill
Folded page_reporting_fill into page_reporting_cycle
Pulled reporting functionality out of free_reported_page
Renamed it to __free_isolated_page
Moved page reporting specific bits to page_reporting_drain
Renamed phdev to prdev since we aren't "hinting" we are "reporting"
Added documentation to describe the usage of unused page reporting
Updated cover page and patch descriptions to avoid mention of boundary


---

Alexander Duyck (6):
mm: Adjust shuffle code to allow for future coalescing
mm: Use zone and order instead of free area in free_list manipulators
mm: Introduce Reported pages
mm: Add unused page reporting documentation
virtio-balloon: Pull page poisoning config out of free page hinting
virtio-balloon: Add support for providing unused page reports to host


Documentation/vm/unused_page_reporting.rst | 44 ++++
drivers/virtio/Kconfig | 1
drivers/virtio/virtio_balloon.c | 88 +++++++
include/linux/mmzone.h | 56 +----
include/linux/page-flags.h | 11 +
include/linux/page_reporting.h | 31 +++
include/uapi/linux/virtio_balloon.h | 1
mm/Kconfig | 11 +
mm/Makefile | 1
mm/memory_hotplug.c | 2
mm/page_alloc.c | 181 +++++++++++----
mm/page_reporting.c | 337 ++++++++++++++++++++++++++++
mm/page_reporting.h | 125 ++++++++++
mm/shuffle.c | 12 -
mm/shuffle.h | 6
15 files changed, 805 insertions(+), 102 deletions(-)
create mode 100644 Documentation/vm/unused_page_reporting.rst
create mode 100644 include/linux/page_reporting.h
create mode 100644 mm/page_reporting.c
create mode 100644 mm/page_reporting.h

--


2019-11-19 21:50:48

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 4/6] mm: Add unused page reporting documentation

From: Alexander Duyck <[email protected]>

Add documentation for unused page reporting. Currently the only consumer is
virtio-balloon, however it is possible that other drivers might make use of
this so it is best to add a bit of documetation explaining at a high level
how to use the API.

Signed-off-by: Alexander Duyck <[email protected]>
---
Documentation/vm/unused_page_reporting.rst | 44 ++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
create mode 100644 Documentation/vm/unused_page_reporting.rst

diff --git a/Documentation/vm/unused_page_reporting.rst b/Documentation/vm/unused_page_reporting.rst
new file mode 100644
index 000000000000..932406f48842
--- /dev/null
+++ b/Documentation/vm/unused_page_reporting.rst
@@ -0,0 +1,44 @@
+.. _unused_page_reporting:
+
+=====================
+Unused Page Reporting
+=====================
+
+Unused page reporting is an API by which a device can register to receive
+lists of pages that are currently unused by the system. This is useful in
+the case of virtualization where a guest is then able to use this data to
+notify the hypervisor that it is no longer using certain pages in memory.
+
+For the driver, typically a balloon driver, to use of this functionality
+it will allocate and initialize a page_reporting_dev_info structure. The
+fields within the structure it will populate are the "report" function
+pointer used to process the scatterlist and "capacity" representing the
+number of entries that the device can support in a single request. Once
+those are populated a call to page_reporting_register will allocate the
+scatterlist and register the device with the reporting framework assuming
+no other page reporting devices are already registered.
+
+Once registered the page reporting API will begin reporting batches of
+pages to the driver. The API determines that it needs to start reporting by
+measuring the number of pages in a given free area versus the number of
+reported pages for that free area. If the value meets or exceeds the value
+defined by PAGE_REPORTING_HWM then the zone is flagged as requesting
+reporting and a worker is scheduled to process zone requesting reporting.
+
+Pages reported will be stored in the scatterlist pointed to in the
+page_reporting_dev_info with the final entry having the end bit set in
+entry nent - 1. While pages are being processed by the report function they
+will not be accessible to the allocator. Once the report function has been
+completed the pages will be returned to the free area from which they were
+obtained.
+
+Prior to removing a driver that is making use of unused page reporting it
+is necessary to call page_reporting_unregister to have the
+page_reporting_dev_info structure that is currently in use by unused page
+reporting removed. Doing this will prevent further reports from being
+issued via the interface. If another driver or the same driver is
+registered it is possible for it to resume where the previous driver had
+left off in terms of reporting unused pages.
+
+Alexander Duyck, Nov 15, 2019
+


2019-11-19 21:50:59

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 5/6] virtio-balloon: Pull page poisoning config out of free page hinting

From: Alexander Duyck <[email protected]>

Currently the page poisoning setting wasn't being enabled unless free page
hinting was enabled. However we will need the page poisoning tracking logic
as well for unused page reporting. As such pull it out and make it a
separate bit of config in the probe function.

In addition we need to add support for the more recent init_on_free feature
which expects a behavior similar to page poisoning in that we expect the
page to be pre-zeroed.

Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/virtio/virtio_balloon.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 226fbb995fb0..92099298bc16 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -842,7 +842,6 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb)
static int virtballoon_probe(struct virtio_device *vdev)
{
struct virtio_balloon *vb;
- __u32 poison_val;
int err;

if (!vdev->config->get) {
@@ -909,11 +908,20 @@ static int virtballoon_probe(struct virtio_device *vdev)
VIRTIO_BALLOON_CMD_ID_STOP);
spin_lock_init(&vb->free_page_list_lock);
INIT_LIST_HEAD(&vb->free_page_list);
- if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ }
+ if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ /* Start with poison val of 0 representing general init */
+ __u32 poison_val = 0;
+
+ /*
+ * Let the hypervisor know that we are expecting a
+ * specific value to be written back in unused pages.
+ */
+ if (!want_init_on_free())
memset(&poison_val, PAGE_POISON, sizeof(poison_val));
- virtio_cwrite(vb->vdev, struct virtio_balloon_config,
- poison_val, &poison_val);
- }
+
+ virtio_cwrite(vb->vdev, struct virtio_balloon_config,
+ poison_val, &poison_val);
}
/*
* We continue to use VIRTIO_BALLOON_F_DEFLATE_ON_OOM to decide if a
@@ -1014,7 +1022,10 @@ static int virtballoon_restore(struct virtio_device *vdev)

static int virtballoon_validate(struct virtio_device *vdev)
{
- if (!page_poisoning_enabled())
+ /* Tell the host whether we care about poisoned pages. */
+ if (!want_init_on_free() &&
+ (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
+ !page_poisoning_enabled()))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);

__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);


2019-11-19 21:51:42

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

From: Alexander Duyck <[email protected]>

Add support for the page reporting feature provided by virtio-balloon.
Reporting differs from the regular balloon functionality in that is is
much less durable than a standard memory balloon. Instead of creating a
list of pages that cannot be accessed the pages are only inaccessible
while they are being indicated to the virtio interface. Once the
interface has acknowledged them they are placed back into their respective
free lists and are once again accessible by the guest system.

Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/virtio/Kconfig | 1 +
drivers/virtio/virtio_balloon.c | 65 +++++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_balloon.h | 1 +
3 files changed, 67 insertions(+)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index 078615cf2afc..4b2dd8259ff5 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -58,6 +58,7 @@ config VIRTIO_BALLOON
tristate "Virtio balloon driver"
depends on VIRTIO
select MEMORY_BALLOON
+ select PAGE_REPORTING
---help---
This driver supports increasing and decreasing the amount
of memory within a KVM guest.
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 92099298bc16..6f5c6555765a 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -19,6 +19,7 @@
#include <linux/mount.h>
#include <linux/magic.h>
#include <linux/pseudo_fs.h>
+#include <linux/page_reporting.h>

/*
* Balloon device works in 4K page units. So each page is pointed to by
@@ -37,6 +38,9 @@
#define VIRTIO_BALLOON_FREE_PAGE_SIZE \
(1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))

+/* limit on the number of pages that can be on the reporting vq */
+#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
+
#ifdef CONFIG_BALLOON_COMPACTION
static struct vfsmount *balloon_mnt;
#endif
@@ -46,6 +50,7 @@ enum virtio_balloon_vq {
VIRTIO_BALLOON_VQ_DEFLATE,
VIRTIO_BALLOON_VQ_STATS,
VIRTIO_BALLOON_VQ_FREE_PAGE,
+ VIRTIO_BALLOON_VQ_REPORTING,
VIRTIO_BALLOON_VQ_MAX
};

@@ -113,6 +118,10 @@ struct virtio_balloon {

/* To register a shrinker to shrink memory upon memory pressure */
struct shrinker shrinker;
+
+ /* Unused page reporting device */
+ struct virtqueue *reporting_vq;
+ struct page_reporting_dev_info pr_dev_info;
};

static struct virtio_device_id id_table[] = {
@@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)

}

+void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
+ unsigned int nents)
+{
+ struct virtio_balloon *vb =
+ container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
+ struct virtqueue *vq = vb->reporting_vq;
+ unsigned int unused, err;
+
+ /* We should always be able to add these buffers to an empty queue. */
+ err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
+ GFP_NOWAIT | __GFP_NOWARN);
+
+ /*
+ * In the extremely unlikely case that something has changed and we
+ * are able to trigger an error we will simply display a warning
+ * and exit without actually processing the pages.
+ */
+ if (WARN_ON(err))
+ return;
+
+ virtqueue_kick(vq);
+
+ /* When host has read buffer, this completes via balloon_ack */
+ wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
+}
+
static void set_page_pfns(struct virtio_balloon *vb,
__virtio32 pfns[], struct page *page)
{
@@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
names[VIRTIO_BALLOON_VQ_STATS] = NULL;
names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
+ names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;

if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
names[VIRTIO_BALLOON_VQ_STATS] = "stats";
@@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
}

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+ names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
+ callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
+ }
+
err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
vqs, callbacks, names, NULL, NULL);
if (err)
return err;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+ vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
+
vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
@@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
if (err)
goto out_del_balloon_wq;
}
+
+ vb->pr_dev_info.report = virtballoon_unused_page_report;
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
+ unsigned int capacity;
+
+ capacity = min_t(unsigned int,
+ virtqueue_get_vring_size(vb->reporting_vq),
+ VIRTIO_BALLOON_VRING_HINTS_MAX);
+ vb->pr_dev_info.capacity = capacity;
+
+ err = page_reporting_register(&vb->pr_dev_info);
+ if (err)
+ goto out_unregister_shrinker;
+ }
+
virtio_device_ready(vdev);

if (towards_target(vb))
virtballoon_changed(vdev);
return 0;

+out_unregister_shrinker:
+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
+ virtio_balloon_unregister_shrinker(vb);
out_del_balloon_wq:
if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
destroy_workqueue(vb->balloon_wq);
@@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
{
struct virtio_balloon *vb = vdev->priv;

+ if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
+ page_reporting_unregister(&vb->pr_dev_info);
if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
virtio_balloon_unregister_shrinker(vb);
spin_lock_irq(&vb->stop_update_lock);
@@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
VIRTIO_BALLOON_F_FREE_PAGE_HINT,
VIRTIO_BALLOON_F_PAGE_POISON,
+ VIRTIO_BALLOON_F_REPORTING,
};

static struct virtio_driver virtio_balloon_driver = {
diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
index a1966cd7b677..19974392d324 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12


2019-11-19 21:56:23

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 QEMU 1/3] virtio-ballon: Implement support for page poison tracking feature

From: Alexander Duyck <[email protected]>

We need to make certain to advertise support for page poison tracking if
we want to actually get data on if the guest will be poisoning pages. So
if free page hinting is active we should add page poisoning support and
let the guest disable it if it isn't using it.

Page poisoning will result in a page being dirtied on free. As such we
cannot really avoid having to copy the page at least one more time since
we will need to write the poison value to the destination. As such we can
just ignore free page hinting if page poisoning is enabled as it will
actually reduce the work we have to do.

Signed-off-by: Alexander Duyck <[email protected]>
---
hw/virtio/virtio-balloon.c | 25 +++++++++++++++++++++----
include/hw/virtio/virtio-balloon.h | 1 +
2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 40b04f518028..6ecfec422309 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -531,6 +531,15 @@ static void virtio_balloon_free_page_start(VirtIOBalloon *s)
return;
}

+ /*
+ * If page poisoning is enabled then we probably shouldn't bother with
+ * the hinting since the poisoning will dirty the page and invalidate
+ * the work we are doing anyway.
+ */
+ if (virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ return;
+ }
+
if (s->free_page_report_cmd_id == UINT_MAX) {
s->free_page_report_cmd_id =
VIRTIO_BALLOON_FREE_PAGE_REPORT_CMD_ID_MIN;
@@ -618,12 +627,10 @@ static size_t virtio_balloon_config_size(VirtIOBalloon *s)
if (s->qemu_4_0_config_size) {
return sizeof(struct virtio_balloon_config);
}
- if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON)) {
+ if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON) ||
+ virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
return sizeof(struct virtio_balloon_config);
}
- if (virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
- return offsetof(struct virtio_balloon_config, poison_val);
- }
return offsetof(struct virtio_balloon_config, free_page_report_cmd_id);
}

@@ -634,6 +641,7 @@ static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)

config.num_pages = cpu_to_le32(dev->num_pages);
config.actual = cpu_to_le32(dev->actual);
+ config.poison_val = cpu_to_le32(dev->poison_val);

if (dev->free_page_report_status == FREE_PAGE_REPORT_S_REQUESTED) {
config.free_page_report_cmd_id =
@@ -697,6 +705,8 @@ static void virtio_balloon_set_config(VirtIODevice *vdev,
qapi_event_send_balloon_change(vm_ram_size -
((ram_addr_t) dev->actual << VIRTIO_BALLOON_PFN_SHIFT));
}
+ dev->poison_val = virtio_vdev_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON) ?
+ le32_to_cpu(config.poison_val) : 0;
trace_virtio_balloon_set_config(dev->actual, oldactual);
}

@@ -706,6 +716,9 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
f |= dev->host_features;
virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
+ if (virtio_has_feature(f, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ virtio_add_feature(&f, VIRTIO_BALLOON_F_PAGE_POISON);
+ }

return f;
}
@@ -847,6 +860,8 @@ static void virtio_balloon_device_reset(VirtIODevice *vdev)
g_free(s->stats_vq_elem);
s->stats_vq_elem = NULL;
}
+
+ s->poison_val = 0;
}

static void virtio_balloon_set_status(VirtIODevice *vdev, uint8_t status)
@@ -909,6 +924,8 @@ static Property virtio_balloon_properties[] = {
VIRTIO_BALLOON_F_DEFLATE_ON_OOM, false),
DEFINE_PROP_BIT("free-page-hint", VirtIOBalloon, host_features,
VIRTIO_BALLOON_F_FREE_PAGE_HINT, false),
+ DEFINE_PROP_BIT("x-page-poison", VirtIOBalloon, host_features,
+ VIRTIO_BALLOON_F_PAGE_POISON, false),
/* QEMU 4.0 accidentally changed the config size even when free-page-hint
* is disabled, resulting in QEMU 3.1 migration incompatibility. This
* property retains this quirk for QEMU 4.1 machine types.
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index d1c968d2376e..7fe78e5c14d7 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -70,6 +70,7 @@ typedef struct VirtIOBalloon {
uint32_t host_features;

bool qemu_4_0_config_size;
+ uint32_t poison_val;
} VirtIOBalloon;

#endif


2019-11-19 21:57:23

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 QEMU 3/3] virtio-balloon: Provide a interface for unused page reporting

From: Alexander Duyck <[email protected]>

Add support for what I am referring to as "unused page reporting".
Basically the idea is to function very similar to how the balloon works
in that we basically end up madvising the page as not being used. However
we don't really need to bother with any deflate type logic since the page
will be faulted back into the guest when it is read or written to.

This is meant to be a simplification of the existing balloon interface
to use for providing hints to what memory needs to be freed. I am assuming
this is safe to do as the deflate logic does not actually appear to do very
much other than tracking what subpages have been released and which ones
haven't.

Signed-off-by: Alexander Duyck <[email protected]>
---
hw/virtio/virtio-balloon.c | 46 ++++++++++++++++++++++++++++++++++--
include/hw/virtio/virtio-balloon.h | 2 +-
2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 6ecfec422309..47f253d016db 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -321,6 +321,40 @@ static void balloon_stats_set_poll_interval(Object *obj, Visitor *v,
balloon_stats_change_timer(s, 0);
}

+static void virtio_balloon_handle_report(VirtIODevice *vdev, VirtQueue *vq)
+{
+ VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
+ VirtQueueElement *elem;
+
+ while ((elem = virtqueue_pop(vq, sizeof(VirtQueueElement)))) {
+ unsigned int i;
+
+ for (i = 0; i < elem->in_num; i++) {
+ void *addr = elem->in_sg[i].iov_base;
+ size_t size = elem->in_sg[i].iov_len;
+ ram_addr_t ram_offset;
+ size_t rb_page_size;
+ RAMBlock *rb;
+
+ if (qemu_balloon_is_inhibited() || dev->poison_val)
+ continue;
+
+ rb = qemu_ram_block_from_host(addr, false, &ram_offset);
+ rb_page_size = qemu_ram_pagesize(rb);
+
+ /* For now we will simply ignore unaligned memory regions */
+ if ((ram_offset | size) & (rb_page_size - 1))
+ continue;
+
+ ram_block_discard_range(rb, ram_offset, size);
+ }
+
+ virtqueue_push(vq, elem, 0);
+ virtio_notify(vdev, vq);
+ g_free(elem);
+ }
+}
+
static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
{
VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
@@ -628,7 +662,8 @@ static size_t virtio_balloon_config_size(VirtIOBalloon *s)
return sizeof(struct virtio_balloon_config);
}
if (virtio_has_feature(features, VIRTIO_BALLOON_F_PAGE_POISON) ||
- virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ virtio_has_feature(features, VIRTIO_BALLOON_F_FREE_PAGE_HINT) ||
+ virtio_has_feature(features, VIRTIO_BALLOON_F_REPORTING)) {
return sizeof(struct virtio_balloon_config);
}
return offsetof(struct virtio_balloon_config, free_page_report_cmd_id);
@@ -716,7 +751,8 @@ static uint64_t virtio_balloon_get_features(VirtIODevice *vdev, uint64_t f,
VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
f |= dev->host_features;
virtio_add_feature(&f, VIRTIO_BALLOON_F_STATS_VQ);
- if (virtio_has_feature(f, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
+ if (virtio_has_feature(f, VIRTIO_BALLOON_F_FREE_PAGE_HINT) ||
+ virtio_has_feature(f, VIRTIO_BALLOON_F_REPORTING)) {
virtio_add_feature(&f, VIRTIO_BALLOON_F_PAGE_POISON);
}

@@ -806,6 +842,10 @@ static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
s->dvq = virtio_add_queue(vdev, 128, virtio_balloon_handle_output);
s->svq = virtio_add_queue(vdev, 128, virtio_balloon_receive_stats);

+ if (virtio_has_feature(s->host_features, VIRTIO_BALLOON_F_REPORTING)) {
+ s->rvq = virtio_add_queue(vdev, 32, virtio_balloon_handle_report);
+ }
+
if (virtio_has_feature(s->host_features,
VIRTIO_BALLOON_F_FREE_PAGE_HINT)) {
s->free_page_vq = virtio_add_queue(vdev, VIRTQUEUE_MAX_SIZE,
@@ -932,6 +972,8 @@ static Property virtio_balloon_properties[] = {
*/
DEFINE_PROP_BOOL("qemu-4-0-config-size", VirtIOBalloon,
qemu_4_0_config_size, false),
+ DEFINE_PROP_BIT("unused-page-reporting", VirtIOBalloon, host_features,
+ VIRTIO_BALLOON_F_REPORTING, true),
DEFINE_PROP_LINK("iothread", VirtIOBalloon, iothread, TYPE_IOTHREAD,
IOThread *),
DEFINE_PROP_END_OF_LIST(),
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 7fe78e5c14d7..db5bf7127112 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -42,7 +42,7 @@ enum virtio_balloon_free_page_report_status {

typedef struct VirtIOBalloon {
VirtIODevice parent_obj;
- VirtQueue *ivq, *dvq, *svq, *free_page_vq;
+ VirtQueue *ivq, *dvq, *svq, *free_page_vq, *rvq;
uint32_t free_page_report_status;
uint32_t num_pages;
uint32_t actual;


2019-11-19 21:58:57

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v14 QEMU 2/3] virtio-balloon: Add bit to notify guest of unused page reporting

From: Alexander Duyck <[email protected]>

Add a bit for the page reporting feature provided by virtio-balloon.

This patch should be replaced once the feature is added to the Linux kernel
and the bit is backported into this exported kernel header.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/standard-headers/linux/virtio_balloon.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/standard-headers/linux/virtio_balloon.h b/include/standard-headers/linux/virtio_balloon.h
index 9375ca2a70de..1c5f6d6f2de6 100644
--- a/include/standard-headers/linux/virtio_balloon.h
+++ b/include/standard-headers/linux/virtio_balloon.h
@@ -36,6 +36,7 @@
#define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
#define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
#define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
+#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */

/* Size of a PFN in the balloon interface. */
#define VIRTIO_BALLOON_PFN_SHIFT 12


2019-11-26 12:52:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

On 19.11.19 22:46, Alexander Duyck wrote:
> This series provides an asynchronous means of reporting unused guest
> pages to a hypervisor so that the memory associated with those pages can
> be dropped and reused by other processes and/or guests on the host. Using
> this it is possible to avoid unnecessary I/O to disk and greatly improve
> performance in the case of memory overcommit on the host.
>
> When enabled it will allocate a set of statistics to track the number of
> reported pages. When the nr_free for a given free area is greater than
> this by the high water mark we will schedule a worker to begin pulling the
> non-reported memory and to provide it to the reporting interface via a
> scatterlist.
>
> Currently this is only in use by virtio-balloon however there is the hope
> that at some point in the future other hypervisors might be able to make
> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> currently using MADV_DONTNEED to indicate to the host kernel that the page
> is currently unused. It will be zeroed and faulted back into the guest the
> next time the page is accessed.

Remind me why we are using MADV_DONTNEED? Mostly for debugging purposes
right now, right? Did you do any measurements with MADV_FREE? I guess
there should be quite a performance increase in some scenarios.

>
> To track if a page is reported or not the Uptodate flag was repurposed and
> used as a Reported flag for Buddy pages. We walk though the free list
> isolating pages and adding them to the scatterlist until we either
> encounter the end of the list or have filled the scatterlist with pages to
> be reported. If we fill the scatterlist before we reach the end of the
> list we rotate the list so that the first unreported page we encounter is
> moved to the head of the list as that is where we will resume after we
> have freed the reported pages back into the tail of the list.

So the boundary pointer didn't actually provide that big of a benefit I
assume (IOW, worst thing is you have to re-scan the whole list)?

>
> Below are the results from various benchmarks. I primarily focused on two
> tests. The first is the will-it-scale/page_fault2 test, and the other is
> a modified version of will-it-scale/page_fault1 that was enabled to use
> THP. I did this as it allows for better visibility into different parts
> of the memory subsystem. The guest is running with 32G for RAM on one
> node of a E5-2630 v3. The host has had some power saving features disabled
> by setting the /dev/cpu_dma_latency value to 10ms.
>
> Test page_fault1 (THP) page_fault2
> Name tasks Process Iter STDEV Process Iter STDEV
> Baseline 1 1203934.75 0.04% 379940.75 0.11%
> 16 8828217.00 0.85% 3178653.00 1.28%
>
> Patches applied 1 1207961.25 0.10% 380852.25 0.25%
> 16 8862373.00 0.98% 3246397.25 0.68%
>
> Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
> MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%
>
> Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
> 16 8371359.75 0.65% 3233665.50 0.84%
>
> Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
> page shuffle 16 8108719.75 0.58% 3218450.25 1.07%
>
> The results above are for a baseline with a linux-next-20191115 kernel,
> that kernel with this patch set applied but page reporting disabled in
> virtio-balloon, patches applied but the madvise disabled by direct
> assigning a device, the patches applied and page reporting fully
> enabled, and the patches enabled with page shuffling enabled. These
> results include the deviation seen between the average value reported here
> versus the high and/or low value. I observed that during the test memory
> usage for the first three tests never dropped whereas with the patches
> fully enabled the VM would drop to using only a few GB of the host's
> memory when switching from memhog to page fault tests.
>
> Most of the overhead seen with this patch set enabled seems due to page
> faults caused by accessing the reported pages and the host zeroing the page
> before giving it back to the guest. This overhead is much more visible when
> using THP than with standard 4K pages. In addition page shuffling seemed to
> increase the amount of faults generated due to an increase in memory churn.

MADV_FREE would be interesting.

>
> The overall guest size is kept fairly small to only a few GB while the test
> is running. If the host memory were oversubscribed this patch set should
> result in a performance improvement as swapping memory in the host can be
> avoided.
>
> A brief history on the background of unused page reporting can be found at:
> https://lore.kernel.org/lkml/[email protected]/
>
> Changes from v12:
> https://lore.kernel.org/lkml/[email protected]/
> Rebased on linux-next 20191031
> Renamed page_is_reported to page_reported
> Renamed add_page_to_reported_list to mark_page_reported
> Dropped unused definition of add_page_to_reported_list for non-reporting case
> Split free_area_reporting out from get_unreported_tail
> Minor updates to cover page
>
> Changes from v13:
> https://lore.kernel.org/lkml/[email protected]/
> Rewrote core reporting functionality
> Merged patches 3 & 4
> Dropped boundary list and related code
> Folded get_reported_page into page_reporting_fill
> Folded page_reporting_fill into page_reporting_cycle
> Pulled reporting functionality out of free_reported_page
> Renamed it to __free_isolated_page
> Moved page reporting specific bits to page_reporting_drain
> Renamed phdev to prdev since we aren't "hinting" we are "reporting"
> Added documentation to describe the usage of unused page reporting
> Updated cover page and patch descriptions to avoid mention of boundary
>
>
> ---
>
> Alexander Duyck (6):
> mm: Adjust shuffle code to allow for future coalescing
> mm: Use zone and order instead of free area in free_list manipulators
> mm: Introduce Reported pages
> mm: Add unused page reporting documentation
> virtio-balloon: Pull page poisoning config out of free page hinting
> virtio-balloon: Add support for providing unused page reports to host
>
>
> Documentation/vm/unused_page_reporting.rst | 44 ++++
> drivers/virtio/Kconfig | 1
> drivers/virtio/virtio_balloon.c | 88 +++++++
> include/linux/mmzone.h | 56 +----
> include/linux/page-flags.h | 11 +
> include/linux/page_reporting.h | 31 +++
> include/uapi/linux/virtio_balloon.h | 1
> mm/Kconfig | 11 +
> mm/Makefile | 1
> mm/memory_hotplug.c | 2
> mm/page_alloc.c | 181 +++++++++++----
> mm/page_reporting.c | 337 ++++++++++++++++++++++++++++
> mm/page_reporting.h | 125 ++++++++++
> mm/shuffle.c | 12 -
> mm/shuffle.h | 6
> 15 files changed, 805 insertions(+), 102 deletions(-)

So roughly 100 LOC less added, that's nice to see :)

I'm planning to look into the details soon, just fairly busy lately. I
hope Mel Et al. can also comment.

--
Thanks,

David / dhildenb

2019-11-26 16:50:15

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

On Tue, 2019-11-26 at 13:20 +0100, David Hildenbrand wrote:
> On 19.11.19 22:46, Alexander Duyck wrote:
> > This series provides an asynchronous means of reporting unused guest
> > pages to a hypervisor so that the memory associated with those pages can
> > be dropped and reused by other processes and/or guests on the host. Using
> > this it is possible to avoid unnecessary I/O to disk and greatly improve
> > performance in the case of memory overcommit on the host.
> >
> > When enabled it will allocate a set of statistics to track the number of
> > reported pages. When the nr_free for a given free area is greater than
> > this by the high water mark we will schedule a worker to begin pulling the
> > non-reported memory and to provide it to the reporting interface via a
> > scatterlist.
> >
> > Currently this is only in use by virtio-balloon however there is the hope
> > that at some point in the future other hypervisors might be able to make
> > use of it. In the virtio-balloon/QEMU implementation the hypervisor is
> > currently using MADV_DONTNEED to indicate to the host kernel that the page
> > is currently unused. It will be zeroed and faulted back into the guest the
> > next time the page is accessed.
>
> Remind me why we are using MADV_DONTNEED? Mostly for debugging purposes
> right now, right? Did you do any measurements with MADV_FREE? I guess
> there should be quite a performance increase in some scenarios.

There are actually a few reasons for not using MADV_FREE.

The first one was debugging as I could visibly see how much memory had
been freed by just checking the memory consumption by the guest. I didn't
have to wait for memory pressure to trigger the memory freeing. In
addition it would force the pages out of the guest so it was much easier
to see if I was freeing the wrong pages.

The second reason is because it is much more portable. The MADV_FREE has
only been a part of the Linux kernel since about 4.5. So if you are
running on an older kernel the option might not be available.

The third reason is simply effort involved. If I used MADV_DONTNEED then I
can just use ram_block_discard_range which is the same function used by
other parts of the balloon driver.

Finally it is my understanding is that MADV_FREE only works on anonymous
memory (https://elixir.bootlin.com/linux/v5.4/source/mm/madvise.c#L700). I
was concerned that using MADV_FREE wouldn't work if used on file backed
memory such as hugetlbfs which is an option for QEMU if I am not mistaken.

> > To track if a page is reported or not the Uptodate flag was repurposed and
> > used as a Reported flag for Buddy pages. We walk though the free list
> > isolating pages and adding them to the scatterlist until we either
> > encounter the end of the list or have filled the scatterlist with pages to
> > be reported. If we fill the scatterlist before we reach the end of the
> > list we rotate the list so that the first unreported page we encounter is
> > moved to the head of the list as that is where we will resume after we
> > have freed the reported pages back into the tail of the list.
>
> So the boundary pointer didn't actually provide that big of a benefit I
> assume (IOW, worst thing is you have to re-scan the whole list)?

I rewrote the code quite a bit to get rid of the disadvantages.
Specifically what the boundary pointer was doing was saving our place in
the list when we left. Even without that we still had to re-scan the
entire list with each zone processed anyway. With these changes we end up
potentially having to perform one additional rescan per free list.

Where things differ now is that the fetching function doesn't bail out of
the list and start over per page. Instead it fills the entire scatterlist
before it exits, and before doing so it will advance the head to the next
non-reported page in the list. In addition instead of walking all of the
orders and migrate types looking for each page the code is now more
methodical and will only work one free list at a time and do not revisit
it until we have processed the entire zone.

Even with all that we still take a pretty significant performance hit in
the page shuffing case, however I am willing to give that up for the sake
of being less intrusive.

> > Below are the results from various benchmarks. I primarily focused on two
> > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > a modified version of will-it-scale/page_fault1 that was enabled to use
> > THP. I did this as it allows for better visibility into different parts
> > of the memory subsystem. The guest is running with 32G for RAM on one
> > node of a E5-2630 v3. The host has had some power saving features disabled
> > by setting the /dev/cpu_dma_latency value to 10ms.
> >
> > Test page_fault1 (THP) page_fault2
> > Name tasks Process Iter STDEV Process Iter STDEV
> > Baseline 1 1203934.75 0.04% 379940.75 0.11%
> > 16 8828217.00 0.85% 3178653.00 1.28%
> >
> > Patches applied 1 1207961.25 0.10% 380852.25 0.25%
> > 16 8862373.00 0.98% 3246397.25 0.68%
> >
> > Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
> > MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%
> >
> > Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
> > 16 8371359.75 0.65% 3233665.50 0.84%
> >
> > Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
> > page shuffle 16 8108719.75 0.58% 3218450.25 1.07%
> >
> > The results above are for a baseline with a linux-next-20191115 kernel,
> > that kernel with this patch set applied but page reporting disabled in
> > virtio-balloon, patches applied but the madvise disabled by direct
> > assigning a device, the patches applied and page reporting fully
> > enabled, and the patches enabled with page shuffling enabled. These
> > results include the deviation seen between the average value reported here
> > versus the high and/or low value. I observed that during the test memory
> > usage for the first three tests never dropped whereas with the patches
> > fully enabled the VM would drop to using only a few GB of the host's
> > memory when switching from memhog to page fault tests.
> >
> > Most of the overhead seen with this patch set enabled seems due to page
> > faults caused by accessing the reported pages and the host zeroing the page
> > before giving it back to the guest. This overhead is much more visible when
> > using THP than with standard 4K pages. In addition page shuffling seemed to
> > increase the amount of faults generated due to an increase in memory churn.
>
> MADV_FREE would be interesting.

I can probably code something up. However that is going to push a bunch of
complexity into the QEMU code and doesn't really mean much to the kernel
code. I can probably add it as another QEMU patch to the set since it is
just a matter of having a function similar to ram_block_discard_range that
uses MADV_FREE instead of MADV_DONTNEED.

> > The overall guest size is kept fairly small to only a few GB while the test
> > is running. If the host memory were oversubscribed this patch set should
> > result in a performance improvement as swapping memory in the host can be
> > avoided.
> >
> > A brief history on the background of unused page reporting can be found at:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Changes from v12:
> > https://lore.kernel.org/lkml/[email protected]/
> > Rebased on linux-next 20191031
> > Renamed page_is_reported to page_reported
> > Renamed add_page_to_reported_list to mark_page_reported
> > Dropped unused definition of add_page_to_reported_list for non-reporting case
> > Split free_area_reporting out from get_unreported_tail
> > Minor updates to cover page
> >
> > Changes from v13:
> > https://lore.kernel.org/lkml/[email protected]/
> > Rewrote core reporting functionality
> > Merged patches 3 & 4
> > Dropped boundary list and related code
> > Folded get_reported_page into page_reporting_fill
> > Folded page_reporting_fill into page_reporting_cycle
> > Pulled reporting functionality out of free_reported_page
> > Renamed it to __free_isolated_page
> > Moved page reporting specific bits to page_reporting_drain
> > Renamed phdev to prdev since we aren't "hinting" we are "reporting"
> > Added documentation to describe the usage of unused page reporting
> > Updated cover page and patch descriptions to avoid mention of boundary
> >
> >
> > ---
> >
> > Alexander Duyck (6):
> > mm: Adjust shuffle code to allow for future coalescing
> > mm: Use zone and order instead of free area in free_list manipulators
> > mm: Introduce Reported pages
> > mm: Add unused page reporting documentation
> > virtio-balloon: Pull page poisoning config out of free page hinting
> > virtio-balloon: Add support for providing unused page reports to host
> >
> >
> > Documentation/vm/unused_page_reporting.rst | 44 ++++
> > drivers/virtio/Kconfig | 1
> > drivers/virtio/virtio_balloon.c | 88 +++++++
> > include/linux/mmzone.h | 56 +----
> > include/linux/page-flags.h | 11 +
> > include/linux/page_reporting.h | 31 +++
> > include/uapi/linux/virtio_balloon.h | 1
> > mm/Kconfig | 11 +
> > mm/Makefile | 1
> > mm/memory_hotplug.c | 2
> > mm/page_alloc.c | 181 +++++++++++----
> > mm/page_reporting.c | 337 ++++++++++++++++++++++++++++
> > mm/page_reporting.h | 125 ++++++++++
> > mm/shuffle.c | 12 -
> > mm/shuffle.h | 6
> > 15 files changed, 805 insertions(+), 102 deletions(-)
>
> So roughly 100 LOC less added, that's nice to see :)
>
> I'm planning to look into the details soon, just fairly busy lately. I
> hope Mel Et al. can also comment.

Agreed. I can see if I can generate something to get the MADV_FREE
numbers. I suspect they were probably be somewhere between the MADV
disabled and fully enabled case, since we will still be taking the page
faults but not doing the zeroing.

2019-11-27 10:03:19

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

On 26.11.19 17:45, Alexander Duyck wrote:
> On Tue, 2019-11-26 at 13:20 +0100, David Hildenbrand wrote:
>> On 19.11.19 22:46, Alexander Duyck wrote:
>>> This series provides an asynchronous means of reporting unused guest
>>> pages to a hypervisor so that the memory associated with those pages can
>>> be dropped and reused by other processes and/or guests on the host. Using
>>> this it is possible to avoid unnecessary I/O to disk and greatly improve
>>> performance in the case of memory overcommit on the host.
>>>
>>> When enabled it will allocate a set of statistics to track the number of
>>> reported pages. When the nr_free for a given free area is greater than
>>> this by the high water mark we will schedule a worker to begin pulling the
>>> non-reported memory and to provide it to the reporting interface via a
>>> scatterlist.
>>>
>>> Currently this is only in use by virtio-balloon however there is the hope
>>> that at some point in the future other hypervisors might be able to make
>>> use of it. In the virtio-balloon/QEMU implementation the hypervisor is
>>> currently using MADV_DONTNEED to indicate to the host kernel that the page
>>> is currently unused. It will be zeroed and faulted back into the guest the
>>> next time the page is accessed.
>>
>> Remind me why we are using MADV_DONTNEED? Mostly for debugging purposes
>> right now, right? Did you do any measurements with MADV_FREE? I guess
>> there should be quite a performance increase in some scenarios.
>
> There are actually a few reasons for not using MADV_FREE.
>
> The first one was debugging as I could visibly see how much memory had
> been freed by just checking the memory consumption by the guest. I didn't
> have to wait for memory pressure to trigger the memory freeing. In
> addition it would force the pages out of the guest so it was much easier
> to see if I was freeing the wrong pages.
>
> The second reason is because it is much more portable. The MADV_FREE has
> only been a part of the Linux kernel since about 4.5. So if you are
> running on an older kernel the option might not be available.

I guess optionally enabling it (for !filebacked and !huge pages) in QEMU
after sensing would be possible. Fallback to ram_discard_range().

>
> The third reason is simply effort involved. If I used MADV_DONTNEED then I
> can just use ram_block_discard_range which is the same function used by
> other parts of the balloon driver.

Yes, that makes perfect sense.

>
> Finally it is my understanding is that MADV_FREE only works on anonymous
> memory (https://elixir.bootlin.com/linux/v5.4/source/mm/madvise.c#L700). I
> was concerned that using MADV_FREE wouldn't work if used on file backed
> memory such as hugetlbfs which is an option for QEMU if I am not mistaken.

Yes, MADV_FREE works just like MADV_DONTNEED only on anonymous memory.
In case of files/hugetlbfs you have to use

fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, ...).

E.g., see qemu/exec.c:ram_block_discard_range. You can do something
similar to this:


static bool madv_free_sensed, madv_free_available;
int ret = -EINVAL;

/*
* MADV_FREE only works on anonymous memory, and especially not on
* hugetlbfs. Older kernels don't support it.
*/
if (rb->page_size == qemu_host_page_size && rb->fb != -1 &&
(!madv_free_sensed || madv_free_available)) {
ret = madvise(start, length, MADV_FREE);
if (ret) {
madv_free_sensed = true;
madv_free_available = false;
} else if (!madv_free_sensed) {
madv_free_sensed = true;
madv_free_available = true;
}
}

/* fallback to MADV_DONTNEED / FALLOC_FL_PUNCH_HOLE */
if (ret) {
ram_block_discard_range(rb, start, length);
}


I agree that something like should be an addon to the current patch set.

>
>>> To track if a page is reported or not the Uptodate flag was repurposed and
>>> used as a Reported flag for Buddy pages. We walk though the free list
>>> isolating pages and adding them to the scatterlist until we either
>>> encounter the end of the list or have filled the scatterlist with pages to
>>> be reported. If we fill the scatterlist before we reach the end of the
>>> list we rotate the list so that the first unreported page we encounter is
>>> moved to the head of the list as that is where we will resume after we
>>> have freed the reported pages back into the tail of the list.
>>
>> So the boundary pointer didn't actually provide that big of a benefit I
>> assume (IOW, worst thing is you have to re-scan the whole list)?
>
> I rewrote the code quite a bit to get rid of the disadvantages.
> Specifically what the boundary pointer was doing was saving our place in
> the list when we left. Even without that we still had to re-scan the
> entire list with each zone processed anyway. With these changes we end up
> potentially having to perform one additional rescan per free list.
>
> Where things differ now is that the fetching function doesn't bail out of
> the list and start over per page. Instead it fills the entire scatterlist
> before it exits, and before doing so it will advance the head to the next
> non-reported page in the list. In addition instead of walking all of the
> orders and migrate types looking for each page the code is now more
> methodical and will only work one free list at a time and do not revisit
> it until we have processed the entire zone.

Makes perfect sense to me.

>
> Even with all that we still take a pretty significant performance hit in
> the page shuffing case, however I am willing to give that up for the sake
> of being less intrusive.

Makes sense as well, especially for a first version.

>
>>> Below are the results from various benchmarks. I primarily focused on two
>>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>>> a modified version of will-it-scale/page_fault1 that was enabled to use
>>> THP. I did this as it allows for better visibility into different parts
>>> of the memory subsystem. The guest is running with 32G for RAM on one
>>> node of a E5-2630 v3. The host has had some power saving features disabled
>>> by setting the /dev/cpu_dma_latency value to 10ms.
>>>
>>> Test page_fault1 (THP) page_fault2
>>> Name tasks Process Iter STDEV Process Iter STDEV
>>> Baseline 1 1203934.75 0.04% 379940.75 0.11%
>>> 16 8828217.00 0.85% 3178653.00 1.28%
>>>
>>> Patches applied 1 1207961.25 0.10% 380852.25 0.25%
>>> 16 8862373.00 0.98% 3246397.25 0.68%
>>>
>>> Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
>>> MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%
>>>
>>> Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
>>> 16 8371359.75 0.65% 3233665.50 0.84%
>>>
>>> Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
>>> page shuffle 16 8108719.75 0.58% 3218450.25 1.07%
>>>
>>> The results above are for a baseline with a linux-next-20191115 kernel,
>>> that kernel with this patch set applied but page reporting disabled in
>>> virtio-balloon, patches applied but the madvise disabled by direct
>>> assigning a device, the patches applied and page reporting fully
>>> enabled, and the patches enabled with page shuffling enabled. These
>>> results include the deviation seen between the average value reported here
>>> versus the high and/or low value. I observed that during the test memory
>>> usage for the first three tests never dropped whereas with the patches
>>> fully enabled the VM would drop to using only a few GB of the host's
>>> memory when switching from memhog to page fault tests.
>>>
>>> Most of the overhead seen with this patch set enabled seems due to page
>>> faults caused by accessing the reported pages and the host zeroing the page
>>> before giving it back to the guest. This overhead is much more visible when
>>> using THP than with standard 4K pages. In addition page shuffling seemed to
>>> increase the amount of faults generated due to an increase in memory churn.
>>
>> MADV_FREE would be interesting.
>
> I can probably code something up. However that is going to push a bunch of
> complexity into the QEMU code and doesn't really mean much to the kernel
> code. I can probably add it as another QEMU patch to the set since it is
> just a matter of having a function similar to ram_block_discard_range that
> uses MADV_FREE instead of MADV_DONTNEED.

Yes, addon patch makes perfect sense. The nice thing about MADV_FREE is
that you only take back pages from a process when really under memory
pressure (before going to SWAP). You will still get a pagefault on the
next access (to identify that the page is still in use after all), but
don't have to fault in a fresh page.

>
>>> The overall guest size is kept fairly small to only a few GB while the test
>>> is running. If the host memory were oversubscribed this patch set should
>>> result in a performance improvement as swapping memory in the host can be
>>> avoided.
>>>
>>> A brief history on the background of unused page reporting can be found at:
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Changes from v12:
>>> https://lore.kernel.org/lkml/[email protected]/
>>> Rebased on linux-next 20191031
>>> Renamed page_is_reported to page_reported
>>> Renamed add_page_to_reported_list to mark_page_reported
>>> Dropped unused definition of add_page_to_reported_list for non-reporting case
>>> Split free_area_reporting out from get_unreported_tail
>>> Minor updates to cover page
>>>
>>> Changes from v13:
>>> https://lore.kernel.org/lkml/[email protected]/
>>> Rewrote core reporting functionality
>>> Merged patches 3 & 4
>>> Dropped boundary list and related code
>>> Folded get_reported_page into page_reporting_fill
>>> Folded page_reporting_fill into page_reporting_cycle
>>> Pulled reporting functionality out of free_reported_page
>>> Renamed it to __free_isolated_page
>>> Moved page reporting specific bits to page_reporting_drain
>>> Renamed phdev to prdev since we aren't "hinting" we are "reporting"
>>> Added documentation to describe the usage of unused page reporting
>>> Updated cover page and patch descriptions to avoid mention of boundary
>>>
>>>
>>> ---
>>>
>>> Alexander Duyck (6):
>>> mm: Adjust shuffle code to allow for future coalescing
>>> mm: Use zone and order instead of free area in free_list manipulators
>>> mm: Introduce Reported pages
>>> mm: Add unused page reporting documentation
>>> virtio-balloon: Pull page poisoning config out of free page hinting
>>> virtio-balloon: Add support for providing unused page reports to host
>>>
>>>
>>> Documentation/vm/unused_page_reporting.rst | 44 ++++
>>> drivers/virtio/Kconfig | 1
>>> drivers/virtio/virtio_balloon.c | 88 +++++++
>>> include/linux/mmzone.h | 56 +----
>>> include/linux/page-flags.h | 11 +
>>> include/linux/page_reporting.h | 31 +++
>>> include/uapi/linux/virtio_balloon.h | 1
>>> mm/Kconfig | 11 +
>>> mm/Makefile | 1
>>> mm/memory_hotplug.c | 2
>>> mm/page_alloc.c | 181 +++++++++++----
>>> mm/page_reporting.c | 337 ++++++++++++++++++++++++++++
>>> mm/page_reporting.h | 125 ++++++++++
>>> mm/shuffle.c | 12 -
>>> mm/shuffle.h | 6
>>> 15 files changed, 805 insertions(+), 102 deletions(-)
>>
>> So roughly 100 LOC less added, that's nice to see :)
>>
>> I'm planning to look into the details soon, just fairly busy lately. I
>> hope Mel Et al. can also comment.
>
> Agreed. I can see if I can generate something to get the MADV_FREE
> numbers. I suspect they were probably be somewhere between the MADV
> disabled and fully enabled case, since we will still be taking the page
> faults but not doing the zeroing.

Exactly.

--
Thanks,

David / dhildenb

2019-11-27 17:37:45

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

On Wed, 2019-11-27 at 11:01 +0100, David Hildenbrand wrote:
> On 26.11.19 17:45, Alexander Duyck wrote:
> > On Tue, 2019-11-26 at 13:20 +0100, David Hildenbrand wrote:
> > > On 19.11.19 22:46, Alexander Duyck wrote:

<snip>

> > > > Below are the results from various benchmarks. I primarily focused on two
> > > > tests. The first is the will-it-scale/page_fault2 test, and the other is
> > > > a modified version of will-it-scale/page_fault1 that was enabled to use
> > > > THP. I did this as it allows for better visibility into different parts
> > > > of the memory subsystem. The guest is running with 32G for RAM on one
> > > > node of a E5-2630 v3. The host has had some power saving features disabled
> > > > by setting the /dev/cpu_dma_latency value to 10ms.
> > > >
> > > > Test page_fault1 (THP) page_fault2
> > > > Name tasks Process Iter STDEV Process Iter STDEV
> > > > Baseline 1 1203934.75 0.04% 379940.75 0.11%
> > > > 16 8828217.00 0.85% 3178653.00 1.28%
> > > >
> > > > Patches applied 1 1207961.25 0.10% 380852.25 0.25%
> > > > 16 8862373.00 0.98% 3246397.25 0.68%
> > > >
> > > > Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
> > > > MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%
> > > >
> > > > Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
> > > > 16 8371359.75 0.65% 3233665.50 0.84%
> > > >
> > > > Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
> > > > page shuffle 16 8108719.75 0.58% 3218450.25 1.07%
> > > >
> > > > The results above are for a baseline with a linux-next-20191115 kernel,
> > > > that kernel with this patch set applied but page reporting disabled in
> > > > virtio-balloon, patches applied but the madvise disabled by direct
> > > > assigning a device, the patches applied and page reporting fully
> > > > enabled, and the patches enabled with page shuffling enabled. These
> > > > results include the deviation seen between the average value reported here
> > > > versus the high and/or low value. I observed that during the test memory
> > > > usage for the first three tests never dropped whereas with the patches
> > > > fully enabled the VM would drop to using only a few GB of the host's
> > > > memory when switching from memhog to page fault tests.
> > > >
> > > > Most of the overhead seen with this patch set enabled seems due to page
> > > > faults caused by accessing the reported pages and the host zeroing the page
> > > > before giving it back to the guest. This overhead is much more visible when
> > > > using THP than with standard 4K pages. In addition page shuffling seemed to
> > > > increase the amount of faults generated due to an increase in memory churn.
> > >
> > > MADV_FREE would be interesting.
> >
> > I can probably code something up. However that is going to push a bunch of
> > complexity into the QEMU code and doesn't really mean much to the kernel
> > code. I can probably add it as another QEMU patch to the set since it is
> > just a matter of having a function similar to ram_block_discard_range that
> > uses MADV_FREE instead of MADV_DONTNEED.
>
> Yes, addon patch makes perfect sense. The nice thing about MADV_FREE is
> that you only take back pages from a process when really under memory
> pressure (before going to SWAP). You will still get a pagefault on the
> next access (to identify that the page is still in use after all), but
> don't have to fault in a fresh page.

So I got things running with a proof of concept using MADV_FREE.
Apparently another roadblock I hadn't realized is that you have to have
the right version of glibc for MADV_FREE to be present.

Anyway with MADV_FREE the numbers actually look pretty close to the
numbers with the madvise disabled. Apparently the page fault overhead
isn't all that significant. When I push the next patch set I will include
the actual numbers, but even with shuffling enabled the results were in
the 8.7 to 8.8 million iteration range.

2019-11-27 17:39:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v14 0/6] mm / virtio: Provide support for unused page reporting

On 27.11.19 18:36, Alexander Duyck wrote:
> On Wed, 2019-11-27 at 11:01 +0100, David Hildenbrand wrote:
>> On 26.11.19 17:45, Alexander Duyck wrote:
>>> On Tue, 2019-11-26 at 13:20 +0100, David Hildenbrand wrote:
>>>> On 19.11.19 22:46, Alexander Duyck wrote:
>
> <snip>
>
>>>>> Below are the results from various benchmarks. I primarily focused on two
>>>>> tests. The first is the will-it-scale/page_fault2 test, and the other is
>>>>> a modified version of will-it-scale/page_fault1 that was enabled to use
>>>>> THP. I did this as it allows for better visibility into different parts
>>>>> of the memory subsystem. The guest is running with 32G for RAM on one
>>>>> node of a E5-2630 v3. The host has had some power saving features disabled
>>>>> by setting the /dev/cpu_dma_latency value to 10ms.
>>>>>
>>>>> Test page_fault1 (THP) page_fault2
>>>>> Name tasks Process Iter STDEV Process Iter STDEV
>>>>> Baseline 1 1203934.75 0.04% 379940.75 0.11%
>>>>> 16 8828217.00 0.85% 3178653.00 1.28%
>>>>>
>>>>> Patches applied 1 1207961.25 0.10% 380852.25 0.25%
>>>>> 16 8862373.00 0.98% 3246397.25 0.68%
>>>>>
>>>>> Patches enabled 1 1207758.75 0.17% 373079.25 0.60%
>>>>> MADV disabled 16 8870373.75 0.29% 3204989.75 1.08%
>>>>>
>>>>> Patches enabled 1 1261183.75 0.39% 373201.50 0.50%
>>>>> 16 8371359.75 0.65% 3233665.50 0.84%
>>>>>
>>>>> Patches enabled 1 1090201.50 0.25% 376967.25 0.29%
>>>>> page shuffle 16 8108719.75 0.58% 3218450.25 1.07%
>>>>>
>>>>> The results above are for a baseline with a linux-next-20191115 kernel,
>>>>> that kernel with this patch set applied but page reporting disabled in
>>>>> virtio-balloon, patches applied but the madvise disabled by direct
>>>>> assigning a device, the patches applied and page reporting fully
>>>>> enabled, and the patches enabled with page shuffling enabled. These
>>>>> results include the deviation seen between the average value reported here
>>>>> versus the high and/or low value. I observed that during the test memory
>>>>> usage for the first three tests never dropped whereas with the patches
>>>>> fully enabled the VM would drop to using only a few GB of the host's
>>>>> memory when switching from memhog to page fault tests.
>>>>>
>>>>> Most of the overhead seen with this patch set enabled seems due to page
>>>>> faults caused by accessing the reported pages and the host zeroing the page
>>>>> before giving it back to the guest. This overhead is much more visible when
>>>>> using THP than with standard 4K pages. In addition page shuffling seemed to
>>>>> increase the amount of faults generated due to an increase in memory churn.
>>>>
>>>> MADV_FREE would be interesting.
>>>
>>> I can probably code something up. However that is going to push a bunch of
>>> complexity into the QEMU code and doesn't really mean much to the kernel
>>> code. I can probably add it as another QEMU patch to the set since it is
>>> just a matter of having a function similar to ram_block_discard_range that
>>> uses MADV_FREE instead of MADV_DONTNEED.
>>
>> Yes, addon patch makes perfect sense. The nice thing about MADV_FREE is
>> that you only take back pages from a process when really under memory
>> pressure (before going to SWAP). You will still get a pagefault on the
>> next access (to identify that the page is still in use after all), but
>> don't have to fault in a fresh page.
>
> So I got things running with a proof of concept using MADV_FREE.
> Apparently another roadblock I hadn't realized is that you have to have
> the right version of glibc for MADV_FREE to be present.
>
> Anyway with MADV_FREE the numbers actually look pretty close to the
> numbers with the madvise disabled. Apparently the page fault overhead
> isn't all that significant. When I push the next patch set I will include
> the actual numbers, but even with shuffling enabled the results were in
> the 8.7 to 8.8 million iteration range.
>

Cool, thanks for evaluating!

--
Thanks,

David / dhildenb

2019-11-28 15:28:02

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On 19.11.19 22:46, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Add support for the page reporting feature provided by virtio-balloon.
> Reporting differs from the regular balloon functionality in that is is
> much less durable than a standard memory balloon. Instead of creating a
> list of pages that cannot be accessed the pages are only inaccessible
> while they are being indicated to the virtio interface. Once the
> interface has acknowledged them they are placed back into their respective
> free lists and are once again accessible by the guest system.

Maybe add something like "In contrast to ordinary balloon
inflation/deflation, the guest can reuse all reported pages immediately
after reporting has finished, without having to notify the hypervisor
about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."

[...]

> /*
> * Balloon device works in 4K page units. So each page is pointed to by
> @@ -37,6 +38,9 @@
> #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
>
> +/* limit on the number of pages that can be on the reporting vq */
> +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16

Maybe rename that from HINTS to REPORTS

> +
> #ifdef CONFIG_BALLOON_COMPACTION
> static struct vfsmount *balloon_mnt;
> #endif
> @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> VIRTIO_BALLOON_VQ_DEFLATE,
> VIRTIO_BALLOON_VQ_STATS,
> VIRTIO_BALLOON_VQ_FREE_PAGE,
> + VIRTIO_BALLOON_VQ_REPORTING,
> VIRTIO_BALLOON_VQ_MAX
> };
>
> @@ -113,6 +118,10 @@ struct virtio_balloon {
>
> /* To register a shrinker to shrink memory upon memory pressure */
> struct shrinker shrinker;
> +
> + /* Unused page reporting device */

Sounds like the device is unused :D

"Device info for reporting unused pages" ?

I am in general wondering, should we rename "unused" to "free". I.e.,
"free page reporting" instead of "unused page reporting"? Or what was
the motivation behind using "unused" ?

> + struct virtqueue *reporting_vq;
> + struct page_reporting_dev_info pr_dev_info;
> };
>
> static struct virtio_device_id id_table[] = {
> @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
>
> }
>
> +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> + unsigned int nents)
> +{
> + struct virtio_balloon *vb =
> + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> + struct virtqueue *vq = vb->reporting_vq;
> + unsigned int unused, err;
> +
> + /* We should always be able to add these buffers to an empty queue. */

This comment somewhat contradicts the error handling (and comment)
below. Maybe just drop it?

> + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> + GFP_NOWAIT | __GFP_NOWARN);
> +
> + /*
> + * In the extremely unlikely case that something has changed and we
> + * are able to trigger an error we will simply display a warning
> + * and exit without actually processing the pages.
> + */
> + if (WARN_ON(err))
> + return;

Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)

> +
> + virtqueue_kick(vq);
> +
> + /* When host has read buffer, this completes via balloon_ack */
> + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));

Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
queue? What if both mechanisms are used concurrently and race/both wait
for the hypervisor?

Maybe we need a separate vb->acked + callback function.

> +}
> +
> static void set_page_pfns(struct virtio_balloon *vb,
> __virtio32 pfns[], struct page *page)
> {
> @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
> names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
>
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
> callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> }
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> + }
> +
> err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> vqs, callbacks, names, NULL, NULL);
> if (err)
> return err;
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> +

I'd register these in the same order they are defined (IOW, move this
further down)

> vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
> if (err)
> goto out_del_balloon_wq;
> }
> +
> + vb->pr_dev_info.report = virtballoon_unused_page_report;
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> + unsigned int capacity;
> +
> + capacity = min_t(unsigned int,
> + virtqueue_get_vring_size(vb->reporting_vq),
> + VIRTIO_BALLOON_VRING_HINTS_MAX);
> + vb->pr_dev_info.capacity = capacity;
> +
> + err = page_reporting_register(&vb->pr_dev_info);
> + if (err)
> + goto out_unregister_shrinker;
> + }

It can happen here that we start reporting before marking the device
ready. Can that be problematic?

Maybe we have to ignore any reports in virtballoon_unused_page_report()
until ready...

> +
> virtio_device_ready(vdev);
>
> if (towards_target(vb))
> virtballoon_changed(vdev);
> return 0;
>
> +out_unregister_shrinker:
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> + virtio_balloon_unregister_shrinker(vb);

A sync is done implicitly, right? So after this call, we won't get any
new callbacks/are stuck in a callback.

> out_del_balloon_wq:
> if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> destroy_workqueue(vb->balloon_wq);
> @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> {
> struct virtio_balloon *vb = vdev->priv;
>
> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> + page_reporting_unregister(&vb->pr_dev_info);

Dito, same question regarding syncs.

> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> virtio_balloon_unregister_shrinker(vb);
> spin_lock_irq(&vb->stop_update_lock);
> @@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> VIRTIO_BALLOON_F_PAGE_POISON,
> + VIRTIO_BALLOON_F_REPORTING,
> };
>
> static struct virtio_driver virtio_balloon_driver = {
> diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> index a1966cd7b677..19974392d324 100644
> --- a/include/uapi/linux/virtio_balloon.h
> +++ b/include/uapi/linux/virtio_balloon.h
> @@ -36,6 +36,7 @@
> #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
>
> /* Size of a PFN in the balloon interface. */
> #define VIRTIO_BALLOON_PFN_SHIFT 12
>
>

Small and powerful patch :)

--
Thanks,

David / dhildenb

2019-11-28 17:01:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Thu, Nov 28, 2019 at 04:25:54PM +0100, David Hildenbrand wrote:
> On 19.11.19 22:46, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Add support for the page reporting feature provided by virtio-balloon.
> > Reporting differs from the regular balloon functionality in that is is
> > much less durable than a standard memory balloon. Instead of creating a
> > list of pages that cannot be accessed the pages are only inaccessible
> > while they are being indicated to the virtio interface. Once the
> > interface has acknowledged them they are placed back into their respective
> > free lists and are once again accessible by the guest system.
>
> Maybe add something like "In contrast to ordinary balloon
> inflation/deflation, the guest can reuse all reported pages immediately
> after reporting has finished, without having to notify the hypervisor
> about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."

Maybe we can make apply. The effect of reporting a page is effectively
putting it in a balloon then immediately taking it out. Maybe without
VIRTIO_BALLOON_F_MUST_TELL_HOST the pages can be reused before host
marked buffers used?

We didn't teach existing page hinting to behave like this, but maybe we
should, and maybe it's not too late, not a long time passed
since it was merged, and the whole shrinker based thing
seems to have been broken ...


BTW generally UAPI patches will have to be sent to virtio-dev
mailing list before they are merged.

> [...]
>
> > /*
> > * Balloon device works in 4K page units. So each page is pointed to by
> > @@ -37,6 +38,9 @@
> > #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> > (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> >
> > +/* limit on the number of pages that can be on the reporting vq */
> > +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
>
> Maybe rename that from HINTS to REPORTS
>
> > +
> > #ifdef CONFIG_BALLOON_COMPACTION
> > static struct vfsmount *balloon_mnt;
> > #endif
> > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> > VIRTIO_BALLOON_VQ_DEFLATE,
> > VIRTIO_BALLOON_VQ_STATS,
> > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > + VIRTIO_BALLOON_VQ_REPORTING,
> > VIRTIO_BALLOON_VQ_MAX
> > };
> >
> > @@ -113,6 +118,10 @@ struct virtio_balloon {
> >
> > /* To register a shrinker to shrink memory upon memory pressure */
> > struct shrinker shrinker;
> > +
> > + /* Unused page reporting device */
>
> Sounds like the device is unused :D
>
> "Device info for reporting unused pages" ?
>
> I am in general wondering, should we rename "unused" to "free". I.e.,
> "free page reporting" instead of "unused page reporting"? Or what was
> the motivation behind using "unused" ?
>
> > + struct virtqueue *reporting_vq;
> > + struct page_reporting_dev_info pr_dev_info;
> > };
> >
> > static struct virtio_device_id id_table[] = {
> > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >
> > }
> >
> > +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> > + unsigned int nents)
> > +{
> > + struct virtio_balloon *vb =
> > + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > + struct virtqueue *vq = vb->reporting_vq;
> > + unsigned int unused, err;
> > +
> > + /* We should always be able to add these buffers to an empty queue. */
>
> This comment somewhat contradicts the error handling (and comment)
> below. Maybe just drop it?
>
> > + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> > + GFP_NOWAIT | __GFP_NOWARN);
> > +
> > + /*
> > + * In the extremely unlikely case that something has changed and we
> > + * are able to trigger an error we will simply display a warning
> > + * and exit without actually processing the pages.
> > + */
> > + if (WARN_ON(err))
> > + return;
>
> Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)
>
> > +
> > + virtqueue_kick(vq);
> > +
> > + /* When host has read buffer, this completes via balloon_ack */
> > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
>
> Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
> queue? What if both mechanisms are used concurrently and race/both wait
> for the hypervisor?
>
> Maybe we need a separate vb->acked + callback function.
>
> > +}
> > +
> > static void set_page_pfns(struct virtio_balloon *vb,
> > __virtio32 pfns[], struct page *page)
> > {
> > @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> >
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > }
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> > + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> > + }
> > +
> > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > vqs, callbacks, names, NULL, NULL);
> > if (err)
> > return err;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> > +
>
> I'd register these in the same order they are defined (IOW, move this
> further down)
>
> > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > if (err)
> > goto out_del_balloon_wq;
> > }
> > +
> > + vb->pr_dev_info.report = virtballoon_unused_page_report;
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > + unsigned int capacity;
> > +
> > + capacity = min_t(unsigned int,
> > + virtqueue_get_vring_size(vb->reporting_vq),
> > + VIRTIO_BALLOON_VRING_HINTS_MAX);
> > + vb->pr_dev_info.capacity = capacity;
> > +
> > + err = page_reporting_register(&vb->pr_dev_info);
> > + if (err)
> > + goto out_unregister_shrinker;
> > + }
>
> It can happen here that we start reporting before marking the device
> ready. Can that be problematic?
>
> Maybe we have to ignore any reports in virtballoon_unused_page_report()
> until ready...
>
> > +
> > virtio_device_ready(vdev);
> >
> > if (towards_target(vb))
> > virtballoon_changed(vdev);
> > return 0;
> >
> > +out_unregister_shrinker:
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > + virtio_balloon_unregister_shrinker(vb);
>
> A sync is done implicitly, right? So after this call, we won't get any
> new callbacks/are stuck in a callback.
>
> > out_del_balloon_wq:
> > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > destroy_workqueue(vb->balloon_wq);
> > @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > {
> > struct virtio_balloon *vb = vdev->priv;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > + page_reporting_unregister(&vb->pr_dev_info);
>
> Dito, same question regarding syncs.
>
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > virtio_balloon_unregister_shrinker(vb);
> > spin_lock_irq(&vb->stop_update_lock);
> > @@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > VIRTIO_BALLOON_F_PAGE_POISON,
> > + VIRTIO_BALLOON_F_REPORTING,
> > };
> >
> > static struct virtio_driver virtio_balloon_driver = {
> > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > index a1966cd7b677..19974392d324 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -36,6 +36,7 @@
> > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
> >
> > /* Size of a PFN in the balloon interface. */
> > #define VIRTIO_BALLOON_PFN_SHIFT 12
> >
> >
>
> Small and powerful patch :)
>
> --
> Thanks,
>
> David / dhildenb

2019-11-29 21:15:31

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Thu, Nov 28, 2019 at 7:26 AM David Hildenbrand <[email protected]> wrote:
>
> On 19.11.19 22:46, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Add support for the page reporting feature provided by virtio-balloon.
> > Reporting differs from the regular balloon functionality in that is is
> > much less durable than a standard memory balloon. Instead of creating a
> > list of pages that cannot be accessed the pages are only inaccessible
> > while they are being indicated to the virtio interface. Once the
> > interface has acknowledged them they are placed back into their respective
> > free lists and are once again accessible by the guest system.
>
> Maybe add something like "In contrast to ordinary balloon
> inflation/deflation, the guest can reuse all reported pages immediately
> after reporting has finished, without having to notify the hypervisor
> about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."

Okay. I'll make a note of it for next version.

> [...]
>
> > /*
> > * Balloon device works in 4K page units. So each page is pointed to by
> > @@ -37,6 +38,9 @@
> > #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> > (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> >
> > +/* limit on the number of pages that can be on the reporting vq */
> > +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
>
> Maybe rename that from HINTS to REPORTS

I'll fix it for the next version.

> > +
> > #ifdef CONFIG_BALLOON_COMPACTION
> > static struct vfsmount *balloon_mnt;
> > #endif
> > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> > VIRTIO_BALLOON_VQ_DEFLATE,
> > VIRTIO_BALLOON_VQ_STATS,
> > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > + VIRTIO_BALLOON_VQ_REPORTING,
> > VIRTIO_BALLOON_VQ_MAX
> > };
> >
> > @@ -113,6 +118,10 @@ struct virtio_balloon {
> >
> > /* To register a shrinker to shrink memory upon memory pressure */
> > struct shrinker shrinker;
> > +
> > + /* Unused page reporting device */
>
> Sounds like the device is unused :D
>
> "Device info for reporting unused pages" ?
>
> I am in general wondering, should we rename "unused" to "free". I.e.,
> "free page reporting" instead of "unused page reporting"? Or what was
> the motivation behind using "unused" ?

I honestly don't remember why I chose "unused" at this point. I can
switch over to "free" if that is what is preferred.

Looking over the code a bit more I suspect the reason for avoiding it
is because free page hinting also mentioned reporting in a few spots.

> > + struct virtqueue *reporting_vq;
> > + struct page_reporting_dev_info pr_dev_info;
> > };
> >
> > static struct virtio_device_id id_table[] = {
> > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> >
> > }
> >
> > +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> > + unsigned int nents)
> > +{
> > + struct virtio_balloon *vb =
> > + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > + struct virtqueue *vq = vb->reporting_vq;
> > + unsigned int unused, err;
> > +
> > + /* We should always be able to add these buffers to an empty queue. */
>
> This comment somewhat contradicts the error handling (and comment)
> below. Maybe just drop it?
>
> > + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> > + GFP_NOWAIT | __GFP_NOWARN);
> > +
> > + /*
> > + * In the extremely unlikely case that something has changed and we
> > + * are able to trigger an error we will simply display a warning
> > + * and exit without actually processing the pages.
> > + */
> > + if (WARN_ON(err))
> > + return;
>
> Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)

Actually I might need to tweak things here a bit. It occurs to me that
this can fail for more than just there not being space in the ring. I
forgot that DMA mapping needs to also occur so in the case of a DMA
mapping failure we would also see an error.

I probably will switch it to a WARN_ON_ONCE. I may also need to add a
return value to the function so that we can indicate that an entire
batch has failed and that we need to abort.

> > +
> > + virtqueue_kick(vq);
> > +
> > + /* When host has read buffer, this completes via balloon_ack */
> > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
>
> Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
> queue? What if both mechanisms are used concurrently and race/both wait
> for the hypervisor?
>
> Maybe we need a separate vb->acked + callback function.

So if I understand correctly what is actually happening is that the
wait event is simply a trigger that will wake us up, and at that point
we check to see if the buffer we submitted is done. If not we go back
to sleep. As such all we are really waiting on is the notification
that the buffers we submitted have been processed. So it is using the
same function but on a different virtual queue.

> > +}
> > +
> > static void set_page_pfns(struct virtio_balloon *vb,
> > __virtio32 pfns[], struct page *page)
> > {
> > @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> >
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > }
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> > + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> > + }
> > +
> > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > vqs, callbacks, names, NULL, NULL);
> > if (err)
> > return err;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> > +
>
> I'd register these in the same order they are defined (IOW, move this
> further down)

done.

> > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > if (err)
> > goto out_del_balloon_wq;
> > }
> > +
> > + vb->pr_dev_info.report = virtballoon_unused_page_report;
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > + unsigned int capacity;
> > +
> > + capacity = min_t(unsigned int,
> > + virtqueue_get_vring_size(vb->reporting_vq),
> > + VIRTIO_BALLOON_VRING_HINTS_MAX);
> > + vb->pr_dev_info.capacity = capacity;
> > +
> > + err = page_reporting_register(&vb->pr_dev_info);
> > + if (err)
> > + goto out_unregister_shrinker;
> > + }
>
> It can happen here that we start reporting before marking the device
> ready. Can that be problematic?
>
> Maybe we have to ignore any reports in virtballoon_unused_page_report()
> until ready...

I don't think there is an issue with us putting buffers on the ring
before it is ready. I think it will just cause our function to sleep.

I'm guessing that is the case since init_vqs will add a buffer to the
stats vq and that happens even earlier in virtballoon_probe.

> > +
> > virtio_device_ready(vdev);
> >
> > if (towards_target(vb))
> > virtballoon_changed(vdev);
> > return 0;
> >
> > +out_unregister_shrinker:
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > + virtio_balloon_unregister_shrinker(vb);
>
> A sync is done implicitly, right? So after this call, we won't get any
> new callbacks/are stuck in a callback.

From what I can tell a read/write semaphore is used in
unregister_shrinker when we delete it from the list so it shouldn't be
an issue.

> > out_del_balloon_wq:
> > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > destroy_workqueue(vb->balloon_wq);
> > @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > {
> > struct virtio_balloon *vb = vdev->priv;
> >
> > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > + page_reporting_unregister(&vb->pr_dev_info);
>
> Dito, same question regarding syncs.

Yes, although for that one I was using pointer deletion, a barrier,
and a cancel_work_sync since I didn't support a list.

> > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > virtio_balloon_unregister_shrinker(vb);
> > spin_lock_irq(&vb->stop_update_lock);
> > @@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > VIRTIO_BALLOON_F_PAGE_POISON,
> > + VIRTIO_BALLOON_F_REPORTING,
> > };
> >
> > static struct virtio_driver virtio_balloon_driver = {
> > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > index a1966cd7b677..19974392d324 100644
> > --- a/include/uapi/linux/virtio_balloon.h
> > +++ b/include/uapi/linux/virtio_balloon.h
> > @@ -36,6 +36,7 @@
> > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
> >
> > /* Size of a PFN in the balloon interface. */
> > #define VIRTIO_BALLOON_PFN_SHIFT 12
> >
> >
>
> Small and powerful patch :)

Agreed. Although we will have to see if we can keep it that way.
Ideally I want to leave this with the ability so specify what size
scatterlist we receive. However if we have to flip it around then it
will force us to add logic for chopping up the scatterlist for
processing in chunks.

Thanks for the review.

- Alex

2019-12-01 11:48:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Fri, Nov 29, 2019 at 01:13:32PM -0800, Alexander Duyck wrote:
> On Thu, Nov 28, 2019 at 7:26 AM David Hildenbrand <[email protected]> wrote:
> >
> > On 19.11.19 22:46, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add support for the page reporting feature provided by virtio-balloon.
> > > Reporting differs from the regular balloon functionality in that is is
> > > much less durable than a standard memory balloon. Instead of creating a
> > > list of pages that cannot be accessed the pages are only inaccessible
> > > while they are being indicated to the virtio interface. Once the
> > > interface has acknowledged them they are placed back into their respective
> > > free lists and are once again accessible by the guest system.
> >
> > Maybe add something like "In contrast to ordinary balloon
> > inflation/deflation, the guest can reuse all reported pages immediately
> > after reporting has finished, without having to notify the hypervisor
> > about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."
>
> Okay. I'll make a note of it for next version.


VIRTIO_BALLOON_F_MUST_TELL_HOST is IMHO misdocumented.
It states:
VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host has to be told before pages from the balloon are
used.
but really balloon always told host. The difference is in timing,
historically balloon gave up pages before sending the
message and before waiting for the buffer to be used by host.

I think this feature can be the same if we want.


> > [...]
> >
> > > /*
> > > * Balloon device works in 4K page units. So each page is pointed to by
> > > @@ -37,6 +38,9 @@
> > > #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> > > (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> > >
> > > +/* limit on the number of pages that can be on the reporting vq */
> > > +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
> >
> > Maybe rename that from HINTS to REPORTS
>
> I'll fix it for the next version.
>
> > > +
> > > #ifdef CONFIG_BALLOON_COMPACTION
> > > static struct vfsmount *balloon_mnt;
> > > #endif
> > > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > VIRTIO_BALLOON_VQ_STATS,
> > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > + VIRTIO_BALLOON_VQ_REPORTING,
> > > VIRTIO_BALLOON_VQ_MAX
> > > };
> > >
> > > @@ -113,6 +118,10 @@ struct virtio_balloon {
> > >
> > > /* To register a shrinker to shrink memory upon memory pressure */
> > > struct shrinker shrinker;
> > > +
> > > + /* Unused page reporting device */
> >
> > Sounds like the device is unused :D
> >
> > "Device info for reporting unused pages" ?
> >
> > I am in general wondering, should we rename "unused" to "free". I.e.,
> > "free page reporting" instead of "unused page reporting"? Or what was
> > the motivation behind using "unused" ?
>
> I honestly don't remember why I chose "unused" at this point. I can
> switch over to "free" if that is what is preferred.
>
> Looking over the code a bit more I suspect the reason for avoiding it
> is because free page hinting also mentioned reporting in a few spots.
>
> > > + struct virtqueue *reporting_vq;
> > > + struct page_reporting_dev_info pr_dev_info;
> > > };
> > >
> > > static struct virtio_device_id id_table[] = {
> > > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > >
> > > }
> > >
> > > +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> > > + unsigned int nents)
> > > +{
> > > + struct virtio_balloon *vb =
> > > + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > > + struct virtqueue *vq = vb->reporting_vq;
> > > + unsigned int unused, err;
> > > +
> > > + /* We should always be able to add these buffers to an empty queue. */
> >
> > This comment somewhat contradicts the error handling (and comment)
> > below. Maybe just drop it?
> >
> > > + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> > > + GFP_NOWAIT | __GFP_NOWARN);
> > > +
> > > + /*
> > > + * In the extremely unlikely case that something has changed and we
> > > + * are able to trigger an error we will simply display a warning
> > > + * and exit without actually processing the pages.
> > > + */
> > > + if (WARN_ON(err))
> > > + return;
> >
> > Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)
>
> Actually I might need to tweak things here a bit. It occurs to me that
> this can fail for more than just there not being space in the ring. I
> forgot that DMA mapping needs to also occur so in the case of a DMA
> mapping failure we would also see an error.

Balloon assumes DMA mapping is bypassed right now:

static int virtballoon_validate(struct virtio_device *vdev)
{
if (!page_poisoning_enabled())
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);

__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);

^^^^^^^^


return 0;
}

I don't think it can work with things like a bounce buffer.

> I probably will switch it to a WARN_ON_ONCE. I may also need to add a
> return value to the function so that we can indicate that an entire
> batch has failed and that we need to abort.
>
> > > +
> > > + virtqueue_kick(vq);
> > > +
> > > + /* When host has read buffer, this completes via balloon_ack */
> > > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> >
> > Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
> > queue? What if both mechanisms are used concurrently and race/both wait
> > for the hypervisor?
> >
> > Maybe we need a separate vb->acked + callback function.
>
> So if I understand correctly what is actually happening is that the
> wait event is simply a trigger that will wake us up, and at that point
> we check to see if the buffer we submitted is done. If not we go back
> to sleep. As such all we are really waiting on is the notification
> that the buffers we submitted have been processed. So it is using the
> same function but on a different virtual queue.
>
> > > +}
> > > +
> > > static void set_page_pfns(struct virtio_balloon *vb,
> > > __virtio32 pfns[], struct page *page)
> > > {
> > > @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> > >
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > }
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > > + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> > > + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> > > + }
> > > +
> > > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > vqs, callbacks, names, NULL, NULL);
> > > if (err)
> > > return err;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > > + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> > > +
> >
> > I'd register these in the same order they are defined (IOW, move this
> > further down)
>
> done.
>
> > > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > > if (err)
> > > goto out_del_balloon_wq;
> > > }
> > > +
> > > + vb->pr_dev_info.report = virtballoon_unused_page_report;
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > > + unsigned int capacity;
> > > +
> > > + capacity = min_t(unsigned int,
> > > + virtqueue_get_vring_size(vb->reporting_vq),
> > > + VIRTIO_BALLOON_VRING_HINTS_MAX);
> > > + vb->pr_dev_info.capacity = capacity;
> > > +
> > > + err = page_reporting_register(&vb->pr_dev_info);
> > > + if (err)
> > > + goto out_unregister_shrinker;
> > > + }
> >
> > It can happen here that we start reporting before marking the device
> > ready. Can that be problematic?
> >
> > Maybe we have to ignore any reports in virtballoon_unused_page_report()
> > until ready...
>
> I don't think there is an issue with us putting buffers on the ring
> before it is ready. I think it will just cause our function to sleep.
>
> I'm guessing that is the case since init_vqs will add a buffer to the
> stats vq and that happens even earlier in virtballoon_probe.
>
> > > +
> > > virtio_device_ready(vdev);
> > >
> > > if (towards_target(vb))
> > > virtballoon_changed(vdev);
> > > return 0;
> > >
> > > +out_unregister_shrinker:
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > + virtio_balloon_unregister_shrinker(vb);
> >
> > A sync is done implicitly, right? So after this call, we won't get any
> > new callbacks/are stuck in a callback.
>
> >From what I can tell a read/write semaphore is used in
> unregister_shrinker when we delete it from the list so it shouldn't be
> an issue.
>
> > > out_del_balloon_wq:
> > > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > > destroy_workqueue(vb->balloon_wq);
> > > @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > > {
> > > struct virtio_balloon *vb = vdev->priv;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > > + page_reporting_unregister(&vb->pr_dev_info);
> >
> > Dito, same question regarding syncs.
>
> Yes, although for that one I was using pointer deletion, a barrier,
> and a cancel_work_sync since I didn't support a list.
>
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > virtio_balloon_unregister_shrinker(vb);
> > > spin_lock_irq(&vb->stop_update_lock);
> > > @@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > VIRTIO_BALLOON_F_PAGE_POISON,
> > > + VIRTIO_BALLOON_F_REPORTING,
> > > };
> > >
> > > static struct virtio_driver virtio_balloon_driver = {
> > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > > index a1966cd7b677..19974392d324 100644
> > > --- a/include/uapi/linux/virtio_balloon.h
> > > +++ b/include/uapi/linux/virtio_balloon.h
> > > @@ -36,6 +36,7 @@
> > > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > > +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
> > >
> > > /* Size of a PFN in the balloon interface. */
> > > #define VIRTIO_BALLOON_PFN_SHIFT 12
> > >
> > >
> >
> > Small and powerful patch :)
>
> Agreed. Although we will have to see if we can keep it that way.
> Ideally I want to leave this with the ability so specify what size
> scatterlist we receive. However if we have to flip it around then it
> will force us to add logic for chopping up the scatterlist for
> processing in chunks.
>
> Thanks for the review.
>
> - Alex

2019-12-01 18:49:35

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Sun, Dec 1, 2019 at 3:46 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Fri, Nov 29, 2019 at 01:13:32PM -0800, Alexander Duyck wrote:
> > On Thu, Nov 28, 2019 at 7:26 AM David Hildenbrand <[email protected]> wrote:
> > >
> > > On 19.11.19 22:46, Alexander Duyck wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Add support for the page reporting feature provided by virtio-balloon.
> > > > Reporting differs from the regular balloon functionality in that is is
> > > > much less durable than a standard memory balloon. Instead of creating a
> > > > list of pages that cannot be accessed the pages are only inaccessible
> > > > while they are being indicated to the virtio interface. Once the
> > > > interface has acknowledged them they are placed back into their respective
> > > > free lists and are once again accessible by the guest system.
> > >
> > > Maybe add something like "In contrast to ordinary balloon
> > > inflation/deflation, the guest can reuse all reported pages immediately
> > > after reporting has finished, without having to notify the hypervisor
> > > about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."
> >
> > Okay. I'll make a note of it for next version.
>
>
> VIRTIO_BALLOON_F_MUST_TELL_HOST is IMHO misdocumented.
> It states:
> VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host has to be told before pages from the balloon are
> used.
> but really balloon always told host. The difference is in timing,
> historically balloon gave up pages before sending the
> message and before waiting for the buffer to be used by host.
>
> I think this feature can be the same if we want.

Okay. I'll still probably try to document the behavior a bit better though.

> > > [...]
> > >
> > > > /*
> > > > * Balloon device works in 4K page units. So each page is pointed to by
> > > > @@ -37,6 +38,9 @@
> > > > #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> > > > (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> > > >
> > > > +/* limit on the number of pages that can be on the reporting vq */
> > > > +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
> > >
> > > Maybe rename that from HINTS to REPORTS
> >
> > I'll fix it for the next version.
> >
> > > > +
> > > > #ifdef CONFIG_BALLOON_COMPACTION
> > > > static struct vfsmount *balloon_mnt;
> > > > #endif
> > > > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> > > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > > VIRTIO_BALLOON_VQ_STATS,
> > > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > > + VIRTIO_BALLOON_VQ_REPORTING,
> > > > VIRTIO_BALLOON_VQ_MAX
> > > > };
> > > >
> > > > @@ -113,6 +118,10 @@ struct virtio_balloon {
> > > >
> > > > /* To register a shrinker to shrink memory upon memory pressure */
> > > > struct shrinker shrinker;
> > > > +
> > > > + /* Unused page reporting device */
> > >
> > > Sounds like the device is unused :D
> > >
> > > "Device info for reporting unused pages" ?
> > >
> > > I am in general wondering, should we rename "unused" to "free". I.e.,
> > > "free page reporting" instead of "unused page reporting"? Or what was
> > > the motivation behind using "unused" ?
> >
> > I honestly don't remember why I chose "unused" at this point. I can
> > switch over to "free" if that is what is preferred.
> >
> > Looking over the code a bit more I suspect the reason for avoiding it
> > is because free page hinting also mentioned reporting in a few spots.
> >
> > > > + struct virtqueue *reporting_vq;
> > > > + struct page_reporting_dev_info pr_dev_info;
> > > > };
> > > >
> > > > static struct virtio_device_id id_table[] = {
> > > > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > > >
> > > > }
> > > >
> > > > +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> > > > + unsigned int nents)
> > > > +{
> > > > + struct virtio_balloon *vb =
> > > > + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > > > + struct virtqueue *vq = vb->reporting_vq;
> > > > + unsigned int unused, err;
> > > > +
> > > > + /* We should always be able to add these buffers to an empty queue. */
> > >
> > > This comment somewhat contradicts the error handling (and comment)
> > > below. Maybe just drop it?
> > >
> > > > + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> > > > + GFP_NOWAIT | __GFP_NOWARN);
> > > > +
> > > > + /*
> > > > + * In the extremely unlikely case that something has changed and we
> > > > + * are able to trigger an error we will simply display a warning
> > > > + * and exit without actually processing the pages.
> > > > + */
> > > > + if (WARN_ON(err))
> > > > + return;
> > >
> > > Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)
> >
> > Actually I might need to tweak things here a bit. It occurs to me that
> > this can fail for more than just there not being space in the ring. I
> > forgot that DMA mapping needs to also occur so in the case of a DMA
> > mapping failure we would also see an error.
>
> Balloon assumes DMA mapping is bypassed right now:
>
> static int virtballoon_validate(struct virtio_device *vdev)
> {
> if (!page_poisoning_enabled())
> __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
>
> __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
>
> ^^^^^^^^
>
>
> return 0;
> }
>
> I don't think it can work with things like a bounce buffer.

Right. It wouldn't work with a bounce buffer. I was thinking more of
something like an IOMMU. So it sounds like the device is doing direct
map always anyway.

In any case I will add some logic so that if we encounter an error we
will just abort the reporting. That way if another user has some issue
like that it can be dealt with sooner and we can avoid flagging pages
as reported that are not.

- Alex

2019-12-02 10:45:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

[...]

>> Sounds like the device is unused :D
>>
>> "Device info for reporting unused pages" ?
>>
>> I am in general wondering, should we rename "unused" to "free". I.e.,
>> "free page reporting" instead of "unused page reporting"? Or what was
>> the motivation behind using "unused" ?
>
> I honestly don't remember why I chose "unused" at this point. I can
> switch over to "free" if that is what is preferred.
>
> Looking over the code a bit more I suspect the reason for avoiding it
> is because free page hinting also mentioned reporting in a few spots.

Maybe we should fix these cases. FWIW, I'd prefer "free page reporting".
(e.g., pairs nicely with "free page hinting").

>>> +
>>> + virtqueue_kick(vq);
>>> +
>>> + /* When host has read buffer, this completes via balloon_ack */
>>> + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
>>
>> Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
>> queue? What if both mechanisms are used concurrently and race/both wait
>> for the hypervisor?
>>
>> Maybe we need a separate vb->acked + callback function.
>
> So if I understand correctly what is actually happening is that the
> wait event is simply a trigger that will wake us up, and at that point
> we check to see if the buffer we submitted is done. If not we go back
> to sleep. As such all we are really waiting on is the notification
> that the buffers we submitted have been processed. So it is using the
> same function but on a different virtual queue.

Very right, this is just a waitqueue (was only looking at this patch,
not the full code). This should indeed be fine.

>>> vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
>>> vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
>>> if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
>>> @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
>>> if (err)
>>> goto out_del_balloon_wq;
>>> }
>>> +
>>> + vb->pr_dev_info.report = virtballoon_unused_page_report;
>>> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
>>> + unsigned int capacity;
>>> +
>>> + capacity = min_t(unsigned int,
>>> + virtqueue_get_vring_size(vb->reporting_vq),
>>> + VIRTIO_BALLOON_VRING_HINTS_MAX);
>>> + vb->pr_dev_info.capacity = capacity;
>>> +
>>> + err = page_reporting_register(&vb->pr_dev_info);
>>> + if (err)
>>> + goto out_unregister_shrinker;
>>> + }
>>
>> It can happen here that we start reporting before marking the device
>> ready. Can that be problematic?
>>
>> Maybe we have to ignore any reports in virtballoon_unused_page_report()
>> until ready...
>
> I don't think there is an issue with us putting buffers on the ring
> before it is ready. I think it will just cause our function to sleep.
>
> I'm guessing that is the case since init_vqs will add a buffer to the
> stats vq and that happens even earlier in virtballoon_probe.
>

Interesting: "Note: vqs are enabled automatically after probe returns.".
Learned something new.

The virtballoon_changed(vdev) *after* virtio_device_ready(vdev) made me
wonder, because that could also fill the queues.

Maybe Michael can clarify.

>>> +
>>> virtio_device_ready(vdev);
>>>
>>> if (towards_target(vb))
>>> virtballoon_changed(vdev);
>>> return 0;
>>>
>>> +out_unregister_shrinker:
>>> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
>>> + virtio_balloon_unregister_shrinker(vb);
>>
>> A sync is done implicitly, right? So after this call, we won't get any
>> new callbacks/are stuck in a callback.
>
> From what I can tell a read/write semaphore is used in
> unregister_shrinker when we delete it from the list so it shouldn't be
> an issue.

Yes, makes sense.

>
>>> out_del_balloon_wq:
>>> if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
>>> destroy_workqueue(vb->balloon_wq);
>>> @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
>>> {
>>> struct virtio_balloon *vb = vdev->priv;
>>>
>>> + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>>> + page_reporting_unregister(&vb->pr_dev_info);
>>
>> Dito, same question regarding syncs.
>
> Yes, although for that one I was using pointer deletion, a barrier,
> and a cancel_work_sync since I didn't support a list.

Okay, perfect.

[...]
>>
>> Small and powerful patch :)
>
> Agreed. Although we will have to see if we can keep it that way.
> Ideally I want to leave this with the ability so specify what size
> scatterlist we receive. However if we have to flip it around then it
> will force us to add logic for chopping up the scatterlist for
> processing in chunks.

I hope we can keep it like that. Otherwise each and every driver has to
implement this chopping-up (e.g., a hypervisor that can only send one
hint at a time - e.g., via a simple hypercall - would have to implement
that).


--
Thanks,

David / dhildenb

2019-12-04 17:50:53

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Thu, Nov 28, 2019 at 9:00 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Nov 28, 2019 at 04:25:54PM +0100, David Hildenbrand wrote:
> > On 19.11.19 22:46, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add support for the page reporting feature provided by virtio-balloon.
> > > Reporting differs from the regular balloon functionality in that is is
> > > much less durable than a standard memory balloon. Instead of creating a
> > > list of pages that cannot be accessed the pages are only inaccessible
> > > while they are being indicated to the virtio interface. Once the
> > > interface has acknowledged them they are placed back into their respective
> > > free lists and are once again accessible by the guest system.
> >
> > Maybe add something like "In contrast to ordinary balloon
> > inflation/deflation, the guest can reuse all reported pages immediately
> > after reporting has finished, without having to notify the hypervisor
> > about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."
>
> Maybe we can make apply. The effect of reporting a page is effectively
> putting it in a balloon then immediately taking it out. Maybe without
> VIRTIO_BALLOON_F_MUST_TELL_HOST the pages can be reused before host
> marked buffers used?
>
> We didn't teach existing page hinting to behave like this, but maybe we
> should, and maybe it's not too late, not a long time passed
> since it was merged, and the whole shrinker based thing
> seems to have been broken ...
>
>
> BTW generally UAPI patches will have to be sent to virtio-dev
> mailing list before they are merged.
>
> > [...]
> >
> > > /*
> > > * Balloon device works in 4K page units. So each page is pointed to by
> > > @@ -37,6 +38,9 @@
> > > #define VIRTIO_BALLOON_FREE_PAGE_SIZE \
> > > (1 << (VIRTIO_BALLOON_FREE_PAGE_ORDER + PAGE_SHIFT))
> > >
> > > +/* limit on the number of pages that can be on the reporting vq */
> > > +#define VIRTIO_BALLOON_VRING_HINTS_MAX 16
> >
> > Maybe rename that from HINTS to REPORTS
> >
> > > +
> > > #ifdef CONFIG_BALLOON_COMPACTION
> > > static struct vfsmount *balloon_mnt;
> > > #endif
> > > @@ -46,6 +50,7 @@ enum virtio_balloon_vq {
> > > VIRTIO_BALLOON_VQ_DEFLATE,
> > > VIRTIO_BALLOON_VQ_STATS,
> > > VIRTIO_BALLOON_VQ_FREE_PAGE,
> > > + VIRTIO_BALLOON_VQ_REPORTING,
> > > VIRTIO_BALLOON_VQ_MAX
> > > };
> > >
> > > @@ -113,6 +118,10 @@ struct virtio_balloon {
> > >
> > > /* To register a shrinker to shrink memory upon memory pressure */
> > > struct shrinker shrinker;
> > > +
> > > + /* Unused page reporting device */
> >
> > Sounds like the device is unused :D
> >
> > "Device info for reporting unused pages" ?
> >
> > I am in general wondering, should we rename "unused" to "free". I.e.,
> > "free page reporting" instead of "unused page reporting"? Or what was
> > the motivation behind using "unused" ?
> >
> > > + struct virtqueue *reporting_vq;
> > > + struct page_reporting_dev_info pr_dev_info;
> > > };
> > >
> > > static struct virtio_device_id id_table[] = {
> > > @@ -152,6 +161,32 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq)
> > >
> > > }
> > >
> > > +void virtballoon_unused_page_report(struct page_reporting_dev_info *pr_dev_info,
> > > + unsigned int nents)
> > > +{
> > > + struct virtio_balloon *vb =
> > > + container_of(pr_dev_info, struct virtio_balloon, pr_dev_info);
> > > + struct virtqueue *vq = vb->reporting_vq;
> > > + unsigned int unused, err;
> > > +
> > > + /* We should always be able to add these buffers to an empty queue. */
> >
> > This comment somewhat contradicts the error handling (and comment)
> > below. Maybe just drop it?
> >
> > > + err = virtqueue_add_inbuf(vq, pr_dev_info->sg, nents, vb,
> > > + GFP_NOWAIT | __GFP_NOWARN);
> > > +
> > > + /*
> > > + * In the extremely unlikely case that something has changed and we
> > > + * are able to trigger an error we will simply display a warning
> > > + * and exit without actually processing the pages.
> > > + */
> > > + if (WARN_ON(err))
> > > + return;
> >
> > Maybe WARN_ON_ONCE? (to not flood the log on recurring errors)
> >
> > > +
> > > + virtqueue_kick(vq);
> > > +
> > > + /* When host has read buffer, this completes via balloon_ack */
> > > + wait_event(vb->acked, virtqueue_get_buf(vq, &unused));
> >
> > Is it safe to rely on the same ack-ing mechanism as the inflate/deflate
> > queue? What if both mechanisms are used concurrently and race/both wait
> > for the hypervisor?
> >
> > Maybe we need a separate vb->acked + callback function.
> >
> > > +}
> > > +
> > > static void set_page_pfns(struct virtio_balloon *vb,
> > > __virtio32 pfns[], struct page *page)
> > > {
> > > @@ -476,6 +511,7 @@ static int init_vqs(struct virtio_balloon *vb)
> > > names[VIRTIO_BALLOON_VQ_DEFLATE] = "deflate";
> > > names[VIRTIO_BALLOON_VQ_STATS] = NULL;
> > > names[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > + names[VIRTIO_BALLOON_VQ_REPORTING] = NULL;
> > >
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > names[VIRTIO_BALLOON_VQ_STATS] = "stats";
> > > @@ -487,11 +523,19 @@ static int init_vqs(struct virtio_balloon *vb)
> > > callbacks[VIRTIO_BALLOON_VQ_FREE_PAGE] = NULL;
> > > }
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > > + names[VIRTIO_BALLOON_VQ_REPORTING] = "reporting_vq";
> > > + callbacks[VIRTIO_BALLOON_VQ_REPORTING] = balloon_ack;
> > > + }
> > > +
> > > err = vb->vdev->config->find_vqs(vb->vdev, VIRTIO_BALLOON_VQ_MAX,
> > > vqs, callbacks, names, NULL, NULL);
> > > if (err)
> > > return err;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > > + vb->reporting_vq = vqs[VIRTIO_BALLOON_VQ_REPORTING];
> > > +
> >
> > I'd register these in the same order they are defined (IOW, move this
> > further down)
> >
> > > vb->inflate_vq = vqs[VIRTIO_BALLOON_VQ_INFLATE];
> > > vb->deflate_vq = vqs[VIRTIO_BALLOON_VQ_DEFLATE];
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ)) {
> > > @@ -932,12 +976,30 @@ static int virtballoon_probe(struct virtio_device *vdev)
> > > if (err)
> > > goto out_del_balloon_wq;
> > > }
> > > +
> > > + vb->pr_dev_info.report = virtballoon_unused_page_report;
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) {
> > > + unsigned int capacity;
> > > +
> > > + capacity = min_t(unsigned int,
> > > + virtqueue_get_vring_size(vb->reporting_vq),
> > > + VIRTIO_BALLOON_VRING_HINTS_MAX);
> > > + vb->pr_dev_info.capacity = capacity;
> > > +
> > > + err = page_reporting_register(&vb->pr_dev_info);
> > > + if (err)
> > > + goto out_unregister_shrinker;
> > > + }
> >
> > It can happen here that we start reporting before marking the device
> > ready. Can that be problematic?
> >
> > Maybe we have to ignore any reports in virtballoon_unused_page_report()
> > until ready...
> >
> > > +
> > > virtio_device_ready(vdev);
> > >
> > > if (towards_target(vb))
> > > virtballoon_changed(vdev);
> > > return 0;
> > >
> > > +out_unregister_shrinker:
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > + virtio_balloon_unregister_shrinker(vb);
> >
> > A sync is done implicitly, right? So after this call, we won't get any
> > new callbacks/are stuck in a callback.
> >
> > > out_del_balloon_wq:
> > > if (virtio_has_feature(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> > > destroy_workqueue(vb->balloon_wq);
> > > @@ -966,6 +1028,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
> > > {
> > > struct virtio_balloon *vb = vdev->priv;
> > >
> > > + if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
> > > + page_reporting_unregister(&vb->pr_dev_info);
> >
> > Dito, same question regarding syncs.
> >
> > > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> > > virtio_balloon_unregister_shrinker(vb);
> > > spin_lock_irq(&vb->stop_update_lock);
> > > @@ -1038,6 +1102,7 @@ static int virtballoon_validate(struct virtio_device *vdev)
> > > VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> > > VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > VIRTIO_BALLOON_F_PAGE_POISON,
> > > + VIRTIO_BALLOON_F_REPORTING,
> > > };
> > >
> > > static struct virtio_driver virtio_balloon_driver = {
> > > diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h
> > > index a1966cd7b677..19974392d324 100644
> > > --- a/include/uapi/linux/virtio_balloon.h
> > > +++ b/include/uapi/linux/virtio_balloon.h
> > > @@ -36,6 +36,7 @@
> > > #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM 2 /* Deflate balloon on OOM */
> > > #define VIRTIO_BALLOON_F_FREE_PAGE_HINT 3 /* VQ to report free pages */
> > > #define VIRTIO_BALLOON_F_PAGE_POISON 4 /* Guest is using page poisoning */
> > > +#define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
> > >
> > > /* Size of a PFN in the balloon interface. */
> > > #define VIRTIO_BALLOON_PFN_SHIFT 12
> > >
> > >
> >
> > Small and powerful patch :)
> >
> > --
> > Thanks,
> >
> > David / dhildenb
>
>

2019-12-04 17:55:09

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v14 6/6] virtio-balloon: Add support for providing unused page reports to host

On Thu, Nov 28, 2019 at 9:00 AM Michael S. Tsirkin <[email protected]> wrote:
>
> On Thu, Nov 28, 2019 at 04:25:54PM +0100, David Hildenbrand wrote:
> > On 19.11.19 22:46, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add support for the page reporting feature provided by virtio-balloon.
> > > Reporting differs from the regular balloon functionality in that is is
> > > much less durable than a standard memory balloon. Instead of creating a
> > > list of pages that cannot be accessed the pages are only inaccessible
> > > while they are being indicated to the virtio interface. Once the
> > > interface has acknowledged them they are placed back into their respective
> > > free lists and are once again accessible by the guest system.
> >
> > Maybe add something like "In contrast to ordinary balloon
> > inflation/deflation, the guest can reuse all reported pages immediately
> > after reporting has finished, without having to notify the hypervisor
> > about it (e.g., VIRTIO_BALLOON_F_MUST_TELL_HOST does not apply)."
>
> Maybe we can make apply. The effect of reporting a page is effectively
> putting it in a balloon then immediately taking it out. Maybe without
> VIRTIO_BALLOON_F_MUST_TELL_HOST the pages can be reused before host
> marked buffers used?
>
> We didn't teach existing page hinting to behave like this, but maybe we
> should, and maybe it's not too late, not a long time passed
> since it was merged, and the whole shrinker based thing
> seems to have been broken ...

The problem is the existing hinting implementation relies on pushing
the memory to the point of OOM in order to avoid having to re-hint on
pages. What it is looking for is a snapshot rather than a running
tally. The page reporting bit approach would only work for the first
migration. The problem is the bit is persistent and would leave unused
pages flagged as reported if another migration starts so it wouldn't
re-report those pages.

> BTW generally UAPI patches will have to be sent to virtio-dev
> mailing list before they are merged.

Do you need just the QEMU patches submitted to virtio-dev or both the
virtio kernel patches and the QEMU patches?

One piece of feedback I got was that it was annoying that I was
including virtio-dev since it requires a subscription to send to it.
If you would like I could apply it on the QEMU patches which would
make the changes more visible at least.

Thanks.

- Alex