2019-02-04 18:16:10

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH 0/4] kvm: Report unused guest pages to host

This patch set provides a mechanism by which guests can notify the host of
pages that are not currently in use. Using this data a KVM host can more
easily balance memory workloads between guests and improve overall system
performance by avoiding unnecessary writing of unused pages to swap.

In order to support this I have added a new hypercall to provided unused
page hints and made use of mechanisms currently used by PowerPC and s390
architectures to provide those hints. To reduce the overhead of this call
I am only using it per huge page instead of of doing a notification per 4K
page. By doing this we can avoid the expense of fragmenting higher order
pages, and reduce overall cost for the hypercall as it will only be
performed once per huge page.

Because we are limiting this to huge pages it was necessary to add a
secondary location where we make the call as the buddy allocator can merge
smaller pages into a higher order huge page.

This approach is not usable in all cases. Specifically, when KVM direct
device assignment is used, the memory for a guest is permanently assigned
to physical pages in order to support DMA from the assigned device. In
this case we cannot give the pages back, so the hypercall is disabled by
the host.

Another situation that can lead to issues is if the page were accessed
immediately after free. For example, if page poisoning is enabled the
guest will populate the page *after* freeing it. In this case it does not
make sense to provide a hint about the page being freed so we do not
perform the hypercalls from the guest if this functionality is enabled.

My testing up till now has consisted of setting up 4 8GB VMs on a system
with 32GB of memory and 4GB of swap. To stress the memory on the system I
would run "memhog 8G" sequentially on each of the guests and observe how
long it took to complete the run. The observed behavior is that on the
systems with these patches applied in both the guest and on the host I was
able to complete the test with a time of 5 to 7 seconds per guest. On a
system without these patches the time ranged from 7 to 49 seconds per
guest. I am assuming the variability is due to time being spent writing
pages out to disk in order to free up space for the guest.

---

Alexander Duyck (4):
madvise: Expose ability to set dontneed from kernel
kvm: Add host side support for free memory hints
kvm: Add guest side support for free memory hints
mm: Add merge page notifier


Documentation/virtual/kvm/cpuid.txt | 4 ++
Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
arch/x86/include/asm/page.h | 25 +++++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 3 ++
arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
arch/x86/kvm/cpuid.c | 6 +++-
arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
include/linux/gfp.h | 4 ++
include/linux/mm.h | 2 +
include/uapi/linux/kvm_para.h | 1 +
mm/madvise.c | 13 +++++++-
mm/page_alloc.c | 2 +
12 files changed, 158 insertions(+), 2 deletions(-)

--


2019-02-04 18:16:22

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH 1/4] madvise: Expose ability to set dontneed from kernel

From: Alexander Duyck <[email protected]>

In order to enable a KVM hypervisor to notify the host that a guest has
freed its pages we will need to have a mechanism to update the virtual
memory associated with the guest. In order to expose this functionality I
am adding a new function do_madvise_dontneed that can be used to indicate
a region that a given VM is done with.

Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/mm.h | 2 ++
mm/madvise.c | 13 ++++++++++++-
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e04396375cf9..eb668a5b4b4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2840,5 +2840,7 @@ static inline bool page_is_guard(struct page *page)
static inline void setup_nr_node_ids(void) {}
#endif

+int do_madvise_dontneed(unsigned long start, size_t len_in);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 21a7881a2db4..8730f7e0081a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -799,7 +799,7 @@ static int madvise_inject_error(int behavior,
* -EBADF - map exists, but area maps something that isn't a file.
* -EAGAIN - a kernel resource was temporarily unavailable.
*/
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int do_madvise(unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -894,3 +894,14 @@ static int madvise_inject_error(int behavior,

return error;
}
+
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+ return do_madvise(start, len_in, behavior);
+}
+
+int do_madvise_dontneed(unsigned long start, size_t len_in)
+{
+ return do_madvise(start, len_in, MADV_DONTNEED);
+}
+EXPORT_SYMBOL_GPL(do_madvise_dontneed);


2019-02-04 18:16:30

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

From: Alexander Duyck <[email protected]>

Add the host side of the KVM memory hinting support. With this we expose a
feature bit indicating that the host will pass the messages along to the
new madvise function.

This functionality is mutually exclusive with device assignment. If a
device is assigned we will disable the functionality as it could lead to a
potential memory corruption if a device writes to a page after KVM has
flagged it as not being used.

The logic as it is currently defined limits the hint to only supporting a
hugepage or larger notifications. This is meant to help prevent us from
potentially breaking up huge pages by hinting that only a portion of the
page is not needed.

Signed-off-by: Alexander Duyck <[email protected]>
---
Documentation/virtual/kvm/cpuid.txt | 4 +++
Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++
arch/x86/include/uapi/asm/kvm_para.h | 3 +++
arch/x86/kvm/cpuid.c | 6 ++++-
arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++
include/uapi/linux/kvm_para.h | 1 +
6 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 97ca1940a0dc..fe3395a58b7e 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -66,6 +66,10 @@ KVM_FEATURE_PV_SEND_IPI || 11 || guest checks this feature bit
|| || before using paravirtualized
|| || send IPIs.
------------------------------------------------------------------------------
+KVM_FEATURE_PV_UNUSED_PAGE_HINT || 12 || guest checks this feature bit
+ || || before using paravirtualized
+ || || unused page hints.
+------------------------------------------------------------------------------
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
|| || per-cpu warps are expected in
|| || kvmclock.
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
index da24c138c8d1..b374678ac1f9 100644
--- a/Documentation/virtual/kvm/hypercalls.txt
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -141,3 +141,17 @@ a0 corresponds to the APIC ID in the third argument (a2), bit 1
corresponds to the APIC ID a2+1, and so on.

Returns the number of CPUs to which the IPIs were delivered successfully.
+
+7. KVM_HC_UNUSED_PAGE_HINT
+------------------------
+Architecture: x86
+Status: active
+Purpose: Send unused page hint to host
+
+a0: physical address of region unused, page aligned
+a1: size of unused region, page aligned
+
+The hypercall lets a guest send notifications to the host that it will no
+longer be using a given page in memory. Multiple pages can be hinted at by
+using the size field to hint that a higher order page is available by
+specifying the higher order page size.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 19980ec1a316..f066c23060df 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -29,6 +29,7 @@
#define KVM_FEATURE_PV_TLB_FLUSH 9
#define KVM_FEATURE_ASYNC_PF_VMEXIT 10
#define KVM_FEATURE_PV_SEND_IPI 11
+#define KVM_FEATURE_PV_UNUSED_PAGE_HINT 12

#define KVM_HINTS_REALTIME 0

@@ -119,4 +120,6 @@ struct kvm_vcpu_pv_apf_data {
#define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
#define KVM_PV_EOI_DISABLED 0x0

+#define KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER HUGETLB_PAGE_ORDER
+
#endif /* _UAPI_ASM_X86_KVM_PARA_H */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bbffa6c54697..b82bcbfbc420 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -136,6 +136,9 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
if (kvm_hlt_in_guest(vcpu->kvm) && best &&
(best->eax & (1 << KVM_FEATURE_PV_UNHALT)))
best->eax &= ~(1 << KVM_FEATURE_PV_UNHALT);
+ if (kvm_arch_has_assigned_device(vcpu->kvm) && best &&
+ (best->eax & KVM_FEATURE_PV_UNUSED_PAGE_HINT))
+ best->eax &= ~(1 << KVM_FEATURE_PV_UNUSED_PAGE_HINT);

/* Update physical-address width */
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
@@ -637,7 +640,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_PV_UNHALT) |
(1 << KVM_FEATURE_PV_TLB_FLUSH) |
(1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
- (1 << KVM_FEATURE_PV_SEND_IPI);
+ (1 << KVM_FEATURE_PV_SEND_IPI) |
+ (1 << KVM_FEATURE_PV_UNUSED_PAGE_HINT);

if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d27206f6c01..3ec75ab849e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -55,6 +55,7 @@
#include <linux/irqbypass.h>
#include <linux/sched/stat.h>
#include <linux/mem_encrypt.h>
+#include <linux/mm.h>

#include <trace/events/kvm.h>

@@ -7052,6 +7053,37 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
}

+static int kvm_pv_unused_page_hint_op(struct kvm *kvm, gpa_t gpa, size_t len)
+{
+ unsigned long start;
+
+ /*
+ * Guarantee the following:
+ * len meets minimum size
+ * len is a power of 2
+ * gpa is aligned to len
+ */
+ if (len < (PAGE_SIZE << KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER))
+ return -KVM_EINVAL;
+ if (!is_power_of_2(len) || !IS_ALIGNED(gpa, len))
+ return -KVM_EINVAL;
+
+ /*
+ * If a device is assigned we cannot use use madvise as memory
+ * is shared with the device and could lead to memory corruption
+ * if the device writes to it after free.
+ */
+ if (kvm_arch_has_assigned_device(kvm))
+ return -KVM_EOPNOTSUPP;
+
+ start = gfn_to_hva(kvm, gpa_to_gfn(gpa));
+
+ if (kvm_is_error_hva(start + len))
+ return -KVM_EFAULT;
+
+ return do_madvise_dontneed(start, len);
+}
+
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
unsigned long nr, a0, a1, a2, a3, ret;
@@ -7098,6 +7130,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_SEND_IPI:
ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
break;
+ case KVM_HC_UNUSED_PAGE_HINT:
+ ret = kvm_pv_unused_page_hint_op(vcpu->kvm, a0, a1);
+ break;
default:
ret = -KVM_ENOSYS;
break;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index 6c0ce49931e5..75643b862a4e 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -28,6 +28,7 @@
#define KVM_HC_MIPS_CONSOLE_OUTPUT 8
#define KVM_HC_CLOCK_PAIRING 9
#define KVM_HC_SEND_IPI 10
+#define KVM_HC_UNUSED_PAGE_HINT 11

/*
* hypercalls use architecture specific


2019-02-04 18:17:10

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

From: Alexander Duyck <[email protected]>

Add guest support for providing free memory hints to the KVM hypervisor for
freed pages huge TLB size or larger. I am restricting the size to
huge TLB order and larger because the hypercalls are too expensive to be
performing one per 4K page. Using the huge TLB order became the obvious
choice for the order to use as it allows us to avoid fragmentation of higher
order memory on the host.

I have limited the functionality so that it doesn't work when page
poisoning is enabled. I did this because a write to the page after doing an
MADV_DONTNEED would effectively negate the hint, so it would be wasting
cycles to do so.

Signed-off-by: Alexander Duyck <[email protected]>
---
arch/x86/include/asm/page.h | 13 +++++++++++++
arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
2 files changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 7555b48803a8..4487ad7a3385 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -18,6 +18,19 @@

struct page;

+#ifdef CONFIG_KVM_GUEST
+#include <linux/jump_label.h>
+extern struct static_key_false pv_free_page_hint_enabled;
+
+#define HAVE_ARCH_FREE_PAGE
+void __arch_free_page(struct page *page, unsigned int order);
+static inline void arch_free_page(struct page *page, unsigned int order)
+{
+ if (static_branch_unlikely(&pv_free_page_hint_enabled))
+ __arch_free_page(page, order);
+}
+#endif
+
#include <linux/range.h>
extern struct range pfn_mapped[];
extern int nr_pfn_mapped;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5c93a65ee1e5..09c91641c36c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -48,6 +48,7 @@
#include <asm/tlb.h>

static int kvmapf = 1;
+DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);

static int __init parse_no_kvmapf(char *arg)
{
@@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);

+ /*
+ * The free page hinting doesn't add much value if page poisoning
+ * is enabled. So we only enable the feature if page poisoning is
+ * no present.
+ */
+ if (!page_poisoning_enabled() &&
+ kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
+ static_branch_enable(&pv_free_page_hint_enabled);
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
@@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
}
arch_initcall(kvm_setup_pv_tlb_flush);

+void __arch_free_page(struct page *page, unsigned int order)
+{
+ /*
+ * Limit hints to blocks no smaller than pageblock in
+ * size to limit the cost for the hypercalls.
+ */
+ if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
+ return;
+
+ kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
+ PAGE_SIZE << order);
+}
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

/* Kick a cpu by its apicid. Used to wake up a halted vcpu */


2019-02-04 18:17:21

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH 4/4] mm: Add merge page notifier

From: Alexander Duyck <[email protected]>

Because the implementation was limiting itself to only providing hints on
pages huge TLB order sized or larger we introduced the possibility for free
pages to slip past us because they are freed as something less then
huge TLB in size and aggregated with buddies later.

To address that I am adding a new call arch_merge_page which is called
after __free_one_page has merged a pair of pages to create a higher order
page. By doing this I am able to fill the gap and provide full coverage for
all of the pages huge TLB order or larger.

Signed-off-by: Alexander Duyck <[email protected]>
---
arch/x86/include/asm/page.h | 12 ++++++++++++
arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
include/linux/gfp.h | 4 ++++
mm/page_alloc.c | 2 ++
4 files changed, 46 insertions(+)

diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
index 4487ad7a3385..9540a97c9997 100644
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
if (static_branch_unlikely(&pv_free_page_hint_enabled))
__arch_free_page(page, order);
}
+
+struct zone;
+
+#define HAVE_ARCH_MERGE_PAGE
+void __arch_merge_page(struct zone *zone, struct page *page,
+ unsigned int order);
+static inline void arch_merge_page(struct zone *zone, struct page *page,
+ unsigned int order)
+{
+ if (static_branch_unlikely(&pv_free_page_hint_enabled))
+ __arch_merge_page(zone, page, order);
+}
#endif

#include <linux/range.h>
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 09c91641c36c..957bb4f427bb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
PAGE_SIZE << order);
}

+void __arch_merge_page(struct zone *zone, struct page *page,
+ unsigned int order)
+{
+ /*
+ * The merging logic has merged a set of buddies up to the
+ * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
+ * advantage of this moment to notify the hypervisor of the free
+ * memory.
+ */
+ if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
+ return;
+
+ /*
+ * Drop zone lock while processing the hypercall. This
+ * should be safe as the page has not yet been added
+ * to the buddy list as of yet and all the pages that
+ * were merged have had their buddy/guard flags cleared
+ * and their order reset to 0.
+ */
+ spin_unlock(&zone->lock);
+
+ kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
+ PAGE_SIZE << order);
+
+ /* reacquire lock and resume freeing memory */
+ spin_lock(&zone->lock);
+}
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS

/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index fdab7de7490d..4746d5560193 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
#ifndef HAVE_ARCH_FREE_PAGE
static inline void arch_free_page(struct page *page, int order) { }
#endif
+#ifndef HAVE_ARCH_MERGE_PAGE
+static inline void
+arch_merge_page(struct zone *zone, struct page *page, int order) { }
+#endif
#ifndef HAVE_ARCH_ALLOC_PAGE
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c954f8c1fbc4..7a1309b0b7c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
page = page + (combined_pfn - pfn);
pfn = combined_pfn;
order++;
+
+ arch_merge_page(zone, page, order);
}
if (max_order < MAX_ORDER) {
/* If we are here, it means order is >= pageblock_order.


2019-02-04 18:20:14

by Alexander Duyck

[permalink] [raw]
Subject: [RFC PATCH QEMU] i386/kvm: Enable paravirtual unused page hint mechanism

From: Alexander Duyck <[email protected]>

This patch adds the flag named kvm-pv-unused-page-hint. This functionality
is enabled by kvm for x86 and provides a mechanism by which the guest can
indicate to the host which pages it is no longer using. By providing these
hints the guest can help to reduce the memory pressure on the host as
dirtied pages will be cleared and not written out to swap if they are
marked as being unused.

Signed-off-by: Alexander Duyck <[email protected]>
---
target/i386/cpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2f5412592d30..0d19a9dc64f1 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -900,7 +900,7 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
"kvmclock", "kvm-nopiodelay", "kvm-mmu", "kvmclock",
"kvm-asyncpf", "kvm-steal-time", "kvm-pv-eoi", "kvm-pv-unhalt",
NULL, "kvm-pv-tlb-flush", NULL, "kvm-pv-ipi",
- NULL, NULL, NULL, NULL,
+ "kvm-pv-unused-page-hint", NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
"kvmclock-stable-bit", NULL, NULL, NULL,


2019-02-04 21:30:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

> +void __arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order)
> +{
> + /*
> + * The merging logic has merged a set of buddies up to the
> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> + * advantage of this moment to notify the hypervisor of the free
> + * memory.
> + */
> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> + return;
> +
> + /*
> + * Drop zone lock while processing the hypercall. This
> + * should be safe as the page has not yet been added
> + * to the buddy list as of yet and all the pages that
> + * were merged have had their buddy/guard flags cleared
> + * and their order reset to 0.
> + */
> + spin_unlock(&zone->lock);
> +
> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> + PAGE_SIZE << order);
> +
> + /* reacquire lock and resume freeing memory */
> + spin_lock(&zone->lock);
> +}

Why do the lock-dropping on merge but not free? What's the difference?

This makes me really nervous. You at *least* want to document this at
the arch_merge_page() call-site, and perhaps even the __free_one_page()
call-sites because they're near where the zone lock is taken.

The place you are calling arch_merge_page() looks OK to me, today. But,
it can't get moved around without careful consideration. That also
needs to be documented to warn off folks who might move code around.

The interaction between the free and merge hooks is also really
implementation-specific. If an architecture is getting order-0
arch_free_page() notifications, it's probably worth at least documenting
that they'll *also* get arch_merge_page() notifications.

The reason x86 doesn't double-hypercall on those is not broached in the
descriptions. That seems to be problematic.

2019-02-04 21:33:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On 2/4/19 10:15 AM, Alexander Duyck wrote:
> +#ifdef CONFIG_KVM_GUEST
> +#include <linux/jump_label.h>
> +extern struct static_key_false pv_free_page_hint_enabled;
> +
> +#define HAVE_ARCH_FREE_PAGE
> +void __arch_free_page(struct page *page, unsigned int order);
> +static inline void arch_free_page(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_free_page(page, order);
> +}
> +#endif

So, this ends up with at least a call, a branch and a ret added to the
order-0 paths, including freeing pages to the per-cpu-pageset lists.
That seems worrisome.

What performance testing has been performed to look into the overhead
added to those paths?

2019-02-04 21:33:13

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Mon, 2019-02-04 at 11:40 -0800, Dave Hansen wrote:
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > + unsigned int order)
> > +{
> > + /*
> > + * The merging logic has merged a set of buddies up to the
> > + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > + * advantage of this moment to notify the hypervisor of the free
> > + * memory.
> > + */
> > + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > + return;
> > +
> > + /*
> > + * Drop zone lock while processing the hypercall. This
> > + * should be safe as the page has not yet been added
> > + * to the buddy list as of yet and all the pages that
> > + * were merged have had their buddy/guard flags cleared
> > + * and their order reset to 0.
> > + */
> > + spin_unlock(&zone->lock);
> > +
> > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > + PAGE_SIZE << order);
> > +
> > + /* reacquire lock and resume freeing memory */
> > + spin_lock(&zone->lock);
> > +}
>
> Why do the lock-dropping on merge but not free? What's the difference?

The lock has not yet been acquired in the free path. The arch_free_page
call is made from free_pages_prepare, whereas the arch_merge_page call
is made from within __free_one_page which has the requirement that the
zone lock be taken before calling the function.

> This makes me really nervous. You at *least* want to document this at
> the arch_merge_page() call-site, and perhaps even the __free_one_page()
> call-sites because they're near where the zone lock is taken.

Okay, that makes sense. I would probably look at adding the
documentation to the arch_merge_page call-site.

> The place you are calling arch_merge_page() looks OK to me, today. But,
> it can't get moved around without careful consideration. That also
> needs to be documented to warn off folks who might move code around.

Agreed.

> The interaction between the free and merge hooks is also really
> implementation-specific. If an architecture is getting order-0
> arch_free_page() notifications, it's probably worth at least documenting
> that they'll *also* get arch_merge_page() notifications.

If an architecture is getting order-0 notifications then the merge
notifications would be pointless since all the pages would be already
hinted.

I can add documentation that explains that in the case where we are
only hinting on non-zero order pages then arch_merge_page should
provide hints for when a page is merged above that threshold.

> The reason x86 doesn't double-hypercall on those is not broached in the
> descriptions. That seems to be problematic.

I will add more documentation to address that.




2019-02-04 21:50:06

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 2019-02-04 at 11:44 -0800, Dave Hansen wrote:
> On 2/4/19 10:15 AM, Alexander Duyck wrote:
> > +#ifdef CONFIG_KVM_GUEST
> > +#include <linux/jump_label.h>
> > +extern struct static_key_false pv_free_page_hint_enabled;
> > +
> > +#define HAVE_ARCH_FREE_PAGE
> > +void __arch_free_page(struct page *page, unsigned int order);
> > +static inline void arch_free_page(struct page *page, unsigned int order)
> > +{
> > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > + __arch_free_page(page, order);
> > +}
> > +#endif
>
> So, this ends up with at least a call, a branch and a ret added to the
> order-0 paths, including freeing pages to the per-cpu-pageset lists.
> That seems worrisome.
>
> What performance testing has been performed to look into the overhead
> added to those paths?

So far I haven't done much in the way of actual performance testing.
Most of my tests have been focused on "is this doing what I think it is
supposed to be doing".

I have been debating if I want to just move the order checks to include
them in the inline functions. In that case we would end up essentially
just jumping over the call code.


2019-02-04 23:20:16

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
>
> Add guest support for providing free memory hints to the KVM hypervisor for
> freed pages huge TLB size or larger. I am restricting the size to
> huge TLB order and larger because the hypercalls are too expensive to be
> performing one per 4K page. Using the huge TLB order became the obvious
> choice for the order to use as it allows us to avoid fragmentation of higher
> order memory on the host.
>
> I have limited the functionality so that it doesn't work when page
> poisoning is enabled. I did this because a write to the page after doing an
> MADV_DONTNEED would effectively negate the hint, so it would be wasting
> cycles to do so.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> arch/x86/include/asm/page.h | 13 +++++++++++++
> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 7555b48803a8..4487ad7a3385 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -18,6 +18,19 @@
>
> struct page;
>
> +#ifdef CONFIG_KVM_GUEST
> +#include <linux/jump_label.h>
> +extern struct static_key_false pv_free_page_hint_enabled;
> +
> +#define HAVE_ARCH_FREE_PAGE
> +void __arch_free_page(struct page *page, unsigned int order);
> +static inline void arch_free_page(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_free_page(page, order);
> +}
> +#endif

This patch and the following one assume that only KVM should be able to hook
to these events. I do not think it is appropriate for __arch_free_page() to
effectively mean “kvm_guest_free_page()”.

Is it possible to use the paravirt infrastructure for this feature,
similarly to other PV features? It is not the best infrastructure, but at least
it is hypervisor-neutral.


2019-02-04 23:40:21

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
> > On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
> >
> > From: Alexander Duyck <[email protected]>
> >
> > Add guest support for providing free memory hints to the KVM hypervisor for
> > freed pages huge TLB size or larger. I am restricting the size to
> > huge TLB order and larger because the hypercalls are too expensive to be
> > performing one per 4K page. Using the huge TLB order became the obvious
> > choice for the order to use as it allows us to avoid fragmentation of higher
> > order memory on the host.
> >
> > I have limited the functionality so that it doesn't work when page
> > poisoning is enabled. I did this because a write to the page after doing an
> > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > cycles to do so.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > arch/x86/include/asm/page.h | 13 +++++++++++++
> > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > 2 files changed, 36 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > index 7555b48803a8..4487ad7a3385 100644
> > --- a/arch/x86/include/asm/page.h
> > +++ b/arch/x86/include/asm/page.h
> > @@ -18,6 +18,19 @@
> >
> > struct page;
> >
> > +#ifdef CONFIG_KVM_GUEST
> > +#include <linux/jump_label.h>
> > +extern struct static_key_false pv_free_page_hint_enabled;
> > +
> > +#define HAVE_ARCH_FREE_PAGE
> > +void __arch_free_page(struct page *page, unsigned int order);
> > +static inline void arch_free_page(struct page *page, unsigned int order)
> > +{
> > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > + __arch_free_page(page, order);
> > +}
> > +#endif
>
> This patch and the following one assume that only KVM should be able to hook
> to these events. I do not think it is appropriate for __arch_free_page() to
> effectively mean “kvm_guest_free_page()”.
>
> Is it possible to use the paravirt infrastructure for this feature,
> similarly to other PV features? It is not the best infrastructure, but at least
> it is hypervisor-neutral.

I could probably tie this into the paravirt infrastructure, but if I
did so I would probably want to pull the checks for the page order out
of the KVM specific bits and make it something we handle in the inline.
Doing that I would probably make it a paravirtual hint that only
operates at the PMD level. That way we wouldn't incur the cost of the
paravirt infrastructure at the per 4K page level.


2019-02-05 00:04:12

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

> On Feb 4, 2019, at 3:37 PM, Alexander Duyck <[email protected]> wrote:
>
> On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
>>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
>>>
>>> From: Alexander Duyck <[email protected]>
>>>
>>> Add guest support for providing free memory hints to the KVM hypervisor for
>>> freed pages huge TLB size or larger. I am restricting the size to
>>> huge TLB order and larger because the hypercalls are too expensive to be
>>> performing one per 4K page. Using the huge TLB order became the obvious
>>> choice for the order to use as it allows us to avoid fragmentation of higher
>>> order memory on the host.
>>>
>>> I have limited the functionality so that it doesn't work when page
>>> poisoning is enabled. I did this because a write to the page after doing an
>>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
>>> cycles to do so.
>>>
>>> Signed-off-by: Alexander Duyck <[email protected]>
>>> ---
>>> arch/x86/include/asm/page.h | 13 +++++++++++++
>>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
>>> 2 files changed, 36 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 7555b48803a8..4487ad7a3385 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -18,6 +18,19 @@
>>>
>>> struct page;
>>>
>>> +#ifdef CONFIG_KVM_GUEST
>>> +#include <linux/jump_label.h>
>>> +extern struct static_key_false pv_free_page_hint_enabled;
>>> +
>>> +#define HAVE_ARCH_FREE_PAGE
>>> +void __arch_free_page(struct page *page, unsigned int order);
>>> +static inline void arch_free_page(struct page *page, unsigned int order)
>>> +{
>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> + __arch_free_page(page, order);
>>> +}
>>> +#endif
>>
>> This patch and the following one assume that only KVM should be able to hook
>> to these events. I do not think it is appropriate for __arch_free_page() to
>> effectively mean “kvm_guest_free_page()”.
>>
>> Is it possible to use the paravirt infrastructure for this feature,
>> similarly to other PV features? It is not the best infrastructure, but at least
>> it is hypervisor-neutral.
>
> I could probably tie this into the paravirt infrastructure, but if I
> did so I would probably want to pull the checks for the page order out
> of the KVM specific bits and make it something we handle in the inline.
> Doing that I would probably make it a paravirtual hint that only
> operates at the PMD level. That way we wouldn't incur the cost of the
> paravirt infrastructure at the per 4K page level.

If I understand you correctly, you “complain” that this would affect
performance.

While it might be, you may want to check whether the already available
tools can solve the problem:

1. You can use a combination of static-key and pv-ops - see for example
steal_account_process_time()

2. You can use callee-saved pv-ops.

The latter might anyhow be necessary since, IIUC, you change a very hot
path. So you may want have a look on the assembly code of free_pcp_prepare()
(or at least its code-size) before and after your changes. If they are too
big, a callee-saved function might be necessary.


2019-02-05 01:31:30

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <[email protected]> wrote:
>
> > On Feb 4, 2019, at 3:37 PM, Alexander Duyck <[email protected]> wrote:
> >
> > On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
> >>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
> >>>
> >>> From: Alexander Duyck <[email protected]>
> >>>
> >>> Add guest support for providing free memory hints to the KVM hypervisor for
> >>> freed pages huge TLB size or larger. I am restricting the size to
> >>> huge TLB order and larger because the hypercalls are too expensive to be
> >>> performing one per 4K page. Using the huge TLB order became the obvious
> >>> choice for the order to use as it allows us to avoid fragmentation of higher
> >>> order memory on the host.
> >>>
> >>> I have limited the functionality so that it doesn't work when page
> >>> poisoning is enabled. I did this because a write to the page after doing an
> >>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
> >>> cycles to do so.
> >>>
> >>> Signed-off-by: Alexander Duyck <[email protected]>
> >>> ---
> >>> arch/x86/include/asm/page.h | 13 +++++++++++++
> >>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> >>> 2 files changed, 36 insertions(+)
> >>>
> >>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >>> index 7555b48803a8..4487ad7a3385 100644
> >>> --- a/arch/x86/include/asm/page.h
> >>> +++ b/arch/x86/include/asm/page.h
> >>> @@ -18,6 +18,19 @@
> >>>
> >>> struct page;
> >>>
> >>> +#ifdef CONFIG_KVM_GUEST
> >>> +#include <linux/jump_label.h>
> >>> +extern struct static_key_false pv_free_page_hint_enabled;
> >>> +
> >>> +#define HAVE_ARCH_FREE_PAGE
> >>> +void __arch_free_page(struct page *page, unsigned int order);
> >>> +static inline void arch_free_page(struct page *page, unsigned int order)
> >>> +{
> >>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>> + __arch_free_page(page, order);
> >>> +}
> >>> +#endif
> >>
> >> This patch and the following one assume that only KVM should be able to hook
> >> to these events. I do not think it is appropriate for __arch_free_page() to
> >> effectively mean “kvm_guest_free_page()”.
> >>
> >> Is it possible to use the paravirt infrastructure for this feature,
> >> similarly to other PV features? It is not the best infrastructure, but at least
> >> it is hypervisor-neutral.
> >
> > I could probably tie this into the paravirt infrastructure, but if I
> > did so I would probably want to pull the checks for the page order out
> > of the KVM specific bits and make it something we handle in the inline.
> > Doing that I would probably make it a paravirtual hint that only
> > operates at the PMD level. That way we wouldn't incur the cost of the
> > paravirt infrastructure at the per 4K page level.
>
> If I understand you correctly, you “complain” that this would affect
> performance.

It wasn't so much a "complaint" as an "observation". What I was
getting at is that if I am going to make it a PV operation I might set
a hard limit on it so that it will specifically only apply to huge
pages and larger. By doing that I can justify performing the screening
based on page order in the inline path and avoid any PV infrastructure
overhead unless I have to incur it.

> While it might be, you may want to check whether the already available
> tools can solve the problem:
>
> 1. You can use a combination of static-key and pv-ops - see for example
> steal_account_process_time()

Okay, I was kind of already heading in this direction. The static key
I am using now would probably stay put.

> 2. You can use callee-saved pv-ops.
>
> The latter might anyhow be necessary since, IIUC, you change a very hot
> path. So you may want have a look on the assembly code of free_pcp_prepare()
> (or at least its code-size) before and after your changes. If they are too
> big, a callee-saved function might be necessary.

I'll have to take a look. I will spend the next couple days
familiarizing myself with the pv-ops infrastructure.

2019-02-05 01:48:48

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

> On Feb 4, 2019, at 4:16 PM, Alexander Duyck <[email protected]> wrote:
>
> On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <[email protected]> wrote:
>>> On Feb 4, 2019, at 3:37 PM, Alexander Duyck <[email protected]> wrote:
>>>
>>> On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
>>>>> On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
>>>>>
>>>>> From: Alexander Duyck <[email protected]>
>>>>>
>>>>> Add guest support for providing free memory hints to the KVM hypervisor for
>>>>> freed pages huge TLB size or larger. I am restricting the size to
>>>>> huge TLB order and larger because the hypercalls are too expensive to be
>>>>> performing one per 4K page. Using the huge TLB order became the obvious
>>>>> choice for the order to use as it allows us to avoid fragmentation of higher
>>>>> order memory on the host.
>>>>>
>>>>> I have limited the functionality so that it doesn't work when page
>>>>> poisoning is enabled. I did this because a write to the page after doing an
>>>>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
>>>>> cycles to do so.
>>>>>
>>>>> Signed-off-by: Alexander Duyck <[email protected]>
>>>>> ---
>>>>> arch/x86/include/asm/page.h | 13 +++++++++++++
>>>>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
>>>>> 2 files changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>>> index 7555b48803a8..4487ad7a3385 100644
>>>>> --- a/arch/x86/include/asm/page.h
>>>>> +++ b/arch/x86/include/asm/page.h
>>>>> @@ -18,6 +18,19 @@
>>>>>
>>>>> struct page;
>>>>>
>>>>> +#ifdef CONFIG_KVM_GUEST
>>>>> +#include <linux/jump_label.h>
>>>>> +extern struct static_key_false pv_free_page_hint_enabled;
>>>>> +
>>>>> +#define HAVE_ARCH_FREE_PAGE
>>>>> +void __arch_free_page(struct page *page, unsigned int order);
>>>>> +static inline void arch_free_page(struct page *page, unsigned int order)
>>>>> +{
>>>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>> + __arch_free_page(page, order);
>>>>> +}
>>>>> +#endif
>>>>
>>>> This patch and the following one assume that only KVM should be able to hook
>>>> to these events. I do not think it is appropriate for __arch_free_page() to
>>>> effectively mean “kvm_guest_free_page()”.
>>>>
>>>> Is it possible to use the paravirt infrastructure for this feature,
>>>> similarly to other PV features? It is not the best infrastructure, but at least
>>>> it is hypervisor-neutral.
>>>
>>> I could probably tie this into the paravirt infrastructure, but if I
>>> did so I would probably want to pull the checks for the page order out
>>> of the KVM specific bits and make it something we handle in the inline.
>>> Doing that I would probably make it a paravirtual hint that only
>>> operates at the PMD level. That way we wouldn't incur the cost of the
>>> paravirt infrastructure at the per 4K page level.
>>
>> If I understand you correctly, you “complain” that this would affect
>> performance.
>
> It wasn't so much a "complaint" as an "observation". What I was
> getting at is that if I am going to make it a PV operation I might set
> a hard limit on it so that it will specifically only apply to huge
> pages and larger. By doing that I can justify performing the screening
> based on page order in the inline path and avoid any PV infrastructure
> overhead unless I have to incur it.

I understood. I guess my use of “double quotes” was lost in translation. ;-)

One more point regarding [2/4] - you may want to consider using madvise_free
instead of madvise_dontneed to avoid unnecessary EPT violations.


2019-02-05 17:46:40

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host


On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.

Hi Alexander,

Can you share the host memory usage before and after your run. (In both
the cases with your patch-set and without your patch-set)

>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
--
Regards
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-05 18:10:18

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 2019-02-04 at 17:46 -0800, Nadav Amit wrote:
> > On Feb 4, 2019, at 4:16 PM, Alexander Duyck <[email protected]> wrote:
> >
> > On Mon, Feb 4, 2019 at 4:03 PM Nadav Amit <[email protected]> wrote:
> > > > On Feb 4, 2019, at 3:37 PM, Alexander Duyck <[email protected]> wrote:
> > > >
> > > > On Mon, 2019-02-04 at 15:00 -0800, Nadav Amit wrote:
> > > > > > On Feb 4, 2019, at 10:15 AM, Alexander Duyck <[email protected]> wrote:
> > > > > >
> > > > > > From: Alexander Duyck <[email protected]>
> > > > > >
> > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > performing one per 4K page. Using the huge TLB order became the obvious
> > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > order memory on the host.
> > > > > >
> > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > cycles to do so.
> > > > > >
> > > > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > > > ---
> > > > > > arch/x86/include/asm/page.h | 13 +++++++++++++
> > > > > > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > > > > > 2 files changed, 36 insertions(+)
> > > > > >
> > > > > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > > > > > index 7555b48803a8..4487ad7a3385 100644
> > > > > > --- a/arch/x86/include/asm/page.h
> > > > > > +++ b/arch/x86/include/asm/page.h
> > > > > > @@ -18,6 +18,19 @@
> > > > > >
> > > > > > struct page;
> > > > > >
> > > > > > +#ifdef CONFIG_KVM_GUEST
> > > > > > +#include <linux/jump_label.h>
> > > > > > +extern struct static_key_false pv_free_page_hint_enabled;
> > > > > > +
> > > > > > +#define HAVE_ARCH_FREE_PAGE
> > > > > > +void __arch_free_page(struct page *page, unsigned int order);
> > > > > > +static inline void arch_free_page(struct page *page, unsigned int order)
> > > > > > +{
> > > > > > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > > > > + __arch_free_page(page, order);
> > > > > > +}
> > > > > > +#endif
> > > > >
> > > > > This patch and the following one assume that only KVM should be able to hook
> > > > > to these events. I do not think it is appropriate for __arch_free_page() to
> > > > > effectively mean “kvm_guest_free_page()”.
> > > > >
> > > > > Is it possible to use the paravirt infrastructure for this feature,
> > > > > similarly to other PV features? It is not the best infrastructure, but at least
> > > > > it is hypervisor-neutral.
> > > >
> > > > I could probably tie this into the paravirt infrastructure, but if I
> > > > did so I would probably want to pull the checks for the page order out
> > > > of the KVM specific bits and make it something we handle in the inline.
> > > > Doing that I would probably make it a paravirtual hint that only
> > > > operates at the PMD level. That way we wouldn't incur the cost of the
> > > > paravirt infrastructure at the per 4K page level.
> > >
> > > If I understand you correctly, you “complain” that this would affect
> > > performance.
> >
> > It wasn't so much a "complaint" as an "observation". What I was
> > getting at is that if I am going to make it a PV operation I might set
> > a hard limit on it so that it will specifically only apply to huge
> > pages and larger. By doing that I can justify performing the screening
> > based on page order in the inline path and avoid any PV infrastructure
> > overhead unless I have to incur it.
>
> I understood. I guess my use of “double quotes” was lost in translation. ;-)

Yeah, I just figured I would restate it to make sure we were "on the
same page". ;-)

> One more point regarding [2/4] - you may want to consider using madvise_free
> instead of madvise_dontneed to avoid unnecessary EPT violations.

For now I am using MADVISE_DONTNEED because it reduces the complexity.
I have been working on a proof of concept with MADVISE_FREE, however we
then have to add some additional checks as MADVISE_FREE only works with
anonymous memory if I am not mistaken.


2019-02-05 19:03:11

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host

On Tue, 2019-02-05 at 12:25 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> >
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> >
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> >
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> >
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
> >
> > My testing up till now has consisted of setting up 4 8GB VMs on a system
> > with 32GB of memory and 4GB of swap. To stress the memory on the system I
> > would run "memhog 8G" sequentially on each of the guests and observe how
> > long it took to complete the run. The observed behavior is that on the
> > systems with these patches applied in both the guest and on the host I was
> > able to complete the test with a time of 5 to 7 seconds per guest. On a
> > system without these patches the time ranged from 7 to 49 seconds per
> > guest. I am assuming the variability is due to time being spent writing
> > pages out to disk in order to free up space for the guest.
>
> Hi Alexander,
>
> Can you share the host memory usage before and after your run. (In both
> the cases with your patch-set and without your patch-set)

Here are some snippets from the /proc/meminfo for the system both
before and after the test.

W/O patch
-- Before --
MemTotal: 32881396 kB
MemFree: 21363724 kB
MemAvailable: 25891228 kB
Buffers: 2276 kB
Cached: 4760280 kB
SwapCached: 0 kB
Active: 7166952 kB
Inactive: 1474980 kB
Active(anon): 3893308 kB
Inactive(anon): 8776 kB
Active(file): 3273644 kB
Inactive(file): 1466204 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 29812 kB
Writeback: 0 kB
AnonPages: 3896540 kB
Mapped: 75568 kB
Shmem: 10044 kB

-- After --
MemTotal: 32881396 kB
MemFree: 194668 kB
MemAvailable: 51356 kB
Buffers: 24 kB
Cached: 129036 kB
SwapCached: 224396 kB
Active: 27223304 kB
Inactive: 2589736 kB
Active(anon): 27220360 kB
Inactive(anon): 2481592 kB
Active(file): 2944 kB
Inactive(file): 108144 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 35616 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 29476628 kB
Mapped: 22820 kB
Shmem: 5516 kB

W/ patch
-- Before --
MemTotal: 32881396 kB
MemFree: 26618880 kB
MemAvailable: 27056004 kB
Buffers: 2276 kB
Cached: 781496 kB
SwapCached: 0 kB
Active: 3309056 kB
Inactive: 393796 kB
Active(anon): 2932728 kB
Inactive(anon): 8776 kB
Active(file): 376328 kB
Inactive(file): 385020 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 96 kB
Writeback: 0 kB
AnonPages: 2935964 kB
Mapped: 75428 kB
Shmem: 10048 kB

-- After --
MemTotal: 32881396 kB
MemFree: 22677904 kB
MemAvailable: 26543092 kB
Buffers: 2276 kB
Cached: 4205908 kB
SwapCached: 0 kB
Active: 3863016 kB
Inactive: 3768596 kB
Active(anon): 3437368 kB
Inactive(anon): 8772 kB
Active(file): 425648 kB
Inactive(file): 3759824 kB
Unevictable: 16756 kB
Mlocked: 16756 kB
SwapTotal: 4194300 kB
SwapFree: 4194300 kB
Dirty: 1336180 kB
Writeback: 0 kB
AnonPages: 3440528 kB
Mapped: 74992 kB
Shmem: 10044 kB


2019-02-07 14:49:00

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host


On 2/4/19 1:15 PM, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.
>
> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
Hi Alexander,

Did you get a chance to look at my v8 posting of Guest Free Page Hinting
[1]?
Considering both the solutions are trying to solve the same problem. It
will be great if we can collaborate and come up with a unified solution.

[1] https://lkml.org/lkml/2019/2/4/993
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --
--
Regards
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-07 16:56:44

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host

On Thu, 2019-02-07 at 09:48 -0500, Nitesh Narayan Lal wrote:
> On 2/4/19 1:15 PM, Alexander Duyck wrote:
> > This patch set provides a mechanism by which guests can notify the host of
> > pages that are not currently in use. Using this data a KVM host can more
> > easily balance memory workloads between guests and improve overall system
> > performance by avoiding unnecessary writing of unused pages to swap.
> >
> > In order to support this I have added a new hypercall to provided unused
> > page hints and made use of mechanisms currently used by PowerPC and s390
> > architectures to provide those hints. To reduce the overhead of this call
> > I am only using it per huge page instead of of doing a notification per 4K
> > page. By doing this we can avoid the expense of fragmenting higher order
> > pages, and reduce overall cost for the hypercall as it will only be
> > performed once per huge page.
> >
> > Because we are limiting this to huge pages it was necessary to add a
> > secondary location where we make the call as the buddy allocator can merge
> > smaller pages into a higher order huge page.
> >
> > This approach is not usable in all cases. Specifically, when KVM direct
> > device assignment is used, the memory for a guest is permanently assigned
> > to physical pages in order to support DMA from the assigned device. In
> > this case we cannot give the pages back, so the hypercall is disabled by
> > the host.
> >
> > Another situation that can lead to issues is if the page were accessed
> > immediately after free. For example, if page poisoning is enabled the
> > guest will populate the page *after* freeing it. In this case it does not
> > make sense to provide a hint about the page being freed so we do not
> > perform the hypercalls from the guest if this functionality is enabled.
>
> Hi Alexander,
>
> Did you get a chance to look at my v8 posting of Guest Free Page Hinting
> [1]?
> Considering both the solutions are trying to solve the same problem. It
> will be great if we can collaborate and come up with a unified solution.
>
> [1] https://lkml.org/lkml/2019/2/4/993

I haven't had a chance to review these yet.

I'll try to take a look later today and provide review notes based on
what I find.

Thanks.

- Alex


2019-02-07 18:21:33

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 04 Feb 2019 10:15:52 -0800
Alexander Duyck <[email protected]> wrote:

> From: Alexander Duyck <[email protected]>
>
> Add guest support for providing free memory hints to the KVM hypervisor for
> freed pages huge TLB size or larger. I am restricting the size to
> huge TLB order and larger because the hypercalls are too expensive to be
> performing one per 4K page. Using the huge TLB order became the obvious
> choice for the order to use as it allows us to avoid fragmentation of higher
> order memory on the host.
>
> I have limited the functionality so that it doesn't work when page
> poisoning is enabled. I did this because a write to the page after doing an
> MADV_DONTNEED would effectively negate the hint, so it would be wasting
> cycles to do so.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> arch/x86/include/asm/page.h | 13 +++++++++++++
> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 7555b48803a8..4487ad7a3385 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -18,6 +18,19 @@
>
> struct page;
>
> +#ifdef CONFIG_KVM_GUEST
> +#include <linux/jump_label.h>
> +extern struct static_key_false pv_free_page_hint_enabled;
> +
> +#define HAVE_ARCH_FREE_PAGE
> +void __arch_free_page(struct page *page, unsigned int order);
> +static inline void arch_free_page(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_free_page(page, order);
> +}
> +#endif
> +
> #include <linux/range.h>
> extern struct range pfn_mapped[];
> extern int nr_pfn_mapped;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5c93a65ee1e5..09c91641c36c 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -48,6 +48,7 @@
> #include <asm/tlb.h>
>
> static int kvmapf = 1;
> +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
>
> static int __init parse_no_kvmapf(char *arg)
> {
> @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> apic_set_eoi_write(kvm_guest_apic_eoi_write);
>
> + /*
> + * The free page hinting doesn't add much value if page poisoning
> + * is enabled. So we only enable the feature if page poisoning is
> + * no present.
> + */
> + if (!page_poisoning_enabled() &&
> + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> + static_branch_enable(&pv_free_page_hint_enabled);
> +
> #ifdef CONFIG_SMP
> smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
> smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
> }
> arch_initcall(kvm_setup_pv_tlb_flush);
>
> +void __arch_free_page(struct page *page, unsigned int order)
> +{
> + /*
> + * Limit hints to blocks no smaller than pageblock in
> + * size to limit the cost for the hypercalls.
> + */
> + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> + return;
> +
> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> + PAGE_SIZE << order);

Does this mean that the vCPU executing this will get stuck
here for the duration of the hypercall? Isn't that too long,
considering that the zone lock is taken and madvise in the
host block on semaphores?

> +}
> +
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>
> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>


2019-02-07 18:45:59

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Thu, 2019-02-07 at 13:21 -0500, Luiz Capitulino wrote:
> On Mon, 04 Feb 2019 10:15:52 -0800
> Alexander Duyck <[email protected]> wrote:
>
> > From: Alexander Duyck <[email protected]>
> >
> > Add guest support for providing free memory hints to the KVM hypervisor for
> > freed pages huge TLB size or larger. I am restricting the size to
> > huge TLB order and larger because the hypercalls are too expensive to be
> > performing one per 4K page. Using the huge TLB order became the obvious
> > choice for the order to use as it allows us to avoid fragmentation of higher
> > order memory on the host.
> >
> > I have limited the functionality so that it doesn't work when page
> > poisoning is enabled. I did this because a write to the page after doing an
> > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > cycles to do so.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > arch/x86/include/asm/page.h | 13 +++++++++++++
> > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > 2 files changed, 36 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > index 7555b48803a8..4487ad7a3385 100644
> > --- a/arch/x86/include/asm/page.h
> > +++ b/arch/x86/include/asm/page.h
> > @@ -18,6 +18,19 @@
> >
> > struct page;
> >
> > +#ifdef CONFIG_KVM_GUEST
> > +#include <linux/jump_label.h>
> > +extern struct static_key_false pv_free_page_hint_enabled;
> > +
> > +#define HAVE_ARCH_FREE_PAGE
> > +void __arch_free_page(struct page *page, unsigned int order);
> > +static inline void arch_free_page(struct page *page, unsigned int order)
> > +{
> > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > + __arch_free_page(page, order);
> > +}
> > +#endif
> > +
> > #include <linux/range.h>
> > extern struct range pfn_mapped[];
> > extern int nr_pfn_mapped;
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 5c93a65ee1e5..09c91641c36c 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -48,6 +48,7 @@
> > #include <asm/tlb.h>
> >
> > static int kvmapf = 1;
> > +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
> >
> > static int __init parse_no_kvmapf(char *arg)
> > {
> > @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
> > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > apic_set_eoi_write(kvm_guest_apic_eoi_write);
> >
> > + /*
> > + * The free page hinting doesn't add much value if page poisoning
> > + * is enabled. So we only enable the feature if page poisoning is
> > + * no present.
> > + */
> > + if (!page_poisoning_enabled() &&
> > + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> > + static_branch_enable(&pv_free_page_hint_enabled);
> > +
> > #ifdef CONFIG_SMP
> > smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
> > smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> > @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
> > }
> > arch_initcall(kvm_setup_pv_tlb_flush);
> >
> > +void __arch_free_page(struct page *page, unsigned int order)
> > +{
> > + /*
> > + * Limit hints to blocks no smaller than pageblock in
> > + * size to limit the cost for the hypercalls.
> > + */
> > + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > + return;
> > +
> > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > + PAGE_SIZE << order);
>
> Does this mean that the vCPU executing this will get stuck
> here for the duration of the hypercall? Isn't that too long,
> considering that the zone lock is taken and madvise in the
> host block on semaphores?

I'm pretty sure the zone lock isn't held when this is called. The lock
isn't acquired until later in the path. This gets executed just before
the page poisoning call which would take time as well since it would
have to memset an entire page. This function is called as a part of
free_pages_prepare, the zone locks aren't acquired until we are calling
into either free_one_page and a few spots before calling
__free_one_page.

My other function in patch 4 which does this from inside of
__free_one_page does have to release the zone lock since it is taken
there.


2019-02-07 20:03:29

by Luiz Capitulino

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Thu, 07 Feb 2019 10:44:11 -0800
Alexander Duyck <[email protected]> wrote:

> On Thu, 2019-02-07 at 13:21 -0500, Luiz Capitulino wrote:
> > On Mon, 04 Feb 2019 10:15:52 -0800
> > Alexander Duyck <[email protected]> wrote:
> >
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > freed pages huge TLB size or larger. I am restricting the size to
> > > huge TLB order and larger because the hypercalls are too expensive to be
> > > performing one per 4K page. Using the huge TLB order became the obvious
> > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > order memory on the host.
> > >
> > > I have limited the functionality so that it doesn't work when page
> > > poisoning is enabled. I did this because a write to the page after doing an
> > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > cycles to do so.
> > >
> > > Signed-off-by: Alexander Duyck <[email protected]>
> > > ---
> > > arch/x86/include/asm/page.h | 13 +++++++++++++
> > > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > > 2 files changed, 36 insertions(+)
> > >
> > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > > index 7555b48803a8..4487ad7a3385 100644
> > > --- a/arch/x86/include/asm/page.h
> > > +++ b/arch/x86/include/asm/page.h
> > > @@ -18,6 +18,19 @@
> > >
> > > struct page;
> > >
> > > +#ifdef CONFIG_KVM_GUEST
> > > +#include <linux/jump_label.h>
> > > +extern struct static_key_false pv_free_page_hint_enabled;
> > > +
> > > +#define HAVE_ARCH_FREE_PAGE
> > > +void __arch_free_page(struct page *page, unsigned int order);
> > > +static inline void arch_free_page(struct page *page, unsigned int order)
> > > +{
> > > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > + __arch_free_page(page, order);
> > > +}
> > > +#endif
> > > +
> > > #include <linux/range.h>
> > > extern struct range pfn_mapped[];
> > > extern int nr_pfn_mapped;
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index 5c93a65ee1e5..09c91641c36c 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -48,6 +48,7 @@
> > > #include <asm/tlb.h>
> > >
> > > static int kvmapf = 1;
> > > +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
> > >
> > > static int __init parse_no_kvmapf(char *arg)
> > > {
> > > @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
> > > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > > apic_set_eoi_write(kvm_guest_apic_eoi_write);
> > >
> > > + /*
> > > + * The free page hinting doesn't add much value if page poisoning
> > > + * is enabled. So we only enable the feature if page poisoning is
> > > + * no present.
> > > + */
> > > + if (!page_poisoning_enabled() &&
> > > + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> > > + static_branch_enable(&pv_free_page_hint_enabled);
> > > +
> > > #ifdef CONFIG_SMP
> > > smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
> > > smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> > > @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
> > > }
> > > arch_initcall(kvm_setup_pv_tlb_flush);
> > >
> > > +void __arch_free_page(struct page *page, unsigned int order)
> > > +{
> > > + /*
> > > + * Limit hints to blocks no smaller than pageblock in
> > > + * size to limit the cost for the hypercalls.
> > > + */
> > > + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > > + return;
> > > +
> > > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > > + PAGE_SIZE << order);
> >
> > Does this mean that the vCPU executing this will get stuck
> > here for the duration of the hypercall? Isn't that too long,
> > considering that the zone lock is taken and madvise in the
> > host block on semaphores?
>
> I'm pretty sure the zone lock isn't held when this is called. The lock
> isn't acquired until later in the path. This gets executed just before
> the page poisoning call which would take time as well since it would
> have to memset an entire page. This function is called as a part of
> free_pages_prepare, the zone locks aren't acquired until we are calling
> into either free_one_page and a few spots before calling
> __free_one_page.

Yeah, you're right of course! I think mixed up __arch_free_page()
and __free_one_page()... free_pages() code path won't take any
locks up to calling __arch_free_page(). Sorry for the noise.

> My other function in patch 4 which does this from inside of
> __free_one_page does have to release the zone lock since it is taken
> there.

I haven't checked that one yet, I'll let you know if I have comments.

2019-02-08 21:08:25

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints


On 2/7/19 1:44 PM, Alexander Duyck wrote:
> On Thu, 2019-02-07 at 13:21 -0500, Luiz Capitulino wrote:
>> On Mon, 04 Feb 2019 10:15:52 -0800
>> Alexander Duyck <[email protected]> wrote:
>>
>>> From: Alexander Duyck <[email protected]>
>>>
>>> Add guest support for providing free memory hints to the KVM hypervisor for
>>> freed pages huge TLB size or larger. I am restricting the size to
>>> huge TLB order and larger because the hypercalls are too expensive to be
>>> performing one per 4K page. Using the huge TLB order became the obvious
>>> choice for the order to use as it allows us to avoid fragmentation of higher
>>> order memory on the host.
>>>
>>> I have limited the functionality so that it doesn't work when page
>>> poisoning is enabled. I did this because a write to the page after doing an
>>> MADV_DONTNEED would effectively negate the hint, so it would be wasting
>>> cycles to do so.
>>>
>>> Signed-off-by: Alexander Duyck <[email protected]>
>>> ---
>>> arch/x86/include/asm/page.h | 13 +++++++++++++
>>> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
>>> 2 files changed, 36 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 7555b48803a8..4487ad7a3385 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -18,6 +18,19 @@
>>>
>>> struct page;
>>>
>>> +#ifdef CONFIG_KVM_GUEST
>>> +#include <linux/jump_label.h>
>>> +extern struct static_key_false pv_free_page_hint_enabled;
>>> +
>>> +#define HAVE_ARCH_FREE_PAGE
>>> +void __arch_free_page(struct page *page, unsigned int order);
>>> +static inline void arch_free_page(struct page *page, unsigned int order)
>>> +{
>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> + __arch_free_page(page, order);
>>> +}
>>> +#endif
>>> +
>>> #include <linux/range.h>
>>> extern struct range pfn_mapped[];
>>> extern int nr_pfn_mapped;
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 5c93a65ee1e5..09c91641c36c 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -48,6 +48,7 @@
>>> #include <asm/tlb.h>
>>>
>>> static int kvmapf = 1;
>>> +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
>>>
>>> static int __init parse_no_kvmapf(char *arg)
>>> {
>>> @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
>>> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>>> apic_set_eoi_write(kvm_guest_apic_eoi_write);
>>>
>>> + /*
>>> + * The free page hinting doesn't add much value if page poisoning
>>> + * is enabled. So we only enable the feature if page poisoning is
>>> + * no present.
>>> + */
>>> + if (!page_poisoning_enabled() &&
>>> + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
>>> + static_branch_enable(&pv_free_page_hint_enabled);
>>> +
>>> #ifdef CONFIG_SMP
>>> smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
>>> smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
>>> @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
>>> }
>>> arch_initcall(kvm_setup_pv_tlb_flush);
>>>
>>> +void __arch_free_page(struct page *page, unsigned int order)
>>> +{
>>> + /*
>>> + * Limit hints to blocks no smaller than pageblock in
>>> + * size to limit the cost for the hypercalls.
>>> + */
>>> + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>> + return;
>>> +
>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>> + PAGE_SIZE << order);
>> Does this mean that the vCPU executing this will get stuck
>> here for the duration of the hypercall? Isn't that too long,
>> considering that the zone lock is taken and madvise in the
>> host block on semaphores?
> I'm pretty sure the zone lock isn't held when this is called. The lock
> isn't acquired until later in the path. This gets executed just before
> the page poisoning call which would take time as well since it would
> have to memset an entire page. This function is called as a part of
> free_pages_prepare, the zone locks aren't acquired until we are calling
> into either free_one_page and a few spots before calling
> __free_one_page.
>
> My other function in patch 4 which does this from inside of
> __free_one_page does have to release the zone lock since it is taken
> there.
>
Considering hypercall's are costly, will it not make sense to coalesce
the pages you are reporting and make a single hypercall for a bunch of
pages?

--
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-08 21:32:04

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Fri, 2019-02-08 at 16:05 -0500, Nitesh Narayan Lal wrote:
> On 2/7/19 1:44 PM, Alexander Duyck wrote:
> > On Thu, 2019-02-07 at 13:21 -0500, Luiz Capitulino wrote:
> > > On Mon, 04 Feb 2019 10:15:52 -0800
> > > Alexander Duyck <[email protected]> wrote:
> > >
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > performing one per 4K page. Using the huge TLB order became the obvious
> > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > order memory on the host.
> > > >
> > > > I have limited the functionality so that it doesn't work when page
> > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > cycles to do so.
> > > >
> > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/page.h | 13 +++++++++++++
> > > > arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> > > > 2 files changed, 36 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > > > index 7555b48803a8..4487ad7a3385 100644
> > > > --- a/arch/x86/include/asm/page.h
> > > > +++ b/arch/x86/include/asm/page.h
> > > > @@ -18,6 +18,19 @@
> > > >
> > > > struct page;
> > > >
> > > > +#ifdef CONFIG_KVM_GUEST
> > > > +#include <linux/jump_label.h>
> > > > +extern struct static_key_false pv_free_page_hint_enabled;
> > > > +
> > > > +#define HAVE_ARCH_FREE_PAGE
> > > > +void __arch_free_page(struct page *page, unsigned int order);
> > > > +static inline void arch_free_page(struct page *page, unsigned int order)
> > > > +{
> > > > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > > + __arch_free_page(page, order);
> > > > +}
> > > > +#endif
> > > > +
> > > > #include <linux/range.h>
> > > > extern struct range pfn_mapped[];
> > > > extern int nr_pfn_mapped;
> > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > index 5c93a65ee1e5..09c91641c36c 100644
> > > > --- a/arch/x86/kernel/kvm.c
> > > > +++ b/arch/x86/kernel/kvm.c
> > > > @@ -48,6 +48,7 @@
> > > > #include <asm/tlb.h>
> > > >
> > > > static int kvmapf = 1;
> > > > +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
> > > >
> > > > static int __init parse_no_kvmapf(char *arg)
> > > > {
> > > > @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
> > > > if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > > > apic_set_eoi_write(kvm_guest_apic_eoi_write);
> > > >
> > > > + /*
> > > > + * The free page hinting doesn't add much value if page poisoning
> > > > + * is enabled. So we only enable the feature if page poisoning is
> > > > + * no present.
> > > > + */
> > > > + if (!page_poisoning_enabled() &&
> > > > + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> > > > + static_branch_enable(&pv_free_page_hint_enabled);
> > > > +
> > > > #ifdef CONFIG_SMP
> > > > smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
> > > > smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> > > > @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
> > > > }
> > > > arch_initcall(kvm_setup_pv_tlb_flush);
> > > >
> > > > +void __arch_free_page(struct page *page, unsigned int order)
> > > > +{
> > > > + /*
> > > > + * Limit hints to blocks no smaller than pageblock in
> > > > + * size to limit the cost for the hypercalls.
> > > > + */
> > > > + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > > > + return;
> > > > +
> > > > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > > > + PAGE_SIZE << order);
> > >
> > > Does this mean that the vCPU executing this will get stuck
> > > here for the duration of the hypercall? Isn't that too long,
> > > considering that the zone lock is taken and madvise in the
> > > host block on semaphores?
> >
> > I'm pretty sure the zone lock isn't held when this is called. The lock
> > isn't acquired until later in the path. This gets executed just before
> > the page poisoning call which would take time as well since it would
> > have to memset an entire page. This function is called as a part of
> > free_pages_prepare, the zone locks aren't acquired until we are calling
> > into either free_one_page and a few spots before calling
> > __free_one_page.
> >
> > My other function in patch 4 which does this from inside of
> > __free_one_page does have to release the zone lock since it is taken
> > there.
> >
>
> Considering hypercall's are costly, will it not make sense to coalesce
> the pages you are reporting and make a single hypercall for a bunch of
> pages?

That is what I am doing with this code and patch 4. I am only making
the call when I have been given a page that is 2M or larger. As such I
am only making one hypercall for every 512 4K pages.

So for example on my test VMs with 8G of RAM I see only about 3K calls
when it ends up freeing all of the application memory which is about 6G
after my test has ended.


2019-02-10 00:44:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Mon, Feb 04, 2019 at 10:15:46AM -0800, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Add the host side of the KVM memory hinting support. With this we expose a
> feature bit indicating that the host will pass the messages along to the
> new madvise function.
>
> This functionality is mutually exclusive with device assignment. If a
> device is assigned we will disable the functionality as it could lead to a
> potential memory corruption if a device writes to a page after KVM has
> flagged it as not being used.

I really dislike this kind of tie-in.

Yes right now assignment is not smart enough but generally
you can protect the unused page in the IOMMU and that's it,
it's safe.

So the policy should not leak into host/guest interface.
Instead it is better to just keep the pages pinned and
ignore the hint for now.



> The logic as it is currently defined limits the hint to only supporting a
> hugepage or larger notifications. This is meant to help prevent us from
> potentially breaking up huge pages by hinting that only a portion of the
> page is not needed.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> Documentation/virtual/kvm/cpuid.txt | 4 +++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 +++
> arch/x86/kvm/cpuid.c | 6 ++++-
> arch/x86/kvm/x86.c | 35 ++++++++++++++++++++++++++++++
> include/uapi/linux/kvm_para.h | 1 +
> 6 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
> index 97ca1940a0dc..fe3395a58b7e 100644
> --- a/Documentation/virtual/kvm/cpuid.txt
> +++ b/Documentation/virtual/kvm/cpuid.txt
> @@ -66,6 +66,10 @@ KVM_FEATURE_PV_SEND_IPI || 11 || guest checks this feature bit
> || || before using paravirtualized
> || || send IPIs.
> ------------------------------------------------------------------------------
> +KVM_FEATURE_PV_UNUSED_PAGE_HINT || 12 || guest checks this feature bit
> + || || before using paravirtualized
> + || || unused page hints.
> +------------------------------------------------------------------------------
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT || 24 || host will warn if no guest-side
> || || per-cpu warps are expected in
> || || kvmclock.
> diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
> index da24c138c8d1..b374678ac1f9 100644
> --- a/Documentation/virtual/kvm/hypercalls.txt
> +++ b/Documentation/virtual/kvm/hypercalls.txt
> @@ -141,3 +141,17 @@ a0 corresponds to the APIC ID in the third argument (a2), bit 1
> corresponds to the APIC ID a2+1, and so on.
>
> Returns the number of CPUs to which the IPIs were delivered successfully.
> +
> +7. KVM_HC_UNUSED_PAGE_HINT
> +------------------------
> +Architecture: x86
> +Status: active
> +Purpose: Send unused page hint to host
> +
> +a0: physical address of region unused, page aligned
> +a1: size of unused region, page aligned
> +
> +The hypercall lets a guest send notifications to the host that it will no
> +longer be using a given page in memory. Multiple pages can be hinted at by
> +using the size field to hint that a higher order page is available by
> +specifying the higher order page size.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 19980ec1a316..f066c23060df 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -29,6 +29,7 @@
> #define KVM_FEATURE_PV_TLB_FLUSH 9
> #define KVM_FEATURE_ASYNC_PF_VMEXIT 10
> #define KVM_FEATURE_PV_SEND_IPI 11
> +#define KVM_FEATURE_PV_UNUSED_PAGE_HINT 12
>
> #define KVM_HINTS_REALTIME 0
>
> @@ -119,4 +120,6 @@ struct kvm_vcpu_pv_apf_data {
> #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
> #define KVM_PV_EOI_DISABLED 0x0
>
> +#define KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER HUGETLB_PAGE_ORDER
> +
> #endif /* _UAPI_ASM_X86_KVM_PARA_H */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index bbffa6c54697..b82bcbfbc420 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -136,6 +136,9 @@ int kvm_update_cpuid(struct kvm_vcpu *vcpu)
> if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> (best->eax & (1 << KVM_FEATURE_PV_UNHALT)))
> best->eax &= ~(1 << KVM_FEATURE_PV_UNHALT);
> + if (kvm_arch_has_assigned_device(vcpu->kvm) && best &&
> + (best->eax & KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> + best->eax &= ~(1 << KVM_FEATURE_PV_UNUSED_PAGE_HINT);
>
> /* Update physical-address width */
> vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> @@ -637,7 +640,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> (1 << KVM_FEATURE_PV_UNHALT) |
> (1 << KVM_FEATURE_PV_TLB_FLUSH) |
> (1 << KVM_FEATURE_ASYNC_PF_VMEXIT) |
> - (1 << KVM_FEATURE_PV_SEND_IPI);
> + (1 << KVM_FEATURE_PV_SEND_IPI) |
> + (1 << KVM_FEATURE_PV_UNUSED_PAGE_HINT);
>
> if (sched_info_on())
> entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3d27206f6c01..3ec75ab849e2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -55,6 +55,7 @@
> #include <linux/irqbypass.h>
> #include <linux/sched/stat.h>
> #include <linux/mem_encrypt.h>
> +#include <linux/mm.h>
>
> #include <trace/events/kvm.h>
>
> @@ -7052,6 +7053,37 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
> kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
> }
>
> +static int kvm_pv_unused_page_hint_op(struct kvm *kvm, gpa_t gpa, size_t len)
> +{
> + unsigned long start;
> +
> + /*
> + * Guarantee the following:
> + * len meets minimum size
> + * len is a power of 2
> + * gpa is aligned to len
> + */
> + if (len < (PAGE_SIZE << KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER))
> + return -KVM_EINVAL;
> + if (!is_power_of_2(len) || !IS_ALIGNED(gpa, len))
> + return -KVM_EINVAL;
> +
> + /*
> + * If a device is assigned we cannot use use madvise as memory
> + * is shared with the device and could lead to memory corruption
> + * if the device writes to it after free.
> + */
> + if (kvm_arch_has_assigned_device(kvm))
> + return -KVM_EOPNOTSUPP;
> +
> + start = gfn_to_hva(kvm, gpa_to_gfn(gpa));
> +
> + if (kvm_is_error_hva(start + len))
> + return -KVM_EFAULT;
> +
> + return do_madvise_dontneed(start, len);
> +}
> +
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> unsigned long nr, a0, a1, a2, a3, ret;
> @@ -7098,6 +7130,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> case KVM_HC_SEND_IPI:
> ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
> break;
> + case KVM_HC_UNUSED_PAGE_HINT:
> + ret = kvm_pv_unused_page_hint_op(vcpu->kvm, a0, a1);
> + break;
> default:
> ret = -KVM_ENOSYS;
> break;
> diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
> index 6c0ce49931e5..75643b862a4e 100644
> --- a/include/uapi/linux/kvm_para.h
> +++ b/include/uapi/linux/kvm_para.h
> @@ -28,6 +28,7 @@
> #define KVM_HC_MIPS_CONSOLE_OUTPUT 8
> #define KVM_HC_CLOCK_PAIRING 9
> #define KVM_HC_SEND_IPI 10
> +#define KVM_HC_UNUSED_PAGE_HINT 11
>
> /*
> * hypercalls use architecture specific

2019-02-10 00:49:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Add guest support for providing free memory hints to the KVM hypervisor for
> freed pages huge TLB size or larger. I am restricting the size to
> huge TLB order and larger because the hypercalls are too expensive to be
> performing one per 4K page.

Even 2M pages start to get expensive with a TB guest.

Really it seems we want a virtio ring so we can pass a batch of these.
E.g. 256 entries, 2M each - that's more like it.

> Using the huge TLB order became the obvious
> choice for the order to use as it allows us to avoid fragmentation of higher
> order memory on the host.
>
> I have limited the functionality so that it doesn't work when page
> poisoning is enabled. I did this because a write to the page after doing an
> MADV_DONTNEED would effectively negate the hint, so it would be wasting
> cycles to do so.

Again that's leaking host implementation detail into guest interface.

We are giving guest page hints to host that makes sense,
weird interactions with other features due to host
implementation details should be handled by host.




> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> arch/x86/include/asm/page.h | 13 +++++++++++++
> arch/x86/kernel/kvm.c | 23 +++++++++++++++++++++++
> 2 files changed, 36 insertions(+)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 7555b48803a8..4487ad7a3385 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -18,6 +18,19 @@
>
> struct page;
>
> +#ifdef CONFIG_KVM_GUEST
> +#include <linux/jump_label.h>
> +extern struct static_key_false pv_free_page_hint_enabled;
> +
> +#define HAVE_ARCH_FREE_PAGE
> +void __arch_free_page(struct page *page, unsigned int order);
> +static inline void arch_free_page(struct page *page, unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_free_page(page, order);
> +}
> +#endif
> +
> #include <linux/range.h>
> extern struct range pfn_mapped[];
> extern int nr_pfn_mapped;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 5c93a65ee1e5..09c91641c36c 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -48,6 +48,7 @@
> #include <asm/tlb.h>
>
> static int kvmapf = 1;
> +DEFINE_STATIC_KEY_FALSE(pv_free_page_hint_enabled);
>
> static int __init parse_no_kvmapf(char *arg)
> {
> @@ -648,6 +649,15 @@ static void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> apic_set_eoi_write(kvm_guest_apic_eoi_write);
>
> + /*
> + * The free page hinting doesn't add much value if page poisoning
> + * is enabled. So we only enable the feature if page poisoning is
> + * no present.
> + */
> + if (!page_poisoning_enabled() &&
> + kvm_para_has_feature(KVM_FEATURE_PV_UNUSED_PAGE_HINT))
> + static_branch_enable(&pv_free_page_hint_enabled);
> +
> #ifdef CONFIG_SMP
> smp_ops.smp_prepare_cpus = kvm_smp_prepare_cpus;
> smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> @@ -762,6 +772,19 @@ static __init int kvm_setup_pv_tlb_flush(void)
> }
> arch_initcall(kvm_setup_pv_tlb_flush);
>
> +void __arch_free_page(struct page *page, unsigned int order)
> +{
> + /*
> + * Limit hints to blocks no smaller than pageblock in
> + * size to limit the cost for the hypercalls.
> + */
> + if (order < KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> + return;
> +
> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> + PAGE_SIZE << order);
> +}
> +
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>
> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */

2019-02-10 00:52:29

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] kvm: Report unused guest pages to host

On Mon, Feb 04, 2019 at 10:15:33AM -0800, Alexander Duyck wrote:
> This patch set provides a mechanism by which guests can notify the host of
> pages that are not currently in use. Using this data a KVM host can more
> easily balance memory workloads between guests and improve overall system
> performance by avoiding unnecessary writing of unused pages to swap.

There's an obvious overlap with Nilal's work and already merged Wei's
work here. So please Cc people reviewing Nilal's and Wei's
patches.


> In order to support this I have added a new hypercall to provided unused
> page hints and made use of mechanisms currently used by PowerPC and s390
> architectures to provide those hints. To reduce the overhead of this call
> I am only using it per huge page instead of of doing a notification per 4K
> page. By doing this we can avoid the expense of fragmenting higher order
> pages, and reduce overall cost for the hypercall as it will only be
> performed once per huge page.
>
> Because we are limiting this to huge pages it was necessary to add a
> secondary location where we make the call as the buddy allocator can merge
> smaller pages into a higher order huge page.
>
> This approach is not usable in all cases. Specifically, when KVM direct
> device assignment is used, the memory for a guest is permanently assigned
> to physical pages in order to support DMA from the assigned device. In
> this case we cannot give the pages back, so the hypercall is disabled by
> the host.
>
> Another situation that can lead to issues is if the page were accessed
> immediately after free. For example, if page poisoning is enabled the
> guest will populate the page *after* freeing it. In this case it does not
> make sense to provide a hint about the page being freed so we do not
> perform the hypercalls from the guest if this functionality is enabled.
>
> My testing up till now has consisted of setting up 4 8GB VMs on a system
> with 32GB of memory and 4GB of swap. To stress the memory on the system I
> would run "memhog 8G" sequentially on each of the guests and observe how
> long it took to complete the run. The observed behavior is that on the
> systems with these patches applied in both the guest and on the host I was
> able to complete the test with a time of 5 to 7 seconds per guest. On a
> system without these patches the time ranged from 7 to 49 seconds per
> guest. I am assuming the variability is due to time being spent writing
> pages out to disk in order to free up space for the guest.
>
> ---
>
> Alexander Duyck (4):
> madvise: Expose ability to set dontneed from kernel
> kvm: Add host side support for free memory hints
> kvm: Add guest side support for free memory hints
> mm: Add merge page notifier
>
>
> Documentation/virtual/kvm/cpuid.txt | 4 ++
> Documentation/virtual/kvm/hypercalls.txt | 14 ++++++++
> arch/x86/include/asm/page.h | 25 +++++++++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 3 ++
> arch/x86/kernel/kvm.c | 51 ++++++++++++++++++++++++++++++
> arch/x86/kvm/cpuid.c | 6 +++-
> arch/x86/kvm/x86.c | 35 +++++++++++++++++++++
> include/linux/gfp.h | 4 ++
> include/linux/mm.h | 2 +
> include/uapi/linux/kvm_para.h | 1 +
> mm/madvise.c | 13 +++++++-
> mm/page_alloc.c | 2 +
> 12 files changed, 158 insertions(+), 2 deletions(-)
>
> --

2019-02-10 00:57:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Because the implementation was limiting itself to only providing hints on
> pages huge TLB order sized or larger we introduced the possibility for free
> pages to slip past us because they are freed as something less then
> huge TLB in size and aggregated with buddies later.
>
> To address that I am adding a new call arch_merge_page which is called
> after __free_one_page has merged a pair of pages to create a higher order
> page. By doing this I am able to fill the gap and provide full coverage for
> all of the pages huge TLB order or larger.
>
> Signed-off-by: Alexander Duyck <[email protected]>

Looks like this will be helpful whenever active free page
hints are added. So I think it's a good idea to
add a hook.

However, could you split adding the hook to a separate
patch from the KVM hypercall based implementation?

Then e.g. Nilal's patches could reuse it too.



> ---
> arch/x86/include/asm/page.h | 12 ++++++++++++
> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> include/linux/gfp.h | 4 ++++
> mm/page_alloc.c | 2 ++
> 4 files changed, 46 insertions(+)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 4487ad7a3385..9540a97c9997 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> if (static_branch_unlikely(&pv_free_page_hint_enabled))
> __arch_free_page(page, order);
> }
> +
> +struct zone;
> +
> +#define HAVE_ARCH_MERGE_PAGE
> +void __arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order);
> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_merge_page(zone, page, order);
> +}
> #endif
>
> #include <linux/range.h>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 09c91641c36c..957bb4f427bb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> PAGE_SIZE << order);
> }
>
> +void __arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order)
> +{
> + /*
> + * The merging logic has merged a set of buddies up to the
> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> + * advantage of this moment to notify the hypervisor of the free
> + * memory.
> + */
> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> + return;
> +
> + /*
> + * Drop zone lock while processing the hypercall. This
> + * should be safe as the page has not yet been added
> + * to the buddy list as of yet and all the pages that
> + * were merged have had their buddy/guard flags cleared
> + * and their order reset to 0.
> + */
> + spin_unlock(&zone->lock);
> +
> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> + PAGE_SIZE << order);
> +
> + /* reacquire lock and resume freeing memory */
> + spin_lock(&zone->lock);
> +}
> +
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>
> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fdab7de7490d..4746d5560193 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> #ifndef HAVE_ARCH_FREE_PAGE
> static inline void arch_free_page(struct page *page, int order) { }
> #endif
> +#ifndef HAVE_ARCH_MERGE_PAGE
> +static inline void
> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> +#endif
> #ifndef HAVE_ARCH_ALLOC_PAGE
> static inline void arch_alloc_page(struct page *page, int order) { }
> #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c954f8c1fbc4..7a1309b0b7c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> page = page + (combined_pfn - pfn);
> pfn = combined_pfn;
> order++;
> +
> + arch_merge_page(zone, page, order);
> }
> if (max_order < MAX_ORDER) {
> /* If we are here, it means order is >= pageblock_order.

2019-02-11 06:41:20

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On 2019/2/5 2:15, Alexander Duyck wrote:
> From: Alexander Duyck <[email protected]>
>
> Because the implementation was limiting itself to only providing hints on
> pages huge TLB order sized or larger we introduced the possibility for free
> pages to slip past us because they are freed as something less then
> huge TLB in size and aggregated with buddies later.
>
> To address that I am adding a new call arch_merge_page which is called
> after __free_one_page has merged a pair of pages to create a higher order
> page. By doing this I am able to fill the gap and provide full coverage for
> all of the pages huge TLB order or larger.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> arch/x86/include/asm/page.h | 12 ++++++++++++
> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> include/linux/gfp.h | 4 ++++
> mm/page_alloc.c | 2 ++
> 4 files changed, 46 insertions(+)
>
> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> index 4487ad7a3385..9540a97c9997 100644
> --- a/arch/x86/include/asm/page.h
> +++ b/arch/x86/include/asm/page.h
> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> if (static_branch_unlikely(&pv_free_page_hint_enabled))
> __arch_free_page(page, order);
> }
> +
> +struct zone;
> +
> +#define HAVE_ARCH_MERGE_PAGE
> +void __arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order);
> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order)
> +{
> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> + __arch_merge_page(zone, page, order);
> +}
> #endif
>
> #include <linux/range.h>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 09c91641c36c..957bb4f427bb 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> PAGE_SIZE << order);
> }
>
> +void __arch_merge_page(struct zone *zone, struct page *page,
> + unsigned int order)
> +{
> + /*
> + * The merging logic has merged a set of buddies up to the
> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> + * advantage of this moment to notify the hypervisor of the free
> + * memory.
> + */
> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> + return;
> +
> + /*
> + * Drop zone lock while processing the hypercall. This
> + * should be safe as the page has not yet been added
> + * to the buddy list as of yet and all the pages that
> + * were merged have had their buddy/guard flags cleared
> + * and their order reset to 0.
> + */
> + spin_unlock(&zone->lock);
> +
> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> + PAGE_SIZE << order);
> +
> + /* reacquire lock and resume freeing memory */
> + spin_lock(&zone->lock);
> +}
> +
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>
> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index fdab7de7490d..4746d5560193 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> #ifndef HAVE_ARCH_FREE_PAGE
> static inline void arch_free_page(struct page *page, int order) { }
> #endif
> +#ifndef HAVE_ARCH_MERGE_PAGE
> +static inline void
> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> +#endif
> #ifndef HAVE_ARCH_ALLOC_PAGE
> static inline void arch_alloc_page(struct page *page, int order) { }
> #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c954f8c1fbc4..7a1309b0b7c5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> page = page + (combined_pfn - pfn);
> pfn = combined_pfn;
> order++;
> +
> + arch_merge_page(zone, page, order);

Not a proper place AFAICS.

Assume we have an order-8 page being sent here for merge and its order-8
buddy is also free, then order++ became 9 and arch_merge_page() will do
the hint to host on this page as an order-9 page, no problem so far.
Then the next round, assume the now order-9 page's buddy is also free,
order++ will become 10 and arch_merge_page() will again hint to host on
this page as an order-10 page. The first hint to host became redundant.

I think the proper place is after the done_merging tag.

BTW, with arch_merge_page() at the proper place, I don't think patch3/4
is necessary - any freed page will go through merge anyway, we won't
lose any hint opportunity. Or do I miss anything?

> }
> if (max_order < MAX_ORDER) {
> /* If we are here, it means order is >= pageblock_order.
>

2019-02-11 13:31:11

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier


On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>> From: Alexander Duyck <[email protected]>
>>
>> Because the implementation was limiting itself to only providing hints on
>> pages huge TLB order sized or larger we introduced the possibility for free
>> pages to slip past us because they are freed as something less then
>> huge TLB in size and aggregated with buddies later.
>>
>> To address that I am adding a new call arch_merge_page which is called
>> after __free_one_page has merged a pair of pages to create a higher order
>> page. By doing this I am able to fill the gap and provide full coverage for
>> all of the pages huge TLB order or larger.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
> Looks like this will be helpful whenever active free page
> hints are added. So I think it's a good idea to
> add a hook.
>
> However, could you split adding the hook to a separate
> patch from the KVM hypercall based implementation?
>
> Then e.g. Nilal's patches could reuse it too.
With the current design of my patch-set, if I use this hook to report
free pages. I will end up making redundant hints for the same pfns.

This is because the pages once freed by the host, are returned back to
the buddy.

>
>
>> ---
>> arch/x86/include/asm/page.h | 12 ++++++++++++
>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
>> include/linux/gfp.h | 4 ++++
>> mm/page_alloc.c | 2 ++
>> 4 files changed, 46 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>> index 4487ad7a3385..9540a97c9997 100644
>> --- a/arch/x86/include/asm/page.h
>> +++ b/arch/x86/include/asm/page.h
>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
>> __arch_free_page(page, order);
>> }
>> +
>> +struct zone;
>> +
>> +#define HAVE_ARCH_MERGE_PAGE
>> +void __arch_merge_page(struct zone *zone, struct page *page,
>> + unsigned int order);
>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>> + unsigned int order)
>> +{
>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>> + __arch_merge_page(zone, page, order);
>> +}
>> #endif
>>
>> #include <linux/range.h>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 09c91641c36c..957bb4f427bb 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>> PAGE_SIZE << order);
>> }
>>
>> +void __arch_merge_page(struct zone *zone, struct page *page,
>> + unsigned int order)
>> +{
>> + /*
>> + * The merging logic has merged a set of buddies up to the
>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>> + * advantage of this moment to notify the hypervisor of the free
>> + * memory.
>> + */
>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>> + return;
>> +
>> + /*
>> + * Drop zone lock while processing the hypercall. This
>> + * should be safe as the page has not yet been added
>> + * to the buddy list as of yet and all the pages that
>> + * were merged have had their buddy/guard flags cleared
>> + * and their order reset to 0.
>> + */
>> + spin_unlock(&zone->lock);
>> +
>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>> + PAGE_SIZE << order);
>> +
>> + /* reacquire lock and resume freeing memory */
>> + spin_lock(&zone->lock);
>> +}
>> +
>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>
>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index fdab7de7490d..4746d5560193 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>> #ifndef HAVE_ARCH_FREE_PAGE
>> static inline void arch_free_page(struct page *page, int order) { }
>> #endif
>> +#ifndef HAVE_ARCH_MERGE_PAGE
>> +static inline void
>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>> +#endif
>> #ifndef HAVE_ARCH_ALLOC_PAGE
>> static inline void arch_alloc_page(struct page *page, int order) { }
>> #endif
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c954f8c1fbc4..7a1309b0b7c5 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>> page = page + (combined_pfn - pfn);
>> pfn = combined_pfn;
>> order++;
>> +
>> + arch_merge_page(zone, page, order);
>> }
>> if (max_order < MAX_ORDER) {
>> /* If we are here, it means order is >= pageblock_order.
--
Regards
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-11 14:25:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
>
> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> >> From: Alexander Duyck <[email protected]>
> >>
> >> Because the implementation was limiting itself to only providing hints on
> >> pages huge TLB order sized or larger we introduced the possibility for free
> >> pages to slip past us because they are freed as something less then
> >> huge TLB in size and aggregated with buddies later.
> >>
> >> To address that I am adding a new call arch_merge_page which is called
> >> after __free_one_page has merged a pair of pages to create a higher order
> >> page. By doing this I am able to fill the gap and provide full coverage for
> >> all of the pages huge TLB order or larger.
> >>
> >> Signed-off-by: Alexander Duyck <[email protected]>
> > Looks like this will be helpful whenever active free page
> > hints are added. So I think it's a good idea to
> > add a hook.
> >
> > However, could you split adding the hook to a separate
> > patch from the KVM hypercall based implementation?
> >
> > Then e.g. Nilal's patches could reuse it too.
> With the current design of my patch-set, if I use this hook to report
> free pages. I will end up making redundant hints for the same pfns.
>
> This is because the pages once freed by the host, are returned back to
> the buddy.

Suggestions on how you'd like to fix this? You do need this if
you introduce a size cut-off right?

> >
> >
> >> ---
> >> arch/x86/include/asm/page.h | 12 ++++++++++++
> >> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> >> include/linux/gfp.h | 4 ++++
> >> mm/page_alloc.c | 2 ++
> >> 4 files changed, 46 insertions(+)
> >>
> >> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >> index 4487ad7a3385..9540a97c9997 100644
> >> --- a/arch/x86/include/asm/page.h
> >> +++ b/arch/x86/include/asm/page.h
> >> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> >> if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >> __arch_free_page(page, order);
> >> }
> >> +
> >> +struct zone;
> >> +
> >> +#define HAVE_ARCH_MERGE_PAGE
> >> +void __arch_merge_page(struct zone *zone, struct page *page,
> >> + unsigned int order);
> >> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> >> + unsigned int order)
> >> +{
> >> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >> + __arch_merge_page(zone, page, order);
> >> +}
> >> #endif
> >>
> >> #include <linux/range.h>
> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >> index 09c91641c36c..957bb4f427bb 100644
> >> --- a/arch/x86/kernel/kvm.c
> >> +++ b/arch/x86/kernel/kvm.c
> >> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> >> PAGE_SIZE << order);
> >> }
> >>
> >> +void __arch_merge_page(struct zone *zone, struct page *page,
> >> + unsigned int order)
> >> +{
> >> + /*
> >> + * The merging logic has merged a set of buddies up to the
> >> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> >> + * advantage of this moment to notify the hypervisor of the free
> >> + * memory.
> >> + */
> >> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> >> + return;
> >> +
> >> + /*
> >> + * Drop zone lock while processing the hypercall. This
> >> + * should be safe as the page has not yet been added
> >> + * to the buddy list as of yet and all the pages that
> >> + * were merged have had their buddy/guard flags cleared
> >> + * and their order reset to 0.
> >> + */
> >> + spin_unlock(&zone->lock);
> >> +
> >> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> >> + PAGE_SIZE << order);
> >> +
> >> + /* reacquire lock and resume freeing memory */
> >> + spin_lock(&zone->lock);
> >> +}
> >> +
> >> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >>
> >> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >> index fdab7de7490d..4746d5560193 100644
> >> --- a/include/linux/gfp.h
> >> +++ b/include/linux/gfp.h
> >> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >> #ifndef HAVE_ARCH_FREE_PAGE
> >> static inline void arch_free_page(struct page *page, int order) { }
> >> #endif
> >> +#ifndef HAVE_ARCH_MERGE_PAGE
> >> +static inline void
> >> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> >> +#endif
> >> #ifndef HAVE_ARCH_ALLOC_PAGE
> >> static inline void arch_alloc_page(struct page *page, int order) { }
> >> #endif
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index c954f8c1fbc4..7a1309b0b7c5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> >> page = page + (combined_pfn - pfn);
> >> pfn = combined_pfn;
> >> order++;
> >> +
> >> + arch_merge_page(zone, page, order);
> >> }
> >> if (max_order < MAX_ORDER) {
> >> /* If we are here, it means order is >= pageblock_order.
> --
> Regards
> Nitesh
>




2019-02-11 16:01:01

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
> On 2019/2/5 2:15, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Because the implementation was limiting itself to only providing hints on
> > pages huge TLB order sized or larger we introduced the possibility for free
> > pages to slip past us because they are freed as something less then
> > huge TLB in size and aggregated with buddies later.
> >
> > To address that I am adding a new call arch_merge_page which is called
> > after __free_one_page has merged a pair of pages to create a higher order
> > page. By doing this I am able to fill the gap and provide full coverage for
> > all of the pages huge TLB order or larger.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > arch/x86/include/asm/page.h | 12 ++++++++++++
> > arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> > include/linux/gfp.h | 4 ++++
> > mm/page_alloc.c | 2 ++
> > 4 files changed, 46 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > index 4487ad7a3385..9540a97c9997 100644
> > --- a/arch/x86/include/asm/page.h
> > +++ b/arch/x86/include/asm/page.h
> > @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> > if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > __arch_free_page(page, order);
> > }
> > +
> > +struct zone;
> > +
> > +#define HAVE_ARCH_MERGE_PAGE
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > + unsigned int order);
> > +static inline void arch_merge_page(struct zone *zone, struct page *page,
> > + unsigned int order)
> > +{
> > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > + __arch_merge_page(zone, page, order);
> > +}
> > #endif
> >
> > #include <linux/range.h>
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 09c91641c36c..957bb4f427bb 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> > PAGE_SIZE << order);
> > }
> >
> > +void __arch_merge_page(struct zone *zone, struct page *page,
> > + unsigned int order)
> > +{
> > + /*
> > + * The merging logic has merged a set of buddies up to the
> > + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > + * advantage of this moment to notify the hypervisor of the free
> > + * memory.
> > + */
> > + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > + return;
> > +
> > + /*
> > + * Drop zone lock while processing the hypercall. This
> > + * should be safe as the page has not yet been added
> > + * to the buddy list as of yet and all the pages that
> > + * were merged have had their buddy/guard flags cleared
> > + * and their order reset to 0.
> > + */
> > + spin_unlock(&zone->lock);
> > +
> > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > + PAGE_SIZE << order);
> > +
> > + /* reacquire lock and resume freeing memory */
> > + spin_lock(&zone->lock);
> > +}
> > +
> > #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >
> > /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index fdab7de7490d..4746d5560193 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > #ifndef HAVE_ARCH_FREE_PAGE
> > static inline void arch_free_page(struct page *page, int order) { }
> > #endif
> > +#ifndef HAVE_ARCH_MERGE_PAGE
> > +static inline void
> > +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> > +#endif
> > #ifndef HAVE_ARCH_ALLOC_PAGE
> > static inline void arch_alloc_page(struct page *page, int order) { }
> > #endif
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c954f8c1fbc4..7a1309b0b7c5 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> > page = page + (combined_pfn - pfn);
> > pfn = combined_pfn;
> > order++;
> > +
> > + arch_merge_page(zone, page, order);
>
> Not a proper place AFAICS.
>
> Assume we have an order-8 page being sent here for merge and its order-8
> buddy is also free, then order++ became 9 and arch_merge_page() will do
> the hint to host on this page as an order-9 page, no problem so far.
> Then the next round, assume the now order-9 page's buddy is also free,
> order++ will become 10 and arch_merge_page() will again hint to host on
> this page as an order-10 page. The first hint to host became redundant.

Actually the problem is even worse the other way around. My concern was
pages being incrementally freed.

With this setup I can catch when we have crossed the threshold from
order 8 to 9, and specifically for that case provide the hint. This
allows me to ignore orders above and below 9.

If I move the hint to the spot after the merging I have no way of
telling if I have hinted the page as a lower order or not. As such I
will hint if it is merged up to orders 9 or greater. So for example if
it merges up to order 9 and stops there then done_merging will report
an order 9 page, then if another page is freed and merged with this up
to order 10 you would be hinting on order 10. By placing the function
here I can guarantee that no more than 1 hint is provided per 2MB page.

> I think the proper place is after the done_merging tag.
>
> BTW, with arch_merge_page() at the proper place, I don't think patch3/4
> is necessary - any freed page will go through merge anyway, we won't
> lose any hint opportunity. Or do I miss anything?

You can refer to my comment above. What I want to avoid is us hinting a
page multiple times if we aren't using MAX_ORDER - 1 as the limit. What
I am avoiding by placing this where I did is us doing a hint on orders
greater than our target hint order. So with this way I only perform one
hint per 2MB page, otherwise I would be performing multiple hints per
2MB page as every order above that would also trigger hints.


2019-02-11 16:26:44

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier


On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
>> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>>>> From: Alexander Duyck <[email protected]>
>>>>
>>>> Because the implementation was limiting itself to only providing hints on
>>>> pages huge TLB order sized or larger we introduced the possibility for free
>>>> pages to slip past us because they are freed as something less then
>>>> huge TLB in size and aggregated with buddies later.
>>>>
>>>> To address that I am adding a new call arch_merge_page which is called
>>>> after __free_one_page has merged a pair of pages to create a higher order
>>>> page. By doing this I am able to fill the gap and provide full coverage for
>>>> all of the pages huge TLB order or larger.
>>>>
>>>> Signed-off-by: Alexander Duyck <[email protected]>
>>> Looks like this will be helpful whenever active free page
>>> hints are added. So I think it's a good idea to
>>> add a hook.
>>>
>>> However, could you split adding the hook to a separate
>>> patch from the KVM hypercall based implementation?
>>>
>>> Then e.g. Nilal's patches could reuse it too.
>> With the current design of my patch-set, if I use this hook to report
>> free pages. I will end up making redundant hints for the same pfns.
>>
>> This is because the pages once freed by the host, are returned back to
>> the buddy.
> Suggestions on how you'd like to fix this? You do need this if
> you introduce a size cut-off right?

I do, there are two ways to go about it.

One is to  use this and have a flag in the page structure indicating
whether that page has been freed/used or not. Though I am not sure if
this will be acceptable upstream.
Second is to find another place to invoke guest_free_page() post buddy
merging.

>
>>>
>>>> ---
>>>> arch/x86/include/asm/page.h | 12 ++++++++++++
>>>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
>>>> include/linux/gfp.h | 4 ++++
>>>> mm/page_alloc.c | 2 ++
>>>> 4 files changed, 46 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>> index 4487ad7a3385..9540a97c9997 100644
>>>> --- a/arch/x86/include/asm/page.h
>>>> +++ b/arch/x86/include/asm/page.h
>>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>> __arch_free_page(page, order);
>>>> }
>>>> +
>>>> +struct zone;
>>>> +
>>>> +#define HAVE_ARCH_MERGE_PAGE
>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>> + unsigned int order);
>>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>>> + unsigned int order)
>>>> +{
>>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>> + __arch_merge_page(zone, page, order);
>>>> +}
>>>> #endif
>>>>
>>>> #include <linux/range.h>
>>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>>> index 09c91641c36c..957bb4f427bb 100644
>>>> --- a/arch/x86/kernel/kvm.c
>>>> +++ b/arch/x86/kernel/kvm.c
>>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>>> PAGE_SIZE << order);
>>>> }
>>>>
>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>> + unsigned int order)
>>>> +{
>>>> + /*
>>>> + * The merging logic has merged a set of buddies up to the
>>>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>>> + * advantage of this moment to notify the hypervisor of the free
>>>> + * memory.
>>>> + */
>>>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>>> + return;
>>>> +
>>>> + /*
>>>> + * Drop zone lock while processing the hypercall. This
>>>> + * should be safe as the page has not yet been added
>>>> + * to the buddy list as of yet and all the pages that
>>>> + * were merged have had their buddy/guard flags cleared
>>>> + * and their order reset to 0.
>>>> + */
>>>> + spin_unlock(&zone->lock);
>>>> +
>>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>>> + PAGE_SIZE << order);
>>>> +
>>>> + /* reacquire lock and resume freeing memory */
>>>> + spin_lock(&zone->lock);
>>>> +}
>>>> +
>>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>
>>>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index fdab7de7490d..4746d5560193 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>> #ifndef HAVE_ARCH_FREE_PAGE
>>>> static inline void arch_free_page(struct page *page, int order) { }
>>>> #endif
>>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>>> +static inline void
>>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>>> +#endif
>>>> #ifndef HAVE_ARCH_ALLOC_PAGE
>>>> static inline void arch_alloc_page(struct page *page, int order) { }
>>>> #endif
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>>> page = page + (combined_pfn - pfn);
>>>> pfn = combined_pfn;
>>>> order++;
>>>> +
>>>> + arch_merge_page(zone, page, order);
>>>> }
>>>> if (max_order < MAX_ORDER) {
>>>> /* If we are here, it means order is >= pageblock_order.
>> --
>> Regards
>> Nitesh
>>
>
>
--
Regards
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-11 16:33:45

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Add guest support for providing free memory hints to the KVM hypervisor for
> > freed pages huge TLB size or larger. I am restricting the size to
> > huge TLB order and larger because the hypercalls are too expensive to be
> > performing one per 4K page.
>
> Even 2M pages start to get expensive with a TB guest.

Agreed.

> Really it seems we want a virtio ring so we can pass a batch of these.
> E.g. 256 entries, 2M each - that's more like it.

The only issue I see with doing that is that we then have to defer the
freeing. Doing that is going to introduce issues in the guest as we are
going to have pages going unused for some period of time while we wait
for the hint to complete, and we cannot just pull said pages back. I'm
not really a fan of the asynchronous nature of Nitesh's patches for
this reason.

> > Using the huge TLB order became the obvious
> > choice for the order to use as it allows us to avoid fragmentation of higher
> > order memory on the host.
> >
> > I have limited the functionality so that it doesn't work when page
> > poisoning is enabled. I did this because a write to the page after doing an
> > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > cycles to do so.
>
> Again that's leaking host implementation detail into guest interface.
>
> We are giving guest page hints to host that makes sense,
> weird interactions with other features due to host
> implementation details should be handled by host.

I don't view this as a host implementation detail, this is guest
feature making use of all pages for debugging. If we are placing poison
values in the page then I wouldn't consider them an unused page, it is
being actively used to store the poison value. If we can achieve this
and free the page back to the host then even better, but until the
features can coexist we should not use the page hinting while page
poisoning is enabled.

This is one of the reasons why I was opposed to just disabling page
poisoning when this feature was enabled in Nitesh's patches. If the
guest has page poisoning enabled it is doing something with the page.
It shouldn't be prevented from doing that because the host wants to
have the option to free the pages.


2019-02-11 20:34:04

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > freed pages huge TLB size or larger. I am restricting the size to
> > > huge TLB order and larger because the hypercalls are too expensive to be
> > > performing one per 4K page.
> >
> > Even 2M pages start to get expensive with a TB guest.
>
> Agreed.
>
> > Really it seems we want a virtio ring so we can pass a batch of these.
> > E.g. 256 entries, 2M each - that's more like it.
>
> The only issue I see with doing that is that we then have to defer the
> freeing. Doing that is going to introduce issues in the guest as we are
> going to have pages going unused for some period of time while we wait
> for the hint to complete, and we cannot just pull said pages back. I'm
> not really a fan of the asynchronous nature of Nitesh's patches for
> this reason.

Well nothing prevents us from doing an extra exit to the hypervisor if
we want. The asynchronous nature is there as an optimization
to allow hypervisor to do its thing on a separate CPU.
Why not proceed doing other things meanwhile?
And if the reason is that we are short on memory, then
maybe we should be less aggressive in hinting?

E.g. if we just have 2 pages:

hint page 1
page 1 hint processed?
yes - proceed to page 2
no - wait for interrupt

get interrupt that page 1 hint is processed
hint page 2


If hypervisor happens to be running on same CPU it
can process things synchronously and we never enter
the no branch.





> > > Using the huge TLB order became the obvious
> > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > order memory on the host.
> > >
> > > I have limited the functionality so that it doesn't work when page
> > > poisoning is enabled. I did this because a write to the page after doing an
> > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > cycles to do so.
> >
> > Again that's leaking host implementation detail into guest interface.
> >
> > We are giving guest page hints to host that makes sense,
> > weird interactions with other features due to host
> > implementation details should be handled by host.
>
> I don't view this as a host implementation detail, this is guest
> feature making use of all pages for debugging. If we are placing poison
> values in the page then I wouldn't consider them an unused page, it is
> being actively used to store the poison value.

Well I guess it's a valid point of view for a kernel hacker, but they are
unused from application's point of view.
However poisoning is transparent to users and most distro users
are not aware of it going on. They just know that debug kernels
are slower.
User loading a debug kernel and immediately breaking overcommit
is an unpleasant experience.

> If we can achieve this
> and free the page back to the host then even better, but until the
> features can coexist we should not use the page hinting while page
> poisoning is enabled.

Existing hinting in balloon allows them to coexist so I think we
need to set the bar just as high for any new variant.

> This is one of the reasons why I was opposed to just disabling page
> poisoning when this feature was enabled in Nitesh's patches. If the
> guest has page poisoning enabled it is doing something with the page.
> It shouldn't be prevented from doing that because the host wants to
> have the option to free the pages.

I agree but I think the decision belongs on the host. I.e.
hint the page but tell the host it needs to be careful
about the poison value. It might also mean we
need to make sure poisoning happens after the hinting, not before.

--
MST

2019-02-11 20:34:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Mon, Feb 11, 2019 at 09:34:25AM -0800, Alexander Duyck wrote:
> On Sat, 2019-02-09 at 19:44 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:46AM -0800, Alexander Duyck wrote:
> > > From: Alexander Duyck <[email protected]>
> > >
> > > Add the host side of the KVM memory hinting support. With this we expose a
> > > feature bit indicating that the host will pass the messages along to the
> > > new madvise function.
> > >
> > > This functionality is mutually exclusive with device assignment. If a
> > > device is assigned we will disable the functionality as it could lead to a
> > > potential memory corruption if a device writes to a page after KVM has
> > > flagged it as not being used.
> >
> > I really dislike this kind of tie-in.
> >
> > Yes right now assignment is not smart enough but generally
> > you can protect the unused page in the IOMMU and that's it,
> > it's safe.
> >
> > So the policy should not leak into host/guest interface.
> > Instead it is better to just keep the pages pinned and
> > ignore the hint for now.
>
> Okay, I can do that. It also gives me a means of benchmarking just the
> hypercall cost versus the extra page faults and zeroing.

Good point. Same goes for poisoning :)

2019-02-11 20:34:30

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Sat, 2019-02-09 at 19:44 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:46AM -0800, Alexander Duyck wrote:
> > From: Alexander Duyck <[email protected]>
> >
> > Add the host side of the KVM memory hinting support. With this we expose a
> > feature bit indicating that the host will pass the messages along to the
> > new madvise function.
> >
> > This functionality is mutually exclusive with device assignment. If a
> > device is assigned we will disable the functionality as it could lead to a
> > potential memory corruption if a device writes to a page after KVM has
> > flagged it as not being used.
>
> I really dislike this kind of tie-in.
>
> Yes right now assignment is not smart enough but generally
> you can protect the unused page in the IOMMU and that's it,
> it's safe.
>
> So the policy should not leak into host/guest interface.
> Instead it is better to just keep the pages pinned and
> ignore the hint for now.

Okay, I can do that. It also gives me a means of benchmarking just the
hypercall cost versus the extra page faults and zeroing.


2019-02-11 20:34:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > So the policy should not leak into host/guest interface.
> > Instead it is better to just keep the pages pinned and
> > ignore the hint for now.
>
> It does seems a bit silly to have guests forever hinting about freed
> memory when the host never has a hope of doing anything about it.
>
> Is that part fixable?


Yes just not with existing IOMMU APIs.

It's in the paragraph just above that you cut out:
Yes right now assignment is not smart enough but generally
you can protect the unused page in the IOMMU and that's it,
it's safe.

So e.g.
extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t size, int prot);


I can elaborate if you like but generally we would need an API that
allows you to atomically update a mapping for a specific page without
perturbing the mapping for other pages.

--
MST

2019-02-11 20:35:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> So the policy should not leak into host/guest interface.
> Instead it is better to just keep the pages pinned and
> ignore the hint for now.

It does seems a bit silly to have guests forever hinting about freed
memory when the host never has a hope of doing anything about it.

Is that part fixable?

2019-02-11 20:35:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 09:48:11AM -0800, Dave Hansen wrote:
> On 2/9/19 4:49 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> >> From: Alexander Duyck <[email protected]>
> >>
> >> Add guest support for providing free memory hints to the KVM hypervisor for
> >> freed pages huge TLB size or larger. I am restricting the size to
> >> huge TLB order and larger because the hypercalls are too expensive to be
> >> performing one per 4K page.
> > Even 2M pages start to get expensive with a TB guest.
>
> Yeah, but we don't allocate and free TB's of memory at a high frequency.
>
> > Really it seems we want a virtio ring so we can pass a batch of these.
> > E.g. 256 entries, 2M each - that's more like it.
>
> That only makes sense for a system that's doing high-frequency,
> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
> (THP or hugetlb) is *not* super-high frequency just because of the
> latency for zeroing the page.

Heh but with a ton of free memory, and a thread zeroing some of
it out in the background, will this still be the case?
It could be that we'll be able to find clean pages
at all times.


> A virtio ring seems like an overblown solution to a non-existent problem.

It would be nice to see some traces to help us decide one way or the other.

--
MST

2019-02-11 20:35:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Mon, Feb 11, 2019 at 11:24:02AM -0500, Nitesh Narayan Lal wrote:
>
> On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
> >> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
> >>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
> >>>> From: Alexander Duyck <[email protected]>
> >>>>
> >>>> Because the implementation was limiting itself to only providing hints on
> >>>> pages huge TLB order sized or larger we introduced the possibility for free
> >>>> pages to slip past us because they are freed as something less then
> >>>> huge TLB in size and aggregated with buddies later.
> >>>>
> >>>> To address that I am adding a new call arch_merge_page which is called
> >>>> after __free_one_page has merged a pair of pages to create a higher order
> >>>> page. By doing this I am able to fill the gap and provide full coverage for
> >>>> all of the pages huge TLB order or larger.
> >>>>
> >>>> Signed-off-by: Alexander Duyck <[email protected]>
> >>> Looks like this will be helpful whenever active free page
> >>> hints are added. So I think it's a good idea to
> >>> add a hook.
> >>>
> >>> However, could you split adding the hook to a separate
> >>> patch from the KVM hypercall based implementation?
> >>>
> >>> Then e.g. Nilal's patches could reuse it too.
> >> With the current design of my patch-set, if I use this hook to report
> >> free pages. I will end up making redundant hints for the same pfns.
> >>
> >> This is because the pages once freed by the host, are returned back to
> >> the buddy.
> > Suggestions on how you'd like to fix this? You do need this if
> > you introduce a size cut-off right?
>
> I do, there are two ways to go about it.
>
> One is to? use this and have a flag in the page structure indicating
> whether that page has been freed/used or not.

Not sure what do you mean. The refcount does this right?

> Though I am not sure if
> this will be acceptable upstream.
> Second is to find another place to invoke guest_free_page() post buddy
> merging.

Might be easier.

> >
> >>>
> >>>> ---
> >>>> arch/x86/include/asm/page.h | 12 ++++++++++++
> >>>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> >>>> include/linux/gfp.h | 4 ++++
> >>>> mm/page_alloc.c | 2 ++
> >>>> 4 files changed, 46 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> >>>> index 4487ad7a3385..9540a97c9997 100644
> >>>> --- a/arch/x86/include/asm/page.h
> >>>> +++ b/arch/x86/include/asm/page.h
> >>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> >>>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>>> __arch_free_page(page, order);
> >>>> }
> >>>> +
> >>>> +struct zone;
> >>>> +
> >>>> +#define HAVE_ARCH_MERGE_PAGE
> >>>> +void __arch_merge_page(struct zone *zone, struct page *page,
> >>>> + unsigned int order);
> >>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
> >>>> + unsigned int order)
> >>>> +{
> >>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> >>>> + __arch_merge_page(zone, page, order);
> >>>> +}
> >>>> #endif
> >>>>
> >>>> #include <linux/range.h>
> >>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >>>> index 09c91641c36c..957bb4f427bb 100644
> >>>> --- a/arch/x86/kernel/kvm.c
> >>>> +++ b/arch/x86/kernel/kvm.c
> >>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> >>>> PAGE_SIZE << order);
> >>>> }
> >>>>
> >>>> +void __arch_merge_page(struct zone *zone, struct page *page,
> >>>> + unsigned int order)
> >>>> +{
> >>>> + /*
> >>>> + * The merging logic has merged a set of buddies up to the
> >>>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> >>>> + * advantage of this moment to notify the hypervisor of the free
> >>>> + * memory.
> >>>> + */
> >>>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> >>>> + return;
> >>>> +
> >>>> + /*
> >>>> + * Drop zone lock while processing the hypercall. This
> >>>> + * should be safe as the page has not yet been added
> >>>> + * to the buddy list as of yet and all the pages that
> >>>> + * were merged have had their buddy/guard flags cleared
> >>>> + * and their order reset to 0.
> >>>> + */
> >>>> + spin_unlock(&zone->lock);
> >>>> +
> >>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> >>>> + PAGE_SIZE << order);
> >>>> +
> >>>> + /* reacquire lock and resume freeing memory */
> >>>> + spin_lock(&zone->lock);
> >>>> +}
> >>>> +
> >>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> >>>>
> >>>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> >>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>>> index fdab7de7490d..4746d5560193 100644
> >>>> --- a/include/linux/gfp.h
> >>>> +++ b/include/linux/gfp.h
> >>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> >>>> #ifndef HAVE_ARCH_FREE_PAGE
> >>>> static inline void arch_free_page(struct page *page, int order) { }
> >>>> #endif
> >>>> +#ifndef HAVE_ARCH_MERGE_PAGE
> >>>> +static inline void
> >>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> >>>> +#endif
> >>>> #ifndef HAVE_ARCH_ALLOC_PAGE
> >>>> static inline void arch_alloc_page(struct page *page, int order) { }
> >>>> #endif
> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>> index c954f8c1fbc4..7a1309b0b7c5 100644
> >>>> --- a/mm/page_alloc.c
> >>>> +++ b/mm/page_alloc.c
> >>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> >>>> page = page + (combined_pfn - pfn);
> >>>> pfn = combined_pfn;
> >>>> order++;
> >>>> +
> >>>> + arch_merge_page(zone, page, order);
> >>>> }
> >>>> if (max_order < MAX_ORDER) {
> >>>> /* If we are here, it means order is >= pageblock_order.
> >> --
> >> Regards
> >> Nitesh
> >>
> >
> >
> --
> Regards
> Nitesh
>




2019-02-11 20:35:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On 2/9/19 4:49 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
>> From: Alexander Duyck <[email protected]>
>>
>> Add guest support for providing free memory hints to the KVM hypervisor for
>> freed pages huge TLB size or larger. I am restricting the size to
>> huge TLB order and larger because the hypercalls are too expensive to be
>> performing one per 4K page.
> Even 2M pages start to get expensive with a TB guest.

Yeah, but we don't allocate and free TB's of memory at a high frequency.

> Really it seems we want a virtio ring so we can pass a batch of these.
> E.g. 256 entries, 2M each - that's more like it.

That only makes sense for a system that's doing high-frequency,
discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
(THP or hugetlb) is *not* super-high frequency just because of the
latency for zeroing the page.

A virtio ring seems like an overblown solution to a non-existent problem.

2019-02-11 20:36:23

by Nitesh Narayan Lal

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On 2/11/19 12:41 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 11:24:02AM -0500, Nitesh Narayan Lal wrote:
>> On 2/11/19 9:17 AM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 11, 2019 at 08:30:03AM -0500, Nitesh Narayan Lal wrote:
>>>> On 2/9/19 7:57 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 04, 2019 at 10:15:58AM -0800, Alexander Duyck wrote:
>>>>>> From: Alexander Duyck <[email protected]>
>>>>>>
>>>>>> Because the implementation was limiting itself to only providing hints on
>>>>>> pages huge TLB order sized or larger we introduced the possibility for free
>>>>>> pages to slip past us because they are freed as something less then
>>>>>> huge TLB in size and aggregated with buddies later.
>>>>>>
>>>>>> To address that I am adding a new call arch_merge_page which is called
>>>>>> after __free_one_page has merged a pair of pages to create a higher order
>>>>>> page. By doing this I am able to fill the gap and provide full coverage for
>>>>>> all of the pages huge TLB order or larger.
>>>>>>
>>>>>> Signed-off-by: Alexander Duyck <[email protected]>
>>>>> Looks like this will be helpful whenever active free page
>>>>> hints are added. So I think it's a good idea to
>>>>> add a hook.
>>>>>
>>>>> However, could you split adding the hook to a separate
>>>>> patch from the KVM hypercall based implementation?
>>>>>
>>>>> Then e.g. Nilal's patches could reuse it too.
>>>> With the current design of my patch-set, if I use this hook to report
>>>> free pages. I will end up making redundant hints for the same pfns.
>>>>
>>>> This is because the pages once freed by the host, are returned back to
>>>> the buddy.
>>> Suggestions on how you'd like to fix this? You do need this if
>>> you introduce a size cut-off right?
>> I do, there are two ways to go about it.
>>
>> One is to  use this and have a flag in the page structure indicating
>> whether that page has been freed/used or not.
> Not sure what do you mean. The refcount does this right?
I meant a flag using which I could determine whether a PFN has been
already freed by the host or not. This is to avoid repetitive free.
>
>> Though I am not sure if
>> this will be acceptable upstream.
>> Second is to find another place to invoke guest_free_page() post buddy
>> merging.
> Might be easier.
>
>>>>>> ---
>>>>>> arch/x86/include/asm/page.h | 12 ++++++++++++
>>>>>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
>>>>>> include/linux/gfp.h | 4 ++++
>>>>>> mm/page_alloc.c | 2 ++
>>>>>> 4 files changed, 46 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>>>>> index 4487ad7a3385..9540a97c9997 100644
>>>>>> --- a/arch/x86/include/asm/page.h
>>>>>> +++ b/arch/x86/include/asm/page.h
>>>>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>>>>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>>> __arch_free_page(page, order);
>>>>>> }
>>>>>> +
>>>>>> +struct zone;
>>>>>> +
>>>>>> +#define HAVE_ARCH_MERGE_PAGE
>>>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>>>> + unsigned int order);
>>>>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>>>>> + unsigned int order)
>>>>>> +{
>>>>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>>>>> + __arch_merge_page(zone, page, order);
>>>>>> +}
>>>>>> #endif
>>>>>>
>>>>>> #include <linux/range.h>
>>>>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>>>>> index 09c91641c36c..957bb4f427bb 100644
>>>>>> --- a/arch/x86/kernel/kvm.c
>>>>>> +++ b/arch/x86/kernel/kvm.c
>>>>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>>>>> PAGE_SIZE << order);
>>>>>> }
>>>>>>
>>>>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>>>>> + unsigned int order)
>>>>>> +{
>>>>>> + /*
>>>>>> + * The merging logic has merged a set of buddies up to the
>>>>>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>>>>> + * advantage of this moment to notify the hypervisor of the free
>>>>>> + * memory.
>>>>>> + */
>>>>>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>>>>> + return;
>>>>>> +
>>>>>> + /*
>>>>>> + * Drop zone lock while processing the hypercall. This
>>>>>> + * should be safe as the page has not yet been added
>>>>>> + * to the buddy list as of yet and all the pages that
>>>>>> + * were merged have had their buddy/guard flags cleared
>>>>>> + * and their order reset to 0.
>>>>>> + */
>>>>>> + spin_unlock(&zone->lock);
>>>>>> +
>>>>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>>>>> + PAGE_SIZE << order);
>>>>>> +
>>>>>> + /* reacquire lock and resume freeing memory */
>>>>>> + spin_lock(&zone->lock);
>>>>>> +}
>>>>>> +
>>>>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>>>
>>>>>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>>>> index fdab7de7490d..4746d5560193 100644
>>>>>> --- a/include/linux/gfp.h
>>>>>> +++ b/include/linux/gfp.h
>>>>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>>>>> #ifndef HAVE_ARCH_FREE_PAGE
>>>>>> static inline void arch_free_page(struct page *page, int order) { }
>>>>>> #endif
>>>>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>>>>> +static inline void
>>>>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>>>>> +#endif
>>>>>> #ifndef HAVE_ARCH_ALLOC_PAGE
>>>>>> static inline void arch_alloc_page(struct page *page, int order) { }
>>>>>> #endif
>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>>>>> --- a/mm/page_alloc.c
>>>>>> +++ b/mm/page_alloc.c
>>>>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>>>>> page = page + (combined_pfn - pfn);
>>>>>> pfn = combined_pfn;
>>>>>> order++;
>>>>>> +
>>>>>> + arch_merge_page(zone, page, order);
>>>>>> }
>>>>>> if (max_order < MAX_ORDER) {
>>>>>> /* If we are here, it means order is >= pageblock_order.
>>>> --
>>>> Regards
>>>> Nitesh
>>>>
>>>
>> --
>> Regards
>> Nitesh
>>
>
>
--
Regards
Nitesh


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-02-11 20:36:27

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > performing one per 4K page.
> > >
> > > Even 2M pages start to get expensive with a TB guest.
> >
> > Agreed.
> >
> > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > E.g. 256 entries, 2M each - that's more like it.
> >
> > The only issue I see with doing that is that we then have to defer the
> > freeing. Doing that is going to introduce issues in the guest as we are
> > going to have pages going unused for some period of time while we wait
> > for the hint to complete, and we cannot just pull said pages back. I'm
> > not really a fan of the asynchronous nature of Nitesh's patches for
> > this reason.
>
> Well nothing prevents us from doing an extra exit to the hypervisor if
> we want. The asynchronous nature is there as an optimization
> to allow hypervisor to do its thing on a separate CPU.
> Why not proceed doing other things meanwhile?
> And if the reason is that we are short on memory, then
> maybe we should be less aggressive in hinting?
>
> E.g. if we just have 2 pages:
>
> hint page 1
> page 1 hint processed?
> yes - proceed to page 2
> no - wait for interrupt
>
> get interrupt that page 1 hint is processed
> hint page 2
>
>
> If hypervisor happens to be running on same CPU it
> can process things synchronously and we never enter
> the no branch.
>

Another concern I would have about processing this asynchronously is
that we have the potential for multiple guest CPUs to become
bottlenecked by a single host CPU. I am not sure if that is something
that would be desirable.

> > > > Using the huge TLB order became the obvious
> > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > order memory on the host.
> > > >
> > > > I have limited the functionality so that it doesn't work when page
> > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > cycles to do so.
> > >
> > > Again that's leaking host implementation detail into guest interface.
> > >
> > > We are giving guest page hints to host that makes sense,
> > > weird interactions with other features due to host
> > > implementation details should be handled by host.
> >
> > I don't view this as a host implementation detail, this is guest
> > feature making use of all pages for debugging. If we are placing poison
> > values in the page then I wouldn't consider them an unused page, it is
> > being actively used to store the poison value.
>
> Well I guess it's a valid point of view for a kernel hacker, but they are
> unused from application's point of view.
> However poisoning is transparent to users and most distro users
> are not aware of it going on. They just know that debug kernels
> are slower.
> User loading a debug kernel and immediately breaking overcommit
> is an unpleasant experience.

How would that be any different then a user loading an older kernel
that doesn't have this feature and breaking overcommit as a result?

I still think it would be better if we left the poisoning enabled in
such a case and just displayed a warning message if nothing else that
hinting is disabled because of page poisoning.

One other thought I had on this is that one side effect of page
poisoning is probably that KSM would be able to merge all of the poison
pages together into a single page since they are all set to the same
values. So even with the poisoned pages it would be possible to reduce
total memory overhead.

> > If we can achieve this
> > and free the page back to the host then even better, but until the
> > features can coexist we should not use the page hinting while page
> > poisoning is enabled.
>
> Existing hinting in balloon allows them to coexist so I think we
> need to set the bar just as high for any new variant.

That is what I heard. I will have to look into this.

> > This is one of the reasons why I was opposed to just disabling page
> > poisoning when this feature was enabled in Nitesh's patches. If the
> > guest has page poisoning enabled it is doing something with the page.
> > It shouldn't be prevented from doing that because the host wants to
> > have the option to free the pages.
>
> I agree but I think the decision belongs on the host. I.e.
> hint the page but tell the host it needs to be careful
> about the poison value. It might also mean we
> need to make sure poisoning happens after the hinting, not before.

The only issue with poisoning after instead of before is that the hint
is ignored and we end up triggering a page fault and zero as a result.
It might make more sense to have an architecture specific call that can
be paravirtualized to handle the case of poisoning the page for us if
we have the unused page hint enabled. Otherwise the write to the page
is a given to invalidate the hint.


2019-02-11 20:36:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On 2/11/19 9:58 AM, Michael S. Tsirkin wrote:
>>> Really it seems we want a virtio ring so we can pass a batch of these.
>>> E.g. 256 entries, 2M each - that's more like it.
>> That only makes sense for a system that's doing high-frequency,
>> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
>> (THP or hugetlb) is *not* super-high frequency just because of the
>> latency for zeroing the page.
> Heh but with a ton of free memory, and a thread zeroing some of
> it out in the background, will this still be the case?
> It could be that we'll be able to find clean pages
> at all times.

In a systems where we have some asynchrounous zeroing of memory where
freed, non-zeroed memory is sequestered out of the allocator, yeah, that
could make sense.

But, that's not what we have today.

>> A virtio ring seems like an overblown solution to a non-existent problem.
> It would be nice to see some traces to help us decide one way or the other.

Yeah, agreed. Sounds like we need some more testing to see if these
approaches hit bottlenecks anywhere.

2019-02-11 20:37:52

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Mon, 2019-02-11 at 12:48 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> > On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > > So the policy should not leak into host/guest interface.
> > > Instead it is better to just keep the pages pinned and
> > > ignore the hint for now.
> >
> > It does seems a bit silly to have guests forever hinting about freed
> > memory when the host never has a hope of doing anything about it.
> >
> > Is that part fixable?
>
>
> Yes just not with existing IOMMU APIs.
>
> It's in the paragraph just above that you cut out:
> Yes right now assignment is not smart enough but generally
> you can protect the unused page in the IOMMU and that's it,
> it's safe.
>
> So e.g.
> extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
> phys_addr_t paddr, size_t size, int prot);
>
>
> I can elaborate if you like but generally we would need an API that
> allows you to atomically update a mapping for a specific page without
> perturbing the mapping for other pages.
>

I still don't see how this would solve anything unless you have the
guest somehow hinting on what pages it is providing to the devices.
You would have to have the host invalidating the pages when the hint is
provided, and have a new hint tied to arch_alloc_page that would
rebuild the IOMMU mapping when a page is allocated.

I'm pretty certain that the added cost of that would make the hinting
pretty pointless as my experience has been that the IOMMU is too much
of a bottleneck to have multiple CPUs trying to create and invalidate
mappings simultaneously.


2019-02-11 20:41:54

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] kvm: Add host side support for free memory hints

On Mon, Feb 11, 2019 at 10:30:10AM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 12:48 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 09:41:19AM -0800, Dave Hansen wrote:
> > > On 2/9/19 4:44 PM, Michael S. Tsirkin wrote:
> > > > So the policy should not leak into host/guest interface.
> > > > Instead it is better to just keep the pages pinned and
> > > > ignore the hint for now.
> > >
> > > It does seems a bit silly to have guests forever hinting about freed
> > > memory when the host never has a hope of doing anything about it.
> > >
> > > Is that part fixable?
> >
> >
> > Yes just not with existing IOMMU APIs.
> >
> > It's in the paragraph just above that you cut out:
> > Yes right now assignment is not smart enough but generally
> > you can protect the unused page in the IOMMU and that's it,
> > it's safe.
> >
> > So e.g.
> > extern int iommu_remap(struct iommu_domain *domain, unsigned long iova,
> > phys_addr_t paddr, size_t size, int prot);
> >
> >
> > I can elaborate if you like but generally we would need an API that
> > allows you to atomically update a mapping for a specific page without
> > perturbing the mapping for other pages.
> >
>
> I still don't see how this would solve anything unless you have the
> guest somehow hinting on what pages it is providing to the devices.
>
> You would have to have the host invalidating the pages when the hint is
> provided, and have a new hint tied to arch_alloc_page that would
> rebuild the IOMMU mapping when a page is allocated.
>
> I'm pretty certain that the added cost of that would make the hinting
> pretty pointless as my experience has been that the IOMMU is too much
> of a bottleneck to have multiple CPUs trying to create and invalidate
> mappings simultaneously.

I agree it's a concern.

Another option would involve passing these hints in the DMA API.

How about the option of removing the device by hotplug when
host needs overcommit? That would involve either buffering
on host, or requesting free pages after device is removed
along the lines of existing balloon code. That btw seems to
be an argument for making this hinting part of balloon.


--
MST

2019-02-11 20:44:45

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > From: Alexander Duyck <[email protected]>
> > > > >
> > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > performing one per 4K page.
> > > >
> > > > Even 2M pages start to get expensive with a TB guest.
> > >
> > > Agreed.
> > >
> > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > E.g. 256 entries, 2M each - that's more like it.
> > >
> > > The only issue I see with doing that is that we then have to defer the
> > > freeing. Doing that is going to introduce issues in the guest as we are
> > > going to have pages going unused for some period of time while we wait
> > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > this reason.
> >
> > Well nothing prevents us from doing an extra exit to the hypervisor if
> > we want. The asynchronous nature is there as an optimization
> > to allow hypervisor to do its thing on a separate CPU.
> > Why not proceed doing other things meanwhile?
> > And if the reason is that we are short on memory, then
> > maybe we should be less aggressive in hinting?
> >
> > E.g. if we just have 2 pages:
> >
> > hint page 1
> > page 1 hint processed?
> > yes - proceed to page 2
> > no - wait for interrupt
> >
> > get interrupt that page 1 hint is processed
> > hint page 2
> >
> >
> > If hypervisor happens to be running on same CPU it
> > can process things synchronously and we never enter
> > the no branch.
> >
>
> Another concern I would have about processing this asynchronously is
> that we have the potential for multiple guest CPUs to become
> bottlenecked by a single host CPU. I am not sure if that is something
> that would be desirable.

Well with a hypercall per page the fix is to block VCPU
completely which is also not for everyone.

If you can't push a free page hint to host, then
ideally you just won't. That's a nice property of
hinting we have upstream right now.
Host too busy - hinting is just skipped.


> > > > > Using the huge TLB order became the obvious
> > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > order memory on the host.
> > > > >
> > > > > I have limited the functionality so that it doesn't work when page
> > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > cycles to do so.
> > > >
> > > > Again that's leaking host implementation detail into guest interface.
> > > >
> > > > We are giving guest page hints to host that makes sense,
> > > > weird interactions with other features due to host
> > > > implementation details should be handled by host.
> > >
> > > I don't view this as a host implementation detail, this is guest
> > > feature making use of all pages for debugging. If we are placing poison
> > > values in the page then I wouldn't consider them an unused page, it is
> > > being actively used to store the poison value.
> >
> > Well I guess it's a valid point of view for a kernel hacker, but they are
> > unused from application's point of view.
> > However poisoning is transparent to users and most distro users
> > are not aware of it going on. They just know that debug kernels
> > are slower.
> > User loading a debug kernel and immediately breaking overcommit
> > is an unpleasant experience.
>
> How would that be any different then a user loading an older kernel
> that doesn't have this feature and breaking overcommit as a result?

Well old kernel does not have the feature so nothing to debug.
When we have a new feature that goes away in the debug kernel,
that's a big support problem since this leads to heisenbugs.

> I still think it would be better if we left the poisoning enabled in
> such a case and just displayed a warning message if nothing else that
> hinting is disabled because of page poisoning.
>
> One other thought I had on this is that one side effect of page
> poisoning is probably that KSM would be able to merge all of the poison
> pages together into a single page since they are all set to the same
> values. So even with the poisoned pages it would be possible to reduce
> total memory overhead.

Right. And BTW one thing that host can do is pass
the hinted area to KSM for merging.
That requires an alloc hook to free it though.

Or we could add a per-VMA byte with the poison
value and use that on host to populate pages on fault.


> > > If we can achieve this
> > > and free the page back to the host then even better, but until the
> > > features can coexist we should not use the page hinting while page
> > > poisoning is enabled.
> >
> > Existing hinting in balloon allows them to coexist so I think we
> > need to set the bar just as high for any new variant.
>
> That is what I heard. I will have to look into this.

It's not doing anything smart right now, just checks
that poison == 0 and skips freeing if not.
But it can be enhanced transparently to guests.

> > > This is one of the reasons why I was opposed to just disabling page
> > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > guest has page poisoning enabled it is doing something with the page.
> > > It shouldn't be prevented from doing that because the host wants to
> > > have the option to free the pages.
> >
> > I agree but I think the decision belongs on the host. I.e.
> > hint the page but tell the host it needs to be careful
> > about the poison value. It might also mean we
> > need to make sure poisoning happens after the hinting, not before.
>
> The only issue with poisoning after instead of before is that the hint
> is ignored and we end up triggering a page fault and zero as a result.
> It might make more sense to have an architecture specific call that can
> be paravirtualized to handle the case of poisoning the page for us if
> we have the unused page hint enabled. Otherwise the write to the page
> is a given to invalidate the hint.

Sounds interesting. So the arch hook will first poison and
then pass the page to the host?

Or we can also ask the host to poison for us, problem is this forces
host to either always write into page, or call MADV_DONTNEED,
without it could do MADV_FREE. Maybe that is not a big issue.

--
MST

2019-02-11 20:44:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 10:19:17AM -0800, Dave Hansen wrote:
> On 2/11/19 9:58 AM, Michael S. Tsirkin wrote:
> >>> Really it seems we want a virtio ring so we can pass a batch of these.
> >>> E.g. 256 entries, 2M each - that's more like it.
> >> That only makes sense for a system that's doing high-frequency,
> >> discontiguous frees of 2M pages. Right now, a 2M free/realloc cycle
> >> (THP or hugetlb) is *not* super-high frequency just because of the
> >> latency for zeroing the page.
> > Heh but with a ton of free memory, and a thread zeroing some of
> > it out in the background, will this still be the case?
> > It could be that we'll be able to find clean pages
> > at all times.
>
> In a systems where we have some asynchrounous zeroing of memory where
> freed, non-zeroed memory is sequestered out of the allocator, yeah, that
> could make sense.
>
> But, that's not what we have today.

Right. I wonder whether it's smart to build this assumption
into a host/guest interface though.

> >> A virtio ring seems like an overblown solution to a non-existent problem.
> > It would be nice to see some traces to help us decide one way or the other.
>
> Yeah, agreed. Sounds like we need some more testing to see if these
> approaches hit bottlenecks anywhere.

2019-02-11 21:02:40

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > From: Alexander Duyck <[email protected]>
> > > > > >
> > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > performing one per 4K page.
> > > > >
> > > > > Even 2M pages start to get expensive with a TB guest.
> > > >
> > > > Agreed.
> > > >
> > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > E.g. 256 entries, 2M each - that's more like it.
> > > >
> > > > The only issue I see with doing that is that we then have to defer the
> > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > going to have pages going unused for some period of time while we wait
> > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > this reason.
> > >
> > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > we want. The asynchronous nature is there as an optimization
> > > to allow hypervisor to do its thing on a separate CPU.
> > > Why not proceed doing other things meanwhile?
> > > And if the reason is that we are short on memory, then
> > > maybe we should be less aggressive in hinting?
> > >
> > > E.g. if we just have 2 pages:
> > >
> > > hint page 1
> > > page 1 hint processed?
> > > yes - proceed to page 2
> > > no - wait for interrupt
> > >
> > > get interrupt that page 1 hint is processed
> > > hint page 2
> > >
> > >
> > > If hypervisor happens to be running on same CPU it
> > > can process things synchronously and we never enter
> > > the no branch.
> > >
> >
> > Another concern I would have about processing this asynchronously is
> > that we have the potential for multiple guest CPUs to become
> > bottlenecked by a single host CPU. I am not sure if that is something
> > that would be desirable.
>
> Well with a hypercall per page the fix is to block VCPU
> completely which is also not for everyone.
>
> If you can't push a free page hint to host, then
> ideally you just won't. That's a nice property of
> hinting we have upstream right now.
> Host too busy - hinting is just skipped.

Right, but if you do that then there is a potential to end up missing
hints for a large portion of memory. It seems like you would end up
with even bigger issues since then at that point you have essentially
leaked memory.

I would think you would need a way to resync the host and the guest
after something like that. Otherwise you can have memory that will just
go unused for an extended period if a guest just goes idle.

> > > > > > Using the huge TLB order became the obvious
> > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > order memory on the host.
> > > > > >
> > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > cycles to do so.
> > > > >
> > > > > Again that's leaking host implementation detail into guest interface.
> > > > >
> > > > > We are giving guest page hints to host that makes sense,
> > > > > weird interactions with other features due to host
> > > > > implementation details should be handled by host.
> > > >
> > > > I don't view this as a host implementation detail, this is guest
> > > > feature making use of all pages for debugging. If we are placing poison
> > > > values in the page then I wouldn't consider them an unused page, it is
> > > > being actively used to store the poison value.
> > >
> > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > unused from application's point of view.
> > > However poisoning is transparent to users and most distro users
> > > are not aware of it going on. They just know that debug kernels
> > > are slower.
> > > User loading a debug kernel and immediately breaking overcommit
> > > is an unpleasant experience.
> >
> > How would that be any different then a user loading an older kernel
> > that doesn't have this feature and breaking overcommit as a result?
>
> Well old kernel does not have the feature so nothing to debug.
> When we have a new feature that goes away in the debug kernel,
> that's a big support problem since this leads to heisenbugs.

Trying to debug host features from the guest would be a pain anyway as
a guest shouldn't even really know what the underlying setup of the
guest is supposed to be.

> > I still think it would be better if we left the poisoning enabled in
> > such a case and just displayed a warning message if nothing else that
> > hinting is disabled because of page poisoning.
> >
> > One other thought I had on this is that one side effect of page
> > poisoning is probably that KSM would be able to merge all of the poison
> > pages together into a single page since they are all set to the same
> > values. So even with the poisoned pages it would be possible to reduce
> > total memory overhead.
>
> Right. And BTW one thing that host can do is pass
> the hinted area to KSM for merging.
> That requires an alloc hook to free it though.
>
> Or we could add a per-VMA byte with the poison
> value and use that on host to populate pages on fault.
>
>
> > > > If we can achieve this
> > > > and free the page back to the host then even better, but until the
> > > > features can coexist we should not use the page hinting while page
> > > > poisoning is enabled.
> > >
> > > Existing hinting in balloon allows them to coexist so I think we
> > > need to set the bar just as high for any new variant.
> >
> > That is what I heard. I will have to look into this.
>
> It's not doing anything smart right now, just checks
> that poison == 0 and skips freeing if not.
> But it can be enhanced transparently to guests.

Okay, so it probably should be extended to add something like poison
page that could replace the zero page for reads to a page that has been
unmapped.

> > > > This is one of the reasons why I was opposed to just disabling page
> > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > guest has page poisoning enabled it is doing something with the page.
> > > > It shouldn't be prevented from doing that because the host wants to
> > > > have the option to free the pages.
> > >
> > > I agree but I think the decision belongs on the host. I.e.
> > > hint the page but tell the host it needs to be careful
> > > about the poison value. It might also mean we
> > > need to make sure poisoning happens after the hinting, not before.
> >
> > The only issue with poisoning after instead of before is that the hint
> > is ignored and we end up triggering a page fault and zero as a result.
> > It might make more sense to have an architecture specific call that can
> > be paravirtualized to handle the case of poisoning the page for us if
> > we have the unused page hint enabled. Otherwise the write to the page
> > is a given to invalidate the hint.
>
> Sounds interesting. So the arch hook will first poison and
> then pass the page to the host?
>
> Or we can also ask the host to poison for us, problem is this forces
> host to either always write into page, or call MADV_DONTNEED,
> without it could do MADV_FREE. Maybe that is not a big issue.

I would think we would ask the host to poison for us. If I am not
mistaken both solutions right now are using MADV_DONTNEED. I would tend
to lean that way if we are doing page poisoning since the cost for
zeroing/poisoning the page on the host could be canceled out by
dropping the page poisoning on the guest.

Then again since we are doing higher order pages only, and the
poisoning is supposed to happen before we get into __free_one_page we
would probably have to do both the poisoning, and the poison on fault.


2019-02-11 22:53:46

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 01:00:53PM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > > From: Alexander Duyck <[email protected]>
> > > > > > >
> > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > > performing one per 4K page.
> > > > > >
> > > > > > Even 2M pages start to get expensive with a TB guest.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > > E.g. 256 entries, 2M each - that's more like it.
> > > > >
> > > > > The only issue I see with doing that is that we then have to defer the
> > > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > > going to have pages going unused for some period of time while we wait
> > > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > > this reason.
> > > >
> > > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > > we want. The asynchronous nature is there as an optimization
> > > > to allow hypervisor to do its thing on a separate CPU.
> > > > Why not proceed doing other things meanwhile?
> > > > And if the reason is that we are short on memory, then
> > > > maybe we should be less aggressive in hinting?
> > > >
> > > > E.g. if we just have 2 pages:
> > > >
> > > > hint page 1
> > > > page 1 hint processed?
> > > > yes - proceed to page 2
> > > > no - wait for interrupt
> > > >
> > > > get interrupt that page 1 hint is processed
> > > > hint page 2
> > > >
> > > >
> > > > If hypervisor happens to be running on same CPU it
> > > > can process things synchronously and we never enter
> > > > the no branch.
> > > >
> > >
> > > Another concern I would have about processing this asynchronously is
> > > that we have the potential for multiple guest CPUs to become
> > > bottlenecked by a single host CPU. I am not sure if that is something
> > > that would be desirable.
> >
> > Well with a hypercall per page the fix is to block VCPU
> > completely which is also not for everyone.
> >
> > If you can't push a free page hint to host, then
> > ideally you just won't. That's a nice property of
> > hinting we have upstream right now.
> > Host too busy - hinting is just skipped.
>
> Right, but if you do that then there is a potential to end up missing
> hints for a large portion of memory. It seems like you would end up
> with even bigger issues since then at that point you have essentially
> leaked memory.
> I would think you would need a way to resync the host and the guest
> after something like that. Otherwise you can have memory that will just
> go unused for an extended period if a guest just goes idle.

Yes and that is my point. Existing hints code will just take a page off
the free list in that case so it resyncs using the free list.

Something like this could work then: mark up
hinted pages with a flag (its easy to find unused
flags for free pages) then when you get an interrupt
because outstanding hints have been consumed,
get unflagged/unhinted pages from buddy and pass
them to host.


>
> > > > > > > Using the huge TLB order became the obvious
> > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > > order memory on the host.
> > > > > > >
> > > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > > cycles to do so.
> > > > > >
> > > > > > Again that's leaking host implementation detail into guest interface.
> > > > > >
> > > > > > We are giving guest page hints to host that makes sense,
> > > > > > weird interactions with other features due to host
> > > > > > implementation details should be handled by host.
> > > > >
> > > > > I don't view this as a host implementation detail, this is guest
> > > > > feature making use of all pages for debugging. If we are placing poison
> > > > > values in the page then I wouldn't consider them an unused page, it is
> > > > > being actively used to store the poison value.
> > > >
> > > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > > unused from application's point of view.
> > > > However poisoning is transparent to users and most distro users
> > > > are not aware of it going on. They just know that debug kernels
> > > > are slower.
> > > > User loading a debug kernel and immediately breaking overcommit
> > > > is an unpleasant experience.
> > >
> > > How would that be any different then a user loading an older kernel
> > > that doesn't have this feature and breaking overcommit as a result?
> >
> > Well old kernel does not have the feature so nothing to debug.
> > When we have a new feature that goes away in the debug kernel,
> > that's a big support problem since this leads to heisenbugs.
>
> Trying to debug host features from the guest would be a pain anyway as
> a guest shouldn't even really know what the underlying setup of the
> guest is supposed to be.

I'm talking about debugging the guest though.

> > > I still think it would be better if we left the poisoning enabled in
> > > such a case and just displayed a warning message if nothing else that
> > > hinting is disabled because of page poisoning.
> > >
> > > One other thought I had on this is that one side effect of page
> > > poisoning is probably that KSM would be able to merge all of the poison
> > > pages together into a single page since they are all set to the same
> > > values. So even with the poisoned pages it would be possible to reduce
> > > total memory overhead.
> >
> > Right. And BTW one thing that host can do is pass
> > the hinted area to KSM for merging.
> > That requires an alloc hook to free it though.
> >
> > Or we could add a per-VMA byte with the poison
> > value and use that on host to populate pages on fault.
> >
> >
> > > > > If we can achieve this
> > > > > and free the page back to the host then even better, but until the
> > > > > features can coexist we should not use the page hinting while page
> > > > > poisoning is enabled.
> > > >
> > > > Existing hinting in balloon allows them to coexist so I think we
> > > > need to set the bar just as high for any new variant.
> > >
> > > That is what I heard. I will have to look into this.
> >
> > It's not doing anything smart right now, just checks
> > that poison == 0 and skips freeing if not.
> > But it can be enhanced transparently to guests.
>
> Okay, so it probably should be extended to add something like poison
> page that could replace the zero page for reads to a page that has been
> unmapped.
>
> > > > > This is one of the reasons why I was opposed to just disabling page
> > > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > > guest has page poisoning enabled it is doing something with the page.
> > > > > It shouldn't be prevented from doing that because the host wants to
> > > > > have the option to free the pages.
> > > >
> > > > I agree but I think the decision belongs on the host. I.e.
> > > > hint the page but tell the host it needs to be careful
> > > > about the poison value. It might also mean we
> > > > need to make sure poisoning happens after the hinting, not before.
> > >
> > > The only issue with poisoning after instead of before is that the hint
> > > is ignored and we end up triggering a page fault and zero as a result.
> > > It might make more sense to have an architecture specific call that can
> > > be paravirtualized to handle the case of poisoning the page for us if
> > > we have the unused page hint enabled. Otherwise the write to the page
> > > is a given to invalidate the hint.
> >
> > Sounds interesting. So the arch hook will first poison and
> > then pass the page to the host?
> >
> > Or we can also ask the host to poison for us, problem is this forces
> > host to either always write into page, or call MADV_DONTNEED,
> > without it could do MADV_FREE. Maybe that is not a big issue.
>
> I would think we would ask the host to poison for us. If I am not
> mistaken both solutions right now are using MADV_DONTNEED. I would tend
> to lean that way if we are doing page poisoning since the cost for
> zeroing/poisoning the page on the host could be canceled out by
> dropping the page poisoning on the guest.
>
> Then again since we are doing higher order pages only, and the
> poisoning is supposed to happen before we get into __free_one_page we
> would probably have to do both the poisoning, and the poison on fault.


Oh that's a nice trick. So in fact if we just make sure
we never report PAGE_SIZE pages then poisoning will
automatically happen before reporting?
So we just need to teach host to poison on fault.
Sounds cool and we can always optimize further later.

--
MST

2019-02-12 00:34:51

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH 3/4] kvm: Add guest side support for free memory hints

On Mon, Feb 11, 2019 at 04:09:53PM -0800, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 17:52 -0500, Michael S. Tsirkin wrote:
> > On Mon, Feb 11, 2019 at 01:00:53PM -0800, Alexander Duyck wrote:
> > > On Mon, 2019-02-11 at 14:54 -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 11, 2019 at 10:10:06AM -0800, Alexander Duyck wrote:
> > > > > On Mon, 2019-02-11 at 12:36 -0500, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 11, 2019 at 08:31:34AM -0800, Alexander Duyck wrote:
> > > > > > > On Sat, 2019-02-09 at 19:49 -0500, Michael S. Tsirkin wrote:
> > > > > > > > On Mon, Feb 04, 2019 at 10:15:52AM -0800, Alexander Duyck wrote:
> > > > > > > > > From: Alexander Duyck <[email protected]>
> > > > > > > > >
> > > > > > > > > Add guest support for providing free memory hints to the KVM hypervisor for
> > > > > > > > > freed pages huge TLB size or larger. I am restricting the size to
> > > > > > > > > huge TLB order and larger because the hypercalls are too expensive to be
> > > > > > > > > performing one per 4K page.
> > > > > > > >
> > > > > > > > Even 2M pages start to get expensive with a TB guest.
> > > > > > >
> > > > > > > Agreed.
> > > > > > >
> > > > > > > > Really it seems we want a virtio ring so we can pass a batch of these.
> > > > > > > > E.g. 256 entries, 2M each - that's more like it.
> > > > > > >
> > > > > > > The only issue I see with doing that is that we then have to defer the
> > > > > > > freeing. Doing that is going to introduce issues in the guest as we are
> > > > > > > going to have pages going unused for some period of time while we wait
> > > > > > > for the hint to complete, and we cannot just pull said pages back. I'm
> > > > > > > not really a fan of the asynchronous nature of Nitesh's patches for
> > > > > > > this reason.
> > > > > >
> > > > > > Well nothing prevents us from doing an extra exit to the hypervisor if
> > > > > > we want. The asynchronous nature is there as an optimization
> > > > > > to allow hypervisor to do its thing on a separate CPU.
> > > > > > Why not proceed doing other things meanwhile?
> > > > > > And if the reason is that we are short on memory, then
> > > > > > maybe we should be less aggressive in hinting?
> > > > > >
> > > > > > E.g. if we just have 2 pages:
> > > > > >
> > > > > > hint page 1
> > > > > > page 1 hint processed?
> > > > > > yes - proceed to page 2
> > > > > > no - wait for interrupt
> > > > > >
> > > > > > get interrupt that page 1 hint is processed
> > > > > > hint page 2
> > > > > >
> > > > > >
> > > > > > If hypervisor happens to be running on same CPU it
> > > > > > can process things synchronously and we never enter
> > > > > > the no branch.
> > > > > >
> > > > >
> > > > > Another concern I would have about processing this asynchronously is
> > > > > that we have the potential for multiple guest CPUs to become
> > > > > bottlenecked by a single host CPU. I am not sure if that is something
> > > > > that would be desirable.
> > > >
> > > > Well with a hypercall per page the fix is to block VCPU
> > > > completely which is also not for everyone.
> > > >
> > > > If you can't push a free page hint to host, then
> > > > ideally you just won't. That's a nice property of
> > > > hinting we have upstream right now.
> > > > Host too busy - hinting is just skipped.
> > >
> > > Right, but if you do that then there is a potential to end up missing
> > > hints for a large portion of memory. It seems like you would end up
> > > with even bigger issues since then at that point you have essentially
> > > leaked memory.
> > > I would think you would need a way to resync the host and the guest
> > > after something like that. Otherwise you can have memory that will just
> > > go unused for an extended period if a guest just goes idle.
> >
> > Yes and that is my point. Existing hints code will just take a page off
> > the free list in that case so it resyncs using the free list.
> >
> > Something like this could work then: mark up
> > hinted pages with a flag (its easy to find unused
> > flags for free pages) then when you get an interrupt
> > because outstanding hints have been consumed,
> > get unflagged/unhinted pages from buddy and pass
> > them to host.
>
> Ugh. This is beginning to sound like yet another daemon that will have
> to be running to handle missed sync events.

Why a daemon? Not at all. You get an interrupt, you schedule
a wq immediately or just do it from the interrupt handler.

> I really think that taking an async approach for this will be nothing
> but trouble. You are going to have a difficult time maintaining any
> sort of coherency no the freelist without the daemon having to take the
> zone lock and then notify the host of what is free and what isn't.

We seem to be doing fine without zone lock for now.
Just plain alloc_pages.

> > >
> > > > > > > > > Using the huge TLB order became the obvious
> > > > > > > > > choice for the order to use as it allows us to avoid fragmentation of higher
> > > > > > > > > order memory on the host.
> > > > > > > > >
> > > > > > > > > I have limited the functionality so that it doesn't work when page
> > > > > > > > > poisoning is enabled. I did this because a write to the page after doing an
> > > > > > > > > MADV_DONTNEED would effectively negate the hint, so it would be wasting
> > > > > > > > > cycles to do so.
> > > > > > > >
> > > > > > > > Again that's leaking host implementation detail into guest interface.
> > > > > > > >
> > > > > > > > We are giving guest page hints to host that makes sense,
> > > > > > > > weird interactions with other features due to host
> > > > > > > > implementation details should be handled by host.
> > > > > > >
> > > > > > > I don't view this as a host implementation detail, this is guest
> > > > > > > feature making use of all pages for debugging. If we are placing poison
> > > > > > > values in the page then I wouldn't consider them an unused page, it is
> > > > > > > being actively used to store the poison value.
> > > > > >
> > > > > > Well I guess it's a valid point of view for a kernel hacker, but they are
> > > > > > unused from application's point of view.
> > > > > > However poisoning is transparent to users and most distro users
> > > > > > are not aware of it going on. They just know that debug kernels
> > > > > > are slower.
> > > > > > User loading a debug kernel and immediately breaking overcommit
> > > > > > is an unpleasant experience.
> > > > >
> > > > > How would that be any different then a user loading an older kernel
> > > > > that doesn't have this feature and breaking overcommit as a result?
> > > >
> > > > Well old kernel does not have the feature so nothing to debug.
> > > > When we have a new feature that goes away in the debug kernel,
> > > > that's a big support problem since this leads to heisenbugs.
> > >
> > > Trying to debug host features from the guest would be a pain anyway as
> > > a guest shouldn't even really know what the underlying setup of the
> > > guest is supposed to be.
> >
> > I'm talking about debugging the guest though.
>
> Right. But my point is if it is a guest feature related to memory that
> you are debugging, then disabling the page hinting would probably be an
> advisable step anyway since it would have the potential for memory
> corruptions itself due to its nature.

Oh absolutely. So that's why I wanted debug kernel to be
as close as possible to non-debug one in that respect.
If one gets a corruption we want it reproducible on debug too.

> > > > > I still think it would be better if we left the poisoning enabled in
> > > > > such a case and just displayed a warning message if nothing else that
> > > > > hinting is disabled because of page poisoning.
> > > > >
> > > > > One other thought I had on this is that one side effect of page
> > > > > poisoning is probably that KSM would be able to merge all of the poison
> > > > > pages together into a single page since they are all set to the same
> > > > > values. So even with the poisoned pages it would be possible to reduce
> > > > > total memory overhead.
> > > >
> > > > Right. And BTW one thing that host can do is pass
> > > > the hinted area to KSM for merging.
> > > > That requires an alloc hook to free it though.
> > > >
> > > > Or we could add a per-VMA byte with the poison
> > > > value and use that on host to populate pages on fault.
> > > >
> > > >
> > > > > > > If we can achieve this
> > > > > > > and free the page back to the host then even better, but until the
> > > > > > > features can coexist we should not use the page hinting while page
> > > > > > > poisoning is enabled.
> > > > > >
> > > > > > Existing hinting in balloon allows them to coexist so I think we
> > > > > > need to set the bar just as high for any new variant.
> > > > >
> > > > > That is what I heard. I will have to look into this.
> > > >
> > > > It's not doing anything smart right now, just checks
> > > > that poison == 0 and skips freeing if not.
> > > > But it can be enhanced transparently to guests.
> > >
> > > Okay, so it probably should be extended to add something like poison
> > > page that could replace the zero page for reads to a page that has been
> > > unmapped.
> > >
> > > > > > > This is one of the reasons why I was opposed to just disabling page
> > > > > > > poisoning when this feature was enabled in Nitesh's patches. If the
> > > > > > > guest has page poisoning enabled it is doing something with the page.
> > > > > > > It shouldn't be prevented from doing that because the host wants to
> > > > > > > have the option to free the pages.
> > > > > >
> > > > > > I agree but I think the decision belongs on the host. I.e.
> > > > > > hint the page but tell the host it needs to be careful
> > > > > > about the poison value. It might also mean we
> > > > > > need to make sure poisoning happens after the hinting, not before.
> > > > >
> > > > > The only issue with poisoning after instead of before is that the hint
> > > > > is ignored and we end up triggering a page fault and zero as a result.
> > > > > It might make more sense to have an architecture specific call that can
> > > > > be paravirtualized to handle the case of poisoning the page for us if
> > > > > we have the unused page hint enabled. Otherwise the write to the page
> > > > > is a given to invalidate the hint.
> > > >
> > > > Sounds interesting. So the arch hook will first poison and
> > > > then pass the page to the host?
> > > >
> > > > Or we can also ask the host to poison for us, problem is this forces
> > > > host to either always write into page, or call MADV_DONTNEED,
> > > > without it could do MADV_FREE. Maybe that is not a big issue.
> > >
> > > I would think we would ask the host to poison for us. If I am not
> > > mistaken both solutions right now are using MADV_DONTNEED. I would tend
> > > to lean that way if we are doing page poisoning since the cost for
> > > zeroing/poisoning the page on the host could be canceled out by
> > > dropping the page poisoning on the guest.
> > >
> > > Then again since we are doing higher order pages only, and the
> > > poisoning is supposed to happen before we get into __free_one_page we
> > > would probably have to do both the poisoning, and the poison on fault.
> >
> >
> > Oh that's a nice trick. So in fact if we just make sure
> > we never report PAGE_SIZE pages then poisoning will
> > automatically happen before reporting?
> > So we just need to teach host to poison on fault.
> > Sounds cool and we can always optimize further later.
>
> That is kind of what I was thinking. In the grand scheme of things I
> figure most of the expense is in the fault and page zeroing bits of the
> code path. I have done a bit of testing today with the patch that just
> drops the messages if a device is assigned, and just the hypercall bits
> are only causing about a 2.5% regression in performance on a will-it-
> scale/page-fault1 test. However if I commit to the full setup with the
> madvise, page fault, and zeroing then I am seeing an 11.5% drop in
> performance.
>
> I think in order to really make this pay off we may need to look into
> avoiding zeroing or poisoning the page in both the host and the guest.
> I will have to look into some things as it looks like there was
> somebody from Intel may have been working on doing some work to address
> that based on the presentation I found at the link below:
>
> https://www.lfasiallc.com/wp-content/uploads/2017/11/Use-Hyper-V-Enlightenments-to-Increase-KVM-VM-Performance_Density_Chao-Peng.pdf
>

2019-02-12 02:11:03

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On 2019/2/11 23:58, Alexander Duyck wrote:
> On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
>> On 2019/2/5 2:15, Alexander Duyck wrote:
>>> From: Alexander Duyck <[email protected]>
>>>
>>> Because the implementation was limiting itself to only providing hints on
>>> pages huge TLB order sized or larger we introduced the possibility for free
>>> pages to slip past us because they are freed as something less then
>>> huge TLB in size and aggregated with buddies later.
>>>
>>> To address that I am adding a new call arch_merge_page which is called
>>> after __free_one_page has merged a pair of pages to create a higher order
>>> page. By doing this I am able to fill the gap and provide full coverage for
>>> all of the pages huge TLB order or larger.
>>>
>>> Signed-off-by: Alexander Duyck <[email protected]>
>>> ---
>>> arch/x86/include/asm/page.h | 12 ++++++++++++
>>> arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
>>> include/linux/gfp.h | 4 ++++
>>> mm/page_alloc.c | 2 ++
>>> 4 files changed, 46 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
>>> index 4487ad7a3385..9540a97c9997 100644
>>> --- a/arch/x86/include/asm/page.h
>>> +++ b/arch/x86/include/asm/page.h
>>> @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
>>> if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> __arch_free_page(page, order);
>>> }
>>> +
>>> +struct zone;
>>> +
>>> +#define HAVE_ARCH_MERGE_PAGE
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order);
>>> +static inline void arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order)
>>> +{
>>> + if (static_branch_unlikely(&pv_free_page_hint_enabled))
>>> + __arch_merge_page(zone, page, order);
>>> +}
>>> #endif
>>>
>>> #include <linux/range.h>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 09c91641c36c..957bb4f427bb 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
>>> PAGE_SIZE << order);
>>> }
>>>
>>> +void __arch_merge_page(struct zone *zone, struct page *page,
>>> + unsigned int order)
>>> +{
>>> + /*
>>> + * The merging logic has merged a set of buddies up to the
>>> + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
>>> + * advantage of this moment to notify the hypervisor of the free
>>> + * memory.
>>> + */
>>> + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
>>> + return;
>>> +
>>> + /*
>>> + * Drop zone lock while processing the hypercall. This
>>> + * should be safe as the page has not yet been added
>>> + * to the buddy list as of yet and all the pages that
>>> + * were merged have had their buddy/guard flags cleared
>>> + * and their order reset to 0.
>>> + */
>>> + spin_unlock(&zone->lock);
>>> +
>>> + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
>>> + PAGE_SIZE << order);
>>> +
>>> + /* reacquire lock and resume freeing memory */
>>> + spin_lock(&zone->lock);
>>> +}
>>> +
>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>
>>> /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index fdab7de7490d..4746d5560193 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
>>> #ifndef HAVE_ARCH_FREE_PAGE
>>> static inline void arch_free_page(struct page *page, int order) { }
>>> #endif
>>> +#ifndef HAVE_ARCH_MERGE_PAGE
>>> +static inline void
>>> +arch_merge_page(struct zone *zone, struct page *page, int order) { }
>>> +#endif
>>> #ifndef HAVE_ARCH_ALLOC_PAGE
>>> static inline void arch_alloc_page(struct page *page, int order) { }
>>> #endif
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index c954f8c1fbc4..7a1309b0b7c5 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
>>> page = page + (combined_pfn - pfn);
>>> pfn = combined_pfn;
>>> order++;
>>> +
>>> + arch_merge_page(zone, page, order);
>>
>> Not a proper place AFAICS.
>>
>> Assume we have an order-8 page being sent here for merge and its order-8
>> buddy is also free, then order++ became 9 and arch_merge_page() will do
>> the hint to host on this page as an order-9 page, no problem so far.
>> Then the next round, assume the now order-9 page's buddy is also free,
>> order++ will become 10 and arch_merge_page() will again hint to host on
>> this page as an order-10 page. The first hint to host became redundant.
>
> Actually the problem is even worse the other way around. My concern was
> pages being incrementally freed.
>
> With this setup I can catch when we have crossed the threshold from
> order 8 to 9, and specifically for that case provide the hint. This
> allows me to ignore orders above and below 9.

OK, I see, you are now only hinting for pages with order 9, not above.

> If I move the hint to the spot after the merging I have no way of
> telling if I have hinted the page as a lower order or not. As such I
> will hint if it is merged up to orders 9 or greater. So for example if
> it merges up to order 9 and stops there then done_merging will report
> an order 9 page, then if another page is freed and merged with this up
> to order 10 you would be hinting on order 10. By placing the function
> here I can guarantee that no more than 1 hint is provided per 2MB page.

So what's the downside of hinting the page as order-10 after merge
compared to as order-9 before the merge? I can see the same physical
range can be hinted multiple times, but the total hint number is the
same: both are 2 - in your current implementation, we hint twice for
each of the 2 order-9 pages; alternatively, we can provide hint for one
order-9 page and the merged order-10 page. I think the cost of
hypercalls are the same? Is it that we want to ease the host side
madvise(DONTNEED) since we can avoid operating the same range multiple
times?

The reason I asked is, if we can move the arch_merge_page() after
done_merging tag, we can theoretically make fewer function calls on free
path for the guest. Maybe not a big deal, I don't know...

>> I think the proper place is after the done_merging tag.
>>
>> BTW, with arch_merge_page() at the proper place, I don't think patch3/4
>> is necessary - any freed page will go through merge anyway, we won't
>> lose any hint opportunity. Or do I miss anything?
>
> You can refer to my comment above. What I want to avoid is us hinting a
> page multiple times if we aren't using MAX_ORDER - 1 as the limit. What

Yeah that's a good point. But is this going to happen?

> I am avoiding by placing this where I did is us doing a hint on orders
> greater than our target hint order. So with this way I only perform one
> hint per 2MB page, otherwise I would be performing multiple hints per
> 2MB page as every order above that would also trigger hints.
>

2019-02-12 17:46:27

by Alexander Duyck

[permalink] [raw]
Subject: Re: [RFC PATCH 4/4] mm: Add merge page notifier

On Tue, 2019-02-12 at 10:09 +0800, Aaron Lu wrote:
> On 2019/2/11 23:58, Alexander Duyck wrote:
> > On Mon, 2019-02-11 at 14:40 +0800, Aaron Lu wrote:
> > > On 2019/2/5 2:15, Alexander Duyck wrote:
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > Because the implementation was limiting itself to only providing hints on
> > > > pages huge TLB order sized or larger we introduced the possibility for free
> > > > pages to slip past us because they are freed as something less then
> > > > huge TLB in size and aggregated with buddies later.
> > > >
> > > > To address that I am adding a new call arch_merge_page which is called
> > > > after __free_one_page has merged a pair of pages to create a higher order
> > > > page. By doing this I am able to fill the gap and provide full coverage for
> > > > all of the pages huge TLB order or larger.
> > > >
> > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > ---
> > > > arch/x86/include/asm/page.h | 12 ++++++++++++
> > > > arch/x86/kernel/kvm.c | 28 ++++++++++++++++++++++++++++
> > > > include/linux/gfp.h | 4 ++++
> > > > mm/page_alloc.c | 2 ++
> > > > 4 files changed, 46 insertions(+)
> > > >
> > > > diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
> > > > index 4487ad7a3385..9540a97c9997 100644
> > > > --- a/arch/x86/include/asm/page.h
> > > > +++ b/arch/x86/include/asm/page.h
> > > > @@ -29,6 +29,18 @@ static inline void arch_free_page(struct page *page, unsigned int order)
> > > > if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > > __arch_free_page(page, order);
> > > > }
> > > > +
> > > > +struct zone;
> > > > +
> > > > +#define HAVE_ARCH_MERGE_PAGE
> > > > +void __arch_merge_page(struct zone *zone, struct page *page,
> > > > + unsigned int order);
> > > > +static inline void arch_merge_page(struct zone *zone, struct page *page,
> > > > + unsigned int order)
> > > > +{
> > > > + if (static_branch_unlikely(&pv_free_page_hint_enabled))
> > > > + __arch_merge_page(zone, page, order);
> > > > +}
> > > > #endif
> > > >
> > > > #include <linux/range.h>
> > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > > index 09c91641c36c..957bb4f427bb 100644
> > > > --- a/arch/x86/kernel/kvm.c
> > > > +++ b/arch/x86/kernel/kvm.c
> > > > @@ -785,6 +785,34 @@ void __arch_free_page(struct page *page, unsigned int order)
> > > > PAGE_SIZE << order);
> > > > }
> > > >
> > > > +void __arch_merge_page(struct zone *zone, struct page *page,
> > > > + unsigned int order)
> > > > +{
> > > > + /*
> > > > + * The merging logic has merged a set of buddies up to the
> > > > + * KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER. Since that is the case, take
> > > > + * advantage of this moment to notify the hypervisor of the free
> > > > + * memory.
> > > > + */
> > > > + if (order != KVM_PV_UNUSED_PAGE_HINT_MIN_ORDER)
> > > > + return;
> > > > +
> > > > + /*
> > > > + * Drop zone lock while processing the hypercall. This
> > > > + * should be safe as the page has not yet been added
> > > > + * to the buddy list as of yet and all the pages that
> > > > + * were merged have had their buddy/guard flags cleared
> > > > + * and their order reset to 0.
> > > > + */
> > > > + spin_unlock(&zone->lock);
> > > > +
> > > > + kvm_hypercall2(KVM_HC_UNUSED_PAGE_HINT, page_to_phys(page),
> > > > + PAGE_SIZE << order);
> > > > +
> > > > + /* reacquire lock and resume freeing memory */
> > > > + spin_lock(&zone->lock);
> > > > +}
> > > > +
> > > > #ifdef CONFIG_PARAVIRT_SPINLOCKS
> > > >
> > > > /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index fdab7de7490d..4746d5560193 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -459,6 +459,10 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > > #ifndef HAVE_ARCH_FREE_PAGE
> > > > static inline void arch_free_page(struct page *page, int order) { }
> > > > #endif
> > > > +#ifndef HAVE_ARCH_MERGE_PAGE
> > > > +static inline void
> > > > +arch_merge_page(struct zone *zone, struct page *page, int order) { }
> > > > +#endif
> > > > #ifndef HAVE_ARCH_ALLOC_PAGE
> > > > static inline void arch_alloc_page(struct page *page, int order) { }
> > > > #endif
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index c954f8c1fbc4..7a1309b0b7c5 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -913,6 +913,8 @@ static inline void __free_one_page(struct page *page,
> > > > page = page + (combined_pfn - pfn);
> > > > pfn = combined_pfn;
> > > > order++;
> > > > +
> > > > + arch_merge_page(zone, page, order);
> > >
> > > Not a proper place AFAICS.
> > >
> > > Assume we have an order-8 page being sent here for merge and its order-8
> > > buddy is also free, then order++ became 9 and arch_merge_page() will do
> > > the hint to host on this page as an order-9 page, no problem so far.
> > > Then the next round, assume the now order-9 page's buddy is also free,
> > > order++ will become 10 and arch_merge_page() will again hint to host on
> > > this page as an order-10 page. The first hint to host became redundant.
> >
> > Actually the problem is even worse the other way around. My concern was
> > pages being incrementally freed.
> >
> > With this setup I can catch when we have crossed the threshold from
> > order 8 to 9, and specifically for that case provide the hint. This
> > allows me to ignore orders above and below 9.
>
> OK, I see, you are now only hinting for pages with order 9, not above.

Right.

> > If I move the hint to the spot after the merging I have no way of
> > telling if I have hinted the page as a lower order or not. As such I
> > will hint if it is merged up to orders 9 or greater. So for example if
> > it merges up to order 9 and stops there then done_merging will report
> > an order 9 page, then if another page is freed and merged with this up
> > to order 10 you would be hinting on order 10. By placing the function
> > here I can guarantee that no more than 1 hint is provided per 2MB page.
>
> So what's the downside of hinting the page as order-10 after merge
> compared to as order-9 before the merge? I can see the same physical
> range can be hinted multiple times, but the total hint number is the
> same: both are 2 - in your current implementation, we hint twice for
> each of the 2 order-9 pages; alternatively, we can provide hint for one
> order-9 page and the merged order-10 page. I think the cost of
> hypercalls are the same? Is it that we want to ease the host side
> madvise(DONTNEED) since we can avoid operating the same range multiple
> times?

The cost for the hypercall overhead is the same, but I would think you
are in the hypercall a bit longer for the order 10 page because you are
having to process both order 9 pages in order clear them. In my mind
doing it that way you end up having to do 50% more madvise work. For a
THP based setup it probably isn't an issue, but I would think if we are
having to invalidate things at the 4K page level that cost could add up
real quick.

I could probably try launching the guest with THP disabled in QEMU to
verify if the difference is visible or not.

> The reason I asked is, if we can move the arch_merge_page() after
> done_merging tag, we can theoretically make fewer function calls on free
> path for the guest. Maybe not a big deal, I don't know...

I suspect it really isn't that big a deal. The two functions are
essentially inline and only one will ever make use of the hypercall.

> > > I think the proper place is after the done_merging tag.
> > >
> > > BTW, with arch_merge_page() at the proper place, I don't think patch3/4
> > > is necessary - any freed page will go through merge anyway, we won't
> > > lose any hint opportunity. Or do I miss anything?
> >
> > You can refer to my comment above. What I want to avoid is us hinting a
> > page multiple times if we aren't using MAX_ORDER - 1 as the limit. What
>
> Yeah that's a good point. But is this going to happen?

One of the advantages I have from splitting things out the way I did is
that I have been able to add some debug counters to track what is freed
as a higher order page and what isn't. From what I am seeing after boot
essentially all of the calls are coming from the merge logic.

I'm suspecting the typical use case is that pages are likely going to
either be freed as something THP or larger, or they will be freed in 4K
increments. By splitting things up the way I did we end up getting the
most efficient performance out of the 4K case since we avoid performing
madvise 1.5 times per page and keep it to once per page.