2021-05-19 03:52:58

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.

This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.

For the shutdown case, a list of pages to be destroyed is formed when
the mm is torn down. Instead of just unmapping the pages when the
address space is being torn down, they are also set aside. Later when
KVM cleans up the VM, a thread is started to clean up the pages from
the list.

This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.

Claudio Imbrenda (11):
KVM: s390: pv: leak the ASCE page when destroy fails
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: lazy destroy for reboot
KVM: s390: pv: extend lazy destroy to handle shutdown
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: add support for UV feature bits

arch/s390/boot/uv.c | 1 +
arch/s390/include/asm/gmap.h | 5 +-
arch/s390/include/asm/mmu.h | 3 +
arch/s390/include/asm/mmu_context.h | 2 +
arch/s390/include/asm/pgtable.h | 16 +-
arch/s390/include/asm/uv.h | 35 ++++-
arch/s390/kernel/uv.c | 133 +++++++++++++++-
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 230 ++++++++++++++++++++++++++--
arch/s390/mm/fault.c | 22 ++-
arch/s390/mm/gmap.c | 86 +++++++----
12 files changed, 490 insertions(+), 51 deletions(-)

--
2.31.1



2021-05-19 03:53:09

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot

Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.

This patch implements a lazy destroy mechanism, that allows a protected
guest to reboot significantly faster than previously.

This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.

The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 118 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 120 insertions(+), 6 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2f09e9d7dc95..db25aa1fb6a6 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2248,7 +2248,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)

r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
if (r)
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);

/* we need to block service interrupts from now on */
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2267,7 +2267,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
*/
if (r)
break;
- r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
+ r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);

/* no need to block service interrupts any more */
clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2796,7 +2796,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
* complaining we do not use kvm_s390_pv_is_protected.
*/
if (kvm_s390_pv_get_handle(kvm))
- kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
debug_unregister(kvm->arch.dbf);
free_page((unsigned long)kvm->arch.sie_page2);
if (!kvm_is_ucontrol(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 79dcd647b378..b3c0796a3cc1 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 59039b8a7be7..9a3547966e18 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -14,8 +14,17 @@
#include <asm/mman.h>
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
+#include <linux/kthread.h>
#include "kvm-s390.h"

+struct deferred_priv {
+ struct mm_struct *mm;
+ unsigned long old_table;
+ u64 handle;
+ void *virt;
+ unsigned long real;
+};
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc = 0;
@@ -202,7 +211,7 @@ static int kvm_s390_pv_replace_asce(struct kvm *kvm)
}

/* this should not fail, but if it does, we must not free the donated memory */
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
+static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;

@@ -230,6 +239,111 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}

+static int kvm_s390_pv_destroy_vm_thread(void *priv)
+{
+ struct deferred_priv *p = priv;
+ u16 rc, rrc;
+ int r = 1;
+
+ /* Exit early if we end up being the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ mmput(p->mm);
+
+ r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (r)
+ return r;
+ atomic_dec(&p->mm->context.is_protected);
+
+ /*
+ * Intentional leak in case the destroy secure VM call fails. The
+ * call should never fail if the hardware is not broken.
+ */
+ free_pages(p->real, get_order(uv_info.guest_base_stor_len));
+ free_pages(p->old_table, CRST_ALLOC_ORDER);
+ vfree(p->virt);
+ kfree(p);
+ return 0;
+}
+
+static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
+{
+ struct task_struct *t;
+
+ priv->virt = kvm->arch.pv.stor_var;
+ priv->real = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+
+ if (kvm_s390_pv_replace_asce(kvm))
+ goto fail;
+
+ t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
+ "kvm_s390_pv_destroy_vm_thread");
+ if (IS_ERR_OR_NULL(t))
+ goto fail;
+
+ memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
+ wake_up_process(t);
+ /*
+ * no actual UVC is performed at this point, just return a successful
+ * rc value to make userspace happy, and an arbitrary rrc
+ */
+ *rc = 1;
+ *rrc = 42;
+
+ return 0;
+
+fail:
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+}
+
+/* Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ unsigned long lim;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(slot, slots) {
+ if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
+ continue;
+ if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ else
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+ if (mmget_not_zero(kvm->mm)) {
+ kvm_s390_clear_2g(kvm);
+ } else {
+ /* No deferred work to do */
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+ }
+ priv->mm = kvm->mm;
+ return deferred_destroy(kvm, priv, rc, rrc);
+}
+
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct uv_cb_cgc uvcb = {
@@ -263,7 +377,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
} else {
atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
--
2.31.1


2021-05-19 04:17:26

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v1 09/11] KVM: s390: pv: extend lazy destroy to handle shutdown

Extend the lazy destroy infrastructure to handle almost all kinds of
exit of the userspace process. The only case not handled is if the
process is killed by the OOM, in which case the old behaviour will
still be in effect.

Add the uv_destroy_page_lazy function to set aside pages when unmapping
them during mm teardown; the pages will be processed and freed when the
protected VM is torn down.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/mmu.h | 3 ++
arch/s390/include/asm/mmu_context.h | 2 ++
arch/s390/include/asm/pgtable.h | 9 ++++--
arch/s390/include/asm/uv.h | 15 +++++++++
arch/s390/kernel/uv.c | 47 +++++++++++++++++++++++++++++
arch/s390/kvm/pv.c | 34 +++++++++++++++++----
6 files changed, 102 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..e2250dbe3d9d 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -18,6 +18,7 @@ typedef struct {
unsigned long vdso_base;
/* The mmu context belongs to a secure guest. */
atomic_t is_protected;
+ struct list_head deferred_list;
/*
* The following bitfields need a down_write on the mm
* semaphore when they are written to. As they are only
@@ -34,6 +35,8 @@ typedef struct {
unsigned int uses_cmm:1;
/* The gmaps associated with this context are allowed to use huge pages. */
unsigned int allow_gmap_hpage_1m:1;
+ /* The mmu context should be destroyed synchronously */
+ unsigned int pv_sync_destroy:1;
} mm_context_t;

#define INIT_MM_CONTEXT(name) \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index e7cffc7b5c2f..da62140404e8 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -27,8 +27,10 @@ static inline int init_new_context(struct task_struct *tsk,
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.flush_count, 0);
atomic_set(&mm->context.is_protected, 0);
+ mm->context.pv_sync_destroy = 0;
mm->context.gmap_asce = 0;
mm->context.flush_mm = 0;
+ INIT_LIST_HEAD(&mm->context.deferred_list);
#ifdef CONFIG_PGSTE
mm->context.alloc_pgste = page_table_allocate_pgste ||
test_thread_flag(TIF_PGSTE) ||
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 141a8aaacd6d..05c68d3c2256 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1106,9 +1106,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+
/* At this point the reference through the mapping is still present */
- if (mm_is_protected(mm) && pte_present(res))
- uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ if (pte_present(res) && mm_is_protected(mm)) {
+ if (full && !mm->context.pv_sync_destroy)
+ uv_destroy_page_lazy(mm, pte_val(res) & PAGE_MASK);
+ else
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ }
return res;
}

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index c6fe6a42e79b..1de5ff5e192a 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -339,6 +339,15 @@ static inline int uv_remove_shared(unsigned long addr) { return 0; }
#if IS_ENABLED(CONFIG_KVM)
extern int prot_virt_host;

+struct destroy_page_lazy {
+ struct list_head list;
+ unsigned short count;
+ unsigned long pfns[];
+};
+
+/* This guarantees that up to PV_MAX_LAZY_COUNT can fit in a page */
+#define PV_MAX_LAZY_COUNT ((PAGE_SIZE - sizeof(struct destroy_page_lazy)) / sizeof(long))
+
static inline int is_prot_virt_host(void)
{
return prot_virt_host;
@@ -347,6 +356,7 @@ static inline int is_prot_virt_host(void)
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
@@ -363,6 +373,11 @@ static inline int uv_destroy_owned_page(unsigned long paddr)
return 0;
}

+static inline int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ return 0;
+}
+
static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index dbcf4434eb53..434d81baceed 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -192,6 +192,53 @@ int uv_convert_owned_from_secure(unsigned long paddr)
return rc;
}

+/*
+ * Set aside the given page and put it in the list of pages to be cleared in
+ * background. The caller must already hold a reference to the page.
+ */
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ struct list_head *head = &mm->context.deferred_list;
+ struct destroy_page_lazy *lazy;
+ struct page *page;
+ int rc;
+
+ /* get an extra reference here */
+ get_page(phys_to_page(paddr));
+
+ lazy = list_first_entry(head, struct destroy_page_lazy, list);
+ /*
+ * We need a fresh page to store more pointers. The current page
+ * might be shared, so it cannot be used directly. Instead, make it
+ * accessible and release it, and let the normal unmap code free it
+ * later, if needed.
+ * Afterwards, try to allocate a new page, but not very hard. If the
+ * allocation fails, we simply return. The next call to this
+ * function will attempt to do the same again, until enough pages
+ * have been freed.
+ */
+ if (list_empty(head) || lazy->count >= PV_MAX_LAZY_COUNT) {
+ rc = uv_convert_owned_from_secure(paddr);
+ /* in case of failure, we intentionally leak the page */
+ if (rc)
+ return rc;
+ /* release the extra reference */
+ put_page(phys_to_page(paddr));
+
+ /* try to allocate a new page quickly, but allow failures */
+ page = alloc_page(GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ if (!page)
+ return -ENOMEM;
+ lazy = page_to_virt(page);
+ lazy->count = 0;
+ list_add(&lazy->list, head);
+ return 0;
+ }
+ /* the array of pointers has space, just add this entry */
+ lazy->pfns[lazy->count++] = phys_to_pfn(paddr);
+ return 0;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 9a3547966e18..4333d3e54ef0 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -19,6 +19,7 @@

struct deferred_priv {
struct mm_struct *mm;
+ bool has_mm;
unsigned long old_table;
u64 handle;
void *virt;
@@ -241,13 +242,29 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)

static int kvm_s390_pv_destroy_vm_thread(void *priv)
{
+ struct destroy_page_lazy *lazy, *next;
struct deferred_priv *p = priv;
u16 rc, rrc;
- int r = 1;
+ int r;

- /* Exit early if we end up being the only users of the mm */
- s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
- mmput(p->mm);
+ list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
+ list_del(&lazy->list);
+ s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+ free_page(__pa(lazy));
+ }
+
+ if (p->has_mm) {
+ /* Exit early if we end up being the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ if (atomic_read(&p->mm->mm_users) == 1) {
+ mmap_write_lock(p->mm);
+ /* destroy synchronously if there are no other users */
+ p->mm->context.pv_sync_destroy = 1;
+ mmap_write_unlock(p->mm);
+ }
+ mmput(p->mm);
+ }
+ mmdrop(p->mm);

r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
@@ -276,7 +293,9 @@ static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc
priv->old_table = (unsigned long)kvm->arch.gmap->table;
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);

- if (kvm_s390_pv_replace_asce(kvm))
+ if (!priv->has_mm)
+ kvm_s390_pv_remove_old_asce(kvm);
+ else if (kvm_s390_pv_replace_asce(kvm))
goto fail;

t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
@@ -333,10 +352,13 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
if (!priv)
return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);

+ mmgrab(kvm->mm);
if (mmget_not_zero(kvm->mm)) {
+ priv->has_mm = true;
kvm_s390_clear_2g(kvm);
- } else {
+ } else if (list_empty(&kvm->mm->context.deferred_list)) {
/* No deferred work to do */
+ mmdrop(kvm->mm);
kfree(priv);
return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
}
--
2.31.1


2021-05-19 04:17:29

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy

Add the module parameter "lazy_destroy", to allow the lazy destroy
mechanism to be switched off. This might be useful for debugging
purposes.

The parameter is enabled by default.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 4333d3e54ef0..00c14406205f 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -26,6 +26,10 @@ struct deferred_priv {
unsigned long real;
};

+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc = 0;
@@ -348,6 +352,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct deferred_priv *priv;

+ if (!lazy_destroy)
+ kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
if (!priv)
return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
@@ -396,6 +403,12 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;

+ if (!lazy_destroy) {
+ mmap_write_lock(kvm->mm);
+ kvm->mm->context.pv_sync_destroy = 1;
+ mmap_write_unlock(kvm->mm);
+ }
+
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
--
2.31.1


2021-05-19 04:17:30

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v1 11/11] KVM: s390: pv: add support for UV feature bits

Add support for Ultravisor feature bits, and take advantage of the
functionality advertised to speed up the lazy destroy mechanism.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/boot/uv.c | 1 +
arch/s390/include/asm/uv.h | 9 ++++++++-
arch/s390/kernel/uv.c | 3 ++-
3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/s390/boot/uv.c b/arch/s390/boot/uv.c
index 87641dd65ccf..efdb55364919 100644
--- a/arch/s390/boot/uv.c
+++ b/arch/s390/boot/uv.c
@@ -36,6 +36,7 @@ void uv_query_info(void)
uv_info.max_sec_stor_addr = ALIGN(uvcb.max_guest_stor_addr, PAGE_SIZE);
uv_info.max_num_sec_conf = uvcb.max_num_sec_conf;
uv_info.max_guest_cpu_id = uvcb.max_guest_cpu_id;
+ uv_info.feature_bits = uvcb.uv_feature_bits;
}

#ifdef CONFIG_PROTECTED_VIRTUALIZATION_GUEST
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 1de5ff5e192a..808c2f0e0d7c 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -73,6 +73,11 @@ enum uv_cmds_inst {
BIT_UVC_CMD_UNPIN_PAGE_SHARED = 22,
};

+/* Bits in uv features field */
+enum uv_features {
+ BIT_UVC_FEAT_MISC_0 = 0,
+};
+
struct uv_cb_header {
u16 len;
u16 cmd; /* Command Code */
@@ -97,7 +102,8 @@ struct uv_cb_qui {
u64 max_guest_stor_addr;
u8 reserved88[158 - 136];
u16 max_guest_cpu_id;
- u8 reserveda0[200 - 160];
+ u64 uv_feature_bits;
+ u8 reserveda0[200 - 168];
} __packed __aligned(8);

/* Initialize Ultravisor */
@@ -274,6 +280,7 @@ struct uv_info {
unsigned long max_sec_stor_addr;
unsigned int max_num_sec_conf;
unsigned short max_guest_cpu_id;
+ unsigned long feature_bits;
};

extern struct uv_info uv_info;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 434d81baceed..b1fa38d5c388 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -291,7 +291,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,

static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
{
- return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ return !test_bit_inv(BIT_UVC_FEAT_MISC_0, &uv_info.feature_bits) &&
+ uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
atomic_read(&mm->context.is_protected) > 1;
}

--
2.31.1


2021-05-19 18:19:25

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Mon, 17 May 2021 22:07:47 +0200
Claudio Imbrenda <[email protected]> wrote:

> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs). This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
>
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
>
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
>
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.
>
> For the shutdown case, a list of pages to be destroyed is formed when
> the mm is torn down. Instead of just unmapping the pages when the
> address space is being torn down, they are also set aside. Later when
> KVM cleans up the VM, a thread is started to clean up the pages from
> the list.

Just to make sure, 'clean up' includes doing uv calls?

>
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.

Are those set-aside-but-not-yet-cleaned-up pages still possibly
accessible in any way? I would assume that they only belong to the
'zombie' guests, and any new or rebooted guest is a new entity that
needs to get new pages?

Can too many not-yet-cleaned-up pages lead to a (temporary) memory
exhaustion?


2021-05-19 18:20:58

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 17:05:37 +0200
Cornelia Huck <[email protected]> wrote:

> On Mon, 17 May 2021 22:07:47 +0200
> Claudio Imbrenda <[email protected]> wrote:
>
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs). This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> >
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> >
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> >
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.
> >
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.
>
> Just to make sure, 'clean up' includes doing uv calls?

yes

> >
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.
>
> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> accessible in any way? I would assume that they only belong to the

in case of reboot: yes, they are still in the address space of the
guest, and can be swapped if needed

> 'zombie' guests, and any new or rebooted guest is a new entity that
> needs to get new pages?

the rebooted guest (normal or secure) will re-use the same pages of the
old guest (before or after cleanup, which is the reason of patches 3
and 4)

the KVM guest is not affected in case of reboot, so the userspace
address space is not touched.

> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> exhaustion?

in case of reboot, not much; the pages were in use are still in use
after the reboot, and they can be swapped.

in case of a shutdown, yes, because the pages are really taken aside
and cleared/destroyed in background. they cannot be swapped. they are
freed immediately as they are processed, to try to mitigate memory
exhaustion scenarios.

in the end, this patchseries is a tradeoff between speed and memory
consumption. the memory needs to be cleared up at some point, and that
requires time.

in cases where this might be an issue, I introduced a new KVM flag to
disable lazy destroy (patch 10)



2021-05-19 18:22:42

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy



On 18.05.21 18:13, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:45:18 +0200
> Christian Borntraeger <[email protected]> wrote:
>
>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <[email protected]> wrote:
>>>
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <[email protected]> wrote:
>>>>
>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>> down, its memory was made unprotected, and then the protected VM
>>>>> itself was destroyed. Looping over the whole address space can
>>>>> take some time, considering the overhead of the various
>>>>> Ultravisor Calls (UVCs). This means that a reboot or a shutdown
>>>>> would take a potentially long amount of time, depending on the
>>>>> amount of used memory.
>>>>>
>>>>> This patchseries implements a deferred destroy mechanism for
>>>>> protected guests. When a protected guest is destroyed, its memory
>>>>> is cleared in background, allowing the guest to restart or
>>>>> terminate significantly faster than before.
>>>>>
>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>> * it still has an address space associated (reboot case)
>>>>> * it does not have an address space anymore (shutdown case)
>>>>>
>>>>> For the reboot case, the reference count of the mm is increased,
>>>>> and then a background thread is started to clean up. Once the
>>>>> thread went through the whole address space, the protected VM is
>>>>> actually destroyed.
>>>>>
>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>> when the mm is torn down. Instead of just unmapping the pages when
>>>>> the address space is being torn down, they are also set aside.
>>>>> Later when KVM cleans up the VM, a thread is started to clean up
>>>>> the pages from the list.
>>>>
>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>
>>> yes
>>>
>>>>>
>>>>> This means that the same address space can have memory belonging
>>>>> to more than one protected guest, although only one will be
>>>>> running, the others will in fact not even have any CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to the
>>>>
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>
>>>> 'zombie' guests, and any new or rebooted guest is a new entity that
>>>> needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>>
>>>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>>>> exhaustion?
>>>
>>> in case of reboot, not much; the pages were in use are still in use
>>> after the reboot, and they can be swapped.
>>>
>>> in case of a shutdown, yes, because the pages are really taken aside
>>> and cleared/destroyed in background. they cannot be swapped. they
>>> are freed immediately as they are processed, to try to mitigate
>>> memory exhaustion scenarios.
>>>
>>> in the end, this patchseries is a tradeoff between speed and memory
>>> consumption. the memory needs to be cleared up at some point, and
>>> that requires time.
>>>
>>> in cases where this might be an issue, I introduced a new KVM flag
>>> to disable lazy destroy (patch 10)
>>
>> Maybe we could piggy-back on the OOM-kill notifier and then fall back
>> to synchronous freeing for some pages?
>
> I'm not sure I follow
>
> once the pages have been set aside, it's too late
>
> while the pages are being set aside, every now and then some memory
> needs to be allocated. the allocation is atomic, not allowed to use
> emergency reserves, and can fail without warning. if the allocation
> fails, we clean up one page and continue, without setting aside
> anything (patch 9)
>
> so if the system is low on memory, the lazy destroy should not make the
> situation too much worse.
>
> the only issue here is starting a normal process in the host (maybe
> a non secure guest) that uses a lot of memory very quickly, right after
> a large secure guest has terminated.

I think page cache page allocations do not need to be atomic.
In that case the kernel might stil l decide to trigger the oom killer. We can
let it notify ourselves free 256 pages synchronously and avoid the oom kill.
Have a look at the virtio-balloon virtio_balloon_oom_notify

2021-05-19 18:23:07

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 17:36:24 +0200
Claudio Imbrenda <[email protected]> wrote:

> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <[email protected]> wrote:
>
> > On Mon, 17 May 2021 22:07:47 +0200
> > Claudio Imbrenda <[email protected]> wrote:

> > > This means that the same address space can have memory belonging to
> > > more than one protected guest, although only one will be running,
> > > the others will in fact not even have any CPUs.
> >
> > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > accessible in any way? I would assume that they only belong to the
>
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
>
> > 'zombie' guests, and any new or rebooted guest is a new entity that
> > needs to get new pages?
>
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)

Took a look at those patches, makes sense.

>
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.

'guest' is a bit ambiguous here -- do you mean the vm here, and the
actual guest above?


2021-05-19 18:23:11

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy



On 18.05.21 17:36, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 17:05:37 +0200
> Cornelia Huck <[email protected]> wrote:
>
>> On Mon, 17 May 2021 22:07:47 +0200
>> Claudio Imbrenda <[email protected]> wrote:
>>
>>> Previously, when a protected VM was rebooted or when it was shut
>>> down, its memory was made unprotected, and then the protected VM
>>> itself was destroyed. Looping over the whole address space can take
>>> some time, considering the overhead of the various Ultravisor Calls
>>> (UVCs). This means that a reboot or a shutdown would take a
>>> potentially long amount of time, depending on the amount of used
>>> memory.
>>>
>>> This patchseries implements a deferred destroy mechanism for
>>> protected guests. When a protected guest is destroyed, its memory
>>> is cleared in background, allowing the guest to restart or
>>> terminate significantly faster than before.
>>>
>>> There are 2 possibilities when a protected VM is torn down:
>>> * it still has an address space associated (reboot case)
>>> * it does not have an address space anymore (shutdown case)
>>>
>>> For the reboot case, the reference count of the mm is increased, and
>>> then a background thread is started to clean up. Once the thread
>>> went through the whole address space, the protected VM is actually
>>> destroyed.
>>>
>>> For the shutdown case, a list of pages to be destroyed is formed
>>> when the mm is torn down. Instead of just unmapping the pages when
>>> the address space is being torn down, they are also set aside.
>>> Later when KVM cleans up the VM, a thread is started to clean up
>>> the pages from the list.
>>
>> Just to make sure, 'clean up' includes doing uv calls?
>
> yes
>
>>>
>>> This means that the same address space can have memory belonging to
>>> more than one protected guest, although only one will be running,
>>> the others will in fact not even have any CPUs.
>>
>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>> accessible in any way? I would assume that they only belong to the
>
> in case of reboot: yes, they are still in the address space of the
> guest, and can be swapped if needed
>
>> 'zombie' guests, and any new or rebooted guest is a new entity that
>> needs to get new pages?
>
> the rebooted guest (normal or secure) will re-use the same pages of the
> old guest (before or after cleanup, which is the reason of patches 3
> and 4)
>
> the KVM guest is not affected in case of reboot, so the userspace
> address space is not touched.
>
>> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
>> exhaustion?
>
> in case of reboot, not much; the pages were in use are still in use
> after the reboot, and they can be swapped.
>
> in case of a shutdown, yes, because the pages are really taken aside
> and cleared/destroyed in background. they cannot be swapped. they are
> freed immediately as they are processed, to try to mitigate memory
> exhaustion scenarios.
>
> in the end, this patchseries is a tradeoff between speed and memory
> consumption. the memory needs to be cleared up at some point, and that
> requires time.
>
> in cases where this might be an issue, I introduced a new KVM flag to
> disable lazy destroy (patch 10)

Maybe we could piggy-back on the OOM-kill notifier and then fall back to
synchronous freeing for some pages?

2021-05-19 18:23:37

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 18:22:42 +0200
David Hildenbrand <[email protected]> wrote:

> On 18.05.21 18:19, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:04:11 +0200
> > Cornelia Huck <[email protected]> wrote:
> >
> >> On Tue, 18 May 2021 17:36:24 +0200
> >> Claudio Imbrenda <[email protected]> wrote:
> >>
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <[email protected]> wrote:
> >>>
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <[email protected]> wrote:
> >>
> >>>>> This means that the same address space can have memory
> >>>>> belonging to more than one protected guest, although only one
> >>>>> will be running, the others will in fact not even have any
> >>>>> CPUs.
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)
> >>
> >> Took a look at those patches, makes sense.
> >>
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.
> >>
> >> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> >> actual guest above?
> >>
> >
> > yes this is tricky, because there is the guest OS, which terminates
> > or reboots, then there is the "secure configuration" entity,
> > handled by the Ultravisor, and then the KVM VM
> >
> > when a secure guest reboots, the "secure configuration" is
> > dismantled (in this case, in a deferred way), and the KVM VM (and
> > its memory) is not directly affected
> >
> > what happened before was that the secure configuration was
> > dismantled synchronously, and then re-created.
> >
> > now instead, a new secure configuration is created using the same
> > KVM VM (and thus the same mm), before the old secure configuration
> > has been completely dismantled. hence the same KVM VM can have
> > multiple secure configurations associated, sharing the same address
> > space.
> >
> > of course, only the newest one is actually running, the other ones
> > are "zombies", without CPUs.
> >
>
> Can a guest trigger a DoS?

I don't see how

a guest can fill its memory and then reboot, and then fill its memory
again and then reboot... but that will take time, filling the memory
will itself clean up leftover pages from previous boots.

"normal" reboot loops will be fast, because there won't be much memory
to process

I have actually tested mixed reboot/shutdown loops, and the system
behaved as you would expect when under load.


2021-05-19 18:23:43

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <[email protected]> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <[email protected]> wrote:

> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?
> >
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> >
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they are
> > freed immediately as they are processed, to try to mitigate memory
> > exhaustion scenarios.
> >
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and that
> > requires time.
> >
> > in cases where this might be an issue, I introduced a new KVM flag to
> > disable lazy destroy (patch 10)
>
> Maybe we could piggy-back on the OOM-kill notifier and then fall back to
> synchronous freeing for some pages?

Sounds like a good idea. If delayed cleanup is safe, you probably want
to have the fast shutdown behaviour.


2021-05-19 18:25:05

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 17:45:18 +0200
Christian Borntraeger <[email protected]> wrote:

> On 18.05.21 17:36, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <[email protected]> wrote:
> >
> >> On Mon, 17 May 2021 22:07:47 +0200
> >> Claudio Imbrenda <[email protected]> wrote:
> >>
> >>> Previously, when a protected VM was rebooted or when it was shut
> >>> down, its memory was made unprotected, and then the protected VM
> >>> itself was destroyed. Looping over the whole address space can
> >>> take some time, considering the overhead of the various
> >>> Ultravisor Calls (UVCs). This means that a reboot or a shutdown
> >>> would take a potentially long amount of time, depending on the
> >>> amount of used memory.
> >>>
> >>> This patchseries implements a deferred destroy mechanism for
> >>> protected guests. When a protected guest is destroyed, its memory
> >>> is cleared in background, allowing the guest to restart or
> >>> terminate significantly faster than before.
> >>>
> >>> There are 2 possibilities when a protected VM is torn down:
> >>> * it still has an address space associated (reboot case)
> >>> * it does not have an address space anymore (shutdown case)
> >>>
> >>> For the reboot case, the reference count of the mm is increased,
> >>> and then a background thread is started to clean up. Once the
> >>> thread went through the whole address space, the protected VM is
> >>> actually destroyed.
> >>>
> >>> For the shutdown case, a list of pages to be destroyed is formed
> >>> when the mm is torn down. Instead of just unmapping the pages when
> >>> the address space is being torn down, they are also set aside.
> >>> Later when KVM cleans up the VM, a thread is started to clean up
> >>> the pages from the list.
> >>
> >> Just to make sure, 'clean up' includes doing uv calls?
> >
> > yes
> >
> >>>
> >>> This means that the same address space can have memory belonging
> >>> to more than one protected guest, although only one will be
> >>> running, the others will in fact not even have any CPUs.
> >>
> >> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >> accessible in any way? I would assume that they only belong to the
> >>
> >
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >
> >> 'zombie' guests, and any new or rebooted guest is a new entity that
> >> needs to get new pages?
> >
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)
> >
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.
> >
> >> Can too many not-yet-cleaned-up pages lead to a (temporary) memory
> >> exhaustion?
> >
> > in case of reboot, not much; the pages were in use are still in use
> > after the reboot, and they can be swapped.
> >
> > in case of a shutdown, yes, because the pages are really taken aside
> > and cleared/destroyed in background. they cannot be swapped. they
> > are freed immediately as they are processed, to try to mitigate
> > memory exhaustion scenarios.
> >
> > in the end, this patchseries is a tradeoff between speed and memory
> > consumption. the memory needs to be cleared up at some point, and
> > that requires time.
> >
> > in cases where this might be an issue, I introduced a new KVM flag
> > to disable lazy destroy (patch 10)
>
> Maybe we could piggy-back on the OOM-kill notifier and then fall back
> to synchronous freeing for some pages?

I'm not sure I follow

once the pages have been set aside, it's too late

while the pages are being set aside, every now and then some memory
needs to be allocated. the allocation is atomic, not allowed to use
emergency reserves, and can fail without warning. if the allocation
fails, we clean up one page and continue, without setting aside
anything (patch 9)

so if the system is low on memory, the lazy destroy should not make the
situation too much worse.

the only issue here is starting a normal process in the host (maybe
a non secure guest) that uses a lot of memory very quickly, right after
a large secure guest has terminated.


2021-05-19 18:25:08

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 18:04:11 +0200
Cornelia Huck <[email protected]> wrote:

> On Tue, 18 May 2021 17:36:24 +0200
> Claudio Imbrenda <[email protected]> wrote:
>
> > On Tue, 18 May 2021 17:05:37 +0200
> > Cornelia Huck <[email protected]> wrote:
> >
> > > On Mon, 17 May 2021 22:07:47 +0200
> > > Claudio Imbrenda <[email protected]> wrote:
>
> > > > This means that the same address space can have memory
> > > > belonging to more than one protected guest, although only one
> > > > will be running, the others will in fact not even have any
> > > > CPUs.
> > >
> > > Are those set-aside-but-not-yet-cleaned-up pages still possibly
> > > accessible in any way? I would assume that they only belong to
> > > the
> >
> > in case of reboot: yes, they are still in the address space of the
> > guest, and can be swapped if needed
> >
> > > 'zombie' guests, and any new or rebooted guest is a new entity
> > > that needs to get new pages?
> >
> > the rebooted guest (normal or secure) will re-use the same pages of
> > the old guest (before or after cleanup, which is the reason of
> > patches 3 and 4)
>
> Took a look at those patches, makes sense.
>
> >
> > the KVM guest is not affected in case of reboot, so the userspace
> > address space is not touched.
>
> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
> actual guest above?
>

yes this is tricky, because there is the guest OS, which terminates or
reboots, then there is the "secure configuration" entity, handled by the
Ultravisor, and then the KVM VM

when a secure guest reboots, the "secure configuration" is dismantled
(in this case, in a deferred way), and the KVM VM (and its memory) is
not directly affected

what happened before was that the secure configuration was dismantled
synchronously, and then re-created.

now instead, a new secure configuration is created using the same KVM
VM (and thus the same mm), before the old secure configuration has been
completely dismantled. hence the same KVM VM can have multiple secure
configurations associated, sharing the same address space.

of course, only the newest one is actually running, the other ones are
"zombies", without CPUs.


2021-05-19 18:25:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On 18.05.21 18:19, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:04:11 +0200
> Cornelia Huck <[email protected]> wrote:
>
>> On Tue, 18 May 2021 17:36:24 +0200
>> Claudio Imbrenda <[email protected]> wrote:
>>
>>> On Tue, 18 May 2021 17:05:37 +0200
>>> Cornelia Huck <[email protected]> wrote:
>>>
>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>> Claudio Imbrenda <[email protected]> wrote:
>>
>>>>> This means that the same address space can have memory
>>>>> belonging to more than one protected guest, although only one
>>>>> will be running, the others will in fact not even have any
>>>>> CPUs.
>>>>
>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>> accessible in any way? I would assume that they only belong to
>>>> the
>>>
>>> in case of reboot: yes, they are still in the address space of the
>>> guest, and can be swapped if needed
>>>
>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>> that needs to get new pages?
>>>
>>> the rebooted guest (normal or secure) will re-use the same pages of
>>> the old guest (before or after cleanup, which is the reason of
>>> patches 3 and 4)
>>
>> Took a look at those patches, makes sense.
>>
>>>
>>> the KVM guest is not affected in case of reboot, so the userspace
>>> address space is not touched.
>>
>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>> actual guest above?
>>
>
> yes this is tricky, because there is the guest OS, which terminates or
> reboots, then there is the "secure configuration" entity, handled by the
> Ultravisor, and then the KVM VM
>
> when a secure guest reboots, the "secure configuration" is dismantled
> (in this case, in a deferred way), and the KVM VM (and its memory) is
> not directly affected
>
> what happened before was that the secure configuration was dismantled
> synchronously, and then re-created.
>
> now instead, a new secure configuration is created using the same KVM
> VM (and thus the same mm), before the old secure configuration has been
> completely dismantled. hence the same KVM VM can have multiple secure
> configurations associated, sharing the same address space.
>
> of course, only the newest one is actually running, the other ones are
> "zombies", without CPUs.
>

Can a guest trigger a DoS?

--
Thanks,

David / dhildenb


2021-05-19 18:25:38

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy



On 18.05.21 18:34, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:20:22 +0200
> Christian Borntraeger <[email protected]> wrote:
>
>> On 18.05.21 18:13, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 17:45:18 +0200
>>> Christian Borntraeger <[email protected]> wrote:
>>>
>>>> On 18.05.21 17:36, Claudio Imbrenda wrote:
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <[email protected]> wrote:
>>>>>
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <[email protected]> wrote:
>>>>>>
>>>>>>> Previously, when a protected VM was rebooted or when it was shut
>>>>>>> down, its memory was made unprotected, and then the protected VM
>>>>>>> itself was destroyed. Looping over the whole address space can
>>>>>>> take some time, considering the overhead of the various
>>>>>>> Ultravisor Calls (UVCs). This means that a reboot or a shutdown
>>>>>>> would take a potentially long amount of time, depending on the
>>>>>>> amount of used memory.
>>>>>>>
>>>>>>> This patchseries implements a deferred destroy mechanism for
>>>>>>> protected guests. When a protected guest is destroyed, its
>>>>>>> memory is cleared in background, allowing the guest to restart
>>>>>>> or terminate significantly faster than before.
>>>>>>>
>>>>>>> There are 2 possibilities when a protected VM is torn down:
>>>>>>> * it still has an address space associated (reboot case)
>>>>>>> * it does not have an address space anymore (shutdown case)
>>>>>>>
>>>>>>> For the reboot case, the reference count of the mm is increased,
>>>>>>> and then a background thread is started to clean up. Once the
>>>>>>> thread went through the whole address space, the protected VM is
>>>>>>> actually destroyed.
>>>>>>>
>>>>>>> For the shutdown case, a list of pages to be destroyed is formed
>>>>>>> when the mm is torn down. Instead of just unmapping the pages
>>>>>>> when the address space is being torn down, they are also set
>>>>>>> aside. Later when KVM cleans up the VM, a thread is started to
>>>>>>> clean up the pages from the list.
>>>>>>
>>>>>> Just to make sure, 'clean up' includes doing uv calls?
>>>>>
>>>>> yes
>>>>>
>>>>>>>
>>>>>>> This means that the same address space can have memory belonging
>>>>>>> to more than one protected guest, although only one will be
>>>>>>> running, the others will in fact not even have any CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>>
>>>>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
>>>>>> memory exhaustion?
>>>>>
>>>>> in case of reboot, not much; the pages were in use are still in
>>>>> use after the reboot, and they can be swapped.
>>>>>
>>>>> in case of a shutdown, yes, because the pages are really taken
>>>>> aside and cleared/destroyed in background. they cannot be
>>>>> swapped. they are freed immediately as they are processed, to try
>>>>> to mitigate memory exhaustion scenarios.
>>>>>
>>>>> in the end, this patchseries is a tradeoff between speed and
>>>>> memory consumption. the memory needs to be cleared up at some
>>>>> point, and that requires time.
>>>>>
>>>>> in cases where this might be an issue, I introduced a new KVM flag
>>>>> to disable lazy destroy (patch 10)
>>>>
>>>> Maybe we could piggy-back on the OOM-kill notifier and then fall
>>>> back to synchronous freeing for some pages?
>>>
>>> I'm not sure I follow
>>>
>>> once the pages have been set aside, it's too late
>>>
>>> while the pages are being set aside, every now and then some memory
>>> needs to be allocated. the allocation is atomic, not allowed to use
>>> emergency reserves, and can fail without warning. if the allocation
>>> fails, we clean up one page and continue, without setting aside
>>> anything (patch 9)
>>>
>>> so if the system is low on memory, the lazy destroy should not make
>>> the situation too much worse.
>>>
>>> the only issue here is starting a normal process in the host (maybe
>>> a non secure guest) that uses a lot of memory very quickly, right
>>> after a large secure guest has terminated.
>>
>> I think page cache page allocations do not need to be atomic.
>> In that case the kernel might stil l decide to trigger the oom
>> killer. We can let it notify ourselves free 256 pages synchronously
>> and avoid the oom kill. Have a look at the virtio-balloon
>> virtio_balloon_oom_notify
>
> the issue is that once the pages have been set aside, it's too late.
> the OOM notifier would only be useful if we get notified of the OOM
> situation _while_ setting aside the pages.
>
> unless you mean that the notifier should simply wait until the thread
> has done (some of) its work?

Exactly. Let the notifier wait until you have freed 256pages and return
256 to the oom notifier.

2021-05-19 18:26:07

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 18:55:56 +0200
Christian Borntraeger <[email protected]> wrote:

> On 18.05.21 18:31, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 18:22:42 +0200
> > David Hildenbrand <[email protected]> wrote:
> >
> >> On 18.05.21 18:19, Claudio Imbrenda wrote:
> >>> On Tue, 18 May 2021 18:04:11 +0200
> >>> Cornelia Huck <[email protected]> wrote:
> >>>
> >>>> On Tue, 18 May 2021 17:36:24 +0200
> >>>> Claudio Imbrenda <[email protected]> wrote:
> >>>>
> >>>>> On Tue, 18 May 2021 17:05:37 +0200
> >>>>> Cornelia Huck <[email protected]> wrote:
> >>>>>
> >>>>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>>>> Claudio Imbrenda <[email protected]> wrote:
> >>>>
> >>>>>>> This means that the same address space can have memory
> >>>>>>> belonging to more than one protected guest, although only one
> >>>>>>> will be running, the others will in fact not even have any
> >>>>>>> CPUs.
> >>>>>>
> >>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>>>> accessible in any way? I would assume that they only belong to
> >>>>>> the
> >>>>>
> >>>>> in case of reboot: yes, they are still in the address space of
> >>>>> the guest, and can be swapped if needed
> >>>>>
> >>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>>>> that needs to get new pages?
> >>>>>
> >>>>> the rebooted guest (normal or secure) will re-use the same pages
> >>>>> of the old guest (before or after cleanup, which is the reason
> >>>>> of patches 3 and 4)
> >>>>
> >>>> Took a look at those patches, makes sense.
> >>>>
> >>>>>
> >>>>> the KVM guest is not affected in case of reboot, so the
> >>>>> userspace address space is not touched.
> >>>>
> >>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and
> >>>> the actual guest above?
> >>>>
> >>>
> >>> yes this is tricky, because there is the guest OS, which
> >>> terminates or reboots, then there is the "secure configuration"
> >>> entity, handled by the Ultravisor, and then the KVM VM
> >>>
> >>> when a secure guest reboots, the "secure configuration" is
> >>> dismantled (in this case, in a deferred way), and the KVM VM (and
> >>> its memory) is not directly affected
> >>>
> >>> what happened before was that the secure configuration was
> >>> dismantled synchronously, and then re-created.
> >>>
> >>> now instead, a new secure configuration is created using the same
> >>> KVM VM (and thus the same mm), before the old secure configuration
> >>> has been completely dismantled. hence the same KVM VM can have
> >>> multiple secure configurations associated, sharing the same
> >>> address space.
> >>>
> >>> of course, only the newest one is actually running, the other ones
> >>> are "zombies", without CPUs.
> >>>
> >>
> >> Can a guest trigger a DoS?
> >
> > I don't see how
> >
> > a guest can fill its memory and then reboot, and then fill its
> > memory again and then reboot... but that will take time, filling
> > the memory will itself clean up leftover pages from previous boots.
> >
>
> In essence this guest will then synchronously wait for the page to be
> exported and reimported, correct?

correct

> > "normal" reboot loops will be fast, because there won't be much
> > memory to process
> >
> > I have actually tested mixed reboot/shutdown loops, and the system
> > behaved as you would expect when under load.
>
> I guess the memory will continue to be accounted to the memcg?
> Correct?

for the reboot case, yes, since the mm is not directly affected.
for the shutdown case, I'm not sure.

2021-05-19 18:26:37

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy

On Tue, 18 May 2021 18:20:22 +0200
Christian Borntraeger <[email protected]> wrote:

> On 18.05.21 18:13, Claudio Imbrenda wrote:
> > On Tue, 18 May 2021 17:45:18 +0200
> > Christian Borntraeger <[email protected]> wrote:
> >
> >> On 18.05.21 17:36, Claudio Imbrenda wrote:
> >>> On Tue, 18 May 2021 17:05:37 +0200
> >>> Cornelia Huck <[email protected]> wrote:
> >>>
> >>>> On Mon, 17 May 2021 22:07:47 +0200
> >>>> Claudio Imbrenda <[email protected]> wrote:
> >>>>
> >>>>> Previously, when a protected VM was rebooted or when it was shut
> >>>>> down, its memory was made unprotected, and then the protected VM
> >>>>> itself was destroyed. Looping over the whole address space can
> >>>>> take some time, considering the overhead of the various
> >>>>> Ultravisor Calls (UVCs). This means that a reboot or a shutdown
> >>>>> would take a potentially long amount of time, depending on the
> >>>>> amount of used memory.
> >>>>>
> >>>>> This patchseries implements a deferred destroy mechanism for
> >>>>> protected guests. When a protected guest is destroyed, its
> >>>>> memory is cleared in background, allowing the guest to restart
> >>>>> or terminate significantly faster than before.
> >>>>>
> >>>>> There are 2 possibilities when a protected VM is torn down:
> >>>>> * it still has an address space associated (reboot case)
> >>>>> * it does not have an address space anymore (shutdown case)
> >>>>>
> >>>>> For the reboot case, the reference count of the mm is increased,
> >>>>> and then a background thread is started to clean up. Once the
> >>>>> thread went through the whole address space, the protected VM is
> >>>>> actually destroyed.
> >>>>>
> >>>>> For the shutdown case, a list of pages to be destroyed is formed
> >>>>> when the mm is torn down. Instead of just unmapping the pages
> >>>>> when the address space is being torn down, they are also set
> >>>>> aside. Later when KVM cleans up the VM, a thread is started to
> >>>>> clean up the pages from the list.
> >>>>
> >>>> Just to make sure, 'clean up' includes doing uv calls?
> >>>
> >>> yes
> >>>
> >>>>>
> >>>>> This means that the same address space can have memory belonging
> >>>>> to more than one protected guest, although only one will be
> >>>>> running, the others will in fact not even have any CPUs.
> >>>>
> >>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
> >>>> accessible in any way? I would assume that they only belong to
> >>>> the
> >>>
> >>> in case of reboot: yes, they are still in the address space of the
> >>> guest, and can be swapped if needed
> >>>
> >>>> 'zombie' guests, and any new or rebooted guest is a new entity
> >>>> that needs to get new pages?
> >>>
> >>> the rebooted guest (normal or secure) will re-use the same pages
> >>> of the old guest (before or after cleanup, which is the reason of
> >>> patches 3 and 4)
> >>>
> >>> the KVM guest is not affected in case of reboot, so the userspace
> >>> address space is not touched.
> >>>
> >>>> Can too many not-yet-cleaned-up pages lead to a (temporary)
> >>>> memory exhaustion?
> >>>
> >>> in case of reboot, not much; the pages were in use are still in
> >>> use after the reboot, and they can be swapped.
> >>>
> >>> in case of a shutdown, yes, because the pages are really taken
> >>> aside and cleared/destroyed in background. they cannot be
> >>> swapped. they are freed immediately as they are processed, to try
> >>> to mitigate memory exhaustion scenarios.
> >>>
> >>> in the end, this patchseries is a tradeoff between speed and
> >>> memory consumption. the memory needs to be cleared up at some
> >>> point, and that requires time.
> >>>
> >>> in cases where this might be an issue, I introduced a new KVM flag
> >>> to disable lazy destroy (patch 10)
> >>
> >> Maybe we could piggy-back on the OOM-kill notifier and then fall
> >> back to synchronous freeing for some pages?
> >
> > I'm not sure I follow
> >
> > once the pages have been set aside, it's too late
> >
> > while the pages are being set aside, every now and then some memory
> > needs to be allocated. the allocation is atomic, not allowed to use
> > emergency reserves, and can fail without warning. if the allocation
> > fails, we clean up one page and continue, without setting aside
> > anything (patch 9)
> >
> > so if the system is low on memory, the lazy destroy should not make
> > the situation too much worse.
> >
> > the only issue here is starting a normal process in the host (maybe
> > a non secure guest) that uses a lot of memory very quickly, right
> > after a large secure guest has terminated.
>
> I think page cache page allocations do not need to be atomic.
> In that case the kernel might stil l decide to trigger the oom
> killer. We can let it notify ourselves free 256 pages synchronously
> and avoid the oom kill. Have a look at the virtio-balloon
> virtio_balloon_oom_notify

the issue is that once the pages have been set aside, it's too late.
the OOM notifier would only be useful if we get notified of the OOM
situation _while_ setting aside the pages.

unless you mean that the notifier should simply wait until the thread
has done (some of) its work?

2021-05-19 18:27:23

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] KVM: s390: pv: implement lazy destroy



On 18.05.21 18:31, Claudio Imbrenda wrote:
> On Tue, 18 May 2021 18:22:42 +0200
> David Hildenbrand <[email protected]> wrote:
>
>> On 18.05.21 18:19, Claudio Imbrenda wrote:
>>> On Tue, 18 May 2021 18:04:11 +0200
>>> Cornelia Huck <[email protected]> wrote:
>>>
>>>> On Tue, 18 May 2021 17:36:24 +0200
>>>> Claudio Imbrenda <[email protected]> wrote:
>>>>
>>>>> On Tue, 18 May 2021 17:05:37 +0200
>>>>> Cornelia Huck <[email protected]> wrote:
>>>>>
>>>>>> On Mon, 17 May 2021 22:07:47 +0200
>>>>>> Claudio Imbrenda <[email protected]> wrote:
>>>>
>>>>>>> This means that the same address space can have memory
>>>>>>> belonging to more than one protected guest, although only one
>>>>>>> will be running, the others will in fact not even have any
>>>>>>> CPUs.
>>>>>>
>>>>>> Are those set-aside-but-not-yet-cleaned-up pages still possibly
>>>>>> accessible in any way? I would assume that they only belong to
>>>>>> the
>>>>>
>>>>> in case of reboot: yes, they are still in the address space of the
>>>>> guest, and can be swapped if needed
>>>>>
>>>>>> 'zombie' guests, and any new or rebooted guest is a new entity
>>>>>> that needs to get new pages?
>>>>>
>>>>> the rebooted guest (normal or secure) will re-use the same pages
>>>>> of the old guest (before or after cleanup, which is the reason of
>>>>> patches 3 and 4)
>>>>
>>>> Took a look at those patches, makes sense.
>>>>
>>>>>
>>>>> the KVM guest is not affected in case of reboot, so the userspace
>>>>> address space is not touched.
>>>>
>>>> 'guest' is a bit ambiguous here -- do you mean the vm here, and the
>>>> actual guest above?
>>>>
>>>
>>> yes this is tricky, because there is the guest OS, which terminates
>>> or reboots, then there is the "secure configuration" entity,
>>> handled by the Ultravisor, and then the KVM VM
>>>
>>> when a secure guest reboots, the "secure configuration" is
>>> dismantled (in this case, in a deferred way), and the KVM VM (and
>>> its memory) is not directly affected
>>>
>>> what happened before was that the secure configuration was
>>> dismantled synchronously, and then re-created.
>>>
>>> now instead, a new secure configuration is created using the same
>>> KVM VM (and thus the same mm), before the old secure configuration
>>> has been completely dismantled. hence the same KVM VM can have
>>> multiple secure configurations associated, sharing the same address
>>> space.
>>>
>>> of course, only the newest one is actually running, the other ones
>>> are "zombies", without CPUs.
>>>
>>
>> Can a guest trigger a DoS?
>
> I don't see how
>
> a guest can fill its memory and then reboot, and then fill its memory
> again and then reboot... but that will take time, filling the memory
> will itself clean up leftover pages from previous boots.

In essence this guest will then synchronously wait for the page to be
exported and reimported, correct?
>
> "normal" reboot loops will be fast, because there won't be much memory
> to process
>
> I have actually tested mixed reboot/shutdown loops, and the system
> behaved as you would expect when under load.

I guess the memory will continue to be accounted to the memcg? Correct?

2021-05-27 09:45:15

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] KVM: s390: pv: lazy destroy for reboot

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Until now, destroying a protected guest was an entirely synchronous
> operation that could potentially take a very long time, depending on
> the size of the guest, due to the time needed to clean up the address
> space from protected pages.
>
> This patch implements a lazy destroy mechanism, that allows a protected
> guest to reboot significantly faster than previously.
>
> This is achieved by clearing the pages of the old guest in background.
> In case of reboot, the new guest will be able to run in the same
> address space almost immediately.
>
> The old protected guest is then only destroyed when all of its memory has
> been destroyed or otherwise made non protected.

LGTM some comments below

>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kvm/kvm-s390.c | 6 +-
> arch/s390/kvm/kvm-s390.h | 2 +-
> arch/s390/kvm/pv.c | 118 ++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 120 insertions(+), 6 deletions(-)
>
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 2f09e9d7dc95..db25aa1fb6a6 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2248,7 +2248,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
>
> r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
> if (r)
> - kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> + kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);
>
> /* we need to block service interrupts from now on */
> set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
> @@ -2267,7 +2267,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
> */
> if (r)
> break;
> - r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
> + r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);
>
> /* no need to block service interrupts any more */
> clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
> @@ -2796,7 +2796,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> * complaining we do not use kvm_s390_pv_is_protected.
> */
> if (kvm_s390_pv_get_handle(kvm))
> - kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
> + kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
> debug_unregister(kvm->arch.dbf);
> free_page((unsigned long)kvm->arch.sie_page2);
> if (!kvm_is_ucontrol(kvm))
> diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
> index 79dcd647b378..b3c0796a3cc1 100644
> --- a/arch/s390/kvm/kvm-s390.h
> +++ b/arch/s390/kvm/kvm-s390.h
> @@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
> /* implemented in pv.c */
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
> int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
> -int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
> +int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
> int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
> int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
> u16 *rrc);
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 59039b8a7be7..9a3547966e18 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -14,8 +14,17 @@
> #include <asm/mman.h>
> #include <linux/pagewalk.h>
> #include <linux/sched/mm.h>
> +#include <linux/kthread.h>
> #include "kvm-s390.h"
>
> +struct deferred_priv {
> + struct mm_struct *mm;
> + unsigned long old_table;
> + u64 handle;
> + void *virt;
> + unsigned long real;
> +};
> +
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> {
> int cc = 0;
> @@ -202,7 +211,7 @@ static int kvm_s390_pv_replace_asce(struct kvm *kvm)
> }
>
> /* this should not fail, but if it does, we must not free the donated memory */
> -int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> +static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
> {
> int cc;
>
> @@ -230,6 +239,111 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> return cc ? -EIO : 0;
> }
>
> +static int kvm_s390_pv_destroy_vm_thread(void *priv)
> +{
> + struct deferred_priv *p = priv;
> + u16 rc, rrc;
> + int r = 1;
> +
> + /* Exit early if we end up being the only users of the mm */
> + s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
> + mmput(p->mm);

Where do we exit early?

> +
> + r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
> + WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
> + if (r)
> + return r;
> + atomic_dec(&p->mm->context.is_protected);
> +
> + /*
> + * Intentional leak in case the destroy secure VM call fails. The
> + * call should never fail if the hardware is not broken.
> + */
> + free_pages(p->real, get_order(uv_info.guest_base_stor_len));
> + free_pages(p->old_table, CRST_ALLOC_ORDER);
> + vfree(p->virt);
> + kfree(p);
> + return 0;
> +}
> +
> +static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
> +{
> + struct task_struct *t;
> +
> + priv->virt = kvm->arch.pv.stor_var;
> + priv->real = kvm->arch.pv.stor_base;

Now I know what the struct members actually mean...
Is there a reson you didn't like config_stor_* as a name or something
similar?

> + priv->handle = kvm_s390_pv_get_handle(kvm);
> + priv->old_table = (unsigned long)kvm->arch.gmap->table;
> + WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
> +
> + if (kvm_s390_pv_replace_asce(kvm))
> + goto fail;
> +
> + t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
> + "kvm_s390_pv_destroy_vm_thread");
> + if (IS_ERR_OR_NULL(t))
> + goto fail;
> +
> + memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
> + KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
> + wake_up_process(t);
> + /*
> + * no actual UVC is performed at this point, just return a successful
> + * rc value to make userspace happy, and an arbitrary rrc
> + */
> + *rc = 1;
> + *rrc = 42;
> +
> + return 0;
> +
> +fail:
> + kfree(priv);
> + return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +}
> +
> +/* Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
> +static void kvm_s390_clear_2g(struct kvm *kvm)
> +{
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + unsigned long lim;
> + int idx;
> +
> + idx = srcu_read_lock(&kvm->srcu);
> +
> + slots = kvm_memslots(kvm);
> + kvm_for_each_memslot(slot, slots) {
> + if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
> + continue;
> + if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
> + lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
> + else
> + lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
> + s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
> + }
> +
> + srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
> +int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
> +{
> + struct deferred_priv *priv;
> +
> + priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
> + if (!priv)
> + return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +
> + if (mmget_not_zero(kvm->mm)) {
> + kvm_s390_clear_2g(kvm);
> + } else {
> + /* No deferred work to do */
> + kfree(priv);
> + return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> + }
> + priv->mm = kvm->mm;
> + return deferred_destroy(kvm, priv, rc, rrc);
> +}
> +
> int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> {
> struct uv_cb_cgc uvcb = {
> @@ -263,7 +377,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> atomic_inc(&kvm->mm->context.is_protected);
> if (cc) {
> if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
> - kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> + kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
> } else {
> atomic_dec(&kvm->mm->context.is_protected);
> kvm_s390_pv_dealloc_vm(kvm);
>

2021-05-27 10:39:22

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v1 10/11] KVM: s390: pv: module parameter to fence lazy destroy

On 5/17/21 10:07 PM, Claudio Imbrenda wrote:
> Add the module parameter "lazy_destroy", to allow the lazy destroy
> mechanism to be switched off. This might be useful for debugging
> purposes.
>
> The parameter is enabled by default.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kvm/pv.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 4333d3e54ef0..00c14406205f 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -26,6 +26,10 @@ struct deferred_priv {
> unsigned long real;
> };
>
> +static int lazy_destroy = 1;
> +module_param(lazy_destroy, int, 0444);

I'm pondering if we want to make that writable or not.

Reviewed-by: Janosch Frank <[email protected]>

> +MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
> +
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> {
> int cc = 0;
> @@ -348,6 +352,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
> {
> struct deferred_priv *priv;
>
> + if (!lazy_destroy)
> + kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> +
> priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
> if (!priv)
> return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
> @@ -396,6 +403,12 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> /* Outputs */
> kvm->arch.pv.handle = uvcb.guest_handle;
>
> + if (!lazy_destroy) {
> + mmap_write_lock(kvm->mm);
> + kvm->mm->context.pv_sync_destroy = 1;
> + mmap_write_unlock(kvm->mm);
> + }
> +
> atomic_inc(&kvm->mm->context.is_protected);
> if (cc) {
> if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
>