Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.
This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.
There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)
For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.
For the shutdown case, a list of pages to be destroyed is formed when
the mm is torn down. Instead of just unmapping the pages when the
address space is being torn down, they are also set aside. Later when
KVM cleans up the VM, a thread is started to clean up the pages from
the list.
This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.
When a guest is destroyed, its memory still counts towards its memory
control group until it's actually freed (I tested this experimentally)
When the system runs out of memory, if a guest has terminated and its
memory is being cleaned asynchronously, the OOM killer will wait a
little and then see if memory has been freed. This has the practical
effect of slowing down memory allocations when the system is out of
memory to give the cleanup thread time to cleanup and free memory, and
avoid an actual OOM situation.
v2->v3
* added definitions for CC return codes for the UVC instruction
* improved make_secure_pte:
- renamed rc to cc
- added comments to explain why returning -EAGAIN is ok
* fixed kvm_s390_pv_replace_asce and kvm_s390_pv_remove_old_asce:
- renamed
- added locking
- moved to gmap.c
* do proper error management in do_secure_storage_access instead of
trying again hoping to get a different exception
* fix outdated patch descriptions
v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier
Claudio Imbrenda (14):
KVM: s390: pv: add macros for UVC CC values
KVM: s390: pv: avoid stall notifications for some UVCs
KVM: s390: pv: leak the ASCE page when destroy fails
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: lazy destroy for reboot
KVM: s390: pv: extend lazy destroy to handle shutdown
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: add OOM notifier for lazy destroy
KVM: s390: pv: avoid export before import if possible
arch/s390/include/asm/gmap.h | 6 +-
arch/s390/include/asm/mmu.h | 3 +
arch/s390/include/asm/mmu_context.h | 2 +
arch/s390/include/asm/pgtable.h | 16 +-
arch/s390/include/asm/uv.h | 31 +++-
arch/s390/kernel/uv.c | 162 +++++++++++++++++++-
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 223 ++++++++++++++++++++++++++--
arch/s390/mm/fault.c | 20 ++-
arch/s390/mm/gmap.c | 141 ++++++++++++++----
11 files changed, 555 insertions(+), 57 deletions(-)
--
2.31.1
Refactor s390_reset_acc so that its pieces can be reused in upcoming
patches. The users parameter for s390_destroy_range will be needed in
upcoming patches.
We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/gmap.h | 4 +-
arch/s390/kvm/pv.c | 12 ++++-
arch/s390/mm/gmap.c | 88 ++++++++++++++++++++++++------------
3 files changed, 72 insertions(+), 32 deletions(-)
diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 746e18bf8984..559a0c18579e 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,7 +147,9 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
void s390_remove_old_asce(struct gmap *gmap);
int s390_replace_asce(struct gmap *gmap);
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index f3dd5dd00013..1e7f63c06a63 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
#include <asm/gmap.h>
#include <asm/uv.h>
#include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
#include "kvm-s390.h"
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -160,8 +162,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;
- /* make all pages accessible before destroying the guest */
- s390_reset_acc(kvm->mm);
+ /*
+ * if the mm still has a mapping, make all its pages accessible
+ * before destroying the guest
+ */
+ if (mmget_not_zero(kvm->mm)) {
+ s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
+ mmput(kvm->mm);
+ }
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 38b792ab57f7..b87892986836 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2670,44 +2670,74 @@ void s390_reset_cmma(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);
-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+ unsigned long next;
+ unsigned long count;
+ unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
+ struct reset_walk_state *p = walk->private;
pte_t pte = READ_ONCE(*ptep);
- /* There is a reference through the mapping */
- if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
- return 0;
+ if (pte_present(pte)) {
+ /* we have a reference from the mapping, take an extra one */
+ get_page(phys_to_page(pte_val(pte)));
+ p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+ p->next = next;
+ p->count++;
+ }
+ return p->count >= DESTROY_LOOP_THRESHOLD;
}
-static const struct mm_walk_ops reset_acc_walk_ops = {
- .pte_entry = __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+ .pte_entry = s390_gather_pages,
};
-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
{
- if (!mm_is_protected(mm))
- return;
- /*
- * we might be called during
- * reset: we walk the pages and clear
- * close of all kvm file descriptors: we walk the pages and clear
- * exit of process on fd closure: vma already gone, do nothing
- */
- if (!mmget_not_zero(mm))
- return;
- mmap_read_lock(mm);
- walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
- mmap_read_unlock(mm);
- mmput(mm);
+ unsigned long i;
+
+ for (i = 0; i < count; i++) {
+ /* we always have an extra reference */
+ uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+ /* get rid of the extra reference */
+ put_page(pfn_to_page(pfns[i]));
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/*
+ * Walk the given range of the given address space, and call the destroy
+ * secure page UVC on each page.
+ * Exit early if the number of users of the mm drops to (or below) the given
+ * value.
+ */
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end)
+{
+ struct reset_walk_state state = { .next = start };
+ int r = 1;
+
+ while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
+ state.count = 0;
+ mmap_read_lock(mm);
+ r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+ mmap_read_unlock(mm);
+ cond_resched();
+ s390_uv_destroy_pfns(state.count, state.pfns);
+ }
}
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(s390_uv_destroy_range);
/*
* Remove the topmost level of page tables from the list of page tables of
--
2.31.1
Add a per-VM OOM notifier for lazy destroy.
When a protected VM is undergoing deferred teardown, register an OOM
notifier. This allows an OOM situation to be handled by just waiting a
little.
The background cleanup deferred destroy process will now keep a running
tally of the amount of pages freed. The asynchronous OOM notifier will
check the number of pages freed before and after waiting. The OOM
notifier will wait 10ms, and then report the number of pages freed by
the deferred destroy mechanism during that time.
If at least 1024 pages have already been freed in the current OOM
situation, no action is taken by the OOM notifier and no wait is
performed. This avoids excessive waiting times in case many VMs are
being destroyed at the same time, once enough memory has been freed.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index c570d4ce29c3..e3dfd5c73d3f 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -15,8 +15,12 @@
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/oom.h>
#include "kvm-s390.h"
+#define KVM_S390_PV_LAZY_DESTROY_OOM_NOTIFY_PRIORITY 70
+
struct deferred_priv {
struct mm_struct *mm;
bool has_mm;
@@ -24,6 +28,8 @@ struct deferred_priv {
u64 handle;
void *stor_var;
unsigned long stor_base;
+ unsigned long n_pages_freed;
+ struct notifier_block oom_nb;
};
static int lazy_destroy = 1;
@@ -200,6 +206,24 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}
+static int kvm_s390_pv_oom_notify(struct notifier_block *nb,
+ unsigned long dummy, void *parm)
+{
+ unsigned long *freed = parm;
+ unsigned long free_before;
+ struct deferred_priv *p;
+
+ if (*freed > 1024)
+ return NOTIFY_OK;
+
+ p = container_of(nb, struct deferred_priv, oom_nb);
+ free_before = READ_ONCE(p->n_pages_freed);
+ msleep(20);
+ *freed += READ_ONCE(p->n_pages_freed) - free_before;
+
+ return NOTIFY_OK;
+}
+
static int kvm_s390_pv_destroy_vm_thread(void *priv)
{
struct destroy_page_lazy *lazy, *next;
@@ -207,12 +231,20 @@ static int kvm_s390_pv_destroy_vm_thread(void *priv)
u16 rc, rrc;
int r;
+ p->oom_nb.priority = KVM_S390_PV_LAZY_DESTROY_OOM_NOTIFY_PRIORITY;
+ p->oom_nb.notifier_call = kvm_s390_pv_oom_notify;
+ r = register_oom_notifier(&p->oom_nb);
+
list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
list_del(&lazy->list);
s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+ WRITE_ONCE(p->n_pages_freed, p->n_pages_freed + lazy->count + 1);
free_page(__pa(lazy));
}
+ if (!r)
+ unregister_oom_notifier(&p->oom_nb);
+
if (p->has_mm) {
/* Clear all the pages as long as we are not the only users of the mm */
s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
--
2.31.1
With upcoming patches, normal guests might touch secure pages.
This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.
This can happen for example when a secure guest reboots; the first
stage of a secure guest is non secure, and in general a secure guest
can reboot into non-secure mode.
If the secure memory of the previous boot has not been cleared up
completely yet, a non-secure guest might touch secure memory, which
will need to be handled properly.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/mm/fault.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index eb68b4f36927..74784581f42d 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
struct vm_area_struct *vma;
struct mm_struct *mm;
struct page *page;
+ struct gmap *gmap;
int rc;
/*
@@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
}
switch (get_fault_type(regs)) {
+ case GMAP_FAULT:
+ gmap = (struct gmap *)S390_lowcore.gmap;
+ addr = __gmap_translate(gmap, addr);
+ if (IS_ERR_VALUE(addr)) {
+ do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
+ break;
+ }
+ fallthrough;
case USER_FAULT:
mm = current->mm;
mmap_read_lock(mm);
@@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
if (rc)
BUG();
break;
- case GMAP_FAULT:
default:
do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
WARN_ON_ONCE(1);
--
2.31.1
Add macros to describe the 4 possible CC values returned by the UVC
instruction.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/uv.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 12c5f006c136..b35add51b967 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -18,6 +18,11 @@
#include <asm/page.h>
#include <asm/gmap.h>
+#define UVC_CC_OK 0
+#define UVC_CC_ERROR 1
+#define UVC_CC_BUSY 2
+#define UVC_CC_PARTIAL 3
+
#define UVC_RC_EXECUTED 0x0001
#define UVC_RC_INV_CMD 0x0002
#define UVC_RC_INV_STATE 0x0003
--
2.31.1
Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.
These new functions can only be called on pages for which a reference
is already being held.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/pgtable.h | 9 ++++++---
arch/s390/include/asm/uv.h | 10 ++++++++--
arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
arch/s390/mm/gmap.c | 4 +++-
4 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dcac7b2df72c..0f1af2232ebe 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t res;
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
@@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
pte_t res;
res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(vma->vm_mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
@@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index b35add51b967..3236293d5a31 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
}
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
-int uv_destroy_page(unsigned long paddr);
+int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
+int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
void setup_uv(void);
@@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
static inline void setup_uv(void) {}
static inline void adjust_to_uv_max(unsigned long *vmax) {}
-static inline int uv_destroy_page(unsigned long paddr)
+static inline int uv_destroy_owned_page(unsigned long paddr)
{
return 0;
}
@@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
}
+
+static inline int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ return 0;
+}
#endif
#if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 68a8fbafcb9c..05f8bf61d20a 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
*
* @paddr: Absolute host address of page to be destroyed
*/
-int uv_destroy_page(unsigned long paddr)
+static int uv_destroy_page(unsigned long paddr)
{
struct uv_cb_cfs uvcb = {
.header.cmd = UVC_CMD_DESTR_SEC_STOR,
@@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
return 0;
}
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_destroy_owned_page(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_destroy_page(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Requests the Ultravisor to encrypt a guest page and make it
* accessible to the host for paging (export).
@@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
return 0;
}
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_convert_from_secure(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5a138f6220c4..38b792ab57f7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
{
pte_t pte = READ_ONCE(*ptep);
+ /* There is a reference through the mapping */
if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
+ WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
+
return 0;
}
--
2.31.1
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements a lazy destroy mechanism, that allows a protected
guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 129 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 131 insertions(+), 6 deletions(-)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index b655a7d82bf0..238297a7bb46 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2281,7 +2281,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
if (r)
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);
/* we need to block service interrupts from now on */
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2300,7 +2300,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
*/
if (r)
break;
- r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
+ r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);
/* no need to block service interrupts any more */
clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2829,7 +2829,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
* complaining we do not use kvm_s390_pv_is_protected.
*/
if (kvm_s390_pv_get_handle(kvm))
- kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
debug_unregister(kvm->arch.dbf);
free_page((unsigned long)kvm->arch.sie_page2);
if (!kvm_is_ucontrol(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..d2380f5e7e1f 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index fd37e97a26b3..8267376a9a0d 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -14,8 +14,17 @@
#include <asm/mman.h>
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
+#include <linux/kthread.h>
#include "kvm-s390.h"
+struct deferred_priv {
+ struct mm_struct *mm;
+ unsigned long old_table;
+ u64 handle;
+ void *stor_var;
+ unsigned long stor_base;
+};
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc = 0;
@@ -158,7 +167,7 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
}
/* this should not fail, but if it does, we must not free the donated memory */
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
+static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;
@@ -186,6 +195,122 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}
+static int kvm_s390_pv_destroy_vm_thread(void *priv)
+{
+ struct deferred_priv *p = priv;
+ u16 rc, rrc;
+ int r;
+
+ /* Clear all the pages as long as we are not the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ /*
+ * If we were the last user of the mm, synchronously free (and clear
+ * if needed) all pages.
+ * Otherwise simply decrease the reference counter; in this case we
+ * have already cleared all pages.
+ */
+ mmput(p->mm);
+
+ r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (r) {
+ mmdrop(p->mm);
+ return r;
+ }
+ atomic_dec(&p->mm->context.is_protected);
+ mmdrop(p->mm);
+
+ /*
+ * Intentional leak in case the destroy secure VM call fails. The
+ * call should never fail if the hardware is not broken.
+ */
+ free_pages(p->stor_base, get_order(uv_info.guest_base_stor_len));
+ free_pages(p->old_table, CRST_ALLOC_ORDER);
+ vfree(p->stor_var);
+ kfree(p);
+ return 0;
+}
+
+static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
+{
+ struct task_struct *t;
+
+ priv->stor_var = kvm->arch.pv.stor_var;
+ priv->stor_base = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+
+ if (s390_replace_asce(kvm->arch.gmap))
+ goto fail;
+
+ t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
+ "kvm_s390_pv_destroy_vm_thread");
+ if (IS_ERR_OR_NULL(t))
+ goto fail;
+
+ memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
+ wake_up_process(t);
+ /*
+ * no actual UVC is performed at this point, just return a successful
+ * rc value to make userspace happy, and an arbitrary rrc
+ */
+ *rc = 1;
+ *rrc = 42;
+
+ return 0;
+
+fail:
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+}
+
+/* Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ unsigned long lim;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(slot, slots) {
+ if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
+ continue;
+ if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ else
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+ mmgrab(kvm->mm);
+ if (mmget_not_zero(kvm->mm)) {
+ kvm_s390_clear_2g(kvm);
+ } else {
+ /* No deferred work to do */
+ mmdrop(kvm->mm);
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+ }
+ priv->mm = kvm->mm;
+ return deferred_destroy(kvm, priv, rc, rrc);
+}
+
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct uv_cb_cgc uvcb = {
@@ -219,7 +344,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
} else {
atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
--
2.31.1
Extend the lazy destroy infrastructure to handle almost all kinds of
exit of the userspace process. The only case not handled is if the
process is killed by the OOM, in which case the old behaviour will
still be in effect.
Add the uv_destroy_page_lazy function to set aside pages when unmapping
them during mm teardown; the pages will be processed and freed when the
protected VM is torn down.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/mmu.h | 3 ++
arch/s390/include/asm/mmu_context.h | 2 ++
arch/s390/include/asm/pgtable.h | 9 ++++--
arch/s390/include/asm/uv.h | 15 +++++++++
arch/s390/kernel/uv.c | 47 +++++++++++++++++++++++++++++
arch/s390/kvm/pv.c | 41 ++++++++++++++++++-------
6 files changed, 104 insertions(+), 13 deletions(-)
diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..e2250dbe3d9d 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -18,6 +18,7 @@ typedef struct {
unsigned long vdso_base;
/* The mmu context belongs to a secure guest. */
atomic_t is_protected;
+ struct list_head deferred_list;
/*
* The following bitfields need a down_write on the mm
* semaphore when they are written to. As they are only
@@ -34,6 +35,8 @@ typedef struct {
unsigned int uses_cmm:1;
/* The gmaps associated with this context are allowed to use huge pages. */
unsigned int allow_gmap_hpage_1m:1;
+ /* The mmu context should be destroyed synchronously */
+ unsigned int pv_sync_destroy:1;
} mm_context_t;
#define INIT_MM_CONTEXT(name) \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index c7937f369e62..5c598afe07a8 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -27,8 +27,10 @@ static inline int init_new_context(struct task_struct *tsk,
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.flush_count, 0);
atomic_set(&mm->context.is_protected, 0);
+ mm->context.pv_sync_destroy = 0;
mm->context.gmap_asce = 0;
mm->context.flush_mm = 0;
+ INIT_LIST_HEAD(&mm->context.deferred_list);
#ifdef CONFIG_PGSTE
mm->context.alloc_pgste = page_table_allocate_pgste ||
test_thread_flag(TIF_PGSTE) ||
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0f1af2232ebe..c311503edc3b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1118,9 +1118,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+
/* At this point the reference through the mapping is still present */
- if (mm_is_protected(mm) && pte_present(res))
- uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ if (pte_present(res) && mm_is_protected(mm)) {
+ if (full && !mm->context.pv_sync_destroy)
+ uv_destroy_page_lazy(mm, pte_val(res) & PAGE_MASK);
+ else
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ }
return res;
}
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index c8bd72be8974..934227611773 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -350,6 +350,15 @@ static inline int uv_remove_shared(unsigned long addr) { return 0; }
#if IS_ENABLED(CONFIG_KVM)
extern int prot_virt_host;
+struct destroy_page_lazy {
+ struct list_head list;
+ unsigned short count;
+ unsigned long pfns[];
+};
+
+/* This guarantees that up to PV_MAX_LAZY_COUNT can fit in a page */
+#define PV_MAX_LAZY_COUNT ((PAGE_SIZE - sizeof(struct destroy_page_lazy)) / sizeof(long))
+
static inline int is_prot_virt_host(void)
{
return prot_virt_host;
@@ -358,6 +367,7 @@ static inline int is_prot_virt_host(void)
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
@@ -374,6 +384,11 @@ static inline int uv_destroy_owned_page(unsigned long paddr)
return 0;
}
+static inline int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ return 0;
+}
+
static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index bc79085d7152..5b632846c4a4 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -186,6 +186,53 @@ int uv_convert_owned_from_secure(unsigned long paddr)
return rc;
}
+/*
+ * Set aside the given page and put it in the list of pages to be cleared in
+ * background. The caller must already hold a reference to the page.
+ */
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ struct list_head *head = &mm->context.deferred_list;
+ struct destroy_page_lazy *lazy;
+ struct page *page;
+ int rc;
+
+ /* get an extra reference here */
+ get_page(phys_to_page(paddr));
+
+ lazy = list_first_entry(head, struct destroy_page_lazy, list);
+ /*
+ * We need a fresh page to store more pointers. The current page
+ * might be shared, so it cannot be used directly. Instead, make it
+ * accessible and release it, and let the normal unmap code free it
+ * later, if needed.
+ * Afterwards, try to allocate a new page, but not very hard. If the
+ * allocation fails, we simply return. The next call to this
+ * function will attempt to do the same again, until enough pages
+ * have been freed.
+ */
+ if (list_empty(head) || lazy->count >= PV_MAX_LAZY_COUNT) {
+ rc = uv_convert_owned_from_secure(paddr);
+ /* in case of failure, we intentionally leak the page */
+ if (rc)
+ return rc;
+ /* release the extra reference */
+ put_page(phys_to_page(paddr));
+
+ /* try to allocate a new page quickly, but allow failures */
+ page = alloc_page(GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ if (!page)
+ return -ENOMEM;
+ lazy = page_to_virt(page);
+ lazy->count = 0;
+ list_add(&lazy->list, head);
+ return 0;
+ }
+ /* the array of pointers has space, just add this entry */
+ lazy->pfns[lazy->count++] = phys_to_pfn(paddr);
+ return 0;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 8267376a9a0d..e10392a8b1ae 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -19,6 +19,7 @@
struct deferred_priv {
struct mm_struct *mm;
+ bool has_mm;
unsigned long old_table;
u64 handle;
void *stor_var;
@@ -197,19 +198,34 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
static int kvm_s390_pv_destroy_vm_thread(void *priv)
{
+ struct destroy_page_lazy *lazy, *next;
struct deferred_priv *p = priv;
u16 rc, rrc;
int r;
- /* Clear all the pages as long as we are not the only users of the mm */
- s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
- /*
- * If we were the last user of the mm, synchronously free (and clear
- * if needed) all pages.
- * Otherwise simply decrease the reference counter; in this case we
- * have already cleared all pages.
- */
- mmput(p->mm);
+ list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
+ list_del(&lazy->list);
+ s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+ free_page(__pa(lazy));
+ }
+
+ if (p->has_mm) {
+ /* Clear all the pages as long as we are not the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ if (atomic_read(&p->mm->mm_users) == 1) {
+ mmap_write_lock(p->mm);
+ /* destroy synchronously if there are no other users */
+ p->mm->context.pv_sync_destroy = 1;
+ mmap_write_unlock(p->mm);
+ }
+ /*
+ * If we were the last user of the mm, synchronously free
+ * (and clear if needed) all pages.
+ * Otherwise simply decrease the reference counter; in this
+ * case we have already cleared all pages.
+ */
+ mmput(p->mm);
+ }
r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
@@ -241,7 +257,9 @@ static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc
priv->old_table = (unsigned long)kvm->arch.gmap->table;
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- if (s390_replace_asce(kvm->arch.gmap))
+ if (!priv->has_mm)
+ s390_remove_old_asce(kvm->arch.gmap);
+ else if (s390_replace_asce(kvm->arch.gmap))
goto fail;
t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
@@ -300,8 +318,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
mmgrab(kvm->mm);
if (mmget_not_zero(kvm->mm)) {
+ priv->has_mm = true;
kvm_s390_clear_2g(kvm);
- } else {
+ } else if (list_empty(&kvm->mm->context.deferred_list)) {
/* No deferred work to do */
mmdrop(kvm->mm);
kfree(priv);
--
2.31.1
On 04.08.21 17:40, Claudio Imbrenda wrote:
> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs). This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
>
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
>
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
>
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.
That doesn't sound too hacky to me, and actually sounds like a good
idea, doing what the guest would do either way but speeding it up
asynchronously, but ...
>
> For the shutdown case, a list of pages to be destroyed is formed when
> the mm is torn down. Instead of just unmapping the pages when the
> address space is being torn down, they are also set aside. Later when
> KVM cleans up the VM, a thread is started to clean up the pages from
> the list.
... this ...
>
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.
... this ...
>
> When a guest is destroyed, its memory still counts towards its memory
> control group until it's actually freed (I tested this experimentally)
>
> When the system runs out of memory, if a guest has terminated and its
> memory is being cleaned asynchronously, the OOM killer will wait a
> little and then see if memory has been freed. This has the practical
> effect of slowing down memory allocations when the system is out of
> memory to give the cleanup thread time to cleanup and free memory, and
> avoid an actual OOM situation.
... and this sound like the kind of arch MM hacks that will bite us in
the long run. Of course, I might be wrong, but already doing excessive
GFP_ATOMIC allocations or messing with the OOM killer that way for a
pure (shutdown) optimization is an alarm signal. Of course, I might be
wrong.
You should at least CC linux-mm. I'll do that right now and also CC
Michal. He might have time to have a quick glimpse at patch #11 and #13.
https://lkml.kernel.org/r/[email protected]
https://lkml.kernel.org/r/[email protected]
IMHO, we should proceed with patch 1-10, as they solve a really
important problem ("slow reboots") in a nice way, whereby patch 11
handles a case that can be worked around comparatively easily by
management tools -- my 2 cents.
--
Thanks,
David / dhildenb
On 04.08.21 17:40, Claudio Imbrenda wrote:
> Add macros to describe the 4 possible CC values returned by the UVC
> instruction.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/include/asm/uv.h | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index 12c5f006c136..b35add51b967 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -18,6 +18,11 @@
> #include <asm/page.h>
> #include <asm/gmap.h>
>
> +#define UVC_CC_OK 0
> +#define UVC_CC_ERROR 1
> +#define UVC_CC_BUSY 2
> +#define UVC_CC_PARTIAL 3
> +
> #define UVC_RC_EXECUTED 0x0001
> #define UVC_RC_INV_CMD 0x0002
> #define UVC_RC_INV_STATE 0x0003
>
Do we have any users we could directly fix up? AFAIKs, most users don't
really care about the cc value, only about cc vs !cc.
The only instances I was able to spot quickly:
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 12c5f006c136..dd72d325f9e8 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -233,7 +233,7 @@ static inline int uv_call(unsigned long r1, unsigned
long r2)
do {
cc = __uv_call(r1, r2);
- } while (cc > 1);
+ } while (cc >= UVC_CC_BUSY);
return cc;
}
@@ -245,7 +245,7 @@ static inline int uv_call_sched(unsigned long r1,
unsigned long r2)
do {
cc = __uv_call(r1, r2);
cond_resched();
- } while (cc > 1);
+ } while (cc >= UVC_CC_BUSY);
return cc;
}
Of course, we could replace all checks for cc vs !cc with "cc !=
UVC_CC_OK" vs "cc == UVC_CC_OK".
--
Thanks,
David / dhildenb
On Fri, 6 Aug 2021 09:26:11 +0200
David Hildenbrand <[email protected]> wrote:
> On 04.08.21 17:40, Claudio Imbrenda wrote:
> > Add macros to describe the 4 possible CC values returned by the UVC
> > instruction.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/include/asm/uv.h | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> > index 12c5f006c136..b35add51b967 100644
> > --- a/arch/s390/include/asm/uv.h
> > +++ b/arch/s390/include/asm/uv.h
> > @@ -18,6 +18,11 @@
> > #include <asm/page.h>
> > #include <asm/gmap.h>
> >
> > +#define UVC_CC_OK 0
> > +#define UVC_CC_ERROR 1
> > +#define UVC_CC_BUSY 2
> > +#define UVC_CC_PARTIAL 3
> > +
> > #define UVC_RC_EXECUTED 0x0001
> > #define UVC_RC_INV_CMD 0x0002
> > #define UVC_RC_INV_STATE 0x0003
> >
>
> Do we have any users we could directly fix up? AFAIKs, most users
> don't really care about the cc value, only about cc vs !cc.
maybe there will be in the future.
I wanted to split away this generic change from the patch that uses it,
to improve readability
> The only instances I was able to spot quickly:
>
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index 12c5f006c136..dd72d325f9e8 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -233,7 +233,7 @@ static inline int uv_call(unsigned long r1,
> unsigned long r2)
>
> do {
> cc = __uv_call(r1, r2);
> - } while (cc > 1);
> + } while (cc >= UVC_CC_BUSY);
> return cc;
> }
>
> @@ -245,7 +245,7 @@ static inline int uv_call_sched(unsigned long r1,
> unsigned long r2)
> do {
> cc = __uv_call(r1, r2);
> cond_resched();
> - } while (cc > 1);
> + } while (cc >= UVC_CC_BUSY);
> return cc;
> }
>
>
> Of course, we could replace all checks for cc vs !cc with "cc !=
> UVC_CC_OK" vs "cc == UVC_CC_OK".
>
On Fri, 6 Aug 2021 09:10:28 +0200
David Hildenbrand <[email protected]> wrote:
> On 04.08.21 17:40, Claudio Imbrenda wrote:
> > Previously, when a protected VM was rebooted or when it was shut
> > down, its memory was made unprotected, and then the protected VM
> > itself was destroyed. Looping over the whole address space can take
> > some time, considering the overhead of the various Ultravisor Calls
> > (UVCs). This means that a reboot or a shutdown would take a
> > potentially long amount of time, depending on the amount of used
> > memory.
> >
> > This patchseries implements a deferred destroy mechanism for
> > protected guests. When a protected guest is destroyed, its memory
> > is cleared in background, allowing the guest to restart or
> > terminate significantly faster than before.
> >
> > There are 2 possibilities when a protected VM is torn down:
> > * it still has an address space associated (reboot case)
> > * it does not have an address space anymore (shutdown case)
> >
> > For the reboot case, the reference count of the mm is increased, and
> > then a background thread is started to clean up. Once the thread
> > went through the whole address space, the protected VM is actually
> > destroyed.
>
> That doesn't sound too hacky to me, and actually sounds like a good
> idea, doing what the guest would do either way but speeding it up
> asynchronously, but ...
>
> >
> > For the shutdown case, a list of pages to be destroyed is formed
> > when the mm is torn down. Instead of just unmapping the pages when
> > the address space is being torn down, they are also set aside.
> > Later when KVM cleans up the VM, a thread is started to clean up
> > the pages from the list.
>
> ... this ...
>
> >
> > This means that the same address space can have memory belonging to
> > more than one protected guest, although only one will be running,
> > the others will in fact not even have any CPUs.
>
> ... this ...
this ^ is exactly the reboot case.
> > When a guest is destroyed, its memory still counts towards its
> > memory control group until it's actually freed (I tested this
> > experimentally)
> >
> > When the system runs out of memory, if a guest has terminated and
> > its memory is being cleaned asynchronously, the OOM killer will
> > wait a little and then see if memory has been freed. This has the
> > practical effect of slowing down memory allocations when the system
> > is out of memory to give the cleanup thread time to cleanup and
> > free memory, and avoid an actual OOM situation.
>
> ... and this sound like the kind of arch MM hacks that will bite us
> in the long run. Of course, I might be wrong, but already doing
> excessive GFP_ATOMIC allocations or messing with the OOM killer that
they are GFP_ATOMIC but they should not put too much weight on the
memory and can also fail without consequences, I used:
GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
also notice that after every page allocation a page gets freed, so this
is only temporary.
I would not call it "messing with the OOM killer", I'm using the same
interface used by virtio-baloon
> way for a pure (shutdown) optimization is an alarm signal. Of course,
> I might be wrong.
>
> You should at least CC linux-mm. I'll do that right now and also CC
> Michal. He might have time to have a quick glimpse at patch #11 and
> #13.
>
> https://lkml.kernel.org/r/[email protected]
> https://lkml.kernel.org/r/[email protected]
>
> IMHO, we should proceed with patch 1-10, as they solve a really
> important problem ("slow reboots") in a nice way, whereby patch 11
> handles a case that can be worked around comparatively easily by
> management tools -- my 2 cents.
how would management tools work around the issue that a shutdown can
take very long?
also, without my patches, the shutdown case would use export instead of
destroy, making it even slower.
>>> This means that the same address space can have memory belonging to
>>> more than one protected guest, although only one will be running,
>>> the others will in fact not even have any CPUs.
>>
>> ... this ...
>
> this ^ is exactly the reboot case.
Ah, right, we're having more than one protected guest per process, so
it's all handled within the same process.
>
>>> When a guest is destroyed, its memory still counts towards its
>>> memory control group until it's actually freed (I tested this
>>> experimentally)
>>>
>>> When the system runs out of memory, if a guest has terminated and
>>> its memory is being cleaned asynchronously, the OOM killer will
>>> wait a little and then see if memory has been freed. This has the
>>> practical effect of slowing down memory allocations when the system
>>> is out of memory to give the cleanup thread time to cleanup and
>>> free memory, and avoid an actual OOM situation.
>>
>> ... and this sound like the kind of arch MM hacks that will bite us
>> in the long run. Of course, I might be wrong, but already doing
>> excessive GFP_ATOMIC allocations or messing with the OOM killer that
>
> they are GFP_ATOMIC but they should not put too much weight on the
> memory and can also fail without consequences, I used:
>
> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
>
> also notice that after every page allocation a page gets freed, so this
> is only temporary.
Correct me if I'm wrong: you're allocating unmovable pages for tracking
(e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will free a
movable process page, correct? Or which page will you be freeing?
>
> I would not call it "messing with the OOM killer", I'm using the same
> interface used by virtio-baloon
Right, and for virtio-balloon it's actually a workaround to restore the
original behavior of a rarely used feature: deflate-on-oom. Commit
da10329cb057 ("virtio-balloon: switch back to OOM handler for
VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched back
from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
"The name "deflate on OOM" makes it pretty clear when deflation should
happen - after other approaches to reclaim memory failed, not while
reclaiming. This allows to minimize the footprint of a guest - memory
will only be taken out of the balloon when really needed."
Note some subtle differences:
a) IIRC, before running into the OOM killer, will try reclaiming
anything else. This is what we want for deflate-on-oom, it might not
be what you want for your feature (e.g., flushing other processes/VMs
to disk/swap instead of waiting for a single process to stop).
b) Migration of movable balloon inflated pages continues working because
we are dealing with non-lru page migration.
Will page reclaim, page migration, compaction, ... of these movable LRU
pages still continue working while they are sitting around waiting to be
cleaned up? I can see that we're grabbing an extra reference when we put
them onto the list, that might be a problem: for example, we can most
certainly not swap out these pages or write them back to disk on memory
pressure.
>
>> way for a pure (shutdown) optimization is an alarm signal. Of course,
>> I might be wrong.
>>
>> You should at least CC linux-mm. I'll do that right now and also CC
>> Michal. He might have time to have a quick glimpse at patch #11 and
>> #13.
>>
>> https://lkml.kernel.org/r/[email protected]
>> https://lkml.kernel.org/r/[email protected]
>>
>> IMHO, we should proceed with patch 1-10, as they solve a really
>> important problem ("slow reboots") in a nice way, whereby patch 11
>> handles a case that can be worked around comparatively easily by
>> management tools -- my 2 cents.
>
> how would management tools work around the issue that a shutdown can
> take very long?
The traditional approach is to wait starting a new VM on another
hypervisor instead until memory has been freed up, or start it on
another hypervisor. That raises the question about the target use case.
What I don't get is that we have to pay the price for freeing up that
memory. Why isn't it sufficient to keep the process running and let
ordinary MM do it's thing?
Maybe you should clearly spell out what the target use case for the fast
shutdown (fast quitting of the process?) is?. I assume it is, starting a
new VM / process / whatsoever on the same host immediately, and then
a) Eventually slowing down other processes due heavy reclaim.
b) Slowing down the new process because you have to pay the price of
cleaning up memory.
I think I am missing why we need the lazy destroy at all when killing a
process. Couldn't you instead teach the OOM killer "hey, we're currently
quitting a heavy process that is just *very* slow to free up memory,
please wait for that before starting shooting around" ?
--
Thanks,
David / dhildenb
On Fri, 6 Aug 2021 13:30:21 +0200
David Hildenbrand <[email protected]> wrote:
[...]
> >>> When the system runs out of memory, if a guest has terminated and
> >>> its memory is being cleaned asynchronously, the OOM killer will
> >>> wait a little and then see if memory has been freed. This has the
> >>> practical effect of slowing down memory allocations when the
> >>> system is out of memory to give the cleanup thread time to
> >>> cleanup and free memory, and avoid an actual OOM situation.
> >>
> >> ... and this sound like the kind of arch MM hacks that will bite us
> >> in the long run. Of course, I might be wrong, but already doing
> >> excessive GFP_ATOMIC allocations or messing with the OOM killer
> >> that
> >
> > they are GFP_ATOMIC but they should not put too much weight on the
> > memory and can also fail without consequences, I used:
> >
> > GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
> >
> > also notice that after every page allocation a page gets freed, so
> > this is only temporary.
>
> Correct me if I'm wrong: you're allocating unmovable pages for
> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will
> free a movable process page, correct? Or which page will you be
> freeing?
we are transforming ALL moveable pages belonging to userspace into
unmoveable pages. every ~500 pages one page gets actually
allocated (unmoveable), and another (moveable) one gets freed.
> >
> > I would not call it "messing with the OOM killer", I'm using the
> > same interface used by virtio-baloon
>
> Right, and for virtio-balloon it's actually a workaround to restore
> the original behavior of a rarely used feature: deflate-on-oom.
> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for
> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched
> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
>
> "The name "deflate on OOM" makes it pretty clear when deflation should
> happen - after other approaches to reclaim memory failed, not while
> reclaiming. This allows to minimize the footprint of a guest -
> memory will only be taken out of the balloon when really needed."
>
> Note some subtle differences:
>
> a) IIRC, before running into the OOM killer, will try reclaiming
> anything else. This is what we want for deflate-on-oom, it might
> not be what you want for your feature (e.g., flushing other
> processes/VMs to disk/swap instead of waiting for a single process to
> stop).
we are already reclaiming the memory of the dead secure guest.
> b) Migration of movable balloon inflated pages continues working
> because we are dealing with non-lru page migration.
>
> Will page reclaim, page migration, compaction, ... of these movable
> LRU pages still continue working while they are sitting around
> waiting to be cleaned up? I can see that we're grabbing an extra
> reference when we put them onto the list, that might be a problem:
> for example, we can most certainly not swap out these pages or write
> them back to disk on memory pressure.
this is true. on the other hand, swapping a moveable page would be even
slower, because those pages would need to be exported and not destroyed.
> >
> >> way for a pure (shutdown) optimization is an alarm signal. Of
> >> course, I might be wrong.
> >>
> >> You should at least CC linux-mm. I'll do that right now and also CC
> >> Michal. He might have time to have a quick glimpse at patch #11 and
> >> #13.
> >>
> >> https://lkml.kernel.org/r/[email protected]
> >> https://lkml.kernel.org/r/[email protected]
> >>
> >> IMHO, we should proceed with patch 1-10, as they solve a really
> >> important problem ("slow reboots") in a nice way, whereby patch 11
> >> handles a case that can be worked around comparatively easily by
> >> management tools -- my 2 cents.
> >
> > how would management tools work around the issue that a shutdown can
> > take very long?
>
> The traditional approach is to wait starting a new VM on another
> hypervisor instead until memory has been freed up, or start it on
> another hypervisor. That raises the question about the target use
> case.
>
> What I don't get is that we have to pay the price for freeing up that
> memory. Why isn't it sufficient to keep the process running and let
> ordinary MM do it's thing?
what price?
you mean let mm do the slowest possible thing when tearing down a dead
guest?
without this, the dying guest would still take up all the memory. and
swapping it would not be any faster (it would be slower, in fact). the
system would OOM anyway.
> Maybe you should clearly spell out what the target use case for the
> fast shutdown (fast quitting of the process?) is?. I assume it is,
> starting a new VM / process / whatsoever on the same host
> immediately, and then
>
> a) Eventually slowing down other processes due heavy reclaim.
for each dying guest, only one CPU is used by the reclaim; depending on
the total load of the system, this might not even be noticeable
> b) Slowing down the new process because you have to pay the price of
> cleaning up memory.
do you prefer to OOM because the dying guest will need ages to clean up
its memory?
> I think I am missing why we need the lazy destroy at all when killing
> a process. Couldn't you instead teach the OOM killer "hey, we're
> currently quitting a heavy process that is just *very* slow to free
> up memory, please wait for that before starting shooting around" ?
isn't this ^ exactly what the OOM notifier does?
another note here:
when the process quits, the mm starts the tear down. at this point, the
mm has no idea that this is a dying KVM guest, so the best it can do is
exporting (which is significantly slower than destroy page)
kvm comes into play long after the mm is gone, and at this point it
can't do anything anymore. the memory is already gone (very slowly).
if I kill -9 qemu (or if qemu segfaults), KVM will never notice until
the mm is gone.
On 8/6/21 9:26 AM, David Hildenbrand wrote:
> On 04.08.21 17:40, Claudio Imbrenda wrote:
>> Add macros to describe the 4 possible CC values returned by the UVC
>> instruction.
>>
>> Signed-off-by: Claudio Imbrenda <[email protected]>
>> ---
>> arch/s390/include/asm/uv.h | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
>> index 12c5f006c136..b35add51b967 100644
>> --- a/arch/s390/include/asm/uv.h
>> +++ b/arch/s390/include/asm/uv.h
>> @@ -18,6 +18,11 @@
>> #include <asm/page.h>
>> #include <asm/gmap.h>
>>
>> +#define UVC_CC_OK 0
>> +#define UVC_CC_ERROR 1
>> +#define UVC_CC_BUSY 2
>> +#define UVC_CC_PARTIAL 3
>> +
>> #define UVC_RC_EXECUTED 0x0001
>> #define UVC_RC_INV_CMD 0x0002
>> #define UVC_RC_INV_STATE 0x0003
>>
>
> Do we have any users we could directly fix up? AFAIKs, most users don't
> really care about the cc value, only about cc vs !cc.
>
> The only instances I was able to spot quickly:
The only fix for the functions below that I would accept would be to
check for cc 2 and 3. A cc >= UVC_CC_BUSY confuses me way too much when
reading.
But honestly for those two I'd just keep the code as is. I only asked
Claudio to fix the code in the next patch and add this patch as it was
not clearly visible he was dealing with a CC.
>
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index 12c5f006c136..dd72d325f9e8 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -233,7 +233,7 @@ static inline int uv_call(unsigned long r1, unsigned
> long r2)
>
> do {
> cc = __uv_call(r1, r2);
> - } while (cc > 1);
> + } while (cc >= UVC_CC_BUSY);
> return cc;
> }
>
> @@ -245,7 +245,7 @@ static inline int uv_call_sched(unsigned long r1,
> unsigned long r2)
> do {
> cc = __uv_call(r1, r2);
> cond_resched();
> - } while (cc > 1);
> + } while (cc >= UVC_CC_BUSY);
> return cc;
> }
>
>
> Of course, we could replace all checks for cc vs !cc with "cc !=
> UVC_CC_OK" vs "cc == UVC_CC_OK".
>
On 06.08.21 15:44, Claudio Imbrenda wrote:
> On Fri, 6 Aug 2021 13:30:21 +0200
> David Hildenbrand <[email protected]> wrote:
>
> [...]
>
>>>>> When the system runs out of memory, if a guest has terminated and
>>>>> its memory is being cleaned asynchronously, the OOM killer will
>>>>> wait a little and then see if memory has been freed. This has the
>>>>> practical effect of slowing down memory allocations when the
>>>>> system is out of memory to give the cleanup thread time to
>>>>> cleanup and free memory, and avoid an actual OOM situation.
>>>>
>>>> ... and this sound like the kind of arch MM hacks that will bite us
>>>> in the long run. Of course, I might be wrong, but already doing
>>>> excessive GFP_ATOMIC allocations or messing with the OOM killer
>>>> that
>>>
>>> they are GFP_ATOMIC but they should not put too much weight on the
>>> memory and can also fail without consequences, I used:
>>>
>>> GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN
>>>
>>> also notice that after every page allocation a page gets freed, so
>>> this is only temporary.
>>
>> Correct me if I'm wrong: you're allocating unmovable pages for
>> tracking (e.g., ZONE_DMA, ZONE_NORMAL) from atomic reserves and will
>> free a movable process page, correct? Or which page will you be
>> freeing?
>
> we are transforming ALL moveable pages belonging to userspace into
> unmoveable pages. every ~500 pages one page gets actually
> allocated (unmoveable), and another (moveable) one gets freed.
>
>>>
>>> I would not call it "messing with the OOM killer", I'm using the
>>> same interface used by virtio-baloon
>>
>> Right, and for virtio-balloon it's actually a workaround to restore
>> the original behavior of a rarely used feature: deflate-on-oom.
>> Commit da10329cb057 ("virtio-balloon: switch back to OOM handler for
>> VIRTIO_BALLOON_F_DEFLATE_ON_OOM") tried to document why we switched
>> back from a shrinker to VIRTIO_BALLOON_F_DEFLATE_ON_OOM:
>>
>> "The name "deflate on OOM" makes it pretty clear when deflation should
>> happen - after other approaches to reclaim memory failed, not while
>> reclaiming. This allows to minimize the footprint of a guest -
>> memory will only be taken out of the balloon when really needed."
>>
>> Note some subtle differences:
>>
>> a) IIRC, before running into the OOM killer, will try reclaiming
>> anything else. This is what we want for deflate-on-oom, it might
>> not be what you want for your feature (e.g., flushing other
>> processes/VMs to disk/swap instead of waiting for a single process to
>> stop).
>
> we are already reclaiming the memory of the dead secure guest.
>
>> b) Migration of movable balloon inflated pages continues working
>> because we are dealing with non-lru page migration.
>>
>> Will page reclaim, page migration, compaction, ... of these movable
>> LRU pages still continue working while they are sitting around
>> waiting to be cleaned up? I can see that we're grabbing an extra
>> reference when we put them onto the list, that might be a problem:
>> for example, we can most certainly not swap out these pages or write
>> them back to disk on memory pressure.
>
> this is true. on the other hand, swapping a moveable page would be even
> slower, because those pages would need to be exported and not destroyed.
>
>>>
>>>> way for a pure (shutdown) optimization is an alarm signal. Of
>>>> course, I might be wrong.
>>>>
>>>> You should at least CC linux-mm. I'll do that right now and also CC
>>>> Michal. He might have time to have a quick glimpse at patch #11 and
>>>> #13.
>>>>
>>>> https://lkml.kernel.org/r/[email protected]
>>>> https://lkml.kernel.org/r/[email protected]
>>>>
>>>> IMHO, we should proceed with patch 1-10, as they solve a really
>>>> important problem ("slow reboots") in a nice way, whereby patch 11
>>>> handles a case that can be worked around comparatively easily by
>>>> management tools -- my 2 cents.
>>>
>>> how would management tools work around the issue that a shutdown can
>>> take very long?
>>
>> The traditional approach is to wait starting a new VM on another
>> hypervisor instead until memory has been freed up, or start it on
>> another hypervisor. That raises the question about the target use
>> case.
>>
>> What I don't get is that we have to pay the price for freeing up that
>> memory. Why isn't it sufficient to keep the process running and let
>> ordinary MM do it's thing?
>
> what price?
>
> you mean let mm do the slowest possible thing when tearing down a dead
> guest?
>
> without this, the dying guest would still take up all the memory. and
> swapping it would not be any faster (it would be slower, in fact). the
> system would OOM anyway.
>
>> Maybe you should clearly spell out what the target use case for the
>> fast shutdown (fast quitting of the process?) is?. I assume it is,
>> starting a new VM / process / whatsoever on the same host
>> immediately, and then
>>
>> a) Eventually slowing down other processes due heavy reclaim.
>
> for each dying guest, only one CPU is used by the reclaim; depending on
> the total load of the system, this might not even be noticeable
>
>> b) Slowing down the new process because you have to pay the price of
>> cleaning up memory.
>
> do you prefer to OOM because the dying guest will need ages to clean up
> its memory?
>
>> I think I am missing why we need the lazy destroy at all when killing
>> a process. Couldn't you instead teach the OOM killer "hey, we're
>> currently quitting a heavy process that is just *very* slow to free
>> up memory, please wait for that before starting shooting around" ?
>
> isn't this ^ exactly what the OOM notifier does?
>
>
> another note here:
>
> when the process quits, the mm starts the tear down. at this point, the
> mm has no idea that this is a dying KVM guest, so the best it can do is
> exporting (which is significantly slower than destroy page)
>
> kvm comes into play long after the mm is gone, and at this point it
> can't do anything anymore. the memory is already gone (very slowly).
>
> if I kill -9 qemu (or if qemu segfaults), KVM will never notice until
> the mm is gone.
>
Summarizing what we discussed offline:
1. We should optimize for proper shutdowns first, this is the most
important use case. We should look into letting QEMU tear down the KVM
secure context such that we can just let MM teardown do its thing ->
destroy instead of export secure pages. If no kernel changes are
required to get that implemented, even better.
2. If we want to optimize "there is a big process dying horribly slow,
please OOM killer please wait a bit instead of starting killing other
processes", we might want to do that in a more generic way (if not
already in place, no expert).
3. If we really want to go down the path of optimizing "kill -9" and
friends to e.g., take 40min instead of 20min on a huge VM (who cares?
especially, the OOM handler will struggle already if memory is getting
freed that slowly, no matter if 40 or 20 minutes), we should look into
being able to release the relevant KVM secure context before tearing
down MM. We should avoid any arch specific hacks.
--
Thanks,
David / dhildenb