2021-09-20 23:15:13

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 00/14] KVM: s390: pv: implement lazy destroy for reboot

Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.

This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.

This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.

The shutdown case is more controversial, and it will be dealt with in a
future patchseries.

When a guest is destroyed, its memory still counts towards its memory
control group until it's actually freed (I tested this experimentally)

v4->v5
* fixed and improved some patch descriptions
* added some comments to better explain what's going on
* use vma_lookup instead of find_vma
* rename is_protected to protected_count since now it's used as a counter

v3->v4
* added patch 2
* split patch 3
* removed the shutdown part -- will be a separate patchseries
* moved the patch introducing the module parameter

v2->v3
* added definitions for CC return codes for the UVC instruction
* improved make_secure_pte:
- renamed rc to cc
- added comments to explain why returning -EAGAIN is ok
* fixed kvm_s390_pv_replace_asce and kvm_s390_pv_remove_old_asce:
- renamed
- added locking
- moved to gmap.c
* do proper error management in do_secure_storage_access instead of
trying again hoping to get a different exception
* fix outdated patch descriptions

v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier

Claudio Imbrenda (14):
KVM: s390: pv: add macros for UVC CC values
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: leak the topmost page table when destroy fails
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: lazy destroy for reboot
KVM: s390: pv: avoid export before import if possible

arch/s390/include/asm/gmap.h | 6 +-
arch/s390/include/asm/mmu.h | 2 +-
arch/s390/include/asm/mmu_context.h | 2 +-
arch/s390/include/asm/pgtable.h | 11 +-
arch/s390/include/asm/uv.h | 16 ++-
arch/s390/kernel/uv.c | 118 ++++++++++++++++--
arch/s390/kvm/intercept.c | 5 +
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 187 ++++++++++++++++++++++++----
arch/s390/mm/fault.c | 20 ++-
arch/s390/mm/gmap.c | 141 +++++++++++++++++----
12 files changed, 447 insertions(+), 69 deletions(-)

--
2.31.1


2021-09-20 23:33:19

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 03/14] KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm

When the system is heavily overcommitted, kvm_s390_pv_init_vm might
generate stall notifications.

Fix this by using uv_call_sched instead of just uv_call. This is ok because
we are not holding spinlocks.

Signed-off-by: Claudio Imbrenda <[email protected]>
Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Christian Borntraeger <[email protected]>
---
arch/s390/kvm/pv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 0a854115100b..00d272d134c2 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -195,7 +195,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;

- cc = uv_call(0, (u64)&uvcb);
+ cc = uv_call_sched(0, (u64)&uvcb);
*rc = uvcb.header.rc;
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
--
2.31.1

2021-09-20 23:35:21

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 08/14] KVM: s390: pv: handle secure storage exceptions for normal guests

With upcoming patches, normal guests might touch secure pages.

This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.

This can happen for example when a secure guest reboots; the first
stage of a secure guest is non secure, and in general a secure guest
can reboot into non-secure mode.

If the secure memory of the previous boot has not been cleared up
completely yet, a non-secure guest might touch secure memory, which
will need to be handled properly.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/mm/fault.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index eb68b4f36927..74784581f42d 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
struct vm_area_struct *vma;
struct mm_struct *mm;
struct page *page;
+ struct gmap *gmap;
int rc;

/*
@@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
}

switch (get_fault_type(regs)) {
+ case GMAP_FAULT:
+ gmap = (struct gmap *)S390_lowcore.gmap;
+ addr = __gmap_translate(gmap, addr);
+ if (IS_ERR_VALUE(addr)) {
+ do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
+ break;
+ }
+ fallthrough;
case USER_FAULT:
mm = current->mm;
mmap_read_lock(mm);
@@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
if (rc)
BUG();
break;
- case GMAP_FAULT:
default:
do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
WARN_ON_ONCE(1);
--
2.31.1

2021-09-20 23:35:50

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 09/14] KVM: s390: pv: refactor s390_reset_acc

Refactor s390_reset_acc so that its pieces can be reused in upcoming
patches. The users parameter for s390_destroy_range will be needed in
upcoming patches.

We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/gmap.h | 4 +-
arch/s390/kvm/pv.c | 12 ++++-
arch/s390/mm/gmap.c | 88 ++++++++++++++++++++++++------------
3 files changed, 72 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 746e18bf8984..559a0c18579e 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,7 +147,9 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
void s390_remove_old_asce(struct gmap *gmap);
int s390_replace_asce(struct gmap *gmap);
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 76b0d64ce8fa..47db80003ea0 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
#include <asm/gmap.h>
#include <asm/uv.h>
#include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
#include "kvm-s390.h"

int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -159,8 +161,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;

- /* make all pages accessible before destroying the guest */
- s390_reset_acc(kvm->mm);
+ /*
+ * if the mm still has a mapping, make all its pages accessible
+ * before destroying the guest
+ */
+ if (mmget_not_zero(kvm->mm)) {
+ s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
+ mmput(kvm->mm);
+ }

cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 38b792ab57f7..b87892986836 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2670,44 +2670,74 @@ void s390_reset_cmma(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);

-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+ unsigned long next;
+ unsigned long count;
+ unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
+ struct reset_walk_state *p = walk->private;
pte_t pte = READ_ONCE(*ptep);

- /* There is a reference through the mapping */
- if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
- return 0;
+ if (pte_present(pte)) {
+ /* we have a reference from the mapping, take an extra one */
+ get_page(phys_to_page(pte_val(pte)));
+ p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+ p->next = next;
+ p->count++;
+ }
+ return p->count >= DESTROY_LOOP_THRESHOLD;
}

-static const struct mm_walk_ops reset_acc_walk_ops = {
- .pte_entry = __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+ .pte_entry = s390_gather_pages,
};

-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
{
- if (!mm_is_protected(mm))
- return;
- /*
- * we might be called during
- * reset: we walk the pages and clear
- * close of all kvm file descriptors: we walk the pages and clear
- * exit of process on fd closure: vma already gone, do nothing
- */
- if (!mmget_not_zero(mm))
- return;
- mmap_read_lock(mm);
- walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
- mmap_read_unlock(mm);
- mmput(mm);
+ unsigned long i;
+
+ for (i = 0; i < count; i++) {
+ /* we always have an extra reference */
+ uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+ /* get rid of the extra reference */
+ put_page(pfn_to_page(pfns[i]));
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/*
+ * Walk the given range of the given address space, and call the destroy
+ * secure page UVC on each page.
+ * Exit early if the number of users of the mm drops to (or below) the given
+ * value.
+ */
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end)
+{
+ struct reset_walk_state state = { .next = start };
+ int r = 1;
+
+ while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
+ state.count = 0;
+ mmap_read_lock(mm);
+ r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+ mmap_read_unlock(mm);
+ cond_resched();
+ s390_uv_destroy_pfns(state.count, state.pfns);
+ }
}
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(s390_uv_destroy_range);

/*
* Remove the topmost level of page tables from the list of page tables of
--
2.31.1

2021-09-20 23:37:00

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 12/14] KVM: s390: pv: module parameter to fence lazy destroy

Add the module parameter "lazy_destroy", to allow the lazy destroy
mechanism to be switched off. This might be useful for debugging
purposes.

The parameter is enabled by default.

Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/pv.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index eb13382c3d36..77a972187033 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -16,6 +16,10 @@
#include <linux/sched/mm.h>
#include "kvm-s390.h"

+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc;
--
2.31.1

2021-09-20 23:37:00

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 10/14] KVM: s390: pv: usage counter instead of flag

Use the is_protected field as a counter instead of a flag. This will
be used in upcoming patches.

Increment the counter when a secure configuration is created, and
decrement it when it is destroyed. Previously the flag was set when the
set secure parameters UVC was performed.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/mmu.h | 2 +-
arch/s390/include/asm/mmu_context.h | 2 +-
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/pv.c | 12 +++++++-----
4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..2c4cbc254604 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -17,7 +17,7 @@ typedef struct {
unsigned long asce_limit;
unsigned long vdso_base;
/* The mmu context belongs to a secure guest. */
- atomic_t is_protected;
+ atomic_t protected_count;
/*
* The following bitfields need a down_write on the mm
* semaphore when they are written to. As they are only
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index c7937f369e62..2a38af5a00c2 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -26,7 +26,7 @@ static inline int init_new_context(struct task_struct *tsk,
INIT_LIST_HEAD(&mm->context.gmap_list);
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.flush_count, 0);
- atomic_set(&mm->context.is_protected, 0);
+ atomic_set(&mm->context.protected_count, 0);
mm->context.gmap_asce = 0;
mm->context.flush_mm = 0;
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0f1af2232ebe..5d06b6befb1f 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -523,7 +523,7 @@ static inline int mm_has_pgste(struct mm_struct *mm)
static inline int mm_is_protected(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
- if (unlikely(atomic_read(&mm->context.is_protected)))
+ if (unlikely(atomic_read(&mm->context.protected_count)))
return 1;
#endif
return 0;
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 47db80003ea0..eb13382c3d36 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -173,7 +173,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- atomic_set(&kvm->mm->context.is_protected, 0);
+ if (!cc)
+ atomic_dec(&kvm->mm->context.protected_count);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
/* Intended memory leak on "impossible" error */
@@ -214,11 +215,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;

+ atomic_inc(&kvm->mm->context.protected_count);
if (cc) {
- if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
+ if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
- else
+ } else {
+ atomic_dec(&kvm->mm->context.protected_count);
kvm_s390_pv_dealloc_vm(kvm);
+ }
return -EIO;
}
kvm->arch.gmap->guest_handle = uvcb.guest_handle;
@@ -241,8 +245,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
*rc, *rrc);
- if (!cc)
- atomic_set(&kvm->mm->context.is_protected, 1);
return cc ? -EINVAL : 0;
}

--
2.31.1

2021-09-20 23:46:28

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v5 06/14] KVM: s390: pv: properly handle page flags for protected guests

Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.

The PG_arch_1 flag is always allowed to overindicate; using the new
functions introduced here allows to reduce the extent of overindication
and thus improve performance.

These new functions can only be called on pages for which a reference
is already being held.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
---
arch/s390/include/asm/pgtable.h | 9 ++++++---
arch/s390/include/asm/uv.h | 10 ++++++++--
arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
arch/s390/mm/gmap.c | 4 +++-
4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dcac7b2df72c..0f1af2232ebe 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t res;

res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

@@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
pte_t res;

res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(vma->vm_mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

@@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index b35add51b967..3236293d5a31 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
}

int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
-int uv_destroy_page(unsigned long paddr);
+int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
+int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);

void setup_uv(void);
@@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
static inline void setup_uv(void) {}
static inline void adjust_to_uv_max(unsigned long *vmax) {}

-static inline int uv_destroy_page(unsigned long paddr)
+static inline int uv_destroy_owned_page(unsigned long paddr)
{
return 0;
}
@@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
}
+
+static inline int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ return 0;
+}
#endif

#if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 68a8fbafcb9c..05f8bf61d20a 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
*
* @paddr: Absolute host address of page to be destroyed
*/
-int uv_destroy_page(unsigned long paddr)
+static int uv_destroy_page(unsigned long paddr)
{
struct uv_cb_cfs uvcb = {
.header.cmd = UVC_CMD_DESTR_SEC_STOR,
@@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
return 0;
}

+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_destroy_owned_page(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_destroy_page(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Requests the Ultravisor to encrypt a guest page and make it
* accessible to the host for paging (export).
@@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
return 0;
}

+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_convert_from_secure(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5a138f6220c4..38b792ab57f7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
{
pte_t pte = READ_ONCE(*ptep);

+ /* There is a reference through the mapping */
if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
+ WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
+
return 0;
}

--
2.31.1

2021-10-05 13:21:48

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v5 03/14] KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm

On 9/20/21 15:24, Claudio Imbrenda wrote:
> When the system is heavily overcommitted, kvm_s390_pv_init_vm might
> generate stall notifications.
>
> Fix this by using uv_call_sched instead of just uv_call. This is ok because
> we are not holding spinlocks.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
> Reviewed-by: Christian Borntraeger <[email protected]>

Reviewed-by: Janosch Frank <[email protected]>
Oops :)

> ---
> arch/s390/kvm/pv.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 0a854115100b..00d272d134c2 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -195,7 +195,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
> uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;
>
> - cc = uv_call(0, (u64)&uvcb);
> + cc = uv_call_sched(0, (u64)&uvcb);
> *rc = uvcb.header.rc;
> *rrc = uvcb.header.rrc;
> KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
>

2021-10-05 13:28:07

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] KVM: s390: pv: implement lazy destroy for reboot

On 9/20/21 15:24, Claudio Imbrenda wrote:
> Previously, when a protected VM was rebooted or when it was shut down,
> its memory was made unprotected, and then the protected VM itself was
> destroyed. Looping over the whole address space can take some time,
> considering the overhead of the various Ultravisor Calls (UVCs). This
> means that a reboot or a shutdown would take a potentially long amount
> of time, depending on the amount of used memory.
>
> This patchseries implements a deferred destroy mechanism for protected
> guests. When a protected guest is destroyed, its memory is cleared in
> background, allowing the guest to restart or terminate significantly
> faster than before.
>
> There are 2 possibilities when a protected VM is torn down:
> * it still has an address space associated (reboot case)
> * it does not have an address space anymore (shutdown case)
>
> For the reboot case, the reference count of the mm is increased, and
> then a background thread is started to clean up. Once the thread went
> through the whole address space, the protected VM is actually
> destroyed.
>
> This means that the same address space can have memory belonging to
> more than one protected guest, although only one will be running, the
> others will in fact not even have any CPUs.
>
> The shutdown case is more controversial, and it will be dealt with in a
> future patchseries.
>
> When a guest is destroyed, its memory still counts towards its memory
> control group until it's actually freed (I tested this experimentally)


@Christian: I'd like to have #1-3 in early so we can focus on the more
complicated stuff.

2021-10-05 14:18:01

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v5 00/14] KVM: s390: pv: implement lazy destroy for reboot



Am 05.10.21 um 15:26 schrieb Janosch Frank:
> On 9/20/21 15:24, Claudio Imbrenda wrote:
>> Previously, when a protected VM was rebooted or when it was shut down,
>> its memory was made unprotected, and then the protected VM itself was
>> destroyed. Looping over the whole address space can take some time,
>> considering the overhead of the various Ultravisor Calls (UVCs). This
>> means that a reboot or a shutdown would take a potentially long amount
>> of time, depending on the amount of used memory.
>>
>> This patchseries implements a deferred destroy mechanism for protected
>> guests. When a protected guest is destroyed, its memory is cleared in
>> background, allowing the guest to restart or terminate significantly
>> faster than before.
>>
>> There are 2 possibilities when a protected VM is torn down:
>> * it still has an address space associated (reboot case)
>> * it does not have an address space anymore (shutdown case)
>>
>> For the reboot case, the reference count of the mm is increased, and
>> then a background thread is started to clean up. Once the thread went
>> through the whole address space, the protected VM is actually
>> destroyed.
>>
>> This means that the same address space can have memory belonging to
>> more than one protected guest, although only one will be running, the
>> others will in fact not even have any CPUs.
>>
>> The shutdown case is more controversial, and it will be dealt with in a
>> future patchseries.
>>
>> When a guest is destroyed, its memory still counts towards its memory
>> control group until it's actually freed (I tested this experimentally)
>
>
> @Christian: I'd like to have #1-3 in early so we can focus on the more complicated stuff.

Yes, makes perfect sense.

2021-10-12 08:05:17

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v5 06/14] KVM: s390: pv: properly handle page flags for protected guests

On 9/20/21 15:24, Claudio Imbrenda wrote:
> Introduce variants of the convert and destroy page functions that also
> clear the PG_arch_1 bit used to mark them as secure pages.
>
> The PG_arch_1 flag is always allowed to overindicate; using the new
> functions introduced here allows to reduce the extent of overindication
> and thus improve performance.
>
> These new functions can only be called on pages for which a reference
> is already being held.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Janosch Frank <[email protected]>

You can make this one a Reviewed-by

> Reviewed-by: Christian Borntraeger <[email protected]>
> ---
> arch/s390/include/asm/pgtable.h | 9 ++++++---
> arch/s390/include/asm/uv.h | 10 ++++++++--
> arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
> arch/s390/mm/gmap.c | 4 +++-
> 4 files changed, 50 insertions(+), 7 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index dcac7b2df72c..0f1af2232ebe 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> pte_t res;
>
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
> pte_t res;
>
> res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(vma->vm_mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> } else {
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> }
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index b35add51b967..3236293d5a31 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
> }
>
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
> -int uv_destroy_page(unsigned long paddr);
> +int uv_destroy_owned_page(unsigned long paddr);
> int uv_convert_from_secure(unsigned long paddr);
> +int uv_convert_owned_from_secure(unsigned long paddr);
> int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
>
> void setup_uv(void);
> @@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
> static inline void setup_uv(void) {}
> static inline void adjust_to_uv_max(unsigned long *vmax) {}
>
> -static inline int uv_destroy_page(unsigned long paddr)
> +static inline int uv_destroy_owned_page(unsigned long paddr)
> {
> return 0;
> }
> @@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
> {
> return 0;
> }
> +
> +static inline int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + return 0;
> +}
> #endif
>
> #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 68a8fbafcb9c..05f8bf61d20a 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
> *
> * @paddr: Absolute host address of page to be destroyed
> */
> -int uv_destroy_page(unsigned long paddr)
> +static int uv_destroy_page(unsigned long paddr)
> {
> struct uv_cb_cfs uvcb = {
> .header.cmd = UVC_CMD_DESTR_SEC_STOR,
> @@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_destroy_owned_page(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_destroy_page(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Requests the Ultravisor to encrypt a guest page and make it
> * accessible to the host for paging (export).
> @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_convert_from_secure(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Calculate the expected ref_count for a page that would otherwise have no
> * further pins. This was cribbed from similar functions in other places in
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 5a138f6220c4..38b792ab57f7 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
> {
> pte_t pte = READ_ONCE(*ptep);
>
> + /* There is a reference through the mapping */
> if (pte_present(pte))
> - WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
> + WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
> +
> return 0;
> }
>
>

2021-10-12 08:20:34

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v5 08/14] KVM: s390: pv: handle secure storage exceptions for normal guests

On 9/20/21 15:24, Claudio Imbrenda wrote:
> With upcoming patches, normal guests might touch secure pages.
>
> This patch extends the existing exception handler to convert the pages
> to non secure also when the exception is triggered by a normal guest.
>
> This can happen for example when a secure guest reboots; the first
> stage of a secure guest is non secure, and in general a secure guest
> can reboot into non-secure mode.
>
> If the secure memory of the previous boot has not been cleared up
> completely yet, a non-secure guest might touch secure memory, which
> will need to be handled properly.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/mm/fault.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index eb68b4f36927..74784581f42d 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
> struct vm_area_struct *vma;
> struct mm_struct *mm;
> struct page *page;
> + struct gmap *gmap;
> int rc;
>
> /*
> @@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
> }
>
> switch (get_fault_type(regs)) {
> + case GMAP_FAULT:
> + gmap = (struct gmap *)S390_lowcore.gmap;
> + addr = __gmap_translate(gmap, addr);
> + if (IS_ERR_VALUE(addr)) {
> + do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
> + break;
> + }
> + fallthrough;

This would trigger an export and not a destroy, right?

> case USER_FAULT:
> mm = current->mm;
> mmap_read_lock(mm);
> @@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
> if (rc)
> BUG();
> break;
> - case GMAP_FAULT:
> default:
> do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
> WARN_ON_ONCE(1);
>

2021-10-12 08:38:21

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v5 08/14] KVM: s390: pv: handle secure storage exceptions for normal guests

On Tue, 12 Oct 2021 10:16:26 +0200
Janosch Frank <[email protected]> wrote:

> On 9/20/21 15:24, Claudio Imbrenda wrote:
> > With upcoming patches, normal guests might touch secure pages.
> >
> > This patch extends the existing exception handler to convert the pages
> > to non secure also when the exception is triggered by a normal guest.
> >
> > This can happen for example when a secure guest reboots; the first
> > stage of a secure guest is non secure, and in general a secure guest
> > can reboot into non-secure mode.
> >
> > If the secure memory of the previous boot has not been cleared up
> > completely yet, a non-secure guest might touch secure memory, which
> > will need to be handled properly.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/mm/fault.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index eb68b4f36927..74784581f42d 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
> > struct vm_area_struct *vma;
> > struct mm_struct *mm;
> > struct page *page;
> > + struct gmap *gmap;
> > int rc;
> >
> > /*
> > @@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
> > }
> >
> > switch (get_fault_type(regs)) {
> > + case GMAP_FAULT:
> > + gmap = (struct gmap *)S390_lowcore.gmap;
> > + addr = __gmap_translate(gmap, addr);
> > + if (IS_ERR_VALUE(addr)) {
> > + do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
> > + break;
> > + }
> > + fallthrough;
>
> This would trigger an export and not a destroy, right?

correct. but this would only happen for leftover secure pages touched
by non-secure guests, before the background thread could clean them up.

>
> > case USER_FAULT:
> > mm = current->mm;
> > mmap_read_lock(mm);
> > @@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
> > if (rc)
> > BUG();
> > break;
> > - case GMAP_FAULT:
> > default:
> > do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
> > WARN_ON_ONCE(1);
> >
>

2021-10-12 12:36:19

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v5 08/14] KVM: s390: pv: handle secure storage exceptions for normal guests

On 10/12/21 10:35, Claudio Imbrenda wrote:
> On Tue, 12 Oct 2021 10:16:26 +0200
> Janosch Frank <[email protected]> wrote:
>
>> On 9/20/21 15:24, Claudio Imbrenda wrote:
>>> With upcoming patches, normal guests might touch secure pages.
>>>
>>> This patch extends the existing exception handler to convert the pages
>>> to non secure also when the exception is triggered by a normal guest.
>>>
>>> This can happen for example when a secure guest reboots; the first
>>> stage of a secure guest is non secure, and in general a secure guest
>>> can reboot into non-secure mode.
>>>
>>> If the secure memory of the previous boot has not been cleared up
>>> completely yet, a non-secure guest might touch secure memory, which
>>> will need to be handled properly.
>>>
>>> Signed-off-by: Claudio Imbrenda <[email protected]>
>>> ---
>>> arch/s390/mm/fault.c | 10 +++++++++-
>>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
>>> index eb68b4f36927..74784581f42d 100644
>>> --- a/arch/s390/mm/fault.c
>>> +++ b/arch/s390/mm/fault.c
>>> @@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
>>> struct vm_area_struct *vma;
>>> struct mm_struct *mm;
>>> struct page *page;
>>> + struct gmap *gmap;
>>> int rc;
>>>
>>> /*
>>> @@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
>>> }
>>>
>>> switch (get_fault_type(regs)) {
>>> + case GMAP_FAULT:
>>> + gmap = (struct gmap *)S390_lowcore.gmap;
>>> + addr = __gmap_translate(gmap, addr);
>>> + if (IS_ERR_VALUE(addr)) {
>>> + do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
>>> + break;
>>> + }
>>> + fallthrough;
>>
>> This would trigger an export and not a destroy, right?
>
> correct. but this would only happen for leftover secure pages touched
> by non-secure guests, before the background thread could clean them up.

I.e. we don't expect to need the destroy speed boost?

>
>>
>>> case USER_FAULT:
>>> mm = current->mm;
>>> mmap_read_lock(mm);
>>> @@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
>>> if (rc)
>>> BUG();
>>> break;
>>> - case GMAP_FAULT:
>>> default:
>>> do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
>>> WARN_ON_ONCE(1);
>>>
>>
>

2021-10-26 15:49:31

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v5 06/14] KVM: s390: pv: properly handle page flags for protected guests

Am 20.09.21 um 15:24 schrieb Claudio Imbrenda:
> Introduce variants of the convert and destroy page functions that also
> clear the PG_arch_1 bit used to mark them as secure pages.
>
> The PG_arch_1 flag is always allowed to overindicate; using the new
> functions introduced here allows to reduce the extent of overindication
> and thus improve performance.
>
> These new functions can only be called on pages for which a reference
> is already being held.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Janosch Frank <[email protected]>
> Reviewed-by: Christian Borntraeger <[email protected]>

applied.