2021-07-28 14:28:48

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 00/13] KVM: s390: pv: implement lazy destroy

Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.

This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.

There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)

For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.

For the shutdown case, a list of pages to be destroyed is formed when
the mm is torn down. Instead of just unmapping the pages when the
address space is being torn down, they are also set aside. Later when
KVM cleans up the VM, a thread is started to clean up the pages from
the list.

This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.

When a guest is destroyed, its memory still counts towards its memory
control group until it's actually freed (I tested this experimentally)

When the system runs out of memory, if a guest has terminated and its
memory is being cleaned asynchronously, the OOM killer will wait a
little and then see if memory has been freed. This has the practical
effect of slowing down memory allocations when the system is out of
memory to give the cleanup thread time to cleanup and free memory, and
avoid an actual OOM situation.

v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier

Claudio Imbrenda (13):
KVM: s390: pv: avoid stall notifications for some UVCs
KVM: s390: pv: leak the ASCE page when destroy fails
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: lazy destroy for reboot
KVM: s390: pv: extend lazy destroy to handle shutdown
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: add OOM notifier for lazy destroy
KVM: s390: pv: add support for UV feature bits

arch/s390/include/asm/gmap.h | 5 +-
arch/s390/include/asm/mmu.h | 3 +
arch/s390/include/asm/mmu_context.h | 2 +
arch/s390/include/asm/pgtable.h | 16 +-
arch/s390/include/asm/uv.h | 26 ++-
arch/s390/kernel/uv.c | 144 ++++++++++++++-
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 272 ++++++++++++++++++++++++++--
arch/s390/mm/fault.c | 22 ++-
arch/s390/mm/gmap.c | 86 ++++++---
11 files changed, 530 insertions(+), 54 deletions(-)

--
2.31.1



2021-07-28 14:29:10

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 10/13] KVM: s390: pv: extend lazy destroy to handle shutdown

Extend the lazy destroy infrastructure to handle almost all kinds of
exit of the userspace process. The only case not handled is if the
process is killed by the OOM, in which case the old behaviour will
still be in effect.

Add the uv_destroy_page_lazy function to set aside pages when unmapping
them during mm teardown; the pages will be processed and freed when the
protected VM is torn down.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/mmu.h | 3 ++
arch/s390/include/asm/mmu_context.h | 2 ++
arch/s390/include/asm/pgtable.h | 9 ++++--
arch/s390/include/asm/uv.h | 15 +++++++++
arch/s390/kernel/uv.c | 47 +++++++++++++++++++++++++++++
arch/s390/kvm/pv.c | 41 ++++++++++++++++++-------
6 files changed, 104 insertions(+), 13 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..e2250dbe3d9d 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -18,6 +18,7 @@ typedef struct {
unsigned long vdso_base;
/* The mmu context belongs to a secure guest. */
atomic_t is_protected;
+ struct list_head deferred_list;
/*
* The following bitfields need a down_write on the mm
* semaphore when they are written to. As they are only
@@ -34,6 +35,8 @@ typedef struct {
unsigned int uses_cmm:1;
/* The gmaps associated with this context are allowed to use huge pages. */
unsigned int allow_gmap_hpage_1m:1;
+ /* The mmu context should be destroyed synchronously */
+ unsigned int pv_sync_destroy:1;
} mm_context_t;

#define INIT_MM_CONTEXT(name) \
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index c7937f369e62..5c598afe07a8 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -27,8 +27,10 @@ static inline int init_new_context(struct task_struct *tsk,
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.flush_count, 0);
atomic_set(&mm->context.is_protected, 0);
+ mm->context.pv_sync_destroy = 0;
mm->context.gmap_asce = 0;
mm->context.flush_mm = 0;
+ INIT_LIST_HEAD(&mm->context.deferred_list);
#ifdef CONFIG_PGSTE
mm->context.alloc_pgste = page_table_allocate_pgste ||
test_thread_flag(TIF_PGSTE) ||
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0f1af2232ebe..c311503edc3b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1118,9 +1118,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+
/* At this point the reference through the mapping is still present */
- if (mm_is_protected(mm) && pte_present(res))
- uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ if (pte_present(res) && mm_is_protected(mm)) {
+ if (full && !mm->context.pv_sync_destroy)
+ uv_destroy_page_lazy(mm, pte_val(res) & PAGE_MASK);
+ else
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ }
return res;
}

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 7722cde51912..eb39e530a9c6 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -345,6 +345,15 @@ static inline int uv_remove_shared(unsigned long addr) { return 0; }
#if IS_ENABLED(CONFIG_KVM)
extern int prot_virt_host;

+struct destroy_page_lazy {
+ struct list_head list;
+ unsigned short count;
+ unsigned long pfns[];
+};
+
+/* This guarantees that up to PV_MAX_LAZY_COUNT can fit in a page */
+#define PV_MAX_LAZY_COUNT ((PAGE_SIZE - sizeof(struct destroy_page_lazy)) / sizeof(long))
+
static inline int is_prot_virt_host(void)
{
return prot_virt_host;
@@ -353,6 +362,7 @@ static inline int is_prot_virt_host(void)
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
@@ -369,6 +379,11 @@ static inline int uv_destroy_owned_page(unsigned long paddr)
return 0;
}

+static inline int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ return 0;
+}
+
static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 1e67a26184b6..f0af49b09a91 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -186,6 +186,53 @@ int uv_convert_owned_from_secure(unsigned long paddr)
return rc;
}

+/*
+ * Set aside the given page and put it in the list of pages to be cleared in
+ * background. The caller must already hold a reference to the page.
+ */
+int uv_destroy_page_lazy(struct mm_struct *mm, unsigned long paddr)
+{
+ struct list_head *head = &mm->context.deferred_list;
+ struct destroy_page_lazy *lazy;
+ struct page *page;
+ int rc;
+
+ /* get an extra reference here */
+ get_page(phys_to_page(paddr));
+
+ lazy = list_first_entry(head, struct destroy_page_lazy, list);
+ /*
+ * We need a fresh page to store more pointers. The current page
+ * might be shared, so it cannot be used directly. Instead, make it
+ * accessible and release it, and let the normal unmap code free it
+ * later, if needed.
+ * Afterwards, try to allocate a new page, but not very hard. If the
+ * allocation fails, we simply return. The next call to this
+ * function will attempt to do the same again, until enough pages
+ * have been freed.
+ */
+ if (list_empty(head) || lazy->count >= PV_MAX_LAZY_COUNT) {
+ rc = uv_convert_owned_from_secure(paddr);
+ /* in case of failure, we intentionally leak the page */
+ if (rc)
+ return rc;
+ /* release the extra reference */
+ put_page(phys_to_page(paddr));
+
+ /* try to allocate a new page quickly, but allow failures */
+ page = alloc_page(GFP_ATOMIC | __GFP_NOMEMALLOC | __GFP_NOWARN);
+ if (!page)
+ return -ENOMEM;
+ lazy = page_to_virt(page);
+ lazy->count = 0;
+ list_add(&lazy->list, head);
+ return 0;
+ }
+ /* the array of pointers has space, just add this entry */
+ lazy->pfns[lazy->count++] = phys_to_pfn(paddr);
+ return 0;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 8c37850dbbcf..950053a4efeb 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -19,6 +19,7 @@

struct deferred_priv {
struct mm_struct *mm;
+ bool has_mm;
unsigned long old_table;
u64 handle;
void *stor_var;
@@ -246,19 +247,34 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)

static int kvm_s390_pv_destroy_vm_thread(void *priv)
{
+ struct destroy_page_lazy *lazy, *next;
struct deferred_priv *p = priv;
u16 rc, rrc;
int r;

- /* Clear all the pages as long as we are not the only users of the mm */
- s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
- /*
- * If we were the last user of the mm, synchronously free (and clear
- * if needed) all pages.
- * Otherwise simply decrease the reference counter; in this case we
- * have already cleared all pages.
- */
- mmput(p->mm);
+ list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
+ list_del(&lazy->list);
+ s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+ free_page(__pa(lazy));
+ }
+
+ if (p->has_mm) {
+ /* Clear all the pages as long as we are not the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ if (atomic_read(&p->mm->mm_users) == 1) {
+ mmap_write_lock(p->mm);
+ /* destroy synchronously if there are no other users */
+ p->mm->context.pv_sync_destroy = 1;
+ mmap_write_unlock(p->mm);
+ }
+ /*
+ * If we were the last user of the mm, synchronously free
+ * (and clear if needed) all pages.
+ * Otherwise simply decrease the reference counter; in this
+ * case we have already cleared all pages.
+ */
+ mmput(p->mm);
+ }

r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
@@ -290,7 +306,9 @@ static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc
priv->old_table = (unsigned long)kvm->arch.gmap->table;
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);

- if (kvm_s390_pv_replace_asce(kvm))
+ if (!priv->has_mm)
+ kvm_s390_pv_remove_old_asce(kvm);
+ else if (kvm_s390_pv_replace_asce(kvm))
goto fail;

t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
@@ -349,8 +367,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)

mmgrab(kvm->mm);
if (mmget_not_zero(kvm->mm)) {
+ priv->has_mm = true;
kvm_s390_clear_2g(kvm);
- } else {
+ } else if (list_empty(&kvm->mm->context.deferred_list)) {
/* No deferred work to do */
mmdrop(kvm->mm);
kfree(priv);
--
2.31.1


2021-07-28 14:29:47

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 07/13] KVM: s390: pv: usage counter instead of flag

Use the is_protected field as a counter instead of a flag. This will
be used in upcoming patches.

Increment the counter when a secure configuration is created, and
decrement it when it is destroyed. Previously the flag was set when the
set secure parameters UVC was performed.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/pv.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index ebb1c8d7bcb3..52e7d15a6dc7 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -223,7 +223,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- atomic_set(&kvm->mm->context.is_protected, 0);
+ if (!cc)
+ atomic_dec(&kvm->mm->context.is_protected);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
/* Intended memory leak on "impossible" error */
@@ -264,11 +265,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;

+ atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
- if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
+ if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
- else
+ } else {
+ atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
+ }
return -EIO;
}
kvm->arch.gmap->guest_handle = uvcb.guest_handle;
@@ -291,8 +295,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
*rc, *rrc);
- if (!cc)
- atomic_set(&kvm->mm->context.is_protected, 1);
return cc ? -EINVAL : 0;
}

--
2.31.1


2021-07-28 14:32:20

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 11/13] KVM: s390: pv: module parameter to fence lazy destroy

Add the module parameter "lazy_destroy", to allow the lazy destroy
mechanism to be switched off. This might be useful for debugging
purposes.

The parameter is enabled by default.

Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/pv.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 950053a4efeb..088b94512af3 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -26,6 +26,10 @@ struct deferred_priv {
unsigned long stor_base;
};

+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc = 0;
@@ -361,6 +365,9 @@ int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct deferred_priv *priv;

+ if (!lazy_destroy)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
if (!priv)
return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
@@ -409,6 +416,12 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;

+ if (!lazy_destroy) {
+ mmap_write_lock(kvm->mm);
+ kvm->mm->context.pv_sync_destroy = 1;
+ mmap_write_unlock(kvm->mm);
+ }
+
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
--
2.31.1


2021-07-28 14:32:38

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 05/13] KVM: s390: pv: handle secure storage exceptions for normal guests

With upcoming patches, normal guests might touch secure pages.

This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.

This can happen for example when a secure guest reboots; the first
stage of a secure guest is non secure, and in general a secure guest
can reboot into non-secure mode.

If the secure memory of the previous boot has not been cleared up
completely yet, a non-secure guest might touch secure memory, which
will need to be handled properly.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/mm/fault.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index eb68b4f36927..b89d625ea2ec 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
struct vm_area_struct *vma;
struct mm_struct *mm;
struct page *page;
+ struct gmap *gmap;
int rc;

/*
@@ -796,6 +797,16 @@ void do_secure_storage_access(struct pt_regs *regs)
}

switch (get_fault_type(regs)) {
+ case GMAP_FAULT:
+ gmap = (struct gmap *)S390_lowcore.gmap;
+ /*
+ * Very unlikely, but if it happens, simply try again.
+ * The next attempt will trigger a different exception.
+ */
+ addr = __gmap_translate(gmap, addr);
+ if (addr == -EFAULT)
+ break;
+ fallthrough;
case USER_FAULT:
mm = current->mm;
mmap_read_lock(mm);
@@ -824,7 +835,6 @@ void do_secure_storage_access(struct pt_regs *regs)
if (rc)
BUG();
break;
- case GMAP_FAULT:
default:
do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
WARN_ON_ONCE(1);
--
2.31.1


2021-07-28 14:32:38

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 12/13] KVM: s390: pv: add OOM notifier for lazy destroy

Add a per-VM OOM notifier for lazy destroy.

When a protected VM is undergoing deferred teardown, register an OOM
notifier. This allows an OOM situation to be handled by just waiting a
little.

The background cleanup deferred destroy process will now keep a running
tally of the amount of pages freed. The asynchronous OOM notifier will
check the number of pages freed before and after waiting. The OOM
notifier will wait 10ms, and then report the number of pages freed by
the deferred destroy mechanism during that time.

If at least 1024 pages have already been freed in the current OOM
situation, no action is taken by the OOM notifier and no wait is
performed. This avoids excessive waiting times in case many VMs are
being destroyed at the same time, once enough memory has been freed.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 088b94512af3..390b57307f24 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -15,8 +15,12 @@
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/oom.h>
#include "kvm-s390.h"

+#define KVM_S390_PV_LAZY_DESTROY_OOM_NOTIFY_PRIORITY 70
+
struct deferred_priv {
struct mm_struct *mm;
bool has_mm;
@@ -24,6 +28,8 @@ struct deferred_priv {
u64 handle;
void *stor_var;
unsigned long stor_base;
+ unsigned long n_pages_freed;
+ struct notifier_block oom_nb;
};

static int lazy_destroy = 1;
@@ -249,6 +255,24 @@ static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}

+static int kvm_s390_pv_oom_notify(struct notifier_block *nb,
+ unsigned long dummy, void *parm)
+{
+ unsigned long *freed = parm;
+ unsigned long free_before;
+ struct deferred_priv *p;
+
+ if (*freed > 1024)
+ return NOTIFY_OK;
+
+ p = container_of(nb, struct deferred_priv, oom_nb);
+ free_before = READ_ONCE(p->n_pages_freed);
+ msleep(20);
+ *freed += READ_ONCE(p->n_pages_freed) - free_before;
+
+ return NOTIFY_OK;
+}
+
static int kvm_s390_pv_destroy_vm_thread(void *priv)
{
struct destroy_page_lazy *lazy, *next;
@@ -256,12 +280,20 @@ static int kvm_s390_pv_destroy_vm_thread(void *priv)
u16 rc, rrc;
int r;

+ p->oom_nb.priority = KVM_S390_PV_LAZY_DESTROY_OOM_NOTIFY_PRIORITY;
+ p->oom_nb.notifier_call = kvm_s390_pv_oom_notify;
+ r = register_oom_notifier(&p->oom_nb);
+
list_for_each_entry_safe(lazy, next, &p->mm->context.deferred_list, list) {
list_del(&lazy->list);
s390_uv_destroy_pfns(lazy->count, lazy->pfns);
+ WRITE_ONCE(p->n_pages_freed, p->n_pages_freed + lazy->count + 1);
free_page(__pa(lazy));
}

+ if (!r)
+ unregister_oom_notifier(&p->oom_nb);
+
if (p->has_mm) {
/* Clear all the pages as long as we are not the only users of the mm */
s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
--
2.31.1


2021-07-28 14:33:01

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 13/13] KVM: s390: pv: add support for UV feature bits

Add support for Ultravisor feature bits, and take advantage of the
functionality advertised to speed up the lazy destroy mechanism.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kernel/uv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index f0af49b09a91..6ec3d7338ec8 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -290,7 +290,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,

static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
{
- return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ return !test_bit_inv(BIT_UV_FEAT_MISC, &uv_info.uv_feature_indications) &&
+ uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
atomic_read(&mm->context.is_protected) > 1;
}

--
2.31.1


2021-07-28 14:33:05

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 04/13] KVM: s390: pv: handle secure storage violations for protected guests

With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation.

This patch adds handling of secure storage violations for protected
guests.

Pages that trigger the exception will be made non-secure before
attempting to use them again.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/uv.h | 1 +
arch/s390/kernel/uv.c | 43 ++++++++++++++++++++++++++++++++++++++
arch/s390/mm/fault.c | 10 +++++++++
3 files changed, 54 insertions(+)

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index bbd51aa94d05..7722cde51912 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -351,6 +351,7 @@ static inline int is_prot_virt_host(void)
}

int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 5a6ac965f379..9cfd7979648c 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -337,6 +337,49 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
}
EXPORT_SYMBOL_GPL(gmap_convert_to_secure);

+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
+{
+ struct vm_area_struct *vma;
+ unsigned long uaddr;
+ struct page *page;
+ int rc;
+
+ rc = -EFAULT;
+ mmap_read_lock(gmap->mm);
+
+ uaddr = __gmap_translate(gmap, gaddr);
+ if (IS_ERR_VALUE(uaddr))
+ goto out;
+ vma = find_vma(gmap->mm, uaddr);
+ if (!vma)
+ goto out;
+ /*
+ * Huge pages should not be able to become secure
+ */
+ if (is_vm_hugetlb_page(vma))
+ goto out;
+
+ rc = 0;
+ /* we take an extra reference here */
+ page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
+ if (IS_ERR_OR_NULL(page))
+ goto out;
+ rc = uv_destroy_owned_page(page_to_phys(page));
+ /*
+ * Fault handlers can race; it is possible that one CPU will destroy
+ * and import the page, at which point the second CPU handling the
+ * same fault will not be able to destroy. In that case we do not
+ * want to terminate the process, we instead try to export the page.
+ */
+ if (rc)
+ rc = uv_convert_owned_from_secure(page_to_phys(page));
+ put_page(page);
+out:
+ mmap_read_unlock(gmap->mm);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_destroy_page);
+
/*
* To be called with the page locked or with an extra reference! This will
* prevent gmap_make_secure from touching the page concurrently. Having 2
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index e33c43b38afe..eb68b4f36927 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -850,6 +850,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);

void do_secure_storage_violation(struct pt_regs *regs)
{
+ unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
+ struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
+
+ /*
+ * If the VM has been rebooted, its address space might still contain
+ * secure pages from the previous boot.
+ * Clear the page so it can be reused.
+ */
+ if (!gmap_destroy_page(gmap, gaddr))
+ return;
/*
* Either KVM messed up the secure guest mapping or the same
* page is mapped into multiple secure guests.
--
2.31.1


2021-07-28 14:33:11

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 08/13] KVM: s390: pv: add export before import

Due to upcoming changes, it will be possible to temporarily have
multiple protected VMs in the same address space, although only one
will be actually active.

In that scenario, it is necessary to perform an export of every page
that is to be imported, since the hardware does not allow a page
belonging to a protected guest to be imported into a different
protected guest.

This also applies to pages that are shared, and thus accessible by the
host.

Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kernel/uv.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 9cfd7979648c..1e67a26184b6 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -241,6 +241,12 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
return rc;
}

+static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
+{
+ return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ atomic_read(&mm->context.is_protected) > 1;
+}
+
/*
* Requests the Ultravisor to make a page accessible to a guest.
* If it's brought in the first time, it will be cleared. If
@@ -284,6 +290,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)

lock_page(page);
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+ if (should_export_before_import(uvcb, gmap->mm))
+ uv_convert_from_secure(page_to_phys(page));
rc = make_secure_pte(ptep, uaddr, page, uvcb);
pte_unmap_unlock(ptep, ptelock);
unlock_page(page);
--
2.31.1


2021-07-28 14:33:12

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 06/13] KVM: s390: pv: refactor s390_reset_acc

Refactor s390_reset_acc so that its pieces can be reused in upcoming
patches. The users parameter for s390_destroy_range will be needed in
upcoming patches.

We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/gmap.h | 5 +-
arch/s390/kvm/pv.c | 12 ++++-
arch/s390/mm/gmap.c | 88 ++++++++++++++++++++++++------------
3 files changed, 73 insertions(+), 32 deletions(-)

diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 40264f60b0da..618ddc455867 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,5 +147,8 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
+
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 1ecdc1769ed9..ebb1c8d7bcb3 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
#include <asm/gmap.h>
#include <asm/uv.h>
#include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
#include "kvm-s390.h"

int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -209,8 +211,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;

- /* make all pages accessible before destroying the guest */
- s390_reset_acc(kvm->mm);
+ /*
+ * if the mm still has a mapping, make all its pages accessible
+ * before destroying the guest
+ */
+ if (mmget_not_zero(kvm->mm)) {
+ s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
+ mmput(kvm->mm);
+ }

cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index de679facc720..ad210a6e2c41 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2670,41 +2670,71 @@ void s390_reset_cmma(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);

-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+ unsigned long next;
+ unsigned long count;
+ unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
+ struct reset_walk_state *p = walk->private;
pte_t pte = READ_ONCE(*ptep);

- /* There is a reference through the mapping */
- if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
- return 0;
+ if (pte_present(pte)) {
+ /* we have a reference from the mapping, take an extra one */
+ get_page(phys_to_page(pte_val(pte)));
+ p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+ p->next = next;
+ p->count++;
+ }
+ return p->count >= DESTROY_LOOP_THRESHOLD;
}

-static const struct mm_walk_ops reset_acc_walk_ops = {
- .pte_entry = __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+ .pte_entry = s390_gather_pages,
};

-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
{
- if (!mm_is_protected(mm))
- return;
- /*
- * we might be called during
- * reset: we walk the pages and clear
- * close of all kvm file descriptors: we walk the pages and clear
- * exit of process on fd closure: vma already gone, do nothing
- */
- if (!mmget_not_zero(mm))
- return;
- mmap_read_lock(mm);
- walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
- mmap_read_unlock(mm);
- mmput(mm);
+ unsigned long i;
+
+ for (i = 0; i < count; i++) {
+ /* we always have an extra reference */
+ uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+ /* get rid of the extra reference */
+ put_page(pfn_to_page(pfns[i]));
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/*
+ * Walk the given range of the given address space, and call the destroy
+ * secure page UVC on each page.
+ * Exit early if the number of users of the mm drops to (or below) the given
+ * value.
+ */
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end)
+{
+ struct reset_walk_state state = { .next = start };
+ int r = 1;
+
+ while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
+ state.count = 0;
+ mmap_read_lock(mm);
+ r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+ mmap_read_unlock(mm);
+ cond_resched();
+ s390_uv_destroy_pfns(state.count, state.pfns);
+ }
}
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(s390_uv_destroy_range);
--
2.31.1


2021-07-28 14:33:13

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 02/13] KVM: s390: pv: leak the ASCE page when destroy fails

When a protected VM is created, the topmost level of page tables of its
ASCE is marked by the Ultravisor; any attempt to use that memory for
protected virtualization will result in failure.

Only a successful Destroy Configuration UVC will remove the marking.

When the Destroy Configuration UVC fails, the topmost level of page
tables of the VM does not get its marking cleared; to avoid issues it
must not be used again.

Since the page becomes in practice unusable, we set it aside and leak it.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index e007df11a2fe..1ecdc1769ed9 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -155,6 +155,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
return -ENOMEM;
}

+/*
+ * Remove the topmost level of page tables from the list of page tables of
+ * the gmap.
+ * This means that it will not be freed when the VM is torn down, and needs
+ * to be handled separately by the caller, unless an intentional leak is
+ * intended.
+ */
+static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
+{
+ struct page *old;
+
+ old = virt_to_page(kvm->arch.gmap->table);
+ list_del(&old->lru);
+ /* in case the ASCE needs to be "removed" multiple times */
+ INIT_LIST_HEAD(&old->lru);
+}
+
+/*
+ * Try to replace the current ASCE with another equivalent one.
+ * If the allocation of the new top level page table fails, the ASCE is not
+ * replaced.
+ * In any case, the old ASCE is removed from the list, therefore the caller
+ * has to make sure to save a pointer to it beforehands, unless an
+ * intentional leak is intended.
+ */
+static int kvm_s390_pv_replace_asce(struct kvm *kvm)
+{
+ unsigned long asce;
+ struct page *page;
+ void *table;
+
+ kvm_s390_pv_remove_old_asce(kvm);
+
+ page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+ if (!page)
+ return -ENOMEM;
+ list_add(&page->lru, &kvm->arch.gmap->crst_list);
+
+ table = page_to_virt(page);
+ memcpy(table, kvm->arch.gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
+
+ asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
+ WRITE_ONCE(kvm->arch.gmap->asce, asce);
+ WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
+ WRITE_ONCE(kvm->arch.gmap->table, table);
+
+ return 0;
+}
+
/* this should not fail, but if it does, we must not free the donated memory */
int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
@@ -169,9 +218,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_set(&kvm->mm->context.is_protected, 0);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
- /* Inteded memory leak on "impossible" error */
+ /* Intended memory leak on "impossible" error */
if (!cc)
kvm_s390_pv_dealloc_vm(kvm);
+ else
+ kvm_s390_pv_replace_asce(kvm);
return cc ? -EIO : 0;
}

--
2.31.1


2021-07-28 14:33:48

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 01/13] KVM: s390: pv: avoid stall notifications for some UVCs

Improve make_secure_pte to avoid stalls when the system is heavily
overcommitted. This was especially problematic in kvm_s390_pv_unpack,
because of the loop over all pages that needed unpacking.

Also fix kvm_s390_pv_init_vm to avoid stalls when the system is heavily
overcommitted.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kernel/uv.c | 11 ++++++++---
arch/s390/kvm/pv.c | 2 +-
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index aeb0a15bcbb7..fd0faa51c1bb 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -196,11 +196,16 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
if (!page_ref_freeze(page, expected))
return -EBUSY;
set_bit(PG_arch_1, &page->flags);
- rc = uv_call(0, (u64)uvcb);
+ rc = __uv_call(0, (u64)uvcb);
page_ref_unfreeze(page, expected);
- /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
- if (rc)
+ /*
+ * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
+ * If busy or partially completed, return -EAGAIN.
+ */
+ if (rc == 1)
rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
+ else if (rc > 1)
+ rc = -EAGAIN;
return rc;
}

diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index c8841f476e91..e007df11a2fe 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -196,7 +196,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;

- cc = uv_call(0, (u64)&uvcb);
+ cc = uv_call_sched(0, (u64)&uvcb);
*rc = uvcb.header.rc;
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
--
2.31.1


2021-07-28 14:35:51

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 03/13] KVM: s390: pv: properly handle page flags for protected guests

Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.

These new functions can only be called on pages for which a reference
is already being held.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/pgtable.h | 9 ++++++---
arch/s390/include/asm/uv.h | 10 ++++++++--
arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
arch/s390/mm/gmap.c | 4 +++-
4 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dcac7b2df72c..0f1af2232ebe 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t res;

res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

@@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
pte_t res;

res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(vma->vm_mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

@@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}

diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 12c5f006c136..bbd51aa94d05 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -351,8 +351,9 @@ static inline int is_prot_virt_host(void)
}

int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
-int uv_destroy_page(unsigned long paddr);
+int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
+int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);

void setup_uv(void);
@@ -362,7 +363,7 @@ void adjust_to_uv_max(unsigned long *vmax);
static inline void setup_uv(void) {}
static inline void adjust_to_uv_max(unsigned long *vmax) {}

-static inline int uv_destroy_page(unsigned long paddr)
+static inline int uv_destroy_owned_page(unsigned long paddr)
{
return 0;
}
@@ -371,6 +372,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
}
+
+static inline int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ return 0;
+}
#endif

#if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index fd0faa51c1bb..5a6ac965f379 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
*
* @paddr: Absolute host address of page to be destroyed
*/
-int uv_destroy_page(unsigned long paddr)
+static int uv_destroy_page(unsigned long paddr)
{
struct uv_cb_cfs uvcb = {
.header.cmd = UVC_CMD_DESTR_SEC_STOR,
@@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
return 0;
}

+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_destroy_owned_page(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_destroy_page(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Requests the Ultravisor to encrypt a guest page and make it
* accessible to the host for paging (export).
@@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
return 0;
}

+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_convert_from_secure(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..de679facc720 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
{
pte_t pte = READ_ONCE(*ptep);

+ /* There is a reference through the mapping */
if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
+ WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
+
return 0;
}

--
2.31.1


2021-07-28 15:46:38

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 09/13] KVM: s390: pv: lazy destroy for reboot

Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.

This patch implements a lazy destroy mechanism, that allows a protected
guest to reboot significantly faster than previously.

This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.

The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.

Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 129 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 131 insertions(+), 6 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index b655a7d82bf0..238297a7bb46 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2281,7 +2281,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)

r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
if (r)
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);

/* we need to block service interrupts from now on */
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2300,7 +2300,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
*/
if (r)
break;
- r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
+ r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);

/* no need to block service interrupts any more */
clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2829,7 +2829,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
* complaining we do not use kvm_s390_pv_is_protected.
*/
if (kvm_s390_pv_get_handle(kvm))
- kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
debug_unregister(kvm->arch.dbf);
free_page((unsigned long)kvm->arch.sie_page2);
if (!kvm_is_ucontrol(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..d2380f5e7e1f 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 52e7d15a6dc7..8c37850dbbcf 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -14,8 +14,17 @@
#include <asm/mman.h>
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
+#include <linux/kthread.h>
#include "kvm-s390.h"

+struct deferred_priv {
+ struct mm_struct *mm;
+ unsigned long old_table;
+ u64 handle;
+ void *stor_var;
+ unsigned long stor_base;
+};
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc = 0;
@@ -207,7 +216,7 @@ static int kvm_s390_pv_replace_asce(struct kvm *kvm)
}

/* this should not fail, but if it does, we must not free the donated memory */
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
+static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;

@@ -235,6 +244,122 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}

+static int kvm_s390_pv_destroy_vm_thread(void *priv)
+{
+ struct deferred_priv *p = priv;
+ u16 rc, rrc;
+ int r;
+
+ /* Clear all the pages as long as we are not the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ /*
+ * If we were the last user of the mm, synchronously free (and clear
+ * if needed) all pages.
+ * Otherwise simply decrease the reference counter; in this case we
+ * have already cleared all pages.
+ */
+ mmput(p->mm);
+
+ r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (r) {
+ mmdrop(p->mm);
+ return r;
+ }
+ atomic_dec(&p->mm->context.is_protected);
+ mmdrop(p->mm);
+
+ /*
+ * Intentional leak in case the destroy secure VM call fails. The
+ * call should never fail if the hardware is not broken.
+ */
+ free_pages(p->stor_base, get_order(uv_info.guest_base_stor_len));
+ free_pages(p->old_table, CRST_ALLOC_ORDER);
+ vfree(p->stor_var);
+ kfree(p);
+ return 0;
+}
+
+static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
+{
+ struct task_struct *t;
+
+ priv->stor_var = kvm->arch.pv.stor_var;
+ priv->stor_base = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+
+ if (kvm_s390_pv_replace_asce(kvm))
+ goto fail;
+
+ t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
+ "kvm_s390_pv_destroy_vm_thread");
+ if (IS_ERR_OR_NULL(t))
+ goto fail;
+
+ memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
+ wake_up_process(t);
+ /*
+ * no actual UVC is performed at this point, just return a successful
+ * rc value to make userspace happy, and an arbitrary rrc
+ */
+ *rc = 1;
+ *rrc = 42;
+
+ return 0;
+
+fail:
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+}
+
+/* Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ unsigned long lim;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(slot, slots) {
+ if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
+ continue;
+ if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ else
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+ mmgrab(kvm->mm);
+ if (mmget_not_zero(kvm->mm)) {
+ kvm_s390_clear_2g(kvm);
+ } else {
+ /* No deferred work to do */
+ mmdrop(kvm->mm);
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+ }
+ priv->mm = kvm->mm;
+ return deferred_destroy(kvm, priv, rc, rrc);
+}
+
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct uv_cb_cgc uvcb = {
@@ -268,7 +393,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
} else {
atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
--
2.31.1


2021-07-29 09:54:03

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 13/13] KVM: s390: pv: add support for UV feature bits

On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> Add support for Ultravisor feature bits, and take advantage of the
> functionality advertised to speed up the lazy destroy mechanism.

UV feature bit support is already merged please fix the description and
subject.

>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kernel/uv.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index f0af49b09a91..6ec3d7338ec8 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -290,7 +290,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
>
> static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
> {
> - return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
> + return !test_bit_inv(BIT_UV_FEAT_MISC, &uv_info.uv_feature_indications) &&
> + uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
> atomic_read(&mm->context.is_protected) > 1;
> }
>
>


2021-07-29 10:01:16

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] KVM: s390: pv: avoid stall notifications for some UVCs

On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> Improve make_secure_pte to avoid stalls when the system is heavily
> overcommitted. This was especially problematic in kvm_s390_pv_unpack,
> because of the loop over all pages that needed unpacking.
>
> Also fix kvm_s390_pv_init_vm to avoid stalls when the system is heavily
> overcommitted.

Fixes tag?

>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kernel/uv.c | 11 ++++++++---
> arch/s390/kvm/pv.c | 2 +-
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index aeb0a15bcbb7..fd0faa51c1bb 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -196,11 +196,16 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> if (!page_ref_freeze(page, expected))
> return -EBUSY;
> set_bit(PG_arch_1, &page->flags);
> - rc = uv_call(0, (u64)uvcb);
> + rc = __uv_call(0, (u64)uvcb);

We should exchange rc with cc since that's what we get back from
__uv_call(). Technically we always get a cc but for the other functions
it's only ever 0/1 which translates to success/error so rc is ok.

> page_ref_unfreeze(page, expected);
> - /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
> - if (rc)
> + /*
> + * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
> + * If busy or partially completed, return -EAGAIN.
> + */
> + if (rc == 1)
> rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> + else if (rc > 1)
> + rc = -EAGAIN;
> return rc;

Could you define the CCs in uv.h and check against the constants here so
it's easier to understand that the rc > 1 checks against a "UV was busy
please re-issue the call again" cc?

Maybe also make it explicit for cc 2 and 3 instead of cc > 1

> }
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index c8841f476e91..e007df11a2fe 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -196,7 +196,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
> uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;
>
> - cc = uv_call(0, (u64)&uvcb);
> + cc = uv_call_sched(0, (u64)&uvcb);
> *rc = uvcb.header.rc;
> *rrc = uvcb.header.rrc;
> KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
>


2021-07-29 10:43:42

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 02/13] KVM: s390: pv: leak the ASCE page when destroy fails

On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> When a protected VM is created, the topmost level of page tables of its
> ASCE is marked by the Ultravisor; any attempt to use that memory for
> protected virtualization will result in failure.
>
> Only a successful Destroy Configuration UVC will remove the marking.
>
> When the Destroy Configuration UVC fails, the topmost level of page
> tables of the VM does not get its marking cleared; to avoid issues it
> must not be used again.
>
> Since the page becomes in practice unusable, we set it aside and leak it.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kvm/pv.c | 53 +++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 52 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index e007df11a2fe..1ecdc1769ed9 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -155,6 +155,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
> return -ENOMEM;
> }
>
> +/*
> + * Remove the topmost level of page tables from the list of page tables of
> + * the gmap.
> + * This means that it will not be freed when the VM is torn down, and needs
> + * to be handled separately by the caller, unless an intentional leak is
> + * intended.
> + */
> +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
> +{
> + struct page *old;
> +
> + old = virt_to_page(kvm->arch.gmap->table);
> + list_del(&old->lru);
> + /* in case the ASCE needs to be "removed" multiple times */
> + INIT_LIST_HEAD(&old->lru);
> +}
> +
> +/*
> + * Try to replace the current ASCE with another equivalent one.
> + * If the allocation of the new top level page table fails, the ASCE is not
> + * replaced.
> + * In any case, the old ASCE is removed from the list, therefore the caller
> + * has to make sure to save a pointer to it beforehands, unless an
> + * intentional leak is intended.
> + */
> +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
> +{
> + unsigned long asce;
> + struct page *page;
> + void *table;
> +
> + kvm_s390_pv_remove_old_asce(kvm);
> +
> + page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> + if (!page)
> + return -ENOMEM;
> + list_add(&page->lru, &kvm->arch.gmap->crst_list);
> +
> + table = page_to_virt(page);
> + memcpy(table, kvm->arch.gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));

Don't we want to memcpy first and then add it to the list?
The gmap is still active per-se so I think we want to take the
guest_table_lock for the list_add here.

> +
> + asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
> + WRITE_ONCE(kvm->arch.gmap->asce, asce);
> + WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
> + WRITE_ONCE(kvm->arch.gmap->table, table);

If I remember correctly those won't need locks but I'm not 100% sure so
please have a look at that.

> +
> + return 0;
> +}

That should both be in gmap.c

> +
> /* this should not fail, but if it does, we must not free the donated memory */
> int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> {
> @@ -169,9 +218,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> atomic_set(&kvm->mm->context.is_protected, 0);
> KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> - /* Inteded memory leak on "impossible" error */
> + /* Intended memory leak on "impossible" error */
> if (!cc)
> kvm_s390_pv_dealloc_vm(kvm);
> + else
> + kvm_s390_pv_replace_asce(kvm);
> return cc ? -EIO : 0;
> }
>
>


2021-07-29 10:50:57

by Cornelia Huck

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] KVM: s390: pv: avoid stall notifications for some UVCs

On Wed, Jul 28 2021, Claudio Imbrenda <[email protected]> wrote:

> Improve make_secure_pte to avoid stalls when the system is heavily
> overcommitted. This was especially problematic in kvm_s390_pv_unpack,
> because of the loop over all pages that needed unpacking.
>
> Also fix kvm_s390_pv_init_vm to avoid stalls when the system is heavily
> overcommitted.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kernel/uv.c | 11 ++++++++---
> arch/s390/kvm/pv.c | 2 +-
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index aeb0a15bcbb7..fd0faa51c1bb 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -196,11 +196,16 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> if (!page_ref_freeze(page, expected))
> return -EBUSY;
> set_bit(PG_arch_1, &page->flags);
> - rc = uv_call(0, (u64)uvcb);
> + rc = __uv_call(0, (u64)uvcb);
> page_ref_unfreeze(page, expected);
> - /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
> - if (rc)
> + /*
> + * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
> + * If busy or partially completed, return -EAGAIN.
> + */
> + if (rc == 1)
> rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> + else if (rc > 1)
> + rc = -EAGAIN;
> return rc;
> }

Possibly dumb question: when does the call return > 1?
gmap_make_secure() will do a wait_on_page_writeback() for -EAGAIN, is
that always the right thing to do?


2021-07-29 11:46:27

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 03/13] KVM: s390: pv: properly handle page flags for protected guests

On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> Introduce variants of the convert and destroy page functions that also
> clear the PG_arch_1 bit used to mark them as secure pages.
>
> These new functions can only be called on pages for which a reference
> is already being held.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>

Acked-by: Janosch Frank <[email protected]>

> ---
> arch/s390/include/asm/pgtable.h | 9 ++++++---
> arch/s390/include/asm/uv.h | 10 ++++++++--
> arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
> arch/s390/mm/gmap.c | 4 +++-
> 4 files changed, 50 insertions(+), 7 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index dcac7b2df72c..0f1af2232ebe 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> pte_t res;
>
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
> pte_t res;
>
> res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(vma->vm_mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> } else {
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> }
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index 12c5f006c136..bbd51aa94d05 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -351,8 +351,9 @@ static inline int is_prot_virt_host(void)
> }
>
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
> -int uv_destroy_page(unsigned long paddr);
> +int uv_destroy_owned_page(unsigned long paddr);
> int uv_convert_from_secure(unsigned long paddr);
> +int uv_convert_owned_from_secure(unsigned long paddr);
> int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
>
> void setup_uv(void);
> @@ -362,7 +363,7 @@ void adjust_to_uv_max(unsigned long *vmax);
> static inline void setup_uv(void) {}
> static inline void adjust_to_uv_max(unsigned long *vmax) {}
>
> -static inline int uv_destroy_page(unsigned long paddr)
> +static inline int uv_destroy_owned_page(unsigned long paddr)
> {
> return 0;
> }
> @@ -371,6 +372,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
> {
> return 0;
> }
> +
> +static inline int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + return 0;
> +}
> #endif
>
> #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index fd0faa51c1bb..5a6ac965f379 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
> *
> * @paddr: Absolute host address of page to be destroyed
> */
> -int uv_destroy_page(unsigned long paddr)
> +static int uv_destroy_page(unsigned long paddr)
> {
> struct uv_cb_cfs uvcb = {
> .header.cmd = UVC_CMD_DESTR_SEC_STOR,
> @@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_destroy_owned_page(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_destroy_page(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Requests the Ultravisor to encrypt a guest page and make it
> * accessible to the host for paging (export).
> @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_convert_from_secure(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Calculate the expected ref_count for a page that would otherwise have no
> * further pins. This was cribbed from similar functions in other places in
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 9bb2c7512cd5..de679facc720 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
> {
> pte_t pte = READ_ONCE(*ptep);
>
> + /* There is a reference through the mapping */
> if (pte_present(pte))
> - WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
> + WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
> +
> return 0;
> }
>
>


2021-07-29 12:18:24

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 05/13] KVM: s390: pv: handle secure storage exceptions for normal guests

On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> With upcoming patches, normal guests might touch secure pages.
>
> This patch extends the existing exception handler to convert the pages
> to non secure also when the exception is triggered by a normal guest.
>
> This can happen for example when a secure guest reboots; the first
> stage of a secure guest is non secure, and in general a secure guest
> can reboot into non-secure mode.
>
> If the secure memory of the previous boot has not been cleared up
> completely yet, a non-secure guest might touch secure memory, which
> will need to be handled properly.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/mm/fault.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index eb68b4f36927..b89d625ea2ec 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
> struct vm_area_struct *vma;
> struct mm_struct *mm;
> struct page *page;
> + struct gmap *gmap;
> int rc;
>
> /*
> @@ -796,6 +797,16 @@ void do_secure_storage_access(struct pt_regs *regs)
> }
>
> switch (get_fault_type(regs)) {
> + case GMAP_FAULT:
> + gmap = (struct gmap *)S390_lowcore.gmap;
> + /*
> + * Very unlikely, but if it happens, simply try again.
> + * The next attempt will trigger a different exception.
> + */

If we keep this the way it currently is then the comment needs to go to
the EFAULT check since it makes no sense above the gmap_translate().

> + addr = __gmap_translate(gmap, addr);

So we had a valid gmap PTE to end up here where the guest touched a
secure page and triggered the exception. But we suddenly can't translate
the gaddr to a vmaddr because the gmap tracking doesn't have an entry
for the address.

My first instinct is to SIGSEGV the process since I can't come up with a
way out of this situation except for the process to map this back in.
The only reason I can think of that it was removed from the mapping is
malicious intent or a bug.

I think this is needs a VM_FAULT_BADMAP and a do_fault_error() call.

> + if (addr == -EFAULT)
> + break;
> + fallthrough;
> case USER_FAULT:
> mm = current->mm;
> mmap_read_lock(mm);
> @@ -824,7 +835,6 @@ void do_secure_storage_access(struct pt_regs *regs)
> if (rc)
> BUG();
> break;
> - case GMAP_FAULT:
> default:
> do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
> WARN_ON_ONCE(1);
>


2021-07-29 13:31:21

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 05/13] KVM: s390: pv: handle secure storage exceptions for normal guests

On Thu, 29 Jul 2021 14:17:11 +0200
Janosch Frank <[email protected]> wrote:

> On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> > With upcoming patches, normal guests might touch secure pages.
> >
> > This patch extends the existing exception handler to convert the
> > pages to non secure also when the exception is triggered by a
> > normal guest.
> >
> > This can happen for example when a secure guest reboots; the first
> > stage of a secure guest is non secure, and in general a secure guest
> > can reboot into non-secure mode.
> >
> > If the secure memory of the previous boot has not been cleared up
> > completely yet, a non-secure guest might touch secure memory, which
> > will need to be handled properly.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/mm/fault.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> > index eb68b4f36927..b89d625ea2ec 100644
> > --- a/arch/s390/mm/fault.c
> > +++ b/arch/s390/mm/fault.c
> > @@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs
> > *regs) struct vm_area_struct *vma;
> > struct mm_struct *mm;
> > struct page *page;
> > + struct gmap *gmap;
> > int rc;
> >
> > /*
> > @@ -796,6 +797,16 @@ void do_secure_storage_access(struct pt_regs
> > *regs) }
> >
> > switch (get_fault_type(regs)) {
> > + case GMAP_FAULT:
> > + gmap = (struct gmap *)S390_lowcore.gmap;
> > + /*
> > + * Very unlikely, but if it happens, simply try
> > again.
> > + * The next attempt will trigger a different
> > exception.
> > + */
>
> If we keep this the way it currently is then the comment needs to go
> to the EFAULT check since it makes no sense above the
> gmap_translate().
>
> > + addr = __gmap_translate(gmap, addr);
>
> So we had a valid gmap PTE to end up here where the guest touched a
> secure page and triggered the exception. But we suddenly can't
> translate the gaddr to a vmaddr because the gmap tracking doesn't
> have an entry for the address.
>
> My first instinct is to SIGSEGV the process since I can't come up
> with a way out of this situation except for the process to map this
> back in. The only reason I can think of that it was removed from the
> mapping is malicious intent or a bug.
>
> I think this is needs a VM_FAULT_BADMAP and a do_fault_error() call.

fair enough, the next version will have that

> > + if (addr == -EFAULT)
> > + break;
> > + fallthrough;
> > case USER_FAULT:
> > mm = current->mm;
> > mmap_read_lock(mm);
> > @@ -824,7 +835,6 @@ void do_secure_storage_access(struct pt_regs
> > *regs) if (rc)
> > BUG();
> > break;
> > - case GMAP_FAULT:
> > default:
> > do_fault_error(regs, VM_READ | VM_WRITE,
> > VM_FAULT_BADMAP); WARN_ON_ONCE(1);
> >
>


2021-07-29 13:32:01

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] KVM: s390: pv: avoid stall notifications for some UVCs

On Thu, 29 Jul 2021 12:49:03 +0200
Cornelia Huck <[email protected]> wrote:

> On Wed, Jul 28 2021, Claudio Imbrenda <[email protected]> wrote:
>
> > Improve make_secure_pte to avoid stalls when the system is heavily
> > overcommitted. This was especially problematic in
> > kvm_s390_pv_unpack, because of the loop over all pages that needed
> > unpacking.
> >
> > Also fix kvm_s390_pv_init_vm to avoid stalls when the system is
> > heavily overcommitted.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/kernel/uv.c | 11 ++++++++---
> > arch/s390/kvm/pv.c | 2 +-
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index aeb0a15bcbb7..fd0faa51c1bb 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -196,11 +196,16 @@ static int make_secure_pte(pte_t *ptep,
> > unsigned long addr, if (!page_ref_freeze(page, expected))
> > return -EBUSY;
> > set_bit(PG_arch_1, &page->flags);
> > - rc = uv_call(0, (u64)uvcb);
> > + rc = __uv_call(0, (u64)uvcb);
> > page_ref_unfreeze(page, expected);
> > - /* Return -ENXIO if the page was not mapped, -EINVAL
> > otherwise */
> > - if (rc)
> > + /*
> > + * Return -ENXIO if the page was not mapped, -EINVAL for
> > other errors.
> > + * If busy or partially completed, return -EAGAIN.
> > + */
> > + if (rc == 1)
> > rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> > + else if (rc > 1)
> > + rc = -EAGAIN;
> > return rc;
> > }
>
> Possibly dumb question: when does the call return > 1?

this is exactly what Janosch meant :)

the next version will have #defines for the 4 possible CC values.

in short:
0 OK
1 error
2 busy (nothing done, try again)
3 partial (something done but not all, try again)

> gmap_make_secure() will do a wait_on_page_writeback() for -EAGAIN, is
> that always the right thing to do?

it's the easiest way to get to a place where we will be able to
reschedule if needed.

wait_on_page_writeback will probably do nothing in that case because
the page is not in writeback.

(a few minutes later)

actually I have checked, it seems that the -EAGAIN gets eventually
propagated to places where it's not checked properly!

this will need some more fixing


2021-07-29 13:32:30

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 13/13] KVM: s390: pv: add support for UV feature bits

On Thu, 29 Jul 2021 11:52:58 +0200
Janosch Frank <[email protected]> wrote:

> On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> > Add support for Ultravisor feature bits, and take advantage of the
> > functionality advertised to speed up the lazy destroy mechanism.
>
> UV feature bit support is already merged please fix the description
> and subject.

oops, will be fixed

> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/kernel/uv.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index f0af49b09a91..6ec3d7338ec8 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -290,7 +290,8 @@ static int make_secure_pte(pte_t *ptep,
> > unsigned long addr,
> > static bool should_export_before_import(struct uv_cb_header *uvcb,
> > struct mm_struct *mm) {
> > - return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
> > + return !test_bit_inv(BIT_UV_FEAT_MISC,
> > &uv_info.uv_feature_indications) &&
> > + uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
> > atomic_read(&mm->context.is_protected) > 1;
> > }
> >
> >
>


2021-07-29 13:33:11

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 02/13] KVM: s390: pv: leak the ASCE page when destroy fails

On Thu, 29 Jul 2021 12:41:05 +0200
Janosch Frank <[email protected]> wrote:

> On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> > When a protected VM is created, the topmost level of page tables of
> > its ASCE is marked by the Ultravisor; any attempt to use that
> > memory for protected virtualization will result in failure.
> >
> > Only a successful Destroy Configuration UVC will remove the marking.
> >
> > When the Destroy Configuration UVC fails, the topmost level of page
> > tables of the VM does not get its marking cleared; to avoid issues
> > it must not be used again.
> >
> > Since the page becomes in practice unusable, we set it aside and
> > leak it.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/kvm/pv.c | 53
> > +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52
> > insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index e007df11a2fe..1ecdc1769ed9 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -155,6 +155,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm
> > *kvm) return -ENOMEM;
> > }
> >
> > +/*
> > + * Remove the topmost level of page tables from the list of page
> > tables of
> > + * the gmap.
> > + * This means that it will not be freed when the VM is torn down,
> > and needs
> > + * to be handled separately by the caller, unless an intentional
> > leak is
> > + * intended.
> > + */
> > +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
> > +{
> > + struct page *old;
> > +
> > + old = virt_to_page(kvm->arch.gmap->table);
> > + list_del(&old->lru);
> > + /* in case the ASCE needs to be "removed" multiple times */
> > + INIT_LIST_HEAD(&old->lru);
> > +}
> > +
> > +/*
> > + * Try to replace the current ASCE with another equivalent one.
> > + * If the allocation of the new top level page table fails, the
> > ASCE is not
> > + * replaced.
> > + * In any case, the old ASCE is removed from the list, therefore
> > the caller
> > + * has to make sure to save a pointer to it beforehands, unless an
> > + * intentional leak is intended.
> > + */
> > +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
> > +{
> > + unsigned long asce;
> > + struct page *page;
> > + void *table;
> > +
> > + kvm_s390_pv_remove_old_asce(kvm);
> > +
> > + page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> > + if (!page)
> > + return -ENOMEM;
> > + list_add(&page->lru, &kvm->arch.gmap->crst_list);
> > +
> > + table = page_to_virt(page);
> > + memcpy(table, kvm->arch.gmap->table, 1UL <<
> > (CRST_ALLOC_ORDER + PAGE_SHIFT));
>
> Don't we want to memcpy first and then add it to the list?
> The gmap is still active per-se so I think we want to take the
> guest_table_lock for the list_add here.

doesn't really make a difference, it is not actually used until a few
lines later

also, the list is only ever touched here, during guest creation and
destruction; IIRC in all those cases we hold kvm->lock

> > +
> > + asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
> > + WRITE_ONCE(kvm->arch.gmap->asce, asce);
> > + WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
> > + WRITE_ONCE(kvm->arch.gmap->table, table);
>
> If I remember correctly those won't need locks but I'm not 100% sure
> so please have a look at that.

it should not need locks, the VM is in use, so it can't disappear under
our feet.

> > +
> > + return 0;
> > +}
>
> That should both be in gmap.c

why?

> > +
> > /* this should not fail, but if it does, we must not free the
> > donated memory */ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
> > *rc, u16 *rrc) {
> > @@ -169,9 +218,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
> > *rc, u16 *rrc) atomic_set(&kvm->mm->context.is_protected, 0);
> > KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x",
> > *rc, *rrc); WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc
> > %x", *rc, *rrc);
> > - /* Inteded memory leak on "impossible" error */
> > + /* Intended memory leak on "impossible" error */
> > if (!cc)
> > kvm_s390_pv_dealloc_vm(kvm);
> > + else
> > + kvm_s390_pv_replace_asce(kvm);
> > return cc ? -EIO : 0;
> > }
> >
> >
>


2021-07-29 13:34:13

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] KVM: s390: pv: avoid stall notifications for some UVCs

On Thu, 29 Jul 2021 11:58:39 +0200
Janosch Frank <[email protected]> wrote:

> On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
> > Improve make_secure_pte to avoid stalls when the system is heavily
> > overcommitted. This was especially problematic in
> > kvm_s390_pv_unpack, because of the loop over all pages that needed
> > unpacking.
> >
> > Also fix kvm_s390_pv_init_vm to avoid stalls when the system is
> > heavily overcommitted.
>
> Fixes tag?

will be in the next version

> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/kernel/uv.c | 11 ++++++++---
> > arch/s390/kvm/pv.c | 2 +-
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index aeb0a15bcbb7..fd0faa51c1bb 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -196,11 +196,16 @@ static int make_secure_pte(pte_t *ptep,
> > unsigned long addr, if (!page_ref_freeze(page, expected))
> > return -EBUSY;
> > set_bit(PG_arch_1, &page->flags);
> > - rc = uv_call(0, (u64)uvcb);
> > + rc = __uv_call(0, (u64)uvcb);
>
> We should exchange rc with cc since that's what we get back from
> __uv_call(). Technically we always get a cc but for the other
> functions it's only ever 0/1 which translates to success/error so rc
> is ok.

will be in the next version

> > page_ref_unfreeze(page, expected);
> > - /* Return -ENXIO if the page was not mapped, -EINVAL
> > otherwise */
> > - if (rc)
> > + /*
> > + * Return -ENXIO if the page was not mapped, -EINVAL for
> > other errors.
> > + * If busy or partially completed, return -EAGAIN.
> > + */
> > + if (rc == 1)
> > rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> > + else if (rc > 1)
> > + rc = -EAGAIN;
> > return rc;
>
> Could you define the CCs in uv.h and check against the constants here
> so it's easier to understand that the rc > 1 checks against a "UV was
> busy please re-issue the call again" cc?
>
> Maybe also make it explicit for cc 2 and 3 instead of cc > 1

will be in the next version

> > }
> >
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index c8841f476e91..e007df11a2fe 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -196,7 +196,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16
> > *rc, u16 *rrc) uvcb.conf_base_stor_origin =
> > (u64)kvm->arch.pv.stor_base; uvcb.conf_virt_stor_origin =
> > (u64)kvm->arch.pv.stor_var;
> > - cc = uv_call(0, (u64)&uvcb);
> > + cc = uv_call_sched(0, (u64)&uvcb);
> > *rc = uvcb.header.rc;
> > *rrc = uvcb.header.rrc;
> > KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len
> > %llx rc %x rrc %x",
>


2021-07-29 13:47:15

by Janosch Frank

[permalink] [raw]
Subject: Re: [PATCH v2 02/13] KVM: s390: pv: leak the ASCE page when destroy fails

On 7/29/21 2:54 PM, Claudio Imbrenda wrote:
> On Thu, 29 Jul 2021 12:41:05 +0200
> Janosch Frank <[email protected]> wrote:
>
>> On 7/28/21 4:26 PM, Claudio Imbrenda wrote:
>>> When a protected VM is created, the topmost level of page tables of
>>> its ASCE is marked by the Ultravisor; any attempt to use that
>>> memory for protected virtualization will result in failure.
>>>
>>> Only a successful Destroy Configuration UVC will remove the marking.
>>>
>>> When the Destroy Configuration UVC fails, the topmost level of page
>>> tables of the VM does not get its marking cleared; to avoid issues
>>> it must not be used again.
>>>
>>> Since the page becomes in practice unusable, we set it aside and
>>> leak it.
>>>
>>> Signed-off-by: Claudio Imbrenda <[email protected]>
>>> ---
>>> arch/s390/kvm/pv.c | 53
>>> +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52
>>> insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
>>> index e007df11a2fe..1ecdc1769ed9 100644
>>> --- a/arch/s390/kvm/pv.c
>>> +++ b/arch/s390/kvm/pv.c
>>> @@ -155,6 +155,55 @@ static int kvm_s390_pv_alloc_vm(struct kvm
>>> *kvm) return -ENOMEM;
>>> }
>>>
>>> +/*
>>> + * Remove the topmost level of page tables from the list of page
>>> tables of
>>> + * the gmap.
>>> + * This means that it will not be freed when the VM is torn down,
>>> and needs
>>> + * to be handled separately by the caller, unless an intentional
>>> leak is
>>> + * intended.
>>> + */
>>> +static void kvm_s390_pv_remove_old_asce(struct kvm *kvm)
>>> +{
>>> + struct page *old;
>>> +
>>> + old = virt_to_page(kvm->arch.gmap->table);
>>> + list_del(&old->lru);
>>> + /* in case the ASCE needs to be "removed" multiple times */
>>> + INIT_LIST_HEAD(&old->lru);
>>> +}
>>> +
>>> +/*
>>> + * Try to replace the current ASCE with another equivalent one.
>>> + * If the allocation of the new top level page table fails, the
>>> ASCE is not
>>> + * replaced.
>>> + * In any case, the old ASCE is removed from the list, therefore
>>> the caller
>>> + * has to make sure to save a pointer to it beforehands, unless an
>>> + * intentional leak is intended.
>>> + */
>>> +static int kvm_s390_pv_replace_asce(struct kvm *kvm)
>>> +{
>>> + unsigned long asce;
>>> + struct page *page;
>>> + void *table;
>>> +
>>> + kvm_s390_pv_remove_old_asce(kvm);
>>> +
>>> + page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
>>> + if (!page)
>>> + return -ENOMEM;
>>> + list_add(&page->lru, &kvm->arch.gmap->crst_list);
>>> +
>>> + table = page_to_virt(page);
>>> + memcpy(table, kvm->arch.gmap->table, 1UL <<
>>> (CRST_ALLOC_ORDER + PAGE_SHIFT));
>>
>> Don't we want to memcpy first and then add it to the list?
>> The gmap is still active per-se so I think we want to take the
>> guest_table_lock for the list_add here.
>
> doesn't really make a difference, it is not actually used until a few
> lines later

Still when you touch the patch, then just move that around please :)

>
> also, the list is only ever touched here, during guest creation and
> destruction; IIRC in all those cases we hold kvm->lock

The crst_list is also used in gmap_alloc_table() coming from
__gmap_link() when touching the top level entries. To be fair there
shouldn't be asynchronous things that trigger a fault/link but we have
valid references to the gmap so at the very least I want a comment why
this is totally fine like we have for gmap_free().

Hrm, I'll need to brood over this a bit more, it has been too long.

>
>>> +
>>> + asce = (kvm->arch.gmap->asce & ~PAGE_MASK) | __pa(table);
>>> + WRITE_ONCE(kvm->arch.gmap->asce, asce);
>>> + WRITE_ONCE(kvm->mm->context.gmap_asce, asce);
>>> + WRITE_ONCE(kvm->arch.gmap->table, table);
>>
>> If I remember correctly those won't need locks but I'm not 100% sure
>> so please have a look at that.
>
> it should not need locks, the VM is in use, so it can't disappear under
> our feet.
>
>>> +
>>> + return 0;
>>> +}
>>
>> That should both be in gmap.c
>
> why?

Because we do gmap stuff and I don't want to pollute pv.c with memory
things if we don't have to.

Btw. s390_reset_acc() is also in gmap.c

>
>>> +
>>> /* this should not fail, but if it does, we must not free the
>>> donated memory */ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
>>> *rc, u16 *rrc) {
>>> @@ -169,9 +218,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16
>>> *rc, u16 *rrc) atomic_set(&kvm->mm->context.is_protected, 0);
>>> KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x",
>>> *rc, *rrc); WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc
>>> %x", *rc, *rrc);
>>> - /* Inteded memory leak on "impossible" error */
>>> + /* Intended memory leak on "impossible" error */
>>> if (!cc)
>>> kvm_s390_pv_dealloc_vm(kvm);
>>> + else
>>> + kvm_s390_pv_replace_asce(kvm);
>>> return cc ? -EIO : 0;
>>> }
>>>
>>>
>>
>