Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.
This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory is cleared in
background, allowing the guest to restart or terminate significantly
faster than before.
There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)
For the reboot case, the reference count of the mm is increased, and
then a background thread is started to clean up. Once the thread went
through the whole address space, the protected VM is actually
destroyed.
This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.
The shutdown case is more controversial, and it will be dealt with in a
future patchseries.
When a guest is destroyed, its memory still counts towards its memory
control group until it's actually freed (I tested this experimentally)
v3->v4
* added patch 2
* split patch 3
* removed the shutdown part -- will be a separate patchseries
* moved the patch introducing the module parameter
v2->v3
* added definitions for CC return codes for the UVC instruction
* improved make_secure_pte:
- renamed rc to cc
- added comments to explain why returning -EAGAIN is ok
* fixed kvm_s390_pv_replace_asce and kvm_s390_pv_remove_old_asce:
- renamed
- added locking
- moved to gmap.c
* do proper error management in do_secure_storage_access instead of
trying again hoping to get a different exception
* fix outdated patch descriptions
v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier
Claudio Imbrenda (14):
KVM: s390: pv: add macros for UVC CC values
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: leak the ASCE page when destroy fails
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: lazy destroy for reboot
KVM: s390: pv: avoid export before import if possible
arch/s390/include/asm/gmap.h | 6 +-
arch/s390/include/asm/pgtable.h | 9 +-
arch/s390/include/asm/uv.h | 16 ++-
arch/s390/kernel/uv.c | 115 ++++++++++++++++++--
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 185 ++++++++++++++++++++++++++++----
arch/s390/mm/fault.c | 20 +++-
arch/s390/mm/gmap.c | 141 +++++++++++++++++++-----
9 files changed, 434 insertions(+), 66 deletions(-)
--
2.31.1
Improve make_secure_pte to avoid stalls when the system is heavily
overcommitted. This was especially problematic in kvm_s390_pv_unpack,
because of the loop over all pages that needed unpacking.
Due to the locks being held, it was not possible to simply replace
uv_call with uv_call_sched. A more complex approach was
needed, in which uv_call is replaced with __uv_call, which does not
loop. When the UVC needs to be executed again, -EAGAIN is returned, and
the caller (or its caller) will try again.
When -EAGAIN is returned, the path is the same as when the page is in
writeback (and the writeback check is also performed, which is
harmless).
Signed-off-by: Claudio Imbrenda <[email protected]>
Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
---
arch/s390/kernel/uv.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index aeb0a15bcbb7..68a8fbafcb9c 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -180,7 +180,7 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
{
pte_t entry = READ_ONCE(*ptep);
struct page *page;
- int expected, rc = 0;
+ int expected, cc = 0;
if (!pte_present(entry))
return -ENXIO;
@@ -196,12 +196,25 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
if (!page_ref_freeze(page, expected))
return -EBUSY;
set_bit(PG_arch_1, &page->flags);
- rc = uv_call(0, (u64)uvcb);
+ /*
+ * If the UVC does not succeed or fail immediately, we don't want to
+ * loop for long, or we might get stall notifications.
+ * On the other hand, this is a complex scenario and we are holding a lot of
+ * locks, so we can't easily sleep and reschedule. We try only once,
+ * and if the UVC returned busy or partial completion, we return
+ * -EAGAIN and we let the callers deal with it.
+ */
+ cc = __uv_call(0, (u64)uvcb);
page_ref_unfreeze(page, expected);
- /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
- if (rc)
- rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
- return rc;
+ /*
+ * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
+ * If busy or partially completed, return -EAGAIN.
+ */
+ if (cc == UVC_CC_OK)
+ return 0;
+ else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
+ return -EAGAIN;
+ return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
}
/*
@@ -254,6 +267,10 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
mmap_read_unlock(gmap->mm);
if (rc == -EAGAIN) {
+ /*
+ * If we are here because the UVC returned busy or partial
+ * completion, this is just a useless check, but it is safe.
+ */
wait_on_page_writeback(page);
} else if (rc == -EBUSY) {
/*
--
2.31.1
When the system is heavily overcommitted, kvm_s390_pv_init_vm might
generate stall notifications.
Fix this by using uv_call_sched instead of just uv_call. This is ok because
we are not holding spinlocks.
Signed-off-by: Claudio Imbrenda <[email protected]>
Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
---
arch/s390/kvm/pv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 0a854115100b..00d272d134c2 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -195,7 +195,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;
- cc = uv_call(0, (u64)&uvcb);
+ cc = uv_call_sched(0, (u64)&uvcb);
*rc = uvcb.header.rc;
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
--
2.31.1
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation.
This patch adds handling of secure storage violations for protected
guests.
Pages that trigger the exception will be made non-secure before
attempting to use them again.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/uv.h | 1 +
arch/s390/kernel/uv.c | 43 ++++++++++++++++++++++++++++++++++++++
arch/s390/mm/fault.c | 10 +++++++++
3 files changed, 54 insertions(+)
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 3236293d5a31..c8bd72be8974 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,6 +356,7 @@ static inline int is_prot_virt_host(void)
}
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 05f8bf61d20a..89ba36a5b4bb 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -349,6 +349,49 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
}
EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
+{
+ struct vm_area_struct *vma;
+ unsigned long uaddr;
+ struct page *page;
+ int rc;
+
+ rc = -EFAULT;
+ mmap_read_lock(gmap->mm);
+
+ uaddr = __gmap_translate(gmap, gaddr);
+ if (IS_ERR_VALUE(uaddr))
+ goto out;
+ vma = find_vma(gmap->mm, uaddr);
+ if (!vma)
+ goto out;
+ /*
+ * Huge pages should not be able to become secure
+ */
+ if (is_vm_hugetlb_page(vma))
+ goto out;
+
+ rc = 0;
+ /* we take an extra reference here */
+ page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
+ if (IS_ERR_OR_NULL(page))
+ goto out;
+ rc = uv_destroy_owned_page(page_to_phys(page));
+ /*
+ * Fault handlers can race; it is possible that one CPU will destroy
+ * and import the page, at which point the second CPU handling the
+ * same fault will not be able to destroy. In that case we do not
+ * want to terminate the process, we instead try to export the page.
+ */
+ if (rc)
+ rc = uv_convert_owned_from_secure(page_to_phys(page));
+ put_page(page);
+out:
+ mmap_read_unlock(gmap->mm);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_destroy_page);
+
/*
* To be called with the page locked or with an extra reference! This will
* prevent gmap_make_secure from touching the page concurrently. Having 2
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index e33c43b38afe..eb68b4f36927 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -850,6 +850,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);
void do_secure_storage_violation(struct pt_regs *regs)
{
+ unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
+ struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
+
+ /*
+ * If the VM has been rebooted, its address space might still contain
+ * secure pages from the previous boot.
+ * Clear the page so it can be reused.
+ */
+ if (!gmap_destroy_page(gmap, gaddr))
+ return;
/*
* Either KVM messed up the secure guest mapping or the same
* page is mapped into multiple secure guests.
--
2.31.1
When a protected VM is created, the topmost level of page tables of its
ASCE is marked by the Ultravisor; any attempt to use that memory for
protected virtualization will result in failure.
Only a successful Destroy Configuration UVC will remove the marking.
When the Destroy Configuration UVC fails, the topmost level of page
tables of the VM does not get its marking cleared; to avoid issues it
must not be used again.
This is a permanent error and the page becomes in practice unusable, so
we set it aside and leak it.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/gmap.h | 2 ++
arch/s390/kvm/pv.c | 4 ++-
arch/s390/mm/gmap.c | 55 ++++++++++++++++++++++++++++++++++++
3 files changed, 60 insertions(+), 1 deletion(-)
diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 40264f60b0da..746e18bf8984 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -148,4 +148,6 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
void s390_reset_acc(struct mm_struct *mm);
+void s390_remove_old_asce(struct gmap *gmap);
+int s390_replace_asce(struct gmap *gmap);
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 00d272d134c2..76b0d64ce8fa 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -168,9 +168,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_set(&kvm->mm->context.is_protected, 0);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
- /* Inteded memory leak on "impossible" error */
+ /* Intended memory leak on "impossible" error */
if (!cc)
kvm_s390_pv_dealloc_vm(kvm);
+ else
+ s390_replace_asce(kvm->arch.gmap);
return cc ? -EIO : 0;
}
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9bb2c7512cd5..5a138f6220c4 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2706,3 +2706,58 @@ void s390_reset_acc(struct mm_struct *mm)
mmput(mm);
}
EXPORT_SYMBOL_GPL(s390_reset_acc);
+
+/*
+ * Remove the topmost level of page tables from the list of page tables of
+ * the gmap.
+ * This means that it will not be freed when the VM is torn down, and needs
+ * to be handled separately by the caller, unless an intentional leak is
+ * intended.
+ */
+void s390_remove_old_asce(struct gmap *gmap)
+{
+ struct page *old;
+
+ old = virt_to_page(gmap->table);
+ spin_lock(&gmap->guest_table_lock);
+ list_del(&old->lru);
+ spin_unlock(&gmap->guest_table_lock);
+ /* in case the ASCE needs to be "removed" multiple times */
+ INIT_LIST_HEAD(&old->lru);
+}
+EXPORT_SYMBOL_GPL(s390_remove_old_asce);
+
+/*
+ * Try to replace the current ASCE with another equivalent one.
+ * If the allocation of the new top level page table fails, the ASCE is not
+ * replaced.
+ * In any case, the old ASCE is removed from the list, therefore the caller
+ * has to make sure to save a pointer to it beforehands, unless an
+ * intentional leak is intended.
+ */
+int s390_replace_asce(struct gmap *gmap)
+{
+ unsigned long asce;
+ struct page *page;
+ void *table;
+
+ s390_remove_old_asce(gmap);
+
+ page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+ if (!page)
+ return -ENOMEM;
+ table = page_to_virt(page);
+ memcpy(table, gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
+
+ spin_lock(&gmap->guest_table_lock);
+ list_add(&page->lru, &gmap->crst_list);
+ spin_unlock(&gmap->guest_table_lock);
+
+ asce = (gmap->asce & ~PAGE_MASK) | __pa(table);
+ WRITE_ONCE(gmap->asce, asce);
+ WRITE_ONCE(gmap->mm->context.gmap_asce, asce);
+ WRITE_ONCE(gmap->table, table);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(s390_replace_asce);
--
2.31.1
With upcoming patches, normal guests might touch secure pages.
This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.
This can happen for example when a secure guest reboots; the first
stage of a secure guest is non secure, and in general a secure guest
can reboot into non-secure mode.
If the secure memory of the previous boot has not been cleared up
completely yet, a non-secure guest might touch secure memory, which
will need to be handled properly.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/mm/fault.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index eb68b4f36927..74784581f42d 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -767,6 +767,7 @@ void do_secure_storage_access(struct pt_regs *regs)
struct vm_area_struct *vma;
struct mm_struct *mm;
struct page *page;
+ struct gmap *gmap;
int rc;
/*
@@ -796,6 +797,14 @@ void do_secure_storage_access(struct pt_regs *regs)
}
switch (get_fault_type(regs)) {
+ case GMAP_FAULT:
+ gmap = (struct gmap *)S390_lowcore.gmap;
+ addr = __gmap_translate(gmap, addr);
+ if (IS_ERR_VALUE(addr)) {
+ do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
+ break;
+ }
+ fallthrough;
case USER_FAULT:
mm = current->mm;
mmap_read_lock(mm);
@@ -824,7 +833,6 @@ void do_secure_storage_access(struct pt_regs *regs)
if (rc)
BUG();
break;
- case GMAP_FAULT:
default:
do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
WARN_ON_ONCE(1);
--
2.31.1
Refactor s390_reset_acc so that its pieces can be reused in upcoming
patches. The users parameter for s390_destroy_range will be needed in
upcoming patches.
We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/gmap.h | 4 +-
arch/s390/kvm/pv.c | 12 ++++-
arch/s390/mm/gmap.c | 88 ++++++++++++++++++++++++------------
3 files changed, 72 insertions(+), 32 deletions(-)
diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 746e18bf8984..559a0c18579e 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,7 +147,9 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
void s390_remove_old_asce(struct gmap *gmap);
int s390_replace_asce(struct gmap *gmap);
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 76b0d64ce8fa..47db80003ea0 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
#include <asm/gmap.h>
#include <asm/uv.h>
#include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
#include "kvm-s390.h"
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -159,8 +161,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;
- /* make all pages accessible before destroying the guest */
- s390_reset_acc(kvm->mm);
+ /*
+ * if the mm still has a mapping, make all its pages accessible
+ * before destroying the guest
+ */
+ if (mmget_not_zero(kvm->mm)) {
+ s390_uv_destroy_range(kvm->mm, 0, 0, TASK_SIZE);
+ mmput(kvm->mm);
+ }
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 38b792ab57f7..b87892986836 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2670,44 +2670,74 @@ void s390_reset_cmma(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);
-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+ unsigned long next;
+ unsigned long count;
+ unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
+ struct reset_walk_state *p = walk->private;
pte_t pte = READ_ONCE(*ptep);
- /* There is a reference through the mapping */
- if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
- return 0;
+ if (pte_present(pte)) {
+ /* we have a reference from the mapping, take an extra one */
+ get_page(phys_to_page(pte_val(pte)));
+ p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+ p->next = next;
+ p->count++;
+ }
+ return p->count >= DESTROY_LOOP_THRESHOLD;
}
-static const struct mm_walk_ops reset_acc_walk_ops = {
- .pte_entry = __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+ .pte_entry = s390_gather_pages,
};
-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
{
- if (!mm_is_protected(mm))
- return;
- /*
- * we might be called during
- * reset: we walk the pages and clear
- * close of all kvm file descriptors: we walk the pages and clear
- * exit of process on fd closure: vma already gone, do nothing
- */
- if (!mmget_not_zero(mm))
- return;
- mmap_read_lock(mm);
- walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
- mmap_read_unlock(mm);
- mmput(mm);
+ unsigned long i;
+
+ for (i = 0; i < count; i++) {
+ /* we always have an extra reference */
+ uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+ /* get rid of the extra reference */
+ put_page(pfn_to_page(pfns[i]));
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/*
+ * Walk the given range of the given address space, and call the destroy
+ * secure page UVC on each page.
+ * Exit early if the number of users of the mm drops to (or below) the given
+ * value.
+ */
+void s390_uv_destroy_range(struct mm_struct *mm, unsigned int users,
+ unsigned long start, unsigned long end)
+{
+ struct reset_walk_state state = { .next = start };
+ int r = 1;
+
+ while ((r > 0) && (atomic_read(&mm->mm_users) > users)) {
+ state.count = 0;
+ mmap_read_lock(mm);
+ r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+ mmap_read_unlock(mm);
+ cond_resched();
+ s390_uv_destroy_pfns(state.count, state.pfns);
+ }
}
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(s390_uv_destroy_range);
/*
* Remove the topmost level of page tables from the list of page tables of
--
2.31.1
If the appropriate UV feature bit is set, there is no need to perform
an export before import.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kernel/uv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index bc79085d7152..6f8aef5609c0 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -251,7 +251,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
{
- return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ return !test_bit_inv(BIT_UV_FEAT_MISC, &uv_info.uv_feature_indications) &&
+ uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
atomic_read(&mm->context.is_protected) > 1;
}
--
2.31.1
Use the is_protected field as a counter instead of a flag. This will
be used in upcoming patches.
Increment the counter when a secure configuration is created, and
decrement it when it is destroyed. Previously the flag was set when the
set secure parameters UVC was performed.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/pv.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 47db80003ea0..ee11ff6afc4f 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -173,7 +173,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- atomic_set(&kvm->mm->context.is_protected, 0);
+ if (!cc)
+ atomic_dec(&kvm->mm->context.is_protected);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
/* Intended memory leak on "impossible" error */
@@ -214,11 +215,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;
+ atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
- if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
+ if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
- else
+ } else {
+ atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
+ }
return -EIO;
}
kvm->arch.gmap->guest_handle = uvcb.guest_handle;
@@ -241,8 +245,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
*rc, *rrc);
- if (!cc)
- atomic_set(&kvm->mm->context.is_protected, 1);
return cc ? -EINVAL : 0;
}
--
2.31.1
If kvm_s390_pv_destroy_cpu is called more than once, we risk calling
free_page on a random page, since the sidad field is aliased with the
gbea, which is not guaranteed to be zero.
The solution is to simply return successfully immediately if the vCPU
was already non secure.
Signed-off-by: Claudio Imbrenda <[email protected]>
Fixes: 19e1227768863a1469797c13ef8fea1af7beac2c ("KVM: S390: protvirt: Introduce instruction data area bounce buffer")
---
arch/s390/kvm/pv.c | 19 +++++++++----------
1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index c8841f476e91..0a854115100b 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -16,18 +16,17 @@
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
- int cc = 0;
+ int cc;
- if (kvm_s390_pv_cpu_get_handle(vcpu)) {
- cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu),
- UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
+ if (!kvm_s390_pv_cpu_get_handle(vcpu))
+ return 0;
+
+ cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu), UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
+
+ KVM_UV_EVENT(vcpu->kvm, 3, "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
+ vcpu->vcpu_id, *rc, *rrc);
+ WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x", *rc, *rrc);
- KVM_UV_EVENT(vcpu->kvm, 3,
- "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
- vcpu->vcpu_id, *rc, *rrc);
- WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x",
- *rc, *rrc);
- }
/* Intended memory leak for something that should never happen. */
if (!cc)
free_pages(vcpu->arch.pv.stor_base,
--
2.31.1
Due to upcoming changes, it will be possible to temporarily have
multiple protected VMs in the same address space, although only one
will be actually active.
In that scenario, it is necessary to perform an export of every page
that is to be imported, since the hardware does not allow a page
belonging to a protected guest to be imported into a different
protected guest.
This also applies to pages that are shared, and thus accessible by the
host.
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kernel/uv.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 89ba36a5b4bb..bc79085d7152 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -249,6 +249,12 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
}
+static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
+{
+ return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ atomic_read(&mm->context.is_protected) > 1;
+}
+
/*
* Requests the Ultravisor to make a page accessible to a guest.
* If it's brought in the first time, it will be cleared. If
@@ -292,6 +298,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
lock_page(page);
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+ if (should_export_before_import(uvcb, gmap->mm))
+ uv_convert_from_secure(page_to_phys(page));
rc = make_secure_pte(ptep, uaddr, page, uvcb);
pte_unmap_unlock(ptep, ptelock);
unlock_page(page);
--
2.31.1
Introduce variants of the convert and destroy page functions that also
clear the PG_arch_1 bit used to mark them as secure pages.
These new functions can only be called on pages for which a reference
is already being held.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/pgtable.h | 9 ++++++---
arch/s390/include/asm/uv.h | 10 ++++++++--
arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
arch/s390/mm/gmap.c | 4 +++-
4 files changed, 50 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dcac7b2df72c..0f1af2232ebe 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
pte_t res;
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
@@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
pte_t res;
res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(vma->vm_mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
@@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+ /* At this point the reference through the mapping is still present */
if (mm_is_protected(mm) && pte_present(res))
- uv_convert_from_secure(pte_val(res) & PAGE_MASK);
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index b35add51b967..3236293d5a31 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
}
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
-int uv_destroy_page(unsigned long paddr);
+int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
+int uv_convert_owned_from_secure(unsigned long paddr);
int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
void setup_uv(void);
@@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
static inline void setup_uv(void) {}
static inline void adjust_to_uv_max(unsigned long *vmax) {}
-static inline int uv_destroy_page(unsigned long paddr)
+static inline int uv_destroy_owned_page(unsigned long paddr)
{
return 0;
}
@@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
{
return 0;
}
+
+static inline int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ return 0;
+}
#endif
#if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 68a8fbafcb9c..05f8bf61d20a 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
*
* @paddr: Absolute host address of page to be destroyed
*/
-int uv_destroy_page(unsigned long paddr)
+static int uv_destroy_page(unsigned long paddr)
{
struct uv_cb_cfs uvcb = {
.header.cmd = UVC_CMD_DESTR_SEC_STOR,
@@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
return 0;
}
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_destroy_owned_page(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_destroy_page(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Requests the Ultravisor to encrypt a guest page and make it
* accessible to the host for paging (export).
@@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
return 0;
}
+/*
+ * The caller must already hold a reference to the page
+ */
+int uv_convert_owned_from_secure(unsigned long paddr)
+{
+ struct page *page = phys_to_page(paddr);
+ int rc;
+
+ get_page(page);
+ rc = uv_convert_from_secure(paddr);
+ if (!rc)
+ clear_bit(PG_arch_1, &page->flags);
+ put_page(page);
+ return rc;
+}
+
/*
* Calculate the expected ref_count for a page that would otherwise have no
* further pins. This was cribbed from similar functions in other places in
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5a138f6220c4..38b792ab57f7 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
{
pte_t pte = READ_ONCE(*ptep);
+ /* There is a reference through the mapping */
if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
+ WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
+
return 0;
}
--
2.31.1
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements a lazy destroy mechanism, that allows a protected
guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 6 +-
arch/s390/kvm/kvm-s390.h | 2 +-
arch/s390/kvm/pv.c | 132 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 134 insertions(+), 6 deletions(-)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index b655a7d82bf0..238297a7bb46 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2281,7 +2281,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
r = kvm_s390_cpus_to_pv(kvm, &cmd->rc, &cmd->rrc);
if (r)
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &dummy, &dummy);
/* we need to block service interrupts from now on */
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2300,7 +2300,7 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
*/
if (r)
break;
- r = kvm_s390_pv_deinit_vm(kvm, &cmd->rc, &cmd->rrc);
+ r = kvm_s390_pv_deinit_vm_deferred(kvm, &cmd->rc, &cmd->rrc);
/* no need to block service interrupts any more */
clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
@@ -2829,7 +2829,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
* complaining we do not use kvm_s390_pv_is_protected.
*/
if (kvm_s390_pv_get_handle(kvm))
- kvm_s390_pv_deinit_vm(kvm, &rc, &rrc);
+ kvm_s390_pv_deinit_vm_deferred(kvm, &rc, &rrc);
debug_unregister(kvm->arch.dbf);
free_page((unsigned long)kvm->arch.sie_page2);
if (!kvm_is_ucontrol(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9fad25109b0d..d2380f5e7e1f 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -211,7 +211,7 @@ static inline int kvm_s390_user_cpu_state_ctrl(struct kvm *kvm)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
u16 *rrc);
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index e29109af7956..e6672937abef 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -14,12 +14,21 @@
#include <asm/mman.h>
#include <linux/pagewalk.h>
#include <linux/sched/mm.h>
+#include <linux/kthread.h>
#include "kvm-s390.h"
static int lazy_destroy = 1;
module_param(lazy_destroy, int, 0444);
MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+struct deferred_priv {
+ struct mm_struct *mm;
+ unsigned long old_table;
+ u64 handle;
+ void *stor_var;
+ unsigned long stor_base;
+};
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc;
@@ -161,7 +170,7 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
}
/* this should not fail, but if it does, we must not free the donated memory */
-int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
+static int kvm_s390_pv_deinit_vm_now(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;
@@ -189,6 +198,125 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}
+static int kvm_s390_pv_destroy_vm_thread(void *priv)
+{
+ struct deferred_priv *p = priv;
+ u16 rc, rrc;
+ int r;
+
+ /* Clear all the pages as long as we are not the only users of the mm */
+ s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
+ /*
+ * If we were the last user of the mm, synchronously free (and clear
+ * if needed) all pages.
+ * Otherwise simply decrease the reference counter; in this case we
+ * have already cleared all pages.
+ */
+ mmput(p->mm);
+
+ r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (r) {
+ mmdrop(p->mm);
+ return r;
+ }
+ atomic_dec(&p->mm->context.is_protected);
+ mmdrop(p->mm);
+
+ /*
+ * Intentional leak in case the destroy secure VM call fails. The
+ * call should never fail if the hardware is not broken.
+ */
+ free_pages(p->stor_base, get_order(uv_info.guest_base_stor_len));
+ free_pages(p->old_table, CRST_ALLOC_ORDER);
+ vfree(p->stor_var);
+ kfree(p);
+ return 0;
+}
+
+static int deferred_destroy(struct kvm *kvm, struct deferred_priv *priv, u16 *rc, u16 *rrc)
+{
+ struct task_struct *t;
+
+ priv->stor_var = kvm->arch.pv.stor_var;
+ priv->stor_base = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+
+ if (s390_replace_asce(kvm->arch.gmap))
+ goto fail;
+
+ t = kthread_create(kvm_s390_pv_destroy_vm_thread, priv,
+ "kvm_s390_pv_destroy_vm_thread");
+ if (IS_ERR_OR_NULL(t))
+ goto fail;
+
+ memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM DEFERRED %d", t->pid);
+ wake_up_process(t);
+ /*
+ * no actual UVC is performed at this point, just return a successful
+ * rc value to make userspace happy, and an arbitrary rrc
+ */
+ *rc = 1;
+ *rrc = 42;
+
+ return 0;
+
+fail:
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+}
+
+/* Clear the first 2GB of guest memory, to avoid prefix issues after reboot */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ unsigned long lim;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+
+ slots = kvm_memslots(kvm);
+ kvm_for_each_memslot(slot, slots) {
+ if (slot->base_gfn >= (SZ_2G / PAGE_SIZE))
+ continue;
+ if (slot->base_gfn + slot->npages > (SZ_2G / PAGE_SIZE))
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ else
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, 1, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, idx);
+}
+
+int kvm_s390_pv_deinit_vm_deferred(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ if (!lazy_destroy)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+
+ mmgrab(kvm->mm);
+ if (mmget_not_zero(kvm->mm)) {
+ kvm_s390_clear_2g(kvm);
+ } else {
+ /* No deferred work to do */
+ mmdrop(kvm->mm);
+ kfree(priv);
+ return kvm_s390_pv_deinit_vm_now(kvm, rc, rrc);
+ }
+ priv->mm = kvm->mm;
+ return deferred_destroy(kvm, priv, rc, rrc);
+}
+
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct uv_cb_cgc uvcb = {
@@ -222,7 +350,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_inc(&kvm->mm->context.is_protected);
if (cc) {
if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
- kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
+ kvm_s390_pv_deinit_vm_now(kvm, &dummy, &dummy);
} else {
atomic_dec(&kvm->mm->context.is_protected);
kvm_s390_pv_dealloc_vm(kvm);
--
2.31.1
Add the module parameter "lazy_destroy", to allow the lazy destroy
mechanism to be switched off. This might be useful for debugging
purposes.
The parameter is enabled by default.
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/pv.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index ee11ff6afc4f..e29109af7956 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -16,6 +16,10 @@
#include <linux/sched/mm.h>
#include "kvm-s390.h"
+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Deferred destroy for protected guests");
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc;
--
2.31.1
Am 18.08.21 um 15:26 schrieb Claudio Imbrenda:
> Use the is_protected field as a counter instead of a flag. This will
> be used in upcoming patches.
Maybe it should also be renamed to reflect that?
>
> Increment the counter when a secure configuration is created, and
> decrement it when it is destroyed. Previously the flag was set when the
> set secure parameters UVC was performed.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Janosch Frank <[email protected]>
> ---
> arch/s390/kvm/pv.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 47db80003ea0..ee11ff6afc4f 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -173,7 +173,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
> UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
> WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
> - atomic_set(&kvm->mm->context.is_protected, 0);
> + if (!cc)
> + atomic_dec(&kvm->mm->context.is_protected);
> KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> /* Intended memory leak on "impossible" error */
> @@ -214,11 +215,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> /* Outputs */
> kvm->arch.pv.handle = uvcb.guest_handle;
>
> + atomic_inc(&kvm->mm->context.is_protected);
> if (cc) {
> - if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
> + if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
> kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
> - else
> + } else {
> + atomic_dec(&kvm->mm->context.is_protected);
> kvm_s390_pv_dealloc_vm(kvm);
> + }
> return -EIO;
> }
> kvm->arch.gmap->guest_handle = uvcb.guest_handle;
> @@ -241,8 +245,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
> *rrc = uvcb.header.rrc;
> KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
> *rc, *rrc);
> - if (!cc)
> - atomic_set(&kvm->mm->context.is_protected, 1);
> return cc ? -EINVAL : 0;
> }
>
Am 18.08.21 um 15:26 schrieb Claudio Imbrenda:
> Until now, destroying a protected guest was an entirely synchronous
> operation that could potentially take a very long time, depending on
> the size of the guest, due to the time needed to clean up the address
> space from protected pages.
>
> This patch implements a lazy destroy mechanism, that allows a protected
> guest to reboot significantly faster than previously.
>
> This is achieved by clearing the pages of the old guest in background.
> In case of reboot, the new guest will be able to run in the same
> address space almost immediately.
>
> The old protected guest is then only destroyed when all of its memory has
> been destroyed or otherwise made non protected.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kvm/kvm-s390.c | 6 +-
> arch/s390/kvm/kvm-s390.h | 2 +-
> arch/s390/kvm/pv.c | 132 ++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 134 insertions(+), 6 deletions(-)
>
[...]
>
> +static int kvm_s390_pv_destroy_vm_thread(void *priv)
> +{
> + struct deferred_priv *p = priv;
> + u16 rc, rrc;
> + int r;
> +
> + /* Clear all the pages as long as we are not the only users of the mm */
> + s390_uv_destroy_range(p->mm, 1, 0, TASK_SIZE_MAX);
> + /*
> + * If we were the last user of the mm, synchronously free (and clear
> + * if needed) all pages.
> + * Otherwise simply decrease the reference counter; in this case we
> + * have already cleared all pages.
> + */
> + mmput(p->mm);
> +
> + r = uv_cmd_nodata(p->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
> + WARN_ONCE(r, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
> + if (r) {
> + mmdrop(p->mm);
The comment about leaking makes more sense here, no?
Also
goto out_dont_free;
> + return r;
> + }
> + atomic_dec(&p->mm->context.is_protected);
> + mmdrop(p->mm);
> +
> + /*
> + * Intentional leak in case the destroy secure VM call fails. The
> + * call should never fail if the hardware is not broken.
> + */
> + free_pages(p->stor_base, get_order(uv_info.guest_base_stor_len));
> + free_pages(p->old_table, CRST_ALLOC_ORDER);
> + vfree(p->stor_var);
out_dont_free:
> + kfree(p);
> + return 0;
> +}
> +
[...]
On 18.08.21 15:26, Claudio Imbrenda wrote:
> If kvm_s390_pv_destroy_cpu is called more than once, we risk calling
> free_page on a random page, since the sidad field is aliased with the
> gbea, which is not guaranteed to be zero.
>
> The solution is to simply return successfully immediately if the vCPU
> was already non secure.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Fixes: 19e1227768863a1469797c13ef8fea1af7beac2c ("KVM: S390: protvirt: Introduce instruction data area bounce buffer")
Patch looks good. Do we have any potential case where we call this twice? In other words,
do we need the Fixes tag with the code as of today or not?
> ---
> arch/s390/kvm/pv.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index c8841f476e91..0a854115100b 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -16,18 +16,17 @@
>
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> {
> - int cc = 0;
> + int cc;
>
> - if (kvm_s390_pv_cpu_get_handle(vcpu)) {
> - cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu),
> - UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> + if (!kvm_s390_pv_cpu_get_handle(vcpu))
> + return 0;
> +
> + cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu), UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> +
> + KVM_UV_EVENT(vcpu->kvm, 3, "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> + vcpu->vcpu_id, *rc, *rrc);
> + WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x", *rc, *rrc);
>
> - KVM_UV_EVENT(vcpu->kvm, 3,
> - "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> - vcpu->vcpu_id, *rc, *rrc);
> - WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x",
> - *rc, *rrc);
> - }
> /* Intended memory leak for something that should never happen. */
> if (!cc)
> free_pages(vcpu->arch.pv.stor_base,
>
On 8/18/21 3:26 PM, Claudio Imbrenda wrote:
> If kvm_s390_pv_destroy_cpu is called more than once, we risk calling
> free_page on a random page, since the sidad field is aliased with the
> gbea, which is not guaranteed to be zero.
>
> The solution is to simply return successfully immediately if the vCPU
> was already non secure.
I have the feeling this has been completely inconsistent from the start.
If we can't destroy a cpu we also won't take the VM out of PV state
because that's only allowed if we have removed all PV CPUs. That means
KVM thinks the VM is in PV mode but most of the CPUs are not.
Granted if the destroy CPU fails it makes more sense to just kill the VM
but looking at QEMU we don't even check the return value of that call.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Fixes: 19e1227768863a1469797c13ef8fea1af7beac2c ("KVM: S390: protvirt: Introduce instruction data area bounce buffer")
> ---
> arch/s390/kvm/pv.c | 19 +++++++++----------
> 1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index c8841f476e91..0a854115100b 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -16,18 +16,17 @@
>
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> {
> - int cc = 0;
> + int cc;
>
> - if (kvm_s390_pv_cpu_get_handle(vcpu)) {
> - cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu),
> - UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> + if (!kvm_s390_pv_cpu_get_handle(vcpu))
> + return 0;
> +
> + cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu), UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> +
> + KVM_UV_EVENT(vcpu->kvm, 3, "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> + vcpu->vcpu_id, *rc, *rrc);
> + WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x", *rc, *rrc);
>
> - KVM_UV_EVENT(vcpu->kvm, 3,
> - "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> - vcpu->vcpu_id, *rc, *rrc);
> - WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x",
> - *rc, *rrc);
> - }
> /* Intended memory leak for something that should never happen. */
> if (!cc)
> free_pages(vcpu->arch.pv.stor_base,
>
On 18.08.21 15:26, Claudio Imbrenda wrote:
> When the system is heavily overcommitted, kvm_s390_pv_init_vm might
> generate stall notifications.
>
> Fix this by using uv_call_sched instead of just uv_call. This is ok because
> we are not holding spinlocks.
I guess this should only happen for really large guests where the donated memory is also large?
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
Reviewed-by: Christian Borntraeger <[email protected]>
> ---
> arch/s390/kvm/pv.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 0a854115100b..00d272d134c2 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -195,7 +195,7 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
> uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;
>
> - cc = uv_call(0, (u64)&uvcb);
> + cc = uv_call_sched(0, (u64)&uvcb);
> *rc = uvcb.header.rc;
> *rrc = uvcb.header.rrc;
> KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
>
On 18.08.21 15:26, Claudio Imbrenda wrote:
> Improve make_secure_pte to avoid stalls when the system is heavily
> overcommitted. This was especially problematic in kvm_s390_pv_unpack,
> because of the loop over all pages that needed unpacking.
>
> Due to the locks being held, it was not possible to simply replace
> uv_call with uv_call_sched. A more complex approach was
> needed, in which uv_call is replaced with __uv_call, which does not
> loop. When the UVC needs to be executed again, -EAGAIN is returned, and
> the caller (or its caller) will try again.
>
> When -EAGAIN is returned, the path is the same as when the page is in
> writeback (and the writeback check is also performed, which is
> harmless).
To me it looks like
handle_pv_uvc does not handle EAGAIN but also calls into this code. Is this code
path ok or do we need to change something here?
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
> ---
> arch/s390/kernel/uv.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index aeb0a15bcbb7..68a8fbafcb9c 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -180,7 +180,7 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> {
> pte_t entry = READ_ONCE(*ptep);
> struct page *page;
> - int expected, rc = 0;
> + int expected, cc = 0;
>
> if (!pte_present(entry))
> return -ENXIO;
> @@ -196,12 +196,25 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> if (!page_ref_freeze(page, expected))
> return -EBUSY;
> set_bit(PG_arch_1, &page->flags);
> - rc = uv_call(0, (u64)uvcb);
> + /*
> + * If the UVC does not succeed or fail immediately, we don't want to
> + * loop for long, or we might get stall notifications.
> + * On the other hand, this is a complex scenario and we are holding a lot of
> + * locks, so we can't easily sleep and reschedule. We try only once,
> + * and if the UVC returned busy or partial completion, we return
> + * -EAGAIN and we let the callers deal with it.
> + */
> + cc = __uv_call(0, (u64)uvcb);
> page_ref_unfreeze(page, expected);
> - /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
> - if (rc)
> - rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> - return rc;
> + /*
> + * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
> + * If busy or partially completed, return -EAGAIN.
> + */
> + if (cc == UVC_CC_OK)
> + return 0;
> + else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
> + return -EAGAIN;
> + return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> }
>
> /*
> @@ -254,6 +267,10 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> mmap_read_unlock(gmap->mm);
>
> if (rc == -EAGAIN) {
> + /*
> + * If we are here because the UVC returned busy or partial
> + * completion, this is just a useless check, but it is safe.
> + */
> wait_on_page_writeback(page);
> } else if (rc == -EBUSY) {
> /*
>
On Tue, 31 Aug 2021 16:32:24 +0200
Christian Borntraeger <[email protected]> wrote:
> On 18.08.21 15:26, Claudio Imbrenda wrote:
> > Improve make_secure_pte to avoid stalls when the system is heavily
> > overcommitted. This was especially problematic in kvm_s390_pv_unpack,
> > because of the loop over all pages that needed unpacking.
> >
> > Due to the locks being held, it was not possible to simply replace
> > uv_call with uv_call_sched. A more complex approach was
> > needed, in which uv_call is replaced with __uv_call, which does not
> > loop. When the UVC needs to be executed again, -EAGAIN is returned, and
> > the caller (or its caller) will try again.
> >
> > When -EAGAIN is returned, the path is the same as when the page is in
> > writeback (and the writeback check is also performed, which is
> > harmless).
>
> To me it looks like
> handle_pv_uvc does not handle EAGAIN but also calls into this code. Is this code
> path ok or do we need to change something here?
EAGAIN will be propagated all the way to userspace, which will retry.
if the UVC fails, the page does not get unpinned, and the next attempt
to run the UVC in the guest will trigger this same path.
if you don't like it, I can change handle_pv_uvc like this
if (rc == -EINVAL || rc == -EAGAIN)
which will save a trip to userspace
>
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
> > ---
> > arch/s390/kernel/uv.c | 29 +++++++++++++++++++++++------
> > 1 file changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index aeb0a15bcbb7..68a8fbafcb9c 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -180,7 +180,7 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> > {
> > pte_t entry = READ_ONCE(*ptep);
> > struct page *page;
> > - int expected, rc = 0;
> > + int expected, cc = 0;
> >
> > if (!pte_present(entry))
> > return -ENXIO;
> > @@ -196,12 +196,25 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
> > if (!page_ref_freeze(page, expected))
> > return -EBUSY;
> > set_bit(PG_arch_1, &page->flags);
> > - rc = uv_call(0, (u64)uvcb);
> > + /*
> > + * If the UVC does not succeed or fail immediately, we don't want to
> > + * loop for long, or we might get stall notifications.
> > + * On the other hand, this is a complex scenario and we are holding a lot of
> > + * locks, so we can't easily sleep and reschedule. We try only once,
> > + * and if the UVC returned busy or partial completion, we return
> > + * -EAGAIN and we let the callers deal with it.
> > + */
> > + cc = __uv_call(0, (u64)uvcb);
> > page_ref_unfreeze(page, expected);
> > - /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
> > - if (rc)
> > - rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> > - return rc;
> > + /*
> > + * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
> > + * If busy or partially completed, return -EAGAIN.
> > + */
> > + if (cc == UVC_CC_OK)
> > + return 0;
> > + else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
> > + return -EAGAIN;
> > + return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
> > }
> >
> > /*
> > @@ -254,6 +267,10 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> > mmap_read_unlock(gmap->mm);
> >
> > if (rc == -EAGAIN) {
> > + /*
> > + * If we are here because the UVC returned busy or partial
> > + * completion, this is just a useless check, but it is safe.
> > + */
> > wait_on_page_writeback(page);
> > } else if (rc == -EBUSY) {
> > /*
> >
On 31.08.21 17:00, Claudio Imbrenda wrote:
> On Tue, 31 Aug 2021 16:32:24 +0200
> Christian Borntraeger <[email protected]> wrote:
>
>> On 18.08.21 15:26, Claudio Imbrenda wrote:
>>> Improve make_secure_pte to avoid stalls when the system is heavily
>>> overcommitted. This was especially problematic in kvm_s390_pv_unpack,
>>> because of the loop over all pages that needed unpacking.
>>>
>>> Due to the locks being held, it was not possible to simply replace
>>> uv_call with uv_call_sched. A more complex approach was
>>> needed, in which uv_call is replaced with __uv_call, which does not
>>> loop. When the UVC needs to be executed again, -EAGAIN is returned, and
>>> the caller (or its caller) will try again.
>>>
>>> When -EAGAIN is returned, the path is the same as when the page is in
>>> writeback (and the writeback check is also performed, which is
>>> harmless).
>>
>> To me it looks like
>> handle_pv_uvc does not handle EAGAIN but also calls into this code. Is this code
>> path ok or do we need to change something here?
>
> EAGAIN will be propagated all the way to userspace, which will retry.
>
> if the UVC fails, the page does not get unpinned, and the next attempt
> to run the UVC in the guest will trigger this same path.
>
> if you don't like it, I can change handle_pv_uvc like this
>
> if (rc == -EINVAL || rc == -EAGAIN)
>
> which will save a trip to userspace
I think a comment would be good.
>
>>
>>>
>>> Signed-off-by: Claudio Imbrenda <[email protected]>
>>> Fixes: 214d9bbcd3a672 ("s390/mm: provide memory management functions for protected KVM guests")
>>> ---
>>> arch/s390/kernel/uv.c | 29 +++++++++++++++++++++++------
>>> 1 file changed, 23 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
>>> index aeb0a15bcbb7..68a8fbafcb9c 100644
>>> --- a/arch/s390/kernel/uv.c
>>> +++ b/arch/s390/kernel/uv.c
>>> @@ -180,7 +180,7 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
>>> {
>>> pte_t entry = READ_ONCE(*ptep);
>>> struct page *page;
>>> - int expected, rc = 0;
>>> + int expected, cc = 0;
>>>
>>> if (!pte_present(entry))
>>> return -ENXIO;
>>> @@ -196,12 +196,25 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
>>> if (!page_ref_freeze(page, expected))
>>> return -EBUSY;
>>> set_bit(PG_arch_1, &page->flags);
>>> - rc = uv_call(0, (u64)uvcb);
>>> + /*
>>> + * If the UVC does not succeed or fail immediately, we don't want to
>>> + * loop for long, or we might get stall notifications.
>>> + * On the other hand, this is a complex scenario and we are holding a lot of
>>> + * locks, so we can't easily sleep and reschedule. We try only once,
>>> + * and if the UVC returned busy or partial completion, we return
>>> + * -EAGAIN and we let the callers deal with it.
>>> + */
>>> + cc = __uv_call(0, (u64)uvcb);
>>> page_ref_unfreeze(page, expected);
>>> - /* Return -ENXIO if the page was not mapped, -EINVAL otherwise */
>>> - if (rc)
>>> - rc = uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
>>> - return rc;
>>> + /*
>>> + * Return -ENXIO if the page was not mapped, -EINVAL for other errors.
>>> + * If busy or partially completed, return -EAGAIN.
>>> + */
>>> + if (cc == UVC_CC_OK)
>>> + return 0;
>>> + else if (cc == UVC_CC_BUSY || cc == UVC_CC_PARTIAL)
>>> + return -EAGAIN;
>>> + return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
>>> }
>>>
>>> /*
>>> @@ -254,6 +267,10 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>>> mmap_read_unlock(gmap->mm);
>>>
>>> if (rc == -EAGAIN) {
>>> + /*
>>> + * If we are here because the UVC returned busy or partial
>>> + * completion, this is just a useless check, but it is safe.
>>> + */
>>> wait_on_page_writeback(page);
>>> } else if (rc == -EBUSY) {
>>> /*
>>>
>
The subject should say
KVM: s390: pv: leak the topmost page table when destroy fails
On 18.08.21 15:26, Claudio Imbrenda wrote:
> When a protected VM is created, the topmost level of page tables of its
> ASCE is marked by the Ultravisor; any attempt to use that memory for
> protected virtualization will result in failure.
maybe rephrase that to
Each secure guest must have a unique address space control element and we
must avoid that new guests will use the same ASCE to avoid an error. As
the ASCE mostly consists of the top most page table address (and flags)
we must not return that memory to the pool unless the ASCE is no longer
used.
Only a a successful Destroy Configuration UVC will make the ASCE no longer
collide.
When the Destroy Configuration UVC fails, the ASCE cannot be reused for a
secure guest ASCE. To avoid a collision, it must not be used again.
> Only a successful Destroy Configuration UVC will remove the marking.
>
> When the Destroy Configuration UVC fails, the topmost level of page
> tables of the VM does not get its marking cleared; to avoid issues it
> must not be used again.
>
> This is a permanent error and the page becomes in practice unusable, so
> we set it aside and leak it.
Maybe add: on failure we already leak other memory that has ultravisor marking (the
variable and base storage for a guest) and not setting the ASCE aside (by
leaking the topmost page table) was an oversight.
Or something like that
maybe also add that we usually do not expect to see such error under normal
circumstances.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/include/asm/gmap.h | 2 ++
> arch/s390/kvm/pv.c | 4 ++-
> arch/s390/mm/gmap.c | 55 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 60 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> index 40264f60b0da..746e18bf8984 100644
> --- a/arch/s390/include/asm/gmap.h
> +++ b/arch/s390/include/asm/gmap.h
> @@ -148,4 +148,6 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
> unsigned long gaddr, unsigned long vmaddr);
> int gmap_mark_unmergeable(void);
> void s390_reset_acc(struct mm_struct *mm);
> +void s390_remove_old_asce(struct gmap *gmap);
> +int s390_replace_asce(struct gmap *gmap);
> #endif /* _ASM_S390_GMAP_H */
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 00d272d134c2..76b0d64ce8fa 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -168,9 +168,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> atomic_set(&kvm->mm->context.is_protected, 0);
> KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> - /* Inteded memory leak on "impossible" error */
> + /* Intended memory leak on "impossible" error */
> if (!cc)
> kvm_s390_pv_dealloc_vm(kvm);
> + else
> + s390_replace_asce(kvm->arch.gmap);
> return cc ? -EIO : 0;
> }
>
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 9bb2c7512cd5..5a138f6220c4 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2706,3 +2706,58 @@ void s390_reset_acc(struct mm_struct *mm)
> mmput(mm);
> }
> EXPORT_SYMBOL_GPL(s390_reset_acc);
> +
> +/*
> + * Remove the topmost level of page tables from the list of page tables of
> + * the gmap.
> + * This means that it will not be freed when the VM is torn down, and needs
> + * to be handled separately by the caller, unless an intentional leak is
> + * intended.
> + */
> +void s390_remove_old_asce(struct gmap *gmap)
> +{
> + struct page *old;
> +
> + old = virt_to_page(gmap->table);
> + spin_lock(&gmap->guest_table_lock);
> + list_del(&old->lru);
> + spin_unlock(&gmap->guest_table_lock);
> + /* in case the ASCE needs to be "removed" multiple times */
> + INIT_LIST_HEAD(&old->lru);
shouldn't that also be under the spin_lock?
> +}
> +EXPORT_SYMBOL_GPL(s390_remove_old_asce);
> +
> +/*
> + * Try to replace the current ASCE with another equivalent one.
> + * If the allocation of the new top level page table fails, the ASCE is not
> + * replaced.
> + * In any case, the old ASCE is removed from the list, therefore the caller
> + * has to make sure to save a pointer to it beforehands, unless an
> + * intentional leak is intended.
> + */
> +int s390_replace_asce(struct gmap *gmap)
> +{
> + unsigned long asce;
> + struct page *page;
> + void *table;
> +
> + s390_remove_old_asce(gmap);
> +
> + page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> + if (!page)
> + return -ENOMEM;
It seems that we do not handle errors in our caller?
> + table = page_to_virt(page);
> + memcpy(table, gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
> +
> + spin_lock(&gmap->guest_table_lock);
> + list_add(&page->lru, &gmap->crst_list);
> + spin_unlock(&gmap->guest_table_lock);
> +
> + asce = (gmap->asce & ~PAGE_MASK) | __pa(table);
Instead of PAGE_MASK better use _ASCE_ORIGIN ?
> + WRITE_ONCE(gmap->asce, asce);
> + WRITE_ONCE(gmap->mm->context.gmap_asce, asce);
> + WRITE_ONCE(gmap->table, table);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(s390_replace_asce);
>
On 18.08.21 15:26, Claudio Imbrenda wrote:
> Introduce variants of the convert and destroy page functions that also
> clear the PG_arch_1 bit used to mark them as secure pages.
>
> These new functions can only be called on pages for which a reference
> is already being held.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Janosch Frank <[email protected]>
Can you refresh my mind? We do have over-indication of PG_arch_1 and this
might result in spending some unneeded cycles but in the end this will be
correct. Right?
And this patch will fix some unnecessary places that add overindication.
> ---
> arch/s390/include/asm/pgtable.h | 9 ++++++---
> arch/s390/include/asm/uv.h | 10 ++++++++--
> arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
> arch/s390/mm/gmap.c | 4 +++-
> 4 files changed, 50 insertions(+), 7 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index dcac7b2df72c..0f1af2232ebe 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> pte_t res;
>
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
> pte_t res;
>
> res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(vma->vm_mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> @@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> } else {
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> }
> + /* At this point the reference through the mapping is still present */
> if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index b35add51b967..3236293d5a31 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
> }
>
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
> -int uv_destroy_page(unsigned long paddr);
> +int uv_destroy_owned_page(unsigned long paddr);
> int uv_convert_from_secure(unsigned long paddr);
> +int uv_convert_owned_from_secure(unsigned long paddr);
> int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
>
> void setup_uv(void);
> @@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
> static inline void setup_uv(void) {}
> static inline void adjust_to_uv_max(unsigned long *vmax) {}
>
> -static inline int uv_destroy_page(unsigned long paddr)
> +static inline int uv_destroy_owned_page(unsigned long paddr)
> {
> return 0;
> }
> @@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
> {
> return 0;
> }
> +
> +static inline int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + return 0;
> +}
> #endif
>
> #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 68a8fbafcb9c..05f8bf61d20a 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
> *
> * @paddr: Absolute host address of page to be destroyed
> */
> -int uv_destroy_page(unsigned long paddr)
> +static int uv_destroy_page(unsigned long paddr)
> {
> struct uv_cb_cfs uvcb = {
> .header.cmd = UVC_CMD_DESTR_SEC_STOR,
> @@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_destroy_owned_page(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_destroy_page(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Requests the Ultravisor to encrypt a guest page and make it
> * accessible to the host for paging (export).
> @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
> return 0;
> }
>
> +/*
> + * The caller must already hold a reference to the page
> + */
> +int uv_convert_owned_from_secure(unsigned long paddr)
> +{
> + struct page *page = phys_to_page(paddr);
> + int rc;
> +
> + get_page(page);
> + rc = uv_convert_from_secure(paddr);
> + if (!rc)
> + clear_bit(PG_arch_1, &page->flags);
> + put_page(page);
> + return rc;
> +}
> +
> /*
> * Calculate the expected ref_count for a page that would otherwise have no
> * further pins. This was cribbed from similar functions in other places in
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 5a138f6220c4..38b792ab57f7 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
> {
> pte_t pte = READ_ONCE(*ptep);
>
> + /* There is a reference through the mapping */
> if (pte_present(pte))
> - WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
> + WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
> +
> return 0;
> }
>
>
On Mon, 6 Sep 2021 17:46:40 +0200
Christian Borntraeger <[email protected]> wrote:
> On 18.08.21 15:26, Claudio Imbrenda wrote:
> > Introduce variants of the convert and destroy page functions that also
> > clear the PG_arch_1 bit used to mark them as secure pages.
> >
> > These new functions can only be called on pages for which a reference
> > is already being held.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > Acked-by: Janosch Frank <[email protected]>
>
> Can you refresh my mind? We do have over-indication of PG_arch_1 and this
> might result in spending some unneeded cycles but in the end this will be
> correct. Right?
> And this patch will fix some unnecessary places that add overindication.
correct, PG_arch_1 will still overindicate, but with this patch it will
happen less.
And PG_arch_1 overindication is perfectly fine from a correctness point
of view.
> > ---
> > arch/s390/include/asm/pgtable.h | 9 ++++++---
> > arch/s390/include/asm/uv.h | 10 ++++++++--
> > arch/s390/kernel/uv.c | 34 ++++++++++++++++++++++++++++++++-
> > arch/s390/mm/gmap.c | 4 +++-
> > 4 files changed, 50 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> > index dcac7b2df72c..0f1af2232ebe 100644
> > --- a/arch/s390/include/asm/pgtable.h
> > +++ b/arch/s390/include/asm/pgtable.h
> > @@ -1074,8 +1074,9 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> > pte_t res;
> >
> > res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> > + /* At this point the reference through the mapping is still present */
> > if (mm_is_protected(mm) && pte_present(res))
> > - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> > + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> > return res;
> > }
> >
> > @@ -1091,8 +1092,9 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
> > pte_t res;
> >
> > res = ptep_xchg_direct(vma->vm_mm, addr, ptep, __pte(_PAGE_INVALID));
> > + /* At this point the reference through the mapping is still present */
> > if (mm_is_protected(vma->vm_mm) && pte_present(res))
> > - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> > + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> > return res;
> > }
> >
> > @@ -1116,8 +1118,9 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> > } else {
> > res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> > }
> > + /* At this point the reference through the mapping is still present */
> > if (mm_is_protected(mm) && pte_present(res))
> > - uv_convert_from_secure(pte_val(res) & PAGE_MASK);
> > + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> > return res;
> > }
> >
> > diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> > index b35add51b967..3236293d5a31 100644
> > --- a/arch/s390/include/asm/uv.h
> > +++ b/arch/s390/include/asm/uv.h
> > @@ -356,8 +356,9 @@ static inline int is_prot_virt_host(void)
> > }
> >
> > int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
> > -int uv_destroy_page(unsigned long paddr);
> > +int uv_destroy_owned_page(unsigned long paddr);
> > int uv_convert_from_secure(unsigned long paddr);
> > +int uv_convert_owned_from_secure(unsigned long paddr);
> > int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr);
> >
> > void setup_uv(void);
> > @@ -367,7 +368,7 @@ void adjust_to_uv_max(unsigned long *vmax);
> > static inline void setup_uv(void) {}
> > static inline void adjust_to_uv_max(unsigned long *vmax) {}
> >
> > -static inline int uv_destroy_page(unsigned long paddr)
> > +static inline int uv_destroy_owned_page(unsigned long paddr)
> > {
> > return 0;
> > }
> > @@ -376,6 +377,11 @@ static inline int uv_convert_from_secure(unsigned long paddr)
> > {
> > return 0;
> > }
> > +
> > +static inline int uv_convert_owned_from_secure(unsigned long paddr)
> > +{
> > + return 0;
> > +}
> > #endif
> >
> > #if defined(CONFIG_PROTECTED_VIRTUALIZATION_GUEST) || IS_ENABLED(CONFIG_KVM)
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index 68a8fbafcb9c..05f8bf61d20a 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -115,7 +115,7 @@ static int uv_pin_shared(unsigned long paddr)
> > *
> > * @paddr: Absolute host address of page to be destroyed
> > */
> > -int uv_destroy_page(unsigned long paddr)
> > +static int uv_destroy_page(unsigned long paddr)
> > {
> > struct uv_cb_cfs uvcb = {
> > .header.cmd = UVC_CMD_DESTR_SEC_STOR,
> > @@ -135,6 +135,22 @@ int uv_destroy_page(unsigned long paddr)
> > return 0;
> > }
> >
> > +/*
> > + * The caller must already hold a reference to the page
> > + */
> > +int uv_destroy_owned_page(unsigned long paddr)
> > +{
> > + struct page *page = phys_to_page(paddr);
> > + int rc;
> > +
> > + get_page(page);
> > + rc = uv_destroy_page(paddr);
> > + if (!rc)
> > + clear_bit(PG_arch_1, &page->flags);
> > + put_page(page);
> > + return rc;
> > +}
> > +
> > /*
> > * Requests the Ultravisor to encrypt a guest page and make it
> > * accessible to the host for paging (export).
> > @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
> > return 0;
> > }
> >
> > +/*
> > + * The caller must already hold a reference to the page
> > + */
> > +int uv_convert_owned_from_secure(unsigned long paddr)
> > +{
> > + struct page *page = phys_to_page(paddr);
> > + int rc;
> > +
> > + get_page(page);
> > + rc = uv_convert_from_secure(paddr);
> > + if (!rc)
> > + clear_bit(PG_arch_1, &page->flags);
> > + put_page(page);
> > + return rc;
> > +}
> > +
> > /*
> > * Calculate the expected ref_count for a page that would otherwise have no
> > * further pins. This was cribbed from similar functions in other places in
> > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > index 5a138f6220c4..38b792ab57f7 100644
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -2678,8 +2678,10 @@ static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
> > {
> > pte_t pte = READ_ONCE(*ptep);
> >
> > + /* There is a reference through the mapping */
> > if (pte_present(pte))
> > - WARN_ON_ONCE(uv_destroy_page(pte_val(pte) & PAGE_MASK));
> > + WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
> > +
> > return 0;
> > }
> >
> >
On 06.09.21 17:56, Claudio Imbrenda wrote:
> On Mon, 6 Sep 2021 17:46:40 +0200
> Christian Borntraeger <[email protected]> wrote:
>
>> On 18.08.21 15:26, Claudio Imbrenda wrote:
>>> Introduce variants of the convert and destroy page functions that also
>>> clear the PG_arch_1 bit used to mark them as secure pages.
>>>
>>> These new functions can only be called on pages for which a reference
>>> is already being held.
>>>
>>> Signed-off-by: Claudio Imbrenda <[email protected]>
>>> Acked-by: Janosch Frank <[email protected]>
>>
>> Can you refresh my mind? We do have over-indication of PG_arch_1 and this
>> might result in spending some unneeded cycles but in the end this will be
>> correct. Right?
>> And this patch will fix some unnecessary places that add overindication.
>
> correct, PG_arch_1 will still overindicate, but with this patch it will
> happen less.
>
> And PG_arch_1 overindication is perfectly fine from a correctness point
> of view.
Maybe add something like this to the patch description then.
>>> +/*
>>> + * The caller must already hold a reference to the page
>>> + */
>>> +int uv_destroy_owned_page(unsigned long paddr)
>>> +{
>>> + struct page *page = phys_to_page(paddr);
Do we have to protect against weird mappings without struct page here? I have not
followed the discussion about this topic. Maybe Gerald knows if we can have memory
without struct pages.
>>> + int rc;
>>> +
>>> + get_page(page);
>>> + rc = uv_destroy_page(paddr);
>>> + if (!rc)
>>> + clear_bit(PG_arch_1, &page->flags);
>>> + put_page(page);
>>> + return rc;
>>> +}
>>> +
>>> /*
>>> * Requests the Ultravisor to encrypt a guest page and make it
>>> * accessible to the host for paging (export).
>>> @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
>>> return 0;
>>> }
>>>
>>> +/*
>>> + * The caller must already hold a reference to the page
>>> + */
>>> +int uv_convert_owned_from_secure(unsigned long paddr)
>>> +{
>>> + struct page *page = phys_to_page(paddr);
Same here. If this is not an issue (and you add something to the patch description as
outlined above)
Reviewed-by: Christian Borntraeger <[email protected]>
On Mon, 6 Sep 2021 17:32:36 +0200
Christian Borntraeger <[email protected]> wrote:
> The subject should say
>
> KVM: s390: pv: leak the topmost page table when destroy fails
>
>
> On 18.08.21 15:26, Claudio Imbrenda wrote:
> > When a protected VM is created, the topmost level of page tables of its
> > ASCE is marked by the Ultravisor; any attempt to use that memory for
> > protected virtualization will result in failure.
>
>
> maybe rephrase that to
> Each secure guest must have a unique address space control element and we
> must avoid that new guests will use the same ASCE to avoid an error. As
> the ASCE mostly consists of the top most page table address (and flags)
> we must not return that memory to the pool unless the ASCE is no longer
> used.
>
> Only a a successful Destroy Configuration UVC will make the ASCE no longer
> collide.
> When the Destroy Configuration UVC fails, the ASCE cannot be reused for a
> secure guest ASCE. To avoid a collision, it must not be used again.
>
ok
>
> > Only a successful Destroy Configuration UVC will remove the marking.
> >
> > When the Destroy Configuration UVC fails, the topmost level of page
> > tables of the VM does not get its marking cleared; to avoid issues it
> > must not be used again.
> >
> > This is a permanent error and the page becomes in practice unusable, so
> > we set it aside and leak it.
>
> Maybe add: on failure we already leak other memory that has ultravisor marking (the
> variable and base storage for a guest) and not setting the ASCE aside (by
> leaking the topmost page table) was an oversight.
>
> Or something like that
>
> maybe also add that we usually do not expect to see such error under normal
> circumstances.
>
makes sense
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/include/asm/gmap.h | 2 ++
> > arch/s390/kvm/pv.c | 4 ++-
> > arch/s390/mm/gmap.c | 55 ++++++++++++++++++++++++++++++++++++
> > 3 files changed, 60 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
> > index 40264f60b0da..746e18bf8984 100644
> > --- a/arch/s390/include/asm/gmap.h
> > +++ b/arch/s390/include/asm/gmap.h
> > @@ -148,4 +148,6 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
> > unsigned long gaddr, unsigned long vmaddr);
> > int gmap_mark_unmergeable(void);
> > void s390_reset_acc(struct mm_struct *mm);
> > +void s390_remove_old_asce(struct gmap *gmap);
> > +int s390_replace_asce(struct gmap *gmap);
> > #endif /* _ASM_S390_GMAP_H */
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index 00d272d134c2..76b0d64ce8fa 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -168,9 +168,11 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> > atomic_set(&kvm->mm->context.is_protected, 0);
> > KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> > WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> > - /* Inteded memory leak on "impossible" error */
> > + /* Intended memory leak on "impossible" error */
> > if (!cc)
> > kvm_s390_pv_dealloc_vm(kvm);
> > + else
> > + s390_replace_asce(kvm->arch.gmap);
> > return cc ? -EIO : 0;
> > }
> >
> > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > index 9bb2c7512cd5..5a138f6220c4 100644
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -2706,3 +2706,58 @@ void s390_reset_acc(struct mm_struct *mm)
> > mmput(mm);
> > }
> > EXPORT_SYMBOL_GPL(s390_reset_acc);
> > +
> > +/*
> > + * Remove the topmost level of page tables from the list of page tables of
> > + * the gmap.
> > + * This means that it will not be freed when the VM is torn down, and needs
> > + * to be handled separately by the caller, unless an intentional leak is
> > + * intended.
> > + */
> > +void s390_remove_old_asce(struct gmap *gmap)
> > +{
> > + struct page *old;
> > +
> > + old = virt_to_page(gmap->table);
> > + spin_lock(&gmap->guest_table_lock);
> > + list_del(&old->lru);
> > + spin_unlock(&gmap->guest_table_lock);
> > + /* in case the ASCE needs to be "removed" multiple times */
> > + INIT_LIST_HEAD(&old->lru);
> shouldn't that also be under the spin_lock?
>
> > +}
> > +EXPORT_SYMBOL_GPL(s390_remove_old_asce);
> > +
> > +/*
> > + * Try to replace the current ASCE with another equivalent one.
> > + * If the allocation of the new top level page table fails, the ASCE is not
> > + * replaced.
> > + * In any case, the old ASCE is removed from the list, therefore the caller
> > + * has to make sure to save a pointer to it beforehands, unless an
> > + * intentional leak is intended.
> > + */
> > +int s390_replace_asce(struct gmap *gmap)
> > +{
> > + unsigned long asce;
> > + struct page *page;
> > + void *table;
> > +
> > + s390_remove_old_asce(gmap);
> > +
> > + page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
> > + if (!page)
> > + return -ENOMEM;
>
> It seems that we do not handle errors in our caller?
for now, but it doesn't hurt to report an error
>
> > + table = page_to_virt(page);
> > + memcpy(table, gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
> > +
> > + spin_lock(&gmap->guest_table_lock);
> > + list_add(&page->lru, &gmap->crst_list);
> > + spin_unlock(&gmap->guest_table_lock);
> > +
> > + asce = (gmap->asce & ~PAGE_MASK) | __pa(table);
>
> Instead of PAGE_MASK better use _ASCE_ORIGIN ?
ok
> > + WRITE_ONCE(gmap->asce, asce);
> > + WRITE_ONCE(gmap->mm->context.gmap_asce, asce);
> > + WRITE_ONCE(gmap->table, table);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(s390_replace_asce);
> >
On Tue, 31 Aug 2021 15:55:07 +0200
Christian Borntraeger <[email protected]> wrote:
> On 18.08.21 15:26, Claudio Imbrenda wrote:
> > If kvm_s390_pv_destroy_cpu is called more than once, we risk calling
> > free_page on a random page, since the sidad field is aliased with the
> > gbea, which is not guaranteed to be zero.
> >
> > The solution is to simply return successfully immediately if the vCPU
> > was already non secure.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > Fixes: 19e1227768863a1469797c13ef8fea1af7beac2c ("KVM: S390: protvirt: Introduce instruction data area bounce buffer")
>
> Patch looks good. Do we have any potential case where we call this twice? In other words,
> do we need the Fixes tag with the code as of today or not?
I think so.
if QEMU calls KVM_PV_DISABLE, and it fails, some VCPUs might have been
made non secure, but the VM itself still counts as secure. QEMU can
then call KVM_PV_DISABLE again, which will try to convert all VCPUs to
non secure again, triggering this bug.
this scenario will not happen in practice (unless the hardware is
broken)
> > ---
> > arch/s390/kvm/pv.c | 19 +++++++++----------
> > 1 file changed, 9 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index c8841f476e91..0a854115100b 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -16,18 +16,17 @@
> >
> > int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> > {
> > - int cc = 0;
> > + int cc;
> >
> > - if (kvm_s390_pv_cpu_get_handle(vcpu)) {
> > - cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu),
> > - UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> > + if (!kvm_s390_pv_cpu_get_handle(vcpu))
> > + return 0;
> > +
> > + cc = uv_cmd_nodata(kvm_s390_pv_cpu_get_handle(vcpu), UVC_CMD_DESTROY_SEC_CPU, rc, rrc);
> > +
> > + KVM_UV_EVENT(vcpu->kvm, 3, "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> > + vcpu->vcpu_id, *rc, *rrc);
> > + WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x", *rc, *rrc);
> >
> > - KVM_UV_EVENT(vcpu->kvm, 3,
> > - "PROTVIRT DESTROY VCPU %d: rc %x rrc %x",
> > - vcpu->vcpu_id, *rc, *rrc);
> > - WARN_ONCE(cc, "protvirt destroy cpu failed rc %x rrc %x",
> > - *rc, *rrc);
> > - }
> > /* Intended memory leak for something that should never happen. */
> > if (!cc)
> > free_pages(vcpu->arch.pv.stor_base,
> >
On Mon, 6 Sep 2021 18:16:10 +0200
Christian Borntraeger <[email protected]> wrote:
> On 06.09.21 17:56, Claudio Imbrenda wrote:
> > On Mon, 6 Sep 2021 17:46:40 +0200
> > Christian Borntraeger <[email protected]> wrote:
> >
> >> On 18.08.21 15:26, Claudio Imbrenda wrote:
> >>> Introduce variants of the convert and destroy page functions that also
> >>> clear the PG_arch_1 bit used to mark them as secure pages.
> >>>
> >>> These new functions can only be called on pages for which a reference
> >>> is already being held.
> >>>
> >>> Signed-off-by: Claudio Imbrenda <[email protected]>
> >>> Acked-by: Janosch Frank <[email protected]>
> >>
> >> Can you refresh my mind? We do have over-indication of PG_arch_1 and this
> >> might result in spending some unneeded cycles but in the end this will be
> >> correct. Right?
> >> And this patch will fix some unnecessary places that add overindication.
> >
> > correct, PG_arch_1 will still overindicate, but with this patch it will
> > happen less.
> >
> > And PG_arch_1 overindication is perfectly fine from a correctness point
> > of view.
>
> Maybe add something like this to the patch description then.
>
> >>> +/*
> >>> + * The caller must already hold a reference to the page
> >>> + */
> >>> +int uv_destroy_owned_page(unsigned long paddr)
> >>> +{
> >>> + struct page *page = phys_to_page(paddr);
>
> Do we have to protect against weird mappings without struct page here? I have not
> followed the discussion about this topic. Maybe Gerald knows if we can have memory
> without struct pages.
at first glance, it seems we can't have mappings without a struct page
>
> >>> + int rc;
> >>> +
> >>> + get_page(page);
> >>> + rc = uv_destroy_page(paddr);
> >>> + if (!rc)
> >>> + clear_bit(PG_arch_1, &page->flags);
> >>> + put_page(page);
> >>> + return rc;
> >>> +}
> >>> +
> >>> /*
> >>> * Requests the Ultravisor to encrypt a guest page and make it
> >>> * accessible to the host for paging (export).
> >>> @@ -154,6 +170,22 @@ int uv_convert_from_secure(unsigned long paddr)
> >>> return 0;
> >>> }
> >>>
> >>> +/*
> >>> + * The caller must already hold a reference to the page
> >>> + */
> >>> +int uv_convert_owned_from_secure(unsigned long paddr)
> >>> +{
> >>> + struct page *page = phys_to_page(paddr);
>
> Same here. If this is not an issue (and you add something to the patch description as
> outlined above)
>
> Reviewed-by: Christian Borntraeger <[email protected]>
>
>