Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.
This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory can be cleared
in background, allowing the guest to restart or terminate significantly
faster than before.
There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)
For the reboot case, two new commands are available for the
KVM_S390_PV_COMMAND:
KVM_PV_ASYNC_DISABLE_PREPARE: prepares the current protected VM for
asynchronous teardown. The current VM will then continue immediately
as non-protected. If a protected VM had already been set aside without
starting the teardown process, this call will fail. In this case the
userspace process should issue a normal KVM_PV_DISABLE
KVM_PV_ASYNC_DISABLE: tears down the protected VM previously set aside
for asychronous teardown. This PV command should ideally be issued by
userspace from a separate thread. If a fatal signal is received (or
the process terminates naturally), the command will terminate
immediately without completing.
The idea is that userspace should first issue the
KVM_PV_ASYNC_DISABLE_PREPARE command, and in case of success, create a
new thread and issue KVM_PV_ASYNC_DISABLE from there. This also allows
for proper accounting of the CPU time needed for the asynchronous
teardown.
This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.
The shutdown case should be dealt with in userspace (e.g. using
clone(CLONE_VM)).
A module parameter is also provided to disable the new functionality,
which is otherwise enabled by default. This should not be an issue
since the new functionality is opt-in anyway. This is mainly thought to
aid debugging.
v5->v6
* completely reworked the series
* removed kernel thread for asynchronous teardown
* added new commands to KVM_S390_PV_COMMAND ioctl
v4->v5
* fixed and improved some patch descriptions
* added some comments to better explain what's going on
* use vma_lookup instead of find_vma
* rename is_protected to protected_count since now it's used as a counter
v3->v4
* added patch 2
* split patch 3
* removed the shutdown part -- will be a separate patchseries
* moved the patch introducing the module parameter
v2->v3
* added definitions for CC return codes for the UVC instruction
* improved make_secure_pte:
- renamed rc to cc
- added comments to explain why returning -EAGAIN is ok
* fixed kvm_s390_pv_replace_asce and kvm_s390_pv_remove_old_asce:
- renamed
- added locking
- moved to gmap.c
* do proper error management in do_secure_storage_access instead of
trying again hoping to get a different exception
* fix outdated patch descriptions
v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier
Claudio Imbrenda (17):
KVM: s390: pv: leak the topmost page table when destroy fails
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: make kvm_s390_cpus_from_pv global
KVM: s390: pv: clear the state without memset
KVM: s390: pv: add mmu_notifier
s390/mm: KVM: pv: when tearing down, try to destroy protected pages
KVM: s390: pv: refactoring of kvm_s390_pv_deinit_vm
KVM: s390: pv: cleanup leftover protected VMs if needed
KVM: s390: pv: asynchronous destroy for reboot
KVM: s390: pv: api documentation for asynchronous destroy
KVM: s390: pv: add KVM_CAP_S390_PROT_REBOOT_ASYNC
KVM: s390: pv: avoid export before import if possible
Documentation/virt/kvm/api.rst | 21 ++-
arch/s390/include/asm/gmap.h | 38 +++-
arch/s390/include/asm/kvm_host.h | 3 +
arch/s390/include/asm/mmu.h | 2 +-
arch/s390/include/asm/mmu_context.h | 2 +-
arch/s390/include/asm/pgtable.h | 11 +-
arch/s390/include/asm/uv.h | 1 +
arch/s390/kernel/uv.c | 64 +++++++
arch/s390/kvm/kvm-s390.c | 59 ++++++-
arch/s390/kvm/kvm-s390.h | 3 +
arch/s390/kvm/pv.c | 259 ++++++++++++++++++++++++++--
arch/s390/mm/fault.c | 20 ++-
arch/s390/mm/gmap.c | 152 +++++++++++++---
include/uapi/linux/kvm.h | 3 +
14 files changed, 591 insertions(+), 47 deletions(-)
--
2.31.1
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation.
This patch adds handling of secure storage violations for protected
guests.
Pages that trigger the exception will be made non-secure before
attempting to use them again for a different secure guest.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/uv.h | 1 +
arch/s390/kernel/uv.c | 55 ++++++++++++++++++++++++++++++++++++++
arch/s390/mm/fault.c | 10 +++++++
3 files changed, 66 insertions(+)
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 72d3e49c2860..cdbd340188ab 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,6 +356,7 @@ static inline int is_prot_virt_host(void)
}
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 386d4e42b8d3..f706456f6261 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -334,6 +334,61 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
}
EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
+/**
+ * gmap_destroy_page - Destroy a guest page.
+ * @gmap the gmap of the guest
+ * @gaddr the guest address to destroy
+ *
+ * An attempt will be made to destroy the given guest page. If the attempt
+ * fails, an attempt is made to export the page. If both attempts fail, an
+ * appropriate error is returned.
+ */
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
+{
+ struct vm_area_struct *vma;
+ unsigned long uaddr;
+ struct page *page;
+ int rc;
+
+ rc = -EFAULT;
+ mmap_read_lock(gmap->mm);
+
+ uaddr = __gmap_translate(gmap, gaddr);
+ if (IS_ERR_VALUE(uaddr))
+ goto out;
+ vma = vma_lookup(gmap->mm, uaddr);
+ if (!vma)
+ goto out;
+ /*
+ * Huge pages should not be able to become secure
+ */
+ if (is_vm_hugetlb_page(vma))
+ goto out;
+
+ rc = 0;
+ /* we take an extra reference here */
+ page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
+ if (IS_ERR_OR_NULL(page))
+ goto out;
+ rc = uv_destroy_owned_page(page_to_phys(page));
+ /*
+ * Fault handlers can race; it is possible that two CPUs will fault
+ * on the same secure page. One CPU can destroy the page, reboot,
+ * re-enter secure mode and import it, while the second CPU was
+ * stuck at the beginning of the handler. At some point the second
+ * CPU will be able to progress, and it will not be able to destroy
+ * the page. In that case we do not want to terminate the process,
+ * we instead try to export the page.
+ */
+ if (rc)
+ rc = uv_convert_owned_from_secure(page_to_phys(page));
+ put_page(page);
+out:
+ mmap_read_unlock(gmap->mm);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_destroy_page);
+
/*
* To be called with the page locked or with an extra reference! This will
* prevent gmap_make_secure from touching the page concurrently. Having 2
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index d30f5986fa85..a1928c89bbfa 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -853,6 +853,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);
void do_secure_storage_violation(struct pt_regs *regs)
{
+ unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
+ struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
+
+ /*
+ * If the VM has been rebooted, its address space might still contain
+ * secure pages from the previous boot.
+ * Clear the page so it can be reused.
+ */
+ if (!gmap_destroy_page(gmap, gaddr))
+ return;
/*
* Either KVM messed up the secure guest mapping or the same
* page is mapped into multiple secure guests.
--
2.31.1
Each secure guest must have a unique ASCE (address space control
element); we must avoid that new guests use the same page for their
ASCE, to avoid errors.
Since the ASCE mostly consists of the address of the topmost page table
(plus some flags), we must not return that memory to the pool unless
the ASCE is no longer in use.
Only a successful Destroy Secure Configuration UVC will make the ASCE
reusable again.
If the Destroy Configuration UVC fails, the ASCE cannot be reused for a
secure guest (either for the ASCE or for other memory areas). To avoid
a collision, it must not be used again. This is a permanent error and
the page becomes in practice unusable, so we set it aside and leak it.
On failure we already leak other memory that belongs to the ultravisor
(i.e. the variable and base storage for a guest) and not leaking the
topmost page table was an oversight.
This error (and thus the leakage) should not happen unless the hardware
is broken or KVM has some unknown serious bug.
Signed-off-by: Claudio Imbrenda <[email protected]>
Fixes: 29b40f105ec8d55 ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
---
arch/s390/include/asm/gmap.h | 2 ++
arch/s390/kvm/pv.c | 9 +++--
arch/s390/mm/gmap.c | 65 ++++++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 3 deletions(-)
diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 40264f60b0da..746e18bf8984 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -148,4 +148,6 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
void s390_reset_acc(struct mm_struct *mm);
+void s390_remove_old_asce(struct gmap *gmap);
+int s390_replace_asce(struct gmap *gmap);
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 00d272d134c2..b906658ffc2e 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -168,10 +168,13 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
atomic_set(&kvm->mm->context.is_protected, 0);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
- /* Inteded memory leak on "impossible" error */
- if (!cc)
+ /* Intended memory leak on "impossible" error */
+ if (!cc) {
kvm_s390_pv_dealloc_vm(kvm);
- return cc ? -EIO : 0;
+ return 0;
+ }
+ s390_replace_asce(kvm->arch.gmap);
+ return -EIO;
}
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dfee0ebb2fac..5d6618016263 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2714,3 +2714,68 @@ void s390_reset_acc(struct mm_struct *mm)
mmput(mm);
}
EXPORT_SYMBOL_GPL(s390_reset_acc);
+
+/**
+ * s390_remove_old_asce - Remove the topmost level of page tables from the
+ * list of page tables of the gmap.
+ * @gmap the gmap whose table is to be removed
+ *
+ * This means that it will not be freed when the VM is torn down, and needs
+ * to be handled separately by the caller, unless an intentional leak is
+ * intended.
+ */
+void s390_remove_old_asce(struct gmap *gmap)
+{
+ struct page *old;
+
+ old = virt_to_page(gmap->table);
+ spin_lock(&gmap->guest_table_lock);
+ list_del(&old->lru);
+ spin_unlock(&gmap->guest_table_lock);
+ /* in case the ASCE needs to be "removed" multiple times */
+ INIT_LIST_HEAD(&old->lru);
+}
+EXPORT_SYMBOL_GPL(s390_remove_old_asce);
+
+/**
+ * s390_replace_asce - Try to replace the current ASCE of a gmap with
+ * another equivalent one.
+ * @gmap the gmap
+ *
+ * If the allocation of the new top level page table fails, the ASCE is not
+ * replaced.
+ * In any case, the old ASCE is always removed from the list. Therefore the
+ * caller has to make sure to save a pointer to it beforehands, unless an
+ * intentional leak is intended.
+ */
+int s390_replace_asce(struct gmap *gmap)
+{
+ unsigned long asce;
+ struct page *page;
+ void *table;
+
+ s390_remove_old_asce(gmap);
+
+ page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+ if (!page)
+ return -ENOMEM;
+ table = page_to_virt(page);
+ memcpy(table, gmap->table, 1UL << (CRST_ALLOC_ORDER + PAGE_SHIFT));
+
+ /*
+ * The caller has to deal with the old ASCE, but here we make sure
+ * the new one is properly added to the list of page tables, so that
+ * it will be freed when the VM is torn down.
+ */
+ spin_lock(&gmap->guest_table_lock);
+ list_add(&page->lru, &gmap->crst_list);
+ spin_unlock(&gmap->guest_table_lock);
+
+ asce = (gmap->asce & ~PAGE_MASK) | __pa(table);
+ WRITE_ONCE(gmap->asce, asce);
+ WRITE_ONCE(gmap->mm->context.gmap_asce, asce);
+ WRITE_ONCE(gmap->table, table);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(s390_replace_asce);
--
2.31.1
With upcoming patches, normal guests might touch secure pages.
This patch extends the existing exception handler to convert the pages
to non secure also when the exception is triggered by a normal guest.
This can happen for example when a secure guest reboots; the first
stage of a secure guest is non secure, and in general a secure guest
can reboot into non-secure mode.
If the secure memory of the previous boot has not been cleared up
completely yet (which will be allowed to happen in an upcoming patch),
a non-secure guest might touch secure memory, which will need to be
handled properly.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/mm/fault.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index a1928c89bbfa..a644e593eef9 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -770,6 +770,7 @@ void do_secure_storage_access(struct pt_regs *regs)
struct vm_area_struct *vma;
struct mm_struct *mm;
struct page *page;
+ struct gmap *gmap;
int rc;
/*
@@ -799,6 +800,14 @@ void do_secure_storage_access(struct pt_regs *regs)
}
switch (get_fault_type(regs)) {
+ case GMAP_FAULT:
+ gmap = (struct gmap *)S390_lowcore.gmap;
+ addr = __gmap_translate(gmap, addr);
+ if (IS_ERR_VALUE(addr)) {
+ do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
+ break;
+ }
+ fallthrough;
case USER_FAULT:
mm = current->mm;
mmap_read_lock(mm);
@@ -827,7 +836,6 @@ void do_secure_storage_access(struct pt_regs *regs)
if (rc)
BUG();
break;
- case GMAP_FAULT:
default:
do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
WARN_ON_ONCE(1);
--
2.31.1
The functions kvm_s390_cpus_from_pv needs to be called from pv.c, so
make it global.
Take the opportunity to add documentation.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 26 +++++++++++++++++++++++++-
arch/s390/kvm/kvm-s390.h | 1 +
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 0c08066067e8..850d86995083 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2206,7 +2206,20 @@ static int kvm_s390_set_cmma_bits(struct kvm *kvm,
return r;
}
-static int kvm_s390_cpus_from_pv(struct kvm *kvm, u16 *rcp, u16 *rrcp)
+/**
+ * kvm_s390_cpus_from_pv - Convert all protected vCPUs in a protected VM to
+ * non protected.
+ * @kvm the VM whose protected vCPUs are to be converted
+ * @rcp return value for the RC field of the UVC (in case of error)
+ * @rrcp return value for the RRC field of the UVC (in case of error)
+ *
+ * Does not stop in case of error, tries to convert as many
+ * CPUs as possible. In case of error, the RC and RRC of the last error are
+ * returned.
+ *
+ * Return: 0 in case of success, otherwise -EIO
+ */
+int kvm_s390_cpus_from_pv(struct kvm *kvm, u16 *rcp, u16 *rrcp)
{
struct kvm_vcpu *vcpu;
u16 rc, rrc;
@@ -2233,6 +2246,17 @@ static int kvm_s390_cpus_from_pv(struct kvm *kvm, u16 *rcp, u16 *rrcp)
return ret;
}
+/**
+ * kvm_s390_cpus_to_pv - Convert all non-protected vCPUs in a protected VM
+ * to protected.
+ * @kvm the VM whose protected vCPUs are to be converted
+ * @rcp return value for the RC field of the UVC (in case of error)
+ * @rrcp return value for the RRC field of the UVC (in case of error)
+ *
+ * Tries to undo the conversion in case of error.
+ *
+ * Return: 0 in case of success, otherwise -EIO
+ */
static int kvm_s390_cpus_to_pv(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int i, r = 0;
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index c07a050d757d..ef9429cb0307 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -351,6 +351,7 @@ int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu);
void kvm_s390_vcpu_unsetup_cmma(struct kvm_vcpu *vcpu);
void kvm_s390_set_cpu_timer(struct kvm_vcpu *vcpu, __u64 cputm);
__u64 kvm_s390_get_cpu_timer(struct kvm_vcpu *vcpu);
+int kvm_s390_cpus_from_pv(struct kvm *kvm, u16 *rcp, u16 *rrcp);
/* implemented in diag.c */
int kvm_s390_handle_diag(struct kvm_vcpu *vcpu);
--
2.31.1
Add the module parameter "lazy_destroy", to allow the asynchronous destroy
mechanism to be switched off. This might be useful for debugging purposes.
The parameter is enabled by default.
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 14a18ba5ff2c..0c08066067e8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -207,6 +207,11 @@ unsigned int diag9c_forwarding_hz;
module_param(diag9c_forwarding_hz, uint, 0644);
MODULE_PARM_DESC(diag9c_forwarding_hz, "Maximum diag9c forwarding per second, 0 to turn off");
+/* allow asynchronous deinit for protected guests */
+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Asynchronous destroy for protected guests");
+
/*
* For now we handle at most 16 double words as this is what the s390 base
* kernel handles and stores in the prefix page. If we ever need to go beyond
--
2.31.1
Due to upcoming changes, it will be possible to temporarily have
multiple protected VMs in the same address space, although only one
will be actually active.
In that scenario, it is necessary to perform an export of every page
that is to be imported, since the hardware does not allow a page
belonging to a protected guest to be imported into a different
protected guest.
This also applies to pages that are shared, and thus accessible by the
host.
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kernel/uv.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index f706456f6261..ec01c3f8b13c 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -234,6 +234,12 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
return uvcb->rc == 0x10a ? -ENXIO : -EINVAL;
}
+static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
+{
+ return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ atomic_read(&mm->context.protected_count) > 1;
+}
+
/*
* Requests the Ultravisor to make a page accessible to a guest.
* If it's brought in the first time, it will be cleared. If
@@ -277,6 +283,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
lock_page(page);
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+ if (should_export_before_import(uvcb, gmap->mm))
+ uv_convert_from_secure(page_to_phys(page));
rc = make_secure_pte(ptep, uaddr, page, uvcb);
pte_unmap_unlock(ptep, ptelock);
unlock_page(page);
--
2.31.1
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements an asynchronous destroy mechanism, that allows a
protected guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.
Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl:
KVM_PV_ASYNC_DISABLE_PREPARE: prepares the current protected VM for
asynchronous teardown. The current VM will then continue immediately
as non-protected. If a protected VM had already been set aside without
starting the teardown process, this call will fail.
KVM_PV_ASYNC_DISABLE: tears down the protected VM previously set aside
for asynchronous teardown. This PV command should ideally be issued by
userspace from a separate thread. If a fatal signal is received (or the
process terminates naturally), the command will terminate immediately
without completing.
Leftover protected VMs are cleaned up when a KVM VM is torn down
normally (either via IOCTL or when the process terminates); this
cleanup has been implemented in a previous patch.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 24 ++++++++
arch/s390/kvm/kvm-s390.h | 2 +
arch/s390/kvm/pv.c | 126 +++++++++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 2 +
4 files changed, 154 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 1b83fd3b1322..a3710e21099f 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2314,6 +2314,30 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
break;
}
+ case KVM_PV_ASYNC_DISABLE_PREPARE:
+ r = -EINVAL;
+ if (!kvm_s390_pv_is_protected(kvm) || !lazy_destroy)
+ break;
+
+ r = kvm_s390_cpus_from_pv(kvm, &cmd->rc, &cmd->rrc);
+ /*
+ * If a CPU could not be destroyed, destroy VM will also fail.
+ * There is no point in trying to destroy it. Instead return
+ * the rc and rrc from the first CPU that failed destroying.
+ */
+ if (r)
+ break;
+ r = kvm_s390_pv_deinit_vm_async_prepare(kvm, &cmd->rc, &cmd->rrc);
+
+ /* no need to block service interrupts any more */
+ clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
+ break;
+ case KVM_PV_ASYNC_DISABLE:
+ r = -EINVAL;
+ if (!kvm->arch.pv.async_deinit)
+ break;
+ r = kvm_s390_pv_deinit_vm_async(kvm, &cmd->rc, &cmd->rrc);
+ break;
case KVM_PV_DISABLE: {
r = -EINVAL;
if (!kvm_s390_pv_is_protected(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index ef9429cb0307..0168db8cd2c2 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -220,6 +220,8 @@ static inline void kvm_s390_set_user_cpu_state_ctrl(struct kvm *kvm)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_async_prepare(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_async(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 825d69601fe5..929ffea6a8cc 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -264,6 +264,132 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}
+/**
+ * kvm_s390_clear_2g - Clear the first 2GB of guest memory.
+ * @kvm the VM whose memory is to be cleared.
+ * Clear the first 2GB of guest memory, to avoid prefix issues after reboot.
+ */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ unsigned long lim;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&kvm->srcu);
+
+ slot = gfn_to_memslot(kvm, 0);
+ /* Clear all slots that are completely below 2GB */
+ while (slot && slot->base_gfn + slot->npages < SZ_2G / PAGE_SIZE) {
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, slot->userspace_addr, lim);
+ slot = gfn_to_memslot(kvm, slot->base_gfn + slot->npages);
+ }
+ /* Last slot crosses the 2G boundary, clear only up to 2GB */
+ if (slot && slot->base_gfn < SZ_2G / PAGE_SIZE) {
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, srcu_idx);
+}
+
+/**
+ * kvm_s390_pv_deinit_vm_async_prepare - Prepare a protected VM for
+ * asynchronous teardown.
+ * @kvm the VM
+ * @rc return value for the RC field of the UVCB
+ * @rrc return value for the RRC field of the UVCB
+ *
+ * Prepare the protected VM for asynchronous teardown. The VM will be able
+ * to continue immediately as a non-secure VM, and the information needed to
+ * properly tear down the protected VM is set aside. If another protected VM
+ * was already set aside without starting a teardown, the function will
+ * fail.
+ *
+ * Context: kvm->lock needs to be held
+ *
+ * Return: 0 in case of success, -EINVAL if another protected VM was already set
+ * aside, -ENOMEM if the system ran out of memory.
+ */
+int kvm_s390_pv_deinit_vm_async_prepare(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ /*
+ * If an asynchronous deinitialization is already pending, refuse.
+ * A synchronous deinitialization has to be performed instead.
+ */
+ if (kvm->arch.pv.async_deinit)
+ return -EINVAL;
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->stor_var = kvm->arch.pv.stor_var;
+ priv->stor_base = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+ if (s390_replace_asce(kvm->arch.gmap)) {
+ kfree(priv);
+ return -ENOMEM;
+ }
+
+ kvm_s390_clear_2g(kvm);
+ kvm_s390_clear_pv_state(kvm);
+ kvm->arch.pv.async_deinit = priv;
+
+ *rc = 1;
+ *rrc = 42;
+ return 0;
+}
+
+/**
+ * kvm_s390_pv_deinit_vm_async - Perform an asynchronous teardown of a
+ * protected VM.
+ * @kvm the VM previously associated with the protected VM
+ * @rc return value for the RC field of the UVCB
+ * @rrc return value for the RRC field of the UVCB
+ *
+ * Tear down the protected VM that had previously been set aside using
+ * kvm_s390_pv_deinit_vm_async_prepare.
+ *
+ * Context: kvm->lock needs to be held
+ *
+ * Return: 0 in case of success, -EINVAL if no protected VM had been
+ * prepared for asynchronous teardowm, -EIO in case of other errors.
+ */
+int kvm_s390_pv_deinit_vm_async(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *p = kvm->arch.pv.async_deinit;
+ int ret = 0;
+
+ if (!p)
+ return -EINVAL;
+ kvm->arch.pv.async_deinit = NULL;
+ mutex_unlock(&kvm->lock);
+
+ /* When a fatal signal is received, stop immediately */
+ if (s390_uv_destroy_range_interruptible(kvm->mm, 0, TASK_SIZE_MAX))
+ goto done;
+ if (kvm_s390_pv_cleanup_deferred(kvm, p))
+ ret = -EIO;
+ else
+ atomic_dec(&kvm->mm->context.protected_count);
+ kfree(p);
+ p = NULL;
+done:
+ /* The caller expects the lock to be held */
+ mutex_lock(&kvm->lock);
+ /*
+ * p is not NULL if we aborted because of a fatal signal, in which
+ * case queue the leftover for later cleanup.
+ */
+ if (p)
+ list_add(&p->list, &kvm->arch.pv.need_cleanup);
+ return ret;
+}
+
static void kvm_s390_pv_mmu_notifier_release(struct mmu_notifier *subscription,
struct mm_struct *mm)
{
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..5166c90d1a1e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1629,6 +1629,8 @@ enum pv_cmd_id {
KVM_PV_VERIFY,
KVM_PV_PREP_RESET,
KVM_PV_UNSHARE_ALL,
+ KVM_PV_ASYNC_DISABLE_PREPARE,
+ KVM_PV_ASYNC_DISABLE,
};
struct kvm_pv_cmd {
--
2.31.1
Add documentation for the new commands added to the KVM_S390_PV_COMMAND
ioctl.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
Documentation/virt/kvm/api.rst | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..f4d981cd58fa 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4975,11 +4975,13 @@ KVM_PV_ENABLE
===== =============================
KVM_PV_DISABLE
-
Deregister the VM from the Ultravisor and reclaim the memory that
had been donated to the Ultravisor, making it usable by the kernel
- again. All registered VCPUs are converted back to non-protected
- ones.
+ again. All registered VCPUs are converted back to non-protected
+ ones. If a previous VM had been prepared for asynchonous teardown
+ with KVM_PV_ASYNC_DISABLE_PREPARE and not actually torn down with
+ KVM_PV_ASYNC_DISABLE, it will be torn down in this call together with
+ the current VM.
KVM_PV_VM_SET_SEC_PARMS
Pass the image header from VM memory to the Ultravisor in
@@ -4992,6 +4994,19 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
+KVM_PV_ASYNC_DISABLE_PREPARE
+ Prepare the current protected VM for asynchronous teardown. The current
+ VM will then continue immediately as non-protected. If a protected VM had
+ already been set aside without starting the teardown process, this call
+ will fail. In this case the userspace process should issue a normal
+ KVM_PV_DISABLE.
+
+KVM_PV_ASYNC_DISABLE
+ Tear down the protected VM previously set aside for asynchronous teardown.
+ This PV command should ideally be issued by userspace from a separate
+ thread. If a fatal signal is received (or the process terminates
+ naturally), the command will terminate immediately without completing.
+
4.126 KVM_X86_SET_MSR_FILTER
----------------------------
--
2.31.1
If the appropriate UV feature bit is set, there is no need to perform
an export before import.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kernel/uv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index ec01c3f8b13c..a29c7b3085b1 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -236,7 +236,8 @@ static int make_secure_pte(pte_t *ptep, unsigned long addr,
static bool should_export_before_import(struct uv_cb_header *uvcb, struct mm_struct *mm)
{
- return uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
+ return !test_bit_inv(BIT_UV_FEAT_MISC, &uv_info.uv_feature_indications) &&
+ uvcb->cmd != UVC_CMD_UNPIN_PAGE_SHARED &&
atomic_read(&mm->context.protected_count) > 1;
}
--
2.31.1
Add KVM_CAP_S390_PROT_REBOOT_ASYNC to signal that the
KVM_PV_ASYNC_DISABLE and KVM_PV_ASYNC_DISABLE_PREPARE commands for the
KVM_S390_PV_COMMAND ioctl are available.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 3 +++
include/uapi/linux/kvm.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index a3710e21099f..8aa274483a99 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -608,6 +608,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_S390_BPB:
r = test_facility(82);
break;
+ case KVM_CAP_S390_PROT_REBOOT_ASYNC:
+ r = lazy_destroy && is_prot_virt_host();
+ break;
case KVM_CAP_S390_PROTECTED:
r = is_prot_virt_host();
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5166c90d1a1e..ac3b7797a263 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1131,6 +1131,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
#define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
+#define KVM_CAP_S390_PROT_REBOOT_ASYNC 207
#ifdef KVM_CAP_IRQ_ROUTING
--
2.31.1
In upcoming patches it will be possible to start tearing down a
protected VM, and finish the teardown concurrently in a different
thread.
Protected VMs that are pending for tear down ("leftover") need to be
cleaned properly when the userspace process (e.g. qemu) terminates.
This patch makes sure that all "leftover" protected VMs are always
properly torn down.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/kvm_host.h | 2 +
arch/s390/kvm/kvm-s390.c | 1 +
arch/s390/kvm/pv.c | 69 ++++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 8b288cff2a3a..948c9d4b0b51 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -921,6 +921,8 @@ struct kvm_s390_pv {
u64 guest_len;
unsigned long stor_base;
void *stor_var;
+ void *async_deinit;
+ struct list_head need_cleanup;
struct mmu_notifier mmu_notifier;
};
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 850d86995083..1b83fd3b1322 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2817,6 +2817,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_s390_vsie_init(kvm);
if (use_gisa)
kvm_s390_gisa_init(kvm);
+ INIT_LIST_HEAD(&kvm->arch.pv.need_cleanup);
KVM_EVENT(3, "vm 0x%pK created by pid %u", kvm, current->pid);
return 0;
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index a54f6e98fa6f..825d69601fe5 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -16,6 +16,19 @@
#include <linux/sched/mm.h>
#include "kvm-s390.h"
+/**
+ * @struct deferred_priv
+ * Represents a "leftover" protected VM that does not correspond to any
+ * active KVM VM.
+ */
+struct deferred_priv {
+ struct list_head list;
+ unsigned long old_table;
+ u64 handle;
+ void *stor_var;
+ unsigned long stor_base;
+};
+
static void kvm_s390_clear_pv_state(struct kvm *kvm)
{
kvm->arch.pv.handle = 0;
@@ -165,6 +178,60 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
return -ENOMEM;
}
+/**
+ * kvm_s390_pv_cleanup_deferred - Clean up one leftover protected VM.
+ * @kvm the KVM that was associated with this leftover protected VM
+ * @deferred details about the leftover protected VM that needs a clean up
+ * Return: 0 in case of success, otherwise 1
+ */
+static int kvm_s390_pv_cleanup_deferred(struct kvm *kvm, struct deferred_priv *deferred)
+{
+ u16 rc, rrc;
+ int cc;
+
+ cc = uv_cmd_nodata(deferred->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", rc, rrc);
+ WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (cc)
+ return cc;
+ /*
+ * Intentionally leak unusable memory. If the UVC fails, the memory
+ * used for the VM and its metadata is permanently unusable.
+ * This can only happen in case of a serious KVM or hardware bug; it
+ * is not expected to happen in normal operation.
+ */
+ free_pages(deferred->stor_base, get_order(uv_info.guest_base_stor_len));
+ free_pages(deferred->old_table, CRST_ALLOC_ORDER);
+ vfree(deferred->stor_var);
+ return 0;
+}
+
+/**
+ * kvm_s390_pv_cleanup_leftovers - Clean up all leftover protected VMs.
+ * @kvm the KVM whose leftover protected VMs are to be cleaned up
+ * Return: 0 in case of success, otherwise 1
+ */
+static int kvm_s390_pv_cleanup_leftovers(struct kvm *kvm)
+{
+ struct deferred_priv *deferred;
+ int cc = 0;
+
+ if (kvm->arch.pv.async_deinit)
+ list_add(kvm->arch.pv.async_deinit, &kvm->arch.pv.need_cleanup);
+
+ while (!list_empty(&kvm->arch.pv.need_cleanup)) {
+ deferred = list_first_entry(&kvm->arch.pv.need_cleanup, typeof(*deferred), list);
+ if (kvm_s390_pv_cleanup_deferred(kvm, deferred))
+ cc = 1;
+ else
+ atomic_dec(&kvm->mm->context.protected_count);
+ list_del(&deferred->list);
+ kfree(deferred);
+ }
+ kvm->arch.pv.async_deinit = NULL;
+ return cc;
+}
+
/* this should not fail, but if it does, we must not free the donated memory */
int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
@@ -192,6 +259,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
+ cc |= kvm_s390_pv_cleanup_leftovers(kvm);
+
return cc ? -EIO : 0;
}
--
2.31.1
Refactor kvm_s390_pv_deinit_vm to improve readability and simplify the
improvements that are coming in subsequent patches.
No functional change intended.
[note: this can potentially be squashed into the next patch, I factored
it out to simplify the review process]
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 111f85158cce..a54f6e98fa6f 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -182,17 +182,17 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- if (!cc)
- atomic_dec(&kvm->mm->context.protected_count);
- KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
- WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
- /* Intended memory leak on "impossible" error */
if (!cc) {
+ atomic_dec(&kvm->mm->context.protected_count);
kvm_s390_pv_dealloc_vm(kvm);
- return 0;
+ } else {
+ /* Intended memory leak on "impossible" error */
+ s390_replace_asce(kvm->arch.gmap);
}
- s390_replace_asce(kvm->arch.gmap);
- return -EIO;
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
+ WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
+
+ return cc ? -EIO : 0;
}
static void kvm_s390_pv_mmu_notifier_release(struct mmu_notifier *subscription,
--
2.31.1
Add an mmu_notifier for protected VMs. The callback function is
triggered when the mm is torn down, and will attempt to convert all
protected vCPUs to non-protected. This allows the mm teardown to use
the destroy page UVC instead of export.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/kvm_host.h | 1 +
arch/s390/kvm/pv.c | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index a604d51acfc8..8b288cff2a3a 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -921,6 +921,7 @@ struct kvm_s390_pv {
u64 guest_len;
unsigned long stor_base;
void *stor_var;
+ struct mmu_notifier mmu_notifier;
};
struct kvm_arch{
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 59da54bd6b21..111f85158cce 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -195,6 +195,21 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return -EIO;
}
+static void kvm_s390_pv_mmu_notifier_release(struct mmu_notifier *subscription,
+ struct mm_struct *mm)
+{
+ struct kvm *kvm = container_of(subscription, struct kvm, arch.pv.mmu_notifier);
+ u16 dummy;
+
+ mutex_lock(&kvm->lock);
+ kvm_s390_cpus_from_pv(kvm, &dummy, &dummy);
+ mutex_unlock(&kvm->lock);
+}
+
+static const struct mmu_notifier_ops kvm_s390_pv_mmu_notifier_ops = {
+ .release = kvm_s390_pv_mmu_notifier_release,
+};
+
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
struct uv_cb_cgc uvcb = {
@@ -236,6 +251,11 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return -EIO;
}
kvm->arch.gmap->guest_handle = uvcb.guest_handle;
+ /* Add the notifier only once. No races because we hold kvm->lock */
+ if (kvm->arch.pv.mmu_notifier.ops != &kvm_s390_pv_mmu_notifier_ops) {
+ kvm->arch.pv.mmu_notifier.ops = &kvm_s390_pv_mmu_notifier_ops;
+ mmu_notifier_register(&kvm->arch.pv.mmu_notifier, kvm->mm);
+ }
return 0;
}
--
2.31.1
When ptep_get_and_clear_full is called for a mm teardown, we will now
attempt to destroy the secure pages. This will be faster than export.
In case it was not a teardown, or if for some reason the destroy page
UVC failed, we try with an export page, like before.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/pgtable.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 23ca0d8e058a..c008b354573e 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1118,9 +1118,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
} else {
res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
}
+ /* Nothing to do */
+ if (!mm_is_protected(mm) || !pte_present(res))
+ return res;
/* At this point the reference through the mapping is still present */
- if (mm_is_protected(mm) && pte_present(res))
- uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
+ if (full && !uv_destroy_owned_page(pte_val(res) & PAGE_MASK))
+ return res;
+ /* If could not destroy, we try export */
+ uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
return res;
}
--
2.31.1
Do not use memset to clean the whole struct kvm_s390_pv; instead,
explicitly clear the fields that need to be cleared.
Upcoming patches will introduce new fields in the struct kvm_s390_pv
that will not need to be cleared.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/pv.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 04aa4a20260b..59da54bd6b21 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -16,6 +16,15 @@
#include <linux/sched/mm.h>
#include "kvm-s390.h"
+static void kvm_s390_clear_pv_state(struct kvm *kvm)
+{
+ kvm->arch.pv.handle = 0;
+ kvm->arch.pv.guest_len = 0;
+ kvm->arch.pv.stor_base = 0;
+ kvm->arch.pv.stor_var = NULL;
+ kvm->arch.pv.handle = 0;
+}
+
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
{
int cc;
@@ -110,7 +119,7 @@ static void kvm_s390_pv_dealloc_vm(struct kvm *kvm)
vfree(kvm->arch.pv.stor_var);
free_pages(kvm->arch.pv.stor_base,
get_order(uv_info.guest_base_stor_len));
- memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
+ kvm_s390_clear_pv_state(kvm);
}
static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
--
2.31.1
Use the new protected_count field as a counter instead of the old
is_protected flag. This will be used in upcoming patches.
Increment the counter when a secure configuration is created, and
decrement it when it is destroyed. Previously the flag was set when the
set secure parameters UVC was performed.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/mmu.h | 2 +-
arch/s390/include/asm/mmu_context.h | 2 +-
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/pv.c | 12 +++++++-----
4 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index e12ff0f29d1a..2c4cbc254604 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -17,7 +17,7 @@ typedef struct {
unsigned long asce_limit;
unsigned long vdso_base;
/* The mmu context belongs to a secure guest. */
- atomic_t is_protected;
+ atomic_t protected_count;
/*
* The following bitfields need a down_write on the mm
* semaphore when they are written to. As they are only
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index c7937f369e62..2a38af5a00c2 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -26,7 +26,7 @@ static inline int init_new_context(struct task_struct *tsk,
INIT_LIST_HEAD(&mm->context.gmap_list);
cpumask_clear(&mm->context.cpu_attach_mask);
atomic_set(&mm->context.flush_count, 0);
- atomic_set(&mm->context.is_protected, 0);
+ atomic_set(&mm->context.protected_count, 0);
mm->context.gmap_asce = 0;
mm->context.flush_mm = 0;
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 008a6c856fa4..23ca0d8e058a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -523,7 +523,7 @@ static inline int mm_has_pgste(struct mm_struct *mm)
static inline int mm_is_protected(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
- if (unlikely(atomic_read(&mm->context.is_protected)))
+ if (unlikely(atomic_read(&mm->context.protected_count)))
return 1;
#endif
return 0;
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 19f67207caa9..04aa4a20260b 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -173,7 +173,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
- atomic_set(&kvm->mm->context.is_protected, 0);
+ if (!cc)
+ atomic_dec(&kvm->mm->context.protected_count);
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
/* Intended memory leak on "impossible" error */
@@ -215,11 +216,14 @@ int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
/* Outputs */
kvm->arch.pv.handle = uvcb.guest_handle;
+ atomic_inc(&kvm->mm->context.protected_count);
if (cc) {
- if (uvcb.header.rc & UVC_RC_NEED_DESTROY)
+ if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
- else
+ } else {
+ atomic_dec(&kvm->mm->context.protected_count);
kvm_s390_pv_dealloc_vm(kvm);
+ }
return -EIO;
}
kvm->arch.gmap->guest_handle = uvcb.guest_handle;
@@ -242,8 +246,6 @@ int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
*rrc = uvcb.header.rrc;
KVM_UV_EVENT(kvm, 3, "PROTVIRT VM SET PARMS: rc %x rrc %x",
*rc, *rrc);
- if (!cc)
- atomic_set(&kvm->mm->context.is_protected, 1);
return cc ? -EINVAL : 0;
}
--
2.31.1
Refactor s390_reset_acc so that it can be reused in upcoming patches.
We don't want to hold all the locks used in a walk_page_range for too
long, and the destroy page UVC does take some time to complete.
Therefore we quickly gather the pages to destroy, and then destroy them
without holding all the locks.
The new refactored function optionally allows to return early without
completing if a fatal signal is pending (and return and appropriate
error code). Two wrappers are provided to call the new function.
Signed-off-by: Claudio Imbrenda <[email protected]>
(dropping Janosch's Ack because of major changes to the patch)
---
arch/s390/include/asm/gmap.h | 36 +++++++++++++-
arch/s390/kvm/pv.c | 12 ++++-
arch/s390/mm/gmap.c | 95 +++++++++++++++++++++++++-----------
3 files changed, 111 insertions(+), 32 deletions(-)
diff --git a/arch/s390/include/asm/gmap.h b/arch/s390/include/asm/gmap.h
index 746e18bf8984..2a913014f63c 100644
--- a/arch/s390/include/asm/gmap.h
+++ b/arch/s390/include/asm/gmap.h
@@ -147,7 +147,41 @@ int gmap_mprotect_notify(struct gmap *, unsigned long start,
void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long dirty_bitmap[4],
unsigned long gaddr, unsigned long vmaddr);
int gmap_mark_unmergeable(void);
-void s390_reset_acc(struct mm_struct *mm);
void s390_remove_old_asce(struct gmap *gmap);
int s390_replace_asce(struct gmap *gmap);
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns);
+int __s390_uv_destroy_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int interruptible);
+
+/**
+ * s390_uv_destroy_range - Destroy a range of pages in the given mm.
+ * @mm the mm on which to operate on
+ * @start the start of the range
+ * @end the end of the range
+ *
+ * This call will call cond_sched, so it should not generate stalls, but it
+ * will otherwise only return when it completed.
+ */
+static inline void s390_uv_destroy_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
+{
+ (void)__s390_uv_destroy_range(mm, start, end, 0);
+}
+
+/**
+ * s390_uv_destroy_range_interruptibe - Destroy a range of pages in the
+ * given mm, but stop when a fatal signal is received.
+ * @mm the mm on which to operate on
+ * @start the start of the range
+ * @end the end of the range
+ *
+ * This call will call cond_sched, so it should not generate stalls. It
+ * will return -EINTR if a fatal signal is received, or 0 if the whole range
+ * has been destroyed.
+ */
+static inline int s390_uv_destroy_range_interruptible(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
+{
+ return __s390_uv_destroy_range(mm, start, end, 1);
+}
#endif /* _ASM_S390_GMAP_H */
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index b906658ffc2e..19f67207caa9 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -12,6 +12,8 @@
#include <asm/gmap.h>
#include <asm/uv.h>
#include <asm/mman.h>
+#include <linux/pagewalk.h>
+#include <linux/sched/mm.h>
#include "kvm-s390.h"
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
@@ -159,8 +161,14 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
int cc;
- /* make all pages accessible before destroying the guest */
- s390_reset_acc(kvm->mm);
+ /*
+ * if the mm still has a mapping, make all its pages accessible
+ * before destroying the guest
+ */
+ if (mmget_not_zero(kvm->mm)) {
+ s390_uv_destroy_range(kvm->mm, 0, TASK_SIZE);
+ mmput(kvm->mm);
+ }
cc = uv_cmd_nodata(kvm_s390_pv_get_handle(kvm),
UVC_CMD_DESTROY_SEC_CONF, rc, rrc);
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 5d6618016263..84096ac72e14 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2676,44 +2676,81 @@ void s390_reset_cmma(struct mm_struct *mm)
}
EXPORT_SYMBOL_GPL(s390_reset_cmma);
-/*
- * make inaccessible pages accessible again
- */
-static int __s390_reset_acc(pte_t *ptep, unsigned long addr,
- unsigned long next, struct mm_walk *walk)
+#define DESTROY_LOOP_THRESHOLD 32
+
+struct reset_walk_state {
+ unsigned long next;
+ unsigned long count;
+ unsigned long pfns[DESTROY_LOOP_THRESHOLD];
+};
+
+static int s390_gather_pages(pte_t *ptep, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
+ struct reset_walk_state *p = walk->private;
pte_t pte = READ_ONCE(*ptep);
- /* There is a reference through the mapping */
- if (pte_present(pte))
- WARN_ON_ONCE(uv_destroy_owned_page(pte_val(pte) & PAGE_MASK));
-
- return 0;
+ if (pte_present(pte)) {
+ /* we have a reference from the mapping, take an extra one */
+ get_page(phys_to_page(pte_val(pte)));
+ p->pfns[p->count] = phys_to_pfn(pte_val(pte));
+ p->next = next;
+ p->count++;
+ }
+ return p->count >= DESTROY_LOOP_THRESHOLD;
}
-static const struct mm_walk_ops reset_acc_walk_ops = {
- .pte_entry = __s390_reset_acc,
+static const struct mm_walk_ops gather_pages_ops = {
+ .pte_entry = s390_gather_pages,
};
-#include <linux/sched/mm.h>
-void s390_reset_acc(struct mm_struct *mm)
+/*
+ * Call the Destroy secure page UVC on each page in the given array of PFNs.
+ * Each page needs to have an extra reference, which will be released here.
+ */
+void s390_uv_destroy_pfns(unsigned long count, unsigned long *pfns)
{
- if (!mm_is_protected(mm))
- return;
- /*
- * we might be called during
- * reset: we walk the pages and clear
- * close of all kvm file descriptors: we walk the pages and clear
- * exit of process on fd closure: vma already gone, do nothing
- */
- if (!mmget_not_zero(mm))
- return;
- mmap_read_lock(mm);
- walk_page_range(mm, 0, TASK_SIZE, &reset_acc_walk_ops, NULL);
- mmap_read_unlock(mm);
- mmput(mm);
+ unsigned long i;
+
+ for (i = 0; i < count; i++) {
+ /* we always have an extra reference */
+ uv_destroy_owned_page(pfn_to_phys(pfns[i]));
+ /* get rid of the extra reference */
+ put_page(pfn_to_page(pfns[i]));
+ cond_resched();
+ }
+}
+EXPORT_SYMBOL_GPL(s390_uv_destroy_pfns);
+
+/**
+ * __s390_uv_destroy_range - Walk the given range of the given address
+ * space, and call the destroy secure page UVC on each page.
+ * Optionally exit early if a fatal signal is pending.
+ * @mm the mm to operate on
+ * @start the start of the range
+ * @end the end of the range
+ * @interruptible if not 0, stop when a fatal signal is received
+ * Return: 0 on success, -EINTR if the function stopped before completing
+ */
+int __s390_uv_destroy_range(struct mm_struct *mm, unsigned long start,
+ unsigned long end, int interruptible)
+{
+ struct reset_walk_state state = { .next = start };
+ int r = 1;
+
+ while (r > 0) {
+ state.count = 0;
+ mmap_read_lock(mm);
+ r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+ mmap_read_unlock(mm);
+ cond_resched();
+ s390_uv_destroy_pfns(state.count, state.pfns);
+ if (interruptible && fatal_signal_pending(current))
+ return -EINTR;
+ }
+ return 0;
}
-EXPORT_SYMBOL_GPL(s390_reset_acc);
+EXPORT_SYMBOL_GPL(__s390_uv_destroy_range);
/**
* s390_remove_old_asce - Remove the topmost level of page tables from the
--
2.31.1
Hi Claudio,
I love your patch! Yet something to improve:
[auto build test ERROR on s390/features]
[also build test ERROR on kvm/queue v5.16-rc3 next-20211203]
[cannot apply to kvms390/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Claudio-Imbrenda/KVM-s390-pv-implement-lazy-destroy-for-reboot/20211204-010121
base: https://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
config: s390-randconfig-r034-20211203 (https://download.01.org/0day-ci/archive/20211204/[email protected]/config)
compiler: s390-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/c2d394c61e53062a1e7ba537029dce61f7a75cff
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Claudio-Imbrenda/KVM-s390-pv-implement-lazy-destroy-for-reboot/20211204-010121
git checkout c2d394c61e53062a1e7ba537029dce61f7a75cff
# save the config file to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=s390 SHELL=/bin/bash arch/s390/kvm/
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
arch/s390/kvm/pv.c: In function 'kvm_s390_pv_init_vm':
>> arch/s390/kvm/pv.c:257:17: error: implicit declaration of function 'mmu_notifier_register'; did you mean 'mmu_notifier_release'? [-Werror=implicit-function-declaration]
257 | mmu_notifier_register(&kvm->arch.pv.mmu_notifier, kvm->mm);
| ^~~~~~~~~~~~~~~~~~~~~
| mmu_notifier_release
cc1: some warnings being treated as errors
vim +257 arch/s390/kvm/pv.c
212
213 int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
214 {
215 struct uv_cb_cgc uvcb = {
216 .header.cmd = UVC_CMD_CREATE_SEC_CONF,
217 .header.len = sizeof(uvcb)
218 };
219 int cc, ret;
220 u16 dummy;
221
222 ret = kvm_s390_pv_alloc_vm(kvm);
223 if (ret)
224 return ret;
225
226 /* Inputs */
227 uvcb.guest_stor_origin = 0; /* MSO is 0 for KVM */
228 uvcb.guest_stor_len = kvm->arch.pv.guest_len;
229 uvcb.guest_asce = kvm->arch.gmap->asce;
230 uvcb.guest_sca = (unsigned long)kvm->arch.sca;
231 uvcb.conf_base_stor_origin = (u64)kvm->arch.pv.stor_base;
232 uvcb.conf_virt_stor_origin = (u64)kvm->arch.pv.stor_var;
233
234 cc = uv_call_sched(0, (u64)&uvcb);
235 *rc = uvcb.header.rc;
236 *rrc = uvcb.header.rrc;
237 KVM_UV_EVENT(kvm, 3, "PROTVIRT CREATE VM: handle %llx len %llx rc %x rrc %x",
238 uvcb.guest_handle, uvcb.guest_stor_len, *rc, *rrc);
239
240 /* Outputs */
241 kvm->arch.pv.handle = uvcb.guest_handle;
242
243 atomic_inc(&kvm->mm->context.protected_count);
244 if (cc) {
245 if (uvcb.header.rc & UVC_RC_NEED_DESTROY) {
246 kvm_s390_pv_deinit_vm(kvm, &dummy, &dummy);
247 } else {
248 atomic_dec(&kvm->mm->context.protected_count);
249 kvm_s390_pv_dealloc_vm(kvm);
250 }
251 return -EIO;
252 }
253 kvm->arch.gmap->guest_handle = uvcb.guest_handle;
254 /* Add the notifier only once. No races because we hold kvm->lock */
255 if (kvm->arch.pv.mmu_notifier.ops != &kvm_s390_pv_mmu_notifier_ops) {
256 kvm->arch.pv.mmu_notifier.ops = &kvm_s390_pv_mmu_notifier_ops;
> 257 mmu_notifier_register(&kvm->arch.pv.mmu_notifier, kvm->mm);
258 }
259 return 0;
260 }
261
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]
On 12/3/21 17:57, Claudio Imbrenda wrote:
> With upcoming patches, protected guests will be able to trigger secure
> storage violations in normal operation.
>
> This patch adds handling of secure storage violations for protected
> guests.
>
> Pages that trigger the exception will be made non-secure before
> attempting to use them again for a different secure guest.
I think we should extend this a bit.
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation. This happens if e.g. a protected
guest is re-booted with lazy destroy enabled and the new guest is also
protected.
When the new protected guest touches pages that haven't yet been
destroyed and thus are accounted to the previous protected guest we will
see the violation exception.
We handle this exception by first trying to destroy the page because we
expect it to belong to a defunct protected guest where a destroy should
be possible. If that fails, we will try to do a normal export of the page.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Janosch Frank <[email protected]>
> ---
> arch/s390/include/asm/uv.h | 1 +
> arch/s390/kernel/uv.c | 55 ++++++++++++++++++++++++++++++++++++++
> arch/s390/mm/fault.c | 10 +++++++
> 3 files changed, 66 insertions(+)
>
> diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
> index 72d3e49c2860..cdbd340188ab 100644
> --- a/arch/s390/include/asm/uv.h
> +++ b/arch/s390/include/asm/uv.h
> @@ -356,6 +356,7 @@ static inline int is_prot_virt_host(void)
> }
>
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
> +int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
> int uv_destroy_owned_page(unsigned long paddr);
> int uv_convert_from_secure(unsigned long paddr);
> int uv_convert_owned_from_secure(unsigned long paddr);
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 386d4e42b8d3..f706456f6261 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -334,6 +334,61 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
> }
> EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
>
> +/**
> + * gmap_destroy_page - Destroy a guest page.
> + * @gmap the gmap of the guest
> + * @gaddr the guest address to destroy
> + *
> + * An attempt will be made to destroy the given guest page. If the attempt
> + * fails, an attempt is made to export the page. If both attempts fail, an
> + * appropriate error is returned.
> + */
> +int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
> +{
> + struct vm_area_struct *vma;
> + unsigned long uaddr;
> + struct page *page;
> + int rc;
> +
> + rc = -EFAULT;
> + mmap_read_lock(gmap->mm);
> +
> + uaddr = __gmap_translate(gmap, gaddr);
> + if (IS_ERR_VALUE(uaddr))
> + goto out;
> + vma = vma_lookup(gmap->mm, uaddr);
> + if (!vma)
> + goto out;
> + /*
> + * Huge pages should not be able to become secure
> + */
Could be one line
> + if (is_vm_hugetlb_page(vma))
> + goto out;
> +
> + rc = 0;
> + /* we take an extra reference here */
Because?
> + page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
> + if (IS_ERR_OR_NULL(page))
> + goto out;
> + rc = uv_destroy_owned_page(page_to_phys(page));
> + /*
> + * Fault handlers can race; it is possible that two CPUs will fault
> + * on the same secure page. One CPU can destroy the page, reboot,
> + * re-enter secure mode and import it, while the second CPU was
> + * stuck at the beginning of the handler. At some point the second
> + * CPU will be able to progress, and it will not be able to destroy
> + * the page. In that case we do not want to terminate the process,
> + * we instead try to export the page.
> + */
So when we export we always export a page that's owned by the new guest,
do I get that right?
> + if (rc)
> + rc = uv_convert_owned_from_secure(page_to_phys(page));
> + put_page(page);
> +out:
> + mmap_read_unlock(gmap->mm);
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(gmap_destroy_page);
> +
> /*
> * To be called with the page locked or with an extra reference! This will
> * prevent gmap_make_secure from touching the page concurrently. Having 2
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index d30f5986fa85..a1928c89bbfa 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -853,6 +853,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);
>
> void do_secure_storage_violation(struct pt_regs *regs)
> {
> + unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
> + struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
> +
> + /*
> + * If the VM has been rebooted, its address space might still contain
> + * secure pages from the previous boot.
> + * Clear the page so it can be reused.
> + */
> + if (!gmap_destroy_page(gmap, gaddr))
> + return;
> /*
> * Either KVM messed up the secure guest mapping or the same
> * page is mapped into multiple secure guests.
>
On 12/3/21 17:58, Claudio Imbrenda wrote:
> With upcoming patches, normal guests might touch secure pages.
>
> This patch extends the existing exception handler to convert the pages
> to non secure also when the exception is triggered by a normal guest.
>
> This can happen for example when a secure guest reboots; the first
> stage of a secure guest is non secure, and in general a secure guest
> can reboot into non-secure mode.
>
> If the secure memory of the previous boot has not been cleared up
> completely yet (which will be allowed to happen in an upcoming patch),
> a non-secure guest might touch secure memory, which will need to be
> handled properly.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/mm/fault.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index a1928c89bbfa..a644e593eef9 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -770,6 +770,7 @@ void do_secure_storage_access(struct pt_regs *regs)
> struct vm_area_struct *vma;
> struct mm_struct *mm;
> struct page *page;
> + struct gmap *gmap;
> int rc;
>
> /*
> @@ -799,6 +800,14 @@ void do_secure_storage_access(struct pt_regs *regs)
> }
>
> switch (get_fault_type(regs)) {
> + case GMAP_FAULT:
> + gmap = (struct gmap *)S390_lowcore.gmap;
> + addr = __gmap_translate(gmap, addr);
__gmap_translate() needs the mmap_read_lock(mm), no?
> + if (IS_ERR_VALUE(addr)) {
> + do_fault_error(regs, VM_ACCESS_FLAGS, VM_FAULT_BADMAP);
> + break;
> + }
> + fallthrough;
> case USER_FAULT:
> mm = current->mm;
> mmap_read_lock(mm);
> @@ -827,7 +836,6 @@ void do_secure_storage_access(struct pt_regs *regs)
> if (rc)
> BUG();
> break;
> - case GMAP_FAULT:
> default:
> do_fault_error(regs, VM_READ | VM_WRITE, VM_FAULT_BADMAP);
> WARN_ON_ONCE(1);
>
On 12/3/21 17:58, Claudio Imbrenda wrote:
> Do not use memset to clean the whole struct kvm_s390_pv; instead,
> explicitly clear the fields that need to be cleared.
>
> Upcoming patches will introduce new fields in the struct kvm_s390_pv
> that will not need to be cleared.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kvm/pv.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index 04aa4a20260b..59da54bd6b21 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -16,6 +16,15 @@
> #include <linux/sched/mm.h>
> #include "kvm-s390.h"
>
> +static void kvm_s390_clear_pv_state(struct kvm *kvm)
> +{
> + kvm->arch.pv.handle = 0;
> + kvm->arch.pv.guest_len = 0;
> + kvm->arch.pv.stor_base = 0;
> + kvm->arch.pv.stor_var = NULL;
> + kvm->arch.pv.handle = 0;
We could use the same approach that we use in qemu to do the cpu resets.
But maybe that would be overkill for a few struct members.
> +}
> +
> int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc)
> {
> int cc;
> @@ -110,7 +119,7 @@ static void kvm_s390_pv_dealloc_vm(struct kvm *kvm)
> vfree(kvm->arch.pv.stor_var);
> free_pages(kvm->arch.pv.stor_base,
> get_order(uv_info.guest_base_stor_len));
> - memset(&kvm->arch.pv, 0, sizeof(kvm->arch.pv));
> + kvm_s390_clear_pv_state(kvm);
> }
>
> static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
>
On 12/3/21 17:58, Claudio Imbrenda wrote:
> When ptep_get_and_clear_full is called for a mm teardown, we will now
> attempt to destroy the secure pages. This will be faster than export.
>
> In case it was not a teardown, or if for some reason the destroy page
> UVC failed, we try with an export page, like before.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
> ---
> arch/s390/include/asm/pgtable.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 23ca0d8e058a..c008b354573e 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1118,9 +1118,14 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> } else {
> res = ptep_xchg_lazy(mm, addr, ptep, __pte(_PAGE_INVALID));
> }
> + /* Nothing to do */
> + if (!mm_is_protected(mm) || !pte_present(res))
> + return res;
> /* At this point the reference through the mapping is still present */
> - if (mm_is_protected(mm) && pte_present(res))
> - uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
Add comment:
The notifier should have tried to destroy the cpus which allows us to
destroy pages. So here we'll try to destroy the pages but if that fails
we fall back to a normal but slower export.
> + if (full && !uv_destroy_owned_page(pte_val(res) & PAGE_MASK))
> + return res;
> + /* If could not destroy, we try export */
> + uv_convert_owned_from_secure(pte_val(res) & PAGE_MASK);
> return res;
> }
>
>