Previously, when a protected VM was rebooted or when it was shut down,
its memory was made unprotected, and then the protected VM itself was
destroyed. Looping over the whole address space can take some time,
considering the overhead of the various Ultravisor Calls (UVCs). This
means that a reboot or a shutdown would take a potentially long amount
of time, depending on the amount of used memory.
This patchseries implements a deferred destroy mechanism for protected
guests. When a protected guest is destroyed, its memory can be cleared
in background, allowing the guest to restart or terminate significantly
faster than before.
There are 2 possibilities when a protected VM is torn down:
* it still has an address space associated (reboot case)
* it does not have an address space anymore (shutdown case)
For the reboot case, two new commands are available for the
KVM_S390_PV_COMMAND:
KVM_PV_ASYNC_DISABLE_PREPARE: prepares the current protected VM for
asynchronous teardown. The current VM will then continue immediately
as non-protected. If a protected VM had already been set aside without
starting the teardown process, this call will fail. In this case the
userspace process should issue a normal KVM_PV_DISABLE
KVM_PV_ASYNC_DISABLE: tears down the protected VM previously set aside
for asychronous teardown. This PV command should ideally be issued by
userspace from a separate thread. If a fatal signal is received (or
the process terminates naturally), the command will terminate
immediately without completing.
The idea is that userspace should first issue the
KVM_PV_ASYNC_DISABLE_PREPARE command, and in case of success, create a
new thread and issue KVM_PV_ASYNC_DISABLE from there. This also allows
for proper accounting of the CPU time needed for the asynchronous
teardown.
This means that the same address space can have memory belonging to
more than one protected guest, although only one will be running, the
others will in fact not even have any CPUs.
The shutdown case should be dealt with in userspace (e.g. using
clone(CLONE_VM)).
A module parameter is also provided to disable the new functionality,
which is otherwise enabled by default. This should not be an issue
since the new functionality is opt-in anyway. This is mainly thought to
aid debugging.
v8->v9
* rebased
* added dependency on MMU_NOTIFIER for KVM in arch/s390/kvm/Kconfig
* add support for the Destroy Secure Configuration Fast UVC
* minor fixes
v7->v8
* switched patches 8 and 9
* improved comments, documentation and patch descriptions
* remove mm notifier when the struct kvm is torn down
* removed useless locks in the mm notifier
* use _ASCE_ORIGIN instead of PAGE_MASK for ASCEs
* cleanup of some compiler warnings
* remove some harmless but useless duplicate code
* the last parameter of __s390_uv_destroy_range is now bool
* rename the KVM capability to KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
v6->v7
* moved INIT_LIST_HEAD inside spinlock in patch 1
* improved commit messages in patch 2
* added missing locks in patch 3
* added and expanded some comments in patch 11
* rebased
v5->v6
* completely reworked the series
* removed kernel thread for asynchronous teardown
* added new commands to KVM_S390_PV_COMMAND ioctl
v4->v5
* fixed and improved some patch descriptions
* added some comments to better explain what's going on
* use vma_lookup instead of find_vma
* rename is_protected to protected_count since now it's used as a counter
v3->v4
* added patch 2
* split patch 3
* removed the shutdown part -- will be a separate patchseries
* moved the patch introducing the module parameter
v2->v3
* added definitions for CC return codes for the UVC instruction
* improved make_secure_pte:
- renamed rc to cc
- added comments to explain why returning -EAGAIN is ok
* fixed kvm_s390_pv_replace_asce and kvm_s390_pv_remove_old_asce:
- renamed
- added locking
- moved to gmap.c
* do proper error management in do_secure_storage_access instead of
trying again hoping to get a different exception
* fix outdated patch descriptions
v1->v2
* rebased on a more recent kernel
* improved/expanded some patch descriptions
* improves/expanded some comments
* added patch 1, which prevents stall notification when the system is
under heavy load.
* rename some members of struct deferred_priv to improve readability
* avoid an use-after-free bug of the struct mm in case of shutdown
* add missing return when lazy destroy is disabled
* add support for OOM notifier
Claudio Imbrenda (18):
KVM: s390: pv: leak the topmost page table when destroy fails
KVM: s390: pv: handle secure storage violations for protected guests
KVM: s390: pv: handle secure storage exceptions for normal guests
KVM: s390: pv: refactor s390_reset_acc
KVM: s390: pv: usage counter instead of flag
KVM: s390: pv: add export before import
KVM: s390: pv: module parameter to fence lazy destroy
KVM: s390: pv: clear the state without memset
KVM: s390: pv: Add kvm_s390_cpus_from_pv to kvm-s390.h and add
documentation
KVM: s390: pv: add mmu_notifier
s390/mm: KVM: pv: when tearing down, try to destroy protected pages
KVM: s390: pv: refactoring of kvm_s390_pv_deinit_vm
KVM: s390: pv: cleanup leftover protected VMs if needed
KVM: s390: pv: asynchronous destroy for reboot
KVM: s390: pv: api documentation for asynchronous destroy
KVM: s390: pv: add KVM_CAP_S390_PROTECTED_ASYNC_DISABLE
KVM: s390: pv: avoid export before import if possible
KVM: s390: pv: support for Destroy fast UVC
Documentation/virt/kvm/api.rst | 25 ++-
arch/s390/include/asm/gmap.h | 39 +++-
arch/s390/include/asm/kvm_host.h | 4 +
arch/s390/include/asm/mmu.h | 2 +-
arch/s390/include/asm/mmu_context.h | 2 +-
arch/s390/include/asm/pgtable.h | 20 +-
arch/s390/include/asm/uv.h | 11 ++
arch/s390/kernel/uv.c | 64 ++++++
arch/s390/kvm/Kconfig | 1 +
arch/s390/kvm/kvm-s390.c | 64 +++++-
arch/s390/kvm/kvm-s390.h | 3 +
arch/s390/kvm/pv.c | 297 +++++++++++++++++++++++++++-
arch/s390/mm/fault.c | 23 ++-
arch/s390/mm/gmap.c | 158 ++++++++++++---
include/uapi/linux/kvm.h | 3 +
15 files changed, 670 insertions(+), 46 deletions(-)
--
2.34.1
Add documentation for the new commands added to the KVM_S390_PV_COMMAND
ioctl.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
Documentation/virt/kvm/api.rst | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f3172376ec3..52ba1c52ae3c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5010,11 +5010,13 @@ KVM_PV_ENABLE
===== =============================
KVM_PV_DISABLE
-
Deregister the VM from the Ultravisor and reclaim the memory that
had been donated to the Ultravisor, making it usable by the kernel
- again. All registered VCPUs are converted back to non-protected
- ones.
+ again. All registered VCPUs are converted back to non-protected
+ ones. If a previous VM had been prepared for asynchonous teardown
+ with KVM_PV_ASYNC_DISABLE_PREPARE and not actually torn down with
+ KVM_PV_ASYNC_DISABLE, it will be torn down in this call together with
+ the current VM.
KVM_PV_VM_SET_SEC_PARMS
Pass the image header from VM memory to the Ultravisor in
@@ -5027,6 +5029,23 @@ KVM_PV_VM_VERIFY
Verify the integrity of the unpacked image. Only if this succeeds,
KVM is allowed to start protected VCPUs.
+KVM_PV_ASYNC_DISABLE_PREPARE
+ Prepare the current protected VM for asynchronous teardown. Most
+ resources used by the current protected VM will be set aside for a
+ subsequent asynchronous teardown. The current protected VM will then
+ resume execution immediately as non-protected. If a protected VM had
+ already been prepared without starting the asynchronous teardown process,
+ this call will fail. In that case, the userspace process should issue a
+ normal KVM_PV_DISABLE.
+
+KVM_PV_ASYNC_DISABLE
+ Tear down the protected VM previously prepared for asynchronous teardown.
+ The resources that had been set aside will be freed asynchronously during
+ the execution of this command.
+ This PV command should ideally be issued by userspace from a separate
+ thread. If a fatal signal is received (or the process terminates
+ naturally), the command will terminate immediately without completing.
+
4.126 KVM_X86_SET_MSR_FILTER
----------------------------
--
2.34.1
Until now, destroying a protected guest was an entirely synchronous
operation that could potentially take a very long time, depending on
the size of the guest, due to the time needed to clean up the address
space from protected pages.
This patch implements an asynchronous destroy mechanism, that allows a
protected guest to reboot significantly faster than previously.
This is achieved by clearing the pages of the old guest in background.
In case of reboot, the new guest will be able to run in the same
address space almost immediately.
The old protected guest is then only destroyed when all of its memory has
been destroyed or otherwise made non protected.
Two new PV commands are added for the KVM_S390_PV_COMMAND ioctl:
KVM_PV_ASYNC_DISABLE_PREPARE: prepares the current protected VM for
asynchronous teardown. The current VM will then continue immediately
as non-protected. If a protected VM had already been set aside without
starting the teardown process, this call will fail.
KVM_PV_ASYNC_DISABLE: tears down the protected VM previously set aside
for asynchronous teardown. This PV command should ideally be issued by
userspace from a separate thread. If a fatal signal is received (or the
process terminates naturally), the command will terminate immediately
without completing.
Leftover protected VMs are cleaned up when a KVM VM is torn down
normally (either via IOCTL or when the process terminates); this
cleanup has been implemented in a previous patch.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 24 ++++++++
arch/s390/kvm/kvm-s390.h | 2 +
arch/s390/kvm/pv.c | 126 +++++++++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 2 +
4 files changed, 154 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3637f556ff33..2453d2d90d6c 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2285,6 +2285,30 @@ static int kvm_s390_handle_pv(struct kvm *kvm, struct kvm_pv_cmd *cmd)
set_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
break;
}
+ case KVM_PV_ASYNC_DISABLE_PREPARE:
+ r = -EINVAL;
+ if (!kvm_s390_pv_is_protected(kvm) || !lazy_destroy)
+ break;
+
+ r = kvm_s390_cpus_from_pv(kvm, &cmd->rc, &cmd->rrc);
+ /*
+ * If a CPU could not be destroyed, destroy VM will also fail.
+ * There is no point in trying to destroy it. Instead return
+ * the rc and rrc from the first CPU that failed destroying.
+ */
+ if (r)
+ break;
+ r = kvm_s390_pv_deinit_vm_async_prepare(kvm, &cmd->rc, &cmd->rrc);
+
+ /* no need to block service interrupts any more */
+ clear_bit(IRQ_PEND_EXT_SERVICE, &kvm->arch.float_int.masked_irqs);
+ break;
+ case KVM_PV_ASYNC_DISABLE:
+ r = -EINVAL;
+ if (!kvm->arch.pv.async_deinit)
+ break;
+ r = kvm_s390_pv_deinit_vm_async(kvm, &cmd->rc, &cmd->rrc);
+ break;
case KVM_PV_DISABLE: {
r = -EINVAL;
if (!kvm_s390_pv_is_protected(kvm))
diff --git a/arch/s390/kvm/kvm-s390.h b/arch/s390/kvm/kvm-s390.h
index 9276d910631b..be53c7750248 100644
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -234,6 +234,8 @@ static inline unsigned long kvm_s390_get_gfn_end(struct kvm_memslots *slots)
/* implemented in pv.c */
int kvm_s390_pv_destroy_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
int kvm_s390_pv_create_cpu(struct kvm_vcpu *vcpu, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_async_prepare(struct kvm *kvm, u16 *rc, u16 *rrc);
+int kvm_s390_pv_deinit_vm_async(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_init_vm(struct kvm *kvm, u16 *rc, u16 *rrc);
int kvm_s390_pv_set_sec_parms(struct kvm *kvm, void *hdr, u64 length, u16 *rc,
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index 56412617dd01..5111f1fc64ab 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -262,6 +262,132 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
return cc ? -EIO : 0;
}
+/**
+ * kvm_s390_clear_2g - Clear the first 2GB of guest memory.
+ * @kvm the VM whose memory is to be cleared.
+ * Clear the first 2GB of guest memory, to avoid prefix issues after reboot.
+ */
+static void kvm_s390_clear_2g(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ unsigned long lim;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&kvm->srcu);
+
+ slot = gfn_to_memslot(kvm, 0);
+ /* Clear all slots that are completely below 2GB */
+ while (slot && slot->base_gfn + slot->npages < SZ_2G / PAGE_SIZE) {
+ lim = slot->userspace_addr + slot->npages * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, slot->userspace_addr, lim);
+ slot = gfn_to_memslot(kvm, slot->base_gfn + slot->npages);
+ }
+ /* Last slot crosses the 2G boundary, clear only up to 2GB */
+ if (slot && slot->base_gfn < SZ_2G / PAGE_SIZE) {
+ lim = slot->userspace_addr + SZ_2G - slot->base_gfn * PAGE_SIZE;
+ s390_uv_destroy_range(kvm->mm, slot->userspace_addr, lim);
+ }
+
+ srcu_read_unlock(&kvm->srcu, srcu_idx);
+}
+
+/**
+ * kvm_s390_pv_deinit_vm_async_prepare - Prepare a protected VM for
+ * asynchronous teardown.
+ * @kvm the VM
+ * @rc return value for the RC field of the UVCB
+ * @rrc return value for the RRC field of the UVCB
+ *
+ * Prepare the protected VM for asynchronous teardown. The VM will be able
+ * to continue immediately as a non-secure VM, and the information needed to
+ * properly tear down the protected VM is set aside. If another protected VM
+ * was already set aside without starting a teardown, the function will
+ * fail.
+ *
+ * Context: kvm->lock needs to be held
+ *
+ * Return: 0 in case of success, -EINVAL if another protected VM was already set
+ * aside, -ENOMEM if the system ran out of memory.
+ */
+int kvm_s390_pv_deinit_vm_async_prepare(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *priv;
+
+ /*
+ * If an asynchronous deinitialization is already pending, refuse.
+ * A synchronous deinitialization has to be performed instead.
+ */
+ if (kvm->arch.pv.async_deinit)
+ return -EINVAL;
+ priv = kmalloc(sizeof(*priv), GFP_KERNEL | __GFP_ZERO);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->stor_var = kvm->arch.pv.stor_var;
+ priv->stor_base = kvm->arch.pv.stor_base;
+ priv->handle = kvm_s390_pv_get_handle(kvm);
+ priv->old_table = (unsigned long)kvm->arch.gmap->table;
+ WRITE_ONCE(kvm->arch.gmap->guest_handle, 0);
+ if (s390_replace_asce(kvm->arch.gmap)) {
+ kfree(priv);
+ return -ENOMEM;
+ }
+
+ kvm_s390_clear_2g(kvm);
+ kvm_s390_clear_pv_state(kvm);
+ kvm->arch.pv.async_deinit = priv;
+
+ *rc = 1;
+ *rrc = 42;
+ return 0;
+}
+
+/**
+ * kvm_s390_pv_deinit_vm_async - Perform an asynchronous teardown of a
+ * protected VM.
+ * @kvm the VM previously associated with the protected VM
+ * @rc return value for the RC field of the UVCB
+ * @rrc return value for the RRC field of the UVCB
+ *
+ * Tear down the protected VM that had previously been set aside using
+ * kvm_s390_pv_deinit_vm_async_prepare.
+ *
+ * Context: kvm->lock needs to be held
+ *
+ * Return: 0 in case of success, -EINVAL if no protected VM had been
+ * prepared for asynchronous teardowm, -EIO in case of other errors.
+ */
+int kvm_s390_pv_deinit_vm_async(struct kvm *kvm, u16 *rc, u16 *rrc)
+{
+ struct deferred_priv *p = kvm->arch.pv.async_deinit;
+ int ret = 0;
+
+ if (!p)
+ return -EINVAL;
+ kvm->arch.pv.async_deinit = NULL;
+ mutex_unlock(&kvm->lock);
+
+ /* When a fatal signal is received, stop immediately */
+ if (s390_uv_destroy_range_interruptible(kvm->mm, 0, TASK_SIZE_MAX))
+ goto done;
+ if (kvm_s390_pv_cleanup_deferred(kvm, p))
+ ret = -EIO;
+ else
+ atomic_dec(&kvm->mm->context.protected_count);
+ kfree(p);
+ p = NULL;
+done:
+ /* The caller expects the lock to be held */
+ mutex_lock(&kvm->lock);
+ /*
+ * p is not NULL if we aborted because of a fatal signal, in which
+ * case queue the leftover for later cleanup.
+ */
+ if (p)
+ list_add(&p->list, &kvm->arch.pv.need_cleanup);
+ return ret;
+}
+
static void kvm_s390_pv_mmu_notifier_release(struct mmu_notifier *subscription,
struct mm_struct *mm)
{
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 507ee1f2aa96..d150610e7a4b 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1644,6 +1644,8 @@ enum pv_cmd_id {
KVM_PV_VERIFY,
KVM_PV_PREP_RESET,
KVM_PV_UNSHARE_ALL,
+ KVM_PV_ASYNC_DISABLE_PREPARE,
+ KVM_PV_ASYNC_DISABLE,
};
struct kvm_pv_cmd {
--
2.34.1
In upcoming patches it will be possible to start tearing down a
protected VM, and finish the teardown concurrently in a different
thread.
Protected VMs that are pending for tear down ("leftover") need to be
cleaned properly when the userspace process (e.g. qemu) terminates.
This patch makes sure that all "leftover" protected VMs are always
properly torn down.
Signed-off-by: Claudio Imbrenda <[email protected]>
---
arch/s390/include/asm/kvm_host.h | 2 +
arch/s390/kvm/kvm-s390.c | 1 +
arch/s390/kvm/pv.c | 69 ++++++++++++++++++++++++++++++++
3 files changed, 72 insertions(+)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 1bccb8561ba9..50e3516cbc03 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -922,6 +922,8 @@ struct kvm_s390_pv {
u64 guest_len;
unsigned long stor_base;
void *stor_var;
+ void *async_deinit;
+ struct list_head need_cleanup;
struct mmu_notifier mmu_notifier;
};
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 446f89db93a1..3637f556ff33 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -2788,6 +2788,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
kvm_s390_vsie_init(kvm);
if (use_gisa)
kvm_s390_gisa_init(kvm);
+ INIT_LIST_HEAD(&kvm->arch.pv.need_cleanup);
KVM_EVENT(3, "vm 0x%pK created by pid %u", kvm, current->pid);
return 0;
diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
index be3b467f8feb..56412617dd01 100644
--- a/arch/s390/kvm/pv.c
+++ b/arch/s390/kvm/pv.c
@@ -17,6 +17,19 @@
#include <linux/mmu_notifier.h>
#include "kvm-s390.h"
+/**
+ * @struct deferred_priv
+ * Represents a "leftover" protected VM that does not correspond to any
+ * active KVM VM.
+ */
+struct deferred_priv {
+ struct list_head list;
+ unsigned long old_table;
+ u64 handle;
+ void *stor_var;
+ unsigned long stor_base;
+};
+
static void kvm_s390_clear_pv_state(struct kvm *kvm)
{
kvm->arch.pv.handle = 0;
@@ -163,6 +176,60 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
return -ENOMEM;
}
+/**
+ * kvm_s390_pv_cleanup_deferred - Clean up one leftover protected VM.
+ * @kvm the KVM that was associated with this leftover protected VM
+ * @deferred details about the leftover protected VM that needs a clean up
+ * Return: 0 in case of success, otherwise 1
+ */
+static int kvm_s390_pv_cleanup_deferred(struct kvm *kvm, struct deferred_priv *deferred)
+{
+ u16 rc, rrc;
+ int cc;
+
+ cc = uv_cmd_nodata(deferred->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
+ KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", rc, rrc);
+ WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
+ if (cc)
+ return cc;
+ /*
+ * Intentionally leak unusable memory. If the UVC fails, the memory
+ * used for the VM and its metadata is permanently unusable.
+ * This can only happen in case of a serious KVM or hardware bug; it
+ * is not expected to happen in normal operation.
+ */
+ free_pages(deferred->stor_base, get_order(uv_info.guest_base_stor_len));
+ free_pages(deferred->old_table, CRST_ALLOC_ORDER);
+ vfree(deferred->stor_var);
+ return 0;
+}
+
+/**
+ * kvm_s390_pv_cleanup_leftovers - Clean up all leftover protected VMs.
+ * @kvm the KVM whose leftover protected VMs are to be cleaned up
+ * Return: 0 in case of success, otherwise 1
+ */
+static int kvm_s390_pv_cleanup_leftovers(struct kvm *kvm)
+{
+ struct deferred_priv *deferred;
+ int cc = 0;
+
+ if (kvm->arch.pv.async_deinit)
+ list_add(kvm->arch.pv.async_deinit, &kvm->arch.pv.need_cleanup);
+
+ while (!list_empty(&kvm->arch.pv.need_cleanup)) {
+ deferred = list_first_entry(&kvm->arch.pv.need_cleanup, typeof(*deferred), list);
+ if (kvm_s390_pv_cleanup_deferred(kvm, deferred))
+ cc = 1;
+ else
+ atomic_dec(&kvm->mm->context.protected_count);
+ list_del(&deferred->list);
+ kfree(deferred);
+ }
+ kvm->arch.pv.async_deinit = NULL;
+ return cc;
+}
+
/* this should not fail, but if it does, we must not free the donated memory */
int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
{
@@ -190,6 +257,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
+ cc |= kvm_s390_pv_cleanup_leftovers(kvm);
+
return cc ? -EIO : 0;
}
--
2.34.1
Add the module parameter "lazy_destroy", to allow the asynchronous destroy
mechanism to be switched off. This might be useful for debugging purposes.
The parameter is enabled by default.
Signed-off-by: Claudio Imbrenda <[email protected]>
Reviewed-by: Janosch Frank <[email protected]>
---
arch/s390/kvm/kvm-s390.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 2296b1ff1e02..702696189505 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -207,6 +207,11 @@ unsigned int diag9c_forwarding_hz;
module_param(diag9c_forwarding_hz, uint, 0644);
MODULE_PARM_DESC(diag9c_forwarding_hz, "Maximum diag9c forwarding per second, 0 to turn off");
+/* allow asynchronous deinit for protected guests */
+static int lazy_destroy = 1;
+module_param(lazy_destroy, int, 0444);
+MODULE_PARM_DESC(lazy_destroy, "Asynchronous destroy for protected guests");
+
/*
* For now we handle at most 16 double words as this is what the s390 base
* kernel handles and stores in the prefix page. If we ever need to go beyond
--
2.34.1
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation.
A secure storage violation is triggered when a protected guest tries to
access secure memory that has been mapped erroneously, or that belongs
to a different protected guest or to the ultravisor.
With upcoming patches, protected guests will be able to trigger secure
storage violations in normal operation. This happens for example if a
protected guest is rebooted with lazy destroy enabled and the new guest
is also protected.
When the new protected guest touches pages that have not yet been
destroyed, and thus are accounted to the previous protected guest, a
secure storage violation is raised.
This patch adds handling of secure storage violations for protected
guests.
This exception is handled by first trying to destroy the page, because
it is expected to belong to a defunct protected guest where a destroy
should be possible. If that fails, a normal export of the page is
attempted.
Therefore, pages that trigger the exception will be made non-secure
before attempting to use them again for a different secure guest.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Janosch Frank <[email protected]>
---
arch/s390/include/asm/uv.h | 1 +
arch/s390/kernel/uv.c | 55 ++++++++++++++++++++++++++++++++++++++
arch/s390/mm/fault.c | 10 +++++++
3 files changed, 66 insertions(+)
diff --git a/arch/s390/include/asm/uv.h b/arch/s390/include/asm/uv.h
index 86218382d29c..6b2b33f19abe 100644
--- a/arch/s390/include/asm/uv.h
+++ b/arch/s390/include/asm/uv.h
@@ -356,6 +356,7 @@ static inline int is_prot_virt_host(void)
}
int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb);
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr);
int uv_destroy_owned_page(unsigned long paddr);
int uv_convert_from_secure(unsigned long paddr);
int uv_convert_owned_from_secure(unsigned long paddr);
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index a5425075dd25..2754471cc789 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -334,6 +334,61 @@ int gmap_convert_to_secure(struct gmap *gmap, unsigned long gaddr)
}
EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
+/**
+ * gmap_destroy_page - Destroy a guest page.
+ * @gmap the gmap of the guest
+ * @gaddr the guest address to destroy
+ *
+ * An attempt will be made to destroy the given guest page. If the attempt
+ * fails, an attempt is made to export the page. If both attempts fail, an
+ * appropriate error is returned.
+ */
+int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
+{
+ struct vm_area_struct *vma;
+ unsigned long uaddr;
+ struct page *page;
+ int rc;
+
+ rc = -EFAULT;
+ mmap_read_lock(gmap->mm);
+
+ uaddr = __gmap_translate(gmap, gaddr);
+ if (IS_ERR_VALUE(uaddr))
+ goto out;
+ vma = vma_lookup(gmap->mm, uaddr);
+ if (!vma)
+ goto out;
+ /*
+ * Huge pages should not be able to become secure
+ */
+ if (is_vm_hugetlb_page(vma))
+ goto out;
+
+ rc = 0;
+ /* we take an extra reference here */
+ page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
+ if (IS_ERR_OR_NULL(page))
+ goto out;
+ rc = uv_destroy_owned_page(page_to_phys(page));
+ /*
+ * Fault handlers can race; it is possible that two CPUs will fault
+ * on the same secure page. One CPU can destroy the page, reboot,
+ * re-enter secure mode and import it, while the second CPU was
+ * stuck at the beginning of the handler. At some point the second
+ * CPU will be able to progress, and it will not be able to destroy
+ * the page. In that case we do not want to terminate the process,
+ * we instead try to export the page.
+ */
+ if (rc)
+ rc = uv_convert_owned_from_secure(page_to_phys(page));
+ put_page(page);
+out:
+ mmap_read_unlock(gmap->mm);
+ return rc;
+}
+EXPORT_SYMBOL_GPL(gmap_destroy_page);
+
/*
* To be called with the page locked or with an extra reference! This will
* prevent gmap_make_secure from touching the page concurrently. Having 2
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index ff16ce0d04ee..47b52e5384f8 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -853,6 +853,16 @@ NOKPROBE_SYMBOL(do_non_secure_storage_access);
void do_secure_storage_violation(struct pt_regs *regs)
{
+ unsigned long gaddr = regs->int_parm_long & __FAIL_ADDR_MASK;
+ struct gmap *gmap = (struct gmap *)S390_lowcore.gmap;
+
+ /*
+ * If the VM has been rebooted, its address space might still contain
+ * secure pages from the previous boot.
+ * Clear the page so it can be reused.
+ */
+ if (!gmap_destroy_page(gmap, gaddr))
+ return;
/*
* Either KVM messed up the secure guest mapping or the same
* page is mapped into multiple secure guests.
--
2.34.1
On 3/30/22 14:26, Claudio Imbrenda wrote:
> In upcoming patches it will be possible to start tearing down a
> protected VM, and finish the teardown concurrently in a different
> thread.
>
> Protected VMs that are pending for tear down ("leftover") need to be
> cleaned properly when the userspace process (e.g. qemu) terminates.
>
> This patch makes sure that all "leftover" protected VMs are always
> properly torn down.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/include/asm/kvm_host.h | 2 +
> arch/s390/kvm/kvm-s390.c | 1 +
> arch/s390/kvm/pv.c | 69 ++++++++++++++++++++++++++++++++
> 3 files changed, 72 insertions(+)
>
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index 1bccb8561ba9..50e3516cbc03 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -922,6 +922,8 @@ struct kvm_s390_pv {
> u64 guest_len;
> unsigned long stor_base;
> void *stor_var;
> + void *async_deinit;
> + struct list_head need_cleanup;
> struct mmu_notifier mmu_notifier;
> };
>
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 446f89db93a1..3637f556ff33 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2788,6 +2788,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> kvm_s390_vsie_init(kvm);
> if (use_gisa)
> kvm_s390_gisa_init(kvm);
> + INIT_LIST_HEAD(&kvm->arch.pv.need_cleanup);
kvm->arch.pv.sync_deinit = NULL;
> KVM_EVENT(3, "vm 0x%pK created by pid %u", kvm, current->pid);
>
> return 0;
> diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> index be3b467f8feb..56412617dd01 100644
> --- a/arch/s390/kvm/pv.c
> +++ b/arch/s390/kvm/pv.c
> @@ -17,6 +17,19 @@
> #include <linux/mmu_notifier.h>
> #include "kvm-s390.h"
>
> +/**
> + * @struct deferred_priv
> + * Represents a "leftover" protected VM that does not correspond to any
> + * active KVM VM.
Maybe something like:
...that is still registered with the Ultravisor but isn't registered
with KVM anymore.
> + */
> +struct deferred_priv {
> + struct list_head list;
> + unsigned long old_table;
> + u64 handle;
> + void *stor_var;
> + unsigned long stor_base;
> +};
> +
> static void kvm_s390_clear_pv_state(struct kvm *kvm)
> {
> kvm->arch.pv.handle = 0;
> @@ -163,6 +176,60 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
> return -ENOMEM;
> }
>
> +/**
> + * kvm_s390_pv_cleanup_deferred - Clean up one leftover protected VM.
> + * @kvm the KVM that was associated with this leftover protected VM
> + * @deferred details about the leftover protected VM that needs a clean up
> + * Return: 0 in case of success, otherwise 1
> + */
> +static int kvm_s390_pv_cleanup_deferred(struct kvm *kvm, struct deferred_priv *deferred)
> +{
> + u16 rc, rrc;
> + int cc;
> +
> + cc = uv_cmd_nodata(deferred->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
> + KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", rc, rrc);
> + WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
> + if (cc)
> + return cc;
> + /*
> + * Intentionally leak unusable memory. If the UVC fails, the memory
> + * used for the VM and its metadata is permanently unusable.
> + * This can only happen in case of a serious KVM or hardware bug; it
> + * is not expected to happen in normal operation.
> + */
> + free_pages(deferred->stor_base, get_order(uv_info.guest_base_stor_len));
> + free_pages(deferred->old_table, CRST_ALLOC_ORDER);
> + vfree(deferred->stor_var);
> + return 0;
> +}
> +
> +/**
> + * kvm_s390_pv_cleanup_leftovers - Clean up all leftover protected VMs.
> + * @kvm the KVM whose leftover protected VMs are to be cleaned up
> + * Return: 0 in case of success, otherwise 1
> + */
> +static int kvm_s390_pv_cleanup_leftovers(struct kvm *kvm)
> +{
> + struct deferred_priv *deferred;
> + int cc = 0;
> +
> + if (kvm->arch.pv.async_deinit)
> + list_add(kvm->arch.pv.async_deinit, &kvm->arch.pv.need_cleanup);
> +
> + while (!list_empty(&kvm->arch.pv.need_cleanup)) {
> + deferred = list_first_entry(&kvm->arch.pv.need_cleanup, typeof(*deferred), list);
> + if (kvm_s390_pv_cleanup_deferred(kvm, deferred))
> + cc = 1;
> + else
> + atomic_dec(&kvm->mm->context.protected_count);
> + list_del(&deferred->list);
> + kfree(deferred);
> + }
> + kvm->arch.pv.async_deinit = NULL;
> + return cc;
> +}
> +
> /* this should not fail, but if it does, we must not free the donated memory */
> int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> {
> @@ -190,6 +257,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
>
> + cc |= kvm_s390_pv_cleanup_leftovers(kvm);
> +
> return cc ? -EIO : 0;
> }
>
On Thu, 31 Mar 2022 16:02:55 +0200
Janosch Frank <[email protected]> wrote:
> On 3/30/22 14:26, Claudio Imbrenda wrote:
> > In upcoming patches it will be possible to start tearing down a
> > protected VM, and finish the teardown concurrently in a different
> > thread.
> >
> > Protected VMs that are pending for tear down ("leftover") need to be
> > cleaned properly when the userspace process (e.g. qemu) terminates.
> >
> > This patch makes sure that all "leftover" protected VMs are always
> > properly torn down.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > ---
> > arch/s390/include/asm/kvm_host.h | 2 +
> > arch/s390/kvm/kvm-s390.c | 1 +
> > arch/s390/kvm/pv.c | 69 ++++++++++++++++++++++++++++++++
> > 3 files changed, 72 insertions(+)
> >
> > diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> > index 1bccb8561ba9..50e3516cbc03 100644
> > --- a/arch/s390/include/asm/kvm_host.h
> > +++ b/arch/s390/include/asm/kvm_host.h
> > @@ -922,6 +922,8 @@ struct kvm_s390_pv {
> > u64 guest_len;
> > unsigned long stor_base;
> > void *stor_var;
> > + void *async_deinit;
> > + struct list_head need_cleanup;
> > struct mmu_notifier mmu_notifier;
> > };
> >
> > diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> > index 446f89db93a1..3637f556ff33 100644
> > --- a/arch/s390/kvm/kvm-s390.c
> > +++ b/arch/s390/kvm/kvm-s390.c
> > @@ -2788,6 +2788,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> > kvm_s390_vsie_init(kvm);
> > if (use_gisa)
> > kvm_s390_gisa_init(kvm);
> > + INIT_LIST_HEAD(&kvm->arch.pv.need_cleanup);
>
> kvm->arch.pv.sync_deinit = NULL;
isn't the struct allocated with __GFP_ZERO ?
>
> > KVM_EVENT(3, "vm 0x%pK created by pid %u", kvm, current->pid);
> >
> > return 0;
> > diff --git a/arch/s390/kvm/pv.c b/arch/s390/kvm/pv.c
> > index be3b467f8feb..56412617dd01 100644
> > --- a/arch/s390/kvm/pv.c
> > +++ b/arch/s390/kvm/pv.c
> > @@ -17,6 +17,19 @@
> > #include <linux/mmu_notifier.h>
> > #include "kvm-s390.h"
> >
> > +/**
> > + * @struct deferred_priv
> > + * Represents a "leftover" protected VM that does not correspond to any
> > + * active KVM VM.
>
> Maybe something like:
> ...that is still registered with the Ultravisor but isn't registered
> with KVM anymore.
will fix
>
> > + */
> > +struct deferred_priv {
> > + struct list_head list;
> > + unsigned long old_table;
> > + u64 handle;
> > + void *stor_var;
> > + unsigned long stor_base;
> > +};
> > +
> > static void kvm_s390_clear_pv_state(struct kvm *kvm)
> > {
> > kvm->arch.pv.handle = 0;
> > @@ -163,6 +176,60 @@ static int kvm_s390_pv_alloc_vm(struct kvm *kvm)
> > return -ENOMEM;
> > }
> >
> > +/**
> > + * kvm_s390_pv_cleanup_deferred - Clean up one leftover protected VM.
> > + * @kvm the KVM that was associated with this leftover protected VM
> > + * @deferred details about the leftover protected VM that needs a clean up
> > + * Return: 0 in case of success, otherwise 1
> > + */
> > +static int kvm_s390_pv_cleanup_deferred(struct kvm *kvm, struct deferred_priv *deferred)
> > +{
> > + u16 rc, rrc;
> > + int cc;
> > +
> > + cc = uv_cmd_nodata(deferred->handle, UVC_CMD_DESTROY_SEC_CONF, &rc, &rrc);
> > + KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", rc, rrc);
> > + WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", rc, rrc);
> > + if (cc)
> > + return cc;
> > + /*
> > + * Intentionally leak unusable memory. If the UVC fails, the memory
> > + * used for the VM and its metadata is permanently unusable.
> > + * This can only happen in case of a serious KVM or hardware bug; it
> > + * is not expected to happen in normal operation.
> > + */
> > + free_pages(deferred->stor_base, get_order(uv_info.guest_base_stor_len));
> > + free_pages(deferred->old_table, CRST_ALLOC_ORDER);
> > + vfree(deferred->stor_var);
> > + return 0;
> > +}
> > +
> > +/**
> > + * kvm_s390_pv_cleanup_leftovers - Clean up all leftover protected VMs.
> > + * @kvm the KVM whose leftover protected VMs are to be cleaned up
> > + * Return: 0 in case of success, otherwise 1
> > + */
> > +static int kvm_s390_pv_cleanup_leftovers(struct kvm *kvm)
> > +{
> > + struct deferred_priv *deferred;
> > + int cc = 0;
> > +
> > + if (kvm->arch.pv.async_deinit)
> > + list_add(kvm->arch.pv.async_deinit, &kvm->arch.pv.need_cleanup);
> > +
> > + while (!list_empty(&kvm->arch.pv.need_cleanup)) {
> > + deferred = list_first_entry(&kvm->arch.pv.need_cleanup, typeof(*deferred), list);
> > + if (kvm_s390_pv_cleanup_deferred(kvm, deferred))
> > + cc = 1;
> > + else
> > + atomic_dec(&kvm->mm->context.protected_count);
> > + list_del(&deferred->list);
> > + kfree(deferred);
> > + }
> > + kvm->arch.pv.async_deinit = NULL;
> > + return cc;
> > +}
> > +
> > /* this should not fail, but if it does, we must not free the donated memory */
> > int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> > {
> > @@ -190,6 +257,8 @@ int kvm_s390_pv_deinit_vm(struct kvm *kvm, u16 *rc, u16 *rrc)
> > KVM_UV_EVENT(kvm, 3, "PROTVIRT DESTROY VM: rc %x rrc %x", *rc, *rrc);
> > WARN_ONCE(cc, "protvirt destroy vm failed rc %x rrc %x", *rc, *rrc);
> >
> > + cc |= kvm_s390_pv_cleanup_leftovers(kvm);
> > +
> > return cc ? -EIO : 0;
> > }
> >
>