This series add support for paravirtualized time for arm64 guests and
KVM hosts following the specification in Arm's document DEN 0057A:
https://developer.arm.com/docs/den0057/a
It implements support for stolen time, allowing the guest to
identify time when it is forcibly not executing.
It doesn't implement support for Live Physical Time (LPT) as there are
some concerns about the overheads and approach in the above
specification, and I expect an updated version of the specification to
be released soon with just the stolen time parts.
I previously posted a series including LPT (as well as stolen time):
https://lore.kernel.org/kvmarm/[email protected]/
Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.
Christoffer Dall (1):
KVM: arm/arm64: Factor out hypercall handling from PSCI code
Steven Price (8):
KVM: arm64: Document PV-time interface
KVM: arm64: Implement PV_FEATURES call
KVM: arm64: Support stolen time reporting via shared structure
KVM: Allow kvm_device_ops to be const
KVM: arm64: Provide a PV_TIME device to user space
arm/arm64: Provide a wrapper for SMCCC 1.1 calls
arm/arm64: Make use of the SMCCC 1.1 wrapper
arm64: Retrieve stolen time as paravirtualized guest
Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++
arch/arm/kvm/Makefile | 2 +-
arch/arm/kvm/handle_exit.c | 2 +-
arch/arm/mm/proc-v7-bugs.c | 13 +-
arch/arm64/include/asm/kvm_host.h | 13 +-
arch/arm64/include/asm/kvm_mmu.h | 2 +
arch/arm64/include/asm/pvclock-abi.h | 20 +++
arch/arm64/include/uapi/asm/kvm.h | 6 +
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/cpu_errata.c | 80 ++++------
arch/arm64/kernel/kvm.c | 155 ++++++++++++++++++
arch/arm64/kvm/Kconfig | 1 +
arch/arm64/kvm/Makefile | 2 +
arch/arm64/kvm/handle_exit.c | 4 +-
include/kvm/arm_hypercalls.h | 44 ++++++
include/kvm/arm_psci.h | 2 +-
include/linux/arm-smccc.h | 58 +++++++
include/linux/cpuhotplug.h | 1 +
include/linux/kvm_host.h | 4 +-
include/linux/kvm_types.h | 2 +
include/uapi/linux/kvm.h | 2 +
virt/kvm/arm/arm.c | 18 +++
virt/kvm/arm/hypercalls.c | 138 ++++++++++++++++
virt/kvm/arm/mmu.c | 44 ++++++
virt/kvm/arm/psci.c | 84 +---------
virt/kvm/arm/pvtime.c | 190 +++++++++++++++++++++++
virt/kvm/kvm_main.c | 6 +-
27 files changed, 848 insertions(+), 153 deletions(-)
create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
create mode 100644 arch/arm64/include/asm/pvclock-abi.h
create mode 100644 arch/arm64/kernel/kvm.c
create mode 100644 include/kvm/arm_hypercalls.h
create mode 100644 virt/kvm/arm/hypercalls.c
create mode 100644 virt/kvm/arm/pvtime.c
--
2.20.1
Allow user space to inform the KVM host where in the physical memory
map the paravirtualized time structures should be located.
A device is created which provides the base address of an array of
Stolen Time (ST) structures, one for each VCPU. There must be (64 *
total number of VCPUs) bytes of memory available at this location.
The address is given in terms of the physical address visible to
the guest and must be 64 byte aligned. The memory should be marked as
reserved to the guest to stop it allocating it for other purposes.
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_mmu.h | 2 +
arch/arm64/include/uapi/asm/kvm.h | 6 +
arch/arm64/kvm/Makefile | 1 +
include/uapi/linux/kvm.h | 2 +
virt/kvm/arm/mmu.c | 44 +++++++
virt/kvm/arm/pvtime.c | 190 ++++++++++++++++++++++++++++++
6 files changed, 245 insertions(+)
create mode 100644 virt/kvm/arm/pvtime.c
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index befe37d4bc0e..88c8a4b2836f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
void kvm_free_stage2_pgd(struct kvm *kvm);
int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
phys_addr_t pa, unsigned long size, bool writable);
+int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
+ phys_addr_t pa, unsigned long size, bool writable);
int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9a507716ae2f..95516a4198ea 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -367,6 +367,12 @@ struct kvm_vcpu_events {
#define KVM_PSCI_RET_INVAL PSCI_RET_INVALID_PARAMS
#define KVM_PSCI_RET_DENIED PSCI_RET_DENIED
+/* Device Control API: PV_TIME */
+#define KVM_DEV_ARM_PV_TIME_PADDR 0
+#define KVM_DEV_ARM_PV_TIME_ST 0
+#define KVM_DEV_ARM_PV_TIME_STATE_SIZE 1
+#define KVM_DEV_ARM_PV_TIME_STATE 2
+
#endif
#endif /* __ARM_KVM_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 73dce4d47d47..5ffbdc39e780 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
+kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a7c19540ce21..04bffafa0708 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1222,6 +1222,8 @@ enum kvm_device_type {
#define KVM_DEV_TYPE_ARM_VGIC_ITS KVM_DEV_TYPE_ARM_VGIC_ITS
KVM_DEV_TYPE_XIVE,
#define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE
+ KVM_DEV_TYPE_ARM_PV_TIME,
+#define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME
KVM_DEV_TYPE_MAX,
};
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 38b4c910b6c3..be28a4aee451 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
return ret;
}
+/**
+ * kvm_phys_addr_memremap - map a memory range to guest IPA
+ *
+ * @kvm: The KVM pointer
+ * @guest_ipa: The IPA at which to insert the mapping
+ * @pa: The physical address of the memory
+ * @size: The size of the mapping
+ */
+int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
+ phys_addr_t pa, unsigned long size, bool writable)
+{
+ phys_addr_t addr, end;
+ int ret = 0;
+ unsigned long pfn;
+ struct kvm_mmu_memory_cache cache = { 0, };
+
+ end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
+ pfn = __phys_to_pfn(pa);
+
+ for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
+ pte_t pte = pfn_pte(pfn, PAGE_S2);
+
+ if (writable)
+ pte = kvm_s2pte_mkwrite(pte);
+
+ ret = mmu_topup_memory_cache(&cache,
+ kvm_mmu_cache_min_pages(kvm),
+ KVM_NR_MEM_OBJS);
+ if (ret)
+ goto out;
+ spin_lock(&kvm->mmu_lock);
+ ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
+ spin_unlock(&kvm->mmu_lock);
+ if (ret)
+ goto out;
+
+ pfn++;
+ }
+
+out:
+ mmu_free_memory_cache(&cache);
+ return ret;
+}
+
static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
{
kvm_pfn_t pfn = *pfnp;
diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
new file mode 100644
index 000000000000..9051bc07eae1
--- /dev/null
+++ b/virt/kvm/arm/pvtime.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2019 Arm Ltd.
+
+#include <linux/kvm_host.h>
+#include <asm/kvm_mmu.h>
+
+/* We currently only support PV time on ARM64 */
+#ifdef CONFIG_ARM64
+
+#include <asm/pvclock-abi.h>
+
+static int max_stolen_size(void)
+{
+ int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
+
+ return ALIGN(size, PAGE_SIZE);
+}
+
+static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
+{
+ struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+
+ pvtime->st = alloc_pages_exact(max_stolen_size(),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!pvtime->st)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
+{
+ struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+
+ pvtime->st_base = GPA_INVALID;
+ free_pages_exact(pvtime->st, max_stolen_size());
+ kfree(dev);
+}
+
+static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
+ void *kaddr, int size)
+{
+ return kvm_phys_addr_memremap(kvm, guest_paddr,
+ virt_to_phys(kaddr),
+ size, false);
+}
+
+static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
+{
+ void *source;
+ size_t size;
+
+ switch (type) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ source = kvm->arch.pvtime.st;
+ size = sizeof(struct pvclock_vcpu_stolen_time_info) *
+ atomic_read(&kvm->online_vcpus);
+ break;
+ default:
+ return -ENXIO;
+ }
+
+ if (copy_to_user(user, source, size))
+ return -EFAULT;
+ return 0;
+}
+
+static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
+{
+ void *dest;
+ size_t size;
+
+ switch (type) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ dest = kvm->arch.pvtime.st;
+ size = sizeof(struct pvclock_vcpu_stolen_time_info) *
+ atomic_read(&kvm->online_vcpus);
+ break;
+ default:
+ return -ENXIO;
+ }
+
+ if (copy_from_user(dest, user, size))
+ return -EFAULT;
+
+ return 0;
+}
+
+static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
+ u64 __user *user = (u64 __user *)attr->addr;
+ u64 paddr;
+ int ret;
+
+ switch (attr->group) {
+ case KVM_DEV_ARM_PV_TIME_PADDR:
+ if (get_user(paddr, user))
+ return -EFAULT;
+ if (paddr & 63)
+ return -EINVAL;
+ switch (attr->attr) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ if (pvtime->st_base != GPA_INVALID)
+ return -EEXIST;
+ ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
+ max_stolen_size());
+ if (ret)
+ return ret;
+ pvtime->st_base = paddr;
+ return 0;
+ }
+ break;
+ case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+ return -EPERM;
+ case KVM_DEV_ARM_PV_TIME_STATE:
+ return pvtime_restore_state(dev->kvm, attr->attr, user);
+ }
+ return -ENXIO;
+}
+
+static int kvm_arm_pvtime_get_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ u64 __user *user = (u64 __user *)attr->addr;
+ u32 size;
+
+ switch (attr->group) {
+ case KVM_DEV_ARM_PV_TIME_PADDR:
+ switch (attr->attr) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ if (put_user(dev->kvm->arch.pvtime.st_base, user))
+ return -EFAULT;
+ return 0;
+ }
+ break;
+ case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+ switch (attr->attr) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ size = sizeof(struct pvclock_vcpu_stolen_time_info);
+ size *= atomic_read(&dev->kvm->online_vcpus);
+ break;
+ default:
+ return -ENXIO;
+ }
+ if (put_user(size, user))
+ return -EFAULT;
+ return 0;
+ case KVM_DEV_ARM_PV_TIME_STATE:
+ return pvtime_save_state(dev->kvm, attr->attr, user);
+ }
+ return -ENXIO;
+}
+
+static int kvm_arm_pvtime_has_attr(struct kvm_device *dev,
+ struct kvm_device_attr *attr)
+{
+ switch (attr->group) {
+ case KVM_DEV_ARM_PV_TIME_PADDR:
+ case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
+ case KVM_DEV_ARM_PV_TIME_STATE:
+ switch (attr->attr) {
+ case KVM_DEV_ARM_PV_TIME_ST:
+ return 0;
+ }
+ break;
+ }
+ return -ENXIO;
+}
+
+static const struct kvm_device_ops pvtime_ops = {
+ "Arm PV time",
+ .create = kvm_arm_pvtime_create,
+ .destroy = kvm_arm_pvtime_destroy,
+ .set_attr = kvm_arm_pvtime_set_attr,
+ .get_attr = kvm_arm_pvtime_get_attr,
+ .has_attr = kvm_arm_pvtime_has_attr
+};
+
+static int __init kvm_pvtime_init(void)
+{
+ kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
+
+ return 0;
+}
+
+late_initcall(kvm_pvtime_init);
+
+#endif
--
2.20.1
On Fri, 2 Aug 2019 15:50:14 +0100
Steven Price <[email protected]> wrote:
> Allow user space to inform the KVM host where in the physical memory
> map the paravirtualized time structures should be located.
>
> A device is created which provides the base address of an array of
> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
> total number of VCPUs) bytes of memory available at this location.
>
> The address is given in terms of the physical address visible to
> the guest and must be 64 byte aligned. The memory should be marked as
> reserved to the guest to stop it allocating it for other purposes.
Why? You seem to be allocating the memory from the kernel, so as far as
the guest is concerned, this isn't generally usable memory.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_mmu.h | 2 +
> arch/arm64/include/uapi/asm/kvm.h | 6 +
> arch/arm64/kvm/Makefile | 1 +
> include/uapi/linux/kvm.h | 2 +
> virt/kvm/arm/mmu.c | 44 +++++++
> virt/kvm/arm/pvtime.c | 190 ++++++++++++++++++++++++++++++
> 6 files changed, 245 insertions(+)
> create mode 100644 virt/kvm/arm/pvtime.c
>
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index befe37d4bc0e..88c8a4b2836f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
> void kvm_free_stage2_pgd(struct kvm *kvm);
> int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> phys_addr_t pa, unsigned long size, bool writable);
> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
> + phys_addr_t pa, unsigned long size, bool writable);
>
> int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index 9a507716ae2f..95516a4198ea 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -367,6 +367,12 @@ struct kvm_vcpu_events {
> #define KVM_PSCI_RET_INVAL PSCI_RET_INVALID_PARAMS
> #define KVM_PSCI_RET_DENIED PSCI_RET_DENIED
>
> +/* Device Control API: PV_TIME */
> +#define KVM_DEV_ARM_PV_TIME_PADDR 0
> +#define KVM_DEV_ARM_PV_TIME_ST 0
> +#define KVM_DEV_ARM_PV_TIME_STATE_SIZE 1
> +#define KVM_DEV_ARM_PV_TIME_STATE 2
> +
> #endif
>
> #endif /* __ARM_KVM_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 73dce4d47d47..5ffbdc39e780 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
> +kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
>
> kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
> kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a7c19540ce21..04bffafa0708 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1222,6 +1222,8 @@ enum kvm_device_type {
> #define KVM_DEV_TYPE_ARM_VGIC_ITS KVM_DEV_TYPE_ARM_VGIC_ITS
> KVM_DEV_TYPE_XIVE,
> #define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE
> + KVM_DEV_TYPE_ARM_PV_TIME,
> +#define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME
> KVM_DEV_TYPE_MAX,
> };
>
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 38b4c910b6c3..be28a4aee451 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
> return ret;
> }
>
> +/**
> + * kvm_phys_addr_memremap - map a memory range to guest IPA
> + *
> + * @kvm: The KVM pointer
> + * @guest_ipa: The IPA at which to insert the mapping
> + * @pa: The physical address of the memory
> + * @size: The size of the mapping
> + */
> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
> + phys_addr_t pa, unsigned long size, bool writable)
> +{
> + phys_addr_t addr, end;
> + int ret = 0;
> + unsigned long pfn;
> + struct kvm_mmu_memory_cache cache = { 0, };
> +
> + end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
> + pfn = __phys_to_pfn(pa);
> +
> + for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
> + pte_t pte = pfn_pte(pfn, PAGE_S2);
> +
> + if (writable)
> + pte = kvm_s2pte_mkwrite(pte);
> +
> + ret = mmu_topup_memory_cache(&cache,
> + kvm_mmu_cache_min_pages(kvm),
> + KVM_NR_MEM_OBJS);
> + if (ret)
> + goto out;
> + spin_lock(&kvm->mmu_lock);
> + ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
> + spin_unlock(&kvm->mmu_lock);
> + if (ret)
> + goto out;
> +
> + pfn++;
> + }
> +
> +out:
> + mmu_free_memory_cache(&cache);
> + return ret;
> +}
This is an exact copy of kvm_phys_addr_ioremap(), with only the memory
attributes changing. Surely we can have a shared implementation that
takes the memory attribute as a parameter.
> +
> static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
> {
> kvm_pfn_t pfn = *pfnp;
> diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
> new file mode 100644
> index 000000000000..9051bc07eae1
> --- /dev/null
> +++ b/virt/kvm/arm/pvtime.c
> @@ -0,0 +1,190 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (C) 2019 Arm Ltd.
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_mmu.h>
> +
> +/* We currently only support PV time on ARM64 */
> +#ifdef CONFIG_ARM64
And we're only compiling it on arm64, so why the #ifdef?
> +
> +#include <asm/pvclock-abi.h>
> +
> +static int max_stolen_size(void)
> +{
> + int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
So we're always allocating enough memory for 512 CPUs? That's an
additional 32kB of contiguous memory...
> +
> + return ALIGN(size, PAGE_SIZE);
> +}
> +
> +static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
> +{
> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> +
> + pvtime->st = alloc_pages_exact(max_stolen_size(),
> + GFP_KERNEL | __GFP_ZERO);
> + if (!pvtime->st)
> + return -ENOMEM;
Is there any chance we could use a vmalloc allocation instead? This
would lift the requirement on having physically contiguous memory.
> +
> + return 0;
> +}
> +
> +static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
> +{
> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> +
> + pvtime->st_base = GPA_INVALID;
> + free_pages_exact(pvtime->st, max_stolen_size());
> + kfree(dev);
> +}
> +
> +static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
> + void *kaddr, int size)
> +{
> + return kvm_phys_addr_memremap(kvm, guest_paddr,
> + virt_to_phys(kaddr),
> + size, false);
> +}
> +
> +static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
> +{
> + void *source;
> + size_t size;
> +
> + switch (type) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + source = kvm->arch.pvtime.st;
> + size = sizeof(struct pvclock_vcpu_stolen_time_info) *
> + atomic_read(&kvm->online_vcpus);
> + break;
> + default:
> + return -ENXIO;
> + }
> +
> + if (copy_to_user(user, source, size))
> + return -EFAULT;
> + return 0;
> +}
> +
> +static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
> +{
> + void *dest;
> + size_t size;
> +
> + switch (type) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + dest = kvm->arch.pvtime.st;
> + size = sizeof(struct pvclock_vcpu_stolen_time_info) *
> + atomic_read(&kvm->online_vcpus);
> + break;
> + default:
> + return -ENXIO;
> + }
> +
> + if (copy_from_user(dest, user, size))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
> + u64 __user *user = (u64 __user *)attr->addr;
> + u64 paddr;
> + int ret;
> +
> + switch (attr->group) {
> + case KVM_DEV_ARM_PV_TIME_PADDR:
> + if (get_user(paddr, user))
> + return -EFAULT;
> + if (paddr & 63)
> + return -EINVAL;
You should check whether the device fits into the IPA space for this
guest, and whether it overlaps with anything else.
> + switch (attr->attr) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + if (pvtime->st_base != GPA_INVALID)
> + return -EEXIST;
> + ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
> + max_stolen_size());
Consider moving the size directly into pvtime_map_pages(), and dropping
the pvtime->st parameter. All you need is kvm and paddr.
> + if (ret)
> + return ret;
> + pvtime->st_base = paddr;
> + return 0;
> + }
> + break;
> + case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> + return -EPERM;
> + case KVM_DEV_ARM_PV_TIME_STATE:
> + return pvtime_restore_state(dev->kvm, attr->attr, user);
> + }
> + return -ENXIO;
> +}
> +
> +static int kvm_arm_pvtime_get_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + u64 __user *user = (u64 __user *)attr->addr;
> + u32 size;
> +
> + switch (attr->group) {
> + case KVM_DEV_ARM_PV_TIME_PADDR:
> + switch (attr->attr) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + if (put_user(dev->kvm->arch.pvtime.st_base, user))
> + return -EFAULT;
> + return 0;
> + }
> + break;
> + case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> + switch (attr->attr) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + size = sizeof(struct pvclock_vcpu_stolen_time_info);
> + size *= atomic_read(&dev->kvm->online_vcpus);
> + break;
> + default:
> + return -ENXIO;
> + }
> + if (put_user(size, user))
> + return -EFAULT;
> + return 0;
> + case KVM_DEV_ARM_PV_TIME_STATE:
> + return pvtime_save_state(dev->kvm, attr->attr, user);
> + }
> + return -ENXIO;
> +}
> +
> +static int kvm_arm_pvtime_has_attr(struct kvm_device *dev,
> + struct kvm_device_attr *attr)
> +{
> + switch (attr->group) {
> + case KVM_DEV_ARM_PV_TIME_PADDR:
> + case KVM_DEV_ARM_PV_TIME_STATE_SIZE:
> + case KVM_DEV_ARM_PV_TIME_STATE:
> + switch (attr->attr) {
> + case KVM_DEV_ARM_PV_TIME_ST:
> + return 0;
> + }
> + break;
> + }
> + return -ENXIO;
> +}
> +
> +static const struct kvm_device_ops pvtime_ops = {
> + "Arm PV time",
> + .create = kvm_arm_pvtime_create,
> + .destroy = kvm_arm_pvtime_destroy,
> + .set_attr = kvm_arm_pvtime_set_attr,
> + .get_attr = kvm_arm_pvtime_get_attr,
> + .has_attr = kvm_arm_pvtime_has_attr
> +};
> +
> +static int __init kvm_pvtime_init(void)
> +{
> + kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
> +
> + return 0;
> +}
> +
> +late_initcall(kvm_pvtime_init);
> +
> +#endif
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Sat, 3 Aug 2019 13:51:13 +0100
Marc Zyngier <[email protected]> wrote:
[forgot that one]
> On Fri, 2 Aug 2019 15:50:14 +0100
> Steven Price <[email protected]> wrote:
[...]
> > +static int __init kvm_pvtime_init(void)
> > +{
> > + kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
> > +
> > + return 0;
> > +}
> > +
> > +late_initcall(kvm_pvtime_init);
Why is it an initcall? So far, the only initcall we've used is the one
that initializes KVM itself. Can't we just the device_ops just like we
do for the vgic?
M.
--
Without deviation from the norm, progress is not possible.
On Fri, 2 Aug 2019 15:50:08 +0100
Steven Price <[email protected]> wrote:
Hi Steven,
> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
>
> https://developer.arm.com/docs/den0057/a
>
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
>
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.
Thanks for posting this.
My current concern with this series is around the fact that we allocate
memory from the kernel on behalf of the guest. It is the first example
of such thing in the ARM port, and I can't really say I'm fond of it.
x86 seems to get away with it by having the memory allocated from
userspace, why I tend to like more. Yes, put_user is more
expensive than a straight store, but this isn't done too often either.
What is the rational for your current approach?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On 03/08/2019 19:05, Marc Zyngier wrote:
> On Fri, 2 Aug 2019 15:50:08 +0100
> Steven Price <[email protected]> wrote:
>
> Hi Steven,
>
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
>> specification, and I expect an updated version of the specification to
>> be released soon with just the stolen time parts.
>
> Thanks for posting this.
>
> My current concern with this series is around the fact that we allocate
> memory from the kernel on behalf of the guest. It is the first example
> of such thing in the ARM port, and I can't really say I'm fond of it.
>
> x86 seems to get away with it by having the memory allocated from
> userspace, why I tend to like more. Yes, put_user is more
> expensive than a straight store, but this isn't done too often either.
>
> What is the rational for your current approach?
As I see it there are 3 approaches that can be taken here:
1. Hypervisor allocates memory and adds it to the virtual machine. This
means that everything to do with the 'device' is encapsulated behind the
KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
stolen time structure to be fast it cannot be a trapping region and has
to be backed by real memory - in this case allocated by the host kernel.
2. Host user space allocates memory. Similar to above, but this time
user space needs to manage the memory region as well as the usual
KVM_CREATE_DEVICE dance. I've no objection to this, but it means
kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
to size the memory region).
3. Guest kernel "donates" the memory to the hypervisor for the
structure. As far as I'm aware this is what x86 does. The problems I see
this approach are:
a) kexec becomes much more tricky - there needs to be a disabling
mechanism for the guest to stop the hypervisor scribbling on memory
before starting the new kernel.
b) If there is more than one entity that is interested in the
information (e.g. firmware and kernel) then this requires some form of
arbitration in the guest because the hypervisor doesn't want to have to
track an arbitrary number of regions to update.
c) Performance can suffer if the host kernel doesn't have a suitably
aligned/sized area to use. As you say - put_user() is more expensive.
The structure is updated on every return to the VM.
Of course x86 does prove the third approach can work, but I'm not sure
which is actually better. Avoid the kexec cancellation requirements was
the main driver of the current approach. Although many of the
conversations about this were also tied up with Live Physical Time which
adds its own complications.
Steve
On 05/08/2019 14:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri, 2 Aug 2019 15:50:08 +0100
>> Steven Price <[email protected]> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
>
> As I see it there are 3 approaches that can be taken here:
>
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
>
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).
>
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
>
> a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.
>
> b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.
>
> c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.
> The structure is updated on every return to the VM.
>
>
> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the
> conversations about this were also tied up with Live Physical Time which
> adds its own complications.
My current train of thoughts is around (2):
- We don't need a new mechanism to track pages or deal with overlapping
IPA ranges
- We can get rid of the save/restore interface
The drawback is that the amount of memory required per vcpu becomes ABI.
I don't think that's a huge deal, as the hypervisor has the same
contract with the guest.
We also take a small hit with put_user(), but this is only done as a
consequence of vcpu_load() (and not on every entry as you suggest
above). It'd be worth quantifying this overhead before making any
decision one way or another.
Thanks,
M.
--
Jazz is not dead, it just smells funny...
On 03/08/2019 13:51, Marc Zyngier wrote:
> On Fri, 2 Aug 2019 15:50:14 +0100
> Steven Price <[email protected]> wrote:
>
>> Allow user space to inform the KVM host where in the physical memory
>> map the paravirtualized time structures should be located.
>>
>> A device is created which provides the base address of an array of
>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>> total number of VCPUs) bytes of memory available at this location.
>>
>> The address is given in terms of the physical address visible to
>> the guest and must be 64 byte aligned. The memory should be marked as
>> reserved to the guest to stop it allocating it for other purposes.
>
> Why? You seem to be allocating the memory from the kernel, so as far as
> the guest is concerned, this isn't generally usable memory.
I obviously didn't word it very well - that's what I meant. The "memory"
that represents the stolen time structure shouldn't be shown to the
guest as normal memory, but "reserved" for the purpose of stolen time.
To be honest it looks like I forgot to rewrite this commit message -
which 64 byte alignment is all that the guest can rely on (because each
vCPU has it's own structure), the actual array of structures needs to be
page aligned to ensure we can safely map it into the guest.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_mmu.h | 2 +
>> arch/arm64/include/uapi/asm/kvm.h | 6 +
>> arch/arm64/kvm/Makefile | 1 +
>> include/uapi/linux/kvm.h | 2 +
>> virt/kvm/arm/mmu.c | 44 +++++++
>> virt/kvm/arm/pvtime.c | 190 ++++++++++++++++++++++++++++++
>> 6 files changed, 245 insertions(+)
>> create mode 100644 virt/kvm/arm/pvtime.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index befe37d4bc0e..88c8a4b2836f 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -157,6 +157,8 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm);
>> void kvm_free_stage2_pgd(struct kvm *kvm);
>> int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> phys_addr_t pa, unsigned long size, bool writable);
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> + phys_addr_t pa, unsigned long size, bool writable);
>>
>> int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run);
>>
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 9a507716ae2f..95516a4198ea 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -367,6 +367,12 @@ struct kvm_vcpu_events {
>> #define KVM_PSCI_RET_INVAL PSCI_RET_INVALID_PARAMS
>> #define KVM_PSCI_RET_DENIED PSCI_RET_DENIED
>>
>> +/* Device Control API: PV_TIME */
>> +#define KVM_DEV_ARM_PV_TIME_PADDR 0
>> +#define KVM_DEV_ARM_PV_TIME_ST 0
>> +#define KVM_DEV_ARM_PV_TIME_STATE_SIZE 1
>> +#define KVM_DEV_ARM_PV_TIME_STATE 2
>> +
>> #endif
>>
>> #endif /* __ARM_KVM_H__ */
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 73dce4d47d47..5ffbdc39e780 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -14,6 +14,7 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/e
>> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arm.o $(KVM)/arm/mmu.o $(KVM)/arm/mmio.o
>> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/psci.o $(KVM)/arm/perf.o
>> kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/hypercalls.o
>> +kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/pvtime.o
>>
>> kvm-$(CONFIG_KVM_ARM_HOST) += inject_fault.o regmap.o va_layout.o
>> kvm-$(CONFIG_KVM_ARM_HOST) += hyp.o hyp-init.o handle_exit.o
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index a7c19540ce21..04bffafa0708 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1222,6 +1222,8 @@ enum kvm_device_type {
>> #define KVM_DEV_TYPE_ARM_VGIC_ITS KVM_DEV_TYPE_ARM_VGIC_ITS
>> KVM_DEV_TYPE_XIVE,
>> #define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE
>> + KVM_DEV_TYPE_ARM_PV_TIME,
>> +#define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME
>> KVM_DEV_TYPE_MAX,
>> };
>>
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 38b4c910b6c3..be28a4aee451 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1368,6 +1368,50 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> return ret;
>> }
>>
>> +/**
>> + * kvm_phys_addr_memremap - map a memory range to guest IPA
>> + *
>> + * @kvm: The KVM pointer
>> + * @guest_ipa: The IPA at which to insert the mapping
>> + * @pa: The physical address of the memory
>> + * @size: The size of the mapping
>> + */
>> +int kvm_phys_addr_memremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> + phys_addr_t pa, unsigned long size, bool writable)
>> +{
>> + phys_addr_t addr, end;
>> + int ret = 0;
>> + unsigned long pfn;
>> + struct kvm_mmu_memory_cache cache = { 0, };
>> +
>> + end = (guest_ipa + size + PAGE_SIZE - 1) & PAGE_MASK;
>> + pfn = __phys_to_pfn(pa);
>> +
>> + for (addr = guest_ipa; addr < end; addr += PAGE_SIZE) {
>> + pte_t pte = pfn_pte(pfn, PAGE_S2);
>> +
>> + if (writable)
>> + pte = kvm_s2pte_mkwrite(pte);
>> +
>> + ret = mmu_topup_memory_cache(&cache,
>> + kvm_mmu_cache_min_pages(kvm),
>> + KVM_NR_MEM_OBJS);
>> + if (ret)
>> + goto out;
>> + spin_lock(&kvm->mmu_lock);
>> + ret = stage2_set_pte(kvm, &cache, addr, &pte, 0);
>> + spin_unlock(&kvm->mmu_lock);
>> + if (ret)
>> + goto out;
>> +
>> + pfn++;
>> + }
>> +
>> +out:
>> + mmu_free_memory_cache(&cache);
>> + return ret;
>> +}
>
> This is an exact copy of kvm_phys_addr_ioremap(), with only the memory
> attributes changing. Surely we can have a shared implementation that
> takes the memory attribute as a parameter.
Good point - although due to below I'm going to need something which can
deal with non-contiguous memory...
>> +
>> static bool transparent_hugepage_adjust(kvm_pfn_t *pfnp, phys_addr_t *ipap)
>> {
>> kvm_pfn_t pfn = *pfnp;
>> diff --git a/virt/kvm/arm/pvtime.c b/virt/kvm/arm/pvtime.c
>> new file mode 100644
>> index 000000000000..9051bc07eae1
>> --- /dev/null
>> +++ b/virt/kvm/arm/pvtime.c
>> @@ -0,0 +1,190 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright (C) 2019 Arm Ltd.
>> +
>> +#include <linux/kvm_host.h>
>> +#include <asm/kvm_mmu.h>
>> +
>> +/* We currently only support PV time on ARM64 */
>> +#ifdef CONFIG_ARM64
>
> And we're only compiling it on arm64, so why the #ifdef?
Another good point - will remove.
>> +
>> +#include <asm/pvclock-abi.h>
>> +
>> +static int max_stolen_size(void)
>> +{
>> + int size = KVM_MAX_VCPUS * sizeof(struct pvclock_vcpu_stolen_time_info);
>
> So we're always allocating enough memory for 512 CPUs? That's an
> additional 32kB of contiguous memory...
>
>> +
>> + return ALIGN(size, PAGE_SIZE);
>> +}
>> +
>> +static int kvm_arm_pvtime_create(struct kvm_device *dev, u32 type)
>> +{
>> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> + pvtime->st = alloc_pages_exact(max_stolen_size(),
>> + GFP_KERNEL | __GFP_ZERO);
>> + if (!pvtime->st)
>> + return -ENOMEM;
>
> Is there any chance we could use a vmalloc allocation instead? This
> would lift the requirement on having physically contiguous memory.
Yes I think it should be possible to use non-contiguous memory and vmalloc.
>> +
>> + return 0;
>> +}
>> +
>> +static void kvm_arm_pvtime_destroy(struct kvm_device *dev)
>> +{
>> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> +
>> + pvtime->st_base = GPA_INVALID;
>> + free_pages_exact(pvtime->st, max_stolen_size());
>> + kfree(dev);
>> +}
>> +
>> +static int pvtime_map_pages(struct kvm *kvm, gpa_t guest_paddr,
>> + void *kaddr, int size)
>> +{
>> + return kvm_phys_addr_memremap(kvm, guest_paddr,
>> + virt_to_phys(kaddr),
>> + size, false);
>> +}
>> +
>> +static int pvtime_save_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> + void *source;
>> + size_t size;
>> +
>> + switch (type) {
>> + case KVM_DEV_ARM_PV_TIME_ST:
>> + source = kvm->arch.pvtime.st;
>> + size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> + atomic_read(&kvm->online_vcpus);
>> + break;
>> + default:
>> + return -ENXIO;
>> + }
>> +
>> + if (copy_to_user(user, source, size))
>> + return -EFAULT;
>> + return 0;
>> +}
>> +
>> +static int pvtime_restore_state(struct kvm *kvm, u64 type, void __user *user)
>> +{
>> + void *dest;
>> + size_t size;
>> +
>> + switch (type) {
>> + case KVM_DEV_ARM_PV_TIME_ST:
>> + dest = kvm->arch.pvtime.st;
>> + size = sizeof(struct pvclock_vcpu_stolen_time_info) *
>> + atomic_read(&kvm->online_vcpus);
>> + break;
>> + default:
>> + return -ENXIO;
>> + }
>> +
>> + if (copy_from_user(dest, user, size))
>> + return -EFAULT;
>> +
>> + return 0;
>> +}
>> +
>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>> + struct kvm_device_attr *attr)
>> +{
>> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>> + u64 __user *user = (u64 __user *)attr->addr;
>> + u64 paddr;
>> + int ret;
>> +
>> + switch (attr->group) {
>> + case KVM_DEV_ARM_PV_TIME_PADDR:
>> + if (get_user(paddr, user))
>> + return -EFAULT;
>> + if (paddr & 63)
>> + return -EINVAL;
>
> You should check whether the device fits into the IPA space for this
> guest, and whether it overlaps with anything else.
pvtime_map_pages() should fail in the case of overlap. That seems
sufficient to me - do you think we need something stronger?
>> + switch (attr->attr) {
>> + case KVM_DEV_ARM_PV_TIME_ST:
>> + if (pvtime->st_base != GPA_INVALID)
>> + return -EEXIST;
>> + ret = pvtime_map_pages(dev->kvm, paddr, pvtime->st,
>> + max_stolen_size());
>
> Consider moving the size directly into pvtime_map_pages(), and dropping
> the pvtime->st parameter. All you need is kvm and paddr.
Will do.
Thanks,
Steve
On 05/08/2019 17:10, Steven Price wrote:
> On 03/08/2019 13:51, Marc Zyngier wrote:
>> On Fri, 2 Aug 2019 15:50:14 +0100
>> Steven Price <[email protected]> wrote:
>>
>>> Allow user space to inform the KVM host where in the physical memory
>>> map the paravirtualized time structures should be located.
>>>
>>> A device is created which provides the base address of an array of
>>> Stolen Time (ST) structures, one for each VCPU. There must be (64 *
>>> total number of VCPUs) bytes of memory available at this location.
>>>
>>> The address is given in terms of the physical address visible to
>>> the guest and must be 64 byte aligned. The memory should be marked as
>>> reserved to the guest to stop it allocating it for other purposes.
>>
>> Why? You seem to be allocating the memory from the kernel, so as far as
>> the guest is concerned, this isn't generally usable memory.
>
> I obviously didn't word it very well - that's what I meant. The "memory"
> that represents the stolen time structure shouldn't be shown to the
> guest as normal memory, but "reserved" for the purpose of stolen time.
>
> To be honest it looks like I forgot to rewrite this commit message -
> which 64 byte alignment is all that the guest can rely on (because each
> vCPU has it's own structure), the actual array of structures needs to be
> page aligned to ensure we can safely map it into the guest.
>
>>>
>>> Signed-off-by: Steven Price <[email protected]>
>>> ---
>>> arch/arm64/include/asm/kvm_mmu.h | 2 +
>>> arch/arm64/include/uapi/asm/kvm.h | 6 +
>>> arch/arm64/kvm/Makefile | 1 +
>>> include/uapi/linux/kvm.h | 2 +
>>> virt/kvm/arm/mmu.c | 44 +++++++
>>> virt/kvm/arm/pvtime.c | 190 ++++++++++++++++++++++++++++++
>>> 6 files changed, 245 insertions(+)
>>> create mode 100644 virt/kvm/arm/pvtime.c
[...]
>>> +static int kvm_arm_pvtime_set_attr(struct kvm_device *dev,
>>> + struct kvm_device_attr *attr)
>>> +{
>>> + struct kvm_arch_pvtime *pvtime = &dev->kvm->arch.pvtime;
>>> + u64 __user *user = (u64 __user *)attr->addr;
>>> + u64 paddr;
>>> + int ret;
>>> +
>>> + switch (attr->group) {
>>> + case KVM_DEV_ARM_PV_TIME_PADDR:
>>> + if (get_user(paddr, user))
>>> + return -EFAULT;
>>> + if (paddr & 63)
>>> + return -EINVAL;
>>
>> You should check whether the device fits into the IPA space for this
>> guest, and whether it overlaps with anything else.
>
> pvtime_map_pages() should fail in the case of overlap. That seems
> sufficient to me - do you think we need something stronger?
Definitely. stage2_set_pte() won't fail for a non-IO overlapping
mapping, and will just treat it as guest memory. If this overlaps with a
memslot, we'll never be able to fault that page in, ending up with
interesting memory corruption... :-/
That's one of the reasons why I think option (2) in your earlier email
is an interesting one, as it sidesteps a whole lot of ugly and hard to
test corner cases.
Thanks,
M.
--
Jazz is not dead, it just smells funny...
On 03/08/2019 18:34, Marc Zyngier wrote:
> On Sat, 3 Aug 2019 13:51:13 +0100
> Marc Zyngier <[email protected]> wrote:
>
> [forgot that one]
>
>> On Fri, 2 Aug 2019 15:50:14 +0100
>> Steven Price <[email protected]> wrote:
>
> [...]
>
>>> +static int __init kvm_pvtime_init(void)
>>> +{
>>> + kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +late_initcall(kvm_pvtime_init);
>
> Why is it an initcall? So far, the only initcall we've used is the one
> that initializes KVM itself. Can't we just the device_ops just like we
> do for the vgic?
So would you prefer a direct call from init_subsystems() in
virt/kvm/arm/arm.c?
The benefit of initcall is just that it keeps the code self-contained.
In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
function for arm.
Steve
On 07/08/2019 14:39, Steven Price wrote:
> On 03/08/2019 18:34, Marc Zyngier wrote:
>> On Sat, 3 Aug 2019 13:51:13 +0100
>> Marc Zyngier <[email protected]> wrote:
>>
>> [forgot that one]
>>
>>> On Fri, 2 Aug 2019 15:50:14 +0100
>>> Steven Price <[email protected]> wrote:
>>
>> [...]
>>
>>>> +static int __init kvm_pvtime_init(void)
>>>> +{
>>>> + kvm_register_device_ops(&pvtime_ops, KVM_DEV_TYPE_ARM_PV_TIME);
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +late_initcall(kvm_pvtime_init);
>>
>> Why is it an initcall? So far, the only initcall we've used is the one
>> that initializes KVM itself. Can't we just the device_ops just like we
>> do for the vgic?
>
> So would you prefer a direct call from init_subsystems() in
> virt/kvm/arm/arm.c?
Yes. Consistency is important.
> The benefit of initcall is just that it keeps the code self-contained.
> In init_subsystems() I'd either need a #ifdef CONFIG_ARM64 or a dummy
> function for arm.
Having a dummy function for 32bit ARM is fine. Most of the code we add
to the 32bit port is made of empty stubs anyway.
Thanks,
M.
--
Jazz is not dead, it just smells funny...
On 05.08.19 15:06, Steven Price wrote:
> On 03/08/2019 19:05, Marc Zyngier wrote:
>> On Fri, 2 Aug 2019 15:50:08 +0100
>> Steven Price <[email protected]> wrote:
>>
>> Hi Steven,
>>
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>>> specification, and I expect an updated version of the specification to
>>> be released soon with just the stolen time parts.
>>
>> Thanks for posting this.
>>
>> My current concern with this series is around the fact that we allocate
>> memory from the kernel on behalf of the guest. It is the first example
>> of such thing in the ARM port, and I can't really say I'm fond of it.
>>
>> x86 seems to get away with it by having the memory allocated from
>> userspace, why I tend to like more. Yes, put_user is more
>> expensive than a straight store, but this isn't done too often either.
>>
>> What is the rational for your current approach?
>
> As I see it there are 3 approaches that can be taken here:
>
> 1. Hypervisor allocates memory and adds it to the virtual machine. This
> means that everything to do with the 'device' is encapsulated behind the
> KVM_CREATE_DEVICE / KVM_[GS]ET_DEVICE_ATTR ioctls. But since we want the
> stolen time structure to be fast it cannot be a trapping region and has
> to be backed by real memory - in this case allocated by the host kernel.
>
> 2. Host user space allocates memory. Similar to above, but this time
> user space needs to manage the memory region as well as the usual
> KVM_CREATE_DEVICE dance. I've no objection to this, but it means
> kvmtool/QEMU needs to be much more aware of what is going on (e.g. how
> to size the memory region).
You ideally want to get the host overhead for a VM to as little as you
can. I'm not terribly fond of the idea of reserving a full page just
because we're too afraid of having the guest donate memory.
>
> 3. Guest kernel "donates" the memory to the hypervisor for the
> structure. As far as I'm aware this is what x86 does. The problems I see
> this approach are:
>
> a) kexec becomes much more tricky - there needs to be a disabling
> mechanism for the guest to stop the hypervisor scribbling on memory
> before starting the new kernel.
I wouldn't call "quiesce a device" much more tricky. We have to do that
for other devices as well today.
> b) If there is more than one entity that is interested in the
> information (e.g. firmware and kernel) then this requires some form of
> arbitration in the guest because the hypervisor doesn't want to have to
> track an arbitrary number of regions to update.
Why would FW care?
> c) Performance can suffer if the host kernel doesn't have a suitably
> aligned/sized area to use. As you say - put_user() is more expensive.
Just define the interface to always require natural alignment when
donating a memory location?
> The structure is updated on every return to the VM.
If you really do suffer from put_user(), there are alternatives. You
could just map the page on the registration hcall and then leave it
pinned until the vcpu gets destroyed again.
> Of course x86 does prove the third approach can work, but I'm not sure
> which is actually better. Avoid the kexec cancellation requirements was
> the main driver of the current approach. Although many of the
I really don't understand the problem with kexec cancellation. Worst
case, let guest FW set it up for you and propagate only the address down
via ACPI/DT. That way you can mark the respective memory as reserved too.
But even with a Linux only mechanism, just take a look at
arch/x86/kernel/kvmclock.c. All they do to remove the map is to hook
into machine_crash_shutdown() and machine_shutdown().
Alex
> conversations about this were also tied up with Live Physical Time which
> adds its own complications.
>
> Steve
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>
Hi Steven,
On 2019/8/2 22:50, Steven Price wrote:
> This series add support for paravirtualized time for arm64 guests and
> KVM hosts following the specification in Arm's document DEN 0057A:
>
> https://developer.arm.com/docs/den0057/a
>
> It implements support for stolen time, allowing the guest to
> identify time when it is forcibly not executing.
>
> It doesn't implement support for Live Physical Time (LPT) as there are
> some concerns about the overheads and approach in the above
Do you plan to pick up LPT support? As there is demand of cross-frequency migration
(from older platform to newer platform).
I am not clear about the overheads and approach problem here, could you please
give some detail information? Maybe we can work together to solve these concerns. :-)
Thanks,
Keqian
> specification, and I expect an updated version of the specification to
> be released soon with just the stolen time parts.
>
> I previously posted a series including LPT (as well as stolen time):
> https://lore.kernel.org/kvmarm/[email protected]/
>
> Patches 2, 5, 7 and 8 are cleanup patches and could be taken separately.
>
> Christoffer Dall (1):
> KVM: arm/arm64: Factor out hypercall handling from PSCI code
>
> Steven Price (8):
> KVM: arm64: Document PV-time interface
> KVM: arm64: Implement PV_FEATURES call
> KVM: arm64: Support stolen time reporting via shared structure
> KVM: Allow kvm_device_ops to be const
> KVM: arm64: Provide a PV_TIME device to user space
> arm/arm64: Provide a wrapper for SMCCC 1.1 calls
> arm/arm64: Make use of the SMCCC 1.1 wrapper
> arm64: Retrieve stolen time as paravirtualized guest
>
> Documentation/virtual/kvm/arm/pvtime.txt | 107 +++++++++++++
> arch/arm/kvm/Makefile | 2 +-
> arch/arm/kvm/handle_exit.c | 2 +-
> arch/arm/mm/proc-v7-bugs.c | 13 +-
> arch/arm64/include/asm/kvm_host.h | 13 +-
> arch/arm64/include/asm/kvm_mmu.h | 2 +
> arch/arm64/include/asm/pvclock-abi.h | 20 +++
> arch/arm64/include/uapi/asm/kvm.h | 6 +
> arch/arm64/kernel/Makefile | 1 +
> arch/arm64/kernel/cpu_errata.c | 80 ++++------
> arch/arm64/kernel/kvm.c | 155 ++++++++++++++++++
> arch/arm64/kvm/Kconfig | 1 +
> arch/arm64/kvm/Makefile | 2 +
> arch/arm64/kvm/handle_exit.c | 4 +-
> include/kvm/arm_hypercalls.h | 44 ++++++
> include/kvm/arm_psci.h | 2 +-
> include/linux/arm-smccc.h | 58 +++++++
> include/linux/cpuhotplug.h | 1 +
> include/linux/kvm_host.h | 4 +-
> include/linux/kvm_types.h | 2 +
> include/uapi/linux/kvm.h | 2 +
> virt/kvm/arm/arm.c | 18 +++
> virt/kvm/arm/hypercalls.c | 138 ++++++++++++++++
> virt/kvm/arm/mmu.c | 44 ++++++
> virt/kvm/arm/psci.c | 84 +---------
> virt/kvm/arm/pvtime.c | 190 +++++++++++++++++++++++
> virt/kvm/kvm_main.c | 6 +-
> 27 files changed, 848 insertions(+), 153 deletions(-)
> create mode 100644 Documentation/virtual/kvm/arm/pvtime.txt
> create mode 100644 arch/arm64/include/asm/pvclock-abi.h
> create mode 100644 arch/arm64/kernel/kvm.c
> create mode 100644 include/kvm/arm_hypercalls.h
> create mode 100644 virt/kvm/arm/hypercalls.c
> create mode 100644 virt/kvm/arm/pvtime.c
>
On 21/07/2020 04:26, zhukeqian wrote:
> Hi Steven,
Hi Keqian,
> On 2019/8/2 22:50, Steven Price wrote:
>> This series add support for paravirtualized time for arm64 guests and
>> KVM hosts following the specification in Arm's document DEN 0057A:
>>
>> https://developer.arm.com/docs/den0057/a
>>
>> It implements support for stolen time, allowing the guest to
>> identify time when it is forcibly not executing.
>>
>> It doesn't implement support for Live Physical Time (LPT) as there are
>> some concerns about the overheads and approach in the above
> Do you plan to pick up LPT support? As there is demand of cross-frequency migration
> (from older platform to newer platform).
I don't have any plans to pick up the LPT support at the moment - feel
free to pick it up! ;)
> I am not clear about the overheads and approach problem here, could you please
> give some detail information? Maybe we can work together to solve these concerns. :-)
Fundamentally the issue here is that LPT only solves one small part of
migration between different hosts. To successfully migrate between hosts
with different CPU implementations it is also necessary to be able to
virtualise various ID registers (e.g. MIDR_EL1, REVIDR_EL1, AIDR_EL1)
which we have no support for currently.
The problem with just virtualising the registers is how you handle
errata. The guest will currently use those (and other) ID registers to
decide whether to enable specific errata workarounds. But what errata
should be enabled for a guest which might migrate to another host?
What we ideally need is a mechanism to communicate to the guest what
workarounds are required to successfully run on any of the hosts that
the guest may be migrated to. You may also have the situation where the
workarounds required for two hosts are mutually incompatible - something
needs to understand this and do the "right thing" (most likely just
reject this situation, i.e. prevent the migration).
There are various options here: e.g. a para-virtualised interface to
describe the workarounds (but this is hard to do in an OS-agnostic way),
or virtual-ID registers describing an idealised environment where no
workarounds are required (and only hosts that have no errata affecting a
guest would be able to provide this).
Given the above complexity and the fact that Armv8.6-A standardises the
frequency to 1GHz this didn't seem worth continuing with. So LPT was
dropped from the spec and patches to avoid holding up the stolen time
support.
However, if you have a use case which doesn't require such a generic
migration (e.g. perhaps old and new platforms are based on the same IP)
then it might be worth looking at bring this back. But to make the
problem solvable it either needs to be restricted to platforms which are
substantially the same (so the errata list will be identical), or
there's work to be done in preparation to deal with migrating a guest
successfully between hosts with potentially different errata requirements.
Can you share more details about the hosts that you are interested in
migrating between?
Thanks,
Steve
Hi Steven,
On 2020/7/27 18:48, Steven Price wrote:
> On 21/07/2020 04:26, zhukeqian wrote:
>> Hi Steven,
>
> Hi Keqian,
>
>> On 2019/8/2 22:50, Steven Price wrote:
>>> This series add support for paravirtualized time for arm64 guests and
>>> KVM hosts following the specification in Arm's document DEN 0057A:
>>>
>>> https://developer.arm.com/docs/den0057/a
>>>
>>> It implements support for stolen time, allowing the guest to
>>> identify time when it is forcibly not executing.
>>>
>>> It doesn't implement support for Live Physical Time (LPT) as there are
>>> some concerns about the overheads and approach in the above
>> Do you plan to pick up LPT support? As there is demand of cross-frequency migration
>> (from older platform to newer platform).
>
> I don't have any plans to pick up the LPT support at the moment - feel free to pick it up! ;)
>
>> I am not clear about the overheads and approach problem here, could you please
>> give some detail information? Maybe we can work together to solve these concerns. :-)
>
> Fundamentally the issue here is that LPT only solves one small part of migration between different hosts. To successfully migrate between hosts with different CPU implementations it is also necessary to be able to virtualise various ID registers (e.g. MIDR_EL1, REVIDR_EL1, AIDR_EL1) which we have no support for currently.
>
Yeah, currently we are trying to do both timer freq virtualization and CPU feature virtualization.
> The problem with just virtualising the registers is how you handle errata. The guest will currently use those (and other) ID registers to decide whether to enable specific errata workarounds. But what errata should be enabled for a guest which might migrate to another host?
>
Thanks for pointing this out.
I think the most important thing is that we should introduce a concept named CPU baseline which represents a standard platform.
If we bring up a guest with a specific CPU baseline, then this guest can only run on a platform that is compatible with this CPU baseline.
So "baseline" and "compatible" are the key point to promise successful cross-platform migration.
> What we ideally need is a mechanism to communicate to the guest what workarounds are required to successfully run on any of the hosts that the guest may be migrated to. You may also have the situation where the workarounds required for two hosts are mutually incompatible - something needs to understand this and do the "right thing" (most likely just reject this situation, i.e. prevent the migration).
>
> There are various options here: e.g. a para-virtualised interface to describe the workarounds (but this is hard to do in an OS-agnostic way), or virtual-ID registers describing an idealised environment where no workarounds are required (and only hosts that have no errata affecting a guest would be able to provide this).
>
My idea is similar with the "idealised environment", but errata workaround still exists.
We do not provide para-virtualised interface, and migration is restricted between platforms that are compatible with baseline.
Baseline should has two aspects: CPU feature and errata. These platforms that are compatible with a specific baseline should have the corresponding CPU feature and errata.
> Given the above complexity and the fact that Armv8.6-A standardises the frequency to 1GHz this didn't seem worth continuing with. So LPT was dropped from the spec and patches to avoid holding up the stolen time support.
>
> However, if you have a use case which doesn't require such a generic migration (e.g. perhaps old and new platforms are based on the same IP) then it might be worth looking at bring this back. But to make the problem solvable it either needs to be restricted to platforms which are substantially the same (so the errata list will be identical), or there's work to be done in preparation to deal with migrating a guest successfully between hosts with potentially different errata requirements.
>
> Can you share more details about the hosts that you are interested in migrating between?
Here we have new platform with 1GHz timer, and old platform is 100MHZ, so we want to solve the cross-platform migration firstly.
Thanks,
Keqian
>
> Thanks,
>
> Steve
> .
>