2021-08-25 20:26:42

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support

This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
original series at [3].

Statistical Profiling Extension (SPE) is an optional feature added in
ARMv8.2. It allows sampling at regular intervals of the operations executed
by the PE and storing a record of each operation in a memory buffer. A high
level overview of the extension is presented in an article on arm.com [4].

This is another complete rewrite of the series, and nothing is set in
stone. If you think of a better way to do things, please suggest it.


Features added
==============

The rewrite enabled me to add support for several features not
present in the previous iteration:

- Support for heterogeneous systems, where only some of the CPUs support SPE.
This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.

- Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
VCPU ioctl.

- The requirement for userspace to mlock() the guest memory has been removed,
and now userspace can make changes to memory contents after the memory is
mapped at stage 2.

- Better debugging of guest memory pinning by printing a warning when we
get an unexpected read or write fault. This helped me catch several bugs
during development, it has already proven very useful. Many thanks to
James who suggested when reviewing v3.


Missing features
================

I've tried to keep the series as small as possible to make it easier to review,
while implementing the core functionality needed for the SPE emulation. As such,
I've chosen to not implement several features:

- Host profiling a guest which has the SPE feature bit set (see open
questions).

- No errata workarounds have been implemented yet, and there are quite a few of
them for Neoverse N1 and Neoverse V1.

- Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
investigating other ways to get around automatic numa balancing, like
requiring userspace to disable it via set_mempolicy(). I am also going to
look at how VFIO gets around it. Suggestions welcome.

- There's plenty of room for optimization. Off the top of my head, using
block mappings at stage 2, batch pinning of pages (similar to what VFIO
does), optimize the way KVM keeps track of pinned pages (using a linked
list triples the memory usage), context-switch the SPE registers on
vcpu_load/vcpu_put on VHE if the host is not profiling, locking
optimizations, etc, etc.

- ...and others. I'm sure I'm missing at least a few things which are
important for someone.


Known issues
============

This is an RFC, so keep in mind that almost definitely there will be scary
bugs. For example, below is a list of known issues which don't affect the
correctness of the emulation, and which I'm planning to fix in a future
iteration:

- With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
the VCPU executes the dcache clean pending ops.

- With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
VCPUs.

This BUG statement can also be triggered with mainline. To reproduce it,
compile kvmtool from this branch [5] and follow the instruction in the
kvmtool commit message.

One workaround could be to stop trying to lock all VCPUs when locking a
memslot and document the fact that it is required that no VCPUs are run
before the ioctl completes, otherwise bad things might happen to the VM.


Open questions
==============

1. Implementing support for host profiling a guest with the SPE feature
means setting the profiling buffer owning regime to EL2. While that is in
effect, PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
probes SPE during this time, the driver will fail; and the guest will be
able to determine when it is profiled. I see two options here:

- Do not allow the host's userspace to profile a guest where at least one
VCPU has SPE.

- Document the effects somewhere and let userspace do whatever it likes.

No preference for either.

2. Userspace is not allowed to profile a CPU event (not bound to a task) is
!perf_allow_cpu(). It is my understanding that this is because of security
reasons, as we don't want a task to profile another task. Because a VM
will only be able to profile itself, I don't think it's necessary to
restrict the VM in any way based on perf_allow_cpu(), like we do with
perfmon_capable() and physical timer timestamps.

3. How to handle guest triggered SPE SErrors. Right now the Linux SPE drivers
doesn't do anything special SErrors reported by SPE, and I've done the same when
a guest manages to trigger one. Should I do something more? Disabling SPE
emulation for the entire VM is one option.


Summary of the patches
======================

Below is a short summary of the patches. For a more detailed explanation of
how SPE works, please see version 3 of the series [1].

Patches 1-11 implement the memslot locking functionality.

Patch 12 implements the KVM_ARM_VCPU_SUPPORTED_CPUS ioctl.

Patches 13-14 are preparatory for KVM SPE.

Patches 15-19 makes it possible for KVM to deny running a SPE enabled VCPU
on CPUs which don't have the SPE hardware.

Patches 20-22 add the userspace interface to configure SPE.

Patches 23-32 implement context switching of the SPE registers.

Patch 33 allows a guest to use physical timestamps only if the VMM is
perfmon_capable().

Patch 34 add the emulation for the SPE buffer management interrupts.

Patches 35-37 add the userspace API to temprorarily stop profiling so
memory can be unlocked and the VM migrated.

Patch 38 is a hack to get KVM SPE emulation going on NUMA systems (like the
Altra server which I used for testing).

Patch 39 finally enables SPE.


Testing
=======

Testing was done on Altra server with two sockets, both populated.

The Linux patches are based on v5.14-rc5 and can also be found on gitlab
[6].

For testing, I've used an SPE enabled version of kvmtool, which can be
found at [7]; the kvmtool patches will also be sent upstream. To test the
SPE_STOP API, I used a special version of kvmtool which starts the guest in
one of the stopped states; that can be found at [8] (compile from a
different commit if a different state and/or transition is desired).

Finally, in the VM I used defconfig Linux guest compiled from v5.15-rc5 and
some kvm-unit-tests patches which I wrote to test SPE [9].

All tests were run three times: once with VHE enabled, once in NVHE mode
(kvm-arm.mode=nvhe) and once in protected mode (kvm-arm.mode=protected).

The first test that I ran was the kvm-unit-tests test. This is also the
test that I used to check that KVM_ARM_VCPU_SPE_STOP_{TRAP,EXIT,RESUME}
works correctly with kvmtool.

Then I profiled iperf3 in the guest (16 VCPUs to limit the size of perf.data,
32GiB memory), while concurrently profiling in the host. This is the command
that I used:

# perf record -ae arm_spe/ts_enable=1,pa_enable=1,pct_enable=1/ -- iperf3 -c 127.0.0.1 -t 30

Everything looked right to me and I didn't see any kernel warnings or bugs.

[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
[2] https://www.spinics.net/lists/arm-kernel/msg776228.html
[3] https://lists.cs.columbia.edu/pipermail/kvmarm/2019-February/034887.html
[4] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/statistical-profiling-extension-for-armv8-a
[5] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/vgic-lock-all-vcpus-lockdep-bug-v1
[6] https://gitlab.arm.com/linux-arm/linux-ae/-/tree/kvm-spe-v4
[7] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v4
[8] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/kvm-spe-v4-spe-stop-tests
[9] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/kvm-spe-v4

Alexandru Elisei (35):
KVM: arm64: Make lock_all_vcpus() available to the rest of KVM
KVM: arm64: Add lock/unlock memslot user API
KVM: arm64: Implement the memslot lock/unlock functionality
KVM: arm64: Defer CMOs for locked memslots until a VCPU is run
KVM: arm64: Perform CMOs on locked memslots when userspace resets
VCPUs
KVM: arm64: Delay tag scrubbing for locked memslots until a VCPU runs
KVM: arm64: Unlock memslots after stage 2 tables are freed
KVM: arm64: Deny changes to locked memslots
KVM: Add kvm_warn{,_ratelimited} macros
KVM: arm64: Print a warning for unexpected faults on locked memslots
KVM: arm64: Allow userspace to lock and unlock memslots
KVM: arm64: Add the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl
KVM: arm64: Add CONFIG_KVM_ARM_SPE Kconfig option
KVM: arm64: Add SPE capability and VCPU feature
drivers/perf: Expose the cpumask of CPUs that support SPE
KVM: arm64: Make SPE available when at least one CPU supports it
KVM: arm64: Set the VCPU SPE feature bit when SPE is available
KVM: arm64: Expose SPE version to guests
KVM: arm64: Do not emulate SPE on CPUs which don't have SPE
KVM: arm64: debug: Configure MDCR_EL2 when a VCPU has SPE
KVM: arm64: Move the write to MDCR_EL2 out of
__activate_traps_common()
KVM: arm64: VHE: Change MDCR_EL2 at world switch if VCPU has SPE
KVM: arm64: Add SPE system registers to VCPU context
KVM: arm64: nVHE: Save PMSCR_EL1 to the host context
KVM: arm64: Rename DEBUG_STATE_SAVE_SPE -> DEBUG_SAVE_SPE_BUFFER flags
KVM: arm64: nVHE: Context switch SPE state if VCPU has SPE
KVM: arm64: VHE: Context switch SPE state if VCPU has SPE
KVM: arm64: Save/restore PMSNEVFR_EL1 on VCPU put/load
KVM: arm64: Allow guest to use physical timestamps if
perfmon_capable()
KVM: arm64: Emulate SPE buffer management interrupt
KVM: arm64: Add an userspace API to stop a VCPU profiling
KVM: arm64: Implement userspace API to stop a VCPU profiling
KVM: arm64: Add PMSIDR_EL1 to the SPE register context
KVM: arm64: Make CONFIG_KVM_ARM_SPE depend on !CONFIG_NUMA_BALANCING
KVM: arm64: Allow userspace to enable SPE for guests

Sudeep Holla (4):
KVM: arm64: Add a new VCPU device control group for SPE
KVM: arm64: Add SPE VCPU device attribute to set the interrupt number
KVM: arm64: Add SPE VCPU device attribute to initialize SPE
KVM: arm64: VHE: Clear MDCR_EL2.E2PB in vcpu_put()

Documentation/virt/kvm/api.rst | 87 ++++-
Documentation/virt/kvm/devices/vcpu.rst | 76 ++++
arch/arm64/include/asm/kvm_arm.h | 1 +
arch/arm64/include/asm/kvm_host.h | 75 +++-
arch/arm64/include/asm/kvm_hyp.h | 51 ++-
arch/arm64/include/asm/kvm_mmu.h | 8 +
arch/arm64/include/asm/kvm_spe.h | 96 ++++++
arch/arm64/include/asm/sysreg.h | 3 +
arch/arm64/include/uapi/asm/kvm.h | 11 +
arch/arm64/kvm/Kconfig | 8 +
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/arm.c | 140 +++++++-
arch/arm64/kvm/debug.c | 55 ++-
arch/arm64/kvm/guest.c | 10 +
arch/arm64/kvm/hyp/include/hyp/spe-sr.h | 32 ++
arch/arm64/kvm/hyp/include/hyp/switch.h | 1 -
arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 24 +-
arch/arm64/kvm/hyp/nvhe/spe-sr.c | 133 +++++++
arch/arm64/kvm/hyp/nvhe/switch.c | 35 +-
arch/arm64/kvm/hyp/vhe/Makefile | 1 +
arch/arm64/kvm/hyp/vhe/spe-sr.c | 193 +++++++++++
arch/arm64/kvm/hyp/vhe/switch.c | 43 ++-
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 2 +-
arch/arm64/kvm/mmu.c | 441 +++++++++++++++++++++++-
arch/arm64/kvm/reset.c | 23 ++
arch/arm64/kvm/spe.c | 381 ++++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 77 ++++-
arch/arm64/kvm/vgic/vgic-init.c | 4 +-
arch/arm64/kvm/vgic/vgic-its.c | 8 +-
arch/arm64/kvm/vgic/vgic-kvm-device.c | 50 +--
arch/arm64/kvm/vgic/vgic.h | 3 -
drivers/perf/arm_spe_pmu.c | 30 +-
include/linux/kvm_host.h | 4 +
include/linux/perf/arm_pmu.h | 7 +
include/uapi/linux/kvm.h | 13 +
36 files changed, 1983 insertions(+), 145 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_spe.h
create mode 100644 arch/arm64/kvm/hyp/include/hyp/spe-sr.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/spe-sr.c
create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c
create mode 100644 arch/arm64/kvm/spe.c

--
2.33.0


2021-08-25 20:28:08

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots until a VCPU is run

KVM relies on doing dcache maintenance on stage 2 faults to present to a
gueste running with the MMU off the same view of memory as userspace. For
locked memslots, KVM so far has done the dcache maintenance when a memslot
is locked, but that leaves KVM in a rather awkward position: what userspace
writes to guest memory after the memslot is locked, but before a VCPU is
run, might not be visible to the guest.

Fix this by deferring the dcache maintenance until the first VCPU is run.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 7 ++++
arch/arm64/include/asm/kvm_mmu.h | 5 +++
arch/arm64/kvm/arm.c | 3 ++
arch/arm64/kvm/mmu.c | 56 ++++++++++++++++++++++++++++---
4 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 97ff3ed5d4b7..ed67f914d169 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
u32 flags;
};

+/* kvm->arch.mmu_pending_ops flags */
+#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE 0
+#define KVM_MAX_MMU_PENDING_OPS 1
+
struct kvm_arch {
struct kvm_s2_mmu mmu;

@@ -135,6 +139,9 @@ struct kvm_arch {
*/
bool return_nisv_io_abort_to_user;

+ /* Defer MMU operations until a VCPU is run. */
+ unsigned long mmu_pending_ops;
+
/*
* VM-wide PMU filter, implemented as a bitmap and big enough for
* up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index ef079b5eb475..525c223e769f 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);

+#define kvm_mmu_has_pending_ops(kvm) \
+ (!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
+
+void kvm_mmu_perform_pending_ops(struct kvm *kvm);
+
static inline unsigned int kvm_get_vmid_bits(void)
{
int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index efb3501c6016..144c982912d8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
if (unlikely(!kvm_vcpu_initialized(vcpu)))
return -ENOEXEC;

+ if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
+ kvm_mmu_perform_pending_ops(vcpu->kvm);
+
ret = kvm_vcpu_first_run_init(vcpu);
if (ret)
return ret;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 59c2bfef2fd1..94fa08f3d9d3 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
return ret;
}

+/*
+ * It's safe to do the CMOs when the first VCPU is run because:
+ * - VCPUs cannot run until mmu_cmo_needed is cleared.
+ * - Memslots cannot be modified because we hold the kvm->slots_lock.
+ *
+ * It's safe to periodically release the mmu_lock because:
+ * - VCPUs cannot run.
+ * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
+ * the mmu_lock, which means accesses will be serialized.
+ * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
+ * is live, which means that the VM will be live.
+ */
+void kvm_mmu_perform_pending_ops(struct kvm *kvm)
+{
+ struct kvm_memory_slot *memslot;
+
+ mutex_lock(&kvm->slots_lock);
+ if (!kvm_mmu_has_pending_ops(kvm))
+ goto out_unlock;
+
+ if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
+ kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
+ if (!memslot_is_locked(memslot))
+ continue;
+ stage2_flush_memslot(kvm, memslot);
+ }
+ }
+
+ bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);
+
+out_unlock:
+ mutex_unlock(&kvm->slots_lock);
+ return;
+}
+
static int try_rlimit_memlock(unsigned long npages)
{
unsigned long lock_limit;
@@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
struct kvm_memory_slot_page *page_entry;
bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
- struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
+ struct kvm_pgtable pgt;
+ struct kvm_pgtable_mm_ops mm_ops;
struct vm_area_struct *vma;
unsigned long npages = memslot->npages;
unsigned int pin_flags = FOLL_LONGTERM;
@@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
pin_flags |= FOLL_WRITE;
}

+ /*
+ * Make a copy of the stage 2 translation table struct to remove the
+ * dcache callback so we can postpone the cache maintenance operations
+ * until the first VCPU is run.
+ */
+ mm_ops = *kvm->arch.mmu.pgt->mm_ops;
+ mm_ops.dcache_clean_inval_poc = NULL;
+ pgt = *kvm->arch.mmu.pgt;
+ pgt.mm_ops = &mm_ops;
+
hva = memslot->userspace_addr;
ipa = memslot->base_gfn << PAGE_SHIFT;

@@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
goto out_err;
}

- ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
+ ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
page_to_phys(page_entry->page),
prot, &cache);
spin_unlock(&kvm->mmu_lock);

if (ret) {
- kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
+ kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
i << PAGE_SHIFT);
unpin_memslot_pages(memslot, writable);
goto out_err;
@@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
*/
ret = account_locked_vm(current->mm, npages, true);
if (ret) {
- kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
+ kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
npages << PAGE_SHIFT);
unpin_memslot_pages(memslot, writable);
goto out_err;
@@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (writable)
memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;

+ set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
+
kvm_mmu_free_memory_cache(&cache);

return 0;
--
2.33.0

2021-08-25 20:28:08

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 03/39] KVM: arm64: Implement the memslot lock/unlock functionality

Pin memory in the process address space and map it in the stage 2 tables as
a result of userspace enabling the KVM_CAP_ARM_LOCK_USER_MEMORY_REGION
capability; and unpin it from the process address space when the capability
is used with the KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_UNLOCK flag.

The current implementation has two drawbacks which will be fixed in future
patches:

- The dcache maintenance is done when the memslot is locked, which means
that it is possible that memory changes made by userspace after the ioctl
completes won't be visible to a guest running with the MMU off.

- Tag scrubbing is done when the memslot is locked. If the MTE capability
is enabled after the ioctl, the guest will be able to access unsanitised
pages. This is prevented by forbidding userspace to enable the MTE
capability if any memslots are locked.

Only PAGE_SIZE mappings are supported at stage 2.

Signed-off-by: Alexandru Elisei <[email protected]>
---
Documentation/virt/kvm/api.rst | 8 +-
arch/arm64/include/asm/kvm_host.h | 11 ++
arch/arm64/kvm/arm.c | 22 ++++
arch/arm64/kvm/mmu.c | 211 +++++++++++++++++++++++++++++-
4 files changed, 245 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 741327ef06b0..5aa251df7077 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6717,10 +6717,10 @@ mapped at stage 2. The permissions specified in args[1] apply to both
mappings. The memory pinned with this capability counts towards the max
locked memory limit for the current process.

-The capability must be enabled before any VCPUs have run. The virtual memory
-range described by the memslot must be mapped in the userspace process without
-any gaps. It is considered an error if write permissions are specified for a
-memslot which logs dirty pages.
+The capability must be enabled before any VCPUs have run. The entire virtual
+memory range described by the memslot must be mapped by the userspace process.
+It is considered an error if write permissions are specified for a memslot which
+logs dirty pages.

7.29.2 KVM_ARM_LOCK_USER_MEMORY_REGION_FLAGS_UNLOCK
---------------------------------------------------
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 797083203603..97ff3ed5d4b7 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -98,7 +98,18 @@ struct kvm_s2_mmu {
struct kvm_arch *arch;
};

+#define KVM_MEMSLOT_LOCK_READ (1 << 0)
+#define KVM_MEMSLOT_LOCK_WRITE (1 << 1)
+#define KVM_MEMSLOT_LOCK_MASK 0x3
+
+struct kvm_memory_slot_page {
+ struct list_head list;
+ struct page *page;
+};
+
struct kvm_arch_memory_slot {
+ struct kvm_memory_slot_page pages;
+ u32 flags;
};

struct kvm_arch {
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 57ac97b30b3d..efb3501c6016 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -108,6 +108,25 @@ static int kvm_lock_user_memory_region_ioctl(struct kvm *kvm,
}
}

+static bool kvm_arm_has_locked_memslots(struct kvm *kvm)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ struct kvm_memory_slot *memslot;
+ bool has_locked_memslots = false;
+ int idx;
+
+ idx = srcu_read_lock(&kvm->srcu);
+ kvm_for_each_memslot(memslot, slots) {
+ if (memslot->arch.flags & KVM_MEMSLOT_LOCK_MASK) {
+ has_locked_memslots = true;
+ break;
+ }
+ }
+ srcu_read_unlock(&kvm->srcu, idx);
+
+ return has_locked_memslots;
+}
+
int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
struct kvm_enable_cap *cap)
{
@@ -123,6 +142,9 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
case KVM_CAP_ARM_MTE:
if (!system_supports_mte() || kvm->created_vcpus)
return -EINVAL;
+ if (kvm_arm_lock_memslot_supported() &&
+ kvm_arm_has_locked_memslots(kvm))
+ return -EPERM;
r = 0;
kvm->arch.mte_enabled = true;
break;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 689b24cb0f10..59c2bfef2fd1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -72,6 +72,11 @@ static bool memslot_is_logging(struct kvm_memory_slot *memslot)
return memslot->dirty_bitmap && !(memslot->flags & KVM_MEM_READONLY);
}

+static bool memslot_is_locked(struct kvm_memory_slot *memslot)
+{
+ return memslot->arch.flags & KVM_MEMSLOT_LOCK_MASK;
+}
+
/**
* kvm_flush_remote_tlbs() - flush all VM TLB entries for v7/8
* @kvm: pointer to kvm structure.
@@ -722,6 +727,10 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
if (map_size == PAGE_SIZE)
return true;

+ /* Allow only PAGE_SIZE mappings for locked memslots */
+ if (memslot_is_locked(memslot))
+ return false;
+
size = memslot->npages * PAGE_SIZE;

gpa_start = memslot->base_gfn << PAGE_SHIFT;
@@ -1244,6 +1253,159 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
return ret;
}

+static int try_rlimit_memlock(unsigned long npages)
+{
+ unsigned long lock_limit;
+ bool has_lock_cap;
+ int ret = 0;
+
+ has_lock_cap = capable(CAP_IPC_LOCK);
+ if (has_lock_cap)
+ goto out;
+
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+ mmap_read_lock(current->mm);
+ if (npages + current->mm->locked_vm > lock_limit)
+ ret = -ENOMEM;
+ mmap_read_unlock(current->mm);
+
+out:
+ return ret;
+}
+
+static void unpin_memslot_pages(struct kvm_memory_slot *memslot, bool writable)
+{
+ struct kvm_memory_slot_page *entry, *tmp;
+
+ list_for_each_entry_safe(entry, tmp, &memslot->arch.pages.list, list) {
+ if (writable)
+ set_page_dirty_lock(entry->page);
+ unpin_user_page(entry->page);
+ kfree(entry);
+ }
+}
+
+static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
+ u64 flags)
+{
+ struct kvm_mmu_memory_cache cache = { 0, __GFP_ZERO, NULL, };
+ struct kvm_memory_slot_page *page_entry;
+ bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
+ struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
+ struct vm_area_struct *vma;
+ unsigned long npages = memslot->npages;
+ unsigned int pin_flags = FOLL_LONGTERM;
+ unsigned long i, hva, ipa, mmu_seq;
+ int ret;
+
+ ret = try_rlimit_memlock(npages);
+ if (ret)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&memslot->arch.pages.list);
+
+ if (writable) {
+ prot |= KVM_PGTABLE_PROT_W;
+ pin_flags |= FOLL_WRITE;
+ }
+
+ hva = memslot->userspace_addr;
+ ipa = memslot->base_gfn << PAGE_SHIFT;
+
+ mmu_seq = kvm->mmu_notifier_seq;
+ smp_rmb();
+
+ for (i = 0; i < npages; i++) {
+ page_entry = kzalloc(sizeof(*page_entry), GFP_KERNEL);
+ if (!page_entry) {
+ unpin_memslot_pages(memslot, writable);
+ ret = -ENOMEM;
+ goto out_err;
+ }
+
+ mmap_read_lock(current->mm);
+ ret = pin_user_pages(hva, 1, pin_flags, &page_entry->page, &vma);
+ if (ret != 1) {
+ mmap_read_unlock(current->mm);
+ unpin_memslot_pages(memslot, writable);
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ if (kvm_has_mte(kvm)) {
+ if (vma->vm_flags & VM_SHARED) {
+ ret = -EFAULT;
+ } else {
+ ret = sanitise_mte_tags(kvm,
+ page_to_pfn(page_entry->page),
+ PAGE_SIZE);
+ }
+ if (ret) {
+ mmap_read_unlock(current->mm);
+ goto out_err;
+ }
+ }
+ mmap_read_unlock(current->mm);
+
+ ret = kvm_mmu_topup_memory_cache(&cache, kvm_mmu_cache_min_pages(kvm));
+ if (ret) {
+ unpin_memslot_pages(memslot, writable);
+ goto out_err;
+ }
+
+ spin_lock(&kvm->mmu_lock);
+ if (mmu_notifier_retry(kvm, mmu_seq)) {
+ spin_unlock(&kvm->mmu_lock);
+ unpin_memslot_pages(memslot, writable);
+ ret = -EAGAIN;
+ goto out_err;
+ }
+
+ ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
+ page_to_phys(page_entry->page),
+ prot, &cache);
+ spin_unlock(&kvm->mmu_lock);
+
+ if (ret) {
+ kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
+ i << PAGE_SHIFT);
+ unpin_memslot_pages(memslot, writable);
+ goto out_err;
+ }
+ list_add(&page_entry->list, &memslot->arch.pages.list);
+
+ hva += PAGE_SIZE;
+ ipa += PAGE_SIZE;
+ }
+
+
+ /*
+ * Even though we've checked the limit at the start, we can still exceed
+ * it if userspace locked other pages in the meantime or if the
+ * CAP_IPC_LOCK capability has been revoked.
+ */
+ ret = account_locked_vm(current->mm, npages, true);
+ if (ret) {
+ kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
+ npages << PAGE_SHIFT);
+ unpin_memslot_pages(memslot, writable);
+ goto out_err;
+ }
+
+ memslot->arch.flags = KVM_MEMSLOT_LOCK_READ;
+ if (writable)
+ memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
+
+ kvm_mmu_free_memory_cache(&cache);
+
+ return 0;
+
+out_err:
+ kvm_mmu_free_memory_cache(&cache);
+ return ret;
+}
+
int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags)
{
struct kvm_memory_slot *memslot;
@@ -1283,7 +1445,12 @@ int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags)
goto out_unlock_slots;
}

- ret = -EINVAL;
+ if (memslot_is_locked(memslot)) {
+ ret = -EBUSY;
+ goto out_unlock_slots;
+ }
+
+ ret = lock_memslot(kvm, memslot, flags);

out_unlock_slots:
mutex_unlock(&kvm->slots_lock);
@@ -1294,13 +1461,51 @@ int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags)
return ret;
}

+static int unlock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
+{
+ bool writable = memslot->arch.flags & KVM_MEMSLOT_LOCK_WRITE;
+ unsigned long npages = memslot->npages;
+
+ unpin_memslot_pages(memslot, writable);
+ account_locked_vm(current->mm, npages, false);
+
+ memslot->arch.flags &= ~KVM_MEMSLOT_LOCK_MASK;
+
+ return 0;
+}
+
+static int unlock_all_memslots(struct kvm *kvm)
+{
+ struct kvm_memory_slot *memslot;
+ int err, ret;
+
+ mutex_lock(&kvm->slots_lock);
+
+ ret = 0;
+ kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
+ if (!memslot_is_locked(memslot))
+ continue;
+ err = unlock_memslot(kvm, memslot);
+ if (err) {
+ kvm_err("error unlocking memslot %u: %d\n",
+ memslot->id, err);
+ /* Continue to try unlocking the rest of the slots */
+ ret = err;
+ }
+ }
+
+ mutex_unlock(&kvm->slots_lock);
+
+ return ret;
+}
+
int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags)
{
struct kvm_memory_slot *memslot;
int ret;

if (flags & KVM_ARM_UNLOCK_MEM_ALL)
- return -EINVAL;
+ return unlock_all_memslots(kvm);

if (slot >= KVM_MEM_SLOTS_NUM)
return -EINVAL;
@@ -1313,7 +1518,7 @@ int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags)
goto out_unlock_slots;
}

- ret = -EINVAL;
+ ret = unlock_memslot(kvm, memslot);

out_unlock_slots:
mutex_unlock(&kvm->slots_lock);
--
2.33.0

2021-08-25 20:28:46

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 16/39] KVM: arm64: Make SPE available when at least one CPU supports it

KVM SPE emulation requires at least one physical CPU to support SPE.
Initialize the cpumask of PEs that support SPE the first time
KVM_CAP_ARM_SPE is queried or when the first virtual machine is created,
whichever comes first.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_spe.h | 26 ++++++++++++++++++++++++++
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/arm.c | 4 ++++
arch/arm64/kvm/spe.c | 32 ++++++++++++++++++++++++++++++++
4 files changed, 63 insertions(+)
create mode 100644 arch/arm64/include/asm/kvm_spe.h
create mode 100644 arch/arm64/kvm/spe.c

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
new file mode 100644
index 000000000000..328115ce0b48
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#ifndef __ARM64_KVM_SPE_H__
+#define __ARM64_KVM_SPE_H__
+
+#ifdef CONFIG_KVM_ARM_SPE
+DECLARE_STATIC_KEY_FALSE(kvm_spe_available);
+
+static __always_inline bool kvm_supports_spe(void)
+{
+ return static_branch_likely(&kvm_spe_available);
+}
+
+void kvm_spe_init_supported_cpus(void);
+void kvm_spe_vm_init(struct kvm *kvm);
+#else
+#define kvm_supports_spe() (false)
+
+static inline void kvm_spe_init_supported_cpus(void) {}
+static inline void kvm_spe_vm_init(struct kvm *kvm) {}
+#endif /* CONFIG_KVM_ARM_SPE */
+
+#endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..86092a0f8367 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -25,3 +25,4 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
vgic/vgic-its.o vgic/vgic-debug.o

kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o
+kvm-$(CONFIG_KVM_ARM_SPE) += spe.o
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 22544eb367f3..82cb7b5b3b45 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -35,6 +35,7 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_asm.h>
#include <asm/kvm_mmu.h>
+#include <asm/kvm_spe.h>
#include <asm/kvm_emulate.h>
#include <asm/sections.h>

@@ -180,6 +181,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)

set_default_spectre(kvm);

+ kvm_spe_vm_init(kvm);
+
return ret;
out_free_stage2_pgd:
kvm_free_stage2_pgd(&kvm->arch.mmu);
@@ -305,6 +308,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
case KVM_CAP_ARM_SPE:
+ kvm_spe_init_supported_cpus();
r = 0;
break;
default:
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
new file mode 100644
index 000000000000..83f92245f881
--- /dev/null
+++ b/arch/arm64/kvm/spe.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#include <linux/cpumask.h>
+#include <linux/kvm_host.h>
+#include <linux/perf/arm_pmu.h>
+
+#include <asm/kvm_spe.h>
+
+DEFINE_STATIC_KEY_FALSE(kvm_spe_available);
+
+static const cpumask_t *supported_cpus;
+
+void kvm_spe_init_supported_cpus(void)
+{
+ if (likely(supported_cpus))
+ return;
+
+ supported_cpus = arm_spe_pmu_supported_cpus();
+ BUG_ON(!supported_cpus);
+
+ if (!cpumask_empty(supported_cpus))
+ static_branch_enable(&kvm_spe_available);
+}
+
+void kvm_spe_vm_init(struct kvm *kvm)
+{
+ /* Set supported_cpus if it isn't already initialized. */
+ kvm_spe_init_supported_cpus();
+}
--
2.33.0

2021-08-25 20:28:46

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 21/39] KVM: arm64: Add SPE VCPU device attribute to set the interrupt number

From: Sudeep Holla <[email protected]>

Add KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_IRQ) to allow the user to set
the interrupt number for the buffer management interrupt.

[ Alexandru E: Split from "KVM: arm64: Add a new VCPU device control group
for SPE" ]

Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
---
Documentation/virt/kvm/devices/vcpu.rst | 19 ++++++
arch/arm64/include/asm/kvm_host.h | 2 +
arch/arm64/include/asm/kvm_spe.h | 10 +++
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/spe.c | 86 +++++++++++++++++++++++++
5 files changed, 118 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 85399c005197..05821d40849f 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -166,3 +166,22 @@ including the layout of the stolen time structure.
===============================

:Architectures: ARM64
+
+4.1 ATTRIBUTE: KVM_ARM_VCPU_SPE_IRQ
+-----------------------------------
+
+:Parameters: in kvm_device_attr.addr the address for the Profiling Buffer
+ management interrupt number as a pointer to an int
+
+Returns:
+
+ ======= ======================================================
+ -EBUSY Interrupt number already set for this VCPU
+ -EFAULT Error accessing the buffer management interrupt number
+ -EINVAL Invalid interrupt number
+ -ENXIO SPE not supported or not properly configured
+ ======= ======================================================
+
+Specifies the Profiling Buffer management interrupt number. The interrupt number
+must be a PPI and the interrupt number must be the same for each VCPU. SPE
+emulation requires an in-kernel vGIC implementation.
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 948adb152104..7b957e439b3d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
#include <asm/fpsimd.h>
#include <asm/kvm.h>
#include <asm/kvm_asm.h>
+#include <asm/kvm_spe.h>
#include <asm/thread_info.h>

#define __KVM_HAVE_ARCH_INTC_INITIALIZED
@@ -353,6 +354,7 @@ struct kvm_vcpu_arch {
struct vgic_cpu vgic_cpu;
struct arch_timer_cpu timer_cpu;
struct kvm_pmu pmu;
+ struct kvm_vcpu_spe spe;

/*
* Anything that is not used directly from assembly code goes
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index ce0f5b3f2027..2fe11868719d 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -6,6 +6,8 @@
#ifndef __ARM64_KVM_SPE_H__
#define __ARM64_KVM_SPE_H__

+#include <linux/kvm.h>
+
#ifdef CONFIG_KVM_ARM_SPE
DECLARE_STATIC_KEY_FALSE(kvm_spe_available);

@@ -14,6 +16,11 @@ static __always_inline bool kvm_supports_spe(void)
return static_branch_likely(&kvm_spe_available);
}

+struct kvm_vcpu_spe {
+ bool initialized; /* SPE initialized for the VCPU */
+ int irq_num; /* Buffer management interrut number */
+};
+
void kvm_spe_init_supported_cpus(void);
void kvm_spe_vm_init(struct kvm *kvm);
int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu);
@@ -24,6 +31,9 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
#else
#define kvm_supports_spe() (false)

+struct kvm_vcpu_spe {
+};
+
static inline void kvm_spe_init_supported_cpus(void) {}
static inline void kvm_spe_vm_init(struct kvm *kvm) {}
static inline int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu) { return -ENOEXEC; }
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 7159a1e23da2..c55d94a1a8f5 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -369,6 +369,7 @@ struct kvm_arm_copy_mte_tags {
#define KVM_ARM_VCPU_PVTIME_CTRL 2
#define KVM_ARM_VCPU_PVTIME_IPA 0
#define KVM_ARM_VCPU_SPE_CTRL 3
+#define KVM_ARM_VCPU_SPE_IRQ 0

/* KVM_IRQ_LINE irq field index values */
#define KVM_ARM_IRQ_VCPU2_SHIFT 28
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 56a3fdb35623..2fdb42e27675 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -43,17 +43,103 @@ int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)
return 0;
}

+static bool kvm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
+{
+ if (!kvm_supports_spe())
+ return false;
+
+ if (!kvm_vcpu_has_spe(vcpu))
+ return false;
+
+ if (!irqchip_in_kernel(vcpu->kvm))
+ return false;
+
+ return true;
+}
+
+static bool kvm_spe_irq_is_valid(struct kvm *kvm, int irq)
+{
+ struct kvm_vcpu *vcpu;
+ int i;
+
+ if (!irq_is_ppi(irq))
+ return -EINVAL;
+
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (!vcpu->arch.spe.irq_num)
+ continue;
+
+ if (vcpu->arch.spe.irq_num != irq)
+ return false;
+ }
+
+ return true;
+}
+
int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
{
+ if (!kvm_vcpu_supports_spe(vcpu))
+ return -ENXIO;
+
+ if (vcpu->arch.spe.initialized)
+ return -EBUSY;
+
+ switch (attr->attr) {
+ case KVM_ARM_VCPU_SPE_IRQ: {
+ int __user *uaddr = (int __user *)(long)attr->addr;
+ int irq;
+
+ if (vcpu->arch.spe.irq_num)
+ return -EBUSY;
+
+ if (get_user(irq, uaddr))
+ return -EFAULT;
+
+ if (!kvm_spe_irq_is_valid(vcpu->kvm, irq))
+ return -EINVAL;
+
+ kvm_debug("Set KVM_ARM_VCPU_SPE_IRQ: %d\n", irq);
+ vcpu->arch.spe.irq_num = irq;
+ return 0;
+ }
+ }
+
return -ENXIO;
}

int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
{
+ if (!kvm_vcpu_supports_spe(vcpu))
+ return -ENXIO;
+
+ switch (attr->attr) {
+ case KVM_ARM_VCPU_SPE_IRQ: {
+ int __user *uaddr = (int __user *)(long)attr->addr;
+ int irq;
+
+ if (!vcpu->arch.spe.irq_num)
+ return -ENXIO;
+
+ irq = vcpu->arch.spe.irq_num;
+ if (put_user(irq, uaddr))
+ return -EFAULT;
+
+ return 0;
+ }
+ }
+
return -ENXIO;
}

int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
{
+ if (!kvm_vcpu_supports_spe(vcpu))
+ return -ENXIO;
+
+ switch(attr->attr) {
+ case KVM_ARM_VCPU_SPE_IRQ:
+ return 0;
+ }
+
return -ENXIO;
}
--
2.33.0

2021-08-25 20:31:30

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 26/39] KVM: arm64: VHE: Change MDCR_EL2 at world switch if VCPU has SPE

When a VCPU has the SPE feature, MDCR_EL2 sets the buffer owning regime to
EL1&0. Write the guest's MDCR_EL2 value as late as possible and restore the
host's value as soon as possible at each world switch to make the profiling
blackout window as small as possible for the host.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_hyp.h | 2 +-
arch/arm64/kvm/debug.c | 14 +++++++++++--
arch/arm64/kvm/hyp/vhe/switch.c | 33 +++++++++++++++++++++++-------
arch/arm64/kvm/hyp/vhe/sysreg-sr.c | 2 +-
4 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 9d60b3006efc..657d0c94cf82 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -95,7 +95,7 @@ void __sve_restore_state(void *sve_pffr, u32 *fpsr);

#ifndef __KVM_NVHE_HYPERVISOR__
void activate_traps_vhe_load(struct kvm_vcpu *vcpu);
-void deactivate_traps_vhe_put(void);
+void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu);
#endif

u64 __guest_enter(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 64e8211366b6..70712cd85f32 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -249,9 +249,19 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu)
vcpu->arch.flags |= KVM_ARM64_DEBUG_DIRTY;

/* Write mdcr_el2 changes since vcpu_load on VHE systems */
- if (has_vhe() && orig_mdcr_el2 != vcpu->arch.mdcr_el2)
- write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+ if (has_vhe()) {
+ /*
+ * MDCR_EL2 can modify the SPE buffer owning regime, defer the
+ * write until the VCPU is run.
+ */
+ if (kvm_vcpu_has_spe(vcpu))
+ goto out;
+
+ if (orig_mdcr_el2 != vcpu->arch.mdcr_el2)
+ write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+ }

+out:
trace_kvm_arm_set_dreg32("MDSCR_EL1", vcpu_read_sys_reg(vcpu, MDSCR_EL1));
}

diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index 983ba1570d72..ec4e179d56ae 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -31,12 +31,29 @@ DEFINE_PER_CPU(struct kvm_host_data, kvm_host_data);
DEFINE_PER_CPU(struct kvm_cpu_context, kvm_hyp_ctxt);
DEFINE_PER_CPU(unsigned long, kvm_hyp_vector);

+static void __restore_host_mdcr_el2(struct kvm_vcpu *vcpu)
+{
+ u64 mdcr_el2;
+
+ mdcr_el2 = read_sysreg(mdcr_el2);
+ mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
+ write_sysreg(mdcr_el2, mdcr_el2);
+}
+
+static void __restore_guest_mdcr_el2(struct kvm_vcpu *vcpu)
+{
+ write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+}
+
static void __activate_traps(struct kvm_vcpu *vcpu)
{
u64 val;

___activate_traps(vcpu);

+ if (kvm_vcpu_has_spe(vcpu))
+ __restore_guest_mdcr_el2(vcpu);
+
val = read_sysreg(cpacr_el1);
val |= CPACR_EL1_TTA;
val &= ~CPACR_EL1_ZEN;
@@ -81,7 +98,11 @@ static void __deactivate_traps(struct kvm_vcpu *vcpu)
*/
asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));

+ if (kvm_vcpu_has_spe(vcpu))
+ __restore_host_mdcr_el2(vcpu);
+
write_sysreg(CPACR_EL1_DEFAULT, cpacr_el1);
+
write_sysreg(vectors, vbar_el1);
}
NOKPROBE_SYMBOL(__deactivate_traps);
@@ -90,16 +111,14 @@ void activate_traps_vhe_load(struct kvm_vcpu *vcpu)
{
__activate_traps_common(vcpu);

- write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
+ if (!kvm_vcpu_has_spe(vcpu))
+ __restore_guest_mdcr_el2(vcpu);
}

-void deactivate_traps_vhe_put(void)
+void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu)
{
- u64 mdcr_el2 = read_sysreg(mdcr_el2);
-
- mdcr_el2 &= MDCR_EL2_HPMN_MASK | MDCR_EL2_TPMS;
-
- write_sysreg(mdcr_el2, mdcr_el2);
+ if (!kvm_vcpu_has_spe(vcpu))
+ __restore_host_mdcr_el2(vcpu);

__deactivate_traps_common();
}
diff --git a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
index 2a0b8c88d74f..007a12dd4351 100644
--- a/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
+++ b/arch/arm64/kvm/hyp/vhe/sysreg-sr.c
@@ -101,7 +101,7 @@ void kvm_vcpu_put_sysregs_vhe(struct kvm_vcpu *vcpu)
struct kvm_cpu_context *host_ctxt;

host_ctxt = &this_cpu_ptr(&kvm_host_data)->host_ctxt;
- deactivate_traps_vhe_put();
+ deactivate_traps_vhe_put(vcpu);

__sysreg_save_el1_state(guest_ctxt);
__sysreg_save_user_state(guest_ctxt);
--
2.33.0

2021-08-25 20:31:55

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 19/39] KVM: arm64: Do not emulate SPE on CPUs which don't have SPE

The kernel allows heterogeneous systems where FEAT_SPE is not present on
all CPUs. This presents a challenge for KVM, as it will have to touch the
SPE registers when emulating SPE for a guest, and those accesses will cause
an undefined exception if SPE is not present on the CPU.

Avoid this situation by comparing the cpumask of the physical CPUs that
support SPE with the cpu list provided by userspace via the
KVM_ARM_VCPU_SUPPORTED_CPUS ioctl and refusing the run the VCPU if there is
a mismatch.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_spe.h | 2 ++
arch/arm64/kvm/arm.c | 3 +++
arch/arm64/kvm/spe.c | 12 ++++++++++++
3 files changed, 17 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 328115ce0b48..ed67ddbf8132 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -16,11 +16,13 @@ static __always_inline bool kvm_supports_spe(void)

void kvm_spe_init_supported_cpus(void);
void kvm_spe_vm_init(struct kvm *kvm);
+int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu);
#else
#define kvm_supports_spe() (false)

static inline void kvm_spe_init_supported_cpus(void) {}
static inline void kvm_spe_vm_init(struct kvm *kvm) {}
+static inline int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu) { return -ENOEXEC; }
#endif /* CONFIG_KVM_ARM_SPE */

#endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 82cb7b5b3b45..8f7025f2e4a0 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -633,6 +633,9 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
if (!kvm_arm_vcpu_is_finalized(vcpu))
return -EPERM;

+ if (kvm_vcpu_has_spe(vcpu) && kvm_spe_check_supported_cpus(vcpu))
+ return -EPERM;
+
vcpu->arch.has_run_once = true;

kvm_arm_vcpu_init_debug(vcpu);
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 83f92245f881..8d2afc137151 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -30,3 +30,15 @@ void kvm_spe_vm_init(struct kvm *kvm)
/* Set supported_cpus if it isn't already initialized. */
kvm_spe_init_supported_cpus();
}
+
+int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)
+{
+ /* SPE is supported on all CPUs, we don't care about the VCPU mask */
+ if (cpumask_equal(supported_cpus, cpu_possible_mask))
+ return 0;
+
+ if (!cpumask_subset(&vcpu->arch.supported_cpus, supported_cpus))
+ return -ENOEXEC;
+
+ return 0;
+}
--
2.33.0

2021-08-25 20:31:55

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 22/39] KVM: arm64: Add SPE VCPU device attribute to initialize SPE

From: Sudeep Holla <[email protected]>

Add KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_INIT) VCPU ioctl to initialize
SPE. Initialization can only be done once for a VCPU. If the feature bit is
set, then SPE must be initialized before the VCPU can be run.

[ Alexandru E: Split from "KVM: arm64: Add a new VCPU device control group
for SPE" ]

Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
---
Documentation/virt/kvm/devices/vcpu.rst | 16 ++++++++++++++
arch/arm64/include/asm/kvm_spe.h | 4 ++--
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/arm.c | 7 ++++--
arch/arm64/kvm/spe.c | 29 ++++++++++++++++++++++++-
5 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 05821d40849f..c275c320e500 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -185,3 +185,19 @@ Returns:
Specifies the Profiling Buffer management interrupt number. The interrupt number
must be a PPI and the interrupt number must be the same for each VCPU. SPE
emulation requires an in-kernel vGIC implementation.
+
+4.2 ATTRIBUTE: KVM_ARM_VCPU_SPE_INIT
+-----------------------------------
+
+:Parameters: no additional parameter in kvm_device_attr.addr
+
+Returns:
+
+ ======= ============================================
+ -EBUSY SPE already initialized for this VCPU
+ -ENXIO SPE not supported or not properly configured
+ ======= ============================================
+
+Request initialization of the Statistical Profiling Extension for this VCPU.
+Must be done after initializaing the in-kernel irqchip and after setting the
+Profiling Buffer management interrupt number for the VCPU.
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index 2fe11868719d..2217b821ab37 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -23,7 +23,7 @@ struct kvm_vcpu_spe {

void kvm_spe_init_supported_cpus(void);
void kvm_spe_vm_init(struct kvm *kvm);
-int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu);
+int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu);

int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
@@ -36,7 +36,7 @@ struct kvm_vcpu_spe {

static inline void kvm_spe_init_supported_cpus(void) {}
static inline void kvm_spe_vm_init(struct kvm *kvm) {}
-static inline int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu) { return -ENOEXEC; }
+static inline int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu) { return -ENOEXEC; }

static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu,
struct kvm_device_attr *attr)
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index c55d94a1a8f5..d4c0b53a5fb2 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -370,6 +370,7 @@ struct kvm_arm_copy_mte_tags {
#define KVM_ARM_VCPU_PVTIME_IPA 0
#define KVM_ARM_VCPU_SPE_CTRL 3
#define KVM_ARM_VCPU_SPE_IRQ 0
+#define KVM_ARM_VCPU_SPE_INIT 1

/* KVM_IRQ_LINE irq field index values */
#define KVM_ARM_IRQ_VCPU2_SHIFT 28
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 8f7025f2e4a0..6af7ef26d2c1 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -633,8 +633,11 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
if (!kvm_arm_vcpu_is_finalized(vcpu))
return -EPERM;

- if (kvm_vcpu_has_spe(vcpu) && kvm_spe_check_supported_cpus(vcpu))
- return -EPERM;
+ if (kvm_vcpu_has_spe(vcpu)) {
+ ret = kvm_spe_vcpu_first_run_init(vcpu);
+ if (ret)
+ return ret;
+ }

vcpu->arch.has_run_once = true;

diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 2fdb42e27675..801ceb66a3d0 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -31,7 +31,7 @@ void kvm_spe_vm_init(struct kvm *kvm)
kvm_spe_init_supported_cpus();
}

-int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)
+static int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)
{
/* SPE is supported on all CPUs, we don't care about the VCPU mask */
if (cpumask_equal(supported_cpus, cpu_possible_mask))
@@ -43,6 +43,20 @@ int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)
return 0;
}

+int kvm_spe_vcpu_first_run_init(struct kvm_vcpu *vcpu)
+{
+ int ret;
+
+ ret = kvm_spe_check_supported_cpus(vcpu);
+ if (ret)
+ return ret;
+
+ if (!vcpu->arch.spe.initialized)
+ return -EPERM;
+
+ return 0;
+}
+
static bool kvm_vcpu_supports_spe(struct kvm_vcpu *vcpu)
{
if (!kvm_supports_spe())
@@ -102,6 +116,18 @@ int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
vcpu->arch.spe.irq_num = irq;
return 0;
}
+ case KVM_ARM_VCPU_SPE_INIT:
+ if (!vcpu->arch.spe.irq_num)
+ return -ENXIO;
+
+ if (!vgic_initialized(vcpu->kvm))
+ return -ENXIO;
+
+ if (kvm_vgic_set_owner(vcpu, vcpu->arch.spe.irq_num, &vcpu->arch.spe))
+ return -ENXIO;
+
+ vcpu->arch.spe.initialized = true;
+ return 0;
}

return -ENXIO;
@@ -138,6 +164,7 @@ int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)

switch(attr->attr) {
case KVM_ARM_VCPU_SPE_IRQ:
+ case KVM_ARM_VCPU_SPE_INIT:
return 0;
}

--
2.33.0

2021-08-25 20:35:44

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 29/39] KVM: arm64: Rename DEBUG_STATE_SAVE_SPE -> DEBUG_SAVE_SPE_BUFFER flags

Setting the KVM_ARM64_DEBUG_STATE_SAVE_SPE flag will stop profiling to
drain the buffer, if the buffer is enabled when switching to the guest, and
then re-enable profiling on the return to the host. Rename it to
KVM_ARM64_DEBUG_SAVE_SPE_BUFFER to avoid any confusion with what a SPE
enabled VCPU will do, which is to save and restore the full SPE state on a
world switch, and not just part of it, some of the time. This also matches
the name of the function __debug_save_host_buffers_nvhe(), which makes use
of the flag to decide if the buffer should be drained.

Similar treatment was applied to KVM_ARM64_DEBUG_STATE_SAVE_TRBE, which was
renamed to KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER, for consistency and to better
reflect what it is doing.

CC: Suzuki K Poulose <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 24 ++++++++++++------------
arch/arm64/kvm/debug.c | 11 ++++++-----
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 8 ++++----
3 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 351b77dc7732..e704847a7645 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -439,18 +439,18 @@ struct kvm_vcpu_arch {
})

/* vcpu_arch flags field values: */
-#define KVM_ARM64_DEBUG_DIRTY (1 << 0)
-#define KVM_ARM64_FP_ENABLED (1 << 1) /* guest FP regs loaded */
-#define KVM_ARM64_FP_HOST (1 << 2) /* host FP regs loaded */
-#define KVM_ARM64_HOST_SVE_IN_USE (1 << 3) /* backup for host TIF_SVE */
-#define KVM_ARM64_HOST_SVE_ENABLED (1 << 4) /* SVE enabled for EL0 */
-#define KVM_ARM64_GUEST_HAS_SVE (1 << 5) /* SVE exposed to guest */
-#define KVM_ARM64_VCPU_SVE_FINALIZED (1 << 6) /* SVE config completed */
-#define KVM_ARM64_GUEST_HAS_PTRAUTH (1 << 7) /* PTRAUTH exposed to guest */
-#define KVM_ARM64_PENDING_EXCEPTION (1 << 8) /* Exception pending */
-#define KVM_ARM64_EXCEPT_MASK (7 << 9) /* Target EL/MODE */
-#define KVM_ARM64_DEBUG_STATE_SAVE_SPE (1 << 12) /* Save SPE context if active */
-#define KVM_ARM64_DEBUG_STATE_SAVE_TRBE (1 << 13) /* Save TRBE context if active */
+#define KVM_ARM64_DEBUG_DIRTY (1 << 0)
+#define KVM_ARM64_FP_ENABLED (1 << 1) /* guest FP regs loaded */
+#define KVM_ARM64_FP_HOST (1 << 2) /* host FP regs loaded */
+#define KVM_ARM64_HOST_SVE_IN_USE (1 << 3) /* backup for host TIF_SVE */
+#define KVM_ARM64_HOST_SVE_ENABLED (1 << 4) /* SVE enabled for EL0 */
+#define KVM_ARM64_GUEST_HAS_SVE (1 << 5) /* SVE exposed to guest */
+#define KVM_ARM64_VCPU_SVE_FINALIZED (1 << 6) /* SVE config completed */
+#define KVM_ARM64_GUEST_HAS_PTRAUTH (1 << 7) /* PTRAUTH exposed to guest */
+#define KVM_ARM64_PENDING_EXCEPTION (1 << 8) /* Exception pending */
+#define KVM_ARM64_EXCEPT_MASK (7 << 9) /* Target EL/MODE */
+#define KVM_ARM64_DEBUG_SAVE_SPE_BUFFER (1 << 12) /* Save SPE buffer if active */
+#define KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER (1 << 13) /* Save TRBE buffer if active */

#define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE | \
KVM_GUESTDBG_USE_SW_BP | \
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 70712cd85f32..6e5fc1887215 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -299,22 +299,23 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
return;

dfr0 = read_sysreg(id_aa64dfr0_el1);
+
/*
* If SPE is present on this CPU and is available at current EL,
- * we may need to check if the host state needs to be saved.
+ * we may need to check if the host buffer needs to be drained.
*/
if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_PMSVER_SHIFT) &&
!(read_sysreg_s(SYS_PMBIDR_EL1) & BIT(SYS_PMBIDR_EL1_P_SHIFT)))
- vcpu->arch.flags |= KVM_ARM64_DEBUG_STATE_SAVE_SPE;
+ vcpu->arch.flags |= KVM_ARM64_DEBUG_SAVE_SPE_BUFFER;

/* Check if we have TRBE implemented and available at the host */
if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_TRBE_SHIFT) &&
!(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_PROG))
- vcpu->arch.flags |= KVM_ARM64_DEBUG_STATE_SAVE_TRBE;
+ vcpu->arch.flags |= KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER;
}

void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
{
- vcpu->arch.flags &= ~(KVM_ARM64_DEBUG_STATE_SAVE_SPE |
- KVM_ARM64_DEBUG_STATE_SAVE_TRBE);
+ vcpu->arch.flags &= ~(KVM_ARM64_DEBUG_SAVE_SPE_BUFFER |
+ KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER);
}
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 6db58722f045..186b90b5fd20 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -85,10 +85,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt)
{
/* Disable and flush SPE data generation */
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_STATE_SAVE_SPE)
+ if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
__debug_save_spe(__ctxt_sys_reg(host_ctxt, PMSCR_EL1));
/* Disable and flush Self-Hosted Trace generation */
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_STATE_SAVE_TRBE)
+ if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER)
__debug_save_trace(&vcpu->arch.host_debug_state.trfcr_el1);
}

@@ -100,9 +100,9 @@ void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt)
{
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_STATE_SAVE_SPE)
+ if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
__debug_restore_spe(ctxt_sys_reg(host_ctxt, PMSCR_EL1));
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_STATE_SAVE_TRBE)
+ if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER)
__debug_restore_trace(vcpu->arch.host_debug_state.trfcr_el1);
}

--
2.33.0

2021-08-25 20:44:44

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 30/39] KVM: arm64: nVHE: Context switch SPE state if VCPU has SPE

For non-VHE systems, make the SPE register state part of the context that
is saved and restored at each world switch. The SPE buffer management
interrupt will be handled in a later patch.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_hyp.h | 19 ++++++
arch/arm64/kvm/hyp/include/hyp/spe-sr.h | 32 +++++++++
arch/arm64/kvm/hyp/nvhe/Makefile | 1 +
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 6 +-
arch/arm64/kvm/hyp/nvhe/spe-sr.c | 87 +++++++++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/switch.c | 29 +++++++--
6 files changed, 165 insertions(+), 9 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/include/hyp/spe-sr.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/spe-sr.c

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 48619c2c0dc6..06e77a739458 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -88,6 +88,25 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt);
void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt);
+#ifdef CONFIG_KVM_ARM_SPE
+void __spe_save_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt);
+void __spe_save_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt);
+void __spe_restore_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt);
+void __spe_restore_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt);
+#else
+static inline void __spe_save_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt) {}
+static inline void __spe_save_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt) {}
+static inline void __spe_restore_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt) {}
+static inline void __spe_restore_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt) {}
+#endif
#endif

void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
diff --git a/arch/arm64/kvm/hyp/include/hyp/spe-sr.h b/arch/arm64/kvm/hyp/include/hyp/spe-sr.h
new file mode 100644
index 000000000000..d5f8f3ffc7d4
--- /dev/null
+++ b/arch/arm64/kvm/hyp/include/hyp/spe-sr.h
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ * Author: Alexandru Elisei <[email protected]>
+ */
+
+#ifndef __ARM64_KVM_HYP_SPE_SR_H__
+#define __ARM64_KVM_HYP_SPE_SR_H__
+
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_asm.h>
+
+static inline void __spe_save_common_state(struct kvm_cpu_context *ctxt)
+{
+ ctxt_sys_reg(ctxt, PMSICR_EL1) = read_sysreg_s(SYS_PMSICR_EL1);
+ ctxt_sys_reg(ctxt, PMSIRR_EL1) = read_sysreg_s(SYS_PMSIRR_EL1);
+ ctxt_sys_reg(ctxt, PMSFCR_EL1) = read_sysreg_s(SYS_PMSFCR_EL1);
+ ctxt_sys_reg(ctxt, PMSEVFR_EL1) = read_sysreg_s(SYS_PMSEVFR_EL1);
+ ctxt_sys_reg(ctxt, PMSLATFR_EL1) = read_sysreg_s(SYS_PMSLATFR_EL1);
+}
+
+static inline void __spe_restore_common_state(struct kvm_cpu_context *ctxt)
+{
+ write_sysreg_s(ctxt_sys_reg(ctxt, PMSICR_EL1), SYS_PMSICR_EL1);
+ write_sysreg_s(ctxt_sys_reg(ctxt, PMSIRR_EL1), SYS_PMSIRR_EL1);
+ write_sysreg_s(ctxt_sys_reg(ctxt, PMSFCR_EL1), SYS_PMSFCR_EL1);
+ write_sysreg_s(ctxt_sys_reg(ctxt, PMSEVFR_EL1), SYS_PMSEVFR_EL1);
+ write_sysreg_s(ctxt_sys_reg(ctxt, PMSLATFR_EL1), SYS_PMSLATFR_EL1);
+}
+
+#endif /* __ARM64_KVM_HYP_SPE_SR_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 5df6193fc430..37dca45d85d5 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -15,6 +15,7 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
cache.o setup.o mm.o mem_protect.o
+obj-$(CONFIG_KVM_ARM_SPE) += spe-sr.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
obj-y += $(lib-objs)
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 186b90b5fd20..1622615954b2 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -85,7 +85,8 @@ void __debug_save_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt)
{
/* Disable and flush SPE data generation */
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
+ if (!kvm_vcpu_has_spe(vcpu) &&
+ vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
__debug_save_spe(__ctxt_sys_reg(host_ctxt, PMSCR_EL1));
/* Disable and flush Self-Hosted Trace generation */
if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER)
@@ -100,7 +101,8 @@ void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
void __debug_restore_host_buffers_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt)
{
- if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
+ if (!kvm_vcpu_has_spe(vcpu) &&
+ vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_SPE_BUFFER)
__debug_restore_spe(ctxt_sys_reg(host_ctxt, PMSCR_EL1));
if (vcpu->arch.flags & KVM_ARM64_DEBUG_SAVE_TRBE_BUFFER)
__debug_restore_trace(vcpu->arch.host_debug_state.trfcr_el1);
diff --git a/arch/arm64/kvm/hyp/nvhe/spe-sr.c b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
new file mode 100644
index 000000000000..46e47c9fd08f
--- /dev/null
+++ b/arch/arm64/kvm/hyp/nvhe/spe-sr.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ * Author: Alexandru Elisei <[email protected]>
+ */
+
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_hyp.h>
+
+#include <hyp/spe-sr.h>
+
+/*
+ * The owning exception level remains unchange from EL1 during the world switch,
+ * which means that profiling is disabled for as long as we execute at EL2. KVM
+ * does not need to explicitely disable profiling, like it does when the VCPU
+ * does not have SPE and we change buffer owning exception level, nor does it
+ * need to do any synchronization around sysreg save/restore.
+ */
+
+void __spe_save_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt)
+{
+ u64 pmblimitr;
+
+ pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+ if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+ psb_csync();
+ dsb(nsh);
+ /*
+ * The buffer performs indirect writes to system registers, a
+ * context synchronization event is needed before the new
+ * PMBPTR_EL1 value is visible to subsequent direct reads.
+ */
+ isb();
+ }
+
+ ctxt_sys_reg(host_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+ ctxt_sys_reg(host_ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+ ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1) = pmblimitr;
+ ctxt_sys_reg(host_ctxt, PMSCR_EL1) = read_sysreg_s(SYS_PMSCR_EL1);
+ ctxt_sys_reg(host_ctxt, PMSCR_EL2) = read_sysreg_el2(SYS_PMSCR);
+
+ __spe_save_common_state(host_ctxt);
+}
+
+void __spe_save_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt)
+{
+ if (read_sysreg_s(SYS_PMBLIMITR_EL1) & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+ psb_csync();
+ dsb(nsh);
+ /* Ensure hardware updates to PMBPTR_EL1 are visible. */
+ isb();
+ }
+
+ ctxt_sys_reg(guest_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+ ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+ /* PMBLIMITR_EL1 is updated only on a trapped write. */
+ ctxt_sys_reg(guest_ctxt, PMSCR_EL1) = read_sysreg_s(SYS_PMSCR_EL1);
+
+ __spe_save_common_state(guest_ctxt);
+}
+
+void __spe_restore_host_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt)
+{
+ __spe_restore_common_state(host_ctxt);
+
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMSCR_EL1), SYS_PMSCR_EL1);
+ write_sysreg_el2(ctxt_sys_reg(host_ctxt, PMSCR_EL2), SYS_PMSCR);
+}
+
+void __spe_restore_guest_state_nvhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt)
+{
+ __spe_restore_common_state(guest_ctxt);
+
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR_EL1);
+ write_sysreg_el2(0, SYS_PMSCR);
+}
diff --git a/arch/arm64/kvm/hyp/nvhe/switch.c b/arch/arm64/kvm/hyp/nvhe/switch.c
index 04d654e71a6e..62ef2a5789ba 100644
--- a/arch/arm64/kvm/hyp/nvhe/switch.c
+++ b/arch/arm64/kvm/hyp/nvhe/switch.c
@@ -194,12 +194,16 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)

__sysreg_save_state_nvhe(host_ctxt);
/*
- * We must flush and disable the SPE buffer for nVHE, as
- * the translation regime(EL1&0) is going to be loaded with
- * that of the guest. And we must do this before we change the
- * translation regime to EL2 (via MDCR_EL2_E2PB == 0) and
- * before we load guest Stage1.
+ * If the VCPU has the SPE feature bit set, then we save the host's SPE
+ * context.
+ *
+ * Otherwise, we only flush and disable the SPE buffer for nVHE, as the
+ * translation regime(EL1&0) is going to be loaded with that of the
+ * guest. And we must do this before we change the translation regime to
+ * EL2 (via MDCR_EL2_E2PB == 0) and before we load guest Stage1.
*/
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_save_host_state_nvhe(vcpu, host_ctxt);
__debug_save_host_buffers_nvhe(vcpu, host_ctxt);

__kvm_adjust_pc(vcpu);
@@ -218,6 +222,9 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__load_guest_stage2(kern_hyp_va(vcpu->arch.hw_mmu));
__activate_traps(vcpu);

+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_restore_guest_state_nvhe(vcpu, guest_ctxt);
+
__hyp_vgic_restore_state(vcpu);
__timer_enable_traps(vcpu);

@@ -232,6 +239,10 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)

__sysreg_save_state_nvhe(guest_ctxt);
__sysreg32_save_state(vcpu);
+
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_save_guest_state_nvhe(vcpu, guest_ctxt);
+
__timer_disable_traps(vcpu);
__hyp_vgic_save_state(vcpu);

@@ -244,10 +255,14 @@ int __kvm_vcpu_run(struct kvm_vcpu *vcpu)
__fpsimd_save_fpexc32(vcpu);

__debug_switch_to_host(vcpu);
+
/*
- * This must come after restoring the host sysregs, since a non-VHE
- * system may enable SPE here and make use of the TTBRs.
+ * Restoring the host context must come after restoring the host
+ * sysregs, since a non-VHE system may enable SPE here and make use of
+ * the TTBRs.
*/
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_restore_host_state_nvhe(vcpu, host_ctxt);
__debug_restore_host_buffers_nvhe(vcpu, host_ctxt);

if (pmu_switch_needed)
--
2.33.0

2021-08-25 20:51:21

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 18/39] KVM: arm64: Expose SPE version to guests

Set the ID_AA64DFR0_EL1.PMSVer field to a non-zero value if the VCPU SPE
feature is set. SPE version is capped at FEAT_SPEv1p1 because KVM doesn't
yet implement freezing of PMU event counters on a SPE buffer management
event.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/kvm/sys_regs.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index f6f126eb6ac1..ab7370b7a44b 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1070,8 +1070,10 @@ static u64 read_id_reg(const struct kvm_vcpu *vcpu,
val = cpuid_feature_cap_perfmon_field(val,
ID_AA64DFR0_PMUVER_SHIFT,
kvm_vcpu_has_pmu(vcpu) ? ID_AA64DFR0_PMUVER_8_4 : 0);
- /* Hide SPE from guests */
- val &= ~FEATURE(ID_AA64DFR0_PMSVER);
+ /* Limit guests to SPE for ARMv8.3 */
+ val = cpuid_feature_cap_perfmon_field(val,
+ ID_AA64DFR0_PMSVER_SHIFT,
+ kvm_vcpu_has_spe(vcpu) ? ID_AA64DFR0_PMSVER_8_3 : 0);
break;
case SYS_ID_DFR0_EL1:
/* Limit guests to PMUv3 for ARMv8.4 */
--
2.33.0

2021-08-25 20:51:21

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 20/39] KVM: arm64: Add a new VCPU device control group for SPE

From: Sudeep Holla <[email protected]>

Add a new VCPU device control group to control various aspects of KVM's SPE
emulation. Functionality will be added in later patches.

[ Alexandru E: Rewrote patch ]

Signed-off-by: Sudeep Holla <[email protected]>
Signed-off-by: Alexandru Elisei <[email protected]>
---
Documentation/virt/kvm/devices/vcpu.rst | 5 +++++
arch/arm64/include/asm/kvm_spe.h | 20 ++++++++++++++++++++
arch/arm64/include/uapi/asm/kvm.h | 1 +
arch/arm64/kvm/guest.c | 10 ++++++++++
arch/arm64/kvm/spe.c | 15 +++++++++++++++
5 files changed, 51 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..85399c005197 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,8 @@ Specifies the base address of the stolen time structure for this VCPU. The
base address must be 64 byte aligned and exist within a valid guest memory
region. See Documentation/virt/kvm/arm/pvtime.rst for more information
including the layout of the stolen time structure.
+
+4. GROUP: KVM_ARM_VCPU_SPE_CTRL
+===============================
+
+:Architectures: ARM64
diff --git a/arch/arm64/include/asm/kvm_spe.h b/arch/arm64/include/asm/kvm_spe.h
index ed67ddbf8132..ce0f5b3f2027 100644
--- a/arch/arm64/include/asm/kvm_spe.h
+++ b/arch/arm64/include/asm/kvm_spe.h
@@ -17,12 +17,32 @@ static __always_inline bool kvm_supports_spe(void)
void kvm_spe_init_supported_cpus(void);
void kvm_spe_vm_init(struct kvm *kvm);
int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu);
+
+int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
+int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
#else
#define kvm_supports_spe() (false)

static inline void kvm_spe_init_supported_cpus(void) {}
static inline void kvm_spe_vm_init(struct kvm *kvm) {}
static inline int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu) { return -ENOEXEC; }
+
+static inline int kvm_spe_set_attr(struct kvm_vcpu *vcpu,
+ struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
+static inline int kvm_spe_get_attr(struct kvm_vcpu *vcpu,
+ struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
+static inline int kvm_spe_has_attr(struct kvm_vcpu *vcpu,
+ struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
#endif /* CONFIG_KVM_ARM_SPE */

#endif /* __ARM64_KVM_SPE_H__ */
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 9f0a8ea50ea9..7159a1e23da2 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -368,6 +368,7 @@ struct kvm_arm_copy_mte_tags {
#define KVM_ARM_VCPU_TIMER_IRQ_PTIMER 1
#define KVM_ARM_VCPU_PVTIME_CTRL 2
#define KVM_ARM_VCPU_PVTIME_IPA 0
+#define KVM_ARM_VCPU_SPE_CTRL 3

/* KVM_IRQ_LINE irq field index values */
#define KVM_ARM_IRQ_VCPU2_SHIFT 28
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 1dfb83578277..316110b5dd95 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -24,6 +24,7 @@
#include <asm/fpsimd.h>
#include <asm/kvm.h>
#include <asm/kvm_emulate.h>
+#include <asm/kvm_spe.h>
#include <asm/sigcontext.h>

#include "trace.h"
@@ -962,6 +963,9 @@ int kvm_arm_vcpu_arch_set_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_set_attr(vcpu, attr);
break;
+ case KVM_ARM_VCPU_SPE_CTRL:
+ ret = kvm_spe_set_attr(vcpu, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -985,6 +989,9 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_get_attr(vcpu, attr);
break;
+ case KVM_ARM_VCPU_SPE_CTRL:
+ ret = kvm_spe_get_attr(vcpu, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -1008,6 +1015,9 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
case KVM_ARM_VCPU_PVTIME_CTRL:
ret = kvm_arm_pvtime_has_attr(vcpu, attr);
break;
+ case KVM_ARM_VCPU_SPE_CTRL:
+ ret = kvm_spe_has_attr(vcpu, attr);
+ break;
default:
ret = -ENXIO;
break;
diff --git a/arch/arm64/kvm/spe.c b/arch/arm64/kvm/spe.c
index 8d2afc137151..56a3fdb35623 100644
--- a/arch/arm64/kvm/spe.c
+++ b/arch/arm64/kvm/spe.c
@@ -42,3 +42,18 @@ int kvm_spe_check_supported_cpus(struct kvm_vcpu *vcpu)

return 0;
}
+
+int kvm_spe_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
+
+int kvm_spe_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
+
+int kvm_spe_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr)
+{
+ return -ENXIO;
+}
--
2.33.0

2021-08-25 20:51:22

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 17/39] KVM: arm64: Set the VCPU SPE feature bit when SPE is available

Check that KVM SPE emulation is supported before allowing userspace to set
the KVM_ARM_VCPU_SPE feature.

According to ARM DDI 0487G.a, page D9-2946 the Profiling Buffer is disabled
if the owning Exception level is 32 bit, reject the SPE feature if the
VCPU's execution state at EL1 is aarch32.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 3 +++
arch/arm64/kvm/reset.c | 23 +++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1f3b46a6df81..948adb152104 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -807,6 +807,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))

+#define kvm_vcpu_has_spe(vcpu) \
+ (test_bit(KVM_ARM_VCPU_SPE, (vcpu)->arch.features))
+
int kvm_trng_call(struct kvm_vcpu *vcpu);
#ifdef CONFIG_KVM
extern phys_addr_t hyp_mem_base;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index cba7872d69a8..17b9f1b29c24 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -27,6 +27,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_emulate.h>
#include <asm/kvm_mmu.h>
+#include <asm/kvm_spe.h>
#include <asm/virt.h>

/* Maximum phys_shift supported for any VM on this host */
@@ -189,6 +190,21 @@ static bool vcpu_allowed_register_width(struct kvm_vcpu *vcpu)
return true;
}

+static int kvm_vcpu_enable_spe(struct kvm_vcpu *vcpu)
+{
+ if (!kvm_supports_spe())
+ return -EINVAL;
+
+ /*
+ * The Profiling Buffer is disabled if the owning Exception level is
+ * aarch32.
+ */
+ if (vcpu_has_feature(vcpu, KVM_ARM_VCPU_EL1_32BIT))
+ return -EINVAL;
+
+ return 0;
+}
+
/**
* kvm_reset_vcpu - sets core registers and sys_regs to reset value
* @vcpu: The VCPU pointer
@@ -245,6 +261,13 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
goto out;
}

+ if (kvm_vcpu_has_spe(vcpu)) {
+ if (kvm_vcpu_enable_spe(vcpu)) {
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+
switch (vcpu->arch.target) {
default:
if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) {
--
2.33.0

2021-08-25 20:52:35

by Alexandru Elisei

[permalink] [raw]
Subject: [RFC PATCH v4 31/39] KVM: arm64: VHE: Context switch SPE state if VCPU has SPE

Similar to the non-VHE case, save and restore the SPE register state at
each world switch for VHE enabled systems if the VCPU has the SPE
feature.

Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/kvm_hyp.h | 24 +++++-
arch/arm64/include/asm/sysreg.h | 2 +
arch/arm64/kvm/hyp/vhe/Makefile | 1 +
arch/arm64/kvm/hyp/vhe/spe-sr.c | 128 +++++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/vhe/switch.c | 8 ++
5 files changed, 161 insertions(+), 2 deletions(-)
create mode 100644 arch/arm64/kvm/hyp/vhe/spe-sr.c

diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
index 06e77a739458..03bc51049996 100644
--- a/arch/arm64/include/asm/kvm_hyp.h
+++ b/arch/arm64/include/asm/kvm_hyp.h
@@ -106,8 +106,28 @@ static inline void __spe_restore_host_state_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *host_ctxt) {}
static inline void __spe_restore_guest_state_nvhe(struct kvm_vcpu *vcpu,
struct kvm_cpu_context *guest_ctxt) {}
-#endif
-#endif
+#endif /* CONFIG_KVM_ARM_SPE */
+#else
+#ifdef CONFIG_KVM_ARM_SPE
+void __spe_save_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt);
+void __spe_save_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt);
+void __spe_restore_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt);
+void __spe_restore_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt);
+#else
+static inline void __spe_save_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt) {}
+static inline void __spe_save_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt) {}
+static inline void __spe_restore_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt) {}
+static inline void __spe_restore_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt) {}
+#endif /* CONFIG_KVM_ARM_SPE */
+#endif /* __KVM_NVHE_HYPERVISOR__ */

void __fpsimd_save_state(struct user_fpsimd_state *fp_regs);
void __fpsimd_restore_state(struct user_fpsimd_state *fp_regs);
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 7b9c3acba684..b2d691bc1049 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -267,6 +267,8 @@
#define SYS_PMSCR_EL1_TS_SHIFT 5
#define SYS_PMSCR_EL1_PCT_SHIFT 6

+#define SYS_PMSCR_EL12 sys_reg(3, 5, 9, 9, 0)
+
#define SYS_PMSCR_EL2 sys_reg(3, 4, 9, 9, 0)
#define SYS_PMSCR_EL2_E0HSPE_SHIFT 0
#define SYS_PMSCR_EL2_E2SPE_SHIFT 1
diff --git a/arch/arm64/kvm/hyp/vhe/Makefile b/arch/arm64/kvm/hyp/vhe/Makefile
index 96bec0ecf9dd..7cb4a9e5ceb0 100644
--- a/arch/arm64/kvm/hyp/vhe/Makefile
+++ b/arch/arm64/kvm/hyp/vhe/Makefile
@@ -7,5 +7,6 @@ asflags-y := -D__KVM_VHE_HYPERVISOR__
ccflags-y := -D__KVM_VHE_HYPERVISOR__

obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o
+obj-$(CONFIG_KVM_ARM_SPE) += spe-sr.o
obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
../fpsimd.o ../hyp-entry.o ../exception.o
diff --git a/arch/arm64/kvm/hyp/vhe/spe-sr.c b/arch/arm64/kvm/hyp/vhe/spe-sr.c
new file mode 100644
index 000000000000..00eab9e2ec60
--- /dev/null
+++ b/arch/arm64/kvm/hyp/vhe/spe-sr.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 - ARM Ltd
+ */
+
+#include <linux/kvm_host.h>
+
+#include <asm/kvm_hyp.h>
+#include <asm/kprobes.h>
+
+#include <hyp/spe-sr.h>
+
+/*
+ * Disable host profiling, drain the buffer and save the host SPE context.
+ * Extra care must be taken because profiling might be in progress.
+ */
+void __spe_save_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt)
+{
+ u64 pmblimitr, pmscr_el2;
+
+ /* Disable profiling while the SPE context is being switched. */
+ pmscr_el2 = read_sysreg_el2(SYS_PMSCR);
+ write_sysreg_el2(0, SYS_PMSCR);
+ isb();
+
+ pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+ if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+ psb_csync();
+ dsb(nsh);
+ /* Ensure hardware updates to PMBPTR_EL1 are visible. */
+ isb();
+ }
+
+ ctxt_sys_reg(host_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+ ctxt_sys_reg(host_ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+ ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1) = pmblimitr;
+ ctxt_sys_reg(host_ctxt, PMSCR_EL2) = pmscr_el2;
+
+ __spe_save_common_state(host_ctxt);
+}
+NOKPROBE_SYMBOL(__spe_save_host_state_vhe);
+
+/*
+ * Drain the guest's buffer and save the SPE state. Profiling is disabled
+ * because we're at EL2 and the buffer owning exceptions level is EL1.
+ */
+void __spe_save_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt)
+{
+ u64 pmblimitr;
+
+ /*
+ * We're at EL2 and the buffer owning regime is EL1, which means that
+ * profiling is disabled. After we disable traps and restore the host's
+ * MDCR_EL2, profiling will remain disabled because we've disabled it
+ * via PMSCR_EL2 when we saved the host's SPE state. All it's needed
+ * here is to drain the buffer.
+ */
+ pmblimitr = read_sysreg_s(SYS_PMBLIMITR_EL1);
+ if (pmblimitr & BIT(SYS_PMBLIMITR_EL1_E_SHIFT)) {
+ psb_csync();
+ dsb(nsh);
+ /* Ensure hardware updates to PMBPTR_EL1 are visible. */
+ isb();
+ }
+
+ ctxt_sys_reg(guest_ctxt, PMBPTR_EL1) = read_sysreg_s(SYS_PMBPTR_EL1);
+ ctxt_sys_reg(guest_ctxt, PMBSR_EL1) = read_sysreg_s(SYS_PMBSR_EL1);
+ /* PMBLIMITR_EL1 is updated only on a trapped write. */
+ ctxt_sys_reg(guest_ctxt, PMSCR_EL1) = read_sysreg_el1(SYS_PMSCR);
+
+ __spe_save_common_state(guest_ctxt);
+}
+NOKPROBE_SYMBOL(__spe_save_guest_state_vhe);
+
+/*
+ * Restore the host SPE context. Special care must be taken because we're
+ * potentially resuming a profiling session which was stopped when we saved the
+ * host SPE register state.
+ */
+void __spe_restore_host_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *host_ctxt)
+{
+ __spe_restore_common_state(host_ctxt);
+
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+
+ /*
+ * Make sure buffer pointer and limit is updated first, so we don't end
+ * up in a situation where profiling is enabled and the buffer uses the
+ * values programmed by the guest.
+ *
+ * This also serves to make sure the write to MDCR_EL2 which changes the
+ * buffer owning Exception level is visible.
+ *
+ * After the synchronization, profiling is still disabled at EL2,
+ * because we cleared PMSCR_EL2 when we saved the host context.
+ */
+ isb();
+
+ write_sysreg_el2(ctxt_sys_reg(host_ctxt, PMSCR_EL2), SYS_PMSCR);
+}
+NOKPROBE_SYMBOL(__spe_restore_host_state_vhe);
+
+/*
+ * Restore the guest SPE context while profiling is disabled at EL2.
+ */
+void __spe_restore_guest_state_vhe(struct kvm_vcpu *vcpu,
+ struct kvm_cpu_context *guest_ctxt)
+{
+ __spe_restore_common_state(guest_ctxt);
+
+ /*
+ * No synchronization needed here. Profiling is disabled at EL2 because
+ * PMSCR_EL2 has been cleared when saving the host's context, and the
+ * buffer has already been drained.
+ */
+
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBPTR_EL1), SYS_PMBPTR_EL1);
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBSR_EL1), SYS_PMBSR_EL1);
+ write_sysreg_s(ctxt_sys_reg(guest_ctxt, PMBLIMITR_EL1), SYS_PMBLIMITR_EL1);
+ write_sysreg_el1(ctxt_sys_reg(guest_ctxt, PMSCR_EL1), SYS_PMSCR);
+ /* PMSCR_EL2 has been cleared when saving the host state. */
+}
+NOKPROBE_SYMBOL(__spe_restore_guest_state_vhe);
diff --git a/arch/arm64/kvm/hyp/vhe/switch.c b/arch/arm64/kvm/hyp/vhe/switch.c
index ec4e179d56ae..46da018f4a5a 100644
--- a/arch/arm64/kvm/hyp/vhe/switch.c
+++ b/arch/arm64/kvm/hyp/vhe/switch.c
@@ -135,6 +135,8 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
guest_ctxt = &vcpu->arch.ctxt;

sysreg_save_host_state_vhe(host_ctxt);
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_save_host_state_vhe(vcpu, host_ctxt);

/*
* ARM erratum 1165522 requires us to configure both stage 1 and
@@ -153,6 +155,8 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
__kvm_adjust_pc(vcpu);

sysreg_restore_guest_state_vhe(guest_ctxt);
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_restore_guest_state_vhe(vcpu, guest_ctxt);
__debug_switch_to_guest(vcpu);

do {
@@ -163,10 +167,14 @@ static int __kvm_vcpu_run_vhe(struct kvm_vcpu *vcpu)
} while (fixup_guest_exit(vcpu, &exit_code));

sysreg_save_guest_state_vhe(guest_ctxt);
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_save_guest_state_vhe(vcpu, guest_ctxt);

__deactivate_traps(vcpu);

sysreg_restore_host_state_vhe(host_ctxt);
+ if (kvm_vcpu_has_spe(vcpu))
+ __spe_restore_host_state_vhe(vcpu, host_ctxt);

if (vcpu->arch.flags & KVM_ARM64_FP_ENABLED)
__fpsimd_save_fpexc32(vcpu);
--
2.33.0

2021-09-22 10:13:45

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support

On 25/08/2021 17:17, Alexandru Elisei wrote:
> This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
> original series at [3].
>
> Statistical Profiling Extension (SPE) is an optional feature added in
> ARMv8.2. It allows sampling at regular intervals of the operations executed
> by the PE and storing a record of each operation in a memory buffer. A high
> level overview of the extension is presented in an article on arm.com [4].
>
> This is another complete rewrite of the series, and nothing is set in
> stone. If you think of a better way to do things, please suggest it.
>
>
> Features added
> ==============
>
> The rewrite enabled me to add support for several features not
> present in the previous iteration:
>
> - Support for heterogeneous systems, where only some of the CPUs support SPE.
> This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.
>
> - Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
> VCPU ioctl.
>
> - The requirement for userspace to mlock() the guest memory has been removed,
> and now userspace can make changes to memory contents after the memory is
> mapped at stage 2.
>
> - Better debugging of guest memory pinning by printing a warning when we
> get an unexpected read or write fault. This helped me catch several bugs
> during development, it has already proven very useful. Many thanks to
> James who suggested when reviewing v3.
>
>
> Missing features
> ================
>
> I've tried to keep the series as small as possible to make it easier to review,
> while implementing the core functionality needed for the SPE emulation. As such,
> I've chosen to not implement several features:
>
> - Host profiling a guest which has the SPE feature bit set (see open
> questions).
>
> - No errata workarounds have been implemented yet, and there are quite a few of
> them for Neoverse N1 and Neoverse V1.
>
> - Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
> investigating other ways to get around automatic numa balancing, like
> requiring userspace to disable it via set_mempolicy(). I am also going to
> look at how VFIO gets around it. Suggestions welcome.
>
> - There's plenty of room for optimization. Off the top of my head, using
> block mappings at stage 2, batch pinning of pages (similar to what VFIO
> does), optimize the way KVM keeps track of pinned pages (using a linked
> list triples the memory usage), context-switch the SPE registers on
> vcpu_load/vcpu_put on VHE if the host is not profiling, locking
> optimizations, etc, etc.
>
> - ...and others. I'm sure I'm missing at least a few things which are
> important for someone.
>
>
> Known issues
> ============
>
> This is an RFC, so keep in mind that almost definitely there will be scary
> bugs. For example, below is a list of known issues which don't affect the
> correctness of the emulation, and which I'm planning to fix in a future
> iteration:
>
> - With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
> the VCPU executes the dcache clean pending ops.
>
> - With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
> kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
> VCPUs.
>
> This BUG statement can also be triggered with mainline. To reproduce it,
> compile kvmtool from this branch [5] and follow the instruction in the
> kvmtool commit message.
>
> One workaround could be to stop trying to lock all VCPUs when locking a
> memslot and document the fact that it is required that no VCPUs are run
> before the ioctl completes, otherwise bad things might happen to the VM.
>
>
> Open questions
> ==============
>
> 1. Implementing support for host profiling a guest with the SPE feature
> means setting the profiling buffer owning regime to EL2. While that is in
> effect, PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
> probes SPE during this time, the driver will fail; and the guest will be
> able to determine when it is profiled. I see two options here:

This doesn't mean the EL2 is owning the SPE. It only tells you that a
higher level EL is owning the SPE. It could as well be EL3. (e.g,
MDCR_EL3.NSPB == 0 or 1). So I think this is architecturally correct,
as long as we trap the guest access to other SPE registers and inject
and UNDEF.


Thanks
Suzuki

2021-09-23 15:15:40

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support

Hi Suzuki,

Thank you for having a look!

On 9/22/21 11:11, Suzuki K Poulose wrote:
> On 25/08/2021 17:17, Alexandru Elisei wrote:
>> This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
>> original series at [3].
>>
>> Statistical Profiling Extension (SPE) is an optional feature added in
>> ARMv8.2. It allows sampling at regular intervals of the operations executed
>> by the PE and storing a record of each operation in a memory buffer. A high
>> level overview of the extension is presented in an article on arm.com [4].
>>
>> This is another complete rewrite of the series, and nothing is set in
>> stone. If you think of a better way to do things, please suggest it.
>>
>>
>> Features added
>> ==============
>>
>> The rewrite enabled me to add support for several features not
>> present in the previous iteration:
>>
>> - Support for heterogeneous systems, where only some of the CPUs support SPE.
>>    This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.
>>
>> - Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
>>    VCPU ioctl.
>>
>> - The requirement for userspace to mlock() the guest memory has been removed,
>>    and now userspace can make changes to memory contents after the memory is
>>    mapped at stage 2.
>>
>> - Better debugging of guest memory pinning by printing a warning when we
>>    get an unexpected read or write fault. This helped me catch several bugs
>>    during development, it has already proven very useful. Many thanks to
>>    James who suggested when reviewing v3.
>>
>>
>> Missing features
>> ================
>>
>> I've tried to keep the series as small as possible to make it easier to review,
>> while implementing the core functionality needed for the SPE emulation. As such,
>> I've chosen to not implement several features:
>>
>> - Host profiling a guest which has the SPE feature bit set (see open
>>    questions).
>>
>> - No errata workarounds have been implemented yet, and there are quite a few of
>>    them for Neoverse N1 and Neoverse V1.
>>
>> - Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
>>    investigating other ways to get around automatic numa balancing, like
>>    requiring userspace to disable it via set_mempolicy(). I am also going to
>>    look at how VFIO gets around it. Suggestions welcome.
>>
>> - There's plenty of room for optimization. Off the top of my head, using
>>    block mappings at stage 2, batch pinning of pages (similar to what VFIO
>>    does), optimize the way KVM keeps track of pinned pages (using a linked
>>    list triples the memory usage), context-switch the SPE registers on
>>    vcpu_load/vcpu_put on VHE if the host is not profiling, locking
>>    optimizations, etc, etc.
>>
>> - ...and others. I'm sure I'm missing at least a few things which are
>>    important for someone.
>>
>>
>> Known issues
>> ============
>>
>> This is an RFC, so keep in mind that almost definitely there will be scary
>> bugs. For example, below is a list of known issues which don't affect the
>> correctness of the emulation, and which I'm planning to fix in a future
>> iteration:
>>
>> - With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
>>    the VCPU executes the dcache clean pending ops.
>>
>> - With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
>>    kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
>>    VCPUs.
>>
>> This BUG statement can also be triggered with mainline. To reproduce it,
>> compile kvmtool from this branch [5] and follow the instruction in the
>> kvmtool commit message.
>>
>> One workaround could be to stop trying to lock all VCPUs when locking a
>> memslot and document the fact that it is required that no VCPUs are run
>> before the ioctl completes, otherwise bad things might happen to the VM.
>>
>>
>> Open questions
>> ==============
>>
>> 1. Implementing support for host profiling a guest with the SPE feature
>> means setting the profiling buffer owning regime to EL2. While that is in
>> effect,  PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
>> probes SPE during this time, the driver will fail; and the guest will be
>> able to determine when it is profiled. I see two options here:
>
> This doesn't mean the EL2 is owning the SPE. It only tells you that a
> higher level EL is owning the SPE. It could as well be EL3. (e.g, MDCR_EL3.NSPB
> == 0 or 1). So I think this is architecturally correct,
> as long as we trap the guest access to other SPE registers and inject
> and UNDEF.

You are right, I was wrong about the part about the guest being able to determine
when it is profiled, I forgot that PMBIDR_EL1.P can also be 1 if EL3 is owning the
buffer.

But I don't understand why you are suggesting that KVM injects UNDEF in this case.
I was thinking that when the host is profiling the guest, KVM can store guest
writes to the SPE registers to the shadow copy of the registers. On a world switch
to guest, KVM will not restore the registers onto the hardware as to not interfere
with the profiling operation performed by the host. When profiling ends, KVM can
then let the guest use SPE again by restoring the guest register values onto the
hardware at every world switch and setting the buffer owning regime to EL1 again.

I put this as an open question because if the guest sees PMBIDR_EL1.P set when the
guest SPE driver probes, the driver will likely decide that SPE is not available.
But the VM has been created with SPE enabled because the host userspace wants the
guest to have access to SPE, and the host userspace might not realize that the act
of profiling a guest can make SPE unusable by that guest. I am inclined now to let
the host's userspace do whatever it wishes and allow it to profile a guest if it
so desires, and mention this possible side effect in the KVM documentation, as it
might be surprising for someone who isn't familiar with the inner workings of
KVM's SPE emulation.

Thanks,

Alex

>
>
> Thanks
> Suzuki

2021-10-18 03:44:35

by Marc Zyngier

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots until a VCPU is run

On Wed, 25 Aug 2021 17:17:40 +0100,
Alexandru Elisei <[email protected]> wrote:
>
> KVM relies on doing dcache maintenance on stage 2 faults to present to a
> gueste running with the MMU off the same view of memory as userspace. For
> locked memslots, KVM so far has done the dcache maintenance when a memslot
> is locked, but that leaves KVM in a rather awkward position: what userspace
> writes to guest memory after the memslot is locked, but before a VCPU is
> run, might not be visible to the guest.
>
> Fix this by deferring the dcache maintenance until the first VCPU is run.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 7 ++++
> arch/arm64/include/asm/kvm_mmu.h | 5 +++
> arch/arm64/kvm/arm.c | 3 ++
> arch/arm64/kvm/mmu.c | 56 ++++++++++++++++++++++++++++---
> 4 files changed, 67 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 97ff3ed5d4b7..ed67f914d169 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
> u32 flags;
> };
>
> +/* kvm->arch.mmu_pending_ops flags */
> +#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE 0
> +#define KVM_MAX_MMU_PENDING_OPS 1
> +
> struct kvm_arch {
> struct kvm_s2_mmu mmu;
>
> @@ -135,6 +139,9 @@ struct kvm_arch {
> */
> bool return_nisv_io_abort_to_user;
>
> + /* Defer MMU operations until a VCPU is run. */
> + unsigned long mmu_pending_ops;

This has a funny taste of VM-wide requests...

> +
> /*
> * VM-wide PMU filter, implemented as a bitmap and big enough for
> * up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ef079b5eb475..525c223e769f 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
> int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
> int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);
>
> +#define kvm_mmu_has_pending_ops(kvm) \
> + (!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
> +
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm);
> +
> static inline unsigned int kvm_get_vmid_bits(void)
> {
> int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index efb3501c6016..144c982912d8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> if (unlikely(!kvm_vcpu_initialized(vcpu)))
> return -ENOEXEC;
>
> + if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
> + kvm_mmu_perform_pending_ops(vcpu->kvm);
> +
> ret = kvm_vcpu_first_run_init(vcpu);

Is there any reason why this isn't done as part of the 'first run'
handling? I am refactoring that part to remove as many things as
possible from the fast path, and would love not to go back to piling
more stuff here.

Or do you expect this to happen more than once per VM (despite what
the comment says)?

> if (ret)
> return ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 59c2bfef2fd1..94fa08f3d9d3 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> return ret;
> }
>
> +/*
> + * It's safe to do the CMOs when the first VCPU is run because:
> + * - VCPUs cannot run until mmu_cmo_needed is cleared.
> + * - Memslots cannot be modified because we hold the kvm->slots_lock.

It would be good to document the expected locking order for this kind
of stuff.

> + *
> + * It's safe to periodically release the mmu_lock because:
> + * - VCPUs cannot run.
> + * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
> + * the mmu_lock, which means accesses will be serialized.
> + * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
> + * is live, which means that the VM will be live.
> + */
> +void kvm_mmu_perform_pending_ops(struct kvm *kvm)
> +{
> + struct kvm_memory_slot *memslot;
> +
> + mutex_lock(&kvm->slots_lock);
> + if (!kvm_mmu_has_pending_ops(kvm))
> + goto out_unlock;
> +
> + if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
> + kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> + if (!memslot_is_locked(memslot))
> + continue;
> + stage2_flush_memslot(kvm, memslot);
> + }
> + }
> +
> + bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);

clear_bit() instead? I understand that you want to support multiple
ops, but this looks odd.

> +
> +out_unlock:
> + mutex_unlock(&kvm->slots_lock);
> + return;
> +}
> +
> static int try_rlimit_memlock(unsigned long npages)
> {
> unsigned long lock_limit;
> @@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> struct kvm_memory_slot_page *page_entry;
> bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> - struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + struct kvm_pgtable pgt;
> + struct kvm_pgtable_mm_ops mm_ops;
> struct vm_area_struct *vma;
> unsigned long npages = memslot->npages;
> unsigned int pin_flags = FOLL_LONGTERM;
> @@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> pin_flags |= FOLL_WRITE;
> }
>
> + /*
> + * Make a copy of the stage 2 translation table struct to remove the
> + * dcache callback so we can postpone the cache maintenance operations
> + * until the first VCPU is run.
> + */
> + mm_ops = *kvm->arch.mmu.pgt->mm_ops;
> + mm_ops.dcache_clean_inval_poc = NULL;
> + pgt = *kvm->arch.mmu.pgt;
> + pgt.mm_ops = &mm_ops;

Huhuh... Can't really say I'm in love with this. Are you trying to
avoid a double dcache clean to PoC? Is this a performance or a
correctness issue?

> +
> hva = memslot->userspace_addr;
> ipa = memslot->base_gfn << PAGE_SHIFT;
>
> @@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> goto out_err;
> }
>
> - ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
> + ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
> page_to_phys(page_entry->page),
> prot, &cache);
> spin_unlock(&kvm->mmu_lock);
>
> if (ret) {
> - kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> + kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
> i << PAGE_SHIFT);
> unpin_memslot_pages(memslot, writable);
> goto out_err;
> @@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> */
> ret = account_locked_vm(current->mm, npages, true);
> if (ret) {
> - kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> + kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
> npages << PAGE_SHIFT);
> unpin_memslot_pages(memslot, writable);
> goto out_err;
> @@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> if (writable)
> memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
>
> + set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
> +
> kvm_mmu_free_memory_cache(&cache);
>
> return 0;

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-10-18 12:55:43

by Alexandru Elisei

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/39] KVM: arm64: Defer CMOs for locked memslots until a VCPU is run

Hi Marc,

On Sun, Oct 17, 2021 at 12:43:29PM +0100, Marc Zyngier wrote:
> On Wed, 25 Aug 2021 17:17:40 +0100,
> Alexandru Elisei <[email protected]> wrote:
> >
> > KVM relies on doing dcache maintenance on stage 2 faults to present to a
> > gueste running with the MMU off the same view of memory as userspace. For
> > locked memslots, KVM so far has done the dcache maintenance when a memslot
> > is locked, but that leaves KVM in a rather awkward position: what userspace
> > writes to guest memory after the memslot is locked, but before a VCPU is
> > run, might not be visible to the guest.
> >
> > Fix this by deferring the dcache maintenance until the first VCPU is run.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 7 ++++
> > arch/arm64/include/asm/kvm_mmu.h | 5 +++
> > arch/arm64/kvm/arm.c | 3 ++
> > arch/arm64/kvm/mmu.c | 56 ++++++++++++++++++++++++++++---
> > 4 files changed, 67 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 97ff3ed5d4b7..ed67f914d169 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -112,6 +112,10 @@ struct kvm_arch_memory_slot {
> > u32 flags;
> > };
> >
> > +/* kvm->arch.mmu_pending_ops flags */
> > +#define KVM_LOCKED_MEMSLOT_FLUSH_DCACHE 0
> > +#define KVM_MAX_MMU_PENDING_OPS 1
> > +
> > struct kvm_arch {
> > struct kvm_s2_mmu mmu;
> >
> > @@ -135,6 +139,9 @@ struct kvm_arch {
> > */
> > bool return_nisv_io_abort_to_user;
> >
> > + /* Defer MMU operations until a VCPU is run. */
> > + unsigned long mmu_pending_ops;
>
> This has a funny taste of VM-wide requests...

I guess you can look at it that way, although the exact semantics of a VM-wide
request are very vague to me.

>
> > +
> > /*
> > * VM-wide PMU filter, implemented as a bitmap and big enough for
> > * up to 2^10 events (ARMv8.0) or 2^16 events (ARMv8.1+).
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index ef079b5eb475..525c223e769f 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -219,6 +219,11 @@ void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
> > int kvm_mmu_lock_memslot(struct kvm *kvm, u64 slot, u64 flags);
> > int kvm_mmu_unlock_memslot(struct kvm *kvm, u64 slot, u64 flags);
> >
> > +#define kvm_mmu_has_pending_ops(kvm) \
> > + (!bitmap_empty(&(kvm)->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS))
> > +
> > +void kvm_mmu_perform_pending_ops(struct kvm *kvm);
> > +
> > static inline unsigned int kvm_get_vmid_bits(void)
> > {
> > int reg = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index efb3501c6016..144c982912d8 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -829,6 +829,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > if (unlikely(!kvm_vcpu_initialized(vcpu)))
> > return -ENOEXEC;
> >
> > + if (unlikely(kvm_mmu_has_pending_ops(vcpu->kvm)))
> > + kvm_mmu_perform_pending_ops(vcpu->kvm);
> > +
> > ret = kvm_vcpu_first_run_init(vcpu);
>
> Is there any reason why this isn't done as part of the 'first run'
> handling? I am refactoring that part to remove as many things as
> possible from the fast path, and would love not to go back to piling
> more stuff here.
>
> Or do you expect this to happen more than once per VM (despite what
> the comment says)?

In theory, it can happen more than once per VM and KVM should allow it to
happen. I believe there is a least one valid use case for it, if userspace
unlocks a memslot to perform migration (so KVM can write-protect stage 2),
decides to cancel migration, then locks the memslot again without destroying the
VM.

However, I didn't take this scenario into account in this series
(kvm_mmu_lock_memslot returns an error if one VCPU has vcpu->arch.has_run_once
true). I'll put it on my todo list for the next iteration.

>
> > if (ret)
> > return ret;
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 59c2bfef2fd1..94fa08f3d9d3 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1253,6 +1253,41 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> > return ret;
> > }
> >
> > +/*
> > + * It's safe to do the CMOs when the first VCPU is run because:
> > + * - VCPUs cannot run until mmu_cmo_needed is cleared.
> > + * - Memslots cannot be modified because we hold the kvm->slots_lock.
>
> It would be good to document the expected locking order for this kind
> of stuff.

I created this locking scheme with the assumption that a memslot is locked
before any VCPUs have run. I need to reconsider that for the use case I
mentioned above. I do agree that this needs to be very well documented.

I also want to try to get rid of lock_all_vcpus (which was
renamed to kvm_lock_all_vcpus in this series), because it triggers a lockdep
bug, which is present in from upstream KVM. Copy and pasted from the cover
letter (I got the number of VCPUs wrong, 47 is enough to trigger it):

"With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
VCPUs.

This BUG statement can also be triggered with mainline. To reproduce it,
compile kvmtool from this branch [1] and follow the instruction in the
kvmtool commit message".

[1] https://gitlab.arm.com/linux-arm/kvmtool-ae/-/tree/vgic-lock-all-vcpus-lockdep-bug-v1

Here's the dmesg on my rockpro64, with mainline Linux v5.15-rc6, with the
kvmtool patch above:

$ ./vm run -c47 -m256 -k ../kvm-unit-tests/arm/selftest.flat --nodefaults -p "setup smp=47 mem=256"

[..]
[ 136.888002] BUG: MAX_LOCK_DEPTH too low!
[ 136.888372] turning off the locking correctness validator.
[ 136.888859] depth: 48 max: 48!
[ 136.889140] 48 locks held by vm/401:
[ 136.889461] #0: ffff00000d621248 (&kvm->lock){+.+.}-{3:3}, at: vgic_v3_attr_regs_access+0x9c/0x250
[ 136.890288] #1: ffff0000098d80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.891032] #2: ffff00000b9980c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.891774] #3: ffff00000b8440c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.892516] #4: ffff00000d6c00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.893259] #5: ffff00000aa600c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.894001] #6: ffff0000099540c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.894743] #7: ffff00000b8f00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.895486] #8: ffff00000b9e00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.896228] #9: ffff00000b8e40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.896970] #10: ffff00000b8dc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.897719] #11: ffff00000b8d40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.898468] #12: ffff00000b9cc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.899219] #13: ffff00000ca940c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.899968] #14: ffff00000c8f40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.900717] #15: ffff00000baa40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.901465] #16: ffff00000e98c0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.902214] #17: ffff00000e9dc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.902965] #18: ffff00000aacc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.903713] #19: ffff0000098c80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.904462] #20: ffff00000ab940c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.905211] #21: ffff00000aba40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.905961] #22: ffff00000e9e80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.906710] #23: ffff00000e9ec0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.907460] #24: ffff00000d4d80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.908209] #25: ffff00000d4dc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.908959] #26: ffff00000ab980c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.909708] #27: ffff00000ab9c0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.910458] #28: ffff00000ba9c0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.911206] #29: ffff00000b9dc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.911955] #30: ffff00000abb40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.912704] #31: ffff00000aab80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.913454] #32: ffff00000b9a00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.914203] #33: ffff00000b9e80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.914952] #34: ffff00000ba000c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.915701] #35: ffff00000b8a80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.916451] #36: ffff00000ca080c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.917199] #37: ffff00000aba80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.917950] #38: ffff00000abac0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.918700] #39: ffff00000c9e80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.919450] #40: ffff00000c9ec0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.920199] #41: ffff00000d4e00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.920948] #42: ffff00000d4e40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.921697] #43: ffff00000ca0c0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.922446] #44: ffff00000abf00c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.923195] #45: ffff00000abf40c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.923945] #46: ffff00000abf80c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.924693] #47: ffff00000abfc0c8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x2c/0xf0
[ 136.925443] INFO: lockdep is turned off.
[ 136.925795] CPU: 5 PID: 401 Comm: vm Not tainted 5.15.0-rc6 #103
[ 136.926331] Hardware name: Pine64 RockPro64 v2.0 (DT)
[ 136.926781] Call trace:
[ 136.927000] dump_backtrace+0x0/0x1a0
[ 136.927332] show_stack+0x18/0x24
[ 136.927630] dump_stack_lvl+0x8c/0xb8
[ 136.927960] dump_stack+0x18/0x34
[ 136.928258] __lock_acquire+0xbc8/0x20cc
[ 136.928611] lock_acquire.part.0+0xe0/0x230
[ 136.928987] lock_acquire+0x68/0x84
[ 136.929301] mutex_trylock+0xa8/0xf0
[ 136.929623] lock_all_vcpus+0x2c/0xf0
[ 136.929953] vgic_v3_attr_regs_access+0xb4/0x250
[ 136.930366] vgic_v3_get_attr+0x128/0x1e4
[ 136.930726] kvm_device_ioctl+0xd8/0x120
[ 136.931078] __arm64_sys_ioctl+0xa8/0xf0
[ 136.931432] invoke_syscall+0x48/0x114
[ 136.931771] el0_svc_common.constprop.0+0xfc/0x11c
[ 136.932200] do_el0_svc+0x28/0x90
[ 136.932499] el0_svc+0x54/0x130
[ 136.932782] el0t_64_sync_handler+0xa4/0x130
[ 136.933164] el0t_64_sync+0x1a0/0x1a4

The test completes correctly, and the lockdep splat is benign (bumping
include/linux/sched.h::MAX_LOCK_DEPTH removed the splat), but I would still like
to try to avoid it if possible.

>
> > + *
> > + * It's safe to periodically release the mmu_lock because:
> > + * - VCPUs cannot run.
> > + * - Any changes to the stage 2 tables triggered by the MMU notifiers also take
> > + * the mmu_lock, which means accesses will be serialized.
> > + * - Stage 2 tables cannot be freed from under us as long as at least one VCPU
> > + * is live, which means that the VM will be live.
> > + */
> > +void kvm_mmu_perform_pending_ops(struct kvm *kvm)
> > +{
> > + struct kvm_memory_slot *memslot;
> > +
> > + mutex_lock(&kvm->slots_lock);
> > + if (!kvm_mmu_has_pending_ops(kvm))
> > + goto out_unlock;
> > +
> > + if (test_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops)) {
> > + kvm_for_each_memslot(memslot, kvm_memslots(kvm)) {
> > + if (!memslot_is_locked(memslot))
> > + continue;
> > + stage2_flush_memslot(kvm, memslot);
> > + }
> > + }
> > +
> > + bitmap_zero(&kvm->arch.mmu_pending_ops, KVM_MAX_MMU_PENDING_OPS);
>
> clear_bit() instead? I understand that you want to support multiple
> ops, but this looks odd.

I can do that.

>
> > +
> > +out_unlock:
> > + mutex_unlock(&kvm->slots_lock);
> > + return;
> > +}
> > +
> > static int try_rlimit_memlock(unsigned long npages)
> > {
> > unsigned long lock_limit;
> > @@ -1293,7 +1328,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > struct kvm_memory_slot_page *page_entry;
> > bool writable = flags & KVM_ARM_LOCK_MEM_WRITE;
> > enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > - struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> > + struct kvm_pgtable pgt;
> > + struct kvm_pgtable_mm_ops mm_ops;
> > struct vm_area_struct *vma;
> > unsigned long npages = memslot->npages;
> > unsigned int pin_flags = FOLL_LONGTERM;
> > @@ -1311,6 +1347,16 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > pin_flags |= FOLL_WRITE;
> > }
> >
> > + /*
> > + * Make a copy of the stage 2 translation table struct to remove the
> > + * dcache callback so we can postpone the cache maintenance operations
> > + * until the first VCPU is run.
> > + */
> > + mm_ops = *kvm->arch.mmu.pgt->mm_ops;
> > + mm_ops.dcache_clean_inval_poc = NULL;
> > + pgt = *kvm->arch.mmu.pgt;
> > + pgt.mm_ops = &mm_ops;
>
> Huhuh... Can't really say I'm in love with this. Are you trying to
> avoid a double dcache clean to PoC? Is this a performance or a
> correctness issue?

I am trying to avoid a double dcache clean to PoC for performance reasons. I
haven't measured it in any way, but I believe that doing dcache clean to PoC for
VMs with big amounts of memory will have a noticeable impact on performance. I
can run some tests and come back with a number if you wish.

My intention here was to create a copy of the kvm->arch.mmu.pgt without the
dcache clean to PoC function pointer, and I agree that it doesn't look too
pleasing. I can temporarily assign NULL to
kvm->arch.mmu.pgt->mm_ops.dcache_clean_inval_poc if that makes it more
palatable. I'm open to suggestions.

One other option, if this looks too unpleasant and the solution is too
convoluted, is to leave the double dcache operation in place and figure out a
way to optimize it after the series has progressed some more.

Thanks for having a look!

Alex

>
> > +
> > hva = memslot->userspace_addr;
> > ipa = memslot->base_gfn << PAGE_SHIFT;
> >
> > @@ -1362,13 +1408,13 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > goto out_err;
> > }
> >
> > - ret = kvm_pgtable_stage2_map(pgt, ipa, PAGE_SIZE,
> > + ret = kvm_pgtable_stage2_map(&pgt, ipa, PAGE_SIZE,
> > page_to_phys(page_entry->page),
> > prot, &cache);
> > spin_unlock(&kvm->mmu_lock);
> >
> > if (ret) {
> > - kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> > + kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
> > i << PAGE_SHIFT);
> > unpin_memslot_pages(memslot, writable);
> > goto out_err;
> > @@ -1387,7 +1433,7 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > */
> > ret = account_locked_vm(current->mm, npages, true);
> > if (ret) {
> > - kvm_pgtable_stage2_unmap(pgt, memslot->base_gfn << PAGE_SHIFT,
> > + kvm_pgtable_stage2_unmap(&pgt, memslot->base_gfn << PAGE_SHIFT,
> > npages << PAGE_SHIFT);
> > unpin_memslot_pages(memslot, writable);
> > goto out_err;
> > @@ -1397,6 +1443,8 @@ static int lock_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot,
> > if (writable)
> > memslot->arch.flags |= KVM_MEMSLOT_LOCK_WRITE;
> >
> > + set_bit(KVM_LOCKED_MEMSLOT_FLUSH_DCACHE, &kvm->arch.mmu_pending_ops);
> > +
> > kvm_mmu_free_memory_cache(&cache);
> >
> > return 0;
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.