2023-01-27 11:31:34

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 00/28] arm64: Support for Arm CCA in KVM

This series is an RFC adding support for running protected VMs using KVM
under the new Arm Confidential Compute Architecture (CCA). The purpose
of this series is to gather feedback on the proposed changes to the
architecture code for CCA.

The user ABI is not in it's final form, we plan to make use of the
memfd_restricted() allocator[1] and associated infrastructure which will
avoid problems in the current user ABI where a malicious VMM may be able
to cause a Granule Protection Fault in the kernel (which is fatal).

The ABI to the RMM (the RMI) is based on the Beta 0 specification[2] and
will be updated in the future when a final version of the specification
is published.

This series is based on v6.2-rc1. It is also available as a git
repository:

https://gitlab.arm.com/linux-arm/linux-cca cca-host/rfc-v1

Introduction
============
A more general introduction to Arm CCA is available on the Arm
website[3], and links to the other components involved are available in
the overall cover letter[4].

Arm Confidential Compute Architecture adds two new 'worlds' to the
architecture: Root and Realm. A new software component known as the RMM
(Realm Management Monitor) runs in Realm EL2 and is trusted by both the
Normal World and VMs running within Realms. This enables mutual
distrust between the Realm VMs and the Normal World.

Virtual machines running within a Realm can decide on a (4k)
page-by-page granularity whether to share a page with the (Normal World)
host or to keep it private (protected). This protection is provided by
the hardware and attempts to access a page which isn't shared by the
Normal World will trigger a Granule Protection Fault. The series starts
by adding handling for these; faults within user space can be handled by
killing the process, faults within kernel space are considered fatal.

The Normal World host can communicate with the RMM via an SMC interface
known as RMI (Realm Management Interface), and Realm VMs can communicate
with the RMM via another SMC interface known as RSI (Realm Services
Interface). This series adds wrappers for the full set of RMI commands
and uses them to manage the realm guests.

The Normal World can use RMI commands to delegate pages to the Realm
world and to create, manage and run Realm VMs. Once delegated the pages
are inaccessible to the Normal World (unless explicitly shared by the
guest). However the Normal World may destroy the Realm VM at any time to
be able to reclaim (undelegate) the pages.

Entry/exit of a Realm VM attempts to reuse the KVM infrastructure, but
ultimately the final mechanism is different. So this series has a bunch
of commits handling the differences. As much as possible is placed in a
two new files: rme.c and rme-exit.c.

The RMM specification provides a new mechanism for a guest to
communicate with host which goes by the name "Host Call". For now this
is simply hooked up to the existing support for HVC calls from a normal
guest.

[1] https://lore.kernel.org/r/20221202061347.1070246-1-chao.p.peng%40linux.intel.com
[2] https://developer.arm.com/documentation/den0137/1-0bet0/
[3] https://www.arm.com/architecture/security-features/arm-confidential-compute-architecture
[4] .. cover letter ..

Joey Gouly (2):
arm64: rme: allow userspace to inject aborts
arm64: rme: support RSI_HOST_CALL

Steven Price (25):
arm64: RME: Handle Granule Protection Faults (GPFs)
arm64: RME: Add SMC definitions for calling the RMM
arm64: RME: Add wrappers for RMI calls
arm64: RME: Check for RME support at KVM init
arm64: RME: Define the user ABI
arm64: RME: ioctls to create and configure realms
arm64: kvm: Allow passing machine type in KVM creation
arm64: RME: Keep a spare page delegated to the RMM
arm64: RME: RTT handling
arm64: RME: Allocate/free RECs to match vCPUs
arm64: RME: Support for the VGIC in realms
KVM: arm64: Support timers in realm RECs
arm64: RME: Allow VMM to set RIPAS
arm64: RME: Handle realm enter/exit
KVM: arm64: Handle realm MMIO emulation
arm64: RME: Allow populating initial contents
arm64: RME: Runtime faulting of memory
KVM: arm64: Handle realm VCPU load
KVM: arm64: Validate register access for a Realm VM
KVM: arm64: Handle Realm PSCI requests
KVM: arm64: WARN on injected undef exceptions
arm64: Don't expose stolen time for realm guests
KVM: arm64: Allow activating realms
arm64: RME: Always use 4k pages for realms
HACK: Accept prototype RMI versions

Suzuki K Poulose (1):
arm64: rme: Allow checking SVE on VM instance

Documentation/virt/kvm/api.rst | 3 +
arch/arm64/include/asm/kvm_emulate.h | 29 +
arch/arm64/include/asm/kvm_host.h | 7 +
arch/arm64/include/asm/kvm_rme.h | 98 ++
arch/arm64/include/asm/rmi_cmds.h | 259 +++++
arch/arm64/include/asm/rmi_smc.h | 242 +++++
arch/arm64/include/asm/virt.h | 1 +
arch/arm64/include/uapi/asm/kvm.h | 63 ++
arch/arm64/kvm/Kconfig | 8 +
arch/arm64/kvm/Makefile | 3 +-
arch/arm64/kvm/arch_timer.c | 53 +-
arch/arm64/kvm/arm.c | 105 +-
arch/arm64/kvm/guest.c | 50 +
arch/arm64/kvm/inject_fault.c | 2 +
arch/arm64/kvm/mmio.c | 7 +
arch/arm64/kvm/mmu.c | 80 +-
arch/arm64/kvm/psci.c | 23 +
arch/arm64/kvm/reset.c | 41 +
arch/arm64/kvm/rme-exit.c | 194 ++++
arch/arm64/kvm/rme.c | 1453 ++++++++++++++++++++++++++
arch/arm64/kvm/vgic/vgic-v3.c | 9 +-
arch/arm64/kvm/vgic/vgic.c | 37 +-
arch/arm64/mm/fault.c | 29 +-
include/kvm/arm_arch_timer.h | 2 +
include/uapi/linux/kvm.h | 21 +-
25 files changed, 2772 insertions(+), 47 deletions(-)
create mode 100644 arch/arm64/include/asm/kvm_rme.h
create mode 100644 arch/arm64/include/asm/rmi_cmds.h
create mode 100644 arch/arm64/include/asm/rmi_smc.h
create mode 100644 arch/arm64/kvm/rme-exit.c
create mode 100644 arch/arm64/kvm/rme.c

--
2.34.1



2023-01-27 11:31:48

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 01/28] arm64: RME: Handle Granule Protection Faults (GPFs)

If the host attempts to access granules that have been delegated for use
in a realm these accesses will be caught and will trigger a Granule
Protection Fault (GPF).

A fault during a page walk signals a bug in the kernel and is handled by
oopsing the kernel. A non-page walk fault could be caused by user space
having access to a page which has been delegated to the kernel and will
trigger a SIGBUS to allow debugging why user space is trying to access a
delegated page.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/mm/fault.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 596f46dabe4e..fd84be115657 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -756,6 +756,25 @@ static int do_tag_check_fault(unsigned long far, unsigned long esr,
return 0;
}

+static int do_gpf_ptw(unsigned long far, unsigned long esr, struct pt_regs *regs)
+{
+ const struct fault_info *inf = esr_to_fault_info(esr);
+
+ die_kernel_fault(inf->name, far, esr, regs);
+ return 0;
+}
+
+static int do_gpf(unsigned long far, unsigned long esr, struct pt_regs *regs)
+{
+ const struct fault_info *inf = esr_to_fault_info(esr);
+
+ if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
+ return 0;
+
+ arm64_notify_die(inf->name, regs, inf->sig, inf->code, far, esr);
+ return 0;
+}
+
static const struct fault_info fault_info[] = {
{ do_bad, SIGKILL, SI_KERNEL, "ttbr address size fault" },
{ do_bad, SIGKILL, SI_KERNEL, "level 1 address size fault" },
@@ -793,11 +812,11 @@ static const struct fault_info fault_info[] = {
{ do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 34" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 35" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 36" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 37" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 38" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 39" },
- { do_bad, SIGKILL, SI_KERNEL, "unknown 40" },
+ { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 0" },
+ { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 1" },
+ { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 2" },
+ { do_gpf_ptw, SIGKILL, SI_KERNEL, "Granule Protection Fault at level 3" },
+ { do_gpf, SIGBUS, SI_KERNEL, "Granule Protection Fault not on table walk" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 41" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 42" },
{ do_bad, SIGKILL, SI_KERNEL, "unknown 43" },
--
2.34.1


2023-01-27 11:31:51

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 03/28] arm64: RME: Add wrappers for RMI calls

The wrappers make the call sites easier to read and deal with the
boiler plate of handling the error codes from the RMM.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/rmi_cmds.h | 259 ++++++++++++++++++++++++++++++
1 file changed, 259 insertions(+)
create mode 100644 arch/arm64/include/asm/rmi_cmds.h

diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
new file mode 100644
index 000000000000..d5468ee46f35
--- /dev/null
+++ b/arch/arm64/include/asm/rmi_cmds.h
@@ -0,0 +1,259 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#ifndef __ASM_RMI_CMDS_H
+#define __ASM_RMI_CMDS_H
+
+#include <linux/arm-smccc.h>
+
+#include <asm/rmi_smc.h>
+
+struct rtt_entry {
+ unsigned long walk_level;
+ unsigned long desc;
+ int state;
+ bool ripas;
+};
+
+static inline int rmi_data_create(unsigned long data, unsigned long rd,
+ unsigned long map_addr, unsigned long src,
+ unsigned long flags)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE, data, rd, map_addr, src,
+ flags, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_data_create_unknown(unsigned long data,
+ unsigned long rd,
+ unsigned long map_addr)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE_UNKNOWN, data, rd, map_addr,
+ &res);
+
+ return res.a0;
+}
+
+static inline int rmi_data_destroy(unsigned long rd, unsigned long map_addr)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_DATA_DESTROY, rd, map_addr, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_features(unsigned long index, unsigned long *out)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_FEATURES, index, &res);
+
+ *out = res.a1;
+ return res.a0;
+}
+
+static inline int rmi_granule_delegate(unsigned long phys)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_GRANULE_DELEGATE, phys, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_granule_undelegate(unsigned long phys)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_GRANULE_UNDELEGATE, phys, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_psci_complete(unsigned long calling_rec,
+ unsigned long target_rec)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_PSCI_COMPLETE, calling_rec, target_rec,
+ &res);
+
+ return res.a0;
+}
+
+static inline int rmi_realm_activate(unsigned long rd)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REALM_ACTIVATE, rd, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_realm_create(unsigned long rd, unsigned long params_ptr)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REALM_CREATE, rd, params_ptr, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_realm_destroy(unsigned long rd)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REALM_DESTROY, rd, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rec_aux_count(unsigned long rd, unsigned long *aux_count)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REC_AUX_COUNT, rd, &res);
+
+ *aux_count = res.a1;
+ return res.a0;
+}
+
+static inline int rmi_rec_create(unsigned long rec, unsigned long rd,
+ unsigned long params_ptr)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REC_CREATE, rec, rd, params_ptr, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rec_destroy(unsigned long rec)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REC_DESTROY, rec, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rec_enter(unsigned long rec, unsigned long run_ptr)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_REC_ENTER, rec, run_ptr, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_create(unsigned long rtt, unsigned long rd,
+ unsigned long map_addr, unsigned long level)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_CREATE, rtt, rd, map_addr, level,
+ &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_destroy(unsigned long rtt, unsigned long rd,
+ unsigned long map_addr, unsigned long level)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_DESTROY, rtt, rd, map_addr, level,
+ &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_fold(unsigned long rtt, unsigned long rd,
+ unsigned long map_addr, unsigned long level)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_FOLD, rtt, rd, map_addr, level, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_init_ripas(unsigned long rd, unsigned long map_addr,
+ unsigned long level)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_INIT_RIPAS, rd, map_addr, level, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_map_unprotected(unsigned long rd,
+ unsigned long map_addr,
+ unsigned long level,
+ unsigned long desc)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_MAP_UNPROTECTED, rd, map_addr, level,
+ desc, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_read_entry(unsigned long rd, unsigned long map_addr,
+ unsigned long level, struct rtt_entry *rtt)
+{
+ struct arm_smccc_1_2_regs regs = {
+ SMC_RMI_RTT_READ_ENTRY,
+ rd, map_addr, level
+ };
+
+ arm_smccc_1_2_smc(&regs, &regs);
+
+ rtt->walk_level = regs.a1;
+ rtt->state = regs.a2 & 0xFF;
+ rtt->desc = regs.a3;
+ rtt->ripas = regs.a4 & 1;
+
+ return regs.a0;
+}
+
+static inline int rmi_rtt_set_ripas(unsigned long rd, unsigned long rec,
+ unsigned long map_addr, unsigned long level,
+ unsigned long ripas)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_SET_RIPAS, rd, rec, map_addr, level,
+ ripas, &res);
+
+ return res.a0;
+}
+
+static inline int rmi_rtt_unmap_unprotected(unsigned long rd,
+ unsigned long map_addr,
+ unsigned long level)
+{
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_invoke(SMC_RMI_RTT_UNMAP_UNPROTECTED, rd, map_addr,
+ level, &res);
+
+ return res.a0;
+}
+
+static inline phys_addr_t rmi_rtt_get_phys(struct rtt_entry *rtt)
+{
+ return rtt->desc & GENMASK(47, 12);
+}
+
+#endif
--
2.34.1


2023-01-27 11:31:56

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 02/28] arm64: RME: Add SMC definitions for calling the RMM

The RMM (Realm Management Monitor) provides functionality that can be
accessed by SMC calls from the host.

The SMC definitions are based on DEN0137[1] version A-bet0.

[1] https://developer.arm.com/documentation/den0137/latest

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/rmi_smc.h | 235 +++++++++++++++++++++++++++++++
1 file changed, 235 insertions(+)
create mode 100644 arch/arm64/include/asm/rmi_smc.h

diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
new file mode 100644
index 000000000000..16ff65090f3a
--- /dev/null
+++ b/arch/arm64/include/asm/rmi_smc.h
@@ -0,0 +1,235 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#ifndef __ASM_RME_SMC_H
+#define __ASM_RME_SMC_H
+
+#include <linux/arm-smccc.h>
+
+#define SMC_RxI_CALL(func) \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_STANDARD, \
+ (func))
+
+/* FID numbers from alp10 specification */
+
+#define SMC_RMI_DATA_CREATE SMC_RxI_CALL(0x0153)
+#define SMC_RMI_DATA_CREATE_UNKNOWN SMC_RxI_CALL(0x0154)
+#define SMC_RMI_DATA_DESTROY SMC_RxI_CALL(0x0155)
+#define SMC_RMI_FEATURES SMC_RxI_CALL(0x0165)
+#define SMC_RMI_GRANULE_DELEGATE SMC_RxI_CALL(0x0151)
+#define SMC_RMI_GRANULE_UNDELEGATE SMC_RxI_CALL(0x0152)
+#define SMC_RMI_PSCI_COMPLETE SMC_RxI_CALL(0x0164)
+#define SMC_RMI_REALM_ACTIVATE SMC_RxI_CALL(0x0157)
+#define SMC_RMI_REALM_CREATE SMC_RxI_CALL(0x0158)
+#define SMC_RMI_REALM_DESTROY SMC_RxI_CALL(0x0159)
+#define SMC_RMI_REC_AUX_COUNT SMC_RxI_CALL(0x0167)
+#define SMC_RMI_REC_CREATE SMC_RxI_CALL(0x015a)
+#define SMC_RMI_REC_DESTROY SMC_RxI_CALL(0x015b)
+#define SMC_RMI_REC_ENTER SMC_RxI_CALL(0x015c)
+#define SMC_RMI_RTT_CREATE SMC_RxI_CALL(0x015d)
+#define SMC_RMI_RTT_DESTROY SMC_RxI_CALL(0x015e)
+#define SMC_RMI_RTT_FOLD SMC_RxI_CALL(0x0166)
+#define SMC_RMI_RTT_INIT_RIPAS SMC_RxI_CALL(0x0168)
+#define SMC_RMI_RTT_MAP_UNPROTECTED SMC_RxI_CALL(0x015f)
+#define SMC_RMI_RTT_READ_ENTRY SMC_RxI_CALL(0x0161)
+#define SMC_RMI_RTT_SET_RIPAS SMC_RxI_CALL(0x0169)
+#define SMC_RMI_RTT_UNMAP_UNPROTECTED SMC_RxI_CALL(0x0162)
+#define SMC_RMI_VERSION SMC_RxI_CALL(0x0150)
+
+#define RMI_ABI_MAJOR_VERSION 1
+#define RMI_ABI_MINOR_VERSION 0
+
+#define RMI_UNASSIGNED 0
+#define RMI_DESTROYED 1
+#define RMI_ASSIGNED 2
+#define RMI_TABLE 3
+#define RMI_VALID_NS 4
+
+#define RMI_ABI_VERSION_GET_MAJOR(version) ((version) >> 16)
+#define RMI_ABI_VERSION_GET_MINOR(version) ((version) & 0xFFFF)
+
+#define RMI_RETURN_STATUS(ret) ((ret) & 0xFF)
+#define RMI_RETURN_INDEX(ret) (((ret) >> 8) & 0xFF)
+
+#define RMI_SUCCESS 0
+#define RMI_ERROR_INPUT 1
+#define RMI_ERROR_REALM 2
+#define RMI_ERROR_REC 3
+#define RMI_ERROR_RTT 4
+#define RMI_ERROR_IN_USE 5
+
+#define RMI_EMPTY 0
+#define RMI_RAM 1
+
+#define RMI_NO_MEASURE_CONTENT 0
+#define RMI_MEASURE_CONTENT 1
+
+#define RMI_FEATURE_REGISTER_0_S2SZ GENMASK(7, 0)
+#define RMI_FEATURE_REGISTER_0_LPA2 BIT(8)
+#define RMI_FEATURE_REGISTER_0_SVE_EN BIT(9)
+#define RMI_FEATURE_REGISTER_0_SVE_VL GENMASK(13, 10)
+#define RMI_FEATURE_REGISTER_0_NUM_BPS GENMASK(17, 14)
+#define RMI_FEATURE_REGISTER_0_NUM_WPS GENMASK(21, 18)
+#define RMI_FEATURE_REGISTER_0_PMU_EN BIT(22)
+#define RMI_FEATURE_REGISTER_0_PMU_NUM_CTRS GENMASK(27, 23)
+#define RMI_FEATURE_REGISTER_0_HASH_SHA_256 BIT(28)
+#define RMI_FEATURE_REGISTER_0_HASH_SHA_512 BIT(29)
+
+struct realm_params {
+ union {
+ u64 features_0;
+ u8 padding_1[0x100];
+ };
+ union {
+ u8 measurement_algo;
+ u8 padding_2[0x300];
+ };
+ union {
+ u8 rpv[64];
+ u8 padding_3[0x400];
+ };
+ union {
+ struct {
+ u16 vmid;
+ u8 padding_4[6];
+ u64 rtt_base;
+ u64 rtt_level_start;
+ u32 rtt_num_start;
+ };
+ u8 padding_5[0x800];
+ };
+};
+
+/*
+ * The number of GPRs (starting from X0) that are
+ * configured by the host when a REC is created.
+ */
+#define REC_CREATE_NR_GPRS 8
+
+#define REC_PARAMS_FLAG_RUNNABLE BIT_ULL(0)
+
+#define REC_PARAMS_AUX_GRANULES 16
+
+struct rec_params {
+ union {
+ u64 flags;
+ u8 padding1[0x100];
+ };
+ union {
+ u64 mpidr;
+ u8 padding2[0x100];
+ };
+ union {
+ u64 pc;
+ u8 padding3[0x100];
+ };
+ union {
+ u64 gprs[REC_CREATE_NR_GPRS];
+ u8 padding4[0x500];
+ };
+ u64 num_rec_aux;
+ u64 aux[REC_PARAMS_AUX_GRANULES];
+};
+
+#define RMI_EMULATED_MMIO BIT(0)
+#define RMI_INJECT_SEA BIT(1)
+#define RMI_TRAP_WFI BIT(2)
+#define RMI_TRAP_WFE BIT(3)
+
+#define REC_RUN_GPRS 31
+#define REC_GIC_NUM_LRS 16
+
+struct rec_entry {
+ union { /* 0x000 */
+ u64 flags;
+ u8 padding0[0x200];
+ };
+ union { /* 0x200 */
+ u64 gprs[REC_RUN_GPRS];
+ u8 padding2[0x100];
+ };
+ union { /* 0x300 */
+ struct {
+ u64 gicv3_hcr;
+ u64 gicv3_lrs[REC_GIC_NUM_LRS];
+ };
+ u8 padding3[0x100];
+ };
+ u8 padding4[0x400];
+};
+
+struct rec_exit {
+ union { /* 0x000 */
+ u8 exit_reason;
+ u8 padding0[0x100];
+ };
+ union { /* 0x100 */
+ struct {
+ u64 esr;
+ u64 far;
+ u64 hpfar;
+ };
+ u8 padding1[0x100];
+ };
+ union { /* 0x200 */
+ u64 gprs[REC_RUN_GPRS];
+ u8 padding2[0x100];
+ };
+ union { /* 0x300 */
+ struct {
+ u64 gicv3_hcr;
+ u64 gicv3_lrs[REC_GIC_NUM_LRS];
+ u64 gicv3_misr;
+ u64 gicv3_vmcr;
+ };
+ u8 padding3[0x100];
+ };
+ union { /* 0x400 */
+ struct {
+ u64 cntp_ctl;
+ u64 cntp_cval;
+ u64 cntv_ctl;
+ u64 cntv_cval;
+ };
+ u8 padding4[0x100];
+ };
+ union { /* 0x500 */
+ struct {
+ u64 ripas_base;
+ u64 ripas_size;
+ u64 ripas_value; /* Only lowest bit */
+ };
+ u8 padding5[0x100];
+ };
+ union { /* 0x600 */
+ u16 imm;
+ u8 padding6[0x100];
+ };
+ union { /* 0x700 */
+ struct {
+ u64 pmu_ovf;
+ u64 pmu_intr_en;
+ u64 pmu_cntr_en;
+ };
+ u8 padding7[0x100];
+ };
+};
+
+struct rec_run {
+ struct rec_entry entry;
+ struct rec_exit exit;
+};
+
+#define RMI_EXIT_SYNC 0x00
+#define RMI_EXIT_IRQ 0x01
+#define RMI_EXIT_FIQ 0x02
+#define RMI_EXIT_PSCI 0x03
+#define RMI_EXIT_RIPAS_CHANGE 0x04
+#define RMI_EXIT_HOST_CALL 0x05
+#define RMI_EXIT_SERROR 0x06
+
+#endif
--
2.34.1


2023-01-27 11:31:59

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

Query the RMI version number and check if it is a compatible version. A
static key is also provided to signal that a supported RMM is available.

Functions are provided to query if a VM or VCPU is a realm (or rec)
which currently will always return false.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
arch/arm64/include/asm/kvm_host.h | 4 +++
arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
arch/arm64/include/asm/virt.h | 1 +
arch/arm64/kvm/Makefile | 3 +-
arch/arm64/kvm/arm.c | 8 +++++
arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
7 files changed, 103 insertions(+), 1 deletion(-)
create mode 100644 arch/arm64/include/asm/kvm_rme.h
create mode 100644 arch/arm64/kvm/rme.c

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 9bdba47f7e14..5a2b7229e83f 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
return test_bit(feature, vcpu->arch.features);
}

+static inline bool kvm_is_realm(struct kvm *kvm)
+{
+ if (static_branch_unlikely(&kvm_rme_is_available))
+ return kvm->arch.is_realm;
+ return false;
+}
+
+static inline enum realm_state kvm_realm_state(struct kvm *kvm)
+{
+ return READ_ONCE(kvm->arch.realm.state);
+}
+
+static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
+{
+ return false;
+}
+
#endif /* __ARM64_KVM_EMULATE_H__ */
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 35a159d131b5..04347c3a8c6b 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -26,6 +26,7 @@
#include <asm/fpsimd.h>
#include <asm/kvm.h>
#include <asm/kvm_asm.h>
+#include <asm/kvm_rme.h>

#define __KVM_HAVE_ARCH_INTC_INITIALIZED

@@ -240,6 +241,9 @@ struct kvm_arch {
* the associated pKVM instance in the hypervisor.
*/
struct kvm_protected_vm pkvm;
+
+ bool is_realm;
+ struct realm realm;
};

struct kvm_vcpu_fault_info {
diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
new file mode 100644
index 000000000000..c26bc2c6770d
--- /dev/null
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#ifndef __ASM_KVM_RME_H
+#define __ASM_KVM_RME_H
+
+enum realm_state {
+ REALM_STATE_NONE,
+ REALM_STATE_NEW,
+ REALM_STATE_ACTIVE,
+ REALM_STATE_DYING
+};
+
+struct realm {
+ enum realm_state state;
+};
+
+int kvm_init_rme(void);
+
+#endif
diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
index 4eb601e7de50..be1383e26626 100644
--- a/arch/arm64/include/asm/virt.h
+++ b/arch/arm64/include/asm/virt.h
@@ -80,6 +80,7 @@ void __hyp_set_vectors(phys_addr_t phys_vector_base);
void __hyp_reset_vectors(void);

DECLARE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
+DECLARE_STATIC_KEY_FALSE(kvm_rme_is_available);

/* Reports the availability of HYP mode */
static inline bool is_hyp_mode_available(void)
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 5e33c2d4645a..d2f0400c50da 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -20,7 +20,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
vgic/vgic-v3.o vgic/vgic-v4.o \
vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
- vgic/vgic-its.o vgic/vgic-debug.o
+ vgic/vgic-its.o vgic/vgic-debug.o \
+ rme.o

kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 9c5573bc4614..d97b39d042ab 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -38,6 +38,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_mmu.h>
#include <asm/kvm_pkvm.h>
+#include <asm/kvm_rme.h>
#include <asm/kvm_emulate.h>
#include <asm/sections.h>

@@ -47,6 +48,7 @@

static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT;
DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
+DEFINE_STATIC_KEY_FALSE(kvm_rme_is_available);

DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);

@@ -2213,6 +2215,12 @@ int kvm_arch_init(void *opaque)

in_hyp_mode = is_kernel_in_hyp_mode();

+ if (in_hyp_mode) {
+ err = kvm_init_rme();
+ if (err)
+ return err;
+ }
+
if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||
cpus_have_final_cap(ARM64_WORKAROUND_1508412))
kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
new file mode 100644
index 000000000000..f6b587bc116e
--- /dev/null
+++ b/arch/arm64/kvm/rme.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/kvm_host.h>
+
+#include <asm/rmi_cmds.h>
+#include <asm/virt.h>
+
+static int rmi_check_version(void)
+{
+ struct arm_smccc_res res;
+ int version_major, version_minor;
+
+ arm_smccc_1_1_invoke(SMC_RMI_VERSION, &res);
+
+ if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
+ return -ENXIO;
+
+ version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
+ version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);
+
+ if (version_major != RMI_ABI_MAJOR_VERSION) {
+ kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",
+ version_major, version_minor,
+ RMI_ABI_MAJOR_VERSION);
+ return -ENXIO;
+ }
+
+ kvm_info("RMI ABI version %d.%d\n", version_major, version_minor);
+
+ return 0;
+}
+
+int kvm_init_rme(void)
+{
+ if (PAGE_SIZE != SZ_4K)
+ /* Only 4k page size on the host is supported */
+ return 0;
+
+ if (rmi_check_version())
+ /* Continue without realm support */
+ return 0;
+
+ /* Future patch will enable static branch kvm_rme_is_available */
+
+ return 0;
+}
--
2.34.1


2023-01-27 11:32:30

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 05/28] arm64: RME: Define the user ABI

There is one (multiplexed) CAP which can be used to create, populate and
then activate the realm.

Signed-off-by: Steven Price <[email protected]>
---
Documentation/virt/kvm/api.rst | 1 +
arch/arm64/include/uapi/asm/kvm.h | 63 +++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 2 +
3 files changed, 66 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0dd5d8733dd5..f1a59d6fb7fc 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4965,6 +4965,7 @@ Recognised values for feature:

===== ===========================================
arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
+ arm64 KVM_ARM_VCPU_REC (requires KVM_CAP_ARM_RME)
===== ===========================================

Finalizes the configuration of the specified vcpu feature.
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index a7a857f1784d..fcc0b8dce29b 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -109,6 +109,7 @@ struct kvm_regs {
#define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */
#define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */
#define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */
+#define KVM_ARM_VCPU_REC 7 /* VCPU REC state as part of Realm */

struct kvm_vcpu_init {
__u32 target;
@@ -401,6 +402,68 @@ enum {
#define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
#define KVM_DEV_ARM_ITS_CTRL_RESET 4

+/* KVM_CAP_ARM_RME on VM fd */
+#define KVM_CAP_ARM_RME_CONFIG_REALM 0
+#define KVM_CAP_ARM_RME_CREATE_RD 1
+#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2
+#define KVM_CAP_ARM_RME_POPULATE_REALM 3
+#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4
+
+#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256 0
+#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512 1
+
+#define KVM_CAP_ARM_RME_RPV_SIZE 64
+
+/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */
+#define KVM_CAP_ARM_RME_CFG_RPV 0
+#define KVM_CAP_ARM_RME_CFG_HASH_ALGO 1
+#define KVM_CAP_ARM_RME_CFG_SVE 2
+#define KVM_CAP_ARM_RME_CFG_DBG 3
+#define KVM_CAP_ARM_RME_CFG_PMU 4
+
+struct kvm_cap_arm_rme_config_item {
+ __u32 cfg;
+ union {
+ /* cfg == KVM_CAP_ARM_RME_CFG_RPV */
+ struct {
+ __u8 rpv[KVM_CAP_ARM_RME_RPV_SIZE];
+ };
+
+ /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */
+ struct {
+ __u32 hash_algo;
+ };
+
+ /* cfg == KVM_CAP_ARM_RME_CFG_SVE */
+ struct {
+ __u32 sve_vq;
+ };
+
+ /* cfg == KVM_CAP_ARM_RME_CFG_DBG */
+ struct {
+ __u32 num_brps;
+ __u32 num_wrps;
+ };
+
+ /* cfg == KVM_CAP_ARM_RME_CFG_PMU */
+ struct {
+ __u32 num_pmu_cntrs;
+ };
+ /* Fix the size of the union */
+ __u8 reserved[256];
+ };
+};
+
+struct kvm_cap_arm_rme_populate_realm_args {
+ __u64 populate_ipa_base;
+ __u64 populate_ipa_size;
+};
+
+struct kvm_cap_arm_rme_init_ipa_args {
+ __u64 init_ipa_base;
+ __u64 init_ipa_size;
+};
+
/* Device Control API on vcpu fd */
#define KVM_ARM_VCPU_PMU_V3_CTRL 0
#define KVM_ARM_VCPU_PMU_V3_IRQ 0
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 20522d4ba1e0..fec1909e8b73 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1176,6 +1176,8 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
#define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225

+#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts
+
#ifdef KVM_CAP_IRQ_ROUTING

struct kvm_irq_routing_irqchip {
--
2.34.1


2023-01-27 11:32:31

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
delegating pages to the RMM to hold the Realm Descriptor (RD) and for
the base level of the Realm Translation Tables (RTT). A VMID also need
to be picked, since the RMM has a separate VMID address space a
dedicated allocator is added for this purpose.

KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
before it is created.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 14 ++
arch/arm64/kvm/arm.c | 19 ++
arch/arm64/kvm/mmu.c | 6 +
arch/arm64/kvm/reset.c | 33 +++
arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
5 files changed, 429 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index c26bc2c6770d..055a22accc08 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -6,6 +6,8 @@
#ifndef __ASM_KVM_RME_H
#define __ASM_KVM_RME_H

+#include <uapi/linux/kvm.h>
+
enum realm_state {
REALM_STATE_NONE,
REALM_STATE_NEW,
@@ -15,8 +17,20 @@ enum realm_state {

struct realm {
enum realm_state state;
+
+ void *rd;
+ struct realm_params *params;
+
+ unsigned long num_aux;
+ unsigned int vmid;
+ unsigned int ia_bits;
};

int kvm_init_rme(void);
+u32 kvm_realm_ipa_limit(void);
+
+int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
+int kvm_init_realm_vm(struct kvm *kvm);
+void kvm_destroy_realm(struct kvm *kvm);

#endif
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index d97b39d042ab..50f54a63732a 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -103,6 +103,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = 0;
set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
break;
+ case KVM_CAP_ARM_RME:
+ if (!static_branch_unlikely(&kvm_rme_is_available))
+ return -EINVAL;
+ mutex_lock(&kvm->lock);
+ r = kvm_realm_enable_cap(kvm, cap);
+ mutex_unlock(&kvm->lock);
+ break;
default:
r = -EINVAL;
break;
@@ -172,6 +179,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
*/
kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit();

+ /* Initialise the realm bits after the generic bits are enabled */
+ if (kvm_is_realm(kvm)) {
+ ret = kvm_init_realm_vm(kvm);
+ if (ret)
+ goto err_free_cpumask;
+ }
+
return 0;

err_free_cpumask:
@@ -204,6 +218,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_destroy_vcpus(kvm);

kvm_unshare_hyp(kvm, kvm + 1);
+
+ kvm_destroy_realm(kvm);
}

int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
@@ -300,6 +316,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_ARM_PTRAUTH_GENERIC:
r = system_has_full_ptr_auth();
break;
+ case KVM_CAP_ARM_RME:
+ r = static_key_enabled(&kvm_rme_is_available);
+ break;
default:
r = 0;
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 31d7fa4c7c14..d0f707767d05 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -840,6 +840,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
struct kvm_pgtable *pgt = NULL;

write_lock(&kvm->mmu_lock);
+ if (kvm_is_realm(kvm) &&
+ kvm_realm_state(kvm) != REALM_STATE_DYING) {
+ /* TODO: teardown rtts */
+ write_unlock(&kvm->mmu_lock);
+ return;
+ }
pgt = mmu->pgt;
if (pgt) {
mmu->pgd_phys = 0;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index e0267f672b8a..c165df174737 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -395,3 +395,36 @@ int kvm_set_ipa_limit(void)

return 0;
}
+
+int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
+{
+ u64 mmfr0, mmfr1;
+ u32 phys_shift;
+ u32 ipa_limit = kvm_ipa_limit;
+
+ if (kvm_is_realm(kvm))
+ ipa_limit = kvm_realm_ipa_limit();
+
+ if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+ return -EINVAL;
+
+ phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
+ if (phys_shift) {
+ if (phys_shift > ipa_limit ||
+ phys_shift < ARM64_MIN_PARANGE_BITS)
+ return -EINVAL;
+ } else {
+ phys_shift = KVM_PHYS_SHIFT;
+ if (phys_shift > ipa_limit) {
+ pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n",
+ current->comm);
+ return -EINVAL;
+ }
+ }
+
+ mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+
+ return 0;
+}
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index f6b587bc116e..9f8c5a91b8fc 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -5,9 +5,49 @@

#include <linux/kvm_host.h>

+#include <asm/kvm_emulate.h>
+#include <asm/kvm_mmu.h>
#include <asm/rmi_cmds.h>
#include <asm/virt.h>

+/************ FIXME: Copied from kvm/hyp/pgtable.c **********/
+#include <asm/kvm_pgtable.h>
+
+struct kvm_pgtable_walk_data {
+ struct kvm_pgtable *pgt;
+ struct kvm_pgtable_walker *walker;
+
+ u64 addr;
+ u64 end;
+};
+
+static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
+{
+ u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
+ u64 mask = BIT(pgt->ia_bits) - 1;
+
+ return (addr & mask) >> shift;
+}
+
+static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
+{
+ struct kvm_pgtable pgt = {
+ .ia_bits = ia_bits,
+ .start_level = start_level,
+ };
+
+ return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
+}
+
+/******************/
+
+static unsigned long rmm_feat_reg0;
+
+static bool rme_supports(unsigned long feature)
+{
+ return !!u64_get_bits(rmm_feat_reg0, feature);
+}
+
static int rmi_check_version(void)
{
struct arm_smccc_res res;
@@ -33,8 +73,319 @@ static int rmi_check_version(void)
return 0;
}

+static unsigned long create_realm_feat_reg0(struct kvm *kvm)
+{
+ unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
+ u64 feat_reg0 = 0;
+
+ int num_bps = u64_get_bits(rmm_feat_reg0,
+ RMI_FEATURE_REGISTER_0_NUM_BPS);
+ int num_wps = u64_get_bits(rmm_feat_reg0,
+ RMI_FEATURE_REGISTER_0_NUM_WPS);
+
+ feat_reg0 |= u64_encode_bits(ia_bits, RMI_FEATURE_REGISTER_0_S2SZ);
+ feat_reg0 |= u64_encode_bits(num_bps, RMI_FEATURE_REGISTER_0_NUM_BPS);
+ feat_reg0 |= u64_encode_bits(num_wps, RMI_FEATURE_REGISTER_0_NUM_WPS);
+
+ return feat_reg0;
+}
+
+u32 kvm_realm_ipa_limit(void)
+{
+ return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
+}
+
+static u32 get_start_level(struct kvm *kvm)
+{
+ long sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, kvm->arch.vtcr);
+
+ return VTCR_EL2_TGRAN_SL0_BASE - sl0;
+}
+
+static int realm_create_rd(struct kvm *kvm)
+{
+ struct realm *realm = &kvm->arch.realm;
+ struct realm_params *params = realm->params;
+ void *rd = NULL;
+ phys_addr_t rd_phys, params_phys;
+ struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
+ unsigned int pgd_sz;
+ int i, r;
+
+ if (WARN_ON(realm->rd) || WARN_ON(!realm->params))
+ return -EEXIST;
+
+ rd = (void *)__get_free_page(GFP_KERNEL);
+ if (!rd)
+ return -ENOMEM;
+
+ rd_phys = virt_to_phys(rd);
+ if (rmi_granule_delegate(rd_phys)) {
+ r = -ENXIO;
+ goto out;
+ }
+
+ pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
+ for (i = 0; i < pgd_sz; i++) {
+ phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
+
+ if (rmi_granule_delegate(pgd_phys)) {
+ r = -ENXIO;
+ goto out_undelegate_tables;
+ }
+ }
+
+ params->rtt_level_start = get_start_level(kvm);
+ params->rtt_num_start = pgd_sz;
+ params->rtt_base = kvm->arch.mmu.pgd_phys;
+ params->vmid = realm->vmid;
+
+ params_phys = virt_to_phys(params);
+
+ if (rmi_realm_create(rd_phys, params_phys)) {
+ r = -ENXIO;
+ goto out_undelegate_tables;
+ }
+
+ realm->rd = rd;
+ realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
+
+ if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
+ WARN_ON(rmi_realm_destroy(rd_phys));
+ goto out_undelegate_tables;
+ }
+
+ return 0;
+
+out_undelegate_tables:
+ while (--i >= 0) {
+ phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
+
+ WARN_ON(rmi_granule_undelegate(pgd_phys));
+ }
+ WARN_ON(rmi_granule_undelegate(rd_phys));
+out:
+ free_page((unsigned long)rd);
+ return r;
+}
+
+/* Protects access to rme_vmid_bitmap */
+static DEFINE_SPINLOCK(rme_vmid_lock);
+static unsigned long *rme_vmid_bitmap;
+
+static int rme_vmid_init(void)
+{
+ unsigned int vmid_count = 1 << kvm_get_vmid_bits();
+
+ rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
+ if (!rme_vmid_bitmap) {
+ kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static int rme_vmid_reserve(void)
+{
+ int ret;
+ unsigned int vmid_count = 1 << kvm_get_vmid_bits();
+
+ spin_lock(&rme_vmid_lock);
+ ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
+ spin_unlock(&rme_vmid_lock);
+
+ return ret;
+}
+
+static void rme_vmid_release(unsigned int vmid)
+{
+ spin_lock(&rme_vmid_lock);
+ bitmap_release_region(rme_vmid_bitmap, vmid, 0);
+ spin_unlock(&rme_vmid_lock);
+}
+
+static int kvm_create_realm(struct kvm *kvm)
+{
+ struct realm *realm = &kvm->arch.realm;
+ int ret;
+
+ if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
+ return -EEXIST;
+
+ ret = rme_vmid_reserve();
+ if (ret < 0)
+ return ret;
+ realm->vmid = ret;
+
+ ret = realm_create_rd(kvm);
+ if (ret) {
+ rme_vmid_release(realm->vmid);
+ return ret;
+ }
+
+ WRITE_ONCE(realm->state, REALM_STATE_NEW);
+
+ /* The realm is up, free the parameters. */
+ free_page((unsigned long)realm->params);
+ realm->params = NULL;
+
+ return 0;
+}
+
+static int config_realm_hash_algo(struct realm *realm,
+ struct kvm_cap_arm_rme_config_item *cfg)
+{
+ switch (cfg->hash_algo) {
+ case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
+ if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
+ return -EINVAL;
+ break;
+ case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
+ if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ realm->params->measurement_algo = cfg->hash_algo;
+ return 0;
+}
+
+static int config_realm_sve(struct realm *realm,
+ struct kvm_cap_arm_rme_config_item *cfg)
+{
+ u64 features_0 = realm->params->features_0;
+ int max_sve_vq = u64_get_bits(rmm_feat_reg0,
+ RMI_FEATURE_REGISTER_0_SVE_VL);
+
+ if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
+ return -EINVAL;
+
+ if (cfg->sve_vq > max_sve_vq)
+ return -EINVAL;
+
+ features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
+ RMI_FEATURE_REGISTER_0_SVE_VL);
+ features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
+ features_0 |= u64_encode_bits(cfg->sve_vq,
+ RMI_FEATURE_REGISTER_0_SVE_VL);
+
+ realm->params->features_0 = features_0;
+ return 0;
+}
+
+static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ struct kvm_cap_arm_rme_config_item cfg;
+ struct realm *realm = &kvm->arch.realm;
+ int r = 0;
+
+ if (kvm_realm_state(kvm) != REALM_STATE_NONE)
+ return -EBUSY;
+
+ if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
+ return -EFAULT;
+
+ switch (cfg.cfg) {
+ case KVM_CAP_ARM_RME_CFG_RPV:
+ memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
+ break;
+ case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
+ r = config_realm_hash_algo(realm, &cfg);
+ break;
+ case KVM_CAP_ARM_RME_CFG_SVE:
+ r = config_realm_sve(realm, &cfg);
+ break;
+ default:
+ r = -EINVAL;
+ }
+
+ return r;
+}
+
+int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
+{
+ int r = 0;
+
+ switch (cap->args[0]) {
+ case KVM_CAP_ARM_RME_CONFIG_REALM:
+ r = kvm_rme_config_realm(kvm, cap);
+ break;
+ case KVM_CAP_ARM_RME_CREATE_RD:
+ if (kvm->created_vcpus) {
+ r = -EBUSY;
+ break;
+ }
+
+ r = kvm_create_realm(kvm);
+ break;
+ default:
+ r = -EINVAL;
+ break;
+ }
+
+ return r;
+}
+
+void kvm_destroy_realm(struct kvm *kvm)
+{
+ struct realm *realm = &kvm->arch.realm;
+ struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
+ unsigned int pgd_sz;
+ int i;
+
+ if (realm->params) {
+ free_page((unsigned long)realm->params);
+ realm->params = NULL;
+ }
+
+ if (kvm_realm_state(kvm) == REALM_STATE_NONE)
+ return;
+
+ WRITE_ONCE(realm->state, REALM_STATE_DYING);
+
+ rme_vmid_release(realm->vmid);
+
+ if (realm->rd) {
+ phys_addr_t rd_phys = virt_to_phys(realm->rd);
+
+ if (WARN_ON(rmi_realm_destroy(rd_phys)))
+ return;
+ if (WARN_ON(rmi_granule_undelegate(rd_phys)))
+ return;
+ free_page((unsigned long)realm->rd);
+ realm->rd = NULL;
+ }
+
+ pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
+ for (i = 0; i < pgd_sz; i++) {
+ phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
+
+ if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
+ return;
+ }
+
+ kvm_free_stage2_pgd(&kvm->arch.mmu);
+}
+
+int kvm_init_realm_vm(struct kvm *kvm)
+{
+ struct realm_params *params;
+
+ params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
+ if (!params)
+ return -ENOMEM;
+
+ params->features_0 = create_realm_feat_reg0(kvm);
+ kvm->arch.realm.params = params;
+ return 0;
+}
+
int kvm_init_rme(void)
{
+ int ret;
+
if (PAGE_SIZE != SZ_4K)
/* Only 4k page size on the host is supported */
return 0;
@@ -43,6 +394,12 @@ int kvm_init_rme(void)
/* Continue without realm support */
return 0;

+ ret = rme_vmid_init();
+ if (ret)
+ return ret;
+
+ WARN_ON(rmi_features(0, &rmm_feat_reg0));
+
/* Future patch will enable static branch kvm_rme_is_available */

return 0;
--
2.34.1


2023-01-27 11:32:39

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 12/28] KVM: arm64: Support timers in realm RECs

The RMM keeps track of the timer while the realm REC is running, but on
exit to the normal world KVM is responsible for handling the timers.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arch_timer.c | 53 ++++++++++++++++++++++++++++++++----
include/kvm/arm_arch_timer.h | 2 ++
2 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
index bb24a76b4224..d4af9ee58550 100644
--- a/arch/arm64/kvm/arch_timer.c
+++ b/arch/arm64/kvm/arch_timer.c
@@ -130,6 +130,11 @@ static void timer_set_offset(struct arch_timer_context *ctxt, u64 offset)
{
struct kvm_vcpu *vcpu = ctxt->vcpu;

+ if (kvm_is_realm(vcpu->kvm)) {
+ WARN_ON(offset);
+ return;
+ }
+
switch(arch_timer_ctx_index(ctxt)) {
case TIMER_VTIMER:
__vcpu_sys_reg(vcpu, CNTVOFF_EL2) = offset;
@@ -411,6 +416,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
}
}

+void kvm_realm_timers_update(struct kvm_vcpu *vcpu)
+{
+ struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu;
+ int i;
+
+ for (i = 0; i < NR_KVM_TIMERS; i++) {
+ struct arch_timer_context *timer = &arch_timer->timers[i];
+ bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT;
+ bool level = kvm_timer_irq_can_fire(timer) && status;
+
+ if (level != timer->irq.level)
+ kvm_timer_update_irq(vcpu, level, timer);
+ }
+}
+
/* Only called for a fully emulated timer */
static void timer_emulate(struct arch_timer_context *ctx)
{
@@ -621,6 +641,11 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
if (unlikely(!timer->enabled))
return;

+ kvm_timer_unblocking(vcpu);
+
+ if (vcpu_is_rec(vcpu))
+ return;
+
get_timer_map(vcpu, &map);

if (static_branch_likely(&has_gic_active_state)) {
@@ -633,8 +658,6 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)

set_cntvoff(timer_get_offset(map.direct_vtimer));

- kvm_timer_unblocking(vcpu);
-
timer_restore_state(map.direct_vtimer);
if (map.direct_ptimer)
timer_restore_state(map.direct_ptimer);
@@ -668,6 +691,9 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
if (unlikely(!timer->enabled))
return;

+ if (vcpu_is_rec(vcpu))
+ goto out;
+
get_timer_map(vcpu, &map);

timer_save_state(map.direct_vtimer);
@@ -686,9 +712,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
if (map.emul_ptimer)
soft_timer_cancel(&map.emul_ptimer->hrtimer);

- if (kvm_vcpu_is_blocking(vcpu))
- kvm_timer_blocking(vcpu);
-
/*
* The kernel may decide to run userspace after calling vcpu_put, so
* we reset cntvoff to 0 to ensure a consistent read between user
@@ -697,6 +720,11 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
* virtual offset of zero, so no need to zero CNTVOFF_EL2 register.
*/
set_cntvoff(0);
+
+out:
+ if (kvm_vcpu_is_blocking(vcpu))
+ kvm_timer_blocking(vcpu);
+
}

/*
@@ -785,12 +813,18 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
struct arch_timer_cpu *timer = vcpu_timer(vcpu);
struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
+ u64 cntvoff;

vtimer->vcpu = vcpu;
ptimer->vcpu = vcpu;

+ if (kvm_is_realm(vcpu->kvm))
+ cntvoff = 0;
+ else
+ cntvoff = kvm_phys_timer_read();
+
/* Synchronize cntvoff across all vtimers of a VM. */
- update_vtimer_cntvoff(vcpu, kvm_phys_timer_read());
+ update_vtimer_cntvoff(vcpu, cntvoff);
timer_set_offset(ptimer, 0);

hrtimer_init(&timer->bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
@@ -1265,6 +1299,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
return -EINVAL;
}

+ /*
+ * We don't use mapped IRQs for Realms because the RMI doesn't allow
+ * us setting the LR.HW bit in the VGIC.
+ */
+ if (vcpu_is_rec(vcpu))
+ return 0;
+
get_timer_map(vcpu, &map);

ret = kvm_vgic_map_phys_irq(vcpu,
diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
index cd6d8f260eab..158280e15a33 100644
--- a/include/kvm/arm_arch_timer.h
+++ b/include/kvm/arm_arch_timer.h
@@ -76,6 +76,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);

+void kvm_realm_timers_update(struct kvm_vcpu *vcpu);
+
u64 kvm_phys_timer_read(void);

void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
--
2.34.1


2023-01-27 11:32:41

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 07/28] arm64: kvm: Allow passing machine type in KVM creation

Previously machine type was used purely for specifying the physical
address size of the guest. Reserve the higher bits to specify an ARM
specific machine type and declare a new type 'KVM_VM_TYPE_ARM_REALM'
used to create a realm guest.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arm.c | 13 +++++++++++++
arch/arm64/kvm/mmu.c | 3 ---
arch/arm64/kvm/reset.c | 3 ---
include/uapi/linux/kvm.h | 19 +++++++++++++++----
4 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 50f54a63732a..badd775547b8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -147,6 +147,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
{
int ret;

+ if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK))
+ return -EINVAL;
+
+ switch (type & KVM_VM_TYPE_ARM_MASK) {
+ case KVM_VM_TYPE_ARM_NORMAL:
+ break;
+ case KVM_VM_TYPE_ARM_REALM:
+ kvm->arch.is_realm = true;
+ break;
+ default:
+ return -EINVAL;
+ }
+
ret = kvm_share_hyp(kvm, kvm + 1);
if (ret)
return ret;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d0f707767d05..22c00274884a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -709,9 +709,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
u64 mmfr0, mmfr1;
u32 phys_shift;

- if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
- return -EINVAL;
-
phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
if (is_protected_kvm_enabled()) {
phys_shift = kvm_ipa_limit;
diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index c165df174737..9e71d69e051f 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -405,9 +405,6 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
if (kvm_is_realm(kvm))
ipa_limit = kvm_realm_ipa_limit();

- if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
- return -EINVAL;
-
phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
if (phys_shift) {
if (phys_shift > ipa_limit ||
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index fec1909e8b73..bcfc4d58dc19 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -898,14 +898,25 @@ struct kvm_ppc_resize_hpt {
#define KVM_S390_SIE_PAGE_OFFSET 1

/*
- * On arm64, machine type can be used to request the physical
- * address size for the VM. Bits[7-0] are reserved for the guest
- * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
- * value 0 implies the default IPA size, 40bits.
+ * On arm64, machine type can be used to request both the machine type and
+ * the physical address size for the VM.
+ *
+ * Bits[11-8] are reserved for the ARM specific machine type.
+ *
+ * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)).
+ * For backward compatibility, value 0 implies the default IPA size, 40bits.
*/
+#define KVM_VM_TYPE_ARM_SHIFT 8
+#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
+#define KVM_VM_TYPE_ARM(_type) \
+ (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
+#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
+#define KVM_VM_TYPE_ARM_REALM KVM_VM_TYPE_ARM(1)
+
#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
+
/*
* ioctls for /dev/kvm fds:
*/
--
2.34.1


2023-01-27 11:32:50

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 15/28] KVM: arm64: Handle realm MMIO emulation

MMIO emulation for a realm cannot be done directly with the VM's
registers as they are protected from the host. However the RMM interface
provides a structure member for providing the read/written value and
we can transfer this to the appropriate VCPU's register entry and then
depend on the generic MMIO handling code in KVM.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/mmio.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index 3dd38a151d2a..c4879fa3a8d3 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -6,6 +6,7 @@

#include <linux/kvm_host.h>
#include <asm/kvm_emulate.h>
+#include <asm/rmi_smc.h>
#include <trace/events/kvm.h>

#include "trace.h"
@@ -109,6 +110,9 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
&data);
data = vcpu_data_host_to_guest(vcpu, data, len);
vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
+
+ if (vcpu_is_rec(vcpu))
+ vcpu->arch.rec.run->entry.gprs[0] = data;
}

/*
@@ -179,6 +183,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
run->mmio.len = len;
vcpu->mmio_needed = 1;

+ if (vcpu_is_rec(vcpu))
+ vcpu->arch.rec.run->entry.flags |= RMI_EMULATED_MMIO;
+
if (!ret) {
/* We handled the access successfully in the kernel. */
if (!is_write)
--
2.34.1


2023-01-27 11:32:55

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 08/28] arm64: RME: Keep a spare page delegated to the RMM

Pages can only be populated/destroyed on the RMM at the 4KB granule,
this requires creating the full depth of RTTs. However if the pages are
going to be combined into a 4MB huge page the last RTT is only
temporarily needed. Similarly when freeing memory the huge page must be
temporarily split requiring temporary usage of the full depth oF RTTs.

To avoid needing to perform a temporary allocation and delegation of a
page for this purpose we keep a spare delegated page around. In
particular this avoids the need for memory allocation while destroying
the realm guest.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 3 +++
arch/arm64/kvm/rme.c | 6 ++++++
2 files changed, 9 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 055a22accc08..a6318af3ed11 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -21,6 +21,9 @@ struct realm {
void *rd;
struct realm_params *params;

+ /* A spare already delegated page */
+ phys_addr_t spare_page;
+
unsigned long num_aux;
unsigned int vmid;
unsigned int ia_bits;
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 9f8c5a91b8fc..0c9d70e4d9e6 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -148,6 +148,7 @@ static int realm_create_rd(struct kvm *kvm)
}

realm->rd = rd;
+ realm->spare_page = PHYS_ADDR_MAX;
realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);

if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
@@ -357,6 +358,11 @@ void kvm_destroy_realm(struct kvm *kvm)
free_page((unsigned long)realm->rd);
realm->rd = NULL;
}
+ if (realm->spare_page != PHYS_ADDR_MAX) {
+ if (!WARN_ON(rmi_granule_undelegate(realm->spare_page)))
+ free_page((unsigned long)phys_to_virt(realm->spare_page));
+ realm->spare_page = PHYS_ADDR_MAX;
+ }

pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
for (i = 0; i < pgd_sz; i++) {
--
2.34.1


2023-01-27 11:32:58

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 11/28] arm64: RME: Support for the VGIC in realms

The RMM provides emulation of a VGIC to the realm guest but delegates
much of the handling to the host. Implement support in KVM for
saving/restoring state to/from the REC structure.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arm.c | 15 +++++++++++---
arch/arm64/kvm/vgic/vgic-v3.c | 9 +++++++--
arch/arm64/kvm/vgic/vgic.c | 37 +++++++++++++++++++++++++++++++++--
3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 52affed2f3cf..1b2547516f62 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -475,17 +475,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ kvm_timer_vcpu_put(vcpu);
+ kvm_vgic_put(vcpu);
+
+ vcpu->cpu = -1;
+
+ if (vcpu_is_rec(vcpu))
+ return;
+
kvm_arch_vcpu_put_debug_state_flags(vcpu);
kvm_arch_vcpu_put_fp(vcpu);
if (has_vhe())
kvm_vcpu_put_sysregs_vhe(vcpu);
- kvm_timer_vcpu_put(vcpu);
- kvm_vgic_put(vcpu);
kvm_vcpu_pmu_restore_host(vcpu);
kvm_arm_vmid_clear_active();

vcpu_clear_on_unsupported_cpu(vcpu);
- vcpu->cpu = -1;
}

void kvm_arm_vcpu_power_off(struct kvm_vcpu *vcpu)
@@ -623,6 +628,10 @@ int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu)
}

if (!irqchip_in_kernel(kvm)) {
+ /* Userspace irqchip not yet supported with Realms */
+ if (kvm_is_realm(vcpu->kvm))
+ return -EOPNOTSUPP;
+
/*
* Tell the rest of the code that there are userspace irqchip
* VMs in the wild.
diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
index 826ff6f2a4e7..121c7a68c397 100644
--- a/arch/arm64/kvm/vgic/vgic-v3.c
+++ b/arch/arm64/kvm/vgic/vgic-v3.c
@@ -6,9 +6,11 @@
#include <linux/kvm.h>
#include <linux/kvm_host.h>
#include <kvm/arm_vgic.h>
+#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
#include <asm/kvm_mmu.h>
#include <asm/kvm_asm.h>
+#include <asm/rmi_smc.h>

#include "vgic.h"

@@ -669,7 +671,8 @@ int vgic_v3_probe(const struct gic_kvm_info *info)
(unsigned long long)info->vcpu.start);
} else if (kvm_get_mode() != KVM_MODE_PROTECTED) {
kvm_vgic_global_state.vcpu_base = info->vcpu.start;
- kvm_vgic_global_state.can_emulate_gicv2 = true;
+ if (!static_branch_unlikely(&kvm_rme_is_available))
+ kvm_vgic_global_state.can_emulate_gicv2 = true;
ret = kvm_register_vgic_device(KVM_DEV_TYPE_ARM_VGIC_V2);
if (ret) {
kvm_err("Cannot register GICv2 KVM device.\n");
@@ -744,7 +747,9 @@ void vgic_v3_vmcr_sync(struct kvm_vcpu *vcpu)
{
struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;

- if (likely(cpu_if->vgic_sre))
+ if (vcpu_is_rec(vcpu))
+ cpu_if->vgic_vmcr = vcpu->arch.rec.run->exit.gicv3_vmcr;
+ else if (likely(cpu_if->vgic_sre))
cpu_if->vgic_vmcr = kvm_call_hyp_ret(__vgic_v3_read_vmcr);
}

diff --git a/arch/arm64/kvm/vgic/vgic.c b/arch/arm64/kvm/vgic/vgic.c
index d97e6080b421..bc77660f7051 100644
--- a/arch/arm64/kvm/vgic/vgic.c
+++ b/arch/arm64/kvm/vgic/vgic.c
@@ -10,7 +10,9 @@
#include <linux/list_sort.h>
#include <linux/nospec.h>

+#include <asm/kvm_emulate.h>
#include <asm/kvm_hyp.h>
+#include <asm/rmi_smc.h>

#include "vgic.h"

@@ -848,10 +850,23 @@ static inline bool can_access_vgic_from_kernel(void)
return !static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif) || has_vhe();
}

+static inline void vgic_rmm_save_state(struct kvm_vcpu *vcpu)
+{
+ struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;
+ int i;
+
+ for (i = 0; i < kvm_vgic_global_state.nr_lr; i++) {
+ cpu_if->vgic_lr[i] = vcpu->arch.rec.run->exit.gicv3_lrs[i];
+ vcpu->arch.rec.run->entry.gicv3_lrs[i] = 0;
+ }
+}
+
static inline void vgic_save_state(struct kvm_vcpu *vcpu)
{
if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif))
vgic_v2_save_state(vcpu);
+ else if (vcpu_is_rec(vcpu))
+ vgic_rmm_save_state(vcpu);
else
__vgic_v3_save_state(&vcpu->arch.vgic_cpu.vgic_v3);
}
@@ -878,10 +893,28 @@ void kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu)
vgic_prune_ap_list(vcpu);
}

+static inline void vgic_rmm_restore_state(struct kvm_vcpu *vcpu)
+{
+ struct vgic_v3_cpu_if *cpu_if = &vcpu->arch.vgic_cpu.vgic_v3;
+ int i;
+
+ for (i = 0; i < kvm_vgic_global_state.nr_lr; i++) {
+ vcpu->arch.rec.run->entry.gicv3_lrs[i] = cpu_if->vgic_lr[i];
+ /*
+ * Also populate the rec.run->exit copies so that a late
+ * decision to back out from entering the realm doesn't cause
+ * the state to be lost
+ */
+ vcpu->arch.rec.run->exit.gicv3_lrs[i] = cpu_if->vgic_lr[i];
+ }
+}
+
static inline void vgic_restore_state(struct kvm_vcpu *vcpu)
{
if (!static_branch_unlikely(&kvm_vgic_global_state.gicv3_cpuif))
vgic_v2_restore_state(vcpu);
+ else if (vcpu_is_rec(vcpu))
+ vgic_rmm_restore_state(vcpu);
else
__vgic_v3_restore_state(&vcpu->arch.vgic_cpu.vgic_v3);
}
@@ -922,7 +955,7 @@ void kvm_vgic_flush_hwstate(struct kvm_vcpu *vcpu)

void kvm_vgic_load(struct kvm_vcpu *vcpu)
{
- if (unlikely(!vgic_initialized(vcpu->kvm)))
+ if (unlikely(!vgic_initialized(vcpu->kvm)) || vcpu_is_rec(vcpu))
return;

if (kvm_vgic_global_state.type == VGIC_V2)
@@ -933,7 +966,7 @@ void kvm_vgic_load(struct kvm_vcpu *vcpu)

void kvm_vgic_put(struct kvm_vcpu *vcpu)
{
- if (unlikely(!vgic_initialized(vcpu->kvm)))
+ if (unlikely(!vgic_initialized(vcpu->kvm)) || vcpu_is_rec(vcpu))
return;

if (kvm_vgic_global_state.type == VGIC_V2)
--
2.34.1


2023-01-27 11:33:03

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 17/28] arm64: RME: Runtime faulting of memory

At runtime if the realm guest accesses memory which hasn't yet been
mapped then KVM needs to either populate the region or fault the guest.

For memory in the lower (protected) region of IPA a fresh page is
provided to the RMM which will zero the contents. For memory in the
upper (shared) region of IPA, the memory from the memslot is mapped
into the realm VM non secure.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_emulate.h | 10 +++++
arch/arm64/include/asm/kvm_rme.h | 12 ++++++
arch/arm64/kvm/mmu.c | 64 +++++++++++++++++++++++++---
arch/arm64/kvm/rme.c | 48 +++++++++++++++++++++
4 files changed, 128 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 285e62914ca4..3a71b3d2e10a 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -502,6 +502,16 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
return READ_ONCE(kvm->arch.realm.state);
}

+static inline gpa_t kvm_gpa_stolen_bits(struct kvm *kvm)
+{
+ if (kvm_is_realm(kvm)) {
+ struct realm *realm = &kvm->arch.realm;
+
+ return BIT(realm->ia_bits - 1);
+ }
+ return 0;
+}
+
static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
{
if (static_branch_unlikely(&kvm_rme_is_available))
diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 9d1583c44a99..303e4a5e5704 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -50,6 +50,18 @@ void kvm_destroy_rec(struct kvm_vcpu *vcpu);
int kvm_rec_enter(struct kvm_vcpu *vcpu);
int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_status);

+void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size);
+int realm_map_protected(struct realm *realm,
+ unsigned long hva,
+ unsigned long base_ipa,
+ struct page *dst_page,
+ unsigned long map_size,
+ struct kvm_mmu_memory_cache *memcache);
+int realm_map_non_secure(struct realm *realm,
+ unsigned long ipa,
+ struct page *page,
+ unsigned long map_size,
+ struct kvm_mmu_memory_cache *memcache);
int realm_set_ipa_state(struct kvm_vcpu *vcpu,
unsigned long addr, unsigned long end,
unsigned long ripas);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index f29558c5dcbc..5417c273861b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -235,8 +235,13 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64

lockdep_assert_held_write(&kvm->mmu_lock);
WARN_ON(size & ~PAGE_MASK);
- WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
- may_block));
+
+ if (kvm_is_realm(kvm))
+ kvm_realm_unmap_range(kvm, start, size);
+ else
+ WARN_ON(stage2_apply_range(kvm, start, end,
+ kvm_pgtable_stage2_unmap,
+ may_block));
}

static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
@@ -250,7 +255,11 @@ static void stage2_flush_memslot(struct kvm *kvm,
phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
phys_addr_t end = addr + PAGE_SIZE * memslot->npages;

- stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_flush);
+ if (kvm_is_realm(kvm))
+ kvm_realm_unmap_range(kvm, addr, end - addr);
+ else
+ stage2_apply_range_resched(kvm, addr, end,
+ kvm_pgtable_stage2_flush);
}

/**
@@ -818,6 +827,10 @@ void stage2_unmap_vm(struct kvm *kvm)
struct kvm_memory_slot *memslot;
int idx, bkt;

+ /* For realms this is handled by the RMM so nothing to do here */
+ if (kvm_is_realm(kvm))
+ return;
+
idx = srcu_read_lock(&kvm->srcu);
mmap_read_lock(current->mm);
write_lock(&kvm->mmu_lock);
@@ -840,6 +853,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
pgt = mmu->pgt;
if (kvm_is_realm(kvm) &&
kvm_realm_state(kvm) != REALM_STATE_DYING) {
+ unmap_stage2_range(mmu, 0, (~0ULL) & PAGE_MASK);
write_unlock(&kvm->mmu_lock);
kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
pgt->start_level);
@@ -1190,6 +1204,24 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
return vma->vm_flags & VM_MTE_ALLOWED;
}

+static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa, unsigned long hva,
+ kvm_pfn_t pfn, unsigned long map_size,
+ enum kvm_pgtable_prot prot,
+ struct kvm_mmu_memory_cache *memcache)
+{
+ struct realm *realm = &kvm->arch.realm;
+ struct page *page = pfn_to_page(pfn);
+
+ if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
+ return -EFAULT;
+
+ if (!realm_is_addr_protected(realm, ipa))
+ return realm_map_non_secure(realm, ipa, page, map_size,
+ memcache);
+
+ return realm_map_protected(realm, hva, ipa, page, map_size, memcache);
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
unsigned long fault_status)
@@ -1210,9 +1242,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
unsigned long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+ gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);

fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
write_fault = kvm_is_write_fault(vcpu);
+
+ /* Realms cannot map read-only */
+ if (vcpu_is_rec(vcpu))
+ write_fault = true;
+
exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
VM_BUG_ON(write_fault && exec_fault);

@@ -1272,7 +1310,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
fault_ipa &= ~(vma_pagesize - 1);

- gfn = fault_ipa >> PAGE_SHIFT;
+ gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
mmap_read_unlock(current->mm);

/*
@@ -1345,7 +1383,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
* If we are not forced to use page mapping, check if we are
* backed by a THP and thus use block mapping if possible.
*/
- if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
+ /* FIXME: We shouldn't need to disable this for realms */
+ if (vma_pagesize == PAGE_SIZE && !(force_pte || device || kvm_is_realm(kvm))) {
if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
vma_pagesize = fault_granule;
else
@@ -1382,6 +1421,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
*/
if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
+ else if (kvm_is_realm(kvm))
+ ret = realm_map_ipa(kvm, fault_ipa, hva, pfn, vma_pagesize,
+ prot, memcache);
else
ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
__pfn_to_phys(pfn), prot,
@@ -1437,6 +1479,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
struct kvm_memory_slot *memslot;
unsigned long hva;
bool is_iabt, write_fault, writable;
+ gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
gfn_t gfn;
int ret, idx;

@@ -1491,7 +1534,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)

idx = srcu_read_lock(&vcpu->kvm->srcu);

- gfn = fault_ipa >> PAGE_SHIFT;
+ gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
memslot = gfn_to_memslot(vcpu->kvm, gfn);
hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
write_fault = kvm_is_write_fault(vcpu);
@@ -1536,6 +1579,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
* of the page size.
*/
fault_ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1);
+ fault_ipa &= ~gpa_stolen_mask;
ret = io_mem_abort(vcpu, fault_ipa);
goto out_unlock;
}
@@ -1617,6 +1661,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
if (!kvm->arch.mmu.pgt)
return false;

+ /* We don't support aging for Realms */
+ if (kvm_is_realm(kvm))
+ return true;
+
WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);

kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
@@ -1630,6 +1678,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
if (!kvm->arch.mmu.pgt)
return false;

+ /* We don't support aging for Realms */
+ if (kvm_is_realm(kvm))
+ return true;
+
return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
range->start << PAGE_SHIFT);
}
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 3405b43e1421..3d46191798e5 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -608,6 +608,54 @@ int realm_map_protected(struct realm *realm,
return -ENXIO;
}

+int realm_map_non_secure(struct realm *realm,
+ unsigned long ipa,
+ struct page *page,
+ unsigned long map_size,
+ struct kvm_mmu_memory_cache *memcache)
+{
+ phys_addr_t rd = virt_to_phys(realm->rd);
+ int map_level;
+ int ret = 0;
+ unsigned long desc = page_to_phys(page) |
+ PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) |
+ /* FIXME: Read+Write permissions for now */
+ (3 << 6) |
+ PTE_SHARED;
+
+ if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
+ return -EINVAL;
+
+ switch (map_size) {
+ case PAGE_SIZE:
+ map_level = 3;
+ break;
+ case RME_L2_BLOCK_SIZE:
+ map_level = 2;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
+
+ if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ /* Create missing RTTs and retry */
+ int level = RMI_RETURN_INDEX(ret);
+
+ ret = realm_create_rtt_levels(realm, ipa, level, map_level,
+ memcache);
+ if (WARN_ON(ret))
+ return -ENXIO;
+
+ ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
+ }
+ if (WARN_ON(ret))
+ return -ENXIO;
+
+ return 0;
+}
+
static int populate_par_region(struct kvm *kvm,
phys_addr_t ipa_base,
phys_addr_t ipa_end)
--
2.34.1


2023-01-27 11:33:16

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 18/28] KVM: arm64: Handle realm VCPU load

When loading a realm VCPU much of the work is handled by the RMM so only
some of the actions are required. Rearrange kvm_arch_vcpu_load()
slightly so we can bail out early for a realm guest.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arm.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fd9e28f48903..46c152a9a150 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -451,19 +451,25 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)

vcpu->cpu = cpu;

+ if (single_task_running())
+ vcpu_clear_wfx_traps(vcpu);
+ else
+ vcpu_set_wfx_traps(vcpu);
+
kvm_vgic_load(vcpu);
kvm_timer_vcpu_load(vcpu);
+
+ if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
+ kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
+
+ /* No additional state needs to be loaded on Realmed VMs */
+ if (vcpu_is_rec(vcpu))
+ return;
+
if (has_vhe())
kvm_vcpu_load_sysregs_vhe(vcpu);
kvm_arch_vcpu_load_fp(vcpu);
kvm_vcpu_pmu_restore_guest(vcpu);
- if (kvm_arm_is_pvtime_enabled(&vcpu->arch))
- kvm_make_request(KVM_REQ_RECORD_STEAL, vcpu);
-
- if (single_task_running())
- vcpu_clear_wfx_traps(vcpu);
- else
- vcpu_set_wfx_traps(vcpu);

if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_disable(vcpu);
--
2.34.1


2023-01-27 11:33:20

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 19/28] KVM: arm64: Validate register access for a Realm VM

The RMM only allows setting the lower GPRS (x0-x7) and PC for a realm
guest. Check this in kvm_arm_set_reg() so that the VMM can receive a
suitable error return if other registers are accessed.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/guest.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 5626ddb540ce..93468bbfb50e 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -768,12 +768,38 @@ int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
return kvm_arm_sys_reg_get_reg(vcpu, reg);
}

+/*
+ * The RMI ABI only enables setting the lower GPRs (x0-x7) and PC.
+ * All other registers are reset to architectural or otherwise defined reset
+ * values by the RMM
+ */
+static bool validate_realm_set_reg(struct kvm_vcpu *vcpu,
+ const struct kvm_one_reg *reg)
+{
+ u64 off = core_reg_offset_from_id(reg->id);
+
+ if ((reg->id & KVM_REG_ARM_COPROC_MASK) != KVM_REG_ARM_CORE)
+ return false;
+
+ switch (off) {
+ case KVM_REG_ARM_CORE_REG(regs.regs[0]) ...
+ KVM_REG_ARM_CORE_REG(regs.regs[7]):
+ case KVM_REG_ARM_CORE_REG(regs.pc):
+ return true;
+ }
+
+ return false;
+}
+
int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg)
{
/* We currently use nothing arch-specific in upper 32 bits */
if ((reg->id & ~KVM_REG_SIZE_MASK) >> 32 != KVM_REG_ARM64 >> 32)
return -EINVAL;

+ if (kvm_is_realm(vcpu->kvm) && !validate_realm_set_reg(vcpu, reg))
+ return -EINVAL;
+
switch (reg->id & KVM_REG_ARM_COPROC_MASK) {
case KVM_REG_ARM_CORE: return set_core_reg(vcpu, reg);
case KVM_REG_ARM_FW:
--
2.34.1


2023-01-27 11:33:25

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 21/28] KVM: arm64: WARN on injected undef exceptions

The RMM doesn't allow injection of a undefined exception into a realm
guest. Add a WARN to catch if this ever happens.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/inject_fault.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
index f32f4a2a347f..29966a3e5a71 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -175,6 +175,8 @@ void kvm_inject_size_fault(struct kvm_vcpu *vcpu)
*/
void kvm_inject_undefined(struct kvm_vcpu *vcpu)
{
+ if (vcpu_is_rec(vcpu))
+ WARN(1, "Cannot inject undefined exception into REC. Continuing with unknown behaviour");
if (vcpu_el1_is_32bit(vcpu))
inject_undef32(vcpu);
else
--
2.34.1


2023-01-27 11:33:31

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 22/28] arm64: Don't expose stolen time for realm guests

It doesn't make much sense and with the ABI as it is it's a footgun for
the VMM which makes fatal granule protection faults easy to trigger.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/arm.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 46c152a9a150..645df5968e1e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -302,7 +302,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = system_supports_mte();
break;
case KVM_CAP_STEAL_TIME:
- r = kvm_arm_pvtime_supported();
+ if (kvm && kvm_is_realm(kvm))
+ r = 0;
+ else
+ r = kvm_arm_pvtime_supported();
break;
case KVM_CAP_ARM_EL1_32BIT:
r = cpus_have_const_cap(ARM64_HAS_32BIT_EL1);
--
2.34.1


2023-01-27 11:33:39

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 26/28] arm64: rme: Allow checking SVE on VM instance

From: Suzuki K Poulose <[email protected]>

Given we have different types of VMs supported, check the
support for SVE for the given instance of the VM to accurately
report the status.

Signed-off-by: Suzuki K Poulose <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 2 ++
arch/arm64/kvm/arm.c | 5 ++++-
arch/arm64/kvm/rme.c | 7 ++++++-
3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 2254e28c855e..68e99e5107bc 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -40,6 +40,8 @@ struct rec {
int kvm_init_rme(void);
u32 kvm_realm_ipa_limit(void);

+bool kvm_rme_supports_sve(void);
+
int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int kvm_init_realm_vm(struct kvm *kvm);
void kvm_destroy_realm(struct kvm *kvm);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 645df5968e1e..1d0b8ac7314f 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -326,7 +326,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = get_kvm_ipa_limit();
break;
case KVM_CAP_ARM_SVE:
- r = system_supports_sve();
+ if (kvm && kvm_is_realm(kvm))
+ r = kvm_rme_supports_sve();
+ else
+ r = system_supports_sve();
break;
case KVM_CAP_ARM_PTRAUTH_ADDRESS:
case KVM_CAP_ARM_PTRAUTH_GENERIC:
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 543e8d10f532..6ae7871aa6ed 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -49,6 +49,11 @@ static bool rme_supports(unsigned long feature)
return !!u64_get_bits(rmm_feat_reg0, feature);
}

+bool kvm_rme_supports_sve(void)
+{
+ return rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN);
+}
+
static int rmi_check_version(void)
{
struct arm_smccc_res res;
@@ -1104,7 +1109,7 @@ static int config_realm_sve(struct realm *realm,
int max_sve_vq = u64_get_bits(rmm_feat_reg0,
RMI_FEATURE_REGISTER_0_SVE_VL);

- if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
+ if (!kvm_rme_supports_sve())
return -EINVAL;

if (cfg->sve_vq > max_sve_vq)
--
2.34.1


2023-01-27 11:33:50

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 28/28] HACK: Accept prototype RMI versions

The upstream RMM currently advertises the major version of an internal
prototype (v56.0) rather than the expected version from the RMM
architecture specification (v1.0).

Add a config option to enable support for the prototype RMI v56.0.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/rmi_smc.h | 7 +++++++
arch/arm64/kvm/Kconfig | 8 ++++++++
arch/arm64/kvm/rme.c | 8 ++++++++
3 files changed, 23 insertions(+)

diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
index 16ff65090f3a..d6bbd7d92b8f 100644
--- a/arch/arm64/include/asm/rmi_smc.h
+++ b/arch/arm64/include/asm/rmi_smc.h
@@ -6,6 +6,13 @@
#ifndef __ASM_RME_SMC_H
#define __ASM_RME_SMC_H

+#ifdef CONFIG_RME_USE_PROTOTYPE_HACKS
+
+// Allow the prototype RMI version
+#define PROTOTYPE_RMI_ABI_MAJOR_VERSION 56
+
+#endif /* CONFIG_RME_USE_PROTOTYPE_HACKS */
+
#include <linux/arm-smccc.h>

#define SMC_RxI_CALL(func) \
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 05da3c8f7e88..13858a5047fd 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -58,6 +58,14 @@ config NVHE_EL2_DEBUG

If unsure, say N.

+config RME_USE_PROTOTYPE_HACKS
+ bool "Allow RMM prototype version numbers"
+ default y
+ help
+ For compatibility with the the current RMM code allow versions
+ numbers from a prototype implementation as well as the expected
+ version number from the RMM specification.
+
config PROTECTED_NVHE_STACKTRACE
bool "Protected KVM hypervisor stacktraces"
depends on NVHE_EL2_DEBUG
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 1eb76cbee267..894060635226 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -67,6 +67,14 @@ static int rmi_check_version(void)
version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);

+#ifdef PROTOTYPE_RMI_ABI_MAJOR_VERSION
+ // Support the prototype
+ if (version_major == PROTOTYPE_RMI_ABI_MAJOR_VERSION) {
+ kvm_err("Using prototype RMM support (version %d.%d)\n",
+ version_major, version_minor);
+ return 0;
+ }
+#endif
if (version_major != RMI_ABI_MAJOR_VERSION) {
kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",
version_major, version_minor,
--
2.34.1


2023-01-27 11:34:12

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 16/28] arm64: RME: Allow populating initial contents

The VMM needs to populate the realm with some data before starting (e.g.
a kernel and initrd). This is measured by the RMM and used as part of
the attestation later on.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/rme.c | 366 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 366 insertions(+)

diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 16e0bfea98b1..3405b43e1421 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -4,6 +4,7 @@
*/

#include <linux/kvm_host.h>
+#include <linux/hugetlb.h>

#include <asm/kvm_emulate.h>
#include <asm/kvm_mmu.h>
@@ -426,6 +427,359 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
}
}

+static int realm_create_protected_data_page(struct realm *realm,
+ unsigned long ipa,
+ struct page *dst_page,
+ struct page *tmp_page)
+{
+ phys_addr_t dst_phys, tmp_phys;
+ int ret;
+
+ copy_page(page_address(tmp_page), page_address(dst_page));
+
+ dst_phys = page_to_phys(dst_page);
+ tmp_phys = page_to_phys(tmp_page);
+
+ if (rmi_granule_delegate(dst_phys))
+ return -ENXIO;
+
+ ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa, tmp_phys,
+ RMI_MEASURE_CONTENT);
+
+ if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ /* Create missing RTTs and retry */
+ int level = RMI_RETURN_INDEX(ret);
+
+ ret = realm_create_rtt_levels(realm, ipa, level,
+ RME_RTT_MAX_LEVEL, NULL);
+ if (ret)
+ goto err;
+
+ ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa,
+ tmp_phys, RMI_MEASURE_CONTENT);
+ }
+
+ if (ret)
+ goto err;
+
+ return 0;
+
+err:
+ if (WARN_ON(rmi_granule_undelegate(dst_phys))) {
+ /* Page can't be returned to NS world so is lost */
+ get_page(dst_page);
+ }
+ return -ENXIO;
+}
+
+static int fold_rtt(phys_addr_t rd, unsigned long addr, int level,
+ struct realm *realm)
+{
+ struct rtt_entry rtt;
+ phys_addr_t rtt_addr;
+
+ if (rmi_rtt_read_entry(rd, addr, level, &rtt))
+ return -ENXIO;
+
+ if (rtt.state != RMI_TABLE)
+ return -EINVAL;
+
+ rtt_addr = rmi_rtt_get_phys(&rtt);
+ if (rmi_rtt_fold(rtt_addr, rd, addr, level + 1))
+ return -ENXIO;
+
+ free_delegated_page(realm, rtt_addr);
+
+ return 0;
+}
+
+int realm_map_protected(struct realm *realm,
+ unsigned long hva,
+ unsigned long base_ipa,
+ struct page *dst_page,
+ unsigned long map_size,
+ struct kvm_mmu_memory_cache *memcache)
+{
+ phys_addr_t dst_phys = page_to_phys(dst_page);
+ phys_addr_t rd = virt_to_phys(realm->rd);
+ unsigned long phys = dst_phys;
+ unsigned long ipa = base_ipa;
+ unsigned long size;
+ int map_level;
+ int ret = 0;
+
+ if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
+ return -EINVAL;
+
+ switch (map_size) {
+ case PAGE_SIZE:
+ map_level = 3;
+ break;
+ case RME_L2_BLOCK_SIZE:
+ map_level = 2;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (map_level < RME_RTT_MAX_LEVEL) {
+ /*
+ * A temporary RTT is needed during the map, precreate it,
+ * however if there is an error (e.g. missing parent tables)
+ * this will be handled below.
+ */
+ realm_create_rtt_levels(realm, ipa, map_level,
+ RME_RTT_MAX_LEVEL, memcache);
+ }
+
+ for (size = 0; size < map_size; size += PAGE_SIZE) {
+ if (rmi_granule_delegate(phys)) {
+ struct rtt_entry rtt;
+
+ /*
+ * It's possible we raced with another VCPU on the same
+ * fault. If the entry exists and matches then exit
+ * early and assume the other VCPU will handle the
+ * mapping.
+ */
+ if (rmi_rtt_read_entry(rd, ipa, RME_RTT_MAX_LEVEL, &rtt))
+ goto err;
+
+ // FIXME: For a block mapping this could race at level
+ // 2 or 3...
+ if (WARN_ON((rtt.walk_level != RME_RTT_MAX_LEVEL ||
+ rtt.state != RMI_ASSIGNED ||
+ rtt.desc != phys))) {
+ goto err;
+ }
+
+ return 0;
+ }
+
+ ret = rmi_data_create_unknown(phys, rd, ipa);
+
+ if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ /* Create missing RTTs and retry */
+ int level = RMI_RETURN_INDEX(ret);
+
+ ret = realm_create_rtt_levels(realm, ipa, level,
+ RME_RTT_MAX_LEVEL,
+ memcache);
+ WARN_ON(ret);
+ if (ret)
+ goto err_undelegate;
+
+ ret = rmi_data_create_unknown(phys, rd, ipa);
+ }
+ WARN_ON(ret);
+
+ if (ret)
+ goto err_undelegate;
+
+ phys += PAGE_SIZE;
+ ipa += PAGE_SIZE;
+ }
+
+ if (map_size == RME_L2_BLOCK_SIZE)
+ ret = fold_rtt(rd, base_ipa, map_level, realm);
+ if (WARN_ON(ret))
+ goto err;
+
+ return 0;
+
+err_undelegate:
+ if (WARN_ON(rmi_granule_undelegate(phys))) {
+ /* Page can't be returned to NS world so is lost */
+ get_page(phys_to_page(phys));
+ }
+err:
+ while (size > 0) {
+ phys -= PAGE_SIZE;
+ size -= PAGE_SIZE;
+ ipa -= PAGE_SIZE;
+
+ rmi_data_destroy(rd, ipa);
+
+ if (WARN_ON(rmi_granule_undelegate(phys))) {
+ /* Page can't be returned to NS world so is lost */
+ get_page(phys_to_page(phys));
+ }
+ }
+ return -ENXIO;
+}
+
+static int populate_par_region(struct kvm *kvm,
+ phys_addr_t ipa_base,
+ phys_addr_t ipa_end)
+{
+ struct realm *realm = &kvm->arch.realm;
+ struct kvm_memory_slot *memslot;
+ gfn_t base_gfn, end_gfn;
+ int idx;
+ phys_addr_t ipa;
+ int ret = 0;
+ struct page *tmp_page;
+ phys_addr_t rd = virt_to_phys(realm->rd);
+
+ base_gfn = gpa_to_gfn(ipa_base);
+ end_gfn = gpa_to_gfn(ipa_end);
+
+ idx = srcu_read_lock(&kvm->srcu);
+ memslot = gfn_to_memslot(kvm, base_gfn);
+ if (!memslot) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ /* We require the region to be contained within a single memslot */
+ if (memslot->base_gfn + memslot->npages < end_gfn) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ tmp_page = alloc_page(GFP_KERNEL);
+ if (!tmp_page) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ mmap_read_lock(current->mm);
+
+ ipa = ipa_base;
+
+ while (ipa < ipa_end) {
+ struct vm_area_struct *vma;
+ unsigned long map_size;
+ unsigned int vma_shift;
+ unsigned long offset;
+ unsigned long hva;
+ struct page *page;
+ kvm_pfn_t pfn;
+ int level;
+
+ hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
+ vma = vma_lookup(current->mm, hva);
+ if (!vma) {
+ ret = -EFAULT;
+ break;
+ }
+
+ if (is_vm_hugetlb_page(vma))
+ vma_shift = huge_page_shift(hstate_vma(vma));
+ else
+ vma_shift = PAGE_SHIFT;
+
+ map_size = 1 << vma_shift;
+
+ /*
+ * FIXME: This causes over mapping, but there's no good
+ * solution here with the ABI as it stands
+ */
+ ipa = ALIGN_DOWN(ipa, map_size);
+
+ switch (map_size) {
+ case RME_L2_BLOCK_SIZE:
+ level = 2;
+ break;
+ case PAGE_SIZE:
+ level = 3;
+ break;
+ default:
+ WARN_ONCE(1, "Unsupport vma_shift %d", vma_shift);
+ ret = -EFAULT;
+ break;
+ }
+
+ pfn = gfn_to_pfn_memslot(memslot, gpa_to_gfn(ipa));
+
+ if (is_error_pfn(pfn)) {
+ ret = -EFAULT;
+ break;
+ }
+
+ ret = rmi_rtt_init_ripas(rd, ipa, level);
+ if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ ret = realm_create_rtt_levels(realm, ipa,
+ RMI_RETURN_INDEX(ret),
+ level, NULL);
+ if (ret)
+ break;
+ ret = rmi_rtt_init_ripas(rd, ipa, level);
+ if (ret) {
+ ret = -ENXIO;
+ break;
+ }
+ }
+
+ if (level < RME_RTT_MAX_LEVEL) {
+ /*
+ * A temporary RTT is needed during the map, precreate
+ * it, however if there is an error (e.g. missing
+ * parent tables) this will be handled in the
+ * realm_create_protected_data_page() call.
+ */
+ realm_create_rtt_levels(realm, ipa, level,
+ RME_RTT_MAX_LEVEL, NULL);
+ }
+
+ page = pfn_to_page(pfn);
+
+ for (offset = 0; offset < map_size && !ret;
+ offset += PAGE_SIZE, page++) {
+ phys_addr_t page_ipa = ipa + offset;
+
+ ret = realm_create_protected_data_page(realm, page_ipa,
+ page, tmp_page);
+ }
+ if (ret)
+ goto err_release_pfn;
+
+ if (level == 2) {
+ ret = fold_rtt(rd, ipa, level, realm);
+ if (ret)
+ goto err_release_pfn;
+ }
+
+ ipa += map_size;
+ kvm_set_pfn_accessed(pfn);
+ kvm_set_pfn_dirty(pfn);
+ kvm_release_pfn_dirty(pfn);
+err_release_pfn:
+ if (ret) {
+ kvm_release_pfn_clean(pfn);
+ break;
+ }
+ }
+
+ mmap_read_unlock(current->mm);
+ __free_page(tmp_page);
+
+out:
+ srcu_read_unlock(&kvm->srcu, idx);
+ return ret;
+}
+
+static int kvm_populate_realm(struct kvm *kvm,
+ struct kvm_cap_arm_rme_populate_realm_args *args)
+{
+ phys_addr_t ipa_base, ipa_end;
+
+ if (kvm_realm_state(kvm) != REALM_STATE_NEW)
+ return -EBUSY;
+
+ if (!IS_ALIGNED(args->populate_ipa_base, PAGE_SIZE) ||
+ !IS_ALIGNED(args->populate_ipa_size, PAGE_SIZE))
+ return -EINVAL;
+
+ ipa_base = args->populate_ipa_base;
+ ipa_end = ipa_base + args->populate_ipa_size;
+
+ if (ipa_end < ipa_base)
+ return -EINVAL;
+
+ return populate_par_region(kvm, ipa_base, ipa_end);
+}
+
static int set_ipa_state(struct kvm_vcpu *vcpu,
unsigned long ipa,
unsigned long end,
@@ -748,6 +1102,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
r = kvm_init_ipa_range_realm(kvm, &args);
break;
}
+ case KVM_CAP_ARM_RME_POPULATE_REALM: {
+ struct kvm_cap_arm_rme_populate_realm_args args;
+ void __user *argp = u64_to_user_ptr(cap->args[1]);
+
+ if (copy_from_user(&args, argp, sizeof(args))) {
+ r = -EFAULT;
+ break;
+ }
+
+ r = kvm_populate_realm(kvm, &args);
+ break;
+ }
default:
r = -EINVAL;
break;
--
2.34.1


2023-01-27 11:42:16

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 14/28] arm64: RME: Handle realm enter/exit

Entering a realm is done using a SMC call to the RMM. On exit the
exit-codes need to be handled slightly differently to the normal KVM
path so define our own functions for realm enter/exit and hook them
in if the guest is a realm guest.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 11 ++
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 19 +++-
arch/arm64/kvm/rme-exit.c | 168 +++++++++++++++++++++++++++++++
arch/arm64/kvm/rme.c | 11 ++
5 files changed, 205 insertions(+), 6 deletions(-)
create mode 100644 arch/arm64/kvm/rme-exit.c

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 3e75cedaad18..9d1583c44a99 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -47,6 +47,9 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
int kvm_create_rec(struct kvm_vcpu *vcpu);
void kvm_destroy_rec(struct kvm_vcpu *vcpu);

+int kvm_rec_enter(struct kvm_vcpu *vcpu);
+int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_status);
+
int realm_set_ipa_state(struct kvm_vcpu *vcpu,
unsigned long addr, unsigned long end,
unsigned long ripas);
@@ -69,4 +72,12 @@ static inline unsigned long rme_rtt_level_mapsize(int level)
return (1UL << RME_RTT_LEVEL_SHIFT(level));
}

+static inline bool realm_is_addr_protected(struct realm *realm,
+ unsigned long addr)
+{
+ unsigned int ia_bits = realm->ia_bits;
+
+ return !(addr & ~(BIT(ia_bits - 1) - 1));
+}
+
#endif
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index d2f0400c50da..884c7c44439f 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -21,7 +21,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
vgic/vgic-its.o vgic/vgic-debug.o \
- rme.o
+ rme.o rme-exit.o

kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 1b2547516f62..fd9e28f48903 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -985,7 +985,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
trace_kvm_entry(*vcpu_pc(vcpu));
guest_timing_enter_irqoff();

- ret = kvm_arm_vcpu_enter_exit(vcpu);
+ if (vcpu_is_rec(vcpu))
+ ret = kvm_rec_enter(vcpu);
+ else
+ ret = kvm_arm_vcpu_enter_exit(vcpu);

vcpu->mode = OUTSIDE_GUEST_MODE;
vcpu->stat.exits++;
@@ -1039,10 +1042,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)

local_irq_enable();

- trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
-
/* Exit types that need handling before we can be preempted */
- handle_exit_early(vcpu, ret);
+ if (!vcpu_is_rec(vcpu)) {
+ trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu),
+ *vcpu_pc(vcpu));
+
+ handle_exit_early(vcpu, ret);
+ }

preempt_enable();

@@ -1065,7 +1071,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
ret = ARM_EXCEPTION_IL;
}

- ret = handle_exit(vcpu, ret);
+ if (vcpu_is_rec(vcpu))
+ ret = handle_rme_exit(vcpu, ret);
+ else
+ ret = handle_exit(vcpu, ret);
}

/* Tell userspace about in-kernel device output levels */
diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c
new file mode 100644
index 000000000000..15a4ff3517db
--- /dev/null
+++ b/arch/arm64/kvm/rme-exit.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/kvm_host.h>
+#include <kvm/arm_psci.h>
+
+#include <asm/rmi_smc.h>
+#include <asm/kvm_emulate.h>
+#include <asm/kvm_rme.h>
+#include <asm/kvm_mmu.h>
+
+typedef int (*exit_handler_fn)(struct kvm_vcpu *vcpu);
+
+static int rec_exit_reason_notimpl(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ pr_err("[vcpu %d] Unhandled exit reason from realm (ESR: %#llx)\n",
+ vcpu->vcpu_id, rec->run->exit.esr);
+ return -ENXIO;
+}
+
+static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
+ vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
+ rec->run->exit.gprs[0]);
+
+ return kvm_handle_guest_abort(vcpu);
+}
+
+static int rec_exit_sync_iabt(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ pr_err("[vcpu %d] Unhandled instruction abort (ESR: %#llx).\n",
+ vcpu->vcpu_id, rec->run->exit.esr);
+ return -ENXIO;
+}
+
+static int rec_exit_sys_reg(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+ unsigned long esr = kvm_vcpu_get_esr(vcpu);
+ int rt = kvm_vcpu_sys_get_rt(vcpu);
+ bool is_write = !(esr & 1);
+ int ret;
+
+ if (is_write)
+ vcpu_set_reg(vcpu, rt, rec->run->exit.gprs[0]);
+
+ ret = kvm_handle_sys_reg(vcpu);
+
+ if (ret >= 0 && !is_write)
+ rec->run->entry.gprs[0] = vcpu_get_reg(vcpu, rt);
+
+ return ret;
+}
+
+static exit_handler_fn rec_exit_handlers[] = {
+ [0 ... ESR_ELx_EC_MAX] = rec_exit_reason_notimpl,
+ [ESR_ELx_EC_SYS64] = rec_exit_sys_reg,
+ [ESR_ELx_EC_DABT_LOW] = rec_exit_sync_dabt,
+ [ESR_ELx_EC_IABT_LOW] = rec_exit_sync_iabt
+};
+
+static int rec_exit_psci(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+ int i;
+
+ for (i = 0; i < REC_RUN_GPRS; i++)
+ vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
+
+ return kvm_psci_call(vcpu);
+}
+
+static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
+{
+ struct realm *realm = &vcpu->kvm->arch.realm;
+ struct rec *rec = &vcpu->arch.rec;
+ unsigned long base = rec->run->exit.ripas_base;
+ unsigned long size = rec->run->exit.ripas_size;
+ unsigned long ripas = rec->run->exit.ripas_value & 1;
+ int ret = -EINVAL;
+
+ if (realm_is_addr_protected(realm, base) &&
+ realm_is_addr_protected(realm, base + size))
+ ret = realm_set_ipa_state(vcpu, base, base + size, ripas);
+
+ WARN(ret, "Unable to satisfy SET_IPAS for %#lx - %#lx, ripas: %#lx\n",
+ base, base + size, ripas);
+
+ return 1;
+}
+
+static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ __vcpu_sys_reg(vcpu, CNTV_CTL_EL0) = rec->run->exit.cntv_ctl;
+ __vcpu_sys_reg(vcpu, CNTV_CVAL_EL0) = rec->run->exit.cntv_cval;
+ __vcpu_sys_reg(vcpu, CNTP_CTL_EL0) = rec->run->exit.cntp_ctl;
+ __vcpu_sys_reg(vcpu, CNTP_CVAL_EL0) = rec->run->exit.cntp_cval;
+
+ kvm_realm_timers_update(vcpu);
+}
+
+/*
+ * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
+ * proper exit to userspace.
+ */
+int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
+{
+ struct rec *rec = &vcpu->arch.rec;
+ u8 esr_ec = ESR_ELx_EC(rec->run->exit.esr);
+ unsigned long status, index;
+
+ status = RMI_RETURN_STATUS(rec_run_ret);
+ index = RMI_RETURN_INDEX(rec_run_ret);
+
+ /*
+ * If a PSCI_SYSTEM_OFF request raced with a vcpu executing, we might
+ * see the following status code and index indicating an attempt to run
+ * a REC when the RD state is SYSTEM_OFF. In this case, we just need to
+ * return to user space which can deal with the system event or will try
+ * to run the KVM VCPU again, at which point we will no longer attempt
+ * to enter the Realm because we will have a sleep request pending on
+ * the VCPU as a result of KVM's PSCI handling.
+ */
+ if (status == RMI_ERROR_REALM && index == 1) {
+ vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+ return 0;
+ }
+
+ if (rec_run_ret)
+ return -ENXIO;
+
+ vcpu->arch.fault.esr_el2 = rec->run->exit.esr;
+ vcpu->arch.fault.far_el2 = rec->run->exit.far;
+ vcpu->arch.fault.hpfar_el2 = rec->run->exit.hpfar;
+
+ update_arch_timer_irq_lines(vcpu);
+
+ /* Reset the emulation flags for the next run of the REC */
+ rec->run->entry.flags = 0;
+
+ switch (rec->run->exit.exit_reason) {
+ case RMI_EXIT_SYNC:
+ return rec_exit_handlers[esr_ec](vcpu);
+ case RMI_EXIT_IRQ:
+ case RMI_EXIT_FIQ:
+ return 1;
+ case RMI_EXIT_PSCI:
+ return rec_exit_psci(vcpu);
+ case RMI_EXIT_RIPAS_CHANGE:
+ return rec_exit_ripas_change(vcpu);
+ }
+
+ kvm_pr_unimpl("Unsupported exit reason: %u\n",
+ rec->run->exit.exit_reason);
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ return 0;
+}
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index b3ea79189839..16e0bfea98b1 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -802,6 +802,17 @@ void kvm_destroy_realm(struct kvm *kvm)
kvm_free_stage2_pgd(&kvm->arch.mmu);
}

+int kvm_rec_enter(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ if (kvm_realm_state(vcpu->kvm) != REALM_STATE_ACTIVE)
+ return -EINVAL;
+
+ return rmi_rec_enter(virt_to_phys(rec->rec_page),
+ virt_to_phys(rec->run));
+}
+
static void free_rec_aux(struct page **aux_pages,
unsigned int num_aux)
{
--
2.34.1


2023-01-27 11:42:19

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 24/28] arm64: rme: allow userspace to inject aborts

From: Joey Gouly <[email protected]>

Extend KVM_SET_VCPU_EVENTS to support realms, where KVM cannot set the
system registers, and the RMM must perform it on next REC entry.

Signed-off-by: Joey Gouly <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
Documentation/virt/kvm/api.rst | 2 ++
arch/arm64/kvm/guest.c | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f1a59d6fb7fc..18a8ddaf31d8 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1238,6 +1238,8 @@ User space may need to inject several types of events to the guest.
Set the pending SError exception state for this VCPU. It is not possible to
'cancel' an Serror that has been made pending.

+User space cannot inject SErrors into Realms.
+
If the guest performed an access to I/O memory which could not be handled by
userspace, for example because of missing instruction syndrome decode
information or because there is no device mapped at the accessed IPA, then
diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 93468bbfb50e..6e53e0ef2fba 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -851,6 +851,30 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
bool has_esr = events->exception.serror_has_esr;
bool ext_dabt_pending = events->exception.ext_dabt_pending;

+ if (vcpu_is_rec(vcpu)) {
+ /* Cannot inject SError into a Realm. */
+ if (serror_pending)
+ return -EINVAL;
+
+ /*
+ * If a data abort is pending, set the flag and let the RMM
+ * inject an SEA when the REC is scheduled to be run.
+ */
+ if (ext_dabt_pending) {
+ /*
+ * Can only inject SEA into a Realm if the previous exit
+ * was due to a data abort of an Unprotected IPA.
+ */
+ if (!(vcpu->arch.rec.run->entry.flags & RMI_EMULATED_MMIO))
+ return -EINVAL;
+
+ vcpu->arch.rec.run->entry.flags &= ~RMI_EMULATED_MMIO;
+ vcpu->arch.rec.run->entry.flags |= RMI_INJECT_SEA;
+ }
+
+ return 0;
+ }
+
if (serror_pending && has_esr) {
if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
return -EINVAL;
--
2.34.1


2023-01-27 11:42:21

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 20/28] KVM: arm64: Handle Realm PSCI requests

The RMM needs to be informed of the target REC when a PSCI call is made
with an MPIDR argument.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 1 +
arch/arm64/kvm/psci.c | 23 +++++++++++++++++++++++
arch/arm64/kvm/rme.c | 13 +++++++++++++
3 files changed, 37 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 303e4a5e5704..2254e28c855e 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -65,6 +65,7 @@ int realm_map_non_secure(struct realm *realm,
int realm_set_ipa_state(struct kvm_vcpu *vcpu,
unsigned long addr, unsigned long end,
unsigned long ripas);
+int realm_psci_complete(struct kvm_vcpu *calling, struct kvm_vcpu *target);

#define RME_RTT_BLOCK_LEVEL 2
#define RME_RTT_MAX_LEVEL 3
diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
index 7fbc4c1b9df0..e2061cab9b26 100644
--- a/arch/arm64/kvm/psci.c
+++ b/arch/arm64/kvm/psci.c
@@ -76,6 +76,10 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
*/
if (!vcpu)
return PSCI_RET_INVALID_PARAMS;
+
+ if (vcpu_is_rec(vcpu))
+ realm_psci_complete(source_vcpu, vcpu);
+
if (!kvm_arm_vcpu_stopped(vcpu)) {
if (kvm_psci_version(source_vcpu) != KVM_ARM_PSCI_0_1)
return PSCI_RET_ALREADY_ON;
@@ -135,6 +139,25 @@ static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
/* Ignore other bits of target affinity */
target_affinity &= target_affinity_mask;

+ if (vcpu_is_rec(vcpu)) {
+ struct kvm_vcpu *target_vcpu;
+
+ /* RMM supports only zero affinity level */
+ if (lowest_affinity_level != 0)
+ return PSCI_RET_INVALID_PARAMS;
+
+ target_vcpu = kvm_mpidr_to_vcpu(kvm, target_affinity);
+ if (!target_vcpu)
+ return PSCI_RET_INVALID_PARAMS;
+
+ /*
+ * Provide the references of running and target RECs to the RMM
+ * so that the RMM can complete the PSCI request.
+ */
+ realm_psci_complete(vcpu, target_vcpu);
+ return PSCI_RET_SUCCESS;
+ }
+
/*
* If one or more VCPU matching target affinity are running
* then ON else OFF
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 3d46191798e5..6ac50481a138 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -126,6 +126,19 @@ static void free_delegated_page(struct realm *realm, phys_addr_t phys)
free_page((unsigned long)phys_to_virt(phys));
}

+int realm_psci_complete(struct kvm_vcpu *calling, struct kvm_vcpu *target)
+{
+ int ret;
+
+ ret = rmi_psci_complete(virt_to_phys(calling->arch.rec.rec_page),
+ virt_to_phys(target->arch.rec.rec_page));
+
+ if (ret)
+ return -EINVAL;
+
+ return 0;
+}
+
static void realm_destroy_undelegate_range(struct realm *realm,
unsigned long ipa,
unsigned long addr,
--
2.34.1


2023-01-27 11:42:31

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 13/28] arm64: RME: Allow VMM to set RIPAS

Each page within the protected region of the realm guest can be marked
as either RAM or EMPTY. Allow the VMM to control this before the guest
has started and provide the equivalent functions to change this (with
the guest's approval) at runtime.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 4 +
arch/arm64/kvm/rme.c | 288 +++++++++++++++++++++++++++++++
2 files changed, 292 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index 4b219ebe1400..3e75cedaad18 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -47,6 +47,10 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
int kvm_create_rec(struct kvm_vcpu *vcpu);
void kvm_destroy_rec(struct kvm_vcpu *vcpu);

+int realm_set_ipa_state(struct kvm_vcpu *vcpu,
+ unsigned long addr, unsigned long end,
+ unsigned long ripas);
+
#define RME_RTT_BLOCK_LEVEL 2
#define RME_RTT_MAX_LEVEL 3

diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index d79ed889ca4d..b3ea79189839 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -73,6 +73,58 @@ static int rmi_check_version(void)
return 0;
}

+static phys_addr_t __alloc_delegated_page(struct realm *realm,
+ struct kvm_mmu_memory_cache *mc, gfp_t flags)
+{
+ phys_addr_t phys = PHYS_ADDR_MAX;
+ void *virt;
+
+ if (realm->spare_page != PHYS_ADDR_MAX) {
+ swap(realm->spare_page, phys);
+ goto out;
+ }
+
+ if (mc)
+ virt = kvm_mmu_memory_cache_alloc(mc);
+ else
+ virt = (void *)__get_free_page(flags);
+
+ if (!virt)
+ goto out;
+
+ phys = virt_to_phys(virt);
+
+ if (rmi_granule_delegate(phys)) {
+ free_page((unsigned long)virt);
+
+ phys = PHYS_ADDR_MAX;
+ }
+
+out:
+ return phys;
+}
+
+static phys_addr_t alloc_delegated_page(struct realm *realm,
+ struct kvm_mmu_memory_cache *mc)
+{
+ return __alloc_delegated_page(realm, mc, GFP_KERNEL);
+}
+
+static void free_delegated_page(struct realm *realm, phys_addr_t phys)
+{
+ if (realm->spare_page == PHYS_ADDR_MAX) {
+ realm->spare_page = phys;
+ return;
+ }
+
+ if (WARN_ON(rmi_granule_undelegate(phys))) {
+ /* Undelegate failed: leak the page */
+ return;
+ }
+
+ free_page((unsigned long)phys_to_virt(phys));
+}
+
static void realm_destroy_undelegate_range(struct realm *realm,
unsigned long ipa,
unsigned long addr,
@@ -220,6 +272,30 @@ static int realm_rtt_create(struct realm *realm,
return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
}

+static int realm_create_rtt_levels(struct realm *realm,
+ unsigned long ipa,
+ int level,
+ int max_level,
+ struct kvm_mmu_memory_cache *mc)
+{
+ if (WARN_ON(level == max_level))
+ return 0;
+
+ while (level++ < max_level) {
+ phys_addr_t rtt = alloc_delegated_page(realm, mc);
+
+ if (rtt == PHYS_ADDR_MAX)
+ return -ENOMEM;
+
+ if (realm_rtt_create(realm, ipa, level, rtt)) {
+ free_delegated_page(realm, rtt);
+ return -ENXIO;
+ }
+ }
+
+ return 0;
+}
+
static int realm_tear_down_rtt_range(struct realm *realm, int level,
unsigned long start, unsigned long end)
{
@@ -309,6 +385,206 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
}

+void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
+{
+ u32 ia_bits = kvm->arch.mmu.pgt->ia_bits;
+ u32 start_level = kvm->arch.mmu.pgt->start_level;
+ unsigned long end = ipa + size;
+ struct realm *realm = &kvm->arch.realm;
+ phys_addr_t tmp_rtt = PHYS_ADDR_MAX;
+
+ if (end > (1UL << ia_bits))
+ end = 1UL << ia_bits;
+ /*
+ * Make sure we have a spare delegated page for tearing down the
+ * block mappings. We must use Atomic allocations as we are called
+ * with kvm->mmu_lock held.
+ */
+ if (realm->spare_page == PHYS_ADDR_MAX) {
+ tmp_rtt = __alloc_delegated_page(realm, NULL, GFP_ATOMIC);
+ /*
+ * We don't have to check the status here, as we may not
+ * have a block level mapping. Delay any error to the point
+ * where we need it.
+ */
+ realm->spare_page = tmp_rtt;
+ }
+
+ realm_tear_down_rtt_range(&kvm->arch.realm, start_level, ipa, end);
+
+ /* Free up the atomic page, if there were any */
+ if (tmp_rtt != PHYS_ADDR_MAX) {
+ free_delegated_page(realm, tmp_rtt);
+ /*
+ * Update the spare_page after we have freed the
+ * above page to make sure it doesn't get cached
+ * in spare_page.
+ * We should re-write this part and always have
+ * a dedicated page for handling block mappings.
+ */
+ realm->spare_page = PHYS_ADDR_MAX;
+ }
+}
+
+static int set_ipa_state(struct kvm_vcpu *vcpu,
+ unsigned long ipa,
+ unsigned long end,
+ int level,
+ unsigned long ripas)
+{
+ struct kvm *kvm = vcpu->kvm;
+ struct realm *realm = &kvm->arch.realm;
+ struct rec *rec = &vcpu->arch.rec;
+ phys_addr_t rd_phys = virt_to_phys(realm->rd);
+ phys_addr_t rec_phys = virt_to_phys(rec->rec_page);
+ unsigned long map_size = rme_rtt_level_mapsize(level);
+ int ret;
+
+ while (ipa < end) {
+ ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, level, ripas);
+
+ if (!ret) {
+ if (!ripas)
+ kvm_realm_unmap_range(kvm, ipa, map_size);
+ } else if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ int walk_level = RMI_RETURN_INDEX(ret);
+
+ if (walk_level < level) {
+ ret = realm_create_rtt_levels(realm, ipa,
+ walk_level,
+ level, NULL);
+ if (ret)
+ return ret;
+ continue;
+ }
+
+ if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
+ return -EINVAL;
+
+ /* Recurse one level lower */
+ ret = set_ipa_state(vcpu, ipa, ipa + map_size,
+ level + 1, ripas);
+ if (ret)
+ return ret;
+ } else {
+ WARN(1, "Unexpected error in %s: %#x\n", __func__,
+ ret);
+ return -EINVAL;
+ }
+ ipa += map_size;
+ }
+
+ return 0;
+}
+
+static int realm_init_ipa_state(struct realm *realm,
+ unsigned long ipa,
+ unsigned long end,
+ int level)
+{
+ unsigned long map_size = rme_rtt_level_mapsize(level);
+ phys_addr_t rd_phys = virt_to_phys(realm->rd);
+ int ret;
+
+ while (ipa < end) {
+ ret = rmi_rtt_init_ripas(rd_phys, ipa, level);
+
+ if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
+ int cur_level = RMI_RETURN_INDEX(ret);
+
+ if (cur_level < level) {
+ ret = realm_create_rtt_levels(realm, ipa,
+ cur_level,
+ level, NULL);
+ if (ret)
+ return ret;
+ /* Retry with the RTT levels in place */
+ continue;
+ }
+
+ /* There's an entry at a lower level, recurse */
+ if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
+ return -EINVAL;
+
+ realm_init_ipa_state(realm, ipa, ipa + map_size,
+ level + 1);
+ } else if (WARN_ON(ret)) {
+ return -ENXIO;
+ }
+
+ ipa += map_size;
+ }
+
+ return 0;
+}
+
+static int find_map_level(struct kvm *kvm, unsigned long start, unsigned long end)
+{
+ int level = RME_RTT_MAX_LEVEL;
+
+ while (level > get_start_level(kvm) + 1) {
+ unsigned long map_size = rme_rtt_level_mapsize(level - 1);
+
+ if (!IS_ALIGNED(start, map_size) ||
+ (start + map_size) > end)
+ break;
+
+ level--;
+ }
+
+ return level;
+}
+
+int realm_set_ipa_state(struct kvm_vcpu *vcpu,
+ unsigned long addr, unsigned long end,
+ unsigned long ripas)
+{
+ int ret = 0;
+
+ while (addr < end) {
+ int level = find_map_level(vcpu->kvm, addr, end);
+ unsigned long map_size = rme_rtt_level_mapsize(level);
+
+ ret = set_ipa_state(vcpu, addr, addr + map_size, level, ripas);
+ if (ret)
+ break;
+
+ addr += map_size;
+ }
+
+ return ret;
+}
+
+static int kvm_init_ipa_range_realm(struct kvm *kvm,
+ struct kvm_cap_arm_rme_init_ipa_args *args)
+{
+ int ret = 0;
+ gpa_t addr, end;
+ struct realm *realm = &kvm->arch.realm;
+
+ addr = args->init_ipa_base;
+ end = addr + args->init_ipa_size;
+
+ if (end < addr)
+ return -EINVAL;
+
+ if (kvm_realm_state(kvm) != REALM_STATE_NEW)
+ return -EBUSY;
+
+ while (addr < end) {
+ int level = find_map_level(kvm, addr, end);
+ unsigned long map_size = rme_rtt_level_mapsize(level);
+
+ ret = realm_init_ipa_state(realm, addr, addr + map_size, level);
+ if (ret)
+ break;
+
+ addr += map_size;
+ }
+
+ return ret;
+}
+
/* Protects access to rme_vmid_bitmap */
static DEFINE_SPINLOCK(rme_vmid_lock);
static unsigned long *rme_vmid_bitmap;
@@ -460,6 +736,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)

r = kvm_create_realm(kvm);
break;
+ case KVM_CAP_ARM_RME_INIT_IPA_REALM: {
+ struct kvm_cap_arm_rme_init_ipa_args args;
+ void __user *argp = u64_to_user_ptr(cap->args[1]);
+
+ if (copy_from_user(&args, argp, sizeof(args))) {
+ r = -EFAULT;
+ break;
+ }
+
+ r = kvm_init_ipa_range_realm(kvm, &args);
+ break;
+ }
default:
r = -EINVAL;
break;
--
2.34.1


2023-01-27 11:42:32

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 25/28] arm64: rme: support RSI_HOST_CALL

From: Joey Gouly <[email protected]>

Forward RSI_HOST_CALLS to KVM's HVC handler.

Signed-off-by: Joey Gouly <[email protected]>
Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/rme-exit.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/arm64/kvm/rme-exit.c b/arch/arm64/kvm/rme-exit.c
index 15a4ff3517db..fcdc87e8f6bc 100644
--- a/arch/arm64/kvm/rme-exit.c
+++ b/arch/arm64/kvm/rme-exit.c
@@ -4,6 +4,7 @@
*/

#include <linux/kvm_host.h>
+#include <kvm/arm_hypercalls.h>
#include <kvm/arm_psci.h>

#include <asm/rmi_smc.h>
@@ -98,6 +99,29 @@ static int rec_exit_ripas_change(struct kvm_vcpu *vcpu)
return 1;
}

+static int rec_exit_host_call(struct kvm_vcpu *vcpu)
+{
+ int ret, i;
+ struct rec *rec = &vcpu->arch.rec;
+
+ vcpu->stat.hvc_exit_stat++;
+
+ for (i = 0; i < REC_RUN_GPRS; i++)
+ vcpu_set_reg(vcpu, i, rec->run->exit.gprs[i]);
+
+ ret = kvm_hvc_call_handler(vcpu);
+
+ if (ret < 0) {
+ vcpu_set_reg(vcpu, 0, ~0UL);
+ ret = 1;
+ }
+
+ for (i = 0; i < REC_RUN_GPRS; i++)
+ rec->run->entry.gprs[i] = vcpu_get_reg(vcpu, i);
+
+ return ret;
+}
+
static void update_arch_timer_irq_lines(struct kvm_vcpu *vcpu)
{
struct rec *rec = &vcpu->arch.rec;
@@ -159,6 +183,8 @@ int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_ret)
return rec_exit_psci(vcpu);
case RMI_EXIT_RIPAS_CHANGE:
return rec_exit_ripas_change(vcpu);
+ case RMI_EXIT_HOST_CALL:
+ return rec_exit_host_call(vcpu);
}

kvm_pr_unimpl("Unsupported exit reason: %u\n",
--
2.34.1


2023-01-27 11:44:09

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 23/28] KVM: arm64: Allow activating realms

Add the ioctl to activate a realm and set the static branch to enable
access to the realm functionality if the RMM is detected.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/rme.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 6ac50481a138..543e8d10f532 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -1000,6 +1000,20 @@ static int kvm_init_ipa_range_realm(struct kvm *kvm,
return ret;
}

+static int kvm_activate_realm(struct kvm *kvm)
+{
+ struct realm *realm = &kvm->arch.realm;
+
+ if (kvm_realm_state(kvm) != REALM_STATE_NEW)
+ return -EBUSY;
+
+ if (rmi_realm_activate(virt_to_phys(realm->rd)))
+ return -ENXIO;
+
+ WRITE_ONCE(realm->state, REALM_STATE_ACTIVE);
+ return 0;
+}
+
/* Protects access to rme_vmid_bitmap */
static DEFINE_SPINLOCK(rme_vmid_lock);
static unsigned long *rme_vmid_bitmap;
@@ -1175,6 +1189,9 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
r = kvm_populate_realm(kvm, &args);
break;
}
+ case KVM_CAP_ARM_RME_ACTIVATE_REALM:
+ r = kvm_activate_realm(kvm);
+ break;
default:
r = -EINVAL;
break;
@@ -1415,7 +1432,7 @@ int kvm_init_rme(void)

WARN_ON(rmi_features(0, &rmm_feat_reg0));

- /* Future patch will enable static branch kvm_rme_is_available */
+ static_branch_enable(&kvm_rme_is_available);

return 0;
}
--
2.34.1


2023-01-27 12:24:18

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 27/28] arm64: RME: Always use 4k pages for realms

Always split up huge pages to avoid problems managing huge pages. There
are two issues currently:

1. The uABI for the VMM allows populating memory on 4k boundaries even
if the underlying allocator (e.g. hugetlbfs) is using a larger page
size. Using a memfd for private allocations will push this issue onto
the VMM as it will need to respect the granularity of the allocator.

2. The guest is able to request arbitrary ranges to be remapped as
shared. Again with a memfd approach it will be up to the VMM to deal
with the complexity and either overmap (need the huge mapping and add
an additional 'overlapping' shared mapping) or reject the request as
invalid due to the use of a huge page allocator.

For now just break everything down to 4k pages in the RMM controlled
stage 2.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/kvm/mmu.c | 4 ++++
arch/arm64/kvm/rme.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5417c273861b..b5fc8d8f7049 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1278,6 +1278,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (logging_active) {
force_pte = true;
vma_shift = PAGE_SHIFT;
+ } else if (kvm_is_realm(kvm)) {
+ // Force PTE level mappings for realms
+ force_pte = true;
+ vma_shift = PAGE_SHIFT;
} else {
vma_shift = get_vma_page_shift(vma, hva);
}
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 6ae7871aa6ed..1eb76cbee267 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -730,7 +730,9 @@ static int populate_par_region(struct kvm *kvm,
break;
}

- if (is_vm_hugetlb_page(vma))
+ // FIXME: To avoid the overmapping issue (see below comment)
+ // force the use of 4k pages
+ if (is_vm_hugetlb_page(vma) && 0)
vma_shift = huge_page_shift(hstate_vma(vma));
else
vma_shift = PAGE_SHIFT;
--
2.34.1


2023-01-27 12:27:37

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 09/28] arm64: RME: RTT handling

The RMM owns the stage 2 page tables for a realm, and KVM must request
that the RMM creates/destroys entries as necessary. The physical pages
to store the page tables are delegated to the realm as required, and can
be undelegated when no longer used.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_rme.h | 19 +++++
arch/arm64/kvm/mmu.c | 7 +-
arch/arm64/kvm/rme.c | 139 +++++++++++++++++++++++++++++++
3 files changed, 162 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index a6318af3ed11..eea5118dfa8a 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int kvm_init_realm_vm(struct kvm *kvm);
void kvm_destroy_realm(struct kvm *kvm);
+void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
+
+#define RME_RTT_BLOCK_LEVEL 2
+#define RME_RTT_MAX_LEVEL 3
+
+#define RME_PAGE_SHIFT 12
+#define RME_PAGE_SIZE BIT(RME_PAGE_SHIFT)
+/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
+#define RME_RTT_LEVEL_SHIFT(l) \
+ ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)
+#define RME_L2_BLOCK_SIZE BIT(RME_RTT_LEVEL_SHIFT(2))
+
+static inline unsigned long rme_rtt_level_mapsize(int level)
+{
+ if (WARN_ON(level > RME_RTT_MAX_LEVEL))
+ return RME_PAGE_SIZE;
+
+ return (1UL << RME_RTT_LEVEL_SHIFT(level));
+}

#endif
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 22c00274884a..f29558c5dcbc 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -834,16 +834,17 @@ void stage2_unmap_vm(struct kvm *kvm)
void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
{
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
- struct kvm_pgtable *pgt = NULL;
+ struct kvm_pgtable *pgt;

write_lock(&kvm->mmu_lock);
+ pgt = mmu->pgt;
if (kvm_is_realm(kvm) &&
kvm_realm_state(kvm) != REALM_STATE_DYING) {
- /* TODO: teardown rtts */
write_unlock(&kvm->mmu_lock);
+ kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
+ pgt->start_level);
return;
}
- pgt = mmu->pgt;
if (pgt) {
mmu->pgd_phys = 0;
mmu->pgt = NULL;
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index 0c9d70e4d9e6..f7b0e5a779f8 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -73,6 +73,28 @@ static int rmi_check_version(void)
return 0;
}

+static void realm_destroy_undelegate_range(struct realm *realm,
+ unsigned long ipa,
+ unsigned long addr,
+ ssize_t size)
+{
+ unsigned long rd = virt_to_phys(realm->rd);
+ int ret;
+
+ while (size > 0) {
+ ret = rmi_data_destroy(rd, ipa);
+ WARN_ON(ret);
+ ret = rmi_granule_undelegate(addr);
+
+ if (ret)
+ get_page(phys_to_page(addr));
+
+ addr += PAGE_SIZE;
+ ipa += PAGE_SIZE;
+ size -= PAGE_SIZE;
+ }
+}
+
static unsigned long create_realm_feat_reg0(struct kvm *kvm)
{
unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
@@ -170,6 +192,123 @@ static int realm_create_rd(struct kvm *kvm)
return r;
}

+static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
+ int level, phys_addr_t rtt_granule)
+{
+ addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
+ return rmi_rtt_destroy(rtt_granule, virt_to_phys(realm->rd), addr,
+ level);
+}
+
+static int realm_destroy_free_rtt(struct realm *realm, unsigned long addr,
+ int level, phys_addr_t rtt_granule)
+{
+ if (realm_rtt_destroy(realm, addr, level, rtt_granule))
+ return -ENXIO;
+ if (!WARN_ON(rmi_granule_undelegate(rtt_granule)))
+ put_page(phys_to_page(rtt_granule));
+
+ return 0;
+}
+
+static int realm_rtt_create(struct realm *realm,
+ unsigned long addr,
+ int level,
+ phys_addr_t phys)
+{
+ addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
+ return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
+}
+
+static int realm_tear_down_rtt_range(struct realm *realm, int level,
+ unsigned long start, unsigned long end)
+{
+ phys_addr_t rd = virt_to_phys(realm->rd);
+ ssize_t map_size = rme_rtt_level_mapsize(level);
+ unsigned long addr, next_addr;
+ bool failed = false;
+
+ for (addr = start; addr < end; addr = next_addr) {
+ phys_addr_t rtt_addr, tmp_rtt;
+ struct rtt_entry rtt;
+ unsigned long end_addr;
+
+ next_addr = ALIGN(addr + 1, map_size);
+
+ end_addr = min(next_addr, end);
+
+ if (rmi_rtt_read_entry(rd, ALIGN_DOWN(addr, map_size),
+ level, &rtt)) {
+ failed = true;
+ continue;
+ }
+
+ rtt_addr = rmi_rtt_get_phys(&rtt);
+ WARN_ON(level != rtt.walk_level);
+
+ switch (rtt.state) {
+ case RMI_UNASSIGNED:
+ case RMI_DESTROYED:
+ break;
+ case RMI_TABLE:
+ if (realm_tear_down_rtt_range(realm, level + 1,
+ addr, end_addr)) {
+ failed = true;
+ break;
+ }
+ if (IS_ALIGNED(addr, map_size) &&
+ next_addr <= end &&
+ realm_destroy_free_rtt(realm, addr, level + 1,
+ rtt_addr))
+ failed = true;
+ break;
+ case RMI_ASSIGNED:
+ WARN_ON(!rtt_addr);
+ /*
+ * If there is a block mapping, break it now, using the
+ * spare_page. We are sure to have a valid delegated
+ * page at spare_page before we enter here, otherwise
+ * WARN once, which will be followed by further
+ * warnings.
+ */
+ tmp_rtt = realm->spare_page;
+ if (level == 2 &&
+ !WARN_ON_ONCE(tmp_rtt == PHYS_ADDR_MAX) &&
+ realm_rtt_create(realm, addr,
+ RME_RTT_MAX_LEVEL, tmp_rtt)) {
+ WARN_ON(1);
+ failed = true;
+ break;
+ }
+ realm_destroy_undelegate_range(realm, addr,
+ rtt_addr, map_size);
+ /*
+ * Collapse the last level table and make the spare page
+ * reusable again.
+ */
+ if (level == 2 &&
+ realm_rtt_destroy(realm, addr, RME_RTT_MAX_LEVEL,
+ tmp_rtt))
+ failed = true;
+ break;
+ case RMI_VALID_NS:
+ WARN_ON(rmi_rtt_unmap_unprotected(rd, addr, level));
+ break;
+ default:
+ WARN_ON(1);
+ failed = true;
+ break;
+ }
+ }
+
+ return failed ? -EINVAL : 0;
+}
+
+void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
+{
+ realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
+}
+
/* Protects access to rme_vmid_bitmap */
static DEFINE_SPINLOCK(rme_vmid_lock);
static unsigned long *rme_vmid_bitmap;
--
2.34.1


2023-01-27 12:27:40

by Steven Price

[permalink] [raw]
Subject: [RFC PATCH 10/28] arm64: RME: Allocate/free RECs to match vCPUs

The RMM maintains a data structure known as the Realm Execution Context
(or REC). It is similar to struct kvm_vcpu and tracks the state of the
virtual CPUs. KVM must delegate memory and request the structures are
created when vCPUs are created, and suitably tear down on destruction.

Signed-off-by: Steven Price <[email protected]>
---
arch/arm64/include/asm/kvm_emulate.h | 2 +
arch/arm64/include/asm/kvm_host.h | 3 +
arch/arm64/include/asm/kvm_rme.h | 10 ++
arch/arm64/kvm/arm.c | 1 +
arch/arm64/kvm/reset.c | 11 ++
arch/arm64/kvm/rme.c | 144 +++++++++++++++++++++++++++
6 files changed, 171 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 5a2b7229e83f..285e62914ca4 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -504,6 +504,8 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)

static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
{
+ if (static_branch_unlikely(&kvm_rme_is_available))
+ return vcpu->arch.rec.mpidr != INVALID_HWID;
return false;
}

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 04347c3a8c6b..ef497b718cdb 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -505,6 +505,9 @@ struct kvm_vcpu_arch {
u64 last_steal;
gpa_t base;
} steal;
+
+ /* Realm meta data */
+ struct rec rec;
};

/*
diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
index eea5118dfa8a..4b219ebe1400 100644
--- a/arch/arm64/include/asm/kvm_rme.h
+++ b/arch/arm64/include/asm/kvm_rme.h
@@ -6,6 +6,7 @@
#ifndef __ASM_KVM_RME_H
#define __ASM_KVM_RME_H

+#include <asm/rmi_smc.h>
#include <uapi/linux/kvm.h>

enum realm_state {
@@ -29,6 +30,13 @@ struct realm {
unsigned int ia_bits;
};

+struct rec {
+ unsigned long mpidr;
+ void *rec_page;
+ struct page *aux_pages[REC_PARAMS_AUX_GRANULES];
+ struct rec_run *run;
+};
+
int kvm_init_rme(void);
u32 kvm_realm_ipa_limit(void);

@@ -36,6 +44,8 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
int kvm_init_realm_vm(struct kvm *kvm);
void kvm_destroy_realm(struct kvm *kvm);
void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
+int kvm_create_rec(struct kvm_vcpu *vcpu);
+void kvm_destroy_rec(struct kvm_vcpu *vcpu);

#define RME_RTT_BLOCK_LEVEL 2
#define RME_RTT_MAX_LEVEL 3
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index badd775547b8..52affed2f3cf 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -373,6 +373,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
/* Force users to call KVM_ARM_VCPU_INIT */
vcpu->arch.target = -1;
bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
+ vcpu->arch.rec.mpidr = INVALID_HWID;

vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;

diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
index 9e71d69e051f..0c84392a4bf2 100644
--- a/arch/arm64/kvm/reset.c
+++ b/arch/arm64/kvm/reset.c
@@ -135,6 +135,11 @@ int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
return -EPERM;

return kvm_vcpu_finalize_sve(vcpu);
+ case KVM_ARM_VCPU_REC:
+ if (!kvm_is_realm(vcpu->kvm))
+ return -EINVAL;
+
+ return kvm_create_rec(vcpu);
}

return -EINVAL;
@@ -145,6 +150,11 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
if (vcpu_has_sve(vcpu) && !kvm_arm_vcpu_sve_finalized(vcpu))
return false;

+ if (kvm_is_realm(vcpu->kvm) &&
+ !(vcpu_is_rec(vcpu) &&
+ READ_ONCE(vcpu->kvm->arch.realm.state) == REALM_STATE_ACTIVE))
+ return false;
+
return true;
}

@@ -157,6 +167,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
if (sve_state)
kvm_unshare_hyp(sve_state, sve_state + vcpu_sve_state_size(vcpu));
kfree(sve_state);
+ kvm_destroy_rec(vcpu);
}

static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
index f7b0e5a779f8..d79ed889ca4d 100644
--- a/arch/arm64/kvm/rme.c
+++ b/arch/arm64/kvm/rme.c
@@ -514,6 +514,150 @@ void kvm_destroy_realm(struct kvm *kvm)
kvm_free_stage2_pgd(&kvm->arch.mmu);
}

+static void free_rec_aux(struct page **aux_pages,
+ unsigned int num_aux)
+{
+ unsigned int i;
+
+ for (i = 0; i < num_aux; i++) {
+ phys_addr_t aux_page_phys = page_to_phys(aux_pages[i]);
+
+ if (WARN_ON(rmi_granule_undelegate(aux_page_phys)))
+ continue;
+
+ __free_page(aux_pages[i]);
+ }
+}
+
+static int alloc_rec_aux(struct page **aux_pages,
+ u64 *aux_phys_pages,
+ unsigned int num_aux)
+{
+ int ret;
+ unsigned int i;
+
+ for (i = 0; i < num_aux; i++) {
+ struct page *aux_page;
+ phys_addr_t aux_page_phys;
+
+ aux_page = alloc_page(GFP_KERNEL);
+ if (!aux_page) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ aux_page_phys = page_to_phys(aux_page);
+ if (rmi_granule_delegate(aux_page_phys)) {
+ __free_page(aux_page);
+ ret = -ENXIO;
+ goto out_err;
+ }
+ aux_pages[i] = aux_page;
+ aux_phys_pages[i] = aux_page_phys;
+ }
+
+ return 0;
+out_err:
+ free_rec_aux(aux_pages, i);
+ return ret;
+}
+
+int kvm_create_rec(struct kvm_vcpu *vcpu)
+{
+ struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
+ unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
+ struct realm *realm = &vcpu->kvm->arch.realm;
+ struct rec *rec = &vcpu->arch.rec;
+ unsigned long rec_page_phys;
+ struct rec_params *params;
+ int r, i;
+
+ if (kvm_realm_state(vcpu->kvm) != REALM_STATE_NEW)
+ return -ENOENT;
+
+ /*
+ * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
+ * flag covers v0.2 and onwards.
+ */
+ if (!test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
+ return -EINVAL;
+
+ BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
+ BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
+
+ params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
+ rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
+ rec->run = (void *)get_zeroed_page(GFP_KERNEL);
+ if (!params || !rec->rec_page || !rec->run) {
+ r = -ENOMEM;
+ goto out_free_pages;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
+ params->gprs[i] = vcpu_regs->regs[i];
+
+ params->pc = vcpu_regs->pc;
+
+ if (vcpu->vcpu_id == 0)
+ params->flags |= REC_PARAMS_FLAG_RUNNABLE;
+
+ rec_page_phys = virt_to_phys(rec->rec_page);
+
+ if (rmi_granule_delegate(rec_page_phys)) {
+ r = -ENXIO;
+ goto out_free_pages;
+ }
+
+ r = alloc_rec_aux(rec->aux_pages, params->aux, realm->num_aux);
+ if (r)
+ goto out_undelegate_rmm_rec;
+
+ params->num_rec_aux = realm->num_aux;
+ params->mpidr = mpidr;
+
+ if (rmi_rec_create(rec_page_phys,
+ virt_to_phys(realm->rd),
+ virt_to_phys(params))) {
+ r = -ENXIO;
+ goto out_free_rec_aux;
+ }
+
+ rec->mpidr = mpidr;
+
+ free_page((unsigned long)params);
+ return 0;
+
+out_free_rec_aux:
+ free_rec_aux(rec->aux_pages, realm->num_aux);
+out_undelegate_rmm_rec:
+ if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
+ rec->rec_page = NULL;
+out_free_pages:
+ free_page((unsigned long)rec->run);
+ free_page((unsigned long)rec->rec_page);
+ free_page((unsigned long)params);
+ return r;
+}
+
+void kvm_destroy_rec(struct kvm_vcpu *vcpu)
+{
+ struct realm *realm = &vcpu->kvm->arch.realm;
+ struct rec *rec = &vcpu->arch.rec;
+ unsigned long rec_page_phys;
+
+ if (!vcpu_is_rec(vcpu))
+ return;
+
+ rec_page_phys = virt_to_phys(rec->rec_page);
+
+ if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
+ return;
+ if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
+ return;
+
+ free_rec_aux(rec->aux_pages, realm->num_aux);
+ free_page((unsigned long)rec->rec_page);
+}
+
int kvm_init_realm_vm(struct kvm *kvm)
{
struct realm_params *params;
--
2.34.1


2023-02-07 12:25:42

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On Fri, Jan 27, 2023 at 11:29:10AM +0000, Steven Price wrote:
> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + struct kvm_cap_arm_rme_config_item cfg;
> + struct realm *realm = &kvm->arch.realm;
> + int r = 0;
> +
> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EBUSY;

This should also check kvm_is_realm() (otherwise we dereference a NULL
realm).

I was wondering about fuzzing the API to find more of this kind of issue,
but don't know anything about it. Is there a recommended way to fuzz KVM?

Thanks,
Jean


2023-02-07 12:56:02

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On 07/02/2023 12:25, Jean-Philippe Brucker wrote:
> On Fri, Jan 27, 2023 at 11:29:10AM +0000, Steven Price wrote:
>> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
>> +{
>> + struct kvm_cap_arm_rme_config_item cfg;
>> + struct realm *realm = &kvm->arch.realm;
>> + int r = 0;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
>> + return -EBUSY;
>
> This should also check kvm_is_realm() (otherwise we dereference a NULL
> realm).

Correct, I think this should be done way up in the stack at :

kvm_vm_ioctl_enable_cap() for KVM_CAP_ARM_RME.

>
> I was wondering about fuzzing the API to find more of this kind of issue,
> but don't know anything about it. Is there a recommended way to fuzz KVM?

Not sure either. kselftests is one possible way to drive these test at
least for unit-testing the new ABIs. This is something we plan to add.

Thanks for catching this.

Suzuki



> Thanks,
> Jean
>


2023-02-13 15:49:08

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

On Fri, 27 Jan 2023 11:29:08 +0000
Steven Price <[email protected]> wrote:

> Query the RMI version number and check if it is a compatible version. A
> static key is also provided to signal that a supported RMM is available.
>
> Functions are provided to query if a VM or VCPU is a realm (or rec)
> which currently will always return false.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
> arch/arm64/include/asm/kvm_host.h | 4 +++
> arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
> arch/arm64/include/asm/virt.h | 1 +
> arch/arm64/kvm/Makefile | 3 +-
> arch/arm64/kvm/arm.c | 8 +++++
> arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
> 7 files changed, 103 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/include/asm/kvm_rme.h
> create mode 100644 arch/arm64/kvm/rme.c
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 9bdba47f7e14..5a2b7229e83f 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
> return test_bit(feature, vcpu->arch.features);
> }
>
> +static inline bool kvm_is_realm(struct kvm *kvm)
> +{
> + if (static_branch_unlikely(&kvm_rme_is_available))
> + return kvm->arch.is_realm;
> + return false;
> +}
> +
> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> +{
> + return READ_ONCE(kvm->arch.realm.state);
> +}
> +
> +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> +{
> + return false;
> +}
> +
> #endif /* __ARM64_KVM_EMULATE_H__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..04347c3a8c6b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -26,6 +26,7 @@
> #include <asm/fpsimd.h>
> #include <asm/kvm.h>
> #include <asm/kvm_asm.h>
> +#include <asm/kvm_rme.h>
>
> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>
> @@ -240,6 +241,9 @@ struct kvm_arch {
> * the associated pKVM instance in the hypervisor.
> */
> struct kvm_protected_vm pkvm;
> +
> + bool is_realm;
^
It would be better to put more comments which really helps on the review.

I was looking for the user of this memeber to see when it is set. It seems
it is not in this patch. It would have been nice to have a quick answer from the
comments.
> + struct realm realm;
> };
>
> struct kvm_vcpu_fault_info {
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> new file mode 100644
> index 000000000000..c26bc2c6770d
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#ifndef __ASM_KVM_RME_H
> +#define __ASM_KVM_RME_H
> +
> +enum realm_state {
> + REALM_STATE_NONE,
> + REALM_STATE_NEW,
> + REALM_STATE_ACTIVE,
> + REALM_STATE_DYING
> +};
> +
> +struct realm {
> + enum realm_state state;
> +};
> +
> +int kvm_init_rme(void);
> +
> +#endif
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 4eb601e7de50..be1383e26626 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -80,6 +80,7 @@ void __hyp_set_vectors(phys_addr_t phys_vector_base);
> void __hyp_reset_vectors(void);
>
> DECLARE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DECLARE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> /* Reports the availability of HYP mode */
> static inline bool is_hyp_mode_available(void)
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 5e33c2d4645a..d2f0400c50da 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -20,7 +20,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-v3.o vgic/vgic-v4.o \
> vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> - vgic/vgic-its.o vgic/vgic-debug.o
> + vgic/vgic-its.o vgic/vgic-debug.o \
> + rme.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..d97b39d042ab 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -38,6 +38,7 @@
> #include <asm/kvm_asm.h>
> #include <asm/kvm_mmu.h>
> #include <asm/kvm_pkvm.h>
> +#include <asm/kvm_rme.h>
> #include <asm/kvm_emulate.h>
> #include <asm/sections.h>
>
> @@ -47,6 +48,7 @@
>
> static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT;
> DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DEFINE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
>
> @@ -2213,6 +2215,12 @@ int kvm_arch_init(void *opaque)
>
> in_hyp_mode = is_kernel_in_hyp_mode();
>
> + if (in_hyp_mode) {
> + err = kvm_init_rme();
> + if (err)
> + return err;
> + }
> +
> if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||
> cpus_have_final_cap(ARM64_WORKAROUND_1508412))
> kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> new file mode 100644
> index 000000000000..f6b587bc116e
> --- /dev/null
> +++ b/arch/arm64/kvm/rme.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/kvm_host.h>
> +
> +#include <asm/rmi_cmds.h>
> +#include <asm/virt.h>
> +
> +static int rmi_check_version(void)
> +{
> + struct arm_smccc_res res;
> + int version_major, version_minor;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_VERSION, &res);
> +
> + if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
> + return -ENXIO;
> +
> + version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
> + version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);
> +
> + if (version_major != RMI_ABI_MAJOR_VERSION) {
> + kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",
> + version_major, version_minor,
> + RMI_ABI_MAJOR_VERSION);
> + return -ENXIO;
> + }
> +
> + kvm_info("RMI ABI version %d.%d\n", version_major, version_minor);
> +
> + return 0;
> +}
> +
> +int kvm_init_rme(void)
> +{
> + if (PAGE_SIZE != SZ_4K)
> + /* Only 4k page size on the host is supported */
> + return 0;
> +
> + if (rmi_check_version())
> + /* Continue without realm support */
> + return 0;
> +
> + /* Future patch will enable static branch kvm_rme_is_available */
> +
> + return 0;
> +}


2023-02-13 15:56:10

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

On Fri, 27 Jan 2023 11:29:08 +0000
Steven Price <[email protected]> wrote:

> Query the RMI version number and check if it is a compatible version. A
> static key is also provided to signal that a supported RMM is available.
>
> Functions are provided to query if a VM or VCPU is a realm (or rec)
> which currently will always return false.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
> arch/arm64/include/asm/kvm_host.h | 4 +++
> arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
> arch/arm64/include/asm/virt.h | 1 +
> arch/arm64/kvm/Makefile | 3 +-
> arch/arm64/kvm/arm.c | 8 +++++
> arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
> 7 files changed, 103 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/include/asm/kvm_rme.h
> create mode 100644 arch/arm64/kvm/rme.c
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 9bdba47f7e14..5a2b7229e83f 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
> return test_bit(feature, vcpu->arch.features);
> }
>
> +static inline bool kvm_is_realm(struct kvm *kvm)
> +{
> + if (static_branch_unlikely(&kvm_rme_is_available))
> + return kvm->arch.is_realm;
> + return false;
> +}
> +
> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> +{
> + return READ_ONCE(kvm->arch.realm.state);
> +}
> +
> +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> +{
> + return false;
> +}
> +
> #endif /* __ARM64_KVM_EMULATE_H__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..04347c3a8c6b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -26,6 +26,7 @@
> #include <asm/fpsimd.h>
> #include <asm/kvm.h>
> #include <asm/kvm_asm.h>
> +#include <asm/kvm_rme.h>
>
> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>
> @@ -240,6 +241,9 @@ struct kvm_arch {
> * the associated pKVM instance in the hypervisor.
> */
> struct kvm_protected_vm pkvm;
> +
> + bool is_realm;
> + struct realm realm;
> };
>
> struct kvm_vcpu_fault_info {
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> new file mode 100644
> index 000000000000..c26bc2c6770d
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#ifndef __ASM_KVM_RME_H
> +#define __ASM_KVM_RME_H
> +
> +enum realm_state {
> + REALM_STATE_NONE,
> + REALM_STATE_NEW,
> + REALM_STATE_ACTIVE,
> + REALM_STATE_DYING
> +};
> +

By the way, it is better to add more comments to introduce about the states
here.

> +struct realm {
> + enum realm_state state;
> +};
> +
> +int kvm_init_rme(void);
> +
> +#endif
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 4eb601e7de50..be1383e26626 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -80,6 +80,7 @@ void __hyp_set_vectors(phys_addr_t phys_vector_base);
> void __hyp_reset_vectors(void);
>
> DECLARE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DECLARE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> /* Reports the availability of HYP mode */
> static inline bool is_hyp_mode_available(void)
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 5e33c2d4645a..d2f0400c50da 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -20,7 +20,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-v3.o vgic/vgic-v4.o \
> vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> - vgic/vgic-its.o vgic/vgic-debug.o
> + vgic/vgic-its.o vgic/vgic-debug.o \
> + rme.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..d97b39d042ab 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -38,6 +38,7 @@
> #include <asm/kvm_asm.h>
> #include <asm/kvm_mmu.h>
> #include <asm/kvm_pkvm.h>
> +#include <asm/kvm_rme.h>
> #include <asm/kvm_emulate.h>
> #include <asm/sections.h>
>
> @@ -47,6 +48,7 @@
>
> static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT;
> DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DEFINE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
>
> @@ -2213,6 +2215,12 @@ int kvm_arch_init(void *opaque)
>
> in_hyp_mode = is_kernel_in_hyp_mode();
>
> + if (in_hyp_mode) {
> + err = kvm_init_rme();
> + if (err)
> + return err;
> + }
> +
> if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||
> cpus_have_final_cap(ARM64_WORKAROUND_1508412))
> kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> new file mode 100644
> index 000000000000..f6b587bc116e
> --- /dev/null
> +++ b/arch/arm64/kvm/rme.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/kvm_host.h>
> +
> +#include <asm/rmi_cmds.h>
> +#include <asm/virt.h>
> +
> +static int rmi_check_version(void)
> +{
> + struct arm_smccc_res res;
> + int version_major, version_minor;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_VERSION, &res);
> +
> + if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
> + return -ENXIO;
> +
> + version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
> + version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);
> +
> + if (version_major != RMI_ABI_MAJOR_VERSION) {
> + kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",
> + version_major, version_minor,
> + RMI_ABI_MAJOR_VERSION);
> + return -ENXIO;
> + }
> +
> + kvm_info("RMI ABI version %d.%d\n", version_major, version_minor);
> +
> + return 0;
> +}
> +
> +int kvm_init_rme(void)
> +{
> + if (PAGE_SIZE != SZ_4K)
> + /* Only 4k page size on the host is supported */
> + return 0;
> +
> + if (rmi_check_version())
> + /* Continue without realm support */
> + return 0;
> +
> + /* Future patch will enable static branch kvm_rme_is_available */
> +
> + return 0;
> +}


2023-02-13 15:59:15

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

On 13/02/2023 15:48, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:08 +0000
> Steven Price <[email protected]> wrote:
>
>> Query the RMI version number and check if it is a compatible version. A
>> static key is also provided to signal that a supported RMM is available.
>>
>> Functions are provided to query if a VM or VCPU is a realm (or rec)
>> which currently will always return false.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
>> arch/arm64/include/asm/kvm_host.h | 4 +++
>> arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
>> arch/arm64/include/asm/virt.h | 1 +
>> arch/arm64/kvm/Makefile | 3 +-
>> arch/arm64/kvm/arm.c | 8 +++++
>> arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
>> 7 files changed, 103 insertions(+), 1 deletion(-)
>> create mode 100644 arch/arm64/include/asm/kvm_rme.h
>> create mode 100644 arch/arm64/kvm/rme.c
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 9bdba47f7e14..5a2b7229e83f 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
>> return test_bit(feature, vcpu->arch.features);
>> }
>>
>> +static inline bool kvm_is_realm(struct kvm *kvm)
>> +{
>> + if (static_branch_unlikely(&kvm_rme_is_available))
>> + return kvm->arch.is_realm;
>> + return false;
>> +}
>> +
>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>> +{
>> + return READ_ONCE(kvm->arch.realm.state);
>> +}
>> +
>> +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
>> +{
>> + return false;
>> +}
>> +
>> #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 35a159d131b5..04347c3a8c6b 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -26,6 +26,7 @@
>> #include <asm/fpsimd.h>
>> #include <asm/kvm.h>
>> #include <asm/kvm_asm.h>
>> +#include <asm/kvm_rme.h>
>>
>> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>>
>> @@ -240,6 +241,9 @@ struct kvm_arch {
>> * the associated pKVM instance in the hypervisor.
>> */
>> struct kvm_protected_vm pkvm;
>> +
>> + bool is_realm;
> ^
> It would be better to put more comments which really helps on the review.

Thanks for the feedback - I had thought "is realm" was fairly
self-documenting, but perhaps I've just spent too much time with this code.

> I was looking for the user of this memeber to see when it is set. It seems
> it is not in this patch. It would have been nice to have a quick answer from the
> comments.

The usage is in the kvm_is_realm() function which is used in several of
the later patches as a way to detect this kvm guest is a realm guest.

I think the main issue is that I've got the patches in the wrong other.
Patch 7 "arm64: kvm: Allow passing machine type in KVM creation" should
probably be before this one, then I could add the assignment of is_realm
into this patch (potentially splitting out the is_realm parts into
another patch).

Thanks,

Steve


2023-02-13 16:04:23

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 05/28] arm64: RME: Define the user ABI

On Fri, 27 Jan 2023 11:29:09 +0000
Steven Price <[email protected]> wrote:

> There is one (multiplexed) CAP which can be used to create, populate and
> then activate the realm.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> Documentation/virt/kvm/api.rst | 1 +
> arch/arm64/include/uapi/asm/kvm.h | 63 +++++++++++++++++++++++++++++++
> include/uapi/linux/kvm.h | 2 +
> 3 files changed, 66 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 0dd5d8733dd5..f1a59d6fb7fc 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4965,6 +4965,7 @@ Recognised values for feature:
>
> ===== ===========================================
> arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
> + arm64 KVM_ARM_VCPU_REC (requires KVM_CAP_ARM_RME)
> ===== ===========================================
>
> Finalizes the configuration of the specified vcpu feature.
> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> index a7a857f1784d..fcc0b8dce29b 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -109,6 +109,7 @@ struct kvm_regs {
> #define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */
> #define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */
> #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */
> +#define KVM_ARM_VCPU_REC 7 /* VCPU REC state as part of Realm */
>
> struct kvm_vcpu_init {
> __u32 target;
> @@ -401,6 +402,68 @@ enum {
> #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
> #define KVM_DEV_ARM_ITS_CTRL_RESET 4
>
> +/* KVM_CAP_ARM_RME on VM fd */
> +#define KVM_CAP_ARM_RME_CONFIG_REALM 0
> +#define KVM_CAP_ARM_RME_CREATE_RD 1
> +#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2
> +#define KVM_CAP_ARM_RME_POPULATE_REALM 3
> +#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4
> +

It is a little bit confusing here. These seems more like 'commands' not caps.
Will leave more comments after reviewing the later patches.

> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256 0
> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512 1
> +
> +#define KVM_CAP_ARM_RME_RPV_SIZE 64
> +
> +/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */
> +#define KVM_CAP_ARM_RME_CFG_RPV 0
> +#define KVM_CAP_ARM_RME_CFG_HASH_ALGO 1
> +#define KVM_CAP_ARM_RME_CFG_SVE 2
> +#define KVM_CAP_ARM_RME_CFG_DBG 3
> +#define KVM_CAP_ARM_RME_CFG_PMU 4
> +
> +struct kvm_cap_arm_rme_config_item {
> + __u32 cfg;
> + union {
> + /* cfg == KVM_CAP_ARM_RME_CFG_RPV */
> + struct {
> + __u8 rpv[KVM_CAP_ARM_RME_RPV_SIZE];
> + };
> +
> + /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */
> + struct {
> + __u32 hash_algo;
> + };
> +
> + /* cfg == KVM_CAP_ARM_RME_CFG_SVE */
> + struct {
> + __u32 sve_vq;
> + };
> +
> + /* cfg == KVM_CAP_ARM_RME_CFG_DBG */
> + struct {
> + __u32 num_brps;
> + __u32 num_wrps;
> + };
> +
> + /* cfg == KVM_CAP_ARM_RME_CFG_PMU */
> + struct {
> + __u32 num_pmu_cntrs;
> + };
> + /* Fix the size of the union */
> + __u8 reserved[256];
> + };
> +};
> +
> +struct kvm_cap_arm_rme_populate_realm_args {
> + __u64 populate_ipa_base;
> + __u64 populate_ipa_size;
> +};
> +
> +struct kvm_cap_arm_rme_init_ipa_args {
> + __u64 init_ipa_base;
> + __u64 init_ipa_size;
> +};
> +
> /* Device Control API on vcpu fd */
> #define KVM_ARM_VCPU_PMU_V3_CTRL 0
> #define KVM_ARM_VCPU_PMU_V3_IRQ 0
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 20522d4ba1e0..fec1909e8b73 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1176,6 +1176,8 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>
> +#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts
> +
> #ifdef KVM_CAP_IRQ_ROUTING
>
> struct kvm_irq_routing_irqchip {


2023-02-13 16:10:55

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On Fri, 27 Jan 2023 11:29:10 +0000
Steven Price <[email protected]> wrote:

> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
> the base level of the Realm Translation Tables (RTT). A VMID also need
> to be picked, since the RMM has a separate VMID address space a
> dedicated allocator is added for this purpose.
>
> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
> before it is created.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 14 ++
> arch/arm64/kvm/arm.c | 19 ++
> arch/arm64/kvm/mmu.c | 6 +
> arch/arm64/kvm/reset.c | 33 +++
> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
> 5 files changed, 429 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index c26bc2c6770d..055a22accc08 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -6,6 +6,8 @@
> #ifndef __ASM_KVM_RME_H
> #define __ASM_KVM_RME_H
>
> +#include <uapi/linux/kvm.h>
> +
> enum realm_state {
> REALM_STATE_NONE,
> REALM_STATE_NEW,
> @@ -15,8 +17,20 @@ enum realm_state {
>
> struct realm {
> enum realm_state state;
> +
> + void *rd;
> + struct realm_params *params;
> +
> + unsigned long num_aux;
> + unsigned int vmid;
> + unsigned int ia_bits;
> };
>

Maybe more comments for this structure?

> int kvm_init_rme(void);
> +u32 kvm_realm_ipa_limit(void);
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> +int kvm_init_realm_vm(struct kvm *kvm);
> +void kvm_destroy_realm(struct kvm *kvm);
>
> #endif
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index d97b39d042ab..50f54a63732a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -103,6 +103,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> r = 0;
> set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
> break;
> + case KVM_CAP_ARM_RME:
> + if (!static_branch_unlikely(&kvm_rme_is_available))
> + return -EINVAL;
> + mutex_lock(&kvm->lock);
> + r = kvm_realm_enable_cap(kvm, cap);
> + mutex_unlock(&kvm->lock);
> + break;
> default:
> r = -EINVAL;
> break;
> @@ -172,6 +179,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> */
> kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit();
>
> + /* Initialise the realm bits after the generic bits are enabled */
> + if (kvm_is_realm(kvm)) {
> + ret = kvm_init_realm_vm(kvm);
> + if (ret)
> + goto err_free_cpumask;
> + }
> +
> return 0;
>
> err_free_cpumask:
> @@ -204,6 +218,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> kvm_destroy_vcpus(kvm);
>
> kvm_unshare_hyp(kvm, kvm + 1);
> +
> + kvm_destroy_realm(kvm);
> }
>
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> @@ -300,6 +316,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_PTRAUTH_GENERIC:
> r = system_has_full_ptr_auth();
> break;
> + case KVM_CAP_ARM_RME:
> + r = static_key_enabled(&kvm_rme_is_available);
> + break;
> default:
> r = 0;
> }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..d0f707767d05 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -840,6 +840,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> struct kvm_pgtable *pgt = NULL;
>
> write_lock(&kvm->mmu_lock);
> + if (kvm_is_realm(kvm) &&
> + kvm_realm_state(kvm) != REALM_STATE_DYING) {
> + /* TODO: teardown rtts */
> + write_unlock(&kvm->mmu_lock);
> + return;
> + }
> pgt = mmu->pgt;
> if (pgt) {
> mmu->pgd_phys = 0;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index e0267f672b8a..c165df174737 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -395,3 +395,36 @@ int kvm_set_ipa_limit(void)
>
> return 0;
> }
> +
> +int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
> +{
> + u64 mmfr0, mmfr1;
> + u32 phys_shift;
> + u32 ipa_limit = kvm_ipa_limit;
> +
> + if (kvm_is_realm(kvm))
> + ipa_limit = kvm_realm_ipa_limit();
> +
> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> + return -EINVAL;
> +
> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> + if (phys_shift) {
> + if (phys_shift > ipa_limit ||
> + phys_shift < ARM64_MIN_PARANGE_BITS)
> + return -EINVAL;
> + } else {
> + phys_shift = KVM_PHYS_SHIFT;
> + if (phys_shift > ipa_limit) {
> + pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n",
> + current->comm);
> + return -EINVAL;
> + }
> + }
> +
> + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> + kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
> +
> + return 0;
> +}
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index f6b587bc116e..9f8c5a91b8fc 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -5,9 +5,49 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
> #include <asm/rmi_cmds.h>
> #include <asm/virt.h>
>
> +/************ FIXME: Copied from kvm/hyp/pgtable.c **********/
> +#include <asm/kvm_pgtable.h>
> +
> +struct kvm_pgtable_walk_data {
> + struct kvm_pgtable *pgt;
> + struct kvm_pgtable_walker *walker;
> +
> + u64 addr;
> + u64 end;
> +};
> +
> +static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
> +{
> + u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
> + u64 mask = BIT(pgt->ia_bits) - 1;
> +
> + return (addr & mask) >> shift;
> +}
> +
> +static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
> +{
> + struct kvm_pgtable pgt = {
> + .ia_bits = ia_bits,
> + .start_level = start_level,
> + };
> +
> + return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
> +}
> +
> +/******************/
> +
> +static unsigned long rmm_feat_reg0;
> +
> +static bool rme_supports(unsigned long feature)
> +{
> + return !!u64_get_bits(rmm_feat_reg0, feature);
> +}
> +
> static int rmi_check_version(void)
> {
> struct arm_smccc_res res;
> @@ -33,8 +73,319 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> +{
> + unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> + u64 feat_reg0 = 0;
> +
> + int num_bps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_BPS);
> + int num_wps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + feat_reg0 |= u64_encode_bits(ia_bits, RMI_FEATURE_REGISTER_0_S2SZ);
> + feat_reg0 |= u64_encode_bits(num_bps, RMI_FEATURE_REGISTER_0_NUM_BPS);
> + feat_reg0 |= u64_encode_bits(num_wps, RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + return feat_reg0;
> +}
> +
> +u32 kvm_realm_ipa_limit(void)
> +{
> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
> +}
> +
> +static u32 get_start_level(struct kvm *kvm)
> +{
> + long sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, kvm->arch.vtcr);
> +
> + return VTCR_EL2_TGRAN_SL0_BASE - sl0;
> +}
> +
> +static int realm_create_rd(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct realm_params *params = realm->params;
> + void *rd = NULL;
> + phys_addr_t rd_phys, params_phys;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i, r;
> +
> + if (WARN_ON(realm->rd) || WARN_ON(!realm->params))
> + return -EEXIST;
> +
> + rd = (void *)__get_free_page(GFP_KERNEL);
> + if (!rd)
> + return -ENOMEM;
> +
> + rd_phys = virt_to_phys(rd);
> + if (rmi_granule_delegate(rd_phys)) {
> + r = -ENXIO;
> + goto out;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (rmi_granule_delegate(pgd_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> + }
> +
> + params->rtt_level_start = get_start_level(kvm);
> + params->rtt_num_start = pgd_sz;
> + params->rtt_base = kvm->arch.mmu.pgd_phys;
> + params->vmid = realm->vmid;
> +
> + params_phys = virt_to_phys(params);
> +
> + if (rmi_realm_create(rd_phys, params_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> +
> + realm->rd = rd;
> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> +
> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
> + WARN_ON(rmi_realm_destroy(rd_phys));
> + goto out_undelegate_tables;
> + }
> +
> + return 0;
> +
> +out_undelegate_tables:
> + while (--i >= 0) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + WARN_ON(rmi_granule_undelegate(pgd_phys));
> + }
> + WARN_ON(rmi_granule_undelegate(rd_phys));
> +out:
> + free_page((unsigned long)rd);
> + return r;
> +}
> +

Just curious. Wouldn't it be better to use IDR as this is ID allocation? There
were some efforts to change the use of bitmap allocation to IDR before.

> +/* Protects access to rme_vmid_bitmap */
> +static DEFINE_SPINLOCK(rme_vmid_lock);
> +static unsigned long *rme_vmid_bitmap;
> +
> +static int rme_vmid_init(void)
> +{
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
> + if (!rme_vmid_bitmap) {
> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static int rme_vmid_reserve(void)
> +{
> + int ret;
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + spin_lock(&rme_vmid_lock);
> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
> + spin_unlock(&rme_vmid_lock);
> +
> + return ret;
> +}
> +
> +static void rme_vmid_release(unsigned int vmid)
> +{
> + spin_lock(&rme_vmid_lock);
> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
> + spin_unlock(&rme_vmid_lock);
> +}
> +
> +static int kvm_create_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + int ret;
> +
> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EEXIST;
> +
> + ret = rme_vmid_reserve();
> + if (ret < 0)
> + return ret;
> + realm->vmid = ret;
> +
> + ret = realm_create_rd(kvm);
> + if (ret) {
> + rme_vmid_release(realm->vmid);
> + return ret;
> + }
> +
> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
> +
> + /* The realm is up, free the parameters. */
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> +
> + return 0;
> +}
> +
> +static int config_realm_hash_algo(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + switch (cfg->hash_algo) {
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
> + return -EINVAL;
> + break;
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + realm->params->measurement_algo = cfg->hash_algo;
> + return 0;
> +}
> +
> +static int config_realm_sve(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + u64 features_0 = realm->params->features_0;
> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
> + return -EINVAL;
> +
> + if (cfg->sve_vq > max_sve_vq)
> + return -EINVAL;
> +
> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
> + features_0 |= u64_encode_bits(cfg->sve_vq,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + realm->params->features_0 = features_0;
> + return 0;
> +}
> +
> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + struct kvm_cap_arm_rme_config_item cfg;
> + struct realm *realm = &kvm->arch.realm;
> + int r = 0;
> +
> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EBUSY;
> +
> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
> + return -EFAULT;
> +
> + switch (cfg.cfg) {
> + case KVM_CAP_ARM_RME_CFG_RPV:
> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
> + break;
> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
> + r = config_realm_hash_algo(realm, &cfg);
> + break;
> + case KVM_CAP_ARM_RME_CFG_SVE:
> + r = config_realm_sve(realm, &cfg);
> + break;
> + default:
> + r = -EINVAL;
> + }
> +
> + return r;
> +}
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + int r = 0;
> +
> + switch (cap->args[0]) {
> + case KVM_CAP_ARM_RME_CONFIG_REALM:
> + r = kvm_rme_config_realm(kvm, cap);
> + break;
> + case KVM_CAP_ARM_RME_CREATE_RD:
> + if (kvm->created_vcpus) {
> + r = -EBUSY;
> + break;
> + }
> +
> + r = kvm_create_realm(kvm);
> + break;
> + default:
> + r = -EINVAL;
> + break;
> + }
> +
> + return r;
> +}
> +
> +void kvm_destroy_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i;
> +
> + if (realm->params) {
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> + }
> +
> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
> + return;
> +
> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
> +
> + rme_vmid_release(realm->vmid);
> +
> + if (realm->rd) {
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> +
> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
> + return;
> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
> + return;
> + free_page((unsigned long)realm->rd);
> + realm->rd = NULL;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
> + return;
> + }
> +
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}
> +
> +int kvm_init_realm_vm(struct kvm *kvm)
> +{
> + struct realm_params *params;
> +
> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
> + if (!params)
> + return -ENOMEM;
> +
> + params->features_0 = create_realm_feat_reg0(kvm);
> + kvm->arch.realm.params = params;
> + return 0;
> +}
> +
> int kvm_init_rme(void)
> {
> + int ret;
> +
> if (PAGE_SIZE != SZ_4K)
> /* Only 4k page size on the host is supported */
> return 0;
> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
> /* Continue without realm support */
> return 0;
>
> + ret = rme_vmid_init();
> + if (ret)
> + return ret;
> +
> + WARN_ON(rmi_features(0, &rmm_feat_reg0));
> +
> /* Future patch will enable static branch kvm_rme_is_available */
>
> return 0;


2023-02-13 16:35:43

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 07/28] arm64: kvm: Allow passing machine type in KVM creation

On Fri, 27 Jan 2023 11:29:11 +0000
Steven Price <[email protected]> wrote:

> Previously machine type was used purely for specifying the physical
> address size of the guest. Reserve the higher bits to specify an ARM
> specific machine type and declare a new type 'KVM_VM_TYPE_ARM_REALM'
> used to create a realm guest.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 13 +++++++++++++
> arch/arm64/kvm/mmu.c | 3 ---
> arch/arm64/kvm/reset.c | 3 ---
> include/uapi/linux/kvm.h | 19 +++++++++++++++----
> 4 files changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 50f54a63732a..badd775547b8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -147,6 +147,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> {
> int ret;
>
> + if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK))
> + return -EINVAL;
> +
> + switch (type & KVM_VM_TYPE_ARM_MASK) {
> + case KVM_VM_TYPE_ARM_NORMAL:
> + break;
> + case KVM_VM_TYPE_ARM_REALM:
> + kvm->arch.is_realm = true;

It is better to let this call fail when !kvm_rme_is_available? It is
strange to be able to create a VM with REALM type in a system doesn't
support RME.

> + break;
> + default:
> + return -EINVAL;
> + }
> +
> ret = kvm_share_hyp(kvm, kvm + 1);
> if (ret)
> return ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d0f707767d05..22c00274884a 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -709,9 +709,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> u64 mmfr0, mmfr1;
> u32 phys_shift;
>
> - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> - return -EINVAL;
> -
> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> if (is_protected_kvm_enabled()) {
> phys_shift = kvm_ipa_limit;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index c165df174737..9e71d69e051f 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -405,9 +405,6 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
> if (kvm_is_realm(kvm))
> ipa_limit = kvm_realm_ipa_limit();
>
> - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> - return -EINVAL;
> -
> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> if (phys_shift) {
> if (phys_shift > ipa_limit ||
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index fec1909e8b73..bcfc4d58dc19 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -898,14 +898,25 @@ struct kvm_ppc_resize_hpt {
> #define KVM_S390_SIE_PAGE_OFFSET 1
>
> /*
> - * On arm64, machine type can be used to request the physical
> - * address size for the VM. Bits[7-0] are reserved for the guest
> - * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
> - * value 0 implies the default IPA size, 40bits.
> + * On arm64, machine type can be used to request both the machine type and
> + * the physical address size for the VM.
> + *
> + * Bits[11-8] are reserved for the ARM specific machine type.
> + *
> + * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)).
> + * For backward compatibility, value 0 implies the default IPA size, 40bits.
> */
> +#define KVM_VM_TYPE_ARM_SHIFT 8
> +#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
> +#define KVM_VM_TYPE_ARM(_type) \
> + (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
> +#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
> +#define KVM_VM_TYPE_ARM_REALM KVM_VM_TYPE_ARM(1)
> +
> #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
> #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
> ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> +
> /*
> * ioctls for /dev/kvm fds:
> */


2023-02-13 16:43:46

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 03/28] arm64: RME: Add wrappers for RMI calls

On Fri, 27 Jan 2023 11:29:07 +0000
Steven Price <[email protected]> wrote:

> The wrappers make the call sites easier to read and deal with the
> boiler plate of handling the error codes from the RMM.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/rmi_cmds.h | 259 ++++++++++++++++++++++++++++++
> 1 file changed, 259 insertions(+)
> create mode 100644 arch/arm64/include/asm/rmi_cmds.h
>
> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
> new file mode 100644
> index 000000000000..d5468ee46f35
> --- /dev/null
> +++ b/arch/arm64/include/asm/rmi_cmds.h
> @@ -0,0 +1,259 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#ifndef __ASM_RMI_CMDS_H
> +#define __ASM_RMI_CMDS_H
> +
> +#include <linux/arm-smccc.h>
> +
> +#include <asm/rmi_smc.h>
> +
> +struct rtt_entry {
> + unsigned long walk_level;
> + unsigned long desc;
> + int state;
> + bool ripas;
> +};
> +

It would be nice to have some information of the follwoing wrappers. E.g.
meaning of the return value. They will be quite helpful in the later patch
review.

> +static inline int rmi_data_create(unsigned long data, unsigned long rd,
> + unsigned long map_addr, unsigned long src,
> + unsigned long flags)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE, data, rd, map_addr, src,
> + flags, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_data_create_unknown(unsigned long data,
> + unsigned long rd,
> + unsigned long map_addr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE_UNKNOWN, data, rd, map_addr,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_data_destroy(unsigned long rd, unsigned long map_addr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_DESTROY, rd, map_addr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_features(unsigned long index, unsigned long *out)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_FEATURES, index, &res);
> +
> + *out = res.a1;
> + return res.a0;
> +}
> +
> +static inline int rmi_granule_delegate(unsigned long phys)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_DELEGATE, phys, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_granule_undelegate(unsigned long phys)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_UNDELEGATE, phys, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_psci_complete(unsigned long calling_rec,
> + unsigned long target_rec)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_PSCI_COMPLETE, calling_rec, target_rec,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_activate(unsigned long rd)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_ACTIVATE, rd, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_create(unsigned long rd, unsigned long params_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_CREATE, rd, params_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_destroy(unsigned long rd)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_DESTROY, rd, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_aux_count(unsigned long rd, unsigned long *aux_count)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_AUX_COUNT, rd, &res);
> +
> + *aux_count = res.a1;
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_create(unsigned long rec, unsigned long rd,
> + unsigned long params_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_CREATE, rec, rd, params_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_destroy(unsigned long rec)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_DESTROY, rec, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_enter(unsigned long rec, unsigned long run_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_ENTER, rec, run_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_create(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_CREATE, rtt, rd, map_addr, level,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_destroy(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_DESTROY, rtt, rd, map_addr, level,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_fold(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_FOLD, rtt, rd, map_addr, level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_init_ripas(unsigned long rd, unsigned long map_addr,
> + unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_INIT_RIPAS, rd, map_addr, level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_map_unprotected(unsigned long rd,
> + unsigned long map_addr,
> + unsigned long level,
> + unsigned long desc)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_MAP_UNPROTECTED, rd, map_addr, level,
> + desc, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_read_entry(unsigned long rd, unsigned long map_addr,
> + unsigned long level, struct rtt_entry *rtt)
> +{
> + struct arm_smccc_1_2_regs regs = {
> + SMC_RMI_RTT_READ_ENTRY,
> + rd, map_addr, level
> + };
> +
> + arm_smccc_1_2_smc(&regs, &regs);
> +
> + rtt->walk_level = regs.a1;
> + rtt->state = regs.a2 & 0xFF;
> + rtt->desc = regs.a3;
> + rtt->ripas = regs.a4 & 1;
> +
> + return regs.a0;
> +}
> +
> +static inline int rmi_rtt_set_ripas(unsigned long rd, unsigned long rec,
> + unsigned long map_addr, unsigned long level,
> + unsigned long ripas)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_SET_RIPAS, rd, rec, map_addr, level,
> + ripas, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_unmap_unprotected(unsigned long rd,
> + unsigned long map_addr,
> + unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_UNMAP_UNPROTECTED, rd, map_addr,
> + level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline phys_addr_t rmi_rtt_get_phys(struct rtt_entry *rtt)
> +{
> + return rtt->desc & GENMASK(47, 12);
> +}
> +
> +#endif


2023-02-13 16:47:13

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 08/28] arm64: RME: Keep a spare page delegated to the RMM

On Fri, 27 Jan 2023 11:29:12 +0000
Steven Price <[email protected]> wrote:

> Pages can only be populated/destroyed on the RMM at the 4KB granule,
> this requires creating the full depth of RTTs. However if the pages are
> going to be combined into a 4MB huge page the last RTT is only
> temporarily needed. Similarly when freeing memory the huge page must be
> temporarily split requiring temporary usage of the full depth oF RTTs.
>
> To avoid needing to perform a temporary allocation and delegation of a
> page for this purpose we keep a spare delegated page around. In
> particular this avoids the need for memory allocation while destroying
> the realm guest.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 3 +++
> arch/arm64/kvm/rme.c | 6 ++++++
> 2 files changed, 9 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index 055a22accc08..a6318af3ed11 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -21,6 +21,9 @@ struct realm {
> void *rd;
> struct realm_params *params;
>
> + /* A spare already delegated page */
> + phys_addr_t spare_page;
> +
> unsigned long num_aux;
> unsigned int vmid;
> unsigned int ia_bits;
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index 9f8c5a91b8fc..0c9d70e4d9e6 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -148,6 +148,7 @@ static int realm_create_rd(struct kvm *kvm)
> }
>
> realm->rd = rd;
> + realm->spare_page = PHYS_ADDR_MAX;
> realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
>
> if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
> @@ -357,6 +358,11 @@ void kvm_destroy_realm(struct kvm *kvm)
> free_page((unsigned long)realm->rd);
> realm->rd = NULL;
> }
> + if (realm->spare_page != PHYS_ADDR_MAX) {
> + if (!WARN_ON(rmi_granule_undelegate(realm->spare_page)))
> + free_page((unsigned long)phys_to_virt(realm->spare_page));

Will the page be leaked (not usable for host and realms) if the undelegate
failed? If yes, better at least put a comment.

> + realm->spare_page = PHYS_ADDR_MAX;
> + }
>
> pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> for (i = 0; i < pgd_sz; i++) {


2023-02-13 17:44:36

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 09/28] arm64: RME: RTT handling

On Fri, 27 Jan 2023 11:29:13 +0000
Steven Price <[email protected]> wrote:

> The RMM owns the stage 2 page tables for a realm, and KVM must request
> that the RMM creates/destroys entries as necessary. The physical pages
> to store the page tables are delegated to the realm as required, and can
> be undelegated when no longer used.
>

This is only an introduction to RTT handling. While this patch is mostly like
RTT teardown, better add more introduction to this patch. Also maybe refine
the tittle to reflect what this patch is actually doing.

> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 19 +++++
> arch/arm64/kvm/mmu.c | 7 +-
> arch/arm64/kvm/rme.c | 139 +++++++++++++++++++++++++++++++
> 3 files changed, 162 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index a6318af3ed11..eea5118dfa8a 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
> int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> int kvm_init_realm_vm(struct kvm *kvm);
> void kvm_destroy_realm(struct kvm *kvm);
> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> +
> +#define RME_RTT_BLOCK_LEVEL 2
> +#define RME_RTT_MAX_LEVEL 3
> +
> +#define RME_PAGE_SHIFT 12
> +#define RME_PAGE_SIZE BIT(RME_PAGE_SHIFT)
> +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
> +#define RME_RTT_LEVEL_SHIFT(l) \
> + ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)
> +#define RME_L2_BLOCK_SIZE BIT(RME_RTT_LEVEL_SHIFT(2))
> +
> +static inline unsigned long rme_rtt_level_mapsize(int level)
> +{
> + if (WARN_ON(level > RME_RTT_MAX_LEVEL))
> + return RME_PAGE_SIZE;
> +
> + return (1UL << RME_RTT_LEVEL_SHIFT(level));
> +}
>
> #endif
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 22c00274884a..f29558c5dcbc 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -834,16 +834,17 @@ void stage2_unmap_vm(struct kvm *kvm)
> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> {
> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> - struct kvm_pgtable *pgt = NULL;
> + struct kvm_pgtable *pgt;
>
> write_lock(&kvm->mmu_lock);
> + pgt = mmu->pgt;
> if (kvm_is_realm(kvm) &&
> kvm_realm_state(kvm) != REALM_STATE_DYING) {
> - /* TODO: teardown rtts */
> write_unlock(&kvm->mmu_lock);
> + kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
> + pgt->start_level);
> return;
> }
> - pgt = mmu->pgt;
> if (pgt) {
> mmu->pgd_phys = 0;
> mmu->pgt = NULL;
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index 0c9d70e4d9e6..f7b0e5a779f8 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -73,6 +73,28 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static void realm_destroy_undelegate_range(struct realm *realm,
> + unsigned long ipa,
> + unsigned long addr,
> + ssize_t size)
> +{
> + unsigned long rd = virt_to_phys(realm->rd);
> + int ret;
> +
> + while (size > 0) {
> + ret = rmi_data_destroy(rd, ipa);
> + WARN_ON(ret);
> + ret = rmi_granule_undelegate(addr);
> +
As the return value is not documented, what will happen if a page undelegate
failed? Leaked? Some explanation is required here.
> + if (ret)
> + get_page(phys_to_page(addr));
> +
> + addr += PAGE_SIZE;
> + ipa += PAGE_SIZE;
> + size -= PAGE_SIZE;
> + }
> +}
> +
> static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> {
> unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> @@ -170,6 +192,123 @@ static int realm_create_rd(struct kvm *kvm)
> return r;
> }
>
> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
> + int level, phys_addr_t rtt_granule)
> +{
> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> + return rmi_rtt_destroy(rtt_granule, virt_to_phys(realm->rd), addr,
> + level);
> +}
> +
> +static int realm_destroy_free_rtt(struct realm *realm, unsigned long addr,
> + int level, phys_addr_t rtt_granule)
> +{
> + if (realm_rtt_destroy(realm, addr, level, rtt_granule))
> + return -ENXIO;
> + if (!WARN_ON(rmi_granule_undelegate(rtt_granule)))
> + put_page(phys_to_page(rtt_granule));
> +
> + return 0;
> +}
> +
> +static int realm_rtt_create(struct realm *realm,
> + unsigned long addr,
> + int level,
> + phys_addr_t phys)
> +{
> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> + return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
> +}
> +
> +static int realm_tear_down_rtt_range(struct realm *realm, int level,
> + unsigned long start, unsigned long end)
> +{
> + phys_addr_t rd = virt_to_phys(realm->rd);
> + ssize_t map_size = rme_rtt_level_mapsize(level);
> + unsigned long addr, next_addr;
> + bool failed = false;
> +
> + for (addr = start; addr < end; addr = next_addr) {
> + phys_addr_t rtt_addr, tmp_rtt;
> + struct rtt_entry rtt;
> + unsigned long end_addr;
> +
> + next_addr = ALIGN(addr + 1, map_size);
> +
> + end_addr = min(next_addr, end);
> +
> + if (rmi_rtt_read_entry(rd, ALIGN_DOWN(addr, map_size),
> + level, &rtt)) {
> + failed = true;
> + continue;
> + }
> +
> + rtt_addr = rmi_rtt_get_phys(&rtt);
> + WARN_ON(level != rtt.walk_level);
> +
> + switch (rtt.state) {
> + case RMI_UNASSIGNED:
> + case RMI_DESTROYED:
> + break;
> + case RMI_TABLE:
> + if (realm_tear_down_rtt_range(realm, level + 1,
> + addr, end_addr)) {
> + failed = true;
> + break;
> + }
> + if (IS_ALIGNED(addr, map_size) &&
> + next_addr <= end &&
> + realm_destroy_free_rtt(realm, addr, level + 1,
> + rtt_addr))
> + failed = true;
> + break;
> + case RMI_ASSIGNED:
> + WARN_ON(!rtt_addr);
> + /*
> + * If there is a block mapping, break it now, using the
> + * spare_page. We are sure to have a valid delegated
> + * page at spare_page before we enter here, otherwise
> + * WARN once, which will be followed by further
> + * warnings.
> + */
> + tmp_rtt = realm->spare_page;
> + if (level == 2 &&
> + !WARN_ON_ONCE(tmp_rtt == PHYS_ADDR_MAX) &&
> + realm_rtt_create(realm, addr,
> + RME_RTT_MAX_LEVEL, tmp_rtt)) {
> + WARN_ON(1);
> + failed = true;
> + break;
> + }
> + realm_destroy_undelegate_range(realm, addr,
> + rtt_addr, map_size);
> + /*
> + * Collapse the last level table and make the spare page
> + * reusable again.
> + */
> + if (level == 2 &&
> + realm_rtt_destroy(realm, addr, RME_RTT_MAX_LEVEL,
> + tmp_rtt))
> + failed = true;
> + break;
> + case RMI_VALID_NS:
> + WARN_ON(rmi_rtt_unmap_unprotected(rd, addr, level));
> + break;
> + default:
> + WARN_ON(1);
> + failed = true;
> + break;
> + }
> + }
> +
> + return failed ? -EINVAL : 0;
> +}
> +
> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
> +{
> + realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
> +}
> +
> /* Protects access to rme_vmid_bitmap */
> static DEFINE_SPINLOCK(rme_vmid_lock);
> static unsigned long *rme_vmid_bitmap;


2023-02-13 18:09:31

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 10/28] arm64: RME: Allocate/free RECs to match vCPUs

On Fri, 27 Jan 2023 11:29:14 +0000
Steven Price <[email protected]> wrote:

> The RMM maintains a data structure known as the Realm Execution Context
> (or REC). It is similar to struct kvm_vcpu and tracks the state of the
> virtual CPUs. KVM must delegate memory and request the structures are
> created when vCPUs are created, and suitably tear down on destruction.
>

It would be better to leave some pointers to the spec here. It really saves
time for reviewers.

> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_emulate.h | 2 +
> arch/arm64/include/asm/kvm_host.h | 3 +
> arch/arm64/include/asm/kvm_rme.h | 10 ++
> arch/arm64/kvm/arm.c | 1 +
> arch/arm64/kvm/reset.c | 11 ++
> arch/arm64/kvm/rme.c | 144 +++++++++++++++++++++++++++
> 6 files changed, 171 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 5a2b7229e83f..285e62914ca4 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -504,6 +504,8 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>
> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> {
> + if (static_branch_unlikely(&kvm_rme_is_available))
> + return vcpu->arch.rec.mpidr != INVALID_HWID;
> return false;
> }
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 04347c3a8c6b..ef497b718cdb 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -505,6 +505,9 @@ struct kvm_vcpu_arch {
> u64 last_steal;
> gpa_t base;
> } steal;
> +
> + /* Realm meta data */
> + struct rec rec;

I think the name of the data structure "rec" needs a prefix, it is too common
and might conflict with the private data structures in the other modules. Maybe
rme_rec or realm_rec?
> };
>
> /*
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index eea5118dfa8a..4b219ebe1400 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -6,6 +6,7 @@
> #ifndef __ASM_KVM_RME_H
> #define __ASM_KVM_RME_H
>
> +#include <asm/rmi_smc.h>
> #include <uapi/linux/kvm.h>
>
> enum realm_state {
> @@ -29,6 +30,13 @@ struct realm {
> unsigned int ia_bits;
> };
>
> +struct rec {
> + unsigned long mpidr;
> + void *rec_page;
> + struct page *aux_pages[REC_PARAMS_AUX_GRANULES];
> + struct rec_run *run;
> +};
> +

It is better to leave some comments for above members or pointers to the spec,
that saves a lot of time for review.

> int kvm_init_rme(void);
> u32 kvm_realm_ipa_limit(void);
>
> @@ -36,6 +44,8 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> int kvm_init_realm_vm(struct kvm *kvm);
> void kvm_destroy_realm(struct kvm *kvm);
> void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> +int kvm_create_rec(struct kvm_vcpu *vcpu);
> +void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>
> #define RME_RTT_BLOCK_LEVEL 2
> #define RME_RTT_MAX_LEVEL 3
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index badd775547b8..52affed2f3cf 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -373,6 +373,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> /* Force users to call KVM_ARM_VCPU_INIT */
> vcpu->arch.target = -1;
> bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> + vcpu->arch.rec.mpidr = INVALID_HWID;
>
> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 9e71d69e051f..0c84392a4bf2 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -135,6 +135,11 @@ int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> return -EPERM;
>
> return kvm_vcpu_finalize_sve(vcpu);
> + case KVM_ARM_VCPU_REC:
> + if (!kvm_is_realm(vcpu->kvm))
> + return -EINVAL;
> +
> + return kvm_create_rec(vcpu);
> }
>
> return -EINVAL;
> @@ -145,6 +150,11 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
> if (vcpu_has_sve(vcpu) && !kvm_arm_vcpu_sve_finalized(vcpu))
> return false;
>
> + if (kvm_is_realm(vcpu->kvm) &&
> + !(vcpu_is_rec(vcpu) &&
> + READ_ONCE(vcpu->kvm->arch.realm.state) == REALM_STATE_ACTIVE))
> + return false;

That's why it is better to introduce the realm state in the previous patches so
that people can really get the idea of the states at this stage.

> +
> return true;
> }
>
> @@ -157,6 +167,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> if (sve_state)
> kvm_unshare_hyp(sve_state, sve_state + vcpu_sve_state_size(vcpu));
> kfree(sve_state);
> + kvm_destroy_rec(vcpu);
> }
>
> static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index f7b0e5a779f8..d79ed889ca4d 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -514,6 +514,150 @@ void kvm_destroy_realm(struct kvm *kvm)
> kvm_free_stage2_pgd(&kvm->arch.mmu);
> }
>
> +static void free_rec_aux(struct page **aux_pages,
> + unsigned int num_aux)
> +{
> + unsigned int i;
> +
> + for (i = 0; i < num_aux; i++) {
> + phys_addr_t aux_page_phys = page_to_phys(aux_pages[i]);
> +
> + if (WARN_ON(rmi_granule_undelegate(aux_page_phys)))
> + continue;
> +
> + __free_page(aux_pages[i]);
> + }
> +}
> +
> +static int alloc_rec_aux(struct page **aux_pages,
> + u64 *aux_phys_pages,
> + unsigned int num_aux)
> +{
> + int ret;
> + unsigned int i;
> +
> + for (i = 0; i < num_aux; i++) {
> + struct page *aux_page;
> + phys_addr_t aux_page_phys;
> +
> + aux_page = alloc_page(GFP_KERNEL);
> + if (!aux_page) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> + aux_page_phys = page_to_phys(aux_page);
> + if (rmi_granule_delegate(aux_page_phys)) {
> + __free_page(aux_page);
> + ret = -ENXIO;
> + goto out_err;
> + }
> + aux_pages[i] = aux_page;
> + aux_phys_pages[i] = aux_page_phys;
> + }
> +
> + return 0;
> +out_err:
> + free_rec_aux(aux_pages, i);
> + return ret;
> +}
> +
> +int kvm_create_rec(struct kvm_vcpu *vcpu)
> +{
> + struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
> + unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
> + struct realm *realm = &vcpu->kvm->arch.realm;
> + struct rec *rec = &vcpu->arch.rec;
> + unsigned long rec_page_phys;
> + struct rec_params *params;
> + int r, i;
> +
> + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_NEW)
> + return -ENOENT;
> +
> + /*
> + * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
> + * flag covers v0.2 and onwards.
> + */
> + if (!test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
> + return -EINVAL;
> +
> + BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
> + BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
> +
> + params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
> + rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
> + rec->run = (void *)get_zeroed_page(GFP_KERNEL);
> + if (!params || !rec->rec_page || !rec->run) {
> + r = -ENOMEM;
> + goto out_free_pages;
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
> + params->gprs[i] = vcpu_regs->regs[i];
> +
> + params->pc = vcpu_regs->pc;
> +
> + if (vcpu->vcpu_id == 0)
> + params->flags |= REC_PARAMS_FLAG_RUNNABLE;
> +
> + rec_page_phys = virt_to_phys(rec->rec_page);
> +
> + if (rmi_granule_delegate(rec_page_phys)) {
> + r = -ENXIO;
> + goto out_free_pages;
> + }
> +

Wouldn't it be better to extend the alloc_rec_aux() to allocate and delegate
pages above? so that you can same some gfps and rmi_granuale_delegates().

> + r = alloc_rec_aux(rec->aux_pages, params->aux, realm->num_aux);
> + if (r)
> + goto out_undelegate_rmm_rec;
> +
> + params->num_rec_aux = realm->num_aux;
> + params->mpidr = mpidr;
> +
> + if (rmi_rec_create(rec_page_phys,
> + virt_to_phys(realm->rd),
> + virt_to_phys(params))) {
> + r = -ENXIO;
> + goto out_free_rec_aux;
> + }
> +
> + rec->mpidr = mpidr;
> +
> + free_page((unsigned long)params);
> + return 0;
> +
> +out_free_rec_aux:
> + free_rec_aux(rec->aux_pages, realm->num_aux);
> +out_undelegate_rmm_rec:
> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
> + rec->rec_page = NULL;
> +out_free_pages:
> + free_page((unsigned long)rec->run);
> + free_page((unsigned long)rec->rec_page);
> + free_page((unsigned long)params);
> + return r;
> +}
> +
> +void kvm_destroy_rec(struct kvm_vcpu *vcpu)
> +{
> + struct realm *realm = &vcpu->kvm->arch.realm;
> + struct rec *rec = &vcpu->arch.rec;
> + unsigned long rec_page_phys;
> +
> + if (!vcpu_is_rec(vcpu))
> + return;
> +
> + rec_page_phys = virt_to_phys(rec->rec_page);
> +
> + if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
> + return;
> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
> + return;
> +

The two returns above feels off. What is the reason to skip the below page
undelegates?

> + free_rec_aux(rec->aux_pages, realm->num_aux);
> + free_page((unsigned long)rec->rec_page);
> +}
> +
> int kvm_init_realm_vm(struct kvm *kvm)
> {
> struct realm_params *params;


2023-02-17 13:07:33

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 13/28] arm64: RME: Allow VMM to set RIPAS

On Fri, 27 Jan 2023 11:29:17 +0000
Steven Price <[email protected]> wrote:

> Each page within the protected region of the realm guest can be marked
> as either RAM or EMPTY. Allow the VMM to control this before the guest
> has started and provide the equivalent functions to change this (with
> the guest's approval) at runtime.
>

The above is just the purpose of this patch. It would be better to have one
more paragraph to describe what this patch does (building RTT and set IPA
state in the RTT) and explain something might confuse people, for example
the spare page.

The spare page is really confusing. When reading __alloc_delegated_page()
, it looks like a mechanism to cache a delegated page for the realm. But later
in the teardown path, it looks like a workaround. What if the allocation of
the spare page failed in the RTT tear down path?

I understand this must be a temporary solution. It would be really nice to
have a big picture or some basic introduction to the future plan.

> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 4 +
> arch/arm64/kvm/rme.c | 288 +++++++++++++++++++++++++++++++
> 2 files changed, 292 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index 4b219ebe1400..3e75cedaad18 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -47,6 +47,10 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> int kvm_create_rec(struct kvm_vcpu *vcpu);
> void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>
> +int realm_set_ipa_state(struct kvm_vcpu *vcpu,
> + unsigned long addr, unsigned long end,
> + unsigned long ripas);
> +
> #define RME_RTT_BLOCK_LEVEL 2
> #define RME_RTT_MAX_LEVEL 3
>
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index d79ed889ca4d..b3ea79189839 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -73,6 +73,58 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static phys_addr_t __alloc_delegated_page(struct realm *realm,
> + struct kvm_mmu_memory_cache *mc, gfp_t flags)
> +{
> + phys_addr_t phys = PHYS_ADDR_MAX;
> + void *virt;
> +
> + if (realm->spare_page != PHYS_ADDR_MAX) {
> + swap(realm->spare_page, phys);
> + goto out;
> + }
> +
> + if (mc)
> + virt = kvm_mmu_memory_cache_alloc(mc);
> + else
> + virt = (void *)__get_free_page(flags);
> +
> + if (!virt)
> + goto out;
> +
> + phys = virt_to_phys(virt);
> +
> + if (rmi_granule_delegate(phys)) {
> + free_page((unsigned long)virt);
> +
> + phys = PHYS_ADDR_MAX;
> + }
> +
> +out:
> + return phys;
> +}
> +
> +static phys_addr_t alloc_delegated_page(struct realm *realm,
> + struct kvm_mmu_memory_cache *mc)
> +{
> + return __alloc_delegated_page(realm, mc, GFP_KERNEL);
> +}
> +
> +static void free_delegated_page(struct realm *realm, phys_addr_t phys)
> +{
> + if (realm->spare_page == PHYS_ADDR_MAX) {
> + realm->spare_page = phys;
> + return;
> + }
> +
> + if (WARN_ON(rmi_granule_undelegate(phys))) {
> + /* Undelegate failed: leak the page */
> + return;
> + }
> +
> + free_page((unsigned long)phys_to_virt(phys));
> +}
> +
> static void realm_destroy_undelegate_range(struct realm *realm,
> unsigned long ipa,
> unsigned long addr,
> @@ -220,6 +272,30 @@ static int realm_rtt_create(struct realm *realm,
> return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
> }
>
> +static int realm_create_rtt_levels(struct realm *realm,
> + unsigned long ipa,
> + int level,
> + int max_level,
> + struct kvm_mmu_memory_cache *mc)
> +{
> + if (WARN_ON(level == max_level))
> + return 0;
> +
> + while (level++ < max_level) {
> + phys_addr_t rtt = alloc_delegated_page(realm, mc);
> +
> + if (rtt == PHYS_ADDR_MAX)
> + return -ENOMEM;
> +
> + if (realm_rtt_create(realm, ipa, level, rtt)) {
> + free_delegated_page(realm, rtt);
> + return -ENXIO;
> + }
> + }
> +
> + return 0;
> +}
> +
> static int realm_tear_down_rtt_range(struct realm *realm, int level,
> unsigned long start, unsigned long end)
> {
> @@ -309,6 +385,206 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
> realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
> }
>
> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
> +{
> + u32 ia_bits = kvm->arch.mmu.pgt->ia_bits;
> + u32 start_level = kvm->arch.mmu.pgt->start_level;
> + unsigned long end = ipa + size;
> + struct realm *realm = &kvm->arch.realm;
> + phys_addr_t tmp_rtt = PHYS_ADDR_MAX;
> +
> + if (end > (1UL << ia_bits))
> + end = 1UL << ia_bits;
> + /*
> + * Make sure we have a spare delegated page for tearing down the
> + * block mappings. We must use Atomic allocations as we are called
> + * with kvm->mmu_lock held.
> + */
> + if (realm->spare_page == PHYS_ADDR_MAX) {
> + tmp_rtt = __alloc_delegated_page(realm, NULL, GFP_ATOMIC);
> + /*
> + * We don't have to check the status here, as we may not
> + * have a block level mapping. Delay any error to the point
> + * where we need it.
> + */
> + realm->spare_page = tmp_rtt;
> + }
> +
> + realm_tear_down_rtt_range(&kvm->arch.realm, start_level, ipa, end);
> +
> + /* Free up the atomic page, if there were any */
> + if (tmp_rtt != PHYS_ADDR_MAX) {
> + free_delegated_page(realm, tmp_rtt);
> + /*
> + * Update the spare_page after we have freed the
> + * above page to make sure it doesn't get cached
> + * in spare_page.
> + * We should re-write this part and always have
> + * a dedicated page for handling block mappings.
> + */
> + realm->spare_page = PHYS_ADDR_MAX;
> + }
> +}
> +
> +static int set_ipa_state(struct kvm_vcpu *vcpu,
> + unsigned long ipa,
> + unsigned long end,
> + int level,
> + unsigned long ripas)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct realm *realm = &kvm->arch.realm;
> + struct rec *rec = &vcpu->arch.rec;
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> + phys_addr_t rec_phys = virt_to_phys(rec->rec_page);
> + unsigned long map_size = rme_rtt_level_mapsize(level);
> + int ret;
> +
> + while (ipa < end) {
> + ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, level, ripas);
> +
> + if (!ret) {
> + if (!ripas)
> + kvm_realm_unmap_range(kvm, ipa, map_size);
> + } else if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + int walk_level = RMI_RETURN_INDEX(ret);
> +
> + if (walk_level < level) {
> + ret = realm_create_rtt_levels(realm, ipa,
> + walk_level,
> + level, NULL);
> + if (ret)
> + return ret;
> + continue;
> + }
> +
> + if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
> + return -EINVAL;
> +
> + /* Recurse one level lower */
> + ret = set_ipa_state(vcpu, ipa, ipa + map_size,
> + level + 1, ripas);
> + if (ret)
> + return ret;
> + } else {
> + WARN(1, "Unexpected error in %s: %#x\n", __func__,
> + ret);
> + return -EINVAL;
> + }
> + ipa += map_size;
> + }
> +
> + return 0;
> +}
> +
> +static int realm_init_ipa_state(struct realm *realm,
> + unsigned long ipa,
> + unsigned long end,
> + int level)
> +{
> + unsigned long map_size = rme_rtt_level_mapsize(level);
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> + int ret;
> +
> + while (ipa < end) {
> + ret = rmi_rtt_init_ripas(rd_phys, ipa, level);
> +
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + int cur_level = RMI_RETURN_INDEX(ret);
> +
> + if (cur_level < level) {
> + ret = realm_create_rtt_levels(realm, ipa,
> + cur_level,
> + level, NULL);
> + if (ret)
> + return ret;
> + /* Retry with the RTT levels in place */
> + continue;
> + }
> +
> + /* There's an entry at a lower level, recurse */
> + if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
> + return -EINVAL;
> +
> + realm_init_ipa_state(realm, ipa, ipa + map_size,
> + level + 1);
> + } else if (WARN_ON(ret)) {
> + return -ENXIO;
> + }
> +
> + ipa += map_size;
> + }
> +
> + return 0;
> +}
> +
> +static int find_map_level(struct kvm *kvm, unsigned long start, unsigned long end)
> +{
> + int level = RME_RTT_MAX_LEVEL;
> +
> + while (level > get_start_level(kvm) + 1) {
> + unsigned long map_size = rme_rtt_level_mapsize(level - 1);
> +
> + if (!IS_ALIGNED(start, map_size) ||
> + (start + map_size) > end)
> + break;
> +
> + level--;
> + }
> +
> + return level;
> +}
> +
> +int realm_set_ipa_state(struct kvm_vcpu *vcpu,
> + unsigned long addr, unsigned long end,
> + unsigned long ripas)
> +{
> + int ret = 0;
> +
> + while (addr < end) {
> + int level = find_map_level(vcpu->kvm, addr, end);
> + unsigned long map_size = rme_rtt_level_mapsize(level);
> +
> + ret = set_ipa_state(vcpu, addr, addr + map_size, level, ripas);
> + if (ret)
> + break;
> +
> + addr += map_size;
> + }
> +
> + return ret;
> +}
> +
> +static int kvm_init_ipa_range_realm(struct kvm *kvm,
> + struct kvm_cap_arm_rme_init_ipa_args *args)
> +{
> + int ret = 0;
> + gpa_t addr, end;
> + struct realm *realm = &kvm->arch.realm;
> +
> + addr = args->init_ipa_base;
> + end = addr + args->init_ipa_size;
> +
> + if (end < addr)
> + return -EINVAL;
> +
> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
> + return -EBUSY;
> +
> + while (addr < end) {
> + int level = find_map_level(kvm, addr, end);
> + unsigned long map_size = rme_rtt_level_mapsize(level);
> +
> + ret = realm_init_ipa_state(realm, addr, addr + map_size, level);
> + if (ret)
> + break;
> +
> + addr += map_size;
> + }
> +
> + return ret;
> +}
> +
> /* Protects access to rme_vmid_bitmap */
> static DEFINE_SPINLOCK(rme_vmid_lock);
> static unsigned long *rme_vmid_bitmap;
> @@ -460,6 +736,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>
> r = kvm_create_realm(kvm);
> break;
> + case KVM_CAP_ARM_RME_INIT_IPA_REALM: {
> + struct kvm_cap_arm_rme_init_ipa_args args;
> + void __user *argp = u64_to_user_ptr(cap->args[1]);
> +
> + if (copy_from_user(&args, argp, sizeof(args))) {
> + r = -EFAULT;
> + break;
> + }
> +
> + r = kvm_init_ipa_range_realm(kvm, &args);
> + break;
> + }
> default:
> r = -EINVAL;
> break;


2023-03-01 11:54:44

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 05/28] arm64: RME: Define the user ABI

On 13/02/2023 16:04, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:09 +0000
> Steven Price <[email protected]> wrote:
>
>> There is one (multiplexed) CAP which can be used to create, populate and
>> then activate the realm.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> Documentation/virt/kvm/api.rst | 1 +
>> arch/arm64/include/uapi/asm/kvm.h | 63 +++++++++++++++++++++++++++++++
>> include/uapi/linux/kvm.h | 2 +
>> 3 files changed, 66 insertions(+)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 0dd5d8733dd5..f1a59d6fb7fc 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -4965,6 +4965,7 @@ Recognised values for feature:
>>
>> ===== ===========================================
>> arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
>> + arm64 KVM_ARM_VCPU_REC (requires KVM_CAP_ARM_RME)
>> ===== ===========================================
>>
>> Finalizes the configuration of the specified vcpu feature.
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index a7a857f1784d..fcc0b8dce29b 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -109,6 +109,7 @@ struct kvm_regs {
>> #define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */
>> #define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */
>> #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */
>> +#define KVM_ARM_VCPU_REC 7 /* VCPU REC state as part of Realm */
>>
>> struct kvm_vcpu_init {
>> __u32 target;
>> @@ -401,6 +402,68 @@ enum {
>> #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
>> #define KVM_DEV_ARM_ITS_CTRL_RESET 4
>>
>> +/* KVM_CAP_ARM_RME on VM fd */
>> +#define KVM_CAP_ARM_RME_CONFIG_REALM 0
>> +#define KVM_CAP_ARM_RME_CREATE_RD 1
>> +#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2
>> +#define KVM_CAP_ARM_RME_POPULATE_REALM 3
>> +#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4
>> +
>
> It is a little bit confusing here. These seems more like 'commands' not caps.
> Will leave more comments after reviewing the later patches.

Sorry for the slow response. Thank you for your review - I hope to post
a new version of this series (rebased on 6.3-rc1) in the coming weeks
with your comments addressed.

They are indeed commands - and using caps is a bit of a hack. The
benefit here is that all the Realm commands are behind the one
KVM_CAP_ARM_RME.

The options I can see are:

a) What I've got here - (ab)using KVM_ENABLE_CAP to perform commands.

b) Add new ioctls for each of the above stages (so 5 new ioctls on top
of the CAP for discovery). With any future extensions requiring new ioctls.

c) Add a single new multiplexing ioctl (along with the CAP for discovery).

I'm not massively keen on defining a new multiplexing scheme (c), but
equally (b) seems like it's burning through ioctl numbers. Which led me
to stick with (a) which at least keeps the rebasing simple (there's only
the one CAP number which could conflict) and there's already a
multiplexing scheme.

But I'm happy to change if there's consensus a different approach would
be preferable.

Thanks,

Steve

>> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256 0
>> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512 1
>> +
>> +#define KVM_CAP_ARM_RME_RPV_SIZE 64
>> +
>> +/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */
>> +#define KVM_CAP_ARM_RME_CFG_RPV 0
>> +#define KVM_CAP_ARM_RME_CFG_HASH_ALGO 1
>> +#define KVM_CAP_ARM_RME_CFG_SVE 2
>> +#define KVM_CAP_ARM_RME_CFG_DBG 3
>> +#define KVM_CAP_ARM_RME_CFG_PMU 4
>> +
>> +struct kvm_cap_arm_rme_config_item {
>> + __u32 cfg;
>> + union {
>> + /* cfg == KVM_CAP_ARM_RME_CFG_RPV */
>> + struct {
>> + __u8 rpv[KVM_CAP_ARM_RME_RPV_SIZE];
>> + };
>> +
>> + /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */
>> + struct {
>> + __u32 hash_algo;
>> + };
>> +
>> + /* cfg == KVM_CAP_ARM_RME_CFG_SVE */
>> + struct {
>> + __u32 sve_vq;
>> + };
>> +
>> + /* cfg == KVM_CAP_ARM_RME_CFG_DBG */
>> + struct {
>> + __u32 num_brps;
>> + __u32 num_wrps;
>> + };
>> +
>> + /* cfg == KVM_CAP_ARM_RME_CFG_PMU */
>> + struct {
>> + __u32 num_pmu_cntrs;
>> + };
>> + /* Fix the size of the union */
>> + __u8 reserved[256];
>> + };
>> +};
>> +
>> +struct kvm_cap_arm_rme_populate_realm_args {
>> + __u64 populate_ipa_base;
>> + __u64 populate_ipa_size;
>> +};
>> +
>> +struct kvm_cap_arm_rme_init_ipa_args {
>> + __u64 init_ipa_base;
>> + __u64 init_ipa_size;
>> +};
>> +
>> /* Device Control API on vcpu fd */
>> #define KVM_ARM_VCPU_PMU_V3_CTRL 0
>> #define KVM_ARM_VCPU_PMU_V3_IRQ 0
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 20522d4ba1e0..fec1909e8b73 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -1176,6 +1176,8 @@ struct kvm_ppc_resize_hpt {
>> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>>
>> +#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts
>> +
>> #ifdef KVM_CAP_IRQ_ROUTING
>>
>> struct kvm_irq_routing_irqchip {
>


2023-03-01 11:55:27

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On 13/02/2023 16:10, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:10 +0000
> Steven Price <[email protected]> wrote:
>
>> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
>> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
>> the base level of the Realm Translation Tables (RTT). A VMID also need
>> to be picked, since the RMM has a separate VMID address space a
>> dedicated allocator is added for this purpose.
>>
>> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
>> before it is created.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_rme.h | 14 ++
>> arch/arm64/kvm/arm.c | 19 ++
>> arch/arm64/kvm/mmu.c | 6 +
>> arch/arm64/kvm/reset.c | 33 +++
>> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
>> 5 files changed, 429 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index c26bc2c6770d..055a22accc08 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -6,6 +6,8 @@
>> #ifndef __ASM_KVM_RME_H
>> #define __ASM_KVM_RME_H
>>
>> +#include <uapi/linux/kvm.h>
>> +
>> enum realm_state {
>> REALM_STATE_NONE,
>> REALM_STATE_NEW,
>> @@ -15,8 +17,20 @@ enum realm_state {
>>
>> struct realm {
>> enum realm_state state;
>> +
>> + void *rd;
>> + struct realm_params *params;
>> +
>> + unsigned long num_aux;
>> + unsigned int vmid;
>> + unsigned int ia_bits;
>> };
>>
>
> Maybe more comments for this structure?

Agreed, this series is a bit light on comments. I'll try to improve for v2.

<snip>

>
> Just curious. Wouldn't it be better to use IDR as this is ID allocation? There
> were some efforts to change the use of bitmap allocation to IDR before.

I'm not sure it makes much difference really. This matches KVM's
vmid_map, but here things are much more simple as there's no support for
the likes of VMID rollover (the number of Realm VMs is just capped at
the number of VMIDs).

IDR provides a lot of functionality we don't need, but equally I don't
think performance or memory usage are really a concern here.

Steve

>> +/* Protects access to rme_vmid_bitmap */
>> +static DEFINE_SPINLOCK(rme_vmid_lock);
>> +static unsigned long *rme_vmid_bitmap;
>> +
>> +static int rme_vmid_init(void)
>> +{
>> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
>> +
>> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
>> + if (!rme_vmid_bitmap) {
>> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
>> + return -ENOMEM;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int rme_vmid_reserve(void)
>> +{
>> + int ret;
>> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
>> +
>> + spin_lock(&rme_vmid_lock);
>> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
>> + spin_unlock(&rme_vmid_lock);
>> +
>> + return ret;
>> +}
>> +
>> +static void rme_vmid_release(unsigned int vmid)
>> +{
>> + spin_lock(&rme_vmid_lock);
>> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
>> + spin_unlock(&rme_vmid_lock);
>> +}
>> +
>> +static int kvm_create_realm(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + int ret;
>> +
>> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
>> + return -EEXIST;
>> +
>> + ret = rme_vmid_reserve();
>> + if (ret < 0)
>> + return ret;
>> + realm->vmid = ret;
>> +
>> + ret = realm_create_rd(kvm);
>> + if (ret) {
>> + rme_vmid_release(realm->vmid);
>> + return ret;
>> + }
>> +
>> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
>> +
>> + /* The realm is up, free the parameters. */
>> + free_page((unsigned long)realm->params);
>> + realm->params = NULL;
>> +
>> + return 0;
>> +}
>> +
>> +static int config_realm_hash_algo(struct realm *realm,
>> + struct kvm_cap_arm_rme_config_item *cfg)
>> +{
>> + switch (cfg->hash_algo) {
>> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
>> + return -EINVAL;
>> + break;
>> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
>> + return -EINVAL;
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> + realm->params->measurement_algo = cfg->hash_algo;
>> + return 0;
>> +}
>> +
>> +static int config_realm_sve(struct realm *realm,
>> + struct kvm_cap_arm_rme_config_item *cfg)
>> +{
>> + u64 features_0 = realm->params->features_0;
>> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> +
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
>> + return -EINVAL;
>> +
>> + if (cfg->sve_vq > max_sve_vq)
>> + return -EINVAL;
>> +
>> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
>> + features_0 |= u64_encode_bits(cfg->sve_vq,
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> +
>> + realm->params->features_0 = features_0;
>> + return 0;
>> +}
>> +
>> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
>> +{
>> + struct kvm_cap_arm_rme_config_item cfg;
>> + struct realm *realm = &kvm->arch.realm;
>> + int r = 0;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
>> + return -EBUSY;
>> +
>> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
>> + return -EFAULT;
>> +
>> + switch (cfg.cfg) {
>> + case KVM_CAP_ARM_RME_CFG_RPV:
>> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
>> + break;
>> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
>> + r = config_realm_hash_algo(realm, &cfg);
>> + break;
>> + case KVM_CAP_ARM_RME_CFG_SVE:
>> + r = config_realm_sve(realm, &cfg);
>> + break;
>> + default:
>> + r = -EINVAL;
>> + }
>> +
>> + return r;
>> +}
>> +
>> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>> +{
>> + int r = 0;
>> +
>> + switch (cap->args[0]) {
>> + case KVM_CAP_ARM_RME_CONFIG_REALM:
>> + r = kvm_rme_config_realm(kvm, cap);
>> + break;
>> + case KVM_CAP_ARM_RME_CREATE_RD:
>> + if (kvm->created_vcpus) {
>> + r = -EBUSY;
>> + break;
>> + }
>> +
>> + r = kvm_create_realm(kvm);
>> + break;
>> + default:
>> + r = -EINVAL;
>> + break;
>> + }
>> +
>> + return r;
>> +}
>> +
>> +void kvm_destroy_realm(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> + unsigned int pgd_sz;
>> + int i;
>> +
>> + if (realm->params) {
>> + free_page((unsigned long)realm->params);
>> + realm->params = NULL;
>> + }
>> +
>> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
>> + return;
>> +
>> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
>> +
>> + rme_vmid_release(realm->vmid);
>> +
>> + if (realm->rd) {
>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> +
>> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
>> + return;
>> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
>> + return;
>> + free_page((unsigned long)realm->rd);
>> + realm->rd = NULL;
>> + }
>> +
>> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
>> + for (i = 0; i < pgd_sz; i++) {
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
>> +
>> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
>> + return;
>> + }
>> +
>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +}
>> +
>> +int kvm_init_realm_vm(struct kvm *kvm)
>> +{
>> + struct realm_params *params;
>> +
>> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
>> + if (!params)
>> + return -ENOMEM;
>> +
>> + params->features_0 = create_realm_feat_reg0(kvm);
>> + kvm->arch.realm.params = params;
>> + return 0;
>> +}
>> +
>> int kvm_init_rme(void)
>> {
>> + int ret;
>> +
>> if (PAGE_SIZE != SZ_4K)
>> /* Only 4k page size on the host is supported */
>> return 0;
>> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
>> /* Continue without realm support */
>> return 0;
>>
>> + ret = rme_vmid_init();
>> + if (ret)
>> + return ret;
>> +
>> + WARN_ON(rmi_features(0, &rmm_feat_reg0));
>> +
>> /* Future patch will enable static branch kvm_rme_is_available */
>>
>> return 0;
>


2023-03-01 11:55:41

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 07/28] arm64: kvm: Allow passing machine type in KVM creation

On 13/02/2023 16:35, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:11 +0000
> Steven Price <[email protected]> wrote:
>
>> Previously machine type was used purely for specifying the physical
>> address size of the guest. Reserve the higher bits to specify an ARM
>> specific machine type and declare a new type 'KVM_VM_TYPE_ARM_REALM'
>> used to create a realm guest.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/kvm/arm.c | 13 +++++++++++++
>> arch/arm64/kvm/mmu.c | 3 ---
>> arch/arm64/kvm/reset.c | 3 ---
>> include/uapi/linux/kvm.h | 19 +++++++++++++++----
>> 4 files changed, 28 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 50f54a63732a..badd775547b8 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -147,6 +147,19 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>> {
>> int ret;
>>
>> + if (type & ~(KVM_VM_TYPE_ARM_MASK | KVM_VM_TYPE_ARM_IPA_SIZE_MASK))
>> + return -EINVAL;
>> +
>> + switch (type & KVM_VM_TYPE_ARM_MASK) {
>> + case KVM_VM_TYPE_ARM_NORMAL:
>> + break;
>> + case KVM_VM_TYPE_ARM_REALM:
>> + kvm->arch.is_realm = true;
>
> It is better to let this call fail when !kvm_rme_is_available? It is
> strange to be able to create a VM with REALM type in a system doesn't
> support RME.

Good point - I'll add a check here.

Thanks,

Steve

>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> +
>> ret = kvm_share_hyp(kvm, kvm + 1);
>> if (ret)
>> return ret;
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d0f707767d05..22c00274884a 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -709,9 +709,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>> u64 mmfr0, mmfr1;
>> u32 phys_shift;
>>
>> - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>> - return -EINVAL;
>> -
>> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>> if (is_protected_kvm_enabled()) {
>> phys_shift = kvm_ipa_limit;
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index c165df174737..9e71d69e051f 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -405,9 +405,6 @@ int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
>> if (kvm_is_realm(kvm))
>> ipa_limit = kvm_realm_ipa_limit();
>>
>> - if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>> - return -EINVAL;
>> -
>> phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>> if (phys_shift) {
>> if (phys_shift > ipa_limit ||
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index fec1909e8b73..bcfc4d58dc19 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -898,14 +898,25 @@ struct kvm_ppc_resize_hpt {
>> #define KVM_S390_SIE_PAGE_OFFSET 1
>>
>> /*
>> - * On arm64, machine type can be used to request the physical
>> - * address size for the VM. Bits[7-0] are reserved for the guest
>> - * PA size shift (i.e, log2(PA_Size)). For backward compatibility,
>> - * value 0 implies the default IPA size, 40bits.
>> + * On arm64, machine type can be used to request both the machine type and
>> + * the physical address size for the VM.
>> + *
>> + * Bits[11-8] are reserved for the ARM specific machine type.
>> + *
>> + * Bits[7-0] are reserved for the guest PA size shift (i.e, log2(PA_Size)).
>> + * For backward compatibility, value 0 implies the default IPA size, 40bits.
>> */
>> +#define KVM_VM_TYPE_ARM_SHIFT 8
>> +#define KVM_VM_TYPE_ARM_MASK (0xfULL << KVM_VM_TYPE_ARM_SHIFT)
>> +#define KVM_VM_TYPE_ARM(_type) \
>> + (((_type) << KVM_VM_TYPE_ARM_SHIFT) & KVM_VM_TYPE_ARM_MASK)
>> +#define KVM_VM_TYPE_ARM_NORMAL KVM_VM_TYPE_ARM(0)
>> +#define KVM_VM_TYPE_ARM_REALM KVM_VM_TYPE_ARM(1)
>> +
>> #define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL
>> #define KVM_VM_TYPE_ARM_IPA_SIZE(x) \
>> ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>> +
>> /*
>> * ioctls for /dev/kvm fds:
>> */
>


2023-03-01 11:56:03

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 08/28] arm64: RME: Keep a spare page delegated to the RMM

On 13/02/2023 16:47, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:12 +0000
> Steven Price <[email protected]> wrote:
>
>> Pages can only be populated/destroyed on the RMM at the 4KB granule,
>> this requires creating the full depth of RTTs. However if the pages are
>> going to be combined into a 4MB huge page the last RTT is only
>> temporarily needed. Similarly when freeing memory the huge page must be
>> temporarily split requiring temporary usage of the full depth oF RTTs.
>>
>> To avoid needing to perform a temporary allocation and delegation of a
>> page for this purpose we keep a spare delegated page around. In
>> particular this avoids the need for memory allocation while destroying
>> the realm guest.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_rme.h | 3 +++
>> arch/arm64/kvm/rme.c | 6 ++++++
>> 2 files changed, 9 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index 055a22accc08..a6318af3ed11 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -21,6 +21,9 @@ struct realm {
>> void *rd;
>> struct realm_params *params;
>>
>> + /* A spare already delegated page */
>> + phys_addr_t spare_page;
>> +
>> unsigned long num_aux;
>> unsigned int vmid;
>> unsigned int ia_bits;
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 9f8c5a91b8fc..0c9d70e4d9e6 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -148,6 +148,7 @@ static int realm_create_rd(struct kvm *kvm)
>> }
>>
>> realm->rd = rd;
>> + realm->spare_page = PHYS_ADDR_MAX;
>> realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
>>
>> if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
>> @@ -357,6 +358,11 @@ void kvm_destroy_realm(struct kvm *kvm)
>> free_page((unsigned long)realm->rd);
>> realm->rd = NULL;
>> }
>> + if (realm->spare_page != PHYS_ADDR_MAX) {
>> + if (!WARN_ON(rmi_granule_undelegate(realm->spare_page)))
>> + free_page((unsigned long)phys_to_virt(realm->spare_page));
>
> Will the page be leaked (not usable for host and realms) if the undelegate
> failed? If yes, better at least put a comment.

Yes - I'll add a comment.

In general being unable to undelegate a page points to a programming
error in the host. The only reason the RMM should refuse the request is
it the page is in use by a Realm which the host has configured. So the
WARN() is correct (there's a kernel bug) and the only sensible course of
action is to leak the page and limp on.

Thanks,

Steve

>> + realm->spare_page = PHYS_ADDR_MAX;
>> + }
>>
>> pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
>> for (i = 0; i < pgd_sz; i++) {
>


2023-03-01 20:21:39

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 05/28] arm64: RME: Define the user ABI

On Wed, 1 Mar 2023 11:54:34 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 16:04, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:09 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> There is one (multiplexed) CAP which can be used to create, populate and
> >> then activate the realm.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> Documentation/virt/kvm/api.rst | 1 +
> >> arch/arm64/include/uapi/asm/kvm.h | 63 +++++++++++++++++++++++++++++++
> >> include/uapi/linux/kvm.h | 2 +
> >> 3 files changed, 66 insertions(+)
> >>
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index 0dd5d8733dd5..f1a59d6fb7fc 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -4965,6 +4965,7 @@ Recognised values for feature:
> >>
> >> ===== ===========================================
> >> arm64 KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)
> >> + arm64 KVM_ARM_VCPU_REC (requires KVM_CAP_ARM_RME)
> >> ===== ===========================================
> >>
> >> Finalizes the configuration of the specified vcpu feature.
> >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> >> index a7a857f1784d..fcc0b8dce29b 100644
> >> --- a/arch/arm64/include/uapi/asm/kvm.h
> >> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >> @@ -109,6 +109,7 @@ struct kvm_regs {
> >> #define KVM_ARM_VCPU_SVE 4 /* enable SVE for this CPU */
> >> #define KVM_ARM_VCPU_PTRAUTH_ADDRESS 5 /* VCPU uses address authentication */
> >> #define KVM_ARM_VCPU_PTRAUTH_GENERIC 6 /* VCPU uses generic authentication */
> >> +#define KVM_ARM_VCPU_REC 7 /* VCPU REC state as part of Realm */
> >>
> >> struct kvm_vcpu_init {
> >> __u32 target;
> >> @@ -401,6 +402,68 @@ enum {
> >> #define KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES 3
> >> #define KVM_DEV_ARM_ITS_CTRL_RESET 4
> >>
> >> +/* KVM_CAP_ARM_RME on VM fd */
> >> +#define KVM_CAP_ARM_RME_CONFIG_REALM 0
> >> +#define KVM_CAP_ARM_RME_CREATE_RD 1
> >> +#define KVM_CAP_ARM_RME_INIT_IPA_REALM 2
> >> +#define KVM_CAP_ARM_RME_POPULATE_REALM 3
> >> +#define KVM_CAP_ARM_RME_ACTIVATE_REALM 4
> >> +
> >
> > It is a little bit confusing here. These seems more like 'commands' not caps.
> > Will leave more comments after reviewing the later patches.
>
> Sorry for the slow response. Thank you for your review - I hope to post
> a new version of this series (rebased on 6.3-rc1) in the coming weeks
> with your comments addressed.
>

Hi:

No worries. I spent most of my time on closing the review of TDX/SNP
series recently while stopped at patch 16 of this series. I will try to
finish the rest of this series this week.

I am glad if my efforts help and more reviewers can smoothly jump in
later.

> They are indeed commands - and using caps is a bit of a hack. The
> benefit here is that all the Realm commands are behind the one
> KVM_CAP_ARM_RME.
>
> The options I can see are:
>
> a) What I've got here - (ab)using KVM_ENABLE_CAP to perform commands.
>
> b) Add new ioctls for each of the above stages (so 5 new ioctls on top
> of the CAP for discovery). With any future extensions requiring new ioctls.
>
> c) Add a single new multiplexing ioctl (along with the CAP for discovery).
>
> I'm not massively keen on defining a new multiplexing scheme (c), but
> equally (b) seems like it's burning through ioctl numbers. Which led me
> to stick with (a) which at least keeps the rebasing simple (there's only
> the one CAP number which could conflict) and there's already a
> multiplexing scheme.
>
> But I'm happy to change if there's consensus a different approach would
> be preferable.
>

Let's see if others have different opinions.

My coin goes to b as it is better to respect "what it is, make it explicit
and clear" when coming to define the UABI. Ioctl number is for UABI. If
it is going to burn out, IMHO, we need to find another way, perhaps another
fd to group those ioctls, like KVM.

1. a) seems abusing the usage of the cap. for sure, the benefit is obvious.
2. c) seems hiding the details, which saves the ioctl numbers, but it didn't
actually help a lot on the complexity and might end up with another bunch
of "command code".

> Thanks,
>
> Steve
>
> >> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256 0
> >> +#define KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512 1
> >> +
> >> +#define KVM_CAP_ARM_RME_RPV_SIZE 64
> >> +
> >> +/* List of configuration items accepted for KVM_CAP_ARM_RME_CONFIG_REALM */
> >> +#define KVM_CAP_ARM_RME_CFG_RPV 0
> >> +#define KVM_CAP_ARM_RME_CFG_HASH_ALGO 1
> >> +#define KVM_CAP_ARM_RME_CFG_SVE 2
> >> +#define KVM_CAP_ARM_RME_CFG_DBG 3
> >> +#define KVM_CAP_ARM_RME_CFG_PMU 4
> >> +
> >> +struct kvm_cap_arm_rme_config_item {
> >> + __u32 cfg;
> >> + union {
> >> + /* cfg == KVM_CAP_ARM_RME_CFG_RPV */
> >> + struct {
> >> + __u8 rpv[KVM_CAP_ARM_RME_RPV_SIZE];
> >> + };
> >> +
> >> + /* cfg == KVM_CAP_ARM_RME_CFG_HASH_ALGO */
> >> + struct {
> >> + __u32 hash_algo;
> >> + };
> >> +
> >> + /* cfg == KVM_CAP_ARM_RME_CFG_SVE */
> >> + struct {
> >> + __u32 sve_vq;
> >> + };
> >> +
> >> + /* cfg == KVM_CAP_ARM_RME_CFG_DBG */
> >> + struct {
> >> + __u32 num_brps;
> >> + __u32 num_wrps;
> >> + };
> >> +
> >> + /* cfg == KVM_CAP_ARM_RME_CFG_PMU */
> >> + struct {
> >> + __u32 num_pmu_cntrs;
> >> + };
> >> + /* Fix the size of the union */
> >> + __u8 reserved[256];
> >> + };
> >> +};
> >> +
> >> +struct kvm_cap_arm_rme_populate_realm_args {
> >> + __u64 populate_ipa_base;
> >> + __u64 populate_ipa_size;
> >> +};
> >> +
> >> +struct kvm_cap_arm_rme_init_ipa_args {
> >> + __u64 init_ipa_base;
> >> + __u64 init_ipa_size;
> >> +};
> >> +
> >> /* Device Control API on vcpu fd */
> >> #define KVM_ARM_VCPU_PMU_V3_CTRL 0
> >> #define KVM_ARM_VCPU_PMU_V3_IRQ 0
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 20522d4ba1e0..fec1909e8b73 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -1176,6 +1176,8 @@ struct kvm_ppc_resize_hpt {
> >> #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
> >> #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> >>
> >> +#define KVM_CAP_ARM_RME 300 // FIXME: Large number to prevent conflicts
> >> +
> >> #ifdef KVM_CAP_IRQ_ROUTING
> >>
> >> struct kvm_irq_routing_irqchip {
> >
>


2023-03-01 20:34:04

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On Wed, 1 Mar 2023 11:55:17 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 16:10, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:10 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
> >> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
> >> the base level of the Realm Translation Tables (RTT). A VMID also need
> >> to be picked, since the RMM has a separate VMID address space a
> >> dedicated allocator is added for this purpose.
> >>
> >> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
> >> before it is created.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_rme.h | 14 ++
> >> arch/arm64/kvm/arm.c | 19 ++
> >> arch/arm64/kvm/mmu.c | 6 +
> >> arch/arm64/kvm/reset.c | 33 +++
> >> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
> >> 5 files changed, 429 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> >> index c26bc2c6770d..055a22accc08 100644
> >> --- a/arch/arm64/include/asm/kvm_rme.h
> >> +++ b/arch/arm64/include/asm/kvm_rme.h
> >> @@ -6,6 +6,8 @@
> >> #ifndef __ASM_KVM_RME_H
> >> #define __ASM_KVM_RME_H
> >>
> >> +#include <uapi/linux/kvm.h>
> >> +
> >> enum realm_state {
> >> REALM_STATE_NONE,
> >> REALM_STATE_NEW,
> >> @@ -15,8 +17,20 @@ enum realm_state {
> >>
> >> struct realm {
> >> enum realm_state state;
> >> +
> >> + void *rd;
> >> + struct realm_params *params;
> >> +
> >> + unsigned long num_aux;
> >> + unsigned int vmid;
> >> + unsigned int ia_bits;
> >> };
> >>
> >
> > Maybe more comments for this structure?
>
> Agreed, this series is a bit light on comments. I'll try to improve for v2.
>
> <snip>
>
> >
> > Just curious. Wouldn't it be better to use IDR as this is ID allocation? There
> > were some efforts to change the use of bitmap allocation to IDR before.
>
> I'm not sure it makes much difference really. This matches KVM's
> vmid_map, but here things are much more simple as there's no support for
> the likes of VMID rollover (the number of Realm VMs is just capped at
> the number of VMIDs).
>
> IDR provides a lot of functionality we don't need, but equally I don't
> think performance or memory usage are really a concern here.

Agree. I am not opposed to the current approach. I gave this comment because
I vaguely remember there were some patch bundles to covert bitmap to IDR in
the kernel before. So I think it would be better to raise it up and get a
conclusion. It would save some efforts for the people who might jump in the
review later.

>
> Steve
>
> >> +/* Protects access to rme_vmid_bitmap */
> >> +static DEFINE_SPINLOCK(rme_vmid_lock);
> >> +static unsigned long *rme_vmid_bitmap;
> >> +
> >> +static int rme_vmid_init(void)
> >> +{
> >> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> >> +
> >> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
> >> + if (!rme_vmid_bitmap) {
> >> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
> >> + return -ENOMEM;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int rme_vmid_reserve(void)
> >> +{
> >> + int ret;
> >> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> >> +
> >> + spin_lock(&rme_vmid_lock);
> >> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
> >> + spin_unlock(&rme_vmid_lock);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static void rme_vmid_release(unsigned int vmid)
> >> +{
> >> + spin_lock(&rme_vmid_lock);
> >> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
> >> + spin_unlock(&rme_vmid_lock);
> >> +}
> >> +
> >> +static int kvm_create_realm(struct kvm *kvm)
> >> +{
> >> + struct realm *realm = &kvm->arch.realm;
> >> + int ret;
> >> +
> >> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
> >> + return -EEXIST;
> >> +
> >> + ret = rme_vmid_reserve();
> >> + if (ret < 0)
> >> + return ret;
> >> + realm->vmid = ret;
> >> +
> >> + ret = realm_create_rd(kvm);
> >> + if (ret) {
> >> + rme_vmid_release(realm->vmid);
> >> + return ret;
> >> + }
> >> +
> >> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
> >> +
> >> + /* The realm is up, free the parameters. */
> >> + free_page((unsigned long)realm->params);
> >> + realm->params = NULL;
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int config_realm_hash_algo(struct realm *realm,
> >> + struct kvm_cap_arm_rme_config_item *cfg)
> >> +{
> >> + switch (cfg->hash_algo) {
> >> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
> >> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
> >> + return -EINVAL;
> >> + break;
> >> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
> >> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
> >> + return -EINVAL;
> >> + break;
> >> + default:
> >> + return -EINVAL;
> >> + }
> >> + realm->params->measurement_algo = cfg->hash_algo;
> >> + return 0;
> >> +}
> >> +
> >> +static int config_realm_sve(struct realm *realm,
> >> + struct kvm_cap_arm_rme_config_item *cfg)
> >> +{
> >> + u64 features_0 = realm->params->features_0;
> >> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
> >> + RMI_FEATURE_REGISTER_0_SVE_VL);
> >> +
> >> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
> >> + return -EINVAL;
> >> +
> >> + if (cfg->sve_vq > max_sve_vq)
> >> + return -EINVAL;
> >> +
> >> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
> >> + RMI_FEATURE_REGISTER_0_SVE_VL);
> >> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
> >> + features_0 |= u64_encode_bits(cfg->sve_vq,
> >> + RMI_FEATURE_REGISTER_0_SVE_VL);
> >> +
> >> + realm->params->features_0 = features_0;
> >> + return 0;
> >> +}
> >> +
> >> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
> >> +{
> >> + struct kvm_cap_arm_rme_config_item cfg;
> >> + struct realm *realm = &kvm->arch.realm;
> >> + int r = 0;
> >> +
> >> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
> >> + return -EBUSY;
> >> +
> >> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
> >> + return -EFAULT;
> >> +
> >> + switch (cfg.cfg) {
> >> + case KVM_CAP_ARM_RME_CFG_RPV:
> >> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
> >> + break;
> >> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
> >> + r = config_realm_hash_algo(realm, &cfg);
> >> + break;
> >> + case KVM_CAP_ARM_RME_CFG_SVE:
> >> + r = config_realm_sve(realm, &cfg);
> >> + break;
> >> + default:
> >> + r = -EINVAL;
> >> + }
> >> +
> >> + return r;
> >> +}
> >> +
> >> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> >> +{
> >> + int r = 0;
> >> +
> >> + switch (cap->args[0]) {
> >> + case KVM_CAP_ARM_RME_CONFIG_REALM:
> >> + r = kvm_rme_config_realm(kvm, cap);
> >> + break;
> >> + case KVM_CAP_ARM_RME_CREATE_RD:
> >> + if (kvm->created_vcpus) {
> >> + r = -EBUSY;
> >> + break;
> >> + }
> >> +
> >> + r = kvm_create_realm(kvm);
> >> + break;
> >> + default:
> >> + r = -EINVAL;
> >> + break;
> >> + }
> >> +
> >> + return r;
> >> +}
> >> +
> >> +void kvm_destroy_realm(struct kvm *kvm)
> >> +{
> >> + struct realm *realm = &kvm->arch.realm;
> >> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> >> + unsigned int pgd_sz;
> >> + int i;
> >> +
> >> + if (realm->params) {
> >> + free_page((unsigned long)realm->params);
> >> + realm->params = NULL;
> >> + }
> >> +
> >> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
> >> + return;
> >> +
> >> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
> >> +
> >> + rme_vmid_release(realm->vmid);
> >> +
> >> + if (realm->rd) {
> >> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> >> +
> >> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
> >> + return;
> >> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
> >> + return;
> >> + free_page((unsigned long)realm->rd);
> >> + realm->rd = NULL;
> >> + }
> >> +
> >> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> >> + for (i = 0; i < pgd_sz; i++) {
> >> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> >> +
> >> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
> >> + return;
> >> + }
> >> +
> >> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> >> +}
> >> +
> >> +int kvm_init_realm_vm(struct kvm *kvm)
> >> +{
> >> + struct realm_params *params;
> >> +
> >> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
> >> + if (!params)
> >> + return -ENOMEM;
> >> +
> >> + params->features_0 = create_realm_feat_reg0(kvm);
> >> + kvm->arch.realm.params = params;
> >> + return 0;
> >> +}
> >> +
> >> int kvm_init_rme(void)
> >> {
> >> + int ret;
> >> +
> >> if (PAGE_SIZE != SZ_4K)
> >> /* Only 4k page size on the host is supported */
> >> return 0;
> >> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
> >> /* Continue without realm support */
> >> return 0;
> >>
> >> + ret = rme_vmid_init();
> >> + if (ret)
> >> + return ret;
> >> +
> >> + WARN_ON(rmi_features(0, &rmm_feat_reg0));
> >> +
> >> /* Future patch will enable static branch kvm_rme_is_available */
> >>
> >> return 0;
> >
>


2023-03-01 20:50:35

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 08/28] arm64: RME: Keep a spare page delegated to the RMM

On Wed, 1 Mar 2023 11:55:37 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 16:47, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:12 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> Pages can only be populated/destroyed on the RMM at the 4KB granule,
> >> this requires creating the full depth of RTTs. However if the pages are
> >> going to be combined into a 4MB huge page the last RTT is only
> >> temporarily needed. Similarly when freeing memory the huge page must be
> >> temporarily split requiring temporary usage of the full depth oF RTTs.
> >>
> >> To avoid needing to perform a temporary allocation and delegation of a
> >> page for this purpose we keep a spare delegated page around. In
> >> particular this avoids the need for memory allocation while destroying
> >> the realm guest.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_rme.h | 3 +++
> >> arch/arm64/kvm/rme.c | 6 ++++++
> >> 2 files changed, 9 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> >> index 055a22accc08..a6318af3ed11 100644
> >> --- a/arch/arm64/include/asm/kvm_rme.h
> >> +++ b/arch/arm64/include/asm/kvm_rme.h
> >> @@ -21,6 +21,9 @@ struct realm {
> >> void *rd;
> >> struct realm_params *params;
> >>
> >> + /* A spare already delegated page */
> >> + phys_addr_t spare_page;
> >> +
> >> unsigned long num_aux;
> >> unsigned int vmid;
> >> unsigned int ia_bits;
> >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> >> index 9f8c5a91b8fc..0c9d70e4d9e6 100644
> >> --- a/arch/arm64/kvm/rme.c
> >> +++ b/arch/arm64/kvm/rme.c
> >> @@ -148,6 +148,7 @@ static int realm_create_rd(struct kvm *kvm)
> >> }
> >>
> >> realm->rd = rd;
> >> + realm->spare_page = PHYS_ADDR_MAX;
> >> realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> >>
> >> if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
> >> @@ -357,6 +358,11 @@ void kvm_destroy_realm(struct kvm *kvm)
> >> free_page((unsigned long)realm->rd);
> >> realm->rd = NULL;
> >> }
> >> + if (realm->spare_page != PHYS_ADDR_MAX) {
> >> + if (!WARN_ON(rmi_granule_undelegate(realm->spare_page)))
> >> + free_page((unsigned long)phys_to_virt(realm->spare_page));
> >
> > Will the page be leaked (not usable for host and realms) if the undelegate
> > failed? If yes, better at least put a comment.
>
> Yes - I'll add a comment.
>
> In general being unable to undelegate a page points to a programming
> error in the host. The only reason the RMM should refuse the request is
> it the page is in use by a Realm which the host has configured. So the
> WARN() is correct (there's a kernel bug) and the only sensible course of
> action is to leak the page and limp on.
>

It would be nice to add a summary of above into the patch comments.

Having a comment when leaking a page (which mostly means the page cannot be
reclaimed by VMM and used on a REALM any more) is nice. TDX/SNP also have
the problem of leaking pages due to mystic reasons.

Imagine the leaking can turn worse bit by bit in a long running server and
KVM will definitely have a generic accounting interface for reporting the
numbers to the userspace later. Having a explicit comment at this time
really makes it easier later.

> Thanks,
>
> Steve
>
> >> + realm->spare_page = PHYS_ADDR_MAX;
> >> + }
> >>
> >> pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> >> for (i = 0; i < pgd_sz; i++) {
> >
>


2023-03-03 14:05:06

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 09/28] arm64: RME: RTT handling

On 13/02/2023 17:44, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:13 +0000
> Steven Price <[email protected]> wrote:
>
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>
> This is only an introduction to RTT handling. While this patch is mostly like
> RTT teardown, better add more introduction to this patch. Also maybe refine
> the tittle to reflect what this patch is actually doing.

You've a definite point that this patch is mostly about RTT teardown.
Technically it also adds the RTT creation path (realm_rtt_create) -
hence the generic patch title.

But I'll definitely expand the commit message to mention the complexity
of tear down which is the bulk of the patch.

>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_rme.h | 19 +++++
>> arch/arm64/kvm/mmu.c | 7 +-
>> arch/arm64/kvm/rme.c | 139 +++++++++++++++++++++++++++++++
>> 3 files changed, 162 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index a6318af3ed11..eea5118dfa8a 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
>> int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>> int kvm_init_realm_vm(struct kvm *kvm);
>> void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
>> +
>> +#define RME_RTT_BLOCK_LEVEL 2
>> +#define RME_RTT_MAX_LEVEL 3
>> +
>> +#define RME_PAGE_SHIFT 12
>> +#define RME_PAGE_SIZE BIT(RME_PAGE_SHIFT)
>> +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
>> +#define RME_RTT_LEVEL_SHIFT(l) \
>> + ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)
>> +#define RME_L2_BLOCK_SIZE BIT(RME_RTT_LEVEL_SHIFT(2))
>> +
>> +static inline unsigned long rme_rtt_level_mapsize(int level)
>> +{
>> + if (WARN_ON(level > RME_RTT_MAX_LEVEL))
>> + return RME_PAGE_SIZE;
>> +
>> + return (1UL << RME_RTT_LEVEL_SHIFT(level));
>> +}
>>
>> #endif
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 22c00274884a..f29558c5dcbc 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -834,16 +834,17 @@ void stage2_unmap_vm(struct kvm *kvm)
>> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> {
>> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>> - struct kvm_pgtable *pgt = NULL;
>> + struct kvm_pgtable *pgt;
>>
>> write_lock(&kvm->mmu_lock);
>> + pgt = mmu->pgt;
>> if (kvm_is_realm(kvm) &&
>> kvm_realm_state(kvm) != REALM_STATE_DYING) {
>> - /* TODO: teardown rtts */
>> write_unlock(&kvm->mmu_lock);
>> + kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
>> + pgt->start_level);
>> return;
>> }
>> - pgt = mmu->pgt;
>> if (pgt) {
>> mmu->pgd_phys = 0;
>> mmu->pgt = NULL;
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 0c9d70e4d9e6..f7b0e5a779f8 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -73,6 +73,28 @@ static int rmi_check_version(void)
>> return 0;
>> }
>>
>> +static void realm_destroy_undelegate_range(struct realm *realm,
>> + unsigned long ipa,
>> + unsigned long addr,
>> + ssize_t size)
>> +{
>> + unsigned long rd = virt_to_phys(realm->rd);
>> + int ret;
>> +
>> + while (size > 0) {
>> + ret = rmi_data_destroy(rd, ipa);
>> + WARN_ON(ret);
>> + ret = rmi_granule_undelegate(addr);
>> +
> As the return value is not documented, what will happen if a page undelegate
> failed? Leaked? Some explanation is required here.

Yes - it's leaked. I'll add a comment to explain the get_page() call.

Thanks,

Steve

>> + if (ret)
>> + get_page(phys_to_page(addr));
>> +
>> + addr += PAGE_SIZE;
>> + ipa += PAGE_SIZE;
>> + size -= PAGE_SIZE;
>> + }
>> +}
>> +
>> static unsigned long create_realm_feat_reg0(struct kvm *kvm)
>> {
>> unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
>> @@ -170,6 +192,123 @@ static int realm_create_rd(struct kvm *kvm)
>> return r;
>> }
>>
>> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
>> + int level, phys_addr_t rtt_granule)
>> +{
>> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
>> + return rmi_rtt_destroy(rtt_granule, virt_to_phys(realm->rd), addr,
>> + level);
>> +}
>> +
>> +static int realm_destroy_free_rtt(struct realm *realm, unsigned long addr,
>> + int level, phys_addr_t rtt_granule)
>> +{
>> + if (realm_rtt_destroy(realm, addr, level, rtt_granule))
>> + return -ENXIO;
>> + if (!WARN_ON(rmi_granule_undelegate(rtt_granule)))
>> + put_page(phys_to_page(rtt_granule));
>> +
>> + return 0;
>> +}
>> +
>> +static int realm_rtt_create(struct realm *realm,
>> + unsigned long addr,
>> + int level,
>> + phys_addr_t phys)
>> +{
>> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
>> + return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
>> +}
>> +
>> +static int realm_tear_down_rtt_range(struct realm *realm, int level,
>> + unsigned long start, unsigned long end)
>> +{
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + ssize_t map_size = rme_rtt_level_mapsize(level);
>> + unsigned long addr, next_addr;
>> + bool failed = false;
>> +
>> + for (addr = start; addr < end; addr = next_addr) {
>> + phys_addr_t rtt_addr, tmp_rtt;
>> + struct rtt_entry rtt;
>> + unsigned long end_addr;
>> +
>> + next_addr = ALIGN(addr + 1, map_size);
>> +
>> + end_addr = min(next_addr, end);
>> +
>> + if (rmi_rtt_read_entry(rd, ALIGN_DOWN(addr, map_size),
>> + level, &rtt)) {
>> + failed = true;
>> + continue;
>> + }
>> +
>> + rtt_addr = rmi_rtt_get_phys(&rtt);
>> + WARN_ON(level != rtt.walk_level);
>> +
>> + switch (rtt.state) {
>> + case RMI_UNASSIGNED:
>> + case RMI_DESTROYED:
>> + break;
>> + case RMI_TABLE:
>> + if (realm_tear_down_rtt_range(realm, level + 1,
>> + addr, end_addr)) {
>> + failed = true;
>> + break;
>> + }
>> + if (IS_ALIGNED(addr, map_size) &&
>> + next_addr <= end &&
>> + realm_destroy_free_rtt(realm, addr, level + 1,
>> + rtt_addr))
>> + failed = true;
>> + break;
>> + case RMI_ASSIGNED:
>> + WARN_ON(!rtt_addr);
>> + /*
>> + * If there is a block mapping, break it now, using the
>> + * spare_page. We are sure to have a valid delegated
>> + * page at spare_page before we enter here, otherwise
>> + * WARN once, which will be followed by further
>> + * warnings.
>> + */
>> + tmp_rtt = realm->spare_page;
>> + if (level == 2 &&
>> + !WARN_ON_ONCE(tmp_rtt == PHYS_ADDR_MAX) &&
>> + realm_rtt_create(realm, addr,
>> + RME_RTT_MAX_LEVEL, tmp_rtt)) {
>> + WARN_ON(1);
>> + failed = true;
>> + break;
>> + }
>> + realm_destroy_undelegate_range(realm, addr,
>> + rtt_addr, map_size);
>> + /*
>> + * Collapse the last level table and make the spare page
>> + * reusable again.
>> + */
>> + if (level == 2 &&
>> + realm_rtt_destroy(realm, addr, RME_RTT_MAX_LEVEL,
>> + tmp_rtt))
>> + failed = true;
>> + break;
>> + case RMI_VALID_NS:
>> + WARN_ON(rmi_rtt_unmap_unprotected(rd, addr, level));
>> + break;
>> + default:
>> + WARN_ON(1);
>> + failed = true;
>> + break;
>> + }
>> + }
>> +
>> + return failed ? -EINVAL : 0;
>> +}
>> +
>> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
>> +{
>> + realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
>> +}
>> +
>> /* Protects access to rme_vmid_bitmap */
>> static DEFINE_SPINLOCK(rme_vmid_lock);
>> static unsigned long *rme_vmid_bitmap;
>


2023-03-03 14:05:14

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 10/28] arm64: RME: Allocate/free RECs to match vCPUs

On 13/02/2023 18:08, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:14 +0000
> Steven Price <[email protected]> wrote:
>
>> The RMM maintains a data structure known as the Realm Execution Context
>> (or REC). It is similar to struct kvm_vcpu and tracks the state of the
>> virtual CPUs. KVM must delegate memory and request the structures are
>> created when vCPUs are created, and suitably tear down on destruction.
>>
>
> It would be better to leave some pointers to the spec here. It really saves
> time for reviewers.

Fair enough. I wasn't sure how often to repeat the link to the spec, but
a few more times wouldn't hurt ;)

>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 2 +
>> arch/arm64/include/asm/kvm_host.h | 3 +
>> arch/arm64/include/asm/kvm_rme.h | 10 ++
>> arch/arm64/kvm/arm.c | 1 +
>> arch/arm64/kvm/reset.c | 11 ++
>> arch/arm64/kvm/rme.c | 144 +++++++++++++++++++++++++++
>> 6 files changed, 171 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 5a2b7229e83f..285e62914ca4 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -504,6 +504,8 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>>
>> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
>> {
>> + if (static_branch_unlikely(&kvm_rme_is_available))
>> + return vcpu->arch.rec.mpidr != INVALID_HWID;
>> return false;
>> }
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 04347c3a8c6b..ef497b718cdb 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -505,6 +505,9 @@ struct kvm_vcpu_arch {
>> u64 last_steal;
>> gpa_t base;
>> } steal;
>> +
>> + /* Realm meta data */
>> + struct rec rec;
>
> I think the name of the data structure "rec" needs a prefix, it is too common
> and might conflict with the private data structures in the other modules. Maybe
> rme_rec or realm_rec?

struct realm_rec seems like a good choice. I agree 'rec' without context
is somewhat ambiguous.

>> };
>>
>> /*
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index eea5118dfa8a..4b219ebe1400 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -6,6 +6,7 @@
>> #ifndef __ASM_KVM_RME_H
>> #define __ASM_KVM_RME_H
>>
>> +#include <asm/rmi_smc.h>
>> #include <uapi/linux/kvm.h>
>>
>> enum realm_state {
>> @@ -29,6 +30,13 @@ struct realm {
>> unsigned int ia_bits;
>> };
>>
>> +struct rec {
>> + unsigned long mpidr;
>> + void *rec_page;
>> + struct page *aux_pages[REC_PARAMS_AUX_GRANULES];
>> + struct rec_run *run;
>> +};
>> +
>
> It is better to leave some comments for above members or pointers to the spec,
> that saves a lot of time for review.

Will add comments.

>> int kvm_init_rme(void);
>> u32 kvm_realm_ipa_limit(void);
>>
>> @@ -36,6 +44,8 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>> int kvm_init_realm_vm(struct kvm *kvm);
>> void kvm_destroy_realm(struct kvm *kvm);
>> void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
>> +int kvm_create_rec(struct kvm_vcpu *vcpu);
>> +void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>>
>> #define RME_RTT_BLOCK_LEVEL 2
>> #define RME_RTT_MAX_LEVEL 3
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index badd775547b8..52affed2f3cf 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -373,6 +373,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>> /* Force users to call KVM_ARM_VCPU_INIT */
>> vcpu->arch.target = -1;
>> bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
>> + vcpu->arch.rec.mpidr = INVALID_HWID;
>>
>> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
>>
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index 9e71d69e051f..0c84392a4bf2 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -135,6 +135,11 @@ int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
>> return -EPERM;
>>
>> return kvm_vcpu_finalize_sve(vcpu);
>> + case KVM_ARM_VCPU_REC:
>> + if (!kvm_is_realm(vcpu->kvm))
>> + return -EINVAL;
>> +
>> + return kvm_create_rec(vcpu);
>> }
>>
>> return -EINVAL;
>> @@ -145,6 +150,11 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
>> if (vcpu_has_sve(vcpu) && !kvm_arm_vcpu_sve_finalized(vcpu))
>> return false;
>>
>> + if (kvm_is_realm(vcpu->kvm) &&
>> + !(vcpu_is_rec(vcpu) &&
>> + READ_ONCE(vcpu->kvm->arch.realm.state) == REALM_STATE_ACTIVE))
>> + return false;
>
> That's why it is better to introduce the realm state in the previous patches so
> that people can really get the idea of the states at this stage.
>
>> +
>> return true;
>> }
>>
>> @@ -157,6 +167,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
>> if (sve_state)
>> kvm_unshare_hyp(sve_state, sve_state + vcpu_sve_state_size(vcpu));
>> kfree(sve_state);
>> + kvm_destroy_rec(vcpu);
>> }
>>
>> static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index f7b0e5a779f8..d79ed889ca4d 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -514,6 +514,150 @@ void kvm_destroy_realm(struct kvm *kvm)
>> kvm_free_stage2_pgd(&kvm->arch.mmu);
>> }
>>
>> +static void free_rec_aux(struct page **aux_pages,
>> + unsigned int num_aux)
>> +{
>> + unsigned int i;
>> +
>> + for (i = 0; i < num_aux; i++) {
>> + phys_addr_t aux_page_phys = page_to_phys(aux_pages[i]);
>> +
>> + if (WARN_ON(rmi_granule_undelegate(aux_page_phys)))
>> + continue;
>> +
>> + __free_page(aux_pages[i]);
>> + }
>> +}
>> +
>> +static int alloc_rec_aux(struct page **aux_pages,
>> + u64 *aux_phys_pages,
>> + unsigned int num_aux)
>> +{
>> + int ret;
>> + unsigned int i;
>> +
>> + for (i = 0; i < num_aux; i++) {
>> + struct page *aux_page;
>> + phys_addr_t aux_page_phys;
>> +
>> + aux_page = alloc_page(GFP_KERNEL);
>> + if (!aux_page) {
>> + ret = -ENOMEM;
>> + goto out_err;
>> + }
>> + aux_page_phys = page_to_phys(aux_page);
>> + if (rmi_granule_delegate(aux_page_phys)) {
>> + __free_page(aux_page);
>> + ret = -ENXIO;
>> + goto out_err;
>> + }
>> + aux_pages[i] = aux_page;
>> + aux_phys_pages[i] = aux_page_phys;
>> + }
>> +
>> + return 0;
>> +out_err:
>> + free_rec_aux(aux_pages, i);
>> + return ret;
>> +}
>> +
>> +int kvm_create_rec(struct kvm_vcpu *vcpu)
>> +{
>> + struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
>> + unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
>> + struct realm *realm = &vcpu->kvm->arch.realm;
>> + struct rec *rec = &vcpu->arch.rec;
>> + unsigned long rec_page_phys;
>> + struct rec_params *params;
>> + int r, i;
>> +
>> + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_NEW)
>> + return -ENOENT;
>> +
>> + /*
>> + * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
>> + * flag covers v0.2 and onwards.
>> + */
>> + if (!test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
>> + return -EINVAL;
>> +
>> + BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
>> + BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
>> +
>> + params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
>> + rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
>> + rec->run = (void *)get_zeroed_page(GFP_KERNEL);
>> + if (!params || !rec->rec_page || !rec->run) {
>> + r = -ENOMEM;
>> + goto out_free_pages;
>> + }
>> +
>> + for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
>> + params->gprs[i] = vcpu_regs->regs[i];
>> +
>> + params->pc = vcpu_regs->pc;
>> +
>> + if (vcpu->vcpu_id == 0)
>> + params->flags |= REC_PARAMS_FLAG_RUNNABLE;
>> +
>> + rec_page_phys = virt_to_phys(rec->rec_page);
>> +
>> + if (rmi_granule_delegate(rec_page_phys)) {
>> + r = -ENXIO;
>> + goto out_free_pages;
>> + }
>> +
>
> Wouldn't it be better to extend the alloc_rec_aux() to allocate and delegate
> pages above? so that you can same some gfps and rmi_granuale_delegates().

I don't think it's really an improvement. There's only the one
rmi_granule_delegate() call (for the REC page itself). The RecParams and
RecRun pages are not delegated because they are shared with the host. It
would also hide the structure setup within the new
alloc_rec_aux_and_rec() function.

>> + r = alloc_rec_aux(rec->aux_pages, params->aux, realm->num_aux);
>> + if (r)
>> + goto out_undelegate_rmm_rec;
>> +
>> + params->num_rec_aux = realm->num_aux;
>> + params->mpidr = mpidr;
>> +
>> + if (rmi_rec_create(rec_page_phys,
>> + virt_to_phys(realm->rd),
>> + virt_to_phys(params))) {
>> + r = -ENXIO;
>> + goto out_free_rec_aux;
>> + }
>> +
>> + rec->mpidr = mpidr;
>> +
>> + free_page((unsigned long)params);
>> + return 0;
>> +
>> +out_free_rec_aux:
>> + free_rec_aux(rec->aux_pages, realm->num_aux);
>> +out_undelegate_rmm_rec:
>> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
>> + rec->rec_page = NULL;
>> +out_free_pages:
>> + free_page((unsigned long)rec->run);
>> + free_page((unsigned long)rec->rec_page);
>> + free_page((unsigned long)params);
>> + return r;
>> +}
>> +
>> +void kvm_destroy_rec(struct kvm_vcpu *vcpu)
>> +{
>> + struct realm *realm = &vcpu->kvm->arch.realm;
>> + struct rec *rec = &vcpu->arch.rec;
>> + unsigned long rec_page_phys;
>> +
>> + if (!vcpu_is_rec(vcpu))
>> + return;
>> +
>> + rec_page_phys = virt_to_phys(rec->rec_page);
>> +
>> + if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
>> + return;
>> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
>> + return;
>> +
>
> The two returns above feels off. What is the reason to skip the below page
> undelegates?

The reason is the usual: if we fail to undelegate then the pages have to
be leaked. I'll add some comments. However it does look like I've got
the order wrong here, if the undelegate fails for rec_page_phys it's
possible that we might still be able to free the rec_aux pages (although
something has gone terribly wrong for that to be the case).

I'll change the order to:

/* If the REC destroy fails, leak all pages relating to the REC */
if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
return;

free_rec_aux(rec->aux_pages, realm->num_aux);

/* If the undelegate fails then leak the REC page */
if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
return;

free_page((unsigned long)rec->rec_page);

If the rmi_rec_destroy() call has failed then the RMM should prevent the
undelegate so there's little point trying.

Steve

>> + free_rec_aux(rec->aux_pages, realm->num_aux);
>> + free_page((unsigned long)rec->rec_page);
>> +}
>> +
>> int kvm_init_realm_vm(struct kvm *kvm)
>> {
>> struct realm_params *params;
>


2023-03-03 14:05:53

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 13/28] arm64: RME: Allow VMM to set RIPAS

On 17/02/2023 13:07, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:17 +0000
> Steven Price <[email protected]> wrote:
>
>> Each page within the protected region of the realm guest can be marked
>> as either RAM or EMPTY. Allow the VMM to control this before the guest
>> has started and provide the equivalent functions to change this (with
>> the guest's approval) at runtime.
>>
>
> The above is just the purpose of this patch. It would be better to have one
> more paragraph to describe what this patch does (building RTT and set IPA
> state in the RTT) and explain something might confuse people, for example
> the spare page.

I'll improve the commit message.

> The spare page is really confusing. When reading __alloc_delegated_page()
> , it looks like a mechanism to cache a delegated page for the realm. But later
> in the teardown path, it looks like a workaround. What if the allocation of
> the spare page failed in the RTT tear down path?

Yeah the spare_page is a bit messy. Ultimately the idea is that rather
than having to delegate a page to the RMM temporarily just for breaking
up a block mapping which is going to be freed, we keep one spare for the
purpose. This also reduces the chance of an allocation failure while
trying to free memory.

One area of confusion, and something that might be worth revisiting, is
that the spare_page is also used opportunistically in
realm_create_rtt_levels(). Again this makes sense in the case where a
temporary page is needed when creating a block mapping, but the code
doesn't distinguish between this and just creating RTTs for normal mappings.

This leads to the rather unfortunate situation that it's not actually
possible to rely on there being a spare_page and therefore this is
pre-populated in kvm_realm_unmap_range(), but with a possibility that
allocation failure could occur resulting in the function failing (which
is 'handled' by a WARN_ON).

> I understand this must be a temporary solution. It would be really nice to
> have a big picture or some basic introduction to the future plan.

Sadly I don't currently have a "big picture" plan at the moment. I am
quite tempted to split spare_page into two:

* A 'guaranteed' spare page purely for destroying block mappings. This
would be allocated when the realm is created and only used for the
purpose of tearing down mappings.

* A temporary spare page used as a minor optimisation during block
mapping creation - rather than immediately freeing the page back when
folding we can hold on to it with the assumption that it's likely to be
useful for creating further mappings in the same call.

However, to be honest, this is all a bit academic because as it stands
block mappings can't really be used. But when we switch over to using
the memfd approach hopefully huge pages can be translated to block mappings.

Steve

>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_rme.h | 4 +
>> arch/arm64/kvm/rme.c | 288 +++++++++++++++++++++++++++++++
>> 2 files changed, 292 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index 4b219ebe1400..3e75cedaad18 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -47,6 +47,10 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
>> int kvm_create_rec(struct kvm_vcpu *vcpu);
>> void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>>
>> +int realm_set_ipa_state(struct kvm_vcpu *vcpu,
>> + unsigned long addr, unsigned long end,
>> + unsigned long ripas);
>> +
>> #define RME_RTT_BLOCK_LEVEL 2
>> #define RME_RTT_MAX_LEVEL 3
>>
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index d79ed889ca4d..b3ea79189839 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -73,6 +73,58 @@ static int rmi_check_version(void)
>> return 0;
>> }
>>
>> +static phys_addr_t __alloc_delegated_page(struct realm *realm,
>> + struct kvm_mmu_memory_cache *mc, gfp_t flags)
>> +{
>> + phys_addr_t phys = PHYS_ADDR_MAX;
>> + void *virt;
>> +
>> + if (realm->spare_page != PHYS_ADDR_MAX) {
>> + swap(realm->spare_page, phys);
>> + goto out;
>> + }
>> +
>> + if (mc)
>> + virt = kvm_mmu_memory_cache_alloc(mc);
>> + else
>> + virt = (void *)__get_free_page(flags);
>> +
>> + if (!virt)
>> + goto out;
>> +
>> + phys = virt_to_phys(virt);
>> +
>> + if (rmi_granule_delegate(phys)) {
>> + free_page((unsigned long)virt);
>> +
>> + phys = PHYS_ADDR_MAX;
>> + }
>> +
>> +out:
>> + return phys;
>> +}
>> +
>> +static phys_addr_t alloc_delegated_page(struct realm *realm,
>> + struct kvm_mmu_memory_cache *mc)
>> +{
>> + return __alloc_delegated_page(realm, mc, GFP_KERNEL);
>> +}
>> +
>> +static void free_delegated_page(struct realm *realm, phys_addr_t phys)
>> +{
>> + if (realm->spare_page == PHYS_ADDR_MAX) {
>> + realm->spare_page = phys;
>> + return;
>> + }
>> +
>> + if (WARN_ON(rmi_granule_undelegate(phys))) {
>> + /* Undelegate failed: leak the page */
>> + return;
>> + }
>> +
>> + free_page((unsigned long)phys_to_virt(phys));
>> +}
>> +
>> static void realm_destroy_undelegate_range(struct realm *realm,
>> unsigned long ipa,
>> unsigned long addr,
>> @@ -220,6 +272,30 @@ static int realm_rtt_create(struct realm *realm,
>> return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
>> }
>>
>> +static int realm_create_rtt_levels(struct realm *realm,
>> + unsigned long ipa,
>> + int level,
>> + int max_level,
>> + struct kvm_mmu_memory_cache *mc)
>> +{
>> + if (WARN_ON(level == max_level))
>> + return 0;
>> +
>> + while (level++ < max_level) {
>> + phys_addr_t rtt = alloc_delegated_page(realm, mc);
>> +
>> + if (rtt == PHYS_ADDR_MAX)
>> + return -ENOMEM;
>> +
>> + if (realm_rtt_create(realm, ipa, level, rtt)) {
>> + free_delegated_page(realm, rtt);
>> + return -ENXIO;
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static int realm_tear_down_rtt_range(struct realm *realm, int level,
>> unsigned long start, unsigned long end)
>> {
>> @@ -309,6 +385,206 @@ void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
>> realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
>> }
>>
>> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
>> +{
>> + u32 ia_bits = kvm->arch.mmu.pgt->ia_bits;
>> + u32 start_level = kvm->arch.mmu.pgt->start_level;
>> + unsigned long end = ipa + size;
>> + struct realm *realm = &kvm->arch.realm;
>> + phys_addr_t tmp_rtt = PHYS_ADDR_MAX;
>> +
>> + if (end > (1UL << ia_bits))
>> + end = 1UL << ia_bits;
>> + /*
>> + * Make sure we have a spare delegated page for tearing down the
>> + * block mappings. We must use Atomic allocations as we are called
>> + * with kvm->mmu_lock held.
>> + */
>> + if (realm->spare_page == PHYS_ADDR_MAX) {
>> + tmp_rtt = __alloc_delegated_page(realm, NULL, GFP_ATOMIC);
>> + /*
>> + * We don't have to check the status here, as we may not
>> + * have a block level mapping. Delay any error to the point
>> + * where we need it.
>> + */
>> + realm->spare_page = tmp_rtt;
>> + }
>> +
>> + realm_tear_down_rtt_range(&kvm->arch.realm, start_level, ipa, end);
>> +
>> + /* Free up the atomic page, if there were any */
>> + if (tmp_rtt != PHYS_ADDR_MAX) {
>> + free_delegated_page(realm, tmp_rtt);
>> + /*
>> + * Update the spare_page after we have freed the
>> + * above page to make sure it doesn't get cached
>> + * in spare_page.
>> + * We should re-write this part and always have
>> + * a dedicated page for handling block mappings.
>> + */
>> + realm->spare_page = PHYS_ADDR_MAX;
>> + }
>> +}
>> +
>> +static int set_ipa_state(struct kvm_vcpu *vcpu,
>> + unsigned long ipa,
>> + unsigned long end,
>> + int level,
>> + unsigned long ripas)
>> +{
>> + struct kvm *kvm = vcpu->kvm;
>> + struct realm *realm = &kvm->arch.realm;
>> + struct rec *rec = &vcpu->arch.rec;
>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> + phys_addr_t rec_phys = virt_to_phys(rec->rec_page);
>> + unsigned long map_size = rme_rtt_level_mapsize(level);
>> + int ret;
>> +
>> + while (ipa < end) {
>> + ret = rmi_rtt_set_ripas(rd_phys, rec_phys, ipa, level, ripas);
>> +
>> + if (!ret) {
>> + if (!ripas)
>> + kvm_realm_unmap_range(kvm, ipa, map_size);
>> + } else if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + int walk_level = RMI_RETURN_INDEX(ret);
>> +
>> + if (walk_level < level) {
>> + ret = realm_create_rtt_levels(realm, ipa,
>> + walk_level,
>> + level, NULL);
>> + if (ret)
>> + return ret;
>> + continue;
>> + }
>> +
>> + if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
>> + return -EINVAL;
>> +
>> + /* Recurse one level lower */
>> + ret = set_ipa_state(vcpu, ipa, ipa + map_size,
>> + level + 1, ripas);
>> + if (ret)
>> + return ret;
>> + } else {
>> + WARN(1, "Unexpected error in %s: %#x\n", __func__,
>> + ret);
>> + return -EINVAL;
>> + }
>> + ipa += map_size;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int realm_init_ipa_state(struct realm *realm,
>> + unsigned long ipa,
>> + unsigned long end,
>> + int level)
>> +{
>> + unsigned long map_size = rme_rtt_level_mapsize(level);
>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> + int ret;
>> +
>> + while (ipa < end) {
>> + ret = rmi_rtt_init_ripas(rd_phys, ipa, level);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + int cur_level = RMI_RETURN_INDEX(ret);
>> +
>> + if (cur_level < level) {
>> + ret = realm_create_rtt_levels(realm, ipa,
>> + cur_level,
>> + level, NULL);
>> + if (ret)
>> + return ret;
>> + /* Retry with the RTT levels in place */
>> + continue;
>> + }
>> +
>> + /* There's an entry at a lower level, recurse */
>> + if (WARN_ON(level >= RME_RTT_MAX_LEVEL))
>> + return -EINVAL;
>> +
>> + realm_init_ipa_state(realm, ipa, ipa + map_size,
>> + level + 1);
>> + } else if (WARN_ON(ret)) {
>> + return -ENXIO;
>> + }
>> +
>> + ipa += map_size;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int find_map_level(struct kvm *kvm, unsigned long start, unsigned long end)
>> +{
>> + int level = RME_RTT_MAX_LEVEL;
>> +
>> + while (level > get_start_level(kvm) + 1) {
>> + unsigned long map_size = rme_rtt_level_mapsize(level - 1);
>> +
>> + if (!IS_ALIGNED(start, map_size) ||
>> + (start + map_size) > end)
>> + break;
>> +
>> + level--;
>> + }
>> +
>> + return level;
>> +}
>> +
>> +int realm_set_ipa_state(struct kvm_vcpu *vcpu,
>> + unsigned long addr, unsigned long end,
>> + unsigned long ripas)
>> +{
>> + int ret = 0;
>> +
>> + while (addr < end) {
>> + int level = find_map_level(vcpu->kvm, addr, end);
>> + unsigned long map_size = rme_rtt_level_mapsize(level);
>> +
>> + ret = set_ipa_state(vcpu, addr, addr + map_size, level, ripas);
>> + if (ret)
>> + break;
>> +
>> + addr += map_size;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int kvm_init_ipa_range_realm(struct kvm *kvm,
>> + struct kvm_cap_arm_rme_init_ipa_args *args)
>> +{
>> + int ret = 0;
>> + gpa_t addr, end;
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + addr = args->init_ipa_base;
>> + end = addr + args->init_ipa_size;
>> +
>> + if (end < addr)
>> + return -EINVAL;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
>> + return -EBUSY;
>> +
>> + while (addr < end) {
>> + int level = find_map_level(kvm, addr, end);
>> + unsigned long map_size = rme_rtt_level_mapsize(level);
>> +
>> + ret = realm_init_ipa_state(realm, addr, addr + map_size, level);
>> + if (ret)
>> + break;
>> +
>> + addr += map_size;
>> + }
>> +
>> + return ret;
>> +}
>> +
>> /* Protects access to rme_vmid_bitmap */
>> static DEFINE_SPINLOCK(rme_vmid_lock);
>> static unsigned long *rme_vmid_bitmap;
>> @@ -460,6 +736,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>>
>> r = kvm_create_realm(kvm);
>> break;
>> + case KVM_CAP_ARM_RME_INIT_IPA_REALM: {
>> + struct kvm_cap_arm_rme_init_ipa_args args;
>> + void __user *argp = u64_to_user_ptr(cap->args[1]);
>> +
>> + if (copy_from_user(&args, argp, sizeof(args))) {
>> + r = -EFAULT;
>> + break;
>> + }
>> +
>> + r = kvm_init_ipa_range_realm(kvm, &args);
>> + break;
>> + }
>> default:
>> r = -EINVAL;
>> break;
>


2023-03-04 12:07:18

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

On Mon, 13 Feb 2023 15:59:05 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 15:48, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:08 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> Query the RMI version number and check if it is a compatible version. A
> >> static key is also provided to signal that a supported RMM is available.
> >>
> >> Functions are provided to query if a VM or VCPU is a realm (or rec)
> >> which currently will always return false.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
> >> arch/arm64/include/asm/kvm_host.h | 4 +++
> >> arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
> >> arch/arm64/include/asm/virt.h | 1 +
> >> arch/arm64/kvm/Makefile | 3 +-
> >> arch/arm64/kvm/arm.c | 8 +++++
> >> arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
> >> 7 files changed, 103 insertions(+), 1 deletion(-)
> >> create mode 100644 arch/arm64/include/asm/kvm_rme.h
> >> create mode 100644 arch/arm64/kvm/rme.c
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> >> index 9bdba47f7e14..5a2b7229e83f 100644
> >> --- a/arch/arm64/include/asm/kvm_emulate.h
> >> +++ b/arch/arm64/include/asm/kvm_emulate.h
> >> @@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
> >> return test_bit(feature, vcpu->arch.features);
> >> }
> >>
> >> +static inline bool kvm_is_realm(struct kvm *kvm)
> >> +{
> >> + if (static_branch_unlikely(&kvm_rme_is_available))
> >> + return kvm->arch.is_realm;
> >> + return false;
> >> +}
> >> +
> >> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> >> +{
> >> + return READ_ONCE(kvm->arch.realm.state);
> >> +}
> >> +
> >> +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> >> +{
> >> + return false;
> >> +}
> >> +
> >> #endif /* __ARM64_KVM_EMULATE_H__ */
> >> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >> index 35a159d131b5..04347c3a8c6b 100644
> >> --- a/arch/arm64/include/asm/kvm_host.h
> >> +++ b/arch/arm64/include/asm/kvm_host.h
> >> @@ -26,6 +26,7 @@
> >> #include <asm/fpsimd.h>
> >> #include <asm/kvm.h>
> >> #include <asm/kvm_asm.h>
> >> +#include <asm/kvm_rme.h>
> >>
> >> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> >>
> >> @@ -240,6 +241,9 @@ struct kvm_arch {
> >> * the associated pKVM instance in the hypervisor.
> >> */
> >> struct kvm_protected_vm pkvm;
> >> +
> >> + bool is_realm;
> > ^
> > It would be better to put more comments which really helps on the review.
>
> Thanks for the feedback - I had thought "is realm" was fairly
> self-documenting, but perhaps I've just spent too much time with this code.
>
> > I was looking for the user of this memeber to see when it is set. It seems
> > it is not in this patch. It would have been nice to have a quick answer from the
> > comments.
>
> The usage is in the kvm_is_realm() function which is used in several of
> the later patches as a way to detect this kvm guest is a realm guest.
>
> I think the main issue is that I've got the patches in the wrong other.
> Patch 7 "arm64: kvm: Allow passing machine type in KVM creation" should
> probably be before this one, then I could add the assignment of is_realm
> into this patch (potentially splitting out the is_realm parts into
> another patch).
>

I agree the patch order seems a problem here. The name is self-documenting
but if the user of the variable is not in this patch, still needs to jump to
the related patch to confirm if the variable is used as expected. In that
situation, a comment would help to avoid jumping between patches (sometimes
finding the the user of a variable from a patch bundle really slows down
the review progress and eventually you have to open a terminal and check
it in the git tree).

> Thanks,
>
> Steve
>


2023-03-04 12:32:29

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 09/28] arm64: RME: RTT handling

On Fri, 3 Mar 2023 14:04:56 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 17:44, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:13 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> The RMM owns the stage 2 page tables for a realm, and KVM must request
> >> that the RMM creates/destroys entries as necessary. The physical pages
> >> to store the page tables are delegated to the realm as required, and can
> >> be undelegated when no longer used.
> >>
> >
> > This is only an introduction to RTT handling. While this patch is mostly like
> > RTT teardown, better add more introduction to this patch. Also maybe refine
> > the tittle to reflect what this patch is actually doing.
>
> You've a definite point that this patch is mostly about RTT teardown.
> Technically it also adds the RTT creation path (realm_rtt_create) -
> hence the generic patch title.
>

But realm_rtt_create() seem only used in realm_tear_down_rtt_range(). That
makes me wonder where is the real RTT creation path.

> But I'll definitely expand the commit message to mention the complexity
> of tear down which is the bulk of the patch.

It is also a good place to explain more about the RTT.

>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_rme.h | 19 +++++
> >> arch/arm64/kvm/mmu.c | 7 +-
> >> arch/arm64/kvm/rme.c | 139 +++++++++++++++++++++++++++++++
> >> 3 files changed, 162 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> >> index a6318af3ed11..eea5118dfa8a 100644
> >> --- a/arch/arm64/include/asm/kvm_rme.h
> >> +++ b/arch/arm64/include/asm/kvm_rme.h
> >> @@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
> >> int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> >> int kvm_init_realm_vm(struct kvm *kvm);
> >> void kvm_destroy_realm(struct kvm *kvm);
> >> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> >> +
> >> +#define RME_RTT_BLOCK_LEVEL 2
> >> +#define RME_RTT_MAX_LEVEL 3
> >> +
> >> +#define RME_PAGE_SHIFT 12
> >> +#define RME_PAGE_SIZE BIT(RME_PAGE_SHIFT)
> >> +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
> >> +#define RME_RTT_LEVEL_SHIFT(l) \
> >> + ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)
> >> +#define RME_L2_BLOCK_SIZE BIT(RME_RTT_LEVEL_SHIFT(2))
> >> +
> >> +static inline unsigned long rme_rtt_level_mapsize(int level)
> >> +{
> >> + if (WARN_ON(level > RME_RTT_MAX_LEVEL))
> >> + return RME_PAGE_SIZE;
> >> +
> >> + return (1UL << RME_RTT_LEVEL_SHIFT(level));
> >> +}
> >>
> >> #endif
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index 22c00274884a..f29558c5dcbc 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -834,16 +834,17 @@ void stage2_unmap_vm(struct kvm *kvm)
> >> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> >> {
> >> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> >> - struct kvm_pgtable *pgt = NULL;
> >> + struct kvm_pgtable *pgt;
> >>
> >> write_lock(&kvm->mmu_lock);
> >> + pgt = mmu->pgt;
> >> if (kvm_is_realm(kvm) &&
> >> kvm_realm_state(kvm) != REALM_STATE_DYING) {
> >> - /* TODO: teardown rtts */
> >> write_unlock(&kvm->mmu_lock);
> >> + kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
> >> + pgt->start_level);
> >> return;
> >> }
> >> - pgt = mmu->pgt;
> >> if (pgt) {
> >> mmu->pgd_phys = 0;
> >> mmu->pgt = NULL;
> >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> >> index 0c9d70e4d9e6..f7b0e5a779f8 100644
> >> --- a/arch/arm64/kvm/rme.c
> >> +++ b/arch/arm64/kvm/rme.c
> >> @@ -73,6 +73,28 @@ static int rmi_check_version(void)
> >> return 0;
> >> }
> >>
> >> +static void realm_destroy_undelegate_range(struct realm *realm,
> >> + unsigned long ipa,
> >> + unsigned long addr,
> >> + ssize_t size)
> >> +{
> >> + unsigned long rd = virt_to_phys(realm->rd);
> >> + int ret;
> >> +
> >> + while (size > 0) {
> >> + ret = rmi_data_destroy(rd, ipa);
> >> + WARN_ON(ret);
> >> + ret = rmi_granule_undelegate(addr);
> >> +
> > As the return value is not documented, what will happen if a page undelegate
> > failed? Leaked? Some explanation is required here.
>
> Yes - it's leaked. I'll add a comment to explain the get_page() call.
>
> Thanks,
>
> Steve
>
> >> + if (ret)
> >> + get_page(phys_to_page(addr));
> >> +
> >> + addr += PAGE_SIZE;
> >> + ipa += PAGE_SIZE;
> >> + size -= PAGE_SIZE;
> >> + }
> >> +}
> >> +
> >> static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> >> {
> >> unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> >> @@ -170,6 +192,123 @@ static int realm_create_rd(struct kvm *kvm)
> >> return r;
> >> }
> >>
> >> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
> >> + int level, phys_addr_t rtt_granule)
> >> +{
> >> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> >> + return rmi_rtt_destroy(rtt_granule, virt_to_phys(realm->rd), addr,
> >> + level);
> >> +}
> >> +
> >> +static int realm_destroy_free_rtt(struct realm *realm, unsigned long addr,
> >> + int level, phys_addr_t rtt_granule)
> >> +{
> >> + if (realm_rtt_destroy(realm, addr, level, rtt_granule))
> >> + return -ENXIO;
> >> + if (!WARN_ON(rmi_granule_undelegate(rtt_granule)))
> >> + put_page(phys_to_page(rtt_granule));
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static int realm_rtt_create(struct realm *realm,
> >> + unsigned long addr,
> >> + int level,
> >> + phys_addr_t phys)
> >> +{
> >> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> >> + return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
> >> +}
> >> +
> >> +static int realm_tear_down_rtt_range(struct realm *realm, int level,
> >> + unsigned long start, unsigned long end)
> >> +{
> >> + phys_addr_t rd = virt_to_phys(realm->rd);
> >> + ssize_t map_size = rme_rtt_level_mapsize(level);
> >> + unsigned long addr, next_addr;
> >> + bool failed = false;
> >> +
> >> + for (addr = start; addr < end; addr = next_addr) {
> >> + phys_addr_t rtt_addr, tmp_rtt;
> >> + struct rtt_entry rtt;
> >> + unsigned long end_addr;
> >> +
> >> + next_addr = ALIGN(addr + 1, map_size);
> >> +
> >> + end_addr = min(next_addr, end);
> >> +
> >> + if (rmi_rtt_read_entry(rd, ALIGN_DOWN(addr, map_size),
> >> + level, &rtt)) {
> >> + failed = true;
> >> + continue;
> >> + }
> >> +
> >> + rtt_addr = rmi_rtt_get_phys(&rtt);
> >> + WARN_ON(level != rtt.walk_level);
> >> +
> >> + switch (rtt.state) {
> >> + case RMI_UNASSIGNED:
> >> + case RMI_DESTROYED:
> >> + break;
> >> + case RMI_TABLE:
> >> + if (realm_tear_down_rtt_range(realm, level + 1,
> >> + addr, end_addr)) {
> >> + failed = true;
> >> + break;
> >> + }
> >> + if (IS_ALIGNED(addr, map_size) &&
> >> + next_addr <= end &&
> >> + realm_destroy_free_rtt(realm, addr, level + 1,
> >> + rtt_addr))
> >> + failed = true;
> >> + break;
> >> + case RMI_ASSIGNED:
> >> + WARN_ON(!rtt_addr);
> >> + /*
> >> + * If there is a block mapping, break it now, using the
> >> + * spare_page. We are sure to have a valid delegated
> >> + * page at spare_page before we enter here, otherwise
> >> + * WARN once, which will be followed by further
> >> + * warnings.
> >> + */
> >> + tmp_rtt = realm->spare_page;
> >> + if (level == 2 &&
> >> + !WARN_ON_ONCE(tmp_rtt == PHYS_ADDR_MAX) &&
> >> + realm_rtt_create(realm, addr,
> >> + RME_RTT_MAX_LEVEL, tmp_rtt)) {
> >> + WARN_ON(1);
> >> + failed = true;
> >> + break;
> >> + }
> >> + realm_destroy_undelegate_range(realm, addr,
> >> + rtt_addr, map_size);
> >> + /*
> >> + * Collapse the last level table and make the spare page
> >> + * reusable again.
> >> + */
> >> + if (level == 2 &&
> >> + realm_rtt_destroy(realm, addr, RME_RTT_MAX_LEVEL,
> >> + tmp_rtt))
> >> + failed = true;
> >> + break;
> >> + case RMI_VALID_NS:
> >> + WARN_ON(rmi_rtt_unmap_unprotected(rd, addr, level));
> >> + break;
> >> + default:
> >> + WARN_ON(1);
> >> + failed = true;
> >> + break;
> >> + }
> >> + }
> >> +
> >> + return failed ? -EINVAL : 0;
> >> +}
> >> +
> >> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
> >> +{
> >> + realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
> >> +}
> >> +
> >> /* Protects access to rme_vmid_bitmap */
> >> static DEFINE_SPINLOCK(rme_vmid_lock);
> >> static unsigned long *rme_vmid_bitmap;
> >
>


2023-03-04 12:47:09

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 10/28] arm64: RME: Allocate/free RECs to match vCPUs

On Fri, 3 Mar 2023 14:05:02 +0000
Steven Price <[email protected]> wrote:

> On 13/02/2023 18:08, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:14 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> The RMM maintains a data structure known as the Realm Execution Context
> >> (or REC). It is similar to struct kvm_vcpu and tracks the state of the
> >> virtual CPUs. KVM must delegate memory and request the structures are
> >> created when vCPUs are created, and suitably tear down on destruction.
> >>
> >
> > It would be better to leave some pointers to the spec here. It really saves
> > time for reviewers. /
>
> Fair enough. I wasn't sure how often to repeat the link to the spec, but
> a few more times wouldn't hurt ;)

Based on my review experience, the right time would be when a new concept is
first on the table.

For example, the concept REC appears in this patch series for the first time, it
would be nice to have following things in the comments:

1) Basic summary of the concept. Several sentences help people to understand
what it is, what it is used for, what/when it is required (mostly this would be
helpful on denoting the interaction with existing flows), and then eventually how
it will interact with the existing flows. It would be good enough for people who
don't have time to dig the concept itself but want to review the interaction flow
with the component they are familiar with or the area they are working on.

2) A simple sentence to show the spec name and chapter name would be good
enough for people who would like to dig it. It is also a nice timing to educate
people who are interested in the detail of the tech concept.

>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_emulate.h | 2 +
> >> arch/arm64/include/asm/kvm_host.h | 3 +
> >> arch/arm64/include/asm/kvm_rme.h | 10 ++
> >> arch/arm64/kvm/arm.c | 1 +
> >> arch/arm64/kvm/reset.c | 11 ++
> >> arch/arm64/kvm/rme.c | 144 +++++++++++++++++++++++++++
> >> 6 files changed, 171 insertions(+)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> >> index 5a2b7229e83f..285e62914ca4 100644
> >> --- a/arch/arm64/include/asm/kvm_emulate.h
> >> +++ b/arch/arm64/include/asm/kvm_emulate.h
> >> @@ -504,6 +504,8 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> >>
> >> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> >> {
> >> + if (static_branch_unlikely(&kvm_rme_is_available))
> >> + return vcpu->arch.rec.mpidr != INVALID_HWID;
> >> return false;
> >> }
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> >> index 04347c3a8c6b..ef497b718cdb 100644
> >> --- a/arch/arm64/include/asm/kvm_host.h
> >> +++ b/arch/arm64/include/asm/kvm_host.h
> >> @@ -505,6 +505,9 @@ struct kvm_vcpu_arch {
> >> u64 last_steal;
> >> gpa_t base;
> >> } steal;
> >> +
> >> + /* Realm meta data */
> >> + struct rec rec;
> >
> > I think the name of the data structure "rec" needs a prefix, it is too common
> > and might conflict with the private data structures in the other modules. Maybe
> > rme_rec or realm_rec?
>
> struct realm_rec seems like a good choice. I agree 'rec' without context
> is somewhat ambiguous.
>
> >> };
> >>
> >> /*
> >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> >> index eea5118dfa8a..4b219ebe1400 100644
> >> --- a/arch/arm64/include/asm/kvm_rme.h
> >> +++ b/arch/arm64/include/asm/kvm_rme.h
> >> @@ -6,6 +6,7 @@
> >> #ifndef __ASM_KVM_RME_H
> >> #define __ASM_KVM_RME_H
> >>
> >> +#include <asm/rmi_smc.h>
> >> #include <uapi/linux/kvm.h>
> >>
> >> enum realm_state {
> >> @@ -29,6 +30,13 @@ struct realm {
> >> unsigned int ia_bits;
> >> };
> >>
> >> +struct rec {
> >> + unsigned long mpidr;
> >> + void *rec_page;
> >> + struct page *aux_pages[REC_PARAMS_AUX_GRANULES];
> >> + struct rec_run *run;
> >> +};
> >> +
> >
> > It is better to leave some comments for above members or pointers to the spec,
> > that saves a lot of time for review.
>
> Will add comments.
>
> >> int kvm_init_rme(void);
> >> u32 kvm_realm_ipa_limit(void);
> >>
> >> @@ -36,6 +44,8 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> >> int kvm_init_realm_vm(struct kvm *kvm);
> >> void kvm_destroy_realm(struct kvm *kvm);
> >> void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> >> +int kvm_create_rec(struct kvm_vcpu *vcpu);
> >> +void kvm_destroy_rec(struct kvm_vcpu *vcpu);
> >>
> >> #define RME_RTT_BLOCK_LEVEL 2
> >> #define RME_RTT_MAX_LEVEL 3
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index badd775547b8..52affed2f3cf 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -373,6 +373,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >> /* Force users to call KVM_ARM_VCPU_INIT */
> >> vcpu->arch.target = -1;
> >> bitmap_zero(vcpu->arch.features, KVM_VCPU_MAX_FEATURES);
> >> + vcpu->arch.rec.mpidr = INVALID_HWID;
> >>
> >> vcpu->arch.mmu_page_cache.gfp_zero = __GFP_ZERO;
> >>
> >> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> >> index 9e71d69e051f..0c84392a4bf2 100644
> >> --- a/arch/arm64/kvm/reset.c
> >> +++ b/arch/arm64/kvm/reset.c
> >> @@ -135,6 +135,11 @@ int kvm_arm_vcpu_finalize(struct kvm_vcpu *vcpu, int feature)
> >> return -EPERM;
> >>
> >> return kvm_vcpu_finalize_sve(vcpu);
> >> + case KVM_ARM_VCPU_REC:
> >> + if (!kvm_is_realm(vcpu->kvm))
> >> + return -EINVAL;
> >> +
> >> + return kvm_create_rec(vcpu);
> >> }
> >>
> >> return -EINVAL;
> >> @@ -145,6 +150,11 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu)
> >> if (vcpu_has_sve(vcpu) && !kvm_arm_vcpu_sve_finalized(vcpu))
> >> return false;
> >>
> >> + if (kvm_is_realm(vcpu->kvm) &&
> >> + !(vcpu_is_rec(vcpu) &&
> >> + READ_ONCE(vcpu->kvm->arch.realm.state) == REALM_STATE_ACTIVE))
> >> + return false;
> >
> > That's why it is better to introduce the realm state in the previous patches so
> > that people can really get the idea of the states at this stage.
> >
> >> +
> >> return true;
> >> }
> >>
> >> @@ -157,6 +167,7 @@ void kvm_arm_vcpu_destroy(struct kvm_vcpu *vcpu)
> >> if (sve_state)
> >> kvm_unshare_hyp(sve_state, sve_state + vcpu_sve_state_size(vcpu));
> >> kfree(sve_state);
> >> + kvm_destroy_rec(vcpu);
> >> }
> >>
> >> static void kvm_vcpu_reset_sve(struct kvm_vcpu *vcpu)
> >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> >> index f7b0e5a779f8..d79ed889ca4d 100644
> >> --- a/arch/arm64/kvm/rme.c
> >> +++ b/arch/arm64/kvm/rme.c
> >> @@ -514,6 +514,150 @@ void kvm_destroy_realm(struct kvm *kvm)
> >> kvm_free_stage2_pgd(&kvm->arch.mmu);
> >> }
> >>
> >> +static void free_rec_aux(struct page **aux_pages,
> >> + unsigned int num_aux)
> >> +{
> >> + unsigned int i;
> >> +
> >> + for (i = 0; i < num_aux; i++) {
> >> + phys_addr_t aux_page_phys = page_to_phys(aux_pages[i]);
> >> +
> >> + if (WARN_ON(rmi_granule_undelegate(aux_page_phys)))
> >> + continue;
> >> +
> >> + __free_page(aux_pages[i]);
> >> + }
> >> +}
> >> +
> >> +static int alloc_rec_aux(struct page **aux_pages,
> >> + u64 *aux_phys_pages,
> >> + unsigned int num_aux)
> >> +{
> >> + int ret;
> >> + unsigned int i;
> >> +
> >> + for (i = 0; i < num_aux; i++) {
> >> + struct page *aux_page;
> >> + phys_addr_t aux_page_phys;
> >> +
> >> + aux_page = alloc_page(GFP_KERNEL);
> >> + if (!aux_page) {
> >> + ret = -ENOMEM;
> >> + goto out_err;
> >> + }
> >> + aux_page_phys = page_to_phys(aux_page);
> >> + if (rmi_granule_delegate(aux_page_phys)) {
> >> + __free_page(aux_page);
> >> + ret = -ENXIO;
> >> + goto out_err;
> >> + }
> >> + aux_pages[i] = aux_page;
> >> + aux_phys_pages[i] = aux_page_phys;
> >> + }
> >> +
> >> + return 0;
> >> +out_err:
> >> + free_rec_aux(aux_pages, i);
> >> + return ret;
> >> +}
> >> +
> >> +int kvm_create_rec(struct kvm_vcpu *vcpu)
> >> +{
> >> + struct user_pt_regs *vcpu_regs = vcpu_gp_regs(vcpu);
> >> + unsigned long mpidr = kvm_vcpu_get_mpidr_aff(vcpu);
> >> + struct realm *realm = &vcpu->kvm->arch.realm;
> >> + struct rec *rec = &vcpu->arch.rec;
> >> + unsigned long rec_page_phys;
> >> + struct rec_params *params;
> >> + int r, i;
> >> +
> >> + if (kvm_realm_state(vcpu->kvm) != REALM_STATE_NEW)
> >> + return -ENOENT;
> >> +
> >> + /*
> >> + * The RMM will report PSCI v1.0 to Realms and the KVM_ARM_VCPU_PSCI_0_2
> >> + * flag covers v0.2 and onwards.
> >> + */
> >> + if (!test_bit(KVM_ARM_VCPU_PSCI_0_2, vcpu->arch.features))
> >> + return -EINVAL;
> >> +
> >> + BUILD_BUG_ON(sizeof(*params) > PAGE_SIZE);
> >> + BUILD_BUG_ON(sizeof(*rec->run) > PAGE_SIZE);
> >> +
> >> + params = (struct rec_params *)get_zeroed_page(GFP_KERNEL);
> >> + rec->rec_page = (void *)__get_free_page(GFP_KERNEL);
> >> + rec->run = (void *)get_zeroed_page(GFP_KERNEL);
> >> + if (!params || !rec->rec_page || !rec->run) {
> >> + r = -ENOMEM;
> >> + goto out_free_pages;
> >> + }
> >> +
> >> + for (i = 0; i < ARRAY_SIZE(params->gprs); i++)
> >> + params->gprs[i] = vcpu_regs->regs[i];
> >> +
> >> + params->pc = vcpu_regs->pc;
> >> +
> >> + if (vcpu->vcpu_id == 0)
> >> + params->flags |= REC_PARAMS_FLAG_RUNNABLE;
> >> +
> >> + rec_page_phys = virt_to_phys(rec->rec_page);
> >> +
> >> + if (rmi_granule_delegate(rec_page_phys)) {
> >> + r = -ENXIO;
> >> + goto out_free_pages;
> >> + }
> >> +
> >
> > Wouldn't it be better to extend the alloc_rec_aux() to allocate and delegate
> > pages above? so that you can same some gfps and rmi_granuale_delegates().
>
> I don't think it's really an improvement. There's only the one
> rmi_granule_delegate() call (for the REC page itself). The RecParams and
> RecRun pages are not delegated because they are shared with the host. It
> would also hide the structure setup within the new
> alloc_rec_aux_and_rec() function.
>

It should make it clearer.

I was thinking if it would be nice to abstract alloc + delegated as a common
function, then alloc_rec_aux() and kvm_realm_rec() can be its common user.

> >> + r = alloc_rec_aux(rec->aux_pages, params->aux, realm->num_aux);
> >> + if (r)
> >> + goto out_undelegate_rmm_rec;
> >> +
> >> + params->num_rec_aux = realm->num_aux;
> >> + params->mpidr = mpidr;
> >> +
> >> + if (rmi_rec_create(rec_page_phys,
> >> + virt_to_phys(realm->rd),
> >> + virt_to_phys(params))) {
> >> + r = -ENXIO;
> >> + goto out_free_rec_aux;
> >> + }
> >> +
> >> + rec->mpidr = mpidr;
> >> +
> >> + free_page((unsigned long)params);
> >> + return 0;
> >> +
> >> +out_free_rec_aux:
> >> + free_rec_aux(rec->aux_pages, realm->num_aux);
> >> +out_undelegate_rmm_rec:
> >> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
> >> + rec->rec_page = NULL;
> >> +out_free_pages:
> >> + free_page((unsigned long)rec->run);
> >> + free_page((unsigned long)rec->rec_page);
> >> + free_page((unsigned long)params);
> >> + return r;
> >> +}
> >> +
> >> +void kvm_destroy_rec(struct kvm_vcpu *vcpu)
> >> +{
> >> + struct realm *realm = &vcpu->kvm->arch.realm;
> >> + struct rec *rec = &vcpu->arch.rec;
> >> + unsigned long rec_page_phys;
> >> +
> >> + if (!vcpu_is_rec(vcpu))
> >> + return;
> >> +
> >> + rec_page_phys = virt_to_phys(rec->rec_page);
> >> +
> >> + if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
> >> + return;
> >> + if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
> >> + return;
> >> +
> >
> > The two returns above feels off. What is the reason to skip the below page
> > undelegates?
>
> The reason is the usual: if we fail to undelegate then the pages have to
> be leaked. I'll add some comments. However it does look like I've got
> the order wrong here, if the undelegate fails for rec_page_phys it's
> possible that we might still be able to free the rec_aux pages (although
> something has gone terribly wrong for that to be the case).
>
> I'll change the order to:
>
> /* If the REC destroy fails, leak all pages relating to the REC */
> if (WARN_ON(rmi_rec_destroy(rec_page_phys)))
> return;
>
> free_rec_aux(rec->aux_pages, realm->num_aux);
>
> /* If the undelegate fails then leak the REC page */
> if (WARN_ON(rmi_granule_undelegate(rec_page_phys)))
> return;
>
> free_page((unsigned long)rec->rec_page);
>
> If the rmi_rec_destroy() call has failed then the RMM should prevent the
> undelegate so there's little point trying.
>
> Steve
>
> >> + free_rec_aux(rec->aux_pages, realm->num_aux);
> >> + free_page((unsigned long)rec->rec_page);
> >> +}
> >> +
> >> int kvm_init_realm_vm(struct kvm *kvm)
> >> {
> >> struct realm_params *params;
> >
>


2023-03-06 15:40:39

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 15/28] KVM: arm64: Handle realm MMIO emulation

On Fri, 27 Jan 2023 11:29:19 +0000
Steven Price <[email protected]> wrote:

> MMIO emulation for a realm cannot be done directly with the VM's
> registers as they are protected from the host. However the RMM interface
> provides a structure member for providing the read/written value and

More details would be better for helping the review. I can only see the
emulated mmio value from the device model (kvmtool or kvm_io_bus) is put into
the GPRS[0] of the RecEntry object. But the rest of the flow is missing.

I guess RMM copies the value in the RecEntry.GPRS[0] to the target GPR in the
guest context in RMI_REC_ENTER when seeing RMI_EMULATED_MMIO. This is for
the guest MMIO read path.

How about the MMIO write path? I don't see where the RecExit.GPRS[0] is loaded
to a varible and returned to the userspace.

> we can transfer this to the appropriate VCPU's register entry and then
> depend on the generic MMIO handling code in KVM.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/kvm/mmio.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> index 3dd38a151d2a..c4879fa3a8d3 100644
> --- a/arch/arm64/kvm/mmio.c
> +++ b/arch/arm64/kvm/mmio.c
> @@ -6,6 +6,7 @@
>
> #include <linux/kvm_host.h>
> #include <asm/kvm_emulate.h>
> +#include <asm/rmi_smc.h>
> #include <trace/events/kvm.h>
>
> #include "trace.h"
> @@ -109,6 +110,9 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
> &data);
> data = vcpu_data_host_to_guest(vcpu, data, len);
> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
> +
> + if (vcpu_is_rec(vcpu))
> + vcpu->arch.rec.run->entry.gprs[0] = data;

I think the guest context is maintained by RMM (while KVM can only touch
Rec{Entry, Exit} object) so that guest context in the legacy VHE mode is
unused.

If yes, I guess here is should be:

if (unlikely(vcpu_is_rec(vcpu)))
vcpu->arch.rec.run->entry.gprs[0] = data;
else
vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);

> }
>
> /*
> @@ -179,6 +183,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> run->mmio.len = len;
> vcpu->mmio_needed = 1;
>
> + if (vcpu_is_rec(vcpu))
> + vcpu->arch.rec.run->entry.flags |= RMI_EMULATED_MMIO;
> +

Wouldn't it be better to set this in the kvm_handle_mmio_return where the MMIO
read emulation has been surely successful?

> if (!ret) {
> /* We handled the access successfully in the kernel. */
> if (!is_write)


2023-03-06 17:36:36

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 16/28] arm64: RME: Allow populating initial contents

On Fri, 27 Jan 2023 11:29:20 +0000
Steven Price <[email protected]> wrote:

> The VMM needs to populate the realm with some data before starting (e.g.
> a kernel and initrd). This is measured by the RMM and used as part of
> the attestation later on.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/kvm/rme.c | 366 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 366 insertions(+)
>
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index 16e0bfea98b1..3405b43e1421 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -4,6 +4,7 @@
> */
>
> #include <linux/kvm_host.h>
> +#include <linux/hugetlb.h>
>
> #include <asm/kvm_emulate.h>
> #include <asm/kvm_mmu.h>
> @@ -426,6 +427,359 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
> }
> }
>
> +static int realm_create_protected_data_page(struct realm *realm,
> + unsigned long ipa,
> + struct page *dst_page,
> + struct page *tmp_page)
> +{
> + phys_addr_t dst_phys, tmp_phys;
> + int ret;
> +
> + copy_page(page_address(tmp_page), page_address(dst_page));
> +
> + dst_phys = page_to_phys(dst_page);
> + tmp_phys = page_to_phys(tmp_page);
> +
> + if (rmi_granule_delegate(dst_phys))
> + return -ENXIO;
> +
> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa, tmp_phys,
> + RMI_MEASURE_CONTENT);
> +
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + /* Create missing RTTs and retry */
> + int level = RMI_RETURN_INDEX(ret);
> +
> + ret = realm_create_rtt_levels(realm, ipa, level,
> + RME_RTT_MAX_LEVEL, NULL);
> + if (ret)
> + goto err;
> +
> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa,
> + tmp_phys, RMI_MEASURE_CONTENT);
> + }
> +
> + if (ret)
> + goto err;
> +
> + return 0;
> +
> +err:
> + if (WARN_ON(rmi_granule_undelegate(dst_phys))) {
> + /* Page can't be returned to NS world so is lost */
> + get_page(dst_page);
> + }
> + return -ENXIO;
> +}
> +
> +static int fold_rtt(phys_addr_t rd, unsigned long addr, int level,
> + struct realm *realm)
> +{
> + struct rtt_entry rtt;
> + phys_addr_t rtt_addr;
> +
> + if (rmi_rtt_read_entry(rd, addr, level, &rtt))
> + return -ENXIO;
> +
> + if (rtt.state != RMI_TABLE)
> + return -EINVAL;
> +
> + rtt_addr = rmi_rtt_get_phys(&rtt);
> + if (rmi_rtt_fold(rtt_addr, rd, addr, level + 1))
> + return -ENXIO;
> +
> + free_delegated_page(realm, rtt_addr);
> +
> + return 0;
> +}
> +
> +int realm_map_protected(struct realm *realm,
> + unsigned long hva,
> + unsigned long base_ipa,
> + struct page *dst_page,
> + unsigned long map_size,
> + struct kvm_mmu_memory_cache *memcache)
> +{
> + phys_addr_t dst_phys = page_to_phys(dst_page);
> + phys_addr_t rd = virt_to_phys(realm->rd);
> + unsigned long phys = dst_phys;
> + unsigned long ipa = base_ipa;
> + unsigned long size;
> + int map_level;
> + int ret = 0;
> +
> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
> + return -EINVAL;
> +
> + switch (map_size) {
> + case PAGE_SIZE:
> + map_level = 3;
> + break;
> + case RME_L2_BLOCK_SIZE:
> + map_level = 2;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + if (map_level < RME_RTT_MAX_LEVEL) {
> + /*
> + * A temporary RTT is needed during the map, precreate it,
> + * however if there is an error (e.g. missing parent tables)
> + * this will be handled below.
> + */
> + realm_create_rtt_levels(realm, ipa, map_level,
> + RME_RTT_MAX_LEVEL, memcache);
> + }
> +
> + for (size = 0; size < map_size; size += PAGE_SIZE) {
> + if (rmi_granule_delegate(phys)) {
> + struct rtt_entry rtt;
> +
> + /*
> + * It's possible we raced with another VCPU on the same
> + * fault. If the entry exists and matches then exit
> + * early and assume the other VCPU will handle the
> + * mapping.
> + */
> + if (rmi_rtt_read_entry(rd, ipa, RME_RTT_MAX_LEVEL, &rtt))
> + goto err;
> +
> + // FIXME: For a block mapping this could race at level
> + // 2 or 3...
> + if (WARN_ON((rtt.walk_level != RME_RTT_MAX_LEVEL ||
> + rtt.state != RMI_ASSIGNED ||
> + rtt.desc != phys))) {
> + goto err;
> + }
> +
> + return 0;
> + }
> +
> + ret = rmi_data_create_unknown(phys, rd, ipa);
> +
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + /* Create missing RTTs and retry */
> + int level = RMI_RETURN_INDEX(ret);
> +
> + ret = realm_create_rtt_levels(realm, ipa, level,
> + RME_RTT_MAX_LEVEL,
> + memcache);
> + WARN_ON(ret);
> + if (ret)
> + goto err_undelegate;
> +
> + ret = rmi_data_create_unknown(phys, rd, ipa);
> + }
> + WARN_ON(ret);
> +
> + if (ret)
> + goto err_undelegate;
> +
> + phys += PAGE_SIZE;
> + ipa += PAGE_SIZE;
> + }
> +
> + if (map_size == RME_L2_BLOCK_SIZE)
> + ret = fold_rtt(rd, base_ipa, map_level, realm);
> + if (WARN_ON(ret))
> + goto err;
> +
> + return 0;
> +
> +err_undelegate:
> + if (WARN_ON(rmi_granule_undelegate(phys))) {
> + /* Page can't be returned to NS world so is lost */
> + get_page(phys_to_page(phys));
> + }
> +err:
> + while (size > 0) {
> + phys -= PAGE_SIZE;
> + size -= PAGE_SIZE;
> + ipa -= PAGE_SIZE;
> +
> + rmi_data_destroy(rd, ipa);
> +
> + if (WARN_ON(rmi_granule_undelegate(phys))) {
> + /* Page can't be returned to NS world so is lost */
> + get_page(phys_to_page(phys));
> + }
> + }
> + return -ENXIO;
> +}
> +

There seems no caller to the function above. Better move it to the related
patch.

> +static int populate_par_region(struct kvm *kvm,
> + phys_addr_t ipa_base,
> + phys_addr_t ipa_end)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct kvm_memory_slot *memslot;
> + gfn_t base_gfn, end_gfn;
> + int idx;
> + phys_addr_t ipa;
> + int ret = 0;
> + struct page *tmp_page;
> + phys_addr_t rd = virt_to_phys(realm->rd);
> +
> + base_gfn = gpa_to_gfn(ipa_base);
> + end_gfn = gpa_to_gfn(ipa_end);
> +
> + idx = srcu_read_lock(&kvm->srcu);
> + memslot = gfn_to_memslot(kvm, base_gfn);
> + if (!memslot) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + /* We require the region to be contained within a single memslot */
> + if (memslot->base_gfn + memslot->npages < end_gfn) {
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + tmp_page = alloc_page(GFP_KERNEL);
> + if (!tmp_page) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + mmap_read_lock(current->mm);
> +
> + ipa = ipa_base;
> +
> + while (ipa < ipa_end) {
> + struct vm_area_struct *vma;
> + unsigned long map_size;
> + unsigned int vma_shift;
> + unsigned long offset;
> + unsigned long hva;
> + struct page *page;
> + kvm_pfn_t pfn;
> + int level;
> +
> + hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
> + vma = vma_lookup(current->mm, hva);
> + if (!vma) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + if (is_vm_hugetlb_page(vma))
> + vma_shift = huge_page_shift(hstate_vma(vma));
> + else
> + vma_shift = PAGE_SHIFT;
> +
> + map_size = 1 << vma_shift;
> +
> + /*
> + * FIXME: This causes over mapping, but there's no good
> + * solution here with the ABI as it stands
> + */
> + ipa = ALIGN_DOWN(ipa, map_size);
> +
> + switch (map_size) {
> + case RME_L2_BLOCK_SIZE:
> + level = 2;
> + break;
> + case PAGE_SIZE:
> + level = 3;
> + break;
> + default:
> + WARN_ONCE(1, "Unsupport vma_shift %d", vma_shift);
> + ret = -EFAULT;
> + break;
> + }
> +
> + pfn = gfn_to_pfn_memslot(memslot, gpa_to_gfn(ipa));
> +
> + if (is_error_pfn(pfn)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + ret = rmi_rtt_init_ripas(rd, ipa, level);
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + ret = realm_create_rtt_levels(realm, ipa,
> + RMI_RETURN_INDEX(ret),
> + level, NULL);
> + if (ret)
> + break;
> + ret = rmi_rtt_init_ripas(rd, ipa, level);
> + if (ret) {
> + ret = -ENXIO;
> + break;
> + }
> + }
> +
> + if (level < RME_RTT_MAX_LEVEL) {
> + /*
> + * A temporary RTT is needed during the map, precreate
> + * it, however if there is an error (e.g. missing
> + * parent tables) this will be handled in the
> + * realm_create_protected_data_page() call.
> + */
> + realm_create_rtt_levels(realm, ipa, level,
> + RME_RTT_MAX_LEVEL, NULL);
> + }
> +
> + page = pfn_to_page(pfn);
> +
> + for (offset = 0; offset < map_size && !ret;
> + offset += PAGE_SIZE, page++) {
> + phys_addr_t page_ipa = ipa + offset;
> +
> + ret = realm_create_protected_data_page(realm, page_ipa,
> + page, tmp_page);
> + }
> + if (ret)
> + goto err_release_pfn;
> +
> + if (level == 2) {
> + ret = fold_rtt(rd, ipa, level, realm);
> + if (ret)
> + goto err_release_pfn;
> + }
> +
> + ipa += map_size;

> + kvm_set_pfn_accessed(pfn);
> + kvm_set_pfn_dirty(pfn);

kvm_release_pfn_dirty() has already called kvm_set_pfn_{accessed, dirty}().

> + kvm_release_pfn_dirty(pfn);
> +err_release_pfn:
> + if (ret) {
> + kvm_release_pfn_clean(pfn);
> + break;
> + }
> + }
> +
> + mmap_read_unlock(current->mm);
> + __free_page(tmp_page);
> +
> +out:
> + srcu_read_unlock(&kvm->srcu, idx);
> + return ret;
> +}
> +
> +static int kvm_populate_realm(struct kvm *kvm,
> + struct kvm_cap_arm_rme_populate_realm_args *args)
> +{
> + phys_addr_t ipa_base, ipa_end;
> +

Check kvm_is_realm(kvm) here or in the kvm_realm_enable_cap().

> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
> + return -EBUSY;

Maybe -EINVAL? The realm hasn't been created (RMI_REALM_CREATE is not called
yet). The userspace shouldn't reach this path.

> +
> + if (!IS_ALIGNED(args->populate_ipa_base, PAGE_SIZE) ||
> + !IS_ALIGNED(args->populate_ipa_size, PAGE_SIZE))
> + return -EINVAL;
> +
> + ipa_base = args->populate_ipa_base;
> + ipa_end = ipa_base + args->populate_ipa_size;
> +
> + if (ipa_end < ipa_base)
> + return -EINVAL;
> +
> + return populate_par_region(kvm, ipa_base, ipa_end);
> +}
> +
> static int set_ipa_state(struct kvm_vcpu *vcpu,
> unsigned long ipa,
> unsigned long end,
> @@ -748,6 +1102,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> r = kvm_init_ipa_range_realm(kvm, &args);
> break;
> }
> + case KVM_CAP_ARM_RME_POPULATE_REALM: {
> + struct kvm_cap_arm_rme_populate_realm_args args;
> + void __user *argp = u64_to_user_ptr(cap->args[1]);
> +
> + if (copy_from_user(&args, argp, sizeof(args))) {
> + r = -EFAULT;
> + break;
> + }
> +
> + r = kvm_populate_realm(kvm, &args);
> + break;
> + }
> default:
> r = -EINVAL;
> break;


2023-03-06 18:21:32

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 17/28] arm64: RME: Runtime faulting of memory

On Fri, 27 Jan 2023 11:29:21 +0000
Steven Price <[email protected]> wrote:

> At runtime if the realm guest accesses memory which hasn't yet been
> mapped then KVM needs to either populate the region or fault the guest.
>
> For memory in the lower (protected) region of IPA a fresh page is
> provided to the RMM which will zero the contents. For memory in the
> upper (shared) region of IPA, the memory from the memslot is mapped
> into the realm VM non secure.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_emulate.h | 10 +++++
> arch/arm64/include/asm/kvm_rme.h | 12 ++++++
> arch/arm64/kvm/mmu.c | 64 +++++++++++++++++++++++++---
> arch/arm64/kvm/rme.c | 48 +++++++++++++++++++++
> 4 files changed, 128 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 285e62914ca4..3a71b3d2e10a 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -502,6 +502,16 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> return READ_ONCE(kvm->arch.realm.state);
> }
>
> +static inline gpa_t kvm_gpa_stolen_bits(struct kvm *kvm)
> +{
> + if (kvm_is_realm(kvm)) {
> + struct realm *realm = &kvm->arch.realm;
> +
> + return BIT(realm->ia_bits - 1);
> + }
> + return 0;
> +}
> +

"stolen" seems a little bit vague. Maybe "shared" bit would be better as
SEV-SNP has C bit and TDX has shared bit. It would be nice to align with
the common knowledge.

Also, it would be nice to change the name of gpa_stolen_mask accordingly.

> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> {
> if (static_branch_unlikely(&kvm_rme_is_available))
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index 9d1583c44a99..303e4a5e5704 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -50,6 +50,18 @@ void kvm_destroy_rec(struct kvm_vcpu *vcpu);
> int kvm_rec_enter(struct kvm_vcpu *vcpu);
> int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>
> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size);
> +int realm_map_protected(struct realm *realm,
> + unsigned long hva,
> + unsigned long base_ipa,
> + struct page *dst_page,
> + unsigned long map_size,
> + struct kvm_mmu_memory_cache *memcache);
> +int realm_map_non_secure(struct realm *realm,
> + unsigned long ipa,
> + struct page *page,
> + unsigned long map_size,
> + struct kvm_mmu_memory_cache *memcache);
> int realm_set_ipa_state(struct kvm_vcpu *vcpu,
> unsigned long addr, unsigned long end,
> unsigned long ripas);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index f29558c5dcbc..5417c273861b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -235,8 +235,13 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> WARN_ON(size & ~PAGE_MASK);
> - WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
> - may_block));
> +
> + if (kvm_is_realm(kvm))
> + kvm_realm_unmap_range(kvm, start, size);
> + else
> + WARN_ON(stage2_apply_range(kvm, start, end,
> + kvm_pgtable_stage2_unmap,
> + may_block));
> }
>
> static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
> @@ -250,7 +255,11 @@ static void stage2_flush_memslot(struct kvm *kvm,
> phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
> phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>
> - stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_flush);
> + if (kvm_is_realm(kvm))
> + kvm_realm_unmap_range(kvm, addr, end - addr);
> + else
> + stage2_apply_range_resched(kvm, addr, end,
> + kvm_pgtable_stage2_flush);
> }
>
> /**
> @@ -818,6 +827,10 @@ void stage2_unmap_vm(struct kvm *kvm)
> struct kvm_memory_slot *memslot;
> int idx, bkt;
>
> + /* For realms this is handled by the RMM so nothing to do here */
> + if (kvm_is_realm(kvm))
> + return;
> +
> idx = srcu_read_lock(&kvm->srcu);
> mmap_read_lock(current->mm);
> write_lock(&kvm->mmu_lock);
> @@ -840,6 +853,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> pgt = mmu->pgt;
> if (kvm_is_realm(kvm) &&
> kvm_realm_state(kvm) != REALM_STATE_DYING) {
> + unmap_stage2_range(mmu, 0, (~0ULL) & PAGE_MASK);
> write_unlock(&kvm->mmu_lock);
> kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
> pgt->start_level);
> @@ -1190,6 +1204,24 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> return vma->vm_flags & VM_MTE_ALLOWED;
> }
>
> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa, unsigned long hva,
> + kvm_pfn_t pfn, unsigned long map_size,
> + enum kvm_pgtable_prot prot,
> + struct kvm_mmu_memory_cache *memcache)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct page *page = pfn_to_page(pfn);
> +
> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> + return -EFAULT;
> +
> + if (!realm_is_addr_protected(realm, ipa))
> + return realm_map_non_secure(realm, ipa, page, map_size,
> + memcache);
> +
> + return realm_map_protected(realm, hva, ipa, page, map_size, memcache);
> +}
> +
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -1210,9 +1242,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> unsigned long vma_pagesize, fault_granule;
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> struct kvm_pgtable *pgt;
> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
>
> fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> write_fault = kvm_is_write_fault(vcpu);
> +
> + /* Realms cannot map read-only */

Out of curiosity, why? It would be nice to have more explanation in the
comment.

> + if (vcpu_is_rec(vcpu))
> + write_fault = true;
> +
> exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> VM_BUG_ON(write_fault && exec_fault);
>
> @@ -1272,7 +1310,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
> fault_ipa &= ~(vma_pagesize - 1);
>
> - gfn = fault_ipa >> PAGE_SHIFT;
> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> mmap_read_unlock(current->mm);
>
> /*
> @@ -1345,7 +1383,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> * If we are not forced to use page mapping, check if we are
> * backed by a THP and thus use block mapping if possible.
> */
> - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
> + /* FIXME: We shouldn't need to disable this for realms */
> + if (vma_pagesize == PAGE_SIZE && !(force_pte || device || kvm_is_realm(kvm))) {

Why do we have to disable this temporarily?

> if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> vma_pagesize = fault_granule;
> else
> @@ -1382,6 +1421,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> */
> if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
> ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> + else if (kvm_is_realm(kvm))
> + ret = realm_map_ipa(kvm, fault_ipa, hva, pfn, vma_pagesize,
> + prot, memcache);
> else
> ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> __pfn_to_phys(pfn), prot,
> @@ -1437,6 +1479,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> struct kvm_memory_slot *memslot;
> unsigned long hva;
> bool is_iabt, write_fault, writable;
> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
> gfn_t gfn;
> int ret, idx;
>
> @@ -1491,7 +1534,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>
> idx = srcu_read_lock(&vcpu->kvm->srcu);
>
> - gfn = fault_ipa >> PAGE_SHIFT;
> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> memslot = gfn_to_memslot(vcpu->kvm, gfn);
> hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> write_fault = kvm_is_write_fault(vcpu);
> @@ -1536,6 +1579,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> * of the page size.
> */
> fault_ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1);
> + fault_ipa &= ~gpa_stolen_mask;
> ret = io_mem_abort(vcpu, fault_ipa);
> goto out_unlock;
> }
> @@ -1617,6 +1661,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> if (!kvm->arch.mmu.pgt)
> return false;
>

Does the unprotected (shared) region of a realm support aging?

> + /* We don't support aging for Realms */
> + if (kvm_is_realm(kvm))
> + return true;
> +
> WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
>
> kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
> @@ -1630,6 +1678,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> if (!kvm->arch.mmu.pgt)
> return false;
>
> + /* We don't support aging for Realms */
> + if (kvm_is_realm(kvm))
> + return true;
> +
> return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
> range->start << PAGE_SHIFT);
> }
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index 3405b43e1421..3d46191798e5 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -608,6 +608,54 @@ int realm_map_protected(struct realm *realm,
> return -ENXIO;
> }
>
> +int realm_map_non_secure(struct realm *realm,
> + unsigned long ipa,
> + struct page *page,
> + unsigned long map_size,
> + struct kvm_mmu_memory_cache *memcache)
> +{
> + phys_addr_t rd = virt_to_phys(realm->rd);
> + int map_level;
> + int ret = 0;
> + unsigned long desc = page_to_phys(page) |
> + PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) |
> + /* FIXME: Read+Write permissions for now */
Why can't we handle the prot from the realm_map_ipa()? Working in progress? :)
> + (3 << 6) |
> + PTE_SHARED;
> +
> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
> + return -EINVAL;
> +
> + switch (map_size) {
> + case PAGE_SIZE:
> + map_level = 3;
> + break;
> + case RME_L2_BLOCK_SIZE:
> + map_level = 2;
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
> +
> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> + /* Create missing RTTs and retry */
> + int level = RMI_RETURN_INDEX(ret);
> +
> + ret = realm_create_rtt_levels(realm, ipa, level, map_level,
> + memcache);
> + if (WARN_ON(ret))
> + return -ENXIO;
> +
> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
> + }
> + if (WARN_ON(ret))
> + return -ENXIO;
> +
> + return 0;
> +}
> +
> static int populate_par_region(struct kvm *kvm,
> phys_addr_t ipa_base,
> phys_addr_t ipa_end)


2023-03-06 19:10:24

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On Fri, 27 Jan 2023 11:29:10 +0000
Steven Price <[email protected]> wrote:

> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
> the base level of the Realm Translation Tables (RTT). A VMID also need
> to be picked, since the RMM has a separate VMID address space a
> dedicated allocator is added for this purpose.
>
> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
> before it is created.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 14 ++
> arch/arm64/kvm/arm.c | 19 ++
> arch/arm64/kvm/mmu.c | 6 +
> arch/arm64/kvm/reset.c | 33 +++
> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
> 5 files changed, 429 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index c26bc2c6770d..055a22accc08 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -6,6 +6,8 @@
> #ifndef __ASM_KVM_RME_H
> #define __ASM_KVM_RME_H
>
> +#include <uapi/linux/kvm.h>
> +
> enum realm_state {
> REALM_STATE_NONE,
> REALM_STATE_NEW,
> @@ -15,8 +17,20 @@ enum realm_state {
>
> struct realm {
> enum realm_state state;
> +
> + void *rd;
> + struct realm_params *params;
> +
> + unsigned long num_aux;
> + unsigned int vmid;
> + unsigned int ia_bits;
> };
>
> int kvm_init_rme(void);
> +u32 kvm_realm_ipa_limit(void);
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> +int kvm_init_realm_vm(struct kvm *kvm);
> +void kvm_destroy_realm(struct kvm *kvm);
>
> #endif
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index d97b39d042ab..50f54a63732a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -103,6 +103,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> r = 0;
> set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
> break;
> + case KVM_CAP_ARM_RME:
> + if (!static_branch_unlikely(&kvm_rme_is_available))
> + return -EINVAL;
> + mutex_lock(&kvm->lock);
> + r = kvm_realm_enable_cap(kvm, cap);
> + mutex_unlock(&kvm->lock);
> + break;
> default:
> r = -EINVAL;
> break;
> @@ -172,6 +179,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> */
> kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit();
>
> + /* Initialise the realm bits after the generic bits are enabled */
> + if (kvm_is_realm(kvm)) {
> + ret = kvm_init_realm_vm(kvm);
> + if (ret)
> + goto err_free_cpumask;
> + }
> +
> return 0;
>
> err_free_cpumask:
> @@ -204,6 +218,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> kvm_destroy_vcpus(kvm);
>
> kvm_unshare_hyp(kvm, kvm + 1);
> +
> + kvm_destroy_realm(kvm);
> }
>
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> @@ -300,6 +316,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_PTRAUTH_GENERIC:
> r = system_has_full_ptr_auth();
> break;
> + case KVM_CAP_ARM_RME:
> + r = static_key_enabled(&kvm_rme_is_available);
> + break;
> default:
> r = 0;
> }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..d0f707767d05 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -840,6 +840,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> struct kvm_pgtable *pgt = NULL;
>
> write_lock(&kvm->mmu_lock);
> + if (kvm_is_realm(kvm) &&
> + kvm_realm_state(kvm) != REALM_STATE_DYING) {
> + /* TODO: teardown rtts */
> + write_unlock(&kvm->mmu_lock);
> + return;
> + }
> pgt = mmu->pgt;
> if (pgt) {
> mmu->pgd_phys = 0;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index e0267f672b8a..c165df174737 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -395,3 +395,36 @@ int kvm_set_ipa_limit(void)
>
> return 0;
> }
> +

The below function doesn't have an user in this patch. Also,
it looks like a part of copy from kvm_init_stage2_mmu()
in arch/arm64/kvm/mmu.c.

> +int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
> +{
> + u64 mmfr0, mmfr1;
> + u32 phys_shift;
> + u32 ipa_limit = kvm_ipa_limit;
> +
> + if (kvm_is_realm(kvm))
> + ipa_limit = kvm_realm_ipa_limit();
> +
> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> + return -EINVAL;
> +
> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> + if (phys_shift) {
> + if (phys_shift > ipa_limit ||
> + phys_shift < ARM64_MIN_PARANGE_BITS)
> + return -EINVAL;
> + } else {
> + phys_shift = KVM_PHYS_SHIFT;
> + if (phys_shift > ipa_limit) {
> + pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n",
> + current->comm);
> + return -EINVAL;
> + }
> + }
> +
> + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> + kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
> +
> + return 0;
> +}
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index f6b587bc116e..9f8c5a91b8fc 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -5,9 +5,49 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
> #include <asm/rmi_cmds.h>
> #include <asm/virt.h>
>
> +/************ FIXME: Copied from kvm/hyp/pgtable.c **********/
> +#include <asm/kvm_pgtable.h>
> +
> +struct kvm_pgtable_walk_data {
> + struct kvm_pgtable *pgt;
> + struct kvm_pgtable_walker *walker;
> +
> + u64 addr;
> + u64 end;
> +};
> +
> +static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
> +{
> + u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
> + u64 mask = BIT(pgt->ia_bits) - 1;
> +
> + return (addr & mask) >> shift;
> +}
> +
> +static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
> +{
> + struct kvm_pgtable pgt = {
> + .ia_bits = ia_bits,
> + .start_level = start_level,
> + };
> +
> + return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
> +}
> +
> +/******************/
> +
> +static unsigned long rmm_feat_reg0;
> +
> +static bool rme_supports(unsigned long feature)
> +{
> + return !!u64_get_bits(rmm_feat_reg0, feature);
> +}
> +
> static int rmi_check_version(void)
> {
> struct arm_smccc_res res;
> @@ -33,8 +73,319 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> +{
> + unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> + u64 feat_reg0 = 0;
> +
> + int num_bps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_BPS);
> + int num_wps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + feat_reg0 |= u64_encode_bits(ia_bits, RMI_FEATURE_REGISTER_0_S2SZ);
> + feat_reg0 |= u64_encode_bits(num_bps, RMI_FEATURE_REGISTER_0_NUM_BPS);
> + feat_reg0 |= u64_encode_bits(num_wps, RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + return feat_reg0;
> +}
> +
> +u32 kvm_realm_ipa_limit(void)
> +{
> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
> +}
> +
> +static u32 get_start_level(struct kvm *kvm)
> +{
> + long sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, kvm->arch.vtcr);
> +
> + return VTCR_EL2_TGRAN_SL0_BASE - sl0;
> +}
> +
> +static int realm_create_rd(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct realm_params *params = realm->params;
> + void *rd = NULL;
> + phys_addr_t rd_phys, params_phys;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i, r;
> +
> + if (WARN_ON(realm->rd) || WARN_ON(!realm->params))
> + return -EEXIST;
> +
> + rd = (void *)__get_free_page(GFP_KERNEL);
> + if (!rd)
> + return -ENOMEM;
> +
> + rd_phys = virt_to_phys(rd);
> + if (rmi_granule_delegate(rd_phys)) {
> + r = -ENXIO;
> + goto out;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (rmi_granule_delegate(pgd_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> + }
> +
> + params->rtt_level_start = get_start_level(kvm);
> + params->rtt_num_start = pgd_sz;
> + params->rtt_base = kvm->arch.mmu.pgd_phys;
> + params->vmid = realm->vmid;
> +
> + params_phys = virt_to_phys(params);
> +
> + if (rmi_realm_create(rd_phys, params_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> +
> + realm->rd = rd;
> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> +
> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
> + WARN_ON(rmi_realm_destroy(rd_phys));
> + goto out_undelegate_tables;
> + }
> +
> + return 0;
> +
> +out_undelegate_tables:
> + while (--i >= 0) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + WARN_ON(rmi_granule_undelegate(pgd_phys));
> + }
> + WARN_ON(rmi_granule_undelegate(rd_phys));
> +out:
> + free_page((unsigned long)rd);
> + return r;
> +}
> +
> +/* Protects access to rme_vmid_bitmap */
> +static DEFINE_SPINLOCK(rme_vmid_lock);
> +static unsigned long *rme_vmid_bitmap;
> +
> +static int rme_vmid_init(void)
> +{
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
> + if (!rme_vmid_bitmap) {
> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static int rme_vmid_reserve(void)
> +{
> + int ret;
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + spin_lock(&rme_vmid_lock);
> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
> + spin_unlock(&rme_vmid_lock);
> +
> + return ret;
> +}
> +
> +static void rme_vmid_release(unsigned int vmid)
> +{
> + spin_lock(&rme_vmid_lock);
> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
> + spin_unlock(&rme_vmid_lock);
> +}
> +
> +static int kvm_create_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + int ret;
> +
> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EEXIST;
> +
> + ret = rme_vmid_reserve();
> + if (ret < 0)
> + return ret;
> + realm->vmid = ret;
> +
> + ret = realm_create_rd(kvm);
> + if (ret) {
> + rme_vmid_release(realm->vmid);
> + return ret;
> + }
> +
> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
> +
> + /* The realm is up, free the parameters. */
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> +
> + return 0;
> +}
> +
> +static int config_realm_hash_algo(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + switch (cfg->hash_algo) {
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
> + return -EINVAL;
> + break;
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + realm->params->measurement_algo = cfg->hash_algo;
> + return 0;
> +}
> +
> +static int config_realm_sve(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + u64 features_0 = realm->params->features_0;
> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
> + return -EINVAL;
> +
> + if (cfg->sve_vq > max_sve_vq)
> + return -EINVAL;
> +
> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
> + features_0 |= u64_encode_bits(cfg->sve_vq,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + realm->params->features_0 = features_0;
> + return 0;
> +}
> +
> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + struct kvm_cap_arm_rme_config_item cfg;
> + struct realm *realm = &kvm->arch.realm;
> + int r = 0;
> +
> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EBUSY;
> +
> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
> + return -EFAULT;
> +
> + switch (cfg.cfg) {
> + case KVM_CAP_ARM_RME_CFG_RPV:
> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
> + break;
> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
> + r = config_realm_hash_algo(realm, &cfg);
> + break;
> + case KVM_CAP_ARM_RME_CFG_SVE:
> + r = config_realm_sve(realm, &cfg);
> + break;
> + default:
> + r = -EINVAL;
> + }
> +
> + return r;
> +}
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + int r = 0;
> +
> + switch (cap->args[0]) {
> + case KVM_CAP_ARM_RME_CONFIG_REALM:
> + r = kvm_rme_config_realm(kvm, cap);
> + break;
> + case KVM_CAP_ARM_RME_CREATE_RD:
> + if (kvm->created_vcpus) {
> + r = -EBUSY;
> + break;
> + }
> +
> + r = kvm_create_realm(kvm);
> + break;
> + default:
> + r = -EINVAL;
> + break;
> + }
> +
> + return r;
> +}
> +
> +void kvm_destroy_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i;
> +
> + if (realm->params) {
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> + }
> +
> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
> + return;
> +
> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
> +
> + rme_vmid_release(realm->vmid);
> +
> + if (realm->rd) {
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> +
> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
> + return;
> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
> + return;
> + free_page((unsigned long)realm->rd);
> + realm->rd = NULL;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
> + return;
> + }
> +
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}
> +
> +int kvm_init_realm_vm(struct kvm *kvm)
> +{
> + struct realm_params *params;
> +
> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
> + if (!params)
> + return -ENOMEM;
> +
> + params->features_0 = create_realm_feat_reg0(kvm);
> + kvm->arch.realm.params = params;
> + return 0;
> +}
> +
> int kvm_init_rme(void)
> {
> + int ret;
> +
> if (PAGE_SIZE != SZ_4K)
> /* Only 4k page size on the host is supported */
> return 0;
> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
> /* Continue without realm support */
> return 0;
>
> + ret = rme_vmid_init();
> + if (ret)
> + return ret;
> +
> + WARN_ON(rmi_features(0, &rmm_feat_reg0));
> +
> /* Future patch will enable static branch kvm_rme_is_available */
>
> return 0;


2023-03-10 15:54:37

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 15/28] KVM: arm64: Handle realm MMIO emulation

On 06/03/2023 15:37, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:19 +0000
> Steven Price <[email protected]> wrote:
>
>> MMIO emulation for a realm cannot be done directly with the VM's
>> registers as they are protected from the host. However the RMM interface
>> provides a structure member for providing the read/written value and
>
> More details would be better for helping the review. I can only see the
> emulated mmio value from the device model (kvmtool or kvm_io_bus) is put into
> the GPRS[0] of the RecEntry object. But the rest of the flow is missing.

The commit message is out of date (sorry about that). A previous version
of the spec had a dedicated member for the read/write value, but this
was changed to just use GPRS[0] as you've spotted. I'll update the text.

> I guess RMM copies the value in the RecEntry.GPRS[0] to the target GPR in the
> guest context in RMI_REC_ENTER when seeing RMI_EMULATED_MMIO. This is for
> the guest MMIO read path.

Yes, when entering the guest after an (emulatable) read data abort the
value in GPRS[0] is loaded from the RecEntry structure into the
appropriate register for the guest.

> How about the MMIO write path? I don't see where the RecExit.GPRS[0] is loaded
> to a varible and returned to the userspace.

The RMM will populate GPRS[0] with the written value in this case (even
if another register was actually used in the instruction). We then
transfer that to the usual VCPU structure so that the normal fault
handling logic works.

>> we can transfer this to the appropriate VCPU's register entry and then
>> depend on the generic MMIO handling code in KVM.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/kvm/mmio.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
>> index 3dd38a151d2a..c4879fa3a8d3 100644
>> --- a/arch/arm64/kvm/mmio.c
>> +++ b/arch/arm64/kvm/mmio.c
>> @@ -6,6 +6,7 @@
>>
>> #include <linux/kvm_host.h>
>> #include <asm/kvm_emulate.h>
>> +#include <asm/rmi_smc.h>
>> #include <trace/events/kvm.h>
>>
>> #include "trace.h"
>> @@ -109,6 +110,9 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
>> &data);
>> data = vcpu_data_host_to_guest(vcpu, data, len);
>> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>> +
>> + if (vcpu_is_rec(vcpu))
>> + vcpu->arch.rec.run->entry.gprs[0] = data;
>
> I think the guest context is maintained by RMM (while KVM can only touch
> Rec{Entry, Exit} object) so that guest context in the legacy VHE mode is
> unused.
>
> If yes, I guess here is should be:
>
> if (unlikely(vcpu_is_rec(vcpu)))
> vcpu->arch.rec.run->entry.gprs[0] = data;
> else
> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);

Correct. Although there's no harm in updating with vcpu_set_reg(). But
I'll make the change because it's clearer.

>> }
>>
>> /*
>> @@ -179,6 +183,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>> run->mmio.len = len;
>> vcpu->mmio_needed = 1;
>>
>> + if (vcpu_is_rec(vcpu))
>> + vcpu->arch.rec.run->entry.flags |= RMI_EMULATED_MMIO;
>> +
>
> Wouldn't it be better to set this in the kvm_handle_mmio_return where the MMIO
> read emulation has been surely successful?

Yes, that makes sense - I'll move this.

Thanks,

Steve

>> if (!ret) {
>> /* We handled the access successfully in the kernel. */
>> if (!is_write)
>


2023-03-10 15:55:21

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 16/28] arm64: RME: Allow populating initial contents

On 06/03/2023 17:34, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:20 +0000
> Steven Price <[email protected]> wrote:
>
>> The VMM needs to populate the realm with some data before starting (e.g.
>> a kernel and initrd). This is measured by the RMM and used as part of
>> the attestation later on.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/kvm/rme.c | 366 +++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 366 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 16e0bfea98b1..3405b43e1421 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -4,6 +4,7 @@
>> */
>>
>> #include <linux/kvm_host.h>
>> +#include <linux/hugetlb.h>
>>
>> #include <asm/kvm_emulate.h>
>> #include <asm/kvm_mmu.h>
>> @@ -426,6 +427,359 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
>> }
>> }
>>
>> +static int realm_create_protected_data_page(struct realm *realm,
>> + unsigned long ipa,
>> + struct page *dst_page,
>> + struct page *tmp_page)
>> +{
>> + phys_addr_t dst_phys, tmp_phys;
>> + int ret;
>> +
>> + copy_page(page_address(tmp_page), page_address(dst_page));
>> +
>> + dst_phys = page_to_phys(dst_page);
>> + tmp_phys = page_to_phys(tmp_page);
>> +
>> + if (rmi_granule_delegate(dst_phys))
>> + return -ENXIO;
>> +
>> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa, tmp_phys,
>> + RMI_MEASURE_CONTENT);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + RME_RTT_MAX_LEVEL, NULL);
>> + if (ret)
>> + goto err;
>> +
>> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa,
>> + tmp_phys, RMI_MEASURE_CONTENT);
>> + }
>> +
>> + if (ret)
>> + goto err;
>> +
>> + return 0;
>> +
>> +err:
>> + if (WARN_ON(rmi_granule_undelegate(dst_phys))) {
>> + /* Page can't be returned to NS world so is lost */
>> + get_page(dst_page);
>> + }
>> + return -ENXIO;
>> +}
>> +
>> +static int fold_rtt(phys_addr_t rd, unsigned long addr, int level,
>> + struct realm *realm)
>> +{
>> + struct rtt_entry rtt;
>> + phys_addr_t rtt_addr;
>> +
>> + if (rmi_rtt_read_entry(rd, addr, level, &rtt))
>> + return -ENXIO;
>> +
>> + if (rtt.state != RMI_TABLE)
>> + return -EINVAL;
>> +
>> + rtt_addr = rmi_rtt_get_phys(&rtt);
>> + if (rmi_rtt_fold(rtt_addr, rd, addr, level + 1))
>> + return -ENXIO;
>> +
>> + free_delegated_page(realm, rtt_addr);
>> +
>> + return 0;
>> +}
>> +
>> +int realm_map_protected(struct realm *realm,
>> + unsigned long hva,
>> + unsigned long base_ipa,
>> + struct page *dst_page,
>> + unsigned long map_size,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + phys_addr_t dst_phys = page_to_phys(dst_page);
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + unsigned long phys = dst_phys;
>> + unsigned long ipa = base_ipa;
>> + unsigned long size;
>> + int map_level;
>> + int ret = 0;
>> +
>> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
>> + return -EINVAL;
>> +
>> + switch (map_size) {
>> + case PAGE_SIZE:
>> + map_level = 3;
>> + break;
>> + case RME_L2_BLOCK_SIZE:
>> + map_level = 2;
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> +
>> + if (map_level < RME_RTT_MAX_LEVEL) {
>> + /*
>> + * A temporary RTT is needed during the map, precreate it,
>> + * however if there is an error (e.g. missing parent tables)
>> + * this will be handled below.
>> + */
>> + realm_create_rtt_levels(realm, ipa, map_level,
>> + RME_RTT_MAX_LEVEL, memcache);
>> + }
>> +
>> + for (size = 0; size < map_size; size += PAGE_SIZE) {
>> + if (rmi_granule_delegate(phys)) {
>> + struct rtt_entry rtt;
>> +
>> + /*
>> + * It's possible we raced with another VCPU on the same
>> + * fault. If the entry exists and matches then exit
>> + * early and assume the other VCPU will handle the
>> + * mapping.
>> + */
>> + if (rmi_rtt_read_entry(rd, ipa, RME_RTT_MAX_LEVEL, &rtt))
>> + goto err;
>> +
>> + // FIXME: For a block mapping this could race at level
>> + // 2 or 3...
>> + if (WARN_ON((rtt.walk_level != RME_RTT_MAX_LEVEL ||
>> + rtt.state != RMI_ASSIGNED ||
>> + rtt.desc != phys))) {
>> + goto err;
>> + }
>> +
>> + return 0;
>> + }
>> +
>> + ret = rmi_data_create_unknown(phys, rd, ipa);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + RME_RTT_MAX_LEVEL,
>> + memcache);
>> + WARN_ON(ret);
>> + if (ret)
>> + goto err_undelegate;
>> +
>> + ret = rmi_data_create_unknown(phys, rd, ipa);
>> + }
>> + WARN_ON(ret);
>> +
>> + if (ret)
>> + goto err_undelegate;
>> +
>> + phys += PAGE_SIZE;
>> + ipa += PAGE_SIZE;
>> + }
>> +
>> + if (map_size == RME_L2_BLOCK_SIZE)
>> + ret = fold_rtt(rd, base_ipa, map_level, realm);
>> + if (WARN_ON(ret))
>> + goto err;
>> +
>> + return 0;
>> +
>> +err_undelegate:
>> + if (WARN_ON(rmi_granule_undelegate(phys))) {
>> + /* Page can't be returned to NS world so is lost */
>> + get_page(phys_to_page(phys));
>> + }
>> +err:
>> + while (size > 0) {
>> + phys -= PAGE_SIZE;
>> + size -= PAGE_SIZE;
>> + ipa -= PAGE_SIZE;
>> +
>> + rmi_data_destroy(rd, ipa);
>> +
>> + if (WARN_ON(rmi_granule_undelegate(phys))) {
>> + /* Page can't be returned to NS world so is lost */
>> + get_page(phys_to_page(phys));
>> + }
>> + }
>> + return -ENXIO;
>> +}
>> +
>
> There seems no caller to the function above. Better move it to the related
> patch.

Indeed this should really be in the next patch - will move as it's very
confusing having it in this patch (sorry about that).

>> +static int populate_par_region(struct kvm *kvm,
>> + phys_addr_t ipa_base,
>> + phys_addr_t ipa_end)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct kvm_memory_slot *memslot;
>> + gfn_t base_gfn, end_gfn;
>> + int idx;
>> + phys_addr_t ipa;
>> + int ret = 0;
>> + struct page *tmp_page;
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> +
>> + base_gfn = gpa_to_gfn(ipa_base);
>> + end_gfn = gpa_to_gfn(ipa_end);
>> +
>> + idx = srcu_read_lock(&kvm->srcu);
>> + memslot = gfn_to_memslot(kvm, base_gfn);
>> + if (!memslot) {
>> + ret = -EFAULT;
>> + goto out;
>> + }
>> +
>> + /* We require the region to be contained within a single memslot */
>> + if (memslot->base_gfn + memslot->npages < end_gfn) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> +
>> + tmp_page = alloc_page(GFP_KERNEL);
>> + if (!tmp_page) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + mmap_read_lock(current->mm);
>> +
>> + ipa = ipa_base;
>> +
>> + while (ipa < ipa_end) {
>> + struct vm_area_struct *vma;
>> + unsigned long map_size;
>> + unsigned int vma_shift;
>> + unsigned long offset;
>> + unsigned long hva;
>> + struct page *page;
>> + kvm_pfn_t pfn;
>> + int level;
>> +
>> + hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
>> + vma = vma_lookup(current->mm, hva);
>> + if (!vma) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + if (is_vm_hugetlb_page(vma))
>> + vma_shift = huge_page_shift(hstate_vma(vma));
>> + else
>> + vma_shift = PAGE_SHIFT;
>> +
>> + map_size = 1 << vma_shift;
>> +
>> + /*
>> + * FIXME: This causes over mapping, but there's no good
>> + * solution here with the ABI as it stands
>> + */
>> + ipa = ALIGN_DOWN(ipa, map_size);
>> +
>> + switch (map_size) {
>> + case RME_L2_BLOCK_SIZE:
>> + level = 2;
>> + break;
>> + case PAGE_SIZE:
>> + level = 3;
>> + break;
>> + default:
>> + WARN_ONCE(1, "Unsupport vma_shift %d", vma_shift);
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + pfn = gfn_to_pfn_memslot(memslot, gpa_to_gfn(ipa));
>> +
>> + if (is_error_pfn(pfn)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + ret = rmi_rtt_init_ripas(rd, ipa, level);
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + ret = realm_create_rtt_levels(realm, ipa,
>> + RMI_RETURN_INDEX(ret),
>> + level, NULL);
>> + if (ret)
>> + break;
>> + ret = rmi_rtt_init_ripas(rd, ipa, level);
>> + if (ret) {
>> + ret = -ENXIO;
>> + break;
>> + }
>> + }
>> +
>> + if (level < RME_RTT_MAX_LEVEL) {
>> + /*
>> + * A temporary RTT is needed during the map, precreate
>> + * it, however if there is an error (e.g. missing
>> + * parent tables) this will be handled in the
>> + * realm_create_protected_data_page() call.
>> + */
>> + realm_create_rtt_levels(realm, ipa, level,
>> + RME_RTT_MAX_LEVEL, NULL);
>> + }
>> +
>> + page = pfn_to_page(pfn);
>> +
>> + for (offset = 0; offset < map_size && !ret;
>> + offset += PAGE_SIZE, page++) {
>> + phys_addr_t page_ipa = ipa + offset;
>> +
>> + ret = realm_create_protected_data_page(realm, page_ipa,
>> + page, tmp_page);
>> + }
>> + if (ret)
>> + goto err_release_pfn;
>> +
>> + if (level == 2) {
>> + ret = fold_rtt(rd, ipa, level, realm);
>> + if (ret)
>> + goto err_release_pfn;
>> + }
>> +
>> + ipa += map_size;
>
>> + kvm_set_pfn_accessed(pfn);
>> + kvm_set_pfn_dirty(pfn);
>
> kvm_release_pfn_dirty() has already called kvm_set_pfn_{accessed, dirty}().

Will remove those calls.

>> + kvm_release_pfn_dirty(pfn);
>> +err_release_pfn:
>> + if (ret) {
>> + kvm_release_pfn_clean(pfn);
>> + break;
>> + }
>> + }
>> +
>> + mmap_read_unlock(current->mm);
>> + __free_page(tmp_page);
>> +
>> +out:
>> + srcu_read_unlock(&kvm->srcu, idx);
>> + return ret;
>> +}
>> +
>> +static int kvm_populate_realm(struct kvm *kvm,
>> + struct kvm_cap_arm_rme_populate_realm_args *args)
>> +{
>> + phys_addr_t ipa_base, ipa_end;
>> +
>
> Check kvm_is_realm(kvm) here or in the kvm_realm_enable_cap().

I'm going to update kvm_vm_ioctl_enable_cap() to check kvm_is_realm() so
we won't get here.

>> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
>> + return -EBUSY;
>
> Maybe -EINVAL? The realm hasn't been created (RMI_REALM_CREATE is not called
> yet). The userspace shouldn't reach this path.

Well user space can attempt to populate in the ACTIVE state - which is
where the idea of 'busy' comes from. Admittedly it's a little confusing
when RMI_REALM_CREATE hasn't been called.

I'm not particularly bothered about the return code, but it's useful to
have a different code to -EINVAL as it's not an invalid argument, but
calling at the wrong time. I can't immediately see a better error code
though.

Steve

>> +
>> + if (!IS_ALIGNED(args->populate_ipa_base, PAGE_SIZE) ||
>> + !IS_ALIGNED(args->populate_ipa_size, PAGE_SIZE))
>> + return -EINVAL;
>> +
>> + ipa_base = args->populate_ipa_base;
>> + ipa_end = ipa_base + args->populate_ipa_size;
>> +
>> + if (ipa_end < ipa_base)
>> + return -EINVAL;
>> +
>> + return populate_par_region(kvm, ipa_base, ipa_end);
>> +}
>> +
>> static int set_ipa_state(struct kvm_vcpu *vcpu,
>> unsigned long ipa,
>> unsigned long end,
>> @@ -748,6 +1102,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>> r = kvm_init_ipa_range_realm(kvm, &args);
>> break;
>> }
>> + case KVM_CAP_ARM_RME_POPULATE_REALM: {
>> + struct kvm_cap_arm_rme_populate_realm_args args;
>> + void __user *argp = u64_to_user_ptr(cap->args[1]);
>> +
>> + if (copy_from_user(&args, argp, sizeof(args))) {
>> + r = -EFAULT;
>> + break;
>> + }
>> +
>> + r = kvm_populate_realm(kvm, &args);
>> + break;
>> + }
>> default:
>> r = -EINVAL;
>> break;
>


2023-03-10 15:55:26

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

On 06/03/2023 19:10, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:10 +0000
> Steven Price <[email protected]> wrote:
>
>> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
>> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
>> the base level of the Realm Translation Tables (RTT). A VMID also need
>> to be picked, since the RMM has a separate VMID address space a
>> dedicated allocator is added for this purpose.
>>
>> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
>> before it is created.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_rme.h | 14 ++
>> arch/arm64/kvm/arm.c | 19 ++
>> arch/arm64/kvm/mmu.c | 6 +
>> arch/arm64/kvm/reset.c | 33 +++
>> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
>> 5 files changed, 429 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index c26bc2c6770d..055a22accc08 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -6,6 +6,8 @@
>> #ifndef __ASM_KVM_RME_H
>> #define __ASM_KVM_RME_H
>>
>> +#include <uapi/linux/kvm.h>
>> +
>> enum realm_state {
>> REALM_STATE_NONE,
>> REALM_STATE_NEW,
>> @@ -15,8 +17,20 @@ enum realm_state {
>>
>> struct realm {
>> enum realm_state state;
>> +
>> + void *rd;
>> + struct realm_params *params;
>> +
>> + unsigned long num_aux;
>> + unsigned int vmid;
>> + unsigned int ia_bits;
>> };
>>
>> int kvm_init_rme(void);
>> +u32 kvm_realm_ipa_limit(void);
>> +
>> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>> +int kvm_init_realm_vm(struct kvm *kvm);
>> +void kvm_destroy_realm(struct kvm *kvm);
>>
>> #endif
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index d97b39d042ab..50f54a63732a 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -103,6 +103,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>> r = 0;
>> set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
>> break;
>> + case KVM_CAP_ARM_RME:
>> + if (!static_branch_unlikely(&kvm_rme_is_available))
>> + return -EINVAL;
>> + mutex_lock(&kvm->lock);
>> + r = kvm_realm_enable_cap(kvm, cap);
>> + mutex_unlock(&kvm->lock);
>> + break;
>> default:
>> r = -EINVAL;
>> break;
>> @@ -172,6 +179,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>> */
>> kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit();
>>
>> + /* Initialise the realm bits after the generic bits are enabled */
>> + if (kvm_is_realm(kvm)) {
>> + ret = kvm_init_realm_vm(kvm);
>> + if (ret)
>> + goto err_free_cpumask;
>> + }
>> +
>> return 0;
>>
>> err_free_cpumask:
>> @@ -204,6 +218,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>> kvm_destroy_vcpus(kvm);
>>
>> kvm_unshare_hyp(kvm, kvm + 1);
>> +
>> + kvm_destroy_realm(kvm);
>> }
>>
>> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> @@ -300,6 +316,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>> case KVM_CAP_ARM_PTRAUTH_GENERIC:
>> r = system_has_full_ptr_auth();
>> break;
>> + case KVM_CAP_ARM_RME:
>> + r = static_key_enabled(&kvm_rme_is_available);
>> + break;
>> default:
>> r = 0;
>> }
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index 31d7fa4c7c14..d0f707767d05 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -840,6 +840,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> struct kvm_pgtable *pgt = NULL;
>>
>> write_lock(&kvm->mmu_lock);
>> + if (kvm_is_realm(kvm) &&
>> + kvm_realm_state(kvm) != REALM_STATE_DYING) {
>> + /* TODO: teardown rtts */
>> + write_unlock(&kvm->mmu_lock);
>> + return;
>> + }
>> pgt = mmu->pgt;
>> if (pgt) {
>> mmu->pgd_phys = 0;
>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>> index e0267f672b8a..c165df174737 100644
>> --- a/arch/arm64/kvm/reset.c
>> +++ b/arch/arm64/kvm/reset.c
>> @@ -395,3 +395,36 @@ int kvm_set_ipa_limit(void)
>>
>> return 0;
>> }
>> +
>
> The below function doesn't have an user in this patch. Also,
> it looks like a part of copy from kvm_init_stage2_mmu()
> in arch/arm64/kvm/mmu.c.

Good spot ;) Yes I discovered this, it should have been removed - it's
no longer used. I think this was an error when I was rebasing:
kvm_arm_setup-stage2() was removed in 315775ff7c6d ("KVM: arm64:
Consolidate stage-2 initialisation into a single function").

Steve

>> +int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
>> +{
>> + u64 mmfr0, mmfr1;
>> + u32 phys_shift;
>> + u32 ipa_limit = kvm_ipa_limit;
>> +
>> + if (kvm_is_realm(kvm))
>> + ipa_limit = kvm_realm_ipa_limit();
>> +
>> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
>> + return -EINVAL;
>> +
>> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>> + if (phys_shift) {
>> + if (phys_shift > ipa_limit ||
>> + phys_shift < ARM64_MIN_PARANGE_BITS)
>> + return -EINVAL;
>> + } else {
>> + phys_shift = KVM_PHYS_SHIFT;
>> + if (phys_shift > ipa_limit) {
>> + pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n",
>> + current->comm);
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
>> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
>> + kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
>> +
>> + return 0;
>> +}
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index f6b587bc116e..9f8c5a91b8fc 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -5,9 +5,49 @@
>>
>> #include <linux/kvm_host.h>
>>
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_mmu.h>
>> #include <asm/rmi_cmds.h>
>> #include <asm/virt.h>
>>
>> +/************ FIXME: Copied from kvm/hyp/pgtable.c **********/
>> +#include <asm/kvm_pgtable.h>
>> +
>> +struct kvm_pgtable_walk_data {
>> + struct kvm_pgtable *pgt;
>> + struct kvm_pgtable_walker *walker;
>> +
>> + u64 addr;
>> + u64 end;
>> +};
>> +
>> +static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
>> +{
>> + u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
>> + u64 mask = BIT(pgt->ia_bits) - 1;
>> +
>> + return (addr & mask) >> shift;
>> +}
>> +
>> +static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
>> +{
>> + struct kvm_pgtable pgt = {
>> + .ia_bits = ia_bits,
>> + .start_level = start_level,
>> + };
>> +
>> + return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
>> +}
>> +
>> +/******************/
>> +
>> +static unsigned long rmm_feat_reg0;
>> +
>> +static bool rme_supports(unsigned long feature)
>> +{
>> + return !!u64_get_bits(rmm_feat_reg0, feature);
>> +}
>> +
>> static int rmi_check_version(void)
>> {
>> struct arm_smccc_res res;
>> @@ -33,8 +73,319 @@ static int rmi_check_version(void)
>> return 0;
>> }
>>
>> +static unsigned long create_realm_feat_reg0(struct kvm *kvm)
>> +{
>> + unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
>> + u64 feat_reg0 = 0;
>> +
>> + int num_bps = u64_get_bits(rmm_feat_reg0,
>> + RMI_FEATURE_REGISTER_0_NUM_BPS);
>> + int num_wps = u64_get_bits(rmm_feat_reg0,
>> + RMI_FEATURE_REGISTER_0_NUM_WPS);
>> +
>> + feat_reg0 |= u64_encode_bits(ia_bits, RMI_FEATURE_REGISTER_0_S2SZ);
>> + feat_reg0 |= u64_encode_bits(num_bps, RMI_FEATURE_REGISTER_0_NUM_BPS);
>> + feat_reg0 |= u64_encode_bits(num_wps, RMI_FEATURE_REGISTER_0_NUM_WPS);
>> +
>> + return feat_reg0;
>> +}
>> +
>> +u32 kvm_realm_ipa_limit(void)
>> +{
>> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> +}
>> +
>> +static u32 get_start_level(struct kvm *kvm)
>> +{
>> + long sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, kvm->arch.vtcr);
>> +
>> + return VTCR_EL2_TGRAN_SL0_BASE - sl0;
>> +}
>> +
>> +static int realm_create_rd(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct realm_params *params = realm->params;
>> + void *rd = NULL;
>> + phys_addr_t rd_phys, params_phys;
>> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> + unsigned int pgd_sz;
>> + int i, r;
>> +
>> + if (WARN_ON(realm->rd) || WARN_ON(!realm->params))
>> + return -EEXIST;
>> +
>> + rd = (void *)__get_free_page(GFP_KERNEL);
>> + if (!rd)
>> + return -ENOMEM;
>> +
>> + rd_phys = virt_to_phys(rd);
>> + if (rmi_granule_delegate(rd_phys)) {
>> + r = -ENXIO;
>> + goto out;
>> + }
>> +
>> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
>> + for (i = 0; i < pgd_sz; i++) {
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
>> +
>> + if (rmi_granule_delegate(pgd_phys)) {
>> + r = -ENXIO;
>> + goto out_undelegate_tables;
>> + }
>> + }
>> +
>> + params->rtt_level_start = get_start_level(kvm);
>> + params->rtt_num_start = pgd_sz;
>> + params->rtt_base = kvm->arch.mmu.pgd_phys;
>> + params->vmid = realm->vmid;
>> +
>> + params_phys = virt_to_phys(params);
>> +
>> + if (rmi_realm_create(rd_phys, params_phys)) {
>> + r = -ENXIO;
>> + goto out_undelegate_tables;
>> + }
>> +
>> + realm->rd = rd;
>> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
>> +
>> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
>> + WARN_ON(rmi_realm_destroy(rd_phys));
>> + goto out_undelegate_tables;
>> + }
>> +
>> + return 0;
>> +
>> +out_undelegate_tables:
>> + while (--i >= 0) {
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
>> +
>> + WARN_ON(rmi_granule_undelegate(pgd_phys));
>> + }
>> + WARN_ON(rmi_granule_undelegate(rd_phys));
>> +out:
>> + free_page((unsigned long)rd);
>> + return r;
>> +}
>> +
>> +/* Protects access to rme_vmid_bitmap */
>> +static DEFINE_SPINLOCK(rme_vmid_lock);
>> +static unsigned long *rme_vmid_bitmap;
>> +
>> +static int rme_vmid_init(void)
>> +{
>> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
>> +
>> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
>> + if (!rme_vmid_bitmap) {
>> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
>> + return -ENOMEM;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int rme_vmid_reserve(void)
>> +{
>> + int ret;
>> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
>> +
>> + spin_lock(&rme_vmid_lock);
>> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
>> + spin_unlock(&rme_vmid_lock);
>> +
>> + return ret;
>> +}
>> +
>> +static void rme_vmid_release(unsigned int vmid)
>> +{
>> + spin_lock(&rme_vmid_lock);
>> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
>> + spin_unlock(&rme_vmid_lock);
>> +}
>> +
>> +static int kvm_create_realm(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + int ret;
>> +
>> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
>> + return -EEXIST;
>> +
>> + ret = rme_vmid_reserve();
>> + if (ret < 0)
>> + return ret;
>> + realm->vmid = ret;
>> +
>> + ret = realm_create_rd(kvm);
>> + if (ret) {
>> + rme_vmid_release(realm->vmid);
>> + return ret;
>> + }
>> +
>> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
>> +
>> + /* The realm is up, free the parameters. */
>> + free_page((unsigned long)realm->params);
>> + realm->params = NULL;
>> +
>> + return 0;
>> +}
>> +
>> +static int config_realm_hash_algo(struct realm *realm,
>> + struct kvm_cap_arm_rme_config_item *cfg)
>> +{
>> + switch (cfg->hash_algo) {
>> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
>> + return -EINVAL;
>> + break;
>> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
>> + return -EINVAL;
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> + realm->params->measurement_algo = cfg->hash_algo;
>> + return 0;
>> +}
>> +
>> +static int config_realm_sve(struct realm *realm,
>> + struct kvm_cap_arm_rme_config_item *cfg)
>> +{
>> + u64 features_0 = realm->params->features_0;
>> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> +
>> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
>> + return -EINVAL;
>> +
>> + if (cfg->sve_vq > max_sve_vq)
>> + return -EINVAL;
>> +
>> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
>> + features_0 |= u64_encode_bits(cfg->sve_vq,
>> + RMI_FEATURE_REGISTER_0_SVE_VL);
>> +
>> + realm->params->features_0 = features_0;
>> + return 0;
>> +}
>> +
>> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
>> +{
>> + struct kvm_cap_arm_rme_config_item cfg;
>> + struct realm *realm = &kvm->arch.realm;
>> + int r = 0;
>> +
>> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
>> + return -EBUSY;
>> +
>> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
>> + return -EFAULT;
>> +
>> + switch (cfg.cfg) {
>> + case KVM_CAP_ARM_RME_CFG_RPV:
>> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
>> + break;
>> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
>> + r = config_realm_hash_algo(realm, &cfg);
>> + break;
>> + case KVM_CAP_ARM_RME_CFG_SVE:
>> + r = config_realm_sve(realm, &cfg);
>> + break;
>> + default:
>> + r = -EINVAL;
>> + }
>> +
>> + return r;
>> +}
>> +
>> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>> +{
>> + int r = 0;
>> +
>> + switch (cap->args[0]) {
>> + case KVM_CAP_ARM_RME_CONFIG_REALM:
>> + r = kvm_rme_config_realm(kvm, cap);
>> + break;
>> + case KVM_CAP_ARM_RME_CREATE_RD:
>> + if (kvm->created_vcpus) {
>> + r = -EBUSY;
>> + break;
>> + }
>> +
>> + r = kvm_create_realm(kvm);
>> + break;
>> + default:
>> + r = -EINVAL;
>> + break;
>> + }
>> +
>> + return r;
>> +}
>> +
>> +void kvm_destroy_realm(struct kvm *kvm)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
>> + unsigned int pgd_sz;
>> + int i;
>> +
>> + if (realm->params) {
>> + free_page((unsigned long)realm->params);
>> + realm->params = NULL;
>> + }
>> +
>> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
>> + return;
>> +
>> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
>> +
>> + rme_vmid_release(realm->vmid);
>> +
>> + if (realm->rd) {
>> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> +
>> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
>> + return;
>> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
>> + return;
>> + free_page((unsigned long)realm->rd);
>> + realm->rd = NULL;
>> + }
>> +
>> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
>> + for (i = 0; i < pgd_sz; i++) {
>> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
>> +
>> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
>> + return;
>> + }
>> +
>> + kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +}
>> +
>> +int kvm_init_realm_vm(struct kvm *kvm)
>> +{
>> + struct realm_params *params;
>> +
>> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
>> + if (!params)
>> + return -ENOMEM;
>> +
>> + params->features_0 = create_realm_feat_reg0(kvm);
>> + kvm->arch.realm.params = params;
>> + return 0;
>> +}
>> +
>> int kvm_init_rme(void)
>> {
>> + int ret;
>> +
>> if (PAGE_SIZE != SZ_4K)
>> /* Only 4k page size on the host is supported */
>> return 0;
>> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
>> /* Continue without realm support */
>> return 0;
>>
>> + ret = rme_vmid_init();
>> + if (ret)
>> + return ret;
>> +
>> + WARN_ON(rmi_features(0, &rmm_feat_reg0));
>> +
>> /* Future patch will enable static branch kvm_rme_is_available */
>>
>> return 0;
>


2023-03-10 15:56:55

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 17/28] arm64: RME: Runtime faulting of memory

On 06/03/2023 18:20, Zhi Wang wrote:
> On Fri, 27 Jan 2023 11:29:21 +0000
> Steven Price <[email protected]> wrote:
>
>> At runtime if the realm guest accesses memory which hasn't yet been
>> mapped then KVM needs to either populate the region or fault the guest.
>>
>> For memory in the lower (protected) region of IPA a fresh page is
>> provided to the RMM which will zero the contents. For memory in the
>> upper (shared) region of IPA, the memory from the memslot is mapped
>> into the realm VM non secure.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>> arch/arm64/include/asm/kvm_emulate.h | 10 +++++
>> arch/arm64/include/asm/kvm_rme.h | 12 ++++++
>> arch/arm64/kvm/mmu.c | 64 +++++++++++++++++++++++++---
>> arch/arm64/kvm/rme.c | 48 +++++++++++++++++++++
>> 4 files changed, 128 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 285e62914ca4..3a71b3d2e10a 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -502,6 +502,16 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>> return READ_ONCE(kvm->arch.realm.state);
>> }
>>
>> +static inline gpa_t kvm_gpa_stolen_bits(struct kvm *kvm)
>> +{
>> + if (kvm_is_realm(kvm)) {
>> + struct realm *realm = &kvm->arch.realm;
>> +
>> + return BIT(realm->ia_bits - 1);
>> + }
>> + return 0;
>> +}
>> +
>
> "stolen" seems a little bit vague. Maybe "shared" bit would be better as
> SEV-SNP has C bit and TDX has shared bit. It would be nice to align with
> the common knowledge.

The Arm CCA term is the "protected" bit[1] - although the bit is
backwards as it's cleared to indicate protected... so not ideal naming! ;)

But it's termed 'stolen' here as it's effectively removed from the set
of value address bits. And this function is returning a mask of the bits
that are not available as address bits. The naming was meant to be
generic that this could encompass other features that need to reserve
IPA bits.

But it's possible this is too generic and perhaps we should just deal
with a single bit rather than potential masks. Alternatively we could
invert this and return a set of valid bits:

static inline gpa_t kvm_gpa_valid_bits(struct kvm *kvm)
{
if (kvm_is_realm(kvm)) {
struct realm *realm = &kvm->arch.realm;

return ~BIT(realm->ia_bits - 1);
}
return ~(gpa_t)0;
}

That would at least match the current usage where the inverse is what we
need.

So SEV-SNP or TDX have a concept of a mask to apply to addresses from
the guest? Can we steal any existing terms?


[1] Technically the spec only states: "Software in a Realm should treat
the most significant bit of an IPA as a protection attribute." I don't
think the bit is directly referred to in the spec as anything other than
"the most significant bit". Although that in itself is confusing as it
is the most significant *active* bit (i.e the configured IPA size
changes which bit is used).

> Also, it would be nice to change the name of gpa_stolen_mask accordingly.
>
>> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
>> {
>> if (static_branch_unlikely(&kvm_rme_is_available))
>> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
>> index 9d1583c44a99..303e4a5e5704 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -50,6 +50,18 @@ void kvm_destroy_rec(struct kvm_vcpu *vcpu);
>> int kvm_rec_enter(struct kvm_vcpu *vcpu);
>> int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>>
>> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size);
>> +int realm_map_protected(struct realm *realm,
>> + unsigned long hva,
>> + unsigned long base_ipa,
>> + struct page *dst_page,
>> + unsigned long map_size,
>> + struct kvm_mmu_memory_cache *memcache);
>> +int realm_map_non_secure(struct realm *realm,
>> + unsigned long ipa,
>> + struct page *page,
>> + unsigned long map_size,
>> + struct kvm_mmu_memory_cache *memcache);
>> int realm_set_ipa_state(struct kvm_vcpu *vcpu,
>> unsigned long addr, unsigned long end,
>> unsigned long ripas);
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index f29558c5dcbc..5417c273861b 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -235,8 +235,13 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
>>
>> lockdep_assert_held_write(&kvm->mmu_lock);
>> WARN_ON(size & ~PAGE_MASK);
>> - WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
>> - may_block));
>> +
>> + if (kvm_is_realm(kvm))
>> + kvm_realm_unmap_range(kvm, start, size);
>> + else
>> + WARN_ON(stage2_apply_range(kvm, start, end,
>> + kvm_pgtable_stage2_unmap,
>> + may_block));
>> }
>>
>> static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
>> @@ -250,7 +255,11 @@ static void stage2_flush_memslot(struct kvm *kvm,
>> phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>> phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>>
>> - stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_flush);
>> + if (kvm_is_realm(kvm))
>> + kvm_realm_unmap_range(kvm, addr, end - addr);
>> + else
>> + stage2_apply_range_resched(kvm, addr, end,
>> + kvm_pgtable_stage2_flush);
>> }
>>
>> /**
>> @@ -818,6 +827,10 @@ void stage2_unmap_vm(struct kvm *kvm)
>> struct kvm_memory_slot *memslot;
>> int idx, bkt;
>>
>> + /* For realms this is handled by the RMM so nothing to do here */
>> + if (kvm_is_realm(kvm))
>> + return;
>> +
>> idx = srcu_read_lock(&kvm->srcu);
>> mmap_read_lock(current->mm);
>> write_lock(&kvm->mmu_lock);
>> @@ -840,6 +853,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>> pgt = mmu->pgt;
>> if (kvm_is_realm(kvm) &&
>> kvm_realm_state(kvm) != REALM_STATE_DYING) {
>> + unmap_stage2_range(mmu, 0, (~0ULL) & PAGE_MASK);
>> write_unlock(&kvm->mmu_lock);
>> kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
>> pgt->start_level);
>> @@ -1190,6 +1204,24 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
>> return vma->vm_flags & VM_MTE_ALLOWED;
>> }
>>
>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa, unsigned long hva,
>> + kvm_pfn_t pfn, unsigned long map_size,
>> + enum kvm_pgtable_prot prot,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + struct page *page = pfn_to_page(pfn);
>> +
>> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>> + return -EFAULT;
>> +
>> + if (!realm_is_addr_protected(realm, ipa))
>> + return realm_map_non_secure(realm, ipa, page, map_size,
>> + memcache);
>> +
>> + return realm_map_protected(realm, hva, ipa, page, map_size, memcache);
>> +}
>> +
>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> struct kvm_memory_slot *memslot, unsigned long hva,
>> unsigned long fault_status)
>> @@ -1210,9 +1242,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> unsigned long vma_pagesize, fault_granule;
>> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> struct kvm_pgtable *pgt;
>> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
>>
>> fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
>> write_fault = kvm_is_write_fault(vcpu);
>> +
>> + /* Realms cannot map read-only */
>
> Out of curiosity, why? It would be nice to have more explanation in the
> comment.

The RMM specification doesn't support mapping protected memory read
only. I don't believe there is any reason why it couldn't, but equally I
don't think there any use cases for a guest needing read-only pages so
this just isn't supported by the RMM. Since the page is necessarily
taken away from the host it's fairly irrelevant (from the host's
perspective) whether it is actually read only or not.

However, this is technically wrong for the case of unprotected (shared)
pages - it should be possible to map those read only. But I need to have
a think about how to fix that up.

>> + if (vcpu_is_rec(vcpu))
>> + write_fault = true;
>> +
>> exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>> VM_BUG_ON(write_fault && exec_fault);
>>
>> @@ -1272,7 +1310,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>> fault_ipa &= ~(vma_pagesize - 1);
>>
>> - gfn = fault_ipa >> PAGE_SHIFT;
>> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
>> mmap_read_unlock(current->mm);
>>
>> /*
>> @@ -1345,7 +1383,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> * If we are not forced to use page mapping, check if we are
>> * backed by a THP and thus use block mapping if possible.
>> */
>> - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
>> + /* FIXME: We shouldn't need to disable this for realms */
>> + if (vma_pagesize == PAGE_SIZE && !(force_pte || device || kvm_is_realm(kvm))) {
>
> Why do we have to disable this temporarily?

The current uABI (not using memfd) has some serious issues regarding
huge page support. KVM normally follows the user space mappings of the
memslot - so if user space has a huge page (transparent or hugetlbs)
then stage 2 for the guest also gets one.

However realms sometimes require that the stage 2 differs. The main
examples are:

* RIPAS - if part of a huge page is RIPAS_RAM and part RIPAS_EMPTY then
the huge page would have to be split.

* Initially populated memory: basically the same as above - if the
populated memory doesn't perfectly align with huge pages, then the
head/tail pages would need to be broken up.

Removing this hack allows the huge pages to be created in stage 2, but
then causes overmapping of the initial contents, then later on when the
VMM (or guest) attempts to change the properties of the misaligned tail
it gets an error because the pages are already present in stage 2.

The planned solution to all this is to stop following the user space
page tables and create huge pages opportunistically from the memfd that
backs the protected range. For now this hack exists to avoid things
"randomly" failing when e.g. the initial kernel image isn't huge page
aligned. In theory it should be possible to make this work with the
current uABI, but it's not worth it when we know we're replacing it.

>> if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
>> vma_pagesize = fault_granule;
>> else
>> @@ -1382,6 +1421,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> */
>> if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
>> ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>> + else if (kvm_is_realm(kvm))
>> + ret = realm_map_ipa(kvm, fault_ipa, hva, pfn, vma_pagesize,
>> + prot, memcache);
>> else
>> ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
>> __pfn_to_phys(pfn), prot,
>> @@ -1437,6 +1479,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>> struct kvm_memory_slot *memslot;
>> unsigned long hva;
>> bool is_iabt, write_fault, writable;
>> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
>> gfn_t gfn;
>> int ret, idx;
>>
>> @@ -1491,7 +1534,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>
>> idx = srcu_read_lock(&vcpu->kvm->srcu);
>>
>> - gfn = fault_ipa >> PAGE_SHIFT;
>> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
>> memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>> write_fault = kvm_is_write_fault(vcpu);
>> @@ -1536,6 +1579,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>> * of the page size.
>> */
>> fault_ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1);
>> + fault_ipa &= ~gpa_stolen_mask;
>> ret = io_mem_abort(vcpu, fault_ipa);
>> goto out_unlock;
>> }
>> @@ -1617,6 +1661,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>> if (!kvm->arch.mmu.pgt)
>> return false;
>>
>
> Does the unprotected (shared) region of a realm support aging?

In theory this should be possible to support by unmapping the NS entry
and handling the fault. But the hardware access flag optimisation isn't
available with the RMM, and the overhead of RMI calls to unmap/map could
be significant.

For now this isn't something we've looked at, but I guess it might be
worth trying out when we have some real hardware to benchmark on.

>> + /* We don't support aging for Realms */
>> + if (kvm_is_realm(kvm))
>> + return true;
>> +
>> WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
>>
>> kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
>> @@ -1630,6 +1678,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>> if (!kvm->arch.mmu.pgt)
>> return false;
>>
>> + /* We don't support aging for Realms */
>> + if (kvm_is_realm(kvm))
>> + return true;
>> +
>> return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
>> range->start << PAGE_SHIFT);
>> }
>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> index 3405b43e1421..3d46191798e5 100644
>> --- a/arch/arm64/kvm/rme.c
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -608,6 +608,54 @@ int realm_map_protected(struct realm *realm,
>> return -ENXIO;
>> }
>>
>> +int realm_map_non_secure(struct realm *realm,
>> + unsigned long ipa,
>> + struct page *page,
>> + unsigned long map_size,
>> + struct kvm_mmu_memory_cache *memcache)
>> +{
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + int map_level;
>> + int ret = 0;
>> + unsigned long desc = page_to_phys(page) |
>> + PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) |
>> + /* FIXME: Read+Write permissions for now */
> Why can't we handle the prot from the realm_map_ipa()? Working in progress? :)

Yes, work in progress - this comes from the "Realms cannot map
read-only" in user_mem_abort() above. Since all faults are treated as
write faults we need to upgrade to read/write here too.

The prot in realm_map_ipa isn't actually useful currently because we
simply WARN_ON and return if it doesn't have PROT_W. Again this needs to
be fixed! It's on my todo list ;)

Steve

>> + (3 << 6) |
>> + PTE_SHARED;
>> +
>> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
>> + return -EINVAL;
>> +
>> + switch (map_size) {
>> + case PAGE_SIZE:
>> + map_level = 3;
>> + break;
>> + case RME_L2_BLOCK_SIZE:
>> + map_level = 2;
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> +
>> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
>> +
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + ret = realm_create_rtt_levels(realm, ipa, level, map_level,
>> + memcache);
>> + if (WARN_ON(ret))
>> + return -ENXIO;
>> +
>> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
>> + }
>> + if (WARN_ON(ret))
>> + return -ENXIO;
>> +
>> + return 0;
>> +}
>> +
>> static int populate_par_region(struct kvm *kvm,
>> phys_addr_t ipa_base,
>> phys_addr_t ipa_end)
>


2023-03-14 15:32:26

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 16/28] arm64: RME: Allow populating initial contents

On Fri, 10 Mar 2023 15:47:16 +0000
Steven Price <[email protected]> wrote:

> On 06/03/2023 17:34, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:20 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> The VMM needs to populate the realm with some data before starting (e.g.
> >> a kernel and initrd). This is measured by the RMM and used as part of
> >> the attestation later on.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/kvm/rme.c | 366 +++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 366 insertions(+)
> >>
> >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> >> index 16e0bfea98b1..3405b43e1421 100644
> >> --- a/arch/arm64/kvm/rme.c
> >> +++ b/arch/arm64/kvm/rme.c
> >> @@ -4,6 +4,7 @@
> >> */
> >>
> >> #include <linux/kvm_host.h>
> >> +#include <linux/hugetlb.h>
> >>
> >> #include <asm/kvm_emulate.h>
> >> #include <asm/kvm_mmu.h>
> >> @@ -426,6 +427,359 @@ void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size)
> >> }
> >> }
> >>
> >> +static int realm_create_protected_data_page(struct realm *realm,
> >> + unsigned long ipa,
> >> + struct page *dst_page,
> >> + struct page *tmp_page)
> >> +{
> >> + phys_addr_t dst_phys, tmp_phys;
> >> + int ret;
> >> +
> >> + copy_page(page_address(tmp_page), page_address(dst_page));
> >> +
> >> + dst_phys = page_to_phys(dst_page);
> >> + tmp_phys = page_to_phys(tmp_page);
> >> +
> >> + if (rmi_granule_delegate(dst_phys))
> >> + return -ENXIO;
> >> +
> >> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa, tmp_phys,
> >> + RMI_MEASURE_CONTENT);
> >> +
> >> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> >> + /* Create missing RTTs and retry */
> >> + int level = RMI_RETURN_INDEX(ret);
> >> +
> >> + ret = realm_create_rtt_levels(realm, ipa, level,
> >> + RME_RTT_MAX_LEVEL, NULL);
> >> + if (ret)
> >> + goto err;
> >> +
> >> + ret = rmi_data_create(dst_phys, virt_to_phys(realm->rd), ipa,
> >> + tmp_phys, RMI_MEASURE_CONTENT);
> >> + }
> >> +
> >> + if (ret)
> >> + goto err;
> >> +
> >> + return 0;
> >> +
> >> +err:
> >> + if (WARN_ON(rmi_granule_undelegate(dst_phys))) {
> >> + /* Page can't be returned to NS world so is lost */
> >> + get_page(dst_page);
> >> + }
> >> + return -ENXIO;
> >> +}
> >> +
> >> +static int fold_rtt(phys_addr_t rd, unsigned long addr, int level,
> >> + struct realm *realm)
> >> +{
> >> + struct rtt_entry rtt;
> >> + phys_addr_t rtt_addr;
> >> +
> >> + if (rmi_rtt_read_entry(rd, addr, level, &rtt))
> >> + return -ENXIO;
> >> +
> >> + if (rtt.state != RMI_TABLE)
> >> + return -EINVAL;
> >> +
> >> + rtt_addr = rmi_rtt_get_phys(&rtt);
> >> + if (rmi_rtt_fold(rtt_addr, rd, addr, level + 1))
> >> + return -ENXIO;
> >> +
> >> + free_delegated_page(realm, rtt_addr);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +int realm_map_protected(struct realm *realm,
> >> + unsigned long hva,
> >> + unsigned long base_ipa,
> >> + struct page *dst_page,
> >> + unsigned long map_size,
> >> + struct kvm_mmu_memory_cache *memcache)
> >> +{
> >> + phys_addr_t dst_phys = page_to_phys(dst_page);
> >> + phys_addr_t rd = virt_to_phys(realm->rd);
> >> + unsigned long phys = dst_phys;
> >> + unsigned long ipa = base_ipa;
> >> + unsigned long size;
> >> + int map_level;
> >> + int ret = 0;
> >> +
> >> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
> >> + return -EINVAL;
> >> +
> >> + switch (map_size) {
> >> + case PAGE_SIZE:
> >> + map_level = 3;
> >> + break;
> >> + case RME_L2_BLOCK_SIZE:
> >> + map_level = 2;
> >> + break;
> >> + default:
> >> + return -EINVAL;
> >> + }
> >> +
> >> + if (map_level < RME_RTT_MAX_LEVEL) {
> >> + /*
> >> + * A temporary RTT is needed during the map, precreate it,
> >> + * however if there is an error (e.g. missing parent tables)
> >> + * this will be handled below.
> >> + */
> >> + realm_create_rtt_levels(realm, ipa, map_level,
> >> + RME_RTT_MAX_LEVEL, memcache);
> >> + }
> >> +
> >> + for (size = 0; size < map_size; size += PAGE_SIZE) {
> >> + if (rmi_granule_delegate(phys)) {
> >> + struct rtt_entry rtt;
> >> +
> >> + /*
> >> + * It's possible we raced with another VCPU on the same
> >> + * fault. If the entry exists and matches then exit
> >> + * early and assume the other VCPU will handle the
> >> + * mapping.
> >> + */
> >> + if (rmi_rtt_read_entry(rd, ipa, RME_RTT_MAX_LEVEL, &rtt))
> >> + goto err;
> >> +
> >> + // FIXME: For a block mapping this could race at level
> >> + // 2 or 3...
> >> + if (WARN_ON((rtt.walk_level != RME_RTT_MAX_LEVEL ||
> >> + rtt.state != RMI_ASSIGNED ||
> >> + rtt.desc != phys))) {
> >> + goto err;
> >> + }
> >> +
> >> + return 0;
> >> + }
> >> +
> >> + ret = rmi_data_create_unknown(phys, rd, ipa);
> >> +
> >> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> >> + /* Create missing RTTs and retry */
> >> + int level = RMI_RETURN_INDEX(ret);
> >> +
> >> + ret = realm_create_rtt_levels(realm, ipa, level,
> >> + RME_RTT_MAX_LEVEL,
> >> + memcache);
> >> + WARN_ON(ret);
> >> + if (ret)
> >> + goto err_undelegate;
> >> +
> >> + ret = rmi_data_create_unknown(phys, rd, ipa);
> >> + }
> >> + WARN_ON(ret);
> >> +
> >> + if (ret)
> >> + goto err_undelegate;
> >> +
> >> + phys += PAGE_SIZE;
> >> + ipa += PAGE_SIZE;
> >> + }
> >> +
> >> + if (map_size == RME_L2_BLOCK_SIZE)
> >> + ret = fold_rtt(rd, base_ipa, map_level, realm);
> >> + if (WARN_ON(ret))
> >> + goto err;
> >> +
> >> + return 0;
> >> +
> >> +err_undelegate:
> >> + if (WARN_ON(rmi_granule_undelegate(phys))) {
> >> + /* Page can't be returned to NS world so is lost */
> >> + get_page(phys_to_page(phys));
> >> + }
> >> +err:
> >> + while (size > 0) {
> >> + phys -= PAGE_SIZE;
> >> + size -= PAGE_SIZE;
> >> + ipa -= PAGE_SIZE;
> >> +
> >> + rmi_data_destroy(rd, ipa);
> >> +
> >> + if (WARN_ON(rmi_granule_undelegate(phys))) {
> >> + /* Page can't be returned to NS world so is lost */
> >> + get_page(phys_to_page(phys));
> >> + }
> >> + }
> >> + return -ENXIO;
> >> +}
> >> +
> >
> > There seems no caller to the function above. Better move it to the related
> > patch.
>
> Indeed this should really be in the next patch - will move as it's very
> confusing having it in this patch (sorry about that).
>
> >> +static int populate_par_region(struct kvm *kvm,
> >> + phys_addr_t ipa_base,
> >> + phys_addr_t ipa_end)
> >> +{
> >> + struct realm *realm = &kvm->arch.realm;
> >> + struct kvm_memory_slot *memslot;
> >> + gfn_t base_gfn, end_gfn;
> >> + int idx;
> >> + phys_addr_t ipa;
> >> + int ret = 0;
> >> + struct page *tmp_page;
> >> + phys_addr_t rd = virt_to_phys(realm->rd);
> >> +
> >> + base_gfn = gpa_to_gfn(ipa_base);
> >> + end_gfn = gpa_to_gfn(ipa_end);
> >> +
> >> + idx = srcu_read_lock(&kvm->srcu);
> >> + memslot = gfn_to_memslot(kvm, base_gfn);
> >> + if (!memslot) {
> >> + ret = -EFAULT;
> >> + goto out;
> >> + }
> >> +
> >> + /* We require the region to be contained within a single memslot */
> >> + if (memslot->base_gfn + memslot->npages < end_gfn) {
> >> + ret = -EINVAL;
> >> + goto out;
> >> + }
> >> +
> >> + tmp_page = alloc_page(GFP_KERNEL);
> >> + if (!tmp_page) {
> >> + ret = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + mmap_read_lock(current->mm);
> >> +
> >> + ipa = ipa_base;
> >> +
> >> + while (ipa < ipa_end) {
> >> + struct vm_area_struct *vma;
> >> + unsigned long map_size;
> >> + unsigned int vma_shift;
> >> + unsigned long offset;
> >> + unsigned long hva;
> >> + struct page *page;
> >> + kvm_pfn_t pfn;
> >> + int level;
> >> +
> >> + hva = gfn_to_hva_memslot(memslot, gpa_to_gfn(ipa));
> >> + vma = vma_lookup(current->mm, hva);
> >> + if (!vma) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + if (is_vm_hugetlb_page(vma))
> >> + vma_shift = huge_page_shift(hstate_vma(vma));
> >> + else
> >> + vma_shift = PAGE_SHIFT;
> >> +
> >> + map_size = 1 << vma_shift;
> >> +
> >> + /*
> >> + * FIXME: This causes over mapping, but there's no good
> >> + * solution here with the ABI as it stands
> >> + */
> >> + ipa = ALIGN_DOWN(ipa, map_size);
> >> +
> >> + switch (map_size) {
> >> + case RME_L2_BLOCK_SIZE:
> >> + level = 2;
> >> + break;
> >> + case PAGE_SIZE:
> >> + level = 3;
> >> + break;
> >> + default:
> >> + WARN_ONCE(1, "Unsupport vma_shift %d", vma_shift);
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + pfn = gfn_to_pfn_memslot(memslot, gpa_to_gfn(ipa));
> >> +
> >> + if (is_error_pfn(pfn)) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + ret = rmi_rtt_init_ripas(rd, ipa, level);
> >> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> >> + ret = realm_create_rtt_levels(realm, ipa,
> >> + RMI_RETURN_INDEX(ret),
> >> + level, NULL);
> >> + if (ret)
> >> + break;
> >> + ret = rmi_rtt_init_ripas(rd, ipa, level);
> >> + if (ret) {
> >> + ret = -ENXIO;
> >> + break;
> >> + }
> >> + }
> >> +
> >> + if (level < RME_RTT_MAX_LEVEL) {
> >> + /*
> >> + * A temporary RTT is needed during the map, precreate
> >> + * it, however if there is an error (e.g. missing
> >> + * parent tables) this will be handled in the
> >> + * realm_create_protected_data_page() call.
> >> + */
> >> + realm_create_rtt_levels(realm, ipa, level,
> >> + RME_RTT_MAX_LEVEL, NULL);
> >> + }
> >> +
> >> + page = pfn_to_page(pfn);
> >> +
> >> + for (offset = 0; offset < map_size && !ret;
> >> + offset += PAGE_SIZE, page++) {
> >> + phys_addr_t page_ipa = ipa + offset;
> >> +
> >> + ret = realm_create_protected_data_page(realm, page_ipa,
> >> + page, tmp_page);
> >> + }
> >> + if (ret)
> >> + goto err_release_pfn;
> >> +
> >> + if (level == 2) {
> >> + ret = fold_rtt(rd, ipa, level, realm);
> >> + if (ret)
> >> + goto err_release_pfn;
> >> + }
> >> +
> >> + ipa += map_size;
> >
> >> + kvm_set_pfn_accessed(pfn);
> >> + kvm_set_pfn_dirty(pfn);
> >
> > kvm_release_pfn_dirty() has already called kvm_set_pfn_{accessed, dirty}().
>
> Will remove those calls.
>
> >> + kvm_release_pfn_dirty(pfn);
> >> +err_release_pfn:
> >> + if (ret) {
> >> + kvm_release_pfn_clean(pfn);
> >> + break;
> >> + }
> >> + }
> >> +
> >> + mmap_read_unlock(current->mm);
> >> + __free_page(tmp_page);
> >> +
> >> +out:
> >> + srcu_read_unlock(&kvm->srcu, idx);
> >> + return ret;
> >> +}
> >> +
> >> +static int kvm_populate_realm(struct kvm *kvm,
> >> + struct kvm_cap_arm_rme_populate_realm_args *args)
> >> +{
> >> + phys_addr_t ipa_base, ipa_end;
> >> +
> >
> > Check kvm_is_realm(kvm) here or in the kvm_realm_enable_cap().
>
> I'm going to update kvm_vm_ioctl_enable_cap() to check kvm_is_realm() so
> we won't get here.
>
> >> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
> >> + return -EBUSY;
> >
> > Maybe -EINVAL? The realm hasn't been created (RMI_REALM_CREATE is not called
> > yet). The userspace shouldn't reach this path.
>
> Well user space can attempt to populate in the ACTIVE state - which is
> where the idea of 'busy' comes from. Admittedly it's a little confusing
> when RMI_REALM_CREATE hasn't been called.
>
> I'm not particularly bothered about the return code, but it's useful to
> have a different code to -EINVAL as it's not an invalid argument, but
> calling at the wrong time. I can't immediately see a better error code
> though.
>
The reason why I feel -EBUSY is little bit off is EBUSY usually indicates
something is already initialized and currently running, then another
calling path wanna to operate it.

I took a look on the ioctls in arch/arm64/kvm/arm.c. It seems people have
different opinions for calling execution path at a wrong time:

For example:

long kvm_arch_vcpu_ioctl()
...
case KVM_GET_REG_LIST: {
struct kvm_reg_list __user *user_list = argp;
struct kvm_reg_list reg_list;
unsigned n;

r = -ENOEXEC;
if (unlikely(!kvm_vcpu_initialized(vcpu)))
break;

r = -EPERM;
if (!kvm_arm_vcpu_is_finalized(vcpu))
break;

If we have to choose one, I prefer -ENOEXEC as -EPERM is stranger. But
personally my vote goes to -EINVAL.

> Steve
>
> >> +
> >> + if (!IS_ALIGNED(args->populate_ipa_base, PAGE_SIZE) ||
> >> + !IS_ALIGNED(args->populate_ipa_size, PAGE_SIZE))
> >> + return -EINVAL;
> >> +
> >> + ipa_base = args->populate_ipa_base;
> >> + ipa_end = ipa_base + args->populate_ipa_size;
> >> +
> >> + if (ipa_end < ipa_base)
> >> + return -EINVAL;
> >> +
> >> + return populate_par_region(kvm, ipa_base, ipa_end);
> >> +}
> >> +
> >> static int set_ipa_state(struct kvm_vcpu *vcpu,
> >> unsigned long ipa,
> >> unsigned long end,
> >> @@ -748,6 +1102,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> >> r = kvm_init_ipa_range_realm(kvm, &args);
> >> break;
> >> }
> >> + case KVM_CAP_ARM_RME_POPULATE_REALM: {
> >> + struct kvm_cap_arm_rme_populate_realm_args args;
> >> + void __user *argp = u64_to_user_ptr(cap->args[1]);
> >> +
> >> + if (copy_from_user(&args, argp, sizeof(args))) {
> >> + r = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + r = kvm_populate_realm(kvm, &args);
> >> + break;
> >> + }
> >> default:
> >> r = -EINVAL;
> >> break;
> >
>


2023-03-14 15:45:32

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 15/28] KVM: arm64: Handle realm MMIO emulation

On Fri, 10 Mar 2023 15:47:14 +0000
Steven Price <[email protected]> wrote:

> On 06/03/2023 15:37, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:19 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> MMIO emulation for a realm cannot be done directly with the VM's
> >> registers as they are protected from the host. However the RMM interface
> >> provides a structure member for providing the read/written value and
> >
> > More details would be better for helping the review. I can only see the
> > emulated mmio value from the device model (kvmtool or kvm_io_bus) is put into
> > the GPRS[0] of the RecEntry object. But the rest of the flow is missing.
>
> The commit message is out of date (sorry about that). A previous version
> of the spec had a dedicated member for the read/write value, but this
> was changed to just use GPRS[0] as you've spotted. I'll update the text.
>
> > I guess RMM copies the value in the RecEntry.GPRS[0] to the target GPR in the
> > guest context in RMI_REC_ENTER when seeing RMI_EMULATED_MMIO. This is for
> > the guest MMIO read path.
>
> Yes, when entering the guest after an (emulatable) read data abort the
> value in GPRS[0] is loaded from the RecEntry structure into the
> appropriate register for the guest.
>
> > How about the MMIO write path? I don't see where the RecExit.GPRS[0] is loaded
> > to a varible and returned to the userspace.
>

-----
> The RMM will populate GPRS[0] with the written value in this case (even
> if another register was actually used in the instruction). We then
> transfer that to the usual VCPU structure so that the normal fault
> handling logic works.
>
-----

Are these in this patch or another patch?

> >> we can transfer this to the appropriate VCPU's register entry and then
> >> depend on the generic MMIO handling code in KVM.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/kvm/mmio.c | 7 +++++++
> >> 1 file changed, 7 insertions(+)
> >>
> >> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> >> index 3dd38a151d2a..c4879fa3a8d3 100644
> >> --- a/arch/arm64/kvm/mmio.c
> >> +++ b/arch/arm64/kvm/mmio.c
> >> @@ -6,6 +6,7 @@
> >>
> >> #include <linux/kvm_host.h>
> >> #include <asm/kvm_emulate.h>
> >> +#include <asm/rmi_smc.h>
> >> #include <trace/events/kvm.h>
> >>
> >> #include "trace.h"
> >> @@ -109,6 +110,9 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
> >> &data);
> >> data = vcpu_data_host_to_guest(vcpu, data, len);
> >> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
> >> +
> >> + if (vcpu_is_rec(vcpu))
> >> + vcpu->arch.rec.run->entry.gprs[0] = data;
> >
> > I think the guest context is maintained by RMM (while KVM can only touch
> > Rec{Entry, Exit} object) so that guest context in the legacy VHE mode is
> > unused.
> >
> > If yes, I guess here is should be:
> >
> > if (unlikely(vcpu_is_rec(vcpu)))
> > vcpu->arch.rec.run->entry.gprs[0] = data;
> > else
> > vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>
> Correct. Although there's no harm in updating with vcpu_set_reg(). But
> I'll make the change because it's clearer.
>
> >> }
> >>
> >> /*
> >> @@ -179,6 +183,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> >> run->mmio.len = len;
> >> vcpu->mmio_needed = 1;
> >>
> >> + if (vcpu_is_rec(vcpu))
> >> + vcpu->arch.rec.run->entry.flags |= RMI_EMULATED_MMIO;
> >> +
> >
> > Wouldn't it be better to set this in the kvm_handle_mmio_return where the MMIO
> > read emulation has been surely successful?
>
> Yes, that makes sense - I'll move this.
>
> Thanks,
>
> Steve
>
> >> if (!ret) {
> >> /* We handled the access successfully in the kernel. */
> >> if (!is_write)
> >
>


2023-03-14 16:41:35

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 17/28] arm64: RME: Runtime faulting of memory

On Fri, 10 Mar 2023 15:47:19 +0000
Steven Price <[email protected]> wrote:

> On 06/03/2023 18:20, Zhi Wang wrote:
> > On Fri, 27 Jan 2023 11:29:21 +0000
> > Steven Price <[email protected]> wrote:
> >
> >> At runtime if the realm guest accesses memory which hasn't yet been
> >> mapped then KVM needs to either populate the region or fault the guest.
> >>
> >> For memory in the lower (protected) region of IPA a fresh page is
> >> provided to the RMM which will zero the contents. For memory in the
> >> upper (shared) region of IPA, the memory from the memslot is mapped
> >> into the realm VM non secure.
> >>
> >> Signed-off-by: Steven Price <[email protected]>
> >> ---
> >> arch/arm64/include/asm/kvm_emulate.h | 10 +++++
> >> arch/arm64/include/asm/kvm_rme.h | 12 ++++++
> >> arch/arm64/kvm/mmu.c | 64 +++++++++++++++++++++++++---
> >> arch/arm64/kvm/rme.c | 48 +++++++++++++++++++++
> >> 4 files changed, 128 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> >> index 285e62914ca4..3a71b3d2e10a 100644
> >> --- a/arch/arm64/include/asm/kvm_emulate.h
> >> +++ b/arch/arm64/include/asm/kvm_emulate.h
> >> @@ -502,6 +502,16 @@ static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> >> return READ_ONCE(kvm->arch.realm.state);
> >> }
> >>
> >> +static inline gpa_t kvm_gpa_stolen_bits(struct kvm *kvm)
> >> +{
> >> + if (kvm_is_realm(kvm)) {
> >> + struct realm *realm = &kvm->arch.realm;
> >> +
> >> + return BIT(realm->ia_bits - 1);
> >> + }
> >> + return 0;
> >> +}
> >> +
> >
> > "stolen" seems a little bit vague. Maybe "shared" bit would be better as
> > SEV-SNP has C bit and TDX has shared bit. It would be nice to align with
> > the common knowledge.
>
> The Arm CCA term is the "protected" bit[1] - although the bit is
> backwards as it's cleared to indicate protected... so not ideal naming! ;)
>
> But it's termed 'stolen' here as it's effectively removed from the set
> of value address bits. And this function is returning a mask of the bits
> that are not available as address bits. The naming was meant to be
> generic that this could encompass other features that need to reserve
> IPA bits.
>
> But it's possible this is too generic and perhaps we should just deal
> with a single bit rather than potential masks. Alternatively we could
> invert this and return a set of valid bits:
>
> static inline gpa_t kvm_gpa_valid_bits(struct kvm *kvm)
> {
> if (kvm_is_realm(kvm)) {
> struct realm *realm = &kvm->arch.realm;
>
> return ~BIT(realm->ia_bits - 1);
> }
> return ~(gpa_t)0;
> }
>
> That would at least match the current usage where the inverse is what we
> need.
>
> So SEV-SNP or TDX have a concept of a mask to apply to addresses from
> the guest? Can we steal any existing terms?
>

In a general level, they are using "shared"/"private". TDX is using a
function kvm_gfn_shared_mask() to get the mask and three other marcos to
apply the mask on GPA (IPA)[1]. SEV-SNP is re-using SME macros e.g.
__sme_clr() to apply the mask[2].

Guess we can take them as an reference: using an inline function to get
the protected_bit_mask, like kvm_ipa_protected_mask() with the spec text
you pasted in the comment of the function. The name echoes the spec
description.

Then another necessary functions like kvm_gpa_{is, to}_{shared, private}
, which applies the mask to a GPA(IPA), to echo the terms in the generic
KVM knowledge. (Guess we can refine that with realm_is_addr_protected()).

[1] https://www.spinics.net/lists/kernel/msg4718104.html
[2] https://lore.kernel.org/lkml/[email protected]/

>
> [1] Technically the spec only states: "Software in a Realm should treat
> the most significant bit of an IPA as a protection attribute." I don't
> think the bit is directly referred to in the spec as anything other than
> "the most significant bit". Although that in itself is confusing as it
> is the most significant *active* bit (i.e the configured IPA size
> changes which bit is used).
>
> > Also, it would be nice to change the name of gpa_stolen_mask accordingly.
> >
> >> static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> >> {
> >> if (static_branch_unlikely(&kvm_rme_is_available))
> >> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> >> index 9d1583c44a99..303e4a5e5704 100644
> >> --- a/arch/arm64/include/asm/kvm_rme.h
> >> +++ b/arch/arm64/include/asm/kvm_rme.h
> >> @@ -50,6 +50,18 @@ void kvm_destroy_rec(struct kvm_vcpu *vcpu);
> >> int kvm_rec_enter(struct kvm_vcpu *vcpu);
> >> int handle_rme_exit(struct kvm_vcpu *vcpu, int rec_run_status);
> >>
> >> +void kvm_realm_unmap_range(struct kvm *kvm, unsigned long ipa, u64 size);
> >> +int realm_map_protected(struct realm *realm,
> >> + unsigned long hva,
> >> + unsigned long base_ipa,
> >> + struct page *dst_page,
> >> + unsigned long map_size,
> >> + struct kvm_mmu_memory_cache *memcache);
> >> +int realm_map_non_secure(struct realm *realm,
> >> + unsigned long ipa,
> >> + struct page *page,
> >> + unsigned long map_size,
> >> + struct kvm_mmu_memory_cache *memcache);
> >> int realm_set_ipa_state(struct kvm_vcpu *vcpu,
> >> unsigned long addr, unsigned long end,
> >> unsigned long ripas);
> >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> >> index f29558c5dcbc..5417c273861b 100644
> >> --- a/arch/arm64/kvm/mmu.c
> >> +++ b/arch/arm64/kvm/mmu.c
> >> @@ -235,8 +235,13 @@ static void __unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64
> >>
> >> lockdep_assert_held_write(&kvm->mmu_lock);
> >> WARN_ON(size & ~PAGE_MASK);
> >> - WARN_ON(stage2_apply_range(kvm, start, end, kvm_pgtable_stage2_unmap,
> >> - may_block));
> >> +
> >> + if (kvm_is_realm(kvm))
> >> + kvm_realm_unmap_range(kvm, start, size);
> >> + else
> >> + WARN_ON(stage2_apply_range(kvm, start, end,
> >> + kvm_pgtable_stage2_unmap,
> >> + may_block));
> >> }
> >>
> >> static void unmap_stage2_range(struct kvm_s2_mmu *mmu, phys_addr_t start, u64 size)
> >> @@ -250,7 +255,11 @@ static void stage2_flush_memslot(struct kvm *kvm,
> >> phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
> >> phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
> >>
> >> - stage2_apply_range_resched(kvm, addr, end, kvm_pgtable_stage2_flush);
> >> + if (kvm_is_realm(kvm))
> >> + kvm_realm_unmap_range(kvm, addr, end - addr);
> >> + else
> >> + stage2_apply_range_resched(kvm, addr, end,
> >> + kvm_pgtable_stage2_flush);
> >> }
> >>
> >> /**
> >> @@ -818,6 +827,10 @@ void stage2_unmap_vm(struct kvm *kvm)
> >> struct kvm_memory_slot *memslot;
> >> int idx, bkt;
> >>
> >> + /* For realms this is handled by the RMM so nothing to do here */
> >> + if (kvm_is_realm(kvm))
> >> + return;
> >> +
> >> idx = srcu_read_lock(&kvm->srcu);
> >> mmap_read_lock(current->mm);
> >> write_lock(&kvm->mmu_lock);
> >> @@ -840,6 +853,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> >> pgt = mmu->pgt;
> >> if (kvm_is_realm(kvm) &&
> >> kvm_realm_state(kvm) != REALM_STATE_DYING) {
> >> + unmap_stage2_range(mmu, 0, (~0ULL) & PAGE_MASK);
> >> write_unlock(&kvm->mmu_lock);
> >> kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
> >> pgt->start_level);
> >> @@ -1190,6 +1204,24 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
> >> return vma->vm_flags & VM_MTE_ALLOWED;
> >> }
> >>
> >> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa, unsigned long hva,
> >> + kvm_pfn_t pfn, unsigned long map_size,
> >> + enum kvm_pgtable_prot prot,
> >> + struct kvm_mmu_memory_cache *memcache)
> >> +{
> >> + struct realm *realm = &kvm->arch.realm;
> >> + struct page *page = pfn_to_page(pfn);
> >> +
> >> + if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
> >> + return -EFAULT;
> >> +
> >> + if (!realm_is_addr_protected(realm, ipa))
> >> + return realm_map_non_secure(realm, ipa, page, map_size,
> >> + memcache);
> >> +
> >> + return realm_map_protected(realm, hva, ipa, page, map_size, memcache);
> >> +}
> >> +
> >> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> struct kvm_memory_slot *memslot, unsigned long hva,
> >> unsigned long fault_status)
> >> @@ -1210,9 +1242,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> unsigned long vma_pagesize, fault_granule;
> >> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> >> struct kvm_pgtable *pgt;
> >> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
> >>
> >> fault_granule = 1UL << ARM64_HW_PGTABLE_LEVEL_SHIFT(fault_level);
> >> write_fault = kvm_is_write_fault(vcpu);
> >> +
> >> + /* Realms cannot map read-only */
> >
> > Out of curiosity, why? It would be nice to have more explanation in the
> > comment.
>
> The RMM specification doesn't support mapping protected memory read
> only. I don't believe there is any reason why it couldn't, but equally I
> don't think there any use cases for a guest needing read-only pages so
> this just isn't supported by the RMM. Since the page is necessarily
> taken away from the host it's fairly irrelevant (from the host's
> perspective) whether it is actually read only or not.
>
> However, this is technically wrong for the case of unprotected (shared)
> pages - it should be possible to map those read only. But I need to have
> a think about how to fix that up.

If the fault IPA carries the protected bit, can't we do like:

if (vcpu_is_rec(vcpu) && fault_ipa_is_protected)
write_fault = true

Are there still other gaps?
>
> >> + if (vcpu_is_rec(vcpu))
> >> + write_fault = true;
> >> +
> >> exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> >> VM_BUG_ON(write_fault && exec_fault);
> >>
> >> @@ -1272,7 +1310,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
> >> fault_ipa &= ~(vma_pagesize - 1);
> >>
> >> - gfn = fault_ipa >> PAGE_SHIFT;
> >> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> >> mmap_read_unlock(current->mm);
> >>
> >> /*
> >> @@ -1345,7 +1383,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> * If we are not forced to use page mapping, check if we are
> >> * backed by a THP and thus use block mapping if possible.
> >> */
> >> - if (vma_pagesize == PAGE_SIZE && !(force_pte || device)) {
> >> + /* FIXME: We shouldn't need to disable this for realms */
> >> + if (vma_pagesize == PAGE_SIZE && !(force_pte || device || kvm_is_realm(kvm))) {
> >
> > Why do we have to disable this temporarily?
>
> The current uABI (not using memfd) has some serious issues regarding
> huge page support. KVM normally follows the user space mappings of the
> memslot - so if user space has a huge page (transparent or hugetlbs)
> then stage 2 for the guest also gets one.
>
> However realms sometimes require that the stage 2 differs. The main
> examples are:
>
> * RIPAS - if part of a huge page is RIPAS_RAM and part RIPAS_EMPTY then
> the huge page would have to be split.
>
> * Initially populated memory: basically the same as above - if the
> populated memory doesn't perfectly align with huge pages, then the
> head/tail pages would need to be broken up.
>
> Removing this hack allows the huge pages to be created in stage 2, but
> then causes overmapping of the initial contents, then later on when the
> VMM (or guest) attempts to change the properties of the misaligned tail
> it gets an error because the pages are already present in stage 2.
>
> The planned solution to all this is to stop following the user space
> page tables and create huge pages opportunistically from the memfd that
> backs the protected range. For now this hack exists to avoid things
> "randomly" failing when e.g. the initial kernel image isn't huge page
> aligned. In theory it should be possible to make this work with the
> current uABI, but it's not worth it when we know we're replacing it.

I see. Will dig it and see if there is any idea come to my mind.
>
> >> if (fault_status == FSC_PERM && fault_granule > PAGE_SIZE)
> >> vma_pagesize = fault_granule;
> >> else
> >> @@ -1382,6 +1421,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> >> */
> >> if (fault_status == FSC_PERM && vma_pagesize == fault_granule)
> >> ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
> >> + else if (kvm_is_realm(kvm))
> >> + ret = realm_map_ipa(kvm, fault_ipa, hva, pfn, vma_pagesize,
> >> + prot, memcache);
> >> else
> >> ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> >> __pfn_to_phys(pfn), prot,
> >> @@ -1437,6 +1479,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >> struct kvm_memory_slot *memslot;
> >> unsigned long hva;
> >> bool is_iabt, write_fault, writable;
> >> + gpa_t gpa_stolen_mask = kvm_gpa_stolen_bits(vcpu->kvm);
> >> gfn_t gfn;
> >> int ret, idx;
> >>
> >> @@ -1491,7 +1534,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >>
> >> idx = srcu_read_lock(&vcpu->kvm->srcu);
> >>
> >> - gfn = fault_ipa >> PAGE_SHIFT;
> >> + gfn = (fault_ipa & ~gpa_stolen_mask) >> PAGE_SHIFT;
> >> memslot = gfn_to_memslot(vcpu->kvm, gfn);
> >> hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
> >> write_fault = kvm_is_write_fault(vcpu);
> >> @@ -1536,6 +1579,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >> * of the page size.
> >> */
> >> fault_ipa |= kvm_vcpu_get_hfar(vcpu) & ((1 << 12) - 1);
> >> + fault_ipa &= ~gpa_stolen_mask;
> >> ret = io_mem_abort(vcpu, fault_ipa);
> >> goto out_unlock;
> >> }
> >> @@ -1617,6 +1661,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >> if (!kvm->arch.mmu.pgt)
> >> return false;
> >>
> >
> > Does the unprotected (shared) region of a realm support aging?
>
> In theory this should be possible to support by unmapping the NS entry
> and handling the fault. But the hardware access flag optimisation isn't
> available with the RMM, and the overhead of RMI calls to unmap/map could
> be significant.
>
> For now this isn't something we've looked at, but I guess it might be
> worth trying out when we have some real hardware to benchmark on.
>
> >> + /* We don't support aging for Realms */
> >> + if (kvm_is_realm(kvm))
> >> + return true;
> >> +
> >> WARN_ON(size != PAGE_SIZE && size != PMD_SIZE && size != PUD_SIZE);
> >>
> >> kpte = kvm_pgtable_stage2_mkold(kvm->arch.mmu.pgt,
> >> @@ -1630,6 +1678,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> >> if (!kvm->arch.mmu.pgt)
> >> return false;
> >>
> >> + /* We don't support aging for Realms */
> >> + if (kvm_is_realm(kvm))
> >> + return true;
> >> +
> >> return kvm_pgtable_stage2_is_young(kvm->arch.mmu.pgt,
> >> range->start << PAGE_SHIFT);
> >> }
> >> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> >> index 3405b43e1421..3d46191798e5 100644
> >> --- a/arch/arm64/kvm/rme.c
> >> +++ b/arch/arm64/kvm/rme.c
> >> @@ -608,6 +608,54 @@ int realm_map_protected(struct realm *realm,
> >> return -ENXIO;
> >> }
> >>
> >> +int realm_map_non_secure(struct realm *realm,
> >> + unsigned long ipa,
> >> + struct page *page,
> >> + unsigned long map_size,
> >> + struct kvm_mmu_memory_cache *memcache)
> >> +{
> >> + phys_addr_t rd = virt_to_phys(realm->rd);
> >> + int map_level;
> >> + int ret = 0;
> >> + unsigned long desc = page_to_phys(page) |
> >> + PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) |
> >> + /* FIXME: Read+Write permissions for now */
> > Why can't we handle the prot from the realm_map_ipa()? Working in progress? :)
>
> Yes, work in progress - this comes from the "Realms cannot map
> read-only" in user_mem_abort() above. Since all faults are treated as
> write faults we need to upgrade to read/write here too.
>
> The prot in realm_map_ipa isn't actually useful currently because we
> simply WARN_ON and return if it doesn't have PROT_W. Again this needs to
> be fixed! It's on my todo list ;)
>
> Steve
>
> >> + (3 << 6) |
> >> + PTE_SHARED;
> >> +
> >> + if (WARN_ON(!IS_ALIGNED(ipa, map_size)))
> >> + return -EINVAL;
> >> +
> >> + switch (map_size) {
> >> + case PAGE_SIZE:
> >> + map_level = 3;
> >> + break;
> >> + case RME_L2_BLOCK_SIZE:
> >> + map_level = 2;
> >> + break;
> >> + default:
> >> + return -EINVAL;
> >> + }
> >> +
> >> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
> >> +
> >> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
> >> + /* Create missing RTTs and retry */
> >> + int level = RMI_RETURN_INDEX(ret);
> >> +
> >> + ret = realm_create_rtt_levels(realm, ipa, level, map_level,
> >> + memcache);
> >> + if (WARN_ON(ret))
> >> + return -ENXIO;
> >> +
> >> + ret = rmi_rtt_map_unprotected(rd, ipa, map_level, desc);
> >> + }
> >> + if (WARN_ON(ret))
> >> + return -ENXIO;
> >> +
> >> + return 0;
> >> +}
> >> +
> >> static int populate_par_region(struct kvm *kvm,
> >> phys_addr_t ipa_base,
> >> phys_addr_t ipa_end)
> >
>


2023-03-22 12:03:22

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 16/28] arm64: RME: Allow populating initial contents

On 14/03/2023 15:31, Zhi Wang wrote:
> On Fri, 10 Mar 2023 15:47:16 +0000
> Steven Price <[email protected]> wrote:
>
>> On 06/03/2023 17:34, Zhi Wang wrote:
>>> On Fri, 27 Jan 2023 11:29:20 +0000
>>> Steven Price <[email protected]> wrote:

<snip>

>>>> + if (kvm_realm_state(kvm) != REALM_STATE_NEW)
>>>> + return -EBUSY;
>>>
>>> Maybe -EINVAL? The realm hasn't been created (RMI_REALM_CREATE is not called
>>> yet). The userspace shouldn't reach this path.
>>
>> Well user space can attempt to populate in the ACTIVE state - which is
>> where the idea of 'busy' comes from. Admittedly it's a little confusing
>> when RMI_REALM_CREATE hasn't been called.
>>
>> I'm not particularly bothered about the return code, but it's useful to
>> have a different code to -EINVAL as it's not an invalid argument, but
>> calling at the wrong time. I can't immediately see a better error code
>> though.
>>
> The reason why I feel -EBUSY is little bit off is EBUSY usually indicates
> something is already initialized and currently running, then another
> calling path wanna to operate it.
>
> I took a look on the ioctls in arch/arm64/kvm/arm.c. It seems people have
> different opinions for calling execution path at a wrong time:
>
> For example:
>
> long kvm_arch_vcpu_ioctl()
> ...
> case KVM_GET_REG_LIST: {
> struct kvm_reg_list __user *user_list = argp;
> struct kvm_reg_list reg_list;
> unsigned n;
>
> r = -ENOEXEC;
> if (unlikely(!kvm_vcpu_initialized(vcpu)))
> break;
>
> r = -EPERM;
> if (!kvm_arm_vcpu_is_finalized(vcpu))
> break;
>
> If we have to choose one, I prefer -ENOEXEC as -EPERM is stranger. But
> personally my vote goes to -EINVAL.

Ok, I think you've convinced me - I'll change to -EINVAL. It is invalid
use of the API and none of the other error codes seem a great fit.

Although I do wish Linux had more descriptive error codes - I often end
up peppering the kernel with a few printks when using a new API to find
out what I'm doing wrong.

Steve

>> Steve
>>
>>>> +
>>>> + if (!IS_ALIGNED(args->populate_ipa_base, PAGE_SIZE) ||
>>>> + !IS_ALIGNED(args->populate_ipa_size, PAGE_SIZE))
>>>> + return -EINVAL;
>>>> +
>>>> + ipa_base = args->populate_ipa_base;
>>>> + ipa_end = ipa_base + args->populate_ipa_size;
>>>> +
>>>> + if (ipa_end < ipa_base)
>>>> + return -EINVAL;
>>>> +
>>>> + return populate_par_region(kvm, ipa_base, ipa_end);
>>>> +}
>>>> +
>>>> static int set_ipa_state(struct kvm_vcpu *vcpu,
>>>> unsigned long ipa,
>>>> unsigned long end,
>>>> @@ -748,6 +1102,18 @@ int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
>>>> r = kvm_init_ipa_range_realm(kvm, &args);
>>>> break;
>>>> }
>>>> + case KVM_CAP_ARM_RME_POPULATE_REALM: {
>>>> + struct kvm_cap_arm_rme_populate_realm_args args;
>>>> + void __user *argp = u64_to_user_ptr(cap->args[1]);
>>>> +
>>>> + if (copy_from_user(&args, argp, sizeof(args))) {
>>>> + r = -EFAULT;
>>>> + break;
>>>> + }
>>>> +
>>>> + r = kvm_populate_realm(kvm, &args);
>>>> + break;
>>>> + }
>>>> default:
>>>> r = -EINVAL;
>>>> break;
>>>
>>
>

2023-03-22 12:05:36

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 15/28] KVM: arm64: Handle realm MMIO emulation

On 14/03/2023 15:44, Zhi Wang wrote:
> On Fri, 10 Mar 2023 15:47:14 +0000
> Steven Price <[email protected]> wrote:
>
>> On 06/03/2023 15:37, Zhi Wang wrote:
>>> On Fri, 27 Jan 2023 11:29:19 +0000
>>> Steven Price <[email protected]> wrote:
>>>
>>>> MMIO emulation for a realm cannot be done directly with the VM's
>>>> registers as they are protected from the host. However the RMM interface
>>>> provides a structure member for providing the read/written value and
>>>
>>> More details would be better for helping the review. I can only see the
>>> emulated mmio value from the device model (kvmtool or kvm_io_bus) is put into
>>> the GPRS[0] of the RecEntry object. But the rest of the flow is missing.
>>
>> The commit message is out of date (sorry about that). A previous version
>> of the spec had a dedicated member for the read/write value, but this
>> was changed to just use GPRS[0] as you've spotted. I'll update the text.
>>
>>> I guess RMM copies the value in the RecEntry.GPRS[0] to the target GPR in the
>>> guest context in RMI_REC_ENTER when seeing RMI_EMULATED_MMIO. This is for
>>> the guest MMIO read path.
>>
>> Yes, when entering the guest after an (emulatable) read data abort the
>> value in GPRS[0] is loaded from the RecEntry structure into the
>> appropriate register for the guest.
>>
>>> How about the MMIO write path? I don't see where the RecExit.GPRS[0] is loaded
>>> to a varible and returned to the userspace.
>>
>
> -----
>> The RMM will populate GPRS[0] with the written value in this case (even
>> if another register was actually used in the instruction). We then
>> transfer that to the usual VCPU structure so that the normal fault
>> handling logic works.
>>
> -----
>
> Are these in this patch or another patch?

The RMM (not included in this particular series[1]) sets the first
element of the 'GPRS' array which is in memory shared with the host.

The Linux half to populate the vcpu structure is in the previous patch:

+static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
+{
+ struct rec *rec = &vcpu->arch.rec;
+
+ if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
+ vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
+ rec->run->exit.gprs[0]);
+
+ return kvm_handle_guest_abort(vcpu);
+}

I guess it might make sense to pull the 'if' statement out of the
previous patch and into this one to keep all the MMIO code together.

Steve

[1] This Linux code is written against the RMM specification and in
theory could work with any RMM implementation. But the TF RMM is open
source, so I can point you at the assignment in the latest commit here:
https://git.trustedfirmware.org/TF-RMM/tf-rmm.git/tree/runtime/core/exit.c?id=d294b1b05e8d234d32684a982552aa2a17fb9157#n264

>>>> we can transfer this to the appropriate VCPU's register entry and then
>>>> depend on the generic MMIO handling code in KVM.
>>>>
>>>> Signed-off-by: Steven Price <[email protected]>
>>>> ---
>>>> arch/arm64/kvm/mmio.c | 7 +++++++
>>>> 1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
>>>> index 3dd38a151d2a..c4879fa3a8d3 100644
>>>> --- a/arch/arm64/kvm/mmio.c
>>>> +++ b/arch/arm64/kvm/mmio.c
>>>> @@ -6,6 +6,7 @@
>>>>
>>>> #include <linux/kvm_host.h>
>>>> #include <asm/kvm_emulate.h>
>>>> +#include <asm/rmi_smc.h>
>>>> #include <trace/events/kvm.h>
>>>>
>>>> #include "trace.h"
>>>> @@ -109,6 +110,9 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
>>>> &data);
>>>> data = vcpu_data_host_to_guest(vcpu, data, len);
>>>> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>>>> +
>>>> + if (vcpu_is_rec(vcpu))
>>>> + vcpu->arch.rec.run->entry.gprs[0] = data;
>>>
>>> I think the guest context is maintained by RMM (while KVM can only touch
>>> Rec{Entry, Exit} object) so that guest context in the legacy VHE mode is
>>> unused.
>>>
>>> If yes, I guess here is should be:
>>>
>>> if (unlikely(vcpu_is_rec(vcpu)))
>>> vcpu->arch.rec.run->entry.gprs[0] = data;
>>> else
>>> vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>>
>> Correct. Although there's no harm in updating with vcpu_set_reg(). But
>> I'll make the change because it's clearer.
>>
>>>> }
>>>>
>>>> /*
>>>> @@ -179,6 +183,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
>>>> run->mmio.len = len;
>>>> vcpu->mmio_needed = 1;
>>>>
>>>> + if (vcpu_is_rec(vcpu))
>>>> + vcpu->arch.rec.run->entry.flags |= RMI_EMULATED_MMIO;
>>>> +
>>>
>>> Wouldn't it be better to set this in the kvm_handle_mmio_return where the MMIO
>>> read emulation has been surely successful?
>>
>> Yes, that makes sense - I'll move this.
>>
>> Thanks,
>>
>> Steve
>>
>>>> if (!ret) {
>>>> /* We handled the access successfully in the kernel. */
>>>> if (!is_write)
>>>
>>
>

2024-03-18 07:04:10

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 03/28] arm64: RME: Add wrappers for RMI calls


Hi Steven,

On 27-01-2023 04:59 pm, Steven Price wrote:
> The wrappers make the call sites easier to read and deal with the
> boiler plate of handling the error codes from the RMM.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/rmi_cmds.h | 259 ++++++++++++++++++++++++++++++
> 1 file changed, 259 insertions(+)
> create mode 100644 arch/arm64/include/asm/rmi_cmds.h
>
> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
> new file mode 100644
> index 000000000000..d5468ee46f35
> --- /dev/null
> +++ b/arch/arm64/include/asm/rmi_cmds.h
> @@ -0,0 +1,259 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#ifndef __ASM_RMI_CMDS_H
> +#define __ASM_RMI_CMDS_H
> +
> +#include <linux/arm-smccc.h>
> +
> +#include <asm/rmi_smc.h>
> +
> +struct rtt_entry {
> + unsigned long walk_level;
> + unsigned long desc;
> + int state;
> + bool ripas;
> +};
> +
> +static inline int rmi_data_create(unsigned long data, unsigned long rd,
> + unsigned long map_addr, unsigned long src,
> + unsigned long flags)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE, data, rd, map_addr, src,
> + flags, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_data_create_unknown(unsigned long data,
> + unsigned long rd,
> + unsigned long map_addr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_CREATE_UNKNOWN, data, rd, map_addr,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_data_destroy(unsigned long rd, unsigned long map_addr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_DATA_DESTROY, rd, map_addr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_features(unsigned long index, unsigned long *out)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_FEATURES, index, &res);
> +
> + *out = res.a1;
> + return res.a0;
> +}
> +
> +static inline int rmi_granule_delegate(unsigned long phys)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_DELEGATE, phys, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_granule_undelegate(unsigned long phys)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_GRANULE_UNDELEGATE, phys, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_psci_complete(unsigned long calling_rec,
> + unsigned long target_rec)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_PSCI_COMPLETE, calling_rec, target_rec,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_activate(unsigned long rd)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_ACTIVATE, rd, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_create(unsigned long rd, unsigned long params_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_CREATE, rd, params_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_realm_destroy(unsigned long rd)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REALM_DESTROY, rd, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_aux_count(unsigned long rd, unsigned long *aux_count)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_AUX_COUNT, rd, &res);
> +
> + *aux_count = res.a1;
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_create(unsigned long rec, unsigned long rd,
> + unsigned long params_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_CREATE, rec, rd, params_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_destroy(unsigned long rec)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_DESTROY, rec, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rec_enter(unsigned long rec, unsigned long run_ptr)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_REC_ENTER, rec, run_ptr, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_create(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_CREATE, rtt, rd, map_addr, level,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_destroy(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_DESTROY, rtt, rd, map_addr, level,
> + &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_fold(unsigned long rtt, unsigned long rd,
> + unsigned long map_addr, unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_FOLD, rtt, rd, map_addr, level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_init_ripas(unsigned long rd, unsigned long map_addr,
> + unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_INIT_RIPAS, rd, map_addr, level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_map_unprotected(unsigned long rd,
> + unsigned long map_addr,
> + unsigned long level,
> + unsigned long desc)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_MAP_UNPROTECTED, rd, map_addr, level,
> + desc, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_read_entry(unsigned long rd, unsigned long map_addr,
> + unsigned long level, struct rtt_entry *rtt)
> +{
> + struct arm_smccc_1_2_regs regs = {
> + SMC_RMI_RTT_READ_ENTRY,
> + rd, map_addr, level
> + };
> +
> + arm_smccc_1_2_smc(&regs, &regs);
> +
> + rtt->walk_level = regs.a1;
> + rtt->state = regs.a2 & 0xFF;
> + rtt->desc = regs.a3;
> + rtt->ripas = regs.a4 & 1;
> +
> + return regs.a0;
> +}
> +
> +static inline int rmi_rtt_set_ripas(unsigned long rd, unsigned long rec,
> + unsigned long map_addr, unsigned long level,
> + unsigned long ripas)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_SET_RIPAS, rd, rec, map_addr, level,
> + ripas, &res);
> +
> + return res.a0;
> +}
> +
> +static inline int rmi_rtt_unmap_unprotected(unsigned long rd,
> + unsigned long map_addr,
> + unsigned long level)
> +{
> + struct arm_smccc_res res;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_RTT_UNMAP_UNPROTECTED, rd, map_addr,
> + level, &res);
> +
> + return res.a0;
> +}
> +
> +static inline phys_addr_t rmi_rtt_get_phys(struct rtt_entry *rtt)
> +{
> + return rtt->desc & GENMASK(47, 12);
> +}
> +
> +#endif

Can we please replace all occurrence of "unsigned long" with u64?
Also as per spec, RTT level is Int64, can we change accordingly?

Please CC me in future cca patch-sets.
[email protected]


Thanks,
Ganapat

2024-03-18 07:17:56

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init



On 27-01-2023 04:59 pm, Steven Price wrote:
> Query the RMI version number and check if it is a compatible version. A
> static key is also provided to signal that a supported RMM is available.
>
> Functions are provided to query if a VM or VCPU is a realm (or rec)
> which currently will always return false.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
> arch/arm64/include/asm/kvm_host.h | 4 +++
> arch/arm64/include/asm/kvm_rme.h | 22 +++++++++++++
> arch/arm64/include/asm/virt.h | 1 +
> arch/arm64/kvm/Makefile | 3 +-
> arch/arm64/kvm/arm.c | 8 +++++
> arch/arm64/kvm/rme.c | 49 ++++++++++++++++++++++++++++
> 7 files changed, 103 insertions(+), 1 deletion(-)
> create mode 100644 arch/arm64/include/asm/kvm_rme.h
> create mode 100644 arch/arm64/kvm/rme.c
>
> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
> index 9bdba47f7e14..5a2b7229e83f 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -490,4 +490,21 @@ static inline bool vcpu_has_feature(struct kvm_vcpu *vcpu, int feature)
> return test_bit(feature, vcpu->arch.features);
> }
>
> +static inline bool kvm_is_realm(struct kvm *kvm)
> +{
> + if (static_branch_unlikely(&kvm_rme_is_available))
> + return kvm->arch.is_realm;
> + return false;
> +}
> +
> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
> +{
> + return READ_ONCE(kvm->arch.realm.state);
> +}
> +
> +static inline bool vcpu_is_rec(struct kvm_vcpu *vcpu)
> +{
> + return false;
> +}
> +
> #endif /* __ARM64_KVM_EMULATE_H__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 35a159d131b5..04347c3a8c6b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -26,6 +26,7 @@
> #include <asm/fpsimd.h>
> #include <asm/kvm.h>
> #include <asm/kvm_asm.h>
> +#include <asm/kvm_rme.h>
>
> #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>
> @@ -240,6 +241,9 @@ struct kvm_arch {
> * the associated pKVM instance in the hypervisor.
> */
> struct kvm_protected_vm pkvm;
> +
> + bool is_realm;
> + struct realm realm;
> };
>
> struct kvm_vcpu_fault_info {
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> new file mode 100644
> index 000000000000..c26bc2c6770d
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -0,0 +1,22 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#ifndef __ASM_KVM_RME_H
> +#define __ASM_KVM_RME_H
> +
> +enum realm_state {
> + REALM_STATE_NONE,
> + REALM_STATE_NEW,
> + REALM_STATE_ACTIVE,
> + REALM_STATE_DYING
> +};
> +
> +struct realm {
> + enum realm_state state;
> +};
> +
> +int kvm_init_rme(void);
> +
> +#endif
> diff --git a/arch/arm64/include/asm/virt.h b/arch/arm64/include/asm/virt.h
> index 4eb601e7de50..be1383e26626 100644
> --- a/arch/arm64/include/asm/virt.h
> +++ b/arch/arm64/include/asm/virt.h
> @@ -80,6 +80,7 @@ void __hyp_set_vectors(phys_addr_t phys_vector_base);
> void __hyp_reset_vectors(void);
>
> DECLARE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DECLARE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> /* Reports the availability of HYP mode */
> static inline bool is_hyp_mode_available(void)
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 5e33c2d4645a..d2f0400c50da 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -20,7 +20,8 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-v3.o vgic/vgic-v4.o \
> vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> - vgic/vgic-its.o vgic/vgic-debug.o
> + vgic/vgic-its.o vgic/vgic-debug.o \
> + rme.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 9c5573bc4614..d97b39d042ab 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -38,6 +38,7 @@
> #include <asm/kvm_asm.h>
> #include <asm/kvm_mmu.h>
> #include <asm/kvm_pkvm.h>
> +#include <asm/kvm_rme.h>
> #include <asm/kvm_emulate.h>
> #include <asm/sections.h>
>
> @@ -47,6 +48,7 @@
>
> static enum kvm_mode kvm_mode = KVM_MODE_DEFAULT;
> DEFINE_STATIC_KEY_FALSE(kvm_protected_mode_initialized);
> +DEFINE_STATIC_KEY_FALSE(kvm_rme_is_available);
>
> DECLARE_KVM_HYP_PER_CPU(unsigned long, kvm_hyp_vector);
>
> @@ -2213,6 +2215,12 @@ int kvm_arch_init(void *opaque)
>
> in_hyp_mode = is_kernel_in_hyp_mode();
>
> + if (in_hyp_mode) {
> + err = kvm_init_rme();
> + if (err)
> + return err;
> + }
> +
> if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||
> cpus_have_final_cap(ARM64_WORKAROUND_1508412))
> kvm_info("Guests without required CPU erratum workarounds can deadlock system!\n" \
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> new file mode 100644
> index 000000000000..f6b587bc116e
> --- /dev/null
> +++ b/arch/arm64/kvm/rme.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/kvm_host.h>
> +
> +#include <asm/rmi_cmds.h>
> +#include <asm/virt.h>
> +
> +static int rmi_check_version(void)
> +{
> + struct arm_smccc_res res;
> + int version_major, version_minor;
> +
> + arm_smccc_1_1_invoke(SMC_RMI_VERSION, &res);
> +
> + if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
> + return -ENXIO;
> +
> + version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
> + version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);
> +
> + if (version_major != RMI_ABI_MAJOR_VERSION) {
> + kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",

Can we please replace "we support" to host supports.
Also in the patch present in the repo, you are using variable
our_version, can this be changed to host_version?

> + version_major, version_minor,
> + RMI_ABI_MAJOR_VERSION);
> + return -ENXIO;
> + }
> +
> + kvm_info("RMI ABI version %d.%d\n", version_major, version_minor);
> +
> + return 0;
> +}
> +
> +int kvm_init_rme(void)
> +{
> + if (PAGE_SIZE != SZ_4K)
> + /* Only 4k page size on the host is supported */
> + return 0;
> +
> + if (rmi_check_version())
> + /* Continue without realm support */
> + return 0;
> +
> + /* Future patch will enable static branch kvm_rme_is_available */
> +
> + return 0;
> +}

Thanks,
Ganapat

2024-03-18 07:41:06

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms



On 27-01-2023 04:59 pm, Steven Price wrote:
> Add the KVM_CAP_ARM_RME_CREATE_FD ioctl to create a realm. This involves
> delegating pages to the RMM to hold the Realm Descriptor (RD) and for
> the base level of the Realm Translation Tables (RTT). A VMID also need
> to be picked, since the RMM has a separate VMID address space a
> dedicated allocator is added for this purpose.
>
> KVM_CAP_ARM_RME_CONFIG_REALM is provided to allow configuring the realm
> before it is created.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 14 ++
> arch/arm64/kvm/arm.c | 19 ++
> arch/arm64/kvm/mmu.c | 6 +
> arch/arm64/kvm/reset.c | 33 +++
> arch/arm64/kvm/rme.c | 357 +++++++++++++++++++++++++++++++
> 5 files changed, 429 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index c26bc2c6770d..055a22accc08 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -6,6 +6,8 @@
> #ifndef __ASM_KVM_RME_H
> #define __ASM_KVM_RME_H
>
> +#include <uapi/linux/kvm.h>
> +
> enum realm_state {
> REALM_STATE_NONE,
> REALM_STATE_NEW,
> @@ -15,8 +17,20 @@ enum realm_state {
>
> struct realm {
> enum realm_state state;
> +
> + void *rd;
> + struct realm_params *params;
> +
> + unsigned long num_aux;
> + unsigned int vmid;
> + unsigned int ia_bits;
> };
>
> int kvm_init_rme(void);
> +u32 kvm_realm_ipa_limit(void);
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> +int kvm_init_realm_vm(struct kvm *kvm);
> +void kvm_destroy_realm(struct kvm *kvm);
>
> #endif
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index d97b39d042ab..50f54a63732a 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -103,6 +103,13 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> r = 0;
> set_bit(KVM_ARCH_FLAG_SYSTEM_SUSPEND_ENABLED, &kvm->arch.flags);
> break;
> + case KVM_CAP_ARM_RME:
> + if (!static_branch_unlikely(&kvm_rme_is_available))
> + return -EINVAL;
> + mutex_lock(&kvm->lock);
> + r = kvm_realm_enable_cap(kvm, cap);
> + mutex_unlock(&kvm->lock);
> + break;
> default:
> r = -EINVAL;
> break;
> @@ -172,6 +179,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
> */
> kvm->arch.dfr0_pmuver.imp = kvm_arm_pmu_get_pmuver_limit();
>
> + /* Initialise the realm bits after the generic bits are enabled */
> + if (kvm_is_realm(kvm)) {
> + ret = kvm_init_realm_vm(kvm);
> + if (ret)
> + goto err_free_cpumask;
> + }
> +
> return 0;
>
> err_free_cpumask:
> @@ -204,6 +218,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> kvm_destroy_vcpus(kvm);
>
> kvm_unshare_hyp(kvm, kvm + 1);
> +
> + kvm_destroy_realm(kvm);
> }
>
> int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> @@ -300,6 +316,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_PTRAUTH_GENERIC:
> r = system_has_full_ptr_auth();
> break;
> + case KVM_CAP_ARM_RME:
> + r = static_key_enabled(&kvm_rme_is_available);
> + break;
> default:
> r = 0;
> }
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 31d7fa4c7c14..d0f707767d05 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -840,6 +840,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> struct kvm_pgtable *pgt = NULL;
>
> write_lock(&kvm->mmu_lock);
> + if (kvm_is_realm(kvm) &&
> + kvm_realm_state(kvm) != REALM_STATE_DYING) {
> + /* TODO: teardown rtts */
> + write_unlock(&kvm->mmu_lock);
> + return;
> + }
> pgt = mmu->pgt;
> if (pgt) {
> mmu->pgd_phys = 0;
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index e0267f672b8a..c165df174737 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -395,3 +395,36 @@ int kvm_set_ipa_limit(void)
>
> return 0;
> }
> +
> +int kvm_arm_setup_stage2(struct kvm *kvm, unsigned long type)
> +{
> + u64 mmfr0, mmfr1;
> + u32 phys_shift;
> + u32 ipa_limit = kvm_ipa_limit;
> +
> + if (kvm_is_realm(kvm))
> + ipa_limit = kvm_realm_ipa_limit();
> +
> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> + return -EINVAL;
> +
> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
> + if (phys_shift) {
> + if (phys_shift > ipa_limit ||
> + phys_shift < ARM64_MIN_PARANGE_BITS)
> + return -EINVAL;
> + } else {
> + phys_shift = KVM_PHYS_SHIFT;
> + if (phys_shift > ipa_limit) {
> + pr_warn_once("%s using unsupported default IPA limit, upgrade your VMM\n",
> + current->comm);
> + return -EINVAL;
> + }
> + }
> +
> + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> + kvm->arch.vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
> +
> + return 0;
> +}
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index f6b587bc116e..9f8c5a91b8fc 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -5,9 +5,49 @@
>
> #include <linux/kvm_host.h>
>
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
> #include <asm/rmi_cmds.h>
> #include <asm/virt.h>
>
> +/************ FIXME: Copied from kvm/hyp/pgtable.c **********/
> +#include <asm/kvm_pgtable.h>
> +
> +struct kvm_pgtable_walk_data {
> + struct kvm_pgtable *pgt;
> + struct kvm_pgtable_walker *walker;
> +
> + u64 addr;
> + u64 end;
> +};
> +
> +static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
> +{
> + u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
> + u64 mask = BIT(pgt->ia_bits) - 1;
> +
> + return (addr & mask) >> shift;
> +}
> +
> +static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
> +{
> + struct kvm_pgtable pgt = {
> + .ia_bits = ia_bits,
> + .start_level = start_level,
> + };
> +
> + return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
> +}
> +
> +/******************/
> +
> +static unsigned long rmm_feat_reg0;
> +
> +static bool rme_supports(unsigned long feature)
> +{
> + return !!u64_get_bits(rmm_feat_reg0, feature);
> +}
> +
> static int rmi_check_version(void)
> {
> struct arm_smccc_res res;
> @@ -33,8 +73,319 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> +{
> + unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> + u64 feat_reg0 = 0;
> +
> + int num_bps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_BPS);
> + int num_wps = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + feat_reg0 |= u64_encode_bits(ia_bits, RMI_FEATURE_REGISTER_0_S2SZ);
> + feat_reg0 |= u64_encode_bits(num_bps, RMI_FEATURE_REGISTER_0_NUM_BPS);
> + feat_reg0 |= u64_encode_bits(num_wps, RMI_FEATURE_REGISTER_0_NUM_WPS);
> +
> + return feat_reg0;
> +}
> +
> +u32 kvm_realm_ipa_limit(void)
> +{
> + return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
> +}
> +
> +static u32 get_start_level(struct kvm *kvm)
> +{
> + long sl0 = FIELD_GET(VTCR_EL2_SL0_MASK, kvm->arch.vtcr);
> +
> + return VTCR_EL2_TGRAN_SL0_BASE - sl0;
> +}
> +
> +static int realm_create_rd(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct realm_params *params = realm->params;
> + void *rd = NULL;
> + phys_addr_t rd_phys, params_phys;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i, r;
> +
> + if (WARN_ON(realm->rd) || WARN_ON(!realm->params))
> + return -EEXIST;
> +
> + rd = (void *)__get_free_page(GFP_KERNEL);
> + if (!rd)
> + return -ENOMEM;
> +
> + rd_phys = virt_to_phys(rd);
> + if (rmi_granule_delegate(rd_phys)) {
> + r = -ENXIO;
> + goto out;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (rmi_granule_delegate(pgd_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> + }
> +
> + params->rtt_level_start = get_start_level(kvm);
> + params->rtt_num_start = pgd_sz;
> + params->rtt_base = kvm->arch.mmu.pgd_phys;
> + params->vmid = realm->vmid;
> +
> + params_phys = virt_to_phys(params);
> +
> + if (rmi_realm_create(rd_phys, params_phys)) {
> + r = -ENXIO;
> + goto out_undelegate_tables;
> + }
> +
> + realm->rd = rd;
> + realm->ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> +
> + if (WARN_ON(rmi_rec_aux_count(rd_phys, &realm->num_aux))) {
> + WARN_ON(rmi_realm_destroy(rd_phys));
> + goto out_undelegate_tables;
> + }
> +
> + return 0;
> +
> +out_undelegate_tables:
> + while (--i >= 0) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + WARN_ON(rmi_granule_undelegate(pgd_phys));
> + }
> + WARN_ON(rmi_granule_undelegate(rd_phys));
> +out:
> + free_page((unsigned long)rd);
> + return r;
> +}
> +
> +/* Protects access to rme_vmid_bitmap */
> +static DEFINE_SPINLOCK(rme_vmid_lock);
> +static unsigned long *rme_vmid_bitmap;
> +
> +static int rme_vmid_init(void)
> +{
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + rme_vmid_bitmap = bitmap_zalloc(vmid_count, GFP_KERNEL);
> + if (!rme_vmid_bitmap) {
> + kvm_err("%s: Couldn't allocate rme vmid bitmap\n", __func__);
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> +static int rme_vmid_reserve(void)
> +{
> + int ret;
> + unsigned int vmid_count = 1 << kvm_get_vmid_bits();
> +
> + spin_lock(&rme_vmid_lock);
> + ret = bitmap_find_free_region(rme_vmid_bitmap, vmid_count, 0);
> + spin_unlock(&rme_vmid_lock);
> +
> + return ret;
> +}
> +
> +static void rme_vmid_release(unsigned int vmid)
> +{
> + spin_lock(&rme_vmid_lock);
> + bitmap_release_region(rme_vmid_bitmap, vmid, 0);
> + spin_unlock(&rme_vmid_lock);
> +}
> +
> +static int kvm_create_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + int ret;
> +
> + if (!kvm_is_realm(kvm) || kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EEXIST;
> +
> + ret = rme_vmid_reserve();
> + if (ret < 0)
> + return ret;
> + realm->vmid = ret;
> +
> + ret = realm_create_rd(kvm);
> + if (ret) {
> + rme_vmid_release(realm->vmid);
> + return ret;
> + }
> +
> + WRITE_ONCE(realm->state, REALM_STATE_NEW);
> +
> + /* The realm is up, free the parameters. */
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> +
> + return 0;
> +}
> +
> +static int config_realm_hash_algo(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + switch (cfg->hash_algo) {
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA256:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_256))
> + return -EINVAL;
> + break;
> + case KVM_CAP_ARM_RME_MEASUREMENT_ALGO_SHA512:
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_HASH_SHA_512))
> + return -EINVAL;
> + break;
> + default:
> + return -EINVAL;
> + }
> + realm->params->measurement_algo = cfg->hash_algo;
> + return 0;
> +}
> +
> +static int config_realm_sve(struct realm *realm,
> + struct kvm_cap_arm_rme_config_item *cfg)
> +{
> + u64 features_0 = realm->params->features_0;
> + int max_sve_vq = u64_get_bits(rmm_feat_reg0,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + if (!rme_supports(RMI_FEATURE_REGISTER_0_SVE_EN))
> + return -EINVAL;
> +
> + if (cfg->sve_vq > max_sve_vq)
> + return -EINVAL;
> +
> + features_0 &= ~(RMI_FEATURE_REGISTER_0_SVE_EN |
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> + features_0 |= u64_encode_bits(1, RMI_FEATURE_REGISTER_0_SVE_EN);
> + features_0 |= u64_encode_bits(cfg->sve_vq,
> + RMI_FEATURE_REGISTER_0_SVE_VL);
> +
> + realm->params->features_0 = features_0;
> + return 0;
> +}
> +
> +static int kvm_rme_config_realm(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + struct kvm_cap_arm_rme_config_item cfg;
> + struct realm *realm = &kvm->arch.realm;
> + int r = 0;
> +
> + if (kvm_realm_state(kvm) != REALM_STATE_NONE)
> + return -EBUSY;
> +
> + if (copy_from_user(&cfg, (void __user *)cap->args[1], sizeof(cfg)))
> + return -EFAULT;
> +
> + switch (cfg.cfg) {
> + case KVM_CAP_ARM_RME_CFG_RPV:
> + memcpy(&realm->params->rpv, &cfg.rpv, sizeof(cfg.rpv));
> + break;
> + case KVM_CAP_ARM_RME_CFG_HASH_ALGO:
> + r = config_realm_hash_algo(realm, &cfg);
> + break;
> + case KVM_CAP_ARM_RME_CFG_SVE:
> + r = config_realm_sve(realm, &cfg);
> + break;
> + default:
> + r = -EINVAL;
> + }
> +
> + return r;
> +}
> +
> +int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> +{
> + int r = 0;
> +
> + switch (cap->args[0]) {
> + case KVM_CAP_ARM_RME_CONFIG_REALM:
> + r = kvm_rme_config_realm(kvm, cap);
> + break;
> + case KVM_CAP_ARM_RME_CREATE_RD:
> + if (kvm->created_vcpus) {
> + r = -EBUSY;
> + break;
> + }
> +
> + r = kvm_create_realm(kvm);
> + break;
> + default:
> + r = -EINVAL;
> + break;
> + }
> +
> + return r;
> +}
> +
> +void kvm_destroy_realm(struct kvm *kvm)
> +{
> + struct realm *realm = &kvm->arch.realm;
> + struct kvm_pgtable *pgt = kvm->arch.mmu.pgt;
> + unsigned int pgd_sz;
> + int i;
> +
> + if (realm->params) {
> + free_page((unsigned long)realm->params);
> + realm->params = NULL;
> + }
> +
> + if (kvm_realm_state(kvm) == REALM_STATE_NONE)
> + return;
> +
> + WRITE_ONCE(realm->state, REALM_STATE_DYING);
> +
> + rme_vmid_release(realm->vmid);
> +
> + if (realm->rd) {
> + phys_addr_t rd_phys = virt_to_phys(realm->rd);
> +
> + if (WARN_ON(rmi_realm_destroy(rd_phys)))
> + return;
> + if (WARN_ON(rmi_granule_undelegate(rd_phys)))
> + return;
> + free_page((unsigned long)realm->rd);
> + realm->rd = NULL;
> + }
> +
> + pgd_sz = kvm_pgd_pages(pgt->ia_bits, pgt->start_level);
> + for (i = 0; i < pgd_sz; i++) {
> + phys_addr_t pgd_phys = kvm->arch.mmu.pgd_phys + i * PAGE_SIZE;
> +
> + if (WARN_ON(rmi_granule_undelegate(pgd_phys)))
> + return;
> + }
> +
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> +}
> +
> +int kvm_init_realm_vm(struct kvm *kvm)
> +{
> + struct realm_params *params;
> +
> + params = (struct realm_params *)get_zeroed_page(GFP_KERNEL);
> + if (!params)
> + return -ENOMEM;
> +
> + params->features_0 = create_realm_feat_reg0(kvm);
> + kvm->arch.realm.params = params;
> + return 0;
> +}
> +
> int kvm_init_rme(void)
> {
> + int ret;
> +
> if (PAGE_SIZE != SZ_4K)
> /* Only 4k page size on the host is supported */
> return 0;
> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
> /* Continue without realm support */
> return 0;
>
> + ret = rme_vmid_init();
> + if (ret)
> + return ret;
> +
> + WARN_ON(rmi_features(0, &rmm_feat_reg0));

Why WARN_ON, Is that good enough to print err/info message and keep
"kvm_rme_is_available" disabled?

IMO, we should print message when rme is enabled, otherwise it should be
silent return.

> +
> /* Future patch will enable static branch kvm_rme_is_available */
>
> return 0;

Thanks,
Ganapat

2024-03-18 11:02:13

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 09/28] arm64: RME: RTT handling


On 27-01-2023 04:59 pm, Steven Price wrote:
> The RMM owns the stage 2 page tables for a realm, and KVM must request
> that the RMM creates/destroys entries as necessary. The physical pages
> to store the page tables are delegated to the realm as required, and can
> be undelegated when no longer used.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/include/asm/kvm_rme.h | 19 +++++
> arch/arm64/kvm/mmu.c | 7 +-
> arch/arm64/kvm/rme.c | 139 +++++++++++++++++++++++++++++++
> 3 files changed, 162 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_rme.h b/arch/arm64/include/asm/kvm_rme.h
> index a6318af3ed11..eea5118dfa8a 100644
> --- a/arch/arm64/include/asm/kvm_rme.h
> +++ b/arch/arm64/include/asm/kvm_rme.h
> @@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
> int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
> int kvm_init_realm_vm(struct kvm *kvm);
> void kvm_destroy_realm(struct kvm *kvm);
> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level);
> +
> +#define RME_RTT_BLOCK_LEVEL 2
> +#define RME_RTT_MAX_LEVEL 3
> +
> +#define RME_PAGE_SHIFT 12
> +#define RME_PAGE_SIZE BIT(RME_PAGE_SHIFT)

Can we use PAGE_SIZE and PAGE_SHIFT instead of redefining?
May be we can use them to define RME_PAGE_SIZE and RME_PAGE_SHIFT.

> +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
> +#define RME_RTT_LEVEL_SHIFT(l) \
> + ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)

Instead of defining again, can we define to
ARM64_HW_PGTABLE_LEVEL_SHIFT?

> +#define RME_L2_BLOCK_SIZE BIT(RME_RTT_LEVEL_SHIFT(2))
> +
> +static inline unsigned long rme_rtt_level_mapsize(int level)
> +{
> + if (WARN_ON(level > RME_RTT_MAX_LEVEL))
> + return RME_PAGE_SIZE;
> +
> + return (1UL << RME_RTT_LEVEL_SHIFT(level));
> +}
>
> #endif
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 22c00274884a..f29558c5dcbc 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -834,16 +834,17 @@ void stage2_unmap_vm(struct kvm *kvm)
> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> {
> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> - struct kvm_pgtable *pgt = NULL;
> + struct kvm_pgtable *pgt;
>
> write_lock(&kvm->mmu_lock);
> + pgt = mmu->pgt;
> if (kvm_is_realm(kvm) &&
> kvm_realm_state(kvm) != REALM_STATE_DYING) {
> - /* TODO: teardown rtts */
> write_unlock(&kvm->mmu_lock);
> + kvm_realm_destroy_rtts(&kvm->arch.realm, pgt->ia_bits,
> + pgt->start_level);
> return;
> }
> - pgt = mmu->pgt;
> if (pgt) {
> mmu->pgd_phys = 0;
> mmu->pgt = NULL;
> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
> index 0c9d70e4d9e6..f7b0e5a779f8 100644
> --- a/arch/arm64/kvm/rme.c
> +++ b/arch/arm64/kvm/rme.c
> @@ -73,6 +73,28 @@ static int rmi_check_version(void)
> return 0;
> }
>
> +static void realm_destroy_undelegate_range(struct realm *realm,
> + unsigned long ipa,
> + unsigned long addr,
> + ssize_t size)
> +{
> + unsigned long rd = virt_to_phys(realm->rd);
> + int ret;
> +
> + while (size > 0) {
> + ret = rmi_data_destroy(rd, ipa);
> + WARN_ON(ret);
> + ret = rmi_granule_undelegate(addr);
> +
> + if (ret)
> + get_page(phys_to_page(addr));
> +
> + addr += PAGE_SIZE;
> + ipa += PAGE_SIZE;
> + size -= PAGE_SIZE;
> + }
> +}
> +
> static unsigned long create_realm_feat_reg0(struct kvm *kvm)
> {
> unsigned long ia_bits = VTCR_EL2_IPA(kvm->arch.vtcr);
> @@ -170,6 +192,123 @@ static int realm_create_rd(struct kvm *kvm)
> return r;
> }
>
> +static int realm_rtt_destroy(struct realm *realm, unsigned long addr,
> + int level, phys_addr_t rtt_granule)
> +{
> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> + return rmi_rtt_destroy(rtt_granule, virt_to_phys(realm->rd), addr,
> + level);
> +}
> +
> +static int realm_destroy_free_rtt(struct realm *realm, unsigned long addr,
> + int level, phys_addr_t rtt_granule)
> +{
> + if (realm_rtt_destroy(realm, addr, level, rtt_granule))
> + return -ENXIO;
> + if (!WARN_ON(rmi_granule_undelegate(rtt_granule)))
> + put_page(phys_to_page(rtt_granule));
> +
> + return 0;
> +}
> +
> +static int realm_rtt_create(struct realm *realm,
> + unsigned long addr,
> + int level,
> + phys_addr_t phys)
> +{
> + addr = ALIGN_DOWN(addr, rme_rtt_level_mapsize(level - 1));
> + return rmi_rtt_create(phys, virt_to_phys(realm->rd), addr, level);
> +}
> +
> +static int realm_tear_down_rtt_range(struct realm *realm, int level,
> + unsigned long start, unsigned long end)
> +{
> + phys_addr_t rd = virt_to_phys(realm->rd);
> + ssize_t map_size = rme_rtt_level_mapsize(level);
> + unsigned long addr, next_addr;
> + bool failed = false;
> +
> + for (addr = start; addr < end; addr = next_addr) {
> + phys_addr_t rtt_addr, tmp_rtt;
> + struct rtt_entry rtt;
> + unsigned long end_addr;
> +
> + next_addr = ALIGN(addr + 1, map_size);
> +
> + end_addr = min(next_addr, end);
> +
> + if (rmi_rtt_read_entry(rd, ALIGN_DOWN(addr, map_size),
> + level, &rtt)) {
> + failed = true;
> + continue;
> + }
> +
> + rtt_addr = rmi_rtt_get_phys(&rtt);
> + WARN_ON(level != rtt.walk_level);
> +
> + switch (rtt.state) {
> + case RMI_UNASSIGNED:
> + case RMI_DESTROYED:
> + break;
> + case RMI_TABLE:
> + if (realm_tear_down_rtt_range(realm, level + 1,
> + addr, end_addr)) {
> + failed = true;
> + break;
> + }
> + if (IS_ALIGNED(addr, map_size) &&
> + next_addr <= end &&
> + realm_destroy_free_rtt(realm, addr, level + 1,
> + rtt_addr))
> + failed = true;
> + break;
> + case RMI_ASSIGNED:
> + WARN_ON(!rtt_addr);
> + /*
> + * If there is a block mapping, break it now, using the
> + * spare_page. We are sure to have a valid delegated
> + * page at spare_page before we enter here, otherwise
> + * WARN once, which will be followed by further
> + * warnings.
> + */
> + tmp_rtt = realm->spare_page;
> + if (level == 2 &&
> + !WARN_ON_ONCE(tmp_rtt == PHYS_ADDR_MAX) &&
> + realm_rtt_create(realm, addr,
> + RME_RTT_MAX_LEVEL, tmp_rtt)) {
> + WARN_ON(1);
> + failed = true;
> + break;
> + }
> + realm_destroy_undelegate_range(realm, addr,
> + rtt_addr, map_size);
> + /*
> + * Collapse the last level table and make the spare page
> + * reusable again.
> + */
> + if (level == 2 &&
> + realm_rtt_destroy(realm, addr, RME_RTT_MAX_LEVEL,
> + tmp_rtt))
> + failed = true;
> + break;
> + case RMI_VALID_NS:
> + WARN_ON(rmi_rtt_unmap_unprotected(rd, addr, level));
> + break;
> + default:
> + WARN_ON(1);
> + failed = true;
> + break;
> + }
> + }
> +
> + return failed ? -EINVAL : 0;
> +}
> +
> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32 start_level)
> +{
> + realm_tear_down_rtt_range(realm, start_level, 0, (1UL << ia_bits));
> +}
> +
> /* Protects access to rme_vmid_bitmap */
> static DEFINE_SPINLOCK(rme_vmid_lock);
> static unsigned long *rme_vmid_bitmap;

Thanks,
Ganapat

2024-03-18 11:22:54

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 06/28] arm64: RME: ioctls to create and configure realms

Thanks for taking a look at this.

On 18/03/2024 07:40, Ganapatrao Kulkarni wrote:
> On 27-01-2023 04:59 pm, Steven Price wrote:
[...]
>>   int kvm_init_rme(void)
>>   {
>> +    int ret;
>> +
>>       if (PAGE_SIZE != SZ_4K)
>>           /* Only 4k page size on the host is supported */
>>           return 0;
>> @@ -43,6 +394,12 @@ int kvm_init_rme(void)
>>           /* Continue without realm support */
>>           return 0;
>>   +    ret = rme_vmid_init();
>> +    if (ret)
>> +        return ret;
>> +
>> +    WARN_ON(rmi_features(0, &rmm_feat_reg0));
>
> Why WARN_ON, Is that good enough to print err/info message and keep
> "kvm_rme_is_available" disabled?

Good point. RMI_FEATURES "does not have any failure conditions" so this
is very much a "should never happen" situation. Assuming the call
gracefully fails then rmm_feat_reg0 would remain 0 which would in
practise stop realms being created, but this is clearly non-ideal.

I'll fix this up in the next version to do the rmi_features() call
before rme_vmid_init(), that way we can just return early without
setting kvm_rme_is_available in this situation. I'll keep the WARN_ON
because something has gone very wrong if this call fails.

> IMO, we should print message when rme is enabled, otherwise it should be
> silent return.

The rmi_check_version() call already outputs a "RMI ABI version %d.%d"
message - I don't want to be too noisy here. Other than the 'cannot
happen' situations if you see the "RMI ABI" message then
kvm_rme_is_available will be set. And those 'cannot happen' routes will
print their own error message (and point to a seriously broken system).

And obviously in the case of SMC_RMI_VERSION not being supported then we
silently return as this is taken to mean there isn't an RMM.

Thanks,

Steve


2024-03-18 11:23:12

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 04/28] arm64: RME: Check for RME support at KVM init

On 18/03/2024 07:17, Ganapatrao Kulkarni wrote:
>
>
> On 27-01-2023 04:59 pm, Steven Price wrote:
>> Query the RMI version number and check if it is a compatible version. A
>> static key is also provided to signal that a supported RMM is available.
>>
>> Functions are provided to query if a VM or VCPU is a realm (or rec)
>> which currently will always return false.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h | 17 ++++++++++
>>   arch/arm64/include/asm/kvm_host.h    |  4 +++
>>   arch/arm64/include/asm/kvm_rme.h     | 22 +++++++++++++
>>   arch/arm64/include/asm/virt.h        |  1 +
>>   arch/arm64/kvm/Makefile              |  3 +-
>>   arch/arm64/kvm/arm.c                 |  8 +++++
>>   arch/arm64/kvm/rme.c                 | 49 ++++++++++++++++++++++++++++
>>   7 files changed, 103 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm64/include/asm/kvm_rme.h
>>   create mode 100644 arch/arm64/kvm/rme.c
>>

[...]

>> diff --git a/arch/arm64/kvm/rme.c b/arch/arm64/kvm/rme.c
>> new file mode 100644
>> index 000000000000..f6b587bc116e
>> --- /dev/null
>> +++ b/arch/arm64/kvm/rme.c
>> @@ -0,0 +1,49 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/kvm_host.h>
>> +
>> +#include <asm/rmi_cmds.h>
>> +#include <asm/virt.h>
>> +
>> +static int rmi_check_version(void)
>> +{
>> +    struct arm_smccc_res res;
>> +    int version_major, version_minor;
>> +
>> +    arm_smccc_1_1_invoke(SMC_RMI_VERSION, &res);
>> +
>> +    if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
>> +        return -ENXIO;
>> +
>> +    version_major = RMI_ABI_VERSION_GET_MAJOR(res.a0);
>> +    version_minor = RMI_ABI_VERSION_GET_MINOR(res.a0);
>> +
>> +    if (version_major != RMI_ABI_MAJOR_VERSION) {
>> +        kvm_err("Unsupported RMI ABI (version %d.%d) we support %d\n",
>
> Can we please replace "we support" to host supports.
> Also in the patch present in the repo, you are using variable
> our_version, can this be changed to host_version?

Sure, I do have a bad habit using "we" - thanks for point it out.

Steve

>> +            version_major, version_minor,
>> +            RMI_ABI_MAJOR_VERSION);
>> +        return -ENXIO;
>> +    }
>> +
>> +    kvm_info("RMI ABI version %d.%d\n", version_major, version_minor);
>> +
>> +    return 0;
>> +}
>> +
>> +int kvm_init_rme(void)
>> +{
>> +    if (PAGE_SIZE != SZ_4K)
>> +        /* Only 4k page size on the host is supported */
>> +        return 0;
>> +
>> +    if (rmi_check_version())
>> +        /* Continue without realm support */
>> +        return 0;
>> +
>> +    /* Future patch will enable static branch kvm_rme_is_available */
>> +
>> +    return 0;
>> +}
>
> Thanks,
> Ganapat


2024-03-18 11:23:30

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 03/28] arm64: RME: Add wrappers for RMI calls

On 18/03/2024 07:03, Ganapatrao Kulkarni wrote:
>
> Hi Steven,
>
> On 27-01-2023 04:59 pm, Steven Price wrote:
>> The wrappers make the call sites easier to read and deal with the
>> boiler plate of handling the error codes from the RMM.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h | 259 ++++++++++++++++++++++++++++++
>>   1 file changed, 259 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/rmi_cmds.h
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h
>> b/arch/arm64/include/asm/rmi_cmds.h
>> new file mode 100644
>> index 000000000000..d5468ee46f35
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/rmi_cmds.h

[...]

>> +static inline int rmi_rtt_read_entry(unsigned long rd, unsigned long
>> map_addr,
>> +                     unsigned long level, struct rtt_entry *rtt)
>> +{
>> +    struct arm_smccc_1_2_regs regs = {
>> +        SMC_RMI_RTT_READ_ENTRY,
>> +        rd, map_addr, level
>> +    };
>> +
>> +    arm_smccc_1_2_smc(&regs, &regs);
>> +
>> +    rtt->walk_level = regs.a1;
>> +    rtt->state = regs.a2 & 0xFF;
>> +    rtt->desc = regs.a3;
>> +    rtt->ripas = regs.a4 & 1;
>> +
>> +    return regs.a0;
>> +}
>> +
>> +static inline int rmi_rtt_set_ripas(unsigned long rd, unsigned long rec,
>> +                    unsigned long map_addr, unsigned long level,
>> +                    unsigned long ripas)
>> +{
>> +    struct arm_smccc_res res;
>> +
>> +    arm_smccc_1_1_invoke(SMC_RMI_RTT_SET_RIPAS, rd, rec, map_addr,
>> level,
>> +                 ripas, &res);
>> +
>> +    return res.a0;
>> +}
>> +
>> +static inline int rmi_rtt_unmap_unprotected(unsigned long rd,
>> +                        unsigned long map_addr,
>> +                        unsigned long level)
>> +{
>> +    struct arm_smccc_res res;
>> +
>> +    arm_smccc_1_1_invoke(SMC_RMI_RTT_UNMAP_UNPROTECTED, rd, map_addr,
>> +                 level, &res);
>> +
>> +    return res.a0;
>> +}
>> +
>> +static inline phys_addr_t rmi_rtt_get_phys(struct rtt_entry *rtt)
>> +{
>> +    return rtt->desc & GENMASK(47, 12);
>> +}
>> +
>> +#endif
>
> Can we please replace all occurrence of "unsigned long" with u64?

I'm conflicted here. On the one hand I agree with you - it would be
better to use types that are sized according to the RMM spec. However,
this file is a thin wrapper around the low-level SMC calls, and the
SMCCC interface is a bunch of "unsigned longs" (e.g. look at struct
arm_smccc_1_2_regs).

In particular it could be broken to use smaller types (e.g. char/u8) as
it would potentially permit the compiler to leave 'junk' in the top part
of the register.

So the question becomes whether to stick with the SMCCC interface sizes
(unsigned long) or use our knowledge that it must be a 64 bit platform
(RMM isn't support for 32 bit) and therefore use u64. My (mild)
preference is for unsigned long because it makes it obvious how this
relates to the SMCCC interface it's using. It also seems like it would
ease compatibility if (/when?) 128 bit registers become a thing.

> Also as per spec, RTT level is Int64, can we change accordingly?

Here, however, I agree you've definitely got a point. level should be
signed as (at least in theory) it could be negative.

> Please CC me in future cca patch-sets.
> [email protected]

I will do, thanks for the review.

Steve


2024-03-18 11:25:45

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 09/28] arm64: RME: RTT handling

On 18/03/2024 11:01, Ganapatrao Kulkarni wrote:
>
> On 27-01-2023 04:59 pm, Steven Price wrote:
>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>> that the RMM creates/destroys entries as necessary. The physical pages
>> to store the page tables are delegated to the realm as required, and can
>> be undelegated when no longer used.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>>   arch/arm64/include/asm/kvm_rme.h |  19 +++++
>>   arch/arm64/kvm/mmu.c             |   7 +-
>>   arch/arm64/kvm/rme.c             | 139 +++++++++++++++++++++++++++++++
>>   3 files changed, 162 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rme.h
>> b/arch/arm64/include/asm/kvm_rme.h
>> index a6318af3ed11..eea5118dfa8a 100644
>> --- a/arch/arm64/include/asm/kvm_rme.h
>> +++ b/arch/arm64/include/asm/kvm_rme.h
>> @@ -35,5 +35,24 @@ u32 kvm_realm_ipa_limit(void);
>>   int kvm_realm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap);
>>   int kvm_init_realm_vm(struct kvm *kvm);
>>   void kvm_destroy_realm(struct kvm *kvm);
>> +void kvm_realm_destroy_rtts(struct realm *realm, u32 ia_bits, u32
>> start_level);
>> +
>> +#define RME_RTT_BLOCK_LEVEL    2
>> +#define RME_RTT_MAX_LEVEL    3
>> +
>> +#define RME_PAGE_SHIFT        12
>> +#define RME_PAGE_SIZE        BIT(RME_PAGE_SHIFT)
>
> Can we use PAGE_SIZE and PAGE_SHIFT instead of redefining?
> May be we can use them to define RME_PAGE_SIZE and RME_PAGE_SHIFT.

At the moment the code only supports the host page size matching the
RMM's. But I want to leave open the possibility for the host size being
larger than the RMM's. In this case PAGE_SHIFT/PAGE_SIZE will not equal
RME_PAGE_SIZE and RME_PAGE_SHIFT. The host will have to create multiple
RMM RTTs for each host page.

>> +/* See ARM64_HW_PGTABLE_LEVEL_SHIFT() */
>> +#define RME_RTT_LEVEL_SHIFT(l)    \
>> +    ((RME_PAGE_SHIFT - 3) * (4 - (l)) + 3)
>
> Instead of defining again, can we define to
> ARM64_HW_PGTABLE_LEVEL_SHIFT?

Same as above - ARM64_HW_PGTABLE_LEVEL_SHIFT uses PAGE_SHIFT, but we
want the same calculation using RME_PAGE_SHIFT which might be different.

Thanks,

Steve


2024-03-18 11:28:52

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [RFC PATCH 12/28] KVM: arm64: Support timers in realm RECs



On 27-01-2023 04:59 pm, Steven Price wrote:
> The RMM keeps track of the timer while the realm REC is running, but on
> exit to the normal world KVM is responsible for handling the timers.
>
> Signed-off-by: Steven Price <[email protected]>
> ---
> arch/arm64/kvm/arch_timer.c | 53 ++++++++++++++++++++++++++++++++----
> include/kvm/arm_arch_timer.h | 2 ++
> 2 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index bb24a76b4224..d4af9ee58550 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -130,6 +130,11 @@ static void timer_set_offset(struct arch_timer_context *ctxt, u64 offset)
> {
> struct kvm_vcpu *vcpu = ctxt->vcpu;
>
> + if (kvm_is_realm(vcpu->kvm)) {
> + WARN_ON(offset);
> + return;
> + }
> +
> switch(arch_timer_ctx_index(ctxt)) {
> case TIMER_VTIMER:
> __vcpu_sys_reg(vcpu, CNTVOFF_EL2) = offset;
> @@ -411,6 +416,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
> }
> }
>
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu)
> +{
> + struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu;
> + int i;
> +
> + for (i = 0; i < NR_KVM_TIMERS; i++) {

Do we required to check for all timers, is realm/rmm uses hyp timers?

> + struct arch_timer_context *timer = &arch_timer->timers[i];
> + bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT;
> + bool level = kvm_timer_irq_can_fire(timer) && status;
> +
> + if (level != timer->irq.level)
> + kvm_timer_update_irq(vcpu, level, timer);
> + }
> +}
> +
> /* Only called for a fully emulated timer */
> static void timer_emulate(struct arch_timer_context *ctx)
> {
> @@ -621,6 +641,11 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
> if (unlikely(!timer->enabled))
> return;
>
> + kvm_timer_unblocking(vcpu);
> +
> + if (vcpu_is_rec(vcpu))
> + return;
> +

For realm, timer->enabled is not set, load returns before this check.

> get_timer_map(vcpu, &map);
>
> if (static_branch_likely(&has_gic_active_state)) {
> @@ -633,8 +658,6 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
>
> set_cntvoff(timer_get_offset(map.direct_vtimer));
>
> - kvm_timer_unblocking(vcpu);
> -
> timer_restore_state(map.direct_vtimer);
> if (map.direct_ptimer)
> timer_restore_state(map.direct_ptimer);
> @@ -668,6 +691,9 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> if (unlikely(!timer->enabled))
> return;
>
> + if (vcpu_is_rec(vcpu))
> + goto out;
> +
> get_timer_map(vcpu, &map);
>
> timer_save_state(map.direct_vtimer);
> @@ -686,9 +712,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> if (map.emul_ptimer)
> soft_timer_cancel(&map.emul_ptimer->hrtimer);
>
> - if (kvm_vcpu_is_blocking(vcpu))
> - kvm_timer_blocking(vcpu);
> -
> /*
> * The kernel may decide to run userspace after calling vcpu_put, so
> * we reset cntvoff to 0 to ensure a consistent read between user
> @@ -697,6 +720,11 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
> * virtual offset of zero, so no need to zero CNTVOFF_EL2 register.
> */
> set_cntvoff(0);
> +
> +out:
> + if (kvm_vcpu_is_blocking(vcpu))
> + kvm_timer_blocking(vcpu);
> +
> }
>
> /*
> @@ -785,12 +813,18 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
> struct arch_timer_cpu *timer = vcpu_timer(vcpu);
> struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
> struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
> + u64 cntvoff;
>
> vtimer->vcpu = vcpu;
> ptimer->vcpu = vcpu;
>
> + if (kvm_is_realm(vcpu->kvm))
> + cntvoff = 0;
> + else
> + cntvoff = kvm_phys_timer_read();
> +
> /* Synchronize cntvoff across all vtimers of a VM. */
> - update_vtimer_cntvoff(vcpu, kvm_phys_timer_read());
> + update_vtimer_cntvoff(vcpu, cntvoff);
> timer_set_offset(ptimer, 0);
>
> hrtimer_init(&timer->bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
> @@ -1265,6 +1299,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
> return -EINVAL;
> }
>
> + /*
> + * We don't use mapped IRQs for Realms because the RMI doesn't allow
> + * us setting the LR.HW bit in the VGIC.
> + */
> + if (vcpu_is_rec(vcpu))
> + return 0;
> +
> get_timer_map(vcpu, &map);
>
> ret = kvm_vgic_map_phys_irq(vcpu,
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index cd6d8f260eab..158280e15a33 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -76,6 +76,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
> int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr);
>
> +void kvm_realm_timers_update(struct kvm_vcpu *vcpu);
> +
> u64 kvm_phys_timer_read(void);
>
> void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);

Thanks,
Ganapat

2024-03-18 14:14:59

by Steven Price

[permalink] [raw]
Subject: Re: [RFC PATCH 12/28] KVM: arm64: Support timers in realm RECs

On 18/03/2024 11:28, Ganapatrao Kulkarni wrote:
>
>
> On 27-01-2023 04:59 pm, Steven Price wrote:
>> The RMM keeps track of the timer while the realm REC is running, but on
>> exit to the normal world KVM is responsible for handling the timers.
>>
>> Signed-off-by: Steven Price <[email protected]>
>> ---
>>   arch/arm64/kvm/arch_timer.c  | 53 ++++++++++++++++++++++++++++++++----
>>   include/kvm/arm_arch_timer.h |  2 ++
>>   2 files changed, 49 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
>> index bb24a76b4224..d4af9ee58550 100644
>> --- a/arch/arm64/kvm/arch_timer.c
>> +++ b/arch/arm64/kvm/arch_timer.c
>> @@ -130,6 +130,11 @@ static void timer_set_offset(struct
>> arch_timer_context *ctxt, u64 offset)
>>   {
>>       struct kvm_vcpu *vcpu = ctxt->vcpu;
>>   +    if (kvm_is_realm(vcpu->kvm)) {
>> +        WARN_ON(offset);
>> +        return;
>> +    }
>> +
>>       switch(arch_timer_ctx_index(ctxt)) {
>>       case TIMER_VTIMER:
>>           __vcpu_sys_reg(vcpu, CNTVOFF_EL2) = offset;
>> @@ -411,6 +416,21 @@ static void kvm_timer_update_irq(struct kvm_vcpu
>> *vcpu, bool new_level,
>>       }
>>   }
>>   +void kvm_realm_timers_update(struct kvm_vcpu *vcpu)
>> +{
>> +    struct arch_timer_cpu *arch_timer = &vcpu->arch.timer_cpu;
>> +    int i;
>> +
>> +    for (i = 0; i < NR_KVM_TIMERS; i++) {
>
> Do we required to check for all timers, is realm/rmm uses hyp timers?

Good point, the realm guest can't use the hyp timers, so this should be
NR_KVM_EL0_TIMERS. The hyp timers are used by the host to interrupt the
guest execution. I think this code was written before NV support added
the extra timers.

>> +        struct arch_timer_context *timer = &arch_timer->timers[i];
>> +        bool status = timer_get_ctl(timer) & ARCH_TIMER_CTRL_IT_STAT;
>> +        bool level = kvm_timer_irq_can_fire(timer) && status;
>> +
>> +        if (level != timer->irq.level)
>> +            kvm_timer_update_irq(vcpu, level, timer);
>> +    }
>> +}
>> +
>>   /* Only called for a fully emulated timer */
>>   static void timer_emulate(struct arch_timer_context *ctx)
>>   {
>> @@ -621,6 +641,11 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
>>       if (unlikely(!timer->enabled))
>>           return;
>>   +    kvm_timer_unblocking(vcpu);
>> +
>> +    if (vcpu_is_rec(vcpu))
>> +        return;
>> +
>
> For realm, timer->enabled is not set, load returns before this check.

True, this can be simplified. Thanks.

Steve

>>       get_timer_map(vcpu, &map);
>>         if (static_branch_likely(&has_gic_active_state)) {
>> @@ -633,8 +658,6 @@ void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
>>         set_cntvoff(timer_get_offset(map.direct_vtimer));
>>   -    kvm_timer_unblocking(vcpu);
>> -
>>       timer_restore_state(map.direct_vtimer);
>>       if (map.direct_ptimer)
>>           timer_restore_state(map.direct_ptimer);
>> @@ -668,6 +691,9 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>>       if (unlikely(!timer->enabled))
>>           return;
>>   +    if (vcpu_is_rec(vcpu))
>> +        goto out;
>> +
>>       get_timer_map(vcpu, &map);
>>         timer_save_state(map.direct_vtimer);
>> @@ -686,9 +712,6 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>>       if (map.emul_ptimer)
>>           soft_timer_cancel(&map.emul_ptimer->hrtimer);
>>   -    if (kvm_vcpu_is_blocking(vcpu))
>> -        kvm_timer_blocking(vcpu);
>> -
>>       /*
>>        * The kernel may decide to run userspace after calling
>> vcpu_put, so
>>        * we reset cntvoff to 0 to ensure a consistent read between user
>> @@ -697,6 +720,11 @@ void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>>        * virtual offset of zero, so no need to zero CNTVOFF_EL2 register.
>>        */
>>       set_cntvoff(0);
>> +
>> +out:
>> +    if (kvm_vcpu_is_blocking(vcpu))
>> +        kvm_timer_blocking(vcpu);
>> +
>>   }
>>     /*
>> @@ -785,12 +813,18 @@ void kvm_timer_vcpu_init(struct kvm_vcpu *vcpu)
>>       struct arch_timer_cpu *timer = vcpu_timer(vcpu);
>>       struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>>       struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
>> +    u64 cntvoff;
>>         vtimer->vcpu = vcpu;
>>       ptimer->vcpu = vcpu;
>>   +    if (kvm_is_realm(vcpu->kvm))
>> +        cntvoff = 0;
>> +    else
>> +        cntvoff = kvm_phys_timer_read();
>> +
>>       /* Synchronize cntvoff across all vtimers of a VM. */
>> -    update_vtimer_cntvoff(vcpu, kvm_phys_timer_read());
>> +    update_vtimer_cntvoff(vcpu, cntvoff);
>>       timer_set_offset(ptimer, 0);
>>         hrtimer_init(&timer->bg_timer, CLOCK_MONOTONIC,
>> HRTIMER_MODE_ABS_HARD);
>> @@ -1265,6 +1299,13 @@ int kvm_timer_enable(struct kvm_vcpu *vcpu)
>>           return -EINVAL;
>>       }
>>   +    /*
>> +     * We don't use mapped IRQs for Realms because the RMI doesn't allow
>> +     * us setting the LR.HW bit in the VGIC.
>> +     */
>> +    if (vcpu_is_rec(vcpu))
>> +        return 0;
>> +
>>       get_timer_map(vcpu, &map);
>>         ret = kvm_vgic_map_phys_irq(vcpu,
>> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
>> index cd6d8f260eab..158280e15a33 100644
>> --- a/include/kvm/arm_arch_timer.h
>> +++ b/include/kvm/arm_arch_timer.h
>> @@ -76,6 +76,8 @@ int kvm_arm_timer_set_attr(struct kvm_vcpu *vcpu,
>> struct kvm_device_attr *attr);
>>   int kvm_arm_timer_get_attr(struct kvm_vcpu *vcpu, struct
>> kvm_device_attr *attr);
>>   int kvm_arm_timer_has_attr(struct kvm_vcpu *vcpu, struct
>> kvm_device_attr *attr);
>>   +void kvm_realm_timers_update(struct kvm_vcpu *vcpu);
>> +
>>   u64 kvm_phys_timer_read(void);
>>     void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
>
> Thanks,
> Ganapat