2021-07-15 17:04:09

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 00/16] KVM: arm64: MMIO guard PV services

KVM/arm64 currently considers that any memory access outside of a
memslot is a MMIO access. This so far has served us very well, but
obviously relies on the guest trusting the host, and especially
userspace to do the right thing.

As we keep on hacking away at pKVM, it becomes obvious that this trust
model is not really fit for a confidential computing environment, and
that the guest would require some guarantees that emulation only
occurs on portions of the address space that have clearly been
identified for this purpose.

This series aims at providing the two sides of the above coin:

- a set of PV services (collectively called 'MMIO guard' -- better
name required!) where the guest can flag portion of its address
space that it considers as MMIO, with map/unmap semantics. Any
attempt to access a MMIO range outside of these regions will result
in an external abort being injected.

- a set of hooks into the ioremap code allowing a Linux guest to tell
KVM about things it want to consider as MMIO. I definitely hate this
part of the series, as it feels clumsy and brittle.

For now, the enrolment in this scheme is controlled by a guest kernel
command-line parameters, but it is expected that KVM will enforce this
for protected VMs.

Note that this crucially misses a save/restore interface for non
protected VMs, and I currently don't have a good solution for
that. Ideas welcome.

I also plan to use this series as a base for some other purposes,
namely to trick the guest in telling us how it maps things like
prefetchable BARs (see the discussion at [1]). That part is not
implemented yet, but there is already some provision to pass the MAIR
index across.

Patches on top of 5.14-rc1, branch pushed at the usual location.

[1] [email protected]

Marc Zyngier (16):
KVM: arm64: Generalise VM features into a set of flags
KVM: arm64: Don't issue CMOs when the physical address is invalid
KVM: arm64: Turn kvm_pgtable_stage2_set_owner into
kvm_pgtable_stage2_annotate
KVM: arm64: Add MMIO checking infrastructure
KVM: arm64: Plumb MMIO checking into the fault handling
KVM: arm64: Force a full unmap on vpcu reinit
KVM: arm64: Wire MMIO guard hypercalls
KVM: arm64: Add tracepoint for failed MMIO guard check
KVM: arm64: Advertise a capability for MMIO guard
KVM: arm64: Add some documentation for the MMIO guard feature
firmware/smccc: Call arch-specific hook on discovering KVM services
mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls
arm64: Implement ioremap/iounmap hooks calling into KVM's MMIO guard
arm64: Enroll into KVM's MMIO guard if required
arm64: Add a helper to retrieve the PTE of a fixmap
arm64: Register earlycon fixmap with the MMIO guard

.../admin-guide/kernel-parameters.txt | 3 +
Documentation/virt/kvm/arm/index.rst | 1 +
Documentation/virt/kvm/arm/mmio-guard.rst | 73 +++++++++++
arch/arm/include/asm/hypervisor.h | 1 +
arch/arm64/include/asm/fixmap.h | 2 +
arch/arm64/include/asm/hypervisor.h | 2 +
arch/arm64/include/asm/kvm_host.h | 14 ++-
arch/arm64/include/asm/kvm_mmu.h | 5 +
arch/arm64/include/asm/kvm_pgtable.h | 12 +-
arch/arm64/kernel/setup.c | 6 +
arch/arm64/kvm/arm.c | 14 ++-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 14 ++-
arch/arm64/kvm/hyp/pgtable.c | 36 +++---
arch/arm64/kvm/hypercalls.c | 20 +++
arch/arm64/kvm/mmio.c | 13 +-
arch/arm64/kvm/mmu.c | 117 ++++++++++++++++++
arch/arm64/kvm/trace_arm.h | 17 +++
arch/arm64/mm/ioremap.c | 107 ++++++++++++++++
arch/arm64/mm/mmu.c | 15 +++
drivers/firmware/smccc/kvm_guest.c | 4 +
include/linux/arm-smccc.h | 28 +++++
include/linux/io.h | 3 +
include/uapi/linux/kvm.h | 1 +
mm/ioremap.c | 13 +-
mm/vmalloc.c | 8 ++
25 files changed, 492 insertions(+), 37 deletions(-)
create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst

--
2.30.2


2021-07-15 17:04:16

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 05/16] KVM: arm64: Plumb MMIO checking into the fault handling

Plumb the MMIO checking code into the MMIO fault handling code.
Nothing allows a region to be registered yet, so there should be
no funtional change either.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/kvm/mmio.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index 3dd38a151d2a..fd5747279d27 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -6,6 +6,7 @@

#include <linux/kvm_host.h>
#include <asm/kvm_emulate.h>
+#include <asm/kvm_mmu.h>
#include <trace/events/kvm.h>

#include "trace.h"
@@ -130,6 +131,10 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
int len;
u8 data_buf[8];

+ /* Check failed? Return to the guest for debriefing... */
+ if (!kvm_check_ioguard_page(vcpu, fault_ipa))
+ return 1;
+
/*
* No valid syndrome? Ask userspace for help if it has
* volunteered to do so, and bail out otherwise.
@@ -156,6 +161,11 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
len = kvm_vcpu_dabt_get_as(vcpu);
rt = kvm_vcpu_dabt_get_rd(vcpu);

+ /* If we cross a page boundary, check that too... */
+ if (((fault_ipa + len - 1) & PAGE_MASK) != (fault_ipa & PAGE_MASK) &&
+ !kvm_check_ioguard_page(vcpu, fault_ipa + len - 1))
+ return 1;
+
if (is_write) {
data = vcpu_data_guest_to_host(vcpu, vcpu_get_reg(vcpu, rt),
len);
--
2.30.2

2021-07-15 17:04:58

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 07/16] KVM: arm64: Wire MMIO guard hypercalls

Plumb in the hypercall interface to allow a guest to discover,
enroll, map and unmap MMIO regions.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
include/linux/arm-smccc.h | 28 ++++++++++++++++++++++++++++
2 files changed, 48 insertions(+)

diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
index 30da78f72b3b..a3deeb907fdd 100644
--- a/arch/arm64/kvm/hypercalls.c
+++ b/arch/arm64/kvm/hypercalls.c
@@ -5,6 +5,7 @@
#include <linux/kvm_host.h>

#include <asm/kvm_emulate.h>
+#include <asm/kvm_mmu.h>

#include <kvm/arm_hypercalls.h>
#include <kvm/arm_psci.h>
@@ -129,10 +130,29 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP);
+ val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO);
+ val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL);
+ val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP);
+ val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP);
break;
case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
kvm_ptp_get_time(vcpu, val);
break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID:
+ val[0] = PAGE_SIZE;
+ break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID:
+ set_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
+ val[0] = SMCCC_RET_SUCCESS;
+ break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID:
+ if (kvm_install_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
+ val[0] = SMCCC_RET_SUCCESS;
+ break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID:
+ if (kvm_remove_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
+ val[0] = SMCCC_RET_SUCCESS;
+ break;
case ARM_SMCCC_TRNG_VERSION:
case ARM_SMCCC_TRNG_FEATURES:
case ARM_SMCCC_TRNG_GET_UUID:
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 7d1cabe15262..4aab2078d8d3 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -107,6 +107,10 @@
/* KVM "vendor specific" services */
#define ARM_SMCCC_KVM_FUNC_FEATURES 0
#define ARM_SMCCC_KVM_FUNC_PTP 1
+#define ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO 2
+#define ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL 3
+#define ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP 4
+#define ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP 5
#define ARM_SMCCC_KVM_FUNC_FEATURES_2 127
#define ARM_SMCCC_KVM_NUM_FUNCS 128

@@ -133,6 +137,30 @@
#define KVM_PTP_VIRT_COUNTER 0
#define KVM_PTP_PHYS_COUNTER 1

+#define ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO)
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL)
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP)
+
+#define ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_64, \
+ ARM_SMCCC_OWNER_VENDOR_HYP, \
+ ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP)
+
/* Paravirtualised time calls (defined by ARM DEN0057A) */
#define ARM_SMCCC_HV_PV_TIME_FEATURES \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
--
2.30.2

2021-07-15 17:25:47

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 13/16] arm64: Implement ioremap/iounmap hooks calling into KVM's MMIO guard

Implement the previously defined ioremap/iounmap hooks for arm64,
calling into KVM's MMIO guard if available.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/mm/ioremap.c | 56 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index b7c81dacabf0..0801fd92f0e3 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -9,13 +9,69 @@
* Copyright (C) 2012 ARM Ltd.
*/

+#define pr_fmt(fmt) "ioremap: " fmt
+
#include <linux/export.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/io.h>
+#include <linux/arm-smccc.h>

#include <asm/fixmap.h>
#include <asm/tlbflush.h>
+#include <asm/hypervisor.h>
+
+static DEFINE_STATIC_KEY_FALSE(ioremap_guard_key);
+
+void ioremap_page_range_hook(unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
+{
+ size_t size = end - addr;
+
+ if (!static_branch_unlikely(&ioremap_guard_key))
+ return;
+
+ if (pfn_valid(__phys_to_pfn(phys_addr)))
+ return;
+
+ while (size) {
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_hvc(ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID,
+ phys_addr, prot, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn_ratelimited("Failed to register %llx\n",
+ phys_addr);
+ return;
+ }
+
+ size -= PAGE_SIZE;
+ phys_addr += PAGE_SIZE;
+ }
+}
+
+void iounmap_page_range_hook(phys_addr_t phys_addr, size_t size)
+{
+ if (!static_branch_unlikely(&ioremap_guard_key))
+ return;
+
+ VM_BUG_ON(phys_addr & ~PAGE_MASK || size & ~PAGE_MASK);
+
+ while (size) {
+ struct arm_smccc_res res;
+
+ arm_smccc_1_1_hvc(ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID,
+ phys_addr, &res);
+ if (res.a0 != SMCCC_RET_SUCCESS) {
+ pr_warn_ratelimited("Failed to unregister %llx\n",
+ phys_addr);
+ return;
+ }
+
+ size -= PAGE_SIZE;
+ phys_addr += PAGE_SIZE;
+ }
+}

static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size,
pgprot_t prot, void *caller)
--
2.30.2

2021-07-15 17:26:07

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 11/16] firmware/smccc: Call arch-specific hook on discovering KVM services

arm64 will soon require its own callback to initialise services
that are only availably on this architecture. Introduce a hook
that can be overloaded by the architecture.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm/include/asm/hypervisor.h | 1 +
arch/arm64/include/asm/hypervisor.h | 1 +
drivers/firmware/smccc/kvm_guest.c | 4 ++++
3 files changed, 6 insertions(+)

diff --git a/arch/arm/include/asm/hypervisor.h b/arch/arm/include/asm/hypervisor.h
index bd61502b9715..8133c8c81a35 100644
--- a/arch/arm/include/asm/hypervisor.h
+++ b/arch/arm/include/asm/hypervisor.h
@@ -6,5 +6,6 @@

void kvm_init_hyp_services(void);
bool kvm_arm_hyp_service_available(u32 func_id);
+void kvm_arm_init_hyp_services(void);

#endif
diff --git a/arch/arm64/include/asm/hypervisor.h b/arch/arm64/include/asm/hypervisor.h
index 0ae427f352c8..8e77f411903f 100644
--- a/arch/arm64/include/asm/hypervisor.h
+++ b/arch/arm64/include/asm/hypervisor.h
@@ -6,5 +6,6 @@

void kvm_init_hyp_services(void);
bool kvm_arm_hyp_service_available(u32 func_id);
+void kvm_arm_init_hyp_services(void);

#endif
diff --git a/drivers/firmware/smccc/kvm_guest.c b/drivers/firmware/smccc/kvm_guest.c
index 2d3e866decaa..56169e73252a 100644
--- a/drivers/firmware/smccc/kvm_guest.c
+++ b/drivers/firmware/smccc/kvm_guest.c
@@ -9,6 +9,8 @@

#include <asm/hypervisor.h>

+void __weak kvm_arm_init_hyp_services(void) {}
+
static DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS) __ro_after_init = { };

void __init kvm_init_hyp_services(void)
@@ -38,6 +40,8 @@ void __init kvm_init_hyp_services(void)

pr_info("hypervisor services detected (0x%08lx 0x%08lx 0x%08lx 0x%08lx)\n",
res.a3, res.a2, res.a1, res.a0);
+
+ kvm_arm_init_hyp_services();
}

bool kvm_arm_hyp_service_available(u32 func_id)
--
2.30.2

2021-07-15 17:26:08

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 14/16] arm64: Enroll into KVM's MMIO guard if required

Should a guest desire to enroll into the MMIO guard, allow it to
do so with a command-line option.

Signed-off-by: Marc Zyngier <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 3 ++
arch/arm64/include/asm/hypervisor.h | 1 +
arch/arm64/kernel/setup.c | 6 +++
arch/arm64/mm/ioremap.c | 38 +++++++++++++++++++
4 files changed, 48 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index bdb22006f713..a398585bed90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2062,6 +2062,9 @@
1 - Bypass the IOMMU for DMA.
unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH.

+ ioremap_guard [ARM64] enable the KVM MMIO guard functionality
+ if available.
+
io7= [HW] IO7 for Marvel-based Alpha systems
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
diff --git a/arch/arm64/include/asm/hypervisor.h b/arch/arm64/include/asm/hypervisor.h
index 8e77f411903f..b130c7b82eaa 100644
--- a/arch/arm64/include/asm/hypervisor.h
+++ b/arch/arm64/include/asm/hypervisor.h
@@ -7,5 +7,6 @@
void kvm_init_hyp_services(void);
bool kvm_arm_hyp_service_available(u32 func_id);
void kvm_arm_init_hyp_services(void);
+void kvm_init_ioremap_services(void);

#endif
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index be5f85b0a24d..c325647f675f 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -49,6 +49,7 @@
#include <asm/tlbflush.h>
#include <asm/traps.h>
#include <asm/efi.h>
+#include <asm/hypervisor.h>
#include <asm/xen/hypervisor.h>
#include <asm/mmu_context.h>

@@ -445,3 +446,8 @@ static int __init register_arm64_panic_block(void)
return 0;
}
device_initcall(register_arm64_panic_block);
+
+void kvm_arm_init_hyp_services(void)
+{
+ kvm_init_ioremap_services();
+}
diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index 0801fd92f0e3..d82b63bcc554 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -23,6 +23,44 @@

static DEFINE_STATIC_KEY_FALSE(ioremap_guard_key);

+static bool ioremap_guard;
+static int __init ioremap_guard_setup(char *str)
+{
+ ioremap_guard = true;
+
+ return 0;
+}
+early_param("ioremap_guard", ioremap_guard_setup);
+
+void kvm_init_ioremap_services(void)
+{
+ struct arm_smccc_res res;
+
+ if (!ioremap_guard)
+ return;
+
+ /* We need all the functions to be implemented */
+ if (!kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO) ||
+ !kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL) ||
+ !kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP) ||
+ !kvm_arm_hyp_service_available(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP))
+ return;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID,
+ &res);
+ if (res.a0 != PAGE_SIZE)
+ return;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID,
+ &res);
+ if (res.a0 == SMCCC_RET_SUCCESS) {
+ static_branch_enable(&ioremap_guard_key);
+ pr_info("Using KVM MMIO guard for ioremap\n");
+ } else {
+ pr_warn("KVM MMIO guard registration failed (%ld)\n", res.a0);
+ }
+}
+
void ioremap_page_range_hook(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot)
{
--
2.30.2

2021-07-15 18:17:14

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

kvm_pgtable_stage2_set_owner() could be generalised into a way
to store up to 63 bits in the page tables, as long as we don't
set bit 0.

Let's just do that.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 12 +++++++-----
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 14 ++++++++++++--
arch/arm64/kvm/hyp/pgtable.c | 20 ++++++--------------
3 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index f004c0115d89..9579e8c2793b 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -274,14 +274,16 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
void *mc);

/**
- * kvm_pgtable_stage2_set_owner() - Unmap and annotate pages in the IPA space to
- * track ownership.
+ * kvm_pgtable_stage2_annotate() - Unmap and annotate pages in the IPA space
+ * to track ownership (and more).
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Base intermediate physical address to annotate.
* @size: Size of the annotated range.
* @mc: Cache of pre-allocated and zeroed memory from which to allocate
* page-table pages.
- * @owner_id: Unique identifier for the owner of the page.
+ * @annotation: A 63 bit value that will be stored in the page tables.
+ * @annotation[0] must be 0, and @annotation[63:1] is stored
+ * in the page tables. @annotation as a whole must not be 0.
*
* By default, all page-tables are owned by identifier 0. This function can be
* used to mark portions of the IPA space as owned by other entities. When a
@@ -290,8 +292,8 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
- void *mc, u8 owner_id);
+int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ void *mc, kvm_pte_t annotation);

/**
* kvm_pgtable_stage2_unmap() - Remove a mapping from a guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index d938ce95d3bd..ffe482c3b818 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -245,6 +245,15 @@ static int host_stage2_idmap(u64 addr)
return ret;
}

+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
+#define KVM_MAX_OWNER_ID 1
+
+static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
+{
+ BUG_ON(owner_id > KVM_MAX_OWNER_ID);
+ return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
+}
+
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
{
int ret;
@@ -257,8 +266,9 @@ int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
return -EINVAL;

hyp_spin_lock(&host_kvm.lock);
- ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start,
- &host_s2_pool, pkvm_hyp_id);
+ ret = kvm_pgtable_stage2_annotate(&host_kvm.pgt, start, end - start,
+ &host_s2_pool,
+ kvm_init_invalid_leaf_owner(pkvm_hyp_id));
hyp_spin_unlock(&host_kvm.lock);

return ret != -EAGAIN ? ret : 0;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a5874ebd0354..a065f6d960af 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -50,9 +50,6 @@

#define KVM_PTE_LEAF_ATTR_S2_IGNORED GENMASK(58, 55)

-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
-#define KVM_MAX_OWNER_ID 1
-
struct kvm_pgtable_walk_data {
struct kvm_pgtable *pgt;
struct kvm_pgtable_walker *walker;
@@ -206,11 +203,6 @@ static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
return pte;
}

-static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
-{
- return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
-}
-
static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag)
@@ -466,7 +458,7 @@ void kvm_pgtable_hyp_destroy(struct kvm_pgtable *pgt)
struct stage2_map_data {
u64 phys;
kvm_pte_t attr;
- u8 owner_id;
+ u64 annotation;

kvm_pte_t *anchor;
kvm_pte_t *childp;
@@ -603,7 +595,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
if (kvm_phys_is_valid(phys))
new = kvm_init_valid_leaf_pte(phys, data->attr, level);
else
- new = kvm_init_invalid_leaf_owner(data->owner_id);
+ new = data->annotation;

if (stage2_pte_is_counted(old)) {
/*
@@ -796,8 +788,8 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
return ret;
}

-int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
- void *mc, u8 owner_id)
+int kvm_pgtable_stage2_annotate(struct kvm_pgtable *pgt, u64 addr, u64 size,
+ void *mc, kvm_pte_t annotation)
{
int ret;
struct stage2_map_data map_data = {
@@ -805,7 +797,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
.mmu = pgt->mmu,
.memcache = mc,
.mm_ops = pgt->mm_ops,
- .owner_id = owner_id,
+ .annotation = annotation,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
.arg = &map_data,
};

- if (owner_id > KVM_MAX_OWNER_ID)
+ if (!annotation || (annotation & PTE_VALID))
return -EINVAL;

ret = kvm_pgtable_walk(pgt, addr, size, &walker);
--
2.30.2

2021-07-15 18:17:31

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 01/16] KVM: arm64: Generalise VM features into a set of flags

We currently deal with a set of booleans for VM features,
while they could be better represented as set of flags
contained in an unsigned long, similarily to what we are
doing on the CPU side.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/include/asm/kvm_host.h | 12 +++++++-----
arch/arm64/kvm/arm.c | 5 +++--
arch/arm64/kvm/mmio.c | 3 ++-
3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 41911585ae0c..4add6c27251f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -122,7 +122,10 @@ struct kvm_arch {
* should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is
* supported.
*/
- bool return_nisv_io_abort_to_user;
+#define KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER 0
+ /* Memory Tagging Extension enabled for the guest */
+#define KVM_ARCH_FLAG_MTE_ENABLED 1
+ unsigned long flags;

/*
* VM-wide PMU filter, implemented as a bitmap and big enough for
@@ -133,9 +136,6 @@ struct kvm_arch {

u8 pfr0_csv2;
u8 pfr0_csv3;
-
- /* Memory Tagging Extension enabled for the guest */
- bool mte_enabled;
};

struct kvm_vcpu_fault_info {
@@ -777,7 +777,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
#define kvm_arm_vcpu_sve_finalized(vcpu) \
((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)

-#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
+#define kvm_has_mte(kvm) \
+ (system_supports_mte() && \
+ test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &(kvm)->arch.flags))
#define kvm_vcpu_has_pmu(vcpu) \
(test_bit(KVM_ARM_VCPU_PMU_V3, (vcpu)->arch.features))

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..97ab1512c44f 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -91,13 +91,14 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
switch (cap->cap) {
case KVM_CAP_ARM_NISV_TO_USER:
r = 0;
- kvm->arch.return_nisv_io_abort_to_user = true;
+ set_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
+ &kvm->arch.flags);
break;
case KVM_CAP_ARM_MTE:
if (!system_supports_mte() || kvm->created_vcpus)
return -EINVAL;
r = 0;
- kvm->arch.mte_enabled = true;
+ set_bit(KVM_ARCH_FLAG_MTE_ENABLED, &kvm->arch.flags);
break;
default:
r = -EINVAL;
diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
index 3e2d8ba11a02..3dd38a151d2a 100644
--- a/arch/arm64/kvm/mmio.c
+++ b/arch/arm64/kvm/mmio.c
@@ -135,7 +135,8 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
* volunteered to do so, and bail out otherwise.
*/
if (!kvm_vcpu_dabt_isvalid(vcpu)) {
- if (vcpu->kvm->arch.return_nisv_io_abort_to_user) {
+ if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
+ &vcpu->kvm->arch.flags)) {
run->exit_reason = KVM_EXIT_ARM_NISV;
run->arm_nisv.esr_iss = kvm_vcpu_dabt_iss_nisv_sanitized(vcpu);
run->arm_nisv.fault_ipa = fault_ipa;
--
2.30.2

2021-07-15 18:22:33

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

Document the hypercalls user for the MMIO guard infrastructure.

Signed-off-by: Marc Zyngier <[email protected]>
---
Documentation/virt/kvm/arm/index.rst | 1 +
Documentation/virt/kvm/arm/mmio-guard.rst | 73 +++++++++++++++++++++++
2 files changed, 74 insertions(+)
create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst

diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
index 78a9b670aafe..e77a0ee2e2d4 100644
--- a/Documentation/virt/kvm/arm/index.rst
+++ b/Documentation/virt/kvm/arm/index.rst
@@ -11,3 +11,4 @@ ARM
psci
pvtime
ptp_kvm
+ mmio-guard
diff --git a/Documentation/virt/kvm/arm/mmio-guard.rst b/Documentation/virt/kvm/arm/mmio-guard.rst
new file mode 100644
index 000000000000..a5563a3e12cc
--- /dev/null
+++ b/Documentation/virt/kvm/arm/mmio-guard.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+KVM MMIO guard
+==============
+
+KVM implements device emulation by handling translation faults to any
+IPA range that is not contained a memory slot. Such translation fault
+is in most cases passed on to userspace (or in rare cases to the host
+kernel) with the address, size and possibly data of the access for
+emulation.
+
+Should the guest exit with an address that is not one that corresponds
+to an emulatable device, userspace may take measures that are not the
+most graceful as far as the guest is concerned (such as terminating it
+or delivering a fatal exception).
+
+There is also an element of trust: by forwarding the request to
+userspace, the kernel asumes that the guest trusts userspace to do the
+right thing.
+
+The KVM MMIO guard offers a way to mitigate this last point: a guest
+can request that only certainly regions of the IPA space are valid as
+MMIO. Only these regions will be handled as an MMIO, and any other
+will result in an exception being delivered to the guest.
+
+This relies on a set of hypercalls defined in the KVM-specific range,
+using the HVC64 calling convention.
+
+* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO
+
+ ============== ======== ================================
+ Function ID: (uint32) 0xC6000002
+ Arguments: none
+ Return Values: (int64) NOT_SUPPORTED(-1) on error, or
+ (uint64) Protection Granule (PG) size in
+ bytes (r0)
+ ============== ======== ================================
+
+* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL
+
+ ============== ======== ==============================
+ Function ID: (uint32) 0xC6000003
+ Arguments: none
+ Return Values: (int64) NOT_SUPPORTED(-1) on error, or
+ RET_SUCCESS(0) (r0)
+ ============== ======== ==============================
+
+* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP
+
+ ============== ======== ======================================
+ Function ID: (uint32) 0xC6000004
+ Arguments: (uint64) The base of the PG-sized IPA range
+ that is allowed to be accessed as
+ MMIO. Must aligned to the PG size (r1)
+ (uint64) Index in the MAIR_EL1 register
+ providing the memory attribute that
+ is used by the guest (r2)
+ Return Values: (int64) NOT_SUPPORTED(-1) on error, or
+ RET_SUCCESS(0) (r0)
+ ============== ======== ======================================
+
+* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP
+
+ ============== ======== ======================================
+ Function ID: (uint32) 0xC6000004
+ Arguments: (uint64) The base of the PG-sized IPA range
+ that is forbidden to be accessed as
+ MMIO. Must aligned to the PG size
+ and have been previously mapped (r1)
+ Return Values: (int64) NOT_SUPPORTED(-1) on error, or
+ RET_SUCCESS(0) (r0)
+ ============== ======== ======================================
--
2.30.2

2021-07-15 18:23:12

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 12/16] mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls

Add a pair of hooks (ioremap_page_range_hook/iounmap_page_range_hook)
that can be implemented by an architecture.

Signed-off-by: Marc Zyngier <[email protected]>
---
include/linux/io.h | 3 +++
mm/ioremap.c | 13 ++++++++++++-
mm/vmalloc.c | 8 ++++++++
3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/io.h b/include/linux/io.h
index 9595151d800d..0ffc265f114c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -21,6 +21,9 @@ void __ioread32_copy(void *to, const void __iomem *from, size_t count);
void __iowrite64_copy(void __iomem *to, const void *from, size_t count);

#ifdef CONFIG_MMU
+void ioremap_page_range_hook(unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot);
+void iounmap_page_range_hook(phys_addr_t phys_addr, size_t size);
int ioremap_page_range(unsigned long addr, unsigned long end,
phys_addr_t phys_addr, pgprot_t prot);
#else
diff --git a/mm/ioremap.c b/mm/ioremap.c
index 8ee0136f8cb0..bd77a86088f2 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,10 +28,21 @@ early_param("nohugeiomap", set_nohugeiomap);
static const unsigned int iomap_max_page_shift = PAGE_SHIFT;
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

+void __weak ioremap_page_range_hook(unsigned long addr, unsigned long end,
+ phys_addr_t phys_addr, pgprot_t prot)
+{
+}
+
int ioremap_page_range(unsigned long addr,
unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
- return vmap_range(addr, end, phys_addr, prot, iomap_max_page_shift);
+ int ret;
+
+ ret = vmap_range(addr, end, phys_addr, prot, iomap_max_page_shift);
+ if (!ret)
+ ioremap_page_range_hook(addr, end, phys_addr, prot);
+
+ return ret;
}

#ifdef CONFIG_GENERIC_IOREMAP
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5cd52805149..af18a6141093 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
#include <linux/pgtable.h>
#include <linux/uaccess.h>
#include <linux/hugetlb.h>
+#include <linux/io.h>
#include <asm/tlbflush.h>
#include <asm/shmparam.h>

@@ -2551,6 +2552,10 @@ static void vm_remove_mappings(struct vm_struct *area, int deallocate_pages)
set_area_direct_map(area, set_direct_map_default_noflush);
}

+void __weak iounmap_page_range_hook(phys_addr_t phys_addr, size_t size)
+{
+}
+
static void __vunmap(const void *addr, int deallocate_pages)
{
struct vm_struct *area;
@@ -2574,6 +2579,9 @@ static void __vunmap(const void *addr, int deallocate_pages)

kasan_poison_vmalloc(area->addr, get_vm_area_size(area));

+ if (area->flags & VM_IOREMAP)
+ iounmap_page_range_hook(area->phys_addr, get_vm_area_size(area));
+
vm_remove_mappings(area, deallocate_pages);

if (deallocate_pages) {
--
2.30.2

2021-07-15 18:23:55

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 16/16] arm64: Register earlycon fixmap with the MMIO guard

On initialising the MMIO guard infrastructure, register the
earlycon mapping if present.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/mm/ioremap.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c
index d82b63bcc554..a27b58e03c93 100644
--- a/arch/arm64/mm/ioremap.c
+++ b/arch/arm64/mm/ioremap.c
@@ -32,6 +32,18 @@ static int __init ioremap_guard_setup(char *str)
}
early_param("ioremap_guard", ioremap_guard_setup);

+static void fixup_fixmap(void)
+{
+ unsigned long addr = __fix_to_virt(FIX_EARLYCON_MEM_BASE);
+ pte_t *ptep = __get_fixmap_pte(FIX_EARLYCON_MEM_BASE);
+
+ if (!ptep)
+ return;
+
+ ioremap_page_range_hook(addr, addr + PAGE_SIZE, __pte_to_phys(*ptep),
+ __pgprot(pte_val(*ptep) & PTE_ATTRINDX_MASK));
+}
+
void kvm_init_ioremap_services(void)
{
struct arm_smccc_res res;
@@ -55,6 +67,7 @@ void kvm_init_ioremap_services(void)
&res);
if (res.a0 == SMCCC_RET_SUCCESS) {
static_branch_enable(&ioremap_guard_key);
+ fixup_fixmap();
pr_info("Using KVM MMIO guard for ioremap\n");
} else {
pr_warn("KVM MMIO guard registration failed (%ld)\n", res.a0);
--
2.30.2

2021-07-15 18:24:06

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 15/16] arm64: Add a helper to retrieve the PTE of a fixmap

In order to transfer the early mapping state into KVM's MMIO
guard infrastucture, provide a small helper that will retrieve
the associated PTE.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/include/asm/fixmap.h | 2 ++
arch/arm64/mm/mmu.c | 15 +++++++++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 4335800201c9..1aae625b944f 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -105,6 +105,8 @@ void __init early_fixmap_init(void);

extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);

+extern pte_t *__get_fixmap_pte(enum fixed_addresses idx);
+
#include <asm-generic/fixmap.h>

#endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d74586508448..f1b7abd04025 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1286,6 +1286,21 @@ void __set_fixmap(enum fixed_addresses idx,
}
}

+pte_t *__get_fixmap_pte(enum fixed_addresses idx)
+{
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *ptep;
+
+ BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
+
+ ptep = fixmap_pte(addr);
+
+ if (!pte_valid(*ptep))
+ return NULL;
+
+ return ptep;
+}
+
void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
{
const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
--
2.30.2

2021-07-15 21:01:14

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 06/16] KVM: arm64: Force a full unmap on vpcu reinit

As we now keep information in the S2PT, we must be careful not
to keep it across a VM reboot, which could otherwise lead to
interesting problems.

Make sure that the S2 is completely discarded on reset of
a vcpu, and remove the flag that enforces the MMIO check.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/kvm/arm.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 97ab1512c44f..b0d2225190d2 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1096,12 +1096,18 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
* ensuring that the data side is always coherent. We still
* need to invalidate the I-cache though, as FWB does *not*
* imply CTR_EL0.DIC.
+ *
+ * If the MMIO guard was enabled, we pay the price of a full
+ * unmap to get back to a sane state (and clear the flag).
*/
if (vcpu->arch.has_run_once) {
- if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
+ if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
+ test_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags))
stage2_unmap_vm(vcpu->kvm);
else
icache_inval_all_pou();
+
+ clear_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
}

vcpu_reset_hcr(vcpu);
--
2.30.2

2021-07-15 21:02:29

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 09/16] KVM: arm64: Advertise a capability for MMIO guard

In order for userspace to find out whether the MMIO guard is
exposed to a guest, expose a capability that says so.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/kvm/arm.c | 1 +
include/uapi/linux/kvm.h | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index b0d2225190d2..72ebad749b0c 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -214,6 +214,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SET_GUEST_DEBUG:
case KVM_CAP_VCPU_ATTRIBUTES:
case KVM_CAP_PTP_KVM:
+ case KVM_CAP_ARM_MMIO_GUARD:
r = 1;
break;
case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d9e4aabcb31a..d4a5715c5c8f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1112,6 +1112,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_BINARY_STATS_FD 203
#define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
#define KVM_CAP_ARM_MTE 205
+#define KVM_CAP_ARM_MMIO_GUARD 206

#ifdef KVM_CAP_IRQ_ROUTING

--
2.30.2

2021-07-15 21:02:30

by Marc Zyngier

[permalink] [raw]
Subject: [PATCH 08/16] KVM: arm64: Add tracepoint for failed MMIO guard check

In order to make debugging easier, expose a new trace point
that triggers when a MMIO check fails.

Signed-off-by: Marc Zyngier <[email protected]>
---
arch/arm64/kvm/mmu.c | 4 +++-
arch/arm64/kvm/trace_arm.h | 17 +++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 638827c8842b..c2a23457552b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1229,8 +1229,10 @@ bool kvm_check_ioguard_page(struct kvm_vcpu *vcpu, gpa_t ipa)
ret = __check_ioguard_page(vcpu, ipa & PAGE_MASK);
spin_unlock(&vcpu->kvm->mmu_lock);

- if (!ret)
+ if (!ret) {
+ trace_kvm_failed_mmio_check(*vcpu_pc(vcpu), ipa);
kvm_inject_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
+ }

return ret;
}
diff --git a/arch/arm64/kvm/trace_arm.h b/arch/arm64/kvm/trace_arm.h
index 33e4e7dd2719..e40cfeb251ad 100644
--- a/arch/arm64/kvm/trace_arm.h
+++ b/arch/arm64/kvm/trace_arm.h
@@ -89,6 +89,23 @@ TRACE_EVENT(kvm_access_fault,
TP_printk("IPA: %lx", __entry->ipa)
);

+TRACE_EVENT(kvm_failed_mmio_check,
+ TP_PROTO(unsigned long vcpu_pc, unsigned long ipa),
+ TP_ARGS(vcpu_pc, ipa),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, vcpu_pc )
+ __field( unsigned long, ipa )
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_pc = vcpu_pc;
+ __entry->ipa = ipa;
+ ),
+
+ TP_printk("PC: %lx IPA: %lx", __entry->vcpu_pc, __entry->ipa)
+);
+
TRACE_EVENT(kvm_irq_line,
TP_PROTO(unsigned int type, int vcpu_idx, int irq_num, int level),
TP_ARGS(type, vcpu_idx, irq_num, level),
--
2.30.2

2021-07-20 10:11:01

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> .arg = &map_data,
> };
>
> - if (owner_id > KVM_MAX_OWNER_ID)
> + if (!annotation || (annotation & PTE_VALID))
> return -EINVAL;

Why do you consider annotation==0 invalid? The assumption so far has
been that the owner_id for the host is 0, so annotating a range with 0s
should be a valid operation -- this will be required when e.g.
transferring ownership of a page back to the host.

>
> ret = kvm_pgtable_walk(pgt, addr, size, &walker);
> --
> 2.30.2
>

2021-07-20 10:24:30

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Tue, 20 Jul 2021 11:09:21 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> > @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > .arg = &map_data,
> > };
> >
> > - if (owner_id > KVM_MAX_OWNER_ID)
> > + if (!annotation || (annotation & PTE_VALID))
> > return -EINVAL;
>
> Why do you consider annotation==0 invalid? The assumption so far has
> been that the owner_id for the host is 0, so annotating a range with 0s
> should be a valid operation -- this will be required when e.g.
> transferring ownership of a page back to the host.

How do you then distinguish it from an empty entry that doesn't map to
anything at all?

M.

--
Without deviation from the norm, progress is not possible.

2021-07-20 10:41:23

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Tuesday 20 Jul 2021 at 11:21:17 (+0100), Marc Zyngier wrote:
> On Tue, 20 Jul 2021 11:09:21 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> > > @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > .arg = &map_data,
> > > };
> > >
> > > - if (owner_id > KVM_MAX_OWNER_ID)
> > > + if (!annotation || (annotation & PTE_VALID))
> > > return -EINVAL;
> >
> > Why do you consider annotation==0 invalid? The assumption so far has
> > been that the owner_id for the host is 0, so annotating a range with 0s
> > should be a valid operation -- this will be required when e.g.
> > transferring ownership of a page back to the host.
>
> How do you then distinguish it from an empty entry that doesn't map to
> anything at all?

You don't, but that's beauty of it :)

The host starts with a PGD full of zeroes, which in terms of ownership
means that it owns the entire (I)PA space. And it loses ownership of a
page only when we explicitly annotate it with an owner id != 0.

2021-07-20 11:18:59

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 15/16] arm64: Add a helper to retrieve the PTE of a fixmap

On Thursday 15 Jul 2021 at 17:31:58 (+0100), Marc Zyngier wrote:
> In order to transfer the early mapping state into KVM's MMIO
> guard infrastucture, provide a small helper that will retrieve
> the associated PTE.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> arch/arm64/include/asm/fixmap.h | 2 ++
> arch/arm64/mm/mmu.c | 15 +++++++++++++++
> 2 files changed, 17 insertions(+)
>
> diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
> index 4335800201c9..1aae625b944f 100644
> --- a/arch/arm64/include/asm/fixmap.h
> +++ b/arch/arm64/include/asm/fixmap.h
> @@ -105,6 +105,8 @@ void __init early_fixmap_init(void);
>
> extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot);
>
> +extern pte_t *__get_fixmap_pte(enum fixed_addresses idx);
> +
> #include <asm-generic/fixmap.h>
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index d74586508448..f1b7abd04025 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1286,6 +1286,21 @@ void __set_fixmap(enum fixed_addresses idx,
> }
> }
>
> +pte_t *__get_fixmap_pte(enum fixed_addresses idx)
> +{
> + unsigned long addr = __fix_to_virt(idx);

Nit: odd spacing here.

> + pte_t *ptep;
> +
> + BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
> +
> + ptep = fixmap_pte(addr);
> +
> + if (!pte_valid(*ptep))
> + return NULL;
> +
> + return ptep;
> +}
> +
> void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot)
> {
> const u64 dt_virt_base = __fix_to_virt(FIX_FDT);
> --
> 2.30.2
>

2021-07-20 11:23:18

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Tue, 20 Jul 2021 11:38:17 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Tuesday 20 Jul 2021 at 11:21:17 (+0100), Marc Zyngier wrote:
> > On Tue, 20 Jul 2021 11:09:21 +0100,
> > Quentin Perret <[email protected]> wrote:
> > >
> > > On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> > > > @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > > .arg = &map_data,
> > > > };
> > > >
> > > > - if (owner_id > KVM_MAX_OWNER_ID)
> > > > + if (!annotation || (annotation & PTE_VALID))
> > > > return -EINVAL;
> > >
> > > Why do you consider annotation==0 invalid? The assumption so far has
> > > been that the owner_id for the host is 0, so annotating a range with 0s
> > > should be a valid operation -- this will be required when e.g.
> > > transferring ownership of a page back to the host.
> >
> > How do you then distinguish it from an empty entry that doesn't map to
> > anything at all?
>
> You don't, but that's beauty of it :)
>
> The host starts with a PGD full of zeroes, which in terms of ownership
> means that it owns the entire (I)PA space. And it loses ownership of a
> page only when we explicitly annotate it with an owner id != 0.

Right. But this scheme doesn't apply to the guests, does it? Don't we
need something that is non-null to preserve the table refcounting?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-07-20 11:38:17

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Tuesday 20 Jul 2021 at 12:20:58 (+0100), Marc Zyngier wrote:
> On Tue, 20 Jul 2021 11:38:17 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > On Tuesday 20 Jul 2021 at 11:21:17 (+0100), Marc Zyngier wrote:
> > > On Tue, 20 Jul 2021 11:09:21 +0100,
> > > Quentin Perret <[email protected]> wrote:
> > > >
> > > > On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> > > > > @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > > > .arg = &map_data,
> > > > > };
> > > > >
> > > > > - if (owner_id > KVM_MAX_OWNER_ID)
> > > > > + if (!annotation || (annotation & PTE_VALID))
> > > > > return -EINVAL;
> > > >
> > > > Why do you consider annotation==0 invalid? The assumption so far has
> > > > been that the owner_id for the host is 0, so annotating a range with 0s
> > > > should be a valid operation -- this will be required when e.g.
> > > > transferring ownership of a page back to the host.
> > >
> > > How do you then distinguish it from an empty entry that doesn't map to
> > > anything at all?
> >
> > You don't, but that's beauty of it :)
> >
> > The host starts with a PGD full of zeroes, which in terms of ownership
> > means that it owns the entire (I)PA space. And it loses ownership of a
> > page only when we explicitly annotate it with an owner id != 0.
>
> Right. But this scheme doesn't apply to the guests, does it?

Right, the meaning of a NULL PTE in guests will clearly be something
different, but I guess the interpretation of what invalid mappings mean
is up to the caller.

> Don't we
> need something that is non-null to preserve the table refcounting?

Sure, but do we care? If the table entry gets zeroed we're then
basically using an 'invalid block' mapping to annotate the entire block
range with '0', whatever that means. For guests it won't mean much, but
for the host that would mean sole ownership of the entire range.

2021-07-20 13:18:46

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 03/16] KVM: arm64: Turn kvm_pgtable_stage2_set_owner into kvm_pgtable_stage2_annotate

On Tue, 20 Jul 2021 12:36:21 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Tuesday 20 Jul 2021 at 12:20:58 (+0100), Marc Zyngier wrote:
> > On Tue, 20 Jul 2021 11:38:17 +0100,
> > Quentin Perret <[email protected]> wrote:
> > >
> > > On Tuesday 20 Jul 2021 at 11:21:17 (+0100), Marc Zyngier wrote:
> > > > On Tue, 20 Jul 2021 11:09:21 +0100,
> > > > Quentin Perret <[email protected]> wrote:
> > > > >
> > > > > On Thursday 15 Jul 2021 at 17:31:46 (+0100), Marc Zyngier wrote:
> > > > > > @@ -815,7 +807,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > > > > .arg = &map_data,
> > > > > > };
> > > > > >
> > > > > > - if (owner_id > KVM_MAX_OWNER_ID)
> > > > > > + if (!annotation || (annotation & PTE_VALID))
> > > > > > return -EINVAL;
> > > > >
> > > > > Why do you consider annotation==0 invalid? The assumption so far has
> > > > > been that the owner_id for the host is 0, so annotating a range with 0s
> > > > > should be a valid operation -- this will be required when e.g.
> > > > > transferring ownership of a page back to the host.
> > > >
> > > > How do you then distinguish it from an empty entry that doesn't map to
> > > > anything at all?
> > >
> > > You don't, but that's beauty of it :)
> > >
> > > The host starts with a PGD full of zeroes, which in terms of ownership
> > > means that it owns the entire (I)PA space. And it loses ownership of a
> > > page only when we explicitly annotate it with an owner id != 0.
> >
> > Right. But this scheme doesn't apply to the guests, does it?
>
> Right, the meaning of a NULL PTE in guests will clearly be something
> different, but I guess the interpretation of what invalid mappings mean
> is up to the caller.
>
> > Don't we
> > need something that is non-null to preserve the table refcounting?
>
> Sure, but do we care? If the table entry gets zeroed we're then
> basically using an 'invalid block' mapping to annotate the entire block
> range with '0', whatever that means. For guests it won't mean much, but
> for the host that would mean sole ownership of the entire range.

I see. You let the refcount drop to 0, unmap the table and let
transfer the 0 annotation one level up, covering the whole block.

I guess I'll revert back to allowing 0, but I'd like to make sure we
don't do that for guests unless we actually tear down the address
space (checking for KVM_PGTABLE_S2_IDMAP should work).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-07-21 21:19:07

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

On Thu, Jul 15, 2021 at 05:31:53PM +0100, Marc Zyngier wrote:
> Document the hypercalls user for the MMIO guard infrastructure.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> Documentation/virt/kvm/arm/index.rst | 1 +
> Documentation/virt/kvm/arm/mmio-guard.rst | 73 +++++++++++++++++++++++
> 2 files changed, 74 insertions(+)
> create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst
>
> diff --git a/Documentation/virt/kvm/arm/index.rst b/Documentation/virt/kvm/arm/index.rst
> index 78a9b670aafe..e77a0ee2e2d4 100644
> --- a/Documentation/virt/kvm/arm/index.rst
> +++ b/Documentation/virt/kvm/arm/index.rst
> @@ -11,3 +11,4 @@ ARM
> psci
> pvtime
> ptp_kvm
> + mmio-guard
> diff --git a/Documentation/virt/kvm/arm/mmio-guard.rst b/Documentation/virt/kvm/arm/mmio-guard.rst
> new file mode 100644
> index 000000000000..a5563a3e12cc
> --- /dev/null
> +++ b/Documentation/virt/kvm/arm/mmio-guard.rst
> @@ -0,0 +1,73 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============
> +KVM MMIO guard
> +==============
> +
> +KVM implements device emulation by handling translation faults to any
> +IPA range that is not contained a memory slot. Such translation fault
^ in ^ a

> +is in most cases passed on to userspace (or in rare cases to the host
> +kernel) with the address, size and possibly data of the access for
> +emulation.
> +
> +Should the guest exit with an address that is not one that corresponds
> +to an emulatable device, userspace may take measures that are not the
> +most graceful as far as the guest is concerned (such as terminating it
> +or delivering a fatal exception).
> +
> +There is also an element of trust: by forwarding the request to
> +userspace, the kernel asumes that the guest trusts userspace to do the

assumes

> +right thing.
> +
> +The KVM MMIO guard offers a way to mitigate this last point: a guest
> +can request that only certainly regions of the IPA space are valid as

certain

> +MMIO. Only these regions will be handled as an MMIO, and any other
> +will result in an exception being delivered to the guest.
> +
> +This relies on a set of hypercalls defined in the KVM-specific range,
> +using the HVC64 calling convention.
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO
> +
> + ============== ======== ================================
> + Function ID: (uint32) 0xC6000002
> + Arguments: none
> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
> + (uint64) Protection Granule (PG) size in
> + bytes (r0)
> + ============== ======== ================================
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL
> +
> + ============== ======== ==============================
> + Function ID: (uint32) 0xC6000003
> + Arguments: none
> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
> + RET_SUCCESS(0) (r0)
> + ============== ======== ==============================
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP
> +
> + ============== ======== ======================================
> + Function ID: (uint32) 0xC6000004
> + Arguments: (uint64) The base of the PG-sized IPA range
> + that is allowed to be accessed as
> + MMIO. Must aligned to the PG size (r1)

align

> + (uint64) Index in the MAIR_EL1 register
> + providing the memory attribute that
> + is used by the guest (r2)
> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
> + RET_SUCCESS(0) (r0)
> + ============== ======== ======================================
> +
> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP
> +
> + ============== ======== ======================================
> + Function ID: (uint32) 0xC6000004

copy+paste error, should be 0xC6000005

> + Arguments: (uint64) The base of the PG-sized IPA range
> + that is forbidden to be accessed as

is now forbidden

or

was allowed

or just drop that part of the sentence because its covered by the "and
have been previously mapped" part. Something like

PG-sized IPA range aligned to the PG size which has been previously mapped
(r1)

> + MMIO. Must aligned to the PG size

align

> + and have been previously mapped (r1)
> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
> + RET_SUCCESS(0) (r0)
> + ============== ======== ======================================
> --
> 2.30.2
>
> _______________________________________________
> kvmarm mailing list
> [email protected]
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

Thanks,
drew

2021-07-21 21:46:16

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH 00/16] KVM: arm64: MMIO guard PV services

On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> KVM/arm64 currently considers that any memory access outside of a
> memslot is a MMIO access. This so far has served us very well, but
> obviously relies on the guest trusting the host, and especially
> userspace to do the right thing.
>
> As we keep on hacking away at pKVM, it becomes obvious that this trust
> model is not really fit for a confidential computing environment, and
> that the guest would require some guarantees that emulation only
> occurs on portions of the address space that have clearly been
> identified for this purpose.

This trust model is hard for me to reason about. userspace is trusted to
control the life cycle of the VM, to prepare the memslots for the VM,
and [presumably] identify what MMIO ranges are valid, yet it's not
trusted to handle invalid MMIO accesses. I'd like to learn more about
this model and the userspace involved.

>
> This series aims at providing the two sides of the above coin:
>
> - a set of PV services (collectively called 'MMIO guard' -- better
> name required!) where the guest can flag portion of its address
> space that it considers as MMIO, with map/unmap semantics. Any
> attempt to access a MMIO range outside of these regions will result
> in an external abort being injected.
>
> - a set of hooks into the ioremap code allowing a Linux guest to tell
> KVM about things it want to consider as MMIO. I definitely hate this
> part of the series, as it feels clumsy and brittle.
>
> For now, the enrolment in this scheme is controlled by a guest kernel
> command-line parameters, but it is expected that KVM will enforce this
> for protected VMs.
>
> Note that this crucially misses a save/restore interface for non
> protected VMs, and I currently don't have a good solution for
> that. Ideas welcome.
>
> I also plan to use this series as a base for some other purposes,
> namely to trick the guest in telling us how it maps things like
> prefetchable BARs (see the discussion at [1]). That part is not
> implemented yet, but there is already some provision to pass the MAIR
> index across.
>
> Patches on top of 5.14-rc1, branch pushed at the usual location.
>
> [1] [email protected]

The fun never stops.

Thanks,
drew

2021-07-22 10:06:12

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 00/16] KVM: arm64: MMIO guard PV services

On Wed, 21 Jul 2021 22:42:43 +0100,
Andrew Jones <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > KVM/arm64 currently considers that any memory access outside of a
> > memslot is a MMIO access. This so far has served us very well, but
> > obviously relies on the guest trusting the host, and especially
> > userspace to do the right thing.
> >
> > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > model is not really fit for a confidential computing environment, and
> > that the guest would require some guarantees that emulation only
> > occurs on portions of the address space that have clearly been
> > identified for this purpose.
>
> This trust model is hard for me to reason about. userspace is trusted to
> control the life cycle of the VM, to prepare the memslots for the VM,
> and [presumably] identify what MMIO ranges are valid, yet it's not
> trusted to handle invalid MMIO accesses. I'd like to learn more about
> this model and the userspace involved.

Imagine the following scenario:

On top of the normal memory described as memslots (which pKVM will
ensure that userspace cannot access), a malicious userspace describes
to the guest another memory region in a firmware table and does not
back it with a memslot.

The hypervisor cannot validate this firmware description (imagine
doing ACPI and DT parsing at EL2...), so the guest starts using this
"memory" for something, and data slowly trickles all the way to EL0.
Not what you wanted.

To ensure that this doesn't happen, we reverse the problem: userspace
(and ultimately the EL1 kernel) doesn't get involved on a translation
fault outside of a memslot *unless* the guest has explicitly asked for
that page to be handled as a MMIO. With that, we have a full
description of the IPA space contained in the S2 page tables:

- memory described via a memslot,
- directly mapped device (GICv2, for exmaple),
- MMIO exposed for emulation

and anything else is an invalid access that results in an abort.

Does this make sense to you?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-07-22 13:26:42

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH 00/16] KVM: arm64: MMIO guard PV services

On Thu, Jul 22, 2021 at 11:00:26AM +0100, Marc Zyngier wrote:
> On Wed, 21 Jul 2021 22:42:43 +0100,
> Andrew Jones <[email protected]> wrote:
> >
> > On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > > KVM/arm64 currently considers that any memory access outside of a
> > > memslot is a MMIO access. This so far has served us very well, but
> > > obviously relies on the guest trusting the host, and especially
> > > userspace to do the right thing.
> > >
> > > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > > model is not really fit for a confidential computing environment, and
> > > that the guest would require some guarantees that emulation only
> > > occurs on portions of the address space that have clearly been
> > > identified for this purpose.
> >
> > This trust model is hard for me to reason about. userspace is trusted to
> > control the life cycle of the VM, to prepare the memslots for the VM,
> > and [presumably] identify what MMIO ranges are valid, yet it's not
> > trusted to handle invalid MMIO accesses. I'd like to learn more about
> > this model and the userspace involved.
>
> Imagine the following scenario:
>
> On top of the normal memory described as memslots (which pKVM will
> ensure that userspace cannot access),

Ah, I didn't know that part.

> a malicious userspace describes
> to the guest another memory region in a firmware table and does not
> back it with a memslot.
>
> The hypervisor cannot validate this firmware description (imagine
> doing ACPI and DT parsing at EL2...), so the guest starts using this
> "memory" for something, and data slowly trickles all the way to EL0.
> Not what you wanted.

Yes, I see that now, in light of the above.

>
> To ensure that this doesn't happen, we reverse the problem: userspace
> (and ultimately the EL1 kernel) doesn't get involved on a translation
> fault outside of a memslot *unless* the guest has explicitly asked for
> that page to be handled as a MMIO. With that, we have a full
> description of the IPA space contained in the S2 page tables:
>
> - memory described via a memslot,
> - directly mapped device (GICv2, for exmaple),
> - MMIO exposed for emulation
>
> and anything else is an invalid access that results in an abort.
>
> Does this make sense to you?

Now I understand better, but if we're worried about malicious userspaces,
then how do we protect the guest from "bad" MMIO devices that have been
described to it? The guest can request access to those using this new
mechanism.

Thanks,
drew

2021-07-22 15:33:23

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 00/16] KVM: arm64: MMIO guard PV services

On Thu, 22 Jul 2021 14:25:15 +0100,
Andrew Jones <[email protected]> wrote:
>
> On Thu, Jul 22, 2021 at 11:00:26AM +0100, Marc Zyngier wrote:
> > On Wed, 21 Jul 2021 22:42:43 +0100,
> > Andrew Jones <[email protected]> wrote:
> > >
> > > On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > > > KVM/arm64 currently considers that any memory access outside of a
> > > > memslot is a MMIO access. This so far has served us very well, but
> > > > obviously relies on the guest trusting the host, and especially
> > > > userspace to do the right thing.
> > > >
> > > > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > > > model is not really fit for a confidential computing environment, and
> > > > that the guest would require some guarantees that emulation only
> > > > occurs on portions of the address space that have clearly been
> > > > identified for this purpose.
> > >
> > > This trust model is hard for me to reason about. userspace is trusted to
> > > control the life cycle of the VM, to prepare the memslots for the VM,
> > > and [presumably] identify what MMIO ranges are valid, yet it's not
> > > trusted to handle invalid MMIO accesses. I'd like to learn more about
> > > this model and the userspace involved.
> >
> > Imagine the following scenario:
> >
> > On top of the normal memory described as memslots (which pKVM will
> > ensure that userspace cannot access),
>
> Ah, I didn't know that part.

Yeah, that's the crucial bit. By default, pKVM guests do not share any
memory with anyone, so the memslots are made inaccessible from both
the VMM and the host kernel. The guest has to explicitly change the
state of the memory it wants to share back with the host for things
like IO.

>
> > a malicious userspace describes
> > to the guest another memory region in a firmware table and does not
> > back it with a memslot.
> >
> > The hypervisor cannot validate this firmware description (imagine
> > doing ACPI and DT parsing at EL2...), so the guest starts using this
> > "memory" for something, and data slowly trickles all the way to EL0.
> > Not what you wanted.
>
> Yes, I see that now, in light of the above.
>
> >
> > To ensure that this doesn't happen, we reverse the problem: userspace
> > (and ultimately the EL1 kernel) doesn't get involved on a translation
> > fault outside of a memslot *unless* the guest has explicitly asked for
> > that page to be handled as a MMIO. With that, we have a full
> > description of the IPA space contained in the S2 page tables:
> >
> > - memory described via a memslot,
> > - directly mapped device (GICv2, for exmaple),
> > - MMIO exposed for emulation
> >
> > and anything else is an invalid access that results in an abort.
> >
> > Does this make sense to you?
>
> Now I understand better, but if we're worried about malicious userspaces,
> then how do we protect the guest from "bad" MMIO devices that have been
> described to it? The guest can request access to those using this new
> mechanism.

We don't try to do anything about a malicious IO device. Any IO should
be considered as malicious, and you don't want to give it anything in
clear-text if it is supposed to be secret.

Eventually, you'd probably want directly assigned devices that can
attest to the guest that they are what they pretend to be, but that's
a long way away. For now, we only want to enable virtio with a reduced
level of trust (bounce buffering via shared pages for DMA, and reduced
MMIO exposure).

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2021-07-23 13:31:34

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

On 2021-07-21 22:17, Andrew Jones wrote:
> On Thu, Jul 15, 2021 at 05:31:53PM +0100, Marc Zyngier wrote:
>> Document the hypercalls user for the MMIO guard infrastructure.
>>
>> Signed-off-by: Marc Zyngier <[email protected]>
>> ---
>> Documentation/virt/kvm/arm/index.rst | 1 +
>> Documentation/virt/kvm/arm/mmio-guard.rst | 73
>> +++++++++++++++++++++++
>> 2 files changed, 74 insertions(+)
>> create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst
>>
>> diff --git a/Documentation/virt/kvm/arm/index.rst
>> b/Documentation/virt/kvm/arm/index.rst
>> index 78a9b670aafe..e77a0ee2e2d4 100644
>> --- a/Documentation/virt/kvm/arm/index.rst
>> +++ b/Documentation/virt/kvm/arm/index.rst
>> @@ -11,3 +11,4 @@ ARM
>> psci
>> pvtime
>> ptp_kvm
>> + mmio-guard
>> diff --git a/Documentation/virt/kvm/arm/mmio-guard.rst
>> b/Documentation/virt/kvm/arm/mmio-guard.rst
>> new file mode 100644
>> index 000000000000..a5563a3e12cc
>> --- /dev/null
>> +++ b/Documentation/virt/kvm/arm/mmio-guard.rst
>> @@ -0,0 +1,73 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +==============
>> +KVM MMIO guard
>> +==============
>> +
>> +KVM implements device emulation by handling translation faults to any
>> +IPA range that is not contained a memory slot. Such translation fault
> ^ in ^ a
>
>> +is in most cases passed on to userspace (or in rare cases to the host
>> +kernel) with the address, size and possibly data of the access for
>> +emulation.
>> +
>> +Should the guest exit with an address that is not one that
>> corresponds
>> +to an emulatable device, userspace may take measures that are not the
>> +most graceful as far as the guest is concerned (such as terminating
>> it
>> +or delivering a fatal exception).
>> +
>> +There is also an element of trust: by forwarding the request to
>> +userspace, the kernel asumes that the guest trusts userspace to do
>> the
>
> assumes
>
>> +right thing.
>> +
>> +The KVM MMIO guard offers a way to mitigate this last point: a guest
>> +can request that only certainly regions of the IPA space are valid as
>
> certain

Thanks, all corrections applied.

>
>> +MMIO. Only these regions will be handled as an MMIO, and any other
>> +will result in an exception being delivered to the guest.
>> +
>> +This relies on a set of hypercalls defined in the KVM-specific range,
>> +using the HVC64 calling convention.
>> +
>> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO
>> +
>> + ============== ======== ================================
>> + Function ID: (uint32) 0xC6000002
>> + Arguments: none
>> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
>> + (uint64) Protection Granule (PG) size in
>> + bytes (r0)
>> + ============== ======== ================================
>> +
>> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL
>> +
>> + ============== ======== ==============================
>> + Function ID: (uint32) 0xC6000003
>> + Arguments: none
>> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
>> + RET_SUCCESS(0) (r0)
>> + ============== ======== ==============================
>> +
>> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP
>> +
>> + ============== ========
>> ======================================
>> + Function ID: (uint32) 0xC6000004
>> + Arguments: (uint64) The base of the PG-sized IPA range
>> + that is allowed to be accessed as
>> + MMIO. Must aligned to the PG size (r1)
>
> align

Hmmm. Ugly mix of tab and spaces. I have no idea what the norm
is here, so I'll just put spaces. I'm sure someone will let me
know if I'm wrong! ;-)

>
>> + (uint64) Index in the MAIR_EL1 register
>> + providing the memory attribute that
>> + is used by the guest (r2)
>> + Return Values: (int64) NOT_SUPPORTED(-1) on error, or
>> + RET_SUCCESS(0) (r0)
>> + ============== ========
>> ======================================
>> +
>> +* ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP
>> +
>> + ============== ========
>> ======================================
>> + Function ID: (uint32) 0xC6000004
>
> copy+paste error, should be 0xC6000005

Gah, well cpotted.

>
>> + Arguments: (uint64) The base of the PG-sized IPA range
>> + that is forbidden to be accessed as
>
> is now forbidden
>
> or
>
> was allowed
>
> or just drop that part of the sentence because its covered by the "and
> have been previously mapped" part. Something like
>
> PG-sized IPA range aligned to the PG size which has been previously
> mapped
> (r1)

Picked the latter.

Thanks again,

M.
--
Jazz is not dead. It just smells funny...

2021-07-23 13:39:55

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

On Fri, Jul 23, 2021 at 02:30:13PM +0100, Marc Zyngier wrote:
...
> > > +
> > > + ============== ========
> > > ======================================
> > > + Function ID: (uint32) 0xC6000004
> > > + Arguments: (uint64) The base of the PG-sized IPA range
> > > + that is allowed to be accessed as
> > > + MMIO. Must aligned to the PG size (r1)
> >
> > align
>
> Hmmm. Ugly mix of tab and spaces. I have no idea what the norm
> is here, so I'll just put spaces. I'm sure someone will let me
> know if I'm wrong! ;-)

Actually, my comment wasn't regarding the alignment of the text. I was
commenting that we should change 'aligned' to 'align' in the text. (Sorry,
that was indeed ambiguous.) Hmm, it might be better to just add 'be', i.e.
'be aligned'.

I'm not sure what to do about the tab/space mixing, but keeping it
consistent is good enough for me.

Thanks,
drew

2021-07-23 13:54:39

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 10/16] KVM: arm64: Add some documentation for the MMIO guard feature

On 2021-07-23 14:38, Andrew Jones wrote:
> On Fri, Jul 23, 2021 at 02:30:13PM +0100, Marc Zyngier wrote:
> ...
>> > > +
>> > > + ============== ========
>> > > ======================================
>> > > + Function ID: (uint32) 0xC6000004
>> > > + Arguments: (uint64) The base of the PG-sized IPA range
>> > > + that is allowed to be accessed as
>> > > + MMIO. Must aligned to the PG size (r1)
>> >
>> > align
>>
>> Hmmm. Ugly mix of tab and spaces. I have no idea what the norm
>> is here, so I'll just put spaces. I'm sure someone will let me
>> know if I'm wrong! ;-)
>
> Actually, my comment wasn't regarding the alignment of the text. I was
> commenting that we should change 'aligned' to 'align' in the text.
> (Sorry,
> that was indeed ambiguous.) Hmm, it might be better to just add 'be',
> i.e.
> 'be aligned'.

*blink*. duh, of course.

> I'm not sure what to do about the tab/space mixing, but keeping it
> consistent is good enough for me.

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2021-07-27 18:11:38

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 01/16] KVM: arm64: Generalise VM features into a set of flags

On Thu, Jul 15, 2021 at 05:31:44PM +0100, Marc Zyngier wrote:
> We currently deal with a set of booleans for VM features,
> while they could be better represented as set of flags
> contained in an unsigned long, similarily to what we are
> doing on the CPU side.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> arch/arm64/include/asm/kvm_host.h | 12 +++++++-----
> arch/arm64/kvm/arm.c | 5 +++--
> arch/arm64/kvm/mmio.c | 3 ++-
> 3 files changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 41911585ae0c..4add6c27251f 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -122,7 +122,10 @@ struct kvm_arch {
> * should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is
> * supported.
> */
> - bool return_nisv_io_abort_to_user;
> +#define KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER 0
> + /* Memory Tagging Extension enabled for the guest */
> +#define KVM_ARCH_FLAG_MTE_ENABLED 1
> + unsigned long flags;

One downside of packing all these together is that updating 'flags' now
requires an atomic rmw sequence (i.e. set_bit()). Then again, that's
probably for the best anyway given that kvm_vm_ioctl_enable_cap() looks
like it doesn't hold any locks.

> /*
> * VM-wide PMU filter, implemented as a bitmap and big enough for
> @@ -133,9 +136,6 @@ struct kvm_arch {
>
> u8 pfr0_csv2;
> u8 pfr0_csv3;
> -
> - /* Memory Tagging Extension enabled for the guest */
> - bool mte_enabled;
> };
>
> struct kvm_vcpu_fault_info {
> @@ -777,7 +777,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
> #define kvm_arm_vcpu_sve_finalized(vcpu) \
> ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>
> -#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
> +#define kvm_has_mte(kvm) \
> + (system_supports_mte() && \
> + test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &(kvm)->arch.flags))

Not an issue with this patch, but I just noticed that the
system_supports_mte() check is redundant here as we only allow the flag to
be set if that's already the case.

Will

2021-07-27 18:12:46

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 05/16] KVM: arm64: Plumb MMIO checking into the fault handling

On Thu, Jul 15, 2021 at 05:31:48PM +0100, Marc Zyngier wrote:
> Plumb the MMIO checking code into the MMIO fault handling code.
> Nothing allows a region to be registered yet, so there should be
> no funtional change either.

Typo: functional

> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> arch/arm64/kvm/mmio.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> index 3dd38a151d2a..fd5747279d27 100644
> --- a/arch/arm64/kvm/mmio.c
> +++ b/arch/arm64/kvm/mmio.c
> @@ -6,6 +6,7 @@
>
> #include <linux/kvm_host.h>
> #include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
> #include <trace/events/kvm.h>
>
> #include "trace.h"
> @@ -130,6 +131,10 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> int len;
> u8 data_buf[8];
>
> + /* Check failed? Return to the guest for debriefing... */
> + if (!kvm_check_ioguard_page(vcpu, fault_ipa))
> + return 1;
> +
> /*
> * No valid syndrome? Ask userspace for help if it has
> * volunteered to do so, and bail out otherwise.
> @@ -156,6 +161,11 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> len = kvm_vcpu_dabt_get_as(vcpu);
> rt = kvm_vcpu_dabt_get_rd(vcpu);
>
> + /* If we cross a page boundary, check that too... */
> + if (((fault_ipa + len - 1) & PAGE_MASK) != (fault_ipa & PAGE_MASK) &&
> + !kvm_check_ioguard_page(vcpu, fault_ipa + len - 1))
> + return 1;
> +

I find this a little odd as the checks straddle the invalid syndrome check,
meaning that the relative priorities of KVM_ARCH_FLAG_MMIO_GUARD and
KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER are unclear.

Will

2021-07-27 18:14:06

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 06/16] KVM: arm64: Force a full unmap on vpcu reinit

On Thu, Jul 15, 2021 at 05:31:49PM +0100, Marc Zyngier wrote:
> As we now keep information in the S2PT, we must be careful not
> to keep it across a VM reboot, which could otherwise lead to
> interesting problems.
>
> Make sure that the S2 is completely discarded on reset of
> a vcpu, and remove the flag that enforces the MMIO check.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> arch/arm64/kvm/arm.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 97ab1512c44f..b0d2225190d2 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1096,12 +1096,18 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
> * ensuring that the data side is always coherent. We still
> * need to invalidate the I-cache though, as FWB does *not*
> * imply CTR_EL0.DIC.
> + *
> + * If the MMIO guard was enabled, we pay the price of a full
> + * unmap to get back to a sane state (and clear the flag).
> */
> if (vcpu->arch.has_run_once) {
> - if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> + if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
> + test_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags))
> stage2_unmap_vm(vcpu->kvm);
> else
> icache_inval_all_pou();
> +
> + clear_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);

What prevents this racing with another vCPU trying to set the bit?

Will

2021-07-27 18:14:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 07/16] KVM: arm64: Wire MMIO guard hypercalls

On Thu, Jul 15, 2021 at 05:31:50PM +0100, Marc Zyngier wrote:
> Plumb in the hypercall interface to allow a guest to discover,
> enroll, map and unmap MMIO regions.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
> include/linux/arm-smccc.h | 28 ++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> index 30da78f72b3b..a3deeb907fdd 100644
> --- a/arch/arm64/kvm/hypercalls.c
> +++ b/arch/arm64/kvm/hypercalls.c
> @@ -5,6 +5,7 @@
> #include <linux/kvm_host.h>
>
> #include <asm/kvm_emulate.h>
> +#include <asm/kvm_mmu.h>
>
> #include <kvm/arm_hypercalls.h>
> #include <kvm/arm_psci.h>
> @@ -129,10 +130,29 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
> val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP);
> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO);
> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL);
> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP);
> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP);
> break;
> case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
> kvm_ptp_get_time(vcpu, val);
> break;
> + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID:
> + val[0] = PAGE_SIZE;
> + break;

I get the nagging feeling that querying the stage-2 page-size outside of
MMIO guard is going to be useful once we start looking at memory sharing,
so perhaps rename this to something more generic?

> + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID:
> + set_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
> + val[0] = SMCCC_RET_SUCCESS;
> + break;
> + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID:
> + if (kvm_install_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> + val[0] = SMCCC_RET_SUCCESS;
> + break;
> + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID:
> + if (kvm_remove_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> + val[0] = SMCCC_RET_SUCCESS;
> + break;

I think there's a slight discrepancy between MAP and UNMAP here in that
calling UNMAP on something that hasn't been mapped will fail, whereas
calling MAP on something that's already been mapped will succeed. I think
that might mean you can't reason about the final state of the page if two
vCPUs race to call these functions in some cases (and both succeed).

Will

2021-07-27 18:14:41

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 12/16] mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls

On Thu, Jul 15, 2021 at 05:31:55PM +0100, Marc Zyngier wrote:
> Add a pair of hooks (ioremap_page_range_hook/iounmap_page_range_hook)
> that can be implemented by an architecture.
>
> Signed-off-by: Marc Zyngier <[email protected]>
> ---
> include/linux/io.h | 3 +++
> mm/ioremap.c | 13 ++++++++++++-
> mm/vmalloc.c | 8 ++++++++
> 3 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/io.h b/include/linux/io.h
> index 9595151d800d..0ffc265f114c 100644
> --- a/include/linux/io.h
> +++ b/include/linux/io.h
> @@ -21,6 +21,9 @@ void __ioread32_copy(void *to, const void __iomem *from, size_t count);
> void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
>
> #ifdef CONFIG_MMU
> +void ioremap_page_range_hook(unsigned long addr, unsigned long end,
> + phys_addr_t phys_addr, pgprot_t prot);
> +void iounmap_page_range_hook(phys_addr_t phys_addr, size_t size);
> int ioremap_page_range(unsigned long addr, unsigned long end,
> phys_addr_t phys_addr, pgprot_t prot);
> #else

Can we avoid these hooks by instead not registering the regions proactively
in the guest and moving that logic to a fault handler which runs off the
back of the injected data abort? From there, we could check if the faulting
IPA is a memory address and register it as MMIO if not.

Dunno, you've spent more time than me thinking about this, but just
wondering if you'd had a crack at doing it that way, as it _seems_ simpler
to my naive brain.

Will

2021-07-28 09:41:56

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 01/16] KVM: arm64: Generalise VM features into a set of flags

On Tue, 27 Jul 2021 19:10:27 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:44PM +0100, Marc Zyngier wrote:
> > We currently deal with a set of booleans for VM features,
> > while they could be better represented as set of flags
> > contained in an unsigned long, similarily to what we are
> > doing on the CPU side.
> >
> > Signed-off-by: Marc Zyngier <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 12 +++++++-----
> > arch/arm64/kvm/arm.c | 5 +++--
> > arch/arm64/kvm/mmio.c | 3 ++-
> > 3 files changed, 12 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 41911585ae0c..4add6c27251f 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -122,7 +122,10 @@ struct kvm_arch {
> > * should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is
> > * supported.
> > */
> > - bool return_nisv_io_abort_to_user;
> > +#define KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER 0
> > + /* Memory Tagging Extension enabled for the guest */
> > +#define KVM_ARCH_FLAG_MTE_ENABLED 1
> > + unsigned long flags;
>
> One downside of packing all these together is that updating 'flags' now
> requires an atomic rmw sequence (i.e. set_bit()). Then again, that's
> probably for the best anyway given that kvm_vm_ioctl_enable_cap() looks
> like it doesn't hold any locks.

That, and these operations are supposed to be extremely rare anyway.

>
> > /*
> > * VM-wide PMU filter, implemented as a bitmap and big enough for
> > @@ -133,9 +136,6 @@ struct kvm_arch {
> >
> > u8 pfr0_csv2;
> > u8 pfr0_csv3;
> > -
> > - /* Memory Tagging Extension enabled for the guest */
> > - bool mte_enabled;
> > };
> >
> > struct kvm_vcpu_fault_info {
> > @@ -777,7 +777,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
> > #define kvm_arm_vcpu_sve_finalized(vcpu) \
> > ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
> >
> > -#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
> > +#define kvm_has_mte(kvm) \
> > + (system_supports_mte() && \
> > + test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &(kvm)->arch.flags))
>
> Not an issue with this patch, but I just noticed that the
> system_supports_mte() check is redundant here as we only allow the flag to
> be set if that's already the case.

It allows us to save a memory access if system_supports_mte() is false
(it is eventually implemented as a static key). On the other hand,
there is so much inlining due to it being a non-final cap that we
probably lose on that too...

M.

--
Without deviation from the norm, progress is not possible.

2021-07-28 10:25:07

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 05/16] KVM: arm64: Plumb MMIO checking into the fault handling

On Tue, 27 Jul 2021 19:11:21 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:48PM +0100, Marc Zyngier wrote:
> > Plumb the MMIO checking code into the MMIO fault handling code.
> > Nothing allows a region to be registered yet, so there should be
> > no funtional change either.
>
> Typo: functional
>
> > Signed-off-by: Marc Zyngier <[email protected]>
> > ---
> > arch/arm64/kvm/mmio.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> > index 3dd38a151d2a..fd5747279d27 100644
> > --- a/arch/arm64/kvm/mmio.c
> > +++ b/arch/arm64/kvm/mmio.c
> > @@ -6,6 +6,7 @@
> >
> > #include <linux/kvm_host.h>
> > #include <asm/kvm_emulate.h>
> > +#include <asm/kvm_mmu.h>
> > #include <trace/events/kvm.h>
> >
> > #include "trace.h"
> > @@ -130,6 +131,10 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > int len;
> > u8 data_buf[8];
> >
> > + /* Check failed? Return to the guest for debriefing... */
> > + if (!kvm_check_ioguard_page(vcpu, fault_ipa))
> > + return 1;
> > +
> > /*
> > * No valid syndrome? Ask userspace for help if it has
> > * volunteered to do so, and bail out otherwise.
> > @@ -156,6 +161,11 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > len = kvm_vcpu_dabt_get_as(vcpu);
> > rt = kvm_vcpu_dabt_get_rd(vcpu);
> >
> > + /* If we cross a page boundary, check that too... */
> > + if (((fault_ipa + len - 1) & PAGE_MASK) != (fault_ipa & PAGE_MASK) &&
> > + !kvm_check_ioguard_page(vcpu, fault_ipa + len - 1))
> > + return 1;
> > +
>
> I find this a little odd as the checks straddle the invalid syndrome check,
> meaning that the relative priorities of KVM_ARCH_FLAG_MMIO_GUARD and
> KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER are unclear.

Good point. And the combination of both flags on its own is odd. Maybe
KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER should be ignored or deemed
incompatible with the MMIO guard feature.

The lack of syndrome information means that we cannot really test for
the boundaries of the access (len is invalid), so I'd be tempted to
inject an abort in this case.

Thoughts?

M.

--
Without deviation from the norm, progress is not possible.

2021-07-28 10:39:54

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 06/16] KVM: arm64: Force a full unmap on vpcu reinit

On Tue, 27 Jul 2021 19:11:33 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:49PM +0100, Marc Zyngier wrote:
> > As we now keep information in the S2PT, we must be careful not
> > to keep it across a VM reboot, which could otherwise lead to
> > interesting problems.
> >
> > Make sure that the S2 is completely discarded on reset of
> > a vcpu, and remove the flag that enforces the MMIO check.
> >
> > Signed-off-by: Marc Zyngier <[email protected]>
> > ---
> > arch/arm64/kvm/arm.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 97ab1512c44f..b0d2225190d2 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1096,12 +1096,18 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
> > * ensuring that the data side is always coherent. We still
> > * need to invalidate the I-cache though, as FWB does *not*
> > * imply CTR_EL0.DIC.
> > + *
> > + * If the MMIO guard was enabled, we pay the price of a full
> > + * unmap to get back to a sane state (and clear the flag).
> > */
> > if (vcpu->arch.has_run_once) {
> > - if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> > + if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
> > + test_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags))
> > stage2_unmap_vm(vcpu->kvm);
> > else
> > icache_inval_all_pou();
> > +
> > + clear_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
>
> What prevents this racing with another vCPU trying to set the bit?

Not much. We could take the kvm lock on both ends to serialize it, but
that's pretty ugly. And should we care? What is the semantic of
resetting a vcpu while another is still running?

If we want to support this sort of behaviour, then our tracking is
totally bogus, because it is VM-wide. And you don't even have to play
with that bit from another vcpu: all the information is lost at the
point where we unmap the S2 PTs.

Maybe an alternative is to move this to the halt/reboot PSCI handlers,
making it clearer what we expect?

M.

--
Without deviation from the norm, progress is not possible.

2021-07-28 10:49:36

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 07/16] KVM: arm64: Wire MMIO guard hypercalls

On Tue, 27 Jul 2021 19:11:46 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:50PM +0100, Marc Zyngier wrote:
> > Plumb in the hypercall interface to allow a guest to discover,
> > enroll, map and unmap MMIO regions.
> >
> > Signed-off-by: Marc Zyngier <[email protected]>
> > ---
> > arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
> > include/linux/arm-smccc.h | 28 ++++++++++++++++++++++++++++
> > 2 files changed, 48 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > index 30da78f72b3b..a3deeb907fdd 100644
> > --- a/arch/arm64/kvm/hypercalls.c
> > +++ b/arch/arm64/kvm/hypercalls.c
> > @@ -5,6 +5,7 @@
> > #include <linux/kvm_host.h>
> >
> > #include <asm/kvm_emulate.h>
> > +#include <asm/kvm_mmu.h>
> >
> > #include <kvm/arm_hypercalls.h>
> > #include <kvm/arm_psci.h>
> > @@ -129,10 +130,29 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> > case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
> > val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> > val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP);
> > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO);
> > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL);
> > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP);
> > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP);
> > break;
> > case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
> > kvm_ptp_get_time(vcpu, val);
> > break;
> > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID:
> > + val[0] = PAGE_SIZE;
> > + break;
>
> I get the nagging feeling that querying the stage-2 page-size outside of
> MMIO guard is going to be useful once we start looking at memory sharing,
> so perhaps rename this to something more generic?

At this stage, why not follow the architecture and simply expose it as
ID_AA64MMFR0_EL1.TGran{4,64,16}_2? That's exactly what it is for, and
we already check for this in KVM itself.

>
> > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID:
> > + set_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
> > + val[0] = SMCCC_RET_SUCCESS;
> > + break;
> > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID:
> > + if (kvm_install_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > + val[0] = SMCCC_RET_SUCCESS;
> > + break;
> > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID:
> > + if (kvm_remove_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > + val[0] = SMCCC_RET_SUCCESS;
> > + break;
>
> I think there's a slight discrepancy between MAP and UNMAP here in that
> calling UNMAP on something that hasn't been mapped will fail, whereas
> calling MAP on something that's already been mapped will succeed. I think
> that might mean you can't reason about the final state of the page if two
> vCPUs race to call these functions in some cases (and both succeed).

I'm not sure that's the expected behaviour for ioremap(), for example
(you can ioremap two portions of the same page successfully).

I guess MAP could return something indicating that the page is already
mapped, but I wouldn't want to return a hard failure in this case.

M.

--
Without deviation from the norm, progress is not possible.

2021-07-28 11:03:41

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 12/16] mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls

On Tue, 27 Jul 2021 19:12:04 +0100,
Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 15, 2021 at 05:31:55PM +0100, Marc Zyngier wrote:
> > Add a pair of hooks (ioremap_page_range_hook/iounmap_page_range_hook)
> > that can be implemented by an architecture.
> >
> > Signed-off-by: Marc Zyngier <[email protected]>
> > ---
> > include/linux/io.h | 3 +++
> > mm/ioremap.c | 13 ++++++++++++-
> > mm/vmalloc.c | 8 ++++++++
> > 3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/io.h b/include/linux/io.h
> > index 9595151d800d..0ffc265f114c 100644
> > --- a/include/linux/io.h
> > +++ b/include/linux/io.h
> > @@ -21,6 +21,9 @@ void __ioread32_copy(void *to, const void __iomem *from, size_t count);
> > void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
> >
> > #ifdef CONFIG_MMU
> > +void ioremap_page_range_hook(unsigned long addr, unsigned long end,
> > + phys_addr_t phys_addr, pgprot_t prot);
> > +void iounmap_page_range_hook(phys_addr_t phys_addr, size_t size);
> > int ioremap_page_range(unsigned long addr, unsigned long end,
> > phys_addr_t phys_addr, pgprot_t prot);
> > #else
>
> Can we avoid these hooks by instead not registering the regions proactively
> in the guest and moving that logic to a fault handler which runs off the
> back of the injected data abort? From there, we could check if the faulting
> IPA is a memory address and register it as MMIO if not.
>
> Dunno, you've spent more time than me thinking about this, but just
> wondering if you'd had a crack at doing it that way, as it _seems_ simpler
> to my naive brain.

I thought about it, but couldn't work out whether it was always
possible for the guest to handle these faults (first access in an
interrupt context, for example?).

Also, this changes the semantics of the protection this is supposed to
offer: any access out of the RAM space will generate an abort, and the
fault handler will grant MMIO forwarding for this page. Stray accesses
that would normally be properly handled as fatal would now succeed and
be forwarded to userspace, even if there was no emulated devices
there.

For this to work, we'd need to work out whether there is any existing
device mapping that actually points to this page. And whether it
actually is supposed to be forwarded to userspace. Do we have a rmap
for device mappings?

M.

--
Without deviation from the norm, progress is not possible.

2021-07-28 14:55:36

by Steven Price

[permalink] [raw]
Subject: Re: [PATCH 01/16] KVM: arm64: Generalise VM features into a set of flags

On 28/07/2021 10:41, Marc Zyngier wrote:
> On Tue, 27 Jul 2021 19:10:27 +0100,
> Will Deacon <[email protected]> wrote:
>>
>> On Thu, Jul 15, 2021 at 05:31:44PM +0100, Marc Zyngier wrote:
>>> We currently deal with a set of booleans for VM features,
>>> while they could be better represented as set of flags
>>> contained in an unsigned long, similarily to what we are
>>> doing on the CPU side.
>>>
>>> Signed-off-by: Marc Zyngier <[email protected]>
>>> ---
>>> arch/arm64/include/asm/kvm_host.h | 12 +++++++-----
>>> arch/arm64/kvm/arm.c | 5 +++--
>>> arch/arm64/kvm/mmio.c | 3 ++-
>>> 3 files changed, 12 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index 41911585ae0c..4add6c27251f 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -122,7 +122,10 @@ struct kvm_arch {
>>> * should) opt in to this feature if KVM_CAP_ARM_NISV_TO_USER is
>>> * supported.
>>> */
>>> - bool return_nisv_io_abort_to_user;
>>> +#define KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER 0
>>> + /* Memory Tagging Extension enabled for the guest */
>>> +#define KVM_ARCH_FLAG_MTE_ENABLED 1
>>> + unsigned long flags;
>>
>> One downside of packing all these together is that updating 'flags' now
>> requires an atomic rmw sequence (i.e. set_bit()). Then again, that's
>> probably for the best anyway given that kvm_vm_ioctl_enable_cap() looks
>> like it doesn't hold any locks.
>
> That, and these operations are supposed to be extremely rare anyway.
>
>>
>>> /*
>>> * VM-wide PMU filter, implemented as a bitmap and big enough for
>>> @@ -133,9 +136,6 @@ struct kvm_arch {
>>>
>>> u8 pfr0_csv2;
>>> u8 pfr0_csv3;
>>> -
>>> - /* Memory Tagging Extension enabled for the guest */
>>> - bool mte_enabled;
>>> };
>>>
>>> struct kvm_vcpu_fault_info {
>>> @@ -777,7 +777,9 @@ bool kvm_arm_vcpu_is_finalized(struct kvm_vcpu *vcpu);
>>> #define kvm_arm_vcpu_sve_finalized(vcpu) \
>>> ((vcpu)->arch.flags & KVM_ARM64_VCPU_SVE_FINALIZED)
>>>
>>> -#define kvm_has_mte(kvm) (system_supports_mte() && (kvm)->arch.mte_enabled)
>>> +#define kvm_has_mte(kvm) \
>>> + (system_supports_mte() && \
>>> + test_bit(KVM_ARCH_FLAG_MTE_ENABLED, &(kvm)->arch.flags))
>>
>> Not an issue with this patch, but I just noticed that the
>> system_supports_mte() check is redundant here as we only allow the flag to
>> be set if that's already the case.
>
> It allows us to save a memory access if system_supports_mte() is false
> (it is eventually implemented as a static key). On the other hand,
> there is so much inlining due to it being a non-final cap that we
> probably lose on that too...

My original logic was that system_supports_mte() checks
IS_ENABLED(CONFIG_ARM64_MTE) - so this enables the code guarded with
kvm_has_mte() to be compiled out if CONFIG_ARM64_MTE is disabled.

Indeed it turns at we currently rely on this (with CONFIG_ARM64_MTE
disabled):

aarch64-linux-gnu-ld: arch/arm64/kvm/mmu.o: in function `sanitise_mte_tags':
/home/stepri01/work/linux/arch/arm64/kvm/mmu.c:887: undefined reference to `mte_clear_page_tags'
aarch64-linux-gnu-ld: arch/arm64/kvm/guest.o: in function `kvm_vm_ioctl_mte_copy_tags':
/home/stepri01/work/linux/arch/arm64/kvm/guest.c:1066: undefined reference to `mte_copy_tags_to_user'
aarch64-linux-gnu-ld: /home/stepri01/work/linux/arch/arm64/kvm/guest.c:1074: undefined reference to `mte_copy_tags_from_user'

Obviously we could pull just the IS_ENABLED() into kvm_has_mte() instead.

Steve

2021-07-30 12:39:20

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 05/16] KVM: arm64: Plumb MMIO checking into the fault handling

On Wed, Jul 28, 2021 at 11:21:52AM +0100, Marc Zyngier wrote:
> On Tue, 27 Jul 2021 19:11:21 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Thu, Jul 15, 2021 at 05:31:48PM +0100, Marc Zyngier wrote:
> > > Plumb the MMIO checking code into the MMIO fault handling code.
> > > Nothing allows a region to be registered yet, so there should be
> > > no funtional change either.
> >
> > Typo: functional
> >
> > > Signed-off-by: Marc Zyngier <[email protected]>
> > > ---
> > > arch/arm64/kvm/mmio.c | 10 ++++++++++
> > > 1 file changed, 10 insertions(+)
> > >
> > > diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
> > > index 3dd38a151d2a..fd5747279d27 100644
> > > --- a/arch/arm64/kvm/mmio.c
> > > +++ b/arch/arm64/kvm/mmio.c
> > > @@ -6,6 +6,7 @@
> > >
> > > #include <linux/kvm_host.h>
> > > #include <asm/kvm_emulate.h>
> > > +#include <asm/kvm_mmu.h>
> > > #include <trace/events/kvm.h>
> > >
> > > #include "trace.h"
> > > @@ -130,6 +131,10 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > > int len;
> > > u8 data_buf[8];
> > >
> > > + /* Check failed? Return to the guest for debriefing... */
> > > + if (!kvm_check_ioguard_page(vcpu, fault_ipa))
> > > + return 1;
> > > +
> > > /*
> > > * No valid syndrome? Ask userspace for help if it has
> > > * volunteered to do so, and bail out otherwise.
> > > @@ -156,6 +161,11 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa)
> > > len = kvm_vcpu_dabt_get_as(vcpu);
> > > rt = kvm_vcpu_dabt_get_rd(vcpu);
> > >
> > > + /* If we cross a page boundary, check that too... */
> > > + if (((fault_ipa + len - 1) & PAGE_MASK) != (fault_ipa & PAGE_MASK) &&
> > > + !kvm_check_ioguard_page(vcpu, fault_ipa + len - 1))
> > > + return 1;
> > > +
> >
> > I find this a little odd as the checks straddle the invalid syndrome check,
> > meaning that the relative priorities of KVM_ARCH_FLAG_MMIO_GUARD and
> > KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER are unclear.
>
> Good point. And the combination of both flags on its own is odd. Maybe
> KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER should be ignored or deemed
> incompatible with the MMIO guard feature.
>
> The lack of syndrome information means that we cannot really test for
> the boundaries of the access (len is invalid), so I'd be tempted to
> inject an abort in this case.
>
> Thoughts?

I agree. Probably worth rejecting both flags anyway so the VMM knows what
it's getting, but injecting an abort into the guest if we don't have
sufficient syndrom information to triage it safely feels like the right
thing to do.

Will

2021-07-30 12:51:51

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 06/16] KVM: arm64: Force a full unmap on vpcu reinit

On Wed, Jul 28, 2021 at 11:38:35AM +0100, Marc Zyngier wrote:
> On Tue, 27 Jul 2021 19:11:33 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Thu, Jul 15, 2021 at 05:31:49PM +0100, Marc Zyngier wrote:
> > > As we now keep information in the S2PT, we must be careful not
> > > to keep it across a VM reboot, which could otherwise lead to
> > > interesting problems.
> > >
> > > Make sure that the S2 is completely discarded on reset of
> > > a vcpu, and remove the flag that enforces the MMIO check.
> > >
> > > Signed-off-by: Marc Zyngier <[email protected]>
> > > ---
> > > arch/arm64/kvm/arm.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > > index 97ab1512c44f..b0d2225190d2 100644
> > > --- a/arch/arm64/kvm/arm.c
> > > +++ b/arch/arm64/kvm/arm.c
> > > @@ -1096,12 +1096,18 @@ static int kvm_arch_vcpu_ioctl_vcpu_init(struct kvm_vcpu *vcpu,
> > > * ensuring that the data side is always coherent. We still
> > > * need to invalidate the I-cache though, as FWB does *not*
> > > * imply CTR_EL0.DIC.
> > > + *
> > > + * If the MMIO guard was enabled, we pay the price of a full
> > > + * unmap to get back to a sane state (and clear the flag).
> > > */
> > > if (vcpu->arch.has_run_once) {
> > > - if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
> > > + if (!cpus_have_final_cap(ARM64_HAS_STAGE2_FWB) ||
> > > + test_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags))
> > > stage2_unmap_vm(vcpu->kvm);
> > > else
> > > icache_inval_all_pou();
> > > +
> > > + clear_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
> >
> > What prevents this racing with another vCPU trying to set the bit?
>
> Not much. We could take the kvm lock on both ends to serialize it, but
> that's pretty ugly. And should we care? What is the semantic of
> resetting a vcpu while another is still running?

It's definitely weird, but given that this is an attack vector I don't think
we can rule out attackers trying whacky stuff like this (although maybe
we end up forbidding vcpu reset in pKVM -- dunno).

> If we want to support this sort of behaviour, then our tracking is
> totally bogus, because it is VM-wide. And you don't even have to play
> with that bit from another vcpu: all the information is lost at the
> point where we unmap the S2 PTs.
>
> Maybe an alternative is to move this to the halt/reboot PSCI handlers,
> making it clearer what we expect?

I think that's probably worth looking at. The race is quite hard to reason
about otherwise, so if clearing the bit can be done on the teardown path
in single-threaded context then I think that's better. It looks like
kvm_prepare_system_event() has all the synchronisation we need there.

Will

2021-07-30 13:12:20

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 07/16] KVM: arm64: Wire MMIO guard hypercalls

On Wed, Jul 28, 2021 at 11:47:20AM +0100, Marc Zyngier wrote:
> On Tue, 27 Jul 2021 19:11:46 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Thu, Jul 15, 2021 at 05:31:50PM +0100, Marc Zyngier wrote:
> > > Plumb in the hypercall interface to allow a guest to discover,
> > > enroll, map and unmap MMIO regions.
> > >
> > > Signed-off-by: Marc Zyngier <[email protected]>
> > > ---
> > > arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
> > > include/linux/arm-smccc.h | 28 ++++++++++++++++++++++++++++
> > > 2 files changed, 48 insertions(+)
> > >
> > > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > > index 30da78f72b3b..a3deeb907fdd 100644
> > > --- a/arch/arm64/kvm/hypercalls.c
> > > +++ b/arch/arm64/kvm/hypercalls.c
> > > @@ -5,6 +5,7 @@
> > > #include <linux/kvm_host.h>
> > >
> > > #include <asm/kvm_emulate.h>
> > > +#include <asm/kvm_mmu.h>
> > >
> > > #include <kvm/arm_hypercalls.h>
> > > #include <kvm/arm_psci.h>
> > > @@ -129,10 +130,29 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> > > case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
> > > val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> > > val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP);
> > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO);
> > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL);
> > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP);
> > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP);
> > > break;
> > > case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
> > > kvm_ptp_get_time(vcpu, val);
> > > break;
> > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID:
> > > + val[0] = PAGE_SIZE;
> > > + break;
> >
> > I get the nagging feeling that querying the stage-2 page-size outside of
> > MMIO guard is going to be useful once we start looking at memory sharing,
> > so perhaps rename this to something more generic?
>
> At this stage, why not follow the architecture and simply expose it as
> ID_AA64MMFR0_EL1.TGran{4,64,16}_2? That's exactly what it is for, and
> we already check for this in KVM itself.

Nice, I hadn't thought of that. On reflection, though, I don't agree that
it's "exactly what it is for" -- the ID register talks about the supported
stage-2 page-sizes, whereas we want to advertise the one page size that
we're currently using. In other words, it's important that we only ever
populate one of the fields and I wonder if that could bite us in future
somehow?

Up to you, you've definitely got a better feel for this than me.

> > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID:
> > > + set_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
> > > + val[0] = SMCCC_RET_SUCCESS;
> > > + break;
> > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID:
> > > + if (kvm_install_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > > + val[0] = SMCCC_RET_SUCCESS;
> > > + break;
> > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID:
> > > + if (kvm_remove_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > > + val[0] = SMCCC_RET_SUCCESS;
> > > + break;
> >
> > I think there's a slight discrepancy between MAP and UNMAP here in that
> > calling UNMAP on something that hasn't been mapped will fail, whereas
> > calling MAP on something that's already been mapped will succeed. I think
> > that might mean you can't reason about the final state of the page if two
> > vCPUs race to call these functions in some cases (and both succeed).
>
> I'm not sure that's the expected behaviour for ioremap(), for example
> (you can ioremap two portions of the same page successfully).

Hmm, good point. Does that mean we should be refcounting the stage-2?
Otherwise if we do something like:

foo = ioremap(page, 0x100);
bar = ioremap(page+0x100, 0x100);
iounmap(foo);

then bar will break. Or did I miss something in the series?

Will

2021-07-30 14:08:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 12/16] mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls

On Wed, Jul 28, 2021 at 12:01:53PM +0100, Marc Zyngier wrote:
> On Tue, 27 Jul 2021 19:12:04 +0100,
> Will Deacon <[email protected]> wrote:
> >
> > On Thu, Jul 15, 2021 at 05:31:55PM +0100, Marc Zyngier wrote:
> > > Add a pair of hooks (ioremap_page_range_hook/iounmap_page_range_hook)
> > > that can be implemented by an architecture.
> > >
> > > Signed-off-by: Marc Zyngier <[email protected]>
> > > ---
> > > include/linux/io.h | 3 +++
> > > mm/ioremap.c | 13 ++++++++++++-
> > > mm/vmalloc.c | 8 ++++++++
> > > 3 files changed, 23 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/io.h b/include/linux/io.h
> > > index 9595151d800d..0ffc265f114c 100644
> > > --- a/include/linux/io.h
> > > +++ b/include/linux/io.h
> > > @@ -21,6 +21,9 @@ void __ioread32_copy(void *to, const void __iomem *from, size_t count);
> > > void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
> > >
> > > #ifdef CONFIG_MMU
> > > +void ioremap_page_range_hook(unsigned long addr, unsigned long end,
> > > + phys_addr_t phys_addr, pgprot_t prot);
> > > +void iounmap_page_range_hook(phys_addr_t phys_addr, size_t size);
> > > int ioremap_page_range(unsigned long addr, unsigned long end,
> > > phys_addr_t phys_addr, pgprot_t prot);
> > > #else
> >
> > Can we avoid these hooks by instead not registering the regions proactively
> > in the guest and moving that logic to a fault handler which runs off the
> > back of the injected data abort? From there, we could check if the faulting
> > IPA is a memory address and register it as MMIO if not.
> >
> > Dunno, you've spent more time than me thinking about this, but just
> > wondering if you'd had a crack at doing it that way, as it _seems_ simpler
> > to my naive brain.
>
> I thought about it, but couldn't work out whether it was always
> possible for the guest to handle these faults (first access in an
> interrupt context, for example?).

If the check is a simple pfn_valid() I think it should be ok, but yes, we'd
definitely not want to do anything more involved given that this could run
in all sorts of horrible contexts.

> Also, this changes the semantics of the protection this is supposed to
> offer: any access out of the RAM space will generate an abort, and the
> fault handler will grant MMIO forwarding for this page. Stray accesses
> that would normally be properly handled as fatal would now succeed and
> be forwarded to userspace, even if there was no emulated devices
> there.

That's true, it would offer much weaker guarantees to the guest. It's more
like a guarantee that memory never traps to the VMM. It also then wouldn't
help with the write-combine fun. It would be simpler though, but with less
functionality.

> For this to work, we'd need to work out whether there is any existing
> device mapping that actually points to this page. And whether it
> actually is supposed to be forwarded to userspace. Do we have a rmap
> for device mappings?

I don't think this would be possible given your comments above.

So let's stick with the approach you've taken. It just feels like there
should be a way to do this without introducing new hooks into the core
code. If it wasn't for pci_remap_iospace(), we could simply hook our
definition of __ioremap_caller(). Another avenue to explore would be
looking at the IO resource instead; I see x86 already uses
IORES_MAP_ENCRYPTED and IORES_MAP_SYSTEM_RAM to drive pgprot...

Will

2021-08-01 11:26:19

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 07/16] KVM: arm64: Wire MMIO guard hypercalls

On Fri, 30 Jul 2021 14:11:03 +0100,
Will Deacon <[email protected]> wrote:
>
> On Wed, Jul 28, 2021 at 11:47:20AM +0100, Marc Zyngier wrote:
> > On Tue, 27 Jul 2021 19:11:46 +0100,
> > Will Deacon <[email protected]> wrote:
> > >
> > > On Thu, Jul 15, 2021 at 05:31:50PM +0100, Marc Zyngier wrote:
> > > > Plumb in the hypercall interface to allow a guest to discover,
> > > > enroll, map and unmap MMIO regions.
> > > >
> > > > Signed-off-by: Marc Zyngier <[email protected]>
> > > > ---
> > > > arch/arm64/kvm/hypercalls.c | 20 ++++++++++++++++++++
> > > > include/linux/arm-smccc.h | 28 ++++++++++++++++++++++++++++
> > > > 2 files changed, 48 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > > > index 30da78f72b3b..a3deeb907fdd 100644
> > > > --- a/arch/arm64/kvm/hypercalls.c
> > > > +++ b/arch/arm64/kvm/hypercalls.c
> > > > @@ -5,6 +5,7 @@
> > > > #include <linux/kvm_host.h>
> > > >
> > > > #include <asm/kvm_emulate.h>
> > > > +#include <asm/kvm_mmu.h>
> > > >
> > > > #include <kvm/arm_hypercalls.h>
> > > > #include <kvm/arm_psci.h>
> > > > @@ -129,10 +130,29 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> > > > case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
> > > > val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> > > > val[0] |= BIT(ARM_SMCCC_KVM_FUNC_PTP);
> > > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_INFO);
> > > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_ENROLL);
> > > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_MAP);
> > > > + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_MMIO_GUARD_UNMAP);
> > > > break;
> > > > case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
> > > > kvm_ptp_get_time(vcpu, val);
> > > > break;
> > > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_INFO_FUNC_ID:
> > > > + val[0] = PAGE_SIZE;
> > > > + break;
> > >
> > > I get the nagging feeling that querying the stage-2 page-size outside of
> > > MMIO guard is going to be useful once we start looking at memory sharing,
> > > so perhaps rename this to something more generic?
> >
> > At this stage, why not follow the architecture and simply expose it as
> > ID_AA64MMFR0_EL1.TGran{4,64,16}_2? That's exactly what it is for, and
> > we already check for this in KVM itself.
>
> Nice, I hadn't thought of that. On reflection, though, I don't agree that
> it's "exactly what it is for" -- the ID register talks about the supported
> stage-2 page-sizes, whereas we want to advertise the one page size that
> we're currently using. In other words, it's important that we only ever
> populate one of the fields and I wonder if that could bite us in future
> somehow?

Either that, or we expose all the page sizes >= to that of the host
(using the fact that larger page sizes are multiples of the base one),
and use the guest's page size to work out the granularity. Which is
what NV does already.

> Up to you, you've definitely got a better feel for this than me.

I'll have a look. The "one size" version is dead easy.

>
> > > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_ENROLL_FUNC_ID:
> > > > + set_bit(KVM_ARCH_FLAG_MMIO_GUARD, &vcpu->kvm->arch.flags);
> > > > + val[0] = SMCCC_RET_SUCCESS;
> > > > + break;
> > > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_MAP_FUNC_ID:
> > > > + if (kvm_install_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > > > + val[0] = SMCCC_RET_SUCCESS;
> > > > + break;
> > > > + case ARM_SMCCC_VENDOR_HYP_KVM_MMIO_GUARD_UNMAP_FUNC_ID:
> > > > + if (kvm_remove_ioguard_page(vcpu, vcpu_get_reg(vcpu, 1)))
> > > > + val[0] = SMCCC_RET_SUCCESS;
> > > > + break;
> > >
> > > I think there's a slight discrepancy between MAP and UNMAP here in that
> > > calling UNMAP on something that hasn't been mapped will fail, whereas
> > > calling MAP on something that's already been mapped will succeed. I think
> > > that might mean you can't reason about the final state of the page if two
> > > vCPUs race to call these functions in some cases (and both succeed).
> >
> > I'm not sure that's the expected behaviour for ioremap(), for example
> > (you can ioremap two portions of the same page successfully).
>
> Hmm, good point. Does that mean we should be refcounting the stage-2?
> Otherwise if we do something like:
>
> foo = ioremap(page, 0x100);
> bar = ioremap(page+0x100, 0x100);
> iounmap(foo);
>
> then bar will break. Or did I miss something in the series?

No, I don't think you have. But I don't think we should implement this
refcounting in the hypervisor side. We really should do it guest side.

I'll have a look.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.