Hi all,
This series aims to improve how the nVHE hypervisor tracks ownership of memory
pages when running in protected mode ("kvm-arm.mode=protected" on the kernel
command line).
The main issue with the existing ownership tracking code is that it is
completely binary: a page is either owned by an entity (e.g. the host) or not.
However, we'll need something smarter to track shared pages, as is needed for
virtio, or even just host/hypervisor communications.
This series introduces a few changes to the kvm page-table library to allow
annotating shared pages in ignored bits (a.k.a. software bits) of leaf entries,
and makes use of that infrastructure to track all pages that are shared between
the host and the hypervisor. We will obviously want to apply the same treatment
to guest stage-2 page-tables, but that is not really possible to do until EL2
manages them directly, so I'll keep that for another series.
The series is split as follows:
- patches 01-04 are essentially cleanups and optimizations of existing code
paths that might be relevant on their own, but also prepare the ground for
the rest of the series;
- patches 05-08 introduce support in the page-table library for annotating
shared pages with software bits;
- patches 09-14 make use of the previously introduced infrastructure to
annotate all pages shared between the host and the hypervisor;
The series is based on the latest kvmarm/fixes branch, and has been tested
on AML-S905X-CC (Le Potato) and using various Qemu configurations.
Thanks!
Quentin
Quentin Perret (14):
KVM: arm64: Provide the host_stage2_try() helper macro
KVM: arm64: Optimize kvm_pgtable_stage2_find_range()
KVM: arm64: Continue stage-2 map when re-creating mappings
KVM: arm64: Rename KVM_PTE_LEAF_ATTR_S2_IGNORED
KVM: arm64: Don't overwrite ignored bits with owner id
KVM: arm64: Tolerate re-creating hyp mappings to set ignored bits
KVM: arm64: Enable forcing page-level stage-2 mappings
KVM: arm64: Add support for tagging shared pages in page-table
KVM: arm64: Mark host bss and rodata section as shared
KVM: arm64: Enable retrieving protections attributes of PTEs
KVM: arm64: Expose kvm_pte_valid() helper
KVM: arm64: Refactor pkvm_pgtable locking
KVM: arm64: Restrict hyp stage-1 manipulation in protected mode
KVM: arm64: Prevent late calls to __pkvm_create_private_mapping()
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_pgtable.h | 109 ++++++---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 +
arch/arm64/kvm/hyp/include/nvhe/mm.h | 3 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 17 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 156 +++++++++++--
arch/arm64/kvm/hyp/nvhe/mm.c | 20 +-
arch/arm64/kvm/hyp/nvhe/setup.c | 52 ++++-
arch/arm64/kvm/hyp/pgtable.c | 219 +++++++++++++-----
arch/arm64/kvm/mmu.c | 14 +-
10 files changed, 454 insertions(+), 140 deletions(-)
--
2.32.0.402.g57bb445576-goog
We currently unmap all MMIO mappings from the host stage-2 to recycle
the pages whenever we run out. In order to make this pattern easy to
re-use from other places, factor the logic out into a dedicated macro.
While at it, apply the macro for the kvm_pgtable_stage2_set_owner()
calls. They're currently only called early on and are guaranteed to
succeed, but making them robust to the -ENOMEM case doesn't hurt and
will avoid painful debugging sessions later on.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 38 ++++++++++++++-------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index d938ce95d3bd..56f2117c877b 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -208,6 +208,23 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
prot, &host_s2_pool);
}
+/*
+ * The pool has been provided with enough pages to cover all of memory with
+ * page granularity, but it is difficult to know how much of the MMIO range
+ * we will need to cover upfront, so we may need to 'recycle' the pages if we
+ * run out.
+ */
+#define host_stage2_try(fn, ...) \
+ ({ \
+ int __ret = fn(__VA_ARGS__); \
+ if (__ret == -ENOMEM) { \
+ __ret = host_stage2_unmap_dev_all(); \
+ if (!__ret) \
+ __ret = fn(__VA_ARGS__); \
+ } \
+ __ret; \
+ })
+
static int host_stage2_idmap(u64 addr)
{
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
@@ -223,22 +240,7 @@ static int host_stage2_idmap(u64 addr)
if (ret)
goto unlock;
- ret = __host_stage2_idmap(range.start, range.end, prot);
- if (ret != -ENOMEM)
- goto unlock;
-
- /*
- * The pool has been provided with enough pages to cover all of memory
- * with page granularity, but it is difficult to know how much of the
- * MMIO range we will need to cover upfront, so we may need to 'recycle'
- * the pages if we run out.
- */
- ret = host_stage2_unmap_dev_all();
- if (ret)
- goto unlock;
-
- ret = __host_stage2_idmap(range.start, range.end, prot);
-
+ ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
unlock:
hyp_spin_unlock(&host_kvm.lock);
@@ -257,8 +259,8 @@ int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
return -EINVAL;
hyp_spin_lock(&host_kvm.lock);
- ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start,
- &host_s2_pool, pkvm_hyp_id);
+ ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt,
+ start, end - start, &host_s2_pool, pkvm_hyp_id);
hyp_spin_unlock(&host_kvm.lock);
return ret != -EAGAIN ? ret : 0;
--
2.32.0.402.g57bb445576-goog
The stage-2 map walkers currently return -EAGAIN when re-creating
identical mappings or only changing access permissions. This allows to
optimize mapping pages for concurrent (v)CPUs faulting on the same
page.
While this works as expected when touching one page-table leaf at a
time, this can lead to difficult situations when mapping larger ranges.
Indeed, a large map operation can fail in the middle if an existing
mapping is found in the range, even if it has compatible attributes,
hence leaving only half of the range mapped.
To avoid having to deal with such failures in the caller, don't
interrupt the map operation when hitting existing PTEs, but make sure to
still return -EAGAIN so that user_mem_abort() can mark the page dirty
when needed.
Cc: Yanan Wang <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 2 +-
arch/arm64/kvm/hyp/pgtable.c | 21 +++++++++++++++++----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d6649352c8b3..af62203d2f7a 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -258,7 +258,7 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
* If device attributes are not explicitly requested in @prot, then the
* mapping will be normal, cacheable.
*
- * Note that the update of a valid leaf PTE in this function will be aborted,
+ * Note that the update of a valid leaf PTE in this function will be skipped,
* if it's trying to recreate the exact same mapping or only change the access
* permissions. Instead, the vCPU will exit one more time from guest if still
* needed and then go through the path of relaxing permissions.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 978f341d02ca..bb73c5331b7c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -475,6 +475,8 @@ struct stage2_map_data {
void *memcache;
struct kvm_pgtable_mm_ops *mm_ops;
+
+ int ret;
};
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
@@ -612,8 +614,10 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
* the vCPU will exit one more time from guest if still needed
* and then go through the path of relaxing permissions.
*/
- if (!stage2_pte_needs_update(old, new))
- return -EAGAIN;
+ if (!stage2_pte_needs_update(old, new)) {
+ data->ret = -EAGAIN;
+ goto out;
+ }
stage2_put_pte(ptep, data->mmu, addr, level, mm_ops);
}
@@ -629,6 +633,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
smp_store_release(ptep, new);
if (stage2_pte_is_counted(new))
mm_ops->get_page(ptep);
+out:
if (kvm_phys_is_valid(phys))
data->phys += granule;
return 0;
@@ -771,6 +776,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.mmu = pgt->mmu,
.memcache = mc,
.mm_ops = pgt->mm_ops,
+ .ret = 0,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -789,7 +795,10 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
dsb(ishst);
- return ret;
+ if (ret)
+ return ret;
+
+ return map_data.ret;
}
int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
@@ -802,6 +811,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
.memcache = mc,
.mm_ops = pgt->mm_ops,
.owner_id = owner_id,
+ .ret = 0,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -815,7 +825,10 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
return -EINVAL;
ret = kvm_pgtable_walk(pgt, addr, size, &walker);
- return ret;
+ if (ret)
+ return ret;
+
+ return map_data.ret;
}
static int stage2_unmap_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
--
2.32.0.402.g57bb445576-goog
The nVHE protected mode uses invalid mappings in the host stage-2
page-table to track the owner of each page in the system. In order to
allow the usage of ignored bits (a.k.a. software bits) in these
mappings, move the owner encoding away from the top bits.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a60653cbd8e5..a0ac8c2bc174 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -50,7 +50,7 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
#define KVM_MAX_OWNER_ID 1
struct kvm_pgtable_walk_data {
--
2.32.0.402.g57bb445576-goog
The ignored bits for both stage-1 and stage-2 page and block
descriptors are in [55:58], so rename KVM_PTE_LEAF_ATTR_S2_IGNORED to
make it applicable to both.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bb73c5331b7c..a60653cbd8e5 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -40,6 +40,8 @@
#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
+#define KVM_PTE_LEAF_ATTR_IGNORED GENMASK(58, 55)
+
#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
@@ -48,8 +50,6 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)
-#define KVM_PTE_LEAF_ATTR_S2_IGNORED GENMASK(58, 55)
-
#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
#define KVM_MAX_OWNER_ID 1
--
2.32.0.402.g57bb445576-goog
Much of the stage-2 manipulation logic relies on being able to destroy
block mappings if e.g. installing a smaller mapping in the range. The
rationale for this behaviour is that stage-2 mappings can always be
re-created lazily. However, this gets more complicated when the stage-2
page-table is used to store metadata about the underlying pages. In such
a case, destroying a block mapping may lead to losing part of the
state, and confuse the user of those metadata (such as the hypervisor in
nVHE protected mode).
To fix this, introduce a callback function in the pgtable struct which
is called during all map operations to determine whether the mappings
can us blocks, or should be forced to page-granularity level. This is
used by the hypervisor when creating the host stage-2 to force
page-level mappings when using non-default protection attributes.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 63 +++++++++++++++++----------
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 16 +++++--
arch/arm64/kvm/hyp/pgtable.c | 20 +++++++--
3 files changed, 69 insertions(+), 30 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index af62203d2f7a..dd72653314c7 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -75,25 +75,6 @@ enum kvm_pgtable_stage2_flags {
KVM_PGTABLE_S2_IDMAP = BIT(1),
};
-/**
- * struct kvm_pgtable - KVM page-table.
- * @ia_bits: Maximum input address size, in bits.
- * @start_level: Level at which the page-table walk starts.
- * @pgd: Pointer to the first top-level entry of the page-table.
- * @mm_ops: Memory management callbacks.
- * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
- */
-struct kvm_pgtable {
- u32 ia_bits;
- u32 start_level;
- kvm_pte_t *pgd;
- struct kvm_pgtable_mm_ops *mm_ops;
-
- /* Stage-2 only */
- struct kvm_s2_mmu *mmu;
- enum kvm_pgtable_stage2_flags flags;
-};
-
/**
* enum kvm_pgtable_prot - Page-table permissions and attributes.
* @KVM_PGTABLE_PROT_X: Execute permission.
@@ -109,11 +90,41 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_DEVICE = BIT(3),
};
-#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
+#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
+#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
+
+#define PAGE_HYP KVM_PGTABLE_PROT_RW
#define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
+typedef bool (*kvm_pgtable_want_pte_cb_t)(u64 addr, u64 end,
+ enum kvm_pgtable_prot prot);
+
+/**
+ * struct kvm_pgtable - KVM page-table.
+ * @ia_bits: Maximum input address size, in bits.
+ * @start_level: Level at which the page-table walk starts.
+ * @pgd: Pointer to the first top-level entry of the page-table.
+ * @mm_ops: Memory management callbacks.
+ * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
+ * @flags: Stage-2 page-table flags.
+ * @want_pte_cb: Callback function used during map operations to decide
+ * whether block mappings can be used to map the given IPA
+ * range.
+ */
+struct kvm_pgtable {
+ u32 ia_bits;
+ u32 start_level;
+ kvm_pte_t *pgd;
+ struct kvm_pgtable_mm_ops *mm_ops;
+
+ /* Stage-2 only */
+ struct kvm_s2_mmu *mmu;
+ enum kvm_pgtable_stage2_flags flags;
+ kvm_pgtable_want_pte_cb_t want_pte_cb;
+};
+
/**
* struct kvm_mem_range - Range of Intermediate Physical Addresses
* @start: Start of the range.
@@ -216,21 +227,25 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
/**
- * kvm_pgtable_stage2_init_flags() - Initialise a guest stage-2 page-table.
+ * kvm_pgtable_stage2_init_full() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @arch: Arch-specific KVM structure representing the guest virtual
* machine.
* @mm_ops: Memory management callbacks.
* @flags: Stage-2 configuration flags.
+ * @want_pte_cb: Callback function used during map operations to decide
+ * whether block mappings can be used to map the given IPA
+ * range.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
+int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
struct kvm_pgtable_mm_ops *mm_ops,
- enum kvm_pgtable_stage2_flags flags);
+ enum kvm_pgtable_stage2_flags flags,
+ kvm_pgtable_want_pte_cb_t want_pte_cb);
#define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \
- kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0)
+ kvm_pgtable_stage2_init_full(pgt, arch, mm_ops, 0, NULL)
/**
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 58edc62be6f7..cdace80d3e28 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -89,6 +89,7 @@ static void prepare_host_vtcr(void)
id_aa64mmfr1_el1_sys_val, phys_shift);
}
+static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base)
{
struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
@@ -101,8 +102,9 @@ int kvm_host_prepare_stage2(void *pgt_pool_base)
if (ret)
return ret;
- ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch,
- &host_kvm.mm_ops, KVM_HOST_S2_FLAGS);
+ ret = kvm_pgtable_stage2_init_full(&host_kvm.pgt, &host_kvm.arch,
+ &host_kvm.mm_ops, KVM_HOST_S2_FLAGS,
+ host_stage2_want_pte_cb);
if (ret)
return ret;
@@ -225,9 +227,17 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
__ret; \
})
+static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot)
+{
+ if (range_is_memory(addr, end))
+ return prot != KVM_PGTABLE_PROT_RWX;
+ else
+ return prot != KVM_PGTABLE_PROT_RW;
+}
+
static int host_stage2_idmap(u64 addr)
{
- enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
struct kvm_mem_range range;
bool is_memory = find_mem_range(addr, &range);
int ret;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 34cf67997a82..5bdbe7a31551 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -452,6 +452,8 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
pgt->mm_ops = mm_ops;
pgt->mmu = NULL;
+ pgt->want_pte_cb = NULL;
+
return 0;
}
@@ -491,6 +493,7 @@ struct stage2_map_data {
struct kvm_pgtable_mm_ops *mm_ops;
int ret;
+ bool force_pte;
};
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
@@ -613,6 +616,9 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
struct kvm_pgtable *pgt = data->mmu->pgt;
struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
+ if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
+ return -E2BIG;
+
if (!kvm_block_mapping_supported(addr, end, phys, level))
return -E2BIG;
@@ -660,6 +666,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
if (data->anchor)
return 0;
+ if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
+ return 0;
+
if (!kvm_block_mapping_supported(addr, end, data->phys, level))
return 0;
@@ -791,6 +800,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.memcache = mc,
.mm_ops = pgt->mm_ops,
.ret = 0,
+ .force_pte = pgt->want_pte_cb && pgt->want_pte_cb(addr, addr + size, prot),
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -826,6 +836,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
.mm_ops = pgt->mm_ops,
.owner_id = owner_id,
.ret = 0,
+ .force_pte = true,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -1070,9 +1081,11 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
return kvm_pgtable_walk(pgt, addr, size, &walker);
}
-int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
- struct kvm_pgtable_mm_ops *mm_ops,
- enum kvm_pgtable_stage2_flags flags)
+
+int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
+ struct kvm_pgtable_mm_ops *mm_ops,
+ enum kvm_pgtable_stage2_flags flags,
+ kvm_pgtable_want_pte_cb_t want_pte_cb)
{
size_t pgd_sz;
u64 vtcr = arch->vtcr;
@@ -1090,6 +1103,7 @@ int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch
pgt->mm_ops = mm_ops;
pgt->mmu = &arch->mmu;
pgt->flags = flags;
+ pgt->want_pte_cb = want_pte_cb;
/* Ensure zeroed PGD pages are visible to the hardware walker */
dsb(ishst);
--
2.32.0.402.g57bb445576-goog
The hypervisor will soon be in charge of tracking ownership of all
memory pages in the system. The current page-tracking infrastructure at
EL2 only allows binary states: a page is either owned or not by an
entity. But a number of use-cases will require more complex states for
pages that are shared between two entities (host, hypervisor, or guests).
In preparation for supporting these use-cases, introduce in the KVM
page-table library some infrastructure allowing to tag shared pages
using ignored bits (a.k.a. software bits) in PTEs.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dd72653314c7..f6d3d5c8910d 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
* @KVM_PGTABLE_PROT_W: Write permission.
* @KVM_PGTABLE_PROT_R: Read permission.
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
+ * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
*/
enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_X = BIT(0),
@@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_R = BIT(2),
KVM_PGTABLE_PROT_DEVICE = BIT(3),
+
+ KVM_PGTABLE_STATE_SHARED = BIT(4),
+ KVM_PGTABLE_STATE_BORROWED = BIT(5),
};
#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 5bdbe7a31551..51598b79dafc 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
}
+static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
+{
+ kvm_pte_t ignored_bits = 0;
+
+ /*
+ * Ignored bits 0 and 1 are reserved to track the memory ownership
+ * state of each page:
+ * 00: The page is owned solely by the page-table owner.
+ * 01: The page is owned by the page-table owner, but is shared
+ * with another entity.
+ * 10: The page is shared with, but not owned by the page-table owner.
+ * 11: Reserved for future use (lending).
+ */
+ if (prot & KVM_PGTABLE_STATE_SHARED) {
+ if (prot & KVM_PGTABLE_STATE_BORROWED)
+ ignored_bits |= BIT(1);
+ else
+ ignored_bits |= BIT(0);
+ }
+
+ return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
+}
+
static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag)
@@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
+ attr |= pte_ignored_bit_prot(prot);
*ptep = attr;
return 0;
@@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
+ attr |= pte_ignored_bit_prot(prot);
*ptep = attr;
return 0;
--
2.32.0.402.g57bb445576-goog
Refactor the hypervisor stage-1 locking in nVHE protected mode to expose
a new pkvm_create_mappings_locked() function. This will be used in later
patches to allow walking and changing the hypervisor stage-1 without
releasing the lock.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
arch/arm64/kvm/hyp/nvhe/mm.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index 8ec3a5a7744b..c76d7136ed9b 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -23,6 +23,7 @@ int hyp_map_vectors(void);
int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
+int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
int __pkvm_create_mappings(unsigned long start, unsigned long size,
unsigned long phys, enum kvm_pgtable_prot prot);
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index a8efdf0f9003..dde22e2a322a 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -67,7 +67,7 @@ unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
return addr;
}
-int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
+int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot)
{
unsigned long start = (unsigned long)from;
unsigned long end = (unsigned long)to;
@@ -81,7 +81,8 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
int err;
phys = hyp_virt_to_phys((void *)virt_addr);
- err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, prot);
+ err = kvm_pgtable_hyp_map(&pkvm_pgtable, virt_addr, PAGE_SIZE,
+ phys, prot);
if (err)
return err;
}
@@ -89,6 +90,17 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
return 0;
}
+int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
+{
+ int ret;
+
+ hyp_spin_lock(&pkvm_pgd_lock);
+ ret = pkvm_create_mappings_locked(from, to, prot);
+ hyp_spin_unlock(&pkvm_pgd_lock);
+
+ return ret;
+}
+
int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
{
unsigned long start, end;
--
2.32.0.402.g57bb445576-goog
Introduce helper functions in the KVM stage-2 and stage-1 page-table
manipulation library allowing to retrieve the enum kvm_pgtable_prot of a
PTE. This will be useful to implement custom walkers outside of
pgtable.c.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 20 ++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 49 ++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index f6d3d5c8910d..1aa49d6aabb7 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -467,4 +467,24 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
*/
int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
int owner, struct kvm_mem_range *range);
+
+/**
+ * kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a
+ * stage-2 Page-Table Entry.
+ * @pte: Page-table entry
+ *
+ * Return: protection attributes of the page-table entry in the enum
+ * kvm_pgtable_prot format.
+ */
+enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
+
+/**
+ * kvm_pgtable_hyp_pte_prot() - Retrieve the protection attributes of a stage-1
+ * Page-Table Entry.
+ * @pte: Page-table entry
+ *
+ * Return: protection attributes of the page-table entry in the enum
+ * kvm_pgtable_prot format.
+ */
+enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 51598b79dafc..c7120797404a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -234,6 +234,20 @@ static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
}
+static enum kvm_pgtable_prot pte_read_ignored_bits_prot(kvm_pte_t pte)
+{
+ enum kvm_pgtable_prot prot = 0;
+ kvm_pte_t ignored_bits = 0;
+
+ ignored_bits = FIELD_GET(KVM_PTE_LEAF_ATTR_IGNORED, pte);
+ if (ignored_bits & BIT(1))
+ prot |= KVM_PGTABLE_STATE_BORROWED | KVM_PGTABLE_STATE_SHARED;
+ if (ignored_bits & BIT(0))
+ prot |= KVM_PGTABLE_STATE_SHARED;
+
+ return prot;
+}
+
static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
u32 level, kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag)
@@ -386,6 +400,25 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
return 0;
}
+enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
+{
+ enum kvm_pgtable_prot prot = 0;
+ u32 ap;
+
+ if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_XN))
+ prot |= KVM_PGTABLE_PROT_X;
+
+ ap = FIELD_GET(KVM_PTE_LEAF_ATTR_LO_S1_AP, pte);
+ if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RO)
+ prot |= KVM_PGTABLE_PROT_R;
+ else if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RW)
+ prot |= KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+
+ prot |= pte_read_ignored_bits_prot(pte);
+
+ return prot;
+}
+
static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
{
if (old == new)
@@ -588,6 +621,22 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
return 0;
}
+enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte)
+{
+ enum kvm_pgtable_prot prot = 0;
+
+ if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R)
+ prot |= KVM_PGTABLE_PROT_R;
+ if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W)
+ prot |= KVM_PGTABLE_PROT_W;
+ if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+ prot |= KVM_PGTABLE_PROT_X;
+
+ prot |= pte_read_ignored_bits_prot(pte);
+
+ return prot;
+}
+
static bool stage2_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
{
if (!kvm_pte_valid(old) || !kvm_pte_valid(new))
--
2.32.0.402.g57bb445576-goog
The kvm_pgtable_stage2_find_range() function is used in the host memory
abort path to try and look for the largest block mapping that can be
used to map the faulting address. In order to do so, the function
currently walks the stage-2 page-table and looks for existing
incompatible mappings within the range of the largest possible block.
If incompatible mappings are found, it tries the same procedure again,
but using a smaller block range, and repeats until a matching range is
found (potentially up to page granularity). While this approach has
benefits (mostly in the fact that it proactively coalesces host stage-2
mappings), it can be slow if the ranges are fragmented, and it isn't
optimized to deal with CPUs faulting on the same IPA as all of them will
do all the work every time.
To avoid these issues, rework kvm_pgtable_stage2_find_range() by walking
the page-table only once to find the closest leaf to the input address,
and return its corresponding range if it is invalid and not owned by
another entity. If a valid leaf is found, return -EAGAIN similar to what
is done in the kvm_pgtable_stage2_map() path to optimize concurrent
faults.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 12 ++---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
arch/arm64/kvm/hyp/pgtable.c | 74 +++++++++++----------------
3 files changed, 34 insertions(+), 54 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index f004c0115d89..d6649352c8b3 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -434,21 +434,17 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
/**
* kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical
- * Addresses with compatible permission
- * attributes.
+ * Addresses with a given ownership.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
* @addr: Address that must be covered by the range.
- * @prot: Protection attributes that the range must be compatible with.
+ * @owner: Expected owner of the pages of the range.
* @range: Range structure used to limit the search space at call time and
* that will hold the result.
*
- * The offset of @addr within a page is ignored. An IPA is compatible with @prot
- * iff its corresponding stage-2 page-table entry has default ownership and, if
- * valid, is mapped with protection attributes identical to @prot.
+ * The offset of @addr within a page is ignored.
*
* Return: 0 on success, negative error code on failure.
*/
int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
- enum kvm_pgtable_prot prot,
- struct kvm_mem_range *range);
+ int owner, struct kvm_mem_range *range);
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 56f2117c877b..58edc62be6f7 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -236,7 +236,7 @@ static int host_stage2_idmap(u64 addr)
prot |= KVM_PGTABLE_PROT_X;
hyp_spin_lock(&host_kvm.lock);
- ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
+ ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, 0, &range);
if (ret)
goto unlock;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 05321f4165e3..978f341d02ca 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1103,76 +1103,60 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
pgt->pgd = NULL;
}
-#define KVM_PTE_LEAF_S2_COMPAT_MASK (KVM_PTE_LEAF_ATTR_S2_PERMS | \
- KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | \
- KVM_PTE_LEAF_ATTR_S2_IGNORED)
+struct stage2_check_permission_data {
+ u32 level;
+ u8 owner;
+};
static int stage2_check_permission_walker(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
enum kvm_pgtable_walk_flags flag,
void * const arg)
{
- kvm_pte_t old_attr, pte = *ptep, *new_attr = arg;
+ struct stage2_check_permission_data *data = arg;
+ kvm_pte_t pte = *ptep;
- /*
- * Compatible mappings are either invalid and owned by the page-table
- * owner (whose id is 0), or valid with matching permission attributes.
- */
- if (kvm_pte_valid(pte)) {
- old_attr = pte & KVM_PTE_LEAF_S2_COMPAT_MASK;
- if (old_attr != *new_attr)
- return -EEXIST;
- } else if (pte) {
+ if (kvm_pte_valid(pte))
+ return -EAGAIN;
+
+ if (FIELD_GET(KVM_INVALID_PTE_OWNER_MASK, pte) != data->owner)
return -EEXIST;
- }
+
+ data->level = level;
return 0;
}
int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
- enum kvm_pgtable_prot prot,
- struct kvm_mem_range *range)
+ int owner, struct kvm_mem_range *range)
{
- kvm_pte_t attr;
+ struct stage2_check_permission_data data = {
+ .level = pgt->start_level,
+ .owner = owner,
+ };
struct kvm_pgtable_walker check_perm_walker = {
.cb = stage2_check_permission_walker,
.flags = KVM_PGTABLE_WALK_LEAF,
- .arg = &attr,
+ .arg = &data,
};
- u64 granule, start, end;
- u32 level;
+ u64 granule, end, start = ALIGN_DOWN(addr, PAGE_SIZE);
int ret;
- ret = stage2_set_prot_attr(pgt, prot, &attr);
+ ret = kvm_pgtable_walk(pgt, start, PAGE_SIZE, &check_perm_walker);
if (ret)
return ret;
- attr &= KVM_PTE_LEAF_S2_COMPAT_MASK;
- for (level = pgt->start_level; level < KVM_PGTABLE_MAX_LEVELS; level++) {
- granule = kvm_granule_size(level);
+ do {
+ granule = kvm_granule_size(data.level);
start = ALIGN_DOWN(addr, granule);
end = start + granule;
+ data.level++;
+ } while ((data.level < KVM_PGTABLE_MAX_LEVELS) &&
+ (!kvm_level_supports_block_mapping(data.level) ||
+ start < range->start || range->end < end));
- if (!kvm_level_supports_block_mapping(level))
- continue;
-
- if (start < range->start || range->end < end)
- continue;
+ range->start = start;
+ range->end = end;
- /*
- * Check the presence of existing mappings with incompatible
- * permissions within the current block range, and try one level
- * deeper if one is found.
- */
- ret = kvm_pgtable_walk(pgt, start, granule, &check_perm_walker);
- if (ret != -EEXIST)
- break;
- }
-
- if (!ret) {
- range->start = start;
- range->end = end;
- }
-
- return ret;
+ return 0;
}
--
2.32.0.402.g57bb445576-goog
The current hypervisor stage-1 mapping code doesn't allow changing an
existing valid mapping. Relax this condition by allowing changes that
only target ignored bits, as that will soon be needed to annotate shared
pages.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index a0ac8c2bc174..34cf67997a82 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -362,6 +362,17 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
return 0;
}
+static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
+{
+ if (old == new)
+ return false;
+
+ if (!kvm_pte_valid(old))
+ return true;
+
+ return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_IGNORED);
+}
+
static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep, struct hyp_map_data *data)
{
@@ -371,9 +382,12 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
if (!kvm_block_mapping_supported(addr, end, phys, level))
return false;
- /* Tolerate KVM recreating the exact same mapping */
+ /*
+ * Tolerate KVM recreating the exact same mapping, or changing ignored
+ * bits.
+ */
new = kvm_init_valid_leaf_pte(phys, data->attr, level);
- if (old != new && !WARN_ON(kvm_pte_valid(old)))
+ if (hyp_pte_needs_update(old, new))
smp_store_release(ptep, new);
data->phys += granule;
--
2.32.0.402.g57bb445576-goog
The host kernel is currently able to change EL2 stage-1 mappings without
restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
world where the host is no longer part of the TCB, this clearly poses a
problem.
To fix this, introduce a new hypercall to allow the host to share a
range of physical memory with the hypervisor, and remove the
__pkvm_create_mappings() variant. The new hypercall implements
ownership and permission checks before allowing the sharing operation,
and it annotates the shared pages in the hypervisor stage-1 and host
stage-2 page-tables.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/include/nvhe/mm.h | 2 -
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 105 ++++++++++++++++++
arch/arm64/kvm/hyp/nvhe/mm.c | 4 +-
arch/arm64/kvm/mmu.c | 14 ++-
7 files changed, 124 insertions(+), 16 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9f0bf2109be7..78db818ae2c9 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -59,7 +59,7 @@
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
#define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
-#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
+#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16
#define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
#define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
#define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index b39047463075..f37e4d3b831b 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -22,6 +22,7 @@ extern struct host_kvm host_kvm;
int __pkvm_prot_finalize(void);
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
+int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end);
int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base);
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index c76d7136ed9b..c9a8f535212e 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -24,8 +24,6 @@ int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
-int __pkvm_create_mappings(unsigned long start, unsigned long size,
- unsigned long phys, enum kvm_pgtable_prot prot);
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 1632f001f4ed..f05ecbd382d0 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -140,14 +140,12 @@ static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
}
-static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
+static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
{
- DECLARE_REG(unsigned long, start, host_ctxt, 1);
- DECLARE_REG(unsigned long, size, host_ctxt, 2);
- DECLARE_REG(unsigned long, phys, host_ctxt, 3);
- DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
+ DECLARE_REG(phys_addr_t, start, host_ctxt, 1);
+ DECLARE_REG(phys_addr_t, end, host_ctxt, 2);
- cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
+ cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(start, end);
}
static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
@@ -193,7 +191,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__vgic_v3_restore_aprs),
HANDLE_FUNC(__pkvm_init),
HANDLE_FUNC(__pkvm_cpu_set_vector),
- HANDLE_FUNC(__pkvm_create_mappings),
+ HANDLE_FUNC(__pkvm_host_share_hyp),
HANDLE_FUNC(__pkvm_create_private_mapping),
HANDLE_FUNC(__pkvm_prot_finalize),
HANDLE_FUNC(__pkvm_mark_hyp),
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 6f28edf58407..20b3cb3fdc67 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -262,6 +262,111 @@ static int host_stage2_idmap(u64 addr)
return ret;
}
+static int hyp_range_is_shared_walker(u64 addr, u64 end, u32 level,
+ kvm_pte_t *ptep,
+ enum kvm_pgtable_walk_flags flag,
+ void * const arg)
+{
+ enum kvm_pgtable_prot prot;
+ kvm_pte_t pte = *ptep;
+
+ if (!kvm_pte_valid(pte))
+ return -EPERM;
+
+ prot = kvm_pgtable_hyp_pte_prot(pte);
+ if (!prot)
+ return -EPERM;
+
+ /* Check that the page has been shared with the hypervisor before */
+ if (prot != (PAGE_HYP | KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED))
+ return -EPERM;
+
+ return 0;
+}
+
+static int hyp_range_is_shared(phys_addr_t start, phys_addr_t end)
+{
+ struct kvm_pgtable_walker walker = {
+ .cb = hyp_range_is_shared_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ return kvm_pgtable_walk(&pkvm_pgtable, (u64)__hyp_va(start),
+ end - start, &walker);
+}
+
+static int check_host_share_hyp_walker(u64 addr, u64 end, u32 level,
+ kvm_pte_t *ptep,
+ enum kvm_pgtable_walk_flags flag,
+ void * const arg)
+{
+ enum kvm_pgtable_prot prot;
+ kvm_pte_t pte = *ptep;
+
+ /* If invalid, only allow to share pristine pages */
+ if (!kvm_pte_valid(pte))
+ return pte ? -EPERM : 0;
+
+ prot = kvm_pgtable_stage2_pte_prot(pte);
+ if (!prot)
+ return -EPERM;
+
+ /* Cannot share a page that is not owned */
+ if (prot & KVM_PGTABLE_STATE_BORROWED)
+ return -EPERM;
+
+ /* Cannot share a page with restricted access */
+ if ((prot & KVM_PGTABLE_PROT_RWX) ^ KVM_PGTABLE_PROT_RWX)
+ return -EPERM;
+
+ /* Allow double-sharing (requires cross-checking the hyp stage-1) */
+ if (prot & KVM_PGTABLE_STATE_SHARED)
+ return hyp_range_is_shared(addr, addr + 1);
+
+ return 0;
+}
+
+static int check_host_share_hyp(phys_addr_t start, phys_addr_t end)
+{
+ struct kvm_pgtable_walker walker = {
+ .cb = check_host_share_hyp_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ return kvm_pgtable_walk(&host_kvm.pgt, start, end - start, &walker);
+}
+
+int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end)
+{
+ enum kvm_pgtable_prot prot;
+ int ret;
+
+ if (!range_is_memory(start, end))
+ return -EINVAL;
+
+ hyp_spin_lock(&host_kvm.lock);
+ hyp_spin_lock(&pkvm_pgd_lock);
+
+ ret = check_host_share_hyp(start, end);
+ if (ret)
+ goto unlock;
+
+ prot = KVM_PGTABLE_PROT_RWX | KVM_PGTABLE_STATE_SHARED;
+ ret = host_stage2_idmap_locked(start, end, prot);
+ if (ret && ret != -EAGAIN)
+ goto unlock;
+
+ prot = PAGE_HYP | KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED;
+ ret = pkvm_create_mappings_locked(__hyp_va(start), __hyp_va(end), prot);
+ /* XXX - undo host stage-2 changes if ret != 0 */
+
+unlock:
+ hyp_spin_unlock(&pkvm_pgd_lock);
+ hyp_spin_unlock(&host_kvm.lock);
+
+ return ret;
+}
+
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
{
int ret;
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index dde22e2a322a..95f6c34a38ec 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -23,8 +23,8 @@ u64 __io_map_base;
struct memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
unsigned int hyp_memblock_nr;
-int __pkvm_create_mappings(unsigned long start, unsigned long size,
- unsigned long phys, enum kvm_pgtable_prot prot)
+static int __pkvm_create_mappings(unsigned long start, unsigned long size,
+ unsigned long phys, enum kvm_pgtable_prot prot)
{
int err;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0625bf2353c2..2158d1e00acd 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -259,10 +259,8 @@ static int __create_hyp_mappings(unsigned long start, unsigned long size,
{
int err;
- if (!kvm_host_owns_hyp_mappings()) {
- return kvm_call_hyp_nvhe(__pkvm_create_mappings,
- start, size, phys, prot);
- }
+ if (WARN_ON(!kvm_host_owns_hyp_mappings()))
+ return -EINVAL;
mutex_lock(&kvm_hyp_pgd_mutex);
err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
@@ -302,6 +300,14 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
if (is_kernel_in_hyp_mode())
return 0;
+ if (!kvm_host_owns_hyp_mappings()) {
+ if (WARN_ON(prot != PAGE_HYP))
+ return -EPERM;
+ return kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ kvm_kaddr_to_phys(from),
+ kvm_kaddr_to_phys(to));
+ }
+
start = start & PAGE_MASK;
end = PAGE_ALIGN(end);
--
2.32.0.402.g57bb445576-goog
As the hypervisor maps the host's .bss and .rodata sections in its
stage-1, make sure to tag them as shared in hyp and host page-tables.
But since the hypervisor relies on the presence of these mappings, we
cannot let the host in complete control of the memory regions -- it
must not unshare or donate them to another entity for example. To
prevent this, let's transfer the ownership of those ranges to the
hypervisor itself, and share the page back with the host.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 7 ++-
arch/arm64/kvm/hyp/nvhe/setup.c | 52 ++++++++++++++++---
3 files changed, 51 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 9c227d87c36d..b39047463075 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -23,6 +23,7 @@ extern struct host_kvm host_kvm;
int __pkvm_prot_finalize(void);
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
+int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base);
void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index cdace80d3e28..6f28edf58407 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -235,6 +235,11 @@ static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot pro
return prot != KVM_PGTABLE_PROT_RW;
}
+int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot)
+{
+ return host_stage2_try(__host_stage2_idmap, start, end, prot);
+}
+
static int host_stage2_idmap(u64 addr)
{
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
@@ -250,7 +255,7 @@ static int host_stage2_idmap(u64 addr)
if (ret)
goto unlock;
- ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
+ ret = host_stage2_idmap_locked(range.start, range.end, prot);
unlock:
hyp_spin_unlock(&host_kvm.lock);
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 0b574d106519..74dce83a6fad 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -83,10 +83,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- ret = pkvm_create_mappings(__start_rodata, __end_rodata, PAGE_HYP_RO);
- if (ret)
- return ret;
-
ret = pkvm_create_mappings(__hyp_rodata_start, __hyp_rodata_end, PAGE_HYP_RO);
if (ret)
return ret;
@@ -95,10 +91,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, PAGE_HYP_RO);
- if (ret)
- return ret;
-
ret = pkvm_create_mappings(virt, virt + size, PAGE_HYP);
if (ret)
return ret;
@@ -117,6 +109,25 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
return ret;
}
+ /*
+ * Map the host's .bss and .rodata sections RO in the hypervisor, but
+ * transfer the ownerhsip from the host to the hypervisor itself to
+ * make sure it can't be donated or shared with another entity.
+ *
+ * The ownership transtion requires matching changes in the host
+ * stage-2. This will done later (see finalize_mappings()) once the
+ * hyp_vmemmap is addressable.
+ */
+ ret = pkvm_create_mappings(__start_rodata, __end_rodata,
+ PAGE_HYP_RO | KVM_PGTABLE_STATE_SHARED);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop,
+ PAGE_HYP_RO | KVM_PGTABLE_STATE_SHARED);
+ if (ret)
+ return ret;
+
return 0;
}
@@ -148,6 +159,27 @@ static void hpool_put_page(void *addr)
hyp_put_page(&hpool, addr);
}
+static int finalize_mappings(void)
+{
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX;
+ int ret;
+
+ /*
+ * The host's .bss and .rodata sections are now conceptually owned by
+ * the hypervisor, so mark them as 'borrowed' in the host stage-2. We
+ * can safely use host_stage2_idmap_locked() at this point since the
+ * host stage-2 has not been enabled yet.
+ */
+ prot |= KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED;
+ ret = host_stage2_idmap_locked(__hyp_pa(__start_rodata),
+ __hyp_pa(__end_rodata), prot);
+ if (ret)
+ return ret;
+
+ return host_stage2_idmap_locked(__hyp_pa(__hyp_bss_end),
+ __hyp_pa(__bss_stop), prot);
+}
+
void __noreturn __pkvm_init_finalise(void)
{
struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
@@ -167,6 +199,10 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;
+ ret = finalize_mappings();
+ if (ret)
+ goto out;
+
pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
.zalloc_page = hyp_zalloc_hyp_page,
.phys_to_virt = hyp_phys_to_virt,
--
2.32.0.402.g57bb445576-goog
The KVM pgtable API exposes the kvm_pgtable_walk() function to allow
the definition of walkers outside of pgtable.c. However, it is not easy
to implement any of those walkers without some of the low-level helpers,
such as kvm_pte_valid(). Make it static inline, and move it to the
header file to allow its re-use in other places.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 7 +++++++
arch/arm64/kvm/hyp/pgtable.c | 6 ------
2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 1aa49d6aabb7..8240c881ae1e 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -25,6 +25,13 @@ static inline u64 kvm_get_parange(u64 mmfr0)
typedef u64 kvm_pte_t;
+#define KVM_PTE_VALID BIT(0)
+
+static inline bool kvm_pte_valid(kvm_pte_t pte)
+{
+ return pte & KVM_PTE_VALID;
+}
+
/**
* struct kvm_pgtable_mm_ops - Memory management callbacks.
* @zalloc_page: Allocate a single zeroed memory page.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index c7120797404a..e0ae57dca827 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -11,7 +11,6 @@
#include <asm/kvm_pgtable.h>
#include <asm/stage2_pgtable.h>
-#define KVM_PTE_VALID BIT(0)
#define KVM_PTE_TYPE BIT(1)
#define KVM_PTE_TYPE_BLOCK 0
@@ -135,11 +134,6 @@ static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
}
-static bool kvm_pte_valid(kvm_pte_t pte)
-{
- return pte & KVM_PTE_VALID;
-}
-
static bool kvm_pte_table(kvm_pte_t pte, u32 level)
{
if (level == KVM_PGTABLE_MAX_LEVELS - 1)
--
2.32.0.402.g57bb445576-goog
__pkvm_create_private_mapping() allows the host kernel to create
arbitrary mappings the hypervisor's "private" range. However, this is
only needed early on, and there should be no good reason for the host
to need this past the point where the pkvm static is set. Make sure to
stub the hypercall past this point to ensure it can't be used by a
malicious host.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index f05ecbd382d0..e1d12f8122a7 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -154,7 +154,10 @@ static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ct
DECLARE_REG(size_t, size, host_ctxt, 2);
DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 3);
- cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
+ if (static_branch_unlikely(&kvm_protected_mode_initialized))
+ cpu_reg(host_ctxt, 1) = -EPERM;
+ else
+ cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
}
static void handle___pkvm_prot_finalize(struct kvm_cpu_context *host_ctxt)
--
2.32.0.402.g57bb445576-goog
On Mon, 19 Jul 2021 11:47:24 +0100,
Quentin Perret <[email protected]> wrote:
>
> The stage-2 map walkers currently return -EAGAIN when re-creating
> identical mappings or only changing access permissions. This allows to
> optimize mapping pages for concurrent (v)CPUs faulting on the same
> page.
>
> While this works as expected when touching one page-table leaf at a
> time, this can lead to difficult situations when mapping larger ranges.
> Indeed, a large map operation can fail in the middle if an existing
> mapping is found in the range, even if it has compatible attributes,
> hence leaving only half of the range mapped.
I'm curious of when this can happen. We normally map a single leaf at
a time, and we don't have a way to map multiple leaves at once: we
either use the VMA base size or try to upgrade it to a THP, but the
result is always a single leaf entry. What changed?
> To avoid having to deal with such failures in the caller, don't
> interrupt the map operation when hitting existing PTEs, but make sure to
> still return -EAGAIN so that user_mem_abort() can mark the page dirty
> when needed.
I don't follow you here: if you return -EAGAIN for a writable mapping,
we don't account for the page to be dirty on the assumption that
nothing has been mapped. But if there is a way to map more than a
single entry and to get -EAGAIN at the same time, then we're bound to
lose data on page eviction.
Can you shed some light on this?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 19 Jul 2021 11:47:26 +0100,
Quentin Perret <[email protected]> wrote:
>
> The nVHE protected mode uses invalid mappings in the host stage-2
> page-table to track the owner of each page in the system. In order to
> allow the usage of ignored bits (a.k.a. software bits) in these
> mappings, move the owner encoding away from the top bits.
But that's exactly what the current situation is allowing: the use of
the SW bits. I am guessing that what you really mean is that you want
to *preserve* the SW bits from an existing mapping and use other bits
when the mapping is invalid?
Thanks,
M.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a60653cbd8e5..a0ac8c2bc174 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -50,7 +50,7 @@
> KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
> KVM_PTE_LEAF_ATTR_HI_S2_XN)
>
> -#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
> +#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
> #define KVM_MAX_OWNER_ID 1
>
> struct kvm_pgtable_walk_data {
> --
> 2.32.0.402.g57bb445576-goog
>
>
--
Without deviation from the norm, progress is not possible.
On Monday 19 Jul 2021 at 13:14:48 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:24 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > The stage-2 map walkers currently return -EAGAIN when re-creating
> > identical mappings or only changing access permissions. This allows to
> > optimize mapping pages for concurrent (v)CPUs faulting on the same
> > page.
> >
> > While this works as expected when touching one page-table leaf at a
> > time, this can lead to difficult situations when mapping larger ranges.
> > Indeed, a large map operation can fail in the middle if an existing
> > mapping is found in the range, even if it has compatible attributes,
> > hence leaving only half of the range mapped.
>
> I'm curious of when this can happen. We normally map a single leaf at
> a time, and we don't have a way to map multiple leaves at once: we
> either use the VMA base size or try to upgrade it to a THP, but the
> result is always a single leaf entry. What changed?
Nothing _yet_ :-)
The 'share' hypercall introduced near the end of the series allows to
share multiple physically contiguous pages in one go -- this is mostly
to allow sharing data-structures that are larger than a page.
So if one of the pages happens to be already mapped by the time the
hypercall is issued, mapping the range with the right SW bits becomes
difficult as kvm_pgtable_stage2_map() will fail halfway through, which
is tricky to handle.
This patch shouldn't change anything for existing users that only map
things that are nicely aligned at block/page granularity, but should
make the life of new users easier, so that seemed like a win.
> > To avoid having to deal with such failures in the caller, don't
> > interrupt the map operation when hitting existing PTEs, but make sure to
> > still return -EAGAIN so that user_mem_abort() can mark the page dirty
> > when needed.
>
> I don't follow you here: if you return -EAGAIN for a writable mapping,
> we don't account for the page to be dirty on the assumption that
> nothing has been mapped. But if there is a way to map more than a
> single entry and to get -EAGAIN at the same time, then we're bound to
> lose data on page eviction.
>
> Can you shed some light on this?
Sure. For guests, hitting the -EAGAIN case means we've lost the race
with another vCPU that faulted the same page. In this case the other
vCPU either mapped the page RO, which means that our vCPU will then get
a permission fault next time we run it which will lead to the page being
marked dirty, or the other vCPU mapped the page RW in which case it
already marked the page dirty for us and we can safely re-enter the
guest without doing anything else.
So what I meant by "still return -EAGAIN so that user_mem_abort() can
mark the page dirty when needed" is "make sure to mark the page dirty
only when necessary: if winning the race and marking the page RW, or
in the permission fault path". That is, by keeping the -EAGAIN I want to
make sure we don't mark the page dirty twice. (This might fine, but this
would be new behaviour, and it was not clear that would scale well to
many vCPUs faulting the same page).
I see how this wording can be highly confusing though, I'll and re-word
for the next version.
Cheers,
Quentin
On Monday 19 Jul 2021 at 13:55:29 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:26 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > The nVHE protected mode uses invalid mappings in the host stage-2
> > page-table to track the owner of each page in the system. In order to
> > allow the usage of ignored bits (a.k.a. software bits) in these
> > mappings, move the owner encoding away from the top bits.
>
> But that's exactly what the current situation is allowing: the use of
> the SW bits. I am guessing that what you really mean is that you want
> to *preserve* the SW bits from an existing mapping and use other bits
> when the mapping is invalid?
Yes, this is really just forward looking, but I think it might be useful
to allow annotating invalid mappings with both an owner id _and_
additional flags for e.g. shared pages and such. And using bits [58-55]
to store those flags just like we do for valid mappings should make
things easier, but no big deal.
I see how this is going to conflict with kvm_pgtable_stage2_annotate()
from your series though, so maybe I should just drop this patch and
leave the encoding 'issue' to the caller -- the rest of the series
doesn't depend on this anyway, this was just small cleanup.
Thanks,
Quentin
On Mon, 19 Jul 2021 11:47:28 +0100,
Quentin Perret <[email protected]> wrote:
>
> Much of the stage-2 manipulation logic relies on being able to destroy
> block mappings if e.g. installing a smaller mapping in the range. The
> rationale for this behaviour is that stage-2 mappings can always be
> re-created lazily. However, this gets more complicated when the stage-2
> page-table is used to store metadata about the underlying pages. In such
> a case, destroying a block mapping may lead to losing part of the
> state, and confuse the user of those metadata (such as the hypervisor in
> nVHE protected mode).
>
> To fix this, introduce a callback function in the pgtable struct which
> is called during all map operations to determine whether the mappings
> can us blocks, or should be forced to page-granularity level. This is
nit: use?
> used by the hypervisor when creating the host stage-2 to force
> page-level mappings when using non-default protection attributes.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 63 +++++++++++++++++----------
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 16 +++++--
> arch/arm64/kvm/hyp/pgtable.c | 20 +++++++--
> 3 files changed, 69 insertions(+), 30 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index af62203d2f7a..dd72653314c7 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -75,25 +75,6 @@ enum kvm_pgtable_stage2_flags {
> KVM_PGTABLE_S2_IDMAP = BIT(1),
> };
>
> -/**
> - * struct kvm_pgtable - KVM page-table.
> - * @ia_bits: Maximum input address size, in bits.
> - * @start_level: Level at which the page-table walk starts.
> - * @pgd: Pointer to the first top-level entry of the page-table.
> - * @mm_ops: Memory management callbacks.
> - * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
> - */
> -struct kvm_pgtable {
> - u32 ia_bits;
> - u32 start_level;
> - kvm_pte_t *pgd;
> - struct kvm_pgtable_mm_ops *mm_ops;
> -
> - /* Stage-2 only */
> - struct kvm_s2_mmu *mmu;
> - enum kvm_pgtable_stage2_flags flags;
> -};
> -
> /**
> * enum kvm_pgtable_prot - Page-table permissions and attributes.
> * @KVM_PGTABLE_PROT_X: Execute permission.
> @@ -109,11 +90,41 @@ enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_DEVICE = BIT(3),
> };
>
> -#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> +#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> +#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
> +
> +#define PAGE_HYP KVM_PGTABLE_PROT_RW
> #define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
> #define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
> #define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
>
> +typedef bool (*kvm_pgtable_want_pte_cb_t)(u64 addr, u64 end,
> + enum kvm_pgtable_prot prot);
> +
> +/**
> + * struct kvm_pgtable - KVM page-table.
> + * @ia_bits: Maximum input address size, in bits.
> + * @start_level: Level at which the page-table walk starts.
> + * @pgd: Pointer to the first top-level entry of the page-table.
> + * @mm_ops: Memory management callbacks.
> + * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
> + * @flags: Stage-2 page-table flags.
> + * @want_pte_cb: Callback function used during map operations to decide
> + * whether block mappings can be used to map the given IPA
> + * range.
> + */
> +struct kvm_pgtable {
> + u32 ia_bits;
> + u32 start_level;
> + kvm_pte_t *pgd;
> + struct kvm_pgtable_mm_ops *mm_ops;
> +
> + /* Stage-2 only */
> + struct kvm_s2_mmu *mmu;
> + enum kvm_pgtable_stage2_flags flags;
> + kvm_pgtable_want_pte_cb_t want_pte_cb;
> +};
nit: does this whole definition really need to move around?
> +
> /**
> * struct kvm_mem_range - Range of Intermediate Physical Addresses
> * @start: Start of the range.
> @@ -216,21 +227,25 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
> u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
>
> /**
> - * kvm_pgtable_stage2_init_flags() - Initialise a guest stage-2 page-table.
> + * kvm_pgtable_stage2_init_full() - Initialise a guest stage-2 page-table.
> * @pgt: Uninitialised page-table structure to initialise.
> * @arch: Arch-specific KVM structure representing the guest virtual
> * machine.
> * @mm_ops: Memory management callbacks.
> * @flags: Stage-2 configuration flags.
> + * @want_pte_cb: Callback function used during map operations to decide
> + * whether block mappings can be used to map the given IPA
> + * range.
> *
> * Return: 0 on success, negative error code on failure.
> */
> -int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> +int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> struct kvm_pgtable_mm_ops *mm_ops,
> - enum kvm_pgtable_stage2_flags flags);
> + enum kvm_pgtable_stage2_flags flags,
> + kvm_pgtable_want_pte_cb_t want_pte_cb);
>
> #define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \
> - kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0)
> + kvm_pgtable_stage2_init_full(pgt, arch, mm_ops, 0, NULL)
nit: in general, we use __foo() as the primitive for foo(), rather
than foo_with_icing_on_top().
>
> /**
> * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 58edc62be6f7..cdace80d3e28 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -89,6 +89,7 @@ static void prepare_host_vtcr(void)
> id_aa64mmfr1_el1_sys_val, phys_shift);
> }
>
> +static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot);
> int kvm_host_prepare_stage2(void *pgt_pool_base)
> {
> struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
> @@ -101,8 +102,9 @@ int kvm_host_prepare_stage2(void *pgt_pool_base)
> if (ret)
> return ret;
>
> - ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch,
> - &host_kvm.mm_ops, KVM_HOST_S2_FLAGS);
> + ret = kvm_pgtable_stage2_init_full(&host_kvm.pgt, &host_kvm.arch,
> + &host_kvm.mm_ops, KVM_HOST_S2_FLAGS,
> + host_stage2_want_pte_cb);
> if (ret)
> return ret;
>
> @@ -225,9 +227,17 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
> __ret; \
> })
>
> +static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot)
> +{
> + if (range_is_memory(addr, end))
> + return prot != KVM_PGTABLE_PROT_RWX;
> + else
> + return prot != KVM_PGTABLE_PROT_RW;
> +}
This really deserves a comment about *why* you make such decision. I
also find it a bit odd that you use the permissions to decide whether
to map a block or a not. It feels like the permission is more of a
side effect than anything else.
> +
> static int host_stage2_idmap(u64 addr)
> {
> - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
> struct kvm_mem_range range;
> bool is_memory = find_mem_range(addr, &range);
> int ret;
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 34cf67997a82..5bdbe7a31551 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -452,6 +452,8 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
> pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
> pgt->mm_ops = mm_ops;
> pgt->mmu = NULL;
> + pgt->want_pte_cb = NULL;
> +
> return 0;
> }
>
> @@ -491,6 +493,7 @@ struct stage2_map_data {
> struct kvm_pgtable_mm_ops *mm_ops;
>
> int ret;
> + bool force_pte;
OK, so you have *two* mechanisms here: once to decide if a range can
be mapped as a block or not, and another one to remember the result
while walking the S2 PTW. This probably deserves some documentation
and/or patch splitting.
> };
>
> u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> @@ -613,6 +616,9 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> struct kvm_pgtable *pgt = data->mmu->pgt;
> struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
>
> + if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
> + return -E2BIG;
> +
> if (!kvm_block_mapping_supported(addr, end, phys, level))
> return -E2BIG;
>
> @@ -660,6 +666,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> if (data->anchor)
> return 0;
>
> + if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
> + return 0;
> +
> if (!kvm_block_mapping_supported(addr, end, data->phys, level))
There is something in me screaming that kvm_block_mapping_supported()
should be the point where we check for these things... Or at least a
helper function that takes 'data' as a parameter.
> return 0;
>
> @@ -791,6 +800,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> .memcache = mc,
> .mm_ops = pgt->mm_ops,
> .ret = 0,
> + .force_pte = pgt->want_pte_cb && pgt->want_pte_cb(addr, addr + size, prot),
Reading this makes me want to rename want_pte_cb() to force_pte_cb()...
> };
> struct kvm_pgtable_walker walker = {
> .cb = stage2_map_walker,
> @@ -826,6 +836,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> .mm_ops = pgt->mm_ops,
> .owner_id = owner_id,
> .ret = 0,
> + .force_pte = true,
> };
> struct kvm_pgtable_walker walker = {
> .cb = stage2_map_walker,
> @@ -1070,9 +1081,11 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
> return kvm_pgtable_walk(pgt, addr, size, &walker);
> }
>
> -int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> - struct kvm_pgtable_mm_ops *mm_ops,
> - enum kvm_pgtable_stage2_flags flags)
> +
> +int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> + struct kvm_pgtable_mm_ops *mm_ops,
> + enum kvm_pgtable_stage2_flags flags,
> + kvm_pgtable_want_pte_cb_t want_pte_cb)
> {
> size_t pgd_sz;
> u64 vtcr = arch->vtcr;
> @@ -1090,6 +1103,7 @@ int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch
> pgt->mm_ops = mm_ops;
> pgt->mmu = &arch->mmu;
> pgt->flags = flags;
> + pgt->want_pte_cb = want_pte_cb;
>
> /* Ensure zeroed PGD pages are visible to the hardware walker */
> dsb(ishst);
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 19 Jul 2021 11:47:29 +0100,
Quentin Perret <[email protected]> wrote:
>
> The hypervisor will soon be in charge of tracking ownership of all
> memory pages in the system. The current page-tracking infrastructure at
> EL2 only allows binary states: a page is either owned or not by an
> entity. But a number of use-cases will require more complex states for
> pages that are shared between two entities (host, hypervisor, or guests).
>
> In preparation for supporting these use-cases, introduce in the KVM
> page-table library some infrastructure allowing to tag shared pages
> using ignored bits (a.k.a. software bits) in PTEs.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
> arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index dd72653314c7..f6d3d5c8910d 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
> * @KVM_PGTABLE_PROT_W: Write permission.
> * @KVM_PGTABLE_PROT_R: Read permission.
> * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> + * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
> + * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
> */
> enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_X = BIT(0),
> @@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_R = BIT(2),
>
> KVM_PGTABLE_PROT_DEVICE = BIT(3),
> +
> + KVM_PGTABLE_STATE_SHARED = BIT(4),
> + KVM_PGTABLE_STATE_BORROWED = BIT(5),
I'd rather have some indirection here, as we have other potential
users for the SW bits outside of pKVM (see the NV series, which uses
some of these SW bits as the backend for TTL-based TLB invalidation).
Can we instead only describe the SW bit states in this enum, and let
the users map the semantic they require onto that state? See [1] for
what I carry in the NV branch.
> };
>
> #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 5bdbe7a31551..51598b79dafc 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
> return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
> }
>
> +static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
Can we call these sw rather than ignored?
> +{
> + kvm_pte_t ignored_bits = 0;
> +
> + /*
> + * Ignored bits 0 and 1 are reserved to track the memory ownership
> + * state of each page:
> + * 00: The page is owned solely by the page-table owner.
> + * 01: The page is owned by the page-table owner, but is shared
> + * with another entity.
> + * 10: The page is shared with, but not owned by the page-table owner.
> + * 11: Reserved for future use (lending).
> + */
> + if (prot & KVM_PGTABLE_STATE_SHARED) {
> + if (prot & KVM_PGTABLE_STATE_BORROWED)
> + ignored_bits |= BIT(1);
> + else
> + ignored_bits |= BIT(0);
> + }
> +
> + return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
> +}
> +
> static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> u32 level, kvm_pte_t *ptep,
> enum kvm_pgtable_walk_flags flag)
> @@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
> attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
> + attr |= pte_ignored_bit_prot(prot);
> *ptep = attr;
>
> return 0;
> @@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
> attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
> + attr |= pte_ignored_bit_prot(prot);
> *ptep = attr;
>
> return 0;
How about kvm_pgtable_stage2_relax_perms()?
Thanks,
M.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=kvm-arm64/nv-5.13&id=5dea6d82de76cfcda59818ec2532fc34c615db39
--
Without deviation from the norm, progress is not possible.
On Mon, 19 Jul 2021 11:47:30 +0100,
Quentin Perret <[email protected]> wrote:
>
> As the hypervisor maps the host's .bss and .rodata sections in its
> stage-1, make sure to tag them as shared in hyp and host page-tables.
>
> But since the hypervisor relies on the presence of these mappings, we
> cannot let the host in complete control of the memory regions -- it
> must not unshare or donate them to another entity for example. To
> prevent this, let's transfer the ownership of those ranges to the
> hypervisor itself, and share the page back with the host.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 7 ++-
> arch/arm64/kvm/hyp/nvhe/setup.c | 52 ++++++++++++++++---
> 3 files changed, 51 insertions(+), 9 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 9c227d87c36d..b39047463075 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -23,6 +23,7 @@ extern struct host_kvm host_kvm;
> int __pkvm_prot_finalize(void);
> int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
>
> +int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
> int kvm_host_prepare_stage2(void *pgt_pool_base);
> void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index cdace80d3e28..6f28edf58407 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -235,6 +235,11 @@ static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot pro
> return prot != KVM_PGTABLE_PROT_RW;
> }
>
> +int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot)
> +{
> + return host_stage2_try(__host_stage2_idmap, start, end, prot);
> +}
> +
> static int host_stage2_idmap(u64 addr)
> {
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
> @@ -250,7 +255,7 @@ static int host_stage2_idmap(u64 addr)
> if (ret)
> goto unlock;
>
> - ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
> + ret = host_stage2_idmap_locked(range.start, range.end, prot);
> unlock:
> hyp_spin_unlock(&host_kvm.lock);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
> index 0b574d106519..74dce83a6fad 100644
> --- a/arch/arm64/kvm/hyp/nvhe/setup.c
> +++ b/arch/arm64/kvm/hyp/nvhe/setup.c
> @@ -83,10 +83,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> if (ret)
> return ret;
>
> - ret = pkvm_create_mappings(__start_rodata, __end_rodata, PAGE_HYP_RO);
> - if (ret)
> - return ret;
> -
> ret = pkvm_create_mappings(__hyp_rodata_start, __hyp_rodata_end, PAGE_HYP_RO);
> if (ret)
> return ret;
> @@ -95,10 +91,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> if (ret)
> return ret;
>
> - ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, PAGE_HYP_RO);
> - if (ret)
> - return ret;
> -
> ret = pkvm_create_mappings(virt, virt + size, PAGE_HYP);
> if (ret)
> return ret;
> @@ -117,6 +109,25 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> return ret;
> }
>
> + /*
> + * Map the host's .bss and .rodata sections RO in the hypervisor, but
> + * transfer the ownerhsip from the host to the hypervisor itself to
> + * make sure it can't be donated or shared with another entity.
> + *
> + * The ownership transtion requires matching changes in the host
> + * stage-2. This will done later (see finalize_mappings()) once the
> + * hyp_vmemmap is addressable.
> + */
> + ret = pkvm_create_mappings(__start_rodata, __end_rodata,
> + PAGE_HYP_RO | KVM_PGTABLE_STATE_SHARED);
> + if (ret)
> + return ret;
> +
> + ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop,
> + PAGE_HYP_RO | KVM_PGTABLE_STATE_SHARED);
> + if (ret)
> + return ret;
> +
> return 0;
> }
>
> @@ -148,6 +159,27 @@ static void hpool_put_page(void *addr)
> hyp_put_page(&hpool, addr);
> }
>
> +static int finalize_mappings(void)
> +{
> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX;
> + int ret;
> +
> + /*
> + * The host's .bss and .rodata sections are now conceptually owned by
> + * the hypervisor, so mark them as 'borrowed' in the host stage-2. We
> + * can safely use host_stage2_idmap_locked() at this point since the
> + * host stage-2 has not been enabled yet.
> + */
> + prot |= KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED;
> + ret = host_stage2_idmap_locked(__hyp_pa(__start_rodata),
> + __hyp_pa(__end_rodata), prot);
Do we really want to map the rodata section as RWX?
> + if (ret)
> + return ret;
> +
> + return host_stage2_idmap_locked(__hyp_pa(__hyp_bss_end),
> + __hyp_pa(__bss_stop), prot);
If the 'locked' state implies SHARED+BORROWED, maybe consider moving
the ORRing of the prot into host_stage2_idmap_locked()?
> +}
> +
> void __noreturn __pkvm_init_finalise(void)
> {
> struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> @@ -167,6 +199,10 @@ void __noreturn __pkvm_init_finalise(void)
> if (ret)
> goto out;
>
> + ret = finalize_mappings();
> + if (ret)
> + goto out;
> +
> pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
> .zalloc_page = hyp_zalloc_hyp_page,
> .phys_to_virt = hyp_phys_to_virt,
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Monday 19 Jul 2021 at 15:24:32 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:28 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > Much of the stage-2 manipulation logic relies on being able to destroy
> > block mappings if e.g. installing a smaller mapping in the range. The
> > rationale for this behaviour is that stage-2 mappings can always be
> > re-created lazily. However, this gets more complicated when the stage-2
> > page-table is used to store metadata about the underlying pages. In such
> > a case, destroying a block mapping may lead to losing part of the
> > state, and confuse the user of those metadata (such as the hypervisor in
> > nVHE protected mode).
> >
> > To fix this, introduce a callback function in the pgtable struct which
> > is called during all map operations to determine whether the mappings
> > can us blocks, or should be forced to page-granularity level. This is
>
> nit: use?
Ack.
> > used by the hypervisor when creating the host stage-2 to force
> > page-level mappings when using non-default protection attributes.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 63 +++++++++++++++++----------
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 16 +++++--
> > arch/arm64/kvm/hyp/pgtable.c | 20 +++++++--
> > 3 files changed, 69 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index af62203d2f7a..dd72653314c7 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -75,25 +75,6 @@ enum kvm_pgtable_stage2_flags {
> > KVM_PGTABLE_S2_IDMAP = BIT(1),
> > };
> >
> > -/**
> > - * struct kvm_pgtable - KVM page-table.
> > - * @ia_bits: Maximum input address size, in bits.
> > - * @start_level: Level at which the page-table walk starts.
> > - * @pgd: Pointer to the first top-level entry of the page-table.
> > - * @mm_ops: Memory management callbacks.
> > - * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
> > - */
> > -struct kvm_pgtable {
> > - u32 ia_bits;
> > - u32 start_level;
> > - kvm_pte_t *pgd;
> > - struct kvm_pgtable_mm_ops *mm_ops;
> > -
> > - /* Stage-2 only */
> > - struct kvm_s2_mmu *mmu;
> > - enum kvm_pgtable_stage2_flags flags;
> > -};
> > -
> > /**
> > * enum kvm_pgtable_prot - Page-table permissions and attributes.
> > * @KVM_PGTABLE_PROT_X: Execute permission.
> > @@ -109,11 +90,41 @@ enum kvm_pgtable_prot {
> > KVM_PGTABLE_PROT_DEVICE = BIT(3),
> > };
> >
> > -#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> > +#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> > +#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
> > +
> > +#define PAGE_HYP KVM_PGTABLE_PROT_RW
> > #define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
> > #define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
> > #define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
> >
> > +typedef bool (*kvm_pgtable_want_pte_cb_t)(u64 addr, u64 end,
> > + enum kvm_pgtable_prot prot);
> > +
> > +/**
> > + * struct kvm_pgtable - KVM page-table.
> > + * @ia_bits: Maximum input address size, in bits.
> > + * @start_level: Level at which the page-table walk starts.
> > + * @pgd: Pointer to the first top-level entry of the page-table.
> > + * @mm_ops: Memory management callbacks.
> > + * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
> > + * @flags: Stage-2 page-table flags.
> > + * @want_pte_cb: Callback function used during map operations to decide
> > + * whether block mappings can be used to map the given IPA
> > + * range.
> > + */
> > +struct kvm_pgtable {
> > + u32 ia_bits;
> > + u32 start_level;
> > + kvm_pte_t *pgd;
> > + struct kvm_pgtable_mm_ops *mm_ops;
> > +
> > + /* Stage-2 only */
> > + struct kvm_s2_mmu *mmu;
> > + enum kvm_pgtable_stage2_flags flags;
> > + kvm_pgtable_want_pte_cb_t want_pte_cb;
> > +};
>
> nit: does this whole definition really need to move around?
The alternative is to move (or forward declare) enum kvm_pgtable_prot
higher up in the file, but I have no strong opinion, so whatever you
prefer will work for me.
> > +
> > /**
> > * struct kvm_mem_range - Range of Intermediate Physical Addresses
> > * @start: Start of the range.
> > @@ -216,21 +227,25 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
> > u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
> >
> > /**
> > - * kvm_pgtable_stage2_init_flags() - Initialise a guest stage-2 page-table.
> > + * kvm_pgtable_stage2_init_full() - Initialise a guest stage-2 page-table.
> > * @pgt: Uninitialised page-table structure to initialise.
> > * @arch: Arch-specific KVM structure representing the guest virtual
> > * machine.
> > * @mm_ops: Memory management callbacks.
> > * @flags: Stage-2 configuration flags.
> > + * @want_pte_cb: Callback function used during map operations to decide
> > + * whether block mappings can be used to map the given IPA
> > + * range.
> > *
> > * Return: 0 on success, negative error code on failure.
> > */
> > -int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> > +int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> > struct kvm_pgtable_mm_ops *mm_ops,
> > - enum kvm_pgtable_stage2_flags flags);
> > + enum kvm_pgtable_stage2_flags flags,
> > + kvm_pgtable_want_pte_cb_t want_pte_cb);
> >
> > #define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \
> > - kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0)
> > + kvm_pgtable_stage2_init_full(pgt, arch, mm_ops, 0, NULL)
>
> nit: in general, we use __foo() as the primitive for foo(), rather
> than foo_with_icing_on_top().
Sure.
> >
> > /**
> > * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 58edc62be6f7..cdace80d3e28 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -89,6 +89,7 @@ static void prepare_host_vtcr(void)
> > id_aa64mmfr1_el1_sys_val, phys_shift);
> > }
> >
> > +static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot);
> > int kvm_host_prepare_stage2(void *pgt_pool_base)
> > {
> > struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
> > @@ -101,8 +102,9 @@ int kvm_host_prepare_stage2(void *pgt_pool_base)
> > if (ret)
> > return ret;
> >
> > - ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch,
> > - &host_kvm.mm_ops, KVM_HOST_S2_FLAGS);
> > + ret = kvm_pgtable_stage2_init_full(&host_kvm.pgt, &host_kvm.arch,
> > + &host_kvm.mm_ops, KVM_HOST_S2_FLAGS,
> > + host_stage2_want_pte_cb);
> > if (ret)
> > return ret;
> >
> > @@ -225,9 +227,17 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
> > __ret; \
> > })
> >
> > +static bool host_stage2_want_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot)
> > +{
> > + if (range_is_memory(addr, end))
> > + return prot != KVM_PGTABLE_PROT_RWX;
> > + else
> > + return prot != KVM_PGTABLE_PROT_RW;
> > +}
>
> This really deserves a comment about *why* you make such decision.
> also find it a bit odd that you use the permissions to decide whether
> to map a block or a not. It feels like the permission is more of a
> side effect than anything else.
The idea is to use page-level mappings for anything that we can't
rebuild lazily in the host stage-2. So the logic in this function
matches exactly what we do in host_stage2_idmap() just below.
And the protection does matter sadly. If for instance we map a large
portion of the host RO, and we use a block mapping for that, then any
subsequent map() call in the block range is likely to have very
undesirable side effects -- we'll forget that the rest of the block
range should be RO and we risk mapping it RW(X) in the mem abort path.
But yes, a comment is very much needed, I'll add something.
> > static int host_stage2_idmap(u64 addr)
> > {
> > - enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
> > struct kvm_mem_range range;
> > bool is_memory = find_mem_range(addr, &range);
> > int ret;
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 34cf67997a82..5bdbe7a31551 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -452,6 +452,8 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
> > pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
> > pgt->mm_ops = mm_ops;
> > pgt->mmu = NULL;
> > + pgt->want_pte_cb = NULL;
> > +
> > return 0;
> > }
> >
> > @@ -491,6 +493,7 @@ struct stage2_map_data {
> > struct kvm_pgtable_mm_ops *mm_ops;
> >
> > int ret;
> > + bool force_pte;
>
> OK, so you have *two* mechanisms here: once to decide if a range can
> be mapped as a block or not, and another one to remember the result
> while walking the S2 PTW. This probably deserves some documentation
> and/or patch splitting.
Sure, I'll add a comment.
> > };
> >
> > u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> > @@ -613,6 +616,9 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> > struct kvm_pgtable *pgt = data->mmu->pgt;
> > struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
> >
> > + if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
> > + return -E2BIG;
> > +
> > if (!kvm_block_mapping_supported(addr, end, phys, level))
> > return -E2BIG;
> >
> > @@ -660,6 +666,9 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
> > if (data->anchor)
> > return 0;
> >
> > + if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
> > + return 0;
> > +
> > if (!kvm_block_mapping_supported(addr, end, data->phys, level))
>
> There is something in me screaming that kvm_block_mapping_supported()
> should be the point where we check for these things... Or at least a
> helper function that takes 'data' as a parameter.
I feel like kvm_block_mapping_supported() might be better as-is as a
purely architectural check that we can also use it for e.g. hyp stage-1
stuff, but a new helper function should do.
>
> > return 0;
> >
> > @@ -791,6 +800,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > .memcache = mc,
> > .mm_ops = pgt->mm_ops,
> > .ret = 0,
> > + .force_pte = pgt->want_pte_cb && pgt->want_pte_cb(addr, addr + size, prot),
>
> Reading this makes me want to rename want_pte_cb() to force_pte_cb()...
No objection from me, I'll rename.
> > };
> > struct kvm_pgtable_walker walker = {
> > .cb = stage2_map_walker,
> > @@ -826,6 +836,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > .mm_ops = pgt->mm_ops,
> > .owner_id = owner_id,
> > .ret = 0,
> > + .force_pte = true,
> > };
> > struct kvm_pgtable_walker walker = {
> > .cb = stage2_map_walker,
> > @@ -1070,9 +1081,11 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
> > return kvm_pgtable_walk(pgt, addr, size, &walker);
> > }
> >
> > -int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> > - struct kvm_pgtable_mm_ops *mm_ops,
> > - enum kvm_pgtable_stage2_flags flags)
> > +
> > +int kvm_pgtable_stage2_init_full(struct kvm_pgtable *pgt, struct kvm_arch *arch,
> > + struct kvm_pgtable_mm_ops *mm_ops,
> > + enum kvm_pgtable_stage2_flags flags,
> > + kvm_pgtable_want_pte_cb_t want_pte_cb)
> > {
> > size_t pgd_sz;
> > u64 vtcr = arch->vtcr;
> > @@ -1090,6 +1103,7 @@ int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch
> > pgt->mm_ops = mm_ops;
> > pgt->mmu = &arch->mmu;
> > pgt->flags = flags;
> > + pgt->want_pte_cb = want_pte_cb;
> >
> > /* Ensure zeroed PGD pages are visible to the hardware walker */
> > dsb(ishst);
Cheers,
Quentin
On Monday 19 Jul 2021 at 15:43:34 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:29 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > The hypervisor will soon be in charge of tracking ownership of all
> > memory pages in the system. The current page-tracking infrastructure at
> > EL2 only allows binary states: a page is either owned or not by an
> > entity. But a number of use-cases will require more complex states for
> > pages that are shared between two entities (host, hypervisor, or guests).
> >
> > In preparation for supporting these use-cases, introduce in the KVM
> > page-table library some infrastructure allowing to tag shared pages
> > using ignored bits (a.k.a. software bits) in PTEs.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
> > arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
> > 2 files changed, 30 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index dd72653314c7..f6d3d5c8910d 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
> > * @KVM_PGTABLE_PROT_W: Write permission.
> > * @KVM_PGTABLE_PROT_R: Read permission.
> > * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> > + * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
> > + * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
> > */
> > enum kvm_pgtable_prot {
> > KVM_PGTABLE_PROT_X = BIT(0),
> > @@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
> > KVM_PGTABLE_PROT_R = BIT(2),
> >
> > KVM_PGTABLE_PROT_DEVICE = BIT(3),
> > +
> > + KVM_PGTABLE_STATE_SHARED = BIT(4),
> > + KVM_PGTABLE_STATE_BORROWED = BIT(5),
>
> I'd rather have some indirection here, as we have other potential
> users for the SW bits outside of pKVM (see the NV series, which uses
> some of these SW bits as the backend for TTL-based TLB invalidation).
>
> Can we instead only describe the SW bit states in this enum, and let
> the users map the semantic they require onto that state? See [1] for
> what I carry in the NV branch.
Works for me -- I just wanted to make sure we don't have users in
different places that use the same bits without knowing, but no strong
opinions, so happy to change.
> > };
> >
> > #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 5bdbe7a31551..51598b79dafc 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
> > return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
> > }
> >
> > +static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
>
> Can we call these sw rather than ignored?
Sure.
> > +{
> > + kvm_pte_t ignored_bits = 0;
> > +
> > + /*
> > + * Ignored bits 0 and 1 are reserved to track the memory ownership
> > + * state of each page:
> > + * 00: The page is owned solely by the page-table owner.
> > + * 01: The page is owned by the page-table owner, but is shared
> > + * with another entity.
> > + * 10: The page is shared with, but not owned by the page-table owner.
> > + * 11: Reserved for future use (lending).
> > + */
> > + if (prot & KVM_PGTABLE_STATE_SHARED) {
> > + if (prot & KVM_PGTABLE_STATE_BORROWED)
> > + ignored_bits |= BIT(1);
> > + else
> > + ignored_bits |= BIT(0);
> > + }
> > +
> > + return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
> > +}
> > +
> > static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> > u32 level, kvm_pte_t *ptep,
> > enum kvm_pgtable_walk_flags flag)
> > @@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
> > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
> > attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
> > + attr |= pte_ignored_bit_prot(prot);
> > *ptep = attr;
> >
> > return 0;
> > @@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
> >
> > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
> > attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > + attr |= pte_ignored_bit_prot(prot);
> > *ptep = attr;
> >
> > return 0;
>
> How about kvm_pgtable_stage2_relax_perms()?
It should leave SW bits untouched, and it really felt like a path were
we want to change permissions and nothing else. What did you have in
mind?
Cheers,
Quentin
On Monday 19 Jul 2021 at 16:01:40 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 11:47:30 +0100,
> Quentin Perret <[email protected]> wrote:
> > +static int finalize_mappings(void)
> > +{
> > + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RWX;
> > + int ret;
> > +
> > + /*
> > + * The host's .bss and .rodata sections are now conceptually owned by
> > + * the hypervisor, so mark them as 'borrowed' in the host stage-2. We
> > + * can safely use host_stage2_idmap_locked() at this point since the
> > + * host stage-2 has not been enabled yet.
> > + */
> > + prot |= KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED;
> > + ret = host_stage2_idmap_locked(__hyp_pa(__start_rodata),
> > + __hyp_pa(__end_rodata), prot);
>
> Do we really want to map the rodata section as RWX?
I know, feels odd, but for now I think so. The host is obviously
welcome to restrict things in its stage-1, but for stage-2, this is
just 'memory' so far, the host is allowed to patch it if it wants too.
Eventually, yes, I think we should make it RO in the host stage-2, but
maybe that's for another series?
> > + if (ret)
> > + return ret;
> > +
> > + return host_stage2_idmap_locked(__hyp_pa(__hyp_bss_end),
> > + __hyp_pa(__bss_stop), prot);
>
> If the 'locked' state implies SHARED+BORROWED, maybe consider moving
> the ORRing of the prot into host_stage2_idmap_locked()?
Ah no, sorry for the confusion, but 'locked' means that we already hold
the pgtable lock. That is not actually true here, but this is a special
case as only the current CPU can be messing with it at this point in
time so taking the lock would just be wasted cycles.
> > +}
> > +
> > void __noreturn __pkvm_init_finalise(void)
> > {
> > struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> > @@ -167,6 +199,10 @@ void __noreturn __pkvm_init_finalise(void)
> > if (ret)
> > goto out;
> >
> > + ret = finalize_mappings();
> > + if (ret)
> > + goto out;
> > +
> > pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
> > .zalloc_page = hyp_zalloc_hyp_page,
> > .phys_to_virt = hyp_phys_to_virt,
Thanks,
Quentin
On Mon, 19 Jul 2021 14:32:10 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Monday 19 Jul 2021 at 13:14:48 (+0100), Marc Zyngier wrote:
> > On Mon, 19 Jul 2021 11:47:24 +0100,
> > Quentin Perret <[email protected]> wrote:
> > >
> > > The stage-2 map walkers currently return -EAGAIN when re-creating
> > > identical mappings or only changing access permissions. This allows to
> > > optimize mapping pages for concurrent (v)CPUs faulting on the same
> > > page.
> > >
> > > While this works as expected when touching one page-table leaf at a
> > > time, this can lead to difficult situations when mapping larger ranges.
> > > Indeed, a large map operation can fail in the middle if an existing
> > > mapping is found in the range, even if it has compatible attributes,
> > > hence leaving only half of the range mapped.
> >
> > I'm curious of when this can happen. We normally map a single leaf at
> > a time, and we don't have a way to map multiple leaves at once: we
> > either use the VMA base size or try to upgrade it to a THP, but the
> > result is always a single leaf entry. What changed?
>
> Nothing _yet_ :-)
>
> The 'share' hypercall introduced near the end of the series allows to
> share multiple physically contiguous pages in one go -- this is mostly
> to allow sharing data-structures that are larger than a page.
>
> So if one of the pages happens to be already mapped by the time the
> hypercall is issued, mapping the range with the right SW bits becomes
> difficult as kvm_pgtable_stage2_map() will fail halfway through, which
> is tricky to handle.
>
> This patch shouldn't change anything for existing users that only map
> things that are nicely aligned at block/page granularity, but should
> make the life of new users easier, so that seemed like a win.
Right, but this is on a different path, right? Guests can never fault
multiple mappings at once, and it takes you a host hypercall to
perform this "multiple leaves at once".
Is there any way we can restrict this to the hypercall? Or even
better, keep the hypercall as a "one page at a time" thing? I can't
imagine it being performance critical (it is a once-off, and only used
over a rather small region of memory). Then, the called doesn't have
to worry about the page already being mapped or not. This would also
match the behaviour of what I do on the MMIO side.
Or do you anticipate much gain from this being able to use block
mappings?
>
> > > To avoid having to deal with such failures in the caller, don't
> > > interrupt the map operation when hitting existing PTEs, but make sure to
> > > still return -EAGAIN so that user_mem_abort() can mark the page dirty
> > > when needed.
> >
> > I don't follow you here: if you return -EAGAIN for a writable mapping,
> > we don't account for the page to be dirty on the assumption that
> > nothing has been mapped. But if there is a way to map more than a
> > single entry and to get -EAGAIN at the same time, then we're bound to
> > lose data on page eviction.
> >
> > Can you shed some light on this?
>
> Sure. For guests, hitting the -EAGAIN case means we've lost the race
> with another vCPU that faulted the same page. In this case the other
> vCPU either mapped the page RO, which means that our vCPU will then get
> a permission fault next time we run it which will lead to the page being
> marked dirty, or the other vCPU mapped the page RW in which case it
> already marked the page dirty for us and we can safely re-enter the
> guest without doing anything else.
>
> So what I meant by "still return -EAGAIN so that user_mem_abort() can
> mark the page dirty when needed" is "make sure to mark the page dirty
> only when necessary: if winning the race and marking the page RW, or
> in the permission fault path". That is, by keeping the -EAGAIN I want to
> make sure we don't mark the page dirty twice. (This might fine, but this
> would be new behaviour, and it was not clear that would scale well to
> many vCPUs faulting the same page).
>
> I see how this wording can be highly confusing though, I'll and re-word
> for the next version.
I indeed found it pretty confusing. A reword would be much appreciated.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 19 Jul 2021 14:39:05 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Monday 19 Jul 2021 at 13:55:29 (+0100), Marc Zyngier wrote:
> > On Mon, 19 Jul 2021 11:47:26 +0100,
> > Quentin Perret <[email protected]> wrote:
> > >
> > > The nVHE protected mode uses invalid mappings in the host stage-2
> > > page-table to track the owner of each page in the system. In order to
> > > allow the usage of ignored bits (a.k.a. software bits) in these
> > > mappings, move the owner encoding away from the top bits.
> >
> > But that's exactly what the current situation is allowing: the use of
> > the SW bits. I am guessing that what you really mean is that you want
> > to *preserve* the SW bits from an existing mapping and use other bits
> > when the mapping is invalid?
>
> Yes, this is really just forward looking, but I think it might be useful
> to allow annotating invalid mappings with both an owner id _and_
> additional flags for e.g. shared pages and such. And using bits [58-55]
> to store those flags just like we do for valid mappings should make
> things easier, but no big deal.
Right, so maybe worth calling that out.
> I see how this is going to conflict with kvm_pgtable_stage2_annotate()
> from your series though, so maybe I should just drop this patch and
> leave the encoding 'issue' to the caller -- the rest of the series
> doesn't depend on this anyway, this was just small cleanup.
I'm not too worried about that for now. We can always rewrite one in
terms of the other, but I wanted to understand exactly this change was
about.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 19 Jul 2021 16:49:13 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Monday 19 Jul 2021 at 15:43:34 (+0100), Marc Zyngier wrote:
> > On Mon, 19 Jul 2021 11:47:29 +0100,
> > Quentin Perret <[email protected]> wrote:
> > >
> > > The hypervisor will soon be in charge of tracking ownership of all
> > > memory pages in the system. The current page-tracking infrastructure at
> > > EL2 only allows binary states: a page is either owned or not by an
> > > entity. But a number of use-cases will require more complex states for
> > > pages that are shared between two entities (host, hypervisor, or guests).
> > >
> > > In preparation for supporting these use-cases, introduce in the KVM
> > > page-table library some infrastructure allowing to tag shared pages
> > > using ignored bits (a.k.a. software bits) in PTEs.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
> > > arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
> > > 2 files changed, 30 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index dd72653314c7..f6d3d5c8910d 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
> > > * @KVM_PGTABLE_PROT_W: Write permission.
> > > * @KVM_PGTABLE_PROT_R: Read permission.
> > > * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> > > + * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
> > > + * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
> > > */
> > > enum kvm_pgtable_prot {
> > > KVM_PGTABLE_PROT_X = BIT(0),
> > > @@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
> > > KVM_PGTABLE_PROT_R = BIT(2),
> > >
> > > KVM_PGTABLE_PROT_DEVICE = BIT(3),
> > > +
> > > + KVM_PGTABLE_STATE_SHARED = BIT(4),
> > > + KVM_PGTABLE_STATE_BORROWED = BIT(5),
> >
> > I'd rather have some indirection here, as we have other potential
> > users for the SW bits outside of pKVM (see the NV series, which uses
> > some of these SW bits as the backend for TTL-based TLB invalidation).
> >
> > Can we instead only describe the SW bit states in this enum, and let
> > the users map the semantic they require onto that state? See [1] for
> > what I carry in the NV branch.
>
> Works for me -- I just wanted to make sure we don't have users in
> different places that use the same bits without knowing, but no strong
> opinions, so happy to change.
>
> > > };
> > >
> > > #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index 5bdbe7a31551..51598b79dafc 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
> > > return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
> > > }
> > >
> > > +static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
> >
> > Can we call these sw rather than ignored?
>
> Sure.
>
> > > +{
> > > + kvm_pte_t ignored_bits = 0;
> > > +
> > > + /*
> > > + * Ignored bits 0 and 1 are reserved to track the memory ownership
> > > + * state of each page:
> > > + * 00: The page is owned solely by the page-table owner.
> > > + * 01: The page is owned by the page-table owner, but is shared
> > > + * with another entity.
> > > + * 10: The page is shared with, but not owned by the page-table owner.
> > > + * 11: Reserved for future use (lending).
> > > + */
> > > + if (prot & KVM_PGTABLE_STATE_SHARED) {
> > > + if (prot & KVM_PGTABLE_STATE_BORROWED)
> > > + ignored_bits |= BIT(1);
> > > + else
> > > + ignored_bits |= BIT(0);
> > > + }
> > > +
> > > + return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
> > > +}
> > > +
> > > static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> > > u32 level, kvm_pte_t *ptep,
> > > enum kvm_pgtable_walk_flags flag)
> > > @@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
> > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
> > > attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
> > > + attr |= pte_ignored_bit_prot(prot);
> > > *ptep = attr;
> > >
> > > return 0;
> > > @@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
> > >
> > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
> > > attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > > + attr |= pte_ignored_bit_prot(prot);
> > > *ptep = attr;
> > >
> > > return 0;
> >
> > How about kvm_pgtable_stage2_relax_perms()?
>
> It should leave SW bits untouched, and it really felt like a path were
> we want to change permissions and nothing else. What did you have in
> mind?
It isn't clear to me that it would not (cannot?) be used to change
other bits, given that it takes an arbitrary 'prot' set. If there is
such an intended restriction, we definitely should document it.
M.
--
Without deviation from the norm, progress is not possible.
Hi Quentin,
On Mon, Jul 19, 2021 at 11:47 AM Quentin Perret <[email protected]> wrote:
>
> The current hypervisor stage-1 mapping code doesn't allow changing an
> existing valid mapping. Relax this condition by allowing changes that
> only target ignored bits, as that will soon be needed to annotate shared
> pages.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index a0ac8c2bc174..34cf67997a82 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -362,6 +362,17 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> return 0;
> }
>
> +static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
> +{
> + if (old == new)
> + return false;
> +
> + if (!kvm_pte_valid(old))
> + return true;
> +
> + return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_IGNORED);
Wouldn't this return false if both ignored and non-ignored bits were
different, or is that not possible (judging by the WARN_ON)? If it is,
then it would need an update, wouldn't it?
Thanks,
/fuad
> +}
> +
> static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> kvm_pte_t *ptep, struct hyp_map_data *data)
> {
> @@ -371,9 +382,12 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
> if (!kvm_block_mapping_supported(addr, end, phys, level))
> return false;
>
> - /* Tolerate KVM recreating the exact same mapping */
> + /*
> + * Tolerate KVM recreating the exact same mapping, or changing ignored
> + * bits.
> + */
> new = kvm_init_valid_leaf_pte(phys, data->attr, level);
> - if (old != new && !WARN_ON(kvm_pte_valid(old)))
> + if (hyp_pte_needs_update(old, new))
> smp_store_release(ptep, new);
>
> data->phys += granule;
> --
> 2.32.0.402.g57bb445576-goog
>
Hi Fuad,
On Tuesday 20 Jul 2021 at 11:17:03 (+0100), Fuad Tabba wrote:
> Hi Quentin,
>
>
> On Mon, Jul 19, 2021 at 11:47 AM Quentin Perret <[email protected]> wrote:
> >
> > The current hypervisor stage-1 mapping code doesn't allow changing an
> > existing valid mapping. Relax this condition by allowing changes that
> > only target ignored bits, as that will soon be needed to annotate shared
> > pages.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
> > 1 file changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index a0ac8c2bc174..34cf67997a82 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -362,6 +362,17 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> > return 0;
> > }
> >
> > +static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
> > +{
> > + if (old == new)
> > + return false;
> > +
> > + if (!kvm_pte_valid(old))
> > + return true;
> > +
> > + return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_IGNORED);
>
> Wouldn't this return false if both ignored and non-ignored bits were
> different, or is that not possible (judging by the WARN_ON)?
Correct, but that is intentional, see below ;)
> If it is, then it would need an update, wouldn't it?
Maybe, but if you look at what the existing code does, we do skip the
update if the old mapping is valid and not equal to new. So I kept the
behaviour as close as possible to this -- if you change any bits outside
of SW bits you get a WARN and we skip the update, as we already do
today. But if you touch only SW bits and nothing else, then I let the
update go through.
That said, I don't think warning and then proceeding to update would be
terribly wrong, it's just that a change of behaviour felt a bit
unnecessary for this particular patch.
Thanks,
Quentin
Hi Quentin,
On Tue, Jul 20, 2021 at 11:30 AM 'Quentin Perret' via kernel-team
<[email protected]> wrote:
>
> Hi Fuad,
>
> On Tuesday 20 Jul 2021 at 11:17:03 (+0100), Fuad Tabba wrote:
> > Hi Quentin,
> >
> >
> > On Mon, Jul 19, 2021 at 11:47 AM Quentin Perret <[email protected]> wrote:
> > >
> > > The current hypervisor stage-1 mapping code doesn't allow changing an
> > > existing valid mapping. Relax this condition by allowing changes that
> > > only target ignored bits, as that will soon be needed to annotate shared
> > > pages.
> > >
> > > Signed-off-by: Quentin Perret <[email protected]>
> > > ---
> > > arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
> > > 1 file changed, 16 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > index a0ac8c2bc174..34cf67997a82 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -362,6 +362,17 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> > > return 0;
> > > }
> > >
> > > +static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
> > > +{
> > > + if (old == new)
> > > + return false;
> > > +
> > > + if (!kvm_pte_valid(old))
> > > + return true;
> > > +
> > > + return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_IGNORED);
> >
> > Wouldn't this return false if both ignored and non-ignored bits were
> > different, or is that not possible (judging by the WARN_ON)?
>
> Correct, but that is intentional, see below ;)
>
> > If it is, then it would need an update, wouldn't it?
>
> Maybe, but if you look at what the existing code does, we do skip the
> update if the old mapping is valid and not equal to new. So I kept the
> behaviour as close as possible to this -- if you change any bits outside
> of SW bits you get a WARN and we skip the update, as we already do
> today. But if you touch only SW bits and nothing else, then I let the
> update go through.
>
> That said, I don't think warning and then proceeding to update would be
> terribly wrong, it's just that a change of behaviour felt a bit
> unnecessary for this particular patch.
Thanks for the clarification. It makes sense to preserve the existing
behavior, but I was wondering if a comment would be good, describing
what merits a "needs update"?
Cheers,
/fuad
> Thanks,
> Quentin
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
>
On Tuesday 20 Jul 2021 at 11:59:27 (+0100), Fuad Tabba wrote:
> Thanks for the clarification. It makes sense to preserve the existing
> behavior, but I was wondering if a comment would be good, describing
> what merits a "needs update"?
Sure thing, I'll add something for v2.
Cheers,
Quentin
On Tuesday 20 Jul 2021 at 11:13:31 (+0100), Marc Zyngier wrote:
> On Mon, 19 Jul 2021 16:49:13 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > On Monday 19 Jul 2021 at 15:43:34 (+0100), Marc Zyngier wrote:
> > > On Mon, 19 Jul 2021 11:47:29 +0100,
> > > Quentin Perret <[email protected]> wrote:
> > > >
> > > > The hypervisor will soon be in charge of tracking ownership of all
> > > > memory pages in the system. The current page-tracking infrastructure at
> > > > EL2 only allows binary states: a page is either owned or not by an
> > > > entity. But a number of use-cases will require more complex states for
> > > > pages that are shared between two entities (host, hypervisor, or guests).
> > > >
> > > > In preparation for supporting these use-cases, introduce in the KVM
> > > > page-table library some infrastructure allowing to tag shared pages
> > > > using ignored bits (a.k.a. software bits) in PTEs.
> > > >
> > > > Signed-off-by: Quentin Perret <[email protected]>
> > > > ---
> > > > arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
> > > > arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
> > > > 2 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > > index dd72653314c7..f6d3d5c8910d 100644
> > > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > > @@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
> > > > * @KVM_PGTABLE_PROT_W: Write permission.
> > > > * @KVM_PGTABLE_PROT_R: Read permission.
> > > > * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> > > > + * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
> > > > + * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
> > > > */
> > > > enum kvm_pgtable_prot {
> > > > KVM_PGTABLE_PROT_X = BIT(0),
> > > > @@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
> > > > KVM_PGTABLE_PROT_R = BIT(2),
> > > >
> > > > KVM_PGTABLE_PROT_DEVICE = BIT(3),
> > > > +
> > > > + KVM_PGTABLE_STATE_SHARED = BIT(4),
> > > > + KVM_PGTABLE_STATE_BORROWED = BIT(5),
> > >
> > > I'd rather have some indirection here, as we have other potential
> > > users for the SW bits outside of pKVM (see the NV series, which uses
> > > some of these SW bits as the backend for TTL-based TLB invalidation).
> > >
> > > Can we instead only describe the SW bit states in this enum, and let
> > > the users map the semantic they require onto that state? See [1] for
> > > what I carry in the NV branch.
> >
> > Works for me -- I just wanted to make sure we don't have users in
> > different places that use the same bits without knowing, but no strong
> > opinions, so happy to change.
> >
> > > > };
> > > >
> > > > #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> > > > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > > > index 5bdbe7a31551..51598b79dafc 100644
> > > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > > @@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
> > > > return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
> > > > }
> > > >
> > > > +static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
> > >
> > > Can we call these sw rather than ignored?
> >
> > Sure.
> >
> > > > +{
> > > > + kvm_pte_t ignored_bits = 0;
> > > > +
> > > > + /*
> > > > + * Ignored bits 0 and 1 are reserved to track the memory ownership
> > > > + * state of each page:
> > > > + * 00: The page is owned solely by the page-table owner.
> > > > + * 01: The page is owned by the page-table owner, but is shared
> > > > + * with another entity.
> > > > + * 10: The page is shared with, but not owned by the page-table owner.
> > > > + * 11: Reserved for future use (lending).
> > > > + */
> > > > + if (prot & KVM_PGTABLE_STATE_SHARED) {
> > > > + if (prot & KVM_PGTABLE_STATE_BORROWED)
> > > > + ignored_bits |= BIT(1);
> > > > + else
> > > > + ignored_bits |= BIT(0);
> > > > + }
> > > > +
> > > > + return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
> > > > +}
> > > > +
> > > > static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> > > > u32 level, kvm_pte_t *ptep,
> > > > enum kvm_pgtable_walk_flags flag)
> > > > @@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> > > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
> > > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
> > > > attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
> > > > + attr |= pte_ignored_bit_prot(prot);
> > > > *ptep = attr;
> > > >
> > > > return 0;
> > > > @@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
> > > >
> > > > attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
> > > > attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
> > > > + attr |= pte_ignored_bit_prot(prot);
> > > > *ptep = attr;
> > > >
> > > > return 0;
> > >
> > > How about kvm_pgtable_stage2_relax_perms()?
> >
> > It should leave SW bits untouched, and it really felt like a path were
> > we want to change permissions and nothing else. What did you have in
> > mind?
>
> It isn't clear to me that it would not (cannot?) be used to change
> other bits, given that it takes an arbitrary 'prot' set.
Sure, though it already ignores KVM_PGTABLE_PROT_DEVICE.
I guess the thing I find hard to reason about is that
kvm_pgtable_stage2_relax_perms() is 'additive'. E.g. it can make a
mapping RW if it was RO, but not the other way around. With the current
patch-set it wasn't really clear how that should translate to
KVM_PGTABLE_STATE_SHARED and such.
> If there is
> such an intended restriction, we definitely should document it.
Ack, that's definitely missing. And in fact I should probably make
kvm_pgtable_stage2_relax_perms() return -EINVAL if we're passing prot
values it can't handle.
Cheers,
Quentin
On Tuesday 20 Jul 2021 at 09:26:10 (+0100), Marc Zyngier wrote:
> Right, but this is on a different path, right? Guests can never fault
> multiple mappings at once, and it takes you a host hypercall to
> perform this "multiple leaves at once".
Right.
> Is there any way we can restrict this to the hypercall? Or even
> better, keep the hypercall as a "one page at a time" thing? I can't
> imagine it being performance critical (it is a once-off, and only used
> over a rather small region of memory). Then, the called doesn't have
> to worry about the page already being mapped or not. This would also
> match the behaviour of what I do on the MMIO side.
>
> Or do you anticipate much gain from this being able to use block
> mappings?
Not really no, especially given that mappings of shared pages will be
forced to page granularity thanks to the other patch we discussed in
this series. I was just hoping to reduce the overhead a bit by reducing
the number of hypercalls. But as you said, this probably doesn't matter
all that much, so happy to rework that. I'll look into making the hcall
use one page at a time, and hopefully that'll simplify a few other
things in the check_host_share_hyp() path near the end of this series.
Thanks,
Quentin
Hi Quentin,
On Mon, Jul 19, 2021 at 11:47 AM Quentin Perret <[email protected]> wrote:
>
> The hypervisor will soon be in charge of tracking ownership of all
> memory pages in the system. The current page-tracking infrastructure at
> EL2 only allows binary states: a page is either owned or not by an
> entity. But a number of use-cases will require more complex states for
> pages that are shared between two entities (host, hypervisor, or guests).
>
> In preparation for supporting these use-cases, introduce in the KVM
> page-table library some infrastructure allowing to tag shared pages
> using ignored bits (a.k.a. software bits) in PTEs.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 5 +++++
> arch/arm64/kvm/hyp/pgtable.c | 25 +++++++++++++++++++++++++
> 2 files changed, 30 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index dd72653314c7..f6d3d5c8910d 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -81,6 +81,8 @@ enum kvm_pgtable_stage2_flags {
> * @KVM_PGTABLE_PROT_W: Write permission.
> * @KVM_PGTABLE_PROT_R: Read permission.
> * @KVM_PGTABLE_PROT_DEVICE: Device attributes.
> + * @KVM_PGTABLE_STATE_SHARED: Page shared with another entity.
> + * @KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity.
> */
> enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_X = BIT(0),
> @@ -88,6 +90,9 @@ enum kvm_pgtable_prot {
> KVM_PGTABLE_PROT_R = BIT(2),
>
> KVM_PGTABLE_PROT_DEVICE = BIT(3),
> +
> + KVM_PGTABLE_STATE_SHARED = BIT(4),
> + KVM_PGTABLE_STATE_BORROWED = BIT(5),
> };
>
> #define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index 5bdbe7a31551..51598b79dafc 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -211,6 +211,29 @@ static kvm_pte_t kvm_init_invalid_leaf_owner(u8 owner_id)
> return FIELD_PREP(KVM_INVALID_PTE_OWNER_MASK, owner_id);
> }
>
> +static kvm_pte_t pte_ignored_bit_prot(enum kvm_pgtable_prot prot)
> +{
> + kvm_pte_t ignored_bits = 0;
> +
> + /*
> + * Ignored bits 0 and 1 are reserved to track the memory ownership
> + * state of each page:
> + * 00: The page is owned solely by the page-table owner.
> + * 01: The page is owned by the page-table owner, but is shared
> + * with another entity.
> + * 10: The page is shared with, but not owned by the page-table owner.
> + * 11: Reserved for future use (lending).
> + */
> + if (prot & KVM_PGTABLE_STATE_SHARED) {
> + if (prot & KVM_PGTABLE_STATE_BORROWED)
> + ignored_bits |= BIT(1);
> + else
> + ignored_bits |= BIT(0);
> + }
This might tie in to Marc's comments for using enums, but
consolidating the translation between prot and ignored/software bits
in one place would be good: thinking about patch 10 as well, where you
get the prot from the ignored bits. Even though you have documented
it, I'm finding the part where a field can be borrowed and shared as
opposed to being only shared not very intuitive, and I need to reread
the comment here to remember the difference while going through the
code.
You also mention lending as potentially reserved for the future, but I
think that lending is the other side of borrowing (depends on who's
doing the giving/taking). I wonder if in this case it would be clearer
to describe it in terms of whether it's exclusively owned vs owned but
shared (for the owner), and just shared for the sharer...
Thanks,
/fuad
> + return FIELD_PREP(KVM_PTE_LEAF_ATTR_IGNORED, ignored_bits);
> +}
> +
> static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
> u32 level, kvm_pte_t *ptep,
> enum kvm_pgtable_walk_flags flag)
> @@ -357,6 +380,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
> attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
> + attr |= pte_ignored_bit_prot(prot);
> *ptep = attr;
>
> return 0;
> @@ -558,6 +582,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
>
> attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
> attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
> + attr |= pte_ignored_bit_prot(prot);
> *ptep = attr;
>
> return 0;
> --
> 2.32.0.402.g57bb445576-goog
>
Hi Quentin,
On Mon, Jul 19, 2021 at 11:47 AM Quentin Perret <[email protected]> wrote:
>
> We currently unmap all MMIO mappings from the host stage-2 to recycle
> the pages whenever we run out. In order to make this pattern easy to
> re-use from other places, factor the logic out into a dedicated macro.
> While at it, apply the macro for the kvm_pgtable_stage2_set_owner()
> calls. They're currently only called early on and are guaranteed to
> succeed, but making them robust to the -ENOMEM case doesn't hurt and
> will avoid painful debugging sessions later on.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 38 ++++++++++++++-------------
> 1 file changed, 20 insertions(+), 18 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index d938ce95d3bd..56f2117c877b 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -208,6 +208,23 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
> prot, &host_s2_pool);
> }
>
> +/*
> + * The pool has been provided with enough pages to cover all of memory with
> + * page granularity, but it is difficult to know how much of the MMIO range
> + * we will need to cover upfront, so we may need to 'recycle' the pages if we
> + * run out.
> + */
> +#define host_stage2_try(fn, ...) \
> + ({ \
> + int __ret = fn(__VA_ARGS__); \
> + if (__ret == -ENOMEM) { \
> + __ret = host_stage2_unmap_dev_all(); \
> + if (!__ret) \
> + __ret = fn(__VA_ARGS__); \
> + } \
> + __ret; \
> + })
I think it might be good to document that this macro expects the
host_kvm.lock to be held.
Thanks,
/fuad
> static int host_stage2_idmap(u64 addr)
> {
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> @@ -223,22 +240,7 @@ static int host_stage2_idmap(u64 addr)
> if (ret)
> goto unlock;
>
> - ret = __host_stage2_idmap(range.start, range.end, prot);
> - if (ret != -ENOMEM)
> - goto unlock;
> -
> - /*
> - * The pool has been provided with enough pages to cover all of memory
> - * with page granularity, but it is difficult to know how much of the
> - * MMIO range we will need to cover upfront, so we may need to 'recycle'
> - * the pages if we run out.
> - */
> - ret = host_stage2_unmap_dev_all();
> - if (ret)
> - goto unlock;
> -
> - ret = __host_stage2_idmap(range.start, range.end, prot);
> -
> + ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
> unlock:
> hyp_spin_unlock(&host_kvm.lock);
>
> @@ -257,8 +259,8 @@ int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
> return -EINVAL;
>
> hyp_spin_lock(&host_kvm.lock);
> - ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start,
> - &host_s2_pool, pkvm_hyp_id);
> + ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt,
> + start, end - start, &host_s2_pool, pkvm_hyp_id);
> hyp_spin_unlock(&host_kvm.lock);
>
> return ret != -EAGAIN ? ret : 0;
> --
> 2.32.0.402.g57bb445576-goog
>
On Tuesday 20 Jul 2021 at 14:48:09 (+0100), Fuad Tabba wrote:
> This might tie in to Marc's comments for using enums, but
> consolidating the translation between prot and ignored/software bits
> in one place would be good: thinking about patch 10 as well, where you
> get the prot from the ignored bits. Even though you have documented
> it, I'm finding the part where a field can be borrowed and shared as
> opposed to being only shared not very intuitive, and I need to reread
> the comment here to remember the difference while going through the
> code.
>
> You also mention lending as potentially reserved for the future, but I
> think that lending is the other side of borrowing (depends on who's
> doing the giving/taking). I wonder if in this case it would be clearer
> to describe it in terms of whether it's exclusively owned vs owned but
> shared (for the owner), and just shared for the sharer...
Argh so I actually found the encoding pretty neat :/
The idea is the following:
- an entity that has a page mapped as SHARED in its PT means it
doesn't have exclusive access to the page;
- an entity that has a page mapped as BORROWED in its PT means it has
access to a page it doesn't own;
From that we can build the states we need:
- when an entity shares a page with another, the original owner gets a
SHARED mapping, and the recipient a SHARED+BORROWED mapping.
- and in the future when/if we implement lending (which means an
entity gives exclusive access to a page to another entity, but
retains ownership) we can map the page in the recipient as
'BORROWED' only, but not 'SHARED'. And the original owner will have
an invalid mapping with a new state 'LENT', which is encoded with
both SW bits set.
How does that sound? Did you have something else in mind?
Thanks,
Quentin
Hi Quentin,
On Tue, Jul 20, 2021 at 3:06 PM Quentin Perret <[email protected]> wrote:
>
> On Tuesday 20 Jul 2021 at 14:48:09 (+0100), Fuad Tabba wrote:
> > This might tie in to Marc's comments for using enums, but
> > consolidating the translation between prot and ignored/software bits
> > in one place would be good: thinking about patch 10 as well, where you
> > get the prot from the ignored bits. Even though you have documented
> > it, I'm finding the part where a field can be borrowed and shared as
> > opposed to being only shared not very intuitive, and I need to reread
> > the comment here to remember the difference while going through the
> > code.
> >
> > You also mention lending as potentially reserved for the future, but I
> > think that lending is the other side of borrowing (depends on who's
> > doing the giving/taking). I wonder if in this case it would be clearer
> > to describe it in terms of whether it's exclusively owned vs owned but
> > shared (for the owner), and just shared for the sharer...
>
> Argh so I actually found the encoding pretty neat :/
> The idea is the following:
>
> - an entity that has a page mapped as SHARED in its PT means it
> doesn't have exclusive access to the page;
>
> - an entity that has a page mapped as BORROWED in its PT means it has
> access to a page it doesn't own;
>
> From that we can build the states we need:
>
> - when an entity shares a page with another, the original owner gets a
> SHARED mapping, and the recipient a SHARED+BORROWED mapping.
>
> - and in the future when/if we implement lending (which means an
> entity gives exclusive access to a page to another entity, but
> retains ownership) we can map the page in the recipient as
> 'BORROWED' only, but not 'SHARED'. And the original owner will have
> an invalid mapping with a new state 'LENT', which is encoded with
> both SW bits set.
>
> How does that sound? Did you have something else in mind?
The encoding is very neat by the way :D
I see where you're going with the lent state now, and I understand the
states as well as the possible transitions now that you've explained
it.
It's the terminology that confused me a bit (especially when you
mention lending, which seemed to imply is something distinct from
borrowing as opposed to just the other side of it). What for me would
help is to document this, and the possible combinations/legal states.
kvm_pgtable.h describes the prots a bit, but maybe you could expand
that similar to what you've done in this email:
@KVM_PGTABLE_STATE_BORROWED: Page borrowed from another entity: has
access to the page but no ownership
Not sure if defining aliases for all legal combinations would also
help or add to the confusion, thinking out loud, something along the
lines of cache state taxonomy (e.g., Sweazy and Smith fig 3 [1]). You
have that in the borrowed (as opposed to owned), and shared (as
opposed to exclusive). So aliases to build on these:
#define KVM_PGTABLE_STATE_BORROWED_SHARED (KVM_PGTABLE_STATE_SHARED |
KVM_PGTABLE_STATE_BORROWED)
#define KVM_PGTABLE_STATE_BORROWED_EXCLUSIVE (KVM_PGTABLE_STATE_BORROWED)
#define KVM_PGTABLE_STATE_OWNED_SHARED (KVM_PGTABLE_STATE_SHARED)
#define KVM_PGTABLE_STATE_OWNED_EXCLUSIVE (0ULL)
You have thought about this way more than I have. But I think that
having clear documentation, ideally in the code itself via
helpers/enums/aliases could help people like me who come to the code
later not shoot themselves in the foot.
Thanks!
/fuad
[1] https://www.cs.auckland.ac.nz/compsci703s1c/archive/2008/resources/Sweazey.pdf
> Thanks,
> Quentin
Hi Quentin,
On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret <[email protected]> wrote:
>
> The KVM pgtable API exposes the kvm_pgtable_walk() function to allow
> the definition of walkers outside of pgtable.c. However, it is not easy
> to implement any of those walkers without some of the low-level helpers,
> such as kvm_pte_valid(). Make it static inline, and move it to the
> header file to allow its re-use in other places.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 7 +++++++
> arch/arm64/kvm/hyp/pgtable.c | 6 ------
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 1aa49d6aabb7..8240c881ae1e 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -25,6 +25,13 @@ static inline u64 kvm_get_parange(u64 mmfr0)
>
> typedef u64 kvm_pte_t;
>
> +#define KVM_PTE_VALID BIT(0)
> +
I don't know if there's a better solution for this, but having the
KVM_PTE_VALID by itself here, with the rest remaining in pgtable.c
might be confusing. I see that you probably don't want to move them
all here because they are internal to pgtable.c.
Thanks,
/fuad
> +static inline bool kvm_pte_valid(kvm_pte_t pte)
> +{
> + return pte & KVM_PTE_VALID;
> +}
> +
> /**
> * struct kvm_pgtable_mm_ops - Memory management callbacks.
> * @zalloc_page: Allocate a single zeroed memory page.
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index c7120797404a..e0ae57dca827 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -11,7 +11,6 @@
> #include <asm/kvm_pgtable.h>
> #include <asm/stage2_pgtable.h>
>
> -#define KVM_PTE_VALID BIT(0)
>
> #define KVM_PTE_TYPE BIT(1)
> #define KVM_PTE_TYPE_BLOCK 0
> @@ -135,11 +134,6 @@ static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
> return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
> }
>
> -static bool kvm_pte_valid(kvm_pte_t pte)
> -{
> - return pte & KVM_PTE_VALID;
> -}
> -
> static bool kvm_pte_table(kvm_pte_t pte, u32 level)
> {
> if (level == KVM_PGTABLE_MAX_LEVELS - 1)
> --
> 2.32.0.402.g57bb445576-goog
>
Hi Quentin,
On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret <[email protected]> wrote:
>
> Refactor the hypervisor stage-1 locking in nVHE protected mode to expose
> a new pkvm_create_mappings_locked() function. This will be used in later
> patches to allow walking and changing the hypervisor stage-1 without
> releasing the lock.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
> arch/arm64/kvm/hyp/nvhe/mm.c | 16 ++++++++++++++--
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> index 8ec3a5a7744b..c76d7136ed9b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -23,6 +23,7 @@ int hyp_map_vectors(void);
> int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
> int __pkvm_create_mappings(unsigned long start, unsigned long size,
> unsigned long phys, enum kvm_pgtable_prot prot);
> unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
The locking logic seems to be consistent, with pkvm_create_mappings()
holding the lock for the whole duration of the operation rather than
per-iteration.
It would be nice though to document which lock should be held for the
_locked versions.
Thanks,
/fuad
> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> index a8efdf0f9003..dde22e2a322a 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mm.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -67,7 +67,7 @@ unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> return addr;
> }
>
> -int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> +int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot)
> {
> unsigned long start = (unsigned long)from;
> unsigned long end = (unsigned long)to;
> @@ -81,7 +81,8 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> int err;
>
> phys = hyp_virt_to_phys((void *)virt_addr);
> - err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, prot);
> + err = kvm_pgtable_hyp_map(&pkvm_pgtable, virt_addr, PAGE_SIZE,
> + phys, prot);
> if (err)
> return err;
> }
> @@ -89,6 +90,17 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> return 0;
> }
>
> +int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> +{
> + int ret;
> +
> + hyp_spin_lock(&pkvm_pgd_lock);
> + ret = pkvm_create_mappings_locked(from, to, prot);
> + hyp_spin_unlock(&pkvm_pgd_lock);
> +
> + return ret;
> +}
> +
> int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
> {
> unsigned long start, end;
> --
> 2.32.0.402.g57bb445576-goog
>
Hi Quentin,
On Mon, Jul 19, 2021 at 11:48 AM Quentin Perret <[email protected]> wrote:
>
> The host kernel is currently able to change EL2 stage-1 mappings without
> restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
> world where the host is no longer part of the TCB, this clearly poses a
> problem.
>
> To fix this, introduce a new hypercall to allow the host to share a
> range of physical memory with the hypervisor, and remove the
> __pkvm_create_mappings() variant. The new hypercall implements
> ownership and permission checks before allowing the sharing operation,
> and it annotates the shared pages in the hypervisor stage-1 and host
> stage-2 page-tables.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 2 +-
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> arch/arm64/kvm/hyp/include/nvhe/mm.h | 2 -
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 12 +-
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 105 ++++++++++++++++++
> arch/arm64/kvm/hyp/nvhe/mm.c | 4 +-
> arch/arm64/kvm/mmu.c | 14 ++-
> 7 files changed, 124 insertions(+), 16 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 9f0bf2109be7..78db818ae2c9 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -59,7 +59,7 @@
> #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
> #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
> #define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
> -#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
> +#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16
> #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
> #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
> #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index b39047463075..f37e4d3b831b 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -22,6 +22,7 @@ extern struct host_kvm host_kvm;
>
> int __pkvm_prot_finalize(void);
> int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
> +int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end);
>
> int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
> int kvm_host_prepare_stage2(void *pgt_pool_base);
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> index c76d7136ed9b..c9a8f535212e 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -24,8 +24,6 @@ int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
> -int __pkvm_create_mappings(unsigned long start, unsigned long size,
> - unsigned long phys, enum kvm_pgtable_prot prot);
> unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> enum kvm_pgtable_prot prot);
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 1632f001f4ed..f05ecbd382d0 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -140,14 +140,12 @@ static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> }
>
> -static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> +static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
> {
> - DECLARE_REG(unsigned long, start, host_ctxt, 1);
> - DECLARE_REG(unsigned long, size, host_ctxt, 2);
> - DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> - DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
> + DECLARE_REG(phys_addr_t, start, host_ctxt, 1);
> + DECLARE_REG(phys_addr_t, end, host_ctxt, 2);
>
> - cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> + cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(start, end);
> }
>
> static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> @@ -193,7 +191,7 @@ static const hcall_t host_hcall[] = {
> HANDLE_FUNC(__vgic_v3_restore_aprs),
> HANDLE_FUNC(__pkvm_init),
> HANDLE_FUNC(__pkvm_cpu_set_vector),
> - HANDLE_FUNC(__pkvm_create_mappings),
> + HANDLE_FUNC(__pkvm_host_share_hyp),
> HANDLE_FUNC(__pkvm_create_private_mapping),
> HANDLE_FUNC(__pkvm_prot_finalize),
> HANDLE_FUNC(__pkvm_mark_hyp),
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 6f28edf58407..20b3cb3fdc67 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -262,6 +262,111 @@ static int host_stage2_idmap(u64 addr)
> return ret;
> }
>
> +static int hyp_range_is_shared_walker(u64 addr, u64 end, u32 level,
> + kvm_pte_t *ptep,
> + enum kvm_pgtable_walk_flags flag,
> + void * const arg)
> +{
> + enum kvm_pgtable_prot prot;
> + kvm_pte_t pte = *ptep;
> +
> + if (!kvm_pte_valid(pte))
> + return -EPERM;
> +
> + prot = kvm_pgtable_hyp_pte_prot(pte);
> + if (!prot)
> + return -EPERM;
nit: is this check necessary?
> + /* Check that the page has been shared with the hypervisor before */
> + if (prot != (PAGE_HYP | KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED))
> + return -EPERM;
> +
> + return 0;
> +}
> +
> +static int hyp_range_is_shared(phys_addr_t start, phys_addr_t end)
> +{
> + struct kvm_pgtable_walker walker = {
> + .cb = hyp_range_is_shared_walker,
> + .flags = KVM_PGTABLE_WALK_LEAF,
> + };
> +
> + return kvm_pgtable_walk(&pkvm_pgtable, (u64)__hyp_va(start),
> + end - start, &walker);
> +}
> +
> +static int check_host_share_hyp_walker(u64 addr, u64 end, u32 level,
nit: It seems the convention is usually addr,size or start,end. Here
you're using addr,end.
> + kvm_pte_t *ptep,
> + enum kvm_pgtable_walk_flags flag,
> + void * const arg)
> +{
> + enum kvm_pgtable_prot prot;
> + kvm_pte_t pte = *ptep;
> +
> + /* If invalid, only allow to share pristine pages */
> + if (!kvm_pte_valid(pte))
> + return pte ? -EPERM : 0;
> +
> + prot = kvm_pgtable_stage2_pte_prot(pte);
> + if (!prot)
> + return -EPERM;
> +
> + /* Cannot share a page that is not owned */
> + if (prot & KVM_PGTABLE_STATE_BORROWED)
> + return -EPERM;
> +
> + /* Cannot share a page with restricted access */
> + if ((prot & KVM_PGTABLE_PROT_RWX) ^ KVM_PGTABLE_PROT_RWX)
nit: isn't this clearer as
if ((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RWX)
> + return -EPERM;
> +
> + /* Allow double-sharing (requires cross-checking the hyp stage-1) */
> + if (prot & KVM_PGTABLE_STATE_SHARED)
> + return hyp_range_is_shared(addr, addr + 1);
Why addr+1, rather than end?
> +
> + return 0;
> +}
> +
> +static int check_host_share_hyp(phys_addr_t start, phys_addr_t end)
> +{
> + struct kvm_pgtable_walker walker = {
> + .cb = check_host_share_hyp_walker,
> + .flags = KVM_PGTABLE_WALK_LEAF,
> + };
> +
> + return kvm_pgtable_walk(&host_kvm.pgt, start, end - start, &walker);
> +}
> +
> +int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end)
> +{
> + enum kvm_pgtable_prot prot;
> + int ret;
> +
> + if (!range_is_memory(start, end))
> + return -EINVAL;
> +
> + hyp_spin_lock(&host_kvm.lock);
> + hyp_spin_lock(&pkvm_pgd_lock);
> +
> + ret = check_host_share_hyp(start, end);
> + if (ret)
> + goto unlock;
> +
> + prot = KVM_PGTABLE_PROT_RWX | KVM_PGTABLE_STATE_SHARED;
> + ret = host_stage2_idmap_locked(start, end, prot);
Just for me to understand this better. The id mapping here, which
wasn't taking place before this patch, is for tracking, right?
Thanks,
/fuad
> + if (ret && ret != -EAGAIN)
> + goto unlock;
> +
> + prot = PAGE_HYP | KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED;
> + ret = pkvm_create_mappings_locked(__hyp_va(start), __hyp_va(end), prot);
> + /* XXX - undo host stage-2 changes if ret != 0 */
> +
> +unlock:
> + hyp_spin_unlock(&pkvm_pgd_lock);
> + hyp_spin_unlock(&host_kvm.lock);
> +
> + return ret;
> +}
> +
> int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
> {
> int ret;
> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> index dde22e2a322a..95f6c34a38ec 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mm.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -23,8 +23,8 @@ u64 __io_map_base;
> struct memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> unsigned int hyp_memblock_nr;
>
> -int __pkvm_create_mappings(unsigned long start, unsigned long size,
> - unsigned long phys, enum kvm_pgtable_prot prot)
> +static int __pkvm_create_mappings(unsigned long start, unsigned long size,
> + unsigned long phys, enum kvm_pgtable_prot prot)
> {
> int err;
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0625bf2353c2..2158d1e00acd 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -259,10 +259,8 @@ static int __create_hyp_mappings(unsigned long start, unsigned long size,
> {
> int err;
>
> - if (!kvm_host_owns_hyp_mappings()) {
> - return kvm_call_hyp_nvhe(__pkvm_create_mappings,
> - start, size, phys, prot);
> - }
> + if (WARN_ON(!kvm_host_owns_hyp_mappings()))
> + return -EINVAL;
>
> mutex_lock(&kvm_hyp_pgd_mutex);
> err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
> @@ -302,6 +300,14 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> if (is_kernel_in_hyp_mode())
> return 0;
>
> + if (!kvm_host_owns_hyp_mappings()) {
> + if (WARN_ON(prot != PAGE_HYP))
> + return -EPERM;
> + return kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> + kvm_kaddr_to_phys(from),
> + kvm_kaddr_to_phys(to));
> + }
> +
> start = start & PAGE_MASK;
> end = PAGE_ALIGN(end);
>
> --
> 2.32.0.402.g57bb445576-goog
>
Hi Fuad,
On Wednesday 21 Jul 2021 at 11:45:53 (+0100), Fuad Tabba wrote:
> > +static int hyp_range_is_shared_walker(u64 addr, u64 end, u32 level,
> > + kvm_pte_t *ptep,
> > + enum kvm_pgtable_walk_flags flag,
> > + void * const arg)
> > +{
> > + enum kvm_pgtable_prot prot;
> > + kvm_pte_t pte = *ptep;
> > +
> > + if (!kvm_pte_valid(pte))
> > + return -EPERM;
> > +
> > + prot = kvm_pgtable_hyp_pte_prot(pte);
> > + if (!prot)
> > + return -EPERM;
> nit: is this check necessary?
I guess not, the next one should do, I'll remove :)
> > + /* Check that the page has been shared with the hypervisor before */
> > + if (prot != (PAGE_HYP | KVM_PGTABLE_STATE_SHARED | KVM_PGTABLE_STATE_BORROWED))
> > + return -EPERM;
> > +
> > + return 0;
> > +}
> > +
> > +static int hyp_range_is_shared(phys_addr_t start, phys_addr_t end)
> > +{
> > + struct kvm_pgtable_walker walker = {
> > + .cb = hyp_range_is_shared_walker,
> > + .flags = KVM_PGTABLE_WALK_LEAF,
> > + };
> > +
> > + return kvm_pgtable_walk(&pkvm_pgtable, (u64)__hyp_va(start),
> > + end - start, &walker);
> > +}
> > +
> > +static int check_host_share_hyp_walker(u64 addr, u64 end, u32 level,
>
> nit: It seems the convention is usually addr,size or start,end. Here
> you're using addr,end.
Well for some reason all the walkers in pgtable.c follow the addr,end
convention, so I followed that. But in fact, as per [1] I might actually
get rid of this walker in v2, so hopefully that'll make the issue go
away.
[1] https://lore.kernel.org/kvmarm/[email protected]/
> > + kvm_pte_t *ptep,
> > + enum kvm_pgtable_walk_flags flag,
> > + void * const arg)
> > +{
> > + enum kvm_pgtable_prot prot;
> > + kvm_pte_t pte = *ptep;
> > +
> > + /* If invalid, only allow to share pristine pages */
> > + if (!kvm_pte_valid(pte))
> > + return pte ? -EPERM : 0;
> > +
> > + prot = kvm_pgtable_stage2_pte_prot(pte);
> > + if (!prot)
> > + return -EPERM;
> > +
> > + /* Cannot share a page that is not owned */
> > + if (prot & KVM_PGTABLE_STATE_BORROWED)
> > + return -EPERM;
> > +
> > + /* Cannot share a page with restricted access */
> > + if ((prot & KVM_PGTABLE_PROT_RWX) ^ KVM_PGTABLE_PROT_RWX)
> nit: isn't this clearer as
>
> if ((prot & KVM_PGTABLE_PROT_RWX) != KVM_PGTABLE_PROT_RWX)
I guess it would be, I'll fix it up.
> > + return -EPERM;
> > +
> > + /* Allow double-sharing (requires cross-checking the hyp stage-1) */
> > + if (prot & KVM_PGTABLE_STATE_SHARED)
> > + return hyp_range_is_shared(addr, addr + 1);
>
> Why addr+1, rather than end?
Because 'end' here is the 'end' that was passed to kvm_pgtable_walk()
IIRC. What I want to do here is check if the page I'm currently visiting
is already shared and if so, that it is shared with the hypervisor. But
it's possible that only one page in the range of pages passed to
__pkvm_host_share_hyp is already shared, so I need to check them one by
one.
Anyways, as per the discussion with Marc on [2], I'll probably switch to
only accept sharing one page at a time, so all these issues should just
go away as well!
[2] https://lore.kernel.org/kvmarm/[email protected]/
> > +
> > + return 0;
> > +}
> > +
> > +static int check_host_share_hyp(phys_addr_t start, phys_addr_t end)
> > +{
> > + struct kvm_pgtable_walker walker = {
> > + .cb = check_host_share_hyp_walker,
> > + .flags = KVM_PGTABLE_WALK_LEAF,
> > + };
> > +
> > + return kvm_pgtable_walk(&host_kvm.pgt, start, end - start, &walker);
> > +}
> > +
> > +int __pkvm_host_share_hyp(phys_addr_t start, phys_addr_t end)
> > +{
> > + enum kvm_pgtable_prot prot;
> > + int ret;
> > +
> > + if (!range_is_memory(start, end))
> > + return -EINVAL;
> > +
> > + hyp_spin_lock(&host_kvm.lock);
> > + hyp_spin_lock(&pkvm_pgd_lock);
> > +
> > + ret = check_host_share_hyp(start, end);
> > + if (ret)
> > + goto unlock;
> > +
> > + prot = KVM_PGTABLE_PROT_RWX | KVM_PGTABLE_STATE_SHARED;
> > + ret = host_stage2_idmap_locked(start, end, prot);
>
> Just for me to understand this better. The id mapping here, which
> wasn't taking place before this patch, is for tracking, right?
Yes, exactly, I want to make sure to mark the page as shared (and
borrowed) in the relevant page-tables to not forget about it :)
Cheers,
Quentin