Hi all,
This is v2 of the patch series first posted here:
https://lore.kernel.org/kvmarm/[email protected]/
This series aims to improve how the nVHE hypervisor tracks ownership of
memory pages when running in protected mode ("kvm-arm.mode=protected" on
the kernel command line).
The main issue with the existing ownership tracking code is that it is
completely binary: a page is either owned by an entity (e.g. the host)
or not. However, we'll need something smarter to track shared pages, as
is needed for virtio, or even just host/hypervisor communications.
This series introduces a few changes to the kvm page-table library to
allow annotating shared pages in ignored bits (a.k.a. software bits) of
leaf entries, and makes use of that infrastructure to track all pages
that are shared between the host and the hypervisor. We will obviously
want to apply the same treatment to guest stage-2 page-tables, but that
is not really possible to do until EL2 manages them directly, so I'll
keep that for another series.
The series is based on the latest kvmarm/fixes branch, and has been
tested on AML-S905X-CC (Le Potato) and using various Qemu configurations.
Changes since v1:
- Changed the 'share' hypercall to accept a single page at a time;
- Dropped the patch allowing to continue stage-2 map when hitting the
EAGAIN case;
- Dropped some of the custom pgtable walkers and used Marc's get_leaf()
patch instead;
- Changed pgtable API to manipulate SW bits directly rather than
specifying shared pages;
- Added comments and documentations all over;
- Cleanups and small refactoring.
Thanks,
Quentin
Marc Zyngier (1):
KVM: arm64: Introduce helper to retrieve a PTE and its level
Quentin Perret (15):
KVM: arm64: Provide the host_stage2_try() helper macro
KVM: arm64: Expose page-table helpers
KVM: arm64: Optimize host memory aborts
KVM: arm64: Rename KVM_PTE_LEAF_ATTR_S2_IGNORED
KVM: arm64: Don't overwrite software bits with owner id
KVM: arm64: Tolerate re-creating hyp mappings to set software bits
KVM: arm64: Enable forcing page-level stage-2 mappings
KVM: arm64: Allow populating software bits
KVM: arm64: Add helpers to tag shared pages in SW bits
KVM: arm64: Introduce and export host_stage2_idmap_locked()
KVM: arm64: Mark host bss and rodata section as shared
KVM: arm64: Enable retrieving protections attributes of PTEs
KVM: arm64: Refactor protected nVHE stage-1 locking
KVM: arm64: Restrict EL2 stage-1 changes in protected mode
KVM: arm64: Make __pkvm_create_mappings static
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/include/asm/kvm_pgtable.h | 153 ++++++++----
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 29 +++
arch/arm64/kvm/hyp/include/nvhe/mm.h | 3 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 188 ++++++++++++--
arch/arm64/kvm/hyp/nvhe/mm.c | 21 +-
arch/arm64/kvm/hyp/nvhe/setup.c | 52 +++-
arch/arm64/kvm/hyp/pgtable.c | 234 ++++++++++--------
arch/arm64/kvm/mmu.c | 28 ++-
10 files changed, 525 insertions(+), 196 deletions(-)
--
2.32.0.432.gabb21c7263-goog
From: Marc Zyngier <[email protected]>
It is becoming a common need to fetch the PTE for a given address
together with its level. Add such a helper.
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 19 ++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 39 ++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index f004c0115d89..082b9d65f40b 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -432,6 +432,25 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size);
int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
struct kvm_pgtable_walker *walker);
+/**
+ * kvm_pgtable_get_leaf() - Walk a page-table and retrieve the leaf entry
+ * with its level.
+ * @pgt: Page-table structure initialised by kvm_pgtable_*_init().
+ * @addr: Input address for the start of the walk.
+ * @ptep: Pointer to storage for the retrieved PTE.
+ * @level: Pointer to storage for the level of the retrieved PTE.
+ *
+ * The offset of @addr within a page is ignored.
+ *
+ * The walker will walk the page-table entries corresponding to the input
+ * address specified, retrieving the leaf corresponding to this address.
+ * Invalid entries are treated as leaf entries.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
+ kvm_pte_t *ptep, u32 *level);
+
/**
* kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical
* Addresses with compatible permission
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 05321f4165e3..78f36bd5df6c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -326,6 +326,45 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
return _kvm_pgtable_walk(&walk_data);
}
+struct leaf_walk_data {
+ kvm_pte_t pte;
+ u32 level;
+};
+
+static int leaf_walker(u64 addr, u64 end, u32 level, kvm_pte_t *ptep,
+ enum kvm_pgtable_walk_flags flag, void * const arg)
+{
+ struct leaf_walk_data *data = arg;
+
+ data->pte = *ptep;
+ data->level = level;
+
+ return 0;
+}
+
+int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
+ kvm_pte_t *ptep, u32 *level)
+{
+ struct leaf_walk_data data;
+ struct kvm_pgtable_walker walker = {
+ .cb = leaf_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ .arg = &data,
+ };
+ int ret;
+
+ ret = kvm_pgtable_walk(pgt, ALIGN_DOWN(addr, PAGE_SIZE),
+ PAGE_SIZE, &walker);
+ if (!ret) {
+ if (ptep)
+ *ptep = data.pte;
+ if (level)
+ *level = data.level;
+ }
+
+ return ret;
+}
+
struct hyp_map_data {
u64 phys;
kvm_pte_t attr;
--
2.32.0.432.gabb21c7263-goog
The ignored bits for both stage-1 and stage-2 page and block
descriptors are in [55:58], so rename KVM_PTE_LEAF_ATTR_S2_IGNORED to
make it applicable to both. And while at it, since these bits are more
commonly known as 'software' bits, rename accordingly.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 55199e579863..516c1b8ce6b8 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -39,6 +39,8 @@
#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 51)
+#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
+
#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
@@ -47,8 +49,6 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)
-#define KVM_PTE_LEAF_ATTR_S2_IGNORED GENMASK(58, 55)
-
#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
#define KVM_MAX_OWNER_ID 1
--
2.32.0.432.gabb21c7263-goog
We will soon start annotating page-tables with new flags to track shared
pages and such, and we will do so in valid mappings using software bits
in the PTEs, as provided by the architecture. However, it is possible
that we will need to use those flags to annotate invalid mappings as
well in the future, similar to what we do to track page ownership in the
host stage-2.
In order to facilitate the annotation of invalid mappings with such
flags, it would be preferable to re-use the same bits as for valid
mappings (bits [58-55]), but these are currently used for ownership
encoding. Since we have plenty of bits left to use in invalid
mappings, move the ownership bits further down the PTE to avoid the
conflict.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 516c1b8ce6b8..b5ca21b44b6a 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -49,7 +49,7 @@
KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
KVM_PTE_LEAF_ATTR_HI_S2_XN)
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(63, 56)
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
#define KVM_MAX_OWNER_ID 1
struct kvm_pgtable_walk_data {
--
2.32.0.432.gabb21c7263-goog
The kvm_pgtable_stage2_find_range() function is used in the host memory
abort path to try and look for the largest block mapping that can be
used to map the faulting address. In order to do so, the function
currently walks the stage-2 page-table and looks for existing
incompatible mappings within the range of the largest possible block.
If incompatible mappings are found, it tries the same procedure again,
but using a smaller block range, and repeats until a matching range is
found (potentially up to page granularity). While this approach has
benefits (mostly in the fact that it proactively coalesces host stage-2
mappings), it can be slow if the ranges are fragmented, and it isn't
optimized to deal with CPUs faulting on the same IPA as all of them will
do all the work every time.
To avoid these issues, remove kvm_pgtable_stage2_find_range(), and walk
the page-table only once in the host_mem_abort() path to find the
closest leaf to the input address. With this, use the corresponding
range if it is invalid and not owned by another entity. If a valid leaf
is found, return -EAGAIN similar to what is done in the
kvm_pgtable_stage2_map() path to optimize concurrent faults.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 30 -----------
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 39 +++++++++++++-
arch/arm64/kvm/hyp/pgtable.c | 74 ---------------------------
3 files changed, 38 insertions(+), 105 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 5a7a13bbd4a1..cec76a49f521 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -141,16 +141,6 @@ enum kvm_pgtable_prot {
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
-/**
- * struct kvm_mem_range - Range of Intermediate Physical Addresses
- * @start: Start of the range.
- * @end: End of the range.
- */
-struct kvm_mem_range {
- u64 start;
- u64 end;
-};
-
/**
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
@@ -477,24 +467,4 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
*/
int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
kvm_pte_t *ptep, u32 *level);
-
-/**
- * kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical
- * Addresses with compatible permission
- * attributes.
- * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
- * @addr: Address that must be covered by the range.
- * @prot: Protection attributes that the range must be compatible with.
- * @range: Range structure used to limit the search space at call time and
- * that will hold the result.
- *
- * The offset of @addr within a page is ignored. An IPA is compatible with @prot
- * iff its corresponding stage-2 page-table entry has default ownership and, if
- * valid, is mapped with protection attributes identical to @prot.
- *
- * Return: 0 on success, negative error code on failure.
- */
-int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
- enum kvm_pgtable_prot prot,
- struct kvm_mem_range *range);
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 871149246f5f..01700a908bb7 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -159,6 +159,11 @@ static int host_stage2_unmap_dev_all(void)
return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr);
}
+struct kvm_mem_range {
+ u64 start;
+ u64 end;
+};
+
static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range)
{
int cur, left = 0, right = hyp_memblock_nr;
@@ -227,6 +232,38 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
__ret; \
})
+static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
+{
+ u64 granule, start, end;
+ kvm_pte_t pte;
+ u32 level;
+ int ret;
+
+ ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
+ if (ret)
+ return ret;
+
+ if (kvm_pte_valid(pte))
+ return -EAGAIN;
+
+ if (pte)
+ return -EPERM;
+
+ do {
+ granule = kvm_granule_size(level);
+ start = ALIGN_DOWN(addr, granule);
+ end = start + granule;
+ level++;
+ } while ((level < KVM_PGTABLE_MAX_LEVELS) &&
+ (!kvm_level_supports_block_mapping(level) ||
+ start < range->start || range->end < end));
+
+ range->start = start;
+ range->end = end;
+
+ return 0;
+}
+
static int host_stage2_idmap(u64 addr)
{
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
@@ -238,7 +275,7 @@ static int host_stage2_idmap(u64 addr)
prot |= KVM_PGTABLE_PROT_X;
hyp_spin_lock(&host_kvm.lock);
- ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
+ ret = host_stage2_find_range(addr, &range);
if (ret)
goto unlock;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 2c5d4d3e31cc..55199e579863 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1115,77 +1115,3 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
pgt->mm_ops->free_pages_exact(pgt->pgd, pgd_sz);
pgt->pgd = NULL;
}
-
-#define KVM_PTE_LEAF_S2_COMPAT_MASK (KVM_PTE_LEAF_ATTR_S2_PERMS | \
- KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | \
- KVM_PTE_LEAF_ATTR_S2_IGNORED)
-
-static int stage2_check_permission_walker(u64 addr, u64 end, u32 level,
- kvm_pte_t *ptep,
- enum kvm_pgtable_walk_flags flag,
- void * const arg)
-{
- kvm_pte_t old_attr, pte = *ptep, *new_attr = arg;
-
- /*
- * Compatible mappings are either invalid and owned by the page-table
- * owner (whose id is 0), or valid with matching permission attributes.
- */
- if (kvm_pte_valid(pte)) {
- old_attr = pte & KVM_PTE_LEAF_S2_COMPAT_MASK;
- if (old_attr != *new_attr)
- return -EEXIST;
- } else if (pte) {
- return -EEXIST;
- }
-
- return 0;
-}
-
-int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
- enum kvm_pgtable_prot prot,
- struct kvm_mem_range *range)
-{
- kvm_pte_t attr;
- struct kvm_pgtable_walker check_perm_walker = {
- .cb = stage2_check_permission_walker,
- .flags = KVM_PGTABLE_WALK_LEAF,
- .arg = &attr,
- };
- u64 granule, start, end;
- u32 level;
- int ret;
-
- ret = stage2_set_prot_attr(pgt, prot, &attr);
- if (ret)
- return ret;
- attr &= KVM_PTE_LEAF_S2_COMPAT_MASK;
-
- for (level = pgt->start_level; level < KVM_PGTABLE_MAX_LEVELS; level++) {
- granule = kvm_granule_size(level);
- start = ALIGN_DOWN(addr, granule);
- end = start + granule;
-
- if (!kvm_level_supports_block_mapping(level))
- continue;
-
- if (start < range->start || range->end < end)
- continue;
-
- /*
- * Check the presence of existing mappings with incompatible
- * permissions within the current block range, and try one level
- * deeper if one is found.
- */
- ret = kvm_pgtable_walk(pgt, start, granule, &check_perm_walker);
- if (ret != -EEXIST)
- break;
- }
-
- if (!ret) {
- range->start = start;
- range->end = end;
- }
-
- return ret;
-}
--
2.32.0.432.gabb21c7263-goog
The current hypervisor stage-1 mapping code doesn't allow changing an
existing valid mapping. Relax this condition by allowing changes that
only target software bits, as that will soon be needed to annotate shared
pages.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/pgtable.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index b5ca21b44b6a..93cc9de4d46c 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -375,6 +375,21 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
return 0;
}
+static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
+{
+ /*
+ * Tolerate KVM recreating the exact same mapping, or changing software
+ * bits if the existing mapping was valid.
+ */
+ if (old == new)
+ return false;
+
+ if (!kvm_pte_valid(old))
+ return true;
+
+ return !WARN_ON((old ^ new) & ~KVM_PTE_LEAF_ATTR_HI_SW);
+}
+
static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep, struct hyp_map_data *data)
{
@@ -384,9 +399,8 @@ static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
if (!kvm_block_mapping_supported(addr, end, phys, level))
return false;
- /* Tolerate KVM recreating the exact same mapping */
new = kvm_init_valid_leaf_pte(phys, data->attr, level);
- if (old != new && !WARN_ON(kvm_pte_valid(old)))
+ if (hyp_pte_needs_update(old, new))
smp_store_release(ptep, new);
data->phys += granule;
--
2.32.0.432.gabb21c7263-goog
Introduce infrastructure allowing to manipulate software bits in stage-1
and stage-2 page-tables using additional entries in the kvm_pgtable_prot
enum.
This is heavily inspired by Marc's implementation of a similar feature
in the NV patch series, but adapted to allow stage-1 changes as well:
https://lore.kernel.org/kvmarm/[email protected]/
Suggested-by: Marc Zyngier <[email protected]>
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 12 +++++++++++-
arch/arm64/kvm/hyp/pgtable.c | 5 +++++
2 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 9246a27e2839..0be9f83974ad 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -108,6 +108,10 @@ enum kvm_pgtable_stage2_flags {
* @KVM_PGTABLE_PROT_W: Write permission.
* @KVM_PGTABLE_PROT_R: Read permission.
* @KVM_PGTABLE_PROT_DEVICE: Device attributes.
+ * @KVM_PGTABLE_PROT_SW0: Software bit 0.
+ * @KVM_PGTABLE_PROT_SW1: Software bit 1.
+ * @KVM_PGTABLE_PROT_SW2: Software bit 2.
+ * @KVM_PGTABLE_PROT_SW3: Software bit 3.
*/
enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_X = BIT(0),
@@ -115,6 +119,11 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_R = BIT(2),
KVM_PGTABLE_PROT_DEVICE = BIT(3),
+
+ KVM_PGTABLE_PROT_SW0 = BIT(55),
+ KVM_PGTABLE_PROT_SW1 = BIT(56),
+ KVM_PGTABLE_PROT_SW2 = BIT(57),
+ KVM_PGTABLE_PROT_SW3 = BIT(58),
};
#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
@@ -406,7 +415,8 @@ kvm_pte_t kvm_pgtable_stage2_mkold(struct kvm_pgtable *pgt, u64 addr);
* If there is a valid, leaf page-table entry used to translate @addr, then
* relax the permissions in that entry according to the read, write and
* execute permissions specified by @prot. No permissions are removed, and
- * TLB invalidation is performed after updating the entry.
+ * TLB invalidation is performed after updating the entry. Software bits cannot
+ * be set or cleared using kvm_pgtable_stage2_relax_perms().
*
* Return: 0 on success, negative error code on failure.
*/
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index e0cd748e4af6..bd409d524dea 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -370,6 +370,7 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_AP, ap);
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S1_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S1_AF;
+ attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW;
*ptep = attr;
return 0;
@@ -571,6 +572,7 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
attr |= FIELD_PREP(KVM_PTE_LEAF_ATTR_LO_S2_SH, sh);
attr |= KVM_PTE_LEAF_ATTR_LO_S2_AF;
+ attr |= prot & KVM_PTE_LEAF_ATTR_HI_SW;
*ptep = attr;
return 0;
@@ -1038,6 +1040,9 @@ int kvm_pgtable_stage2_relax_perms(struct kvm_pgtable *pgt, u64 addr,
u32 level;
kvm_pte_t set = 0, clr = 0;
+ if (prot & KVM_PTE_LEAF_ATTR_HI_SW)
+ return -EINVAL;
+
if (prot & KVM_PGTABLE_PROT_R)
set |= KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R;
--
2.32.0.432.gabb21c7263-goog
We currently unmap all MMIO mappings from the host stage-2 to recycle
the pages whenever we run out. In order to make this pattern easy to
re-use from other places, factor the logic out into a dedicated macro.
While at it, apply the macro for the kvm_pgtable_stage2_set_owner()
calls. They're currently only called early on and are guaranteed to
succeed, but making them robust to the -ENOMEM case doesn't hurt and
will avoid painful debugging sessions later on.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 40 +++++++++++++++------------
1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index d938ce95d3bd..871149246f5f 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -208,6 +208,25 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
prot, &host_s2_pool);
}
+/*
+ * The pool has been provided with enough pages to cover all of memory with
+ * page granularity, but it is difficult to know how much of the MMIO range
+ * we will need to cover upfront, so we may need to 'recycle' the pages if we
+ * run out.
+ *
+ * Must be called with host_kvm.lock held.
+ */
+#define host_stage2_try(fn, ...) \
+ ({ \
+ int __ret = fn(__VA_ARGS__); \
+ if (__ret == -ENOMEM) { \
+ __ret = host_stage2_unmap_dev_all(); \
+ if (!__ret) \
+ __ret = fn(__VA_ARGS__); \
+ } \
+ __ret; \
+ })
+
static int host_stage2_idmap(u64 addr)
{
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
@@ -223,22 +242,7 @@ static int host_stage2_idmap(u64 addr)
if (ret)
goto unlock;
- ret = __host_stage2_idmap(range.start, range.end, prot);
- if (ret != -ENOMEM)
- goto unlock;
-
- /*
- * The pool has been provided with enough pages to cover all of memory
- * with page granularity, but it is difficult to know how much of the
- * MMIO range we will need to cover upfront, so we may need to 'recycle'
- * the pages if we run out.
- */
- ret = host_stage2_unmap_dev_all();
- if (ret)
- goto unlock;
-
- ret = __host_stage2_idmap(range.start, range.end, prot);
-
+ ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
unlock:
hyp_spin_unlock(&host_kvm.lock);
@@ -257,8 +261,8 @@ int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
return -EINVAL;
hyp_spin_lock(&host_kvm.lock);
- ret = kvm_pgtable_stage2_set_owner(&host_kvm.pgt, start, end - start,
- &host_s2_pool, pkvm_hyp_id);
+ ret = host_stage2_try(kvm_pgtable_stage2_set_owner, &host_kvm.pgt,
+ start, end - start, &host_s2_pool, pkvm_hyp_id);
hyp_spin_unlock(&host_kvm.lock);
return ret != -EAGAIN ? ret : 0;
--
2.32.0.432.gabb21c7263-goog
We will soon start annotating shared pages in page-tables in nVHE
protected mode. Define all the states in which a page can be (owned,
shared and owned, shared and borrowed), and provide helpers allowing to
convert this into SW bits annotations using the matching prot
attributes.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 27 +++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 9c227d87c36d..facbd9a7ab99 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -12,6 +12,33 @@
#include <asm/virt.h>
#include <nvhe/spinlock.h>
+/*
+ * SW bits 0-1 are reserved to track the memory ownership state of each page:
+ * 00: The page is owned solely by the page-table owner.
+ * 01: The page is owned by the page-table owner, but is shared
+ * with another entity.
+ * 10: The page is shared with, but not owned by the page-table owner.
+ * 11: Reserved for future use (lending).
+ */
+enum pkvm_page_state
+{
+ PKVM_PAGE_OWNED = 0ULL,
+ PKVM_PAGE_SHARED_OWNED = KVM_PGTABLE_PROT_SW0,
+ PKVM_PAGE_SHARED_BORROWED = KVM_PGTABLE_PROT_SW1,
+};
+
+#define PKVM_PAGE_STATE_PROT_MASK (KVM_PGTABLE_PROT_SW0 | KVM_PGTABLE_PROT_SW1)
+static inline enum kvm_pgtable_prot pkvm_mkstate(enum kvm_pgtable_prot prot,
+ enum pkvm_page_state state)
+{
+ return (prot & ~PKVM_PAGE_STATE_PROT_MASK) | state;
+}
+
+static inline enum pkvm_page_state pkvm_getstate(enum kvm_pgtable_prot prot)
+{
+ return prot & PKVM_PAGE_STATE_PROT_MASK;
+}
+
struct host_kvm {
struct kvm_arch arch;
struct kvm_pgtable pgt;
--
2.32.0.432.gabb21c7263-goog
We will need to manipulate the host stage-2 page-table from outside
mem_protect.c soon. Introduce a function wrapping the host_stage2_try()
call and make it usable to users of mem_protect.h.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index facbd9a7ab99..8e5725d032b2 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -50,6 +50,7 @@ extern struct host_kvm host_kvm;
int __pkvm_prot_finalize(void);
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
+int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base);
void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt);
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 2d41d4fa8901..223c541a7051 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -266,6 +266,11 @@ static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
return 0;
}
+int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot)
+{
+ return host_stage2_try(__host_stage2_idmap, start, end, prot);
+}
+
static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot)
{
/*
@@ -303,7 +308,7 @@ static int host_stage2_idmap(u64 addr)
if (ret)
goto unlock;
- ret = host_stage2_try(__host_stage2_idmap, range.start, range.end, prot);
+ ret = host_stage2_idmap_locked(range.start, range.end, prot);
unlock:
hyp_spin_unlock(&host_kvm.lock);
--
2.32.0.432.gabb21c7263-goog
The KVM pgtable API exposes the kvm_pgtable_walk() function to allow
the definition of walkers outside of pgtable.c. However, it is not easy
to implement any of those walkers without some of the low-level helpers.
Move some of them to the header file to allow re-use from other places.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 27 +++++++++++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 26 --------------------------
2 files changed, 27 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 082b9d65f40b..5a7a13bbd4a1 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -25,6 +25,33 @@ static inline u64 kvm_get_parange(u64 mmfr0)
typedef u64 kvm_pte_t;
+#define KVM_PTE_VALID BIT(0)
+
+static inline bool kvm_pte_valid(kvm_pte_t pte)
+{
+ return pte & KVM_PTE_VALID;
+}
+
+static inline u64 kvm_granule_shift(u32 level)
+{
+ /* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */
+ return ARM64_HW_PGTABLE_LEVEL_SHIFT(level);
+}
+
+static inline u64 kvm_granule_size(u32 level)
+{
+ return BIT(kvm_granule_shift(level));
+}
+
+static inline bool kvm_level_supports_block_mapping(u32 level)
+{
+ /*
+ * Reject invalid block mappings and don't bother with 4TB mappings for
+ * 52-bit PAs.
+ */
+ return !(level == 0 || (PAGE_SIZE != SZ_4K && level == 1));
+}
+
/**
* struct kvm_pgtable_mm_ops - Memory management callbacks.
* @zalloc_page: Allocate a single zeroed memory page.
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 78f36bd5df6c..2c5d4d3e31cc 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -11,7 +11,6 @@
#include <asm/kvm_pgtable.h>
#include <asm/stage2_pgtable.h>
-#define KVM_PTE_VALID BIT(0)
#define KVM_PTE_TYPE BIT(1)
#define KVM_PTE_TYPE_BLOCK 0
@@ -61,17 +60,6 @@ struct kvm_pgtable_walk_data {
u64 end;
};
-static u64 kvm_granule_shift(u32 level)
-{
- /* Assumes KVM_PGTABLE_MAX_LEVELS is 4 */
- return ARM64_HW_PGTABLE_LEVEL_SHIFT(level);
-}
-
-static u64 kvm_granule_size(u32 level)
-{
- return BIT(kvm_granule_shift(level));
-}
-
#define KVM_PHYS_INVALID (-1ULL)
static bool kvm_phys_is_valid(u64 phys)
@@ -79,15 +67,6 @@ static bool kvm_phys_is_valid(u64 phys)
return phys < BIT(id_aa64mmfr0_parange_to_phys_shift(ID_AA64MMFR0_PARANGE_MAX));
}
-static bool kvm_level_supports_block_mapping(u32 level)
-{
- /*
- * Reject invalid block mappings and don't bother with 4TB mappings for
- * 52-bit PAs.
- */
- return !(level == 0 || (PAGE_SIZE != SZ_4K && level == 1));
-}
-
static bool kvm_block_mapping_supported(u64 addr, u64 end, u64 phys, u32 level)
{
u64 granule = kvm_granule_size(level);
@@ -135,11 +114,6 @@ static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
}
-static bool kvm_pte_valid(kvm_pte_t pte)
-{
- return pte & KVM_PTE_VALID;
-}
-
static bool kvm_pte_table(kvm_pte_t pte, u32 level)
{
if (level == KVM_PGTABLE_MAX_LEVELS - 1)
--
2.32.0.432.gabb21c7263-goog
As the hypervisor maps the host's .bss and .rodata sections in its
stage-1, make sure to tag them as shared in hyp and host page-tables.
But since the hypervisor relies on the presence of these mappings, we
cannot let the host in complete control of the memory regions -- it
must not unshare or donate them to another entity for example. To
prevent this, let's transfer the ownership of those ranges to the
hypervisor itself, and share the pages back with the host.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/setup.c | 52 ++++++++++++++++++++++++++++-----
1 file changed, 44 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/setup.c b/arch/arm64/kvm/hyp/nvhe/setup.c
index 0b574d106519..285c8aea5065 100644
--- a/arch/arm64/kvm/hyp/nvhe/setup.c
+++ b/arch/arm64/kvm/hyp/nvhe/setup.c
@@ -58,6 +58,7 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
{
void *start, *end, *virt = hyp_phys_to_virt(phys);
unsigned long pgt_size = hyp_s1_pgtable_pages() << PAGE_SHIFT;
+ enum kvm_pgtable_prot prot;
int ret, i;
/* Recreate the hyp page-table using the early page allocator */
@@ -83,10 +84,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- ret = pkvm_create_mappings(__start_rodata, __end_rodata, PAGE_HYP_RO);
- if (ret)
- return ret;
-
ret = pkvm_create_mappings(__hyp_rodata_start, __hyp_rodata_end, PAGE_HYP_RO);
if (ret)
return ret;
@@ -95,10 +92,6 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
if (ret)
return ret;
- ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, PAGE_HYP_RO);
- if (ret)
- return ret;
-
ret = pkvm_create_mappings(virt, virt + size, PAGE_HYP);
if (ret)
return ret;
@@ -117,6 +110,24 @@ static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
return ret;
}
+ /*
+ * Map the host's .bss and .rodata sections RO in the hypervisor, but
+ * transfer the ownerhsip from the host to the hypervisor itself to
+ * make sure it can't be donated or shared with another entity.
+ *
+ * The ownership transtion requires matching changes in the host
+ * stage-2. This will done later (see finalize_mappings()) once the
+ * hyp_vmemmap is addressable.
+ */
+ prot = pkvm_mkstate(PAGE_HYP_RO, PKVM_PAGE_SHARED_OWNED);
+ ret = pkvm_create_mappings(__start_rodata, __end_rodata, prot);
+ if (ret)
+ return ret;
+
+ ret = pkvm_create_mappings(__hyp_bss_end, __bss_stop, prot);
+ if (ret)
+ return ret;
+
return 0;
}
@@ -148,6 +159,27 @@ static void hpool_put_page(void *addr)
hyp_put_page(&hpool, addr);
}
+static int finalize_mappings(void)
+{
+ enum kvm_pgtable_prot prot;
+ int ret;
+
+ /*
+ * The host's .bss and .rodata sections are now conceptually owned by
+ * the hypervisor, so mark them as 'borrowed' in the host stage-2. We
+ * can safely use host_stage2_idmap_locked() at this point since the
+ * host stage-2 has not been enabled yet.
+ */
+ prot = pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_BORROWED);
+ ret = host_stage2_idmap_locked(__hyp_pa(__start_rodata),
+ __hyp_pa(__end_rodata), prot);
+ if (ret)
+ return ret;
+
+ return host_stage2_idmap_locked(__hyp_pa(__hyp_bss_end),
+ __hyp_pa(__bss_stop), prot);
+}
+
void __noreturn __pkvm_init_finalise(void)
{
struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
@@ -167,6 +199,10 @@ void __noreturn __pkvm_init_finalise(void)
if (ret)
goto out;
+ ret = finalize_mappings();
+ if (ret)
+ goto out;
+
pkvm_pgtable_mm_ops = (struct kvm_pgtable_mm_ops) {
.zalloc_page = hyp_zalloc_hyp_page,
.phys_to_virt = hyp_phys_to_virt,
--
2.32.0.432.gabb21c7263-goog
Introduce helper functions in the KVM stage-2 and stage-1 page-table
manipulation library allowing to retrieve the enum kvm_pgtable_prot of a
PTE. This will be useful to implement custom walkers outside of
pgtable.c.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 20 +++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 37 ++++++++++++++++++++++++++++
2 files changed, 57 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 0be9f83974ad..f253fb5bcee2 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -492,4 +492,24 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
*/
int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
kvm_pte_t *ptep, u32 *level);
+
+/**
+ * kvm_pgtable_stage2_pte_prot() - Retrieve the protection attributes of a
+ * stage-2 Page-Table Entry.
+ * @pte: Page-table entry
+ *
+ * Return: protection attributes of the page-table entry in the enum
+ * kvm_pgtable_prot format.
+ */
+enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
+
+/**
+ * kvm_pgtable_hyp_pte_prot() - Retrieve the protection attributes of a stage-1
+ * Page-Table Entry.
+ * @pte: Page-table entry
+ *
+ * Return: protection attributes of the page-table entry in the enum
+ * kvm_pgtable_prot format.
+ */
+enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index bd409d524dea..fb80d3019cff 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -376,6 +376,26 @@ static int hyp_set_prot_attr(enum kvm_pgtable_prot prot, kvm_pte_t *ptep)
return 0;
}
+enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte)
+{
+ enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW;
+ u32 ap;
+
+ if (!kvm_pte_valid(pte))
+ return prot;
+
+ if (!(pte & KVM_PTE_LEAF_ATTR_HI_S1_XN))
+ prot |= KVM_PGTABLE_PROT_X;
+
+ ap = FIELD_GET(KVM_PTE_LEAF_ATTR_LO_S1_AP, pte);
+ if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RO)
+ prot |= KVM_PGTABLE_PROT_R;
+ else if (ap == KVM_PTE_LEAF_ATTR_LO_S1_AP_RW)
+ prot |= KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+
+ return prot;
+}
+
static bool hyp_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
{
/*
@@ -578,6 +598,23 @@ static int stage2_set_prot_attr(struct kvm_pgtable *pgt, enum kvm_pgtable_prot p
return 0;
}
+enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte)
+{
+ enum kvm_pgtable_prot prot = pte & KVM_PTE_LEAF_ATTR_HI_SW;
+
+ if (!kvm_pte_valid(pte))
+ return prot;
+
+ if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R)
+ prot |= KVM_PGTABLE_PROT_R;
+ if (pte & KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W)
+ prot |= KVM_PGTABLE_PROT_W;
+ if (!(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN))
+ prot |= KVM_PGTABLE_PROT_X;
+
+ return prot;
+}
+
static bool stage2_pte_needs_update(kvm_pte_t old, kvm_pte_t new)
{
if (!kvm_pte_valid(old) || !kvm_pte_valid(new))
--
2.32.0.432.gabb21c7263-goog
The host kernel is currently able to change EL2 stage-1 mappings without
restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
world where the host is no longer part of the TCB, this clearly poses a
problem.
To fix this, introduce a new hypercall to allow the host to share a
physical memory page with the hypervisor, and remove the
__pkvm_create_mappings() variant. The new hypercall implements
ownership and permission checks before allowing the sharing operation,
and it annotates the shared page in the hypervisor stage-1 and host
stage-2 page-tables.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 2 +-
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +--
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 90 +++++++++++++++++++
arch/arm64/kvm/mmu.c | 28 +++++-
5 files changed, 120 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9f0bf2109be7..78db818ae2c9 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -59,7 +59,7 @@
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
#define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
#define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
-#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
+#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16
#define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
#define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
#define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 8e5725d032b2..1b8e59a9d065 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -49,6 +49,7 @@ extern struct host_kvm host_kvm;
int __pkvm_prot_finalize(void);
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
+int __pkvm_host_share_hyp(u64 pfn);
int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 1632f001f4ed..dd155646ec86 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -140,14 +140,11 @@ static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
}
-static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
+static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
{
- DECLARE_REG(unsigned long, start, host_ctxt, 1);
- DECLARE_REG(unsigned long, size, host_ctxt, 2);
- DECLARE_REG(unsigned long, phys, host_ctxt, 3);
- DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
+ DECLARE_REG(u64, pfn, host_ctxt, 1);
- cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
+ cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(pfn);
}
static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
@@ -193,7 +190,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__vgic_v3_restore_aprs),
HANDLE_FUNC(__pkvm_init),
HANDLE_FUNC(__pkvm_cpu_set_vector),
- HANDLE_FUNC(__pkvm_create_mappings),
+ HANDLE_FUNC(__pkvm_host_share_hyp),
HANDLE_FUNC(__pkvm_create_private_mapping),
HANDLE_FUNC(__pkvm_prot_finalize),
HANDLE_FUNC(__pkvm_mark_hyp),
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 223c541a7051..75273166d2c5 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -315,6 +315,96 @@ static int host_stage2_idmap(u64 addr)
return ret;
}
+static inline bool check_prot(enum kvm_pgtable_prot prot,
+ enum kvm_pgtable_prot required,
+ enum kvm_pgtable_prot denied)
+{
+ return (prot & (required | denied)) == required;
+}
+
+int __pkvm_host_share_hyp(u64 pfn)
+{
+ phys_addr_t addr = hyp_pfn_to_phys(pfn);
+ enum kvm_pgtable_prot prot, cur;
+ void * virt = __hyp_va(addr);
+ enum pkvm_page_state state;
+ kvm_pte_t pte;
+ u32 level;
+ int ret;
+
+ if (!range_is_memory(addr, addr + PAGE_SIZE))
+ return -EINVAL;
+
+ hyp_spin_lock(&host_kvm.lock);
+ hyp_spin_lock(&pkvm_pgd_lock);
+
+ ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
+ if (ret)
+ goto unlock;
+ if (!pte)
+ goto map_shared;
+
+ /*
+ * Check attributes in the host stage-2 PTE. We need the page to be:
+ * - mapped RWX as we're sharing memory;
+ * - not borrowed, as that implies absence of ownership.
+ * Otherwise, we can't let it got through
+ */
+ cur = kvm_pgtable_stage2_pte_prot(pte);
+ prot = pkvm_mkstate(0, PKVM_PAGE_SHARED_BORROWED);
+ if (!check_prot(cur, KVM_PGTABLE_PROT_RWX, prot)) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
+ state = pkvm_getstate(cur);
+ if (state == PKVM_PAGE_OWNED)
+ goto map_shared;
+
+ /*
+ * Tolerate double-sharing the same page, but this requires
+ * cross-checking the hypervisor stage-1.
+ */
+ if (state != PKVM_PAGE_SHARED_OWNED) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
+ ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, &level);
+ if (ret)
+ goto unlock;
+
+ /*
+ * If the page has been shared with the hypervisor, it must be
+ * SHARED_BORROWED already.
+ */
+ cur = kvm_pgtable_hyp_pte_prot(pte);
+ prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
+ if (!check_prot(cur, prot, ~prot))
+ ret = EPERM;
+ goto unlock;
+
+map_shared:
+ /*
+ * If the page is not yet shared, adjust mappings in both page-tables
+ * while both locks are held.
+ */
+ prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
+ ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, prot);
+ if (ret)
+ goto unlock;
+
+ prot = pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED);
+ ret = host_stage2_idmap_locked(addr, addr + PAGE_SIZE, prot);
+ BUG_ON(ret);
+
+unlock:
+ hyp_spin_unlock(&pkvm_pgd_lock);
+ hyp_spin_unlock(&host_kvm.lock);
+
+ return ret;
+}
+
int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
{
int ret;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0625bf2353c2..cbab146cda6a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -259,10 +259,8 @@ static int __create_hyp_mappings(unsigned long start, unsigned long size,
{
int err;
- if (!kvm_host_owns_hyp_mappings()) {
- return kvm_call_hyp_nvhe(__pkvm_create_mappings,
- start, size, phys, prot);
- }
+ if (WARN_ON(!kvm_host_owns_hyp_mappings()))
+ return -EINVAL;
mutex_lock(&kvm_hyp_pgd_mutex);
err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
@@ -282,6 +280,21 @@ static phys_addr_t kvm_kaddr_to_phys(void *kaddr)
}
}
+static int pkvm_share_hyp(phys_addr_t start, phys_addr_t end)
+{
+ phys_addr_t addr;
+ int ret;
+
+ for (addr = ALIGN_DOWN(start, PAGE_SIZE); addr < end; addr += PAGE_SIZE) {
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ __phys_to_pfn(addr));
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
/**
* create_hyp_mappings - duplicate a kernel virtual address range in Hyp mode
* @from: The virtual kernel start address of the range
@@ -302,6 +315,13 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
if (is_kernel_in_hyp_mode())
return 0;
+ if (!kvm_host_owns_hyp_mappings()) {
+ if (WARN_ON(prot != PAGE_HYP))
+ return -EPERM;
+ return pkvm_share_hyp(kvm_kaddr_to_phys(from),
+ kvm_kaddr_to_phys(to));
+ }
+
start = start & PAGE_MASK;
end = PAGE_ALIGN(end);
--
2.32.0.432.gabb21c7263-goog
Much of the stage-2 manipulation logic relies on being able to destroy
block mappings if e.g. installing a smaller mapping in the range. The
rationale for this behaviour is that stage-2 mappings can always be
re-created lazily. However, this gets more complicated when the stage-2
page-table is used to store metadata about the underlying pages. In such
cases, destroying a block mapping may lead to losing part of the state,
and confuse the user of those metadata (such as the hypervisor in nVHE
protected mode).
To avoid this, introduce a callback function in the pgtable struct which
is called during all map operations to determine whether the mappings
can use blocks, or should be forced to page granularity. This is used by
the hypervisor when creating the host stage-2 to force page-level
mappings when using non-default protection attributes.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 65 ++++++++++++++++-----------
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 30 +++++++++++--
arch/arm64/kvm/hyp/pgtable.c | 29 +++++++++---
3 files changed, 91 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index cec76a49f521..9246a27e2839 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -102,25 +102,6 @@ enum kvm_pgtable_stage2_flags {
KVM_PGTABLE_S2_IDMAP = BIT(1),
};
-/**
- * struct kvm_pgtable - KVM page-table.
- * @ia_bits: Maximum input address size, in bits.
- * @start_level: Level at which the page-table walk starts.
- * @pgd: Pointer to the first top-level entry of the page-table.
- * @mm_ops: Memory management callbacks.
- * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
- */
-struct kvm_pgtable {
- u32 ia_bits;
- u32 start_level;
- kvm_pte_t *pgd;
- struct kvm_pgtable_mm_ops *mm_ops;
-
- /* Stage-2 only */
- struct kvm_s2_mmu *mmu;
- enum kvm_pgtable_stage2_flags flags;
-};
-
/**
* enum kvm_pgtable_prot - Page-table permissions and attributes.
* @KVM_PGTABLE_PROT_X: Execute permission.
@@ -136,11 +117,41 @@ enum kvm_pgtable_prot {
KVM_PGTABLE_PROT_DEVICE = BIT(3),
};
-#define PAGE_HYP (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
+#define KVM_PGTABLE_PROT_RW (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W)
+#define KVM_PGTABLE_PROT_RWX (KVM_PGTABLE_PROT_RW | KVM_PGTABLE_PROT_X)
+
+#define PAGE_HYP KVM_PGTABLE_PROT_RW
#define PAGE_HYP_EXEC (KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_X)
#define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
#define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
+typedef bool (*kvm_pgtable_force_pte_cb_t)(u64 addr, u64 end,
+ enum kvm_pgtable_prot prot);
+
+/**
+ * struct kvm_pgtable - KVM page-table.
+ * @ia_bits: Maximum input address size, in bits.
+ * @start_level: Level at which the page-table walk starts.
+ * @pgd: Pointer to the first top-level entry of the page-table.
+ * @mm_ops: Memory management callbacks.
+ * @mmu: Stage-2 KVM MMU struct. Unused for stage-1 page-tables.
+ * @flags: Stage-2 page-table flags.
+ * @force_pte_cb: Callback function used during map operations to decide
+ * whether block mappings can be used to map the given IPA
+ * range.
+ */
+struct kvm_pgtable {
+ u32 ia_bits;
+ u32 start_level;
+ kvm_pte_t *pgd;
+ struct kvm_pgtable_mm_ops *mm_ops;
+
+ /* Stage-2 only */
+ struct kvm_s2_mmu *mmu;
+ enum kvm_pgtable_stage2_flags flags;
+ kvm_pgtable_force_pte_cb_t force_pte_cb;
+};
+
/**
* enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
* @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
@@ -233,21 +244,25 @@ int kvm_pgtable_hyp_map(struct kvm_pgtable *pgt, u64 addr, u64 size, u64 phys,
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift);
/**
- * kvm_pgtable_stage2_init_flags() - Initialise a guest stage-2 page-table.
+ * __kvm_pgtable_stage2_init() - Initialise a guest stage-2 page-table.
* @pgt: Uninitialised page-table structure to initialise.
* @arch: Arch-specific KVM structure representing the guest virtual
* machine.
* @mm_ops: Memory management callbacks.
* @flags: Stage-2 configuration flags.
+ * @force_pte_cb: Callback function used during map operations to decide
+ * whether block mappings can be used to map the given IPA
+ * range.
*
* Return: 0 on success, negative error code on failure.
*/
-int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
- struct kvm_pgtable_mm_ops *mm_ops,
- enum kvm_pgtable_stage2_flags flags);
+int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch,
+ struct kvm_pgtable_mm_ops *mm_ops,
+ enum kvm_pgtable_stage2_flags flags,
+ kvm_pgtable_force_pte_cb_t force_pte_cb);
#define kvm_pgtable_stage2_init(pgt, arch, mm_ops) \
- kvm_pgtable_stage2_init_flags(pgt, arch, mm_ops, 0)
+ __kvm_pgtable_stage2_init(pgt, arch, mm_ops, 0, NULL)
/**
* kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 01700a908bb7..2d41d4fa8901 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -89,6 +89,7 @@ static void prepare_host_vtcr(void)
id_aa64mmfr1_el1_sys_val, phys_shift);
}
+static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot);
int kvm_host_prepare_stage2(void *pgt_pool_base)
{
struct kvm_s2_mmu *mmu = &host_kvm.arch.mmu;
@@ -101,8 +102,9 @@ int kvm_host_prepare_stage2(void *pgt_pool_base)
if (ret)
return ret;
- ret = kvm_pgtable_stage2_init_flags(&host_kvm.pgt, &host_kvm.arch,
- &host_kvm.mm_ops, KVM_HOST_S2_FLAGS);
+ ret = __kvm_pgtable_stage2_init(&host_kvm.pgt, &host_kvm.arch,
+ &host_kvm.mm_ops, KVM_HOST_S2_FLAGS,
+ host_stage2_force_pte_cb);
if (ret)
return ret;
@@ -264,9 +266,31 @@ static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
return 0;
}
+static bool host_stage2_force_pte_cb(u64 addr, u64 end, enum kvm_pgtable_prot prot)
+{
+ /*
+ * Block mappings must be used with care in the host stage-2 as a
+ * kvm_pgtable_stage2_map() operation targeting a page in the range of
+ * an existing block will delete the block under the assumption that
+ * mappings in the rest of the block range can always be rebuilt lazily.
+ * That assumption is correct for the host stage-2 with RWX mappings
+ * targeting memory or RW mappings targeting MMIO ranges (see
+ * host_stage2_idmap() below which implements some of the host memory
+ * abort logic). However, this is not safe for any other mappings where
+ * the host stage-2 page-table is in fact the only place where this
+ * state is stored. In all those cases, it is safer to use page-level
+ * mappings, hence avoiding to lose the state because of side-effects in
+ * kvm_pgtable_stage2_map().
+ */
+ if (range_is_memory(addr, end))
+ return prot != KVM_PGTABLE_PROT_RWX;
+ else
+ return prot != KVM_PGTABLE_PROT_RW;
+}
+
static int host_stage2_idmap(u64 addr)
{
- enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
+ enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_RW;
struct kvm_mem_range range;
bool is_memory = find_mem_range(addr, &range);
int ret;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 93cc9de4d46c..e0cd748e4af6 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -465,6 +465,8 @@ int kvm_pgtable_hyp_init(struct kvm_pgtable *pgt, u32 va_bits,
pgt->start_level = KVM_PGTABLE_MAX_LEVELS - levels;
pgt->mm_ops = mm_ops;
pgt->mmu = NULL;
+ pgt->force_pte_cb = NULL;
+
return 0;
}
@@ -502,6 +504,9 @@ struct stage2_map_data {
void *memcache;
struct kvm_pgtable_mm_ops *mm_ops;
+
+ /* Force mappings to page granularity */
+ bool force_pte;
};
u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
@@ -615,6 +620,15 @@ static bool stage2_pte_executable(kvm_pte_t pte)
return !(pte & KVM_PTE_LEAF_ATTR_HI_S2_XN);
}
+static bool stage2_block_mapping_allowed(u64 addr, u64 end, u32 level,
+ struct stage2_map_data *data)
+{
+ if (data->force_pte && (level < (KVM_PGTABLE_MAX_LEVELS - 1)))
+ return false;
+
+ return kvm_block_mapping_supported(addr, end, data->phys, level);
+}
+
static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
@@ -624,7 +638,7 @@ static int stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
struct kvm_pgtable *pgt = data->mmu->pgt;
struct kvm_pgtable_mm_ops *mm_ops = data->mm_ops;
- if (!kvm_block_mapping_supported(addr, end, phys, level))
+ if (!stage2_block_mapping_allowed(addr, end, level, data))
return -E2BIG;
if (kvm_phys_is_valid(phys))
@@ -668,7 +682,7 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, u32 level,
if (data->anchor)
return 0;
- if (!kvm_block_mapping_supported(addr, end, data->phys, level))
+ if (!stage2_block_mapping_allowed(addr, end, level, data))
return 0;
data->childp = kvm_pte_follow(*ptep, data->mm_ops);
@@ -798,6 +812,7 @@ int kvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
.mmu = pgt->mmu,
.memcache = mc,
.mm_ops = pgt->mm_ops,
+ .force_pte = pgt->force_pte_cb && pgt->force_pte_cb(addr, addr + size, prot),
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -829,6 +844,7 @@ int kvm_pgtable_stage2_set_owner(struct kvm_pgtable *pgt, u64 addr, u64 size,
.memcache = mc,
.mm_ops = pgt->mm_ops,
.owner_id = owner_id,
+ .force_pte = true,
};
struct kvm_pgtable_walker walker = {
.cb = stage2_map_walker,
@@ -1070,9 +1086,11 @@ int kvm_pgtable_stage2_flush(struct kvm_pgtable *pgt, u64 addr, u64 size)
return kvm_pgtable_walk(pgt, addr, size, &walker);
}
-int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch,
- struct kvm_pgtable_mm_ops *mm_ops,
- enum kvm_pgtable_stage2_flags flags)
+
+int __kvm_pgtable_stage2_init(struct kvm_pgtable *pgt, struct kvm_arch *arch,
+ struct kvm_pgtable_mm_ops *mm_ops,
+ enum kvm_pgtable_stage2_flags flags,
+ kvm_pgtable_force_pte_cb_t force_pte_cb)
{
size_t pgd_sz;
u64 vtcr = arch->vtcr;
@@ -1090,6 +1108,7 @@ int kvm_pgtable_stage2_init_flags(struct kvm_pgtable *pgt, struct kvm_arch *arch
pgt->mm_ops = mm_ops;
pgt->mmu = &arch->mmu;
pgt->flags = flags;
+ pgt->force_pte_cb = force_pte_cb;
/* Ensure zeroed PGD pages are visible to the hardware walker */
dsb(ishst);
--
2.32.0.432.gabb21c7263-goog
Refactor the hypervisor stage-1 locking in nVHE protected mode to expose
a new pkvm_create_mappings_locked() function. This will be used in later
patches to allow walking and changing the hypervisor stage-1 without
releasing the lock.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
arch/arm64/kvm/hyp/nvhe/mm.c | 17 +++++++++++++++--
2 files changed, 16 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index 8ec3a5a7744b..c76d7136ed9b 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -23,6 +23,7 @@ int hyp_map_vectors(void);
int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
+int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
int __pkvm_create_mappings(unsigned long start, unsigned long size,
unsigned long phys, enum kvm_pgtable_prot prot);
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index a8efdf0f9003..256cbe5c0dca 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -67,7 +67,8 @@ unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
return addr;
}
-int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
+/* Must be called with the pkvm_pgd_lock held */
+int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot)
{
unsigned long start = (unsigned long)from;
unsigned long end = (unsigned long)to;
@@ -81,7 +82,8 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
int err;
phys = hyp_virt_to_phys((void *)virt_addr);
- err = __pkvm_create_mappings(virt_addr, PAGE_SIZE, phys, prot);
+ err = kvm_pgtable_hyp_map(&pkvm_pgtable, virt_addr, PAGE_SIZE,
+ phys, prot);
if (err)
return err;
}
@@ -89,6 +91,17 @@ int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
return 0;
}
+int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
+{
+ int ret;
+
+ hyp_spin_lock(&pkvm_pgd_lock);
+ ret = pkvm_create_mappings_locked(from, to, prot);
+ hyp_spin_unlock(&pkvm_pgd_lock);
+
+ return ret;
+}
+
int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back)
{
unsigned long start, end;
--
2.32.0.432.gabb21c7263-goog
The __pkvm_create_mappings() function is no longer used outside of
nvhe/mm.c, make it static.
Signed-off-by: Quentin Perret <[email protected]>
---
arch/arm64/kvm/hyp/include/nvhe/mm.h | 2 --
arch/arm64/kvm/hyp/nvhe/mm.c | 4 ++--
2 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
index c76d7136ed9b..c9a8f535212e 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mm.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
@@ -24,8 +24,6 @@ int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
int pkvm_create_mappings_locked(void *from, void *to, enum kvm_pgtable_prot prot);
-int __pkvm_create_mappings(unsigned long start, unsigned long size,
- unsigned long phys, enum kvm_pgtable_prot prot);
unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
index 256cbe5c0dca..80303041f72f 100644
--- a/arch/arm64/kvm/hyp/nvhe/mm.c
+++ b/arch/arm64/kvm/hyp/nvhe/mm.c
@@ -23,8 +23,8 @@ u64 __io_map_base;
struct memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
unsigned int hyp_memblock_nr;
-int __pkvm_create_mappings(unsigned long start, unsigned long size,
- unsigned long phys, enum kvm_pgtable_prot prot)
+static int __pkvm_create_mappings(unsigned long start, unsigned long size,
+ unsigned long phys, enum kvm_pgtable_prot prot)
{
int err;
--
2.32.0.432.gabb21c7263-goog
On Mon, 26 Jul 2021 10:28:53 +0100,
Quentin Perret <[email protected]> wrote:
>
> The kvm_pgtable_stage2_find_range() function is used in the host memory
> abort path to try and look for the largest block mapping that can be
> used to map the faulting address. In order to do so, the function
> currently walks the stage-2 page-table and looks for existing
> incompatible mappings within the range of the largest possible block.
> If incompatible mappings are found, it tries the same procedure again,
> but using a smaller block range, and repeats until a matching range is
> found (potentially up to page granularity). While this approach has
> benefits (mostly in the fact that it proactively coalesces host stage-2
> mappings), it can be slow if the ranges are fragmented, and it isn't
> optimized to deal with CPUs faulting on the same IPA as all of them will
> do all the work every time.
>
> To avoid these issues, remove kvm_pgtable_stage2_find_range(), and walk
> the page-table only once in the host_mem_abort() path to find the
> closest leaf to the input address. With this, use the corresponding
> range if it is invalid and not owned by another entity. If a valid leaf
> is found, return -EAGAIN similar to what is done in the
> kvm_pgtable_stage2_map() path to optimize concurrent faults.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_pgtable.h | 30 -----------
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 39 +++++++++++++-
> arch/arm64/kvm/hyp/pgtable.c | 74 ---------------------------
> 3 files changed, 38 insertions(+), 105 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 5a7a13bbd4a1..cec76a49f521 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -141,16 +141,6 @@ enum kvm_pgtable_prot {
> #define PAGE_HYP_RO (KVM_PGTABLE_PROT_R)
> #define PAGE_HYP_DEVICE (PAGE_HYP | KVM_PGTABLE_PROT_DEVICE)
>
> -/**
> - * struct kvm_mem_range - Range of Intermediate Physical Addresses
> - * @start: Start of the range.
> - * @end: End of the range.
> - */
> -struct kvm_mem_range {
> - u64 start;
> - u64 end;
> -};
> -
> /**
> * enum kvm_pgtable_walk_flags - Flags to control a depth-first page-table walk.
> * @KVM_PGTABLE_WALK_LEAF: Visit leaf entries, including invalid
> @@ -477,24 +467,4 @@ int kvm_pgtable_walk(struct kvm_pgtable *pgt, u64 addr, u64 size,
> */
> int kvm_pgtable_get_leaf(struct kvm_pgtable *pgt, u64 addr,
> kvm_pte_t *ptep, u32 *level);
> -
> -/**
> - * kvm_pgtable_stage2_find_range() - Find a range of Intermediate Physical
> - * Addresses with compatible permission
> - * attributes.
> - * @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
> - * @addr: Address that must be covered by the range.
> - * @prot: Protection attributes that the range must be compatible with.
> - * @range: Range structure used to limit the search space at call time and
> - * that will hold the result.
> - *
> - * The offset of @addr within a page is ignored. An IPA is compatible with @prot
> - * iff its corresponding stage-2 page-table entry has default ownership and, if
> - * valid, is mapped with protection attributes identical to @prot.
> - *
> - * Return: 0 on success, negative error code on failure.
> - */
> -int kvm_pgtable_stage2_find_range(struct kvm_pgtable *pgt, u64 addr,
> - enum kvm_pgtable_prot prot,
> - struct kvm_mem_range *range);
> #endif /* __ARM64_KVM_PGTABLE_H__ */
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 871149246f5f..01700a908bb7 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -159,6 +159,11 @@ static int host_stage2_unmap_dev_all(void)
> return kvm_pgtable_stage2_unmap(pgt, addr, BIT(pgt->ia_bits) - addr);
> }
>
> +struct kvm_mem_range {
> + u64 start;
> + u64 end;
> +};
> +
> static bool find_mem_range(phys_addr_t addr, struct kvm_mem_range *range)
> {
> int cur, left = 0, right = hyp_memblock_nr;
> @@ -227,6 +232,38 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
> __ret; \
> })
>
> +static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
nit: I find 'find_range' a bit odd. We already have found a
range. We're just trying to narrow it down to something that fits in a
single block mapping. How about 'host_stage2_adjust_range'?
> +{
> + u64 granule, start, end;
> + kvm_pte_t pte;
> + u32 level;
> + int ret;
> +
> + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
> + if (ret)
> + return ret;
> +
> + if (kvm_pte_valid(pte))
> + return -EAGAIN;
> +
> + if (pte)
> + return -EPERM;
> +
> + do {
> + granule = kvm_granule_size(level);
> + start = ALIGN_DOWN(addr, granule);
> + end = start + granule;
> + level++;
> + } while ((level < KVM_PGTABLE_MAX_LEVELS) &&
> + (!kvm_level_supports_block_mapping(level) ||
> + start < range->start || range->end < end));
> +
This expression does my head in. You are trying to find the largest
block mapping that entirely fits in range, right? Can we just express
that directly (with a global negation for the purpose of the loop)?
do {
[...]
} while (level < KVM_PGTABLE_MAX_LEVELS &&
!(kvm_level_supports_block_mapping(level) &&
start >= range->start &&
end <= range->end));
I personally find this much more readable, because it expresses the
condition we are looking for rather than a lot of conditions forcing
us to continue.
You could also use a kvm_mem_range for the iteration, and add a helper
that checks for the inclusion.
> + range->start = start;
> + range->end = end;
> +
> + return 0;
> +}
> +
> static int host_stage2_idmap(u64 addr)
> {
> enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W;
> @@ -238,7 +275,7 @@ static int host_stage2_idmap(u64 addr)
> prot |= KVM_PGTABLE_PROT_X;
>
> hyp_spin_lock(&host_kvm.lock);
> - ret = kvm_pgtable_stage2_find_range(&host_kvm.pgt, addr, prot, &range);
> + ret = host_stage2_find_range(addr, &range);
> if (ret)
> goto unlock;
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Mon, 26 Jul 2021 10:29:04 +0100,
Quentin Perret <[email protected]> wrote:
>
> The host kernel is currently able to change EL2 stage-1 mappings without
> restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
> world where the host is no longer part of the TCB, this clearly poses a
> problem.
>
> To fix this, introduce a new hypercall to allow the host to share a
> physical memory page with the hypervisor, and remove the
> __pkvm_create_mappings() variant. The new hypercall implements
> ownership and permission checks before allowing the sharing operation,
> and it annotates the shared page in the hypervisor stage-1 and host
> stage-2 page-tables.
>
> Signed-off-by: Quentin Perret <[email protected]>
> ---
> arch/arm64/include/asm/kvm_asm.h | 2 +-
> arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +--
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 90 +++++++++++++++++++
> arch/arm64/kvm/mmu.c | 28 +++++-
> 5 files changed, 120 insertions(+), 12 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 9f0bf2109be7..78db818ae2c9 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -59,7 +59,7 @@
> #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
> #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
> #define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
> -#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
> +#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16
> #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
> #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
> #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> index 8e5725d032b2..1b8e59a9d065 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> @@ -49,6 +49,7 @@ extern struct host_kvm host_kvm;
>
> int __pkvm_prot_finalize(void);
> int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
> +int __pkvm_host_share_hyp(u64 pfn);
>
> int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
> int kvm_host_prepare_stage2(void *pgt_pool_base);
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index 1632f001f4ed..dd155646ec86 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -140,14 +140,11 @@ static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> }
>
> -static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> +static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
> {
> - DECLARE_REG(unsigned long, start, host_ctxt, 1);
> - DECLARE_REG(unsigned long, size, host_ctxt, 2);
> - DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> - DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
> + DECLARE_REG(u64, pfn, host_ctxt, 1);
>
> - cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> + cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(pfn);
> }
>
> static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> @@ -193,7 +190,7 @@ static const hcall_t host_hcall[] = {
> HANDLE_FUNC(__vgic_v3_restore_aprs),
> HANDLE_FUNC(__pkvm_init),
> HANDLE_FUNC(__pkvm_cpu_set_vector),
> - HANDLE_FUNC(__pkvm_create_mappings),
> + HANDLE_FUNC(__pkvm_host_share_hyp),
> HANDLE_FUNC(__pkvm_create_private_mapping),
> HANDLE_FUNC(__pkvm_prot_finalize),
> HANDLE_FUNC(__pkvm_mark_hyp),
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 223c541a7051..75273166d2c5 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -315,6 +315,96 @@ static int host_stage2_idmap(u64 addr)
> return ret;
> }
>
> +static inline bool check_prot(enum kvm_pgtable_prot prot,
> + enum kvm_pgtable_prot required,
> + enum kvm_pgtable_prot denied)
> +{
> + return (prot & (required | denied)) == required;
> +}
> +
> +int __pkvm_host_share_hyp(u64 pfn)
> +{
> + phys_addr_t addr = hyp_pfn_to_phys(pfn);
> + enum kvm_pgtable_prot prot, cur;
> + void * virt = __hyp_va(addr);
> + enum pkvm_page_state state;
> + kvm_pte_t pte;
> + u32 level;
> + int ret;
> +
> + if (!range_is_memory(addr, addr + PAGE_SIZE))
> + return -EINVAL;
> +
> + hyp_spin_lock(&host_kvm.lock);
> + hyp_spin_lock(&pkvm_pgd_lock);
> +
> + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
> + if (ret)
> + goto unlock;
> + if (!pte)
> + goto map_shared;
> +
> + /*
> + * Check attributes in the host stage-2 PTE. We need the page to be:
> + * - mapped RWX as we're sharing memory;
> + * - not borrowed, as that implies absence of ownership.
> + * Otherwise, we can't let it got through
> + */
> + cur = kvm_pgtable_stage2_pte_prot(pte);
> + prot = pkvm_mkstate(0, PKVM_PAGE_SHARED_BORROWED);
> + if (!check_prot(cur, KVM_PGTABLE_PROT_RWX, prot)) {
> + ret = -EPERM;
> + goto unlock;
> + }
> +
> + state = pkvm_getstate(cur);
> + if (state == PKVM_PAGE_OWNED)
> + goto map_shared;
> +
> + /*
> + * Tolerate double-sharing the same page, but this requires
> + * cross-checking the hypervisor stage-1.
> + */
> + if (state != PKVM_PAGE_SHARED_OWNED) {
> + ret = -EPERM;
> + goto unlock;
> + }
> +
> + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, &level);
> + if (ret)
> + goto unlock;
> +
> + /*
> + * If the page has been shared with the hypervisor, it must be
> + * SHARED_BORROWED already.
> + */
> + cur = kvm_pgtable_hyp_pte_prot(pte);
> + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
> + if (!check_prot(cur, prot, ~prot))
> + ret = EPERM;
> + goto unlock;
> +
> +map_shared:
> + /*
> + * If the page is not yet shared, adjust mappings in both page-tables
> + * while both locks are held.
> + */
> + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
> + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, prot);
> + if (ret)
> + goto unlock;
> +
> + prot = pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED);
> + ret = host_stage2_idmap_locked(addr, addr + PAGE_SIZE, prot);
> + BUG_ON(ret);
Why the different treatment here? If we couldn't map the page in EL2
Stage-1, surely that's also a big problem, isn't it?
> +
> +unlock:
> + hyp_spin_unlock(&pkvm_pgd_lock);
> + hyp_spin_unlock(&host_kvm.lock);
> +
> + return ret;
> +}
> +
> int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
> {
> int ret;
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 0625bf2353c2..cbab146cda6a 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -259,10 +259,8 @@ static int __create_hyp_mappings(unsigned long start, unsigned long size,
> {
> int err;
>
> - if (!kvm_host_owns_hyp_mappings()) {
> - return kvm_call_hyp_nvhe(__pkvm_create_mappings,
> - start, size, phys, prot);
> - }
> + if (WARN_ON(!kvm_host_owns_hyp_mappings()))
> + return -EINVAL;
How comes that this doesn't generate a warning on a GICv2 system when
we try and map the CPU interface via create_hyp_io_mappings()?
>
> mutex_lock(&kvm_hyp_pgd_mutex);
> err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
> @@ -282,6 +280,21 @@ static phys_addr_t kvm_kaddr_to_phys(void *kaddr)
> }
> }
>
> +static int pkvm_share_hyp(phys_addr_t start, phys_addr_t end)
> +{
> + phys_addr_t addr;
> + int ret;
> +
> + for (addr = ALIGN_DOWN(start, PAGE_SIZE); addr < end; addr += PAGE_SIZE) {
> + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> + __phys_to_pfn(addr));
> + if (ret)
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> /**
> * create_hyp_mappings - duplicate a kernel virtual address range in Hyp mode
> * @from: The virtual kernel start address of the range
> @@ -302,6 +315,13 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> if (is_kernel_in_hyp_mode())
> return 0;
>
> + if (!kvm_host_owns_hyp_mappings()) {
> + if (WARN_ON(prot != PAGE_HYP))
> + return -EPERM;
> + return pkvm_share_hyp(kvm_kaddr_to_phys(from),
> + kvm_kaddr_to_phys(to));
> + }
> +
> start = start & PAGE_MASK;
> end = PAGE_ALIGN(end);
>
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
On Monday 26 Jul 2021 at 12:27:12 (+0100), Marc Zyngier wrote:
> On Mon, 26 Jul 2021 10:29:04 +0100,
> Quentin Perret <[email protected]> wrote:
> >
> > The host kernel is currently able to change EL2 stage-1 mappings without
> > restrictions thanks to the __pkvm_create_mappings() hypercall. But in a
> > world where the host is no longer part of the TCB, this clearly poses a
> > problem.
> >
> > To fix this, introduce a new hypercall to allow the host to share a
> > physical memory page with the hypervisor, and remove the
> > __pkvm_create_mappings() variant. The new hypercall implements
> > ownership and permission checks before allowing the sharing operation,
> > and it annotates the shared page in the hypervisor stage-1 and host
> > stage-2 page-tables.
> >
> > Signed-off-by: Quentin Perret <[email protected]>
> > ---
> > arch/arm64/include/asm/kvm_asm.h | 2 +-
> > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
> > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 11 +--
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 90 +++++++++++++++++++
> > arch/arm64/kvm/mmu.c | 28 +++++-
> > 5 files changed, 120 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> > index 9f0bf2109be7..78db818ae2c9 100644
> > --- a/arch/arm64/include/asm/kvm_asm.h
> > +++ b/arch/arm64/include/asm/kvm_asm.h
> > @@ -59,7 +59,7 @@
> > #define __KVM_HOST_SMCCC_FUNC___vgic_v3_save_aprs 13
> > #define __KVM_HOST_SMCCC_FUNC___vgic_v3_restore_aprs 14
> > #define __KVM_HOST_SMCCC_FUNC___pkvm_init 15
> > -#define __KVM_HOST_SMCCC_FUNC___pkvm_create_mappings 16
> > +#define __KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp 16
> > #define __KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping 17
> > #define __KVM_HOST_SMCCC_FUNC___pkvm_cpu_set_vector 18
> > #define __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize 19
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > index 8e5725d032b2..1b8e59a9d065 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > @@ -49,6 +49,7 @@ extern struct host_kvm host_kvm;
> >
> > int __pkvm_prot_finalize(void);
> > int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end);
> > +int __pkvm_host_share_hyp(u64 pfn);
> >
> > int host_stage2_idmap_locked(u64 start, u64 end, enum kvm_pgtable_prot prot);
> > int kvm_host_prepare_stage2(void *pgt_pool_base);
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index 1632f001f4ed..dd155646ec86 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -140,14 +140,11 @@ static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> > cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> > }
> >
> > -static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> > +static void handle___pkvm_host_share_hyp(struct kvm_cpu_context *host_ctxt)
> > {
> > - DECLARE_REG(unsigned long, start, host_ctxt, 1);
> > - DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > - DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> > - DECLARE_REG(enum kvm_pgtable_prot, prot, host_ctxt, 4);
> > + DECLARE_REG(u64, pfn, host_ctxt, 1);
> >
> > - cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> > + cpu_reg(host_ctxt, 1) = __pkvm_host_share_hyp(pfn);
> > }
> >
> > static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> > @@ -193,7 +190,7 @@ static const hcall_t host_hcall[] = {
> > HANDLE_FUNC(__vgic_v3_restore_aprs),
> > HANDLE_FUNC(__pkvm_init),
> > HANDLE_FUNC(__pkvm_cpu_set_vector),
> > - HANDLE_FUNC(__pkvm_create_mappings),
> > + HANDLE_FUNC(__pkvm_host_share_hyp),
> > HANDLE_FUNC(__pkvm_create_private_mapping),
> > HANDLE_FUNC(__pkvm_prot_finalize),
> > HANDLE_FUNC(__pkvm_mark_hyp),
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 223c541a7051..75273166d2c5 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -315,6 +315,96 @@ static int host_stage2_idmap(u64 addr)
> > return ret;
> > }
> >
> > +static inline bool check_prot(enum kvm_pgtable_prot prot,
> > + enum kvm_pgtable_prot required,
> > + enum kvm_pgtable_prot denied)
> > +{
> > + return (prot & (required | denied)) == required;
> > +}
> > +
> > +int __pkvm_host_share_hyp(u64 pfn)
> > +{
> > + phys_addr_t addr = hyp_pfn_to_phys(pfn);
> > + enum kvm_pgtable_prot prot, cur;
> > + void * virt = __hyp_va(addr);
> > + enum pkvm_page_state state;
> > + kvm_pte_t pte;
> > + u32 level;
> > + int ret;
> > +
> > + if (!range_is_memory(addr, addr + PAGE_SIZE))
> > + return -EINVAL;
> > +
> > + hyp_spin_lock(&host_kvm.lock);
> > + hyp_spin_lock(&pkvm_pgd_lock);
> > +
> > + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
> > + if (ret)
> > + goto unlock;
> > + if (!pte)
> > + goto map_shared;
> > +
> > + /*
> > + * Check attributes in the host stage-2 PTE. We need the page to be:
> > + * - mapped RWX as we're sharing memory;
> > + * - not borrowed, as that implies absence of ownership.
> > + * Otherwise, we can't let it got through
> > + */
> > + cur = kvm_pgtable_stage2_pte_prot(pte);
> > + prot = pkvm_mkstate(0, PKVM_PAGE_SHARED_BORROWED);
> > + if (!check_prot(cur, KVM_PGTABLE_PROT_RWX, prot)) {
> > + ret = -EPERM;
> > + goto unlock;
> > + }
> > +
> > + state = pkvm_getstate(cur);
> > + if (state == PKVM_PAGE_OWNED)
> > + goto map_shared;
> > +
> > + /*
> > + * Tolerate double-sharing the same page, but this requires
> > + * cross-checking the hypervisor stage-1.
> > + */
> > + if (state != PKVM_PAGE_SHARED_OWNED) {
> > + ret = -EPERM;
> > + goto unlock;
> > + }
> > +
> > + ret = kvm_pgtable_get_leaf(&pkvm_pgtable, (u64)virt, &pte, &level);
> > + if (ret)
> > + goto unlock;
> > +
> > + /*
> > + * If the page has been shared with the hypervisor, it must be
> > + * SHARED_BORROWED already.
> > + */
> > + cur = kvm_pgtable_hyp_pte_prot(pte);
> > + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
> > + if (!check_prot(cur, prot, ~prot))
> > + ret = EPERM;
> > + goto unlock;
> > +
> > +map_shared:
> > + /*
> > + * If the page is not yet shared, adjust mappings in both page-tables
> > + * while both locks are held.
> > + */
> > + prot = pkvm_mkstate(PAGE_HYP, PKVM_PAGE_SHARED_BORROWED);
> > + ret = pkvm_create_mappings_locked(virt, virt + PAGE_SIZE, prot);
> > + if (ret)
> > + goto unlock;
> > +
> > + prot = pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_OWNED);
> > + ret = host_stage2_idmap_locked(addr, addr + PAGE_SIZE, prot);
> > + BUG_ON(ret);
>
> Why the different treatment here? If we couldn't map the page in EL2
> Stage-1, surely that's also a big problem, isn't it?
Right, the reasoning was, if the EL2 stage-1 map failed for any reason,
it will hopefully leave the page-table unmodified, which means there is
a chance we can bail out without a corrupted state. But if it succeeded
and the host stage-2 map didn't, we're in trouble since we can't unmap
anything from the EL2 stage-1 (yet) to clean things up on the error path.
But yes, failing either of them is sign of a major problem, so I can
surely simplify and just BUG() on both. Alternatively, mapping the host
stage-2 first would actually be preferable, since I should be able to
unmap things from there if I need too ...
> > +
> > +unlock:
> > + hyp_spin_unlock(&pkvm_pgd_lock);
> > + hyp_spin_unlock(&host_kvm.lock);
> > +
> > + return ret;
> > +}
> > +
> > int __pkvm_mark_hyp(phys_addr_t start, phys_addr_t end)
> > {
> > int ret;
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 0625bf2353c2..cbab146cda6a 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -259,10 +259,8 @@ static int __create_hyp_mappings(unsigned long start, unsigned long size,
> > {
> > int err;
> >
> > - if (!kvm_host_owns_hyp_mappings()) {
> > - return kvm_call_hyp_nvhe(__pkvm_create_mappings,
> > - start, size, phys, prot);
> > - }
> > + if (WARN_ON(!kvm_host_owns_hyp_mappings()))
> > + return -EINVAL;
>
> How comes that this doesn't generate a warning on a GICv2 system when
> we try and map the CPU interface via create_hyp_io_mappings()?
Aha, so that's an interesting one. In short, because we have a separate
hypercall in __create_hyp_private_mapping() to handle mapping things in
the 'private' range of the hypervisor, so we should be covered.
But note that we should consider this other hypercall privileged, and we
should allow the host to call it only while the pkvm static key is not
set. Same thing for a bunch of other calls (__pkvm_init and friends),
so this is something we should fix/consolidate at some point in the
future.
I guess it would make sense conceptually to mark those IO pages shared
as well, but given that this share operation must never happen when
we no longer trust the host, and since so far we're under the assumption
that non-memory pages cannot be shared with e.g. guests, I figured there
was no real benefit to it _yet_. And it would make
host_stage2_unmap_dev_all() and friends a bit harder to implement. We
currently assume that pages in the MMIO range cannot be 'pinned' as per
the comment above host_stage2_try().
Thanks,
Quentin
> >
> > mutex_lock(&kvm_hyp_pgd_mutex);
> > err = kvm_pgtable_hyp_map(hyp_pgtable, start, size, phys, prot);
> > @@ -282,6 +280,21 @@ static phys_addr_t kvm_kaddr_to_phys(void *kaddr)
> > }
> > }
> >
> > +static int pkvm_share_hyp(phys_addr_t start, phys_addr_t end)
> > +{
> > + phys_addr_t addr;
> > + int ret;
> > +
> > + for (addr = ALIGN_DOWN(start, PAGE_SIZE); addr < end; addr += PAGE_SIZE) {
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> > + __phys_to_pfn(addr));
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /**
> > * create_hyp_mappings - duplicate a kernel virtual address range in Hyp mode
> > * @from: The virtual kernel start address of the range
> > @@ -302,6 +315,13 @@ int create_hyp_mappings(void *from, void *to, enum kvm_pgtable_prot prot)
> > if (is_kernel_in_hyp_mode())
> > return 0;
> >
> > + if (!kvm_host_owns_hyp_mappings()) {
> > + if (WARN_ON(prot != PAGE_HYP))
> > + return -EPERM;
> > + return pkvm_share_hyp(kvm_kaddr_to_phys(from),
> > + kvm_kaddr_to_phys(to));
> > + }
> > +
> > start = start & PAGE_MASK;
> > end = PAGE_ALIGN(end);
> >
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
On Monday 26 Jul 2021 at 11:35:10 (+0100), Marc Zyngier wrote:
> On Mon, 26 Jul 2021 10:28:53 +0100,
> Quentin Perret <[email protected]> wrote:
> > +static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
>
> nit: I find 'find_range' a bit odd. We already have found a
> range. We're just trying to narrow it down to something that fits in a
> single block mapping. How about 'host_stage2_adjust_range'?
Ack.
> > +{
> > + u64 granule, start, end;
> > + kvm_pte_t pte;
> > + u32 level;
> > + int ret;
> > +
> > + ret = kvm_pgtable_get_leaf(&host_kvm.pgt, addr, &pte, &level);
> > + if (ret)
> > + return ret;
> > +
> > + if (kvm_pte_valid(pte))
> > + return -EAGAIN;
> > +
> > + if (pte)
> > + return -EPERM;
> > +
> > + do {
> > + granule = kvm_granule_size(level);
> > + start = ALIGN_DOWN(addr, granule);
> > + end = start + granule;
> > + level++;
> > + } while ((level < KVM_PGTABLE_MAX_LEVELS) &&
> > + (!kvm_level_supports_block_mapping(level) ||
> > + start < range->start || range->end < end));
> > +
>
> This expression does my head in. You are trying to find the largest
> block mapping that entirely fits in range, right? Can we just express
> that directly (with a global negation for the purpose of the loop)?
>
> do {
> [...]
> } while (level < KVM_PGTABLE_MAX_LEVELS &&
> !(kvm_level_supports_block_mapping(level) &&
> start >= range->start &&
> end <= range->end));
>
> I personally find this much more readable, because it expresses the
> condition we are looking for rather than a lot of conditions forcing
> us to continue.
>
> You could also use a kvm_mem_range for the iteration, and add a helper
> that checks for the inclusion.
Something like this (untested)?
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 75273166d2c5..07d228163090 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -234,9 +234,15 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
__ret; \
})
+static inline bool range_included(struct kvm_mem_range *child,
+ struct kvm_mem_range *parent)
+{
+ return parent->start <= child->start && child->end <= parent->end;
+}
+
static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
{
- u64 granule, start, end;
+ struct kvm_mem_range cur;
kvm_pte_t pte;
u32 level;
int ret;
@@ -252,16 +258,15 @@ static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
return -EPERM;
do {
- granule = kvm_granule_size(level);
- start = ALIGN_DOWN(addr, granule);
- end = start + granule;
+ u64 granule = kvm_granule_size(level);
+ cur.start = ALIGN_DOWN(addr, granule);
+ cur.end = cur.start + granule;
level++;
} while ((level < KVM_PGTABLE_MAX_LEVELS) &&
- (!kvm_level_supports_block_mapping(level) ||
- start < range->start || range->end < end));
+ !(kvm_level_supports_block_mapping(level) &&
+ range_included(&cur, parent)));
- range->start = start;
- range->end = end;
+ *range = cur;
return 0;
}
On Mon, 26 Jul 2021 14:13:06 +0100,
Quentin Perret <[email protected]> wrote:
>
> On Monday 26 Jul 2021 at 11:35:10 (+0100), Marc Zyngier wrote:
[...]
> > You could also use a kvm_mem_range for the iteration, and add a helper
> > that checks for the inclusion.
>
> Something like this (untested)?
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 75273166d2c5..07d228163090 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -234,9 +234,15 @@ static inline int __host_stage2_idmap(u64 start, u64 end,
> __ret; \
> })
>
> +static inline bool range_included(struct kvm_mem_range *child,
> + struct kvm_mem_range *parent)
> +{
> + return parent->start <= child->start && child->end <= parent->end;
> +}
> +
> static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
> {
> - u64 granule, start, end;
> + struct kvm_mem_range cur;
> kvm_pte_t pte;
> u32 level;
> int ret;
> @@ -252,16 +258,15 @@ static int host_stage2_find_range(u64 addr, struct kvm_mem_range *range)
> return -EPERM;
>
> do {
> - granule = kvm_granule_size(level);
> - start = ALIGN_DOWN(addr, granule);
> - end = start + granule;
> + u64 granule = kvm_granule_size(level);
> + cur.start = ALIGN_DOWN(addr, granule);
> + cur.end = cur.start + granule;
> level++;
> } while ((level < KVM_PGTABLE_MAX_LEVELS) &&
> - (!kvm_level_supports_block_mapping(level) ||
> - start < range->start || range->end < end));
> + !(kvm_level_supports_block_mapping(level) &&
> + range_included(&cur, parent)));
>
> - range->start = start;
> - range->end = end;
> + *range = cur;
>
> return 0;
> }
>
Beautiful.
M.
--
Without deviation from the norm, progress is not possible.
On Monday 26 Jul 2021 at 10:29:01 (+0100), Quentin Perret wrote:
> +static int finalize_mappings(void)
> +{
> + enum kvm_pgtable_prot prot;
> + int ret;
> +
> + /*
> + * The host's .bss and .rodata sections are now conceptually owned by
> + * the hypervisor, so mark them as 'borrowed' in the host stage-2. We
> + * can safely use host_stage2_idmap_locked() at this point since the
> + * host stage-2 has not been enabled yet.
> + */
> + prot = pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_SHARED_BORROWED);
> + ret = host_stage2_idmap_locked(__hyp_pa(__start_rodata),
> + __hyp_pa(__end_rodata), prot);
> + if (ret)
> + return ret;
> +
> + return host_stage2_idmap_locked(__hyp_pa(__hyp_bss_end),
> + __hyp_pa(__bss_stop), prot);
> +}
> +
> void __noreturn __pkvm_init_finalise(void)
> {
> struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> @@ -167,6 +199,10 @@ void __noreturn __pkvm_init_finalise(void)
> if (ret)
> goto out;
>
> + ret = finalize_mappings();
> + if (ret)
> + goto out;
While working on v3 of this series it occurred to me that we can
actually do vastly better than this. Specifically, the annotation of
shared pages currently happens in two places (recreate_hyp_mappings()
and finalize_mappings()) with nothing to guarantee they are in sync. At
the same time, the annotation of pages owned by the hypervisor is left
to the host itself using the __pkvm_mark_hyp hypercall. But clearly, by
the point we arrive to finalize_mappings() above, all the information I
need is already stored in the hyp pgtable. That is, it should be fairly
easy to walk the hyp stage-1, and for each valid mapping create a
matching annotation in the host stage-2 to mark the page shared or owned
by the hypervisor.
I'll have a go at implementing this in v3, which would guarantee
consistency across page-tables once the hypervisor is initialized, and
also allow to get rid of __pkvm_mark_hyp entirely. But if anybody thinks
this is a bad idea in the meantime, please shout!
Thanks,
Quentin