2023-09-27 11:40:12

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 00/11] arm64: ptdump: View the host stage-2 page-tables

Hi,

This can be used as a debugging tool for dumping the host stage-2
page-tables under pKVM envinronment.

When CONFIG_NVHE_EL2_PTDUMP_DEBUGFS is enabled, ptdump registers the
'host_stage2_kernel_page_tables' entry with debugfs and this allows us
to dump the host stage-2 page-tables with the following command:
cat /sys/kernel/debug/host_stage2_kernel_page_tables

The output is showing the entries in the following format:
<IPA range> <size> <descriptor type> <access permissions> <mem_attributes>

The tool interprets the pKVM ownership annotation stored in the invalid
entries and dumps to the console the ownership information. To be able
to access the host stage-2 page-tables from the kernel, a new hypervisor
call was introduced which allows us to snapshot the page-tables in a host
provided buffer. The hypervisor call is hidden behind CONFIG_NVHE_EL2_DEBUG
as this should be used under debugging environment.

I verified this series with Qemu and Pixel 6 both using kvm-arm.mode=protected.

Thanks,

Sebastian Ene (11):
KVM: arm64: Add snap shooting the host stage-2 pagetables
arm64: ptdump: Use the mask from the state structure
arm64: ptdump: Add the walker function to the ptdump info structure
KVM: arm64: Move pagetable definitions to common header
arm64: ptdump: Introduce stage-2 pagetables format description
arm64: ptdump: Register a debugfs entry for the host stage-2
page-tables
arm64: ptdump: Snapshot the host stage-2 pagetables
arm64: ptdump: Parse the host stage-2 page-tables from the snapshot
arm64: ptdump: Interpret memory attributes based on runtime
configuration
arm64: ptdump: Interpret pKVM ownership annotations
arm64: ptdump: Fix format output during stage-2 pagetable dumping

arch/arm64/include/asm/kvm_asm.h | 1 +
arch/arm64/include/asm/kvm_pgtable.h | 85 ++++
arch/arm64/include/asm/ptdump.h | 6 +
arch/arm64/kvm/Kconfig | 12 +
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 8 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 103 +++++
arch/arm64/kvm/hyp/pgtable.c | 98 +++--
arch/arm64/mm/ptdump.c | 405 +++++++++++++++++-
arch/arm64/mm/ptdump_debugfs.c | 37 +-
10 files changed, 716 insertions(+), 57 deletions(-)

--
2.42.0.515.g380fc7ccd1-goog


2023-09-27 11:43:26

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 07/11] arm64: ptdump: Snapshot the host stage-2 pagetables

Introduce callbacks invoked when the debugfs entry is accessed from the
userspace and store them in the ptdump state structure. Call these
functions when the ptdump registered entry with debugfs is open/closed.
When we open the entry we allocate memory for the host stage-2 snapshot,
we share it with the hypervisor and then we issue the hvc to copy the
page-tables to the shared buffer. When we close the debugfs entry we
release the associated memory resources and we unshare the memory from
the hypervisor.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 5 ++
arch/arm64/mm/ptdump.c | 143 +++++++++++++++++++++++++++++++-
arch/arm64/mm/ptdump_debugfs.c | 34 +++++++-
3 files changed, 179 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 1f6e0aabf16a..35b883524462 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -19,7 +19,12 @@ struct ptdump_info {
struct mm_struct *mm;
const struct addr_marker *markers;
unsigned long base_addr;
+ void (*ptdump_prepare_walk)(struct ptdump_info *info);
void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
+ void (*ptdump_end_walk)(struct ptdump_info *info);
+ struct mutex file_lock;
+ size_t mc_len;
+ void *priv;
};

void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 25c0640e82aa..7d57fa9be724 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -24,6 +24,7 @@
#include <asm/memory.h>
#include <asm/pgtable-hwdef.h>
#include <asm/ptdump.h>
+#include <asm/kvm_pkvm.h>
#include <asm/kvm_pgtable.h>


@@ -482,6 +483,139 @@ static struct mm_struct ipa_init_mm = {
.mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS,
ipa_init_mm.mmap_lock),
};
+
+static phys_addr_t ptdump_host_pa(void *addr)
+{
+ return __pa(addr);
+}
+
+static void *ptdump_host_va(phys_addr_t phys)
+{
+ return __va(phys);
+}
+
+static size_t stage2_get_pgd_len(void)
+{
+ u64 mmfr0, mmfr1, vtcr;
+ u32 phys_shift = get_kvm_ipa_limit();
+
+ mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+
+ return kvm_pgtable_stage2_pgd_size(vtcr);
+}
+
+static void stage2_ptdump_prepare_walk(struct ptdump_info *info)
+{
+ struct kvm_pgtable_snapshot *snapshot;
+ int ret, pgd_index, mc_index, pgd_pages_sz;
+ void *page_hva;
+
+ snapshot = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!snapshot)
+ return;
+
+ memset(snapshot, 0, PAGE_SIZE);
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, virt_to_pfn(snapshot));
+ if (ret)
+ goto free_snapshot;
+
+ snapshot->pgd_len = stage2_get_pgd_len();
+ pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
+ snapshot->pgd_hva = alloc_pages_exact(snapshot->pgd_len,
+ GFP_KERNEL_ACCOUNT);
+ if (!snapshot->pgd_hva)
+ goto unshare_snapshot;
+
+ for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ virt_to_pfn(page_hva));
+ if (ret)
+ goto unshare_pgd_pages;
+ }
+
+ for (mc_index = 0; mc_index < info->mc_len; mc_index++) {
+ page_hva = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!page_hva)
+ goto free_memcache_pages;
+
+ push_hyp_memcache(&snapshot->mc, page_hva, ptdump_host_pa);
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ virt_to_pfn(page_hva));
+ if (ret) {
+ pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ goto free_memcache_pages;
+ }
+ }
+
+ ret = kvm_call_hyp_nvhe(__pkvm_copy_host_stage2, snapshot);
+ if (ret)
+ goto free_memcache_pages;
+
+ info->priv = snapshot;
+ return;
+
+free_memcache_pages:
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ while (page_hva) {
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ }
+unshare_pgd_pages:
+ pgd_index = pgd_index - 1;
+ for (; pgd_index >= 0; pgd_index--) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ }
+ free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
+unshare_snapshot:
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(snapshot)));
+free_snapshot:
+ free_pages_exact(snapshot, PAGE_SIZE);
+ info->priv = NULL;
+}
+
+static void stage2_ptdump_end_walk(struct ptdump_info *info)
+{
+ struct kvm_pgtable_snapshot *snapshot = info->priv;
+ void *page_hva;
+ int pgd_index, ret, pgd_pages_sz;
+
+ if (!snapshot)
+ return;
+
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ while (page_hva) {
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ }
+
+ pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
+ for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ }
+
+ free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(snapshot)));
+ free_pages_exact(snapshot, PAGE_SIZE);
+ info->priv = NULL;
+}
#endif /* CONFIG_NVHE_EL2_PTDUMP_DEBUGFS */

static int __init ptdump_init(void)
@@ -495,11 +629,16 @@ static int __init ptdump_init(void)

#ifdef CONFIG_NVHE_EL2_PTDUMP_DEBUGFS
stage2_kernel_ptdump_info = (struct ptdump_info) {
- .markers = ipa_address_markers,
- .mm = &ipa_init_mm,
+ .markers = ipa_address_markers,
+ .mm = &ipa_init_mm,
+ .mc_len = host_s2_pgtable_pages(),
+ .ptdump_prepare_walk = stage2_ptdump_prepare_walk,
+ .ptdump_end_walk = stage2_ptdump_end_walk,
};

init_rwsem(&ipa_init_mm.mmap_lock);
+ mutex_init(&stage2_kernel_ptdump_info.file_lock);
+
ptdump_debugfs_register(&stage2_kernel_ptdump_info,
"host_stage2_kernel_page_tables");
#endif
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 7564519db1e6..14619452dd8d 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -15,7 +15,39 @@ static int ptdump_show(struct seq_file *m, void *v)
put_online_mems();
return 0;
}
-DEFINE_SHOW_ATTRIBUTE(ptdump);
+
+static int ptdump_open(struct inode *inode, struct file *file)
+{
+ int ret;
+ struct ptdump_info *info = inode->i_private;
+
+ ret = single_open(file, ptdump_show, inode->i_private);
+ if (!ret && info->ptdump_prepare_walk) {
+ mutex_lock(&info->file_lock);
+ info->ptdump_prepare_walk(info);
+ }
+ return ret;
+}
+
+static int ptdump_release(struct inode *inode, struct file *file)
+{
+ struct ptdump_info *info = inode->i_private;
+
+ if (info->ptdump_end_walk) {
+ info->ptdump_end_walk(info);
+ mutex_unlock(&info->file_lock);
+ }
+
+ return single_release(inode, file);
+}
+
+static const struct file_operations ptdump_fops = {
+ .owner = THIS_MODULE,
+ .open = ptdump_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = ptdump_release,
+};

void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name)
{
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 12:44:54

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 08/11] arm64: ptdump: Parse the host stage-2 page-tables from the snapshot

Add a walker function which configures ptdump to parse the page-tables
from the snapshot. Convert the physical address of the pagetable's start
address to a host virtual address and use the ptdump walker to parse the
page-table descriptors.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 53 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 7d57fa9be724..c0e7a80992f4 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -616,6 +616,58 @@ static void stage2_ptdump_end_walk(struct ptdump_info *info)
free_pages_exact(snapshot, PAGE_SIZE);
info->priv = NULL;
}
+
+static u32 stage2_get_max_pgd_index(u32 ipa_bits, u32 start_level)
+{
+ u64 end_ipa = BIT(ipa_bits) - 1;
+ u32 shift = ARM64_HW_PGTABLE_LEVEL_SHIFT(start_level - 1);
+
+ return end_ipa >> shift;
+}
+
+static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
+{
+ struct kvm_pgtable_snapshot *snapshot = info->priv;
+ struct pg_state st;
+ struct kvm_pgtable *pgtable;
+ u32 pgd_index, pgd_pages, pgd_shift;
+ u64 start_ipa = 0, end_ipa;
+ pgd_t *pgd;
+
+ if (snapshot == NULL || !snapshot->pgtable.pgd)
+ return;
+
+ pgtable = &snapshot->pgtable;
+ info->mm->pgd = phys_to_virt((phys_addr_t)pgtable->pgd);
+ pgd_shift = ARM64_HW_PGTABLE_LEVEL_SHIFT(pgtable->start_level - 1);
+ pgd_pages = stage2_get_max_pgd_index(pgtable->ia_bits,
+ pgtable->start_level);
+
+ for (pgd_index = 0; pgd_index <= pgd_pages; pgd_index++) {
+ end_ipa = start_ipa + (1UL << pgd_shift);
+
+ st = (struct pg_state) {
+ .seq = s,
+ .marker = info->markers,
+ .level = pgtable->start_level,
+ .pg_level = &stage2_pg_level[0],
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]) {
+ {start_ipa, end_ipa},
+ {0, 0},
+ },
+ },
+ };
+
+ ipa_address_markers[0].start_address = start_ipa;
+ ipa_address_markers[1].start_address = end_ipa;
+
+ pgd = &info->mm->pgd[pgd_index * PTRS_PER_PTE];
+ ptdump_walk_pgd(&st.ptdump, info->mm, pgd);
+ start_ipa = end_ipa;
+ }
+}
#endif /* CONFIG_NVHE_EL2_PTDUMP_DEBUGFS */

static int __init ptdump_init(void)
@@ -634,6 +686,7 @@ static int __init ptdump_init(void)
.mc_len = host_s2_pgtable_pages(),
.ptdump_prepare_walk = stage2_ptdump_prepare_walk,
.ptdump_end_walk = stage2_ptdump_end_walk,
+ .ptdump_walk = stage2_ptdump_walk,
};

init_rwsem(&ipa_init_mm.mmap_lock);
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 13:56:57

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 11/11] arm64: ptdump: Fix format output during stage-2 pagetable dumping

Fix two issues where the printed address range from debugfs was not showing
out correctly when trying to read from
/sys/kernel/debug/host_stage2_kernel_page_tables entry.

The first issue was printing to debugfs the following:
0x0000010000000000-0x0000000000000000 16777215T PGD

If the st->start_address was larger than the current addr the delta
variable used to display the size of the address range was overflowing.
The second issue was printing the following wrong IPA range:
0x0000000000000000-0x0000000000000000 0E PGD

Validate the current address range before printing it from the debugfs
entry.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 45ff4ebae01a..2c21ba9b47d1 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -430,6 +430,9 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
const char *unit = units;
unsigned long delta;

+ if (st->start_address >= addr)
+ goto update_state;
+
if (st->current_prot) {
note_prot_uxn(st, addr);
note_prot_wx(st, addr);
@@ -455,6 +458,7 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
}

+update_state:
st->start_address = addr;
st->current_prot = prot;
st->level = level;
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 14:35:34

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 04/11] KVM: arm64: Move pagetable definitions to common header

In preparation for using the stage-2 definitions in ptdump, move some of
these macros in the common header.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 42 ++++++++++++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 42 ----------------------------
2 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index be615700f8ac..913f34d75b29 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -45,6 +45,48 @@ typedef u64 kvm_pte_t;

#define KVM_PHYS_INVALID (-1ULL)

+#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
+
+#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO \
+ ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 2 : 3; })
+#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW \
+ ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 0 : 1; })
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
+#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
+#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
+#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
+
+#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 50)
+
+#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
+
+#define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50)
+
+#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
+ KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
+ KVM_PTE_LEAF_ATTR_HI_S2_XN)
+
+#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
+#define KVM_MAX_OWNER_ID 1
+
+/*
+ * Used to indicate a pte for which a 'break-before-make' sequence is in
+ * progress.
+ */
+#define KVM_INVALID_PTE_LOCKED BIT(10)
+
static inline bool kvm_pte_valid(kvm_pte_t pte)
{
return pte & KVM_PTE_VALID;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 256654b89c1e..67fa122c6028 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -17,48 +17,6 @@
#define KVM_PTE_TYPE_PAGE 1
#define KVM_PTE_TYPE_TABLE 1

-#define KVM_PTE_LEAF_ATTR_LO GENMASK(11, 2)
-
-#define KVM_PTE_LEAF_ATTR_LO_S1_ATTRIDX GENMASK(4, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP GENMASK(7, 6)
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RO \
- ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 2 : 3; })
-#define KVM_PTE_LEAF_ATTR_LO_S1_AP_RW \
- ({ cpus_have_final_cap(ARM64_KVM_HVHE) ? 0 : 1; })
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S1_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S1_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR GENMASK(5, 2)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R BIT(6)
-#define KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W BIT(7)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH GENMASK(9, 8)
-#define KVM_PTE_LEAF_ATTR_LO_S2_SH_IS 3
-#define KVM_PTE_LEAF_ATTR_LO_S2_AF BIT(10)
-
-#define KVM_PTE_LEAF_ATTR_HI GENMASK(63, 50)
-
-#define KVM_PTE_LEAF_ATTR_HI_SW GENMASK(58, 55)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S2_XN BIT(54)
-
-#define KVM_PTE_LEAF_ATTR_HI_S1_GP BIT(50)
-
-#define KVM_PTE_LEAF_ATTR_S2_PERMS (KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | \
- KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | \
- KVM_PTE_LEAF_ATTR_HI_S2_XN)
-
-#define KVM_INVALID_PTE_OWNER_MASK GENMASK(9, 2)
-#define KVM_MAX_OWNER_ID 1
-
-/*
- * Used to indicate a pte for which a 'break-before-make' sequence is in
- * progress.
- */
-#define KVM_INVALID_PTE_LOCKED BIT(10)
-
struct kvm_pgtable_walk_data {
struct kvm_pgtable_walker *walker;

--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 15:12:11

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 03/11] arm64: ptdump: Add the walker function to the ptdump info structure

Stage-2 needs a dedicated walk function to be able to parse concatenated
pagetables. The ptdump info structure is used to hold different
configuration options for the walker. This structure is registered with
the debugfs entry and is stored in the argument for the debugfs file.
Hence, in preparation for parsing the stage-2 pagetables add the walk
function as an argument for the debugfs file.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 1 +
arch/arm64/mm/ptdump.c | 1 +
arch/arm64/mm/ptdump_debugfs.c | 3 ++-
3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 581caac525b0..1f6e0aabf16a 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -19,6 +19,7 @@ struct ptdump_info {
struct mm_struct *mm;
const struct addr_marker *markers;
unsigned long base_addr;
+ void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
};

void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 8761a70f916f..d531e24ea0b2 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -346,6 +346,7 @@ static struct ptdump_info kernel_ptdump_info = {
.mm = &init_mm,
.markers = address_markers,
.base_addr = PAGE_OFFSET,
+ .ptdump_walk = &ptdump_walk,
};

void ptdump_check_wx(void)
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..7564519db1e6 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -10,7 +10,8 @@ static int ptdump_show(struct seq_file *m, void *v)
struct ptdump_info *info = m->private;

get_online_mems();
- ptdump_walk(m, info);
+ if (info->ptdump_walk)
+ info->ptdump_walk(m, info);
put_online_mems();
return 0;
}
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 15:33:40

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 05/11] arm64: ptdump: Introduce stage-2 pagetables format description

Add an array which holds human readable information about the format of
a stage-2 descriptor. The array is then used by the descriptor parser
to extract information about the memory attributes.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 91 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 91 insertions(+)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index d531e24ea0b2..8c4f06ca622a 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -24,6 +24,7 @@
#include <asm/memory.h>
#include <asm/pgtable-hwdef.h>
#include <asm/ptdump.h>
+#include <asm/kvm_pgtable.h>


enum address_markers_idx {
@@ -171,6 +172,66 @@ static const struct prot_bits pte_bits[] = {
}
};

+static const struct prot_bits stage2_pte_bits[] = {
+ {
+ .mask = PTE_VALID,
+ .val = PTE_VALID,
+ .set = " ",
+ .clear = "F",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN,
+ .val = KVM_PTE_LEAF_ATTR_HI_S2_XN,
+ .set = "XN",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
+ .set = "R",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+ .set = "W",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_AF,
+ .set = "AF",
+ .clear = " ",
+ }, {
+ .mask = PTE_NG,
+ .val = PTE_NG,
+ .set = "FnXS",
+ .clear = " ",
+ }, {
+ .mask = PTE_CONT,
+ .val = PTE_CONT,
+ .set = "CON",
+ .clear = " ",
+ }, {
+ .mask = PTE_TABLE_BIT,
+ .val = PTE_TABLE_BIT,
+ .set = " ",
+ .clear = "BLK",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW0,
+ .val = KVM_PGTABLE_PROT_SW0,
+ .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW1,
+ .val = KVM_PGTABLE_PROT_SW1,
+ .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW2,
+ .val = KVM_PGTABLE_PROT_SW2,
+ .set = "SW2",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW3,
+ .val = KVM_PGTABLE_PROT_SW3,
+ .set = "SW3",
+ },
+};
+
struct pg_level {
const struct prot_bits *bits;
const char *name;
@@ -202,6 +263,30 @@ static struct pg_level pg_level[] = {
},
};

+static struct pg_level stage2_pg_level[] = {
+ { /* pgd */
+ .name = "PGD",
+ .bits = stage2_pte_bits,
+ .num = ARRAY_SIZE(stage2_pte_bits),
+ }, { /* p4d */
+ .name = "P4D",
+ .bits = stage2_pte_bits,
+ .num = ARRAY_SIZE(stage2_pte_bits),
+ }, { /* pud */
+ .name = (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
+ .bits = stage2_pte_bits,
+ .num = ARRAY_SIZE(stage2_pte_bits),
+ }, { /* pmd */
+ .name = (CONFIG_PGTABLE_LEVELS > 2) ? "PMD" : "PGD",
+ .bits = stage2_pte_bits,
+ .num = ARRAY_SIZE(stage2_pte_bits),
+ }, { /* pte */
+ .name = "PTE",
+ .bits = stage2_pte_bits,
+ .num = ARRAY_SIZE(stage2_pte_bits),
+ },
+};
+
static void dump_prot(struct pg_state *st, const struct prot_bits *bits,
size_t num)
{
@@ -340,6 +425,12 @@ static void __init ptdump_initialize(void)
if (pg_level[i].bits)
for (j = 0; j < pg_level[i].num; j++)
pg_level[i].mask |= pg_level[i].bits[j].mask;
+
+ for (i = 0; i < ARRAY_SIZE(stage2_pg_level); i++)
+ if (stage2_pg_level[i].bits)
+ for (j = 0; j < stage2_pg_level[i].num; j++)
+ stage2_pg_level[i].mask |=
+ stage2_pg_level[i].bits[j].mask;
}

static struct ptdump_info kernel_ptdump_info = {
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 15:49:20

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 02/11] arm64: ptdump: Use the mask from the state structure

Printing the descriptor attributes requires accessing a mask which has a
different set of attributes for stage-2. In preparation for adding support
for the stage-2 pagetables dumping, use the mask from the local context
and not from the globally defined pg_level array. Store a pointer to
the pg_level array in the ptdump state structure. This will allow us to
extract the mask which is wrapped in the pg_level array and use it for
descriptor parsing in the note_page.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index e305b6593c4e..8761a70f916f 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -75,6 +75,7 @@ static struct addr_marker address_markers[] = {
struct pg_state {
struct ptdump_state ptdump;
struct seq_file *seq;
+ struct pg_level *pg_level;
const struct addr_marker *marker;
unsigned long start_address;
int level;
@@ -252,11 +253,12 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
u64 val)
{
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+ struct pg_level *pg_info = st->pg_level;
static const char units[] = "KMGTPE";
u64 prot = 0;

if (level >= 0)
- prot = val & pg_level[level].mask;
+ prot = val & pg_info[level].mask;

if (st->level == -1) {
st->level = level;
@@ -282,10 +284,10 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
unit++;
}
pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
- pg_level[st->level].name);
- if (st->current_prot && pg_level[st->level].bits)
- dump_prot(st, pg_level[st->level].bits,
- pg_level[st->level].num);
+ pg_info[st->level].name);
+ if (st->current_prot && pg_info[st->level].bits)
+ dump_prot(st, pg_info[st->level].bits,
+ pg_info[st->level].num);
pt_dump_seq_puts(st->seq, "\n");

if (addr >= st->marker[1].start_address) {
@@ -316,6 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
st = (struct pg_state){
.seq = s,
.marker = info->markers,
+ .pg_level = &pg_level[0],
.level = -1,
.ptdump = {
.note_page = note_page,
@@ -353,6 +356,7 @@ void ptdump_check_wx(void)
{ 0, NULL},
{ -1, NULL},
},
+ .pg_level = &pg_level[0],
.level = -1,
.check_wx = true,
.ptdump = {
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 16:21:39

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 01/11] KVM: arm64: Add snap shooting the host stage-2 pagetables

Introduce a new HVC that allows the caller to snap shoot the stage-2
pagetables under NVHE debug configuration. The caller specifies the
location where the pagetables are copied and must ensure that the memory
is accessible by the hypervisor. The memory where the pagetables are
copied has to be allocated by the caller and shared with the hypervisor.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/kvm_asm.h | 1 +
arch/arm64/include/asm/kvm_pgtable.h | 36 ++++++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 1 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 18 +++
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 103 ++++++++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 56 ++++++++++
6 files changed, 215 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 24b5e6b23417..99145a24c0f6 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -81,6 +81,7 @@ enum __kvm_host_smccc_func {
__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
+ __KVM_HOST_SMCCC_FUNC___pkvm_copy_host_stage2,
};

#define DECLARE_KVM_VHE_SYM(sym) extern char sym[]
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index d3e354bb8351..be615700f8ac 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -10,6 +10,7 @@
#include <linux/bits.h>
#include <linux/kvm_host.h>
#include <linux/types.h>
+#include <asm/kvm_host.h>

#define KVM_PGTABLE_MAX_LEVELS 4U

@@ -351,6 +352,21 @@ struct kvm_pgtable {
kvm_pgtable_force_pte_cb_t force_pte_cb;
};

+/**
+ * struct kvm_pgtable_snapshot - Snapshot page-table wrapper.
+ * @pgtable: The page-table configuration.
+ * @mc: Memcache used for pagetable pages allocation.
+ * @pgd_hva: Host virtual address of a physically contiguous buffer
+ * used for storing the PGD.
+ * @pgd_len: The size of the phyisically contiguous buffer in bytes.
+ */
+struct kvm_pgtable_snapshot {
+ struct kvm_pgtable pgtable;
+ struct kvm_hyp_memcache mc;
+ void *pgd_hva;
+ size_t pgd_len;
+};
+
/**
* kvm_pgtable_hyp_init() - Initialise a hypervisor stage-1 page-table.
* @pgt: Uninitialised page-table structure to initialise.
@@ -756,4 +772,24 @@ enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
*/
void kvm_tlb_flush_vmid_range(struct kvm_s2_mmu *mmu,
phys_addr_t addr, size_t size);
+
+#ifdef CONFIG_NVHE_EL2_DEBUG
+/**
+ * kvm_pgtable_stage2_copy() - Snapshot the pagetable
+ *
+ * @to_pgt: Destination pagetable
+ * @from_pgt: Source pagetable. The caller must lock the pagetables first
+ * @mc: The memcache where we allocate the destination pagetables from
+ */
+int kvm_pgtable_stage2_copy(struct kvm_pgtable *to_pgt,
+ const struct kvm_pgtable *from_pgt,
+ void *mc);
+#else
+static inline int kvm_pgtable_stage2_copy(struct kvm_pgtable *to_pgt,
+ const struct kvm_pgtable *from_pgt,
+ void *mc)
+{
+ return -EPERM;
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
#endif /* __ARM64_KVM_PGTABLE_H__ */
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 0972faccc2af..9cfb35d68850 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -69,6 +69,7 @@ int __pkvm_host_donate_hyp(u64 pfn, u64 nr_pages);
int __pkvm_hyp_donate_host(u64 pfn, u64 nr_pages);
int __pkvm_host_share_ffa(u64 pfn, u64 nr_pages);
int __pkvm_host_unshare_ffa(u64 pfn, u64 nr_pages);
+int __pkvm_host_stage2_prepare_copy(struct kvm_pgtable_snapshot *snapshot);

bool addr_is_memory(phys_addr_t phys);
int host_stage2_idmap_locked(phys_addr_t addr, u64 size, enum kvm_pgtable_prot prot);
diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
index 2385fd03ed87..0d9b56c31cf2 100644
--- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
+++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
@@ -314,6 +314,23 @@ static void handle___pkvm_teardown_vm(struct kvm_cpu_context *host_ctxt)
cpu_reg(host_ctxt, 1) = __pkvm_teardown_vm(handle);
}

+static void handle___pkvm_copy_host_stage2(struct kvm_cpu_context *host_ctxt)
+{
+ int ret = -EPERM;
+#ifdef CONFIG_NVHE_EL2_DEBUG
+ DECLARE_REG(struct kvm_pgtable_snapshot *, snapshot, host_ctxt, 1);
+ kvm_pteref_t pgd;
+
+ snapshot = kern_hyp_va(snapshot);
+ ret = __pkvm_host_stage2_prepare_copy(snapshot);
+ if (!ret) {
+ pgd = snapshot->pgtable.pgd;
+ snapshot->pgtable.pgd = (kvm_pteref_t)__hyp_pa(pgd);
+ }
+#endif
+ cpu_reg(host_ctxt, 1) = ret;
+}
+
typedef void (*hcall_t)(struct kvm_cpu_context *);

#define HANDLE_FUNC(x) [__KVM_HOST_SMCCC_FUNC_##x] = (hcall_t)handle_##x
@@ -348,6 +365,7 @@ static const hcall_t host_hcall[] = {
HANDLE_FUNC(__pkvm_init_vm),
HANDLE_FUNC(__pkvm_init_vcpu),
HANDLE_FUNC(__pkvm_teardown_vm),
+ HANDLE_FUNC(__pkvm_copy_host_stage2),
};

static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 9d703441278b..fe1a6dbd6d31 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -266,6 +266,109 @@ int kvm_guest_prepare_stage2(struct pkvm_hyp_vm *vm, void *pgd)
return 0;
}

+#ifdef CONFIG_NVHE_EL2_DEBUG
+static struct hyp_pool snapshot_pool = {0};
+static DEFINE_HYP_SPINLOCK(snapshot_pool_lock);
+
+static void *snapshot_zalloc_pages_exact(size_t size)
+{
+ void *addr = hyp_alloc_pages(&snapshot_pool, get_order(size));
+
+ hyp_split_page(hyp_virt_to_page(addr));
+
+ /*
+ * The size of concatenated PGDs is always a power of two of PAGE_SIZE,
+ * so there should be no need to free any of the tail pages to make the
+ * allocation exact.
+ */
+ WARN_ON(size != (PAGE_SIZE << get_order(size)));
+
+ return addr;
+}
+
+static void snapshot_get_page(void *addr)
+{
+ hyp_get_page(&snapshot_pool, addr);
+}
+
+static void *snapshot_zalloc_page(void *mc)
+{
+ struct hyp_page *p;
+ void *addr;
+
+ addr = hyp_alloc_pages(&snapshot_pool, 0);
+ if (addr)
+ return addr;
+
+ addr = pop_hyp_memcache(mc, hyp_phys_to_virt);
+ if (!addr)
+ return addr;
+
+ memset(addr, 0, PAGE_SIZE);
+ p = hyp_virt_to_page(addr);
+ memset(p, 0, sizeof(*p));
+ p->refcount = 1;
+
+ return addr;
+}
+
+static void snapshot_s2_free_pages_exact(void *addr, unsigned long size)
+{
+ u8 order = get_order(size);
+ unsigned int i;
+ struct hyp_page *p;
+
+ for (i = 0; i < (1 << order); i++) {
+ p = hyp_virt_to_page(addr + (i * PAGE_SIZE));
+ hyp_page_ref_dec_and_test(p);
+ }
+}
+
+int __pkvm_host_stage2_prepare_copy(struct kvm_pgtable_snapshot *snapshot)
+{
+ size_t required_pgd_len;
+ struct kvm_pgtable_mm_ops mm_ops = {0};
+ struct kvm_pgtable *to_pgt, *from_pgt = &host_mmu.pgt;
+ struct kvm_hyp_memcache *memcache = &snapshot->mc;
+ int ret;
+ void *pgd;
+ u64 nr_pages;
+
+ required_pgd_len = kvm_pgtable_stage2_pgd_size(host_mmu.arch.vtcr);
+ if (snapshot->pgd_len < required_pgd_len)
+ return -ENOMEM;
+
+ to_pgt = &snapshot->pgtable;
+ nr_pages = snapshot->pgd_len / PAGE_SIZE;
+ pgd = kern_hyp_va(snapshot->pgd_hva);
+
+ hyp_spin_lock(&snapshot_pool_lock);
+ hyp_pool_init(&snapshot_pool, hyp_virt_to_pfn(pgd),
+ required_pgd_len / PAGE_SIZE, 0);
+
+ mm_ops.zalloc_pages_exact = snapshot_zalloc_pages_exact;
+ mm_ops.zalloc_page = snapshot_zalloc_page;
+ mm_ops.free_pages_exact = snapshot_s2_free_pages_exact;
+ mm_ops.get_page = snapshot_get_page;
+ mm_ops.phys_to_virt = hyp_phys_to_virt;
+ mm_ops.virt_to_phys = hyp_virt_to_phys;
+ mm_ops.page_count = hyp_page_count;
+
+ to_pgt->ia_bits = from_pgt->ia_bits;
+ to_pgt->start_level = from_pgt->start_level;
+ to_pgt->flags = from_pgt->flags;
+ to_pgt->mm_ops = &mm_ops;
+
+ host_lock_component();
+ ret = kvm_pgtable_stage2_copy(to_pgt, from_pgt, memcache);
+ host_unlock_component();
+
+ hyp_spin_unlock(&snapshot_pool_lock);
+
+ return ret;
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
+
void reclaim_guest_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc)
{
void *addr;
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index f155b8c9e98c..256654b89c1e 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1598,3 +1598,59 @@ void kvm_pgtable_stage2_free_unlinked(struct kvm_pgtable_mm_ops *mm_ops, void *p
WARN_ON(mm_ops->page_count(pgtable) != 1);
mm_ops->put_page(pgtable);
}
+
+#ifdef CONFIG_NVHE_EL2_DEBUG
+static int stage2_copy_walker(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ struct kvm_pgtable_mm_ops *mm_ops = ctx->mm_ops;
+ void *copy_table, *original_addr;
+ kvm_pte_t new = ctx->old;
+
+ if (!stage2_pte_is_counted(ctx->old))
+ return 0;
+
+ if (kvm_pte_table(ctx->old, ctx->level)) {
+ copy_table = mm_ops->zalloc_page(ctx->arg);
+ if (!copy_table)
+ return -ENOMEM;
+
+ original_addr = kvm_pte_follow(ctx->old, mm_ops);
+
+ memcpy(copy_table, original_addr, PAGE_SIZE);
+ new = kvm_init_table_pte(copy_table, mm_ops);
+ }
+
+ *ctx->ptep = new;
+
+ return 0;
+}
+
+int kvm_pgtable_stage2_copy(struct kvm_pgtable *to_pgt,
+ const struct kvm_pgtable *from_pgt,
+ void *mc)
+{
+ int ret;
+ size_t pgd_sz;
+ struct kvm_pgtable_mm_ops *mm_ops = to_pgt->mm_ops;
+ struct kvm_pgtable_walker walker = {
+ .cb = stage2_copy_walker,
+ .flags = KVM_PGTABLE_WALK_LEAF |
+ KVM_PGTABLE_WALK_TABLE_PRE,
+ .arg = mc
+ };
+
+ pgd_sz = kvm_pgd_pages(to_pgt->ia_bits, to_pgt->start_level) *
+ PAGE_SIZE;
+ to_pgt->pgd = (kvm_pteref_t)mm_ops->zalloc_pages_exact(pgd_sz);
+ if (!to_pgt->pgd)
+ return -ENOMEM;
+
+ memcpy(to_pgt->pgd, from_pgt->pgd, pgd_sz);
+
+ ret = kvm_pgtable_walk(to_pgt, 0, BIT(to_pgt->ia_bits), &walker);
+ mm_ops->free_pages_exact(to_pgt->pgd, pgd_sz);
+
+ return ret;
+}
+#endif /* CONFIG_NVHE_EL2_DEBUG */
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 16:48:06

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 06/11] arm64: ptdump: Register a debugfs entry for the host stage-2 page-tables

Initialize the structures used to keep the state of the stage-2 ptdump
walker. To satisfy the ptdump API for parsing no VMA regions, initialize
a memory structure. Since we are going to parse a snapshot of the host
stage-2 paget-tables we don't rely on the locking from the memory
structure.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/kvm/Kconfig | 12 ++++++++++++
arch/arm64/mm/ptdump.c | 26 ++++++++++++++++++++++++++
2 files changed, 38 insertions(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..2974bb5c4838 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -71,4 +71,16 @@ config PROTECTED_NVHE_STACKTRACE

If unsure, or not using protected nVHE (pKVM), say N.

+config NVHE_EL2_PTDUMP_DEBUGFS
+ bool "Present the stage-2 pagetables to debugfs"
+ depends on NVHE_EL2_DEBUG && PTDUMP_DEBUGFS
+ help
+ Say Y here if you want to show the pKVM host stage-2 kernel pagetable
+ layout in a debugfs file. This information is only useful for kernel developers
+ who are working in architecture specific areas of the kernel.
+ It is probably not a good idea to enable this feature in a production
+ kernel.
+
+ If in doubt, say N.
+
endif # VIRTUALIZATION
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 8c4f06ca622a..25c0640e82aa 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -469,6 +469,21 @@ void ptdump_check_wx(void)
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
}

+#ifdef CONFIG_NVHE_EL2_PTDUMP_DEBUGFS
+static struct ptdump_info stage2_kernel_ptdump_info;
+static struct addr_marker ipa_address_markers[] = {
+ { 0, "IPA start"},
+ { -1, "IPA end"},
+ { -1, NULL},
+};
+
+/* Initialize a memory structure used by ptdump to walk the no-VMA region */
+static struct mm_struct ipa_init_mm = {
+ .mm_mt = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS,
+ ipa_init_mm.mmap_lock),
+};
+#endif /* CONFIG_NVHE_EL2_PTDUMP_DEBUGFS */
+
static int __init ptdump_init(void)
{
address_markers[PAGE_END_NR].start_address = PAGE_END;
@@ -477,6 +492,17 @@ static int __init ptdump_init(void)
#endif
ptdump_initialize();
ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
+
+#ifdef CONFIG_NVHE_EL2_PTDUMP_DEBUGFS
+ stage2_kernel_ptdump_info = (struct ptdump_info) {
+ .markers = ipa_address_markers,
+ .mm = &ipa_init_mm,
+ };
+
+ init_rwsem(&ipa_init_mm.mmap_lock);
+ ptdump_debugfs_register(&stage2_kernel_ptdump_info,
+ "host_stage2_kernel_page_tables");
+#endif
return 0;
}
device_initcall(ptdump_init);
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 19:13:14

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 10/11] arm64: ptdump: Interpret pKVM ownership annotations

Add support for interpretting pKVM invalid stage-2 descriptors that hold
ownership information. We use these descriptors to keep track of the
memory donations from the host side.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 7 +++++++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 7 -------
arch/arm64/mm/ptdump.c | 10 ++++++++++
3 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 913f34d75b29..938baffa7d4d 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -87,6 +87,13 @@ typedef u64 kvm_pte_t;
*/
#define KVM_INVALID_PTE_LOCKED BIT(10)

+/* This corresponds to page-table locking order */
+enum pkvm_component_id {
+ PKVM_ID_HOST,
+ PKVM_ID_HYP,
+ PKVM_ID_FFA,
+};
+
static inline bool kvm_pte_valid(kvm_pte_t pte)
{
return pte & KVM_PTE_VALID;
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 9cfb35d68850..cc2c439ffe75 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -53,13 +53,6 @@ struct host_mmu {
};
extern struct host_mmu host_mmu;

-/* This corresponds to page-table locking order */
-enum pkvm_component_id {
- PKVM_ID_HOST,
- PKVM_ID_HYP,
- PKVM_ID_FFA,
-};
-
extern unsigned long hyp_nr_cpus;

int __pkvm_prot_finalize(void);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 964758d5e76d..45ff4ebae01a 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -272,6 +272,16 @@ static const struct prot_bits stage2_pte_bits[] = {
.val = PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) | PTE_VALID,
.set = "MEM/NORMAL FWB",
.feature_on = is_fwb_enabled,
+ }, {
+ .mask = KVM_INVALID_PTE_OWNER_MASK | PTE_VALID,
+ .val = FIELD_PREP_CONST(KVM_INVALID_PTE_OWNER_MASK,
+ PKVM_ID_HYP),
+ .set = "HYP",
+ }, {
+ .mask = KVM_INVALID_PTE_OWNER_MASK | PTE_VALID,
+ .val = FIELD_PREP_CONST(KVM_INVALID_PTE_OWNER_MASK,
+ PKVM_ID_FFA),
+ .set = "FF-A",
}, {
.mask = KVM_PGTABLE_PROT_SW0,
.val = KVM_PGTABLE_PROT_SW0,
--
2.42.0.515.g380fc7ccd1-goog

2023-09-27 19:24:04

by Sebastian Ene

[permalink] [raw]
Subject: [PATCH 09/11] arm64: ptdump: Interpret memory attributes based on runtime configuration

When FWB is used the memory attributes stored in the descriptors have a
different bitfield layout. Introduce two callbacks that verify the current
runtime configuration before parsing the attribute fields.
Add support for parsing the memory attribute fields from the page table
descriptors.

Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 67 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index c0e7a80992f4..964758d5e76d 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -85,13 +85,22 @@ struct pg_state {
bool check_wx;
unsigned long wx_pages;
unsigned long uxn_pages;
+ struct ptdump_info *info;
};

+/*
+ * This callback checks the runtime configuration before interpreting the
+ * attributes defined in the prot_bits.
+ */
+typedef bool (*is_feature_cb)(const void *ctx);
+
struct prot_bits {
u64 mask;
u64 val;
const char *set;
const char *clear;
+ is_feature_cb feature_on; /* bit ignored if the callback returns false */
+ is_feature_cb feature_off; /* bit ignored if the callback returns true */
};

static const struct prot_bits pte_bits[] = {
@@ -173,6 +182,34 @@ static const struct prot_bits pte_bits[] = {
}
};

+static bool is_fwb_enabled(const void *ctx)
+{
+ const struct pg_state *st = ctx;
+ const struct ptdump_info *info = st->info;
+ struct kvm_pgtable_snapshot *snapshot = info->priv;
+ struct kvm_pgtable *pgtable = &snapshot->pgtable;
+
+ bool fwb_enabled = false;
+
+ if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
+ fwb_enabled = !(pgtable->flags & KVM_PGTABLE_S2_NOFWB);
+
+ return fwb_enabled;
+}
+
+static bool is_table_bit_ignored(const void *ctx)
+{
+ const struct pg_state *st = ctx;
+
+ if (!(st->current_prot & PTE_VALID))
+ return true;
+
+ if (st->level == CONFIG_PGTABLE_LEVELS)
+ return true;
+
+ return false;
+}
+
static const struct prot_bits stage2_pte_bits[] = {
{
.mask = PTE_VALID,
@@ -214,6 +251,27 @@ static const struct prot_bits stage2_pte_bits[] = {
.val = PTE_TABLE_BIT,
.set = " ",
.clear = "BLK",
+ .feature_off = is_table_bit_ignored,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_VALID,
+ .set = "DEVICE/nGnRE",
+ .feature_off = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_FWB_DEVICE_nGnRE) | PTE_VALID,
+ .set = "DEVICE/nGnRE FWB",
+ .feature_on = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_VALID,
+ .set = "MEM/NORMAL",
+ .feature_off = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) | PTE_VALID,
+ .set = "MEM/NORMAL FWB",
+ .feature_on = is_fwb_enabled,
}, {
.mask = KVM_PGTABLE_PROT_SW0,
.val = KVM_PGTABLE_PROT_SW0,
@@ -289,13 +347,19 @@ static struct pg_level stage2_pg_level[] = {
};

static void dump_prot(struct pg_state *st, const struct prot_bits *bits,
- size_t num)
+ size_t num)
{
unsigned i;

for (i = 0; i < num; i++, bits++) {
const char *s;

+ if (bits->feature_on && !bits->feature_on(st))
+ continue;
+
+ if (bits->feature_off && bits->feature_off(st))
+ continue;
+
if ((st->current_prot & bits->mask) == bits->val)
s = bits->set;
else
@@ -651,6 +715,7 @@ static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
.marker = info->markers,
.level = pgtable->start_level,
.pg_level = &stage2_pg_level[0],
+ .info = info,
.ptdump = {
.note_page = note_page,
.range = (struct ptdump_range[]) {
--
2.42.0.515.g380fc7ccd1-goog

2023-09-28 19:55:39

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 01/11] KVM: arm64: Add snap shooting the host stage-2 pagetables

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on kvmarm/next linus/master v6.6-rc3 next-20230928]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Ene/KVM-arm64-Add-snap-shooting-the-host-stage-2-pagetables/20230927-192734
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link: https://lore.kernel.org/r/20230927112517.2631674-2-sebastianene%40google.com
patch subject: [PATCH 01/11] KVM: arm64: Add snap shooting the host stage-2 pagetables
config: arm64-allyesconfig (https://download.01.org/0day-ci/archive/20230929/[email protected]/config)
compiler: aarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230929/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

arch/arm64/kvm/hyp/nvhe/mem_protect.c: In function '__pkvm_host_stage2_prepare_copy':
>> arch/arm64/kvm/hyp/nvhe/mem_protect.c:335:13: warning: variable 'nr_pages' set but not used [-Wunused-but-set-variable]
335 | u64 nr_pages;
| ^~~~~~~~


vim +/nr_pages +335 arch/arm64/kvm/hyp/nvhe/mem_protect.c

326
327 int __pkvm_host_stage2_prepare_copy(struct kvm_pgtable_snapshot *snapshot)
328 {
329 size_t required_pgd_len;
330 struct kvm_pgtable_mm_ops mm_ops = {0};
331 struct kvm_pgtable *to_pgt, *from_pgt = &host_mmu.pgt;
332 struct kvm_hyp_memcache *memcache = &snapshot->mc;
333 int ret;
334 void *pgd;
> 335 u64 nr_pages;
336
337 required_pgd_len = kvm_pgtable_stage2_pgd_size(host_mmu.arch.vtcr);
338 if (snapshot->pgd_len < required_pgd_len)
339 return -ENOMEM;
340
341 to_pgt = &snapshot->pgtable;
342 nr_pages = snapshot->pgd_len / PAGE_SIZE;
343 pgd = kern_hyp_va(snapshot->pgd_hva);
344
345 hyp_spin_lock(&snapshot_pool_lock);
346 hyp_pool_init(&snapshot_pool, hyp_virt_to_pfn(pgd),
347 required_pgd_len / PAGE_SIZE, 0);
348
349 mm_ops.zalloc_pages_exact = snapshot_zalloc_pages_exact;
350 mm_ops.zalloc_page = snapshot_zalloc_page;
351 mm_ops.free_pages_exact = snapshot_s2_free_pages_exact;
352 mm_ops.get_page = snapshot_get_page;
353 mm_ops.phys_to_virt = hyp_phys_to_virt;
354 mm_ops.virt_to_phys = hyp_virt_to_phys;
355 mm_ops.page_count = hyp_page_count;
356
357 to_pgt->ia_bits = from_pgt->ia_bits;
358 to_pgt->start_level = from_pgt->start_level;
359 to_pgt->flags = from_pgt->flags;
360 to_pgt->mm_ops = &mm_ops;
361
362 host_lock_component();
363 ret = kvm_pgtable_stage2_copy(to_pgt, from_pgt, memcache);
364 host_unlock_component();
365
366 hyp_spin_unlock(&snapshot_pool_lock);
367
368 return ret;
369 }
370 #endif /* CONFIG_NVHE_EL2_DEBUG */
371

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-09-29 13:14:07

by Marc Zyngier

[permalink] [raw]
Subject: Re: [PATCH 00/11] arm64: ptdump: View the host stage-2 page-tables

Hi Sebastian,

On Wed, 27 Sep 2023 12:25:06 +0100,
Sebastian Ene <[email protected]> wrote:
>
> Hi,
>
> This can be used as a debugging tool for dumping the host stage-2
> page-tables under pKVM envinronment.

Why only pKVM? Why only the host? Dumping page tables shouldn't be
reserved to this corner case. Specially considering that pKVM is still
really far away from being remotely useful upstream.

I'd really expect this sort of debugging information to be fully
available for both host and guest, for all modes (nVHE, VHE, hVHE,
protected, nested), without limitations other than the configuration
option.

Also, please Cc the relevant parties (I'm the only one Cc'd on the KVM
side...)

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2023-10-01 09:25:41

by Sebastian Ene

[permalink] [raw]
Subject: Re: [PATCH 00/11] arm64: ptdump: View the host stage-2 page-tables

On Fri, Sep 29, 2023 at 02:11:23PM +0100, Marc Zyngier wrote:

Hello Marc,

Thanks for having a look.

> Hi Sebastian,
>
> On Wed, 27 Sep 2023 12:25:06 +0100,
> Sebastian Ene <[email protected]> wrote:
> >
> > Hi,
> >
> > This can be used as a debugging tool for dumping the host stage-2
> > page-tables under pKVM envinronment.
>
> Why only pKVM? Why only the host? Dumping page tables shouldn't be
> reserved to this corner case. Specially considering that pKVM is still
> really far away from being remotely useful upstream.
>

I wanted to publish the initial series which adds support for the host
and then extend it to guest VMs.

> I'd really expect this sort of debugging information to be fully
> available for both host and guest, for all modes (nVHE, VHE, hVHE,
> protected, nested), without limitations other than the configuration
> option.

I agree, let me re-spin the series and add support for non-protected as
well.

>
> Also, please Cc the relevant parties (I'm the only one Cc'd on the KVM
> side...)
>

Thanks,

Sebastian

> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.