Hi,
This can be used as a debugging tool for dumping the second stage
page-tables.
When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump registers
'/sys/debug/kvm/<guest_id>/stage2_page_tables' entry with debugfs
upon guest creation. This allows userspace tools (eg. cat) to dump the
stage-2 pagetables by reading the registered file.
Reading the debugfs file shows stage-2 memory ranges in following format:
<IPA range> <size> <descriptor type> <access permissions> <mem_attributes>
Under pKVM configuration(kvm-arm.mode=protected) ptdump registers an entry
for the host stage-2 pagetables in the following path:
/sys/debug/kvm/host_stage2_page_tables/
The tool interprets the pKVM ownership annotation stored in the invalid
entries and dumps to the console the ownership information. To be able
to access the host stage-2 page-tables from the kernel, a new hypervisor
call was introduced which allows us to snapshot the page-tables in a host
provided buffer. The hypervisor call is hidden behind CONFIG_NVHE_EL2_DEBUG
as this should be used under debugging environment.
Link to the second version:
https://lore.kernel.org/all/[email protected]/#r
Link to the first version:
https://lore.kernel.org/all/[email protected]/
Changelog:
v2 -> v3:
* register the stage-2 debugfs entry for the host under
/sys/debug/kvm/host_stage2_page_tables and in
/sys/debug/kvm/<guest_id>/stage2_page_tables for guests.
* don't use a static array for parsing the attributes description,
generate it dynamically based on the number of pagetable levels
* remove the lock that was guarding the seq_file private inode data,
and keep the data private to the open file session.
* minor fixes & renaming of CONFIG_NVHE_EL2_PTDUMP_DEBUGFS to
CONFIG_PTDUMP_STAGE2_DEBUGFS
v1 -> v2:
* use the stage-2 pagetable walker for dumping descriptors instead of
the one provided by ptdump.
* support for guests pagetables dumping under VHE/nVHE non-protected
Thanks,
Sebastian Ene (10):
KVM: arm64: Add snap shooting the host stage-2 pagetables
arm64: ptdump: Use the mask from the state structure
arm64: ptdump: Add the walker function to the ptdump info structure
KVM: arm64: Move pagetable definitions to common header
arm64: ptdump: Add hooks on debugfs file operations
arm64: ptdump: Register a debugfs entry for the host stage-2 tables
arm64: ptdump: Parse the host stage-2 page-tables from the snapshot
arm64: ptdump: Interpret memory attributes based on runtime
configuration
arm64: ptdump: Interpret pKVM ownership annotations
arm64: ptdump: Add support for guest stage-2 pagetables dumping
arch/arm64/include/asm/kvm_asm.h | 1 +
arch/arm64/include/asm/kvm_pgtable.h | 85 +++
arch/arm64/include/asm/ptdump.h | 27 +
arch/arm64/kvm/Kconfig | 13 +
arch/arm64/kvm/arm.c | 2 +
arch/arm64/kvm/debug.c | 6 +
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 8 +-
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 20 +
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 102 ++++
arch/arm64/kvm/hyp/pgtable.c | 98 ++--
arch/arm64/kvm/mmu.c | 2 +
arch/arm64/mm/ptdump.c | 483 +++++++++++++++++-
arch/arm64/mm/ptdump_debugfs.c | 64 ++-
13 files changed, 852 insertions(+), 59 deletions(-)
--
2.43.0.rc0.421.g78406f8d94-goog
Stage-2 needs a dedicated walk function to be able to parse concatenated
pagetables. The ptdump info structure is used to hold different
configuration options for the walker. This structure is registered with
the debugfs entry and is stored in the argument for the debugfs file.
Hence, in preparation for parsing the stage-2 pagetables add the walk
function as an argument for the debugfs file.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 1 +
arch/arm64/mm/ptdump.c | 1 +
arch/arm64/mm/ptdump_debugfs.c | 3 ++-
3 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 581caac525b0..1f6e0aabf16a 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -19,6 +19,7 @@ struct ptdump_info {
struct mm_struct *mm;
const struct addr_marker *markers;
unsigned long base_addr;
+ void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
};
void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 8761a70f916f..d531e24ea0b2 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -346,6 +346,7 @@ static struct ptdump_info kernel_ptdump_info = {
.mm = &init_mm,
.markers = address_markers,
.base_addr = PAGE_OFFSET,
+ .ptdump_walk = &ptdump_walk,
};
void ptdump_check_wx(void)
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 68bf1a125502..7564519db1e6 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -10,7 +10,8 @@ static int ptdump_show(struct seq_file *m, void *v)
struct ptdump_info *info = m->private;
get_online_mems();
- ptdump_walk(m, info);
+ if (info->ptdump_walk)
+ info->ptdump_walk(m, info);
put_online_mems();
return 0;
}
--
2.43.0.rc0.421.g78406f8d94-goog
Register a debugfs file on guest creation to be able to view their
second translation tables with ptdump. This assumes that the host is in
control of the guest stage-2 and has direct access to the pagetables.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 7 ++++
arch/arm64/kvm/debug.c | 6 +++
arch/arm64/kvm/mmu.c | 2 +
arch/arm64/mm/ptdump.c | 68 +++++++++++++++++++++++++++++++++
4 files changed, 83 insertions(+)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index de5a5a0c0ecf..21b281715d27 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -5,6 +5,8 @@
#ifndef __ASM_PTDUMP_H
#define __ASM_PTDUMP_H
+#include <asm/kvm_pgtable.h>
+
#ifdef CONFIG_PTDUMP_CORE
#include <linux/mm_types.h>
@@ -23,6 +25,7 @@ struct ptdump_info {
int (*ptdump_prepare_walk)(void *file_priv);
void (*ptdump_end_walk)(void *file_priv);
size_t mc_len;
+ void *priv;
};
void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
@@ -48,8 +51,12 @@ void ptdump_check_wx(void);
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
void ptdump_register_host_stage2(void);
+int ptdump_register_guest_stage2(struct kvm *kvm);
+void ptdump_unregister_guest_stage2(struct kvm_pgtable *pgt);
#else
static inline void ptdump_register_host_stage2(void) { }
+static inline int ptdump_register_guest_stage2(struct kvm *kvm) { return 0; }
+static inline void ptdump_unregister_guest_stage2(struct kvm_pgtable *pgt) { }
#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
#ifdef CONFIG_DEBUG_WX
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 8725291cb00a..555d022f8ad9 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -13,6 +13,7 @@
#include <asm/kvm_asm.h>
#include <asm/kvm_arm.h>
#include <asm/kvm_emulate.h>
+#include <asm/ptdump.h>
#include "trace.h"
@@ -342,3 +343,8 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
}
+
+int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+ return ptdump_register_guest_stage2(kvm);
+}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d87c8fcc4c24..da45050596e6 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -11,6 +11,7 @@
#include <linux/sched/signal.h>
#include <trace/events/kvm.h>
#include <asm/pgalloc.h>
+#include <asm/ptdump.h>
#include <asm/cacheflush.h>
#include <asm/kvm_arm.h>
#include <asm/kvm_mmu.h>
@@ -1021,6 +1022,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
write_unlock(&kvm->mmu_lock);
if (pgt) {
+ ptdump_unregister_guest_stage2(pgt);
kvm_pgtable_stage2_destroy(pgt);
kfree(pgt);
}
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index ffb87237078b..741764cff105 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -27,6 +27,7 @@
#include <asm/kvm_pkvm.h>
#include <asm/kvm_pgtable.h>
#include <asm/kvm_host.h>
+#include <asm/kvm_mmu.h>
enum address_markers_idx {
@@ -519,6 +520,16 @@ void ptdump_check_wx(void)
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
static struct ptdump_info stage2_kernel_ptdump_info;
+struct ptdump_registered_guest {
+ struct list_head reg_list;
+ struct ptdump_info info;
+ struct kvm_pgtable_snapshot snapshot;
+ rwlock_t *lock;
+};
+
+static LIST_HEAD(ptdump_guest_list);
+static DEFINE_MUTEX(ptdump_list_lock);
+
static phys_addr_t ptdump_host_pa(void *addr)
{
return __pa(addr);
@@ -757,6 +768,63 @@ static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
kvm_pgtable_walk(pgtable, start_ipa, end_ipa, &walker);
}
+static void guest_stage2_ptdump_walk(struct seq_file *s,
+ struct ptdump_info *info)
+{
+ struct ptdump_info_file_priv *f_priv =
+ container_of(info, struct ptdump_info_file_priv, info);
+ struct ptdump_registered_guest *guest = info->priv;
+
+ f_priv->file_priv = &guest->snapshot;
+
+ read_lock(guest->lock);
+ stage2_ptdump_walk(s, info);
+ read_unlock(guest->lock);
+}
+
+int ptdump_register_guest_stage2(struct kvm *kvm)
+{
+ struct ptdump_registered_guest *guest;
+ struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
+ struct kvm_pgtable *pgt = mmu->pgt;
+
+ guest = kzalloc(sizeof(struct ptdump_registered_guest), GFP_KERNEL);
+ if (!guest)
+ return -ENOMEM;
+
+ memcpy(&guest->snapshot.pgtable, pgt, sizeof(struct kvm_pgtable));
+ guest->info = (struct ptdump_info) {
+ .ptdump_walk = guest_stage2_ptdump_walk,
+ };
+
+ guest->info.priv = guest;
+ guest->lock = &kvm->mmu_lock;
+ mutex_lock(&ptdump_list_lock);
+
+ ptdump_debugfs_kvm_register(&guest->info, "stage2_page_tables",
+ kvm->debugfs_dentry);
+
+ list_add(&guest->reg_list, &ptdump_guest_list);
+ mutex_unlock(&ptdump_list_lock);
+
+ return 0;
+}
+
+void ptdump_unregister_guest_stage2(struct kvm_pgtable *pgt)
+{
+ struct ptdump_registered_guest *guest;
+
+ mutex_lock(&ptdump_list_lock);
+ list_for_each_entry(guest, &ptdump_guest_list, reg_list) {
+ if (guest->snapshot.pgtable.pgd == pgt->pgd) {
+ list_del(&guest->reg_list);
+ kfree(guest);
+ break;
+ }
+ }
+ mutex_unlock(&ptdump_list_lock);
+}
+
void ptdump_register_host_stage2(void)
{
if (!is_protected_kvm_enabled())
--
2.43.0.rc0.421.g78406f8d94-goog
Printing the descriptor attributes requires accessing a mask which has a
different set of attributes for stage-2. In preparation for adding support
for the stage-2 pagetables dumping, use the mask from the local context
and not from the globally defined pg_level array. Store a pointer to
the pg_level array in the ptdump state structure. This will allow us to
extract the mask which is wrapped in the pg_level array and use it for
descriptor parsing in the note_page.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index e305b6593c4e..8761a70f916f 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -75,6 +75,7 @@ static struct addr_marker address_markers[] = {
struct pg_state {
struct ptdump_state ptdump;
struct seq_file *seq;
+ struct pg_level *pg_level;
const struct addr_marker *marker;
unsigned long start_address;
int level;
@@ -252,11 +253,12 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
u64 val)
{
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+ struct pg_level *pg_info = st->pg_level;
static const char units[] = "KMGTPE";
u64 prot = 0;
if (level >= 0)
- prot = val & pg_level[level].mask;
+ prot = val & pg_info[level].mask;
if (st->level == -1) {
st->level = level;
@@ -282,10 +284,10 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
unit++;
}
pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
- pg_level[st->level].name);
- if (st->current_prot && pg_level[st->level].bits)
- dump_prot(st, pg_level[st->level].bits,
- pg_level[st->level].num);
+ pg_info[st->level].name);
+ if (st->current_prot && pg_info[st->level].bits)
+ dump_prot(st, pg_info[st->level].bits,
+ pg_info[st->level].num);
pt_dump_seq_puts(st->seq, "\n");
if (addr >= st->marker[1].start_address) {
@@ -316,6 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
st = (struct pg_state){
.seq = s,
.marker = info->markers,
+ .pg_level = &pg_level[0],
.level = -1,
.ptdump = {
.note_page = note_page,
@@ -353,6 +356,7 @@ void ptdump_check_wx(void)
{ 0, NULL},
{ -1, NULL},
},
+ .pg_level = &pg_level[0],
.level = -1,
.check_wx = true,
.ptdump = {
--
2.43.0.rc0.421.g78406f8d94-goog
Initialize a structures used to keep the state of the host stage-2 ptdump
walker when pKVM is enabled. Create a new debugfs entry for the host
stage-2 pagetables and hook the callbacks invoked when the entry is
accessed. When the debugfs file is opened, allocate memory resources which
will be shared with the hypervisor for saving the pagetable snapshot.
On close release the associated memory and we unshare it from the
hypervisor.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 12 +++
arch/arm64/kvm/Kconfig | 13 +++
arch/arm64/kvm/arm.c | 2 +
arch/arm64/mm/ptdump.c | 168 ++++++++++++++++++++++++++++++++
arch/arm64/mm/ptdump_debugfs.c | 8 +-
5 files changed, 202 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 9b2bebfcefbe..de5a5a0c0ecf 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -22,6 +22,7 @@ struct ptdump_info {
void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
int (*ptdump_prepare_walk)(void *file_priv);
void (*ptdump_end_walk)(void *file_priv);
+ size_t mc_len;
};
void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
@@ -33,13 +34,24 @@ struct ptdump_info_file_priv {
#ifdef CONFIG_PTDUMP_DEBUGFS
#define EFI_RUNTIME_MAP_END DEFAULT_MAP_WINDOW_64
void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name);
+void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
+ struct dentry *d_entry);
#else
static inline void ptdump_debugfs_register(struct ptdump_info *info,
const char *name) { }
+static inline void ptdump_debugfs_kvm_register(struct ptdump_info *info,
+ const char *name,
+ struct dentry *d_entry) { }
#endif
void ptdump_check_wx(void);
#endif /* CONFIG_PTDUMP_CORE */
+#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
+void ptdump_register_host_stage2(void);
+#else
+static inline void ptdump_register_host_stage2(void) { }
+#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
+
#ifdef CONFIG_DEBUG_WX
#define debug_checkwx() ptdump_check_wx()
#else
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..cf5b7f06b152 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -71,4 +71,17 @@ config PROTECTED_NVHE_STACKTRACE
If unsure, or not using protected nVHE (pKVM), say N.
+config PTDUMP_STAGE2_DEBUGFS
+ bool "Present the stage-2 pagetables to debugfs"
+ depends on NVHE_EL2_DEBUG && PTDUMP_DEBUGFS && KVM
+ default n
+ help
+ Say Y here if you want to show the stage-2 kernel pagetables
+ layout in a debugfs file. This information is only useful for kernel developers
+ who are working in architecture specific areas of the kernel.
+ It is probably not a good idea to enable this feature in a production
+ kernel.
+
+ If in doubt, say N.
+
endif # VIRTUALIZATION
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e5f75f1f1085..987683650576 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -28,6 +28,7 @@
#include <linux/uaccess.h>
#include <asm/ptrace.h>
+#include <asm/ptdump.h>
#include <asm/mman.h>
#include <asm/tlbflush.h>
#include <asm/cacheflush.h>
@@ -2592,6 +2593,7 @@ static __init int kvm_arm_init(void)
if (err)
goto out_subs;
+ ptdump_register_host_stage2();
kvm_arm_initialised = true;
return 0;
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index d531e24ea0b2..0b4cb54e43ff 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -24,6 +24,9 @@
#include <asm/memory.h>
#include <asm/pgtable-hwdef.h>
#include <asm/ptdump.h>
+#include <asm/kvm_pkvm.h>
+#include <asm/kvm_pgtable.h>
+#include <asm/kvm_host.h>
enum address_markers_idx {
@@ -378,6 +381,170 @@ void ptdump_check_wx(void)
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
}
+#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
+static struct ptdump_info stage2_kernel_ptdump_info;
+
+static phys_addr_t ptdump_host_pa(void *addr)
+{
+ return __pa(addr);
+}
+
+static void *ptdump_host_va(phys_addr_t phys)
+{
+ return __va(phys);
+}
+
+static size_t stage2_get_pgd_len(void)
+{
+ u64 mmfr0, mmfr1, vtcr;
+ u32 phys_shift = get_kvm_ipa_limit();
+
+ mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+ mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
+ vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
+
+ return kvm_pgtable_stage2_pgd_size(vtcr);
+}
+
+static int stage2_ptdump_prepare_walk(void *file_priv)
+{
+ struct ptdump_info_file_priv *f_priv = file_priv;
+ struct ptdump_info *info = &f_priv->info;
+ struct kvm_pgtable_snapshot *snapshot;
+ int ret, pgd_index, mc_index, pgd_pages_sz;
+ void *page_hva;
+ phys_addr_t pgd;
+
+ snapshot = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!snapshot)
+ return -ENOMEM;
+
+ memset(snapshot, 0, PAGE_SIZE);
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, virt_to_pfn(snapshot));
+ if (ret)
+ goto free_snapshot;
+
+ snapshot->pgd_len = stage2_get_pgd_len();
+ pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
+ snapshot->pgd_hva = alloc_pages_exact(snapshot->pgd_len,
+ GFP_KERNEL_ACCOUNT);
+ if (!snapshot->pgd_hva) {
+ ret = -ENOMEM;
+ goto unshare_snapshot;
+ }
+
+ for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ virt_to_pfn(page_hva));
+ if (ret)
+ goto unshare_pgd_pages;
+ }
+
+ for (mc_index = 0; mc_index < info->mc_len; mc_index++) {
+ page_hva = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
+ if (!page_hva) {
+ ret = -ENOMEM;
+ goto free_memcache_pages;
+ }
+
+ push_hyp_memcache(&snapshot->mc, page_hva, ptdump_host_pa);
+ ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
+ virt_to_pfn(page_hva));
+ if (ret) {
+ pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ goto free_memcache_pages;
+ }
+ }
+
+ ret = kvm_call_hyp_nvhe(__pkvm_copy_host_stage2, snapshot);
+ if (ret)
+ goto free_memcache_pages;
+
+ pgd = (phys_addr_t)snapshot->pgtable.pgd;
+ snapshot->pgtable.pgd = phys_to_virt(pgd);
+ f_priv->file_priv = snapshot;
+ return 0;
+
+free_memcache_pages:
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ while (page_hva) {
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ }
+unshare_pgd_pages:
+ pgd_index = pgd_index - 1;
+ for (; pgd_index >= 0; pgd_index--) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ }
+ free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
+unshare_snapshot:
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(snapshot)));
+free_snapshot:
+ free_pages_exact(snapshot, PAGE_SIZE);
+ f_priv->file_priv = NULL;
+ return ret;
+}
+
+static void stage2_ptdump_end_walk(void *file_priv)
+{
+ struct ptdump_info_file_priv *f_priv = file_priv;
+ struct kvm_pgtable_snapshot *snapshot = f_priv->file_priv;
+ void *page_hva;
+ int pgd_index, ret, pgd_pages_sz;
+
+ if (!snapshot)
+ return;
+
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ while (page_hva) {
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ free_pages_exact(page_hva, PAGE_SIZE);
+ page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
+ }
+
+ pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
+ for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
+ page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
+ ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(page_hva));
+ WARN_ON(ret);
+ }
+
+ free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
+ WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
+ virt_to_pfn(snapshot)));
+ free_pages_exact(snapshot, PAGE_SIZE);
+ f_priv->file_priv = NULL;
+}
+
+void ptdump_register_host_stage2(void)
+{
+ if (!is_protected_kvm_enabled())
+ return;
+
+ stage2_kernel_ptdump_info = (struct ptdump_info) {
+ .mc_len = host_s2_pgtable_pages(),
+ .ptdump_prepare_walk = stage2_ptdump_prepare_walk,
+ .ptdump_end_walk = stage2_ptdump_end_walk,
+ };
+
+ ptdump_debugfs_kvm_register(&stage2_kernel_ptdump_info,
+ "host_stage2_page_tables",
+ kvm_debugfs_dir);
+}
+#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
+
static int __init ptdump_init(void)
{
address_markers[PAGE_END_NR].start_address = PAGE_END;
@@ -386,6 +553,7 @@ static int __init ptdump_init(void)
#endif
ptdump_initialize();
ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
+
return 0;
}
device_initcall(ptdump_init);
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 3bf5de51e8c3..4821dbef784c 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -68,5 +68,11 @@ static const struct file_operations ptdump_fops = {
void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name)
{
- debugfs_create_file(name, 0400, NULL, info, &ptdump_fops);
+ ptdump_debugfs_kvm_register(info, name, NULL);
+}
+
+void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
+ struct dentry *d_entry)
+{
+ debugfs_create_file(name, 0400, d_entry, info, &ptdump_fops);
}
--
2.43.0.rc0.421.g78406f8d94-goog
When FWB is used the memory attributes stored in the descriptors have a
different bitfield layout. Introduce two callbacks that verify the current
runtime configuration before parsing the attribute fields.
Add support for parsing the memory attribute fields from the page table
descriptors.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 65 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 64 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 9f88542d5312..ec7f6430f6d7 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -89,11 +89,19 @@ struct pg_state {
struct ptdump_info_file_priv *f_priv;
};
+/*
+ * This callback checks the runtime configuration before interpreting the
+ * attributes defined in the prot_bits.
+ */
+typedef bool (*is_feature_cb)(const void *ctx);
+
struct prot_bits {
u64 mask;
u64 val;
const char *set;
const char *clear;
+ is_feature_cb feature_on; /* bit ignored if the callback returns false */
+ is_feature_cb feature_off; /* bit ignored if the callback returns true */
};
static const struct prot_bits pte_bits[] = {
@@ -175,6 +183,34 @@ static const struct prot_bits pte_bits[] = {
}
};
+static bool is_fwb_enabled(const void *ctx)
+{
+ const struct pg_state *st = ctx;
+ struct ptdump_info_file_priv *f_priv = st->f_priv;
+ struct kvm_pgtable_snapshot *snapshot = f_priv->file_priv;
+ struct kvm_pgtable *pgtable = &snapshot->pgtable;
+
+ bool fwb_enabled = false;
+
+ if (cpus_have_final_cap(ARM64_HAS_STAGE2_FWB))
+ fwb_enabled = !(pgtable->flags & KVM_PGTABLE_S2_NOFWB);
+
+ return fwb_enabled;
+}
+
+static bool is_table_bit_ignored(const void *ctx)
+{
+ const struct pg_state *st = ctx;
+
+ if (!(st->current_prot & PTE_VALID))
+ return true;
+
+ if (st->level == CONFIG_PGTABLE_LEVELS)
+ return true;
+
+ return false;
+}
+
static const struct prot_bits stage2_pte_bits[] = {
{
.mask = PTE_VALID,
@@ -216,6 +252,27 @@ static const struct prot_bits stage2_pte_bits[] = {
.val = PTE_TABLE_BIT,
.set = " ",
.clear = "BLK",
+ .feature_off = is_table_bit_ignored,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_DEVICE_nGnRE) | PTE_VALID,
+ .set = "DEVICE/nGnRE",
+ .feature_off = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_FWB_DEVICE_nGnRE) | PTE_VALID,
+ .set = "DEVICE/nGnRE FWB",
+ .feature_on = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_NORMAL) | PTE_VALID,
+ .set = "MEM/NORMAL",
+ .feature_off = is_fwb_enabled,
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_MEMATTR | PTE_VALID,
+ .val = PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) | PTE_VALID,
+ .set = "MEM/NORMAL FWB",
+ .feature_on = is_fwb_enabled,
}, {
.mask = KVM_PGTABLE_PROT_SW0,
.val = KVM_PGTABLE_PROT_SW0,
@@ -267,13 +324,19 @@ static struct pg_level pg_level[] = {
};
static void dump_prot(struct pg_state *st, const struct prot_bits *bits,
- size_t num)
+ size_t num)
{
unsigned i;
for (i = 0; i < num; i++, bits++) {
const char *s;
+ if (bits->feature_on && !bits->feature_on(st))
+ continue;
+
+ if (bits->feature_off && bits->feature_off(st))
+ continue;
+
if ((st->current_prot & bits->mask) == bits->val)
s = bits->set;
else
--
2.43.0.rc0.421.g78406f8d94-goog
Add support for interpretting pKVM invalid stage-2 descriptors that hold
ownership information. We use these descriptors to keep track of the
memory donations from the host side.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/kvm_pgtable.h | 7 +++++++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 7 -------
arch/arm64/mm/ptdump.c | 10 ++++++++++
3 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 913f34d75b29..938baffa7d4d 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -87,6 +87,13 @@ typedef u64 kvm_pte_t;
*/
#define KVM_INVALID_PTE_LOCKED BIT(10)
+/* This corresponds to page-table locking order */
+enum pkvm_component_id {
+ PKVM_ID_HOST,
+ PKVM_ID_HYP,
+ PKVM_ID_FFA,
+};
+
static inline bool kvm_pte_valid(kvm_pte_t pte)
{
return pte & KVM_PTE_VALID;
diff --git a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
index 9cfb35d68850..cc2c439ffe75 100644
--- a/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
+++ b/arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
@@ -53,13 +53,6 @@ struct host_mmu {
};
extern struct host_mmu host_mmu;
-/* This corresponds to page-table locking order */
-enum pkvm_component_id {
- PKVM_ID_HOST,
- PKVM_ID_HYP,
- PKVM_ID_FFA,
-};
-
extern unsigned long hyp_nr_cpus;
int __pkvm_prot_finalize(void);
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index ec7f6430f6d7..ffb87237078b 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -273,6 +273,16 @@ static const struct prot_bits stage2_pte_bits[] = {
.val = PTE_S2_MEMATTR(MT_S2_FWB_NORMAL) | PTE_VALID,
.set = "MEM/NORMAL FWB",
.feature_on = is_fwb_enabled,
+ }, {
+ .mask = KVM_INVALID_PTE_OWNER_MASK | PTE_VALID,
+ .val = FIELD_PREP_CONST(KVM_INVALID_PTE_OWNER_MASK,
+ PKVM_ID_HYP),
+ .set = "HYP",
+ }, {
+ .mask = KVM_INVALID_PTE_OWNER_MASK | PTE_VALID,
+ .val = FIELD_PREP_CONST(KVM_INVALID_PTE_OWNER_MASK,
+ PKVM_ID_FFA),
+ .set = "FF-A",
}, {
.mask = KVM_PGTABLE_PROT_SW0,
.val = KVM_PGTABLE_PROT_SW0,
--
2.43.0.rc0.421.g78406f8d94-goog
Add a walker function which configures ptdump to parse the page-tables
from the snapshot. Define the attributes used by the stage-2 parser and
build a description of an array which holds the name of the level.
Walk the entire address space configured in the pagetable and parse the
attribute descriptors.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/mm/ptdump.c | 157 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 157 insertions(+)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 0b4cb54e43ff..9f88542d5312 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -86,6 +86,7 @@ struct pg_state {
bool check_wx;
unsigned long wx_pages;
unsigned long uxn_pages;
+ struct ptdump_info_file_priv *f_priv;
};
struct prot_bits {
@@ -174,6 +175,66 @@ static const struct prot_bits pte_bits[] = {
}
};
+static const struct prot_bits stage2_pte_bits[] = {
+ {
+ .mask = PTE_VALID,
+ .val = PTE_VALID,
+ .set = " ",
+ .clear = "F",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN,
+ .val = KVM_PTE_LEAF_ATTR_HI_S2_XN,
+ .set = "XN",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
+ .set = "R",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
+ .set = "W",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_AF,
+ .set = "AF",
+ .clear = " ",
+ }, {
+ .mask = PTE_NG,
+ .val = PTE_NG,
+ .set = "FnXS",
+ .clear = " ",
+ }, {
+ .mask = PTE_CONT,
+ .val = PTE_CONT,
+ .set = "CON",
+ .clear = " ",
+ }, {
+ .mask = PTE_TABLE_BIT,
+ .val = PTE_TABLE_BIT,
+ .set = " ",
+ .clear = "BLK",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW0,
+ .val = KVM_PGTABLE_PROT_SW0,
+ .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW1,
+ .val = KVM_PGTABLE_PROT_SW1,
+ .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW2,
+ .val = KVM_PGTABLE_PROT_SW2,
+ .set = "SW2",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW3,
+ .val = KVM_PGTABLE_PROT_SW3,
+ .set = "SW3",
+ },
+};
+
struct pg_level {
const struct prot_bits *bits;
const char *name;
@@ -286,6 +347,7 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
delta >>= 10;
unit++;
}
+
pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
pg_info[st->level].name);
if (st->current_prot && pg_info[st->level].bits)
@@ -394,6 +456,11 @@ static void *ptdump_host_va(phys_addr_t phys)
return __va(phys);
}
+static struct kvm_pgtable_mm_ops host_mmops = {
+ .phys_to_virt = ptdump_host_va,
+ .virt_to_phys = ptdump_host_pa,
+};
+
static size_t stage2_get_pgd_len(void)
{
u64 mmfr0, mmfr1, vtcr;
@@ -528,6 +595,95 @@ static void stage2_ptdump_end_walk(void *file_priv)
f_priv->file_priv = NULL;
}
+static int stage2_ptdump_visitor(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ struct pg_state *st = ctx->arg;
+ struct ptdump_state *pt_st = &st->ptdump;
+
+ pt_st->note_page(pt_st, ctx->addr, ctx->level, ctx->old);
+
+ return 0;
+}
+
+static void stage2_ptdump_build_levels(struct pg_level *level,
+ size_t num_levels,
+ unsigned int start_level)
+{
+ static const char * const lvl_names[] = {"PGD", "PUD", "PMD", "PTE"};
+ int i, j, name_index;
+
+ if (num_levels > KVM_PGTABLE_MAX_LEVELS && start_level > 2) {
+ pr_warn("invalid configuration %lu levels start_lvl %u\n",
+ num_levels, start_level);
+ return;
+ }
+
+ for (i = start_level; i < num_levels; i++) {
+ name_index = i - start_level;
+ name_index = name_index * start_level + name_index;
+
+ level[i].name = lvl_names[name_index];
+ level[i].num = ARRAY_SIZE(stage2_pte_bits);
+ level[i].bits = stage2_pte_bits;
+
+ for (j = 0; j < level[i].num; j++)
+ level[i].mask |= level[i].bits[j].mask;
+ }
+}
+
+static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
+{
+ struct ptdump_info_file_priv *f_priv =
+ container_of(info, struct ptdump_info_file_priv, info);
+ struct kvm_pgtable_snapshot *snapshot = f_priv->file_priv;
+ struct pg_state st;
+ struct kvm_pgtable *pgtable;
+ u64 start_ipa = 0, end_ipa;
+ struct addr_marker ipa_address_markers[3];
+ struct pg_level stage2_pg_level[KVM_PGTABLE_MAX_LEVELS] = {0};
+ struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) {
+ .cb = stage2_ptdump_visitor,
+ .arg = &st,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ if (snapshot == NULL || !snapshot->pgtable.pgd)
+ return;
+
+ pgtable = &snapshot->pgtable;
+ pgtable->mm_ops = &host_mmops;
+ end_ipa = BIT(pgtable->ia_bits) - 1;
+
+ memset(&ipa_address_markers[0], 0, sizeof(ipa_address_markers));
+
+ ipa_address_markers[0].start_address = start_ipa;
+ ipa_address_markers[0].name = "IPA start";
+
+ ipa_address_markers[1].start_address = end_ipa;
+ ipa_address_markers[1].name = "IPA end";
+
+ stage2_ptdump_build_levels(stage2_pg_level, KVM_PGTABLE_MAX_LEVELS,
+ pgtable->start_level);
+
+ st = (struct pg_state) {
+ .seq = s,
+ .marker = &ipa_address_markers[0],
+ .level = -1,
+ .pg_level = &stage2_pg_level[0],
+ .f_priv = f_priv,
+ .ptdump = {
+ .note_page = note_page,
+ .range = (struct ptdump_range[]) {
+ {start_ipa, end_ipa},
+ {0, 0},
+ },
+ },
+ };
+
+ kvm_pgtable_walk(pgtable, start_ipa, end_ipa, &walker);
+}
+
void ptdump_register_host_stage2(void)
{
if (!is_protected_kvm_enabled())
@@ -537,6 +693,7 @@ void ptdump_register_host_stage2(void)
.mc_len = host_s2_pgtable_pages(),
.ptdump_prepare_walk = stage2_ptdump_prepare_walk,
.ptdump_end_walk = stage2_ptdump_end_walk,
+ .ptdump_walk = stage2_ptdump_walk,
};
ptdump_debugfs_kvm_register(&stage2_kernel_ptdump_info,
--
2.43.0.rc0.421.g78406f8d94-goog
Hi Sebastian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on kvmarm/next linus/master v6.7-rc1 next-20231115]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Ene/KVM-arm64-Add-snap-shooting-the-host-stage-2-pagetables/20231116-012017
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link: https://lore.kernel.org/r/20231115171639.2852644-9-sebastianene%40google.com
patch subject: [PATCH v3 07/10] arm64: ptdump: Parse the host stage-2 page-tables from the snapshot
config: arm64-randconfig-001-20231116 (https://download.01.org/0day-ci/archive/20231116/[email protected]/config)
compiler: aarch64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> arch/arm64/mm/ptdump.c:178:31: warning: 'stage2_pte_bits' defined but not used [-Wunused-const-variable=]
178 | static const struct prot_bits stage2_pte_bits[] = {
| ^~~~~~~~~~~~~~~
vim +/stage2_pte_bits +178 arch/arm64/mm/ptdump.c
177
> 178 static const struct prot_bits stage2_pte_bits[] = {
179 {
180 .mask = PTE_VALID,
181 .val = PTE_VALID,
182 .set = " ",
183 .clear = "F",
184 }, {
185 .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN,
186 .val = KVM_PTE_LEAF_ATTR_HI_S2_XN,
187 .set = "XN",
188 .clear = " ",
189 }, {
190 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
191 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
192 .set = "R",
193 .clear = " ",
194 }, {
195 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
196 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
197 .set = "W",
198 .clear = " ",
199 }, {
200 .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF,
201 .val = KVM_PTE_LEAF_ATTR_LO_S2_AF,
202 .set = "AF",
203 .clear = " ",
204 }, {
205 .mask = PTE_NG,
206 .val = PTE_NG,
207 .set = "FnXS",
208 .clear = " ",
209 }, {
210 .mask = PTE_CONT,
211 .val = PTE_CONT,
212 .set = "CON",
213 .clear = " ",
214 }, {
215 .mask = PTE_TABLE_BIT,
216 .val = PTE_TABLE_BIT,
217 .set = " ",
218 .clear = "BLK",
219 }, {
220 .mask = KVM_PGTABLE_PROT_SW0,
221 .val = KVM_PGTABLE_PROT_SW0,
222 .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
223 }, {
224 .mask = KVM_PGTABLE_PROT_SW1,
225 .val = KVM_PGTABLE_PROT_SW1,
226 .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
227 }, {
228 .mask = KVM_PGTABLE_PROT_SW2,
229 .val = KVM_PGTABLE_PROT_SW2,
230 .set = "SW2",
231 }, {
232 .mask = KVM_PGTABLE_PROT_SW3,
233 .val = KVM_PGTABLE_PROT_SW3,
234 .set = "SW3",
235 },
236 };
237
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Sebastian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on arm64/for-next/core]
[also build test WARNING on kvmarm/next linus/master v6.7-rc1 next-20231117]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Sebastian-Ene/KVM-arm64-Add-snap-shooting-the-host-stage-2-pagetables/20231116-012017
base: https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
patch link: https://lore.kernel.org/r/20231115171639.2852644-9-sebastianene%40google.com
patch subject: [PATCH v3 07/10] arm64: ptdump: Parse the host stage-2 page-tables from the snapshot
config: arm64-randconfig-r071-20231119 (https://download.01.org/0day-ci/archive/20231119/[email protected]/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231119/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All warnings (new ones prefixed by >>):
>> arch/arm64/mm/ptdump.c:178:31: warning: unused variable 'stage2_pte_bits' [-Wunused-const-variable]
178 | static const struct prot_bits stage2_pte_bits[] = {
| ^
1 warning generated.
vim +/stage2_pte_bits +178 arch/arm64/mm/ptdump.c
177
> 178 static const struct prot_bits stage2_pte_bits[] = {
179 {
180 .mask = PTE_VALID,
181 .val = PTE_VALID,
182 .set = " ",
183 .clear = "F",
184 }, {
185 .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN,
186 .val = KVM_PTE_LEAF_ATTR_HI_S2_XN,
187 .set = "XN",
188 .clear = " ",
189 }, {
190 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
191 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R,
192 .set = "R",
193 .clear = " ",
194 }, {
195 .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
196 .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W,
197 .set = "W",
198 .clear = " ",
199 }, {
200 .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF,
201 .val = KVM_PTE_LEAF_ATTR_LO_S2_AF,
202 .set = "AF",
203 .clear = " ",
204 }, {
205 .mask = PTE_NG,
206 .val = PTE_NG,
207 .set = "FnXS",
208 .clear = " ",
209 }, {
210 .mask = PTE_CONT,
211 .val = PTE_CONT,
212 .set = "CON",
213 .clear = " ",
214 }, {
215 .mask = PTE_TABLE_BIT,
216 .val = PTE_TABLE_BIT,
217 .set = " ",
218 .clear = "BLK",
219 }, {
220 .mask = KVM_PGTABLE_PROT_SW0,
221 .val = KVM_PGTABLE_PROT_SW0,
222 .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
223 }, {
224 .mask = KVM_PGTABLE_PROT_SW1,
225 .val = KVM_PGTABLE_PROT_SW1,
226 .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
227 }, {
228 .mask = KVM_PGTABLE_PROT_SW2,
229 .val = KVM_PGTABLE_PROT_SW2,
230 .set = "SW2",
231 }, {
232 .mask = KVM_PGTABLE_PROT_SW3,
233 .val = KVM_PGTABLE_PROT_SW3,
234 .set = "SW3",
235 },
236 };
237
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Wed, Nov 15, 2023 at 05:16:36PM +0000, Sebastian Ene wrote:
> Initialize a structures used to keep the state of the host stage-2 ptdump
> walker when pKVM is enabled. Create a new debugfs entry for the host
> stage-2 pagetables and hook the callbacks invoked when the entry is
> accessed. When the debugfs file is opened, allocate memory resources which
> will be shared with the hypervisor for saving the pagetable snapshot.
> On close release the associated memory and we unshare it from the
> hypervisor.
>
> Signed-off-by: Sebastian Ene <[email protected]>
> ---
> arch/arm64/include/asm/ptdump.h | 12 +++
> arch/arm64/kvm/Kconfig | 13 +++
> arch/arm64/kvm/arm.c | 2 +
> arch/arm64/mm/ptdump.c | 168 ++++++++++++++++++++++++++++++++
> arch/arm64/mm/ptdump_debugfs.c | 8 +-
> 5 files changed, 202 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
> index 9b2bebfcefbe..de5a5a0c0ecf 100644
> --- a/arch/arm64/include/asm/ptdump.h
> +++ b/arch/arm64/include/asm/ptdump.h
> @@ -22,6 +22,7 @@ struct ptdump_info {
> void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
> int (*ptdump_prepare_walk)(void *file_priv);
> void (*ptdump_end_walk)(void *file_priv);
> + size_t mc_len;
> };
>
> void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
> @@ -33,13 +34,24 @@ struct ptdump_info_file_priv {
> #ifdef CONFIG_PTDUMP_DEBUGFS
> #define EFI_RUNTIME_MAP_END DEFAULT_MAP_WINDOW_64
> void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name);
> +void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
> + struct dentry *d_entry);
> #else
> static inline void ptdump_debugfs_register(struct ptdump_info *info,
> const char *name) { }
> +static inline void ptdump_debugfs_kvm_register(struct ptdump_info *info,
> + const char *name,
> + struct dentry *d_entry) { }
> #endif
> void ptdump_check_wx(void);
> #endif /* CONFIG_PTDUMP_CORE */
>
> +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> +void ptdump_register_host_stage2(void);
> +#else
> +static inline void ptdump_register_host_stage2(void) { }
> +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> +
> #ifdef CONFIG_DEBUG_WX
> #define debug_checkwx() ptdump_check_wx()
> #else
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 83c1e09be42e..cf5b7f06b152 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -71,4 +71,17 @@ config PROTECTED_NVHE_STACKTRACE
>
> If unsure, or not using protected nVHE (pKVM), say N.
>
> +config PTDUMP_STAGE2_DEBUGFS
> + bool "Present the stage-2 pagetables to debugfs"
> + depends on NVHE_EL2_DEBUG && PTDUMP_DEBUGFS && KVM
> + default n
> + help
> + Say Y here if you want to show the stage-2 kernel pagetables
> + layout in a debugfs file. This information is only useful for kernel developers
> + who are working in architecture specific areas of the kernel.
> + It is probably not a good idea to enable this feature in a production
> + kernel.
> +
> + If in doubt, say N.
> +
> endif # VIRTUALIZATION
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e5f75f1f1085..987683650576 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -28,6 +28,7 @@
>
> #include <linux/uaccess.h>
> #include <asm/ptrace.h>
> +#include <asm/ptdump.h>
> #include <asm/mman.h>
> #include <asm/tlbflush.h>
> #include <asm/cacheflush.h>
> @@ -2592,6 +2593,7 @@ static __init int kvm_arm_init(void)
> if (err)
> goto out_subs;
>
> + ptdump_register_host_stage2();
> kvm_arm_initialised = true;
>
> return 0;
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index d531e24ea0b2..0b4cb54e43ff 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -24,6 +24,9 @@
> #include <asm/memory.h>
> #include <asm/pgtable-hwdef.h>
> #include <asm/ptdump.h>
> +#include <asm/kvm_pkvm.h>
> +#include <asm/kvm_pgtable.h>
> +#include <asm/kvm_host.h>
>
>
> enum address_markers_idx {
> @@ -378,6 +381,170 @@ void ptdump_check_wx(void)
> pr_info("Checked W+X mappings: passed, no W+X pages found\n");
> }
>
> +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> +static struct ptdump_info stage2_kernel_ptdump_info;
> +
> +static phys_addr_t ptdump_host_pa(void *addr)
> +{
> + return __pa(addr);
> +}
> +
> +static void *ptdump_host_va(phys_addr_t phys)
> +{
> + return __va(phys);
> +}
> +
> +static size_t stage2_get_pgd_len(void)
> +{
> + u64 mmfr0, mmfr1, vtcr;
> + u32 phys_shift = get_kvm_ipa_limit();
> +
> + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> + vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
> +
> + return kvm_pgtable_stage2_pgd_size(vtcr);
That's a lot of conversions to go from the kvm_ipa_limit to VTCR and
VTCR back to ia_bits and the start level, but that would mean rewrite pieces of
pgtable.c there. :-\
> +}
> +
> +static int stage2_ptdump_prepare_walk(void *file_priv)
> +{
> + struct ptdump_info_file_priv *f_priv = file_priv;
> + struct ptdump_info *info = &f_priv->info;
> + struct kvm_pgtable_snapshot *snapshot;
> + int ret, pgd_index, mc_index, pgd_pages_sz;
> + void *page_hva;
> + phys_addr_t pgd;
> +
> + snapshot = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
> + if (!snapshot)
> + return -ENOMEM;
For a single page, __get_free_page is enough.
> +
> + memset(snapshot, 0, PAGE_SIZE);
> + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, virt_to_pfn(snapshot));
> + if (ret)
> + goto free_snapshot;
It'd probably be better to not share anything here, and let the hypervisor do
host_donate_hyp() and hyp_donate_host() before returning back from the HVC. This
way the hypervisor will protect itself.
> +
> + snapshot->pgd_len = stage2_get_pgd_len();
> + pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
> + snapshot->pgd_hva = alloc_pages_exact(snapshot->pgd_len,
> + GFP_KERNEL_ACCOUNT);
> + if (!snapshot->pgd_hva) {
> + ret = -ENOMEM;
> + goto unshare_snapshot;
> + }
> +
> + for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
> + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> + virt_to_pfn(page_hva));
> + if (ret)
> + goto unshare_pgd_pages;
> + }
> +
> + for (mc_index = 0; mc_index < info->mc_len; mc_index++) {
> + page_hva = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
ditto.
> + if (!page_hva) {
> + ret = -ENOMEM;
> + goto free_memcache_pages;
> + }
> +
> + push_hyp_memcache(&snapshot->mc, page_hva, ptdump_host_pa);
> + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> + virt_to_pfn(page_hva));
> + if (ret) {
> + pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> + free_pages_exact(page_hva, PAGE_SIZE);
> + goto free_memcache_pages;
> + }
Maybe for the page-table pages, it'd be better to let the hyp does the
host_donate_hyp() / hyp_donate_host()? It might be easier than sharing + pin.
> + }
> +
> + ret = kvm_call_hyp_nvhe(__pkvm_copy_host_stage2, snapshot);
> + if (ret)
> + goto free_memcache_pages;
> +
> + pgd = (phys_addr_t)snapshot->pgtable.pgd;
> + snapshot->pgtable.pgd = phys_to_virt(pgd);
> + f_priv->file_priv = snapshot;
> + return 0;
> +
> +free_memcache_pages:
> + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> + while (page_hva) {
> + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(page_hva));
> + WARN_ON(ret);
> + free_pages_exact(page_hva, PAGE_SIZE);
> + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> + }
> +unshare_pgd_pages:
> + pgd_index = pgd_index - 1;
> + for (; pgd_index >= 0; pgd_index--) {
> + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(page_hva));
> + WARN_ON(ret);
> + }
> + free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
> +unshare_snapshot:
> + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(snapshot)));
> +free_snapshot:
> + free_pages_exact(snapshot, PAGE_SIZE);
> + f_priv->file_priv = NULL;
> + return ret;
Couldn't this path be merged with stage2_ptdump_end_walk()?
> +}
> +
> +static void stage2_ptdump_end_walk(void *file_priv)
> +{
> + struct ptdump_info_file_priv *f_priv = file_priv;
> + struct kvm_pgtable_snapshot *snapshot = f_priv->file_priv;
> + void *page_hva;
> + int pgd_index, ret, pgd_pages_sz;
> +
> + if (!snapshot)
> + return;
> +
> + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> + while (page_hva) {
> + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(page_hva));
> + WARN_ON(ret);
> + free_pages_exact(page_hva, PAGE_SIZE);
> + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> + }
> +
> + pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
> + for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
> + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(page_hva));
> + WARN_ON(ret);
> + }
> +
> + free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
> + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> + virt_to_pfn(snapshot)));
> + free_pages_exact(snapshot, PAGE_SIZE);
> + f_priv->file_priv = NULL;
> +}
> +
> +void ptdump_register_host_stage2(void)
> +{
> + if (!is_protected_kvm_enabled())
> + return;
> +
> + stage2_kernel_ptdump_info = (struct ptdump_info) {
> + .mc_len = host_s2_pgtable_pages(),
> + .ptdump_prepare_walk = stage2_ptdump_prepare_walk,
> + .ptdump_end_walk = stage2_ptdump_end_walk,
> + };
> +
> + ptdump_debugfs_kvm_register(&stage2_kernel_ptdump_info,
> + "host_stage2_page_tables",
> + kvm_debugfs_dir);
> +}
> +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> +
> static int __init ptdump_init(void)
> {
> address_markers[PAGE_END_NR].start_address = PAGE_END;
> @@ -386,6 +553,7 @@ static int __init ptdump_init(void)
> #endif
> ptdump_initialize();
> ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> +
Not needed.
> return 0;
> }
> device_initcall(ptdump_init);
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 3bf5de51e8c3..4821dbef784c 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -68,5 +68,11 @@ static const struct file_operations ptdump_fops = {
>
> void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name)
> {
> - debugfs_create_file(name, 0400, NULL, info, &ptdump_fops);
> + ptdump_debugfs_kvm_register(info, name, NULL);
Not really related to kvm, the only difference is passing or not a dentry.
How about a single (non __init) function?
> +}
> +
> +void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
> + struct dentry *d_entry)
> +{
> + debugfs_create_file(name, 0400, d_entry, info, &ptdump_fops);
> }
> --
> 2.43.0.rc0.421.g78406f8d94-goog
>
Hi Seb,
On Wed, Nov 15, 2023 at 05:16:30PM +0000, Sebastian Ene wrote:
> Hi,
>
> This can be used as a debugging tool for dumping the second stage
> page-tables.
>
> When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump registers
> '/sys/debug/kvm/<guest_id>/stage2_page_tables' entry with debugfs
> upon guest creation. This allows userspace tools (eg. cat) to dump the
> stage-2 pagetables by reading the registered file.
>
> Reading the debugfs file shows stage-2 memory ranges in following format:
> <IPA range> <size> <descriptor type> <access permissions> <mem_attributes>
>
> Under pKVM configuration(kvm-arm.mode=protected) ptdump registers an entry
> for the host stage-2 pagetables in the following path:
> /sys/debug/kvm/host_stage2_page_tables/
>
> The tool interprets the pKVM ownership annotation stored in the invalid
> entries and dumps to the console the ownership information. To be able
> to access the host stage-2 page-tables from the kernel, a new hypervisor
> call was introduced which allows us to snapshot the page-tables in a host
> provided buffer. The hypervisor call is hidden behind CONFIG_NVHE_EL2_DEBUG
> as this should be used under debugging environment.
While I think the value of the feature you're proposing is great, I'm
not a fan of the current shape of this series.
Reusing note_page() for the stage-2 dump is somewhat convenient, but the
series pulls a **massive** amount of KVM details outside of KVM:
- Open-coding the whole snapshotting interface with EL2 outside of KVM.
This is a complete non-starter for me; the kernel<->EL2 interface
needs to be owned by the EL1 portions of KVM.
- Building page-table walkers using the KVM pgtable library outside of
KVM.
- Copying (rather than directly calling) the logic responsible for
things like FWB and PGD concatenation.
- Hoisting the definition of _software bits_ outside of KVM. I'm less
concerned about hardware bits since they have an unambiguous meaning.
I think exporting the necessary stuff from ptdump into KVM will lead to
a much cleaner implementation.
--
Thanks,
Oliver
On Wed, Nov 15, 2023 at 05:16:40PM +0000, Sebastian Ene wrote:
> +struct ptdump_registered_guest {
> + struct list_head reg_list;
> + struct ptdump_info info;
> + struct kvm_pgtable_snapshot snapshot;
> + rwlock_t *lock;
> +};
Why can't we just store a pointer directly to struct kvm in ::private?
Also, shouldn't you take a reference on struct kvm when the file is
opened to protect against VM teardown?
> +static LIST_HEAD(ptdump_guest_list);
> +static DEFINE_MUTEX(ptdump_list_lock);
What is the list for?
> static phys_addr_t ptdump_host_pa(void *addr)
> {
> return __pa(addr);
> @@ -757,6 +768,63 @@ static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
> kvm_pgtable_walk(pgtable, start_ipa, end_ipa, &walker);
> }
>
> +static void guest_stage2_ptdump_walk(struct seq_file *s,
> + struct ptdump_info *info)
> +{
> + struct ptdump_info_file_priv *f_priv =
> + container_of(info, struct ptdump_info_file_priv, info);
> + struct ptdump_registered_guest *guest = info->priv;
> +
> + f_priv->file_priv = &guest->snapshot;
> +
> + read_lock(guest->lock);
> + stage2_ptdump_walk(s, info);
> + read_unlock(guest->lock);
Taking the mmu lock for read allows other table walkers to add new
mappings and adjust the granularity of existing ones. Should this
instead take the mmu lock for write?
> +}
> +
> +int ptdump_register_guest_stage2(struct kvm *kvm)
> +{
> + struct ptdump_registered_guest *guest;
> + struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> + struct kvm_pgtable *pgt = mmu->pgt;
> +
> + guest = kzalloc(sizeof(struct ptdump_registered_guest), GFP_KERNEL);
You want GFP_KERNEL_ACCOUNT here.
--
Thanks,
Oliver
On Wed, Nov 22, 2023 at 11:18:45PM +0000, Oliver Upton wrote:
Hi Oliver,
> Hi Seb,
>
> On Wed, Nov 15, 2023 at 05:16:30PM +0000, Sebastian Ene wrote:
> > Hi,
> >
> > This can be used as a debugging tool for dumping the second stage
> > page-tables.
> >
> > When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump registers
> > '/sys/debug/kvm/<guest_id>/stage2_page_tables' entry with debugfs
> > upon guest creation. This allows userspace tools (eg. cat) to dump the
> > stage-2 pagetables by reading the registered file.
> >
> > Reading the debugfs file shows stage-2 memory ranges in following format:
> > <IPA range> <size> <descriptor type> <access permissions> <mem_attributes>
> >
> > Under pKVM configuration(kvm-arm.mode=protected) ptdump registers an entry
> > for the host stage-2 pagetables in the following path:
> > /sys/debug/kvm/host_stage2_page_tables/
> >
> > The tool interprets the pKVM ownership annotation stored in the invalid
> > entries and dumps to the console the ownership information. To be able
> > to access the host stage-2 page-tables from the kernel, a new hypervisor
> > call was introduced which allows us to snapshot the page-tables in a host
> > provided buffer. The hypervisor call is hidden behind CONFIG_NVHE_EL2_DEBUG
> > as this should be used under debugging environment.
>
> While I think the value of the feature you're proposing is great, I'm
> not a fan of the current shape of this series.
>
> Reusing note_page() for the stage-2 dump is somewhat convenient, but the
> series pulls a **massive** amount of KVM details outside of KVM:
>
> - Open-coding the whole snapshotting interface with EL2 outside of KVM.
> This is a complete non-starter for me; the kernel<->EL2 interface
> needs to be owned by the EL1 portions of KVM.
>
> - Building page-table walkers using the KVM pgtable library outside of
> KVM.
>
> - Copying (rather than directly calling) the logic responsible for
> things like FWB and PGD concatenation.
>
> - Hoisting the definition of _software bits_ outside of KVM. I'm less
> concerned about hardware bits since they have an unambiguous meaning.
>
> I think exporting the necessary stuff from ptdump into KVM will lead to
> a much cleaner implementation.
>
Right, I had to import a lot of definitions from KVM, especially for the
prot_bits array and for the IPA size retrieval. I think it would be less
intrusive the other way around, to pull some ptdump hooks into kvm.
> --
> Thanks,
> Oliver
Thanks,
Seb
On Wed, Nov 22, 2023 at 11:35:57PM +0000, Oliver Upton wrote:
> On Wed, Nov 15, 2023 at 05:16:40PM +0000, Sebastian Ene wrote:
> > +struct ptdump_registered_guest {
> > + struct list_head reg_list;
> > + struct ptdump_info info;
> > + struct kvm_pgtable_snapshot snapshot;
> > + rwlock_t *lock;
> > +};
>
> Why can't we just store a pointer directly to struct kvm in ::private?
I don't think it will work unless we expect a struct kvm_pgtable
in stage2_ptdump_walk file_priv field. I think it is a good ideea and will
simplify things a little bit dropping kvm_pgtable_snapshot from here.
The current code has some fileds that are reduntant (the priv pointers)
because I also made this to work with protected guests where we can't
access their pagetables directly.
> Also, shouldn't you take a reference on struct kvm when the file is
> opened to protect against VM teardown?
>
I am not sure about the need could you please elaborate a bit ? On VM
teardown we expect ptdump_unregister_guest_stage2 to be invoked.
> > +static LIST_HEAD(ptdump_guest_list);
> > +static DEFINE_MUTEX(ptdump_list_lock);
>
> What is the list for?
>
I am keeping a list of registered guests with ptdump and the lock should
protect the list against concurent VM teardowns.
> > static phys_addr_t ptdump_host_pa(void *addr)
> > {
> > return __pa(addr);
> > @@ -757,6 +768,63 @@ static void stage2_ptdump_walk(struct seq_file *s, struct ptdump_info *info)
> > kvm_pgtable_walk(pgtable, start_ipa, end_ipa, &walker);
> > }
> >
> > +static void guest_stage2_ptdump_walk(struct seq_file *s,
> > + struct ptdump_info *info)
> > +{
> > + struct ptdump_info_file_priv *f_priv =
> > + container_of(info, struct ptdump_info_file_priv, info);
> > + struct ptdump_registered_guest *guest = info->priv;
> > +
> > + f_priv->file_priv = &guest->snapshot;
> > +
> > + read_lock(guest->lock);
> > + stage2_ptdump_walk(s, info);
> > + read_unlock(guest->lock);
>
> Taking the mmu lock for read allows other table walkers to add new
> mappings and adjust the granularity of existing ones. Should this
> instead take the mmu lock for write?
>
Thanks for pointing our, this is exactly what I was trying to avoid,
so yes I should use the write mmu lock in this case.
> > +}
> > +
> > +int ptdump_register_guest_stage2(struct kvm *kvm)
> > +{
> > + struct ptdump_registered_guest *guest;
> > + struct kvm_s2_mmu *mmu = &kvm->arch.mmu;
> > + struct kvm_pgtable *pgt = mmu->pgt;
> > +
> > + guest = kzalloc(sizeof(struct ptdump_registered_guest), GFP_KERNEL);
>
> You want GFP_KERNEL_ACCOUNT here.
>
Right, thanks this is because it is an untrusted allocation triggered from
userspace.
> --
> Thanks,
> Oliver
Thank you,
Seb
On Tue, Nov 21, 2023 at 05:13:41PM +0000, Vincent Donnefort wrote:
> On Wed, Nov 15, 2023 at 05:16:36PM +0000, Sebastian Ene wrote:
Hi,
> > Initialize a structures used to keep the state of the host stage-2 ptdump
> > walker when pKVM is enabled. Create a new debugfs entry for the host
> > stage-2 pagetables and hook the callbacks invoked when the entry is
> > accessed. When the debugfs file is opened, allocate memory resources which
> > will be shared with the hypervisor for saving the pagetable snapshot.
> > On close release the associated memory and we unshare it from the
> > hypervisor.
> >
> > Signed-off-by: Sebastian Ene <[email protected]>
> > ---
> > arch/arm64/include/asm/ptdump.h | 12 +++
> > arch/arm64/kvm/Kconfig | 13 +++
> > arch/arm64/kvm/arm.c | 2 +
> > arch/arm64/mm/ptdump.c | 168 ++++++++++++++++++++++++++++++++
> > arch/arm64/mm/ptdump_debugfs.c | 8 +-
> > 5 files changed, 202 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
> > index 9b2bebfcefbe..de5a5a0c0ecf 100644
> > --- a/arch/arm64/include/asm/ptdump.h
> > +++ b/arch/arm64/include/asm/ptdump.h
> > @@ -22,6 +22,7 @@ struct ptdump_info {
> > void (*ptdump_walk)(struct seq_file *s, struct ptdump_info *info);
> > int (*ptdump_prepare_walk)(void *file_priv);
> > void (*ptdump_end_walk)(void *file_priv);
> > + size_t mc_len;
> > };
> >
> > void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
> > @@ -33,13 +34,24 @@ struct ptdump_info_file_priv {
> > #ifdef CONFIG_PTDUMP_DEBUGFS
> > #define EFI_RUNTIME_MAP_END DEFAULT_MAP_WINDOW_64
> > void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name);
> > +void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
> > + struct dentry *d_entry);
> > #else
> > static inline void ptdump_debugfs_register(struct ptdump_info *info,
> > const char *name) { }
> > +static inline void ptdump_debugfs_kvm_register(struct ptdump_info *info,
> > + const char *name,
> > + struct dentry *d_entry) { }
> > #endif
> > void ptdump_check_wx(void);
> > #endif /* CONFIG_PTDUMP_CORE */
> >
> > +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> > +void ptdump_register_host_stage2(void);
> > +#else
> > +static inline void ptdump_register_host_stage2(void) { }
> > +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> > +
> > #ifdef CONFIG_DEBUG_WX
> > #define debug_checkwx() ptdump_check_wx()
> > #else
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 83c1e09be42e..cf5b7f06b152 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -71,4 +71,17 @@ config PROTECTED_NVHE_STACKTRACE
> >
> > If unsure, or not using protected nVHE (pKVM), say N.
> >
> > +config PTDUMP_STAGE2_DEBUGFS
> > + bool "Present the stage-2 pagetables to debugfs"
> > + depends on NVHE_EL2_DEBUG && PTDUMP_DEBUGFS && KVM
> > + default n
> > + help
> > + Say Y here if you want to show the stage-2 kernel pagetables
> > + layout in a debugfs file. This information is only useful for kernel developers
> > + who are working in architecture specific areas of the kernel.
> > + It is probably not a good idea to enable this feature in a production
> > + kernel.
> > +
> > + If in doubt, say N.
> > +
> > endif # VIRTUALIZATION
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index e5f75f1f1085..987683650576 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -28,6 +28,7 @@
> >
> > #include <linux/uaccess.h>
> > #include <asm/ptrace.h>
> > +#include <asm/ptdump.h>
> > #include <asm/mman.h>
> > #include <asm/tlbflush.h>
> > #include <asm/cacheflush.h>
> > @@ -2592,6 +2593,7 @@ static __init int kvm_arm_init(void)
> > if (err)
> > goto out_subs;
> >
> > + ptdump_register_host_stage2();
> > kvm_arm_initialised = true;
> >
> > return 0;
> > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> > index d531e24ea0b2..0b4cb54e43ff 100644
> > --- a/arch/arm64/mm/ptdump.c
> > +++ b/arch/arm64/mm/ptdump.c
> > @@ -24,6 +24,9 @@
> > #include <asm/memory.h>
> > #include <asm/pgtable-hwdef.h>
> > #include <asm/ptdump.h>
> > +#include <asm/kvm_pkvm.h>
> > +#include <asm/kvm_pgtable.h>
> > +#include <asm/kvm_host.h>
> >
> >
> > enum address_markers_idx {
> > @@ -378,6 +381,170 @@ void ptdump_check_wx(void)
> > pr_info("Checked W+X mappings: passed, no W+X pages found\n");
> > }
> >
> > +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> > +static struct ptdump_info stage2_kernel_ptdump_info;
> > +
> > +static phys_addr_t ptdump_host_pa(void *addr)
> > +{
> > + return __pa(addr);
> > +}
> > +
> > +static void *ptdump_host_va(phys_addr_t phys)
> > +{
> > + return __va(phys);
> > +}
> > +
> > +static size_t stage2_get_pgd_len(void)
> > +{
> > + u64 mmfr0, mmfr1, vtcr;
> > + u32 phys_shift = get_kvm_ipa_limit();
> > +
> > + mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
> > + mmfr1 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR1_EL1);
> > + vtcr = kvm_get_vtcr(mmfr0, mmfr1, phys_shift);
> > +
> > + return kvm_pgtable_stage2_pgd_size(vtcr);
>
> That's a lot of conversions to go from the kvm_ipa_limit to VTCR and
> VTCR back to ia_bits and the start level, but that would mean rewrite pieces of
> pgtable.c there. :-\
>
Right, I think with Oliver's suggestion we will no longer have to move
these bits around and the code will be self contained under /kvm.
> > +}
> > +
> > +static int stage2_ptdump_prepare_walk(void *file_priv)
> > +{
> > + struct ptdump_info_file_priv *f_priv = file_priv;
> > + struct ptdump_info *info = &f_priv->info;
> > + struct kvm_pgtable_snapshot *snapshot;
> > + int ret, pgd_index, mc_index, pgd_pages_sz;
> > + void *page_hva;
> > + phys_addr_t pgd;
> > +
> > + snapshot = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
> > + if (!snapshot)
> > + return -ENOMEM;
>
> For a single page, __get_free_page is enough.
>
I can use this, thanks.
> > +
> > + memset(snapshot, 0, PAGE_SIZE);
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp, virt_to_pfn(snapshot));
> > + if (ret)
> > + goto free_snapshot;
>
> It'd probably be better to not share anything here, and let the hypervisor do
> host_donate_hyp() and hyp_donate_host() before returning back from the HVC. This
> way the hypervisor will protect itself.
>
Right, as we took this discussion offline, I will update this and use
the *donate* API.
> > +
> > + snapshot->pgd_len = stage2_get_pgd_len();
> > + pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
> > + snapshot->pgd_hva = alloc_pages_exact(snapshot->pgd_len,
> > + GFP_KERNEL_ACCOUNT);
> > + if (!snapshot->pgd_hva) {
> > + ret = -ENOMEM;
> > + goto unshare_snapshot;
> > + }
> > +
> > + for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
> > + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> > + virt_to_pfn(page_hva));
> > + if (ret)
> > + goto unshare_pgd_pages;
> > + }
> > +
> > + for (mc_index = 0; mc_index < info->mc_len; mc_index++) {
> > + page_hva = alloc_pages_exact(PAGE_SIZE, GFP_KERNEL_ACCOUNT);
>
> ditto.
>
Ack.
> > + if (!page_hva) {
> > + ret = -ENOMEM;
> > + goto free_memcache_pages;
> > + }
> > +
> > + push_hyp_memcache(&snapshot->mc, page_hva, ptdump_host_pa);
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_share_hyp,
> > + virt_to_pfn(page_hva));
> > + if (ret) {
> > + pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> > + free_pages_exact(page_hva, PAGE_SIZE);
> > + goto free_memcache_pages;
> > + }
>
> Maybe for the page-table pages, it'd be better to let the hyp does the
> host_donate_hyp() / hyp_donate_host()? It might be easier than sharing + pin.
>
> > + }
> > +
> > + ret = kvm_call_hyp_nvhe(__pkvm_copy_host_stage2, snapshot);
> > + if (ret)
> > + goto free_memcache_pages;
> > +
> > + pgd = (phys_addr_t)snapshot->pgtable.pgd;
> > + snapshot->pgtable.pgd = phys_to_virt(pgd);
> > + f_priv->file_priv = snapshot;
> > + return 0;
> > +
> > +free_memcache_pages:
> > + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> > + while (page_hva) {
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(page_hva));
> > + WARN_ON(ret);
> > + free_pages_exact(page_hva, PAGE_SIZE);
> > + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> > + }
> > +unshare_pgd_pages:
> > + pgd_index = pgd_index - 1;
> > + for (; pgd_index >= 0; pgd_index--) {
> > + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(page_hva));
> > + WARN_ON(ret);
> > + }
> > + free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
> > +unshare_snapshot:
> > + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(snapshot)));
> > +free_snapshot:
> > + free_pages_exact(snapshot, PAGE_SIZE);
> > + f_priv->file_priv = NULL;
> > + return ret;
>
> Couldn't this path be merged with stage2_ptdump_end_walk()?
>
I think it should be doable.
> > +}
> > +
> > +static void stage2_ptdump_end_walk(void *file_priv)
> > +{
> > + struct ptdump_info_file_priv *f_priv = file_priv;
> > + struct kvm_pgtable_snapshot *snapshot = f_priv->file_priv;
> > + void *page_hva;
> > + int pgd_index, ret, pgd_pages_sz;
> > +
> > + if (!snapshot)
> > + return;
> > +
> > + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> > + while (page_hva) {
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(page_hva));
> > + WARN_ON(ret);
> > + free_pages_exact(page_hva, PAGE_SIZE);
> > + page_hva = pop_hyp_memcache(&snapshot->mc, ptdump_host_va);
> > + }
> > +
> > + pgd_pages_sz = snapshot->pgd_len / PAGE_SIZE;
> > + for (pgd_index = 0; pgd_index < pgd_pages_sz; pgd_index++) {
> > + page_hva = snapshot->pgd_hva + pgd_index * PAGE_SIZE;
> > + ret = kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(page_hva));
> > + WARN_ON(ret);
> > + }
> > +
> > + free_pages_exact(snapshot->pgd_hva, snapshot->pgd_len);
> > + WARN_ON(kvm_call_hyp_nvhe(__pkvm_host_unshare_hyp,
> > + virt_to_pfn(snapshot)));
> > + free_pages_exact(snapshot, PAGE_SIZE);
> > + f_priv->file_priv = NULL;
> > +}
> > +
> > +void ptdump_register_host_stage2(void)
> > +{
> > + if (!is_protected_kvm_enabled())
> > + return;
> > +
> > + stage2_kernel_ptdump_info = (struct ptdump_info) {
> > + .mc_len = host_s2_pgtable_pages(),
> > + .ptdump_prepare_walk = stage2_ptdump_prepare_walk,
> > + .ptdump_end_walk = stage2_ptdump_end_walk,
> > + };
> > +
> > + ptdump_debugfs_kvm_register(&stage2_kernel_ptdump_info,
> > + "host_stage2_page_tables",
> > + kvm_debugfs_dir);
> > +}
> > +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> > +
> > static int __init ptdump_init(void)
> > {
> > address_markers[PAGE_END_NR].start_address = PAGE_END;
> > @@ -386,6 +553,7 @@ static int __init ptdump_init(void)
> > #endif
> > ptdump_initialize();
> > ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> > +
>
> Not needed.
>
Will remove this, checkpatch didn't seem to complain about it.
> > return 0;
> > }
> > device_initcall(ptdump_init);
> > diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> > index 3bf5de51e8c3..4821dbef784c 100644
> > --- a/arch/arm64/mm/ptdump_debugfs.c
> > +++ b/arch/arm64/mm/ptdump_debugfs.c
> > @@ -68,5 +68,11 @@ static const struct file_operations ptdump_fops = {
> >
> > void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name)
> > {
> > - debugfs_create_file(name, 0400, NULL, info, &ptdump_fops);
> > + ptdump_debugfs_kvm_register(info, name, NULL);
>
> Not really related to kvm, the only difference is passing or not a dentry.
>
> How about a single (non __init) function?
>
I don't think it works because you have to keep the signature of the
original function. This 'ptdump_debugfs_register' is also called from the
non-arch drivers code.
> > +}
> > +
> > +void ptdump_debugfs_kvm_register(struct ptdump_info *info, const char *name,
> > + struct dentry *d_entry)
> > +{
> > + debugfs_create_file(name, 0400, d_entry, info, &ptdump_fops);
> > }
> > --
> > 2.43.0.rc0.421.g78406f8d94-goog
> >