Hi,
This is the first part of the series which enables dumping of the guest
stage-2 pagetables. The support for dumping the host stage-2 pagetables
which is pKVM specific will be part of a follow-up series as per the
feedback received in v4.
When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump registers
'/sys/debug/kvm/<guest_id>/stage2_page_tables' entry with debugfs
upon guest creation. This allows userspace tools (eg. cat) to dump the
stage-2 pagetables by reading the registered file.
Reading the debugfs file shows stage-2 memory ranges in following format:
<IPA range> <size> <descriptor type> <access permissions> <mem_attributes>
Below is the output of a guest stage-2 pagetable mappings running under
Qemu:
---[ IPA bits 33 start lvl 2 ]---
0x0000000000000000-0x0000000080000000 2G PGD
0x0000000080000000-0x0000000080c00000 12M PGD R W AF BLK
0x0000000080c00000-0x0000000080e00000 2M PGD XN R W AF BLK
0x0000000080e00000-0x0000000081000000 2M PGD R W AF BLK
0x0000000081000000-0x0000000081400000 4M PGD XN R W AF BLK
0x0000000081400000-0x000000008fe00000 234M PGD
0x000000008fe00000-0x0000000090000000 2M PGD XN R W AF BLK
0x0000000090000000-0x00000000fa000000 1696M PGD
0x00000000fa000000-0x00000000fe000000 64M PGD XN R W AF BLK
0x00000000fe000000-0x0000000100000000 32M PGD
0x0000000100000000-0x0000000101c00000 28M PGD XN R W AF BLK
0x0000000101c00000-0x0000000102000000 4M PGD
0x0000000102000000-0x0000000102200000 2M PGD XN R W AF BLK
0x0000000102200000-0x000000017b000000 1934M PGD
0x000000017b000000-0x0000000180000000 80M PGD XN R W AF BLK
Link to v4:
https://lore.kernel.org/all/[email protected]/
Link to v3:
https://lore.kernel.org/all/[email protected]/
Changelog:
v4 -> current_version:
* refactorization: split the series into two parts as per the feedback
received from Oliver. Introduce the base support which allows dumping
of the guest stage-2 pagetables.
* removed the *ops* struct wrapper built on top of the file_ops and
simplify the ptdump interface access.
* keep the page table walker away from the ptdump specific code
v3 -> current_version:
* refactorization: moved all the **KVM** specific components under
kvm/ as suggested by Oliver. Introduced a new file
arm64/kvm/ptdump.c which handled the second stage translation.
re-used only the display portion from mm/ptdump.c
* pagetable snapshot creation now uses memory donated from the host.
The memory is no longer shared with the host as this can pose a security
risk if the host has access to manipulate the pagetable copy while
the hypervisor iterates it.
* fixed a memory leak: while memory was used from the memcache for
building the snapshot pagetable, it was no longer giving back the
pages to the host for freeing. A separate array was introduced to
keep track of the pages allocated from the memcache.
v2 -> v3:
* register the stage-2 debugfs entry for the host under
/sys/debug/kvm/host_stage2_page_tables and in
/sys/debug/kvm/<guest_id>/stage2_page_tables for guests.
* don't use a static array for parsing the attributes description,
generate it dynamically based on the number of pagetable levels
* remove the lock that was guarding the seq_file private inode data,
and keep the data private to the open file session.
* minor fixes & renaming of CONFIG_NVHE_EL2_PTDUMP_DEBUGFS to
CONFIG_PTDUMP_STAGE2_DEBUGFS
v1 -> v2:
* use the stage-2 pagetable walker for dumping descriptors instead of
the one provided by ptdump.
* support for guests pagetables dumping under VHE/nVHE non-protected
Thanks,
Sebastian Ene (4):
arm64: ptdump: Expose the attribute parsing functionality
arm64: ptdump: Use the mask from the state structure
KVM: arm64: Register ptdump with debugfs on guest creation
KVM: arm64: Initialize the ptdump parser with stage-2 attributes
arch/arm64/include/asm/ptdump.h | 42 +++++-
arch/arm64/kvm/Kconfig | 13 ++
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/debug.c | 7 +
arch/arm64/kvm/kvm_ptdump.h | 20 +++
arch/arm64/kvm/ptdump.c | 235 ++++++++++++++++++++++++++++++++
arch/arm64/mm/ptdump.c | 49 ++-----
7 files changed, 327 insertions(+), 40 deletions(-)
create mode 100644 arch/arm64/kvm/kvm_ptdump.h
create mode 100644 arch/arm64/kvm/ptdump.c
--
2.43.0.594.gd9cf4e227d-goog
To keep the same output format as the arch specific ptdump and for the
sake of reusability, move the parser's state tracking code out
of the arch specific.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 41 ++++++++++++++++++++++++++++++++-
arch/arm64/mm/ptdump.c | 36 ++---------------------------
2 files changed, 42 insertions(+), 35 deletions(-)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 581caac525b0..23510be35084 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -9,6 +9,8 @@
#include <linux/mm_types.h>
#include <linux/seq_file.h>
+#include <linux/ptdump.h>
+
struct addr_marker {
unsigned long start_address;
@@ -21,15 +23,52 @@ struct ptdump_info {
unsigned long base_addr;
};
+struct prot_bits {
+ u64 mask;
+ u64 val;
+ const char *set;
+ const char *clear;
+};
+
+struct pg_level {
+ const struct prot_bits *bits;
+ const char *name;
+ size_t num;
+ u64 mask;
+};
+
+/*
+ * The page dumper groups page table entries of the same type into a single
+ * description. It uses pg_state to track the range information while
+ * iterating over the pte entries. When the continuity is broken it then
+ * dumps out a description of the range.
+ */
+struct pg_state {
+ struct ptdump_state ptdump;
+ struct seq_file *seq;
+ const struct addr_marker *marker;
+ unsigned long start_address;
+ int level;
+ u64 current_prot;
+ bool check_wx;
+ unsigned long wx_pages;
+ unsigned long uxn_pages;
+};
+
void ptdump_walk(struct seq_file *s, struct ptdump_info *info);
+void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ u64 val);
#ifdef CONFIG_PTDUMP_DEBUGFS
#define EFI_RUNTIME_MAP_END DEFAULT_MAP_WINDOW_64
void __init ptdump_debugfs_register(struct ptdump_info *info, const char *name);
#else
static inline void ptdump_debugfs_register(struct ptdump_info *info,
const char *name) { }
-#endif
+#endif /* CONFIG_PTDUMP_DEBUGFS */
void ptdump_check_wx(void);
+#else
+static inline void note_page(void *pt_st, unsigned long addr,
+ int level, u64 val) { }
#endif /* CONFIG_PTDUMP_CORE */
#ifdef CONFIG_DEBUG_WX
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index e305b6593c4e..64127c70b109 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -66,31 +66,6 @@ static struct addr_marker address_markers[] = {
seq_printf(m, fmt); \
})
-/*
- * The page dumper groups page table entries of the same type into a single
- * description. It uses pg_state to track the range information while
- * iterating over the pte entries. When the continuity is broken it then
- * dumps out a description of the range.
- */
-struct pg_state {
- struct ptdump_state ptdump;
- struct seq_file *seq;
- const struct addr_marker *marker;
- unsigned long start_address;
- int level;
- u64 current_prot;
- bool check_wx;
- unsigned long wx_pages;
- unsigned long uxn_pages;
-};
-
-struct prot_bits {
- u64 mask;
- u64 val;
- const char *set;
- const char *clear;
-};
-
static const struct prot_bits pte_bits[] = {
{
.mask = PTE_VALID,
@@ -170,13 +145,6 @@ static const struct prot_bits pte_bits[] = {
}
};
-struct pg_level {
- const struct prot_bits *bits;
- const char *name;
- size_t num;
- u64 mask;
-};
-
static struct pg_level pg_level[] = {
{ /* pgd */
.name = "PGD",
@@ -248,8 +216,8 @@ static void note_prot_wx(struct pg_state *st, unsigned long addr)
st->wx_pages += (addr - st->start_address) / PAGE_SIZE;
}
-static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
- u64 val)
+void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
+ u64 val)
{
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
static const char units[] = "KMGTPE";
--
2.43.0.594.gd9cf4e227d-goog
Printing the descriptor attributes requires accessing a mask which has a
different set of attributes for stage-2. In preparation for adding support
for the stage-2 pagetables dumping, use the mask from the local context
and not from the globally defined pg_level array. Store a pointer to
the pg_level array in the ptdump state structure. This will allow us to
extract the mask which is wrapped in the pg_level array and use it for
descriptor parsing in the note_page.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/include/asm/ptdump.h | 1 +
arch/arm64/mm/ptdump.c | 13 ++++++++-----
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
index 23510be35084..4e728d2a1f2c 100644
--- a/arch/arm64/include/asm/ptdump.h
+++ b/arch/arm64/include/asm/ptdump.h
@@ -45,6 +45,7 @@ struct pg_level {
*/
struct pg_state {
struct ptdump_state ptdump;
+ struct pg_level *pg_level;
struct seq_file *seq;
const struct addr_marker *marker;
unsigned long start_address;
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 64127c70b109..015ed65d3e9b 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -220,11 +220,12 @@ void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
u64 val)
{
struct pg_state *st = container_of(pt_st, struct pg_state, ptdump);
+ struct pg_level *pg_info = st->pg_level;
static const char units[] = "KMGTPE";
u64 prot = 0;
if (level >= 0)
- prot = val & pg_level[level].mask;
+ prot = val & pg_info[level].mask;
if (st->level == -1) {
st->level = level;
@@ -250,10 +251,10 @@ void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
unit++;
}
pt_dump_seq_printf(st->seq, "%9lu%c %s", delta, *unit,
- pg_level[st->level].name);
- if (st->current_prot && pg_level[st->level].bits)
- dump_prot(st, pg_level[st->level].bits,
- pg_level[st->level].num);
+ pg_info[st->level].name);
+ if (st->current_prot && pg_info[st->level].bits)
+ dump_prot(st, pg_info[st->level].bits,
+ pg_info[st->level].num);
pt_dump_seq_puts(st->seq, "\n");
if (addr >= st->marker[1].start_address) {
@@ -284,6 +285,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
st = (struct pg_state){
.seq = s,
.marker = info->markers,
+ .pg_level = &pg_level[0],
.level = -1,
.ptdump = {
.note_page = note_page,
@@ -321,6 +323,7 @@ void ptdump_check_wx(void)
{ 0, NULL},
{ -1, NULL},
},
+ .pg_level = &pg_level[0],
.level = -1,
.check_wx = true,
.ptdump = {
--
2.43.0.594.gd9cf4e227d-goog
While arch/*/mem/ptdump handles the kernel pagetable dumping code,
introduce KVM/ptdump which shows the guest stage-2 pagetables. The
separation is necessary because most of the definitions from the
stage-2 pagetable reside in the KVM path and we will be invoking
functionality **specific** to KVM.
When a guest is created, register a new file entry under the guest
debugfs dir which allows userspace to show the contents of the guest
stage-2 pagetables when accessed.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/kvm/Kconfig | 13 ++++++
arch/arm64/kvm/Makefile | 1 +
arch/arm64/kvm/debug.c | 7 ++++
arch/arm64/kvm/kvm_ptdump.h | 20 ++++++++++
arch/arm64/kvm/ptdump.c | 79 +++++++++++++++++++++++++++++++++++++
5 files changed, 120 insertions(+)
create mode 100644 arch/arm64/kvm/kvm_ptdump.h
create mode 100644 arch/arm64/kvm/ptdump.c
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 6c3c8ca73e7f..28097dd72174 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -68,4 +68,17 @@ config PROTECTED_NVHE_STACKTRACE
If unsure, or not using protected nVHE (pKVM), say N.
+config PTDUMP_STAGE2_DEBUGFS
+ bool "Present the stage-2 pagetables to debugfs"
+ depends on PTDUMP_DEBUGFS && KVM
+ default n
+ help
+ Say Y here if you want to show the stage-2 kernel pagetables
+ layout in a debugfs file. This information is only useful for kernel developers
+ who are working in architecture specific areas of the kernel.
+ It is probably not a good idea to enable this feature in a production
+ kernel.
+
+ If in doubt, say N.
+
endif # VIRTUALIZATION
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index c0c050e53157..190eac17538c 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -23,6 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
vgic/vgic-its.o vgic/vgic-debug.o
kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
+kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
always-y := hyp_constants.h hyp-constants.s
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 8725291cb00a..aef52836cd90 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -14,6 +14,8 @@
#include <asm/kvm_arm.h>
#include <asm/kvm_emulate.h>
+#include <kvm_ptdump.h>
+
#include "trace.h"
/* These are the bits of MDSCR_EL1 we may manipulate */
@@ -342,3 +344,8 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
}
+
+int kvm_arch_create_vm_debugfs(struct kvm *kvm)
+{
+ return kvm_ptdump_guest_register(kvm);
+}
diff --git a/arch/arm64/kvm/kvm_ptdump.h b/arch/arm64/kvm/kvm_ptdump.h
new file mode 100644
index 000000000000..a7c00a28481b
--- /dev/null
+++ b/arch/arm64/kvm/kvm_ptdump.h
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) Google, 2023
+ * Author: Sebastian Ene <[email protected]>
+ */
+
+#ifndef __KVM_PTDUMP_H
+#define __KVM_PTDUMP_H
+
+#include <linux/kvm_host.h>
+#include <asm/ptdump.h>
+
+
+#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
+int kvm_ptdump_guest_register(struct kvm *kvm);
+#else
+static inline int kvm_ptdump_guest_register(struct kvm *kvm) { return 0; }
+#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
+
+#endif /* __KVM_PTDUMP_H */
diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
new file mode 100644
index 000000000000..a4e984da8aa7
--- /dev/null
+++ b/arch/arm64/kvm/ptdump.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-only
+//
+// Debug helper used to dump the stage-2 pagetables of the system and their
+// associated permissions.
+//
+// Copyright (C) Google, 2023
+// Author: Sebastian Ene <[email protected]>
+
+#include <linux/debugfs.h>
+#include <linux/kvm_host.h>
+#include <linux/seq_file.h>
+
+#include <asm/kvm_pkvm.h>
+#include <kvm_ptdump.h>
+
+
+static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
+static int kvm_ptdump_guest_show(struct seq_file *m, void *);
+
+static const struct file_operations kvm_ptdump_guest_fops = {
+ .open = kvm_ptdump_guest_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int kvm_ptdump_guest_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, kvm_ptdump_guest_show, inode->i_private);
+}
+
+static int kvm_ptdump_visitor(const struct kvm_pgtable_visit_ctx *ctx,
+ enum kvm_pgtable_walk_flags visit)
+{
+ struct pg_state *st = ctx->arg;
+ struct ptdump_state *pt_st = &st->ptdump;
+
+ note_page(pt_st, ctx->addr, ctx->level, ctx->old);
+ return 0;
+}
+
+static int kvm_ptdump_show_common(struct seq_file *m,
+ struct kvm_pgtable *pgtable,
+ struct pg_state *parser_state)
+{
+ struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) {
+ .cb = kvm_ptdump_visitor,
+ .arg = parser_state,
+ .flags = KVM_PGTABLE_WALK_LEAF,
+ };
+
+ return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
+}
+
+static int kvm_ptdump_guest_show(struct seq_file *m, void *)
+{
+ struct kvm *guest_kvm = m->private;
+ struct kvm_s2_mmu *mmu = &guest_kvm->arch.mmu;
+ struct pg_state parser_state = {0};
+ int ret;
+
+ write_lock(&guest_kvm->mmu_lock);
+ ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
+ write_unlock(&guest_kvm->mmu_lock);
+
+ return ret;
+}
+
+int kvm_ptdump_guest_register(struct kvm *kvm)
+{
+ struct dentry *parent;
+
+ parent = debugfs_create_file("stage2_page_tables", 0400,
+ kvm->debugfs_dentry, kvm,
+ &kvm_ptdump_guest_fops);
+ if (IS_ERR(parent))
+ return PTR_ERR(parent);
+ return 0;
+}
--
2.43.0.594.gd9cf4e227d-goog
Define a set of attributes used by the ptdump parser to display the
properties of a guest memory region covered by a pagetable descriptor.
Build a description of the pagetable levels and initialize the parser
with this configuration.
Signed-off-by: Sebastian Ene <[email protected]>
---
arch/arm64/kvm/ptdump.c | 156 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)
diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
index a4e984da8aa7..60725d46f17b 100644
--- a/arch/arm64/kvm/ptdump.c
+++ b/arch/arm64/kvm/ptdump.c
@@ -14,6 +14,69 @@
#include <kvm_ptdump.h>
+#define ADDR_MARKER_LEN (2)
+#define MARKER_MSG_LEN (32)
+
+static const struct prot_bits stage2_pte_bits[] = {
+ {
+ .mask = PTE_VALID,
+ .val = PTE_VALID,
+ .set = " ",
+ .clear = "F",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
+ .val = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
+ .set = "XN",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
+ .set = "R",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
+ .set = "W",
+ .clear = " ",
+ }, {
+ .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
+ .val = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
+ .set = "AF",
+ .clear = " ",
+ }, {
+ .mask = PTE_NG,
+ .val = PTE_NG,
+ .set = "FnXS",
+ .clear = " ",
+ }, {
+ .mask = PTE_CONT | PTE_VALID,
+ .val = PTE_CONT | PTE_VALID,
+ .set = "CON",
+ .clear = " ",
+ }, {
+ .mask = PTE_TABLE_BIT,
+ .val = PTE_TABLE_BIT,
+ .set = " ",
+ .clear = "BLK",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW0,
+ .val = KVM_PGTABLE_PROT_SW0,
+ .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW1,
+ .val = KVM_PGTABLE_PROT_SW1,
+ .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW2,
+ .val = KVM_PGTABLE_PROT_SW2,
+ .set = "SW2",
+ }, {
+ .mask = KVM_PGTABLE_PROT_SW3,
+ .val = KVM_PGTABLE_PROT_SW3,
+ .set = "SW3",
+ },
+};
+
static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
static int kvm_ptdump_guest_show(struct seq_file *m, void *);
@@ -52,6 +115,94 @@ static int kvm_ptdump_show_common(struct seq_file *m,
return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
}
+static void kvm_ptdump_build_levels(struct pg_level *level, u32 start_lvl)
+{
+ static const char * const level_names[] = {"PGD", "PUD", "PMD", "PTE"};
+ u32 i = 0;
+ u64 mask_lvl = 0;
+
+ if (start_lvl > 2) {
+ pr_err("invalid start_lvl %u\n", start_lvl);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(stage2_pte_bits); i++)
+ mask_lvl |= stage2_pte_bits[i].mask;
+
+ for (i = start_lvl; i <= KVM_PGTABLE_LAST_LEVEL; i++) {
+ level[i].name = level_names[i];
+ level[i].num = ARRAY_SIZE(stage2_pte_bits);
+ level[i].bits = stage2_pte_bits;
+ level[i].mask = mask_lvl;
+ }
+
+ if (start_lvl > 0)
+ level[start_lvl].name = level_names[0];
+}
+
+static int kvm_ptdump_parser_init(struct pg_state *st,
+ struct kvm_pgtable *pgtable,
+ struct seq_file *m)
+{
+ struct addr_marker *ipa_addr_marker;
+ char *marker_msg;
+ struct pg_level *level_descr;
+ struct ptdump_range *range;
+
+ ipa_addr_marker = kzalloc(sizeof(struct addr_marker) * ADDR_MARKER_LEN,
+ GFP_KERNEL_ACCOUNT);
+ if (!ipa_addr_marker)
+ return -ENOMEM;
+
+ marker_msg = kzalloc(MARKER_MSG_LEN, GFP_KERNEL_ACCOUNT);
+ if (!marker_msg)
+ goto free_with_marker;
+
+ level_descr = kzalloc(sizeof(struct pg_level) * (KVM_PGTABLE_LAST_LEVEL + 1),
+ GFP_KERNEL_ACCOUNT);
+ if (!level_descr)
+ goto free_with_msg;
+
+ range = kzalloc(sizeof(struct ptdump_range) * ADDR_MARKER_LEN,
+ GFP_KERNEL_ACCOUNT);
+ if (!range)
+ goto free_with_level;
+
+ kvm_ptdump_build_levels(level_descr, pgtable->start_level);
+
+ snprintf(marker_msg, MARKER_MSG_LEN, "IPA bits %2u start lvl %1d",
+ pgtable->ia_bits, pgtable->start_level);
+
+ ipa_addr_marker[0].name = marker_msg;
+ ipa_addr_marker[1].start_address = BIT(pgtable->ia_bits);
+ range[0].end = BIT(pgtable->ia_bits);
+
+ st->seq = m;
+ st->marker = ipa_addr_marker;
+ st->level = -1,
+ st->pg_level = level_descr,
+ st->ptdump.range = range;
+ return 0;
+
+free_with_level:
+ kfree(level_descr);
+free_with_msg:
+ kfree(marker_msg);
+free_with_marker:
+ kfree(ipa_addr_marker);
+ return -ENOMEM;
+}
+
+static void kvm_ptdump_parser_teardown(struct pg_state *st)
+{
+ const struct addr_marker *ipa_addr_marker = st->marker;
+
+ kfree(ipa_addr_marker[0].name);
+ kfree(ipa_addr_marker);
+ kfree(st->pg_level);
+ kfree(st->ptdump.range);
+}
+
static int kvm_ptdump_guest_show(struct seq_file *m, void *)
{
struct kvm *guest_kvm = m->private;
@@ -59,10 +210,15 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *)
struct pg_state parser_state = {0};
int ret;
+ ret = kvm_ptdump_parser_init(&parser_state, mmu->pgt, m);
+ if (ret)
+ return ret;
+
write_lock(&guest_kvm->mmu_lock);
ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
write_unlock(&guest_kvm->mmu_lock);
+ kvm_ptdump_parser_teardown(&parser_state);
return ret;
}
--
2.43.0.594.gd9cf4e227d-goog
On Wed, Feb 07, 2024 at 02:48:33PM +0000, Sebastian Ene wrote:
> Define a set of attributes used by the ptdump parser to display the
> properties of a guest memory region covered by a pagetable descriptor.
> Build a description of the pagetable levels and initialize the parser
> with this configuration.
>
> Signed-off-by: Sebastian Ene <[email protected]>
> ---
> arch/arm64/kvm/ptdump.c | 156 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 156 insertions(+)
>
> diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
> index a4e984da8aa7..60725d46f17b 100644
> --- a/arch/arm64/kvm/ptdump.c
> +++ b/arch/arm64/kvm/ptdump.c
> @@ -14,6 +14,69 @@
> #include <kvm_ptdump.h>
>
>
> +#define ADDR_MARKER_LEN (2)
> +#define MARKER_MSG_LEN (32)
> +
> +static const struct prot_bits stage2_pte_bits[] = {
> + {
> + .mask = PTE_VALID,
> + .val = PTE_VALID,
> + .set = " ",
> + .clear = "F",
> + }, {
> + .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
> + .val = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
> + .set = "XN",
> + .clear = " ",
> + }, {
> + .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
> + .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
> + .set = "R",
> + .clear = " ",
> + }, {
> + .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
> + .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
> + .set = "W",
> + .clear = " ",
> + }, {
> + .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
> + .val = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
> + .set = "AF",
> + .clear = " ",
> + }, {
> + .mask = PTE_NG,
> + .val = PTE_NG,
> + .set = "FnXS",
> + .clear = " ",
> + }, {
> + .mask = PTE_CONT | PTE_VALID,
> + .val = PTE_CONT | PTE_VALID,
> + .set = "CON",
> + .clear = " ",
> + }, {
> + .mask = PTE_TABLE_BIT,
> + .val = PTE_TABLE_BIT,
> + .set = " ",
> + .clear = "BLK",
<snip>
> + }, {
> + .mask = KVM_PGTABLE_PROT_SW0,
> + .val = KVM_PGTABLE_PROT_SW0,
> + .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
> + }, {
> + .mask = KVM_PGTABLE_PROT_SW1,
> + .val = KVM_PGTABLE_PROT_SW1,
> + .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
> + }, {
> + .mask = KVM_PGTABLE_PROT_SW2,
> + .val = KVM_PGTABLE_PROT_SW2,
> + .set = "SW2",
> + }, {
> + .mask = KVM_PGTABLE_PROT_SW3,
> + .val = KVM_PGTABLE_PROT_SW3,
> + .set = "SW3",
> + },
</snip>
These bits are never set in a 'normal' stage-2 PTE, does it make sense
to carry descriptors for them here? In contexts where the SW bits are
used it might be more useful if the ptdump used the specific meaning of
the bit (e.g. OWNED, BORROWED, etc) instead of the generic SW%d.
That can all wait for when the pKVM bits come into play though.
> +};
> +
> static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
> static int kvm_ptdump_guest_show(struct seq_file *m, void *);
>
> @@ -52,6 +115,94 @@ static int kvm_ptdump_show_common(struct seq_file *m,
> return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
> }
>
> +static void kvm_ptdump_build_levels(struct pg_level *level, u32 start_lvl)
> +{
> + static const char * const level_names[] = {"PGD", "PUD", "PMD", "PTE"};
> + u32 i = 0;
> + u64 mask_lvl = 0;
nit: _lvl adds nothing to this, and actually confused me for a sec as
to whether the mask changed per level.
> + if (start_lvl > 2) {
> + pr_err("invalid start_lvl %u\n", start_lvl);
> + return;
> + }
Can't we get something like -EINVAL out here and fail initialization?
Otherwise breadcrumbs like this pr_err() are hard to connect to a
specific failure.
> + for (i = 0; i < ARRAY_SIZE(stage2_pte_bits); i++)
> + mask_lvl |= stage2_pte_bits[i].mask;
> +
> + for (i = start_lvl; i <= KVM_PGTABLE_LAST_LEVEL; i++) {
> + level[i].name = level_names[i];
> + level[i].num = ARRAY_SIZE(stage2_pte_bits);
> + level[i].bits = stage2_pte_bits;
> + level[i].mask = mask_lvl;
> + }
> +
> + if (start_lvl > 0)
> + level[start_lvl].name = level_names[0];
> +}
> +
> +static int kvm_ptdump_parser_init(struct pg_state *st,
> + struct kvm_pgtable *pgtable,
> + struct seq_file *m)
> +{
> + struct addr_marker *ipa_addr_marker;
> + char *marker_msg;
> + struct pg_level *level_descr;
> + struct ptdump_range *range;
> +
> + ipa_addr_marker = kzalloc(sizeof(struct addr_marker) * ADDR_MARKER_LEN,
> + GFP_KERNEL_ACCOUNT);
> + if (!ipa_addr_marker)
> + return -ENOMEM;
> +
> + marker_msg = kzalloc(MARKER_MSG_LEN, GFP_KERNEL_ACCOUNT);
> + if (!marker_msg)
> + goto free_with_marker;
> +
> + level_descr = kzalloc(sizeof(struct pg_level) * (KVM_PGTABLE_LAST_LEVEL + 1),
> + GFP_KERNEL_ACCOUNT);
> + if (!level_descr)
> + goto free_with_msg;
> +
> + range = kzalloc(sizeof(struct ptdump_range) * ADDR_MARKER_LEN,
> + GFP_KERNEL_ACCOUNT);
> + if (!range)
> + goto free_with_level;
> +
> + kvm_ptdump_build_levels(level_descr, pgtable->start_level);
> +
> + snprintf(marker_msg, MARKER_MSG_LEN, "IPA bits %2u start lvl %1d",
> + pgtable->ia_bits, pgtable->start_level);
> +
> + ipa_addr_marker[0].name = marker_msg;
Is the dynamic name worth the added complexity? I see nothing wrong with
exposing additional debugfs files for simple attributes like the IPA
range and page table levels.
I know it isn't *that* much, just looking for every opportunity to
simplify further.
> + ipa_addr_marker[1].start_address = BIT(pgtable->ia_bits);
> + range[0].end = BIT(pgtable->ia_bits);
> +
> + st->seq = m;
> + st->marker = ipa_addr_marker;
> + st->level = -1,
> + st->pg_level = level_descr,
> + st->ptdump.range = range;
> + return 0;
> +
> +free_with_level:
> + kfree(level_descr);
> +free_with_msg:
> + kfree(marker_msg);
> +free_with_marker:
> + kfree(ipa_addr_marker);
> + return -ENOMEM;
> +}
> +
> +static void kvm_ptdump_parser_teardown(struct pg_state *st)
> +{
> + const struct addr_marker *ipa_addr_marker = st->marker;
> +
> + kfree(ipa_addr_marker[0].name);
> + kfree(ipa_addr_marker);
> + kfree(st->pg_level);
> + kfree(st->ptdump.range);
> +}
> +
> static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> {
> struct kvm *guest_kvm = m->private;
> @@ -59,10 +210,15 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> struct pg_state parser_state = {0};
> int ret;
>
> + ret = kvm_ptdump_parser_init(&parser_state, mmu->pgt, m);
> + if (ret)
> + return ret;
> +
Can this be done at open(), or am I missing something?
> write_lock(&guest_kvm->mmu_lock);
> ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> write_unlock(&guest_kvm->mmu_lock);
>
> + kvm_ptdump_parser_teardown(&parser_state);
Same question here, can this happen at close()? I guess you'll need a
struct to encapsulate pg_state and a pointer to the VM at least.
Actually, come to think of it, if you embed all of the data you need for
the walker into a structure you can just do a single allocation for it
upfront.
--
Thanks,
Oliver
On Wed, Feb 07, 2024 at 02:48:32PM +0000, Sebastian Ene wrote:
> While arch/*/mem/ptdump handles the kernel pagetable dumping code,
> introduce KVM/ptdump which shows the guest stage-2 pagetables. The
> separation is necessary because most of the definitions from the
> stage-2 pagetable reside in the KVM path and we will be invoking
> functionality **specific** to KVM.
>
> When a guest is created, register a new file entry under the guest
> debugfs dir which allows userspace to show the contents of the guest
> stage-2 pagetables when accessed.
>
> Signed-off-by: Sebastian Ene <[email protected]>
> ---
> arch/arm64/kvm/Kconfig | 13 ++++++
> arch/arm64/kvm/Makefile | 1 +
> arch/arm64/kvm/debug.c | 7 ++++
> arch/arm64/kvm/kvm_ptdump.h | 20 ++++++++++
> arch/arm64/kvm/ptdump.c | 79 +++++++++++++++++++++++++++++++++++++
> 5 files changed, 120 insertions(+)
> create mode 100644 arch/arm64/kvm/kvm_ptdump.h
> create mode 100644 arch/arm64/kvm/ptdump.c
>
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 6c3c8ca73e7f..28097dd72174 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -68,4 +68,17 @@ config PROTECTED_NVHE_STACKTRACE
>
> If unsure, or not using protected nVHE (pKVM), say N.
>
> +config PTDUMP_STAGE2_DEBUGFS
> + bool "Present the stage-2 pagetables to debugfs"
> + depends on PTDUMP_DEBUGFS && KVM
> + default n
> + help
> + Say Y here if you want to show the stage-2 kernel pagetables
> + layout in a debugfs file. This information is only useful for kernel developers
> + who are working in architecture specific areas of the kernel.
> + It is probably not a good idea to enable this feature in a production
> + kernel.
> +
> + If in doubt, say N.
> +
> endif # VIRTUALIZATION
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index c0c050e53157..190eac17538c 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -23,6 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-its.o vgic/vgic-debug.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
> +kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
>
> always-y := hyp_constants.h hyp-constants.s
>
> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> index 8725291cb00a..aef52836cd90 100644
> --- a/arch/arm64/kvm/debug.c
> +++ b/arch/arm64/kvm/debug.c
> @@ -14,6 +14,8 @@
> #include <asm/kvm_arm.h>
> #include <asm/kvm_emulate.h>
>
> +#include <kvm_ptdump.h>
> +
> #include "trace.h"
>
> /* These are the bits of MDSCR_EL1 we may manipulate */
> @@ -342,3 +344,8 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
> }
> +
> +int kvm_arch_create_vm_debugfs(struct kvm *kvm)
> +{
> + return kvm_ptdump_guest_register(kvm);
> +}
> diff --git a/arch/arm64/kvm/kvm_ptdump.h b/arch/arm64/kvm/kvm_ptdump.h
> new file mode 100644
> index 000000000000..a7c00a28481b
> --- /dev/null
> +++ b/arch/arm64/kvm/kvm_ptdump.h
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) Google, 2023
> + * Author: Sebastian Ene <[email protected]>
> + */
> +
> +#ifndef __KVM_PTDUMP_H
> +#define __KVM_PTDUMP_H
> +
> +#include <linux/kvm_host.h>
> +#include <asm/ptdump.h>
> +
> +
> +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> +int kvm_ptdump_guest_register(struct kvm *kvm);
> +#else
> +static inline int kvm_ptdump_guest_register(struct kvm *kvm) { return 0; }
> +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> +
> +#endif /* __KVM_PTDUMP_H */
> diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
> new file mode 100644
> index 000000000000..a4e984da8aa7
> --- /dev/null
> +++ b/arch/arm64/kvm/ptdump.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +//
> +// Debug helper used to dump the stage-2 pagetables of the system and their
> +// associated permissions.
> +//
> +// Copyright (C) Google, 2023
> +// Author: Sebastian Ene <[email protected]>
Same comment as last time about ... the comment :)
Should be of the form
/*
*
*/
> +#include <linux/debugfs.h>
> +#include <linux/kvm_host.h>
> +#include <linux/seq_file.h>
> +
> +#include <asm/kvm_pkvm.h>
is this needed?
> +#include <kvm_ptdump.h>
> +
> +
> +static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
> +static int kvm_ptdump_guest_show(struct seq_file *m, void *);
can you structure the file in a way to avoid forward declarations?
> +static const struct file_operations kvm_ptdump_guest_fops = {
> + .open = kvm_ptdump_guest_open,
> + .read = seq_read,
> + .llseek = seq_lseek,
> + .release = single_release,
> +};
> +
> +static int kvm_ptdump_guest_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, kvm_ptdump_guest_show, inode->i_private);
> +}
> +
Shouldn't we take a reference on the KVM struct at open to avoid UAF?
struct kvm *kvm = inode->i_private;
if (!kvm_get_kvm_safe(kvm))
return -ENOENT;
Then you can do a put on it at close().
> +static int kvm_ptdump_visitor(const struct kvm_pgtable_visit_ctx *ctx,
> + enum kvm_pgtable_walk_flags visit)
> +{
> + struct pg_state *st = ctx->arg;
> + struct ptdump_state *pt_st = &st->ptdump;
> +
> + note_page(pt_st, ctx->addr, ctx->level, ctx->old);
> + return 0;
> +}
> +
> +static int kvm_ptdump_show_common(struct seq_file *m,
> + struct kvm_pgtable *pgtable,
> + struct pg_state *parser_state)
> +{
> + struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) {
> + .cb = kvm_ptdump_visitor,
> + .arg = parser_state,
> + .flags = KVM_PGTABLE_WALK_LEAF,
> + };
> +
> + return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
> +}
> +
> +static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> +{
> + struct kvm *guest_kvm = m->private;
> + struct kvm_s2_mmu *mmu = &guest_kvm->arch.mmu;
> + struct pg_state parser_state = {0};
> + int ret;
> +
> + write_lock(&guest_kvm->mmu_lock);
> + ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> + write_unlock(&guest_kvm->mmu_lock);
> +
> + return ret;
> +}
> +
> +int kvm_ptdump_guest_register(struct kvm *kvm)
> +{
> + struct dentry *parent;
> +
> + parent = debugfs_create_file("stage2_page_tables", 0400,
> + kvm->debugfs_dentry, kvm,
> + &kvm_ptdump_guest_fops);
> + if (IS_ERR(parent))
> + return PTR_ERR(parent);
This makes the otherwise benign debugfs failure into something fatal for
VM creation, no?
From the documentation on debugfs_create_file():
* NOTE: it's expected that most callers should _ignore_ the errors returned
* by this function. Other debugfs functions handle the fact that the "dentry"
* passed to them could be an error and they don't crash in that case.
* Drivers should generally work fine even if debugfs fails to init anyway.
The fact that kvm_arch_create_vm_debugfs() has a return value is a bit
of an anti-pattern to begin with.
--
Thanks,
Oliver
On Tue, Feb 13, 2024 at 12:56:20AM +0000, Oliver Upton wrote:
> On Wed, Feb 07, 2024 at 02:48:32PM +0000, Sebastian Ene wrote:
Hello Oliver,
> > While arch/*/mem/ptdump handles the kernel pagetable dumping code,
> > introduce KVM/ptdump which shows the guest stage-2 pagetables. The
> > separation is necessary because most of the definitions from the
> > stage-2 pagetable reside in the KVM path and we will be invoking
> > functionality **specific** to KVM.
> >
> > When a guest is created, register a new file entry under the guest
> > debugfs dir which allows userspace to show the contents of the guest
> > stage-2 pagetables when accessed.
> >
> > Signed-off-by: Sebastian Ene <[email protected]>
> > ---
> > arch/arm64/kvm/Kconfig | 13 ++++++
> > arch/arm64/kvm/Makefile | 1 +
> > arch/arm64/kvm/debug.c | 7 ++++
> > arch/arm64/kvm/kvm_ptdump.h | 20 ++++++++++
> > arch/arm64/kvm/ptdump.c | 79 +++++++++++++++++++++++++++++++++++++
> > 5 files changed, 120 insertions(+)
> > create mode 100644 arch/arm64/kvm/kvm_ptdump.h
> > create mode 100644 arch/arm64/kvm/ptdump.c
> >
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 6c3c8ca73e7f..28097dd72174 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -68,4 +68,17 @@ config PROTECTED_NVHE_STACKTRACE
> >
> > If unsure, or not using protected nVHE (pKVM), say N.
> >
> > +config PTDUMP_STAGE2_DEBUGFS
> > + bool "Present the stage-2 pagetables to debugfs"
> > + depends on PTDUMP_DEBUGFS && KVM
> > + default n
> > + help
> > + Say Y here if you want to show the stage-2 kernel pagetables
> > + layout in a debugfs file. This information is only useful for kernel developers
> > + who are working in architecture specific areas of the kernel.
> > + It is probably not a good idea to enable this feature in a production
> > + kernel.
> > +
> > + If in doubt, say N.
> > +
> > endif # VIRTUALIZATION
> > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> > index c0c050e53157..190eac17538c 100644
> > --- a/arch/arm64/kvm/Makefile
> > +++ b/arch/arm64/kvm/Makefile
> > @@ -23,6 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> > vgic/vgic-its.o vgic/vgic-debug.o
> >
> > kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
> > +kvm-$(CONFIG_PTDUMP_STAGE2_DEBUGFS) += ptdump.o
> >
> > always-y := hyp_constants.h hyp-constants.s
> >
> > diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> > index 8725291cb00a..aef52836cd90 100644
> > --- a/arch/arm64/kvm/debug.c
> > +++ b/arch/arm64/kvm/debug.c
> > @@ -14,6 +14,8 @@
> > #include <asm/kvm_arm.h>
> > #include <asm/kvm_emulate.h>
> >
> > +#include <kvm_ptdump.h>
> > +
> > #include "trace.h"
> >
> > /* These are the bits of MDSCR_EL1 we may manipulate */
> > @@ -342,3 +344,8 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
> > vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
> > vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
> > }
> > +
> > +int kvm_arch_create_vm_debugfs(struct kvm *kvm)
> > +{
> > + return kvm_ptdump_guest_register(kvm);
> > +}
> > diff --git a/arch/arm64/kvm/kvm_ptdump.h b/arch/arm64/kvm/kvm_ptdump.h
> > new file mode 100644
> > index 000000000000..a7c00a28481b
> > --- /dev/null
> > +++ b/arch/arm64/kvm/kvm_ptdump.h
> > @@ -0,0 +1,20 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) Google, 2023
> > + * Author: Sebastian Ene <[email protected]>
> > + */
> > +
> > +#ifndef __KVM_PTDUMP_H
> > +#define __KVM_PTDUMP_H
> > +
> > +#include <linux/kvm_host.h>
> > +#include <asm/ptdump.h>
> > +
> > +
> > +#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> > +int kvm_ptdump_guest_register(struct kvm *kvm);
> > +#else
> > +static inline int kvm_ptdump_guest_register(struct kvm *kvm) { return 0; }
> > +#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
> > +
> > +#endif /* __KVM_PTDUMP_H */
> > diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
> > new file mode 100644
> > index 000000000000..a4e984da8aa7
> > --- /dev/null
> > +++ b/arch/arm64/kvm/ptdump.c
> > @@ -0,0 +1,79 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +//
> > +// Debug helper used to dump the stage-2 pagetables of the system and their
> > +// associated permissions.
> > +//
> > +// Copyright (C) Google, 2023
> > +// Author: Sebastian Ene <[email protected]>
>
> Same comment as last time about ... the comment :)
>
> Should be of the form
>
> /*
> *
> */
>
I forgot to update this, sorry. Let me fix it.
> > +#include <linux/debugfs.h>
> > +#include <linux/kvm_host.h>
> > +#include <linux/seq_file.h>
> > +
> > +#include <asm/kvm_pkvm.h>
>
> is this needed?
>
We only need <asm/kvm_pgtable.h> so I will update this.
> > +#include <kvm_ptdump.h>
> > +
> > +
> > +static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
> > +static int kvm_ptdump_guest_show(struct seq_file *m, void *);
>
> can you structure the file in a way to avoid forward declarations?
>
Sounds good, let me drop those.
> > +static const struct file_operations kvm_ptdump_guest_fops = {
> > + .open = kvm_ptdump_guest_open,
> > + .read = seq_read,
> > + .llseek = seq_lseek,
> > + .release = single_release,
> > +};
> > +
> > +static int kvm_ptdump_guest_open(struct inode *inode, struct file *file)
> > +{
> > + return single_open(file, kvm_ptdump_guest_show, inode->i_private);
> > +}
> > +
>
> Shouldn't we take a reference on the KVM struct at open to avoid UAF?
>
> struct kvm *kvm = inode->i_private;
>
> if (!kvm_get_kvm_safe(kvm))
> return -ENOENT;
>
> Then you can do a put on it at close().
>
Thanks, I though that the kvm_destroy_vm_debugfs will keep spinning if
there are opened paths to the debugfs entry, but I guess nothing prevents
that from happening and the kvm struct can be removed behind our back.
> > +static int kvm_ptdump_visitor(const struct kvm_pgtable_visit_ctx *ctx,
> > + enum kvm_pgtable_walk_flags visit)
> > +{
> > + struct pg_state *st = ctx->arg;
> > + struct ptdump_state *pt_st = &st->ptdump;
> > +
> > + note_page(pt_st, ctx->addr, ctx->level, ctx->old);
> > + return 0;
> > +}
> > +
> > +static int kvm_ptdump_show_common(struct seq_file *m,
> > + struct kvm_pgtable *pgtable,
> > + struct pg_state *parser_state)
> > +{
> > + struct kvm_pgtable_walker walker = (struct kvm_pgtable_walker) {
> > + .cb = kvm_ptdump_visitor,
> > + .arg = parser_state,
> > + .flags = KVM_PGTABLE_WALK_LEAF,
> > + };
> > +
> > + return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
> > +}
> > +
> > +static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> > +{
> > + struct kvm *guest_kvm = m->private;
> > + struct kvm_s2_mmu *mmu = &guest_kvm->arch.mmu;
> > + struct pg_state parser_state = {0};
> > + int ret;
> > +
> > + write_lock(&guest_kvm->mmu_lock);
> > + ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> > + write_unlock(&guest_kvm->mmu_lock);
> > +
> > + return ret;
> > +}
> > +
> > +int kvm_ptdump_guest_register(struct kvm *kvm)
> > +{
> > + struct dentry *parent;
> > +
> > + parent = debugfs_create_file("stage2_page_tables", 0400,
> > + kvm->debugfs_dentry, kvm,
> > + &kvm_ptdump_guest_fops);
> > + if (IS_ERR(parent))
> > + return PTR_ERR(parent);
>
> This makes the otherwise benign debugfs failure into something fatal for
> VM creation, no?
>
> From the documentation on debugfs_create_file():
>
> * NOTE: it's expected that most callers should _ignore_ the errors returned
> * by this function. Other debugfs functions handle the fact that the "dentry"
> * passed to them could be an error and they don't crash in that case.
> * Drivers should generally work fine even if debugfs fails to init anyway.
>
> The fact that kvm_arch_create_vm_debugfs() has a return value is a bit
> of an anti-pattern to begin with.
>
Ack, I will ignore the retun code then.
> --
> Thanks,
> Oliver
Thanks,
Sebastian
On Tue, Feb 13, 2024 at 04:42:27PM +0000, Sebastian Ene wrote:
> On Tue, Feb 13, 2024 at 12:56:20AM +0000, Oliver Upton wrote:
> > On Wed, Feb 07, 2024 at 02:48:32PM +0000, Sebastian Ene wrote:
[...]
> > > +static int kvm_ptdump_guest_open(struct inode *inode, struct file *file)
> > > +{
> > > + return single_open(file, kvm_ptdump_guest_show, inode->i_private);
> > > +}
> > > +
> >
> > Shouldn't we take a reference on the KVM struct at open to avoid UAF?
> >
> > struct kvm *kvm = inode->i_private;
> >
> > if (!kvm_get_kvm_safe(kvm))
> > return -ENOENT;
> >
> > Then you can do a put on it at close().
> >
>
> Thanks, I though that the kvm_destroy_vm_debugfs will keep spinning if
> there are opened paths to the debugfs entry, but I guess nothing prevents
> that from happening and the kvm struct can be removed behind our back.
kvm_destroy_vm() will get called after the last put() on a kvm struct,
so all debugfs files should be closed too if we're consistent about
this.
--
Thanks,
Oliver
On Tue, Feb 13, 2024 at 12:42:42AM +0000, Oliver Upton wrote:
> On Wed, Feb 07, 2024 at 02:48:33PM +0000, Sebastian Ene wrote:
> > Define a set of attributes used by the ptdump parser to display the
> > properties of a guest memory region covered by a pagetable descriptor.
> > Build a description of the pagetable levels and initialize the parser
> > with this configuration.
> >
> > Signed-off-by: Sebastian Ene <[email protected]>
> > ---
> > arch/arm64/kvm/ptdump.c | 156 ++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 156 insertions(+)
> >
> > diff --git a/arch/arm64/kvm/ptdump.c b/arch/arm64/kvm/ptdump.c
> > index a4e984da8aa7..60725d46f17b 100644
> > --- a/arch/arm64/kvm/ptdump.c
> > +++ b/arch/arm64/kvm/ptdump.c
> > @@ -14,6 +14,69 @@
> > #include <kvm_ptdump.h>
> >
> >
> > +#define ADDR_MARKER_LEN (2)
> > +#define MARKER_MSG_LEN (32)
> > +
> > +static const struct prot_bits stage2_pte_bits[] = {
> > + {
> > + .mask = PTE_VALID,
> > + .val = PTE_VALID,
> > + .set = " ",
> > + .clear = "F",
> > + }, {
> > + .mask = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
> > + .val = KVM_PTE_LEAF_ATTR_HI_S2_XN | PTE_VALID,
> > + .set = "XN",
> > + .clear = " ",
> > + }, {
> > + .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
> > + .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_R | PTE_VALID,
> > + .set = "R",
> > + .clear = " ",
> > + }, {
> > + .mask = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
> > + .val = KVM_PTE_LEAF_ATTR_LO_S2_S2AP_W | PTE_VALID,
> > + .set = "W",
> > + .clear = " ",
> > + }, {
> > + .mask = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
> > + .val = KVM_PTE_LEAF_ATTR_LO_S2_AF | PTE_VALID,
> > + .set = "AF",
> > + .clear = " ",
> > + }, {
> > + .mask = PTE_NG,
> > + .val = PTE_NG,
> > + .set = "FnXS",
> > + .clear = " ",
> > + }, {
> > + .mask = PTE_CONT | PTE_VALID,
> > + .val = PTE_CONT | PTE_VALID,
> > + .set = "CON",
> > + .clear = " ",
> > + }, {
> > + .mask = PTE_TABLE_BIT,
> > + .val = PTE_TABLE_BIT,
> > + .set = " ",
> > + .clear = "BLK",
>
> <snip>
>
> > + }, {
> > + .mask = KVM_PGTABLE_PROT_SW0,
> > + .val = KVM_PGTABLE_PROT_SW0,
> > + .set = "SW0", /* PKVM_PAGE_SHARED_OWNED */
> > + }, {
> > + .mask = KVM_PGTABLE_PROT_SW1,
> > + .val = KVM_PGTABLE_PROT_SW1,
> > + .set = "SW1", /* PKVM_PAGE_SHARED_BORROWED */
> > + }, {
> > + .mask = KVM_PGTABLE_PROT_SW2,
> > + .val = KVM_PGTABLE_PROT_SW2,
> > + .set = "SW2",
> > + }, {
> > + .mask = KVM_PGTABLE_PROT_SW3,
> > + .val = KVM_PGTABLE_PROT_SW3,
> > + .set = "SW3",
> > + },
>
> </snip>
>
> These bits are never set in a 'normal' stage-2 PTE, does it make sense
> to carry descriptors for them here? In contexts where the SW bits are
> used it might be more useful if the ptdump used the specific meaning of
> the bit (e.g. OWNED, BORROWED, etc) instead of the generic SW%d.
>
> That can all wait for when the pKVM bits come into play though.
>
True, I guess we don't need these bits for now. We can insert them at a
later time when we will have the pKVM support and then we will have to
use their real maning for the state tracking.
> > +};
> > +
> > static int kvm_ptdump_guest_open(struct inode *inode, struct file *file);
> > static int kvm_ptdump_guest_show(struct seq_file *m, void *);
> >
> > @@ -52,6 +115,94 @@ static int kvm_ptdump_show_common(struct seq_file *m,
> > return kvm_pgtable_walk(pgtable, 0, BIT(pgtable->ia_bits), &walker);
> > }
> >
> > +static void kvm_ptdump_build_levels(struct pg_level *level, u32 start_lvl)
> > +{
> > + static const char * const level_names[] = {"PGD", "PUD", "PMD", "PTE"};
> > + u32 i = 0;
> > + u64 mask_lvl = 0;
>
> nit: _lvl adds nothing to this, and actually confused me for a sec as
> to whether the mask changed per level.
>
I will drop it from the name to avoid the confusion.
> > + if (start_lvl > 2) {
> > + pr_err("invalid start_lvl %u\n", start_lvl);
> > + return;
> > + }
>
> Can't we get something like -EINVAL out here and fail initialization?
> Otherwise breadcrumbs like this pr_err() are hard to connect to a
> specific failure.
>
Ok, I will add a return code for this function.
> > + for (i = 0; i < ARRAY_SIZE(stage2_pte_bits); i++)
> > + mask_lvl |= stage2_pte_bits[i].mask;
> > +
> > + for (i = start_lvl; i <= KVM_PGTABLE_LAST_LEVEL; i++) {
> > + level[i].name = level_names[i];
> > + level[i].num = ARRAY_SIZE(stage2_pte_bits);
> > + level[i].bits = stage2_pte_bits;
> > + level[i].mask = mask_lvl;
> > + }
> > +
> > + if (start_lvl > 0)
> > + level[start_lvl].name = level_names[0];
> > +}
> > +
> > +static int kvm_ptdump_parser_init(struct pg_state *st,
> > + struct kvm_pgtable *pgtable,
> > + struct seq_file *m)
> > +{
> > + struct addr_marker *ipa_addr_marker;
> > + char *marker_msg;
> > + struct pg_level *level_descr;
> > + struct ptdump_range *range;
> > +
> > + ipa_addr_marker = kzalloc(sizeof(struct addr_marker) * ADDR_MARKER_LEN,
> > + GFP_KERNEL_ACCOUNT);
> > + if (!ipa_addr_marker)
> > + return -ENOMEM;
> > +
> > + marker_msg = kzalloc(MARKER_MSG_LEN, GFP_KERNEL_ACCOUNT);
> > + if (!marker_msg)
> > + goto free_with_marker;
> > +
> > + level_descr = kzalloc(sizeof(struct pg_level) * (KVM_PGTABLE_LAST_LEVEL + 1),
> > + GFP_KERNEL_ACCOUNT);
> > + if (!level_descr)
> > + goto free_with_msg;
> > +
> > + range = kzalloc(sizeof(struct ptdump_range) * ADDR_MARKER_LEN,
> > + GFP_KERNEL_ACCOUNT);
> > + if (!range)
> > + goto free_with_level;
> > +
> > + kvm_ptdump_build_levels(level_descr, pgtable->start_level);
> > +
> > + snprintf(marker_msg, MARKER_MSG_LEN, "IPA bits %2u start lvl %1d",
> > + pgtable->ia_bits, pgtable->start_level);
> > +
> > + ipa_addr_marker[0].name = marker_msg;
>
> Is the dynamic name worth the added complexity? I see nothing wrong with
> exposing additional debugfs files for simple attributes like the IPA
> range and page table levels.
>
> I know it isn't *that* much, just looking for every opportunity to
> simplify further.
>
We can keep them separate, I have no strong opinion about this. I think
this was Vincent's, original suggestion to have them so I will check with
him as well.
> > + ipa_addr_marker[1].start_address = BIT(pgtable->ia_bits);
> > + range[0].end = BIT(pgtable->ia_bits);
> > +
> > + st->seq = m;
> > + st->marker = ipa_addr_marker;
> > + st->level = -1,
> > + st->pg_level = level_descr,
> > + st->ptdump.range = range;
> > + return 0;
> > +
> > +free_with_level:
> > + kfree(level_descr);
> > +free_with_msg:
> > + kfree(marker_msg);
> > +free_with_marker:
> > + kfree(ipa_addr_marker);
> > + return -ENOMEM;
> > +}
> > +
> > +static void kvm_ptdump_parser_teardown(struct pg_state *st)
> > +{
> > + const struct addr_marker *ipa_addr_marker = st->marker;
> > +
> > + kfree(ipa_addr_marker[0].name);
> > + kfree(ipa_addr_marker);
> > + kfree(st->pg_level);
> > + kfree(st->ptdump.range);
> > +}
> > +
> > static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> > {
> > struct kvm *guest_kvm = m->private;
> > @@ -59,10 +210,15 @@ static int kvm_ptdump_guest_show(struct seq_file *m, void *)
> > struct pg_state parser_state = {0};
> > int ret;
> >
> > + ret = kvm_ptdump_parser_init(&parser_state, mmu->pgt, m);
> > + if (ret)
> > + return ret;
> > +
>
> Can this be done at open(), or am I missing something?
>
I guess we can do this in open() but then we will have to add again that
struct that wraps some ptdump specific state tracking. It seemed a bit cleaner in
this way. What do you think ?
> > write_lock(&guest_kvm->mmu_lock);
> > ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> > write_unlock(&guest_kvm->mmu_lock);
> >
> > + kvm_ptdump_parser_teardown(&parser_state);
>
> Same question here, can this happen at close()? I guess you'll need a
> struct to encapsulate pg_state and a pointer to the VM at least.
>
Right, I tried to avoid using a separate struct as we discussed in v4.
> Actually, come to think of it, if you embed all of the data you need for
> the walker into a structure you can just do a single allocation for it
> upfront.
>
> --
> Thanks,
> Oliver
Thanks for the feedback,
Seb
On Tue, Feb 13, 2024 at 04:59:01PM +0000, Sebastian Ene wrote:
> On Tue, Feb 13, 2024 at 12:42:42AM +0000, Oliver Upton wrote:
> > On Wed, Feb 07, 2024 at 02:48:33PM +0000, Sebastian Ene wrote:
[...]
> > > +
> > > + snprintf(marker_msg, MARKER_MSG_LEN, "IPA bits %2u start lvl %1d",
> > > + pgtable->ia_bits, pgtable->start_level);
> > > +
> > > + ipa_addr_marker[0].name = marker_msg;
> >
> > Is the dynamic name worth the added complexity? I see nothing wrong with
> > exposing additional debugfs files for simple attributes like the IPA
> > range and page table levels.
> >
> > I know it isn't *that* much, just looking for every opportunity to
> > simplify further.
> >
>
> We can keep them separate, I have no strong opinion about this. I think
> this was Vincent's, original suggestion to have them so I will check with
> him as well.
Well, if we get to the place where there's a single struct containing
all of the required data upfront then this becomes less of an issue.
This is useful information still, so let's see if we can go about it the
other way.
> > > + ret = kvm_ptdump_parser_init(&parser_state, mmu->pgt, m);
> > > + if (ret)
> > > + return ret;
> > > +
> >
> > Can this be done at open(), or am I missing something?
> >
>
> I guess we can do this in open() but then we will have to add again that
> struct that wraps some ptdump specific state tracking. It seemed a bit cleaner in
> this way. What do you think ?
Allocating something that looks like an iterator end embedding it in
->private isn't too uncommon.
> > > write_lock(&guest_kvm->mmu_lock);
> > > ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> > > write_unlock(&guest_kvm->mmu_lock);
> > >
> > > + kvm_ptdump_parser_teardown(&parser_state);
> >
> > Same question here, can this happen at close()? I guess you'll need a
> > struct to encapsulate pg_state and a pointer to the VM at least.
> >
>
> Right, I tried to avoid using a separate struct as we discussed in v4.
Sorry, I hope I didn't confuse you in my prior feedback.
What I had issue with was the multiple layers of function ptr / ops
structs for managing the file interface. I have zero concerns with
organizing the _data_ for the walk this way.
> > Actually, come to think of it, if you embed all of the data you need for
> > the walker into a structure you can just do a single allocation for it
> > upfront.
--
Thanks,
Oliver
On Tue, Feb 13, 2024 at 05:10:30PM +0000, Oliver Upton wrote:
> On Tue, Feb 13, 2024 at 04:59:01PM +0000, Sebastian Ene wrote:
> > On Tue, Feb 13, 2024 at 12:42:42AM +0000, Oliver Upton wrote:
> > > On Wed, Feb 07, 2024 at 02:48:33PM +0000, Sebastian Ene wrote:
>
> [...]
>
> > > > +
> > > > + snprintf(marker_msg, MARKER_MSG_LEN, "IPA bits %2u start lvl %1d",
> > > > + pgtable->ia_bits, pgtable->start_level);
> > > > +
> > > > + ipa_addr_marker[0].name = marker_msg;
> > >
> > > Is the dynamic name worth the added complexity? I see nothing wrong with
> > > exposing additional debugfs files for simple attributes like the IPA
> > > range and page table levels.
> > >
> > > I know it isn't *that* much, just looking for every opportunity to
> > > simplify further.
> > >
> >
> > We can keep them separate, I have no strong opinion about this. I think
> > this was Vincent's, original suggestion to have them so I will check with
> > him as well.
>
> Well, if we get to the place where there's a single struct containing
> all of the required data upfront then this becomes less of an issue.
> This is useful information still, so let's see if we can go about it the
> other way.
>
Allright, I will create separate entries for getting the ipa_space and the start
level which will be exposed in the vm debugfs dir.
> > > > + ret = kvm_ptdump_parser_init(&parser_state, mmu->pgt, m);
> > > > + if (ret)
> > > > + return ret;
> > > > +
> > >
> > > Can this be done at open(), or am I missing something?
> > >
> >
> > I guess we can do this in open() but then we will have to add again that
> > struct that wraps some ptdump specific state tracking. It seemed a bit cleaner in
> > this way. What do you think ?
>
> Allocating something that looks like an iterator end embedding it in
> ->private isn't too uncommon.
>
Ack, will stick to this approach in this case.
> > > > write_lock(&guest_kvm->mmu_lock);
> > > > ret = kvm_ptdump_show_common(m, mmu->pgt, &parser_state);
> > > > write_unlock(&guest_kvm->mmu_lock);
> > > >
> > > > + kvm_ptdump_parser_teardown(&parser_state);
> > >
> > > Same question here, can this happen at close()? I guess you'll need a
> > > struct to encapsulate pg_state and a pointer to the VM at least.
> > >
> >
> > Right, I tried to avoid using a separate struct as we discussed in v4.
>
> Sorry, I hope I didn't confuse you in my prior feedback.
>
> What I had issue with was the multiple layers of function ptr / ops
> structs for managing the file interface. I have zero concerns with
> organizing the _data_ for the walk this way.
>
Right, I see what you mean, thanks for the clarification.
> > > Actually, come to think of it, if you embed all of the data you need for
> > > the walker into a structure you can just do a single allocation for it
> > > upfront.
>
> --
> Thanks,
> Oliver
Thanks for the feedback,
Seb