2022-08-04 07:08:39

by lizhe.67

[permalink] [raw]
Subject: [RFC] x86/mm/dump_pagetables: Allow dumping pagetables by pid

From: Li Zhe <[email protected]>

In current kernel we can only dump a user task's pagetable
by task itself. Sometimes we need to inspect the page table
attributes of different memory maps to user space to meet
the relevant development and debugging requirements. This
patch helps us to make our works easier. It add two file
named 'pid' and 'pid_pgtable_show'. We can use 'pid' to
input the task we want to inspect and get pagetable info
from 'pid_pgtable_show'.

User space can use file 'pid' and 'pid_pgtable_show' as follows.
====
$ echo $pid > /sys/kernel/debug/page_tables/pid
$ cat /sys/kernel/debug/page_tables/pid_pgtable_show

Signed-off-by: Li Zhe <[email protected]>
---
arch/x86/mm/debug_pagetables.c | 82 ++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)

diff --git a/arch/x86/mm/debug_pagetables.c b/arch/x86/mm/debug_pagetables.c
index 092ea436c7e6..53a8ced44080 100644
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -4,6 +4,8 @@
#include <linux/module.h>
#include <linux/seq_file.h>
#include <linux/pgtable.h>
+#include <linux/slab.h>
+#include <linux/sched/mm.h>

static int ptdump_show(struct seq_file *m, void *v)
{
@@ -31,6 +33,84 @@ static int ptdump_curusr_show(struct seq_file *m, void *v)
}

DEFINE_SHOW_ATTRIBUTE(ptdump_curusr);
+
+static pid_t trace_pid;
+static int ptdump_pid_pgtable_show(struct seq_file *m, void *v)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ if (trace_pid == 0)
+ return 0;
+
+ rcu_read_lock();
+ task = find_task_by_vpid(trace_pid);
+ if (!task) {
+ rcu_read_unlock();
+ return -ESRCH;
+ }
+ mm = get_task_mm(task);
+ rcu_read_unlock();
+
+ if (mm && mm->pgd)
+ ptdump_walk_pgd_level_debugfs(m, mm, true);
+
+ if (mm)
+ mmput(mm);
+
+ return 0;
+}
+
+DEFINE_SHOW_ATTRIBUTE(ptdump_pid_pgtable);
+
+static ssize_t ptdump_pid_write(struct file *file, const char __user *buffer,
+ size_t count, loff_t *f_pos)
+{
+ pid_t pid;
+ int ret = -ENOMEM;
+ char *tmp = kzalloc(count, GFP_KERNEL);
+
+ if (!tmp)
+ return ret;
+
+ if (copy_from_user(tmp, buffer, count)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = kstrtoint(tmp, 0, &pid);
+ if (ret) {
+ ret = -EINVAL;
+ goto out;
+ }
+ kfree(tmp);
+ trace_pid = pid;
+ return count;
+
+out:
+ kfree(tmp);
+ return ret;
+}
+
+static int ptdump_show_pid(struct seq_file *m, void *v)
+{
+ seq_printf(m, "%d\n", trace_pid);
+ return 0;
+}
+
+static int ptdump_open_pid(struct inode *inode, struct file *filp)
+{
+ return single_open(filp, ptdump_show_pid, NULL);
+}
+
+static const struct file_operations ptdump_pid_fops = {
+ .owner = THIS_MODULE,
+ .open = ptdump_open_pid,
+ .write = ptdump_pid_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
#endif

#if defined(CONFIG_EFI) && defined(CONFIG_X86_64)
@@ -57,6 +137,8 @@ static int __init pt_dump_debug_init(void)
#ifdef CONFIG_PAGE_TABLE_ISOLATION
debugfs_create_file("current_user", 0400, dir, NULL,
&ptdump_curusr_fops);
+ debugfs_create_file("pid_pgtable_show", 0400, dir, NULL, &ptdump_pid_pgtable_fops);
+ debugfs_create_file("pid", 0400, dir, NULL, &ptdump_pid_fops);
#endif
#if defined(CONFIG_EFI) && defined(CONFIG_X86_64)
debugfs_create_file("efi", 0400, dir, NULL, &ptdump_efi_fops);
--
2.20.1



2022-09-10 23:30:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] x86/mm/dump_pagetables: Allow dumping pagetables by pid

On 8/4/22 00:04, [email protected] wrote:
> In current kernel we can only dump a user task's pagetable
> by task itself. Sometimes we need to inspect the page table
> attributes of different memory maps to user space to meet
> the relevant development and debugging requirements. This
> patch helps us to make our works easier. It add two file
> named 'pid' and 'pid_pgtable_show'. We can use 'pid' to
> input the task we want to inspect and get pagetable info
> from 'pid_pgtable_show'.
>
> User space can use file 'pid' and 'pid_pgtable_show' as follows.
> ====
> $ echo $pid > /sys/kernel/debug/page_tables/pid
> $ cat /sys/kernel/debug/page_tables/pid_pgtable_show

This seems a wee bit silly considering that we have /proc. It's also
impossible to have an ABI like this work if multiple processes are
trying to dump different pids.

Are there any other per-process things in debugfs where folks have done
something similar?

2022-09-12 08:38:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] x86/mm/dump_pagetables: Allow dumping pagetables by pid

On Sat, Sep 10, 2022 at 04:09:55PM -0700, Dave Hansen wrote:
> On 8/4/22 00:04, [email protected] wrote:
> > In current kernel we can only dump a user task's pagetable
> > by task itself. Sometimes we need to inspect the page table
> > attributes of different memory maps to user space to meet
> > the relevant development and debugging requirements. This
> > patch helps us to make our works easier. It add two file
> > named 'pid' and 'pid_pgtable_show'. We can use 'pid' to
> > input the task we want to inspect and get pagetable info
> > from 'pid_pgtable_show'.
> >
> > User space can use file 'pid' and 'pid_pgtable_show' as follows.
> > ====
> > $ echo $pid > /sys/kernel/debug/page_tables/pid
> > $ cat /sys/kernel/debug/page_tables/pid_pgtable_show
>
> This seems a wee bit silly considering that we have /proc. It's also
> impossible to have an ABI like this work if multiple processes are
> trying to dump different pids.
>
> Are there any other per-process things in debugfs where folks have done
> something similar?

Not that I'm aware of; we can ofcourse duplicate the whole process tree
in /debug once again, but that would suck.

Another option that sucks is writing and reading to the same filedesc;
something like:

exec 3<> /debug/page_tables/pid_user
echo $pid >&3
cat - <&3

that way you can have multiple concurrent users.