2024-05-04 00:30:20

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than through textual
processing of /proc/<pid>/maps contents. See patch #2 for the context,
justification, and nuances of the API design.

Patch #1 is a refactoring to keep VMA name logic determination in one place.
Patch #2 is the meat of kernel-side API.
Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
optionally use this new ioctl()-based API, if supported.
Patch #5 implements a simple C tool to demonstrate intended efficient use (for
both textual and binary interfaces) and allows benchmarking them. Patch itself
also has performance numbers of a test based on one of the medium-sized
internal applications taken from production.

This patch set was based on top of next-20240503 tag in linux-next tree.
Not sure what should be the target tree for this, I'd appreciate any guidance,
thank you!

Andrii Nakryiko (5):
fs/procfs: extract logic for getting VMA name constituents
fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
tools: sync uapi/linux/fs.h header into tools subdir
selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

fs/proc/task_mmu.c | 290 +++++++++++---
include/uapi/linux/fs.h | 32 ++
.../perf/trace/beauty/include/uapi/linux/fs.h | 32 ++
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 2 +-
tools/testing/selftests/bpf/procfs_query.c | 366 ++++++++++++++++++
tools/testing/selftests/bpf/test_progs.c | 3 +
tools/testing/selftests/bpf/test_progs.h | 2 +
tools/testing/selftests/bpf/trace_helpers.c | 105 ++++-
9 files changed, 763 insertions(+), 70 deletions(-)
create mode 100644 tools/testing/selftests/bpf/procfs_query.c

--
2.43.0



2024-05-04 00:30:42

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 1/5] fs/procfs: extract logic for getting VMA name constituents

Extract generic logic to fetch relevant pieces of data to describe VMA
name. This could be just some string (either special constant or
user-provided), or a string with some formatted wrapping text (e.g.,
"[anon_shmem:<something>]"), or, commonly, file path. seq_file-based
logic has different methods to handle all three cases, but they are
currently mixed in with extracting underlying sources of data.

This patch splits this into data fetching and data formatting, so that
data fetching can be reused later on.

There should be no functional changes.

Signed-off-by: Andrii Nakryiko <[email protected]>
---
fs/proc/task_mmu.c | 125 +++++++++++++++++++++++++--------------------
1 file changed, 71 insertions(+), 54 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e5a5f015ff03..8e503a1635b7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -239,6 +239,67 @@ static int do_maps_open(struct inode *inode, struct file *file,
sizeof(struct proc_maps_private));
}

+static void get_vma_name(struct vm_area_struct *vma,
+ const struct path **path,
+ const char **name,
+ const char **name_fmt)
+{
+ struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL;
+
+ *name = NULL;
+ *path = NULL;
+ *name_fmt = NULL;
+
+ /*
+ * Print the dentry name for named mappings, and a
+ * special [heap] marker for the heap:
+ */
+ if (vma->vm_file) {
+ /*
+ * If user named this anon shared memory via
+ * prctl(PR_SET_VMA ..., use the provided name.
+ */
+ if (anon_name) {
+ *name_fmt = "[anon_shmem:%s]";
+ *name = anon_name->name;
+ } else {
+ *path = file_user_path(vma->vm_file);
+ }
+ return;
+ }
+
+ if (vma->vm_ops && vma->vm_ops->name) {
+ *name = vma->vm_ops->name(vma);
+ if (*name)
+ return;
+ }
+
+ *name = arch_vma_name(vma);
+ if (*name)
+ return;
+
+ if (!vma->vm_mm) {
+ *name = "[vdso]";
+ return;
+ }
+
+ if (vma_is_initial_heap(vma)) {
+ *name = "[heap]";
+ return;
+ }
+
+ if (vma_is_initial_stack(vma)) {
+ *name = "[stack]";
+ return;
+ }
+
+ if (anon_name) {
+ *name_fmt = "[anon:%s]";
+ *name = anon_name->name;
+ return;
+ }
+}
+
static void show_vma_header_prefix(struct seq_file *m,
unsigned long start, unsigned long end,
vm_flags_t flags, unsigned long long pgoff,
@@ -262,17 +323,15 @@ static void show_vma_header_prefix(struct seq_file *m,
static void
show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
{
- struct anon_vma_name *anon_name = NULL;
- struct mm_struct *mm = vma->vm_mm;
- struct file *file = vma->vm_file;
+ const struct path *path;
+ const char *name_fmt, *name;
vm_flags_t flags = vma->vm_flags;
unsigned long ino = 0;
unsigned long long pgoff = 0;
unsigned long start, end;
dev_t dev = 0;
- const char *name = NULL;

- if (file) {
+ if (vma->vm_file) {
const struct inode *inode = file_user_inode(vma->vm_file);

dev = inode->i_sb->s_dev;
@@ -283,57 +342,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
start = vma->vm_start;
end = vma->vm_end;
show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino);
- if (mm)
- anon_name = anon_vma_name(vma);

- /*
- * Print the dentry name for named mappings, and a
- * special [heap] marker for the heap:
- */
- if (file) {
+ get_vma_name(vma, &path, &name, &name_fmt);
+ if (path) {
seq_pad(m, ' ');
- /*
- * If user named this anon shared memory via
- * prctl(PR_SET_VMA ..., use the provided name.
- */
- if (anon_name)
- seq_printf(m, "[anon_shmem:%s]", anon_name->name);
- else
- seq_path(m, file_user_path(file), "\n");
- goto done;
- }
-
- if (vma->vm_ops && vma->vm_ops->name) {
- name = vma->vm_ops->name(vma);
- if (name)
- goto done;
- }
-
- name = arch_vma_name(vma);
- if (!name) {
- if (!mm) {
- name = "[vdso]";
- goto done;
- }
-
- if (vma_is_initial_heap(vma)) {
- name = "[heap]";
- goto done;
- }
-
- if (vma_is_initial_stack(vma)) {
- name = "[stack]";
- goto done;
- }
-
- if (anon_name) {
- seq_pad(m, ' ');
- seq_printf(m, "[anon:%s]", anon_name->name);
- }
- }
-
-done:
- if (name) {
+ seq_path(m, path, "\n");
+ } else if (name_fmt) {
+ seq_pad(m, ' ');
+ seq_printf(m, name_fmt, name);
+ } else if (name) {
seq_pad(m, ' ');
seq_puts(m, name);
}
--
2.43.0


2024-05-04 00:31:04

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

/proc/<pid>/maps file is extremely useful in practice for various tasks
involving figuring out process memory layout, what files are backing any
given memory range, etc. One important class of applications that
absolutely rely on this are profilers/stack symbolizers. They would
normally capture stack trace containing absolute memory addresses of
some functions, and would then use /proc/<pid>/maps file to file
corresponding backing ELF files, file offsets within them, and then
continue from there to get yet more information (ELF symbols, DWARF
information) to get human-readable symbolic information.

As such, there are both performance and correctness requirement
involved. This address to VMA information translation has to be done as
efficiently as possible, but also not miss any VMA (especially in the
case of loading/unloading shared libraries).

Unfortunately, for all the /proc/<pid>/maps file universality and
usefulness, it doesn't fit the above 100%.

First, it's text based, which makes its programmatic use from
applications and libraries unnecessarily cumbersome and slow due to the
need to do text parsing to get necessary pieces of information.

Second, it's main purpose is to emit all VMAs sequentially, but in
practice captured addresses would fall only into a small subset of all
process' VMAs, mainly containing executable text. Yet, library would
need to parse most or all of the contents to find needed VMAs, as there
is no way to skip VMAs that are of no use. Efficient library can do the
linear pass and it is still relatively efficient, but it's definitely an
overhead that can be avoided, if there was a way to do more targeted
querying of the relevant VMA information.

Another problem when writing generic stack trace symbolization library
is an unfortunate performance-vs-correctness tradeoff that needs to be
made. Library has to make a decision to either cache parsed contents of
/proc/<pid>/maps for service future requests (if application requests to
symbolize another set of addresses, captured at some later time, which
is typical for periodic/continuous profiling cases) to avoid higher
costs of needed to re-parse this file or caching the contents in memory
to speed up future requests. In the former case, more memory is used for
the cache and there is a risk of getting stale data if application
loaded/unloaded shared libraries, or otherwise changed its set of VMAs
through additiona mmap() calls (and other means of altering memory
address space). In the latter case, it's the performance hit that comes
from re-opening the file and re-reading/re-parsing its contents all over
again.

This patch aims to solve this problem by providing a new API built on
top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
interface, avoiding the cost and awkwardness of textual representation
for programmatic use. It's designed to be extensible and
forward/backward compatible by including user-specified field size and
using copy_struct_from_user() approach. But, most importantly, it allows
to do point queries for specific single address, specified by user. And
this is done efficiently using VMA iterator.

User has a choice to pick either getting VMA that covers provided
address or -ENOENT if none is found (exact, least surprising, case). Or,
with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
get either VMA that covers the address (if there is one), or the closest
next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
more efficient use, but, given it could be a surprising behavior,
requires an explicit opt-in.

Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
sense given it's querying the same set of VMA data. All the permissions
checks performed on /proc/<pid>/maps opening fit here as well.
ioctl-based implementation is fetching remembered mm_struct reference,
but otherwise doesn't interfere with seq_file-based implementation of
/proc/<pid>/maps textual interface, and so could be used together or
independently without paying any price for that.

There is one extra thing that /proc/<pid>/maps doesn't currently
provide, and that's an ability to fetch ELF build ID, if present. User
has control over whether this piece of information is requested or not
by either setting build_id_size field to zero or non-zero maximum buffer
size they provided through build_id_addr field (which encodes user
pointer as __u64 field).

The need to get ELF build ID reliably is an important aspect when
dealing with profiling and stack trace symbolization, and
/proc/<pid>/maps textual representation doesn't help with this,
requiring applications to open underlying ELF binary through
/proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
permissions implications due giving a full access to the binary from
(potentially) another process, while all application is interested in is
build ID. Giving an ability to request just build ID doesn't introduce
any additional security concerns, on top of what /proc/<pid>/maps is
already concerned with, simplifying the overall logic.

Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes
to make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in
which page faults are allowed. BPF-specific implementation currently
bails out if necessary part of ELF file is not paged in, all due to
extra BPF-specific restrictions (like the need to fetch build ID in
restrictive contexts such as NMI handler).

Note also, that fetching VMA name (e.g., backing file path, or special
hard-coded or user-provided names) is optional just like build ID. If
user sets vma_name_size to zero, kernel code won't attempt to retrieve
it, saving resources.

Signed-off-by: Andrii Nakryiko <[email protected]>
---
fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/fs.h | 32 ++++++++
2 files changed, 197 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8e503a1635b7..cb7b1ff1a144 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,6 +22,7 @@
#include <linux/pkeys.h>
#include <linux/minmax.h>
#include <linux/overflow.h>
+#include <linux/buildid.h>

#include <asm/elf.h>
#include <asm/tlb.h>
@@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
return do_maps_open(inode, file, &proc_pid_maps_op);
}

+static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
+{
+ struct procfs_procmap_query karg;
+ struct vma_iterator iter;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ const char *name = NULL;
+ char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
+ __u64 usize;
+ int err;
+
+ if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
+ return -EFAULT;
+ if (usize > PAGE_SIZE)
+ return -E2BIG;
+ if (usize < offsetofend(struct procfs_procmap_query, query_addr))
+ return -EINVAL;
+ err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
+ if (err)
+ return err;
+
+ if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
+ return -EINVAL;
+ if (!!karg.vma_name_size != !!karg.vma_name_addr)
+ return -EINVAL;
+ if (!!karg.build_id_size != !!karg.build_id_addr)
+ return -EINVAL;
+
+ mm = priv->mm;
+ if (!mm || !mmget_not_zero(mm))
+ return -ESRCH;
+ if (mmap_read_lock_killable(mm)) {
+ mmput(mm);
+ return -EINTR;
+ }
+
+ vma_iter_init(&iter, mm, karg.query_addr);
+ vma = vma_next(&iter);
+ if (!vma) {
+ err = -ENOENT;
+ goto out;
+ }
+ /* user wants covering VMA, not the closest next one */
+ if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
+ vma->vm_start > karg.query_addr) {
+ err = -ENOENT;
+ goto out;
+ }
+
+ karg.vma_start = vma->vm_start;
+ karg.vma_end = vma->vm_end;
+
+ if (vma->vm_file) {
+ const struct inode *inode = file_user_inode(vma->vm_file);
+
+ karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
+ karg.dev_major = MAJOR(inode->i_sb->s_dev);
+ karg.dev_minor = MINOR(inode->i_sb->s_dev);
+ karg.inode = inode->i_ino;
+ } else {
+ karg.vma_offset = 0;
+ karg.dev_major = 0;
+ karg.dev_minor = 0;
+ karg.inode = 0;
+ }
+
+ karg.vma_flags = 0;
+ if (vma->vm_flags & VM_READ)
+ karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
+ if (vma->vm_flags & VM_WRITE)
+ karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
+ if (vma->vm_flags & VM_EXEC)
+ karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
+ if (vma->vm_flags & VM_MAYSHARE)
+ karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
+
+ if (karg.build_id_size) {
+ __u32 build_id_sz = BUILD_ID_SIZE_MAX;
+
+ err = build_id_parse(vma, build_id_buf, &build_id_sz);
+ if (!err) {
+ if (karg.build_id_size < build_id_sz) {
+ err = -ENAMETOOLONG;
+ goto out;
+ }
+ karg.build_id_size = build_id_sz;
+ }
+ }
+
+ if (karg.vma_name_size) {
+ size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
+ const struct path *path;
+ const char *name_fmt;
+ size_t name_sz = 0;
+
+ get_vma_name(vma, &path, &name, &name_fmt);
+
+ if (path || name_fmt || name) {
+ name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
+ if (!name_buf) {
+ err = -ENOMEM;
+ goto out;
+ }
+ }
+ if (path) {
+ name = d_path(path, name_buf, name_buf_sz);
+ if (IS_ERR(name)) {
+ err = PTR_ERR(name);
+ goto out;
+ }
+ name_sz = name_buf + name_buf_sz - name;
+ } else if (name || name_fmt) {
+ name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
+ name = name_buf;
+ }
+ if (name_sz > name_buf_sz) {
+ err = -ENAMETOOLONG;
+ goto out;
+ }
+ karg.vma_name_size = name_sz;
+ }
+
+ /* unlock and put mm_struct before copying data to user */
+ mmap_read_unlock(mm);
+ mmput(mm);
+
+ if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
+ name, karg.vma_name_size)) {
+ kfree(name_buf);
+ return -EFAULT;
+ }
+ kfree(name_buf);
+
+ if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
+ build_id_buf, karg.build_id_size))
+ return -EFAULT;
+
+ if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
+ return -EFAULT;
+
+ return 0;
+
+out:
+ mmap_read_unlock(mm);
+ mmput(mm);
+ kfree(name_buf);
+ return err;
+}
+
+static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct seq_file *seq = file->private_data;
+ struct proc_maps_private *priv = seq->private;
+
+ switch (cmd) {
+ case PROCFS_PROCMAP_QUERY:
+ return do_procmap_query(priv, (void __user *)arg);
+ default:
+ return -ENOIOCTLCMD;
+ }
+}
+
const struct file_operations proc_pid_maps_operations = {
.open = pid_maps_open,
.read = seq_read,
.llseek = seq_lseek,
.release = proc_map_release,
+ .unlocked_ioctl = procfs_procmap_ioctl,
+ .compat_ioctl = procfs_procmap_ioctl,
};

/*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 45e4e64fd664..fe8924a8d916 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -393,4 +393,36 @@ struct pm_scan_arg {
__u64 return_mask;
};

+/* /proc/<pid>/maps ioctl */
+#define PROCFS_IOCTL_MAGIC 0x9f
+#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
+
+enum procmap_query_flags {
+ PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
+};
+
+enum procmap_vma_flags {
+ PROCFS_PROCMAP_VMA_READABLE = 0x01,
+ PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
+ PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
+ PROCFS_PROCMAP_VMA_SHARED = 0x08,
+};
+
+struct procfs_procmap_query {
+ __u64 size;
+ __u64 query_flags; /* in */
+ __u64 query_addr; /* in */
+ __u64 vma_start; /* out */
+ __u64 vma_end; /* out */
+ __u64 vma_flags; /* out */
+ __u64 vma_offset; /* out */
+ __u64 inode; /* out */
+ __u32 dev_major; /* out */
+ __u32 dev_minor; /* out */
+ __u32 vma_name_size; /* in/out */
+ __u32 build_id_size; /* in/out */
+ __u64 vma_name_addr; /* in */
+ __u64 build_id_addr; /* in */
+};
+
#endif /* _UAPI_LINUX_FS_H */
--
2.43.0


2024-05-04 00:31:21

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 3/5] tools: sync uapi/linux/fs.h header into tools subdir

Keep them in sync for use from BPF selftests.

Signed-off-by: Andrii Nakryiko <[email protected]>
---
.../perf/trace/beauty/include/uapi/linux/fs.h | 32 +++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/tools/perf/trace/beauty/include/uapi/linux/fs.h b/tools/perf/trace/beauty/include/uapi/linux/fs.h
index 45e4e64fd664..fe8924a8d916 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/fs.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/fs.h
@@ -393,4 +393,36 @@ struct pm_scan_arg {
__u64 return_mask;
};

+/* /proc/<pid>/maps ioctl */
+#define PROCFS_IOCTL_MAGIC 0x9f
+#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
+
+enum procmap_query_flags {
+ PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
+};
+
+enum procmap_vma_flags {
+ PROCFS_PROCMAP_VMA_READABLE = 0x01,
+ PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
+ PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
+ PROCFS_PROCMAP_VMA_SHARED = 0x08,
+};
+
+struct procfs_procmap_query {
+ __u64 size;
+ __u64 query_flags; /* in */
+ __u64 query_addr; /* in */
+ __u64 vma_start; /* out */
+ __u64 vma_end; /* out */
+ __u64 vma_flags; /* out */
+ __u64 vma_offset; /* out */
+ __u64 inode; /* out */
+ __u32 dev_major; /* out */
+ __u32 dev_minor; /* out */
+ __u32 vma_name_size; /* in/out */
+ __u32 build_id_size; /* in/out */
+ __u64 vma_name_addr; /* in */
+ __u64 build_id_addr; /* in */
+};
+
#endif /* _UAPI_LINUX_FS_H */
--
2.43.0


2024-05-04 00:31:45

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 4/5] selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available

Instead of parsing text-based /proc/<pid>/maps file, try to use
PROCFS_PROCMAP_QUERY ioctl() to simplify and speed up data fetching.
This logic is used to do uprobe file offset calculation, so any bugs in
this logic would manifest as failing uprobe BPF selftests.

This also serves as a simple demonstration of one of the intended uses.

Signed-off-by: Andrii Nakryiko <[email protected]>
---
tools/testing/selftests/bpf/test_progs.c | 3 +
tools/testing/selftests/bpf/test_progs.h | 2 +
tools/testing/selftests/bpf/trace_helpers.c | 105 +++++++++++++++++---
3 files changed, 95 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 89ff704e9dad..6a19970f2531 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -19,6 +19,8 @@
#include <bpf/btf.h>
#include "json_writer.h"

+int env_verbosity = 0;
+
static bool verbose(void)
{
return env.verbosity > VERBOSE_NONE;
@@ -848,6 +850,7 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
return -EINVAL;
}
}
+ env_verbosity = env->verbosity;

if (verbose()) {
if (setenv("SELFTESTS_VERBOSE", "1", 1) == -1) {
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index 0ba5a20b19ba..6eae7fdab0d7 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -95,6 +95,8 @@ struct test_state {
FILE *stdout;
};

+extern int env_verbosity;
+
struct test_env {
struct test_selector test_selector;
struct test_selector subtest_selector;
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 70e29f316fe7..8ac71e73d173 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -10,6 +10,8 @@
#include <pthread.h>
#include <unistd.h>
#include <linux/perf_event.h>
+#include <linux/fs.h>
+#include <sys/ioctl.h>
#include <sys/mman.h>
#include "trace_helpers.h"
#include <linux/limits.h>
@@ -233,29 +235,92 @@ int kallsyms_find(const char *sym, unsigned long long *addr)
return err;
}

+#ifdef PROCFS_PROCMAP_QUERY
+int env_verbosity __weak = 0;
+
+int procmap_query(int fd, const void *addr, size_t *start, size_t *offset, int *flags)
+{
+ char path_buf[PATH_MAX], build_id_buf[20];
+ struct procfs_procmap_query q;
+ int err;
+
+ memset(&q, 0, sizeof(q));
+ q.size = sizeof(q);
+ q.query_addr = (__u64)addr;
+ q.vma_name_addr = (__u64)path_buf;
+ q.vma_name_size = sizeof(path_buf);
+ q.build_id_addr = (__u64)build_id_buf;
+ q.build_id_size = sizeof(build_id_buf);
+
+ err = ioctl(fd, PROCFS_PROCMAP_QUERY, &q);
+ if (err < 0) {
+ err = -errno;
+ if (err == -ENOTTY)
+ return -EOPNOTSUPP; /* ioctl() not implemented yet */
+ if (err == -ENOENT)
+ return -ESRCH; /* vma not found */
+ return err;
+ }
+
+ if (env_verbosity >= 1) {
+ printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n",
+ (long)addr, (long)q.vma_start, (long)q.vma_end,
+ (q.vma_flags & PROCFS_PROCMAP_VMA_READABLE) ? 'r' : '-',
+ (q.vma_flags & PROCFS_PROCMAP_VMA_WRITABLE) ? 'w' : '-',
+ (q.vma_flags & PROCFS_PROCMAP_VMA_EXECUTABLE) ? 'x' : '-',
+ (q.vma_flags & PROCFS_PROCMAP_VMA_SHARED) ? 's' : 'p',
+ (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode,
+ q.vma_name_size ? path_buf : "",
+ q.build_id_size ? "YES" : "NO",
+ q.build_id_size);
+ }
+
+ *start = q.vma_start;
+ *offset = q.vma_offset;
+ *flags = q.vma_flags;
+ return 0;
+}
+#else
+int procmap_query(int fd, const void *addr, size_t *start, size_t *offset, int *flags)
+{
+ return -EOPNOTSUPP;
+}
+#endif
+
ssize_t get_uprobe_offset(const void *addr)
{
- size_t start, end, base;
- char buf[256];
- bool found = false;
+ size_t start, base, end;
FILE *f;
+ char buf[256];
+ int err, flags;

f = fopen("/proc/self/maps", "r");
if (!f)
return -errno;

- while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
- if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) {
- found = true;
- break;
+ err = procmap_query(fileno(f), addr, &start, &base, &flags);
+ if (err == 0) {
+ if (!(flags & PROCFS_PROCMAP_VMA_EXECUTABLE))
+ return -ESRCH;
+ } else if (err != -EOPNOTSUPP) {
+ fclose(f);
+ return err;
+ } else if (err) {
+ bool found = false;
+
+ while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) {
+ if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) {
+ found = true;
+ break;
+ }
+ }
+ if (!found) {
+ fclose(f);
+ return -ESRCH;
}
}
-
fclose(f);

- if (!found)
- return -ESRCH;
-
#if defined(__powerpc64__) && defined(_CALL_ELF) && _CALL_ELF == 2

#define OP_RT_RA_MASK 0xffff0000UL
@@ -296,15 +361,25 @@ ssize_t get_rel_offset(uintptr_t addr)
size_t start, end, offset;
char buf[256];
FILE *f;
+ int err, flags;

f = fopen("/proc/self/maps", "r");
if (!f)
return -errno;

- while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) {
- if (addr >= start && addr < end) {
- fclose(f);
- return (size_t)addr - start + offset;
+ err = procmap_query(fileno(f), (const void *)addr, &start, &offset, &flags);
+ if (err == 0) {
+ fclose(f);
+ return (size_t)addr - start + offset;
+ } else if (err != -EOPNOTSUPP) {
+ fclose(f);
+ return err;
+ } else if (err) {
+ while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) {
+ if (addr >= start && addr < end) {
+ fclose(f);
+ return (size_t)addr - start + offset;
+ }
}
}

--
2.43.0


2024-05-04 00:33:17

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

Implement a simple tool/benchmark for comparing address "resolution"
logic based on textual /proc/<pid>/maps interface and new binary
ioctl-based PROCFS_PROCMAP_QUERY command.

The tool expects a file with a list of hex addresses, relevant PID, and
then provides control over whether textual or binary ioctl-based ways to
process VMAs should be used.

The overall logic implements as efficient way to do batched processing
of a given set of (unsorted) addresses. We first sort them in increasing
order (remembering their original position to restore original order, if
necessary), and then process all VMAs from /proc/<pid>/maps, matching
addresses to VMAs and calculating file offsets, if matched. For
ioctl-based approach the idea is similar, but is implemented even more
efficiently, requesting only VMAs that cover all given addresses,
skipping all the irrelevant VMAs altogether.

To be able to compare efficiency of both APIs tool has "benchark" mode.
User provides a number of processing runs to run in a tight loop, timing
specifically /proc/<pid>/maps parsing and processing parts of the logic
only. Address sorting and re-sorting is excluded. This gives a more
direct way to compare ioctl- vs text-based APIs.

We used a medium-sized production application to do representative
benchmark. A bunch of stack traces were captured, resulting in 4435
user space addresses (699 unique ones, but we didn't deduplicate them).
Application itself had 702 VMAs reported in /proc/<pid>/maps.

Averaging time taken to process all addresses 10000 times, showed that:
- text-based approach took 380 microseconds *per one batch run*;
- ioctl-based approach took 10 microseconds *per identical batch run*.

This gives about ~35x speed up to do exactly the same amoun of work
(build IDs were not fetched for ioctl-based benchmark; fetching build
IDs resulted in 2x slowdown compared to no-build-ID case).

I also did an strace run of both cases. In text-based one the tool did
68 read() syscalls, fetching up to 4KB of data in one go. In comparison,
ioctl-based implementation had to do only 6 ioctl() calls to fetch all
relevant VMAs.

It is projected that savings from processing big production applications
would only widen the gap in favor of binary-based querying ioctl API, as
bigger applications will tend to have even more non-executable VMA
mappings relative to executable ones.

Signed-off-by: Andrii Nakryiko <[email protected]>
---
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 2 +-
tools/testing/selftests/bpf/procfs_query.c | 366 +++++++++++++++++++++
3 files changed, 368 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/procfs_query.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index f1aebabfb017..7eaa8f417278 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -45,6 +45,7 @@ test_cpp
/veristat
/sign-file
/uprobe_multi
+/procfs_query
*.ko
*.tmp
xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index ba28d42b74db..07e17bb89767 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -131,7 +131,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
- xdp_features bpf_test_no_cfi.ko
+ xdp_features bpf_test_no_cfi.ko procfs_query

TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi

diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c
new file mode 100644
index 000000000000..8ca3978244ad
--- /dev/null
+++ b/tools/testing/selftests/bpf/procfs_query.c
@@ -0,0 +1,366 @@
+// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+#include <argp.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <time.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+
+static bool verbose;
+static bool quiet;
+static bool use_ioctl;
+static bool request_build_id;
+static char *addrs_path;
+static int pid;
+static int bench_runs;
+
+const char *argp_program_version = "procfs_query 0.0";
+const char *argp_program_bug_address = "<[email protected]>";
+
+static inline uint64_t get_time_ns(void)
+{
+ struct timespec t;
+
+ clock_gettime(CLOCK_MONOTONIC, &t);
+
+ return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec;
+}
+
+static const struct argp_option opts[] = {
+ { "verbose", 'v', NULL, 0, "Verbose mode" },
+ { "quiet", 'q', NULL, 0, "Quiet mode (no output)" },
+ { "pid", 'p', "PID", 0, "PID of the process" },
+ { "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" },
+ { "benchmark", 'B', "RUNS", 0, "Benchmark mode" },
+ { "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" },
+ { "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" },
+ {},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+ switch (key) {
+ case 'v':
+ verbose = true;
+ break;
+ case 'q':
+ quiet = true;
+ break;
+ case 'i':
+ use_ioctl = true;
+ break;
+ case 'b':
+ request_build_id = true;
+ break;
+ case 'p':
+ pid = strtol(arg, NULL, 10);
+ break;
+ case 'f':
+ addrs_path = strdup(arg);
+ break;
+ case 'B':
+ bench_runs = strtol(arg, NULL, 10);
+ if (bench_runs <= 0) {
+ fprintf(stderr, "Invalid benchmark run count: %s\n", arg);
+ return -EINVAL;
+ }
+ break;
+ case ARGP_KEY_ARG:
+ argp_usage(state);
+ break;
+ default:
+ return ARGP_ERR_UNKNOWN;
+ }
+ return 0;
+}
+
+static const struct argp argp = {
+ .options = opts,
+ .parser = parse_arg,
+};
+
+struct addr {
+ unsigned long long addr;
+ int idx;
+};
+
+static struct addr *addrs;
+static size_t addr_cnt, addr_cap;
+
+struct resolved_addr {
+ unsigned long long file_off;
+ const char *vma_name;
+ int build_id_sz;
+ char build_id[20];
+};
+
+static struct resolved_addr *resolved;
+
+static int resolve_addrs_ioctl(void)
+{
+ char buf[32], build_id_buf[20], vma_name[PATH_MAX];
+ struct procfs_procmap_query q;
+ int fd, err, i;
+ struct addr *a = &addrs[0];
+ struct resolved_addr *r;
+
+ snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+ fd = open(buf, O_RDONLY);
+ if (fd < 0) {
+ err = -errno;
+ fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+ return err;
+ }
+
+ memset(&q, 0, sizeof(q));
+ q.size = sizeof(q);
+ q.query_flags = PROCFS_PROCMAP_EXACT_OR_NEXT_VMA;
+ q.vma_name_addr = (__u64)vma_name;
+ if (request_build_id)
+ q.build_id_addr = (__u64)build_id_buf;
+
+ for (i = 0; i < addr_cnt; ) {
+ char *name = NULL;
+
+ q.query_addr = (__u64)a->addr;
+ q.vma_name_size = sizeof(vma_name);
+ if (request_build_id)
+ q.build_id_size = sizeof(build_id_buf);
+
+ err = ioctl(fd, PROCFS_PROCMAP_QUERY, &q);
+ if (err < 0 && errno == ENOTTY) {
+ close(fd);
+ fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() command is not supported on this kernel!\n");
+ return -EOPNOTSUPP; /* ioctl() not implemented yet */
+ }
+ if (err < 0 && errno == ENOENT) {
+ fprintf(stderr, "ENOENT\n");
+ i++;
+ a++;
+ continue; /* unresolved address */
+ }
+ if (err < 0) {
+ err = -errno;
+ close(fd);
+ fprintf(stderr, "PROCFS_PROCMAP_QUERY ioctl() returned error: %d\n", err);
+ return err;
+ }
+
+ /* skip addrs falling before current VMA */
+ for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) {
+ }
+ /* process addrs covered by current VMA */
+ for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) {
+ r = &resolved[a->idx];
+ r->file_off = a->addr - q.vma_start + q.vma_offset;
+
+ /* reuse name, if it was already strdup()'ed */
+ if (q.vma_name_size)
+ name = name ?: strdup(vma_name);
+ r->vma_name = name;
+
+ if (q.build_id_size) {
+ r->build_id_sz = q.build_id_size;
+ memcpy(r->build_id, build_id_buf, q.build_id_size);
+ }
+ }
+ }
+
+ close(fd);
+ return 0;
+}
+
+static int resolve_addrs_parse(void)
+{
+ size_t vma_start, vma_end, vma_offset, ino;
+ uint32_t dev_major, dev_minor;
+ char perms[4], buf[32], vma_name[PATH_MAX];
+ FILE *f;
+ int err, idx = 0;
+ struct addr *a = &addrs[idx];
+ struct resolved_addr *r;
+
+ snprintf(buf, sizeof(buf), "/proc/%d/maps", pid);
+ f = fopen(buf, "r");
+ if (!f) {
+ err = -errno;
+ fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err);
+ return err;
+ }
+
+ while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n",
+ &vma_start, &vma_end,
+ &perms[0], &perms[1], &perms[2], &perms[3],
+ &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) {
+ const char *name = NULL;
+
+ /* skip addrs before current vma, they stay unresolved */
+ for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) {
+ }
+
+ /* resolve all addrs within current vma now */
+ for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) {
+ r = &resolved[a->idx];
+ r->file_off = a->addr - vma_start + vma_offset;
+
+ /* reuse name, if it was already strdup()'ed */
+ if (err > 10)
+ name = name ?: strdup(vma_name);
+ else
+ name = NULL;
+ r->vma_name = name;
+ }
+
+ /* ran out of addrs to resolve, stop early */
+ if (idx >= addr_cnt)
+ break;
+ }
+
+ fclose(f);
+ return 0;
+}
+
+static int cmp_by_addr(const void *a, const void *b)
+{
+ const struct addr *x = a, *y = b;
+
+ if (x->addr != y->addr)
+ return x->addr < y->addr ? -1 : 1;
+ return x->idx < y->idx ? -1 : 1;
+}
+
+static int cmp_by_idx(const void *a, const void *b)
+{
+ const struct addr *x = a, *y = b;
+
+ return x->idx < y->idx ? -1 : 1;
+}
+
+int main(int argc, char **argv)
+{
+ FILE* f;
+ int err, i;
+ unsigned long long addr;
+ uint64_t start_ns;
+ double total_ns;
+
+ /* Parse command line arguments */
+ err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+ if (err)
+ return err;
+
+ if (pid <= 0 || !addrs_path) {
+ fprintf(stderr, "Please provide PID and file with addresses to process!\n");
+ exit(1);
+ }
+
+ if (verbose) {
+ fprintf(stderr, "PID: %d\n", pid);
+ fprintf(stderr, "PATH: %s\n", addrs_path);
+ }
+
+ f = fopen(addrs_path, "r");
+ if (!f) {
+ err = -errno;
+ fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err);
+ goto out;
+ }
+
+ while ((err = fscanf(f, "%llx\n", &addr)) == 1) {
+ if (addr_cnt == addr_cap) {
+ addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2);
+ addrs = realloc(addrs, sizeof(*addrs) * addr_cap);
+ memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs));
+ }
+
+ addrs[addr_cnt].addr = addr;
+ addrs[addr_cnt].idx = addr_cnt;
+
+ addr_cnt++;
+ }
+ if (verbose)
+ fprintf(stderr, "READ %zu addrs!\n", addr_cnt);
+ if (!feof(f)) {
+ fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path);
+ err = -EINVAL;
+ fclose(f);
+ goto out;
+ }
+ fclose(f);
+ if (addr_cnt == 0) {
+ fprintf(stderr, "No addresses provided, bailing out!\n");
+ err = -ENOENT;
+ goto out;
+ }
+
+ resolved = calloc(addr_cnt, sizeof(*resolved));
+
+ qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr);
+ if (verbose) {
+ fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt);
+ for (i = 0; i < addr_cnt; i++) {
+ fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr);
+ }
+ }
+
+ start_ns = get_time_ns();
+ for (i = bench_runs ?: 1; i > 0; i--) {
+ if (use_ioctl) {
+ err = resolve_addrs_ioctl();
+ } else {
+ err = resolve_addrs_parse();
+ }
+ if (err) {
+ fprintf(stderr, "Failed to resolve addrs: %d!\n", err);
+ goto out;
+ }
+ }
+ total_ns = get_time_ns() - start_ns;
+
+ if (bench_runs) {
+ fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n",
+ bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0,
+ total_ns / bench_runs / addr_cnt / 1000.0);
+ }
+
+ /* sort them back into the original order */
+ qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx);
+
+ if (!quiet) {
+ printf("RESOLVED ADDRS (%zu):\n", addr_cnt);
+ for (i = 0; i < addr_cnt; i++) {
+ const struct addr *a = &addrs[i];
+ const struct resolved_addr *r = &resolved[a->idx];
+
+ if (r->file_off) {
+ printf("RESOLVED #%d: %#llx -> OFF %#llx",
+ a->idx, a->addr, r->file_off);
+ if (r->vma_name)
+ printf(" NAME %s", r->vma_name);
+ if (r->build_id_sz) {
+ char build_id_str[41];
+ int j;
+
+ for (j = 0; j < r->build_id_sz; j++)
+ sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]);
+ printf(" BUILDID %s", build_id_str);
+ }
+ printf("\n");
+ } else {
+ printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr);
+ }
+ }
+ }
+out:
+ free(addrs);
+ free(addrs_path);
+ free(resolved);
+
+ return err < 0 ? -err : 0;
+}
--
2.43.0


2024-05-04 11:24:40

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
>
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.

I don't have anything against adding a binary interface for this. But
it's somewhat odd to do ioctls based on /proc files. I wonder if there
isn't a more suitable place for this. prctl()? New vmstat() system call
using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

2024-05-04 15:28:57

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers. They would
> normally capture stack trace containing absolute memory addresses of
> some functions, and would then use /proc/<pid>/maps file to file
> corresponding backing ELF files, file offsets within them, and then
> continue from there to get yet more information (ELF symbols, DWARF
> information) to get human-readable symbolic information.
>
> As such, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries).
>
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above 100%.

Is this a new change or has it always been this way?

> First, it's text based, which makes its programmatic use from
> applications and libraries unnecessarily cumbersome and slow due to the
> need to do text parsing to get necessary pieces of information.

slow in what way? How has it never been noticed before as a problem?

And exact numbers are appreciated please, yes open/read/close seems
slower than open/ioctl/close, but is it really overall an issue in the
real world for anything?

Text apis are good as everyone can handle them, ioctls are harder for
obvious reasons.

> Second, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a small subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.

I don't understand, is this a bug in the current files? If so, why not
just fix that up?

And again "efficient" need to be quantified.

> Another problem when writing generic stack trace symbolization library
> is an unfortunate performance-vs-correctness tradeoff that needs to be
> made.

What requirement has caused a "generic stack trace symbolization
library" to be needed at all? What is the problem you are trying to
solve that is not already solved by existing tools?

> Library has to make a decision to either cache parsed contents of
> /proc/<pid>/maps for service future requests (if application requests to
> symbolize another set of addresses, captured at some later time, which
> is typical for periodic/continuous profiling cases) to avoid higher
> costs of needed to re-parse this file or caching the contents in memory
> to speed up future requests. In the former case, more memory is used for
> the cache and there is a risk of getting stale data if application
> loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> through additiona mmap() calls (and other means of altering memory
> address space). In the latter case, it's the performance hit that comes
> from re-opening the file and re-reading/re-parsing its contents all over
> again.

Again, "performance hit" needs to be justified, it shouldn't be much
overall.

> This patch aims to solve this problem by providing a new API built on
> top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> interface, avoiding the cost and awkwardness of textual representation
> for programmatic use.

Some people find text easier to handle for programmatic use :)

> It's designed to be extensible and
> forward/backward compatible by including user-specified field size and
> using copy_struct_from_user() approach. But, most importantly, it allows
> to do point queries for specific single address, specified by user. And
> this is done efficiently using VMA iterator.

Ok, maybe this is the main issue, you only want one at a time?

> User has a choice to pick either getting VMA that covers provided
> address or -ENOENT if none is found (exact, least surprising, case). Or,
> with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> get either VMA that covers the address (if there is one), or the closest
> next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> more efficient use, but, given it could be a surprising behavior,
> requires an explicit opt-in.
>
> Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> sense given it's querying the same set of VMA data. All the permissions
> checks performed on /proc/<pid>/maps opening fit here as well.
> ioctl-based implementation is fetching remembered mm_struct reference,
> but otherwise doesn't interfere with seq_file-based implementation of
> /proc/<pid>/maps textual interface, and so could be used together or
> independently without paying any price for that.
>
> There is one extra thing that /proc/<pid>/maps doesn't currently
> provide, and that's an ability to fetch ELF build ID, if present. User
> has control over whether this piece of information is requested or not
> by either setting build_id_size field to zero or non-zero maximum buffer
> size they provided through build_id_addr field (which encodes user
> pointer as __u64 field).
>
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this,
> requiring applications to open underlying ELF binary through
> /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> permissions implications due giving a full access to the binary from
> (potentially) another process, while all application is interested in is
> build ID. Giving an ability to request just build ID doesn't introduce
> any additional security concerns, on top of what /proc/<pid>/maps is
> already concerned with, simplifying the overall logic.
>
> Kernel already implements build ID fetching, which is used from BPF
> subsystem. We are reusing this code here, but plan a follow up changes
> to make it work better under more relaxed assumption (compared to what
> existing code assumes) of being called from user process context, in
> which page faults are allowed. BPF-specific implementation currently
> bails out if necessary part of ELF file is not paged in, all due to
> extra BPF-specific restrictions (like the need to fetch build ID in
> restrictive contexts such as NMI handler).
>
> Note also, that fetching VMA name (e.g., backing file path, or special
> hard-coded or user-provided names) is optional just like build ID. If
> user sets vma_name_size to zero, kernel code won't attempt to retrieve
> it, saving resources.
>
> Signed-off-by: Andrii Nakryiko <[email protected]>

Where is the userspace code that uses this new api you have created?

> ---
> fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fs.h | 32 ++++++++
> 2 files changed, 197 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8e503a1635b7..cb7b1ff1a144 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,6 +22,7 @@
> #include <linux/pkeys.h>
> #include <linux/minmax.h>
> #include <linux/overflow.h>
> +#include <linux/buildid.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> return do_maps_open(inode, file, &proc_pid_maps_op);
> }
>
> +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> +{
> + struct procfs_procmap_query karg;
> + struct vma_iterator iter;
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> + const char *name = NULL;
> + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> + __u64 usize;
> + int err;
> +
> + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> + return -EFAULT;
> + if (usize > PAGE_SIZE)

Nice, where did you document that? And how is that portable given that
PAGE_SIZE can be different on different systems?

and why aren't you checking the actual structure size instead? You can
easily run off the end here without knowing it.

> + return -E2BIG;
> + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> + return -EINVAL;

Ok, so you have two checks? How can the first one ever fail?


> + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> + if (err)
> + return err;
> +
> + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> + return -EINVAL;
> + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> + return -EINVAL;
> + if (!!karg.build_id_size != !!karg.build_id_addr)
> + return -EINVAL;

So you want values to be set, right?

> +
> + mm = priv->mm;
> + if (!mm || !mmget_not_zero(mm))
> + return -ESRCH;

What is this error for? Where is this documentned?

> + if (mmap_read_lock_killable(mm)) {
> + mmput(mm);
> + return -EINTR;
> + }
> +
> + vma_iter_init(&iter, mm, karg.query_addr);
> + vma = vma_next(&iter);
> + if (!vma) {
> + err = -ENOENT;
> + goto out;
> + }
> + /* user wants covering VMA, not the closest next one */
> + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> + vma->vm_start > karg.query_addr) {
> + err = -ENOENT;
> + goto out;
> + }
> +
> + karg.vma_start = vma->vm_start;
> + karg.vma_end = vma->vm_end;
> +
> + if (vma->vm_file) {
> + const struct inode *inode = file_user_inode(vma->vm_file);
> +
> + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> + karg.dev_minor = MINOR(inode->i_sb->s_dev);

So the major/minor is that of the file superblock? Why?

> + karg.inode = inode->i_ino;

What is userspace going to do with this?

> + } else {
> + karg.vma_offset = 0;
> + karg.dev_major = 0;
> + karg.dev_minor = 0;
> + karg.inode = 0;

Why not set everything to 0 up above at the beginning so you never miss
anything, and you don't miss any holes accidentally in the future.

> + }
> +
> + karg.vma_flags = 0;
> + if (vma->vm_flags & VM_READ)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> + if (vma->vm_flags & VM_WRITE)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> + if (vma->vm_flags & VM_EXEC)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> + if (vma->vm_flags & VM_MAYSHARE)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> +
> + if (karg.build_id_size) {
> + __u32 build_id_sz = BUILD_ID_SIZE_MAX;
> +
> + err = build_id_parse(vma, build_id_buf, &build_id_sz);
> + if (!err) {
> + if (karg.build_id_size < build_id_sz) {
> + err = -ENAMETOOLONG;
> + goto out;
> + }
> + karg.build_id_size = build_id_sz;
> + }
> + }
> +
> + if (karg.vma_name_size) {
> + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> + const struct path *path;
> + const char *name_fmt;
> + size_t name_sz = 0;
> +
> + get_vma_name(vma, &path, &name, &name_fmt);
> +
> + if (path || name_fmt || name) {
> + name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> + if (!name_buf) {
> + err = -ENOMEM;
> + goto out;
> + }
> + }
> + if (path) {
> + name = d_path(path, name_buf, name_buf_sz);
> + if (IS_ERR(name)) {
> + err = PTR_ERR(name);
> + goto out;
> + }
> + name_sz = name_buf + name_buf_sz - name;
> + } else if (name || name_fmt) {
> + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> + name = name_buf;
> + }
> + if (name_sz > name_buf_sz) {
> + err = -ENAMETOOLONG;
> + goto out;
> + }
> + karg.vma_name_size = name_sz;
> + }
> +
> + /* unlock and put mm_struct before copying data to user */
> + mmap_read_unlock(mm);
> + mmput(mm);
> +
> + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> + name, karg.vma_name_size)) {
> + kfree(name_buf);
> + return -EFAULT;
> + }
> + kfree(name_buf);
> +
> + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
> + build_id_buf, karg.build_id_size))
> + return -EFAULT;
> +
> + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
> + return -EFAULT;
> +
> + return 0;
> +
> +out:
> + mmap_read_unlock(mm);
> + mmput(mm);
> + kfree(name_buf);
> + return err;
> +}
> +
> +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct seq_file *seq = file->private_data;
> + struct proc_maps_private *priv = seq->private;
> +
> + switch (cmd) {
> + case PROCFS_PROCMAP_QUERY:
> + return do_procmap_query(priv, (void __user *)arg);
> + default:
> + return -ENOIOCTLCMD;
> + }
> +}
> +
> const struct file_operations proc_pid_maps_operations = {
> .open = pid_maps_open,
> .read = seq_read,
> .llseek = seq_lseek,
> .release = proc_map_release,
> + .unlocked_ioctl = procfs_procmap_ioctl,
> + .compat_ioctl = procfs_procmap_ioctl,
> };
>
> /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 45e4e64fd664..fe8924a8d916 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -393,4 +393,36 @@ struct pm_scan_arg {
> __u64 return_mask;
> };
>
> +/* /proc/<pid>/maps ioctl */
> +#define PROCFS_IOCTL_MAGIC 0x9f

Don't you need to document this in the proper place?

> +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> +
> +enum procmap_query_flags {
> + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> +};
> +
> +enum procmap_vma_flags {
> + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> + PROCFS_PROCMAP_VMA_SHARED = 0x08,

Are these bits? If so, please use the bit macro for it to make it
obvious.

> +};
> +
> +struct procfs_procmap_query {
> + __u64 size;
> + __u64 query_flags; /* in */

Does this map to the procmap_vma_flags enum? if so, please say so.

> + __u64 query_addr; /* in */
> + __u64 vma_start; /* out */
> + __u64 vma_end; /* out */
> + __u64 vma_flags; /* out */
> + __u64 vma_offset; /* out */
> + __u64 inode; /* out */

What is the inode for, you have an inode for the file already, why give
it another one?

> + __u32 dev_major; /* out */
> + __u32 dev_minor; /* out */

What is major/minor for?

> + __u32 vma_name_size; /* in/out */
> + __u32 build_id_size; /* in/out */
> + __u64 vma_name_addr; /* in */
> + __u64 build_id_addr; /* in */

Why not document this all using kerneldoc above the structure?

anyway, I don't like ioctls, but there is a place for them, you just
have to actually justify the use for them and not say "not efficient
enough" as that normally isn't an issue overall.

thanks,

greg k-h

2024-05-04 15:30:05

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> Implement a simple tool/benchmark for comparing address "resolution"
> logic based on textual /proc/<pid>/maps interface and new binary
> ioctl-based PROCFS_PROCMAP_QUERY command.

Of course an artificial benchmark of "read a whole file" vs. "a tiny
ioctl" is going to be different, but step back and show how this is
going to be used in the real world overall. Pounding on this file is
not a normal operation, right?

thanks,

greg k-h

2024-05-04 15:33:11

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> I also did an strace run of both cases. In text-based one the tool did
> 68 read() syscalls, fetching up to 4KB of data in one go.

Why not fetch more at once?

And I have a fun 'readfile()' syscall implementation around here that
needs justification to get merged (I try so every other year or so) that
can do the open/read/close loop in one call, with the buffer size set by
userspace if you really are saying this is a "hot path" that needs that
kind of speedup. But in the end, io_uring usually is the proper api for
that instead, why not use that here instead of slow open/read/close if
you care about speed?

> In comparison,
> ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> relevant VMAs.
>
> It is projected that savings from processing big production applications
> would only widen the gap in favor of binary-based querying ioctl API, as
> bigger applications will tend to have even more non-executable VMA
> mappings relative to executable ones.

Define "bigger applications" please. Is this some "large database
company workload" type of thing, or something else?

thanks,

greg k-h

2024-05-04 15:34:29

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
>
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

See my objection to the ioctl api in the patch review itself.

Also, as this is a new user/kernel api, it needs loads of documentation
(there was none), and probably also cc: linux-api, right?

thanks,

greg k-h

2024-05-04 18:37:52

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

Hi, Greg.

We've discussed this earlier.

Breaking news: /proc is slow, /sys too. Always have been.

Each /sys file is kind of fast, but there are so many files that
lookups eat all the runtime.

/proc files are bigger and thus slower. There is no way to filter
information.

If someone would post /proc today and said "it is 20-50-100" times
slower (which is true) than existing interfraces, linux-kernel would
not even laugh at him/her.

> slow in what way?

open/read/close is slow compared to equivalent not involving file
descriptors and textual processing.

> Text apis are good as everyone can handle them,

Text APIs provoke inefficient software:

Any noob can write

for name in name_list:
with open(f'/sys/kernel/slab/{name}/order') as f:
slab_order = int(f.read().split()[0])

See the problem? It's inefficient.
No open("/sys", O_DIRECTORY|O_PATH);
No openat(sys_fd, "kernel/slab", O_DIRECTORY|O_PATH);
No openat(sys_kernel_slab, buf, O_RDONLY);

buf is allocated dynamically many times probably, it's Python after all.
buf is longer than necessary. pathname buf won't be reused for result.

split() conses a list, only to discard everything but first element.

Internally, sysfs allocates 1 page, instead of putting 1 byte somewhere
in userspace memory. /proc too.

Lookup is done every time (I don't think sysfs caches dentries in dcache
but I may be mistaken, so lookup is even slower).

Multiply by many times monitoring daemons run this (potentially disturbing
other tasks).

> ioctls are harder for obvious reasons.

What? ioctl are hard now?

Text APIs are garbage. If it's some crap in debugfs then noone cares.
But /proc/*/maps is not in debugfs.

Specifically on /proc/*/maps:

* _very_ well written software know that unescaping needs to be done on pathname

* (deleted) and (unreachable) junk. readlink and /proc/*/maps don't have
space for flags for unambigious deleted/unreachable status which
doesn't eat into pathname -- whoops


> I don't understand, is this a bug in the current files? If so, why not
> just fix that up?

open/read DO NOT accept file-specific flags, they are dumb like that.

In theory /proc/*/maps _could_ accept

pread(fd, buf, sizeof(buf), addr);

and return data for VMA containing "addr", but it can't because "addr"
is offset in textual file. Such offset is not interesting at all.

> And again "efficient" need to be quantified.

* roll eyes *

> Some people find text easier to handle for programmatic use :)

Some people should be barred from writing software by Programming Supreme Court
or something like that.

2024-05-04 21:51:05

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
>
> On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers. They would
> > normally capture stack trace containing absolute memory addresses of
> > some functions, and would then use /proc/<pid>/maps file to file
> > corresponding backing ELF files, file offsets within them, and then
> > continue from there to get yet more information (ELF symbols, DWARF
> > information) to get human-readable symbolic information.
> >
> > As such, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries).
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above 100%.
>
> Is this a new change or has it always been this way?
>

Probably always has been this way. My first exposure to profiling and
stack symbolization was about 7 years ago, and already then
/proc/<pid>/maps was the only way to do this, and not a 100% fit even
then.

> > First, it's text based, which makes its programmatic use from
> > applications and libraries unnecessarily cumbersome and slow due to the
> > need to do text parsing to get necessary pieces of information.
>
> slow in what way? How has it never been noticed before as a problem?

It's just inherently slower to parse text to fish out a bunch of
integers (vma_start address, offset, inode+dev and file paths are
typical pieces needed to "normalize" captured stack trace addresses).
It's not too bad in terms of programming and performance for
scanf-like APIs, but without scanf, you are dealing with splitting by
whitespaces and tons of unnecessary string allocations.

It was noticed, I think people using this for profiling/symbolization
are not necessarily well versed in kernel development and they just
get by with what kernel provides.

>
> And exact numbers are appreciated please, yes open/read/close seems
> slower than open/ioctl/close, but is it really overall an issue in the
> real world for anything?
>
> Text apis are good as everyone can handle them, ioctls are harder for
> obvious reasons.

Yes, and acknowledged the usefulness of text-based interface. But it's
my (and other people I've talked with that had to deal with these
textual interfaces) opinion that using binary interfaces are far
superior when it comes to *programmatic* usage (i.e., from
C/C++/Rust/whatever languages directly). Textual is great for bash
scripts and human debugging, of course.

>
> > Second, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a small subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
>
> I don't understand, is this a bug in the current files? If so, why not
> just fix that up?
>

It's not a bug, I think /proc/<pid>/maps was targeted to describe
*entire* address space, but for profiling and symbolization needs we
need to find only a small subset of relevant VMAs. There is nothing
wrong with existing implementation, it's just not a 100% fit for the
more specialized "let's find relevant VMAs for this set of addresses"
problem.

> And again "efficient" need to be quantified.

You probably saw patch #5 where I solve exactly the same problem in
two different ways. And the problem is typical for symbolization: you
are given a bunch of addresses within some process, we need to find
files they belong to and what file offset they are mapped to. This is
then used to, for example, match them to ELF symbols representing
functions.

>
> > Another problem when writing generic stack trace symbolization library
> > is an unfortunate performance-vs-correctness tradeoff that needs to be
> > made.
>
> What requirement has caused a "generic stack trace symbolization
> library" to be needed at all? What is the problem you are trying to
> solve that is not already solved by existing tools?

Capturing stack trace is a very common part, especially for BPF-based
tools and applications. E.g., bpftrace allows one to capture stack
traces for some "interesting events" (whatever that is, some kernel
function call, user function call, perf event, there is tons of
flexibility). Stack traces answer "how did we get here", but it's just
an array of addresses, which need to be translated to something that
humans can make sense of.

That's what the symbolization library is helping with. This process is
multi-step, quite involved, hard to get right with a good balance of
efficiency, correctness and fullness of information (there is always a
choice of doing simplistic symbolization using just ELF symbols, or
much more expensive but also fuller symbolization using DWARF
information, which gives also file name + line number information, can
symbolize inlined functions, etc).

One such library is blazesym ([0], cc'ed Daniel, who's working on it),
which is developed by Meta for both internal use in our fleet-wide
profiler, and is also in the process of being integrated into bpftrace
(to improve bpftrace's current somewhat limited symbolization approach
based on BCC). There is also a non-Meta project (I believe Datadog)
that is using it for its own needs.

Symbolization is quite a common task, that's highly non-trivial.

[0] https://github.com/libbpf/blazesym

>
> > Library has to make a decision to either cache parsed contents of
> > /proc/<pid>/maps for service future requests (if application requests to
> > symbolize another set of addresses, captured at some later time, which
> > is typical for periodic/continuous profiling cases) to avoid higher
> > costs of needed to re-parse this file or caching the contents in memory
> > to speed up future requests. In the former case, more memory is used for
> > the cache and there is a risk of getting stale data if application
> > loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> > through additiona mmap() calls (and other means of altering memory
> > address space). In the latter case, it's the performance hit that comes
> > from re-opening the file and re-reading/re-parsing its contents all over
> > again.
>
> Again, "performance hit" needs to be justified, it shouldn't be much
> overall.

I'm not sure how to answer whether it's much or not. Can you be a bit
more specific on what you'd like to see?

But I want to say that sensitivity to any overhead differs a lot
depending on specifics. As general rule, we try to minimize any
resource usage of the profiler/symbolizer itself on the host that is
being profiled, to minimize the disruption of the production workload.
So anything that can be done to optimize any part of the overall
profiling process is a benefit.

But while for big servers tolerance might be higher in terms of
re-opening and re-parsing a bunch of text files, we also have use
cases on much less powerful and very performance sensitive Oculus VR
devices, for example. There, any extra piece of work is scrutinized,
so having to parse text on those relatively weak devices does add up.
Enough to spend effort to optimize text parsing in blazesym's Rust
code (see [1] for recent improvements).

[1] https://github.com/libbpf/blazesym/pull/643/commits/b89b91b42b994b135a0079bf04b2319c0054f745

>
> > This patch aims to solve this problem by providing a new API built on
> > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> > interface, avoiding the cost and awkwardness of textual representation
> > for programmatic use.
>
> Some people find text easier to handle for programmatic use :)

I don't disagree, but pretty much everyone I discussed having to deal
with text-based kernel APIs are pretty uniformly in favor of
binary-based interfaces, if they are available.

But note, I'm not proposing to deprecate or remove text-based
/proc/<pid>/maps. And the main point of this work is not so much
binary vs text, as more selecting "point-based" querying capability as
opposed to the "iterate everything" approach of /proc/<pid>/maps.

>
> > It's designed to be extensible and
> > forward/backward compatible by including user-specified field size and
> > using copy_struct_from_user() approach. But, most importantly, it allows
> > to do point queries for specific single address, specified by user. And
> > this is done efficiently using VMA iterator.
>
> Ok, maybe this is the main issue, you only want one at a time?

Yes. More or less, I need "a few" that cover a captured set of addresses.

>
> > User has a choice to pick either getting VMA that covers provided
> > address or -ENOENT if none is found (exact, least surprising, case). Or,
> > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> > get either VMA that covers the address (if there is one), or the closest
> > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> > more efficient use, but, given it could be a surprising behavior,
> > requires an explicit opt-in.
> >
> > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> > sense given it's querying the same set of VMA data. All the permissions
> > checks performed on /proc/<pid>/maps opening fit here as well.
> > ioctl-based implementation is fetching remembered mm_struct reference,
> > but otherwise doesn't interfere with seq_file-based implementation of
> > /proc/<pid>/maps textual interface, and so could be used together or
> > independently without paying any price for that.
> >
> > There is one extra thing that /proc/<pid>/maps doesn't currently
> > provide, and that's an ability to fetch ELF build ID, if present. User
> > has control over whether this piece of information is requested or not
> > by either setting build_id_size field to zero or non-zero maximum buffer
> > size they provided through build_id_addr field (which encodes user
> > pointer as __u64 field).
> >
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this,
> > requiring applications to open underlying ELF binary through
> > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> > permissions implications due giving a full access to the binary from
> > (potentially) another process, while all application is interested in is
> > build ID. Giving an ability to request just build ID doesn't introduce
> > any additional security concerns, on top of what /proc/<pid>/maps is
> > already concerned with, simplifying the overall logic.
> >
> > Kernel already implements build ID fetching, which is used from BPF
> > subsystem. We are reusing this code here, but plan a follow up changes
> > to make it work better under more relaxed assumption (compared to what
> > existing code assumes) of being called from user process context, in
> > which page faults are allowed. BPF-specific implementation currently
> > bails out if necessary part of ELF file is not paged in, all due to
> > extra BPF-specific restrictions (like the need to fetch build ID in
> > restrictive contexts such as NMI handler).
> >
> > Note also, that fetching VMA name (e.g., backing file path, or special
> > hard-coded or user-provided names) is optional just like build ID. If
> > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > it, saving resources.
> >
> > Signed-off-by: Andrii Nakryiko <[email protected]>
>
> Where is the userspace code that uses this new api you have created?

So I added a faithful comparison of existing /proc/<pid>/maps vs new
ioctl() API to solve a common problem (as described above) in patch
#5. The plan is to put it in mentioned blazesym library at the very
least.

I'm sure perf would benefit from this as well (cc'ed Arnaldo and
linux-perf-user), as they need to do stack symbolization as well.

It will be up to other similar projects to adopt this, but we'll
definitely get this into blazesym as it is actually a problem for the
abovementioned Oculus use case. We already had to make a tradeoff (see
[2], this wasn't done just because we could, but it was requested by
Oculus customers) to cache the contents of /proc/<pid>/maps and run
the risk of missing some shared libraries that can be loaded later. It
would be great to not have to do this tradeoff, which this new API
would enable.

[2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf

>
> > ---
> > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/fs.h | 32 ++++++++
> > 2 files changed, 197 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8e503a1635b7..cb7b1ff1a144 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -22,6 +22,7 @@
> > #include <linux/pkeys.h>
> > #include <linux/minmax.h>
> > #include <linux/overflow.h>
> > +#include <linux/buildid.h>
> >
> > #include <asm/elf.h>
> > #include <asm/tlb.h>
> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > return do_maps_open(inode, file, &proc_pid_maps_op);
> > }
> >
> > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > +{
> > + struct procfs_procmap_query karg;
> > + struct vma_iterator iter;
> > + struct vm_area_struct *vma;
> > + struct mm_struct *mm;
> > + const char *name = NULL;
> > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > + __u64 usize;
> > + int err;
> > +
> > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > + return -EFAULT;
> > + if (usize > PAGE_SIZE)
>
> Nice, where did you document that? And how is that portable given that
> PAGE_SIZE can be different on different systems?

I'm happy to document everything, can you please help by pointing
where this documentation has to live?

This is mostly fool-proofing, though, because the user has to pass
sizeof(struct procfs_procmap_query), which I don't see ever getting
close to even 4KB (not even saying about 64KB). This is just to
prevent copy_struct_from_user() below to do too much zero-checking.

>
> and why aren't you checking the actual structure size instead? You can
> easily run off the end here without knowing it.

See copy_struct_from_user(), it does more checks. This is a helper
designed specifically to deal with use cases like this where kernel
struct size can change and user space might be newer or older.
copy_struct_from_user() has a nice documentation describing all these
nuances.

>
> > + return -E2BIG;
> > + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > + return -EINVAL;
>
> Ok, so you have two checks? How can the first one ever fail?

Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
won't fail, but this one will fail.

The point of this check is that user has to specify at least first
three fields of procfs_procmap_query (size, query_flags, and
query_addr), because without those the query is meaningless.
>
>
> > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);

and this helper does more checks validating that the user either has a
shorter struct (and then zero-fills the rest of kernel-side struct) or
has longer (and then the longer part has to be zero filled). Do check
copy_struct_from_user() documentation, it's great.

> > + if (err)
> > + return err;
> > +
> > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > + return -EINVAL;
> > + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > + return -EINVAL;
> > + if (!!karg.build_id_size != !!karg.build_id_addr)
> > + return -EINVAL;
>
> So you want values to be set, right?

Either both should be set, or neither. It's ok for both size/addr
fields to be zero, in which case it indicates that the user doesn't
want this part of information (which is usually a bit more expensive
to get and might not be necessary for all the cases).

>
> > +
> > + mm = priv->mm;
> > + if (!mm || !mmget_not_zero(mm))
> > + return -ESRCH;
>
> What is this error for? Where is this documentned?

I copied it from existing /proc/<pid>/maps checks. I presume it's
guarding the case when mm might be already put. So if the process is
gone, but we have /proc/<pid>/maps file open?

>
> > + if (mmap_read_lock_killable(mm)) {
> > + mmput(mm);
> > + return -EINTR;
> > + }
> > +
> > + vma_iter_init(&iter, mm, karg.query_addr);
> > + vma = vma_next(&iter);
> > + if (!vma) {
> > + err = -ENOENT;
> > + goto out;
> > + }
> > + /* user wants covering VMA, not the closest next one */
> > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > + vma->vm_start > karg.query_addr) {
> > + err = -ENOENT;
> > + goto out;
> > + }
> > +
> > + karg.vma_start = vma->vm_start;
> > + karg.vma_end = vma->vm_end;
> > +
> > + if (vma->vm_file) {
> > + const struct inode *inode = file_user_inode(vma->vm_file);
> > +
> > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > + karg.dev_minor = MINOR(inode->i_sb->s_dev);
>
> So the major/minor is that of the file superblock? Why?

Because inode number is unique only within given super block (and even
then it's more complicated, e.g., btrfs subvolumes add more headaches,
I believe). inode + dev maj/min is sometimes used for cache/reuse of
per-binary information (e.g., pre-processed DWARF information, which
is *very* expensive, so anything that allows to avoid doing this is
helpful).

>
> > + karg.inode = inode->i_ino;
>
> What is userspace going to do with this?
>

See above.

> > + } else {
> > + karg.vma_offset = 0;
> > + karg.dev_major = 0;
> > + karg.dev_minor = 0;
> > + karg.inode = 0;
>
> Why not set everything to 0 up above at the beginning so you never miss
> anything, and you don't miss any holes accidentally in the future.
>

Stylistic preference, I find this more explicit, but I don't care much
one way or another.

> > + }
> > +
> > + karg.vma_flags = 0;
> > + if (vma->vm_flags & VM_READ)
> > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > + if (vma->vm_flags & VM_WRITE)
> > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > + if (vma->vm_flags & VM_EXEC)
> > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > + if (vma->vm_flags & VM_MAYSHARE)
> > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > +

[...]

> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 45e4e64fd664..fe8924a8d916 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > __u64 return_mask;
> > };
> >
> > +/* /proc/<pid>/maps ioctl */
> > +#define PROCFS_IOCTL_MAGIC 0x9f
>
> Don't you need to document this in the proper place?

I probably do, but I'm asking for help in knowing where. procfs is not
a typical area of kernel I'm working with, so any pointers are highly
appreciated.

>
> > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > +
> > +enum procmap_query_flags {
> > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > +};
> > +
> > +enum procmap_vma_flags {
> > + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > + PROCFS_PROCMAP_VMA_SHARED = 0x08,
>
> Are these bits? If so, please use the bit macro for it to make it
> obvious.
>

Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
add any extra #includes to this UAPI header, but I can figure out the
necessary dependency and do BIT(), I just didn't feel like BIT() adds
much here, tbh.

> > +};
> > +
> > +struct procfs_procmap_query {
> > + __u64 size;
> > + __u64 query_flags; /* in */
>
> Does this map to the procmap_vma_flags enum? if so, please say so.

no, procmap_query_flags, and yes, I will

>
> > + __u64 query_addr; /* in */
> > + __u64 vma_start; /* out */
> > + __u64 vma_end; /* out */
> > + __u64 vma_flags; /* out */
> > + __u64 vma_offset; /* out */
> > + __u64 inode; /* out */
>
> What is the inode for, you have an inode for the file already, why give
> it another one?

This is inode of vma's backing file, same as /proc/<pid>/maps' file
column. What inode of file do I already have here? You mean of
/proc/<pid>/maps itself? It's useless for the intended purposes.

>
> > + __u32 dev_major; /* out */
> > + __u32 dev_minor; /* out */
>
> What is major/minor for?

This is the same information as emitted by /proc/<pid>/maps,
identifies superblock of vma's backing file. As I mentioned above, it
can be used for caching per-file (i.e., per-ELF binary) information
(for example).

>
> > + __u32 vma_name_size; /* in/out */
> > + __u32 build_id_size; /* in/out */
> > + __u64 vma_name_addr; /* in */
> > + __u64 build_id_addr; /* in */
>
> Why not document this all using kerneldoc above the structure?

Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
figuring out the best place and approach, and so wanted to avoid
documentation churn.

Would something like what we have for pm_scan_arg and pagemap APIs
work? I see it added a few simple descriptions for pm_scan_arg struct,
and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
though)? Anyways, I'm hoping for pointers where all this should be
documented. Thank you!

>
> anyway, I don't like ioctls, but there is a place for them, you just
> have to actually justify the use for them and not say "not efficient
> enough" as that normally isn't an issue overall.

I've written a demo tool in patch #5 which performs real-world task:
mapping addresses to their VMAs (specifically calculating file offset,
finding vma_start + vma_end range to further access files from
/proc/<pid>/map_files/<start>-<end>). I did the implementation
faithfully, doing it in the most optimal way for both APIs. I showed
that for "typical" (it's hard to specify what typical is, of course,
too many variables) scenario (it was data collected on a real server
running real service, 30 seconds of process-specific stack traces were
captured, if I remember correctly). I showed that doing exactly the
same amount of work is ~35x times slower with /proc/<pid>/maps.

Take another process, another set of addresses, another anything, and
the numbers will be different, but I think it gives the right idea.

But I think we are overpivoting on text vs binary distinction here.
It's the more targeted querying of VMAs that's beneficial here. This
allows applications to not cache anything and just re-query when doing
periodic or continuous profiling (where addresses are coming in not as
one batch, as a sequence of batches extended in time).

/proc/<pid>/maps, for all its usefulness, just can't provide this sort
of ability, as it wasn't designed to do that and is targeting
different use cases.

And then, a new ability to request reliable (it's not 100% reliable
today, I'm going to address that as a follow up) build ID is *crucial*
for some scenarios. The mentioned Oculus use case, the need to fully
access underlying ELF binary just to get build ID is frowned upon. And
for a good reason. Profiler only needs build ID, which is no secret
and not sensitive information. This new (and binary, yes) API allows
to add this into an API without breaking any backwards compatibility.

>
> thanks,
>
> greg k-h

2024-05-04 21:51:14

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Sat, May 4, 2024 at 8:34 AM Greg KH <[email protected]> wrote:
>
> On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> > On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > > applications to query VMA information more efficiently than through textual
> > > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > > justification, and nuances of the API design.
> > >
> > > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > > Patch #2 is the meat of kernel-side API.
> > > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > > optionally use this new ioctl()-based API, if supported.
> > > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > > also has performance numbers of a test based on one of the medium-sized
> > > internal applications taken from production.
> >
> > I don't have anything against adding a binary interface for this. But
> > it's somewhat odd to do ioctls based on /proc files. I wonder if there
> > isn't a more suitable place for this. prctl()? New vmstat() system call
> > using a pidfd/pid as reference? ioctl() on fs/pidfs.c?
>
> See my objection to the ioctl api in the patch review itself.

Will address them there.


>
> Also, as this is a new user/kernel api, it needs loads of documentation
> (there was none), and probably also cc: linux-api, right?

Will cc linux-api. And yes, I didn't want to invest too much time in
documentation upfront, as I knew that API itself will be tweaked and
tuned, moved to some other place (see Christian's pidfd suggestion).
But I'm happy to write it, I'd appreciate the pointers where exactly
this should live. Thanks!

>
> thanks,
>
> greg k-h

2024-05-04 21:51:37

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Sat, May 4, 2024 at 4:24 AM Christian Brauner <[email protected]> wrote:
>
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
>
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

I did ioctl() on /proc/<pid>/maps because that's the file that's used
for the same use cases and it can be opened from other processes for
any target PID. I'm open to any suggestions that make more sense, this
v1 is mostly to start the conversation.

prctl() probably doesn't make sense, as according to man page:

prctl() manipulates various aspects of the behavior of the
calling thread or process.

And this facility is most often used from another (profiler or
symbolizer) process.

New syscall feels like an overkill, but if that's the only way, so be it.

I do like the idea of ioctl() on top of pidfd (I assume that's what
you mean by "fs/pidfs.c", right)? This seems most promising. One
question/nuance. If I understand correctly, pidfd won't hold
task_struct (and its mm_struct) reference, right? So if the process
exits, even if I have pidfd, that task is gone and so we won't be able
to query it. Is that right?

If yes, then it's still workable in a lot of situations, but it would
be nice to have an ability to query VMAs (at least for binary's own
text segments) even if the process exits. This is the case for
short-lived processes that profilers capture some stack traces from,
but by the time these stack traces are processed they are gone.

This might be a stupid idea and question, but what if ioctl() on pidfd
itself would create another FD that would represent mm_struct of that
process, and then we have ioctl() on *that* soft-of-mm-struct-fd to
query VMA. Would that work at all? This approach would allow
long-running profiler application to open pidfd and this other "mm fd"
once, cache it, and then just query it. Meanwhile we can epoll() pidfd
itself to know when the process exits so that these mm_structs are not
referenced for longer than necessary.

Is this pushing too far or you think that would work and be acceptable?

But in any case, I think ioctl() on top of pidfd makes total sense for
this, thanks.

2024-05-04 21:57:48

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > Implement a simple tool/benchmark for comparing address "resolution"
> > logic based on textual /proc/<pid>/maps interface and new binary
> > ioctl-based PROCFS_PROCMAP_QUERY command.
>
> Of course an artificial benchmark of "read a whole file" vs. "a tiny
> ioctl" is going to be different, but step back and show how this is
> going to be used in the real world overall. Pounding on this file is
> not a normal operation, right?
>

It's not artificial at all. It's *exactly* what, say, blazesym library
is doing (see [0], it's Rust and part of the overall library API, I
think C code in this patch is way easier to follow for someone not
familiar with implementation of blazesym, but both implementations are
doing exactly the same sequence of steps). You can do it even less
efficiently by parsing the whole file, building an in-memory lookup
table, then looking up addresses one by one. But that's even slower
and more memory-hungry. So I didn't even bother implementing that, it
would put /proc/<pid>/maps at even more disadvantage.

Other applications that deal with stack traces (including perf) would
be doing one of those two approaches, depending on circumstances and
level of sophistication of code (and sensitivity to performance).

[0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193

> thanks,
>
> greg k-h

2024-05-04 22:13:56

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Sat, May 4, 2024 at 8:32 AM Greg KH <[email protected]> wrote:
>
> On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > I also did an strace run of both cases. In text-based one the tool did
> > 68 read() syscalls, fetching up to 4KB of data in one go.
>
> Why not fetch more at once?
>

I didn't expect to be interrogated so much on the performance of the
text parsing front, sorry. :) You can probably tune this, but where is
the reasonable limit? 64KB? 256KB? 1MB? See below for some more
production numbers.

> And I have a fun 'readfile()' syscall implementation around here that
> needs justification to get merged (I try so every other year or so) that
> can do the open/read/close loop in one call, with the buffer size set by
> userspace if you really are saying this is a "hot path" that needs that
> kind of speedup. But in the end, io_uring usually is the proper api for
> that instead, why not use that here instead of slow open/read/close if
> you care about speed?
>

I'm not sure what I need to say here. I'm sure it will be useful, but
as I already explained, it's not about the text file or not, it's
about having to read too much information that's completely
irrelevant. Again, see below for another data point.

> > In comparison,
> > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > relevant VMAs.
> >
> > It is projected that savings from processing big production applications
> > would only widen the gap in favor of binary-based querying ioctl API, as
> > bigger applications will tend to have even more non-executable VMA
> > mappings relative to executable ones.
>
> Define "bigger applications" please. Is this some "large database
> company workload" type of thing, or something else?

I don't have a definition. But I had in mind, as one example, an
ads-serving service we use internally (it's a pretty large application
by pretty much any metric you can come up with). I just randomly
picked one of the production hosts, found one instance of that
service, and looked at its /proc/<pid>/maps file. Hopefully it will
satisfy your need for specifics.

# cat /proc/1126243/maps | wc -c
1570178
# cat /proc/1126243/maps | wc -l
28875
# cat /proc/1126243/maps | grep ' ..x. ' | wc -l
7347

You can see that maps file itself is about 1.5MB of text (which means
single-shot reading of its entire contents is a bit unrealistic,
though, sure, why not). The process contains 28875 VMAs, out of which
only 7347 are executable.

This means if we were to profile this process (and normally we profile
entire system, so it's almost never single /proc/<pid>/maps file that
needs to be open and processed), we'd need *at most* (absolute worst
case!) 7347/28875 = 25.5% of entries. In reality, most code will be
concentrated in a much smaller number of executable VMAs, of course.
But no, I don't have specific numbers at hand, sorry.

It matters less whether it's text or binary (though binary undoubtedly
will be faster, it's strange to even argue about this), it's the
ability to fetch only relevant VMAs that is the point here.

>
> thanks,
>
> greg k-h

2024-05-04 23:37:17

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

Hi Andrii,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20240503]
[also build test WARNING on v6.9-rc6]
[cannot apply to bpf-next/master bpf/master perf-tools-next/perf-tools-next tip/perf/core perf-tools/perf-tools brauner-vfs/vfs.all linus/master acme/perf/core v6.9-rc6 v6.9-rc5 v6.9-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Andrii-Nakryiko/fs-procfs-extract-logic-for-getting-VMA-name-constituents/20240504-083146
base: next-20240503
patch link: https://lore.kernel.org/r/20240504003006.3303334-3-andrii%40kernel.org
patch subject: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
config: openrisc-allnoconfig (https://download.01.org/0day-ci/archive/20240505/[email protected]/config)
compiler: or1k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240505/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

fs/proc/task_mmu.c: In function 'do_procmap_query':
>> fs/proc/task_mmu.c:505:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
505 | if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
| ^
fs/proc/task_mmu.c:512:48: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
512 | if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
| ^


vim +505 fs/proc/task_mmu.c

378
379 static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
380 {
381 struct procfs_procmap_query karg;
382 struct vma_iterator iter;
383 struct vm_area_struct *vma;
384 struct mm_struct *mm;
385 const char *name = NULL;
386 char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
387 __u64 usize;
388 int err;
389
390 if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
391 return -EFAULT;
392 if (usize > PAGE_SIZE)
393 return -E2BIG;
394 if (usize < offsetofend(struct procfs_procmap_query, query_addr))
395 return -EINVAL;
396 err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
397 if (err)
398 return err;
399
400 if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
401 return -EINVAL;
402 if (!!karg.vma_name_size != !!karg.vma_name_addr)
403 return -EINVAL;
404 if (!!karg.build_id_size != !!karg.build_id_addr)
405 return -EINVAL;
406
407 mm = priv->mm;
408 if (!mm || !mmget_not_zero(mm))
409 return -ESRCH;
410 if (mmap_read_lock_killable(mm)) {
411 mmput(mm);
412 return -EINTR;
413 }
414
415 vma_iter_init(&iter, mm, karg.query_addr);
416 vma = vma_next(&iter);
417 if (!vma) {
418 err = -ENOENT;
419 goto out;
420 }
421 /* user wants covering VMA, not the closest next one */
422 if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
423 vma->vm_start > karg.query_addr) {
424 err = -ENOENT;
425 goto out;
426 }
427
428 karg.vma_start = vma->vm_start;
429 karg.vma_end = vma->vm_end;
430
431 if (vma->vm_file) {
432 const struct inode *inode = file_user_inode(vma->vm_file);
433
434 karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
435 karg.dev_major = MAJOR(inode->i_sb->s_dev);
436 karg.dev_minor = MINOR(inode->i_sb->s_dev);
437 karg.inode = inode->i_ino;
438 } else {
439 karg.vma_offset = 0;
440 karg.dev_major = 0;
441 karg.dev_minor = 0;
442 karg.inode = 0;
443 }
444
445 karg.vma_flags = 0;
446 if (vma->vm_flags & VM_READ)
447 karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
448 if (vma->vm_flags & VM_WRITE)
449 karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
450 if (vma->vm_flags & VM_EXEC)
451 karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
452 if (vma->vm_flags & VM_MAYSHARE)
453 karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
454
455 if (karg.build_id_size) {
456 __u32 build_id_sz = BUILD_ID_SIZE_MAX;
457
458 err = build_id_parse(vma, build_id_buf, &build_id_sz);
459 if (!err) {
460 if (karg.build_id_size < build_id_sz) {
461 err = -ENAMETOOLONG;
462 goto out;
463 }
464 karg.build_id_size = build_id_sz;
465 }
466 }
467
468 if (karg.vma_name_size) {
469 size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
470 const struct path *path;
471 const char *name_fmt;
472 size_t name_sz = 0;
473
474 get_vma_name(vma, &path, &name, &name_fmt);
475
476 if (path || name_fmt || name) {
477 name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
478 if (!name_buf) {
479 err = -ENOMEM;
480 goto out;
481 }
482 }
483 if (path) {
484 name = d_path(path, name_buf, name_buf_sz);
485 if (IS_ERR(name)) {
486 err = PTR_ERR(name);
487 goto out;
488 }
489 name_sz = name_buf + name_buf_sz - name;
490 } else if (name || name_fmt) {
491 name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
492 name = name_buf;
493 }
494 if (name_sz > name_buf_sz) {
495 err = -ENAMETOOLONG;
496 goto out;
497 }
498 karg.vma_name_size = name_sz;
499 }
500
501 /* unlock and put mm_struct before copying data to user */
502 mmap_read_unlock(mm);
503 mmput(mm);
504
> 505 if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
506 name, karg.vma_name_size)) {
507 kfree(name_buf);
508 return -EFAULT;
509 }
510 kfree(name_buf);
511
512 if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
513 build_id_buf, karg.build_id_size))
514 return -EFAULT;
515
516 if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
517 return -EFAULT;
518
519 return 0;
520
521 out:
522 mmap_read_unlock(mm);
523 mmput(mm);
524 kfree(name_buf);
525 return err;
526 }
527

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-05-05 05:10:04

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> >
> > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > Implement a simple tool/benchmark for comparing address "resolution"
> > > logic based on textual /proc/<pid>/maps interface and new binary
> > > ioctl-based PROCFS_PROCMAP_QUERY command.
> >
> > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > ioctl" is going to be different, but step back and show how this is
> > going to be used in the real world overall. Pounding on this file is
> > not a normal operation, right?
> >
>
> It's not artificial at all. It's *exactly* what, say, blazesym library
> is doing (see [0], it's Rust and part of the overall library API, I
> think C code in this patch is way easier to follow for someone not
> familiar with implementation of blazesym, but both implementations are
> doing exactly the same sequence of steps). You can do it even less
> efficiently by parsing the whole file, building an in-memory lookup
> table, then looking up addresses one by one. But that's even slower
> and more memory-hungry. So I didn't even bother implementing that, it
> would put /proc/<pid>/maps at even more disadvantage.
>
> Other applications that deal with stack traces (including perf) would
> be doing one of those two approaches, depending on circumstances and
> level of sophistication of code (and sensitivity to performance).

The code in perf doing this is here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
The code is using the api/io.h code:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
Using perf to profile perf it was observed time was spent allocating
buffers and locale related activities when using stdio, so io is a
lighter weight alternative, albeit with more verbose code than fscanf.
You could add this as an alternate /proc/<pid>/maps reader, we have a
similar benchmark in `perf bench internals synthesize`.

Thanks,
Ian

> [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
>
> > thanks,
> >
> > greg k-h
>

2024-05-05 05:26:38

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <[email protected]> wrote:
>
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
>
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.
>
> This patch set was based on top of next-20240503 tag in linux-next tree.
> Not sure what should be the target tree for this, I'd appreciate any guidance,
> thank you!
>
> Andrii Nakryiko (5):
> fs/procfs: extract logic for getting VMA name constituents
> fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
> tools: sync uapi/linux/fs.h header into tools subdir
> selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
> selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

I'd love to see improvements like this for the Linux perf command.
Some thoughts:

- Could we do something scalability wise better than a file
descriptor per pid? If a profiler is running in a container the cost
of many file descriptors can be significant, and something that
increases as machines get larger. Could we have a /proc/maps for all
processes?

- Something that is broken in perf currently is that we can race
between reading /proc and opening events on the pids it contains. For
example, perf top supports a uid option that first scans to find all
processes owned by a user then tries to open an event on each process.
This fails if the process terminates between the scan and the open
leading to a frequent:
```
$ sudo perf top -u `id -u`
The sys_perf_event_open() syscall returned with 3 (No such process)
for event (cycles:P).
```
It would be nice for the API to consider cgroups, uids and the like as
ways to get a subset of things to scan.

- Some what related, the mmap perf events give data after the mmap
call has happened. As VMAs get merged this can lead to mmap perf
events looking like the memory overlaps (for jits using anonymous
memory) and we lack munmap/mremap events.

Jiri Olsa has looked at improvements in this area in the past.

Thanks,
Ian

> fs/proc/task_mmu.c | 290 +++++++++++---
> include/uapi/linux/fs.h | 32 ++
> .../perf/trace/beauty/include/uapi/linux/fs.h | 32 ++
> tools/testing/selftests/bpf/.gitignore | 1 +
> tools/testing/selftests/bpf/Makefile | 2 +-
> tools/testing/selftests/bpf/procfs_query.c | 366 ++++++++++++++++++
> tools/testing/selftests/bpf/test_progs.c | 3 +
> tools/testing/selftests/bpf/test_progs.h | 2 +
> tools/testing/selftests/bpf/trace_helpers.c | 105 ++++-
> 9 files changed, 763 insertions(+), 70 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/procfs_query.c
>
> --
> 2.43.0
>
>

2024-05-06 13:59:06

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > hard-coded or user-provided names) is optional just like build ID. If
> > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > it, saving resources.

> > > Signed-off-by: Andrii Nakryiko <[email protected]>

> > Where is the userspace code that uses this new api you have created?

> So I added a faithful comparison of existing /proc/<pid>/maps vs new
> ioctl() API to solve a common problem (as described above) in patch
> #5. The plan is to put it in mentioned blazesym library at the very
> least.
>
> I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> linux-perf-user), as they need to do stack symbolization as well.

At some point, when BPF iterators became a thing we thought about, IIRC
Jiri did some experimentation, but I lost track, of using BPF to
synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
as in uapi/linux/perf_event.h:

/*
* The MMAP2 records are an augmented version of MMAP, they add
* maj, min, ino numbers to be used to uniquely identify each mapping
*
* struct {
* struct perf_event_header header;
*
* u32 pid, tid;
* u64 addr;
* u64 len;
* u64 pgoff;
* union {
* struct {
* u32 maj;
* u32 min;
* u64 ino;
* u64 ino_generation;
* };
* struct {
* u8 build_id_size;
* u8 __reserved_1;
* u16 __reserved_2;
* u8 build_id[20];
* };
* };
* u32 prot, flags;
* char filename[];
* struct sample_id sample_id;
* };
*/
PERF_RECORD_MMAP2 = 10,

* PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event

As perf.data files can be used for many purposes we want them all, so we
setup a meta data perf file descriptor to go on receiving the new mmaps
while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
it in parallel, etc:

⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'

Usage: perf record [<options>] [<command>]
or: perf record [<options>] -- <command> [<options>]

--num-thread-synthesize <n>
number of threads to run for event synthesis
--synth <no|all|task|mmap|cgroup>
Fine-tune event synthesis: default=all

⬢[acme@toolbox perf-tools-next]$

For this specific initial synthesis of everything the plan, as mentioned
about Jiri's experiments, was to use a BPF iterator to just feed the
perf ring buffer with those events, that way userspace would just
receive the usual records it gets when a new mmap is put in place, the
BPF iterator would just feed the preexisting mmaps, as instructed via
the perf_event_attr for the perf_event_open syscall.

For people not wanting BPF, i.e. disabling it altogether in perf or
disabling just BPF skels, then we would fallback to the current method,
or to the one being discussed here when it becomes available.

One thing to have in mind is for this iterator not to generate duplicate
records for non-pre-existing mmaps, i.e. we would need some generation
number that would be bumped when asking for such pre-existing maps
PERF_RECORD_MMAP2 dumps.

> It will be up to other similar projects to adopt this, but we'll
> definitely get this into blazesym as it is actually a problem for the

At some point looking at plugging blazesym somehow with perf may be
something to consider, indeed.

- Arnaldo

> abovementioned Oculus use case. We already had to make a tradeoff (see
> [2], this wasn't done just because we could, but it was requested by
> Oculus customers) to cache the contents of /proc/<pid>/maps and run
> the risk of missing some shared libraries that can be loaded later. It
> would be great to not have to do this tradeoff, which this new API
> would enable.
>
> [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
>
> >
> > > ---
> > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> > > include/uapi/linux/fs.h | 32 ++++++++
> > > 2 files changed, 197 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -22,6 +22,7 @@
> > > #include <linux/pkeys.h>
> > > #include <linux/minmax.h>
> > > #include <linux/overflow.h>
> > > +#include <linux/buildid.h>
> > >
> > > #include <asm/elf.h>
> > > #include <asm/tlb.h>
> > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > return do_maps_open(inode, file, &proc_pid_maps_op);
> > > }
> > >
> > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > +{
> > > + struct procfs_procmap_query karg;
> > > + struct vma_iterator iter;
> > > + struct vm_area_struct *vma;
> > > + struct mm_struct *mm;
> > > + const char *name = NULL;
> > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > + __u64 usize;
> > > + int err;
> > > +
> > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > + return -EFAULT;
> > > + if (usize > PAGE_SIZE)
> >
> > Nice, where did you document that? And how is that portable given that
> > PAGE_SIZE can be different on different systems?
>
> I'm happy to document everything, can you please help by pointing
> where this documentation has to live?
>
> This is mostly fool-proofing, though, because the user has to pass
> sizeof(struct procfs_procmap_query), which I don't see ever getting
> close to even 4KB (not even saying about 64KB). This is just to
> prevent copy_struct_from_user() below to do too much zero-checking.
>
> >
> > and why aren't you checking the actual structure size instead? You can
> > easily run off the end here without knowing it.
>
> See copy_struct_from_user(), it does more checks. This is a helper
> designed specifically to deal with use cases like this where kernel
> struct size can change and user space might be newer or older.
> copy_struct_from_user() has a nice documentation describing all these
> nuances.
>
> >
> > > + return -E2BIG;
> > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > + return -EINVAL;
> >
> > Ok, so you have two checks? How can the first one ever fail?
>
> Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> won't fail, but this one will fail.
>
> The point of this check is that user has to specify at least first
> three fields of procfs_procmap_query (size, query_flags, and
> query_addr), because without those the query is meaningless.
> >
> >
> > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
>
> and this helper does more checks validating that the user either has a
> shorter struct (and then zero-fills the rest of kernel-side struct) or
> has longer (and then the longer part has to be zero filled). Do check
> copy_struct_from_user() documentation, it's great.
>
> > > + if (err)
> > > + return err;
> > > +
> > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > + return -EINVAL;
> > > + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > + return -EINVAL;
> > > + if (!!karg.build_id_size != !!karg.build_id_addr)
> > > + return -EINVAL;
> >
> > So you want values to be set, right?
>
> Either both should be set, or neither. It's ok for both size/addr
> fields to be zero, in which case it indicates that the user doesn't
> want this part of information (which is usually a bit more expensive
> to get and might not be necessary for all the cases).
>
> >
> > > +
> > > + mm = priv->mm;
> > > + if (!mm || !mmget_not_zero(mm))
> > > + return -ESRCH;
> >
> > What is this error for? Where is this documentned?
>
> I copied it from existing /proc/<pid>/maps checks. I presume it's
> guarding the case when mm might be already put. So if the process is
> gone, but we have /proc/<pid>/maps file open?
>
> >
> > > + if (mmap_read_lock_killable(mm)) {
> > > + mmput(mm);
> > > + return -EINTR;
> > > + }
> > > +
> > > + vma_iter_init(&iter, mm, karg.query_addr);
> > > + vma = vma_next(&iter);
> > > + if (!vma) {
> > > + err = -ENOENT;
> > > + goto out;
> > > + }
> > > + /* user wants covering VMA, not the closest next one */
> > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > + vma->vm_start > karg.query_addr) {
> > > + err = -ENOENT;
> > > + goto out;
> > > + }
> > > +
> > > + karg.vma_start = vma->vm_start;
> > > + karg.vma_end = vma->vm_end;
> > > +
> > > + if (vma->vm_file) {
> > > + const struct inode *inode = file_user_inode(vma->vm_file);
> > > +
> > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > + karg.dev_minor = MINOR(inode->i_sb->s_dev);
> >
> > So the major/minor is that of the file superblock? Why?
>
> Because inode number is unique only within given super block (and even
> then it's more complicated, e.g., btrfs subvolumes add more headaches,
> I believe). inode + dev maj/min is sometimes used for cache/reuse of
> per-binary information (e.g., pre-processed DWARF information, which
> is *very* expensive, so anything that allows to avoid doing this is
> helpful).
>
> >
> > > + karg.inode = inode->i_ino;
> >
> > What is userspace going to do with this?
> >
>
> See above.
>
> > > + } else {
> > > + karg.vma_offset = 0;
> > > + karg.dev_major = 0;
> > > + karg.dev_minor = 0;
> > > + karg.inode = 0;
> >
> > Why not set everything to 0 up above at the beginning so you never miss
> > anything, and you don't miss any holes accidentally in the future.
> >
>
> Stylistic preference, I find this more explicit, but I don't care much
> one way or another.
>
> > > + }
> > > +
> > > + karg.vma_flags = 0;
> > > + if (vma->vm_flags & VM_READ)
> > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > + if (vma->vm_flags & VM_WRITE)
> > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > + if (vma->vm_flags & VM_EXEC)
> > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > + if (vma->vm_flags & VM_MAYSHARE)
> > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > +
>
> [...]
>
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index 45e4e64fd664..fe8924a8d916 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > __u64 return_mask;
> > > };
> > >
> > > +/* /proc/<pid>/maps ioctl */
> > > +#define PROCFS_IOCTL_MAGIC 0x9f
> >
> > Don't you need to document this in the proper place?
>
> I probably do, but I'm asking for help in knowing where. procfs is not
> a typical area of kernel I'm working with, so any pointers are highly
> appreciated.
>
> >
> > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > +
> > > +enum procmap_query_flags {
> > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > +};
> > > +
> > > +enum procmap_vma_flags {
> > > + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > + PROCFS_PROCMAP_VMA_SHARED = 0x08,
> >
> > Are these bits? If so, please use the bit macro for it to make it
> > obvious.
> >
>
> Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> add any extra #includes to this UAPI header, but I can figure out the
> necessary dependency and do BIT(), I just didn't feel like BIT() adds
> much here, tbh.
>
> > > +};
> > > +
> > > +struct procfs_procmap_query {
> > > + __u64 size;
> > > + __u64 query_flags; /* in */
> >
> > Does this map to the procmap_vma_flags enum? if so, please say so.
>
> no, procmap_query_flags, and yes, I will
>
> >
> > > + __u64 query_addr; /* in */
> > > + __u64 vma_start; /* out */
> > > + __u64 vma_end; /* out */
> > > + __u64 vma_flags; /* out */
> > > + __u64 vma_offset; /* out */
> > > + __u64 inode; /* out */
> >
> > What is the inode for, you have an inode for the file already, why give
> > it another one?
>
> This is inode of vma's backing file, same as /proc/<pid>/maps' file
> column. What inode of file do I already have here? You mean of
> /proc/<pid>/maps itself? It's useless for the intended purposes.
>
> >
> > > + __u32 dev_major; /* out */
> > > + __u32 dev_minor; /* out */
> >
> > What is major/minor for?
>
> This is the same information as emitted by /proc/<pid>/maps,
> identifies superblock of vma's backing file. As I mentioned above, it
> can be used for caching per-file (i.e., per-ELF binary) information
> (for example).
>
> >
> > > + __u32 vma_name_size; /* in/out */
> > > + __u32 build_id_size; /* in/out */
> > > + __u64 vma_name_addr; /* in */
> > > + __u64 build_id_addr; /* in */
> >
> > Why not document this all using kerneldoc above the structure?
>
> Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> figuring out the best place and approach, and so wanted to avoid
> documentation churn.
>
> Would something like what we have for pm_scan_arg and pagemap APIs
> work? I see it added a few simple descriptions for pm_scan_arg struct,
> and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> though)? Anyways, I'm hoping for pointers where all this should be
> documented. Thank you!
>
> >
> > anyway, I don't like ioctls, but there is a place for them, you just
> > have to actually justify the use for them and not say "not efficient
> > enough" as that normally isn't an issue overall.
>
> I've written a demo tool in patch #5 which performs real-world task:
> mapping addresses to their VMAs (specifically calculating file offset,
> finding vma_start + vma_end range to further access files from
> /proc/<pid>/map_files/<start>-<end>). I did the implementation
> faithfully, doing it in the most optimal way for both APIs. I showed
> that for "typical" (it's hard to specify what typical is, of course,
> too many variables) scenario (it was data collected on a real server
> running real service, 30 seconds of process-specific stack traces were
> captured, if I remember correctly). I showed that doing exactly the
> same amount of work is ~35x times slower with /proc/<pid>/maps.
>
> Take another process, another set of addresses, another anything, and
> the numbers will be different, but I think it gives the right idea.
>
> But I think we are overpivoting on text vs binary distinction here.
> It's the more targeted querying of VMAs that's beneficial here. This
> allows applications to not cache anything and just re-query when doing
> periodic or continuous profiling (where addresses are coming in not as
> one batch, as a sequence of batches extended in time).
>
> /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> of ability, as it wasn't designed to do that and is targeting
> different use cases.
>
> And then, a new ability to request reliable (it's not 100% reliable
> today, I'm going to address that as a follow up) build ID is *crucial*
> for some scenarios. The mentioned Oculus use case, the need to fully
> access underlying ELF binary just to get build ID is frowned upon. And
> for a good reason. Profiler only needs build ID, which is no secret
> and not sensitive information. This new (and binary, yes) API allows
> to add this into an API without breaking any backwards compatibility.
>
> >
> > thanks,
> >
> > greg k-h

2024-05-06 18:13:11

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

Hello,

On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > it, saving resources.
>
> > > > Signed-off-by: Andrii Nakryiko <[email protected]>
>
> > > Where is the userspace code that uses this new api you have created?
>
> > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > ioctl() API to solve a common problem (as described above) in patch
> > #5. The plan is to put it in mentioned blazesym library at the very
> > least.
> >
> > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > linux-perf-user), as they need to do stack symbolization as well.

I think the general use case in perf is different. This ioctl API is great
for live tracing of a single (or a small number of) process(es). And
yes, perf tools have those tracing use cases too. But I think the
major use case of perf tools is system-wide profiling.

For system-wide profiling, you need to process samples of many
different processes at a high frequency. Now perf record doesn't
process them and just save it for offline processing (well, it does
at the end to find out build-ID but it can be omitted).

Doing it online is possible (like perf top) but it would add more
overhead during the profiling. And we cannot move processing
or symbolization to the end of profiling because some (short-
lived) tasks can go away.

Also it should support perf report (offline) on data from a
different kernel or even a different machine.

So it saves the memory map of processes and symbolizes
the stack trace with it later. Of course it needs to be updated
as the memory map changes and that's why it tracks mmap
or similar syscalls with PERF_RECORD_MMAP[2] records.

A problem with this approach is to get the initial state of all
(or a target for non-system-wide mode) existing processes.
We call it synthesizing, and read /proc/PID/maps to generate
the mmap records.

I think the below comment from Arnaldo talked about how
we can improve the synthesizing (which is sequential access
to proc maps) using BPF.

Thanks,
Namhyung


>
> At some point, when BPF iterators became a thing we thought about, IIRC
> Jiri did some experimentation, but I lost track, of using BPF to
> synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> as in uapi/linux/perf_event.h:
>
> /*
> * The MMAP2 records are an augmented version of MMAP, they add
> * maj, min, ino numbers to be used to uniquely identify each mapping
> *
> * struct {
> * struct perf_event_header header;
> *
> * u32 pid, tid;
> * u64 addr;
> * u64 len;
> * u64 pgoff;
> * union {
> * struct {
> * u32 maj;
> * u32 min;
> * u64 ino;
> * u64 ino_generation;
> * };
> * struct {
> * u8 build_id_size;
> * u8 __reserved_1;
> * u16 __reserved_2;
> * u8 build_id[20];
> * };
> * };
> * u32 prot, flags;
> * char filename[];
> * struct sample_id sample_id;
> * };
> */
> PERF_RECORD_MMAP2 = 10,
>
> * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
>
> As perf.data files can be used for many purposes we want them all, so we
> setup a meta data perf file descriptor to go on receiving the new mmaps
> while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> it in parallel, etc:
>
> ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
>
> Usage: perf record [<options>] [<command>]
> or: perf record [<options>] -- <command> [<options>]
>
> --num-thread-synthesize <n>
> number of threads to run for event synthesis
> --synth <no|all|task|mmap|cgroup>
> Fine-tune event synthesis: default=all
>
> ⬢[acme@toolbox perf-tools-next]$
>
> For this specific initial synthesis of everything the plan, as mentioned
> about Jiri's experiments, was to use a BPF iterator to just feed the
> perf ring buffer with those events, that way userspace would just
> receive the usual records it gets when a new mmap is put in place, the
> BPF iterator would just feed the preexisting mmaps, as instructed via
> the perf_event_attr for the perf_event_open syscall.
>
> For people not wanting BPF, i.e. disabling it altogether in perf or
> disabling just BPF skels, then we would fallback to the current method,
> or to the one being discussed here when it becomes available.
>
> One thing to have in mind is for this iterator not to generate duplicate
> records for non-pre-existing mmaps, i.e. we would need some generation
> number that would be bumped when asking for such pre-existing maps
> PERF_RECORD_MMAP2 dumps.
>
> > It will be up to other similar projects to adopt this, but we'll
> > definitely get this into blazesym as it is actually a problem for the
>
> At some point looking at plugging blazesym somehow with perf may be
> something to consider, indeed.
>
> - Arnaldo
>
> > abovementioned Oculus use case. We already had to make a tradeoff (see
> > [2], this wasn't done just because we could, but it was requested by
> > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > the risk of missing some shared libraries that can be loaded later. It
> > would be great to not have to do this tradeoff, which this new API
> > would enable.
> >
> > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> >
> > >
> > > > ---
> > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> > > > include/uapi/linux/fs.h | 32 ++++++++
> > > > 2 files changed, 197 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -22,6 +22,7 @@
> > > > #include <linux/pkeys.h>
> > > > #include <linux/minmax.h>
> > > > #include <linux/overflow.h>
> > > > +#include <linux/buildid.h>
> > > >
> > > > #include <asm/elf.h>
> > > > #include <asm/tlb.h>
> > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > return do_maps_open(inode, file, &proc_pid_maps_op);
> > > > }
> > > >
> > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > > +{
> > > > + struct procfs_procmap_query karg;
> > > > + struct vma_iterator iter;
> > > > + struct vm_area_struct *vma;
> > > > + struct mm_struct *mm;
> > > > + const char *name = NULL;
> > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > > + __u64 usize;
> > > > + int err;
> > > > +
> > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > > + return -EFAULT;
> > > > + if (usize > PAGE_SIZE)
> > >
> > > Nice, where did you document that? And how is that portable given that
> > > PAGE_SIZE can be different on different systems?
> >
> > I'm happy to document everything, can you please help by pointing
> > where this documentation has to live?
> >
> > This is mostly fool-proofing, though, because the user has to pass
> > sizeof(struct procfs_procmap_query), which I don't see ever getting
> > close to even 4KB (not even saying about 64KB). This is just to
> > prevent copy_struct_from_user() below to do too much zero-checking.
> >
> > >
> > > and why aren't you checking the actual structure size instead? You can
> > > easily run off the end here without knowing it.
> >
> > See copy_struct_from_user(), it does more checks. This is a helper
> > designed specifically to deal with use cases like this where kernel
> > struct size can change and user space might be newer or older.
> > copy_struct_from_user() has a nice documentation describing all these
> > nuances.
> >
> > >
> > > > + return -E2BIG;
> > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > > + return -EINVAL;
> > >
> > > Ok, so you have two checks? How can the first one ever fail?
> >
> > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> > won't fail, but this one will fail.
> >
> > The point of this check is that user has to specify at least first
> > three fields of procfs_procmap_query (size, query_flags, and
> > query_addr), because without those the query is meaningless.
> > >
> > >
> > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> >
> > and this helper does more checks validating that the user either has a
> > shorter struct (and then zero-fills the rest of kernel-side struct) or
> > has longer (and then the longer part has to be zero filled). Do check
> > copy_struct_from_user() documentation, it's great.
> >
> > > > + if (err)
> > > > + return err;
> > > > +
> > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > > + return -EINVAL;
> > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > > + return -EINVAL;
> > > > + if (!!karg.build_id_size != !!karg.build_id_addr)
> > > > + return -EINVAL;
> > >
> > > So you want values to be set, right?
> >
> > Either both should be set, or neither. It's ok for both size/addr
> > fields to be zero, in which case it indicates that the user doesn't
> > want this part of information (which is usually a bit more expensive
> > to get and might not be necessary for all the cases).
> >
> > >
> > > > +
> > > > + mm = priv->mm;
> > > > + if (!mm || !mmget_not_zero(mm))
> > > > + return -ESRCH;
> > >
> > > What is this error for? Where is this documentned?
> >
> > I copied it from existing /proc/<pid>/maps checks. I presume it's
> > guarding the case when mm might be already put. So if the process is
> > gone, but we have /proc/<pid>/maps file open?
> >
> > >
> > > > + if (mmap_read_lock_killable(mm)) {
> > > > + mmput(mm);
> > > > + return -EINTR;
> > > > + }
> > > > +
> > > > + vma_iter_init(&iter, mm, karg.query_addr);
> > > > + vma = vma_next(&iter);
> > > > + if (!vma) {
> > > > + err = -ENOENT;
> > > > + goto out;
> > > > + }
> > > > + /* user wants covering VMA, not the closest next one */
> > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > > + vma->vm_start > karg.query_addr) {
> > > > + err = -ENOENT;
> > > > + goto out;
> > > > + }
> > > > +
> > > > + karg.vma_start = vma->vm_start;
> > > > + karg.vma_end = vma->vm_end;
> > > > +
> > > > + if (vma->vm_file) {
> > > > + const struct inode *inode = file_user_inode(vma->vm_file);
> > > > +
> > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev);
> > >
> > > So the major/minor is that of the file superblock? Why?
> >
> > Because inode number is unique only within given super block (and even
> > then it's more complicated, e.g., btrfs subvolumes add more headaches,
> > I believe). inode + dev maj/min is sometimes used for cache/reuse of
> > per-binary information (e.g., pre-processed DWARF information, which
> > is *very* expensive, so anything that allows to avoid doing this is
> > helpful).
> >
> > >
> > > > + karg.inode = inode->i_ino;
> > >
> > > What is userspace going to do with this?
> > >
> >
> > See above.
> >
> > > > + } else {
> > > > + karg.vma_offset = 0;
> > > > + karg.dev_major = 0;
> > > > + karg.dev_minor = 0;
> > > > + karg.inode = 0;
> > >
> > > Why not set everything to 0 up above at the beginning so you never miss
> > > anything, and you don't miss any holes accidentally in the future.
> > >
> >
> > Stylistic preference, I find this more explicit, but I don't care much
> > one way or another.
> >
> > > > + }
> > > > +
> > > > + karg.vma_flags = 0;
> > > > + if (vma->vm_flags & VM_READ)
> > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > > + if (vma->vm_flags & VM_WRITE)
> > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > > + if (vma->vm_flags & VM_EXEC)
> > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > > + if (vma->vm_flags & VM_MAYSHARE)
> > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > > +
> >
> > [...]
> >
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 45e4e64fd664..fe8924a8d916 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > > __u64 return_mask;
> > > > };
> > > >
> > > > +/* /proc/<pid>/maps ioctl */
> > > > +#define PROCFS_IOCTL_MAGIC 0x9f
> > >
> > > Don't you need to document this in the proper place?
> >
> > I probably do, but I'm asking for help in knowing where. procfs is not
> > a typical area of kernel I'm working with, so any pointers are highly
> > appreciated.
> >
> > >
> > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > > +
> > > > +enum procmap_query_flags {
> > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > > +};
> > > > +
> > > > +enum procmap_vma_flags {
> > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08,
> > >
> > > Are these bits? If so, please use the bit macro for it to make it
> > > obvious.
> > >
> >
> > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> > add any extra #includes to this UAPI header, but I can figure out the
> > necessary dependency and do BIT(), I just didn't feel like BIT() adds
> > much here, tbh.
> >
> > > > +};
> > > > +
> > > > +struct procfs_procmap_query {
> > > > + __u64 size;
> > > > + __u64 query_flags; /* in */
> > >
> > > Does this map to the procmap_vma_flags enum? if so, please say so.
> >
> > no, procmap_query_flags, and yes, I will
> >
> > >
> > > > + __u64 query_addr; /* in */
> > > > + __u64 vma_start; /* out */
> > > > + __u64 vma_end; /* out */
> > > > + __u64 vma_flags; /* out */
> > > > + __u64 vma_offset; /* out */
> > > > + __u64 inode; /* out */
> > >
> > > What is the inode for, you have an inode for the file already, why give
> > > it another one?
> >
> > This is inode of vma's backing file, same as /proc/<pid>/maps' file
> > column. What inode of file do I already have here? You mean of
> > /proc/<pid>/maps itself? It's useless for the intended purposes.
> >
> > >
> > > > + __u32 dev_major; /* out */
> > > > + __u32 dev_minor; /* out */
> > >
> > > What is major/minor for?
> >
> > This is the same information as emitted by /proc/<pid>/maps,
> > identifies superblock of vma's backing file. As I mentioned above, it
> > can be used for caching per-file (i.e., per-ELF binary) information
> > (for example).
> >
> > >
> > > > + __u32 vma_name_size; /* in/out */
> > > > + __u32 build_id_size; /* in/out */
> > > > + __u64 vma_name_addr; /* in */
> > > > + __u64 build_id_addr; /* in */
> > >
> > > Why not document this all using kerneldoc above the structure?
> >
> > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> > figuring out the best place and approach, and so wanted to avoid
> > documentation churn.
> >
> > Would something like what we have for pm_scan_arg and pagemap APIs
> > work? I see it added a few simple descriptions for pm_scan_arg struct,
> > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> > though)? Anyways, I'm hoping for pointers where all this should be
> > documented. Thank you!
> >
> > >
> > > anyway, I don't like ioctls, but there is a place for them, you just
> > > have to actually justify the use for them and not say "not efficient
> > > enough" as that normally isn't an issue overall.
> >
> > I've written a demo tool in patch #5 which performs real-world task:
> > mapping addresses to their VMAs (specifically calculating file offset,
> > finding vma_start + vma_end range to further access files from
> > /proc/<pid>/map_files/<start>-<end>). I did the implementation
> > faithfully, doing it in the most optimal way for both APIs. I showed
> > that for "typical" (it's hard to specify what typical is, of course,
> > too many variables) scenario (it was data collected on a real server
> > running real service, 30 seconds of process-specific stack traces were
> > captured, if I remember correctly). I showed that doing exactly the
> > same amount of work is ~35x times slower with /proc/<pid>/maps.
> >
> > Take another process, another set of addresses, another anything, and
> > the numbers will be different, but I think it gives the right idea.
> >
> > But I think we are overpivoting on text vs binary distinction here.
> > It's the more targeted querying of VMAs that's beneficial here. This
> > allows applications to not cache anything and just re-query when doing
> > periodic or continuous profiling (where addresses are coming in not as
> > one batch, as a sequence of batches extended in time).
> >
> > /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> > of ability, as it wasn't designed to do that and is targeting
> > different use cases.
> >
> > And then, a new ability to request reliable (it's not 100% reliable
> > today, I'm going to address that as a follow up) build ID is *crucial*
> > for some scenarios. The mentioned Oculus use case, the need to fully
> > access underlying ELF binary just to get build ID is frowned upon. And
> > for a good reason. Profiler only needs build ID, which is no secret
> > and not sensitive information. This new (and binary, yes) API allows
> > to add this into an API without breaking any backwards compatibility.
> >
> > >
> > > thanks,
> > >
> > > greg k-h
>

2024-05-06 18:33:02

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
>
> On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > >
> > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > >
> > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > ioctl" is going to be different, but step back and show how this is
> > > going to be used in the real world overall. Pounding on this file is
> > > not a normal operation, right?
> > >
> >
> > It's not artificial at all. It's *exactly* what, say, blazesym library
> > is doing (see [0], it's Rust and part of the overall library API, I
> > think C code in this patch is way easier to follow for someone not
> > familiar with implementation of blazesym, but both implementations are
> > doing exactly the same sequence of steps). You can do it even less
> > efficiently by parsing the whole file, building an in-memory lookup
> > table, then looking up addresses one by one. But that's even slower
> > and more memory-hungry. So I didn't even bother implementing that, it
> > would put /proc/<pid>/maps at even more disadvantage.
> >
> > Other applications that deal with stack traces (including perf) would
> > be doing one of those two approaches, depending on circumstances and
> > level of sophistication of code (and sensitivity to performance).
>
> The code in perf doing this is here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> The code is using the api/io.h code:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> Using perf to profile perf it was observed time was spent allocating
> buffers and locale related activities when using stdio, so io is a
> lighter weight alternative, albeit with more verbose code than fscanf.
> You could add this as an alternate /proc/<pid>/maps reader, we have a
> similar benchmark in `perf bench internals synthesize`.
>

If I add a new implementation using this ioctl() into
perf_event__synthesize_mmap_events(), will it be tested from this
`perf bench internals synthesize`? I'm not too familiar with perf code
organization, sorry if it's a stupid question. If not, where exactly
is the code that would be triggered from benchmark?

> Thanks,
> Ian
>
> > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> >
> > > thanks,
> > >
> > > greg k-h
> >

2024-05-06 18:42:14

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > it, saving resources.
>
> > > > Signed-off-by: Andrii Nakryiko <[email protected]>
>
> > > Where is the userspace code that uses this new api you have created?
>
> > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > ioctl() API to solve a common problem (as described above) in patch
> > #5. The plan is to put it in mentioned blazesym library at the very
> > least.
> >
> > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > linux-perf-user), as they need to do stack symbolization as well.
>
> At some point, when BPF iterators became a thing we thought about, IIRC
> Jiri did some experimentation, but I lost track, of using BPF to
> synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> as in uapi/linux/perf_event.h:
>
> /*
> * The MMAP2 records are an augmented version of MMAP, they add
> * maj, min, ino numbers to be used to uniquely identify each mapping
> *
> * struct {
> * struct perf_event_header header;
> *
> * u32 pid, tid;
> * u64 addr;
> * u64 len;
> * u64 pgoff;
> * union {
> * struct {
> * u32 maj;
> * u32 min;
> * u64 ino;
> * u64 ino_generation;
> * };
> * struct {
> * u8 build_id_size;
> * u8 __reserved_1;
> * u16 __reserved_2;
> * u8 build_id[20];
> * };
> * };
> * u32 prot, flags;
> * char filename[];
> * struct sample_id sample_id;
> * };
> */
> PERF_RECORD_MMAP2 = 10,
>
> * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
>
> As perf.data files can be used for many purposes we want them all, so we

ok, so because you want them all and you don't know which VMAs will be
useful or not, it's a different problem. BPF iterators will be faster
purely due to avoiding binary -> text -> binary conversion path, but
other than that you'll still retrieve all VMAs.

You can still do the same full VMA iteration with this new API, of
course, but advantages are probably smaller as you'll be retrieving a
full set of VMAs regardless (though it would be interesting to compare
anyways).

> setup a meta data perf file descriptor to go on receiving the new mmaps
> while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> it in parallel, etc:
>
> ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
>
> Usage: perf record [<options>] [<command>]
> or: perf record [<options>] -- <command> [<options>]
>
> --num-thread-synthesize <n>
> number of threads to run for event synthesis
> --synth <no|all|task|mmap|cgroup>
> Fine-tune event synthesis: default=all
>
> ⬢[acme@toolbox perf-tools-next]$
>
> For this specific initial synthesis of everything the plan, as mentioned
> about Jiri's experiments, was to use a BPF iterator to just feed the
> perf ring buffer with those events, that way userspace would just
> receive the usual records it gets when a new mmap is put in place, the
> BPF iterator would just feed the preexisting mmaps, as instructed via
> the perf_event_attr for the perf_event_open syscall.
>
> For people not wanting BPF, i.e. disabling it altogether in perf or
> disabling just BPF skels, then we would fallback to the current method,
> or to the one being discussed here when it becomes available.
>
> One thing to have in mind is for this iterator not to generate duplicate
> records for non-pre-existing mmaps, i.e. we would need some generation
> number that would be bumped when asking for such pre-existing maps
> PERF_RECORD_MMAP2 dumps.

Looking briefly at struct vm_area_struct, it doesn't seems like the
kernel maintains any sort of generation (at least not at
vm_area_struct level), so this would be nice to have, I'm sure, but
isn't really related to adding this API. Once the kernel does have
this "VMA generation" counter, it can be trivially added to this
binary interface (which can't be said about /proc/<pid>/maps,
unfortunately).

>
> > It will be up to other similar projects to adopt this, but we'll
> > definitely get this into blazesym as it is actually a problem for the
>
> At some point looking at plugging blazesym somehow with perf may be
> something to consider, indeed.

In the above I meant direct use of this new API in perf code itself,
but yes, blazesym is a generic library for symbolization that handles
ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
sense to use it.

>
> - Arnaldo
>
> > abovementioned Oculus use case. We already had to make a tradeoff (see
> > [2], this wasn't done just because we could, but it was requested by
> > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > the risk of missing some shared libraries that can be loaded later. It
> > would be great to not have to do this tradeoff, which this new API
> > would enable.
> >
> > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> >

[...]

2024-05-06 18:43:49

by Ian Rogers

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
> >
> > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > > >
> > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > >
> > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > ioctl" is going to be different, but step back and show how this is
> > > > going to be used in the real world overall. Pounding on this file is
> > > > not a normal operation, right?
> > > >
> > >
> > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > is doing (see [0], it's Rust and part of the overall library API, I
> > > think C code in this patch is way easier to follow for someone not
> > > familiar with implementation of blazesym, but both implementations are
> > > doing exactly the same sequence of steps). You can do it even less
> > > efficiently by parsing the whole file, building an in-memory lookup
> > > table, then looking up addresses one by one. But that's even slower
> > > and more memory-hungry. So I didn't even bother implementing that, it
> > > would put /proc/<pid>/maps at even more disadvantage.
> > >
> > > Other applications that deal with stack traces (including perf) would
> > > be doing one of those two approaches, depending on circumstances and
> > > level of sophistication of code (and sensitivity to performance).
> >
> > The code in perf doing this is here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > The code is using the api/io.h code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > Using perf to profile perf it was observed time was spent allocating
> > buffers and locale related activities when using stdio, so io is a
> > lighter weight alternative, albeit with more verbose code than fscanf.
> > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > similar benchmark in `perf bench internals synthesize`.
> >
>
> If I add a new implementation using this ioctl() into
> perf_event__synthesize_mmap_events(), will it be tested from this
> `perf bench internals synthesize`? I'm not too familiar with perf code
> organization, sorry if it's a stupid question. If not, where exactly
> is the code that would be triggered from benchmark?

Yes it would be triggered :-)

Thanks,
Ian

> > Thanks,
> > Ian
> >
> > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > >
> > > > thanks,
> > > >
> > > > greg k-h
> > >

2024-05-06 18:53:59

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.

> > > > > Signed-off-by: Andrii Nakryiko <[email protected]>

> > > > Where is the userspace code that uses this new api you have created?

> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.

> I think the general use case in perf is different. This ioctl API is great
> for live tracing of a single (or a small number of) process(es). And
> yes, perf tools have those tracing use cases too. But I think the
> major use case of perf tools is system-wide profiling.

> For system-wide profiling, you need to process samples of many
> different processes at a high frequency. Now perf record doesn't
> process them and just save it for offline processing (well, it does
> at the end to find out build-ID but it can be omitted).

Since:

Author: Jiri Olsa <[email protected]>
Date: Mon Dec 14 11:54:49 2020 +0100
1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")

We don't need to to process the events to find the build ids. I haven't
checked if we still do it to find out which DSOs had hits, but we
shouldn't need to do it for build-ids (unless they were not in memory
when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
haven't checked but IIRC is a possibility if that ELF part isn't in
memory at the time we want to copy it).

If we're still traversing it like that I guess we can have a knob and
make it the default to not do that and instead create the perf.data
build ID header table with all the build-ids we got from
PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
processing at the end of a 'perf record' session.

> Doing it online is possible (like perf top) but it would add more
> overhead during the profiling. And we cannot move processing

It comes in the PERF_RECORD_MMAP2, filled by the kernel.

> or symbolization to the end of profiling because some (short-
> lived) tasks can go away.

right

> Also it should support perf report (offline) on data from a
> different kernel or even a different machine.

right

> So it saves the memory map of processes and symbolizes
> the stack trace with it later. Of course it needs to be updated
> as the memory map changes and that's why it tracks mmap
> or similar syscalls with PERF_RECORD_MMAP[2] records.

> A problem with this approach is to get the initial state of all
> (or a target for non-system-wide mode) existing processes.
> We call it synthesizing, and read /proc/PID/maps to generate
> the mmap records.

> I think the below comment from Arnaldo talked about how
> we can improve the synthesizing (which is sequential access
> to proc maps) using BPF.

Yes, I wonder how far Jiri went, Jiri?

- Arnaldo

> Thanks,
> Namhyung
>
>
> >
> > At some point, when BPF iterators became a thing we thought about, IIRC
> > Jiri did some experimentation, but I lost track, of using BPF to
> > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > as in uapi/linux/perf_event.h:
> >
> > /*
> > * The MMAP2 records are an augmented version of MMAP, they add
> > * maj, min, ino numbers to be used to uniquely identify each mapping
> > *
> > * struct {
> > * struct perf_event_header header;
> > *
> > * u32 pid, tid;
> > * u64 addr;
> > * u64 len;
> > * u64 pgoff;
> > * union {
> > * struct {
> > * u32 maj;
> > * u32 min;
> > * u64 ino;
> > * u64 ino_generation;
> > * };
> > * struct {
> > * u8 build_id_size;
> > * u8 __reserved_1;
> > * u16 __reserved_2;
> > * u8 build_id[20];
> > * };
> > * };
> > * u32 prot, flags;
> > * char filename[];
> > * struct sample_id sample_id;
> > * };
> > */
> > PERF_RECORD_MMAP2 = 10,
> >
> > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
> >
> > As perf.data files can be used for many purposes we want them all, so we
> > setup a meta data perf file descriptor to go on receiving the new mmaps
> > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > it in parallel, etc:
> >
> > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> >
> > Usage: perf record [<options>] [<command>]
> > or: perf record [<options>] -- <command> [<options>]
> >
> > --num-thread-synthesize <n>
> > number of threads to run for event synthesis
> > --synth <no|all|task|mmap|cgroup>
> > Fine-tune event synthesis: default=all
> >
> > ⬢[acme@toolbox perf-tools-next]$
> >
> > For this specific initial synthesis of everything the plan, as mentioned
> > about Jiri's experiments, was to use a BPF iterator to just feed the
> > perf ring buffer with those events, that way userspace would just
> > receive the usual records it gets when a new mmap is put in place, the
> > BPF iterator would just feed the preexisting mmaps, as instructed via
> > the perf_event_attr for the perf_event_open syscall.
> >
> > For people not wanting BPF, i.e. disabling it altogether in perf or
> > disabling just BPF skels, then we would fallback to the current method,
> > or to the one being discussed here when it becomes available.
> >
> > One thing to have in mind is for this iterator not to generate duplicate
> > records for non-pre-existing mmaps, i.e. we would need some generation
> > number that would be bumped when asking for such pre-existing maps
> > PERF_RECORD_MMAP2 dumps.
> >
> > > It will be up to other similar projects to adopt this, but we'll
> > > definitely get this into blazesym as it is actually a problem for the
> >
> > At some point looking at plugging blazesym somehow with perf may be
> > something to consider, indeed.
> >
> > - Arnaldo
> >
> > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > [2], this wasn't done just because we could, but it was requested by
> > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > the risk of missing some shared libraries that can be loaded later. It
> > > would be great to not have to do this tradeoff, which this new API
> > > would enable.
> > >
> > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > >
> > > >
> > > > > ---
> > > > > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> > > > > include/uapi/linux/fs.h | 32 ++++++++
> > > > > 2 files changed, 197 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 8e503a1635b7..cb7b1ff1a144 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -22,6 +22,7 @@
> > > > > #include <linux/pkeys.h>
> > > > > #include <linux/minmax.h>
> > > > > #include <linux/overflow.h>
> > > > > +#include <linux/buildid.h>
> > > > >
> > > > > #include <asm/elf.h>
> > > > > #include <asm/tlb.h>
> > > > > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > > return do_maps_open(inode, file, &proc_pid_maps_op);
> > > > > }
> > > > >
> > > > > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > > > > +{
> > > > > + struct procfs_procmap_query karg;
> > > > > + struct vma_iterator iter;
> > > > > + struct vm_area_struct *vma;
> > > > > + struct mm_struct *mm;
> > > > > + const char *name = NULL;
> > > > > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > > > > + __u64 usize;
> > > > > + int err;
> > > > > +
> > > > > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > > > > + return -EFAULT;
> > > > > + if (usize > PAGE_SIZE)
> > > >
> > > > Nice, where did you document that? And how is that portable given that
> > > > PAGE_SIZE can be different on different systems?
> > >
> > > I'm happy to document everything, can you please help by pointing
> > > where this documentation has to live?
> > >
> > > This is mostly fool-proofing, though, because the user has to pass
> > > sizeof(struct procfs_procmap_query), which I don't see ever getting
> > > close to even 4KB (not even saying about 64KB). This is just to
> > > prevent copy_struct_from_user() below to do too much zero-checking.
> > >
> > > >
> > > > and why aren't you checking the actual structure size instead? You can
> > > > easily run off the end here without knowing it.
> > >
> > > See copy_struct_from_user(), it does more checks. This is a helper
> > > designed specifically to deal with use cases like this where kernel
> > > struct size can change and user space might be newer or older.
> > > copy_struct_from_user() has a nice documentation describing all these
> > > nuances.
> > >
> > > >
> > > > > + return -E2BIG;
> > > > > + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > > > > + return -EINVAL;
> > > >
> > > > Ok, so you have two checks? How can the first one ever fail?
> > >
> > > Hmm.. If usize = 8, copy_from_user() won't fail, usize > PAGE_SIZE
> > > won't fail, but this one will fail.
> > >
> > > The point of this check is that user has to specify at least first
> > > three fields of procfs_procmap_query (size, query_flags, and
> > > query_addr), because without those the query is meaningless.
> > > >
> > > >
> > > > > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > >
> > > and this helper does more checks validating that the user either has a
> > > shorter struct (and then zero-fills the rest of kernel-side struct) or
> > > has longer (and then the longer part has to be zero filled). Do check
> > > copy_struct_from_user() documentation, it's great.
> > >
> > > > > + if (err)
> > > > > + return err;
> > > > > +
> > > > > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > > > > + return -EINVAL;
> > > > > + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > > > > + return -EINVAL;
> > > > > + if (!!karg.build_id_size != !!karg.build_id_addr)
> > > > > + return -EINVAL;
> > > >
> > > > So you want values to be set, right?
> > >
> > > Either both should be set, or neither. It's ok for both size/addr
> > > fields to be zero, in which case it indicates that the user doesn't
> > > want this part of information (which is usually a bit more expensive
> > > to get and might not be necessary for all the cases).
> > >
> > > >
> > > > > +
> > > > > + mm = priv->mm;
> > > > > + if (!mm || !mmget_not_zero(mm))
> > > > > + return -ESRCH;
> > > >
> > > > What is this error for? Where is this documentned?
> > >
> > > I copied it from existing /proc/<pid>/maps checks. I presume it's
> > > guarding the case when mm might be already put. So if the process is
> > > gone, but we have /proc/<pid>/maps file open?
> > >
> > > >
> > > > > + if (mmap_read_lock_killable(mm)) {
> > > > > + mmput(mm);
> > > > > + return -EINTR;
> > > > > + }
> > > > > +
> > > > > + vma_iter_init(&iter, mm, karg.query_addr);
> > > > > + vma = vma_next(&iter);
> > > > > + if (!vma) {
> > > > > + err = -ENOENT;
> > > > > + goto out;
> > > > > + }
> > > > > + /* user wants covering VMA, not the closest next one */
> > > > > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > > > > + vma->vm_start > karg.query_addr) {
> > > > > + err = -ENOENT;
> > > > > + goto out;
> > > > > + }
> > > > > +
> > > > > + karg.vma_start = vma->vm_start;
> > > > > + karg.vma_end = vma->vm_end;
> > > > > +
> > > > > + if (vma->vm_file) {
> > > > > + const struct inode *inode = file_user_inode(vma->vm_file);
> > > > > +
> > > > > + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> > > > > + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> > > > > + karg.dev_minor = MINOR(inode->i_sb->s_dev);
> > > >
> > > > So the major/minor is that of the file superblock? Why?
> > >
> > > Because inode number is unique only within given super block (and even
> > > then it's more complicated, e.g., btrfs subvolumes add more headaches,
> > > I believe). inode + dev maj/min is sometimes used for cache/reuse of
> > > per-binary information (e.g., pre-processed DWARF information, which
> > > is *very* expensive, so anything that allows to avoid doing this is
> > > helpful).
> > >
> > > >
> > > > > + karg.inode = inode->i_ino;
> > > >
> > > > What is userspace going to do with this?
> > > >
> > >
> > > See above.
> > >
> > > > > + } else {
> > > > > + karg.vma_offset = 0;
> > > > > + karg.dev_major = 0;
> > > > > + karg.dev_minor = 0;
> > > > > + karg.inode = 0;
> > > >
> > > > Why not set everything to 0 up above at the beginning so you never miss
> > > > anything, and you don't miss any holes accidentally in the future.
> > > >
> > >
> > > Stylistic preference, I find this more explicit, but I don't care much
> > > one way or another.
> > >
> > > > > + }
> > > > > +
> > > > > + karg.vma_flags = 0;
> > > > > + if (vma->vm_flags & VM_READ)
> > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> > > > > + if (vma->vm_flags & VM_WRITE)
> > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> > > > > + if (vma->vm_flags & VM_EXEC)
> > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> > > > > + if (vma->vm_flags & VM_MAYSHARE)
> > > > > + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> > > > > +
> > >
> > > [...]
> > >
> > > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > > index 45e4e64fd664..fe8924a8d916 100644
> > > > > --- a/include/uapi/linux/fs.h
> > > > > +++ b/include/uapi/linux/fs.h
> > > > > @@ -393,4 +393,36 @@ struct pm_scan_arg {
> > > > > __u64 return_mask;
> > > > > };
> > > > >
> > > > > +/* /proc/<pid>/maps ioctl */
> > > > > +#define PROCFS_IOCTL_MAGIC 0x9f
> > > >
> > > > Don't you need to document this in the proper place?
> > >
> > > I probably do, but I'm asking for help in knowing where. procfs is not
> > > a typical area of kernel I'm working with, so any pointers are highly
> > > appreciated.
> > >
> > > >
> > > > > +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> > > > > +
> > > > > +enum procmap_query_flags {
> > > > > + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> > > > > +};
> > > > > +
> > > > > +enum procmap_vma_flags {
> > > > > + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> > > > > + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> > > > > + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> > > > > + PROCFS_PROCMAP_VMA_SHARED = 0x08,
> > > >
> > > > Are these bits? If so, please use the bit macro for it to make it
> > > > obvious.
> > > >
> > >
> > > Yes, they are. When I tried BIT(1), it didn't compile. I chose not to
> > > add any extra #includes to this UAPI header, but I can figure out the
> > > necessary dependency and do BIT(), I just didn't feel like BIT() adds
> > > much here, tbh.
> > >
> > > > > +};
> > > > > +
> > > > > +struct procfs_procmap_query {
> > > > > + __u64 size;
> > > > > + __u64 query_flags; /* in */
> > > >
> > > > Does this map to the procmap_vma_flags enum? if so, please say so.
> > >
> > > no, procmap_query_flags, and yes, I will
> > >
> > > >
> > > > > + __u64 query_addr; /* in */
> > > > > + __u64 vma_start; /* out */
> > > > > + __u64 vma_end; /* out */
> > > > > + __u64 vma_flags; /* out */
> > > > > + __u64 vma_offset; /* out */
> > > > > + __u64 inode; /* out */
> > > >
> > > > What is the inode for, you have an inode for the file already, why give
> > > > it another one?
> > >
> > > This is inode of vma's backing file, same as /proc/<pid>/maps' file
> > > column. What inode of file do I already have here? You mean of
> > > /proc/<pid>/maps itself? It's useless for the intended purposes.
> > >
> > > >
> > > > > + __u32 dev_major; /* out */
> > > > > + __u32 dev_minor; /* out */
> > > >
> > > > What is major/minor for?
> > >
> > > This is the same information as emitted by /proc/<pid>/maps,
> > > identifies superblock of vma's backing file. As I mentioned above, it
> > > can be used for caching per-file (i.e., per-ELF binary) information
> > > (for example).
> > >
> > > >
> > > > > + __u32 vma_name_size; /* in/out */
> > > > > + __u32 build_id_size; /* in/out */
> > > > > + __u64 vma_name_addr; /* in */
> > > > > + __u64 build_id_addr; /* in */
> > > >
> > > > Why not document this all using kerneldoc above the structure?
> > >
> > > Yes, sorry, I slacked a bit on adding this upfront. I knew we'll be
> > > figuring out the best place and approach, and so wanted to avoid
> > > documentation churn.
> > >
> > > Would something like what we have for pm_scan_arg and pagemap APIs
> > > work? I see it added a few simple descriptions for pm_scan_arg struct,
> > > and there is Documentation/admin-guide/mm/pagemap.rst. Should I add
> > > Documentation/admin-guide/mm/procmap.rst (admin-guide part feels off,
> > > though)? Anyways, I'm hoping for pointers where all this should be
> > > documented. Thank you!
> > >
> > > >
> > > > anyway, I don't like ioctls, but there is a place for them, you just
> > > > have to actually justify the use for them and not say "not efficient
> > > > enough" as that normally isn't an issue overall.
> > >
> > > I've written a demo tool in patch #5 which performs real-world task:
> > > mapping addresses to their VMAs (specifically calculating file offset,
> > > finding vma_start + vma_end range to further access files from
> > > /proc/<pid>/map_files/<start>-<end>). I did the implementation
> > > faithfully, doing it in the most optimal way for both APIs. I showed
> > > that for "typical" (it's hard to specify what typical is, of course,
> > > too many variables) scenario (it was data collected on a real server
> > > running real service, 30 seconds of process-specific stack traces were
> > > captured, if I remember correctly). I showed that doing exactly the
> > > same amount of work is ~35x times slower with /proc/<pid>/maps.
> > >
> > > Take another process, another set of addresses, another anything, and
> > > the numbers will be different, but I think it gives the right idea.
> > >
> > > But I think we are overpivoting on text vs binary distinction here.
> > > It's the more targeted querying of VMAs that's beneficial here. This
> > > allows applications to not cache anything and just re-query when doing
> > > periodic or continuous profiling (where addresses are coming in not as
> > > one batch, as a sequence of batches extended in time).
> > >
> > > /proc/<pid>/maps, for all its usefulness, just can't provide this sort
> > > of ability, as it wasn't designed to do that and is targeting
> > > different use cases.
> > >
> > > And then, a new ability to request reliable (it's not 100% reliable
> > > today, I'm going to address that as a follow up) build ID is *crucial*
> > > for some scenarios. The mentioned Oculus use case, the need to fully
> > > access underlying ELF binary just to get build ID is frowned upon. And
> > > for a good reason. Profiler only needs build ID, which is no secret
> > > and not sensitive information. This new (and binary, yes) API allows
> > > to add this into an API without breaking any backwards compatibility.
> > >
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> >

2024-05-06 18:55:36

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 6, 2024 at 11:05 AM Namhyung Kim <[email protected]> wrote:
>
> Hello,
>
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> >
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.
> >
> > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> >
> > > > Where is the userspace code that uses this new api you have created?
> >
> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.
>
> I think the general use case in perf is different. This ioctl API is great
> for live tracing of a single (or a small number of) process(es). And
> yes, perf tools have those tracing use cases too. But I think the
> major use case of perf tools is system-wide profiling.

The intended use case is also a system-wide profiling, but I haven't
heard that opening a file per process is a big bottleneck or a
limitation, tbh.

>
> For system-wide profiling, you need to process samples of many
> different processes at a high frequency. Now perf record doesn't
> process them and just save it for offline processing (well, it does
> at the end to find out build-ID but it can be omitted).
>
> Doing it online is possible (like perf top) but it would add more
> overhead during the profiling. And we cannot move processing
> or symbolization to the end of profiling because some (short-
> lived) tasks can go away.

We do have some setups where we install a BPF program that monitors
process exit and mmap() events and emits (proactively) VMA
information. It's not applicable everywhere, and in some setups (like
Oculus case) we just accept that short-lived processes will be missed
at the expense of less interruption, simpler and less privileged
"agents" doing profiling and address resolution logic.

So the problem space, as can be seen, is pretty vast and varied, and
there is no single API that would serve all the needs perfectly.

>
> Also it should support perf report (offline) on data from a
> different kernel or even a different machine.

We fetch build ID (and resolve file offset) and offload actual
symbolization to a dedicated fleet of servers, whenever possible. We
don't yet do it for kernel stack traces, but we are moving in this
direction (and there are their own problems with /proc/kallsyms being
text-based, listing everything, and pretty big all in itself; but
that's a separate topic).

>
> So it saves the memory map of processes and symbolizes
> the stack trace with it later. Of course it needs to be updated
> as the memory map changes and that's why it tracks mmap
> or similar syscalls with PERF_RECORD_MMAP[2] records.
>
> A problem with this approach is to get the initial state of all
> (or a target for non-system-wide mode) existing processes.
> We call it synthesizing, and read /proc/PID/maps to generate
> the mmap records.
>
> I think the below comment from Arnaldo talked about how
> we can improve the synthesizing (which is sequential access
> to proc maps) using BPF.

Yep. We can also benchmark using this new ioctl() to fetch a full set
of VMAs, it might still be good enough.

>
> Thanks,
> Namhyung
>

[...]

2024-05-06 18:58:59

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

On Sat, May 4, 2024 at 10:26 PM Ian Rogers <[email protected]> wrote:
>
> On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <[email protected]> wrote:
> >
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> >
> > This patch set was based on top of next-20240503 tag in linux-next tree.
> > Not sure what should be the target tree for this, I'd appreciate any guidance,
> > thank you!
> >
> > Andrii Nakryiko (5):
> > fs/procfs: extract logic for getting VMA name constituents
> > fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
> > tools: sync uapi/linux/fs.h header into tools subdir
> > selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
> > selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
>
> I'd love to see improvements like this for the Linux perf command.
> Some thoughts:
>
> - Could we do something scalability wise better than a file
> descriptor per pid? If a profiler is running in a container the cost
> of many file descriptors can be significant, and something that
> increases as machines get larger. Could we have a /proc/maps for all
> processes?

It's probably not a question to me, as it seems like an entirely
different set of APIs. But it also seems a bit convoluted to mix
together information about many address spaces.

As for the cost of FDs, I haven't run into this limitation, and it
seems like the trend in Linux in general is towards "everything is a
file". Just look at pidfd, for example.

Also, having a fd that can be queries has an extra nice property. For
example, opening /proc/self/maps (i.e., process' own maps file)
doesn't require any extra permissions, and then it can be transferred
to another trusted process that would do address
resolution/symbolization. In practice right now it's unavoidable to
add extra caps/root permissions to the profiling process even if the
only thing that it needs is contents of /proc/<pid>/maps (and the use
case is as benign as symbol resolution). Not having an FD for this API
would make this use case unworkable.

>
> - Something that is broken in perf currently is that we can race
> between reading /proc and opening events on the pids it contains. For
> example, perf top supports a uid option that first scans to find all
> processes owned by a user then tries to open an event on each process.
> This fails if the process terminates between the scan and the open
> leading to a frequent:
> ```
> $ sudo perf top -u `id -u`
> The sys_perf_event_open() syscall returned with 3 (No such process)
> for event (cycles:P).
> ```
> It would be nice for the API to consider cgroups, uids and the like as
> ways to get a subset of things to scan.

This seems like putting too much into an API, tbh. It feels like
mapping cgroupos/uids to their processes is its own way and if we
don't have efficient APIs to do this, we should add it. But conflating
it into "get VMAs from this process" seems wrong to me.

>
> - Some what related, the mmap perf events give data after the mmap
> call has happened. As VMAs get merged this can lead to mmap perf
> events looking like the memory overlaps (for jits using anonymous
> memory) and we lack munmap/mremap events.

Is this related to "VMA generation" that Arnaldo mentioned? I'd
happily add it to the new API, as it's easily extensible, if the
kernel already maintains it. If not, then it should be a separate work
to discuss whether kernel *should* track this information.

>
> Jiri Olsa has looked at improvements in this area in the past.
>
> Thanks,
> Ian
>
> > fs/proc/task_mmu.c | 290 +++++++++++---
> > include/uapi/linux/fs.h | 32 ++
> > .../perf/trace/beauty/include/uapi/linux/fs.h | 32 ++
> > tools/testing/selftests/bpf/.gitignore | 1 +
> > tools/testing/selftests/bpf/Makefile | 2 +-
> > tools/testing/selftests/bpf/procfs_query.c | 366 ++++++++++++++++++
> > tools/testing/selftests/bpf/test_progs.c | 3 +
> > tools/testing/selftests/bpf/test_progs.h | 2 +
> > tools/testing/selftests/bpf/trace_helpers.c | 105 ++++-
> > 9 files changed, 763 insertions(+), 70 deletions(-)
> > create mode 100644 tools/testing/selftests/bpf/procfs_query.c
> >
> > --
> > 2.43.0
> >
> >

2024-05-06 19:16:49

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote:
> On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > it, saving resources.
>
> > > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
>
> > > > > Where is the userspace code that uses this new api you have created?
>
> > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > ioctl() API to solve a common problem (as described above) in patch
> > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > least.
> > > >
> > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > linux-perf-user), as they need to do stack symbolization as well.
>
> > I think the general use case in perf is different. This ioctl API is great
> > for live tracing of a single (or a small number of) process(es). And
> > yes, perf tools have those tracing use cases too. But I think the
> > major use case of perf tools is system-wide profiling.
>
> > For system-wide profiling, you need to process samples of many
> > different processes at a high frequency. Now perf record doesn't
> > process them and just save it for offline processing (well, it does
> > at the end to find out build-ID but it can be omitted).
>
> Since:
>
> Author: Jiri Olsa <[email protected]>
> Date: Mon Dec 14 11:54:49 2020 +0100
> 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")
>
> We don't need to to process the events to find the build ids. I haven't
> checked if we still do it to find out which DSOs had hits, but we
> shouldn't need to do it for build-ids (unless they were not in memory
> when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
> haven't checked but IIRC is a possibility if that ELF part isn't in
> memory at the time we want to copy it).

> If we're still traversing it like that I guess we can have a knob and
> make it the default to not do that and instead create the perf.data
> build ID header table with all the build-ids we got from
> PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
> processing at the end of a 'perf record' session.

But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it
just goes on directly to the perf.data file :-\

Humm, perhaps the sideband thread...

- Arnaldo

2024-05-06 20:35:23

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote:
> On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> >
> > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > it, saving resources.
> >
> > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> >
> > > > Where is the userspace code that uses this new api you have created?
> >
> > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > ioctl() API to solve a common problem (as described above) in patch
> > > #5. The plan is to put it in mentioned blazesym library at the very
> > > least.
> > >
> > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > linux-perf-user), as they need to do stack symbolization as well.
> >
> > At some point, when BPF iterators became a thing we thought about, IIRC
> > Jiri did some experimentation, but I lost track, of using BPF to
> > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > as in uapi/linux/perf_event.h:
> >
> > /*
> > * The MMAP2 records are an augmented version of MMAP, they add
> > * maj, min, ino numbers to be used to uniquely identify each mapping
> > *
> > * struct {
> > * struct perf_event_header header;
> > *
> > * u32 pid, tid;
> > * u64 addr;
> > * u64 len;
> > * u64 pgoff;
> > * union {
> > * struct {
> > * u32 maj;
> > * u32 min;
> > * u64 ino;
> > * u64 ino_generation;
> > * };
> > * struct {
> > * u8 build_id_size;
> > * u8 __reserved_1;
> > * u16 __reserved_2;
> > * u8 build_id[20];
> > * };
> > * };
> > * u32 prot, flags;
> > * char filename[];
> > * struct sample_id sample_id;
> > * };
> > */
> > PERF_RECORD_MMAP2 = 10,
> >
> > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
> >
> > As perf.data files can be used for many purposes we want them all, so we
>
> ok, so because you want them all and you don't know which VMAs will be
> useful or not, it's a different problem. BPF iterators will be faster
> purely due to avoiding binary -> text -> binary conversion path, but
> other than that you'll still retrieve all VMAs.

But not using tons of syscalls to parse text data from /proc.

> You can still do the same full VMA iteration with this new API, of
> course, but advantages are probably smaller as you'll be retrieving a
> full set of VMAs regardless (though it would be interesting to compare
> anyways).

sure, I can't see how it would be faster, but yeah, interesting to see
what is the difference.

> > setup a meta data perf file descriptor to go on receiving the new mmaps
> > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > it in parallel, etc:
> >
> > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> >
> > Usage: perf record [<options>] [<command>]
> > or: perf record [<options>] -- <command> [<options>]
> >
> > --num-thread-synthesize <n>
> > number of threads to run for event synthesis
> > --synth <no|all|task|mmap|cgroup>
> > Fine-tune event synthesis: default=all
> >
> > ⬢[acme@toolbox perf-tools-next]$
> >
> > For this specific initial synthesis of everything the plan, as mentioned
> > about Jiri's experiments, was to use a BPF iterator to just feed the
> > perf ring buffer with those events, that way userspace would just
> > receive the usual records it gets when a new mmap is put in place, the
> > BPF iterator would just feed the preexisting mmaps, as instructed via
> > the perf_event_attr for the perf_event_open syscall.
> >
> > For people not wanting BPF, i.e. disabling it altogether in perf or
> > disabling just BPF skels, then we would fallback to the current method,
> > or to the one being discussed here when it becomes available.
> >
> > One thing to have in mind is for this iterator not to generate duplicate
> > records for non-pre-existing mmaps, i.e. we would need some generation
> > number that would be bumped when asking for such pre-existing maps
> > PERF_RECORD_MMAP2 dumps.
>
> Looking briefly at struct vm_area_struct, it doesn't seems like the
> kernel maintains any sort of generation (at least not at
> vm_area_struct level), so this would be nice to have, I'm sure, but

Yeah, this would be something specific to the "retrieve me the list of
VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would
generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator)
would bump the generation number and store it to the VMA in
perf_event_mmap() so that the iterator doesn't consider it, as it is a
new mmap that is being just sent to whoever is listening, and the perf
tool that put in place the BPF program to iterate is listening.

> isn't really related to adding this API. Once the kernel does have

Well, perf wants to enumerate pre-existing mmaps _and_ after that
finishes to know about new mmaps, so we need to know a way to avoid
having the BPF program enumerating pre-existing maps sending
PERF_RECORD_MMAP2 for maps perf already knows about via a regular
PERF_RECORD_MMAP2 sent when a new mmap is put in place.

So there is an overlap where perf (or any other tool wanting to
enumerate all pre-existing maps and new ones) can receive info for the
same map from the enumerator and from the existing mechanism generating
PERF_RECORD_MMAP2 records.

- Arnaldo

> this "VMA generation" counter, it can be trivially added to this
> binary interface (which can't be said about /proc/<pid>/maps,
> unfortunately).
>
> >
> > > It will be up to other similar projects to adopt this, but we'll
> > > definitely get this into blazesym as it is actually a problem for the
> >
> > At some point looking at plugging blazesym somehow with perf may be
> > something to consider, indeed.
>
> In the above I meant direct use of this new API in perf code itself,
> but yes, blazesym is a generic library for symbolization that handles
> ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
> sense to use it.
>
> >
> > - Arnaldo
> >
> > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > [2], this wasn't done just because we could, but it was requested by
> > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > the risk of missing some shared libraries that can be loaded later. It
> > > would be great to not have to do this tradeoff, which this new API
> > > would enable.
> > >
> > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > >
>
> [...]

2024-05-07 05:07:24

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Mon, May 6, 2024 at 11:43 AM Ian Rogers <[email protected]> wrote:
>
> On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
> > >
> > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > <[email protected]> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > >
> > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > ioctl" is going to be different, but step back and show how this is
> > > > > going to be used in the real world overall. Pounding on this file is
> > > > > not a normal operation, right?
> > > > >
> > > >
> > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > think C code in this patch is way easier to follow for someone not
> > > > familiar with implementation of blazesym, but both implementations are
> > > > doing exactly the same sequence of steps). You can do it even less
> > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > table, then looking up addresses one by one. But that's even slower
> > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > would put /proc/<pid>/maps at even more disadvantage.
> > > >
> > > > Other applications that deal with stack traces (including perf) would
> > > > be doing one of those two approaches, depending on circumstances and
> > > > level of sophistication of code (and sensitivity to performance).
> > >
> > > The code in perf doing this is here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > The code is using the api/io.h code:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > Using perf to profile perf it was observed time was spent allocating
> > > buffers and locale related activities when using stdio, so io is a
> > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > similar benchmark in `perf bench internals synthesize`.
> > >
> >
> > If I add a new implementation using this ioctl() into
> > perf_event__synthesize_mmap_events(), will it be tested from this
> > `perf bench internals synthesize`? I'm not too familiar with perf code
> > organization, sorry if it's a stupid question. If not, where exactly
> > is the code that would be triggered from benchmark?
>
> Yes it would be triggered :-)

Ok, I don't exactly know how to interpret the results (and what the
benchmark is doing), but numbers don't seem to be worse. They actually
seem to be a bit better.

I pushed my code that adds perf integration to [0]. That commit has
results, but I'll post them here (with invocation parameters).
perf-ioctl is the version with ioctl()-based implementation,
perf-parse is, logically, text-parsing version. Here are the results
(and see my notes below the results as well):

TEXT-BASED
==========

# ./perf-parse bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 80.311 usec (+- 0.077 usec)
Average num. events: 32.000 (+- 0.000)
Average time per event 2.510 usec
Average data synthesis took: 84.429 usec (+- 0.066 usec)
Average num. events: 179.000 (+- 0.000)
Average time per event 0.472 usec

# ./perf-parse bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 79.900 usec (+- 0.077 usec)
Average num. events: 32.000 (+- 0.000)
Average time per event 2.497 usec
Average data synthesis took: 84.832 usec (+- 0.074 usec)
Average num. events: 180.000 (+- 0.000)
Average time per event 0.471 usec

# ./perf-parse bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
Number of synthesis threads: 1
Average synthesis took: 36338.100 usec (+- 406.091 usec)
Average num. events: 14091.300 (+- 7.433)
Average time per event 2.579 usec
Number of synthesis threads: 2
Average synthesis took: 37071.200 usec (+- 746.498 usec)
Average num. events: 14085.900 (+- 1.900)
Average time per event 2.632 usec
Number of synthesis threads: 3
Average synthesis took: 33932.300 usec (+- 626.861 usec)
Average num. events: 14085.900 (+- 1.900)
Average time per event 2.409 usec
Number of synthesis threads: 4
Average synthesis took: 33822.700 usec (+- 506.290 usec)
Average num. events: 14099.200 (+- 8.761)
Average time per event 2.399 usec
Number of synthesis threads: 5
Average synthesis took: 33348.200 usec (+- 389.771 usec)
Average num. events: 14085.900 (+- 1.900)
Average time per event 2.367 usec
Number of synthesis threads: 6
Average synthesis took: 33269.600 usec (+- 350.341 usec)
Average num. events: 14084.000 (+- 0.000)
Average time per event 2.362 usec
Number of synthesis threads: 7
Average synthesis took: 32663.900 usec (+- 338.870 usec)
Average num. events: 14085.900 (+- 1.900)
Average time per event 2.319 usec
Number of synthesis threads: 8
Average synthesis took: 32748.400 usec (+- 285.450 usec)
Average num. events: 14085.900 (+- 1.900)
Average time per event 2.325 usec

IOCTL-BASED
===========
# ./perf-ioctl bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 72.996 usec (+- 0.076 usec)
Average num. events: 31.000 (+- 0.000)
Average time per event 2.355 usec
Average data synthesis took: 79.067 usec (+- 0.074 usec)
Average num. events: 178.000 (+- 0.000)
Average time per event 0.444 usec

# ./perf-ioctl bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 73.921 usec (+- 0.073 usec)
Average num. events: 31.000 (+- 0.000)
Average time per event 2.385 usec
Average data synthesis took: 80.545 usec (+- 0.070 usec)
Average num. events: 178.000 (+- 0.000)
Average time per event 0.453 usec

# ./perf-ioctl bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
Number of synthesis threads: 1
Average synthesis took: 35609.500 usec (+- 428.576 usec)
Average num. events: 14040.700 (+- 1.700)
Average time per event 2.536 usec
Number of synthesis threads: 2
Average synthesis took: 34293.800 usec (+- 453.811 usec)
Average num. events: 14040.700 (+- 1.700)
Average time per event 2.442 usec
Number of synthesis threads: 3
Average synthesis took: 32385.200 usec (+- 363.106 usec)
Average num. events: 14040.700 (+- 1.700)
Average time per event 2.307 usec
Number of synthesis threads: 4
Average synthesis took: 33113.100 usec (+- 553.931 usec)
Average num. events: 14054.500 (+- 11.469)
Average time per event 2.356 usec
Number of synthesis threads: 5
Average synthesis took: 31600.600 usec (+- 297.349 usec)
Average num. events: 14012.500 (+- 4.590)
Average time per event 2.255 usec
Number of synthesis threads: 6
Average synthesis took: 32309.900 usec (+- 472.225 usec)
Average num. events: 14004.000 (+- 0.000)
Average time per event 2.307 usec
Number of synthesis threads: 7
Average synthesis took: 31400.100 usec (+- 206.261 usec)
Average num. events: 14004.800 (+- 0.800)
Average time per event 2.242 usec
Number of synthesis threads: 8
Average synthesis took: 31601.400 usec (+- 303.350 usec)
Average num. events: 14005.700 (+- 1.700)
Average time per event 2.256 usec

I also double-checked (using strace) that it does what it is supposed
to do, and it seems like everything checks out. Here's text-based
strace log:

openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
read(3, "", 8192) = 0
close(3) = 0


BTW, note how the kernel doesn't serve more than 4KB of data, even
though perf provides 8KB buffer (that's to Greg's question about
optimizing using bigger buffers, I suspect without seq_file changes,
it won't work).

And here's an abbreviated log for ioctl version, it has lots more (but
much faster) ioctl() syscalls, given it dumps everything:

openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0

... 195 ioctl() calls in total ...

ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
= -1 ENOENT (No such file or directory)
close(3) = 0


So, it's not the optimal usage of this API, and yet it's still better
(or at least not worse) than text-based API.

[0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35

>
> Thanks,
> Ian
>
> > > Thanks,
> > > Ian
> > >
> > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > >
> > > > > thanks,
> > > > >
> > > > > greg k-h
> > > >

2024-05-07 15:50:25

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

. Adding Suren & Willy to the Cc

* Andrii Nakryiko <[email protected]> [240504 18:14]:
> On Sat, May 4, 2024 at 8:32 AM Greg KH <[email protected]> wrote:
> >
> > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > I also did an strace run of both cases. In text-based one the tool did
> > > 68 read() syscalls, fetching up to 4KB of data in one go.
> >
> > Why not fetch more at once?
> >
>
> I didn't expect to be interrogated so much on the performance of the
> text parsing front, sorry. :) You can probably tune this, but where is
> the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> production numbers.

The reason the file reads are limited to 4KB is because this file is
used for monitoring processes. We have a significant number of
organisations polling this file so frequently that the mmap lock
contention becomes an issue. (reading a file is free, right?) People
also tend to try to figure out why a process is slow by reading this
file - which amplifies the lock contention.

What happens today is that the lock is yielded after 4KB to allow time
for mmap writes to happen. This also means your data may be
inconsistent from one 4KB block to the next (the write may be around
this boundary).

This new interface also takes the lock in do_procmap_query() and does
the 4kb blocks as well. Extending this size means more time spent
blocking mmap writes, but a more consistent view of the world (less
"tearing" of the addresses).

We are working to reduce these issues by switching the /proc/<pid>/maps
file to use rcu lookup. I would recommend we do not proceed with this
interface using the old method and instead, implement it using rcu from
the start - if it fits your use case (or we can make it fit your use
case).

At least, for most page faults, we can work around the lock contention
(since v6.6), but not all and not on all archs.

..

>
> > > In comparison,
> > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > relevant VMAs.
> > >
> > > It is projected that savings from processing big production applications
> > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > bigger applications will tend to have even more non-executable VMA
> > > mappings relative to executable ones.
> >
> > Define "bigger applications" please. Is this some "large database
> > company workload" type of thing, or something else?
>
> I don't have a definition. But I had in mind, as one example, an
> ads-serving service we use internally (it's a pretty large application
> by pretty much any metric you can come up with). I just randomly
> picked one of the production hosts, found one instance of that
> service, and looked at its /proc/<pid>/maps file. Hopefully it will
> satisfy your need for specifics.
>
> # cat /proc/1126243/maps | wc -c
> 1570178
> # cat /proc/1126243/maps | wc -l
> 28875
> # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> 7347

We have distributions increasing the map_count to an insane number to
allow games to work [1]. It is, unfortunately, only a matter of time until
this is regularly an issue as it is being normalised and allowed by an
increased number of distributions (fedora, arch, ubuntu). So, despite
my email address, I am not talking about large database companies here.

Also, note that applications that use guard VMAs double the number for
the guards. Fun stuff.

We are really doing a lot in the VMA area to reduce the mmap locking
contention and it seems you have a use case for a new interface that can
leverage these changes.

We have at least two talks around this area at LSF if you are attending.

Thanks,
Liam

[1] https://lore.kernel.org/linux-mm/[email protected]/


2024-05-07 16:22:19

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

* Matthew Wilcox <[email protected]> [240507 12:10]:
> On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote:
> > .. Adding Suren & Willy to the Cc
>
> I've been staying out of this disaster. i thought steven rostedt was
> going to do all of this in the kernel anyway. wasn't thre a session on
> that at lsfmm in vancouver last year?

sframes? The only other one that comes to mind is the one where he and
kent were yelling at each other.

2024-05-07 16:33:30

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <Liam.Howlett@oraclecom> wrote:
>
> .. Adding Suren & Willy to the Cc
>
> * Andrii Nakryiko <[email protected]> [240504 18:14]:
> > On Sat, May 4, 2024 at 8:32 AM Greg KH <[email protected]> wrote:
> > >
> > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > I also did an strace run of both cases. In text-based one the tool did
> > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > >
> > > Why not fetch more at once?
> > >
> >
> > I didn't expect to be interrogated so much on the performance of the
> > text parsing front, sorry. :) You can probably tune this, but where is
> > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > production numbers.
>
> The reason the file reads are limited to 4KB is because this file is
> used for monitoring processes. We have a significant number of
> organisations polling this file so frequently that the mmap lock
> contention becomes an issue. (reading a file is free, right?) People
> also tend to try to figure out why a process is slow by reading this
> file - which amplifies the lock contention.
>
> What happens today is that the lock is yielded after 4KB to allow time
> for mmap writes to happen. This also means your data may be
> inconsistent from one 4KB block to the next (the write may be around
> this boundary).
>
> This new interface also takes the lock in do_procmap_query() and does
> the 4kb blocks as well. Extending this size means more time spent
> blocking mmap writes, but a more consistent view of the world (less
> "tearing" of the addresses).

Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
does a single VMA look up (presumably O(logN) operation) using a
single vma_iter_init(addr) + vma_next() call on vma_iterator.

As for the mmap_read_lock_killable() (is that what we are talking
about?), I'm happy to use anything else available, please give me a
pointer. But I suspect given how fast and small this new API is,
mmap_read_lock_killable() in it is not comparable to holding it for
producing /proc/<pid>/maps contents.

>
> We are working to reduce these issues by switching the /proc/<pid>/maps
> file to use rcu lookup. I would recommend we do not proceed with this
> interface using the old method and instead, implement it using rcu from
> the start - if it fits your use case (or we can make it fit your use
> case).
>
> At least, for most page faults, we can work around the lock contention
> (since v6.6), but not all and not on all archs.
>
> ...
>
> >
> > > > In comparison,
> > > > ioctl-based implementation had to do only 6 ioctl() calls to fetch all
> > > > relevant VMAs.
> > > >
> > > > It is projected that savings from processing big production applications
> > > > would only widen the gap in favor of binary-based querying ioctl API, as
> > > > bigger applications will tend to have even more non-executable VMA
> > > > mappings relative to executable ones.
> > >
> > > Define "bigger applications" please. Is this some "large database
> > > company workload" type of thing, or something else?
> >
> > I don't have a definition. But I had in mind, as one example, an
> > ads-serving service we use internally (it's a pretty large application
> > by pretty much any metric you can come up with). I just randomly
> > picked one of the production hosts, found one instance of that
> > service, and looked at its /proc/<pid>/maps file. Hopefully it will
> > satisfy your need for specifics.
> >
> > # cat /proc/1126243/maps | wc -c
> > 1570178
> > # cat /proc/1126243/maps | wc -l
> > 28875
> > # cat /proc/1126243/maps | grep ' ..x. ' | wc -l
> > 7347
>
> We have distributions increasing the map_count to an insane number to
> allow games to work [1]. It is, unfortunately, only a matter of time until
> this is regularly an issue as it is being normalised and allowed by an
> increased number of distributions (fedora, arch, ubuntu). So, despite
> my email address, I am not talking about large database companies here.
>
> Also, note that applications that use guard VMAs double the number for
> the guards. Fun stuff.
>
> We are really doing a lot in the VMA area to reduce the mmap locking
> contention and it seems you have a use case for a new interface that can
> leverage these changes.
>
> We have at least two talks around this area at LSF if you are attending.

I am attending LSFMM, yes, I'll try to not miss them.

>
> Thanks,
> Liam
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>

2024-05-07 16:43:38

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 6, 2024 at 1:35 PM Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> On Mon, May 06, 2024 at 11:41:43AM -0700, Andrii Nakryiko wrote:
> > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > >
> > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > it, saving resources.
> > >
> > > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > >
> > > > > Where is the userspace code that uses this new api you have created?
> > >
> > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > ioctl() API to solve a common problem (as described above) in patch
> > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > least.
> > > >
> > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > linux-perf-user), as they need to do stack symbolization as well.
> > >
> > > At some point, when BPF iterators became a thing we thought about, IIRC
> > > Jiri did some experimentation, but I lost track, of using BPF to
> > > synthesize PERF_RECORD_MMAP2 records for pre-existing maps, the layout
> > > as in uapi/linux/perf_event.h:
> > >
> > > /*
> > > * The MMAP2 records are an augmented version of MMAP, they add
> > > * maj, min, ino numbers to be used to uniquely identify each mapping
> > > *
> > > * struct {
> > > * struct perf_event_header header;
> > > *
> > > * u32 pid, tid;
> > > * u64 addr;
> > > * u64 len;
> > > * u64 pgoff;
> > > * union {
> > > * struct {
> > > * u32 maj;
> > > * u32 min;
> > > * u64 ino;
> > > * u64 ino_generation;
> > > * };
> > > * struct {
> > > * u8 build_id_size;
> > > * u8 __reserved_1;
> > > * u16 __reserved_2;
> > > * u8 build_id[20];
> > > * };
> > > * };
> > > * u32 prot, flags;
> > > * char filename[];
> > > * struct sample_id sample_id;
> > > * };
> > > */
> > > PERF_RECORD_MMAP2 = 10,
> > >
> > > * PERF_RECORD_MISC_MMAP_BUILD_ID - PERF_RECORD_MMAP2 event
> > >
> > > As perf.data files can be used for many purposes we want them all, so we
> >
> > ok, so because you want them all and you don't know which VMAs will be
> > useful or not, it's a different problem. BPF iterators will be faster
> > purely due to avoiding binary -> text -> binary conversion path, but
> > other than that you'll still retrieve all VMAs.
>
> But not using tons of syscalls to parse text data from /proc.

In terms of syscall *count* you win with 4KB text reads, there are
fewer syscalls because of this 4KB-based batching. But the cost of
syscall + amount of user-space processing is a different matter. My
benchmark in perf (see patch #5 discussion) suggests that even with
more ioctl() syscalls, perf would win here.

But I also realized that what you really need (I think, correct me if
I'm wrong) is only file-backed VMAs, because all the other ones are
not that useful for symbolization. So I'm adding a minimal change to
my code to allow the user to specify another query flag to only return
file-backed VMAs. I'm going to try it with perf code and see how that
helps. I'll post results in patch #5 thread, once I have them.

>
> > You can still do the same full VMA iteration with this new API, of
> > course, but advantages are probably smaller as you'll be retrieving a
> > full set of VMAs regardless (though it would be interesting to compare
> > anyways).
>
> sure, I can't see how it would be faster, but yeah, interesting to see
> what is the difference.

see patch #5 thread, seems like it's still a bit faster

>
> > > setup a meta data perf file descriptor to go on receiving the new mmaps
> > > while we read /proc/<pid>/maps, to reduce the chance of missing maps, do
> > > it in parallel, etc:
> > >
> > > ⬢[acme@toolbox perf-tools-next]$ perf record -h 'event synthesis'
> > >
> > > Usage: perf record [<options>] [<command>]
> > > or: perf record [<options>] -- <command> [<options>]
> > >
> > > --num-thread-synthesize <n>
> > > number of threads to run for event synthesis
> > > --synth <no|all|task|mmap|cgroup>
> > > Fine-tune event synthesis: default=all
> > >
> > > ⬢[acme@toolbox perf-tools-next]$
> > >
> > > For this specific initial synthesis of everything the plan, as mentioned
> > > about Jiri's experiments, was to use a BPF iterator to just feed the
> > > perf ring buffer with those events, that way userspace would just
> > > receive the usual records it gets when a new mmap is put in place, the
> > > BPF iterator would just feed the preexisting mmaps, as instructed via
> > > the perf_event_attr for the perf_event_open syscall.
> > >
> > > For people not wanting BPF, i.e. disabling it altogether in perf or
> > > disabling just BPF skels, then we would fallback to the current method,
> > > or to the one being discussed here when it becomes available.
> > >
> > > One thing to have in mind is for this iterator not to generate duplicate
> > > records for non-pre-existing mmaps, i.e. we would need some generation
> > > number that would be bumped when asking for such pre-existing maps
> > > PERF_RECORD_MMAP2 dumps.
> >
> > Looking briefly at struct vm_area_struct, it doesn't seems like the
> > kernel maintains any sort of generation (at least not at
> > vm_area_struct level), so this would be nice to have, I'm sure, but
>
> Yeah, this would be something specific to the "retrieve me the list of
> VMAs" bulky thing, i.e. the kernel perf code (or the BPF that would
> generate the PERF_RECORD_MMAP2 records by using a BPF vma iterator)
> would bump the generation number and store it to the VMA in
> perf_event_mmap() so that the iterator doesn't consider it, as it is a
> new mmap that is being just sent to whoever is listening, and the perf
> tool that put in place the BPF program to iterate is listening.

Ok, we went on *so many* tangents in emails on this patch set :) Seems
like there are a bunch of perf-specific improvements possible which
are completely irrelevant to the API I'm proposing. Let's please keep
them separate (and you, perf folks, should propose them upstream),
it's getting hard to see what this patch set is actually about with
all the tangential emails.

>
> > isn't really related to adding this API. Once the kernel does have
>
> Well, perf wants to enumerate pre-existing mmaps _and_ after that
> finishes to know about new mmaps, so we need to know a way to avoid
> having the BPF program enumerating pre-existing maps sending
> PERF_RECORD_MMAP2 for maps perf already knows about via a regular
> PERF_RECORD_MMAP2 sent when a new mmap is put in place.
>
> So there is an overlap where perf (or any other tool wanting to
> enumerate all pre-existing maps and new ones) can receive info for the
> same map from the enumerator and from the existing mechanism generating
> PERF_RECORD_MMAP2 records.
>
> - Arnaldo
>
> > this "VMA generation" counter, it can be trivially added to this
> > binary interface (which can't be said about /proc/<pid>/maps,
> > unfortunately).
> >
> > >
> > > > It will be up to other similar projects to adopt this, but we'll
> > > > definitely get this into blazesym as it is actually a problem for the
> > >
> > > At some point looking at plugging blazesym somehow with perf may be
> > > something to consider, indeed.
> >
> > In the above I meant direct use of this new API in perf code itself,
> > but yes, blazesym is a generic library for symbolization that handles
> > ELF/DWARF/GSYM (and I believe more formats), so it indeed might make
> > sense to use it.
> >
> > >
> > > - Arnaldo
> > >
> > > > abovementioned Oculus use case. We already had to make a tradeoff (see
> > > > [2], this wasn't done just because we could, but it was requested by
> > > > Oculus customers) to cache the contents of /proc/<pid>/maps and run
> > > > the risk of missing some shared libraries that can be loaded later. It
> > > > would be great to not have to do this tradeoff, which this new API
> > > > would enable.
> > > >
> > > > [2] https://github.com/libbpf/blazesym/commit/6b521314126b3ae6f2add43e93234b59fed48ccf
> > > >
> >
> > [...]

2024-05-07 17:29:40

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Mon, May 6, 2024 at 11:43 AM Ian Rogers <[email protected]> wrote:
> >
> > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > >
> > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > going to be used in the real world overall. Pounding on this file is
> > > > > > not a normal operation, right?
> > > > > >
> > > > >
> > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > think C code in this patch is way easier to follow for someone not
> > > > > familiar with implementation of blazesym, but both implementations are
> > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > table, then looking up addresses one by one. But that's even slower
> > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > >
> > > > > Other applications that deal with stack traces (including perf) would
> > > > > be doing one of those two approaches, depending on circumstances and
> > > > > level of sophistication of code (and sensitivity to performance).
> > > >
> > > > The code in perf doing this is here:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > The code is using the api/io.h code:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > Using perf to profile perf it was observed time was spent allocating
> > > > buffers and locale related activities when using stdio, so io is a
> > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > similar benchmark in `perf bench internals synthesize`.
> > > >
> > >
> > > If I add a new implementation using this ioctl() into
> > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > organization, sorry if it's a stupid question. If not, where exactly
> > > is the code that would be triggered from benchmark?
> >
> > Yes it would be triggered :-)
>
> Ok, I don't exactly know how to interpret the results (and what the
> benchmark is doing), but numbers don't seem to be worse. They actually
> seem to be a bit better.
>
> I pushed my code that adds perf integration to [0]. That commit has
> results, but I'll post them here (with invocation parameters).
> perf-ioctl is the version with ioctl()-based implementation,
> perf-parse is, logically, text-parsing version. Here are the results
> (and see my notes below the results as well):
>
> TEXT-BASED
> ==========
>
> # ./perf-parse bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 80.311 usec (+- 0.077 usec)
> Average num. events: 32.000 (+- 0.000)
> Average time per event 2.510 usec
> Average data synthesis took: 84.429 usec (+- 0.066 usec)
> Average num. events: 179.000 (+- 0.000)
> Average time per event 0.472 usec
>
> # ./perf-parse bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 79.900 usec (+- 0.077 usec)
> Average num. events: 32.000 (+- 0.000)
> Average time per event 2.497 usec
> Average data synthesis took: 84.832 usec (+- 0.074 usec)
> Average num. events: 180.000 (+- 0.000)
> Average time per event 0.471 usec
>
> # ./perf-parse bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
> Number of synthesis threads: 1
> Average synthesis took: 36338.100 usec (+- 406.091 usec)
> Average num. events: 14091.300 (+- 7.433)
> Average time per event 2.579 usec
> Number of synthesis threads: 2
> Average synthesis took: 37071.200 usec (+- 746.498 usec)
> Average num. events: 14085.900 (+- 1.900)
> Average time per event 2.632 usec
> Number of synthesis threads: 3
> Average synthesis took: 33932.300 usec (+- 626.861 usec)
> Average num. events: 14085.900 (+- 1.900)
> Average time per event 2.409 usec
> Number of synthesis threads: 4
> Average synthesis took: 33822.700 usec (+- 506.290 usec)
> Average num. events: 14099.200 (+- 8.761)
> Average time per event 2.399 usec
> Number of synthesis threads: 5
> Average synthesis took: 33348.200 usec (+- 389.771 usec)
> Average num. events: 14085.900 (+- 1.900)
> Average time per event 2.367 usec
> Number of synthesis threads: 6
> Average synthesis took: 33269.600 usec (+- 350.341 usec)
> Average num. events: 14084.000 (+- 0.000)
> Average time per event 2.362 usec
> Number of synthesis threads: 7
> Average synthesis took: 32663.900 usec (+- 338.870 usec)
> Average num. events: 14085.900 (+- 1.900)
> Average time per event 2.319 usec
> Number of synthesis threads: 8
> Average synthesis took: 32748.400 usec (+- 285.450 usec)
> Average num. events: 14085.900 (+- 1.900)
> Average time per event 2.325 usec
>
> IOCTL-BASED
> ===========
> # ./perf-ioctl bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 72.996 usec (+- 0.076 usec)
> Average num. events: 31.000 (+- 0.000)
> Average time per event 2.355 usec
> Average data synthesis took: 79.067 usec (+- 0.074 usec)
> Average num. events: 178.000 (+- 0.000)
> Average time per event 0.444 usec
>
> # ./perf-ioctl bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 73.921 usec (+- 0.073 usec)
> Average num. events: 31.000 (+- 0.000)
> Average time per event 2.385 usec
> Average data synthesis took: 80.545 usec (+- 0.070 usec)
> Average num. events: 178.000 (+- 0.000)
> Average time per event 0.453 usec
>
> # ./perf-ioctl bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
> Number of synthesis threads: 1
> Average synthesis took: 35609.500 usec (+- 428.576 usec)
> Average num. events: 14040.700 (+- 1.700)
> Average time per event 2.536 usec
> Number of synthesis threads: 2
> Average synthesis took: 34293.800 usec (+- 453.811 usec)
> Average num. events: 14040.700 (+- 1.700)
> Average time per event 2.442 usec
> Number of synthesis threads: 3
> Average synthesis took: 32385.200 usec (+- 363.106 usec)
> Average num. events: 14040.700 (+- 1.700)
> Average time per event 2.307 usec
> Number of synthesis threads: 4
> Average synthesis took: 33113.100 usec (+- 553.931 usec)
> Average num. events: 14054.500 (+- 11.469)
> Average time per event 2.356 usec
> Number of synthesis threads: 5
> Average synthesis took: 31600.600 usec (+- 297.349 usec)
> Average num. events: 14012.500 (+- 4.590)
> Average time per event 2.255 usec
> Number of synthesis threads: 6
> Average synthesis took: 32309.900 usec (+- 472.225 usec)
> Average num. events: 14004.000 (+- 0.000)
> Average time per event 2.307 usec
> Number of synthesis threads: 7
> Average synthesis took: 31400.100 usec (+- 206.261 usec)
> Average num. events: 14004.800 (+- 0.800)
> Average time per event 2.242 usec
> Number of synthesis threads: 8
> Average synthesis took: 31601.400 usec (+- 303.350 usec)
> Average num. events: 14005.700 (+- 1.700)
> Average time per event 2.256 usec
>
> I also double-checked (using strace) that it does what it is supposed
> to do, and it seems like everything checks out. Here's text-based
> strace log:
>
> openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> read(3, "", 8192) = 0
> close(3) = 0
>
>
> BTW, note how the kernel doesn't serve more than 4KB of data, even
> though perf provides 8KB buffer (that's to Greg's question about
> optimizing using bigger buffers, I suspect without seq_file changes,
> it won't work).
>
> And here's an abbreviated log for ioctl version, it has lots more (but
> much faster) ioctl() syscalls, given it dumps everything:
>
> openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
>
> ... 195 ioctl() calls in total ...
>
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> = -1 ENOENT (No such file or directory)
> close(3) = 0
>
>
> So, it's not the optimal usage of this API, and yet it's still better
> (or at least not worse) than text-based API.
>

In another reply to Arnaldo on patch #2 I mentioned the idea of
allowing to iterate only file-backed VMAs (as it seems like what
symbolizers would only care about, but I might be wrong here). So I
tried that quickly, given it's a trivial addition to my code. See
results below (it is slightly faster, but not much, because most of
VMAs in that benchmark seem to be indeed file-backed anyways), just
for completeness. I'm not sure if that would be useful/used by perf,
so please let me know.

As I mentioned above, it's not radically faster in this perf
benchmark, because we still request about 170 VMAs (vs ~195 if we
iterate *all* of them), so not a big change. The ratio will vary
depending on what the process is doing, of course. Anyways, just for
completeness, I'm not sure if I have to add this "filter" to the
actual implementation.

# ./perf-filebacked bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 65.759 usec (+- 0.063 usec)
Average num. events: 30.000 (+- 0.000)
Average time per event 2.192 usec
Average data synthesis took: 73.840 usec (+- 0.080 usec)
Average num. events: 153.000 (+- 0.000)
Average time per event 0.483 usec

# ./perf-filebacked bench internals synthesize
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 66.245 usec (+- 0.059 usec)
Average num. events: 30.000 (+- 0.000)
Average time per event 2.208 usec
Average data synthesis took: 70.627 usec (+- 0.074 usec)
Average num. events: 153.000 (+- 0.000)
Average time per event 0.462 usec

# ./perf-filebacked bench internals synthesize --mt -M 8
# Running 'internals/synthesize' benchmark:
Computing performance of multi threaded perf event synthesis by
synthesizing events on CPU 0:
Number of synthesis threads: 1
Average synthesis took: 33477.500 usec (+- 556.102 usec)
Average num. events: 10125.700 (+- 1.620)
Average time per event 3.306 usec
Number of synthesis threads: 2
Average synthesis took: 30473.700 usec (+- 221.933 usec)
Average num. events: 10127.000 (+- 0.000)
Average time per event 3.009 usec
Number of synthesis threads: 3
Average synthesis took: 29775.200 usec (+- 315.212 usec)
Average num. events: 10128.700 (+- 0.667)
Average time per event 2.940 usec
Number of synthesis threads: 4
Average synthesis took: 29477.100 usec (+- 621.258 usec)
Average num. events: 10129.000 (+- 0.000)
Average time per event 2.910 usec
Number of synthesis threads: 5
Average synthesis took: 29777.900 usec (+- 294.710 usec)
Average num. events: 10144.700 (+- 11.597)
Average time per event 2.935 usec
Number of synthesis threads: 6
Average synthesis took: 27774.700 usec (+- 357.569 usec)
Average num. events: 10158.500 (+- 14.710)
Average time per event 2.734 usec
Number of synthesis threads: 7
Average synthesis took: 27437.200 usec (+- 233.626 usec)
Average num. events: 10135.700 (+- 2.700)
Average time per event 2.707 usec
Number of synthesis threads: 8
Average synthesis took: 28784.600 usec (+- 477.630 usec)
Average num. events: 10133.000 (+- 0.000)
Average time per event 2.841 usec

> [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
>
> >
> > Thanks,
> > Ian
> >
> > > > Thanks,
> > > > Ian
> > > >
> > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > >
> > > > > > thanks,
> > > > > >
> > > > > > greg k-h
> > > > >

2024-05-07 18:07:06

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

* Andrii Nakryiko <[email protected]> [240507 12:28]:
> On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <[email protected]> wrote:
> >
> > .. Adding Suren & Willy to the Cc
> >
> > * Andrii Nakryiko <[email protected]> [240504 18:14]:
> > > On Sat, May 4, 2024 at 8:32 AM Greg KH <[email protected]> wrote:
> > > >
> > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > >
> > > > Why not fetch more at once?
> > > >
> > >
> > > I didn't expect to be interrogated so much on the performance of the
> > > text parsing front, sorry. :) You can probably tune this, but where is
> > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > production numbers.
> >
> > The reason the file reads are limited to 4KB is because this file is
> > used for monitoring processes. We have a significant number of
> > organisations polling this file so frequently that the mmap lock
> > contention becomes an issue. (reading a file is free, right?) People
> > also tend to try to figure out why a process is slow by reading this
> > file - which amplifies the lock contention.
> >
> > What happens today is that the lock is yielded after 4KB to allow time
> > for mmap writes to happen. This also means your data may be
> > inconsistent from one 4KB block to the next (the write may be around
> > this boundary).
> >
> > This new interface also takes the lock in do_procmap_query() and does
> > the 4kb blocks as well. Extending this size means more time spent
> > blocking mmap writes, but a more consistent view of the world (less
> > "tearing" of the addresses).
>
> Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> does a single VMA look up (presumably O(logN) operation) using a
> single vma_iter_init(addr) + vma_next() call on vma_iterator.

Sorry, I read this:

+ if (usize > PAGE_SIZE)
+ return -E2BIG;

And thought you were going to return many vmas in that buffer. I see
now that you are doing one copy at a time.

>
> As for the mmap_read_lock_killable() (is that what we are talking
> about?), I'm happy to use anything else available, please give me a
> pointer. But I suspect given how fast and small this new API is,
> mmap_read_lock_killable() in it is not comparable to holding it for
> producing /proc/<pid>/maps contents.

Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
mmap sem).

You can see examples of avoiding the mmap lock by use of rcu in
mm/memory.c lock_vma_under_rcu() which is used in the fault path.
userfaultfd has an example as well. But again, remember that not all
archs have this functionality, so you'd need to fall back to full mmap
locking.

Certainly a single lookup and copy will be faster than a 4k buffer
filling copy, but you will be walking the tree O(n) times, where n is
the vma count. This isn't as efficient as multiple lookups in a row as
we will re-walk from the top of the tree. You will also need to contend
with the fact that the chance of the vmas changing between calls is much
higher here too - if that's an issue. Neither of these issues go away
with use of the rcu locking instead of the mmap lock, but we can be
quite certain that we won't cause locking contention.

Thanks,
Liam


2024-05-07 18:40:15

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

* Andrii Nakryiko <[email protected]> [240503 20:30]:
> /proc/<pid>/maps file is extremely useful in practice for various tasks
> involving figuring out process memory layout, what files are backing any
> given memory range, etc. One important class of applications that
> absolutely rely on this are profilers/stack symbolizers. They would
> normally capture stack trace containing absolute memory addresses of
> some functions, and would then use /proc/<pid>/maps file to file
> corresponding backing ELF files, file offsets within them, and then
> continue from there to get yet more information (ELF symbols, DWARF
> information) to get human-readable symbolic information.
>
> As such, there are both performance and correctness requirement
> involved. This address to VMA information translation has to be done as
> efficiently as possible, but also not miss any VMA (especially in the
> case of loading/unloading shared libraries).
>
> Unfortunately, for all the /proc/<pid>/maps file universality and
> usefulness, it doesn't fit the above 100%.
>
> First, it's text based, which makes its programmatic use from
> applications and libraries unnecessarily cumbersome and slow due to the
> need to do text parsing to get necessary pieces of information.
>
> Second, it's main purpose is to emit all VMAs sequentially, but in
> practice captured addresses would fall only into a small subset of all
> process' VMAs, mainly containing executable text. Yet, library would
> need to parse most or all of the contents to find needed VMAs, as there
> is no way to skip VMAs that are of no use. Efficient library can do the
> linear pass and it is still relatively efficient, but it's definitely an
> overhead that can be avoided, if there was a way to do more targeted
> querying of the relevant VMA information.
>
> Another problem when writing generic stack trace symbolization library
> is an unfortunate performance-vs-correctness tradeoff that needs to be
> made. Library has to make a decision to either cache parsed contents of
> /proc/<pid>/maps for service future requests (if application requests to
> symbolize another set of addresses, captured at some later time, which
> is typical for periodic/continuous profiling cases) to avoid higher
> costs of needed to re-parse this file or caching the contents in memory
> to speed up future requests. In the former case, more memory is used for
> the cache and there is a risk of getting stale data if application
> loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> through additiona mmap() calls (and other means of altering memory
> address space). In the latter case, it's the performance hit that comes
> from re-opening the file and re-reading/re-parsing its contents all over
> again.
>
> This patch aims to solve this problem by providing a new API built on
> top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> interface, avoiding the cost and awkwardness of textual representation
> for programmatic use. It's designed to be extensible and
> forward/backward compatible by including user-specified field size and
> using copy_struct_from_user() approach. But, most importantly, it allows
> to do point queries for specific single address, specified by user. And
> this is done efficiently using VMA iterator.
>
> User has a choice to pick either getting VMA that covers provided
> address or -ENOENT if none is found (exact, least surprising, case). Or,
> with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> get either VMA that covers the address (if there is one), or the closest
> next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> more efficient use, but, given it could be a surprising behavior,
> requires an explicit opt-in.
>
> Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> sense given it's querying the same set of VMA data. All the permissions
> checks performed on /proc/<pid>/maps opening fit here as well.
> ioctl-based implementation is fetching remembered mm_struct reference,
> but otherwise doesn't interfere with seq_file-based implementation of
> /proc/<pid>/maps textual interface, and so could be used together or
> independently without paying any price for that.
>
> There is one extra thing that /proc/<pid>/maps doesn't currently
> provide, and that's an ability to fetch ELF build ID, if present. User
> has control over whether this piece of information is requested or not
> by either setting build_id_size field to zero or non-zero maximum buffer
> size they provided through build_id_addr field (which encodes user
> pointer as __u64 field).
>
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this,
> requiring applications to open underlying ELF binary through
> /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> permissions implications due giving a full access to the binary from
> (potentially) another process, while all application is interested in is
> build ID. Giving an ability to request just build ID doesn't introduce
> any additional security concerns, on top of what /proc/<pid>/maps is
> already concerned with, simplifying the overall logic.
>
> Kernel already implements build ID fetching, which is used from BPF
> subsystem. We are reusing this code here, but plan a follow up changes
> to make it work better under more relaxed assumption (compared to what
> existing code assumes) of being called from user process context, in
> which page faults are allowed. BPF-specific implementation currently
> bails out if necessary part of ELF file is not paged in, all due to
> extra BPF-specific restrictions (like the need to fetch build ID in
> restrictive contexts such as NMI handler).
>
> Note also, that fetching VMA name (e.g., backing file path, or special
> hard-coded or user-provided names) is optional just like build ID. If
> user sets vma_name_size to zero, kernel code won't attempt to retrieve
> it, saving resources.
>
> Signed-off-by: Andrii Nakryiko <[email protected]>
> ---
> fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> include/uapi/linux/fs.h | 32 ++++++++
> 2 files changed, 197 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8e503a1635b7..cb7b1ff1a144 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -22,6 +22,7 @@
> #include <linux/pkeys.h>
> #include <linux/minmax.h>
> #include <linux/overflow.h>
> +#include <linux/buildid.h>
>
> #include <asm/elf.h>
> #include <asm/tlb.h>
> @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> return do_maps_open(inode, file, &proc_pid_maps_op);
> }
>
> +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> +{
> + struct procfs_procmap_query karg;
> + struct vma_iterator iter;
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> + const char *name = NULL;
> + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> + __u64 usize;
> + int err;
> +
> + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> + return -EFAULT;
> + if (usize > PAGE_SIZE)
> + return -E2BIG;
> + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> + return -EINVAL;
> + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> + if (err)
> + return err;
> +
> + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> + return -EINVAL;
> + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> + return -EINVAL;
> + if (!!karg.build_id_size != !!karg.build_id_addr)
> + return -EINVAL;
> +
> + mm = priv->mm;
> + if (!mm || !mmget_not_zero(mm))
> + return -ESRCH;
> + if (mmap_read_lock_killable(mm)) {
> + mmput(mm);
> + return -EINTR;
> + }

Using the rcu lookup here will allow for more success rate with less
lock contention.

> +
> + vma_iter_init(&iter, mm, karg.query_addr);
> + vma = vma_next(&iter);
> + if (!vma) {
> + err = -ENOENT;
> + goto out;
> + }
> + /* user wants covering VMA, not the closest next one */
> + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> + vma->vm_start > karg.query_addr) {
> + err = -ENOENT;
> + goto out;
> + }

The interface you are using is a start address to search from to the end
of the address space, so this won't work as you intended with the
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator
has the desired interface you want as the single address lookup doesn't
use the vma iterator. I'd just run the vma_next() and check the limits.
See find_exact_vma() for the limit checks.

> +
> + karg.vma_start = vma->vm_start;
> + karg.vma_end = vma->vm_end;
> +
> + if (vma->vm_file) {
> + const struct inode *inode = file_user_inode(vma->vm_file);
> +
> + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT;
> + karg.dev_major = MAJOR(inode->i_sb->s_dev);
> + karg.dev_minor = MINOR(inode->i_sb->s_dev);
> + karg.inode = inode->i_ino;
> + } else {
> + karg.vma_offset = 0;
> + karg.dev_major = 0;
> + karg.dev_minor = 0;
> + karg.inode = 0;
> + }
> +
> + karg.vma_flags = 0;
> + if (vma->vm_flags & VM_READ)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_READABLE;
> + if (vma->vm_flags & VM_WRITE)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_WRITABLE;
> + if (vma->vm_flags & VM_EXEC)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_EXECUTABLE;
> + if (vma->vm_flags & VM_MAYSHARE)
> + karg.vma_flags |= PROCFS_PROCMAP_VMA_SHARED;
> +
> + if (karg.build_id_size) {
> + __u32 build_id_sz = BUILD_ID_SIZE_MAX;
> +
> + err = build_id_parse(vma, build_id_buf, &build_id_sz);
> + if (!err) {
> + if (karg.build_id_size < build_id_sz) {
> + err = -ENAMETOOLONG;
> + goto out;
> + }
> + karg.build_id_size = build_id_sz;
> + }
> + }
> +
> + if (karg.vma_name_size) {
> + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
> + const struct path *path;
> + const char *name_fmt;
> + size_t name_sz = 0;
> +
> + get_vma_name(vma, &path, &name, &name_fmt);
> +
> + if (path || name_fmt || name) {
> + name_buf = kmalloc(name_buf_sz, GFP_KERNEL);
> + if (!name_buf) {
> + err = -ENOMEM;
> + goto out;
> + }
> + }
> + if (path) {
> + name = d_path(path, name_buf, name_buf_sz);
> + if (IS_ERR(name)) {
> + err = PTR_ERR(name);
> + goto out;
> + }
> + name_sz = name_buf + name_buf_sz - name;
> + } else if (name || name_fmt) {
> + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name);
> + name = name_buf;
> + }
> + if (name_sz > name_buf_sz) {
> + err = -ENAMETOOLONG;
> + goto out;
> + }
> + karg.vma_name_size = name_sz;
> + }
> +
> + /* unlock and put mm_struct before copying data to user */
> + mmap_read_unlock(mm);
> + mmput(mm);
> +
> + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr,
> + name, karg.vma_name_size)) {
> + kfree(name_buf);
> + return -EFAULT;
> + }
> + kfree(name_buf);
> +
> + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
> + build_id_buf, karg.build_id_size))
> + return -EFAULT;
> +
> + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
> + return -EFAULT;
> +
> + return 0;
> +
> +out:
> + mmap_read_unlock(mm);
> + mmput(mm);
> + kfree(name_buf);
> + return err;
> +}
> +
> +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + struct seq_file *seq = file->private_data;
> + struct proc_maps_private *priv = seq->private;
> +
> + switch (cmd) {
> + case PROCFS_PROCMAP_QUERY:
> + return do_procmap_query(priv, (void __user *)arg);
> + default:
> + return -ENOIOCTLCMD;
> + }
> +}
> +
> const struct file_operations proc_pid_maps_operations = {
> .open = pid_maps_open,
> .read = seq_read,
> .llseek = seq_lseek,
> .release = proc_map_release,
> + .unlocked_ioctl = procfs_procmap_ioctl,
> + .compat_ioctl = procfs_procmap_ioctl,
> };
>
> /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 45e4e64fd664..fe8924a8d916 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -393,4 +393,36 @@ struct pm_scan_arg {
> __u64 return_mask;
> };
>
> +/* /proc/<pid>/maps ioctl */
> +#define PROCFS_IOCTL_MAGIC 0x9f
> +#define PROCFS_PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 1, struct procfs_procmap_query)
> +
> +enum procmap_query_flags {
> + PROCFS_PROCMAP_EXACT_OR_NEXT_VMA = 0x01,
> +};
> +
> +enum procmap_vma_flags {
> + PROCFS_PROCMAP_VMA_READABLE = 0x01,
> + PROCFS_PROCMAP_VMA_WRITABLE = 0x02,
> + PROCFS_PROCMAP_VMA_EXECUTABLE = 0x04,
> + PROCFS_PROCMAP_VMA_SHARED = 0x08,
> +};
> +
> +struct procfs_procmap_query {
> + __u64 size;
> + __u64 query_flags; /* in */
> + __u64 query_addr; /* in */
> + __u64 vma_start; /* out */
> + __u64 vma_end; /* out */
> + __u64 vma_flags; /* out */
> + __u64 vma_offset; /* out */
> + __u64 inode; /* out */
> + __u32 dev_major; /* out */
> + __u32 dev_minor; /* out */
> + __u32 vma_name_size; /* in/out */
> + __u32 build_id_size; /* in/out */
> + __u64 vma_name_addr; /* in */
> + __u64 build_id_addr; /* in */
> +};
> +
> #endif /* _UAPI_LINUX_FS_H */
> --
> 2.43.0
>
>

2024-05-07 18:53:42

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Tue, May 7, 2024 at 11:10 AM Liam R. Howlett <[email protected]> wrote:
>
> * Andrii Nakryiko <[email protected]> [240503 20:30]:
> > /proc/<pid>/maps file is extremely useful in practice for various tasks
> > involving figuring out process memory layout, what files are backing any
> > given memory range, etc. One important class of applications that
> > absolutely rely on this are profilers/stack symbolizers. They would
> > normally capture stack trace containing absolute memory addresses of
> > some functions, and would then use /proc/<pid>/maps file to file
> > corresponding backing ELF files, file offsets within them, and then
> > continue from there to get yet more information (ELF symbols, DWARF
> > information) to get human-readable symbolic information.
> >
> > As such, there are both performance and correctness requirement
> > involved. This address to VMA information translation has to be done as
> > efficiently as possible, but also not miss any VMA (especially in the
> > case of loading/unloading shared libraries).
> >
> > Unfortunately, for all the /proc/<pid>/maps file universality and
> > usefulness, it doesn't fit the above 100%.
> >
> > First, it's text based, which makes its programmatic use from
> > applications and libraries unnecessarily cumbersome and slow due to the
> > need to do text parsing to get necessary pieces of information.
> >
> > Second, it's main purpose is to emit all VMAs sequentially, but in
> > practice captured addresses would fall only into a small subset of all
> > process' VMAs, mainly containing executable text. Yet, library would
> > need to parse most or all of the contents to find needed VMAs, as there
> > is no way to skip VMAs that are of no use. Efficient library can do the
> > linear pass and it is still relatively efficient, but it's definitely an
> > overhead that can be avoided, if there was a way to do more targeted
> > querying of the relevant VMA information.
> >
> > Another problem when writing generic stack trace symbolization library
> > is an unfortunate performance-vs-correctness tradeoff that needs to be
> > made. Library has to make a decision to either cache parsed contents of
> > /proc/<pid>/maps for service future requests (if application requests to
> > symbolize another set of addresses, captured at some later time, which
> > is typical for periodic/continuous profiling cases) to avoid higher
> > costs of needed to re-parse this file or caching the contents in memory
> > to speed up future requests. In the former case, more memory is used for
> > the cache and there is a risk of getting stale data if application
> > loaded/unloaded shared libraries, or otherwise changed its set of VMAs
> > through additiona mmap() calls (and other means of altering memory
> > address space). In the latter case, it's the performance hit that comes
> > from re-opening the file and re-reading/re-parsing its contents all over
> > again.
> >
> > This patch aims to solve this problem by providing a new API built on
> > top of /proc/<pid>/maps. It is ioctl()-based and built as a binary
> > interface, avoiding the cost and awkwardness of textual representation
> > for programmatic use. It's designed to be extensible and
> > forward/backward compatible by including user-specified field size and
> > using copy_struct_from_user() approach. But, most importantly, it allows
> > to do point queries for specific single address, specified by user. And
> > this is done efficiently using VMA iterator.
> >
> > User has a choice to pick either getting VMA that covers provided
> > address or -ENOENT if none is found (exact, least surprising, case). Or,
> > with an extra query flag (PROCFS_PROCMAP_EXACT_OR_NEXT_VMA), they can
> > get either VMA that covers the address (if there is one), or the closest
> > next VMA (i.e., VMA with the smallest vm_start > addr). The later allows
> > more efficient use, but, given it could be a surprising behavior,
> > requires an explicit opt-in.
> >
> > Basing this ioctl()-based API on top of /proc/<pid>/maps's FD makes
> > sense given it's querying the same set of VMA data. All the permissions
> > checks performed on /proc/<pid>/maps opening fit here as well.
> > ioctl-based implementation is fetching remembered mm_struct reference,
> > but otherwise doesn't interfere with seq_file-based implementation of
> > /proc/<pid>/maps textual interface, and so could be used together or
> > independently without paying any price for that.
> >
> > There is one extra thing that /proc/<pid>/maps doesn't currently
> > provide, and that's an ability to fetch ELF build ID, if present. User
> > has control over whether this piece of information is requested or not
> > by either setting build_id_size field to zero or non-zero maximum buffer
> > size they provided through build_id_addr field (which encodes user
> > pointer as __u64 field).
> >
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this,
> > requiring applications to open underlying ELF binary through
> > /proc/<pid>/map_files/<start>-<end> symlink, which adds an extra
> > permissions implications due giving a full access to the binary from
> > (potentially) another process, while all application is interested in is
> > build ID. Giving an ability to request just build ID doesn't introduce
> > any additional security concerns, on top of what /proc/<pid>/maps is
> > already concerned with, simplifying the overall logic.
> >
> > Kernel already implements build ID fetching, which is used from BPF
> > subsystem. We are reusing this code here, but plan a follow up changes
> > to make it work better under more relaxed assumption (compared to what
> > existing code assumes) of being called from user process context, in
> > which page faults are allowed. BPF-specific implementation currently
> > bails out if necessary part of ELF file is not paged in, all due to
> > extra BPF-specific restrictions (like the need to fetch build ID in
> > restrictive contexts such as NMI handler).
> >
> > Note also, that fetching VMA name (e.g., backing file path, or special
> > hard-coded or user-provided names) is optional just like build ID. If
> > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > it, saving resources.
> >
> > Signed-off-by: Andrii Nakryiko <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 165 ++++++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/fs.h | 32 ++++++++
> > 2 files changed, 197 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 8e503a1635b7..cb7b1ff1a144 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -22,6 +22,7 @@
> > #include <linux/pkeys.h>
> > #include <linux/minmax.h>
> > #include <linux/overflow.h>
> > +#include <linux/buildid.h>
> >
> > #include <asm/elf.h>
> > #include <asm/tlb.h>
> > @@ -375,11 +376,175 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > return do_maps_open(inode, file, &proc_pid_maps_op);
> > }
> >
> > +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> > +{
> > + struct procfs_procmap_query karg;
> > + struct vma_iterator iter;
> > + struct vm_area_struct *vma;
> > + struct mm_struct *mm;
> > + const char *name = NULL;
> > + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
> > + __u64 usize;
> > + int err;
> > +
> > + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize)))
> > + return -EFAULT;
> > + if (usize > PAGE_SIZE)
> > + return -E2BIG;
> > + if (usize < offsetofend(struct procfs_procmap_query, query_addr))
> > + return -EINVAL;
> > + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize);
> > + if (err)
> > + return err;
> > +
> > + if (karg.query_flags & ~PROCFS_PROCMAP_EXACT_OR_NEXT_VMA)
> > + return -EINVAL;
> > + if (!!karg.vma_name_size != !!karg.vma_name_addr)
> > + return -EINVAL;
> > + if (!!karg.build_id_size != !!karg.build_id_addr)
> > + return -EINVAL;
> > +
> > + mm = priv->mm;
> > + if (!mm || !mmget_not_zero(mm))
> > + return -ESRCH;
> > + if (mmap_read_lock_killable(mm)) {
> > + mmput(mm);
> > + return -EINTR;
> > + }
>
> Using the rcu lookup here will allow for more success rate with less
> lock contention.
>

If you have any code pointers, I'd appreciate it. If not, I'll try to
find it myself, no worries.

> > +
> > + vma_iter_init(&iter, mm, karg.query_addr);
> > + vma = vma_next(&iter);
> > + if (!vma) {
> > + err = -ENOENT;
> > + goto out;
> > + }
> > + /* user wants covering VMA, not the closest next one */
> > + if (!(karg.query_flags & PROCFS_PROCMAP_EXACT_OR_NEXT_VMA) &&
> > + vma->vm_start > karg.query_addr) {
> > + err = -ENOENT;
> > + goto out;
> > + }
>
> The interface you are using is a start address to search from to the end
> of the address space, so this won't work as you intended with the
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA flag. I do not think the vma iterator

Maybe the name isn't the best, by "EXACT" here I meant "VMA that
exactly covers provided address", so maybe "COVERING_OR_NEXT_VMA"
would be better wording.

With that out of the way, I think this API works exactly how I expect
it to work:

# cat /proc/3406/maps | grep -C1 7f42099fe000
7f42099fa000-7f42099fc000 rw-p 00000000 00:00 0
7f42099fc000-7f42099fe000 r--p 00000000 00:21 109331
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f42099fe000-7f4209a0e000 r-xp 00002000 00:21 109331
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
7f4209a0e000-7f4209a14000 r--p 00012000 00:21 109331
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

# cat addrs.txt
0x7f42099fe010

# ./procfs_query -f addrs.txt -p 3406 -v -Q
PID: 3406
PATH: addrs.txt
READ 1 addrs!
SORTED ADDRS (1):
ADDR #0: 0x7f42099fe010
VMA FOUND (addr 7f42099fe010): 7f42099fe000-7f4209a0e000 r-xp 00002000
00:21 109331 /usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8
(build ID: NO, 0 bytes)
RESOLVED ADDRS (1):
RESOLVED #0: 0x7f42099fe010 -> OFF 0x2010 NAME
/usr/local/fbcode/platform010-compat/lib/libz.so.1.2.8

You can see above that for the requested 0x7f42099fe010 address we got
a VMA that starts before this address: 7f42099fe000-7f4209a0e000,
which is what we want.

Before submitting I ran the tool with /proc/<pid>/maps and ioctl to
"resolve" the exact same set of addresses and I compared results. They
were identical.


Note, there is a small bug in the tool I added in patch #5. I changed
`-i` argument to `-Q` at the very last moment and haven't updated the
code in one place. But other than that I didn't change anything. For
the above output, I added "VMA FOUND" verbose logging to see all the
details of VMA, not just resolved offset. I'll add that in v2.

> has the desired interface you want as the single address lookup doesn't
> use the vma iterator. I'd just run the vma_next() and check the limits.
> See find_exact_vma() for the limit checks.
>
> > +
> > + karg.vma_start = vma->vm_start;
> > + karg.vma_end = vma->vm_end;
> > +

[...]

2024-05-07 19:01:43

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <[email protected]> wrote:
>
> * Andrii Nakryiko <[email protected]> [240507 12:28]:
> > On Tue, May 7, 2024 at 8:49 AM Liam R. Howlett <[email protected]> wrote:
> > >
> > > .. Adding Suren & Willy to the Cc
> > >
> > > * Andrii Nakryiko <[email protected]> [240504 18:14]:
> > > > On Sat, May 4, 2024 at 8:32 AM Greg KH <[email protected]> wrote:
> > > > >
> > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > I also did an strace run of both cases. In text-based one the tool did
> > > > > > 68 read() syscalls, fetching up to 4KB of data in one go.
> > > > >
> > > > > Why not fetch more at once?
> > > > >
> > > >
> > > > I didn't expect to be interrogated so much on the performance of the
> > > > text parsing front, sorry. :) You can probably tune this, but where is
> > > > the reasonable limit? 64KB? 256KB? 1MB? See below for some more
> > > > production numbers.
> > >
> > > The reason the file reads are limited to 4KB is because this file is
> > > used for monitoring processes. We have a significant number of
> > > organisations polling this file so frequently that the mmap lock
> > > contention becomes an issue. (reading a file is free, right?) People
> > > also tend to try to figure out why a process is slow by reading this
> > > file - which amplifies the lock contention.
> > >
> > > What happens today is that the lock is yielded after 4KB to allow time
> > > for mmap writes to happen. This also means your data may be
> > > inconsistent from one 4KB block to the next (the write may be around
> > > this boundary).
> > >
> > > This new interface also takes the lock in do_procmap_query() and does
> > > the 4kb blocks as well. Extending this size means more time spent
> > > blocking mmap writes, but a more consistent view of the world (less
> > > "tearing" of the addresses).
> >
> > Hold on. There is no 4KB in the new ioctl-based API I'm adding. It
> > does a single VMA look up (presumably O(logN) operation) using a
> > single vma_iter_init(addr) + vma_next() call on vma_iterator.
>
> Sorry, I read this:
>
> + if (usize > PAGE_SIZE)
> + return -E2BIG;
>
> And thought you were going to return many vmas in that buffer. I see
> now that you are doing one copy at a time.
>
> >
> > As for the mmap_read_lock_killable() (is that what we are talking
> > about?), I'm happy to use anything else available, please give me a
> > pointer. But I suspect given how fast and small this new API is,
> > mmap_read_lock_killable() in it is not comparable to holding it for
> > producing /proc/<pid>/maps contents.
>
> Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> mmap sem).
>
> You can see examples of avoiding the mmap lock by use of rcu in
> mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> userfaultfd has an example as well. But again, remember that not all
> archs have this functionality, so you'd need to fall back to full mmap
> locking.

Thanks for the pointer (didn't see email when replying on the other thread).

I looked at lock_vma_under_rcu() quickly, and seems like it's designed
to find VMA that covers given address, but not the next closest one.
So it's a bit problematic for the API I'm adding, as
PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
cases we have. But maybe some variation of lock_vma_under_rcu() can be
added that would fit this case?

>
> Certainly a single lookup and copy will be faster than a 4k buffer
> filling copy, but you will be walking the tree O(n) times, where n is
> the vma count. This isn't as efficient as multiple lookups in a row as
> we will re-walk from the top of the tree. You will also need to contend
> with the fact that the chance of the vmas changing between calls is much
> higher here too - if that's an issue. Neither of these issues go away
> with use of the rcu locking instead of the mmap lock, but we can be
> quite certain that we won't cause locking contention.

You are right about O(n) times, but note that for symbolization cases
I'm describing, this n will be, generally, *much* smaller than a total
number of VMAs within the process. It's a huge speed up in practice.
This is because we pre-sort addresses in user-space, and then we query
VMA for the first address, but then we quickly skip all the other
addresses that are already covered by this VMA, and so the next
request will query a new VMA that covers another subset of addresses.
This way we'll get the minimal number of VMAs that cover captured
addresses (which in the case of stack traces would be a few VMAs
belonging to executable sections of process' binary plus a bunch of
shared libraries).

>
> Thanks,
> Liam
>

2024-05-07 19:10:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 07, 2024 at 11:48:44AM -0400, Liam R. Howlett wrote:
> .. Adding Suren & Willy to the Cc

I've been staying out of this disaster. i thought steven rostedt was
going to do all of this in the kernel anyway. wasn't thre a session on
that at lsfmm in vancouver last year?

2024-05-07 21:58:45

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 2/5] fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps

On Mon, May 6, 2024 at 12:16 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> On Mon, May 06, 2024 at 03:53:40PM -0300, Arnaldo Carvalho de Melo wrote:
> > On Mon, May 06, 2024 at 11:05:17AM -0700, Namhyung Kim wrote:
> > > On Mon, May 6, 2024 at 6:58 AM Arnaldo Carvalho de Melo <[email protected]> wrote:
> > > > On Sat, May 04, 2024 at 02:50:31PM -0700, Andrii Nakryiko wrote:
> > > > > On Sat, May 4, 2024 at 8:28 AM Greg KH <[email protected]> wrote:
> > > > > > On Fri, May 03, 2024 at 05:30:03PM -0700, Andrii Nakryiko wrote:
> > > > > > > Note also, that fetching VMA name (e.g., backing file path, or special
> > > > > > > hard-coded or user-provided names) is optional just like build ID. If
> > > > > > > user sets vma_name_size to zero, kernel code won't attempt to retrieve
> > > > > > > it, saving resources.
> >
> > > > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> >
> > > > > > Where is the userspace code that uses this new api you have created?
> >
> > > > > So I added a faithful comparison of existing /proc/<pid>/maps vs new
> > > > > ioctl() API to solve a common problem (as described above) in patch
> > > > > #5. The plan is to put it in mentioned blazesym library at the very
> > > > > least.
> > > > >
> > > > > I'm sure perf would benefit from this as well (cc'ed Arnaldo and
> > > > > linux-perf-user), as they need to do stack symbolization as well.
> >
> > > I think the general use case in perf is different. This ioctl API is great
> > > for live tracing of a single (or a small number of) process(es). And
> > > yes, perf tools have those tracing use cases too. But I think the
> > > major use case of perf tools is system-wide profiling.
> >
> > > For system-wide profiling, you need to process samples of many
> > > different processes at a high frequency. Now perf record doesn't
> > > process them and just save it for offline processing (well, it does
> > > at the end to find out build-ID but it can be omitted).
> >
> > Since:
> >
> > Author: Jiri Olsa <[email protected]>
> > Date: Mon Dec 14 11:54:49 2020 +0100
> > 1ca6e80254141d26 ("perf tools: Store build id when available in PERF_RECORD_MMAP2 metadata events")
> >
> > We don't need to to process the events to find the build ids. I haven't
> > checked if we still do it to find out which DSOs had hits, but we
> > shouldn't need to do it for build-ids (unless they were not in memory
> > when the kernel tried to stash them in the PERF_RECORD_MMAP2, which I
> > haven't checked but IIRC is a possibility if that ELF part isn't in
> > memory at the time we want to copy it).
>
> > If we're still traversing it like that I guess we can have a knob and
> > make it the default to not do that and instead create the perf.data
> > build ID header table with all the build-ids we got from
> > PERF_RECORD_MMAP2, a (slightly) bigger perf.data file but no event
> > processing at the end of a 'perf record' session.
>
> But then we don't process the PERF_RECORD_MMAP2 in 'perf record', it
> just goes on directly to the perf.data file :-\

Yep, we don't process build-IDs at the end if --buildid-mmap
option is given. It won't have build-ID header table but it's
not needed anymore and perf report can know build-ID from
MMAP2 directly.

Thanks,
Namhyung

2024-05-07 22:27:51

by Namhyung Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <[email protected]> wrote:
> > >
> > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > > <[email protected]> wrote:
> > > >
> > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > > > > > >
> > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > > >
> > > > > > > Of course an artificial benchmark of "read a whole file" vs. "a tiny
> > > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > > going to be used in the real world overall. Pounding on this file is
> > > > > > > not a normal operation, right?
> > > > > > >
> > > > > >
> > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > > think C code in this patch is way easier to follow for someone not
> > > > > > familiar with implementation of blazesym, but both implementations are
> > > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > > table, then looking up addresses one by one. But that's even slower
> > > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > > >
> > > > > > Other applications that deal with stack traces (including perf) would
> > > > > > be doing one of those two approaches, depending on circumstances and
> > > > > > level of sophistication of code (and sensitivity to performance).
> > > > >
> > > > > The code in perf doing this is here:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > > The code is using the api/io.h code:
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > > Using perf to profile perf it was observed time was spent allocating
> > > > > buffers and locale related activities when using stdio, so io is a
> > > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > > similar benchmark in `perf bench internals synthesize`.
> > > > >
> > > >
> > > > If I add a new implementation using this ioctl() into
> > > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > > organization, sorry if it's a stupid question. If not, where exactly
> > > > is the code that would be triggered from benchmark?
> > >
> > > Yes it would be triggered :-)
> >
> > Ok, I don't exactly know how to interpret the results (and what the
> > benchmark is doing), but numbers don't seem to be worse. They actually
> > seem to be a bit better.
> >
> > I pushed my code that adds perf integration to [0]. That commit has
> > results, but I'll post them here (with invocation parameters).
> > perf-ioctl is the version with ioctl()-based implementation,
> > perf-parse is, logically, text-parsing version. Here are the results
> > (and see my notes below the results as well):
> >
> > TEXT-BASED
> > ==========
> >
> > # ./perf-parse bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 80.311 usec (+- 0.077 usec)
> > Average num. events: 32.000 (+- 0.000)
> > Average time per event 2.510 usec
> > Average data synthesis took: 84.429 usec (+- 0.066 usec)
> > Average num. events: 179.000 (+- 0.000)
> > Average time per event 0.472 usec
> >
> > # ./perf-parse bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 79.900 usec (+- 0.077 usec)
> > Average num. events: 32.000 (+- 0.000)
> > Average time per event 2.497 usec
> > Average data synthesis took: 84.832 usec (+- 0.074 usec)
> > Average num. events: 180.000 (+- 0.000)
> > Average time per event 0.471 usec
> >
> > # ./perf-parse bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> > Number of synthesis threads: 1
> > Average synthesis took: 36338.100 usec (+- 406.091 usec)
> > Average num. events: 14091.300 (+- 7.433)
> > Average time per event 2.579 usec
> > Number of synthesis threads: 2
> > Average synthesis took: 37071.200 usec (+- 746.498 usec)
> > Average num. events: 14085.900 (+- 1.900)
> > Average time per event 2.632 usec
> > Number of synthesis threads: 3
> > Average synthesis took: 33932.300 usec (+- 626.861 usec)
> > Average num. events: 14085.900 (+- 1.900)
> > Average time per event 2.409 usec
> > Number of synthesis threads: 4
> > Average synthesis took: 33822.700 usec (+- 506.290 usec)
> > Average num. events: 14099.200 (+- 8.761)
> > Average time per event 2.399 usec
> > Number of synthesis threads: 5
> > Average synthesis took: 33348.200 usec (+- 389.771 usec)
> > Average num. events: 14085.900 (+- 1.900)
> > Average time per event 2.367 usec
> > Number of synthesis threads: 6
> > Average synthesis took: 33269.600 usec (+- 350.341 usec)
> > Average num. events: 14084.000 (+- 0.000)
> > Average time per event 2.362 usec
> > Number of synthesis threads: 7
> > Average synthesis took: 32663.900 usec (+- 338.870 usec)
> > Average num. events: 14085.900 (+- 1.900)
> > Average time per event 2.319 usec
> > Number of synthesis threads: 8
> > Average synthesis took: 32748.400 usec (+- 285.450 usec)
> > Average num. events: 14085.900 (+- 1.900)
> > Average time per event 2.325 usec
> >
> > IOCTL-BASED
> > ===========
> > # ./perf-ioctl bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 72.996 usec (+- 0.076 usec)
> > Average num. events: 31.000 (+- 0.000)
> > Average time per event 2.355 usec
> > Average data synthesis took: 79.067 usec (+- 0.074 usec)
> > Average num. events: 178.000 (+- 0.000)
> > Average time per event 0.444 usec
> >
> > # ./perf-ioctl bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 73.921 usec (+- 0.073 usec)
> > Average num. events: 31.000 (+- 0.000)
> > Average time per event 2.385 usec
> > Average data synthesis took: 80.545 usec (+- 0.070 usec)
> > Average num. events: 178.000 (+- 0.000)
> > Average time per event 0.453 usec
> >
> > # ./perf-ioctl bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> > Number of synthesis threads: 1
> > Average synthesis took: 35609.500 usec (+- 428.576 usec)
> > Average num. events: 14040.700 (+- 1.700)
> > Average time per event 2.536 usec
> > Number of synthesis threads: 2
> > Average synthesis took: 34293.800 usec (+- 453.811 usec)
> > Average num. events: 14040.700 (+- 1.700)
> > Average time per event 2.442 usec
> > Number of synthesis threads: 3
> > Average synthesis took: 32385.200 usec (+- 363.106 usec)
> > Average num. events: 14040.700 (+- 1.700)
> > Average time per event 2.307 usec
> > Number of synthesis threads: 4
> > Average synthesis took: 33113.100 usec (+- 553.931 usec)
> > Average num. events: 14054.500 (+- 11.469)
> > Average time per event 2.356 usec
> > Number of synthesis threads: 5
> > Average synthesis took: 31600.600 usec (+- 297.349 usec)
> > Average num. events: 14012.500 (+- 4.590)
> > Average time per event 2.255 usec
> > Number of synthesis threads: 6
> > Average synthesis took: 32309.900 usec (+- 472.225 usec)
> > Average num. events: 14004.000 (+- 0.000)
> > Average time per event 2.307 usec
> > Number of synthesis threads: 7
> > Average synthesis took: 31400.100 usec (+- 206.261 usec)
> > Average num. events: 14004.800 (+- 0.800)
> > Average time per event 2.242 usec
> > Number of synthesis threads: 8
> > Average synthesis took: 31601.400 usec (+- 303.350 usec)
> > Average num. events: 14005.700 (+- 1.700)
> > Average time per event 2.256 usec
> >
> > I also double-checked (using strace) that it does what it is supposed
> > to do, and it seems like everything checks out. Here's text-based
> > strace log:
> >
> > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> > read(3, "", 8192) = 0
> > close(3) = 0
> >
> >
> > BTW, note how the kernel doesn't serve more than 4KB of data, even
> > though perf provides 8KB buffer (that's to Greg's question about
> > optimizing using bigger buffers, I suspect without seq_file changes,
> > it won't work).
> >
> > And here's an abbreviated log for ioctl version, it has lots more (but
> > much faster) ioctl() syscalls, given it dumps everything:
> >
> > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> >
> > ... 195 ioctl() calls in total ...
> >
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> > = -1 ENOENT (No such file or directory)
> > close(3) = 0
> >
> >
> > So, it's not the optimal usage of this API, and yet it's still better
> > (or at least not worse) than text-based API.

It's surprising that more ioctl is cheaper than less read and parse.

> >
>
> In another reply to Arnaldo on patch #2 I mentioned the idea of
> allowing to iterate only file-backed VMAs (as it seems like what
> symbolizers would only care about, but I might be wrong here). So I

Yep, I think it's enough to get file-backed VMAs only.


> tried that quickly, given it's a trivial addition to my code. See
> results below (it is slightly faster, but not much, because most of
> VMAs in that benchmark seem to be indeed file-backed anyways), just
> for completeness. I'm not sure if that would be useful/used by perf,
> so please let me know.

Thanks for doing this. It'd be useful as it provides better synthesizing
performance. The startup latency of perf record is a problem, I need
to take a look if it can be improved more.

Thanks,
Namhyung


>
> As I mentioned above, it's not radically faster in this perf
> benchmark, because we still request about 170 VMAs (vs ~195 if we
> iterate *all* of them), so not a big change. The ratio will vary
> depending on what the process is doing, of course. Anyways, just for
> completeness, I'm not sure if I have to add this "filter" to the
> actual implementation.
>
> # ./perf-filebacked bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 65.759 usec (+- 0.063 usec)
> Average num. events: 30.000 (+- 0.000)
> Average time per event 2.192 usec
> Average data synthesis took: 73.840 usec (+- 0.080 usec)
> Average num. events: 153.000 (+- 0.000)
> Average time per event 0.483 usec
>
> # ./perf-filebacked bench internals synthesize
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 66.245 usec (+- 0.059 usec)
> Average num. events: 30.000 (+- 0.000)
> Average time per event 2.208 usec
> Average data synthesis took: 70.627 usec (+- 0.074 usec)
> Average num. events: 153.000 (+- 0.000)
> Average time per event 0.462 usec
>
> # ./perf-filebacked bench internals synthesize --mt -M 8
> # Running 'internals/synthesize' benchmark:
> Computing performance of multi threaded perf event synthesis by
> synthesizing events on CPU 0:
> Number of synthesis threads: 1
> Average synthesis took: 33477.500 usec (+- 556.102 usec)
> Average num. events: 10125.700 (+- 1.620)
> Average time per event 3.306 usec
> Number of synthesis threads: 2
> Average synthesis took: 30473.700 usec (+- 221.933 usec)
> Average num. events: 10127.000 (+- 0.000)
> Average time per event 3.009 usec
> Number of synthesis threads: 3
> Average synthesis took: 29775.200 usec (+- 315.212 usec)
> Average num. events: 10128.700 (+- 0.667)
> Average time per event 2.940 usec
> Number of synthesis threads: 4
> Average synthesis took: 29477.100 usec (+- 621.258 usec)
> Average num. events: 10129.000 (+- 0.000)
> Average time per event 2.910 usec
> Number of synthesis threads: 5
> Average synthesis took: 29777.900 usec (+- 294.710 usec)
> Average num. events: 10144.700 (+- 11.597)
> Average time per event 2.935 usec
> Number of synthesis threads: 6
> Average synthesis took: 27774.700 usec (+- 357.569 usec)
> Average num. events: 10158.500 (+- 14.710)
> Average time per event 2.734 usec
> Number of synthesis threads: 7
> Average synthesis took: 27437.200 usec (+- 233.626 usec)
> Average num. events: 10135.700 (+- 2.700)
> Average time per event 2.707 usec
> Number of synthesis threads: 8
> Average synthesis took: 28784.600 usec (+- 477.630 usec)
> Average num. events: 10133.000 (+- 0.000)
> Average time per event 2.841 usec
>
> > [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
> >
> > >
> > > Thanks,
> > > Ian
> > >
> > > > > Thanks,
> > > > > Ian
> > > > >
> > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > > >
> > > > > > > thanks,
> > > > > > >
> > > > > > > greg k-h
> > > > > >
>

2024-05-07 22:57:48

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <[email protected]> wrote:
>
> On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Mon, May 6, 2024 at 10:06 PM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Mon, May 6, 2024 at 11:43 AM Ian Rogers <[email protected]> wrote:
> > > >
> > > > On Mon, May 6, 2024 at 11:32 AM Andrii Nakryiko
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Sat, May 4, 2024 at 10:09 PM Ian Rogers <[email protected]> wrote:
> > > > > >
> > > > > > On Sat, May 4, 2024 at 2:57 PM Andrii Nakryiko
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sat, May 4, 2024 at 8:29 AM Greg KH <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Fri, May 03, 2024 at 05:30:06PM -0700, Andrii Nakryiko wrote:
> > > > > > > > > Implement a simple tool/benchmark for comparing address "resolution"
> > > > > > > > > logic based on textual /proc/<pid>/maps interface and new binary
> > > > > > > > > ioctl-based PROCFS_PROCMAP_QUERY command.
> > > > > > > >
> > > > > > > > Of course an artificial benchmark of "read a whole file" vs "a tiny
> > > > > > > > ioctl" is going to be different, but step back and show how this is
> > > > > > > > going to be used in the real world overall. Pounding on this file is
> > > > > > > > not a normal operation, right?
> > > > > > > >
> > > > > > >
> > > > > > > It's not artificial at all. It's *exactly* what, say, blazesym library
> > > > > > > is doing (see [0], it's Rust and part of the overall library API, I
> > > > > > > think C code in this patch is way easier to follow for someone not
> > > > > > > familiar with implementation of blazesym, but both implementations are
> > > > > > > doing exactly the same sequence of steps). You can do it even less
> > > > > > > efficiently by parsing the whole file, building an in-memory lookup
> > > > > > > table, then looking up addresses one by one. But that's even slower
> > > > > > > and more memory-hungry. So I didn't even bother implementing that, it
> > > > > > > would put /proc/<pid>/maps at even more disadvantage.
> > > > > > >
> > > > > > > Other applications that deal with stack traces (including perf) would
> > > > > > > be doing one of those two approaches, depending on circumstances and
> > > > > > > level of sophistication of code (and sensitivity to performance).
> > > > > >
> > > > > > The code in perf doing this is here:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/synthetic-events.c#n440
> > > > > > The code is using the api/io.h code:
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/api/io.h
> > > > > > Using perf to profile perf it was observed time was spent allocating
> > > > > > buffers and locale related activities when using stdio, so io is a
> > > > > > lighter weight alternative, albeit with more verbose code than fscanf.
> > > > > > You could add this as an alternate /proc/<pid>/maps reader, we have a
> > > > > > similar benchmark in `perf bench internals synthesize`.
> > > > > >
> > > > >
> > > > > If I add a new implementation using this ioctl() into
> > > > > perf_event__synthesize_mmap_events(), will it be tested from this
> > > > > `perf bench internals synthesize`? I'm not too familiar with perf code
> > > > > organization, sorry if it's a stupid question. If not, where exactly
> > > > > is the code that would be triggered from benchmark?
> > > >
> > > > Yes it would be triggered :-)
> > >
> > > Ok, I don't exactly know how to interpret the results (and what the
> > > benchmark is doing), but numbers don't seem to be worse. They actually
> > > seem to be a bit better.
> > >
> > > I pushed my code that adds perf integration to [0]. That commit has
> > > results, but I'll post them here (with invocation parameters).
> > > perf-ioctl is the version with ioctl()-based implementation,
> > > perf-parse is, logically, text-parsing version. Here are the results
> > > (and see my notes below the results as well):
> > >
> > > TEXT-BASED
> > > ==========
> > >
> > > # ./perf-parse bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > > Average synthesis took: 80.311 usec (+- 0.077 usec)
> > > Average num. events: 32.000 (+- 0.000)
> > > Average time per event 2.510 usec
> > > Average data synthesis took: 84.429 usec (+- 0.066 usec)
> > > Average num. events: 179.000 (+- 0.000)
> > > Average time per event 0.472 usec
> > >
> > > # ./perf-parse bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > > Average synthesis took: 79.900 usec (+- 0.077 usec)
> > > Average num. events: 32.000 (+- 0.000)
> > > Average time per event 2.497 usec
> > > Average data synthesis took: 84.832 usec (+- 0.074 usec)
> > > Average num. events: 180.000 (+- 0.000)
> > > Average time per event 0.471 usec
> > >
> > > # ./perf-parse bench internals synthesize --mt -M 8
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of multi threaded perf event synthesis by
> > > synthesizing events on CPU 0:
> > > Number of synthesis threads: 1
> > > Average synthesis took: 36338.100 usec (+- 406.091 usec)
> > > Average num. events: 14091.300 (+- 7.433)
> > > Average time per event 2.579 usec
> > > Number of synthesis threads: 2
> > > Average synthesis took: 37071.200 usec (+- 746.498 usec)
> > > Average num. events: 14085.900 (+- 1.900)
> > > Average time per event 2.632 usec
> > > Number of synthesis threads: 3
> > > Average synthesis took: 33932.300 usec (+- 626.861 usec)
> > > Average num. events: 14085.900 (+- 1.900)
> > > Average time per event 2.409 usec
> > > Number of synthesis threads: 4
> > > Average synthesis took: 33822.700 usec (+- 506.290 usec)
> > > Average num. events: 14099.200 (+- 8.761)
> > > Average time per event 2.399 usec
> > > Number of synthesis threads: 5
> > > Average synthesis took: 33348.200 usec (+- 389.771 usec)
> > > Average num. events: 14085.900 (+- 1.900)
> > > Average time per event 2.367 usec
> > > Number of synthesis threads: 6
> > > Average synthesis took: 33269.600 usec (+- 350.341 usec)
> > > Average num. events: 14084.000 (+- 0.000)
> > > Average time per event 2.362 usec
> > > Number of synthesis threads: 7
> > > Average synthesis took: 32663.900 usec (+- 338.870 usec)
> > > Average num. events: 14085.900 (+- 1.900)
> > > Average time per event 2.319 usec
> > > Number of synthesis threads: 8
> > > Average synthesis took: 32748.400 usec (+- 285.450 usec)
> > > Average num. events: 14085.900 (+- 1.900)
> > > Average time per event 2.325 usec
> > >
> > > IOCTL-BASED
> > > ===========
> > > # ./perf-ioctl bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > > Average synthesis took: 72.996 usec (+- 0.076 usec)
> > > Average num. events: 31.000 (+- 0.000)
> > > Average time per event 2.355 usec
> > > Average data synthesis took: 79.067 usec (+- 0.074 usec)
> > > Average num. events: 178.000 (+- 0.000)
> > > Average time per event 0.444 usec
> > >
> > > # ./perf-ioctl bench internals synthesize
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of single threaded perf event synthesis by
> > > synthesizing events on the perf process itself:
> > > Average synthesis took: 73.921 usec (+- 0.073 usec)
> > > Average num. events: 31.000 (+- 0.000)
> > > Average time per event 2.385 usec
> > > Average data synthesis took: 80.545 usec (+- 0.070 usec)
> > > Average num. events: 178.000 (+- 0.000)
> > > Average time per event 0.453 usec
> > >
> > > # ./perf-ioctl bench internals synthesize --mt -M 8
> > > # Running 'internals/synthesize' benchmark:
> > > Computing performance of multi threaded perf event synthesis by
> > > synthesizing events on CPU 0:
> > > Number of synthesis threads: 1
> > > Average synthesis took: 35609.500 usec (+- 428.576 usec)
> > > Average num. events: 14040.700 (+- 1.700)
> > > Average time per event 2.536 usec
> > > Number of synthesis threads: 2
> > > Average synthesis took: 34293.800 usec (+- 453.811 usec)
> > > Average num. events: 14040.700 (+- 1.700)
> > > Average time per event 2.442 usec
> > > Number of synthesis threads: 3
> > > Average synthesis took: 32385.200 usec (+- 363.106 usec)
> > > Average num. events: 14040.700 (+- 1.700)
> > > Average time per event 2.307 usec
> > > Number of synthesis threads: 4
> > > Average synthesis took: 33113.100 usec (+- 553.931 usec)
> > > Average num. events: 14054.500 (+- 11.469)
> > > Average time per event 2.356 usec
> > > Number of synthesis threads: 5
> > > Average synthesis took: 31600.600 usec (+- 297.349 usec)
> > > Average num. events: 14012.500 (+- 4.590)
> > > Average time per event 2.255 usec
> > > Number of synthesis threads: 6
> > > Average synthesis took: 32309.900 usec (+- 472.225 usec)
> > > Average num. events: 14004.000 (+- 0.000)
> > > Average time per event 2.307 usec
> > > Number of synthesis threads: 7
> > > Average synthesis took: 31400.100 usec (+- 206.261 usec)
> > > Average num. events: 14004.800 (+- 0.800)
> > > Average time per event 2.242 usec
> > > Number of synthesis threads: 8
> > > Average synthesis took: 31601.400 usec (+- 303.350 usec)
> > > Average num. events: 14005.700 (+- 1.700)
> > > Average time per event 2.256 usec
> > >
> > > I also double-checked (using strace) that it does what it is supposed
> > > to do, and it seems like everything checks out. Here's text-based
> > > strace log:
> > >
> > > openat(AT_FDCWD, "/proc/35876/task/35876/maps", O_RDONLY) = 3
> > > read(3, "00400000-0040c000 r--p 00000000 "..., 8192) = 3997
> > > read(3, "7f519d4d3000-7f519d516000 r--p 0"..., 8192) = 4025
> > > read(3, "7f519dc3d000-7f519dc44000 r-xp 0"..., 8192) = 4048
> > > read(3, "7f519dd2d000-7f519dd2f000 r--p 0"..., 8192) = 4017
> > > read(3, "7f519dff6000-7f519dff8000 r--p 0"..., 8192) = 2744
> > > read(3, "", 8192) = 0
> > > close(3) = 0
> > >
> > >
> > > BTW, note how the kernel doesn't serve more than 4KB of data, even
> > > though perf provides 8KB buffer (that's to Greg's question about
> > > optimizing using bigger buffers, I suspect without seq_file changes,
> > > it won't work).
> > >
> > > And here's an abbreviated log for ioctl version, it has lots more (but
> > > much faster) ioctl() syscalls, given it dumps everything:
> > >
> > > openat(AT_FDCWD, "/proc/36380/task/36380/maps", O_RDONLY) = 3
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > >
> > > ... 195 ioctl() calls in total ...
> > >
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50) = 0
> > > ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x9f, 0x1, 0x60), 0x7fff6b603d50)
> > > = -1 ENOENT (No such file or directory)
> > > close(3) = 0
> > >
> > >
> > > So, it's not the optimal usage of this API, and yet it's still better
> > > (or at least not worse) than text-based API.
>
> It's surprising that more ioctl is cheaper than less read and parse.

I encourage you to try this locally, just in case I missed something
([0]). But it does seem this way. I have mitigations and retpoline
off, so syscall switch is pretty fast (under 0.5 microsecond).

[0] https://github.com/anakryiko/linux/tree/procfs-proc-maps-ioctl
>
> > >
> >
> > In another reply to Arnaldo on patch #2 I mentioned the idea of
> > allowing to iterate only file-backed VMAs (as it seems like what
> > symbolizers would only care about, but I might be wrong here). So I
>
> Yep, I think it's enough to get file-backed VMAs only.
>

Ok, I guess I'll keep this functionality for v2 then, it's a pretty
trivial extension to existing logic.

>
> > tried that quickly, given it's a trivial addition to my code. See
> > results below (it is slightly faster, but not much, because most of
> > VMAs in that benchmark seem to be indeed file-backed anyways), just
> > for completeness. I'm not sure if that would be useful/used by perf,
> > so please let me know.
>
> Thanks for doing this. It'd be useful as it provides better synthesizing
> performance. The startup latency of perf record is a problem, I need
> to take a look if it can be improved more.
>
> Thanks,
> Namhyung
>
>
> >
> > As I mentioned above, it's not radically faster in this perf
> > benchmark, because we still request about 170 VMAs (vs ~195 if we
> > iterate *all* of them), so not a big change. The ratio will vary
> > depending on what the process is doing, of course. Anyways, just for
> > completeness, I'm not sure if I have to add this "filter" to the
> > actual implementation.
> >
> > # ./perf-filebacked bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 65.759 usec (+- 0.063 usec)
> > Average num. events: 30.000 (+- 0.000)
> > Average time per event 2.192 usec
> > Average data synthesis took: 73.840 usec (+- 0.080 usec)
> > Average num. events: 153.000 (+- 0.000)
> > Average time per event 0.483 usec
> >
> > # ./perf-filebacked bench internals synthesize
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of single threaded perf event synthesis by
> > synthesizing events on the perf process itself:
> > Average synthesis took: 66.245 usec (+- 0.059 usec)
> > Average num. events: 30.000 (+- 0.000)
> > Average time per event 2.208 usec
> > Average data synthesis took: 70.627 usec (+- 0.074 usec)
> > Average num. events: 153.000 (+- 0.000)
> > Average time per event 0.462 usec
> >
> > # ./perf-filebacked bench internals synthesize --mt -M 8
> > # Running 'internals/synthesize' benchmark:
> > Computing performance of multi threaded perf event synthesis by
> > synthesizing events on CPU 0:
> > Number of synthesis threads: 1
> > Average synthesis took: 33477.500 usec (+- 556.102 usec)
> > Average num. events: 10125.700 (+- 1.620)
> > Average time per event 3.306 usec
> > Number of synthesis threads: 2
> > Average synthesis took: 30473.700 usec (+- 221.933 usec)
> > Average num. events: 10127.000 (+- 0.000)
> > Average time per event 3.009 usec
> > Number of synthesis threads: 3
> > Average synthesis took: 29775.200 usec (+- 315.212 usec)
> > Average num. events: 10128.700 (+- 0.667)
> > Average time per event 2.940 usec
> > Number of synthesis threads: 4
> > Average synthesis took: 29477.100 usec (+- 621.258 usec)
> > Average num. events: 10129.000 (+- 0.000)
> > Average time per event 2.910 usec
> > Number of synthesis threads: 5
> > Average synthesis took: 29777.900 usec (+- 294.710 usec)
> > Average num. events: 10144.700 (+- 11.597)
> > Average time per event 2.935 usec
> > Number of synthesis threads: 6
> > Average synthesis took: 27774.700 usec (+- 357.569 usec)
> > Average num. events: 10158.500 (+- 14.710)
> > Average time per event 2.734 usec
> > Number of synthesis threads: 7
> > Average synthesis took: 27437.200 usec (+- 233.626 usec)
> > Average num. events: 10135.700 (+- 2.700)
> > Average time per event 2.707 usec
> > Number of synthesis threads: 8
> > Average synthesis took: 28784.600 usec (+- 477.630 usec)
> > Average num. events: 10133.000 (+- 0.000)
> > Average time per event 2.841 usec
> >
> > > [0] https://github.com/anakryiko/linux/commit/0841fe675ed30f5605c5b228e18f5612ea253b35
> > >
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > > > > > Thanks,
> > > > > > Ian
> > > > > >
> > > > > > > [0] https://github.com/libbpf/blazesym/blob/ee9b48a80c0b4499118a1e8e5d901cddb2b33ab1/src/normalize/user.rs#L193
> > > > > > >
> > > > > > > > thanks,
> > > > > > > >
> > > > > > > > greg k-h
> > > > > > >
> >

2024-05-08 00:36:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

On Tue, May 07, 2024 at 03:56:40PM -0700, Andrii Nakryiko wrote:
> On Tue, May 7, 2024 at 3:27 PM Namhyung Kim <[email protected]> wrote:
> > On Tue, May 7, 2024 at 10:29 AM Andrii Nakryiko <[email protected]> wrote:
> > > In another reply to Arnaldo on patch #2 I mentioned the idea of
> > > allowing to iterate only file-backed VMAs (as it seems like what
> > > symbolizers would only care about, but I might be wrong here). So I

> > Yep, I think it's enough to get file-backed VMAs only.

> Ok, I guess I'll keep this functionality for v2 then, it's a pretty
> trivial extension to existing logic.

Maps for JITed code, for isntance, aren't backed by files:

commit 578c03c86fadcc6fd7319ddf41dd4d1d88aab77a
Author: Namhyung Kim <[email protected]>
Date: Thu Jan 16 10:49:31 2014 +0900

perf symbols: Fix JIT symbol resolution on heap

Gaurav reported that perf cannot profile JIT program if it executes the
code on heap. This was because current map__new() only handle JIT on
anon mappings - extends it to handle no_dso (heap, stack) case too.

This patch assumes JIT profiling only provides dynamic function symbols
so check the mapping type to distinguish the case. It'd provide no
symbols for data mapping - if we need to support symbols on data
mappings later it should be changed.

Reported-by: Gaurav Jain <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
Tested-by: Gaurav Jain <[email protected]>

⬢[acme@toolbox perf-tools-next]$ git show 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89
commit 89365e6c9ad4c0e090e4c6a4b67a3ce319381d89
Author: Andi Kleen <[email protected]>
Date: Wed Apr 24 17:03:02 2013 -0700

perf tools: Handle JITed code in shared memory

Need to check for /dev/zero.

Most likely more strings are missing too.

Signed-off-by: Andi Kleen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c
index 6fcb9de623401b8a..8bcdf9e54089acaf 100644
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -21,6 +21,7 @@ const char *map_type__name[MAP__NR_TYPES] = {
static inline int is_anon_memory(const char *filename)
{
return !strcmp(filename, "//anon") ||
+ !strcmp(filename, "/dev/zero (deleted)") ||
!strcmp(filename, "/anon_hugepage (deleted)");
}

etc.

- Arnaldo

2024-05-08 01:21:34

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 5/5] selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

* Andrii Nakryiko <[email protected]> [240507 15:01]:
> On Tue, May 7, 2024 at 11:06 AM Liam R. Howlett <[email protected]> wrote:
..
> > >
> > > As for the mmap_read_lock_killable() (is that what we are talking
> > > about?), I'm happy to use anything else available, please give me a
> > > pointer. But I suspect given how fast and small this new API is,
> > > mmap_read_lock_killable() in it is not comparable to holding it for
> > > producing /proc/<pid>/maps contents.
> >
> > Yes, mmap_read_lock_killable() is the mmap lock (formally known as the
> > mmap sem).
> >
> > You can see examples of avoiding the mmap lock by use of rcu in
> > mm/memory.c lock_vma_under_rcu() which is used in the fault path.
> > userfaultfd has an example as well. But again, remember that not all
> > archs have this functionality, so you'd need to fall back to full mmap
> > locking.
>
> Thanks for the pointer (didn't see email when replying on the other thread).
>
> I looked at lock_vma_under_rcu() quickly, and seems like it's designed
> to find VMA that covers given address, but not the next closest one.
> So it's a bit problematic for the API I'm adding, as
> PROCFS_PROCMAP_EXACT_OR_NEXT_VMA (which I can rename to
> COVERING_OR_NEXT_VMA, if necessary), is quite important for the use
> cases we have. But maybe some variation of lock_vma_under_rcu() can be
> added that would fit this case?

Yes, as long as we have the rcu read lock, we can use the same
vma_next() calls you use today. We will have to be careful not to use
the vma while it's being altered, but per-vma locking should provide
that functionality for you.

>
> >
> > Certainly a single lookup and copy will be faster than a 4k buffer
> > filling copy, but you will be walking the tree O(n) times, where n is
> > the vma count. This isn't as efficient as multiple lookups in a row as
> > we will re-walk from the top of the tree. You will also need to contend
> > with the fact that the chance of the vmas changing between calls is much
> > higher here too - if that's an issue. Neither of these issues go away
> > with use of the rcu locking instead of the mmap lock, but we can be
> > quite certain that we won't cause locking contention.
>
> You are right about O(n) times, but note that for symbolization cases
> I'm describing, this n will be, generally, *much* smaller than a total
> number of VMAs within the process. It's a huge speed up in practice.
> This is because we pre-sort addresses in user-space, and then we query
> VMA for the first address, but then we quickly skip all the other
> addresses that are already covered by this VMA, and so the next
> request will query a new VMA that covers another subset of addresses.
> This way we'll get the minimal number of VMAs that cover captured
> addresses (which in the case of stack traces would be a few VMAs
> belonging to executable sections of process' binary plus a bunch of
> shared libraries).

This also implies you won't have to worry about shifting addresses? I'd
think that the reference to the mm means none of these are going to be
changing at the point of the calls (not exiting).

Given your usecase, I'm surprised you're looking for the next vma at
all.

Thanks,
Liam