hi,
this patchset adds build id object pointer to struct file object.
We have several use cases for build id to be used in BPF programs
[2][3].
Having build id pointer stored directly in file object allows to get
build id reliably regardless of the execution context as described
in [3].
Previous iteration [1] stored build id pointer into inode, but it
became clear that struct file object is better place, because we read
the build id when the file is mmap-ed and as Dave Chinner said [4]:
> Yes, the problem being that we can cache hundreds of millions of
> inodes in memory, and only a very small subset of them are going to
> have open files associated with them. And an even smaller subset are
> going to be mmapped.
thanks,
jirka
v3 changes:
- moved build id back to file object (discussed in [4])
- drop patch 4, it's not needed [Andrii]
- added ack to patch 7 [Andrii]
- replaced BUILD_ID_SIZE_MAX macro with enum [Andrii]
- few re-formatting fixes [Andrii]
v2 changes:
- store build id under inode [Matthew Wilcox]
- use urandom_read and liburandom_read.so for test [Andrii]
- add libelf-based helper to fetch build ID from elf [Andrii]
- store build id or error we got when reading it [Andrii]
- use full name i_build_id [Andrii]
- several tests fixes [Andrii]
[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/CA+khW7juLEcrTOd7iKG3C_WY8L265XKNo0iLzV1fE=o-cyeHcQ@mail.gmail.com/
[3] https://lore.kernel.org/bpf/ZABf26mV0D0LS7r%2F@krava/
[4] https://lore.kernel.org/bpf/[email protected]/
---
Jiri Olsa (9):
mm: Store build id in file object
perf: Use file object build id in perf_event_mmap_event
bpf: Use file object build id in stackmap
bpf: Switch BUILD_ID_SIZE_MAX to enum
selftests/bpf: Add read_buildid function
selftests/bpf: Add err.h header
selftests/bpf: Replace extract_build_id with read_build_id
selftests/bpf: Add iter_task_vma_buildid test
selftests/bpf: Add file_build_id test
fs/file_table.c | 3 +++
include/linux/buildid.h | 21 ++++++++++++++++++-
include/linux/fs.h | 7 +++++++
kernel/bpf/stackmap.c | 24 +++++++++++++++++++++-
kernel/events/core.c | 43 ++++++++++++++++++++++++++++++++++----
lib/buildid.c | 42 +++++++++++++++++++++++++++++++++++++
mm/Kconfig | 9 ++++++++
mm/mmap.c | 18 ++++++++++++++++
tools/testing/selftests/bpf/Makefile | 7 ++++++-
tools/testing/selftests/bpf/no_build_id.c | 6 ++++++
tools/testing/selftests/bpf/prog_tests/bpf_iter.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/prog_tests/file_build_id.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c | 19 +++++++----------
tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c | 17 ++++++---------
tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/progs/err.h | 18 ++++++++++++++++
tools/testing/selftests/bpf/progs/file_build_id.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/progs/profiler.inc.h | 3 +--
tools/testing/selftests/bpf/test_progs.c | 25 ----------------------
tools/testing/selftests/bpf/test_progs.h | 11 +++++++++-
tools/testing/selftests/bpf/trace_helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tools/testing/selftests/bpf/trace_helpers.h | 5 +++++
22 files changed, 608 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/bpf/no_build_id.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/file_build_id.c
create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
create mode 100644 tools/testing/selftests/bpf/progs/err.h
create mode 100644 tools/testing/selftests/bpf/progs/file_build_id.c
Storing build id in file object for elf executable with build
id defined. The build id is stored when file is mmaped.
The build id object assignment to the file is locked with existing
file->f_mapping semaphore.
The f_build_id pointer points either build id object or carries
the error the build id retrieval failed on.
It's hidden behind new config option CONFIG_FILE_BUILD_ID.
Signed-off-by: Jiri Olsa <[email protected]>
---
fs/file_table.c | 3 +++
include/linux/buildid.h | 17 +++++++++++++++++
include/linux/fs.h | 7 +++++++
lib/buildid.c | 42 +++++++++++++++++++++++++++++++++++++++++
mm/Kconfig | 9 +++++++++
mm/mmap.c | 18 ++++++++++++++++++
6 files changed, 96 insertions(+)
diff --git a/fs/file_table.c b/fs/file_table.c
index 372653b92617..d72f72503268 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -29,6 +29,7 @@
#include <linux/ima.h>
#include <linux/swap.h>
#include <linux/kmemleak.h>
+#include <linux/buildid.h>
#include <linux/atomic.h>
@@ -48,6 +49,7 @@ static void file_free_rcu(struct rcu_head *head)
{
struct file *f = container_of(head, struct file, f_rcuhead);
+ file_build_id_free(f);
put_cred(f->f_cred);
kmem_cache_free(filp_cachep, f);
}
@@ -413,6 +415,7 @@ void __init files_init(void)
filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
percpu_counter_init(&nr_files, 0, GFP_KERNEL);
+ build_id_init();
}
/*
diff --git a/include/linux/buildid.h b/include/linux/buildid.h
index 3b7a0ff4642f..b8b2e00420d6 100644
--- a/include/linux/buildid.h
+++ b/include/linux/buildid.h
@@ -3,9 +3,15 @@
#define _LINUX_BUILDID_H
#include <linux/mm_types.h>
+#include <linux/slab.h>
#define BUILD_ID_SIZE_MAX 20
+struct build_id {
+ u32 sz;
+ char data[BUILD_ID_SIZE_MAX];
+};
+
int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
__u32 *size);
int build_id_parse_buf(const void *buf, unsigned char *build_id, u32 buf_size);
@@ -17,4 +23,15 @@ void init_vmlinux_build_id(void);
static inline void init_vmlinux_build_id(void) { }
#endif
+#ifdef CONFIG_FILE_BUILD_ID
+void __init build_id_init(void);
+void build_id_free(struct build_id *bid);
+void file_build_id_free(struct file *f);
+void vma_read_build_id(struct vm_area_struct *vma, struct build_id **bidp);
+#else
+static inline void __init build_id_init(void) { }
+static inline void build_id_free(struct build_id *bid) { }
+static inline void file_build_id_free(struct file *f) { }
+#endif /* CONFIG_FILE_BUILD_ID */
+
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..ce03fd965cdb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -977,6 +977,13 @@ struct file {
struct address_space *f_mapping;
errseq_t f_wb_err;
errseq_t f_sb_err; /* for syncfs */
+#ifdef CONFIG_FILE_BUILD_ID
+ /*
+ * Initialized when the file is mmaped (mmap_region),
+ * guarded by f_mapping lock.
+ */
+ struct build_id *f_build_id;
+#endif
} __randomize_layout
__attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
diff --git a/lib/buildid.c b/lib/buildid.c
index dfc62625cae4..04181c0b7c21 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/slab.h>
#define BUILD_ID 3
@@ -189,3 +190,44 @@ void __init init_vmlinux_build_id(void)
build_id_parse_buf(&__start_notes, vmlinux_build_id, size);
}
#endif
+
+#ifdef CONFIG_FILE_BUILD_ID
+
+/* SLAB cache for build_id structures */
+static struct kmem_cache *build_id_cachep;
+
+void vma_read_build_id(struct vm_area_struct *vma, struct build_id **bidp)
+{
+ struct build_id *bid = ERR_PTR(-ENOMEM);
+ int err;
+
+ bid = kmem_cache_alloc(build_id_cachep, GFP_KERNEL);
+ if (!bid)
+ goto out;
+ err = build_id_parse(vma, bid->data, &bid->sz);
+ if (err) {
+ build_id_free(bid);
+ bid = ERR_PTR(err);
+ }
+out:
+ *bidp = bid;
+}
+
+void file_build_id_free(struct file *f)
+{
+ build_id_free(f->f_build_id);
+}
+
+void build_id_free(struct build_id *bid)
+{
+ if (IS_ERR_OR_NULL(bid))
+ return;
+ kmem_cache_free(build_id_cachep, bid);
+}
+
+void __init build_id_init(void)
+{
+ build_id_cachep = kmem_cache_create("build_id", sizeof(struct build_id), 0,
+ SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
+}
+#endif /* CONFIG_FILE_BUILD_ID */
diff --git a/mm/Kconfig b/mm/Kconfig
index 4751031f3f05..5e9f48962703 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1202,6 +1202,15 @@ config LRU_GEN_STATS
This option has a per-memcg and per-node memory overhead.
# }
+config FILE_BUILD_ID
+ bool "Store build id in file object"
+ default n
+ help
+ Store build id in file object for elf executable with build id
+ defined. The build id is stored when file is mmaped.
+
+ It's typically used by eBPF programs and perf subsystem.
+
source "mm/damon/Kconfig"
endmenu
diff --git a/mm/mmap.c b/mm/mmap.c
index 20f21f0949dd..1c14e8c84d3a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2480,6 +2480,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
pgoff_t vm_pgoff;
int error;
VMA_ITERATOR(vmi, mm, addr);
+ struct build_id *bid = NULL;
/* Check against address space limit. */
if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
@@ -2575,6 +2576,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (error)
goto unmap_and_free_vma;
+#ifdef CONFIG_FILE_BUILD_ID
+ if (vma->vm_flags & VM_EXEC && !file->f_build_id)
+ vma_read_build_id(vma, &bid);
+#endif
/*
* Expansion is handled above, merging is handled below.
* Drivers should not alter the address of the VMA.
@@ -2647,6 +2652,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (vma->vm_flags & VM_SHARED)
mapping_allow_writable(vma->vm_file->f_mapping);
+#ifdef CONFIG_FILE_BUILD_ID
+ if (bid && !file->f_build_id) {
+ file->f_build_id = bid;
+ bid = NULL;
+ }
+#endif
flush_dcache_mmap_lock(vma->vm_file->f_mapping);
vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap);
flush_dcache_mmap_unlock(vma->vm_file->f_mapping);
@@ -2667,6 +2678,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
expanded:
perf_event_mmap(vma);
+ /*
+ * File can already have f_build_id assigned, so we need to release
+ * the build id we just read and did not assign.
+ */
+ build_id_free(bid);
+
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
@@ -2711,6 +2728,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
mapping_unmap_writable(file->f_mapping);
free_vma:
vm_area_free(vma);
+ build_id_free(bid);
unacct_error:
if (charged)
vm_unacct_memory(charged);
--
2.39.2
Use build id from file object when available for perf's MMAP2 event
build id data.
The file's f_build_id is available (for CONFIG_FILE_BUILD_ID option)
when the file is mmap-ed and before the perf's callback is executed,
so we can use it, instead of reading it again.
Signed-off-by: Jiri Olsa <[email protected]>
---
kernel/events/core.c | 43 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 39 insertions(+), 4 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f79fd8b87f75..3a5398dda6f6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8527,6 +8527,9 @@ struct perf_mmap_event {
u32 prot, flags;
u8 build_id[BUILD_ID_SIZE_MAX];
u32 build_id_size;
+#ifdef CONFIG_FILE_BUILD_ID
+ struct build_id *f_build_id;
+#endif
struct {
struct perf_event_header header;
@@ -8539,6 +8542,38 @@ struct perf_mmap_event {
} event_id;
};
+#ifdef CONFIG_FILE_BUILD_ID
+static void build_id_read(struct perf_mmap_event *mmap_event)
+{
+ struct vm_area_struct *vma = mmap_event->vma;
+
+ mmap_event->f_build_id = vma->vm_file ? vma->vm_file->f_build_id : NULL;
+}
+
+static bool has_build_id(struct perf_mmap_event *mmap_event)
+{
+ return !IS_ERR_OR_NULL(mmap_event->f_build_id);
+}
+
+#define build_id_data mmap_event->f_build_id->data
+#define build_id_size mmap_event->f_build_id->sz
+#else
+static void build_id_read(struct perf_mmap_event *mmap_event)
+{
+ struct vm_area_struct *vma = mmap_event->vma;
+
+ build_id_parse(vma, mmap_event->build_id, &mmap_event->build_id_size);
+}
+
+static bool has_build_id(struct perf_mmap_event *mmap_event)
+{
+ return mmap_event->build_id_size;
+}
+
+#define build_id_data mmap_event->build_id
+#define build_id_size mmap_event->build_id_size
+#endif
+
static int perf_event_mmap_match(struct perf_event *event,
void *data)
{
@@ -8583,7 +8618,7 @@ static void perf_event_mmap_output(struct perf_event *event,
mmap_event->event_id.pid = perf_event_pid(event, current);
mmap_event->event_id.tid = perf_event_tid(event, current);
- use_build_id = event->attr.build_id && mmap_event->build_id_size;
+ use_build_id = event->attr.build_id && has_build_id(mmap_event);
if (event->attr.mmap2 && use_build_id)
mmap_event->event_id.header.misc |= PERF_RECORD_MISC_MMAP_BUILD_ID;
@@ -8592,10 +8627,10 @@ static void perf_event_mmap_output(struct perf_event *event,
if (event->attr.mmap2) {
if (use_build_id) {
- u8 size[4] = { (u8) mmap_event->build_id_size, 0, 0, 0 };
+ u8 size[4] = { (u8) build_id_size, 0, 0, 0 };
__output_copy(&handle, size, 4);
- __output_copy(&handle, mmap_event->build_id, BUILD_ID_SIZE_MAX);
+ __output_copy(&handle, build_id_data, BUILD_ID_SIZE_MAX);
} else {
perf_output_put(&handle, mmap_event->maj);
perf_output_put(&handle, mmap_event->min);
@@ -8727,7 +8762,7 @@ static void perf_event_mmap_event(struct perf_mmap_event *mmap_event)
mmap_event->event_id.header.size = sizeof(mmap_event->event_id) + size;
if (atomic_read(&nr_build_id_events))
- build_id_parse(vma, mmap_event->build_id, &mmap_event->build_id_size);
+ build_id_read(mmap_event);
perf_iterate_sb(perf_event_mmap_output,
mmap_event,
--
2.39.2
Use build id from file object in stackmap if it's available.
The file's f_build_id is available (for CONFIG_FILE_BUILD_ID option)
when the file is mmap-ed, so it will be available (if present) when
used by stackmap.
Signed-off-by: Jiri Olsa <[email protected]>
---
kernel/bpf/stackmap.c | 24 +++++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 0f1d8dced933..14d27bd83081 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -124,6 +124,28 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
return ERR_PTR(err);
}
+#ifdef CONFIG_FILE_BUILD_ID
+static int vma_get_build_id(struct vm_area_struct *vma, unsigned char *build_id)
+{
+ struct build_id *bid;
+
+ if (!vma->vm_file)
+ return -EINVAL;
+ bid = vma->vm_file->f_build_id;
+ if (IS_ERR_OR_NULL(bid))
+ return bid ? PTR_ERR(bid) : -ENOENT;
+ if (bid->sz > BUILD_ID_SIZE_MAX)
+ return -EINVAL;
+ memcpy(build_id, bid->data, bid->sz);
+ return 0;
+}
+#else
+static int vma_get_build_id(struct vm_area_struct *vma, unsigned char *build_id)
+{
+ return build_id_parse(vma, build_id, NULL);
+}
+#endif
+
static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
u64 *ips, u32 trace_nr, bool user)
{
@@ -156,7 +178,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
goto build_id_valid;
}
vma = find_vma(current->mm, ips[i]);
- if (!vma || build_id_parse(vma, id_offs[i].build_id, NULL)) {
+ if (!vma || vma_get_build_id(vma, id_offs[i].build_id)) {
/* per entry fall back to ips */
id_offs[i].status = BPF_STACK_BUILD_ID_IP;
id_offs[i].ip = ips[i];
--
2.39.2
Switching BUILD_ID_SIZE_MAX to enum, so we expose it to BPF
programs through vmlinux.h.
Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
include/linux/buildid.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/buildid.h b/include/linux/buildid.h
index b8b2e00420d6..316971c634fe 100644
--- a/include/linux/buildid.h
+++ b/include/linux/buildid.h
@@ -5,7 +5,9 @@
#include <linux/mm_types.h>
#include <linux/slab.h>
-#define BUILD_ID_SIZE_MAX 20
+enum {
+ BUILD_ID_SIZE_MAX = 20
+};
struct build_id {
u32 sz;
--
2.39.2
Adding read_build_id function that parses out build id from
specified binary.
It will replace extract_build_id and also be used in following
changes.
Signed-off-by: Jiri Olsa <[email protected]>
---
tools/testing/selftests/bpf/trace_helpers.c | 86 +++++++++++++++++++++
tools/testing/selftests/bpf/trace_helpers.h | 5 ++
2 files changed, 91 insertions(+)
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index 934bf28fc888..72b38a41f574 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -11,6 +11,9 @@
#include <linux/perf_event.h>
#include <sys/mman.h>
#include "trace_helpers.h"
+#include <linux/limits.h>
+#include <libelf.h>
+#include <gelf.h>
#define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe"
#define DEBUGFS_PIPE "/sys/kernel/debug/tracing/trace_pipe"
@@ -234,3 +237,86 @@ ssize_t get_rel_offset(uintptr_t addr)
fclose(f);
return -EINVAL;
}
+
+static int
+parse_build_id_buf(const void *note_start, Elf32_Word note_size,
+ char *build_id)
+{
+ Elf32_Word note_offs = 0, new_offs;
+
+ while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
+ Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
+
+ if (nhdr->n_type == 3 && nhdr->n_namesz == sizeof("GNU") &&
+ !strcmp((char *)(nhdr + 1), "GNU") && nhdr->n_descsz > 0 &&
+ nhdr->n_descsz <= BPF_BUILD_ID_SIZE) {
+ memcpy(build_id, note_start + note_offs +
+ ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr), nhdr->n_descsz);
+ memset(build_id + nhdr->n_descsz, 0, BPF_BUILD_ID_SIZE - nhdr->n_descsz);
+ return (int) nhdr->n_descsz;
+ }
+
+ new_offs = note_offs + sizeof(Elf32_Nhdr) +
+ ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
+ if (new_offs >= note_size)
+ break;
+ note_offs = new_offs;
+ }
+
+ return -EINVAL;
+}
+
+/* Reads binary from *path* file and returns it in the *build_id*
+ * which is expected to be at least BPF_BUILD_ID_SIZE bytes.
+ * Returns size of build id on success. On error the error value
+ * is returned.
+ */
+int read_build_id(const char *path, char *build_id)
+{
+ int fd, err = -EINVAL;
+ Elf *elf = NULL;
+ GElf_Ehdr ehdr;
+ size_t max, i;
+
+ fd = open(path, O_RDONLY | O_CLOEXEC);
+ if (fd < 0)
+ return -errno;
+
+ (void)elf_version(EV_CURRENT);
+
+ elf = elf_begin(fd, ELF_C_READ, NULL);
+ if (!elf)
+ goto out;
+ if (elf_kind(elf) != ELF_K_ELF)
+ goto out;
+ if (gelf_getehdr(elf, &ehdr) == NULL)
+ goto out;
+ if (ehdr.e_ident[EI_CLASS] != ELFCLASS64)
+ goto out;
+
+ for (i = 0; i < ehdr.e_phnum; i++) {
+ GElf_Phdr mem, *phdr;
+ char *data;
+
+ phdr = gelf_getphdr(elf, i, &mem);
+ if (!phdr)
+ goto out;
+ if (phdr->p_type != PT_NOTE)
+ continue;
+ data = elf_rawfile(elf, &max);
+ if (!data)
+ goto out;
+ if (phdr->p_offset >= max || (phdr->p_offset + phdr->p_memsz >= max))
+ goto out;
+ err = parse_build_id_buf(data + phdr->p_offset, phdr->p_memsz, build_id);
+ if (err > 0)
+ goto out;
+ err = -EINVAL;
+ }
+
+out:
+ if (elf)
+ elf_end(elf);
+ close(fd);
+ return err;
+}
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index 53efde0e2998..bc3b92057033 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -4,6 +4,9 @@
#include <bpf/libbpf.h>
+#define __ALIGN_MASK(x, mask) (((x)+(mask))&~(mask))
+#define ALIGN(x, a) __ALIGN_MASK(x, (typeof(x))(a)-1)
+
struct ksym {
long addr;
char *name;
@@ -23,4 +26,6 @@ void read_trace_pipe(void);
ssize_t get_uprobe_offset(const void *addr);
ssize_t get_rel_offset(uintptr_t addr);
+int read_build_id(const char *path, char *build_id);
+
#endif
--
2.39.2
Moving error macros from profiler.inc.h to new err.h header.
It will be used in following changes.
Also adding PTR_ERR macro that will be used in following changes.
Signed-off-by: Jiri Olsa <[email protected]>
---
tools/testing/selftests/bpf/progs/err.h | 18 ++++++++++++++++++
.../testing/selftests/bpf/progs/profiler.inc.h | 3 +--
2 files changed, 19 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/err.h
diff --git a/tools/testing/selftests/bpf/progs/err.h b/tools/testing/selftests/bpf/progs/err.h
new file mode 100644
index 000000000000..d66d283d9e59
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/err.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ERR_H__
+#define __ERR_H__
+
+#define MAX_ERRNO 4095
+#define IS_ERR_VALUE(x) (unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO
+
+static inline int IS_ERR_OR_NULL(const void *ptr)
+{
+ return !ptr || IS_ERR_VALUE((unsigned long)ptr);
+}
+
+static inline long PTR_ERR(const void *ptr)
+{
+ return (long) ptr;
+}
+
+#endif /* __ERR_H__ */
diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h b/tools/testing/selftests/bpf/progs/profiler.inc.h
index 875513866032..f799d87e8700 100644
--- a/tools/testing/selftests/bpf/progs/profiler.inc.h
+++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
@@ -6,6 +6,7 @@
#include <bpf/bpf_tracing.h>
#include "profiler.h"
+#include "err.h"
#ifndef NULL
#define NULL 0
@@ -16,7 +17,6 @@
#define O_DIRECTORY 00200000
#define __O_TMPFILE 020000000
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
-#define MAX_ERRNO 4095
#define S_IFMT 00170000
#define S_IFSOCK 0140000
#define S_IFLNK 0120000
@@ -34,7 +34,6 @@
#define S_ISBLK(m) (((m)&S_IFMT) == S_IFBLK)
#define S_ISFIFO(m) (((m)&S_IFMT) == S_IFIFO)
#define S_ISSOCK(m) (((m)&S_IFMT) == S_IFSOCK)
-#define IS_ERR_VALUE(x) (unsigned long)(void*)(x) >= (unsigned long)-MAX_ERRNO
#define KILL_DATA_ARRAY_SIZE 8
--
2.39.2
Replacing extract_build_id with read_build_id that parses out
build id directly from elf without using readelf tool.
Acked-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Jiri Olsa <[email protected]>
---
.../bpf/prog_tests/stacktrace_build_id.c | 19 ++++++--------
.../bpf/prog_tests/stacktrace_build_id_nmi.c | 17 +++++--------
tools/testing/selftests/bpf/test_progs.c | 25 -------------------
tools/testing/selftests/bpf/test_progs.h | 1 -
4 files changed, 13 insertions(+), 49 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
index 9ad09a6c538a..a2e75a976f04 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
@@ -7,13 +7,12 @@ void test_stacktrace_build_id(void)
int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
struct test_stacktrace_build_id *skel;
- int err, stack_trace_len;
+ int err, stack_trace_len, build_id_size;
__u32 key, prev_key, val, duration = 0;
- char buf[256];
- int i, j;
+ char buf[BPF_BUILD_ID_SIZE];
struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
int build_id_matches = 0;
- int retry = 1;
+ int i, retry = 1;
retry:
skel = test_stacktrace_build_id__open_and_load();
@@ -52,9 +51,10 @@ void test_stacktrace_build_id(void)
"err %d errno %d\n", err, errno))
goto cleanup;
- err = extract_build_id(buf, 256);
+ build_id_size = read_build_id("urandom_read", buf);
+ err = build_id_size < 0 ? build_id_size : 0;
- if (CHECK(err, "get build_id with readelf",
+ if (CHECK(err, "read_build_id",
"err %d errno %d\n", err, errno))
goto cleanup;
@@ -64,8 +64,6 @@ void test_stacktrace_build_id(void)
goto cleanup;
do {
- char build_id[64];
-
err = bpf_map_lookup_elem(stackmap_fd, &key, id_offs);
if (CHECK(err, "lookup_elem from stackmap",
"err %d, errno %d\n", err, errno))
@@ -73,10 +71,7 @@ void test_stacktrace_build_id(void)
for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i)
if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID &&
id_offs[i].offset != 0) {
- for (j = 0; j < 20; ++j)
- sprintf(build_id + 2 * j, "%02x",
- id_offs[i].build_id[j] & 0xff);
- if (strstr(buf, build_id) != NULL)
+ if (memcmp(buf, id_offs[i].build_id, build_id_size) == 0)
build_id_matches = 1;
}
prev_key = key;
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
index f4ea1a215ce4..4a1c5a692730 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
@@ -28,11 +28,10 @@ void test_stacktrace_build_id_nmi(void)
.config = PERF_COUNT_HW_CPU_CYCLES,
};
__u32 key, prev_key, val, duration = 0;
- char buf[256];
- int i, j;
+ char buf[BPF_BUILD_ID_SIZE];
struct bpf_stack_build_id id_offs[PERF_MAX_STACK_DEPTH];
- int build_id_matches = 0;
- int retry = 1;
+ int build_id_matches = 0, build_id_size;
+ int i, retry = 1;
attr.sample_freq = read_perf_max_sample_freq();
@@ -94,7 +93,8 @@ void test_stacktrace_build_id_nmi(void)
"err %d errno %d\n", err, errno))
goto cleanup;
- err = extract_build_id(buf, 256);
+ build_id_size = read_build_id("urandom_read", buf);
+ err = build_id_size < 0 ? build_id_size : 0;
if (CHECK(err, "get build_id with readelf",
"err %d errno %d\n", err, errno))
@@ -106,8 +106,6 @@ void test_stacktrace_build_id_nmi(void)
goto cleanup;
do {
- char build_id[64];
-
err = bpf_map__lookup_elem(skel->maps.stackmap, &key, sizeof(key),
id_offs, sizeof(id_offs), 0);
if (CHECK(err, "lookup_elem from stackmap",
@@ -116,10 +114,7 @@ void test_stacktrace_build_id_nmi(void)
for (i = 0; i < PERF_MAX_STACK_DEPTH; ++i)
if (id_offs[i].status == BPF_STACK_BUILD_ID_VALID &&
id_offs[i].offset != 0) {
- for (j = 0; j < 20; ++j)
- sprintf(build_id + 2 * j, "%02x",
- id_offs[i].build_id[j] & 0xff);
- if (strstr(buf, build_id) != NULL)
+ if (memcmp(buf, id_offs[i].build_id, build_id_size) == 0)
build_id_matches = 1;
}
prev_key = key;
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 6d5e3022c75f..9813d53c4878 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -591,31 +591,6 @@ int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len)
return err;
}
-int extract_build_id(char *build_id, size_t size)
-{
- FILE *fp;
- char *line = NULL;
- size_t len = 0;
-
- fp = popen("readelf -n ./urandom_read | grep 'Build ID'", "r");
- if (fp == NULL)
- return -1;
-
- if (getline(&line, &len, fp) == -1)
- goto err;
- pclose(fp);
-
- if (len > size)
- len = size;
- memcpy(build_id, line, len);
- build_id[len] = '\0';
- free(line);
- return 0;
-err:
- pclose(fp);
- return -1;
-}
-
static int finit_module(int fd, const char *param_values, int flags)
{
return syscall(__NR_finit_module, fd, param_values, flags);
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index 3cbf005747ed..d91427bfe0d7 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -404,7 +404,6 @@ static inline void *u64_to_ptr(__u64 ptr)
int bpf_find_map(const char *test, struct bpf_object *obj, const char *name);
int compare_map_keys(int map1_fd, int map2_fd);
int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len);
-int extract_build_id(char *build_id, size_t size);
int kern_sync_rcu(void);
int trigger_module_test_read(int read_sz);
int trigger_module_test_write(int write_sz);
--
2.39.2
Testing iterator access to build id in vma->vm_file object by storing
each binary with build id into map and checking it against build id
retrieved in user space.
Signed-off-by: Jiri Olsa <[email protected]>
---
.../selftests/bpf/prog_tests/bpf_iter.c | 78 +++++++++++++++++++
.../bpf/progs/bpf_iter_task_vma_buildid.c | 56 +++++++++++++
2 files changed, 134 insertions(+)
create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
index 1f02168103dd..c7dd89e7cad0 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
@@ -33,6 +33,7 @@
#include "bpf_iter_bpf_link.skel.h"
#include "bpf_iter_ksym.skel.h"
#include "bpf_iter_sockmap.skel.h"
+#include "bpf_iter_task_vma_buildid.skel.h"
static int duration;
@@ -1536,6 +1537,81 @@ static void test_task_vma_dead_task(void)
bpf_iter_task_vma__destroy(skel);
}
+#define D_PATH_BUF_SIZE 1024
+
+struct build_id {
+ u32 sz;
+ char data[BPF_BUILD_ID_SIZE];
+};
+
+static void test_task_vma_buildid(void)
+{
+ int err, iter_fd = -1, proc_maps_fd = -1, sz;
+ struct bpf_iter_task_vma_buildid *skel;
+ char key[D_PATH_BUF_SIZE], *prev_key;
+ char build_id[BPF_BUILD_ID_SIZE];
+ int len, files_fd, cnt = 0;
+ struct build_id val;
+ char c;
+
+ skel = bpf_iter_task_vma_buildid__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma_buildid__open_and_load"))
+ return;
+
+ skel->links.proc_maps = bpf_program__attach_iter(
+ skel->progs.proc_maps, NULL);
+
+ if (!ASSERT_OK_PTR(skel->links.proc_maps, "bpf_program__attach_iter")) {
+ skel->links.proc_maps = NULL;
+ goto out;
+ }
+
+ iter_fd = bpf_iter_create(bpf_link__fd(skel->links.proc_maps));
+ if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+ goto out;
+
+ /* trigger the iterator, there's no output, just map */
+ len = read(iter_fd, &c, 1);
+ ASSERT_EQ(len, 0, "len_check");
+
+ files_fd = bpf_map__fd(skel->maps.files);
+
+ prev_key = NULL;
+
+ while (true) {
+ err = bpf_map_get_next_key(files_fd, prev_key, &key);
+ if (err) {
+ if (errno == ENOENT)
+ err = 0;
+ break;
+ }
+ if (bpf_map_lookup_elem(files_fd, key, &val))
+ break;
+ if (!ASSERT_LE(val.sz, BPF_BUILD_ID_SIZE, "buildid_size"))
+ break;
+
+ sz = read_build_id(key, build_id);
+ /* If there's an error, the build id is not present or malformed, kernel
+ * should see the same result and bpf program pushed zero build id.
+ */
+ if (sz < 0) {
+ memset(build_id, 0x0, BPF_BUILD_ID_SIZE);
+ sz = BPF_BUILD_ID_SIZE;
+ }
+ ASSERT_EQ(val.sz, sz, "build_id_size");
+ ASSERT_MEMEQ(val.data, build_id, sz, "build_id_data");
+
+ prev_key = key;
+ cnt++;
+ }
+
+ printf("checked %d files\n", cnt);
+out:
+ close(proc_maps_fd);
+ close(iter_fd);
+ bpf_iter_task_vma_buildid__destroy(skel);
+}
+
void test_bpf_sockmap_map_iter_fd(void)
{
struct bpf_iter_sockmap *skel;
@@ -1659,6 +1735,8 @@ void test_bpf_iter(void)
test_task_vma();
if (test__start_subtest("task_vma_dead_task"))
test_task_vma_dead_task();
+ if (test__start_subtest("task_vma_buildid"))
+ test_task_vma_buildid();
if (test__start_subtest("task_btf"))
test_task_btf();
if (test__start_subtest("tcp4"))
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
new file mode 100644
index 000000000000..11a59c0f1aba
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "bpf_iter.h"
+#include "err.h"
+#include <bpf/bpf_helpers.h>
+#include <string.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define VM_EXEC 0x00000004
+#define D_PATH_BUF_SIZE 1024
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 10000);
+ __type(key, char[D_PATH_BUF_SIZE]);
+ __type(value, struct build_id);
+} files SEC(".maps");
+
+static char path[D_PATH_BUF_SIZE];
+static struct build_id build_id;
+
+SEC("iter/task_vma")
+int proc_maps(struct bpf_iter__task_vma *ctx)
+{
+ struct vm_area_struct *vma = ctx->vma;
+ struct task_struct *task = ctx->task;
+ struct file *file;
+
+ if (task == (void *)0 || vma == (void *)0)
+ return 0;
+
+ if (!(vma->vm_flags & VM_EXEC))
+ return 0;
+
+ file = vma->vm_file;
+ if (!file)
+ return 0;
+
+ __builtin_memset(path, 0x0, D_PATH_BUF_SIZE);
+ bpf_d_path(&file->f_path, (char *) &path, D_PATH_BUF_SIZE);
+
+ if (bpf_map_lookup_elem(&files, &path))
+ return 0;
+
+ if (IS_ERR_OR_NULL(file->f_build_id)) {
+ /* On error return empty build id. */
+ __builtin_memset(&build_id.data, 0x0, sizeof(build_id.data));
+ build_id.sz = BUILD_ID_SIZE_MAX;
+ } else {
+ __builtin_memcpy(&build_id, file->f_build_id, sizeof(*file->f_build_id));
+ }
+
+ bpf_map_update_elem(&files, &path, &build_id, 0);
+ return 0;
+}
--
2.39.2
The test attaches bpf program to sched_process_exec tracepoint
and gets build of executed file from bprm->file object.
We use urandom_read as the test program and in addition we also
attach uprobe to liburandom_read.so:urandlib_read_without_sema
and retrieve and check build id of that shared library.
Also executing the no_build_id binary to verify the bpf program
gets the error properly.
Signed-off-by: Jiri Olsa <[email protected]>
---
tools/testing/selftests/bpf/Makefile | 7 +-
tools/testing/selftests/bpf/no_build_id.c | 6 ++
.../selftests/bpf/prog_tests/file_build_id.c | 98 +++++++++++++++++++
.../selftests/bpf/progs/file_build_id.c | 70 +++++++++++++
tools/testing/selftests/bpf/test_progs.h | 10 ++
5 files changed, 190 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/no_build_id.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/file_build_id.c
create mode 100644 tools/testing/selftests/bpf/progs/file_build_id.c
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 55811c448eb7..cf93a23c7962 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -88,7 +88,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
xdp_features
-TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file
+TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file $(OUTPUT)/no_build_id
TEST_GEN_FILES += liburandom_read.so
# Emit succinct information message describing current building step
@@ -201,6 +201,10 @@ $(OUTPUT)/sign-file: ../../../../scripts/sign-file.c
$< -o $@ \
$(shell $(HOSTPKG_CONFIG) --libs libcrypto 2> /dev/null || echo -lcrypto)
+$(OUTPUT)/no_build_id: no_build_id.c
+ $(call msg,BINARY,,$@)
+ $(Q)$(CC) $^ -Wl,--build-id=none -o $@
+
$(OUTPUT)/bpf_testmod.ko: $(VMLINUX_BTF) $(wildcard bpf_testmod/Makefile bpf_testmod/*.[ch])
$(call msg,MOD,,$@)
$(Q)$(RM) bpf_testmod/bpf_testmod.ko # force re-compilation
@@ -564,6 +568,7 @@ TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko \
$(OUTPUT)/liburandom_read.so \
$(OUTPUT)/xdp_synproxy \
$(OUTPUT)/sign-file \
+ $(OUTPUT)/no_build_id \
ima_setup.sh \
verify_sig_setup.sh \
$(wildcard progs/btf_dump_test_case_*.c) \
diff --git a/tools/testing/selftests/bpf/no_build_id.c b/tools/testing/selftests/bpf/no_build_id.c
new file mode 100644
index 000000000000..a6f28f1c06d5
--- /dev/null
+++ b/tools/testing/selftests/bpf/no_build_id.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+
+int main(void)
+{
+ return 0;
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/file_build_id.c b/tools/testing/selftests/bpf/prog_tests/file_build_id.c
new file mode 100644
index 000000000000..cb233cfd58cb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/file_build_id.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <unistd.h>
+#include <test_progs.h>
+#include "file_build_id.skel.h"
+#include "trace_helpers.h"
+
+static void
+test_build_id(const char *bin, const char *lib, long bin_err, long lib_err)
+{
+ int err, child_pid = 0, child_status, c = 1, sz;
+ char build_id[BPF_BUILD_ID_SIZE];
+ struct file_build_id *skel;
+ int go[2] = { -1, -1 };
+
+ if (!ASSERT_OK(pipe(go), "pipe"))
+ return;
+
+ skel = file_build_id__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "file_build_id__open_and_load"))
+ goto out;
+
+ child_pid = fork();
+ if (child_pid < 0)
+ goto out;
+
+ /* child */
+ if (child_pid == 0) {
+ close(go[1]);
+ /* wait for parent's pid update */
+ err = read(go[0], &c, 1);
+ if (!ASSERT_EQ(err, 1, "child_read_pipe"))
+ exit(err);
+
+ execle(bin, bin, NULL, NULL);
+ exit(errno);
+ }
+
+ /* parent, update child's pid and kick it */
+ skel->bss->pid = child_pid;
+
+ close(go[0]);
+
+ err = file_build_id__attach(skel);
+ if (!ASSERT_OK(err, "file_build_id__attach"))
+ goto out;
+
+ err = write(go[1], &c, 1);
+ if (!ASSERT_EQ(err, 1, "child_write_pipe"))
+ goto out;
+
+ /* wait for child to exit */
+ waitpid(child_pid, &child_status, 0);
+ child_pid = 0;
+ if (!ASSERT_EQ(WEXITSTATUS(child_status), 0, "child_exit_value"))
+ goto out;
+
+ /* test binary */
+ sz = read_build_id(bin, build_id);
+ err = sz > 0 ? 0 : sz;
+
+ ASSERT_EQ((long) err, bin_err, "read_build_id_bin_err");
+ ASSERT_EQ(skel->bss->build_id_bin_err, bin_err, "build_id_bin_err");
+
+ if (!err) {
+ ASSERT_EQ(skel->bss->build_id_bin_size, sz, "build_id_bin_size");
+ ASSERT_MEMEQ(skel->bss->build_id_bin, build_id, sz, "build_id_bin");
+ }
+
+ /* test library if present */
+ if (lib) {
+ sz = read_build_id(lib, build_id);
+ err = sz > 0 ? 0 : sz;
+
+ ASSERT_EQ((long) err, lib_err, "read_build_id_lib_err");
+ ASSERT_EQ(skel->bss->build_id_lib_err, lib_err, "build_id_lib_err");
+
+ if (!err) {
+ ASSERT_EQ(skel->bss->build_id_lib_size, sz, "build_id_lib_size");
+ ASSERT_MEMEQ(skel->bss->build_id_lib, build_id, sz, "build_id_lib");
+ }
+ }
+
+out:
+ close(go[1]);
+ close(go[0]);
+ if (child_pid)
+ waitpid(child_pid, &child_status, 0);
+ file_build_id__destroy(skel);
+}
+
+void test_file_build_id(void)
+{
+ if (test__start_subtest("present"))
+ test_build_id("./urandom_read", "./liburandom_read.so", 0, 0);
+ if (test__start_subtest("missing"))
+ test_build_id("./no_build_id", NULL, -EINVAL, 0);
+}
diff --git a/tools/testing/selftests/bpf/progs/file_build_id.c b/tools/testing/selftests/bpf/progs/file_build_id.c
new file mode 100644
index 000000000000..6dc10c8e17ac
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/file_build_id.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include "err.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <linux/string.h>
+
+char _license[] SEC("license") = "GPL";
+
+int pid;
+
+u32 build_id_bin_size;
+u32 build_id_lib_size;
+
+char build_id_bin[BUILD_ID_SIZE_MAX];
+char build_id_lib[BUILD_ID_SIZE_MAX];
+
+long build_id_bin_err;
+long build_id_lib_err;
+
+static int store_build_id(struct file *file, char *build_id, u32 *sz, long *err)
+{
+ struct build_id *bid;
+
+ bid = file->f_build_id;
+ if (IS_ERR_OR_NULL(bid)) {
+ *err = PTR_ERR(bid);
+ return 0;
+ }
+ *sz = bid->sz;
+ if (bid->sz > sizeof(bid->data)) {
+ *err = 1;
+ return 0;
+ }
+ __builtin_memcpy(build_id, bid->data, sizeof(bid->data));
+ *err = 0;
+ return 0;
+}
+
+SEC("tp_btf/sched_process_exec")
+int BPF_PROG(prog, struct task_struct *p, pid_t old_pid, struct linux_binprm *bprm)
+{
+ int cur_pid = bpf_get_current_pid_tgid() >> 32;
+
+ if (pid != cur_pid)
+ return 0;
+ if (!bprm->file)
+ return 0;
+ return store_build_id(bprm->file, build_id_bin, &build_id_bin_size, &build_id_bin_err);
+}
+
+static long check_vma(struct task_struct *task, struct vm_area_struct *vma,
+ void *data)
+{
+ if (!vma || !vma->vm_file || !vma->vm_file)
+ return 0;
+ return store_build_id(vma->vm_file, build_id_lib, &build_id_lib_size, &build_id_lib_err);
+}
+
+SEC("uprobe/./liburandom_read.so:urandlib_read_without_sema")
+int BPF_UPROBE(urandlib_read_without_sema)
+{
+ struct task_struct *task = bpf_get_current_task_btf();
+ int cur_pid = bpf_get_current_pid_tgid() >> 32;
+
+ if (pid != cur_pid)
+ return 0;
+ return bpf_find_vma(task, ctx->ip, check_vma, NULL, 0);
+}
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index d91427bfe0d7..285dd5d91426 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -310,6 +310,16 @@ int test__join_cgroup(const char *path);
___ok; \
})
+#define ASSERT_MEMEQ(actual, expected, sz, name) ({ \
+ static int duration = 0; \
+ const char *___act = actual; \
+ const char *___exp = expected; \
+ bool ___ok = memcmp(___act, ___exp, sz) == 0; \
+ CHECK(!___ok, (name), \
+ "unexpected %s does not match\n", (name)); \
+ ___ok; \
+})
+
#define ASSERT_STRNEQ(actual, expected, len, name) ({ \
static int duration = 0; \
const char *___act = actual; \
--
2.39.2
On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds build id object pointer to struct file object.
>
> We have several use cases for build id to be used in BPF programs
> [2][3].
Yes, you have use cases, but you never answered the question I asked:
Is this going to be enabled by every distro kernel, or is it for special
use-cases where only people doing a very specialised thing who are
willing to build their own kernels will use it?
Saying "hubble/tetragon" doesn't answer that question. Maybe it does
to you, but I have no idea what that software is.
Put it another way: how does this make *MY* life better? Literally me.
How will it affect my life?
On Thu, Mar 16, 2023 at 10:35 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds build id object pointer to struct file object.
> >
> > We have several use cases for build id to be used in BPF programs
> > [2][3].
>
> Yes, you have use cases, but you never answered the question I asked:
>
> Is this going to be enabled by every distro kernel, or is it for special
> use-cases where only people doing a very specialised thing who are
> willing to build their own kernels will use it?
>
> Saying "hubble/tetragon" doesn't answer that question. Maybe it does
> to you, but I have no idea what that software is.
>
> Put it another way: how does this make *MY* life better? Literally me.
> How will it affect my life?
So at Google we use build IDs for all profiling, I believe Meta is the
same but obviously I can't speak for them. For BPF program stack
traces, using build ID + offset stack traces is preferable to perf's
whole system synthesis of mmap events based on data held in
/proc/pid/maps. Individual stack traces are larger, but you avoid the
ever growing problem of coming up with some initial virtual memory
state that will allow you to identify samples.
This doesn't answer the question about how this will help you, but I
expect over time you will see scalability issues and also want to use
tools assuming build IDs are present and cheap to access.
Thanks,
Ian
On 3/16/23 6:01 PM, Jiri Olsa wrote:
> The test attaches bpf program to sched_process_exec tracepoint
> and gets build of executed file from bprm->file object.
>
> We use urandom_read as the test program and in addition we also
> attach uprobe to liburandom_read.so:urandlib_read_without_sema
> and retrieve and check build id of that shared library.
>
> Also executing the no_build_id binary to verify the bpf program
> gets the error properly.
>
> Signed-off-by: Jiri Olsa <[email protected]>
[...]
> diff --git a/tools/testing/selftests/bpf/progs/file_build_id.c b/tools/testing/selftests/bpf/progs/file_build_id.c
> new file mode 100644
> index 000000000000..6dc10c8e17ac
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/file_build_id.c
> @@ -0,0 +1,70 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include "err.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +#include <linux/string.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +int pid;
> +
> +u32 build_id_bin_size;
> +u32 build_id_lib_size;
> +
> +char build_id_bin[BUILD_ID_SIZE_MAX];
> +char build_id_lib[BUILD_ID_SIZE_MAX];
> +
> +long build_id_bin_err;
> +long build_id_lib_err;
> +
> +static int store_build_id(struct file *file, char *build_id, u32 *sz, long *err)
> +{
> + struct build_id *bid;
> +
> + bid = file->f_build_id;
> + if (IS_ERR_OR_NULL(bid)) {
I think you might need to enable this also in the CI config:
[...]
CLNG-BPF [test_maps] btf_dump_test_case_multidim.bpf.o
CLNG-BPF [test_maps] ima.bpf.o
CLNG-BPF [test_maps] bpf_cubic.bpf.o
progs/bpf_iter_task_vma_buildid.c:46:27: error: no member named 'f_build_id' in 'struct file'
if (IS_ERR_OR_NULL(file->f_build_id)) {
~~~~ ^
progs/bpf_iter_task_vma_buildid.c:51:37: error: no member named 'f_build_id' in 'struct file'
__builtin_memcpy(&build_id, file->f_build_id, sizeof(*file->f_build_id));
~~~~ ^
progs/bpf_iter_task_vma_buildid.c:51:63: error: no member named 'f_build_id' in 'struct file'
__builtin_memcpy(&build_id, file->f_build_id, sizeof(*file->f_build_id));
~~~~ ^
3 errors generated.
make: *** [Makefile:578: /tmp/work/bpf/bpf/tools/testing/selftests/bpf/bpf_iter_task_vma_buildid.bpf.o] Error 1
make: *** Waiting for unfinished jobs....
make: Leaving directory '/tmp/work/bpf/bpf/tools/testing/selftests/bpf'
Error: Process completed with exit code 2.
Thanks,
Daniel
On Thu, Mar 16, 2023 at 10:50 AM Ian Rogers <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 10:35 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds build id object pointer to struct file object.
> > >
> > > We have several use cases for build id to be used in BPF programs
> > > [2][3].
> >
> > Yes, you have use cases, but you never answered the question I asked:
> >
> > Is this going to be enabled by every distro kernel, or is it for special
> > use-cases where only people doing a very specialised thing who are
> > willing to build their own kernels will use it?
> >
> > Saying "hubble/tetragon" doesn't answer that question. Maybe it does
> > to you, but I have no idea what that software is.
> >
> > Put it another way: how does this make *MY* life better? Literally me.
> > How will it affect my life?
>
> So at Google we use build IDs for all profiling, I believe Meta is the
> same but obviously I can't speak for them. For BPF program stack
Yep, Meta is also capturing stack traces with build ID as well, if
possible. Build IDs help with profiling short-lived processes which
exit before the profiling session is done and user-space tooling is
able to collect /proc/<pid>/maps contents (which is what Ian is
referring to here). But also build ID allows to offload more of the
expensive stack symbolization process (converting raw memory addresses
into human readable function+offset+file path+line numbers
information) to dedicated remote servers, by allowing to cache and
reuse preprocessed DWARF/ELF information based on build ID.
I believe perf tool is also using build ID, so any tool relying on
perf capturing full and complete profiling data for system-wide
performance analysis would benefit as well.
Generally speaking, there is a whole ecosystem built on top of
assumption that binaries have build ID and profiling tooling is able
to provide more value if those build IDs are more reliably collected.
Which ultimately benefits the entire open-source ecosystem by allowing
people to spot issues (not necessarily just performance, it could be
correctness issues as well) more reliably, fix them, and benefit every
user.
> traces, using build ID + offset stack traces is preferable to perf's
> whole system synthesis of mmap events based on data held in
> /proc/pid/maps. Individual stack traces are larger, but you avoid the
> ever growing problem of coming up with some initial virtual memory
> state that will allow you to identify samples.
>
> This doesn't answer the question about how this will help you, but I
> expect over time you will see scalability issues and also want to use
> tools assuming build IDs are present and cheap to access.
>
> Thanks,
> Ian
On Thu, Mar 16, 2023 at 10:02 AM Jiri Olsa <[email protected]> wrote:
>
> Use build id from file object in stackmap if it's available.
>
> The file's f_build_id is available (for CONFIG_FILE_BUILD_ID option)
> when the file is mmap-ed, so it will be available (if present) when
> used by stackmap.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
LGTM.
Acked-by: Andrii Nakryiko <[email protected]>
> kernel/bpf/stackmap.c | 24 +++++++++++++++++++++++-
> 1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 0f1d8dced933..14d27bd83081 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -124,6 +124,28 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr)
> return ERR_PTR(err);
> }
>
> +#ifdef CONFIG_FILE_BUILD_ID
> +static int vma_get_build_id(struct vm_area_struct *vma, unsigned char *build_id)
> +{
> + struct build_id *bid;
> +
> + if (!vma->vm_file)
> + return -EINVAL;
> + bid = vma->vm_file->f_build_id;
> + if (IS_ERR_OR_NULL(bid))
> + return bid ? PTR_ERR(bid) : -ENOENT;
> + if (bid->sz > BUILD_ID_SIZE_MAX)
> + return -EINVAL;
> + memcpy(build_id, bid->data, bid->sz);
> + return 0;
> +}
> +#else
> +static int vma_get_build_id(struct vm_area_struct *vma, unsigned char *build_id)
> +{
> + return build_id_parse(vma, build_id, NULL);
> +}
> +#endif
> +
> static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> u64 *ips, u32 trace_nr, bool user)
> {
> @@ -156,7 +178,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> goto build_id_valid;
> }
> vma = find_vma(current->mm, ips[i]);
> - if (!vma || build_id_parse(vma, id_offs[i].build_id, NULL)) {
> + if (!vma || vma_get_build_id(vma, id_offs[i].build_id)) {
> /* per entry fall back to ips */
> id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> id_offs[i].ip = ips[i];
> --
> 2.39.2
>
On Thu, Mar 16, 2023 at 10:02 AM Jiri Olsa <[email protected]> wrote:
>
> Storing build id in file object for elf executable with build
> id defined. The build id is stored when file is mmaped.
>
> The build id object assignment to the file is locked with existing
> file->f_mapping semaphore.
>
> The f_build_id pointer points either build id object or carries
> the error the build id retrieval failed on.
>
> It's hidden behind new config option CONFIG_FILE_BUILD_ID.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
> fs/file_table.c | 3 +++
> include/linux/buildid.h | 17 +++++++++++++++++
> include/linux/fs.h | 7 +++++++
> lib/buildid.c | 42 +++++++++++++++++++++++++++++++++++++++++
> mm/Kconfig | 9 +++++++++
> mm/mmap.c | 18 ++++++++++++++++++
> 6 files changed, 96 insertions(+)
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 372653b92617..d72f72503268 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -29,6 +29,7 @@
> #include <linux/ima.h>
> #include <linux/swap.h>
> #include <linux/kmemleak.h>
> +#include <linux/buildid.h>
>
> #include <linux/atomic.h>
>
> @@ -48,6 +49,7 @@ static void file_free_rcu(struct rcu_head *head)
> {
> struct file *f = container_of(head, struct file, f_rcuhead);
>
> + file_build_id_free(f);
> put_cred(f->f_cred);
> kmem_cache_free(filp_cachep, f);
> }
> @@ -413,6 +415,7 @@ void __init files_init(void)
> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> + build_id_init();
> }
>
> /*
> diff --git a/include/linux/buildid.h b/include/linux/buildid.h
> index 3b7a0ff4642f..b8b2e00420d6 100644
> --- a/include/linux/buildid.h
> +++ b/include/linux/buildid.h
> @@ -3,9 +3,15 @@
> #define _LINUX_BUILDID_H
>
> #include <linux/mm_types.h>
> +#include <linux/slab.h>
>
> #define BUILD_ID_SIZE_MAX 20
>
> +struct build_id {
> + u32 sz;
> + char data[BUILD_ID_SIZE_MAX];
> +};
> +
> int build_id_parse(struct vm_area_struct *vma, unsigned char *build_id,
> __u32 *size);
> int build_id_parse_buf(const void *buf, unsigned char *build_id, u32 buf_size);
> @@ -17,4 +23,15 @@ void init_vmlinux_build_id(void);
> static inline void init_vmlinux_build_id(void) { }
> #endif
>
> +#ifdef CONFIG_FILE_BUILD_ID
> +void __init build_id_init(void);
> +void build_id_free(struct build_id *bid);
> +void file_build_id_free(struct file *f);
> +void vma_read_build_id(struct vm_area_struct *vma, struct build_id **bidp);
> +#else
> +static inline void __init build_id_init(void) { }
> +static inline void build_id_free(struct build_id *bid) { }
> +static inline void file_build_id_free(struct file *f) { }
> +#endif /* CONFIG_FILE_BUILD_ID */
> +
> #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c85916e9f7db..ce03fd965cdb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -977,6 +977,13 @@ struct file {
> struct address_space *f_mapping;
> errseq_t f_wb_err;
> errseq_t f_sb_err; /* for syncfs */
> +#ifdef CONFIG_FILE_BUILD_ID
> + /*
> + * Initialized when the file is mmaped (mmap_region),
> + * guarded by f_mapping lock.
> + */
> + struct build_id *f_build_id;
> +#endif
> } __randomize_layout
> __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
>
> diff --git a/lib/buildid.c b/lib/buildid.c
> index dfc62625cae4..04181c0b7c21 100644
> --- a/lib/buildid.c
> +++ b/lib/buildid.c
> @@ -5,6 +5,7 @@
> #include <linux/elf.h>
> #include <linux/kernel.h>
> #include <linux/pagemap.h>
> +#include <linux/slab.h>
>
> #define BUILD_ID 3
>
> @@ -189,3 +190,44 @@ void __init init_vmlinux_build_id(void)
> build_id_parse_buf(&__start_notes, vmlinux_build_id, size);
> }
> #endif
> +
> +#ifdef CONFIG_FILE_BUILD_ID
> +
> +/* SLAB cache for build_id structures */
> +static struct kmem_cache *build_id_cachep;
> +
> +void vma_read_build_id(struct vm_area_struct *vma, struct build_id **bidp)
this function clearly has a result to return, so why use void function
and out parameters instead of just returning `struct build_id *`?
> +{
> + struct build_id *bid = ERR_PTR(-ENOMEM);
> + int err;
> +
> + bid = kmem_cache_alloc(build_id_cachep, GFP_KERNEL);
> + if (!bid)
> + goto out;
> + err = build_id_parse(vma, bid->data, &bid->sz);
> + if (err) {
> + build_id_free(bid);
> + bid = ERR_PTR(err);
> + }
[...]
On Thu, Mar 16, 2023 at 10:02 AM Jiri Olsa <[email protected]> wrote:
>
> Switching BUILD_ID_SIZE_MAX to enum, so we expose it to BPF
> programs through vmlinux.h.
>
> Suggested-by: Andrii Nakryiko <[email protected]>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
Thanks!
Acked-by: Andrii Nakryiko <[email protected]>
> include/linux/buildid.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/buildid.h b/include/linux/buildid.h
> index b8b2e00420d6..316971c634fe 100644
> --- a/include/linux/buildid.h
> +++ b/include/linux/buildid.h
> @@ -5,7 +5,9 @@
> #include <linux/mm_types.h>
> #include <linux/slab.h>
>
> -#define BUILD_ID_SIZE_MAX 20
> +enum {
> + BUILD_ID_SIZE_MAX = 20
> +};
>
> struct build_id {
> u32 sz;
> --
> 2.39.2
>
On Thu, Mar 16, 2023 at 10:03 AM Jiri Olsa <[email protected]> wrote:
>
> Adding read_build_id function that parses out build id from
> specified binary.
>
> It will replace extract_build_id and also be used in following
> changes.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
> tools/testing/selftests/bpf/trace_helpers.c | 86 +++++++++++++++++++++
> tools/testing/selftests/bpf/trace_helpers.h | 5 ++
> 2 files changed, 91 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> index 934bf28fc888..72b38a41f574 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.c
> +++ b/tools/testing/selftests/bpf/trace_helpers.c
> @@ -11,6 +11,9 @@
> #include <linux/perf_event.h>
> #include <sys/mman.h>
> #include "trace_helpers.h"
> +#include <linux/limits.h>
> +#include <libelf.h>
> +#include <gelf.h>
>
> #define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe"
> #define DEBUGFS_PIPE "/sys/kernel/debug/tracing/trace_pipe"
> @@ -234,3 +237,86 @@ ssize_t get_rel_offset(uintptr_t addr)
> fclose(f);
> return -EINVAL;
> }
> +
> +static int
> +parse_build_id_buf(const void *note_start, Elf32_Word note_size,
> + char *build_id)
nit: single line
should we pass buffer size instead of assuming at least BPF_BUILD_ID_SIZE below?
> +{
> + Elf32_Word note_offs = 0, new_offs;
> +
> + while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
> + Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
> +
> + if (nhdr->n_type == 3 && nhdr->n_namesz == sizeof("GNU") &&
> + !strcmp((char *)(nhdr + 1), "GNU") && nhdr->n_descsz > 0 &&
> + nhdr->n_descsz <= BPF_BUILD_ID_SIZE) {
> + memcpy(build_id, note_start + note_offs +
> + ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr), nhdr->n_descsz);
> + memset(build_id + nhdr->n_descsz, 0, BPF_BUILD_ID_SIZE - nhdr->n_descsz);
> + return (int) nhdr->n_descsz;
> + }
> +
> + new_offs = note_offs + sizeof(Elf32_Nhdr) +
> + ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
> + if (new_offs >= note_size)
> + break;
while condition() above would handle this, so this check appears not necessary?
so just assign note_offs directly?
> + note_offs = new_offs;
> + }
> +
> + return -EINVAL;
nit: -ENOENT or -ESRCH?
> +}
> +
> +/* Reads binary from *path* file and returns it in the *build_id*
> + * which is expected to be at least BPF_BUILD_ID_SIZE bytes.
> + * Returns size of build id on success. On error the error value
> + * is returned.
> + */
> +int read_build_id(const char *path, char *build_id)
> +{
> + int fd, err = -EINVAL;
> + Elf *elf = NULL;
> + GElf_Ehdr ehdr;
> + size_t max, i;
> +
> + fd = open(path, O_RDONLY | O_CLOEXEC);
> + if (fd < 0)
> + return -errno;
> +
> + (void)elf_version(EV_CURRENT);
> +
> + elf = elf_begin(fd, ELF_C_READ, NULL);
ELF_C_READ_MMAP ?
> + if (!elf)
> + goto out;
> + if (elf_kind(elf) != ELF_K_ELF)
> + goto out;
> + if (gelf_getehdr(elf, &ehdr) == NULL)
nit: !gelf_getehdr()
> + goto out;
> + if (ehdr.e_ident[EI_CLASS] != ELFCLASS64)
> + goto out;
does this have to be 64-bit specific?... you are using gelf stuff, you
can be bitness-agnostic here
> +
> + for (i = 0; i < ehdr.e_phnum; i++) {
> + GElf_Phdr mem, *phdr;
> + char *data;
> +
> + phdr = gelf_getphdr(elf, i, &mem);
> + if (!phdr)
> + goto out;
> + if (phdr->p_type != PT_NOTE)
> + continue;
I don't know where ELF + build ID spec is (if at all), but it seems to
always be in the ".note.gnu.build-id" section, so should we check the
name here?
> + data = elf_rawfile(elf, &max);
> + if (!data)
> + goto out;
> + if (phdr->p_offset >= max || (phdr->p_offset + phdr->p_memsz >= max))
`phdr->p_offset + phdr->p_memsz == max` would be fine, no?
> + goto out;
> + err = parse_build_id_buf(data + phdr->p_offset, phdr->p_memsz, build_id);
> + if (err > 0)
> + goto out;
> + err = -EINVAL;
> + }
> +
> +out:
> + if (elf)
> + elf_end(elf);
> + close(fd);
> + return err;
> +}
> diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
> index 53efde0e2998..bc3b92057033 100644
> --- a/tools/testing/selftests/bpf/trace_helpers.h
> +++ b/tools/testing/selftests/bpf/trace_helpers.h
> @@ -4,6 +4,9 @@
>
> #include <bpf/libbpf.h>
>
> +#define __ALIGN_MASK(x, mask) (((x)+(mask))&~(mask))
> +#define ALIGN(x, a) __ALIGN_MASK(x, (typeof(x))(a)-1)
> +
> struct ksym {
> long addr;
> char *name;
> @@ -23,4 +26,6 @@ void read_trace_pipe(void);
> ssize_t get_uprobe_offset(const void *addr);
> ssize_t get_rel_offset(uintptr_t addr);
>
> +int read_build_id(const char *path, char *build_id);
> +
> #endif
> --
> 2.39.2
>
On Thu, Mar 16, 2023 at 10:03 AM Jiri Olsa <[email protected]> wrote:
>
> Moving error macros from profiler.inc.h to new err.h header.
> It will be used in following changes.
>
> Also adding PTR_ERR macro that will be used in following changes.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
Acked-by: Andrii Nakryiko <[email protected]>
> tools/testing/selftests/bpf/progs/err.h | 18 ++++++++++++++++++
> .../testing/selftests/bpf/progs/profiler.inc.h | 3 +--
> 2 files changed, 19 insertions(+), 2 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/progs/err.h
>
[...]
On Thu, Mar 16, 2023 at 10:03 AM Jiri Olsa <[email protected]> wrote:
>
> Testing iterator access to build id in vma->vm_file object by storing
> each binary with build id into map and checking it against build id
> retrieved in user space.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
> .../selftests/bpf/prog_tests/bpf_iter.c | 78 +++++++++++++++++++
> .../bpf/progs/bpf_iter_task_vma_buildid.c | 56 +++++++++++++
> 2 files changed, 134 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> index 1f02168103dd..c7dd89e7cad0 100644
> --- a/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> +++ b/tools/testing/selftests/bpf/prog_tests/bpf_iter.c
> @@ -33,6 +33,7 @@
> #include "bpf_iter_bpf_link.skel.h"
> #include "bpf_iter_ksym.skel.h"
> #include "bpf_iter_sockmap.skel.h"
> +#include "bpf_iter_task_vma_buildid.skel.h"
>
> static int duration;
>
> @@ -1536,6 +1537,81 @@ static void test_task_vma_dead_task(void)
> bpf_iter_task_vma__destroy(skel);
> }
>
> +#define D_PATH_BUF_SIZE 1024
> +
> +struct build_id {
> + u32 sz;
> + char data[BPF_BUILD_ID_SIZE];
> +};
> +
> +static void test_task_vma_buildid(void)
> +{
> + int err, iter_fd = -1, proc_maps_fd = -1, sz;
> + struct bpf_iter_task_vma_buildid *skel;
> + char key[D_PATH_BUF_SIZE], *prev_key;
> + char build_id[BPF_BUILD_ID_SIZE];
> + int len, files_fd, cnt = 0;
> + struct build_id val;
> + char c;
> +
> + skel = bpf_iter_task_vma_buildid__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "bpf_iter_task_vma_buildid__open_and_load"))
> + return;
> +
> + skel->links.proc_maps = bpf_program__attach_iter(
> + skel->progs.proc_maps, NULL);
> +
> + if (!ASSERT_OK_PTR(skel->links.proc_maps, "bpf_program__attach_iter")) {
> + skel->links.proc_maps = NULL;
> + goto out;
> + }
> +
> + iter_fd = bpf_iter_create(bpf_link__fd(skel->links.proc_maps));
> + if (!ASSERT_GE(iter_fd, 0, "create_iter"))
> + goto out;
> +
> + /* trigger the iterator, there's no output, just map */
> + len = read(iter_fd, &c, 1);
> + ASSERT_EQ(len, 0, "len_check");
> +
> + files_fd = bpf_map__fd(skel->maps.files);
> +
> + prev_key = NULL;
> +
> + while (true) {
> + err = bpf_map_get_next_key(files_fd, prev_key, &key);
> + if (err) {
> + if (errno == ENOENT)
> + err = 0;
> + break;
> + }
> + if (bpf_map_lookup_elem(files_fd, key, &val))
> + break;
> + if (!ASSERT_LE(val.sz, BPF_BUILD_ID_SIZE, "buildid_size"))
> + break;
> +
> + sz = read_build_id(key, build_id);
> + /* If there's an error, the build id is not present or malformed, kernel
> + * should see the same result and bpf program pushed zero build id.
> + */
> + if (sz < 0) {
> + memset(build_id, 0x0, BPF_BUILD_ID_SIZE);
> + sz = BPF_BUILD_ID_SIZE;
> + }
> + ASSERT_EQ(val.sz, sz, "build_id_size");
> + ASSERT_MEMEQ(val.data, build_id, sz, "build_id_data");
> +
> + prev_key = key;
> + cnt++;
> + }
> +
> + printf("checked %d files\n", cnt);
ASSERT_GT(cnt, 0, "file_cnt");
better than printf, and tests no results condition
> +out:
> + close(proc_maps_fd);
> + close(iter_fd);
`if (proc_maps_fd >= 0)` and `if (iter_fd >= 0)` are missing
> + bpf_iter_task_vma_buildid__destroy(skel);
> +}
> +
> void test_bpf_sockmap_map_iter_fd(void)
> {
> struct bpf_iter_sockmap *skel;
> @@ -1659,6 +1735,8 @@ void test_bpf_iter(void)
> test_task_vma();
> if (test__start_subtest("task_vma_dead_task"))
> test_task_vma_dead_task();
> + if (test__start_subtest("task_vma_buildid"))
> + test_task_vma_buildid();
> if (test__start_subtest("task_btf"))
> test_task_btf();
> if (test__start_subtest("tcp4"))
> diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
> new file mode 100644
> index 000000000000..11a59c0f1aba
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "bpf_iter.h"
> +#include "err.h"
> +#include <bpf/bpf_helpers.h>
> +#include <string.h>
do you need <string.h> include? I don't think BPF-side is supposed to include it
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define VM_EXEC 0x00000004
> +#define D_PATH_BUF_SIZE 1024
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __uint(max_entries, 10000);
> + __type(key, char[D_PATH_BUF_SIZE]);
> + __type(value, struct build_id);
> +} files SEC(".maps");
> +
> +static char path[D_PATH_BUF_SIZE];
> +static struct build_id build_id;
> +
> +SEC("iter/task_vma")
> +int proc_maps(struct bpf_iter__task_vma *ctx)
> +{
> + struct vm_area_struct *vma = ctx->vma;
> + struct task_struct *task = ctx->task;
> + struct file *file;
> +
> + if (task == (void *)0 || vma == (void *)0)
> + return 0;
> +
> + if (!(vma->vm_flags & VM_EXEC))
> + return 0;
> +
> + file = vma->vm_file;
> + if (!file)
> + return 0;
> +
> + __builtin_memset(path, 0x0, D_PATH_BUF_SIZE);
> + bpf_d_path(&file->f_path, (char *) &path, D_PATH_BUF_SIZE);
> +
> + if (bpf_map_lookup_elem(&files, &path))
> + return 0;
> +
> + if (IS_ERR_OR_NULL(file->f_build_id)) {
> + /* On error return empty build id. */
> + __builtin_memset(&build_id.data, 0x0, sizeof(build_id.data));
> + build_id.sz = BUILD_ID_SIZE_MAX;
> + } else {
> + __builtin_memcpy(&build_id, file->f_build_id, sizeof(*file->f_build_id));
> + }
> +
> + bpf_map_update_elem(&files, &path, &build_id, 0);
> + return 0;
> +}
> --
> 2.39.2
>
On Thu, Mar 16, 2023 at 10:03 AM Jiri Olsa <[email protected]> wrote:
>
> The test attaches bpf program to sched_process_exec tracepoint
> and gets build of executed file from bprm->file object.
typo: build ID
>
> We use urandom_read as the test program and in addition we also
> attach uprobe to liburandom_read.so:urandlib_read_without_sema
> and retrieve and check build id of that shared library.
>
> Also executing the no_build_id binary to verify the bpf program
> gets the error properly.
>
> Signed-off-by: Jiri Olsa <[email protected]>
> ---
> tools/testing/selftests/bpf/Makefile | 7 +-
> tools/testing/selftests/bpf/no_build_id.c | 6 ++
> .../selftests/bpf/prog_tests/file_build_id.c | 98 +++++++++++++++++++
> .../selftests/bpf/progs/file_build_id.c | 70 +++++++++++++
> tools/testing/selftests/bpf/test_progs.h | 10 ++
> 5 files changed, 190 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/bpf/no_build_id.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/file_build_id.c
> create mode 100644 tools/testing/selftests/bpf/progs/file_build_id.c
>
Looks good, but let's add CONFIG_FILE_BUILD_ID to
selftests/bpf/config, as Daniel mentioned.
Acked-by: Andrii Nakryiko <[email protected]>
[...]
On Thu, Mar 16, 2023 at 02:51:52PM -0700, Andrii Nakryiko wrote:
> Yep, Meta is also capturing stack traces with build ID as well, if
> possible. Build IDs help with profiling short-lived processes which
> exit before the profiling session is done and user-space tooling is
> able to collect /proc/<pid>/maps contents (which is what Ian is
> referring to here). But also build ID allows to offload more of the
> expensive stack symbolization process (converting raw memory addresses
> into human readable function+offset+file path+line numbers
> information) to dedicated remote servers, by allowing to cache and
> reuse preprocessed DWARF/ELF information based on build ID.
>
> I believe perf tool is also using build ID, so any tool relying on
> perf capturing full and complete profiling data for system-wide
> performance analysis would benefit as well.
>
> Generally speaking, there is a whole ecosystem built on top of
> assumption that binaries have build ID and profiling tooling is able
> to provide more value if those build IDs are more reliably collected.
> Which ultimately benefits the entire open-source ecosystem by allowing
> people to spot issues (not necessarily just performance, it could be
> correctness issues as well) more reliably, fix them, and benefit every
> user.
But build IDs are _generally_ available. The only problem (AIUI)
is when you're trying to examine the contents of one container from
another container. And to solve that problem, you're imposing a cost
on everybody else with (so far) pretty vague justifications. I really
don't like to see you growing struct file for this (nor struct inode,
nor struct vm_area_struct). It's all quite unsatisfactory and I don't
have a good suggestion.
On Thu, Mar 16, 2023 at 8:51 PM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Mar 16, 2023 at 02:51:52PM -0700, Andrii Nakryiko wrote:
> > Yep, Meta is also capturing stack traces with build ID as well, if
> > possible. Build IDs help with profiling short-lived processes which
> > exit before the profiling session is done and user-space tooling is
> > able to collect /proc/<pid>/maps contents (which is what Ian is
> > referring to here). But also build ID allows to offload more of the
> > expensive stack symbolization process (converting raw memory addresses
> > into human readable function+offset+file path+line numbers
> > information) to dedicated remote servers, by allowing to cache and
> > reuse preprocessed DWARF/ELF information based on build ID.
> >
> > I believe perf tool is also using build ID, so any tool relying on
> > perf capturing full and complete profiling data for system-wide
> > performance analysis would benefit as well.
> >
> > Generally speaking, there is a whole ecosystem built on top of
> > assumption that binaries have build ID and profiling tooling is able
> > to provide more value if those build IDs are more reliably collected.
> > Which ultimately benefits the entire open-source ecosystem by allowing
> > people to spot issues (not necessarily just performance, it could be
> > correctness issues as well) more reliably, fix them, and benefit every
> > user.
>
> But build IDs are _generally_ available. The only problem (AIUI)
> is when you're trying to examine the contents of one container from
> another container. And to solve that problem, you're imposing a cost
> on everybody else with (so far) pretty vague justifications. I really
> don't like to see you growing struct file for this (nor struct inode,
> nor struct vm_area_struct). It's all quite unsatisfactory and I don't
> have a good suggestion.
There is a lot of profiling, observability and debugging tooling built
using BPF. And when capturing stack traces from BPF programs, if the
build ID note is not physically present in memory, fetching it from
the BPF program might fail in NMI (and other non-faultable contexts).
This patch set is about making sure we always can fetch build ID, even
from most restrictive environments. It's guarded by Kconfig to avoid
adding 8 bytes of overhead to struct file for environment where this
might be unacceptable, giving users and distros a choice.
On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> > But build IDs are _generally_ available. The only problem (AIUI)
> > is when you're trying to examine the contents of one container from
> > another container. And to solve that problem, you're imposing a cost
> > on everybody else with (so far) pretty vague justifications. I really
> > don't like to see you growing struct file for this (nor struct inode,
> > nor struct vm_area_struct). It's all quite unsatisfactory and I don't
> > have a good suggestion.
>
> There is a lot of profiling, observability and debugging tooling built
> using BPF. And when capturing stack traces from BPF programs, if the
> build ID note is not physically present in memory, fetching it from
> the BPF program might fail in NMI (and other non-faultable contexts).
> This patch set is about making sure we always can fetch build ID, even
> from most restrictive environments. It's guarded by Kconfig to avoid
> adding 8 bytes of overhead to struct file for environment where this
> might be unacceptable, giving users and distros a choice.
Lovely. As an exercise you might want to collect the stats on the
number of struct file instances on the system vs. the number of files
that happen to be ELF objects and are currently mmapped anywhere.
That does depend upon the load, obviously, but it's not hard to collect -
you already have more than enough hooks inserted in the relevant places.
That might give a better appreciation of the reactions...
On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
>
> > > But build IDs are _generally_ available. The only problem (AIUI)
> > > is when you're trying to examine the contents of one container from
> > > another container. And to solve that problem, you're imposing a cost
> > > on everybody else with (so far) pretty vague justifications. I really
> > > don't like to see you growing struct file for this (nor struct inode,
> > > nor struct vm_area_struct). It's all quite unsatisfactory and I don't
> > > have a good suggestion.
> >
> > There is a lot of profiling, observability and debugging tooling built
> > using BPF. And when capturing stack traces from BPF programs, if the
> > build ID note is not physically present in memory, fetching it from
> > the BPF program might fail in NMI (and other non-faultable contexts).
> > This patch set is about making sure we always can fetch build ID, even
> > from most restrictive environments. It's guarded by Kconfig to avoid
> > adding 8 bytes of overhead to struct file for environment where this
> > might be unacceptable, giving users and distros a choice.
>
> Lovely. As an exercise you might want to collect the stats on the
> number of struct file instances on the system vs. the number of files
> that happen to be ELF objects and are currently mmapped anywhere.
> That does depend upon the load, obviously, but it's not hard to collect -
> you already have more than enough hooks inserted in the relevant places.
> That might give a better appreciation of the reactions...
One possibility would be a bit stolen from inode flags + hash keyed by
struct inode address (middle bits make for a decent hash function);
inode eviction would check that bit and kick the corresponding thing
from hash if the bit is set.
Associating that thing with inode => hash lookup/insert + set the bit.
On Fri, Mar 17, 2023 at 2:21 PM Al Viro <[email protected]> wrote:
>
> On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> > On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> >
> > > > But build IDs are _generally_ available. The only problem (AIUI)
> > > > is when you're trying to examine the contents of one container from
> > > > another container. And to solve that problem, you're imposing a cost
> > > > on everybody else with (so far) pretty vague justifications. I really
> > > > don't like to see you growing struct file for this (nor struct inode,
> > > > nor struct vm_area_struct). It's all quite unsatisfactory and I don't
> > > > have a good suggestion.
> > >
> > > There is a lot of profiling, observability and debugging tooling built
> > > using BPF. And when capturing stack traces from BPF programs, if the
> > > build ID note is not physically present in memory, fetching it from
> > > the BPF program might fail in NMI (and other non-faultable contexts).
> > > This patch set is about making sure we always can fetch build ID, even
> > > from most restrictive environments. It's guarded by Kconfig to avoid
> > > adding 8 bytes of overhead to struct file for environment where this
> > > might be unacceptable, giving users and distros a choice.
> >
> > Lovely. As an exercise you might want to collect the stats on the
> > number of struct file instances on the system vs. the number of files
> > that happen to be ELF objects and are currently mmapped anywhere.
That's a good suggestion. I wrote a simple script that uses the drgn
tool ([0]), it enables nice introspection of the state of the kernel
memory for the running kernel. The script is at the bottom ([1]) for
anyone to sanity check. I didn't try to figure out which file is
mmaped as executable and which didn't, so let's do worst case and
assume that none of the file is executable, and thus that 8 byte
pointer is a waste for all of them.
On my devserver I got:
task_cnt=15984 uniq_file_cnt=56780
On randomly chosen production host I got:
task_cnt=6387 uniq_file_cnt=22514
So it seems like my devserver is "busier" than the production host. :)
Above numbers suggest that my devserver's kernel has about 57000
*unique* `struct file *` instances. That's 450KB of overhead. That's
not much by any modern standard.
But let's say I'm way off, and we have 1 million struct files. That's
8MB overhead. I'd argue that those 8MB is not a big deal even on a
normal laptop, even less so on production servers. Especially if you
have 1 million active struct file instances created in the system, as
way more will be used for application-specific needs.
> > That does depend upon the load, obviously, but it's not hard to collect -
> > you already have more than enough hooks inserted in the relevant places.
> > That might give a better appreciation of the reactions...
>
> One possibility would be a bit stolen from inode flags + hash keyed by
> struct inode address (middle bits make for a decent hash function);
> inode eviction would check that bit and kick the corresponding thing
> from hash if the bit is set.
>
> Associating that thing with inode => hash lookup/insert + set the bit.
This is an interesting idea, but now we are running into a few
unnecessary problems. We need to have a global dynamically sized hash
map in the system. If we fix the number of buckets, we risk either
wasting memory on an underutilized system (if we oversize), or
performance problems due to collisions (if we undersize) if we have a
busy system with lots of executables mapped in memory. If we don't
pre-size, then we are talking about reallocations, rehashing, and
doing that under global lock or something like that. Further, we'd
have to take locks on buckets, which causes further problems for
looking up build ID from this hashmap in NMI context for perf events
and BPF programs, as locks can't be safely taken under those
conditions, and thus fetching build ID would still be unreliable
(though less so than it is today, of course).
All of this is solvable to some degree (but not perfectly and not with
simple and elegant approaches), but seems like an unnecessarily
overcomplication compared to the amount of memory that we hope to
save. It still feels like a Kconfig-guarded 8 byte field per struct
file is a reasonable price for gaining reliable build ID information
for profiling/tracing tools.
[0] https://drgn.readthedocs.io/en/latest/index.html
[1] Script I used:
from drgn.helpers.linux.pid import for_each_task
from drgn.helpers.linux.fs import for_each_file
task_cnt = 0
file_set = set()
for task in for_each_task(prog):
task_cnt += 1
try:
for (fd, file) in for_each_file(task):
file_set.add(file.value_())
except:
pass
uniq_file_cnt = len(file_set)
print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")
On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds build id object pointer to struct file object.
> >
> > We have several use cases for build id to be used in BPF programs
> > [2][3].
>
> Yes, you have use cases, but you never answered the question I asked:
>
> Is this going to be enabled by every distro kernel, or is it for special
> use-cases where only people doing a very specialised thing who are
> willing to build their own kernels will use it?
I hope so, but I guess only time tell.. given the response by Ian and Andrii
there are 3 big users already
>
> Saying "hubble/tetragon" doesn't answer that question. Maybe it does
> to you, but I have no idea what that software is.
>
> Put it another way: how does this make *MY* life better? Literally me.
> How will it affect my life?
sorry, I'm not sure how to answer this.. there're possible users of the
feature and the price seems to be reasonable to me, but of course I understand
file/inode objects are tricky to mess with
jirka
On Fri, Mar 17, 2023 at 11:08:44PM -0700, Andrii Nakryiko wrote:
SNIP
> > > That does depend upon the load, obviously, but it's not hard to collect -
> > > you already have more than enough hooks inserted in the relevant places.
> > > That might give a better appreciation of the reactions...
> >
> > One possibility would be a bit stolen from inode flags + hash keyed by
> > struct inode address (middle bits make for a decent hash function);
> > inode eviction would check that bit and kick the corresponding thing
> > from hash if the bit is set.
> >
> > Associating that thing with inode => hash lookup/insert + set the bit.
>
> This is an interesting idea, but now we are running into a few
> unnecessary problems. We need to have a global dynamically sized hash
> map in the system. If we fix the number of buckets, we risk either
> wasting memory on an underutilized system (if we oversize), or
> performance problems due to collisions (if we undersize) if we have a
> busy system with lots of executables mapped in memory. If we don't
> pre-size, then we are talking about reallocations, rehashing, and
> doing that under global lock or something like that. Further, we'd
> have to take locks on buckets, which causes further problems for
> looking up build ID from this hashmap in NMI context for perf events
> and BPF programs, as locks can't be safely taken under those
> conditions, and thus fetching build ID would still be unreliable
> (though less so than it is today, of course).
>
> All of this is solvable to some degree (but not perfectly and not with
> simple and elegant approaches), but seems like an unnecessarily
> overcomplication compared to the amount of memory that we hope to
> save. It still feels like a Kconfig-guarded 8 byte field per struct
> file is a reasonable price for gaining reliable build ID information
> for profiling/tracing tools.
>
>
> [0] https://drgn.readthedocs.io/en/latest/index.html
>
> [1] Script I used:
>
> from drgn.helpers.linux.pid import for_each_task
> from drgn.helpers.linux.fs import for_each_file
>
> task_cnt = 0
> file_set = set()
>
> for task in for_each_task(prog):
> task_cnt += 1
> try:
> for (fd, file) in for_each_file(task):
> file_set.add(file.value_())
> except:
> pass
>
> uniq_file_cnt = len(file_set)
> print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")
great you beat me to this, I wouldn't have thought of using drgn for this ;-)
I'll see if I can install it to some of our test servers
thanks,
jirka
On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds build id object pointer to struct file object.
> > >
> > > We have several use cases for build id to be used in BPF programs
> > > [2][3].
> >
> > Yes, you have use cases, but you never answered the question I asked:
> >
> > Is this going to be enabled by every distro kernel, or is it for special
> > use-cases where only people doing a very specialised thing who are
> > willing to build their own kernels will use it?
>
> I hope so, but I guess only time tell.. given the response by Ian and Andrii
> there are 3 big users already
So the whole "There's a config option to turn it off" shtick is just a
fig-leaf. I won't ever see it turned off. You're imposing the cost of
this on EVERYONE who runs a distro kernel. And almost nobody will see
any benefits from it. Thanks for admitting that.
On Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds build id object pointer to struct file object.
> > > >
> > > > We have several use cases for build id to be used in BPF programs
> > > > [2][3].
> > >
> > > Yes, you have use cases, but you never answered the question I asked:
> > >
> > > Is this going to be enabled by every distro kernel, or is it for special
> > > use-cases where only people doing a very specialised thing who are
> > > willing to build their own kernels will use it?
> >
> > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > there are 3 big users already
>
> So the whole "There's a config option to turn it off" shtick is just a
> fig-leaf. I won't ever see it turned off. You're imposing the cost of
> this on EVERYONE who runs a distro kernel. And almost nobody will see
> any benefits from it. Thanks for admitting that.
>
sure, I understand that's legit way of looking at this
I can imagine distros would have that enabled for debugging version of
the kernel (like in fedora), and if that proves to be useful the standard
kernel might take it, but yes, there's price (for discussion as pointed
by Andrii) and it'd be for the distro maintainers to decide
jirka
Em Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox escreveu:
> On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds build id object pointer to struct file object.
> > > >
> > > > We have several use cases for build id to be used in BPF programs
> > > > [2][3].
> > >
> > > Yes, you have use cases, but you never answered the question I asked:
> > >
> > > Is this going to be enabled by every distro kernel, or is it for special
> > > use-cases where only people doing a very specialised thing who are
> > > willing to build their own kernels will use it?
> >
> > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > there are 3 big users already
>
> So the whole "There's a config option to turn it off" shtick is just a
> fig-leaf. I won't ever see it turned off. You're imposing the cost of
> this on EVERYONE who runs a distro kernel. And almost nobody will see
> any benefits from it. Thanks for admitting that.
I agree that build-ids are not useful for all 'struct file' uses, just
for executable files and for people wanting to have better observability
capabilities.
Having said that, it seems there will be no extra memory overhead at
least for a fedora:36 x86_64 kernel:
void __init files_init(void)
{
filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
percpu_counter_init(&nr_files, 0, GFP_KERNEL);
}
[root@quaco ~]# pahole file | grep size: -A2
/* size: 232, cachelines: 4, members: 20 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* last cacheline: 40 bytes */
[acme@quaco perf-tools]$ uname -a
Linux quaco 6.1.11-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 9 20:36:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@quaco ~]# head -2 /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
[root@quaco ~]# grep -w filp /proc/slabinfo
filp 12452 13056 256 32 2 : tunables 0 0 0 : slabdata 408 408 0
[root@quaco ~]#
so there are 24 bytes on the 4th cacheline that are not being used,
right?
One other observation is that maybe we could do it as the 'struct sock'
hierachy in networking, where we would have a 'struct exec_file' that
would be:
struct exec_file {
struct file file;
char build_id[20];
}
say, and then when we create the 'struct file' in __alloc_file() we
could check some bit in 'flags' like Al Viro suggested and pick a
different slab than 'filp_cachep', that has that extra space for the
build_id (and whatever else exec related state we may end up wanting, if
ever).
No core fs will need to know about that except when we go free it, to
free from the right slab cache.
In current distro configs, no overhead would take place if I read that
SLAB_HWCACHE_ALIGN thing right, no?
- Arnaldo
On Thu, Mar 16, 2023 at 03:23:03PM -0700, Andrii Nakryiko wrote:
> On Thu, Mar 16, 2023 at 10:03 AM Jiri Olsa <[email protected]> wrote:
> >
> > Adding read_build_id function that parses out build id from
> > specified binary.
> >
> > It will replace extract_build_id and also be used in following
> > changes.
> >
> > Signed-off-by: Jiri Olsa <[email protected]>
I'll send this separatelly as bpf/selftests fix so doesn't get lost
> > ---
> > tools/testing/selftests/bpf/trace_helpers.c | 86 +++++++++++++++++++++
> > tools/testing/selftests/bpf/trace_helpers.h | 5 ++
> > 2 files changed, 91 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
> > index 934bf28fc888..72b38a41f574 100644
> > --- a/tools/testing/selftests/bpf/trace_helpers.c
> > +++ b/tools/testing/selftests/bpf/trace_helpers.c
> > @@ -11,6 +11,9 @@
> > #include <linux/perf_event.h>
> > #include <sys/mman.h>
> > #include "trace_helpers.h"
> > +#include <linux/limits.h>
> > +#include <libelf.h>
> > +#include <gelf.h>
> >
> > #define TRACEFS_PIPE "/sys/kernel/tracing/trace_pipe"
> > #define DEBUGFS_PIPE "/sys/kernel/debug/tracing/trace_pipe"
> > @@ -234,3 +237,86 @@ ssize_t get_rel_offset(uintptr_t addr)
> > fclose(f);
> > return -EINVAL;
> > }
> > +
> > +static int
> > +parse_build_id_buf(const void *note_start, Elf32_Word note_size,
> > + char *build_id)
>
> nit: single line
ok
>
> should we pass buffer size instead of assuming at least BPF_BUILD_ID_SIZE below?
ok
>
> > +{
> > + Elf32_Word note_offs = 0, new_offs;
> > +
> > + while (note_offs + sizeof(Elf32_Nhdr) < note_size) {
> > + Elf32_Nhdr *nhdr = (Elf32_Nhdr *)(note_start + note_offs);
> > +
> > + if (nhdr->n_type == 3 && nhdr->n_namesz == sizeof("GNU") &&
> > + !strcmp((char *)(nhdr + 1), "GNU") && nhdr->n_descsz > 0 &&
> > + nhdr->n_descsz <= BPF_BUILD_ID_SIZE) {
> > + memcpy(build_id, note_start + note_offs +
> > + ALIGN(sizeof("GNU"), 4) + sizeof(Elf32_Nhdr), nhdr->n_descsz);
> > + memset(build_id + nhdr->n_descsz, 0, BPF_BUILD_ID_SIZE - nhdr->n_descsz);
> > + return (int) nhdr->n_descsz;
> > + }
> > +
> > + new_offs = note_offs + sizeof(Elf32_Nhdr) +
> > + ALIGN(nhdr->n_namesz, 4) + ALIGN(nhdr->n_descsz, 4);
> > + if (new_offs >= note_size)
> > + break;
>
> while condition() above would handle this, so this check appears not necessary?
>
> so just assign note_offs directly?
good idea, it will simplify that
>
>
> > + note_offs = new_offs;
> > + }
> > +
> > + return -EINVAL;
>
> nit: -ENOENT or -ESRCH?
I kept the same error as is in kernel, but ENOENT makes more sense
>
> > +}
> > +
> > +/* Reads binary from *path* file and returns it in the *build_id*
> > + * which is expected to be at least BPF_BUILD_ID_SIZE bytes.
> > + * Returns size of build id on success. On error the error value
> > + * is returned.
> > + */
> > +int read_build_id(const char *path, char *build_id)
> > +{
> > + int fd, err = -EINVAL;
> > + Elf *elf = NULL;
> > + GElf_Ehdr ehdr;
> > + size_t max, i;
> > +
> > + fd = open(path, O_RDONLY | O_CLOEXEC);
> > + if (fd < 0)
> > + return -errno;
> > +
> > + (void)elf_version(EV_CURRENT);
> > +
> > + elf = elf_begin(fd, ELF_C_READ, NULL);
>
> ELF_C_READ_MMAP ?
ok
>
> > + if (!elf)
> > + goto out;
> > + if (elf_kind(elf) != ELF_K_ELF)
> > + goto out;
> > + if (gelf_getehdr(elf, &ehdr) == NULL)
>
> nit: !gelf_getehdr()
ok
>
> > + goto out;
> > + if (ehdr.e_ident[EI_CLASS] != ELFCLASS64)
> > + goto out;
>
> does this have to be 64-bit specific?... you are using gelf stuff, you
> can be bitness-agnostic here
right, I don't think it's needed, will check
>
> > +
> > + for (i = 0; i < ehdr.e_phnum; i++) {
> > + GElf_Phdr mem, *phdr;
> > + char *data;
> > +
> > + phdr = gelf_getphdr(elf, i, &mem);
> > + if (!phdr)
> > + goto out;
> > + if (phdr->p_type != PT_NOTE)
> > + continue;
>
> I don't know where ELF + build ID spec is (if at all), but it seems to
> always be in the ".note.gnu.build-id" section, so should we check the
> name here?
this section name is not manadatory as stated in
https://fedoraproject.org/wiki/RolandMcGrath/BuildID
The new section is canonically called .note.gnu.build-id, but the name is not normative,
and the section can be merged with other SHT_NOTE sections. The ELF note headers give
name "GNU" and type 3 (NT_GNU_BUILD_ID) for a build ID note.
>
>
> > + data = elf_rawfile(elf, &max);
> > + if (!data)
> > + goto out;
> > + if (phdr->p_offset >= max || (phdr->p_offset + phdr->p_memsz >= max))
>
> `phdr->p_offset + phdr->p_memsz == max` would be fine, no?
right, will change
thanks,
jirka
>
> > + goto out;
> > + err = parse_build_id_buf(data + phdr->p_offset, phdr->p_memsz, build_id);
> > + if (err > 0)
> > + goto out;
> > + err = -EINVAL;
> > + }
> > +
> > +out:
> > + if (elf)
> > + elf_end(elf);
> > + close(fd);
> > + return err;
> > +}
> > diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
> > index 53efde0e2998..bc3b92057033 100644
> > --- a/tools/testing/selftests/bpf/trace_helpers.h
> > +++ b/tools/testing/selftests/bpf/trace_helpers.h
> > @@ -4,6 +4,9 @@
> >
> > #include <bpf/libbpf.h>
> >
> > +#define __ALIGN_MASK(x, mask) (((x)+(mask))&~(mask))
> > +#define ALIGN(x, a) __ALIGN_MASK(x, (typeof(x))(a)-1)
> > +
> > struct ksym {
> > long addr;
> > char *name;
> > @@ -23,4 +26,6 @@ void read_trace_pipe(void);
> > ssize_t get_uprobe_offset(const void *addr);
> > ssize_t get_rel_offset(uintptr_t addr);
> >
> > +int read_build_id(const char *path, char *build_id);
> > +
> > #endif
> > --
> > 2.39.2
> >
On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox escreveu:
> > On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > > hi,
> > > > > this patchset adds build id object pointer to struct file object.
> > > > >
> > > > > We have several use cases for build id to be used in BPF programs
> > > > > [2][3].
> > > >
> > > > Yes, you have use cases, but you never answered the question I asked:
> > > >
> > > > Is this going to be enabled by every distro kernel, or is it for special
> > > > use-cases where only people doing a very specialised thing who are
> > > > willing to build their own kernels will use it?
> > >
> > > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > > there are 3 big users already
> >
> > So the whole "There's a config option to turn it off" shtick is just a
> > fig-leaf. I won't ever see it turned off. You're imposing the cost of
> > this on EVERYONE who runs a distro kernel. And almost nobody will see
> > any benefits from it. Thanks for admitting that.
>
> I agree that build-ids are not useful for all 'struct file' uses, just
> for executable files and for people wanting to have better observability
> capabilities.
>
> Having said that, it seems there will be no extra memory overhead at
> least for a fedora:36 x86_64 kernel:
>
> void __init files_init(void)
> {
> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> }
>
> [root@quaco ~]# pahole file | grep size: -A2
> /* size: 232, cachelines: 4, members: 20 */
> /* sum members: 228, holes: 1, sum holes: 4 */
> /* last cacheline: 40 bytes */
> [acme@quaco perf-tools]$ uname -a
> Linux quaco 6.1.11-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 9 20:36:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> [root@quaco ~]# head -2 /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> [root@quaco ~]# grep -w filp /proc/slabinfo
> filp 12452 13056 256 32 2 : tunables 0 0 0 : slabdata 408 408 0
> [root@quaco ~]#
>
> so there are 24 bytes on the 4th cacheline that are not being used,
> right?
Well, even better then!
>
> One other observation is that maybe we could do it as the 'struct sock'
> hierachy in networking, where we would have a 'struct exec_file' that
> would be:
>
> struct exec_file {
> struct file file;
> char build_id[20];
> }
>
> say, and then when we create the 'struct file' in __alloc_file() we
> could check some bit in 'flags' like Al Viro suggested and pick a
> different slab than 'filp_cachep', that has that extra space for the
> build_id (and whatever else exec related state we may end up wanting, if
> ever).
>
> No core fs will need to know about that except when we go free it, to
> free from the right slab cache.
>
> In current distro configs, no overhead would take place if I read that
> SLAB_HWCACHE_ALIGN thing right, no?
Makes sense to me as well. Whatever the solution, as long as it's
usable from NMI contexts would be fine for the purposes of fetching
build ID. It would be good to hear from folks that are opposing adding
a pointer field to struct file whether they prefer this way instead?
>
> - Arnaldo
On Fri, Mar 31, 2023 at 11:19:45AM -0700, Andrii Nakryiko wrote:
> On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Having said that, it seems there will be no extra memory overhead at
> > least for a fedora:36 x86_64 kernel:
>
> Makes sense to me as well. Whatever the solution, as long as it's
> usable from NMI contexts would be fine for the purposes of fetching
> build ID. It would be good to hear from folks that are opposing adding
> a pointer field to struct file whether they prefer this way instead?
Still no. While it may not take up any room right now, this will
surely not be the last thing added to struct file. When something
which is genuinely useful needs to be added, that person should
not have to sort out your mess first,
NAK now, NAK tomorrow, NAK forever. Al told you how you could do it
without trampling on core data structures.
On Fri, Mar 31, 2023 at 11:36 AM Matthew Wilcox <[email protected]> wrote:
>
> On Fri, Mar 31, 2023 at 11:19:45AM -0700, Andrii Nakryiko wrote:
> > On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
> > <[email protected]> wrote:
> > > Having said that, it seems there will be no extra memory overhead at
> > > least for a fedora:36 x86_64 kernel:
> >
> > Makes sense to me as well. Whatever the solution, as long as it's
> > usable from NMI contexts would be fine for the purposes of fetching
> > build ID. It would be good to hear from folks that are opposing adding
> > a pointer field to struct file whether they prefer this way instead?
>
> Still no. While it may not take up any room right now, this will
> surely not be the last thing added to struct file. When something
> which is genuinely useful needs to be added, that person should
> not have to sort out your mess first,
So I assume you are talking about adding a pointer field to the struct
file, right? What about the alternative proposed by Arnaldo to have a
struct exec_file that extends a struct file?
>
> NAK now, NAK tomorrow, NAK forever. Al told you how you could do it
> without trampling on core data structures.
As I replied to Al, any solution that will have a lookup table on the
side isn't compatible with usage from NMI context due to locking. And
lots of tracing use cases are based on perf counters which are handled
in NMI context. And that's besides all the complexities with
right-sizing hash maps (if hashmaps are to be used).