LinuxLists.cc - [PATCH] perf trace: Fix syscall untraceable bug

2024-06-08 17:21:58

Subject: [PATCH] perf trace: Fix syscall untraceable bug

This is a bug found when implementing pretty-printing for the
landlock_add_rule system call, I decided to send this patch separately
because this is a serious bug that should be fixed fast.

I wrote a test program to do landlock_add_rule syscall in a loop,
yet perf trace -e landlock_add_rule freezes, giving no output.

This bug is introduced by the false understanding of the variable "key"
below:
```
for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
struct syscall *sc = trace__syscall_info(trace, NULL, key);
...
}
```
The code above seems right at the beginning, but when looking at
syscalltbl.c, I found these lines:

```
for (i = 0; i <= syscalltbl_native_max_id; ++i)
if (syscalltbl_native[i])
++nr_entries;

entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
...

for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) {
if (syscalltbl_native[i]) {
entries[j].name = syscalltbl_native[i];
entries[j].id = i;
++j;
}
}
```

meaning the key is merely an index to traverse the syscall table,
instead of the actual syscall id for this particular syscall.

So if one uses key to do trace__syscall_info(trace, NULL, key), because
key only goes up to trace->sctbl->syscalls.nr_entries, for example, on
my X86_64 machine, this number is 373, it will end up neglecting all
the rest of the syscall, in my case, everything after `rseq`, because
the traversal will stop at 373, and `rseq` is the last syscall whose id
is lower than 373

in tools/perf/arch/x86/include/generated/asm/syscalls_64.c:
```
...
[334] = "rseq",
[424] = "pidfd_send_signal",
...
```

The reason why the key is scrambled but perf trace works well is that
key is used in trace__syscall_info(trace, NULL, key) to do
trace->syscalls.table[id], this makes sure that the struct syscall returned
actually has an id the same value as key, making the later bpf_prog
matching all correct.

After fixing this bug, I can do perf trace on 38 more syscalls, and
because more syscalls are visible, we get 8 more syscalls that can be
augmented.

before:

perf $ perf trace -vv --max-events=1 |& grep Reusing
Reusing "open" BPF sys_enter augmenter for "stat"
Reusing "open" BPF sys_enter augmenter for "lstat"
Reusing "open" BPF sys_enter augmenter for "access"
Reusing "connect" BPF sys_enter augmenter for "accept"
Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
Reusing "connect" BPF sys_enter augmenter for "bind"
Reusing "connect" BPF sys_enter augmenter for "getsockname"
Reusing "connect" BPF sys_enter augmenter for "getpeername"
Reusing "open" BPF sys_enter augmenter for "execve"
Reusing "open" BPF sys_enter augmenter for "truncate"
Reusing "open" BPF sys_enter augmenter for "chdir"
Reusing "open" BPF sys_enter augmenter for "mkdir"
Reusing "open" BPF sys_enter augmenter for "rmdir"
Reusing "open" BPF sys_enter augmenter for "creat"
Reusing "open" BPF sys_enter augmenter for "link"
Reusing "open" BPF sys_enter augmenter for "unlink"
Reusing "open" BPF sys_enter augmenter for "symlink"
Reusing "open" BPF sys_enter augmenter for "readlink"
Reusing "open" BPF sys_enter augmenter for "chmod"
Reusing "open" BPF sys_enter augmenter for "chown"
Reusing "open" BPF sys_enter augmenter for "lchown"
Reusing "open" BPF sys_enter augmenter for "mknod"
Reusing "open" BPF sys_enter augmenter for "statfs"
Reusing "open" BPF sys_enter augmenter for "pivot_root"
Reusing "open" BPF sys_enter augmenter for "chroot"
Reusing "open" BPF sys_enter augmenter for "acct"
Reusing "open" BPF sys_enter augmenter for "swapon"
Reusing "open" BPF sys_enter augmenter for "swapoff"
Reusing "open" BPF sys_enter augmenter for "delete_module"
Reusing "open" BPF sys_enter augmenter for "setxattr"
Reusing "open" BPF sys_enter augmenter for "lsetxattr"
Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
Reusing "open" BPF sys_enter augmenter for "getxattr"
Reusing "open" BPF sys_enter augmenter for "lgetxattr"
Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
Reusing "open" BPF sys_enter augmenter for "listxattr"
Reusing "open" BPF sys_enter augmenter for "llistxattr"
Reusing "open" BPF sys_enter augmenter for "removexattr"
Reusing "open" BPF sys_enter augmenter for "lremovexattr"
Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
Reusing "open" BPF sys_enter augmenter for "mq_open"
Reusing "open" BPF sys_enter augmenter for "mq_unlink"
Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
Reusing "open" BPF sys_enter augmenter for "symlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
Reusing "connect" BPF sys_enter augmenter for "accept4"
Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
Reusing "open" BPF sys_enter augmenter for "memfd_create"
Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
Reusing "fremovexattr" BPF sys_enter augmenter for "statx"

after

perf $ perf trace -vv --max-events=1 |& grep Reusing
Reusing "open" BPF sys_enter augmenter for "stat"
Reusing "open" BPF sys_enter augmenter for "lstat"
Reusing "open" BPF sys_enter augmenter for "access"
Reusing "connect" BPF sys_enter augmenter for "accept"
Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
Reusing "connect" BPF sys_enter augmenter for "bind"
Reusing "connect" BPF sys_enter augmenter for "getsockname"
Reusing "connect" BPF sys_enter augmenter for "getpeername"
Reusing "open" BPF sys_enter augmenter for "execve"
Reusing "open" BPF sys_enter augmenter for "truncate"
Reusing "open" BPF sys_enter augmenter for "chdir"
Reusing "open" BPF sys_enter augmenter for "mkdir"
Reusing "open" BPF sys_enter augmenter for "rmdir"
Reusing "open" BPF sys_enter augmenter for "creat"
Reusing "open" BPF sys_enter augmenter for "link"
Reusing "open" BPF sys_enter augmenter for "unlink"
Reusing "open" BPF sys_enter augmenter for "symlink"
Reusing "open" BPF sys_enter augmenter for "readlink"
Reusing "open" BPF sys_enter augmenter for "chmod"
Reusing "open" BPF sys_enter augmenter for "chown"
Reusing "open" BPF sys_enter augmenter for "lchown"
Reusing "open" BPF sys_enter augmenter for "mknod"
Reusing "open" BPF sys_enter augmenter for "statfs"
Reusing "open" BPF sys_enter augmenter for "pivot_root"
Reusing "open" BPF sys_enter augmenter for "chroot"
Reusing "open" BPF sys_enter augmenter for "acct"
Reusing "open" BPF sys_enter augmenter for "swapon"
Reusing "open" BPF sys_enter augmenter for "swapoff"
Reusing "open" BPF sys_enter augmenter for "delete_module"
Reusing "open" BPF sys_enter augmenter for "setxattr"
Reusing "open" BPF sys_enter augmenter for "lsetxattr"
Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
Reusing "open" BPF sys_enter augmenter for "getxattr"
Reusing "open" BPF sys_enter augmenter for "lgetxattr"
Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
Reusing "open" BPF sys_enter augmenter for "listxattr"
Reusing "open" BPF sys_enter augmenter for "llistxattr"
Reusing "open" BPF sys_enter augmenter for "removexattr"
Reusing "open" BPF sys_enter augmenter for "lremovexattr"
Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
Reusing "open" BPF sys_enter augmenter for "mq_open"
Reusing "open" BPF sys_enter augmenter for "mq_unlink"
Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
Reusing "open" BPF sys_enter augmenter for "symlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
Reusing "connect" BPF sys_enter augmenter for "accept4"
Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
Reusing "open" BPF sys_enter augmenter for "memfd_create"
Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
Reusing "fremovexattr" BPF sys_enter augmenter for "statx"

TL;DR:

These are the new syscalls that can be augmented
Reusing "openat" BPF sys_enter augmenter for "open_tree"
Reusing "openat" BPF sys_enter augmenter for "openat2"
Reusing "openat" BPF sys_enter augmenter for "mount_setattr"
Reusing "openat" BPF sys_enter augmenter for "move_mount"
Reusing "open" BPF sys_enter augmenter for "fsopen"
Reusing "openat" BPF sys_enter augmenter for "fspick"
Reusing "openat" BPF sys_enter augmenter for "faccessat2"
Reusing "openat" BPF sys_enter augmenter for "fchmodat2"

as for the perf trace output:

before

perf $ perf trace -e faccessat2 --max-events=1
[no output]

after

perf $ ./perf trace -e faccessat2 --max-events=1
0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0

P.S. The reason why this bug was not found in the past five years is
probably because it only happens to the newer syscalls whose id is
greater, for instance, faccessat2 of id 439, which not a lot of people
care about when using perf trace.

Signed-off-by: Howard Chu <[email protected]>
---
tools/perf/builtin-trace.c | 32 +++++++++++++++++++++-----------
tools/perf/util/syscalltbl.c | 21 +++++++++------------
tools/perf/util/syscalltbl.h | 5 +++++
3 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index c42bc608954e..5cbe1748911d 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id)
static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc)
{
struct tep_format_field *field, *candidate_field;
- int id;
+ struct __syscall *scs = trace->sctbl->syscalls.entries;
+ int id, _id;

/*
* We're only interested in syscalls that have a pointer:
@@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace

try_to_find_pair:
for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) {
- struct syscall *pair = trace__syscall_info(trace, NULL, id);
+ struct syscall *pair;
struct bpf_program *pair_prog;
bool is_candidate = false;

+ _id = scs[id].id;
+ pair = trace__syscall_info(trace, NULL, _id);
+
if (pair == NULL || pair == sc ||
pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented)
continue;
@@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
{
int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter);
int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit);
- int err = 0, key;
+ int err = 0, key, id;
+ struct __syscall *scs = trace->sctbl->syscalls.entries;

for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
int prog_fd;

- if (!trace__syscall_enabled(trace, key))
+ id = scs[key].id;
+
+ if (!trace__syscall_enabled(trace, id))
continue;

- trace__init_syscall_bpf_progs(trace, key);
+ trace__init_syscall_bpf_progs(trace, id);

// It'll get at least the "!raw_syscalls:unaugmented"
- prog_fd = trace__bpf_prog_sys_enter_fd(trace, key);
- err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
+ prog_fd = trace__bpf_prog_sys_enter_fd(trace, id);
+ err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
if (err)
break;
- prog_fd = trace__bpf_prog_sys_exit_fd(trace, key);
- err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY);
+ prog_fd = trace__bpf_prog_sys_exit_fd(trace, id);
+ err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY);
if (err)
break;
}
@@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
* array tail call, then that one will be used.
*/
for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
- struct syscall *sc = trace__syscall_info(trace, NULL, key);
+ struct syscall *sc;
struct bpf_program *pair_prog;
int prog_fd;

+ id = scs[key].id;
+ sc = trace__syscall_info(trace, NULL, id);
+
if (sc == NULL || sc->bpf_prog.sys_enter == NULL)
continue;

@@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
* with the fd for the program we're reusing:
*/
prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter);
- err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
+ err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
if (err)
break;
}
diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c
index 63be7b58761d..16aa886c40f0 100644
--- a/tools/perf/util/syscalltbl.c
+++ b/tools/perf/util/syscalltbl.c
@@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID;
static const char *const *syscalltbl_native = syscalltbl_loongarch;
#endif

-struct syscall {
- int id;
- const char *name;
-};
-
static int syscallcmpname(const void *vkey, const void *ventry)
{
const char *key = vkey;
- const struct syscall *entry = ventry;
+ const struct __syscall *entry = ventry;

return strcmp(key, entry->name);
}

static int syscallcmp(const void *va, const void *vb)
{
- const struct syscall *a = va, *b = vb;
+ const struct __syscall *a = va, *b = vb;

return strcmp(a->name, b->name);
}
@@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb)
static int syscalltbl__init_native(struct syscalltbl *tbl)
{
int nr_entries = 0, i, j;
- struct syscall *entries;
+ struct __syscall *entries;

for (i = 0; i <= syscalltbl_native_max_id; ++i)
if (syscalltbl_native[i])
++nr_entries;

- entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
+ entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) *
+ nr_entries);
if (tbl->syscalls.entries == NULL)
return -1;

@@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl)
}
}

- qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp);
+ qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall),
+ syscallcmp);
tbl->syscalls.nr_entries = nr_entries;
tbl->syscalls.max_id = syscalltbl_native_max_id;
return 0;
@@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id

int syscalltbl__id(struct syscalltbl *tbl, const char *name)
{
- struct syscall *sc = bsearch(name, tbl->syscalls.entries,
+ struct __syscall *sc = bsearch(name, tbl->syscalls.entries,
tbl->syscalls.nr_entries, sizeof(*sc),
syscallcmpname);

@@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx)
{
int i;
- struct syscall *syscalls = tbl->syscalls.entries;
+ struct __syscall *syscalls = tbl->syscalls.entries;

for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) {
if (strglobmatch(syscalls[i].name, syscall_glob)) {
diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h
index a41d2ca9e4ae..6e93a0874c40 100644
--- a/tools/perf/util/syscalltbl.h
+++ b/tools/perf/util/syscalltbl.h
@@ -2,6 +2,11 @@
#ifndef __PERF_SYSCALLTBL_H
#define __PERF_SYSCALLTBL_H

+struct __syscall {
+ int id;
+ const char *name;
+};
+
struct syscalltbl {
int audit_machine;
struct {
--
2.45.2

2024-06-09 19:11:21

by Markus Elfring

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

> This is a bug found when implementing pretty-printing for the
> landlock_add_rule system call, I decided to send this patch separately
> because this is a serious bug that should be fixed fast.
…

Would you like to add the tag “Fixes” accordingly?

Regards,
Markus

2024-06-10 21:33:48

by Ian Rogers

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <[email protected]> wrote:
>
> This is a bug found when implementing pretty-printing for the
> landlock_add_rule system call, I decided to send this patch separately
> because this is a serious bug that should be fixed fast.
>
> I wrote a test program to do landlock_add_rule syscall in a loop,
> yet perf trace -e landlock_add_rule freezes, giving no output.
>
> This bug is introduced by the false understanding of the variable "key"
> below:
> ```
> for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> struct syscall *sc = trace__syscall_info(trace, NULL, key);
> ...
> }
> ```
> The code above seems right at the beginning, but when looking at
> syscalltbl.c, I found these lines:
>
> ```
> for (i = 0; i <= syscalltbl_native_max_id; ++i)
> if (syscalltbl_native[i])
> ++nr_entries;
>
> entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> ...
>
> for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) {
> if (syscalltbl_native[i]) {
> entries[j].name = syscalltbl_native[i];
> entries[j].id = i;
> ++j;
> }
> }
> ```
>
> meaning the key is merely an index to traverse the syscall table,
> instead of the actual syscall id for this particular syscall.

Thanks Howard, I'm not following this. Doesn't it make sense to use
the syscall number as its id?

>
>
> So if one uses key to do trace__syscall_info(trace, NULL, key), because
> key only goes up to trace->sctbl->syscalls.nr_entries, for example, on
> my X86_64 machine, this number is 373, it will end up neglecting all
> the rest of the syscall, in my case, everything after `rseq`, because
> the traversal will stop at 373, and `rseq` is the last syscall whose id
> is lower than 373
>
> in tools/perf/arch/x86/include/generated/asm/syscalls_64.c:
> ```
> ...
> [334] = "rseq",
> [424] = "pidfd_send_signal",
> ...
> ```
>
> The reason why the key is scrambled but perf trace works well is that
> key is used in trace__syscall_info(trace, NULL, key) to do
> trace->syscalls.table[id], this makes sure that the struct syscall returned
> actually has an id the same value as key, making the later bpf_prog
> matching all correct.

Could we create a test for this? We have tests that list all perf
events and then running a perf command on them. It wouldn't be
possible to guarantee output.

>
> After fixing this bug, I can do perf trace on 38 more syscalls, and
> because more syscalls are visible, we get 8 more syscalls that can be
> augmented.
>
> before:
>
> perf $ perf trace -vv --max-events=1 |& grep Reusing
> Reusing "open" BPF sys_enter augmenter for "stat"
> Reusing "open" BPF sys_enter augmenter for "lstat"
> Reusing "open" BPF sys_enter augmenter for "access"
> Reusing "connect" BPF sys_enter augmenter for "accept"
> Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> Reusing "connect" BPF sys_enter augmenter for "bind"
> Reusing "connect" BPF sys_enter augmenter for "getsockname"
> Reusing "connect" BPF sys_enter augmenter for "getpeername"
> Reusing "open" BPF sys_enter augmenter for "execve"
> Reusing "open" BPF sys_enter augmenter for "truncate"
> Reusing "open" BPF sys_enter augmenter for "chdir"
> Reusing "open" BPF sys_enter augmenter for "mkdir"
> Reusing "open" BPF sys_enter augmenter for "rmdir"
> Reusing "open" BPF sys_enter augmenter for "creat"
> Reusing "open" BPF sys_enter augmenter for "link"
> Reusing "open" BPF sys_enter augmenter for "unlink"
> Reusing "open" BPF sys_enter augmenter for "symlink"
> Reusing "open" BPF sys_enter augmenter for "readlink"
> Reusing "open" BPF sys_enter augmenter for "chmod"
> Reusing "open" BPF sys_enter augmenter for "chown"
> Reusing "open" BPF sys_enter augmenter for "lchown"
> Reusing "open" BPF sys_enter augmenter for "mknod"
> Reusing "open" BPF sys_enter augmenter for "statfs"
> Reusing "open" BPF sys_enter augmenter for "pivot_root"
> Reusing "open" BPF sys_enter augmenter for "chroot"
> Reusing "open" BPF sys_enter augmenter for "acct"
> Reusing "open" BPF sys_enter augmenter for "swapon"
> Reusing "open" BPF sys_enter augmenter for "swapoff"
> Reusing "open" BPF sys_enter augmenter for "delete_module"
> Reusing "open" BPF sys_enter augmenter for "setxattr"
> Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> Reusing "open" BPF sys_enter augmenter for "getxattr"
> Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> Reusing "open" BPF sys_enter augmenter for "listxattr"
> Reusing "open" BPF sys_enter augmenter for "llistxattr"
> Reusing "open" BPF sys_enter augmenter for "removexattr"
> Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> Reusing "open" BPF sys_enter augmenter for "mq_open"
> Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> Reusing "open" BPF sys_enter augmenter for "symlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> Reusing "connect" BPF sys_enter augmenter for "accept4"
> Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> Reusing "open" BPF sys_enter augmenter for "memfd_create"
> Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
>
> after
>
> perf $ perf trace -vv --max-events=1 |& grep Reusing
> Reusing "open" BPF sys_enter augmenter for "stat"
> Reusing "open" BPF sys_enter augmenter for "lstat"
> Reusing "open" BPF sys_enter augmenter for "access"
> Reusing "connect" BPF sys_enter augmenter for "accept"
> Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> Reusing "connect" BPF sys_enter augmenter for "bind"
> Reusing "connect" BPF sys_enter augmenter for "getsockname"
> Reusing "connect" BPF sys_enter augmenter for "getpeername"
> Reusing "open" BPF sys_enter augmenter for "execve"
> Reusing "open" BPF sys_enter augmenter for "truncate"
> Reusing "open" BPF sys_enter augmenter for "chdir"
> Reusing "open" BPF sys_enter augmenter for "mkdir"
> Reusing "open" BPF sys_enter augmenter for "rmdir"
> Reusing "open" BPF sys_enter augmenter for "creat"
> Reusing "open" BPF sys_enter augmenter for "link"
> Reusing "open" BPF sys_enter augmenter for "unlink"
> Reusing "open" BPF sys_enter augmenter for "symlink"
> Reusing "open" BPF sys_enter augmenter for "readlink"
> Reusing "open" BPF sys_enter augmenter for "chmod"
> Reusing "open" BPF sys_enter augmenter for "chown"
> Reusing "open" BPF sys_enter augmenter for "lchown"
> Reusing "open" BPF sys_enter augmenter for "mknod"
> Reusing "open" BPF sys_enter augmenter for "statfs"
> Reusing "open" BPF sys_enter augmenter for "pivot_root"
> Reusing "open" BPF sys_enter augmenter for "chroot"
> Reusing "open" BPF sys_enter augmenter for "acct"
> Reusing "open" BPF sys_enter augmenter for "swapon"
> Reusing "open" BPF sys_enter augmenter for "swapoff"
> Reusing "open" BPF sys_enter augmenter for "delete_module"
> Reusing "open" BPF sys_enter augmenter for "setxattr"
> Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> Reusing "open" BPF sys_enter augmenter for "getxattr"
> Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> Reusing "open" BPF sys_enter augmenter for "listxattr"
> Reusing "open" BPF sys_enter augmenter for "llistxattr"
> Reusing "open" BPF sys_enter augmenter for "removexattr"
> Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> Reusing "open" BPF sys_enter augmenter for "mq_open"
> Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> Reusing "open" BPF sys_enter augmenter for "symlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> Reusing "connect" BPF sys_enter augmenter for "accept4"
> Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> Reusing "open" BPF sys_enter augmenter for "memfd_create"
> Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
>
> TL;DR:
>
> These are the new syscalls that can be augmented
> Reusing "openat" BPF sys_enter augmenter for "open_tree"
> Reusing "openat" BPF sys_enter augmenter for "openat2"
> Reusing "openat" BPF sys_enter augmenter for "mount_setattr"
> Reusing "openat" BPF sys_enter augmenter for "move_mount"
> Reusing "open" BPF sys_enter augmenter for "fsopen"
> Reusing "openat" BPF sys_enter augmenter for "fspick"
> Reusing "openat" BPF sys_enter augmenter for "faccessat2"
> Reusing "openat" BPF sys_enter augmenter for "fchmodat2"
>
> as for the perf trace output:
>
> before
>
> perf $ perf trace -e faccessat2 --max-events=1
> [no output]
>
> after
>
> perf $ ./perf trace -e faccessat2 --max-events=1
> 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0
>
> P.S. The reason why this bug was not found in the past five years is
> probably because it only happens to the newer syscalls whose id is
> greater, for instance, faccessat2 of id 439, which not a lot of people
> care about when using perf trace.
>
> Signed-off-by: Howard Chu <[email protected]>
> ---
> tools/perf/builtin-trace.c | 32 +++++++++++++++++++++-----------
> tools/perf/util/syscalltbl.c | 21 +++++++++------------
> tools/perf/util/syscalltbl.h | 5 +++++
> 3 files changed, 35 insertions(+), 23 deletions(-)
>
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index c42bc608954e..5cbe1748911d 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id)
> static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc)
> {
> struct tep_format_field *field, *candidate_field;
> - int id;
> + struct __syscall *scs = trace->sctbl->syscalls.entries;
> + int id, _id;
>
> /*
> * We're only interested in syscalls that have a pointer:
> @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace
>
> try_to_find_pair:
> for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) {
> - struct syscall *pair = trace__syscall_info(trace, NULL, id);
> + struct syscall *pair;
> struct bpf_program *pair_prog;
> bool is_candidate = false;
>
> + _id = scs[id].id;
> + pair = trace__syscall_info(trace, NULL, _id);
> +
> if (pair == NULL || pair == sc ||
> pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented)
> continue;
> @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> {
> int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter);
> int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit);
> - int err = 0, key;
> + int err = 0, key, id;
> + struct __syscall *scs = trace->sctbl->syscalls.entries;
>
> for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> int prog_fd;
>
> - if (!trace__syscall_enabled(trace, key))
> + id = scs[key].id;
> +
> + if (!trace__syscall_enabled(trace, id))
> continue;
>
> - trace__init_syscall_bpf_progs(trace, key);
> + trace__init_syscall_bpf_progs(trace, id);
>
> // It'll get at least the "!raw_syscalls:unaugmented"
> - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key);
> - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id);
> + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> if (err)
> break;
> - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key);
> - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY);
> + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id);
> + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY);
> if (err)
> break;
> }
> @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> * array tail call, then that one will be used.
> */
> for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> - struct syscall *sc = trace__syscall_info(trace, NULL, key);
> + struct syscall *sc;
> struct bpf_program *pair_prog;
> int prog_fd;
>
> + id = scs[key].id;
> + sc = trace__syscall_info(trace, NULL, id);
> +
> if (sc == NULL || sc->bpf_prog.sys_enter == NULL)
> continue;
>
> @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> * with the fd for the program we're reusing:
> */
> prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter);
> - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> if (err)
> break;
> }
> diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c
> index 63be7b58761d..16aa886c40f0 100644
> --- a/tools/perf/util/syscalltbl.c
> +++ b/tools/perf/util/syscalltbl.c
> @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID;
> static const char *const *syscalltbl_native = syscalltbl_loongarch;
> #endif
>
> -struct syscall {
> - int id;
> - const char *name;
> -};
> -
> static int syscallcmpname(const void *vkey, const void *ventry)
> {
> const char *key = vkey;
> - const struct syscall *entry = ventry;
> + const struct __syscall *entry = ventry;
>
> return strcmp(key, entry->name);
> }
>
> static int syscallcmp(const void *va, const void *vb)
> {
> - const struct syscall *a = va, *b = vb;
> + const struct __syscall *a = va, *b = vb;
>
> return strcmp(a->name, b->name);
> }
> @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb)
> static int syscalltbl__init_native(struct syscalltbl *tbl)
> {
> int nr_entries = 0, i, j;
> - struct syscall *entries;
> + struct __syscall *entries;
>
> for (i = 0; i <= syscalltbl_native_max_id; ++i)
> if (syscalltbl_native[i])
> ++nr_entries;
>
> - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) *
> + nr_entries);
> if (tbl->syscalls.entries == NULL)
> return -1;
>
> @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl)
> }
> }
>
> - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp);
> + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall),
> + syscallcmp);
> tbl->syscalls.nr_entries = nr_entries;
> tbl->syscalls.max_id = syscalltbl_native_max_id;
> return 0;
> @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id
>
> int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> {
> - struct syscall *sc = bsearch(name, tbl->syscalls.entries,
> + struct __syscall *sc = bsearch(name, tbl->syscalls.entries,
> tbl->syscalls.nr_entries, sizeof(*sc),
> syscallcmpname);
>
> @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx)
> {
> int i;
> - struct syscall *syscalls = tbl->syscalls.entries;
> + struct __syscall *syscalls = tbl->syscalls.entries;
>
> for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) {
> if (strglobmatch(syscalls[i].name, syscall_glob)) {
> diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h
> index a41d2ca9e4ae..6e93a0874c40 100644
> --- a/tools/perf/util/syscalltbl.h
> +++ b/tools/perf/util/syscalltbl.h
> @@ -2,6 +2,11 @@
> #ifndef __PERF_SYSCALLTBL_H
> #define __PERF_SYSCALLTBL_H
>

It'd be nice to document the struct with examples that explain the
confusion that's happened and is fixed here.

Thanks,
Ian

> +struct __syscall {
> + int id;
> + const char *name;
> +};
> +
> struct syscalltbl {
> int audit_machine;
> struct {
> --
> 2.45.2
>

2024-06-11 10:23:19

by Howard Chu

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

Hello Ian,

Thanks for reviewing this patch.

On Tue, Jun 11, 2024 at 5:33 AM Ian Rogers <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <[email protected]> wrote:
> >
> > This is a bug found when implementing pretty-printing for the
> > landlock_add_rule system call, I decided to send this patch separately
> > because this is a serious bug that should be fixed fast.
> >
> > I wrote a test program to do landlock_add_rule syscall in a loop,
> > yet perf trace -e landlock_add_rule freezes, giving no output.
> >
> > This bug is introduced by the false understanding of the variable "key"
> > below:
> > ```
> > ```
> > The code above seems right at the beginning, but when looking at
> > syscalltbl.c, I found these lines:
> >
> > ```
> >
> > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> > ...
> >
> > ```
> >
> > meaning the key is merely an index to traverse the syscall table,
> > instead of the actual syscall id for this particular syscall.
>
>
> Thanks Howard, I'm not following this. Doesn't it make sense to use
> the syscall number as its id?

Yes, it makes perfect sense to use the syscall number as its id. But here:

> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > struct syscall *sc = trace__syscall_info(trace, NULL, key);
> > ...
> > }

sctbl->syscalls.nr_entries is not the upper bound of those syscall
ids, it is how many syscalls that are native to the system, that we
collected. We collect them this way(in util/syscalltbl.c):

> > for (i = 0; i <= syscalltbl_native_max_id; ++i)
> > if (syscalltbl_native[i])
> > ++nr_entries;

> > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) {
> > if (syscalltbl_native[i]) {
> > entries[j].name = syscalltbl_native[i];
> > entries[j].id = i;
> > ++j;
> > }
> > }

As shown above, we only collect syscalls that are native to the architecture.

```
#if defined(__x86_64__)
#include <asm/syscalls_64.c>
const int syscalltbl_native_max_id = SYSCALLTBL_x86_64_MAX_ID;
static const char *const *syscalltbl_native = syscalltbl_x86_64;
```

So when collecting these entries of system calls, we neglect some of
the system calls that are not available on the system. How many system
calls are available? nr_entries of them are available.

But the real upper bound of the syscall id is syscalls.max_id.
```
tbl->syscalls.max_id = syscalltbl_native_max_id;
```

here is one example:

(think about s1 as syscall of id1, syscall s2 is not available on the system)

s1 s2 s3 s4 (max_id: 4)
s1 __ s3 s4 (nr_entries: 3)

If one uses nr_entries as the upper bound, we'll end up traversing
only the ids of s1, s2, s3 (of course, s2 doesn't exist so if you do
`perf trace -v` you will see the error message), and then we will lose
the precious and fully traceable s4.

What will happen is, in trace__init_syscalls_bpf_prog_array_maps(), if
you stop at nr_entries when traversing, you won't register bpf's
sys_enter program for some system calls, causing the bpf_tail_call to
fail. And if you fail, you fall through and return a 0, well, I have
to admit that I am still trying to understand this, but returning 0
means no sample, but if the bpf_tail_call succeeds,
syscall_unaugmented() is called, it returns a 1, and there will be
samples.

```
SEC("tp/raw_syscalls/sys_enter")
int syscall_unaugmented(struct syscall_enter_args *args)
{
return 1;
}
```

```
bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);

// If not found on the PROG_ARRAY syscalls map, then we're filtering it:
return 0;
```

>
> >
> >
> > So if one uses key to do trace__syscall_info(trace, NULL, key), because
> > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on
> > my X86_64 machine, this number is 373, it will end up neglecting all
> > the rest of the syscall, in my case, everything after `rseq`, because
> > the traversal will stop at 373, and `rseq` is the last syscall whose id
> > is lower than 373
> >
> > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c:
> > ```
> > ...
> > [334] = "rseq",
> > [424] = "pidfd_send_signal",
> > ...
> > ```
> >
> > The reason why the key is scrambled but perf trace works well is that
> > key is used in trace__syscall_info(trace, NULL, key) to do
> > trace->syscalls.table[id], this makes sure that the struct syscall returned
> > actually has an id the same value as key, making the later bpf_prog
> > matching all correct.
>
>
> Could we create a test for this? We have tests that list all perf
> events and then running a perf command on them. It wouldn't be
> possible to guarantee output.

Sure, you can't guarantee the output because syscalls are architecture
and kernel version specific. I am running perf on kernel version
6.9.3, and my processor is X86_64. I tried to write the test script in
perl and shell script but failed, so here is a quick test in c if you
want to test it out. This is just doing 3 system calls though...

```
#define _GNU_SOURCE
#include <unistd.h>
#include <linux/landlock.h>
#include <sys/syscall.h>
#include <stdio.h>
#include <stdlib.h>

void ll()
{
syscall(SYS_landlock_add_rule, 0,
LANDLOCK_RULE_NET_PORT,
NULL, 0);
syscall(SYS_landlock_add_rule, 0,
LANDLOCK_RULE_PATH_BENEATH,
NULL, 0);
}

void opat2()
{
syscall(SYS_openat2, 114, "openat2 path", NULL, 0);
}

void faat2()
{
syscall(SYS_faccessat2, 0, "faccessat2 path", 0, 0);
}

int main()
{
printf("testing syscalls...");

while (1) {
ll();
opat2();
faat2();
sleep(1);
}

return 0;
}
```

Please save it as test.c, compile it with
```
gcc test.c -o test
```

Please compile the patched and original perf with BUILD_BPF_SKEL=1.
The behaviour of BUILD_BPF_SKEL=0 is correct.

Now please run it with this command below:
```
perf trace -e faccessat2,openat2,landlock_add_rule -- ./test
```

Or just run ./test, and open up a second shell, do `perf trace -e
faccessat2,openat2,landlock_add_rule` system wide, or do a `ps aux |
grep test` and specify -p <pid-of-the-test-program>

If I run the test program, trace it with the original perf, there is
not a single trace message on all of these three system calls. Because
in the syscall table on my machine:
```
[437] = "openat2",
[439] = "faccessat2",
[445] = "landlock_add_rule",
```
these system calls' ids are bigger than the syscalls.nr_entries(on my
machine, 373), they never got registered.

I'm not 100% sure that you will get the same output as mine, so please
tell me if there's any difference when testing. I'll write up a test
covering all system calls soon.

Here's another aspect of this, the simplest fix is just changing:

> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {

to

> > for (key = 0; key < trace->sctbl->syscalls.max_id; ++key) {

no problem.

But when you run `perf trace -e landlock_add_rule -v`, you get:
```
perf $ ./perf trace -e landlock_add_rule -v
Using CPUID GenuineIntel-6-3D-4
Problems reading syscall 134: 2 (No such file or directory)(uselib) information
Problems reading syscall 156: 2 (No such file or directory)(_sysctl) information
Problems reading syscall 174: 2 (No such file or
directory)(create_module) information
Problems reading syscall 177: 2 (No such file or
directory)(get_kernel_syms) information
Problems reading syscall 178: 2 (No such file or
directory)(query_module) information
Problems reading syscall 180: 2 (No such file or
directory)(nfsservctl) information
Problems reading syscall 181: 2 (No such file or directory)(getpmsg) information
Problems reading syscall 182: 2 (No such file or directory)(putpmsg) information
Problems reading syscall 183: 2 (No such file or
directory)(afs_syscall) information
Problems reading syscall 184: 2 (No such file or directory)(tuxcall) information
Problems reading syscall 185: 2 (No such file or directory)(security)
information
Problems reading syscall 205: 2 (No such file or
directory)(set_thread_area) information
Problems reading syscall 211: 2 (No such file or
directory)(get_thread_area) information
Problems reading syscall 212: 2 (No such file or
directory)(lookup_dcookie) information
Problems reading syscall 214: 2 (No such file or
directory)(epoll_ctl_old) information
Problems reading syscall 215: 2 (No such file or
directory)(epoll_wait_old) information
Problems reading syscall 236: 2 (No such file or directory)(vserver) information
Problems reading syscall 335: 17 (File exists) information
Problems reading syscall 336: 17 (File exists) information
Problems reading syscall 337: 17 (File exists) information
Problems reading syscall 338: 17 (File exists) information
Problems reading syscall 339: 17 (File exists) information
Problems reading syscall 340: 17 (File exists) information
Problems reading syscall 341: 17 (File exists) information
Problems reading syscall 342: 17 (File exists) information
Problems reading syscall 343: 17 (File exists) information
Problems reading syscall 344: 17 (File exists) information
Problems reading syscall 345: 17 (File exists) information
Problems reading syscall 346: 17 (File exists) information
Problems reading syscall 347: 17 (File exists) information
Problems reading syscall 348: 17 (File exists) information
Problems reading syscall 349: 17 (File exists) information
Problems reading syscall 350: 17 (File exists) information
Problems reading syscall 351: 17 (File exists) information
Problems reading syscall 352: 17 (File exists) information
Problems reading syscall 353: 17 (File exists) information
Problems reading syscall 354: 17 (File exists) information
Problems reading syscall 355: 17 (File exists) information
Problems reading syscall 356: 17 (File exists) information
Problems reading syscall 357: 17 (File exists) information
Problems reading syscall 358: 17 (File exists) information
Problems reading syscall 359: 17 (File exists) information
Problems reading syscall 360: 17 (File exists) information
Problems reading syscall 361: 17 (File exists) information
Problems reading syscall 362: 17 (File exists) information
Problems reading syscall 363: 17 (File exists) information
Problems reading syscall 364: 17 (File exists) information
Problems reading syscall 365: 17 (File exists) information
Problems reading syscall 366: 17 (File exists) information
Problems reading syscall 367: 17 (File exists) information
Problems reading syscall 368: 17 (File exists) information
Problems reading syscall 369: 17 (File exists) information
Problems reading syscall 370: 17 (File exists) information
Problems reading syscall 371: 17 (File exists) information
Problems reading syscall 372: 17 (File exists) information
event qualifier tracepoint filter: (common_pid != 1115781) && (id == 445)
mmap size 528384B
```

The registration of syscall from 335 to 372 is totally unnecessary.

reminder:
```
[334] = "rseq",
[424] = "pidfd_send_signal",
```

For example, syscall of id 335 is not available on my X86_64 machine,
why even bother reading this syscall's info using
trace__syscall_info().

So with this patch, the verbose message will become:

```
perf $ ./perf trace -e landlock_add_rule -v
Using CPUID GenuineIntel-6-3D-4
vmlinux BTF loaded
Problems reading syscall 156: 2 (No such file or directory)(_sysctl) information
Problems reading syscall 183: 2 (No such file or
directory)(afs_syscall) information
Problems reading syscall 174: 2 (No such file or
directory)(create_module) information
Problems reading syscall 214: 2 (No such file or
directory)(epoll_ctl_old) information
Problems reading syscall 215: 2 (No such file or
directory)(epoll_wait_old) information
Problems reading syscall 177: 2 (No such file or
directory)(get_kernel_syms) information
Problems reading syscall 211: 2 (No such file or
directory)(get_thread_area) information
Problems reading syscall 181: 2 (No such file or directory)(getpmsg) information
Problems reading syscall 156: 22 (Invalid argument)(_sysctl) information
Problems reading syscall 183: 22 (Invalid argument)(afs_syscall) information
Problems reading syscall 174: 22 (Invalid argument)(create_module) information
Problems reading syscall 214: 22 (Invalid argument)(epoll_ctl_old) information
Problems reading syscall 215: 22 (Invalid argument)(epoll_wait_old) information
Problems reading syscall 177: 22 (Invalid argument)(get_kernel_syms) information
Problems reading syscall 211: 22 (Invalid argument)(get_thread_area) information
Problems reading syscall 181: 22 (Invalid argument)(getpmsg) information
Problems reading syscall 212: 2 (No such file or
directory)(lookup_dcookie) information
Problems reading syscall 180: 2 (No such file or
directory)(nfsservctl) information
Problems reading syscall 182: 2 (No such file or directory)(putpmsg) information
Problems reading syscall 178: 2 (No such file or
directory)(query_module) information
Problems reading syscall 185: 2 (No such file or directory)(security)
information
Problems reading syscall 205: 2 (No such file or
directory)(set_thread_area) information
Problems reading syscall 184: 2 (No such file or directory)(tuxcall) information
Problems reading syscall 134: 2 (No such file or directory)(uselib) information
Problems reading syscall 236: 2 (No such file or directory)(vserver) information
Problems reading syscall 212: 22 (Invalid argument)(lookup_dcookie) information
Problems reading syscall 180: 22 (Invalid argument)(nfsservctl) information
Problems reading syscall 182: 22 (Invalid argument)(putpmsg) information
Problems reading syscall 178: 22 (Invalid argument)(query_module) information
Problems reading syscall 185: 22 (Invalid argument)(security) information
Problems reading syscall 205: 22 (Invalid argument)(set_thread_area) information
Problems reading syscall 184: 22 (Invalid argument)(tuxcall) information
Problems reading syscall 134: 22 (Invalid argument)(uselib) information
Problems reading syscall 236: 22 (Invalid argument)(vserver) information
event qualifier tracepoint filter: (common_pid != 1129314) && (id == 445)
mmap size 528384B
```

Now only when reading native syscalls' information, if there's a
problem, there is an error message.

I think the desired behavior in
trace__init_syscalls_bpf_prog_array_maps() is that, when a system call
is not augmentable, you assign syscall_unaugmented() to it(via bpf
prog array)

But what happens is, when tracing landlock_add_rule, this program
never runs. Because there is never a bpf program registration for
landlock_add_rule, the key never reaches that high. With this patch,
it does.

So in util/bpf_skel/augmented_raw_syscalls.bpf.c, bpf_tail_call()
falls through, bpf program returns 0, there will be no samples. But if
it successfully calls syscall_unaugmented() and returns 1, there will
be samples.

But then again, if you build perf with BUILD_BPF_SKEL=0, using no bpf
functionalities, everything is fine, but if you do want to use bpf
augmentation, you have to register the syscall, the key has to go up
to max_id, stopping at nr_entries is probably not what we want.

>
> >
> > After fixing this bug, I can do perf trace on 38 more syscalls, and
> > because more syscalls are visible, we get 8 more syscalls that can be
> > augmented.
> >
> > before:
> >
> > perf $ perf trace -vv --max-events=1 |& grep Reusing
> > Reusing "open" BPF sys_enter augmenter for "stat"
> > Reusing "open" BPF sys_enter augmenter for "lstat"
> > Reusing "open" BPF sys_enter augmenter for "access"
> > Reusing "connect" BPF sys_enter augmenter for "accept"
> > Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> > Reusing "connect" BPF sys_enter augmenter for "bind"
> > Reusing "connect" BPF sys_enter augmenter for "getsockname"
> > Reusing "connect" BPF sys_enter augmenter for "getpeername"
> > Reusing "open" BPF sys_enter augmenter for "execve"
> > Reusing "open" BPF sys_enter augmenter for "truncate"
> > Reusing "open" BPF sys_enter augmenter for "chdir"
> > Reusing "open" BPF sys_enter augmenter for "mkdir"
> > Reusing "open" BPF sys_enter augmenter for "rmdir"
> > Reusing "open" BPF sys_enter augmenter for "creat"
> > Reusing "open" BPF sys_enter augmenter for "link"
> > Reusing "open" BPF sys_enter augmenter for "unlink"
> > Reusing "open" BPF sys_enter augmenter for "symlink"
> > Reusing "open" BPF sys_enter augmenter for "readlink"
> > Reusing "open" BPF sys_enter augmenter for "chmod"
> > Reusing "open" BPF sys_enter augmenter for "chown"
> > Reusing "open" BPF sys_enter augmenter for "lchown"
> > Reusing "open" BPF sys_enter augmenter for "mknod"
> > Reusing "open" BPF sys_enter augmenter for "statfs"
> > Reusing "open" BPF sys_enter augmenter for "pivot_root"
> > Reusing "open" BPF sys_enter augmenter for "chroot"
> > Reusing "open" BPF sys_enter augmenter for "acct"
> > Reusing "open" BPF sys_enter augmenter for "swapon"
> > Reusing "open" BPF sys_enter augmenter for "swapoff"
> > Reusing "open" BPF sys_enter augmenter for "delete_module"
> > Reusing "open" BPF sys_enter augmenter for "setxattr"
> > Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> > Reusing "open" BPF sys_enter augmenter for "getxattr"
> > Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> > Reusing "open" BPF sys_enter augmenter for "listxattr"
> > Reusing "open" BPF sys_enter augmenter for "llistxattr"
> > Reusing "open" BPF sys_enter augmenter for "removexattr"
> > Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> > Reusing "open" BPF sys_enter augmenter for "mq_open"
> > Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> > Reusing "open" BPF sys_enter augmenter for "symlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> > Reusing "connect" BPF sys_enter augmenter for "accept4"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> > Reusing "open" BPF sys_enter augmenter for "memfd_create"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
> >
> > after
> >
> > perf $ perf trace -vv --max-events=1 |& grep Reusing
> > Reusing "open" BPF sys_enter augmenter for "stat"
> > Reusing "open" BPF sys_enter augmenter for "lstat"
> > Reusing "open" BPF sys_enter augmenter for "access"
> > Reusing "connect" BPF sys_enter augmenter for "accept"
> > Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> > Reusing "connect" BPF sys_enter augmenter for "bind"
> > Reusing "connect" BPF sys_enter augmenter for "getsockname"
> > Reusing "connect" BPF sys_enter augmenter for "getpeername"
> > Reusing "open" BPF sys_enter augmenter for "execve"
> > Reusing "open" BPF sys_enter augmenter for "truncate"
> > Reusing "open" BPF sys_enter augmenter for "chdir"
> > Reusing "open" BPF sys_enter augmenter for "mkdir"
> > Reusing "open" BPF sys_enter augmenter for "rmdir"
> > Reusing "open" BPF sys_enter augmenter for "creat"
> > Reusing "open" BPF sys_enter augmenter for "link"
> > Reusing "open" BPF sys_enter augmenter for "unlink"
> > Reusing "open" BPF sys_enter augmenter for "symlink"
> > Reusing "open" BPF sys_enter augmenter for "readlink"
> > Reusing "open" BPF sys_enter augmenter for "chmod"
> > Reusing "open" BPF sys_enter augmenter for "chown"
> > Reusing "open" BPF sys_enter augmenter for "lchown"
> > Reusing "open" BPF sys_enter augmenter for "mknod"
> > Reusing "open" BPF sys_enter augmenter for "statfs"
> > Reusing "open" BPF sys_enter augmenter for "pivot_root"
> > Reusing "open" BPF sys_enter augmenter for "chroot"
> > Reusing "open" BPF sys_enter augmenter for "acct"
> > Reusing "open" BPF sys_enter augmenter for "swapon"
> > Reusing "open" BPF sys_enter augmenter for "swapoff"
> > Reusing "open" BPF sys_enter augmenter for "delete_module"
> > Reusing "open" BPF sys_enter augmenter for "setxattr"
> > Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> > Reusing "open" BPF sys_enter augmenter for "getxattr"
> > Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> > Reusing "open" BPF sys_enter augmenter for "listxattr"
> > Reusing "open" BPF sys_enter augmenter for "llistxattr"
> > Reusing "open" BPF sys_enter augmenter for "removexattr"
> > Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> > Reusing "open" BPF sys_enter augmenter for "mq_open"
> > Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> > Reusing "open" BPF sys_enter augmenter for "symlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> > Reusing "connect" BPF sys_enter augmenter for "accept4"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> > Reusing "open" BPF sys_enter augmenter for "memfd_create"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
> >
> > TL;DR:
> >
> > These are the new syscalls that can be augmented
> > Reusing "openat" BPF sys_enter augmenter for "open_tree"
> > Reusing "openat" BPF sys_enter augmenter for "openat2"
> > Reusing "openat" BPF sys_enter augmenter for "mount_setattr"
> > Reusing "openat" BPF sys_enter augmenter for "move_mount"
> > Reusing "open" BPF sys_enter augmenter for "fsopen"
> > Reusing "openat" BPF sys_enter augmenter for "fspick"
> > Reusing "openat" BPF sys_enter augmenter for "faccessat2"
> > Reusing "openat" BPF sys_enter augmenter for "fchmodat2"
> >
> > as for the perf trace output:
> >
> > before
> >
> > perf $ perf trace -e faccessat2 --max-events=1
> > [no output]
> >
> > after
> >
> > perf $ ./perf trace -e faccessat2 --max-events=1
> > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0
> >
> > P.S. The reason why this bug was not found in the past five years is
> > probably because it only happens to the newer syscalls whose id is
> > greater, for instance, faccessat2 of id 439, which not a lot of people
> > care about when using perf trace.
> >
> > Signed-off-by: Howard Chu <[email protected]>
> > ---
> > tools/perf/builtin-trace.c | 32 +++++++++++++++++++++-----------
> > tools/perf/util/syscalltbl.c | 21 +++++++++------------
> > tools/perf/util/syscalltbl.h | 5 +++++
> > 3 files changed, 35 insertions(+), 23 deletions(-)
> >
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index c42bc608954e..5cbe1748911d 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id)
> > static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc)
> > {
> > struct tep_format_field *field, *candidate_field;
> > - int id;
> > + struct __syscall *scs = trace->sctbl->syscalls.entries;
> > + int id, _id;
> >
> > /*
> > * We're only interested in syscalls that have a pointer:
> > @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace
> >
> > try_to_find_pair:
> > for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) {
> > - struct syscall *pair = trace__syscall_info(trace, NULL, id);
> > + struct syscall *pair;
> > struct bpf_program *pair_prog;
> > bool is_candidate = false;
> >
> > + _id = scs[id].id;
> > + pair = trace__syscall_info(trace, NULL, _id);
> > +
> > if (pair == NULL || pair == sc ||
> > pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented)
> > continue;
> > @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > {
> > int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter);
> > int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit);
> > - int err = 0, key;
> > + int err = 0, key, id;
> > + struct __syscall *scs = trace->sctbl->syscalls.entries;
> >
> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > int prog_fd;
> >
> > - if (!trace__syscall_enabled(trace, key))
> > + id = scs[key].id;
> > +
> > + if (!trace__syscall_enabled(trace, id))
> > continue;
> >
> > - trace__init_syscall_bpf_progs(trace, key);
> > + trace__init_syscall_bpf_progs(trace, id);
> >
> > // It'll get at least the "!raw_syscalls:unaugmented"
> > - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key);
> > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> > + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id);
> > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key);
> > - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY);
> > + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id);
> > + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > }
> > @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > * array tail call, then that one will be used.
> > */
> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > - struct syscall *sc = trace__syscall_info(trace, NULL, key);
> > + struct syscall *sc;
> > struct bpf_program *pair_prog;
> > int prog_fd;
> >
> > + id = scs[key].id;
> > + sc = trace__syscall_info(trace, NULL, id);
> > +
> > if (sc == NULL || sc->bpf_prog.sys_enter == NULL)
> > continue;
> >
> > @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > * with the fd for the program we're reusing:
> > */
> > prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter);
> > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > }
> > diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c
> > index 63be7b58761d..16aa886c40f0 100644
> > --- a/tools/perf/util/syscalltbl.c
> > +++ b/tools/perf/util/syscalltbl.c
> > @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID;
> > static const char *const *syscalltbl_native = syscalltbl_loongarch;
> > #endif
> >
> > -struct syscall {
> > - int id;
> > - const char *name;
> > -};
> > -
> > static int syscallcmpname(const void *vkey, const void *ventry)
> > {
> > const char *key = vkey;
> > - const struct syscall *entry = ventry;
> > + const struct __syscall *entry = ventry;
> >
> > return strcmp(key, entry->name);
> > }
> >
> > static int syscallcmp(const void *va, const void *vb)
> > {
> > - const struct syscall *a = va, *b = vb;
> > + const struct __syscall *a = va, *b = vb;
> >
> > return strcmp(a->name, b->name);
> > }
> > @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb)
> > static int syscalltbl__init_native(struct syscalltbl *tbl)
> > {
> > int nr_entries = 0, i, j;
> > - struct syscall *entries;
> > + struct __syscall *entries;
> >
> > for (i = 0; i <= syscalltbl_native_max_id; ++i)
> > if (syscalltbl_native[i])
> > ++nr_entries;
> >
> > - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> > + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) *
> > + nr_entries);
> > if (tbl->syscalls.entries == NULL)
> > return -1;
> >
> > @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl)
> > }
> > }
> >
> > - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp);
> > + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall),
> > + syscallcmp);
> > tbl->syscalls.nr_entries = nr_entries;
> > tbl->syscalls.max_id = syscalltbl_native_max_id;
> > return 0;
> > @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id
> >
> > int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> > {
> > - struct syscall *sc = bsearch(name, tbl->syscalls.entries,
> > + struct __syscall *sc = bsearch(name, tbl->syscalls.entries,
> > tbl->syscalls.nr_entries, sizeof(*sc),
> > syscallcmpname);
> >
> > @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> > int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx)
> > {
> > int i;
> > - struct syscall *syscalls = tbl->syscalls.entries;
> > + struct __syscall *syscalls = tbl->syscalls.entries;
> >
> > for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) {
> > if (strglobmatch(syscalls[i].name, syscall_glob)) {
> > diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h
> > index a41d2ca9e4ae..6e93a0874c40 100644
> > --- a/tools/perf/util/syscalltbl.h
> > +++ b/tools/perf/util/syscalltbl.h
> > @@ -2,6 +2,11 @@
> > #ifndef __PERF_SYSCALLTBL_H
> > #define __PERF_SYSCALLTBL_H
> >
>
> It'd be nice to document the struct with examples that explain the
> confusion that's happened and is fixed here.

You mean the "struct __syscall"? That's just me trying to expose the
already existed "struct syscall" it in syscalltbl.h(originally in
syscalltbl.c, so it's private), so that I can use it in
builtin-trace.c. It was originally named "struct syscall", but we
already had one "struct syscall" in builtin-trace.c, so I renamed it
to "struct __syscall". Please tell me if this is inappropriate.

Thanks,
Howard
>
> Thanks,
> Ian
>
> > +struct __syscall {
> > + int id;
> > + const char *name;
> > +};
> > +
> > struct syscalltbl {
> > int audit_machine;
> > struct {
> > --
> > 2.45.2
> >

On Tue, Jun 11, 2024 at 5:33 AM Ian Rogers <[email protected]> wrote:
>
> On Sat, Jun 8, 2024 at 10:21 AM Howard Chu <[email protected]> wrote:
> >
> > This is a bug found when implementing pretty-printing for the
> > landlock_add_rule system call, I decided to send this patch separately
> > because this is a serious bug that should be fixed fast.
> >
> > I wrote a test program to do landlock_add_rule syscall in a loop,
> > yet perf trace -e landlock_add_rule freezes, giving no output.
> >
> > This bug is introduced by the false understanding of the variable "key"
> > below:
> > ```
> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > struct syscall *sc = trace__syscall_info(trace, NULL, key);
> > ...
> > }
> > ```
> > The code above seems right at the beginning, but when looking at
> > syscalltbl.c, I found these lines:
> >
> > ```
> > for (i = 0; i <= syscalltbl_native_max_id; ++i)
> > if (syscalltbl_native[i])
> > ++nr_entries;
> >
> > entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> > ...
> >
> > for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) {
> > if (syscalltbl_native[i]) {
> > entries[j].name = syscalltbl_native[i];
> > entries[j].id = i;
> > ++j;
> > }
> > }
> > ```
> >
> > meaning the key is merely an index to traverse the syscall table,
> > instead of the actual syscall id for this particular syscall.
>
>
> Thanks Howard, I'm not following this. Doesn't it make sense to use
> the syscall number as its id?
>
> >
> >
> > So if one uses key to do trace__syscall_info(trace, NULL, key), because
> > key only goes up to trace->sctbl->syscalls.nr_entries, for example, on
> > my X86_64 machine, this number is 373, it will end up neglecting all
> > the rest of the syscall, in my case, everything after `rseq`, because
> > the traversal will stop at 373, and `rseq` is the last syscall whose id
> > is lower than 373
> >
> > in tools/perf/arch/x86/include/generated/asm/syscalls_64.c:
> > ```
> > ...
> > [334] = "rseq",
> > [424] = "pidfd_send_signal",
> > ...
> > ```
> >
> > The reason why the key is scrambled but perf trace works well is that
> > key is used in trace__syscall_info(trace, NULL, key) to do
> > trace->syscalls.table[id], this makes sure that the struct syscall returned
> > actually has an id the same value as key, making the later bpf_prog
> > matching all correct.
>
>
> Could we create a test for this? We have tests that list all perf
> events and then running a perf command on them. It wouldn't be
> possible to guarantee output.
>
> >
> > After fixing this bug, I can do perf trace on 38 more syscalls, and
> > because more syscalls are visible, we get 8 more syscalls that can be
> > augmented.
> >
> > before:
> >
> > perf $ perf trace -vv --max-events=1 |& grep Reusing
> > Reusing "open" BPF sys_enter augmenter for "stat"
> > Reusing "open" BPF sys_enter augmenter for "lstat"
> > Reusing "open" BPF sys_enter augmenter for "access"
> > Reusing "connect" BPF sys_enter augmenter for "accept"
> > Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> > Reusing "connect" BPF sys_enter augmenter for "bind"
> > Reusing "connect" BPF sys_enter augmenter for "getsockname"
> > Reusing "connect" BPF sys_enter augmenter for "getpeername"
> > Reusing "open" BPF sys_enter augmenter for "execve"
> > Reusing "open" BPF sys_enter augmenter for "truncate"
> > Reusing "open" BPF sys_enter augmenter for "chdir"
> > Reusing "open" BPF sys_enter augmenter for "mkdir"
> > Reusing "open" BPF sys_enter augmenter for "rmdir"
> > Reusing "open" BPF sys_enter augmenter for "creat"
> > Reusing "open" BPF sys_enter augmenter for "link"
> > Reusing "open" BPF sys_enter augmenter for "unlink"
> > Reusing "open" BPF sys_enter augmenter for "symlink"
> > Reusing "open" BPF sys_enter augmenter for "readlink"
> > Reusing "open" BPF sys_enter augmenter for "chmod"
> > Reusing "open" BPF sys_enter augmenter for "chown"
> > Reusing "open" BPF sys_enter augmenter for "lchown"
> > Reusing "open" BPF sys_enter augmenter for "mknod"
> > Reusing "open" BPF sys_enter augmenter for "statfs"
> > Reusing "open" BPF sys_enter augmenter for "pivot_root"
> > Reusing "open" BPF sys_enter augmenter for "chroot"
> > Reusing "open" BPF sys_enter augmenter for "acct"
> > Reusing "open" BPF sys_enter augmenter for "swapon"
> > Reusing "open" BPF sys_enter augmenter for "swapoff"
> > Reusing "open" BPF sys_enter augmenter for "delete_module"
> > Reusing "open" BPF sys_enter augmenter for "setxattr"
> > Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> > Reusing "open" BPF sys_enter augmenter for "getxattr"
> > Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> > Reusing "open" BPF sys_enter augmenter for "listxattr"
> > Reusing "open" BPF sys_enter augmenter for "llistxattr"
> > Reusing "open" BPF sys_enter augmenter for "removexattr"
> > Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> > Reusing "open" BPF sys_enter augmenter for "mq_open"
> > Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> > Reusing "open" BPF sys_enter augmenter for "symlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> > Reusing "connect" BPF sys_enter augmenter for "accept4"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> > Reusing "open" BPF sys_enter augmenter for "memfd_create"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
> >
> > after
> >
> > perf $ perf trace -vv --max-events=1 |& grep Reusing
> > Reusing "open" BPF sys_enter augmenter for "stat"
> > Reusing "open" BPF sys_enter augmenter for "lstat"
> > Reusing "open" BPF sys_enter augmenter for "access"
> > Reusing "connect" BPF sys_enter augmenter for "accept"
> > Reusing "sendto" BPF sys_enter augmenter for "recvfrom"
> > Reusing "connect" BPF sys_enter augmenter for "bind"
> > Reusing "connect" BPF sys_enter augmenter for "getsockname"
> > Reusing "connect" BPF sys_enter augmenter for "getpeername"
> > Reusing "open" BPF sys_enter augmenter for "execve"
> > Reusing "open" BPF sys_enter augmenter for "truncate"
> > Reusing "open" BPF sys_enter augmenter for "chdir"
> > Reusing "open" BPF sys_enter augmenter for "mkdir"
> > Reusing "open" BPF sys_enter augmenter for "rmdir"
> > Reusing "open" BPF sys_enter augmenter for "creat"
> > Reusing "open" BPF sys_enter augmenter for "link"
> > Reusing "open" BPF sys_enter augmenter for "unlink"
> > Reusing "open" BPF sys_enter augmenter for "symlink"
> > Reusing "open" BPF sys_enter augmenter for "readlink"
> > Reusing "open" BPF sys_enter augmenter for "chmod"
> > Reusing "open" BPF sys_enter augmenter for "chown"
> > Reusing "open" BPF sys_enter augmenter for "lchown"
> > Reusing "open" BPF sys_enter augmenter for "mknod"
> > Reusing "open" BPF sys_enter augmenter for "statfs"
> > Reusing "open" BPF sys_enter augmenter for "pivot_root"
> > Reusing "open" BPF sys_enter augmenter for "chroot"
> > Reusing "open" BPF sys_enter augmenter for "acct"
> > Reusing "open" BPF sys_enter augmenter for "swapon"
> > Reusing "open" BPF sys_enter augmenter for "swapoff"
> > Reusing "open" BPF sys_enter augmenter for "delete_module"
> > Reusing "open" BPF sys_enter augmenter for "setxattr"
> > Reusing "open" BPF sys_enter augmenter for "lsetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fsetxattr"
> > Reusing "open" BPF sys_enter augmenter for "getxattr"
> > Reusing "open" BPF sys_enter augmenter for "lgetxattr"
> > Reusing "openat" BPF sys_enter augmenter for "fgetxattr"
> > Reusing "open" BPF sys_enter augmenter for "listxattr"
> > Reusing "open" BPF sys_enter augmenter for "llistxattr"
> > Reusing "open" BPF sys_enter augmenter for "removexattr"
> > Reusing "open" BPF sys_enter augmenter for "lremovexattr"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "fremovexattr"
> > Reusing "open" BPF sys_enter augmenter for "mq_open"
> > Reusing "open" BPF sys_enter augmenter for "mq_unlink"
> > Reusing "fsetxattr" BPF sys_enter augmenter for "add_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "request_key"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "inotify_add_watch"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mkdirat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "mknodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchownat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "futimesat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "newfstatat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "unlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "linkat"
> > Reusing "open" BPF sys_enter augmenter for "symlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "readlinkat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "fchmodat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "faccessat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "utimensat"
> > Reusing "connect" BPF sys_enter augmenter for "accept4"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "name_to_handle_at"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "renameat2"
> > Reusing "open" BPF sys_enter augmenter for "memfd_create"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> > Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
> >
> > TL;DR:
> >
> > These are the new syscalls that can be augmented
> > Reusing "openat" BPF sys_enter augmenter for "open_tree"
> > Reusing "openat" BPF sys_enter augmenter for "openat2"
> > Reusing "openat" BPF sys_enter augmenter for "mount_setattr"
> > Reusing "openat" BPF sys_enter augmenter for "move_mount"
> > Reusing "open" BPF sys_enter augmenter for "fsopen"
> > Reusing "openat" BPF sys_enter augmenter for "fspick"
> > Reusing "openat" BPF sys_enter augmenter for "faccessat2"
> > Reusing "openat" BPF sys_enter augmenter for "fchmodat2"
> >
> > as for the perf trace output:
> >
> > before
> >
> > perf $ perf trace -e faccessat2 --max-events=1
> > [no output]
> >
> > after
> >
> > perf $ ./perf trace -e faccessat2 --max-events=1
> > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0
> >
> > P.S. The reason why this bug was not found in the past five years is
> > probably because it only happens to the newer syscalls whose id is
> > greater, for instance, faccessat2 of id 439, which not a lot of people
> > care about when using perf trace.
> >
> > Signed-off-by: Howard Chu <[email protected]>
> > ---
> > tools/perf/builtin-trace.c | 32 +++++++++++++++++++++-----------
> > tools/perf/util/syscalltbl.c | 21 +++++++++------------
> > tools/perf/util/syscalltbl.h | 5 +++++
> > 3 files changed, 35 insertions(+), 23 deletions(-)
> >
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index c42bc608954e..5cbe1748911d 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -3354,7 +3354,8 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id)
> > static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc)
> > {
> > struct tep_format_field *field, *candidate_field;
> > - int id;
> > + struct __syscall *scs = trace->sctbl->syscalls.entries;
> > + int id, _id;
> >
> > /*
> > * We're only interested in syscalls that have a pointer:
> > @@ -3368,10 +3369,13 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace
> >
> > try_to_find_pair:
> > for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) {
> > - struct syscall *pair = trace__syscall_info(trace, NULL, id);
> > + struct syscall *pair;
> > struct bpf_program *pair_prog;
> > bool is_candidate = false;
> >
> > + _id = scs[id].id;
> > + pair = trace__syscall_info(trace, NULL, _id);
> > +
> > if (pair == NULL || pair == sc ||
> > pair->bpf_prog.sys_enter == trace->skel->progs.syscall_unaugmented)
> > continue;
> > @@ -3456,23 +3460,26 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > {
> > int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter);
> > int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit);
> > - int err = 0, key;
> > + int err = 0, key, id;
> > + struct __syscall *scs = trace->sctbl->syscalls.entries;
> >
> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > int prog_fd;
> >
> > - if (!trace__syscall_enabled(trace, key))
> > + id = scs[key].id;
> > +
> > + if (!trace__syscall_enabled(trace, id))
> > continue;
> >
> > - trace__init_syscall_bpf_progs(trace, key);
> > + trace__init_syscall_bpf_progs(trace, id);
> >
> > // It'll get at least the "!raw_syscalls:unaugmented"
> > - prog_fd = trace__bpf_prog_sys_enter_fd(trace, key);
> > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> > + prog_fd = trace__bpf_prog_sys_enter_fd(trace, id);
> > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > - prog_fd = trace__bpf_prog_sys_exit_fd(trace, key);
> > - err = bpf_map_update_elem(map_exit_fd, &key, &prog_fd, BPF_ANY);
> > + prog_fd = trace__bpf_prog_sys_exit_fd(trace, id);
> > + err = bpf_map_update_elem(map_exit_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > }
> > @@ -3506,10 +3513,13 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > * array tail call, then that one will be used.
> > */
> > for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> > - struct syscall *sc = trace__syscall_info(trace, NULL, key);
> > + struct syscall *sc;
> > struct bpf_program *pair_prog;
> > int prog_fd;
> >
> > + id = scs[key].id;
> > + sc = trace__syscall_info(trace, NULL, id);
> > +
> > if (sc == NULL || sc->bpf_prog.sys_enter == NULL)
> > continue;
> >
> > @@ -3535,7 +3545,7 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
> > * with the fd for the program we're reusing:
> > */
> > prog_fd = bpf_program__fd(sc->bpf_prog.sys_enter);
> > - err = bpf_map_update_elem(map_enter_fd, &key, &prog_fd, BPF_ANY);
> > + err = bpf_map_update_elem(map_enter_fd, &id, &prog_fd, BPF_ANY);
> > if (err)
> > break;
> > }
> > diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c
> > index 63be7b58761d..16aa886c40f0 100644
> > --- a/tools/perf/util/syscalltbl.c
> > +++ b/tools/perf/util/syscalltbl.c
> > @@ -44,22 +44,17 @@ const int syscalltbl_native_max_id = SYSCALLTBL_LOONGARCH_MAX_ID;
> > static const char *const *syscalltbl_native = syscalltbl_loongarch;
> > #endif
> >
> > -struct syscall {
> > - int id;
> > - const char *name;
> > -};
> > -
> > static int syscallcmpname(const void *vkey, const void *ventry)
> > {
> > const char *key = vkey;
> > - const struct syscall *entry = ventry;
> > + const struct __syscall *entry = ventry;
> >
> > return strcmp(key, entry->name);
> > }
> >
> > static int syscallcmp(const void *va, const void *vb)
> > {
> > - const struct syscall *a = va, *b = vb;
> > + const struct __syscall *a = va, *b = vb;
> >
> > return strcmp(a->name, b->name);
> > }
> > @@ -67,13 +62,14 @@ static int syscallcmp(const void *va, const void *vb)
> > static int syscalltbl__init_native(struct syscalltbl *tbl)
> > {
> > int nr_entries = 0, i, j;
> > - struct syscall *entries;
> > + struct __syscall *entries;
> >
> > for (i = 0; i <= syscalltbl_native_max_id; ++i)
> > if (syscalltbl_native[i])
> > ++nr_entries;
> >
> > - entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> > + entries = tbl->syscalls.entries = malloc(sizeof(struct __syscall) *
> > + nr_entries);
> > if (tbl->syscalls.entries == NULL)
> > return -1;
> >
> > @@ -85,7 +81,8 @@ static int syscalltbl__init_native(struct syscalltbl *tbl)
> > }
> > }
> >
> > - qsort(tbl->syscalls.entries, nr_entries, sizeof(struct syscall), syscallcmp);
> > + qsort(tbl->syscalls.entries, nr_entries, sizeof(struct __syscall),
> > + syscallcmp);
> > tbl->syscalls.nr_entries = nr_entries;
> > tbl->syscalls.max_id = syscalltbl_native_max_id;
> > return 0;
> > @@ -116,7 +113,7 @@ const char *syscalltbl__name(const struct syscalltbl *tbl __maybe_unused, int id
> >
> > int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> > {
> > - struct syscall *sc = bsearch(name, tbl->syscalls.entries,
> > + struct __syscall *sc = bsearch(name, tbl->syscalls.entries,
> > tbl->syscalls.nr_entries, sizeof(*sc),
> > syscallcmpname);
> >
> > @@ -126,7 +123,7 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
> > int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx)
> > {
> > int i;
> > - struct syscall *syscalls = tbl->syscalls.entries;
> > + struct __syscall *syscalls = tbl->syscalls.entries;
> >
> > for (i = *idx + 1; i < tbl->syscalls.nr_entries; ++i) {
> > if (strglobmatch(syscalls[i].name, syscall_glob)) {
> > diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h
> > index a41d2ca9e4ae..6e93a0874c40 100644
> > --- a/tools/perf/util/syscalltbl.h
> > +++ b/tools/perf/util/syscalltbl.h
> > @@ -2,6 +2,11 @@
> > #ifndef __PERF_SYSCALLTBL_H
> > #define __PERF_SYSCALLTBL_H
> >
>
> It'd be nice to document the struct with examples that explain the
> confusion that's happened and is fixed here.
>
> Thanks,
> Ian
>
> > +struct __syscall {
> > + int id;
> > + const char *name;
> > +};
> > +
> > struct syscalltbl {
> > int audit_machine;
> > struct {
> > --
> > 2.45.2
> >

2024-06-11 14:54:10

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote:
> This is a bug found when implementing pretty-printing for the
> landlock_add_rule system call, I decided to send this patch separately
> because this is a serious bug that should be fixed fast.

> I wrote a test program to do landlock_add_rule syscall in a loop,
> yet perf trace -e landlock_add_rule freezes, giving no output.

> This bug is introduced by the false understanding of the variable "key"
> below:
> ```
> for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
> struct syscall *sc = trace__syscall_info(trace, NULL, key);
> ...
> }
> ```
> The code above seems right at the beginning, but when looking at
> syscalltbl.c, I found these lines:
>
> ```
> for (i = 0; i <= syscalltbl_native_max_id; ++i)
> if (syscalltbl_native[i])
> ++nr_entries;
>
> entries = tbl->syscalls.entries = malloc(sizeof(struct syscall) * nr_entries);
> ...
>
> for (i = 0, j = 0; i <= syscalltbl_native_max_id; ++i) {
> if (syscalltbl_native[i]) {
> entries[j].name = syscalltbl_native[i];
> entries[j].id = i;
> ++j;
> }
> }
> ```
>
> meaning the key is merely an index to traverse the syscall table,
> instead of the actual syscall id for this particular syscall.

> So if one uses key to do trace__syscall_info(trace, NULL, key), because
> key only goes up to trace->sctbl->syscalls.nr_entries, for example, on
> my X86_64 machine, this number is 373, it will end up neglecting all
> the rest of the syscall, in my case, everything after `rseq`, because
> the traversal will stop at 373, and `rseq` is the last syscall whose id
> is lower than 373

> in tools/perf/arch/x86/include/generated/asm/syscalls_64.c:
> ```
> ...
> [334] = "rseq",
> [424] = "pidfd_send_signal",
> ...
> ```
>
> The reason why the key is scrambled but perf trace works well is that
> key is used in trace__syscall_info(trace, NULL, key) to do
> trace->syscalls.table[id], this makes sure that the struct syscall returned
> actually has an id the same value as key, making the later bpf_prog
> matching all correct.

Right, trace->syscalls.table holds info read from tracefs, while
trace->sctbl holds info created from per-arch syscall tables, usually
from:

⬢[acme@toolbox perf-tools-next]$ find arch -name "*.tbl"
arch/alpha/kernel/syscalls/syscall.tbl
arch/arm/tools/syscall.tbl
arch/m68k/kernel/syscalls/syscall.tbl
arch/microblaze/kernel/syscalls/syscall.tbl
arch/mips/kernel/syscalls/syscall_n32.tbl
arch/mips/kernel/syscalls/syscall_n64.tbl
arch/mips/kernel/syscalls/syscall_o32.tbl
arch/parisc/kernel/syscalls/syscall.tbl
arch/powerpc/kernel/syscalls/syscall.tbl
arch/s390/kernel/syscalls/syscall.tbl
arch/sh/kernel/syscalls/syscall.tbl
arch/sparc/kernel/syscalls/syscall.tbl
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl
arch/xtensa/kernel/syscalls/syscall.tbl
⬢[acme@toolbox perf-tools-next]$

trace->sctbl->syscalls.entries is sorted by name and has { syscall_id,
name } as contents.

But it is loaded at the time the reuse of BPF programs is attempted,
because initially we were using libaudit for getting id, that we get
from the tracepoint payload:

root@number:~# cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/format
name: sys_enter
ID: 361
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

field:long id; offset:8; size:8; signed:1;
^^
^^
^^
^^
field:unsigned long args[6]; offset:16; size:48; signed:0;

print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
root@number:~#

To syscall name.

Your analysis is perfect, great! Please take a look at the attached
patch, that I'm testing now that is a simplification of your patch, that
avoids exposing that __syscall struct, reducing patch size by
introducing this instead:

+++ b/tools/perf/util/syscalltbl.c
@@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
return sc ? sc->id : -1;
}

+int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx)
+{
+ struct syscall *syscalls = tbl->syscalls.entries;
+
+ return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1;
+}

And then using it to go from the index to the syscall id in the loops
traversing the sorted trace->sctbl->

> After fixing this bug, I can do perf trace on 38 more syscalls, and
> because more syscalls are visible, we get 8 more syscalls that can be
> augmented.
>
> before:
>
> perf $ perf trace -vv --max-events=1 |& grep Reusing
> Reusing "open" BPF sys_enter augmenter for "stat"
> Reusing "open" BPF sys_enter augmenter for "lstat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "execveat"
> Reusing "fremovexattr" BPF sys_enter augmenter for "statx"
>
> TL;DR:

I did the above and got:

root@number:~# wc -l before after
63 before
71 after
134 total
root@number:~#

> These are the new syscalls that can be augmented

Which matches the 8 more reused bpf programs.

> Reusing "openat" BPF sys_enter augmenter for "open_tree"
> Reusing "openat" BPF sys_enter augmenter for "openat2"
> Reusing "openat" BPF sys_enter augmenter for "mount_setattr"
> Reusing "openat" BPF sys_enter augmenter for "move_mount"
> Reusing "open" BPF sys_enter augmenter for "fsopen"
> Reusing "openat" BPF sys_enter augmenter for "fspick"
> Reusing "openat" BPF sys_enter augmenter for "faccessat2"
> Reusing "openat" BPF sys_enter augmenter for "fchmodat2"

But interestingly I get this:

root@number:~# grep -E 'for "(open_tree|openat2|mount_setattr|move_mount|fsopen|fspick|faccessat2|fchmodat2)' after
Reusing "faccessat" BPF sys_enter augmenter for "faccessat2"
Reusing "faccessat" BPF sys_enter augmenter for "fchmodat2"
Reusing "access" BPF sys_enter augmenter for "fsopen"
Reusing "faccessat" BPF sys_enter augmenter for "fspick"
Reusing "faccessat" BPF sys_enter augmenter for "mount_setattr"
Reusing "faccessat" BPF sys_enter augmenter for "move_mount"
Reusing "faccessat" BPF sys_enter augmenter for "open_tree"
Reusing "faccessat" BPF sys_enter augmenter for "openat2"
root@number:~#

Which matches my expectations as the array we're traversing,
trace->sctbl->syscalls->entries is sorted by name.

> as for the perf trace output:

> before
>
> perf $ perf trace -e faccessat2 --max-events=1
> [no output]
>
> after
>
> perf $ ./perf trace -e faccessat2 --max-events=1
> 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0
>
> P.S. The reason why this bug was not found in the past five years is
> probably because it only happens to the newer syscalls whose id is
> greater, for instance, faccessat2 of id 439, which not a lot of people
> care about when using perf trace.

That and the fact that the BPF code was hidden before having to use -e,
that got changed kinda recently when we switched to using BPF skels for
augmenting syscalls in 'perf trace':

⬢[acme@toolbox perf-tools-next]$ git log --oneline tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
a9f4c6c999008c92 perf trace: Collect sys_nanosleep first argument
29d16de26df17e94 perf augmented_raw_syscalls.bpf: Move 'struct timespec64' to vmlinux.h
5069211e2f0b47e7 perf trace: Use the right bpf_probe_read(_str) variant for reading user data
33b725ce7b988756 perf trace: Avoid compile error wrt redefining bool
7d9642311b6d9d31 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(augmented_arg->value) is a power of two.
262b54b6c9396823 perf bpf augmented_raw_syscalls: Add an assert to make sure sizeof(saddr) is a power of two.
1836480429d173c0 perf bpf_skel augmented_raw_syscalls: Cap the socklen parameter using &= sizeof(saddr)
cd2cece61ac5f900 perf trace: Tidy comments related to BPF + syscall augmentation
5e6da6be3082f77b perf trace: Migrate BPF augmentation to use a skeleton
⬢[acme@toolbox perf-tools-next]$

⬢[acme@toolbox perf-tools-next]$ git show --oneline --pretty=reference 5e6da6be3082f77b | head -1
5e6da6be3082f77b (perf trace: Migrate BPF augmentation to use a skeleton, 2023-08-10)
⬢[acme@toolbox perf-tools-next]$

I.e. from August, 2023.

One had as well to ask for BUILD_BPF_SKEL=1, which now is default if all
it needs is available on the system.

I'll test it together with your btf_enum work.

- Arnaldo

diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index c42bc608954ee08d..c4fa8191253d4880 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -3354,8 +3354,6 @@ static int trace__bpf_prog_sys_exit_fd(struct trace *trace, int id)
static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace, struct syscall *sc)
{
struct tep_format_field *field, *candidate_field;
- int id;
-
/*
* We're only interested in syscalls that have a pointer:
*/
@@ -3367,7 +3365,8 @@ static struct bpf_program *trace__find_usable_bpf_prog_entry(struct trace *trace
return NULL;

try_to_find_pair:
- for (id = 0; id < trace->sctbl->syscalls.nr_entries; ++id) {
+ for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) {
+ int id = syscalltbl__id_at_idx(trace->sctbl, i);
struct syscall *pair = trace__syscall_info(trace, NULL, id);
struct bpf_program *pair_prog;
bool is_candidate = false;
@@ -3456,10 +3455,10 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
{
int map_enter_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_enter);
int map_exit_fd = bpf_map__fd(trace->skel->maps.syscalls_sys_exit);
- int err = 0, key;
+ int err = 0;

- for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
- int prog_fd;
+ for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) {
+ int prog_fd, key = syscalltbl__id_at_idx(trace->sctbl, i);

if (!trace__syscall_enabled(trace, key))
continue;
@@ -3505,7 +3504,8 @@ static int trace__init_syscalls_bpf_prog_array_maps(struct trace *trace)
* first and second arg (this one on the raw_syscalls:sys_exit prog
* array tail call, then that one will be used.
*/
- for (key = 0; key < trace->sctbl->syscalls.nr_entries; ++key) {
+ for (int i = 0; i < trace->sctbl->syscalls.nr_entries; ++i) {
+ int key = syscalltbl__id_at_idx(trace->sctbl, i);
struct syscall *sc = trace__syscall_info(trace, NULL, key);
struct bpf_program *pair_prog;
int prog_fd;
diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c
index 63be7b58761d2f33..0dd26b991b3fb513 100644
--- a/tools/perf/util/syscalltbl.c
+++ b/tools/perf/util/syscalltbl.c
@@ -123,6 +123,13 @@ int syscalltbl__id(struct syscalltbl *tbl, const char *name)
return sc ? sc->id : -1;
}

+int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx)
+{
+ struct syscall *syscalls = tbl->syscalls.entries;
+
+ return idx < tbl->syscalls.nr_entries ? syscalls[idx].id : -1;
+}
+
int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx)
{
int i;
diff --git a/tools/perf/util/syscalltbl.h b/tools/perf/util/syscalltbl.h
index a41d2ca9e4aebad5..2b53b7ed25a6affe 100644
--- a/tools/perf/util/syscalltbl.h
+++ b/tools/perf/util/syscalltbl.h
@@ -16,6 +16,7 @@ void syscalltbl__delete(struct syscalltbl *tbl);

const char *syscalltbl__name(const struct syscalltbl *tbl, int id);
int syscalltbl__id(struct syscalltbl *tbl, const char *name);
+int syscalltbl__id_at_idx(struct syscalltbl *tbl, int idx);

int syscalltbl__strglobmatch_first(struct syscalltbl *tbl, const char *syscall_glob, int *idx);
int syscalltbl__strglobmatch_next(struct syscalltbl *tbl, const char *syscall_glob, int *idx);

2024-06-11 15:29:01

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote:
> as for the perf trace output:
>
> before
>
> perf $ perf trace -e faccessat2 --max-events=1
> [no output]
>
> after
>
> perf $ ./perf trace -e faccessat2 --max-events=1
> 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0

Yeah, before there is no output, after, with the following test case:

⬢[acme@toolbox c]$ cat faccessat2.c
#include <fcntl.h> /* Definition of AT_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
#include <stdio.h>

/* Provide own perf_event_open stub because glibc doesn't */
__attribute__((weak))
int faccessat2(int dirfd, const char *pathname, int mode, int flags)
{
return syscall(SYS_faccessat2, dirfd, pathname, mode, flags);
}

int main(int argc, char *argv[])
{
int err = faccessat2(123, argv[1], X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW);

printf("faccessat2(123, %s, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = %d\n", argv[1], err);
return err;
}
⬢[acme@toolbox c]$ make faccessat2
cc faccessat2.c -o faccessat2
⬢[acme@toolbox c]$ ./faccessat2 bla
faccessat2(123, bla, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = -1
⬢[acme@toolbox c]$

In the other terminal, as root:

root@number:~# perf trace --call-graph dwarf -e faccessat2 --max-events=1
0.000 ( 0.034 ms): bash/62004 faccessat2(dfd: 123, filename: "bla", mode: X, flags: EACCESS|SYMLINK_NOFOLLOW) = -1 EBADF (Bad file descriptor)
syscall (/usr/lib64/libc.so.6)
faccessat2 (/home/acme/c/faccessat2)
main (/home/acme/c/faccessat2)
__libc_start_call_main (/usr/lib64/libc.so.6)
__libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6)
_start (/home/acme/c/faccessat2)
root@number:~#

Now to write another test case, this time for the landlock syscall, to
test your btf_enum patch.

In the future please add the test case so that one can quickly reproduce
your testing steps.

- Arnaldo

2024-06-11 15:44:21

by Howard Chu

[permalink] [raw]

Subject: Re: [PATCH] perf trace: Fix syscall untraceable bug

Hello Arnaldo,

On Tue, Jun 11, 2024 at 11:25 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> On Sun, Jun 09, 2024 at 01:21:46AM +0800, Howard Chu wrote:
> > as for the perf trace output:
> >
> > before
> >
> > perf $ perf trace -e faccessat2 --max-events=1
> > [no output]
> >
> > after
> >
> > perf $ ./perf trace -e faccessat2 --max-events=1
> > 0.000 ( 0.037 ms): waybar/958 faccessat2(dfd: 40, filename: "uevent") = 0
>
> Yeah, before there is no output, after, with the following test case:
>
> ⬢[acme@toolbox c]$ cat faccessat2.c
> #include <fcntl.h> /* Definition of AT_* constants */
> #include <sys/syscall.h> /* Definition of SYS_* constants */
> #include <unistd.h>
> #include <stdio.h>
>
> /* Provide own perf_event_open stub because glibc doesn't */
> __attribute__((weak))
> int faccessat2(int dirfd, const char *pathname, int mode, int flags)
> {
> return syscall(SYS_faccessat2, dirfd, pathname, mode, flags);
> }
>
> int main(int argc, char *argv[])
> {
> int err = faccessat2(123, argv[1], X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW);
>
> printf("faccessat2(123, %s, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = %d\n", argv[1], err);
> return err;
> }
> ⬢[acme@toolbox c]$ make faccessat2
> cc faccessat2.c -o faccessat2
> ⬢[acme@toolbox c]$ ./faccessat2 bla
> faccessat2(123, bla, X_OK, AT_EACCESS | AT_SYMLINK_NOFOLLOW) = -1
> ⬢[acme@toolbox c]$
>
> In the other terminal, as root:
>
> root@number:~# perf trace --call-graph dwarf -e faccessat2 --max-events=1
> 0.000 ( 0.034 ms): bash/62004 faccessat2(dfd: 123, filename: "bla", mode: X, flags: EACCESS|SYMLINK_NOFOLLOW) = -1 EBADF (Bad file descriptor)
> syscall (/usr/lib64/libc.so.6)
> faccessat2 (/home/acme/c/faccessat2)
> main (/home/acme/c/faccessat2)
> __libc_start_call_main (/usr/lib64/libc.so.6)
> __libc_start_main@@GLIBC_2.34 (/usr/lib64/libc.so.6)
> _start (/home/acme/c/faccessat2)
> root@number:~#
>
> Now to write another test case, this time for the landlock syscall, to
> test your btf_enum patch.
>
> In the future please add the test case so that one can quickly reproduce
> your testing steps.

Thank you for testing this patch, sorry for the inconvenience, I will
attach tests to speed up reproduction in the future. Also, your
simplification looks good, thank you so much.

Thanks,
Howard

>
> - Arnaldo