When a probe is registered at the `trace_sys_enter` tracepoint, and
that probe changes the system call number, the old system call still
gets executed on x86_64 (and potentially other architectures). This
is inconsistent with how ARM64 (and potentially other architectures)
handles this, and inconsistent with the tracepoint semantics prior to
change b6ec41346103 (core/entry: Report syscall correctly for trace
and audit).
With this patch, the semantics are restored to be the same as before
the aforementioned change (and thus made consistent with ARM64). The
change adds one line to re-read the system call number register into
the `syscall` variable. By reading twice, the benefits of the
aforementioned change b6ec41346103 are kept.
There should be no performance impact if no sys_enter tracepoints are
registered, since re-reading the system call number from `regs` is
only done conditonally if the tracepoint is in use. If a probe is
registered, the performance impact should still be minimal, since the
additional call to `syscall_get_nr` amounts to only an inlined read
of `regs->orig_ax` (on x86_64).
Signed-off-by: André Rösti <[email protected]>
---
@Thomas Gleixner: You may have received this e-mail twice. My apologies!
This is my first attempt to contribute, and I made a mistake using git
send-email. Thanks for your work maintaining this and sorry again.
---
kernel/entry/common.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 88cb3c88aaa5..89b14ba9ed14 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -57,8 +57,11 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
/* Either of the above might have changed the syscall number */
syscall = syscall_get_nr(current, regs);
- if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
+ if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) {
trace_sys_enter(regs, syscall);
+ /* Tracers may have changed system call number as well */
+ syscall = syscall_get_nr(current, regs);
+ }
syscall_enter_audit(regs, syscall);
base-commit: 221a164035fd8b554a44bd7c4bf8e7715a497561
--
2.34.1
André!
On Sat, Mar 09 2024 at 05:53, André Rösti wrote:
Nice finding!
Just a few nitpicks:
The subject line lacks a subsystem prefix. In this case is should be:
Subject: [PATCH] entry: Respect ......
Documentation/process/ has quite some information about change logs
along with the tip tree specific one:
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html
> When a probe is registered at the `trace_sys_enter` tracepoint, and
s/`trace_sys_enter`/trace_sys_enter()/
> that probe changes the system call number, the old system call still
> gets executed on x86_64 (and potentially other architectures). This
s/on x86_64 (and potentially other architectures//
Why? It's clear that this code is only executed on architectures which
utilize the generic entry code.
> is inconsistent with how ARM64 (and potentially other architectures)
> handles this, and inconsistent with the tracepoint semantics prior to
> change b6ec41346103 (core/entry: Report syscall correctly for trace
> and audit).
This worked correctly until commit b6ec41346103 ("core/entry: Report
syscall correctly for trace and audit") which removed the re-evaluation
of the syscall number after the trace point.
The ARM64 info is nice to have, but not really relevant because the real
issue is the commit which changed the semantics of the entry code.
> With this patch, the semantics are restored to be the same as before
Please search for "This patch" in the process documentation
> the aforementioned change (and thus made consistent with ARM64). The
> change adds one line to re-read the system call number register into
> the `syscall` variable. By reading twice, the benefits of the
> aforementioned change b6ec41346103 are kept.
My version for the above would be:
Restore the original semantics by re-evaluating the system call number
after trace_sys_enter().
That's the gist of the change, right?
It does not matter where the number comes from and which variable is it
written to. That's what is in the patch itself. It's neither interesting
that reading twice keeps some benefit because again, that's what can be
seen from the actual change. You'd need to mention it if your fix would
change it.
> There should be no performance impact if no sys_enter tracepoints are
> registered, since re-reading the system call number from `regs` is
> only done conditonally if the tracepoint is in use. If a probe is
> registered, the performance impact should still be minimal, since the
> additional call to `syscall_get_nr` amounts to only an inlined read
> of `regs->orig_ax` (on x86_64).
That's pretty x86 specific. What about something like this:
The performance impact of this re-evaluation is minimal because it is
only relevant when a trace point is active and compared to the actual
trace point overhead the read from a cache hot variable is
neglectible.
This lacks a:
Fixes: commit b6ec41346103 ("core/entry: Report syscall correctly for trace and audit")
tag.
> Signed-off-by: André Rösti <[email protected]>
> ---
> @Thomas Gleixner: You may have received this e-mail twice. My apologies!
> This is my first attempt to contribute, and I made a mistake using git
> send-email. Thanks for your work maintaining this and sorry again.
Don't worry. The first time submission is a learning experience. I
lively remember the healthy lesson I got a few decades ago and I still
get them today when I fail to see or describe something.
That said, I'm impressed by the detective work you did to put an
explanatory change log together on your first submission. Keep up the
good work!
I'm looking forward to the V2. Feel free to use my suggestions above as
guideline and rephrase it in your own words.
> ---
> kernel/entry/common.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 88cb3c88aaa5..89b14ba9ed14 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -57,8 +57,11 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
> /* Either of the above might have changed the syscall number */
> syscall = syscall_get_nr(current, regs);
>
> - if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
> + if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT)) {
> trace_sys_enter(regs, syscall);
> + /* Tracers may have changed system call number as well */
I'd rather say 'Probes or BPF hooks in the tracepoint ...'
Because a trace point per se does not change anything.
> + syscall = syscall_get_nr(current, regs);
> + }
>
> syscall_enter_audit(regs, syscall);
Thanks,
tglx