Hi,
I've hit an issue when tracing system calls on Linux. I
know that perf and ftrace ignore compat syscalls on x86
(see comment above kernel/trace/trace_syscalls.c:trace_get_syscall_nr()).
* Some architectures that allow for 32bit applications
* to run on a 64bit kernel, do not map the syscalls for
* the 32bit tasks the same as they do for 64bit tasks.
*
* *cough*x86*cough*
*
* In such a case, instead of reporting the wrong syscalls,
* simply ignore them.
Even though this comment states that those compat system calls
are ignored, there is a corner case with return from execve which
does not seem to be correctly handled when the task TS_COMPAT
mode is flipped by execve.
I suspect that ftrace and perf suffer from this issue when
32-bit compat program running a 64-bit program: when returning
from execve, is_compat_task() returns false, but the system call
number executed is that of the 32-bit execve, which may map to
whatever system call it is associated to on the 64-bit arch.
This issue also affects LTTng.
In LTTng, rather than ignoring compat syscalls, we take a
different approach: we keep two syscall tables within the tracer:
one for syscalls, one for compat_syscalls. Whenever a syscall
tracing instrumentation is hit, we use is_compat_task() to map
to the correct syscall table.
We trace syscall entry and exit events into a different event
for each syscall, because we fetch input/output parameters
specific to each system call (e.g. strings) from user-space
before/after the system call. We also filter on a per-syscall
basis.
Unfortunately, there is an issue with the specific case
of execve: whenever a 64-bit execve syscall loads a 32-bit
compat executable, or when a 32-bit compat execve loads a
64-bit executable, the TS_COMPAT status is changed before
execve returns to userspace. However, the system call number
in the pt_regs stays the same. Unfortunately, this mixes up
the mapping between the syscall number and the syscall table
in the tracer.
I have a few ideas on how to overcome this, and would like your
feedback on the matter:
1) One possible approach would be to reserve an extra status flag
in struct thread_info to get the TS_COMPAT status at syscall
entry. It would _not_ be updated when the executable is loaded,
so the state at return from execve would match the state when
entering execve. This is a simple approach, but requires kernel
changes.
2) Keep the compat state at system call entry in a data structure
(e.g. hash table) indexed by thread number within each tracer.
This could work around this issue within each tracer.
3) Change the syscall number in the struct pt_regs whenever we
change the compat mode of a process. A 64-bit execve system
call number would be mapped to a 32-bit compat execve number,
or the opposite. This requires a kernel change, and seems to be
rather intrusive.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Sun, 8 Nov 2015 19:37:37 +0000 (UTC)
Mathieu Desnoyers <[email protected]> wrote:
> I have a few ideas on how to overcome this, and would like your
> feedback on the matter:
>
> 1) One possible approach would be to reserve an extra status flag
> in struct thread_info to get the TS_COMPAT status at syscall
> entry. It would _not_ be updated when the executable is loaded,
> so the state at return from execve would match the state when
> entering execve. This is a simple approach, but requires kernel
> changes.
Or add a flag TS_EXECVE that can be set by the tracepoint syscall
enter, and checked on exit. If set, we know that the exec happened.
>
> 2) Keep the compat state at system call entry in a data structure
> (e.g. hash table) indexed by thread number within each tracer.
> This could work around this issue within each tracer.
This is of course what you can do now. As it doesn't touch the kernel.
>
> 3) Change the syscall number in the struct pt_regs whenever we
> change the compat mode of a process. A 64-bit execve system
> call number would be mapped to a 32-bit compat execve number,
> or the opposite. This requires a kernel change, and seems to be
> rather intrusive.
>
This is a definite no.
I'm thinking the TS_EXECVE flag would be the least intrusive. Add a
comment that it is used by tracepoints to map between compat and
non-compat syscalls when execve switches the flag. This would not need
to touch any of the logic of the hotpaths within the systemcalls
themselves.
-- Steve
On 11/09/2015 08:05 AM, Steven Rostedt wrote:
> On Sun, 8 Nov 2015 19:37:37 +0000 (UTC)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> I have a few ideas on how to overcome this, and would like your
>> feedback on the matter:
>>
>> 1) One possible approach would be to reserve an extra status flag
>> in struct thread_info to get the TS_COMPAT status at syscall
>> entry. It would _not_ be updated when the executable is loaded,
>> so the state at return from execve would match the state when
>> entering execve. This is a simple approach, but requires kernel
>> changes.
>
> Or add a flag TS_EXECVE that can be set by the tracepoint syscall
> enter, and checked on exit. If set, we know that the exec happened.
>
>>
>> 2) Keep the compat state at system call entry in a data structure
>> (e.g. hash table) indexed by thread number within each tracer.
>> This could work around this issue within each tracer.
>
> This is of course what you can do now. As it doesn't touch the kernel.
>
>>
>> 3) Change the syscall number in the struct pt_regs whenever we
>> change the compat mode of a process. A 64-bit execve system
>> call number would be mapped to a 32-bit compat execve number,
>> or the opposite. This requires a kernel change, and seems to be
>> rather intrusive.
>>
>
> This is a definite no.
>
>
> I'm thinking the TS_EXECVE flag would be the least intrusive. Add a
> comment that it is used by tracepoints to map between compat and
> non-compat syscalls when execve switches the flag. This would not need
> to touch any of the logic of the hotpaths within the systemcalls
> themselves.
Let's make it really simple: add an 'unsigned int arch' to
syscall_return_slowpath. As of last week, Linus' tree sends all compat
returns, without exception (except brand new children, depending on your
point of view), through that path, and the caller always knows the
architecture.
But keep in mind that any games you play here are going to get
completely and utterly screwed up if anyone is playing with ptrace to
change syscall numbers. You'd also going to have problems with syscall
restart, sigreturn, etc, so it would be nice to have an argument that
the putative solution solves the problem for real instead of just adding
complexity to paper it over.
Meanwhile, I'm trying to remove all of the magic from the handling of
execve, and I'm half-way there. Let's please not add more, especially
if that magic needs to touch asm code.
--Andy
On Mon, 9 Nov 2015 11:29:10 -0800
Andy Lutomirski <[email protected]> wrote:
> > I'm thinking the TS_EXECVE flag would be the least intrusive. Add a
> > comment that it is used by tracepoints to map between compat and
> > non-compat syscalls when execve switches the flag. This would not need
> > to touch any of the logic of the hotpaths within the systemcalls
> > themselves.
>
> Let's make it really simple: add an 'unsigned int arch' to
> syscall_return_slowpath. As of last week, Linus' tree sends all compat
> returns, without exception (except brand new children, depending on your
> point of view), through that path, and the caller always knows the
> architecture.
>
> But keep in mind that any games you play here are going to get
> completely and utterly screwed up if anyone is playing with ptrace to
> change syscall numbers. You'd also going to have problems with syscall
> restart, sigreturn, etc, so it would be nice to have an argument that
> the putative solution solves the problem for real instead of just adding
> complexity to paper it over.
>
> Meanwhile, I'm trying to remove all of the magic from the handling of
> execve, and I'm half-way there. Let's please not add more, especially
> if that magic needs to touch asm code.
The solution I suggested wouldn't touch any asm code. The only change
would be to reserve the TS_EXECVE flag. Actually, come to think of it,
we could have Mathieu's TS_ORIG_COMPAT flag, and still only have the
tracepoint syscall set it, such that the matching tracepoint syscall
exit would know that the initial call was COMPAT or not.
The goal is only to make sure that the system call exit tracepoint
matches the system call enter tracepoint.
The system call enter would set or clear the TS_ORIG_COMPAT if the
TS_COMPAT is set when entering the system call, and it would check that
flag when exiting the system call.
-- Steve
On Mon, Nov 9, 2015 at 11:43 AM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Nov 2015 11:29:10 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>> > I'm thinking the TS_EXECVE flag would be the least intrusive. Add a
>> > comment that it is used by tracepoints to map between compat and
>> > non-compat syscalls when execve switches the flag. This would not need
>> > to touch any of the logic of the hotpaths within the systemcalls
>> > themselves.
>>
>> Let's make it really simple: add an 'unsigned int arch' to
>> syscall_return_slowpath. As of last week, Linus' tree sends all compat
>> returns, without exception (except brand new children, depending on your
>> point of view), through that path, and the caller always knows the
>> architecture.
>>
>> But keep in mind that any games you play here are going to get
>> completely and utterly screwed up if anyone is playing with ptrace to
>> change syscall numbers. You'd also going to have problems with syscall
>> restart, sigreturn, etc, so it would be nice to have an argument that
>> the putative solution solves the problem for real instead of just adding
>> complexity to paper it over.
>>
>> Meanwhile, I'm trying to remove all of the magic from the handling of
>> execve, and I'm half-way there. Let's please not add more, especially
>> if that magic needs to touch asm code.
>
> The solution I suggested wouldn't touch any asm code. The only change
> would be to reserve the TS_EXECVE flag. Actually, come to think of it,
> we could have Mathieu's TS_ORIG_COMPAT flag, and still only have the
> tracepoint syscall set it, such that the matching tracepoint syscall
> exit would know that the initial call was COMPAT or not.
Someone needs to clear TS_EXECVE, though.
>
> The goal is only to make sure that the system call exit tracepoint
> matches the system call enter tracepoint.
>
> The system call enter would set or clear the TS_ORIG_COMPAT if the
> TS_COMPAT is set when entering the system call, and it would check that
> flag when exiting the system call.
This seems a bit odd, though, since we aren't very good about
preserving the syscall nr or the args through syscall processing. In
any event, in the new improved x86 syscall code, we know what arch we
are just by following the control flow, so no flags should be needed.
Hence my suggestion of just adding an "unsigned int arch" to the
return slowpath.
--Andy
On Mon, 9 Nov 2015 12:57:06 -0800
Andy Lutomirski <[email protected]> wrote:
> > The solution I suggested wouldn't touch any asm code. The only change
> > would be to reserve the TS_EXECVE flag. Actually, come to think of it,
> > we could have Mathieu's TS_ORIG_COMPAT flag, and still only have the
> > tracepoint syscall set it, such that the matching tracepoint syscall
> > exit would know that the initial call was COMPAT or not.
>
> Someone needs to clear TS_EXECVE, though.
Well, it gets set and cleared by the syscall enter (same for
TS_ORIG_COMPAT), and exit for that matter.
It's trivial to have a tracepoint hook added when either system call
enter or exit tracepoints are enabled. Thus, the setting and clearing of
the flag can be done by another callback at those tracepoints.
>
> >
> > The goal is only to make sure that the system call exit tracepoint
> > matches the system call enter tracepoint.
> >
> > The system call enter would set or clear the TS_ORIG_COMPAT if the
> > TS_COMPAT is set when entering the system call, and it would check that
> > flag when exiting the system call.
>
> This seems a bit odd, though, since we aren't very good about
> preserving the syscall nr or the args through syscall processing. In
> any event, in the new improved x86 syscall code, we know what arch we
> are just by following the control flow, so no flags should be needed.
> Hence my suggestion of just adding an "unsigned int arch" to the
> return slowpath.
I guess I don't understand this "unsigned int arch".
When the execve system call is called, it's running in x86_64 mode, and
then the execve changes the state to ia32 bit mode. Then on return, the
tracepoint system call exit, has the x86_64 system call number, but if
it checks to see what state the task is in, it will see ia32 state, and
then report the number for ia32 instead.
For example, in x86_64, execve is 59, and that number is passed to the
system call enter tracepoint. Now on return of the system call, the
system call exit tracepoint gets called with 59 as the system call as
well, but if that tracepoint checks the state, it will think its
returning the "olduname" system call (that's 59 for ia32).
What change are you making to solve this?
-- Steve
----- On Nov 9, 2015, at 4:12 PM, rostedt [email protected] wrote:
> On Mon, 9 Nov 2015 12:57:06 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>> > The solution I suggested wouldn't touch any asm code. The only change
>> > would be to reserve the TS_EXECVE flag. Actually, come to think of it,
>> > we could have Mathieu's TS_ORIG_COMPAT flag, and still only have the
>> > tracepoint syscall set it, such that the matching tracepoint syscall
>> > exit would know that the initial call was COMPAT or not.
>>
>> Someone needs to clear TS_EXECVE, though.
>
> Well, it gets set and cleared by the syscall enter (same for
> TS_ORIG_COMPAT), and exit for that matter.
>
> It's trivial to have a tracepoint hook added when either system call
> enter or exit tracepoints are enabled. Thus, the setting and clearing of
> the flag can be done by another callback at those tracepoints.
There is one issue with relying on the tracepoint hook on system call
enter to set the status flag (whichever of TS_EXECVE or TS_ORIG_COMPAT):
let's suppose a thread is preempted for a rather long time between
syscall enter and syscall exit, within an execve system call. At that
point, we enable syscall tracing. This means we may have missed setting
or clearing TS_ORIG_COMPAT, and we then hit the syscall exit tracepoint
with the flag uninitialized.
So if we go for this kind of flag solution, we have two choices:
1) We always set/clear the TS_ORIG_COMPAT flag on system call entry, not
just within a tracepoint which can be dynamically wired up at arbitrary
point in time.
2) We set/clear the TS_ORIG_COMPAT flag within the syscall entry tracepoint,
but whenever we wire up that tracepoint, we iterate on all existing threads
to figure out if a thread is currently running or preempted within an
execve system call.
Option 2 seems rather more complicated, but has the upside of not setting
the flag when tracing is inactive. I'm really not sure that the tiny overhead
of setting a flag non-atomically is worth the trouble of doing option 2
though.
>
>>
>> >
>> > The goal is only to make sure that the system call exit tracepoint
>> > matches the system call enter tracepoint.
>> >
>> > The system call enter would set or clear the TS_ORIG_COMPAT if the
>> > TS_COMPAT is set when entering the system call, and it would check that
>> > flag when exiting the system call.
>>
>> This seems a bit odd, though, since we aren't very good about
>> preserving the syscall nr or the args through syscall processing. In
>> any event, in the new improved x86 syscall code, we know what arch we
>> are just by following the control flow, so no flags should be needed.
>> Hence my suggestion of just adding an "unsigned int arch" to the
>> return slowpath.
>
> I guess I don't understand this "unsigned int arch".
>
> When the execve system call is called, it's running in x86_64 mode, and
> then the execve changes the state to ia32 bit mode. Then on return, the
> tracepoint system call exit, has the x86_64 system call number, but if
> it checks to see what state the task is in, it will see ia32 state, and
> then report the number for ia32 instead.
>
> For example, in x86_64, execve is 59, and that number is passed to the
> system call enter tracepoint. Now on return of the system call, the
> system call exit tracepoint gets called with 59 as the system call as
> well, but if that tracepoint checks the state, it will think its
> returning the "olduname" system call (that's 59 for ia32).
>
> What change are you making to solve this?
I share your concern that Andy's proposal does not appear to address the
issue at hand. But I may be missing something too. Our issue is not about
knowing the current architecture when returning from execve system call;
we very well know that with is_compat_arch(). The issue is the mismatch
between the system call number that led us there and the current arch
when returning from execve to userspace.
Thanks,
Mathieu
>
> -- Steve
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Mon, Nov 9, 2015 at 1:12 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Nov 2015 12:57:06 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>> > The solution I suggested wouldn't touch any asm code. The only change
>> > would be to reserve the TS_EXECVE flag. Actually, come to think of it,
>> > we could have Mathieu's TS_ORIG_COMPAT flag, and still only have the
>> > tracepoint syscall set it, such that the matching tracepoint syscall
>> > exit would know that the initial call was COMPAT or not.
>>
>> Someone needs to clear TS_EXECVE, though.
>
> Well, it gets set and cleared by the syscall enter (same for
> TS_ORIG_COMPAT), and exit for that matter.
>
> It's trivial to have a tracepoint hook added when either system call
> enter or exit tracepoints are enabled. Thus, the setting and clearing of
> the flag can be done by another callback at those tracepoints.
>
>>
>> >
>> > The goal is only to make sure that the system call exit tracepoint
>> > matches the system call enter tracepoint.
>> >
>> > The system call enter would set or clear the TS_ORIG_COMPAT if the
>> > TS_COMPAT is set when entering the system call, and it would check that
>> > flag when exiting the system call.
>>
>> This seems a bit odd, though, since we aren't very good about
>> preserving the syscall nr or the args through syscall processing. In
>> any event, in the new improved x86 syscall code, we know what arch we
>> are just by following the control flow, so no flags should be needed.
>> Hence my suggestion of just adding an "unsigned int arch" to the
>> return slowpath.
>
> I guess I don't understand this "unsigned int arch".
>
> When the execve system call is called, it's running in x86_64 mode, and
> then the execve changes the state to ia32 bit mode. Then on return, the
> tracepoint system call exit, has the x86_64 system call number, but if
> it checks to see what state the task is in, it will see ia32 state, and
> then report the number for ia32 instead.
>
> For example, in x86_64, execve is 59, and that number is passed to the
> system call enter tracepoint. Now on return of the system call, the
> system call exit tracepoint gets called with 59 as the system call as
> well, but if that tracepoint checks the state, it will think its
> returning the "olduname" system call (that's 59 for ia32).
>
> What change are you making to solve this?
>
do_syscall_32_irqs_on would call syscall_return_slowpath(regs,
AUDIT_ARCH_I386). do_syscall_64 (which doesn't exist yet) would call
syscall_return_slowpath(regs, AUDIT_ARCH_X86_64).
--Andy
On Mon, 9 Nov 2015 17:51:25 -0800
Andy Lutomirski <[email protected]> wrote:
> do_syscall_32_irqs_on would call syscall_return_slowpath(regs,
> AUDIT_ARCH_I386). do_syscall_64 (which doesn't exist yet) would call
> syscall_return_slowpath(regs, AUDIT_ARCH_X86_64).
>
OK, so you are saying that a execve that switches the current state
into ia32 will return from the do_syscall_64 regardless? Then we would
have to add tracepoints that would be for both ia32 and x86_64. But
that would solve the current issue at hand.
-- Steve
On Mon, Nov 9, 2015 at 6:31 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 9 Nov 2015 17:51:25 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>
>> do_syscall_32_irqs_on would call syscall_return_slowpath(regs,
>> AUDIT_ARCH_I386). do_syscall_64 (which doesn't exist yet) would call
>> syscall_return_slowpath(regs, AUDIT_ARCH_X86_64).
>>
>
> OK, so you are saying that a execve that switches the current state
> into ia32 will return from the do_syscall_64 regardless? Then we would
> have to add tracepoints that would be for both ia32 and x86_64. But
> that would solve the current issue at hand.
>
Indeed. Unlike fork/clone, execve is only magical insofar as it does
magical things to task_struct and it enters in the 64-bit native case
through a nasty asm path. The former has no effect on the entry code
(except most likely blocking opportunistic sysret because we're a bit
silly and it might break ABI to change that), and the latter barely
matters for this purpose. In any event, I'm planning on getting rid
of the asm stub for 4.5 if I can get the code written and tested in
time.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
----- On Nov 11, 2015, at 8:08 PM, Andy Lutomirski [email protected] wrote:
> On Mon, Nov 9, 2015 at 6:31 PM, Steven Rostedt <[email protected]> wrote:
>> On Mon, 9 Nov 2015 17:51:25 -0800
>> Andy Lutomirski <[email protected]> wrote:
>>
>>
>>> do_syscall_32_irqs_on would call syscall_return_slowpath(regs,
>>> AUDIT_ARCH_I386). do_syscall_64 (which doesn't exist yet) would call
>>> syscall_return_slowpath(regs, AUDIT_ARCH_X86_64).
>>>
>>
>> OK, so you are saying that a execve that switches the current state
>> into ia32 will return from the do_syscall_64 regardless? Then we would
>> have to add tracepoints that would be for both ia32 and x86_64. But
>> that would solve the current issue at hand.
>>
>
> Indeed. Unlike fork/clone, execve is only magical insofar as it does
> magical things to task_struct and it enters in the 64-bit native case
> through a nasty asm path. The former has no effect on the entry code
> (except most likely blocking opportunistic sysret because we're a bit
> silly and it might break ABI to change that), and the latter barely
> matters for this purpose. In any event, I'm planning on getting rid
> of the asm stub for 4.5 if I can get the code written and tested in
> time.
I guess there are no plans to do this kind of change to other
architectures in the near future ? If so, we might want to
investigate the thread status flag approach for other architectures,
and use the AUDIT_ARCH_* approach for x86.
Thoughts ?
Thanks,
Mathieu
>
> --Andy
>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com