LinuxLists.cc - [RFC] Full syscall argument decode in "perf trace"

2013-09-17 15:19:54

Subject: [RFC] Full syscall argument decode in "perf trace"

Hi,

I'm trying to figure out how to extend "perf trace".

Currently, it shows syscall names and arguments, and only them.
Meaning that syscalls such as open(2) are shown as:

open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3

The problem is, of course, that user wants to see the filename
per se, not the address of its first byte.

To improve that, we need to fetch the pointed-to data.
There are two approaches to this: extending
"raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
or selectively stopping the traced process when it reaches the thacepoint.

First solution is attractive performance-wise, but requires a lot
of new code: *ALL* syscalls will need to know which arguments are pointers,
how large their pointed-to data structures are, and (remember
readv and friends!) some of pointed-to structures themselves
contain pointers which reference even more data.

If we want to go this way, do we want to encode all this knowledge in kernel?
If yes, how? If no, in what form userspace (perf trace) would configure
the tracepoint wrt which syscalls' arguments to copy to trace buffer?

The second solution is to pause traced process, let "perf trace" to fetch
its data (e.g. via process_vm_readv(2)) and unpause it.

The dead-simple approach ("pause on every sys_{enter,exit}") would be
no faster than strace. To make any sense, as a minimum the pausing needs
to be conditional: there is no need to stop on syscalls which do not
have indirect data (e.g. close(2), dup2(2)...).

Optimizing further, we can choose a few typical syscalls such as [f]stat(2),
write(2), and apply solution #1 ("dump data to trace buffer and don't pause")
to them.
For example, fstat(fd, &statbuf) does not need to stop on sys_enter at all,
and needs to only copy the fixed number of bytes of statbuf to trace buffer
on exit to avoid the need to pause.

If we want to go this way, how do you guys think this should be implemented?

IIUC tracepoints weren't meant to be able to influence execution,
the "pause the current process when tracepoint
is triggered" is a new feature. Does it look acceptable?
How to go about implementing it? Something like an ad-hoc extension field in
struct perf_event_attr to enable it?
Specifically, a new field or flag can enable this:
perf_event_open -> perf_event_alloc(... overflow_handler_which_conditionally_stops_current ...)

The "pausing", what it should be, exactly? In the ancient times, strace
chose to simply use SIGSTOP for similar needs, and it ended up interfering
with tracing real SIGSTOPs. I guess we don't want to repeat that. Then,
how? More specifically: when "perf trace" will read trace buffer and see
"process FOO paused in sys_exit from readv", how it should kick
process FOO to unpause it?

**end of brain dump**

Comments? Suggestions?

2013-09-17 17:52:14

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> I'm trying to figure out how to extend "perf trace".

> Currently, it shows syscall names and arguments, and only them.
> Meaning that syscalls such as open(2) are shown as:

> open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3

> The problem is, of course, that user wants to see the filename
> per se, not the address of its first byte.

> To improve that, we need to fetch the pointed-to data.
> There are two approaches to this: extending
> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> or selectively stopping the traced process when it reaches the thacepoint.

We don't want to stop the process at all, this is one of the major
advantages of 'perf trace' over 'strace'.

Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
tracepoint to vfs_getname to use that.

> First solution is attractive performance-wise, but requires a lot
> of new code: *ALL* syscalls will need to know which arguments are pointers,
> how large their pointed-to data structures are, and (remember
> readv and friends!) some of pointed-to structures themselves
> contain pointers which reference even more data.

Well, we can look at DWARF to get the function signatures, types,
librarize 'perf probe' and insert probes in the syscalls we want
decoding.

That for the cases where we don't have a tracepoint or when adding a new
tracepoint is not an option.

And this all with what we have in the kernel right now.

Also for 'perf trace' look at my perf/core branch, where we have more
syscall arg beautifiers and the machinery that is getting in place to
allow that.

Longer term we could have something like dtrace's CTF to have a more
compact type only ELF section that always go with the kernel, like we
have CFI in binaries these days.

- Arnaldo

2013-09-17 19:06:20

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

2013-09-18 11:35:53

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

On 09/17/2013 09:06 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
>> I'm trying to figure out how to extend "perf trace".
>
>> Currently, it shows syscall names and arguments, and only them.
>> Meaning that syscalls such as open(2) are shown as:
>
>> open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
>
>> The problem is, of course, that user wants to see the filename
>> per se, not the address of its first byte.
>
>> To improve that, we need to fetch the pointed-to data.
>> There are two approaches to this: extending
>> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
>> or selectively stopping the traced process when it reaches the thacepoint.
>
> We don't want to stop the process at all, this is one of the major
> advantages of 'perf trace' over 'strace'.

This is a worthy goal. strace is so slow exactly because it stops
traced process so often. strace developers do want to avoid
as many of these stops as possible.

I'm not sure that "not stopping ever" is achievable, though.
There are cases where stopping is necessary.

For example, after clone() call, depending on the tracer needs,
there may be operations which must be done on the new child
before it is allowed to run.

strace used to use hideous, unsafe workarounds to catch children,
until ptrace was augmented with features which made children stop
immediately.

Do you think you can work around that? I just don't see how.

> Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
> tracepoint to vfs_getname to use that.

I know that this is the way how to fetch syscall args without stopping,
yes.

The problem: ~100 more tracepoints need to be added merely to get
to the point where strace already is, wrt quality of syscall decoding.
strace has nearly 300 separate custom syscall formatting functions,
some of them quite complex.

If we need to add syscall stopping feature (which, as I said above,
will be necessary anyway IMO), then syscall decoding can be as good
as strace *already*. Then, gradually more tracepoints are added
to make it faster.

I am thinking about going into this direction.

Therefore my question should be restated as:

Would perf developers accept the "syscall pausing" feature,
or it won't be accepted?

--
vda

2013-09-18 12:47:04

by David Ahern

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

On 9/18/13 5:35 AM, Denys Vlasenko wrote:
> Therefore my question should be restated as:
>
> Would perf developers accept the "syscall pausing" feature,
> or it won't be accepted?

I have been using perf-trace a lot lately specifically because it is
effectively a 'passive' observer of the task (e.g., time-sensitive tasks
can be traced with perf but not with strace).

Also, your solution would not work if the raw_syscall events are written
to a file for later analysis where as using tracepoints to collect this
information would.

David

2013-09-18 13:35:38

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

* David Ahern <[email protected]> wrote:

> On 9/18/13 5:35 AM, Denys Vlasenko wrote:
> >Therefore my question should be restated as:
> >
> >Would perf developers accept the "syscall pausing" feature,
> >or it won't be accepted?
>
> I have been using perf-trace a lot lately specifically because it is
> effectively a 'passive' observer of the task (e.g., time-sensitive tasks
> can be traced with perf but not with strace).

Yes, this is not just an important but a primary design goal for all perf
utilities: if the tracing buffers used are large enough then it should be
a nearly zero-overhead observer that preserves all previous patterns of
behavior as much as possible.

Thanks,

Ingo

2013-09-18 14:34:03

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

Em Wed, Sep 18, 2013 at 01:35:13PM +0200, Denys Vlasenko escreveu:
> On 09/17/2013 09:06 PM, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Sep 17, 2013 at 05:10:55PM +0200, Denys Vlasenko escreveu:
> >> I'm trying to figure out how to extend "perf trace".
> >
> >> Currently, it shows syscall names and arguments, and only them.
> >> Meaning that syscalls such as open(2) are shown as:
> >
> >> open(filename: 140736118412184, flags: 0, mode: 140736118403776) = 3
> >
> >> The problem is, of course, that user wants to see the filename
> >> per se, not the address of its first byte.
> >
> >> To improve that, we need to fetch the pointed-to data.
> >> There are two approaches to this: extending
> >> "raw_syscalls:sys_{enter,exit}" tracepoint so that it returns this data,
> >> or selectively stopping the traced process when it reaches the thacepoint.
> >
> > We don't want to stop the process at all, this is one of the major
> > advantages of 'perf trace' over 'strace'.
>
> This is a worthy goal. strace is so slow exactly because it stops
> traced process so often. strace developers do want to avoid
> as many of these stops as possible.
>
> I'm not sure that "not stopping ever" is achievable, though.
> There are cases where stopping is necessary.

Can't we try first to achieve what is possible with existing
infrastructure so that we can, out of the combo 'perf trace' and
'strace' have something that is better than plain 'strace'?

> For example, after clone() call, depending on the tracer needs,
> there may be operations which must be done on the new child
> before it is allowed to run.
>
> strace used to use hideous, unsafe workarounds to catch children,
> until ptrace was augmented with features which made children stop
> immediately.
>
> Do you think you can work around that? I just don't see how.

I haven't even thought about it 8-)

> > Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
> > tracepoint to vfs_getname to use that.
>
> I know that this is the way how to fetch syscall args without stopping,
> yes.
>
> The problem: ~100 more tracepoints need to be added merely to get
> to the point where strace already is, wrt quality of syscall decoding.
> strace has nearly 300 separate custom syscall formatting functions,
> some of them quite complex.
>
> If we need to add syscall stopping feature (which, as I said above,
> will be necessary anyway IMO), then syscall decoding can be as good
> as strace *already*. Then, gradually more tracepoints are added
> to make it faster.
>
> I am thinking about going into this direction.
>
> Therefore my question should be restated as:
>
> Would perf developers accept the "syscall pausing" feature,
> or it won't be accepted?

Do you have some patch for us to try?

- Arnaldo

2013-09-26 07:41:50

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

On Wed, Sep 18, 2013 at 4:33 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
>> > Look at the tmp.perf/trace2 branch in my git repo, tglx and Ingo added a
>> > tracepoint to vfs_getname to use that.
>>
>> I know that this is the way how to fetch syscall args without stopping,
>> yes.
>>
>> The problem: ~100 more tracepoints need to be added merely to get
>> to the point where strace already is, wrt quality of syscall decoding.
>> strace has nearly 300 separate custom syscall formatting functions,
>> some of them quite complex.
>>
>> If we need to add syscall stopping feature (which, as I said above,
>> will be necessary anyway IMO), then syscall decoding can be as good
>> as strace *already*. Then, gradually more tracepoints are added
>> to make it faster.
>>
>> I am thinking about going into this direction.
>>
>> Therefore my question should be restated as:
>>
>> Would perf developers accept the "syscall pausing" feature,
>> or it won't be accepted?
>
> Do you have some patch for us to try?

I have a patch which is a bit strace specific: it sidesteps
the question of the synchronization between traced process
and its tracer by using ptrace's existing method of reporting stops.

This works for strace, and is very easy to implement.
Naturally, other tracers (e.g. "perf trace" wouldn't
want to start using ptrace! Synchronization needs
to be done in some other way, not as a ptrace stop.

For one, the stopping flag needs to be a counter, so that
more than one tracer can use this feature concurrently.

But anyway, I am attaching it.

It adds a new flag, attr.sysexit_stop, which makes process stop
at next syscall exit when this tracepoint overflows.

--
vda

Attachments:

perf_trace_stop_RFC.diff (1.23 kB)

2013-09-30 11:34:21

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [RFC] Full syscall argument decode in "perf trace"

On Thu, Sep 26, 2013 at 9:41 AM, Denys Vlasenko
<[email protected]> wrote:
> On Wed, Sep 18, 2013 at 4:33 PM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
>>> The problem: ~100 more tracepoints need to be added merely to get
>>> to the point where strace already is, wrt quality of syscall decoding.
>>> strace has nearly 300 separate custom syscall formatting functions,
>>> some of them quite complex.
>>>
>>> If we need to add syscall stopping feature (which, as I said above,
>>> will be necessary anyway IMO), then syscall decoding can be as good
>>> as strace *already*. Then, gradually more tracepoints are added
>>> to make it faster.
>>>
>>> I am thinking about going into this direction.
>>>
>>> Therefore my question should be restated as:
>>>
>>> Would perf developers accept the "syscall pausing" feature,
>>> or it won't be accepted?
>>
>> Do you have some patch for us to try?
>
> I have a patch which is a bit strace specific: it sidesteps
> the question of the synchronization between traced process
> and its tracer by using ptrace's existing method of reporting stops.
>
> This works for strace, and is very easy to implement.
> Naturally, other tracers (e.g. "perf trace" wouldn't
> want to start using ptrace! Synchronization needs
> to be done in some other way, not as a ptrace stop.
>
> For one, the stopping flag needs to be a counter, so that
> more than one tracer can use this feature concurrently.
>
> But anyway, I am attaching it.
>
> It adds a new flag, attr.sysexit_stop, which makes process stop
> at next syscall exit when this tracepoint overflows.

Here is the next iteration of the work in progress.

I added syscall masks.
This necessitated propagation of pointer to struct pt_regs
which points to userspace registers from sys_{enter,exit}
tracepoints to overflow handling functions, in order to get syscall#.
(Yes, I discovered that pt_regs which was already there wasn't
the *userspace* one).

The patch is tested: I have a modified version of strace
which decodes all syscalls properly and which avoids stopping
on all syscall entries and on a selected few syscall exits too.

As I see it, the next thing to tackle is the stopping method.
(The current patch still uses my old ptrace-specific hack).

How about the following: add a per-task "pause counter".
If it is <= 0, then task is not paused. If it is > 0, task is paused.

When an attached perf fd causes task to pause, the counter
is incremented, a marker is written into the perf buffer,
and task goes to sleep.

When tracer process sees the marker, it commands traced
process to "unpause", which decrements the counter.

Why this way?
* this allows traced process to be paused by several tracers
at once.
* this does not need heavy-weight notifications to be sent
to tracers (unlike my current hack, which invokes the
waitpid notification machinery, the source of much of strace's
slowness).
* it might work even if counter increment is reordered
relative to perf marker writing. if tracer sees the marker,
it can "unpause" - decrement counter and cause it to go -1.
The task is not paused (the rule is "<= 0", not "= 0").
Then kernel increments the counter, it's 0 now,
and task is still not paused. (I'm not sure whether
such property is useful, but if it is, we have it - good :)

The downside is, we'd need one new field in task struct.

Does this look sensible to you?

Attachments:

perf_trace_stop_RFC_v2.diff (25.26 kB)