2011-04-06 18:50:58

by Vaibhav Nagarnaik

[permalink] [raw]
Subject: [RFC] tracing: Adding cgroup aware tracing functionality

All

The cgroup functionality is being used widely in different scenarios. It also
is being integrated with other parts of kernel to take advantage of its
features. One of the areas that is not yet aware of cgroup functionality is
the ftrace framework.

Although ftrace provides a way to filter based on PIDs of tasks to be traced,
it is restricted to specific tracers, like function tracer. Also it becomes
difficult to keep track of all PIDs in a dynamic environment with processes
being created and destroyed in a short amount of time.

An application that creates many processes/tasks is convenient to track and
control with cgroups, but it is difficult to track these processes for the
purposes of tracing. And if child processes are moved to another cgroup, it
makes sense to trace only the original cgroup.

This proposal is to create a file in the tracing directory called
set_trace_cgroup to which a user can write the path of an active cgroup, one
at a time. If no cgroups are specified, no filtering is done and all tasks are
traced. When a cgroup path is added in, it sets a boolean tracing_enabled for
the enabled cgroup in all the hierarchies, which enables tracing for all the
assigned tasks under the specified cgroup.

Though creating a new file in the directory is not desirable, but this
interface seems the most appropriate change required to implement the new
feature.

This tracing_enabled flag is also exported in the cgroupfs directory structure
which can be turned on/off for a specific hierarchy/cgroup combination. This
gives control to enable/disable tracing over a cgroup in a specific hierarchy
only.

This gives more fine-grained control over the tasks being traced. I would like
to know your thoughts on this interface and the approach to make tracing
cgroup aware.


Thanks

Vaibhav Nagarnaik


2011-04-07 01:34:02

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Wed, Apr 06, 2011 at 11:50:21AM -0700, Vaibhav Nagarnaik wrote:
> All
> The cgroup functionality is being used widely in different scenarios. It also
> is being integrated with other parts of kernel to take advantage of its
> features. One of the areas that is not yet aware of cgroup functionality is
> the ftrace framework.
>
> Although ftrace provides a way to filter based on PIDs of tasks to be traced,
> it is restricted to specific tracers, like function tracer. Also it becomes
> difficult to keep track of all PIDs in a dynamic environment with processes
> being created and destroyed in a short amount of time.
>
> An application that creates many processes/tasks is convenient to track and
> control with cgroups, but it is difficult to track these processes for the
> purposes of tracing. And if child processes are moved to another cgroup, it
> makes sense to trace only the original cgroup.
>
> This proposal is to create a file in the tracing directory called
> set_trace_cgroup to which a user can write the path of an active cgroup, one
> at a time. If no cgroups are specified, no filtering is done and all tasks are
> traced. When a cgroup path is added in, it sets a boolean tracing_enabled for
> the enabled cgroup in all the hierarchies, which enables tracing for all the
> assigned tasks under the specified cgroup.
>
> Though creating a new file in the directory is not desirable, but this
> interface seems the most appropriate change required to implement the new
> feature.
>
> This tracing_enabled flag is also exported in the cgroupfs directory structure
> which can be turned on/off for a specific hierarchy/cgroup combination. This
> gives control to enable/disable tracing over a cgroup in a specific hierarchy
> only.
>
> This gives more fine-grained control over the tasks being traced. I would like
> to know your thoughts on this interface and the approach to make tracing
> cgroup aware.

So I have to ask, why can't you use perf events to do tracing limited on cgroups?
It has this cgroup context awareness.

2011-04-07 03:18:11

by Vaibhav Nagarnaik

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Wed, Apr 6, 2011 at 6:33 PM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Apr 06, 2011 at 11:50:21AM -0700, Vaibhav Nagarnaik wrote:
>> All
>> The cgroup functionality is being used widely in different scenarios. It also
>> is being integrated with other parts of kernel to take advantage of its
>> features. One of the areas that is not yet aware of cgroup functionality is
>> the ftrace framework.
>>
>> Although ftrace provides a way to filter based on PIDs of tasks to be traced,
>> it is restricted to specific tracers, like function tracer. Also it becomes
>> difficult to keep track of all PIDs in a dynamic environment with processes
>> being created and destroyed in a short amount of time.
>>
>> An application that creates many processes/tasks is convenient to track and
>> control with cgroups, but it is difficult to track these processes for the
>> purposes of tracing. And if child processes are moved to another cgroup, it
>> makes sense to trace only the original cgroup.
>>
>> This proposal is to create a file in the tracing directory called
>> set_trace_cgroup to which a user can write the path of an active cgroup, one
>> at a time. If no cgroups are specified, no filtering is done and all tasks are
>> traced. When a cgroup path is added in, it sets a boolean tracing_enabled for
>> the enabled cgroup in all the hierarchies, which enables tracing for all the
>> assigned tasks under the specified cgroup.
>>
>> Though creating a new file in the directory is not desirable, but this
>> interface seems the most appropriate change required to implement the new
>> feature.
>>
>> This tracing_enabled flag is also exported in the cgroupfs directory structure
>> which can be turned on/off for a specific hierarchy/cgroup combination. This
>> gives control to enable/disable tracing over a cgroup in a specific hierarchy
>> only.
>>
>> This gives more fine-grained control over the tasks being traced. I would like
>> to know your thoughts on this interface and the approach to make tracing
>> cgroup aware.
>
> So I have to ask, why can't you use perf events to do tracing limited on cgroups?
> It has this cgroup context awareness.
>

The perf event cgroup awareness comes from creating a different hierarchy for
perf events. When the events and the current task's cgroup match, the events
are logged. So the changes are pretty specific to the perf events.

Even in the case where changes are made to handle trace events, the interface
files are still needed. The interface used to specify perf events uses the
perf_event syscall which isn't available to specify trace events.

This is based on my limited understanding of the perf_events cgroup awareness
patch. Please correct me if I am missing anything.

Thanks

Vaibhav Nagarnaik

2011-04-07 12:06:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Wed, Apr 06, 2011 at 08:17:33PM -0700, Vaibhav Nagarnaik wrote:
> On Wed, Apr 6, 2011 at 6:33 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Wed, Apr 06, 2011 at 11:50:21AM -0700, Vaibhav Nagarnaik wrote:
> >> All
> >> The cgroup functionality is being used widely in different scenarios. It also
> >> is being integrated with other parts of kernel to take advantage of its
> >> features. One of the areas that is not yet aware of cgroup functionality is
> >> the ftrace framework.
> >>
> >> Although ftrace provides a way to filter based on PIDs of tasks to be traced,
> >> it is restricted to specific tracers, like function tracer. Also it becomes
> >> difficult to keep track of all PIDs in a dynamic environment with processes
> >> being created and destroyed in a short amount of time.
> >>
> >> An application that creates many processes/tasks is convenient to track and
> >> control with cgroups, but it is difficult to track these processes for the
> >> purposes of tracing. And if child processes are moved to another cgroup, it
> >> makes sense to trace only the original cgroup.
> >>
> >> This proposal is to create a file in the tracing directory called
> >> set_trace_cgroup to which a user can write the path of an active cgroup, one
> >> at a time. If no cgroups are specified, no filtering is done and all tasks are
> >> traced. When a cgroup path is added in, it sets a boolean tracing_enabled for
> >> the enabled cgroup in all the hierarchies, which enables tracing for all the
> >> assigned tasks under the specified cgroup.
> >>
> >> Though creating a new file in the directory is not desirable, but this
> >> interface seems the most appropriate change required to implement the new
> >> feature.
> >>
> >> This tracing_enabled flag is also exported in the cgroupfs directory structure
> >> which can be turned on/off for a specific hierarchy/cgroup combination. This
> >> gives control to enable/disable tracing over a cgroup in a specific hierarchy
> >> only.
> >>
> >> This gives more fine-grained control over the tasks being traced. I would like
> >> to know your thoughts on this interface and the approach to make tracing
> >> cgroup aware.
> >
> > So I have to ask, why can't you use perf events to do tracing limited on cgroups?
> > It has this cgroup context awareness.
> >
>
> The perf event cgroup awareness comes from creating a different hierarchy for
> perf events. When the events and the current task's cgroup match, the events
> are logged. So the changes are pretty specific to the perf events.
>
> Even in the case where changes are made to handle trace events, the interface
> files are still needed. The interface used to specify perf events uses the
> perf_event syscall which isn't available to specify trace events.
>
> This is based on my limited understanding of the perf_events cgroup awareness
> patch. Please correct me if I am missing anything.


Ah but perf events can do much more than counting and sampling
hardware events. Trace events can be used as perf events too.

List the events:

perf list -e tracepoints

List of pre-defined events (to be used in -e):

skb:kfree_skb [Tracepoint event]
skb:consume_skb [Tracepoint event]
skb:skb_copy_datagram_iovec [Tracepoint event]
net:net_dev_xmit [Tracepoint event]
net:net_dev_queue [Tracepoint event]
net:netif_receive_skb [Tracepoint event]
net:netif_rx [Tracepoint event]
napi:napi_poll [Tracepoint event]
scsi:scsi_dispatch_cmd_start [Tracepoint event]
scsi:scsi_dispatch_cmd_error [Tracepoint event]
scsi:scsi_dispatch_cmd_done [Tracepoint event]
scsi:scsi_dispatch_cmd_timeout [Tracepoint event]
scsi:scsi_eh_wakeup [Tracepoint event]
drm:drm_vblank_event [Tracepoint event]
drm:drm_vblank_event_queued [Tracepoint event]
drm:drm_vblank_event_delivered [Tracepoint event]
block:block_rq_abort [Tracepoint event]
block:block_rq_requeue [Tracepoint event]
block:block_rq_complete [Tracepoint event]
block:block_rq_insert [Tracepoint event]
etc...


Trace sched switch events:

perf record -e sched:sched_switch -a
^C


Print them:

perf script

swapper 0 [000] 1132.964598: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm
kworker/0:1 4358 [000] 1132.964641: sched_switch: prev_comm=kworker/0:1 prev_pid=4358 prev_prio=120 prev_state=S ==> ne
syslogd 2703 [000] 1132.964720: sched_switch: prev_comm=syslogd prev_pid=2703 prev_prio=120 prev_state=D ==> next_c
swapper 0 [000] 1132.965100: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm
perf 4725 [001] 1132.965178: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm
swapper 0 [001] 1132.965227: sched_switch: prev_comm=kworker/0:0 prev_pid=0 prev_prio=120 prev_state=R ==> next_
perf 4725 [001] 1132.965246: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm
etc...

2011-04-07 20:22:54

by David Sharp

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Thu, Apr 7, 2011 at 5:06 AM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Apr 06, 2011 at 08:17:33PM -0700, Vaibhav Nagarnaik wrote:
>> On Wed, Apr 6, 2011 at 6:33 PM, Frederic Weisbecker <[email protected]> wrote:
>> > On Wed, Apr 06, 2011 at 11:50:21AM -0700, Vaibhav Nagarnaik wrote:
>> >> All
>> >> The cgroup functionality is being used widely in different scenarios. It also
>> >> is being integrated with other parts of kernel to take advantage of its
>> >> features. One of the areas that is not yet aware of cgroup functionality is
>> >> the ftrace framework.
>> >>
>> >> Although ftrace provides a way to filter based on PIDs of tasks to be traced,
>> >> it is restricted to specific tracers, like function tracer. Also it becomes
>> >> difficult to keep track of all PIDs in a dynamic environment with processes
>> >> being created and destroyed in a short amount of time.
>> >>
>> >> An application that creates many processes/tasks is convenient to track and
>> >> control with cgroups, but it is difficult to track these processes for the
>> >> purposes of tracing. And if child processes are moved to another cgroup, it
>> >> makes sense to trace only the original cgroup.
>> >>
>> >> This proposal is to create a file in the tracing directory called
>> >> set_trace_cgroup to which a user can write the path of an active cgroup, one
>> >> at a time. If no cgroups are specified, no filtering is done and all tasks are
>> >> traced. When a cgroup path is added in, it sets a boolean tracing_enabled for
>> >> the enabled cgroup in all the hierarchies, which enables tracing for all the
>> >> assigned tasks under the specified cgroup.
>> >>
>> >> Though creating a new file in the directory is not desirable, but this
>> >> interface seems the most appropriate change required to implement the new
>> >> feature.
>> >>
>> >> This tracing_enabled flag is also exported in the cgroupfs directory structure
>> >> which can be turned on/off for a specific hierarchy/cgroup combination. This
>> >> gives control to enable/disable tracing over a cgroup in a specific hierarchy
>> >> only.
>> >>
>> >> This gives more fine-grained control over the tasks being traced. I would like
>> >> to know your thoughts on this interface and the approach to make tracing
>> >> cgroup aware.
>> >
>> > So I have to ask, why can't you use perf events to do tracing limited on cgroups?
>> > It has this cgroup context awareness.

Perf doesn't have the same latency characteristics as ftrace. It costs
a full microsecond for every trace event.

https://lkml.org/lkml/2010/10/28/261

It's possible these results need to be updated. Has any effort been
made to improve the tracing latency of perf?

>> The perf event cgroup awareness comes from creating a different hierarchy for
>> perf events. When the events and the current task's cgroup match, the events
>> are logged. So the changes are pretty specific to the perf events.
>>
>> Even in the case where changes are made to handle trace events, the interface
>> files are still needed. The interface used to specify perf events uses the
>> perf_event syscall which isn't available to specify trace events.
>>
>> This is based on my limited understanding of the perf_events cgroup awareness
>> patch. Please correct me if I am missing anything.
>
>
> Ah but perf events can do much more than counting and sampling
> hardware events. Trace events can be used as perf events too.
>
> List the events:
>
>        perf list -e tracepoints
>
> List of pre-defined events (to be used in -e):
>
>  skb:kfree_skb                              [Tracepoint event]
>  skb:consume_skb                            [Tracepoint event]
>  skb:skb_copy_datagram_iovec                [Tracepoint event]
>  net:net_dev_xmit                           [Tracepoint event]
>  net:net_dev_queue                          [Tracepoint event]
>  net:netif_receive_skb                      [Tracepoint event]
>  net:netif_rx                               [Tracepoint event]
>  napi:napi_poll                             [Tracepoint event]
>  scsi:scsi_dispatch_cmd_start               [Tracepoint event]
>  scsi:scsi_dispatch_cmd_error               [Tracepoint event]
>  scsi:scsi_dispatch_cmd_done                [Tracepoint event]
>  scsi:scsi_dispatch_cmd_timeout             [Tracepoint event]
>  scsi:scsi_eh_wakeup                        [Tracepoint event]
>  drm:drm_vblank_event                       [Tracepoint event]
>  drm:drm_vblank_event_queued                [Tracepoint event]
>  drm:drm_vblank_event_delivered             [Tracepoint event]
>  block:block_rq_abort                       [Tracepoint event]
>  block:block_rq_requeue                     [Tracepoint event]
>  block:block_rq_complete                    [Tracepoint event]
>  block:block_rq_insert                      [Tracepoint event]
>  etc...
>
>
> Trace sched switch events:
>
>        perf record -e sched:sched_switch -a
>        ^C
>
>
> Print them:
>
>        perf script
>
>         swapper     0 [000]  1132.964598: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm
>     kworker/0:1  4358 [000]  1132.964641: sched_switch: prev_comm=kworker/0:1 prev_pid=4358 prev_prio=120 prev_state=S ==> ne
>         syslogd  2703 [000]  1132.964720: sched_switch: prev_comm=syslogd prev_pid=2703 prev_prio=120 prev_state=D ==> next_c
>         swapper     0 [000]  1132.965100: sched_switch: prev_comm=swapper prev_pid=0 prev_prio=120 prev_state=R ==> next_comm
>            perf  4725 [001]  1132.965178: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm
>         swapper     0 [001]  1132.965227: sched_switch: prev_comm=kworker/0:0 prev_pid=0 prev_prio=120 prev_state=R ==> next_
>            perf  4725 [001]  1132.965246: sched_switch: prev_comm=perf prev_pid=4725 prev_prio=120 prev_state=D ==> next_comm
>        etc...
>
>

2011-04-07 21:32:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Thu, Apr 07, 2011 at 01:22:30PM -0700, David Sharp wrote:
> On Thu, Apr 7, 2011 at 5:06 AM, Frederic Weisbecker <[email protected]> wrote:
> Perf doesn't have the same latency characteristics as ftrace. It costs
> a full microsecond for every trace event.
>
> https://lkml.org/lkml/2010/10/28/261
>
> It's possible these results need to be updated. Has any effort been
> made to improve the tracing latency of perf?

Nothing significant since then, I believe. But the hotspots are known
and some are relatively low hanging fruits if you want to get closer to
ftrace throughput:

* When an event triggers, we do a double copy. A first one in a temporary
buffer and a second one from the temporary buffer to the event'ss one.
This is because we don't have the same discard feature than in ftrace
buffer. We need to first filter on the temporary buffer and give up if the filter
matched instead of copying to the main buffer.

As a short term solution: have a fast path tracing for the case where we
don't have a filter: directly copy to the main buffer.

In the longer term I think we want to filter on tracepoint parameters
rather than in the ending trace.

* We save more things in perf, because we have the perf headers. So we
save the pid twice: once in trace event headers, second in perf headers.
We need to drop the one from the trace event.
Also in the case of pure tracing, we don't need to save the ip in the perf
headers.

* We have lots of conditionals in the fast path, due to some exclusion options,
overflow count tracking, etc... We probably want a fastpath tracing function
for the high volume tracing case, something that goes quickly to the buffer
saving.

And there are things common to ftrace and perf that we probably want to have:
like tracking of pids using sched switch event if one is running, instead
of saving the pid on each traces. And get rid of the preempt_count in the
trace event headers, at least have the possibility to choose whether we want
it.


Any help in any of these tasks would be very welcome.

2011-04-07 22:42:32

by David Sharp

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Thu, Apr 7, 2011 at 2:32 PM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Apr 07, 2011 at 01:22:30PM -0700, David Sharp wrote:
>> On Thu, Apr 7, 2011 at 5:06 AM, Frederic Weisbecker <[email protected]> wrote:
>> Perf doesn't have the same latency characteristics as ftrace. It costs
>> a full microsecond for every trace event.
>>
>> https://lkml.org/lkml/2010/10/28/261
>>
>> It's possible these results need to be updated. Has any effort been
>> made to improve the tracing latency of perf?
>
> Nothing significant since then, I believe. But the hotspots are known
> and some are relatively low hanging fruits if you want to get closer to
> ftrace throughput:
>
> * When an event triggers, we do a double copy. A first one in a temporary
> buffer and a second one from the temporary buffer to the event'ss one.
> This is because we don't have the same discard feature than in ftrace
> buffer. We need to first filter on the temporary buffer and give up if the filter
> matched instead of copying to the main buffer.
>
> As a short term solution: have a fast path tracing for the case where we
> don't have a filter: directly copy to the main buffer.
>
> In the longer term I think we want to filter on tracepoint parameters
> rather than in the ending trace.
>
> * We save more things in perf, because we have the perf headers. So we
> save the pid twice: once in trace event headers, second in perf headers.
> We need to drop the one from the trace event.
> Also in the case of pure tracing, we don't need to save the ip in the perf
> headers.
>
> * We have lots of conditionals in the fast path, due to some exclusion options,
> overflow count tracking, etc... We probably want a fastpath tracing function
> for the high volume tracing case, something that goes quickly to the buffer
> saving.
>
> And there are things common to ftrace and perf that we probably want to have:
> like tracking of pids using sched switch event if one is running, instead
> of saving the pid on each traces. And get rid of the preempt_count in the
> trace event headers, at least have the possibility to choose whether we want
> it.
>
>
> Any help in any of these tasks would be very welcome.
>

This is all very interesting, but doesn't really help us. I'd prefer
to focus on the proposal itself than discuss the merits of perf and
ftrace. We're using ftrace for the foreseeable future, and afaik, it's
still a maintained part of the kernel. If perf improves its
performance for tracing, then we can consider switching to it. We
could invest time improving perf, and that might be worthwhile, but
ftrace is here now.

So with that in mind, are there any suggestions regarding cgroup
functionality in ftrace?

2011-04-08 00:28:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Thu, Apr 07, 2011 at 03:42:08PM -0700, David Sharp wrote:
> On Thu, Apr 7, 2011 at 2:32 PM, Frederic Weisbecker <[email protected]> wrote:
> > Nothing significant since then, I believe. But the hotspots are known
> > and some are relatively low hanging fruits if you want to get closer to
> > ftrace throughput:
> >
> > * When an event triggers, we do a double copy. A first one in a temporary
> > buffer and a second one from the temporary buffer to the event'ss one.
> > This is because we don't have the same discard feature than in ftrace
> > buffer. We need to first filter on the temporary buffer and give up if the filter
> > matched instead of copying to the main buffer.
> >
> > As a short term solution: have a fast path tracing for the case where we
> > don't have a filter: directly copy to the main buffer.
> >
> > In the longer term I think we want to filter on tracepoint parameters
> > rather than in the ending trace.
> >
> > * We save more things in perf, because we have the perf headers. So we
> > save the pid twice: once in trace event headers, second in perf headers.
> > We need to drop the one from the trace event.
> > Also in the case of pure tracing, we don't need to save the ip in the perf
> > headers.
> >
> > * We have lots of conditionals in the fast path, due to some exclusion options,
> > overflow count tracking, etc... We probably want a fastpath tracing function
> > for the high volume tracing case, something that goes quickly to the buffer
> > saving.
> >
> > And there are things common to ftrace and perf that we probably want to have:
> > like tracking of pids using sched switch event if one is running, instead
> > of saving the pid on each traces. And get rid of the preempt_count in the
> > trace event headers, at least have the possibility to choose whether we want
> > it.
> >
> >
> > Any help in any of these tasks would be very welcome.
> >
>
> This is all very interesting, but doesn't really help us. I'd prefer
> to focus on the proposal itself than discuss the merits of perf and
> ftrace. We're using ftrace for the foreseeable future, and afaik, it's
> still a maintained part of the kernel. If perf improves its
> performance for tracing, then we can consider switching to it. We
> could invest time improving perf, and that might be worthwhile, but
> ftrace is here now.

You are investing upstream for your tracing needs. And that's really
a nice step that I appreciate, as IIRC, Google had its own internal tracing
(ktrace?) before. Nonetheless you can't be such a significant
user/developer of the upstream kernel tracing and at the same time ignore some
key problems of the actual big picture of it.

You need to be aware that we are not going anywhere if we duplicate
every features between perf and ftrace. We want to merge the common
pieces, keep the best of them and not expand the two tier tracing of today.

I wish people stop thinking about perf and ftrace as
competitors. Probably developers could start having a sane view
once both will have close performances and then we can start
thinking about a common backend (a buffer abstraction, which development
can be iterated incrementally, usable with a syscall) and eliminate the
overlapping pieces.

I'm not asking you to unify the kernel tracing all alone. But you need to
start to enlarge your view.

I tend to think perf is more suitable for finegrained context definition
in general.

2011-04-08 07:38:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, 2011-04-08 at 02:28 +0200, Frederic Weisbecker wrote:

> > This is all very interesting, but doesn't really help us. I'd prefer
> > to focus on the proposal itself than discuss the merits of perf and
> > ftrace. We're using ftrace for the foreseeable future, and afaik, it's
> > still a maintained part of the kernel. If perf improves its
> > performance for tracing, then we can consider switching to it. We
> > could invest time improving perf, and that might be worthwhile, but
> > ftrace is here now.
>
> You are investing upstream for your tracing needs. And that's really
> a nice step that I appreciate, as IIRC, Google had its own internal tracing
> (ktrace?) before. Nonetheless you can't be such a significant
> user/developer of the upstream kernel tracing and at the same time ignore some
> key problems of the actual big picture of it.
>
> You need to be aware that we are not going anywhere if we duplicate
> every features between perf and ftrace. We want to merge the common
> pieces, keep the best of them and not expand the two tier tracing of today.


I agree that it would be great if we can start to merge the two. But
until we have a road map that we can all agree upon, I don't see that
happening in the near future. But I may be wrong ;)

>
> I wish people stop thinking about perf and ftrace as
> competitors.

I don't think this is about perf and ftrace as competitors, but they are
currently two different infrastructures that are existing in the kernel.
They are currently optimized for different purposes. ftrace is optimized
for system tracing (persistent buffers and such) where as perf is
optimized for user tracing. But the two can do both but the other
feature is not as efficient as the other tool.

As you said perf has a lot of overhead due to data that it saves per
event. How easy is it to modify that without breaking the ABI?

> Probably developers could start having a sane view
> once both will have close performances and then we can start
> thinking about a common backend (a buffer abstraction, which development
> can be iterated incrementally, usable with a syscall) and eliminate the
> overlapping pieces.

I wouldn't say eliminate, but at least merge the overlapping pieces. I'm
still totally against stripping out the debugfs code, and as tools have
been made to depend on it, I'm not sure we can rip it out. But I do not
see any harm in supporting both a debugfs feature along with a syscall
interface. I'm willing to do the leg work here to keep it.

>
> I'm not asking you to unify the kernel tracing all alone. But you need to
> start to enlarge your view.

You might want to be a bit more specific by what you mean here.

>
> I tend to think perf is more suitable for finegrained context definition
> in general.

I actually agree, as perf is more focused on per process (or group) than
ftrace. But that said, I guess the issue is also, if they have a simple
solution that is not invasive and suits their needs, what's the harm in
accepting it?

-- Steve

2011-04-08 13:45:52

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 08, 2011 at 03:37:48AM -0400, Steven Rostedt wrote:
> On Fri, 2011-04-08 at 02:28 +0200, Frederic Weisbecker wrote:
>
> > > This is all very interesting, but doesn't really help us. I'd prefer
> > > to focus on the proposal itself than discuss the merits of perf and
> > > ftrace. We're using ftrace for the foreseeable future, and afaik, it's
> > > still a maintained part of the kernel. If perf improves its
> > > performance for tracing, then we can consider switching to it. We
> > > could invest time improving perf, and that might be worthwhile, but
> > > ftrace is here now.
> >
> > You are investing upstream for your tracing needs. And that's really
> > a nice step that I appreciate, as IIRC, Google had its own internal tracing
> > (ktrace?) before. Nonetheless you can't be such a significant
> > user/developer of the upstream kernel tracing and at the same time ignore some
> > key problems of the actual big picture of it.
> >
> > You need to be aware that we are not going anywhere if we duplicate
> > every features between perf and ftrace. We want to merge the common
> > pieces, keep the best of them and not expand the two tier tracing of today.
>
>
> I agree that it would be great if we can start to merge the two. But
> until we have a road map that we can all agree upon, I don't see that
> happening in the near future. But I may be wrong ;)

Nah, I don't think it's necessary to have a roadmap, just a kind
of general direction. Other than that everything can be done piecewise
without thinking too far for every single patches.

If we were to create a buffer abstraction, something we can create with an
fd, we can have a shared buffer implementation. And this buffer may be
able to accept different modes in the future.

Then one could attach the fd to a perf event, which would override the
default perf event buffer settings.

And ftrace can use that same buffer internally.

I bet this idea is not controversial. What has yet to be solved is
the debate on the writers that can run in overwriting mode at the same
time we have readers. Which comes along debates on using subbuffers, etc..

But I guess we can solve that along the way?

There are many other things we want to do to unify even more: have the
function tracers usable as trace events, same for trace_printk, etc...
Those parts are pretty uncontroversial.

I don't consider the tracing merge as a one block thing, it's actually
many standalone pieces that require incremental changes.

> >
> > I wish people stop thinking about perf and ftrace as
> > competitors.
>
> I don't think this is about perf and ftrace as competitors, but they are
> currently two different infrastructures that are existing in the kernel.
> They are currently optimized for different purposes. ftrace is optimized
> for system tracing (persistent buffers and such) where as perf is
> optimized for user tracing. But the two can do both but the other
> feature is not as efficient as the other tool.

But that's an accidental two tier optimization. At least yeah ftrace
goal is deemed for per cpu tracing, thus it is optimized this way.
But perf should work well on both cases.

> As you said perf has a lot of overhead due to data that it saves per
> event. How easy is it to modify that without breaking the ABI?

It doesn't need to break the ABI. We can add a field in the perf event
attr to drop the ftrace headers. We can even remove the support for these
headers in the pretty long term.

> > Probably developers could start having a sane view
> > once both will have close performances and then we can start
> > thinking about a common backend (a buffer abstraction, which development
> > can be iterated incrementally, usable with a syscall) and eliminate the
> > overlapping pieces.
>
> I wouldn't say eliminate, but at least merge the overlapping pieces. I'm
> still totally against stripping out the debugfs code, and as tools have
> been made to depend on it, I'm not sure we can rip it out. But I do not
> see any harm in supporting both a debugfs feature along with a syscall
> interface. I'm willing to do the leg work here to keep it.

I have no strong problem with that either. We can keep the debugfs interface
or part of it, while merging the overlapping pieces.

I think it's actually not even in question currently. We are far from a state
where we can remove the debugfs interface.

> >
> > I'm not asking you to unify the kernel tracing all alone. But you need to
> > start to enlarge your view.
>
> You might want to be a bit more specific by what you mean here.

I was just grumpy :)

But instead of moaning against others I guess I should rather try to
start to work on it.

> >
> > I tend to think perf is more suitable for finegrained context definition
> > in general.
>
> I actually agree, as perf is more focused on per process (or group) than
> ftrace. But that said, I guess the issue is also, if they have a simple
> solution that is not invasive and suits their needs, what's the harm in
> accepting it?

To enter more details, perf and ftrace have different ways of dealing
with contexts of tracing.

ftrace would have a check on every exclusion point in the fast path
(pid, cgroups, etc...) while perf would actually schedule events on
top of these criterias. So that there should be only one check to know
if we are running in the fast path.

In practice we have much more checks from the fast path, but that again
waits for more optimizations.

So that's the reason why I think perf is more suitable when it's about
dealing with contexts. Adding a cgroup check in the ftrace fastpath
is automatically going to be invasive, this is one more check in any
trace event fast path. As you said ftrace is more optimized for global
tracing, which makes it the wrong place for that IMHO.

I won't oppose, and may be they even have a non-invasive solution to
propose that I haven't thought about. Until then I think they are
investing the wrong place.

2011-04-08 14:08:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, 2011-04-08 at 15:45 +0200, Frederic Weisbecker wrote:
>
> I bet this idea is not controversial. What has yet to be solved is
> the debate on the writers that can run in overwriting mode at the same
> time we have readers. Which comes along debates on using subbuffers,
> etc..
>
> But I guess we can solve that along the way?

Dunno, but the much larger point is how ftrace is going to want to do
multiple sessions and keep session context etc (what events to where).
I've only heard some vague hand waving there but never heard anything
concrete.

2011-04-08 14:08:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, 2011-04-08 at 03:37 -0400, Steven Rostedt wrote:
> I don't think this is about perf and ftrace as competitors, but they are
> currently two different infrastructures that are existing in the kernel.
> They are currently optimized for different purposes. ftrace is optimized
> for system tracing (persistent buffers and such) where as perf is
> optimized for user tracing.

That's complete nonsense, perf isn't build for tracing at all, its just
that tracing is a special case of the larger problem set of
sampling/profiling it is build for.

Nor is perf in any way shape or form better suited for user than for
kernel space, it really doesn't care, if we'd only be interested in user
crap we'd never have done NMI sampling for instance.

> But the two can do both but the other
> feature is not as efficient as the other tool.

Well, neither can do user-space-tracing at all, simply because we don't
have hooks into userspace, although the uprobes stuff looks to cure some
of that.

> As you said perf has a lot of overhead due to data that it saves per
> event.

Someday you should actually read the perf code before you say something.

2011-04-08 17:02:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, 2011-04-08 at 16:07 +0200, Peter Zijlstra wrote:

> > As you said perf has a lot of overhead due to data that it saves per
> > event.
>
> Someday you should actually read the perf code before you say something.

I have looked at the code although not as much recently, but I do plan
on looking at it again in much more detail. But you are correct that I
did not make this comment on the code itself, but on looking at the
data:

I ran:

# perf record -a -e sched:sched_switch sleep 10
# mv perf.data perf.data.10
# perf record -a -e sched:sched_switch sleep 20
# mv perf.data perf.data.20
# ls -l perf.data.*
-rw-------. 1 root root 4480655 2011-04-08 12:36 perf.data.10
-rw-------. 1 root root 5532431 2011-04-08 12:37 perf.data.20
# perf script -i perf.data.10 | wc -l
9909
# perf script -i perf.data.20 | wc -l
18675

Then I did some deltas to figure out he size per event:

5532431-4480655 = 1051776
18675-9909 = 8766
1051776/8766 = 119

Which shows that the sched switch event takes up 119 bytes per event.

Then I looked at what ftrace does:

# trace-cmd record -e sched_switch -o trace.dat.10 sleep 10
# trace-cmd record -e sched_switch -o trace.dat.20 sleep 20
# trace-cmd report trace.dat.10 | wc -l
38856
# trace-cmd report trace.dat.20 | wc -l
77124
# ls -l trace.dat.*
-rw-r--r--. 1 root root 5832704 2011-04-08 12:41 trace.dat.10
-rw-r--r--. 1 root root 8790016 2011-04-08 12:41 trace.dat.20

8790016-5832704 = 2957312
77124-38856 = 38268
2957312/38268 = 77

As you stated, I need to look more into the perf code (which I plan on
doing), but it seems that perf adds 42 bytes more per event. Perhaps
this is something we can fix. I'd love to make both perf and ftrace be
able to limit its header. There's no reason to record the pid for every
event if we don't need to. As well as the preempt count and interrupt
status. But these are legacy from the latency tracer code from -rt.

I think there's a lot of work that we can do make tracing in perf more
compatible with the tracing features of ftrace. I did say the ugly word
"roadmap" but perhaps it's just direction that we need. I feel we are
all a bunch of cooks with their own taste, and we don't all like the
spices used by each other.

-- Steve

2011-04-08 18:33:24

by Michael Rubin

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

Minor history, Google wrote ktrace which we started to push upstream
and then jumped on the ftrace train. ftrace was a very good drop in
replacement for us. ftrace has a lot of features ktrace did not. We
use perf also in other contexts.

We have devoted a lot of time and energy from many engineers outside
of kernel to use ftrace. It's not just the kernel code, but we have
built a lot of sw on ftrace as we examine systems in the cloud. It was
not easy to make this migration since it had a lot of user impact. It
was less the low level API but the fact that the tracepoints semantics
had changed. Google is in the process of no longer using ktrace.

Requirements for tracing for us seem to be:
1) Around 250ns latencies when enabled
2) Small events - We have worked on shrinking the ftrace event size.
Since we are constrained on the amount of memory we can use in the
ring buffer, every byte counts. A nice sized event is on the order of
<= 8 bytes. Some of the 2.6.34 ftrace events like irq_enter and exit
were too big around 20 bytes. sched_switch is around 60 bytes.

I am not sure that perf meets these requirements. We need to use
ftrace for tracing _now_. Google also needs to focus on improving
ftrace upstream so we can contribute our changes to the community. We
are submitting patches and working with upstream to improve ftrace.
Currently we depend on ftrace and being around for a long time.

Fredric wrote:" Nonetheless you can't be such a significant
user/developer of the upstream kernel tracing and at the same time
ignore some key problems of the actual big picture of it."

I see less of a big picture issue and more of an awkward two pictures.
>From my view it's a question of whether ftrace should continue to be
supported and improved or whether perf will do everything as well and
when. We depend on ftrace today and as such we have invested our time
there.

mrubin

2011-04-08 19:01:02

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 08, 2011 at 03:37:48AM -0400, Steven Rostedt wrote:
> I actually agree, as perf is more focused on per process (or group) than
> ftrace. But that said, I guess the issue is also, if they have a simple
> solution that is not invasive and suits their needs, what's the harm in
> accepting it?

What about a kind of cgroup_of(path) operator that we can use on
filters?

common_pid cgroup_of(path)
or
common_pid __cgroup_of__ path

That way you don't bloat the tracing fast path?

2011-04-08 20:27:25

by Justin TerAvest

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 8, 2011 at 11:32 AM, Michael Rubin <[email protected]> wrote:
> Minor history, Google wrote ktrace which we started to push upstream
> and then jumped on the ftrace train. ftrace was a very good drop in
> replacement for us. ftrace has a lot of features ktrace did not. We
> use perf also in other contexts.
>
> We have devoted a lot of time and energy from many engineers outside
> of kernel to use ftrace. It's not just the kernel code, but we have
> built a lot of sw on ftrace as we examine systems in the cloud. It was
> not easy to make this migration since it had a lot of user impact. It
> was less the low level API but the fact that the tracepoints semantics
> had changed. Google is in the process of no longer using ktrace.
>
> Requirements for tracing for us seem to be:
> 1) Around 250ns latencies when enabled

It's worth pointing out that this is a goal for us on ftrace.

We were able to do this well with ktrace, and have devoted effort to
making ftrace faster, but we're not at 250ns yet. We know that is
where we want to be.

> 2) Small events - We have worked on shrinking the ftrace event size.
> Since we are constrained on the amount of memory we can use in the
> ring buffer, every byte counts. A nice sized event is on the order of
> <= 8 bytes. Some of the 2.6.34 ftrace events like irq_enter and exit
> were too big around 20 bytes. sched_switch is around 60 bytes.
>
> I am not sure that perf meets these requirements. We need to use
> ftrace for tracing _now_. ?Google also needs to focus on improving
> ftrace upstream so we can contribute our changes to the community. We
> are submitting patches and working with upstream to improve ftrace.
> Currently we depend on ftrace and being around for a long time.
>
> Fredric wrote:" Nonetheless you can't be such a significant
> user/developer of the upstream kernel tracing and at the same time
> ignore some key problems of the actual big picture of it."
>
> I see less of a big picture issue and more of an awkward two pictures.
> From my view it's a question of whether ftrace should continue to be
> supported and improved or whether perf will do everything as well and
> when. We depend on ftrace today and as such we have invested our time
> there.
>
> mrubin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2011-04-08 20:38:35

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 08, 2011 at 09:00:56PM +0200, Frederic Weisbecker wrote:
> On Fri, Apr 08, 2011 at 03:37:48AM -0400, Steven Rostedt wrote:
> > I actually agree, as perf is more focused on per process (or group) than
> > ftrace. But that said, I guess the issue is also, if they have a simple
> > solution that is not invasive and suits their needs, what's the harm in
> > accepting it?
>
> What about a kind of cgroup_of(path) operator that we can use on
> filters?
>
> common_pid cgroup_of(path)
> or
> common_pid __cgroup_of__ path
>
> That way you don't bloat the tracing fast path?

Note in this example, we would simply ignore the common_pid
value and assume that pid is the one of current. This economizes
a step to pid -> task resolution.

2011-04-08 21:42:09

by David Sharp

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 8, 2011 at 1:38 PM, Frederic Weisbecker <[email protected]> wrote:
> On Fri, Apr 08, 2011 at 09:00:56PM +0200, Frederic Weisbecker wrote:
>> On Fri, Apr 08, 2011 at 03:37:48AM -0400, Steven Rostedt wrote:
>> > I actually agree, as perf is more focused on per process (or group) than
>> > ftrace. But that said, I guess the issue is also, if they have a simple
>> > solution that is not invasive and suits their needs, what's the harm in
>> > accepting it?
>>
>> What about a kind of cgroup_of(path) operator that we can use on
>> filters?
>>
>>       common_pid cgroup_of(path)
>> or
>>       common_pid __cgroup_of__ path
>>
>> That way you don't bloat the tracing fast path?
>
> Note in this example, we would simply ignore the common_pid
> value and assume that pid is the one of current. This economizes
> a step to pid -> task resolution.
>

This is a decent idea, but I'm worried about the complexity of using
filters like this. Filters are written to *every* event that you want
the filter to apply to (if you set the top-level filter, it just
copies the filter to all applicable events), and this is a filter you
would mostly only want to apply to *all* events at once. Furthermore,
filters work by discarding the event *after* the event has already
been written, so all tasks will be incurring full tracing overhead.
With cgroup filtering up front, we can avoid ~90% [0] of the overhead
for untraced cgroups.

I'm also thinking that cgroups could be a way to expose tracing to
non-root users. Making it a filter doesn't work for that.


Hmm.. Maybe ftrace needs a "global filters" feature. cgroup and pid
would be prime candidates for this, perhaps there are others. These
would be an optional list of filters applied *before* writing the
event or reserving buffer space, so they could not use the event
fields. Mostly I'm thinking they would use things accessible from the
current task_struct.


If we could work all that out, then I would change a couple things:
one of my grand plans for tracing is to remove pid from every event,
and replace it with a tiny "pid_changed" event (unless "sched_switch"
et al is enabled). So I wouldn't want to attach it to common_pid at
all. Instead, I would make it a unary operator.

It also doesn't work with multiple hieranchies. When you refer to a
cgroup path of "/apps/container_3", are we talking about the cgroup
for cpu, or mem, or blkio, or all, or a subset? This is what the
"tracing_enabled" files in the cgroup filesystem in Vaibhav's proposal
were for. Maybe this could be an optional argument to the unary
operator.

So, the operator becomes:
cgroup_of(/path) means any subsystem,
cgroup_of(/path, cpu, mem) means cpu or mem.

d#

[0] This figure is made up. Like most statistics. ;)

2011-04-09 11:09:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, 2011-04-08 at 13:02 -0400, Steven Rostedt wrote:
> On Fri, 2011-04-08 at 16:07 +0200, Peter Zijlstra wrote:
>
> > > As you said perf has a lot of overhead due to data that it saves per
> > > event.
> >
> > Someday you should actually read the perf code before you say something.
>
> I have looked at the code although not as much recently, but I do plan
> on looking at it again in much more detail. But you are correct that I
> did not make this comment on the code itself, but on looking at the
> data:

> 1051776/8766 = 119

> 2957312/38268 = 77
>
> As you stated, I need to look more into the perf code (which I plan on
> doing), but it seems that perf adds 42 bytes more per event. Perhaps
> this is something we can fix.

Aside from the 8 byte header everything else is configurable with
PERF_SAMPLE_flags and probably has some overlap with stuff we also have
in the tracepoint data we then get through PERF_SAMPLE_RAW

> I'd love to make both perf and ftrace be
> able to limit its header. There's no reason to record the pid for every
> event if we don't need to. As well as the preempt count and interrupt
> status. But these are legacy from the latency tracer code from -rt.

Right.

> I think there's a lot of work that we can do make tracing in perf more
> compatible with the tracing features of ftrace. I did say the ugly word
> "roadmap" but perhaps it's just direction that we need. I feel we are
> all a bunch of cooks with their own taste, and we don't all like the
> spices used by each other.

Partly yeah, but there's also real functional differences, the last time
I profiled perf with perf (yay for recursion) we spend a lot of time in
conditionals. Due to the fact that perf is mainly sampling based (and
tracing being samples with period==1) and all that output
configurability there's a true forest of if statements to pass through
and I'm fairly sure we totally trash the branch predictor on that.

2011-04-12 21:38:14

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC] tracing: Adding cgroup aware tracing functionality

On Fri, Apr 08, 2011 at 02:41:43PM -0700, David Sharp wrote:
> On Fri, Apr 8, 2011 at 1:38 PM, Frederic Weisbecker <[email protected]> wrote:
> > On Fri, Apr 08, 2011 at 09:00:56PM +0200, Frederic Weisbecker wrote:
> >> On Fri, Apr 08, 2011 at 03:37:48AM -0400, Steven Rostedt wrote:
> >> > I actually agree, as perf is more focused on per process (or group) than
> >> > ftrace. But that said, I guess the issue is also, if they have a simple
> >> > solution that is not invasive and suits their needs, what's the harm in
> >> > accepting it?
> >>
> >> What about a kind of cgroup_of(path) operator that we can use on
> >> filters?
> >>
> >> ? ? ? common_pid cgroup_of(path)
> >> or
> >> ? ? ? common_pid __cgroup_of__ path
> >>
> >> That way you don't bloat the tracing fast path?
> >
> > Note in this example, we would simply ignore the common_pid
> > value and assume that pid is the one of current. This economizes
> > a step to pid -> task resolution.
> >
>
> This is a decent idea, but I'm worried about the complexity of using
> filters like this. Filters are written to *every* event that you want
> the filter to apply to (if you set the top-level filter, it just
> copies the filter to all applicable events), and this is a filter you
> would mostly only want to apply to *all* events at once.

Hmm, but this complexity doesn't happen at tracing time. It happens before
and once. So I'm not sure there is a real harm there. Besides the whole
infrastructure for that is already in place.

You only need a global effect because your worklow only involves that.
But someone else may come with some more complicated usecase.


> Furthermore,
> filters work by discarding the event *after* the event has already
> been written, so all tasks will be incurring full tracing overhead.
> With cgroup filtering up front, we can avoid ~90% [0] of the overhead
> for untraced cgroups.

In fact we desire pre-record time filtering for every filters, or most
of them.

No strong idea about how we can fix that though. Perhaps we can start
by dividing filtering in two parts: a pre and post tracing.

> I'm also thinking that cgroups could be a way to expose tracing to
> non-root users. Making it a filter doesn't work for that.

Hmm, as an example, perf doesn't expose trace events to non-root users
because that can leak kernel internal informations to users,
although this permission can be tweaked through a sysfs file.
Ah we make an exception for syscall events, in the context of a process.
We may provide more exceptions in the future, like page faults.
Anyway we need to enable this non-root tracing mode case by case.

Look at TRACE_EVENT_FL_CAP_ANY. I don't know if we want ftrace
to support non-root users in the future, but if we do this
should be done on top of this flag.

> Hmm.. Maybe ftrace needs a "global filters" feature. cgroup and pid
> would be prime candidates for this, perhaps there are others. These
> would be an optional list of filters applied *before* writing the
> event or reserving buffer space, so they could not use the event
> fields. Mostly I'm thinking they would use things accessible from the
> current task_struct.

I still don't understand why such filter really need to be global. But
providing pre-record time filters and put inside any filters that
is built on a unary operator (cgroup_of "path" can be unary and refer
to current).

Except from supporting unary operators, the debugfs interface doesn't
need to change to support that. Only ftrace has to sort it out between
pre and post filtering, depending on the operator nature.

And may be in the future we can pull more filtering in pre-record time.

> If we could work all that out, then I would change a couple things:
> one of my grand plans for tracing is to remove pid from every event,
> and replace it with a tiny "pid_changed" event (unless "sched_switch"
> et al is enabled). So I wouldn't want to attach it to common_pid at
> all. Instead, I would make it a unary operator.

pid_changed is basically a sched switch event. But otherwise, agreed.

> It also doesn't work with multiple hieranchies. When you refer to a
> cgroup path of "/apps/container_3", are we talking about the cgroup
> for cpu, or mem, or blkio, or all, or a subset? This is what the
> "tracing_enabled" files in the cgroup filesystem in Vaibhav's proposal
> were for. Maybe this could be an optional argument to the unary
> operator.
>
> So, the operator becomes:
> cgroup_of(/path) means any subsystem,
> cgroup_of(/path, cpu, mem) means cpu or mem.

Yeah, why not.