2010-11-04 15:40:29

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFD] Perf generic context based exclusion/inclusion (was Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting)

Le 24 ao?t 2010 10:14, Ingo Molnar <[email protected]> a ?crit :
>
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
>> > >
>> > > Well, the task and cgroup information is there but what does it really
>> > > tell me? As long as the irq & softirq time can be caused by any other
>> > > process I don't see the value of this incorrect data point.
>> > >
>> >
>> > Data point will be correct. How it gets used is a different qn. This
>> > interface will be useful for Alert/Paranoid/Annoyed user/admin who
>> > sees that the job exec_time is high but it is not doing any useful
>> > work.
>>
>> I'm very sympathetic with Martin's POV. irq/softirq times per task
>> don't really make sense. In the case you provide above the solution
>> would be to subtract these times from the task execution time, not
>> break it out. In that case he would see his task not do much, and end
>> up with the same action list.
>
> Right, andthis connects to something Frederic sent a few RFC patches for
> some time ago: finegrained irq/softirq perf stat support. If we do
> something in this area we need a facility that enables both types of
> statistics gathering.
>
> Frederic's model is based on exclusion - so you could do a perf stat run
> that excluded softirq and hardirq execution from a workload's runtime.
> It's nifty, as it allows the reduction of measurement noise. (IRQ and
> softirq execution can be regarded as random noise added (or not added)
> to execution times)
>
> Thanks,
>
> Ingo
>


(Answering thousand years later)

Concerning the softirq/hardirq filtering in perf, this is still
something I want to do,
but now I think we should do it differently, especially we should
extend the idea of exclusion to the generic level.

A "context" is a generic idea: this is something that starts and ends
at specific events. It means this can be expressed with
perf events, for example:

- a context of "lock X held" starts when X is acquired and stops when
X is released
- a context of "irq" starts when we enter irq and ends when we exits irq.

There are tons of other examples. And considering how much we can tune
any perf event already (think about
filters) and the variety of events flavour we have (static
tracepoints, breakpoints, dyn probes), we can define very
precise contexts and count whatever inside:

- count cycles while we hold rq lock

If you consider that events that delimit contexts can, themselves, run
under exclusion/inclusion contexts, you can do
complex things like in this scenario:

- create a enter_irq event and a exit_irq events
- create a lock_acquired and a lock_release event, make them
counting/sampling only under enter_irq --- exit_irq above perf events
based defined context
- attach filter to these lock events, to only trigger if X is the lock name
- create a cycles counting event, make it running under the
lock_acquired -- lock_released above perf events based defined context

The result is that you will only count cycles when we hold X under irq.

I think this is definetely the direction we need to take. When the
function tracers will be available as
trace events, this could become intensely powerful (counting cycles
inside some functions only, or if you hold lock X
under function Y in softirq and.....).

I'm just not sure yet about the interface, perhaps an ioctl to attach
an event to another one
through their fds and tell whether we want the event to enable or
disable the counting/sampling
on the other.
We could have as much "enabler" or "disabler" as we want, or only one
each, not sure yet.
Or may be we want to create the abstraction of "contexts" using fds
for them. Not sure.

We probably also want an attr->enable_on_schedule.

Anyway, I'll certainly work on that after the dwarf unwinding is good enough.


2010-11-04 19:46:52

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [RFD] Perf generic context based exclusion/inclusion (was Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting)

On Thu, Nov 4, 2010 at 8:40 AM, Frederic Weisbecker <[email protected]> wrote:
> Le 24 ao?t 2010 10:14, Ingo Molnar <[email protected]> a ?crit :
>>
>> * Peter Zijlstra <[email protected]> wrote:
>>
>>> On Thu, 2010-07-22 at 19:12 -0700, Venkatesh Pallipadi wrote:
>>> > >
>>> > > Well, the task and cgroup information is there but what does it really
>>> > > tell me? As long as the irq & softirq time can be caused by any other
>>> > > process I don't see the value of this incorrect data point.
>>> > >
>>> >
>>> > Data point will be correct. How it gets used is a different qn. This
>>> > interface will be useful for Alert/Paranoid/Annoyed user/admin who
>>> > sees that the job exec_time is high but it is not doing any useful
>>> > work.
>>>
>>> I'm very sympathetic with Martin's POV. irq/softirq times per task
>>> don't really make sense. In the case you provide above the solution
>>> would be to subtract these times from the task execution time, not
>>> break it out. In that case he would see his task not do much, and end
>>> up with the same action list.
>>
>> Right, andthis connects to something Frederic sent a few RFC patches for
>> some time ago: finegrained irq/softirq perf stat support. If we do
>> something in this area we need a facility that enables both types of
>> statistics gathering.
>>
>> Frederic's model is based on exclusion - so you could do a perf stat run
>> that excluded softirq and hardirq execution from a workload's runtime.
>> It's nifty, as it allows the reduction of measurement noise. (IRQ and
>> softirq execution can be regarded as random noise added (or not added)
>> to execution times)
>>
>> Thanks,
>>
>> ? ? ? ?Ingo
>>
>
>
> (Answering thousand years later)
>
> Concerning the softirq/hardirq filtering in perf, this is still
> something I want to do,
> but now I think we should do it differently, especially we should
> extend the idea of exclusion to the generic level.
>
> A "context" is a generic idea: this is something that starts and ends
> at specific events. It means this can be expressed with
> perf events, for example:
>
> - a context of "lock X held" starts when X is acquired and stops when
> X is released
> - a context of "irq" starts when we enter irq and ends when we exits irq.
>

I think this is will be a useful abstraction to have, mostly beyond
just irq/softirq. Couple of comments:
- For locks, we may want to track both "wait context" and "hold context"
- This may be a bit odd and probably there is some other way of doing
this better. But, one other context we may want to track is the sleep
or wait at certain points. What I am thinking is something like how
long are we waiting on this kmalloc when we are holding this mutex
kind of info. May be it is best to do this as having sleep in kmalloc
as a context.
- Few other examples of this being useful is to count events only when
these two or more locks are held together or how long we were waiting
on one spinlock while we are holding one spinlock.

Thanks,
Venki

> There are tons of other examples. And considering how much we can tune
> any perf event already (think about
> filters) and the variety of events flavour we have (static
> tracepoints, breakpoints, dyn probes), we can define very
> precise contexts and count whatever inside:
>
> - count cycles while we hold rq lock
>
> If you consider that events that delimit contexts can, themselves, run
> under exclusion/inclusion contexts, you can do
> complex things like in this scenario:
>
> - create a enter_irq event and a exit_irq events
> - create a lock_acquired and a lock_release event, make them
> counting/sampling only under enter_irq --- exit_irq above perf events
> based defined context
> - attach filter to these lock events, to only trigger if X is the lock name
> - create a cycles counting event, make it running under the
> lock_acquired -- lock_released above perf events based defined context
>
> The result is that you will only count cycles when we hold X under irq.
>
> I think this is definetely the direction we need to take. When the
> function tracers will be available as
> trace events, this could become intensely powerful (counting cycles
> inside some functions only, or if you hold lock X
> under function Y in softirq and.....).
>
> I'm just not sure yet about the interface, perhaps an ioctl to attach
> an event to another one
> through their fds and tell whether we want the event to enable or
> disable the counting/sampling
> on the other.
> We could have as much "enabler" or "disabler" as we want, or only one
> each, not sure yet.
> Or may be we want to create the abstraction of "contexts" using fds
> for them. Not sure.
>
> We probably also want an attr->enable_on_schedule.
>
> Anyway, I'll certainly work on that after the dwarf unwinding is good enough.
>

2010-11-05 03:51:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFD] Perf generic context based exclusion/inclusion (was Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting)

On Thu, Nov 04, 2010 at 12:46:42PM -0700, Venkatesh Pallipadi wrote:
> On Thu, Nov 4, 2010 at 8:40 AM, Frederic Weisbecker <[email protected]> wrote:
> > (Answering thousand years later)
> >
> > Concerning the softirq/hardirq filtering in perf, this is still
> > something I want to do,
> > but now I think we should do it differently, especially we should
> > extend the idea of exclusion to the generic level.
> >
> > A "context" is a generic idea: this is something that starts and ends
> > at specific events. It means this can be expressed with
> > perf events, for example:
> >
> > - a context of "lock X held" starts when X is acquired and stops when
> > X is released
> > - a context of "irq" starts when we enter irq and ends when we exits irq.
> >
>
> I think this is will be a useful abstraction to have, mostly beyond
> just irq/softirq. Couple of comments:
> - For locks, we may want to track both "wait context" and "hold context"



Sure. In which case you want to play with both lock_acquire and lock_acquired
events. ^

lock_acquire is the "I request this lock" event.
lock_acquired is the "I have grabed this lock" event.

Well, there might be some time between the two events due to various
things, mainly lockdep checks I guess because those events are very
tight to lockdep internals (or lockstats). If you want to
know if there has been some waiting for the lock, you need lock_contended
event.

So, you can play some games here with those events.


> - This may be a bit odd and probably there is some other way of doing
> this better. But, one other context we may want to track is the sleep
> or wait at certain points. What I am thinking is something like how
> long are we waiting on this kmalloc when we are holding this mutex
> kind of info. May be it is best to do this as having sleep in kmalloc
> as a context.



If I understand you well, you want to define a context made of a
sleeping time?

The problem is that it means defining an empty context:
nothing happens in task X while it is sleeping (by definition), so
capturing any events, or counting whatever counter in this slice
would never report anything.

OTOH you can capture trace events and get the time between a
sleep and a wake up event, then you can compute the difference
from a post processing script in perf tools.



> - Few other examples of this being useful is to count events only when
> these two or more locks are held together or how long we were waiting
> on one spinlock while we are holding one spinlock.


Agreed. For that you can combine two levels of contexts, let's say you want
to count instructions when we hold lock Y when we also hold lock X
(the lock dependency order beeing X -> Y), but you don't want to count
when you hold Y without X:

- create two perf events: lock_acquire and lock_release, apply filter on lock name X
- create two perf events: lock_acquire and lock_release, apply filter on lock name Y
- apply the first pair of lock events as an inclusive context for the second pair
(which means the second pair must only count/sample on the context delimited by the
first pair)
- create a perf event that count instructions, apply the second above pair as an
inclusive context.

2010-11-05 20:09:09

by Venkatesh Pallipadi

[permalink] [raw]
Subject: Re: [RFD] Perf generic context based exclusion/inclusion (was Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting)

On Thu, Nov 4, 2010 at 8:51 PM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Nov 04, 2010 at 12:46:42PM -0700, Venkatesh Pallipadi wrote:
>> On Thu, Nov 4, 2010 at 8:40 AM, Frederic Weisbecker <[email protected]> wrote:
<snip>
>
>
>> - This may be a bit odd and probably there is some other way of doing
>> this better. But, one other context we may want to track is the sleep
>> or wait at certain points. What I am thinking is something like how
>> long are we waiting on this kmalloc when we are holding this mutex
>> kind of info. May be it is best to do this as having sleep in kmalloc
>> as a context.
>
>
>
> If I understand you well, you want to define a context made of a
> sleeping time?
>
> The problem is that it means defining an empty context:
> nothing happens in task X while it is sleeping (by definition), so
> capturing any events, or counting whatever counter in this slice
> would never report anything.
>
> OTOH you can capture trace events and get the time between a
> sleep and a wake up event, then you can compute the difference
> from a post processing script in perf tools.
>

Agree. Sleeping context won't work in general. I was just thinking
aloud on whether there is a way to tie up sleep trace event with a pid
and mutex being held, so that something like "how long did this thread
wait in this kmalloc while being in some critical section" can be
captured. So, post processing trace events is a better option here.

Thanks,
Venki

>> - Few other examples of this being useful is to count events only when
>> these two or more locks are held together or how long we were waiting
>> on one spinlock while we are holding one spinlock.
>
>
> Agreed. For that you can combine two levels of contexts, let's say you want
> to count instructions when we hold lock Y when we also hold lock X
> (the lock dependency order beeing X -> Y), but you don't want to count
> when you hold Y without X:
>
> - create two perf events: lock_acquire and lock_release, apply filter on lock name X
> - create two perf events: lock_acquire and lock_release, apply filter on lock name Y
> - apply the first pair of lock events as an inclusive context for the second pair
> ?(which means the second pair must only count/sample on the context delimited by the
> ?first pair)
> - create a perf event that count instructions, apply the second above pair as an
> ?inclusive context.
>
>
>