LinuxLists.cc - [PATCH RFC] perf lock: Add new event "lock_acquired

2010-07-24 17:01:54

Subject: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

Recording every lock events produces too large overhead
both in CPU time and memory space.
This overhead is a big problem which should be solved,
but locks are important factor today.

So I'd like to add a new lock event "lock_acquired_waittime"
for contention analysis. lock_acquired() calculates time consumed
for waiting the locks with held_lock->waittime_stamp set by lock_contended().
If this value is 0, no contention occured. If this value is not 0,
there is occurence of contention.
So I inserted the branch based on this value into lock_acquired().
New lock_acquired() calls trace_lock_acquired() when waittime is 0,
and calls trace_lock_acquired_waittime(), function for the new event,
when waittime is not 0.

The traditional event, lock_acquired, is too high frequency event.
But the event lock_acquired_waittime is not so high frequency,
so overhead produced by tracing this event is not so large.

This is an example:
| mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
| Total time: 1.712 [sec]
| mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m 1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
| Total time: 1.754 [sec]
| [ perf record: Woken up 1 times to write data ]
| [ perf record: Captured and wrote 0.958 MB perf.data (~41837 samples) ]

"perf bench sched messaging" is a kernel bound program, but this result
shows that overhead is, at least on my Core i3 laptop, small enough for discarding
especially from CPU time perspective.

This new event is not enough for analyzing complete lock behaviour.
But measuring quantity of time consumed for lock contention will be possible.

So I think this patch is useful for real world lock analysis.
I'd like to write in new analyzer which use this event in Python.

Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: Xiao Guangrong <[email protected]>
Signed-off-by: Hitoshi Mitake <[email protected]>

---
include/trace/events/lock.h | 23 +++++++++++++++++++++++
kernel/lockdep.c | 5 ++++-
2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/lock.h b/include/trace/events/lock.h
index 2821b86..40164f2 100644
--- a/include/trace/events/lock.h
+++ b/include/trace/events/lock.h
@@ -77,6 +77,29 @@ DEFINE_EVENT(lock, lock_acquired,
TP_ARGS(lock, ip)
);

+TRACE_EVENT(lock_acquired_waittime,
+
+ TP_PROTO(struct lockdep_map *lock, unsigned long ip,
+ u64 waittime),
+
+ TP_ARGS(lock, ip, waittime),
+
+ TP_STRUCT__entry(
+ __string(name, lock->name)
+ __field(void *, lockdep_addr)
+ __field(u64, waittime)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, lock->name);
+ __entry->lockdep_addr = lock;
+ __entry->waittime = waittime;
+ ),
+
+ TP_printk("%p %s %llu [ns]", __entry->lockdep_addr,
+ __get_str(name), __entry->waittime)
+);
+
#endif
#endif

diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 5428679..2a002e9 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -3390,7 +3390,10 @@ found_it:
hlock->holdtime_stamp = now;
}

- trace_lock_acquired(lock, ip);
+ if (waittime)
+ trace_lock_acquired_waittime(lock, ip, waittime);
+ else
+ trace_lock_acquired(lock, ip);

stats = get_lock_stats(hlock_class(hlock));
if (waittime) {
--
1.7.1.1

2010-07-28 16:39:56

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
> Recording every lock events produces too large overhead
> both in CPU time and memory space.
> This overhead is a big problem which should be solved,
> but locks are important factor today.
>
> So I'd like to add a new lock event "lock_acquired_waittime"
> for contention analysis. lock_acquired() calculates time consumed
> for waiting the locks with held_lock->waittime_stamp set by lock_contended().
> If this value is 0, no contention occured. If this value is not 0,
> there is occurence of contention.
> So I inserted the branch based on this value into lock_acquired().
> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
> and calls trace_lock_acquired_waittime(), function for the new event,
> when waittime is not 0.
>
> The traditional event, lock_acquired, is too high frequency event.
> But the event lock_acquired_waittime is not so high frequency,
> so overhead produced by tracing this event is not so large.
>
> This is an example:
> | mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging
> | # Running sched/messaging benchmark...
> | # 20 sender and receiver processes per group
> | # 10 groups == 400 processes run
> |
> | Total time: 1.712 [sec]
> | mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m 1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
> | # Running sched/messaging benchmark...
> | # 20 sender and receiver processes per group
> | # 10 groups == 400 processes run
> |
> | Total time: 1.754 [sec]
> | [ perf record: Woken up 1 times to write data ]
> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837 samples) ]
>
> "perf bench sched messaging" is a kernel bound program, but this result
> shows that overhead is, at least on my Core i3 laptop, small enough for discarding
> especially from CPU time perspective.
>
> This new event is not enough for analyzing complete lock behaviour.
> But measuring quantity of time consumed for lock contention will be possible.
>
> So I think this patch is useful for real world lock analysis.
> I'd like to write in new analyzer which use this event in Python.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jason Baron <[email protected]>
> Cc: Xiao Guangrong <[email protected]>
> Signed-off-by: Hitoshi Mitake <[email protected]>
>
> ---

But those who only want to trace the time waited can just trace
lock_acquire and lock_acquired event. It seems it would be enough
to reduce the lock tracing overhead a bit.

Also the problem with:

if (cond)
trace_x()
else
trace_y()

It adds a new condition, only used by tracing, that will always be
checked, whenever the tracing is on or off. So we lose a cool feature
of tracepoint here: any tracing code is hidden behind a boolean wall,
so that the tracing off path is optimized with a single branch check:

static inline void trace_##name(proto) \
{ \
if (unlikely(__tracepoint_##name.state)) \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(data_proto), \
TP_ARGS(data_args)); \
} \

When tracing is off, ie: __tracepoint_##name.state == 0, the only
extra code executed is the check of __tracepoint_##name.state.

With your new tracepoint, you've added another branch that is
always checked (unless !CONFIG_TRACEPOINTS, due to compiler
optimizations).

But well, that's not very important. Lockdep is already debugging
code not enabled in production kernel (at least not supposed to).

The more important point is: do we really need this? Have you observed
lost events with only lock_acquire + lock_acquired events?

You can check lost events with: perf trace -d
You can also tune the size of the buffers with the -m option in
perf record. The default size of perf record is not enough for lock
events.

Optimizing the size of the records is definetly something we want. But
I'm not sure this is the good way.

Something like tuning the fields we want in a record seems to me a
more generic and wide tracing optimization, a gain that would be
propagated in every tracepoints and not only a specific and precise
tracing workload on specific tracepoints.

Look at this example:

fweisbec@nowhere:~$ cat /debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 29
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;

field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
field:pid_t prev_pid; offset:28; size:4; signed:1;
field:int prev_prio; offset:32; size:4; signed:1;
field:long prev_state; offset:40; size:8; signed:1;
field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
field:pid_t next_pid; offset:64; size:4; signed:1;
field:int next_prio; offset:68; size:4; signed:1;

print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==> next_comm=%s next_pid=%d next_prio=%d",
REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"} , {
2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128, "W" }) : "R", REC->next_comm,
REC->next_pid, REC->next_prio

In typical scheduler tracing workloads, we don't need:
- common_flags (irq flags)
- common_preempt_count (if preempt_count != 0 on sched switch, it's a bug)
- common_pid (prev_pid == common_pid, may be prev_pid should be dropped in fact)
- common_lock_depth (not used yet in perf tools, and the BKL is dying)
- prio (not used yet in perf sched)
- comm (we already have some comm resolution in perf)

I just talking about some tracing workloads, typically perf sched won't use the
aboves. But other kinds of instrumentation may need the prio for example.

I let you imagine how much space we can save if we can bring some choice on the
fields we want, depending on what the user wants and so.

If someone is willing to work on that, it would be definitely welcome.
I think we could use a kind of mask of fields we want.

Note I think that's easy for common_* fields as they are always the same
and filled from tracing_generic_entry_update(), for the others it doesn't
looks easy, we would need to add more macros inside TP_fast_assign().
But having the support for common_* fields would be already a good
thing.

2010-08-01 05:36:47

by Hitoshi Mitake

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On 07/29/10 01:39, Frederic Weisbecker wrote:
> On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
>> Recording every lock events produces too large overhead
>> both in CPU time and memory space.
>> This overhead is a big problem which should be solved,
>> but locks are important factor today.
>>
>> So I'd like to add a new lock event "lock_acquired_waittime"
>> for contention analysis. lock_acquired() calculates time consumed
>> for waiting the locks with held_lock->waittime_stamp set by
lock_contended().
>> If this value is 0, no contention occured. If this value is not 0,
>> there is occurence of contention.
>> So I inserted the branch based on this value into lock_acquired().
>> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
>> and calls trace_lock_acquired_waittime(), function for the new event,
>> when waittime is not 0.
>>
>> The traditional event, lock_acquired, is too high frequency event.
>> But the event lock_acquired_waittime is not so high frequency,
>> so overhead produced by tracing this event is not so large.
>>
>> This is an example:
>> | mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging
>> | # Running sched/messaging benchmark...
>> | # 20 sender and receiver processes per group
>> | # 10 groups == 400 processes run
>> |
>> | Total time: 1.712 [sec]
>> | mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m
1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
>> | # Running sched/messaging benchmark...
>> | # 20 sender and receiver processes per group
>> | # 10 groups == 400 processes run
>> |
>> | Total time: 1.754 [sec]
>> | [ perf record: Woken up 1 times to write data ]
>> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837
samples) ]
>>
>> "perf bench sched messaging" is a kernel bound program, but this result
>> shows that overhead is, at least on my Core i3 laptop, small enough
for discarding
>> especially from CPU time perspective.
>>
>> This new event is not enough for analyzing complete lock behaviour.
>> But measuring quantity of time consumed for lock contention will be
possible.
>>
>> So I think this patch is useful for real world lock analysis.
>> I'd like to write in new analyzer which use this event in Python.
>>
>> Cc: Ingo Molnar<[email protected]>
>> Cc: Peter Zijlstra<[email protected]>
>> Cc: Paul Mackerras<[email protected]>
>> Cc: Arnaldo Carvalho de Melo<[email protected]>
>> Cc: Jens Axboe<[email protected]>
>> Cc: Jason Baron<[email protected]>
>> Cc: Xiao Guangrong<[email protected]>
>> Signed-off-by: Hitoshi Mitake<[email protected]>
>>
>> ---
>
>
>
> But those who only want to trace the time waited can just trace
> lock_acquire and lock_acquired event. It seems it would be enough
> to reduce the lock tracing overhead a bit.
>
> Also the problem with:
>
> if (cond)
> trace_x()
> else
> trace_y()
>
> It adds a new condition, only used by tracing, that will always be
> checked, whenever the tracing is on or off. So we lose a cool feature
> of tracepoint here: any tracing code is hidden behind a boolean wall,
> so that the tracing off path is optimized with a single branch check:
>
> static inline void trace_##name(proto) \
> { \
> if (unlikely(__tracepoint_##name.state)) \
> __DO_TRACE(&__tracepoint_##name, \
> TP_PROTO(data_proto), \
> TP_ARGS(data_args)); \
> } \
>
>
> When tracing is off, ie: __tracepoint_##name.state == 0, the only
> extra code executed is the check of __tracepoint_##name.state.
>
> With your new tracepoint, you've added another branch that is
> always checked (unless !CONFIG_TRACEPOINTS, due to compiler
> optimizations).

Sorry, this branch is not a good choice.
But for my purpose, this tracing enable/disable feature is not enough.
Because I wanted to suppress the frequency of lock event for contention
analysis,
so branch for judging occurence of contention had to be added.

>
> But well, that's not very important. Lockdep is already debugging
> code not enabled in production kernel (at least not supposed to).
>

I want to make lockdep production kernel feature with dynamic patching.
Current lock event tracing and in kernel lock statistics (/proc/lock_stat)
depend on lockdep. Overhead of these features must be devided into
individual.

# BTW, I completely misunderstood the principle of lockdep,
# I might understood it recently :( So I can work on it now.

> The more important point is: do we really need this? Have you observed
> lost events with only lock_acquire + lock_acquired events?
>
> You can check lost events with: perf trace -d
> You can also tune the size of the buffers with the -m option in
> perf record. The default size of perf record is not enough for lock
> events.

The purpose of this patch is suppressing the frequency of events,
lost of events is out of scope of this patch.

>
>
> Optimizing the size of the records is definetly something we want. But
> I'm not sure this is the good way.
>
> Something like tuning the fields we want in a record seems to me a
> more generic and wide tracing optimization, a gain that would be
> propagated in every tracepoints and not only a specific and precise
> tracing workload on specific tracepoints.
>
> Look at this example:
>
> fweisbec@nowhere:~$ cat /debug/tracing/events/sched/sched_switch/format
> name: sched_switch
> ID: 29
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
>
> field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> field:pid_t prev_pid; offset:28; size:4; signed:1;
> field:int prev_prio; offset:32; size:4; signed:1;
> field:long prev_state; offset:40; size:8; signed:1;
> field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
> field:pid_t next_pid; offset:64; size:4; signed:1;
> field:int next_prio; offset:68; size:4; signed:1;
>
> print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==>
next_comm=%s next_pid=%d next_prio=%d",
> REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ?
__print_flags(REC->prev_state, "|", { 1, "S"} , {
> 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x"
}, { 128, "W" }) : "R", REC->next_comm,
> REC->next_pid, REC->next_prio
>
>
> In typical scheduler tracing workloads, we don't need:
> - common_flags (irq flags)
> - common_preempt_count (if preempt_count != 0 on sched switch, it's
a bug)
> - common_pid (prev_pid == common_pid, may be prev_pid should be
dropped in fact)
> - common_lock_depth (not used yet in perf tools, and the BKL is dying)
> - prio (not used yet in perf sched)
> - comm (we already have some comm resolution in perf)
>
> I just talking about some tracing workloads, typically perf sched
won't use the
> aboves. But other kinds of instrumentation may need the prio for example.
>
> I let you imagine how much space we can save if we can bring some
choice on the
> fields we want, depending on what the user wants and so.

Yeah, this will contribute to reducing overhead of tracing.

>
> If someone is willing to work on that, it would be definitely welcome.
> I think we could use a kind of mask of fields we want.
>
> Note I think that's easy for common_* fields as they are always the same
> and filled from tracing_generic_entry_update(), for the others it doesn't
> looks easy, we would need to add more macros inside TP_fast_assign().
> But having the support for common_* fields would be already a good
> thing.
>
>

After I finish current pending works like lockdep enabling/disabling,
and anyone don't do it, I'd like to work on it.
Because tracing overhead is really critical problem for lock events.
I want to try any method for reducing overhead.
# But I don't have enough power, this will be very slow :(

2010-08-01 07:56:29

by Hitoshi Mitake

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On 2010年08月01日 14:36, Hitoshi Mitake wrote:
> On 07/29/10 01:39, Frederic Weisbecker wrote:
> > On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
> >> Recording every lock events produces too large overhead
> >> both in CPU time and memory space.
> >> This overhead is a big problem which should be solved,
> >> but locks are important factor today.
> >>
> >> So I'd like to add a new lock event "lock_acquired_waittime"
> >> for contention analysis. lock_acquired() calculates time consumed
> >> for waiting the locks with held_lock->waittime_stamp set by
> lock_contended().
> >> If this value is 0, no contention occured. If this value is not 0,
> >> there is occurence of contention.
> >> So I inserted the branch based on this value into lock_acquired().
> >> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
> >> and calls trace_lock_acquired_waittime(), function for the new event,
> >> when waittime is not 0.
> >>
> >> The traditional event, lock_acquired, is too high frequency event.
> >> But the event lock_acquired_waittime is not so high frequency,
> >> so overhead produced by tracing this event is not so large.
> >>
> >> This is an example:
> >> | mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging
> >> | # Running sched/messaging benchmark...
> >> | # 20 sender and receiver processes per group
> >> | # 10 groups == 400 processes run
> >> |
> >> | Total time: 1.712 [sec]
> >> | mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m
> 1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
> >> | # Running sched/messaging benchmark...
> >> | # 20 sender and receiver processes per group
> >> | # 10 groups == 400 processes run
> >> |
> >> | Total time: 1.754 [sec]
> >> | [ perf record: Woken up 1 times to write data ]
> >> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837
> samples) ]
> >>
> >> "perf bench sched messaging" is a kernel bound program, but this result
> >> shows that overhead is, at least on my Core i3 laptop, small enough
> for discarding
> >> especially from CPU time perspective.
> >>
> >> This new event is not enough for analyzing complete lock behaviour.
> >> But measuring quantity of time consumed for lock contention will be
> possible.
> >>
> >> So I think this patch is useful for real world lock analysis.
> >> I'd like to write in new analyzer which use this event in Python.
> >>
> >> Cc: Ingo Molnar<[email protected]>
> >> Cc: Peter Zijlstra<[email protected]>
> >> Cc: Paul Mackerras<[email protected]>
> >> Cc: Arnaldo Carvalho de Melo<[email protected]>
> >> Cc: Jens Axboe<[email protected]>
> >> Cc: Jason Baron<[email protected]>
> >> Cc: Xiao Guangrong<[email protected]>
> >> Signed-off-by: Hitoshi Mitake<[email protected]>
> >>
> >> ---
> >
> >
> >
> > But those who only want to trace the time waited can just trace
> > lock_acquire and lock_acquired event. It seems it would be enough
> > to reduce the lock tracing overhead a bit.
> >
> > Also the problem with:
> >
> > if (cond)
> > trace_x()
> > else
> > trace_y()
> >
> > It adds a new condition, only used by tracing, that will always be
> > checked, whenever the tracing is on or off. So we lose a cool feature
> > of tracepoint here: any tracing code is hidden behind a boolean wall,
> > so that the tracing off path is optimized with a single branch check:
> >
> > static inline void trace_##name(proto) \
> > { \
> > if (unlikely(__tracepoint_##name.state)) \
> > __DO_TRACE(&__tracepoint_##name, \
> > TP_PROTO(data_proto), \
> > TP_ARGS(data_args)); \
> > } \
> >
> >
> > When tracing is off, ie: __tracepoint_##name.state == 0, the only
> > extra code executed is the check of __tracepoint_##name.state.
> >
> > With your new tracepoint, you've added another branch that is
> > always checked (unless !CONFIG_TRACEPOINTS, due to compiler
> > optimizations).
>
> Sorry, this branch is not a good choice.
> But for my purpose, this tracing enable/disable feature is not enough.
> Because I wanted to suppress the frequency of lock event for contention
> analysis,
> so branch for judging occurence of contention had to be added.
>
> >
> > But well, that's not very important. Lockdep is already debugging
> > code not enabled in production kernel (at least not supposed to).
> >
>
> I want to make lockdep production kernel feature with dynamic patching.
> Current lock event tracing and in kernel lock statistics (/proc/lock_stat)
> depend on lockdep. Overhead of these features must be devided into
> individual.
>
> # BTW, I completely misunderstood the principle of lockdep,
> # I might understood it recently :( So I can work on it now.
>
> > The more important point is: do we really need this? Have you observed
> > lost events with only lock_acquire + lock_acquired events?
> >
> > You can check lost events with: perf trace -d
> > You can also tune the size of the buffers with the -m option in
> > perf record. The default size of perf record is not enough for lock
> > events.
>
> The purpose of this patch is suppressing the frequency of events,
> lost of events is out of scope of this patch.
>
> >
> >
> > Optimizing the size of the records is definetly something we want. But
> > I'm not sure this is the good way.
> >
> > Something like tuning the fields we want in a record seems to me a
> > more generic and wide tracing optimization, a gain that would be
> > propagated in every tracepoints and not only a specific and precise
> > tracing workload on specific tracepoints.
> >
> > Look at this example:
> >
> > fweisbec@nowhere:~$ cat /debug/tracing/events/sched/sched_switch/format
> > name: sched_switch
> > ID: 29
> > format:
> > field:unsigned short common_type; offset:0; size:2; signed:0;
> > field:unsigned char common_flags; offset:2; size:1; signed:0;
> > field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> > field:int common_pid; offset:4; size:4; signed:1;
> > field:int common_lock_depth; offset:8; size:4; signed:1;
> >
> > field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> > field:pid_t prev_pid; offset:28; size:4; signed:1;
> > field:int prev_prio; offset:32; size:4; signed:1;
> > field:long prev_state; offset:40; size:8; signed:1;
> > field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
> > field:pid_t next_pid; offset:64; size:4; signed:1;
> > field:int next_prio; offset:68; size:4; signed:1;
> >
> > print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==>
> next_comm=%s next_pid=%d next_prio=%d",
> > REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ?
> __print_flags(REC->prev_state, "|", { 1, "S"} , {
> > 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x"
> }, { 128, "W" }) : "R", REC->next_comm,
> > REC->next_pid, REC->next_prio
> >
> >
> > In typical scheduler tracing workloads, we don't need:
> > - common_flags (irq flags)
> > - common_preempt_count (if preempt_count != 0 on sched switch, it's
> a bug)
> > - common_pid (prev_pid == common_pid, may be prev_pid should be
> dropped in fact)
> > - common_lock_depth (not used yet in perf tools, and the BKL is dying)
> > - prio (not used yet in perf sched)
> > - comm (we already have some comm resolution in perf)
> >
> > I just talking about some tracing workloads, typically perf sched
> won't use the
> > aboves. But other kinds of instrumentation may need the prio for
> example.
> >
> > I let you imagine how much space we can save if we can bring some
> choice on the
> > fields we want, depending on what the user wants and so.
>
> Yeah, this will contribute to reducing overhead of tracing.
>
> >
> > If someone is willing to work on that, it would be definitely welcome.
> > I think we could use a kind of mask of fields we want.
> >
> > Note I think that's easy for common_* fields as they are always the same
> > and filled from tracing_generic_entry_update(), for the others it
> doesn't
> > looks easy, we would need to add more macros inside TP_fast_assign().
> > But having the support for common_* fields would be already a good
> > thing.
> >
> >
>
> After I finish current pending works like lockdep enabling/disabling,
> and anyone don't do it, I'd like to work on it.
> Because tracing overhead is really critical problem for lock events.
> I want to try any method for reducing overhead.
> # But I don't have enough power, this will be very slow :(
>

Oops, I received the mail says that "[email protected]" is invalid
from mailer daemon of oracle.com .
I added the email address "[email protected]", sorry Jens.

2010-08-02 12:24:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On Sun, 2010-08-01 at 14:36 +0900, Hitoshi Mitake wrote:
> I want to make lockdep production kernel feature with dynamic
> patching.

Not really feasible. Note that enabling lockdep grows the size of
spinlock_t (and others) quite significantly.

Then there is the problem that you need to pass all tasks through some
quiesent state in order to enable lockdep (there must be a guarantee of
no locks held).

Now, we could possibly make it all work, but I'm not at all convinced we
want to pay the price, which is a much _much_ more complex
infrastructure.

2010-08-03 07:37:08

by Hitoshi Mitake

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On 08/02/10 21:23, Peter Zijlstra wrote:
> On Sun, 2010-08-01 at 14:36 +0900, Hitoshi Mitake wrote:
>> I want to make lockdep production kernel feature with dynamic
>> patching.
>
> Not really feasible. Note that enabling lockdep grows the size of
> spinlock_t (and others) quite significantly.
>
> Then there is the problem that you need to pass all tasks through some
> quiesent state in order to enable lockdep (there must be a guarantee of
> no locks held).
>
> Now, we could possibly make it all work, but I'm not at all convinced we
> want to pay the price, which is a much _much_ more complex
> infrastructure.
>
>

I have to agree with the problem of huge cost you mentioned above.

But, I believe that
- lock validation
- lock tracing
- in kernel light weight lock usage statistics
should be divided into individual features, at least.
And tracing and statistics should be available for production kernel,
ideally.

How do you think about this point?

2010-08-03 08:01:05

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis

On Tue, 2010-08-03 at 16:37 +0900, Hitoshi Mitake wrote:
> But, I believe that
> - lock validation
> - lock tracing
> - in kernel light weight lock usage statistics
> should be divided into individual features, at least.
> And tracing and statistics should be available for production kernel,
> ideally.
>
> How do you think about this point?

Both the lock usage and lock validation use the lock classification
scheme and thus require the bloated lock structures, and hence I don't
think its at all reasonable to do that runtime.

lock tracing could possibly be done by patching all lock sites, but we'd
have to be somewhat careful there too not to make the whole spinlock
code even worse than it already is.