Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755142Ab0HAH43 (ORCPT ); Sun, 1 Aug 2010 03:56:29 -0400 Received: from ns.dcl.info.waseda.ac.jp ([133.9.216.194]:58896 "EHLO ns.dcl.info.waseda.ac.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752755Ab0HAH42 (ORCPT ); Sun, 1 Aug 2010 03:56:28 -0400 Message-ID: <4C5528A6.1060008@dcl.info.waseda.ac.jp> Date: Sun, 01 Aug 2010 16:56:22 +0900 From: Hitoshi Mitake User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.7) Gecko/20100724 Lanikai/3.1.1 MIME-Version: 1.0 To: Frederic Weisbecker , Jens Axboe CC: Steven Rostedt , linux-kernel@vger.kernel.org, h.mitake@gmail.com, Ingo Molnar , Peter Zijlstra , Paul Mackerras , Arnaldo Carvalho de Melo , Jason Baron , Xiao Guangrong Subject: Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime" for contention analysis References: <1279990529-8798-1-git-send-email-mitake@dcl.info.waseda.ac.jp> <20100728163946.GB5394@nowhere> <4C5507F2.7090503@dcl.info.waseda.ac.jp> In-Reply-To: <4C5507F2.7090503@dcl.info.waseda.ac.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8998 Lines: 217 On 2010年08月01日 14:36, Hitoshi Mitake wrote: > On 07/29/10 01:39, Frederic Weisbecker wrote: > > On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote: > >> Recording every lock events produces too large overhead > >> both in CPU time and memory space. > >> This overhead is a big problem which should be solved, > >> but locks are important factor today. > >> > >> So I'd like to add a new lock event "lock_acquired_waittime" > >> for contention analysis. lock_acquired() calculates time consumed > >> for waiting the locks with held_lock->waittime_stamp set by > lock_contended(). > >> If this value is 0, no contention occured. If this value is not 0, > >> there is occurence of contention. > >> So I inserted the branch based on this value into lock_acquired(). > >> New lock_acquired() calls trace_lock_acquired() when waittime is 0, > >> and calls trace_lock_acquired_waittime(), function for the new event, > >> when waittime is not 0. > >> > >> The traditional event, lock_acquired, is too high frequency event. > >> But the event lock_acquired_waittime is not so high frequency, > >> so overhead produced by tracing this event is not so large. > >> > >> This is an example: > >> | mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging > >> | # Running sched/messaging benchmark... > >> | # 20 sender and receiver processes per group > >> | # 10 groups == 400 processes run > >> | > >> | Total time: 1.712 [sec] > >> | mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m > 1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging > >> | # Running sched/messaging benchmark... > >> | # 20 sender and receiver processes per group > >> | # 10 groups == 400 processes run > >> | > >> | Total time: 1.754 [sec] > >> | [ perf record: Woken up 1 times to write data ] > >> | [ perf record: Captured and wrote 0.958 MB perf.data (~41837 > samples) ] > >> > >> "perf bench sched messaging" is a kernel bound program, but this result > >> shows that overhead is, at least on my Core i3 laptop, small enough > for discarding > >> especially from CPU time perspective. > >> > >> This new event is not enough for analyzing complete lock behaviour. > >> But measuring quantity of time consumed for lock contention will be > possible. > >> > >> So I think this patch is useful for real world lock analysis. > >> I'd like to write in new analyzer which use this event in Python. > >> > >> Cc: Ingo Molnar > >> Cc: Peter Zijlstra > >> Cc: Paul Mackerras > >> Cc: Arnaldo Carvalho de Melo > >> Cc: Jens Axboe > >> Cc: Jason Baron > >> Cc: Xiao Guangrong > >> Signed-off-by: Hitoshi Mitake > >> > >> --- > > > > > > > > But those who only want to trace the time waited can just trace > > lock_acquire and lock_acquired event. It seems it would be enough > > to reduce the lock tracing overhead a bit. > > > > Also the problem with: > > > > if (cond) > > trace_x() > > else > > trace_y() > > > > It adds a new condition, only used by tracing, that will always be > > checked, whenever the tracing is on or off. So we lose a cool feature > > of tracepoint here: any tracing code is hidden behind a boolean wall, > > so that the tracing off path is optimized with a single branch check: > > > > static inline void trace_##name(proto) \ > > { \ > > if (unlikely(__tracepoint_##name.state)) \ > > __DO_TRACE(&__tracepoint_##name, \ > > TP_PROTO(data_proto), \ > > TP_ARGS(data_args)); \ > > } \ > > > > > > When tracing is off, ie: __tracepoint_##name.state == 0, the only > > extra code executed is the check of __tracepoint_##name.state. > > > > With your new tracepoint, you've added another branch that is > > always checked (unless !CONFIG_TRACEPOINTS, due to compiler > > optimizations). > > Sorry, this branch is not a good choice. > But for my purpose, this tracing enable/disable feature is not enough. > Because I wanted to suppress the frequency of lock event for contention > analysis, > so branch for judging occurence of contention had to be added. > > > > > But well, that's not very important. Lockdep is already debugging > > code not enabled in production kernel (at least not supposed to). > > > > I want to make lockdep production kernel feature with dynamic patching. > Current lock event tracing and in kernel lock statistics (/proc/lock_stat) > depend on lockdep. Overhead of these features must be devided into > individual. > > # BTW, I completely misunderstood the principle of lockdep, > # I might understood it recently :( So I can work on it now. > > > The more important point is: do we really need this? Have you observed > > lost events with only lock_acquire + lock_acquired events? > > > > You can check lost events with: perf trace -d > > You can also tune the size of the buffers with the -m option in > > perf record. The default size of perf record is not enough for lock > > events. > > The purpose of this patch is suppressing the frequency of events, > lost of events is out of scope of this patch. > > > > > > > Optimizing the size of the records is definetly something we want. But > > I'm not sure this is the good way. > > > > Something like tuning the fields we want in a record seems to me a > > more generic and wide tracing optimization, a gain that would be > > propagated in every tracepoints and not only a specific and precise > > tracing workload on specific tracepoints. > > > > Look at this example: > > > > fweisbec@nowhere:~$ cat /debug/tracing/events/sched/sched_switch/format > > name: sched_switch > > ID: 29 > > format: > > field:unsigned short common_type; offset:0; size:2; signed:0; > > field:unsigned char common_flags; offset:2; size:1; signed:0; > > field:unsigned char common_preempt_count; offset:3; size:1; signed:0; > > field:int common_pid; offset:4; size:4; signed:1; > > field:int common_lock_depth; offset:8; size:4; signed:1; > > > > field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1; > > field:pid_t prev_pid; offset:28; size:4; signed:1; > > field:int prev_prio; offset:32; size:4; signed:1; > > field:long prev_state; offset:40; size:8; signed:1; > > field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1; > > field:pid_t next_pid; offset:64; size:4; signed:1; > > field:int next_prio; offset:68; size:4; signed:1; > > > > print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==> > next_comm=%s next_pid=%d next_prio=%d", > > REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ? > __print_flags(REC->prev_state, "|", { 1, "S"} , { > > 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" > }, { 128, "W" }) : "R", REC->next_comm, > > REC->next_pid, REC->next_prio > > > > > > In typical scheduler tracing workloads, we don't need: > > - common_flags (irq flags) > > - common_preempt_count (if preempt_count != 0 on sched switch, it's > a bug) > > - common_pid (prev_pid == common_pid, may be prev_pid should be > dropped in fact) > > - common_lock_depth (not used yet in perf tools, and the BKL is dying) > > - prio (not used yet in perf sched) > > - comm (we already have some comm resolution in perf) > > > > I just talking about some tracing workloads, typically perf sched > won't use the > > aboves. But other kinds of instrumentation may need the prio for > example. > > > > I let you imagine how much space we can save if we can bring some > choice on the > > fields we want, depending on what the user wants and so. > > Yeah, this will contribute to reducing overhead of tracing. > > > > > If someone is willing to work on that, it would be definitely welcome. > > I think we could use a kind of mask of fields we want. > > > > Note I think that's easy for common_* fields as they are always the same > > and filled from tracing_generic_entry_update(), for the others it > doesn't > > looks easy, we would need to add more macros inside TP_fast_assign(). > > But having the support for common_* fields would be already a good > > thing. > > > > > > After I finish current pending works like lockdep enabling/disabling, > and anyone don't do it, I'd like to work on it. > Because tracing overhead is really critical problem for lock events. > I want to try any method for reducing overhead. > # But I don't have enough power, this will be very slow :( > Oops, I received the mail says that "jens.axboe@oracle.com" is invalid from mailer daemon of oracle.com . I added the email address "jaxboe@fusionio.com", sorry Jens. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/