Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758940AbbEFEOM (ORCPT ); Wed, 6 May 2015 00:14:12 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:13794 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758909AbbEFEOG (ORCPT ); Wed, 6 May 2015 00:14:06 -0400 Message-ID: <554994F0.7010406@huawei.com> Date: Wed, 6 May 2015 12:13:36 +0800 From: Yunlong Song User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Rabin Vincent , David Ahern CC: , , , "Arnaldo Carvalho de Melo" , , Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process References: <553A45CA.8020808@huawei.com> <553A4C18.3030609@gmail.com> <553B9F30.1040100@huawei.com> <553BB895.20800@gmail.com> <20150505215328.GA3397@debian> In-Reply-To: <20150505215328.GA3397@debian> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.74.205] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4243 Lines: 99 On 2015/5/6 5:53, Rabin Vincent wrote: > On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote: >> On 4/25/15 8:05 AM, Yunlong Song wrote: >>> But this only shows the system call like strace, but we want the call >>> stack of kernel functions in fact. >>> >> We haven't added the callchain option yet; on the to-do list. >> >> perf trace record -g -- iozone ... >> perf trace -i perf.data -s >> --> summary of system calls, max/min/average times >> >> perf trace -i perf.data --duration 10.0 -T >> --> note the timestamp where the write took a "long" time >> >> perf script >> --> search down to *around* the time of interest; you want the syscall >> entry; timestamp is for exit > > Now if I understood the use case right, what Yulong Song wants to know > is what the iozone process is doing in the kernel (i.e. the stacktrace of why > exactly it goes to sleep / what it's waiting for) during these > sys_writes which take a long time. > > The commands above will identify the sys_write which takes time but only > provide the stacktrace at the entry and exit of the syscall, but this do > not show why the process blocked or what it did inside the system call. > > So a way to get what is required for the use case would be to make the > following changes to the above sequence: > > (1) include the sched:* events when perf trace record is run > > (2) around the time of interest, look at the kernel stack st the sched:switch > events between the entry and the exit. This will show what the process was > waiting for when it when it blocked. The stacktraces at the > stat_runtime events in the process may also be useful to understand what > was going on. > > Example: > > $ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync > $ perf trace -i perf.data -s > > dd (147), 364 events, 94.3%, 0.000 msec > > syscall calls min avg max stddev > --------------- -------- --------- --------- --------- ------ > write 63 266.019 316.896 963.413 4.69% > ... > > $ perf trace -i perf.data --duration 960 -T > 91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760 > > $ perf script > ... > > dd 147 [002] 90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101) > 211a00 syscall_trace_enter_phase2 ([kernel.kallsyms]) > 7d79c3 tracesys_phase2 ([kernel.kallsyms]) > ed20 [unknown] (/lib/libpthread-2.16.so) > > (... trimmed everything except the longest block: ...) > > dd 147 [002] 91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120 > 7d1121 __schedule ([kernel.kallsyms]) > 7d1817 schedule ([kernel.kallsyms]) > 4578a5 jbd2_log_wait_commit ([kernel.kallsyms]) > 4546ff jbd2_log_do_checkpoint ([kernel.kallsyms]) > 45489f __jbd2_log_wait_for_space ([kernel.kallsyms]) > 44c2d8 start_this_handle ([kernel.kallsyms]) > 44c7b6 jbd2__journal_start ([kernel.kallsyms]) > 42fc3b __ext4_journal_start_sb ([kernel.kallsyms]) > 4012e9 ext4_write_begin ([kernel.kallsyms]) > 33795f generic_perform_write ([kernel.kallsyms]) > 338b68 __generic_file_write_iter ([kernel.kallsyms]) > 3f4371 ext4_file_write_iter ([kernel.kallsyms]) > 381ae1 __vfs_write ([kernel.kallsyms]) > 381f07 vfs_write ([kernel.kallsyms]) > 3825dd sys_write ([kernel.kallsyms]) > 7d7a1a tracesys_phase2 ([kernel.kallsyms]) > ed20 [unknown] (/lib/libpthread-2.16.so) > > swapper 0 [002] 91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002 > > . > Good, it's clearer with "sched:*" event. -- Thanks, Yunlong Song -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/