Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759462AbbDYOES (ORCPT ); Sat, 25 Apr 2015 10:04:18 -0400 Received: from szxga03-in.huawei.com ([119.145.14.66]:10308 "EHLO szxga03-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751472AbbDYOEP (ORCPT ); Sat, 25 Apr 2015 10:04:15 -0400 Message-ID: <553B9EC0.9060001@huawei.com> Date: Sat, 25 Apr 2015 22:03:44 +0800 From: Yunlong Song User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: , , , "Arnaldo Carvalho de Melo" CC: , , Subject: Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process References: <553A45CA.8020808@huawei.com> <553A49CF.1020905@huawei.com> In-Reply-To: <553A49CF.1020905@huawei.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.74.205] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020201.553B9ED1.021D,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 74c813c4f1ec380f59569ad0e845b2e4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6593 Lines: 175 On 2015/4/24 21:49, Yunlong Song wrote: > On 2015/4/24 21:31, Yunlong Song wrote: >> [Profiling Background] >> Now we are profiling the performance of ext4 and f2fs on an eMMC card with iozone, >> we find a case that ext4 is better than f2fs in random write under the test of >> "iozone -s 262144 -r 64 -i 0 -i 2". We want to analyze the I/O delay of the two >> file systems. We have got a conclusion that 1% of sys_write takes up 60% time of >> the overall sys_write (262144/64=4096). We want to find out the call stack during >> this specific 1% sys_write. Our idea is to record the stack in a certain time period >> and since the specific 1% case takes up 60% time, the total number of records of its >> stack should also takes up 60% of the total records, then we can recognize those stacks >> and figure out what the I/O stack of f2fs is doing in the 1% case. >> >> [Profiling Problem] >> >> Although perf can record the events (with call stack) of a specified pid, e.g. using >> "perf record -g iozone -s 262144 -r 64 -i 0 -i 2". But we find iozone is interrupted >> and the CPU is scheduled to other process. As a result, perf will not record any events >> of iozone until iozone's context is recovered and the CPU is scheduled to continue >> processing the sys_write of iozone. > >> This obeys our initial idea which is described in [Profiling Background], > > This "disobeys" our initial idea which is described in [Profiling Background], > > since we cannot recognize the call stacks of the specific 1% case >> by using the ratio of the record number. >> >> [Alternative Solution without Perf] >> We can use /proc/#pid/stack to record the stack in a certain period (e.g. 1ms) of iozone, >> no matter whether iozone is interrupted or not. However, we have not taken a deep sight >> into this, since we want to use perf to do this kind of thing. >> >> [Question about Perf] >> So we have a question that "How does perf still record the stack of a specified pid even >> when that process is interrupted and CPU is scheduled to other process?" >> > > Here is another example: $ cat test.c void func_loop(void) { int i, sum = 0; for (i = 0; i < 100000; i++) sum += 1; } void func_sleep(void) { sleep(1); } int main(void) { while (1) { func_loop(); func_sleep(); } return 0; } $ gcc test.c -o test -O0 $ ./test $ ps aux | grep test Yunlong+ 133798 0.0 0.0 4056 348 pts/7 S+ 21:52 0:00 ./test $ perf record -g -p 133798 sleep 5 $ perf report -g --stdio # To display the perf.data header info, please use --header/--header-only options. # # Samples: 16 of event 'cycles' # Event count (approx.): 2439331 # # Children Self Command Shared Object Symbol # ........ ........ ....... ................. ............................... # 79.46% 76.48% test test [.] func_loop | ---func_loop | |--97.07%-- main | __libc_start_main --2.93%-- [...] 77.14% 0.00% test libc-2.19.so [.] __libc_start_main | ---__libc_start_main 77.14% 0.00% test test [.] main | ---main __libc_start_main 20.54% 0.00% test [unknown] [.] 0000000000000000 | ---0 20.54% 0.00% test libc-2.19.so [.] __GI___libc_nanosleep | ---__GI___libc_nanosleep 0 20.54% 0.00% test [unknown] [k] 0x00007f2fa07ef6b0 | ---0x7f2fa07ef6b0 __GI___libc_nanosleep 0 20.54% 0.00% test [kernel.kallsyms] [k] system_call_fastpath | ---system_call_fastpath 0x7f2fa07ef6b0 __GI___libc_nanosleep As shown above, the func_loop is regarded as the main operation and its call stack is showed on the top of perf report. However, we know that func_sleep is really the main operation from the view of time duration indeed. And if we want to its actual call stack in the real time, we can see the result as following: $ while true ; do cat /proc/133798/stack ; sleep 0.1 ; done [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff [] hrtimer_nanosleep+0xa3/0x160 [] SyS_nanosleep+0x5d/0x80 [] system_call_fastpath+0x16/0x1b [<00007f2fa07ef6b0>] 0x7f2fa07ef6b0 [] 0xffffffffffffffff As shown above, func_sleep is regarded as the main operation from the view of time and its call stack is printed in real time. -- Thanks, Yunlong Song -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/