Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753956AbdFMSCT (ORCPT ); Tue, 13 Jun 2017 14:02:19 -0400 Received: from mail-qt0-f171.google.com ([209.85.216.171]:33913 "EHLO mail-qt0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753103AbdFMSCP (ORCPT ); Tue, 13 Jun 2017 14:02:15 -0400 MIME-Version: 1.0 In-Reply-To: <20170613090649.2a5edb6a@gandalf.local.home> References: <20170612212005.1bf8c44d@gandalf.local.home> <20170613090649.2a5edb6a@gandalf.local.home> From: Will Hawkins Date: Tue, 13 Jun 2017 14:02:08 -0400 Message-ID: Subject: Re: Ftrace vs perf user page fault statistics differences To: Steven Rostedt Cc: LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10799 Lines: 291 On Tue, Jun 13, 2017 at 9:06 AM, Steven Rostedt wrote: > On Mon, 12 Jun 2017 22:05:05 -0400 > Will Hawkins wrote: > >> > Can you supply this program. So I can see exactly what it does? >> >> I have attached the binary to this email. I also provided the source >> so you can reproduce the situation directly. You can use the makefile >> provided as long as you have nasm installed. >> >> Although this is a response to a later question, I won't bury the >> lede: The program was written by hand in asm, compiled with nasm to >> object code and then run through gcc to get the output binary. Here is >> how that works: >> >> nasm -felf64 page.s && gcc -nostdlib page.o -o page >> >> The result, as far as ldd is concerned: >> >> hawkinsw@moormans:~/code/one_page_play$ ldd page >> not a dynamic executable > > OK, that makes sense. > >> >> >> > >> >> >> >> I looked at the results: >> >> >> >> perf report >> >> >> >> and the results were as I expected. There were two page faults for >> >> loading the code into memory and a page fault to >> >> copy_user_enhanced_fast_string invoked by execve's implementation when >> >> loading the binary. >> > >> > What does perf script show you? >> >> Here is what perf report --stdio shows: >> >> # Overhead Command Shared Object Symbol >> # ........ ....... ................. .................................. >> # >> 33.33% page page [.] _start >> | >> --- _start >> >> 33.33% page page [.] target >> | >> --- target >> >> 33.33% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string >> | >> --- copy_user_enhanced_fast_string >> load_elf_binary >> search_binary_handler >> do_execve_common.isra.23 >> sys_execve >> stub_execve >> 0x7f51698a31e7 >> >> >> >> Here is what perf script shows: >> >> page 28787 113626.587935: page-faults: >> ffffffff81376609 copy_user_enhanced_fast_string ([kernel.kallsyms]) >> ffffffff812197bf load_elf_binary ([kernel.kallsyms]) >> ffffffff811c76df search_binary_handler ([kernel.kallsyms]) >> ffffffff811c8c8e do_execve_common.isra.23 ([kernel.kallsyms]) >> ffffffff811c9176 sys_execve ([kernel.kallsyms]) >> ffffffff8173afc9 stub_execve ([kernel.kallsyms]) >> 7f51698a31e7 [unknown] ([unknown]) >> >> page 28787 113626.587961: page-faults: >> 4000e0 _start (/home/hawkinsw/code/one_page_play/page) >> >> page 28787 113626.587968: page-faults: >> 401473 target (/home/hawkinsw/code/one_page_play/page) >> >> Again, this seems like exactly what I would expect. >> >> > >> >> >> >> I decided to run the application under ftrace just for fun. I wanted >> >> an excuse to learn more about it and this seemed like the perfect >> >> chance. I used the incredible trace-cmd suite for the actual >> >> incantation of ftrace. I won't include the actual incantations here >> >> because I used many of them while digging around. >> > >> > what events did you enable? >> >> Sorry for not including this in my first email. I enabled >> exceptions:page_fault_user. I ran the trace-cmd record like this: >> >> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l >> handle_mm_fault ../one_page_play/page > > See if things change if you add -F. Perf by default only traces the > application you specify. trace-cmd by default traces everything. > > That is, "trace-cmd record -F" is similar to "perf record". Where as > "trace-cmd record" is similar to "perf record -a". > Thank you for pointing this out. I had been using -F for exactly the reason that you mentioned. I failed to include it in the command that I sent along. Very sorry for the confusion. Here is an updated version of the command that I issued: sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l handle_mm_fault -F ../one_page_play/page and I generated output like ./trace-cmd report --profile and I see the following (among some other output): Event: page_fault_user:0x7f094f7dd104 (1) Event: page_fault_user:0x4000e0 (1) Event: page_fault_user:0x7f094f7eae4a (1) Event: page_fault_user:0x7f094f860d40 (1) Event: page_fault_user:0x7f094f7db560 (1) Event: page_fault_user:0x4040cb (1) Event: page_fault_user:0x401825 (1) Event: page_fault_user:0x401473 (1) Event: page_fault_user:0x7f094f7e64c4 (1) Event: page_fault_user:0x7f094f7f1212 (1) That output comes from under the task: page- heading, so it seems like those faults are being attributed to the page task. This command seems to show something interesting: sudo ./trace-cmd record -e exceptions:page_fault_user -p function_graph -g __do_fault -F ../one_page_play/page and the relevant output from ./trace-cmd report --profile is task: page-4032 Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max: 2398(ts:170150.060916) Min:855(ts:170150.054713) Event: page_fault_user:0x7ffad3143d40 (1) Event: page_fault_user:0x4000e0 (1) Event: page_fault_user:0x401473 (1) Event: page_fault_user:0x7ffad30c94c4 (1) This is closer to what I would expect. The first of the two 0x4... addresses is the entry point and the second is the target. Basically, that is exactly what I expect. The other two are the "suspicious" entries. Neither matches the copy_user_enhanced_fast_string symbol location and are not loaded in the binary (according to gdb). It is odd to me that the output from the previous command includes information about the trace-cmd process since I specified the -F option. But, back to exactly what you asked. Here is the result of running perf again with the -a option. I ran this command: sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page 50.41% page ld-2.19.so [.] do_lookup_x | --- do_lookup_x 44.21% perf [kernel.kallsyms] [k] iov_iter_fault_in_readable | --- iov_iter_fault_in_readable generic_file_buffered_write __generic_file_aio_write generic_file_aio_write ext4_file_write do_sync_write vfs_write sys_write system_call_fastpath __write_nocancel 0x4081a5 0x407a40 __libc_start_main 4.13% perf perf [.] 0x0000000000015b54 | --- 0x415b54 0x4081a5 0x407a40 __libc_start_main 0.41% page page [.] _start | --- _start 0.41% page page [.] target | --- target 0.41% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string | --- copy_user_enhanced_fast_string load_elf_binary search_binary_handler do_execve_common.isra.23 sys_execve stub_execve __execve What is interesting, is that the output differs based on whether I've dropped the kernel caches before I run the perf record. When I do that, there are no page faults attributed to the entry point or the target of the program. I would imagine that, after dropping caches, the readahead handler picks up those pages when the binary is loaded and negates the need for a page fault. That, indeed, seems to be the case. I can see that when I run perf with an additional block:block_rq_issue event. Immediately after dropping the caches, there is a block request event. On subsequent executions, there is no such event but there are the page faults that I expect. What I did notice is that some of the page faults come from the task before execve is called. From what I've seen, this looks like the kernel reclaiming pages from the spawning process before it is replaced with the new binary (during the call to execve). After the execve, there seem to be to page faults: page-4613 [006] 171795.748310: funcgraph_entry: 0.151 us | mutex_unlock(); page-4613 [006] 171795.748313: funcgraph_entry: 0.166 us | __fsnotify_parent(); page-4613 [006] 171795.748313: funcgraph_entry: 0.321 us | fsnotify(); page-4613 [006] 171795.748314: funcgraph_entry: 0.090 us | __sb_end_write(); page-4613 [006] 171795.748317: funcgraph_entry: | trace_do_page_fault() { page-4613 [006] 171795.748317: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x4 page-4613 [006] 171795.748319: funcgraph_exit: 2.254 us | } page-4613 [006] 171795.748321: funcgraph_entry: | trace_do_page_fault() { page-4613 [006] 171795.748322: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x14 page-4613 [006] 171795.748323: funcgraph_exit: 1.144 us | } NOTICE THIS: page-4613 [006] 171795.748324: funcgraph_entry: | sys_execve() { page-4613 [007] 171795.748391: block_rq_issue: 8,0 R 0 () 764812912 + 16 [trace-cmd] page-4613 [005] 171795.759476: funcgraph_exit: # 11152.111 us | } page-4613 [005] 171795.759477: funcgraph_entry: 3.745 us | do_notify_resume(); page-4613 [005] 171795.759481: funcgraph_entry: | trace_do_page_fault() { page-4613 [005] 171795.759482: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x14 page-4613 [005] 171795.759487: funcgraph_exit: 5.833 us | } page-4613 [005] 171795.759488: funcgraph_entry: | trace_do_page_fault() { page-4613 [005] 171795.759489: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x14 page-4613 [005] 171795.759490: funcgraph_exit: 2.003 us | } page-4613 [005] 171795.759492: funcgraph_entry: | sys_exit() { I wish that I could get something "better" than " address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page faults. That would really tell me more about whether this is the "correct" behavior. As ever, thank you very much for your help! Using these tools has been an incredible learning experience. I still think that I am just missing something stupid, but I really appreciate your patience. Thanks again! Will > -- Steve