MIME-Version: 1.0
In-Reply-To: <20170613090649.2a5edb6a@gandalf.local.home>
References: <CAE+MWFvLvx+GRLQ4ikSb94h1jVXwmFVb-3PCzFoqC7rnOb0rvw@mail.gmail.com>
 <CAE+MWFuvAtGc0boe3arhnyoVpCpJBOsBTpk8W7Lh6r5PFE-Hmg@mail.gmail.com>
 <20170612212005.1bf8c44d@gandalf.local.home> <CAE+MWFv=Vu_zVt7snvKgp3vqROZ4+dgGHjc5-nD1MWcP=wgcZQ@mail.gmail.com>
 <20170613090649.2a5edb6a@gandalf.local.home>
From: Will Hawkins <whh8b@virginia.edu>
Date: Tue, 13 Jun 2017 14:02:08 -0400
Message-ID: <CAE+MWFurcMZFG=Z-_PW1G1P5Mf+VSyzqpYqzUhbhmV_=DJgM5w@mail.gmail.com>
Subject: Re: Ftrace vs perf user page fault statistics differences
To: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 10799
Lines: 291

On Tue, Jun 13, 2017 at 9:06 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 12 Jun 2017 22:05:05 -0400
> Will Hawkins <whh8b@virginia.edu> wrote:
>
>> > Can you supply this program. So I can see exactly what it does?
>>
>> I have attached the binary to this email. I also provided the source
>> so you can reproduce the situation directly. You can use the makefile
>> provided as long as you have nasm installed.
>>
>> Although this is a response to a later question, I won't bury the
>> lede: The program was written by hand in asm, compiled with nasm to
>> object code and then run through gcc to get the output binary. Here is
>> how that works:
>>
>> nasm -felf64 page.s && gcc -nostdlib page.o -o page
>>
>> The result, as far as ldd is concerned:
>>
>> hawkinsw@moormans:~/code/one_page_play$ ldd page
>>     not a dynamic executable
>
> OK, that makes sense.
>
>>
>>
>> >
>> >>
>> >> I looked at the results:
>> >>
>> >> perf report
>> >>
>> >> and the results were as I expected. There were two page faults for
>> >> loading the code into memory and a page fault to
>> >> copy_user_enhanced_fast_string invoked by execve's implementation when
>> >> loading the binary.
>> >
>> > What does perf script show you?
>>
>> Here is what perf report --stdio shows:
>>
>> # Overhead  Command      Shared Object                              Symbol
>> # ........  .......  .................  ..................................
>> #
>>     33.33%     page  page               [.] _start
>>                |
>>                --- _start
>>
>>     33.33%     page  page               [.] target
>>                |
>>                --- target
>>
>>     33.33%     page  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
>>                |
>>                --- copy_user_enhanced_fast_string
>>                    load_elf_binary
>>                    search_binary_handler
>>                    do_execve_common.isra.23
>>                    sys_execve
>>                    stub_execve
>>                    0x7f51698a31e7
>>
>>
>>
>> Here is what perf script shows:
>>
>> page 28787 113626.587935: page-faults:
>>         ffffffff81376609 copy_user_enhanced_fast_string ([kernel.kallsyms])
>>         ffffffff812197bf load_elf_binary ([kernel.kallsyms])
>>         ffffffff811c76df search_binary_handler ([kernel.kallsyms])
>>         ffffffff811c8c8e do_execve_common.isra.23 ([kernel.kallsyms])
>>         ffffffff811c9176 sys_execve ([kernel.kallsyms])
>>         ffffffff8173afc9 stub_execve ([kernel.kallsyms])
>>             7f51698a31e7 [unknown] ([unknown])
>>
>> page 28787 113626.587961: page-faults:
>>                   4000e0 _start (/home/hawkinsw/code/one_page_play/page)
>>
>> page 28787 113626.587968: page-faults:
>>                   401473 target (/home/hawkinsw/code/one_page_play/page)
>>
>> Again, this seems like exactly what I would expect.
>>
>> >
>> >>
>> >> I decided to run the application under ftrace just for fun. I wanted
>> >> an excuse to learn more about it and this seemed like the perfect
>> >> chance. I used the incredible trace-cmd suite for the actual
>> >> incantation of ftrace. I won't include the actual incantations here
>> >> because I used many of them while digging around.
>> >
>> > what events did you enable?
>>
>> Sorry for not including this in my first email. I enabled
>> exceptions:page_fault_user. I ran the trace-cmd record like this:
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>> handle_mm_fault ../one_page_play/page
>
> See if things change if you add -F. Perf by default only traces the
> application you specify. trace-cmd by default traces everything.
>
> That is, "trace-cmd record -F" is similar to "perf record". Where as
> "trace-cmd record" is similar to "perf record -a".
>

Thank you for pointing this out. I had been using -F for exactly the
reason that you mentioned. I failed to include it in the command that
I sent along. Very sorry for the confusion. Here is an updated version
of the command that I issued:


sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
handle_mm_fault -F ../one_page_play/page

and I generated output like

./trace-cmd report --profile

and I see the following (among some other output):

  Event: page_fault_user:0x7f094f7dd104 (1)
  Event: page_fault_user:0x4000e0 (1)
  Event: page_fault_user:0x7f094f7eae4a (1)
  Event: page_fault_user:0x7f094f860d40 (1)
  Event: page_fault_user:0x7f094f7db560 (1)
  Event: page_fault_user:0x4040cb (1)
  Event: page_fault_user:0x401825 (1)
  Event: page_fault_user:0x401473 (1)
  Event: page_fault_user:0x7f094f7e64c4 (1)
  Event: page_fault_user:0x7f094f7f1212 (1)

That output comes from under the task: page-<pid> heading, so it seems
like those faults are being attributed to the page task.

This command seems to show something interesting:

sudo ./trace-cmd record -e exceptions:page_fault_user -p
function_graph -g __do_fault -F ../one_page_play/page

and the relevant output from

./trace-cmd report --profile

is

task: page-4032
  Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
2398(ts:170150.060916) Min:855(ts:170150.054713)
  Event: page_fault_user:0x7ffad3143d40 (1)
  Event: page_fault_user:0x4000e0 (1)
  Event: page_fault_user:0x401473 (1)
  Event: page_fault_user:0x7ffad30c94c4 (1)

This is closer to what I would expect. The first of the two 0x4...
addresses is the entry point and the second is the target. Basically,
that is exactly what I expect. The other two are the "suspicious"
entries. Neither matches the copy_user_enhanced_fast_string symbol
location and are not loaded in the binary (according to gdb).

It is odd to me that the output from the previous command includes
information about the trace-cmd process since I specified the -F
option.

But, back to exactly what you asked. Here is the result of running
perf again with the -a option. I ran this command:

sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page

    50.41%     page  ld-2.19.so         [.] do_lookup_x
               |
               --- do_lookup_x

    44.21%     perf  [kernel.kallsyms]  [k] iov_iter_fault_in_readable
               |
               --- iov_iter_fault_in_readable
                   generic_file_buffered_write
                   __generic_file_aio_write
                   generic_file_aio_write
                   ext4_file_write
                   do_sync_write
                   vfs_write
                   sys_write
                   system_call_fastpath
                   __write_nocancel
                   0x4081a5
                   0x407a40
                   __libc_start_main

     4.13%     perf  perf               [.] 0x0000000000015b54
               |
               --- 0x415b54
                   0x4081a5
                   0x407a40
                   __libc_start_main

     0.41%     page  page               [.] _start
               |
               --- _start

     0.41%     page  page               [.] target
               |
               --- target

     0.41%     page  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
               |
               --- copy_user_enhanced_fast_string
                   load_elf_binary
                   search_binary_handler
                   do_execve_common.isra.23
                   sys_execve
                   stub_execve
                   __execve


What is interesting, is that the output differs based on whether I've
dropped the kernel caches before I run the perf record. When I do
that, there are no page faults attributed to the entry point or the
target of the program. I would imagine that, after dropping caches,
the readahead handler picks up those pages when the binary is loaded
and negates the need for a page fault. That, indeed, seems to be the
case. I can see that when I run perf with an additional
block:block_rq_issue event. Immediately after dropping the caches,
there is a block request event. On subsequent executions, there is no
such event but there are the page faults that I expect.

What I did notice is that some of the page faults come from the task
before execve is called. From what I've seen, this looks like the
kernel reclaiming pages from the spawning process before it is
replaced with the new binary (during the call to execve). After the
execve, there seem to be to page faults:


            page-4613  [006] 171795.748310: funcgraph_entry:
0.151 us   |  mutex_unlock();
            page-4613  [006] 171795.748313: funcgraph_entry:
0.166 us   |  __fsnotify_parent();
            page-4613  [006] 171795.748313: funcgraph_entry:
0.321 us   |  fsnotify();
            page-4613  [006] 171795.748314: funcgraph_entry:
0.090 us   |  __sb_end_write();
            page-4613  [006] 171795.748317: funcgraph_entry:
        |  trace_do_page_fault() {
            page-4613  [006] 171795.748317: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
            page-4613  [006] 171795.748319: funcgraph_exit:
2.254 us   |  }
            page-4613  [006] 171795.748321: funcgraph_entry:
        |  trace_do_page_fault() {
            page-4613  [006] 171795.748322: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
            page-4613  [006] 171795.748323: funcgraph_exit:
1.144 us   |  }

NOTICE THIS:
            page-4613  [006] 171795.748324: funcgraph_entry:
        |  sys_execve() {
            page-4613  [007] 171795.748391: block_rq_issue:       8,0
R 0 () 764812912 + 16 [trace-cmd]
            page-4613  [005] 171795.759476: funcgraph_exit:       #
11152.111 us |  }
            page-4613  [005] 171795.759477: funcgraph_entry:
3.745 us   |  do_notify_resume();
            page-4613  [005] 171795.759481: funcgraph_entry:
        |  trace_do_page_fault() {
            page-4613  [005] 171795.759482: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
            page-4613  [005] 171795.759487: funcgraph_exit:
5.833 us   |  }
            page-4613  [005] 171795.759488: funcgraph_entry:
        |  trace_do_page_fault() {
            page-4613  [005] 171795.759489: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
            page-4613  [005] 171795.759490: funcgraph_exit:
2.003 us   |  }
            page-4613  [005] 171795.759492: funcgraph_entry:
        |  sys_exit() {

I wish that I could get something "better" than "
address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
faults. That would really tell me more about whether this is the
"correct" behavior.

As ever, thank you very much for your help! Using these tools has been
an incredible learning experience. I still think that I am just
missing something stupid, but I really appreciate your patience.

Thanks again!
Will

> -- Steve