Dear Mr. Rostedt and Kernel community,
I hope that this is an appropriate place to ask this question, Please
forgive me for wasting your time if it is not. I've searched for
answers to this question on LKML and the "net" before asking. I
apologize if I already missed the question and answer somewhere else.
I was doing a gut check of my understanding of how the Kernel pages in
a binary program for execution. I created a very small program that is
two pages. The entry point to the program is on page a and the code
simply jumps to the second page (b) for its termination.
I compiled the program and dropped the kernel's memory caches [1].
Then I ran the program under perf:
perf record --call-graph fp -e page-faults ../one_page_play/page
I looked at the results:
perf report
and the results were as I expected. There were two page faults for
loading the code into memory and a page fault to
copy_user_enhanced_fast_string invoked by execve's implementation when
loading the binary.
I decided to run the application under ftrace just for fun. I wanted
an excuse to learn more about it and this seemed like the perfect
chance. I used the incredible trace-cmd suite for the actual
incantation of ftrace. I won't include the actual incantations here
because I used many of them while digging around.
The results are both expected and unexpected. I see output like this:
Event: page_fault_user:0x4000e0
which indicates that there is a page fault at the program's entry
point (and matches what I saw with the perf output). I have another
similar entry that confirms the other expected page fault when loading
the second page of the test application.
However, I also see entries like this:
Event: page_fault_user:0x7f4f590934c4 (1)
The addresses of the faults I see that match that pattern are not
loaded into the application binary. What I discovered as I
investigated, is that those page faults seem to occur when the kernel
is attempting to record the output of stack traces, etc.
After thinking through this, I came up with the following hypothesis
which is the crux of this email:
Ftrace's act of recording the traces that I requested to its ring
buffer generated page faults of their own. These page faults are
generated on behalf of the traced program and get reported in the
results.
If that is correct/reasonable, it explains the differences between
what perf is reporting and what ftrace is reporting and I am happy.
If, however, that is a bogus conclusion, please help me understand
what is going on.
I know that everyone who is on this email is incredibly busy and has
much to do. I hope that I've included enough information to make it
possible for you experts to advise, but not included too much to waste
your time.
If you have the time or interest in answering, I would love to hear
your responses. Please CC me directly on all responses.
Thanks again for your time!
Will
[1] I used echo 3 > /proc/sys/vm/drop_caches to accomplish this and
issued it between every run. It may have been overkill, but I did it
anyway.
On Mon, 12 Jun 2017 20:20:42 -0400
Will Hawkins <[email protected]> wrote:
> Dear Mr. Rostedt and Kernel community,
>
> I hope that this is an appropriate place to ask this question, Please
> forgive me for wasting your time if it is not. I've searched for
> answers to this question on LKML and the "net" before asking. I
> apologize if I already missed the question and answer somewhere else.
No no, this is the correct forum.
>
> I was doing a gut check of my understanding of how the Kernel pages in
> a binary program for execution. I created a very small program that is
> two pages. The entry point to the program is on page a and the code
> simply jumps to the second page (b) for its termination.
>
> I compiled the program and dropped the kernel's memory caches [1].
> Then I ran the program under perf:
>
> perf record --call-graph fp -e page-faults ../one_page_play/page
Can you supply this program. So I can see exactly what it does?
>
> I looked at the results:
>
> perf report
>
> and the results were as I expected. There were two page faults for
> loading the code into memory and a page fault to
> copy_user_enhanced_fast_string invoked by execve's implementation when
> loading the binary.
What does perf script show you?
>
> I decided to run the application under ftrace just for fun. I wanted
> an excuse to learn more about it and this seemed like the perfect
> chance. I used the incredible trace-cmd suite for the actual
> incantation of ftrace. I won't include the actual incantations here
> because I used many of them while digging around.
what events did you enable?
>
> The results are both expected and unexpected. I see output like this:
>
> Event: page_fault_user:0x4000e0
>
> which indicates that there is a page fault at the program's entry
> point (and matches what I saw with the perf output). I have another
> similar entry that confirms the other expected page fault when loading
> the second page of the test application.
>
> However, I also see entries like this:
>
> Event: page_fault_user:0x7f4f590934c4 (1)
This could very well be the dynamic linker.
>
> The addresses of the faults I see that match that pattern are not
> loaded into the application binary. What I discovered as I
> investigated, is that those page faults seem to occur when the kernel
> is attempting to record the output of stack traces, etc.
Are you sure? What does ldd give you of your program?
>
> After thinking through this, I came up with the following hypothesis
> which is the crux of this email:
>
> Ftrace's act of recording the traces that I requested to its ring
> buffer generated page faults of their own. These page faults are
> generated on behalf of the traced program and get reported in the
> results.
There are no page faults that happen by the ftrace ring buffer that
would be associated with the program.
>
> If that is correct/reasonable, it explains the differences between
> what perf is reporting and what ftrace is reporting and I am happy.
> If, however, that is a bogus conclusion, please help me understand
> what is going on.
I'll need to know more about exactly what you are doing to help out.
>
> I know that everyone who is on this email is incredibly busy and has
> much to do. I hope that I've included enough information to make it
> possible for you experts to advise, but not included too much to waste
> your time.
>
> If you have the time or interest in answering, I would love to hear
> your responses. Please CC me directly on all responses.
No worries about Ccing you directly. LKML gets over 600 emails a day.
Nobody reads it all. If someone isn't Cc'd directly, they will most
likely never respond.
>
> Thanks again for your time!
No problem.
-- Steve
>
> Will
>
> [1] I used echo 3 > /proc/sys/vm/drop_caches to accomplish this and
> issued it between every run. It may have been overkill, but I did it
> anyway.
Thank you for your response. I have included responses to your questions inline.
On Mon, Jun 12, 2017 at 9:20 PM, Steven Rostedt <[email protected]> wrote:
> On Mon, 12 Jun 2017 20:20:42 -0400
> Will Hawkins <[email protected]> wrote:
>
>> Dear Mr. Rostedt and Kernel community,
>>
>> I hope that this is an appropriate place to ask this question, Please
>> forgive me for wasting your time if it is not. I've searched for
>> answers to this question on LKML and the "net" before asking. I
>> apologize if I already missed the question and answer somewhere else.
>
> No no, this is the correct forum.
Whew.
>
>>
>> I was doing a gut check of my understanding of how the Kernel pages in
>> a binary program for execution. I created a very small program that is
>> two pages. The entry point to the program is on page a and the code
>> simply jumps to the second page (b) for its termination.
>>
>> I compiled the program and dropped the kernel's memory caches [1].
>> Then I ran the program under perf:
>>
>> perf record --call-graph fp -e page-faults ../one_page_play/page
>
> Can you supply this program. So I can see exactly what it does?
I have attached the binary to this email. I also provided the source
so you can reproduce the situation directly. You can use the makefile
provided as long as you have nasm installed.
Although this is a response to a later question, I won't bury the
lede: The program was written by hand in asm, compiled with nasm to
object code and then run through gcc to get the output binary. Here is
how that works:
nasm -felf64 page.s && gcc -nostdlib page.o -o page
The result, as far as ldd is concerned:
hawkinsw@moormans:~/code/one_page_play$ ldd page
not a dynamic executable
>
>>
>> I looked at the results:
>>
>> perf report
>>
>> and the results were as I expected. There were two page faults for
>> loading the code into memory and a page fault to
>> copy_user_enhanced_fast_string invoked by execve's implementation when
>> loading the binary.
>
> What does perf script show you?
Here is what perf report --stdio shows:
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..................................
#
33.33% page page [.] _start
|
--- _start
33.33% page page [.] target
|
--- target
33.33% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
|
--- copy_user_enhanced_fast_string
load_elf_binary
search_binary_handler
do_execve_common.isra.23
sys_execve
stub_execve
0x7f51698a31e7
Here is what perf script shows:
page 28787 113626.587935: page-faults:
ffffffff81376609 copy_user_enhanced_fast_string ([kernel.kallsyms])
ffffffff812197bf load_elf_binary ([kernel.kallsyms])
ffffffff811c76df search_binary_handler ([kernel.kallsyms])
ffffffff811c8c8e do_execve_common.isra.23 ([kernel.kallsyms])
ffffffff811c9176 sys_execve ([kernel.kallsyms])
ffffffff8173afc9 stub_execve ([kernel.kallsyms])
7f51698a31e7 [unknown] ([unknown])
page 28787 113626.587961: page-faults:
4000e0 _start (/home/hawkinsw/code/one_page_play/page)
page 28787 113626.587968: page-faults:
401473 target (/home/hawkinsw/code/one_page_play/page)
Again, this seems like exactly what I would expect.
>
>>
>> I decided to run the application under ftrace just for fun. I wanted
>> an excuse to learn more about it and this seemed like the perfect
>> chance. I used the incredible trace-cmd suite for the actual
>> incantation of ftrace. I won't include the actual incantations here
>> because I used many of them while digging around.
>
> what events did you enable?
Sorry for not including this in my first email. I enabled
exceptions:page_fault_user. I ran the trace-cmd record like this:
sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
handle_mm_fault ../one_page_play/page
and generated output like this:
./trace-cmd report --profile
>
>>
>> The results are both expected and unexpected. I see output like this:
>>
>> Event: page_fault_user:0x4000e0
>>
>> which indicates that there is a page fault at the program's entry
>> point (and matches what I saw with the perf output). I have another
>> similar entry that confirms the other expected page fault when loading
>> the second page of the test application.
>>
>> However, I also see entries like this:
>>
>> Event: page_fault_user:0x7f4f590934c4 (1)
>
> This could very well be the dynamic linker.
As I mentioned earlier, the program is statically linked. I did that
for this very reason. I wanted to isolate the page faults for loading
the program itself and not have to worry about interference from the
static linker. I was *trying* to be intentional about getting these
results. I appear to be missing the boat somewhere.
>
>>
>> The addresses of the faults I see that match that pattern are not
>> loaded into the application binary. What I discovered as I
>> investigated, is that those page faults seem to occur when the kernel
>> is attempting to record the output of stack traces, etc.
>
> Are you sure? What does ldd give you of your program?
I posted this output above, a little out of order. I will repeat here:
hawkinsw@moormans:~/code/one_page_play$ ldd page
not a dynamic executable
>
>>
>> After thinking through this, I came up with the following hypothesis
>> which is the crux of this email:
>>
>> Ftrace's act of recording the traces that I requested to its ring
>> buffer generated page faults of their own. These page faults are
>> generated on behalf of the traced program and get reported in the
>> results.
>
> There are no page faults that happen by the ftrace ring buffer that
> would be associated with the program.
This is what I would have expected, but it was/is my best guess. I
>
>>
>> If that is correct/reasonable, it explains the differences between
>> what perf is reporting and what ftrace is reporting and I am happy.
>> If, however, that is a bogus conclusion, please help me understand
>> what is going on.
>
> I'll need to know more about exactly what you are doing to help out.
I hope that the information I posted above was helpful. Please let me
know what additional information I can provide.
Thanks again for taking the time to investigate this with me. I'm sure
it's something simple that I am missing.
Will
>
>>
>> I know that everyone who is on this email is incredibly busy and has
>> much to do. I hope that I've included enough information to make it
>> possible for you experts to advise, but not included too much to waste
>> your time.
>>
>> If you have the time or interest in answering, I would love to hear
>> your responses. Please CC me directly on all responses.
>
> No worries about Ccing you directly. LKML gets over 600 emails a day.
> Nobody reads it all. If someone isn't Cc'd directly, they will most
> likely never respond.
>
>>
>> Thanks again for your time!
>
> No problem.
>
> -- Steve
>
>>
>> Will
>>
>> [1] I used echo 3 > /proc/sys/vm/drop_caches to accomplish this and
>> issued it between every run. It may have been overkill, but I did it
>> anyway.
>
On Mon, 12 Jun 2017 22:05:05 -0400
Will Hawkins <[email protected]> wrote:
> > Can you supply this program. So I can see exactly what it does?
>
> I have attached the binary to this email. I also provided the source
> so you can reproduce the situation directly. You can use the makefile
> provided as long as you have nasm installed.
>
> Although this is a response to a later question, I won't bury the
> lede: The program was written by hand in asm, compiled with nasm to
> object code and then run through gcc to get the output binary. Here is
> how that works:
>
> nasm -felf64 page.s && gcc -nostdlib page.o -o page
>
> The result, as far as ldd is concerned:
>
> hawkinsw@moormans:~/code/one_page_play$ ldd page
> not a dynamic executable
OK, that makes sense.
>
>
> >
> >>
> >> I looked at the results:
> >>
> >> perf report
> >>
> >> and the results were as I expected. There were two page faults for
> >> loading the code into memory and a page fault to
> >> copy_user_enhanced_fast_string invoked by execve's implementation when
> >> loading the binary.
> >
> > What does perf script show you?
>
> Here is what perf report --stdio shows:
>
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. ..................................
> #
> 33.33% page page [.] _start
> |
> --- _start
>
> 33.33% page page [.] target
> |
> --- target
>
> 33.33% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> |
> --- copy_user_enhanced_fast_string
> load_elf_binary
> search_binary_handler
> do_execve_common.isra.23
> sys_execve
> stub_execve
> 0x7f51698a31e7
>
>
>
> Here is what perf script shows:
>
> page 28787 113626.587935: page-faults:
> ffffffff81376609 copy_user_enhanced_fast_string ([kernel.kallsyms])
> ffffffff812197bf load_elf_binary ([kernel.kallsyms])
> ffffffff811c76df search_binary_handler ([kernel.kallsyms])
> ffffffff811c8c8e do_execve_common.isra.23 ([kernel.kallsyms])
> ffffffff811c9176 sys_execve ([kernel.kallsyms])
> ffffffff8173afc9 stub_execve ([kernel.kallsyms])
> 7f51698a31e7 [unknown] ([unknown])
>
> page 28787 113626.587961: page-faults:
> 4000e0 _start (/home/hawkinsw/code/one_page_play/page)
>
> page 28787 113626.587968: page-faults:
> 401473 target (/home/hawkinsw/code/one_page_play/page)
>
> Again, this seems like exactly what I would expect.
>
> >
> >>
> >> I decided to run the application under ftrace just for fun. I wanted
> >> an excuse to learn more about it and this seemed like the perfect
> >> chance. I used the incredible trace-cmd suite for the actual
> >> incantation of ftrace. I won't include the actual incantations here
> >> because I used many of them while digging around.
> >
> > what events did you enable?
>
> Sorry for not including this in my first email. I enabled
> exceptions:page_fault_user. I ran the trace-cmd record like this:
>
> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
> handle_mm_fault ../one_page_play/page
See if things change if you add -F. Perf by default only traces the
application you specify. trace-cmd by default traces everything.
That is, "trace-cmd record -F" is similar to "perf record". Where as
"trace-cmd record" is similar to "perf record -a".
-- Steve
On Tue, Jun 13, 2017 at 9:06 AM, Steven Rostedt <[email protected]> wrote:
> On Mon, 12 Jun 2017 22:05:05 -0400
> Will Hawkins <[email protected]> wrote:
>
>> > Can you supply this program. So I can see exactly what it does?
>>
>> I have attached the binary to this email. I also provided the source
>> so you can reproduce the situation directly. You can use the makefile
>> provided as long as you have nasm installed.
>>
>> Although this is a response to a later question, I won't bury the
>> lede: The program was written by hand in asm, compiled with nasm to
>> object code and then run through gcc to get the output binary. Here is
>> how that works:
>>
>> nasm -felf64 page.s && gcc -nostdlib page.o -o page
>>
>> The result, as far as ldd is concerned:
>>
>> hawkinsw@moormans:~/code/one_page_play$ ldd page
>> not a dynamic executable
>
> OK, that makes sense.
>
>>
>>
>> >
>> >>
>> >> I looked at the results:
>> >>
>> >> perf report
>> >>
>> >> and the results were as I expected. There were two page faults for
>> >> loading the code into memory and a page fault to
>> >> copy_user_enhanced_fast_string invoked by execve's implementation when
>> >> loading the binary.
>> >
>> > What does perf script show you?
>>
>> Here is what perf report --stdio shows:
>>
>> # Overhead Command Shared Object Symbol
>> # ........ ....... ................. ..................................
>> #
>> 33.33% page page [.] _start
>> |
>> --- _start
>>
>> 33.33% page page [.] target
>> |
>> --- target
>>
>> 33.33% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
>> |
>> --- copy_user_enhanced_fast_string
>> load_elf_binary
>> search_binary_handler
>> do_execve_common.isra.23
>> sys_execve
>> stub_execve
>> 0x7f51698a31e7
>>
>>
>>
>> Here is what perf script shows:
>>
>> page 28787 113626.587935: page-faults:
>> ffffffff81376609 copy_user_enhanced_fast_string ([kernel.kallsyms])
>> ffffffff812197bf load_elf_binary ([kernel.kallsyms])
>> ffffffff811c76df search_binary_handler ([kernel.kallsyms])
>> ffffffff811c8c8e do_execve_common.isra.23 ([kernel.kallsyms])
>> ffffffff811c9176 sys_execve ([kernel.kallsyms])
>> ffffffff8173afc9 stub_execve ([kernel.kallsyms])
>> 7f51698a31e7 [unknown] ([unknown])
>>
>> page 28787 113626.587961: page-faults:
>> 4000e0 _start (/home/hawkinsw/code/one_page_play/page)
>>
>> page 28787 113626.587968: page-faults:
>> 401473 target (/home/hawkinsw/code/one_page_play/page)
>>
>> Again, this seems like exactly what I would expect.
>>
>> >
>> >>
>> >> I decided to run the application under ftrace just for fun. I wanted
>> >> an excuse to learn more about it and this seemed like the perfect
>> >> chance. I used the incredible trace-cmd suite for the actual
>> >> incantation of ftrace. I won't include the actual incantations here
>> >> because I used many of them while digging around.
>> >
>> > what events did you enable?
>>
>> Sorry for not including this in my first email. I enabled
>> exceptions:page_fault_user. I ran the trace-cmd record like this:
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>> handle_mm_fault ../one_page_play/page
>
> See if things change if you add -F. Perf by default only traces the
> application you specify. trace-cmd by default traces everything.
>
> That is, "trace-cmd record -F" is similar to "perf record". Where as
> "trace-cmd record" is similar to "perf record -a".
>
Thank you for pointing this out. I had been using -F for exactly the
reason that you mentioned. I failed to include it in the command that
I sent along. Very sorry for the confusion. Here is an updated version
of the command that I issued:
sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
handle_mm_fault -F ../one_page_play/page
and I generated output like
./trace-cmd report --profile
and I see the following (among some other output):
Event: page_fault_user:0x7f094f7dd104 (1)
Event: page_fault_user:0x4000e0 (1)
Event: page_fault_user:0x7f094f7eae4a (1)
Event: page_fault_user:0x7f094f860d40 (1)
Event: page_fault_user:0x7f094f7db560 (1)
Event: page_fault_user:0x4040cb (1)
Event: page_fault_user:0x401825 (1)
Event: page_fault_user:0x401473 (1)
Event: page_fault_user:0x7f094f7e64c4 (1)
Event: page_fault_user:0x7f094f7f1212 (1)
That output comes from under the task: page-<pid> heading, so it seems
like those faults are being attributed to the page task.
This command seems to show something interesting:
sudo ./trace-cmd record -e exceptions:page_fault_user -p
function_graph -g __do_fault -F ../one_page_play/page
and the relevant output from
./trace-cmd report --profile
is
task: page-4032
Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
2398(ts:170150.060916) Min:855(ts:170150.054713)
Event: page_fault_user:0x7ffad3143d40 (1)
Event: page_fault_user:0x4000e0 (1)
Event: page_fault_user:0x401473 (1)
Event: page_fault_user:0x7ffad30c94c4 (1)
This is closer to what I would expect. The first of the two 0x4...
addresses is the entry point and the second is the target. Basically,
that is exactly what I expect. The other two are the "suspicious"
entries. Neither matches the copy_user_enhanced_fast_string symbol
location and are not loaded in the binary (according to gdb).
It is odd to me that the output from the previous command includes
information about the trace-cmd process since I specified the -F
option.
But, back to exactly what you asked. Here is the result of running
perf again with the -a option. I ran this command:
sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
50.41% page ld-2.19.so [.] do_lookup_x
|
--- do_lookup_x
44.21% perf [kernel.kallsyms] [k] iov_iter_fault_in_readable
|
--- iov_iter_fault_in_readable
generic_file_buffered_write
__generic_file_aio_write
generic_file_aio_write
ext4_file_write
do_sync_write
vfs_write
sys_write
system_call_fastpath
__write_nocancel
0x4081a5
0x407a40
__libc_start_main
4.13% perf perf [.] 0x0000000000015b54
|
--- 0x415b54
0x4081a5
0x407a40
__libc_start_main
0.41% page page [.] _start
|
--- _start
0.41% page page [.] target
|
--- target
0.41% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
|
--- copy_user_enhanced_fast_string
load_elf_binary
search_binary_handler
do_execve_common.isra.23
sys_execve
stub_execve
__execve
What is interesting, is that the output differs based on whether I've
dropped the kernel caches before I run the perf record. When I do
that, there are no page faults attributed to the entry point or the
target of the program. I would imagine that, after dropping caches,
the readahead handler picks up those pages when the binary is loaded
and negates the need for a page fault. That, indeed, seems to be the
case. I can see that when I run perf with an additional
block:block_rq_issue event. Immediately after dropping the caches,
there is a block request event. On subsequent executions, there is no
such event but there are the page faults that I expect.
What I did notice is that some of the page faults come from the task
before execve is called. From what I've seen, this looks like the
kernel reclaiming pages from the spawning process before it is
replaced with the new binary (during the call to execve). After the
execve, there seem to be to page faults:
page-4613 [006] 171795.748310: funcgraph_entry:
0.151 us | mutex_unlock();
page-4613 [006] 171795.748313: funcgraph_entry:
0.166 us | __fsnotify_parent();
page-4613 [006] 171795.748313: funcgraph_entry:
0.321 us | fsnotify();
page-4613 [006] 171795.748314: funcgraph_entry:
0.090 us | __sb_end_write();
page-4613 [006] 171795.748317: funcgraph_entry:
| trace_do_page_fault() {
page-4613 [006] 171795.748317: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
page-4613 [006] 171795.748319: funcgraph_exit:
2.254 us | }
page-4613 [006] 171795.748321: funcgraph_entry:
| trace_do_page_fault() {
page-4613 [006] 171795.748322: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
page-4613 [006] 171795.748323: funcgraph_exit:
1.144 us | }
NOTICE THIS:
page-4613 [006] 171795.748324: funcgraph_entry:
| sys_execve() {
page-4613 [007] 171795.748391: block_rq_issue: 8,0
R 0 () 764812912 + 16 [trace-cmd]
page-4613 [005] 171795.759476: funcgraph_exit: #
11152.111 us | }
page-4613 [005] 171795.759477: funcgraph_entry:
3.745 us | do_notify_resume();
page-4613 [005] 171795.759481: funcgraph_entry:
| trace_do_page_fault() {
page-4613 [005] 171795.759482: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
page-4613 [005] 171795.759487: funcgraph_exit:
5.833 us | }
page-4613 [005] 171795.759488: funcgraph_entry:
| trace_do_page_fault() {
page-4613 [005] 171795.759489: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
page-4613 [005] 171795.759490: funcgraph_exit:
2.003 us | }
page-4613 [005] 171795.759492: funcgraph_entry:
| sys_exit() {
I wish that I could get something "better" than "
address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
faults. That would really tell me more about whether this is the
"correct" behavior.
As ever, thank you very much for your help! Using these tools has been
an incredible learning experience. I still think that I am just
missing something stupid, but I really appreciate your patience.
Thanks again!
Will
> -- Steve
On Tue, 13 Jun 2017 14:02:08 -0400
Will Hawkins <[email protected]> wrote:
> Thank you for pointing this out. I had been using -F for exactly the
> reason that you mentioned. I failed to include it in the command that
> I sent along. Very sorry for the confusion. Here is an updated version
> of the command that I issued:
>
>
> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
> handle_mm_fault -F ../one_page_play/page
>
> and I generated output like
>
> ./trace-cmd report --profile
>
> and I see the following (among some other output):
>
> Event: page_fault_user:0x7f094f7dd104 (1)
> Event: page_fault_user:0x4000e0 (1)
> Event: page_fault_user:0x7f094f7eae4a (1)
> Event: page_fault_user:0x7f094f860d40 (1)
> Event: page_fault_user:0x7f094f7db560 (1)
> Event: page_fault_user:0x4040cb (1)
> Event: page_fault_user:0x401825 (1)
> Event: page_fault_user:0x401473 (1)
> Event: page_fault_user:0x7f094f7e64c4 (1)
> Event: page_fault_user:0x7f094f7f1212 (1)
>
> That output comes from under the task: page-<pid> heading, so it seems
> like those faults are being attributed to the page task.
>
> This command seems to show something interesting:
>
> sudo ./trace-cmd record -e exceptions:page_fault_user -p
> function_graph -g __do_fault -F ../one_page_play/page
>
> and the relevant output from
>
> ./trace-cmd report --profile
>
> is
>
> task: page-4032
> Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
> 2398(ts:170150.060916) Min:855(ts:170150.054713)
> Event: page_fault_user:0x7ffad3143d40 (1)
> Event: page_fault_user:0x4000e0 (1)
> Event: page_fault_user:0x401473 (1)
> Event: page_fault_user:0x7ffad30c94c4 (1)
>
> This is closer to what I would expect. The first of the two 0x4...
> addresses is the entry point and the second is the target. Basically,
> that is exactly what I expect. The other two are the "suspicious"
> entries. Neither matches the copy_user_enhanced_fast_string symbol
> location and are not loaded in the binary (according to gdb).
As you state below, there is faults recorded before the exec. Which is
true with trace-cmd (not sure about perf). As trace-cmd does do some
work after tracing is started and before the exec is called.
>
> It is odd to me that the output from the previous command includes
> information about the trace-cmd process since I specified the -F
> option.
Did the trace-cmd process have the same pid as the page program? The -F
makes it follow the pid, which can be trace-cmd before the fork.
Oh, and what versions are you running of the kernel as well as
trace-cmd. It's best to always get the latest trace-cmd. Older versions
did have some bugs.
Also, function_graph can have some residual traces from a sched_switch.
That is, the tracing of a task specified by -F is enabled and disabled
per cpu at sched_switch. The sched_switch tracepoint is before the
actual context switch. When the next task is the task specified by -F,
tracing begins. But the context switch can start tracing the previous
task right after the sched switch tracepoint. Here, in
kernel/sched/core.c in __schedule():
trace_sched_switch(preempt, prev, next);
/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
The context_switch() function will be traced even when the previous
task isn't the task from -F.
>
> But, back to exactly what you asked. Here is the result of running
> perf again with the -a option. I ran this command:
>
> sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
>
> 50.41% page ld-2.19.so [.] do_lookup_x
> |
> --- do_lookup_x
>
> 44.21% perf [kernel.kallsyms] [k] iov_iter_fault_in_readable
> |
> --- iov_iter_fault_in_readable
> generic_file_buffered_write
> __generic_file_aio_write
> generic_file_aio_write
> ext4_file_write
> do_sync_write
> vfs_write
> sys_write
> system_call_fastpath
> __write_nocancel
> 0x4081a5
> 0x407a40
> __libc_start_main
>
> 4.13% perf perf [.] 0x0000000000015b54
> |
> --- 0x415b54
> 0x4081a5
> 0x407a40
> __libc_start_main
>
> 0.41% page page [.] _start
> |
> --- _start
>
> 0.41% page page [.] target
> |
> --- target
>
> 0.41% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
> |
> --- copy_user_enhanced_fast_string
> load_elf_binary
> search_binary_handler
> do_execve_common.isra.23
> sys_execve
> stub_execve
> __execve
>
>
> What is interesting, is that the output differs based on whether I've
> dropped the kernel caches before I run the perf record. When I do
> that, there are no page faults attributed to the entry point or the
> target of the program. I would imagine that, after dropping caches,
> the readahead handler picks up those pages when the binary is loaded
> and negates the need for a page fault. That, indeed, seems to be the
> case. I can see that when I run perf with an additional
> block:block_rq_issue event. Immediately after dropping the caches,
> there is a block request event. On subsequent executions, there is no
> such event but there are the page faults that I expect.
>
> What I did notice is that some of the page faults come from the task
> before execve is called. From what I've seen, this looks like the
> kernel reclaiming pages from the spawning process before it is
> replaced with the new binary (during the call to execve). After the
> execve, there seem to be to page faults:
>
>
> page-4613 [006] 171795.748310: funcgraph_entry:
> 0.151 us | mutex_unlock();
> page-4613 [006] 171795.748313: funcgraph_entry:
> 0.166 us | __fsnotify_parent();
> page-4613 [006] 171795.748313: funcgraph_entry:
> 0.321 us | fsnotify();
> page-4613 [006] 171795.748314: funcgraph_entry:
> 0.090 us | __sb_end_write();
> page-4613 [006] 171795.748317: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [006] 171795.748317: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x4
> page-4613 [006] 171795.748319: funcgraph_exit:
> 2.254 us | }
> page-4613 [006] 171795.748321: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [006] 171795.748322: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [006] 171795.748323: funcgraph_exit:
> 1.144 us | }
>
> NOTICE THIS:
> page-4613 [006] 171795.748324: funcgraph_entry:
> | sys_execve() {
> page-4613 [007] 171795.748391: block_rq_issue: 8,0
> R 0 () 764812912 + 16 [trace-cmd]
> page-4613 [005] 171795.759476: funcgraph_exit: #
> 11152.111 us | }
> page-4613 [005] 171795.759477: funcgraph_entry:
> 3.745 us | do_notify_resume();
> page-4613 [005] 171795.759481: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [005] 171795.759482: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [005] 171795.759487: funcgraph_exit:
> 5.833 us | }
> page-4613 [005] 171795.759488: funcgraph_entry:
> | trace_do_page_fault() {
> page-4613 [005] 171795.759489: page_fault_user:
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
> page-4613 [005] 171795.759490: funcgraph_exit:
> 2.003 us | }
> page-4613 [005] 171795.759492: funcgraph_entry:
> | sys_exit() {
>
> I wish that I could get something "better" than "
> address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
> faults. That would really tell me more about whether this is the
> "correct" behavior.
Also, you might want to do a trace-cmd report -l, which lets you see if
the event happened in interrupt context.
>
> As ever, thank you very much for your help! Using these tools has been
> an incredible learning experience. I still think that I am just
> missing something stupid, but I really appreciate your patience.
No no, you got me curious as well. Although, the only page faults I see
are from before the exec and the actual program. Could you run the -F
on your program with the other options you have and report a full:
trace-cmd report -l
And then you can ask about what you don't understand being there.
-- Steve
-- Steve
On Tue, Jun 13, 2017 at 4:04 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 13 Jun 2017 14:02:08 -0400
> Will Hawkins <[email protected]> wrote:
>
>> Thank you for pointing this out. I had been using -F for exactly the
>> reason that you mentioned. I failed to include it in the command that
>> I sent along. Very sorry for the confusion. Here is an updated version
>> of the command that I issued:
>>
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>> handle_mm_fault -F ../one_page_play/page
>>
>> and I generated output like
>>
>> ./trace-cmd report --profile
>>
>> and I see the following (among some other output):
>>
>> Event: page_fault_user:0x7f094f7dd104 (1)
>> Event: page_fault_user:0x4000e0 (1)
>> Event: page_fault_user:0x7f094f7eae4a (1)
>> Event: page_fault_user:0x7f094f860d40 (1)
>> Event: page_fault_user:0x7f094f7db560 (1)
>> Event: page_fault_user:0x4040cb (1)
>> Event: page_fault_user:0x401825 (1)
>> Event: page_fault_user:0x401473 (1)
>> Event: page_fault_user:0x7f094f7e64c4 (1)
>> Event: page_fault_user:0x7f094f7f1212 (1)
>>
>> That output comes from under the task: page-<pid> heading, so it seems
>> like those faults are being attributed to the page task.
>>
>> This command seems to show something interesting:
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -p
>> function_graph -g __do_fault -F ../one_page_play/page
>>
>> and the relevant output from
>>
>> ./trace-cmd report --profile
>>
>> is
>>
>> task: page-4032
>> Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
>> 2398(ts:170150.060916) Min:855(ts:170150.054713)
>> Event: page_fault_user:0x7ffad3143d40 (1)
>> Event: page_fault_user:0x4000e0 (1)
>> Event: page_fault_user:0x401473 (1)
>> Event: page_fault_user:0x7ffad30c94c4 (1)
>>
>> This is closer to what I would expect. The first of the two 0x4...
>> addresses is the entry point and the second is the target. Basically,
>> that is exactly what I expect. The other two are the "suspicious"
>> entries. Neither matches the copy_user_enhanced_fast_string symbol
>> location and are not loaded in the binary (according to gdb).
>
> As you state below, there is faults recorded before the exec. Which is
> true with trace-cmd (not sure about perf). As trace-cmd does do some
> work after tracing is started and before the exec is called.
>
>>
>> It is odd to me that the output from the previous command includes
>> information about the trace-cmd process since I specified the -F
>> option.
>
> Did the trace-cmd process have the same pid as the page program? The -F
> makes it follow the pid, which can be trace-cmd before the fork.
>
> Oh, and what versions are you running of the kernel as well as
> trace-cmd. It's best to always get the latest trace-cmd. Older versions
> did have some bugs.
I am building from source from commit
1785022e12cf0079889c519ffc7157f2143326ea. I haven't pulled in a few
days, but it seems fairly recent. I hope that I didn't miss an update.
I can pull if you think that might be beneficial.
This is on a relatively old Kernel from Ubuntu's LTS 14.04: 3.13.0-93-generic
>
> Also, function_graph can have some residual traces from a sched_switch.
>
> That is, the tracing of a task specified by -F is enabled and disabled
> per cpu at sched_switch. The sched_switch tracepoint is before the
> actual context switch. When the next task is the task specified by -F,
> tracing begins. But the context switch can start tracing the previous
> task right after the sched switch tracepoint. Here, in
> kernel/sched/core.c in __schedule():
>
> trace_sched_switch(preempt, prev, next);
>
> /* Also unlocks the rq: */
> rq = context_switch(rq, prev, next, &rf);
>
>
> The context_switch() function will be traced even when the previous
> task isn't the task from -F.
>
>
This seems to make very good sense and does seem to match what was
starting to emerge based on my investigations from earlier today. I
don't consider this a problem at all, just not something that I would
have naively expected. Now that I understand the system better, it
makes perfect sense.
>>
>> But, back to exactly what you asked. Here is the result of running
>> perf again with the -a option. I ran this command:
>>
>> sudo perf record --call-graph fp -e page-faults -a ../one_page_play/page
>>
>> 50.41% page ld-2.19.so [.] do_lookup_x
>> |
>> --- do_lookup_x
>>
>> 44.21% perf [kernel.kallsyms] [k] iov_iter_fault_in_readable
>> |
>> --- iov_iter_fault_in_readable
>> generic_file_buffered_write
>> __generic_file_aio_write
>> generic_file_aio_write
>> ext4_file_write
>> do_sync_write
>> vfs_write
>> sys_write
>> system_call_fastpath
>> __write_nocancel
>> 0x4081a5
>> 0x407a40
>> __libc_start_main
>>
>> 4.13% perf perf [.] 0x0000000000015b54
>> |
>> --- 0x415b54
>> 0x4081a5
>> 0x407a40
>> __libc_start_main
>>
>> 0.41% page page [.] _start
>> |
>> --- _start
>>
>> 0.41% page page [.] target
>> |
>> --- target
>>
>> 0.41% page [kernel.kallsyms] [k] copy_user_enhanced_fast_string
>> |
>> --- copy_user_enhanced_fast_string
>> load_elf_binary
>> search_binary_handler
>> do_execve_common.isra.23
>> sys_execve
>> stub_execve
>> __execve
>>
>>
>> What is interesting, is that the output differs based on whether I've
>> dropped the kernel caches before I run the perf record. When I do
>> that, there are no page faults attributed to the entry point or the
>> target of the program. I would imagine that, after dropping caches,
>> the readahead handler picks up those pages when the binary is loaded
>> and negates the need for a page fault. That, indeed, seems to be the
>> case. I can see that when I run perf with an additional
>> block:block_rq_issue event. Immediately after dropping the caches,
>> there is a block request event. On subsequent executions, there is no
>> such event but there are the page faults that I expect.
>>
>> What I did notice is that some of the page faults come from the task
>> before execve is called. From what I've seen, this looks like the
>> kernel reclaiming pages from the spawning process before it is
>> replaced with the new binary (during the call to execve). After the
>> execve, there seem to be to page faults:
>>
>>
>> page-4613 [006] 171795.748310: funcgraph_entry:
>> 0.151 us | mutex_unlock();
>> page-4613 [006] 171795.748313: funcgraph_entry:
>> 0.166 us | __fsnotify_parent();
>> page-4613 [006] 171795.748313: funcgraph_entry:
>> 0.321 us | fsnotify();
>> page-4613 [006] 171795.748314: funcgraph_entry:
>> 0.090 us | __sb_end_write();
>> page-4613 [006] 171795.748317: funcgraph_entry:
>> | trace_do_page_fault() {
>> page-4613 [006] 171795.748317: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x4
>> page-4613 [006] 171795.748319: funcgraph_exit:
>> 2.254 us | }
>> page-4613 [006] 171795.748321: funcgraph_entry:
>> | trace_do_page_fault() {
>> page-4613 [006] 171795.748322: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> page-4613 [006] 171795.748323: funcgraph_exit:
>> 1.144 us | }
>>
>> NOTICE THIS:
>> page-4613 [006] 171795.748324: funcgraph_entry:
>> | sys_execve() {
>> page-4613 [007] 171795.748391: block_rq_issue: 8,0
>> R 0 () 764812912 + 16 [trace-cmd]
>> page-4613 [005] 171795.759476: funcgraph_exit: #
>> 11152.111 us | }
>> page-4613 [005] 171795.759477: funcgraph_entry:
>> 3.745 us | do_notify_resume();
>> page-4613 [005] 171795.759481: funcgraph_entry:
>> | trace_do_page_fault() {
>> page-4613 [005] 171795.759482: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> page-4613 [005] 171795.759487: funcgraph_exit:
>> 5.833 us | }
>> page-4613 [005] 171795.759488: funcgraph_entry:
>> | trace_do_page_fault() {
>> page-4613 [005] 171795.759489: page_fault_user:
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14
>> page-4613 [005] 171795.759490: funcgraph_exit:
>> 2.003 us | }
>> page-4613 [005] 171795.759492: funcgraph_entry:
>> | sys_exit() {
>>
>> I wish that I could get something "better" than "
>> address=__per_cpu_end ip=__per_cpu_end error_code=0x14" for those page
>> faults. That would really tell me more about whether this is the
>> "correct" behavior.
>
> Also, you might want to do a trace-cmd report -l, which lets you see if
> the event happened in interrupt context.
>
>>
>> As ever, thank you very much for your help! Using these tools has been
>> an incredible learning experience. I still think that I am just
>> missing something stupid, but I really appreciate your patience.
>
> No no, you got me curious as well. Although, the only page faults I see
> are from before the exec and the actual program. Could you run the -F
> on your program with the other options you have and report a full:
>
> trace-cmd report -l
>
> And then you can ask about what you don't understand being there.
>
> -- Steve
I have a slightly refined record command that I think has convinced me
that I see what is going on. Here's the record command:
sudo ./trace-cmd record -p function_graph -T --func-stack -l
do_page_fault -l do_execve --profile ../one_page_play/page
and here's the report command:
./trace-cmd report -l --comm="page"
Here's the (snipped) output:
...
page-7717 7d... 198178.664966: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664968: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664970: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664971: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664973: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664974: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664975: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x4
...
page-7717 7d... 198178.664977: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...
page-7717 7dN.. 198178.664985: sched_wakeup:
comm=migration/7 pid=58 prio=0 success=1 target_cpu=007
page-7717 7dN.. 198178.664986: kernel_stack: <stack trace>
=> try_to_wake_up (ffffffff8109d2a2)
=> wake_up_process (ffffffff8109d3a5)
...
=> sched_exec (ffffffff8109bcd0)
=> do_execve_common.isra.23 (ffffffff811c88ce)
=> SyS_execve (ffffffff811c9176)
=> stub_execve (ffffffff8173afc9)
<idle>-0 4d... 198178.664993: sched_switch:
prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==>
next_comm=trace-cmd next_pid=7717 next_prio=120
page-7717 4.... 198178.665053: sched_process_exec:
filename=../one_page_play/page pid=7717 old_pid=7717
...
=> stub_execve (ffffffff8173afc9)
...
page-7717 4d... 198178.665056: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...
page-7717 4d... 198178.665059: page_fault_user:
address=__per_cpu_end ip=__per_cpu_end error_code=0x14
...
Here's how I read this:
There are a "bunch" of page faults that occur as the spawning process
quits. Then the execve() system call starts. After that, there are two
page faults.
The two page faults that occur after the exec() are exactly the number
that I would expect.
It seems like this is exactly the answer!
But, while I have your ear, I was wondering if you could direct me to
where the page_fault_user gathers its information (when tracing) and
where that information is formatted for printing (in trace-cmd). I'd
really like to investigate modifying that code so that it provides
"better" information than
address=__per_cpu_end ip=__per_cpu_end
Now that I'm into this, I want to really dig in. If you can give me a
pointer, that would really help me get started. After that I can
attempt to make some patches and see where it leads.
For the x86 architecture, the relevant files seem to be
mm/fault.c (with the trace_page_fault_entries function)
and
include/asm/trace/exceptions.h (where the exception class is built)
What seems odd there is that the exception *should* be responding with
"better" information -- address is set to the value of the cr2
register which should contain the address of the faulting instruction.
Perhaps I just don't understand what the symbol __per_cpu_end means.
Any information you can shed on this would be really great.
Thanks again for your patience with me! I think that I now understand
what is going on.
Will
>
>
> -- Steve
On Tue, 13 Jun 2017 21:27:20 -0400
Will Hawkins <[email protected]> wrote:
> But, while I have your ear, I was wondering if you could direct me to
> where the page_fault_user gathers its information (when tracing) and
> where that information is formatted for printing (in trace-cmd). I'd
> really like to investigate modifying that code so that it provides
> "better" information than
>
> address=__per_cpu_end ip=__per_cpu_end
>
> Now that I'm into this, I want to really dig in. If you can give me a
> pointer, that would really help me get started. After that I can
> attempt to make some patches and see where it leads.
>
> For the x86 architecture, the relevant files seem to be
>
> mm/fault.c (with the trace_page_fault_entries function)
>
> and
>
> include/asm/trace/exceptions.h (where the exception class is built)
Correct.
>
> What seems odd there is that the exception *should* be responding with
> "better" information -- address is set to the value of the cr2
> register which should contain the address of the faulting instruction.
> Perhaps I just don't understand what the symbol __per_cpu_end means.
>
> Any information you can shed on this would be really great.
hmm, actually that looks to be translating the ip address into
functions. It shouldn't be doing that, and it doesn't do it for me with
the latest kernel and trace-cmd.
What does it give you in trace-cmd report -R ? The -R will not parse
the printk-fmt of the event format files, and just show raw numbers.
Can you give me the contents of:
cat /sys/kernel/debug/tracing/exceptions/page_fault_user/format
?
That's how trace-cmd parses it.
>
> Thanks again for your patience with me! I think that I now understand
> what is going on.
No problem, it's good to be curious.
-- Steve
Hello,
On Wed, Jun 14, 2017 at 5:04 AM, Steven Rostedt <[email protected]> wrote:
> On Tue, 13 Jun 2017 14:02:08 -0400
> Will Hawkins <[email protected]> wrote:
>
>> Thank you for pointing this out. I had been using -F for exactly the
>> reason that you mentioned. I failed to include it in the command that
>> I sent along. Very sorry for the confusion. Here is an updated version
>> of the command that I issued:
>>
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>> handle_mm_fault -F ../one_page_play/page
>>
>> and I generated output like
>>
>> ./trace-cmd report --profile
>>
>> and I see the following (among some other output):
>>
>> Event: page_fault_user:0x7f094f7dd104 (1)
>> Event: page_fault_user:0x4000e0 (1)
>> Event: page_fault_user:0x7f094f7eae4a (1)
>> Event: page_fault_user:0x7f094f860d40 (1)
>> Event: page_fault_user:0x7f094f7db560 (1)
>> Event: page_fault_user:0x4040cb (1)
>> Event: page_fault_user:0x401825 (1)
>> Event: page_fault_user:0x401473 (1)
>> Event: page_fault_user:0x7f094f7e64c4 (1)
>> Event: page_fault_user:0x7f094f7f1212 (1)
>>
>> That output comes from under the task: page-<pid> heading, so it seems
>> like those faults are being attributed to the page task.
>>
>> This command seems to show something interesting:
>>
>> sudo ./trace-cmd record -e exceptions:page_fault_user -p
>> function_graph -g __do_fault -F ../one_page_play/page
>>
>> and the relevant output from
>>
>> ./trace-cmd report --profile
>>
>> is
>>
>> task: page-4032
>> Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
>> 2398(ts:170150.060916) Min:855(ts:170150.054713)
>> Event: page_fault_user:0x7ffad3143d40 (1)
>> Event: page_fault_user:0x4000e0 (1)
>> Event: page_fault_user:0x401473 (1)
>> Event: page_fault_user:0x7ffad30c94c4 (1)
>>
>> This is closer to what I would expect. The first of the two 0x4...
>> addresses is the entry point and the second is the target. Basically,
>> that is exactly what I expect. The other two are the "suspicious"
>> entries. Neither matches the copy_user_enhanced_fast_string symbol
>> location and are not loaded in the binary (according to gdb).
>
> As you state below, there is faults recorded before the exec. Which is
> true with trace-cmd (not sure about perf). As trace-cmd does do some
> work after tracing is started and before the exec is called.
When perf profiles a program started by the same command line, it
disables the events by default and enables them during exec. Please
see linux/tools/perf/util/evsel.c:perf_evsel__config().
Thanks,
Namhyung
On Tue, Jun 13, 2017 at 11:04 PM, Namhyung Kim <[email protected]> wrote:
> Hello,
>
> On Wed, Jun 14, 2017 at 5:04 AM, Steven Rostedt <[email protected]> wrote:
>> On Tue, 13 Jun 2017 14:02:08 -0400
>> Will Hawkins <[email protected]> wrote:
>>
>>> Thank you for pointing this out. I had been using -F for exactly the
>>> reason that you mentioned. I failed to include it in the command that
>>> I sent along. Very sorry for the confusion. Here is an updated version
>>> of the command that I issued:
>>>
>>>
>>> sudo ./trace-cmd record -e exceptions:page_fault_user -T --profile -l
>>> handle_mm_fault -F ../one_page_play/page
>>>
>>> and I generated output like
>>>
>>> ./trace-cmd report --profile
>>>
>>> and I see the following (among some other output):
>>>
>>> Event: page_fault_user:0x7f094f7dd104 (1)
>>> Event: page_fault_user:0x4000e0 (1)
>>> Event: page_fault_user:0x7f094f7eae4a (1)
>>> Event: page_fault_user:0x7f094f860d40 (1)
>>> Event: page_fault_user:0x7f094f7db560 (1)
>>> Event: page_fault_user:0x4040cb (1)
>>> Event: page_fault_user:0x401825 (1)
>>> Event: page_fault_user:0x401473 (1)
>>> Event: page_fault_user:0x7f094f7e64c4 (1)
>>> Event: page_fault_user:0x7f094f7f1212 (1)
>>>
>>> That output comes from under the task: page-<pid> heading, so it seems
>>> like those faults are being attributed to the page task.
>>>
>>> This command seems to show something interesting:
>>>
>>> sudo ./trace-cmd record -e exceptions:page_fault_user -p
>>> function_graph -g __do_fault -F ../one_page_play/page
>>>
>>> and the relevant output from
>>>
>>> ./trace-cmd report --profile
>>>
>>> is
>>>
>>> task: page-4032
>>> Event: func: __do_fault() (4) Total: 6685 Avg: 1671 Max:
>>> 2398(ts:170150.060916) Min:855(ts:170150.054713)
>>> Event: page_fault_user:0x7ffad3143d40 (1)
>>> Event: page_fault_user:0x4000e0 (1)
>>> Event: page_fault_user:0x401473 (1)
>>> Event: page_fault_user:0x7ffad30c94c4 (1)
>>>
>>> This is closer to what I would expect. The first of the two 0x4...
>>> addresses is the entry point and the second is the target. Basically,
>>> that is exactly what I expect. The other two are the "suspicious"
>>> entries. Neither matches the copy_user_enhanced_fast_string symbol
>>> location and are not loaded in the binary (according to gdb).
>>
>> As you state below, there is faults recorded before the exec. Which is
>> true with trace-cmd (not sure about perf). As trace-cmd does do some
>> work after tracing is started and before the exec is called.
>
> When perf profiles a program started by the same command line, it
> disables the events by default and enables them during exec. Please
> see linux/tools/perf/util/evsel.c:perf_evsel__config().
>
> Thanks,
> Namhyung
Namhyung,
I think that this answers a very important question! Thanks for chiming in!
Will
On Wed, 14 Jun 2017 13:30:59 -0400
Will Hawkins <[email protected]> wrote:
> > When perf profiles a program started by the same command line, it
> > disables the events by default and enables them during exec. Please
> > see linux/tools/perf/util/evsel.c:perf_evsel__config().
> >
> > Thanks,
> > Namhyung
>
> Namhyung,
>
> I think that this answers a very important question! Thanks for chiming in!
Yes. One difference between the design of ftrace and the design of
perf, is that, I avoided inserting call backs throughout the kernel.
Perf has a few function calls in the exec code. Just grep "perf" in
fs/exec.c. There's a few scattered around there. Causing a slight
overhead for when perf is not in use.
Hmm, I really should remove all perf injections and make them either a
tracepoint or generic jumplabels that anything may attach to. Then
ftrace could have the same features. And lttng for that matter.
-- Steve
On Tue, Jun 13, 2017 at 10:13 PM, Steven Rostedt <[email protected]> wrote:
> On Tue, 13 Jun 2017 21:27:20 -0400
> Will Hawkins <[email protected]> wrote:
>
>> But, while I have your ear, I was wondering if you could direct me to
>> where the page_fault_user gathers its information (when tracing) and
>> where that information is formatted for printing (in trace-cmd). I'd
>> really like to investigate modifying that code so that it provides
>> "better" information than
>>
>> address=__per_cpu_end ip=__per_cpu_end
>>
>> Now that I'm into this, I want to really dig in. If you can give me a
>> pointer, that would really help me get started. After that I can
>> attempt to make some patches and see where it leads.
>>
>> For the x86 architecture, the relevant files seem to be
>>
>> mm/fault.c (with the trace_page_fault_entries function)
>>
>> and
>>
>> include/asm/trace/exceptions.h (where the exception class is built)
>
> Correct.
>
>>
>> What seems odd there is that the exception *should* be responding with
>> "better" information -- address is set to the value of the cr2
>> register which should contain the address of the faulting instruction.
>> Perhaps I just don't understand what the symbol __per_cpu_end means.
>>
>> Any information you can shed on this would be really great.
>
> hmm, actually that looks to be translating the ip address into
> functions. It shouldn't be doing that, and it doesn't do it for me with
> the latest kernel and trace-cmd.
>
> What does it give you in trace-cmd report -R ? The -R will not parse
> the printk-fmt of the event format files, and just show raw numbers.
Brilliant advice. And, guess what? It gives me exactly what I expected:
page-7783 6d... 198598.003464: page_fault_user:
address=0x4000e0 ip=0x4000e0 error_code=0x14
...
page-7783 6d... 198598.003466: page_fault_user:
address=0x401473 ip=0x401473 error_code=0x14
Since I don't expect you to be following the details as closely as I
am, those are exactly the addresses that I expected to see. I think
that is the final confirmation of the hypothesis!
>
> Can you give me the contents of:
>
> cat /sys/kernel/debug/tracing/exceptions/page_fault_user/format
>
> ?
>
> That's how trace-cmd parses it.
In the kernel version that I am running (again, pretty old) I do not
have this file. I do, however, have
/sys/kernel/debug/tracing/events/exceptions/page_fault_user/format
and the contents are:
name: page_fault_user
ID: 79
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:unsigned long address; offset:8; size:8; signed:0;
field:unsigned long ip; offset:16; size:8; signed:0;
field:unsigned long error_code; offset:24; size:8; signed:0;
print fmt: "address=%pf ip=%pf error_code=0x%lx", (void
*)REC->address, (void *)REC->ip, REC->error_code
Again, this looks like exactly what I would expect since address has
the cr2 value in that function. Plus, we know that the raw value is
correct. I suppose that the "symbolification" of that value is done in
trace-cmd, right? So, perhaps that is where I should start looking for
the problem?
I definitely want to follow up on this and help where I can. That
said, I think I am satisfied with "our" (really, your) answer to the
original problem.
Thank you so much!
Will
>
>
>
>>
>> Thanks again for your patience with me! I think that I now understand
>> what is going on.
>
>
> No problem, it's good to be curious.
>
> -- Steve
On Wed, 14 Jun 2017 13:47:17 -0400
Will Hawkins <[email protected]> wrote:
> > That's how trace-cmd parses it.
>
> In the kernel version that I am running (again, pretty old) I do not
> have this file. I do, however, have
It may be due to the kernel version. It gets the functions
from /proc/kallsyms. That could have have an issue. Although, I have
this too:
# grep per_cpu_start /proc/kallsyms
0000000000000000 A __per_cpu_start
But I don't see it being converted in my report.
Hmm, it's not saved. Interesting:
trace-cmd report -f
to see the list of saved functions. I need to figure out why it does
for you, but not for me.
>
> /sys/kernel/debug/tracing/events/exceptions/page_fault_user/format
>
> and the contents are:
>
> name: page_fault_user
> ID: 79
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
>
> field:unsigned long address; offset:8; size:8; signed:0;
> field:unsigned long ip; offset:16; size:8; signed:0;
> field:unsigned long error_code; offset:24; size:8; signed:0;
>
> print fmt: "address=%pf ip=%pf error_code=0x%lx", (void
> *)REC->address, (void *)REC->ip, REC->error_code
>
> Again, this looks like exactly what I would expect since address has
> the cr2 value in that function. Plus, we know that the raw value is
> correct. I suppose that the "symbolification" of that value is done in
> trace-cmd, right? So, perhaps that is where I should start looking for
> the problem?
>
> I definitely want to follow up on this and help where I can. That
> said, I think I am satisfied with "our" (really, your) answer to the
> original problem.
It has something to do with the reading of the function names
in /proc/kallsyms.
Ah, I bet it's a change in the kernel. A recent update to trace-cmd was
to ignore functions in kallsyms of type "A" (which you can see is what
I have above).
What do you have when you do:
sudo grep per_cpu_start /proc/kallsyms
May or may not need sudo depending on if the kernel lets non root have
access to kallsyms.
-- Steve
On Wed, Jun 14, 2017 at 4:01 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 14 Jun 2017 13:47:17 -0400
> Will Hawkins <[email protected]> wrote:
>
>
>> > That's how trace-cmd parses it.
>>
>> In the kernel version that I am running (again, pretty old) I do not
>> have this file. I do, however, have
>
> It may be due to the kernel version. It gets the functions
> from /proc/kallsyms. That could have have an issue. Although, I have
> this too:
>
> # grep per_cpu_start /proc/kallsyms
> 0000000000000000 A __per_cpu_start
Here's my result of a similar command:
# cat /proc/kallsyms | grep per_cpu_start
0000000000000000 D __per_cpu_start
There's only the difference between (what I think is) the flag value
there in the second column.
>
> But I don't see it being converted in my report.
>
> Hmm, it's not saved. Interesting:
>
> trace-cmd report -f
>
> to see the list of saved functions. I need to figure out why it does
> for you, but not for me.
sudo ./trace-cmd report -f | grep per_cpu_start
0000000000000000 __per_cpu_start
>
>>
>> /sys/kernel/debug/tracing/events/exceptions/page_fault_user/format
>>
>> and the contents are:
>>
>> name: page_fault_user
>> ID: 79
>> format:
>> field:unsigned short common_type; offset:0; size:2; signed:0;
>> field:unsigned char common_flags; offset:2; size:1; signed:0;
>> field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
>> field:int common_pid; offset:4; size:4; signed:1;
>>
>> field:unsigned long address; offset:8; size:8; signed:0;
>> field:unsigned long ip; offset:16; size:8; signed:0;
>> field:unsigned long error_code; offset:24; size:8; signed:0;
>>
>> print fmt: "address=%pf ip=%pf error_code=0x%lx", (void
>> *)REC->address, (void *)REC->ip, REC->error_code
>>
>> Again, this looks like exactly what I would expect since address has
>> the cr2 value in that function. Plus, we know that the raw value is
>> correct. I suppose that the "symbolification" of that value is done in
>> trace-cmd, right? So, perhaps that is where I should start looking for
>> the problem?
>>
>> I definitely want to follow up on this and help where I can. That
>> said, I think I am satisfied with "our" (really, your) answer to the
>> original problem.
>
> It has something to do with the reading of the function names
> in /proc/kallsyms.
>
> Ah, I bet it's a change in the kernel. A recent update to trace-cmd was
> to ignore functions in kallsyms of type "A" (which you can see is what
> I have above).
And my output (above) seems to show that mine shows the per_cpu_start
as a D instead of an A. Perhaps that is why we are seeing differences
in our report? It's still curious that it would match 0x000..0 with
something that is clearly, well, not 0x00...0.
Let me know if there is a spot in the code where it decides whether to
symbolize. Or, I can work backward from the recent change you
mentioned above to find that spot if you will tell me the hash of that
commit.
Thanks again for working through this with me!
I hope you had a good weekend,
Will
>
> What do you have when you do:
>
> sudo grep per_cpu_start /proc/kallsyms
>
> May or may not need sudo depending on if the kernel lets non root have
> access to kallsyms.
>
>
> -- Steve
On Mon, 19 Jun 2017 13:13:27 -0400
Will Hawkins <[email protected]> wrote:
> On Wed, Jun 14, 2017 at 4:01 PM, Steven Rostedt <[email protected]> wrote:
> > On Wed, 14 Jun 2017 13:47:17 -0400
> > Will Hawkins <[email protected]> wrote:
> >
> >
> >> > That's how trace-cmd parses it.
> >>
> >> In the kernel version that I am running (again, pretty old) I do not
> >> have this file. I do, however, have
> >
> > It may be due to the kernel version. It gets the functions
> > from /proc/kallsyms. That could have have an issue. Although, I have
> > this too:
> >
> > # grep per_cpu_start /proc/kallsyms
> > 0000000000000000 A __per_cpu_start
>
> Here's my result of a similar command:
>
> # cat /proc/kallsyms | grep per_cpu_start
> 0000000000000000 D __per_cpu_start
Right, that it doesn't have the latest code to say to ignore it in your
kernel.
>
> There's only the difference between (what I think is) the flag value
> there in the second column.
>
> >
> > But I don't see it being converted in my report.
> >
> > Hmm, it's not saved. Interesting:
> >
> > trace-cmd report -f
> >
> > to see the list of saved functions. I need to figure out why it does
> > for you, but not for me.
>
> sudo ./trace-cmd report -f | grep per_cpu_start
> 0000000000000000 __per_cpu_start
Yep, it took it. It only saves the address an the name.
> > Ah, I bet it's a change in the kernel. A recent update to trace-cmd was
> > to ignore functions in kallsyms of type "A" (which you can see is what
> > I have above).
>
> And my output (above) seems to show that mine shows the per_cpu_start
> as a D instead of an A. Perhaps that is why we are seeing differences
> in our report? It's still curious that it would match 0x000..0 with
> something that is clearly, well, not 0x00...0.
Because function ips are seldom at the start of the function (well,
other than fentry code), it grabs the first function with the address
before the ip. Which would be 0x0000.
>
> Let me know if there is a spot in the code where it decides whether to
> symbolize. Or, I can work backward from the recent change you
> mentioned above to find that spot if you will tell me the hash of that
> commit.
If you run with trace-cmd report -O offset, you should see the offset
off of functions. Not sure it works with function graph tracer, but
should at least work with function tracing.
-- Steve