LinuxLists.cc - Ftrace vs perf user page fault statistics differences

2017-06-12 20:08:51

Subject: Ftrace vs perf user page fault statistics differences

Dear Mr. Rostedt and Kernel community,

I hope that this is an appropriate place to ask this question, Please
forgive me for wasting your time if it is not. I've searched for
answers to this question on LKML and the "net" before asking. I
apologize if I already missed the question and answer somewhere else.

I was doing a gut check of my understanding of how the Kernel pages in
a binary program for execution. I created a very small program that is
two pages. The entry point to the program is on page a and the code
simply jumps to the second page (b) for its termination.

I compiled the program and dropped the kernel's memory caches [1].
Then I ran the program under perf:

perf record --call-graph fp -e page-faults ../one_page_play/page

I looked at the results:

perf report

and the results were as I expected. There were two page faults for
loading the code into memory and a page fault to
copy_user_enhanced_fast_string invoked by execve's implementation when
loading the binary.

I decided to run the application under ftrace just for fun. I wanted
an excuse to learn more about it and this seemed like the perfect
chance. I used the incredible trace-cmd suite for the actual
incantation of ftrace. I won't include the actual incantations here
because I used many of them while digging around.

The results are both expected and unexpected. I see output like this:

Event: page_fault_user:0x4000e0

which indicates that there is a page fault at the program's entry
point (and matches what I saw with the perf output). I have another
similar entry that confirms the other expected page fault when loading
the second page of the test application.

However, I also see entries like this:

Event: page_fault_user:0x7f4f590934c4 (1)

The addresses of the faults I see that match that pattern are not
loaded into the application binary. What I discovered as I
investigated, is that those page faults seem to occur when the kernel
is attempting to record the output of stack traces, etc.

After thinking through this, I came up with the following hypothesis
which is the crux of this email:

Ftrace's act of recording the traces that I requested to its ring
buffer generated page faults of their own. These page faults are
generated on behalf of the traced program and get reported in the
results.

If that is correct/reasonable, it explains the differences between
what perf is reporting and what ftrace is reporting and I am happy.
If, however, that is a bogus conclusion, please help me understand
what is going on.

I know that everyone who is on this email is incredibly busy and has
much to do. I hope that I've included enough information to make it
possible for you experts to advise, but not included too much to waste
your time.

If you have the time or interest in answering, I would love to hear
your responses. Please CC me directly on all responses.

Thanks again for your time!

Will

[1] I used echo 3 > /proc/sys/vm/drop_caches to accomplish this and
issued it between every run. It may have been overkill, but I did it
anyway.

2017-06-13 01:20:34

by Steven Rostedt

[permalink] [raw]

Subject: Re: Ftrace vs perf user page fault statistics differences

On Mon, 12 Jun 2017 20:20:42 -0400
Will Hawkins <[email protected]> wrote:

> Dear Mr. Rostedt and Kernel community,
>
> I hope that this is an appropriate place to ask this question, Please
> forgive me for wasting your time if it is not. I've searched for
> answers to this question on LKML and the "net" before asking. I
> apologize if I already missed the question and answer somewhere else.

No no, this is the correct forum.

>
> I was doing a gut check of my understanding of how the Kernel pages in
> a binary program for execution. I created a very small program that is
> two pages. The entry point to the program is on page a and the code
> simply jumps to the second page (b) for its termination.
>
> I compiled the program and dropped the kernel's memory caches [1].
> Then I ran the program under perf:
>
> perf record --call-graph fp -e page-faults ../one_page_play/page

Can you supply this program. So I can see exactly what it does?

>
> I looked at the results:
>
> perf report
>
> and the results were as I expected. There were two page faults for
> loading the code into memory and a page fault to
> copy_user_enhanced_fast_string invoked by execve's implementation when
> loading the binary.

What does perf script show you?

>
> I decided to run the application under ftrace just for fun. I wanted
> an excuse to learn more about it and this seemed like the perfect
> chance. I used the incredible trace-cmd suite for the actual
> incantation of ftrace. I won't include the actual incantations here
> because I used many of them while digging around.

what events did you enable?

>
> The results are both expected and unexpected. I see output like this:
>
> Event: page_fault_user:0x4000e0
>
> which indicates that there is a page fault at the program's entry
> point (and matches what I saw with the perf output). I have another
> similar entry that confirms the other expected page fault when loading
> the second page of the test application.
>
> However, I also see entries like this:
>
> Event: page_fault_user:0x7f4f590934c4 (1)

This could very well be the dynamic linker.

>
> The addresses of the faults I see that match that pattern are not
> loaded into the application binary. What I discovered as I
> investigated, is that those page faults seem to occur when the kernel
> is attempting to record the output of stack traces, etc.

Are you sure? What does ldd give you of your program?

>
> After thinking through this, I came up with the following hypothesis
> which is the crux of this email:
>
> Ftrace's act of recording the traces that I requested to its ring
> buffer generated page faults of their own. These page faults are
> generated on behalf of the traced program and get reported in the
> results.

There are no page faults that happen by the ftrace ring buffer that
would be associated with the program.

>
> If that is correct/reasonable, it explains the differences between
> what perf is reporting and what ftrace is reporting and I am happy.
> If, however, that is a bogus conclusion, please help me understand
> what is going on.

I'll need to know more about exactly what you are doing to help out.

>
> I know that everyone who is on this email is incredibly busy and has
> much to do. I hope that I've included enough information to make it
> possible for you experts to advise, but not included too much to waste
> your time.
>
> If you have the time or interest in answering, I would love to hear
> your responses. Please CC me directly on all responses.

No worries about Ccing you directly. LKML gets over 600 emails a day.
Nobody reads it all. If someone isn't Cc'd directly, they will most
likely never respond.

>
> Thanks again for your time!

No problem.

-- Steve

>
> Will
>
> [1] I used echo 3 > /proc/sys/vm/drop_caches to accomplish this and
> issued it between every run. It may have been overkill, but I did it
> anyway.

2017-06-13 02:05:09

On Wed, Jun 14, 2017 at 4:01 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 14 Jun 2017 13:47:17 -0400
> Will Hawkins <[email protected]> wrote:
>
>
>> > That's how trace-cmd parses it.
>>
>> In the kernel version that I am running (again, pretty old) I do not
>> have this file. I do, however, have
>
> It may be due to the kernel version. It gets the functions
> from /proc/kallsyms. That could have have an issue. Although, I have
> this too:
>
> # grep per_cpu_start /proc/kallsyms
> 0000000000000000 A __per_cpu_start

Here's my result of a similar command:

# cat /proc/kallsyms | grep per_cpu_start
0000000000000000 D __per_cpu_start

There's only the difference between (what I think is) the flag value
there in the second column.

>
> But I don't see it being converted in my report.
>
> Hmm, it's not saved. Interesting:
>
> trace-cmd report -f
>
> to see the list of saved functions. I need to figure out why it does
> for you, but not for me.

sudo ./trace-cmd report -f | grep per_cpu_start
0000000000000000 __per_cpu_start

>
>>
>> /sys/kernel/debug/tracing/events/exceptions/page_fault_user/format
>>
>> and the contents are:
>>
>> name: page_fault_user
>> ID: 79
>> format:
>> field:unsigned short common_type; offset:0; size:2; signed:0;
>> field:unsigned char common_flags; offset:2; size:1; signed:0;
>> field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
>> field:int common_pid; offset:4; size:4; signed:1;
>>
>> field:unsigned long address; offset:8; size:8; signed:0;
>> field:unsigned long ip; offset:16; size:8; signed:0;
>> field:unsigned long error_code; offset:24; size:8; signed:0;
>>
>> print fmt: "address=%pf ip=%pf error_code=0x%lx", (void
>> *)REC->address, (void *)REC->ip, REC->error_code
>>
>> Again, this looks like exactly what I would expect since address has
>> the cr2 value in that function. Plus, we know that the raw value is
>> correct. I suppose that the "symbolification" of that value is done in
>> trace-cmd, right? So, perhaps that is where I should start looking for
>> the problem?
>>
>> I definitely want to follow up on this and help where I can. That
>> said, I think I am satisfied with "our" (really, your) answer to the
>> original problem.
>
> It has something to do with the reading of the function names
> in /proc/kallsyms.
>
> Ah, I bet it's a change in the kernel. A recent update to trace-cmd was
> to ignore functions in kallsyms of type "A" (which you can see is what
> I have above).

And my output (above) seems to show that mine shows the per_cpu_start
as a D instead of an A. Perhaps that is why we are seeing differences
in our report? It's still curious that it would match 0x000..0 with
something that is clearly, well, not 0x00...0.

Let me know if there is a spot in the code where it decides whether to
symbolize. Or, I can work backward from the recent change you
mentioned above to find that spot if you will tell me the hash of that
commit.

Thanks again for working through this with me!

I hope you had a good weekend,
Will

>
> What do you have when you do:
>
> sudo grep per_cpu_start /proc/kallsyms
>
> May or may not need sudo depending on if the kernel lets non root have
> access to kallsyms.
>
>
> -- Steve

2017-06-19 18:52:55

by Steven Rostedt

[permalink] [raw]

Subject: Re: Ftrace vs perf user page fault statistics differences

On Mon, 19 Jun 2017 13:13:27 -0400
Will Hawkins <[email protected]> wrote:

> On Wed, Jun 14, 2017 at 4:01 PM, Steven Rostedt <[email protected]> wrote:
> > On Wed, 14 Jun 2017 13:47:17 -0400
> > Will Hawkins <[email protected]> wrote:
> >
> >
> >> > That's how trace-cmd parses it.
> >>
> >> In the kernel version that I am running (again, pretty old) I do not
> >> have this file. I do, however, have
> >
> > It may be due to the kernel version. It gets the functions
> > from /proc/kallsyms. That could have have an issue. Although, I have
> > this too:
> >
> > # grep per_cpu_start /proc/kallsyms
> > 0000000000000000 A __per_cpu_start
>
> Here's my result of a similar command:
>
> # cat /proc/kallsyms | grep per_cpu_start
> 0000000000000000 D __per_cpu_start

Right, that it doesn't have the latest code to say to ignore it in your
kernel.

>
> There's only the difference between (what I think is) the flag value
> there in the second column.
>
> >
> > But I don't see it being converted in my report.
> >
> > Hmm, it's not saved. Interesting:
> >
> > trace-cmd report -f
> >
> > to see the list of saved functions. I need to figure out why it does
> > for you, but not for me.
>
> sudo ./trace-cmd report -f | grep per_cpu_start
> 0000000000000000 __per_cpu_start

Yep, it took it. It only saves the address an the name.

> > Ah, I bet it's a change in the kernel. A recent update to trace-cmd was
> > to ignore functions in kallsyms of type "A" (which you can see is what
> > I have above).
>
> And my output (above) seems to show that mine shows the per_cpu_start
> as a D instead of an A. Perhaps that is why we are seeing differences
> in our report? It's still curious that it would match 0x000..0 with
> something that is clearly, well, not 0x00...0.

Because function ips are seldom at the start of the function (well,
other than fentry code), it grabs the first function with the address
before the ip. Which would be 0x0000.

>
> Let me know if there is a spot in the code where it decides whether to
> symbolize. Or, I can work backward from the recent change you
> mentioned above to find that spot if you will tell me the hash of that
> commit.

If you run with trace-cmd report -O offset, you should see the offset
off of functions. Not sure it works with function graph tracer, but
should at least work with function tracing.

-- Steve