2015-12-09 04:13:49

by Yunlong Song

[permalink] [raw]
Subject: [Questions] perf c2c: What's the current status of perf c2c?

Hi, Don,
I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
somewhere else? And does it support Haswell?

--
Thanks,
Yunlong Song


2015-12-09 08:04:48

by Jiri Olsa

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 09, 2015 at 12:06:44PM +0800, Yunlong Song wrote:
> Hi, Don,
> I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
> However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
> introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
> somewhere else? And does it support Haswell?

hi,
not sure Don made any progress on this field, but I'm having
his c2c sources rebased current perf sources ATM.

I changed the tool a little to run over new DATALA events
added in Haswell (in addition to ldlat events) and it seems
to work.

the plan for me is to to use it some more to prove it's useful
and kick it to be merged with perf at some point

jirka

2015-12-09 08:15:47

by Wang Nan

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?



On 2015/12/9 16:04, Jiri Olsa wrote:
> On Wed, Dec 09, 2015 at 12:06:44PM +0800, Yunlong Song wrote:
>> Hi, Don,
>> I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
>> However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
>> introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
>> somewhere else? And does it support Haswell?
> hi,
> not sure Don made any progress on this field, but I'm having
> his c2c sources rebased current perf sources ATM.

Do you have a git repository so we can fetch the
code of it?

Thank you.

2015-12-09 09:11:50

by Jiri Olsa

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 09, 2015 at 04:12:36PM +0800, Wangnan (F) wrote:
>
>
> On 2015/12/9 16:04, Jiri Olsa wrote:
> >On Wed, Dec 09, 2015 at 12:06:44PM +0800, Yunlong Song wrote:
> >>Hi, Don,
> >> I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
> >>However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
> >>introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
> >>somewhere else? And does it support Haswell?
> >hi,
> >not sure Don made any progress on this field, but I'm having
> >his c2c sources rebased current perf sources ATM.
>
> Do you have a git repository so we can fetch the
> code of it?

yes, but it makes my eyes bleed ;-) I have some hacks on
top of Don's changes which I'm ashamed to share ATM

let me kick it into some reasonable shape first

jirka

2015-12-09 09:34:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 09, 2015 at 09:04:40AM +0100, Jiri Olsa wrote:
> On Wed, Dec 09, 2015 at 12:06:44PM +0800, Yunlong Song wrote:
> > Hi, Don,
> > I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
> > However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
> > introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
> > somewhere else? And does it support Haswell?
>
> hi,
> not sure Don made any progress on this field, but I'm having
> his c2c sources rebased current perf sources ATM.
>
> I changed the tool a little to run over new DATALA events
> added in Haswell (in addition to ldlat events) and it seems
> to work.
>
> the plan for me is to to use it some more to prove it's useful
> and kick it to be merged with perf at some point

So I never really liked the c2c tool because it was so narrowly
focussed, it only works on NUMA thingies IIRC.

I would much rather see a tool that uses PEBS events and does a dwarf
decode of the exact instruction's data reference -- without relying on
data address bits.

That is; suppose we measure LLC_MISS, even if we have a
data-address, as soon as its inside a dynamically allocated object,
you're lost.

However, since we have the exact instruction we can simply look at that.
Imagine something like:

struct foo {
int blah;
int val;
int array[];
};

struct bar {
struct foo *foo;
}

int foobar(struct bar *bar)
{
return bar->foo->val;
}

Which we can imagine could result in code like:

foobar:
mov (%rax), %rax # load bar::foo
mov (%rax,1,4), %rax # load foo::val


And DWARFs should know this, so by knowing the instruction we can know
which load missed the cache.

Once you have this information, you can use pahole like structure output
and heat colour them or whatnot. Bright red if you miss lots etc..

Now currently this is possible but a bit of work because the DWARF
annotations are not exactly following these data types, that is you
might need to decode previous instructions and infer some bits.

I think Stephane was working with GCC people to allow more/better DWARF
annotations and allow easier retrieval of this information.


Note: the proposed scheme still have some holes in, imagine trying to
load an array[] member like:

mov 8(%rax, %rcx, 4), %rcx

This would load the array element indexed by RCX into RCX, thereby
destroying the index. In this case knowing the data address you can
still compute the index if you also know RAX (which you get from the
PEBS register dump).

2015-12-09 10:58:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 09, 2015 at 10:34:02AM +0100, Peter Zijlstra wrote:
> Which we can imagine could result in code like:
>
> foobar:
> mov (%rax), %rax # load bar::foo
> mov (%rax,1,4), %rax # load foo::val
>
>
> And DWARFs should know this, so by knowing the instruction we can know
> which load missed the cache.
>
> Once you have this information, you can use pahole like structure output
> and heat colour them or whatnot. Bright red if you miss lots etc..
>
> Now currently this is possible but a bit of work because the DWARF
> annotations are not exactly following these data types, that is you
> might need to decode previous instructions and infer some bits.

To clarify, current DWARFs might only know the argument to foobar is of
type struct bar *, and we'll have to infer the rest.

> I think Stephane was working with GCC people to allow more/better DWARF
> annotations and allow easier retrieval of this information.

And even if that gets sorted, it might still make sense to implement the
hard case as per the above, because it'll take a long time before
everything is build with the fancy new GCC/dwarf output.

2015-12-09 11:09:39

by Joe Mario

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

[RESEND - this time w/o html junk]

On 12/09/2015 04:34 AM, Peter Zijlstra wrote:
> On Wed, Dec 09, 2015 at 09:04:40AM +0100, Jiri Olsa wrote:
>> On Wed, Dec 09, 2015 at 12:06:44PM +0800, Yunlong Song wrote:
>>> Hi, Don,
>>> I am interested in the perf c2c tool, which is introduced in: http://lwn.net/Articles/588866/
>>> However, I found that this tool has not been applied to the mainline tree of perf, Why? It was first
>>> introduced in Feb. 2014. What's its current status now? Does it have a new version or a repository
>>> somewhere else? And does it support Haswell?
>>
>> hi,
>> not sure Don made any progress on this field, but I'm having
>> his c2c sources rebased current perf sources ATM.
>>

<snip>

> So I never really liked the c2c tool because it was so narrowly
> focussed, it only works on NUMA thingies IIRC.
>
> I would much rather see a tool that uses PEBS events and does a dwarf
> decode of the exact instruction's data reference -- without relying on
> data address bits.

Peter:
Yes, that would be a great enhancement, but is it any reason to hold up the current implementation?

I've been using "perf c2c" heavily with customers over the past two years and after they see what it can do, their first question is why it hasn't been checked in upstream yet.

Joe

2015-12-09 14:04:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 09, 2015 at 06:02:44AM -0500, Joe Mario wrote:
> Yes, that would be a great enhancement,

This is hardly new though; I've outlined the very same the first time
the c2c thing got mentioned.

> but is it any reason to hold up the current implementation?

I just wonder how much of c2c is still useful once we get it done
proper. And once such a tool is out there, its hard to kill, leaving us
with a maintenance burden we could do without.

2015-12-09 16:58:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

> > the plan for me is to to use it some more to prove it's useful
> > and kick it to be merged with perf at some point
>
> So I never really liked the c2c tool because it was so narrowly
> focussed, it only works on NUMA thingies IIRC.

It should work on all systems with an Intel Core (not Atom)

However it was never clear to me if the tool was any better
than simply sampling for

mem_load_uops_l3_hit_retired.xsnp_hitm:pp (local socket)
mem_load_uops_l3_miss_retired.remote_hitm:pp (remote socket)

which gives you instructions that reference bouncing cache lines.

-Andi

2015-12-09 17:15:09

by Stephane Eranian

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

Hi,

On Wed, Dec 9, 2015 at 8:58 AM, Andi Kleen <[email protected]> wrote:
>> > the plan for me is to to use it some more to prove it's useful
>> > and kick it to be merged with perf at some point
>>
>> So I never really liked the c2c tool because it was so narrowly
>> focussed, it only works on NUMA thingies IIRC.
>
> It should work on all systems with an Intel Core (not Atom)
>
> However it was never clear to me if the tool was any better
> than simply sampling for
>
> mem_load_uops_l3_hit_retired.xsnp_hitm:pp (local socket)
> mem_load_uops_l3_miss_retired.remote_hitm:pp (remote socket)
>
> which gives you instructions that reference bouncing cache lines.
>
If I recall the c2c tool is giving you more than the bouncing line. It shows you
the offset inside the line and the participating CPUs.

2015-12-09 17:21:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

> If I recall the c2c tool is giving you more than the bouncing line. It shows you
> the offset inside the line and the participating CPUs.

On Haswell and later you could get the same with the normal address
reporting. The events above support DLA.

-Andi

--
[email protected] -- Speaking for myself only.

2015-12-09 19:48:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On Wed, Dec 9, 2015 at 9:21 AM, Andi Kleen <[email protected]> wrote:
>> If I recall the c2c tool is giving you more than the bouncing line. It shows you
>> the offset inside the line and the participating CPUs.
>
> On Haswell and later you could get the same with the normal address
> reporting. The events above support DLA.
>
I know the events track the condition better than just with the
latency threshold.
I think what it boils down to is not so much the PMU side but rather
the tool side.

2015-12-09 20:41:47

by Joe Mario

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On 12/09/2015 12:15 PM, Stephane Eranian wrote:
> If I recall the c2c tool is giving you more than the bouncing line. It shows you
> the offset inside the line and the participating CPUs.

Correct. It shows much more than the bouncing line.

Appended below is the output for running "perf c2c" on a 4-node system
running a multi-thread version of linpack. I've annotated it show describe
what some of the fields mean.

Note, your screen output has to be set pretty wide to read it. For those
not wanting to read it in their mailer, grab it from:
http://people.redhat.com/jmario/perf_c2c/perf_c2c_annotated_output.txt

Let me know of any questions. My annotations begin with "// ".

Joe
--------------------------------------------------------------------
// Perf c2c output from a linpack run.

// Set screen wide to view this.

// Here's a breakdown of all loads and stores sampled.
// It shows where they hit and missed.

=================================================
Trace Event Information
=================================================
Total records : 3229269
Locked Load/Store Operations : 64420
Load Operations : 1153827
Loads - uncacheable : 11
Loads - IO : 0
Loads - Miss : 11002
Loads - no mapping : 200
Load Fill Buffer Hit : 355942
Load L1D hit : 361303
Load L2D hit : 46792
Load LLC hit : 274265
Load Local HITM : 18647
Load Remote HITM : 55225
Load Remote HIT : 8917
Load Local DRAM : 11895
Load Remote DRAM : 28275
Load MESI State Exclusive : 40170
Load MESI State Shared : 0
Load LLC Misses : 104312
LLC Misses to Local DRAM : 11.4%
LLC Misses to Remote DRAM : 27.1%
LLC Misses to Remote cache (HIT) : 8.5%
LLC Misses to Remote cache (HITM) : 52.9%
Store Operations : 2069610
Store - uncacheable : 0
Store - no mapping : 3538
Store L1D Hit : 1889168
Store L1D Miss : 176904
No Page Map Rejects : 102146
Unable to parse data source : 5832


// Table showing activity on the shared cache lines.

=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 14213
Load HITs on shared lines : 426101
Fill Buffer Hits on shared lines : 159230
L1D hits on shared lines : 77377
L2D hits on shared lines : 28897
LLC hits on shared lines : 94709
Locked Access on shared lines : 45704
Store HITs on shared lines : 198050
Store L1D hits on shared lines : 188136
Total Merged records : 271778


// In the next table, for each of the 10 hottest cache lines, breakout all the activity for that line.
// The sorting is done by remote hitm percentage, defined as loads that hit in a remote node's modified cacheline.

================================================================================================================================================================================================================
Shared Data Cache Line Table

Total %All Total ---- Core Load Hit ---- -- LLC Load Hit -- ----- LLC Load Hitm ----- -- Load Dram -- LLC ---- Store Reference ----
Index Phys Adrs Records Ld Miss %hitm Loads FB L1D L2D Lcl Rmt Total Lcl Rmt Lcl Rmt Ld Miss Total L1Hit L1Miss
================================================================================================================================================================================================================
0 0x7f9d8833bf80 12402 1.87% 3.54% 11748 6582 2259 72 24 20 2788 835 1953 2 1 1976 654 648 6
1 0x85fd0f0c5c0 5593 0.35% 0.65% 3813 797 2131 153 171 32 457 97 360 20 52 464 1780 1780 0
2 0x81fd0ddfd00 2174 0.21% 0.40% 1352 318 558 50 75 13 315 94 221 9 14 257 822 822 0
3 0x85fd0f0c8c0 1854 0.21% 0.39% 1436 184 655 153 92 28 279 65 214 13 32 287 418 418 0
4 0x83fffb17280 735 0.16% 0.31% 543 3 287 12 15 2 216 44 172 2 6 182 192 192 0
5 0x7fff81db1e00 3641 0.16% 0.30% 3331 1111 849 624 384 83 228 62 166 15 37 301 310 310 0
6 0x83fff857280 755 0.15% 0.29% 563 3 316 13 24 0 201 43 158 3 3 164 192 192 0
7 0x85fff917280 801 0.15% 0.28% 648 1 409 19 19 1 198 44 154 1 0 156 153 153 0
8 0x85fffc17280 767 0.14% 0.27% 664 4 403 27 15 2 205 54 151 2 6 161 103 103 0
9 0x85fffbf7280 718 0.14% 0.27% 495 5 262 15 14 0 193 42 151 1 5 157 223 223 0


======================================================================================================================================================================================================================================================================

Shared Cache Line Distribution Pareto

---- All ---- -- Shared -- ---- HITM ---- Load Inst Execute Latency Shared Data Participants
Data Misses Data Misses Remote Local -- Store Refs --
---- cycles ---- cpu
Num %dist %cumm %dist %cumm LLCmiss LLChit L1 hit L1 Miss Data Address Pid Tid Inst Address median mean CV cnt Symbol Object Node{cpus %hitms %stores} Node{cpus %hitms %stores} ...
======================================================================================================================================================================================================================================================================
-----------------------------------------------------------------------------------------------
0 1.9% 1.9% 3.5% 3.5% 1953 835 648 6 0x7f9d8833bf80 146560
-----------------------------------------------------------------------------------------------
0.3% 0.4% 97.2% 0.0% 0x10 146560 *** 0x9969fa 2036 2265 12.4% 80 __kmp_acquire_queuing_lock_wi xlinpack_xeon64 0{20 16.7% 25.2%} 1{21 33.3% 24.0%} 2{20 33.3% 25.7%} 3{19 16.7% 25.1%}
88.9% 89.1% 0.0% 0.0% 0x10 146560 *** 0x996dfc 304 349 0.4% 79 __kmp_release_queuing_lock_wi xlinpack_xeon64 0{20 23.7% n/a} 1{21 24.5% n/a} 2{19 26.8% n/a} 3{19 25.0% n/a}
0.1% 0.0% 0.0% 0.0% 0x14 146560 146608 0x9969ac 410 410 0.0% 1 __kmp_acquire_queuing_lock_wi xlinpack_xeon64 2{ 1 100.0% n/a}
10.7% 10.5% 0.0% 0.0% 0x14 146560 *** 0x996def 303 352 1.2% 69 __kmp_release_queuing_lock_wi xlinpack_xeon64 0{18 21.5% n/a} 1{18 27.8% n/a} 2{18 31.6% n/a} 3{15 19.1% n/a}
0.0% 0.0% 2.8% 100.0% 0x14 146560 *** 0x996e54 n/a n/a n/a 19 __kmp_release_queuing_lock_wi xlinpack_xeon64 0{ 4 n/a 25.0%} 1{ 6 n/a 29.2%} 2{ 6 n/a 33.3%} 3{ 3 n/a 12.5%}

// Here's where the data gets interesting. Look at cacheline 0 above. The cacheline at data address 0x7f9d8833bf80 had the most contention.
//
// There were 1953 loads to that cacheline that hit in "remote-node modified cachelines". That's why this is the hottest cacheline.
// There were 835 loads that hit in a local modified cacheline. Also noted are the number of stores that both hit and missed the L1 cache.
// All accesses to that cacheline occurred at offsets 0x10 and 0x14.
// All accesses were done by one pid (146560), which was the Pid for linpack.
// I chose to display "***" when multiple thread ids (Tids) were involved in the same entry. Individual Tids can be displayed, but it makes for a long output.
// The instruction address of the load/store is displayed, along with the function and object name.
// Where loads are involved, the median and mean load latency cycles are displayed. Dumping the c2c raw records shows you individual worst offenders. It's not uncommon to see loads taking tens of thousands of cycles to complete when heavy contention is involved.
// The "cpu cnt" column shows how many individual cpus had samples contributing to a row of data. In the first row above, there were samples from 80 cpus for that row.
// The "Shared Data Participants" columns show the nodes the samples occured on, the number of cpus in each node that samples came from for that row, and for each node, the percentage of hitms and stores that came from that node.
//
// The above data shows how that hot cacheline is being concurrently read and written from cpus across all four nodes on the system. It's then easy to disassemble the binary (with line info) to identify the source code line numbers causing the false sharing.

-----------------------------------------------------------------------------------------------
1 0.3% 2.2% 0.7% 4.2% 360 97 1780 0 0xffff885fd0f0a5c0 ***
-----------------------------------------------------------------------------------------------
21.1% 29.9% 0.0% 0.0% 0x18 146560 *** 0xffffffff81196f40 1193 2404 13.9% 44 handle_mm_fault [kernel.kallsyms] 0{ 4 9.2% n/a} 1{12 27.6% n/a} 2{17 40.8% n/a} 3{11 22.4% n/a}
25.0% 21.6% 0.0% 0.0% 0x18 146560 *** 0xffffffff811a16d9 377 754 16.8% 43 mm_find_pmd [kernel.kallsyms] 0{11 23.3% n/a} 1{12 16.7% n/a} 2{11 41.1% n/a} 3{ 9 18.9% n/a}
17.2% 12.4% 0.0% 0.0% 0x18 0 0 0xffffffff8163accb 1994 2467 11.0% 58 __schedule [kernel.kallsyms] 0{16 24.2% n/a} 1{13 21.0% n/a} 2{13 21.0% n/a} 3{16 33.9% n/a}
3.9% 2.1% 24.1% 0.0% 0x24 *** *** 0xffffffff810b41ce 2066 3398 30.0% 136 finish_task_switch [kernel.kallsyms] 0{32 35.7% 23.5%} 1{36 14.3% 26.8%} 2{36 42.9% 26.1%} 3{32 7.1% 23.5%}
13.9% 11.3% 29.6% 0.0% 0x24 146560 *** 0xffffffff8163aa37 7230 7026 7.8% 139 __schedule [kernel.kallsyms] 0{34 26.0% 18.8%} 1{36 24.0% 28.5%} 2{36 34.0% 27.9%} 3{33 16.0% 24.7%}
0.0% 0.0% 0.1% 0.0% 0x24 146560 146665 0xffffffff8163aa3c n/a n/a n/a 1 __schedule [kernel.kallsyms] 2{ 1 n/a 100.0%}
5.3% 3.1% 0.0% 0.0% 0x38 146560 *** 0xffffffff810aa959 406 971 22.2% 20 down_read_trylock [kernel.kallsyms] 0{ 3 15.8% n/a} 1{ 9 52.6% n/a} 2{ 4 15.8% n/a} 3{ 4 15.8% n/a}
12.5% 15.5% 18.0% 0.0% 0x38 146560 *** 0xffffffff810aa965 2342 3845 13.9% 82 down_read_trylock [kernel.kallsyms] 0{21 15.6% 16.6%} 1{22 24.4% 31.6%} 2{19 35.6% 27.2%} 3{20 24.4% 24.7%}
1.1% 4.1% 28.3% 0.0% 0x38 146560 *** 0xffffffff810aa9c3 1376 1462 20.1% 87 up_read [kernel.kallsyms] 0{21 50.0% 16.7%} 1{23 50.0% 31.3%} 2{23 0.0% 28.6%} 3{20 0.0% 23.4%}

-----------------------------------------------------------------------------------------------
2 0.2% 2.4% 0.4% 4.6% 221 94 822 0 0xffff881fd0dddd00 146560
-----------------------------------------------------------------------------------------------
2.7% 2.1% 0.0% 0.0% 0x00 146560 *** 0xffffffff811a31ff 328 550 25.9% 8 page_lock_anon_vma_read [kernel.kallsyms] 0{ 2 16.7% n/a} 1{ 1 16.7% n/a} 2{ 3 33.3% n/a} 3{ 2 33.3% n/a}
9.5% 9.6% 0.0% 0.0% 0x00 146560 *** 0xffffffff811a36f5 429 814 19.2% 24 try_to_unmap_anon [kernel.kallsyms] 0{ 5 19.0% n/a} 1{ 9 42.9% n/a} 2{ 5 23.8% n/a} 3{ 5 14.3% n/a}
30.8% 27.7% 0.0% 0.0% 0x00 146560 *** 0xffffffff811a39f0 373 541 9.4% 37 rmap_walk [kernel.kallsyms] 0{12 32.4% n/a} 1{ 9 20.6% n/a} 2{ 8 25.0% n/a} 3{ 8 22.1% n/a}
24.0% 20.2% 0.0% 0.0% 0x00 146560 *** 0xffffffff811a3a6a 424 692 11.0% 35 rmap_walk [kernel.kallsyms] 0{ 9 26.4% n/a} 1{10 24.5% n/a} 2{ 9 32.1% n/a} 3{ 7 17.0% n/a}
0.9% 0.0% 0.0% 0.0% 0x08 146560 *** 0xffffffff810aa959 295 295 4.4% 2 down_read_trylock [kernel.kallsyms] 0{ 2 100.0% n/a}
2.3% 2.1% 38.3% 0.0% 0x08 146560 *** 0xffffffff810aa965 537 821 25.9% 48 down_read_trylock [kernel.kallsyms] 0{15 40.0% 25.4%} 1{12 0.0% 28.6%} 2{11 0.0% 26.7%} 3{10 60.0% 19.4%}
3.2% 4.3% 15.0% 0.0% 0x08 146560 *** 0xffffffff810aa9c3 746 834 17.1% 44 up_read [kernel.kallsyms] 0{13 14.3% 23.6%} 1{12 14.3% 23.6%} 2{10 28.6% 26.0%} 3{ 9 42.9% 26.8%}
4.1% 3.2% 22.9% 0.0% 0x08 146560 *** 0xffffffff8163a0a5 950 934 15.7% 46 down_read [kernel.kallsyms] 0{14 44.4% 31.9%} 1{13 22.2% 22.3%} 2{11 22.2% 25.5%} 3{ 8 11.1% 20.2%}
12.7% 17.0% 0.0% 0.0% 0x28 146560 *** 0xffffffff811a3140 337 827 26.1% 28 page_get_anon_vma [kernel.kallsyms] 0{ 7 21.4% n/a} 1{ 9 42.9% n/a} 2{ 6 10.7% n/a} 3{ 6 25.0% n/a}
2.3% 2.1% 17.0% 0.0% 0x28 146560 *** 0xffffffff811a3151 1288 1870 29.9% 43 page_get_anon_vma [kernel.kallsyms] 0{13 40.0% 33.6%} 1{11 20.0% 19.3%} 2{10 0.0% 27.1%} 3{ 9 40.0% 20.0%}
0.0% 0.0% 0.1% 0.0% 0x28 146560 146602 0xffffffff811a31aa n/a n/a n/a 1 page_get_anon_vma [kernel.kallsyms] 1{ 1 n/a 100.0%}
2.7% 3.2% 6.7% 0.0% 0x28 146560 *** 0xffffffff811c77b4 843 1950 43.3% 35 migrate_pages [kernel.kallsyms] 0{10 16.7% 36.4%} 1{ 6 16.7% 18.2%} 2{10 16.7% 21.8%} 3{ 9 50.0% 23.6%}
5.0% 8.5% 0.0% 0.0% 0x30 146560 *** 0xffffffff81190b15 336 561 19.6% 15 anon_vma_interval_tree_iter_f [kernel.kallsyms] 0{ 5 36.4% n/a} 1{ 4 9.1% n/a} 2{ 3 36.4% n/a} 3{ 3 18.2% n/a}

-----------------------------------------------------------------------------------------------
3 0.2% 2.6% 0.4% 5.0% 214 65 418 0 0xffff885fd0f0a8c0 ***
-----------------------------------------------------------------------------------------------
7.9% 13.8% 0.0% 0.0% 0x08 0 0 0xffffffff81065c26 430 724 17.2% 20 leave_mm [kernel.kallsyms] 0{10 58.8% n/a} 1{10 41.2% n/a}
0.5% 0.0% 12.7% 0.0% 0x08 0 0 0xffffffff81065c38 1343 1343 0.0% 25 leave_mm [kernel.kallsyms] 0{15 0.0% 54.7%} 1{10 100.0% 45.3%}
1.9% 1.5% 5.5% 0.0% 0x08 0 0 0xffffffff810b0c74 1902 3074 32.9% 18 cpumask_set_cpu [kernel.kallsyms] 0{ 9 25.0% 60.9%} 1{ 9 75.0% 39.1%}
11.2% 9.2% 0.0% 0.0% 0x08 *** *** 0xffffffff8163acb5 401 554 15.9% 20 __schedule [kernel.kallsyms] 0{13 58.3% n/a} 1{ 7 41.7% n/a}
6.5% 6.2% 0.0% 0.0% 0x0c 0 0 0xffffffff81065c26 310 540 22.2% 14 leave_mm [kernel.kallsyms] 1{ 2 14.3% n/a} 2{10 78.6% n/a} 3{ 2 7.1% n/a}
0.5% 0.0% 11.2% 0.0% 0x0c 0 0 0xffffffff81065c38 3638 3638 0.0% 25 leave_mm [kernel.kallsyms] 1{ 1 0.0% 4.3%} 2{16 100.0% 68.1%} 3{ 8 0.0% 27.7%}
2.8% 4.6% 3.8% 0.0% 0x0c 0 0 0xffffffff810b0c74 1774 2290 25.7% 20 cpumask_set_cpu [kernel.kallsyms] 1{ 3 0.0% 18.8%} 2{12 100.0% 50.0%} 3{ 5 0.0% 31.2%}
15.4% 9.2% 0.0% 0.0% 0x0c *** *** 0xffffffff8163acb5 415 768 23.1% 21 __schedule [kernel.kallsyms] 1{ 2 3.0% n/a} 2{12 63.6% n/a} 3{ 7 33.3% n/a}
10.7% 15.4% 0.0% 0.0% 0x10 0 0 0xffffffff81065c26 428 619 14.2% 18 leave_mm [kernel.kallsyms] 0{13 65.2% n/a} 1{ 3 13.0% n/a} 3{ 2 21.7% n/a}
0.0% 0.0% 14.1% 0.0% 0x10 0 0 0xffffffff81065c38 n/a n/a n/a 27 leave_mm [kernel.kallsyms] 0{15 n/a 55.9%} 1{ 5 n/a 18.6%} 3{ 7 n/a 25.4%}
0.0% 4.6% 6.9% 0.0% 0x10 0 0 0xffffffff810b0c74 n/a n/a n/a 18 cpumask_set_cpu [kernel.kallsyms] 0{10 n/a 41.4%} 1{ 5 n/a 41.4%} 3{ 3 n/a 17.2%}
9.3% 7.7% 0.0% 0.0% 0x10 *** *** 0xffffffff8163acb5 404 640 25.5% 18 __schedule [kernel.kallsyms] 0{ 6 35.0% n/a} 1{ 5 25.0% n/a} 3{ 7 40.0% n/a}
13.1% 9.2% 0.0% 0.0% 0x14 0 0 0xffffffff81065c26 340 458 9.8% 21 leave_mm [kernel.kallsyms] 1{ 7 32.1% n/a} 2{13 64.3% n/a} 3{ 1 3.6% n/a}
0.5% 0.0% 20.3% 0.0% 0x14 0 0 0xffffffff81065c38 2662 2662 0.0% 27 leave_mm [kernel.kallsyms] 1{11 0.0% 32.9%} 2{14 100.0% 57.6%} 3{ 2 0.0% 9.4%}
1.9% 0.0% 8.9% 0.0% 0x14 0 0 0xffffffff810b0c74 2892 3086 28.5% 20 cpumask_set_cpu [kernel.kallsyms] 1{ 6 0.0% 32.4%} 2{13 100.0% 64.9%} 3{ 1 0.0% 2.7%}
0.0% 0.0% 0.2% 0.0% 0x14 0 0 0xffffffff810b0c78 n/a n/a n/a 1 cpumask_set_cpu [kernel.kallsyms] 2{ 1 n/a 100.0%}
8.9% 4.6% 0.0% 0.0% 0x14 0 0 0xffffffff8163acb5 437 1041 20.5% 16 __schedule [kernel.kallsyms] 1{ 8 42.1% n/a} 2{ 7 52.6% n/a} 3{ 1 5.3% n/a}
3.3% 7.7% 0.0% 0.0% 0x18 0 0 0xffffffff81065c26 407 450 15.8% 8 leave_mm [kernel.kallsyms] 3{ 8 100.0% n/a}
0.0% 0.0% 12.4% 0.0% 0x18 0 0 0xffffffff81065c38 n/a n/a n/a 16 leave_mm [kernel.kallsyms] 3{16 n/a 100.0%}
0.9% 1.5% 3.8% 0.0% 0x18 0 0 0xffffffff810b0c74 1007 1230 18.2% 12 cpumask_set_cpu [kernel.kallsyms] 3{12 100.0% 100.0%}
4.7% 4.6% 0.0% 0.0% 0x18 *** *** 0xffffffff8163acb5 342 629 25.1% 10 __schedule [kernel.kallsyms] 3{10 100.0% n/a}

-----------------------------------------------------------------------------------------------
4 0.2% 2.8% 0.3% 5.3% 172 44 192 0 0xffff883fffb15280 146560
-----------------------------------------------------------------------------------------------
0.6% 0.0% 85.4% 0.0% 0x00 146560 *** 0xffffffff810e6e13 298 298 0.0% 1 flush_smp_call_function_queue [kernel.kallsyms] 1{ 1 100.0% 100.0%}
98.8% 97.7% 0.0% 0.0% 0x00 146560 *** 0xffffffff813067d8 291 363 4.9% 44 llist_add_batch [kernel.kallsyms] 0{13 27.1% n/a} 1{10 13.5% n/a} 2{12 29.4% n/a} 3{ 9 30.0% n/a}
0.6% 2.3% 14.6% 0.0% 0x00 146560 *** 0xffffffff813067e1 892 1255 0.0% 21 llist_add_batch [kernel.kallsyms] 0{ 7 0.0% 35.7%} 1{ 2 0.0% 7.1%} 2{ 6 100.0% 28.6%} 3{ 6 0.0% 28.6%}

-----------------------------------------------------------------------------------------------
5 0.2% 3.0% 0.3% 5.6% 166 62 310 0 0xffffffff81dafe00 ***
-----------------------------------------------------------------------------------------------
0.6% 0.0% 0.0% 0.0% 0x00 146560 146696 0xffffffff810bcc6c 1363 1363 0.0% 1 update_cfs_rq_h_load [kernel.kallsyms] 1{ 1 100.0% n/a}
22.3% 21.0% 0.0% 0.0% 0x00 146560 *** 0xffffffff810bdf56 1421 2209 15.8% 38 select_task_rq_fair [kernel.kallsyms] 0{13 32.4% n/a} 1{ 7 16.2% n/a} 2{12 35.1% n/a} 3{ 6 16.2% n/a}
4.2% 6.5% 0.0% 0.0% 0x00 146560 *** 0xffffffff810bdfaf 1500 5357 64.4% 10 select_task_rq_fair [kernel.kallsyms] 0{ 1 0.0% n/a} 1{ 4 71.4% n/a} 2{ 2 0.0% n/a} 3{ 3 28.6% n/a}
6.0% 11.3% 0.0% 0.0% 0x00 *** *** 0xffffffff810befcf 3172 2506 25.8% 15 update_blocked_averages [kernel.kallsyms] 0{ 2 10.0% n/a} 1{ 4 20.0% n/a} 2{ 4 30.0% n/a} 3{ 5 40.0% n/a}
54.8% 54.8% 0.0% 0.0% 0x00 *** *** 0xffffffff810c1963 1884 2815 13.1% 81 update_cfs_shares [kernel.kallsyms] 0{16 23.1% n/a} 1{21 24.2% n/a} 2{23 27.5% n/a} 3{21 25.3% n/a}
1.2% 0.0% 0.0% 0.0% 0x08 146560 *** 0xffffffff810b5fb7 1365 1365 51.4% 2 set_task_cpu [kernel.kallsyms] 0{ 1 50.0% n/a} 1{ 1 50.0% n/a}
3.0% 3.2% 0.0% 0.0% 0x08 146560 *** 0xffffffff810bf901 815 1186 45.9% 6 can_migrate_task [kernel.kallsyms] 0{ 2 20.0% n/a} 1{ 2 40.0% n/a} 2{ 2 40.0% n/a}
3.6% 0.0% 44.8% 0.0% 0x20 *** *** 0xffffffff810bcf6f 5364 6967 41.5% 94 update_cfs_rq_blocked_load [kernel.kallsyms] 0{22 33.3% 17.3%} 1{27 0.0% 36.0%} 2{23 16.7% 23.0%} 3{22 50.0% 23.7%}
0.0% 0.0% 1.0% 0.0% 0x28 *** *** 0xffffffff810bf3a9 n/a n/a n/a 3 update_blocked_averages [kernel.kallsyms] 0{ 1 n/a 33.3%} 3{ 2 n/a 66.7%}
0.0% 0.0% 0.3% 0.0% 0x28 0 0 0xffffffff810c102a n/a n/a n/a 1 idle_enter_fair [kernel.kallsyms] 1{ 1 n/a 100.0%}
0.6% 3.2% 29.7% 0.0% 0x28 146560 *** 0xffffffff810c2455 1647 1647 0.0% 68 dequeue_task_fair [kernel.kallsyms] 0{23 100.0% 35.9%} 1{18 0.0% 26.1%} 2{19 0.0% 25.0%} 3{ 8 0.0% 13.0%}
0.0% 0.0% 0.6% 0.0% 0x28 146560 *** 0xffffffff810c2ba0 n/a n/a n/a 2 task_tick_fair [kernel.kallsyms] 2{ 1 n/a 50.0%} 3{ 1 n/a 50.0%}
3.0% 0.0% 23.5% 0.0% 0x28 *** *** 0xffffffff810c4203 2085 2958 31.8% 61 enqueue_task_fair [kernel.kallsyms] 0{14 20.0% 30.1%} 1{14 0.0% 21.9%} 2{13 20.0% 21.9%} 3{20 60.0% 26.0%}
0.6% 0.0% 0.0% 0.0% 0x30 146560 146593 0xffffffff810b5fe7 3070 3070 0.0% 1 set_task_cpu [kernel.kallsyms] 1{ 1 100.0% n/a}

-----------------------------------------------------------------------------------------------
6 0.2% 3.1% 0.3% 5.9% 158 43 192 0 0xffff883fff855280 146560
-----------------------------------------------------------------------------------------------
0.0% 0.0% 87.5% 0.0% 0x00 146560 146569 0xffffffff810e6e13 n/a n/a n/a 1 flush_smp_call_function_queue [kernel.kallsyms] 1{ 1 n/a 100.0%}
98.7% 100.0% 0.0% 0.0% 0x00 146560 *** 0xffffffff813067d8 317 406 5.9% 44 llist_add_batch [kernel.kallsyms] 0{13 32.1% n/a} 1{10 11.5% n/a} 2{11 30.1% n/a} 3{10 26.3% n/a}
1.3% 0.0% 12.5% 0.0% 0x00 146560 *** 0xffffffff813067e1 1040 1040 23.1% 20 llist_add_batch [kernel.kallsyms] 0{ 6 50.0% 29.2%} 1{ 5 0.0% 20.8%} 2{ 6 50.0% 25.0%} 3{ 3 0.0% 25.0%}

-----------------------------------------------------------------------------------------------
7 0.1% 3.3% 0.3% 6.2% 154 44 153 0 0xffff885fff915280 146560
-----------------------------------------------------------------------------------------------
0.6% 0.0% 86.9% 0.0% 0x00 146560 146624 0xffffffff810e6e13 1341 1341 0.0% 1 flush_smp_call_function_queue [kernel.kallsyms] 2{ 1 100.0% 100.0%}
98.7% 100.0% 0.0% 0.0% 0x00 146560 *** 0xffffffff813067d8 287 377 9.2% 43 llist_add_batch [kernel.kallsyms] 0{12 36.8% n/a} 1{11 24.3% n/a} 2{10 11.8% n/a} 3{10 27.0% n/a}
0.6% 0.0% 13.1% 0.0% 0x00 146560 *** 0xffffffff813067e1 539 539 0.0% 17 llist_add_batch [kernel.kallsyms] 0{ 6 0.0% 30.0%} 1{ 6 0.0% 40.0%} 2{ 1 0.0% 5.0%} 3{ 4 100.0% 25.0%}

-----------------------------------------------------------------------------------------------
8 0.1% 3.4% 0.3% 6.4% 151 54 103 0 0xffff885fffc15280 146560
-----------------------------------------------------------------------------------------------
0.7% 0.0% 69.9% 0.0% 0x00 146560 146607 0xffffffff810e6e13 902 902 0.0% 1 flush_smp_call_function_queue [kernel.kallsyms] 2{ 1 100.0% 100.0%}
99.3% 98.1% 0.0% 0.0% 0x00 146560 *** 0xffffffff813067d8 278 377 5.3% 45 llist_add_batch [kernel.kallsyms] 0{13 34.7% n/a} 1{11 24.7% n/a} 2{11 10.7% n/a} 3{10 30.0% n/a}
0.0% 1.9% 30.1% 0.0% 0x00 146560 *** 0xffffffff813067e1 n/a n/a n/a 23 llist_add_batch [kernel.kallsyms] 0{ 6 n/a 22.6%} 1{ 7 n/a 32.3%} 2{ 6 n/a 25.8%} 3{ 4 n/a 19.4%}

-----------------------------------------------------------------------------------------------
9 0.1% 3.5% 0.3% 6.7% 151 42 223 0 0xffff885fffbf5280 146560
-----------------------------------------------------------------------------------------------
0.0% 0.0% 87.4% 0.0% 0x00 146560 *** 0xffffffff810e6e13 n/a n/a n/a 1 flush_smp_call_function_queue [kernel.kallsyms] 2{ 1 n/a 100.0%}
99.3% 97.6% 0.0% 0.0% 0x00 146560 *** 0xffffffff813067d8 309 346 2.5% 44 llist_add_batch [kernel.kallsyms] 0{11 28.0% n/a} 1{11 23.3% n/a} 2{11 13.3% n/a} 3{11 35.3% n/a}
0.7% 2.4% 12.6% 0.0% 0x00 146560 *** 0xffffffff813067e1 1028 1696 0.0% 20 llist_add_batch [kernel.kallsyms] 0{ 5 0.0% 25.0%} 1{ 2 0.0% 7.1%} 2{ 6 100.0% 35.7%} 3{ 7 0.0% 32.1%}



=====================================================================================================================================
Object Name, Path & Reference Counts

Index Records Object Name Object Path
=====================================================================================================================================
0 2032703 xlinpack_xeon64 /home/joe/linpack/xlinpack_xeon64
1 1059580 [kernel.kallsyms] /proc/kcore
2 21352 libpthread-2.17.so /usr/lib64/libpthread-2.17.so
3 3882 perf /home/root/git/rhel7/tools/perf/perf
4 26 libc-2.17.so /usr/lib64/libc-2.17.so
5 5 libpython2.7.so.1.0 /usr/lib64/libpython2.7.so.1.0
6 3 ld-2.17.so /usr/lib64/ld-2.17.so
7 1 sendmail.sendmail /usr/sbin/sendmail.sendmail
8 1 irqbalance /usr/sbin/irqbalance



2015-12-10 02:39:07

by Yunlong Song

[permalink] [raw]
Subject: Re: [Questions] perf c2c: What's the current status of perf c2c?

On 2015/12/10 4:41, Joe Mario wrote:
> Appended below is the output for running "perf c2c" on a 4-node system
> running a multi-thread version of linpack. I've annotated it show describe
> what some of the fields mean.
>
> Note, your screen output has to be set pretty wide to read it. For those
> not wanting to read it in their mailer, grab it from:
> http://people.redhat.com/jmario/perf_c2c/perf_c2c_annotated_output.txt
>
> Let me know of any questions. My annotations begin with "// ".
>
> Joe
> --------------------------------------------------------------------
> // Perf c2c output from a linpack run.

Hi, Joe,
Got these details, thanks a lot for your interpretation. -_-

--
Thanks,
Yunlong Song