2016-10-01 13:44:11

by Joe Mario

[permalink] [raw]
Subject: Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems

On 09/29/2016 05:19 AM, Peter Zijlstra wrote:

>
> What I want is a tool that maps memop events (any PEBS memops) back to a
> 'type::member' form and sorts on that. That doesn't rely on the PEBS
> 'Data Linear Address' field, as that is useless for dynamically
> allocated bits. Instead it would use the IP and Dwarf information to
> deduce the 'type::member' of the memop.
>
> I want pahole like output, showing me where the hits (green) and misses
> (red) are in a structure.

I agree that would give valuable insight, but it needs to be
in addition to what this c2c provides today, and not a replacement for.

Ten years ago Robert Hundt created that pahole-style output as a developer option
to the HP-UX compiler. It used compiler feedback to compute every struct
accessed by the application, with exact counts for all reads and writes to
every struct member. It even had affinity information to show how often
field members were accessed together in time.

He and I ran it on numerous large applications. It was awesome, but it
did fall short in a few places that Jiri's c2c patches provide, such as
being able to:

- distinguish where the concurrent cacheline accesses came from (e.g, which
cores, and which nodes).

- see where the loads got resolved from, (local cache, local memory, remote
cache, remote memory).

- see if the hot structs were cacheline aligned or not.

- see if more than one hot struct shares a cachline.

- see how costly, via load latencies, the contention is.

- see, among all the accesses to a cachline, which thread or process is
causing the most harm.

- insight into how many other threads/processes are contending for a
cacheline (and who they are).

The above info has been critical to understanding how best to tackle the
contention uncovered for all those who have used the "perf c2c" prototype.

So yes, the pahole-style addition would be a plus and it would make it easier
to map it back to the struct, but make sure to preserve what the current
"perfc2c" provides that the pahole-style output will not.

Joe