Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750951AbcJANoL (ORCPT ); Sat, 1 Oct 2016 09:44:11 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45484 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbcJANoK (ORCPT ); Sat, 1 Oct 2016 09:44:10 -0400 Subject: Re: [PATCHv4 00/57] perf c2c: Add new tool to analyze cacheline contention on NUMA systems To: Peter Zijlstra , Jiri Olsa References: <1474558645-19956-1-git-send-email-jolsa@kernel.org> <20160929091912.GV5012@twins.programming.kicks-ass.net> Cc: Arnaldo Carvalho de Melo , Michael Trapp , "Long, Wai Man" , Stanislav Ievlev , Kim Phillips , lkml , Don Zickus , Ingo Molnar , Namhyung Kim , David Ahern , Andi Kleen , Stephane Eranian , Robert Hundt From: Joe Mario Message-ID: <36cb0bb2-0a58-c5bc-98f5-d9b25b84439d@redhat.com> Date: Sat, 1 Oct 2016 09:44:06 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <20160929091912.GV5012@twins.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Sat, 01 Oct 2016 13:44:09 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2071 Lines: 54 On 09/29/2016 05:19 AM, Peter Zijlstra wrote: > > What I want is a tool that maps memop events (any PEBS memops) back to a > 'type::member' form and sorts on that. That doesn't rely on the PEBS > 'Data Linear Address' field, as that is useless for dynamically > allocated bits. Instead it would use the IP and Dwarf information to > deduce the 'type::member' of the memop. > > I want pahole like output, showing me where the hits (green) and misses > (red) are in a structure. I agree that would give valuable insight, but it needs to be in addition to what this c2c provides today, and not a replacement for. Ten years ago Robert Hundt created that pahole-style output as a developer option to the HP-UX compiler. It used compiler feedback to compute every struct accessed by the application, with exact counts for all reads and writes to every struct member. It even had affinity information to show how often field members were accessed together in time. He and I ran it on numerous large applications. It was awesome, but it did fall short in a few places that Jiri's c2c patches provide, such as being able to: - distinguish where the concurrent cacheline accesses came from (e.g, which cores, and which nodes). - see where the loads got resolved from, (local cache, local memory, remote cache, remote memory). - see if the hot structs were cacheline aligned or not. - see if more than one hot struct shares a cachline. - see how costly, via load latencies, the contention is. - see, among all the accesses to a cachline, which thread or process is causing the most harm. - insight into how many other threads/processes are contending for a cacheline (and who they are). The above info has been critical to understanding how best to tackle the contention uncovered for all those who have used the "perf c2c" prototype. So yes, the pahole-style addition would be a plus and it would make it easier to map it back to the struct, but make sure to preserve what the current "perfc2c" provides that the pahole-style output will not. Joe