Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753795Ab1DWMtr (ORCPT ); Sat, 23 Apr 2011 08:49:47 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:57643 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753529Ab1DWMtq (ORCPT ); Sat, 23 Apr 2011 08:49:46 -0400 Date: Sat, 23 Apr 2011 14:49:19 +0200 From: Ingo Molnar To: Stephane Eranian Cc: Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org, Andi Kleen , Peter Zijlstra , Lin Ming , Arnaldo Carvalho de Melo , Thomas Gleixner , Peter Zijlstra , eranian@gmail.com, Arun Sharma , Linus Torvalds , Andrew Morton Subject: Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Add missing user space support for config1/config2 Message-ID: <20110423124919.GB5147@elte.hu> References: <20110422092322.GA1948@elte.hu> <20110422105211.GB1948@elte.hu> <20110422131846.GA20760@elte.hu> <20110422204740.GA21364@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.20 (2009-08-17) X-ELTE-SpamScore: -2.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-2.0 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.3.1 -2.0 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6284 Lines: 173 * Stephane Eranian wrote: > On Fri, Apr 22, 2011 at 10:47 PM, Ingo Molnar wrote: > > > > * Stephane Eranian wrote: > > > >> On Fri, Apr 22, 2011 at 3:18 PM, Ingo Molnar wrote: > >> > > >> > * Stephane Eranian wrote: > >> > > >> > > > Say i'm a developer and i have an app with such code: > >> > > > > >> > > > #define THOUSAND 1000 > >> > > > > >> > > > static char array[THOUSAND][THOUSAND]; > >> > > > > >> > > > int init_array(void) > >> > > > { > >> > > > ? ? ? ?int i, j; > >> > > > > >> > > > ? ? ? ?for (i = 0; i < THOUSAND; i++) { > >> > > > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) { > >> > > > ? ? ? ? ? ? ? ? ? ? ? ?array[j][i]++; > >> > > > ? ? ? ? ? ? ? ?} > >> > > > ? ? ? ?} > >> > > > > >> > > > ? ? ? ?return 0; > >> > > > } > >> > > > > >> > > > Pretty common stuff, right? > >> > > > > >> > > > Using the generalized cache events i can run: > >> > > > > >> > > > ?$ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array > >> > > > > >> > > > ?Performance counter stats for './array' (10 runs): > >> > > > > >> > > > ? ? ? ? 6,719,130 cycles:u ? ? ? ? ? ? ? ? ? ( +- ? 0.662% ) > >> > > > ? ? ? ? 5,084,792 instructions:u ? ? ? ? ? # ? ? ?0.757 IPC ? ? ( +- ? 0.000% ) > >> > > > ? ? ? ? 1,037,032 l1-dcache-loads:u ? ? ? ? ?( +- ? 0.009% ) > >> > > > ? ? ? ? 1,003,604 l1-dcache-load-misses:u ? ?( +- ? 0.003% ) > >> > > > > >> > > > ? ? ? ?0.003802098 ?seconds time elapsed ? ( +- ?13.395% ) > >> > > > > >> > > > I consider that this is 'bad', because for almost every dcache-load there's a > >> > > > dcache-miss - a 99% L1 cache miss rate! > >> > > > > >> > > > Then i think a bit, notice something, apply this performance optimization: > >> > > > >> > > I don't think this example is really representative of the kind of problems > >> > > people face, it is just too small and obvious. [...] > >> > > >> > Well, the overwhelming majority of performance problems are 'small and obvious' > >> > >> Problems are not simple. Most serious applications?these days are huge, > >> hundreds of MB of text, if not GB. > >> > >> In your artificial example, you knew the answer before you started the > >> measurement. > >> > >> Most of the time, applications are assembled out of hundreds of libraries, so > >> no single developers knows all?the code. Thus, the performance analyst is > >> faced with a black box most of the time. > > > > I isolated out an example and assumed that you'd agree that identifying hot > > spots is trivial with generic cache events. > > > > My assumption was wrong so let me show you how trivial it really is. > > > > Here's an example with *two* problematic functions (but it could have hundreds, > > it does not matter): > > > > --------------------------------> > > #define THOUSAND 1000 > > > > static char array1[THOUSAND][THOUSAND]; > > > > static char array2[THOUSAND][THOUSAND]; > > > > void func1(void) > > { > > ? ? ? ?int i, j; > > > > ? ? ? ?for (i = 0; i < THOUSAND; i++) > > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) > > ? ? ? ? ? ? ? ? ? ? ? ?array1[i][j]++; > > } > > > > void func2(void) > > { > > ? ? ? ?int i, j; > > > > ? ? ? ?for (i = 0; i < THOUSAND; i++) > > ? ? ? ? ? ? ? ?for (j = 0; j < THOUSAND; j++) > > ? ? ? ? ? ? ? ? ? ? ? ?array2[j][i]++; > > } > > > > int main(void) > > { > > ? ? ? ?for (;;) { > > ? ? ? ? ? ? ? ?func1(); > > ? ? ? ? ? ? ? ?func2(); > > ? ? ? ?} > > > > ? ? ? ?return 0; > > } > > <-------------------------------- > > > > We do not know which one has the cache-misses problem, func1() or func2(), it's > > all a black box, right? > > > > Using generic cache events you simply type this: > > > > ?$ perf top -e l1-dcache-load-misses -e l1-dcache-loads > > > > And you get such output: > > > > ? PerfTop: ? ?1923 irqs/sec ?kernel: 0.0% ?exact: ?0.0% [l1-dcache-load-misses:u/l1-dcache-loads:u], ?(all, 16 CPUs) > > ------------------------------------------------------------------------------------------------------- > > > > ? weight ? ?samples ?pcnt funct DSO > > ? ______ ? ?_______ _____ _____ ______________________ > > > > ? ? ?1.9 ? ? ? 6184 98.8% func2 /home/mingo/opt/array2 > > ? ? ?0.0 ? ? ? ? 69 ?1.1% func1 /home/mingo/opt/array2 > > > > It has pinpointed the problem in func2 *very* precisely. > > > > Obviously this can be used to analyze larger apps as well, with thousands > > of functions, to pinpoint cachemiss problems in specific functions. > > No, it does not. The thing is, you will need to come up with more convincing and concrete arguments than a blanket, unsupported "No, it does not" claim. I *just showed* you an example which you claimed just two mails ago is impossible to analyze. I showed an example two functions and claimed that the same thing works with 3 or more functions as well: perf top will happily display the ones with the highest cachemiss ratio, regardless of how many there are. > As I said before, your example is just to trivial to be representative. You > keep thinking that what you see in the profile pinpoints exactly the > instruction or even the function where the problem always occurs. This is not > always the case. There is skid, and it can be very big, the IP you get may > not even be in the same function where the load was issued. So now you claim a narrow special case (most of the hot-spot overhead skidding out of a function) as a counter-proof? Sometimes skid causes problems - in practice it rarely does, and i do a lot of profiling. Also, i'd expect PEBS to be extended in the future to more and more events - including cachemiss events. That will solve this kind of skidding in a pretty natural way. Also, lets analyze your narrow special case: if a function is indeed "invisible" to profiling because most overhead skids out of it then there's little you can do with raw events to begin with ... You really need to specifically demonstrate how raw events help your example. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/