Date: Mon, 17 Nov 2014 17:01:26 +0100
From: Jiri Olsa <jolsa@redhat.com>
To: kan.liang@intel.com
Cc: acme@kernel.org, a.p.zijlstra@chello.nl, eranian@google.com,
        linux-kernel@vger.kernel.org, mingo@redhat.com, paulus@samba.org,
        ak@linux.intel.com
Subject: Re: [PATCH V3 0/3] perf tool: Haswell LBR call stack support (user)
Message-ID: <20141117160125.GD21532@krava.brq.redhat.com>
References: <1415972652-17310-1-git-send-email-kan.liang@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1415972652-17310-1-git-send-email-kan.liang@intel.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org

On Fri, Nov 14, 2014 at 08:44:09AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> This is the user space patch for Haswell LBR call stack support.
> For many profiling tasks we need the callgraph. For example we often
> need to see the caller of a lock or the caller of a memcpy or other
> library function to actually tune the program. Frame pointer unwinding
> is efficient and works well. But frame pointers are off by default on
> 64bit code (and on modern 32bit gccs), so there are many binaries around
> that do not use frame pointers. Profiling unchanged production code is
> very useful in practice. On some CPUs frame pointer also has a high
> cost. Dwarf2 unwinding also does not always work and is extremely slow
> (upto 20% overhead).
> 
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers. The LBR call stack facility provides an alternative to get
> callgraph. It has some limitations too, but should work in most cases
> and is significantly faster than dwarf. Frame pointer unwinding is still
> the best default, but LBR call stack is a good alternative when nothing
> else works.
> 


---
> A new call chain recording option "lbr" is introduced into perf tool for
> LBR call stack. The user can use --call-graph lbr to get the call stack
> information from hardware.
> 
> When profiling bc(1) on Fedora 19:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph lbr bc -l < cmd
> If enabling LBR, perf report output looks like:
>     50.36%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>     33.66%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>                      bc_divide
>                      execute
>                      run_code
>                      yyparse
>                      main
>                      __libc_start_main
>                      _start
>      7.62%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                     |
>                     |--99.89%-- 0x2000186a8
>                      --0.11%-- [...]
>      6.83%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>                     |
>                     |--99.94%-- bc_add
>                     |          execute
>                     |          run_code
>                     |          yyparse
>                     |          main
>                     |          __libc_start_main
>                     |          _start
>                      --0.06%-- [...]
>      0.46%       bc  libc-2.17.so       [.] __memset_sse2
>                  |
>                  --- __memset_sse2
>                     |
>                     |--54.13%-- bc_new_num
>                     |          |
>                     |          |--51.00%-- bc_divide
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |          |--30.46%-- _bc_do_sub
>                     |          |          bc_add
>                     |          |          execute
>                     |          |          run_code
>                     |          |          yyparse
>                     |          |          main
>                     |          |          __libc_start_main
>                     |          |          _start
>                     |          |
>                     |           --18.55%-- _bc_do_add
>                     |                     bc_add
>                     |                     execute
>                     |                     run_code
>                     |                     yyparse
>                     |                     main
>                     |                     __libc_start_main
>                     |                     _start
>                     |
>                      --45.87%-- bc_divide
>                                execute
>                                run_code
>                                yyparse
>                                main
>                                __libc_start_main
>                                _start
> If using FP, perf report output looks like:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
>     50.49%       bc  bc                 [.] bc_divide
>                  |
>                  --- bc_divide
>     33.57%       bc  bc                 [.] _one_mult
>                  |
>                  --- _one_mult
>      7.61%       bc  bc                 [.] _bc_do_add
>                  |
>                  --- _bc_do_add
>                      0x2000186a8
>      6.88%       bc  bc                 [.] _bc_do_sub
>                  |
>                  --- _bc_do_sub
>      0.42%       bc  libc-2.17.so       [.] __memcpy_ssse3_back
>                  |
>                  --- __memcpy_ssse3_back
> 
> If using LBR, perf report -D output looks like:
> 11739295893248 0x4d0 [0xe0]: PERF_RECORD_SAMPLE(IP, 0x2): 10505/10505:
> 0x40054d period: 39255 addr: 0
> ... LBR call chain: nr:7
> .....  0: fffffffffffffe00
> .....  1: 0000000000400540
> .....  2: 0000000000400587
> .....  3: 00000000004005b3
> .....  4: 00000000004005ef
> .....  5: 0000003d1cc21b43
> .....  6: 0000000000400474
> ... FP chain: nr:6
> .....  0: fffffffffffffe00
> .....  1: 000000000040054d
> .....  2: 000000000040058c
> .....  3: 00000000004005b8
> .....  4: 00000000004005f4
> .....  5: 0000003d1cc21b45
>  ... thread: a.out:10505
>  ...... dso: /home/lk/a.out
> 
> 
> The LBR call stack has following known limitations
>  - Zero length calls are not filtered out by hardware
>  - Exception handing such as setjmp/longjmp will have calls/returns not
>    match
>  - Pushing different return address onto the stack will have calls/returns
>    not match
>  - If callstack is deeper than the LBR, only the last entries are captured
---

also could you please add all above ^^^ as an additional text
for patch 3/3 changelog (perf tools: Construct LBR call chain)?

looks too nice to lose it ;-)

thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/