Date: Thu, 26 Nov 2015 06:03:57 +0900
From: Namhyung Kim <namhyung@kernel.org>
To: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>, Ingo Molnar <mingo@kernel.org>,
        linux-kernel@vger.kernel.org, Andi Kleen <andi@firstfloor.org>,
        David Ahern <dsahern@gmail.com>, Jiri Olsa <jolsa@redhat.com>,
        Kan Liang <kan.liang@intel.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains
Message-ID: <20151125210357.GA13940@danjae.kornet>
References: <1447955603-24895-1-git-send-email-acme@kernel.org>
 <1447955603-24895-35-git-send-email-acme@kernel.org>
 <20151123151647.GD3587@lerouge>
 <20151124052708.GB2636@sejong>
 <20151124144551.GB30441@redhat.com>
 <20151125012608.GA6171@sejong>
 <20151125013418.GL18140@kernel.org>
 <20151125021025.GM18140@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20151125021025.GM18140@kernel.org>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 8500
Lines: 197

Hi Arnaldo,

On Tue, Nov 24, 2015 at 11:10:25PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Nov 24, 2015 at 10:34:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Wed, Nov 25, 2015 at 10:26:08AM +0900, Namhyung Kim escreveu:
> > > On Tue, Nov 24, 2015 at 12:45:51PM -0200, Arnaldo Carvalho de Melo wrote:
> > > > Em Tue, Nov 24, 2015 at 02:27:08PM +0900, Namhyung Kim escreveu:
> > > > > On Mon, Nov 23, 2015 at 04:16:48PM +0100, Frederic Weisbecker wrote:
> > > > > > On Thu, Nov 19, 2015 at 02:53:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > > > From: Namhyung Kim <namhyung@kernel.org>
> > > > > > [...]
> > > > > Thus I simply copied callchain lists in parents to leaf nodes.  Yes,
> > > > > it will consume some memory but can simplify the code.
> > > > 
> > > > I haven't done any measuring, but I'm noticing that 'perf top -g' is
> > > > showing more warnings about not being able to process events fast enough
> > > > and so ends up losing events, I tried with --max-stack 16 and it helped,
> > > > this is just a heads up.
> > > 
> > > OK, but it seems that it's not related to this patch since this patch
> > > only affects flat or folded callchain mode.
> > 
> > Well, doesn't this patch makes some of the involved data structures
> > larger, thus putting more pressure on the L1 cache, etc? It may well be
> > related, but we need to measure.
> > 
> > > > Perhaps my workstation workloads are gettning deeper callchains over
> > > > time, but perhaps this is the cost of processing callchains that is
> > > > increasing, I need to stop and try to quantify this.
> > > > 
> > > > We really need to look at reducing the overhead of processing
> > > > callchains.
> > > 
> > > Right, but with my multi-thread work, I realized that perf is getting
> > > heavier recently.  I guess it's mostly due to the atomic refcount
> > > work.  I need to get back to the multi-thread work..
> > 
> > We really need to measure this ;-)
> 
> So, something strange, if I use:
> 
> [acme@zoo linux]$ cat ~/bin/allmod 
> rm -rf ../build/allmodconfig/ ; mkdir ../build/allmodconfig/ ; make O=../build/allmodconfig/ allmodconfig ; make -j32 O=../build/allmodconfig
> [acme@zoo linux]$
> 
> To generate background load, I don't see that much this:
> 
> +    8.55%     8.49%  libc-2┌─Warning!──────────────────────────────────────────────┐ ▒
> +    7.08%     6.98%  perf  │Events are being lost, check IO/CPU overload!          │ ▒
> +    6.84%     0.04%  perf  │                                                       │ ▒
> +    6.01%     0.09%  perf  │You may want to run 'perf' using a RT scheduler policy:│ ▒
> +    5.26%     0.13%  [kerne│                                                       │t▒
> +    4.96%     1.50%  perf  │ perf top -r 80                                        │ ▒
> +    4.76%     3.58%  perf  │                                                       │ ▒
> +    4.69%     0.05%  perf  │Or reduce the sampling frequency.                      │ ▒
> 
> Its with a low loadavg:
> 
> [acme@zoo linux]$ cat /proc/loadavg 
> 0.75 0.79 0.64 3/549 21259
> 
> That it pops up :-\
> 
> If I take a snapshot with 'P'
> 
> # head -40 perf.hist.0
> +   21.43%    21.09%  libc-2.20.so          [.] _int_malloc
> +   19.49%     0.00%  perf                  [.] cmd_top
> +   19.46%     0.02%  perf                  [.] perf_top__mmap_read_idx
> +   19.03%     0.06%  perf                  [.] hist_entry_iter__add
> +   16.46%     1.85%  perf                  [.] iter_add_next_cumulative_entry
> +   12.09%    11.98%  libc-2.20.so          [.] free
> +   10.68%    10.61%  libc-2.20.so          [.] __libc_calloc
> +    9.61%     0.09%  perf                  [.] hists__decay_entries
> +    8.92%     8.85%  libc-2.20.so          [.] malloc_consolidate
> +    8.17%     6.33%  perf                  [.] free_callchain_node
> +    7.94%     0.09%  perf                  [.] hist_entry__delete
> +    6.22%     0.03%  perf                  [.] callchain_append
> +    6.20%     6.11%  perf                  [.] append_chain_children
> +    5.44%     1.50%  perf                  [.] __hists__add_entry
> +    4.34%     0.14%  [kernel]              [k] entry_SYSCALL_64_fastpath
> +    3.69%     3.67%  perf                  [.] sort__dso_cmp
> +    3.12%     0.20%  perf                  [.] hists__output_resort
> +    2.88%     0.00%  [unknown]             [.] 0x6d86258d4c544155
> +    2.88%     0.00%  libc-2.20.so          [.] __libc_start_main
> +    2.88%     0.00%  perf                  [.] main
> +    2.88%     0.00%  perf                  [.] run_builtin
> +    2.66%     0.00%  libpthread-2.20.so    [.] start_thread
> +    2.66%     0.00%  perf                  [.] display_thread_tui
> +    2.66%     0.00%  perf                  [.] perf_evlist__tui_browse_hists
> +    2.66%     0.00%  perf                  [.] perf_evsel__hists_browse
> +    2.49%     0.07%  [kernel]              [k] sys_futex
> +    2.42%     0.06%  [kernel]              [k] do_futex
>      2.24%     0.00%  perf                  [.] perf_top__sort_new_samples
> +    1.92%     0.51%  perf                  [.] hists__collapse_resort
> +    1.87%     1.86%  perf                  [.] hpp__sort_overhead_acc
> +    1.69%     0.09%  libc-2.20.so          [.] __lll_unlock_wake_private
> +    1.45%     1.44%  perf                  [.] hpp__nop_cmp
>      1.45%     1.43%  perf                  [.] rb_erase
> +    1.44%     1.42%  perf                  [.] __sort__hpp_cmp
>      1.31%     0.16%  libc-2.20.so          [.] __lll_lock_wait_private
>      1.18%     0.19%  [kernel]              [k] futex_wake
> +    1.13%     1.13%  perf                  [.] sort__sym_cmp
>      1.11%     0.02%  [kernel]              [k] schedule
>      1.09%     0.22%  [kernel]              [k] __schedule
>      0.99%     0.08%  [kernel]              [k] futex_wait
> 
> So its quite a lot of mallocs, i.e. just do a 'perf top -g' and wait a
> bit, malloc goes on bubbling up to the top, about the same time it
> starts showing that popup screen telling that we're losing events.
> 
> If I use --no-children, to see if there is a difference, using either
> --call-graph caller or callee, it doesn't get more than about 1%.
> 
> Ok, now I tried with "perf top --call-graph caller" i.e. with
> --children, and looked at the _int_malloc callchains I get really long,
> bogus callchains, see below. That explains why I don't lose events when
> I use --max-stack.
> 
> I'll have to stop now, and I put the full perf.hist.1 at
> http://vger.kernel.org/~acme/perf/perf.hist.1.xz
> 
> - Arnaldo
> 
> [root@zoo ~]# head -60 perf.hist.1
> -   17.92%    17.10%  libc-2.20.so                        [.] _int_malloc
>    + 112.80% _int_malloc
>      11.14% 0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068
>         0x41bf5118
>         0x41bf5068

Hmm.. we should cut off the loop in the broken callchains.  And it'd
be good to apply --hide-unresolved to callchains as well.  I'll take a
look at it.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/