by Namhyung Kim

[permalink] [raw]

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Hi Arnaldo,

On Tue, Nov 24, 2015 at 11:10:25PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Nov 24, 2015 at 10:34:18PM -0300, Arnaldo Carvalho de Melo escreveu:
> > Em Wed, Nov 25, 2015 at 10:26:08AM +0900, Namhyung Kim escreveu:
> > > On Tue, Nov 24, 2015 at 12:45:51PM -0200, Arnaldo Carvalho de Melo wrote:
> > > > Em Tue, Nov 24, 2015 at 02:27:08PM +0900, Namhyung Kim escreveu:
> > > > > On Mon, Nov 23, 2015 at 04:16:48PM +0100, Frederic Weisbecker wrote:
> > > > > > On Thu, Nov 19, 2015 at 02:53:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > > > From: Namhyung Kim <[email protected]>
> > > > > > [...]
> > > > > Thus I simply copied callchain lists in parents to leaf nodes. Yes,
> > > > > it will consume some memory but can simplify the code.
> > > >
> > > > I haven't done any measuring, but I'm noticing that 'perf top -g' is
> > > > showing more warnings about not being able to process events fast enough
> > > > and so ends up losing events, I tried with --max-stack 16 and it helped,
> > > > this is just a heads up.
> > >
> > > OK, but it seems that it's not related to this patch since this patch
> > > only affects flat or folded callchain mode.
> >
> > Well, doesn't this patch makes some of the involved data structures
> > larger, thus putting more pressure on the L1 cache, etc? It may well be
> > related, but we need to measure.
> >
> > > > Perhaps my workstation workloads are gettning deeper callchains over
> > > > time, but perhaps this is the cost of processing callchains that is
> > > > increasing, I need to stop and try to quantify this.
> > > >
> > > > We really need to look at reducing the overhead of processing
> > > > callchains.
> > >
> > > Right, but with my multi-thread work, I realized that perf is getting
> > > heavier recently. I guess it's mostly due to the atomic refcount
> > > work. I need to get back to the multi-thread work..
> >
> > We really need to measure this ;-)
>
> So, something strange, if I use:
>
> [acme@zoo linux]$ cat ~/bin/allmod
> rm -rf ../build/allmodconfig/ ; mkdir ../build/allmodconfig/ ; make O=../build/allmodconfig/ allmodconfig ; make -j32 O=../build/allmodconfig
> [acme@zoo linux]$
>
> To generate background load, I don't see that much this:
>
> + 8.55% 8.49% libc-2┌─Warning!──────────────────────────────────────────────┐ ▒
> + 7.08% 6.98% perf │Events are being lost, check IO/CPU overload! │ ▒
> + 6.84% 0.04% perf │ │ ▒
> + 6.01% 0.09% perf │You may want to run 'perf' using a RT scheduler policy:│ ▒
> + 5.26% 0.13% [kerne│ │t▒
> + 4.96% 1.50% perf │ perf top -r 80 │ ▒
> + 4.76% 3.58% perf │ │ ▒
> + 4.69% 0.05% perf │Or reduce the sampling frequency. │ ▒
>
> Its with a low loadavg:
>
> [acme@zoo linux]$ cat /proc/loadavg
> 0.75 0.79 0.64 3/549 21259
>
> That it pops up :-\
>
> If I take a snapshot with 'P'
>
> # head -40 perf.hist.0
> + 21.43% 21.09% libc-2.20.so [.] _int_malloc
> + 19.49% 0.00% perf [.] cmd_top
> + 19.46% 0.02% perf [.] perf_top__mmap_read_idx
> + 19.03% 0.06% perf [.] hist_entry_iter__add
> + 16.46% 1.85% perf [.] iter_add_next_cumulative_entry
> + 12.09% 11.98% libc-2.20.so [.] free
> + 10.68% 10.61% libc-2.20.so [.] __libc_calloc
> + 9.61% 0.09% perf [.] hists__decay_entries
> + 8.92% 8.85% libc-2.20.so [.] malloc_consolidate
> + 8.17% 6.33% perf [.] free_callchain_node
> + 7.94% 0.09% perf [.] hist_entry__delete
> + 6.22% 0.03% perf [.] callchain_append
> + 6.20% 6.11% perf [.] append_chain_children
> + 5.44% 1.50% perf [.] __hists__add_entry
> + 4.34% 0.14% [kernel] [k] entry_SYSCALL_64_fastpath
> + 3.69% 3.67% perf [.] sort__dso_cmp
> + 3.12% 0.20% perf [.] hists__output_resort
> + 2.88% 0.00% [unknown] [.] 0x6d86258d4c544155
> + 2.88% 0.00% libc-2.20.so [.] __libc_start_main
> + 2.88% 0.00% perf [.] main
> + 2.88% 0.00% perf [.] run_builtin
> + 2.66% 0.00% libpthread-2.20.so [.] start_thread
> + 2.66% 0.00% perf [.] display_thread_tui
> + 2.66% 0.00% perf [.] perf_evlist__tui_browse_hists
> + 2.66% 0.00% perf [.] perf_evsel__hists_browse
> + 2.49% 0.07% [kernel] [k] sys_futex
> + 2.42% 0.06% [kernel] [k] do_futex
> 2.24% 0.00% perf [.] perf_top__sort_new_samples
> + 1.92% 0.51% perf [.] hists__collapse_resort
> + 1.87% 1.86% perf [.] hpp__sort_overhead_acc
> + 1.69% 0.09% libc-2.20.so [.] __lll_unlock_wake_private
> + 1.45% 1.44% perf [.] hpp__nop_cmp
> 1.45% 1.43% perf [.] rb_erase
> + 1.44% 1.42% perf [.] __sort__hpp_cmp
> 1.31% 0.16% libc-2.20.so [.] __lll_lock_wait_private
> 1.18% 0.19% [kernel] [k] futex_wake
> + 1.13% 1.13% perf [.] sort__sym_cmp
> 1.11% 0.02% [kernel] [k] schedule
> 1.09% 0.22% [kernel] [k] __schedule
> 0.99% 0.08% [kernel] [k] futex_wait
>
> So its quite a lot of mallocs, i.e. just do a 'perf top -g' and wait a
> bit, malloc goes on bubbling up to the top, about the same time it
> starts showing that popup screen telling that we're losing events.
>
> If I use --no-children, to see if there is a difference, using either
> --call-graph caller or callee, it doesn't get more than about 1%.
>
> Ok, now I tried with "perf top --call-graph caller" i.e. with
> --children, and looked at the _int_malloc callchains I get really long,
> bogus callchains, see below. That explains why I don't lose events when
> I use --max-stack.
>
> I'll have to stop now, and I put the full perf.hist.1 at
> http://vger.kernel.org/~acme/perf/perf.hist.1.xz
>
> - Arnaldo
>
> [root@zoo ~]# head -60 perf.hist.1
> - 17.92% 17.10% libc-2.20.so [.] _int_malloc
> + 112.80% _int_malloc
> 11.14% 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068
> 0x41bf5118
> 0x41bf5068

Hmm.. we should cut off the loop in the broken callchains. And it'd
be good to apply --hide-unresolved to callchains as well. I'll take a
look at it.

Thanks,
Namhyung

Subject: [GIT PULL 00/37] perf/core improvements and fixes

Subject: [PATCH 01/37] perf test: Fix build of BPF and LLVM on older glibc libraries

Subject: [PATCH 02/37] tools: Fix selftests_install Makefile rule

Subject: [PATCH 03/37] tools: Adopt memdup() from tools/perf, moving it to tools/lib/string.c

Subject: [PATCH 04/37] tools: Clone the kernel's strtobool function

Subject: [PATCH 05/37] bpf tools: Load a program with different instances using preprocessor

Subject: [PATCH 06/37] perf bpf: Add BPF_PROLOGUE config options for further patches

Subject: [PATCH 07/37] perf bpf: Compile dwarf-regs.c if CONFIG_BPF_PROLOGUE is on

Subject: [PATCH 08/37] perf bpf: Allow BPF program attach to uprobe events

Subject: [PATCH 09/37] perf bpf: Allow attaching BPF programs to modules symbols

Subject: [PATCH 10/37] perf bpf: Allow BPF program config probing options

Subject: [PATCH 11/37] perf bpf: Add prologue for BPF programs for fetching arguments

Subject: [PATCH 12/37] perf bpf: Generate prologue for BPF programs

Subject: [PATCH 13/37] perf test: Test the BPF prologue adding infrastructure

Subject: [PATCH 14/37] perf test: Fix 'perf test BPF' when it fails to find a suitable vmlinux

Subject: [PATCH 15/37] perf bpf: Use same BPF program if arguments are identical

Subject: [PATCH 16/37] perf tests: Pass the subtest index to each test routine

Subject: [PATCH 17/37] perf test: Print result for each LLVM subtest

Subject: [PATCH 18/37] perf test: Print result for each BPF subtest

Subject: [PATCH 19/37] perf test: Mute test cases error messages if verbose == 0

Subject: [PATCH 20/37] perf probe: Fix to free temporal Dwarf_Frame

Subject: [PATCH 21/37] perf machine: Fix machine__findnew_module_map to put registered map

Subject: [PATCH 22/37] perf machine: Fix machine__destroy_kernel_maps to drop vmlinux_maps references

Subject: [PATCH 23/37] perf machine: Fix to destroy kernel maps when machine exits

Subject: [PATCH 24/37] perf tools: Make perf_exec_path() always return malloc'd string

Subject: [PATCH 25/37] perf tools: Fix to put new map after inserting to map_groups in dso__load_sym

Subject: [PATCH 26/37] perf tools: Fix __dsos__addnew to put dso after adding it to the list

Subject: [PATCH 27/37] perf tools: Fix machine__create_kernel_maps to put kernel dso refcount

Subject: [PATCH 28/37] perf machine: Fix machine__findnew_module_map to put dso

Subject: [PATCH 29/37] perf report: Support folded callchain mode on --stdio

Subject: [PATCH 30/37] perf callchain: Abstract callchain print function

Subject: [PATCH 31/37] perf callchain: Add count fields to struct callchain_node

Subject: [PATCH 32/37] perf report: Add callchain value option

Subject: [PATCH 33/37] perf hists browser: Factor out hist_browser__show_callchain_list()

Subject: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: [PATCH 35/37] perf hists browser: Support folded callchains

Subject: [PATCH 36/37] perf ui/gtk: Support flat callchains

Subject: [PATCH 37/37] perf ui/gtk: Support folded callchains

Subject: RE: [GIT PULL 00/37] perf/core improvements and fixes

Subject: Re: [GIT PULL 00/37] perf/core improvements and fixes

Subject: RE: [GIT PULL 00/37] perf/core improvements and fixes

Subject: Re: [GIT PULL 00/37] perf/core improvements and fixes

Subject: Re: [PATCH 31/37] perf callchain: Add count fields to struct callchain_node

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 31/37] perf callchain: Add count fields to struct callchain_node

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: Re: [PATCH 34/37] perf hists browser: Support flat callchains

Subject: [PATCH 26/37] perf tools: Fix dsosaddnew to put dso after adding it to the list