Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20210312020257.197137-1-songliubraving@fb.com>
 <CAM9d7che4Ott6F6SNj8aaXea+wgzDE8pVntkpGr1TCbnfWNXkw@mail.gmail.com> <4B3CF1B3-5EED-4882-BC99-AD676D4E3429@fb.com>
In-Reply-To: <4B3CF1B3-5EED-4882-BC99-AD676D4E3429@fb.com>
From:   Namhyung Kim <namhyung@kernel.org>
Date:   Sat, 13 Mar 2021 11:47:51 +0900
Message-ID: <CAM9d7cg+HD3-vLXX_rUSg1kWSZ3MGeyrQwdJoa5CgbZjeD2+GA@mail.gmail.com>
Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
To:     Song Liu <songliubraving@fb.com>
Cc:     linux-kernel <linux-kernel@vger.kernel.org>,
        Kernel Team <Kernel-team@fb.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Arnaldo Carvalho de Melo <acme@redhat.com>,
        Jiri Olsa <jolsa@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hi,
> >
> > On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@fb.com> wrote:
> >>
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >>
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >>
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >
> > Interesting!  Actually I thought about something similar before,
> > but my BPF knowledge is outdated.  So I need to catch up but
> > failed to have some time for it so far. ;-)
> >
> >>
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >>
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> >
> > In my case, per cgroup event counting is very important.
> > And I'd like to do that with lots of cpus and cgroups.
>
> We can easily extend this approach to support cgroups events. I didn't
> implement it to keep the first version simple.

OK.

>
> > So I'm working on an in-kernel solution (without BPF),
> > I hope to share it soon.
>
> This is interesting! I cannot wait to see how it looks like. I spent
> quite some time try to enable in kernel sharing (not just cgroup
> events), but finally decided to try BPF approach.

Well I found it hard to support generic event sharing that works
for all use cases.  So I'm focusing on the per cgroup case only.

>
> >
> > And for event groups, it seems the current implementation
> > cannot handle more than one event (not even in a group).
> > That could be a serious limitation..
>
> It supports multiple events. Multiple events are independent, i.e.,
> "cycles" and "instructions" would use two independent leader programs.

OK, then do you need multiple bperf_attr_maps?  Does it work
for an arbitrary number of events?

>
> >
> >>
> >> The following commands have been tested:
> >>
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> >
> > Hmm... so it loads both leader and follower programs if needed, right?
> > Does it support multiple followers with different targets at the same time?
>
> Yes, the whole idea is to have one leader program and multiple follower
> programs. If we only run one of these commands at a time, it will load
> one leader and one follower. If we run multiple of them in parallel,
> they will share the same leader program and load multiple follower
> programs.
>
> I actually tested more than the commands above. The list actually means
> we support -a, -C -p, and -t.
>
> Currently, this works for multiple events, and different parallel
> perf-stat. The two commands below will work well in parallel:
>
>   perf stat --use-bpf -e ref-cycles,instructions -a
>   perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5
>
> Note the use of ref-cycles, which can only use one counter on Intel CPUs.
> With this approach, the above two commands will not do time multiplexing
> on ref-cycles.

Awesome!

Thanks,
Namhyung