Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Mon, 15 Mar 2021 15:09:03 +0100
From:   Jiri Olsa <jolsa@redhat.com>
To:     Song Liu <songliubraving@fb.com>
Cc:     linux-kernel <linux-kernel@vger.kernel.org>,
        Kernel Team <Kernel-team@fb.com>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        "acme@redhat.com" <acme@redhat.com>,
        "namhyung@kernel.org" <namhyung@kernel.org>,
        "jolsa@kernel.org" <jolsa@kernel.org>
Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
Message-ID: <YE9qf8vcS8svnFG7@krava>
References: <20210312020257.197137-1-songliubraving@fb.com>
 <YE6Sq78CRmr/JsHl@krava>
 <95CDB411-4F73-4C56-8CA5-48C002F03ACF@fb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <95CDB411-4F73-4C56-8CA5-48C002F03ACF@fb.com>
Precedence: bulk

On Mon, Mar 15, 2021 at 07:51:11AM +0000, Song Liu wrote:

SNIP

> >> Known limitations:
> >> 1. Do not support per cgroup events;
> > 
> > isn't that another filter via bpf_get_current_cgroup_id ?
> 
> This is tricky with nested cgroups. We also have bpf_get_current_ancestor_cgroup_id,
> but that's not the exact same we need either. We may need a new helper. 
> 
> Also, we are limited by 38 follower programs for the leader program, which 
> might be too few for Namhyung's use case. We can use some logic in the 
> follower program to count events for many cgroups in one fexit program. 

ok, I see

> 
> > 
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> > 
> > we'd need to call the leadr on fentry/fexit of the monitored bpf
> > program, right?
> 
> My current plan is to let perf-stat -b share the same perf_event map with
> bperf, but not the leader program. 

ok

> 
> > 
> >> 3. Do not support event groups.
> > 
> > I guess for group support you'll need to load 'leaders' for each group member
> 
> I am not sure how this would work. Say the user started the following in 
> parallel:
> 
>     #1  perf stat --bpf-counters -e cycles -a &
>     #2  perf stat --bpf-counters -e instructions -C 1,2,3 &
>     #3  perf stat --bpf-counters -e {cycles,instructions} -p 123 
> 
> Event "cycles" is loaded by #1; while event "instruction" is loaded by #2.
> If #3 reuses these events, it is tricky (or impossible?) to make sure the
> event group works as expected, right?

the reason for group is to force kernel to all events if possible
or none..  so with your change I don't see a problem, you just
provide values for those counters.. but let's see when you get
there ;-)

> 
> > 
> >> 
> >> The following commands have been tested:
> >> 
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> > 
> > works good with that file you sent.. I'll check/test more,
> > so far some quick comments below
> > 
> > thanks,
> > jirka
> > 
> > 
> > 
> > SNIP
> > 
> >> @@ -1146,6 +1156,10 @@ static struct option stat_options[] = {
> >> #ifdef HAVE_BPF_SKEL
> >> 	OPT_STRING('b', "bpf-prog", &target.bpf_str, "bpf-prog-id",
> >> 		   "stat events on existing bpf program id"),
> >> +	OPT_BOOLEAN(0, "use-bpf", &target.use_bpf,
> >> +		    "use bpf program to count events"),
> >> +	OPT_STRING(0, "attr-map", &target.attr_map, "attr-map-path",
> >> +		   "path to perf_event_attr map"),
> > 
> > what's the point of allowing another name? just debug purpose?
> 
> It is mostly to cover corner cases, like something else used the same 
> name. 

about that.. we just take the object fd assuming it's map,
should we test it somehow?

  map_fd = bpf_obj_get(path);

if it's not the map we expect, I think we should generate
another name without forcing user to run again with --attr-map

but still warn, so other perf session can use the new name

SNIP

> >> +static int bperf_sync_counters(struct evsel *evsel)
> >> +{
> >> +	struct perf_cpu_map *all_cpus = perf_cpu_map__new(NULL);
> >> +	int num_cpu, i, cpu;
> >> +
> >> +	if (!all_cpus)
> >> +		return -1;
> >> +
> >> +	num_cpu = all_cpus->nr;
> >> +	for (i = 0; i < num_cpu; i++) {
> >> +		cpu = all_cpus->map[i];
> >> +		bperf_trigger_reading(evsel->bperf_leader_prog_fd, cpu);
> >> +	}
> >> +	return 0;
> >> +}
> >> +
> >> +static int bperf__enable(struct evsel *evsel)
> >> +{
> >> +	struct bperf_follower_bpf *skel = evsel->follower_skel;
> >> +	__u32 num_cpu_bpf = libbpf_num_possible_cpus();
> > 
> > we have cpu__max_cpu for that
> 
> libbpf calls for percpu array use libbpf_num_possible_cpus. So I guess it 
> is better to use the same here. The two are identical at the moment though.

then in the bperf__read you take that array and update
perf_counts, which is based on perf's cpus, so they mix
anyway

I'd like to keep perf code using perf's cpus api.. could
we just check at the begining that libbpf_num_possible_cpus
returns same number as cpu__max_cpu (if not, we have a
problem anyway) and use perf's cpu api

thanks,
jirka