Subject: Re: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs
 to access hardware PMU counter
To: kaixu xia <xiakaixu@huawei.com>, davem@davemloft.net, acme@kernel.org,
        mingo@redhat.com, a.p.zijlstra@chello.nl,
        masami.hiramatsu.pt@hitachi.com, jolsa@kernel.org
References: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com>
Cc: wangnan0@huawei.com, linux-kernel@vger.kernel.org, pi3orama@163.com,
        hekuang@huawei.com
From: Alexei Starovoitov <ast@plumgrid.com>
Message-ID: <55A98808.9010307@plumgrid.com>
Date: Fri, 17 Jul 2015 15:56:08 -0700
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0)
 Gecko/20100101 Thunderbird/38.0.1
MIME-Version: 1.0
In-Reply-To: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2702
Lines: 60

On 7/17/15 3:43 AM, kaixu xia wrote:
> There are many useful PMUs provided by X86 and other architectures. By
> combining PMU, kprobe and eBPF program together, many interesting things
> can be done. For example, by probing at sched:sched_switch we can
> measure IPC changing between different processes by watching 'cycle' PMU
> counter; by probing at entry and exit points of a kernel function we are
> able to compute cache miss rate for a function by collecting
> 'cache-misses' counter and see the differences. In summary, we can
> define the begin and end points of a procedure, insert kprobes on them,
> attach two BPF programs and let them collect specific PMU counter.

that would be definitely a useful feature.
As far as overall design I think it should be done slightly differently.
The addition of 'flags' to all maps is a bit hacky and it seems has few
holes. It's better to reuse 'store fds into maps' code that prog_array
is doing. You can add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY
and reuse most of the arraymap.c code.
The program also wouldn't need to do lookup+read_pmu, so instead of:
   r0 = 0 (the chosen key: CPU-0)
   *(u32 *)(fp - 4) = r0
   value = bpf_map_lookup_elem(map_fd, fp - 4);
   count = bpf_read_pmu(value);
you will be able to do:
   count = bpf_perf_event_read(perf_event_array_map_fd, index)
which will be faster.
note, I'd prefer 'bpf_perf_event_read' name for the helper.

Then inside helper we really cannot do mutex, sleep or smp_call,
but since programs are always executed in preempt disabled
and never from NMI, I think something like the following should work:
u64 bpf_perf_event_read(u64 r1, u64 index,...)
{
   struct bpf_perf_event_array *array = (void *) (long) r1;
   struct perf_event *event;

   if (unlikely(index >= array->map.max_entries))
      return -EINVAL;
   event = array->events[index];
   if (event->state != PERF_EVENT_STATE_ACTIVE)
      return -EINVAL;
   if (event->oncpu != raw_smp_processor_id())
      return -EINVAL;
   __perf_event_read(event);
   return perf_event_count(event);
}
not sure whether we need to disable irq around __perf_event_read,
I think it should be ok without.
Also during store of FD into perf_event_array you'd need
to filter out all crazy events. I would limit it to few
basic types first.

btw, make sure you do your tests with lockdep and other debugs on.
and for the sample code please use C for the bpf program. Not many
people can read bpf asm ;)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/