Content-Type: text/plain;
	charset=gb2312
Mime-Version: 1.0 (1.0)
Subject: Re: [RFC PATCH 0/6] bpf: Introduce the new ability of eBPF programs to access hardware PMU counter
From: pi3orama <pi3orama@163.com>
In-Reply-To: <55A98808.9010307@plumgrid.com>
Date: Sat, 18 Jul 2015 07:27:43 +0800
Cc: kaixu xia <xiakaixu@huawei.com>,
        "davem@davemloft.net" <davem@davemloft.net>,
        "acme@kernel.org" <acme@kernel.org>,
        "mingo@redhat.com" <mingo@redhat.com>,
        "a.p.zijlstra@chello.nl" <a.p.zijlstra@chello.nl>,
        "masami.hiramatsu.pt@hitachi.com" <masami.hiramatsu.pt@hitachi.com>,
        "jolsa@kernel.org" <jolsa@kernel.org>,
        "wangnan0@huawei.com" <wangnan0@huawei.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "hekuang@huawei.com" <hekuang@huawei.com>
Content-Transfer-Encoding: 8BIT
Message-Id: <C1242C27-229A-4B98-9C44-29CD20049779@163.com>
References: <1437129816-13176-1-git-send-email-xiakaixu@huawei.com> <55A98808.9010307@plumgrid.com>
To: Alexei Starovoitov <ast@plumgrid.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3837
Lines: 78


?????ҵ? iPhone

> ?? 2015??7??18?գ?????6:56??Alexei Starovoitov <ast@plumgrid.com> д????
> 
>> On 7/17/15 3:43 AM, kaixu xia wrote:
>> There are many useful PMUs provided by X86 and other architectures. By
>> combining PMU, kprobe and eBPF program together, many interesting things
>> can be done. For example, by probing at sched:sched_switch we can
>> measure IPC changing between different processes by watching 'cycle' PMU
>> counter; by probing at entry and exit points of a kernel function we are
>> able to compute cache miss rate for a function by collecting
>> 'cache-misses' counter and see the differences. In summary, we can
>> define the begin and end points of a procedure, insert kprobes on them,
>> attach two BPF programs and let them collect specific PMU counter.
> 
> that would be definitely a useful feature.
> As far as overall design I think it should be done slightly differently.
> The addition of 'flags' to all maps is a bit hacky and it seems has few
> holes. It's better to reuse 'store fds into maps' code that prog_array
> is doing. You can add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY
> and reuse most of the arraymap.c code.

Then we also need another BPF_MAP_TYPE_PERF_EVENT_HASHMAP for events in task context.

> The program also wouldn't need to do lookup+read_pmu, so instead of:
>  r0 = 0 (the chosen key: CPU-0)
>  *(u32 *)(fp - 4) = r0
>  value = bpf_map_lookup_elem(map_fd, fp - 4);
>  count = bpf_read_pmu(value);
> you will be able to do:
>  count = bpf_perf_event_read(perf_event_array_map_fd, index)
> which will be faster.
> note, I'd prefer 'bpf_perf_event_read' name for the helper.

I though that. It makes things much simpler since we won't need care verifier too much. 

I choose current implementation because I think we may need perf event not wrapped in map in future (for example, tracepoints). With the design you suggested in this case we have to create a map with only 1 element in it. Although it is acceptable we can make it a better look. However, let's think that in future.

> 
> Then inside helper we really cannot do mutex, sleep or smp_call,
> but since programs are always executed in preempt disabled
> and never from NMI, I think something like the following should work:
> u64 bpf_perf_event_read(u64 r1, u64 index,...)
> {
>  struct bpf_perf_event_array *array = (void *) (long) r1;
>  struct perf_event *event;
> 
>  if (unlikely(index >= array->map.max_entries))
>     return -EINVAL;
>  event = array->events[index];
>  if (event->state != PERF_EVENT_STATE_ACTIVE)
>     return -EINVAL;
>  if (event->oncpu != raw_smp_processor_id())
>     return -EINVAL;
>  __perf_event_read(event);
>  return perf_event_count(event);
> }
> not sure whether we need to disable irq around __perf_event_read,
> I think it should be ok without.
> Also during store of FD into perf_event_array you'd need
> to filter out all crazy events. I would limit it to few
> basic types first.
> 
> btw, make sure you do your tests with lockdep and other debugs on.
> and for the sample code please use C for the bpf program. Not many
> people can read bpf asm ;)
> 

We still need some perf side code to make a c program work. Perf should create the perf events and put them into map. We post this kernel side code first to see your attitude on the basic design principle we choose. Although the implementation has some bugs, from you and Peter's response I think the user interface we choose is no too much problem, so we can start perf side coding now. Right? After perf side code done you won't see asm code in the example.

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/