Subject: Re: [PATCH 4/4] perf core: Add backward attribute to perf event
To: Peter Zijlstra <peterz@infradead.org>
References: <1459147292-239310-1-git-send-email-wangnan0@huawei.com>
 <1459147292-239310-5-git-send-email-wangnan0@huawei.com>
 <20160329140439.GK3408@twins.programming.kicks-ass.net>
 <56FB39E9.7020006@huawei.com> <56FB3C3D.3050400@huawei.com>
CC: Alexei Starovoitov <ast@kernel.org>,
        Arnaldo Carvalho de Melo <acme@redhat.com>,
        <linux-kernel@vger.kernel.org>,
        Brendan Gregg <brendan.d.gregg@gmail.com>,
        He Kuang <hekuang@huawei.com>, Jiri Olsa <jolsa@kernel.org>,
        Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
        Namhyung Kim <namhyung@kernel.org>, <pi3orama@163.com>,
        Zefan Li <lizefan@huawei.com>
From: "Wangnan (F)" <wangnan0@huawei.com>
Message-ID: <5703C612.1080608@huawei.com>
Date: Tue, 5 Apr 2016 22:05:06 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.0
MIME-Version: 1.0
In-Reply-To: <56FB3C3D.3050400@huawei.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3289
Lines: 120


On 2016/3/30 10:38, Wangnan (F) wrote:
>
>
> On 2016/3/30 10:28, Wangnan (F) wrote:
>>
>>
>> On 2016/3/29 22:04, Peter Zijlstra wrote:
>>> On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote:
>>>
>>> Could you maybe write a perf/tests thingy for this so that _some_
>>> userspace exists that exercises this new code?
>>>
>>>
>>>>   int perf_output_begin(struct perf_output_handle *handle,
>>>>                 struct perf_event *event, unsigned int size)
>>>>   {
>>>> +    if (unlikely(is_write_backward(event)))
>>>> +        return __perf_output_begin(handle, event, size, true);
>>>>       return __perf_output_begin(handle, event, size, false);
>>>>   }
>>> Would something like:
>>>
>>> int perf_output_begin(...)
>>> {
>>>     if (unlikely(is_write_backward(event))
>>>         return perf_output_begin_backward(...);
>>>     return perf_output_begin_forward(...);
>>> }
>>>
>>> make sense; I'm not sure how much is still using this, but it seems
>>> somewhat excessive to inline two copies of that thing into a single
>>> function.
>>
>>

[SNIP]

>
> Sorry. Your second suggestion seems also good:
>
> My implementation makes a big perf_output_begin(), but introduces only 
> one load and one branch.
>
> Your first suggestion introduces one load, one branch and one function 
> call.
>
> Your second suggestion introduces one load, and at least one (and at 
> most three) branches.
>
> I need some benchmarking result.
>
> Thank you.

No obviously performance divergence among all 3 implementations.

Here are some numbers:

I tested the cost of generating PERF_RECORD_COMM event using prctl with
following code:

         ...
         gettimeofday(&tv1, NULL);
         for (i = 0; i < 1000 * 1000 * 3; i++) {
                 char proc_name[10];

                 snprintf(proc_name, sizeof(proc_name), "p:%d\n", i);
                 prctl(PR_SET_NAME, proc_name);
         }
         gettimeofday(&tv2, NULL);
         us1 = tv1.tv_sec * 1000000 + tv1.tv_usec;
         us2 = tv2.tv_sec * 1000000 + tv2.tv_usec;
         printf("%ld\n", us2 - us1);
         ...

Run this benchmark 100 time in each experiment. Bind benchmark to core 2
and perf to core 1 to ensure they are on a same CPU.

Result:

BASE    : execute without perf
4.5     : pure v4.5
TIP     : with only patch 1-3/4 in this patch set applied
BIGFUNC : the implementation in my original patch
FUNCCALL: the implememtation in Peter's first suggestion:
    int perf_output_begin(...)
    {
        if (unlikely(is_write_backward(event))
            return perf_output_begin_backward(...);
        return perf_output_begin_forward(...);
    }
BRANCH : the implememtation in Peter's second suggestion:
     int perf_output_begin(...)
     {
         return __perf_output_begin(..., unlikely(event->attr.backwards));
     }


'perf' is executed using:
  # perf record -o /dev/null --no-buildid-cache -e 
syscalls:sys_enter_read ...


Results:

              MEAN       STDVAR
BASE    : 1122968.85   33492.52
4.5     : 2714200.70   26231.69
TIP     : 2646260.46   32610.56
BIGFUNC : 2661308.46   52707.47
FUNCCALL: 2636061.10   52607.80
BRANCH  : 2651335.74   34910.04


Considering the stdvar, the performance result is nearly identical.

I'd like to choose 'BRANCH' because its code looks better.

Thank you.