Message-ID: <557025A2.6090903@huawei.com>
Date: Thu, 4 Jun 2015 18:17:06 +0800
From: "Wangnan (F)" <wangnan0@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0
MIME-Version: 1.0
To: Ingo Molnar <mingo@kernel.org>, Arnaldo Carvalho de Melo <acme@kernel.org>
CC: <linux-kernel@vger.kernel.org>, Adrian Hunter <adrian.hunter@intel.com>,
        Alexei Starovoitov <ast@plumgrid.com>,
        Brendan Gregg <brendan.d.gregg@gmail.com>,
        Daniel Borkmann <daniel@iogearbox.net>,
        "David Ahern" <dsahern@gmail.com>, He Kuang <hekuang@huawei.com>,
        Jiri Olsa <jolsa@redhat.com>, Kaixu Xia <xiakaixu@huawei.com>,
        Madhavan Srinivasan <maddy@linux.vnet.ibm.com>,
        Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
        Michael Ellerman <mpe@ellerman.id.au>,
        Namhyung Kim <namhyung@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>, <pi3orama@163.com>,
        Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>,
        "Zefan Li" <lizefan@huawei.com>,
        Arnaldo Carvalho de Melo <acme@redhat.com>,
        Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: [EXPERIENCE] My experience on using perf record BPF filter on a real
 usecase
References: <1433371238-14572-1-git-send-email-acme@kernel.org> <20150604054854.GA10969@gmail.com> <556FEB32.8040909@huawei.com>
In-Reply-To: <556FEB32.8040909@huawei.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7456
Lines: 203

Hi all,

I'd like to share my exprience on using 'perf record' BPF filter in a
real usecase to show the power and shortcome in my patch series:

  https://lkml.kernel.org/r/1433144296-74992-1-git-send-email-wangnan0@huawei.com

and other works on eBPF.

My usecase shows that such filter is useful. Also, I hope it can help us
to find way to further improve it.

My task is to find the reason why iozone test result is bad on some
specific cases. The development environment is a x86_64 server, the target
machine is a smartphone with Android. By previous analysis I have
already got some useful information:

  1. iozone computes bandwidth by averaging time of each sys_write.

  2. In our case, 1% sys_write takes 75% of total time, so what I need
     to do now should be finding the reason why those sys_write take so
     long.

  3. By sampling call stack on sched:sched_switch, I find that those
     sys_write calls lock_page() and blocks on it.

I decide to use BPF filter to find the other side of this locking
contention. The idea is simple:

  1. For all calls of lock_page(), probe at entry and exit points of
     it. Measure the execution time of the lock_page() call. If it takes
     too long (longer than 0.1 second) then there should have a lock
     contention. Take the sample at exit point.

  2. For all calls of unlock_page(), if the page is acquiring by other
     on at least 0.1 second before, take a sample at this point.

Currently making the above idea work is possible but not very
straightforward. One problem I can identify is:

     Doesn't like ftrace, there is no way for eBPF program to access call
     stack information. Without extra information, eBPF programs are
     unable to match lock_page events and corresponding lock_page%return
     events. Currently the only way for passing information between
     programs are maps. To simulate call stack matching, I create a
     BPF_FUNC_git_tid() which returns current->pid, and a
     proc_locking_page_map map which records the acquired page and time
     of calling lock_page.

     Another problem is: at the entry of lock_page() and
     unlock_page(), for fetching the page pointer I have to directly
     use 'ctx->regs[0]' (I am on aarch64). Which is not protable.

The final program I used is attached at the bottom of this email. It
costs more than 100 lines of code. I have to do some debugging to
ensure it works correctly on a virtual machine.

It is compiled using:

  # $CLANG ${INCLUDE} -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign -O2 \
        -emit-llvm -c lock_page.c -o - | $LLC -march=bpf -filetype=obj -o  \
        lock_page.o

Then the lock_page.o is transfered onto target system.

I loaded it using following command:

  # perf record -e syscalls:sys_enter_write -e syscalls:sys_exit_write \
          -e lock_page.o -a iozone ...

Here is another inconvenience. Currently I only concern on write
syscall issued by iozone. However, without '-a' I'm unable to collect
information of the locker. If I want to filter sys_{enter,exit}_write
belong to iozone out using eBPF, I need to implement another function
like BPF_FUNC_git_comm. Another method is to use perf '--filter' after
the two events. However it looks strange to use two filter mechanisms
together. This time I choose to do filtering offline using perf script.

The result is resonable. Finaly I found the two side of lock contention.
It shows the way to improve. I'm sorry I can't share the call stack in
this list.

One inconvenience in this stage is: the information is
printed into ring buffer while the samples are stored into perf.data.
By analysing perf.data without ftrace ring buffer I don't know how long
the lock_page() cost becasue I don't sample at the entry of
lock_page().

The final part is the BPF program I used. I think there should have
better way to do it. If any know how to make it shorter please let me
know.

Thank you.

/* ------------- START OF BPF PROGRAM ------------- */
     /* __lock_page pass to unlock_page, key is pid */
     struct proc_locking_page {
         unsigned long page;
         unsigned long time;
     };

     struct bpf_map_def SEC("maps") proc_locking_page_map = {
         .type = BPF_MAP_TYPE_HASH,
         .key_size = sizeof(unsigned long),
         .value_size = sizeof(struct proc_locking_page),
         .max_entries = 1000000,
     };

     /* from page get pid */
     struct page_being_locked_by_proc {
         unsigned long tid;
         unsigned long time;
     };

     struct bpf_map_def SEC("maps") page_being_locked_by_proc_map = {
         .type = BPF_MAP_TYPE_HASH,
         .key_size = sizeof(unsigned long),
         .value_size = sizeof(struct page_being_locked_by_proc),
         .max_entries = 1000000,
     };

     SEC("lock_page=__lock_page")
     int lock_page_recorder(struct pt_regs *ctx)
     {
         unsigned long tid = bpf_get_tid();
         unsigned long page = ctx->regs[0];
         unsigned long curr_ns = bpf_ktime_get_ns();

         struct proc_locking_page locking_page;
         struct page_being_locked_by_proc being_locked;

         locking_page.page = page;
         locking_page.time = curr_ns;
         being_locked.tid = tid;
         being_locked.time = curr_ns;

         bpf_map_update_elem(&proc_locking_page_map, &tid, 
&locking_page, BPF_ANY);
         bpf_map_update_elem(&page_being_locked_by_proc_map, &page, 
&being_locked, BPF_ANY);
         return 0;
     }

     SEC("lock_page_ret=__lock_page%return")
     int lock_page_return_recorder(struct pt_regs *ctx)
     {
         unsigned long tid = bpf_get_tid();
         unsigned long curr_ns = bpf_ktime_get_ns();
         unsigned long page;
         unsigned long diff_time;
         struct proc_locking_page *p_locking_page;


         p_locking_page = bpf_map_lookup_elem(&proc_locking_page_map, &tid);


         /* BAD!! */
         if (!p_locking_page)
                 return 0;

         page = p_locking_page->page;
         diff_time = curr_ns - p_locking_page->time;
         bpf_map_delete_elem(&proc_locking_page_map, &tid);
         bpf_map_delete_elem(&page_being_locked_by_proc_map, &page);

         if (diff_time > 10000000) {
                 char fmt[] = "tid %d get page %lx using %d ns\n";
                 bpf_trace_printk(fmt, sizeof(fmt), tid, page, diff_time);
                 return 1;
         }

         return 0;
     }

     SEC("unlock_page=unlock_page")
     int unlock_page_recorder(struct pt_regs *ctx)
     {
         unsigned long tid = bpf_get_tid();
         unsigned long page = ctx->regs[0];
         unsigned long time = bpf_ktime_get_ns();
         unsigned long diff_time;
         struct page_being_locked_by_proc *p_being_locked;
         char fmt[] = "%d vs %d, %d ns\n";

         p_being_locked = 
bpf_map_lookup_elem(&page_being_locked_by_proc_map, &page);
         if (!p_being_locked)
                 return 0;
         diff_time = time - p_being_locked->time;
         if (diff_time > 10000000) {
                 bpf_trace_printk(fmt, sizeof(fmt), tid, 
p_being_locked->tid, diff_time);
                 return 1;
         }
         return 0;
     }
/* ------------- END OF BPF PROGRAM ------------- */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/