Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752838AbbFDKVO (ORCPT ); Thu, 4 Jun 2015 06:21:14 -0400 Received: from szxga02-in.huawei.com ([119.145.14.65]:1876 "EHLO szxga02-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752374AbbFDKUm (ORCPT ); Thu, 4 Jun 2015 06:20:42 -0400 Message-ID: <557025A2.6090903@huawei.com> Date: Thu, 4 Jun 2015 18:17:06 +0800 From: "Wangnan (F)" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Ingo Molnar , Arnaldo Carvalho de Melo CC: , Adrian Hunter , Alexei Starovoitov , Brendan Gregg , Daniel Borkmann , "David Ahern" , He Kuang , Jiri Olsa , Kaixu Xia , Madhavan Srinivasan , Masami Hiramatsu , Michael Ellerman , Namhyung Kim , Peter Zijlstra , , Sukadev Bhattiprolu , "Zefan Li" , Arnaldo Carvalho de Melo , Alexei Starovoitov Subject: [EXPERIENCE] My experience on using perf record BPF filter on a real usecase References: <1433371238-14572-1-git-send-email-acme@kernel.org> <20150604054854.GA10969@gmail.com> <556FEB32.8040909@huawei.com> In-Reply-To: <556FEB32.8040909@huawei.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.66.109] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7456 Lines: 203 Hi all, I'd like to share my exprience on using 'perf record' BPF filter in a real usecase to show the power and shortcome in my patch series: https://lkml.kernel.org/r/1433144296-74992-1-git-send-email-wangnan0@huawei.com and other works on eBPF. My usecase shows that such filter is useful. Also, I hope it can help us to find way to further improve it. My task is to find the reason why iozone test result is bad on some specific cases. The development environment is a x86_64 server, the target machine is a smartphone with Android. By previous analysis I have already got some useful information: 1. iozone computes bandwidth by averaging time of each sys_write. 2. In our case, 1% sys_write takes 75% of total time, so what I need to do now should be finding the reason why those sys_write take so long. 3. By sampling call stack on sched:sched_switch, I find that those sys_write calls lock_page() and blocks on it. I decide to use BPF filter to find the other side of this locking contention. The idea is simple: 1. For all calls of lock_page(), probe at entry and exit points of it. Measure the execution time of the lock_page() call. If it takes too long (longer than 0.1 second) then there should have a lock contention. Take the sample at exit point. 2. For all calls of unlock_page(), if the page is acquiring by other on at least 0.1 second before, take a sample at this point. Currently making the above idea work is possible but not very straightforward. One problem I can identify is: Doesn't like ftrace, there is no way for eBPF program to access call stack information. Without extra information, eBPF programs are unable to match lock_page events and corresponding lock_page%return events. Currently the only way for passing information between programs are maps. To simulate call stack matching, I create a BPF_FUNC_git_tid() which returns current->pid, and a proc_locking_page_map map which records the acquired page and time of calling lock_page. Another problem is: at the entry of lock_page() and unlock_page(), for fetching the page pointer I have to directly use 'ctx->regs[0]' (I am on aarch64). Which is not protable. The final program I used is attached at the bottom of this email. It costs more than 100 lines of code. I have to do some debugging to ensure it works correctly on a virtual machine. It is compiled using: # $CLANG ${INCLUDE} -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign -O2 \ -emit-llvm -c lock_page.c -o - | $LLC -march=bpf -filetype=obj -o \ lock_page.o Then the lock_page.o is transfered onto target system. I loaded it using following command: # perf record -e syscalls:sys_enter_write -e syscalls:sys_exit_write \ -e lock_page.o -a iozone ... Here is another inconvenience. Currently I only concern on write syscall issued by iozone. However, without '-a' I'm unable to collect information of the locker. If I want to filter sys_{enter,exit}_write belong to iozone out using eBPF, I need to implement another function like BPF_FUNC_git_comm. Another method is to use perf '--filter' after the two events. However it looks strange to use two filter mechanisms together. This time I choose to do filtering offline using perf script. The result is resonable. Finaly I found the two side of lock contention. It shows the way to improve. I'm sorry I can't share the call stack in this list. One inconvenience in this stage is: the information is printed into ring buffer while the samples are stored into perf.data. By analysing perf.data without ftrace ring buffer I don't know how long the lock_page() cost becasue I don't sample at the entry of lock_page(). The final part is the BPF program I used. I think there should have better way to do it. If any know how to make it shorter please let me know. Thank you. /* ------------- START OF BPF PROGRAM ------------- */ /* __lock_page pass to unlock_page, key is pid */ struct proc_locking_page { unsigned long page; unsigned long time; }; struct bpf_map_def SEC("maps") proc_locking_page_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(unsigned long), .value_size = sizeof(struct proc_locking_page), .max_entries = 1000000, }; /* from page get pid */ struct page_being_locked_by_proc { unsigned long tid; unsigned long time; }; struct bpf_map_def SEC("maps") page_being_locked_by_proc_map = { .type = BPF_MAP_TYPE_HASH, .key_size = sizeof(unsigned long), .value_size = sizeof(struct page_being_locked_by_proc), .max_entries = 1000000, }; SEC("lock_page=__lock_page") int lock_page_recorder(struct pt_regs *ctx) { unsigned long tid = bpf_get_tid(); unsigned long page = ctx->regs[0]; unsigned long curr_ns = bpf_ktime_get_ns(); struct proc_locking_page locking_page; struct page_being_locked_by_proc being_locked; locking_page.page = page; locking_page.time = curr_ns; being_locked.tid = tid; being_locked.time = curr_ns; bpf_map_update_elem(&proc_locking_page_map, &tid, &locking_page, BPF_ANY); bpf_map_update_elem(&page_being_locked_by_proc_map, &page, &being_locked, BPF_ANY); return 0; } SEC("lock_page_ret=__lock_page%return") int lock_page_return_recorder(struct pt_regs *ctx) { unsigned long tid = bpf_get_tid(); unsigned long curr_ns = bpf_ktime_get_ns(); unsigned long page; unsigned long diff_time; struct proc_locking_page *p_locking_page; p_locking_page = bpf_map_lookup_elem(&proc_locking_page_map, &tid); /* BAD!! */ if (!p_locking_page) return 0; page = p_locking_page->page; diff_time = curr_ns - p_locking_page->time; bpf_map_delete_elem(&proc_locking_page_map, &tid); bpf_map_delete_elem(&page_being_locked_by_proc_map, &page); if (diff_time > 10000000) { char fmt[] = "tid %d get page %lx using %d ns\n"; bpf_trace_printk(fmt, sizeof(fmt), tid, page, diff_time); return 1; } return 0; } SEC("unlock_page=unlock_page") int unlock_page_recorder(struct pt_regs *ctx) { unsigned long tid = bpf_get_tid(); unsigned long page = ctx->regs[0]; unsigned long time = bpf_ktime_get_ns(); unsigned long diff_time; struct page_being_locked_by_proc *p_being_locked; char fmt[] = "%d vs %d, %d ns\n"; p_being_locked = bpf_map_lookup_elem(&page_being_locked_by_proc_map, &page); if (!p_being_locked) return 0; diff_time = time - p_being_locked->time; if (diff_time > 10000000) { bpf_trace_printk(fmt, sizeof(fmt), tid, p_being_locked->tid, diff_time); return 1; } return 0; } /* ------------- END OF BPF PROGRAM ------------- */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/