Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754908AbbBFG7O (ORCPT ); Fri, 6 Feb 2015 01:59:14 -0500 Received: from szxga01-in.huawei.com ([119.145.14.64]:59066 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751696AbbBFG7N (ORCPT ); Fri, 6 Feb 2015 01:59:13 -0500 From: He Kuang To: CC: , , , , , , Subject: [RFC 0/2] Experience on full stack profiling using eBPF. Date: Fri, 6 Feb 2015 14:56:03 +0800 Message-ID: <1423205765-2289-1-git-send-email-hekuang@huawei.com> X-Mailer: git-send-email 2.2.0.33.gc18b867 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.107.197.189] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4487 Lines: 102 Hi Alexei, We are working on full stack system profiling tools and found your eBPF is an amazing start point for us. However, during our prototype designing and implementing we also found some limitations on it. I'd like to share our experience with you in this mailing list, and I hope you and others can provide comments and suggestions for us. Our goal is to profile production systems by analyzing end-to-end trace of particular procedures, including full IO request through sys_write() system call to underlying disk driver, full network processes through sys_sendto() to NIC driver and even full database query through client initiating to server processing to client finalizing. To avoid heavy tracing overhead and overwhelming data, we randomly sample a few out of huge amounts of procedures, only trace events caused by them at different layers and leave other procedures work normally. As a result, we make as few traces as possible for our profiling with the minimal effect to system performance. In this series of RFC we provide a very rough implementation which traces at different layers for IO processes in a QEMU system. Patch 1/2 deploys some new tracepoints at each layer of IO stack, patch 2/2 attaches eBPF programs on them. eBPF programs select events according to their trigger conditions, and propagate these conditions to eBPF programs at lower layers. We use hash tables to record trigger conditions. Our prototype is able to sample one sys_write() call among all write by a dd command. The result is as follow: dd-985 [000] .... 54878.066046: iov_iter_copy_from_user_atomic: Node1 iov=00000000011c0010 page=ffffea0001eebb40 ... kworker/ [000] .... 54886.746406: ext4_bio_write_page: Node2 page=ffffea0001eebb40 bio=ffff88007c249540 kworker/ [000] .... 54886.750054: bio_attempt_back_merge: Node3 bio=ffff88007c249540 rq=ffff88007c3d4158 ... kworker/ [000] d... 54886.828480: virtscsi_add_cmd: Node4 req=ffff88007c3d4158 index=2 ... tracex5-9 [000] .Ns. 54886.840415: bio_endio: Node5 bio=ffff88007c249540 The result trace log is less than 20 lines, which will be more than 10000 lines without sampling and conditions. We can easily get a graph like this from the above result trace: S1/2/3 are samples. S1S2 S3 | | | (vfs)pagecache -----------------------------------------> timeline S1S2 S3 | | | (block)request -----------------------------------------> timeline S1S2 S3 | | | bio_end -----------------------------------------> timeline During our work we found that eBPF can be improved so we can make our life easier: 1. BPF prog cannot use div/mod operator, which is useful in sampling. 2. BPF is a filter, the output trace goes to trace pipe, sometimes we may need a structured trace output like perf.data, so we can record more in filter function. BPF Array table is a good place to record data, but the index is restricted to 4 byte size. 3. Though most of events are filtered out, the eBPF program attached to tracepoints still run for every execution. We should find a way to minimize performance overhead for those events filtered out. 4. Provide a more user-friendly interface for user to use. Thanks for your work. Signed-off-by: He Kuang He Kuang (2): tracing: add additional IO tracepoints samples: bpf: IO profiling example block/bio.c | 1 + block/blk-core.c | 4 + drivers/scsi/virtio_scsi.c | 3 + fs/ext4/page-io.c | 4 + include/trace/events/block.h | 81 +++++++++++++++++ include/trace/events/ext4.h | 21 +++++ include/trace/events/filemap.h | 25 ++++++ include/trace/events/scsi.h | 21 +++++ mm/filemap.c | 2 + samples/bpf/Makefile | 4 + samples/bpf/tracex5_kern.c | 195 +++++++++++++++++++++++++++++++++++++++++ samples/bpf/tracex5_user.c | 56 ++++++++++++ 12 files changed, 417 insertions(+) create mode 100644 samples/bpf/tracex5_kern.c create mode 100644 samples/bpf/tracex5_user.c -- 2.2.0.33.gc18b867 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/