Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755288AbYLCCwy (ORCPT ); Tue, 2 Dec 2008 21:52:54 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754199AbYLCCwo (ORCPT ); Tue, 2 Dec 2008 21:52:44 -0500 Received: from mx2.redhat.com ([66.187.237.31]:38343 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754119AbYLCCwn (ORCPT ); Tue, 2 Dec 2008 21:52:43 -0500 Subject: Re: [RFC PATCH 0/3] A couple of feature requests to the unified trace buffer From: Steven Rostedt To: Jiaying Zhang Cc: linux-kernel@vger.kernel.org, Martin Bligh , Michael Rubin , Michael Davidson , Andrew Morton , Ingo Molnar In-Reply-To: <5df78e1d0812021652w74c968bboba5e6457059b569b@mail.gmail.com> References: <5df78e1d0812021652w74c968bboba5e6457059b569b@mail.gmail.com> Content-Type: text/plain Organization: Red Hat Date: Tue, 02 Dec 2008 21:52:11 -0500 Message-Id: <1228272731.4886.26.camel@localhost.localdomain> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5834 Lines: 124 Hi Jiaying, On Tue, 2008-12-02 at 16:52 -0800, Jiaying Zhang wrote: > Hi all, > > We have taken Steve's unified trace buffer patches while implementing our > kernel tracing prototype. The unified trace buffer has many features we need > and has simplified our implementation a lot. However, our use of the unified > trace buffer is slightly different from ftrace's cases. We need to trace large > amounts of data in a production environment, so it's critical that we have as > little impact on performance as possible. > > We are particularly concerned about two performance requirements: > > 1. read large amounts of data from the unified trace buffer at little > performance cost. > 2. write to the unified trace buffer with little performance overhead. > > We are considering two approaches to satisfy the first requirement - > mmap vs splice. Right now, we only implemented mmap support, so I am > going to talk about that approach in detail. For the second requirement, > our performance tests suggest the use of inline versions of buffer write > functions for high-throughput kernel events, as described below. If you check out my git tree: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git branch: tip/devel You will see that I started writing code to handle "splice". I have a retarded version that works right now (using the term "works" loosely here). Also, most of our development is done in the Ingo's tip tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git branch: master I see in the patches that you based this on 2.6.26.2? Not sure which version of the ring buffer you are using, but there has been lots of improvements to it already. > > The patches of our changes are sent following this email. We haven't tested > these patches much, and they are probably very buggy. However, we would > like to get some early feedback and see if we are headed in the right direction. > Here is the summary of the changes we made. > > A major feature we would like to include in the unified trace buffer is the > mmap support. Instead of reading one event out of the buffer at a time, we > implemented a debugfs interface that allows a user-level program to mmap the > unified trace buffer and directly copy the trace data to a disk file or socket > descriptor. Compared with reading a single event at a time, this approach > saves a copy from kernel to user-space. It also allows a trace reader to read > bulk data in a single system call. We haven't yet conducted a performance > comparison, but we think the mmap approach will have a significant performance > advantage over the current read interface. To make mmap work, I made several > changes to the unified trace buffer code. The changes include an API that maps > a page offset to a physical page in a trace buffer, APIs that export the offset > of the current produced/consumed data, and an API to advance the consumed > data pointer. The patch in the next email, "adding mmap support to the unified > trace buffer", contains the described changes. We notice that there are other > alternatives to save kernel to user-space copy, like splice_read. We haven't > explored those approaches yet. Implementations of those interfaces would be > very welcome. As I said above, I'm working on a splice version. I'm very new to the splice_read code so I'm learning as I go along. I can post my retarded patch if you would like. Actually, I'll just push it to my git tree in the branch tip/splice. > > Other helpful features from the unified trace buffer are the inline versions of > buffer write functions, i.e., ring_buffer_lock_reserve and > ring_buffer_unlock_commit. We found that function calls can add noticeable > overhead while tracing high-throughput events. According to our measurement, > the function calls from trace buffer writing alone add roughly 3% overhead > under the tbench benchmark. To see how we can eliminate this overhead, > I tried the inline versions of lock_reserve and unlock_commit (see the second > patch "adding inline buffer write functions to the unified trace buffer"). The > inline functions basically copy the existing code from ring_buffer_lock_reserve > and ring_buffer_unlock_commit. But to make it compile, I also need to move > certain functions to the header file. The side effect is that the interface > becomes less clean. This will be something that we'll need to work with. I'm finding it a bit hard that a function call is that expensive. > > We may need more fundamental changes to implement inline versions of > ring buffer write functions while still keep the clean layered design. > Our goal is to enable certain kernel event tracing by default, so we > consider 3% a big saving and we think it is important to make sure that > entering an event into a cpu buffer is really fast. > > I attached our current kernel tracing prototype in the third email to give > you an idea on how we intend to use the unified trace buffer. We will post > the trace prototype alone later, but would love to hear any early feedback. Sure, I'll try to take some time out to look at them. Just a few points on a real post: - base it on tip/master or at least my tree. - do not use "Refer to previous email" in any of the patches. The patches are brought in individually, and the change log must be able to stand on its own without needing to refer to any other patch, website or commit. - Please include a "Signed-off-by: Full Name " in every patch. Other than that, thanks for the work. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/