Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753868AbbKYMpA (ORCPT ); Wed, 25 Nov 2015 07:45:00 -0500 Received: from szxga01-in.huawei.com ([58.251.152.64]:37005 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753265AbbKYMo6 (ORCPT ); Wed, 25 Nov 2015 07:44:58 -0500 Message-ID: <5655AD2C.9050506@huawei.com> Date: Wed, 25 Nov 2015 20:44:28 +0800 From: Yunlong Song User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Arnaldo Carvalho de Melo CC: , , , , , , , , , , , , , , , , Subject: Re: [PATCH] perf record: Add snapshot mode support for perf's regular events References: <1448373632-8806-1-git-send-email-yunlong.song@huawei.com> <1448373632-8806-2-git-send-email-yunlong.song@huawei.com> <20151124143032.GB18140@kernel.org> In-Reply-To: <20151124143032.GB18140@kernel.org> Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.74.205] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020202.5655AD3D.007B,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: eecc8d3f4324afdf4aa3b94fd1c059b8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6433 Lines: 120 On 2015/11/24 22:30, Arnaldo Carvalho de Melo wrote: > Looks like an interesting feature, if for no other reason, to match what > we have for aux area tracing. > > But perhaps having an event in the stream itself that would signal when > to dump the snapshot would be a better approach? > > One that perhaps you create with 'perf probe' or a tracepoint, both > could use the in-kernel filtering capabilities to specify when to > trigger that snapshot? > > If this snapshotting could happen in the kernel, that would be even > better, i.e. no need for two buffers (kernel rb + userspace rb) for > achieving this? Anyway, see below some comments. Hi, Arnaldo, Thanks for your suggestions and comments. As for the suggestion of snapshot mode implemented in kernel ring buffer rather than user space ring buffer, I have thought about that before I made this patch, the problems are: 1. Memory Handle If snapshot mode implemented in kernel space ring buffer, what's proper size of that ring buffer? If the size is small (such as 512k), it may lose the older part of target info which is really related with the reason causing that snapshot. For example, a temporary process launches and runs for a while, and it cannot terminate properly in the end. The real reason is that something is wrong during the launching stage, however, we do not know this prior knowledge and has to record all the information from the beginning to the end of this process's liftime. By analyzing the log of perf.data, we can finally figure out that the problem exists in the launching stage. Here if kernel space ring buffer is not big enough, we will lose the older part of the logs. In this case, we may have to set the size of kernel space ring buffer big enough, but can we kmalloc or vmalloc such big size easily in kernel space in any case? There may be some limitations to do this. Since allocating memory in the kernel is not as simple as allocating memory in user space. From this point of view, I prefer to use user space ring buffer to store the records. It's more flexible and feasible to handle memory. 2. Overwrite Overhead Without overwrite feature, snapshot mode implemented in kernel space ring buffer will also lose target info. For example, the mmap page of kernel space ring buffer is not full at time A. The target info is filled to this mmap page and then it is full at time B, then it wakes up perf to mmap_read all its records. In snapshot mode, all the records are totally dropped without writing to perf.data. Later at time C, a signal triggers the perf to dump, but the older part of target info was lost at time B. This is because kernel space ring buffer does not overwrite itself, and dump all its content once a page is full. So, we need "overwrite" support, but it needs further fixes and the corresponding patch is not applied to perf now probably due to its executive overhead (in the critical path, not very sure yet). Thus instead of incurring any probable overhead in the kernel space and thus may cause any problems, I prefer to implement the "overwrite" feature in the user space ring buffer. 3. Metadata Handle Even if problem 2 above is solved, i.e., we finally fix the overwrite feature of kernel space ring buffer. What about metadata handling (mmap_event, comm_event, fork_event, exit_event, etc.)? When it's time to overwrite this info during the wraparound procedure, this metadata events have to be handled separately to wake up perf (trigger the poll) to write to perf.data. This may increase complexity in the kernel space. However, as for the user space ring buffer version, it directly writes to perf.data once it notices the metadata event, no action of waking up perf by poll at all. Since the user space ring buffer design has to parse each event to figure out its size to skip when overwriting, it can easily figure out the event type by the way (which event is perf_record_sample, and which event is metadata). However, for this problem, there is a under-discussing solution. As what Wang Nan replies, "For such tracking events (PERF_RECORD_FORK...), we have dummy event so it is possible for us to receive tracking events from a separated channel, therefore we don't have to parse every events to pick those events out." 4. Trigger Semantic As what you mentioned, snapshot mode implemented in kernel space ring buffer "can" use the in-kernel filtering capabilities to specify when to trigger that snapshot. In my personal opinion, I regard the in-kernel filtering capabilities as something all about controlling, or say, programming with the tracepoints/trace events, which does not relate to the semantic of snapshot. You can use the in-kernel filtering capabilities (such as eBPF) to enable/disable tracing, filter tracing, aggregate tracing, but all this tracing info should be written to perf.data normally. However, we can decide when to really do this writing action via snapshot mode. Snapshot mode is just a semantic in the user space from a high level view. Without snapshot mode, those "in-kernel filtering capabilities" can still work for the regular perf. In the production case, apps can send SIGUSR2 signal to perf to easily record the most useful info which is strongly related with the apps performance problem, no redundant info any more (only collect the trace at the time any problem happens). > How about: snapshot_ring_buffer? If that is deemed too big, perhaps > snapshot_rb? > > Please continue following the existing convention and name this: > > static int record__write_rb() > > As the buffer you're writing is a ring buffer, thus needs checking about > wrap around, like you do here. > > This really looks hackish, using a global to change the behaviour of > some existing function, and a volatile at that, ouch, please change the > prototypes in some way to avoid globals like this. Thanks for your careful review and helpful comments, my patch stands for certain kind of design for snapshot mode. And other people may have suggestions and different designs, let's talk and compare different designs first. If the design of this patch is taken finally, I will rewrite it then. -- Thanks, Yunlong Song -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/