Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755629AbcC2A3E (ORCPT ); Mon, 28 Mar 2016 20:29:04 -0400 Received: from mail-pf0-f175.google.com ([209.85.192.175]:36818 "EHLO mail-pf0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754647AbcC2A3B (ORCPT ); Mon, 28 Mar 2016 20:29:01 -0400 Date: Mon, 28 Mar 2016 17:28:56 -0700 From: Alexei Starovoitov To: Wang Nan Cc: Alexei Starovoitov , Arnaldo Carvalho de Melo , Peter Zijlstra , linux-kernel@vger.kernel.org, Brendan Gregg , He Kuang , Jiri Olsa , Masami Hiramatsu , Namhyung Kim , pi3orama@163.com, Zefan Li Subject: Re: [PATCH 4/4] perf core: Add backward attribute to perf event Message-ID: <20160329002855.GC31198@ast-mbp.thefacebook.com> References: <1459147292-239310-1-git-send-email-wangnan0@huawei.com> <1459147292-239310-5-git-send-email-wangnan0@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1459147292-239310-5-git-send-email-wangnan0@huawei.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7327 Lines: 160 On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote: > This patch introduces 'write_backward' bit to perf_event_attr, which > controls the direction of a ring buffer. After set, the corresponding > ring buffer is written from end to beginning. This feature is design to > support reading from overwritable ring buffer. > > Ring buffer can be created by mapping a perf event fd. Kernel puts event > records into ring buffer, user programs like perf fetch them from > address returned by mmap(). To prevent racing between kernel and perf, > they communicate to each other through 'head' and 'tail' pointers. > Kernel maintains 'head' pointer, points it to the next free area (tail > of the last record). Perf maintains 'tail' pointer, points it to the > tail of last consumed record (record has already been fetched). Kernel > determines the available space in a ring buffer using these two > pointers, prevents to overwrite unfetched records. > > By mapping without 'PROT_WRITE', an overwritable ring buffer is created. > Different from normal ring buffer, perf is unable to maintain 'tail' > pointer because writing is forbidden. Therefore, for this type of ring > buffers, kernel overwrite old records unconditionally, works like flight > recorder. This feature would be useful if reading from overwritable ring > buffer were as easy as reading from normal ring buffer. However, > there's an obscure problem. > > The following figure demonstrates the state of an overwritable ring > buffer which is nearly full. In this figure, the 'head' pointer points > to the end of last record, and a long record 'E' is pending. For a > normal ring buffer, a 'tail' pointer would have pointed to position (X), > so kernel knows there's no more space in the ring buffer. However, for > an overwritable ring buffer, kernel doesn't care the 'tail' pointer. > > (X) head > . | > . V > +------+-------+----------+------+---+ > |A....A|B.....B|C........C|D....D| | > +------+-------+----------+------+---+ > > After writing record 'E', record 'A' is overwritten. > > head > | > V > +--+---+-------+----------+------+---+ > |.E|..A|B.....B|C........C|D....D|E..| > +--+---+-------+----------+------+---+ > > Now perf decides to read from this ring buffer. However, none of the > the two natural positions, 'head' and the start of this ring buffer, > are pointing to the head of a record. Even perf can read the full ring > buffer, it is unable to find the position to start decoding. > > The first attempt tries to solve this problem AFAIK can be found from > [1]. It makes kernel to maintain 'tail' pointer: updates it when ring > buffer is half full. However, this approach introduces overhead to > fast path. Test result shows a 1% overhead [2]. In addition, this method > utilizes no more tham 50% records. > > Another attempt can be found from [3], which allow putting the size of > an event at the end of each record. This approach allows perf to find > records in a backword manner from 'head' pointer by reading size of a > record from its tail. However, because of alignment requirement, it > needs 8 bytes to record the size of a record, which is a huge waste. Its > performance is also not good, because more data need to be written. > This approach also introduces some extra branch instructions to fast > path. > > 'write_backward' is a better solution to this problem. > > Following figure demonstrates the state of the overwritable ring buffer > when 'write_backward' is set before overwriting: > > head > | > V > +---+------+----------+-------+------+ > | |D....D|C........C|B.....B|A....A| > +---+------+----------+-------+------+ > > and after overwriting: > head > | > V > +---+------+----------+-------+---+--+ > |..E|D....D|C........C|B.....B|A..|E.| > +---+------+----------+-------+---+--+ > > In each situation, 'head' points to the beginning of the newest record. > From this record, perf can iterate over the full ring buffer, fetching > as mush records as possible one by one. > > The only limitation needs to consider is back-to-back reading. Due to > the non-deterministic of user program, it is impossible to ensure the > ring buffer keeps stable during reading. Consider an extreme situation: > perf is scheduled out after reading record 'D', then a burst of events > come, eat up the whole ring buffer (one or multiple rounds), but 'head' > pointer happends to be at the same position when perf comes back. > Continue reading after 'D' is incorrect now. > > To prevent this problem, we need to find a way to ensure the ring buffer > is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is > suggested because its overhead is lower than > ioctl(PERF_EVENT_IOC_ENABLE). > > This patch utilizes event's default overflow_handler introduced > previously. perf_event_output_backward() is created as the default > overflow handler for backward ring buffers. To avoid extra overhead to > fast path, original perf_event_output() becomes __perf_event_output() > and marked '__always_inline'. In theory, there's no extra overhead > introduced to fast path. > > Performance result: > > Calling 3000000 times of 'close(-1)', use gettimeofday() to check > duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture > system calls. In ns. > > Testing environment: > > CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > Kernel : v4.5.0 > MEAN STDVAR > BASE 800214.950 2853.083 > PRE1 2253846.700 9997.014 > PRE2 2257495.540 8516.293 > POST 2250896.100 8933.921 > > Where 'BASE' is pure performance without capturing. 'PRE1' is test > result of pure 'v4.5.0' kernel. 'PRE2' is test result before this > patch. 'POST' is test result after this patch. See [4] for detail > experimental setup. > > Considering the stdvar, this patch doesn't introduce performance > overhead to fast path. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html > [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html > [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html > [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com > > Signed-off-by: Wang Nan > Cc: He Kuang > Cc: Alexei Starovoitov > Cc: Arnaldo Carvalho de Melo > Cc: Brendan Gregg > Cc: Jiri Olsa > Cc: Masami Hiramatsu > Cc: Namhyung Kim > Cc: Peter Zijlstra > Cc: Zefan Li > Cc: pi3orama@163.com > --- > include/linux/perf_event.h | 28 +++++++++++++++++++++--- > include/uapi/linux/perf_event.h | 3 ++- > kernel/events/core.c | 48 ++++++++++++++++++++++++++++++++++++----- > kernel/events/ring_buffer.c | 14 ++++++++++++ > 4 files changed, 84 insertions(+), 9 deletions(-) Very useful feature. Looks good. Acked-by: Alexei Starovoitov