Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757416AbcDMCeU (ORCPT ); Tue, 12 Apr 2016 22:34:20 -0400 Received: from szxga01-in.huawei.com ([58.251.152.64]:37551 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756238AbcDMCeT (ORCPT ); Tue, 12 Apr 2016 22:34:19 -0400 Subject: Re: [PATCH tip] perf core: Add backward attribute to perf event To: , , References: <1459865478-53413-1-git-send-email-wangnan0@huawei.com> CC: , He Kuang , "Arnaldo Carvalho de Melo" , Brendan Gregg , Jiri Olsa , Masami Hiramatsu , Namhyung Kim , Zefan Li , From: "Wangnan (F)" Message-ID: <570DAFDA.2010503@huawei.com> Date: Wed, 13 Apr 2016 10:32:58 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 MIME-Version: 1.0 In-Reply-To: <1459865478-53413-1-git-send-email-wangnan0@huawei.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.111.66.109] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020206.570DB00B.007E,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 36fcee9b5de12e1a2e4f83ee083a8b9f Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 15560 Lines: 403 Hi Peter, When would you consider collection this patch? You have collected all dependencies, this is the last patch to enable the backward attribute. I have already posted a 'perf test' testcase for it so there will be some usable user space code. I will start posting my perf side code again after perf/core support backward attribute. There are about 34 patches now. Thank you. On 2016/4/5 22:11, Wang Nan wrote: > This patch introduces 'write_backward' bit to perf_event_attr, which > controls the direction of a ring buffer. After set, the corresponding > ring buffer is written from end to beginning. This feature is design to > support reading from overwritable ring buffer. > > Ring buffer can be created by mapping a perf event fd. Kernel puts event > records into ring buffer, user programs like perf fetch them from > address returned by mmap(). To prevent racing between kernel and perf, > they communicate to each other through 'head' and 'tail' pointers. > Kernel maintains 'head' pointer, points it to the next free area (tail > of the last record). Perf maintains 'tail' pointer, points it to the > tail of last consumed record (record has already been fetched). Kernel > determines the available space in a ring buffer using these two > pointers to avoid overwrite unfetched records. > > By mapping without 'PROT_WRITE', an overwritable ring buffer is created. > Different from normal ring buffer, perf is unable to maintain 'tail' > pointer because writing is forbidden. Therefore, for this type of ring > buffers, kernel overwrite old records unconditionally, works like flight > recorder. This feature would be useful if reading from overwritable ring > buffer were as easy as reading from normal ring buffer. However, > there's an obscure problem. > > The following figure demonstrates a full overwritable ring buffer. In > this figure, the 'head' pointer points to the end of last record, and a > long record 'E' is pending. For a normal ring buffer, a 'tail' pointer > would have pointed to position (X), so kernel knows there's no more > space in the ring buffer. However, for an overwritable ring buffer, > kernel ignore the 'tail' pointer. > > (X) head > . | > . V > +------+-------+----------+------+---+ > |A....A|B.....B|C........C|D....D| | > +------+-------+----------+------+---+ > > Record 'A' is overwritten by event 'E': > > head > | > V > +--+---+-------+----------+------+---+ > |.E|..A|B.....B|C........C|D....D|E..| > +--+---+-------+----------+------+---+ > > Now perf decides to read from this ring buffer. However, none of these > two natural positions, 'head' and the start of this ring buffer, are > pointing to the head of a record. Even the full ring buffer can be > accessed by perf, it is unable to find a position to start decoding. > > The first attempt tries to solve this problem AFAIK can be found from > [1]. It makes kernel to maintain 'tail' pointer: updates it when ring > buffer is half full. However, this approach introduces overhead to > fast path. Test result shows a 1% overhead [2]. In addition, this method > utilizes no more tham 50% records. > > Another attempt can be found from [3], which allows putting the size of > an event at the end of each record. This approach allows perf to find > records in a backword manner from 'head' pointer by reading size of a > record from its tail. However, because of alignment requirement, it > needs 8 bytes to record the size of a record, which is a huge waste. Its > performance is also not good, because more data need to be written. > This approach also introduces some extra branch instructions to fast > path. > > 'write_backward' is a better solution to this problem. > > Following figure demonstrates the state of the overwritable ring buffer > when 'write_backward' is set before overwriting: > > head > | > V > +---+------+----------+-------+------+ > | |D....D|C........C|B.....B|A....A| > +---+------+----------+-------+------+ > > and after overwriting: > head > | > V > +---+------+----------+-------+---+--+ > |..E|D....D|C........C|B.....B|A..|E.| > +---+------+----------+-------+---+--+ > > In each situation, 'head' points to the beginning of the newest record. > From this record, perf can iterate over the full ring buffer and fetch > records one by one. > > The only limitation needs to consider is back-to-back reading. Due to > the non-deterministic of user program, it is impossible to ensure the > ring buffer keeps stable during reading. Consider an extreme situation: > perf is scheduled out after reading record 'D', then a burst of events > come, eat up the whole ring buffer (one or multiple rounds). When perf > comes back, reading after 'D' is incorrect now. > > To prevent this problem, we need to find a way to ensure the ring buffer > is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is > suggested because its overhead is lower than > ioctl(PERF_EVENT_IOC_ENABLE). > > By carefully verifying 'header' pointer, reader can avoid pausing the > ring-buffer. For example: > > /* A union of all possible events */ > union perf_event event; > > p = head = perf_mmap__read_head(); > while (true) { > /* copy header of next event */ > fetch(&event.header, p, sizeof(event.header)); > > /* read 'head' pointer */ > head = perf_mmap__read_head(); > > /* check overwritten: is the header good? */ > if (!verify(sizeof(event.header), p, head)) > break; > > /* copy the whole event */ > fetch(&event, p, event.header.size); > > /* read 'head' pointer again */ > head = perf_mmap__read_head(); > > /* is the whole event good? */ > if (!verify(event.header.size, p, head)) > break; > p += event.header.size; > } > > However, the overhead is high because: > > a) In-place decoding is not safe. > Copying-verifying-decoding is required. > b) Fetching 'head' pointer requires additional synchronization. > > (From Alexei Starovoitov: > > Even this trick work, pause is needed for more than stability of > reading. When we collect the events into overwrite buffer we're waiting > for some other trigger (like all cpu utilization spike or just one cpu > running and all others are idle) and when it happens the buffer has > valuable info from the past. At this point new events are no longer > interesting and buffer should be paused, events read and unpaused until > next trigger comes.) > > This patch utilizes event's default overflow_handler introduced > previously. perf_event_output_backward() is created as the default > overflow handler for backward ring buffers. To avoid extra overhead to > fast path, original perf_event_output() becomes __perf_event_output() > and marked '__always_inline'. In theory, there's no extra overhead > introduced to fast path. > > Performance result: > > Calling 3000000 times of 'close(-1)', use gettimeofday() to check > duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture > system calls. In ns. > > Testing environment: > > CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > Kernel : v4.5.0 > MEAN STDVAR > BASE 800214.950 2853.083 > PRE1 2253846.700 9997.014 > PRE2 2257495.540 8516.293 > POST 2250896.100 8933.921 > > Where 'BASE' is pure performance without capturing. 'PRE1' is test > result of pure 'v4.5.0' kernel. 'PRE2' is test result before this > patch. 'POST' is test result after this patch. See [4] for detail > experimental setup. > > Considering the stdvar, this patch doesn't introduce performance > overhead to fast path. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html > [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html > [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html > [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com > > Signed-off-by: Wang Nan > Acked-by: Alexei Starovoitov > Cc: He Kuang > Cc: Arnaldo Carvalho de Melo > Cc: Brendan Gregg > Cc: Jiri Olsa > Cc: Masami Hiramatsu > Cc: Namhyung Kim > Cc: Peter Zijlstra > Cc: Zefan Li > Cc: pi3orama@163.com > --- > include/linux/perf_event.h | 28 +++++++++++++++++++++--- > include/uapi/linux/perf_event.h | 3 ++- > kernel/events/core.c | 48 ++++++++++++++++++++++++++++++++++++----- > kernel/events/ring_buffer.c | 16 +++++++++++++- > 4 files changed, 85 insertions(+), 10 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 4065ca2..0cc36ad 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -834,14 +834,24 @@ extern int perf_event_overflow(struct perf_event *event, > struct perf_sample_data *data, > struct pt_regs *regs); > > +extern void perf_event_output_forward(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs); > +extern void perf_event_output_backward(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs); > extern void perf_event_output(struct perf_event *event, > - struct perf_sample_data *data, > - struct pt_regs *regs); > + struct perf_sample_data *data, > + struct pt_regs *regs); > > static inline bool > is_default_overflow_handler(struct perf_event *event) > { > - return (event->overflow_handler == perf_event_output); > + if (likely(event->overflow_handler == perf_event_output_forward)) > + return true; > + if (unlikely(event->overflow_handler == perf_event_output_backward)) > + return true; > + return false; > } > > extern void > @@ -1042,8 +1052,20 @@ static inline bool has_aux(struct perf_event *event) > return event->pmu->setup_aux; > } > > +static inline bool is_write_backward(struct perf_event *event) > +{ > + return !!event->attr.write_backward; > +} > + > extern int perf_output_begin(struct perf_output_handle *handle, > struct perf_event *event, unsigned int size); > +extern int perf_output_begin_forward(struct perf_output_handle *handle, > + struct perf_event *event, > + unsigned int size); > +extern int perf_output_begin_backward(struct perf_output_handle *handle, > + struct perf_event *event, > + unsigned int size); > + > extern void perf_output_end(struct perf_output_handle *handle); > extern unsigned int perf_output_copy(struct perf_output_handle *handle, > const void *buf, unsigned int len); > diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h > index a3c1903..43fc8d2 100644 > --- a/include/uapi/linux/perf_event.h > +++ b/include/uapi/linux/perf_event.h > @@ -340,7 +340,8 @@ struct perf_event_attr { > comm_exec : 1, /* flag comm events that are due to an exec */ > use_clockid : 1, /* use @clockid for time fields */ > context_switch : 1, /* context switch data */ > - __reserved_1 : 37; > + write_backward : 1, /* Write ring buffer from end to beginning */ > + __reserved_1 : 36; > > union { > __u32 wakeup_events; /* wakeup every n events */ > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 8c3b35f..263a9d8 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5693,9 +5693,13 @@ void perf_prepare_sample(struct perf_event_header *header, > } > } > > -void perf_event_output(struct perf_event *event, > - struct perf_sample_data *data, > - struct pt_regs *regs) > +static void __always_inline > +__perf_event_output(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs, > + int (*output_begin)(struct perf_output_handle *, > + struct perf_event *, > + unsigned int)) > { > struct perf_output_handle handle; > struct perf_event_header header; > @@ -5705,7 +5709,7 @@ void perf_event_output(struct perf_event *event, > > perf_prepare_sample(&header, data, event, regs); > > - if (perf_output_begin(&handle, event, header.size)) > + if (output_begin(&handle, event, header.size)) > goto exit; > > perf_output_sample(&handle, &header, data, event); > @@ -5716,6 +5720,30 @@ exit: > rcu_read_unlock(); > } > > +void > +perf_event_output_forward(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs) > +{ > + __perf_event_output(event, data, regs, perf_output_begin_forward); > +} > + > +void > +perf_event_output_backward(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs) > +{ > + __perf_event_output(event, data, regs, perf_output_begin_backward); > +} > + > +void > +perf_event_output(struct perf_event *event, > + struct perf_sample_data *data, > + struct pt_regs *regs) > +{ > + __perf_event_output(event, data, regs, perf_output_begin); > +} > + > /* > * read event_id > */ > @@ -8152,8 +8180,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, > if (overflow_handler) { > event->overflow_handler = overflow_handler; > event->overflow_handler_context = context; > + } else if (is_write_backward(event)){ > + event->overflow_handler = perf_event_output_backward; > + event->overflow_handler_context = NULL; > } else { > - event->overflow_handler = perf_event_output; > + event->overflow_handler = perf_event_output_forward; > event->overflow_handler_context = NULL; > } > > @@ -8388,6 +8419,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event) > goto out; > > /* > + * Either writing ring buffer from beginning or from end. > + * Mixing is not allowed. > + */ > + if (is_write_backward(output_event) != is_write_backward(event)) > + goto out; > + > + /* > * If both events generate aux data, they must be on the same PMU > */ > if (has_aux(event) && has_aux(output_event) && > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index 60be55a..c49bab4 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -230,10 +230,24 @@ out: > return -ENOSPC; > } > > +int perf_output_begin_forward(struct perf_output_handle *handle, > + struct perf_event *event, unsigned int size) > +{ > + return __perf_output_begin(handle, event, size, false); > +} > + > +int perf_output_begin_backward(struct perf_output_handle *handle, > + struct perf_event *event, unsigned int size) > +{ > + return __perf_output_begin(handle, event, size, true); > +} > + > int perf_output_begin(struct perf_output_handle *handle, > struct perf_event *event, unsigned int size) > { > - return __perf_output_begin(handle, event, size, false); > + > + return __perf_output_begin(handle, event, size, > + unlikely(is_write_backward(event))); > } > > unsigned int perf_output_copy(struct perf_output_handle *handle,