Subject: Re: [PATCH tip] perf core: Add backward attribute to perf event
To: <peterz@infradead.org>, <mingo@kernel.org>, <acme@kernel.org>
References: <1459865478-53413-1-git-send-email-wangnan0@huawei.com>
CC: <linux-kernel@vger.kernel.org>, He Kuang <hekuang@huawei.com>,
        "Arnaldo Carvalho de Melo" <acme@redhat.com>,
        Brendan Gregg <brendan.d.gregg@gmail.com>,
        Jiri Olsa <jolsa@kernel.org>,
        Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
        Namhyung Kim <namhyung@kernel.org>, Zefan Li <lizefan@huawei.com>,
        <pi3orama@163.com>
From: "Wangnan (F)" <wangnan0@huawei.com>
Message-ID: <570DAFDA.2010503@huawei.com>
Date: Wed, 13 Apr 2016 10:32:58 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.5.0
MIME-Version: 1.0
In-Reply-To: <1459865478-53413-1-git-send-email-wangnan0@huawei.com>
Content-Type: text/plain; charset="utf-8"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 15560
Lines: 403

Hi Peter,

   When would you consider collection this patch? You have
collected all dependencies, this is the last patch to enable
the backward attribute. I have already posted a 'perf test'
testcase for it so there will be some usable user space code.

  I will start posting my perf side code again after perf/core
support backward attribute. There are about 34 patches
now.

Thank you.


On 2016/4/5 22:11, Wang Nan wrote:
> This patch introduces 'write_backward' bit to perf_event_attr, which
> controls the direction of a ring buffer. After set, the corresponding
> ring buffer is written from end to beginning. This feature is design to
> support reading from overwritable ring buffer.
>
> Ring buffer can be created by mapping a perf event fd. Kernel puts event
> records into ring buffer, user programs like perf fetch them from
> address returned by mmap(). To prevent racing between kernel and perf,
> they communicate to each other through 'head' and 'tail' pointers.
> Kernel maintains 'head' pointer, points it to the next free area (tail
> of the last record). Perf maintains 'tail' pointer, points it to the
> tail of last consumed record (record has already been fetched). Kernel
> determines the available space in a ring buffer using these two
> pointers to avoid overwrite unfetched records.
>
> By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
> Different from normal ring buffer, perf is unable to maintain 'tail'
> pointer because writing is forbidden. Therefore, for this type of ring
> buffers, kernel overwrite old records unconditionally, works like flight
> recorder. This feature would be useful if reading from overwritable ring
> buffer were as easy as reading from normal ring buffer. However,
> there's an obscure problem.
>
> The following figure demonstrates a full overwritable ring buffer. In
> this figure, the 'head' pointer points to the end of last record, and a
> long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
> would have pointed to position (X), so kernel knows there's no more
> space in the ring buffer. However, for an overwritable ring buffer,
> kernel ignore the 'tail' pointer.
>
>     (X)                              head
>      .                                |
>      .                                V
>      +------+-------+----------+------+---+
>      |A....A|B.....B|C........C|D....D|   |
>      +------+-------+----------+------+---+
>
> Record 'A' is overwritten by event 'E':
>
>        head
>         |
>         V
>      +--+---+-------+----------+------+---+
>      |.E|..A|B.....B|C........C|D....D|E..|
>      +--+---+-------+----------+------+---+
>
> Now perf decides to read from this ring buffer. However, none of these
> two natural positions, 'head' and the start of this ring buffer, are
> pointing to the head of a record. Even the full ring buffer can be
> accessed by perf, it is unable to find a position to start decoding.
>
> The first attempt tries to solve this problem AFAIK can be found from
> [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
> buffer is half full. However, this approach introduces overhead to
> fast path. Test result shows a 1% overhead [2]. In addition, this method
> utilizes no more tham 50% records.
>
> Another attempt can be found from [3], which allows putting the size of
> an event at the end of each record. This approach allows perf to find
> records in a backword manner from 'head' pointer by reading size of a
> record from its tail. However, because of alignment requirement, it
> needs 8 bytes to record the size of a record, which is a huge waste. Its
> performance is also not good, because more data need to be written.
> This approach also introduces some extra branch instructions to fast
> path.
>
> 'write_backward' is a better solution to this problem.
>
> Following figure demonstrates the state of the overwritable ring buffer
> when 'write_backward' is set before overwriting:
>
>         head
>          |
>          V
>      +---+------+----------+-------+------+
>      |   |D....D|C........C|B.....B|A....A|
>      +---+------+----------+-------+------+
>
> and after overwriting:
>                                       head
>                                        |
>                                        V
>      +---+------+----------+-------+---+--+
>      |..E|D....D|C........C|B.....B|A..|E.|
>      +---+------+----------+-------+---+--+
>
> In each situation, 'head' points to the beginning of the newest record.
>  From this record, perf can iterate over the full ring buffer and fetch
> records one by one.
>
> The only limitation needs to consider is back-to-back reading. Due to
> the non-deterministic of user program, it is impossible to ensure the
> ring buffer keeps stable during reading. Consider an extreme situation:
> perf is scheduled out after reading record 'D', then a burst of events
> come, eat up the whole ring buffer (one or multiple rounds). When perf
> comes back, reading after 'D' is incorrect now.
>
> To prevent this problem, we need to find a way to ensure the ring buffer
> is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
> suggested because its overhead is lower than
> ioctl(PERF_EVENT_IOC_ENABLE).
>
> By carefully verifying 'header' pointer, reader can avoid pausing the
> ring-buffer. For example:
>
>      /* A union of all possible events */
>      union perf_event event;
>
>      p = head = perf_mmap__read_head();
>      while (true) {
>          /* copy header of next event */
>          fetch(&event.header, p, sizeof(event.header));
>
>          /* read 'head' pointer */
>          head = perf_mmap__read_head();
>
>          /* check overwritten: is the header good? */
>          if (!verify(sizeof(event.header), p, head))
>              break;
>
>          /* copy the whole event */
>          fetch(&event, p, event.header.size);
>
>          /* read 'head' pointer again */
>          head = perf_mmap__read_head();
>
>          /* is the whole event good? */
>          if (!verify(event.header.size, p, head))
>              break;
>          p += event.header.size;
>      }
>
> However, the overhead is high because:
>
>   a) In-place decoding is not safe.
>      Copying-verifying-decoding is required.
>   b) Fetching 'head' pointer requires additional synchronization.
>
> (From Alexei Starovoitov:
>
> Even this trick work, pause is needed for more than stability of
> reading. When we collect the events into overwrite buffer we're waiting
> for some other trigger (like all cpu utilization spike or just one cpu
> running and all others are idle) and when it happens the buffer has
> valuable info from the past. At this point new events are no longer
> interesting and buffer should be paused, events read and unpaused until
> next trigger comes.)
>
> This patch utilizes event's default overflow_handler introduced
> previously. perf_event_output_backward() is created as the default
> overflow handler for backward ring buffers. To avoid extra overhead to
> fast path, original perf_event_output() becomes __perf_event_output()
> and marked '__always_inline'. In theory, there's no extra overhead
> introduced to fast path.
>
> Performance result:
>
> Calling 3000000 times of 'close(-1)', use gettimeofday() to check
> duration.  Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
> system calls. In ns.
>
> Testing environment:
>
>   CPU    : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
>   Kernel : v4.5.0
>                     MEAN         STDVAR
> BASE            800214.950    2853.083
> PRE1           2253846.700    9997.014
> PRE2           2257495.540    8516.293
> POST           2250896.100    8933.921
>
> Where 'BASE' is pure performance without capturing. 'PRE1' is test
> result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
> patch. 'POST' is test result after this patch. See [4] for detail
> experimental setup.
>
> Considering the stdvar, this patch doesn't introduce performance
> overhead to fast path.
>
> [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
> [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
> [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
> [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com
>
> Signed-off-by: Wang Nan <wangnan0@huawei.com>
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> Cc: He Kuang <hekuang@huawei.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
> Cc: Jiri Olsa <jolsa@kernel.org>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Zefan Li <lizefan@huawei.com>
> Cc: pi3orama@163.com
> ---
>   include/linux/perf_event.h      | 28 +++++++++++++++++++++---
>   include/uapi/linux/perf_event.h |  3 ++-
>   kernel/events/core.c            | 48 ++++++++++++++++++++++++++++++++++++-----
>   kernel/events/ring_buffer.c     | 16 +++++++++++++-
>   4 files changed, 85 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 4065ca2..0cc36ad 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -834,14 +834,24 @@ extern int perf_event_overflow(struct perf_event *event,
>   				 struct perf_sample_data *data,
>   				 struct pt_regs *regs);
>   
> +extern void perf_event_output_forward(struct perf_event *event,
> +				     struct perf_sample_data *data,
> +				     struct pt_regs *regs);
> +extern void perf_event_output_backward(struct perf_event *event,
> +				       struct perf_sample_data *data,
> +				       struct pt_regs *regs);
>   extern void perf_event_output(struct perf_event *event,
> -				struct perf_sample_data *data,
> -				struct pt_regs *regs);
> +			      struct perf_sample_data *data,
> +			      struct pt_regs *regs);
>   
>   static inline bool
>   is_default_overflow_handler(struct perf_event *event)
>   {
> -	return (event->overflow_handler == perf_event_output);
> +	if (likely(event->overflow_handler == perf_event_output_forward))
> +		return true;
> +	if (unlikely(event->overflow_handler == perf_event_output_backward))
> +		return true;
> +	return false;
>   }
>   
>   extern void
> @@ -1042,8 +1052,20 @@ static inline bool has_aux(struct perf_event *event)
>   	return event->pmu->setup_aux;
>   }
>   
> +static inline bool is_write_backward(struct perf_event *event)
> +{
> +	return !!event->attr.write_backward;
> +}
> +
>   extern int perf_output_begin(struct perf_output_handle *handle,
>   			     struct perf_event *event, unsigned int size);
> +extern int perf_output_begin_forward(struct perf_output_handle *handle,
> +				    struct perf_event *event,
> +				    unsigned int size);
> +extern int perf_output_begin_backward(struct perf_output_handle *handle,
> +				      struct perf_event *event,
> +				      unsigned int size);
> +
>   extern void perf_output_end(struct perf_output_handle *handle);
>   extern unsigned int perf_output_copy(struct perf_output_handle *handle,
>   			     const void *buf, unsigned int len);
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index a3c1903..43fc8d2 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -340,7 +340,8 @@ struct perf_event_attr {
>   				comm_exec      :  1, /* flag comm events that are due to an exec */
>   				use_clockid    :  1, /* use @clockid for time fields */
>   				context_switch :  1, /* context switch data */
> -				__reserved_1   : 37;
> +				write_backward :  1, /* Write ring buffer from end to beginning */
> +				__reserved_1   : 36;
>   
>   	union {
>   		__u32		wakeup_events;	  /* wakeup every n events */
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8c3b35f..263a9d8 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5693,9 +5693,13 @@ void perf_prepare_sample(struct perf_event_header *header,
>   	}
>   }
>   
> -void perf_event_output(struct perf_event *event,
> -			struct perf_sample_data *data,
> -			struct pt_regs *regs)
> +static void __always_inline
> +__perf_event_output(struct perf_event *event,
> +		    struct perf_sample_data *data,
> +		    struct pt_regs *regs,
> +		    int (*output_begin)(struct perf_output_handle *,
> +					struct perf_event *,
> +					unsigned int))
>   {
>   	struct perf_output_handle handle;
>   	struct perf_event_header header;
> @@ -5705,7 +5709,7 @@ void perf_event_output(struct perf_event *event,
>   
>   	perf_prepare_sample(&header, data, event, regs);
>   
> -	if (perf_output_begin(&handle, event, header.size))
> +	if (output_begin(&handle, event, header.size))
>   		goto exit;
>   
>   	perf_output_sample(&handle, &header, data, event);
> @@ -5716,6 +5720,30 @@ exit:
>   	rcu_read_unlock();
>   }
>   
> +void
> +perf_event_output_forward(struct perf_event *event,
> +			 struct perf_sample_data *data,
> +			 struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin_forward);
> +}
> +
> +void
> +perf_event_output_backward(struct perf_event *event,
> +			   struct perf_sample_data *data,
> +			   struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin_backward);
> +}
> +
> +void
> +perf_event_output(struct perf_event *event,
> +		  struct perf_sample_data *data,
> +		  struct pt_regs *regs)
> +{
> +	__perf_event_output(event, data, regs, perf_output_begin);
> +}
> +
>   /*
>    * read event_id
>    */
> @@ -8152,8 +8180,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>   	if (overflow_handler) {
>   		event->overflow_handler	= overflow_handler;
>   		event->overflow_handler_context = context;
> +	} else if (is_write_backward(event)){
> +		event->overflow_handler = perf_event_output_backward;
> +		event->overflow_handler_context = NULL;
>   	} else {
> -		event->overflow_handler = perf_event_output;
> +		event->overflow_handler = perf_event_output_forward;
>   		event->overflow_handler_context = NULL;
>   	}
>   
> @@ -8388,6 +8419,13 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
>   		goto out;
>   
>   	/*
> +	 * Either writing ring buffer from beginning or from end.
> +	 * Mixing is not allowed.
> +	 */
> +	if (is_write_backward(output_event) != is_write_backward(event))
> +		goto out;
> +
> +	/*
>   	 * If both events generate aux data, they must be on the same PMU
>   	 */
>   	if (has_aux(event) && has_aux(output_event) &&
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 60be55a..c49bab4 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -230,10 +230,24 @@ out:
>   	return -ENOSPC;
>   }
>   
> +int perf_output_begin_forward(struct perf_output_handle *handle,
> +			     struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, false);
> +}
> +
> +int perf_output_begin_backward(struct perf_output_handle *handle,
> +			       struct perf_event *event, unsigned int size)
> +{
> +	return __perf_output_begin(handle, event, size, true);
> +}
> +
>   int perf_output_begin(struct perf_output_handle *handle,
>   		      struct perf_event *event, unsigned int size)
>   {
> -	return __perf_output_begin(handle, event, size, false);
> +
> +	return __perf_output_begin(handle, event, size,
> +				   unlikely(is_write_backward(event)));
>   }
>   
>   unsigned int perf_output_copy(struct perf_output_handle *handle,