Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <f26c9130-71f1-9849-36bf-4da7d9394fe1@linux.ibm.com>
Date:   Thu, 3 Aug 2023 12:36:07 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.12.0
Subject: Re: [PATCH v2] perf doc: Document ring buffer mechanism
To:     Leo Yan <leo.yan@linaro.org>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Ian Rogers <irogers@google.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@kernel.org>,
        Namhyung Kim <namhyung@kernel.org>,
        Adrian Hunter <adrian.hunter@intel.com>,
        linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org
References: <20230803035037.1750340-1-leo.yan@linaro.org>
Content-Language: en-US
From:   Thomas Richter <tmricht@linux.ibm.com>
Organization: IBM
In-Reply-To: <20230803035037.1750340-1-leo.yan@linaro.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 8/3/23 05:50, Leo Yan wrote:
> In the Linux perf tool, the ring buffer serves not only as a medium for
> transferring PMU event data but also as a vital mechanism for hardware
> tracing using technologies like Intel PT and Arm CoreSight, etc.
> 
> Consequently, the ring buffer mechanism plays a crucial role by ensuring
> high throughput for data transfer between the kernel and user space
> while avoiding excessive overhead caused by the ring buffer itself.
> 
> This commit documents the ring buffer mechanism in detail.  It provides
> an in-depth explanation of the implementation of both the generic ring
> buffer and the AUX ring buffer.  Additionally, it covers how these ring
> buffers support various tracing modes and explains the synchronization
> with memory barriers.
> 
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
> 
> Changes from v1:
> - Addressed Ian's comments and suggestions (Ian Rogers).
> 
>  tools/perf/Documentation/perf-ring-buffer.txt | 762 ++++++++++++++++++
>  1 file changed, 762 insertions(+)
>  create mode 100644 tools/perf/Documentation/perf-ring-buffer.txt
> 
> diff --git a/tools/perf/Documentation/perf-ring-buffer.txt b/tools/perf/Documentation/perf-ring-buffer.txt
> new file mode 100644
> index 000000000000..2380d54a1068
> --- /dev/null
> +++ b/tools/perf/Documentation/perf-ring-buffer.txt
> @@ -0,0 +1,762 @@
> +perf-ring-buffer(1)
> +===================
> +
> +NAME
> +----
> +perf-ring-buffer - Introduction to the perf ring buffer mechanism
> +
> +Introduction
> +------------
> +The ring buffer is a fundamental mechanism for data transfer.  perf uses
> +ring buffers to transfer event data from kernel to user space, another
> +kind of ring buffer which is so called auxiliary (AUX) ring buffer also
> +plays an important role for hardware tracing with Intel PT, Arm
> +CoreSight, etc.
> +
> +The ring buffer implementation is critical but it's also a very
> +challenging work.  On the one hand, the kernel and perf tool in the user
> +space use the ring buffer to exchange data and stores data into data
> +file, thus the ring buffer needs to transfer data with high throughput;
> +on the other hand, the ring buffer management should avoid significant
> +overload to distract profiling results.
> +
> +This documentation dives into the details for perf ring buffer with two
> +parts: firstly it explains the perf ring buffer implementation, then in
> +the second part discusses the AUX ring buffer mechanism.
> +
> +Ring buffer implementation
> +--------------------------
> +
> +Basic algorithm
> +~~~~~~~~~~~~~~~
> +
> +That said, a typical ring buffer is managed by a head pointer and a tail
> +pointer; the head pointer is manipulated by a writer and the tail
> +pointer is updated by a reader respectively.
> +
> +            +---------------------------+
> +            |   |   |***|***|***|   |   |
> +            +---------------------------+
> +                  `-> Tail    `-> Head
> +
> +            * : the data is filled by the writer.
> +	    Figure 1: Ring buffer
> +
> +Perf uses the same way to manage its ring buffer.  In the implementation
> +there are two key data structures held together in a set of consecutive
> +pages, the control structure and then the ring buffer itself.  The page
> +with the control structure in is known as the "user page".  Being held
> +in continuous virtual addresses simplifies locating the ring buffer
> +address, it is in the pages after the page with the user page.
> +
> +The control structure is named as `perf_event_mmap_page`, it contains a
> +head pointer `data_head` and a tail pointer `data_tail`.  When the
> +kernel starts to fill records into the ring buffer, it updates the head
> +pointer to reserve the memory so later it can safely store events into
> +the buffer; on the other side, the perf tool updates the tail pointer
> +after consuming data from the ring buffer.
> +
> +        user page                          ring buffer
> +  +---------+---------+   +---------------------------------------+
> +  |data_head|data_tail|...|   |   |***|***|***|***|***|   |   |   |
> +  +---------+---------+   +---------------------------------------+
> +      `          `--------------^                   ^
> +       `--------------------------------------------|
> +
> +            * : the data is filled by the writer.
> +	    Figure 2: Perf ring buffer
> +
> +When using the 'perf record' tool, we can specify the ring buffer size
> +with option '-m' or '--mmap-pages=', the given size will be rounded up
> +to a power of two that is a multiple of a page size.  Though the kernel
> +allocates at once for all memory pages, it's deferred to map the pages
> +to VMA area until the perf tool accesses the buffer from the user space.
> +In other words, at the first time accesses the buffer's page from user
> +space in the perf tool, a data abort exception for page fault is taken
> +and the kernel uses this occasion to map the page into process VMA, thus
> +the perf tool can continue to access the page after returning from the
> +exception.
> +
> +The function perf_mmap_fault() is for handling the page fault, which
> +invokes perf_mmap_to_page() to figure out which page should be mapped.
> +The structure 'vm_fault' has a field 'pgoff' to indicate which page
> +should be mapped, if 'pgoff' is zero it maps the ring buffer's user
> +page, otherwise, the ring buffer's page is mapped with index 'pgoff-1'
> +(since the first page in VMA is for user page, so we need to decrease 1
> +to get the ring buffer's page index).
> +
> +Ring buffer for different tracing modes
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Perf profiles programs with different modes: default mode, per thread
> +mode, per cpu mode, and system wide mode.  This section describes what's
                                          remove this word (what's)  ^^^^^^
> +these modes and how the ring buffer meets requirements for them.  At
> +last we will review the race conditions caused by these modes.
> +
> +Default mode
> +
> +Usually we execute `perf record` command followed by a profiling program
> +name, like below command:
> +
> +	perf record test_program
> +
> +This command doesn't specify any options related with ring buffer mode,
> +it's called default mode.
> +
> +As shown below, the perf tool allocates individual ring buffers for each
> +CPU, but it only enables events for the profiled program rather than for
> +all threads in the system.  The T1 thread represents the thread context
> +of the `test_program`, whereas T2 and T3 are irrelevant threads in the
> +system.   The perf samples are exclusively collected for the T1 thread
> +and stored in the ring buffer associated with the CPU on which the T1
> +thread is running.
> +
> +
> +            T1                      T2                 T1
> +          +----+              +-----------+          +----+
> +  CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +          +----+--------------+-----------+----------+----+-------->
> +            |                                          |
> +            v                                          v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 0                      |
> +          +-----------------------------------------------------+
> +
> +                 T1
> +               +-----+
> +  CPU1         |xxxxx|
> +          -----+-----+--------------------------------------------->
> +                  |
> +                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 1                      |
> +          +-----------------------------------------------------+
> +
> +                                      T1              T3
> +                                    +----+        +-------+
> +  CPU2                              |xxxx|        |xxxxxxx|
> +          --------------------------+----+--------+-------+-------->
> +                                      |
> +                                      v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 2                      |
> +          +-----------------------------------------------------+
> +
> +                            T1
> +                     +--------------+
> +  CPU3               |xxxxxxxxxxxxxx|
> +          -----------+--------------+------------------------------>
> +                            |
> +                            v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 3                      |
> +          +-----------------------------------------------------+
> +
> +	    T1: Thread 1; T2: Thread 2; T3: Thread 3
> +	    x: Thread is in running state
> +	    Figure 3: Ring buffer for default mode
> +
> +Per-thread mode
> +
> +By specifying option '--per-thread' in perf command, the ring buffer is
> +allocated for every profiled thread.  An example command is:
> +
> +	perf record --per-thread test_program
> +
> +In this mode, a profiled thread is scheduled on a CPU, the events on
> +that CPU will be enabled; and if the thread is scheduled out from the
> +CPU, the events on the CPU will be disabled.  When the thread is
> +migrated from one CPU to another, the events will be disabled on the
> +previous CPU and enabled on the next CPU correspondingly.
> +
> +            T1                      T2                 T1
> +          +----+              +-----------+          +----+
> +  CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +          +----+--------------+-----------+----------+----+-------->
> +            |                                           |
> +            |    T1                                     |
> +            |  +-----+                                  |
> +  CPU1      |  |xxxxx|                                  |
> +          --|--+-----+----------------------------------|---------->
> +            |     |                                     |
> +            |     |                   T1            T3  |
> +            |     |                 +----+        +---+ |
> +  CPU2      |     |                 |xxxx|        |xxx| |
> +          --|-----|-----------------+----+--------+---+-|---------->
> +            |     |                   |                 |
> +            |     |         T1        |                 |
> +            |     |  +--------------+ |                 |
> +  CPU3      |     |  |xxxxxxxxxxxxxx| |                 |
> +          --|-----|--+--------------+-|-----------------|---------->
> +            |     |         |         |                 |
> +            v     v         v         v                 v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer                        |
> +          +-----------------------------------------------------+
> +
> +	    T1: Thread 1
> +	    x: Thread is in running state
> +	    Figure 4: Ring buffer for per-thread mode
> +
> +When perf runs in per-thread mode, a ring buffer is allocated for the
> +profiled thread T1.  The ring buffer is dedicated for thread T1, if the
> +thread T1 is running, the perf events will be recorded into the ring
> +buffer; when the thread is sleeping, all associated events will be
> +disabled, thus no trace data will be recorded into the ring buffer.
> +
> +Per-CPU mode
> +
> +The option '-C' is used to collect samples on the list of CPUs, the ring
> +buffers are allocated for the specified CPUs.  For the example in below
> +command, the perf command receives option '-C 0,2', as the result, two
> +ring buffers serve CPU0 and CPU2 separately:
> +
> +	perf record -C 0,2 test_program
> +
> +In this example, even there have tasks running on CPU1 and CPU3, since
> +the ring buffer is absent for them, any activities on these two CPUs
> +will be ignored.  A usage case is to combine the options for per-thread
> +mode and per-CPU mode, e.g. the options '–C 0,2' and '––per–thread' are
> +specified together, the samples are recorded only when the profiled
> +thread is scheduled on any of the listed CPUs.
> +
> +            T1                      T2                 T1
> +          +----+              +-----------+          +----+
> +  CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +          +----+--------------+-----------+----------+----+-------->
> +            |                       |                  |
> +            v                       v                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 0                      |
> +          +-----------------------------------------------------+
> +
> +                 T1
> +               +-----+
> +  CPU1         |xxxxx|
> +          -----+-----+--------------------------------------------->
> +
> +                                      T1              T3
> +                                    +----+        +-------+
> +  CPU2                              |xxxx|        |xxxxxxx|
> +          --------------------------+----+--------+-------+-------->
> +                                      |               |
> +                                      v               v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 1                      |
> +          +-----------------------------------------------------+
> +
> +                            T1
> +                     +--------------+
> +  CPU3               |xxxxxxxxxxxxxx|
> +          -----------+--------------+------------------------------>
> +
> +	    T1: Thread 1; T2: Thread 2; T3: Thread 3
> +	    x: Thread is in running state
> +	    Figure 5: Ring buffer for per-CPU mode
> +
> +System wide mode
> +
> +By using option '–a' or '––all–cpus', perf collects samples on all CPUs
> +for all tasks, we call it as the system wide mode, the command is:
> +
> +	perf record -a test_program
> +
> +In the system wide mode, every CPU has its own ring buffer, all threads
> +are monitored during the running state and the samples are recorded into
> +the ring buffer belonging to the CPU which the events occurred on.
> +
> +            T1                      T2                 T1
> +          +----+              +-----------+          +----+
> +  CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +          +----+--------------+-----------+----------+----+-------->
> +            |                       |                  |
> +            v                       v                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 0                      |
> +          +-----------------------------------------------------+
> +
> +                 T1
> +               +-----+
> +  CPU1         |xxxxx|
> +          -----+-----+--------------------------------------------->
> +                  |
> +                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 1                      |
> +          +-----------------------------------------------------+
> +
> +                                      T1              T3
> +                                    +----+        +-------+
> +  CPU2                              |xxxx|        |xxxxxxx|
> +          --------------------------+----+--------+-------+-------->
> +                                      |               |
> +                                      v               v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 2                      |
> +          +-----------------------------------------------------+
> +
> +                            T1
> +                     +--------------+
> +  CPU3               |xxxxxxxxxxxxxx|
> +          -----------+--------------+------------------------------>
> +                            |
> +                            v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 3                      |
> +          +-----------------------------------------------------+
> +
> +	    T1: Thread 1; T2: Thread 2; T3: Thread 3
> +	    x: Thread is in running state
> +	    Figure 6: Ring buffer for system wide mode
> +
> +
> +Accessing buffer
> +~~~~~~~~~~~~~~~~
> +
> +Based on the understanding of how the ring buffer is allocated in
> +various modes, this section will explain the accessing the ring buffer.
      better .... this section explains access  the ring buffer.
> +
> +Producer-consumer model
> +
> +In the Linux kernel, the PMU events can produce samples which are stored
> +into the ring buffer; the perf command in user space consumes the
> +samples by reading out data from the ring buffer and finally saves the
> +data into the file for post analysis.  It’s a typical producer-consumer
> +model for using the ring buffer.
> +
> +The perf process polls on the PMU events and sleeps when no events are
> +incoming.  To prevent frequent exchanges between the kernel and user
> +space, the kernel event core layer introduces a watermark, which is
> +stored in the perf_buffer::watermark.  When a sample is recorded into
> +the ring buffer, and if the used buffer exceeds the watermark, the
> +kernel wakes up the perf process to read samples from the ring buffer.
> +
> +                     Perf
> +                     / | Read samples
> +           Polling  /  `--------------|               Ring buffer
> +                   v                  v    ;-------------------v
> +  +----------------+     +---------+---------+   +-------------------+
> +  |Event wait queue|     |data_head|data_tail|   |***|***|   |   |***|
> +  +----------------+     +---------+---------+   +-------------------+
> +           ^                  ^ `----------------------^
> +           | Wake up tasks    | Store samples
> +        +-----------------------------+
> +	|  Kernel event core layer    |
> +        +-----------------------------+
> +
> +            * : the data is filled by the writer.
> +	    Figure 7: Writing and reading the ring buffer
> +
> +When the kernel event core layer notifies the user space, because
> +multiple events might share the same ring buffer for recording samples,
> +the core layer iterates every event associated with the ring buffer and
> +wakes up tasks waiting on the event.  This is fulfilled by the kernel
> +function ring_buffer_wakeup().
> +
> +After the perf process is woken up, it starts to check the ring buffers
> +one by one, if it finds any ring buffer contains samples it will read
                                  s/contains/containing/
> +out the samples for statistics or saving into the data file.  Given the
> +perf process is able to run on any CPU, this leads to the ring buffer
> +potentially being accessed from multiple CPUs simultaneously, which
> +causes race conditions.  The race condition handling is described in the
> +section "Memory synchronization".
> +
> +Writing samples into buffer
> +
> +When a hardware event counter overflows, a sample will be taken and
> +saved into the ring buffer; the function __perf_event_output() is used
> +to fill samples into the ring buffer, it calls the below sub functions:
> +
> +- The sub function perf_prepare_sample() prepares sample fields based on
> +  the sample type;
> +- output_begin() is a function pointer, it’s passed dynamically via the
> +  argument for different writing directions, its purpose is to prepare
> +  the info for writing ring buffer, when the function returns back the
> +  ring buffer info is stored in structure perf_output_handle;
> +- perf_output_sample() outputs the sample into the ring buffer;
> +- perf_output_end() updates the head pointer for user page so perf tool
> +  can see the latest value.
> +
> +Let’s examine output_begin() in detail.  As the ring buffer allows
> +writing in two directions: backward or forward, the function pointer for
> +output_begin() is assigned according to the writing type of the buffer,
> +it can be perf_output_begin_forward() or perf_output_begin_backward().
> +
> +In the case of the backward ring buffer, where the user page is mapped
> +without ’PROT_WRITE’, the tool in user space is unable to update the
> +tail pointer.  As a result, only the head pointer is accessed in this
> +scenario, and the tail pointer is not used in perf tool.  The head
> +pointer indicates the beginning of a sample, perf tool can read out the
> +samples one by one based on sample’s event size.
> +
> +Alternatively, the forward ring buffer uses both head pointer and tail
> +pointer for the buffer management.  This method is more commonly used in
> +perf tool, to simplify the description, the following explanation
> +focuses on the forward ring buffer.
> +
> +    struct perf_output_handle       /---->  struct perf_buffer
> +  +---------------------------+     |     +--------------------+
> +  |           *rb;            |-----|     |   local_t head;    |
> +  +---------------------------+           +--------------------+
> +  |        int page;          |           |    *user_page;     |
> +  +---------------------------+           +--------------------+
> +  |       void *addr;         |                      |
> +  +---------------------------+                      v
> +  |   unsigned long size;     |         struct perf_event_mmap_page
> +  +---------------------------+           +--------------------+
> +                                          |   __u64 data_head; |
> +                                          +--------------------+
> +                                          |   __u64 data_tail; |
> +                                          +--------------------+
> +
> +	    Figure 8: Data structures for writing ring buffer
> +
> +In Linux kernel, the event core layer uses the structure perf_buffer to
> +track the buffer’s latest header, and it keeps the information for
> +buffer pages.  The structure perf_buffer manages ring buffer during its
> +life cycle, it is allocated once the ring buffer is created and released
> +when the ring buffer is destroyed.
> +
> +It’s possible for multiple events to write buffer concurrently.  For
> +instance, a software event and a hardware PMU event both are enabled for
> +profiling, when the software event is in the middle of sampling, the
> +hardware event maybe be overflow and its interrupt is triggered in this
> +case.  This leads to the race condition for updating perf_buffer::head.
> +In __perf_output_begin(), Linux kernel uses compare-and-swap atomicity
> +local_cmpxchg() to implement the lockless algorithm for protecting
> +perf_buffer::head.
> +
> +The structure perf_output_handle serves as a temporary context for
> +tracking the information related to the buffer.  For instance, the
> +perf_output_handle::rb field points to the global perf_buffer structure.
> +Additionally, the perf_output_handle::addr field, based on the lockless
> +algorithm, specifies the destination address where the sample data is to
> +be stored.
> +
> +The advantages of the perf_output_handle structure is that it enables
> +concurrent writing to the buffer by different events.  For the previous
> +example, two instances of perf_output_handle serve as separate contexts
> +for software events and hardware events.  This allows each event to
> +reserve its own memory space within the out_begin() function, and
> +perf_output_handle::addr is used for populating the specific event.
> +
> +Once the sample data has been successfully stored in the buffer, the
> +header of the ring buffer is synced from perf_buffer::head to
> +perf_event_mmap_page::data_head, which is fulfilled in the function
> +perf_output_end().  This synchronization indicates to the perf tool that
> +it is now safe to read the newly added samples from the user space.
> +
> +Reading samples from buffer
> +
> +In the user space, the perf tool utilizes the perf_event_mmap_page
> +structure to handle the head and tail of the buffer.  It also uses
> +perf_mmap structure to keep track of a context for the ring buffer, this
> +context includes information about the buffer's starting and ending
> +addresses.  Additionally, the mask value can be utilized to compute the
> +circular buffer pointer even for an overflow.
> +
> +Similar to the kernel, the perf tool in the user space firstly reads
                                                      s/firstly/first/
> +out the recorded data from the ring buffer, and then updates the
> +buffer's tail pointer perf_event_mmap_page::data_tail.
> +
> +Memory synchronization
> +
> +The modern CPUs with relaxed memory model cannot promise the memory
> +ordering, this means it’s possible to access the ring buffer and the
> +perf_event_mmap_page structure out of order.  To assure the specific
> +sequence for memory accessing perf ring buffer, memory barriers are
> +used to assure the data dependency.  The rationale for the memory
> +synchronization is as below:
> +
> +  Kernel                          User space
> +
> +  if (LOAD ->data_tail) {         LOAD ->data_head
> +                   (A)            smp_rmb()        (C)
> +    STORE $data                   LOAD $data
> +    smp_wmb()      (B)            smp_mb()         (D)
> +    STORE ->data_head             STORE ->data_tail
> +  }
> +
> +The comments in tools/include/linux/ring_buffer.h gives nice description
> +for why and how to use memory barriers, here we will just provide an
> +alternative explanation:
> +
> +(A) is a control dependency so that CPU assures order between checking
> +pointer perf_event_mmap_page::data_tail and filling sample into ring
> +buffer;
> +
> +(D) pairs with (A).  (D) separates the ring buffer data reading from
> +writing the pointer data_tail, perf tool firstly consumes samples and
                                         s/firstly/first/
> +then tells the kernel that the data chunk has been released.  Since
> +a reading operation is followed by a writing operation, thus (D) is a
> +full memory barrier.
> +
> +(B) is a writing barrier in the middle of two writing operations, which
> +makes sure that recording a sample must be prior to updating the head
> +pointer.
> +
> +(C) pairs with (B).  (C) is a read memory barrier to ensure the head
> +pointer is fetched before reading samples.
> +
> +To implement the above algorithm, the perf_output_put_handle() function
> +in the kernel and two helpers ring_buffer_read_head() and
> +ring_buffer_write_tail() in the user space are introduced, they rely
> +on memory barriers as described above to ensure the data dependency.
> +
> +Some architectures support one-way permeable barrier with load-acquire
> +and store-release operations, these barriers are more relaxed with less
> +performance penalty, so (C) and (D) can be optimized to use barriers
> +smp_load_acquire() and smp_store_release() respectively.
> +
> +If an architecture doesn’t support load-acquire and store-release in its
> +memory model, it will roll back to the old fashion of memory barrier
> +operations.  In this case, smp_load_acquire() encapsulates READ_ONCE() +
> +smp_mb(), since smp_mb() is costly, ring_buffer_read_head() doesn't
> +invoke smp_load_acquire() and it uses the barriers READ_ONCE() +
> +smp_rmb() instead.
> +
> +The mechanism of AUX ring buffer
> +--------------------------------
> +
> +In this chapter, we will explain the implementation of the AUX ring
> +buffer.  In the first part it will discuss the connection between the
> +AUX ring buffer and the regular ring buffer, then the second part will
> +examine how the AUX ring buffer co-works with the regular ring buffer,
> +as well as the additional features introduced by the AUX ring buffer for
> +the sampling mechanism.
> +
> +The relationship between AUX and regular ring buffers
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Generally, the AUX ring buffer is an auxiliary for the regular ring
> +buffer.  The regular ring buffer is primarily used to store the event
> +samples and every event format complies with the definition in the
> +union perf_event; the AUX ring buffer is for recording the hardware
> +trace data and the trace data format is hardware IP dependent.
> +
> +The general use and advantage of the AUX ring buffer is that it is
> +written directly by hardware rather than by the kernel.  For example,
> +regular profile samples that write to the regular ring buffer cause an
> +interrupt.  Tracing execution requires a high number of samples and
> +using interrupts would be overwhelming for the regular ring buffer
> +mechanism.  Having an AUX buffer allows for a region of memory more
> +decoupled from the kernel and written to directly by hardware tracing.
> +
> +The AUX ring buffer reuses the same algorithm with the regular ring
> +buffer for the buffer management.  The control structure
> +perf_event_mmap_page extends the new fields aux_head and aux_tail for
> +the head and tail pointers of the AUX ring buffer.
> +
> +During the initialisation phase, besides the mmap()-ed regular ring
> +buffer, the perf tool invokes a second syscall in the
> +auxtrace_mmap__mmap() function for the mmap of the AUX buffer;
> +rb_alloc_aux() in the kernel allocates pages, these pages will be
> +deferred to map into VMA when handling the page fault, which is the same
> +lazy mechanism with the regular ring buffer.
> +
> +AUX events and AUX trace data are two different things.  Let's see an
> +example:
> +
> +	perf record -a -e cycles -e cs_etm/@tmc_etr0/ -- sleep 2
> +
> +The above command enables two events: one is the event 'cycles' from PMU
> +and another is the AUX event 'cs_etm' from Arm CoreSight, both are saved
> +into the regular ring buffer while the CoreSight's AUX trace data is
> +stored in the AUX ring buffer.
> +
> +As a result, we can see the regular ring buffer and the AUX ring buffer
> +are allocated in pairs.  The perf in default mode allocates the regular
> +ring buffer and the AUX ring buffer per CPU-wise, which is the same as
> +the system wide mode, however, the default mode records samples only for
> +the profiled program, whereas the latter mode profiles for all programs
> +in the system.  For per-thread mode, the perf tool allocates only one
> +regular ring buffer and one AUX ring buffer for the whole session.  For
> +the per-CPU mode, the perf allocates two kinds of ring buffers for CPUs
> +specified by the option '-C'.
> +
> +The below figure demonstrates the buffers' layout in the system wide
> +mode; if there are any activities on one CPU, the AUX event samples and
> +the hardware trace data will be recorded into the dedicated buffers for
> +the CPU.
> +
> +            T1                      T2                 T1
> +          +----+              +-----------+          +----+
> +  CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +          +----+--------------+-----------+----------+----+-------->
> +            |                       |                  |
> +            v                       v                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 0                      |
> +          +-----------------------------------------------------+
> +            |                       |                  |
> +            v                       v                  v
> +          +-----------------------------------------------------+
> +          |               AUX Ring buffer 0                     |
> +          +-----------------------------------------------------+
> +
> +                 T1
> +               +-----+
> +  CPU1         |xxxxx|
> +          -----+-----+--------------------------------------------->
> +                  |
> +                  v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 1                      |
> +          +-----------------------------------------------------+
> +                  |
> +                  v
> +          +-----------------------------------------------------+
> +          |               AUX Ring buffer 1                     |
> +          +-----------------------------------------------------+
> +
> +                                      T1              T3
> +                                    +----+        +-------+
> +  CPU2                              |xxxx|        |xxxxxxx|
> +          --------------------------+----+--------+-------+-------->
> +                                      |               |
> +                                      v               v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 2                      |
> +          +-----------------------------------------------------+
> +                                      |               |
> +                                      v               v
> +          +-----------------------------------------------------+
> +          |               AUX Ring buffer 2                     |
> +          +-----------------------------------------------------+
> +
> +                            T1
> +                     +--------------+
> +  CPU3               |xxxxxxxxxxxxxx|
> +          -----------+--------------+------------------------------>
> +                            |
> +                            v
> +          +-----------------------------------------------------+
> +          |                  Ring buffer 3                      |
> +          +-----------------------------------------------------+
> +                            |
> +                            v
> +          +-----------------------------------------------------+
> +          |               AUX Ring buffer 3                     |
> +          +-----------------------------------------------------+
> +
> +	    T1: Thread 1; T2: Thread 2; T3: Thread 3
> +	    x: Thread is in running state
> +	    Figure 9: AUX ring buffer for system wide mode
> +
> +AUX events
> +~~~~~~~~~~
> +
> +Similar to perf_output_begin() and perf_output_end()'s working for the
> +regular ring buffer, perf_aux_output_begin() and perf_aux_output_end()
> +serve for the AUX ring buffer for processing the hardware trace data.
> +The structure perf_output_handle is used as a context to track the AUX
> +buffer’s info.
> +
> +perf_aux_output_begin() initializes the structure perf_output_handle.
> +It fetches the AUX head pointer and assigns to perf_output_handle::head,
> +afterwards, the low level driver uses perf_output_handle::head as the
> +start address for storing hardware trace data.
> +
> +Once the hardware trace data is stored into the AUX ring buffer, the PMU
> +driver will stop hardware tracing by calling the pmu::stop() callback.
> +Similar to the regular ring buffer, the AUX ring buffer needs to apply
> +the memory synchronization mechanism as discussed in the section "Memory
> +synchronization".  Since the AUX ring buffer is managed by the PMU
> +driver, the barrier (B), which is a writing barrier to ensure the trace
> +data is externally visible prior to updating the head pointer, is asked
> +to be implemented in the PMU driver.
> +
> +Then pmu::stop() can safely call the perf_aux_output_end() function to
> +finish two things:
> +
> +- It fills an event PERF_RECORD_AUX into the regular ring buffer, this
> +event delivers the information of the start address and data size for a
> +chunk of hardware trace data has been stored into the AUX ring buffer;
> +
> +- Since the hardware trace driver has stored new trace data into the AUX
> +ring buffer, the argument 'size' indicates how many bytes have been
> +consumed by the hardware tracing, thus perf_aux_output_end() updates the
> +header pointer perf_buffer::aux_head to reflect the latest buffer usage.
> +
> +At the end, the PMU driver will restart hardware tracing.  During this
> +temporary suspending period, it will lose hardware trace data, which
> +will introduce a discontinuity during decoding phase.
> +
> +The event PERF_RECORD_AUX presents an AUX event which is handled in the
> +kernel, but it lacks the information for saving the AUX trace data in
> +the perf file.  When the perf tool copies the trace data from AUX ring
> +buffer to the perf data file, it synthesizes a PERF_RECORD_AUXTRACE
> +event which includes the offest and size of the AUX trace data in the
> +perf file.  Afterwards, the perf tool reads out the AUX trace data from
> +the perf file based on the PERF_RECORD_AUXTRACE events, and the
> +PERF_RECORD_AUX event is used to decode a chunk of data by correlating
> +with time order.
> +
> +Snapshot mode
> +~~~~~~~~~~~~~
> +
> +Perf supports snapshot mode for AUX ring buffer, in this mode, users
> +only record AUX trace data at a specific time point which users are
> +interested in.  E.g. below gives an example of how to take snapshots
> +with 1 second interval with Arm CoreSight:
> +
> +  perf record -e cs_etm/@tmc_etr0/u -S -a program &
> +  PERFPID=$!
> +  while true; do
> +      kill -USR2 $PERFPID
> +      sleep 1
> +  done
> +
> +The main flow for snapshot mode is:
> +
> +- Before a snapshot is taken, the AUX ring buffer acts in free run mode.
> +During free run mode the perf doesn't record any of the AUX events and
> +trace data;
> +
> +- Once the perf tool receives the USR2 signal, it triggers the callback
> +function auxtrace_record::snapshot_start() to deactivate hardware
> +tracing.  The kernel driver then populates the AUX ring buffer with the
> +hardware trace data, and the event PERF_RECORD_AUX is stored in the
> +regular ring buffer;
> +
> +- Then perf tool takes a snapshot, record__read_auxtrace_snapshot()
> +reads out the hardware trace data from the AUX ring buffer and saves it
> +into perf data file;
> +
> +- After the snapshot is finished, auxtrace_record::snapshot_finish()
> +restarts the PMU event for AUX tracing.
> +
> +The perf only accesses the head pointer perf_event_mmap_page::aux_head
> +in snapshot mode and doesn’t touch tail pointer aux_tail, this is
> +because the AUX ring buffer can overflow in free run mode, the tail
> +pointer is useless in this case.  Alternatively, the callback
> +auxtrace_record::find_snapshot() is introduced for making the decision
> +of whether the AUX ring buffer has been wrapped around or not, at the
> +end it fixes up the AUX buffer's head which are used to calculate the
> +trace data size.
> +
> +As we know, the buffers' deployment can be per-thread mode, per-CPU
> +mode, or system wide mode, and the snapshot can be applied to any of
> +these modes.  Below is an example of taking snapshot with system wide
> +mode.
> +
> +                                         Snapshot is taken
> +                                                 |
> +                                                 v
> +                        +------------------------+
> +                        |  AUX Ring buffer 0     | <- aux_head
> +                        +------------------------+
> +                                                 v
> +                +--------------------------------+
> +                |          AUX Ring buffer 1     | <- aux_head
> +                +--------------------------------+
> +                                                 v
> +    +--------------------------------------------+
> +    |                      AUX Ring buffer 2     | <- aux_head
> +    +--------------------------------------------+
> +                                                 v
> +         +---------------------------------------+
> +         |                 AUX Ring buffer 3     | <- aux_head
> +         +---------------------------------------+
> +
> +	    Figure 10: Snapshot with system wide mode

Great work, just some minor wording issues.
-- 
Thomas Richter, Dept 3303, IBM s390 Linux Development, Boeblingen, Germany
--
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294