Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230715035343.46737-1-leo.yan@linaro.org> <CAP-5=fVmBVA5XGac8BZks6UnjGaMv8bx_PhC6_YrWC2TGjw-Fg@mail.gmail.com>
 <20230802170038.GA1306442@leoy-huanghe.lan>
In-Reply-To: <20230802170038.GA1306442@leoy-huanghe.lan>
From:   Ian Rogers <irogers@google.com>
Date:   Wed, 2 Aug 2023 15:01:57 -0700
Message-ID: <CAP-5=fUXtNqbDyYUXojPMmfz8ZK9-O7_Jfp8G9D5zWVB-_4eRw@mail.gmail.com>
Subject: Re: [PATCH v1] perf doc: Document ring buffer mechanism
To:     Leo Yan <leo.yan@linaro.org>
Cc:     Arnaldo Carvalho de Melo <acme@kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Adrian Hunter <adrian.hunter@intel.com>,
        Ingo Molnar <mingo@redhat.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>,
        Jiri Olsa <jolsa@kernel.org>, linux-kernel@vger.kernel.org,
        linux-perf-users@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Wed, Aug 2, 2023 at 10:00=E2=80=AFAM Leo Yan <leo.yan@linaro.org> wrote:
>
> Hi Ian,
>
> On Mon, Jul 17, 2023 at 02:27:11PM -0700, Ian Rogers wrote:
>
> [...]
>
> > > Thanks Ian's suggestion for upstreaming this documentation into Linux
> > > source code tree at the Jan of this year, also thanks my son Rowan fo=
r
> > > correcting me some grammer errors in this summer holiday.  Finally I
> > > heavily refactored this documentation and I think it's ready for
> > > reviewing.
> >
> > This is awesome, thanks Leo for taking the time to do this! I've done
> > some readability nits
>
> Thanks a lot for review, Ian.
>
> I agreed with most of your comments.  Blow I just keep the comments for
> further discussion or give supplements, otherwise, I remove the rest
> comments and will reflect them in my new patch.
>
> [...]
>
> > > +Perf uses the same way to manage its ring buffer.  In implementation
> > > +there have two actors: a page contains a control structure and ring
> > > +buffer; the page containing the control structure is named as "user
> > > +page", the user page and the ring buffer are mapped into user space
> > > +with the continuous virtual address, since the user page and the rin=
g
> > > +buffer are consecutive, from the control structure pointer it's
> > > +easily to know the ring buffer address.
> >
> > nit: Perhaps reword a little as:
> > Perf uses the same way to manage its ring buffer. In the
> > implementation there are two key data structures held together in a
> > set of consecutive pages, the control structure and then the ring
> > buffer itself. The page with the control structure in is known as the
> > "user page". Being held in continuous virtual addresses simplifies
> > locating the ring buffer address, it is in the pages after the page
> > with the user page.
> >
> > Off-topic: seems wasteful to allocate a full page for this.
>
> Arm CPUs support not only 4KiB page size, but also support 16KiB/64KiB
> page size, it does waste a bit memory for using a page for control
> structure.  But given the control structure need to be mapped to user
> space in page size unit, seems here have no room to optimize it.
>
> [...]
>
> > > +        user page                          ring buffer
> > > +  +---------+---------+   +---------------------------------------+
> > > +  |data_head|data_tail|...|   |   |***|***|***|***|***|   |   |   |
> > > +  +---------+---------+   +---------------------------------------+
> > > +      `          `--------------^                   ^
> > > +       `--------------------------------------------|
> > > +
> > > +            * : the data is filled by the writer.
> > > +           Figure 2: Perf ring buffer
> > > +
> > > +When using 'perf record' tool, we can specify the ring buffer size w=
ith
> >
> > nit: s/using/using the/
> >
> > > +option '-m' or '--mmap-pages=3D', finally the size will be rounded u=
p to
> >
> > nit: s/finally the size/the given size/
> >
> > > +power of two with page unit.  Though the kernel allocates at once fo=
r
> >
> > nit: s/power of two with page unit/a power of two that is a multiple
> > of a page size/
> >
> > Off-topic-ish: Perhaps it is worth motivating why the page size must
> > be a power of 2. I'm guessing it is because this means that when
> > moving the pointers/indices they can be masked to cause wrapping,
> > rather than using division/modulus.
>
> Correct.  The comments in kernel/events/core.c verifies this guessing:
>
> "If we have rb pages ensure they're a power-of-two number, so we
> can do bitmasks instead of modulo."
>
> > Fwiw, I think this could also be
> > solved with a comparison, and the performance overhead could be
> > insignificant compared to the memory savings.
>
> I am not sure if a comparison is sufficient.  As you said, if the page
> number is not a power-of-two number, and page index increases
> monotonically, we need to compute modulus to get the offset.
>
> Agreed with you, we can consider to remove the limitaion of the page
> number must be power-of-two for memory saving and check what's the
> performance penalty.
>
> > > +all memory pages, it's deferred to map the pages to VMA area until
> > > +the perf tool accesses the buffer from the user space.  In other wor=
ds,
> > > +at the first time accesses the buffer's page from user space in the =
perf
> > > +tool, a data abort exception for page fault is taken and the kernel
> > > +uses this occasion to map the page into process VMA, thus the perf t=
ool
> > > +can continue to access the page after returning from exception.
> >
> > nit: s/exception/the exception/
> >
> > Off topic: Should the perf tool use something like the MAP_POPULATE
> > flag to reduce page faults, given the pages were allocated already in
> > the kernel? Tweaking tools/lib/perf/mmap.c to do this and running
> > "time perf record -a sleep 1" shows the minor page fault count going
> > from 7700 to 9700, so it seems like a bad idea.
>
> Hmm ... I got the opposite result, by observing page fault counting, I
> can see the improvement with adding the MAP_POPULATE flag.
>
> Before adding MAP_POPULATE flag:
>
>   # ./perf stat -- ./perf record -a sleep 1
>
>     5,359      page-faults                      #    2.409 K/sec
>     5,353      page-faults                      #    2.415 K/sec
>     5,353      page-faults                      #    2.425 K/sec
>
>   # ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1
>
>     2,122      page-faults                      #    2.038 K/sec
>     2,121      page-faults                      #    2.001 K/sec
>     2,121      page-faults                      #    2.015 K/sec
>
> After adding MAP_POPULATE flag:
>
>   # ./perf stat -- ./perf record -a sleep 1
>
>     5,004      page-faults                      #    2.260 K/sec
>     5,002      page-faults                      #    2.253 K/sec
>     5,003      page-faults                      #    2.239 K/sec
>
>   # ./perf stat -- ./perf record -e cs_etm/@tmc_etr0/ -a sleep 1
>
>     1,082      page-faults                      #  856.589 /sec
>     1,082      page-faults                      #    1.002 K/sec
>     1,080      page-faults                      #    1.013 K/sec
>
> [...]

Interesting, I wonder if it is an ARM vs x86 difference?

> > > +System wide mode
> > > +
> > > +By default if without specifying mode, or explicitly using option '=
=E2=80=93a'
> > > +or '=E2=80=93=E2=80=93all=E2=80=93cpus', perf collects samples on al=
l CPUs in the system wide
> > > +mode.
> > > +
> > > +In this mode, every CPU has its own ring buffer; all threads are
> > > +monitored during the running state and the samples are recorded into=
 the
> >
> > This doesn't cover the case of:
> > $ perf record benchmark
> > Where there will be a ring buffer on every CPU but events/sampling
> > will only be enabled for benchmark's threads, ie not all threads.
>
> Correct.  I wrongly thought it's the system-wide mode when don't
> specify option '-a', will add a 'default mode' to address above case.
>
> [...]
>
> > > When a sample is recorded into the ring buffer, the kernel event
> > > +core layer will wake up the perf process to read samples from the ri=
ng
> > > +buffer.
> >
> > It isn't always necessary to wake the perf tool process. There is a
> > little bit more detail on different polling modes in the
> > perf_event_open man page in the section on "Overflow handling":
> > https://github.com/mkerrisk/man-pages/blob/master/man2/perf_event_open.=
2#L3062
>
> Exactly, perf ring buffer uses watermark as threshold, and only when
> cross the threshold the kernel notifies the user space.  I confirmed
> this in the function __perf_output_begin().
>
> So will rephrase as:
>
> "When a sample is recorded into the ring buffer, and the number of
> samples crossing the threshold, the kernel event core layer will wake up
> the perf process to read samples from the ring buffer."

Sounds good.

> > > +
> > > +                     Perf
> > > +                     / | Read samples
> > > +           Polling  /  `--------------|               Ring buffer
> > > +                   v                  v    ;-------------------v
> > > +  +----------------+     +---------+---------+   +------------------=
-+
> > > +  |Event wait queue|     |data_head|data_tail|   |***|***|   |   |**=
*|
> > > +  +----------------+     +---------+---------+   +------------------=
-+
> > > +           ^                  ^ `----------------------^
> > > +           | Wake up tasks    | Store samples
> > > +        +-----------------------------+
> > > +       |  Kernel event core layer    |
> > > +        +-----------------------------+
> > > +
> > > +            * : the data is filled by the writer.
> > > +           Figure 6: Writing and reading the ring buffer
> > > +
> > > +Because multiple events might share the same ring buffer for recordi=
ng
> > > +samples, when any event sample is stored into the ring buffer, the
> > > +kernel event core layer iterates every event associated to the ring
> > > +buffer and wake up tasks on the wait queue of the events.  This is
> > > +fulfilled by the kernel function ring_buffer_wakeup().
> >
> > I'm not sure about the use of "event" here. If you do:
> > $ perf stat -e "cycles,instructions"
> > Then cycles and instructions will each have a ring buffer, I'm not
> > sure it is possible to get them to share a ring buffer. I think here
> > you may be referring to side band events, like mmap2.
>
> Seems to me, this is incorrect.  Since 'perf stat' only read counters
> (via the kernel function perf_read()), it doesn't allocate ring buffer
> at all for events.  By using GDB, I can confirm the function
> perf_mmap__mmap() is never called for 'perf stat' command.
>
> Just clarify, 'perf stat' still mmap the 'user page' for control
> structure and for timer counter accessing, but this is not the same
> thing with ring buffer.

Sorry, I meant record not stat.

> [...]
>
> > > +In Linux kernel, the event core layer uses the structure perf_buffer=
 to
> > > +track the buffer=E2=80=99s latest header, and it keeps the informati=
on for
> > > +buffer pages.  The structure perf_buffer manages ring buffer during =
its
> > > +life cycle, it is allocated once the ring buffer is created and rele=
ased
> > > +when the ring buffer is destroyed.
> > > +
> > > +It=E2=80=99s possible for multiple events to write buffer concurrent=
ly.  For
> > > +instance, a software event and a hardware PMU event both are enabled=
 for
> > > +profiling, when the software event is in the middle of sampling, the
> > > +hardware event maybe be overflow and its interrupt is triggered in t=
his
> > > +case.  This leads to the race condition for updating perf_buffer::he=
ad.
> > > +In __perf_output_begin(), Linux kernel uses compare-and-swap atomici=
ty
> > > +local_cmpxchg() to implement the lockless algorithm for protecting
> > > +perf_buffer::head.
> >
> > In the case of:
> > $ perf record -e "instructions,faults"
> > faults is a software event but it won't share the ring buffer with
> > instructions. I think the atomicity exists for the side band events.
>
> I verified on my Arm64 board and confirmed that hardware events
> (instructions, cycles) share the same ring buffer with software event
> (faults).  Below is the captured log (truncated for more readable):
>
>   ...
>   =3D> __perf_sw_event
>   =3D> do_page_fault
>   =3D> do_translation_fault
>   ...
>   ls-884     [003] ...2.   253.737217: perf_output_begin_forward: __perf_=
output_begin: rb=3D00000000720bac97 page=3D0 offset=3D1688 addr=3D000000001=
9392296 size=3D2408
>
>   ...
>   =3D> __perf_event_overflow
>   =3D> perf_event_overflow
>   =3D> armv8pmu_handle_irq
>   ...
>   ls-884     [003] d.h1.   253.737247: perf_output_begin_forward: __perf_=
output_begin: rb=3D00000000720bac97 page=3D0 offset=3D1736 addr=3D000000009=
e259b5a size=3D2360
>
> We can see for both software event and Armv8 PMU event, both use the
> same rb pointer (0x00000000720bac9), which means the software and
> hardware events store samples into the same ring buffer.  Thus the
> above description is valid.
>
> P.s. maybe the side band events are not a relevant topic, but I can
> see the side band events have dedicated ring buffer.

You're right, the code doing this is using ioctls in particular
PERF_EVENT_IOC_SET_OUTPUT:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tr=
ee/tools/lib/perf/evlist.c?h=3Dperf-tools-next#n517
The man page mentions "PERF_FLAG_FD_OUTPUT (broken since Linux
2.6.35)" which had confused me.

Thanks,
Ian

> Thanks,
> Leo