2013-08-30 17:48:43

by Vince Weaver

[permalink] [raw]
Subject: perf_event: rdpmc self-monitoring overhead issue

Hello,

I've finally found time to track down why perf_event/rdpmc self-monitoring
overhead was so bad.

To summarize, a test which does:

perf_event_open()
ioctl(PERF_EVENT_IOC_ENABLE)
read() /* either via syscall or the rdpmc code listed in
include/uapi/linux/perf_event.h */
ioctl(PERF_EVENT_IOC_DISABLE)

is done, and the number of cycles for each routine is taken using
rdtsc().

On a Core2 Processor the results look something like this for read:

| read time for 1 event
| median of 1024 runs
| (cycles)
-----------------------------|-------------------------
2.6.32-perfctr (rdpmc) | 133
2.6.30-perfmon2 | 1264
3.10 | 1482
3.10 (rdpmc) | 3062

As you can see, using the userspace-only rdpmc code is twice as slow as
just using the read() syscall.


I've tracked down the cause of this, and apparently it's due to
the first access to the event's struct perf_event_mmap_page. If
outside of the read timing code I do an unrelated read of the mmap() page
to fault it in, then the result is much more believable:

3.10 (rdpmc) | 123

So the question is, why do I have to explicitly in advance fault the
page in? Is there a way to force this to happen automatically?

The perfctr code as far as I can tell doesn't touch its mmap page in
advance.
It uses vm_insert_page() to insert the page rather than the
rb tree stuff that perf_event uses.

I know part of this overhead is due to the construction of my benchmark
and in theory would be mitigated if you were doing a large number
of measurements in a program, but at the same time this is also a common
pattern when self-monitoring: putting calipers around one chunk of code
and taking one measurement (often in a timing-critical area where
overhead matters).

Vince


2013-08-30 18:05:09

by Stephane Eranian

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

On Fri, Aug 30, 2013 at 7:55 PM, Vince Weaver <[email protected]> wrote:
> Hello,
>
> I've finally found time to track down why perf_event/rdpmc self-monitoring
> overhead was so bad.
>
> To summarize, a test which does:
>
> perf_event_open()
> ioctl(PERF_EVENT_IOC_ENABLE)
> read() /* either via syscall or the rdpmc code listed in
> include/uapi/linux/perf_event.h */
> ioctl(PERF_EVENT_IOC_DISABLE)
>
> is done, and the number of cycles for each routine is taken using
> rdtsc().
>
> On a Core2 Processor the results look something like this for read:
>
> | read time for 1 event
> | median of 1024 runs
> | (cycles)
> -----------------------------|-------------------------
> 2.6.32-perfctr (rdpmc) | 133
> 2.6.30-perfmon2 | 1264
> 3.10 | 1482
> 3.10 (rdpmc) | 3062
>
> As you can see, using the userspace-only rdpmc code is twice as slow as
> just using the read() syscall.
>
>
> I've tracked down the cause of this, and apparently it's due to
> the first access to the event's struct perf_event_mmap_page. If
> outside of the read timing code I do an unrelated read of the mmap() page
> to fault it in, then the result is much more believable:
>
> 3.10 (rdpmc) | 123
>
You mean that the high cost in your first example comes from the fact
that you are averaging over all the iterations and not n-1 (where 1 is
the first). I don't see a flag in mmap() to fault it in immediately. But
why not document, that programs should touch the page once before
starting any timing measurements.

> So the question is, why do I have to explicitly in advance fault the
> page in? Is there a way to force this to happen automatically?
>
> The perfctr code as far as I can tell doesn't touch its mmap page in
> advance.
> It uses vm_insert_page() to insert the page rather than the
> rb tree stuff that perf_event uses.
>
> I know part of this overhead is due to the construction of my benchmark
> and in theory would be mitigated if you were doing a large number
> of measurements in a program, but at the same time this is also a common
> pattern when self-monitoring: putting calipers around one chunk of code
> and taking one measurement (often in a timing-critical area where
> overhead matters).
>
> Vince

2013-08-30 18:28:43

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

On Fri, 30 Aug 2013, Stephane Eranian wrote:
> >
> You mean that the high cost in your first example comes from the fact
> that you are averaging over all the iterations and not n-1 (where 1 is
> the first). I don't see a flag in mmap() to fault it in immediately. But
> why not document, that programs should touch the page once before
> starting any timing measurements.

I was just curious why perfctr didn't need to do this, but possibly
I'm just missing the mmap page being touched. Though the code is pretty
small and I'm not seeing any such access.

With perf_event you have one mmap page per event so touching all the pages
starts racking up some overhead if you have multiple events, but yes
probably lost in the noise if you do it at open() rather than at read().

It's also hard to have a dummy read of a page w/o the compiler
optimizing it away...

Vince Weaver
[email protected]
http://www.eece.maine.edu/~vweaver/

2013-08-31 10:07:06

by Mikael Pettersson

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

Vince Weaver writes:
> On Fri, 30 Aug 2013, Stephane Eranian wrote:
> > >
> > You mean that the high cost in your first example comes from the fact
> > that you are averaging over all the iterations and not n-1 (where 1 is
> > the first). I don't see a flag in mmap() to fault it in immediately. But
> > why not document, that programs should touch the page once before
> > starting any timing measurements.
>
> I was just curious why perfctr didn't need to do this, but possibly
> I'm just missing the mmap page being touched. Though the code is pretty
> small and I'm not seeing any such access.

perfctr's ->mmap() actively inserts the counters page into the calling
process' mm, so it's available immediately. The other model is to wait
for the first page fault on that page before inserting it; apparently
the other performance counter frameworks do that instead.

2013-09-02 02:50:42

by Andi Kleen

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

Stephane Eranian <[email protected]> writes:

> I don't see a flag in mmap() to fault it in immediately.

MAP_PRESENT

-Andi

--
[email protected] -- Speaking for myself only

2013-09-02 08:24:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <[email protected]> wrote:
> Stephane Eranian <[email protected]> writes:
>
>> I don't see a flag in mmap() to fault it in immediately.
>
> MAP_PRESENT
>
I could not find this constant defined anywhere in the kernel source tree
nor in /usr/include. Are you sure of the name?

2013-09-02 09:23:19

by Andi Kleen

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

On Mon, Sep 02, 2013 at 10:24:13AM +0200, Stephane Eranian wrote:
> On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <[email protected]> wrote:
> > Stephane Eranian <[email protected]> writes:
> >
> >> I don't see a flag in mmap() to fault it in immediately.
> >
> > MAP_PRESENT
> >
> I could not find this constant defined anywhere in the kernel source tree
> nor in /usr/include. Are you sure of the name?

MAP_POPULATE sorry

-Andi

2013-09-02 13:09:12

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

On Mon, 2 Sep 2013, Stephane Eranian wrote:

> On Mon, Sep 2, 2013 at 4:50 AM, Andi Kleen <[email protected]> wrote:
> > Stephane Eranian <[email protected]> writes:
> >
> >> I don't see a flag in mmap() to fault it in immediately.
> >
> > MAP_PRESENT
> >
> I could not find this constant defined anywhere in the kernel source tree
> nor in /usr/include. Are you sure of the name?

I assume he means MAP_POPULATE

which does improve things, from ~3000 cycles to ~219 cycles but that's
still more overhead than the ~130 or so you get by manually touching the
page first.

Vince

2013-09-02 17:27:04

by Andi Kleen

[permalink] [raw]
Subject: Re: perf_event: rdpmc self-monitoring overhead issue

> I assume he means MAP_POPULATE

Yes.

>
> which does improve things, from ~3000 cycles to ~219 cycles but that's
> still more overhead than the ~130 or so you get by manually touching the
> page first.

That seems odd. It should be the same.

Can you do a trace-cmd function trace and compare the two cases?

trace-cmd record -p function_graph ...
trace-cmd report

(as usual for tracing perf remove the useless -pg removal for perf in
kernel/events/Makefile and arch/x86/kernel/cpu/Makefile first)

-Andi