2017-08-10 18:49:06

by Vince Weaver

[permalink] [raw]
Subject: perf: multiple mmap of fd behavior on x86/ARM


So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
get ARM64 rdpmc support working, but apparently those patches never made
it upstream?)

anyway one test was failing due to an x86/arm difference, which is
possibly only tangentially perf related.

On x86 you can mmap() a perf_event_open() file descriptor multiple times
and it works.

On ARM/ARM64 you can only mmap() it once, any other attempts fail.

Is this expected behavior?

You can run the
tests/record_sample/mmap_multiple
test in the current git of my perf_event_tests testsuite for a testcase.

Vince


2017-08-11 10:02:38

by Mark Rutland

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote:
>
> So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
> get ARM64 rdpmc support working, but apparently those patches never made
> it upstream?)

IIUC by 'rdpmc' you mean direct userspace counter access?

Patches for that never made it upstream. Last I saw, there were no
patches in a suitable state for review.

There are also difficulties (e.g. big.LITTLE systems where the number of
counters can differ across CPUs) which have yet to be solved.

> anyway one test was failing due to an x86/arm difference, which is
> possibly only tangentially perf related.
>
> On x86 you can mmap() a perf_event_open() file descriptor multiple times
> and it works.
>
> On ARM/ARM64 you can only mmap() it once, any other attempts fail.

Interesting. Which platform(s) are you testing on, with which kernel
version(s)?

> Is this expected behavior?

I'm not sure, but it sounds surprising.

> You can run the
> tests/record_sample/mmap_multiple
> test in the current git of my perf_event_tests testsuite for a testcase.

This appears to work for me:

nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ ./mmap_multiple
Trying to mmap same perf_event fd multiple times... PASSED

nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ git log --oneline HEAD~1..
c82c4dd tests: huge_grou_start: add info that this was fixed in Linux 4.3
nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ uname -a
Linux ribbensteg 4.13.0-rc4-00010-g2ce1491 #229 SMP PREEMPT Thu Aug 10 17:06:56 BST 2017 aarch64 aarch64 aarch64 GNU/Linux

nanook@ribbensteg:~/src/perf_event_tests/tests/record_sample$ strace ./mmap_multiple
execve("./mmap_multiple", ["./mmap_multiple"], [/* 18 vars */]) = 0
brk(0) = 0x2d9aa000
faccessat(AT_FDCWD, "/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff9d10e000
faccessat(AT_FDCWD, "/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=42361, ...}) = 0
mmap(NULL, 42361, PROT_READ, MAP_PRIVATE, 3, 0) = 0xffff9d103000
close(3) = 0
faccessat(AT_FDCWD, "/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/aarch64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\267\0\1\0\0\0(\17\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1283776, ...}) = 0
mmap(NULL, 1356664, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xffff9cf9b000
mprotect(0xffff9d0ce000, 61440, PROT_NONE) = 0
mmap(0xffff9d0dd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x132000) = 0xffff9d0dd000
mmap(0xffff9d0e3000, 13176, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffff9d0e3000
close(3) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff9cf9a000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff9cf99000
mprotect(0xffff9d0dd000, 16384, PROT_READ) = 0
mprotect(0x412000, 4096, PROT_READ) = 0
mprotect(0xffff9d112000, 4096, PROT_READ) = 0
munmap(0xffff9d103000, 42361) = 0
perf_event_open(0xfffffbff0310, 0, -1, -1, 0) = 3
mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xffff9d105000
mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xffff9cf90000
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff9cf80000
write(1, "Trying to mmap same perf_event f"..., 77Trying to mmap same perf_event fd multiple times... PASSED
) = 77
exit_group(0) = ?
+++ exited with 0 +++

Thanks,
Mark.

2017-08-11 10:53:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote:
> On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote:
> >
> > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
> > get ARM64 rdpmc support working, but apparently those patches never made
> > it upstream?)
>
> IIUC by 'rdpmc' you mean direct userspace counter access?
>
> Patches for that never made it upstream. Last I saw, there were no
> patches in a suitable state for review.
>
> There are also difficulties (e.g. big.LITTLE systems where the number of
> counters can differ across CPUs) which have yet to be solved.

How would that be a problem? The API gives an explicit index to use with
the 'rdpmc' instruction.

2017-08-11 11:07:49

by Mark Rutland

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote:
> > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote:
> > >
> > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
> > > get ARM64 rdpmc support working, but apparently those patches never made
> > > it upstream?)
> >
> > IIUC by 'rdpmc' you mean direct userspace counter access?
> >
> > Patches for that never made it upstream. Last I saw, there were no
> > patches in a suitable state for review.
> >
> > There are also difficulties (e.g. big.LITTLE systems where the number of
> > counters can differ across CPUs) which have yet to be solved.
>
> How would that be a problem? The API gives an explicit index to use with
> the 'rdpmc' instruction.

It's a problem because access to unimplemented counters trap. So if a
task gets migrated from a CPU with N counters to one with N-1, accessing
counter N would be problematic.

So we'd need to account for that somehow, in addition to the usual
sequence counter fun to verify the index was valid when the access was
performed.

Thanks,
Mark.

2017-08-11 14:53:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 12:06:39PM +0100, Mark Rutland wrote:
> On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote:
> > > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote:
> > > >
> > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
> > > > get ARM64 rdpmc support working, but apparently those patches never made
> > > > it upstream?)
> > >
> > > IIUC by 'rdpmc' you mean direct userspace counter access?
> > >
> > > Patches for that never made it upstream. Last I saw, there were no
> > > patches in a suitable state for review.
> > >
> > > There are also difficulties (e.g. big.LITTLE systems where the number of
> > > counters can differ across CPUs) which have yet to be solved.
> >
> > How would that be a problem? The API gives an explicit index to use with
> > the 'rdpmc' instruction.
>
> It's a problem because access to unimplemented counters trap. So if a
> task gets migrated from a CPU with N counters to one with N-1, accessing
> counter N would be problematic.
>
> So we'd need to account for that somehow, in addition to the usual
> sequence counter fun to verify the index was valid when the access was
> performed.

Aah, you need restartable-sequences :-)

2017-08-11 15:25:56

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, 11 Aug 2017, Mark Rutland wrote:

> IIUC by 'rdpmc' you mean direct userspace counter access?
>
> Patches for that never made it upstream. Last I saw, there were no
> patches in a suitable state for review.

yes, someone from Linaro sent me some code a while back that implemented
the userspace side and claimed the kernel patches would appear at some
point. I should try to dig up that e-mail.

The "rdpmc" code looked something like this
if (counter == PERF_COUNT_HW_CPU_CYCLES)
asm volatile("mrs %0, pmccntr_el0" : "=r" (ret));
else {
asm volatile("msr pmselr_el0, %0" : : "r" ((counter-1)));
asm volatile("mrs %0, pmxevcntr_el0" : "=r" (ret));
}


> > On ARM/ARM64 you can only mmap() it once, any other attempts fail.
>
> Interesting. Which platform(s) are you testing on, with which kernel
> version(s)?

This is on a Dragonbaord 401c running a vendor 64-bit 4.4 kernel,
a Nvidia Jetson TX-1 board running a 64-bit 3.10 vendor kernel,
as well as a Raspberry Pi 3B running a 32-bit 4.9 pi foundation kernel.

It's a pain getting a recent-git kernel on these boards but I'm most of
the way to getting one booting on the Pi 3B. (got distracted by the fact
that Linpack still reliably crashes the Pi-3b even with a heatsink).

Here's strace from the Dragonboard:
perf_event_open(0x7fc649e900, 0, -1, -1, 0) = 3
mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0x7f7e1b1000
mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = -1 EINVAL (Invalid argument)

Vince

2017-08-11 16:24:33

by Mark Rutland

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 11:25:37AM -0400, Vince Weaver wrote:
> On Fri, 11 Aug 2017, Mark Rutland wrote:
>
> > IIUC by 'rdpmc' you mean direct userspace counter access?
> >
> > Patches for that never made it upstream. Last I saw, there were no
> > patches in a suitable state for review.
>
> yes, someone from Linaro sent me some code a while back that implemented
> the userspace side and claimed the kernel patches would appear at some
> point. I should try to dig up that e-mail.

IIRC, patches were sent back in 2014, but as I mentioned above, those
were far from suitable for upstream, even ignoring cases like
big.LITTLE. Said patches were never reworked and reposted.

> > > On ARM/ARM64 you can only mmap() it once, any other attempts fail.
> >
> > Interesting. Which platform(s) are you testing on, with which kernel
> > version(s)?
>
> This is on a Dragonbaord 401c running a vendor 64-bit 4.4 kernel,
> a Nvidia Jetson TX-1 board running a 64-bit 3.10 vendor kernel,

Just to check, how does x86 behave on each of those kernel releases?

Many things have changed since v4.4.

> as well as a Raspberry Pi 3B running a 32-bit 4.9 pi foundation kernel.

Hmm. On 32-bit this might be down to some arch/arm/mm cache aliasing
code, or it might be down to something that's changed since v4.9.

> It's a pain getting a recent-git kernel on these boards but I'm most of
> the way to getting one booting on the Pi 3B. (got distracted by the fact
> that Linpack still reliably crashes the Pi-3b even with a heatsink).

IIUC, were you to modify this test to use SW events, you could test it
on an aarch64 kernel running under QEMU. To the best of my knowledge,
the code paths for HW and SW PMU are identical for mmap.

Otherwise, you might have more luck using a foundation model, which has
a PMU.

Thanks,
Mark.

2017-08-11 16:51:21

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, 11 Aug 2017, Mark Rutland wrote:

> IIRC, patches were sent back in 2014, but as I mentioned above, those
> were far from suitable for upstream, even ignoring cases like
> big.LITTLE. Said patches were never reworked and reposted.

Here's the commit message in the perf_event_tests tree, having trouble
finding the original e-mail that went with it.

commit 2cc2e21e349243889ba59408527cc1a97dd0dc44
Author: Yogesh Tillu <[email protected]>
Date: Tue Mar 1 14:18:22 2016 +0530

Add support for RDPMC test with mmap way

This test adds support for reading perf hw counter from userspace.
Method (2)
rdpmc_comparision_mmap:
Test read perf hw counter in userspace using open/mmap syscall.
It requires kernel with perf mmap patchset and
echo 1 > /sys/bus/platform/drivers/armv8-pmu/rdpmc

Above Method Tested On:(X86/ARM)
It is tested with perf mmap patchset on kernel v4.5.0-rc5+
With above Tests, we can benchmark access of perf hw counters in
userspace with syscall vs perf_event_mmap_page way.

Signed-off-by: Yogesh Tillu <[email protected]>



> Just to check, how does x86 behave on each of those kernel releases?
>
> Many things have changed since v4.4.

I'm fairly sure this test (well, the equivelent code in
tests/record_sample/record_mmap that I based the test on) has been passing
on all of my x86 test machines since ~3.10 or so, or else I would noticed.

If I can get a custom kernel to boot on one of my machines I can start
digging in and see if I can find where the EINVAL comes from.

This isn't some key thing that needs to be fixed, I was just curious about
the behavior difference between x86 and ARM. There are a few other minor
x86/ARM diferences, especially realting to perf_event_open() error
returns, that I had to special case in a few of my tests.

Vince

2017-08-11 17:10:32

by Mark Rutland

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 12:51:12PM -0400, Vince Weaver wrote:
> On Fri, 11 Aug 2017, Mark Rutland wrote:
> > Just to check, how does x86 behave on each of those kernel releases?
> >
> > Many things have changed since v4.4.
>
> I'm fairly sure this test (well, the equivelent code in
> tests/record_sample/record_mmap that I based the test on) has been passing
> on all of my x86 test machines since ~3.10 or so, or else I would noticed.

Ok.

> If I can get a custom kernel to boot on one of my machines I can start
> digging in and see if I can find where the EINVAL comes from.

>From a quick scan, I can't spot anything obvious that would affect the
arm64 perf mmap behaviour, that has changed since v4.9.

> This isn't some key thing that needs to be fixed, I was just curious about
> the behavior difference between x86 and ARM.

Sure; likewise I'm curious.

Thanks,
Mark.

2017-08-11 19:01:51

by Vince Weaver

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, 11 Aug 2017, Mark Rutland wrote:

> > This isn't some key thing that needs to be fixed, I was just curious about
> > the behavior difference between x86 and ARM.
>
> Sure; likewise I'm curious.

well I finally got a current git 64-bit kernel booted on the pi3.

Challenge: USB known to be broken currently, so no keyboard or ethernet.
Extra challenge: had the RX/TX lines switched on the serial connector.
Bonus challenge: the bcm2837 dts file doesn't enable armv8 PMU

I got through all of that, only to find:

$ uname -a
Linux pi3-git 4.13.0-rc4-00152-g2627393 #2 SMP PREEMPT Fri Aug 11 13:58:42 EDT 2017 aarch64 GNU/Linux

$ ./mmap_multiple
Trying to mmap same perf_event fd multiple times... PASSED

So maybe the issue was fixed between 4.9 and current?

Vince

2017-08-14 10:57:32

by Will Deacon

[permalink] [raw]
Subject: Re: perf: multiple mmap of fd behavior on x86/ARM

On Fri, Aug 11, 2017 at 04:53:30PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 11, 2017 at 12:06:39PM +0100, Mark Rutland wrote:
> > On Fri, Aug 11, 2017 at 12:52:52PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 11, 2017 at 11:01:27AM +0100, Mark Rutland wrote:
> > > > On Thu, Aug 10, 2017 at 02:48:52PM -0400, Vince Weaver wrote:
> > > > >
> > > > > So I was working on my perf_event_tests on ARM/ARM64 (the end goal was to
> > > > > get ARM64 rdpmc support working, but apparently those patches never made
> > > > > it upstream?)
> > > >
> > > > IIUC by 'rdpmc' you mean direct userspace counter access?
> > > >
> > > > Patches for that never made it upstream. Last I saw, there were no
> > > > patches in a suitable state for review.
> > > >
> > > > There are also difficulties (e.g. big.LITTLE systems where the number of
> > > > counters can differ across CPUs) which have yet to be solved.
> > >
> > > How would that be a problem? The API gives an explicit index to use with
> > > the 'rdpmc' instruction.
> >
> > It's a problem because access to unimplemented counters trap. So if a
> > task gets migrated from a CPU with N counters to one with N-1, accessing
> > counter N would be problematic.
> >
> > So we'd need to account for that somehow, in addition to the usual
> > sequence counter fun to verify the index was valid when the access was
> > performed.
>
> Aah, you need restartable-sequences :-)

Or, in the absence of those, I wouldn't mind only supporting this for
non-big/little platforms initially.

Will