2023-11-21 12:09:07

by Hector Martin

[permalink] [raw]
Subject: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

Perf broke on all Apple ARM64 systems (tested almost everything), and
according to maz also on Juno (so, probably all big.LITTLE) since v6.5.

Test command:

sudo taskset -c 0 ./perf stat -e apple_icestorm_pmu/cycles/ -e
apple_firestorm_pmu/cycles/ -e cycles ls

Since this is taskset to CPU #0 (LITTLE core, icestorm), only events for
icestorm are expected.

I bisected the breakage to two distinct points:

5ea8f2ccffb is the first bad commit. With its parent, the output is as
expected (same as v6.4):

3,297,462 apple_icestorm_pmu/cycles/

<not counted> apple_firestorm_pmu/cycles/
(0.00%)
<not counted> cycles
(0.00%)

With 5ea8f2ccffb everything breaks:

<not supported> apple_icestorm_pmu/cycles/

<not supported> apple_firestorm_pmu/cycles/

<not counted> cycles
(0.00%)

Somewhere along the way to 82fe2e45cdb00 things get even worse (didn't
bother bisecting this range). With its parent:

<not supported> apple_icestorm_pmu/cycles/

<not supported> apple_firestorm_pmu/cycles/

<not supported> apple_icestorm_pmu/cycles/

<not supported> apple_firestorm_pmu/cycles/

Then 82fe2e45cdb00 leads to the current v6.5 behavior:

<not counted> apple_icestorm_pmu/cycles/
(0.00%)
<not counted> apple_firestorm_pmu/cycles/
(0.00%)
<not counted> cycles
(0.00%)

If I taskset the task to CPU#2 (big core, firestorm), I get events:

1,454,858 apple_icestorm_pmu/cycles/

1,454,760 apple_firestorm_pmu/cycles/

1,454,384 cycles


So the current behavior is that all output seems to come from the
firestorm PMU event counter, regardless of requested event.

This is all unchanged and still broken in v6.7-rc2.

- Hector


2023-11-21 13:40:52

by Marc Zyngier

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

[Adding key people on Cc]

On Tue, 21 Nov 2023 12:08:48 +0000,
Hector Martin <[email protected]> wrote:
>
> Perf broke on all Apple ARM64 systems (tested almost everything), and
> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.

I can confirm that at least on 6.7-rc2, perf is pretty busted on any
asymmetric ARM platform. It isn't clear what criteria is used to pick
the PMU, but nothing works anymore.

The saving grace in my case is that Debian still ships a 6.1 perftool
package, but that's obviously not going to last.

I'm happy to test potential fixes.

M.

--
Without deviation from the norm, progress is not possible.

2023-11-21 15:26:24

by Marc Zyngier

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, 21 Nov 2023 13:40:31 +0000,
Marc Zyngier <[email protected]> wrote:
>
> [Adding key people on Cc]
>
> On Tue, 21 Nov 2023 12:08:48 +0000,
> Hector Martin <[email protected]> wrote:
> >
> > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
>
> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> asymmetric ARM platform. It isn't clear what criteria is used to pick
> the PMU, but nothing works anymore.
>
> The saving grace in my case is that Debian still ships a 6.1 perftool
> package, but that's obviously not going to last.
>
> I'm happy to test potential fixes.

At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
-vvv. And it is quite entertaining (this is taskset to an 'icestorm'
CPU):

<quote>
maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
apple_firestorm_pmu/cycles/ -e cycles ls
Using CPUID 0x00000000612f0280
Attempt to add: apple_icestorm_pmu/cycles=0/
..after resolving event: apple_icestorm_pmu/cycles=0/
Opening: unknown-hardware:HG
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0xb00000000
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
sys_perf_event_open failed, error -95
Attempt to add: apple_firestorm_pmu/cycles=0/
..after resolving event: apple_firestorm_pmu/cycles=0/
Control descriptor is not initialized
Opening: apple_icestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
Opening: apple_firestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
Opening: cycles
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
bench builtin-evlist.c builtin-probe.c CREDITS perf.h
Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
builtin-daemon.o builtin-list.c builtin-version.c perf ui
builtin-data.c builtin-list.o builtin-version.o perf-archive util
builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
builtin-diff.c builtin-mem.c command-list.txt perf.c
apple_icestorm_pmu/cycles/: -1: 0 873709 0
apple_firestorm_pmu/cycles/: -1: 0 873709 0
cycles: -1: 0 873709 0
apple_icestorm_pmu/cycles/: 0 873709 0
apple_firestorm_pmu/cycles/: 0 873709 0
cycles: 0 873709 0

Performance counter stats for 'ls':

<not counted> apple_icestorm_pmu/cycles/ (0.00%)
<not counted> apple_firestorm_pmu/cycles/ (0.00%)
<not counted> cycles (0.00%)

0.000002250 seconds time elapsed

0.000000000 seconds user
0.000000000 seconds sys
</quote>

If I run the same thing on another CPU cluster (firestorm), I get
this:

<quote>
maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
apple_firestorm_pmu/cycles/ -e cycles ls
Using CPUID 0x00000000612f0280
Attempt to add: apple_icestorm_pmu/cycles=0/
..after resolving event: apple_icestorm_pmu/cycles=0/
Opening: unknown-hardware:HG
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0xb00000000
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
sys_perf_event_open failed, error -95
Attempt to add: apple_firestorm_pmu/cycles=0/
..after resolving event: apple_firestorm_pmu/cycles=0/
Control descriptor is not initialized
Opening: apple_icestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
Opening: apple_firestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
Opening: cycles
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
bench builtin-evlist.c builtin-probe.c CREDITS perf.h
Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
builtin-daemon.o builtin-list.c builtin-version.c perf ui
builtin-data.c builtin-list.o builtin-version.o perf-archive util
builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
builtin-diff.c builtin-mem.c command-list.txt perf.c
apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
cycles: -1: 1034653 469125 469125
apple_icestorm_pmu/cycles/: 1035101 469125 469125
apple_firestorm_pmu/cycles/: 1035035 469125 469125
cycles: 1034653 469125 469125

Performance counter stats for 'ls':

1,035,101 apple_icestorm_pmu/cycles/
1,035,035 apple_firestorm_pmu/cycles/
1,034,653 cycles

0.000001333 seconds time elapsed

0.000000000 seconds user
0.000000000 seconds sys
</quote>

which doesn't make any sense either. I really don't understand what
this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
nor what this 'cycle=0' stuff is.

/puzzled

M.

--
Without deviation from the norm, progress is not possible.

2023-11-21 15:40:42

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> On Tue, 21 Nov 2023 13:40:31 +0000,
> Marc Zyngier <[email protected]> wrote:
> >
> > [Adding key people on Cc]
> >
> > On Tue, 21 Nov 2023 12:08:48 +0000,
> > Hector Martin <[email protected]> wrote:
> > >
> > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> >
> > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > the PMU, but nothing works anymore.
> >
> > The saving grace in my case is that Debian still ships a 6.1 perftool
> > package, but that's obviously not going to last.
> >
> > I'm happy to test potential fixes.
>
> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> CPU):

IIUC the tool is doing the wrong thing here and overriding explicit
${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
that ${pmu}'s type and event namespace.

Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
targetted to a specific PMU, it's semantically wrong to rewrite events like
this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
PERF_COUNT_HW_${EVENT}.

Mark.

> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 0 873709 0
> apple_firestorm_pmu/cycles/: -1: 0 873709 0
> cycles: -1: 0 873709 0
> apple_icestorm_pmu/cycles/: 0 873709 0
> apple_firestorm_pmu/cycles/: 0 873709 0
> cycles: 0 873709 0
>
> Performance counter stats for 'ls':
>
> <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> <not counted> cycles (0.00%)
>
> 0.000002250 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> If I run the same thing on another CPU cluster (firestorm), I get
> this:
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> cycles: -1: 1034653 469125 469125
> apple_icestorm_pmu/cycles/: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: 1035035 469125 469125
> cycles: 1034653 469125 469125
>
> Performance counter stats for 'ls':
>
> 1,035,101 apple_icestorm_pmu/cycles/
> 1,035,035 apple_firestorm_pmu/cycles/
> 1,034,653 cycles
>
> 0.000001333 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> which doesn't make any sense either. I really don't understand what
> this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> nor what this 'cycle=0' stuff is.
>
> /puzzled
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2023-11-21 15:44:24

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 7:24 AM Marc Zyngier <[email protected]> wrote:
>
> On Tue, 21 Nov 2023 13:40:31 +0000,
> Marc Zyngier <[email protected]> wrote:
> >
> > [Adding key people on Cc]
> >
> > On Tue, 21 Nov 2023 12:08:48 +0000,
> > Hector Martin <[email protected]> wrote:
> > >
> > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> >
> > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > the PMU, but nothing works anymore.
> >
> > The saving grace in my case is that Debian still ships a 6.1 perftool
> > package, but that's obviously not going to last.
> >
> > I'm happy to test potential fixes.
>
> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> CPU):
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 0 873709 0
> apple_firestorm_pmu/cycles/: -1: 0 873709 0
> cycles: -1: 0 873709 0
> apple_icestorm_pmu/cycles/: 0 873709 0
> apple_firestorm_pmu/cycles/: 0 873709 0
> cycles: 0 873709 0
>
> Performance counter stats for 'ls':
>
> <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> <not counted> cycles (0.00%)
>
> 0.000002250 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> If I run the same thing on another CPU cluster (firestorm), I get
> this:
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95
> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> cycles: -1: 1034653 469125 469125
> apple_icestorm_pmu/cycles/: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: 1035035 469125 469125
> cycles: 1034653 469125 469125
>
> Performance counter stats for 'ls':
>
> 1,035,101 apple_icestorm_pmu/cycles/
> 1,035,035 apple_firestorm_pmu/cycles/
> 1,034,653 cycles
>
> 0.000001333 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>
>
> which doesn't make any sense either. I really don't understand what
> this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> nor what this 'cycle=0' stuff is.

Hi Marc,

I'm unclear if you are running a newer perf tool on an older kernel or
not. In any case I'll assume the kernel and perf tool versions match.
In Linux 6.6 this patch was added to the ARM PMU:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/arm_pmu.c?id=5c816728651ae425954542fed64d21d40cb75a9f

My guess is that the apple_icestorm_pmu requires a similar patch. The
perf tool is supposed to not use extended types when they aren't
supported:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
So I share your confusion as to why something broke.

PERF_TYPE_HARDWARE is a legacy type where there are hardcoded type and
config values that correspond to an event. The PMU driver turns legacy
events into the real types. On BIG.little systems if the legacy events
are monitoring a task a different event is needed for each PMU (ie >1
event). In your example you are monitoring 'ls', a task, and so
different cycles events are necessary. In the high 32-bits (the
extended type) the PMU is identified.

Thanks for reporting the issue,
Ian

> /puzzled
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.

2023-11-21 15:47:37

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > On Tue, 21 Nov 2023 13:40:31 +0000,
> > Marc Zyngier <[email protected]> wrote:
> > >
> > > [Adding key people on Cc]
> > >
> > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > Hector Martin <[email protected]> wrote:
> > > >
> > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >
> > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > the PMU, but nothing works anymore.
> > >
> > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > package, but that's obviously not going to last.
> > >
> > > I'm happy to test potential fixes.
> >
> > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > CPU):
>
> IIUC the tool is doing the wrong thing here and overriding explicit
> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> that ${pmu}'s type and event namespace.
>
> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> targetted to a specific PMU, it's semantically wrong to rewrite events like
> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> PERF_COUNT_HW_${EVENT}.

If you name a PMU and an event then the event should only be opened on
that PMU, 100% agree. There's a bunch of output, but when the legacy
cycles event is opened it appears to be because it was explicitly
requested.

Thanks,
Ian

> Mark.
>
> > <quote>
> > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > cycles: -1: 0 873709 0
> > apple_icestorm_pmu/cycles/: 0 873709 0
> > apple_firestorm_pmu/cycles/: 0 873709 0
> > cycles: 0 873709 0
> >
> > Performance counter stats for 'ls':
> >
> > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > <not counted> cycles (0.00%)
> >
> > 0.000002250 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
> > </quote>
> >
> > If I run the same thing on another CPU cluster (firestorm), I get
> > this:
> >
> > <quote>
> > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > cycles: -1: 1034653 469125 469125
> > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > cycles: 1034653 469125 469125
> >
> > Performance counter stats for 'ls':
> >
> > 1,035,101 apple_icestorm_pmu/cycles/
> > 1,035,035 apple_firestorm_pmu/cycles/
> > 1,034,653 cycles
> >
> > 0.000001333 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
> > </quote>
> >
> > which doesn't make any sense either. I really don't understand what
> > this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> > nor what this 'cycle=0' stuff is.
> >
> > /puzzled
> >
> > M.
> >
> > --
> > Without deviation from the norm, progress is not possible.

2023-11-21 15:57:11

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 07:41:17AM -0800, Ian Rogers wrote:
> Hi Marc,

Hi Ian,

> I'm unclear if you are running a newer perf tool on an older kernel or
> not. In any case I'll assume the kernel and perf tool versions match.
> In Linux 6.6 this patch was added to the ARM PMU:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/arm_pmu.c?id=5c816728651ae425954542fed64d21d40cb75a9f
>
> My guess is that the apple_icestorm_pmu requires a similar patch.

The apple_icestorm_pmu PMU driver uses the arm_pmu framework, so it's using
that code (since v6.6).

> The perf tool is supposed to not use extended types when they aren't
> supported:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532

How does that is_event_supported() check actually work? I suspect that's giving
the wrong answer.

Regardless, I think the tool is doing something semantically wrong, see below.

> So I share your confusion as to why something broke.
>
> PERF_TYPE_HARDWARE is a legacy type where there are hardcoded type and
> config values that correspond to an event. The PMU driver turns legacy
> events into the real types. On BIG.little systems if the legacy events
> are monitoring a task a different event is needed for each PMU (ie >1
> event). In your example you are monitoring 'ls', a task, and so
> different cycles events are necessary. In the high 32-bits (the
> extended type) the PMU is identified.

I think the interesting thing here is that the tool is mapping events with an
explicit PMU into legacy PERF_TYPE_HARDWARE events, which is the opposite
direction than intended. Regardless of whether PERF_TYPE_HARDWARE events can be
targetted to a specific PMU, if the user has requested to use a specific PMU we
should be using that PMU and related event namespace.

Marc's command line was:

sudo taskset -c 0 ./perf stat -vvv \
-e apple_icestorm_pmu/cycles/ \
-e apple_firestorm_pmu/cycles/ \
-e cycles \
ls

... and so the apple_*_pmu events should target their respective PMUs, and the
plain 'cycles' event could legitimately be opened as a single
PERF_TYPE_HARDWARE event, or split into two directed PERF_TYPE_HARDWARE events
targetting the two PMUs.

However, thwe tool opens three (undirected?) PERF_TYPE_HARDWARE events:

Opening: apple_icestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
Opening: apple_firestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
Opening: cycles
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------

Mark.

2023-11-21 16:03:28

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > Marc Zyngier <[email protected]> wrote:
> > > >
> > > > [Adding key people on Cc]
> > > >
> > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > Hector Martin <[email protected]> wrote:
> > > > >
> > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > >
> > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > the PMU, but nothing works anymore.
> > > >
> > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > package, but that's obviously not going to last.
> > > >
> > > > I'm happy to test potential fixes.
> > >
> > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > CPU):
> >
> > IIUC the tool is doing the wrong thing here and overriding explicit
> > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > that ${pmu}'s type and event namespace.
> >
> > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > PERF_COUNT_HW_${EVENT}.
>
> If you name a PMU and an event then the event should only be opened on
> that PMU, 100% agree. There's a bunch of output, but when the legacy
> cycles event is opened it appears to be because it was explicitly
> requested.

I think you've missed that the named PMU events are being erreously transformed
into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.

Opening: apple_firestorm_pmu/cycles/
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4

... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.

Marc said that he bisected the issue down to commit:

5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")

... so it looks like something is going wrong when the events are being parsed,
e.g. losing the HW PMU information?

Thanks,
Mark.

>
>
> Thanks,
> Ian
>
> > Mark.
> >
> > > <quote>
> > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > Using CPUID 0x00000000612f0280
> > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > Opening: unknown-hardware:HG
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > config 0xb00000000
> > > disabled 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > sys_perf_event_open failed, error -95
> > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > Control descriptor is not initialized
> > > Opening: apple_icestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> > > Opening: apple_firestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > Opening: cycles
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > > cycles: -1: 0 873709 0
> > > apple_icestorm_pmu/cycles/: 0 873709 0
> > > apple_firestorm_pmu/cycles/: 0 873709 0
> > > cycles: 0 873709 0
> > >
> > > Performance counter stats for 'ls':
> > >
> > > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > > <not counted> cycles (0.00%)
> > >
> > > 0.000002250 seconds time elapsed
> > >
> > > 0.000000000 seconds user
> > > 0.000000000 seconds sys
> > > </quote>
> > >
> > > If I run the same thing on another CPU cluster (firestorm), I get
> > > this:
> > >
> > > <quote>
> > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > Using CPUID 0x00000000612f0280
> > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > Opening: unknown-hardware:HG
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > config 0xb00000000
> > > disabled 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > sys_perf_event_open failed, error -95
> > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > Control descriptor is not initialized
> > > Opening: apple_icestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > > Opening: apple_firestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > > Opening: cycles
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > > cycles: -1: 1034653 469125 469125
> > > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > > cycles: 1034653 469125 469125
> > >
> > > Performance counter stats for 'ls':
> > >
> > > 1,035,101 apple_icestorm_pmu/cycles/
> > > 1,035,035 apple_firestorm_pmu/cycles/
> > > 1,034,653 cycles
> > >
> > > 0.000001333 seconds time elapsed
> > >
> > > 0.000000000 seconds user
> > > 0.000000000 seconds sys
> > > </quote>
> > >
> > > which doesn't make any sense either. I really don't understand what
> > > this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> > > nor what this 'cycle=0' stuff is.
> > >
> > > /puzzled
> > >
> > > M.
> > >
> > > --
> > > Without deviation from the norm, progress is not possible.

2023-11-21 16:04:10

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 7:56 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 07:41:17AM -0800, Ian Rogers wrote:
> > Hi Marc,
>
> Hi Ian,
>
> > I'm unclear if you are running a newer perf tool on an older kernel or
> > not. In any case I'll assume the kernel and perf tool versions match.
> > In Linux 6.6 this patch was added to the ARM PMU:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/arm_pmu.c?id=5c816728651ae425954542fed64d21d40cb75a9f
> >
> > My guess is that the apple_icestorm_pmu requires a similar patch.
>
> The apple_icestorm_pmu PMU driver uses the arm_pmu framework, so it's using
> that code (since v6.6).
>
> > The perf tool is supposed to not use extended types when they aren't
> > supported:
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
>
> How does that is_event_supported() check actually work? I suspect that's giving
> the wrong answer.

Maybe, the implementation is to check using perf_event_open:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232

This is recycling logic from perf list where many legacy cache events
are elided due to a lack of support.

> Regardless, I think the tool is doing something semantically wrong, see below.
>
> > So I share your confusion as to why something broke.
> >
> > PERF_TYPE_HARDWARE is a legacy type where there are hardcoded type and
> > config values that correspond to an event. The PMU driver turns legacy
> > events into the real types. On BIG.little systems if the legacy events
> > are monitoring a task a different event is needed for each PMU (ie >1
> > event). In your example you are monitoring 'ls', a task, and so
> > different cycles events are necessary. In the high 32-bits (the
> > extended type) the PMU is identified.
>
> I think the interesting thing here is that the tool is mapping events with an
> explicit PMU into legacy PERF_TYPE_HARDWARE events, which is the opposite
> direction than intended. Regardless of whether PERF_TYPE_HARDWARE events can be
> targetted to a specific PMU, if the user has requested to use a specific PMU we
> should be using that PMU and related event namespace.
>
> Marc's command line was:
>
> sudo taskset -c 0 ./perf stat -vvv \
> -e apple_icestorm_pmu/cycles/ \
> -e apple_firestorm_pmu/cycles/ \
> -e cycles \

-e cycles here is a direct request for the legacy cycles event. It
will match in the parser here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n301

which goes to:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.y?h=perf-tools-next#n397

and as this is a hardware event there is wildcard expansion on each core PMU.

Thanks,
Ian

> ls
>
> ... and so the apple_*_pmu events should target their respective PMUs, and the
> plain 'cycles' event could legitimately be opened as a single
> PERF_TYPE_HARDWARE event, or split into two directed PERF_TYPE_HARDWARE events
> targetting the two PMUs.
>
> However, thwe tool opens three (undirected?) PERF_TYPE_HARDWARE events:
>
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
>
> Mark.

2023-11-21 16:09:11

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 08:03:11AM -0800, Ian Rogers wrote:
> On Tue, Nov 21, 2023 at 7:56 AM Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Nov 21, 2023 at 07:41:17AM -0800, Ian Rogers wrote:
> > > Hi Marc,
> >
> > Hi Ian,
> >
> > > I'm unclear if you are running a newer perf tool on an older kernel or
> > > not. In any case I'll assume the kernel and perf tool versions match.
> > > In Linux 6.6 this patch was added to the ARM PMU:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/perf/arm_pmu.c?id=5c816728651ae425954542fed64d21d40cb75a9f
> > >
> > > My guess is that the apple_icestorm_pmu requires a similar patch.
> >
> > The apple_icestorm_pmu PMU driver uses the arm_pmu framework, so it's using
> > that code (since v6.6).
> >
> > > The perf tool is supposed to not use extended types when they aren't
> > > supported:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
> >
> > How does that is_event_supported() check actually work? I suspect that's giving
> > the wrong answer.
>
> Maybe, the implementation is to check using perf_event_open:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232
>
> This is recycling logic from perf list where many legacy cache events
> are elided due to a lack of support.
>
> > Regardless, I think the tool is doing something semantically wrong, see below.
> >
> > > So I share your confusion as to why something broke.
> > >
> > > PERF_TYPE_HARDWARE is a legacy type where there are hardcoded type and
> > > config values that correspond to an event. The PMU driver turns legacy
> > > events into the real types. On BIG.little systems if the legacy events
> > > are monitoring a task a different event is needed for each PMU (ie >1
> > > event). In your example you are monitoring 'ls', a task, and so
> > > different cycles events are necessary. In the high 32-bits (the
> > > extended type) the PMU is identified.
> >
> > I think the interesting thing here is that the tool is mapping events with an
> > explicit PMU into legacy PERF_TYPE_HARDWARE events, which is the opposite
> > direction than intended. Regardless of whether PERF_TYPE_HARDWARE events can be
> > targetted to a specific PMU, if the user has requested to use a specific PMU we
> > should be using that PMU and related event namespace.
> >
> > Marc's command line was:
> >
> > sudo taskset -c 0 ./perf stat -vvv \
> > -e apple_icestorm_pmu/cycles/ \
> > -e apple_firestorm_pmu/cycles/ \
> > -e cycles \
>
> -e cycles here is a direct request for the legacy cycles event. It
> will match in the parser here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n301
>
> which goes to:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.y?h=perf-tools-next#n397
>
> and as this is a hardware event there is wildcard expansion on each core PMU.

Please read the rest of my message, which was talking about the other two
events.

Mark.

>
> Thanks,
> Ian
>
> > ls
> >
> > ... and so the apple_*_pmu events should target their respective PMUs, and the
> > plain 'cycles' event could legitimately be opened as a single
> > PERF_TYPE_HARDWARE event, or split into two directed PERF_TYPE_HARDWARE events
> > targetting the two PMUs.
> >
> > However, thwe tool opens three (undirected?) PERF_TYPE_HARDWARE events:
> >
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> >
> > Mark.

2023-11-21 16:12:35

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > >
> > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > Marc Zyngier <[email protected]> wrote:
> > > > >
> > > > > [Adding key people on Cc]
> > > > >
> > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > Hector Martin <[email protected]> wrote:
> > > > > >
> > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > >
> > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > the PMU, but nothing works anymore.
> > > > >
> > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > package, but that's obviously not going to last.
> > > > >
> > > > > I'm happy to test potential fixes.
> > > >
> > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > CPU):
> > >
> > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > that ${pmu}'s type and event namespace.
> > >
> > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > PERF_COUNT_HW_${EVENT}.
> >
> > If you name a PMU and an event then the event should only be opened on
> > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > cycles event is opened it appears to be because it was explicitly
> > requested.
>
> I think you've missed that the named PMU events are being erreously transformed
> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
>
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
>
> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
>
> Marc said that he bisected the issue down to commit:
>
> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
>
> ... so it looks like something is going wrong when the events are being parsed,
> e.g. losing the HW PMU information?

Ok, I think I'm getting confused by other things. This looks like the issue.

I think it may be working as intended, but not how you intended :-) If
a core PMU is listed and then a legacy event, the legacy event should
be opened on the core PMU as a legacy event with the extended type
set. This is to allow things like legacy cache events to be opened on
a specified PMU. Legacy event names match with a higher priority than
those in sysfs or json as they are hard coded. Presumably the
expectation was that by advertising a cycles event, presumably in
sysfs, then this is what would be matched.

Thanks,
Ian

> Thanks,
> Mark.
>
> >
> >
> > Thanks,
> > Ian
> >
> > > Mark.
> > >
> > > > <quote>
> > > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > > Using CPUID 0x00000000612f0280
> > > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > > Opening: unknown-hardware:HG
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > config 0xb00000000
> > > > disabled 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > > sys_perf_event_open failed, error -95
> > > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > > Control descriptor is not initialized
> > > > Opening: apple_icestorm_pmu/cycles/
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> > > > Opening: apple_firestorm_pmu/cycles/
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > Opening: cycles
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > > > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > > > cycles: -1: 0 873709 0
> > > > apple_icestorm_pmu/cycles/: 0 873709 0
> > > > apple_firestorm_pmu/cycles/: 0 873709 0
> > > > cycles: 0 873709 0
> > > >
> > > > Performance counter stats for 'ls':
> > > >
> > > > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > > > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > > > <not counted> cycles (0.00%)
> > > >
> > > > 0.000002250 seconds time elapsed
> > > >
> > > > 0.000000000 seconds user
> > > > 0.000000000 seconds sys
> > > > </quote>
> > > >
> > > > If I run the same thing on another CPU cluster (firestorm), I get
> > > > this:
> > > >
> > > > <quote>
> > > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > > Using CPUID 0x00000000612f0280
> > > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > > Opening: unknown-hardware:HG
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > config 0xb00000000
> > > > disabled 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > > sys_perf_event_open failed, error -95
> > > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > > Control descriptor is not initialized
> > > > Opening: apple_icestorm_pmu/cycles/
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > > > Opening: apple_firestorm_pmu/cycles/
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > > > Opening: cycles
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > > > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > > > cycles: -1: 1034653 469125 469125
> > > > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > > > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > > > cycles: 1034653 469125 469125
> > > >
> > > > Performance counter stats for 'ls':
> > > >
> > > > 1,035,101 apple_icestorm_pmu/cycles/
> > > > 1,035,035 apple_firestorm_pmu/cycles/
> > > > 1,034,653 cycles
> > > >
> > > > 0.000001333 seconds time elapsed
> > > >
> > > > 0.000000000 seconds user
> > > > 0.000000000 seconds sys
> > > > </quote>
> > > >
> > > > which doesn't make any sense either. I really don't understand what
> > > > this PERF_TYPE_HARDWARE does here (the *real* types are 10 and 11),
> > > > nor what this 'cycle=0' stuff is.
> > > >
> > > > /puzzled
> > > >
> > > > M.
> > > >
> > > > --
> > > > Without deviation from the norm, progress is not possible.

2023-11-21 16:17:55

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > Marc Zyngier <[email protected]> wrote:
> > > > > >
> > > > > > [Adding key people on Cc]
> > > > > >
> > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > Hector Martin <[email protected]> wrote:
> > > > > > >
> > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > >
> > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > the PMU, but nothing works anymore.
> > > > > >
> > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > package, but that's obviously not going to last.
> > > > > >
> > > > > > I'm happy to test potential fixes.
> > > > >
> > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > CPU):
> > > >
> > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > that ${pmu}'s type and event namespace.
> > > >
> > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > PERF_COUNT_HW_${EVENT}.
> > >
> > > If you name a PMU and an event then the event should only be opened on
> > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > cycles event is opened it appears to be because it was explicitly
> > > requested.
> >
> > I think you've missed that the named PMU events are being erreously transformed
> > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> >
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> >
> > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> >
> > Marc said that he bisected the issue down to commit:
> >
> > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> >
> > ... so it looks like something is going wrong when the events are being parsed,
> > e.g. losing the HW PMU information?
>
> Ok, I think I'm getting confused by other things. This looks like the issue.
>
> I think it may be working as intended, but not how you intended :-) If
> a core PMU is listed and then a legacy event, the legacy event should
> be opened on the core PMU as a legacy event with the extended type
> set. This is to allow things like legacy cache events to be opened on
> a specified PMU. Legacy event names match with a higher priority than
> those in sysfs or json as they are hard coded.

That has never been the case previously, so this is user-visible breakage, and
it prevents users from being able to do the right thing, so I think that's a
broken design.

> Presumably the expectation was that by advertising a cycles event, presumably
> in sysfs, then this is what would be matched.

I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
*in that PMU's namespace* is used. Overriding that breaks long-established
practice and provides users with no recourse to get the behavioru they expect
(and previosuly had).

I do think that (regardless of whther this was the sematnic you intended)
silently overriding events with legacy events is a bug, and one we should fix.
As I mentioned in another reply, just because the events have the same name
does not mean that they are semantically the same, so we're liable to give
people the wrong numbers anyhow.

Can we fix this?

Mark.

2023-11-21 16:39:09

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > >
> > > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > >
> > > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > > Marc Zyngier <[email protected]> wrote:
> > > > > > >
> > > > > > > [Adding key people on Cc]
> > > > > > >
> > > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > > Hector Martin <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > > >
> > > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > > the PMU, but nothing works anymore.
> > > > > > >
> > > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > > package, but that's obviously not going to last.
> > > > > > >
> > > > > > > I'm happy to test potential fixes.
> > > > > >
> > > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > > CPU):
> > > > >
> > > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > that ${pmu}'s type and event namespace.
> > > > >
> > > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > PERF_COUNT_HW_${EVENT}.
> > > >
> > > > If you name a PMU and an event then the event should only be opened on
> > > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > cycles event is opened it appears to be because it was explicitly
> > > > requested.
> > >
> > > I think you've missed that the named PMU events are being erreously transformed
> > > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > >
> > > Opening: apple_firestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > >
> > > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > >
> > > Marc said that he bisected the issue down to commit:
> > >
> > > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > >
> > > ... so it looks like something is going wrong when the events are being parsed,
> > > e.g. losing the HW PMU information?
> >
> > Ok, I think I'm getting confused by other things. This looks like the issue.
> >
> > I think it may be working as intended, but not how you intended :-) If
> > a core PMU is listed and then a legacy event, the legacy event should
> > be opened on the core PMU as a legacy event with the extended type
> > set. This is to allow things like legacy cache events to be opened on
> > a specified PMU. Legacy event names match with a higher priority than
> > those in sysfs or json as they are hard coded.
>
> That has never been the case previously, so this is user-visible breakage, and
> it prevents users from being able to do the right thing, so I think that's a
> broken design.

So the problem was caused by ARM and Intel doing two different things.
Intel did at least contribute to the perf tool in support for their
BIG.little/hybrid, so that's why the semantics match their approach.

> > Presumably the expectation was that by advertising a cycles event, presumably
> > in sysfs, then this is what would be matched.
>
> I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
> *in that PMU's namespace* is used. Overriding that breaks long-established
> practice and provides users with no recourse to get the behavioru they expect
> (and previosuly had).

On ARM but not Intel.

> I do think that (regardless of whther this was the sematnic you intended)
> silently overriding events with legacy events is a bug, and one we should fix.
> As I mentioned in another reply, just because the events have the same name
> does not mean that they are semantically the same, so we're liable to give
> people the wrong numbers anyhow.
>
> Can we fix this?

So I'd like to fix this, some things from various conversations:

1) we lack testing. Our testing relies on the sysfs of the machine
being run on, which is better than nothing. I think ideally we'd have
a collection of zipped up sysfs directories and then we could have a
test that asserts on ARM you get the behavior you want.

2) for RISC-V they want to make the legacy event matching something in
user land to simplify the PMU driver.

3) I'd like to get rid of the PMU json interface. My idea is to
convert json events/metrics into sysfs style files, zip these up and
then link them into the perf binary. On Intel the json is 70% of the
binary (7MB out of 10MB) and we may get this down to 3MB with this
approach. The json lookup would need to incorporate the cpuid matching
that currently exists. When we look up an event I'd like the approach
to be like unionfs with a specified but configurable order. Users
could provide directories of their own events/metrics for various
PMUs, and then this approach could be used to help with (1).

Those proposals are not something to add as a -rc fix, so what I think
you're asking for here is a "if ARM" fix somewhere in the event
parsing. That's of course possible but it will cause problems if you
did say:

perf stat -e arm_pmu/LLC-load-misses/ ...

as I doubt the PMU driver is advertising this legacy event in sysfs
and the "if ARM" logic would presumably be trying to disable legacy
events in the term list for the ARM PMU.

Given all of this, is anything actually broken and needing a fix for 6.7?

Thanks,
Ian

> Mark.

2023-11-21 23:43:44

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 09:08:48PM +0900, Hector Martin wrote:
> Perf broke on all Apple ARM64 systems (tested almost everything), and
> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
>
> Test command:
>
> sudo taskset -c 0 ./perf stat -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
>
> Since this is taskset to CPU #0 (LITTLE core, icestorm), only events for
> icestorm are expected.
>
> I bisected the breakage to two distinct points:
>
> 5ea8f2ccffb is the first bad commit. With its parent, the output is as
> expected (same as v6.4):
>
> 3,297,462 apple_icestorm_pmu/cycles/
>
> <not counted> apple_firestorm_pmu/cycles/
> (0.00%)
> <not counted> cycles
> (0.00%)
>
> With 5ea8f2ccffb everything breaks:
>
> <not supported> apple_icestorm_pmu/cycles/
>
> <not supported> apple_firestorm_pmu/cycles/
>
> <not counted> cycles
> (0.00%)
>
> Somewhere along the way to 82fe2e45cdb00 things get even worse (didn't
> bother bisecting this range). With its parent:
>
> <not supported> apple_icestorm_pmu/cycles/
>
> <not supported> apple_firestorm_pmu/cycles/
>
> <not supported> apple_icestorm_pmu/cycles/
>
> <not supported> apple_firestorm_pmu/cycles/
>
> Then 82fe2e45cdb00 leads to the current v6.5 behavior:
>
> <not counted> apple_icestorm_pmu/cycles/
> (0.00%)
> <not counted> apple_firestorm_pmu/cycles/
> (0.00%)
> <not counted> cycles
> (0.00%)
>
> If I taskset the task to CPU#2 (big core, firestorm), I get events:
>
> 1,454,858 apple_icestorm_pmu/cycles/
>
> 1,454,760 apple_firestorm_pmu/cycles/
>
> 1,454,384 cycles
>
>
> So the current behavior is that all output seems to come from the
> firestorm PMU event counter, regardless of requested event.
>
> This is all unchanged and still broken in v6.7-rc2.
>

Thanks for the regression report (and it has been handled well already).
I'm adding it to regzbot for tracking:

#regzbot ^introduced: 5ea8f2ccffb239

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (2.33 kB)
signature.asc (235.00 B)
Download all attachments

2023-11-22 03:24:16

by Hector Martin

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5



On 2023/11/22 1:38, Ian Rogers wrote:
> On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
>>
>> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
>>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
>>>>
>>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
>>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
>>>>>>
>>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
>>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
>>>>>>> Marc Zyngier <[email protected]> wrote:
>>>>>>>>
>>>>>>>> [Adding key people on Cc]
>>>>>>>>
>>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
>>>>>>>> Hector Martin <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
>>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
>>>>>>>>
>>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
>>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
>>>>>>>> the PMU, but nothing works anymore.
>>>>>>>>
>>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
>>>>>>>> package, but that's obviously not going to last.
>>>>>>>>
>>>>>>>> I'm happy to test potential fixes.
>>>>>>>
>>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
>>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
>>>>>>> CPU):
>>>>>>
>>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
>>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
>>>>>> that ${pmu}'s type and event namespace.
>>>>>>
>>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
>>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
>>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
>>>>>> PERF_COUNT_HW_${EVENT}.
>>>>>
>>>>> If you name a PMU and an event then the event should only be opened on
>>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
>>>>> cycles event is opened it appears to be because it was explicitly
>>>>> requested.
>>>>
>>>> I think you've missed that the named PMU events are being erreously transformed
>>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
>>>>
>>>> Opening: apple_firestorm_pmu/cycles/
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
>>>>
>>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
>>>>
>>>> Marc said that he bisected the issue down to commit:
>>>>
>>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
>>>>
>>>> ... so it looks like something is going wrong when the events are being parsed,
>>>> e.g. losing the HW PMU information?
>>>
>>> Ok, I think I'm getting confused by other things. This looks like the issue.
>>>
>>> I think it may be working as intended, but not how you intended :-) If
>>> a core PMU is listed and then a legacy event, the legacy event should
>>> be opened on the core PMU as a legacy event with the extended type
>>> set. This is to allow things like legacy cache events to be opened on
>>> a specified PMU. Legacy event names match with a higher priority than
>>> those in sysfs or json as they are hard coded.
>>
>> That has never been the case previously, so this is user-visible breakage, and
>> it prevents users from being able to do the right thing, so I think that's a
>> broken design.
>
> So the problem was caused by ARM and Intel doing two different things.
> Intel did at least contribute to the perf tool in support for their
> BIG.little/hybrid, so that's why the semantics match their approach.
>
>>> Presumably the expectation was that by advertising a cycles event, presumably
>>> in sysfs, then this is what would be matched.
>>
>> I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
>> *in that PMU's namespace* is used. Overriding that breaks long-established
>> practice and provides users with no recourse to get the behavioru they expect
>> (and previosuly had).
>
> On ARM but not Intel.
>
>> I do think that (regardless of whther this was the sematnic you intended)
>> silently overriding events with legacy events is a bug, and one we should fix.
>> As I mentioned in another reply, just because the events have the same name
>> does not mean that they are semantically the same, so we're liable to give
>> people the wrong numbers anyhow.
>>
>> Can we fix this?
>
> So I'd like to fix this, some things from various conversations:
>
> 1) we lack testing. Our testing relies on the sysfs of the machine
> being run on, which is better than nothing. I think ideally we'd have
> a collection of zipped up sysfs directories and then we could have a
> test that asserts on ARM you get the behavior you want.
>
> 2) for RISC-V they want to make the legacy event matching something in
> user land to simplify the PMU driver.
>
> 3) I'd like to get rid of the PMU json interface. My idea is to
> convert json events/metrics into sysfs style files, zip these up and
> then link them into the perf binary. On Intel the json is 70% of the
> binary (7MB out of 10MB) and we may get this down to 3MB with this
> approach. The json lookup would need to incorporate the cpuid matching
> that currently exists. When we look up an event I'd like the approach
> to be like unionfs with a specified but configurable order. Users
> could provide directories of their own events/metrics for various
> PMUs, and then this approach could be used to help with (1).
>
> Those proposals are not something to add as a -rc fix, so what I think
> you're asking for here is a "if ARM" fix somewhere in the event
> parsing. That's of course possible but it will cause problems if you
> did say:
>
> perf stat -e arm_pmu/LLC-load-misses/ ...
>
> as I doubt the PMU driver is advertising this legacy event in sysfs
> and the "if ARM" logic would presumably be trying to disable legacy
> events in the term list for the ARM PMU.
>
> Given all of this, is anything actually broken and needing a fix for 6.7?

You literally cannot use perf correctly on ARM big.LITTLE systems since
6.5, while it worked fine on 6.4. So, yes, it's broken and it needs
fixing. This is a major regression.

$ taskset -c 0 perf stat -e apple_icestorm_pmu/cycles/ echo


Performance counter stats for 'echo':

<not counted> apple_icestorm_pmu/cycles/u
(0.00%)

0.001385544 seconds time elapsed

0.001375000 seconds user
0.000000000 seconds sys


$ taskset -c 2 perf stat -e apple_firestorm_pmu/cycles/ echo


Performance counter stats for 'echo':

169,965 apple_firestorm_pmu/cycles/u


0.000466667 seconds time elapsed

0.000475000 seconds user
0.000000000 seconds sys


Both of those should return counts. One does not, and it doesn't even
seem to be predictable which one you get. *On my particular system, it
is currently impossible to get any performance counter data from the E
cores, as far as I can tell, no matter how you invoke perf*.

Feel free to argue semantics as to what went wrong or how it should be
fixed, but there is no question that this is a regression that requires
a fix. Perf is currently simply broken here, where it wasn't in 6.4.

- Hector

2023-11-22 13:04:30

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 08:38:45AM -0800, Ian Rogers wrote:
> On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > >
> > > > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > > > Marc Zyngier <[email protected]> wrote:
> > > > > > > >
> > > > > > > > [Adding key people on Cc]
> > > > > > > >
> > > > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > > > Hector Martin <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > > > >
> > > > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > > > the PMU, but nothing works anymore.
> > > > > > > >
> > > > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > > > package, but that's obviously not going to last.
> > > > > > > >
> > > > > > > > I'm happy to test potential fixes.
> > > > > > >
> > > > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > > > CPU):
> > > > > >
> > > > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > > that ${pmu}'s type and event namespace.
> > > > > >
> > > > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > > PERF_COUNT_HW_${EVENT}.
> > > > >
> > > > > If you name a PMU and an event then the event should only be opened on
> > > > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > > cycles event is opened it appears to be because it was explicitly
> > > > > requested.
> > > >
> > > > I think you've missed that the named PMU events are being erreously transformed
> > > > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > >
> > > > Opening: apple_firestorm_pmu/cycles/
> > > > ------------------------------------------------------------
> > > > perf_event_attr:
> > > > type 0 (PERF_TYPE_HARDWARE)
> > > > size 136
> > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > sample_type IDENTIFIER
> > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > disabled 1
> > > > inherit 1
> > > > enable_on_exec 1
> > > > exclude_guest 1
> > > > ------------------------------------------------------------
> > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > >
> > > > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > >
> > > > Marc said that he bisected the issue down to commit:
> > > >
> > > > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > >
> > > > ... so it looks like something is going wrong when the events are being parsed,
> > > > e.g. losing the HW PMU information?
> > >
> > > Ok, I think I'm getting confused by other things. This looks like the issue.
> > >
> > > I think it may be working as intended, but not how you intended :-) If
> > > a core PMU is listed and then a legacy event, the legacy event should
> > > be opened on the core PMU as a legacy event with the extended type
> > > set. This is to allow things like legacy cache events to be opened on
> > > a specified PMU. Legacy event names match with a higher priority than
> > > those in sysfs or json as they are hard coded.
> >
> > That has never been the case previously, so this is user-visible breakage, and
> > it prevents users from being able to do the right thing, so I think that's a
> > broken design.
>
> So the problem was caused by ARM and Intel doing two different things.
> Intel did at least contribute to the perf tool in support for their
> BIG.little/hybrid, so that's why the semantics match their approach.

I appreciate that, and I agree that from the Arm side we haven't been as
engaged with userspace on this front (please understand I'm the messenger here,
this is something I've repeatedly asked for within Arm).

Regardless, I don't think that changes the substance of the bug, which is that
we're converting named-pmu events into entirely different PERF_TYPE_HARDWARE
events.

I agree that expanding plain legacy event names to a set of PMU-tagetted legacy
events makes sense (and even for Arm, that's the right thing to do, IMO). If
I ask for 'cycles' and that gets expanded to multiple legacy cycles events that
target specific CPU PMUs, that's good.

The thing that doesn't make sense here is converting named-pmu events into
egacy events. If I ask for 'apple_firestorm_pmu/cycles/', that should be the
'cycles' event in the apple_firestorm_pmu's event namespace, and *shouldn't* be
converted to a (potentially semantically different) PERF_TYPE_HARDWARE event,
even if that's targetted towards the apple_firestorm_pmu. I think that should
be true for *any* PMU, whether thats an arm/x86/whatever CPU PMU or a system
PMU.

> > > Presumably the expectation was that by advertising a cycles event, presumably
> > > in sysfs, then this is what would be matched.

Yes. That's how this has always worked prior to the changes Marc referenced.
Note that this can *also* be expaned to events from json databases, but was
*never* previously silently converted to a PERF_TYPE_HARDWARE event.

Please note that the events in sysfs are *namespaced* to the PMU (specifically,
when using that PMU's dynamic type); they are not necessarily the same as
legacy events (though they may have similar or matching
names in some cases), they may be semantically distinct from the legacy events
even if the names match, and it is incorrect to conflate the two.

> > I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
> > *in that PMU's namespace* is used. Overriding that breaks long-established
> > practice and provides users with no recourse to get the behavioru they expect
> > (and previosuly had).
>
> On ARM but not Intel.

As above, I don't think the CPU architecture matters here for the case that I'm
saying is broken. I think that regardless of CPU architecture (or for any
non-CPU PMU) it is semantically incorrect to convert a named-pmu event to a
legacy event.

> > I do think that (regardless of whther this was the sematnic you intended)
> > silently overriding events with legacy events is a bug, and one we should fix.
> > As I mentioned in another reply, just because the events have the same name
> > does not mean that they are semantically the same, so we're liable to give
> > people the wrong numbers anyhow.
> >
> > Can we fix this?
>
> So I'd like to fix this, some things from various conversations:
>
> 1) we lack testing. Our testing relies on the sysfs of the machine
> being run on, which is better than nothing. I think ideally we'd have
> a collection of zipped up sysfs directories and then we could have a
> test that asserts on ARM you get the behavior you want.

I agree we lack testing, and I'd be happy to help here going forwards, though I
don't think this is a prerequisite for fixing this issue.

> 2) for RISC-V they want to make the legacy event matching something in
> user land to simplify the PMU driver.

Ok; I see how this might be related, but it doesn't sound like a prerequisite
for fixing this issue -- there are plenty of people in this thread who can
test.

> 3) I'd like to get rid of the PMU json interface. My idea is to
> convert json events/metrics into sysfs style files, zip these up and
> then link them into the perf binary. On Intel the json is 70% of the
> binary (7MB out of 10MB) and we may get this down to 3MB with this
> approach. The json lookup would need to incorporate the cpuid matching
> that currently exists. When we look up an event I'd like the approach
> to be like unionfs with a specified but configurable order. Users
> could provide directories of their own events/metrics for various
> PMUs, and then this approach could be used to help with (1).

I can see how that might interact with whatever changes we make to fix this
issue, but this seems like a future aspiration, and not a prerequisite for
fixing the existing functional regression.

> Those proposals are not something to add as a -rc fix, so what I think
> you're asking for here is a "if ARM" fix somewhere in the event
> parsing. That's of course possible but it will cause problems if you
> did say:
>
> perf stat -e arm_pmu/LLC-load-misses/ ...

As above, I do not think this is an arm-specific issue, we're just the canary
in the coalmine.

Please note that:

perf stat -e arm_pmu/LLC-load-misses/ ...

... would never have worked previously. No arm_pmu instances have a
"LLC-load-misses" event in their event namespaces, and we don't have any
userspace file mapping that event.

That said, If I really wanted that legacy event, I'd have asked for it bare,
e.g.

perf stat -e LLC-load-misses

... and we're in agreement that it's sensible to expand this to multiple
PERF_TYPE_HARDWARE events targeting the individual CPU PMUs.

So I see no need to do anything to have magic for 'arm_pmu/LLC-load-misses/'.

> as I doubt the PMU driver is advertising this legacy event in sysfs
> and the "if ARM" logic would presumably be trying to disable legacy
> events in the term list for the ARM PMU.
>
> Given all of this, is anything actually broken and needing a fix for 6.7?

There is absolutely a bug that needs to be fixed here (and needs to be
backported to stable so that it gets picked up by distributions).

Thanks,
Mark.

2023-11-22 13:06:53

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> On 2023/11/22 1:38, Ian Rogers wrote:
> > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> >>>>>>> Marc Zyngier <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> [Adding key people on Cc]
> >>>>>>>>
> >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> >>>>>>>> Hector Martin <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> >>>>>>>>
> >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> >>>>>>>> the PMU, but nothing works anymore.
> >>>>>>>>
> >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> >>>>>>>> package, but that's obviously not going to last.
> >>>>>>>>
> >>>>>>>> I'm happy to test potential fixes.
> >>>>>>>
> >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> >>>>>>> CPU):
> >>>>>>
> >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> >>>>>> that ${pmu}'s type and event namespace.
> >>>>>>
> >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> >>>>>> PERF_COUNT_HW_${EVENT}.
> >>>>>
> >>>>> If you name a PMU and an event then the event should only be opened on
> >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> >>>>> cycles event is opened it appears to be because it was explicitly
> >>>>> requested.
> >>>>
> >>>> I think you've missed that the named PMU events are being erreously transformed
> >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> >>>>
> >>>> Opening: apple_firestorm_pmu/cycles/
> >>>> ------------------------------------------------------------
> >>>> perf_event_attr:
> >>>> type 0 (PERF_TYPE_HARDWARE)
> >>>> size 136
> >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> >>>> sample_type IDENTIFIER
> >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> >>>> disabled 1
> >>>> inherit 1
> >>>> enable_on_exec 1
> >>>> exclude_guest 1
> >>>> ------------------------------------------------------------
> >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> >>>>
> >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> >>>>
> >>>> Marc said that he bisected the issue down to commit:
> >>>>
> >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> >>>>
> >>>> ... so it looks like something is going wrong when the events are being parsed,
> >>>> e.g. losing the HW PMU information?
> >>>
> >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> >>>
> >>> I think it may be working as intended, but not how you intended :-) If
> >>> a core PMU is listed and then a legacy event, the legacy event should

The point is that "cycles" when prefixed with "pmu/" shouldn't be
considered "cycles" as HW/0, in that setting it is "cycles" for that
PMU. (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
it, then we get what we want/had before, see below):

And there is an attempt at using the specified PMU, see the first
perf_event_open:

root@roc-rk3399-pc:~# strace -e perf_event_open perf stat -vv -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
Using CPUID 0x00000000410fd082
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
config 0x700000000
disabled 1
------------------------------------------------------------
sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)

//// HERE: it tries config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES taking into
//account the PMU number 0x7

root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
7
root@roc-rk3399-pc:~#

But then we don't have "cycles" in that PMU:

root@roc-rk3399-pc:~# ls -la /sys/devices/armv8_cortex_a53/events/cycles
ls: cannot access '/sys/devices/armv8_cortex_a53/events/cycles': No such file or directory
root@roc-rk3399-pc:~#

Maybe:

root@roc-rk3399-pc:~# taskset -c 5,6 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
Using CPUID 0x00000000410fd034
Control descriptor is not initialized

armv8_cortex_a53/cpu_cycles/: 0 2079000 0
armv8_cortex_a72/cpu_cycles/: 2488961 2079000 2079000

Performance counter stats for 'echo':

<not counted> armv8_cortex_a53/cpu_cycles/ (0.00%)
2488961 armv8_cortex_a72/cpu_cycles/

0.003449266 seconds time elapsed

0.003502000 seconds user
0.000000000 seconds sys


root@roc-rk3399-pc:~# taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
Using CPUID 0x00000000410fd034
Control descriptor is not initialized

armv8_cortex_a53/cpu_cycles/: 2986601 6999416 6999416
armv8_cortex_a72/cpu_cycles/: 0 6999416 0

Performance counter stats for 'echo':

2986601 armv8_cortex_a53/cpu_cycles/
<not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)

0.011434508 seconds time elapsed

0.003911000 seconds user
0.007454000 seconds sys


root@roc-rk3399-pc:~#

root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/events/cpu_cycles
event=0x0011
root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/events/cpu_cycles
event=0x0011
root@roc-rk3399-pc:~#

And the syscalls seem sane:

root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
Using CPUID 0x00000000410fd034
Control descriptor is not initialized
perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 4

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14573, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
armv8_cortex_a53/cpu_cycles/: 3227098 4480875 4480875
armv8_cortex_a72/cpu_cycles/: 0 4480875 0

Performance counter stats for 'echo':

3227098 armv8_cortex_a53/cpu_cycles/
<not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)

0.008381759 seconds time elapsed

0.004064000 seconds user
0.004121000 seconds sys


--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=14572, si_uid=0} ---
+++ exited with 0 +++
root@roc-rk3399-pc:~#

As:

root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
7
root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/type
8
root@roc-rk3399-pc:~#

See the type=0x7 and type=0x8.

So what we need here seems to be to translate the generic term "cycles"
to "cpu_cycles" when a PMU is explicitely passed in the event name and
it doesn't have "cycles" and then just retry.

- Arnaldo

2023-11-22 15:30:54

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 5:04 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 08:38:45AM -0800, Ian Rogers wrote:
> > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > >
> > > On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > > On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > > >
> > > > > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > > > > Marc Zyngier <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > [Adding key people on Cc]
> > > > > > > > >
> > > > > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > > > > Hector Martin <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > > > > >
> > > > > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > > > > the PMU, but nothing works anymore.
> > > > > > > > >
> > > > > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > > > > package, but that's obviously not going to last.
> > > > > > > > >
> > > > > > > > > I'm happy to test potential fixes.
> > > > > > > >
> > > > > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > > > > CPU):
> > > > > > >
> > > > > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > > > that ${pmu}'s type and event namespace.
> > > > > > >
> > > > > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > > > PERF_COUNT_HW_${EVENT}.
> > > > > >
> > > > > > If you name a PMU and an event then the event should only be opened on
> > > > > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > > > cycles event is opened it appears to be because it was explicitly
> > > > > > requested.
> > > > >
> > > > > I think you've missed that the named PMU events are being erreously transformed
> > > > > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > > >
> > > > > Opening: apple_firestorm_pmu/cycles/
> > > > > ------------------------------------------------------------
> > > > > perf_event_attr:
> > > > > type 0 (PERF_TYPE_HARDWARE)
> > > > > size 136
> > > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > > sample_type IDENTIFIER
> > > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > > disabled 1
> > > > > inherit 1
> > > > > enable_on_exec 1
> > > > > exclude_guest 1
> > > > > ------------------------------------------------------------
> > > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > >
> > > > > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > > >
> > > > > Marc said that he bisected the issue down to commit:
> > > > >
> > > > > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > > >
> > > > > ... so it looks like something is going wrong when the events are being parsed,
> > > > > e.g. losing the HW PMU information?
> > > >
> > > > Ok, I think I'm getting confused by other things. This looks like the issue.
> > > >
> > > > I think it may be working as intended, but not how you intended :-) If
> > > > a core PMU is listed and then a legacy event, the legacy event should
> > > > be opened on the core PMU as a legacy event with the extended type
> > > > set. This is to allow things like legacy cache events to be opened on
> > > > a specified PMU. Legacy event names match with a higher priority than
> > > > those in sysfs or json as they are hard coded.
> > >
> > > That has never been the case previously, so this is user-visible breakage, and
> > > it prevents users from being able to do the right thing, so I think that's a
> > > broken design.
> >
> > So the problem was caused by ARM and Intel doing two different things.
> > Intel did at least contribute to the perf tool in support for their
> > BIG.little/hybrid, so that's why the semantics match their approach.
>
> I appreciate that, and I agree that from the Arm side we haven't been as
> engaged with userspace on this front (please understand I'm the messenger here,
> this is something I've repeatedly asked for within Arm).
>
> Regardless, I don't think that changes the substance of the bug, which is that
> we're converting named-pmu events into entirely different PERF_TYPE_HARDWARE
> events.
>
> I agree that expanding plain legacy event names to a set of PMU-tagetted legacy
> events makes sense (and even for Arm, that's the right thing to do, IMO). If
> I ask for 'cycles' and that gets expanded to multiple legacy cycles events that
> target specific CPU PMUs, that's good.
>
> The thing that doesn't make sense here is converting named-pmu events into
> egacy events. If I ask for 'apple_firestorm_pmu/cycles/', that should be the
> 'cycles' event in the apple_firestorm_pmu's event namespace, and *shouldn't* be
> converted to a (potentially semantically different) PERF_TYPE_HARDWARE event,
> even if that's targetted towards the apple_firestorm_pmu. I think that should
> be true for *any* PMU, whether thats an arm/x86/whatever CPU PMU or a system
> PMU.

This is saying that legacy events are lower than system events. We
don't do this historically and as it requires extra PMU set up. On an
Intel Tigerlake:

```
$ ls /sys/devices/cpu/events
branch-instructions cache-misses instructions ref-cycles
topdown-be-bound
branch-misses cache-references mem-loads slots
topdown-fe-bound
bus-cycles cpu-cycles mem-stores topdown-bad-spec
topdown-retiring
```
here (at least) branch-misses, bus-cycles, cache-references,
cpu-cycles and instructions overlap with legacy event names
```
$ perf --version
perf version 6.5.6
$ perf stat -vv -e branch-misses,bus-cycles,cache-references,cp
u-cycles,instructions true
Using CPUID GenuineIntel-6-8D-1
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x5 (PERF_COUNT_HW_BRANCH_MISSES)
...
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x6 (PERF_COUNT_HW_BUS_CYCLES)
...
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x2 (PERF_COUNT_HW_CACHE_REFERENCES)
...
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
...
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0x1 (PERF_COUNT_HW_INSTRUCTIONS)
...
branch-misses: -1: 6571 826226 826226
bus-cycles: -1: 31411 826226 826226
cache-references: -1: 19507 826226 826226
cpu-cycles: -1: 1127215 826226 826226
instructions: -1: 1301583 826226 826226
branch-misses: 6571 826226 826226
bus-cycles: 31411 826226 826226
cache-references: 19507 826226 826226
cpu-cycles: 1127215 826226 826226
instructions: 1301583 826226 826226

Performance counter stats for 'true':
...
```
ie perf 6.5 and all events even though sysfs has events we're opening
them with PERF_TYPE_HARDWARE.

> > > > Presumably the expectation was that by advertising a cycles event, presumably
> > > > in sysfs, then this is what would be matched.
>
> Yes. That's how this has always worked prior to the changes Marc referenced.
> Note that this can *also* be expaned to events from json databases, but was
> *never* previously silently converted to a PERF_TYPE_HARDWARE event.
>
> Please note that the events in sysfs are *namespaced* to the PMU (specifically,
> when using that PMU's dynamic type); they are not necessarily the same as
> legacy events (though they may have similar or matching
> names in some cases), they may be semantically distinct from the legacy events
> even if the names match, and it is incorrect to conflate the two.

This was a behavior added by Intel so that say cpu_atom/legacy-event/
would only open as a hardware event on that PMU. The point of the
blamed change is to make that behavior consistent for all core PMUs.

> > > I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
> > > *in that PMU's namespace* is used. Overriding that breaks long-established
> > > practice and provides users with no recourse to get the behavioru they expect
> > > (and previosuly had).
> >
> > On ARM but not Intel.
>
> As above, I don't think the CPU architecture matters here for the case that I'm
> saying is broken. I think that regardless of CPU architecture (or for any
> non-CPU PMU) it is semantically incorrect to convert a named-pmu event to a
> legacy event.

So perf's behavior has always been that legacy event priority is
greater-than sysfs and json. The distinction here is that a core PMU
is explicitly listed and it doesn't seem unreasonable to use core PMU
names with legacy events, the behavior Intel added.

> > > I do think that (regardless of whther this was the sematnic you intended)
> > > silently overriding events with legacy events is a bug, and one we should fix.
> > > As I mentioned in another reply, just because the events have the same name
> > > does not mean that they are semantically the same, so we're liable to give
> > > people the wrong numbers anyhow.
> > >
> > > Can we fix this?
> >
> > So I'd like to fix this, some things from various conversations:
> >
> > 1) we lack testing. Our testing relies on the sysfs of the machine
> > being run on, which is better than nothing. I think ideally we'd have
> > a collection of zipped up sysfs directories and then we could have a
> > test that asserts on ARM you get the behavior you want.
>
> I agree we lack testing, and I'd be happy to help here going forwards, though I
> don't think this is a prerequisite for fixing this issue.
>
> > 2) for RISC-V they want to make the legacy event matching something in
> > user land to simplify the PMU driver.
>
> Ok; I see how this might be related, but it doesn't sound like a prerequisite
> for fixing this issue -- there are plenty of people in this thread who can
> test.
>
> > 3) I'd like to get rid of the PMU json interface. My idea is to
> > convert json events/metrics into sysfs style files, zip these up and
> > then link them into the perf binary. On Intel the json is 70% of the
> > binary (7MB out of 10MB) and we may get this down to 3MB with this
> > approach. The json lookup would need to incorporate the cpuid matching
> > that currently exists. When we look up an event I'd like the approach
> > to be like unionfs with a specified but configurable order. Users
> > could provide directories of their own events/metrics for various
> > PMUs, and then this approach could be used to help with (1).
>
> I can see how that might interact with whatever changes we make to fix this
> issue, but this seems like a future aspiration, and not a prerequisite for
> fixing the existing functional regression.
>
> > Those proposals are not something to add as a -rc fix, so what I think
> > you're asking for here is a "if ARM" fix somewhere in the event
> > parsing. That's of course possible but it will cause problems if you
> > did say:
> >
> > perf stat -e arm_pmu/LLC-load-misses/ ...
>
> As above, I do not think this is an arm-specific issue, we're just the canary
> in the coalmine.

Disagree, see comments above. A behavior change here would impact Intel.

> Please note that:
>
> perf stat -e arm_pmu/LLC-load-misses/ ...
>
> ... would never have worked previously. No arm_pmu instances have a
> "LLC-load-misses" event in their event namespaces, and we don't have any
> userspace file mapping that event.

This event was for the purpose of giving an example, perf list will
show you events that work. The point is that a legacy event may not be
available on both BIG.little PMU types so being able to designate the
PMU there is helpful.

> That said, If I really wanted that legacy event, I'd have asked for it bare,
> e.g.
>
> perf stat -e LLC-load-misses
>
> ... and we're in agreement that it's sensible to expand this to multiple
> PERF_TYPE_HARDWARE events targeting the individual CPU PMUs.
>
> So I see no need to do anything to have magic for 'arm_pmu/LLC-load-misses/'.
>
> > as I doubt the PMU driver is advertising this legacy event in sysfs
> > and the "if ARM" logic would presumably be trying to disable legacy
> > events in the term list for the ARM PMU.
> >
> > Given all of this, is anything actually broken and needing a fix for 6.7?
>
> There is absolutely a bug that needs to be fixed here (and needs to be
> backported to stable so that it gets picked up by distributions).

I'm not seeing this. The behavior is consistent with Intel, this has
gone 2 releases without being spotted, it was triggered by a PMU event
name aliasing a legacy event name and the behavior has always been
legacy event names have higher priority than sysfs and json events.

Whilst I'm seeing a lot of complaining, I've not seen a proposal of
what behavior you want. Isn't it a PMU bug if the legacy event
specifying the PMU doesn't get opened by the core PMU? Fixing the PMU
driver appears to be the right fix and means there is consistency on
core events across architectures.

Thanks,
Ian

> Thanks,
> Mark.

2023-11-22 15:35:28

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 5:06 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > On 2023/11/22 1:38, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > >>>>>>> Marc Zyngier <[email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> [Adding key people on Cc]
> > >>>>>>>>
> > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > >>>>>>>> Hector Martin <[email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >>>>>>>>
> > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > >>>>>>>> the PMU, but nothing works anymore.
> > >>>>>>>>
> > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > >>>>>>>> package, but that's obviously not going to last.
> > >>>>>>>>
> > >>>>>>>> I'm happy to test potential fixes.
> > >>>>>>>
> > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > >>>>>>> CPU):
> > >>>>>>
> > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > >>>>>> that ${pmu}'s type and event namespace.
> > >>>>>>
> > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > >>>>>> PERF_COUNT_HW_${EVENT}.
> > >>>>>
> > >>>>> If you name a PMU and an event then the event should only be opened on
> > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > >>>>> cycles event is opened it appears to be because it was explicitly
> > >>>>> requested.
> > >>>>
> > >>>> I think you've missed that the named PMU events are being erreously transformed
> > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > >>>>
> > >>>> Opening: apple_firestorm_pmu/cycles/
> > >>>> ------------------------------------------------------------
> > >>>> perf_event_attr:
> > >>>> type 0 (PERF_TYPE_HARDWARE)
> > >>>> size 136
> > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > >>>> sample_type IDENTIFIER
> > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > >>>> disabled 1
> > >>>> inherit 1
> > >>>> enable_on_exec 1
> > >>>> exclude_guest 1
> > >>>> ------------------------------------------------------------
> > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > >>>>
> > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > >>>>
> > >>>> Marc said that he bisected the issue down to commit:
> > >>>>
> > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > >>>>
> > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > >>>> e.g. losing the HW PMU information?
> > >>>
> > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > >>>
> > >>> I think it may be working as intended, but not how you intended :-) If
> > >>> a core PMU is listed and then a legacy event, the legacy event should
>
> The point is that "cycles" when prefixed with "pmu/" shouldn't be
> considered "cycles" as HW/0, in that setting it is "cycles" for that
> PMU. (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> it, then we get what we want/had before, see below):
>
> And there is an attempt at using the specified PMU, see the first
> perf_event_open:
>
> root@roc-rk3399-pc:~# strace -e perf_event_open perf stat -vv -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
> Using CPUID 0x00000000410fd082
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0x700000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)
>
> //// HERE: it tries config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES taking into
> //account the PMU number 0x7
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
> 7
> root@roc-rk3399-pc:~#
>
> But then we don't have "cycles" in that PMU:
>
> root@roc-rk3399-pc:~# ls -la /sys/devices/armv8_cortex_a53/events/cycles
> ls: cannot access '/sys/devices/armv8_cortex_a53/events/cycles': No such file or directory
> root@roc-rk3399-pc:~#
>
> Maybe:
>
> root@roc-rk3399-pc:~# taskset -c 5,6 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
>
> armv8_cortex_a53/cpu_cycles/: 0 2079000 0
> armv8_cortex_a72/cpu_cycles/: 2488961 2079000 2079000
>
> Performance counter stats for 'echo':
>
> <not counted> armv8_cortex_a53/cpu_cycles/ (0.00%)
> 2488961 armv8_cortex_a72/cpu_cycles/
>
> 0.003449266 seconds time elapsed
>
> 0.003502000 seconds user
> 0.000000000 seconds sys
>
>
> root@roc-rk3399-pc:~# taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
>
> armv8_cortex_a53/cpu_cycles/: 2986601 6999416 6999416
> armv8_cortex_a72/cpu_cycles/: 0 6999416 0
>
> Performance counter stats for 'echo':
>
> 2986601 armv8_cortex_a53/cpu_cycles/
> <not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)
>
> 0.011434508 seconds time elapsed
>
> 0.003911000 seconds user
> 0.007454000 seconds sys
>
>
> root@roc-rk3399-pc:~#
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/events/cpu_cycles
> event=0x0011
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/events/cpu_cycles
> event=0x0011
> root@roc-rk3399-pc:~#
>
> And the syscalls seem sane:
>
> root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 0,1,2,3,4 perf stat -v -e armv8_cortex_a53/cpu_cycles/,armv8_cortex_a72/cpu_cycles/ echo
> Using CPUID 0x00000000410fd034
> Control descriptor is not initialized
> perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
> perf_event_open({type=0x8 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */, config=0x11, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 14573, -1, -1, PERF_FLAG_FD_CLOEXEC) = 4
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14573, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
> armv8_cortex_a53/cpu_cycles/: 3227098 4480875 4480875
> armv8_cortex_a72/cpu_cycles/: 0 4480875 0
>
> Performance counter stats for 'echo':
>
> 3227098 armv8_cortex_a53/cpu_cycles/
> <not counted> armv8_cortex_a72/cpu_cycles/ (0.00%)
>
> 0.008381759 seconds time elapsed
>
> 0.004064000 seconds user
> 0.004121000 seconds sys
>
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=14572, si_uid=0} ---
> +++ exited with 0 +++
> root@roc-rk3399-pc:~#
>
> As:
>
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a53/type
> 7
> root@roc-rk3399-pc:~# cat /sys/devices/armv8_cortex_a72/type
> 8
> root@roc-rk3399-pc:~#
>
> See the type=0x7 and type=0x8.
>
> So what we need here seems to be to translate the generic term "cycles"
> to "cpu_cycles" when a PMU is explicitely passed in the event name and
> it doesn't have "cycles" and then just retry.

The PMU driver does the legacy to raw encoding translation, this is an
assumption the tool has of core PMUs. You can see ARM's PMU driver
doing the mapping here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_pmuv3.c#n40

Thanks,
Ian


> - Arnaldo

2023-11-22 15:51:39

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 10:06:23AM -0300, Arnaldo Carvalho de Melo wrote:
> Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > On 2023/11/22 1:38, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > >>>>>>> Marc Zyngier <[email protected]> wrote:
> > >>>>>>>>
> > >>>>>>>> [Adding key people on Cc]
> > >>>>>>>>
> > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > >>>>>>>> Hector Martin <[email protected]> wrote:
> > >>>>>>>>>
> > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >>>>>>>>
> > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > >>>>>>>> the PMU, but nothing works anymore.
> > >>>>>>>>
> > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > >>>>>>>> package, but that's obviously not going to last.
> > >>>>>>>>
> > >>>>>>>> I'm happy to test potential fixes.
> > >>>>>>>
> > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > >>>>>>> CPU):
> > >>>>>>
> > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > >>>>>> that ${pmu}'s type and event namespace.
> > >>>>>>
> > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > >>>>>> PERF_COUNT_HW_${EVENT}.
> > >>>>>
> > >>>>> If you name a PMU and an event then the event should only be opened on
> > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > >>>>> cycles event is opened it appears to be because it was explicitly
> > >>>>> requested.
> > >>>>
> > >>>> I think you've missed that the named PMU events are being erreously transformed
> > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > >>>>
> > >>>> Opening: apple_firestorm_pmu/cycles/
> > >>>> ------------------------------------------------------------
> > >>>> perf_event_attr:
> > >>>> type 0 (PERF_TYPE_HARDWARE)
> > >>>> size 136
> > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > >>>> sample_type IDENTIFIER
> > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > >>>> disabled 1
> > >>>> inherit 1
> > >>>> enable_on_exec 1
> > >>>> exclude_guest 1
> > >>>> ------------------------------------------------------------
> > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > >>>>
> > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > >>>>
> > >>>> Marc said that he bisected the issue down to commit:
> > >>>>
> > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > >>>>
> > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > >>>> e.g. losing the HW PMU information?
> > >>>
> > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > >>>
> > >>> I think it may be working as intended, but not how you intended :-) If
> > >>> a core PMU is listed and then a legacy event, the legacy event should
>
> The point is that "cycles" when prefixed with "pmu/" shouldn't be
> considered "cycles" as HW/0, in that setting it is "cycles" for that
> PMU.

Exactly.

> (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> it, then we get what we want/had before, see below):

Both Cortex-A53 and Cortex-A72 have the common PMUv3 events, so they have
"cpu_cycles" and "bus_cycles".

The Apple PMUs that Hector and Marc anre using don't follow the PMUv3
architecture, and just have a "cycles" event.

[...]

> So what we need here seems to be to translate the generic term "cycles"
> to "cpu_cycles" when a PMU is explicitely passed in the event name and
> it doesn't have "cycles" and then just retry.

I'm not sure we need to map that.

My thinking is:

* If the user asks for "cycles" without a PMU name, that should use the
PERF_TYPE_HARDWARE cycles event. The ARM PMUs handle that correctly when the
event is directed to them.

* If the user asks for "${pmu}/cycles/", that should only use the "cycles"
event in that PMU's namespace, not PERF_TYPE_HARDWARE.

* If we need a way so say "use the PERF_TYPE_HARDWARE cycles event on ${pmu}",
then we should have a new syntax for that (e.g. as we have for raw events),
e.g. it would be possible to have "pmu/hw:cycles/" or something like that.

That way there's no ambiguity.

Mark.

2023-11-22 16:05:06

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 7:49 AM Mark Rutland <[email protected]> wrote:
>
> On Wed, Nov 22, 2023 at 10:06:23AM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > > On 2023/11/22 1:38, Ian Rogers wrote:
> > > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > > >>>>>>> Marc Zyngier <[email protected]> wrote:
> > > >>>>>>>>
> > > >>>>>>>> [Adding key people on Cc]
> > > >>>>>>>>
> > > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > > >>>>>>>> Hector Martin <[email protected]> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > >>>>>>>>
> > > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > >>>>>>>> the PMU, but nothing works anymore.
> > > >>>>>>>>
> > > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > > >>>>>>>> package, but that's obviously not going to last.
> > > >>>>>>>>
> > > >>>>>>>> I'm happy to test potential fixes.
> > > >>>>>>>
> > > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > >>>>>>> CPU):
> > > >>>>>>
> > > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > >>>>>> that ${pmu}'s type and event namespace.
> > > >>>>>>
> > > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > >>>>>> PERF_COUNT_HW_${EVENT}.
> > > >>>>>
> > > >>>>> If you name a PMU and an event then the event should only be opened on
> > > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > >>>>> cycles event is opened it appears to be because it was explicitly
> > > >>>>> requested.
> > > >>>>
> > > >>>> I think you've missed that the named PMU events are being erreously transformed
> > > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > >>>>
> > > >>>> Opening: apple_firestorm_pmu/cycles/
> > > >>>> ------------------------------------------------------------
> > > >>>> perf_event_attr:
> > > >>>> type 0 (PERF_TYPE_HARDWARE)
> > > >>>> size 136
> > > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > >>>> sample_type IDENTIFIER
> > > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > >>>> disabled 1
> > > >>>> inherit 1
> > > >>>> enable_on_exec 1
> > > >>>> exclude_guest 1
> > > >>>> ------------------------------------------------------------
> > > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > >>>>
> > > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > >>>>
> > > >>>> Marc said that he bisected the issue down to commit:
> > > >>>>
> > > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > >>>>
> > > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > > >>>> e.g. losing the HW PMU information?
> > > >>>
> > > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > > >>>
> > > >>> I think it may be working as intended, but not how you intended :-) If
> > > >>> a core PMU is listed and then a legacy event, the legacy event should
> >
> > The point is that "cycles" when prefixed with "pmu/" shouldn't be
> > considered "cycles" as HW/0, in that setting it is "cycles" for that
> > PMU.
>
> Exactly.
>
> > (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> > have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> > it, then we get what we want/had before, see below):
>
> Both Cortex-A53 and Cortex-A72 have the common PMUv3 events, so they have
> "cpu_cycles" and "bus_cycles".
>
> The Apple PMUs that Hector and Marc anre using don't follow the PMUv3
> architecture, and just have a "cycles" event.
>
> [...]
>
> > So what we need here seems to be to translate the generic term "cycles"
> > to "cpu_cycles" when a PMU is explicitely passed in the event name and
> > it doesn't have "cycles" and then just retry.
>
> I'm not sure we need to map that.
>
> My thinking is:
>
> * If the user asks for "cycles" without a PMU name, that should use the
> PERF_TYPE_HARDWARE cycles event. The ARM PMUs handle that correctly when the
> event is directed to them.
>
> * If the user asks for "${pmu}/cycles/", that should only use the "cycles"
> event in that PMU's namespace, not PERF_TYPE_HARDWARE.
>
> * If we need a way so say "use the PERF_TYPE_HARDWARE cycles event on ${pmu}",
> then we should have a new syntax for that (e.g. as we have for raw events),
> e.g. it would be possible to have "pmu/hw:cycles/" or something like that.
>
> That way there's no ambiguity.

This would break cpu_core/LLC-load-misses/ on Intel hybrid as the
LLC-load-misses event is legacy and not advertised in either sysfs or
in json.

Thanks,
Ian

> Mark.

2023-11-22 16:09:16

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 07:29:34AM -0800, Ian Rogers wrote:
> On Wed, Nov 22, 2023 at 5:04 AM Mark Rutland <[email protected]> wrote:
> > On Tue, Nov 21, 2023 at 08:38:45AM -0800, Ian Rogers wrote:
> > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > > > On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > > > On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > > > > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > > > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > > > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > > > > > Marc Zyngier <[email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > [Adding key people on Cc]
> > > > > > > > > >
> > > > > > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > > > > > Hector Martin <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > > > > > >
> > > > > > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > > > > > the PMU, but nothing works anymore.
> > > > > > > > > >
> > > > > > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > > > > > package, but that's obviously not going to last.
> > > > > > > > > >
> > > > > > > > > > I'm happy to test potential fixes.
> > > > > > > > >
> > > > > > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > > > > > CPU):
> > > > > > > >
> > > > > > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > > > > that ${pmu}'s type and event namespace.
> > > > > > > >
> > > > > > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > > > > PERF_COUNT_HW_${EVENT}.
> > > > > > >
> > > > > > > If you name a PMU and an event then the event should only be opened on
> > > > > > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > > > > cycles event is opened it appears to be because it was explicitly
> > > > > > > requested.
> > > > > >
> > > > > > I think you've missed that the named PMU events are being erreously transformed
> > > > > > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > > > >
> > > > > > Opening: apple_firestorm_pmu/cycles/
> > > > > > ------------------------------------------------------------
> > > > > > perf_event_attr:
> > > > > > type 0 (PERF_TYPE_HARDWARE)
> > > > > > size 136
> > > > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > > > sample_type IDENTIFIER
> > > > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > > > disabled 1
> > > > > > inherit 1
> > > > > > enable_on_exec 1
> > > > > > exclude_guest 1
> > > > > > ------------------------------------------------------------
> > > > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > > >
> > > > > > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > > > >
> > > > > > Marc said that he bisected the issue down to commit:
> > > > > >
> > > > > > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > > > >
> > > > > > ... so it looks like something is going wrong when the events are being parsed,
> > > > > > e.g. losing the HW PMU information?
> > > > >
> > > > > Ok, I think I'm getting confused by other things. This looks like the issue.
> > > > >
> > > > > I think it may be working as intended, but not how you intended :-) If
> > > > > a core PMU is listed and then a legacy event, the legacy event should
> > > > > be opened on the core PMU as a legacy event with the extended type
> > > > > set. This is to allow things like legacy cache events to be opened on
> > > > > a specified PMU. Legacy event names match with a higher priority than
> > > > > those in sysfs or json as they are hard coded.
> > > >
> > > > That has never been the case previously, so this is user-visible breakage, and
> > > > it prevents users from being able to do the right thing, so I think that's a
> > > > broken design.
> > >
> > > So the problem was caused by ARM and Intel doing two different things.
> > > Intel did at least contribute to the perf tool in support for their
> > > BIG.little/hybrid, so that's why the semantics match their approach.
> >
> > I appreciate that, and I agree that from the Arm side we haven't been as
> > engaged with userspace on this front (please understand I'm the messenger here,
> > this is something I've repeatedly asked for within Arm).
> >
> > Regardless, I don't think that changes the substance of the bug, which is that
> > we're converting named-pmu events into entirely different PERF_TYPE_HARDWARE
> > events.
> >
> > I agree that expanding plain legacy event names to a set of PMU-tagetted legacy
> > events makes sense (and even for Arm, that's the right thing to do, IMO). If
> > I ask for 'cycles' and that gets expanded to multiple legacy cycles events that
> > target specific CPU PMUs, that's good.
> >
> > The thing that doesn't make sense here is converting named-pmu events into
> > egacy events. If I ask for 'apple_firestorm_pmu/cycles/', that should be the
> > 'cycles' event in the apple_firestorm_pmu's event namespace, and *shouldn't* be
> > converted to a (potentially semantically different) PERF_TYPE_HARDWARE event,
> > even if that's targetted towards the apple_firestorm_pmu. I think that should
> > be true for *any* PMU, whether thats an arm/x86/whatever CPU PMU or a system
> > PMU.
>
> This is saying that legacy events are lower than system events. We
> don't do this historically and as it requires extra PMU set up. On an
> Intel Tigerlake:
>
> ```
> $ ls /sys/devices/cpu/events
> branch-instructions cache-misses instructions ref-cycles
> topdown-be-bound
> branch-misses cache-references mem-loads slots
> topdown-fe-bound
> bus-cycles cpu-cycles mem-stores topdown-bad-spec
> topdown-retiring
> ```
> here (at least) branch-misses, bus-cycles, cache-references,
> cpu-cycles and instructions overlap with legacy event names
> ```
> $ perf --version
> perf version 6.5.6
> $ perf stat -vv -e branch-misses,bus-cycles,cache-references,cp
> u-cycles,instructions true

Here you *aren't using a named PMU. As I said before, using the
PERF_TYPE_HARDWARE events in this case is entriely fine, it's just the
${pmu}/${eventname}/ case that I'm saying should use the PMU's namespace,
which was historically the case, and is what users are depending upon.

i.e.

perf stat -e cycles ./workload

... can/should use PERF_TYPE_HARDWARE events, as it used to

However:

perf srtat -e ${pmu}/cycles/ ./workload

... should use the PMU's namespaced events, as it used to

> Using CPUID GenuineIntel-6-8D-1
> intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> Control descriptor is not initialized
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0x5 (PERF_COUNT_HW_BRANCH_MISSES)
> ...
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0x6 (PERF_COUNT_HW_BUS_CYCLES)
> ...
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0x2 (PERF_COUNT_HW_CACHE_REFERENCES)
> ...
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> ...
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0x1 (PERF_COUNT_HW_INSTRUCTIONS)
> ...
> branch-misses: -1: 6571 826226 826226
> bus-cycles: -1: 31411 826226 826226
> cache-references: -1: 19507 826226 826226
> cpu-cycles: -1: 1127215 826226 826226
> instructions: -1: 1301583 826226 826226
> branch-misses: 6571 826226 826226
> bus-cycles: 31411 826226 826226
> cache-references: 19507 826226 826226
> cpu-cycles: 1127215 826226 826226
> instructions: 1301583 826226 826226
>
> Performance counter stats for 'true':
> ...
> ```
> ie perf 6.5 and all events even though sysfs has events we're opening
> them with PERF_TYPE_HARDWARE.

As above, this is a different case.

>
> > > > > Presumably the expectation was that by advertising a cycles event, presumably
> > > > > in sysfs, then this is what would be matched.
> >
> > Yes. That's how this has always worked prior to the changes Marc referenced.
> > Note that this can *also* be expaned to events from json databases, but was
> > *never* previously silently converted to a PERF_TYPE_HARDWARE event.
> >
> > Please note that the events in sysfs are *namespaced* to the PMU (specifically,
> > when using that PMU's dynamic type); they are not necessarily the same as
> > legacy events (though they may have similar or matching
> > names in some cases), they may be semantically distinct from the legacy events
> > even if the names match, and it is incorrect to conflate the two.
>
> This was a behavior added by Intel so that say cpu_atom/legacy-event/
> would only open as a hardware event on that PMU. The point of the
> blamed change is to make that behavior consistent for all core PMUs.

Ok, so Intel has an intel-specific behaviour change, which was ok for them.

That was made generic, but cause d a functional regression on arm (and possibly
other architectures if anyone else cares about the namespaced events).

Why can't this be rteturned to being x86 specific?

> > > > I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
> > > > *in that PMU's namespace* is used. Overriding that breaks long-established
> > > > practice and provides users with no recourse to get the behavioru they expect
> > > > (and previosuly had).
> > >
> > > On ARM but not Intel.
> >
> > As above, I don't think the CPU architecture matters here for the case that I'm
> > saying is broken. I think that regardless of CPU architecture (or for any
> > non-CPU PMU) it is semantically incorrect to convert a named-pmu event to a
> > legacy event.
>
> So perf's behavior has always been that legacy event priority is
> greater-than sysfs and json. The distinction here is that a core PMU
> is explicitly listed and it doesn't seem unreasonable to use core PMU
> names with legacy events, the behavior Intel added.

That may be ok for Intel, but given it *is* causing functional probelsm for
others, why must it remain generic?

> > > > I do think that (regardless of whther this was the sematnic you intended)
> > > > silently overriding events with legacy events is a bug, and one we should fix.
> > > > As I mentioned in another reply, just because the events have the same name
> > > > does not mean that they are semantically the same, so we're liable to give
> > > > people the wrong numbers anyhow.
> > > >
> > > > Can we fix this?
> > >
> > > So I'd like to fix this, some things from various conversations:
> > >
> > > 1) we lack testing. Our testing relies on the sysfs of the machine
> > > being run on, which is better than nothing. I think ideally we'd have
> > > a collection of zipped up sysfs directories and then we could have a
> > > test that asserts on ARM you get the behavior you want.
> >
> > I agree we lack testing, and I'd be happy to help here going forwards, though I
> > don't think this is a prerequisite for fixing this issue.
> >
> > > 2) for RISC-V they want to make the legacy event matching something in
> > > user land to simplify the PMU driver.
> >
> > Ok; I see how this might be related, but it doesn't sound like a prerequisite
> > for fixing this issue -- there are plenty of people in this thread who can
> > test.
> >
> > > 3) I'd like to get rid of the PMU json interface. My idea is to
> > > convert json events/metrics into sysfs style files, zip these up and
> > > then link them into the perf binary. On Intel the json is 70% of the
> > > binary (7MB out of 10MB) and we may get this down to 3MB with this
> > > approach. The json lookup would need to incorporate the cpuid matching
> > > that currently exists. When we look up an event I'd like the approach
> > > to be like unionfs with a specified but configurable order. Users
> > > could provide directories of their own events/metrics for various
> > > PMUs, and then this approach could be used to help with (1).
> >
> > I can see how that might interact with whatever changes we make to fix this
> > issue, but this seems like a future aspiration, and not a prerequisite for
> > fixing the existing functional regression.
> >
> > > Those proposals are not something to add as a -rc fix, so what I think
> > > you're asking for here is a "if ARM" fix somewhere in the event
> > > parsing. That's of course possible but it will cause problems if you
> > > did say:
> > >
> > > perf stat -e arm_pmu/LLC-load-misses/ ...
> >
> > As above, I do not think this is an arm-specific issue, we're just the canary
> > in the coalmine.
>
> Disagree, see comments above. A behavior change here would impact Intel.

Ok, so have Intel keep the Intel behaviour?

> > Please note that:
> >
> > perf stat -e arm_pmu/LLC-load-misses/ ...
> >
> > ... would never have worked previously. No arm_pmu instances have a
> > "LLC-load-misses" event in their event namespaces, and we don't have any
> > userspace file mapping that event.
>
> This event was for the purpose of giving an example, perf list will
> show you events that work. The point is that a legacy event may not be
> available on both BIG.little PMU types so being able to designate the
> PMU there is helpful.

Sure, but (as per my reply to Arnaldo), it's possible to add an unambiguous way
to specify that, e.g a 'hw:' prefix like:

some_arm_pmu/hw:LLC-load-misses/

... which wouldn't clash and cause hte regression that users are seing.

> > That said, If I really wanted that legacy event, I'd have asked for it bare,
> > e.g.
> >
> > perf stat -e LLC-load-misses
> >
> > ... and we're in agreement that it's sensible to expand this to multiple
> > PERF_TYPE_HARDWARE events targeting the individual CPU PMUs.
> >
> > So I see no need to do anything to have magic for 'arm_pmu/LLC-load-misses/'.
> >
> > > as I doubt the PMU driver is advertising this legacy event in sysfs
> > > and the "if ARM" logic would presumably be trying to disable legacy
> > > events in the term list for the ARM PMU.
> > >
> > > Given all of this, is anything actually broken and needing a fix for 6.7?
> >
> > There is absolutely a bug that needs to be fixed here (and needs to be
> > backported to stable so that it gets picked up by distributions).
>
> I'm not seeing this. The behavior is consistent with Intel, this has
> gone 2 releases without being spotted,

This has gone two releases because people has just updated their tools. The
prior behaviour for Arm has been there for most of a decade.

> it was triggered by a PMU event
> name aliasing a legacy event name and the behavior has always been
> legacy event names have higher priority than sysfs and json events.

That has been the case for plain events without a PMU name. That was never the
case for events with a PMU name, or there would not have been any difference in
behaviour.

> Whilst I'm seeing a lot of complaining, I've not seen a proposal of
> what behavior you want.

As per my initial reply the bevaiour we want is that:

pmu/eventname/

... opens 'eventname' in that PMU's event namespace, rather than converting the
event into a PERF_TYPE_HARDWARE event. That was the prior behaviour, which
people have been using for most of a decade.

I understand that there was some Intel-specific behaviour, and that may need to
be kept for Intel. Making that behaviour generic broke other existing users.

If we need a mechanism to target a legacy event to a specific PMU, we can add
an unambiguous way of descirbing that (e.g. the 'hw:' prefix I've suggested a
few times).


> Isn't it a PMU bug if the legacy event specifying the PMU doesn't get opened
> by the core PMU?

No?

Prior to that mechanism being added to the kernel, there was no way to do that.

When the mechanism was added to x86 specifically, it wasn't a generic feature.

> Fixing the PMU driver appears to be the right fix and means there is
> consistency on core events across architectures.

I think that's orthogonal.

Adding support to the PMU drivers (which has already been done, per the commit
you quoted before) is good so that userspace can do the right thing for:

perf stat -e some_generic_event ./workload

... but that should not be necessary to retain the existing behaviour for:

perf stat -e pmu/some_similarly_named_event/ ./workload

Thanks,
Mark.

2023-11-22 16:19:59

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

Em Wed, Nov 22, 2023 at 03:49:18PM +0000, Mark Rutland escreveu:
> On Wed, Nov 22, 2023 at 10:06:23AM -0300, Arnaldo Carvalho de Melo wrote:
> > The point is that "cycles" when prefixed with "pmu/" shouldn't be
> > considered "cycles" as HW/0, in that setting it is "cycles" for that
> > PMU.

> Exactly.

> > (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> > have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> > it, then we get what we want/had before, see below):

> Both Cortex-A53 and Cortex-A72 have the common PMUv3 events, so they have
> "cpu_cycles" and "bus_cycles".

root@roc-rk3399-pc:~# ls -la /sys/devices/*/events/*cycles
-r--r--r-- 1 root root 4096 Nov 22 12:35 /sys/devices/armv8_cortex_a53/events/bus_cycles
-r--r--r-- 1 root root 4096 Nov 22 12:35 /sys/devices/armv8_cortex_a53/events/cpu_cycles
-r--r--r-- 1 root root 4096 Nov 22 12:35 /sys/devices/armv8_cortex_a72/events/bus_cycles
-r--r--r-- 1 root root 4096 Nov 22 12:35 /sys/devices/armv8_cortex_a72/events/cpu_cycles
root@roc-rk3399-pc:~#

But on x86, on a AMD machine:

⬢[acme@toolbox ~]$ ls -la /sys/devices/*/events/*cycles
-r--r--r--. 1 nobody nobody 4096 Nov 22 12:48 /sys/devices/cpu/events/cpu-cycles
⬢[acme@toolbox ~]$

And an Intel:

[acme@quaco asahi]$ ls -la /sys/devices/*/events/*cycles
-r--r--r--. 1 root root 4096 Nov 22 13:11 /sys/devices/cpu/events/bus-cycles
-r--r--r--. 1 root root 4096 Nov 22 13:11 /sys/devices/cpu/events/cpu-cycles
-r--r--r--. 1 root root 4096 Nov 22 13:11 /sys/devices/cpu/events/ref-cycles
[acme@quaco asahi]$

Slight difference with those - and _.

> The Apple PMUs that Hector and Marc anre using don't follow the PMUv3
> architecture, and just have a "cycles" event.

I see, and even being prefixed with the PMU name, as
"apple_icestorm_pmu/cycles/" it ends up trumping that and moving that to
(PERF_TYPE_HARDWARE, PERF_HW_CPU_CYCLES) instead of
(/sys/devices/apple_icestorm_pmu/events/type,
/sys/devices/apple_icestorm_pmu/events/cycles) as I noticed with:

sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)

I.e.:

type=PERF_TYPE_HARDWARE, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES

It should be:

type=/sys/devices/apple_icestorm_pmu/events/type, config=/sys/devices/apple_icestorm_pmu/events/cycles

That is the minimal patch to address the regression reported, even if
using some kludge to buy time for a longer term more elegant solution,
Ian?

> [...]

> > So what we need here seems to be to translate the generic term "cycles"
> > to "cpu_cycles" when a PMU is explicitely passed in the event name and
> > it doesn't have "cycles" and then just retry.
>
> I'm not sure we need to map that.
>
> My thinking is:
>
> * If the user asks for "cycles" without a PMU name, that should use the
> PERF_TYPE_HARDWARE cycles event. The ARM PMUs handle that correctly when the
> event is directed to them.
>
> * If the user asks for "${pmu}/cycles/", that should only use the "cycles"
> event in that PMU's namespace, not PERF_TYPE_HARDWARE.

And thus, armv8_cortex_a53/cycles/ and armv8_cortex_a72/cycles/ should
just fail as there is no "cycles" for that PMU, no fallback.

> * If we need a way so say "use the PERF_TYPE_HARDWARE cycles event on ${pmu}",
> then we should have a new syntax for that (e.g. as we have for raw events),
> e.g. it would be possible to have "pmu/hw:cycles/" or something like that.
>
> That way there's no ambiguity.

- Arnaldo

2023-11-22 16:26:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

Em Wed, Nov 22, 2023 at 08:04:26AM -0800, Ian Rogers escreveu:
> On Wed, Nov 22, 2023 at 7:49 AM Mark Rutland <[email protected]> wrote:
> >
> > On Wed, Nov 22, 2023 at 10:06:23AM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > > > On 2023/11/22 1:38, Ian Rogers wrote:
> > > > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > > > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > >>>>>>> Marc Zyngier <[email protected]> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> [Adding key people on Cc]
> > > > >>>>>>>>
> > > > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > >>>>>>>> Hector Martin <[email protected]> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > >>>>>>>>
> > > > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > >>>>>>>> the PMU, but nothing works anymore.
> > > > >>>>>>>>
> > > > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > >>>>>>>> package, but that's obviously not going to last.
> > > > >>>>>>>>
> > > > >>>>>>>> I'm happy to test potential fixes.
> > > > >>>>>>>
> > > > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > >>>>>>> CPU):
> > > > >>>>>>
> > > > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > > > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > >>>>>> that ${pmu}'s type and event namespace.
> > > > >>>>>>
> > > > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > >>>>>> PERF_COUNT_HW_${EVENT}.
> > > > >>>>>
> > > > >>>>> If you name a PMU and an event then the event should only be opened on
> > > > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > >>>>> cycles event is opened it appears to be because it was explicitly
> > > > >>>>> requested.
> > > > >>>>
> > > > >>>> I think you've missed that the named PMU events are being erreously transformed
> > > > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > > >>>>
> > > > >>>> Opening: apple_firestorm_pmu/cycles/
> > > > >>>> ------------------------------------------------------------
> > > > >>>> perf_event_attr:
> > > > >>>> type 0 (PERF_TYPE_HARDWARE)
> > > > >>>> size 136
> > > > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > >>>> sample_type IDENTIFIER
> > > > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > >>>> disabled 1
> > > > >>>> inherit 1
> > > > >>>> enable_on_exec 1
> > > > >>>> exclude_guest 1
> > > > >>>> ------------------------------------------------------------
> > > > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > >>>>
> > > > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > > >>>>
> > > > >>>> Marc said that he bisected the issue down to commit:
> > > > >>>>
> > > > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > > >>>>
> > > > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > > > >>>> e.g. losing the HW PMU information?
> > > > >>>
> > > > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > > > >>>
> > > > >>> I think it may be working as intended, but not how you intended :-) If
> > > > >>> a core PMU is listed and then a legacy event, the legacy event should
> > >
> > > The point is that "cycles" when prefixed with "pmu/" shouldn't be
> > > considered "cycles" as HW/0, in that setting it is "cycles" for that
> > > PMU.
> >
> > Exactly.
> >
> > > (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> > > have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> > > it, then we get what we want/had before, see below):
> >
> > Both Cortex-A53 and Cortex-A72 have the common PMUv3 events, so they have
> > "cpu_cycles" and "bus_cycles".
> >
> > The Apple PMUs that Hector and Marc anre using don't follow the PMUv3
> > architecture, and just have a "cycles" event.
> >
> > [...]
> >
> > > So what we need here seems to be to translate the generic term "cycles"
> > > to "cpu_cycles" when a PMU is explicitely passed in the event name and
> > > it doesn't have "cycles" and then just retry.
> >
> > I'm not sure we need to map that.
> >
> > My thinking is:
> >
> > * If the user asks for "cycles" without a PMU name, that should use the
> > PERF_TYPE_HARDWARE cycles event. The ARM PMUs handle that correctly when the
> > event is directed to them.
> >
> > * If the user asks for "${pmu}/cycles/", that should only use the "cycles"
> > event in that PMU's namespace, not PERF_TYPE_HARDWARE.
> >
> > * If we need a way so say "use the PERF_TYPE_HARDWARE cycles event on ${pmu}",
> > then we should have a new syntax for that (e.g. as we have for raw events),
> > e.g. it would be possible to have "pmu/hw:cycles/" or something like that.
> >
> > That way there's no ambiguity.
>
> This would break cpu_core/LLC-load-misses/ on Intel hybrid as the
> LLC-load-misses event is legacy and not advertised in either sysfs or
> in json.

Indeed:

[root@quaco ~]# ls /sys/devices/cpu/events/
branch-instructions bus-cycles cache-references instructions mem-stores topdown-fetch-bubbles topdown-recovery-bubbles.scale topdown-slots-retired topdown-total-slots.scale
branch-misses cache-misses cpu-cycles mem-loads ref-cycles topdown-recovery-bubbles topdown-slots-issued topdown-total-slots
[root@quaco ~]# strace -e perf_event_open perf stat -e cpu/LLC-load-misses/ echo
perf_event_open({type=PERF_TYPE_HW_CACHE, size=0x88 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_CACHE_RESULT_MISS<<16|PERF_COUNT_HW_CACHE_OP_READ<<8|PERF_COUNT_HW_CACHE_LL, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 41467, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=41467, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---

Performance counter stats for 'echo':

1,015 cpu/LLC-load-misses/

0.005167119 seconds time elapsed

0.000821000 seconds user
0.004105000 seconds sys


--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=41466, si_uid=0} ---
+++ exited with 0 +++
[root@quaco ~]#

Is it difficult to before doing the current expansion to
PERF_TYPE_HARDWARE/PERF_HW_CPU_CYCLES just check if there is an event
with the name specified in the PMU specified, if there is, use that.

- Arnaldo

2023-11-22 16:32:19

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 8:08 AM Mark Rutland <[email protected]> wrote:
>
> On Wed, Nov 22, 2023 at 07:29:34AM -0800, Ian Rogers wrote:
> > On Wed, Nov 22, 2023 at 5:04 AM Mark Rutland <[email protected]> wrote:
> > > On Tue, Nov 21, 2023 at 08:38:45AM -0800, Ian Rogers wrote:
> > > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > > > > On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > > > > On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > > > > > On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > > > > > On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > > > > > > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > > > > > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > > > > > > Marc Zyngier <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > [Adding key people on Cc]
> > > > > > > > > > >
> > > > > > > > > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > > > > > > > Hector Martin <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > > > > > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > > > > > > >
> > > > > > > > > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > > > > > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > > > > > > > the PMU, but nothing works anymore.
> > > > > > > > > > >
> > > > > > > > > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > > > > > > > package, but that's obviously not going to last.
> > > > > > > > > > >
> > > > > > > > > > > I'm happy to test potential fixes.
> > > > > > > > > >
> > > > > > > > > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > > > > > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > > > > > > CPU):
> > > > > > > > >
> > > > > > > > > IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > > > > > ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > > > > > that ${pmu}'s type and event namespace.
> > > > > > > > >
> > > > > > > > > Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > > > > > targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > > > > > this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > > > > > PERF_COUNT_HW_${EVENT}.
> > > > > > > >
> > > > > > > > If you name a PMU and an event then the event should only be opened on
> > > > > > > > that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > > > > > cycles event is opened it appears to be because it was explicitly
> > > > > > > > requested.
> > > > > > >
> > > > > > > I think you've missed that the named PMU events are being erreously transformed
> > > > > > > into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > > > > >
> > > > > > > Opening: apple_firestorm_pmu/cycles/
> > > > > > > ------------------------------------------------------------
> > > > > > > perf_event_attr:
> > > > > > > type 0 (PERF_TYPE_HARDWARE)
> > > > > > > size 136
> > > > > > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > > > > sample_type IDENTIFIER
> > > > > > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > > > > disabled 1
> > > > > > > inherit 1
> > > > > > > enable_on_exec 1
> > > > > > > exclude_guest 1
> > > > > > > ------------------------------------------------------------
> > > > > > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > > > >
> > > > > > > ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > > > > >
> > > > > > > Marc said that he bisected the issue down to commit:
> > > > > > >
> > > > > > > 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > > > > >
> > > > > > > ... so it looks like something is going wrong when the events are being parsed,
> > > > > > > e.g. losing the HW PMU information?
> > > > > >
> > > > > > Ok, I think I'm getting confused by other things. This looks like the issue.
> > > > > >
> > > > > > I think it may be working as intended, but not how you intended :-) If
> > > > > > a core PMU is listed and then a legacy event, the legacy event should
> > > > > > be opened on the core PMU as a legacy event with the extended type
> > > > > > set. This is to allow things like legacy cache events to be opened on
> > > > > > a specified PMU. Legacy event names match with a higher priority than
> > > > > > those in sysfs or json as they are hard coded.
> > > > >
> > > > > That has never been the case previously, so this is user-visible breakage, and
> > > > > it prevents users from being able to do the right thing, so I think that's a
> > > > > broken design.
> > > >
> > > > So the problem was caused by ARM and Intel doing two different things.
> > > > Intel did at least contribute to the perf tool in support for their
> > > > BIG.little/hybrid, so that's why the semantics match their approach.
> > >
> > > I appreciate that, and I agree that from the Arm side we haven't been as
> > > engaged with userspace on this front (please understand I'm the messenger here,
> > > this is something I've repeatedly asked for within Arm).
> > >
> > > Regardless, I don't think that changes the substance of the bug, which is that
> > > we're converting named-pmu events into entirely different PERF_TYPE_HARDWARE
> > > events.
> > >
> > > I agree that expanding plain legacy event names to a set of PMU-tagetted legacy
> > > events makes sense (and even for Arm, that's the right thing to do, IMO). If
> > > I ask for 'cycles' and that gets expanded to multiple legacy cycles events that
> > > target specific CPU PMUs, that's good.
> > >
> > > The thing that doesn't make sense here is converting named-pmu events into
> > > egacy events. If I ask for 'apple_firestorm_pmu/cycles/', that should be the
> > > 'cycles' event in the apple_firestorm_pmu's event namespace, and *shouldn't* be
> > > converted to a (potentially semantically different) PERF_TYPE_HARDWARE event,
> > > even if that's targetted towards the apple_firestorm_pmu. I think that should
> > > be true for *any* PMU, whether thats an arm/x86/whatever CPU PMU or a system
> > > PMU.
> >
> > This is saying that legacy events are lower than system events. We
> > don't do this historically and as it requires extra PMU set up. On an
> > Intel Tigerlake:
> >
> > ```
> > $ ls /sys/devices/cpu/events
> > branch-instructions cache-misses instructions ref-cycles
> > topdown-be-bound
> > branch-misses cache-references mem-loads slots
> > topdown-fe-bound
> > bus-cycles cpu-cycles mem-stores topdown-bad-spec
> > topdown-retiring
> > ```
> > here (at least) branch-misses, bus-cycles, cache-references,
> > cpu-cycles and instructions overlap with legacy event names
> > ```
> > $ perf --version
> > perf version 6.5.6
> > $ perf stat -vv -e branch-misses,bus-cycles,cache-references,cp
> > u-cycles,instructions true
>
> Here you *aren't using a named PMU. As I said before, using the
> PERF_TYPE_HARDWARE events in this case is entriely fine, it's just the
> ${pmu}/${eventname}/ case that I'm saying should use the PMU's namespace,
> which was historically the case, and is what users are depending upon.
>
> i.e.
>
> perf stat -e cycles ./workload
>
> ... can/should use PERF_TYPE_HARDWARE events, as it used to
>
> However:
>
> perf srtat -e ${pmu}/cycles/ ./workload
>
> ... should use the PMU's namespaced events, as it used to
>
> > Using CPUID GenuineIntel-6-8D-1
> > intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
> > Control descriptor is not initialized
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0x5 (PERF_COUNT_HW_BRANCH_MISSES)
> > ...
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0x6 (PERF_COUNT_HW_BUS_CYCLES)
> > ...
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0x2 (PERF_COUNT_HW_CACHE_REFERENCES)
> > ...
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > ...
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0x1 (PERF_COUNT_HW_INSTRUCTIONS)
> > ...
> > branch-misses: -1: 6571 826226 826226
> > bus-cycles: -1: 31411 826226 826226
> > cache-references: -1: 19507 826226 826226
> > cpu-cycles: -1: 1127215 826226 826226
> > instructions: -1: 1301583 826226 826226
> > branch-misses: 6571 826226 826226
> > bus-cycles: 31411 826226 826226
> > cache-references: 19507 826226 826226
> > cpu-cycles: 1127215 826226 826226
> > instructions: 1301583 826226 826226
> >
> > Performance counter stats for 'true':
> > ...
> > ```
> > ie perf 6.5 and all events even though sysfs has events we're opening
> > them with PERF_TYPE_HARDWARE.
>
> As above, this is a different case.
>
> >
> > > > > > Presumably the expectation was that by advertising a cycles event, presumably
> > > > > > in sysfs, then this is what would be matched.
> > >
> > > Yes. That's how this has always worked prior to the changes Marc referenced.
> > > Note that this can *also* be expaned to events from json databases, but was
> > > *never* previously silently converted to a PERF_TYPE_HARDWARE event.
> > >
> > > Please note that the events in sysfs are *namespaced* to the PMU (specifically,
> > > when using that PMU's dynamic type); they are not necessarily the same as
> > > legacy events (though they may have similar or matching
> > > names in some cases), they may be semantically distinct from the legacy events
> > > even if the names match, and it is incorrect to conflate the two.
> >
> > This was a behavior added by Intel so that say cpu_atom/legacy-event/
> > would only open as a hardware event on that PMU. The point of the
> > blamed change is to make that behavior consistent for all core PMUs.
>
> Ok, so Intel has an intel-specific behaviour change, which was ok for them.
>
> That was made generic, but cause d a functional regression on arm (and possibly
> other architectures if anyone else cares about the namespaced events).
>
> Why can't this be rteturned to being x86 specific?
>
> > > > > I expect that if I ask for ${pmu}/${event}/, that PMU is used, and the event
> > > > > *in that PMU's namespace* is used. Overriding that breaks long-established
> > > > > practice and provides users with no recourse to get the behavioru they expect
> > > > > (and previosuly had).
> > > >
> > > > On ARM but not Intel.
> > >
> > > As above, I don't think the CPU architecture matters here for the case that I'm
> > > saying is broken. I think that regardless of CPU architecture (or for any
> > > non-CPU PMU) it is semantically incorrect to convert a named-pmu event to a
> > > legacy event.
> >
> > So perf's behavior has always been that legacy event priority is
> > greater-than sysfs and json. The distinction here is that a core PMU
> > is explicitly listed and it doesn't seem unreasonable to use core PMU
> > names with legacy events, the behavior Intel added.
>
> That may be ok for Intel, but given it *is* causing functional probelsm for
> others, why must it remain generic?
>
> > > > > I do think that (regardless of whther this was the sematnic you intended)
> > > > > silently overriding events with legacy events is a bug, and one we should fix.
> > > > > As I mentioned in another reply, just because the events have the same name
> > > > > does not mean that they are semantically the same, so we're liable to give
> > > > > people the wrong numbers anyhow.
> > > > >
> > > > > Can we fix this?
> > > >
> > > > So I'd like to fix this, some things from various conversations:
> > > >
> > > > 1) we lack testing. Our testing relies on the sysfs of the machine
> > > > being run on, which is better than nothing. I think ideally we'd have
> > > > a collection of zipped up sysfs directories and then we could have a
> > > > test that asserts on ARM you get the behavior you want.
> > >
> > > I agree we lack testing, and I'd be happy to help here going forwards, though I
> > > don't think this is a prerequisite for fixing this issue.
> > >
> > > > 2) for RISC-V they want to make the legacy event matching something in
> > > > user land to simplify the PMU driver.
> > >
> > > Ok; I see how this might be related, but it doesn't sound like a prerequisite
> > > for fixing this issue -- there are plenty of people in this thread who can
> > > test.
> > >
> > > > 3) I'd like to get rid of the PMU json interface. My idea is to
> > > > convert json events/metrics into sysfs style files, zip these up and
> > > > then link them into the perf binary. On Intel the json is 70% of the
> > > > binary (7MB out of 10MB) and we may get this down to 3MB with this
> > > > approach. The json lookup would need to incorporate the cpuid matching
> > > > that currently exists. When we look up an event I'd like the approach
> > > > to be like unionfs with a specified but configurable order. Users
> > > > could provide directories of their own events/metrics for various
> > > > PMUs, and then this approach could be used to help with (1).
> > >
> > > I can see how that might interact with whatever changes we make to fix this
> > > issue, but this seems like a future aspiration, and not a prerequisite for
> > > fixing the existing functional regression.
> > >
> > > > Those proposals are not something to add as a -rc fix, so what I think
> > > > you're asking for here is a "if ARM" fix somewhere in the event
> > > > parsing. That's of course possible but it will cause problems if you
> > > > did say:
> > > >
> > > > perf stat -e arm_pmu/LLC-load-misses/ ...
> > >
> > > As above, I do not think this is an arm-specific issue, we're just the canary
> > > in the coalmine.
> >
> > Disagree, see comments above. A behavior change here would impact Intel.
>
> Ok, so have Intel keep the Intel behaviour?
>
> > > Please note that:
> > >
> > > perf stat -e arm_pmu/LLC-load-misses/ ...
> > >
> > > ... would never have worked previously. No arm_pmu instances have a
> > > "LLC-load-misses" event in their event namespaces, and we don't have any
> > > userspace file mapping that event.
> >
> > This event was for the purpose of giving an example, perf list will
> > show you events that work. The point is that a legacy event may not be
> > available on both BIG.little PMU types so being able to designate the
> > PMU there is helpful.
>
> Sure, but (as per my reply to Arnaldo), it's possible to add an unambiguous way
> to specify that, e.g a 'hw:' prefix like:
>
> some_arm_pmu/hw:LLC-load-misses/
>
> ... which wouldn't clash and cause hte regression that users are seing.
>
> > > That said, If I really wanted that legacy event, I'd have asked for it bare,
> > > e.g.
> > >
> > > perf stat -e LLC-load-misses
> > >
> > > ... and we're in agreement that it's sensible to expand this to multiple
> > > PERF_TYPE_HARDWARE events targeting the individual CPU PMUs.
> > >
> > > So I see no need to do anything to have magic for 'arm_pmu/LLC-load-misses/'.
> > >
> > > > as I doubt the PMU driver is advertising this legacy event in sysfs
> > > > and the "if ARM" logic would presumably be trying to disable legacy
> > > > events in the term list for the ARM PMU.
> > > >
> > > > Given all of this, is anything actually broken and needing a fix for 6.7?
> > >
> > > There is absolutely a bug that needs to be fixed here (and needs to be
> > > backported to stable so that it gets picked up by distributions).
> >
> > I'm not seeing this. The behavior is consistent with Intel, this has
> > gone 2 releases without being spotted,
>
> This has gone two releases because people has just updated their tools. The
> prior behaviour for Arm has been there for most of a decade.
>
> > it was triggered by a PMU event
> > name aliasing a legacy event name and the behavior has always been
> > legacy event names have higher priority than sysfs and json events.
>
> That has been the case for plain events without a PMU name. That was never the
> case for events with a PMU name, or there would not have been any difference in
> behaviour.
>
> > Whilst I'm seeing a lot of complaining, I've not seen a proposal of
> > what behavior you want.
>
> As per my initial reply the bevaiour we want is that:
>
> pmu/eventname/
>
> ... opens 'eventname' in that PMU's event namespace, rather than converting the
> event into a PERF_TYPE_HARDWARE event. That was the prior behaviour, which
> people have been using for most of a decade.
>
> I understand that there was some Intel-specific behaviour, and that may need to
> be kept for Intel. Making that behaviour generic broke other existing users.
>
> If we need a mechanism to target a legacy event to a specific PMU, we can add
> an unambiguous way of descirbing that (e.g. the 'hw:' prefix I've suggested a
> few times).
>
>
> > Isn't it a PMU bug if the legacy event specifying the PMU doesn't get opened
> > by the core PMU?
>
> No?
>
> Prior to that mechanism being added to the kernel, there was no way to do that.
>
> When the mechanism was added to x86 specifically, it wasn't a generic feature.
>
> > Fixing the PMU driver appears to be the right fix and means there is
> > consistency on core events across architectures.
>
> I think that's orthogonal.
>
> Adding support to the PMU drivers (which has already been done, per the commit
> you quoted before) is good so that userspace can do the right thing for:
>
> perf stat -e some_generic_event ./workload
>
> ... but that should not be necessary to retain the existing behaviour for:
>
> perf stat -e pmu/some_similarly_named_event/ ./workload
>
> Thanks,
> Mark.

Given the PMU mapping exists, what is the difficulty in the case of
this PMU? I could explain what I see on ARMv8 devices and the broken
PMU landscape from the last 10 years but that hardly feels
constructive here. I'm not understanding the difficulty of
translating:

struct perf_event_attr {
...
.type = PERF_TYPE_HARDARE,
.config = <pmu's type> << 32 | PERF_COUNT_HW_CPU_CYCLES,
...
}

to the event called "cycles" that the PMU is advertising? Given the
mapping already has to exist for every core PMU driver.

I can look at doing an event parser change like:

```
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index aa2f5c6fc7fc..9a18fda525d2 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
err_str,
/*help=*/NULL);
return -EINVAL;
}
- if (perf_pmu__supports_legacy_cache(pmu)) {
+ if (perf_pmu__supports_legacy_cache(pmu) &&
+ !perf_pmu__have_event(pmu, term->val.str)) {
attr->type = PERF_TYPE_HW_CACHE;
return
parse_events__decode_legacy_cache(term->config, pmu->type,
&attr->config);
@@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
err_str,
/*help=*/NULL);
return -EINVAL;
}
- attr->type = PERF_TYPE_HARDWARE;
- attr->config = term->val.num;
- if (perf_pmus__supports_extended_type())
- attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
+ if (perf_pmu__have_event(pmu, term->val.str)) {
+ /* If the PMU has a sysfs or json event prefer
it over legacy. ARM requires this. */
+ term->term_type = PARSE_EVENTS__TERM_TYPE_USER;
+ } else {
+ attr->type = PERF_TYPE_HARDWARE;
+ attr->config = term->val.num;
+ if (perf_pmus__supports_extended_type())
+ attr->config |= (__u64)pmu->type <<
PERF_PMU_TYPE_SHIFT;
+ }
return 0;
}
if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||
```
(note: this is incomplete as term->val.str isn't populated for
PARSE_EVENTS__TERM_TYPE_HARDWARE)

but this is a behavioral change on Intel and shouldn't therefore come
in as an rc fix.

Thanks,
Ian

2023-11-22 16:33:55

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 8:26 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Wed, Nov 22, 2023 at 08:04:26AM -0800, Ian Rogers escreveu:
> > On Wed, Nov 22, 2023 at 7:49 AM Mark Rutland <[email protected]> wrote:
> > >
> > > On Wed, Nov 22, 2023 at 10:06:23AM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Wed, Nov 22, 2023 at 12:23:27PM +0900, Hector Martin escreveu:
> > > > > On 2023/11/22 1:38, Ian Rogers wrote:
> > > > > > On Tue, Nov 21, 2023 at 8:15 AM Mark Rutland <[email protected]> wrote:
> > > > > >> On Tue, Nov 21, 2023 at 08:09:37AM -0800, Ian Rogers wrote:
> > > > > >>> On Tue, Nov 21, 2023 at 8:03 AM Mark Rutland <[email protected]> wrote:
> > > > > >>>> On Tue, Nov 21, 2023 at 07:46:57AM -0800, Ian Rogers wrote:
> > > > > >>>>> On Tue, Nov 21, 2023 at 7:40 AM Mark Rutland <[email protected]> wrote:
> > > > > >>>>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > > > >>>>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
> > > > > >>>>>>> Marc Zyngier <[email protected]> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> [Adding key people on Cc]
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > > >>>>>>>> Hector Martin <[email protected]> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > >>>>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > > >>>>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > > >>>>>>>> the PMU, but nothing works anymore.
> > > > > >>>>>>>>
> > > > > >>>>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > > >>>>>>>> package, but that's obviously not going to last.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I'm happy to test potential fixes.
> > > > > >>>>>>>
> > > > > >>>>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > > > >>>>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > > > >>>>>>> CPU):
> > > > > >>>>>>
> > > > > >>>>>> IIUC the tool is doing the wrong thing here and overriding explicit
> > > > > >>>>>> ${pmu}/${event}/ events with PERF_TYPE_HARDWARE events rather than events using
> > > > > >>>>>> that ${pmu}'s type and event namespace.
> > > > > >>>>>>
> > > > > >>>>>> Regardless of the *new* ABI that allows PERF_TYPE_HARDWARE events to be
> > > > > >>>>>> targetted to a specific PMU, it's semantically wrong to rewrite events like
> > > > > >>>>>> this since ${pmu}/${event}/ is not necessarily equivalent to a similarly-named
> > > > > >>>>>> PERF_COUNT_HW_${EVENT}.
> > > > > >>>>>
> > > > > >>>>> If you name a PMU and an event then the event should only be opened on
> > > > > >>>>> that PMU, 100% agree. There's a bunch of output, but when the legacy
> > > > > >>>>> cycles event is opened it appears to be because it was explicitly
> > > > > >>>>> requested.
> > > > > >>>>
> > > > > >>>> I think you've missed that the named PMU events are being erreously transformed
> > > > > >>>> into PERF_TYPE_HARDWARE events. Look at the -vvv output, e.g.
> > > > > >>>>
> > > > > >>>> Opening: apple_firestorm_pmu/cycles/
> > > > > >>>> ------------------------------------------------------------
> > > > > >>>> perf_event_attr:
> > > > > >>>> type 0 (PERF_TYPE_HARDWARE)
> > > > > >>>> size 136
> > > > > >>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > > > >>>> sample_type IDENTIFIER
> > > > > >>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > > > >>>> disabled 1
> > > > > >>>> inherit 1
> > > > > >>>> enable_on_exec 1
> > > > > >>>> exclude_guest 1
> > > > > >>>> ------------------------------------------------------------
> > > > > >>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > > > >>>>
> > > > > >>>> ... which should not be PERF_TYPE_HARDWARE && PERF_COUNT_HW_CPU_CYCLES.
> > > > > >>>>
> > > > > >>>> Marc said that he bisected the issue down to commit:
> > > > > >>>>
> > > > > >>>> 5ea8f2ccffb23983 ("perf parse-events: Support hardware events as terms")
> > > > > >>>>
> > > > > >>>> ... so it looks like something is going wrong when the events are being parsed,
> > > > > >>>> e.g. losing the HW PMU information?
> > > > > >>>
> > > > > >>> Ok, I think I'm getting confused by other things. This looks like the issue.
> > > > > >>>
> > > > > >>> I think it may be working as intended, but not how you intended :-) If
> > > > > >>> a core PMU is listed and then a legacy event, the legacy event should
> > > >
> > > > The point is that "cycles" when prefixed with "pmu/" shouldn't be
> > > > considered "cycles" as HW/0, in that setting it is "cycles" for that
> > > > PMU.
> > >
> > > Exactly.
> > >
> > > > (but we only have "cpu_cycles" for at least the a53 and a72 PMUs I
> > > > have access in a Libre Computer rockchip 3399-pc hybrid board, if we use
> > > > it, then we get what we want/had before, see below):
> > >
> > > Both Cortex-A53 and Cortex-A72 have the common PMUv3 events, so they have
> > > "cpu_cycles" and "bus_cycles".
> > >
> > > The Apple PMUs that Hector and Marc anre using don't follow the PMUv3
> > > architecture, and just have a "cycles" event.
> > >
> > > [...]
> > >
> > > > So what we need here seems to be to translate the generic term "cycles"
> > > > to "cpu_cycles" when a PMU is explicitely passed in the event name and
> > > > it doesn't have "cycles" and then just retry.
> > >
> > > I'm not sure we need to map that.
> > >
> > > My thinking is:
> > >
> > > * If the user asks for "cycles" without a PMU name, that should use the
> > > PERF_TYPE_HARDWARE cycles event. The ARM PMUs handle that correctly when the
> > > event is directed to them.
> > >
> > > * If the user asks for "${pmu}/cycles/", that should only use the "cycles"
> > > event in that PMU's namespace, not PERF_TYPE_HARDWARE.
> > >
> > > * If we need a way so say "use the PERF_TYPE_HARDWARE cycles event on ${pmu}",
> > > then we should have a new syntax for that (e.g. as we have for raw events),
> > > e.g. it would be possible to have "pmu/hw:cycles/" or something like that.
> > >
> > > That way there's no ambiguity.
> >
> > This would break cpu_core/LLC-load-misses/ on Intel hybrid as the
> > LLC-load-misses event is legacy and not advertised in either sysfs or
> > in json.
>
> Indeed:
>
> [root@quaco ~]# ls /sys/devices/cpu/events/
> branch-instructions bus-cycles cache-references instructions mem-stores topdown-fetch-bubbles topdown-recovery-bubbles.scale topdown-slots-retired topdown-total-slots.scale
> branch-misses cache-misses cpu-cycles mem-loads ref-cycles topdown-recovery-bubbles topdown-slots-issued topdown-total-slots
> [root@quaco ~]# strace -e perf_event_open perf stat -e cpu/LLC-load-misses/ echo
> perf_event_open({type=PERF_TYPE_HW_CACHE, size=0x88 /* PERF_ATTR_SIZE_??? */, config=PERF_COUNT_HW_CACHE_RESULT_MISS<<16|PERF_COUNT_HW_CACHE_OP_READ<<8|PERF_COUNT_HW_CACHE_LL, sample_period=0, sample_type=PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=1, enable_on_exec=1, precise_ip=0 /* arbitrary skid */, exclude_guest=1, ...}, 41467, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=41467, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
>
> Performance counter stats for 'echo':
>
> 1,015 cpu/LLC-load-misses/
>
> 0.005167119 seconds time elapsed
>
> 0.000821000 seconds user
> 0.004105000 seconds sys
>
>
> --- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=41466, si_uid=0} ---
> +++ exited with 0 +++
> [root@quaco ~]#
>
> Is it difficult to before doing the current expansion to
> PERF_TYPE_HARDWARE/PERF_HW_CPU_CYCLES just check if there is an event
> with the name specified in the PMU specified, if there is, use that.

Agreed and I've sent an early cut of this. The issue is that then we
end up changing the encoding on Intel. I also don't see why ARM
doesn't just fix their PMU.

Thanks,
Ian

> - Arnaldo

2023-11-22 16:56:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

Em Wed, Nov 22, 2023 at 08:29:58AM -0800, Ian Rogers escreveu:
> I can look at doing an event parser change like:
>
> ```
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index aa2f5c6fc7fc..9a18fda525d2 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
> err_str,
> /*help=*/NULL);
> return -EINVAL;
> }
> - if (perf_pmu__supports_legacy_cache(pmu)) {
> + if (perf_pmu__supports_legacy_cache(pmu) &&
> + !perf_pmu__have_event(pmu, term->val.str)) {
> attr->type = PERF_TYPE_HW_CACHE;
> return
> parse_events__decode_legacy_cache(term->config, pmu->type,
> &attr->config);
> @@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
> err_str,
> /*help=*/NULL);
> return -EINVAL;
> }
> - attr->type = PERF_TYPE_HARDWARE;
> - attr->config = term->val.num;
> - if (perf_pmus__supports_extended_type())
> - attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> + if (perf_pmu__have_event(pmu, term->val.str)) {
> + /* If the PMU has a sysfs or json event prefer
> it over legacy. ARM requires this. */
> + term->term_type = PARSE_EVENTS__TERM_TYPE_USER;
> + } else {
> + attr->type = PERF_TYPE_HARDWARE;
> + attr->config = term->val.num;
> + if (perf_pmus__supports_extended_type())
> + attr->config |= (__u64)pmu->type <<
> PERF_PMU_TYPE_SHIFT;
> + }
> return 0;
> }
> if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||
> ```
> (note: this is incomplete as term->val.str isn't populated for
> PARSE_EVENTS__TERM_TYPE_HARDWARE)

Yeah, I had to apply manually as your MUA mangled it, then it didn't
build, had to remove some consts, then there was a struct member
mistake, after all fixed I get to the patch below, but it now segfaults,
probably what you mention...

root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 4,5 perf stat -v -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
Using CPUID 0x00000000410fd082
perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
+++ killed by SIGSEGV +++
Segmentation fault
root@roc-rk3399-pc:~#

- Arnaldo

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index aa2f5c6fc7fc..1e648454cc49 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -976,7 +976,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
struct parse_events_error *err)
{
if (term->type_term == PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE) {
- const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
+ struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);

if (!pmu) {
char *err_str;
@@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
err_str, /*help=*/NULL);
return -EINVAL;
}
- if (perf_pmu__supports_legacy_cache(pmu)) {
+ if (perf_pmu__supports_legacy_cache(pmu) &&
+ !perf_pmu__have_event(pmu, term->val.str)) {
attr->type = PERF_TYPE_HW_CACHE;
return parse_events__decode_legacy_cache(term->config, pmu->type,
&attr->config);
@@ -994,7 +995,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
}
if (term->type_term == PARSE_EVENTS__TERM_TYPE_HARDWARE) {
- const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
+ struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);

if (!pmu) {
char *err_str;
@@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
err_str, /*help=*/NULL);
return -EINVAL;
}
- attr->type = PERF_TYPE_HARDWARE;
- attr->config = term->val.num;
- if (perf_pmus__supports_extended_type())
- attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
+ if (perf_pmu__have_event(pmu, term->val.str)) {
+ /* If the PMU has a sysfs or JSON event prefer it over legacy. ARM requires this. */
+ term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
+ } else {
+ attr->type = PERF_TYPE_HARDWARE;
+ attr->config = term->val.num;
+ if (perf_pmus__supports_extended_type())
+ attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
+ }
return 0;
}
if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||

2023-11-22 17:02:13

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 8:55 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> Em Wed, Nov 22, 2023 at 08:29:58AM -0800, Ian Rogers escreveu:
> > I can look at doing an event parser change like:
> >
> > ```
> > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > index aa2f5c6fc7fc..9a18fda525d2 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > err_str,
> > /*help=*/NULL);
> > return -EINVAL;
> > }
> > - if (perf_pmu__supports_legacy_cache(pmu)) {
> > + if (perf_pmu__supports_legacy_cache(pmu) &&
> > + !perf_pmu__have_event(pmu, term->val.str)) {
> > attr->type = PERF_TYPE_HW_CACHE;
> > return
> > parse_events__decode_legacy_cache(term->config, pmu->type,
> > &attr->config);
> > @@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > err_str,
> > /*help=*/NULL);
> > return -EINVAL;
> > }
> > - attr->type = PERF_TYPE_HARDWARE;
> > - attr->config = term->val.num;
> > - if (perf_pmus__supports_extended_type())
> > - attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> > + if (perf_pmu__have_event(pmu, term->val.str)) {
> > + /* If the PMU has a sysfs or json event prefer
> > it over legacy. ARM requires this. */
> > + term->term_type = PARSE_EVENTS__TERM_TYPE_USER;
> > + } else {
> > + attr->type = PERF_TYPE_HARDWARE;
> > + attr->config = term->val.num;
> > + if (perf_pmus__supports_extended_type())
> > + attr->config |= (__u64)pmu->type <<
> > PERF_PMU_TYPE_SHIFT;
> > + }
> > return 0;
> > }
> > if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||
> > ```
> > (note: this is incomplete as term->val.str isn't populated for
> > PARSE_EVENTS__TERM_TYPE_HARDWARE)
>
> Yeah, I had to apply manually as your MUA mangled it, then it didn't
> build, had to remove some consts, then there was a struct member
> mistake, after all fixed I get to the patch below, but it now segfaults,
> probably what you mention...
>
> root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 4,5 perf stat -v -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
> Using CPUID 0x00000000410fd082
> perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)
> --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
> +++ killed by SIGSEGV +++
> Segmentation fault
> root@roc-rk3399-pc:~#

Right, I have something further along that fails tests. I'll try to
send out an RFC today, but given the Intel behavior change ¯\_(ツ)_/¯
But Intel don't appear to have an issue having two things called, for
example, cycles and them both being a cycles event so they may not
care. It is only ARM's PMUs that appear broken in this way.

Thanks,
Ian

> - Arnaldo
>
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index aa2f5c6fc7fc..1e648454cc49 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -976,7 +976,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
> struct parse_events_error *err)
> {
> if (term->type_term == PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE) {
> - const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> + struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
>
> if (!pmu) {
> char *err_str;
> @@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
> err_str, /*help=*/NULL);
> return -EINVAL;
> }
> - if (perf_pmu__supports_legacy_cache(pmu)) {
> + if (perf_pmu__supports_legacy_cache(pmu) &&
> + !perf_pmu__have_event(pmu, term->val.str)) {
> attr->type = PERF_TYPE_HW_CACHE;
> return parse_events__decode_legacy_cache(term->config, pmu->type,
> &attr->config);
> @@ -994,7 +995,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
> term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
> }
> if (term->type_term == PARSE_EVENTS__TERM_TYPE_HARDWARE) {
> - const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> + struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
>
> if (!pmu) {
> char *err_str;
> @@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
> err_str, /*help=*/NULL);
> return -EINVAL;
> }
> - attr->type = PERF_TYPE_HARDWARE;
> - attr->config = term->val.num;
> - if (perf_pmus__supports_extended_type())
> - attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> + if (perf_pmu__have_event(pmu, term->val.str)) {
> + /* If the PMU has a sysfs or JSON event prefer it over legacy. ARM requires this. */
> + term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
> + } else {
> + attr->type = PERF_TYPE_HARDWARE;
> + attr->config = term->val.num;
> + if (perf_pmus__supports_extended_type())
> + attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> + }
> return 0;
> }
> if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||

2023-11-23 04:33:32

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Wed, Nov 22, 2023 at 8:59 AM Ian Rogers <[email protected]> wrote:
>
> On Wed, Nov 22, 2023 at 8:55 AM Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> >
> > Em Wed, Nov 22, 2023 at 08:29:58AM -0800, Ian Rogers escreveu:
> > > I can look at doing an event parser change like:
> > >
> > > ```
> > > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > > index aa2f5c6fc7fc..9a18fda525d2 100644
> > > --- a/tools/perf/util/parse-events.c
> > > +++ b/tools/perf/util/parse-events.c
> > > @@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > > err_str,
> > > /*help=*/NULL);
> > > return -EINVAL;
> > > }
> > > - if (perf_pmu__supports_legacy_cache(pmu)) {
> > > + if (perf_pmu__supports_legacy_cache(pmu) &&
> > > + !perf_pmu__have_event(pmu, term->val.str)) {
> > > attr->type = PERF_TYPE_HW_CACHE;
> > > return
> > > parse_events__decode_legacy_cache(term->config, pmu->type,
> > > &attr->config);
> > > @@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > > err_str,
> > > /*help=*/NULL);
> > > return -EINVAL;
> > > }
> > > - attr->type = PERF_TYPE_HARDWARE;
> > > - attr->config = term->val.num;
> > > - if (perf_pmus__supports_extended_type())
> > > - attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> > > + if (perf_pmu__have_event(pmu, term->val.str)) {
> > > + /* If the PMU has a sysfs or json event prefer
> > > it over legacy. ARM requires this. */
> > > + term->term_type = PARSE_EVENTS__TERM_TYPE_USER;
> > > + } else {
> > > + attr->type = PERF_TYPE_HARDWARE;
> > > + attr->config = term->val.num;
> > > + if (perf_pmus__supports_extended_type())
> > > + attr->config |= (__u64)pmu->type <<
> > > PERF_PMU_TYPE_SHIFT;
> > > + }
> > > return 0;
> > > }
> > > if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||
> > > ```
> > > (note: this is incomplete as term->val.str isn't populated for
> > > PARSE_EVENTS__TERM_TYPE_HARDWARE)
> >
> > Yeah, I had to apply manually as your MUA mangled it, then it didn't
> > build, had to remove some consts, then there was a struct member
> > mistake, after all fixed I get to the patch below, but it now segfaults,
> > probably what you mention...
> >
> > root@roc-rk3399-pc:~# strace -e perf_event_open taskset -c 4,5 perf stat -v -e cycles,armv8_cortex_a53/cycles/,armv8_cortex_a72/cycles/ echo
> > Using CPUID 0x00000000410fd082
> > perf_event_open({type=PERF_TYPE_HARDWARE, size=0 /* PERF_ATTR_SIZE_??? */, config=0x7<<32|PERF_COUNT_HW_CPU_CYCLES, sample_period=0, sample_type=0, read_format=0, disabled=1, precise_ip=0 /* arbitrary skid */, ...}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT (No such file or directory)
> > --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
> > +++ killed by SIGSEGV +++
> > Segmentation fault
> > root@roc-rk3399-pc:~#
>
> Right, I have something further along that fails tests. I'll try to
> send out an RFC today, but given the Intel behavior change ¯\_(ツ)_/¯
> But Intel don't appear to have an issue having two things called, for
> example, cycles and them both being a cycles event so they may not
> care. It is only ARM's PMUs that appear broken in this way.

To workaround the PMU bug posted:
https://lore.kernel.org/lkml/[email protected]/

Thanks,
Ian

> Thanks,
> Ian
>
> > - Arnaldo
> >
> > diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> > index aa2f5c6fc7fc..1e648454cc49 100644
> > --- a/tools/perf/util/parse-events.c
> > +++ b/tools/perf/util/parse-events.c
> > @@ -976,7 +976,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > struct parse_events_error *err)
> > {
> > if (term->type_term == PARSE_EVENTS__TERM_TYPE_LEGACY_CACHE) {
> > - const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> > + struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> >
> > if (!pmu) {
> > char *err_str;
> > @@ -986,7 +986,8 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > err_str, /*help=*/NULL);
> > return -EINVAL;
> > }
> > - if (perf_pmu__supports_legacy_cache(pmu)) {
> > + if (perf_pmu__supports_legacy_cache(pmu) &&
> > + !perf_pmu__have_event(pmu, term->val.str)) {
> > attr->type = PERF_TYPE_HW_CACHE;
> > return parse_events__decode_legacy_cache(term->config, pmu->type,
> > &attr->config);
> > @@ -994,7 +995,7 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
> > }
> > if (term->type_term == PARSE_EVENTS__TERM_TYPE_HARDWARE) {
> > - const struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> > + struct perf_pmu *pmu = perf_pmus__find_by_type(attr->type);
> >
> > if (!pmu) {
> > char *err_str;
> > @@ -1004,10 +1005,15 @@ static int config_term_pmu(struct perf_event_attr *attr,
> > err_str, /*help=*/NULL);
> > return -EINVAL;
> > }
> > - attr->type = PERF_TYPE_HARDWARE;
> > - attr->config = term->val.num;
> > - if (perf_pmus__supports_extended_type())
> > - attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> > + if (perf_pmu__have_event(pmu, term->val.str)) {
> > + /* If the PMU has a sysfs or JSON event prefer it over legacy. ARM requires this. */
> > + term->type_term = PARSE_EVENTS__TERM_TYPE_USER;
> > + } else {
> > + attr->type = PERF_TYPE_HARDWARE;
> > + attr->config = term->val.num;
> > + if (perf_pmus__supports_extended_type())
> > + attr->config |= (__u64)pmu->type << PERF_PMU_TYPE_SHIFT;
> > + }
> > return 0;
> > }
> > if (term->type_term == PARSE_EVENTS__TERM_TYPE_USER ||

2023-11-23 14:23:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> On Tue, 21 Nov 2023 13:40:31 +0000,
> Marc Zyngier <[email protected]> wrote:
> >
> > [Adding key people on Cc]
> >
> > On Tue, 21 Nov 2023 12:08:48 +0000,
> > Hector Martin <[email protected]> wrote:
> > >
> > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> >
> > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > the PMU, but nothing works anymore.
> >
> > The saving grace in my case is that Debian still ships a 6.1 perftool
> > package, but that's obviously not going to last.
> >
> > I'm happy to test potential fixes.
>
> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> CPU):

Looking at this with fresh(er) eyes, I think there's a userspace bug here,
regardless of whether one believes it's correct to convert a named-pmu event to
a PERF_TYPE_HARDWARE event directed at that PMU.

It looks like the userspace tool is dropping the extended type ID after an
initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
extended type ID), which explains why we seem to get events from one PMU only.

More detail below...

Marc, if you have time, could you run the same commands (on the same kernel)
with a perf tool build from v6.4?

> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------

Here config[31:0] is 0 (PERF_COUNT_HW_CPU_CYCLES), and config[63:32] is 0xb,
which is presumably the PMU ID for the apple_icestorm_pmu.

The attr doesn't contain exclude_guest=1, so this will be rejected by the PMU
driver due to its mode exclusion requirements.

> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95

... which is what we see here (this is EOPNOTSUPP, which __hw_perf_event_init()
in drivers/perf/arm_pmu.c returns when the mode requested mode exclusion
options aren't supported).

So far, so good...

> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------

... but here, the extended type ID has been dropped, and this event is no
longer directed towards the apple_firestorm_pmu PMU, so the kernel can direct
this to *any* CPU PMU...

> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3

... and *some* PMU accepts it.

> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------

Likewise here, no extended type ID...

> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------

Likewise here, no extended type ID...

> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 0 873709 0
> apple_firestorm_pmu/cycles/: -1: 0 873709 0
> cycles: -1: 0 873709 0
> apple_icestorm_pmu/cycles/: 0 873709 0
> apple_firestorm_pmu/cycles/: 0 873709 0
> cycles: 0 873709 0
>
> Performance counter stats for 'ls':
>
> <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> <not counted> cycles (0.00%)
>
> 0.000002250 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys

So it looks like the tool has expanded the requested
'apple_icestorm_pmu/cycles/' event into three cycles events, each opened
without an extended type ID.

AFAICT, the kernel has done exactly what it has always done for
PERF_TYPE_HARDWARE/PERF_COUNT_HW_CPU_CYCLES events: pick the first PMU which
said it can handle them.

> If I run the same thing on another CPU cluster (firestorm), I get
> this:
>
> <quote>
> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> apple_firestorm_pmu/cycles/ -e cycles ls
> Using CPUID 0x00000000612f0280
> Attempt to add: apple_icestorm_pmu/cycles=0/
> ..after resolving event: apple_icestorm_pmu/cycles=0/
> Opening: unknown-hardware:HG
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> config 0xb00000000
> disabled 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> sys_perf_event_open failed, error -95

Again, we see one request with an extended type ID, which fails due to mode exclusion requirements...

> Attempt to add: apple_firestorm_pmu/cycles=0/
> ..after resolving event: apple_firestorm_pmu/cycles=0/
> Control descriptor is not initialized
> Opening: apple_icestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> Opening: apple_firestorm_pmu/cycles/
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> Opening: cycles
> ------------------------------------------------------------
> perf_event_attr:
> type 0 (PERF_TYPE_HARDWARE)
> size 136
> config 0 (PERF_COUNT_HW_CPU_CYCLES)
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------

... but all subsequent requests do not have an extended type ID, and the kernel
directs these to whichever PMU accepts the event first...

> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> builtin-daemon.o builtin-list.c builtin-version.c perf ui
> builtin-data.c builtin-list.o builtin-version.o perf-archive util
> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> builtin-diff.c builtin-mem.c command-list.txt perf.c
> apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> cycles: -1: 1034653 469125 469125
> apple_icestorm_pmu/cycles/: 1035101 469125 469125
> apple_firestorm_pmu/cycles/: 1035035 469125 469125
> cycles: 1034653 469125 469125
>
> Performance counter stats for 'ls':
>
> 1,035,101 apple_icestorm_pmu/cycles/
> 1,035,035 apple_firestorm_pmu/cycles/
> 1,034,653 cycles
>
> 0.000001333 seconds time elapsed
>
> 0.000000000 seconds user
> 0.000000000 seconds sys
> </quote>

... and in this case the workload was run on a CPU affine ot that arbitrary
PMU, hence we managed to count.

So AFAICT, this is a userspace bug, maybe related to the way we probe for
supported PMU features?

Thanks,
Mark.

2023-11-23 14:47:36

by Marc Zyngier

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Thu, 23 Nov 2023 14:23:10 +0000,
Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > On Tue, 21 Nov 2023 13:40:31 +0000,
> > Marc Zyngier <[email protected]> wrote:
> > >
> > > [Adding key people on Cc]
> > >
> > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > Hector Martin <[email protected]> wrote:
> > > >
> > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >
> > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > the PMU, but nothing works anymore.
> > >
> > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > package, but that's obviously not going to last.
> > >
> > > I'm happy to test potential fixes.
> >
> > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > CPU):
>
> Looking at this with fresh(er) eyes, I think there's a userspace bug here,
> regardless of whether one believes it's correct to convert a named-pmu event to
> a PERF_TYPE_HARDWARE event directed at that PMU.
>
> It looks like the userspace tool is dropping the extended type ID after an
> initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
> extended type ID), which explains why we seem to get events from one PMU only.
>
> More detail below...
>
> Marc, if you have time, could you run the same commands (on the same kernel)
> with a perf tool build from v6.4?

Here you go:

<quote>
$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e apple_firestorm_pmu/cycles/ -e cycles ls >/dev/null
Using CPUID 0x00000000610f0280
Attempting to add event pmu 'apple_icestorm_pmu' with 'cycles,' that may result in non-fatal errors
After aliases, add event pmu 'apple_icestorm_pmu' with 'event,' that may result in non-fatal errors
Attempting to add event pmu 'apple_firestorm_pmu' with 'cycles,' that may result in non-fatal errors
After aliases, add event pmu 'apple_firestorm_pmu' with 'event,' that may result in non-fatal errors
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 10
size 136
config 0x2
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624462 cpu -1 group_fd -1 flags 0x8 = 3
------------------------------------------------------------
perf_event_attr:
type 11
size 136
config 0x2
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624462 cpu -1 group_fd -1 flags 0x8 = 4
------------------------------------------------------------
perf_event_attr:
size 136
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624462 cpu -1 group_fd -1 flags 0x8 = 5
apple_icestorm_pmu/cycles/: -1: 1492180 724333 724333
apple_firestorm_pmu/cycles/: -1: 0 724333 0
cycles: -1: 0 724333 0
apple_icestorm_pmu/cycles/: 1492180 724333 724333
apple_firestorm_pmu/cycles/: 0 724333 0
cycles: 0 724333 0

Performance counter stats for 'ls':

1,492,180 apple_icestorm_pmu/cycles/
<not counted> apple_firestorm_pmu/cycles/ (0.00%)
<not counted> cycles (0.00%)

0.000001917 seconds time elapsed

0.000000000 seconds user
0.000000000 seconds sys
</quote>

and on the other cluster:

<quote>
$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e apple_firestorm_pmu/cycles/ -e cycles ls >/dev/null
Using CPUID 0x00000000610f0280
Attempting to add event pmu 'apple_icestorm_pmu' with 'cycles,' that may result in non-fatal errors
After aliases, add event pmu 'apple_icestorm_pmu' with 'event,' that may result in non-fatal errors
Attempting to add event pmu 'apple_firestorm_pmu' with 'cycles,' that may result in non-fatal errors
After aliases, add event pmu 'apple_firestorm_pmu' with 'event,' that may result in non-fatal errors
Control descriptor is not initialized
------------------------------------------------------------
perf_event_attr:
type 10
size 136
config 0x2
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624466 cpu -1 group_fd -1 flags 0x8 = 3
------------------------------------------------------------
perf_event_attr:
type 11
size 136
config 0x2
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624466 cpu -1 group_fd -1 flags 0x8 = 4
------------------------------------------------------------
perf_event_attr:
size 136
sample_type IDENTIFIER
read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
disabled 1
inherit 1
enable_on_exec 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid 1624466 cpu -1 group_fd -1 flags 0x8 = 5
apple_icestorm_pmu/cycles/: -1: 0 593209 0
apple_firestorm_pmu/cycles/: -1: 1038247 593209 593209
cycles: -1: 1037870 593209 593209
apple_icestorm_pmu/cycles/: 0 593209 0
apple_firestorm_pmu/cycles/: 1038247 593209 593209
cycles: 1037870 593209 593209

Performance counter stats for 'ls':

<not counted> apple_icestorm_pmu/cycles/ (0.00%)
1,038,247 apple_firestorm_pmu/cycles/
1,037,870 cycles

0.000001500 seconds time elapsed

0.000000000 seconds user
0.000000000 seconds sys
</quote>

For the record, this is on a 6.6-rc6 kernel, userspace perf as of v6.4.0.

M.

--
Without deviation from the norm, progress is not possible.

2023-11-23 15:15:53

by Ian Rogers

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Thu, Nov 23, 2023 at 6:23 AM Mark Rutland <[email protected]> wrote:
>
> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > On Tue, 21 Nov 2023 13:40:31 +0000,
> > Marc Zyngier <[email protected]> wrote:
> > >
> > > [Adding key people on Cc]
> > >
> > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > Hector Martin <[email protected]> wrote:
> > > >
> > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > >
> > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > the PMU, but nothing works anymore.
> > >
> > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > package, but that's obviously not going to last.
> > >
> > > I'm happy to test potential fixes.
> >
> > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > CPU):
>
> Looking at this with fresh(er) eyes, I think there's a userspace bug here,
> regardless of whether one believes it's correct to convert a named-pmu event to
> a PERF_TYPE_HARDWARE event directed at that PMU.
>
> It looks like the userspace tool is dropping the extended type ID after an
> initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
> extended type ID), which explains why we seem to get events from one PMU only.
>
> More detail below...
>
> Marc, if you have time, could you run the same commands (on the same kernel)
> with a perf tool build from v6.4?
>
> > <quote>
> > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
>
> Here config[31:0] is 0 (PERF_COUNT_HW_CPU_CYCLES), and config[63:32] is 0xb,
> which is presumably the PMU ID for the apple_icestorm_pmu.
>
> The attr doesn't contain exclude_guest=1, so this will be rejected by the PMU
> driver due to its mode exclusion requirements.
>
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
>
> ... which is what we see here (this is EOPNOTSUPP, which __hw_perf_event_init()
> in drivers/perf/arm_pmu.c returns when the mode requested mode exclusion
> options aren't supported).
>
> So far, so good...
>
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> ... but here, the extended type ID has been dropped, and this event is no
> longer directed towards the apple_firestorm_pmu PMU, so the kernel can direct
> this to *any* CPU PMU...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
>
> ... and *some* PMU accepts it.
>
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> Likewise here, no extended type ID...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> Likewise here, no extended type ID...
>
> > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > cycles: -1: 0 873709 0
> > apple_icestorm_pmu/cycles/: 0 873709 0
> > apple_firestorm_pmu/cycles/: 0 873709 0
> > cycles: 0 873709 0
> >
> > Performance counter stats for 'ls':
> >
> > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > <not counted> cycles (0.00%)
> >
> > 0.000002250 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
>
> So it looks like the tool has expanded the requested
> 'apple_icestorm_pmu/cycles/' event into three cycles events, each opened
> without an extended type ID.
>
> AFAICT, the kernel has done exactly what it has always done for
> PERF_TYPE_HARDWARE/PERF_COUNT_HW_CPU_CYCLES events: pick the first PMU which
> said it can handle them.
>
> > If I run the same thing on another CPU cluster (firestorm), I get
> > this:
> >
> > <quote>
> > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > apple_firestorm_pmu/cycles/ -e cycles ls
> > Using CPUID 0x00000000612f0280
> > Attempt to add: apple_icestorm_pmu/cycles=0/
> > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > Opening: unknown-hardware:HG
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > config 0xb00000000
> > disabled 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > sys_perf_event_open failed, error -95
>
> Again, we see one request with an extended type ID, which fails due to mode exclusion requirements...
>
> > Attempt to add: apple_firestorm_pmu/cycles=0/
> > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > Control descriptor is not initialized
> > Opening: apple_icestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > Opening: apple_firestorm_pmu/cycles/
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > Opening: cycles
> > ------------------------------------------------------------
> > perf_event_attr:
> > type 0 (PERF_TYPE_HARDWARE)
> > size 136
> > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > sample_type IDENTIFIER
> > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > disabled 1
> > inherit 1
> > enable_on_exec 1
> > exclude_guest 1
> > ------------------------------------------------------------
>
> ... but all subsequent requests do not have an extended type ID, and the kernel
> directs these to whichever PMU accepts the event first...
>
> > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > cycles: -1: 1034653 469125 469125
> > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > cycles: 1034653 469125 469125
> >
> > Performance counter stats for 'ls':
> >
> > 1,035,101 apple_icestorm_pmu/cycles/
> > 1,035,035 apple_firestorm_pmu/cycles/
> > 1,034,653 cycles
> >
> > 0.000001333 seconds time elapsed
> >
> > 0.000000000 seconds user
> > 0.000000000 seconds sys
> > </quote>
>
> ... and in this case the workload was run on a CPU affine ot that arbitrary
> PMU, hence we managed to count.
>
> So AFAICT, this is a userspace bug, maybe related to the way we probe for
> supported PMU features?

Probing PMU features is done by trying to perf_event_open events. For
extended types it is a cycles event on each core PMU:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532

The is_event_supported logic is here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232

There is the following comment:

if (open_return == -EACCES) {
/*
* This happens if the paranoid value
* /proc/sys/kernel/perf_event_paranoid is set to 2
* Re-run with exclude_kernel set; we don't do that
* by default as some ARM machines do not support it.
*
*/

Thanks,
Ian

> Thanks,
> Mark.

2023-11-23 16:48:48

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Thu, Nov 23, 2023 at 07:14:21AM -0800, Ian Rogers wrote:
> On Thu, Nov 23, 2023 at 6:23 AM Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
> > > On Tue, 21 Nov 2023 13:40:31 +0000,
> > > Marc Zyngier <[email protected]> wrote:
> > > >
> > > > [Adding key people on Cc]
> > > >
> > > > On Tue, 21 Nov 2023 12:08:48 +0000,
> > > > Hector Martin <[email protected]> wrote:
> > > > >
> > > > > Perf broke on all Apple ARM64 systems (tested almost everything), and
> > > > > according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
> > > >
> > > > I can confirm that at least on 6.7-rc2, perf is pretty busted on any
> > > > asymmetric ARM platform. It isn't clear what criteria is used to pick
> > > > the PMU, but nothing works anymore.
> > > >
> > > > The saving grace in my case is that Debian still ships a 6.1 perftool
> > > > package, but that's obviously not going to last.
> > > >
> > > > I'm happy to test potential fixes.
> > >
> > > At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
> > > -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
> > > CPU):
> >
> > Looking at this with fresh(er) eyes, I think there's a userspace bug here,
> > regardless of whether one believes it's correct to convert a named-pmu event to
> > a PERF_TYPE_HARDWARE event directed at that PMU.
> >
> > It looks like the userspace tool is dropping the extended type ID after an
> > initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
> > extended type ID), which explains why we seem to get events from one PMU only.
> >
> > More detail below...
> >
> > Marc, if you have time, could you run the same commands (on the same kernel)
> > with a perf tool build from v6.4?
> >
> > > <quote>
> > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > Using CPUID 0x00000000612f0280
> > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > Opening: unknown-hardware:HG
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > config 0xb00000000
> > > disabled 1
> > > ------------------------------------------------------------
> >
> > Here config[31:0] is 0 (PERF_COUNT_HW_CPU_CYCLES), and config[63:32] is 0xb,
> > which is presumably the PMU ID for the apple_icestorm_pmu.
> >
> > The attr doesn't contain exclude_guest=1, so this will be rejected by the PMU
> > driver due to its mode exclusion requirements.
> >
> > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > sys_perf_event_open failed, error -95
> >
> > ... which is what we see here (this is EOPNOTSUPP, which __hw_perf_event_init()
> > in drivers/perf/arm_pmu.c returns when the mode requested mode exclusion
> > options aren't supported).
> >
> > So far, so good...
> >
> > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > Control descriptor is not initialized
> > > Opening: apple_icestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> >
> > ... but here, the extended type ID has been dropped, and this event is no
> > longer directed towards the apple_firestorm_pmu PMU, so the kernel can direct
> > this to *any* CPU PMU...
> >
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
> >
> > ... and *some* PMU accepts it.
> >
> > > Opening: apple_firestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> >
> > Likewise here, no extended type ID...
> >
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
> > > Opening: cycles
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> >
> > Likewise here, no extended type ID...
> >
> > > sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
> > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > apple_icestorm_pmu/cycles/: -1: 0 873709 0
> > > apple_firestorm_pmu/cycles/: -1: 0 873709 0
> > > cycles: -1: 0 873709 0
> > > apple_icestorm_pmu/cycles/: 0 873709 0
> > > apple_firestorm_pmu/cycles/: 0 873709 0
> > > cycles: 0 873709 0
> > >
> > > Performance counter stats for 'ls':
> > >
> > > <not counted> apple_icestorm_pmu/cycles/ (0.00%)
> > > <not counted> apple_firestorm_pmu/cycles/ (0.00%)
> > > <not counted> cycles (0.00%)
> > >
> > > 0.000002250 seconds time elapsed
> > >
> > > 0.000000000 seconds user
> > > 0.000000000 seconds sys
> >
> > So it looks like the tool has expanded the requested
> > 'apple_icestorm_pmu/cycles/' event into three cycles events, each opened
> > without an extended type ID.
> >
> > AFAICT, the kernel has done exactly what it has always done for
> > PERF_TYPE_HARDWARE/PERF_COUNT_HW_CPU_CYCLES events: pick the first PMU which
> > said it can handle them.
> >
> > > If I run the same thing on another CPU cluster (firestorm), I get
> > > this:
> > >
> > > <quote>
> > > maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
> > > apple_firestorm_pmu/cycles/ -e cycles ls
> > > Using CPUID 0x00000000612f0280
> > > Attempt to add: apple_icestorm_pmu/cycles=0/
> > > ..after resolving event: apple_icestorm_pmu/cycles=0/
> > > Opening: unknown-hardware:HG
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > config 0xb00000000
> > > disabled 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
> > > sys_perf_event_open failed, error -95
> >
> > Again, we see one request with an extended type ID, which fails due to mode exclusion requirements...
> >
> > > Attempt to add: apple_firestorm_pmu/cycles=0/
> > > ..after resolving event: apple_firestorm_pmu/cycles=0/
> > > Control descriptor is not initialized
> > > Opening: apple_icestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
> > > Opening: apple_firestorm_pmu/cycles/
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
> > > Opening: cycles
> > > ------------------------------------------------------------
> > > perf_event_attr:
> > > type 0 (PERF_TYPE_HARDWARE)
> > > size 136
> > > config 0 (PERF_COUNT_HW_CPU_CYCLES)
> > > sample_type IDENTIFIER
> > > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> > > disabled 1
> > > inherit 1
> > > enable_on_exec 1
> > > exclude_guest 1
> > > ------------------------------------------------------------
> >
> > ... but all subsequent requests do not have an extended type ID, and the kernel
> > directs these to whichever PMU accepts the event first...
> >
> > > sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
> > > arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
> > > bench builtin-evlist.c builtin-probe.c CREDITS perf.h
> > > Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
> > > builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
> > > builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
> > > builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
> > > builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
> > > builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
> > > builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
> > > builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
> > > builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
> > > builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
> > > builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
> > > builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
> > > builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
> > > builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
> > > builtin-daemon.o builtin-list.c builtin-version.c perf ui
> > > builtin-data.c builtin-list.o builtin-version.o perf-archive util
> > > builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
> > > builtin-diff.c builtin-mem.c command-list.txt perf.c
> > > apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
> > > apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
> > > cycles: -1: 1034653 469125 469125
> > > apple_icestorm_pmu/cycles/: 1035101 469125 469125
> > > apple_firestorm_pmu/cycles/: 1035035 469125 469125
> > > cycles: 1034653 469125 469125
> > >
> > > Performance counter stats for 'ls':
> > >
> > > 1,035,101 apple_icestorm_pmu/cycles/
> > > 1,035,035 apple_firestorm_pmu/cycles/
> > > 1,034,653 cycles
> > >
> > > 0.000001333 seconds time elapsed
> > >
> > > 0.000000000 seconds user
> > > 0.000000000 seconds sys
> > > </quote>
> >
> > ... and in this case the workload was run on a CPU affine ot that arbitrary
> > PMU, hence we managed to count.
> >
> > So AFAICT, this is a userspace bug, maybe related to the way we probe for
> > supported PMU features?
>
> Probing PMU features is done by trying to perf_event_open events. For
> extended types it is a cycles event on each core PMU:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
>
> The is_event_supported logic is here:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232

Ah, so IIUC what's happening is:

1) Userspace tries to detect extended type support, with a cycles event
directed to one of the CPU PMUs. The attr for this does not have
exclude_guest set.

2) In the kernel, the core perf code sees the extended hw type id, and directs
this towards the correct PMU (apple_icestorm_pmu).

3) The PMU driver looks at the attr, sees exclude_guest is not set, and returns
-EOPNOTSUPP, exactly as it would regardless of whether the extended hw type
is used.

Note: this happens to be a difference between x86 PMUs and the apple_* PMUs,
but this is a legitimate part of the perf ABI, not an arm-specific quirk or
bug.

4) Userspace receives -EOPNOTSUPP, and so decide the extended hw_type is not
supported (even though the kernel does support the extended hw type id, and
the event was rejected for orthogonal reasons).

5) Userspace avoids the extended hw type, but still uses
PERF_EVENT_TYPE_HARDWARE events for named-pmu events.

Does that sound plausible to you, or have I misunderstood?

From Marc's reply at:

https://lore.kernel.org/lkml/[email protected]/

... with perf built from v6.4, the perf tool can open named pmu events without
issue, and sets exclude_guest in the attr. So it seems like there's a mismatch
between regular opening of events and probing for extended hw type that causes
that to differ.

AFAICT, the kernel is doing the right thing here, but the userspace detection
of extended type id support happens to differ from regular event opening, and
mis-interprets -EOPNOTSUP as "the kernel doesn't support extended type IDs"
rather than "The kernel was able to consume the extended type ID, but the
specific PMU targetted said it doesn't support this attr".

IIUC that means this'll be broken on older kernels (those before the extended
hw type id support was introduced), too?

It sounds like we need to make (4) more robust? I'm not immediately sure how,
given the rats nest of returns in perf_event_open(), but I'm happy to try to
help with that.

It also seems like (5) is a problem regardless. If the user asks for a named
PMU event on an older kernel (before the extended hw type id was a thing), and
the tool converts that to a plain PERF_EVENT_TYPE_HARDWARE event, it's liable
to be handled by a different PMU than the one the user asked for.

Thanks,
Mark.

2023-11-23 17:09:14

by James Clark

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5



On 23/11/2023 16:48, Mark Rutland wrote:
> On Thu, Nov 23, 2023 at 07:14:21AM -0800, Ian Rogers wrote:
>> On Thu, Nov 23, 2023 at 6:23 AM Mark Rutland <[email protected]> wrote:
>>>
>>> On Tue, Nov 21, 2023 at 03:24:25PM +0000, Marc Zyngier wrote:
>>>> On Tue, 21 Nov 2023 13:40:31 +0000,
>>>> Marc Zyngier <[email protected]> wrote:
>>>>>
>>>>> [Adding key people on Cc]
>>>>>
>>>>> On Tue, 21 Nov 2023 12:08:48 +0000,
>>>>> Hector Martin <[email protected]> wrote:
>>>>>>
>>>>>> Perf broke on all Apple ARM64 systems (tested almost everything), and
>>>>>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.
>>>>>
>>>>> I can confirm that at least on 6.7-rc2, perf is pretty busted on any
>>>>> asymmetric ARM platform. It isn't clear what criteria is used to pick
>>>>> the PMU, but nothing works anymore.
>>>>>
>>>>> The saving grace in my case is that Debian still ships a 6.1 perftool
>>>>> package, but that's obviously not going to last.
>>>>>
>>>>> I'm happy to test potential fixes.
>>>>
>>>> At Mark's request, I've dumped a couple of perf (as of -rc2) runs with
>>>> -vvv. And it is quite entertaining (this is taskset to an 'icestorm'
>>>> CPU):
>>>
>>> Looking at this with fresh(er) eyes, I think there's a userspace bug here,
>>> regardless of whether one believes it's correct to convert a named-pmu event to
>>> a PERF_TYPE_HARDWARE event directed at that PMU.
>>>
>>> It looks like the userspace tool is dropping the extended type ID after an
>>> initial probe, and requests events with plain PERF_TYPE_HARDWARE (without an
>>> extended type ID), which explains why we seem to get events from one PMU only.
>>>
>>> More detail below...
>>>
>>> Marc, if you have time, could you run the same commands (on the same kernel)
>>> with a perf tool build from v6.4?
>>>
>>>> <quote>
>>>> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 0 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
>>>> apple_firestorm_pmu/cycles/ -e cycles ls
>>>> Using CPUID 0x00000000612f0280
>>>> Attempt to add: apple_icestorm_pmu/cycles=0/
>>>> ..after resolving event: apple_icestorm_pmu/cycles=0/
>>>> Opening: unknown-hardware:HG
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> config 0xb00000000
>>>> disabled 1
>>>> ------------------------------------------------------------
>>>
>>> Here config[31:0] is 0 (PERF_COUNT_HW_CPU_CYCLES), and config[63:32] is 0xb,
>>> which is presumably the PMU ID for the apple_icestorm_pmu.
>>>
>>> The attr doesn't contain exclude_guest=1, so this will be rejected by the PMU
>>> driver due to its mode exclusion requirements.
>>>
>>>> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
>>>> sys_perf_event_open failed, error -95
>>>
>>> ... which is what we see here (this is EOPNOTSUPP, which __hw_perf_event_init()
>>> in drivers/perf/arm_pmu.c returns when the mode requested mode exclusion
>>> options aren't supported).
>>>
>>> So far, so good...
>>>
>>>> Attempt to add: apple_firestorm_pmu/cycles=0/
>>>> ..after resolving event: apple_firestorm_pmu/cycles=0/
>>>> Control descriptor is not initialized
>>>> Opening: apple_icestorm_pmu/cycles/
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>
>>> ... but here, the extended type ID has been dropped, and this event is no
>>> longer directed towards the apple_firestorm_pmu PMU, so the kernel can direct
>>> this to *any* CPU PMU...
>>>
>>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 3
>>>
>>> ... and *some* PMU accepts it.
>>>
>>>> Opening: apple_firestorm_pmu/cycles/
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>
>>> Likewise here, no extended type ID...
>>>
>>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 4
>>>> Opening: cycles
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>
>>> Likewise here, no extended type ID...
>>>
>>>> sys_perf_event_open: pid 1045843 cpu -1 group_fd -1 flags 0x8 = 5
>>>> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
>>>> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
>>>> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
>>>> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
>>>> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
>>>> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
>>>> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
>>>> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
>>>> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
>>>> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
>>>> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
>>>> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
>>>> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
>>>> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
>>>> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
>>>> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
>>>> builtin-daemon.o builtin-list.c builtin-version.c perf ui
>>>> builtin-data.c builtin-list.o builtin-version.o perf-archive util
>>>> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
>>>> builtin-diff.c builtin-mem.c command-list.txt perf.c
>>>> apple_icestorm_pmu/cycles/: -1: 0 873709 0
>>>> apple_firestorm_pmu/cycles/: -1: 0 873709 0
>>>> cycles: -1: 0 873709 0
>>>> apple_icestorm_pmu/cycles/: 0 873709 0
>>>> apple_firestorm_pmu/cycles/: 0 873709 0
>>>> cycles: 0 873709 0
>>>>
>>>> Performance counter stats for 'ls':
>>>>
>>>> <not counted> apple_icestorm_pmu/cycles/ (0.00%)
>>>> <not counted> apple_firestorm_pmu/cycles/ (0.00%)
>>>> <not counted> cycles (0.00%)
>>>>
>>>> 0.000002250 seconds time elapsed
>>>>
>>>> 0.000000000 seconds user
>>>> 0.000000000 seconds sys
>>>
>>> So it looks like the tool has expanded the requested
>>> 'apple_icestorm_pmu/cycles/' event into three cycles events, each opened
>>> without an extended type ID.
>>>
>>> AFAICT, the kernel has done exactly what it has always done for
>>> PERF_TYPE_HARDWARE/PERF_COUNT_HW_CPU_CYCLES events: pick the first PMU which
>>> said it can handle them.
>>>
>>>> If I run the same thing on another CPU cluster (firestorm), I get
>>>> this:
>>>>
>>>> <quote>
>>>> maz@valley-girl:~/hot-poop/arm-platforms/tools/perf$ sudo taskset -c 2 ./perf stat -vvv -e apple_icestorm_pmu/cycles/ -e
>>>> apple_firestorm_pmu/cycles/ -e cycles ls
>>>> Using CPUID 0x00000000612f0280
>>>> Attempt to add: apple_icestorm_pmu/cycles=0/
>>>> ..after resolving event: apple_icestorm_pmu/cycles=0/
>>>> Opening: unknown-hardware:HG
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> config 0xb00000000
>>>> disabled 1
>>>> ------------------------------------------------------------
>>>> sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8
>>>> sys_perf_event_open failed, error -95
>>>
>>> Again, we see one request with an extended type ID, which fails due to mode exclusion requirements...
>>>
>>>> Attempt to add: apple_firestorm_pmu/cycles=0/
>>>> ..after resolving event: apple_firestorm_pmu/cycles=0/
>>>> Control descriptor is not initialized
>>>> Opening: apple_icestorm_pmu/cycles/
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 3
>>>> Opening: apple_firestorm_pmu/cycles/
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 4
>>>> Opening: cycles
>>>> ------------------------------------------------------------
>>>> perf_event_attr:
>>>> type 0 (PERF_TYPE_HARDWARE)
>>>> size 136
>>>> config 0 (PERF_COUNT_HW_CPU_CYCLES)
>>>> sample_type IDENTIFIER
>>>> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>>>> disabled 1
>>>> inherit 1
>>>> enable_on_exec 1
>>>> exclude_guest 1
>>>> ------------------------------------------------------------
>>>
>>> ... but all subsequent requests do not have an extended type ID, and the kernel
>>> directs these to whichever PMU accepts the event first...
>>>
>>>> sys_perf_event_open: pid 1045925 cpu -1 group_fd -1 flags 0x8 = 5
>>>> arch builtin-diff.o builtin-mem.o common-cmds.h perf-completion.sh
>>>> bench builtin-evlist.c builtin-probe.c CREDITS perf.h
>>>> Build builtin-evlist.o builtin-probe.o design.txt perf-in.o
>>>> builtin-annotate.c builtin-ftrace.c builtin-record.c dlfilters perf-iostat
>>>> builtin-annotate.o builtin-ftrace.o builtin-record.o Documentation perf-iostat.sh
>>>> builtin-bench.c builtin.h builtin-report.c FEATURE-DUMP perf.o
>>>> builtin-bench.o builtin-help.c builtin-report.o include perf-read-vdso.c
>>>> builtin-buildid-cache.c builtin-help.o builtin-sched.c jvmti perf-sys.h
>>>> builtin-buildid-cache.o builtin-inject.c builtin-script.c libapi PERF-VERSION-FILE
>>>> builtin-buildid-list.c builtin-inject.o builtin-script.o libperf perf-with-kcore
>>>> builtin-buildid-list.o builtin-kallsyms.c builtin-stat.c libsubcmd pmu-events
>>>> builtin-c2c.c builtin-kallsyms.o builtin-stat.o libsymbol python
>>>> builtin-c2c.o builtin-kmem.c builtin-timechart.c Makefile python_ext_build
>>>> builtin-config.c builtin-kvm.c builtin-top.c Makefile.config scripts
>>>> builtin-config.o builtin-kvm.o builtin-top.o Makefile.perf tests
>>>> builtin-daemon.c builtin-kwork.c builtin-trace.c MANIFEST trace
>>>> builtin-daemon.o builtin-list.c builtin-version.c perf ui
>>>> builtin-data.c builtin-list.o builtin-version.o perf-archive util
>>>> builtin-data.o builtin-lock.c check-headers.sh perf-archive.sh
>>>> builtin-diff.c builtin-mem.c command-list.txt perf.c
>>>> apple_icestorm_pmu/cycles/: -1: 1035101 469125 469125
>>>> apple_firestorm_pmu/cycles/: -1: 1035035 469125 469125
>>>> cycles: -1: 1034653 469125 469125
>>>> apple_icestorm_pmu/cycles/: 1035101 469125 469125
>>>> apple_firestorm_pmu/cycles/: 1035035 469125 469125
>>>> cycles: 1034653 469125 469125
>>>>
>>>> Performance counter stats for 'ls':
>>>>
>>>> 1,035,101 apple_icestorm_pmu/cycles/
>>>> 1,035,035 apple_firestorm_pmu/cycles/
>>>> 1,034,653 cycles
>>>>
>>>> 0.000001333 seconds time elapsed
>>>>
>>>> 0.000000000 seconds user
>>>> 0.000000000 seconds sys
>>>> </quote>
>>>
>>> ... and in this case the workload was run on a CPU affine ot that arbitrary
>>> PMU, hence we managed to count.
>>>
>>> So AFAICT, this is a userspace bug, maybe related to the way we probe for
>>> supported PMU features?
>>
>> Probing PMU features is done by trying to perf_event_open events. For
>> extended types it is a cycles event on each core PMU:
>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmus.c?h=perf-tools-next#n532
>>
>> The is_event_supported logic is here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n232
>
> Ah, so IIUC what's happening is:
>
> 1) Userspace tries to detect extended type support, with a cycles event
> directed to one of the CPU PMUs. The attr for this does not have
> exclude_guest set.
>
> 2) In the kernel, the core perf code sees the extended hw type id, and directs
> this towards the correct PMU (apple_icestorm_pmu).
>
> 3) The PMU driver looks at the attr, sees exclude_guest is not set, and returns
> -EOPNOTSUPP, exactly as it would regardless of whether the extended hw type
> is used.
>
> Note: this happens to be a difference between x86 PMUs and the apple_* PMUs,
> but this is a legitimate part of the perf ABI, not an arm-specific quirk or
> bug.
>
> 4) Userspace receives -EOPNOTSUPP, and so decide the extended hw_type is not
> supported (even though the kernel does support the extended hw type id, and
> the event was rejected for orthogonal reasons).
>
> 5) Userspace avoids the extended hw type, but still uses
> PERF_EVENT_TYPE_HARDWARE events for named-pmu events.
>
> Does that sound plausible to you, or have I misunderstood?
>
> From Marc's reply at:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> ... with perf built from v6.4, the perf tool can open named pmu events without
> issue, and sets exclude_guest in the attr. So it seems like there's a mismatch
> between regular opening of events and probing for extended hw type that causes
> that to differ.
>
> AFAICT, the kernel is doing the right thing here, but the userspace detection
> of extended type id support happens to differ from regular event opening, and
> mis-interprets -EOPNOTSUP as "the kernel doesn't support extended type IDs"
> rather than "The kernel was able to consume the extended type ID, but the
> specific PMU targetted said it doesn't support this attr".
>
> IIUC that means this'll be broken on older kernels (those before the extended
> hw type id support was introduced), too?
>
> It sounds like we need to make (4) more robust? I'm not immediately sure how,
> given the rats nest of returns in perf_event_open(), but I'm happy to try to
> help with that.

It might be worth reporting extended HW ID support in the caps folder of
the PMU so that Perf can look there instead of trying to open the event.
It's something that we know will always be on or always be off so it
doesn't make sense to try to discover it by opening an event.

>
> It also seems like (5) is a problem regardless. If the user asks for a named
> PMU event on an older kernel (before the extended hw type id was a thing), and
> the tool converts that to a plain PERF_EVENT_TYPE_HARDWARE event, it's liable
> to be handled by a different PMU than the one the user asked for.
>
> Thanks,
> Mark.

2023-11-23 17:16:18

by Mark Rutland

[permalink] [raw]
Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

On Thu, Nov 23, 2023 at 05:08:43PM +0000, James Clark wrote:
> On 23/11/2023 16:48, Mark Rutland wrote:
> > Ah, so IIUC what's happening is:
> >
> > 1) Userspace tries to detect extended type support, with a cycles event
> > directed to one of the CPU PMUs. The attr for this does not have
> > exclude_guest set.
> >
> > 2) In the kernel, the core perf code sees the extended hw type id, and directs
> > this towards the correct PMU (apple_icestorm_pmu).
> >
> > 3) The PMU driver looks at the attr, sees exclude_guest is not set, and returns
> > -EOPNOTSUPP, exactly as it would regardless of whether the extended hw type
> > is used.
> >
> > Note: this happens to be a difference between x86 PMUs and the apple_* PMUs,
> > but this is a legitimate part of the perf ABI, not an arm-specific quirk or
> > bug.
> >
> > 4) Userspace receives -EOPNOTSUPP, and so decide the extended hw_type is not
> > supported (even though the kernel does support the extended hw type id, and
> > the event was rejected for orthogonal reasons).

> > It sounds like we need to make (4) more robust? I'm not immediately sure how,
> > given the rats nest of returns in perf_event_open(), but I'm happy to try to
> > help with that.
>
> It might be worth reporting extended HW ID support in the caps folder of
> the PMU so that Perf can look there instead of trying to open the event.
> It's something that we know will always be on or always be off so it
> doesn't make sense to try to discover it by opening an event.

Yep, I'm open to that idea. I'm more than happy to expose something that
indicates "this PMU supports the extended HW ID" and/or "this kernel supports
the extended HW ID".

Given that the actual PMU drivers don't see the extended cap, and that's
handled by the core, I'd like to make the core logic unconditional and remove
the kernel-internal PERF_PMU_CAP_EXTENDED_HW_TYPE cap. So I'd lean towards the
"this kernel supports the extended HW ID" option.

Thanks,
Mark.

Subject: Re: [REGRESSION] Perf (userspace) broken on big.LITTLE systems since v6.5

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 22.11.23 00:43, Bagas Sanjaya wrote:
> On Tue, Nov 21, 2023 at 09:08:48PM +0900, Hector Martin wrote:
>> Perf broke on all Apple ARM64 systems (tested almost everything), and
>> according to maz also on Juno (so, probably all big.LITTLE) since v6.5.

#regzbot fix: perf parse-events: Make legacy events lower priority than
sysfs/JSON
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.