LinuxLists.cc - [PATCH v1] perf evlist: Force adding default events only to core PMUs

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Sat, 25 May 2024 at 08:30, Ian Rogers <[email protected]> wrote:
>
> PMUs other than core PMUs may have a 'cycles' event. Default opening a
> non-core cycles event with perf record can lead to perf_event_open
> failure. Avoid this by only opening the default 'cycles' event on core
> PMUs.
>
> Closes: https://lore.kernel.org/lkml/CAHk-=wiWvtFyedDNpoV7a8Fq_FpbB+F5KmWK2xPY3QoYseOf_A@mail.gmail.com/
> Fixes: 617824a7f0f7 ("perf parse-events: Prefer sysfs/JSON hardware events over legacy")
> Signed-off-by: Ian Rogers <[email protected]>

This makes 'perf record' work for me again.

Tested-by: Linus Torvalds <[email protected]>

Thanks,

Linus

2024-05-25 21:14:54

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Sat, 25 May 2024 at 09:43, Linus Torvalds
<[email protected]> wrote:
>
> This makes 'perf record' work for me again.

Oh, wait, no it doesn't.

It makes just the plain "perf record" without any arguments work,
which was what I was testing because I was lazy.

So now

$ perf record sleep 1

works fine. But

$ perf record -e cycles:pp sleep 1

is still completely broken (with or without ":p" and ":pp").

So no. That still needs to be fixed, or the whole "prefer sysfs/JSON
by default" needs to be reverted.

Linus

2024-05-27 10:59:01

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> On Sat, 25 May 2024 at 09:43, Linus Torvalds
> <[email protected]> wrote:
> >
> > This makes 'perf record' work for me again.
>
> Oh, wait, no it doesn't.
>
> It makes just the plain "perf record" without any arguments work,
> which was what I was testing because I was lazy.
>
> So now
>
> $ perf record sleep 1
>
> works fine. But
>
> $ perf record -e cycles:pp sleep 1
>
> is still completely broken (with or without ":p" and ":pp").

Seems to me that this patch fails to check if a PMU is a core-attached
PMU that can support common hardware events. Therefore, we should
consider adding the following check.

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 30f958069076..bc1822c2f3e3 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
while ((pmu = perf_pmus__scan(pmu)) != NULL) {
bool auto_merge_stats;

+ if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
+ continue;
+
if (parse_events__filter_pmu(parse_state, pmu))
continue;

To be clear, I only compiled this change but I have no chance to test
it. @Ian, could you confirm this?

Thanks,
Leo

> So no. That still needs to be fixed, or the whole "prefer sysfs/JSON
> by default" needs to be reverted.

2024-05-28 05:37:18

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
>
> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> > On Sat, 25 May 2024 at 09:43, Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > This makes 'perf record' work for me again.
> >
> > Oh, wait, no it doesn't.
> >
> > It makes just the plain "perf record" without any arguments work,
> > which was what I was testing because I was lazy.
> >
> > So now
> >
> > $ perf record sleep 1
> >
> > works fine. But
> >
> > $ perf record -e cycles:pp sleep 1
> >
> > is still completely broken (with or without ":p" and ":pp").
>
> Seems to me that this patch fails to check if a PMU is a core-attached
> PMU that can support common hardware events. Therefore, we should
> consider adding the following check.
>
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 30f958069076..bc1822c2f3e3 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
> while ((pmu = perf_pmus__scan(pmu)) != NULL) {
> bool auto_merge_stats;
>
> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
> + continue;
> +
> if (parse_events__filter_pmu(parse_state, pmu))
> continue;
>
> To be clear, I only compiled this change but I have no chance to test
> it. @Ian, could you confirm this?

Hi Leo,

so the code is working as intended. I believe it also agrees with what
Arnaldo thinks.

If you do:

$ perf stat -e cycles ...

and you have

/sys/devices/pmu1/events/cycles
/sys/devices/pmu2/events/cycles

The output of perf stat should contain counts for pmu1 and pmu2. Were
the event 'data_read' or 'inst_retired.any' we wouldn't be having the
question as this has always been perf's behavior - changing that
behavior would be a regression and why we can't just limit wildcard
searches to core PMUs.

The issue is about legacy events. There are more potential legacy
names than those in 'enum perf_hw_id' as there are all the hardware
cache events. So ignoring "hw_config != PERF_COUNT_HW_MAX" when adding
an event to >1 PMU is only a partial solution, but I (and I think
others agree) don't think legacy names should be special in this way.
If you ask for an event and multiple PMUs advertise it then the
behavior should be to count the event.

Linus' trouble stems from cycles being used as a default value without
restricting it to core PMUs. This patch solves that. There is then the
issue that if an event that doesn't support sampling is wild card
matched by `perf record` then it should be ignored. I'm in less
agreement here, such events failing could be regarded as a feature.
The workaround is to add the PMU name when specifying the event. If we
do allow such events to be skipped, should the skipping be accompanied
by a warning? Presumably if no events can be opened then things should
terminate. Restricting everything to the core PMU I think is
short-sighted as knowing what's happening on accelerators is
increasingly important.

For the sake of fixing TMA metrics and things like default attributes
for ARM I added a notion that evsels in perf stat can be skippable.
The plumbing for this with perf record is more tricky as the logic has
half migrated into libperf - parsing and skippable are in perf, the
event opening and mmap logic is all in libperf, and we need to add
ignoring skipped evsels when adding the events to the mmap-ed ring
buffer. I have a patch that's WIP for this, but I also think we could
also agree that when >1 PMU advertises an event, perf's behavior when
matching should be to open all such events. You avoid this by
specifying a PMU name.

Thanks,
Ian

> Thanks,
> Leo
>
>
> > So no. That still needs to be fixed, or the whole "prefer sysfs/JSON
> > by default" needs to be reverted.
>
>

2024-05-28 17:06:49

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Mon, 27 May 2024 at 22:37, Ian Rogers <[email protected]> wrote:
>
> If you do:
>
> $ perf stat -e cycles ...

You always talk about "perf stat", because you want to ignore a big
part of the issueL

> The issue is about legacy events.

No.

The issue isn't "perf stat".

That works. Even with the broken version.

> I have a patch that's WIP for this, but I also think we could
> also agree that when >1 PMU advertises an event, perf's behavior when
> matching should be to open all such events. You avoid this by
> specifying a PMU name.

Christ. You're still ignoring the elephant in the room.

Stop using "perf stat" as an example. It's bogus. You're ignoring the issue.

Lookie here:

$ perf stat make
...
Performance counter stats for 'make':

5,262.78 msec task-clock #
1.195 CPUs utilized
46,891 context-switches #
8.910 K/sec
6 cpu-migrations #
1.140 /sec
198,402 page-faults #
37.699 K/sec
10,238,763,227 cycles #
1.946 GHz
16,280,644,955 instructions # 1.59
insn per cycle
3,252,743,314 branches #
618.066 M/sec
83,702,386 branch-misses #
2.57% of all branches

4.405792739 seconds time elapsed

4.287784000 seconds user
0.921132000 seconds sys

but then

$ perf record make
Error:
cycles:P: PMU Hardware doesn't support
sampling/overflow-interrupts. Try 'perf stat'

because that broken thing (a) picked a different cycles than before
and (b) your argument that it should pick both IS WRONG BECAUSE ONE OF
THEM DOESN'T EVEN WORK.

Why is this so hard to just accept? Why do you keep mis-stating the problem?

How hard is it to realize that I DO NOT WANT "perf stat"? The perf
error message is bogus crap. If I ask for a "perf record", it
shouldn't pick the wrong PMU that can't do it, and then say "do you
want to do something else"?

I don't care a whit for "legacy events". I care about the fact that
you changed the code to pick the WRONG event, and then you are blaming
anything but that.

If perf would go "Oh, this one doesn't support 'record', let's see if
another one does", it would all be good.

If perf would go "Oh, let's prioritize core events", it would all be good.

But no. Your patch actively picked a bad event, and then you try to
blame some "legacy" thing.

Yes, the legacy thing picked the right event, but it's not even about
legacy. You could have picked the right event any number of other
ways.

It's about "it damn well worked when you didn't go out of your way to
pick the wrong event".

In other words, this isn't about "legacy" and "new".

This is about "right" and "wrong". The old code picked right - for
whatever reasons. The new code picked wrong - also for whatever
reasons.

Don't try to make it be anything else. Just admit that the new code
picked the wrong PMU, instead of trying to make excuses for it.

Linus

2024-05-28 17:40:15

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, May 28, 2024 at 10:01 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, 27 May 2024 at 22:37, Ian Rogers <[email protected]> wrote:
> >
> > If you do:
> >
> > $ perf stat -e cycles ...
>
> You always talk about "perf stat", because you want to ignore a big
> part of the issueL
>
> > The issue is about legacy events.
>
> No.
>
> The issue isn't "perf stat".
>
> That works. Even with the broken version.
>
> > I have a patch that's WIP for this, but I also think we could
> > also agree that when >1 PMU advertises an event, perf's behavior when
> > matching should be to open all such events. You avoid this by
> > specifying a PMU name.
>
> Christ. You're still ignoring the elephant in the room.
>
> Stop using "perf stat" as an example. It's bogus. You're ignoring the issue.
>
> Lookie here:
>
> $ perf stat make
> ...
> Performance counter stats for 'make':
>
> 5,262.78 msec task-clock #
> 1.195 CPUs utilized
> 46,891 context-switches #
> 8.910 K/sec
> 6 cpu-migrations #
> 1.140 /sec
> 198,402 page-faults #
> 37.699 K/sec
> 10,238,763,227 cycles #
> 1.946 GHz
> 16,280,644,955 instructions # 1.59
> insn per cycle
> 3,252,743,314 branches #
> 618.066 M/sec
> 83,702,386 branch-misses #
> 2.57% of all branches
>
> 4.405792739 seconds time elapsed
>
> 4.287784000 seconds user
> 0.921132000 seconds sys
>
> but then
>
> $ perf record make
> Error:
> cycles:P: PMU Hardware doesn't support
> sampling/overflow-interrupts. Try 'perf stat'
>
> because that broken thing (a) picked a different cycles than before
> and (b) your argument that it should pick both IS WRONG BECAUSE ONE OF
> THEM DOESN'T EVEN WORK.

By default it picked "cycles:P" with this patch it now picks "<core
pmu>/cycles/P". As your Tested-by attested this does work.

> Why is this so hard to just accept? Why do you keep mis-stating the problem?

I'm not. What your saying is that the arm_dsu_0 PMU advertising cycles
doesn't work if you use it for perf record. I agree. This change fixed
the issue. What to do if someone writes "perf record -e cycles" well
there are two advertised cycles events isn't it reasonable perf record
try to open them both?

> How hard is it to realize that I DO NOT WANT "perf stat"? The perf
> error message is bogus crap. If I ask for a "perf record", it
> shouldn't pick the wrong PMU that can't do it, and then say "do you
> want to do something else"?

It's not about perf record having some kind of intelligence and
picking a PMU. perf record either opens an event on the PMU you ask
for it it wild cards across them all. For legacy events they used to
be handled differently and I'm trying to fix this. In part so I can
fix the PMU behavior on the Apple M1 and later CPUs that fail to
implement legacy events properly in their PMU driver. You've said you
used to use Apple CPUs for ARM testing, I'm trying to fix a problem
that will help you.

> I don't care a whit for "legacy events". I care about the fact that
> you changed the code to pick the WRONG event, and then you are blaming
> anything but that.

Sure, I added "if user == linus then pick wrong PMU". The code was
reviewed by IBM and Intel. ARM were on the CC list. The change baked
on linux-next for a good long while. All of this points to my problem
that I'm often fixing problems for ARM with a complete lack of
testing/reviewing/acking... by them.

> If perf would go "Oh, this one doesn't support 'record', let's see if
> another one does", it would all be good.
>
> If perf would go "Oh, let's prioritize core events", it would all be good.

But as I've pointed out it wouldn't because then you'd break the
behavior of doing things like gathering memory bandwidth from uncore
PMUs. An example:
```
$ sudo perf stat -e data_read -a sleep 1

Performance counter stats for 'system wide':

4,447.10 MiB data_read

1.001748581 seconds time elapsed
```
By making the wildcard only work for core PMUs the best case is you'd make this:
```
$ sudo perf stat -e 'imc_free_running/data_read/' -a sleep 1

Performance counter stats for 'system wide':

4,454.56 MiB imc_free_running/data_read/

1.001865448 seconds time elapsed
```
But that wouldn't work again as ARM decided to mess up the naming
convention, my unmerged fix for that is here:
https://lore.kernel.org/lkml/[email protected]/
it says Marvell but Marvell followed ARM's lead.

> But no. Your patch actively picked a bad event, and then you try to
> blame some "legacy" thing.
>
> Yes, the legacy thing picked the right event, but it's not even about
> legacy. You could have picked the right event any number of other
> ways.
>
> It's about "it damn well worked when you didn't go out of your way to
> pick the wrong event".
>
> In other words, this isn't about "legacy" and "new".

I'm not clear what you think is "new". All the events are being picked
in your case from sysfs, the way this has all worked is years if not
decades old. What is new is that because of an event name the behavior
should be uniform, motivated initially by fixing your other ARM test
platform of Apple.

> This is about "right" and "wrong". The old code picked right - for
> whatever reasons. The new code picked wrong - also for whatever
> reasons.
>
> Don't try to make it be anything else. Just admit that the new code
> picked the wrong PMU, instead of trying to make excuses for it.

I agree it picked the wrong PMU for default events. This was a problem
on no systems that anybody was bothering to test with. Having been
made aware of the issue I fixed it in this patch, you're welcome.

What is still not clear from this is what should the behavior be of:

$ perf record -e cycles ...

Should it wildcard all events and open them on all PMUs potentially
failing? Well this has always been perf's behavior were the event:

$ perf record -e inst_retired.any ...

where inst_retired.any could be an event advertised on an accelerator
or device where sampling doesn't work. If inst_retired.any doesn't
work for you as an example, pick another event that does. A GPU has
instructions and cycles so the likelihood of naming conflicts is high.

We can make perf record ignore opening events on PMUs that don't
support sampling, it's an invasive and non-trivial change not suited
to landing in 6.10. It is also a behavior change, see this thread for
how popular those are.

So 6.10 is now in a mess. We likely fail tests, reverting this change
has a bunch of consequences and presumably I'm expected to dig
through, figure those out and then provide fixes. Thanks!

For 6.11 I currently suggest we revert the revert and apply this
patch. This would also I think be the best thing to do for 6.10. I
appreciate my opinions are worth much less than others. I don't see
why the priority should be to fix things on an ARM system that nobody
is actively testing on rather than say Apple devices fixed by the
reverted change, RISC-V system, etc. Anyway...

Thanks,
Ian

> Linus
>

2024-05-28 18:13:41

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, 28 May 2024 at 10:40, Ian Rogers <[email protected]> wrote:
>
> I agree it picked the wrong PMU for default events. This was a problem
> on no systems that anybody was bothering to test with. Having been
> made aware of the issue I fixed it in this patch, you're welcome.

You didn't just pick it for default events. You also picked it for
when the user explicitly asks for "profile for cycles"

> What is still not clear from this is what should the behavior be of:
>
> $ perf record -e cycles ...

Why do you claim that?

I've already told you that CLEARLY it's wrong to pick a cycles event
that doesn't support 'record'.

I've also suggested that you might look at core only PMUs.

But more importantly, you should look at documented and historical behavior.

So what is your argument? Because from where I'm sitting, you keep
making irrelevant arguments about *other* events, not about "cycles".

It used to work. It doesn't any more.

> Should it wildcard all events and open them on all PMUs potentially
> failing? Well this has always been perf's behavior were the event:
>
> $ perf record -e inst_retired.any ...

You keep making up irrelevant arguments.

Lookie here: I do "perf list" to just see the events, and what do I
get? Let me quote that for you:

List of pre-defined events (to be used in -e or -M):
...
cpu-cycles OR cycles [Hardware event]

and then later on in the list I get

general:
cpu_cycles
[Cycle. Unit: armv8_pmuv3_0]

and dammit, your patch broke the DOCUMENTED way to get the most
obvious profiling data: cycles.

So stop making shit up. All your arguments have been bogus garbage
that have been talking about entirely different things than the one
thing I reported was broken.

And you *keep* doing that. Days into this, you keep making shit up
that isn't about this very simple thing.

Every single time I tell you what the problem is, you try to twist to
be about something entirely different. Either a different 'perf'
command entirely, or about a different event that is ENTIRELY
IRRELEVANT.

What the hell is your problem? Why can't you just admit that you
f*cked up, and fix the thing I told you was broken, and that is very
clearly broken and there is no "what about" issues AT ALL.

So stop the idiocy already. Face the actual issue. Don't make up other things.

Dammit, if I wanted "arm_dsu_58/cycles/", I would SAY so. I didn't. I
said "cycles", which is the thing that has always worked across
architectures, that is DOCUMENTED to be the same as "cpu-cycles", and
that used to work just fine.

It's literally RIGHT THERE in "perf list". Using "-e cycles" is
literally also what the man-pages say. This is not me doing something
odd.

And yes, I use an explicit "-e cycles:pp" because the default is not
that. and "cycles:pp" does better than the default.

Again, this is all documented, with "man perf-record" literally
talking about "-e cycles" and then pointing to "man perf-list" for the
modifier details, which detail that 'pp' as meaning "use maximum
detected precise level". Which is exactly what I want (even if on this
machine, it turns out that "p" and "pp" are the same because the armv8
pmuv3 doesn't have that "correct for instruction drift" that Intel
does on x86).

Why is this simple thing so hard for you to understand?

The fact is, if you make "cycles" mean ANYTHING ELSE than the
long-documented actual obvious thing, you have broken perf. It's that
simple.

So stop the excuses already. Stop making up other stuff that isn't
relevant. Stop bringing up events or PMU's that are simply not the
issue.

Face your bug head on, instead of making me have to tell you the same
thing over and over and over again.

Linus

2024-05-28 18:59:32

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, May 28, 2024 at 11:13 AM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 28 May 2024 at 10:40, Ian Rogers <[email protected]> wrote:
> >
> > I agree it picked the wrong PMU for default events. This was a problem
> > on no systems that anybody was bothering to test with. Having been
> > made aware of the issue I fixed it in this patch, you're welcome.
>
> You didn't just pick it for default events. You also picked it for
> when the user explicitly asks for "profile for cycles"
>
> > What is still not clear from this is what should the behavior be of:
> >
> > $ perf record -e cycles ...
>
> Why do you claim that?
>
> I've already told you that CLEARLY it's wrong to pick a cycles event
> that doesn't support 'record'.
>
> I've also suggested that you might look at core only PMUs.
>
> But more importantly, you should look at documented and historical behavior.
>
> So what is your argument? Because from where I'm sitting, you keep
> making irrelevant arguments about *other* events, not about "cycles".
>
> It used to work. It doesn't any more.
>
> > Should it wildcard all events and open them on all PMUs potentially
> > failing? Well this has always been perf's behavior were the event:
> >
> > $ perf record -e inst_retired.any ...
>
> You keep making up irrelevant arguments.
>
> Lookie here: I do "perf list" to just see the events, and what do I
> get? Let me quote that for you:
>
> List of pre-defined events (to be used in -e or -M):
> ...
> cpu-cycles OR cycles [Hardware event]
>
> and then later on in the list I get
>
> general:
> cpu_cycles
> [Cycle. Unit: armv8_pmuv3_0]
>
> and dammit, your patch broke the DOCUMENTED way to get the most
> obvious profiling data: cycles.
>
> So stop making shit up. All your arguments have been bogus garbage
> that have been talking about entirely different things than the one
> thing I reported was broken.
>
> And you *keep* doing that. Days into this, you keep making shit up
> that isn't about this very simple thing.
>
> Every single time I tell you what the problem is, you try to twist to
> be about something entirely different. Either a different 'perf'
> command entirely, or about a different event that is ENTIRELY
> IRRELEVANT.
>
> What the hell is your problem? Why can't you just admit that you
> f*cked up, and fix the thing I told you was broken, and that is very
> clearly broken and there is no "what about" issues AT ALL.
>
> So stop the idiocy already. Face the actual issue. Don't make up other things.
>
> Dammit, if I wanted "arm_dsu_58/cycles/", I would SAY so. I didn't. I
> said "cycles", which is the thing that has always worked across
> architectures, that is DOCUMENTED to be the same as "cpu-cycles", and
> that used to work just fine.
>
> It's literally RIGHT THERE in "perf list". Using "-e cycles" is
> literally also what the man-pages say. This is not me doing something
> odd.

But nobody else ever reported the issue, even ARM who maintain the PMU
driver whose event name conflicts. This hasn't been a problem for
anybody else.

> And yes, I use an explicit "-e cycles:pp" because the default is not
> that. and "cycles:pp" does better than the default.

This is wrong. cycles:P means use the maximum precision, so if the PMU
supports it it will use cycles:ppp and on x86 this is often the case.
The number of 'p's used after the event with :P is read from sysfs:
```
$ cat /sys/devices/cpu/caps/max_precise
3
```
The default using cycles:P is intentional and better than cycles:pp.
If `/sys/devices/cpu/caps/max_precise` is wrong for your PMU driver
then that should get fixed.

> Again, this is all documented, with "man perf-record" literally
> talking about "-e cycles" and then pointing to "man perf-list" for the
> modifier details, which detail that 'pp' as meaning "use maximum
> detected precise level". Which is exactly what I want (even if on this
> machine, it turns out that "p" and "pp" are the same because the armv8
> pmuv3 doesn't have that "correct for instruction drift" that Intel
> does on x86).
>
> Why is this simple thing so hard for you to understand?
>
> The fact is, if you make "cycles" mean ANYTHING ELSE than the
> long-documented actual obvious thing, you have broken perf. It's that
> simple.
>
> So stop the excuses already. Stop making up other stuff that isn't
> relevant. Stop bringing up events or PMU's that are simply not the
> issue.
>
> Face your bug head on, instead of making me have to tell you the same
> thing over and over and over again.

I keep saying the same thing as I don't agree with you, you have
broken perf in 6.10 and are presumably looking to me to pick up the
pieces. To work around the naming conflict on systems with
arm_dsu_*/events/cycles/ I don't think it is unreasonable to specify
armv8_pmuv3_0/cycles/, the blame for this lies with ARM's event name
within the arm_dsu_* PMU driver, admittedly they didn't know this
would be an issue given perf's non-uniform handling of legacy events.
When ARM requested that legacy events be a lower priority than
sysfs/json then the ball was set in motion for this problem. This was
done to work around an ARM PMU problem on Apple ARM CPUs.

Let's say an Apple CPU has a PMU called armv8_pmuv3_0. If you try to
program a legacy event on it then it will be broken - I lack a system
to test this but I'm reliably informed (user bugs and by ARM) and heck
it was a bunch of work to get this working if it was for nobody. To
fix this perf must read event data for armv8_pmuv3_0/cycles/ from
sysfs. If you don't specify a PMU then perf will try to program a
legacy event. This is the behavior in Linux 6.9. Oh, but legacy events
are broken on my Apple ARM CPU. The change made it so when you don't
specify a PMU you will use the sysfs/json one first instead of the
legacy one. Viola, your favorite `perf record -e cycles:pp ..` should
be fixed on Apple ARM CPUs.

So I think the revert is a real regression for a larger user base.
There is a testing issue here, not least I don't possess an Apple ARM
machine. All of these issues revolve around ARM and yet they do
minimal to try to fix them, beside complain that I should fix the
legacy/sysfs/json issue which is how we got here.

There's of course the alternate universe view that I need to admit I'm
wrong and I should stop going out of my way to break people. Hello
alternate universe, I admit it, I'm wrong and a terrible person please
accept my most sincere apologies. Please alternate universe, can you
tell me how to wild card event names and make them work across PMUs
without the behavior being specific to certain perf comands or event
names (which I think is worse than having to specify which PMU you
think the event should apply)?

Thanks,
Ian

> Linus
>

2024-05-28 19:45:33

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, 28 May 2024 at 11:59, Ian Rogers <[email protected]> wrote:
>
> But nobody else ever reported the issue, even ARM who maintain the PMU
> driver whose event name conflicts. This hasn't been a problem for
> anybody else.

I'm not blaming you for having had a bug.

I'm blaming you for NOT DEALING WITH THE BUG APPROPRIATELY.

I reported the bug and bisected it four days ago.

Taking some time to fix the bug is fine.

But that's not what you've been doing.

Since then, pretty much ALL you have done is argue about irrelevant
thingas that weren't about the regression in question.

The fact that you still don't agree, having broken documented
behavior, and still argue against just having it fixed, I can't do
anything about.

> So I think the revert is a real regression for a larger user base.

I didn't have much choice, did I? You refuse to even acknowledge the
bug I hit. I'd have been happy if you had just fixed the bug. You
didn't. You just argued.

> There is a testing issue here, not least I don't possess an Apple ARM
> machine.

This is not an Apple ARM machine. I have one of those too, but this
isn't it. It's an Ampere Computing system, based on an ARM Neoverse N1
(and the ARM PMU's both for the core and for the interconnect).

But that is pretty much irrelevant by now.

The issue is that you don't fix bugs you leave behind, forcing the revert.

I'm happy to test any patches. But I'm done arguing. The "cycles"
thing needs to work. This is not a "pretty please".

This is a "if you can't understand that and acknowledge that without
arguing, just work on something else, ok?"

Linus

2024-05-28 20:00:43

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, 28 May 2024 at 12:44, Arnaldo Carvalho de Melo <[email protected]> wrote:
>
> For 'perf record' we're asking for sampling, if the event has the name
> specified and can't be sampled, skip it, warn the user and even so
> only if verbose mode is asked, something like:

Yes. I think that's the right rule in general.

However, the more I have looked at this case, the more I am also
convinced that "cycles" as a name is special.

It's literally documented to be an alias for cpu-cycles, both in
examples and in "perf list" output, and that's what the usage is.

So even if you were to have some other PMU in the system that had a
"cycles" thing, if it's not a core event but some other cycles
("uncore", bus cycles, bicycles, whatever), it shouldn't be used even
if it could be used for profiling.

You'd have to use the full PMU name and actually list it out if you
want to use a non-core counter named "cycles".

And yes, we even have some documentation that says exactly that:

"e.g usage::

perf stat -a -e arm_dsu_0/cycles/"

So this isn't even anything new or ambiguous. This is just how things
*ARE*, and absolutely have to be.

Linus

2024-05-28 20:06:03

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
> > On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
> > > On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> > > > On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:
>
> > > > > This makes 'perf record' work for me again.
>
> > > > Oh, wait, no it doesn't.
>
> > > > It makes just the plain "perf record" without any arguments work,
> > > > which was what I was testing because I was lazy.
>
> > > > So now
>
> > > > $ perf record sleep 1
>
> > > > works fine. But
>
> > > > $ perf record -e cycles:pp sleep 1
>
> > > > is still completely broken (with or without ":p" and ":pp").
>
> > > Seems to me that this patch fails to check if a PMU is a core-attached
> > > PMU that can support common hardware events. Therefore, we should
> > > consider adding the following check.
>
> > > +++ b/tools/perf/util/parse-events.c
> > > @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
> > > while ((pmu = perf_pmus__scan(pmu)) != NULL) {
> > > bool auto_merge_stats;
> > >
> > > + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
> > > + continue;
> > > +
> > > if (parse_events__filter_pmu(parse_state, pmu))
> > > continue;
>
> > > To be clear, I only compiled this change but I have no chance to test
> > > it. @Ian, could you confirm this?
>
> > Hi Leo,
>
> > so the code is working as intended. I believe it also agrees with what
> > Arnaldo thinks.
>
> > If you do:
>
> > $ perf stat -e cycles ...
>
> > and you have
>
> > /sys/devices/pmu1/events/cycles
> > /sys/devices/pmu2/events/cycles
>
> > The output of perf stat should contain counts for pmu1 and pmu2. Were
> > the event 'data_read' or 'inst_retired.any' we wouldn't be having the
>
> Sure, what is being asked is to count events and if those two events in
> those two PMUs can count, then do what the user asked.
>
> For 'perf record' we're asking for sampling, if the event has the name
> specified and can't be sampled, skip it, warn the user and even so
> only if verbose mode is asked, something like:
>
> root@x1:~# perf record -e cycles -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> root@x1:~# perf evlist
> cpu_atom/cycles/
> cpu_core/cycles/
> dummy:u
> root@x1:~#
>
> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
> another in a 'cpu_core' one, both can be sampled, my workload may
> run/use resources on then, I'm interested, sample both.
>
> But if we had some other PMU, to use a name Jiri uses in tests/fake
> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
> and for some reason it doesn't support sampling, skip it, then the
> result should be the same as above.
>
> If the user finds it strange after looking at sysfs that 'krava/cycles/'
> isn't being sampled, the usual workflow is to ask perf for more
> verbosity, using -v (or multiple 'v' letters to get increasing levels of
> verbosity), in which case the user would see:
>
> root@x1:~# perf record -v -e cycles -a sleep 1
> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> root@x1:~# perf evlist

The problem here is that we're hiding a problem rather than reporting
it. Typically we report the issue and more than that we ask the user
to work around the issue. That would be analogous to wanting the user
to specify what PMU they want the event to apply to, which has always
been perf's behavior.

Thanks,
Ian

> - Arnaldo

2024-05-28 20:30:33

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
> On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
> > On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> > > On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:

> > > > This makes 'perf record' work for me again.

> > > Oh, wait, no it doesn't.

> > > It makes just the plain "perf record" without any arguments work,
> > > which was what I was testing because I was lazy.

> > > So now

> > > $ perf record sleep 1

> > > works fine. But

> > > $ perf record -e cycles:pp sleep 1

> > > is still completely broken (with or without ":p" and ":pp").

> > Seems to me that this patch fails to check if a PMU is a core-attached
> > PMU that can support common hardware events. Therefore, we should
> > consider adding the following check.

> > +++ b/tools/perf/util/parse-events.c
> > @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
> > while ((pmu = perf_pmus__scan(pmu)) != NULL) {
> > bool auto_merge_stats;
> >
> > + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
> > + continue;
> > +
> > if (parse_events__filter_pmu(parse_state, pmu))
> > continue;

> > To be clear, I only compiled this change but I have no chance to test
> > it. @Ian, could you confirm this?

> Hi Leo,

> so the code is working as intended. I believe it also agrees with what
> Arnaldo thinks.

> If you do:

> $ perf stat -e cycles ...

> and you have

> /sys/devices/pmu1/events/cycles
> /sys/devices/pmu2/events/cycles

> The output of perf stat should contain counts for pmu1 and pmu2. Were
> the event 'data_read' or 'inst_retired.any' we wouldn't be having the

Sure, what is being asked is to count events and if those two events in
those two PMUs can count, then do what the user asked.

For 'perf record' we're asking for sampling, if the event has the name
specified and can't be sampled, skip it, warn the user and even so
only if verbose mode is asked, something like:

root@x1:~# perf record -e cycles -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
root@x1:~# perf evlist
cpu_atom/cycles/
cpu_core/cycles/
dummy:u
root@x1:~#

Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
another in a 'cpu_core' one, both can be sampled, my workload may
run/use resources on then, I'm interested, sample both.

But if we had some other PMU, to use a name Jiri uses in tests/fake
PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
and for some reason it doesn't support sampling, skip it, then the
result should be the same as above.

If the user finds it strange after looking at sysfs that 'krava/cycles/'
isn't being sampled, the usual workflow is to ask perf for more
verbosity, using -v (or multiple 'v' letters to get increasing levels of
verbosity), in which case the user would see:

root@x1:~# perf record -v -e cycles -a sleep 1
WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
root@x1:~# perf evlist

- Arnaldo

2024-05-28 20:34:05

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, 28 May 2024 at 13:03, Ian Rogers <[email protected]> wrote:
>
> But you've traded a fix for one set of users with a fix for another.

No.

You don't seem to understand what "regression" means.

The "no regressions" rule is that we do not introduce NEW bugs.

It *literally* came about because we had an endless dance of "fix two
bugs, introduce one new one", and that then resulted in a system that
you cannot TRUST.

We had years of this with suspend/resume in particular, where the
argument was always exactly "this fixed many more systems" when
something else broke.

But because random other things kept breaking, it meant that people
couldn't upgrade the kernel and feel safe about it.

Your old laptop might no longer work, because somebody had deemed that
all those *new* laptops were more important.

So I introduced that "no regressions" rule something like two decades
ago, because people need to be able to update their kernel without
fear of something they relied on suddenly stopping to work.

The fact that the "suddenly stopped working" may be a minority DOES
NOT MATTER ONE WHIT.

Stability matters.

It's MUCH MUCH better to have legacy bad behavior that you never dealt
with correctly, than to introduce *new* bugs that hit something that
used to work.

So something that "perf" has never done correctly is simply not an issue.

You deal with that by saying "that has never worked properly before either".

You might even document it along with (hopefully) possible workarounds.

The whole "one step forward, two steps back" is absolutely fine if you
are line dancing.

But we're not line dancing.

We take it slow and steady, and if you can't fix something without
breaking something else, then that thing simply does not get fixed.

And there are always exceptions. Sometimes something may need to be
broken because it's an acute security issue.

And if it takes a year for somebody to find a regression, and in the
meantime others have started relying on the new behavior, then at some
point it's a "yes, that's a regression, but it wasn't reported in a
timely enough manner".

And sometimes the use-case is basically a museum machine and we retire
support for it because such a machine should use museum software too.

So it's not like it's some long-term absolute guarantee. But it
absolutely *is* the #1 rule in the kernel.

A fix that breaks something else is not a fix at all.

And in this case, when the regression was noted within days, and
within the merge window, there's simply no discussion about it.

Linus

2024-05-28 20:36:15

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, May 28, 2024 at 12:45 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, 28 May 2024 at 11:59, Ian Rogers <[email protected]> wrote:
> >
> > But nobody else ever reported the issue, even ARM who maintain the PMU
> > driver whose event name conflicts. This hasn't been a problem for
> > anybody else.
>
> I'm not blaming you for having had a bug.
>
> I'm blaming you for NOT DEALING WITH THE BUG APPROPRIATELY.
>
> Taking some time to fix the bug is fine.
>
> But that's not what you've been doing.

On LKML:

Issue reported:
2024-05-25 1:31 ` Linus Torvalds [this message]..
Fix posted:
2024-05-25 15:32 ` Ian Rogers [this message]

The only thing I've tried to clear up is the ambiguity of when an
event doesn't have a PMU what does it mean? Perf's metrics don't
specify PMUs and have uncore events. We can't restrict non-PMU
specifying events to just core events without rewriting them even if
it best matches your mental model, perf has never worked this way.

> Since then, pretty much ALL you have done is argue about irrelevant
> thingas that weren't about the regression in question.
>
> The fact that you still don't agree, having broken documented
> behavior, and still argue against just having it fixed, I can't do
> anything about.
>
> > So I think the revert is a real regression for a larger user base.
>
> I didn't have much choice, did I? You refuse to even acknowledge the
> bug I hit. I'd have been happy if you had just fixed the bug. You
> didn't. You just argued.
>
> > There is a testing issue here, not least I don't possess an Apple ARM
> > machine.
>
> This is not an Apple ARM machine. I have one of those too, but this
> isn't it. It's an Ampere Computing system, based on an ARM Neoverse N1
> (and the ARM PMU's both for the core and for the interconnect).
>
> But that is pretty much irrelevant by now.
>
> The issue is that you don't fix bugs you leave behind, forcing the revert.

But you've traded a fix for one set of users with a fix for another. I
suspect the number of ARM neoverse N1 users of the PMU are small, not
least as these devices tend to be in the cloud where PMU support is
deliberately limited. As test expectations were for the patch applied,
I think things are further regressed. I'm glad you're happy.

Thanks,
Ian

> I'm happy to test any patches. But I'm done arguing. The "cycles"
> thing needs to work. This is not a "pretty please".
>
> This is a "if you can't understand that and acknowledge that without
> arguing, just work on something else, ok?"
>
> Linus
>

2024-05-28 21:56:33

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Tue, 28 May 2024 at 14:37, Ian Rogers <[email protected]> wrote:
>
> I think stability in the context of this problem is an abstract term
> especially given the low amount of ARM support for PMUs.

Just stop arguing.

The 'cycles' thing is documented. It's CP:U cycles. This is just a *fact*.

I have now blocked you. You will get no more replies from me, because
it's not worth my time.

I simply have better things to do that deal with people who cannot
accept documented and existing behavior, and spend days arguing that
their broken fix was a "fix".

Just go play in any of the other open source projects that (sadly)
don't have regression rules.

Linus

2024-05-29 14:53:22

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On 28/05/2024 20:51, Ian Rogers wrote:
> On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo
> <[email protected]> wrote:
>>
>> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
>>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
>>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
>>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:
>>
>>>>>> This makes 'perf record' work for me again.
>>
>>>>> Oh, wait, no it doesn't.
>>
>>>>> It makes just the plain "perf record" without any arguments work,
>>>>> which was what I was testing because I was lazy.
>>
>>>>> So now
>>
>>>>> $ perf record sleep 1
>>
>>>>> works fine. But
>>
>>>>> $ perf record -e cycles:pp sleep 1
>>
>>>>> is still completely broken (with or without ":p" and ":pp").
>>
>>>> Seems to me that this patch fails to check if a PMU is a core-attached
>>>> PMU that can support common hardware events. Therefore, we should
>>>> consider adding the following check.
>>
>>>> +++ b/tools/perf/util/parse-events.c
>>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
>>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) {
>>>> bool auto_merge_stats;
>>>>
>>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
>>>> + continue;
>>>> +
>>>> if (parse_events__filter_pmu(parse_state, pmu))
>>>> continue;
>>
>>>> To be clear, I only compiled this change but I have no chance to test
>>>> it. @Ian, could you confirm this?
>>
>>> Hi Leo,
>>
>>> so the code is working as intended. I believe it also agrees with what
>>> Arnaldo thinks.
>>
>>> If you do:
>>
>>> $ perf stat -e cycles ...
>>
>>> and you have
>>
>>> /sys/devices/pmu1/events/cycles
>>> /sys/devices/pmu2/events/cycles
>>
>>> The output of perf stat should contain counts for pmu1 and pmu2. Were
>>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the
>>
>> Sure, what is being asked is to count events and if those two events in
>> those two PMUs can count, then do what the user asked.
>>
>> For 'perf record' we're asking for sampling, if the event has the name
>> specified and can't be sampled, skip it, warn the user and even so
>> only if verbose mode is asked, something like:
>>
>> root@x1:~# perf record -e cycles -a sleep 1
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
>> root@x1:~# perf evlist
>> cpu_atom/cycles/
>> cpu_core/cycles/
>> dummy:u
>> root@x1:~#
>>
>> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
>> another in a 'cpu_core' one, both can be sampled, my workload may
>> run/use resources on then, I'm interested, sample both.
>>
>> But if we had some other PMU, to use a name Jiri uses in tests/fake
>> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
>> and for some reason it doesn't support sampling, skip it, then the
>> result should be the same as above.
>>
>> If the user finds it strange after looking at sysfs that 'krava/cycles/'
>> isn't being sampled, the usual workflow is to ask perf for more
>> verbosity, using -v (or multiple 'v' letters to get increasing levels of
>> verbosity), in which case the user would see:
>>
>> root@x1:~# perf record -v -e cycles -a sleep 1
>> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
>> root@x1:~# perf evlist

This makes sense to me. I like keeping the old apparent behavior unless
-v is used and it will feel like the tool "just works".

In the context of the commit summary "perf parse-events: Prefer
sysfs/JSON hardware events over legacy":

I don't follow why that should be "Prefer, even if it's an event that
can't be opened, sysfs/JSON...".

Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then
use legacy". If all events can be opened, sure go and open them all. If
only core events can be opened, do that too. If only uncore events can
be opened... etc.

I'm happy to help with the implementation or testing of that on my
big.LITTLE system.

FWIW the hybrid stuff with Perf stat is already working well for me on
Arm since we added PERF_PMU_CAP_EXTENDED_HW_TYPE which Ian suggested.

>
> The problem here is that we're hiding a problem rather than reporting
> it. Typically we report the issue and more than that we ask the user
> to work around the issue. That would be analogous to wanting the user
> to specify what PMU they want the event to apply to, which has always
> been perf's behavior.
>

Is the problem you are referring to that there are multiple PMUs with
'cycles' events? Surely that's only a problem in the context of the new
proposed behavior, otherwise it's not really a problem. It's just
something that happens to exist.

Because the user could always use the defaults (no argument) or -e
cycles and historically Perf correctly picked the one that could be
opened. Or if they want the DSU one they could specify it. That can all
still work _and_ we can support "prefer sysfs/JSON" as long as we don't
prefer it when opening the event doesn't work.

Thanks
James

> Thanks,
> Ian
>
>> - Arnaldo

2024-05-29 17:34:14

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, May 29, 2024 at 7:50 AM James Clark <[email protected]> wrote:
>
> On 28/05/2024 20:51, Ian Rogers wrote:
> > On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo
> > <[email protected]> wrote:
> >>
> >> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
> >>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
> >>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> >>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:
> >>
> >>>>>> This makes 'perf record' work for me again.
> >>
> >>>>> Oh, wait, no it doesn't.
> >>
> >>>>> It makes just the plain "perf record" without any arguments work,
> >>>>> which was what I was testing because I was lazy.
> >>
> >>>>> So now
> >>
> >>>>> $ perf record sleep 1
> >>
> >>>>> works fine. But
> >>
> >>>>> $ perf record -e cycles:pp sleep 1
> >>
> >>>>> is still completely broken (with or without ":p" and ":pp").
> >>
> >>>> Seems to me that this patch fails to check if a PMU is a core-attached
> >>>> PMU that can support common hardware events. Therefore, we should
> >>>> consider adding the following check.
> >>
> >>>> +++ b/tools/perf/util/parse-events.c
> >>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
> >>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) {
> >>>> bool auto_merge_stats;
> >>>>
> >>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
> >>>> + continue;
> >>>> +
> >>>> if (parse_events__filter_pmu(parse_state, pmu))
> >>>> continue;
> >>
> >>>> To be clear, I only compiled this change but I have no chance to test
> >>>> it. @Ian, could you confirm this?
> >>
> >>> Hi Leo,
> >>
> >>> so the code is working as intended. I believe it also agrees with what
> >>> Arnaldo thinks.
> >>
> >>> If you do:
> >>
> >>> $ perf stat -e cycles ...
> >>
> >>> and you have
> >>
> >>> /sys/devices/pmu1/events/cycles
> >>> /sys/devices/pmu2/events/cycles
> >>
> >>> The output of perf stat should contain counts for pmu1 and pmu2. Were
> >>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the
> >>
> >> Sure, what is being asked is to count events and if those two events in
> >> those two PMUs can count, then do what the user asked.
> >>
> >> For 'perf record' we're asking for sampling, if the event has the name
> >> specified and can't be sampled, skip it, warn the user and even so
> >> only if verbose mode is asked, something like:
> >>
> >> root@x1:~# perf record -e cycles -a sleep 1
> >> [ perf record: Woken up 1 times to write data ]
> >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> >> root@x1:~# perf evlist
> >> cpu_atom/cycles/
> >> cpu_core/cycles/
> >> dummy:u
> >> root@x1:~#
> >>
> >> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
> >> another in a 'cpu_core' one, both can be sampled, my workload may
> >> run/use resources on then, I'm interested, sample both.
> >>
> >> But if we had some other PMU, to use a name Jiri uses in tests/fake
> >> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
> >> and for some reason it doesn't support sampling, skip it, then the
> >> result should be the same as above.
> >>
> >> If the user finds it strange after looking at sysfs that 'krava/cycles/'
> >> isn't being sampled, the usual workflow is to ask perf for more
> >> verbosity, using -v (or multiple 'v' letters to get increasing levels of
> >> verbosity), in which case the user would see:
> >>
> >> root@x1:~# perf record -v -e cycles -a sleep 1
> >> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
> >> [ perf record: Woken up 1 times to write data ]
> >> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> >> root@x1:~# perf evlist
>
> This makes sense to me. I like keeping the old apparent behavior unless
> -v is used and it will feel like the tool "just works".
>
> In the context of the commit summary "perf parse-events: Prefer
> sysfs/JSON hardware events over legacy":
>
> I don't follow why that should be "Prefer, even if it's an event that
> can't be opened, sysfs/JSON...".
>
> Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then
> use legacy". If all events can be opened, sure go and open them all. If
> only core events can be opened, do that too. If only uncore events can
> be opened... etc.

Currently event parsing is not the same as event opening. Event
parsing I hope is about generating a set of `struct perf_event_attr`,
I'd like to see it migrate to libperf and it needn't pull in evsels
and evlists that are a construct on top of these. There's always more
work to do.

This whole can't the tool do blah, blah, blah.. well fine, but that's
not actually describing the tool. It is describing a different tool
and the question is how to get from what the tool is to what you
describe. A lot of people talk to me about what the tool should do,
they are a lot more hesitant to actually do the work.

A command line with an event like:
$ perf record -e inst_retired.any ...
seems simple enough. But how many PMUs support inst_retired.any, what
should the set of 'struct perf_event_attr' look like? Oh that's
obvious Ian, you're talking about an Intel event so there should be
exactly one and it should be on the PMU 'cpu' (note, until recently we
didn't even really track PMUs associated with an event). But then what
about hybrid? Ah yes, you want two 'struct perf_event_attr' one set up
for cpu_core and one set up for cpu_atom. Okay, so how in the tool do
I "know" this? Well the tool should scan PMUs, look for the event and
then create a 'struct perf_event_attr' for each one. Okay that's where
we are today, although there are some places where core and non-core
PMUs are treated differently and I'd like that not to be the case as
it has caused conditions where we will program events differently, and
in particular I believe this breaks ARM Apple M? PMUs. It also aligned
with RISC-V plans for how to support events.

When a legacy event occurs there is a whole bunch of other processing.
The legacy event names are hard coded into the parse events parser:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l?h=perf-tools-next#n332
An example of a legacy event name is cycles. The parsing of cycles
used to create exactly one 'struct perf_event_attr' with the PMU type
set to the special PERF_TYPE_HARDWARE and the config set to
PERF_COUNT_HW_CPU_CYCLES. But then what about hybrid? Well then we
need an event for each PMU type otherwise when a process gets
rescheduled between CPU types it will suddenly become invisible. This
is something we fixed in Linux 6.x.

Ah, Ian did you know on Apple M? PMUs the legacy events are broken and
we need to always use sysfs or json events? The fact that ARM's PMU
didn't look like a core PMU meant we wouldn't try to program legacy
events on them and now your code to fix Intel's hybrid has exposed a
latent ARM PMU bug? Oh and by the way this has been broken for
multiple Linux releases, go fix it by yesterday.

My expectation on fixing this was that Intel would just say to go
away. But no, Kan was cool with it and I went ahead inverted the
priorities and now sysfs and json events have priority - the largest
part of the work is making the test expectations align with what the
code does. So what this means for our cycles event, well the logic to
scan all PMUs for inst_retired.any is now the default logic but in the
case of a legacy event we also remember a legacy encoding is possible.
So if the event isn't found in sysfs or json for a core PMU then we
use the legacy encoding. Note, I'm just talking about making 'struct
perf_event_attr' not opening events.

Great, so why did you need a patch in 6.10? The process was
incomplete. Let's say on a Apple M? system you went:
$ perf record -e 'armv8_pmuv3_0/cycles/' ...
then you would indeed get a sysfs encoded event. But what about:
$ perf record -e 'cycles' ...
I may want the event on the BIG and little cores. Well in that case
the sysfs and json overriding legacy event logic hadn't been applied.
This means that the legacy encoding would always be used, but as I
already mentioned the legacy encoding is broken on a Apple M? system.
The user is supposed to somehow know the fix is to add the PMU name
and specify multiple events.

With the now reverted change the event parsing logic would scan for
the cycles event on every PMU and create a struct perf_event_attr for
it (the code makes evsels, but they are just wrappers on top of a
struct perf_event_attr). But why scan all PMUs? Well that was done to
make hybrid work and because perf's behavior has always been either to
handle an event as legacy or handle it as a potential event on all
PMUs. We're making legacy second class, given the desire to prefer
sysfs and json, and we only use that encoding when the scan fails to
find the event on a core PMU.

So great, ignoring the revert, that fixed everything? Well no. The
tool in places was hard coding 'struct perf_event_attr' which is of
course broken were things to be hybrid or BIG.little. So the fix for
that was to not hard code things. We need a set of 'struct
perf_event_attr', ah I know a way to get that let's just use our event
parsing logic. So a 'struct perf_event_attr' hard coding type to
PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also
setting the precision to maximum was changed into parsing the string
"cycles:P". Sounds good, no? Well no. Somebody decided to create an
ARM event called cycles (Intel's name to avoid conflicts is
clockticks) and now that event was getting added to our set. Although
the patch sat for weeks (months?) on linux-next nobody had discovered
a fairly obvious breakage.

Oh Ian this is your crazy json Ian wants to pick a PMU and wants to
break the world not caring about regressions mentality, admit you hate
everybody and let me throw some expletives at you. Well I hope not. We
were parsing "cycles:P" to generate our set of perf_event_attr, let's
avoid the problem of getting uncore PMU events by making it so the
events parsed are only on core PMUs - hey, this is at least better
than the broken on hybrid hard coded alternative. Which is where this
patch sits on top of everything.

Well we hate it. Cycles is clearly only something that CPUs have
(ignoring the fact ARM have explicitly called their L3 cache event
cycles, cough, cough..) make everything go back. Go back to where? We
want legacy events to have priority again? But we have legacy event
names like l1-i-speculative-read-prefetches, we want to hard code that
nonsense and make it modify the regular scan all PMUs logic? Clearly
no, and that's where Arnaldo's proposal is.

Under Arnaldo's proposal, as I understand it, the parsing will
generate the set of 'struct perf_event_attr' in the list of evsels.
When we go to open the evsels a failure needn't be fatal. Notice the
important distinction here. You are talking about opening during
parsing, the proposal is to postpone that.

Great, go and code. I did - I even did it on the weekend. The events
we want to ignore I reuse the idea of skippable that was introduced
for when a hard coded 'struct perf_event_attr' could fail to open but
we'd moved the code to use event parsing. The problem is that nothing
in the perf record logic, that is split between perf and libperf,
expects an evsel not to open. To add the workarounds was making a
change large enough that it was unreasonable to think it could land in
v6.10 as it may well break something else.

So where does that leave us:
1) we have Arnaldo's proposal, it sounds good but it has clear
behavior changes. My rough implementation showed it to be too big for
v6.10.
2) v6.10 has reverted a patch fixing the event encodings for the PMU
on ARM Apple M? CPUs.
3) as the reverted patch has undone something fundamental our tests
are now out of sync with what event parsing does (hey ARM never made
the event parsing tests work anyway so why care about this one,
right?)
4) if we revert the revert it fixes the Apple M? issue but breaks
Neoverse if the arm_dsu PMU is enabled. Well then apply this fix on
top of the revert and you're good? No, what has explicitly been said
is that "cycles" shouldn't follow any of the patterns of event parsing
so far described. But why? Well because even though the PMUs are
advertising two cycles events, cycles needs to be special and mean
core only if no PMU is given. So 1 cycles event then? Well no, we
still want PMU scanning because there's hybrid to care about. Okay,
and is it just cycles that are special? We have potentially >200
legacy events where a similar argument could be made about making them
core only, what about instructions, what about branches... The reply I
got was, go away and I will no longer speak to you.

I think the best plan we have for moving forward is Arnaldo's
proposal, Linus has explicitly stated that it is wrong and there
should be special event names. We can't land this in v6.10 which is
why I've said that it doesn't seem so terrible that on a Neoverse if
you are specifying events by hand and you don't specify a PMU and the
event you specify happens to be cycles, make it clear you want the
armv8_pmuv3_0 one by adding the PMU name. This has always been the way
to clear up the ambiguity.

> I'm happy to help with the implementation or testing of that on my
> big.LITTLE system.
>
> FWIW the hybrid stuff with Perf stat is already working well for me on
> Arm since we added PERF_PMU_CAP_EXTENDED_HW_TYPE which Ian suggested.
>
> >
> > The problem here is that we're hiding a problem rather than reporting
> > it. Typically we report the issue and more than that we ask the user
> > to work around the issue. That would be analogous to wanting the user
> > to specify what PMU they want the event to apply to, which has always
> > been perf's behavior.
> >
>
> Is the problem you are referring to that there are multiple PMUs with
> 'cycles' events? Surely that's only a problem in the context of the new
> proposed behavior, otherwise it's not really a problem. It's just
> something that happens to exist.

So perf will try to open events, when failing it will fall back, if an
event fails to open then we report an error. This is why none of the
code currently handles an evsel in an evlist where the event has
failed to open.

Failure to open could be a behavior someone relies upon. Let's say you
want to probe the behavior of a PMU, you could be relying on failures
to show you the event won't work. This is Hyrum's law:

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

It isn't new behavior for perf to scan all PMUs, it always has, the
new behavior is around legacy events. We want multiple PMU scanning
for hybrid, we want all PMU scanning for uncore. The legacy changes
happened because of the Apple M? PMU with me being complained at by
folks at ARM who have now created this mess by their arm_dsu event
name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles,
0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),":
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177
that's up to ARM but it would make sense to me.

> Because the user could always use the defaults (no argument) or -e
> cycles and historically Perf correctly picked the one that could be
> opened. Or if they want the DSU one they could specify it. That can all
> still work _and_ we can support "prefer sysfs/JSON" as long as we don't
> prefer it when opening the event doesn't work.

Hopefully this is all explained above.

Thanks,
Ian

> Thanks
> James
>
> > Thanks,
> > Ian
> >
> >> - Arnaldo

2024-05-29 18:44:23

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, May 29, 2024 at 03:50:53PM +0100, James Clark wrote:
> Is the problem you are referring to that there are multiple PMUs with
> 'cycles' events? Surely that's only a problem in the context of the new
> proposed behavior, otherwise it's not really a problem. It's just
> something that happens to exist.

> Because the user could always use the defaults (no argument) or -e
> cycles and historically Perf correctly picked the one that could be

See below to see if mixing up "cycles" for efficiency and performance
cores is something sane or if I am missing something.

> opened. Or if they want the DSU one they could specify it. That can all
> still work _and_ we can support "prefer sysfs/JSON" as long as we don't
> prefer it when opening the event doesn't work.

Yeah, getting all the events in all the PMUs that match a string (after
it is normalized to cover historical artifacts, as in the case of
"cycles", "cpu_cycles" and "cpu-cycles", all of which should mean
"cycles" the special, default event) and that can sample if that is what
is being asked seems to be a sane outcome from this discussion.

But lemme do try to show the differences from my Lenovo Intel Hybrid
system (13th gen) and a Libre Computer Rockchip ARM64 hybrid system:

There are some differences on how ARM64 supports hybrid that we may find
interesting to fix or at least to (better) document, for instance:

root@roc-rk3399-pc:~# dmidecode -H 1
# dmidecode 3.4
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.
7 structures occupying 283 bytes.
Table at 0xEAE7A020.

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: libre-computer
Product Name: roc-rk3399-pc
Version: Not Specified
Serial Number: b03c01a7179278b7
UUID: 63333062-3130-3761-3137-393237386237
Wake-up Type: Reserved
SKU Number: Not Specified
Family: Not Specified

root@roc-rk3399-pc:~#

This is a hybrid architecture:

root@roc-rk3399-pc:~# ls -la /sys/devices/*/events/cpu_cycles
-r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a53/events/cpu_cycles
-r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a72/events/cpu_cycles
root@roc-rk3399-pc:~#

In an intel hybrid system we instead have:

root@number:~# ls -la /sys/devices/*/events/cpu-cycles
-r--r--r--. 1 root root 4096 May 29 13:59 /sys/devices/cpu_atom/events/cpu-cycles
-r--r--r--. 1 root root 4096 May 29 14:00 /sys/devices/cpu_core/events/cpu-cycles
root@number:~#

Small difference, a - versus a _, but then both hybrid, efficiency
cores (armv8_cortex_a53 vs cpu_atom) and performance ones
(armv8_cortex_a72 vs cpu_core).

On the Intel Hybrid system:

root@number:~# perf record -e cycles -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.709 MB perf.data (46911 samples) ]
root@number:~# perf evlist
cpu_atom/cycles/
cpu_core/cycles/
dummy:u
root@number:~#

root@number:~# perf evlist -v
cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
root@number:~#

So it is recording CPU cycles in all the CPUs in the system, performance
and efficiency ones and that gets clear on a per-sample base:

root@number:~# perf script
perf 2465078 [000] 73716.379947: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms])
perf 2465078 [001] 73716.379966: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms])
<SNIP more cpu_core/cycles/ samples>
gnome-shell 2608 [018] 73716.380704: 6721618 cpu_atom/cycles/: ffffffffc0b8419c fw_domains_get_with_fallback+0xfc ([kernel.kallsyms])
podman 688107 [017] 73716.380706: 6695621 cpu_atom/cycles/: 564fc6110da0 [unknown] (/usr/bin/podman)
podman 687246 [000] 73716.380842: 8844997 cpu_core/cycles/: ffffffffb515150c _raw_spin_lock_irqsave+0xc ([kernel.kallsyms])
podman 688108 [016] 73716.380913: 6737580 cpu_atom/cycles/: ffffffffb515205c native_queued_spin_lock_slowpath+0x28c ([kernel.kallsyms])
swapper 0 [004] 73716.380932: 2090132 cpu_core/cycles/: ffffffffb513ad49 poll_idle+0x59 ([kernel.kallsyms])
<SNIP>

But on the ARM hybrid system, without Ian's patch, i.e. with what is in
torvalds/master right now (plus some header copies updates I'm working
on that are unrelated):

root@roc-rk3399-pc:~# perf record -e cycles -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.135 MB perf.data (359 samples) ]
root@roc-rk3399-pc:~# perf evlist
cycles
dummy:u
root@roc-rk3399-pc:~#

It records just one "event" even tho it is recording for all CPUs, both
efficiency and performance:

root@number:~# perf script
<SNIP>
kworker/2:1-eve 10124 [002] 9687.302790: 60674 cycles: ffffc4c65bdd7380 vmap_small_pages_range_noflush+0x190 ([kernel.kallsyms])
kworker/2:1-eve 10124 [002] 9687.302957: 66040 cycles: ffffc4c65bdd7438 vmap_small_pages_range_noflush+0x248 ([kernel.kallsyms])
kworker/2:1-eve 10124 [002] 9687.303139: 71011 cycles: ffffc4c65cde0210 ww_mutex_lock+0x60 ([kernel.kallsyms])
swapper 0 [002] 9687.303342: 75390 cycles: ffffc4c65bbc31c8 update_blocked_averages+0x188 ([kernel.kallsyms])
swapper 0 [000] 9687.309276: 45496 cycles: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
<SNIP>

Everything appears as "cycles" but we're getting samples for all CPUs,
again, performance and efficiency ones, different kinds of processors,
right?

root@roc-rk3399-pc:~# perf report --stdio --sort cpu
# To display the perf.data header info, please use --header/--header-only options.
#
# Total Lost Samples: 0
#
# Samples: 359 of event 'cycles'
# Event count (approx.): 23873034
#
# Overhead CPU
# ........ ...
#
31.34% 003
22.44% 004
19.30% 000
12.94% 002
9.14% 001
4.84% 005

root@roc-rk3399-pc:~#

If we try, instead with cpu-cycles:

root@roc-rk3399-pc:~# perf record -e cpu-cycles -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.135 MB perf.data (346 samples) ]
root@roc-rk3399-pc:~# perf evlist
cpu-cycles
dummy:u
root@roc-rk3399-pc:~#
root@roc-rk3399-pc:~# perf evlist -v
cpu-cycles: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
root@roc-rk3399-pc:~#

Both 'cycles' and 'cpu-cycles' end up the same as type: 0
(PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES).

But if we use something equivalent but with that - replaced with a _ we
get a behaviour that is closer to the Intel one:

root@roc-rk3399-pc:~# perf record -e cpu_cycles -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.137 MB perf.data (390 samples) ]
root@roc-rk3399-pc:~#
root@roc-rk3399-pc:~# perf evlist
armv8_cortex_a53/cpu_cycles/
armv8_cortex_a72/cpu_cycles/
dummy:u
root@roc-rk3399-pc:~#

root@roc-rk3399-pc:~# perf evlist -v
armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
root@roc-rk3399-pc:~#

That doesn't mixes up CPU cycles for different CPU types:

root@roc-rk3399-pc:~# perf script
<SNIP>
perf 16726 [005] 12632.206216: 3798 armv8_cortex_a72/cpu_cycles/: ffffc4c65be618d8 do_vfs_ioctl+0x424 ([kernel.kallsyms])
swapper 0 [000] 12632.206235: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
perf 16726 [005] 12632.206272: 20279 armv8_cortex_a72/cpu_cycles/: ffffc4c65be113b4 kmem_cache_alloc+0x44 ([kernel.kallsyms])
sugov:4 166 [004] 12632.206409: 52979 armv8_cortex_a72/cpu_cycles/: ffffc4c65cde5de8 _raw_spin_unlock_irqrestore+0x14 ([kernel.kallsyms])
perf 16726 [005] 12632.206443: 67123 armv8_cortex_a72/cpu_cycles/: ffffc4c65be26bbc arch_local_irq_restore+0x8 ([kernel.kallsyms])
perf 16726 [005] 12632.206690: 96987 armv8_cortex_a72/cpu_cycles/: ffffc4c65bdb4a84 fault_in_readable+0xe4 ([kernel.kallsyms])
perf-exec 16727 [004] 12632.206836: 84199 armv8_cortex_a72/cpu_cycles/: ffffc4c65bd6c3b4 next_uptodate_page+0x264 ([kernel.kallsyms])
perf 16726 [005] 12632.206950: 102567 armv8_cortex_a72/cpu_cycles/: ffffc4c65bbe2aa4 up_write+0xa4 ([kernel.kallsyms])
swapper 0 [000] 12632.207030: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
perf-exec 16727 [004] 12632.207037: 79507 armv8_cortex_a72/cpu_cycles/: ffffc4c65c48b89c strnlen_user+0x16c ([kernel.kallsyms])
<SNIP>

So from the point of view of the user its important to differentiate
samples for each type of CPU, so grouping everything into the same
basket as ARM did in its big.LITTLE seems strange/"wrong".

The way that in Intel when it "does the right thing" (I think no quotes
are needed here, but I may be missing something) and at the tool level
translates the special event name "cycles" into what the Intel PMU
kernel drivers advertises to the world via sysfs as
/sys/devices/cpu_{atom,core}/events/cpu-cycles (with that -) and ARM
advertises as /sys/devices/armv8_cortex_a{53,72}/events/cpu_cycles (note
the _) but gets translated in terms of 'struct perf_event_attr' as

Intel:

cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000
cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000

Versus ARM as:

armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles)
armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles)

can possibly be made more consistent in a way that doesn't break any
user experience using the perf command line.

I would propose that 'cycles' explicitely asked or as the default,
translates into armv8_cortex_a{53,72}/cpu_cycles/ on ARM and on
/sys/devices/cpu_{atom,core}/events/cpu-cycles on Intel, and that
whatever other architecture that comes to this party tries to learn from
this botched experience.

- Arnaldo

2024-05-29 19:26:01

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, May 29, 2024 at 11:44 AM Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
> On Wed, May 29, 2024 at 03:50:53PM +0100, James Clark wrote:
> > Is the problem you are referring to that there are multiple PMUs with
> > 'cycles' events? Surely that's only a problem in the context of the new
> > proposed behavior, otherwise it's not really a problem. It's just
> > something that happens to exist.
>
> > Because the user could always use the defaults (no argument) or -e
> > cycles and historically Perf correctly picked the one that could be
>
> See below to see if mixing up "cycles" for efficiency and performance
> cores is something sane or if I am missing something.
>
> > opened. Or if they want the DSU one they could specify it. That can all
> > still work _and_ we can support "prefer sysfs/JSON" as long as we don't
> > prefer it when opening the event doesn't work.
>
> Yeah, getting all the events in all the PMUs that match a string (after
> it is normalized to cover historical artifacts, as in the case of
> "cycles", "cpu_cycles" and "cpu-cycles", all of which should mean
> "cycles" the special, default event) and that can sample if that is what
> is being asked seems to be a sane outcome from this discussion.
>
> But lemme do try to show the differences from my Lenovo Intel Hybrid
> system (13th gen) and a Libre Computer Rockchip ARM64 hybrid system:
>
> There are some differences on how ARM64 supports hybrid that we may find
> interesting to fix or at least to (better) document, for instance:
>
> root@roc-rk3399-pc:~# dmidecode -H 1
> # dmidecode 3.4
> Getting SMBIOS data from sysfs.
> SMBIOS 3.0 present.
> 7 structures occupying 283 bytes.
> Table at 0xEAE7A020.
>
> Handle 0x0001, DMI type 1, 27 bytes
> System Information
> Manufacturer: libre-computer
> Product Name: roc-rk3399-pc
> Version: Not Specified
> Serial Number: b03c01a7179278b7
> UUID: 63333062-3130-3761-3137-393237386237
> Wake-up Type: Reserved
> SKU Number: Not Specified
> Family: Not Specified
>
> root@roc-rk3399-pc:~#
>
> This is a hybrid architecture:
>
> root@roc-rk3399-pc:~# ls -la /sys/devices/*/events/cpu_cycles
> -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a53/events/cpu_cycles
> -r--r--r-- 1 root root 4096 May 29 16:27 /sys/devices/armv8_cortex_a72/events/cpu_cycles
> root@roc-rk3399-pc:~#
>
> In an intel hybrid system we instead have:
>
> root@number:~# ls -la /sys/devices/*/events/cpu-cycles
> -r--r--r--. 1 root root 4096 May 29 13:59 /sys/devices/cpu_atom/events/cpu-cycles
> -r--r--r--. 1 root root 4096 May 29 14:00 /sys/devices/cpu_core/events/cpu-cycles
> root@number:~#
>
> Small difference, a - versus a _, but then both hybrid, efficiency
> cores (armv8_cortex_a53 vs cpu_atom) and performance ones
> (armv8_cortex_a72 vs cpu_core).
>
> On the Intel Hybrid system:
>
> root@number:~# perf record -e cycles -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 4.709 MB perf.data (46911 samples) ]
> root@number:~# perf evlist
> cpu_atom/cycles/
> cpu_core/cycles/
> dummy:u
> root@number:~#
>
> root@number:~# perf evlist -v
> cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
> cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
> dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
> root@number:~#
>
> So it is recording CPU cycles in all the CPUs in the system, performance
> and efficiency ones and that gets clear on a per-sample base:
>
> root@number:~# perf script
> perf 2465078 [000] 73716.379947: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms])
> perf 2465078 [001] 73716.379966: 1 cpu_core/cycles/: ffffffffb40a55fa native_write_msr+0xa ([kernel.kallsyms])
> <SNIP more cpu_core/cycles/ samples>
> gnome-shell 2608 [018] 73716.380704: 6721618 cpu_atom/cycles/: ffffffffc0b8419c fw_domains_get_with_fallback+0xfc ([kernel.kallsyms])
> podman 688107 [017] 73716.380706: 6695621 cpu_atom/cycles/: 564fc6110da0 [unknown] (/usr/bin/podman)
> podman 687246 [000] 73716.380842: 8844997 cpu_core/cycles/: ffffffffb515150c _raw_spin_lock_irqsave+0xc ([kernel.kallsyms])
> podman 688108 [016] 73716.380913: 6737580 cpu_atom/cycles/: ffffffffb515205c native_queued_spin_lock_slowpath+0x28c ([kernel.kallsyms])
> swapper 0 [004] 73716.380932: 2090132 cpu_core/cycles/: ffffffffb513ad49 poll_idle+0x59 ([kernel.kallsyms])
> <SNIP>
>
> But on the ARM hybrid system, without Ian's patch, i.e. with what is in
> torvalds/master right now (plus some header copies updates I'm working
> on that are unrelated):
>
> root@roc-rk3399-pc:~# perf record -e cycles -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.135 MB perf.data (359 samples) ]
> root@roc-rk3399-pc:~# perf evlist
> cycles
> dummy:u
> root@roc-rk3399-pc:~#
>
> It records just one "event" even tho it is recording for all CPUs, both
> efficiency and performance:
>
> root@number:~# perf script
> <SNIP>
> kworker/2:1-eve 10124 [002] 9687.302790: 60674 cycles: ffffc4c65bdd7380 vmap_small_pages_range_noflush+0x190 ([kernel.kallsyms])
> kworker/2:1-eve 10124 [002] 9687.302957: 66040 cycles: ffffc4c65bdd7438 vmap_small_pages_range_noflush+0x248 ([kernel.kallsyms])
> kworker/2:1-eve 10124 [002] 9687.303139: 71011 cycles: ffffc4c65cde0210 ww_mutex_lock+0x60 ([kernel.kallsyms])
> swapper 0 [002] 9687.303342: 75390 cycles: ffffc4c65bbc31c8 update_blocked_averages+0x188 ([kernel.kallsyms])
> swapper 0 [000] 9687.309276: 45496 cycles: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
> <SNIP>
>
> Everything appears as "cycles" but we're getting samples for all CPUs,
> again, performance and efficiency ones, different kinds of processors,
> right?
>
> root@roc-rk3399-pc:~# perf report --stdio --sort cpu
> # To display the perf.data header info, please use --header/--header-only options.
> #
> # Total Lost Samples: 0
> #
> # Samples: 359 of event 'cycles'
> # Event count (approx.): 23873034
> #
> # Overhead CPU
> # ........ ...
> #
> 31.34% 003
> 22.44% 004
> 19.30% 000
> 12.94% 002
> 9.14% 001
> 4.84% 005
>
> root@roc-rk3399-pc:~#
>
> If we try, instead with cpu-cycles:
>
> root@roc-rk3399-pc:~# perf record -e cpu-cycles -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.135 MB perf.data (346 samples) ]
> root@roc-rk3399-pc:~# perf evlist
> cpu-cycles
> dummy:u
> root@roc-rk3399-pc:~#
> root@roc-rk3399-pc:~# perf evlist -v
> cpu-cycles: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
> dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
> root@roc-rk3399-pc:~#
>
> Both 'cycles' and 'cpu-cycles' end up the same as type: 0
> (PERF_TYPE_HARDWARE), size: 136, config: 0 (PERF_COUNT_HW_CPU_CYCLES).
>
> But if we use something equivalent but with that - replaced with a _ we
> get a behaviour that is closer to the Intel one:
>
> root@roc-rk3399-pc:~# perf record -e cpu_cycles -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.137 MB perf.data (390 samples) ]
> root@roc-rk3399-pc:~#
> root@roc-rk3399-pc:~# perf evlist
> armv8_cortex_a53/cpu_cycles/
> armv8_cortex_a72/cpu_cycles/
> dummy:u
> root@roc-rk3399-pc:~#
>
> root@roc-rk3399-pc:~# perf evlist -v
> armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
> armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles), { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|CPU|PERIOD|IDENTIFIER, read_format: ID|LOST, disabled: 1, inherit: 1, freq: 1, sample_id_all: 1, exclude_guest: 1
> dummy:u: type: 1 (software), size: 136, config: 0x9 (PERF_COUNT_SW_DUMMY), { sample_period, sample_freq }: 1, sample_type: IP|TID|TIME|CPU|IDENTIFIER, read_format: ID|LOST, inherit: 1, exclude_kernel: 1, exclude_hv: 1, mmap: 1, comm: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1, ksymbol: 1, bpf_event: 1
> root@roc-rk3399-pc:~#
>
> That doesn't mixes up CPU cycles for different CPU types:
>
> root@roc-rk3399-pc:~# perf script
> <SNIP>
> perf 16726 [005] 12632.206216: 3798 armv8_cortex_a72/cpu_cycles/: ffffc4c65be618d8 do_vfs_ioctl+0x424 ([kernel.kallsyms])
> swapper 0 [000] 12632.206235: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
> perf 16726 [005] 12632.206272: 20279 armv8_cortex_a72/cpu_cycles/: ffffc4c65be113b4 kmem_cache_alloc+0x44 ([kernel.kallsyms])
> sugov:4 166 [004] 12632.206409: 52979 armv8_cortex_a72/cpu_cycles/: ffffc4c65cde5de8 _raw_spin_unlock_irqrestore+0x14 ([kernel.kallsyms])
> perf 16726 [005] 12632.206443: 67123 armv8_cortex_a72/cpu_cycles/: ffffc4c65be26bbc arch_local_irq_restore+0x8 ([kernel.kallsyms])
> perf 16726 [005] 12632.206690: 96987 armv8_cortex_a72/cpu_cycles/: ffffc4c65bdb4a84 fault_in_readable+0xe4 ([kernel.kallsyms])
> perf-exec 16727 [004] 12632.206836: 84199 armv8_cortex_a72/cpu_cycles/: ffffc4c65bd6c3b4 next_uptodate_page+0x264 ([kernel.kallsyms])
> perf 16726 [005] 12632.206950: 102567 armv8_cortex_a72/cpu_cycles/: ffffc4c65bbe2aa4 up_write+0xa4 ([kernel.kallsyms])
> swapper 0 [000] 12632.207030: 78413 armv8_cortex_a53/cpu_cycles/: ffffc4c65ca38978 cpuidle_enter_state+0xc8 ([kernel.kallsyms])
> perf-exec 16727 [004] 12632.207037: 79507 armv8_cortex_a72/cpu_cycles/: ffffc4c65c48b89c strnlen_user+0x16c ([kernel.kallsyms])
> <SNIP>
>
> So from the point of view of the user its important to differentiate
> samples for each type of CPU, so grouping everything into the same
> basket as ARM did in its big.LITTLE seems strange/"wrong".
>
> The way that in Intel when it "does the right thing" (I think no quotes
> are needed here, but I may be missing something) and at the tool level
> translates the special event name "cycles" into what the Intel PMU
> kernel drivers advertises to the world via sysfs as
> /sys/devices/cpu_{atom,core}/events/cpu-cycles (with that -) and ARM
> advertises as /sys/devices/armv8_cortex_a{53,72}/events/cpu_cycles (note
> the _) but gets translated in terms of 'struct perf_event_attr' as
>
> Intel:
>
> cpu_atom/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0xa00000000
> cpu_core/cycles/: type: 0 (PERF_TYPE_HARDWARE), size: 136, config: 0x400000000

Here there is a legacy encoding of config 0 meaning
PERF_COUNT_HW_CPU_CYCLES the high bits are (from perf_event.h):
```
* PERF_TYPE_HARDWARE: 0xEEEEEEEE000000AA
* AA: hardware event ID
* EEEEEEEE: PMU type ID
```
Intel has a cpu-cycles sysfs event and depending on whether you do or
don't have the reverted patch the priority of that over a legacy event
is going to vary.

> Versus ARM as:
>
> armv8_cortex_a53/cpu_cycles/: type: 7 (armv8_cortex_a53), size: 136, config: 0x11 (cpu_cycles)
> armv8_cortex_a72/cpu_cycles/: type: 8 (armv8_cortex_a72), size: 136, config: 0x11 (cpu_cycles)

Here cpu_cycles isn't a legacy event which use hyphens and not underscores:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/parse-events.l#n335
The two sysfs events were programmed. Were another PMU advertising
cpu_cycles we'd also try to program that.

> can possibly be made more consistent in a way that doesn't break any
> user experience using the perf command line.
>
> I would propose that 'cycles' explicitely asked or as the default,
> translates into armv8_cortex_a{53,72}/cpu_cycles/ on ARM and on
> /sys/devices/cpu_{atom,core}/events/cpu-cycles on Intel, and that
> whatever other architecture that comes to this party tries to learn from
> this botched experience.

I don't understand this paragraph. My expectation is that sysfs and
json are priority over legacy, we transitioned to this to fix ARM,
this does mean reverting the revert. We use legacy for core PMUs when
the sysfs/json lookup fails and there is a legacy name. We may or may
not have a notion of special events where special events are only
programmed on core PMUs. My preference is no notion, as what flavor of
cycles, .. go in it is just madness. Linus disagrees and won't talk
about it to clear up the ambiguity. If we do have special events then
can we keep scanning all PMUs by having test logic that complains when
a PMU uses a special name for one of their events or must we put
special case logic in the event parser?

We can fix the arm_dsu bug by renaming cycles there. If that's too
hard to land, clearing up ambiguity by adding a PMU name has always
been the way to do this. My preference for v6.10 is revert the revert,
then add either a rename of the arm_dsu event and/or the change here.

We can make perf record tolerant and ignore opening events on PMUs
that don't support sampling, but I think it is too big a thing to do
for v6.10.

I don't know what state perf test is in, and I expect the answer
varies greatly between perf-next and linus/master. There is stuff
queued up to go in perf-next further building on the expectations of
the reverted patch.

Could someone at ARM please set up a build/test server on linux-next
for whatever machine is Linus' flavor of the day so that we don't see
this happen in a PR again. I build test ARM on a Raspberry Pi and the
BIG.little devices are similarly at the low end of the
price/performance spectrum. In all likelihood they all lack precise
samples, Linus' command lines show he cares about this.

Thanks,
Ian

2024-05-30 05:35:35

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, May 29, 2024 at 12:25 PM Ian Rogers <[email protected]> wrote:
> We can fix the arm_dsu bug by renaming cycles there. If that's too
> hard to land, clearing up ambiguity by adding a PMU name has always
> been the way to do this. My preference for v6.10 is revert the revert,
> then add either a rename of the arm_dsu event and/or the change here.
>
> We can make perf record tolerant and ignore opening events on PMUs
> that don't support sampling, but I think it is too big a thing to do
> for v6.10.

How about adding a flag to parse_event_option_args so that we
can check if it's for sampling events. And then we might skip
uncore PMUs easily (assuming arm_dsu PMU is uncore).

It might not be a perfect solution but it could be a simple one.
Ideally I think it'd be nice if the kernel exports more information
about the PMUs like sampling and exclude capabilities.

Thanks,
Namhyung

2024-05-30 12:48:57

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On 30/05/2024 06:35, Namhyung Kim wrote:
> On Wed, May 29, 2024 at 12:25 PM Ian Rogers <[email protected]> wrote:
>> We can fix the arm_dsu bug by renaming cycles there. If that's too
>> hard to land, clearing up ambiguity by adding a PMU name has always
>> been the way to do this. My preference for v6.10 is revert the revert,
>> then add either a rename of the arm_dsu event and/or the change here.
>>
>> We can make perf record tolerant and ignore opening events on PMUs
>> that don't support sampling, but I think it is too big a thing to do
>> for v6.10.
>
> How about adding a flag to parse_event_option_args so that we
> can check if it's for sampling events. And then we might skip
> uncore PMUs easily (assuming arm_dsu PMU is uncore).

It's uncore yes.

Couldn't we theoretically have a core PMU that still doesn't support
sampling though? And then we'd end up in the same situation. Attempting
to open the event is the only sure way of knowing, rather than trying to
guess with some heuristic in userspace?

Maybe a bit too hypothetical but still worth considering.

>
> It might not be a perfect solution but it could be a simple one.
> Ideally I think it'd be nice if the kernel exports more information
> about the PMUs like sampling and exclude capabilities.
> > Thanks,
> Namhyung

That seems like a much better suggestion. Especially with the ever
expanding retry/fallback mechanism that can never really take into
account every combination of event attributes that can fail.

James

2024-05-30 13:57:50

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
>
> On 30/05/2024 06:35, Namhyung Kim wrote:
> > On Wed, May 29, 2024 at 12:25 PM Ian Rogers <[email protected]> wrote:
> >> We can fix the arm_dsu bug by renaming cycles there. If that's too
> >> hard to land, clearing up ambiguity by adding a PMU name has always
> >> been the way to do this. My preference for v6.10 is revert the revert,
> >> then add either a rename of the arm_dsu event and/or the change here.
> >>
> >> We can make perf record tolerant and ignore opening events on PMUs
> >> that don't support sampling, but I think it is too big a thing to do
> >> for v6.10.
> >
> > How about adding a flag to parse_event_option_args so that we
> > can check if it's for sampling events. And then we might skip
> > uncore PMUs easily (assuming arm_dsu PMU is uncore).
>
> It's uncore yes.
>
> Couldn't we theoretically have a core PMU that still doesn't support
> sampling though? And then we'd end up in the same situation. Attempting
> to open the event is the only sure way of knowing, rather than trying to
> guess with some heuristic in userspace?
>
> Maybe a bit too hypothetical but still worth considering.
>
> >
> > It might not be a perfect solution but it could be a simple one.
> > Ideally I think it'd be nice if the kernel exports more information
> > about the PMUs like sampling and exclude capabilities.
> > > Thanks,
> > Namhyung
>
> That seems like a much better suggestion. Especially with the ever
> expanding retry/fallback mechanism that can never really take into
> account every combination of event attributes that can fail.

I think this approach can work but we may break PMUs.

Rather than use `is_core` on `struct pmu` we could have say a
`supports_sampling` and we pass to parse_events an option to exclude
any PMU that doesn't have that flag. Now obviously more than just core
PMUs support sampling. All software PMUs, tracepoints, probes. We have
an imprecise list of these in perf_pmu__is_software. So we can set
supports_sampling for perf_pmu__is_software and is_core.

I think the problem comes for things like the AMD IBS PMUs, intel_bts
and intel_pt. Often these only support sampling but aren't core. There
may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
make a list of all these PMU names then we can use that to set
supports_sampling and not break event parsing for these PMUs.

The name list sounds somewhat impractical, let's say we lazily compute
the supports_sampling on a PMU. We need the sampling equivalent of
is_event_supported:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
is_event_supported has had bugs, look at the exclude_guest workaround
for Apple PMUs. It also isn't clear to me how we choose the event
config that we're going to probe to determine whether sampling works.
The perf_event_open may reject the test because of a bad config and
not because sampling isn't supported.

So I think we can make the approach work if we had either:
1) a list of PMUs that support sampling,
2) a reliable "is_sampling_supported" test.

I'm not sure of the advantages of doing (2) rather than just creating
the set of evsels and ignoring those that fail to open. Ignoring
evsels that fail to open seems more unlikely to break anything as the
user is giving the events/config values for the PMUs they care about.

Thanks,
Ian

2024-05-30 15:37:24

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On 29/05/2024 18:33, Ian Rogers wrote:
> On Wed, May 29, 2024 at 7:50 AM James Clark <[email protected]> wrote:
>>
>> On 28/05/2024 20:51, Ian Rogers wrote:
>>> On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo
>>> <[email protected]> wrote:
>>>>
>>>> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
>>>>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
>>>>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
>>>>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:
>>>>
>>>>>>>> This makes 'perf record' work for me again.
>>>>
>>>>>>> Oh, wait, no it doesn't.
>>>>
>>>>>>> It makes just the plain "perf record" without any arguments work,
>>>>>>> which was what I was testing because I was lazy.
>>>>
>>>>>>> So now
>>>>
>>>>>>> $ perf record sleep 1
>>>>
>>>>>>> works fine. But
>>>>
>>>>>>> $ perf record -e cycles:pp sleep 1
>>>>
>>>>>>> is still completely broken (with or without ":p" and ":pp").
>>>>
>>>>>> Seems to me that this patch fails to check if a PMU is a core-attached
>>>>>> PMU that can support common hardware events. Therefore, we should
>>>>>> consider adding the following check.
>>>>
>>>>>> +++ b/tools/perf/util/parse-events.c
>>>>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
>>>>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) {
>>>>>> bool auto_merge_stats;
>>>>>>
>>>>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
>>>>>> + continue;
>>>>>> +
>>>>>> if (parse_events__filter_pmu(parse_state, pmu))
>>>>>> continue;
>>>>
>>>>>> To be clear, I only compiled this change but I have no chance to test
>>>>>> it. @Ian, could you confirm this?
>>>>
>>>>> Hi Leo,
>>>>
>>>>> so the code is working as intended. I believe it also agrees with what
>>>>> Arnaldo thinks.
>>>>
>>>>> If you do:
>>>>
>>>>> $ perf stat -e cycles ...
>>>>
>>>>> and you have
>>>>
>>>>> /sys/devices/pmu1/events/cycles
>>>>> /sys/devices/pmu2/events/cycles
>>>>
>>>>> The output of perf stat should contain counts for pmu1 and pmu2. Were
>>>>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the
>>>>
>>>> Sure, what is being asked is to count events and if those two events in
>>>> those two PMUs can count, then do what the user asked.
>>>>
>>>> For 'perf record' we're asking for sampling, if the event has the name
>>>> specified and can't be sampled, skip it, warn the user and even so
>>>> only if verbose mode is asked, something like:
>>>>
>>>> root@x1:~# perf record -e cycles -a sleep 1
>>>> [ perf record: Woken up 1 times to write data ]
>>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
>>>> root@x1:~# perf evlist
>>>> cpu_atom/cycles/
>>>> cpu_core/cycles/
>>>> dummy:u
>>>> root@x1:~#
>>>>
>>>> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
>>>> another in a 'cpu_core' one, both can be sampled, my workload may
>>>> run/use resources on then, I'm interested, sample both.
>>>>
>>>> But if we had some other PMU, to use a name Jiri uses in tests/fake
>>>> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
>>>> and for some reason it doesn't support sampling, skip it, then the
>>>> result should be the same as above.
>>>>
>>>> If the user finds it strange after looking at sysfs that 'krava/cycles/'
>>>> isn't being sampled, the usual workflow is to ask perf for more
>>>> verbosity, using -v (or multiple 'v' letters to get increasing levels of
>>>> verbosity), in which case the user would see:
>>>>
>>>> root@x1:~# perf record -v -e cycles -a sleep 1
>>>> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
>>>> [ perf record: Woken up 1 times to write data ]
>>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
>>>> root@x1:~# perf evlist
>>
>> This makes sense to me. I like keeping the old apparent behavior unless
>> -v is used and it will feel like the tool "just works".
>>
>> In the context of the commit summary "perf parse-events: Prefer
>> sysfs/JSON hardware events over legacy":
>>
>> I don't follow why that should be "Prefer, even if it's an event that
>> can't be opened, sysfs/JSON...".
>>
>> Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then
>> use legacy". If all events can be opened, sure go and open them all. If
>> only core events can be opened, do that too. If only uncore events can
>> be opened... etc.
>

[...]

> So great, ignoring the revert, that fixed everything? Well no. The
> tool in places was hard coding 'struct perf_event_attr' which is of
> course broken were things to be hybrid or BIG.little. So the fix for
> that was to not hard code things. We need a set of 'struct
> perf_event_attr', ah I know a way to get that let's just use our event
> parsing logic. So a 'struct perf_event_attr' hard coding type to
> PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also
> setting the precision to maximum was changed into parsing the string
> "cycles:P". Sounds good, no? Well no. Somebody decided to create an
> ARM event called cycles (Intel's name to avoid conflicts is
> clockticks) and now that event was getting added to our set. Although
> the patch sat for weeks (months?) on linux-next nobody had discovered
> a fairly obvious breakage.
>

We did see the test failure on our Ampere test machine 7 days ago, but
for some reason only on mainline (I was also on holiday at the same
time). I'm checking if that machine is running all the branches and will
make sure it does from now on.

We are running perf-tools-next on other machines and I try to look at
all the test failures. Just this one had a bit of an obscure combination
of needing the DSU PMU.

One thing we don't have in CI is any Apple M hardware. I can look into
it but I wouldn't have high hopes for anything soon.

[...]

> It isn't new behavior for perf to scan all PMUs, it always has, the
> new behavior is around legacy events. We want multiple PMU scanning
> for hybrid, we want all PMU scanning for uncore. The legacy changes
> happened because of the Apple M? PMU with me being complained at by
> folks at ARM who have now created this mess by their arm_dsu event
> name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles,
> 0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),":
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177
> that's up to ARM but it would make sense to me.
>

Not sure about that one, that would break anyone's scripts or tools that
are looking at DSU cycles. And it wouldn't fix the issue in the future
if there were other reasons the event doesn't open (like non sampling
core events, or someone's brand new uncore PMU that also has a cycles
event).

It seems like we're converging one something that works though in the
other threads, but I'm still digesting the problems a bit.

>> Because the user could always use the defaults (no argument) or -e
>> cycles and historically Perf correctly picked the one that could be
>> opened. Or if they want the DSU one they could specify it. That can all
>> still work _and_ we can support "prefer sysfs/JSON" as long as we don't
>> prefer it when opening the event doesn't work.
>
> Hopefully this is all explained above.
>
> Thanks,
> Ian
>
>> Thanks
>> James
>>
>>> Thanks,
>>> Ian
>>>
>>>> - Arnaldo

2024-05-30 16:15:00

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Thu, May 30, 2024 at 8:37 AM James Clark <[email protected]> wrote:
>
>
>
> On 29/05/2024 18:33, Ian Rogers wrote:
> > On Wed, May 29, 2024 at 7:50 AM James Clark <[email protected]> wrote:
> >>
> >> On 28/05/2024 20:51, Ian Rogers wrote:
> >>> On Tue, May 28, 2024 at 12:44 PM Arnaldo Carvalho de Melo
> >>> <[email protected]> wrote:
> >>>>
> >>>> On Mon, May 27, 2024 at 10:36:45PM -0700, Ian Rogers wrote:
> >>>>> On Mon, May 27, 2024 at 3:58 AM Leo Yan <[email protected]> wrote:
> >>>>>> On Sat, May 25, 2024 at 02:14:26PM -0700, Linus Torvalds wrote:
> >>>>>>> On Sat, 25 May 2024 at 09:43, Linus Torvalds <[email protected]> wrote:
> >>>>
> >>>>>>>> This makes 'perf record' work for me again.
> >>>>
> >>>>>>> Oh, wait, no it doesn't.
> >>>>
> >>>>>>> It makes just the plain "perf record" without any arguments work,
> >>>>>>> which was what I was testing because I was lazy.
> >>>>
> >>>>>>> So now
> >>>>
> >>>>>>> $ perf record sleep 1
> >>>>
> >>>>>>> works fine. But
> >>>>
> >>>>>>> $ perf record -e cycles:pp sleep 1
> >>>>
> >>>>>>> is still completely broken (with or without ":p" and ":pp").
> >>>>
> >>>>>> Seems to me that this patch fails to check if a PMU is a core-attached
> >>>>>> PMU that can support common hardware events. Therefore, we should
> >>>>>> consider adding the following check.
> >>>>
> >>>>>> +++ b/tools/perf/util/parse-events.c
> >>>>>> @@ -1594,6 +1594,9 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state,
> >>>>>> while ((pmu = perf_pmus__scan(pmu)) != NULL) {
> >>>>>> bool auto_merge_stats;
> >>>>>>
> >>>>>> + if (hw_config != PERF_COUNT_HW_MAX && !pmu->is_core)
> >>>>>> + continue;
> >>>>>> +
> >>>>>> if (parse_events__filter_pmu(parse_state, pmu))
> >>>>>> continue;
> >>>>
> >>>>>> To be clear, I only compiled this change but I have no chance to test
> >>>>>> it. @Ian, could you confirm this?
> >>>>
> >>>>> Hi Leo,
> >>>>
> >>>>> so the code is working as intended. I believe it also agrees with what
> >>>>> Arnaldo thinks.
> >>>>
> >>>>> If you do:
> >>>>
> >>>>> $ perf stat -e cycles ...
> >>>>
> >>>>> and you have
> >>>>
> >>>>> /sys/devices/pmu1/events/cycles
> >>>>> /sys/devices/pmu2/events/cycles
> >>>>
> >>>>> The output of perf stat should contain counts for pmu1 and pmu2. Were
> >>>>> the event 'data_read' or 'inst_retired.any' we wouldn't be having the
> >>>>
> >>>> Sure, what is being asked is to count events and if those two events in
> >>>> those two PMUs can count, then do what the user asked.
> >>>>
> >>>> For 'perf record' we're asking for sampling, if the event has the name
> >>>> specified and can't be sampled, skip it, warn the user and even so
> >>>> only if verbose mode is asked, something like:
> >>>>
> >>>> root@x1:~# perf record -e cycles -a sleep 1
> >>>> [ perf record: Woken up 1 times to write data ]
> >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> >>>> root@x1:~# perf evlist
> >>>> cpu_atom/cycles/
> >>>> cpu_core/cycles/
> >>>> dummy:u
> >>>> root@x1:~#
> >>>>
> >>>> Cool, there are two 'cycles' events, one in a PMU named 'cpu_atom',
> >>>> another in a 'cpu_core' one, both can be sampled, my workload may
> >>>> run/use resources on then, I'm interested, sample both.
> >>>>
> >>>> But if we had some other PMU, to use a name Jiri uses in tests/fake
> >>>> PMUs, the 'krava' PMU and it has a 'cycles' event, so 'krava/cycles/'
> >>>> and for some reason it doesn't support sampling, skip it, then the
> >>>> result should be the same as above.
> >>>>
> >>>> If the user finds it strange after looking at sysfs that 'krava/cycles/'
> >>>> isn't being sampled, the usual workflow is to ask perf for more
> >>>> verbosity, using -v (or multiple 'v' letters to get increasing levels of
> >>>> verbosity), in which case the user would see:
> >>>>
> >>>> root@x1:~# perf record -v -e cycles -a sleep 1
> >>>> WARNING: skipping 'krava/cycles/' event, it doesn't support sampling.
> >>>> [ perf record: Woken up 1 times to write data ]
> >>>> [ perf record: Captured and wrote 1.998 MB perf.data (4472 samples) ]
> >>>> root@x1:~# perf evlist
> >>
> >> This makes sense to me. I like keeping the old apparent behavior unless
> >> -v is used and it will feel like the tool "just works".
> >>
> >> In the context of the commit summary "perf parse-events: Prefer
> >> sysfs/JSON hardware events over legacy":
> >>
> >> I don't follow why that should be "Prefer, even if it's an event that
> >> can't be opened, sysfs/JSON...".
> >>
> >> Surely it should be "Prefer sysfs/JSON, unless it can't be opened, then
> >> use legacy". If all events can be opened, sure go and open them all. If
> >> only core events can be opened, do that too. If only uncore events can
> >> be opened... etc.
> >
>
> [...]
>
> > So great, ignoring the revert, that fixed everything? Well no. The
> > tool in places was hard coding 'struct perf_event_attr' which is of
> > course broken were things to be hybrid or BIG.little. So the fix for
> > that was to not hard code things. We need a set of 'struct
> > perf_event_attr', ah I know a way to get that let's just use our event
> > parsing logic. So a 'struct perf_event_attr' hard coding type to
> > PERF_TYPE_HARDWARE, the config to PERF_COUNT_HW_CPU_CYCLES and also
> > setting the precision to maximum was changed into parsing the string
> > "cycles:P". Sounds good, no? Well no. Somebody decided to create an
> > ARM event called cycles (Intel's name to avoid conflicts is
> > clockticks) and now that event was getting added to our set. Although
> > the patch sat for weeks (months?) on linux-next nobody had discovered
> > a fairly obvious breakage.
> >
>
> We did see the test failure on our Ampere test machine 7 days ago, but
> for some reason only on mainline (I was also on holiday at the same
> time). I'm checking if that machine is running all the branches and will
> make sure it does from now on.
>
> We are running perf-tools-next on other machines and I try to look at
> all the test failures. Just this one had a bit of an obscure combination
> of needing the DSU PMU.

Thanks, I'm truly appreciative of greater testing and I appreciate it
doesn't happen for free. It also has ongoing costs. Thank you!

> One thing we don't have in CI is any Apple M hardware. I can look into
> it but I wouldn't have high hopes for anything soon.

As mentioned in these threads it is also knowingly broken - what the
reverted patch was trying to address. Perhaps one could be captured on
the way to e-waste if the screen,... don't work. We don't need much.
Apple M is the root cause of much special behavior in the perf tool
and the testing situation on it is sad. Maybe the Linux Foundation
could get one?

> [...]
>
> > It isn't new behavior for perf to scan all PMUs, it always has, the
> > new behavior is around legacy events. We want multiple PMU scanning
> > for hybrid, we want all PMU scanning for uncore. The legacy changes
> > happened because of the Apple M? PMU with me being complained at by
> > folks at ARM who have now created this mess by their arm_dsu event
> > name. Shouldn't it be a 1 liner fix to change "DSU_EVENT_ATTR(cycles,
> > 0x11)," to "DSU_EVENT_ATTR(clockticks, 0x11),":
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/perf/arm_dsu_pmu.c#n177
> > that's up to ARM but it would make sense to me.
> >
>
> Not sure about that one, that would break anyone's scripts or tools that
> are looking at DSU cycles. And it wouldn't fix the issue in the future
> if there were other reasons the event doesn't open (like non sampling
> core events, or someone's brand new uncore PMU that also has a cycles
> event).

Right, but should the resolution there be to specify which PMU you
want to resolve the ambiguity. The tool telling you a PMU doesn't
support sampling is signal. We'd expect the tool to fail if it didn't
support an event, is it really unreasonable to fail on a mode of an
event?

I'm further confused by the DSU driver. The names:

DSU_EVENT_ATTR(cycles, 0x11),
DSU_EVENT_ATTR(bus_access, 0x19),
DSU_EVENT_ATTR(memory_error, 0x1a),
DSU_EVENT_ATTR(bus_cycles, 0x1d),

The last 3 seem unambiguous, but cycles, couldn't it be read as also
possibly meaning bus_cycles? Wouldn't cpu_cycles be in keeping with
ARM's other names and an objectively better name? Getting that 1 liner
in v6.10 would resolve a lot of problems. I also think llc or l3 may
be more "intention revealing" names for the device than dsu, but hey I
don't want to start a naming war.

> It seems like we're converging one something that works though in the
> other threads, but I'm still digesting the problems a bit.

I think the major blocker is that some people, although I can only
name 1 and they've stopped listening, think event names should somehow
carry special meaning and in that special case it implies use only
core PMUs. The only known example of a special event is cycles, but if
you look for equivalent event names in perf's code there are things as
banal as branches and instructions, and as wild as
dTLB-speculative-read-misses. I'd really like not to carry around
notions of special event names but if we merge patches that ignore
that and then they get reverted, justified through hand written tests
looking to poke at things like the dsu PMU, I don't know where we
are. It has also been threatened that the perf code could be removed
from the kernel tree, so there's really no fun in poking a bear.

Thanks,
Ian

> >> Because the user could always use the defaults (no argument) or -e
> >> cycles and historically Perf correctly picked the one that could be
> >> opened. Or if they want the DSU one they could specify it. That can all
> >> still work _and_ we can support "prefer sysfs/JSON" as long as we don't
> >> prefer it when opening the event doesn't work.
> >
> > Hopefully this is all explained above.
> >
> > Thanks,
> > Ian
> >
> >> Thanks
> >> James
> >>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> - Arnaldo

2024-05-30 22:51:53

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
> On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
> >
> > On 30/05/2024 06:35, Namhyung Kim wrote:
> > > On Wed, May 29, 2024 at 12:25 PM Ian Rogers <[email protected]> wrote:
> > >> We can fix the arm_dsu bug by renaming cycles there. If that's too
> > >> hard to land, clearing up ambiguity by adding a PMU name has always
> > >> been the way to do this. My preference for v6.10 is revert the revert,
> > >> then add either a rename of the arm_dsu event and/or the change here.
> > >>
> > >> We can make perf record tolerant and ignore opening events on PMUs
> > >> that don't support sampling, but I think it is too big a thing to do
> > >> for v6.10.
> > >
> > > How about adding a flag to parse_event_option_args so that we
> > > can check if it's for sampling events. And then we might skip
> > > uncore PMUs easily (assuming arm_dsu PMU is uncore).
> >
> > It's uncore yes.
> >
> > Couldn't we theoretically have a core PMU that still doesn't support
> > sampling though? And then we'd end up in the same situation. Attempting
> > to open the event is the only sure way of knowing, rather than trying to
> > guess with some heuristic in userspace?
> >
> > Maybe a bit too hypothetical but still worth considering.

Then I think it's a real problem and perf should report it like we do
now.

> >
> > >
> > > It might not be a perfect solution but it could be a simple one.
> > > Ideally I think it'd be nice if the kernel exports more information
> > > about the PMUs like sampling and exclude capabilities.
> > > > Thanks,
> > > Namhyung
> >
> > That seems like a much better suggestion. Especially with the ever
> > expanding retry/fallback mechanism that can never really take into
> > account every combination of event attributes that can fail.
>
> I think this approach can work but we may break PMUs.
>
> Rather than use `is_core` on `struct pmu` we could have say a
> `supports_sampling` and we pass to parse_events an option to exclude
> any PMU that doesn't have that flag. Now obviously more than just core
> PMUs support sampling. All software PMUs, tracepoints, probes. We have
> an imprecise list of these in perf_pmu__is_software. So we can set
> supports_sampling for perf_pmu__is_software and is_core.

Yep, we can do that if the kernel provides the info. But before that
I think it's practical to skip uncore PMUs and hope other PMUs don't
have event aliases clashing with the legacy names. :)

>
> I think the problem comes for things like the AMD IBS PMUs, intel_bts
> and intel_pt. Often these only support sampling but aren't core. There
> may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
> make a list of all these PMU names then we can use that to set
> supports_sampling and not break event parsing for these PMUs.
>
> The name list sounds somewhat impractical, let's say we lazily compute
> the supports_sampling on a PMU. We need the sampling equivalent of
> is_event_supported:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
> is_event_supported has had bugs, look at the exclude_guest workaround
> for Apple PMUs. It also isn't clear to me how we choose the event
> config that we're going to probe to determine whether sampling works.
> The perf_event_open may reject the test because of a bad config and
> not because sampling isn't supported.
>
> So I think we can make the approach work if we had either:
> 1) a list of PMUs that support sampling,
> 2) a reliable "is_sampling_supported" test.
>
> I'm not sure of the advantages of doing (2) rather than just creating
> the set of evsels and ignoring those that fail to open. Ignoring
> evsels that fail to open seems more unlikely to break anything as the
> user is giving the events/config values for the PMUs they care about.

Yep, that's also possible. I'm ok if you want to go that direction.

Thanks,
Namhyung

2024-06-05 20:30:32

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <[email protected]> wrote:
>
> On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
> > On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
> > >
> > > On 30/05/2024 06:35, Namhyung Kim wrote:
> > > > It might not be a perfect solution but it could be a simple one.
> > > > Ideally I think it'd be nice if the kernel exports more information
> > > > about the PMUs like sampling and exclude capabilities.
> > > > > Thanks,
> > > > Namhyung
> > >
> > > That seems like a much better suggestion. Especially with the ever
> > > expanding retry/fallback mechanism that can never really take into
> > > account every combination of event attributes that can fail.
> >
> > I think this approach can work but we may break PMUs.
> >
> > Rather than use `is_core` on `struct pmu` we could have say a
> > `supports_sampling` and we pass to parse_events an option to exclude
> > any PMU that doesn't have that flag. Now obviously more than just core
> > PMUs support sampling. All software PMUs, tracepoints, probes. We have
> > an imprecise list of these in perf_pmu__is_software. So we can set
> > supports_sampling for perf_pmu__is_software and is_core.
>
> Yep, we can do that if the kernel provides the info. But before that
> I think it's practical to skip uncore PMUs and hope other PMUs don't
> have event aliases clashing with the legacy names. :)
>
> >
> > I think the problem comes for things like the AMD IBS PMUs, intel_bts
> > and intel_pt. Often these only support sampling but aren't core. There
> > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
> > make a list of all these PMU names then we can use that to set
> > supports_sampling and not break event parsing for these PMUs.
> >
> > The name list sounds somewhat impractical, let's say we lazily compute
> > the supports_sampling on a PMU. We need the sampling equivalent of
> > is_event_supported:
> > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
> > is_event_supported has had bugs, look at the exclude_guest workaround
> > for Apple PMUs. It also isn't clear to me how we choose the event
> > config that we're going to probe to determine whether sampling works.
> > The perf_event_open may reject the test because of a bad config and
> > not because sampling isn't supported.
> >
> > So I think we can make the approach work if we had either:
> > 1) a list of PMUs that support sampling,
> > 2) a reliable "is_sampling_supported" test.
> >
> > I'm not sure of the advantages of doing (2) rather than just creating
> > the set of evsels and ignoring those that fail to open. Ignoring
> > evsels that fail to open seems more unlikely to break anything as the
> > user is giving the events/config values for the PMUs they care about.
>
> Yep, that's also possible. I'm ok if you want to go that direction.

Hmm.. I thought about this again. But it can be a problem if we ignore
any failures as it can be a real error due to other reason - e.g. not
supported configuration or other user mistakes.

Thanks,
Namhyung

2024-06-05 23:02:28

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <[email protected]> wrote:
>
> On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <[email protected]> wrote:
> >
> > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
> > > On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
> > > >
> > > > On 30/05/2024 06:35, Namhyung Kim wrote:
> > > > > It might not be a perfect solution but it could be a simple one.
> > > > > Ideally I think it'd be nice if the kernel exports more information
> > > > > about the PMUs like sampling and exclude capabilities.
> > > > > > Thanks,
> > > > > Namhyung
> > > >
> > > > That seems like a much better suggestion. Especially with the ever
> > > > expanding retry/fallback mechanism that can never really take into
> > > > account every combination of event attributes that can fail.
> > >
> > > I think this approach can work but we may break PMUs.
> > >
> > > Rather than use `is_core` on `struct pmu` we could have say a
> > > `supports_sampling` and we pass to parse_events an option to exclude
> > > any PMU that doesn't have that flag. Now obviously more than just core
> > > PMUs support sampling. All software PMUs, tracepoints, probes. We have
> > > an imprecise list of these in perf_pmu__is_software. So we can set
> > > supports_sampling for perf_pmu__is_software and is_core.
> >
> > Yep, we can do that if the kernel provides the info. But before that
> > I think it's practical to skip uncore PMUs and hope other PMUs don't
> > have event aliases clashing with the legacy names. :)
> >
> > >
> > > I think the problem comes for things like the AMD IBS PMUs, intel_bts
> > > and intel_pt. Often these only support sampling but aren't core. There
> > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
> > > make a list of all these PMU names then we can use that to set
> > > supports_sampling and not break event parsing for these PMUs.
> > >
> > > The name list sounds somewhat impractical, let's say we lazily compute
> > > the supports_sampling on a PMU. We need the sampling equivalent of
> > > is_event_supported:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
> > > is_event_supported has had bugs, look at the exclude_guest workaround
> > > for Apple PMUs. It also isn't clear to me how we choose the event
> > > config that we're going to probe to determine whether sampling works.
> > > The perf_event_open may reject the test because of a bad config and
> > > not because sampling isn't supported.
> > >
> > > So I think we can make the approach work if we had either:
> > > 1) a list of PMUs that support sampling,
> > > 2) a reliable "is_sampling_supported" test.
> > >
> > > I'm not sure of the advantages of doing (2) rather than just creating
> > > the set of evsels and ignoring those that fail to open. Ignoring
> > > evsels that fail to open seems more unlikely to break anything as the
> > > user is giving the events/config values for the PMUs they care about.
> >
> > Yep, that's also possible. I'm ok if you want to go that direction.
>
> Hmm.. I thought about this again. But it can be a problem if we ignore
> any failures as it can be a real error due to other reason - e.g. not
> supported configuration or other user mistakes.

Right, we have two not good choices:

1) Try to detect whether sampling is supported, but any test doing
this needs to guess at a configuration and we'll need to deflake this
on off platforms like those that don't allow things like exclude
guest.
2) Ignore failures, possibly hiding user errors.

I would prefer for (2) the errors were pr_err rather than pr_debug,
something the user can clean up by getting rid of warned about PMUs.
This will avoid hiding the error, but then on Neoverse cycles will
warn about the arm_dsu PMU's cycles event for exactly Linus' test
case. My understanding is that this is deemed a regression, hence
Arnaldo proposing pr_debug to hide it.

Thanks,
Ian

2024-06-06 07:09:57

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <[email protected]> wrote:
> >
> > On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <[email protected]> wrote:
> > >
> > > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
> > > > On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
> > > > >
> > > > > On 30/05/2024 06:35, Namhyung Kim wrote:
> > > > > > It might not be a perfect solution but it could be a simple one.
> > > > > > Ideally I think it'd be nice if the kernel exports more information
> > > > > > about the PMUs like sampling and exclude capabilities.
> > > > > > > Thanks,
> > > > > > Namhyung
> > > > >
> > > > > That seems like a much better suggestion. Especially with the ever
> > > > > expanding retry/fallback mechanism that can never really take into
> > > > > account every combination of event attributes that can fail.
> > > >
> > > > I think this approach can work but we may break PMUs.
> > > >
> > > > Rather than use `is_core` on `struct pmu` we could have say a
> > > > `supports_sampling` and we pass to parse_events an option to exclude
> > > > any PMU that doesn't have that flag. Now obviously more than just core
> > > > PMUs support sampling. All software PMUs, tracepoints, probes. We have
> > > > an imprecise list of these in perf_pmu__is_software. So we can set
> > > > supports_sampling for perf_pmu__is_software and is_core.
> > >
> > > Yep, we can do that if the kernel provides the info. But before that
> > > I think it's practical to skip uncore PMUs and hope other PMUs don't
> > > have event aliases clashing with the legacy names. :)
> > >
> > > >
> > > > I think the problem comes for things like the AMD IBS PMUs, intel_bts
> > > > and intel_pt. Often these only support sampling but aren't core. There
> > > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
> > > > make a list of all these PMU names then we can use that to set
> > > > supports_sampling and not break event parsing for these PMUs.
> > > >
> > > > The name list sounds somewhat impractical, let's say we lazily compute
> > > > the supports_sampling on a PMU. We need the sampling equivalent of
> > > > is_event_supported:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
> > > > is_event_supported has had bugs, look at the exclude_guest workaround
> > > > for Apple PMUs. It also isn't clear to me how we choose the event
> > > > config that we're going to probe to determine whether sampling works.
> > > > The perf_event_open may reject the test because of a bad config and
> > > > not because sampling isn't supported.
> > > >
> > > > So I think we can make the approach work if we had either:
> > > > 1) a list of PMUs that support sampling,
> > > > 2) a reliable "is_sampling_supported" test.
> > > >
> > > > I'm not sure of the advantages of doing (2) rather than just creating
> > > > the set of evsels and ignoring those that fail to open. Ignoring
> > > > evsels that fail to open seems more unlikely to break anything as the
> > > > user is giving the events/config values for the PMUs they care about.
> > >
> > > Yep, that's also possible. I'm ok if you want to go that direction.
> >
> > Hmm.. I thought about this again. But it can be a problem if we ignore
> > any failures as it can be a real error due to other reason - e.g. not
> > supported configuration or other user mistakes.
>
> Right, we have two not good choices:
>
> 1) Try to detect whether sampling is supported, but any test doing
> this needs to guess at a configuration and we'll need to deflake this
> on off platforms like those that don't allow things like exclude
> guest.

I believe we don't need to try so hard to detect if sampling is
supported or not. I hope we will eventually add that to the
kernel. Also this is just an additional defense line, it should
work without it in most cases. It'll just protect from a few edge
cases like uncore PMUs having events of legacy name. For
other events or PMUs, I think it's ok to fail.

> 2) Ignore failures, possibly hiding user errors.
>
> I would prefer for (2) the errors were pr_err rather than pr_debug,
> something the user can clean up by getting rid of warned about PMUs.
> This will avoid hiding the error, but then on Neoverse cycles will
> warn about the arm_dsu PMU's cycles event for exactly Linus' test
> case. My understanding is that this is deemed a regression, hence
> Arnaldo proposing pr_debug to hide it.

Right, if we use pr_err() then users will complain. If we use
pr_debug() then errors will be hidden silently.

Thanks,
Namhyung

2024-06-06 09:43:36

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On 06/06/2024 08:09, Namhyung Kim wrote:
> On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <[email protected]> wrote:
>>
>> On Wed, Jun 5, 2024 at 1:29 PM Namhyung Kim <[email protected]> wrote:
>>>
>>> On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <[email protected]> wrote:
>>>>
>>>> On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
>>>>> On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
>>>>>>
>>>>>> On 30/05/2024 06:35, Namhyung Kim wrote:
>>>>>>> It might not be a perfect solution but it could be a simple one.
>>>>>>> Ideally I think it'd be nice if the kernel exports more information
>>>>>>> about the PMUs like sampling and exclude capabilities.
>>>>>>>> Thanks,
>>>>>>> Namhyung
>>>>>>
>>>>>> That seems like a much better suggestion. Especially with the ever
>>>>>> expanding retry/fallback mechanism that can never really take into
>>>>>> account every combination of event attributes that can fail.
>>>>>
>>>>> I think this approach can work but we may break PMUs.
>>>>>
>>>>> Rather than use `is_core` on `struct pmu` we could have say a
>>>>> `supports_sampling` and we pass to parse_events an option to exclude
>>>>> any PMU that doesn't have that flag. Now obviously more than just core
>>>>> PMUs support sampling. All software PMUs, tracepoints, probes. We have
>>>>> an imprecise list of these in perf_pmu__is_software. So we can set
>>>>> supports_sampling for perf_pmu__is_software and is_core.
>>>>
>>>> Yep, we can do that if the kernel provides the info. But before that
>>>> I think it's practical to skip uncore PMUs and hope other PMUs don't
>>>> have event aliases clashing with the legacy names. :)
>>>>
>>>>>
>>>>> I think the problem comes for things like the AMD IBS PMUs, intel_bts
>>>>> and intel_pt. Often these only support sampling but aren't core. There
>>>>> may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
>>>>> make a list of all these PMU names then we can use that to set
>>>>> supports_sampling and not break event parsing for these PMUs.
>>>>>
>>>>> The name list sounds somewhat impractical, let's say we lazily compute
>>>>> the supports_sampling on a PMU. We need the sampling equivalent of
>>>>> is_event_supported:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
>>>>> is_event_supported has had bugs, look at the exclude_guest workaround
>>>>> for Apple PMUs. It also isn't clear to me how we choose the event
>>>>> config that we're going to probe to determine whether sampling works.
>>>>> The perf_event_open may reject the test because of a bad config and
>>>>> not because sampling isn't supported.
>>>>>
>>>>> So I think we can make the approach work if we had either:
>>>>> 1) a list of PMUs that support sampling,
>>>>> 2) a reliable "is_sampling_supported" test.
>>>>>
>>>>> I'm not sure of the advantages of doing (2) rather than just creating
>>>>> the set of evsels and ignoring those that fail to open. Ignoring
>>>>> evsels that fail to open seems more unlikely to break anything as the
>>>>> user is giving the events/config values for the PMUs they care about.
>>>>
>>>> Yep, that's also possible. I'm ok if you want to go that direction.
>>>
>>> Hmm.. I thought about this again. But it can be a problem if we ignore
>>> any failures as it can be a real error due to other reason - e.g. not
>>> supported configuration or other user mistakes.
>>
>> Right, we have two not good choices:
>>
>> 1) Try to detect whether sampling is supported, but any test doing
>> this needs to guess at a configuration and we'll need to deflake this
>> on off platforms like those that don't allow things like exclude
>> guest.
>
> I believe we don't need to try so hard to detect if sampling is
> supported or not. I hope we will eventually add that to the
> kernel. Also this is just an additional defense line, it should
> work without it in most cases. It'll just protect from a few edge
> cases like uncore PMUs having events of legacy name. For
> other events or PMUs, I think it's ok to fail.
>
>
>> 2) Ignore failures, possibly hiding user errors.
>>
>> I would prefer for (2) the errors were pr_err rather than pr_debug,
>> something the user can clean up by getting rid of warned about PMUs.
>> This will avoid hiding the error, but then on Neoverse cycles will
>> warn about the arm_dsu PMU's cycles event for exactly Linus' test
>> case. My understanding is that this is deemed a regression, hence
>> Arnaldo proposing pr_debug to hide it.
>
> Right, if we use pr_err() then users will complain. If we use
> pr_debug() then errors will be hidden silently.
>
> Thanks,
> Namhyung

I'm not sure if anyone would really complain about warnings for
attempting to open but not succeeding, as long as the event that they
really wanted is there. I'm imagining output like this:

$ perf record -e cycles -- ls

Warning: skipped arm_dsu/cycles/ event(s), recording on
armv8_pmuv3_0/cycles/, armv8_pmuv3_1/cycles/

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (30 samples) ]

You only really need to worry when no events can be opened, but
presumably that was a warning anyway.

And in stat mode I wouldn't expect any warnings.

2024-06-06 13:48:49

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Wed, Jun 05, 2024 at 01:29:02PM -0700, Namhyung Kim wrote:
> On Thu, May 30, 2024 at 3:52 PM Namhyung Kim <[email protected]> wrote:
> > On Thu, May 30, 2024 at 06:46:08AM -0700, Ian Rogers wrote:
> > > On Thu, May 30, 2024 at 5:48 AM James Clark <[email protected]> wrote:
> > > > On 30/05/2024 06:35, Namhyung Kim wrote:
> > > > > It might not be a perfect solution but it could be a simple one.
> > > > > Ideally I think it'd be nice if the kernel exports more information
> > > > > about the PMUs like sampling and exclude capabilities.

> > > > That seems like a much better suggestion. Especially with the ever
> > > > expanding retry/fallback mechanism that can never really take into
> > > > account every combination of event attributes that can fail.

> > > I think this approach can work but we may break PMUs.

> > > Rather than use `is_core` on `struct pmu` we could have say a
> > > `supports_sampling` and we pass to parse_events an option to exclude
> > > any PMU that doesn't have that flag. Now obviously more than just core
> > > PMUs support sampling. All software PMUs, tracepoints, probes. We have
> > > an imprecise list of these in perf_pmu__is_software. So we can set
> > > supports_sampling for perf_pmu__is_software and is_core.

> > Yep, we can do that if the kernel provides the info. But before that
> > I think it's practical to skip uncore PMUs and hope other PMUs don't
> > have event aliases clashing with the legacy names. :)

> > > I think the problem comes for things like the AMD IBS PMUs, intel_bts
> > > and intel_pt. Often these only support sampling but aren't core. There
> > > may be IBM S390 PMUs or other vendor PMUs that are similar. If we can
> > > make a list of all these PMU names then we can use that to set
> > > supports_sampling and not break event parsing for these PMUs.

> > > The name list sounds somewhat impractical, let's say we lazily compute
> > > the supports_sampling on a PMU. We need the sampling equivalent of
> > > is_event_supported:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/print-events.c?h=perf-tools-next#n242
> > > is_event_supported has had bugs, look at the exclude_guest workaround
> > > for Apple PMUs. It also isn't clear to me how we choose the event
> > > config that we're going to probe to determine whether sampling works.
> > > The perf_event_open may reject the test because of a bad config and
> > > not because sampling isn't supported.

> > > So I think we can make the approach work if we had either:
> > > 1) a list of PMUs that support sampling,
> > > 2) a reliable "is_sampling_supported" test.

> > > I'm not sure of the advantages of doing (2) rather than just creating
> > > the set of evsels and ignoring those that fail to open. Ignoring
> > > evsels that fail to open seems more unlikely to break anything as the
> > > user is giving the events/config values for the PMUs they care about.

> > Yep, that's also possible. I'm ok if you want to go that direction.

> Hmm.. I thought about this again. But it can be a problem if we ignore
> any failures as it can be a real error due to other reason - e.g. not
> supported configuration or other user mistakes.

Well, we should not ignore any failures, just look at
evsel__open_strerror(), we get some error from the kernel, we know we're
doing something, we put it into context, and then we try to provide the
most helpful message to the user.

The question here is how to figure out if what we want to do (sample)
makes sense for this event that has the name we used, since we're using
a "wildcard", a "well known event name" that should be translated to
events in all PMUs where it "matches", if those actual per-PMU events
can provide what the user is asking.

My suggestion about using pr_debug() is a compromise, one that will not
show warnings all the time for such a common case (cycles) while
allowing the use to follow a common, in tools/perf, way of diagnosing
when things don't look sane, which is to ask for verbose mode.

The best thing would be for us to be able to query, using existing
facilities, if what we want is possible, i.e. interpret the error coming
from the kernel in the context of what we are asking.

If this is not possible because the error is too generic (EINVAL) and
can map to multiple other things, then we can try to have a new kernel
facility for us to get this info in a authoritative way and use a
pr_warning() knowing that that warning will go away as the kernel is
updated.

Or we could use a big hammer, special handling some PMUs/events we know
can't sample and avoiding those when they fail, ugly, but should work,
and this is just for this "well known event names" people have been
using since forever. Those wanting specific events to be used should
know better and specify it precisely in the command line.

Does this sum up all the discussion so far?

Cheers,

- Arnaldo

2024-06-06 13:52:30

[permalink] [raw]

Subject: Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

On Thu, Jun 06, 2024 at 10:42:33AM +0100, James Clark wrote:
> On 06/06/2024 08:09, Namhyung Kim wrote:
> > On Wed, Jun 5, 2024 at 4:02 PM Ian Rogers <[email protected]> wrote:
> >> 2) Ignore failures, possibly hiding user errors.

> >> I would prefer for (2) the errors were pr_err rather than pr_debug,
> >> something the user can clean up by getting rid of warned about PMUs.
> >> This will avoid hiding the error, but then on Neoverse cycles will
> >> warn about the arm_dsu PMU's cycles event for exactly Linus' test
> >> case. My understanding is that this is deemed a regression, hence
> >> Arnaldo proposing pr_debug to hide it.

> > Right, if we use pr_err() then users will complain. If we use
> > pr_debug() then errors will be hidden silently.

> I'm not sure if anyone would really complain about warnings for
> attempting to open but not succeeding, as long as the event that they
> really wanted is there. I'm imagining output like this:

> $ perf record -e cycles -- ls

> Warning: skipped arm_dsu/cycles/ event(s), recording on
> armv8_pmuv3_0/cycles/, armv8_pmuv3_1/cycles/

> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.008 MB perf.data (30 samples) ]

> You only really need to worry when no events can be opened, but
> presumably that was a warning anyway.

Agreed, while we don't find a way, old or new to autoritatively skip the
event, when that pr_warning() gets turned into a pr_debug() so that
people expecting that those skipped events were included get a message
telling why they were not.

> And in stat mode I wouldn't expect any warnings.

Right.

- Arnaldo

2024-06-07 06:10:29