2023-02-23 00:29:36

by Atish Patra

[permalink] [raw]
Subject: Perf event to counter mapping question

Hi All,
We are trying to figure out what is the best approach to define the
perf event to programmable counter mappings in RISC-V.
Until recently, all the programmable counter/event selector registers
were writable in M-mode (highest privilege mode) only. The firmware
residing in M-mode
would discover the mapping from device tree[1] and the perf driver
relies on SBI PMU[2] interface to discover the mapping between event &
counters.

There are new ISA extensions being proposed to make counters /event
selector register in supervisor mode as well. Thus, a pmu driver
can directly program the event selectors without relying on firmware.
However, the kernel needs to be aware of counter mapping to do that.

AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
much easier. It just needs to pick the next available counter.
On the other hand, x86 allows selective counter mapping which is
discovered from the json file and maintained in per event
constraints[4].
There may be some legacy reasons why it was done in x86 this way[5].
Please correct me if I am wrong in my understanding/assumption.

Here are a few approaches that can be used to solve it in RISC-V.

1. Continue to use device tree bindings
Cons: We have to define similar entries for ACPI. It makes
virtualization difficult as the VMM has to discover and update the
device tree/ACPI as well.

2. Mandate all-to-all mapping similar to ARM64.
Note: This is only for programmable counters. If the platform supports
any fixed counters (i.e. can monitor
only a specific event), that needs to be provisioned via some other
method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
AMU not PMU.

3. All platforms need to define which subset of events can be
monitored using a subset of counters. The platform specific perf json
file can specify that.
This approach provides more flexibility but makes the code path a bit
more complex as the counter mask constraint needs to be maintained per
event basis.

4. Any other approach ?

Any thoughts on what would be the best approach for RISC-V. It would
be great to repeat any past mistakes in RISC-V by learning from
experience from the community.

[1] https://lore.kernel.org/lkml/Y6tS959TaY2EBAdn@spud/T/
[2] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/riscv-sbi.adoc#function-find-and-configure-a-matching-counter-fid-2
[3] https://elixir.bootlin.com/linux/v6.2/source/arch/arm64/kernel/perf_event.c#L899
[4] https://elixir.bootlin.com/linux/latest/source/arch/x86/events/core.c#L876
[5] https://www.mail-archive.com/[email protected]/msg1978937.html
--
Regards,
Atish


2023-02-23 02:55:50

by Anup Patel

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

On Thu, Feb 23, 2023 at 5:58 AM Atish Patra <[email protected]> wrote:
>
> Hi All,
> We are trying to figure out what is the best approach to define the
> perf event to programmable counter mappings in RISC-V.
> Until recently, all the programmable counter/event selector registers
> were writable in M-mode (highest privilege mode) only. The firmware
> residing in M-mode
> would discover the mapping from device tree[1] and the perf driver
> relies on SBI PMU[2] interface to discover the mapping between event &
> counters.
>
> There are new ISA extensions being proposed to make counters /event
> selector register in supervisor mode as well. Thus, a pmu driver
> can directly program the event selectors without relying on firmware.
> However, the kernel needs to be aware of counter mapping to do that.
>
> AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
> much easier. It just needs to pick the next available counter.
> On the other hand, x86 allows selective counter mapping which is
> discovered from the json file and maintained in per event
> constraints[4].
> There may be some legacy reasons why it was done in x86 this way[5].
> Please correct me if I am wrong in my understanding/assumption.
>
> Here are a few approaches that can be used to solve it in RISC-V.
>
> 1. Continue to use device tree bindings
> Cons: We have to define similar entries for ACPI. It makes
> virtualization difficult as the VMM has to discover and update the
> device tree/ACPI as well.
>
> 2. Mandate all-to-all mapping similar to ARM64.
> Note: This is only for programmable counters. If the platform supports
> any fixed counters (i.e. can monitor
> only a specific event), that needs to be provisioned via some other
> method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
> AMU not PMU.
>
> 3. All platforms need to define which subset of events can be
> monitored using a subset of counters. The platform specific perf json
> file can specify that.
> This approach provides more flexibility but makes the code path a bit
> more complex as the counter mask constraint needs to be maintained per
> event basis.
>
> 4. Any other approach ?

I suggest a 4th approach where by default the kernel assumes all-to-all
mappings and optionally perf json file can be used to override mappings
for certain counters. This approach is more like a hybrid approach between
approach #2 and #3. It work fine with KVM RISC-V as well because Guest/VM
will assume all-to-all mapping for logical HW counters whereas Host can have
specific counter mappings.

>
> Any thoughts on what would be the best approach for RISC-V. It would
> be great to repeat any past mistakes in RISC-V by learning from
> experience from the community.
>
> [1] https://lore.kernel.org/lkml/Y6tS959TaY2EBAdn@spud/T/
> [2] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/riscv-sbi.adoc#function-find-and-configure-a-matching-counter-fid-2
> [3] https://elixir.bootlin.com/linux/v6.2/source/arch/arm64/kernel/perf_event.c#L899
> [4] https://elixir.bootlin.com/linux/latest/source/arch/x86/events/core.c#L876
> [5] https://www.mail-archive.com/[email protected]/msg1978937.html
> --
> Regards,
> Atish

Regards,
Anup

2023-02-23 08:27:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

On Wed, Feb 22, 2023 at 04:28:36PM -0800, Atish Patra wrote:

> AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
> much easier. It just needs to pick the next available counter.
> On the other hand, x86 allows selective counter mapping which is
> discovered from the json file and maintained in per event
> constraints[4].

All the contraint management is done in kernel, and yes, it's a giant
pain in the rear side.

From what I understand the reason for these contraints is complexity of
implementation, less constraints is more 'wires' in the hardware.

With PMU use being ever more popular, we're seeing the x86 PMU move
towards less constraints -- although I don't think we'll ever get rid of
them :/

> 2. Mandate all-to-all mapping similar to ARM64.

If at all possible, I would strongly recommend taking this route. Yes,
the hardware people will complain, but newer x86 hardware having less,
or simpler, constraints might be sufficient to convince them.

(and if you do have to do contraints, please take a lesson from x86 and
*never* allow overlapping contraints as AMD had, solving those
constraints is not fun)

As you note, this is *much* simpler to program and virtualize.

> Note: This is only for programmable counters. If the platform supports
> any fixed counters (i.e. can monitor
> only a specific event), that needs to be provisioned via some other
> method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
> AMU not PMU.

So free running counters are ideal and fairly simple to multiplex/use.

The moment you start adding overflow interrupts / filters and any other
complexities to fixed function counters it becomes a mess (look at the
x86 PMU again).

2023-02-23 18:47:26

by Beeman Strong

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

Trying again:
Hi Peter, thanks for the feedback. Can you say more about AMD's
overlapping constraints?


On Thu, Feb 23, 2023 at 12:27 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Feb 22, 2023 at 04:28:36PM -0800, Atish Patra wrote:
>
> > AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
> > much easier. It just needs to pick the next available counter.
> > On the other hand, x86 allows selective counter mapping which is
> > discovered from the json file and maintained in per event
> > constraints[4].
>
> All the contraint management is done in kernel, and yes, it's a giant
> pain in the rear side.
>
> From what I understand the reason for these contraints is complexity of
> implementation, less constraints is more 'wires' in the hardware.
>
> With PMU use being ever more popular, we're seeing the x86 PMU move
> towards less constraints -- although I don't think we'll ever get rid of
> them :/
>
> > 2. Mandate all-to-all mapping similar to ARM64.
>
> If at all possible, I would strongly recommend taking this route. Yes,
> the hardware people will complain, but newer x86 hardware having less,
> or simpler, constraints might be sufficient to convince them.
>
> (and if you do have to do contraints, please take a lesson from x86 and
> *never* allow overlapping contraints as AMD had, solving those
> constraints is not fun)
>
> As you note, this is *much* simpler to program and virtualize.
>
> > Note: This is only for programmable counters. If the platform supports
> > any fixed counters (i.e. can monitor
> > only a specific event), that needs to be provisioned via some other
> > method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
> > AMU not PMU.
>
> So free running counters are ideal and fairly simple to multiplex/use.
>
> The moment you start adding overflow interrupts / filters and any other
> complexities to fixed function counters it becomes a mess (look at the
> x86 PMU again).

2023-02-24 02:33:12

by Atish Patra

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

On Thu, Feb 23, 2023 at 12:27 AM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Feb 22, 2023 at 04:28:36PM -0800, Atish Patra wrote:
>
> > AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
> > much easier. It just needs to pick the next available counter.
> > On the other hand, x86 allows selective counter mapping which is
> > discovered from the json file and maintained in per event
> > constraints[4].
>
> All the contraint management is done in kernel, and yes, it's a giant
> pain in the rear side.
>
> From what I understand the reason for these contraints is complexity of
> implementation, less constraints is more 'wires' in the hardware.
>
> With PMU use being ever more popular, we're seeing the x86 PMU move
> towards less constraints -- although I don't think we'll ever get rid of
> them :/
>
> > 2. Mandate all-to-all mapping similar to ARM64.
>
> If at all possible, I would strongly recommend taking this route. Yes,
> the hardware people will complain, but newer x86 hardware having less,
> or simpler, constraints might be sufficient to convince them.
>

Yeah. That's where folks want to go in order to provide flexibility
for future platform vendors by
allowing constraints.

Can you provide some examples or some pointers that describe these
simpler constraints ?

Finding a middle path would certainly keep everyone happy :). Thanks a
lot for your input.

> (and if you do have to do contraints, please take a lesson from x86 and
> *never* allow overlapping contraints as AMD had, solving those
> constraints is not fun)
>
> As you note, this is *much* simpler to program and virtualize.
>
> > Note: This is only for programmable counters. If the platform supports
> > any fixed counters (i.e. can monitor
> > only a specific event), that needs to be provisioned via some other
> > method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
> > AMU not PMU.
>
> So free running counters are ideal and fairly simple to multiplex/use.
>
> The moment you start adding overflow interrupts / filters and any other
> complexities to fixed function counters it becomes a mess (look at the
> x86 PMU again).



--
Regards,
Atish

2023-02-24 02:38:39

by Atish Patra

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

On Wed, Feb 22, 2023 at 6:55 PM Anup Patel <[email protected]> wrote:
>
> On Thu, Feb 23, 2023 at 5:58 AM Atish Patra <[email protected]> wrote:
> >
> > Hi All,
> > We are trying to figure out what is the best approach to define the
> > perf event to programmable counter mappings in RISC-V.
> > Until recently, all the programmable counter/event selector registers
> > were writable in M-mode (highest privilege mode) only. The firmware
> > residing in M-mode
> > would discover the mapping from device tree[1] and the perf driver
> > relies on SBI PMU[2] interface to discover the mapping between event &
> > counters.
> >
> > There are new ISA extensions being proposed to make counters /event
> > selector register in supervisor mode as well. Thus, a pmu driver
> > can directly program the event selectors without relying on firmware.
> > However, the kernel needs to be aware of counter mapping to do that.
> >
> > AFAIK, ARM64 allows all-to-all mapping in pmuv3[1]. That makes life
> > much easier. It just needs to pick the next available counter.
> > On the other hand, x86 allows selective counter mapping which is
> > discovered from the json file and maintained in per event
> > constraints[4].
> > There may be some legacy reasons why it was done in x86 this way[5].
> > Please correct me if I am wrong in my understanding/assumption.
> >
> > Here are a few approaches that can be used to solve it in RISC-V.
> >
> > 1. Continue to use device tree bindings
> > Cons: We have to define similar entries for ACPI. It makes
> > virtualization difficult as the VMM has to discover and update the
> > device tree/ACPI as well.
> >
> > 2. Mandate all-to-all mapping similar to ARM64.
> > Note: This is only for programmable counters. If the platform supports
> > any fixed counters (i.e. can monitor
> > only a specific event), that needs to be provisioned via some other
> > method. IIRC the fixed counters(apart from cycle) in ARM64 are part of
> > AMU not PMU.
> >
> > 3. All platforms need to define which subset of events can be
> > monitored using a subset of counters. The platform specific perf json
> > file can specify that.
> > This approach provides more flexibility but makes the code path a bit
> > more complex as the counter mask constraint needs to be maintained per
> > event basis.
> >
> > 4. Any other approach ?
>
> I suggest a 4th approach where by default the kernel assumes all-to-all
> mappings and optionally perf json file can be used to override mappings
> for certain counters. This approach is more like a hybrid approach between

Do you mean override counter mapping for certain events ? I don't
think we should split counter space as platforms may want
to choose random encounters between hpmcounter3-hpmcounte31.

We can always do this. If the counter mask is not set in the event
attributes, assume that the event
can be monitored by all the counters. x86 does something similar as well.

> approach #2 and #3. It work fine with KVM RISC-V as well because Guest/VM
> will assume all-to-all mapping for logical HW counters whereas Host can have
> specific counter mappings.
>
> >
> > Any thoughts on what would be the best approach for RISC-V. It would
> > be great to repeat any past mistakes in RISC-V by learning from
> > experience from the community.
> >
> > [1] https://lore.kernel.org/lkml/Y6tS959TaY2EBAdn@spud/T/
> > [2] https://github.com/riscv-non-isa/riscv-sbi-doc/blob/master/riscv-sbi.adoc#function-find-and-configure-a-matching-counter-fid-2
> > [3] https://elixir.bootlin.com/linux/v6.2/source/arch/arm64/kernel/perf_event.c#L899
> > [4] https://elixir.bootlin.com/linux/latest/source/arch/x86/events/core.c#L876
> > [5] https://www.mail-archive.com/[email protected]/msg1978937.html
> > --
> > Regards,
> > Atish
>
> Regards,
> Anup



--
Regards,
Atish

2023-02-25 16:43:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Perf event to counter mapping question

On Thu, Feb 23, 2023 at 09:48:35AM -0800, Beeman Strong wrote:
> Hi Peter, thanks for the feedback. Can you say more about AMD's
> overlapping constraints?

bc1738f6ee83 ("perf, x86: Fix event scheduler for constraints with overlapping counters")