by Liang, Kan

[permalink] [raw]

Subject: Re: [PATCH V2 0/5] Clean up perf mem

On 2023-12-13 4:51 a.m., Athira Rajeev wrote:
>
>
>> On 08-Dec-2023, at 2:01 AM, Arnaldo Carvalho de Melo <[email protected]> wrote:
>>
>> Em Thu, Dec 07, 2023 at 11:23:33AM -0800, [email protected] escreveu:
>>> From: Kan Liang <[email protected]>
>>>
>>> Changes since V1:
>>> - Fix strcmp of PMU name checking (Ravi)
>>> - Fix "/," typo (Ian)
>>> - Rename several functions with perf_pmu__mem_events prefix. (Ian)
>>> - Fold the header removal patch into the patch where the cleanups made.
>>> (Arnaldo)
>>> - Add reviewed-by and tested-by from Ian and Ravi
>>
>> It would be good to have a Tested-by from people working in all the
>> architectures affectes, like we got from Ravi for AMD, can we get those?
>>
>> I'm applying it locally for test building, will push to
>> perf-tools-next/tmp.perf-tools-next for a while, so there is some time
>> to test.
>>
>> ARM64 (Leo?) and ppc, for PPC... humm Ravi did it, who could test it now?
> Hi Arnaldo, Ravi
>
> Looking into this for testing on powerpc. Will update back.
>

Thanks Athira. I've sent out the latest V3. Please give it a try.

https://lore.kernel.org/lkml/[email protected]/

Thanks,
Kan

> Thanks
> Athira
>>
>> - Arnaldo
>>
>>> As discussed in the below thread, the patch set is to clean up perf mem.
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Introduce generic functions perf_mem_events__ptr(),
>>> perf_mem_events__name() ,and is_mem_loads_aux_event() to replace the
>>> ARCH specific ones.
>>> Simplify the perf_mem_event__supported().
>>>
>>> Only keeps the ARCH-specific perf_mem_events array in the corresponding
>>> mem-events.c for each ARCH.
>>>
>>> There is no functional change.
>>>
>>> The patch set touches almost all the ARCHs, Intel, AMD, ARM, Power and
>>> etc. But I can only test it on two Intel platforms.
>>> Please give it try, if you have machines with other ARCHs.
>>>
>>> Here are the test results:
>>> Intel hybrid machine:
>>>
>>> $perf mem record -e list
>>> ldlat-loads : available
>>> ldlat-stores : available
>>>
>>> $perf mem record -e ldlat-loads -v --ldlat 50
>>> calling: record -e cpu_atom/mem-loads,ldlat=50/P -e cpu_core/mem-loads,ldlat=50/P
>>>
>>> $perf mem record -v
>>> calling: record -e cpu_atom/mem-loads,ldlat=30/P -e cpu_atom/mem-stores/P -e cpu_core/mem-loads,ldlat=30/P -e cpu_core/mem-stores/P
>>>
>>> $perf mem record -t store -v
>>> calling: record -e cpu_atom/mem-stores/P -e cpu_core/mem-stores/P
>>>
>>>
>>> Intel SPR:
>>> $perf mem record -e list
>>> ldlat-loads : available
>>> ldlat-stores : available
>>>
>>> $perf mem record -e ldlat-loads -v --ldlat 50
>>> calling: record -e {cpu/mem-loads-aux/,cpu/mem-loads,ldlat=50/}:P
>>>
>>> $perf mem record -v
>>> calling: record -e {cpu/mem-loads-aux/,cpu/mem-loads,ldlat=30/}:P -e cpu/mem-stores/P
>>>
>>> $perf mem record -t store -v
>>> calling: record -e cpu/mem-stores/P
>>>
>>> Kan Liang (5):
>>> perf mem: Add mem_events into the supported perf_pmu
>>> perf mem: Clean up perf_mem_events__ptr()
>>> perf mem: Clean up perf_mem_events__name()
>>> perf mem: Clean up perf_mem_event__supported()
>>> perf mem: Clean up is_mem_loads_aux_event()
>>>
>>> tools/perf/arch/arm64/util/mem-events.c | 36 +----
>>> tools/perf/arch/arm64/util/pmu.c | 6 +
>>> tools/perf/arch/powerpc/util/mem-events.c | 13 +-
>>> tools/perf/arch/powerpc/util/mem-events.h | 7 +
>>> tools/perf/arch/powerpc/util/pmu.c | 11 ++
>>> tools/perf/arch/s390/util/pmu.c | 3 +
>>> tools/perf/arch/x86/util/mem-events.c | 99 ++----------
>>> tools/perf/arch/x86/util/pmu.c | 11 ++
>>> tools/perf/builtin-c2c.c | 28 +++-
>>> tools/perf/builtin-mem.c | 28 +++-
>>> tools/perf/util/mem-events.c | 181 +++++++++++++---------
>>> tools/perf/util/mem-events.h | 15 +-
>>> tools/perf/util/pmu.c | 4 +-
>>> tools/perf/util/pmu.h | 7 +
>>> 14 files changed, 233 insertions(+), 216 deletions(-)
>>> create mode 100644 tools/perf/arch/powerpc/util/mem-events.h
>>> create mode 100644 tools/perf/arch/powerpc/util/pmu.c
>>>
>>> --
>>> 2.35.1
>>>
>>
>> --
>>
>> - Arnaldo
>
>
>

2023-12-18 03:21:22

by Leo Yan

[permalink] [raw]

Subject: Re: [PATCH V2 3/5] perf mem: Clean up perf_mem_events__name()

Hi Ian,

On Wed, Dec 13, 2023 at 09:33:24AM -0800, Ian Rogers wrote:

Sorry for late response due to I took a leave at end of last week.

[...]

> > > The purpose of this cleanup patch set is to remove the unnecessary
> > > __weak functions, and try to make the code more generic.
> >
> > I understand weak functions are not very good practice. The point is
> > weak functions are used for some archs have implemented a feature but
> > other archs have not.
> >
> > I can think a potential case to use a central place to maintain the
> > code for all archs - when we want to support cross analysis. But this
> > patch series is for supporting cross analysis, to be honest, I still
> > don't see benefit for this series, though I know you might try to
> > support hybrid modes.
>
> So thanks to Kan for doing this series and trying to clean the code
> up. My complaint about the code is that it was overly hard wired.
> Heck, we have event strings to parse that hard code PMU and event
> names. In fixing hybrid my complaint was that we were adding yet more
> complexity and as a lot of this was resting on printf format strings
> it was hard to check that these were being used correctly. The
> direction of this patch series I agree with.

I agreed as well ;)

> Imo we should avoid weak definitions. Weak definitions are outside of
> the C specification and have introduced bugs into the code base -
> specifically a weak const array was having its size propagated into
> code but then the linker later replaced that weak array. An
> architecture #ifdef is better than a weak definition as the behavior
> is defined and things blow up early rather than suffering from subtle
> corruptions.

Thanks a lot for sharing.

> The Linux kernel device driver is abstracting what the hardware is
> doing and since the more recent changes the PMU abstraction in the
> perf tool is trying to match that. If we need to interface with PMUs
> differently on each architecture then something is wrong. It happens
> that things are wrong and we need to work with that. For example, on
> intel there are topdown events that introduce ordering issues. We have
> default initialization functions for different PMUs. The perf_pmu
> struct is created in perf_pmu__lookup and that always calls an
> perf_pmu__arch_init where the name of the PMU is already set. In the
> weak perf_pmu__arch_init we tweak the perf_pmu struct so that it will
> behave correctly elsewhere in the code. These changes are paralleling
> that. That said, the Intel perf_pmu__arch_init does strcmps against
> "intel_pt" and "intel_bts", does it really need to be arch specific
> given it is already PMU specific?

To be clear, I don't object refactoring, I am just wandering if have
better approach.

Your above question is a good point. I admit the implementation in
arch's perf_pmu__arch_init() is not a good practice, IIUC, you are
seeking an general approach for dynamically probing PMU (and the
associated events).

What I can think about is using the static linked PMU descriptor +
init callback function, which is widely used in Linux kernel for
machine's initialization (refer to [1]).

Likewise, we can define a descriptor for every PMU and link the
descriptor into a data section, e.g.:

static const struct perf_pmu __intel_pt_pmu
__section(".pmu.info.init") = {
.name = "intel_pt",
.auxtrace = true,
.selectable = true,
.perf_event_attr_init_default = intel_pt_pmu_default_config,
.mem_events = NULL,
...
}

As a result, perf_pmu__lookup() just iterates the descriptor array
stored in the data section ".pmu.info.init". To support more complex
case, we even can add a callback pointer in the structure perf_pmu to
invoke PMU specific initialization.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/include/asm/mach/arch.h#n78

> A situation we see in testing is
> people trying to run user space ISA emulators like qemu, say ARM on
> x86, should the PMU set up for intel_pt and intel_bts be broken in
> this environment?

This is a good case, also is a complex case.

> I think that as the names are very specific I'd
> prefer if the code were outside of the tools/perf/arch directory.

I am not sure if understand your meaning.

Anyway, let me extend a bit with this patch series. For instance,
perf_pmu__mem_events_name() function as a central place generates memory
event naming for different archs (and even with different attributes).
This leads to architecture specific things are maintained in perf core
layer.

Rather than adding a data pointer '.mem_events' in to struct perf_pmu,
I'd like to add a new callback (say .mem_event_init()) into struct
perf_pmu, this function can return back the memory event string.

In the end, we can de-couple the perf core layer with arch specific
codes - and if we really want to support cross runtime case (e.g. Arm
binary on x86 machine), we can connect with linked descriptor as
mentioned above.

> There are cases with PMU names like "cpu" where we're probably going
> to need ifdefs or runtime is_intel() calls.
>
> Anyway, my point is that I think we should be moving away from arch
> specific stuff, as this patch series tries, and that way we have the
> best chance of changes benefitting users regardless of hardware.

To be clear, I agree it's great if we can build in all archs into single
perf binary for support cross runtime.

On the other hand, I don't think it's a good idea to totally remove arch
folders.

> It may be that to make all of this work properly we need to modify PMUs,
> but that all seems good, such as adding the extended type support for
> legacy events on ARM PMUs so that legacy events can work without a
> fixed CPU. We haven't got core PMUs working properly, see the recent
> perf top hybrid problem. There are obviously issues with uncore as,
> for example, most memory controllers are replicated PMUs but some
> kernel perf drivers select a memory controller via a config value. We
> either need to change the drivers for some kind of uniformity or do
> some kind of abstracting in the perf tool. I think we'll probably need
> to do it in the tool and when doing that we really shouldn't be doing
> it in an arch specific or weak function way.

Thanks for bringing up. Now I understand a bit your concerns.

Leo