LinuxLists.cc - Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

2023-10-02 19:18:13

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On Mon, Sep 11, 2023 at 08:48:17AM -0700, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> Currently, the additional information of a branch entry is stored in a
> u64 space. With more and more information added, the space is running
> out. For example, the information of occurrences of events will be added
> for each branch.
>
> Add a new branch sample type, PERF_SAMPLE_BRANCH_EXTRA, to indicate
> whether to support an extra space.
>
> Two places were suggested to append the extra space.
> https://lore.kernel.org/lkml/[email protected]/
> One place is right after the flags of each branch entry. It changes the
> existing struct perf_branch_entry. In the later Intel-specific
> implementation, two separate spaces have to be created in the
> struct cpu_hw_events to store different branch entry structures. That
> duplicates space.

Well, something like so:

- struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
+
+ union {
+ struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
+ struct perf_branch_entry_ext lbr_entries_ext[MAX_LBR_ENTRIES];
+ };

would just do... you just have to be really careful to consistently pick
the right one.

Something that might help would be to do make perf_branch_stack::entries
a 'void *' and use:

struct perf_branch_entry_ext *
perf_get_branch_entry(struct perf_sample_data *data, int idx)
{
if (data->sample_flags & PERF_SAMPLE_BRANCH_EXTRA)
return (struct perf_branch_entry_ext *)data->br_stack->entries + idx;

return (struct perf_branch_entry *)data->br_stack->entries + idx;
}

> The other place is right after the entire struct perf_branch_stack.
> Only adding the new extra space in the struct cpu_hw_event is necessary.
> The disadvantage is that the pointer of the extra space has to be
> recorded. The common interface perf_sample_save_brstack() has to be
> updated as well.

Right.. probably easier.

> The latter requires less space and is much straight forward. It is
> implemented in the patch.

Same amount of space either way around. 'n*x+n*y == n*(x+y)' and all that.

> Also, add a new branch sample type, PERF_SAMPLE_BRANCH_EVT_CNTRS, to
> indicate whether include occurrences of events in branch info. The
> information will be stored in the extra space.

This... why do we need two flags?

Also, I can't find this in the SDM, how wide are these counter deltas?
ISTR they're saturating, but not how wide they are.

2023-10-02 20:57:49

by Liang, Kan

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On 2023-10-02 11:45 a.m., Peter Zijlstra wrote:
> On Mon, Sep 11, 2023 at 08:48:17AM -0700, [email protected] wrote:
>> From: Kan Liang <[email protected]>
>>
>> Currently, the additional information of a branch entry is stored in a
>> u64 space. With more and more information added, the space is running
>> out. For example, the information of occurrences of events will be added
>> for each branch.
>>
>> Add a new branch sample type, PERF_SAMPLE_BRANCH_EXTRA, to indicate
>> whether to support an extra space.
>>
>> Two places were suggested to append the extra space.
>> https://lore.kernel.org/lkml/[email protected]/
>> One place is right after the flags of each branch entry. It changes the
>> existing struct perf_branch_entry. In the later Intel-specific
>> implementation, two separate spaces have to be created in the
>> struct cpu_hw_events to store different branch entry structures. That
>> duplicates space.
>
> Well, something like so:
>
> - struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
> +
> + union {
> + struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES];
> + struct perf_branch_entry_ext lbr_entries_ext[MAX_LBR_ENTRIES];
> + };
>
> would just do... you just have to be really careful to consistently pick
> the right one.
>
> Something that might help would be to do make perf_branch_stack::entries
> a 'void *' and use:
>
> struct perf_branch_entry_ext *
> perf_get_branch_entry(struct perf_sample_data *data, int idx)
> {
> if (data->sample_flags & PERF_SAMPLE_BRANCH_EXTRA)
> return (struct perf_branch_entry_ext *)data->br_stack->entries + idx;
>
> return (struct perf_branch_entry *)data->br_stack->entries + idx;
> }

I tried to avoid the above extra calculation (although it should be
tiny), since it's in a NMI handler. So I once planned to add an extra
struct perf_branch_entry_ext lbr_entries_ext[MAX_LBR_ENTRIES]; which
doubles the space.
But yes, it should be doable.

>
>> The other place is right after the entire struct perf_branch_stack.
>> Only adding the new extra space in the struct cpu_hw_event is necessary.
>> The disadvantage is that the pointer of the extra space has to be
>> recorded. The common interface perf_sample_save_brstack() has to be
>> updated as well.
>
> Right.. probably easier.

I don't see big drawbacks to it. Easier to understand and implement, so
should be easier to maintain as well.
I guess I will still use the latter, if no objection.

>
>> The latter requires less space and is much straight forward. It is
>> implemented in the patch.
>
> Same amount of space either way around. 'n*x+n*y == n*(x+y)' and all that.
>
>> Also, add a new branch sample type, PERF_SAMPLE_BRANCH_EVT_CNTRS, to
>> indicate whether include occurrences of events in branch info. The
>> information will be stored in the extra space.
>
> This... why do we need two flags?

Users may only collect the occurrences of some events in a group. The
EVT_CNTRS flag is used to indicate those events. E.g.,
perf record -e "{cpu/branch-instructions,branch_type=call/,
cpu/branch-misses,branch_type=event/}"

Only the occurrences of the branch-misses event is collected in LBR and
finally dumped into the extra buffer.

While the first flag, PERF_SAMPLE_BRANCH_EXTRA, only tells that the
extra space is required.

>
> Also, I can't find this in the SDM, how wide are these counter deltas?
> ISTR they're saturating, but not how wide they are.

Now, it's documented in the Intel® Architecture Instruction Set
Extensions and Future Features, Chapter 8, 8.6 LBR ENHANCEMENTS. It
should be moved to SDM later.
https://cdrdv2.intel.com/v1/dl/getContent/671368

Only 2 bits for each counter. Saturating at a value of 3.

Thanks,
Kan

2023-10-02 22:05:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On Mon, Oct 02, 2023 at 03:19:04PM -0400, Liang, Kan wrote:

> >> Also, add a new branch sample type, PERF_SAMPLE_BRANCH_EVT_CNTRS, to
> >> indicate whether include occurrences of events in branch info. The
> >> information will be stored in the extra space.
> >
> > This... why do we need two flags?
>
> Users may only collect the occurrences of some events in a group. The
> EVT_CNTRS flag is used to indicate those events. E.g.,
> perf record -e "{cpu/branch-instructions,branch_type=call/,
> cpu/branch-misses,branch_type=event/}"
>
> Only the occurrences of the branch-misses event is collected in LBR and
> finally dumped into the extra buffer.
>
> While the first flag, PERF_SAMPLE_BRANCH_EXTRA, only tells that the
> extra space is required.

Or have it implicit, I reallt don't see the point of having two bits
here.

> > Also, I can't find this in the SDM, how wide are these counter deltas?
> > ISTR they're saturating, but not how wide they are.
>
> Now, it's documented in the Intel? Architecture Instruction Set
> Extensions and Future Features, Chapter 8, 8.6 LBR ENHANCEMENTS. It
> should be moved to SDM later.
> https://cdrdv2.intel.com/v1/dl/getContent/671368
>
> Only 2 bits for each counter. Saturating at a value of 3.

Urgh, this ISE document is shite, that thing don't say how many
IA32_LBR_INFO.PMCx_CNT fields there are, I think your later patch says
4, right? And is this for arch LBR or the other thing?

(Also, what is IA32_LER_x_INFO ?)

This is then a grant total of 8 bits.

And we still have 31 spare bits in perf_branch_entry.

Why again do we need the extra u64 ?!?

More specifically, this interface is pretty crap -- suppose the next
generation of things feels that 2 bits aint' enough and goes and gives
us 4. Then what do we do?

Did I already say that the ISE document raises more questions than it
provides answers?

2023-10-03 00:58:08

by Liang, Kan

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On 2023-10-02 5:37 p.m., Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 03:19:04PM -0400, Liang, Kan wrote:
>
>>>> Also, add a new branch sample type, PERF_SAMPLE_BRANCH_EVT_CNTRS, to
>>>> indicate whether include occurrences of events in branch info. The
>>>> information will be stored in the extra space.
>>>
>>> This... why do we need two flags?
>>
>> Users may only collect the occurrences of some events in a group. The
>> EVT_CNTRS flag is used to indicate those events. E.g.,
>> perf record -e "{cpu/branch-instructions,branch_type=call/,
>> cpu/branch-misses,branch_type=event/}"
>>
>> Only the occurrences of the branch-misses event is collected in LBR and
>> finally dumped into the extra buffer.
>>
>> While the first flag, PERF_SAMPLE_BRANCH_EXTRA, only tells that the
>> extra space is required.
>
> Or have it implicit, I reallt don't see the point of having two bits
> here.

Perf has to traverse the whole group to decide whether using the extra
space. But It should be possible to use an internal flag to avoid the
traverse every time. Let me have a try.

>
>>> Also, I can't find this in the SDM, how wide are these counter deltas?
>>> ISTR they're saturating, but not how wide they are.
>>
>> Now, it's documented in the Intel® Architecture Instruction Set
>> Extensions and Future Features, Chapter 8, 8.6 LBR ENHANCEMENTS. It
>> should be moved to SDM later.
>> https://cdrdv2.intel.com/v1/dl/getContent/671368
>>
>> Only 2 bits for each counter. Saturating at a value of 3.
>
> Urgh, this ISE document is shite, that thing don't say how many
> IA32_LBR_INFO.PMCx_CNT fields there are, I think your later patch says
> 4, right? And is this for arch LBR or the other thing?
>

It's for Arch LBR. Yes, the current CPUID enumeration implies that only
4 counters.

"Per-counter support for LBR Event Logging is indicated by the “Event
Logging Supported” bitmap in CPUID.(EAX=01CH, ECX=0).ECX[19:16]"

> (Also, what is IA32_LER_x_INFO ?)

Last Event Record (LER). It records the last taken branch preceding the
last exception, hardware interrupt, or software interrupt.
Linux doesn't have it supported.

>
> This is then a grant total of 8 bits.
>
> And we still have 31 spare bits in perf_branch_entry.
>
> Why again do we need the extra u64 ?!?
>

The first version utilizes the spare bits in perf_branch_entry.
https://lore.kernel.org/lkml/[email protected]/

To address the similar concern (what should we do if more counters and a
wider bits are added later), I changed it to the extra space method
since V2.

Another consideration is that the 'events' field in the
perf_branch_entry from V1 is Intel specific. The u64 extra space is more
generic. Other ARCHs can utilize it to store other extra information if
they want.

Please let me know if I'm overthinking. I can switch back to the
'events' field of V1.

> More specifically, this interface is pretty crap -- suppose the next
> generation of things feels that 2 bits aint' enough and goes and gives
> us 4. Then what do we do?
>

The current LBR is an architectural feature. The existed fields of 2
bits 4 counters should not be changed.
But yes, it's possible to add more bits and counters into the reserved
bits. The reserved bits of the IA32_LBR_x_INFO are only 31 now. The u64
extra space should be good enough.
If more information is introduced later (e.g., a brand new
LBR_x_INFO_2), then we can add a extra_2 space.

But I don't see there is a plan to extend the IA32_LBR_x_INFO again in
the near future.

> Did I already say that the ISE document raises more questions than it
> provides answers?

Yes. Would an improved CPUID enumeration can address the questions? For
example, the CPUID enumeration can give the maximum number of counters
and supported width? I think we can discuss it with the architect.

Thanks,
Kan

2023-10-03 10:28:06

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On Mon, Oct 02, 2023 at 08:57:57PM -0400, Liang, Kan wrote:

> > Did I already say that the ISE document raises more questions than it
> > provides answers?
>
> Yes. Would an improved CPUID enumeration can address the questions? For
> example, the CPUID enumeration can give the maximum number of counters
> and supported width? I think we can discuss it with the architect.

So.. no. Suppose another arch goes and does the same, but with a
different number and width of counters. They won't have CPUID.

I'm thinking we should do something like expose branch_counter_nr and
branch_counter_width in the sysfs node, and then rename this extra field
to counters.

Then userspace can do something like:

for (i = 0; i < branch_counter_nr; i++) {
counter[i] = counters & ((1 << branch_counter_width) - 1);
counters >>= branch_counter_width;
}

to extract the actual counter values.

So then we end up with:

* { u64 nr;
* { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
* { u64 from, to, flags } lbr[nr];
+ * { u64 counters; } cntr[nr] && PERF_SAMPLE_BRANCH_COUNTERS
* } && PERF_SAMPLE_BRANCH_STACK

Have it explicitly named counters, have only the one flag and have sysfs
files describe how to decode it.

Then for this Intel thing we have 4 counters of 2 bits, but if someone
else were to do something different (both Power and ARM64 have this
branch stack stuff now) they can describe it.

It is a bit wasteful on bits... but at least its clear I suppose.

2023-10-03 12:57:44

by Liang, Kan

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On 2023-10-03 6:27 a.m., Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 08:57:57PM -0400, Liang, Kan wrote:
>
>>> Did I already say that the ISE document raises more questions than it
>>> provides answers?
>>
>> Yes. Would an improved CPUID enumeration can address the questions? For
>> example, the CPUID enumeration can give the maximum number of counters
>> and supported width? I think we can discuss it with the architect.
>
> So.. no. Suppose another arch goes and does the same, but with a
> different number and width of counters. They won't have CPUID.
>
> I'm thinking we should do something like expose branch_counter_nr and
> branch_counter_width in the sysfs node, and then rename this extra field
> to counters.

Sure. I will expose them under the "caps" folder.

>
> Then userspace can do something like:
>
> for (i = 0; i < branch_counter_nr; i++) {
> counter[i] = counters & ((1 << branch_counter_width) - 1);
> counters >>= branch_counter_width;
> }
>
> to extract the actual counter values.
>
>
> So then we end up with:
>
> * { u64 nr;
> * { u64 hw_idx; } && PERF_SAMPLE_BRANCH_HW_INDEX
> * { u64 from, to, flags } lbr[nr];
> + * { u64 counters; } cntr[nr] && PERF_SAMPLE_BRANCH_COUNTERS
> * } && PERF_SAMPLE_BRANCH_STACK
>
> Have it explicitly named counters, have only the one flag and have sysfs
> files describe how to decode it.
>
> Then for this Intel thing we have 4 counters of 2 bits, but if someone
> else were to do something different (both Power and ARM64 have this
> branch stack stuff now) they can describe it.
>
> It is a bit wasteful on bits... but at least its clear I suppose.

Yes. I will send a V4 with the above suggestions.

Thanks,
Kan

2023-10-03 15:09:30

by Andi Kleen

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

> I'm thinking we should do something like expose branch_counter_nr and
> branch_counter_width in the sysfs node, and then rename this extra field
> to counters.
>
> Then userspace can do something like:
>
> for (i = 0; i < branch_counter_nr; i++) {
> counter[i] = counters & ((1 << branch_counter_width) - 1);
> counters >>= branch_counter_width;
> }
>
> to extract the actual counter values.

perf script/report won't necessarily have access to the sysfs
values if they run on a different system

It would need extra PT style metadata written by perf record to
perf.data and read by the user tools.

Seems complicated. It would be better if it just parsed on its own.

-Andi

2023-10-03 15:58:49

by Liang, Kan

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On 2023-10-03 11:06 a.m., Andi Kleen wrote:
>> I'm thinking we should do something like expose branch_counter_nr and
>> branch_counter_width in the sysfs node, and then rename this extra field
>> to counters.
>>
>> Then userspace can do something like:
>>
>> for (i = 0; i < branch_counter_nr; i++) {
>> counter[i] = counters & ((1 << branch_counter_width) - 1);
>> counters >>= branch_counter_width;
>> }
>>
>> to extract the actual counter values.
>
> perf script/report won't necessarily have access to the sysfs
> values if they run on a different system
>
> It would need extra PT style metadata written by perf record to
> perf.data and read by the user tools.
>
> Seems complicated. It would be better if it just parsed on its own.
>

If so, perf probably have to save the information in the
perf_branch_entry. At least, perf has to be able to support 64 counters
and 64 width. That will use at least 12 bits of the info field of the
perf_branch_entry. (We only have 31 spare bits now. Almost half.) That
seems a waste.

Perf tool already supports "branches" in the caps folder. It has been
saved in the header. We can use a similar way to support
"branch_counter_nr" and "branch_counter_width". It doesn't seem very
complex. I think I will try the sysfs method first.

Thanks,
Kan

2023-10-03 16:33:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On Tue, Oct 03, 2023 at 08:06:59AM -0700, Andi Kleen wrote:
> > I'm thinking we should do something like expose branch_counter_nr and
> > branch_counter_width in the sysfs node, and then rename this extra field
> > to counters.
> >
> > Then userspace can do something like:
> >
> > for (i = 0; i < branch_counter_nr; i++) {
> > counter[i] = counters & ((1 << branch_counter_width) - 1);
> > counters >>= branch_counter_width;
> > }
> >
> > to extract the actual counter values.
>
> perf script/report won't necessarily have access to the sysfs
> values if they run on a different system
>
> It would need extra PT style metadata written by perf record to
> perf.data and read by the user tools.
>
> Seems complicated. It would be better if it just parsed on its own.

Well, you really don't want to repeat the 4,2 thing in every event, that
seems daft (and a waste of space, because how large do we want those
fields to be etc..).

We don't really have a better place for these sorts of things. And yeah,
the PT thing sets precedent here.

2023-10-03 16:57:59

by Andi Kleen

[permalink] [raw]

Subject: Re: [RESEND PATCH V3 1/6] perf: Add branch stack extra

On Tue, Oct 03, 2023 at 06:33:00PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 03, 2023 at 08:06:59AM -0700, Andi Kleen wrote:
> > > I'm thinking we should do something like expose branch_counter_nr and
> > > branch_counter_width in the sysfs node, and then rename this extra field
> > > to counters.
> > >
> > > Then userspace can do something like:
> > >
> > > for (i = 0; i < branch_counter_nr; i++) {
> > > counter[i] = counters & ((1 << branch_counter_width) - 1);
> > > counters >>= branch_counter_width;
> > > }
> > >
> > > to extract the actual counter values.
> >
> > perf script/report won't necessarily have access to the sysfs
> > values if they run on a different system
> >
> > It would need extra PT style metadata written by perf record to
> > perf.data and read by the user tools.
> >
> > Seems complicated. It would be better if it just parsed on its own.
>
> Well, you really don't want to repeat the 4,2 thing in every event, that
> seems daft (and a waste of space, because how large do we want those
> fields to be etc..).

It's just a few bits? It could be an extra 16bit field or so per event.

There are probably other self describing encodings for the numbers
(e.g. some variant of LEB128 on a sub byte level), but that might be more
expensive to store it.

What would worry me is that various users would just hard code and then
fail later. There are lots of non perf tools perf.data parsers around
these days.

-Andi