2022-08-05 10:57:14

by Stephane Eranian

[permalink] [raw]
Subject: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten

Hi,

I was alerted by an internal user that the PEBS TSC-based timestamps
do not appear
correctly in the final perf.data output file from perf record.

After some investigation, I came to the conclusion that indeed the
data->time field setup
by PEBS in the setup_pebs_fixed_sample_data() is later overwritten by
perf_events generic
code in perf_prepare_sample(). There is an ordering problem here.

Looking around we found that this problem had been uncovered back in
May 2020 and a
patch had been posted then:
https://lore.kernel.org/lkml/[email protected]/T/

However this patch was never commented upon or committed.

The problem is still present in the upstream code today.

1. perf_sample_data_init()
2. setup_pebs_fixed_sample_data(): data->time =
native_sched_clock_from_tsc(pebs->tsc);
3. perf_prepare_sample(): data->time = perf_event_clock(event);

The patch from 2020 (Andreas Kogler) fixes the problem by making the
assignment in 3.
conditioned to the value of data->time being 0. Andreas also suggested
an alternative which
would break up the call to perf_event_ouput() like this is done in the
BTS code allowing
the prepare_sample() call to be made before PEBS samples are
extracted. That would
generate some code duplication. Although this approach appears more
robust, the one
issue I see is that prepare_sample may need data that would be filled
by PEBS and
therefore it would need to be called afterwards.

Any better ideas?
Thanks.


2022-08-05 13:46:15

by Liang, Kan

[permalink] [raw]
Subject: Re: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten



On 2022-08-05 6:49 a.m., Stephane Eranian wrote:
> Hi,
>
> I was alerted by an internal user that the PEBS TSC-based timestamps
> do not appear
> correctly in the final perf.data output file from perf record.
>
> After some investigation, I came to the conclusion that indeed the
> data->time field setup
> by PEBS in the setup_pebs_fixed_sample_data() is later overwritten by
> perf_events generic
> code in perf_prepare_sample(). There is an ordering problem here.
>
> Looking around we found that this problem had been uncovered back in
> May 2020 and a
> patch had been posted then:
> https://lore.kernel.org/lkml/[email protected]/T/
>
> However this patch was never commented upon or committed.
>
> The problem is still present in the upstream code today.
>
> 1. perf_sample_data_init()
> 2. setup_pebs_fixed_sample_data(): data->time =
> native_sched_clock_from_tsc(pebs->tsc);
> 3. perf_prepare_sample(): data->time = perf_event_clock(event);
>
> The patch from 2020 (Andreas Kogler) fixes the problem by making the
> assignment in 3.
> conditioned to the value of data->time being 0. Andreas also suggested
> an alternative which
> would break up the call to perf_event_ouput() like this is done in the
> BTS code allowing
> the prepare_sample() call to be made before PEBS samples are
> extracted. That would
> generate some code duplication. Although this approach appears more
> robust, the one
> issue I see is that prepare_sample may need data that would be filled
> by PEBS and
> therefore it would need to be called afterwards.
>
> Any better ideas?

I think Andreas's patch is the most straightforward and simplest patch
to fix the issue. But, if I recall correctly, Peter prefers to minimize
the cachelines touched by the perf_sample_data_init(). So initializing
the data->time in the perf_sample_data_init() seems break the rule.

I think HW will provide more and more such kind of precise information.
Maybe we can use a flag variable to track whether the information is
already provided to avoid the overwritten.

Here are the two patches (Not test yet). The first one is to fix the
timestamp issue. The second one is to use the flag to replace
data->br_stack = NULL; as an examples. I think we can further move other
variables out of the perf_sample_data_init() to minimize the cachelines
touched by the perf_sample_data_init().


From a46c2e17597717090013d390b6d64f51a69b4ea0 Mon Sep 17 00:00:00 2001
From: Kan Liang <[email protected]>
Date: Fri, 5 Aug 2022 06:01:15 -0700
Subject: [PATCH 1/2] perf: Avoid overwriting precise timestamp

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 8 ++++++--
include/linux/perf_event.h | 3 +++
kernel/events/core.c | 2 +-
3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 34be3bc5151a..a2c26eaeb0d9 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1687,8 +1687,10 @@ static void setup_pebs_fixed_sample_data(struct
perf_event *event,
* We can only do this for the default trace clock.
*/
if (x86_pmu.intel_cap.pebs_format >= 3 &&
- event->attr.use_clockid == 0)
+ event->attr.use_clockid == 0) {
data->time = native_sched_clock_from_tsc(pebs->tsc);
+ data->flags |= PERF_SAMPLE_DATA_TIME;
+ }

if (has_branch_stack(event))
data->br_stack = &cpuc->lbr_stack;
@@ -1750,8 +1752,10 @@ static void
setup_pebs_adaptive_sample_data(struct perf_event *event,
perf_sample_data_init(data, 0, event->hw.last_period);
data->period = event->hw.last_period;

- if (event->attr.use_clockid == 0)
+ if (event->attr.use_clockid == 0) {
data->time = native_sched_clock_from_tsc(basic->tsc);
+ data->flags |= PERF_SAMPLE_DATA_TIME;
+ }

/*
* We must however always use iregs for the unwinder to stay sane; the
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index da759560eec5..33054bf31fc1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -999,6 +999,7 @@ int perf_event_read_local(struct perf_event *event,
u64 *value,
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);

+#define PERF_SAMPLE_DATA_TIME 0x1

struct perf_sample_data {
/*
@@ -1012,6 +1013,7 @@ struct perf_sample_data {
union perf_sample_weight weight;
u64 txn;
union perf_mem_data_src data_src;
+ u64 flags;

/*
* The other fields, optionally {set,used} by
@@ -1061,6 +1063,7 @@ static inline void perf_sample_data_init(struct
perf_sample_data *data,
data->weight.full = 0;
data->data_src.val = PERF_MEM_NA;
data->txn = 0;
+ data->flags = 0;
}

/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 65e0bcba2e21..057c197ae106 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6799,7 +6799,7 @@ static void __perf_event_header__init_id(struct
perf_event_header *header,
data->tid_entry.tid = perf_event_tid(event, current);
}

- if (sample_type & PERF_SAMPLE_TIME)
+ if ((sample_type & PERF_SAMPLE_TIME) && !(data->flags &
PERF_SAMPLE_DATA_TIME))
data->time = perf_event_clock(event);

if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER))
--
2.35.1




The patch as below use a new flag to track the availability of the br_stack.



From c384353e17e200781145557cd5395c3f8adde6f9 Mon Sep 17 00:00:00 2001
From: Kan Liang <[email protected]>
Date: Fri, 5 Aug 2022 06:12:41 -0700
Subject: [PATCH 2/2] perf: Add perf sample data flag for br_stack

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 5 ++++-
include/linux/perf_event.h | 5 +++--
kernel/events/core.c | 4 ++--
3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index a2c26eaeb0d9..27c440c13bfd 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1692,8 +1692,10 @@ static void setup_pebs_fixed_sample_data(struct
perf_event *event,
data->flags |= PERF_SAMPLE_DATA_TIME;
}

- if (has_branch_stack(event))
+ if (has_branch_stack(event)) {
data->br_stack = &cpuc->lbr_stack;
+ data->flags |= PERF_SAMPLE_DATA_BR_STACK;
+ }
}

static void adaptive_pebs_save_regs(struct pt_regs *regs,
@@ -1847,6 +1849,7 @@ static void setup_pebs_adaptive_sample_data(struct
perf_event *event,
if (has_branch_stack(event)) {
intel_pmu_store_pebs_lbrs(lbr);
data->br_stack = &cpuc->lbr_stack;
+ data->flags |= PERF_SAMPLE_DATA_BR_STACK;
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 33054bf31fc1..8c4a2913ec58 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1000,6 +1000,7 @@ extern u64 perf_event_read_value(struct perf_event
*event,
u64 *enabled, u64 *running);

#define PERF_SAMPLE_DATA_TIME 0x1
+#define PERF_SAMPLE_DATA_BR_STACK 0x2

struct perf_sample_data {
/*
@@ -1008,7 +1009,6 @@ struct perf_sample_data {
*/
u64 addr;
struct perf_raw_record *raw;
- struct perf_branch_stack *br_stack;
u64 period;
union perf_sample_weight weight;
u64 txn;
@@ -1019,6 +1019,8 @@ struct perf_sample_data {
* The other fields, optionally {set,used} by
* perf_{prepare,output}_sample().
*/
+ struct perf_branch_stack *br_stack;
+
u64 type;
u64 ip;
struct {
@@ -1058,7 +1060,6 @@ static inline void perf_sample_data_init(struct
perf_sample_data *data,
/* remaining struct members initialized in perf_prepare_sample() */
data->addr = addr;
data->raw = NULL;
- data->br_stack = NULL;
data->period = period;
data->weight.full = 0;
data->data_src.val = PERF_MEM_NA;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 057c197ae106..cd19ce85ef59 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7039,7 +7039,7 @@ void perf_output_sample(struct perf_output_handle
*handle,
}

if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
- if (data->br_stack) {
+ if (data->flags & PERF_SAMPLE_DATA_BR_STACK) {
size_t size;

size = data->br_stack->nr
@@ -7339,7 +7339,7 @@ void perf_prepare_sample(struct perf_event_header
*header,

if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
int size = sizeof(u64); /* nr */
- if (data->br_stack) {
+ if (data->flags & PERF_SAMPLE_DATA_BR_STACK) {
if (perf_sample_save_hw_index(event))
size += sizeof(u64);

--
2.35.1


Thanks,
Kan

2022-08-08 09:47:31

by Ravi Bangoria

[permalink] [raw]
Subject: Re: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten

On 05-Aug-22 7:06 PM, Liang, Kan wrote:
>
>
> On 2022-08-05 6:49 a.m., Stephane Eranian wrote:
>> Hi,
>>
>> I was alerted by an internal user that the PEBS TSC-based timestamps
>> do not appear
>> correctly in the final perf.data output file from perf record.
>>
>> After some investigation, I came to the conclusion that indeed the
>> data->time field setup
>> by PEBS in the setup_pebs_fixed_sample_data() is later overwritten by
>> perf_events generic
>> code in perf_prepare_sample(). There is an ordering problem here.
>>
>> Looking around we found that this problem had been uncovered back in
>> May 2020 and a
>> patch had been posted then:
>> https://lore.kernel.org/lkml/[email protected]/T/
>>
>> However this patch was never commented upon or committed.
>>
>> The problem is still present in the upstream code today.
>>
>> 1. perf_sample_data_init()
>> 2. setup_pebs_fixed_sample_data(): data->time =
>> native_sched_clock_from_tsc(pebs->tsc);
>> 3. perf_prepare_sample(): data->time = perf_event_clock(event);
>>
>> The patch from 2020 (Andreas Kogler) fixes the problem by making the
>> assignment in 3.
>> conditioned to the value of data->time being 0. Andreas also suggested
>> an alternative which
>> would break up the call to perf_event_ouput() like this is done in the
>> BTS code allowing
>> the prepare_sample() call to be made before PEBS samples are
>> extracted. That would
>> generate some code duplication. Although this approach appears more
>> robust, the one
>> issue I see is that prepare_sample may need data that would be filled
>> by PEBS and
>> therefore it would need to be called afterwards.
>>
>> Any better ideas?
>
> I think Andreas's patch is the most straightforward and simplest patch
> to fix the issue. But, if I recall correctly, Peter prefers to minimize
> the cachelines touched by the perf_sample_data_init(). So initializing
> the data->time in the perf_sample_data_init() seems break the rule.
>
> I think HW will provide more and more such kind of precise information.
> Maybe we can use a flag variable to track whether the information is
> already provided to avoid the overwritten.

fwiw, we had similar quirks in the past. For ex: __PERF_SAMPLE_CALLCHAIN_EARLY

Thanks,
Ravi

2022-08-24 10:08:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten


Should be 3 patches at the very least I think, the first one introducing
the new field and then follow up patches making use of it.

And yes as Ravi mentions there's the CALLCHAIN_EARLY hack that could be
cleaned up as well, making it 4 or something.

On Fri, Aug 05, 2022 at 09:36:37AM -0400, Liang, Kan wrote:

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 34be3bc5151a..a2c26eaeb0d9 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1687,8 +1687,10 @@ static void setup_pebs_fixed_sample_data(struct
> perf_event *event,
> * We can only do this for the default trace clock.
> */
> if (x86_pmu.intel_cap.pebs_format >= 3 &&
> - event->attr.use_clockid == 0)
> + event->attr.use_clockid == 0) {

Indent fail; please add: 'set cino=(0:0' to your .vimrc or figure out
the equivalent for your editor of choice.

> data->time = native_sched_clock_from_tsc(pebs->tsc);
> + data->flags |= PERF_SAMPLE_DATA_TIME;
> + }
>
> if (has_branch_stack(event))
> data->br_stack = &cpuc->lbr_stack;

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index da759560eec5..33054bf31fc1 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -999,6 +999,7 @@ int perf_event_read_local(struct perf_event *event,
> u64 *value,
> extern u64 perf_event_read_value(struct perf_event *event,
> u64 *enabled, u64 *running);
>
> +#define PERF_SAMPLE_DATA_TIME 0x1
>
> struct perf_sample_data {
> /*
> @@ -1012,6 +1013,7 @@ struct perf_sample_data {
> union perf_sample_weight weight;
> u64 txn;
> union perf_mem_data_src data_src;
> + u64 flags;
>
> /*
> * The other fields, optionally {set,used} by

How about we call that 'sample_flags' instead and use PERF_SAMPLE_* as
we already have, something like so:


diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ee8b9ecdc03b..b0ebbb1377b9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1007,6 +1007,7 @@ struct perf_sample_data {
* Fields set by perf_sample_data_init(), group so as to
* minimize the cachelines touched.
*/
+ u64 sample_flags;
u64 addr;
struct perf_raw_record *raw;
struct perf_branch_stack *br_stack;
@@ -1056,6 +1057,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
u64 addr, u64 period)
{
/* remaining struct members initialized in perf_prepare_sample() */
+ data->sample_flags = 0;
data->addr = addr;
data->raw = NULL;
data->br_stack = NULL;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2621fd24ad26..fed447f59024 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6792,15 +6792,21 @@ static void perf_aux_sample_output(struct perf_event *event,
ring_buffer_put(rb);
}

-static void __perf_event_header__init_id(struct perf_event_header *header,
- struct perf_sample_data *data,
- struct perf_event *event)
+static u64 __perf_event_header__init_id(struct perf_event_header *header,
+ struct perf_sample_data *data,
+ struct perf_event *event)
{
u64 sample_type = event->attr.sample_type;

data->type = sample_type;
header->size += event->id_header_size;

+ /*
+ * Clear the sample flags that have already been done by the
+ * PMU driver.
+ */
+ sample_type &= ~data->sample_flags;
+
if (sample_type & PERF_SAMPLE_TID) {
/* namespace issues */
data->tid_entry.pid = perf_event_pid(event, current);
@@ -6820,6 +6826,8 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
data->cpu_entry.cpu = raw_smp_processor_id();
data->cpu_entry.reserved = 0;
}
+
+ return sample_type;
}

void perf_event_header__init_id(struct perf_event_header *header,
@@ -7302,7 +7310,7 @@ void perf_prepare_sample(struct perf_event_header *header,
struct perf_event *event,
struct pt_regs *regs)
{
- u64 sample_type = event->attr.sample_type;
+ u64 sample_type;

header->type = PERF_RECORD_SAMPLE;
header->size = sizeof(*header) + event->header_size;
@@ -7310,7 +7318,7 @@ void perf_prepare_sample(struct perf_event_header *header,
header->misc = 0;
header->misc |= perf_misc_flags(regs);

- __perf_event_header__init_id(header, data, event);
+ sample_type = __perf_event_header__init_id(header, data, event);

if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
data->ip = perf_instruction_pointer(regs);

2022-08-24 19:46:51

by Liang, Kan

[permalink] [raw]
Subject: Re: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten



On 2022-08-24 5:27 a.m., Peter Zijlstra wrote:
>
> Should be 3 patches at the very least I think, the first one introducing
> the new field and then follow up patches making use of it.
>
> And yes as Ravi mentions there's the CALLCHAIN_EARLY hack that could be
> cleaned up as well, making it 4 or something.
>
> On Fri, Aug 05, 2022 at 09:36:37AM -0400, Liang, Kan wrote:
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 34be3bc5151a..a2c26eaeb0d9 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1687,8 +1687,10 @@ static void setup_pebs_fixed_sample_data(struct
>> perf_event *event,
>> * We can only do this for the default trace clock.
>> */
>> if (x86_pmu.intel_cap.pebs_format >= 3 &&
>> - event->attr.use_clockid == 0)
>> + event->attr.use_clockid == 0) {
>
> Indent fail; please add: 'set cino=(0:0' to your .vimrc or figure out
> the equivalent for your editor of choice.
>
>> data->time = native_sched_clock_from_tsc(pebs->tsc);
>> + data->flags |= PERF_SAMPLE_DATA_TIME;
>> + }
>>
>> if (has_branch_stack(event))
>> data->br_stack = &cpuc->lbr_stack;
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index da759560eec5..33054bf31fc1 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -999,6 +999,7 @@ int perf_event_read_local(struct perf_event *event,
>> u64 *value,
>> extern u64 perf_event_read_value(struct perf_event *event,
>> u64 *enabled, u64 *running);
>>
>> +#define PERF_SAMPLE_DATA_TIME 0x1
>>
>> struct perf_sample_data {
>> /*
>> @@ -1012,6 +1013,7 @@ struct perf_sample_data {
>> union perf_sample_weight weight;
>> u64 txn;
>> union perf_mem_data_src data_src;
>> + u64 flags;
>>
>> /*
>> * The other fields, optionally {set,used} by
>
> How about we call that 'sample_flags' instead and use PERF_SAMPLE_* as
> we already have, something like so:

True, I think we can use PERF_SAMPLE_* and avoid adding more flags.

I will implement some patches based on the suggestion.

Thanks,
Kan

>
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index ee8b9ecdc03b..b0ebbb1377b9 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1007,6 +1007,7 @@ struct perf_sample_data {
> * Fields set by perf_sample_data_init(), group so as to
> * minimize the cachelines touched.
> */
> + u64 sample_flags;
> u64 addr;
> struct perf_raw_record *raw;
> struct perf_branch_stack *br_stack;
> @@ -1056,6 +1057,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> u64 addr, u64 period)
> {
> /* remaining struct members initialized in perf_prepare_sample() */
> + data->sample_flags = 0;
> data->addr = addr;
> data->raw = NULL;
> data->br_stack = NULL;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 2621fd24ad26..fed447f59024 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6792,15 +6792,21 @@ static void perf_aux_sample_output(struct perf_event *event,
> ring_buffer_put(rb);
> }
>
> -static void __perf_event_header__init_id(struct perf_event_header *header,
> - struct perf_sample_data *data,
> - struct perf_event *event)
> +static u64 __perf_event_header__init_id(struct perf_event_header *header,
> + struct perf_sample_data *data,
> + struct perf_event *event)
> {
> u64 sample_type = event->attr.sample_type;
>
> data->type = sample_type;
> header->size += event->id_header_size;
>
> + /*
> + * Clear the sample flags that have already been done by the
> + * PMU driver.
> + */
> + sample_type &= ~data->sample_flags;
> +
> if (sample_type & PERF_SAMPLE_TID) {
> /* namespace issues */
> data->tid_entry.pid = perf_event_pid(event, current);
> @@ -6820,6 +6826,8 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> data->cpu_entry.cpu = raw_smp_processor_id();
> data->cpu_entry.reserved = 0;
> }
> +
> + return sample_type;
> }
>
> void perf_event_header__init_id(struct perf_event_header *header,
> @@ -7302,7 +7310,7 @@ void perf_prepare_sample(struct perf_event_header *header,
> struct perf_event *event,
> struct pt_regs *regs)
> {
> - u64 sample_type = event->attr.sample_type;
> + u64 sample_type;
>
> header->type = PERF_RECORD_SAMPLE;
> header->size = sizeof(*header) + event->header_size;
> @@ -7310,7 +7318,7 @@ void perf_prepare_sample(struct perf_event_header *header,
> header->misc = 0;
> header->misc |= perf_misc_flags(regs);
>
> - __perf_event_header__init_id(header, data, event);
> + sample_type = __perf_event_header__init_id(header, data, event);
>
> if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> data->ip = perf_instruction_pointer(regs);

2022-08-24 21:07:23

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf/x86/intel/pebs: PEBS timestamps overwritten

On Wed, Aug 24, 2022 at 12:14 PM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2022-08-24 5:27 a.m., Peter Zijlstra wrote:
> >
> > Should be 3 patches at the very least I think, the first one introducing
> > the new field and then follow up patches making use of it.
> >
> > And yes as Ravi mentions there's the CALLCHAIN_EARLY hack that could be
> > cleaned up as well, making it 4 or something.
> >
> > On Fri, Aug 05, 2022 at 09:36:37AM -0400, Liang, Kan wrote:
> >
> >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> >> index 34be3bc5151a..a2c26eaeb0d9 100644
> >> --- a/arch/x86/events/intel/ds.c
> >> +++ b/arch/x86/events/intel/ds.c
> >> @@ -1687,8 +1687,10 @@ static void setup_pebs_fixed_sample_data(struct
> >> perf_event *event,
> >> * We can only do this for the default trace clock.
> >> */
> >> if (x86_pmu.intel_cap.pebs_format >= 3 &&
> >> - event->attr.use_clockid == 0)
> >> + event->attr.use_clockid == 0) {
> >
> > Indent fail; please add: 'set cino=(0:0' to your .vimrc or figure out
> > the equivalent for your editor of choice.
> >
> >> data->time = native_sched_clock_from_tsc(pebs->tsc);
> >> + data->flags |= PERF_SAMPLE_DATA_TIME;
> >> + }
> >>
> >> if (has_branch_stack(event))
> >> data->br_stack = &cpuc->lbr_stack;
> >
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index da759560eec5..33054bf31fc1 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -999,6 +999,7 @@ int perf_event_read_local(struct perf_event *event,
> >> u64 *value,
> >> extern u64 perf_event_read_value(struct perf_event *event,
> >> u64 *enabled, u64 *running);
> >>
> >> +#define PERF_SAMPLE_DATA_TIME 0x1
> >>
> >> struct perf_sample_data {
> >> /*
> >> @@ -1012,6 +1013,7 @@ struct perf_sample_data {
> >> union perf_sample_weight weight;
> >> u64 txn;
> >> union perf_mem_data_src data_src;
> >> + u64 flags;
> >>
> >> /*
> >> * The other fields, optionally {set,used} by
> >
> > How about we call that 'sample_flags' instead and use PERF_SAMPLE_* as
> > we already have, something like so:
>
> True, I think we can use PERF_SAMPLE_* and avoid adding more flags.
>
> I will implement some patches based on the suggestion.
>
I agree with the approach as well. We reuse the PERF_SAMPLE. That
means it automatically
adjusts as we add more PERF_SAMPLE*.

> Thanks,
> Kan
>
> >
> >
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index ee8b9ecdc03b..b0ebbb1377b9 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -1007,6 +1007,7 @@ struct perf_sample_data {
> > * Fields set by perf_sample_data_init(), group so as to
> > * minimize the cachelines touched.
> > */
> > + u64 sample_flags;
> > u64 addr;
> > struct perf_raw_record *raw;
> > struct perf_branch_stack *br_stack;
> > @@ -1056,6 +1057,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
> > u64 addr, u64 period)
> > {
> > /* remaining struct members initialized in perf_prepare_sample() */
> > + data->sample_flags = 0;
> > data->addr = addr;
> > data->raw = NULL;
> > data->br_stack = NULL;
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 2621fd24ad26..fed447f59024 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6792,15 +6792,21 @@ static void perf_aux_sample_output(struct perf_event *event,
> > ring_buffer_put(rb);
> > }
> >
> > -static void __perf_event_header__init_id(struct perf_event_header *header,
> > - struct perf_sample_data *data,
> > - struct perf_event *event)
> > +static u64 __perf_event_header__init_id(struct perf_event_header *header,
> > + struct perf_sample_data *data,
> > + struct perf_event *event)
> > {
> > u64 sample_type = event->attr.sample_type;
> >
> > data->type = sample_type;
> > header->size += event->id_header_size;
> >
> > + /*
> > + * Clear the sample flags that have already been done by the
> > + * PMU driver.
> > + */
> > + sample_type &= ~data->sample_flags;
> > +
> > if (sample_type & PERF_SAMPLE_TID) {
> > /* namespace issues */
> > data->tid_entry.pid = perf_event_pid(event, current);
> > @@ -6820,6 +6826,8 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> > data->cpu_entry.cpu = raw_smp_processor_id();
> > data->cpu_entry.reserved = 0;
> > }
> > +
> > + return sample_type;
> > }
> >
> > void perf_event_header__init_id(struct perf_event_header *header,
> > @@ -7302,7 +7310,7 @@ void perf_prepare_sample(struct perf_event_header *header,
> > struct perf_event *event,
> > struct pt_regs *regs)
> > {
> > - u64 sample_type = event->attr.sample_type;
> > + u64 sample_type;
> >
> > header->type = PERF_RECORD_SAMPLE;
> > header->size = sizeof(*header) + event->header_size;
> > @@ -7310,7 +7318,7 @@ void perf_prepare_sample(struct perf_event_header *header,
> > header->misc = 0;
> > header->misc |= perf_misc_flags(regs);
> >
> > - __perf_event_header__init_id(header, data, event);
> > + sample_type = __perf_event_header__init_id(header, data, event);
> >
> > if (sample_type & (PERF_SAMPLE_IP | PERF_SAMPLE_CODE_PAGE_SIZE))
> > data->ip = perf_instruction_pointer(regs);