From: Kan Liang <[email protected]>
------
Just want to ping and check the status of the patch series.
But there was something wrong with my mail box. All the emails for V2 patches
were lost. I have to resend the V2 series.
Sorry for the noise.
Changes since V1:
- Check PERF_X86_EVENT_AUTO_RELOAD before call
intel_pmu_save_and_restore()
- Introduce a special purpose intel_pmu_save_and_restart()
just for AUTO_RELOAD.
- New patch to disable userspace RDPMC usage if large PEBS is enabled.
------
There is bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5
The bug is caused by two issues.
- In x86_perf_event_update, the calculation of event->count does not
take the auto-reload values into account.
- In x86_pmu_read, it doesn't count the undrained values in large PEBS
buffers.
The issue is introduced with the auto-reload mechanism enabled by
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
mechanism when possible")
Also, the userspace RDPMC usage is broken for large PEBS.
The issue is introduced with the large PEBS enabled by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
The source code of read_count is as below.
struct cpu {
int fd;
struct perf_event_mmap_page *buf;
};
int perf_open(struct cpu *ctx, int cpu)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.sample_period = 100000,
.config = 0,
.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
.precise_ip = 3,
.mmap = 1,
.comm = 1,
.task = 1,
.mmap2 = 1,
.sample_id_all = 1,
.comm_exec = 1,
};
ctx->buf = NULL;
ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
if (ctx->fd < 0) {
perror("perf_event_open");
return -1;
}
return 0;
}
void perf_close(struct cpu *ctx)
{
close(ctx->fd);
if (ctx->buf)
munmap(ctx->buf, pagesize);
}
int main(int ac, char **av)
{
struct cpu ctx;
u64 count;
perf_open(&ctx, 0);
while (1) {
sleep(5);
if (read(ctx.fd, &count, 8) != 8) {
perror("counter read");
break;
}
printf("0x%llx\n", count);
}
perf_close(&ctx);
}
Kan Liang (4):
perf/x86/intel: fix event update for auto-reload
perf/x86: introduce read function for x86_pmu
perf/x86/intel: drain PEBS buffer in event read
perf/x86: fix: disable userspace RDPMC usage for large PEBS
arch/x86/events/core.c | 5 ++-
arch/x86/events/intel/core.c | 9 +++++
arch/x86/events/intel/ds.c | 79 ++++++++++++++++++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 3 ++
4 files changed, 93 insertions(+), 3 deletions(-)
--
2.7.4
From: Kan Liang <[email protected]>
Large PEBS need to be specially handled in event count read.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 2 ++
arch/x86/events/perf_event.h | 1 +
2 files changed, 3 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 140d332..acd7ffc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1884,6 +1884,8 @@ early_initcall(init_hw_perf_events);
static inline void x86_pmu_read(struct perf_event *event)
{
+ if (x86_pmu.read)
+ return x86_pmu.read(event);
x86_perf_event_update(event);
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f7aaadf..67426e51 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -534,6 +534,7 @@ struct x86_pmu {
void (*disable)(struct perf_event *);
void (*add)(struct perf_event *);
void (*del)(struct perf_event *);
+ void (*read)(struct perf_event *event);
int (*hw_config)(struct perf_event *event);
int (*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
unsigned eventsel;
--
2.7.4
From: Kan Liang <[email protected]>
There is bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5
There is auto-reload mechanism enabled for PEBS events in fixed period
mode. But the calculation of event->count does not take the auto-reload
values into account. Anyone who read the event->count will get wrong
result, e.g x86_pmu_read. Also, the calculation of hwc->period_left is
wrong either. It impacts the accuracy of period for the first record in
PEBS multiple records.
The issue is introduced with the auto-reload mechanism enabled by
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")
Introduce intel_pmu_save_and_restart_reload only for auto-reload.
The formula to calculate the event->count is as below.
event->count = period left from last time +
(reload_times - 1) * reload_val +
latency of PMI handler
prev_count is the last observed hardware counter value. Just the same as
non-auto-reload, its absolute value is the period of the first record.
It should not update with each reload. Because it doesn't 'observe' the
hardware counter for each auto-reload.
For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val to
event->count.
The calculation of the latency of PMI handler is a little bit different
as non-auto-reload. Because the start point is -reload_value. It needs
to be adjusted by adding reload_value.
The period_left needs to do the same adjustment.
There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b..cc1f373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
return NULL;
}
+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+ u64 reload_val,
+ int reload_times)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 prev_raw_count, new_raw_count;
+ u64 delta;
+
+ if ((reload_times == 0) || (reload_val == 0))
+ return intel_pmu_save_and_restart(event);
+
+ /*
+ * Careful: an NMI might modify the previous event value.
+ *
+ * Our tactic to handle this is to first atomically read and
+ * exchange a new raw count - then add that new-prev delta
+ * count to the generic event atomically:
+ */
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count)
+ goto again;
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ *
+ * event->count = period left from last time +
+ * (reload_times - 1) * reload_val +
+ * latency of PMI handler
+ * The period left from last time can be got from -prev_count.
+ * The start points of counting is always -reload_val.
+ * So the real latency of PMI handler is reload_val + new_raw_count.
+ */
+ delta = (reload_val << shift) + (new_raw_count << shift) -
+ (prev_raw_count << shift);
+ delta >>= shift;
+
+ local64_add(reload_val * (reload_times - 1), &event->count);
+ local64_add(delta, &event->count);
+ local64_sub(delta, &hwc->period_left);
+
+ return x86_perf_event_set_period(event);
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count)
{
+ struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct pt_regs regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);
- if (!intel_pmu_save_and_restart(event) &&
- !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, hwc->sample_period,
+ count);
+ } else if (!intel_pmu_save_and_restart(event))
return;
while (count > 1) {
--
2.7.4
From: Kan Liang <[email protected]>
When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value needed for event update unless
flush the PEBS buffer.
Drain the PEBS buffer in event read when large PEBS is enabled.
For the threshold is one, even auto-reload is enabled, it doesn't need
to be specially handled. Because auto-reload is only effect when event
overflow. There is no overflow in event read.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 9 +++++++++
arch/x86/events/intel/ds.c | 10 ++++++++++
arch/x86/events/perf_event.h | 2 ++
3 files changed, 21 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 09c26a4..bdc35f8 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}
+static void intel_pmu_read_event(struct perf_event *event)
+{
+ if (event->attr.precise_ip)
+ return intel_pmu_pebs_read(event);
+
+ x86_perf_event_update(event);
+}
+
static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
{
int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
@@ -3495,6 +3503,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.disable = intel_pmu_disable_event,
.add = intel_pmu_add_event,
.del = intel_pmu_del_event,
+ .read = intel_pmu_read_event,
.hw_config = intel_pmu_hw_config,
.schedule_events = x86_schedule_events,
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cc1f373..2027560 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -926,6 +926,16 @@ void intel_pmu_pebs_del(struct perf_event *event)
pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
}
+void intel_pmu_pebs_read(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ if (pebs_needs_sched_cb(cpuc))
+ return intel_pmu_drain_pebs_buffer();
+
+ x86_perf_event_update(event);
+}
+
void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 67426e51..93ec3b4 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -928,6 +928,8 @@ void intel_pmu_pebs_add(struct perf_event *event);
void intel_pmu_pebs_del(struct perf_event *event);
+void intel_pmu_pebs_read(struct perf_event *event);
+
void intel_pmu_pebs_enable(struct perf_event *event);
void intel_pmu_pebs_disable(struct perf_event *event);
--
2.7.4
From: Kan Liang <[email protected]>
The userspace RDPMC usage never works for large PEBS since the large
PEBS is introduced by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value for userspace RDPMC.
Disable the userspace RDPMC usage when large PEBS is enabled.
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index acd7ffc..703bd81 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2120,7 +2120,8 @@ static int x86_pmu_event_init(struct perf_event *event)
event->destroy(event);
}
- if (READ_ONCE(x86_pmu.attr_rdpmc))
+ if (READ_ONCE(x86_pmu.attr_rdpmc) &&
+ !(event->hw.flags & PERF_X86_EVENT_FREERUNNING))
event->hw.flags |= PERF_X86_EVENT_RDPMC_ALLOWED;
return err;
--
2.7.4
On Mon, Jan 08, 2018 at 07:15:13AM -0800, [email protected] wrote:
SNIP
> There is nothing need to do in x86_perf_event_set_period(). Because it
> is fixed period. The period_left is already adjusted.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 3674a4b..cc1f373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
> return NULL;
> }
>
> +/*
> + * Specific intel_pmu_save_and_restart() for auto-reload.
> + */
> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> + u64 reload_val,
> + int reload_times)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 prev_raw_count, new_raw_count;
> + u64 delta;
> +
> + if ((reload_times == 0) || (reload_val == 0))
> + return intel_pmu_save_and_restart(event);
why is this check needed? AFAICS __intel_pmu_pebs_event is
called only if reload_times != 0 and reload_val is always
non zero for sampling
jirka
On Mon, Jan 08, 2018 at 07:15:15AM -0800, [email protected] wrote:
> From: Kan Liang <[email protected]>
>
> When the PEBS interrupt threshold is larger than one, there is no way to
> get exact auto-reload times and value needed for event update unless
> flush the PEBS buffer.
>
> Drain the PEBS buffer in event read when large PEBS is enabled.
>
> For the threshold is one, even auto-reload is enabled, it doesn't need
> to be specially handled. Because auto-reload is only effect when event
> overflow. There is no overflow in event read.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/intel/core.c | 9 +++++++++
> arch/x86/events/intel/ds.c | 10 ++++++++++
> arch/x86/events/perf_event.h | 2 ++
> 3 files changed, 21 insertions(+)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 09c26a4..bdc35f8 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
> intel_pmu_pebs_del(event);
> }
>
> +static void intel_pmu_read_event(struct perf_event *event)
> +{
> + if (event->attr.precise_ip)
> + return intel_pmu_pebs_read(event);
check for (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
would be more accurate?
jirka
On 1/10/2018 5:22 AM, Jiri Olsa wrote:
> On Mon, Jan 08, 2018 at 07:15:13AM -0800, [email protected] wrote:
>
> SNIP
>
>> There is nothing need to do in x86_perf_event_set_period(). Because it
>> is fixed period. The period_left is already adjusted.
>>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 3674a4b..cc1f373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
>> return NULL;
>> }
>>
>> +/*
>> + * Specific intel_pmu_save_and_restart() for auto-reload.
>> + */
>> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
>> + u64 reload_val,
>> + int reload_times)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + int shift = 64 - x86_pmu.cntval_bits;
>> + u64 prev_raw_count, new_raw_count;
>> + u64 delta;
>> +
>> + if ((reload_times == 0) || (reload_val == 0))
>> + return intel_pmu_save_and_restart(event);
>
> why is this check needed? AFAICS __intel_pmu_pebs_event is
> called only if reload_times != 0 and reload_val is always
> non zero for sampling
>
Here is a sanity check for reload_times and reload_val.
Right, usually they are non zero.
I think it should not bring any issues. Right?
If so, I think we may still keep it?
Thanks,
Kan
On 1/10/2018 5:39 AM, Jiri Olsa wrote:
> On Mon, Jan 08, 2018 at 07:15:15AM -0800, [email protected] wrote:
>> From: Kan Liang <[email protected]>
>>
>> When the PEBS interrupt threshold is larger than one, there is no way to
>> get exact auto-reload times and value needed for event update unless
>> flush the PEBS buffer.
>>
>> Drain the PEBS buffer in event read when large PEBS is enabled.
>>
>> For the threshold is one, even auto-reload is enabled, it doesn't need
>> to be specially handled. Because auto-reload is only effect when event
>> overflow. There is no overflow in event read.
>>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> arch/x86/events/intel/core.c | 9 +++++++++
>> arch/x86/events/intel/ds.c | 10 ++++++++++
>> arch/x86/events/perf_event.h | 2 ++
>> 3 files changed, 21 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 09c26a4..bdc35f8 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
>> intel_pmu_pebs_del(event);
>> }
>>
>> +static void intel_pmu_read_event(struct perf_event *event)
>> +{
>> + if (event->attr.precise_ip)
>> + return intel_pmu_pebs_read(event);
>
> check for (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
> would be more accurate?
>
It will narrow down the events.
But considering the readability, I think it would be better to use
precise_ip.The exposed functions in ds.c should be generic functions for
all PEBS events, not specific case.
I think _AUTO_RELOAD looks too specific.
Thanks,
Kan
On Wed, Jan 10, 2018 at 09:31:01AM -0500, Liang, Kan wrote:
>
>
> On 1/10/2018 5:22 AM, Jiri Olsa wrote:
> > On Mon, Jan 08, 2018 at 07:15:13AM -0800, [email protected] wrote:
> >
> > SNIP
> >
> > > There is nothing need to do in x86_perf_event_set_period(). Because it
> > > is fixed period. The period_left is already adjusted.
> > >
> > > Signed-off-by: Kan Liang <[email protected]>
> > > ---
> > > arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
> > > 1 file changed, 67 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> > > index 3674a4b..cc1f373 100644
> > > --- a/arch/x86/events/intel/ds.c
> > > +++ b/arch/x86/events/intel/ds.c
> > > @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
> > > return NULL;
> > > }
> > > +/*
> > > + * Specific intel_pmu_save_and_restart() for auto-reload.
> > > + */
> > > +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> > > + u64 reload_val,
> > > + int reload_times)
> > > +{
> > > + struct hw_perf_event *hwc = &event->hw;
> > > + int shift = 64 - x86_pmu.cntval_bits;
> > > + u64 prev_raw_count, new_raw_count;
> > > + u64 delta;
> > > +
> > > + if ((reload_times == 0) || (reload_val == 0))
> > > + return intel_pmu_save_and_restart(event);
> >
> > why is this check needed? AFAICS __intel_pmu_pebs_event is
> > called only if reload_times != 0 and reload_val is always
> > non zero for sampling
> >
>
> Here is a sanity check for reload_times and reload_val.
> Right, usually they are non zero.
> I think it should not bring any issues. Right?
> If so, I think we may still keep it?
sure, no big deal.. I just don't see the reason ;-)
jirka
On Wed, Jan 10, 2018 at 09:31:56AM -0500, Liang, Kan wrote:
>
>
> On 1/10/2018 5:39 AM, Jiri Olsa wrote:
> > On Mon, Jan 08, 2018 at 07:15:15AM -0800, [email protected] wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > When the PEBS interrupt threshold is larger than one, there is no way to
> > > get exact auto-reload times and value needed for event update unless
> > > flush the PEBS buffer.
> > >
> > > Drain the PEBS buffer in event read when large PEBS is enabled.
> > >
> > > For the threshold is one, even auto-reload is enabled, it doesn't need
> > > to be specially handled. Because auto-reload is only effect when event
> > > overflow. There is no overflow in event read.
> > >
> > > Signed-off-by: Kan Liang <[email protected]>
> > > ---
> > > arch/x86/events/intel/core.c | 9 +++++++++
> > > arch/x86/events/intel/ds.c | 10 ++++++++++
> > > arch/x86/events/perf_event.h | 2 ++
> > > 3 files changed, 21 insertions(+)
> > >
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 09c26a4..bdc35f8 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
> > > intel_pmu_pebs_del(event);
> > > }
> > > +static void intel_pmu_read_event(struct perf_event *event)
> > > +{
> > > + if (event->attr.precise_ip)
> > > + return intel_pmu_pebs_read(event);
> >
> > check for (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
> > would be more accurate?
> >
>
> It will narrow down the events.
> But considering the readability, I think it would be better to use
> precise_ip.The exposed functions in ds.c should be generic functions for all
> PEBS events, not specific case.
> I think _AUTO_RELOAD looks too specific.
hum, but the PEBS drain is specific just for
PERF_X86_EVENT_AUTO_RELOAD events, right?
wrt readability maybe you could add function like:
bool pebs_drain_before_read(struct perf_event *event)
{
return event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD;
}
thanks,
jirka
On 1/11/2018 6:10 AM, Jiri Olsa wrote:
> On Wed, Jan 10, 2018 at 09:31:56AM -0500, Liang, Kan wrote:
>>
>>
>> On 1/10/2018 5:39 AM, Jiri Olsa wrote:
>>> On Mon, Jan 08, 2018 at 07:15:15AM -0800, [email protected] wrote:
>>>> From: Kan Liang <[email protected]>
>>>>
>>>> When the PEBS interrupt threshold is larger than one, there is no way to
>>>> get exact auto-reload times and value needed for event update unless
>>>> flush the PEBS buffer.
>>>>
>>>> Drain the PEBS buffer in event read when large PEBS is enabled.
>>>>
>>>> For the threshold is one, even auto-reload is enabled, it doesn't need
>>>> to be specially handled. Because auto-reload is only effect when event
>>>> overflow. There is no overflow in event read.
>>>>
>>>> Signed-off-by: Kan Liang <[email protected]>
>>>> ---
>>>> arch/x86/events/intel/core.c | 9 +++++++++
>>>> arch/x86/events/intel/ds.c | 10 ++++++++++
>>>> arch/x86/events/perf_event.h | 2 ++
>>>> 3 files changed, 21 insertions(+)
>>>>
>>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>>> index 09c26a4..bdc35f8 100644
>>>> --- a/arch/x86/events/intel/core.c
>>>> +++ b/arch/x86/events/intel/core.c
>>>> @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
>>>> intel_pmu_pebs_del(event);
>>>> }
>>>> +static void intel_pmu_read_event(struct perf_event *event)
>>>> +{
>>>> + if (event->attr.precise_ip)
>>>> + return intel_pmu_pebs_read(event);
>>>
>>> check for (event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD)
>>> would be more accurate?
>>>
>>
>> It will narrow down the events.
>> But considering the readability, I think it would be better to use
>> precise_ip.The exposed functions in ds.c should be generic functions for all
>> PEBS events, not specific case.
>> I think _AUTO_RELOAD looks too specific.
>
> hum, but the PEBS drain is specific just for
> PERF_X86_EVENT_AUTO_RELOAD events, right?
Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do
PEBS drain in _read().
So it does the check in intel_pmu_pebs_read()
+ if (pebs_needs_sched_cb(cpuc))
+ return intel_pmu_drain_pebs_buffer();
>
> wrt readability maybe you could add function like:
The existing function pebs_needs_sched_cb() can do the check.
We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
But to be honest, I still cannot see a reason for that.
It could save a call to intel_pmu_pebs_read(), but _read() is not
critical path. It doesn't save much.
Also, it has to expose two inline PEBS specific functions. I think
exposing one PEBS generic function would be better here.
Thanks,
Kan
> bool pebs_drain_before_read(struct perf_event *event)
> {
> return event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD;
> }
>
> thanks,
> jirka
>
On Thu, Jan 11, 2018 at 10:21:25AM -0500, Liang, Kan wrote:
SNIP
> >
> > hum, but the PEBS drain is specific just for
> > PERF_X86_EVENT_AUTO_RELOAD events, right?
>
> Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
> PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
> But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
>
> Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do PEBS
> drain in _read().
>
> So it does the check in intel_pmu_pebs_read()
> + if (pebs_needs_sched_cb(cpuc))
> + return intel_pmu_drain_pebs_buffer();
>
> >
> > wrt readability maybe you could add function like:
>
> The existing function pebs_needs_sched_cb() can do the check.
> We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
>
> But to be honest, I still cannot see a reason for that.
> It could save a call to intel_pmu_pebs_read(), but _read() is not critical
> path. It doesn't save much.
hum, pmu->read is also called for PERF_SAMPLE_READ for sample,
check perf_output_read
for non sampling event you shouldn't be able to create PEBS
event, there's check in x86_pmu_hw_config
I agree it does not save much, it just confused me while
I was reading the code, like why is this needed for all
events with precise_ip
jirka
On 1/11/2018 10:45 AM, Jiri Olsa wrote:
> On Thu, Jan 11, 2018 at 10:21:25AM -0500, Liang, Kan wrote:
>
> SNIP
>
>>>
>>> hum, but the PEBS drain is specific just for
>>> PERF_X86_EVENT_AUTO_RELOAD events, right?
>>
>> Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
>> PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
>> But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
>>
>> Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do PEBS
>> drain in _read().
>>
>> So it does the check in intel_pmu_pebs_read()
>> + if (pebs_needs_sched_cb(cpuc))
>> + return intel_pmu_drain_pebs_buffer();
>>
>>>
>>> wrt readability maybe you could add function like:
>>
>> The existing function pebs_needs_sched_cb() can do the check.
>> We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
>>
>> But to be honest, I still cannot see a reason for that.
>> It could save a call to intel_pmu_pebs_read(), but _read() is not critical
>> path. It doesn't save much.
>
> hum, pmu->read is also called for PERF_SAMPLE_READ for sample,
> check perf_output_read
>
> for non sampling event you shouldn't be able to create PEBS
> event, there's check in x86_pmu_hw_config
>
> I agree it does not save much, it just confused me while
> I was reading the code, like why is this needed for all
> events with precise_ip
>
Sorry for the late response.
How about the patch as below?
The patch will be split into two patches in V3. One is to introduce
intel_pmu_large_pebs_read, the other is to introduce intel_pmu_read_event.
Thanks,
Kan
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 731153a..1610a9d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event
*event)
intel_pmu_pebs_del(event);
}
+static void intel_pmu_read_event(struct perf_event *event)
+{
+ if (intel_pmu_large_pebs_read(event))
+ return;
+
+ x86_perf_event_update(event);
+}
+
static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
{
int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
@@ -3495,6 +3503,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.disable = intel_pmu_disable_event,
.add = intel_pmu_add_event,
.del = intel_pmu_del_event,
+ .read = intel_pmu_read_event,
.hw_config = intel_pmu_hw_config,
.schedule_events = x86_schedule_events,
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index a1edd91..3f11ede 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -978,6 +978,19 @@ void intel_pmu_pebs_del(struct perf_event *event)
pebs_update_state(needed_cb, cpuc, event->ctx->pmu);
}
+int intel_pmu_large_pebs_read(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Check if the event has large pebs */
+ if (!pebs_needs_sched_cb(cpuc))
+ return 0
+
+ intel_pmu_drain_pebs_buffer();
+
+ return 1;
+}
+
void intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 805400b..7d3cd32 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -923,6 +923,8 @@ void intel_pmu_pebs_disable_all(void);
void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool
sched_in);
+int intel_pmu_large_pebs_read(struct perf_event *event);
+
void intel_ds_init(void);
void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool
sched_in);
On Tue, Jan 16, 2018 at 01:49:13PM -0500, Liang, Kan wrote:
>
>
> On 1/11/2018 10:45 AM, Jiri Olsa wrote:
> > On Thu, Jan 11, 2018 at 10:21:25AM -0500, Liang, Kan wrote:
> >
> > SNIP
> >
> > > >
> > > > hum, but the PEBS drain is specific just for
> > > > PERF_X86_EVENT_AUTO_RELOAD events, right?
> > >
> > > Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
> > > PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
> > > But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
> > >
> > > Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do PEBS
> > > drain in _read().
> > >
> > > So it does the check in intel_pmu_pebs_read()
> > > + if (pebs_needs_sched_cb(cpuc))
> > > + return intel_pmu_drain_pebs_buffer();
> > >
> > > >
> > > > wrt readability maybe you could add function like:
> > >
> > > The existing function pebs_needs_sched_cb() can do the check.
> > > We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
> > >
> > > But to be honest, I still cannot see a reason for that.
> > > It could save a call to intel_pmu_pebs_read(), but _read() is not critical
> > > path. It doesn't save much.
> >
> > hum, pmu->read is also called for PERF_SAMPLE_READ for sample,
> > check perf_output_read
> >
> > for non sampling event you shouldn't be able to create PEBS
> > event, there's check in x86_pmu_hw_config
> >
> > I agree it does not save much, it just confused me while
> > I was reading the code, like why is this needed for all
> > events with precise_ip
> >
>
>
> Sorry for the late response.
>
> How about the patch as below?
> The patch will be split into two patches in V3. One is to introduce
> intel_pmu_large_pebs_read, the other is to introduce intel_pmu_read_event.
>
> Thanks,
> Kan
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 731153a..1610a9d 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event
> *event)
> intel_pmu_pebs_del(event);
> }
>
> +static void intel_pmu_read_event(struct perf_event *event)
> +{
> + if (intel_pmu_large_pebs_read(event))
> + return;
should this be 'if (!intel_pmu_large_pebs_read(event))'
but looks better for me without the precise_ip check
thanks,
jirka
On 1/18/2018 4:49 AM, Jiri Olsa wrote:
> On Tue, Jan 16, 2018 at 01:49:13PM -0500, Liang, Kan wrote:
>>
>>
>> On 1/11/2018 10:45 AM, Jiri Olsa wrote:
>>> On Thu, Jan 11, 2018 at 10:21:25AM -0500, Liang, Kan wrote:
>>>
>>> SNIP
>>>
>>>>>
>>>>> hum, but the PEBS drain is specific just for
>>>>> PERF_X86_EVENT_AUTO_RELOAD events, right?
>>>>
>>>> Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
>>>> PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
>>>> But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
>>>>
>>>> Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do PEBS
>>>> drain in _read().
>>>>
>>>> So it does the check in intel_pmu_pebs_read()
>>>> + if (pebs_needs_sched_cb(cpuc))
>>>> + return intel_pmu_drain_pebs_buffer();
>>>>
>>>>>
>>>>> wrt readability maybe you could add function like:
>>>>
>>>> The existing function pebs_needs_sched_cb() can do the check.
>>>> We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
>>>>
>>>> But to be honest, I still cannot see a reason for that.
>>>> It could save a call to intel_pmu_pebs_read(), but _read() is not critical
>>>> path. It doesn't save much.
>>>
>>> hum, pmu->read is also called for PERF_SAMPLE_READ for sample,
>>> check perf_output_read
>>>
>>> for non sampling event you shouldn't be able to create PEBS
>>> event, there's check in x86_pmu_hw_config
>>>
>>> I agree it does not save much, it just confused me while
>>> I was reading the code, like why is this needed for all
>>> events with precise_ip
>>>
>>
>>
>> Sorry for the late response.
>>
>> How about the patch as below?
>> The patch will be split into two patches in V3. One is to introduce
>> intel_pmu_large_pebs_read, the other is to introduce intel_pmu_read_event.
>>
>> Thanks,
>> Kan
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 731153a..1610a9d 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event
>> *event)
>> intel_pmu_pebs_del(event);
>> }
>>
>> +static void intel_pmu_read_event(struct perf_event *event)
>> +{
>> + if (intel_pmu_large_pebs_read(event))
>> + return;
>
> should this be 'if (!intel_pmu_large_pebs_read(event))'
>
NO. For large pebs, the event->count has been updated in drain_pebs().
So it doesn't need to do x86_perf_event_update() again.
Thanks,
Kan
> but looks better for me without the precise_ip check
>
> thanks,
> jirka
>
On Thu, Jan 18, 2018 at 08:30:30AM -0500, Liang, Kan wrote:
>
>
> On 1/18/2018 4:49 AM, Jiri Olsa wrote:
> > On Tue, Jan 16, 2018 at 01:49:13PM -0500, Liang, Kan wrote:
> > >
> > >
> > > On 1/11/2018 10:45 AM, Jiri Olsa wrote:
> > > > On Thu, Jan 11, 2018 at 10:21:25AM -0500, Liang, Kan wrote:
> > > >
> > > > SNIP
> > > >
> > > > > >
> > > > > > hum, but the PEBS drain is specific just for
> > > > > > PERF_X86_EVENT_AUTO_RELOAD events, right?
> > > > >
> > > > > Accurately, PEBS drain is specific for PERF_X86_EVENT_FREERUNNING here.
> > > > > PERF_X86_EVENT_FREERUNNING event must be _AUTO_RELOAD event.
> > > > > But in some cases, _AUTO_RELOAD event cannot be _FREERUNNING event.
> > > > >
> > > > > Only the event which is both _FREERUNNING and _AUTO_RELOAD need to do PEBS
> > > > > drain in _read().
> > > > >
> > > > > So it does the check in intel_pmu_pebs_read()
> > > > > + if (pebs_needs_sched_cb(cpuc))
> > > > > + return intel_pmu_drain_pebs_buffer();
> > > > >
> > > > > >
> > > > > > wrt readability maybe you could add function like:
> > > > >
> > > > > The existing function pebs_needs_sched_cb() can do the check.
> > > > > We just need to expose it, and also the intel_pmu_drain_pebs_buffer().
> > > > >
> > > > > But to be honest, I still cannot see a reason for that.
> > > > > It could save a call to intel_pmu_pebs_read(), but _read() is not critical
> > > > > path. It doesn't save much.
> > > >
> > > > hum, pmu->read is also called for PERF_SAMPLE_READ for sample,
> > > > check perf_output_read
> > > >
> > > > for non sampling event you shouldn't be able to create PEBS
> > > > event, there's check in x86_pmu_hw_config
> > > >
> > > > I agree it does not save much, it just confused me while
> > > > I was reading the code, like why is this needed for all
> > > > events with precise_ip
> > > >
> > >
> > >
> > > Sorry for the late response.
> > >
> > > How about the patch as below?
> > > The patch will be split into two patches in V3. One is to introduce
> > > intel_pmu_large_pebs_read, the other is to introduce intel_pmu_read_event.
> > >
> > > Thanks,
> > > Kan
> > >
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 731153a..1610a9d 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event
> > > *event)
> > > intel_pmu_pebs_del(event);
> > > }
> > >
> > > +static void intel_pmu_read_event(struct perf_event *event)
> > > +{
> > > + if (intel_pmu_large_pebs_read(event))
> > > + return;
> >
> > should this be 'if (!intel_pmu_large_pebs_read(event))'
> >
>
> NO. For large pebs, the event->count has been updated in drain_pebs(). So it
> doesn't need to do x86_perf_event_update() again.
>
ok, thanks
jirka
On Mon, Jan 08, 2018 at 07:15:13AM -0800, [email protected] wrote:
> The formula to calculate the event->count is as below:
> event->count = period left from last time +
> (reload_times - 1) * reload_val +
> latency of PMI handler
>
> prev_count is the last observed hardware counter value. Just the same as
> non-auto-reload, its absolute value is the period of the first record.
> It should not update with each reload. Because it doesn't 'observe' the
> hardware counter for each auto-reload.
>
> For the second and later records, the period is exactly the reload
> value. Just need to simply add (reload_times - 1) * reload_val to
> event->count.
>
> The calculation of the latency of PMI handler is a little bit different
> as non-auto-reload. Because the start point is -reload_value. It needs
> to be adjusted by adding reload_value.
> The period_left needs to do the same adjustment.
What's this about the PMI latency, we don't care about that in any other
situation, right? Sure the PMI takes a bit of time, but we're not
correcting for that anywhere, so why start now?
> There is nothing need to do in x86_perf_event_set_period(). Because it
> is fixed period. The period_left is already adjusted.
Fixes tag is missing.
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 3674a4b..cc1f373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
> return NULL;
> }
>
> +/*
> + * Specific intel_pmu_save_and_restart() for auto-reload.
> + */
> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> + u64 reload_val,
> + int reload_times)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 prev_raw_count, new_raw_count;
> + u64 delta;
> +
> + if ((reload_times == 0) || (reload_val == 0))
> + return intel_pmu_save_and_restart(event);
Like Jiri, I find this confusing at best. If we need to call that one,
you shouldn't have called this function to begin with.
At best, have a WARN here or something.
> +
> + /*
> + * Careful: an NMI might modify the previous event value.
> + *
> + * Our tactic to handle this is to first atomically read and
> + * exchange a new raw count - then add that new-prev delta
> + * count to the generic event atomically:
> + */
For now this seems to only get called from *drain_pebs* which afaict
only happens when we've disabled the PMU (either from sched_task or
PMI).
Now, you want to put this in the pmu::read() path, and that does not
disable the PMU, but I don't think we can drain the PEBS buffer while
its active, that's too full of races, so even there you'll have to
disable stuff.
So I don't think this is correct/desired for this case.
> +again:
> + prev_raw_count = local64_read(&hwc->prev_count);
> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +
> + if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> + new_raw_count) != prev_raw_count)
> + goto again;
> +
> + /*
> + * Now we have the new raw value and have updated the prev
> + * timestamp already. We can now calculate the elapsed delta
> + * (event-)time and add that to the generic event.
> + *
> + * Careful, not all hw sign-extends above the physical width
> + * of the count.
> + *
> + * event->count = period left from last time +
> + * (reload_times - 1) * reload_val +
> + * latency of PMI handler
*
> + * The period left from last time can be got from -prev_count.
> + * The start points of counting is always -reload_val.
> + * So the real latency of PMI handler is reload_val + new_raw_count.
> + */
That is very confused, the PMI latency is utterly unrelated to anything
you do here.
> + delta = (reload_val << shift) + (new_raw_count << shift) -
> + (prev_raw_count << shift);
> + delta >>= shift;
> +
> + local64_add(reload_val * (reload_times - 1), &event->count);
> + local64_add(delta, &event->count);
And this is still wrong I think. Consider the case where !reload_times.
We can easily call pmu::read() twice in one period. In that case we
should increment count with (new - prev).
Only once we get a new sample and are known to have wrapped, do we need
to consider that wrap.
> + local64_sub(delta, &hwc->period_left);
> +
> + return x86_perf_event_set_period(event);
> +}
> +
> static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> void *base, void *top,
> int bit, int count)
> {
> + struct hw_perf_event *hwc = &event->hw;
> struct perf_sample_data data;
> struct pt_regs regs;
> void *at = get_next_pebs_record_by_bit(base, top, bit);
>
> - if (!intel_pmu_save_and_restart(event) &&
> - !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> + if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> + /*
> + * Now, auto-reload is only enabled in fixed period mode.
> + * The reload value is always hwc->sample_period.
> + * May need to change it, if auto-reload is enabled in
> + * freq mode later.
> + */
> + intel_pmu_save_and_restart_reload(event, hwc->sample_period,
> + count);
Since you pass in @event, hwc->sample_period is already available to it,
no need to pass that in as well.
> + } else if (!intel_pmu_save_and_restart(event))
> return;
>
> while (count > 1) {
On 1/24/2018 7:26 AM, Peter Zijlstra wrote:
> On Mon, Jan 08, 2018 at 07:15:13AM -0800, [email protected] wrote:
>
>> The formula to calculate the event->count is as below:
>
>> event->count = period left from last time +
>> (reload_times - 1) * reload_val +
>> latency of PMI handler
>>
>> prev_count is the last observed hardware counter value. Just the same as
>> non-auto-reload, its absolute value is the period of the first record.
>> It should not update with each reload. Because it doesn't 'observe' the
>> hardware counter for each auto-reload.
>>
>> For the second and later records, the period is exactly the reload
>> value. Just need to simply add (reload_times - 1) * reload_val to
>> event->count.
>>
>> The calculation of the latency of PMI handler is a little bit different
>> as non-auto-reload. Because the start point is -reload_value. It needs
>> to be adjusted by adding reload_value.
>> The period_left needs to do the same adjustment.
>
> What's this about the PMI latency, we don't care about that in any other
> situation, right? Sure the PMI takes a bit of time, but we're not
> correcting for that anywhere, so why start now?
The latency is the gap between PEBS hardware is armed and the NMI is
handled.
Sorry for the misleading description.
I will rewrite the description in V3.
>
>> There is nothing need to do in x86_perf_event_set_period(). Because it
>> is fixed period. The period_left is already adjusted.
>
> Fixes tag is missing.
Will add it in V3.
>
>> Signed-off-by: Kan Liang <[email protected]>
>> ---
>> arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 67 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 3674a4b..cc1f373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1251,17 +1251,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
>> return NULL;
>> }
>>
>> +/*
>> + * Specific intel_pmu_save_and_restart() for auto-reload.
>> + */
>> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
>> + u64 reload_val,
>> + int reload_times)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + int shift = 64 - x86_pmu.cntval_bits;
>> + u64 prev_raw_count, new_raw_count;
>> + u64 delta;
>> +
>> + if ((reload_times == 0) || (reload_val == 0))
>> + return intel_pmu_save_and_restart(event);
>
> Like Jiri, I find this confusing at best. If we need to call that one,
> you shouldn't have called this function to begin with.
>
> At best, have a WARN here or something.
>
Will add a WARN in V3.
>> +
>> + /*
>> + * Careful: an NMI might modify the previous event value.
>> + *
>> + * Our tactic to handle this is to first atomically read and
>> + * exchange a new raw count - then add that new-prev delta
>> + * count to the generic event atomically:
>> + */
>
> For now this seems to only get called from *drain_pebs* which afaict
> only happens when we've disabled the PMU (either from sched_task or
> PMI).
>
> Now, you want to put this in the pmu::read() path, and that does not
> disable the PMU, but I don't think we can drain the PEBS buffer while
> its active, that's too full of races, so even there you'll have to
> disable stuff.
>
> So I don't think this is correct/desired for this case.
>
Could we do something as below in the pmu::read() path? (not test yet)
if (pebs_needs_sched_cb(cpuc)) {
perf_pmu_disable();
intel_pmu_drain_pebs_buffer();
x86_perf_event_update();
perf_pmu_enable();
return;
}
x86_perf_event_update() is to handle the case !reload_times which you
mentioned as below. Because the PMU is disabled, nothing will be changed
for other cases.
>> +again:
>> + prev_raw_count = local64_read(&hwc->prev_count);
>> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
>> +
>> + if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
>> + new_raw_count) != prev_raw_count)
>> + goto again;
>> +
>> + /*
>> + * Now we have the new raw value and have updated the prev
>> + * timestamp already. We can now calculate the elapsed delta
>> + * (event-)time and add that to the generic event.
>> + *
>> + * Careful, not all hw sign-extends above the physical width
>> + * of the count.
>> + *
>> + * event->count = period left from last time +
>> + * (reload_times - 1) * reload_val +
>> + * latency of PMI handler
> *
>> + * The period left from last time can be got from -prev_count.
>> + * The start points of counting is always -reload_val.
>> + * So the real latency of PMI handler is reload_val + new_raw_count.
>> + */
>
> That is very confused, the PMI latency is utterly unrelated to anything
> you do here.
Will fix the confusing comments in V3.
>
>> + delta = (reload_val << shift) + (new_raw_count << shift) -
>> + (prev_raw_count << shift);
>> + delta >>= shift;
>> +
>> + local64_add(reload_val * (reload_times - 1), &event->count);
>> + local64_add(delta, &event->count);
>
> And this is still wrong I think. Consider the case where !reload_times.
>
> We can easily call pmu::read() twice in one period. In that case we
> should increment count with (new - prev).
>
> Only once we get a new sample and are known to have wrapped, do we need
> to consider that wrap.
The code as above should fix this issue.
>
>> + local64_sub(delta, &hwc->period_left);
>> +
>> + return x86_perf_event_set_period(event);
>> +}
>> +
>> static void __intel_pmu_pebs_event(struct perf_event *event,
>> struct pt_regs *iregs,
>> void *base, void *top,
>> int bit, int count)
>> {
>> + struct hw_perf_event *hwc = &event->hw;
>> struct perf_sample_data data;
>> struct pt_regs regs;
>> void *at = get_next_pebs_record_by_bit(base, top, bit);
>>
>> - if (!intel_pmu_save_and_restart(event) &&
>> - !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
>> + if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
>> + /*
>> + * Now, auto-reload is only enabled in fixed period mode.
>> + * The reload value is always hwc->sample_period.
>> + * May need to change it, if auto-reload is enabled in
>> + * freq mode later.
>> + */
>> + intel_pmu_save_and_restart_reload(event, hwc->sample_period,
>> + count);
>
> Since you pass in @event, hwc->sample_period is already available to it,
> no need to pass that in as well.
OK. I will change it.
Thanks,
Kan
>
>> + } else if (!intel_pmu_save_and_restart(event))
>> return;
>>
>> while (count > 1) {