LinuxLists.cc - [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

2018-01-29 16:32:39

[permalink] [raw]

Subject: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

From: Kan Liang <[email protected]>

------

Changes since V2:
- Refined the changelog
- Introduced specific read function for large PEBS.
The previous generic PEBS read function is confusing.
Disabled PMU in pmu::read() path for large PEBS.
Handled the corner case when reload_times == 0.
- Modified the parameter of intel_pmu_save_and_restart_reload()
Discarded local64_cmpxchg
- Added fixes tag
- Added WARN to handle reload_times == 0 || reload_val == 0

Changes since V1:
- Check PERF_X86_EVENT_AUTO_RELOAD before call
intel_pmu_save_and_restore()
- Introduce a special purpose intel_pmu_save_and_restart()
just for AUTO_RELOAD.
- New patch to disable userspace RDPMC usage if large PEBS is enabled.

------

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5

The bug is caused by two issues.
- In x86_perf_event_update, the calculation of event->count does not
take the auto-reload values into account.
- In x86_pmu_read, it doesn't count the undrained values in large PEBS
buffers.

The first issue was introduced with the auto-reload mechanism enabled
since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
mechanism when possible")

Patch 1 fixed the issue in x86_perf_event_update.

The second issue was introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 2-4 fixed the issue in x86_pmu_read.

Besides the two issues, the userspace RDPMC usage is broken for large
PEBS as well.
The RDPMC issue was also introduced since commit b8241d20699e
("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
interrupt threshold)")

Patch 5 fixed the RDPMC issue.

The source code of read_count is as below.

struct cpu {
int fd;
struct perf_event_mmap_page *buf;
};

int perf_open(struct cpu *ctx, int cpu)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.size = sizeof(struct perf_event_attr),
.sample_period = 100000,
.config = 0,
.sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
.precise_ip = 3,
.mmap = 1,
.comm = 1,
.task = 1,
.mmap2 = 1,
.sample_id_all = 1,
.comm_exec = 1,
};
ctx->buf = NULL;
ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
if (ctx->fd < 0) {
perror("perf_event_open");
return -1;
}
return 0;
}

void perf_close(struct cpu *ctx)
{
close(ctx->fd);
if (ctx->buf)
munmap(ctx->buf, pagesize);
}

int main(int ac, char **av)
{
struct cpu ctx;
u64 count;

perf_open(&ctx, 0);

while (1) {
sleep(5);

if (read(ctx.fd, &count, 8) != 8) {
perror("counter read");
break;
}
printf("0x%llx\n", count);

}
perf_close(&ctx);
}

Kan Liang (5):
perf/x86/intel: fix event update for auto-reload
perf/x86: introduce read function for x86_pmu
perf/x86/intel/ds: introduce read function for large pebs
perf/x86/intel: introduce read function for intel_pmu
perf/x86: fix: disable userspace RDPMC usage for large PEBS

arch/x86/events/core.c | 5 ++-
arch/x86/events/intel/core.c | 9 +++++
arch/x86/events/intel/ds.c | 85 ++++++++++++++++++++++++++++++++++++++++++--
arch/x86/events/perf_event.h | 3 ++
4 files changed, 99 insertions(+), 3 deletions(-)

--
2.7.4

2018-01-29 16:31:22

[permalink] [raw]

Subject: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

From: Kan Liang <[email protected]>

There is a bug when mmap read event->count with large PEBS enabled.
Here is an example.
#./read_count
0x71f0
0x122c0
0x1000000001c54
0x100000001257d
0x200000000bdc5

In fixed period mode, the auto-reload mechanism could be enabled for
PEBS events. But the calculation of event->count does not take the
auto-reload values into account. Anyone who read the event->count will
get wrong result, e.g x86_pmu_read. The calculation of hwc->period_left
is wrong either, which will impact the calculation of the period for
the first record in PEBS multiple records.

The issue was introduced with the auto-reload mechanism enabled since
commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")

Introduce intel_pmu_save_and_restart_reload to calculate the event->count
only for auto-reload.

There is a small gap between 'PEBS hardware is armed' and 'the NMI is
handled'. Because of the gap, the first record also needs to be specially
handled.
The formula to calculate the increments of event->count is as below.
The increments = the period for first record +
(reload_times - 1) * reload_val +
the gap
- The 'the period for first record' is the period left from last PMI,
which can be got from the previous event value.
- For the second and later records, the period is exactly the reload
value. Just need to simply add (reload_times - 1) * reload_val
- Because of the auto-reload, the start point of counting is alwyas
(-reload_val). So the calculation of 'the gap' needs to be corrected by
adding reload_val.
The period_left needs to do the same adjustment as well.

There is nothing need to do in x86_perf_event_set_period(). Because it
is fixed period. The period_left is already adjusted.

Fixes: 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload mechanism
when possible")
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 69 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8156e47..6533426 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1303,17 +1303,82 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
return NULL;
}

+/*
+ * Specific intel_pmu_save_and_restart() for auto-reload.
+ * It only be called from drain_pebs().
+ */
+static int intel_pmu_save_and_restart_reload(struct perf_event *event,
+ int reload_times)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 reload_val = hwc->sample_period;
+ u64 prev_raw_count, new_raw_count;
+ u64 delta;
+
+ WARN_ON((reload_times == 0) || (reload_val == 0));
+
+ /*
+ * drain_pebs() only happens when the PMU is disabled.
+ * It doesnot need to specially handle the previous event value.
+ * The hwc->prev_count will be updated in x86_perf_event_set_period().
+ */
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ *
+ * There is a small gap between 'PEBS hardware is armed' and 'the NMI
+ * is handled'. Because of the gap, the first record also needs to be
+ * specially handled.
+ * The formula to calculate the increments of event->count is as below.
+ * The increments = the period for first record +
+ * (reload_times - 1) * reload_val +
+ * the gap
+ * 'The period for first record' can be got from -prev_raw_count.
+ *
+ * 'The gap' = new_raw_count + reload_val. Because the start point of
+ * counting is always -reload_val for auto-reload.
+ *
+ * The period_left needs to do the same adjustment by adding
+ * reload_val.
+ */
+ delta = (reload_val << shift) + (new_raw_count << shift) -
+ (prev_raw_count << shift);
+ delta >>= shift;
+
+ local64_add(reload_val * (reload_times - 1), &event->count);
+ local64_add(delta, &event->count);
+ local64_sub(delta, &hwc->period_left);
+
+ return x86_perf_event_set_period(event);
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count)
{
+ struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct pt_regs regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event) &&
- !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, count);
+ } else if (!intel_pmu_save_and_restart(event))
return;

while (count > 1) {
--
2.7.4

2018-01-29 16:32:12

[permalink] [raw]

Subject: [PATCH V3 5/5] perf/x86: fix: disable userspace RDPMC usage for large PEBS

From: Kan Liang <[email protected]>

The userspace RDPMC usage never works for large PEBS since the large
PEBS is introduced by
commit b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value for userspace RDPMC.

Disable the userspace RDPMC usage when large PEBS is enabled.

Fixes: b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index acd7ffc..703bd81 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2120,7 +2120,8 @@ static int x86_pmu_event_init(struct perf_event *event)
event->destroy(event);
}

- if (READ_ONCE(x86_pmu.attr_rdpmc))
+ if (READ_ONCE(x86_pmu.attr_rdpmc) &&
+ !(event->hw.flags & PERF_X86_EVENT_FREERUNNING))
event->hw.flags |= PERF_X86_EVENT_RDPMC_ALLOWED;

return err;
--
2.7.4

2018-01-29 16:32:58

[permalink] [raw]

Subject: [PATCH V3 3/5] perf/x86/intel/ds: introduce read function for large pebs

From: Kan Liang <[email protected]>

When the PEBS interrupt threshold is larger than one, there is no way to
get exact auto-reload times and value, which needed for event update
unless flush the PEBS buffer.

Introduce intel_pmu_large_pebs_read() to drain the PEBS buffer in event
read when large PEBS is enabled.
To prevent the race, the drain_pebs() only be called when the PMU is
disabled.

Unconditionally call x86_perf_event_update() for large pebs.
- It is easily to call pmu::read() twice in a short period. There could
be no samples in the PEBS buffer. x86_perf_event_update() is needed
to update the count.
- There is no harmful to call x86_perf_event_update() for other cases.
- It's safe. Don't need to worry about the auto-reload. Because the PMU
is disabled.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/ds.c | 16 ++++++++++++++++
arch/x86/events/perf_event.h | 2 ++
2 files changed, 18 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 6533426..1c11fa2 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1303,6 +1303,22 @@ get_next_pebs_record_by_bit(void *base, void *top, int bit)
return NULL;
}

+int intel_pmu_large_pebs_read(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /* Check if the event has large pebs */
+ if (!pebs_needs_sched_cb(cpuc))
+ return 0;
+
+ perf_pmu_disable(event->pmu);
+ intel_pmu_drain_pebs_buffer();
+ x86_perf_event_update(event);
+ perf_pmu_enable(event->pmu);
+
+ return 1;
+}
+
/*
* Specific intel_pmu_save_and_restart() for auto-reload.
* It only be called from drain_pebs().
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 805400b..7d3cd32 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -923,6 +923,8 @@ void intel_pmu_pebs_disable_all(void);

void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);

+int intel_pmu_large_pebs_read(struct perf_event *event);
+
void intel_ds_init(void);

void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
--
2.7.4

2018-01-29 16:33:51

[permalink] [raw]

Subject: [PATCH V3 4/5] perf/x86/intel: fix pmu read for large PEBS

From: Kan Liang <[email protected]>

Large PEBS needs to be specially handled in event count read.

It is only available for intel_pmu.

Only need to specially handle the large PEBS. For the threshold is one,
even auto-reload is enabled, it doesn't need to be specially handled.
Because,
- auto-reload is only effect when event overflow. Once there is an
overflow, the NMI handler will automatically drain_pebs().
- For other cases, x86_perf_event_update() is good enough.

Fixes: b8241d20699e ("perf/x86/intel: Implement batched PEBS interrupt
handling (large PEBS interrupt threshold)")
Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/intel/core.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 731153a..1610a9d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2060,6 +2060,14 @@ static void intel_pmu_del_event(struct perf_event *event)
intel_pmu_pebs_del(event);
}

+static void intel_pmu_read_event(struct perf_event *event)
+{
+ if (intel_pmu_large_pebs_read(event))
+ return;
+
+ x86_perf_event_update(event);
+}
+
static void intel_pmu_enable_fixed(struct hw_perf_event *hwc)
{
int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
@@ -3495,6 +3503,7 @@ static __initconst const struct x86_pmu intel_pmu = {
.disable = intel_pmu_disable_event,
.add = intel_pmu_add_event,
.del = intel_pmu_del_event,
+ .read = intel_pmu_read_event,
.hw_config = intel_pmu_hw_config,
.schedule_events = x86_schedule_events,
.eventsel = MSR_ARCH_PERFMON_EVENTSEL0,
--
2.7.4

2018-01-29 16:34:03

[permalink] [raw]

Subject: [PATCH V3 2/5] perf/x86: introduce read function for x86_pmu

From: Kan Liang <[email protected]>

Large PEBS needs to be specially handled in event count read.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/events/core.c | 2 ++
arch/x86/events/perf_event.h | 1 +
2 files changed, 3 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 140d332..acd7ffc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1884,6 +1884,8 @@ early_initcall(init_hw_perf_events);

static inline void x86_pmu_read(struct perf_event *event)
{
+ if (x86_pmu.read)
+ return x86_pmu.read(event);
x86_perf_event_update(event);
}

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8e4ea143..805400b 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -519,6 +519,7 @@ struct x86_pmu {
void (*disable)(struct perf_event *);
void (*add)(struct perf_event *);
void (*del)(struct perf_event *);
+ void (*read)(struct perf_event *event);
int (*hw_config)(struct perf_event *event);
int (*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
unsigned eventsel;
--
2.7.4

2018-01-30 09:17:34

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

Hi,

On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>
> From: Kan Liang <[email protected]>
>
> ------
>
> Changes since V2:
> - Refined the changelog
> - Introduced specific read function for large PEBS.
> The previous generic PEBS read function is confusing.
> Disabled PMU in pmu::read() path for large PEBS.
> Handled the corner case when reload_times == 0.
> - Modified the parameter of intel_pmu_save_and_restart_reload()
> Discarded local64_cmpxchg
> - Added fixes tag
> - Added WARN to handle reload_times == 0 || reload_val == 0
>
> Changes since V1:
> - Check PERF_X86_EVENT_AUTO_RELOAD before call
> intel_pmu_save_and_restore()

It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
with large PEBS. Large PEBS requires fixed period. So the kernel could
make up the period from the event and store it in the sampling buffer.

I tried using large PEBS recently, and despite trying different option
combination of perf record, I was not able to get it to work.

$ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

But I was able to make this work with a much older kernel.

Another annoyance I ran into is with perf record requiring -c period
in order not to set
PERF_SAMPLE_PERIOD in the event.

If I do:
perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
type 4
size 112
config 0x10d1
{ sample_period, sample_freq } 199936
sample_type IP|TID|CPU

But if I do:
perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
--no-timestamp --no-period -a -C 0

I get

perf_event_attr:
type 4
size 112
config 0x10d1
{ sample_period, sample_freq } 199936
sample_type IP|TID|CPU|PERIOD

Perf should check if all events have a period=, then it should not
pass PERF_SAMPLE_PERIOD, even
more so when only one event is defined.

Also it does not seem to honor --no-period.

All of this makes it hard to use large PEBS in general.

So I think your series should also address this part.

> - Introduce a special purpose intel_pmu_save_and_restart()
> just for AUTO_RELOAD.
> - New patch to disable userspace RDPMC usage if large PEBS is enabled.
>
> ------
>
> There is a bug when mmap read event->count with large PEBS enabled.
> Here is an example.
> #./read_count
> 0x71f0
> 0x122c0
> 0x1000000001c54
> 0x100000001257d
> 0x200000000bdc5
>
> The bug is caused by two issues.
> - In x86_perf_event_update, the calculation of event->count does not
> take the auto-reload values into account.
> - In x86_pmu_read, it doesn't count the undrained values in large PEBS
> buffers.
>
> The first issue was introduced with the auto-reload mechanism enabled
> since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
> mechanism when possible")
>
> Patch 1 fixed the issue in x86_perf_event_update.
>
> The second issue was introduced since commit b8241d20699e
> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
> interrupt threshold)")
>
> Patch 2-4 fixed the issue in x86_pmu_read.
>
> Besides the two issues, the userspace RDPMC usage is broken for large
> PEBS as well.
> The RDPMC issue was also introduced since commit b8241d20699e
> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
> interrupt threshold)")
>
> Patch 5 fixed the RDPMC issue.
>
> The source code of read_count is as below.
>
> struct cpu {
> int fd;
> struct perf_event_mmap_page *buf;
> };
>
> int perf_open(struct cpu *ctx, int cpu)
> {
> struct perf_event_attr attr = {
> .type = PERF_TYPE_HARDWARE,
> .size = sizeof(struct perf_event_attr),
> .sample_period = 100000,
> .config = 0,
> .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
> PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
> .precise_ip = 3,
> .mmap = 1,
> .comm = 1,
> .task = 1,
> .mmap2 = 1,
> .sample_id_all = 1,
> .comm_exec = 1,
> };
> ctx->buf = NULL;
> ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
> if (ctx->fd < 0) {
> perror("perf_event_open");
> return -1;
> }
> return 0;
> }
>
> void perf_close(struct cpu *ctx)
> {
> close(ctx->fd);
> if (ctx->buf)
> munmap(ctx->buf, pagesize);
> }
>
> int main(int ac, char **av)
> {
> struct cpu ctx;
> u64 count;
>
> perf_open(&ctx, 0);
>
> while (1) {
> sleep(5);
>
> if (read(ctx.fd, &count, 8) != 8) {
> perror("counter read");
> break;
> }
> printf("0x%llx\n", count);
>
> }
> perf_close(&ctx);
> }
>
> Kan Liang (5):
> perf/x86/intel: fix event update for auto-reload
> perf/x86: introduce read function for x86_pmu
> perf/x86/intel/ds: introduce read function for large pebs
> perf/x86/intel: introduce read function for intel_pmu
> perf/x86: fix: disable userspace RDPMC usage for large PEBS
>
> arch/x86/events/core.c | 5 ++-
> arch/x86/events/intel/core.c | 9 +++++
> arch/x86/events/intel/ds.c | 85 ++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/events/perf_event.h | 3 ++
> 4 files changed, 99 insertions(+), 3 deletions(-)
>
> --
> 2.7.4
>

2018-01-30 14:20:59

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
> Hi,
>
> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
> >
> > From: Kan Liang <[email protected]>
> >
> > ------
> >
> > Changes since V2:
> > - Refined the changelog
> > - Introduced specific read function for large PEBS.
> > The previous generic PEBS read function is confusing.
> > Disabled PMU in pmu::read() path for large PEBS.
> > Handled the corner case when reload_times == 0.
> > - Modified the parameter of intel_pmu_save_and_restart_reload()
> > Discarded local64_cmpxchg
> > - Added fixes tag
> > - Added WARN to handle reload_times == 0 || reload_val == 0
> >
> > Changes since V1:
> > - Check PERF_X86_EVENT_AUTO_RELOAD before call
> > intel_pmu_save_and_restore()
>
> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> with large PEBS. Large PEBS requires fixed period. So the kernel could
> make up the period from the event and store it in the sampling buffer.
>
> I tried using large PEBS recently, and despite trying different option
> combination of perf record, I was not able to get it to work.
>
> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> But I was able to make this work with a much older kernel.
>
> Another annoyance I ran into is with perf record requiring -c period
> in order not to set
> PERF_SAMPLE_PERIOD in the event.
>
> If I do:
> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> I get
>
> perf_event_attr:
> type 4
> size 112
> config 0x10d1
> { sample_period, sample_freq } 199936
> sample_type IP|TID|CPU
>
> But if I do:
> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> I get
>
> perf_event_attr:
> type 4
> size 112
> config 0x10d1
> { sample_period, sample_freq } 199936
> sample_type IP|TID|CPU|PERIOD
>
> Perf should check if all events have a period=, then it should not
> pass PERF_SAMPLE_PERIOD, even
> more so when only one event is defined.
>
> Also it does not seem to honor --no-period.

yep, there's a bug in period=x term handling
we did not re/set the sample_type based on that

attached patch fixes that for me, also takes into account
the --no/-period options

jirka

---
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f251e824edac..907267206973 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
&record.opts.sample_time_set,
"Record the sample timestamps"),
- OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
+ OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
+ "Record the sample period"),
OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
"don't sample"),
OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index 2357f4ccc9c7..cfe46236a5e5 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -50,6 +50,7 @@ struct record_opts {
bool sample_time_set;
bool sample_cpu;
bool period;
+ bool period_set;
bool running_time;
bool full_auxtrace;
bool auxtrace_snapshot_mode;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 66fa45198a11..ff359c9ece2e 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
attr->sample_period = term->val.period;
attr->freq = 0;
+ perf_evsel__reset_sample_bit(evsel, PERIOD);
}
break;
case PERF_EVSEL__CONFIG_TERM_FREQ:
if (!(term->weak && opts->user_freq != UINT_MAX)) {
attr->sample_freq = term->val.freq;
attr->freq = 1;
+ perf_evsel__set_sample_bit(evsel, PERIOD);
}
break;
case PERF_EVSEL__CONFIG_TERM_TIME:
@@ -969,9 +971,6 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
if (target__has_cpu(&opts->target) || opts->sample_cpu)
perf_evsel__set_sample_bit(evsel, CPU);

- if (opts->period)
- perf_evsel__set_sample_bit(evsel, PERIOD);
-
/*
* When the user explicitly disabled time don't force it here.
*/
@@ -1073,6 +1072,14 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
apply_config_terms(evsel, opts, track);

evsel->ignore_missing_thread = opts->ignore_missing_thread;
+
+ /* The --period option takes the precedence. */
+ if (opts->period_set) {
+ if (opts->period)
+ perf_evsel__set_sample_bit(evsel, PERIOD);
+ else
+ perf_evsel__reset_sample_bit(evsel, PERIOD);
+ }
}

static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)

2018-01-30 14:44:17

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On 1/30/2018 4:16 AM, Stephane Eranian wrote:
> Hi,
>
> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>>
>> From: Kan Liang <[email protected]>
>>
>> ------
>>
>> Changes since V2:
>> - Refined the changelog
>> - Introduced specific read function for large PEBS.
>> The previous generic PEBS read function is confusing.
>> Disabled PMU in pmu::read() path for large PEBS.
>> Handled the corner case when reload_times == 0.
>> - Modified the parameter of intel_pmu_save_and_restart_reload()
>> Discarded local64_cmpxchg
>> - Added fixes tag
>> - Added WARN to handle reload_times == 0 || reload_val == 0
>>
>> Changes since V1:
>> - Check PERF_X86_EVENT_AUTO_RELOAD before call
>> intel_pmu_save_and_restore()
>
> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> with large PEBS. Large PEBS requires fixed period. So the kernel could
> make up the period from the event and store it in the sampling buffer.

The PERF_SAMPLE_PERIOD will be implicitly set in freq mode.
By now, large PEBS doesn't support freq mode yet.

>
> I tried using large PEBS recently, and despite trying different option
> combination of perf record, I was not able to get it to work.
>
> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> But I was able to make this work with a much older kernel.

Is there any error message?

>
> Another annoyance I ran into is with perf record requiring -c period
> in order not to set
> PERF_SAMPLE_PERIOD in the event.
>
> If I do:
> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> I get
>
> perf_event_attr:
> type 4
> size 112
> config 0x10d1
> { sample_period, sample_freq } 199936
> sample_type IP|TID|CPU
>
> But if I do:
> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
> --no-timestamp --no-period -a -C 0
>
> I get
>
> perf_event_attr:
> type 4
> size 112
> config 0x10d1
> { sample_period, sample_freq } 199936
> sample_type IP|TID|CPU|PERIOD
>
> Perf should check if all events have a period=, then it should not
> pass PERF_SAMPLE_PERIOD, even
> more so when only one event is defined.

As my understanding, the "period=" is per-event term. It should not
change the global setting.
I think it's better still use "--no-period" to control the PERIOD bit.

>
> Also it does not seem to honor --no-period.
>

Right, perf tool doesn't clear the PERIOD for --no-period.
if (opts->period)
perf_evsel__set_sample_bit(evsel, PERIOD);

I will submit a patch to fix it.

> All of this makes it hard to use large PEBS in general.
>
> So I think your series should also address this part.

The issue looks like perf tool problem. I will take a look at it, but
probably address it in a separate patch set.

Thanks,
Kan
>
>> - Introduce a special purpose intel_pmu_save_and_restart()
>> just for AUTO_RELOAD.
>> - New patch to disable userspace RDPMC usage if large PEBS is enabled.
>>
>> ------
>>
>> There is a bug when mmap read event->count with large PEBS enabled.
>> Here is an example.
>> #./read_count
>> 0x71f0
>> 0x122c0
>> 0x1000000001c54
>> 0x100000001257d
>> 0x200000000bdc5
>>
>> The bug is caused by two issues.
>> - In x86_perf_event_update, the calculation of event->count does not
>> take the auto-reload values into account.
>> - In x86_pmu_read, it doesn't count the undrained values in large PEBS
>> buffers.
>>
>> The first issue was introduced with the auto-reload mechanism enabled
>> since commit 851559e35fd5 ("perf/x86/intel: Use the PEBS auto reload
>> mechanism when possible")
>>
>> Patch 1 fixed the issue in x86_perf_event_update.
>>
>> The second issue was introduced since commit b8241d20699e
>> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
>> interrupt threshold)")
>>
>> Patch 2-4 fixed the issue in x86_pmu_read.
>>
>> Besides the two issues, the userspace RDPMC usage is broken for large
>> PEBS as well.
>> The RDPMC issue was also introduced since commit b8241d20699e
>> ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS
>> interrupt threshold)")
>>
>> Patch 5 fixed the RDPMC issue.
>>
>> The source code of read_count is as below.
>>
>> struct cpu {
>> int fd;
>> struct perf_event_mmap_page *buf;
>> };
>>
>> int perf_open(struct cpu *ctx, int cpu)
>> {
>> struct perf_event_attr attr = {
>> .type = PERF_TYPE_HARDWARE,
>> .size = sizeof(struct perf_event_attr),
>> .sample_period = 100000,
>> .config = 0,
>> .sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID |
>> PERF_SAMPLE_TIME | PERF_SAMPLE_CPU,
>> .precise_ip = 3,
>> .mmap = 1,
>> .comm = 1,
>> .task = 1,
>> .mmap2 = 1,
>> .sample_id_all = 1,
>> .comm_exec = 1,
>> };
>> ctx->buf = NULL;
>> ctx->fd = syscall(__NR_perf_event_open, &attr, -1, cpu, -1, 0);
>> if (ctx->fd < 0) {
>> perror("perf_event_open");
>> return -1;
>> }
>> return 0;
>> }
>>
>> void perf_close(struct cpu *ctx)
>> {
>> close(ctx->fd);
>> if (ctx->buf)
>> munmap(ctx->buf, pagesize);
>> }
>>
>> int main(int ac, char **av)
>> {
>> struct cpu ctx;
>> u64 count;
>>
>> perf_open(&ctx, 0);
>>
>> while (1) {
>> sleep(5);
>>
>> if (read(ctx.fd, &count, 8) != 8) {
>> perror("counter read");
>> break;
>> }
>> printf("0x%llx\n", count);
>>
>> }
>> perf_close(&ctx);
>> }
>>
>> Kan Liang (5):
>> perf/x86/intel: fix event update for auto-reload
>> perf/x86: introduce read function for x86_pmu
>> perf/x86/intel/ds: introduce read function for large pebs
>> perf/x86/intel: introduce read function for intel_pmu
>> perf/x86: fix: disable userspace RDPMC usage for large PEBS
>>
>> arch/x86/events/core.c | 5 ++-
>> arch/x86/events/intel/core.c | 9 +++++
>> arch/x86/events/intel/ds.c | 85 ++++++++++++++++++++++++++++++++++++++++++--
>> arch/x86/events/perf_event.h | 3 ++
>> 4 files changed, 99 insertions(+), 3 deletions(-)
>>
>> --
>> 2.7.4
>>

2018-01-30 15:01:44

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On 1/30/2018 8:39 AM, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>> Hi,
>>
>> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>>>
>>> From: Kan Liang <[email protected]>
>>>
>>> ------
>>>
>>> Changes since V2:
>>> - Refined the changelog
>>> - Introduced specific read function for large PEBS.
>>> The previous generic PEBS read function is confusing.
>>> Disabled PMU in pmu::read() path for large PEBS.
>>> Handled the corner case when reload_times == 0.
>>> - Modified the parameter of intel_pmu_save_and_restart_reload()
>>> Discarded local64_cmpxchg
>>> - Added fixes tag
>>> - Added WARN to handle reload_times == 0 || reload_val == 0
>>>
>>> Changes since V1:
>>> - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>> intel_pmu_save_and_restore()
>>
>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>> make up the period from the event and store it in the sampling buffer.
>>
>> I tried using large PEBS recently, and despite trying different option
>> combination of perf record, I was not able to get it to work.
>>
>> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> But I was able to make this work with a much older kernel.
>>
>> Another annoyance I ran into is with perf record requiring -c period
>> in order not to set
>> PERF_SAMPLE_PERIOD in the event.
>>
>> If I do:
>> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> I get
>>
>> perf_event_attr:
>> type 4
>> size 112
>> config 0x10d1
>> { sample_period, sample_freq } 199936
>> sample_type IP|TID|CPU
>>
>> But if I do:
>> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
>> --no-timestamp --no-period -a -C 0
>>
>> I get
>>
>> perf_event_attr:
>> type 4
>> size 112
>> config 0x10d1
>> { sample_period, sample_freq } 199936
>> sample_type IP|TID|CPU|PERIOD
>>
>> Perf should check if all events have a period=, then it should not
>> pass PERF_SAMPLE_PERIOD, even
>> more so when only one event is defined.
>>
>> Also it does not seem to honor --no-period.
>
> yep, there's a bug in period=x term handling
> we did not re/set the sample_type based on that
>
> attached patch fixes that for me, also takes into account
> the --no/-period options
>
> jirka
>
>
> ---
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index f251e824edac..907267206973 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
> OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
> &record.opts.sample_time_set,
> "Record the sample timestamps"),
> - OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
> + OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
> + "Record the sample period"),
> OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
> "don't sample"),
> OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index 2357f4ccc9c7..cfe46236a5e5 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -50,6 +50,7 @@ struct record_opts {
> bool sample_time_set;
> bool sample_cpu;
> bool period;
> + bool period_set;
> bool running_time;
> bool full_auxtrace;
> bool auxtrace_snapshot_mode;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 66fa45198a11..ff359c9ece2e 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
> if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
> attr->sample_period = term->val.period;
> attr->freq = 0;
> + perf_evsel__reset_sample_bit(evsel, PERIOD);
> }
> break;
> case PERF_EVSEL__CONFIG_TERM_FREQ:
> if (!(term->weak && opts->user_freq != UINT_MAX)) {
> attr->sample_freq = term->val.freq;
> attr->freq = 1;
> + perf_evsel__set_sample_bit(evsel, PERIOD);
> }

If we do so, some events could be in fixed mode without PERIOD set.
Other events could be in freq mode with PERIOD set.
That could be a problem for large PEBS. The PEBS buffer is shared among
events. It doesn't support freq mode yet.

Thanks,
Kan

> break;
> case PERF_EVSEL__CONFIG_TERM_TIME:
> @@ -969,9 +971,6 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
> if (target__has_cpu(&opts->target) || opts->sample_cpu)
> perf_evsel__set_sample_bit(evsel, CPU);
>
> - if (opts->period)
> - perf_evsel__set_sample_bit(evsel, PERIOD);
> -
> /*
> * When the user explicitly disabled time don't force it here.
> */
> @@ -1073,6 +1072,14 @@ void perf_evsel__config(struct perf_evsel *evsel, struct record_opts *opts,
> apply_config_terms(evsel, opts, track);
>
> evsel->ignore_missing_thread = opts->ignore_missing_thread;
> +
> + /* The --period option takes the precedence. */
> + if (opts->period_set) {
> + if (opts->period)
> + perf_evsel__set_sample_bit(evsel, PERIOD);
> + else
> + perf_evsel__reset_sample_bit(evsel, PERIOD);
> + }
> }
>
> static int perf_evsel__alloc_fd(struct perf_evsel *evsel, int ncpus, int nthreads)
>

2018-01-30 15:06:13

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>
>
> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
> > On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
> > > Hi,
> > >
> > > On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
> > > >
> > > > From: Kan Liang <[email protected]>
> > > >
> > > > ------
> > > >
> > > > Changes since V2:
> > > > - Refined the changelog
> > > > - Introduced specific read function for large PEBS.
> > > > The previous generic PEBS read function is confusing.
> > > > Disabled PMU in pmu::read() path for large PEBS.
> > > > Handled the corner case when reload_times == 0.
> > > > - Modified the parameter of intel_pmu_save_and_restart_reload()
> > > > Discarded local64_cmpxchg
> > > > - Added fixes tag
> > > > - Added WARN to handle reload_times == 0 || reload_val == 0
> > > >
> > > > Changes since V1:
> > > > - Check PERF_X86_EVENT_AUTO_RELOAD before call
> > > > intel_pmu_save_and_restore()
> > >
> > > It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
> > > with large PEBS. Large PEBS requires fixed period. So the kernel could
> > > make up the period from the event and store it in the sampling buffer.
> > >
> > > I tried using large PEBS recently, and despite trying different option
> > > combination of perf record, I was not able to get it to work.
> > >
> > > $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > >
> > > But I was able to make this work with a much older kernel.
> > >
> > > Another annoyance I ran into is with perf record requiring -c period
> > > in order not to set
> > > PERF_SAMPLE_PERIOD in the event.
> > >
> > > If I do:
> > > perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > >
> > > I get
> > >
> > > perf_event_attr:
> > > type 4
> > > size 112
> > > config 0x10d1
> > > { sample_period, sample_freq } 199936
> > > sample_type IP|TID|CPU
> > >
> > > But if I do:
> > > perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
> > > --no-timestamp --no-period -a -C 0
> > >
> > > I get
> > >
> > > perf_event_attr:
> > > type 4
> > > size 112
> > > config 0x10d1
> > > { sample_period, sample_freq } 199936
> > > sample_type IP|TID|CPU|PERIOD
> > >
> > > Perf should check if all events have a period=, then it should not
> > > pass PERF_SAMPLE_PERIOD, even
> > > more so when only one event is defined.
> > >
> > > Also it does not seem to honor --no-period.
> >
> > yep, there's a bug in period=x term handling
> > we did not re/set the sample_type based on that
> >
> > attached patch fixes that for me, also takes into account
> > the --no/-period options
> >
> > jirka
> >
> >
> > ---
> > diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> > index f251e824edac..907267206973 100644
> > --- a/tools/perf/builtin-record.c
> > +++ b/tools/perf/builtin-record.c
> > @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
> > OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
> > &record.opts.sample_time_set,
> > "Record the sample timestamps"),
> > - OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
> > + OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
> > + "Record the sample period"),
> > OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
> > "don't sample"),
> > OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
> > diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> > index 2357f4ccc9c7..cfe46236a5e5 100644
> > --- a/tools/perf/perf.h
> > +++ b/tools/perf/perf.h
> > @@ -50,6 +50,7 @@ struct record_opts {
> > bool sample_time_set;
> > bool sample_cpu;
> > bool period;
> > + bool period_set;
> > bool running_time;
> > bool full_auxtrace;
> > bool auxtrace_snapshot_mode;
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index 66fa45198a11..ff359c9ece2e 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
> > if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
> > attr->sample_period = term->val.period;
> > attr->freq = 0;
> > + perf_evsel__reset_sample_bit(evsel, PERIOD);
> > }
> > break;
> > case PERF_EVSEL__CONFIG_TERM_FREQ:
> > if (!(term->weak && opts->user_freq != UINT_MAX)) {
> > attr->sample_freq = term->val.freq;
> > attr->freq = 1;
> > + perf_evsel__set_sample_bit(evsel, PERIOD);
> > }
>
> If we do so, some events could be in fixed mode without PERIOD set. Other
> events could be in freq mode with PERIOD set.

it also sets the attr->freq, so it's not in fixed mode

jirka

2018-01-30 15:26:47

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On 1/30/2018 10:04 AM, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>
>>
>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>> Hi,
>>>>
>>>> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>>>>>
>>>>> From: Kan Liang <[email protected]>
>>>>>
>>>>> ------
>>>>>
>>>>> Changes since V2:
>>>>> - Refined the changelog
>>>>> - Introduced specific read function for large PEBS.
>>>>> The previous generic PEBS read function is confusing.
>>>>> Disabled PMU in pmu::read() path for large PEBS.
>>>>> Handled the corner case when reload_times == 0.
>>>>> - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>> Discarded local64_cmpxchg
>>>>> - Added fixes tag
>>>>> - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>
>>>>> Changes since V1:
>>>>> - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>> intel_pmu_save_and_restore()
>>>>
>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>> make up the period from the event and store it in the sampling buffer.
>>>>
>>>> I tried using large PEBS recently, and despite trying different option
>>>> combination of perf record, I was not able to get it to work.
>>>>
>>>> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> But I was able to make this work with a much older kernel.
>>>>
>>>> Another annoyance I ran into is with perf record requiring -c period
>>>> in order not to set
>>>> PERF_SAMPLE_PERIOD in the event.
>>>>
>>>> If I do:
>>>> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> I get
>>>>
>>>> perf_event_attr:
>>>> type 4
>>>> size 112
>>>> config 0x10d1
>>>> { sample_period, sample_freq } 199936
>>>> sample_type IP|TID|CPU
>>>>
>>>> But if I do:
>>>> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>> --no-timestamp --no-period -a -C 0
>>>>
>>>> I get
>>>>
>>>> perf_event_attr:
>>>> type 4
>>>> size 112
>>>> config 0x10d1
>>>> { sample_period, sample_freq } 199936
>>>> sample_type IP|TID|CPU|PERIOD
>>>>
>>>> Perf should check if all events have a period=, then it should not
>>>> pass PERF_SAMPLE_PERIOD, even
>>>> more so when only one event is defined.
>>>>
>>>> Also it does not seem to honor --no-period.
>>>
>>> yep, there's a bug in period=x term handling
>>> we did not re/set the sample_type based on that
>>>
>>> attached patch fixes that for me, also takes into account
>>> the --no/-period options
>>>
>>> jirka
>>>
>>>
>>> ---
>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>> index f251e824edac..907267206973 100644
>>> --- a/tools/perf/builtin-record.c
>>> +++ b/tools/perf/builtin-record.c
>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>> OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>> &record.opts.sample_time_set,
>>> "Record the sample timestamps"),
>>> - OPT_BOOLEAN('P', "period", &record.opts.period, "Record the sample period"),
>>> + OPT_BOOLEAN_SET('P', "period", &record.opts.period, &record.opts.period_set,
>>> + "Record the sample period"),
>>> OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>> "don't sample"),
>>> OPT_BOOLEAN_SET('N', "no-buildid-cache", &record.no_buildid_cache,
>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>> --- a/tools/perf/perf.h
>>> +++ b/tools/perf/perf.h
>>> @@ -50,6 +50,7 @@ struct record_opts {
>>> bool sample_time_set;
>>> bool sample_cpu;
>>> bool period;
>>> + bool period_set;
>>> bool running_time;
>>> bool full_auxtrace;
>>> bool auxtrace_snapshot_mode;
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index 66fa45198a11..ff359c9ece2e 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel *evsel,
>>> if (!(term->weak && opts->user_interval != ULLONG_MAX)) {
>>> attr->sample_period = term->val.period;
>>> attr->freq = 0;
>>> + perf_evsel__reset_sample_bit(evsel, PERIOD);
>>> }
>>> break;
>>> case PERF_EVSEL__CONFIG_TERM_FREQ:
>>> if (!(term->weak && opts->user_freq != UINT_MAX)) {
>>> attr->sample_freq = term->val.freq;
>>> attr->freq = 1;
>>> + perf_evsel__set_sample_bit(evsel, PERIOD);
>>> }
>>
>> If we do so, some events could be in fixed mode without PERIOD set. Other
>> events could be in freq mode with PERIOD set.
>
> it also sets the attr->freq, so it's not in fixed mode
>

I mean the TERM_PERIOD. It probably enable the large PEBS.
>>> attr->freq = 0;
>>> + perf_evsel__reset_sample_bit(evsel, PERIOD);

The events in fixed mode could enable large PEBS. Events in freq mode
should not enable large PEBS.
I think that could be a problem if some events try to enable large PEBS,
while others not.

Thanks,
Kan

2018-01-30 17:31:36

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 7:25 AM, Liang, Kan <[email protected]> wrote:
>
>
> On 1/30/2018 10:04 AM, Jiri Olsa wrote:
>>
>> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>>
>>>
>>>
>>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>>>
>>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> From: Kan Liang <[email protected]>
>>>>>>
>>>>>> ------
>>>>>>
>>>>>> Changes since V2:
>>>>>> - Refined the changelog
>>>>>> - Introduced specific read function for large PEBS.
>>>>>> The previous generic PEBS read function is confusing.
>>>>>> Disabled PMU in pmu::read() path for large PEBS.
>>>>>> Handled the corner case when reload_times == 0.
>>>>>> - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>>> Discarded local64_cmpxchg
>>>>>> - Added fixes tag
>>>>>> - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>>
>>>>>> Changes since V1:
>>>>>> - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>>> intel_pmu_save_and_restore()
>>>>>
>>>>>
>>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>>> make up the period from the event and store it in the sampling buffer.
>>>>>
>>>>> I tried using large PEBS recently, and despite trying different option
>>>>> combination of perf record, I was not able to get it to work.
>>>>>
>>>>> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> But I was able to make this work with a much older kernel.
>>>>>
>>>>> Another annoyance I ran into is with perf record requiring -c period
>>>>> in order not to set
>>>>> PERF_SAMPLE_PERIOD in the event.
>>>>>
>>>>> If I do:
>>>>> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> I get
>>>>>
>>>>> perf_event_attr:
>>>>> type 4
>>>>> size 112
>>>>> config 0x10d1
>>>>> { sample_period, sample_freq } 199936
>>>>> sample_type IP|TID|CPU
>>>>>
>>>>> But if I do:
>>>>> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>> --no-timestamp --no-period -a -C 0
>>>>>
>>>>> I get
>>>>>
>>>>> perf_event_attr:
>>>>> type 4
>>>>> size 112
>>>>> config 0x10d1
>>>>> { sample_period, sample_freq } 199936
>>>>> sample_type IP|TID|CPU|PERIOD
>>>>>
>>>>> Perf should check if all events have a period=, then it should not
>>>>> pass PERF_SAMPLE_PERIOD, even
>>>>> more so when only one event is defined.
>>>>>
>>>>> Also it does not seem to honor --no-period.
>>>>
>>>>
>>>> yep, there's a bug in period=x term handling
>>>> we did not re/set the sample_type based on that
>>>>
>>>> attached patch fixes that for me, also takes into account
>>>> the --no/-period options
>>>>
>>>> jirka
>>>>
>>>>
>>>> ---
>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>> index f251e824edac..907267206973 100644
>>>> --- a/tools/perf/builtin-record.c
>>>> +++ b/tools/perf/builtin-record.c
>>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>>> OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>>> &record.opts.sample_time_set,
>>>> "Record the sample timestamps"),
>>>> - OPT_BOOLEAN('P', "period", &record.opts.period, "Record the
>>>> sample period"),
>>>> + OPT_BOOLEAN_SET('P', "period", &record.opts.period,
>>>> &record.opts.period_set,
>>>> + "Record the sample period"),
>>>> OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>>> "don't sample"),
>>>> OPT_BOOLEAN_SET('N', "no-buildid-cache",
>>>> &record.no_buildid_cache,
>>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>>> --- a/tools/perf/perf.h
>>>> +++ b/tools/perf/perf.h
>>>> @@ -50,6 +50,7 @@ struct record_opts {
>>>> bool sample_time_set;
>>>> bool sample_cpu;
>>>> bool period;
>>>> + bool period_set;
>>>> bool running_time;
>>>> bool full_auxtrace;
>>>> bool auxtrace_snapshot_mode;
>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>> index 66fa45198a11..ff359c9ece2e 100644
>>>> --- a/tools/perf/util/evsel.c
>>>> +++ b/tools/perf/util/evsel.c
>>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel
>>>> *evsel,
>>>> if (!(term->weak && opts->user_interval !=
>>>> ULLONG_MAX)) {
>>>> attr->sample_period = term->val.period;
>>>> attr->freq = 0;
>>>> + perf_evsel__reset_sample_bit(evsel,
>>>> PERIOD);
>>>> }
>>>> break;
>>>> case PERF_EVSEL__CONFIG_TERM_FREQ:
>>>> if (!(term->weak && opts->user_freq !=
>>>> UINT_MAX)) {
>>>> attr->sample_freq = term->val.freq;
>>>> attr->freq = 1;
>>>> + perf_evsel__set_sample_bit(evsel,
>>>> PERIOD);
>>>> }
>>>
>>>
>>> If we do so, some events could be in fixed mode without PERIOD set. Other
>>> events could be in freq mode with PERIOD set.
>>
>>
>> it also sets the attr->freq, so it's not in fixed mode
>>
>
> I mean the TERM_PERIOD. It probably enable the large PEBS.
>>>> attr->freq = 0;
>>>> + perf_evsel__reset_sample_bit(evsel, PERIOD);
>
> The events in fixed mode could enable large PEBS. Events in freq mode should
> not enable large PEBS.
> I think that could be a problem if some events try to enable large PEBS,
> while others not.
>
You only enable large PEBS if 100% of the events use fixed periods,
either via -c period
or because they all use individual period=p. The --no-period could
also be used to remove
the period for measurements where the period is not needed.

> Thanks,
> Kan
>
>

2018-01-30 17:33:58

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On 1/30/2018 11:36 AM, Stephane Eranian wrote:
> On Tue, Jan 30, 2018 at 7:25 AM, Liang, Kan <[email protected]> wrote:
>>
>>
>> On 1/30/2018 10:04 AM, Jiri Olsa wrote:
>>>
>>> On Tue, Jan 30, 2018 at 09:59:15AM -0500, Liang, Kan wrote:
>>>>
>>>>
>>>>
>>>> On 1/30/2018 8:39 AM, Jiri Olsa wrote:
>>>>>
>>>>> On Tue, Jan 30, 2018 at 01:16:39AM -0800, Stephane Eranian wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Mon, Jan 29, 2018 at 8:29 AM, <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> From: Kan Liang <[email protected]>
>>>>>>>
>>>>>>> ------
>>>>>>>
>>>>>>> Changes since V2:
>>>>>>> - Refined the changelog
>>>>>>> - Introduced specific read function for large PEBS.
>>>>>>> The previous generic PEBS read function is confusing.
>>>>>>> Disabled PMU in pmu::read() path for large PEBS.
>>>>>>> Handled the corner case when reload_times == 0.
>>>>>>> - Modified the parameter of intel_pmu_save_and_restart_reload()
>>>>>>> Discarded local64_cmpxchg
>>>>>>> - Added fixes tag
>>>>>>> - Added WARN to handle reload_times == 0 || reload_val == 0
>>>>>>>
>>>>>>> Changes since V1:
>>>>>>> - Check PERF_X86_EVENT_AUTO_RELOAD before call
>>>>>>> intel_pmu_save_and_restore()
>>>>>>
>>>>>>
>>>>>> It is not yet clear to me why PERF_SAMPLE_PERIOD is not allowed
>>>>>> with large PEBS. Large PEBS requires fixed period. So the kernel could
>>>>>> make up the period from the event and store it in the sampling buffer.
>>>>>>
>>>>>> I tried using large PEBS recently, and despite trying different option
>>>>>> combination of perf record, I was not able to get it to work.
>>>>>>
>>>>>> $ perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> But I was able to make this work with a much older kernel.
>>>>>>
>>>>>> Another annoyance I ran into is with perf record requiring -c period
>>>>>> in order not to set
>>>>>> PERF_SAMPLE_PERIOD in the event.
>>>>>>
>>>>>> If I do:
>>>>>> perf record -c 1 -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> I get
>>>>>>
>>>>>> perf_event_attr:
>>>>>> type 4
>>>>>> size 112
>>>>>> config 0x10d1
>>>>>> { sample_period, sample_freq } 199936
>>>>>> sample_type IP|TID|CPU
>>>>>>
>>>>>> But if I do:
>>>>>> perf record -e cpu/event=0xd1,umask=0x10,period=19936/pp
>>>>>> --no-timestamp --no-period -a -C 0
>>>>>>
>>>>>> I get
>>>>>>
>>>>>> perf_event_attr:
>>>>>> type 4
>>>>>> size 112
>>>>>> config 0x10d1
>>>>>> { sample_period, sample_freq } 199936
>>>>>> sample_type IP|TID|CPU|PERIOD
>>>>>>
>>>>>> Perf should check if all events have a period=, then it should not
>>>>>> pass PERF_SAMPLE_PERIOD, even
>>>>>> more so when only one event is defined.
>>>>>>
>>>>>> Also it does not seem to honor --no-period.
>>>>>
>>>>>
>>>>> yep, there's a bug in period=x term handling
>>>>> we did not re/set the sample_type based on that
>>>>>
>>>>> attached patch fixes that for me, also takes into account
>>>>> the --no/-period options
>>>>>
>>>>> jirka
>>>>>
>>>>>
>>>>> ---
>>>>> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
>>>>> index f251e824edac..907267206973 100644
>>>>> --- a/tools/perf/builtin-record.c
>>>>> +++ b/tools/perf/builtin-record.c
>>>>> @@ -1566,7 +1566,8 @@ static struct option __record_options[] = {
>>>>> OPT_BOOLEAN_SET('T', "timestamp", &record.opts.sample_time,
>>>>> &record.opts.sample_time_set,
>>>>> "Record the sample timestamps"),
>>>>> - OPT_BOOLEAN('P', "period", &record.opts.period, "Record the
>>>>> sample period"),
>>>>> + OPT_BOOLEAN_SET('P', "period", &record.opts.period,
>>>>> &record.opts.period_set,
>>>>> + "Record the sample period"),
>>>>> OPT_BOOLEAN('n', "no-samples", &record.opts.no_samples,
>>>>> "don't sample"),
>>>>> OPT_BOOLEAN_SET('N', "no-buildid-cache",
>>>>> &record.no_buildid_cache,
>>>>> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
>>>>> index 2357f4ccc9c7..cfe46236a5e5 100644
>>>>> --- a/tools/perf/perf.h
>>>>> +++ b/tools/perf/perf.h
>>>>> @@ -50,6 +50,7 @@ struct record_opts {
>>>>> bool sample_time_set;
>>>>> bool sample_cpu;
>>>>> bool period;
>>>>> + bool period_set;
>>>>> bool running_time;
>>>>> bool full_auxtrace;
>>>>> bool auxtrace_snapshot_mode;
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index 66fa45198a11..ff359c9ece2e 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -745,12 +745,14 @@ static void apply_config_terms(struct perf_evsel
>>>>> *evsel,
>>>>> if (!(term->weak && opts->user_interval !=
>>>>> ULLONG_MAX)) {
>>>>> attr->sample_period = term->val.period;
>>>>> attr->freq = 0;
>>>>> + perf_evsel__reset_sample_bit(evsel,
>>>>> PERIOD);
>>>>> }
>>>>> break;
>>>>> case PERF_EVSEL__CONFIG_TERM_FREQ:
>>>>> if (!(term->weak && opts->user_freq !=
>>>>> UINT_MAX)) {
>>>>> attr->sample_freq = term->val.freq;
>>>>> attr->freq = 1;
>>>>> + perf_evsel__set_sample_bit(evsel,
>>>>> PERIOD);
>>>>> }
>>>>
>>>>
>>>> If we do so, some events could be in fixed mode without PERIOD set. Other
>>>> events could be in freq mode with PERIOD set.
>>>
>>>
>>> it also sets the attr->freq, so it's not in fixed mode
>>>
>>
>> I mean the TERM_PERIOD. It probably enable the large PEBS.
>>>>> attr->freq = 0;
>>>>> + perf_evsel__reset_sample_bit(evsel, PERIOD);
>>
>> The events in fixed mode could enable large PEBS. Events in freq mode should
>> not enable large PEBS.
>> I think that could be a problem if some events try to enable large PEBS,
>> while others not.
>>
> You only enable large PEBS if 100% of the events use fixed periods,
> either via -c period
> or because they all use individual period=p. The --no-period could
> also be used to remove
> the period for measurements where the period is not needed.

Oh, right, the kernel has already guaranteed that.
if (cpuc->n_pebs == cpuc->n_large_pebs) {
threshold = ds->pebs_absolute_maximum -
x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
} else {

Sorry for the noise.

jirka's patch looks good to me.

Thanks,
Kan

2018-01-30 18:54:12

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 11:48:18AM -0500, Liang, Kan wrote:

SNIP

> > >
> > > The events in fixed mode could enable large PEBS. Events in freq mode should
> > > not enable large PEBS.
> > > I think that could be a problem if some events try to enable large PEBS,
> > > while others not.
> > >
> > You only enable large PEBS if 100% of the events use fixed periods,
> > either via -c period
> > or because they all use individual period=p. The --no-period could
> > also be used to remove
> > the period for measurements where the period is not needed.
>
>
> Oh, right, the kernel has already guaranteed that.
> if (cpuc->n_pebs == cpuc->n_large_pebs) {
> threshold = ds->pebs_absolute_maximum -
> x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> } else {
>
> Sorry for the noise.
>
> jirka's patch looks good to me.

cool, I'll post it later this week

thanks,
jirka

2018-01-30 20:23:31

by Stephane Eranian

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 10:52 AM, Jiri Olsa <[email protected]> wrote:
>
> On Tue, Jan 30, 2018 at 11:48:18AM -0500, Liang, Kan wrote:
>
> SNIP
>
> > > >
> > > > The events in fixed mode could enable large PEBS. Events in freq mode should
> > > > not enable large PEBS.
> > > > I think that could be a problem if some events try to enable large PEBS,
> > > > while others not.
> > > >
> > > You only enable large PEBS if 100% of the events use fixed periods,
> > > either via -c period
> > > or because they all use individual period=p. The --no-period could
> > > also be used to remove
> > > the period for measurements where the period is not needed.
> >
> >
> > Oh, right, the kernel has already guaranteed that.
> > if (cpuc->n_pebs == cpuc->n_large_pebs) {
> > threshold = ds->pebs_absolute_maximum -
> > x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> > } else {
> >
> > Sorry for the noise.
> >
> > jirka's patch looks good to me.
>
> cool, I'll post it later this week
>
Still, the part I am missing here, is why asking for
PERF_SAMPLE_PERIOD voids large PEBS.

2018-01-31 04:27:40

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

> Still, the part I am missing here, is why asking for
> PERF_SAMPLE_PERIOD voids large PEBS.

I think it was disabled together with frequency mode
(which we could support too, but it's a bit more work)

But yes PERIOD could probably be allowed.

-Andi

2018-01-31 10:20:40

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > Still, the part I am missing here, is why asking for
> > PERF_SAMPLE_PERIOD voids large PEBS.
>
> I think it was disabled together with frequency mode
> (which we could support too, but it's a bit more work)
>
> But yes PERIOD could probably be allowed.

looks like it's just a matter of adding PERF_SAMPLE_PERIOD
into PEBS_FREERUNNING_FLAGS, we already populate period
in setup_pebs_sample_data

jirka

---
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 8e4ea143ed96..78f91ec1056e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -93,7 +93,8 @@ struct amd_nb {
PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
- PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
+ PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
+ PERF_SAMPLE_PERIOD)

#define PEBS_REGS \
(PERF_REG_X86_AX | \

2018-01-31 13:16:41

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
> On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > > Still, the part I am missing here, is why asking for
> > > PERF_SAMPLE_PERIOD voids large PEBS.
> >
> > I think it was disabled together with frequency mode
> > (which we could support too, but it's a bit more work)
> >
> > But yes PERIOD could probably be allowed.
>
> looks like it's just a matter of adding PERF_SAMPLE_PERIOD
> into PEBS_FREERUNNING_FLAGS, we already populate period
> in setup_pebs_sample_data
>
> jirka
>
>
> ---
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 8e4ea143ed96..78f91ec1056e 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -93,7 +93,8 @@ struct amd_nb {
> PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
> - PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
> + PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
> + PERF_SAMPLE_PERIOD)

seems to work, getting large PEBS event for following command line:

perf record -e cycles:P -c 100 --no-timestamp -C 0 --period

jirka

2018-01-31 15:16:17

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On 1/31/2018 8:15 AM, Jiri Olsa wrote:
> On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
>> On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
>>>> Still, the part I am missing here, is why asking for
>>>> PERF_SAMPLE_PERIOD voids large PEBS.
>>>
>>> I think it was disabled together with frequency mode
>>> (which we could support too, but it's a bit more work)
>>>
>>> But yes PERIOD could probably be allowed.
>>
>> looks like it's just a matter of adding PERF_SAMPLE_PERIOD
>> into PEBS_FREERUNNING_FLAGS, we already populate period
>> in setup_pebs_sample_data
>>
>> jirka
>>
>>
>> ---
>> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
>> index 8e4ea143ed96..78f91ec1056e 100644
>> --- a/arch/x86/events/perf_event.h
>> +++ b/arch/x86/events/perf_event.h
>> @@ -93,7 +93,8 @@ struct amd_nb {
>> PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
>> PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
>> PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
>> - PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
>> + PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
>> + PERF_SAMPLE_PERIOD)
>
> seems to work, getting large PEBS event for following command line:
>
> perf record -e cycles:P -c 100 --no-timestamp -C 0 --period
>
>

Yes, I tried the patch. The large PEBS can be enabled with the PERIOD
flag. Everything looks good.

Jirka, could you please post the patch as well?

Thanks,
Kan

2018-01-31 15:42:38

[permalink] [raw]

Subject: Re: [PATCH V3 0/5] bugs fix for large PEBS mmap read and rdpmc read

On Wed, Jan 31, 2018 at 10:15:33AM -0500, Liang, Kan wrote:
>
>
> On 1/31/2018 8:15 AM, Jiri Olsa wrote:
> > On Wed, Jan 31, 2018 at 10:15:39AM +0100, Jiri Olsa wrote:
> > > On Tue, Jan 30, 2018 at 07:59:41PM -0800, Andi Kleen wrote:
> > > > > Still, the part I am missing here, is why asking for
> > > > > PERF_SAMPLE_PERIOD voids large PEBS.
> > > >
> > > > I think it was disabled together with frequency mode
> > > > (which we could support too, but it's a bit more work)
> > > >
> > > > But yes PERIOD could probably be allowed.
> > >
> > > looks like it's just a matter of adding PERF_SAMPLE_PERIOD
> > > into PEBS_FREERUNNING_FLAGS, we already populate period
> > > in setup_pebs_sample_data
> > >
> > > jirka
> > >
> > >
> > > ---
> > > diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> > > index 8e4ea143ed96..78f91ec1056e 100644
> > > --- a/arch/x86/events/perf_event.h
> > > +++ b/arch/x86/events/perf_event.h
> > > @@ -93,7 +93,8 @@ struct amd_nb {
> > > PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> > > PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> > > PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
> > > - PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
> > > + PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER | \
> > > + PERF_SAMPLE_PERIOD)
> >
> > seems to work, getting large PEBS event for following command line:
> >
> > perf record -e cycles:P -c 100 --no-timestamp -C 0 --period
> >
> >
>
> Yes, I tried the patch. The large PEBS can be enabled with the PERIOD flag.
> Everything looks good.
>
> Jirka, could you please post the patch as well?

yes, will post that one as well

jirka

2018-02-06 15:09:13

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

On Mon, Jan 29, 2018 at 08:29:29AM -0800, [email protected] wrote:
> +/*
> + * Specific intel_pmu_save_and_restart() for auto-reload.
> + * It only be called from drain_pebs().
> + */
> +static int intel_pmu_save_and_restart_reload(struct perf_event *event,
> + int reload_times)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 reload_val = hwc->sample_period;
> + u64 prev_raw_count, new_raw_count;
> + u64 delta;
> +
> + WARN_ON((reload_times == 0) || (reload_val == 0));

I'm struggling to see why !count is a problem. You can call read() twice
without having extra PEBS samples, right?

Which also seems to suggest we have a problem in intel_pmu_drain_pebs*()
because those will simply not call us when there aren't any PEBS events
in the buffer, even though the count value can have changed. Your patch
doesn't appear to address that.

> +
> + /*
> + * drain_pebs() only happens when the PMU is disabled.
> + * It doesnot need to specially handle the previous event value.
> + * The hwc->prev_count will be updated in x86_perf_event_set_period().

That's completely buggered. You shouldn't need to call
x86_perf_event_set_period() _at_all_.

> + */

WARN_ON(this_cpu_read(cpu_hw_events.enabled));

> + prev_raw_count = local64_read(&hwc->prev_count);
> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> +
> + /*
> + * Now we have the new raw value and have updated the prev
> + * timestamp already. We can now calculate the elapsed delta
> + * (event-)time and add that to the generic event.
> + *
> + * Careful, not all hw sign-extends above the physical width
> + * of the count.
> + *
> + * There is a small gap between 'PEBS hardware is armed' and 'the NMI
> + * is handled'. Because of the gap, the first record also needs to be
> + * specially handled.

True, but not at all relevant.

> + * The formula to calculate the increments of event->count is as below.
> + * The increments = the period for first record +
> + * (reload_times - 1) * reload_val +
> + * the gap
> + * 'The period for first record' can be got from -prev_raw_count.
> + *
> + * 'The gap' = new_raw_count + reload_val. Because the start point of
> + * counting is always -reload_val for auto-reload.

That is still very much confused, what you seem to want is something
like:

Since the counter increments a negative counter value and
overflows on the sign switch, giving the interval:

[-period, 0]

the difference between two consequtive reads is:

A) value2 - value1;
when no overflows have happened in between,

B) (0 - value1) + (value2 - (-period));
when one overflow happened in between,

C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
when @n overflows happened in betwee.

Here A) is the obvious difference, B) is the extention to the
discrete interval, where the first term is to the top of the
interval and the second term is from the bottom of the next
interval and 3) the extention to multiple intervals, where the
middle term is the whole intervals covered.

An equivalent of C, by reduction, is:

value2 - value1 + n * period

> + *
> + * The period_left needs to do the same adjustment by adding
> + * reload_val.
> + */
> + delta = (reload_val << shift) + (new_raw_count << shift) -
> + (prev_raw_count << shift);
> + delta >>= shift;
> +
> + local64_add(reload_val * (reload_times - 1), &event->count);
> + local64_add(delta, &event->count);
> + local64_sub(delta, &hwc->period_left);

Nobody cares about period_left, since AUTO_RELOAD already limits the
periods to what the hardware supports, right?

> +
> + return x86_perf_event_set_period(event);
> +}

With the exception of handling 'empty' buffers, I ended up with the
below. Please try again.

---
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1156,16 +1156,13 @@ int x86_perf_event_set_period(struct per

per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;

- if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
- local64_read(&hwc->prev_count) != (u64)-left) {
- /*
- * The hw event starts counting from this event offset,
- * mark it to be able to extra future deltas:
- */
- local64_set(&hwc->prev_count, (u64)-left);
+ /*
+ * The hw event starts counting from this event offset,
+ * mark it to be able to extra future deltas:
+ */
+ local64_set(&hwc->prev_count, (u64)-left);

- wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
- }
+ wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

/*
* Due to erratum on certan cpu we need
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1306,17 +1306,89 @@ get_next_pebs_record_by_bit(void *base,
return NULL;
}

+/*
+ * Special variant of intel_pmu_save_and_restart() for auto-reload.
+ */
+static int
+intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ int shift = 64 - x86_pmu.cntval_bits;
+ u64 period = hwc->sample_period;
+ u64 prev_raw_count, new_raw_count;
+ u64 delta;
+
+ WARN_ON(!period);
+
+ /*
+ * drain_pebs() only happens when the PMU is disabled.
+ */
+ WARN_ON(this_cpu_read(cpu_hw_events.enabled));
+
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdpmcl(hwc->event_base_rdpmc, new_raw_count);
+ local64_set(&hwx->prev_count, new_raw_count);
+
+ /*
+ * Careful, not all hw sign-extends above the physical width
+ * of the counter.
+ */
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+
+ /*
+ * Since the counter increments a negative counter value and
+ * overflows on the sign switch, giving the interval:
+ *
+ * [-period, 0]
+ *
+ * the difference between two consequtive reads is:
+ *
+ * A) value2 - value1;
+ * when no overflows have happened in between,
+ *
+ * B) (0 - value1) + (value2 - (-period));
+ * when one overflow happened in between,
+ *
+ * C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
+ * when @n overflows happened in between.
+ *
+ * Here A) is the obvious difference, B) is the extention to the
+ * discrete interval, where the first term is to the top of the
+ * interval and the second term is from the bottom of the next
+ * interval and 3) the extention to multiple intervals, where the
+ * middle term is the whole intervals covered.
+ *
+ * An equivalent of C, by reduction, is:
+ *
+ * value2 - value1 + n * period
+ */
+ local64_add(delta + period * count, &event->count);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count)
{
+ struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct pt_regs regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);

- if (!intel_pmu_save_and_restart(event) &&
- !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
+ if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
+ /*
+ * Now, auto-reload is only enabled in fixed period mode.
+ * The reload value is always hwc->sample_period.
+ * May need to change it, if auto-reload is enabled in
+ * freq mode later.
+ */
+ intel_pmu_save_and_restart_reload(event, count);
+ } else if (!intel_pmu_save_and_restart(event))
return;

while (count > 1) {

2018-02-06 18:00:13

[permalink] [raw]

Subject: Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

> With the exception of handling 'empty' buffers, I ended up with the
> below. Please try again.
>

There are two small errors. After fixing them, the patch works well.

> ---
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1156,16 +1156,13 @@ int x86_perf_event_set_period(struct per
>
> per_cpu(pmc_prev_left[idx], smp_processor_id()) = left;
>
> - if (!(hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) ||
> - local64_read(&hwc->prev_count) != (u64)-left) {
> - /*
> - * The hw event starts counting from this event offset,
> - * mark it to be able to extra future deltas:
> - */
> - local64_set(&hwc->prev_count, (u64)-left);
> + /*
> + * The hw event starts counting from this event offset,
> + * mark it to be able to extra future deltas:
> + */
> + local64_set(&hwc->prev_count, (u64)-left);
>
> - wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> - }
> + wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> /*
> * Due to erratum on certan cpu we need
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1306,17 +1306,89 @@ get_next_pebs_record_by_bit(void *base,
> return NULL;
> }
>
> +/*
> + * Special variant of intel_pmu_save_and_restart() for auto-reload.
> + */
> +static int
> +intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + int shift = 64 - x86_pmu.cntval_bits;
> + u64 period = hwc->sample_period;
> + u64 prev_raw_count, new_raw_count;
> + u64 delta;
> +
> + WARN_ON(!period);
> +
> + /*
> + * drain_pebs() only happens when the PMU is disabled.
> + */
> + WARN_ON(this_cpu_read(cpu_hw_events.enabled));
> +
> + prev_raw_count = local64_read(&hwc->prev_count);
> + rdpmcl(hwc->event_base_rdpmc, new_raw_count);
> + local64_set(&hwx->prev_count, new_raw_count);

Here is a typo. It should be &hwc->prev_count.

> +
> + /*
> + * Careful, not all hw sign-extends above the physical width
> + * of the counter.
> + */
> + delta = (new_raw_count << shift) - (prev_raw_count << shift);
> + delta >>= shift;

new_raw_count could be smaller than prev_raw_count.
The sign bit will be set. The delta>> could be wrong.

I think we can add a period here to prevent it.
+ delta = (period << shift) + (new_raw_count << shift) -
+ (prev_raw_count << shift);
+ delta >>= shift;
......
+ local64_add(delta + period * (count - 1), &event->count);

Thanks,
Kan
> +
> + /*
> + * Since the counter increments a negative counter value and
> + * overflows on the sign switch, giving the interval:
> + *
> + * [-period, 0]
> + *
> + * the difference between two consequtive reads is:
> + *
> + * A) value2 - value1;
> + * when no overflows have happened in between,
> + *
> + * B) (0 - value1) + (value2 - (-period));
> + * when one overflow happened in between,
> + *
> + * C) (0 - value1) + (n - 1) * (period) + (value2 - (-period));
> + * when @n overflows happened in between.
> + *
> + * Here A) is the obvious difference, B) is the extention to the
> + * discrete interval, where the first term is to the top of the
> + * interval and the second term is from the bottom of the next
> + * interval and 3) the extention to multiple intervals, where the
> + * middle term is the whole intervals covered.
> + *
> + * An equivalent of C, by reduction, is:
> + *
> + * value2 - value1 + n * period
> + */
> + local64_add(delta + period * count, &event->count);
> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> void *base, void *top,
> int bit, int count)
> {
> + struct hw_perf_event *hwc = &event->hw;
> struct perf_sample_data data;
> struct pt_regs regs;
> void *at = get_next_pebs_record_by_bit(base, top, bit);
>
> - if (!intel_pmu_save_and_restart(event) &&
> - !(event->hw.flags & PERF_X86_EVENT_AUTO_RELOAD))
> + if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> + /*
> + * Now, auto-reload is only enabled in fixed period mode.
> + * The reload value is always hwc->sample_period.
> + * May need to change it, if auto-reload is enabled in
> + * freq mode later.
> + */
> + intel_pmu_save_and_restart_reload(event, count);
> + } else if (!intel_pmu_save_and_restart(event))
> return;
>
> while (count > 1) {
>

2018-02-09 14:11:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
>
>
> > With the exception of handling 'empty' buffers, I ended up with the
> > below. Please try again.
> >
>
> There are two small errors. After fixing them, the patch works well.

Well, it still doesn't do A, two read()s without PEBS record in between.
So that needs fixing. What 3/5 does, call x86_perf_event_update() after
drain_pebs() is actively wrong after this patch.

> > +
> > + /*
> > + * Careful, not all hw sign-extends above the physical width
> > + * of the counter.
> > + */
> > + delta = (new_raw_count << shift) - (prev_raw_count << shift);
> > + delta >>= shift;
>
> new_raw_count could be smaller than prev_raw_count.
> The sign bit will be set. The delta>> could be wrong.
>
> I think we can add a period here to prevent it.
> + delta = (period << shift) + (new_raw_count << shift) -
> + (prev_raw_count << shift);
> + delta >>= shift;
> ......
> + local64_add(delta + period * (count - 1), &event->count);
>

Right it does, but that wrecks case A again, because then we get here
with !@count.

Maybe something like:

s64 new, old;

new = ((s64)(new_raw_count << shift) >> shift);
old = ((s64)(old_raw_count << shift) >> shift);

local64_add(new - old + count * period, &event->count);

And then make intel_pmu_drain_pebs_*(), call this function even when !n.

2018-02-09 15:52:19

[permalink] [raw]

Subject: Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

On 2/9/2018 9:09 AM, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
>>
>>
>>> With the exception of handling 'empty' buffers, I ended up with the
>>> below. Please try again.
>>>
>>
>> There are two small errors. After fixing them, the patch works well.
>
> Well, it still doesn't do A, two read()s without PEBS record in between.
> So that needs fixing. What 3/5 does, call x86_perf_event_update() after
> drain_pebs() is actively wrong after this patch.
>

As my understanding, for case A, drain_pebs() will return immediately.
It cannot reach the patch.
Because there is no PEBS record ready. So the ds->pebs_index should be
the same as ds->pebs_buffer_base.

3/5 is to handle case A.

Thanks,
Kan

>>> +
>>> + /*
>>> + * Careful, not all hw sign-extends above the physical width
>>> + * of the counter.
>>> + */
>>> + delta = (new_raw_count << shift) - (prev_raw_count << shift);
>>> + delta >>= shift;
>>
>> new_raw_count could be smaller than prev_raw_count.
>> The sign bit will be set. The delta>> could be wrong.
>>
>> I think we can add a period here to prevent it.
>> + delta = (period << shift) + (new_raw_count << shift) -
>> + (prev_raw_count << shift);
>> + delta >>= shift;
>> ......
>> + local64_add(delta + period * (count - 1), &event->count);
>>
>
> Right it does, but that wrecks case A again, because then we get here
> with !@count.
>
> Maybe something like:
>
>
> s64 new, old;
>
> new = ((s64)(new_raw_count << shift) >> shift);
> old = ((s64)(old_raw_count << shift) >> shift);
>
> local64_add(new - old + count * period, &event->count);
>
>
> And then make intel_pmu_drain_pebs_*(), call this function even when !n.
>

2018-02-10 14:10:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V3 1/5] perf/x86/intel: fix event update for auto-reload

On Fri, Feb 09, 2018 at 10:49:35AM -0500, Liang, Kan wrote:
>
>
> On 2/9/2018 9:09 AM, Peter Zijlstra wrote:
> > On Tue, Feb 06, 2018 at 12:58:23PM -0500, Liang, Kan wrote:
> > >
> > >
> > > > With the exception of handling 'empty' buffers, I ended up with the
> > > > below. Please try again.
> > > >
> > >
> > > There are two small errors. After fixing them, the patch works well.
> >
> > Well, it still doesn't do A, two read()s without PEBS record in between.
> > So that needs fixing. What 3/5 does, call x86_perf_event_update() after
> > drain_pebs() is actively wrong after this patch.
> >
>
> As my understanding, for case A, drain_pebs() will return immediately. It
> cannot reach the patch.
> Because there is no PEBS record ready. So the ds->pebs_index should be the
> same as ds->pebs_buffer_base.

Right, so fix that.

> 3/5 is to handle case A.

3/5 is terminatlly broken, you should not call x86_perf_event_update()
on a auto-reload event _ever_ after my patch.