2021-11-05 07:26:51

by Nadav Amit

[permalink] [raw]
Subject: Using perf_event_open() to sample multiple events of a process

Hello Ken, Peter,

I would appreciate some help regarding the use of perf_event_open()
to have multiple samples getting into the same mmap’d memory when they
are both attached to the same process.

I am doing so (using both PERF_FLAG_FD_NO_GROUP and PERF_FLAG_FD_OUTPUT),
but it results in -EINVAL. Debugging the code shows that
perf_event_set_output() fails due to the following check:

/*
* If its not a per-cpu rb, it must be the same task.
*/
if (output_event->cpu == -1 && output_event->ctx != event->ctx)
goto out;

However, it appears that at this point, event->ctx is not initialized
(it is null) so the test fails and the whole perf_event_open() syscall
fails.

Am I missing something? If not, I am unsure, unfortunately, what the
proper way to fix it is…

I include a small test that fails on my system. The second
perf_event_open fails due to the check in perf_event_set_output():




#define _GNU_SOURCE 1

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

long perf_event_open(struct perf_event_attr* event_attr, pid_t pid, int cpu,
int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, event_attr, pid, cpu, group_fd, flags);
}

int main(void)
{
pid_t pid = getpid();
int group_fd, fd;
void *p;
struct perf_event_attr pe = {
.type = 4,
.size = sizeof(struct perf_event_attr),
.config = 0x11d0,
.sample_type = 0x8,
.sample_period = 1000,
.precise_ip = 2,
};

group_fd = perf_event_open(&pe, pid, -1, -1, PERF_FLAG_FD_CLOEXEC |
PERF_FLAG_FD_NO_GROUP |
PERF_FLAG_FD_OUTPUT);

if (group_fd < 0) {
perror("first perf_event_open");
exit(-1);
}

p = mmap(NULL, 3 * 4096, PROT_READ|PROT_WRITE, MAP_SHARED, group_fd, 0);

if (p == MAP_FAILED) {
perror("MAP_FAILED");
exit(-1);
}

pe.config = 0x12d0;

fd = perf_event_open(&pe, pid, -1, group_fd, PERF_FLAG_FD_CLOEXEC |
PERF_FLAG_FD_NO_GROUP |
PERF_FLAG_FD_OUTPUT);

if (fd < 0) {
perror("second perf_event_open");
exit(-1);
}

printf("success\n");
}


2021-11-06 02:12:11

by Nadav Amit

[permalink] [raw]
Subject: Re: Using perf_event_open() to sample multiple events of a process



> On Nov 5, 2021, at 5:45 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Nov 04, 2021 at 10:57:50PM -0700, Nadav Amit wrote:
>> Hello Ken, Peter,
>>
>> I would appreciate some help regarding the use of perf_event_open()
>> to have multiple samples getting into the same mmap’d memory when they
>> are both attached to the same process.
>>
>> I am doing so (using both PERF_FLAG_FD_NO_GROUP and PERF_FLAG_FD_OUTPUT),
>> but it results in -EINVAL. Debugging the code shows that
>> perf_event_set_output() fails due to the following check:
>>
>> /*
>> * If its not a per-cpu rb, it must be the same task.
>> */
>> if (output_event->cpu == -1 && output_event->ctx != event->ctx)
>> goto out;
>>
>> However, it appears that at this point, event->ctx is not initialized
>> (it is null) so the test fails and the whole perf_event_open() syscall
>> fails.
>>
>> Am I missing something? If not, I am unsure, unfortunately, what the
>> proper way to fix it is…
>>
>> I include a small test that fails on my system. The second
>> perf_event_open fails due to the check in perf_event_set_output():
>>
>
> Works when you use the SET_OUTPUT ioctl()...
>
> I think something went sideways in the syscall path and things went out
> of order :/ I'll try and have a look.

Highly appreciated.

While at it, I would note that there is a mistake in Intel SDM 31.4.2.27,
Table 31-47 “Block Item Packet Definition”. The ID[5:0] takes 6 bits,
when in fact I presume it is only 5 bits (according to the table).

Perhaps you can forward it to the relevant people.

Thanks again,
Nadav

2021-11-06 02:12:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Using perf_event_open() to sample multiple events of a process

On Sat, Nov 06, 2021 at 01:45:57AM +0100, Peter Zijlstra wrote:
> On Thu, Nov 04, 2021 at 10:57:50PM -0700, Nadav Amit wrote:
> > Hello Ken, Peter,
> >
> > I would appreciate some help regarding the use of perf_event_open()
> > to have multiple samples getting into the same mmap’d memory when they
> > are both attached to the same process.
> >
> > I am doing so (using both PERF_FLAG_FD_NO_GROUP and PERF_FLAG_FD_OUTPUT),
> > but it results in -EINVAL. Debugging the code shows that
> > perf_event_set_output() fails due to the following check:
> >
> > /*
> > * If its not a per-cpu rb, it must be the same task.
> > */
> > if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> > goto out;
> >
> > However, it appears that at this point, event->ctx is not initialized
> > (it is null) so the test fails and the whole perf_event_open() syscall
> > fails.
> >
> > Am I missing something? If not, I am unsure, unfortunately, what the
> > proper way to fix it is…
> >
> > I include a small test that fails on my system. The second
> > perf_event_open fails due to the check in perf_event_set_output():
> >
>
> Works when you use the SET_OUTPUT ioctl()...
>
> I think something went sideways in the syscall path and things went out
> of order :/ I'll try and have a look.

The problem seems to be that we call perf_event_set_output() before we
set event->ctx, which is a bit of a problem.

Now, afaict it's been broken since c3f00c70276d ("perf: Separate
find_get_context() from event initialization"), which is ages ago :/

It's waaay too late to try and fix it; I'll be likely to make an even
bigger mess if I tried. Perhaps tomorrow.

Clearly FD_OUTPUT isn't much used :-(

2021-11-06 04:52:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Using perf_event_open() to sample multiple events of a process

On Thu, Nov 04, 2021 at 10:57:50PM -0700, Nadav Amit wrote:
> Hello Ken, Peter,
>
> I would appreciate some help regarding the use of perf_event_open()
> to have multiple samples getting into the same mmap’d memory when they
> are both attached to the same process.
>
> I am doing so (using both PERF_FLAG_FD_NO_GROUP and PERF_FLAG_FD_OUTPUT),
> but it results in -EINVAL. Debugging the code shows that
> perf_event_set_output() fails due to the following check:
>
> /*
> * If its not a per-cpu rb, it must be the same task.
> */
> if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> goto out;
>
> However, it appears that at this point, event->ctx is not initialized
> (it is null) so the test fails and the whole perf_event_open() syscall
> fails.
>
> Am I missing something? If not, I am unsure, unfortunately, what the
> proper way to fix it is…
>
> I include a small test that fails on my system. The second
> perf_event_open fails due to the check in perf_event_set_output():
>

Works when you use the SET_OUTPUT ioctl()...

I think something went sideways in the syscall path and things went out
of order :/ I'll try and have a look.


---
#define _GNU_SOURCE 1

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <linux/hw_breakpoint.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

long perf_event_open(struct perf_event_attr* event_attr, pid_t pid, int cpu,
int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, event_attr, pid, cpu, group_fd, flags);
}

int main(void)
{
pid_t pid = getpid();
int group_fd, fd, err;
void *p;
struct perf_event_attr pe = {
.type = 4,
.size = sizeof(struct perf_event_attr),
.config = 0x11d0,
.sample_type = 0x8,
.sample_period = 1000,
.precise_ip = 2,
};

group_fd = perf_event_open(&pe, pid, -1, -1, PERF_FLAG_FD_CLOEXEC |
PERF_FLAG_FD_NO_GROUP |
PERF_FLAG_FD_OUTPUT);

if (group_fd < 0) {
perror("first perf_event_open");
exit(-1);
}

p = mmap(NULL, 3 * 4096, PROT_READ|PROT_WRITE, MAP_SHARED, group_fd, 0);

if (p == MAP_FAILED) {
perror("MAP_FAILED");
exit(-1);
}

pe.config = 0x12d0;

fd = perf_event_open(&pe, pid, -1, -1, PERF_FLAG_FD_CLOEXEC |
PERF_FLAG_FD_NO_GROUP |
PERF_FLAG_FD_OUTPUT);

if (fd < 0) {
perror("second perf_event_open");
exit(-1);
}

err = ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, group_fd);
if (err < 0) {
perror("ioctl");
exit(-1);
}

printf("success\n");
}

2021-11-08 17:43:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Using perf_event_open() to sample multiple events of a process

On Sat, Nov 06, 2021 at 01:57:23AM +0100, Peter Zijlstra wrote:

> The problem seems to be that we call perf_event_set_output() before we
> set event->ctx, which is a bit of a problem.
>
> Now, afaict it's been broken since c3f00c70276d ("perf: Separate
> find_get_context() from event initialization"), which is ages ago :/
>
> It's waaay too late to try and fix it; I'll be likely to make an even
> bigger mess if I tried. Perhaps tomorrow.
>
> Clearly FD_OUTPUT isn't much used :-(

The below seems to fix, it's a bit of a hack, but I couldn't really come
up with anything saner.

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f2253ea729a2..dbe766663733 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5595,6 +5595,7 @@ static inline int perf_fget_light(int fd, struct fd *p)
}

static int perf_event_set_output(struct perf_event *event,
+ struct perf_event_context *ctx,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
static int perf_copy_attr(struct perf_event_attr __user *uattr,
@@ -5647,10 +5648,10 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
if (ret)
return ret;
output_event = output.file->private_data;
- ret = perf_event_set_output(event, output_event);
+ ret = perf_event_set_output(event, event->ctx, output_event);
fdput(output);
} else {
- ret = perf_event_set_output(event, NULL);
+ ret = perf_event_set_output(event, event->ctx, NULL);
}
return ret;
}
@@ -11830,7 +11831,9 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
}

static int
-perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
+perf_event_set_output(struct perf_event *event,
+ struct perf_event_context *event_ctx,
+ struct perf_event *output_event)
{
struct perf_buffer *rb = NULL;
int ret = -EINVAL;
@@ -11851,7 +11854,7 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
/*
* If its not a per-cpu rb, it must be the same task.
*/
- if (output_event->cpu == -1 && output_event->ctx != event->ctx)
+ if (output_event->cpu == -1 && output_event->ctx != event_ctx)
goto out;

/*
@@ -12232,7 +12235,7 @@ SYSCALL_DEFINE5(perf_event_open,
}

if (output_event) {
- err = perf_event_set_output(event, output_event);
+ err = perf_event_set_output(event, ctx, output_event);
if (err)
goto err_context;
}

2021-11-09 01:03:01

by Nadav Amit

[permalink] [raw]
Subject: Re: Using perf_event_open() to sample multiple events of a process


> On Nov 8, 2021, at 7:24 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, Nov 06, 2021 at 01:57:23AM +0100, Peter Zijlstra wrote:
>
>> The problem seems to be that we call perf_event_set_output() before we
>> set event->ctx, which is a bit of a problem.
>>
>> Now, afaict it's been broken since c3f00c70276d ("perf: Separate
>> find_get_context() from event initialization"), which is ages ago :/
>>
>> It's waaay too late to try and fix it; I'll be likely to make an even
>> bigger mess if I tried. Perhaps tomorrow.
>>
>> Clearly FD_OUTPUT isn't much used :-(
>
> The below seems to fix, it's a bit of a hack, but I couldn't really come
> up with anything saner.

I originally considered doing a similar hack. I assume it should work,
but I moved to using the ioctl workaround that you suggested.

Clearly nobody is using this feature if it is broken for 11 years.
There is always the option to deprecate it if there is an alternative.