2020-04-30 01:38:44

by Jin Yao

[permalink] [raw]
Subject: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

A metric may consist of system wide event and non system-wide event.
The event group leader may be the system wide event.

For example, the metric "C2_Pkg_Residency" consists of
"cstate_pkg/c2-residency" and "msr/tsc". The former counts on the first
CPU of socket (tagged system-wide) and the latter is per CPU.

But "C2_Pkg_Residency" hits assertion failure on cascadelakex.

# perf stat -M "C2_Pkg_Residency" -a -- sleep 1
perf: util/evsel.c:1464: get_group_fd: Assertion `!(fd == -1)' failed.
Aborted

get_group_fd(evsel, cpu, thread)
{
leader = evsel->leader;
fd = FD(leader, cpu, thread);
BUG_ON(fd == -1);
}

Considering this case, leader is "cstate_pkg/c2-residency", evsel is
"msr/tsc" and cpu is 1. Because "cstate_pkg/c2-residency" is a system-wide
event and it's processed on CPU0, so FD(leader, 1, thread) must return an
invalid fd, then BUG_ON() may be triggered.

This patch gets group fd from CPU0 for system wide event if
FD(leader, cpu, thread) returns invalid fd.

With this patch,

# perf stat -M "C2_Pkg_Residency" -a -- sleep 1

Performance counter stats for 'system wide':

1000850802 cstate_pkg/c2-residency/ # 0.5 C2_Pkg_Residency
201446161592 msr/tsc/

1.010637051 seconds time elapsed

Fixes: 6a4bb04caacc ("perf tools: Enable grouping logic for parsed events")
Signed-off-by: Jin Yao <[email protected]>
---
tools/perf/util/evsel.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 6a571d322bb2..cd6470f63d6f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
BUG_ON(!leader->core.fd);

fd = FD(leader, cpu, thread);
+ if (fd == -1 && leader->core.system_wide)
+ fd = FD(leader, 0, thread);
+
BUG_ON(fd == -1);

return fd;
--
2.17.1


2020-05-01 10:25:45

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

On Thu, Apr 30, 2020 at 09:34:51AM +0800, Jin Yao wrote:
> A metric may consist of system wide event and non system-wide event.
> The event group leader may be the system wide event.
>
> For example, the metric "C2_Pkg_Residency" consists of
> "cstate_pkg/c2-residency" and "msr/tsc". The former counts on the first
> CPU of socket (tagged system-wide) and the latter is per CPU.
>
> But "C2_Pkg_Residency" hits assertion failure on cascadelakex.
>
> # perf stat -M "C2_Pkg_Residency" -a -- sleep 1
> perf: util/evsel.c:1464: get_group_fd: Assertion `!(fd == -1)' failed.
> Aborted
>
> get_group_fd(evsel, cpu, thread)
> {
> leader = evsel->leader;
> fd = FD(leader, cpu, thread);
> BUG_ON(fd == -1);
> }
>
> Considering this case, leader is "cstate_pkg/c2-residency", evsel is
> "msr/tsc" and cpu is 1. Because "cstate_pkg/c2-residency" is a system-wide
> event and it's processed on CPU0, so FD(leader, 1, thread) must return an
> invalid fd, then BUG_ON() may be triggered.
>
> This patch gets group fd from CPU0 for system wide event if
> FD(leader, cpu, thread) returns invalid fd.
>
> With this patch,
>
> # perf stat -M "C2_Pkg_Residency" -a -- sleep 1
>
> Performance counter stats for 'system wide':
>
> 1000850802 cstate_pkg/c2-residency/ # 0.5 C2_Pkg_Residency
> 201446161592 msr/tsc/
>
> 1.010637051 seconds time elapsed
>
> Fixes: 6a4bb04caacc ("perf tools: Enable grouping logic for parsed events")
> Signed-off-by: Jin Yao <[email protected]>
> ---
> tools/perf/util/evsel.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 6a571d322bb2..cd6470f63d6f 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
> BUG_ON(!leader->core.fd);
>
> fd = FD(leader, cpu, thread);
> + if (fd == -1 && leader->core.system_wide)

fd does not need to be -1 in here.. in my setup cstate_pkg/c2-residency/
has cpumask 0, so other cpus never get open and are 0, and the whole thing
ends up with:

sys_perf_event_open: pid -1 cpu 1 group_fd 0 flags 0
sys_perf_event_open failed, error -9

I actualy thought we put -1 to fd array but couldn't find it.. perhaps we should od that


> + fd = FD(leader, 0, thread);
> +

so how do we group following events?

cstate_pkg/c2-residency/ - cpumask 0
msr/tsc/ - all cpus

cpu 0 is fine.. the rest I have no idea ;-)

that's why metrics use the :W, that disables grouping on failure

jirka

> BUG_ON(fd == -1);
>
> return fd;
> --
> 2.17.1
>

2020-05-02 02:38:56

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/1/2020 6:23 PM, Jiri Olsa wrote:
> On Thu, Apr 30, 2020 at 09:34:51AM +0800, Jin Yao wrote:
>> A metric may consist of system wide event and non system-wide event.
>> The event group leader may be the system wide event.
>>
>> For example, the metric "C2_Pkg_Residency" consists of
>> "cstate_pkg/c2-residency" and "msr/tsc". The former counts on the first
>> CPU of socket (tagged system-wide) and the latter is per CPU.
>>
>> But "C2_Pkg_Residency" hits assertion failure on cascadelakex.
>>
>> # perf stat -M "C2_Pkg_Residency" -a -- sleep 1
>> perf: util/evsel.c:1464: get_group_fd: Assertion `!(fd == -1)' failed.
>> Aborted
>>
>> get_group_fd(evsel, cpu, thread)
>> {
>> leader = evsel->leader;
>> fd = FD(leader, cpu, thread);
>> BUG_ON(fd == -1);
>> }
>>
>> Considering this case, leader is "cstate_pkg/c2-residency", evsel is
>> "msr/tsc" and cpu is 1. Because "cstate_pkg/c2-residency" is a system-wide
>> event and it's processed on CPU0, so FD(leader, 1, thread) must return an
>> invalid fd, then BUG_ON() may be triggered.
>>
>> This patch gets group fd from CPU0 for system wide event if
>> FD(leader, cpu, thread) returns invalid fd.
>>
>> With this patch,
>>
>> # perf stat -M "C2_Pkg_Residency" -a -- sleep 1
>>
>> Performance counter stats for 'system wide':
>>
>> 1000850802 cstate_pkg/c2-residency/ # 0.5 C2_Pkg_Residency
>> 201446161592 msr/tsc/
>>
>> 1.010637051 seconds time elapsed
>>
>> Fixes: 6a4bb04caacc ("perf tools: Enable grouping logic for parsed events")
>> Signed-off-by: Jin Yao <[email protected]>
>> ---
>> tools/perf/util/evsel.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 6a571d322bb2..cd6470f63d6f 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>> BUG_ON(!leader->core.fd);
>>
>> fd = FD(leader, cpu, thread);
>> + if (fd == -1 && leader->core.system_wide)
>
> fd does not need to be -1 in here.. in my setup cstate_pkg/c2-residency/
> has cpumask 0, so other cpus never get open and are 0, and the whole thing
> ends up with:
>
> sys_perf_event_open: pid -1 cpu 1 group_fd 0 flags 0
> sys_perf_event_open failed, error -9
>
> I actualy thought we put -1 to fd array but couldn't find it.. perhaps we should od that
>
>

I have tested on two platforms. On KBL desktop fd is 0 for this case, but on
oncascadelakex server, fd is -1, so the BUG_ON(fd == -1) is triggered.

>> + fd = FD(leader, 0, thread);
>> +
>
> so how do we group following events?
>
> cstate_pkg/c2-residency/ - cpumask 0
> msr/tsc/ - all cpus
>

Not sure if it's enough to only use cpumask 0 because cstate_pkg/c2-residency/
should be per-socket.

> cpu 0 is fine.. the rest I have no idea ;-)
>

Perhaps we directly remove the BUG_ON(fd == -1) assertion?

Thanks
Jin Yao

> that's why metrics use the :W, that disables grouping on failure
>
> jirka
>
>> BUG_ON(fd == -1);
>>
>> return fd;
>> --
>> 2.17.1
>>
>

2020-05-05 00:06:45

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

On Sat, May 02, 2020 at 10:33:59AM +0800, Jin, Yao wrote:

SNIP

> > > @@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
> > > BUG_ON(!leader->core.fd);
> > > fd = FD(leader, cpu, thread);
> > > + if (fd == -1 && leader->core.system_wide)
> >
> > fd does not need to be -1 in here.. in my setup cstate_pkg/c2-residency/
> > has cpumask 0, so other cpus never get open and are 0, and the whole thing
> > ends up with:
> >
> > sys_perf_event_open: pid -1 cpu 1 group_fd 0 flags 0
> > sys_perf_event_open failed, error -9
> >
> > I actualy thought we put -1 to fd array but couldn't find it.. perhaps we should od that
> >
> >
>
> I have tested on two platforms. On KBL desktop fd is 0 for this case, but on
> oncascadelakex server, fd is -1, so the BUG_ON(fd == -1) is triggered.
>
> > > + fd = FD(leader, 0, thread);
> > > +
> >
> > so how do we group following events?
> >
> > cstate_pkg/c2-residency/ - cpumask 0
> > msr/tsc/ - all cpus
> >
>
> Not sure if it's enough to only use cpumask 0 because
> cstate_pkg/c2-residency/ should be per-socket.
>
> > cpu 0 is fine.. the rest I have no idea ;-)
> >
>
> Perhaps we directly remove the BUG_ON(fd == -1) assertion?

I think we need to make clear how to deal with grouping over
events that comes for different cpus

so how do we group following events?

cstate_pkg/c2-residency/ - cpumask 0
msr/tsc/ - all cpus


what's the reason/expected output of groups with above events?
seems to make sense only if we limit msr/tsc/ to cpumask 0 as well

jirka

2020-05-09 07:39:13

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/5/2020 8:03 AM, Jiri Olsa wrote:
> On Sat, May 02, 2020 at 10:33:59AM +0800, Jin, Yao wrote:
>
> SNIP
>
>>>> @@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>>>> BUG_ON(!leader->core.fd);
>>>> fd = FD(leader, cpu, thread);
>>>> + if (fd == -1 && leader->core.system_wide)
>>>
>>> fd does not need to be -1 in here.. in my setup cstate_pkg/c2-residency/
>>> has cpumask 0, so other cpus never get open and are 0, and the whole thing
>>> ends up with:
>>>
>>> sys_perf_event_open: pid -1 cpu 1 group_fd 0 flags 0
>>> sys_perf_event_open failed, error -9
>>>
>>> I actualy thought we put -1 to fd array but couldn't find it.. perhaps we should od that
>>>
>>>
>>
>> I have tested on two platforms. On KBL desktop fd is 0 for this case, but on
>> oncascadelakex server, fd is -1, so the BUG_ON(fd == -1) is triggered.
>>
>>>> + fd = FD(leader, 0, thread);
>>>> +
>>>
>>> so how do we group following events?
>>>
>>> cstate_pkg/c2-residency/ - cpumask 0
>>> msr/tsc/ - all cpus
>>>
>>
>> Not sure if it's enough to only use cpumask 0 because
>> cstate_pkg/c2-residency/ should be per-socket.
>>
>>> cpu 0 is fine.. the rest I have no idea ;-)
>>>
>>
>> Perhaps we directly remove the BUG_ON(fd == -1) assertion?
>
> I think we need to make clear how to deal with grouping over
> events that comes for different cpus
>
> so how do we group following events?
>
> cstate_pkg/c2-residency/ - cpumask 0
> msr/tsc/ - all cpus
>
>
> what's the reason/expected output of groups with above events?
> seems to make sense only if we limit msr/tsc/ to cpumask 0 as well
>
> jirka
>

On 2-socket machine (e.g cascadelakex), "cstate_pkg/c2-residency/" is per-socket
event and the cpumask is 0 and 24.

root@lkp-csl-2sp5 /sys/devices/cstate_pkg# cat cpumask
0,24

We can't limit it to cpumask 0. It should be programmed on CPU0 and CPU24 (the
first CPU on each socket).

The "msr/tsc" are per-cpu event, it should be programmed on all cpus. So I don't
think we can limit msr/tsc to cpumask 0.

The issue is how we deal with get_group_fd().

static int get_group_fd(struct evsel *evsel, int cpu, int thread)
{
struct evsel *leader = evsel->leader;
int fd;

if (evsel__is_group_leader(evsel))
return -1;

/*
* Leader must be already processed/open,
* if not it's a bug.
*/
BUG_ON(!leader->core.fd);

fd = FD(leader, cpu, thread);
BUG_ON(fd == -1);

return fd;
}

When evsel is "msr/tsc/",

FD(leader, 0, 0) is 3 (3 is the fd of "cstate_pkg/c2-residency/" on CPU0)
FD(leader, 1, 0) is -1
BUG_ON asserted.

If we just return group_fd(-1) for "msr/tsc", it looks like it's not a problem,
is it?

Thanks
Jin Yao

2020-05-15 06:08:37

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/9/2020 3:37 PM, Jin, Yao wrote:
> Hi Jiri,
>
> On 5/5/2020 8:03 AM, Jiri Olsa wrote:
>> On Sat, May 02, 2020 at 10:33:59AM +0800, Jin, Yao wrote:
>>
>> SNIP
>>
>>>>> @@ -1461,6 +1461,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>>>>>        BUG_ON(!leader->core.fd);
>>>>>        fd = FD(leader, cpu, thread);
>>>>> +    if (fd == -1 && leader->core.system_wide)
>>>>
>>>> fd does not need to be -1 in here.. in my setup cstate_pkg/c2-residency/
>>>> has cpumask 0, so other cpus never get open and are 0, and the whole thing
>>>> ends up with:
>>>>
>>>>     sys_perf_event_open: pid -1  cpu 1  group_fd 0  flags 0
>>>>     sys_perf_event_open failed, error -9
>>>>
>>>> I actualy thought we put -1 to fd array but couldn't find it.. perhaps we should od that
>>>>
>>>>
>>>
>>> I have tested on two platforms. On KBL desktop fd is 0 for this case, but on
>>> oncascadelakex server, fd is -1, so the BUG_ON(fd == -1) is triggered.
>>>
>>>>> +        fd = FD(leader, 0, thread);
>>>>> +
>>>>
>>>> so how do we group following events?
>>>>
>>>>     cstate_pkg/c2-residency/ - cpumask 0
>>>>     msr/tsc/                 - all cpus
>>>>
>>>
>>> Not sure if it's enough to only use cpumask 0 because
>>> cstate_pkg/c2-residency/ should be per-socket.
>>>
>>>> cpu 0 is fine.. the rest I have no idea ;-)
>>>>
>>>
>>> Perhaps we directly remove the BUG_ON(fd == -1) assertion?
>>
>> I think we need to make clear how to deal with grouping over
>> events that comes for different cpus
>>
>>     so how do we group following events?
>>
>>        cstate_pkg/c2-residency/ - cpumask 0
>>        msr/tsc/                 - all cpus
>>
>>
>> what's the reason/expected output of groups with above events?
>> seems to make sense only if we limit msr/tsc/ to cpumask 0 as well
>>
>> jirka
>>
>
> On 2-socket machine (e.g cascadelakex), "cstate_pkg/c2-residency/" is per-socket event and the
> cpumask is 0 and 24.
>
> root@lkp-csl-2sp5 /sys/devices/cstate_pkg# cat cpumask
> 0,24
>
> We can't limit it to cpumask 0. It should be programmed on CPU0 and CPU24 (the first CPU on each
> socket).
>
> The "msr/tsc" are per-cpu event, it should be programmed on all cpus. So I don't think we can limit
> msr/tsc to cpumask 0.
>
> The issue is how we deal with get_group_fd().
>
> static int get_group_fd(struct evsel *evsel, int cpu, int thread)
> {
>         struct evsel *leader = evsel->leader;
>         int fd;
>
>         if (evsel__is_group_leader(evsel))
>                 return -1;
>
>         /*
>          * Leader must be already processed/open,
>          * if not it's a bug.
>          */
>         BUG_ON(!leader->core.fd);
>
>         fd = FD(leader, cpu, thread);
>         BUG_ON(fd == -1);
>
>         return fd;
> }
>
> When evsel is "msr/tsc/",
>
> FD(leader, 0, 0) is 3 (3 is the fd of "cstate_pkg/c2-residency/" on CPU0)
> FD(leader, 1, 0) is -1
> BUG_ON asserted.
>
> If we just return group_fd(-1) for "msr/tsc", it looks like it's not a problem, is it?
>
> Thanks
> Jin Yao

I think I get the root cause. That should be a serious bug in get_group_fd, access violation!

For a group mixed with system-wide event and per-core event and the group leader is system-wide
event, access violation will happen.

perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).

But for per core event, perf_evsel__alloc_fd allocates N FD members (N = ncpus). For example, for
ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are valid.

get_group_fd(struct evsel *evsel, int cpu, int thread)
{
struct evsel *leader = evsel->leader;

fd = FD(leader, cpu, thread); /* access violation may happen here */
}

If leader is system-wide event, only the FD(leader, 0, 0) is valid.

When get_group_fd accesses FD(leader, 1, 0), access violation happens.

My fix is:

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 28683b0eb738..db05b8a1e1a8 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
if (evsel__is_group_leader(evsel))
return -1;

+ if (leader->core.system_wide && !evsel->core.system_wide)
+ return -2;
+
/*
* Leader must be already processed/open,
* if not it's a bug.
@@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
pid = perf_thread_map__pid(threads, thread);

group_fd = get_group_fd(evsel, cpu, thread);
+ if (group_fd == -2) {
+ errno = EINVAL;
+ err = -EINVAL;
+ goto out_close;
+ }
retry_open:
test_attr__ready();

It enables the perf_evlist__reset_weak_group. And in the second_pass (in __run_perf_stat), the
events will be opened successfully.

I have tested OK for this fix on cascadelakex.

Thanks
Jin Yao

2020-05-15 08:37:44

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:

SNIP

> I think I get the root cause. That should be a serious bug in get_group_fd, access violation!
>
> For a group mixed with system-wide event and per-core event and the group
> leader is system-wide event, access violation will happen.
>
> perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).
>
> But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
> ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
> valid.
>
> get_group_fd(struct evsel *evsel, int cpu, int thread)
> {
> struct evsel *leader = evsel->leader;
>
> fd = FD(leader, cpu, thread); /* access violation may happen here */
> }
>
> If leader is system-wide event, only the FD(leader, 0, 0) is valid.
>
> When get_group_fd accesses FD(leader, 1, 0), access violation happens.
>
> My fix is:
>
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index 28683b0eb738..db05b8a1e1a8 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
> if (evsel__is_group_leader(evsel))
> return -1;
>
> + if (leader->core.system_wide && !evsel->core.system_wide)
> + return -2;

so this effectively stops grouping system_wide events with others,
and I think it's correct, how about events that differ in cpumask?

should we perhaps ensure this before we call open? go throught all
groups and check they are on the same cpus?

thanks,
jirka


> +
> /*
> * Leader must be already processed/open,
> * if not it's a bug.
> @@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> pid = perf_thread_map__pid(threads, thread);
>
> group_fd = get_group_fd(evsel, cpu, thread);
> + if (group_fd == -2) {
> + errno = EINVAL;
> + err = -EINVAL;
> + goto out_close;
> + }
> retry_open:
> test_attr__ready();
>
> It enables the perf_evlist__reset_weak_group. And in the second_pass (in
> __run_perf_stat), the events will be opened successfully.
>
> I have tested OK for this fix on cascadelakex.
>
> Thanks
> Jin Yao
>

2020-05-18 03:43:37

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/15/2020 4:33 PM, Jiri Olsa wrote:
> On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:
>
> SNIP
>
>> I think I get the root cause. That should be a serious bug in get_group_fd, access violation!
>>
>> For a group mixed with system-wide event and per-core event and the group
>> leader is system-wide event, access violation will happen.
>>
>> perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).
>>
>> But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
>> ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
>> valid.
>>
>> get_group_fd(struct evsel *evsel, int cpu, int thread)
>> {
>> struct evsel *leader = evsel->leader;
>>
>> fd = FD(leader, cpu, thread); /* access violation may happen here */
>> }
>>
>> If leader is system-wide event, only the FD(leader, 0, 0) is valid.
>>
>> When get_group_fd accesses FD(leader, 1, 0), access violation happens.
>>
>> My fix is:
>>
>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>> index 28683b0eb738..db05b8a1e1a8 100644
>> --- a/tools/perf/util/evsel.c
>> +++ b/tools/perf/util/evsel.c
>> @@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>> if (evsel__is_group_leader(evsel))
>> return -1;
>>
>> + if (leader->core.system_wide && !evsel->core.system_wide)
>> + return -2;
>
> so this effectively stops grouping system_wide events with others,
> and I think it's correct, how about events that differ in cpumask?
>

My understanding for the events that differ in cpumaks is, if the leader's cpumask is not fully
matched with the evsel's cpumask then we stop the grouping. Is this understanding correct?

I have done some tests and get some conclusions:

1. If the group is mixed with core and uncore events, the system_wide checking can distinguish them.

2. If the group is mixed with core and uncore events and "-a" is specified, the system_wide for core
event is also false. So system_wide checking can distinguish them too

3. In my test, the issue only occurs when we collect the metric which is mixed with uncore event and
core event, so maybe checking the system_wide is OK.

> should we perhaps ensure this before we call open? go throught all
> groups and check they are on the same cpus?
>

The issue doesn't happen at most of the time (only for the metric consisting of uncore event and
core event), so fallback to stop grouping if call open is failed looks reasonable.

Thanks
Jin Yao

> thanks,
> jirka
>
>
>> +
>> /*
>> * Leader must be already processed/open,
>> * if not it's a bug.
>> @@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>> pid = perf_thread_map__pid(threads, thread);
>>
>> group_fd = get_group_fd(evsel, cpu, thread);
>> + if (group_fd == -2) {
>> + errno = EINVAL;
>> + err = -EINVAL;
>> + goto out_close;
>> + }
>> retry_open:
>> test_attr__ready();
>>
>> It enables the perf_evlist__reset_weak_group. And in the second_pass (in
>> __run_perf_stat), the events will be opened successfully.
>>
>> I have tested OK for this fix on cascadelakex.
>>
>> Thanks
>> Jin Yao
>>
>

2020-05-20 05:40:44

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/18/2020 11:28 AM, Jin, Yao wrote:
> Hi Jiri,
>
> On 5/15/2020 4:33 PM, Jiri Olsa wrote:
>> On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:
>>
>> SNIP
>>
>>> I think I get the root cause. That should be a serious bug in get_group_fd, access violation!
>>>
>>> For a group mixed with system-wide event and per-core event and the group
>>> leader is system-wide event, access violation will happen.
>>>
>>> perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).
>>>
>>> But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
>>> ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
>>> valid.
>>>
>>> get_group_fd(struct evsel *evsel, int cpu, int thread)
>>> {
>>>      struct evsel *leader = evsel->leader;
>>>
>>>      fd = FD(leader, cpu, thread);    /* access violation may happen here */
>>> }
>>>
>>> If leader is system-wide event, only the FD(leader, 0, 0) is valid.
>>>
>>> When get_group_fd accesses FD(leader, 1, 0), access violation happens.
>>>
>>> My fix is:
>>>
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index 28683b0eb738..db05b8a1e1a8 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>>>          if (evsel__is_group_leader(evsel))
>>>                  return -1;
>>>
>>> +       if (leader->core.system_wide && !evsel->core.system_wide)
>>> +               return -2;
>>
>> so this effectively stops grouping system_wide events with others,
>> and I think it's correct, how about events that differ in cpumask?
>>
>
> My understanding for the events that differ in cpumaks is, if the leader's cpumask is not fully
> matched with the evsel's cpumask then we stop the grouping. Is this understanding correct?
>
> I have done some tests and get some conclusions:
>
> 1. If the group is mixed with core and uncore events, the system_wide checking can distinguish them.
>
> 2. If the group is mixed with core and uncore events and "-a" is specified, the system_wide for core
> event is also false. So system_wide checking can distinguish them too
>
> 3. In my test, the issue only occurs when we collect the metric which is mixed with uncore event and
> core event, so maybe checking the system_wide is OK.
>
>> should we perhaps ensure this before we call open? go throught all
>> groups and check they are on the same cpus?
>>
>
> The issue doesn't happen at most of the time (only for the metric consisting of uncore event and
> core event), so fallback to stop grouping if call open is failed looks reasonable.
>
> Thanks
> Jin Yao
>
>> thanks,
>> jirka
>>
>>
>>> +
>>>          /*
>>>           * Leader must be already processed/open,
>>>           * if not it's a bug.
>>> @@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>>>                                  pid = perf_thread_map__pid(threads, thread);
>>>
>>>                          group_fd = get_group_fd(evsel, cpu, thread);
>>> +                       if (group_fd == -2) {
>>> +                               errno = EINVAL;
>>> +                               err = -EINVAL;
>>> +                               goto out_close;
>>> +                       }
>>>   retry_open:
>>>                          test_attr__ready();
>>>
>>> It enables the perf_evlist__reset_weak_group. And in the second_pass (in
>>> __run_perf_stat), the events will be opened successfully.
>>>
>>> I have tested OK for this fix on cascadelakex.
>>>
>>> Thanks
>>> Jin Yao
>>>
>>

Is this fix OK?

Another thing is, do you think if we need to rename "evsel->core.system_wide" to
"evsel->core.has_cpumask".

The "system_wide" may misleading.

evsel->core.system_wide = pmu ? pmu->is_uncore : false;

"pmu->is_uncore" is true if PMU has a "cpumask". But it's not just uncore PMU which has cpumask.
Some other PMUs, e.g. cstate_pkg, also have cpumask. So for this case, "has_cpumask" should be better.

But I'm not sure if the change is OK for other case, e.g. PT, which also uses
"evsel->core.system_wide".

Thanks
Jin Yao

2020-05-20 07:54:25

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

On Wed, May 20, 2020 at 01:36:40PM +0800, Jin, Yao wrote:
> Hi Jiri,
>
> On 5/18/2020 11:28 AM, Jin, Yao wrote:
> > Hi Jiri,
> >
> > On 5/15/2020 4:33 PM, Jiri Olsa wrote:
> > > On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:
> > >
> > > SNIP
> > >
> > > > I think I get the root cause. That should be a serious bug in get_group_fd, access violation!
> > > >
> > > > For a group mixed with system-wide event and per-core event and the group
> > > > leader is system-wide event, access violation will happen.
> > > >
> > > > perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).
> > > >
> > > > But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
> > > > ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
> > > > valid.
> > > >
> > > > get_group_fd(struct evsel *evsel, int cpu, int thread)
> > > > {
> > > > ???? struct evsel *leader = evsel->leader;
> > > >
> > > > ???? fd = FD(leader, cpu, thread);??? /* access violation may happen here */
> > > > }
> > > >
> > > > If leader is system-wide event, only the FD(leader, 0, 0) is valid.
> > > >
> > > > When get_group_fd accesses FD(leader, 1, 0), access violation happens.
> > > >
> > > > My fix is:
> > > >
> > > > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > > > index 28683b0eb738..db05b8a1e1a8 100644
> > > > --- a/tools/perf/util/evsel.c
> > > > +++ b/tools/perf/util/evsel.c
> > > > @@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
> > > > ???????? if (evsel__is_group_leader(evsel))
> > > > ???????????????? return -1;
> > > >
> > > > +?????? if (leader->core.system_wide && !evsel->core.system_wide)
> > > > +?????????????? return -2;
> > >
> > > so this effectively stops grouping system_wide events with others,
> > > and I think it's correct, how about events that differ in cpumask?
> > >
> >
> > My understanding for the events that differ in cpumaks is, if the
> > leader's cpumask is not fully matched with the evsel's cpumask then we
> > stop the grouping. Is this understanding correct?
> >
> > I have done some tests and get some conclusions:
> >
> > 1. If the group is mixed with core and uncore events, the system_wide checking can distinguish them.
> >
> > 2. If the group is mixed with core and uncore events and "-a" is
> > specified, the system_wide for core event is also false. So system_wide
> > checking can distinguish them too
> >
> > 3. In my test, the issue only occurs when we collect the metric which is
> > mixed with uncore event and core event, so maybe checking the
> > system_wide is OK.
> >
> > > should we perhaps ensure this before we call open? go throught all
> > > groups and check they are on the same cpus?
> > >
> >
> > The issue doesn't happen at most of the time (only for the metric
> > consisting of uncore event and core event), so fallback to stop grouping
> > if call open is failed looks reasonable.
> >
> > Thanks
> > Jin Yao
> >
> > > thanks,
> > > jirka
> > >
> > >
> > > > +
> > > > ???????? /*
> > > > ????????? * Leader must be already processed/open,
> > > > ????????? * if not it's a bug.
> > > > @@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> > > > ???????????????????????????????? pid = perf_thread_map__pid(threads, thread);
> > > >
> > > > ???????????????????????? group_fd = get_group_fd(evsel, cpu, thread);
> > > > +?????????????????????? if (group_fd == -2) {
> > > > +?????????????????????????????? errno = EINVAL;
> > > > +?????????????????????????????? err = -EINVAL;
> > > > +?????????????????????????????? goto out_close;
> > > > +?????????????????????? }
> > > > ? retry_open:
> > > > ???????????????????????? test_attr__ready();
> > > >
> > > > It enables the perf_evlist__reset_weak_group. And in the second_pass (in
> > > > __run_perf_stat), the events will be opened successfully.
> > > >
> > > > I have tested OK for this fix on cascadelakex.
> > > >
> > > > Thanks
> > > > Jin Yao
> > > >
> > >
>
> Is this fix OK?
>
> Another thing is, do you think if we need to rename
> "evsel->core.system_wide" to "evsel->core.has_cpumask".
>
> The "system_wide" may misleading.
>
> evsel->core.system_wide = pmu ? pmu->is_uncore : false;
>
> "pmu->is_uncore" is true if PMU has a "cpumask". But it's not just uncore
> PMU which has cpumask. Some other PMUs, e.g. cstate_pkg, also have cpumask.
> So for this case, "has_cpumask" should be better.

so those flags are checked in many places in the code so I don't
think it's wise to mess with them

what I meant before was that the cpumask could be different for
different events so even when both events are 'system_wide' the
leader 'fd' might not exist for the groupped events and vice versa

so maybe we should ensure that we are groupping events with same
cpu maps before we go for open, so the get_group_fd stays simple

>
> But I'm not sure if the change is OK for other case, e.g. PT, which also
> uses "evsel->core.system_wide".

plz CC Adrian Hunter <[email protected]> on next patches
if you are touching this

thanks,
jirka

2020-05-21 04:40:22

by Jin Yao

[permalink] [raw]
Subject: Re: [PATCH] perf evsel: Get group fd from CPU0 for system wide event

Hi Jiri,

On 5/20/2020 3:50 PM, Jiri Olsa wrote:
> On Wed, May 20, 2020 at 01:36:40PM +0800, Jin, Yao wrote:
>> Hi Jiri,
>>
>> On 5/18/2020 11:28 AM, Jin, Yao wrote:
>>> Hi Jiri,
>>>
>>> On 5/15/2020 4:33 PM, Jiri Olsa wrote:
>>>> On Fri, May 15, 2020 at 02:04:57PM +0800, Jin, Yao wrote:
>>>>
>>>> SNIP
>>>>
>>>>> I think I get the root cause. That should be a serious bug in get_group_fd, access violation!
>>>>>
>>>>> For a group mixed with system-wide event and per-core event and the group
>>>>> leader is system-wide event, access violation will happen.
>>>>>
>>>>> perf_evsel__alloc_fd allocates one FD member for system-wide event (only FD(evsel, 0, 0) is valid).
>>>>>
>>>>> But for per core event, perf_evsel__alloc_fd allocates N FD members (N =
>>>>> ncpus). For example, for ncpus is 8, FD(evsel, 0, 0) to FD(evsel, 7, 0) are
>>>>> valid.
>>>>>
>>>>> get_group_fd(struct evsel *evsel, int cpu, int thread)
>>>>> {
>>>>>      struct evsel *leader = evsel->leader;
>>>>>
>>>>>      fd = FD(leader, cpu, thread);    /* access violation may happen here */
>>>>> }
>>>>>
>>>>> If leader is system-wide event, only the FD(leader, 0, 0) is valid.
>>>>>
>>>>> When get_group_fd accesses FD(leader, 1, 0), access violation happens.
>>>>>
>>>>> My fix is:
>>>>>
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index 28683b0eb738..db05b8a1e1a8 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -1440,6 +1440,9 @@ static int get_group_fd(struct evsel *evsel, int cpu, int thread)
>>>>>          if (evsel__is_group_leader(evsel))
>>>>>                  return -1;
>>>>>
>>>>> +       if (leader->core.system_wide && !evsel->core.system_wide)
>>>>> +               return -2;
>>>>
>>>> so this effectively stops grouping system_wide events with others,
>>>> and I think it's correct, how about events that differ in cpumask?
>>>>
>>>
>>> My understanding for the events that differ in cpumaks is, if the
>>> leader's cpumask is not fully matched with the evsel's cpumask then we
>>> stop the grouping. Is this understanding correct?
>>>
>>> I have done some tests and get some conclusions:
>>>
>>> 1. If the group is mixed with core and uncore events, the system_wide checking can distinguish them.
>>>
>>> 2. If the group is mixed with core and uncore events and "-a" is
>>> specified, the system_wide for core event is also false. So system_wide
>>> checking can distinguish them too
>>>
>>> 3. In my test, the issue only occurs when we collect the metric which is
>>> mixed with uncore event and core event, so maybe checking the
>>> system_wide is OK.
>>>
>>>> should we perhaps ensure this before we call open? go throught all
>>>> groups and check they are on the same cpus?
>>>>
>>>
>>> The issue doesn't happen at most of the time (only for the metric
>>> consisting of uncore event and core event), so fallback to stop grouping
>>> if call open is failed looks reasonable.
>>>
>>> Thanks
>>> Jin Yao
>>>
>>>> thanks,
>>>> jirka
>>>>
>>>>
>>>>> +
>>>>>          /*
>>>>>           * Leader must be already processed/open,
>>>>>           * if not it's a bug.
>>>>> @@ -1665,6 +1668,11 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>>>>>                                  pid = perf_thread_map__pid(threads, thread);
>>>>>
>>>>>                          group_fd = get_group_fd(evsel, cpu, thread);
>>>>> +                       if (group_fd == -2) {
>>>>> +                               errno = EINVAL;
>>>>> +                               err = -EINVAL;
>>>>> +                               goto out_close;
>>>>> +                       }
>>>>>   retry_open:
>>>>>                          test_attr__ready();
>>>>>
>>>>> It enables the perf_evlist__reset_weak_group. And in the second_pass (in
>>>>> __run_perf_stat), the events will be opened successfully.
>>>>>
>>>>> I have tested OK for this fix on cascadelakex.
>>>>>
>>>>> Thanks
>>>>> Jin Yao
>>>>>
>>>>
>>
>> Is this fix OK?
>>
>> Another thing is, do you think if we need to rename
>> "evsel->core.system_wide" to "evsel->core.has_cpumask".
>>
>> The "system_wide" may misleading.
>>
>> evsel->core.system_wide = pmu ? pmu->is_uncore : false;
>>
>> "pmu->is_uncore" is true if PMU has a "cpumask". But it's not just uncore
>> PMU which has cpumask. Some other PMUs, e.g. cstate_pkg, also have cpumask.
>> So for this case, "has_cpumask" should be better.
>
> so those flags are checked in many places in the code so I don't
> think it's wise to mess with them
>
> what I meant before was that the cpumask could be different for
> different events so even when both events are 'system_wide' the
> leader 'fd' might not exist for the groupped events and vice versa
>
> so maybe we should ensure that we are groupping events with same
> cpu maps before we go for open, so the get_group_fd stays simple
>

Thanks for the comments. I'm preparing the patch according to this idea.

>>
>> But I'm not sure if the change is OK for other case, e.g. PT, which also
>> uses "evsel->core.system_wide".
>
> plz CC Adrian Hunter <[email protected]> on next patches
> if you are touching this
>

I will not touch "evsel->core.system_wide" in the new patch.

Thanks
Jin Yao

> thanks,
> jirka
>