2018-10-20 04:06:52

by David Miller

[permalink] [raw]
Subject: perf overlapping maps...


Symbols aren't exactly right all the time on sparc and even the owner
of a sample is set to "unknown" from time to time so I turned on some
debugging to investigate.

One thing that stands out is that we get overlapping maps all the
time.

So I tried to narrow down how this happens. Here is one case, we get
a new thread fork event for emacs-gtk before the MMAP events so we go:

thread__fork(thread, parent, timestamp)
{
...
thread__clone_map_groups(thread, parent)
{
...
map_groups__clone(thread, parent->mg)

Dumping this map_groups__clone() operation I see:

map_groups__clone: parent 0x10000425420 --> 0x10000418fb0
map_groups__clone: new [0000010000000000:0000010000110000] /bin/bash
map_groups__clone: new [0000010000212000:000001000021e000] /bin/bash
map_groups__clone: new [000001000021e000:00000100002a0000] /tmp/perf-1309.map
map_groups__clone: new [fff0000100000000:fff0000100024000] /lib/sparc64-linux-gnu/ld-2.27.so
map_groups__clone: new [fff0000100124000:fff0000100126000] /lib/sparc64-linux-gnu/ld-2.27.so
map_groups__clone: new [fff0000100128000:fff0000100152000] /lib/sparc64-linux-gnu/libtinfo.so.6.1
map_groups__clone: new [fff0000100254000:fff0000100256000] /lib/sparc64-linux-gnu/libtinfo.so.6.1
map_groups__clone: new [fff0000100258000:fff000010025c000] /lib/sparc64-linux-gnu/libdl-2.27.so
map_groups__clone: new [fff000010035c000:fff000010035e000] /lib/sparc64-linux-gnu/libdl-2.27.so
map_groups__clone: new [fff000010046a000:fff000010046c000] [vdso]
map_groups__clone: new [fff000010046c000:fff00001005cc000] /lib/sparc64-linux-gnu/libc-2.27.so
map_groups__clone: new [fff00001006d0000:fff00001006d4000] /lib/sparc64-linux-gnu/libc-2.27.so
map_groups__clone: new [fff00001006d4000:fff00001006d6000] /tmp/perf-1309.map
map_groups__clone: new [fff0000100874000:fff000010087e000] /lib/sparc64-linux-gnu/libnss_files-2.27.so
map_groups__clone: new [fff000010097e000:fff0000100980000] /lib/sparc64-linux-gnu/libnss_files-2.27.so
map_groups__clone: new [fff0000100980000:fff0000100986000] /tmp/perf-1309.map

It's inheriting maps for the parent bash shell that invoked emacs-gtk, which
makes no sense at all.

We proceed to process the MMAP events which have the proper mappings
for emacs-gtk, and eventually we happen to hit a mapping that overlaps
with one of the address ranges of the parent bash shell.

For the stuff that doesn't overlap, we have bogus parent bash shell
process mappings left in the emacs-gtk thread map group.

The above trace is simply from "./perf record 2>x.log", nothing fancy.

What we are doing above can't be right.

Yes, when processing real perf events from the kernel for a fork
event, we should do that inheritance stuff. But if we are
synthesizing the fork to build threads and maps for already running
processes, we absolutely should not perform the map groups clone.

One solution I've come up with is:

1) When synthesizing a fork event, set PERF_RECORD_MISC_COMM_EXEC in
header->misc.

2) Use this to elide the map groups clone in
thread__clone_map_groups().

Comments?


2018-10-20 04:45:47

by David Miller

[permalink] [raw]
Subject: Re: perf overlapping maps...

From: David Miller <[email protected]>
Date: Fri, 19 Oct 2018 21:05:49 -0700 (PDT)

> One solution I've come up with is:
>
> 1) When synthesizing a fork event, set PERF_RECORD_MISC_COMM_EXEC in
> header->misc.
>
> 2) Use this to elide the map groups clone in
> thread__clone_map_groups().

Looking into code history, I notice:

commit 363b785f3805a2632eb09a8b430842461c21a640
Author: Don Zickus <[email protected]>
Date: Fri Mar 14 10:43:44 2014 -0400

perf tools: Speed up thread map generation

and the subsequent:

commit 4aa5f4f7bb8bc41cba15bcd0d80c4fb085027d6b
Author: Arnaldo Carvalho de Melo <[email protected]>
Date: Fri Feb 27 19:52:10 2015 -0300

perf tools: Fix FORK after COMM when synthesizing records for pre-existing threads

If Don wanted to have the map cloning to happen for processes without
CLONE_VM, I'm not sure that's right.

For real threads, we just take a reference to the map group from
the parent.

Don, a quick summary. If we synthesize a fork event, let's say for an
emacs process. perf will clone the map groups of the parent bash
shell which invoked emacs. Via:

thread__fork(thread, parent, timestamp)
{
...
thread__clone_map_groups(thread, parent)
{
...
map_groups__clone(thread, parent->mg)

Which is completely bogus. It brings all of the bash process maps
into the emacs thread map group. Then we process the emacs mmap2
events, which overlap the bash process maps already cloned into the
emacs map group. And this make all kinds of erroneous things happen.

I'm suggesting to elide the map groups clone in this situation where
we are synthesizing the fork.

Thanks.

2018-10-22 14:50:32

by Don Zickus

[permalink] [raw]
Subject: Re: perf overlapping maps...

(adding Jiri)

On Fri, Oct 19, 2018 at 09:44:01PM -0700, David Miller wrote:
> From: David Miller <[email protected]>
> Date: Fri, 19 Oct 2018 21:05:49 -0700 (PDT)
>
> > One solution I've come up with is:
> >
> > 1) When synthesizing a fork event, set PERF_RECORD_MISC_COMM_EXEC in
> > header->misc.
> >
> > 2) Use this to elide the map groups clone in
> > thread__clone_map_groups().
>
> Looking into code history, I notice:
>
> commit 363b785f3805a2632eb09a8b430842461c21a640
> Author: Don Zickus <[email protected]>
> Date: Fri Mar 14 10:43:44 2014 -0400
>
> perf tools: Speed up thread map generation
>
> and the subsequent:
>
> commit 4aa5f4f7bb8bc41cba15bcd0d80c4fb085027d6b
> Author: Arnaldo Carvalho de Melo <[email protected]>
> Date: Fri Feb 27 19:52:10 2015 -0300
>
> perf tools: Fix FORK after COMM when synthesizing records for pre-existing threads
>
> If Don wanted to have the map cloning to happen for processes without
> CLONE_VM, I'm not sure that's right.
>
> For real threads, we just take a reference to the map group from
> the parent.
>
> Don, a quick summary. If we synthesize a fork event, let's say for an
> emacs process. perf will clone the map groups of the parent bash
> shell which invoked emacs. Via:
>
> thread__fork(thread, parent, timestamp)
> {
> ...
> thread__clone_map_groups(thread, parent)
> {
> ...
> map_groups__clone(thread, parent->mg)
>
> Which is completely bogus. It brings all of the bash process maps
> into the emacs thread map group. Then we process the emacs mmap2
> events, which overlap the bash process maps already cloned into the
> emacs map group. And this make all kinds of erroneous things happen.
>
> I'm suggesting to elide the map groups clone in this situation where
> we are synthesizing the fork.

Hi David,

Honestly, I remember very little of this change other than we ran specjbb
which created thousands of threads and we wanted a better way to handle that
situation (waiting 15 minutes seemed wrong).

Jiri Olsa is probably more knowledgable about this then I am these days and
can work with Joe to re-do the test to verify any suggested changes does not
break our intended use case.

Thinking about it more, I am wondering if we did this because we ran the
test and it takes about 20 minutes to 'warm up' then we attached perf to the
test. This implies we had to handle the situation where 10K threads already
existed hence our optimization. But I can be wrong.

Your suggestion is probably right and I am sure we can reproduce the
scenario to verify things didn't regress.

Cheers,
Don

2018-10-22 16:21:46

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf overlapping maps...

On Mon, Oct 22, 2018 at 10:07:38AM -0400, Don Zickus wrote:
> (adding Jiri)
>
> On Fri, Oct 19, 2018 at 09:44:01PM -0700, David Miller wrote:
> > From: David Miller <[email protected]>
> > Date: Fri, 19 Oct 2018 21:05:49 -0700 (PDT)
> >
> > > One solution I've come up with is:
> > >
> > > 1) When synthesizing a fork event, set PERF_RECORD_MISC_COMM_EXEC in
> > > header->misc.
> > >
> > > 2) Use this to elide the map groups clone in
> > > thread__clone_map_groups().
> >
> > Looking into code history, I notice:
> >
> > commit 363b785f3805a2632eb09a8b430842461c21a640
> > Author: Don Zickus <[email protected]>
> > Date: Fri Mar 14 10:43:44 2014 -0400
> >
> > perf tools: Speed up thread map generation
> >
> > and the subsequent:
> >
> > commit 4aa5f4f7bb8bc41cba15bcd0d80c4fb085027d6b
> > Author: Arnaldo Carvalho de Melo <[email protected]>
> > Date: Fri Feb 27 19:52:10 2015 -0300
> >
> > perf tools: Fix FORK after COMM when synthesizing records for pre-existing threads
> >
> > If Don wanted to have the map cloning to happen for processes without
> > CLONE_VM, I'm not sure that's right.
> >
> > For real threads, we just take a reference to the map group from
> > the parent.
> >
> > Don, a quick summary. If we synthesize a fork event, let's say for an
> > emacs process. perf will clone the map groups of the parent bash
> > shell which invoked emacs. Via:
> >
> > thread__fork(thread, parent, timestamp)
> > {
> > ...
> > thread__clone_map_groups(thread, parent)
> > {
> > ...
> > map_groups__clone(thread, parent->mg)
> >
> > Which is completely bogus. It brings all of the bash process maps
> > into the emacs thread map group. Then we process the emacs mmap2
> > events, which overlap the bash process maps already cloned into the
> > emacs map group. And this make all kinds of erroneous things happen.
> >
> > I'm suggesting to elide the map groups clone in this situation where
> > we are synthesizing the fork.

right, this seems correct.. we should only clone parent maps
for kernel fork event, not when we synthesize.. I'll check
the solution you proposed and try to come with a patch

>
> Hi David,
>
> Honestly, I remember very little of this change other than we ran specjbb
> which created thousands of threads and we wanted a better way to handle that
> situation (waiting 15 minutes seemed wrong).
>
> Jiri Olsa is probably more knowledgable about this then I am these days and
> can work with Joe to re-do the test to verify any suggested changes does not
> break our intended use case.
>
> Thinking about it more, I am wondering if we did this because we ran the
> test and it takes about 20 minutes to 'warm up' then we attached perf to the
> test. This implies we had to handle the situation where 10K threads already
> existed hence our optimization. But I can be wrong.
>
> Your suggestion is probably right and I am sure we can reproduce the
> scenario to verify things didn't regress.

I think the fix might actualy speed things up,
but yes, there could be other report regressions

thanks,
jirka

2018-10-22 17:10:44

by Don Zickus

[permalink] [raw]
Subject: Re: perf overlapping maps...

On Mon, Oct 22, 2018 at 06:16:13PM +0200, Jiri Olsa wrote:
> On Mon, Oct 22, 2018 at 10:07:38AM -0400, Don Zickus wrote:
> > (adding Jiri)
> >
> > On Fri, Oct 19, 2018 at 09:44:01PM -0700, David Miller wrote:
> > > From: David Miller <[email protected]>
> > > Date: Fri, 19 Oct 2018 21:05:49 -0700 (PDT)
> > >
> > > > One solution I've come up with is:
> > > >
> > > > 1) When synthesizing a fork event, set PERF_RECORD_MISC_COMM_EXEC in
> > > > header->misc.
> > > >
> > > > 2) Use this to elide the map groups clone in
> > > > thread__clone_map_groups().
> > >
> > > Looking into code history, I notice:
> > >
> > > commit 363b785f3805a2632eb09a8b430842461c21a640
> > > Author: Don Zickus <[email protected]>
> > > Date: Fri Mar 14 10:43:44 2014 -0400
> > >
> > > perf tools: Speed up thread map generation
> > >
> > > and the subsequent:
> > >
> > > commit 4aa5f4f7bb8bc41cba15bcd0d80c4fb085027d6b
> > > Author: Arnaldo Carvalho de Melo <[email protected]>
> > > Date: Fri Feb 27 19:52:10 2015 -0300
> > >
> > > perf tools: Fix FORK after COMM when synthesizing records for pre-existing threads
> > >
> > > If Don wanted to have the map cloning to happen for processes without
> > > CLONE_VM, I'm not sure that's right.
> > >
> > > For real threads, we just take a reference to the map group from
> > > the parent.
> > >
> > > Don, a quick summary. If we synthesize a fork event, let's say for an
> > > emacs process. perf will clone the map groups of the parent bash
> > > shell which invoked emacs. Via:
> > >
> > > thread__fork(thread, parent, timestamp)
> > > {
> > > ...
> > > thread__clone_map_groups(thread, parent)
> > > {
> > > ...
> > > map_groups__clone(thread, parent->mg)
> > >
> > > Which is completely bogus. It brings all of the bash process maps
> > > into the emacs thread map group. Then we process the emacs mmap2
> > > events, which overlap the bash process maps already cloned into the
> > > emacs map group. And this make all kinds of erroneous things happen.
> > >
> > > I'm suggesting to elide the map groups clone in this situation where
> > > we are synthesizing the fork.
>
> right, this seems correct.. we should only clone parent maps
> for kernel fork event, not when we synthesize.. I'll check
> the solution you proposed and try to come with a patch

Thanks Jiri for helping me out on this!

Cheers,
Don

>
> >
> > Hi David,
> >
> > Honestly, I remember very little of this change other than we ran specjbb
> > which created thousands of threads and we wanted a better way to handle that
> > situation (waiting 15 minutes seemed wrong).
> >
> > Jiri Olsa is probably more knowledgable about this then I am these days and
> > can work with Joe to re-do the test to verify any suggested changes does not
> > break our intended use case.
> >
> > Thinking about it more, I am wondering if we did this because we ran the
> > test and it takes about 20 minutes to 'warm up' then we attached perf to the
> > test. This implies we had to handle the situation where 10K threads already
> > existed hence our optimization. But I can be wrong.
> >
> > Your suggestion is probably right and I am sure we can reproduce the
> > scenario to verify things didn't regress.
>
> I think the fix might actualy speed things up,
> but yes, there could be other report regressions
>
> thanks,
> jirka

2018-10-22 18:03:20

by David Miller

[permalink] [raw]
Subject: Re: perf overlapping maps...

From: Jiri Olsa <[email protected]>
Date: Mon, 22 Oct 2018 18:16:13 +0200

> I think the fix might actualy speed things up,
> but yes, there could be other report regressions

I was about to say the same thing, it could actually speed things up.
In the best case, less work is done (clone avoided, and overlapping
maps don't have to be handled). In the worst case, nothing changes.

Here is what I've been using, to give you an idea. There may be some
file offset fuzz in these patches.

diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 0cd42150f712..e5a442313f9d 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -308,6 +308,7 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
event->fork.pid = tgid;
event->fork.tid = pid;
event->fork.header.type = PERF_RECORD_FORK;
+ event->fork.header.misc = PERF_RECORD_MISC_COMM_EXEC;

event->fork.header.size = (sizeof(event->fork) + machine->id_hdr_size);

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 111ae858cbcb..dc06f1fc2ed5 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1708,5 +1720,6 @@ int machine__process_fork_event(struct machine *machine, union perf_event *event
struct thread *parent = machine__findnew_thread(machine,
event->fork.ppid,
event->fork.ptid);
+ int do_maps_clone = 1;
int err = 0;

@@ -1737,8 +1754,11 @@ int machine__process_fork_event(struct machine *machine, union perf_event *event
thread = machine__findnew_thread(machine, event->fork.pid,
event->fork.tid);

+ if (event->fork.header.misc & PERF_RECORD_MISC_COMM_EXEC)
+ do_maps_clone = 0;
+
if (thread == NULL || parent == NULL ||
- thread__fork(thread, parent, sample->time) < 0) {
+ thread__fork(thread, parent, sample->time, do_maps_clone) < 0) {
dump_printf("problem processing PERF_RECORD_FORK, skipping event.\n");
err = -1;
}
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index 2048d393ece6..7f2858edf221 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -330,7 +330,8 @@ static int thread__prepare_access(struct thread *thread)
}

static int thread__clone_map_groups(struct thread *thread,
- struct thread *parent)
+ struct thread *parent,
+ int do_maps_clone)
{
/* This is new thread, we share map groups for process. */
if (thread->pid_ == parent->pid_)
@@ -341,15 +342,14 @@ static int thread__clone_map_groups(struct thread *thread,
thread->pid_, thread->tid, parent->pid_, parent->tid);
return 0;
}
-
/* But this one is new process, copy maps. */
- if (map_groups__clone(thread, parent->mg) < 0)
+ if (do_maps_clone &&
+ map_groups__clone(thread, parent->mg) < 0)
return -ENOMEM;
-
return 0;
}

-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
+int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp, int do_maps_clone)
{
if (parent->comm_set) {
const char *comm = thread__comm_str(parent);
@@ -362,7 +362,7 @@ int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
}

thread->ppid = parent->tid;
- return thread__clone_map_groups(thread, parent);
+ return thread__clone_map_groups(thread, parent, do_maps_clone);
}

void thread__find_cpumode_addr_location(struct thread *thread, u64 addr,
diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 07606aa6998d..8e4ca1ede01f 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -87,7 +87,7 @@ struct comm *thread__comm(const struct thread *thread);
struct comm *thread__exec_comm(const struct thread *thread);
const char *thread__comm_str(const struct thread *thread);
int thread__insert_map(struct thread *thread, struct map *map);
-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp);
+int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp, int do_maps_clone);
size_t thread__fprintf(struct thread *thread, FILE *fp);

struct thread *thread__main_thread(struct machine *machine, struct thread *thread);

2018-10-23 06:35:36

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf overlapping maps...

On Mon, Oct 22, 2018 at 10:58:42AM -0700, David Miller wrote:
> From: Jiri Olsa <[email protected]>
> Date: Mon, 22 Oct 2018 18:16:13 +0200
>
> > I think the fix might actualy speed things up,
> > but yes, there could be other report regressions
>
> I was about to say the same thing, it could actually speed things up.
> In the best case, less work is done (clone avoided, and overlapping
> maps don't have to be handled). In the worst case, nothing changes.
>
> Here is what I've been using, to give you an idea. There may be some
> file offset fuzz in these patches.

I'm not sure about using the misc field bit defined/used by userland,
in case there's some new one comming in future for fork event..

but the only other way I can think of now is adding new 'user' event
for that, but that ended up as a bigger change (attached)

I think if we make some 'big enough' comment about the bit usage,
your change is better.. will you post or should I?

thanks,
jirka


---
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index bc646185f8d9..06e28cef5579 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -60,6 +60,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
+ [PERF_RECORD_FORK_USER] = "FORK_USER",
};

static const char *perf_ns__names[] = {
@@ -307,7 +308,7 @@ static int perf_event__synthesize_fork(struct perf_tool *tool,
}
event->fork.pid = tgid;
event->fork.tid = pid;
- event->fork.header.type = PERF_RECORD_FORK;
+ event->fork.header.type = PERF_RECORD_FORK_USER;

event->fork.header.size = (sizeof(event->fork) + machine->id_hdr_size);

diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index bfa60bcafbde..e7252ae61529 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -249,6 +249,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
PERF_RECORD_HEADER_FEATURE = 80,
+ PERF_RECORD_FORK_USER = 81,
PERF_RECORD_HEADER_MAX
};

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 111ae858cbcb..b7d8aec18707 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1702,6 +1702,7 @@ void machine__remove_thread(struct machine *machine, struct thread *th)
int machine__process_fork_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample)
{
+ bool user = event->header.type == PERF_RECORD_FORK_USER;
struct thread *thread = machine__find_thread(machine,
event->fork.pid,
event->fork.tid);
@@ -1738,7 +1739,7 @@ int machine__process_fork_event(struct machine *machine, union perf_event *event
event->fork.tid);

if (thread == NULL || parent == NULL ||
- thread__fork(thread, parent, sample->time) < 0) {
+ thread__fork(thread, parent, sample->time, user) < 0) {
dump_printf("problem processing PERF_RECORD_FORK, skipping event.\n");
err = -1;
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index df5e12b22905..a219dd119d95 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -801,6 +801,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_STAT_ROUND] = perf_event__stat_round_swap,
[PERF_RECORD_EVENT_UPDATE] = perf_event__event_update_swap,
[PERF_RECORD_TIME_CONV] = perf_event__all64_swap,
+ [PERF_RECORD_FORK_USER] = perf_event__task_swap,
[PERF_RECORD_HEADER_MAX] = NULL,
};

@@ -1269,6 +1270,7 @@ static int machines__deliver_event(struct machines *machines,
case PERF_RECORD_NAMESPACES:
return tool->namespaces(tool, event, sample, machine);
case PERF_RECORD_FORK:
+ case PERF_RECORD_FORK_USER:
return tool->fork(tool, event, sample, machine);
case PERF_RECORD_EXIT:
return tool->exit(tool, event, sample, machine);
@@ -1398,6 +1400,12 @@ static s64 perf_session__process_user_event(struct perf_session *session,
}
}

+static bool process_user(union perf_event *event)
+{
+ return event->header.type >= PERF_RECORD_USER_TYPE_START &&
+ event->header.type != PERF_RECORD_FORK_USER;
+}
+
int perf_session__deliver_synth_event(struct perf_session *session,
union perf_event *event,
struct perf_sample *sample)
@@ -1407,7 +1415,7 @@ int perf_session__deliver_synth_event(struct perf_session *session,

events_stats__inc(&evlist->stats, event->header.type);

- if (event->header.type >= PERF_RECORD_USER_TYPE_START)
+ if (process_user(event))
return perf_session__process_user_event(session, event, 0);

return machines__deliver_event(&session->machines, evlist, event, sample, tool, 0);
@@ -1492,7 +1500,7 @@ static s64 perf_session__process_event(struct perf_session *session,

events_stats__inc(&evlist->stats, event->header.type);

- if (event->header.type >= PERF_RECORD_USER_TYPE_START)
+ if (process_user(event))
return perf_session__process_user_event(session, event, file_offset);

if (tool->ordered_events) {
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index 2048d393ece6..2782d2626f09 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -332,10 +332,6 @@ static int thread__prepare_access(struct thread *thread)
static int thread__clone_map_groups(struct thread *thread,
struct thread *parent)
{
- /* This is new thread, we share map groups for process. */
- if (thread->pid_ == parent->pid_)
- return thread__prepare_access(thread);
-
if (thread->mg == parent->mg) {
pr_debug("broken map groups on thread %d/%d parent %d/%d\n",
thread->pid_, thread->tid, parent->pid_, parent->tid);
@@ -343,13 +339,11 @@ static int thread__clone_map_groups(struct thread *thread,
}

/* But this one is new process, copy maps. */
- if (map_groups__clone(thread, parent->mg) < 0)
- return -ENOMEM;
-
- return 0;
+ return map_groups__clone(thread, parent->mg);
}

-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
+int thread__fork(struct thread *thread, struct thread *parent,
+ u64 timestamp, bool user)
{
if (parent->comm_set) {
const char *comm = thread__comm_str(parent);
@@ -362,6 +356,15 @@ int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
}

thread->ppid = parent->tid;
+
+ /* This is new thread, we share map groups for process. */
+ if (thread->pid_ == parent->pid_)
+ return thread__prepare_access(thread);
+
+ /* Do not clone maps for user space fork event. */
+ if (user)
+ return 0;
+
return thread__clone_map_groups(thread, parent);
}

diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 07606aa6998d..827727300100 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -87,7 +87,8 @@ struct comm *thread__comm(const struct thread *thread);
struct comm *thread__exec_comm(const struct thread *thread);
const char *thread__comm_str(const struct thread *thread);
int thread__insert_map(struct thread *thread, struct map *map);
-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp);
+int thread__fork(struct thread *thread, struct thread *parent,
+ u64 timestamp, bool user);
size_t thread__fprintf(struct thread *thread, FILE *fp);

struct thread *thread__main_thread(struct machine *machine, struct thread *thread);

2018-10-23 18:02:50

by David Miller

[permalink] [raw]
Subject: Re: perf overlapping maps...

From: Jiri Olsa <[email protected]>
Date: Tue, 23 Oct 2018 08:34:52 +0200

> I'm not sure about using the misc field bit defined/used by userland,
> in case there's some new one comming in future for fork event..
>
> but the only other way I can think of now is adding new 'user' event
> for that, but that ended up as a bigger change (attached)
>
> I think if we make some 'big enough' comment about the bit usage,
> your change is better.. will you post or should I?

There might be something else we can do to implement this, and I think
making a whole new event for what is an application internal problem
is overkill.

What is kind of silly about how all of the synthetic events work is
that we throw away a lot of information by tossing the events over to
the generic event processing engine of the perf tool.

So we generate the events knowing the thread, context, PID, cpu, etc.
and then we lose all of that information, and the event processing
engine has to look all of it up again.

This is also, BTW, the reason we have dependencies on synthetic event
emission ordering. F.e. this comes up wrt. COMM and FORK events.

I understand that this design allows the perf tool types to define a
private function to dispatch the events, as is appropriate for what
the tool is doing.

But the side effect of this design is that it means it is hard to pass
internal state around, outside of the event object itself.

Anyways, I'll look into this and see if there is a better way to
implement this.

Thanks!

2018-10-23 18:07:26

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: perf overlapping maps...

Em Tue, Oct 23, 2018 at 10:54:05AM -0700, David Miller escreveu:
> From: Jiri Olsa <[email protected]>
> Date: Tue, 23 Oct 2018 08:34:52 +0200
>
> > I'm not sure about using the misc field bit defined/used by userland,
> > in case there's some new one comming in future for fork event..
> >
> > but the only other way I can think of now is adding new 'user' event
> > for that, but that ended up as a bigger change (attached)
> >
> > I think if we make some 'big enough' comment about the bit usage,
> > your change is better.. will you post or should I?
>
> There might be something else we can do to implement this, and I think
> making a whole new event for what is an application internal problem
> is overkill.

agreed, I saw this earlier today and thought about "use cpumode" but got
sidetracked with processing other patches :-\ see below.

> What is kind of silly about how all of the synthetic events work is
> that we throw away a lot of information by tossing the events over to
> the generic event processing engine of the perf tool.
>
> So we generate the events knowing the thread, context, PID, cpu, etc.
> and then we lose all of that information, and the event processing
> engine has to look all of it up again.
>
> This is also, BTW, the reason we have dependencies on synthetic event
> emission ordering. F.e. this comes up wrt. COMM and FORK events.
>
> I understand that this design allows the perf tool types to define a
> private function to dispatch the events, as is appropriate for what
> the tool is doing.
>
> But the side effect of this design is that it means it is hard to pass
> internal state around, outside of the event object itself.
>
> Anyways, I'll look into this and see if there is a better way to
> implement this.

IIRC this was first done for 'perf record', where we have to stash those
events in the perf.data file, to then, later, 'perf report' to process
those, so when working on 'perf top', it just reuses that machinery.

Sure, with some love and care 'perf top' could do better and update all
the data structures directly :-)

Anyway, have you guys considered tweaking using event->header.misc |=
PERF_RECORD_MISC_USER? The kernel leaves that as zero for the
PERF_RECORD_FORK it emits:

static void perf_event_task(struct task_struct *task,
struct perf_event_context *task_ctx,
int new)
{
struct perf_task_event task_event;

if (!atomic_read(&nr_comm_events) &&
!atomic_read(&nr_mmap_events) &&
!atomic_read(&nr_task_events))
return;

task_event = (struct perf_task_event){
.task = task,
.task_ctx = task_ctx,
.event_id = {
.header = {
.type = new ? PERF_RECORD_FORK : PERF_RECORD_EXIT,
.misc = 0,
.size = sizeof(task_event.event_id),
},
<SNIP>

void perf_event_fork(struct task_struct *task)
{
perf_event_task(task, NULL, 1);
perf_event_namespaces(task);
}

#define PERF_RECORD_MISC_CPUMODE_UNKNOWN (0 << 0)
#define PERF_RECORD_MISC_KERNEL (1 << 0)
#define PERF_RECORD_MISC_USER (2 << 0)

- Arnaldo

2018-10-23 18:15:52

by David Miller

[permalink] [raw]
Subject: Re: perf overlapping maps...

From: Arnaldo Carvalho de Melo <[email protected]>
Date: Tue, 23 Oct 2018 15:05:03 -0300

> IIRC this was first done for 'perf record', where we have to stash those
> events in the perf.data file, to then, later, 'perf report' to process
> those, so when working on 'perf top', it just reuses that machinery.
>
> Sure, with some love and care 'perf top' could do better and update all
> the data structures directly :-)

Thanks for the history, it is useful information :)

> Anyway, have you guys considered tweaking using event->header.misc |=
> PERF_RECORD_MISC_USER? The kernel leaves that as zero for the
> PERF_RECORD_FORK it emits:

I really would like to steer the approach away from using UAPI
perf_event fields in an internal way.

I am really very sorry for suggesting such a scheme myself in the
first place. It really was a bad idea upon much consideration.

The synthetic fork is not really a fork, it's more like a "create".

And this fundamental semantic difference is why we have all of these
issues wrt. handling COMM and parent map inheritance.

There is also a bunch of non-trivial code to deal with whether we
synthetically create the child or the parent first, wrt. finding
thread leaders and parent threads.

What I'm trying to say is that there is a clean design based solution
hiding somewhere in here and I'd like to find it :-)

2018-10-23 19:28:42

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: perf overlapping maps...

Em Tue, Oct 23, 2018 at 11:15:03AM -0700, David Miller escreveu:
> From: Arnaldo Carvalho de Melo <[email protected]>
> Date: Tue, 23 Oct 2018 15:05:03 -0300
> > IIRC this was first done for 'perf record', where we have to stash those
> > events in the perf.data file, to then, later, 'perf report' to process
> > those, so when working on 'perf top', it just reuses that machinery.

> > Sure, with some love and care 'perf top' could do better and update all
> > the data structures directly :-)

> Thanks for the history, it is useful information :)

> > Anyway, have you guys considered tweaking using event->header.misc |=
> > PERF_RECORD_MISC_USER? The kernel leaves that as zero for the
> > PERF_RECORD_FORK it emits:

> I really would like to steer the approach away from using UAPI
> perf_event fields in an internal way.

> I am really very sorry for suggesting such a scheme myself in the
> first place. It really was a bad idea upon much consideration.

> The synthetic fork is not really a fork, it's more like a "create".

> And this fundamental semantic difference is why we have all of these
> issues wrt. handling COMM and parent map inheritance.

> There is also a bunch of non-trivial code to deal with whether we
> synthetically create the child or the parent first, wrt. finding
> thread leaders and parent threads.

> What I'm trying to say is that there is a clean design based solution
> hiding somewhere in here and I'd like to find it :-)

So, this is all because we're trying to recreate things that happened in
the past, for threads we're interested in but couldn't catch the
PERF_RECORD_{FORK,COMM,MMAP} when they originally take place.

Ideally we would recreate them in the exact same order and with the
exact same details, which was kinda what was intended, but as you're
seeing is failing at that in various cases.

Also if we keep using this abstraction, i.e. synthesize in userspace
what the kernel does, generating PERF_RECORD__{FORK,COMM,MMAP}, then
older tools will continue working with perf.data files generated by
a new, fixed up 'perf record'.

And nowadays there are other tools that read perf.data files:

http://code.qt.io/cgit/qt-creator/perfparser.git/
https://doc.qt.io/qtcreator/creator-cpu-usage-analyzer.html

Or do you think we should introduce new record types that deal better
with pre-existing threads/maps?

- Arnaldo

2018-10-24 11:34:58

by Jiri Olsa

[permalink] [raw]
Subject: Re: perf overlapping maps...

On Tue, Oct 23, 2018 at 11:15:03AM -0700, David Miller wrote:
> From: Arnaldo Carvalho de Melo <[email protected]>
> Date: Tue, 23 Oct 2018 15:05:03 -0300
>
> > IIRC this was first done for 'perf record', where we have to stash those
> > events in the perf.data file, to then, later, 'perf report' to process
> > those, so when working on 'perf top', it just reuses that machinery.
> >
> > Sure, with some love and care 'perf top' could do better and update all
> > the data structures directly :-)
>
> Thanks for the history, it is useful information :)
>
> > Anyway, have you guys considered tweaking using event->header.misc |=
> > PERF_RECORD_MISC_USER? The kernel leaves that as zero for the
> > PERF_RECORD_FORK it emits:
>
> I really would like to steer the approach away from using UAPI
> perf_event fields in an internal way.
>
> I am really very sorry for suggesting such a scheme myself in the
> first place. It really was a bad idea upon much consideration.
>
> The synthetic fork is not really a fork, it's more like a "create".
>
> And this fundamental semantic difference is why we have all of these
> issues wrt. handling COMM and parent map inheritance.
>
> There is also a bunch of non-trivial code to deal with whether we
> synthetically create the child or the parent first, wrt. finding
> thread leaders and parent threads.
>
> What I'm trying to say is that there is a clean design based solution
> hiding somewhere in here and I'd like to find it :-)

how about adding a data file marker/event when the synthesized
portion of data is over

attached patch adds an 'SYNTHESIZE_END' event and prevents
parent's maps cloning on fork until that event is found

we would need more code to stay backward compatible, which
I did not include.. just to cleanly outline the solution

jirka


---
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 10cf889c6d75..328b75711272 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -746,6 +746,11 @@ static const struct perf_event_mmap_page *record__pick_pc(struct record *rec)
return NULL;
}

+static struct perf_event_header synthesize_end_event = {
+ .size = sizeof(struct perf_event_header),
+ .type = PERF_RECORD_SYNTHESIZE_END,
+};
+
static int record__synthesize(struct record *rec, bool tail)
{
struct perf_session *session = rec->session;
@@ -853,6 +858,12 @@ static int record__synthesize(struct record *rec, bool tail)
err = __machine__synthesize_threads(machine, tool, &opts->target, rec->evlist->threads,
process_synthesized_event, opts->sample_address,
opts->proc_map_timeout, 1);
+
+ if (err < 0)
+ return err;
+
+ err = record__write(rec, NULL, &synthesize_end_event,
+ sizeof(synthesize_end_event));
out:
return err;
}
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index bc646185f8d9..d7aee86faa60 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -60,6 +60,7 @@ static const char *perf_event__names[] = {
[PERF_RECORD_EVENT_UPDATE] = "EVENT_UPDATE",
[PERF_RECORD_TIME_CONV] = "TIME_CONV",
[PERF_RECORD_HEADER_FEATURE] = "FEATURE",
+ [PERF_RECORD_SYNTHESIZE_END] = "SYNTHESIZE_END",
};

static const char *perf_ns__names[] = {
@@ -1413,12 +1414,12 @@ size_t perf_event__fprintf_task(union perf_event *event, FILE *fp)
event->fork.ppid, event->fork.ptid);
}

-int perf_event__process_fork(struct perf_tool *tool __maybe_unused,
+int perf_event__process_fork(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
struct machine *machine)
{
- return machine__process_fork_event(machine, event, sample);
+ return machine__process_fork_event(machine, event, sample, tool->synth_end);
}

int perf_event__process_exit(struct perf_tool *tool __maybe_unused,
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index bfa60bcafbde..51eb996f2696 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -249,6 +249,7 @@ enum perf_user_event_type { /* above any possible kernel type */
PERF_RECORD_EVENT_UPDATE = 78,
PERF_RECORD_TIME_CONV = 79,
PERF_RECORD_HEADER_FEATURE = 80,
+ PERF_RECORD_SYNTHESIZE_END = 81,
PERF_RECORD_HEADER_MAX
};

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 111ae858cbcb..f5e08ccb6032 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1700,7 +1700,7 @@ void machine__remove_thread(struct machine *machine, struct thread *th)
}

int machine__process_fork_event(struct machine *machine, union perf_event *event,
- struct perf_sample *sample)
+ struct perf_sample *sample, bool clone)
{
struct thread *thread = machine__find_thread(machine,
event->fork.pid,
@@ -1738,7 +1738,7 @@ int machine__process_fork_event(struct machine *machine, union perf_event *event
event->fork.tid);

if (thread == NULL || parent == NULL ||
- thread__fork(thread, parent, sample->time) < 0) {
+ thread__fork(thread, parent, sample->time, clone) < 0) {
dump_printf("problem processing PERF_RECORD_FORK, skipping event.\n");
err = -1;
}
@@ -1781,7 +1781,7 @@ int machine__process_event(struct machine *machine, union perf_event *event,
case PERF_RECORD_MMAP2:
ret = machine__process_mmap2_event(machine, event, sample); break;
case PERF_RECORD_FORK:
- ret = machine__process_fork_event(machine, event, sample); break;
+ ret = machine__process_fork_event(machine, event, sample, false); break;
case PERF_RECORD_EXIT:
ret = machine__process_exit_event(machine, event, sample); break;
case PERF_RECORD_LOST:
diff --git a/tools/perf/util/machine.h b/tools/perf/util/machine.h
index d856b85862e2..c4a343809273 100644
--- a/tools/perf/util/machine.h
+++ b/tools/perf/util/machine.h
@@ -109,7 +109,7 @@ int machine__process_comm_event(struct machine *machine, union perf_event *event
int machine__process_exit_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
int machine__process_fork_event(struct machine *machine, union perf_event *event,
- struct perf_sample *sample);
+ struct perf_sample *sample, bool clone);
int machine__process_lost_event(struct machine *machine, union perf_event *event,
struct perf_sample *sample);
int machine__process_lost_samples_event(struct machine *machine, union perf_event *event,
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index df5e12b22905..e3c6661a3936 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -1393,6 +1393,9 @@ static s64 perf_session__process_user_event(struct perf_session *session,
return tool->time_conv(session, event);
case PERF_RECORD_HEADER_FEATURE:
return tool->feature(session, event);
+ case PERF_RECORD_SYNTHESIZE_END:
+ tool->synth_end = true;
+ return 0;
default:
return -EINVAL;
}
diff --git a/tools/perf/util/thread.c b/tools/perf/util/thread.c
index 2048d393ece6..e8c09bc60ba9 100644
--- a/tools/perf/util/thread.c
+++ b/tools/perf/util/thread.c
@@ -332,10 +332,6 @@ static int thread__prepare_access(struct thread *thread)
static int thread__clone_map_groups(struct thread *thread,
struct thread *parent)
{
- /* This is new thread, we share map groups for process. */
- if (thread->pid_ == parent->pid_)
- return thread__prepare_access(thread);
-
if (thread->mg == parent->mg) {
pr_debug("broken map groups on thread %d/%d parent %d/%d\n",
thread->pid_, thread->tid, parent->pid_, parent->tid);
@@ -343,13 +339,11 @@ static int thread__clone_map_groups(struct thread *thread,
}

/* But this one is new process, copy maps. */
- if (map_groups__clone(thread, parent->mg) < 0)
- return -ENOMEM;
-
- return 0;
+ return map_groups__clone(thread, parent->mg);
}

-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
+int thread__fork(struct thread *thread, struct thread *parent,
+ u64 timestamp, bool clone)
{
if (parent->comm_set) {
const char *comm = thread__comm_str(parent);
@@ -362,6 +356,13 @@ int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp)
}

thread->ppid = parent->tid;
+
+ fprintf(stderr, "fork %d, clone %d\n", thread->tid, clone);
+
+ /* This is new thread, we share map groups for process. */
+ if (!clone || thread->pid_ == parent->pid_)
+ return thread__prepare_access(thread);
+
return thread__clone_map_groups(thread, parent);
}

diff --git a/tools/perf/util/thread.h b/tools/perf/util/thread.h
index 07606aa6998d..dd6716cd0fd6 100644
--- a/tools/perf/util/thread.h
+++ b/tools/perf/util/thread.h
@@ -87,7 +87,8 @@ struct comm *thread__comm(const struct thread *thread);
struct comm *thread__exec_comm(const struct thread *thread);
const char *thread__comm_str(const struct thread *thread);
int thread__insert_map(struct thread *thread, struct map *map);
-int thread__fork(struct thread *thread, struct thread *parent, u64 timestamp);
+int thread__fork(struct thread *thread, struct thread *parent,
+ u64 timestamp, bool clone);
size_t thread__fprintf(struct thread *thread, FILE *fp);

struct thread *thread__main_thread(struct machine *machine, struct thread *thread);
diff --git a/tools/perf/util/tool.h b/tools/perf/util/tool.h
index 56e4ca54020a..25463d7d9647 100644
--- a/tools/perf/util/tool.h
+++ b/tools/perf/util/tool.h
@@ -74,6 +74,7 @@ struct perf_tool {
bool ordering_requires_timestamps;
bool namespace_events;
bool no_warn;
+ bool synth_end;
enum show_feature_header show_feat_hdr;
};


2018-10-24 21:31:23

by David Miller

[permalink] [raw]
Subject: Re: perf overlapping maps...

From: Jiri Olsa <[email protected]>
Date: Wed, 24 Oct 2018 13:34:16 +0200

> On Tue, Oct 23, 2018 at 11:15:03AM -0700, David Miller wrote:
>> From: Arnaldo Carvalho de Melo <[email protected]>
>> Date: Tue, 23 Oct 2018 15:05:03 -0300
>>
>> > IIRC this was first done for 'perf record', where we have to stash those
>> > events in the perf.data file, to then, later, 'perf report' to process
>> > those, so when working on 'perf top', it just reuses that machinery.
>> >
>> > Sure, with some love and care 'perf top' could do better and update all
>> > the data structures directly :-)
>>
>> Thanks for the history, it is useful information :)
>>
>> > Anyway, have you guys considered tweaking using event->header.misc |=
>> > PERF_RECORD_MISC_USER? The kernel leaves that as zero for the
>> > PERF_RECORD_FORK it emits:
>>
>> I really would like to steer the approach away from using UAPI
>> perf_event fields in an internal way.
>>
>> I am really very sorry for suggesting such a scheme myself in the
>> first place. It really was a bad idea upon much consideration.
>>
>> The synthetic fork is not really a fork, it's more like a "create".
>>
>> And this fundamental semantic difference is why we have all of these
>> issues wrt. handling COMM and parent map inheritance.
>>
>> There is also a bunch of non-trivial code to deal with whether we
>> synthetically create the child or the parent first, wrt. finding
>> thread leaders and parent threads.
>>
>> What I'm trying to say is that there is a clean design based solution
>> hiding somewhere in here and I'd like to find it :-)
>
> how about adding a data file marker/event when the synthesized
> portion of data is over
>
> attached patch adds an 'SYNTHESIZE_END' event and prevents
> parent's maps cloning on fork until that event is found
>
> we would need more code to stay backward compatible, which
> I did not include.. just to cleanly outline the solution

I appreciate everyone reiterating the ABI issue. Wherein, perf record
stashes these events into a file and later analysis tools need to
interpret them.

Considering this specifically, it is very tempting to make a new event
representing this thread "CREATE" operation.

But, if we make a new event then existing tools will break because
they will see this unrecognizable event and not create the threads.

From this perspective setting a misc flag in the FORK event might in
fact be the best compromise.

Existing tools will still create the thread, in exactly the same
existing way. This keeps things working as well as they do currently.

So my vote returns to emitting FORK with the special COMM flag set in
the misc flags. And we document this properly of course.

Any objections? I'll cook up a formal patch meanwhile in case we do
agree.