2013-09-06 19:37:09

by David Ahern

[permalink] [raw]
Subject: [PATCH] perf session: Add option to copy events when queueing

When processing events the session code has an ordered samples queue which is
used to time-sort events coming in across multiple mmaps. At a later point in
time samples on the queue are flushed up to some timestamp at which point the
event is actually processed.

When analyzing events live (ie., record/analysis path in the same command)
there is a race that leads to corrupted events and parse errors which cause
perf to terminate. The problem is that when the event is placed in the ordered
samples queue it is only a reference to the event which is really sitting in
the mmap buffer. Even though the event is queued for later processing the mmap
tail pointer is updated which indicates to the kernel that the event has been
processed. The race is flushing the event from the queue before it gets
overwritten by some other event. For commands trying to process events live
(versus just writing to a file) and processing a high rate of events this leads
to parse failures and perf terminates.

Examples hitting this problem are 'perf kvm stat live', especially with nested
VMs which generate 100,000+ traces per second, and a command processing
scheduling events with a high rate of context switching -- e.g., running
'perf bench sched pipe'.

This patch offers live commands an option to copy the event when it is placed in
the ordered samples queue.

Signed-off-by: David Ahern <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
---
tools/perf/util/session.c | 17 ++++++++++++++---
tools/perf/util/session.h | 1 +
2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 1b185ca..71f16db 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)

sq = list_entry(os->to_free.next, struct sample_queue, list);
list_del(&sq->list);
+ if (session->copy_on_queue)
+ free(sq->event);
free(sq);
}
}
@@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
break;

ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
- if (ret)
+ if (ret) {
pr_err("Can't parse sample, err = %d\n", ret);
- else {
+ if (s->copy_on_queue)
+ free(iter->event);
+ } else {
ret = perf_session_deliver_event(s, iter->event, &sample, tool,
iter->file_offset);
+ if (s->copy_on_queue)
+ free(iter->event);
if (ret)
return ret;
}
@@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,

new->timestamp = timestamp;
new->file_offset = file_offset;
- new->event = event;
+
+ if (s->copy_on_queue) {
+ new->event = malloc(event->header.size);
+ memcpy(new->event, event, event->header.size);
+ } else
+ new->event = event;

__queue_event(new, s);

diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
index 3aa75fb..4adfcbb 100644
--- a/tools/perf/util/session.h
+++ b/tools/perf/util/session.h
@@ -38,6 +38,7 @@ struct perf_session {
bool fd_pipe;
bool repipe;
struct ordered_samples ordered_samples;
+ bool copy_on_queue;
char filename[1];
};

--
1.7.10.1


2013-09-14 16:16:32

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> When processing events the session code has an ordered samples queue which is
> used to time-sort events coming in across multiple mmaps. At a later point in
> time samples on the queue are flushed up to some timestamp at which point the
> event is actually processed.
>
> When analyzing events live (ie., record/analysis path in the same command)
> there is a race that leads to corrupted events and parse errors which cause
> perf to terminate. The problem is that when the event is placed in the ordered
> samples queue it is only a reference to the event which is really sitting in
> the mmap buffer. Even though the event is queued for later processing the mmap
> tail pointer is updated which indicates to the kernel that the event has been
> processed. The race is flushing the event from the queue before it gets
> overwritten by some other event. For commands trying to process events live
> (versus just writing to a file) and processing a high rate of events this leads
> to parse failures and perf terminates.
>
> Examples hitting this problem are 'perf kvm stat live', especially with nested
> VMs which generate 100,000+ traces per second, and a command processing
> scheduling events with a high rate of context switching -- e.g., running
> 'perf bench sched pipe'.
>
> This patch offers live commands an option to copy the event when it is placed in
> the ordered samples queue.
>
> Signed-off-by: David Ahern <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jiri Olsa <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Namhyung Kim <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Stephane Eranian <[email protected]>
> ---
> tools/perf/util/session.c | 17 ++++++++++++++---
> tools/perf/util/session.h | 1 +
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 1b185ca..71f16db 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>
> sq = list_entry(os->to_free.next, struct sample_queue, list);
> list_del(&sq->list);
> + if (session->copy_on_queue)
> + free(sq->event);
> free(sq);
> }
> }
> @@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
> break;
>
> ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
> - if (ret)
> + if (ret) {
> pr_err("Can't parse sample, err = %d\n", ret);
> - else {
> + if (s->copy_on_queue)
> + free(iter->event);
> + } else {
> ret = perf_session_deliver_event(s, iter->event, &sample, tool,
> iter->file_offset);
> + if (s->copy_on_queue)
> + free(iter->event);
> if (ret)
> return ret;
> }
> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>
> new->timestamp = timestamp;
> new->file_offset = file_offset;
> - new->event = event;
> +
> + if (s->copy_on_queue) {
> + new->event = malloc(event->header.size);
> + memcpy(new->event, event, event->header.size);
> + } else
> + new->event = event;
>
> __queue_event(new, s);
>
> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
> index 3aa75fb..4adfcbb 100644
> --- a/tools/perf/util/session.h
> +++ b/tools/perf/util/session.h
> @@ -38,6 +38,7 @@ struct perf_session {
> bool fd_pipe;
> bool repipe;
> struct ordered_samples ordered_samples;
> + bool copy_on_queue;

So do you think it should stay optional? This looks like a global problem, I mean
the event can be unmapped anytime for any builtin tool mapping it, right?

Also we already allocate the sample list node (struct sample_queue) from os->sample
buffer. ie: we have our own allocator there.

Probably we should reuse that and include the copied event space in "struct sample_queue"?

Also looking at it now, it seems we have a bug on the existing code:


if (!list_empty(sc)) {
new = list_entry(sc->next, struct sample_queue, list);
list_del(&new->list);
} else if (os->sample_buffer) {
new = os->sample_buffer + os->sample_buffer_idx;
if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
os->sample_buffer = NULL;
} else {
os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
if (!os->sample_buffer)
return -ENOMEM;
list_add(&os->sample_buffer->list, &os->to_free);
os->sample_buffer_idx = 2;
new = os->sample_buffer + 1;
}

If we actually run out of buffer rooms, we should realloc right after and not
wait for the next entry, otherwise we loose an event:

if (!list_empty(sc)) {
new = list_entry(sc->next, struct sample_queue, list);
list_del(&new->list);
} else {
if (os->sample_buffer) {
new = os->sample_buffer + os->sample_buffer_idx;
if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
os->sample_buffer = NULL;
}

if (!os->sample_buffer) {
os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
if (!os->sample_buffer)
return -ENOMEM;
list_add(&os->sample_buffer->list, &os->to_free);
os->sample_buffer_idx = 2;
new = os->sample_buffer + 1;
}


Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
a function. But the idea is there.

2013-09-14 17:25:49

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 9/14/13 10:16 AM, Frederic Weisbecker wrote:
>> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>>
>> new->timestamp = timestamp;
>> new->file_offset = file_offset;
>> - new->event = event;
>> +
>> + if (s->copy_on_queue) {
>> + new->event = malloc(event->header.size);
>> + memcpy(new->event, event, event->header.size);
>> + } else
>> + new->event = event;

---8<---

> So do you think it should stay optional? This looks like a global problem, I mean
> the event can be unmapped anytime for any builtin tool mapping it, right?

Yes. I could make it the default behavior; just overhead in doing that
(malloc/copy for each event).

>
> Also we already allocate the sample list node (struct sample_queue) from os->sample
> buffer. ie: we have our own allocator there.
>
> Probably we should reuse that and include the copied event space in "struct sample_queue"?


Right, that's where I put the malloc and copy - I kept the relevant
change above. I take it you are thinking of something different but I am
not following you. You definitely do NOT want to change struct
sample_queue to include an event - like this:

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 51f5edf..866944a 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -491,7 +491,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
struct sample_queue {
u64 timestamp;
u64 file_offset;
- union perf_event *event;
+ union perf_event event;
struct list_head list;
};

size of event is determined by mmap_event (mmap2_event in latest code)
which is > 4096 because of the filename argument. Including the event
directly in sample_queue would balloon memory usage (learned this the
hard way!).

>
> Also looking at it now, it seems we have a bug on the existing code:
>
>
> if (!list_empty(sc)) {
> new = list_entry(sc->next, struct sample_queue, list);
> list_del(&new->list);
> } else if (os->sample_buffer) {
> new = os->sample_buffer + os->sample_buffer_idx;
> if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> os->sample_buffer = NULL;
> } else {
> os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> if (!os->sample_buffer)
> return -ENOMEM;
> list_add(&os->sample_buffer->list, &os->to_free);
> os->sample_buffer_idx = 2;
> new = os->sample_buffer + 1;
> }
>
> If we actually run out of buffer rooms, we should realloc right after and not
> wait for the next entry, otherwise we loose an event:
>
> if (!list_empty(sc)) {
> new = list_entry(sc->next, struct sample_queue, list);
> list_del(&new->list);
> } else {
> if (os->sample_buffer) {
> new = os->sample_buffer + os->sample_buffer_idx;
> if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> os->sample_buffer = NULL;
> }
>
> if (!os->sample_buffer) {
> os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> if (!os->sample_buffer)
> return -ENOMEM;
> list_add(&os->sample_buffer->list, &os->to_free);
> os->sample_buffer_idx = 2;
> new = os->sample_buffer + 1;
> }
>
>
> Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
> a function. But the idea is there.

Ok. That should be a separate patch. Are you going to submit that one?

David

2013-09-16 16:40:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On Sat, Sep 14, 2013 at 11:25:40AM -0600, David Ahern wrote:
> On 9/14/13 10:16 AM, Frederic Weisbecker wrote:
> >>@@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
> >>
> >> new->timestamp = timestamp;
> >> new->file_offset = file_offset;
> >>- new->event = event;
> >>+
> >>+ if (s->copy_on_queue) {
> >>+ new->event = malloc(event->header.size);
> >>+ memcpy(new->event, event, event->header.size);
> >>+ } else
> >>+ new->event = event;
>
> ---8<---
>
> >So do you think it should stay optional? This looks like a global problem, I mean
> >the event can be unmapped anytime for any builtin tool mapping it, right?
>
> Yes. I could make it the default behavior; just overhead in doing
> that (malloc/copy for each event).

Are there any tool that don't suffer from this bug somehow? If not then it must
be applied unconditionally.

>
> >
> >Also we already allocate the sample list node (struct sample_queue) from os->sample
> >buffer. ie: we have our own allocator there.
> >
> >Probably we should reuse that and include the copied event space in "struct sample_queue"?
>
>
> Right, that's where I put the malloc and copy - I kept the relevant
> change above. I take it you are thinking of something different but
> I am not following you. You definitely do NOT want to change struct
> sample_queue to include an event - like this:
>
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 51f5edf..866944a 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -491,7 +491,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
> struct sample_queue {
> u64 timestamp;
> u64 file_offset;
> - union perf_event *event;
> + union perf_event event;

Right that's roughly what I thought.

> struct list_head list;
> };
>
> size of event is determined by mmap_event (mmap2_event in latest
> code) which is > 4096 because of the filename argument. Including
> the event directly in sample_queue would balloon memory usage
> (learned this the hard way!).

Ah then perhaps we can allocate with the dynamic size of the event?

>
> >
> >Also looking at it now, it seems we have a bug on the existing code:
> >
> >
> > if (!list_empty(sc)) {
> > new = list_entry(sc->next, struct sample_queue, list);
> > list_del(&new->list);
> > } else if (os->sample_buffer) {
> > new = os->sample_buffer + os->sample_buffer_idx;
> > if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> > os->sample_buffer = NULL;
> > } else {
> > os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> > if (!os->sample_buffer)
> > return -ENOMEM;
> > list_add(&os->sample_buffer->list, &os->to_free);
> > os->sample_buffer_idx = 2;
> > new = os->sample_buffer + 1;
> > }
> >
> >If we actually run out of buffer rooms, we should realloc right after and not
> >wait for the next entry, otherwise we loose an event:
> >
> > if (!list_empty(sc)) {
> > new = list_entry(sc->next, struct sample_queue, list);
> > list_del(&new->list);
> > } else {
> > if (os->sample_buffer) {
> > new = os->sample_buffer + os->sample_buffer_idx;
> > if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> > os->sample_buffer = NULL;
> > }
> >
> > if (!os->sample_buffer) {
> > os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> > if (!os->sample_buffer)
> > return -ENOMEM;
> > list_add(&os->sample_buffer->list, &os->to_free);
> > os->sample_buffer_idx = 2;
> > new = os->sample_buffer + 1;
> > }
> >
> >
> >Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
> >a function. But the idea is there.
>
> Ok. That should be a separate patch. Are you going to submit that one?

Yeah, unless you beat me at it :)

2013-09-17 04:14:14

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 9/16/13 10:40 AM, Frederic Weisbecker wrote:
>> Yes. I could make it the default behavior; just overhead in doing
>> that (malloc/copy for each event).
>
> Are there any tool that don't suffer from this bug somehow? If not then it must
> be applied unconditionally.

I believe only 'live' commands would be affected: top, kvm stat live, my
local scheduling daemon. perf-trace bypasses session and time ordering
(though I question that decision).

---8<---

>> size of event is determined by mmap_event (mmap2_event in latest
>> code) which is > 4096 because of the filename argument. Including
>> the event directly in sample_queue would balloon memory usage
>> (learned this the hard way!).
>
> Ah then perhaps we can allocate with the dynamic size of the event?

Yes, that's how I have it my patch -- allocate memory based on header
size and copy in the queue_event function. I think we agree on this part
now.

---8<---

>>> Although the mirrored os->sample_buffer condition check is a bit ugly and should move to
>>> a function. But the idea is there.
>>
>> Ok. That should be a separate patch. Are you going to submit that one?
>
> Yeah, unless you beat me at it :)

That's not going to happen on my end due to a recent time constraint
(jury duty - minimum 1 week trial). At best I can do a patch in early
October.

David

2013-10-02 12:19:01

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> When processing events the session code has an ordered samples queue which is
> used to time-sort events coming in across multiple mmaps. At a later point in
> time samples on the queue are flushed up to some timestamp at which point the
> event is actually processed.
>
> When analyzing events live (ie., record/analysis path in the same command)
> there is a race that leads to corrupted events and parse errors which cause
> perf to terminate. The problem is that when the event is placed in the ordered
> samples queue it is only a reference to the event which is really sitting in
> the mmap buffer. Even though the event is queued for later processing the mmap
> tail pointer is updated which indicates to the kernel that the event has been
> processed. The race is flushing the event from the queue before it gets
> overwritten by some other event. For commands trying to process events live
> (versus just writing to a file) and processing a high rate of events this leads
> to parse failures and perf terminates.

hi,
'perf top -G dwarf' hit this due to longer callchains
processing your patch fixed that, thanks! ;-)

>
> Examples hitting this problem are 'perf kvm stat live', especially with nested
> VMs which generate 100,000+ traces per second, and a command processing
> scheduling events with a high rate of context switching -- e.g., running
> 'perf bench sched pipe'.
>
> This patch offers live commands an option to copy the event when it is placed in
> the ordered samples queue.
>

SNIP

>
> new->timestamp = timestamp;
> new->file_offset = file_offset;
> - new->event = event;
> +
> + if (s->copy_on_queue) {
> + new->event = malloc(event->header.size);
> + memcpy(new->event, event, event->header.size);

missing allocation error check

jirka

2013-10-02 12:38:47

by Jiri Olsa

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On Wed, Oct 02, 2013 at 02:18:21PM +0200, Jiri Olsa wrote:
> On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
> > When processing events the session code has an ordered samples queue which is
> > used to time-sort events coming in across multiple mmaps. At a later point in
> > time samples on the queue are flushed up to some timestamp at which point the
> > event is actually processed.
> >
> > When analyzing events live (ie., record/analysis path in the same command)
> > there is a race that leads to corrupted events and parse errors which cause
> > perf to terminate. The problem is that when the event is placed in the ordered
> > samples queue it is only a reference to the event which is really sitting in
> > the mmap buffer. Even though the event is queued for later processing the mmap
> > tail pointer is updated which indicates to the kernel that the event has been
> > processed. The race is flushing the event from the queue before it gets
> > overwritten by some other event. For commands trying to process events live
> > (versus just writing to a file) and processing a high rate of events this leads
> > to parse failures and perf terminates.
>
> hi,
> 'perf top -G dwarf' hit this due to longer callchains
> processing your patch fixed that, thanks! ;-)

ugh... now looking closer on top processing, it does not
seem to use queues ;-\ must be something else then..


>
> >
> > Examples hitting this problem are 'perf kvm stat live', especially with nested
> > VMs which generate 100,000+ traces per second, and a command processing
> > scheduling events with a high rate of context switching -- e.g., running
> > 'perf bench sched pipe'.
> >
> > This patch offers live commands an option to copy the event when it is
> > placed in
> > the ordered samples queue.

So I guess you have some other patch that actually sets
session::copy_on_queue?

jirka

2013-10-02 12:53:21

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 10/2/13 6:38 AM, Jiri Olsa wrote:
>>> Examples hitting this problem are 'perf kvm stat live', especially with nested
>>> VMs which generate 100,000+ traces per second, and a command processing
>>> scheduling events with a high rate of context switching -- e.g., running
>>> 'perf bench sched pipe'.
>>>
>>> This patch offers live commands an option to copy the event when it is
>>> placed in
>>> the ordered samples queue.
>
> So I guess you have some other patch that actually sets
> session::copy_on_queue?

Yes, with this patch commands have to set session->copy_on_queue to true.

My latest perf-sched-daemon code actually solves this another way:
rather than asking perf-session to copy the events, the command itself
does and passes the copied event to the perf-session processing code.

I like this design better because the command itself controls the
allocation and free which the daemon needs because the events are added
to another queue so the flow is:

read from mmmap --> copy event --> pass to session code for time
ordering --> process time ordered events

The daemon puts the time ordered events into a time-limited queue and
only processes the event when requested.

If that is too confusing take a look at:

https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c

process_event is the handler for sample events coming out of the mmaps.
It allocates memory, copies the event and then calls
perf_session_queue_event on the copy.

David

2013-10-24 09:30:43

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

2013/9/14 Frederic Weisbecker <[email protected]>:
> On Fri, Sep 06, 2013 at 01:37:01PM -0600, David Ahern wrote:
>> When processing events the session code has an ordered samples queue which is
>> used to time-sort events coming in across multiple mmaps. At a later point in
>> time samples on the queue are flushed up to some timestamp at which point the
>> event is actually processed.
>>
>> When analyzing events live (ie., record/analysis path in the same command)
>> there is a race that leads to corrupted events and parse errors which cause
>> perf to terminate. The problem is that when the event is placed in the ordered
>> samples queue it is only a reference to the event which is really sitting in
>> the mmap buffer. Even though the event is queued for later processing the mmap
>> tail pointer is updated which indicates to the kernel that the event has been
>> processed. The race is flushing the event from the queue before it gets
>> overwritten by some other event. For commands trying to process events live
>> (versus just writing to a file) and processing a high rate of events this leads
>> to parse failures and perf terminates.
>>
>> Examples hitting this problem are 'perf kvm stat live', especially with nested
>> VMs which generate 100,000+ traces per second, and a command processing
>> scheduling events with a high rate of context switching -- e.g., running
>> 'perf bench sched pipe'.
>>
>> This patch offers live commands an option to copy the event when it is placed in
>> the ordered samples queue.
>>
>> Signed-off-by: David Ahern <[email protected]>
>> Cc: Frederic Weisbecker <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Jiri Olsa <[email protected]>
>> Cc: Mike Galbraith <[email protected]>
>> Cc: Namhyung Kim <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Stephane Eranian <[email protected]>
>> ---
>> tools/perf/util/session.c | 17 ++++++++++++++---
>> tools/perf/util/session.h | 1 +
>> 2 files changed, 15 insertions(+), 3 deletions(-)
>>
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 1b185ca..71f16db 100644
>> --- a/tools/perf/util/session.c
>> +++ b/tools/perf/util/session.c
>> @@ -483,6 +483,8 @@ static void perf_session_free_sample_buffers(struct perf_session *session)
>>
>> sq = list_entry(os->to_free.next, struct sample_queue, list);
>> list_del(&sq->list);
>> + if (session->copy_on_queue)
>> + free(sq->event);
>> free(sq);
>> }
>> }
>> @@ -513,11 +515,15 @@ static int flush_sample_queue(struct perf_session *s,
>> break;
>>
>> ret = perf_evlist__parse_sample(s->evlist, iter->event, &sample);
>> - if (ret)
>> + if (ret) {
>> pr_err("Can't parse sample, err = %d\n", ret);
>> - else {
>> + if (s->copy_on_queue)
>> + free(iter->event);
>> + } else {
>> ret = perf_session_deliver_event(s, iter->event, &sample, tool,
>> iter->file_offset);
>> + if (s->copy_on_queue)
>> + free(iter->event);
>> if (ret)
>> return ret;
>> }
>> @@ -676,7 +682,12 @@ int perf_session_queue_event(struct perf_session *s, union perf_event *event,
>>
>> new->timestamp = timestamp;
>> new->file_offset = file_offset;
>> - new->event = event;
>> +
>> + if (s->copy_on_queue) {
>> + new->event = malloc(event->header.size);
>> + memcpy(new->event, event, event->header.size);
>> + } else
>> + new->event = event;
>>
>> __queue_event(new, s);
>>
>> diff --git a/tools/perf/util/session.h b/tools/perf/util/session.h
>> index 3aa75fb..4adfcbb 100644
>> --- a/tools/perf/util/session.h
>> +++ b/tools/perf/util/session.h
>> @@ -38,6 +38,7 @@ struct perf_session {
>> bool fd_pipe;
>> bool repipe;
>> struct ordered_samples ordered_samples;
>> + bool copy_on_queue;
>
> So do you think it should stay optional? This looks like a global problem, I mean
> the event can be unmapped anytime for any builtin tool mapping it, right?
>
> Also we already allocate the sample list node (struct sample_queue) from os->sample
> buffer. ie: we have our own allocator there.
>
> Probably we should reuse that and include the copied event space in "struct sample_queue"?
>
> Also looking at it now, it seems we have a bug on the existing code:
>
>
> if (!list_empty(sc)) {
> new = list_entry(sc->next, struct sample_queue, list);
> list_del(&new->list);
> } else if (os->sample_buffer) {
> new = os->sample_buffer + os->sample_buffer_idx;
> if (++os->sample_buffer_idx == MAX_SAMPLE_BUFFER)
> os->sample_buffer = NULL;
> } else {
> os->sample_buffer = malloc(MAX_SAMPLE_BUFFER * sizeof(*new));
> if (!os->sample_buffer)
> return -ENOMEM;
> list_add(&os->sample_buffer->list, &os->to_free);
> os->sample_buffer_idx = 2;
> new = os->sample_buffer + 1;
> }
>
> If we actually run out of buffer rooms, we should realloc right after and not
> wait for the next entry, otherwise we loose an event:

Bah, checking that again, there don't seem to be a bug there. Actually
the sample buffer is reset after we pick the last entry. So it looks
all fine. I got confused as usual. Nevermind.

2013-10-24 10:23:42

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
> Bah, checking that again, there don't seem to be a bug there. Actually
> the sample buffer is reset after we pick the last entry. So it looks
> all fine. I got confused as usual. Nevermind.

Ok. I had not come back to this thread since I decided on a different
route for the event copying. I'll take it out of my to-do list.

David

2013-10-24 12:27:55

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

Em Thu, Oct 24, 2013 at 11:23:32AM +0100, David Ahern escreveu:
> On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
> >Bah, checking that again, there don't seem to be a bug there. Actually
> >the sample buffer is reset after we pick the last entry. So it looks
> >all fine. I got confused as usual. Nevermind.
>
> Ok. I had not come back to this thread since I decided on a
> different route for the event copying. I'll take it out of my to-do
> list.

Can you elaborate on that?

I had this feeling that perhaps we could defer copying the event till it
would be overwritten, something like making a range read only and then
when the event would be _really_ consumed the tooling would mark it as
so... Have to think about it more tho :-\

- Arnaldo

2013-10-24 13:12:33

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 10/24/13 1:27 PM, Arnaldo Melo wrote:
> Em Thu, Oct 24, 2013 at 11:23:32AM +0100, David Ahern escreveu:
>> On 10/24/13 10:30 AM, Frederic Weisbecker wrote:
>>> Bah, checking that again, there don't seem to be a bug there. Actually
>>> the sample buffer is reset after we pick the last entry. So it looks
>>> all fine. I got confused as usual. Nevermind.
>>
>> Ok. I had not come back to this thread since I decided on a
>> different route for the event copying. I'll take it out of my to-do
>> list.
>
> Can you elaborate on that?

The driving use case is my perf-daemon:
https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c,
line 271.

Rather than have the perf infrastructure manage the allocation and
copies I decided to have the daemon do it. The session infrastructure is
only used for time sorting.

This works better because the event that pops out of the session
ordering is put onto another list and it gives the daemon control of
when memory is allocated and freed and the event only has to be copied once.

>
> I had this feeling that perhaps we could defer copying the event till it
> would be overwritten, something like making a range read only and then
> when the event would be _really_ consumed the tooling would mark it as
> so... Have to think about it more tho :-\

The above does not solve the problem for live tools where the event
actually lies in the ring buffer and could be overwritten while it sits
in the session ordered samples queue. The kvm-stat-live tool for nested
virtualization is a stress test for this case. I need to come back to
that problem, it just has not bubbled up to the top.

David

2013-10-24 14:07:39

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

Em Thu, Oct 24, 2013 at 02:12:29PM +0100, David Ahern escreveu:
> On 10/24/13 1:27 PM, Arnaldo Melo wrote:

> The driving use case is my perf-daemon:
> https://github.com/dsahern/linux/blob/perf-sched-timehist-3.11/tools/perf/schedmon.c,
> line 271.
>
> Rather than have the perf infrastructure manage the allocation and
> copies I decided to have the daemon do it. The session
> infrastructure is only used for time sorting.
>
> This works better because the event that pops out of the session
> ordering is put onto another list and it gives the daemon control of
> when memory is allocated and freed and the event only has to be
> copied once.

Ok, I see, but that could be useful for other tools, so at some point
what ends up being the best for your tool can be submitted upstream 8-)

> >
> >I had this feeling that perhaps we could defer copying the event till it
> >would be overwritten, something like making a range read only and then
> >when the event would be _really_ consumed the tooling would mark it as
> >so... Have to think about it more tho :-\
>
> The above does not solve the problem for live tools where the event
> actually lies in the ring buffer and could be overwritten while it
> sits in the session ordered samples queue. The kvm-stat-live tool
> for nested virtualization is a stress test for this case. I need to
> come back to that problem, it just has not bubbled up to the top.

What I thought was since we have to allocate memory _somewhere_ we could
size the ring buffer in such a way that it would be big enough for us to
leave those referenced entries there, not consumed, till we actually
consume them when reordering the events, and if we run out of space, the
kernel would just use LOST records, and not overwrite anything.

- Arnaldo

2013-10-25 16:04:19

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH] perf session: Add option to copy events when queueing

On 10/24/13 8:07 AM, Arnaldo Melo wrote:
> What I thought was since we have to allocate memory _somewhere_ we could
> size the ring buffer in such a way that it would be big enough for us to
> leave those referenced entries there, not consumed, till we actually
> consume them when reordering the events, and if we run out of space, the
> kernel would just use LOST records, and not overwrite anything.

The tail pointer should only be moved up once all the events have been
flushed from the ordered samples queue. Right now the tail is moved up
when the event is read from the mmap (the patch in the other email
thread withstanding). The problem is that we do not have any linkage
between the event and the mmap that it came from; that information is
lost when the event is added to the ordered samples queue.

David