We parse the event while it's read from mmap buffer in
write_event(), but sometimes the event will straddles the
mmap boundary, we should handle this case
And if we record events(such as perf kmem/sched) for long
times, Ctrl + C can't interrupt it just for this reason
Signed-off-by: Xiao Guangrong <[email protected]>
---
tools/perf/builtin-record.c | 62 +++++++++++++++++++++++++++++++++---------
1 files changed, 48 insertions(+), 14 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7bb9ca1..97573b7 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -111,28 +111,62 @@ static void write_output(void *buf, size_t size)
}
}
-static void write_event(event_t *buf, size_t size)
+static void write_event(void *buf, size_t size)
{
- size_t processed_size = buf->header.size;
- event_t *ev = buf;
+ event_t *ev;
+ struct mmap_event mmap_ev;
+ static void *rec_buf;
+ static unsigned long rec_len;
- do {
- /*
- * Add it to the list of DSOs, so that when we finish this
- * record session we can pick the available build-ids.
- */
+ write_output(buf, size);
+
+ while (size > 0) {
+ if (rec_buf) {
+ int len;
+
+ memcpy(&mmap_ev, rec_buf, rec_len);
+ len = sizeof(mmap_ev.header) - rec_len;
+ if (len > 0) {
+ memcpy(&mmap_ev + rec_len, buf, len);
+ buf += len;
+ size -= len;
+ rec_len = sizeof((mmap_ev.header));
+ }
+
+ len = mmap_ev.header.size - rec_len;
+ if (mmap_ev.header.type == PERF_RECORD_MMAP)
+ memcpy(&mmap_ev + rec_len, buf, len);
+
+ buf += len;
+ size -= len;
+
+ rec_buf = NULL;
+ rec_len = 0;
+
+ ev = (event_t *)&mmap_ev;
+ goto handle_mmap;
+ }
+
+ ev = (event_t *)buf;
+ if (size < sizeof(struct perf_event_header) ||
+ ev->header.size > size) {
+ rec_buf = buf;
+ rec_len = size;
+ break;
+ }
+
+ buf += ev->header.size;
+ size -= ev->header.size;
+
+handle_mmap:
if (ev->header.type == PERF_RECORD_MMAP) {
struct list_head *head = &dsos__user;
if (ev->header.misc == 1)
head = &dsos__kernel;
__dsos__findnew(head, ev->mmap.filename);
}
+ }
- ev = ((void *)ev) + ev->header.size;
- processed_size += ev->header.size;
- } while (processed_size < size);
-
- write_output(buf, size);
}
static int process_synthesized_event(event_t *event,
--
1.6.1.2
On Wed, Jan 20, 2010 at 09:34:15PM +0800, Xiao Guangrong wrote:
> We parse the event while it's read from mmap buffer in
> write_event(), but sometimes the event will straddles the
> mmap boundary, we should handle this case
>
> And if we record events(such as perf kmem/sched) for long
> times, Ctrl + C can't interrupt it just for this reason
>
> Signed-off-by: Xiao Guangrong <[email protected]>
I've seen this bug recently, perf record was hanging because of
events of size 0. Must be the same bug. Thanks!
> ---
> tools/perf/builtin-record.c | 62 +++++++++++++++++++++++++++++++++---------
> 1 files changed, 48 insertions(+), 14 deletions(-)
>
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 7bb9ca1..97573b7 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -111,28 +111,62 @@ static void write_output(void *buf, size_t size)
> }
> }
>
> -static void write_event(event_t *buf, size_t size)
> +static void write_event(void *buf, size_t size)
> {
> - size_t processed_size = buf->header.size;
> - event_t *ev = buf;
> + event_t *ev;
> + struct mmap_event mmap_ev;
> + static void *rec_buf;
> + static unsigned long rec_len;
>
> - do {
> - /*
> - * Add it to the list of DSOs, so that when we finish this
> - * record session we can pick the available build-ids.
> - */
> + write_output(buf, size);
> +
> + while (size > 0) {
> + if (rec_buf) {
> + int len;
> +
> + memcpy(&mmap_ev, rec_buf, rec_len);
> + len = sizeof(mmap_ev.header) - rec_len;
> + if (len > 0) {
> + memcpy(&mmap_ev + rec_len, buf, len);
> + buf += len;
> + size -= len;
> + rec_len = sizeof((mmap_ev.header));
> + }
> +
> + len = mmap_ev.header.size - rec_len;
> + if (mmap_ev.header.type == PERF_RECORD_MMAP)
> + memcpy(&mmap_ev + rec_len, buf, len);
> +
> + buf += len;
> + size -= len;
> +
> + rec_buf = NULL;
> + rec_len = 0;
> +
> + ev = (event_t *)&mmap_ev;
> + goto handle_mmap;
> + }
> +
> + ev = (event_t *)buf;
> + if (size < sizeof(struct perf_event_header) ||
> + ev->header.size > size) {
> + rec_buf = buf;
> + rec_len = size;
> + break;
> + }
> +
> + buf += ev->header.size;
> + size -= ev->header.size;
> +
> +handle_mmap:
> if (ev->header.type == PERF_RECORD_MMAP) {
> struct list_head *head = &dsos__user;
> if (ev->header.misc == 1)
> head = &dsos__kernel;
> __dsos__findnew(head, ev->mmap.filename);
> }
> + }
>
> - ev = ((void *)ev) + ev->header.size;
> - processed_size += ev->header.size;
> - } while (processed_size < size);
> -
> - write_output(buf, size);
> }
>
> static int process_synthesized_event(event_t *event,
> --
> 1.6.1.2
>
On Wed, 2010-01-20 at 21:34 +0800, Xiao Guangrong wrote:
> We parse the event while it's read from mmap buffer in
> write_event(), but sometimes the event will straddles the
> mmap boundary, we should handle this case
>
> And if we record events(such as perf kmem/sched) for long
> times, Ctrl + C can't interrupt it just for this reason
Uhm, how why? it didn't used to know about events and just copied the
data.
On Wed, 2010-01-20 at 14:41 +0100, Peter Zijlstra wrote:
> On Wed, 2010-01-20 at 21:34 +0800, Xiao Guangrong wrote:
> > We parse the event while it's read from mmap buffer in
> > write_event(), but sometimes the event will straddles the
> > mmap boundary, we should handle this case
> >
> > And if we record events(such as perf kmem/sched) for long
> > times, Ctrl + C can't interrupt it just for this reason
>
> Uhm, how why? it didn't used to know about events and just copied the
> data.
looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
should not know or care about the actual events and simply write data
out.
Em Wed, Jan 20, 2010 at 02:44:56PM +0100, Peter Zijlstra escreveu:
> On Wed, 2010-01-20 at 14:41 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-01-20 at 21:34 +0800, Xiao Guangrong wrote:
> > > We parse the event while it's read from mmap buffer in
> > > write_event(), but sometimes the event will straddles the
> > > mmap boundary, we should handle this case
> > >
> > > And if we record events(such as perf kmem/sched) for long
> > > times, Ctrl + C can't interrupt it just for this reason
> >
> > Uhm, how why? it didn't used to know about events and just copied the
> > data.
>
> looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
> should not know or care about the actual events and simply write data
> out.
Oh well, I guess then we should do that after record finishes,
reprocessing all the data in the file.
- Arnaldo
On Wed, 2010-01-20 at 12:09 -0200, Arnaldo Carvalho de Melo wrote:
> Em Wed, Jan 20, 2010 at 02:44:56PM +0100, Peter Zijlstra escreveu:
> > On Wed, 2010-01-20 at 14:41 +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-01-20 at 21:34 +0800, Xiao Guangrong wrote:
> > > > We parse the event while it's read from mmap buffer in
> > > > write_event(), but sometimes the event will straddles the
> > > > mmap boundary, we should handle this case
> > > >
> > > > And if we record events(such as perf kmem/sched) for long
> > > > times, Ctrl + C can't interrupt it just for this reason
> > >
> > > Uhm, how why? it didn't used to know about events and just copied the
> > > data.
> >
> > looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
> > should not know or care about the actual events and simply write data
> > out.
>
> Oh well, I guess then we should do that after record finishes,
> reprocessing all the data in the file.
Why do we need it at all?
Can't report/archive/etc.. sort all this out?
Em Wed, Jan 20, 2010 at 03:14:57PM +0100, Peter Zijlstra escreveu:
> On Wed, 2010-01-20 at 12:09 -0200, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Jan 20, 2010 at 02:44:56PM +0100, Peter Zijlstra escreveu:
> > > On Wed, 2010-01-20 at 14:41 +0100, Peter Zijlstra wrote:
> > > > Uhm, how why? it didn't used to know about events and just copied the
> > > > data.
> > >
> > > looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
> > > should not know or care about the actual events and simply write data
> > > out.
> >
> > Oh well, I guess then we should do that after record finishes,
> > reprocessing all the data in the file.
>
> Why do we need it at all?
>
> Can't report/archive/etc.. sort all this out?
We need to record the buildids at record time, preferrably at
PERF_RECORD_MMAP time, from the kernel, for long running, low frequency
sessions were DSOs may be updated.
Or for when a developer is updating his binary and doing perf record
runs to then use perf diff:
perf record ./myapp
vi myapp.c
make
perf record ./myapp
perf diff
Works now because the first 'perf record' saved a copy of myapp so that
'perf diff', finding the right build-id in ~/.debug/.build-id/ can do
the right thing.
But till we send the buildid in an extended PERF_RECORD_MMAP event sent
from the kernel, I try to reduce the window by doing this at record
time.
The current way of intercepting events at write_event() is deemed
broken, so I can do it atexit(), when we go to write the header, but
this will hurt because we'll have to reprocess it all.
Yes, since we'd still have the window, we could do it on perf archive,
but it would be an extra explicit step and since I think the right place
to do this is at perf record (with the kernel injecting the buildid, if
present in the binary being loaded), I'd prefer to keep it that way.
Then work on having the buildid stored in some suitable kernel data
structure and injected in the PERF_RECORD_MMAP event.
- Arnaldo
On Wed, 2010-01-20 at 12:26 -0200, Arnaldo Carvalho de Melo wrote:
> > Can't report/archive/etc.. sort all this out?
>
> We need to record the buildids at record time, preferrably at
> PERF_RECORD_MMAP time, from the kernel, for long running, low frequency
> sessions were DSOs may be updated.
>
> Or for when a developer is updating his binary and doing perf record
> runs to then use perf diff:
>
> perf record ./myapp
> vi myapp.c
> make
> perf record ./myapp
> perf diff
> Works now because the first 'perf record' saved a copy of myapp so that
> 'perf diff', finding the right build-id in ~/.debug/.build-id/ can do
> the right thing.
>
> But till we send the buildid in an extended PERF_RECORD_MMAP event sent
> from the kernel, I try to reduce the window by doing this at record
> time.
>
> The current way of intercepting events at write_event() is deemed
> broken, so I can do it atexit(), when we go to write the header, but
> this will hurt because we'll have to reprocess it all.
>
> Yes, since we'd still have the window, we could do it on perf archive,
> but it would be an extra explicit step and since I think the right place
> to do this is at perf record (with the kernel injecting the buildid, if
> present in the binary being loaded), I'd prefer to keep it that way.
>
> Then work on having the buildid stored in some suitable kernel data
> structure and injected in the PERF_RECORD_MMAP event.
OK, so my typical workflow is something like:
perf record -o baseline ./foo
# again:
vi foo.c
make
perf record ./foo
perf diff -i baseline -i perf.data
# if good enough, quit, else goto again
I don't find intermediate diff's all that interesting.
So we could create a perf snapshot like command that collects everything
(and maybe allow something like: perf record --snapshot), and have perf
diff compare against that by default, instead of perf.data~ or whatever
that was called again.
As to buildids from kernel space, I don't think the kernel actually
knows or cares enough about executable maps to be able to extract those.
The initial ELF loader could (maybe) for the main executable, but all
DSOs and esp stuff like dlopen() would become mightly tricky.
On Wed, 2010-01-20 at 15:14 +0100, Peter Zijlstra wrote:
> > > > Uhm, how why? it didn't used to know about events and just copied the
> > > > data.
> > >
> > > looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
> > > should not know or care about the actual events and simply write data
> > > out.
> >
> > Oh well, I guess then we should do that after record finishes,
> > reprocessing all the data in the file.
>
> Why do we need it at all?
To clarify, we want to keep the record cycle as small/fast as possible
in order to minimize permutation of the system we're recording.
Em Wed, Jan 20, 2010 at 03:43:30PM +0100, Peter Zijlstra escreveu:
> On Wed, 2010-01-20 at 15:14 +0100, Peter Zijlstra wrote:
>
> > > > > Uhm, how why? it didn't used to know about events and just copied the
> > > > > data.
> > > >
> > > > looks like acme wrecked it in f5a2c3dc.. anyway the fix is wrong, record
> > > > should not know or care about the actual events and simply write data
> > > > out.
> > >
> > > Oh well, I guess then we should do that after record finishes,
> > > reprocessing all the data in the file.
> >
> > Why do we need it at all?
>
> To clarify, we want to keep the record cycle as small/fast as possible
> in order to minimize permutation of the system we're recording.
This is understood, its the whole reason for us to just record the
build-ids and not try and go reading all the symtabs that appeared on
perf.data.
nstead we just record 20 byte cookies that later will be looked up
somewhere, be it the cache that perf record creates or -debuginfo
packages, or the binary on the fs, whatever matches the buildid and thus
guarantees that we're not doing bogus symbol lookups.
I.e. always trying to defer everything that is possible to post
processing time, at perf report.
- Arnaldo
Em Wed, Jan 20, 2010 at 03:37:52PM +0100, Peter Zijlstra escreveu:
> On Wed, 2010-01-20 at 12:26 -0200, Arnaldo Carvalho de Melo wrote:
>
> > > Can't report/archive/etc.. sort all this out?
> >
> > We need to record the buildids at record time, preferrably at
> > PERF_RECORD_MMAP time, from the kernel, for long running, low frequency
> > sessions were DSOs may be updated.
> >
> > Or for when a developer is updating his binary and doing perf record
> > runs to then use perf diff:
> >
> > perf record ./myapp
> > vi myapp.c
> > make
> > perf record ./myapp
> > perf diff
>
> > Works now because the first 'perf record' saved a copy of myapp so that
> > 'perf diff', finding the right build-id in ~/.debug/.build-id/ can do
> > the right thing.
> >
> > But till we send the buildid in an extended PERF_RECORD_MMAP event sent
> > from the kernel, I try to reduce the window by doing this at record
> > time.
> >
> > The current way of intercepting events at write_event() is deemed
> > broken, so I can do it atexit(), when we go to write the header, but
> > this will hurt because we'll have to reprocess it all.
> >
> > Yes, since we'd still have the window, we could do it on perf archive,
> > but it would be an extra explicit step and since I think the right place
> > to do this is at perf record (with the kernel injecting the buildid, if
> > present in the binary being loaded), I'd prefer to keep it that way.
> >
> > Then work on having the buildid stored in some suitable kernel data
> > structure and injected in the PERF_RECORD_MMAP event.
>
> OK, so my typical workflow is something like:
>
> perf record -o baseline ./foo
>
> # again:
>
> vi foo.c
> make
> perf record ./foo
>
> perf diff -i baseline -i perf.data
>
> # if good enough, quit, else goto again
>
> I don't find intermediate diff's all that interesting.
Well, I plan to write 'perf diff --plot N' and have it look at the N
last perf.data files (perf.data, perf.data.N, perf.data.(N - 1)), that
may be per tag (v2.6.30, v2.6.31, v2.6.32, v2.6.33) or per
changeset to show if we are regressing over time.
Like people do from time to time to find regressions on subsystems they
don't pay that much attention.
> So we could create a perf snapshot like command that collects everything
> (and maybe allow something like: perf record --snapshot), and have perf
> diff compare against that by default, instead of perf.data~ or whatever
> that was called again.
Oh, --baseline for 'perf record' to set a baseline file that then would
by default be used and compared against the latest perf.data file
collected seems useful, that way it would be just:
perf record --baseline ./myapp
repeat:
vi myapp.c
make
perf record ./myapp
perf diff
goto repeat;
> As to buildids from kernel space, I don't think the kernel actually
> knows or cares enough about executable maps to be able to extract those.
> The initial ELF loader could (maybe) for the main executable, but all
> DSOs and esp stuff like dlopen() would become mightly tricky.
Having a way to unequivocally connect the DSO that appeared on a
PERF_RECORD_MMAP to a file where a symtab is still seems a requirement
for me.
I'll try to figure out a way to do it cheaply and non-intrusively if at
all possible.
- Arnaldo
* Xiao Guangrong <[email protected]> wrote:
> We parse the event while it's read from mmap buffer in
> write_event(), but sometimes the event will straddles the
> mmap boundary, we should handle this case
>
> And if we record events(such as perf kmem/sched) for long
> times, Ctrl + C can't interrupt it just for this reason
>
> Signed-off-by: Xiao Guangrong <[email protected]>
> ---
> tools/perf/builtin-record.c | 62 +++++++++++++++++++++++++++++++++---------
Note, i've reverted the original instead, via:
a8e6f73: Revert "perf record: Intercept all events"
As Peter said it, it is fundamentally wrong for perf record to touch the data
stream in the common case. Arnaldo, cannot we use some other method for perf
archive's needs? Worst-case we can do an optional parsing of the data stream
although we should first try hard to find a generic method ...
Ingo
On Sun, 2010-01-31 at 09:32 +0100, Ingo Molnar wrote:
> * Xiao Guangrong <[email protected]> wrote:
>
> > We parse the event while it's read from mmap buffer in
> > write_event(), but sometimes the event will straddles the
> > mmap boundary, we should handle this case
> >
> > And if we record events(such as perf kmem/sched) for long
> > times, Ctrl + C can't interrupt it just for this reason
> >
> > Signed-off-by: Xiao Guangrong <[email protected]>
> > ---
> > tools/perf/builtin-record.c | 62 +++++++++++++++++++++++++++++++++---------
>
> Note, i've reverted the original instead, via:
>
> a8e6f73: Revert "perf record: Intercept all events"
>
> As Peter said it, it is fundamentally wrong for perf record to touch the data
> stream in the common case. Arnaldo, cannot we use some other method for perf
> archive's needs? Worst-case we can do an optional parsing of the data stream
> although we should first try hard to find a generic method ...
Arnaldo posted patches for this last friday somewhere, it might turn out
you already found them (reading mail by date order here).