2010-12-21 02:44:03

by David Sharp

[permalink] [raw]
Subject: ftrace: trace_pipe_raw interface broken

trace_pipe_raw doesn't work correctly, I think because
tracing_buffers_read() uses ring_buffer_read_page() incorrectly.

In particular, if you read less than one page at a time, then you get
much more data than expected from the ring buffer:

$ ftrace=/sys/kernel/debug/tracing
$ echo 0 > $ftrace/tracing_enabled
$ echo 1 > $ftrace/events/enable
$ echo > $ftrace/trace # reset
$ echo 100 > $ftrace/buffer_size_kb
$ # generate lots of events:
$ echo 1 > $ftrace/tracing_enabled && ls -laR /proc/ > /dev/null 2>&1
&& echo 0 > $ftrace/tracing_enabled
$ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
675840

That's significantly more data than the buffer could possibly contain.
Further shrinking bs increases the amount of data output even more.

Even when reading >=4096 bytes, something is funny:

$ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
118784

That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
pages, or 104kb-worth of pages. 116kB is about 28 pages.

Looking at the code for tracing_buffers_read() and
ring_buffer_read_page(), there seems to be a disconnect.

ring_buffer_read_page can consume less than one page if it was in the
middle of the buffer, such as when would happen after userspace only
asks for 1k. It returns the offset of the start of what it cosumed,
and the end is put in the returned bpage->commit. However,
ring_buffer_data_page is not exposed outside of ring_buffer.c.
Furthermore, tracing_buffers_read ignores the return value, except to
check for error.

Basically, I think tracing_buffers_read was forgotten when the ring
buffer got headers in the buffer pages. It looks like it still expects
whole pages to have valid tracing data.

I was working on a fix, but, how is trace_pipe_raw supposed to work?
I'm not sure if it's supposed to output full pages including the page
headers, or just data. I'm guessing the latter?

Thanks,
d#


2010-12-21 03:52:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

Hi David,

Thanks for the report.

On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
> trace_pipe_raw doesn't work correctly, I think because
> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>
> In particular, if you read less than one page at a time, then you get
> much more data than expected from the ring buffer:
>
> $ ftrace=/sys/kernel/debug/tracing
> $ echo 0 > $ftrace/tracing_enabled
> $ echo 1 > $ftrace/events/enable
> $ echo > $ftrace/trace # reset
> $ echo 100 > $ftrace/buffer_size_kb
> $ # generate lots of events:
> $ echo 1 > $ftrace/tracing_enabled && ls -laR /proc/ > /dev/null 2>&1
> && echo 0 > $ftrace/tracing_enabled
> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
> 675840
>
> That's significantly more data than the buffer could possibly contain.
> Further shrinking bs increases the amount of data output even more.

Yep that's a side effect. What happens is that we always read a "full
page" even if we do not read in page size chunks. If we read 100 bytes,
it will read a header, and all the events that fit inside that 100
bytes, and then return a full page with the header, the 100 bytes -
header size, and the rest of the page filled with zeros.

Do a:

hexdump -C outputfile

and you'll see a lot of "zeros". The reason for this is to keep all
reads in pagesize chunks. If you read only 100 bytes of data, I still
need to keep that page aligned in the output. The reason I do this is in
case there's two readers. I don't want one to read part of a page that
includes the header, and then the other reads the rest of the page. The
header tells you how much data is on that page.

I guess I could have had the info->spare just read the full page, and
then other readers would just read the next page.



>
> Even when reading >=4096 bytes, something is funny:
>
> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
> 118784
>
> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
> pages, or 104kb-worth of pages. 116kB is about 28 pages.

Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
are not much different.

Note, when I did it, I got:

$ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
27+0 records in
27+0 records out
110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s

$ cat /debug/tracing/buffer_size_kb
100

Ah, running hexdump -C, I got this at the end:

00019c80 62 57 00 00 22 d1 09 81 ff ff ff ff 76 d1 09 81 |bW.."?..????v?..|
00019c90 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 |????............|
00019ca0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0001a000 d8 32 b6 7e ed 53 05 00 24 00 00 00 00 00 00 00 |?2?~?S..$.......|
0001a010 a8 36 00 00 01 00 00 00 ca 0b 00 00 ff ff ff ff |?6......?...????|
0001a020 00 00 00 00 6a d1 09 81 ff ff ff ff 68 26 16 81 |....j?..????h&..|
0001a030 ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 |????............|
0001a040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0001b000

I don't know why it added that last page. I'll have to look into this.



> Looking at the code for tracing_buffers_read() and
> ring_buffer_read_page(), there seems to be a disconnect.
>
> ring_buffer_read_page can consume less than one page if it was in the
> middle of the buffer, such as when would happen after userspace only
> asks for 1k. It returns the offset of the start of what it cosumed,

Actually, we don't do that anymore. We either return < 0 on error or
zero. We always add to the 0 offset. The comment needs to be fixed.

> and the end is put in the returned bpage->commit. However,
> ring_buffer_data_page is not exposed outside of ring_buffer.c.
> Furthermore, tracing_buffers_read ignores the return value, except to
> check for error.
>
> Basically, I think tracing_buffers_read was forgotten when the ring
> buffer got headers in the buffer pages. It looks like it still expects
> whole pages to have valid tracing data.

Yep, it does, and ring_buffer_read_page() adds zero to the rest of the
page.


>
> I was working on a fix, but, how is trace_pipe_raw supposed to work?
> I'm not sure if it's supposed to output full pages including the page
> headers, or just data. I'm guessing the latter?

Nope, the former, because we need it to be full pages.

-- Steve

2010-12-21 06:00:13

by David Sharp

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <[email protected]> wrote:
> Hi David,
>
> Thanks for the report.
>
> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>> trace_pipe_raw doesn't work correctly, I think because
>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>
>> In particular, if you read less than one page at a time, then you get
>> much more data than expected from the ring buffer:
>>
>> $ ftrace=/sys/kernel/debug/tracing
>> $ echo 0 > $ftrace/tracing_enabled
>> $ echo 1 > $ftrace/events/enable
>> $ echo > $ftrace/trace  # reset
>> $ echo 100 > $ftrace/buffer_size_kb
>> $  # generate lots of events:
>> $ echo 1 > $ftrace/tracing_enabled &&  ls -laR /proc/ > /dev/null 2>&1
>> && echo 0 > $ftrace/tracing_enabled
>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>> 675840
>>
>> That's significantly more data than the buffer could possibly contain.
>> Further shrinking bs increases the amount of data output even more.
>
> Yep that's a side effect. What happens is that we always read a "full
> page" even if we do not read in page size chunks. If we read 100 bytes,
> it will read a header, and all the events that fit inside that 100
> bytes, and then return a full page with the header, the 100 bytes -
> header size, and the rest of the page filled with zeros.
>
> Do a:
>
>        hexdump -C outputfile
>
> and you'll see a lot of "zeros". The reason for this is to keep all
> reads in pagesize chunks. If you read only 100 bytes of data, I still
> need to keep that page aligned in the output. The reason I do this is in
> case there's two readers. I don't want one to read part of a page that
> includes the header, and then the other reads the rest of the page. The
> header tells you how much data is on that page.

aha, I see. My assumption that the page headers weren't a needed part
of the trace output was throwing off my analysis.

> I guess I could have had the info->spare just read the full page, and
> then other readers would just read the next page.

I don't think multiple readers a use case we're concerned about, so
either way as far as I'm concerned.

>> Even when reading >=4096 bytes, something is funny:
>>
>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>> 118784
>>
>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>
> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
> are not much different.

oops, yeah. I rounded 2*8 = 8 ;)

> Note, when I did it, I got:
>
> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
> 27+0 records in
> 27+0 records out
> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>
> $ cat /debug/tracing/buffer_size_kb
> 100
>
> Ah, running hexdump -C, I got this at the end:
>
> 00019c80  62 57 00 00 22 d1 09 81  ff ff ff ff 76 d1 09 81  |bW.."Ñ..ÿÿÿÿvÑ..|
> 00019c90  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
> 00019ca0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 0001a000  d8 32 b6 7e ed 53 05 00  24 00 00 00 00 00 00 00  |Ø2¶~íS..$.......|
> 0001a010  a8 36 00 00 01 00 00 00  ca 0b 00 00 ff ff ff ff  |š6......Ê...ÿÿÿÿ|
> 0001a020  00 00 00 00 6a d1 09 81  ff ff ff ff 68 26 16 81  |....jÑ..ÿÿÿÿh&..|
> 0001a030  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
> 0001a040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 0001b000
>
> I don't know why it added that last page. I'll have to look into this.

Thanks for taking a look!

I'm actually seeing up to 3 extra pages (29 records) frequently,
although it varies. I even just got 30 pages.

aha, check this out though, looking at all the page headers:

$ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
00000000 7d 9a 19 9a fd 01 00 00 74 0d 00 00 00 00 00 00 |}...�...t.......|
00001000 12 22 4a 71 01 02 00 00 38 00 00 00 00 00 00 00 |."Jq....8.......|
00002000 62 99 60 d7 01 02 00 00 b8 0f 00 c0 ff ff ff ff |b.`�....�..�����|
# negative commit count? or, notice this is the third page.
# Maybe there's garbage at the beginning, rather than the end.
00003000 e4 cb e2 dc 01 02 00 00 f0 0f 00 00 00 00 00 00 |����....�.......|
# <snip>
00019000 7a 79 01 ef 01 02 00 00 dc 0f 00 00 00 00 00 00 |zy.�....�.......|
0001a000 c7 51 05 ef 01 02 00 00 c0 0f 00 00 00 00 00 00 |�Q.�....�.......|
0001b000 af fe 1b ef 01 02 00 00 34 08 00 00 00 00 00 00 |��.�....4.......|
0001c000 49 f3 1f ef 01 02 00 00 2c 00 00 00 00 00 00 00 |I�.�....,.......|

Interestingly, they all have valid-looking timestamps (in that they
are monotonic), even before the negative commit count.

Here's the one where I got 30 pages:

00000000 1d f5 1f ef 01 02 00 00 68 07 00 00 00 00 00 00 |.�.�....h.......|
00000770 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 bf 04 59 1f 4a 02 00 00 28 00 00 00 00 00 00 00 |�.Y.J...(.......|
00001030 00 00 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 |....����........|
00001040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00002000 6b 0b 59 1f 4a 02 00 00 28 00 00 00 00 00 00 00 |k.Y.J...(.......|
00002030 ff ff ff ff 1d 00 00 00 00 00 00 00 00 00 00 00 |����............|
00002040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003000 df b5 6d 68 4a 02 00 00 d4 0f 00 c0 ff ff ff ff |ߵmhJ...�..�����|
# negative commit count in the 4th page this time. I see a pattern here...
# also, lots of zeroes in the first 3 pages.
00004000 94 9b be 6c 4a 02 00 00 d4 0f 00 00 00 00 00 00 |..�lJ...�.......|
# <snip>
00019000 88 69 4f 91 4a 02 00 00 cc 0f 00 00 00 00 00 00 |.iO.J...�.......|
0001a000 6f 4d ab 97 4a 02 00 00 e4 0f 00 00 00 00 00 00 |oM�.J...�.......|
0001b000 4c 98 2d 9e 4a 02 00 00 d4 0f 00 00 00 00 00 00 |L.-.J...�.......|
0001c000 f0 92 57 a3 4a 02 00 00 6c 0e 00 00 00 00 00 00 |�.W�J...l.......|
0001d000 6e f9 24 a4 4a 02 00 00 2c 00 00 00 00 00 00 00 |n�$�J...,.......|

I don't know what this means yet. Is a negative commit count ever
expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
are events missing, but I don't think that would result in these bit
patterns.

That's enough for today. I'll investigate more tomorrow.

>> Looking at the code for tracing_buffers_read() and
>> ring_buffer_read_page(), there seems to be a disconnect.
>>
>> ring_buffer_read_page can consume less than one page if it was in the
>> middle of the buffer, such as when would happen after userspace only
>> asks for 1k. It returns the offset of the start of what it cosumed,
>
> Actually, we don't do that anymore. We either return < 0 on error or
> zero. We always add to the 0 offset. The comment needs to be fixed.
>
>> and the end is put in the returned bpage->commit. However,
>> ring_buffer_data_page is not exposed outside of ring_buffer.c.
>> Furthermore, tracing_buffers_read ignores the return value, except to
>> check for error.
>>
>> Basically, I think tracing_buffers_read was forgotten when the ring
>> buffer got headers in the buffer pages. It looks like it still expects
>> whole pages to have valid tracing data.
>
> Yep, it does, and ring_buffer_read_page() adds zero to the rest of the
> page.
>
>
>>
>> I was working on a fix, but, how is trace_pipe_raw supposed to work?
>> I'm not sure if it's supposed to output full pages including the page
>> headers, or just data. I'm guessing the latter?
>
> Nope, the former, because we need it to be full pages.
>
> -- Steve
>
>
>

2010-12-22 23:23:53

by David Sharp

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Mon, Dec 20, 2010 at 9:59 PM, David Sharp <[email protected]> wrote:
> On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <[email protected]> wrote:
>> Hi David,
>>
>> Thanks for the report.
>>
>> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>>> trace_pipe_raw doesn't work correctly, I think because
>>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>>
>>> In particular, if you read less than one page at a time, then you get
>>> much more data than expected from the ring buffer:
>>>
>>> $ ftrace=/sys/kernel/debug/tracing
>>> $ echo 0 > $ftrace/tracing_enabled
>>> $ echo 1 > $ftrace/events/enable
>>> $ echo > $ftrace/trace  # reset
>>> $ echo 100 > $ftrace/buffer_size_kb
>>> $  # generate lots of events:
>>> $ echo 1 > $ftrace/tracing_enabled &&  ls -laR /proc/ > /dev/null 2>&1
>>> && echo 0 > $ftrace/tracing_enabled
>>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>> 675840
>>>
>>> That's significantly more data than the buffer could possibly contain.
>>> Further shrinking bs increases the amount of data output even more.
>>
>> Yep that's a side effect. What happens is that we always read a "full
>> page" even if we do not read in page size chunks. If we read 100 bytes,
>> it will read a header, and all the events that fit inside that 100
>> bytes, and then return a full page with the header, the 100 bytes -
>> header size, and the rest of the page filled with zeros.
>>
>> Do a:
>>
>>        hexdump -C outputfile
>>
>> and you'll see a lot of "zeros". The reason for this is to keep all
>> reads in pagesize chunks. If you read only 100 bytes of data, I still
>> need to keep that page aligned in the output. The reason I do this is in
>> case there's two readers. I don't want one to read part of a page that
>> includes the header, and then the other reads the rest of the page. The
>> header tells you how much data is on that page.
>
> aha, I see. My assumption that the page headers weren't a needed part
> of the trace output was throwing off my analysis.
>
>> I guess I could have had the info->spare just read the full page, and
>> then other readers would just read the next page.
>
> I don't think multiple readers a use case we're concerned about, so
> either way as far as I'm concerned.
>
>>> Even when reading >=4096 bytes, something is funny:
>>>
>>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>> 118784
>>>
>>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>>
>> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
>> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
>> are not much different.
>
> oops, yeah. I rounded 2*8 = 8  ;)
>
>> Note, when I did it, I got:
>>
>> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
>> 27+0 records in
>> 27+0 records out
>> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>>
>> $ cat /debug/tracing/buffer_size_kb
>> 100
>>
>> Ah, running hexdump -C, I got this at the end:
>>
>> 00019c80  62 57 00 00 22 d1 09 81  ff ff ff ff 76 d1 09 81  |bW.."Ñ..ÿÿÿÿvÑ..|
>> 00019c90  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>> 00019ca0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 0001a000  d8 32 b6 7e ed 53 05 00  24 00 00 00 00 00 00 00  |Ø2¶~íS..$.......|
>> 0001a010  a8 36 00 00 01 00 00 00  ca 0b 00 00 ff ff ff ff  |š6......Ê...ÿÿÿÿ|
>> 0001a020  00 00 00 00 6a d1 09 81  ff ff ff ff 68 26 16 81  |....jÑ..ÿÿÿÿh&..|
>> 0001a030  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>> 0001a040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 0001b000
>>
>> I don't know why it added that last page. I'll have to look into this.
>
> Thanks for taking a look!
>
> I'm actually seeing up to 3 extra pages (29 records) frequently,
> although it varies. I even just got 30 pages.
>
> aha, check this out though, looking at all the page headers:
>
> $ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
> 00000000  7d 9a 19 9a fd 01 00 00  74 0d 00 00 00 00 00 00  |}...�...t.......|
> 00001000  12 22 4a 71 01 02 00 00  38 00 00 00 00 00 00 00  |."Jq....8.......|
> 00002000  62 99 60 d7 01 02 00 00  b8 0f 00 c0 ff ff ff ff  |b.`�....�..�����|
> # negative commit count? or, notice this is the third page.
> # Maybe there's garbage at the beginning, rather than the end.
> 00003000  e4 cb e2 dc 01 02 00 00  f0 0f 00 00 00 00 00 00  |����....�.......|
> # <snip>
> 00019000  7a 79 01 ef 01 02 00 00  dc 0f 00 00 00 00 00 00  |zy.�....�.......|
> 0001a000  c7 51 05 ef 01 02 00 00  c0 0f 00 00 00 00 00 00  |�Q.�....�.......|
> 0001b000  af fe 1b ef 01 02 00 00  34 08 00 00 00 00 00 00  |��.�....4.......|
> 0001c000  49 f3 1f ef 01 02 00 00  2c 00 00 00 00 00 00 00  |I�.�....,.......|
>
> Interestingly, they all have valid-looking timestamps (in that they
> are monotonic), even before the negative commit count.
>
> Here's the one where I got 30 pages:
>
> 00000000  1d f5 1f ef 01 02 00 00  68 07 00 00 00 00 00 00  |.�.�....h.......|
> 00000770  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00001000  bf 04 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |�.Y.J...(.......|
> 00001030  00 00 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |....����........|
> 00001040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00002000  6b 0b 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |k.Y.J...(.......|
> 00002030  ff ff ff ff 1d 00 00 00  00 00 00 00 00 00 00 00  |����............|
> 00002040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
> *
> 00003000  df b5 6d 68 4a 02 00 00  d4 0f 00 c0 ff ff ff ff  |ߵmhJ...�..�����|
> # negative commit count in the 4th page this time. I see a pattern here...
> # also, lots of zeroes in the first 3 pages.
> 00004000  94 9b be 6c 4a 02 00 00  d4 0f 00 00 00 00 00 00  |..�lJ...�.......|
> # <snip>
> 00019000  88 69 4f 91 4a 02 00 00  cc 0f 00 00 00 00 00 00  |.iO.J...�.......|
> 0001a000  6f 4d ab 97 4a 02 00 00  e4 0f 00 00 00 00 00 00  |oM�.J...�.......|
> 0001b000  4c 98 2d 9e 4a 02 00 00  d4 0f 00 00 00 00 00 00  |L.-.J...�.......|
> 0001c000  f0 92 57 a3 4a 02 00 00  6c 0e 00 00 00 00 00 00  |�.W�J...l.......|
> 0001d000  6e f9 24 a4 4a 02 00 00  2c 00 00 00 00 00 00 00  |n�$�J...,.......|
>
> I don't know what this means yet. Is a negative commit count ever
> expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
> are events missing, but I don't think that would result in these bit
> patterns.
>
> That's enough for today. I'll investigate more tomorrow.

I did a test that yielded some interesting insights:

$ # write the current time to trace_mark continuously from cpu0:
$ taskset -c 0 sh -c 'while true ; do echo "time: $(date +%s.%N)" \
> /sys/kernel/debug/tracing/trace_marker ; done' &
$ # Let the buffer fill, and note the start and end times:
$ echo "start: $(date +%s.%N)" ; echo 1 > $ftrace/tracing_enabled ; sleep 12 ; \
echo 0 > $ftrace/tracing_enabled ; echo "end: $(date +%s.%N)"
start: 1293051826.630452621
end: 1293051838.634052658
$ # gather result
$ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw of=/tmp/trace_raw
30+0 records in
30+0 records out
122880 bytes (123 kB) copied, 0.000300114 s, 409 MB/s

Okay, 30 pages. 4 too many, but I've seen that before.

Now, the data. Page 1:

00000000 61 17 63 55 ab 82 00 00 ac 0f 00 00 00 00 00 00 |a.cU�...�.......|
00000010 1e 6c b4 f7 04 01 00 00 0c 00 00 00 05 00 00 00 |.l��............|
00000020 66 44 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 |fD..����........|
00000030 74 69 6d 65 3a 20 31 32 39 33 30 35 31 38 32 36 |time: 1293051826|
00000040 2e 36 33 31 32 34 31 31 36 39 0a 00 0c 00 00 00 |.631241169......|

okay, a small time lag (789 microsec) before the trace starts. (If
you're asking why you're seeing the beginning of the trace and how
there could have been no overflow, you're in the advanced class. I
didn't realize this until Page 4...)

btw, the trace_marks are occurring on the order of 1ms apart.

Page 1 ends at 0fbc, which fits the 0fac value for "commit" in the page header.

Now, the whole of Page 2:

00001000 70 25 e0 81 b3 82 00 00 34 00 00 00 00 00 00 00 |p%�.�...4.......|
00001010 6c df ff 01 05 00 00 00 66 44 00 00 ff ff ff ff |l��.....fD..����|
00001020 00 00 00 00 00 00 00 00 74 69 6d 65 3a 20 31 32 |........time: 12|
00001030 39 33 30 35 31 38 32 36 2e 37 31 32 31 31 34 34 |93051826.7121144|
00001040 36 30 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 |60..............|

Only a single event, nothing remarkable about it other than that it is
on a separate page

Page 3 has only a padding event in it:

00002000 6b 24 f0 81 b3 82 00 00 10 00 00 00 00 00 00 00 |k$�.�...........|
00002010 3d 00 00 00 0c 00 00 00 00 00 00 00 00 00 00 00 |=...............|

So 0x0fac + 0x34 + 0x10 = 0xff0, plus 16 bytes of header makes one
page. So, it seems that was the real first page of the ring buffer.
The question is, why did it get split up?

Here's where it gets more interesting. Here's Page 4:

00003000 25 56 1a cb b5 82 00 00 d8 0f 00 c0 ff ff ff ff |%V.˵...�..�����|
00003010 0c 00 00 00 05 00 00 00 66 44 00 00 ff ff ff ff |........fD..����|
00003020 00 00 00 00 00 00 00 00 74 69 6d 65 3a 20 31 32 |........time: 12|
00003030 39 33 30 35 31 38 33 36 2e 35 34 34 36 31 30 31 |93051836.5446101|
00003040 37 39 0a 00 8c c7 0f 02 05 00 00 00 66 44 00 00 |79...�......fD..|

Remember the last event on page 2 was at 1293051826.712114460. There
are 10 seconds missing!

Also, still seeing that strange value for "commit" (0xffffffff0c000fd8).

Between the first event on Page4 and the last event, the events seem
to be contiguous, very roughly 1msec apart.

The last event of the trace is the only event on the last page:

0001d000 50 dc 41 47 b6 82 00 00 34 00 00 00 00 00 00 00 |P�AG�...4.......|
0001d010 2c b7 1a 02 05 00 00 00 66 44 00 00 ff ff ff ff |,�......fD..����|
0001d020 00 00 00 00 00 00 00 00 74 69 6d 65 3a 20 31 32 |........time: 12|
0001d030 39 33 30 35 31 38 33 38 2e 36 33 31 38 37 38 37 |93051838.6318787|
0001d040 39 34 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 |94..............|

That's at 1293051838.631878794, compared to the recorded "end" time
1293051838.634052658 (diff is 2,173 microsec). So, that very likely is
the real end of the trace.

This leads to a theory that this has something to do with overflow:

$ cat per_cpu/cpu0/stats
entries: 0
overrun: 16770
commit overrun: 0

16770 events where each event is on the order of 1msec apart roughly
accounts for 10 seconds of missing time.

So, it seems that the ring buffer started out writing to what
eventually became the initial reader page. But then as overflow
occurs, that page does not get overwritten.

There are three outstanding questions in my mind:
1) Why are we seeing the "first" page, which either should have been
overwritten, or should not be seen at all?
2) As before: Why the strange "commit" value in the page header for
the first contiguous-event page (Page 4)?
3) Why is that first page getting split up?

I think I have an idea about (3):

in ring_buffer_read_page(), there are three conditions for reading out
a page on event at a time. One of them is "reader_page ==
commit_page". That seems plausible for a full ring buffer, but I
haven't fully groked that.

But, within that path, it seems it should still copy out the whole
page. But, the loop condition for copying events is "while (len >
size)" where len is the remaining available size in the
caller-supplied page, and size is the size of the next event in the
ring buffer. So, that's off-by-one. if len == size, then there's just
enough space.

Does this fit? 0x1000 - 0xfbc = 68 bytes for len, but the event in
Page 2 is 52 bytes... So no. hmm, another question.

But, size is set to "rb_event_ts_length(event)", to ensure space for
time extend events to be paired with the event after it. Then in the
next loop, "size" bytes are memcpy'd to the caller-supplied page.
HOWEVER, the reader is then advanced with rb_advance_reader(), which
*does not* advance twice for time-extend events. So, the memcpy will
have copied two events, and "pos" is advanced by "size", and "len" is
decremented by "size", but then the reader is advanced by *only* one
event! This means that time-extend events should be followed by
duplicate events. Sure enough, the first event on the Page 1 is a time
extend, and it is followed by two identical events (I only show the
first word of the duplicate above, though).

A patch for these two problems is incoming.

Sorry for these long messages. If this is too much detail, I can scale it back.

2010-12-22 23:45:51

by David Sharp

[permalink] [raw]
Subject: [PATCH] ring_buffer: off-by-one and duplicate events in ring_buffer_read_page

Fix two related problems in the event-copying loop of
ring_buffer_read_page.

The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.

size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.

Signed-off-by: David Sharp <[email protected]>
CC: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
---
kernel/trace/ring_buffer.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..bd1c35a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3853,6 +3853,13 @@ int ring_buffer_read_page(struct ring_buffer *buffer,

/* Need to copy one event at a time */
do {
+ /* We need the size of one event, because
+ * rb_advance_reader only advances by one event,
+ * whereas rb_event_ts_length may include the size of
+ * one or two events.
+ * We have already ensured there's enough space if this
+ * is a time extend. */
+ size = rb_event_length(event);
memcpy(bpage->data + pos, rpage->data + rpos, size);

len -= size;
@@ -3867,7 +3874,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
event = rb_reader_event(cpu_buffer);
/* Always keep the time extend and data together */
size = rb_event_ts_length(event);
- } while (len > size);
+ } while (len >= size);

/* update bpage */
local_set(&bpage->commit, pos);
--
1.7.3.1

2010-12-22 23:47:49

by David Sharp

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Wed, Dec 22, 2010 at 3:23 PM, David Sharp <[email protected]> wrote:
> On Mon, Dec 20, 2010 at 9:59 PM, David Sharp <[email protected]> wrote:
>> On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <[email protected]> wrote:
>>> Hi David,
>>>
>>> Thanks for the report.
>>>
>>> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>>>> trace_pipe_raw doesn't work correctly, I think because
>>>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>>>
>>>> In particular, if you read less than one page at a time, then you get
>>>> much more data than expected from the ring buffer:
>>>>
>>>> $ ftrace=/sys/kernel/debug/tracing
>>>> $ echo 0 > $ftrace/tracing_enabled
>>>> $ echo 1 > $ftrace/events/enable
>>>> $ echo > $ftrace/trace  # reset
>>>> $ echo 100 > $ftrace/buffer_size_kb
>>>> $  # generate lots of events:
>>>> $ echo 1 > $ftrace/tracing_enabled &&  ls -laR /proc/ > /dev/null 2>&1
>>>> && echo 0 > $ftrace/tracing_enabled
>>>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>> 675840
>>>>
>>>> That's significantly more data than the buffer could possibly contain.
>>>> Further shrinking bs increases the amount of data output even more.
>>>
>>> Yep that's a side effect. What happens is that we always read a "full
>>> page" even if we do not read in page size chunks. If we read 100 bytes,
>>> it will read a header, and all the events that fit inside that 100
>>> bytes, and then return a full page with the header, the 100 bytes -
>>> header size, and the rest of the page filled with zeros.
>>>
>>> Do a:
>>>
>>>        hexdump -C outputfile
>>>
>>> and you'll see a lot of "zeros". The reason for this is to keep all
>>> reads in pagesize chunks. If you read only 100 bytes of data, I still
>>> need to keep that page aligned in the output. The reason I do this is in
>>> case there's two readers. I don't want one to read part of a page that
>>> includes the header, and then the other reads the rest of the page. The
>>> header tells you how much data is on that page.
>>
>> aha, I see. My assumption that the page headers weren't a needed part
>> of the trace output was throwing off my analysis.
>>
>>> I guess I could have had the info->spare just read the full page, and
>>> then other readers would just read the next page.
>>
>> I don't think multiple readers a use case we're concerned about, so
>> either way as far as I'm concerned.
>>
>>>> Even when reading >=4096 bytes, something is funny:
>>>>
>>>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>> 118784
>>>>
>>>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>>>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>>>
>>> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
>>> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
>>> are not much different.
>>
>> oops, yeah. I rounded 2*8 = 8  ;)
>>
>>> Note, when I did it, I got:
>>>
>>> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
>>> 27+0 records in
>>> 27+0 records out
>>> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>>>
>>> $ cat /debug/tracing/buffer_size_kb
>>> 100
>>>
>>> Ah, running hexdump -C, I got this at the end:
>>>
>>> 00019c80  62 57 00 00 22 d1 09 81  ff ff ff ff 76 d1 09 81  |bW.."Ñ..ÿÿÿÿvÑ..|
>>> 00019c90  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>>> 00019ca0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>> *
>>> 0001a000  d8 32 b6 7e ed 53 05 00  24 00 00 00 00 00 00 00  |Ø2¶~íS..$.......|
>>> 0001a010  a8 36 00 00 01 00 00 00  ca 0b 00 00 ff ff ff ff  |š6......Ê...ÿÿÿÿ|
>>> 0001a020  00 00 00 00 6a d1 09 81  ff ff ff ff 68 26 16 81  |....jÑ..ÿÿÿÿh&..|
>>> 0001a030  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>>> 0001a040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>> *
>>> 0001b000
>>>
>>> I don't know why it added that last page. I'll have to look into this.
>>
>> Thanks for taking a look!
>>
>> I'm actually seeing up to 3 extra pages (29 records) frequently,
>> although it varies. I even just got 30 pages.
>>
>> aha, check this out though, looking at all the page headers:
>>
>> $ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
>> 00000000  7d 9a 19 9a fd 01 00 00  74 0d 00 00 00 00 00 00  |}...�...t.......|
>> 00001000  12 22 4a 71 01 02 00 00  38 00 00 00 00 00 00 00  |."Jq....8.......|
>> 00002000  62 99 60 d7 01 02 00 00  b8 0f 00 c0 ff ff ff ff  |b.`�....�..�����|
>> # negative commit count? or, notice this is the third page.
>> # Maybe there's garbage at the beginning, rather than the end.
>> 00003000  e4 cb e2 dc 01 02 00 00  f0 0f 00 00 00 00 00 00  |����....�.......|
>> # <snip>
>> 00019000  7a 79 01 ef 01 02 00 00  dc 0f 00 00 00 00 00 00  |zy.�....�.......|
>> 0001a000  c7 51 05 ef 01 02 00 00  c0 0f 00 00 00 00 00 00  |�Q.�....�.......|
>> 0001b000  af fe 1b ef 01 02 00 00  34 08 00 00 00 00 00 00  |��.�....4.......|
>> 0001c000  49 f3 1f ef 01 02 00 00  2c 00 00 00 00 00 00 00  |I�.�....,.......|
>>
>> Interestingly, they all have valid-looking timestamps (in that they
>> are monotonic), even before the negative commit count.
>>
>> Here's the one where I got 30 pages:
>>
>> 00000000  1d f5 1f ef 01 02 00 00  68 07 00 00 00 00 00 00  |.�.�....h.......|
>> 00000770  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00001000  bf 04 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |�.Y.J...(.......|
>> 00001030  00 00 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |....����........|
>> 00001040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00002000  6b 0b 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |k.Y.J...(.......|
>> 00002030  ff ff ff ff 1d 00 00 00  00 00 00 00 00 00 00 00  |����............|
>> 00002040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 00003000  df b5 6d 68 4a 02 00 00  d4 0f 00 c0 ff ff ff ff  |ߵmhJ...�..�����|
>> # negative commit count in the 4th page this time. I see a pattern here...
>> # also, lots of zeroes in the first 3 pages.
>> 00004000  94 9b be 6c 4a 02 00 00  d4 0f 00 00 00 00 00 00  |..�lJ...�.......|
>> # <snip>
>> 00019000  88 69 4f 91 4a 02 00 00  cc 0f 00 00 00 00 00 00  |.iO.J...�.......|
>> 0001a000  6f 4d ab 97 4a 02 00 00  e4 0f 00 00 00 00 00 00  |oM�.J...�.......|
>> 0001b000  4c 98 2d 9e 4a 02 00 00  d4 0f 00 00 00 00 00 00  |L.-.J...�.......|
>> 0001c000  f0 92 57 a3 4a 02 00 00  6c 0e 00 00 00 00 00 00  |�.W�J...l.......|
>> 0001d000  6e f9 24 a4 4a 02 00 00  2c 00 00 00 00 00 00 00  |n�$�J...,.......|
>>
>> I don't know what this means yet. Is a negative commit count ever
>> expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
>> are events missing, but I don't think that would result in these bit
>> patterns.
>>
>> That's enough for today. I'll investigate more tomorrow.
>
> I did a test that yielded some interesting insights:
>
> $ # write the current time to trace_mark continuously from cpu0:
> $ taskset -c 0 sh -c 'while true ; do echo "time: $(date +%s.%N)" \
>  > /sys/kernel/debug/tracing/trace_marker ; done' &
> $ # Let the buffer fill, and note the start and end times:
> $ echo "start: $(date +%s.%N)" ; echo 1 > $ftrace/tracing_enabled ; sleep 12 ; \
>  echo 0 > $ftrace/tracing_enabled ; echo "end:   $(date +%s.%N)"
> start: 1293051826.630452621
> end:   1293051838.634052658
> $ # gather result
> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw of=/tmp/trace_raw
> 30+0 records in
> 30+0 records out
> 122880 bytes (123 kB) copied, 0.000300114 s, 409 MB/s
>
> Okay, 30 pages. 4 too many, but I've seen that before.
>
> Now, the data. Page 1:
>
> 00000000  61 17 63 55 ab 82 00 00  ac 0f 00 00 00 00 00 00  |a.cU�...�.......|
> 00000010  1e 6c b4 f7 04 01 00 00  0c 00 00 00 05 00 00 00  |.l��............|
> 00000020  66 44 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |fD..����........|
> 00000030  74 69 6d 65 3a 20 31 32  39 33 30 35 31 38 32 36  |time: 1293051826|
> 00000040  2e 36 33 31 32 34 31 31  36 39 0a 00 0c 00 00 00  |.631241169......|
>
> okay, a small time lag (789 microsec) before the trace starts. (If
> you're asking why you're seeing the beginning of the trace and how
> there could have been no overflow, you're in the advanced class. I
> didn't realize this until Page 4...)
>
> btw, the trace_marks are occurring on the order of 1ms apart.
>
> Page 1 ends at 0fbc, which fits the 0fac value for "commit" in the page header.
>
> Now, the whole of Page 2:
>
> 00001000  70 25 e0 81 b3 82 00 00  34 00 00 00 00 00 00 00  |p%�.�...4.......|
> 00001010  6c df ff 01 05 00 00 00  66 44 00 00 ff ff ff ff  |l��.....fD..����|
> 00001020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
> 00001030  39 33 30 35 31 38 32 36  2e 37 31 32 31 31 34 34  |93051826.7121144|
> 00001040  36 30 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |60..............|
>
> Only a single event, nothing remarkable about it other than that it is
> on a separate page
>
> Page 3 has only a padding event in it:
>
> 00002000  6b 24 f0 81 b3 82 00 00  10 00 00 00 00 00 00 00  |k$�.�...........|
> 00002010  3d 00 00 00 0c 00 00 00  00 00 00 00 00 00 00 00  |=...............|
>
> So 0x0fac + 0x34 + 0x10 = 0xff0, plus 16 bytes of header makes one
> page. So, it seems that was the real first page of the ring buffer.
> The question is, why did it get split up?
>
> Here's where it gets more interesting. Here's Page 4:
>
> 00003000  25 56 1a cb b5 82 00 00  d8 0f 00 c0 ff ff ff ff  |%V.˵...�..�����|
> 00003010  0c 00 00 00 05 00 00 00  66 44 00 00 ff ff ff ff  |........fD..����|
> 00003020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
> 00003030  39 33 30 35 31 38 33 36  2e 35 34 34 36 31 30 31  |93051836.5446101|
> 00003040  37 39 0a 00 8c c7 0f 02  05 00 00 00 66 44 00 00  |79...�......fD..|
>
> Remember the last event on page 2 was at 1293051826.712114460. There
> are 10 seconds missing!
>
> Also, still seeing that strange value for "commit" (0xffffffff0c000fd8).
>
> Between the first event on Page4 and the last event, the events seem
> to be contiguous, very roughly 1msec apart.
>
> The last event of the trace is the only event on the last page:
>
> 0001d000  50 dc 41 47 b6 82 00 00  34 00 00 00 00 00 00 00  |P�AG�...4.......|
> 0001d010  2c b7 1a 02 05 00 00 00  66 44 00 00 ff ff ff ff  |,�......fD..����|
> 0001d020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
> 0001d030  39 33 30 35 31 38 33 38  2e 36 33 31 38 37 38 37  |93051838.6318787|
> 0001d040  39 34 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |94..............|
>
> That's at 1293051838.631878794, compared to the recorded "end" time
> 1293051838.634052658 (diff is 2,173 microsec). So, that very likely is
> the real end of the trace.
>
> This leads to a theory that this has something to do with overflow:
>
> $ cat per_cpu/cpu0/stats
> entries: 0
> overrun: 16770
> commit overrun: 0
>
> 16770 events where each event is on the order of 1msec apart roughly
> accounts for 10 seconds of missing time.
>
> So, it seems that the ring buffer started out writing to what
> eventually became the initial reader page. But then as overflow
> occurs, that page does not get overwritten.
>
> There are three outstanding questions in my mind:
> 1) Why are we seeing the "first" page, which either should have been
> overwritten, or should not be seen at all?
> 2) As before: Why the strange "commit" value in the page header for
> the first contiguous-event page (Page 4)?
> 3) Why is that first page getting split up?
>
> I think I have an idea about (3):
>
> in ring_buffer_read_page(), there are three conditions for reading out
> a page on event at a time. One of them is "reader_page ==
> commit_page". That seems plausible for a full ring buffer, but I
> haven't fully groked that.
>
> But, within that path, it seems it should still copy out the whole
> page. But, the loop condition for copying events is "while (len >
> size)" where len is the remaining available size in the
> caller-supplied page, and size is the size of the next event in the
> ring buffer. So, that's off-by-one. if len == size, then there's just
> enough space.
>
> Does this fit? 0x1000 - 0xfbc = 68 bytes for len, but the event in
> Page 2 is 52 bytes... So no. hmm, another question.
>
> But, size is set to "rb_event_ts_length(event)", to ensure space for
> time extend events to be paired with the event after it. Then in the
> next loop, "size" bytes are memcpy'd to the caller-supplied page.
> HOWEVER, the reader is then advanced with rb_advance_reader(), which
> *does not* advance twice for time-extend events. So, the memcpy will
> have copied two events, and "pos" is advanced by "size", and "len" is
> decremented by "size", but then the reader is advanced by *only* one
> event! This means that time-extend events should be followed by
> duplicate events. Sure enough, the first event on the Page 1 is a time
> extend, and it is followed by two identical events (I only show the
> first word of the duplicate above, though).
>
> A patch for these two problems is incoming.
>
> Sorry for these long messages. If this is too much detail, I can scale it back.
>

Okay. Now I've tested with the patch (sent separately).

- Extra pages are only seen when I don't reset the buffer before
running the test again.
- So far, when I've seen extra pages, I've only seen one extra page.
- When there's an extra page, it's still full of pre-overflow data, as
opposed to, say, data from the previous test.

2010-12-23 00:37:35

by David Sharp

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Wed, Dec 22, 2010 at 3:47 PM, David Sharp <[email protected]> wrote:
> On Wed, Dec 22, 2010 at 3:23 PM, David Sharp <[email protected]> wrote:
>> On Mon, Dec 20, 2010 at 9:59 PM, David Sharp <[email protected]> wrote:
>>> On Mon, Dec 20, 2010 at 7:52 PM, Steven Rostedt <[email protected]> wrote:
>>>> Hi David,
>>>>
>>>> Thanks for the report.
>>>>
>>>> On Mon, 2010-12-20 at 18:43 -0800, David Sharp wrote:
>>>>> trace_pipe_raw doesn't work correctly, I think because
>>>>> tracing_buffers_read() uses ring_buffer_read_page() incorrectly.
>>>>>
>>>>> In particular, if you read less than one page at a time, then you get
>>>>> much more data than expected from the ring buffer:
>>>>>
>>>>> $ ftrace=/sys/kernel/debug/tracing
>>>>> $ echo 0 > $ftrace/tracing_enabled
>>>>> $ echo 1 > $ftrace/events/enable
>>>>> $ echo > $ftrace/trace  # reset
>>>>> $ echo 100 > $ftrace/buffer_size_kb
>>>>> $  # generate lots of events:
>>>>> $ echo 1 > $ftrace/tracing_enabled &&  ls -laR /proc/ > /dev/null 2>&1
>>>>> && echo 0 > $ftrace/tracing_enabled
>>>>> $ dd bs=1024 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>>> 675840
>>>>>
>>>>> That's significantly more data than the buffer could possibly contain.
>>>>> Further shrinking bs increases the amount of data output even more.
>>>>
>>>> Yep that's a side effect. What happens is that we always read a "full
>>>> page" even if we do not read in page size chunks. If we read 100 bytes,
>>>> it will read a header, and all the events that fit inside that 100
>>>> bytes, and then return a full page with the header, the 100 bytes -
>>>> header size, and the rest of the page filled with zeros.
>>>>
>>>> Do a:
>>>>
>>>>        hexdump -C outputfile
>>>>
>>>> and you'll see a lot of "zeros". The reason for this is to keep all
>>>> reads in pagesize chunks. If you read only 100 bytes of data, I still
>>>> need to keep that page aligned in the output. The reason I do this is in
>>>> case there's two readers. I don't want one to read part of a page that
>>>> includes the header, and then the other reads the rest of the page. The
>>>> header tells you how much data is on that page.
>>>
>>> aha, I see. My assumption that the page headers weren't a needed part
>>> of the trace output was throwing off my analysis.
>>>
>>>> I guess I could have had the info->spare just read the full page, and
>>>> then other readers would just read the next page.
>>>
>>> I don't think multiple readers a use case we're concerned about, so
>>> either way as far as I'm concerned.
>>>
>>>>> Even when reading >=4096 bytes, something is funny:
>>>>>
>>>>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw 2>/dev/null | wc -c
>>>>> 118784
>>>>>
>>>>> That's 116kB. I expect 100kb to allocate (102400+4087)/4088 = 26
>>>>> pages, or 104kb-worth of pages. 116kB is about 28 pages.
>>>>
>>>> Hmm, isn't page data 4080, not 4088, since it has 2, 8 byte items in it.
>>>> Timestamp and atomic (on 64bit). 4096 - 16 = 4080. But still the numbers
>>>> are not much different.
>>>
>>> oops, yeah. I rounded 2*8 = 8  ;)
>>>
>>>> Note, when I did it, I got:
>>>>
>>>> $ dd if=/debug/tracing/per_cpu/cpu0/trace_pipe_raw of=/tmp/dat bs=4096
>>>> 27+0 records in
>>>> 27+0 records out
>>>> 110592 bytes (111 kB) copied, 0.00302891 s, 36.5 MB/s
>>>>
>>>> $ cat /debug/tracing/buffer_size_kb
>>>> 100
>>>>
>>>> Ah, running hexdump -C, I got this at the end:
>>>>
>>>> 00019c80  62 57 00 00 22 d1 09 81  ff ff ff ff 76 d1 09 81  |bW.."Ñ..ÿÿÿÿvÑ..|
>>>> 00019c90  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>>>> 00019ca0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>>> *
>>>> 0001a000  d8 32 b6 7e ed 53 05 00  24 00 00 00 00 00 00 00  |Ø2¶~íS..$.......|
>>>> 0001a010  a8 36 00 00 01 00 00 00  ca 0b 00 00 ff ff ff ff  |š6......Ê...ÿÿÿÿ|
>>>> 0001a020  00 00 00 00 6a d1 09 81  ff ff ff ff 68 26 16 81  |....jÑ..ÿÿÿÿh&..|
>>>> 0001a030  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |ÿÿÿÿ............|
>>>> 0001a040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>>> *
>>>> 0001b000
>>>>
>>>> I don't know why it added that last page. I'll have to look into this.
>>>
>>> Thanks for taking a look!
>>>
>>> I'm actually seeing up to 3 extra pages (29 records) frequently,
>>> although it varies. I even just got 30 pages.
>>>
>>> aha, check this out though, looking at all the page headers:
>>>
>>> $ hexdump -C /tmp/trace_raw | grep '^[0-9a-f]*000 '
>>> 00000000  7d 9a 19 9a fd 01 00 00  74 0d 00 00 00 00 00 00  |}...�...t.......|
>>> 00001000  12 22 4a 71 01 02 00 00  38 00 00 00 00 00 00 00  |."Jq....8.......|
>>> 00002000  62 99 60 d7 01 02 00 00  b8 0f 00 c0 ff ff ff ff  |b.`�....�..�����|
>>> # negative commit count? or, notice this is the third page.
>>> # Maybe there's garbage at the beginning, rather than the end.
>>> 00003000  e4 cb e2 dc 01 02 00 00  f0 0f 00 00 00 00 00 00  |����....�.......|
>>> # <snip>
>>> 00019000  7a 79 01 ef 01 02 00 00  dc 0f 00 00 00 00 00 00  |zy.�....�.......|
>>> 0001a000  c7 51 05 ef 01 02 00 00  c0 0f 00 00 00 00 00 00  |�Q.�....�.......|
>>> 0001b000  af fe 1b ef 01 02 00 00  34 08 00 00 00 00 00 00  |��.�....4.......|
>>> 0001c000  49 f3 1f ef 01 02 00 00  2c 00 00 00 00 00 00 00  |I�.�....,.......|
>>>
>>> Interestingly, they all have valid-looking timestamps (in that they
>>> are monotonic), even before the negative commit count.
>>>
>>> Here's the one where I got 30 pages:
>>>
>>> 00000000  1d f5 1f ef 01 02 00 00  68 07 00 00 00 00 00 00  |.�.�....h.......|
>>> 00000770  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>> *
>>> 00001000  bf 04 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |�.Y.J...(.......|
>>> 00001030  00 00 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |....����........|
>>> 00001040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>> *
>>> 00002000  6b 0b 59 1f 4a 02 00 00  28 00 00 00 00 00 00 00  |k.Y.J...(.......|
>>> 00002030  ff ff ff ff 1d 00 00 00  00 00 00 00 00 00 00 00  |����............|
>>> 00002040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>>> *
>>> 00003000  df b5 6d 68 4a 02 00 00  d4 0f 00 c0 ff ff ff ff  |ߵmhJ...�..�����|
>>> # negative commit count in the 4th page this time. I see a pattern here...
>>> # also, lots of zeroes in the first 3 pages.
>>> 00004000  94 9b be 6c 4a 02 00 00  d4 0f 00 00 00 00 00 00  |..�lJ...�.......|
>>> # <snip>
>>> 00019000  88 69 4f 91 4a 02 00 00  cc 0f 00 00 00 00 00 00  |.iO.J...�.......|
>>> 0001a000  6f 4d ab 97 4a 02 00 00  e4 0f 00 00 00 00 00 00  |oM�.J...�.......|
>>> 0001b000  4c 98 2d 9e 4a 02 00 00  d4 0f 00 00 00 00 00 00  |L.-.J...�.......|
>>> 0001c000  f0 92 57 a3 4a 02 00 00  6c 0e 00 00 00 00 00 00  |�.W�J...l.......|
>>> 0001d000  6e f9 24 a4 4a 02 00 00  2c 00 00 00 00 00 00 00  |n�$�J...,.......|
>>>
>>> I don't know what this means yet. Is a negative commit count ever
>>> expected? ring_buffer_read_page does add 1<<30 and 1<<31 when there
>>> are events missing, but I don't think that would result in these bit
>>> patterns.
>>>
>>> That's enough for today. I'll investigate more tomorrow.
>>
>> I did a test that yielded some interesting insights:
>>
>> $ # write the current time to trace_mark continuously from cpu0:
>> $ taskset -c 0 sh -c 'while true ; do echo "time: $(date +%s.%N)" \
>>  > /sys/kernel/debug/tracing/trace_marker ; done' &
>> $ # Let the buffer fill, and note the start and end times:
>> $ echo "start: $(date +%s.%N)" ; echo 1 > $ftrace/tracing_enabled ; sleep 12 ; \
>>  echo 0 > $ftrace/tracing_enabled ; echo "end:   $(date +%s.%N)"
>> start: 1293051826.630452621
>> end:   1293051838.634052658
>> $ # gather result
>> $ dd bs=4096 if=$ftrace/per_cpu/cpu0/trace_pipe_raw of=/tmp/trace_raw
>> 30+0 records in
>> 30+0 records out
>> 122880 bytes (123 kB) copied, 0.000300114 s, 409 MB/s
>>
>> Okay, 30 pages. 4 too many, but I've seen that before.
>>
>> Now, the data. Page 1:
>>
>> 00000000  61 17 63 55 ab 82 00 00  ac 0f 00 00 00 00 00 00  |a.cU�...�.......|
>> 00000010  1e 6c b4 f7 04 01 00 00  0c 00 00 00 05 00 00 00  |.l��............|
>> 00000020  66 44 00 00 ff ff ff ff  00 00 00 00 00 00 00 00  |fD..����........|
>> 00000030  74 69 6d 65 3a 20 31 32  39 33 30 35 31 38 32 36  |time: 1293051826|
>> 00000040  2e 36 33 31 32 34 31 31  36 39 0a 00 0c 00 00 00  |.631241169......|
>>
>> okay, a small time lag (789 microsec) before the trace starts. (If
>> you're asking why you're seeing the beginning of the trace and how
>> there could have been no overflow, you're in the advanced class. I
>> didn't realize this until Page 4...)
>>
>> btw, the trace_marks are occurring on the order of 1ms apart.
>>
>> Page 1 ends at 0fbc, which fits the 0fac value for "commit" in the page header.
>>
>> Now, the whole of Page 2:
>>
>> 00001000  70 25 e0 81 b3 82 00 00  34 00 00 00 00 00 00 00  |p%�.�...4.......|
>> 00001010  6c df ff 01 05 00 00 00  66 44 00 00 ff ff ff ff  |l��.....fD..����|
>> 00001020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
>> 00001030  39 33 30 35 31 38 32 36  2e 37 31 32 31 31 34 34  |93051826.7121144|
>> 00001040  36 30 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |60..............|
>>
>> Only a single event, nothing remarkable about it other than that it is
>> on a separate page
>>
>> Page 3 has only a padding event in it:
>>
>> 00002000  6b 24 f0 81 b3 82 00 00  10 00 00 00 00 00 00 00  |k$�.�...........|
>> 00002010  3d 00 00 00 0c 00 00 00  00 00 00 00 00 00 00 00  |=...............|
>>
>> So 0x0fac + 0x34 + 0x10 = 0xff0, plus 16 bytes of header makes one
>> page. So, it seems that was the real first page of the ring buffer.
>> The question is, why did it get split up?
>>
>> Here's where it gets more interesting. Here's Page 4:
>>
>> 00003000  25 56 1a cb b5 82 00 00  d8 0f 00 c0 ff ff ff ff  |%V.˵...�..�����|
>> 00003010  0c 00 00 00 05 00 00 00  66 44 00 00 ff ff ff ff  |........fD..����|
>> 00003020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
>> 00003030  39 33 30 35 31 38 33 36  2e 35 34 34 36 31 30 31  |93051836.5446101|
>> 00003040  37 39 0a 00 8c c7 0f 02  05 00 00 00 66 44 00 00  |79...�......fD..|
>>
>> Remember the last event on page 2 was at 1293051826.712114460. There
>> are 10 seconds missing!
>>
>> Also, still seeing that strange value for "commit" (0xffffffff0c000fd8).
>>
>> Between the first event on Page4 and the last event, the events seem
>> to be contiguous, very roughly 1msec apart.
>>
>> The last event of the trace is the only event on the last page:
>>
>> 0001d000  50 dc 41 47 b6 82 00 00  34 00 00 00 00 00 00 00  |P�AG�...4.......|
>> 0001d010  2c b7 1a 02 05 00 00 00  66 44 00 00 ff ff ff ff  |,�......fD..����|
>> 0001d020  00 00 00 00 00 00 00 00  74 69 6d 65 3a 20 31 32  |........time: 12|
>> 0001d030  39 33 30 35 31 38 33 38  2e 36 33 31 38 37 38 37  |93051838.6318787|
>> 0001d040  39 34 0a 00 00 00 00 00  00 00 00 00 00 00 00 00  |94..............|
>>
>> That's at 1293051838.631878794, compared to the recorded "end" time
>> 1293051838.634052658 (diff is 2,173 microsec). So, that very likely is
>> the real end of the trace.
>>
>> This leads to a theory that this has something to do with overflow:
>>
>> $ cat per_cpu/cpu0/stats
>> entries: 0
>> overrun: 16770
>> commit overrun: 0
>>
>> 16770 events where each event is on the order of 1msec apart roughly
>> accounts for 10 seconds of missing time.
>>
>> So, it seems that the ring buffer started out writing to what
>> eventually became the initial reader page. But then as overflow
>> occurs, that page does not get overwritten.
>>
>> There are three outstanding questions in my mind:
>> 1) Why are we seeing the "first" page, which either should have been
>> overwritten, or should not be seen at all?
>> 2) As before: Why the strange "commit" value in the page header for
>> the first contiguous-event page (Page 4)?
>> 3) Why is that first page getting split up?
>>
>> I think I have an idea about (3):
>>
>> in ring_buffer_read_page(), there are three conditions for reading out
>> a page on event at a time. One of them is "reader_page ==
>> commit_page". That seems plausible for a full ring buffer, but I
>> haven't fully groked that.
>>
>> But, within that path, it seems it should still copy out the whole
>> page. But, the loop condition for copying events is "while (len >
>> size)" where len is the remaining available size in the
>> caller-supplied page, and size is the size of the next event in the
>> ring buffer. So, that's off-by-one. if len == size, then there's just
>> enough space.
>>
>> Does this fit? 0x1000 - 0xfbc = 68 bytes for len, but the event in
>> Page 2 is 52 bytes... So no. hmm, another question.
>>
>> But, size is set to "rb_event_ts_length(event)", to ensure space for
>> time extend events to be paired with the event after it. Then in the
>> next loop, "size" bytes are memcpy'd to the caller-supplied page.
>> HOWEVER, the reader is then advanced with rb_advance_reader(), which
>> *does not* advance twice for time-extend events. So, the memcpy will
>> have copied two events, and "pos" is advanced by "size", and "len" is
>> decremented by "size", but then the reader is advanced by *only* one
>> event! This means that time-extend events should be followed by
>> duplicate events. Sure enough, the first event on the Page 1 is a time
>> extend, and it is followed by two identical events (I only show the
>> first word of the duplicate above, though).
>>
>> A patch for these two problems is incoming.
>>
>> Sorry for these long messages. If this is too much detail, I can scale it back.
>>
>
> Okay. Now I've tested with the patch (sent separately).
>
> - Extra pages are only seen when I don't reset the buffer before
> running the test again.
> - So far, when I've seen extra pages, I've only seen one extra page.
> - When there's an extra page, it's still full of pre-overflow data, as
> opposed to, say, data from the previous test.
>

Regarding the strange "commit" values:

I noticed just now that it appears the low 16-bits are the correct
value for "commit", bits 16-31 are c000, and bits 32-63 are ffffffff.

ring_buffer_read_page can store the number of missing events at the
end of the page if there is room. It signals it has done so with two
bits in commit, bits 30 and 31. That's c0000000. This points to the
ffffffff being a signed math problem, because these bits are added
using "local_add".

Here's how the bits are defined:

/* Flag when events were overwritten */
#define RB_MISSED_EVENTS (1 << 31)
/* Missed count stored at end */
#define RB_MISSED_STORED (1 << 30)

well, those would come out as signed int, and 1<<31 is 0x80000000, aka
INT_MIN. When passed to local_add, which takes signed long, that would
be sign-extended to 0xffffffff80000000.

Well, mystery solved at least. Now, how should it be fixed? Or is this
intended behavior?

By my count that leaves only one mystery: why are we seeing this extra
page at the beginning with pre-overflow data?

2010-12-23 00:38:45

by David Sharp

[permalink] [raw]
Subject: [PATCH] ring_buffer: off-by-one and duplicate events in ring_buffer_read_page

Fix two related problems in the event-copying loop of
ring_buffer_read_page.

The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.

size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.

Signed-off-by: David Sharp <[email protected]>
CC: Steven Rostedt <[email protected]>
LKML-Reference: <[email protected]>
---
kernel/trace/ring_buffer.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..bd1c35a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3853,6 +3853,13 @@ int ring_buffer_read_page(struct ring_buffer *buffer,

/* Need to copy one event at a time */
do {
+ /* We need the size of one event, because
+ * rb_advance_reader only advances by one event,
+ * whereas rb_event_ts_length may include the size of
+ * one or two events.
+ * We have already ensured there's enough space if this
+ * is a time extend. */
+ size = rb_event_length(event);
memcpy(bpage->data + pos, rpage->data + rpos, size);

len -= size;
@@ -3867,7 +3874,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
event = rb_reader_event(cpu_buffer);
/* Always keep the time extend and data together */
size = rb_event_ts_length(event);
- } while (len > size);
+ } while (len >= size);

/* update bpage */
local_set(&bpage->commit, pos);
--
1.7.3.1

2010-12-23 15:57:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

Hi David,

Thanks for looking deeper into this.

On Wed, 2010-12-22 at 16:37 -0800, David Sharp wrote:

> Regarding the strange "commit" values:
>
> I noticed just now that it appears the low 16-bits are the correct
> value for "commit", bits 16-31 are c000, and bits 32-63 are ffffffff.

Ah yeah, I use 32bits so it would be the same on both 32 and 64.

>
> ring_buffer_read_page can store the number of missing events at the
> end of the page if there is room. It signals it has done so with two
> bits in commit, bits 30 and 31. That's c0000000. This points to the
> ffffffff being a signed math problem, because these bits are added
> using "local_add".
>
> Here's how the bits are defined:
>
> /* Flag when events were overwritten */
> #define RB_MISSED_EVENTS (1 << 31)
> /* Missed count stored at end */
> #define RB_MISSED_STORED (1 << 30)
>
> well, those would come out as signed int, and 1<<31 is 0x80000000, aka
> INT_MIN. When passed to local_add, which takes signed long, that would
> be sign-extended to 0xffffffff80000000.
>
> Well, mystery solved at least. Now, how should it be fixed? Or is this
> intended behavior?

Not quite intended, but not something to worry about either. We mask off
the 30 bits to determine the size.

>
> By my count that leaves only one mystery: why are we seeing this extra
> page at the beginning with pre-overflow data?

If you did not reset the buffer, there's a chance that the writer is on
the reader page. The reader page is always outside the ring buffer, but
it points into the ring buffer. If this occurs, then you will get the
reader page data, plus the rest of the ring buffer (which is the full
size you asked for).

-- Steve

2010-12-23 23:06:43

by David Sharp

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Thu, Dec 23, 2010 at 7:57 AM, Steven Rostedt <[email protected]> wrote:
> Hi David,
>
> Thanks for looking deeper into this.
>
> On Wed, 2010-12-22 at 16:37 -0800, David Sharp wrote:
>
>> Regarding the strange "commit" values:
>>
>> I noticed just now that it appears the low 16-bits are the correct
>> value for "commit", bits 16-31 are c000, and bits 32-63 are ffffffff.
>
> Ah yeah, I use 32bits so it would be the same on both 32 and 64.
>
>>
>> ring_buffer_read_page can store the number of missing events at the
>> end of the page if there is room. It signals it has done so with two
>> bits in commit, bits 30 and 31. That's c0000000. This points to the
>> ffffffff being a signed math problem, because these bits are added
>> using "local_add".
>>
>> Here's how the bits are defined:
>>
>> /* Flag when events were overwritten */
>> #define RB_MISSED_EVENTS      (1 << 31)
>> /* Missed count stored at end */
>> #define RB_MISSED_STORED      (1 << 30)
>>
>> well, those would come out as signed int, and 1<<31 is 0x80000000, aka
>> INT_MIN. When passed to local_add, which takes signed long, that would
>> be sign-extended to 0xffffffff80000000.
>>
>> Well, mystery solved at least. Now, how should it be fixed? Or is this
>> intended behavior?
>
> Not quite intended, but not something to worry about either. We mask off
> the 30 bits to determine the size.

in trace-cmd you mean?

>> By my count that leaves only one mystery: why are we seeing this extra
>> page at the beginning with pre-overflow data?
>
> If you did not reset the buffer, there's a chance that the writer is on
> the reader page. The reader page is always outside the ring buffer, but
> it points into the ring buffer. If this occurs, then you will get the
> reader page data, plus the rest of the ring buffer (which is the full
> size you asked for).

Okay, that's a surprising behavior to me, but resetting the buffer
seems like a fine workaround.

btw, corroborating this is that reading the formatted output of
"trace" also shows the page of pre-overflow data.

Thanks for your help,
d#

2010-12-24 04:35:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: ftrace: trace_pipe_raw interface broken

On Thu, 2010-12-23 at 15:06 -0800, David Sharp wrote:
> On Thu, Dec 23, 2010 at 7:57 AM, Steven Rostedt <[email protected]> wrote:

> > Not quite intended, but not something to worry about either. We mask off
> > the 30 bits to determine the size.
>
> in trace-cmd you mean?

Yeah, but if you find that strange, I guess we could mask it off too.

>
> >> By my count that leaves only one mystery: why are we seeing this extra
> >> page at the beginning with pre-overflow data?
> >
> > If you did not reset the buffer, there's a chance that the writer is on
> > the reader page. The reader page is always outside the ring buffer, but
> > it points into the ring buffer. If this occurs, then you will get the
> > reader page data, plus the rest of the ring buffer (which is the full
> > size you asked for).
>
> Okay, that's a surprising behavior to me, but resetting the buffer
> seems like a fine workaround.

Getting a little more trace than you asked for seems like a good thing
to me ;-)

>
> btw, corroborating this is that reading the formatted output of
> "trace" also shows the page of pre-overflow data.

Yep, that's already known.

-- Steve

2010-12-25 08:59:30

by David Sharp

[permalink] [raw]
Subject: [tip:perf/urgent] ring_buffer: Off-by-one and duplicate events in ring_buffer_read_page

Commit-ID: e1e359273576ee8fe27021356b064c772ed29af3
Gitweb: http://git.kernel.org/tip/e1e359273576ee8fe27021356b064c772ed29af3
Author: David Sharp <[email protected]>
AuthorDate: Wed, 22 Dec 2010 16:38:24 -0800
Committer: Steven Rostedt <[email protected]>
CommitDate: Thu, 23 Dec 2010 12:09:30 -0500

ring_buffer: Off-by-one and duplicate events in ring_buffer_read_page

Fix two related problems in the event-copying loop of
ring_buffer_read_page.

The loop condition for copying events is off-by-one.
"len" is the remaining space in the caller-supplied page.
"size" is the size of the next event (or two events).
If len == size, then there is just enough space for the next event.

size was set to rb_event_ts_length, which may include the size of two
events if the first event is a time-extend, in order to assure time-
extends are kept together with the event after it. However,
rb_advance_reader always advances by one event. This would result in the
event after any time-extend being duplicated. Instead, get the size of
a single event for the memcpy, but use rb_event_ts_length for the loop
condition.

Signed-off-by: David Sharp <[email protected]>
LKML-Reference: <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
kernel/trace/ring_buffer.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..bd1c35a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3853,6 +3853,13 @@ int ring_buffer_read_page(struct ring_buffer *buffer,

/* Need to copy one event at a time */
do {
+ /* We need the size of one event, because
+ * rb_advance_reader only advances by one event,
+ * whereas rb_event_ts_length may include the size of
+ * one or two events.
+ * We have already ensured there's enough space if this
+ * is a time extend. */
+ size = rb_event_length(event);
memcpy(bpage->data + pos, rpage->data + rpos, size);

len -= size;
@@ -3867,7 +3874,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
event = rb_reader_event(cpu_buffer);
/* Always keep the time extend and data together */
size = rb_event_ts_length(event);
- } while (len > size);
+ } while (len >= size);

/* update bpage */
local_set(&bpage->commit, pos);