Hi Steve,
I am doing some load testing with our kernel tracing prototype
that uses the unified trace buffer for managing its data. I sometimes
saw kernel stack dump caused by the following checking in
function __rb_reserve_next:
if (unlikely(next_page == cpu_buffer->commit_page)) {
WARN_ON_ONCE(1);
goto out_unlock;
}
The comments above the code say the problem is caused by
"an interrupt storm that made it all the way around the buffer".
But I think there is race here that a single interrupt can cause
the check to fail. Suppose this is what happens:
An event is traced and calls __rb_reserve_next. Right after it
gets the current tail_page (line tail_page = cpu_buffer->tail_page;),
an interrupt happens that is also traced. The interrupt also takes
the same tail_page. The interrupt event moves the tail_page
forward if the tail_page is full. Note that the interrupt event gets
the old 'write' value because the first event has not updated that yet.
So the interrupt event may also update the commit_page if it is
the same as the tail_page. As a result, the above check would
fail after the interrupt finishes and the first event resumes its execution.
I have seen the problem happens frequently under heavy loads
on a multi-core machine. Interestingly, I also saw the above
warning that might actually be caused by an interrupt storm.
I was using 64k buffer size and am not sure whether it is possible
for so many interrupts to happen in a short time window.
I think we can use the time_stamp to distinguish the two cases.
Also, in either case, it seems bad to leave the tail_page->write with
an invalid value because it can cause problem when a reader
reads the page. Here is my proposed fix for the problem:
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7f69cfe..1500f78 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -982,8 +982,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
* it all the way around the buffer, bail, and warn
* about it.
*/
- if (unlikely(next_page == cpu_buffer->commit_page)) {
+ if (unlikely(next_page == cpu_buffer->commit_page) &&
+ tail_page->time_stamp > next_page->time_stamp) {
WARN_ON_ONCE(1);
+ if (tail <= BUF_PAGE_SIZE)
+ local_set(&tail_page->write, tail);
goto out_unlock;
}
Jiaying
Hi Jiaying,
Sorry for taking so long to reply.
On Fri, 2008-12-12 at 16:26 -0800, Jiaying Zhang wrote:
> Hi Steve,
>
> I am doing some load testing with our kernel tracing prototype
> that uses the unified trace buffer for managing its data. I sometimes
> saw kernel stack dump caused by the following checking in
> function __rb_reserve_next:
> if (unlikely(next_page == cpu_buffer->commit_page)) {
> WARN_ON_ONCE(1);
> goto out_unlock;
> }
> The comments above the code say the problem is caused by
> "an interrupt storm that made it all the way around the buffer".
> But I think there is race here that a single interrupt can cause
> the check to fail. Suppose this is what happens:
> An event is traced and calls __rb_reserve_next. Right after it
> gets the current tail_page (line tail_page = cpu_buffer->tail_page;),
> an interrupt happens that is also traced. The interrupt also takes
> the same tail_page. The interrupt event moves the tail_page
> forward if the tail_page is full. Note that the interrupt event gets
> the old 'write' value because the first event has not updated that yet.
Good catch.
> So the interrupt event may also update the commit_page if it is
> the same as the tail_page. As a result, the above check would
> fail after the interrupt finishes and the first event resumes its execution.
>
> I have seen the problem happens frequently under heavy loads
> on a multi-core machine. Interestingly, I also saw the above
> warning that might actually be caused by an interrupt storm.
> I was using 64k buffer size and am not sure whether it is possible
> for so many interrupts to happen in a short time window.
>
> I think we can use the time_stamp to distinguish the two cases.
> Also, in either case, it seems bad to leave the tail_page->write with
> an invalid value because it can cause problem when a reader
> reads the page. Here is my proposed fix for the problem:
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7f69cfe..1500f78 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -982,8 +982,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
> * it all the way around the buffer, bail, and warn
> * about it.
> */
> - if (unlikely(next_page == cpu_buffer->commit_page)) {
> + if (unlikely(next_page == cpu_buffer->commit_page) &&
> + tail_page->time_stamp > next_page->time_stamp) {
> WARN_ON_ONCE(1);
> + if (tail <= BUF_PAGE_SIZE)
> + local_set(&tail_page->write, tail);
Actually what we probably should do instead, is simply record the commit
page first:
+ commit_page = cpu_buffer->commit_page;
+ barrier();
tail_page = cpu_buffer->tail_page;
write = local_add_return(length, &tail_page->write);
And then we could test next_page == commit_page instead.
-- Steve
Hi Jiaying,
Again, sorry for taking so long to get back to you on this.
On Wed, 2008-12-17 at 16:00 -0800, Jiaying Zhang wrote:
> Hi Steve,
>
> I mentioned in my last email that I saw the warning about the trace buffer
> became full because of an interrupt storm even with my posted patch applied.
> After adding more debugging messages, I found that the problem was actually
> caused by another race in the code.
Actually, the other email did find a bug.
> I saw the events in the buffer were
> not just interrupts but include other kernel events as well. Looks
> like the commit
> page failed to advance because of a race between the update of tail_page in
> __rb_reserve_next_event and the following lines in the rb_set_commit_to_write:
> while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
> cpu_buffer->commit_page->page->commit =
> cpu_buffer->commit_page->write;
> rb_inc_page(cpu_buffer, &cpu_buffer->commit_page);
> ...
> }
You are saying an interrupt happens here that pushes the tail page one.
> while (rb_commit_index(cpu_buffer) !=
> rb_page_write(cpu_buffer->commit_page)) {
> cpu_buffer->commit_page->page->commit =
> cpu_buffer->commit_page->write;
> barrier();
> }
Now here we are left with the tail page moved one forward but the commit
left back one, and thus we do not increment the commit.
>
> The problem is that an interrupt can happen right after a kernel event finishes
> the condition check "cpu_buffer->commit_page != cpu_buffer->tail_page" but
> before it updates the commit value of the commit_page. If we were at the tail
> of the tail_page and the commit_page was the same as the tail_page when
> the first kernel event checked that, neither of the events would advance the
> commit_page pointer because the interrupt event was NOT the commit event
> before the kernel event updated the commit pointer and the kernel event did
> NOT see the change of the tail_page made by the interrupt event. Once we
> got into this situation, the trace buffer would soon become full and reject
> any further reservation requests.
>
> A possible fix I think is to update the commit pointer of the commit_page
> both before and after updating the commit_page in rb_set_commit_to_write().
> Here is the proposed fix. Please let me know if my analysis makes sense
> to you. Thanks a lot!
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7f69cfe..b345ba7 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -839,6 +839,12 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu
> *cpu_buffer)
> * back to us). This allows us to do a simple loop to
> * assign the commit to the tail.
> */
> + while (rb_commit_index(cpu_buffer) !=
> + rb_page_write(cpu_buffer->commit_page)) {
> + cpu_buffer->commit_page->page->commit =
> + cpu_buffer->commit_page->write;
> + barrier();
> + }
I do not think this solves it for us. Because the interrupt comes in
after the next check, and this is where we are left with the dangling
commit.
> while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
> cpu_buffer->commit_page->page->commit =
> cpu_buffer->commit_page->write;
What we need is to add do while over the entire code with:
do {
[...]
} while (cpu_buffer->commit_page != cpu_buffer->tail_page);
Thanks!
-- Steve
Hi Steve,
On Mon, Dec 22, 2008 at 3:02 PM, Steven Rostedt <[email protected]> wrote:
> Hi Jiaying,
>
> Sorry for taking so long to reply.
No problem. Thanks for looking at it!
>
> On Fri, 2008-12-12 at 16:26 -0800, Jiaying Zhang wrote:
>> Hi Steve,
>>
>> I am doing some load testing with our kernel tracing prototype
>> that uses the unified trace buffer for managing its data. I sometimes
>> saw kernel stack dump caused by the following checking in
>> function __rb_reserve_next:
>> if (unlikely(next_page == cpu_buffer->commit_page)) {
>> WARN_ON_ONCE(1);
>> goto out_unlock;
>> }
>> The comments above the code say the problem is caused by
>> "an interrupt storm that made it all the way around the buffer".
>> But I think there is race here that a single interrupt can cause
>> the check to fail. Suppose this is what happens:
>> An event is traced and calls __rb_reserve_next. Right after it
>> gets the current tail_page (line tail_page = cpu_buffer->tail_page;),
>> an interrupt happens that is also traced. The interrupt also takes
>> the same tail_page. The interrupt event moves the tail_page
>> forward if the tail_page is full. Note that the interrupt event gets
>> the old 'write' value because the first event has not updated that yet.
>
> Good catch.
>
>> So the interrupt event may also update the commit_page if it is
>> the same as the tail_page. As a result, the above check would
>> fail after the interrupt finishes and the first event resumes its execution.
>>
>> I have seen the problem happens frequently under heavy loads
>> on a multi-core machine. Interestingly, I also saw the above
>> warning that might actually be caused by an interrupt storm.
>> I was using 64k buffer size and am not sure whether it is possible
>> for so many interrupts to happen in a short time window.
>>
>> I think we can use the time_stamp to distinguish the two cases.
>> Also, in either case, it seems bad to leave the tail_page->write with
>> an invalid value because it can cause problem when a reader
>> reads the page. Here is my proposed fix for the problem:
>>
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index 7f69cfe..1500f78 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -982,8 +982,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
>> * it all the way around the buffer, bail, and warn
>> * about it.
>> */
>> - if (unlikely(next_page == cpu_buffer->commit_page)) {
>> + if (unlikely(next_page == cpu_buffer->commit_page) &&
>> + tail_page->time_stamp > next_page->time_stamp) {
>> WARN_ON_ONCE(1);
>> + if (tail <= BUF_PAGE_SIZE)
>> + local_set(&tail_page->write, tail);
>
> Actually what we probably should do instead, is simply record the commit
> page first:
>
> + commit_page = cpu_buffer->commit_page;
> + barrier();
> tail_page = cpu_buffer->tail_page;
> write = local_add_return(length, &tail_page->write);
>
>
> And then we could test next_page == commit_page instead.
Yes. This is more clear and simpler ;).
Jiaying
>
> -- Steve
>
>
>
On Mon, Dec 22, 2008 at 3:42 PM, Steven Rostedt <[email protected]> wrote:
> Hi Jiaying,
>
> Again, sorry for taking so long to get back to you on this.
Again. Thank you for looking at the problem.
>
> On Wed, 2008-12-17 at 16:00 -0800, Jiaying Zhang wrote:
>> Hi Steve,
>>
>> I mentioned in my last email that I saw the warning about the trace buffer
>> became full because of an interrupt storm even with my posted patch applied.
>> After adding more debugging messages, I found that the problem was actually
>> caused by another race in the code.
>
> Actually, the other email did find a bug.
>
>
>> I saw the events in the buffer were
>> not just interrupts but include other kernel events as well. Looks
>> like the commit
>> page failed to advance because of a race between the update of tail_page in
>> __rb_reserve_next_event and the following lines in the rb_set_commit_to_write:
>
>> while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
>> cpu_buffer->commit_page->page->commit =
>> cpu_buffer->commit_page->write;
>> rb_inc_page(cpu_buffer, &cpu_buffer->commit_page);
>> ...
>> }
>
> You are saying an interrupt happens here that pushes the tail page one.
>
>> while (rb_commit_index(cpu_buffer) !=
>> rb_page_write(cpu_buffer->commit_page)) {
>> cpu_buffer->commit_page->page->commit =
>> cpu_buffer->commit_page->write;
>> barrier();
>> }
>
> Now here we are left with the tail page moved one forward but the commit
> left back one, and thus we do not increment the commit.
Actually, I was saying that an interrupt happens after the last loop
and we are left with the tail page moved one forward but the 'commit page'
left back one. The function still updates 'commit' pointer correctly after
the interrupt because it is the commit event, but it does not update the
'commit page' since it already jumped out from the last loop.
>
>
>
>>
>> The problem is that an interrupt can happen right after a kernel event finishes
>> the condition check "cpu_buffer->commit_page != cpu_buffer->tail_page" but
>> before it updates the commit value of the commit_page. If we were at the tail
>> of the tail_page and the commit_page was the same as the tail_page when
>> the first kernel event checked that, neither of the events would advance the
>> commit_page pointer because the interrupt event was NOT the commit event
>> before the kernel event updated the commit pointer and the kernel event did
>> NOT see the change of the tail_page made by the interrupt event. Once we
>> got into this situation, the trace buffer would soon become full and reject
>> any further reservation requests.
>>
>> A possible fix I think is to update the commit pointer of the commit_page
>> both before and after updating the commit_page in rb_set_commit_to_write().
>> Here is the proposed fix. Please let me know if my analysis makes sense
>> to you. Thanks a lot!
>>
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index 7f69cfe..b345ba7 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -839,6 +839,12 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu
>> *cpu_buffer)
>> * back to us). This allows us to do a simple loop to
>> * assign the commit to the tail.
>> */
>> + while (rb_commit_index(cpu_buffer) !=
>> + rb_page_write(cpu_buffer->commit_page)) {
>> + cpu_buffer->commit_page->page->commit =
>> + cpu_buffer->commit_page->write;
>> + barrier();
>> + }
>
> I do not think this solves it for us. Because the interrupt comes in
> after the next check, and this is where we are left with the dangling
> commit.
>
>> while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
>> cpu_buffer->commit_page->page->commit =
>> cpu_buffer->commit_page->write;
>
> What we need is to add do while over the entire code with:
>
> do {
> [...]
> } while (cpu_buffer->commit_page != cpu_buffer->tail_page);
>
I was thinking to use this kind of loop but has not come up with
one that looks clean and nice. Could you post the full patch?
Thanks!
Jiaying
> Thanks!
>
> -- Steve
>
>
>
Hi Steve,
I mentioned in my last email that I saw the warning about the trace buffer
became full because of an interrupt storm even with my posted patch applied.
After adding more debugging messages, I found that the problem was actually
caused by another race in the code. I saw the events in the buffer were
not just interrupts but include other kernel events as well. Looks
like the commit
page failed to advance because of a race between the update of tail_page in
__rb_reserve_next_event and the following lines in the rb_set_commit_to_write:
while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
cpu_buffer->commit_page->page->commit =
cpu_buffer->commit_page->write;
rb_inc_page(cpu_buffer, &cpu_buffer->commit_page);
...
}
while (rb_commit_index(cpu_buffer) !=
rb_page_write(cpu_buffer->commit_page)) {
cpu_buffer->commit_page->page->commit =
cpu_buffer->commit_page->write;
barrier();
}
The problem is that an interrupt can happen right after a kernel event finishes
the condition check "cpu_buffer->commit_page != cpu_buffer->tail_page" but
before it updates the commit value of the commit_page. If we were at the tail
of the tail_page and the commit_page was the same as the tail_page when
the first kernel event checked that, neither of the events would advance the
commit_page pointer because the interrupt event was NOT the commit event
before the kernel event updated the commit pointer and the kernel event did
NOT see the change of the tail_page made by the interrupt event. Once we
got into this situation, the trace buffer would soon become full and reject
any further reservation requests.
A possible fix I think is to update the commit pointer of the commit_page
both before and after updating the commit_page in rb_set_commit_to_write().
Here is the proposed fix. Please let me know if my analysis makes sense
to you. Thanks a lot!
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7f69cfe..b345ba7 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -839,6 +839,12 @@ rb_set_commit_to_write(struct ring_buffer_per_cpu
*cpu_buffer)
* back to us). This allows us to do a simple loop to
* assign the commit to the tail.
*/
+ while (rb_commit_index(cpu_buffer) !=
+ rb_page_write(cpu_buffer->commit_page)) {
+ cpu_buffer->commit_page->page->commit =
+ cpu_buffer->commit_page->write;
+ barrier();
+ }
while (cpu_buffer->commit_page != cpu_buffer->tail_page) {
cpu_buffer->commit_page->page->commit =
cpu_buffer->commit_page->write;
Jiaying
On Fri, Dec 12, 2008 at 4:26 PM, Jiaying Zhang <[email protected]> wrote:
> Hi Steve,
>
> I am doing some load testing with our kernel tracing prototype
> that uses the unified trace buffer for managing its data. I sometimes
> saw kernel stack dump caused by the following checking in
> function __rb_reserve_next:
> if (unlikely(next_page == cpu_buffer->commit_page)) {
> WARN_ON_ONCE(1);
> goto out_unlock;
> }
> The comments above the code say the problem is caused by
> "an interrupt storm that made it all the way around the buffer".
> But I think there is race here that a single interrupt can cause
> the check to fail. Suppose this is what happens:
> An event is traced and calls __rb_reserve_next. Right after it
> gets the current tail_page (line tail_page = cpu_buffer->tail_page;),
> an interrupt happens that is also traced. The interrupt also takes
> the same tail_page. The interrupt event moves the tail_page
> forward if the tail_page is full. Note that the interrupt event gets
> the old 'write' value because the first event has not updated that yet.
> So the interrupt event may also update the commit_page if it is
> the same as the tail_page. As a result, the above check would
> fail after the interrupt finishes and the first event resumes its execution.
>
> I have seen the problem happens frequently under heavy loads
> on a multi-core machine. Interestingly, I also saw the above
> warning that might actually be caused by an interrupt storm.
> I was using 64k buffer size and am not sure whether it is possible
> for so many interrupts to happen in a short time window.
>
> I think we can use the time_stamp to distinguish the two cases.
> Also, in either case, it seems bad to leave the tail_page->write with
> an invalid value because it can cause problem when a reader
> reads the page. Here is my proposed fix for the problem:
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 7f69cfe..1500f78 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -982,8 +982,11 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
> * it all the way around the buffer, bail, and warn
> * about it.
> */
> - if (unlikely(next_page == cpu_buffer->commit_page)) {
> + if (unlikely(next_page == cpu_buffer->commit_page) &&
> + tail_page->time_stamp > next_page->time_stamp) {
> WARN_ON_ONCE(1);
> + if (tail <= BUF_PAGE_SIZE)
> + local_set(&tail_page->write, tail);
> goto out_unlock;
> }
>
>
> Jiaying
>