2020-07-01 17:08:07

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: Tracing: rb_head_page_deactivate() caught in an infinite loop

Hi Steven and Mingo,

While trying to adjust the buffer size (echo <size> >
/sys/kernel/debug/tracing/buffer_size_kb), we see that the kernel gets
caught up in an infinite loop
while traversing the "cpu_buffer->pages" list in
rb_head_page_deactivate().

Looks like the last node of the list could be uninitialized, thus
leading to infinite traversal. From the data that we captured:
000|rb_head_page_deactivate(inline)
| cpu_buffer = 0xFFFFFF8000671600 =
kernel_size_le_lo32+0xFFFFFF652F6EE600 -> (
...
| pages = 0xFFFFFF80A909D980 =
kernel_size_le_lo32+0xFFFFFF65D811A980 -> (
| next = 0xFFFFFF80A909D200 =
kernel_size_le_lo32+0xFFFFFF65D811A200 -> (
| next = 0xFFFFFF80A909D580 =
kernel_size_le_lo32+0xFFFFFF65D811A580 -> (
| next = 0xFFFFFF8138D1CD00 =
kernel_size_le_lo32+0xFFFFFF6667D99D00 -> (
| next = 0xFFFFFF80006716F0 =
kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
| next = 0xFFFFFF80006716F0 =
kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
| next = 0xFFFFFF80006716F0 =
kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
| next = 0xFFFFFF80006716F0 =
kernel_size_le_lo32+0xFFFFFF652F6EE6F0,

Wanted to check with you if there's any scenario that could lead us into
this state.

Test details:
-- Arch: arm64
-- Kernel version 5.4.30; running on Andriod
-- Test case: Running the following set of commands across reboot will
lead us to the scenario

atrace --async_start -z -c -b 120000 sched audio irq idle freq
< Run any workload here >
atrace --async_dump -z -c -b 1200000 sched audio irq idle freq >
mytrace.trace
atrace --async_stop > /dev/null
echo 150000 > /sys/kernel/debug/tracing/buffer_size_kb
echo 200000 > /sys/kernel/debug/tracing/buffer_size_kb
reboot

Repeating the above lines across reboots would reproduce the issue.
The "atrace" or "echo" would just get stuck while resizing the buffer
size.
I'll try to reproduce the issue without atrace as well, but wondering
what could be the reason for leading us to this state.

Thank you.
Raghavendra


2020-07-02 02:05:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: Tracing: rb_head_page_deactivate() caught in an infinite loop

On Wed, 01 Jul 2020 10:07:06 -0700
[email protected] wrote:

> Hi Steven and Mingo,
>

Hi Raghavendra,


> While trying to adjust the buffer size (echo <size> >
> /sys/kernel/debug/tracing/buffer_size_kb), we see that the kernel gets
> caught up in an infinite loop
> while traversing the "cpu_buffer->pages" list in
> rb_head_page_deactivate().
>
> Looks like the last node of the list could be uninitialized, thus
> leading to infinite traversal. From the data that we captured:
> 000|rb_head_page_deactivate(inline)
> | cpu_buffer = 0xFFFFFF8000671600 =
> kernel_size_le_lo32+0xFFFFFF652F6EE600 -> (
> ...
> | pages = 0xFFFFFF80A909D980 =
> kernel_size_le_lo32+0xFFFFFF65D811A980 -> (
> | next = 0xFFFFFF80A909D200 =
> kernel_size_le_lo32+0xFFFFFF65D811A200 -> (
> | next = 0xFFFFFF80A909D580 =
> kernel_size_le_lo32+0xFFFFFF65D811A580 -> (
> | next = 0xFFFFFF8138D1CD00 =
> kernel_size_le_lo32+0xFFFFFF6667D99D00 -> (
> | next = 0xFFFFFF80006716F0 =
> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
> | next = 0xFFFFFF80006716F0 =
> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
> | next = 0xFFFFFF80006716F0 =
> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
> | next = 0xFFFFFF80006716F0 =
> kernel_size_le_lo32+0xFFFFFF652F6EE6F0,
>
> Wanted to check with you if there's any scenario that could lead us into
> this state.
>
> Test details:
> -- Arch: arm64
> -- Kernel version 5.4.30; running on Andriod
> -- Test case: Running the following set of commands across reboot will
> lead us to the scenario
>
> atrace --async_start -z -c -b 120000 sched audio irq idle freq
> < Run any workload here >
> atrace --async_dump -z -c -b 1200000 sched audio irq idle freq >
> mytrace.trace
> atrace --async_stop > /dev/null
> echo 150000 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 200000 > /sys/kernel/debug/tracing/buffer_size_kb
> reboot
>
> Repeating the above lines across reboots would reproduce the issue.
> The "atrace" or "echo" would just get stuck while resizing the buffer
> size.

What do you mean repeat across reboots? If it doesn't happen it wont
ever happen, but if you reboot it may have it happen again?

> I'll try to reproduce the issue without atrace as well, but wondering
> what could be the reason for leading us to this state.

I haven't used arm lately, and I'm unfamiliar with atrace. So I don't
really know what is going on. If you can reproduce this with just a
shell script accessing the ftrace files, that would be much more useful.

Thanks,

-- Steve

2020-07-02 19:53:39

by Raghavendra Rao Ananta

[permalink] [raw]
Subject: Re: Tracing: rb_head_page_deactivate() caught in an infinite loop

Hi Steven,

Thank you for responding.

On 2020-07-01 19:03, Steven Rostedt wrote:
> On Wed, 01 Jul 2020 10:07:06 -0700
> [email protected] wrote:
>
>> Hi Steven and Mingo,
>>
>
> Hi Raghavendra,
>
>
>> While trying to adjust the buffer size (echo <size> >
>> /sys/kernel/debug/tracing/buffer_size_kb), we see that the kernel gets
>> caught up in an infinite loop
>> while traversing the "cpu_buffer->pages" list in
>> rb_head_page_deactivate().
>>
>> Looks like the last node of the list could be uninitialized, thus
>> leading to infinite traversal. From the data that we captured:
>> 000|rb_head_page_deactivate(inline)
>> | cpu_buffer = 0xFFFFFF8000671600 =
>> kernel_size_le_lo32+0xFFFFFF652F6EE600 -> (
>> ...
>> | pages = 0xFFFFFF80A909D980 =
>> kernel_size_le_lo32+0xFFFFFF65D811A980 -> (
>> | next = 0xFFFFFF80A909D200 =
>> kernel_size_le_lo32+0xFFFFFF65D811A200 -> (
>> | next = 0xFFFFFF80A909D580 =
>> kernel_size_le_lo32+0xFFFFFF65D811A580 -> (
>> | next = 0xFFFFFF8138D1CD00 =
>> kernel_size_le_lo32+0xFFFFFF6667D99D00 -> (
>> | next = 0xFFFFFF80006716F0 =
>> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
>> | next = 0xFFFFFF80006716F0 =
>> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
>> | next = 0xFFFFFF80006716F0 =
>> kernel_size_le_lo32+0xFFFFFF652F6EE6F0 -> (
>> | next = 0xFFFFFF80006716F0 =
>> kernel_size_le_lo32+0xFFFFFF652F6EE6F0,
>>
>> Wanted to check with you if there's any scenario that could lead us
>> into
>> this state.
>>
>> Test details:
>> -- Arch: arm64
>> -- Kernel version 5.4.30; running on Andriod
>> -- Test case: Running the following set of commands across reboot will
>> lead us to the scenario
>>
>> atrace --async_start -z -c -b 120000 sched audio irq idle freq
>> < Run any workload here >
>> atrace --async_dump -z -c -b 1200000 sched audio irq idle freq >
>> mytrace.trace
>> atrace --async_stop > /dev/null
>> echo 150000 > /sys/kernel/debug/tracing/buffer_size_kb
>> echo 200000 > /sys/kernel/debug/tracing/buffer_size_kb
>> reboot
>>
>> Repeating the above lines across reboots would reproduce the issue.
>> The "atrace" or "echo" would just get stuck while resizing the buffer
>> size.
>
> What do you mean repeat across reboots? If it doesn't happen it wont
> ever happen, but if you reboot it may have it happen again?
>
Actually, ignore the reboot part. I was able to reproduce with the
reboot.
>> I'll try to reproduce the issue without atrace as well, but wondering
>> what could be the reason for leading us to this state.
>
> I haven't used arm lately, and I'm unfamiliar with atrace. So I don't
> really know what is going on. If you can reproduce this with just a
> shell script accessing the ftrace files, that would be much more
> useful.
>
I tried to reproduce it without atrace, but couldn't. It's a simple
wrapper utility
to collect ftraces. You can find the source here:
https://android.googlesource.com/platform/system/extras/+/jb-mr1-dev-plus-aosp/atrace/atrace.c
> Thanks,
>
> -- Steve