2021-07-22 22:45:31

by Stefan Metzmacher

[permalink] [raw]
Subject: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)


Hi Steve,

After some days of training:
https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
I was able to get much closer to the problem :-)

In order to reproduce it and get reliable kexec crash dumps,
I needed to give the VM at least 3 cores.

While running './io-uring_cp-forever link-cp.c file' (from:
https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
in one window, the following simple sequence triggered the problem in most cases:

echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
echo 1 > /sys/kernel/tracing/set_event_pid

It triggered the following:

> [ 192.924023] ------------[ cut here ]------------
> [ 192.924026] WARNING: CPU: 2 PID: 1696 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xc0
> [ 192.924034] Modules linked in: vboxsf intel_rapl_msr intel_rapl_common crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl vboxvideo drm_vr
> am_helper drm_ttm_helper ttm snd_intel8x0 input_leds snd_ac97_codec joydev ac97_bus serio_raw snd_pcm drm_kms_helper cec mac_hid vboxguest snd_timer rc_core snd fb
> _sys_fops syscopyarea sysfillrect soundcore sysimgblt kvm_intel kvm sch_fq_codel drm sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul psmouse
> ahci libahci e1000 i2c_piix4 pata_acpi video
> [ 192.924068] CPU: 2 PID: 1696 Comm: io_uring-cp-for Kdump: loaded Not tainted 5.13.0-1007-oem #7-Ubuntu
> [ 192.924071] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [ 192.924072] RIP: 0010:kfence_protect_page+0x33/0xc0
> [ 192.924075] Code: 53 89 f3 48 8d 75 e4 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 e8 98 1f da ff 48 85 c0 74 06 83 7d e4 01 74 06 <0f> 0b 31 c0 eb 39 48 8b 38 48 89 c2 84 db 75 47 48 89 f8 0f 1f 40
> [ 192.924077] RSP: 0018:ffff980c0077f918 EFLAGS: 00010046
> [ 192.924079] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffbcc10000
> [ 192.924080] RDX: ffff980c0077f91c RSI: 0000000000032000 RDI: 0000000000000000
> [ 192.924082] RBP: ffff980c0077f938 R08: 0000000000000000 R09: 0000000000000000
> [ 192.924083] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000032000
> [ 192.924084] R13: 0000000000000000 R14: ffff980c0077fb58 R15: 0000000000000000
> [ 192.924085] FS: 00007f2491207540(0000) GS:ffff8ccb5bd00000(0000) knlGS:0000000000000000
> [ 192.924087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 192.924088] CR2: 00000000000325f8 CR3: 0000000102572005 CR4: 00000000000706e0
> [ 192.924091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 192.924092] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 192.924093] Call Trace:
> [ 192.924096] kfence_unprotect+0x17/0x30
> [ 192.924099] kfence_handle_page_fault+0x97/0x250
> [ 192.924102] ? cmp_ex_sort+0x30/0x30
> [ 192.924104] page_fault_oops+0xa0/0x2a0
> [ 192.924106] ? trace_event_buffer_reserve+0x22/0xb0
> [ 192.924110] ? search_exception_tables+0x5b/0x60
> [ 192.924113] kernelmode_fixup_or_oops+0x92/0xf0
> [ 192.924115] __bad_area_nosemaphore+0x14d/0x190
> [ 192.924116] __bad_area+0x5f/0x80
> [ 192.924118] bad_area+0x16/0x20
> [ 192.924119] do_user_addr_fault+0x368/0x640> [ 192.924121] ? aa_file_perm+0x11d/0x470
> [ 192.924123] exc_page_fault+0x7d/0x170
> [ 192.924127] asm_exc_page_fault+0x1e/0x30
> [ 192.924129] RIP: 0010:trace_event_buffer_reserve+0x22/0xb0
> [ 192.924132] Code: 00 00 00 00 0f 1f 40 00 55 48 89 e5 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 4c 8b 76 10 f6 46 49 02 74 29 48 8b 46 28 <48> 8b 88 b8 00 00 00 48 8b 90 c0 00 00 00 48 09 d1 74 12 48 8b 40
> [ 192.924134] RSP: 0018:ffff980c0077fc00 EFLAGS: 00010002
> [ 192.924135] RAX: 0000000000032540 RBX: ffff980c0077fc38 RCX: 0000000000000002
> [ 192.924136] RDX: 0000000000000028 RSI: ffffffffbcdb7e80 RDI: ffff980c0077fc38
> [ 192.924137] RBP: ffff980c0077fc20 R08: ffff8ccb46814cb8 R09: 0000000000000010
> [ 192.924138] R10: 000000007ffff000 R11: 0000000000000000 R12: ffffffffbcdb7e80
> [ 192.924140] R13: 0000000000000028 R14: 0000000000000000 R15: ffff8ccb49b6ed0c
> [ 192.924142] trace_event_raw_event_sched_wakeup_template+0x63/0xf0
> [ 192.924146] try_to_wake_up+0x285/0x570
> [ 192.924148] wake_up_process+0x15/0x20
> [ 192.924149] io_wqe_activate_free_worker+0x5b/0x70
> [ 192.924152] io_wqe_enqueue+0xfb/0x190
> [ 192.924154] io_wq_enqueue+0x1e/0x20
> [ 192.924156] io_queue_async_work+0xa0/0x120
> [ 192.924158] __io_queue_sqe+0x17e/0x360
> [ 192.924160] ? io_req_free_batch_finish+0x8d/0xf0
> [ 192.924162] io_queue_sqe+0x199/0x2a0
> [ 192.924164] io_submit_sqes+0x70d/0x7f0
> [ 192.924166] __x64_sys_io_uring_enter+0x1b8/0x3d0
> [ 192.924168] do_syscall_64+0x40/0xb0
> [ 192.924170] entry_SYSCALL_64_after_hwframe+0x44/0xae
> [ 192.924172] RIP: 0033:0x7f249112f89d
> [ 192.924174] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> [ 192.924179] RSP: 002b:00007fff56c491c8 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> [ 192.924180] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f249112f89d
> [ 192.924182] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000005
> [ 192.924182] RBP: 00007fff56c49200 R08: 0000000000000000 R09: 0000000000000008
> [ 192.924183] R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
> [ 192.924184] R13: 00007fff56c49380 R14: 0000000000000dcf R15: 000055c533b6f2a0
> [ 192.924186] ---[ end trace d1211902aae73d20 ]---

The problem seems to happen in this line of trace_event_ignore_this_pid():

pid_list = rcu_dereference_raw(tr->filtered_pids);

It seems it got inlined within trace_event_buffer_reserve()

There strangest things I found so far is this:

crash> sym global_trace
ffffffffbcdb7e80 (d) global_trace
crash> list ftrace_trace_arrays
ffffffffbcdb7e70
ffffffffbcdb7e80

trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from
ffffffffbcdb7e80.

Also this:

crash> struct trace_array.events -o global_trace
struct trace_array {
[ffffffffbcdb9be0] struct list_head events;
}
crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0
ffffffffbcdb9be0
tr = 0xffffffffbcdb7b20
ffff8ccb45cdfb00
tr = 0xffffffffbcdb7e80
ffff8ccb45cdf580
tr = 0xffffffffbcdb7e80
ffff8ccb45cdfe18
tr = 0xffffffffbcdb7e80
...
tr = 0xffffffffbcdb7e80

The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80

Additional information can be found here:
https://www.samba.org/~metze/202107221802.reproduce-freeze-05-first-time.v3-pid1/
-rw-r--r-- 1 metze metze 37250108 Jul 22 18:02 dump.202107221802
-rw-r--r-- 1 metze metze 63075 Jul 22 18:02 dmesg.202107221802
-rw-r--r-- 1 metze metze 8820 Jul 22 18:13 crash-bt-a-s.txt
-rw-r--r-- 1 metze metze 36586 Jul 22 18:14 crash-bt-a-l-FF-s.txt
-rw-r--r-- 1 metze metze 4798 Jul 22 23:49 crash-bt-p.txt
-rw-r--r-- 1 metze metze 946 Jul 23 00:13 strange-things.txt
-rw-r--r-- 1 metze metze 482 Jul 23 00:24 code-and-vmlinux.txt

The code can be found here:
https://kernel.ubuntu.com/git/kernel-ppa/mirror/ubuntu-oem-5.10-focal.git/tag/?h=Ubuntu-oem-5.13-5.13.0-1007.7

And the /usr/lib/debug/boot/vmlinux-5.13.0-1007-oem file can be extracted from here:
http://ddebs.ubuntu.com/pool/main/l/linux-oem-5.13/linux-image-unsigned-5.13.0-1007-oem-dbgsym_5.13.0-1007.7_amd64.ddeb

Also also needed the latest code from https://github.com/crash-utility/crash.git
(at commit f53b73e8380bca054cebd2b61ff118c46609429b)

It would be really great if you (or anyone else on the lists) could have a closer look
in order to get the problem fixed :-)

I've learned a lot this week and tried hard to find the problem myself,
but I have to move back to other work for now.

Thanks!
metze


Attachments:
OpenPGP_signature (849.00 B)
OpenPGP digital signature

2021-07-23 01:42:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)

On Fri, 23 Jul 2021 00:43:13 +0200
Stefan Metzmacher <[email protected]> wrote:

> Hi Steve,

Hi Stefan,

>
> After some days of training:
> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
> I was able to get much closer to the problem :-)
>
> In order to reproduce it and get reliable kexec crash dumps,
> I needed to give the VM at least 3 cores.
>
> While running './io-uring_cp-forever link-cp.c file' (from:
> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
> in one window, the following simple sequence triggered the problem in most cases:
>
> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
> echo 1 > /sys/kernel/tracing/set_event_pid

I was able to reproduce it with running hackbench in a while loop and
in another terminal, executing the above two lines.

I think I found the bug. Can you test this patch?

Thanks,

-- Steve

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 976bf8ce8039..fc32821f8240 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -299,8 +299,8 @@ static int tracepoint_add_func(struct tracepoint *tp,
* a pointer to it. This array is referenced by __DO_TRACE from
* include/linux/tracepoint.h using rcu_dereference_sched().
*/
- rcu_assign_pointer(tp->funcs, tp_funcs);
tracepoint_update_call(tp, tp_funcs, false);
+ rcu_assign_pointer(tp->funcs, tp_funcs);
static_key_enable(&tp->key);

release_probes(old);

2021-07-23 02:55:07

by Steven Rostedt

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)

On Fri, 23 Jul 2021 00:43:13 +0200
Stefan Metzmacher <[email protected]> wrote:

> Hi Steve,
>
> After some days of training:
> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
> I was able to get much closer to the problem :-)

Well, knowing what to look for was not going to be easy. And I'm sure
you were shocked to see what I posted as a fix ;-)

Assuming this does fix your issue, I sent out a real patch with the
explanation of what happened in the change log, so that you can see why
that change was your issue.

>
> In order to reproduce it and get reliable kexec crash dumps,
> I needed to give the VM at least 3 cores.

Yes, it did require having this run on multiple CPUs to have a race
condition trigger, and two cores would have been hard to hit it. I ran
it on 8 cores and it triggered rather easily.

>
> While running './io-uring_cp-forever link-cp.c file' (from:
> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
> in one window, the following simple sequence triggered the problem in most cases:
>
> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
> echo 1 > /sys/kernel/tracing/set_event_pid

All it took was something busy that did a lot of wakeups while setting
the set_event_pid, to be able to hit the race easily. As I stated, I
triggered it with running hackbench instead of the io-uring code. In
fact, this bug had nothing to do with io-uring, and you were only
triggering it because you were making enough of a load on the system to
make the race happen often.

>
> It triggered the following:
>
> > [ 192.924023] ------------[ cut here ]------------
> > [ 192.924026] WARNING: CPU: 2 PID: 1696 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x33/0xc0
> > [ 192.924034] Modules linked in: vboxsf intel_rapl_msr intel_rapl_common crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl vboxvideo drm_vr
> > am_helper drm_ttm_helper ttm snd_intel8x0 input_leds snd_ac97_codec joydev ac97_bus serio_raw snd_pcm drm_kms_helper cec mac_hid vboxguest snd_timer rc_core snd fb
> > _sys_fops syscopyarea sysfillrect soundcore sysimgblt kvm_intel kvm sch_fq_codel drm sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul psmouse
> > ahci libahci e1000 i2c_piix4 pata_acpi video
> > [ 192.924068] CPU: 2 PID: 1696 Comm: io_uring-cp-for Kdump: loaded Not tainted 5.13.0-1007-oem #7-Ubuntu
> > [ 192.924071] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> > [ 192.924072] RIP: 0010:kfence_protect_page+0x33/0xc0
> > [ 192.924075] Code: 53 89 f3 48 8d 75 e4 48 83 ec 10 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 e8 98 1f da ff 48 85 c0 74 06 83 7d e4 01 74 06 <0f> 0b 31 c0 eb 39 48 8b 38 48 89 c2 84 db 75 47 48 89 f8 0f 1f 40
> > [ 192.924077] RSP: 0018:ffff980c0077f918 EFLAGS: 00010046
> > [ 192.924079] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffbcc10000
> > [ 192.924080] RDX: ffff980c0077f91c RSI: 0000000000032000 RDI: 0000000000000000
> > [ 192.924082] RBP: ffff980c0077f938 R08: 0000000000000000 R09: 0000000000000000
> > [ 192.924083] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000032000
> > [ 192.924084] R13: 0000000000000000 R14: ffff980c0077fb58 R15: 0000000000000000
> > [ 192.924085] FS: 00007f2491207540(0000) GS:ffff8ccb5bd00000(0000) knlGS:0000000000000000
> > [ 192.924087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 192.924088] CR2: 00000000000325f8 CR3: 0000000102572005 CR4: 00000000000706e0
> > [ 192.924091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 192.924092] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 192.924093] Call Trace:
> > [ 192.924096] kfence_unprotect+0x17/0x30
> > [ 192.924099] kfence_handle_page_fault+0x97/0x250
> > [ 192.924102] ? cmp_ex_sort+0x30/0x30
> > [ 192.924104] page_fault_oops+0xa0/0x2a0
> > [ 192.924106] ? trace_event_buffer_reserve+0x22/0xb0
> > [ 192.924110] ? search_exception_tables+0x5b/0x60
> > [ 192.924113] kernelmode_fixup_or_oops+0x92/0xf0
> > [ 192.924115] __bad_area_nosemaphore+0x14d/0x190
> > [ 192.924116] __bad_area+0x5f/0x80
> > [ 192.924118] bad_area+0x16/0x20
> > [ 192.924119] do_user_addr_fault+0x368/0x640> [ 192.924121] ? aa_file_perm+0x11d/0x470
> > [ 192.924123] exc_page_fault+0x7d/0x170
> > [ 192.924127] asm_exc_page_fault+0x1e/0x30
> > [ 192.924129] RIP: 0010:trace_event_buffer_reserve+0x22/0xb0
> > [ 192.924132] Code: 00 00 00 00 0f 1f 40 00 55 48 89 e5 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 89 fb 4c 8b 76 10 f6 46 49 02 74 29 48 8b 46 28 <48> 8b 88 b8 00 00 00 48 8b 90 c0 00 00 00 48 09 d1 74 12 48 8b 40
> > [ 192.924134] RSP: 0018:ffff980c0077fc00 EFLAGS: 00010002
> > [ 192.924135] RAX: 0000000000032540 RBX: ffff980c0077fc38 RCX: 0000000000000002
> > [ 192.924136] RDX: 0000000000000028 RSI: ffffffffbcdb7e80 RDI: ffff980c0077fc38
> > [ 192.924137] RBP: ffff980c0077fc20 R08: ffff8ccb46814cb8 R09: 0000000000000010
> > [ 192.924138] R10: 000000007ffff000 R11: 0000000000000000 R12: ffffffffbcdb7e80
> > [ 192.924140] R13: 0000000000000028 R14: 0000000000000000 R15: ffff8ccb49b6ed0c
> > [ 192.924142] trace_event_raw_event_sched_wakeup_template+0x63/0xf0
> > [ 192.924146] try_to_wake_up+0x285/0x570
> > [ 192.924148] wake_up_process+0x15/0x20
> > [ 192.924149] io_wqe_activate_free_worker+0x5b/0x70
> > [ 192.924152] io_wqe_enqueue+0xfb/0x190
> > [ 192.924154] io_wq_enqueue+0x1e/0x20
> > [ 192.924156] io_queue_async_work+0xa0/0x120
> > [ 192.924158] __io_queue_sqe+0x17e/0x360
> > [ 192.924160] ? io_req_free_batch_finish+0x8d/0xf0
> > [ 192.924162] io_queue_sqe+0x199/0x2a0
> > [ 192.924164] io_submit_sqes+0x70d/0x7f0
> > [ 192.924166] __x64_sys_io_uring_enter+0x1b8/0x3d0
> > [ 192.924168] do_syscall_64+0x40/0xb0
> > [ 192.924170] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [ 192.924172] RIP: 0033:0x7f249112f89d
> > [ 192.924174] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c3 f5 0c 00 f7 d8 64 89 01 48
> > [ 192.924179] RSP: 002b:00007fff56c491c8 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> > [ 192.924180] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f249112f89d
> > [ 192.924182] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000005
> > [ 192.924182] RBP: 00007fff56c49200 R08: 0000000000000000 R09: 0000000000000008
> > [ 192.924183] R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
> > [ 192.924184] R13: 00007fff56c49380 R14: 0000000000000dcf R15: 000055c533b6f2a0
> > [ 192.924186] ---[ end trace d1211902aae73d20 ]---
>
> The problem seems to happen in this line of trace_event_ignore_this_pid():
>
> pid_list = rcu_dereference_raw(tr->filtered_pids);

That "tr" comes from the trace_event_file that is passed in by the
"data" field of the callback. Hence, this callback got the data field
of the event_filter_pid_sched_wakeup_probe_pre() callback that is
called before all events when the set_event_pid file is set. That
means, the "tr" being dereferened was not the "tr" you were looking for.

>
> It seems it got inlined within trace_event_buffer_reserve()
>
> There strangest things I found so far is this:
>
> crash> sym global_trace
> ffffffffbcdb7e80 (d) global_trace
> crash> list ftrace_trace_arrays
> ffffffffbcdb7e70
> ffffffffbcdb7e80
>
> trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from
> ffffffffbcdb7e80.

ftrace_trace_arrays is a list_head, and I doubt you created any
instances, thus the list head has only one instance, and that is
global_trace. Hence, it points to global_trace and itself. It just so
happens that a list_head is 16 bytes.


>
> Also this:
>
> crash> struct trace_array.events -o global_trace
> struct trace_array {
> [ffffffffbcdb9be0] struct list_head events;
> }
> crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0
> ffffffffbcdb9be0
> tr = 0xffffffffbcdb7b20
> ffff8ccb45cdfb00
> tr = 0xffffffffbcdb7e80
> ffff8ccb45cdf580
> tr = 0xffffffffbcdb7e80
> ffff8ccb45cdfe18
> tr = 0xffffffffbcdb7e80
> ...
> tr = 0xffffffffbcdb7e80
>
> The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80

I'm thinking it is confused by hitting the ftrace_trace_arrays
list_head itself.

>
> Additional information can be found here:
> https://www.samba.org/~metze/202107221802.reproduce-freeze-05-first-time.v3-pid1/
> -rw-r--r-- 1 metze metze 37250108 Jul 22 18:02 dump.202107221802
> -rw-r--r-- 1 metze metze 63075 Jul 22 18:02 dmesg.202107221802
> -rw-r--r-- 1 metze metze 8820 Jul 22 18:13 crash-bt-a-s.txt
> -rw-r--r-- 1 metze metze 36586 Jul 22 18:14 crash-bt-a-l-FF-s.txt
> -rw-r--r-- 1 metze metze 4798 Jul 22 23:49 crash-bt-p.txt
> -rw-r--r-- 1 metze metze 946 Jul 23 00:13 strange-things.txt
> -rw-r--r-- 1 metze metze 482 Jul 23 00:24 code-and-vmlinux.txt
>
> The code can be found here:
> https://kernel.ubuntu.com/git/kernel-ppa/mirror/ubuntu-oem-5.10-focal.git/tag/?h=Ubuntu-oem-5.13-5.13.0-1007.7
>
> And the /usr/lib/debug/boot/vmlinux-5.13.0-1007-oem file can be extracted from here:
> http://ddebs.ubuntu.com/pool/main/l/linux-oem-5.13/linux-image-unsigned-5.13.0-1007-oem-dbgsym_5.13.0-1007.7_amd64.ddeb
>
> Also also needed the latest code from https://github.com/crash-utility/crash.git
> (at commit f53b73e8380bca054cebd2b61ff118c46609429b)
>
> It would be really great if you (or anyone else on the lists) could have a closer look
> in order to get the problem fixed :-)

Once I triggered it and started looking at what was happening, it
didn't take me to long to figure out where the culprit was.

>
> I've learned a lot this week and tried hard to find the problem myself,
> but I have to move back to other work for now.
>

I'm glad you learned a lot. There's a lot of complex code in there, and
its getting more complex, as you can see with the static_calls.

This is because tracing tries hard to avoid the heisenbug effect. You
know, you see a bug, turn on tracing, and then that bug goes away!

Thus it pulls out all the tricks it can to be as least intrusive on the
system as it can be. And that causes things to get complex really quick.

Cheers, and thanks for keeping up on this bug!

-- Steve

2021-07-23 10:36:55

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)


Hi Steve,
>> After some days of training:
>> https://training.linuxfoundation.org/training/linux-kernel-debugging-and-security/
>> I was able to get much closer to the problem :-)
>
> Well, knowing what to look for was not going to be easy.
>
> And I'm sure you were shocked to see what I posted as a fix ;-)

Not really, such problems typically just have one line fixes :-)

> Assuming this does fix your issue, I sent out a real patch with the
> explanation of what happened in the change log, so that you can see why
> that change was your issue.

Yes, it does the trick, thanks very much!

>> In order to reproduce it and get reliable kexec crash dumps,
>> I needed to give the VM at least 3 cores.
>
> Yes, it did require having this run on multiple CPUs to have a race
> condition trigger, and two cores would have been hard to hit it. I ran
> it on 8 cores and it triggered rather easily.

I mean I was able to trigger the problem with 2 cores, but then
the machine was completely stuck, without triggering the kexec reboot
to get the crash dump. I don't understand why as the additional cpu
is not really involved at all..

Any idea why kexec could be so unreliable?
(even with 3 or 6 cpus it sometimes just didn't trigger)

>> While running './io-uring_cp-forever link-cp.c file' (from:
>> https://github.com/metze-samba/liburing/commits/io_uring-cp-forever )
>> in one window, the following simple sequence triggered the problem in most cases:
>>
>> echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
>> echo 1 > /sys/kernel/tracing/set_event_pid
>
> All it took was something busy that did a lot of wakeups while setting
> the set_event_pid, to be able to hit the race easily. As I stated, I
> triggered it with running hackbench instead of the io-uring code. In
> fact, this bug had nothing to do with io-uring, and you were only
> triggering it because you were making enough of a load on the system to
> make the race happen often.

Yes, it was just the example that made it most reliable to trigger for me.

>> pid_list = rcu_dereference_raw(tr->filtered_pids);
>
> That "tr" comes from the trace_event_file that is passed in by the
> "data" field of the callback. Hence, this callback got the data field
> of the event_filter_pid_sched_wakeup_probe_pre() callback that is
> called before all events when the set_event_pid file is set. That
> means, the "tr" being dereferened was not the "tr" you were looking for.

Yes, that's what I assumed, but didn't find the reason for.

>>
>> It seems it got inlined within trace_event_buffer_reserve()
>>
>> There strangest things I found so far is this:
>>
>> crash> sym global_trace
>> ffffffffbcdb7e80 (d) global_trace
>> crash> list ftrace_trace_arrays
>> ffffffffbcdb7e70
>> ffffffffbcdb7e80
>>
>> trace_array has size 7672, but ffffffffbcdb7e70 is only 16 bytes away from
>> ffffffffbcdb7e80.
>
> ftrace_trace_arrays is a list_head, and I doubt you created any
> instances, thus the list head has only one instance, and that is
> global_trace. Hence, it points to global_trace and itself. It just so
> happens that a list_head is 16 bytes.

Ah, that explains a lot:

crash> struct list_head -o ftrace_trace_arrays
struct list_head {
[ffffffffbcdb7e70] struct list_head *next;
[ffffffffbcdb7e78] struct list_head *prev;
}
SIZE: 16
crash> list ftrace_trace_arrays
ffffffffbcdb7e70
ffffffffbcdb7e80

I guess this is what I want:

crash> list -H ftrace_trace_arrays
ffffffffbcdb7e80

>> Also this:
>>
>> crash> struct trace_array.events -o global_trace
>> struct trace_array {
>> [ffffffffbcdb9be0] struct list_head events;
>> }
>> crash> list -s trace_event_file.tr -o trace_event_file.list ffffffffbcdb9be0
>> ffffffffbcdb9be0
>> tr = 0xffffffffbcdb7b20
>> ffff8ccb45cdfb00
>> tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdf580
>> tr = 0xffffffffbcdb7e80
>> ffff8ccb45cdfe18
>> tr = 0xffffffffbcdb7e80
>> ...
>> tr = 0xffffffffbcdb7e80
>>
>> The first one 0xffffffffbcdb7b20 is only 864 bytes away from 0xffffffffbcdb7e80
>
> I'm thinking it is confused by hitting the ftrace_trace_arrays
> list_head itself.

Yes, I needed the -H here too:

list -s trace_event_file.tr -o trace_event_file.list -H ffffffffbcdb9be0


>> It would be really great if you (or anyone else on the lists) could have a closer look
>> in order to get the problem fixed :-)
>
> Once I triggered it and started looking at what was happening, it
> didn't take me to long to figure out where the culprit was.

Great!

>>
>> I've learned a lot this week and tried hard to find the problem myself,
>> but I have to move back to other work for now.
>>
>
> I'm glad you learned a lot. There's a lot of complex code in there, and
> its getting more complex, as you can see with the static_calls.
>
> This is because tracing tries hard to avoid the heisenbug effect. You
> know, you see a bug, turn on tracing, and then that bug goes away!
>
> Thus it pulls out all the tricks it can to be as least intrusive on the
> system as it can be. And that causes things to get complex really quick.

I'll have a look, I may need something similar for my smbdirect socket driver
in future.

> Cheers, and thanks for keeping up on this bug!

Thanks for fixing the bug!

Now I can finally use:

trace-cmd record -e all -P $(pidof io_uring-cp-forever)

But that doesn't include the iou-wrk-* threads
and the '-c' option seems to only work with forking.

Is there a way to specify "trace *all* threads of the given pid"?
(Note the threads are comming and going, so it's not possible to
specifiy -P more than once)

Thanks!
metze


Attachments:
OpenPGP_signature (849.00 B)
OpenPGP digital signature

2021-07-23 11:32:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)

On Fri, 23 Jul 2021 12:35:09 +0200
Stefan Metzmacher <[email protected]> wrote:

> > Assuming this does fix your issue, I sent out a real patch with the
> > explanation of what happened in the change log, so that you can see why
> > that change was your issue.
>
> Yes, it does the trick, thanks very much!

Can I get a "Tested-by" from you?

> Now I can finally use:
>
> trace-cmd record -e all -P $(pidof io_uring-cp-forever)
>
> But that doesn't include the iou-wrk-* threads
> and the '-c' option seems to only work with forking.

Have you tried it? It should work for threads as well. It hooks to the
sched_process_fork tracepoint, which should be triggered even when a new
thread is created.

Or do you mean that you want that process and all its threads too that are
already running? I could probably have it try to add it via the /proc file
system in that case.

Can you start the task via trace-cmd?

trace-cmd record -e all -F -c io_uring-cp-forever ...


>
> Is there a way to specify "trace *all* threads of the given pid"?
> (Note the threads are comming and going, so it's not possible to
> specifiy -P more than once)

Right, although, you could append tasks manually to the set_event_pid file
from another terminal after starting trace-cmd. Once a pid is added to that
file, all children it makes will also be added. That could be a work around
until we have trace-cmd do it.

Care to write a bugzilla report for this feature?

https://bugzilla.kernel.org/buglist.cgi?component=Trace-cmd%2FKernelshark&list_id=1088173

-- Steve

2021-07-23 11:55:50

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)


Hi Steve,

>>> Assuming this does fix your issue, I sent out a real patch with the
>>> explanation of what happened in the change log, so that you can see why
>>> that change was your issue.
>>
>> Yes, it does the trick, thanks very much!
>
> Can I get a "Tested-by" from you?

Sure!

Can you check if the static_key_disable() and static_key_enable()
calls are correctly ordered compared to rcu_assign_pointer()
and explain why they are, as I not really understand that it's different
from tracepoint_update_call vs. rcu_assign_pointer

>> Now I can finally use:
>>
>> trace-cmd record -e all -P $(pidof io_uring-cp-forever)
>>
>> But that doesn't include the iou-wrk-* threads
>> and the '-c' option seems to only work with forking.
>
> Have you tried it? It should work for threads as well. It hooks to the
> sched_process_fork tracepoint, which should be triggered even when a new
> thread is created.
>
> Or do you mean that you want that process and all its threads too that are
> already running? I could probably have it try to add it via the /proc file
> system in that case.
>
> Can you start the task via trace-cmd?
>
> trace-cmd record -e all -F -c io_uring-cp-forever ...

I think that would work, but I typically want to analyze a process
that is already running.

>> Is there a way to specify "trace *all* threads of the given pid"?
>> (Note the threads are comming and going, so it's not possible to
>> specifiy -P more than once)
>
> Right, although, you could append tasks manually to the set_event_pid file
> from another terminal after starting trace-cmd. Once a pid is added to that
> file, all children it makes will also be added. That could be a work around
> until we have trace-cmd do it.

Sure, but it will always be racy.

With children, does new threads also count as children or only fork() children?

> Care to write a bugzilla report for this feature?
>
> https://bugzilla.kernel.org/buglist.cgi?component=Trace-cmd%2FKernelshark&list_id=1088173

I may do later...

Thanks!
metze


Attachments:
OpenPGP_signature (849.00 B)
OpenPGP digital signature

2021-07-23 12:45:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: sched_waking vs. set_event_pid crash (Re: Tracing busy processes/threads freezes/stalls the whole machine)

On Fri, 23 Jul 2021 13:53:41 +0200
Stefan Metzmacher <[email protected]> wrote:

> Hi Steve,
>
> >>> Assuming this does fix your issue, I sent out a real patch with the
> >>> explanation of what happened in the change log, so that you can see why
> >>> that change was your issue.
> >>
> >> Yes, it does the trick, thanks very much!
> >
> > Can I get a "Tested-by" from you?
>
> Sure!

Thanks.

>
> Can you check if the static_key_disable() and static_key_enable()
> calls are correctly ordered compared to rcu_assign_pointer()
> and explain why they are, as I not really understand that it's different
> from tracepoint_update_call vs. rcu_assign_pointer

The order doesn't even matter. I'm assuming you are talking about the
static_key_disable/enable with respect to enabling the tracepoint.

The reason it doesn't matter is because the rcu_assign_pointer is
updating an array of elements that hold both the function to call and
the data it needs. Inside the tracepoint loop where the callbacks are
called, in the iterator code (not the static call), you will see:

it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
if (it_func_ptr) { \
do { \
it_func = READ_ONCE((it_func_ptr)->func); \
__data = (it_func_ptr)->data; \
((void(*)(void *, proto))(it_func))(__data, args); \
} while ((++it_func_ptr)->func); \
}

What that does is to get either the old array, or the new array and
places that array into it_func_ptr. Each element of this array contains
the callback (in .func) and the callback's data (in .data).

The enabling or disabling of the tracepoint doesn't really make a
difference with respect to the order of updating the funcs array.
That's because the users of this will either see the old array, the new
array, or NULL, in that it_func_ptr. This is how RCU works.

The bug we hit was because the static_call was updated separately from
the array. That makes it more imperative that you get the order
correct. As my email stated, with static_calls we have this:

it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) { \
__data = (it_func_ptr)->data; \
static_call(tp_func_##name)(__data, args); \
}

Where the issue is that the static_call which chooses which callback to
make, is updated asynchronously from the update of the array.

>
> >> Now I can finally use:
> >>
> >> trace-cmd record -e all -P $(pidof io_uring-cp-forever)
> >>
> >> But that doesn't include the iou-wrk-* threads
> >> and the '-c' option seems to only work with forking.
> >
> > Have you tried it? It should work for threads as well. It hooks to the
> > sched_process_fork tracepoint, which should be triggered even when a new
> > thread is created.
> >
> > Or do you mean that you want that process and all its threads too that are
> > already running? I could probably have it try to add it via the /proc file
> > system in that case.
> >
> > Can you start the task via trace-cmd?
> >
> > trace-cmd record -e all -F -c io_uring-cp-forever ...
>
> I think that would work, but I typically want to analyze a process
> that is already running.
>
> >> Is there a way to specify "trace *all* threads of the given pid"?
> >> (Note the threads are comming and going, so it's not possible to
> >> specifiy -P more than once)
> >
> > Right, although, you could append tasks manually to the set_event_pid file
> > from another terminal after starting trace-cmd. Once a pid is added to that
> > file, all children it makes will also be added. That could be a work around
> > until we have trace-cmd do it.
>
> Sure, but it will always be racy.

Not really. Matters what you mean by "racy". You wont be able to
"instantaneously" enable a process and all its threads, but you can
capture all of them after a given point. As you are attaching to a
process already running, you already missed events because you were not
yet tracing. But once you have a thread, you will always have it.

>
> With children, does new threads also count as children or only fork() children?

New threads. It's the kernel point of view of a task, which does not
differentiate processes from threads. Everything that can be scheduled
on a CPU is called a "task". How a task interacts with other tasks is
what defines it being a "thread" or a "process".

>
> > Care to write a bugzilla report for this feature?
> >
> > https://bugzilla.kernel.org/buglist.cgi?component=Trace-cmd%2FKernelshark&list_id=1088173
>
> I may do later...
>

Thanks,

-- Steve