2023-11-22 18:44:12

by Paul E. McKenney

[permalink] [raw]
Subject: [BUG] RCU CPU stall warning (heads up)

Hello!

Just FYI for the moment.

I hit the following three times out of 15 ten-hour TREE03 rcutorture
runs on next-20231121, which suggests an MTBF of about 50 hours. This is
new over the past week or so.

The symptom is that the RCU grace-period kthread is marked as running
("R"), but remains stuck in schedule() for the remainder of the run.

My next steps will be to retry on today's -next, and if that reproduces
the bug, attempt to bisect.

But I figured that I should send this out on the off-chance that it is
a known problem.

Thanx, Paul

[ 456.859071] rcu-torture: rtc: 0000000002666dc0 ver: 14234 tfle: 0 rta: 14235 rtaf: 0 rtf: 14225 rtmbe: 0 rtmbkf: 0/10105 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 39 nt: 792049 onoff: 324/324:329/330 18,4182:5,9107 84652:154698 (HZ=1000) barrier: 1987/1988:0 read-exits: 416 nocb-toggles: 0:0
[ 456.865422] rcu-torture: Reader Pipe: 1075314187 71402 0 0 0 0 0 0 0 0 0
[ 456.867182] rcu-torture: Reader Batch: 1075090167 295422 0 0 0 0 0 0 0 0 0
[ 456.868930] rcu-torture: Free-Block Circulation: 14234 14233 14232 14231 14230 14229 14228 14227 14226 14225 0
[ 464.018060] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 464.019191] rcu: 11-...!: (1 GPs behind) idle=b41c/0/0x1 softirq=0/0 fqs=88
[ 464.020489] rcu: (t=21003 jiffies g=302741 q=71371 ncpus=11)
[ 464.021549] rcu: rcu_preempt kthread starved for 20595 jiffies! g302741 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 464.023457] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 464.025117] rcu: RCU grace-period kthread stack dump:
[ 464.026030] task:rcu_preempt state:R running task stack:14896 pid:14 tgid:14 ppid:2 flags:0x00004000
[ 464.028032] Call Trace:
[ 464.028507] <TASK>
[ 464.028900] __schedule+0x306/0x880
[ 464.029559] schedule+0x2b/0xb0
[ 464.030139] schedule_timeout+0x86/0x160
[ 464.030857] ? __pfx_process_timeout+0x10/0x10
[ 464.031673] rcu_gp_fqs_loop+0x12a/0x5f0
[ 464.032406] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 464.033207] rcu_gp_kthread+0x191/0x240
[ 464.033905] kthread+0xd3/0x100
[ 464.034494] ? __pfx_kthread+0x10/0x10
[ 464.035189] ret_from_fork+0x2f/0x50
[ 464.035837] ? __pfx_kthread+0x10/0x10
[ 464.036534] ret_from_fork_asm+0x1a/0x30
[ 464.037264] </TASK>
[ 464.037665] rcu: Stack dump where RCU GP kthread last ran:
[ 464.038671] Sending NMI from CPU 11 to CPUs 0:
[ 464.039507] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
[ 464.040505] CPU: 11 PID: 0 Comm: swapper/11 Not tainted 6.7.0-rc2-next-20231121 #7188
[ 464.042806] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 464.044842] RIP: 0010:default_idle+0xf/0x20
[ 464.045618] Code: 4c 01 c7 4c 29 c2 e9 72 ff ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 03 a8 36 00 fb f4 <fa> c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[ 464.048992] RSP: 0000:ffffac70400fbef0 EFLAGS: 00000202
[ 464.049952] RAX: ffff9d381f4c0000 RBX: ffff9d380124c740 RCX: 0000000000000000
[ 464.051264] RDX: 4000000000000000 RSI: ffffffffb4afa83b RDI: 000000000c18b414
[ 464.052557] RBP: 000000000000000b R08: 000000000c18b414 R09: 0000000000000000
[ 464.053856] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 464.055169] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 464.056463] FS: 0000000000000000(0000) GS:ffff9d381f4c0000(0000) knlGS:0000000000000000
[ 464.057936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 464.058988] CR2: 0000000000000000 CR3: 000000000204e000 CR4: 00000000000006f0
[ 464.060314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 464.061607] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 464.062907] Call Trace:
[ 464.063378] <IRQ>
[ 464.063754] ? rcu_dump_cpu_stacks+0xf1/0x150
[ 464.064629] ? rcu_sched_clock_irq+0x56e/0xee0
[ 464.065450] ? __kmem_cache_alloc_node+0x44/0x1e0
[ 464.066316] ? rcu_torture_timer+0x4e/0x80
[ 464.067063] ? __pfx_rcu_torture_timer+0x10/0x10
[ 464.067910] ? kvm_sched_clock_read+0x11/0x20
[ 464.068713] ? sched_clock+0x10/0x30
[ 464.069395] ? sched_clock_cpu+0xf/0x190
[ 464.070116] ? update_process_times+0x45/0x70
[ 464.070907] ? tick_nohz_highres_handler+0xb6/0xe0
[ 464.071790] ? __pfx_tick_nohz_highres_handler+0x10/0x10
[ 464.072772] ? __hrtimer_run_queues+0x10d/0x2a0
[ 464.073606] ? hrtimer_interrupt+0xfe/0x240
[ 464.074386] ? __sysvec_apic_timer_interrupt+0x53/0x140
[ 464.075353] ? sysvec_apic_timer_interrupt+0x6b/0x80
[ 464.076270] </IRQ>
[ 464.076655] <TASK>
[ 464.077040] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 464.078021] ? default_idle+0xf/0x20
[ 464.078692] default_idle_call+0x2d/0xe0
[ 464.079420] do_idle+0x180/0x1c0
[ 464.080021] cpu_startup_entry+0x24/0x30
[ 464.080756] start_secondary+0xf7/0x100
[ 464.081479] secondary_startup_64_no_verify+0x178/0x17b
[ 464.082445] </TASK>


2023-12-04 17:11:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG] RCU CPU stall warning (bisected)

On Wed, Nov 22, 2023 at 10:43:52AM -0800, Paul E. McKenney wrote:
> Hello!
>
> Just FYI for the moment.
>
> I hit the following three times out of 15 ten-hour TREE03 rcutorture
> runs on next-20231121, which suggests an MTBF of about 50 hours. This is
> new over the past week or so.
>
> The symptom is that the RCU grace-period kthread is marked as running
> ("R"), but remains stuck in schedule() for the remainder of the run.
>
> My next steps will be to retry on today's -next, and if that reproduces
> the bug, attempt to bisect.
>
> But I figured that I should send this out on the off-chance that it is
> a known problem.

And the bisection fingered this commit, maybe even rightfully so:

5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")

Next step is to revert this on top of v6.7-rc3 and retest. I will let
you know what happened when the test completes. And the error rate did
drop off during the bisection, so it is possible that there are multiple
causes of this bug. :-/

In the meantime, is this a known problem? ("Did I just waste a week
bisecting something that has already been fixed?")

Thanx, Paul